trying again to get incremental backup
A few years ago, I sketched out a design for incremental backup, but
no patch for incremental backup ever got committed. Instead, the whole
thing evolved into a project to add backup manifests, which are nice,
but not as nice as incremental backup would be. So I've decided to
have another go at incremental backup itself. Attached are some WIP
patches. Let me summarize the design and some open questions and
problems with it that I've discovered. I welcome problem reports and
test results from others, as well.
The basic design of this patch set is pretty simple, and there are
three main parts. First, there's a new background process called the
walsummarizer which runs all the time. It reads the WAL and generates
WAL summary files. WAL summary files are extremely small compared to
the original WAL and contain only the minimal amount of information
that we need in order to determine which parts of the database need to
be backed up. They tell us about files getting created, destroyed, or
truncated, and they tell us about modified blocks. Naturally, we don't
find out about blocks that were modified without any write-ahead log
record, e.g. hint bit updates, but those are of necessity not critical
for correctness, so it's OK. Second, pg_basebackup has a mode where it
can take an incremental backup. You must supply a backup manifest from
a previous full backup. We read the WAL summary files that have been
generated between the start of the previous backup and the start of
this one, and use that to figure out which relation files have changed
and how much. Non-relation files are sent normally, just as they would
be in a full backup. Relation files can either be sent in full or be
replaced by an incremental file, which contains a subset of the blocks
in the file plus a bit of information to handle truncations properly.
Third, there's now a pg_combinebackup utility which takes a full
backup and one or more incremental backups, performs a bunch of sanity
checks, and if everything works out, writes out a new, synthetic full
backup, aka a data directory.
Simple usage example:
pg_basebackup -cfast -Dx
pg_basebackup -cfast -Dy --incremental x/backup_manifest
pg_combinebackup x y -o z
The part of all this with which I'm least happy is the WAL
summarization engine. Actually, the core process of summarizing the
WAL seems totally fine, and the file format is very compact thanks to
some nice ideas from my colleague Dilip Kumar. Someone may of course
wish to argue that the information should be represented in some other
file format instead, and that can be done if it's really needed, but I
don't see a lot of value in tinkering with it, either. Where I do
think there's a problem is deciding how much WAL ought to be
summarized in one WAL summary file. Summary files cover a certain
range of WAL records - they have names like
$TLI${START_LSN}${END_LSN}.summary. It's not too hard to figure out
where a file should start - generally, it's wherever the previous file
ended, possibly on a new timeline, but figuring out where the summary
should end is trickier. You always have the option to either read
another WAL record and fold it into the current summary, or end the
current summary where you are, write out the file, and begin a new
one. So how do you decide what to do?
I originally had the idea of summarizing a certain number of MB of WAL
per WAL summary file, and so I added a GUC wal_summarize_mb for that
purpose. But then I realized that actually, you really want WAL
summary file boundaries to line up with possible redo points, because
when you do an incremental backup, you need a summary that stretches
from the redo point of the checkpoint written at the start of the
prior backup to the redo point of the checkpoint written at the start
of the current backup. The block modifications that happen in that
range of WAL records are the ones that need to be included in the
incremental. Unfortunately, there's no indication in the WAL itself
that you've reached a redo point, but I wrote code that tries to
notice when we've reached the redo point stored in shared memory and
stops the summary there. But I eventually realized that's not good
enough either, because if summarization zooms past the redo point
before noticing the updated redo point in shared memory, then the
backup sat around waiting for the next summary file to be generated so
it had enough summaries to proceed with the backup, while the
summarizer was in no hurry to finish up the current file and just sat
there waiting for more WAL to be generated. Eventually the incremental
backup would just time out. I tried to fix that by making it so that
if somebody's waiting for a summary file to be generated, they can let
the summarizer know about that and it can write a summary file ending
at the LSN up to which it has read and then begin a new file from
there. That seems to fix the hangs, but now I've got three
overlapping, interconnected systems for deciding where to end the
current summary file, and maybe that's OK, but I have a feeling there
might be a better way.
Dilip had an interesting potential solution to this problem, which was
to always emit a special WAL record at the redo pointer. That is, when
we fix the redo pointer for the checkpoint record we're about to
write, also insert a WAL record there. That way, when the summarizer
reaches that sentinel record, it knows it should stop the summary just
before. I'm not sure whether this approach is viable, especially from
a performance and concurrency perspective, and I'm not sure whether
people here would like it, but it does seem like it would make things
a whole lot simpler for this patch set.
Another thing that I'm not too sure about is: what happens if we find
a relation file on disk that doesn't appear in the backup_manifest for
the previous backup and isn't mentioned in the WAL summaries either?
The fact that said file isn't mentioned in the WAL summaries seems
like it ought to mean that the file is unchanged, in which case
perhaps this ought to be an error condition. But I'm not too sure
about that treatment. I have a feeling that there might be some subtle
problems here, especially if databases or tablespaces get dropped and
then new ones get created that happen to have the same OIDs. And what
about wal_level=minimal? I'm not at a point where I can say I've gone
through and plugged up these kinds of corner-case holes tightly yet,
and I'm worried that there may be still other scenarios of which I
haven't even thought. Happy to hear your ideas about what the problem
cases are or how any of the problems should be solved.
A related design question is whether we should really be sending the
whole backup manifest to the server at all. If it turns out that we
don't really need anything except for the LSN of the previous backup,
we could send that one piece of information instead of everything. On
the other hand, if we need the list of files from the previous backup,
then sending the whole manifest makes sense.
Another big and rather obvious problem with the patch set is that it
doesn't currently have any automated test cases, or any real
documentation. Those are obviously things that need a lot of work
before there could be any thought of committing this. And probably a
lot of bugs will be found along the way, too.
A few less-serious problems with the patch:
- We don't have an incremental JSON parser, so if you have a
backup_manifest>1GB, pg_basebackup --incremental is going to fail.
That's also true of the existing code in pg_verifybackup, and for the
same reason. I talked to Andrew Dunstan at one point about adapting
our JSON parser to support incremental parsing, and he had a patch for
that, but I think he found some problems with it and I'm not sure what
the current status is.
- The patch does support differential backup, aka an incremental atop
another incremental. There's no particular limit to how long a chain
of backups can be. However, pg_combinebackup currently requires that
the first backup is a full backup and all the later ones are
incremental backups. So if you have a full backup a and an incremental
backup b and a differential backup c, you can combine a b and c to get
a full backup equivalent to one you would have gotten if you had taken
a full backup at the time you took c. However, you can't combine b and
c with each other without combining them with a, and that might be
desirable in some situations. You might want to collapse a bunch of
older differential backups into a single one that covers the whole
time range of all of them. I think that the file format can support
that, but the tool is currently too dumb.
- We only know how to operate on directories, not tar files. I thought
about that when working on pg_verifybackup as well, but I didn't do
anything about it. It would be nice to go back and make that tool work
on tar-format backups, and this one, too. I don't think there would be
a whole lot of point trying to operate on compressed tar files because
you need random access and that seems hard on a compressed file, but
on uncompressed files it seems at least theoretically doable. I'm not
sure whether anyone would care that much about this, though, even
though it does sound pretty cool.
In the attached patch series, patches 1 through 6 are various
refactoring patches, patch 7 is the main event, and patch 8 adds a
useful inspection tool.
Thanks,
--
Robert Haas
EDB: http://www.enterprisedb.com
Attachments:
v1-0006-Move-src-bin-pg_verifybackup-parse_manifest.c-int.patchapplication/octet-stream; name=v1-0006-Move-src-bin-pg_verifybackup-parse_manifest.c-int.patchDownload
From f66a97fcca07bb56e6a8e644b6bb4d6bee24ab94 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Mon, 5 Jun 2023 15:42:28 -0400
Subject: [PATCH v1 6/8] Move src/bin/pg_verifybackup/parse_manifest.c into
src/common.
This makes it possible for the code to be easily reused by other
client-side tools, and/or by the server.
---
src/bin/pg_verifybackup/Makefile | 1 -
src/bin/pg_verifybackup/meson.build | 1 -
src/bin/pg_verifybackup/pg_verifybackup.c | 2 +-
src/common/Makefile | 1 +
src/common/meson.build | 1 +
src/{bin/pg_verifybackup => common}/parse_manifest.c | 4 ++--
src/{bin/pg_verifybackup => include/common}/parse_manifest.h | 2 +-
7 files changed, 6 insertions(+), 6 deletions(-)
rename src/{bin/pg_verifybackup => common}/parse_manifest.c (99%)
rename src/{bin/pg_verifybackup => include/common}/parse_manifest.h (97%)
diff --git a/src/bin/pg_verifybackup/Makefile b/src/bin/pg_verifybackup/Makefile
index 596df15118..8f04fa662c 100644
--- a/src/bin/pg_verifybackup/Makefile
+++ b/src/bin/pg_verifybackup/Makefile
@@ -21,7 +21,6 @@ LDFLAGS_INTERNAL += -L$(top_builddir)/src/fe_utils -lpgfeutils $(libpq_pgport)
OBJS = \
$(WIN32RES) \
- parse_manifest.o \
pg_verifybackup.o
all: pg_verifybackup
diff --git a/src/bin/pg_verifybackup/meson.build b/src/bin/pg_verifybackup/meson.build
index 9369da1bc6..58f780d1a6 100644
--- a/src/bin/pg_verifybackup/meson.build
+++ b/src/bin/pg_verifybackup/meson.build
@@ -1,7 +1,6 @@
# Copyright (c) 2022-2023, PostgreSQL Global Development Group
pg_verifybackup_sources = files(
- 'parse_manifest.c',
'pg_verifybackup.c'
)
diff --git a/src/bin/pg_verifybackup/pg_verifybackup.c b/src/bin/pg_verifybackup/pg_verifybackup.c
index 059836f0e6..ce423a03d4 100644
--- a/src/bin/pg_verifybackup/pg_verifybackup.c
+++ b/src/bin/pg_verifybackup/pg_verifybackup.c
@@ -20,9 +20,9 @@
#include "common/hashfn.h"
#include "common/logging.h"
+#include "common/parse_manifest.h"
#include "fe_utils/simple_list.h"
#include "getopt_long.h"
-#include "parse_manifest.h"
#include "pgtime.h"
/*
diff --git a/src/common/Makefile b/src/common/Makefile
index 113029bf7b..e4cd26762b 100644
--- a/src/common/Makefile
+++ b/src/common/Makefile
@@ -65,6 +65,7 @@ OBJS_COMMON = \
kwlookup.o \
link-canary.o \
md5_common.o \
+ parse_manifest.o \
percentrepl.o \
pg_get_line.o \
pg_lzcompress.o \
diff --git a/src/common/meson.build b/src/common/meson.build
index 9efc80ac02..cc6671edca 100644
--- a/src/common/meson.build
+++ b/src/common/meson.build
@@ -17,6 +17,7 @@ common_sources = files(
'kwlookup.c',
'link-canary.c',
'md5_common.c',
+ 'parse_manifest.c',
'percentrepl.c',
'pg_get_line.c',
'pg_lzcompress.c',
diff --git a/src/bin/pg_verifybackup/parse_manifest.c b/src/common/parse_manifest.c
similarity index 99%
rename from src/bin/pg_verifybackup/parse_manifest.c
rename to src/common/parse_manifest.c
index 2379f7be7b..672e8bcf25 100644
--- a/src/bin/pg_verifybackup/parse_manifest.c
+++ b/src/common/parse_manifest.c
@@ -6,15 +6,15 @@
* Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
* Portions Copyright (c) 1994, Regents of the University of California
*
- * src/bin/pg_verifybackup/parse_manifest.c
+ * src/common/parse_manifest.c
*
*-------------------------------------------------------------------------
*/
#include "postgres_fe.h"
-#include "parse_manifest.h"
#include "common/jsonapi.h"
+#include "common/parse_manifest.h"
/*
* Semantic states for JSON manifest parsing.
diff --git a/src/bin/pg_verifybackup/parse_manifest.h b/src/include/common/parse_manifest.h
similarity index 97%
rename from src/bin/pg_verifybackup/parse_manifest.h
rename to src/include/common/parse_manifest.h
index 7387a917a2..7b24c5d785 100644
--- a/src/bin/pg_verifybackup/parse_manifest.h
+++ b/src/include/common/parse_manifest.h
@@ -6,7 +6,7 @@
* Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
* Portions Copyright (c) 1994, Regents of the University of California
*
- * src/bin/pg_verifybackup/parse_manifest.h
+ * src/include/common/parse_manifest.h
*
*-------------------------------------------------------------------------
*/
--
2.37.1 (Apple Git-137.1)
v1-0008-Add-new-pg_walsummary-tool.patchapplication/octet-stream; name=v1-0008-Add-new-pg_walsummary-tool.patchDownload
From 69f8a513cc6b8fcddbc4be038fd09e9422867e9d Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Wed, 14 Jun 2023 11:31:39 -0400
Subject: [PATCH v1 8/8] Add new pg_walsummary tool.
This can dump the contents of WAL summary files, either those in
pg_wal/summaries, or the INCREMENTAL_BACKUP files that are part of
an incremental backup proper.
XXX. Needs documentation and tests.
---
src/bin/Makefile | 1 +
src/bin/meson.build | 1 +
src/bin/pg_walsummary/.gitignore | 1 +
src/bin/pg_walsummary/Makefile | 42 ++++
src/bin/pg_walsummary/meson.build | 24 +++
src/bin/pg_walsummary/pg_walsummary.c | 278 ++++++++++++++++++++++++++
6 files changed, 347 insertions(+)
create mode 100644 src/bin/pg_walsummary/.gitignore
create mode 100644 src/bin/pg_walsummary/Makefile
create mode 100644 src/bin/pg_walsummary/meson.build
create mode 100644 src/bin/pg_walsummary/pg_walsummary.c
diff --git a/src/bin/Makefile b/src/bin/Makefile
index aa2210925e..f98f58d39e 100644
--- a/src/bin/Makefile
+++ b/src/bin/Makefile
@@ -31,6 +31,7 @@ SUBDIRS = \
pg_upgrade \
pg_verifybackup \
pg_waldump \
+ pg_walsummary \
pgbench \
psql \
scripts
diff --git a/src/bin/meson.build b/src/bin/meson.build
index 4cb6fd59bb..d1e9ef4409 100644
--- a/src/bin/meson.build
+++ b/src/bin/meson.build
@@ -17,6 +17,7 @@ subdir('pg_test_timing')
subdir('pg_upgrade')
subdir('pg_verifybackup')
subdir('pg_waldump')
+subdir('pg_walsummary')
subdir('pgbench')
subdir('pgevent')
subdir('psql')
diff --git a/src/bin/pg_walsummary/.gitignore b/src/bin/pg_walsummary/.gitignore
new file mode 100644
index 0000000000..d71ec192fa
--- /dev/null
+++ b/src/bin/pg_walsummary/.gitignore
@@ -0,0 +1 @@
+pg_walsummary
diff --git a/src/bin/pg_walsummary/Makefile b/src/bin/pg_walsummary/Makefile
new file mode 100644
index 0000000000..852f7208f6
--- /dev/null
+++ b/src/bin/pg_walsummary/Makefile
@@ -0,0 +1,42 @@
+#-------------------------------------------------------------------------
+#
+# Makefile for src/bin/pg_walsummary
+#
+# Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+# Portions Copyright (c) 1994, Regents of the University of California
+#
+# src/bin/pg_walsummary/Makefile
+#
+#-------------------------------------------------------------------------
+
+PGFILEDESC = "pg_walsummary - print contents of WAL summary files"
+PGAPPICON=win32
+
+subdir = src/bin/pg_walsummary
+top_builddir = ../../..
+include $(top_builddir)/src/Makefile.global
+
+override CPPFLAGS := -I$(libpq_srcdir) $(CPPFLAGS)
+LDFLAGS_INTERNAL += -L$(top_builddir)/src/fe_utils -lpgfeutils
+
+OBJS = \
+ $(WIN32RES) \
+ pg_walsummary.o
+
+all: pg_walsummary
+
+pg_walsummary: $(OBJS) | submake-libpgport submake-libpgfeutils
+ $(CC) $(CFLAGS) $^ $(LDFLAGS) $(LDFLAGS_EX) $(LIBS) -o $@$(X)
+
+
+install: all installdirs
+ $(INSTALL_PROGRAM) pg_walsummary$(X) '$(DESTDIR)$(bindir)/pg_walsummary$(X)'
+
+installdirs:
+ $(MKDIR_P) '$(DESTDIR)$(bindir)'
+
+uninstall:
+ rm -f '$(DESTDIR)$(bindir)/pg_walsummary$(X)'
+
+clean distclean maintainer-clean:
+ rm -f pg_walsummary$(X) $(OBJS)
diff --git a/src/bin/pg_walsummary/meson.build b/src/bin/pg_walsummary/meson.build
new file mode 100644
index 0000000000..c2092960c6
--- /dev/null
+++ b/src/bin/pg_walsummary/meson.build
@@ -0,0 +1,24 @@
+# Copyright (c) 2022-2023, PostgreSQL Global Development Group
+
+pg_walsummary_sources = files(
+ 'pg_walsummary.c',
+)
+
+if host_system == 'windows'
+ pg_walsummary_sources += rc_bin_gen.process(win32ver_rc, extra_args: [
+ '--NAME', 'pg_walsummary',
+ '--FILEDESC', 'pg_walsummary - print contents of WAL summary files',])
+endif
+
+pg_walsummary = executable('pg_walsummary',
+ pg_walsummary_sources,
+ dependencies: [frontend_code],
+ kwargs: default_bin_args,
+)
+bin_targets += pg_walsummary
+
+tests += {
+ 'name': 'pg_walsummary',
+ 'sd': meson.current_source_dir(),
+ 'bd': meson.current_build_dir()
+}
diff --git a/src/bin/pg_walsummary/pg_walsummary.c b/src/bin/pg_walsummary/pg_walsummary.c
new file mode 100644
index 0000000000..0304a42026
--- /dev/null
+++ b/src/bin/pg_walsummary/pg_walsummary.c
@@ -0,0 +1,278 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_walsummary.c
+ * Prints the contents of WAL summary files.
+ *
+ * Copyright (c) 2017-2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/bin/pg_walsummary/pg_walsummary.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres_fe.h"
+
+#include <fcntl.h>
+#include <limits.h>
+
+#include "common/blkreftable.h"
+#include "common/logging.h"
+#include "fe_utils/option_utils.h"
+#include "lib/stringinfo.h"
+#include "getopt_long.h"
+
+typedef struct ws_options
+{
+ bool individual;
+ bool quiet;
+} ws_options;
+
+typedef struct ws_file_info
+{
+ int fd;
+ char *filename;
+} ws_file_info;
+
+static BlockNumber *block_buffer = NULL;
+static unsigned block_buffer_size = 512; /* Initial size. */
+
+static void dump_one_relation(ws_options *opt, RelFileLocator *rlocator,
+ ForkNumber forknum, BlockNumber limit_block,
+ BlockRefTableReader *reader);
+static void help(const char *progname);
+static int compare_block_numbers(const void *a, const void *b);
+static int walsummary_read_callback(void *callback_arg, void *data,
+ int length);
+static void walsummary_error_callback(void *callback_arg, char *fmt,...);
+
+/*
+ * Main program.
+ */
+int
+main(int argc, char *argv[])
+{
+ static struct option long_options[] = {
+ {"individual", no_argument, NULL, 'i'},
+ {"quiet", no_argument, NULL, 'q'},
+ {NULL, 0, NULL, 0}
+ };
+
+ const char *progname;
+ int optindex;
+ int c;
+ ws_options opt;
+
+ pg_logging_init(argv[0]);
+ progname = get_progname(argv[0]);
+ handle_help_version_opts(argc, argv, progname, help);
+
+ /* process command-line options */
+ while ((c = getopt_long(argc, argv, "f:iqw:",
+ long_options, &optindex)) != -1)
+ {
+ switch (c)
+ {
+ case 'i':
+ opt.individual = true;
+ break;
+ case 'q':
+ opt.quiet = true;
+ break;
+ default:
+ /* getopt_long already emitted a complaint */
+ pg_log_error_hint("Try \"%s --help\" for more information.", progname);
+ exit(1);
+ }
+ }
+
+ if (optind >= argc)
+ {
+ pg_log_error("%s: no input files specified", progname);
+ pg_log_error_hint("Try \"%s --help\" for more information.", progname);
+ exit(1);
+ }
+
+ while (optind < argc)
+ {
+ ws_file_info ws;
+ BlockRefTableReader *reader;
+ RelFileLocator rlocator;
+ ForkNumber forknum;
+ BlockNumber limit_block;
+
+ ws.filename = argv[optind++];
+ if ((ws.fd = open(ws.filename, O_RDONLY | PG_BINARY, 0)) < 0)
+ pg_fatal("could not open file \"%s\": %m", ws.filename);
+
+ reader = CreateBlockRefTableReader(walsummary_read_callback, &ws,
+ ws.filename,
+ walsummary_error_callback, NULL);
+ while (BlockRefTableReaderNextRelation(reader, &rlocator, &forknum,
+ &limit_block))
+ dump_one_relation(&opt, &rlocator, forknum, limit_block, reader);
+
+ DestroyBlockRefTableReader(reader);
+ close(ws.fd);
+ }
+
+ exit(0);
+}
+
+/*
+ * Dump details for one relation.
+ */
+static void
+dump_one_relation(ws_options *opt, RelFileLocator *rlocator,
+ ForkNumber forknum, BlockNumber limit_block,
+ BlockRefTableReader *reader)
+{
+ unsigned i = 0;
+ unsigned nblocks;
+ BlockNumber startblock = InvalidBlockNumber;
+ BlockNumber endblock = InvalidBlockNumber;
+
+ /* Dump limit block, if any. */
+ if (limit_block != InvalidBlockNumber)
+ printf("TS %u, DB %u, REL %u, FORK %s: limit %u\n",
+ rlocator->spcOid, rlocator->dbOid, rlocator->relNumber,
+ forkNames[forknum], limit_block);
+
+ /* If we haven't allocated a block buffer yet, do that now. */
+ if (block_buffer == NULL)
+ block_buffer = palloc_array(BlockNumber, block_buffer_size);
+
+ /* Try to fill the block buffer. */
+ nblocks = BlockRefTableReaderGetBlocks(reader,
+ block_buffer,
+ block_buffer_size);
+
+ /* If we filled the block buffer completely, we must enlarge it. */
+ while (nblocks >= block_buffer_size)
+ {
+ unsigned new_size;
+
+ /* Double the size, being careful about overflow. */
+ new_size = block_buffer_size * 2;
+ if (new_size < block_buffer_size)
+ new_size = PG_UINT32_MAX;
+ block_buffer = repalloc_array(block_buffer, BlockNumber, new_size);
+
+ /* Try to fill the newly-allocated space. */
+ nblocks +=
+ BlockRefTableReaderGetBlocks(reader,
+ block_buffer + block_buffer_size,
+ new_size - block_buffer_size);
+
+ /* Save the new size for later calls. */
+ block_buffer_size = new_size;
+ }
+
+ /* If we don't need to produce any output, skip the rest of this. */
+ if (opt->quiet)
+ return;
+
+ /*
+ * Sort the returned block numbers. If the block reference table was using
+ * the bitmap representation for a given chunk, the block numbers in that
+ * chunk will already be sorted, but when the array-of-offsets
+ * representation is used, we can receive block numbers here out of order.
+ */
+ qsort(block_buffer, nblocks, sizeof(BlockNumber), compare_block_numbers);
+
+ /* Dump block references. */
+ while (i < nblocks)
+ {
+ /*
+ * Find the next range of blocks to print, but if --individual was
+ * specified, then consider each block a separate range.
+ */
+ startblock = endblock = block_buffer[i++];
+ if (!opt->individual)
+ {
+ while (i < nblocks && block_buffer[i] == endblock + 1)
+ {
+ endblock++;
+ i++;
+ }
+ }
+
+ /* Format this range of block numbers as a string. */
+ if (startblock == endblock)
+ printf("TS %u, DB %u, REL %u, FORK %s: block %u\n",
+ rlocator->spcOid, rlocator->dbOid, rlocator->relNumber,
+ forkNames[forknum], startblock);
+ else
+ printf("TS %u, DB %u, REL %u, FORK %s: blocks %u..%u\n",
+ rlocator->spcOid, rlocator->dbOid, rlocator->relNumber,
+ forkNames[forknum], startblock, endblock);
+ }
+}
+
+/*
+ * Quicksort comparator for block numbers.
+ */
+static int
+compare_block_numbers(const void *a, const void *b)
+{
+ BlockNumber aa = *(BlockNumber *) a;
+ BlockNumber bb = *(BlockNumber *) b;
+
+ if (aa > bb)
+ return 1;
+ else if (aa == bb)
+ return 0;
+ else
+ return -1;
+}
+
+/*
+ * Error callback.
+ */
+void
+walsummary_error_callback(void *callback_arg, char *fmt,...)
+{
+ va_list ap;
+
+ va_start(ap, fmt);
+ pg_log_generic_v(PG_LOG_ERROR, PG_LOG_PRIMARY, fmt, ap);
+ va_end(ap);
+
+ exit(1);
+}
+
+/*
+ * Read callback.
+ */
+int
+walsummary_read_callback(void *callback_arg, void *data, int length)
+{
+ ws_file_info *ws = callback_arg;
+ int rc;
+
+ if ((rc = read(ws->fd, data, length)) < 0)
+ pg_fatal("could not read file \"%s\": %m", ws->filename);
+
+ return rc;
+}
+
+/*
+ * help
+ *
+ * Prints help page for the program
+ *
+ * progname: the name of the executed program, such as "pg_walsummary"
+ */
+static void
+help(const char *progname)
+{
+ printf(_("%s prints the contents of a WAL summary file.\n\n"), progname);
+ printf(_("Usage:\n"));
+ printf(_(" %s [OPTION]... FILE...\n"), progname);
+ printf(_("\nOptions:\n"));
+ printf(_(" -i, --individual list block numbers individually, not as ranges\n"));
+ printf(_(" -q, --quiet don't print anything, just parse the files\n"));
+ printf(_(" -?, --help show this help, then exit\n"));
+
+ printf(_("\nReport bugs to <%s>.\n"), PACKAGE_BUGREPORT);
+ printf(_("%s home page: <%s>\n"), PACKAGE_NAME, PACKAGE_URL);
+}
--
2.37.1 (Apple Git-137.1)
v1-0007-Prototype-patch-for-incremental-and-differential-.patchapplication/octet-stream; name=v1-0007-Prototype-patch-for-incremental-and-differential-.patchDownload
From 8485fc23d54cc1e359a71801845ea255584905d5 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Wed, 14 Jun 2023 11:31:29 -0400
Subject: [PATCH v1 7/8] Prototype patch for incremental and differential
backup.
We don't differentiate between incremental and differential backups;
the term "incremental" as used herein means "either incremental or
differential".
This adds a new background process, the WAL summarizer, whose behavor
is governed by new GUCs wal_summarize_mb and wal_summarize_keep_time.
This writes out WAL summary files to $PGDATA/pg_wal/summaries. Each
summary file contains information for a certain range of LSNs on a
certain TLI. For each relation, it stores a "limit block" which is
0 if a relation is created or destroyed within a certain range of WAL
records, or otherwise the shortest length to which the relation was
truncated during that range of WAL records, or otherwise
InvalidBlockNumber. In addition, it stores any blocks which have
been modified during that range of WAL records, but excluding blocks
which were removed by truncation after they were modified and which
were never modified thereafter. In other words, it tells us which
blocks need to copied in case of an incremental backup covering that
range of WAL records.
To take an incremental backup, you use the new replication command
UPLOAD_MANIFEST to upload the manifest for the prior backup. This
prior backup could either be a full backup or another incremental
backup. You then use BASE_BACKUP with the INCREMENTAL option to take
the backup. pg_basebackup now has an --incremental=PATH_TO_MANIFEST
option to trigger this behavior.
An incremental backup is like a regular full backup except that
some relation files are replaced with files with names like
INCREMENTAL.${ORIGINAL_NAME}, and the backup_label file contains
additional lines identifying it as an incremental backup. The new
pg_combinebackup tool can be used to reconstruct a data directory
from a full backup and a series of incremental backups.
XXX. It would be nice if we could do something about incremental
JSON parsing.
XXX. This needs a lot of work on documentation and tests.
Patch by me. Thanks to Dilip Kumar and Andres Freund for some helpful
design discussions.
---
doc/src/sgml/monitoring.sgml | 17 +
src/backend/access/transam/xlog.c | 97 +-
src/backend/access/transam/xlogbackup.c | 10 +
src/backend/access/transam/xlogrecovery.c | 10 +-
src/backend/backup/Makefile | 5 +-
src/backend/backup/basebackup.c | 340 +++-
src/backend/backup/basebackup_incremental.c | 867 ++++++++++
src/backend/backup/meson.build | 3 +
src/backend/backup/walsummary.c | 356 +++++
src/backend/backup/walsummaryfuncs.c | 169 ++
src/backend/postmaster/Makefile | 1 +
src/backend/postmaster/auxprocess.c | 8 +
src/backend/postmaster/meson.build | 1 +
src/backend/postmaster/postmaster.c | 53 +
src/backend/postmaster/walsummarizer.c | 1414 +++++++++++++++++
src/backend/replication/repl_gram.y | 14 +-
src/backend/replication/repl_scanner.l | 2 +
src/backend/replication/walsender.c | 162 +-
src/backend/storage/ipc/ipci.c | 3 +
src/backend/storage/lmgr/lwlocknames.txt | 1 +
src/backend/utils/activity/pgstat_io.c | 4 +-
src/backend/utils/activity/wait_event.c | 15 +
src/backend/utils/init/miscinit.c | 3 +
src/backend/utils/misc/guc_tables.c | 29 +
src/backend/utils/misc/postgresql.conf.sample | 5 +
src/bin/Makefile | 1 +
src/bin/initdb/initdb.c | 1 +
src/bin/meson.build | 1 +
src/bin/pg_basebackup/bbstreamer_file.c | 1 +
src/bin/pg_basebackup/pg_basebackup.c | 108 +-
src/bin/pg_basebackup/t/010_pg_basebackup.pl | 4 +-
src/bin/pg_combinebackup/.gitignore | 1 +
src/bin/pg_combinebackup/Makefile | 46 +
src/bin/pg_combinebackup/backup_label.c | 281 ++++
src/bin/pg_combinebackup/backup_label.h | 29 +
src/bin/pg_combinebackup/copy_file.c | 169 ++
src/bin/pg_combinebackup/copy_file.h | 19 +
src/bin/pg_combinebackup/load_manifest.c | 245 +++
src/bin/pg_combinebackup/load_manifest.h | 67 +
src/bin/pg_combinebackup/meson.build | 29 +
src/bin/pg_combinebackup/pg_combinebackup.c | 1268 +++++++++++++++
src/bin/pg_combinebackup/reconstruct.c | 618 +++++++
src/bin/pg_combinebackup/reconstruct.h | 32 +
src/bin/pg_combinebackup/write_manifest.c | 293 ++++
src/bin/pg_combinebackup/write_manifest.h | 33 +
src/bin/pg_resetwal/pg_resetwal.c | 36 +
src/common/Makefile | 1 +
src/common/blkreftable.c | 1309 +++++++++++++++
src/common/meson.build | 1 +
src/include/access/xlog.h | 1 +
src/include/access/xlogbackup.h | 2 +
src/include/backup/basebackup.h | 5 +-
src/include/backup/basebackup_incremental.h | 56 +
src/include/backup/walsummary.h | 49 +
src/include/catalog/pg_proc.dat | 19 +
src/include/common/blkreftable.h | 120 ++
src/include/miscadmin.h | 3 +
src/include/nodes/replnodes.h | 9 +
src/include/postmaster/walsummarizer.h | 31 +
src/include/storage/proc.h | 9 +-
src/include/utils/guc_tables.h | 1 +
src/include/utils/wait_event.h | 7 +-
src/test/recovery/t/001_stream_rep.pl | 2 +
src/test/recovery/t/019_replslot_limit.pl | 3 +
.../t/035_standby_logical_decoding.pl | 1 +
src/tools/pgindent/typedefs.list | 24 +
66 files changed, 8454 insertions(+), 70 deletions(-)
create mode 100644 src/backend/backup/basebackup_incremental.c
create mode 100644 src/backend/backup/walsummary.c
create mode 100644 src/backend/backup/walsummaryfuncs.c
create mode 100644 src/backend/postmaster/walsummarizer.c
create mode 100644 src/bin/pg_combinebackup/.gitignore
create mode 100644 src/bin/pg_combinebackup/Makefile
create mode 100644 src/bin/pg_combinebackup/backup_label.c
create mode 100644 src/bin/pg_combinebackup/backup_label.h
create mode 100644 src/bin/pg_combinebackup/copy_file.c
create mode 100644 src/bin/pg_combinebackup/copy_file.h
create mode 100644 src/bin/pg_combinebackup/load_manifest.c
create mode 100644 src/bin/pg_combinebackup/load_manifest.h
create mode 100644 src/bin/pg_combinebackup/meson.build
create mode 100644 src/bin/pg_combinebackup/pg_combinebackup.c
create mode 100644 src/bin/pg_combinebackup/reconstruct.c
create mode 100644 src/bin/pg_combinebackup/reconstruct.h
create mode 100644 src/bin/pg_combinebackup/write_manifest.c
create mode 100644 src/bin/pg_combinebackup/write_manifest.h
create mode 100644 src/common/blkreftable.c
create mode 100644 src/include/backup/basebackup_incremental.h
create mode 100644 src/include/backup/walsummary.h
create mode 100644 src/include/common/blkreftable.h
create mode 100644 src/include/postmaster/walsummarizer.h
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 5cfdc70c03..97809a73f6 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1161,6 +1161,10 @@ postgres 27093 0.0 0.0 30096 2752 ? Ss 11:34 0:00 postgres: ser
<entry><literal>WalSenderMain</literal></entry>
<entry>Waiting in main loop of WAL sender process.</entry>
</row>
+ <row>
+ <entry><literal>WalSummarizeWAL</literal></entry>
+ <entry>Waiting in WAL summarizer process for new WAL to be written.</entry>
+ </row>
<row>
<entry><literal>WalWriterMain</literal></entry>
<entry>Waiting in main loop of WAL writer process.</entry>
@@ -1591,6 +1595,14 @@ postgres 27093 0.0 0.0 30096 2752 ? Ss 11:34 0:00 postgres: ser
<entry>Waiting for a read from a timeline history file during a walsender
timeline command.</entry>
</row>
+ <row>
+ <entry><literal>WalSummaryRead</literal></entry>
+ <entry>Waiting to read from a WAL summary file.</entry>
+ </row>
+ <row>
+ <entry><literal>WalSummaryWrite</literal></entry>
+ <entry>Waiting to write to a WAL summary file.</entry>
+ </row>
<row>
<entry><literal>WALSync</literal></entry>
<entry>Waiting for a WAL file to reach durable storage.</entry>
@@ -2357,6 +2369,11 @@ postgres 27093 0.0 0.0 30096 2752 ? Ss 11:34 0:00 postgres: ser
<entry>Waiting to acquire an exclusive lock to truncate off any
empty pages at the end of a table vacuumed.</entry>
</row>
+ <row>
+ <entry><literal>WalSummarizerError</literal></entry>
+ <entry>Waiting to retry after recovering from an error in the
+ WAL summarizer process.</entry>
+ </row>
</tbody>
</tgroup>
</table>
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 664d4ba598..6c66d5118b 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -77,6 +77,7 @@
#include "port/pg_iovec.h"
#include "postmaster/bgwriter.h"
#include "postmaster/startup.h"
+#include "postmaster/walsummarizer.h"
#include "postmaster/walwriter.h"
#include "replication/logical.h"
#include "replication/origin.h"
@@ -3477,6 +3478,43 @@ XLogGetLastRemovedSegno(void)
return lastRemovedSegNo;
}
+/*
+ * Return the oldest WAL segment on the given TLI that still exists in
+ * XLOGDIR, or 0 if none.
+ */
+XLogSegNo
+XLogGetOldestSegno(TimeLineID tli)
+{
+ DIR *xldir;
+ struct dirent *xlde;
+ XLogSegNo oldest_segno = 0;
+
+ xldir = AllocateDir(XLOGDIR);
+ while ((xlde = ReadDir(xldir, XLOGDIR)) != NULL)
+ {
+ TimeLineID file_tli;
+ XLogSegNo file_segno;
+
+ /* Ignore files that are not XLOG segments */
+ if (!IsXLogFileName(xlde->d_name))
+ continue;
+
+ /* Parse filename to get TLI and segno. */
+ XLogFromFileName(xlde->d_name, &file_tli, &file_segno,
+ wal_segment_size);
+
+ /* Ignore anything that's not from the TLI of interest. */
+ if (tli != file_tli)
+ continue;
+
+ /* If it's the oldest so far, update oldest_segno. */
+ if (oldest_segno == 0 || file_segno < oldest_segno)
+ oldest_segno = file_segno;
+ }
+
+ FreeDir(xldir);
+ return oldest_segno;
+}
/*
* Update the last removed segno pointer in shared memory, to reflect that the
@@ -3756,8 +3794,8 @@ RemoveXlogFile(const struct dirent *segment_de,
}
/*
- * Verify whether pg_wal and pg_wal/archive_status exist.
- * If the latter does not exist, recreate it.
+ * Verify whether pg_wal, pg_wal/archive_status, and pg_wal/summaries exist.
+ * If the latter do not exist, recreate them.
*
* It is not the goal of this function to verify the contents of these
* directories, but to help in cases where someone has performed a cluster
@@ -3800,6 +3838,26 @@ ValidateXLOGDirectoryStructure(void)
(errmsg("could not create missing directory \"%s\": %m",
path)));
}
+
+ /* Check for summaries */
+ snprintf(path, MAXPGPATH, XLOGDIR "/summaries");
+ if (stat(path, &stat_buf) == 0)
+ {
+ /* Check for weird cases where it exists but isn't a directory */
+ if (!S_ISDIR(stat_buf.st_mode))
+ ereport(FATAL,
+ (errmsg("required WAL directory \"%s\" does not exist",
+ path)));
+ }
+ else
+ {
+ ereport(LOG,
+ (errmsg("creating missing WAL directory \"%s\"", path)));
+ if (MakePGDirectory(path) < 0)
+ ereport(FATAL,
+ (errmsg("could not create missing directory \"%s\": %m",
+ path)));
+ }
}
/*
@@ -5123,9 +5181,9 @@ StartupXLOG(void)
#endif
/*
- * Verify that pg_wal and pg_wal/archive_status exist. In cases where
- * someone has performed a copy for PITR, these directories may have been
- * excluded and need to be re-created.
+ * Verify that pg_wal, pg_wal/archive_status, and pg_wal/summaries exist.
+ * In cases where someone has performed a copy for PITR, these directories
+ * may have been excluded and need to be re-created.
*/
ValidateXLOGDirectoryStructure();
@@ -6802,6 +6860,17 @@ CreateCheckPoint(int flags)
*/
END_CRIT_SECTION();
+ /*
+ * If there hasn't been much system activity in a while, the WAL
+ * summarizer may be sleeping for relatively long periods, which could
+ * delay an incremental backup that has started concurrently. In the hopes
+ * of avoiding that, poke the WAL summarizer here.
+ *
+ * Possibly this should instead be done at some earlier point in this
+ * function, but it's not clear that it matters much.
+ */
+ SetWalSummarizerLatch();
+
/*
* Let smgr do post-checkpoint cleanup (eg, deleting old files).
*/
@@ -7476,6 +7545,20 @@ KeepLogSeg(XLogRecPtr recptr, XLogSegNo *logSegNo)
}
}
+ /*
+ * If WAL summarization is in use, don't remove WAL that has yet to be
+ * summarized.
+ */
+ keep = GetOldestUnsummarizedLSN(NULL, NULL);
+ if (keep != InvalidXLogRecPtr)
+ {
+ XLogSegNo unsummarized_segno;
+
+ XLByteToSeg(keep, unsummarized_segno, wal_segment_size);
+ if (unsummarized_segno < segno)
+ segno = unsummarized_segno;
+ }
+
/* but, keep at least wal_keep_size if that's set */
if (wal_keep_size_mb > 0)
{
@@ -8462,8 +8545,8 @@ do_pg_backup_start(const char *backupidstr, bool fast, List **tablespaces,
/*
* Try to parse the directory name as an unsigned integer.
*
- * Tablespace directories should be positive integers that can
- * be represented in 32 bits, with no leading zeroes or trailing
+ * Tablespace directories should be positive integers that can be
+ * represented in 32 bits, with no leading zeroes or trailing
* garbage. If we come across a name that doesn't meet those
* criteria, skip it.
*/
diff --git a/src/backend/access/transam/xlogbackup.c b/src/backend/access/transam/xlogbackup.c
index 23461c9d2c..3ad6b679d5 100644
--- a/src/backend/access/transam/xlogbackup.c
+++ b/src/backend/access/transam/xlogbackup.c
@@ -77,6 +77,16 @@ build_backup_content(BackupState *state, bool ishistoryfile)
appendStringInfo(result, "STOP TIMELINE: %u\n", state->stoptli);
}
+ /* either both istartpoint and istarttli should be set, or neither */
+ Assert(XLogRecPtrIsInvalid(state->istartpoint) == (state->istarttli == 0));
+ if (!XLogRecPtrIsInvalid(state->istartpoint))
+ {
+ appendStringInfo(result, "INCREMENTAL FROM LSN: %X/%X\n",
+ LSN_FORMAT_ARGS(state->istartpoint));
+ appendStringInfo(result, "INCREMENTAL FROM TLI: %u\n",
+ state->istarttli);
+ }
+
data = result->data;
pfree(result);
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 4ff4430006..89ddec5bf9 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -1284,6 +1284,12 @@ read_backup_label(XLogRecPtr *checkPointLoc, TimeLineID *backupLabelTLI,
tli_from_file, BACKUP_LABEL_FILE)));
}
+ if (fscanf(lfp, "INCREMENTAL FROM LSN: %X/%X\n", &hi, &lo) > 0)
+ ereport(FATAL,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("this is an incremental backup, not a data directory"),
+ errhint("Use pg_combinebackup to reconstruct a valid data directory.")));
+
if (ferror(lfp) || FreeFile(lfp))
ereport(FATAL,
(errcode_for_file_access(),
@@ -1340,7 +1346,7 @@ read_tablespace_map(List **tablespaces)
{
if (!was_backslash && (ch == '\n' || ch == '\r'))
{
- char *endp;
+ char *endp;
if (i == 0)
continue; /* \r immediately followed by \n */
@@ -1363,7 +1369,7 @@ read_tablespace_map(List **tablespaces)
ti = palloc0(sizeof(tablespaceinfo));
errno = 0;
ti->oid = strtoul(str, &endp, 10);
- if (*endp != '\0' || errno == EINVAL || errno == ERANGE)
+ if (*endp != '\0' || errno == EINVAL || errno == ERANGE)
ereport(FATAL,
(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
errmsg("invalid data in file \"%s\"", TABLESPACE_MAP)));
diff --git a/src/backend/backup/Makefile b/src/backend/backup/Makefile
index b21bd8ff43..751e6d3d5e 100644
--- a/src/backend/backup/Makefile
+++ b/src/backend/backup/Makefile
@@ -19,12 +19,15 @@ OBJS = \
basebackup.o \
basebackup_copy.o \
basebackup_gzip.o \
+ basebackup_incremental.o \
basebackup_lz4.o \
basebackup_zstd.o \
basebackup_progress.o \
basebackup_server.o \
basebackup_sink.o \
basebackup_target.o \
- basebackup_throttle.o
+ basebackup_throttle.o \
+ walsummary.o \
+ walsummaryfuncs.o
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/backup/basebackup.c b/src/backend/backup/basebackup.c
index 64ab54fe06..8aea2a4a76 100644
--- a/src/backend/backup/basebackup.c
+++ b/src/backend/backup/basebackup.c
@@ -20,8 +20,10 @@
#include "access/xlogbackup.h"
#include "backup/backup_manifest.h"
#include "backup/basebackup.h"
+#include "backup/basebackup_incremental.h"
#include "backup/basebackup_sink.h"
#include "backup/basebackup_target.h"
+#include "catalog/pg_tablespace_d.h"
#include "commands/defrem.h"
#include "common/compression.h"
#include "common/file_perm.h"
@@ -64,6 +66,7 @@ typedef struct
bool fastcheckpoint;
bool nowait;
bool includewal;
+ bool incremental;
uint32 maxrate;
bool sendtblspcmapfile;
bool send_to_client;
@@ -75,22 +78,37 @@ typedef struct
pg_checksum_type manifest_checksum_type;
} basebackup_options;
+typedef struct
+{
+ const char *filename;
+ pg_checksum_context *checksum_ctx;
+ bbsink *sink;
+ size_t bytes_sent;
+} FileChunkContext;
+
static int64 sendTablespace(bbsink *sink, char *path, Oid spcoid, bool sizeonly,
- struct backup_manifest_info *manifest);
+ struct backup_manifest_info *manifest,
+ IncrementalBackupInfo *ib);
static int64 sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
List *tablespaces, bool sendtblspclinks,
- backup_manifest_info *manifest, Oid spcoid);
+ backup_manifest_info *manifest, Oid spcoid,
+ IncrementalBackupInfo *ib);
static bool sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
struct stat *statbuf, bool missing_ok,
Oid dboid, Oid spcoid, RelFileNumber relfilenumber,
unsigned segno,
- backup_manifest_info *manifest);
+ backup_manifest_info *manifest,
+ unsigned num_incremental_blocks,
+ BlockNumber *incremental_blocks,
+ unsigned truncation_block_length);
static off_t read_file_data_into_buffer(bbsink *sink,
const char *readfilename, int fd,
off_t offset, size_t length,
BlockNumber blkno,
bool verify_checksum,
int *checksum_failures);
+static void push_to_sink(bbsink *sink, pg_checksum_context *checksum_ctx,
+ size_t *bytes_done, void *data, size_t length);
static bool verify_page_checksum(Page page, XLogRecPtr start_lsn,
BlockNumber blkno,
uint16 *expected_checksum);
@@ -102,7 +120,8 @@ static int64 _tarWriteHeader(bbsink *sink, const char *filename,
bool sizeonly);
static void _tarWritePadding(bbsink *sink, int len);
static void convert_link_to_directory(const char *pathbuf, struct stat *statbuf);
-static void perform_base_backup(basebackup_options *opt, bbsink *sink);
+static void perform_base_backup(basebackup_options *opt, bbsink *sink,
+ IncrementalBackupInfo *ib);
static void parse_basebackup_options(List *options, basebackup_options *opt);
static int compareWalFileNames(const ListCell *a, const ListCell *b);
static int basebackup_read_file(int fd, char *buf, size_t nbytes, off_t offset,
@@ -220,7 +239,8 @@ static const struct exclude_list_item excludeFiles[] =
* clobbered by longjmp" from stupider versions of gcc.
*/
static void
-perform_base_backup(basebackup_options *opt, bbsink *sink)
+perform_base_backup(basebackup_options *opt, bbsink *sink,
+ IncrementalBackupInfo *ib)
{
bbsink_state state;
XLogRecPtr endptr;
@@ -270,6 +290,10 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
ListCell *lc;
tablespaceinfo *newti;
+ /* If this is an incremental backup, execute preparatory steps. */
+ if (ib != NULL)
+ PrepareForIncrementalBackup(ib, backup_state);
+
/* Add a node for the base directory at the end */
newti = palloc0(sizeof(tablespaceinfo));
newti->size = -1;
@@ -289,10 +313,10 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
if (tmp->path == NULL)
tmp->size = sendDir(sink, ".", 1, true, state.tablespaces,
- true, NULL, InvalidOid);
+ true, NULL, InvalidOid, NULL);
else
tmp->size = sendTablespace(sink, tmp->path, tmp->oid, true,
- NULL);
+ NULL, NULL);
state.bytes_total += tmp->size;
}
state.bytes_total_is_valid = true;
@@ -330,7 +354,7 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
/* Then the bulk of the files... */
sendDir(sink, ".", 1, false, state.tablespaces,
- sendtblspclinks, &manifest, InvalidOid);
+ sendtblspclinks, &manifest, InvalidOid, ib);
/* ... and pg_control after everything else. */
if (lstat(XLOG_CONTROL_FILE, &statbuf) != 0)
@@ -340,7 +364,7 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
XLOG_CONTROL_FILE)));
sendFile(sink, XLOG_CONTROL_FILE, XLOG_CONTROL_FILE, &statbuf,
false, InvalidOid, InvalidOid,
- InvalidRelFileNumber, 0, &manifest);
+ InvalidRelFileNumber, 0, &manifest, 0, NULL, 0);
}
else
{
@@ -348,7 +372,7 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
bbsink_begin_archive(sink, archive_name);
- sendTablespace(sink, ti->path, ti->oid, false, &manifest);
+ sendTablespace(sink, ti->path, ti->oid, false, &manifest, ib);
}
/*
@@ -610,7 +634,7 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
sendFile(sink, pathbuf, pathbuf, &statbuf, false,
InvalidOid, InvalidOid, InvalidRelFileNumber, 0,
- &manifest);
+ &manifest, 0, NULL, 0);
/* unconditionally mark file as archived */
StatusFilePath(pathbuf, fname, ".done");
@@ -686,6 +710,7 @@ parse_basebackup_options(List *options, basebackup_options *opt)
bool o_checkpoint = false;
bool o_nowait = false;
bool o_wal = false;
+ bool o_incremental = false;
bool o_maxrate = false;
bool o_tablespace_map = false;
bool o_noverify_checksums = false;
@@ -764,6 +789,15 @@ parse_basebackup_options(List *options, basebackup_options *opt)
opt->includewal = defGetBoolean(defel);
o_wal = true;
}
+ else if (strcmp(defel->defname, "incremental") == 0)
+ {
+ if (o_incremental)
+ ereport(ERROR,
+ (errcode(ERRCODE_SYNTAX_ERROR),
+ errmsg("duplicate option \"%s\"", defel->defname)));
+ opt->incremental = defGetBoolean(defel);
+ o_incremental = true;
+ }
else if (strcmp(defel->defname, "max_rate") == 0)
{
int64 maxrate;
@@ -956,7 +990,7 @@ parse_basebackup_options(List *options, basebackup_options *opt)
* the filesystem, bypassing the buffer cache.
*/
void
-SendBaseBackup(BaseBackupCmd *cmd)
+SendBaseBackup(BaseBackupCmd *cmd, IncrementalBackupInfo *ib)
{
basebackup_options opt;
bbsink *sink;
@@ -980,6 +1014,20 @@ SendBaseBackup(BaseBackupCmd *cmd)
set_ps_display(activitymsg);
}
+ /*
+ * If we're asked to perform an incremental backup and the user has not
+ * supplied a manifest, that's an ERROR.
+ *
+ * If we're asked to perform a full backup and the user did supply a
+ * manifest, just ignore it.
+ */
+ if (!opt.incremental)
+ ib = NULL;
+ else if (ib == NULL)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("must UPLOAD_MANIFEST before performing an incremental BASE_BACKUP")));
+
/*
* If the target is specifically 'client' then set up to stream the backup
* to the client; otherwise, it's being sent someplace else and should not
@@ -1011,7 +1059,7 @@ SendBaseBackup(BaseBackupCmd *cmd)
*/
PG_TRY();
{
- perform_base_backup(&opt, sink);
+ perform_base_backup(&opt, sink, ib);
}
PG_FINALLY();
{
@@ -1086,7 +1134,7 @@ sendFileWithContent(bbsink *sink, const char *filename, const char *content,
*/
static int64
sendTablespace(bbsink *sink, char *path, Oid spcoid, bool sizeonly,
- backup_manifest_info *manifest)
+ backup_manifest_info *manifest, IncrementalBackupInfo *ib)
{
int64 size;
char pathbuf[MAXPGPATH];
@@ -1120,7 +1168,7 @@ sendTablespace(bbsink *sink, char *path, Oid spcoid, bool sizeonly,
/* Send all the files in the tablespace version directory */
size += sendDir(sink, pathbuf, strlen(path), sizeonly, NIL, true, manifest,
- spcoid);
+ spcoid, ib);
return size;
}
@@ -1140,7 +1188,7 @@ sendTablespace(bbsink *sink, char *path, Oid spcoid, bool sizeonly,
static int64
sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
List *tablespaces, bool sendtblspclinks, backup_manifest_info *manifest,
- Oid spcoid)
+ Oid spcoid, IncrementalBackupInfo *ib)
{
DIR *dir;
struct dirent *de;
@@ -1148,7 +1196,8 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
struct stat statbuf;
int64 size = 0;
const char *lastDir; /* Split last dir from parent path. */
- bool isRelationDir = false; /* Does directory contain relations? */
+ bool isRelationDir = false; /* Does directory contain relations? */
+ bool isGlobalDir = false;
Oid dboid = InvalidOid;
/*
@@ -1182,14 +1231,17 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
}
}
else if (strcmp(path, "./global") == 0)
+ {
isRelationDir = true;
+ isGlobalDir = true;
+ }
dir = AllocateDir(path);
while ((de = ReadDir(dir, path)) != NULL)
{
int excludeIdx;
bool excludeFound;
- RelFileNumber relfilenumber = InvalidRelFileNumber;
+ RelFileNumber relfilenumber = InvalidRelFileNumber;
ForkNumber relForkNum = InvalidForkNumber;
unsigned segno = 0;
bool isRelationFile = false;
@@ -1256,9 +1308,8 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
char initForkFile[MAXPGPATH];
/*
- * If any other type of fork, check if there is an init fork
- * with the same RelFileNumber. If so, the file can be
- * excluded.
+ * If any other type of fork, check if there is an init fork with
+ * the same RelFileNumber. If so, the file can be excluded.
*/
snprintf(initForkFile, sizeof(initForkFile), "%s/%u_init",
path, relfilenumber);
@@ -1332,11 +1383,13 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
&statbuf, sizeonly);
/*
- * Also send archive_status directory (by hackishly reusing
- * statbuf from above ...).
+ * Also send archive_status and summaries directories (by
+ * hackishly reusing statbuf from above ...).
*/
size += _tarWriteHeader(sink, "./pg_wal/archive_status", NULL,
&statbuf, sizeonly);
+ size += _tarWriteHeader(sink, "./pg_wal/summaries", NULL,
+ &statbuf, sizeonly);
continue; /* don't recurse into pg_wal */
}
@@ -1405,27 +1458,79 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
if (!skip_this_dir)
size += sendDir(sink, pathbuf, basepathlen, sizeonly, tablespaces,
- sendtblspclinks, manifest, spcoid);
+ sendtblspclinks, manifest, spcoid, ib);
}
else if (S_ISREG(statbuf.st_mode))
{
bool sent = false;
+ unsigned num_blocks_required = 0;
+ unsigned truncation_block_length = 0;
+ BlockNumber relative_block_numbers[RELSEG_SIZE];
+ char tarfilenamebuf[MAXPGPATH * 2];
+ char *tarfilename = pathbuf + basepathlen + 1;
+ FileBackupMethod method = BACK_UP_FILE_FULLY;
+
+ if (ib != NULL && isRelationFile)
+ {
+ Oid relspcoid;
+ char *lookup_path;
+
+ if (OidIsValid(spcoid))
+ {
+ relspcoid = spcoid;
+ lookup_path = psprintf("pg_tblspc/%u/%s", spcoid,
+ pathbuf + basepathlen + 1);
+ }
+ else
+ {
+ if (isGlobalDir)
+ relspcoid = GLOBALTABLESPACE_OID;
+ else
+ relspcoid = DEFAULTTABLESPACE_OID;
+ lookup_path = pstrdup(pathbuf + basepathlen + 1);
+ }
- if (!sizeonly)
- sent = sendFile(sink, pathbuf, pathbuf + basepathlen + 1, &statbuf,
- true, dboid, spcoid,
- relfilenumber, segno, manifest);
+ method = GetFileBackupMethod(ib, lookup_path, dboid, relspcoid,
+ relfilenumber, relForkNum,
+ segno, statbuf.st_size,
+ &num_blocks_required,
+ relative_block_numbers,
+ &truncation_block_length);
+ if (method == BACK_UP_FILE_INCREMENTALLY)
+ {
+ statbuf.st_size =
+ GetIncrementalFileSize(num_blocks_required);
+ snprintf(tarfilenamebuf, sizeof(tarfilenamebuf),
+ "%s/INCREMENTAL.%s",
+ path + basepathlen + 1,
+ de->d_name);
+ tarfilename = tarfilenamebuf;
+ }
+
+ pfree(lookup_path);
+ }
- if (sent || sizeonly)
+ if (method != DO_NOT_BACK_UP_FILE)
{
- /* Add size. */
- size += statbuf.st_size;
+ if (!sizeonly)
+ sent = sendFile(sink, pathbuf, tarfilename, &statbuf,
+ true, dboid, spcoid,
+ relfilenumber, segno, manifest,
+ num_blocks_required,
+ method == BACK_UP_FILE_INCREMENTALLY ? relative_block_numbers : NULL,
+ truncation_block_length);
+
+ if (sent || sizeonly)
+ {
+ /* Add size. */
+ size += statbuf.st_size;
- /* Pad to a multiple of the tar block size. */
- size += tarPaddingBytesRequired(statbuf.st_size);
+ /* Pad to a multiple of the tar block size. */
+ size += tarPaddingBytesRequired(statbuf.st_size);
- /* Size of the header for the file. */
- size += TAR_BLOCK_SIZE;
+ /* Size of the header for the file. */
+ size += TAR_BLOCK_SIZE;
+ }
}
}
else
@@ -1444,6 +1549,12 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
* If dboid is anything other than InvalidOid then any checksum failures
* detected will get reported to the cumulative stats system.
*
+ * If the file is to be set incrementally, then num_incremental_blocks
+ * should be the number of blocks to be sent, and incremental_blocks
+ * an array of block numbers relative to the start of the current segment.
+ * If the whole file is to be sent, then incremental_blocks should be NULL,
+ * and num_incremental_blocks can have any value, as it will be ignored.
+ *
* Returns true if the file was successfully sent, false if 'missing_ok',
* and the file did not exist.
*/
@@ -1451,7 +1562,8 @@ static bool
sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
struct stat *statbuf, bool missing_ok, Oid dboid, Oid spcoid,
RelFileNumber relfilenumber, unsigned segno,
- backup_manifest_info *manifest)
+ backup_manifest_info *manifest, unsigned num_incremental_blocks,
+ BlockNumber *incremental_blocks, unsigned truncation_block_length)
{
int fd;
BlockNumber blkno = 0;
@@ -1460,6 +1572,7 @@ sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
pgoff_t bytes_done = 0;
bool verify_checksum = false;
pg_checksum_context checksum_ctx;
+ int ibindex = 0;
if (pg_checksum_init(&checksum_ctx, manifest->checksum_type) < 0)
elog(ERROR, "could not initialize checksum of file \"%s\"",
@@ -1492,22 +1605,111 @@ sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
RelFileNumberIsValid(relfilenumber))
verify_checksum = true;
+ /*
+ * If we're sending an incremental file, write the file header.
+ */
+ if (incremental_blocks != NULL)
+ {
+ unsigned magic = INCREMENTAL_MAGIC;
+ size_t header_bytes_done = 0;
+
+ /* Emit header data. */
+ push_to_sink(sink, &checksum_ctx, &header_bytes_done,
+ &magic, sizeof(magic));
+ push_to_sink(sink, &checksum_ctx, &header_bytes_done,
+ &num_incremental_blocks, sizeof(num_incremental_blocks));
+ push_to_sink(sink, &checksum_ctx, &header_bytes_done,
+ &truncation_block_length, sizeof(truncation_block_length));
+ push_to_sink(sink, &checksum_ctx, &header_bytes_done,
+ incremental_blocks,
+ sizeof(BlockNumber) * num_incremental_blocks);
+
+ /* Flush out any data still in the buffer so it's again empty. */
+ if (header_bytes_done > 0)
+ {
+ bbsink_archive_contents(sink, header_bytes_done);
+ if (pg_checksum_update(&checksum_ctx,
+ (uint8 *) sink->bbs_buffer,
+ header_bytes_done) < 0)
+ elog(ERROR, "could not update checksum of base backup");
+ }
+
+ /* Update our notion of file position. */
+ bytes_done += sizeof(magic);
+ bytes_done += sizeof(num_incremental_blocks);
+ bytes_done += sizeof(truncation_block_length);
+ bytes_done += sizeof(BlockNumber) * num_incremental_blocks;
+ }
+
/*
* Loop until we read the amount of data the caller told us to expect. The
* file could be longer, if it was extended while we were sending it, but
* for a base backup we can ignore such extended data. It will be restored
* from WAL.
*/
- while (bytes_done < statbuf->st_size)
+ while (1)
{
- size_t remaining = statbuf->st_size - bytes_done;
+ /*
+ * Determine whether we've read all the data that we need, and if not,
+ * read some more.
+ */
+ if (incremental_blocks == NULL)
+ {
+ size_t remaining = statbuf->st_size - bytes_done;
+
+ /*
+ * If we've read the required number of bytes, then it's time to
+ * stop.
+ */
+ if (bytes_done >= statbuf->st_size)
+ break;
+
+ /*
+ * Read as many bytes as will fit in the buffer, or however many
+ * are left to read, whichever is less.
+ */
+ cnt = read_file_data_into_buffer(sink, readfilename, fd,
+ bytes_done, remaining,
+ blkno + segno * RELSEG_SIZE,
+ verify_checksum,
+ &checksum_failures);
+ }
+ else
+ {
+ BlockNumber relative_blkno;
+
+ /*
+ * If we've read all the blocks, then it's time to stop.
+ */
+ if (ibindex >= num_incremental_blocks)
+ break;
+
+ /*
+ * Read just one block, whichever one is the next that we're
+ * supposed to include.
+ */
+ relative_blkno = incremental_blocks[ibindex++];
+ cnt = read_file_data_into_buffer(sink, readfilename, fd,
+ relative_blkno * BLCKSZ,
+ BLCKSZ,
+ relative_blkno + segno * RELSEG_SIZE,
+ verify_checksum,
+ &checksum_failures);
- /* Try to read some more data. */
- cnt = read_file_data_into_buffer(sink, readfilename, fd, bytes_done,
- remaining,
- blkno + segno * RELSEG_SIZE,
- verify_checksum,
- &checksum_failures);
+ /*
+ * If we get a partial read, that must mean that the relation is
+ * being truncated. Ultimately, it should be truncated to a
+ * multiple of BLCKSZ, since this path should only be reached for
+ * relation files, but we might transiently observe an
+ * intermediate value.
+ *
+ * It should be fine to treat this just as if the entire block had
+ * been truncated away - i.e. fill this and all later blocks with
+ * zeroes. WAL replay will fix things up.
+ */
+ if (cnt < BLCKSZ)
+ break;
+ }
/*
* If the amount of data we were able to read was not a multiple of
@@ -1690,6 +1892,56 @@ read_file_data_into_buffer(bbsink *sink, const char *readfilename, int fd,
return cnt;
}
+/*
+ * Push data into a bbsink.
+ *
+ * It's better, when possible, to read data directly into the bbsink's buffer,
+ * rather than using this function to copy it into the buffer; this function is
+ * for cases where that approach is not practical.
+ *
+ * bytes_done should point to a count of the number of bytes that are
+ * currently used in the bbsink's buffer. Upon return, the bytes identified by
+ * data and length will have been copied into the bbsink's buffer, flushing
+ * as required, and *bytes_done will have been updated accordingly. If the
+ * buffer was flushed, the previous contents will also have been fed to
+ * checksum_ctx.
+ *
+ * Note that after one or more calls to this function it is the caller's
+ * responsibility to perform any required final flush.
+ */
+static void
+push_to_sink(bbsink *sink, pg_checksum_context *checksum_ctx,
+ size_t *bytes_done, void *data, size_t length)
+{
+ while (length > 0)
+ {
+ size_t bytes_to_copy;
+
+ /*
+ * We use < here rather than <= so that if the data exactly fills the
+ * remaining buffer space, we trigger a flush now.
+ */
+ if (length < sink->bbs_buffer_length - *bytes_done)
+ {
+ /* Append remaining data to buffer. */
+ memcpy(sink->bbs_buffer + *bytes_done, data, length);
+ *bytes_done += length;
+ return;
+ }
+
+ /* Copy until buffer is full and flush it. */
+ bytes_to_copy = sink->bbs_buffer_length - *bytes_done;
+ memcpy(sink->bbs_buffer + *bytes_done, data, bytes_to_copy);
+ data = ((char *) data) + bytes_to_copy;
+ length -= bytes_to_copy;
+ bbsink_archive_contents(sink, sink->bbs_buffer_length);
+ if (pg_checksum_update(checksum_ctx, (uint8 *) sink->bbs_buffer,
+ sink->bbs_buffer_length) < 0)
+ elog(ERROR, "could not update checksum");
+ *bytes_done = 0;
+ }
+}
+
/*
* Try to verify the checksum for the provided page, if it seems appropriate
* to do so.
diff --git a/src/backend/backup/basebackup_incremental.c b/src/backend/backup/basebackup_incremental.c
new file mode 100644
index 0000000000..b70eeb0282
--- /dev/null
+++ b/src/backend/backup/basebackup_incremental.c
@@ -0,0 +1,867 @@
+/*-------------------------------------------------------------------------
+ *
+ * basebackup_incremental.c
+ * code for incremental backup support
+ *
+ * This code isn't actually in charge of taking an incremental backup;
+ * the actual construction of the incremental backup happens in
+ * basebackup.c. Here, we're concerned with providing the necessary
+ * supports for that operation. In particular, we need to parse the
+ * backup manifest supplied by the user taking the incremental backup
+ * and extract the required information from it.
+ *
+ * Portions Copyright (c) 2010-2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/backend/backup/basebackup_incremental.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/timeline.h"
+#include "access/xlog.h"
+#include "access/xlogrecovery.h"
+#include "backup/basebackup_incremental.h"
+#include "backup/walsummary.h"
+#include "common/blkreftable.h"
+#include "common/parse_manifest.h"
+#include "common/hashfn.h"
+#include "postmaster/walsummarizer.h"
+
+#define BLOCKS_PER_READ 512
+
+typedef struct
+{
+ TimeLineID tli;
+ XLogRecPtr start_lsn;
+ XLogRecPtr end_lsn;
+} backup_wal_range;
+
+typedef struct
+{
+ uint32 status;
+ const char *path;
+ size_t size;
+} backup_file_entry;
+
+static uint32 hash_string_pointer(const char *s);
+#define SH_PREFIX backup_file
+#define SH_ELEMENT_TYPE backup_file_entry
+#define SH_KEY_TYPE const char *
+#define SH_KEY path
+#define SH_HASH_KEY(tb, key) hash_string_pointer(key)
+#define SH_EQUAL(tb, a, b) (strcmp(a, b) == 0)
+#define SH_SCOPE static inline
+#define SH_DECLARE
+#define SH_DEFINE
+#include "lib/simplehash.h"
+
+struct IncrementalBackupInfo
+{
+ /* Memory context for this object and its subsidiary objects. */
+ MemoryContext mcxt;
+
+ /* Temporary buffer for storing the manifest while parsing it. */
+ StringInfoData buf;
+
+ /* WAL ranges extracted from the backup manifest. */
+ List *manifest_wal_ranges;
+
+ /*
+ * Files extracted from the backup manifest.
+ *
+ * We don't really need this information, because we use WAL summaries to
+ * figure what's changed. It would be unsafe to just rely on the list of
+ * files that existed before, because it's possible for a file to be
+ * removed and a new one created with the same name and different
+ * contents. In such cases, the whole file must still be sent. We can tell
+ * from the WAL summaries whether that happened, but not from the file
+ * list.
+ *
+ * Nonetheless, this data is useful for sanity checking. If a file that we
+ * think we shouldn't need to send is not present in the manifest for the
+ * prior backup, something has gone terribly wrong. We retain the file
+ * names and sizes, but not the checksums or last modified times, for
+ * which we have no use.
+ *
+ * One significant downside of storing this data is that it consumes
+ * memory. If that turns out to be a problem, we might have to decide not
+ * to retain this information, or to make it optional.
+ */
+ backup_file_hash *manifest_files;
+
+ /*
+ * Block-reference table for the incremental backup.
+ *
+ * It's possible that storing the entire block-reference table in memory
+ * will be a problem for some users. The in-memory format that we're using
+ * here is pretty efficient, converging to little more than 1 bit per
+ * block for relation forks with large numbers of modified blocks. It's
+ * possible, however, that if you try to perform an incremental backup of
+ * a database with a sufficiently large number of relations on a
+ * sufficiently small machine, you could run out of memory here. If that
+ * turns out to be a problem in practice, we'll need to be more clever.
+ */
+ BlockRefTable *brtab;
+};
+
+static void manifest_process_file(JsonManifestParseContext *,
+ char *pathname,
+ size_t size,
+ pg_checksum_type checksum_type,
+ int checksum_length,
+ uint8 *checksum_payload);
+static void manifest_process_wal_range(JsonManifestParseContext *,
+ TimeLineID tli,
+ XLogRecPtr start_lsn,
+ XLogRecPtr end_lsn);
+static void manifest_report_error(JsonManifestParseContext *ib,
+ const char *fmt,...)
+ pg_attribute_printf(2, 3) pg_attribute_noreturn();
+
+/*
+ * Create a new object for storing information extracted from the manifest
+ * supplied when creating an incremental backup.
+ */
+IncrementalBackupInfo *
+CreateIncrementalBackupInfo(MemoryContext mcxt)
+{
+ IncrementalBackupInfo *ib;
+ MemoryContext oldcontext;
+
+ oldcontext = MemoryContextSwitchTo(mcxt);
+
+ ib = palloc0(sizeof(IncrementalBackupInfo));
+ ib->mcxt = mcxt;
+ initStringInfo(&ib->buf);
+
+ /*
+ * It's hard to guess how many files a "typical" installation will have in
+ * the data directory, but a fresh initdb creates almost 1000 files as of
+ * this writing, so it seems to make sense for our estimate to
+ * substantially higher.
+ */
+ ib->manifest_files = backup_file_create(mcxt, 10000, NULL);
+
+ MemoryContextSwitchTo(oldcontext);
+
+ return ib;
+}
+
+/*
+ * Before taking an incremental backup, the caller must supply the backup
+ * manifest from a prior backup. Each chunk of manifest data recieved
+ * from the client should be passed to this function.
+ */
+void
+AppendIncrementalManifestData(IncrementalBackupInfo *ib, const char *data,
+ int len)
+{
+ MemoryContext oldcontext;
+
+ /* Switch to our memory context. */
+ oldcontext = MemoryContextSwitchTo(ib->mcxt);
+
+ /*
+ * XXX. Our json parser is at present incapable of parsing json blobs
+ * incrementally, so we have to accumulate the entire backup manifest
+ * before we can do anything with it. This should really be fixed, since
+ * some users might have very large numbers of files in the data
+ * directory.
+ */
+ appendBinaryStringInfo(&ib->buf, data, len);
+
+ /* Switch back to previous memory context. */
+ MemoryContextSwitchTo(oldcontext);
+}
+
+/*
+ * Finalize an IncrementalBackupInfo object after all manifest data has
+ * been supplied via calls to AppendIncrementalManifestData.
+ */
+void
+FinalizeIncrementalManifest(IncrementalBackupInfo *ib)
+{
+ JsonManifestParseContext context;
+ MemoryContext oldcontext;
+
+ /* Switch to our memory context. */
+ oldcontext = MemoryContextSwitchTo(ib->mcxt);
+
+ /* Parse the manifest. */
+ context.private_data = ib;
+ context.perfile_cb = manifest_process_file;
+ context.perwalrange_cb = manifest_process_wal_range;
+ context.error_cb = manifest_report_error;
+ json_parse_manifest(&context, ib->buf.data, ib->buf.len);
+
+ /* Done with the buffer, so release memory. */
+ pfree(ib->buf.data);
+ ib->buf.data = NULL;
+
+ /* Switch back to previous memory context. */
+ MemoryContextSwitchTo(oldcontext);
+}
+
+/*
+ * Prepare to take an incremental backup.
+ *
+ * Before this function is called, AppendIncrementalManifestData and
+ * FinalizeIncrementalManifest should have already been called to pass all
+ * the manifest data to this object.
+ *
+ * This function performs sanity checks on the data extracted from the
+ * manifest and figures out for which WAL ranges we need summaries, and
+ * whether those summaries are available. Then, it reads and combines the
+ * data from those summary files. It also updates the backup_state with the
+ * reference TLI and LSN for the prior backup.
+ */
+void
+PrepareForIncrementalBackup(IncrementalBackupInfo *ib,
+ BackupState *backup_state)
+{
+ MemoryContext oldcontext;
+ List *expectedTLEs;
+ List *all_wslist,
+ *required_wslist = NIL;
+ ListCell *lc;
+ TimeLineHistoryEntry **tlep;
+ int num_wal_ranges;
+ int i;
+ bool found_backup_start_tli = false;
+ TimeLineID earliest_wal_range_tli = 0;
+ XLogRecPtr earliest_wal_range_start_lsn;
+ TimeLineID latest_wal_range_tli = 0;
+ XLogRecPtr summarized_lsn;
+
+ Assert(ib->buf.data == NULL);
+
+ /* Switch to our memory context. */
+ oldcontext = MemoryContextSwitchTo(ib->mcxt);
+
+ /*
+ * Match up the TLIs that appear in the WAL ranges of the backup manifest
+ * with those that appear in this server's timeline history. We expect
+ * every backup_wal_range to match to a TimeLineHistoryEntry; if it does
+ * not, that's an error.
+ *
+ * This loop also decides which of the WAL ranges is the manifest is most
+ * ancient and which one is the newest, according to the timeline history
+ * of this server, and stores TLIs of those WAL ranges into
+ * earliest_wal_range_tli and latest_wal_range_tli. It also updates
+ * earliest_wal_range_start_lsn to the start LSN of the WAL range for
+ * earliest_wal_range_tli.
+ *
+ * Note that the return value of readTimeLineHistory puts the latest
+ * timeline at the beginning of the list, not the end. Hence, the earliest
+ * TLI is the one that occurs nearest the end of the list returned by
+ * readTimeLineHistory, and the latest TLI is the one that occurs closest
+ * to the beginning.
+ */
+ expectedTLEs = readTimeLineHistory(backup_state->starttli);
+ num_wal_ranges = list_length(ib->manifest_wal_ranges);
+ tlep = palloc0(num_wal_ranges * sizeof(TimeLineHistoryEntry *));
+ for (i = 0; i < num_wal_ranges; ++i)
+ {
+ backup_wal_range *range = list_nth(ib->manifest_wal_ranges, i);
+ bool saw_earliest_wal_range_tli = false;
+ bool saw_latest_wal_range_tli = false;
+
+ /* Search this server's history for this WAL range's TLI. */
+ foreach(lc, expectedTLEs)
+ {
+ TimeLineHistoryEntry *tle = lfirst(lc);
+
+ if (tle->tli == range->tli)
+ {
+ tlep[i] = tle;
+ break;
+ }
+
+ if (tle->tli == earliest_wal_range_tli)
+ saw_earliest_wal_range_tli = true;
+ if (tle->tli == latest_wal_range_tli)
+ saw_latest_wal_range_tli = true;
+ }
+
+ /*
+ * An incremental backup can only be taken relative to a backup that
+ * represents a previous state of this server. If the backup requires
+ * WAL from a timeline that's not in our history, that definitely
+ * isn't the case.
+ */
+ if (tlep[i] == NULL)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("timeline %u found in manifest, but not in this server's history",
+ range->tli)));
+
+ /*
+ * If we found this TLI in the server's history before encountering
+ * the latest TLI seen so far in the server's history, then this TLI
+ * is the latest one seen so far.
+ *
+ * If on the other hand we saw the earliest TLI seen so far before
+ * finding this TLI, this TLI is earlier than the earliest one seen so
+ * far. And if this is the first TLI for which we've searched, it's
+ * also the earliest one seen so far.
+ *
+ * On the first loop iteration, both things should necessarily be
+ * true.
+ */
+ if (!saw_latest_wal_range_tli)
+ latest_wal_range_tli = range->tli;
+ if (earliest_wal_range_tli == 0 || saw_earliest_wal_range_tli)
+ {
+ earliest_wal_range_tli = range->tli;
+ earliest_wal_range_start_lsn = range->start_lsn;
+ }
+ }
+
+ /*
+ * Propagate information about the prior backup into the backup_label that
+ * will be generated for this backup.
+ */
+ backup_state->istartpoint = earliest_wal_range_start_lsn;
+ backup_state->istarttli = earliest_wal_range_tli;
+
+ /*
+ * Sanity check start and end LSNs for the WAL ranges in the manifest.
+ *
+ * Commonly, there won't be any timeline switches during the prior backup
+ * at all, but if there are, they should happen at the same LSNs that this
+ * server switched timelines.
+ *
+ * Whether there are any timeline switches during the prior backup or not,
+ * the prior backup shouldn't require any WAL from a timeline prior to the
+ * start of that timeline. It also shouldn't require any WAL from later
+ * than the start of this backup.
+ *
+ * If any of these sanity checks fail, one possible explanation is that
+ * the user has generated WAL on the same timeline with the same LSNs more
+ * than once. For instance, if two standbys running on timeline 1 were
+ * both promoted and (due to a broken archiving setup) both selected new
+ * timeline ID 2, then it's possible that one of these checks might trip.
+ *
+ * Note that there are lots of ways for the user to do something very bad
+ * without tripping any of these checks, and they are not intended to be
+ * comprehensive. It's pretty hard to see how we could be certain of
+ * anything here. However, if there's a problem staring us right in the
+ * face, it's best to report it, so we do.
+ */
+ for (i = 0; i < num_wal_ranges; ++i)
+ {
+ backup_wal_range *range = list_nth(ib->manifest_wal_ranges, i);
+
+ if (range->tli == earliest_wal_range_tli)
+ {
+ if (range->start_lsn < tlep[i]->begin)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("manifest requires WAL from initial timeline %u starting at %X/%X, but that timeline begins at %X/%X",
+ range->tli,
+ LSN_FORMAT_ARGS(range->start_lsn),
+ LSN_FORMAT_ARGS(tlep[i]->begin))));
+ }
+ else
+ {
+ if (range->start_lsn != tlep[i]->begin)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("manifest requires WAL from continuation timeline %u starting at %X/%X, but that timeline begins at %X/%X",
+ range->tli,
+ LSN_FORMAT_ARGS(range->start_lsn),
+ LSN_FORMAT_ARGS(tlep[i]->begin))));
+ }
+
+ if (range->tli == latest_wal_range_tli)
+ {
+ if (range->end_lsn > backup_state->startpoint)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("manifest requires WAL from final timeline %u ending at %X/%X, but this backup starts at %X/%X",
+ range->tli,
+ LSN_FORMAT_ARGS(range->end_lsn),
+ LSN_FORMAT_ARGS(backup_state->startpoint))));
+ }
+ else
+ {
+ if (range->end_lsn != tlep[i]->end)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("manifest requires WAL from non-final timeline %u ending at %X/%X, but this server switched timelines at %X/%X",
+ range->tli,
+ LSN_FORMAT_ARGS(range->end_lsn),
+ LSN_FORMAT_ARGS(tlep[i]->end))));
+ }
+
+ }
+
+ /*
+ * Wait for WAL summarization to catch up to the backup start LSN (but
+ * time out if it doesn't do so quickly enough).
+ */
+ /* XXX make timeout configurable */
+ summarized_lsn = WaitForWalSummarization(backup_state->startpoint, 60000);
+ if (summarized_lsn < backup_state->startpoint)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("timeout waiting for WAL summarization"),
+ errdetail("This backup requires WAL to be summarized up to %X/%X, but summarizer has only reached %X/%X.",
+ LSN_FORMAT_ARGS(backup_state->startpoint),
+ LSN_FORMAT_ARGS(summarized_lsn))));
+
+ /*
+ * Retrieve a list of all WAL summaries on any timeline that overlap with
+ * the LSN range of interest. We could instead call GetWalSummaries() once
+ * per timeline in the loop that follows, but that would involve reading
+ * the directory multiple times. It should be mildly faster - and perhaps
+ * a bit safer - to do it just once.
+ */
+ all_wslist = GetWalSummaries(0, earliest_wal_range_start_lsn,
+ backup_state->startpoint);
+
+ /*
+ * We need WAL summaries for everything that happened during the prior
+ * backup and everything that happened afterward up until the point where
+ * the current backup started.
+ */
+ foreach(lc, expectedTLEs)
+ {
+ TimeLineHistoryEntry *tle = lfirst(lc);
+ XLogRecPtr tli_start_lsn = tle->begin;
+ XLogRecPtr tli_end_lsn = tle->end;
+ XLogRecPtr tli_missing_lsn = InvalidXLogRecPtr;
+ List *tli_wslist;
+
+ /*
+ * Working through the history of this server from the current
+ * timeline backwards, we skip everything until we find the timeline
+ * where this backup started. Most of the time, this means we won't
+ * skip anything at all, as it's unlikely that the timeline has
+ * changed since the beginning of the backup moments ago.
+ */
+ if (tle->tli == backup_state->starttli)
+ {
+ found_backup_start_tli = true;
+ tli_end_lsn = backup_state->startpoint;
+ }
+ else if (!found_backup_start_tli)
+ continue;
+
+ /*
+ * Find the summaries that overlap the LSN range of interest for this
+ * timeline. If this is the earliest timeline involved, the range of
+ * interest begins with the start LSN of the prior backup; otherwise,
+ * it begins at the LSN at which this timeline came into existence. If
+ * this is the latest TLI involved, the range of interest ends at the
+ * start LSN of the current backup; otherwise, it ends at the point
+ * where we switched from this timeline to the next one.
+ */
+ if (tle->tli == earliest_wal_range_tli)
+ tli_start_lsn = earliest_wal_range_start_lsn;
+ tli_wslist = FilterWalSummaries(all_wslist, tle->tli,
+ tli_start_lsn, tli_end_lsn);
+
+ /*
+ * There is no guarantee that the WAL summaries we found cover the
+ * entire range of LSNs for which summaries are required, or indeed
+ * that we found any WAL summaries at all. Check whether we have a
+ * problem of that sort.
+ */
+ if (!WalSummariesAreComplete(tli_wslist, tli_start_lsn, tli_end_lsn,
+ &tli_missing_lsn))
+ {
+ if (XLogRecPtrIsInvalid(tli_missing_lsn))
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("WAL summaries are required on timeline %u from %X/%X to %X/%X, but no summaries for that timeline and LSN range exist",
+ tle->tli,
+ LSN_FORMAT_ARGS(tli_start_lsn),
+ LSN_FORMAT_ARGS(tli_end_lsn))));
+ else
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("WAL summaries are required on timeline %u from %X/%X to %X/%X, but the summaries for that timeline and LSN range are incomplete",
+ tle->tli,
+ LSN_FORMAT_ARGS(tli_start_lsn),
+ LSN_FORMAT_ARGS(tli_end_lsn)),
+ errdetail("The first unsummarized LSN is this range is %X/%X.",
+ LSN_FORMAT_ARGS(tli_missing_lsn))));
+ }
+
+ /*
+ * Remember that we need to read these summaries.
+ *
+ * Technically, it's possible that this could read more files than
+ * required, since tli_wslist in theory could contain redundant
+ * summaries. For instance, if we have a summary from 0/10000000 to
+ * 0/20000000 and also one from 0/00000000 to 0/30000000, then the
+ * latter subsumes the former and the former could be ignored.
+ *
+ * We ignore this possibility because the WAL summarizer only tries to
+ * generate summaries that do not overlap. If somehow they exist,
+ * we'll do a bit of extra work but the results should still be
+ * correct.
+ */
+ required_wslist = list_concat(required_wslist, tli_wslist);
+
+ /*
+ * Timelines earlier than the one in which the prior backup began are
+ * not relevant.
+ */
+ if (tle->tli == earliest_wal_range_tli)
+ break;
+ }
+
+ /*
+ * Read all of the required block reference table files and merge all of
+ * the data into a single in-memory block reference table.
+ *
+ * See the comments for struct IncrementalBackupInfo for some thoughts on
+ * memory usage.
+ */
+ ib->brtab = CreateEmptyBlockRefTable();
+ foreach(lc, required_wslist)
+ {
+ WalSummaryFile *ws = lfirst(lc);
+ WalSummaryIO wsio;
+ BlockRefTableReader *reader;
+ RelFileLocator rlocator;
+ ForkNumber forknum;
+ BlockNumber limit_block;
+ BlockNumber blocks[BLOCKS_PER_READ];
+
+ wsio.file = OpenWalSummaryFile(ws, false);
+ wsio.filepos = 0;
+ ereport(DEBUG1,
+ (errmsg_internal("reading WAL summary file \"%s\"",
+ FilePathName(wsio.file))));
+ reader = CreateBlockRefTableReader(ReadWalSummary, &wsio,
+ FilePathName(wsio.file),
+ ReportWalSummaryError, NULL);
+ while (BlockRefTableReaderNextRelation(reader, &rlocator, &forknum,
+ &limit_block))
+ {
+ BlockRefTableSetLimitBlock(ib->brtab, &rlocator,
+ forknum, limit_block);
+
+ while (1)
+ {
+ unsigned nblocks;
+ unsigned i;
+
+ nblocks = BlockRefTableReaderGetBlocks(reader, blocks,
+ BLOCKS_PER_READ);
+ if (nblocks == 0)
+ break;
+
+ for (i = 0; i < nblocks; ++i)
+ BlockRefTableMarkBlockModified(ib->brtab, &rlocator,
+ forknum, blocks[i]);
+ }
+ }
+ DestroyBlockRefTableReader(reader);
+ FileClose(wsio.file);
+ }
+
+ /* Switch back to previous memory context. */
+ MemoryContextSwitchTo(oldcontext);
+}
+
+/*
+ * Get the pathname that should be used when a file is sent incrementally.
+ *
+ * The result is a palloc'd string.
+ */
+char *
+GetIncrementalFilePath(Oid dboid, Oid spcoid, RelFileNumber relfilenumber,
+ ForkNumber forknum, unsigned segno)
+{
+ char *path;
+ char *lastslash;
+ char *ipath;
+
+ path = GetRelationPath(dboid, spcoid, relfilenumber, InvalidBackendId,
+ forknum);
+
+ lastslash = strrchr(path, '/');
+ Assert(lastslash != NULL);
+ *lastslash = '\0';
+
+ if (segno > 0)
+ ipath = psprintf("%s/INCREMENTAL.%s.%u", path, lastslash + 1, segno);
+ else
+ ipath = psprintf("%s/INCREMENTAL.%s", path, lastslash + 1);
+
+ pfree(path);
+
+ return ipath;
+}
+
+/*
+ * How should we back up a particular file as part of an incremental backup?
+ *
+ * If the return value is BACK_UP_FILE_FULLY, caller should back up the whole
+ * file just as if this were not an incremental backup.
+ *
+ * If the return value is BACK_UP_FILE_INCREMENTALLY, caller should include
+ * an incremental file in the backup instead of the entire file. On return,
+ * *num_blocks_required will be set to the number of blocks that need to be
+ * sent, and the actual block numbers will have been stored in
+ * relative_block_numbers, which should be an array of at least RELSEG_SIZE.
+ * In addition, *truncation_block_length will be set to the value that should
+ * be included in the incremental file.
+ *
+ * If the return value is DO_NOT_BACK_UP_FILE, the caller should not include
+ * the file in the backup at all.
+ */
+FileBackupMethod
+GetFileBackupMethod(IncrementalBackupInfo *ib, char *path,
+ Oid dboid, Oid spcoid,
+ RelFileNumber relfilenumber, ForkNumber forknum,
+ unsigned segno, size_t size,
+ unsigned *num_blocks_required,
+ BlockNumber *relative_block_numbers,
+ unsigned *truncation_block_length)
+{
+ BlockNumber absolute_block_numbers[RELSEG_SIZE];
+ BlockNumber limit_block;
+ BlockNumber start_blkno;
+ BlockNumber stop_blkno;
+ RelFileLocator rlocator;
+ BlockRefTableEntry *brtentry;
+ unsigned i;
+ unsigned nblocks;
+
+ /* Should only be called after PrepareForIncrementalBackup. */
+ Assert(ib->buf.data == NULL);
+
+ /*
+ * dboid could be InvalidOid if shared rel, but spcoid and relfilenumber
+ * should have legal values.
+ */
+ Assert(OidIsValid(spcoid));
+ Assert(RelFileNumberIsValid(relfilenumber));
+
+ /*
+ * If the file size is too large or not a multiple of BLCKSZ, then
+ * something weird is happening, so give up and send the whole file.
+ */
+ if ((size % BLCKSZ) != 0 || size / BLCKSZ > RELSEG_SIZE)
+ return BACK_UP_FILE_FULLY;
+
+ /*
+ * The free-space map fork is not properly WAL-logged, so we need to
+ * backup the entire file every time.
+ */
+ if (forknum == FSM_FORKNUM)
+ return BACK_UP_FILE_FULLY;
+
+ /*
+ * Check whether this file is part of the prior backup. If it isn't, back
+ * up the whole file.
+ */
+ if (backup_file_lookup(ib->manifest_files, path) == NULL)
+ {
+ char *ipath;
+
+ ipath = GetIncrementalFilePath(dboid, spcoid, relfilenumber,
+ forknum, segno);
+ if (backup_file_lookup(ib->manifest_files, ipath) == NULL)
+ return BACK_UP_FILE_FULLY;
+ }
+
+ /* Look up the block reference table entry. */
+ rlocator.spcOid = spcoid;
+ rlocator.dbOid = dboid;
+ rlocator.relNumber = relfilenumber;
+ brtentry = BlockRefTableGetEntry(ib->brtab, &rlocator, forknum,
+ &limit_block);
+
+ /*
+ * If there is no entry, then there have been no WAL-logged changes to the
+ * relation since the predecessor backup was taken, so we can back it up
+ * incrementally and need not include any modified blocks.
+ *
+ * However, if the file is zero-length, we should do a full backup,
+ * because an incremental file is always more than zero length, and it's
+ * silly to take an incremental backup when a full backup would be
+ * smaller.
+ */
+ if (brtentry == NULL)
+ {
+ *num_blocks_required = 0;
+ *truncation_block_length = size / BLCKSZ;
+ if (size == 0)
+ return BACK_UP_FILE_FULLY;
+ return BACK_UP_FILE_INCREMENTALLY;
+ }
+
+ /*
+ * If the limit_block is less than or equal to the point where this
+ * segment starts, send the whole file.
+ */
+ if (limit_block <= segno * RELSEG_SIZE)
+ return BACK_UP_FILE_FULLY;
+
+ /*
+ * Get relevant entries from the block reference table entry.
+ *
+ * We shouldn't overflow computing the start or stop block numbers, but if
+ * it manages to happen somehow, detect it and throw an error.
+ */
+ start_blkno = segno * RELSEG_SIZE;
+ stop_blkno = start_blkno + (size / BLCKSZ);
+ if (start_blkno / RELSEG_SIZE != segno || stop_blkno < start_blkno)
+ ereport(ERROR,
+ errcode(ERRCODE_INTERNAL_ERROR),
+ errmsg_internal("overflow computing block number bounds for segment %u with size %lu",
+ segno, size));
+ nblocks = BlockRefTableEntryGetBlocks(brtentry, start_blkno, stop_blkno,
+ absolute_block_numbers, RELSEG_SIZE);
+ Assert(nblocks <= RELSEG_SIZE);
+
+ /*
+ * If we're going to have to send nearly all of the blocks, then just send
+ * the whole file, because that won't require much extra storage or
+ * transfer and will speed up and simplify backup restoration. It's not
+ * clear what threshold is most appropriate here and perhaps it ought to
+ * be configurable, but for now we're just going to say that if we'd need
+ * to send 90% of the blocks anyway, give up and send the whole file.
+ *
+ * NB: If you change the threshold here, at least make sure to back up the
+ * file fully when every single block must be sent, because there's
+ * nothing good about sending an incremental file in that case.
+ */
+ if (nblocks * BLCKSZ > size * 0.9)
+ return BACK_UP_FILE_FULLY;
+
+ /*
+ * Looks like we can send an incremental file.
+ *
+ * Return the relevant details to the caller, transposing absolute block
+ * numbers to relative block numbers.
+ *
+ * The truncation block length is the minimum length of the reconstructed
+ * file. Any block numbers below this threshold that are not present in
+ * the backup need to be fetched from the prior backup. At or above this
+ * threshold, blocks should only be included in the result if they are
+ * present in the backup. (This may require inserting zero blocks if the
+ * blocks included in the backup are non-consecutive.)
+ */
+ for (i = 0; i < nblocks; ++i)
+ relative_block_numbers[i] = absolute_block_numbers[i] - start_blkno;
+ *num_blocks_required = nblocks;
+ *truncation_block_length =
+ Min(size / BLCKSZ, limit_block - segno * RELSEG_SIZE);
+ return BACK_UP_FILE_INCREMENTALLY;
+}
+
+/*
+ * Compute the size for an incremental file containing a given number of blocks.
+ */
+extern size_t
+GetIncrementalFileSize(unsigned num_blocks_required)
+{
+ size_t result;
+
+ /* Make sure we're not going to overflow. */
+ Assert(num_blocks_required <= RELSEG_SIZE);
+
+ /*
+ * Three four byte quantities (magic number, truncation block length,
+ * block count) followed by block numbers followed by block contents.
+ */
+ result = 3 * sizeof(uint32);
+ result += (BLCKSZ + sizeof(BlockNumber)) * num_blocks_required;
+
+ return result;
+}
+
+/*
+ * Helper function for filemap hash table.
+ */
+static uint32
+hash_string_pointer(const char *s)
+{
+ unsigned char *ss = (unsigned char *) s;
+
+ return hash_bytes(ss, strlen(s));
+}
+
+/*
+ * This callback is invoked for each file mentioned in the backup manifest.
+ *
+ * We store the path to each file and the size of each file for sanity-checking
+ * purposes. For further details, see comments for IncrementalBackupInfo.
+ */
+static void
+manifest_process_file(JsonManifestParseContext *context,
+ char *pathname, size_t size,
+ pg_checksum_type checksum_type,
+ int checksum_length,
+ uint8 *checksum_payload)
+{
+ IncrementalBackupInfo *ib = context->private_data;
+ backup_file_entry *entry;
+ bool found;
+
+ entry = backup_file_insert(ib->manifest_files, pathname, &found);
+ if (!found)
+ {
+ entry->path = MemoryContextStrdup(ib->manifest_files->ctx,
+ pathname);
+ entry->size = size;
+ }
+}
+
+/*
+ * This callback is invoked for each WAL range mentioned in the backup
+ * manifest.
+ *
+ * We're just interested in learning the oldest LSN and the corresponding TLI
+ * that appear in any WAL range.
+ */
+static void
+manifest_process_wal_range(JsonManifestParseContext *context,
+ TimeLineID tli, XLogRecPtr start_lsn,
+ XLogRecPtr end_lsn)
+{
+ IncrementalBackupInfo *ib = context->private_data;
+ backup_wal_range *range = palloc(sizeof(backup_wal_range));
+
+ range->tli = tli;
+ range->start_lsn = start_lsn;
+ range->end_lsn = end_lsn;
+ ib->manifest_wal_ranges = lappend(ib->manifest_wal_ranges, range);
+}
+
+/*
+ * This callback is invoked if an error occurs while parsing the backup
+ * manifest.
+ */
+static void
+manifest_report_error(JsonManifestParseContext *context, const char *fmt,...)
+{
+ StringInfoData errbuf;
+
+ initStringInfo(&errbuf);
+
+ for (;;)
+ {
+ va_list ap;
+ int needed;
+
+ va_start(ap, fmt);
+ needed = appendStringInfoVA(&errbuf, fmt, ap);
+ va_end(ap);
+ if (needed == 0)
+ break;
+ enlargeStringInfo(&errbuf, needed);
+ }
+
+ ereport(ERROR,
+ errmsg_internal("%s", errbuf.data));
+}
diff --git a/src/backend/backup/meson.build b/src/backend/backup/meson.build
index 11a79bbf80..1cace3b2fe 100644
--- a/src/backend/backup/meson.build
+++ b/src/backend/backup/meson.build
@@ -5,6 +5,7 @@ backend_sources += files(
'basebackup.c',
'basebackup_copy.c',
'basebackup_gzip.c',
+ 'basebackup_incremental.c',
'basebackup_lz4.c',
'basebackup_progress.c',
'basebackup_server.c',
@@ -12,4 +13,6 @@ backend_sources += files(
'basebackup_target.c',
'basebackup_throttle.c',
'basebackup_zstd.c',
+ 'walsummary.o',
+ 'walsummaryfuncs.o'
)
diff --git a/src/backend/backup/walsummary.c b/src/backend/backup/walsummary.c
new file mode 100644
index 0000000000..ebf4ea038d
--- /dev/null
+++ b/src/backend/backup/walsummary.c
@@ -0,0 +1,356 @@
+/*-------------------------------------------------------------------------
+ *
+ * walsummary.c
+ * Functions for accessing and managing WAL summary data.
+ *
+ * Portions Copyright (c) 2010-2022, PostgreSQL Global Development Group
+ *
+ * src/backend/backup/walsummary.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include <sys/stat.h>
+#include <unistd.h>
+
+#include "access/xlog_internal.h"
+#include "backup/walsummary.h"
+#include "utils/wait_event.h"
+
+static bool IsWalSummaryFilename(char *filename);
+static int ListComparatorForWalSummaryFiles(const ListCell *a,
+ const ListCell *b);
+
+/*
+ * Get a list of WAL summaries.
+ *
+ * If tli != 0, only WAL summaries with the indicated TLI will be included.
+ *
+ * If start_lsn != InvalidXLogRecPtr, only summaries that end before the
+ * indicated LSN will be included.
+ *
+ * If end_lsn != InvalidXLogRecPtr, only summaries that start before the
+ * indicated LSN will be included.
+ *
+ * The intent is that you can call GetWalSummaries(tli, start_lsn, end_lsn)
+ * to get all WAL summaries on the indicated timeline that overlap the
+ * specified LSN range.
+ */
+List *
+GetWalSummaries(TimeLineID tli, XLogRecPtr start_lsn, XLogRecPtr end_lsn)
+{
+ DIR *sdir;
+ struct dirent *dent;
+ List *result = NIL;
+
+ sdir = AllocateDir(XLOGDIR "/summaries");
+ while ((dent = ReadDir(sdir, XLOGDIR "/summaries")) != NULL)
+ {
+ WalSummaryFile *ws;
+ uint32 tmp[5];
+ TimeLineID file_tli;
+ XLogRecPtr file_start_lsn;
+ XLogRecPtr file_end_lsn;
+
+ /* Decode filename, or skip if it's not in the expected format. */
+ if (!IsWalSummaryFilename(dent->d_name))
+ continue;
+ sscanf(dent->d_name, "%08X%08X%08X%08X%08X",
+ &tmp[0], &tmp[1], &tmp[2], &tmp[3], &tmp[4]);
+ file_tli = tmp[0];
+ file_start_lsn = ((uint64) tmp[1]) << 32 | tmp[2];
+ file_end_lsn = ((uint64) tmp[3]) << 32 | tmp[4];
+
+ /* Skip if it doesn't match the filter criteria. */
+ if (tli != 0 && tli != file_tli)
+ continue;
+ if (!XLogRecPtrIsInvalid(start_lsn) && start_lsn > file_end_lsn)
+ continue;
+ if (!XLogRecPtrIsInvalid(end_lsn) && end_lsn < file_start_lsn)
+ continue;
+
+ /* Add it to the list. */
+ ws = palloc(sizeof(WalSummaryFile));
+ ws->tli = file_tli;
+ ws->start_lsn = file_start_lsn;
+ ws->end_lsn = file_end_lsn;
+ result = lappend(result, ws);
+ }
+ FreeDir(sdir);
+
+ return result;
+}
+
+/*
+ * Build a new list of WAL summaries based on an existing list, but filtering
+ * out summaries that don't match the search parameters.
+ *
+ * If tli != 0, only WAL summaries with the indicated TLI will be included.
+ *
+ * If start_lsn != InvalidXLogRecPtr, only summaries that end before the
+ * indicated LSN will be included.
+ *
+ * If end_lsn != InvalidXLogRecPtr, only summaries that start before the
+ * indicated LSN will be included.
+ */
+List *
+FilterWalSummaries(List *wslist, TimeLineID tli,
+ XLogRecPtr start_lsn, XLogRecPtr end_lsn)
+{
+ List *result = NIL;
+ ListCell *lc;
+
+ /* Loop over input. */
+ foreach(lc, wslist)
+ {
+ WalSummaryFile *ws = lfirst(lc);
+
+ /* Skip if it doesn't match the filter criteria. */
+ if (tli != 0 && tli != ws->tli)
+ continue;
+ if (!XLogRecPtrIsInvalid(start_lsn) && start_lsn > ws->end_lsn)
+ continue;
+ if (!XLogRecPtrIsInvalid(end_lsn) && end_lsn < ws->start_lsn)
+ continue;
+
+ /* Add it to the result list. */
+ result = lappend(result, ws);
+ }
+
+ return result;
+}
+
+/*
+ * Check whether the supplied list of WalSummaryFile objects covers the
+ * whole range of LSNs from start_lsn to end_lsn. This function ignores
+ * timelines, so the caller should probably filter using the appropriate
+ * timeline before calling this.
+ *
+ * If the whole range of LSNs is covered, returns true, otherwise false.
+ * If false is returned, *missing_lsn is set either to InvalidXLogRecPtr
+ * if there are no WAL summary files in the input list, or to the first LSN
+ * in the range that is not covered by a WAL summary file in the input list.
+ */
+bool
+WalSummariesAreComplete(List *wslist, XLogRecPtr start_lsn,
+ XLogRecPtr end_lsn, XLogRecPtr *missing_lsn)
+{
+ XLogRecPtr current_lsn = start_lsn;
+ ListCell *lc;
+
+ /* Special case for empty list. */
+ if (wslist == NIL)
+ {
+ *missing_lsn = InvalidXLogRecPtr;
+ return false;
+ }
+
+ /* Make a private copy of the list and sort it by start LSN. */
+ wslist = list_copy(wslist);
+ list_sort(wslist, ListComparatorForWalSummaryFiles);
+
+ /*
+ * Consider summary files in order of increasing start_lsn, advancing the
+ * known-summarized range from start_lsn toward end_lsn.
+ *
+ * Normally, the summary files should cover non-overlapping WAL ranges,
+ * but this algorithm is intended to be correct even in case of overlap.
+ */
+ foreach(lc, wslist)
+ {
+ WalSummaryFile *ws = lfirst(lc);
+
+ if (ws->start_lsn > current_lsn)
+ {
+ /* We found a gap. */
+ break;
+ }
+ if (ws->end_lsn > current_lsn)
+ {
+ /*
+ * Next summary extends beyond end of previous summary, so extend
+ * the end of the range known to be summarized.
+ */
+ current_lsn = ws->end_lsn;
+
+ /*
+ * If the range we know to be summarized has reached the required
+ * end LSN, we have proved completeness.
+ */
+ if (current_lsn >= end_lsn)
+ return true;
+ }
+ }
+
+ /*
+ * We either ran out of summary files without reaching the end LSN, or we
+ * hit a gap in the sequence that resulted in us bailing out of the loop
+ * above.
+ */
+ *missing_lsn = current_lsn;
+ return false;
+}
+
+/*
+ * Open a WAL summary file.
+ *
+ * This will throw an error in case of trouble. As an exception, if
+ * missing_ok = true and the trouble is specifically that the file does
+ * not exist, it will not throw an error and will return a value less than 0.
+ */
+File
+OpenWalSummaryFile(WalSummaryFile *ws, bool missing_ok)
+{
+ char path[MAXPGPATH];
+ File file;
+
+ snprintf(path, MAXPGPATH,
+ XLOGDIR "/summaries/%08X%08X%08X%08X%08X.summary",
+ ws->tli,
+ LSN_FORMAT_ARGS(ws->start_lsn),
+ LSN_FORMAT_ARGS(ws->end_lsn));
+
+ file = PathNameOpenFile(path, O_RDONLY);
+ if (file < 0 && (errno != EEXIST || !missing_ok))
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not open file \"%s\": %m", path)));
+
+ return file;
+}
+
+/*
+ * Remove a WAL summary file if the last modification time precedes the
+ * cutoff time.
+ */
+void
+RemoveWalSummaryIfOlderThan(WalSummaryFile *ws, time_t cutoff_time)
+{
+ char path[MAXPGPATH];
+ struct stat statbuf;
+
+ snprintf(path, MAXPGPATH,
+ XLOGDIR "/summaries/%08X%08X%08X%08X%08X.summary",
+ ws->tli,
+ LSN_FORMAT_ARGS(ws->start_lsn),
+ LSN_FORMAT_ARGS(ws->end_lsn));
+
+ if (lstat(path, &statbuf) != 0)
+ {
+ if (errno == ENOENT)
+ return;
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not stat file \"%s\": %m", path)));
+ }
+ if (statbuf.st_mtime >= cutoff_time)
+ return;
+ if (unlink(path) != 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not stat file \"%s\": %m", path)));
+ ereport(DEBUG2,
+ (errmsg_internal("removing file \"%s\"", path)));
+}
+
+/*
+ * Test whether a filename looks like a WAL summary file.
+ */
+static bool
+IsWalSummaryFilename(char *filename)
+{
+ return strspn(filename, "0123456789ABCDEF") == 40 &&
+ strcmp(filename + 40, ".summary") == 0;
+}
+
+/*
+ * Data read callback for use with CreateBlockRefTableReader.
+ */
+int
+ReadWalSummary(void *wal_summary_io, void *data, int length)
+{
+ WalSummaryIO *io = wal_summary_io;
+ int nbytes;
+
+ nbytes = FileRead(io->file, data, length, io->filepos,
+ WAIT_EVENT_WAL_SUMMARY_READ);
+ if (nbytes < 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not write file \"%s\": %m",
+ FilePathName(io->file))));
+
+ io->filepos += nbytes;
+ return nbytes;
+}
+
+/*
+ * Data write callback for use with WriteBlockRefTable.
+ */
+int
+WriteWalSummary(void *wal_summary_io, void *data, int length)
+{
+ WalSummaryIO *io = wal_summary_io;
+ int nbytes;
+
+ nbytes = FileWrite(io->file, data, length, io->filepos,
+ WAIT_EVENT_WAL_SUMMARY_WRITE);
+ if (nbytes < 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not write file \"%s\": %m",
+ FilePathName(io->file))));
+ if (nbytes != length)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not write file \"%s\": wrote only %d of %d bytes at offset %u",
+ FilePathName(io->file), nbytes,
+ length, (unsigned) io->filepos),
+ errhint("Check free disk space.")));
+
+ io->filepos += nbytes;
+ return nbytes;
+}
+
+/*
+ * Error-reporting callback for use with CreateBlockRefTableReader.
+ */
+void
+ReportWalSummaryError(void *callback_arg, char *fmt,...)
+{
+ StringInfoData buf;
+ va_list ap;
+ int needed;
+
+ initStringInfo(&buf);
+ for (;;)
+ {
+ va_start(ap, fmt);
+ needed = appendStringInfoVA(&buf, fmt, ap);
+ va_end(ap);
+ if (needed == 0)
+ break;
+ enlargeStringInfo(&buf, needed);
+ }
+ ereport(ERROR,
+ errcode(ERRCODE_DATA_CORRUPTED),
+ errmsg_internal("%s", buf.data));
+}
+
+/*
+ * Comparator to sort a List of WalSummaryFile objects by start_lsn.
+ */
+static int
+ListComparatorForWalSummaryFiles(const ListCell *a, const ListCell *b)
+{
+ WalSummaryFile *ws1 = lfirst(a);
+ WalSummaryFile *ws2 = lfirst(b);
+
+ if (ws1->start_lsn < ws2->start_lsn)
+ return -1;
+ if (ws1->start_lsn > ws2->start_lsn)
+ return 1;
+ return 0;
+}
diff --git a/src/backend/backup/walsummaryfuncs.c b/src/backend/backup/walsummaryfuncs.c
new file mode 100644
index 0000000000..2e77d38b4a
--- /dev/null
+++ b/src/backend/backup/walsummaryfuncs.c
@@ -0,0 +1,169 @@
+/*-------------------------------------------------------------------------
+ *
+ * walsummaryfuncs.c
+ * SQL-callable functions for accessing WAL summary data.
+ *
+ * Portions Copyright (c) 2010-2022, PostgreSQL Global Development Group
+ *
+ * src/backend/backup/walsummaryfuncs.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "backup/walsummary.h"
+#include "common/blkreftable.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "utils/fmgrprotos.h"
+#include "utils/pg_lsn.h"
+
+#define NUM_WS_ATTS 3
+#define NUM_SUMMARY_ATTS 6
+#define MAX_BLOCKS_PER_CALL 256
+
+/*
+ * List the WAL summary files available in pg_wal/summaries.
+ */
+Datum
+pg_available_wal_summaries(PG_FUNCTION_ARGS)
+{
+ ReturnSetInfo *rsi;
+ List *wslist;
+ ListCell *lc;
+ Datum values[NUM_WS_ATTS];
+ bool nulls[NUM_WS_ATTS];
+
+ InitMaterializedSRF(fcinfo, 0);
+ rsi = (ReturnSetInfo *) fcinfo->resultinfo;
+
+ memset(nulls, 0, sizeof(nulls));
+
+ wslist = GetWalSummaries(0, InvalidXLogRecPtr, InvalidXLogRecPtr);
+ foreach(lc, wslist)
+ {
+ WalSummaryFile *ws = (WalSummaryFile *) lfirst(lc);
+ HeapTuple tuple;
+
+ CHECK_FOR_INTERRUPTS();
+
+ values[0] = Int64GetDatum((int64) ws->tli);
+ values[1] = LSNGetDatum(ws->start_lsn);
+ values[2] = LSNGetDatum(ws->end_lsn);
+
+ tuple = heap_form_tuple(rsi->setDesc, values, nulls);
+ tuplestore_puttuple(rsi->setResult, tuple);
+ }
+
+ return (Datum) 0;
+}
+
+/*
+ * List the contents of a WAL summary file identified by TLI, start LSN,
+ * and end LSN.
+ */
+Datum
+pg_wal_summary_contents(PG_FUNCTION_ARGS)
+{
+ ReturnSetInfo *rsi;
+ Datum values[NUM_SUMMARY_ATTS];
+ bool nulls[NUM_SUMMARY_ATTS];
+ WalSummaryFile ws;
+ WalSummaryIO io;
+ BlockRefTableReader *reader;
+ int64 raw_tli;
+ RelFileLocator rlocator;
+ ForkNumber forknum;
+ BlockNumber limit_block;
+
+ InitMaterializedSRF(fcinfo, 0);
+ rsi = (ReturnSetInfo *) fcinfo->resultinfo;
+ memset(nulls, 0, sizeof(nulls));
+
+ /*
+ * Since the timeline could at least in theory be more than 2^31, and
+ * since we don't have unsigned types at the SQL level, it is passed as a
+ * 64-bit integer. Test whether it's out of range.
+ */
+ raw_tli = PG_GETARG_INT64(0);
+ if (raw_tli < 1 || raw_tli > PG_INT32_MAX)
+ ereport(ERROR,
+ errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("invalid timeline %lld", (long long) raw_tli));
+
+ /* Prepare to read the specified WAL summry file. */
+ ws.tli = (TimeLineID) raw_tli;
+ ws.start_lsn = PG_GETARG_LSN(1);
+ ws.end_lsn = PG_GETARG_LSN(2);
+ io.filepos = 0;
+ io.file = OpenWalSummaryFile(&ws, false);
+ reader = CreateBlockRefTableReader(ReadWalSummary, &io,
+ FilePathName(io.file),
+ ReportWalSummaryError, NULL);
+
+ /* Loop over relation forks. */
+ while (BlockRefTableReaderNextRelation(reader, &rlocator, &forknum,
+ &limit_block))
+ {
+ BlockNumber blocks[MAX_BLOCKS_PER_CALL];
+ HeapTuple tuple;
+
+ CHECK_FOR_INTERRUPTS();
+
+ values[0] = ObjectIdGetDatum(rlocator.relNumber);
+ values[1] = ObjectIdGetDatum(rlocator.spcOid);
+ values[2] = ObjectIdGetDatum(rlocator.dbOid);
+ values[3] = Int16GetDatum((int16) forknum);
+
+ /* Loop over blocks within the current relation fork. */
+ while (true)
+ {
+ unsigned nblocks;
+ unsigned i;
+
+ CHECK_FOR_INTERRUPTS();
+
+ nblocks = BlockRefTableReaderGetBlocks(reader, blocks,
+ MAX_BLOCKS_PER_CALL);
+ if (nblocks == 0)
+ break;
+
+ /*
+ * For each block that we specifically know to have been modified,
+ * emit a row with that block number and limit_block = false.
+ */
+ values[5] = BoolGetDatum(false);
+ for (i = 0; i < nblocks; ++i)
+ {
+ values[4] = Int64GetDatum((int64) blocks[i]);
+
+ tuple = heap_form_tuple(rsi->setDesc, values, nulls);
+ tuplestore_puttuple(rsi->setResult, tuple);
+ }
+
+ /*
+ * If the limit block is not InvalidBlockNumber, emit an exta row
+ * with that block number and limit_block = true.
+ *
+ * There is no point in doing this when the limit_block is
+ * InvalidBlockNumber, because no block with that number or any
+ * higher number can ever exist.
+ */
+ if (BlockNumberIsValid(limit_block))
+ {
+ values[4] = Int64GetDatum((int64) limit_block);
+ values[5] = BoolGetDatum(true);
+
+ tuple = heap_form_tuple(rsi->setDesc, values, nulls);
+ tuplestore_puttuple(rsi->setResult, tuple);
+ }
+ }
+ }
+
+ /* Cleanup */
+ DestroyBlockRefTableReader(reader);
+ FileClose(io.file);
+
+ return (Datum) 0;
+}
diff --git a/src/backend/postmaster/Makefile b/src/backend/postmaster/Makefile
index 047448b34e..367a46c617 100644
--- a/src/backend/postmaster/Makefile
+++ b/src/backend/postmaster/Makefile
@@ -24,6 +24,7 @@ OBJS = \
postmaster.o \
startup.o \
syslogger.o \
+ walsummarizer.o \
walwriter.o
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/postmaster/auxprocess.c b/src/backend/postmaster/auxprocess.c
index cae6feb356..0c15c1777d 100644
--- a/src/backend/postmaster/auxprocess.c
+++ b/src/backend/postmaster/auxprocess.c
@@ -21,6 +21,7 @@
#include "postmaster/auxprocess.h"
#include "postmaster/bgwriter.h"
#include "postmaster/startup.h"
+#include "postmaster/walsummarizer.h"
#include "postmaster/walwriter.h"
#include "replication/walreceiver.h"
#include "storage/bufmgr.h"
@@ -80,6 +81,9 @@ AuxiliaryProcessMain(AuxProcType auxtype)
case WalReceiverProcess:
MyBackendType = B_WAL_RECEIVER;
break;
+ case WalSummarizerProcess:
+ MyBackendType = B_WAL_SUMMARIZER;
+ break;
default:
elog(PANIC, "unrecognized process type: %d", (int) MyAuxProcType);
MyBackendType = B_INVALID;
@@ -161,6 +165,10 @@ AuxiliaryProcessMain(AuxProcType auxtype)
WalReceiverMain();
proc_exit(1);
+ case WalSummarizerProcess:
+ WalSummarizerMain();
+ proc_exit(1);
+
default:
elog(PANIC, "unrecognized process type: %d", (int) MyAuxProcType);
proc_exit(1);
diff --git a/src/backend/postmaster/meson.build b/src/backend/postmaster/meson.build
index cda921fd10..a30eb6692f 100644
--- a/src/backend/postmaster/meson.build
+++ b/src/backend/postmaster/meson.build
@@ -12,5 +12,6 @@ backend_sources += files(
'postmaster.c',
'startup.c',
'syslogger.c',
+ 'walsummarizer.c',
'walwriter.c',
)
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 4c49393fc5..c85ac19f4a 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -114,6 +114,7 @@
#include "postmaster/pgarch.h"
#include "postmaster/postmaster.h"
#include "postmaster/syslogger.h"
+#include "postmaster/walsummarizer.h"
#include "replication/logicallauncher.h"
#include "replication/walsender.h"
#include "storage/fd.h"
@@ -251,6 +252,7 @@ static pid_t StartupPID = 0,
CheckpointerPID = 0,
WalWriterPID = 0,
WalReceiverPID = 0,
+ WalSummarizerPID = 0,
AutoVacPID = 0,
PgArchPID = 0,
SysLoggerPID = 0;
@@ -442,6 +444,7 @@ static bool CreateOptsFile(int argc, char *argv[], char *fullprogname);
static pid_t StartChildProcess(AuxProcType type);
static void StartAutovacuumWorker(void);
static void MaybeStartWalReceiver(void);
+static void MaybeStartWalSummarizer(void);
static void InitPostmasterDeathWatchHandle(void);
/*
@@ -562,6 +565,7 @@ static void ShmemBackendArrayRemove(Backend *bn);
#define StartCheckpointer() StartChildProcess(CheckpointerProcess)
#define StartWalWriter() StartChildProcess(WalWriterProcess)
#define StartWalReceiver() StartChildProcess(WalReceiverProcess)
+#define StartWalSummarizer() StartChildProcess(WalSummarizerProcess)
/* Macros to check exit status of a child process */
#define EXIT_STATUS_0(st) ((st) == 0)
@@ -1845,6 +1849,9 @@ ServerLoop(void)
if (WalReceiverRequested)
MaybeStartWalReceiver();
+ /* If we need to start a WAL summarizer, try to do that now */
+ MaybeStartWalSummarizer();
+
/* Get other worker processes running, if needed */
if (StartWorkerNeeded || HaveCrashedWorker)
maybe_start_bgworkers();
@@ -2736,6 +2743,8 @@ process_pm_reload_request(void)
signal_child(WalWriterPID, SIGHUP);
if (WalReceiverPID != 0)
signal_child(WalReceiverPID, SIGHUP);
+ if (WalSummarizerPID != 0)
+ signal_child(WalSummarizerPID, SIGHUP);
if (AutoVacPID != 0)
signal_child(AutoVacPID, SIGHUP);
if (PgArchPID != 0)
@@ -3089,6 +3098,7 @@ process_pm_child_exit(void)
BgWriterPID = StartBackgroundWriter();
if (WalWriterPID == 0)
WalWriterPID = StartWalWriter();
+ MaybeStartWalSummarizer();
/*
* Likewise, start other special children as needed. In a restart
@@ -3207,6 +3217,20 @@ process_pm_child_exit(void)
continue;
}
+ /*
+ * Was it the wal summarizer? Normal exit can be ignored; we'll start
+ * a new one at the next iteration of the postmaster's main loop, if
+ * necessary. Any other exit condition is treated as a crash.
+ */
+ if (pid == WalSummarizerPID)
+ {
+ WalSummarizerPID = 0;
+ if (!EXIT_STATUS_0(exitstatus))
+ HandleChildCrash(pid, exitstatus,
+ _("WAL summarizer process"));
+ continue;
+ }
+
/*
* Was it the autovacuum launcher? Normal exit can be ignored; we'll
* start a new one at the next iteration of the postmaster's main
@@ -3602,6 +3626,12 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
else if (WalReceiverPID != 0 && take_action)
sigquit_child(WalReceiverPID);
+ /* Take care of the walsummarizer too */
+ if (pid == WalSummarizerPID)
+ WalSummarizerPID = 0;
+ else if (WalSummarizerPID != 0 && take_action)
+ sigquit_child(WalSummarizerPID);
+
/* Take care of the autovacuum launcher too */
if (pid == AutoVacPID)
AutoVacPID = 0;
@@ -3752,6 +3782,8 @@ PostmasterStateMachine(void)
signal_child(StartupPID, SIGTERM);
if (WalReceiverPID != 0)
signal_child(WalReceiverPID, SIGTERM);
+ if (WalSummarizerPID != 0)
+ signal_child(WalSummarizerPID, SIGTERM);
/* checkpointer, archiver, stats, and syslogger may continue for now */
/* Now transition to PM_WAIT_BACKENDS state to wait for them to die */
@@ -3778,6 +3810,7 @@ PostmasterStateMachine(void)
if (CountChildren(BACKEND_TYPE_ALL - BACKEND_TYPE_WALSND) == 0 &&
StartupPID == 0 &&
WalReceiverPID == 0 &&
+ WalSummarizerPID == 0 &&
BgWriterPID == 0 &&
(CheckpointerPID == 0 ||
(!FatalError && Shutdown < ImmediateShutdown)) &&
@@ -3875,6 +3908,7 @@ PostmasterStateMachine(void)
/* These other guys should be dead already */
Assert(StartupPID == 0);
Assert(WalReceiverPID == 0);
+ Assert(WalSummarizerPID == 0);
Assert(BgWriterPID == 0);
Assert(CheckpointerPID == 0);
Assert(WalWriterPID == 0);
@@ -4096,6 +4130,8 @@ TerminateChildren(int signal)
signal_child(WalWriterPID, signal);
if (WalReceiverPID != 0)
signal_child(WalReceiverPID, signal);
+ if (WalSummarizerPID != 0)
+ signal_child(WalSummarizerPID, signal);
if (AutoVacPID != 0)
signal_child(AutoVacPID, signal);
if (PgArchPID != 0)
@@ -5402,6 +5438,10 @@ StartChildProcess(AuxProcType type)
ereport(LOG,
(errmsg("could not fork WAL receiver process: %m")));
break;
+ case WalSummarizerProcess:
+ ereport(LOG,
+ (errmsg("could not fork WAL summarizer process: %m")));
+ break;
default:
ereport(LOG,
(errmsg("could not fork process: %m")));
@@ -5538,6 +5578,19 @@ MaybeStartWalReceiver(void)
}
}
+/*
+ * MaybeStartWalSummarizer
+ * Start the WAL summarizer process, if not running and our state allows.
+ */
+static void
+MaybeStartWalSummarizer(void)
+{
+ if (wal_summarize_mb != 0 && WalSummarizerPID == 0 &&
+ (pmState == PM_RUN || pmState == PM_HOT_STANDBY) &&
+ Shutdown <= SmartShutdown)
+ WalSummarizerPID = StartWalSummarizer();
+}
+
/*
* Create the opts file
diff --git a/src/backend/postmaster/walsummarizer.c b/src/backend/postmaster/walsummarizer.c
new file mode 100644
index 0000000000..926b6c6ae4
--- /dev/null
+++ b/src/backend/postmaster/walsummarizer.c
@@ -0,0 +1,1414 @@
+/*-------------------------------------------------------------------------
+ *
+ * walsummarizer.c
+ *
+ * Background process to perform WAL summarization, if it is enabled.
+ * It continuously scans the write-ahead log and periodically emits a
+ * summary file which indicates which blocks in which relation forks
+ * were modified by WAL records in the LSN range covered by the summary
+ * file. See walsummary.c and blkreftable.c for more details on the
+ * naming and contents of WAL summary files.
+ *
+ * If configured to do, this background process will also remove WAL
+ * summary files when the file timestamp is older than a configurable
+ * threshold (but only if the WAL has been removed first).
+ *
+ * Portions Copyright (c) 1996-2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/backend/postmaster/walsummarizer.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/timeline.h"
+#include "access/xlog.h"
+#include "access/xlog_internal.h"
+#include "access/xlogrecovery.h"
+#include "access/xlogutils.h"
+#include "backup/walsummary.h"
+#include "catalog/storage_xlog.h"
+#include "common/blkreftable.h"
+#include "libpq/pqsignal.h"
+#include "miscadmin.h"
+#include "postmaster/bgwriter.h"
+#include "postmaster/interrupt.h"
+#include "postmaster/walsummarizer.h"
+#include "replication/walreceiver.h"
+#include "storage/fd.h"
+#include "storage/ipc.h"
+#include "storage/lwlock.h"
+#include "storage/latch.h"
+#include "storage/proc.h"
+#include "storage/procsignal.h"
+#include "storage/shmem.h"
+#include "storage/spin.h"
+#include "utils/guc.h"
+#include "utils/memutils.h"
+#include "utils/wait_event.h"
+
+/*
+ * Data in shared memory related to WAL summarization.
+ */
+typedef struct
+{
+ /*
+ * These fields are protected by WALSummarizerLock.
+ *
+ * Until we've discovered what summary files already exist on disk and
+ * stored that information in shared memory, initialized is false and the
+ * other fields here contain no meaningful information. After that has
+ * been done, initialized is true.
+ *
+ * summarized_tli and summarized_lsn indicate the last LSN and TLI at
+ * which the next summary file will start. Normally, these are the LSN
+ * and TLI at which the last file ended; in such case, lsn_is_exact is
+ * true. If, however, the LSN is just an approximation, then lsn_is_exact
+ * is false. This can happen if, for example, there are no existing WAL
+ * summary files at startup. In that case, we have to derive the position
+ * at which to start summarizing from the WAL files that exist on disk,
+ * and so the LSN might point to the start of the next file even though
+ * that might happen to be in the middle of a WAL record.
+ *
+ * summarizer_pgprocno is the pgprocno value for the summarizer process,
+ * if one is running, or else INVALID_PGPROCNO.
+ *
+ * pending_lsn is used by the summarizer to advertise the ending LSN of
+ * a record it has recently read. It shouldn't ever be less than
+ * summarized_lsn, but might be greater, because the summarizer buffers
+ * data for a range of LSNs in memory before writing out a new file.
+ *
+ * switch_requested can be set to true to notify the summarizer that a
+ * new WAL summary file should be written as soon as possible, without
+ * trying to read more WAL first.
+ */
+ bool initialized;
+ TimeLineID summarized_tli;
+ XLogRecPtr summarized_lsn;
+ bool lsn_is_exact;
+ int summarizer_pgprocno;
+ XLogRecPtr pending_lsn;
+ bool switch_requested;
+
+ /*
+ * This field handles its own synchronizaton.
+ */
+ ConditionVariable summary_file_cv;
+} WalSummarizerData;
+
+/*
+ * Private data for our xlogreader's page read callback.
+ */
+typedef struct
+{
+ TimeLineID tli;
+ bool historic;
+ XLogRecPtr read_upto;
+ bool end_of_wal;
+ bool waited;
+ XLogRecPtr redo_pointer;
+ bool redo_pointer_reached;
+ XLogRecPtr redo_pointer_refresh_lsn;
+} SummarizerReadLocalXLogPrivate;
+
+/* Pointer to shared memory state. */
+static WalSummarizerData *WalSummarizerCtl;
+
+/*
+ * When we reach end of WAL and need to read more, we sleep for a number of
+ * milliseconds that is a integer multiple of MS_PER_SLEEP_QUANTUM. This is
+ * the multiplier. It should vary between 1 and MAX_SLEEP_QUANTA, depending
+ * on system activity. See summarizer_wait_for_wal() for how we adjust this.
+ */
+static long sleep_quanta = 1;
+
+/*
+ * The sleep time will always be a multiple of 200ms and will not exceed
+ * one minute (300 * 200 = 60 * 1000).
+ */
+#define MAX_SLEEP_QUANTA 300
+#define MS_PER_SLEEP_QUANTUM 200
+
+/*
+ * This is a count of the number of pages of WAL that we've read since the
+ * last time we waited for more WAL to appear.
+ */
+static long pages_read_since_last_sleep = 0;
+
+/*
+ * Most recent RedoRecPtr value observed by MaybeRemoveOldWalSummaries.
+ */
+static XLogRecPtr redo_pointer_at_last_summary_removal = InvalidXLogRecPtr;
+
+/*
+ * GUC parameters
+ */
+int wal_summarize_mb = 256;
+int wal_summarize_keep_time = 7 * 24 * 60;
+
+static XLogRecPtr GetLatestLSN(TimeLineID *tli);
+static void HandleWalSummarizerInterrupts(void);
+static XLogRecPtr SummarizeWAL(TimeLineID tli, bool historic,
+ XLogRecPtr start_lsn, bool exact,
+ XLogRecPtr cutoff_lsn, XLogRecPtr maximum_lsn);
+static void SummarizeSmgrRecord(XLogReaderState *xlogreader,
+ BlockRefTable *brtab);
+static void SummarizeXactRecord(XLogReaderState *xlogreader,
+ BlockRefTable *brtab);
+static int summarizer_read_local_xlog_page(XLogReaderState *state,
+ XLogRecPtr targetPagePtr,
+ int reqLen,
+ XLogRecPtr targetRecPtr,
+ char *cur_page);
+static void summarizer_wait_for_wal(void);
+static void MaybeRemoveOldWalSummaries(void);
+
+/*
+ * Amount of shared memory required for this module.
+ */
+Size
+WalSummarizerShmemSize(void)
+{
+ return sizeof(WalSummarizerData);
+}
+
+/*
+ * Create or attach to shared memory segment for this module.
+ */
+void
+WalSummarizerShmemInit(void)
+{
+ bool found;
+
+ WalSummarizerCtl = (WalSummarizerData *)
+ ShmemInitStruct("Wal Summarizer Ctl", WalSummarizerShmemSize(),
+ &found);
+
+ if (!found)
+ {
+ /*
+ * First time through, so initialize.
+ *
+ * We're just filling in dummy values here -- the real initialization
+ * will happen when GetOldestUnsummarizedLSN() is called for the first
+ * time.
+ */
+ WalSummarizerCtl->initialized = false;
+ WalSummarizerCtl->summarized_tli = 0;
+ WalSummarizerCtl->summarized_lsn = InvalidXLogRecPtr;
+ WalSummarizerCtl->lsn_is_exact = false;
+ WalSummarizerCtl->summarizer_pgprocno = INVALID_PGPROCNO;
+ WalSummarizerCtl->pending_lsn = InvalidXLogRecPtr;
+ WalSummarizerCtl->switch_requested = false;
+ ConditionVariableInit(&WalSummarizerCtl->summary_file_cv);
+ }
+}
+
+/*
+ * Entry point for walsummarizer process.
+ */
+void
+WalSummarizerMain(void)
+{
+ sigjmp_buf local_sigjmp_buf;
+ MemoryContext context;
+
+ /*
+ * Within this function, 'current_lsn' and 'current_tli' refer to the
+ * point from which the next WAL summary file should start. 'exact' is
+ * true if 'current_lsn' is known to be the start of a WAL recod or WAL
+ * segment, and false if it might be in the middle of a record someplace.
+ *
+ * 'switch_lsn' and 'switch_tli', if set, are the LSN at which we need to
+ * switch to a new timeline and the timeline to which we need to switch.
+ * If not set, we either haven't figured out the answers yet or we're
+ * already on the latest timeline.
+ */
+ XLogRecPtr current_lsn;
+ TimeLineID current_tli;
+ bool exact;
+ XLogRecPtr switch_lsn = InvalidXLogRecPtr;
+ TimeLineID switch_tli = 0;
+
+ ereport(DEBUG1,
+ (errmsg_internal("WAL summarizer started")));
+
+ /*
+ * Properly accept or ignore signals the postmaster might send us
+ *
+ * We have no particular use for SIGINT at the moment, but seems
+ * reasonable to treat like SIGTERM.
+ */
+ pqsignal(SIGHUP, SignalHandlerForConfigReload);
+ pqsignal(SIGINT, SignalHandlerForShutdownRequest);
+ pqsignal(SIGTERM, SignalHandlerForShutdownRequest);
+ /* SIGQUIT handler was already set up by InitPostmasterChild */
+ pqsignal(SIGALRM, SIG_IGN);
+ pqsignal(SIGPIPE, SIG_IGN);
+ pqsignal(SIGUSR1, procsignal_sigusr1_handler);
+ pqsignal(SIGUSR2, SIG_IGN); /* not used */
+
+ /* Advertise ourselves. */
+ LWLockAcquire(WALSummarizerLock, LW_EXCLUSIVE);
+ WalSummarizerCtl->summarizer_pgprocno = MyProc->pgprocno;
+ LWLockRelease(WALSummarizerLock);
+
+ /* Create and switch to a memory context that we can reset on error. */
+ context = AllocSetContextCreate(TopMemoryContext,
+ "Wal Summarizer",
+ ALLOCSET_DEFAULT_SIZES);
+ MemoryContextSwitchTo(context);
+
+ /*
+ * Reset some signals that are accepted by postmaster but not here
+ */
+ pqsignal(SIGCHLD, SIG_DFL);
+
+ /*
+ * If an exception is encountered, processing resumes here.
+ */
+ if (sigsetjmp(local_sigjmp_buf, 1) != 0)
+ {
+ /* Since not using PG_TRY, must reset error stack by hand */
+ error_context_stack = NULL;
+
+ /* Prevent interrupts while cleaning up */
+ HOLD_INTERRUPTS();
+
+ /* Report the error to the server log */
+ EmitErrorReport();
+
+ /* Release resources we might have acquired. */
+ LWLockReleaseAll();
+ ConditionVariableCancelSleep();
+ pgstat_report_wait_end();
+ ReleaseAuxProcessResources(false);
+ AtEOXact_Files(false);
+ AtEOXact_HashTables(false);
+
+ /*
+ * Now return to normal top-level context and clear ErrorContext for
+ * next time.
+ */
+ MemoryContextSwitchTo(context);
+ FlushErrorState();
+
+ /* Flush any leaked data in the top-level context */
+ MemoryContextResetAndDeleteChildren(context);
+
+ /* Now we can allow interrupts again */
+ RESUME_INTERRUPTS();
+
+ /*
+ * Sleep for 10 seconds before attempting to resume operations in
+ * order to avoid excessing logging.
+ *
+ * Many of the likely error conditions are things that will repeat
+ * every time. For example, if the WAL can't be read or the summary
+ * can't be written, only administrator action will cure the problem.
+ * So a really fast retry time doesn't seem to be especially
+ * beneficial, and it will clutter the logs.
+ */
+ (void) WaitLatch(MyLatch,
+ WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
+ 10000,
+ WAIT_EVENT_WAL_SUMMARIZER_ERROR);
+ }
+
+ /* We can now handle ereport(ERROR) */
+ PG_exception_stack = &local_sigjmp_buf;
+
+ /*
+ * Unblock signals (they were blocked when the postmaster forked us)
+ */
+ sigprocmask(SIG_SETMASK, &UnBlockSig, NULL);
+
+ /*
+ * Fetch information about previous progress from shared memory.
+ *
+ * If we discover that WAL summarization is not enabled, just exit.
+ */
+ current_lsn = GetOldestUnsummarizedLSN(¤t_tli, &exact);
+ if (XLogRecPtrIsInvalid(current_lsn))
+ proc_exit(0);
+
+ /*
+ * Loop forever
+ */
+ for (;;)
+ {
+ XLogRecPtr latest_lsn;
+ TimeLineID latest_tli;
+ XLogRecPtr cutoff_lsn;
+ XLogRecPtr end_of_summary_lsn;
+
+ /* Flush any leaked data in the top-level context */
+ MemoryContextResetAndDeleteChildren(context);
+
+ /* Process any signals received recently. */
+ HandleWalSummarizerInterrupts();
+
+ /* If it's time to remove any old WAL summaries, do that now. */
+ MaybeRemoveOldWalSummaries();
+
+ /* Find the LSN and TLI up to which we can safely summarize. */
+ latest_lsn = GetLatestLSN(&latest_tli);
+
+ /*
+ * If we're summarizing a historic timeline and we haven't yet
+ * computed the point at which to switch to the next timeline, do that
+ * now.
+ *
+ * Note that if this is a standby, what was previously the current
+ * timeline could become historic at any time.
+ *
+ * We could try to make this more efficient by caching the results of
+ * readTimeLineHistory when latest_tli has not changed, but since we
+ * only have to do this once per timeline switch, we probably wouldn't
+ * save any significant amount of work in practice.
+ */
+ if (current_tli != latest_tli && XLogRecPtrIsInvalid(switch_lsn))
+ {
+ List *tles = readTimeLineHistory(latest_tli);
+
+ switch_lsn = tliSwitchPoint(current_tli, tles, &switch_tli);
+ elog(DEBUG2,
+ "switch point from TLI %u to TLI %u is at %X/%X",
+ current_tli, switch_tli, LSN_FORMAT_ARGS(switch_lsn));
+ }
+
+ /*
+ * wal_summarize_mb sets a soft limit on the amont of WAL covered
+ * by a single summary file. If we read a WAL record that ends after
+ * the cutoff LSN computed here, we'll stop the summary. In most cases,
+ * it will actually stop earlier than that, but this is here as a
+ * backstop.
+ */
+ cutoff_lsn = current_lsn + wal_summarize_mb * 1024 * 1024;
+ if (!XLogRecPtrIsInvalid(switch_lsn) && cutoff_lsn > switch_lsn)
+ cutoff_lsn = switch_lsn;
+ elog(DEBUG2,
+ "WAL summarization cutoff is TLI %d @ %X/%X, flush position is %X/%X",
+ current_tli, LSN_FORMAT_ARGS(cutoff_lsn), LSN_FORMAT_ARGS(latest_lsn));
+
+ /* Summarize WAL. */
+ end_of_summary_lsn = SummarizeWAL(current_tli,
+ current_tli != latest_tli,
+ current_lsn, exact,
+ cutoff_lsn, latest_lsn);
+ Assert(!XLogRecPtrIsInvalid(end_of_summary_lsn));
+ Assert(end_of_summary_lsn >= current_lsn);
+
+ /*
+ * Update state for next loop iteration.
+ *
+ * Next summary file should start from exactly where this one ended.
+ * Timeline remains unchanged unless a switch LSN was computed and we
+ * have reached it.
+ */
+ current_lsn = end_of_summary_lsn;
+ exact = true;
+ if (!XLogRecPtrIsInvalid(switch_lsn) && cutoff_lsn >= switch_lsn)
+ {
+ current_tli = switch_tli;
+ switch_lsn = InvalidXLogRecPtr;
+ switch_tli = 0;
+ }
+
+ /* Update state in shared memory. */
+ LWLockAcquire(WALSummarizerLock, LW_EXCLUSIVE);
+ Assert(WalSummarizerCtl->pending_lsn <= end_of_summary_lsn);
+ WalSummarizerCtl->summarized_lsn = end_of_summary_lsn;
+ WalSummarizerCtl->summarized_tli = current_tli;
+ WalSummarizerCtl->lsn_is_exact = true;
+ WalSummarizerCtl->pending_lsn = end_of_summary_lsn;
+ WalSummarizerCtl->switch_requested = false;
+ LWLockRelease(WALSummarizerLock);
+
+ /* Wake up anyone waiting for more summary files to be written. */
+ ConditionVariableBroadcast(&WalSummarizerCtl->summary_file_cv);
+ }
+}
+
+/*
+ * Get the oldest LSN in this server's timeline history that has not yet been
+ * summarized.
+ *
+ * If *tli != NULL, it will be set to the TLI for the LSN that is returned.
+ *
+ * If *lsn_is_exact != NULL, it will be set to true if the returned LSN is
+ * necessarily the start of a WAL record and false if it's just the beginning
+ * of a WAL segment.
+ */
+XLogRecPtr
+GetOldestUnsummarizedLSN(TimeLineID *tli, bool *lsn_is_exact)
+{
+ TimeLineID latest_tli;
+ LWLockMode mode = LW_SHARED;
+ int n;
+ List *tles;
+ XLogRecPtr unsummarized_lsn;
+ TimeLineID unsummarized_tli = 0;
+ bool should_make_exact = false;
+ List *existing_summaries;
+ ListCell *lc;
+
+ /* If not summarizing WAL, do nothing. */
+ if (wal_summarize_mb == 0)
+ return InvalidXLogRecPtr;
+
+ /*
+ * Initially, we acquire the lock in shared mode and try to fetch the
+ * required information. If the data structure hasn't been initialized, we
+ * reacquire the lock in shared mode so that we can initialize it.
+ * However, if someone else does that first before we get the lock, then
+ * we can just return the requested information after all.
+ */
+ while (true)
+ {
+ LWLockAcquire(WALSummarizerLock, mode);
+
+ if (WalSummarizerCtl->initialized)
+ {
+ unsummarized_lsn = WalSummarizerCtl->summarized_lsn;
+ if (tli != NULL)
+ *tli = WalSummarizerCtl->summarized_tli;
+ if (lsn_is_exact != NULL)
+ *lsn_is_exact = WalSummarizerCtl->lsn_is_exact;
+ LWLockRelease(WALSummarizerLock);
+ return unsummarized_lsn;
+ }
+
+ if (mode == LW_EXCLUSIVE)
+ break;
+
+ LWLockRelease(WALSummarizerLock);
+ mode = LW_EXCLUSIVE;
+ }
+
+ /*
+ * The data structure needs to be initialized, and we are the first to
+ * obtain the lock in exclusive mode, so it's our job to do that
+ * initialization.
+ *
+ * So, find the oldest timeline on which WAL still exists, and the
+ * earliest segment for which it exists.
+ */
+ (void) GetLatestLSN(&latest_tli);
+ tles = readTimeLineHistory(latest_tli);
+ for (n = list_length(tles) - 1; n >= 0; --n)
+ {
+ TimeLineHistoryEntry *tle = list_nth(tles, n);
+ XLogSegNo oldest_segno;
+
+ oldest_segno = XLogGetOldestSegno(tle->tli);
+ if (oldest_segno != 0)
+ {
+ /* Compute oldest LSN that still exists on disk. */
+ XLogSegNoOffsetToRecPtr(oldest_segno, 0, wal_segment_size,
+ unsummarized_lsn);
+
+ unsummarized_tli = tle->tli;
+ break;
+ }
+ }
+
+ /* It really should not be possible for us to find no WAL. */
+ if (unsummarized_tli == 0)
+ ereport(ERROR,
+ errcode(ERRCODE_INTERNAL_ERROR),
+ errmsg_internal("no WAL found on timeline %d", latest_tli));
+
+ /*
+ * Don't try to summarize anything older than the end LSN of the newest
+ * summary file that exists for this timeline.
+ */
+ existing_summaries =
+ GetWalSummaries(unsummarized_tli,
+ InvalidXLogRecPtr, InvalidXLogRecPtr);
+ foreach(lc, existing_summaries)
+ {
+ WalSummaryFile *ws = lfirst(lc);
+
+ if (ws->end_lsn > unsummarized_lsn)
+ {
+ unsummarized_lsn = ws->end_lsn;
+ should_make_exact = true;
+ }
+ }
+
+ /* Update shared memory with the discovered values. */
+ WalSummarizerCtl->initialized = true;
+ WalSummarizerCtl->summarized_lsn = unsummarized_lsn;
+ WalSummarizerCtl->summarized_tli = unsummarized_tli;
+ WalSummarizerCtl->lsn_is_exact = should_make_exact;
+ WalSummarizerCtl->pending_lsn = unsummarized_lsn;
+
+ /* Also return the to the caller as required. */
+ if (tli != NULL)
+ *tli = WalSummarizerCtl->summarized_tli;
+ if (lsn_is_exact != NULL)
+ *lsn_is_exact = WalSummarizerCtl->lsn_is_exact;
+ LWLockRelease(WALSummarizerLock);
+
+ return unsummarized_lsn;
+}
+
+/*
+ * Attempt to set the WAL summarizer's latch.
+ *
+ * This might not work, because there's no guarantee that the WAL summarizer
+ * process was successfully started, and it also might have started but
+ * subsequently terminated. So, under normal circumstances, this will get the
+ * latch set, but there's no guarantee.
+ */
+void
+SetWalSummarizerLatch(void)
+{
+ int pgprocno;
+
+ LWLockAcquire(WALSummarizerLock, LW_EXCLUSIVE);
+ pgprocno = WalSummarizerCtl->summarizer_pgprocno;
+ LWLockRelease(WALSummarizerLock);
+
+ if (pgprocno != INVALID_PGPROCNO)
+ SetLatch(&ProcGlobal->allProcs[pgprocno].procLatch);
+}
+
+/*
+ * Wait until WAL summarization reaches the given LSN, but not longer than
+ * the given timeout.
+ *
+ * The return value is the first still-unsummarized LSN. If it's greater than
+ * or equal to the passed LSN, then that LSN was reached. If not, we timed out.
+ */
+XLogRecPtr
+WaitForWalSummarization(XLogRecPtr lsn, long timeout)
+{
+ TimestampTz start_time = GetCurrentTimestamp();
+ TimestampTz deadline = TimestampTzPlusMilliseconds(start_time, timeout);
+ XLogRecPtr summarized_lsn;
+
+ Assert(!XLogRecPtrIsInvalid(lsn));
+ Assert(timeout > 0);
+
+ while (1)
+ {
+ TimestampTz now;
+ long remaining_timeout;
+
+ /*
+ * If the LSN summarized on disk has reached the target value, stop.
+ * If it hasn't, but the in-memory value has reached the target value,
+ * request that a file be written as soon as possible.
+ */
+ LWLockAcquire(WALSummarizerLock, LW_EXCLUSIVE);
+ summarized_lsn = WalSummarizerCtl->summarized_lsn;
+ if (summarized_lsn < lsn &&
+ WalSummarizerCtl->pending_lsn >= lsn)
+ WalSummarizerCtl->switch_requested = true;
+ LWLockRelease(WALSummarizerLock);
+ if (summarized_lsn >= lsn)
+ break;
+
+ /* Timeout reached? If yes, stop. */
+ now = GetCurrentTimestamp();
+ remaining_timeout = TimestampDifferenceMilliseconds(now, deadline);
+ if (remaining_timeout <= 0)
+ break;
+
+ /*
+ * Limit the sleep to 1 second, because we may need to request a
+ * switch.
+ */
+ if (remaining_timeout > 1000)
+ remaining_timeout = 1000;
+
+ /* Wait and see. */
+ ConditionVariableTimedSleep(&WalSummarizerCtl->summary_file_cv,
+ remaining_timeout,
+ WAIT_EVENT_WAL_SUMMARY_READY);
+ }
+
+ return summarized_lsn;
+}
+
+/*
+ * Get the latest LSN that is eligible to be summarized, and set *tli to the
+ * corresponding timeline.
+ */
+static XLogRecPtr
+GetLatestLSN(TimeLineID *tli)
+{
+ if (!RecoveryInProgress())
+ {
+ /* Don't summarize WAL before it's flushed. */
+ return GetFlushRecPtr(tli);
+ }
+ else
+ {
+ XLogRecPtr flush_lsn;
+ TimeLineID flush_tli;
+ XLogRecPtr replay_lsn;
+ TimeLineID replay_tli;
+
+ /*
+ * What we really want to know is how much WAL has been flushed to
+ * disk, but the only flush position available is the one provided by
+ * the walreceiver, which may not be running, because this could be
+ * crash recovery or recovery via restore_command. So use either the
+ * WAL receiver's flush position or the replay position, whichever is
+ * further ahead, on the theory that if the WAL has been replayed then
+ * it must also have been flushed to disk.
+ */
+ flush_lsn = GetWalRcvFlushRecPtr(NULL, &flush_tli);
+ replay_lsn = GetXLogReplayRecPtr(&replay_tli);
+ if (flush_lsn > replay_lsn)
+ {
+ *tli = flush_tli;
+ return flush_lsn;
+ }
+ else
+ {
+ *tli = replay_tli;
+ return replay_lsn;
+ }
+ }
+}
+
+/*
+ * Interrupt handler for main loop of WAL summarizer process.
+ */
+static void
+HandleWalSummarizerInterrupts(void)
+{
+ if (ProcSignalBarrierPending)
+ ProcessProcSignalBarrier();
+
+ if (ConfigReloadPending)
+ {
+ ConfigReloadPending = false;
+ ProcessConfigFile(PGC_SIGHUP);
+ }
+
+ if (ShutdownRequestPending || wal_summarize_mb == 0)
+ {
+ ereport(DEBUG1,
+ errmsg_internal("WAL summarizer shutting down"));
+ proc_exit(0);
+ }
+
+ /* Perform logging of memory contexts of this process */
+ if (LogMemoryContextPending)
+ ProcessLogMemoryContextInterrupt();
+}
+
+/*
+ * Summarize a range of WAL records on a single timeline.
+ *
+ * 'tli' is the timeline to be summarized. 'historic' should be false if the
+ * timeline in question is the latest one and true otherwise.
+ *
+ * 'start_lsn' is the point at which we should start summarizing. If this
+ * value comes from the end LSN of the previous record as returned by the
+ * xlograder machinery, 'exact' should be true; otherwise, 'exact' should
+ * be false, and this function will search forward for the start of a valid
+ * WAL record.
+ *
+ * 'cutoff_lsn' is the point at which we should stop summarizing. The first
+ * record that ends at or after cutoff_lsn will be the last one included
+ * in the summary.
+ *
+ * 'maximum_lsn' identifies the point beyond which we can't count on being
+ * able to read any more WAL. It should be the switch point when reading a
+ * historic timeline, or the most-recently-measured end of WAL when reading
+ * the current timeline.
+ *
+ * The return value is the LSN at which the WAL summary actually ends. Most
+ * often, a summary file ends because we notice that a checkpoint has
+ * occurred and reach the redo pointer of that checkpoint, but sometimes
+ * we stop for other reasons, such as a timeline switch, or reading a record
+ * that ends after the cutoff_lsn.
+ */
+static XLogRecPtr
+SummarizeWAL(TimeLineID tli, bool historic,
+ XLogRecPtr start_lsn, bool exact,
+ XLogRecPtr cutoff_lsn, XLogRecPtr maximum_lsn)
+{
+ SummarizerReadLocalXLogPrivate *private_data;
+ XLogReaderState *xlogreader;
+ XLogRecPtr summary_start_lsn;
+ XLogRecPtr summary_end_lsn = cutoff_lsn;
+ char temp_path[MAXPGPATH];
+ char final_path[MAXPGPATH];
+ WalSummaryIO io;
+ BlockRefTable *brtab = CreateEmptyBlockRefTable();
+
+ /* Initialize private data for xlogreader. */
+ private_data = (SummarizerReadLocalXLogPrivate *)
+ palloc0(sizeof(SummarizerReadLocalXLogPrivate));
+ private_data->tli = tli;
+ private_data->historic = historic;
+ private_data->read_upto = maximum_lsn;
+ private_data->redo_pointer = GetRedoRecPtr();
+ private_data->redo_pointer_refresh_lsn = start_lsn;
+ private_data->redo_pointer_reached =
+ (start_lsn >= private_data->redo_pointer);
+
+ /* Create xlogreader. */
+ xlogreader = XLogReaderAllocate(wal_segment_size, NULL,
+ XL_ROUTINE(.page_read = &summarizer_read_local_xlog_page,
+ .segment_open = &wal_segment_open,
+ .segment_close = &wal_segment_close),
+ private_data);
+ if (xlogreader == NULL)
+ ereport(ERROR,
+ (errcode(ERRCODE_OUT_OF_MEMORY),
+ errmsg("out of memory"),
+ errdetail("Failed while allocating a WAL reading processor.")));
+
+ /*
+ * When exact = false, we're starting from an arbitrary point in the WAL
+ * and must search forward for the start of the next record.
+ *
+ * When exact = true, start_lsn should be either the LSN where a record
+ * begins, or the LSN of a page where the page header is immediately
+ * followed by the start of a new record. XLogBeginRead should tolerate
+ * either case.
+ *
+ * We need to allow for both cases because the behavior of xlogreader
+ * varies. When a record spans two or more xlog pages, the ending LSN
+ * reported by xlogreader will be the starting LSN of the following
+ * record, but when an xlog page boundary falls between two records, the
+ * end LSN for the first will be reported as the first byte of the
+ * following page. We can't know until we read that page how large the
+ * header will be, but we'll have to skip over it to find the next record.
+ */
+ if (exact)
+ {
+ /*
+ * Even if start_lsn is the beginning of a page rather than the
+ * beginning of the first record on that page, we should still use it
+ * as the start LSN for the summary file. That's because we detect
+ * missing summary files by looking for cases where the end LSN of one
+ * file is less than the start LSN of the next file. When only a page
+ * header is skipped, nothing has been missed.
+ */
+ XLogBeginRead(xlogreader, start_lsn);
+ summary_start_lsn = start_lsn;
+ }
+ else
+ {
+ summary_start_lsn = XLogFindNextRecord(xlogreader, start_lsn);
+ if (XLogRecPtrIsInvalid(summary_start_lsn))
+ {
+ /*
+ * If we hit end-of-WAL while trying to find the next valid
+ * record, we must be on a historic timeline that has no valid
+ * records that begin after start_lsn and before end of WAL.
+ */
+ if (private_data->end_of_wal)
+ {
+ ereport(DEBUG1,
+ errmsg_internal("could not read WAL from timeline %d at %X/%X: end of WAL at %X/%X",
+ tli,
+ LSN_FORMAT_ARGS(start_lsn),
+ LSN_FORMAT_ARGS(private_data->read_upto)));
+
+ /*
+ * The timeline ends at or after start_lsn, without containing
+ * any records. Thus, we must make sure the main loop does not
+ * iterate. If start_lsn is the end of the timeline, then we
+ * won't actually emit an empty summary file, but otherwise,
+ * we must, to capture the fact that the LSN range in question
+ * contains no interesting WAL records.
+ */
+ summary_start_lsn = start_lsn;
+ summary_end_lsn = private_data->read_upto;
+ cutoff_lsn = xlogreader->EndRecPtr;
+ }
+ else
+ ereport(ERROR,
+ (errmsg("could not find a valid record after %X/%X",
+ LSN_FORMAT_ARGS(start_lsn))));
+ }
+
+ /* We shouldn't go backward. */
+ Assert(summary_start_lsn >= start_lsn);
+ }
+
+ /*
+ * Main loop: read xlog records one by one.
+ */
+ while (xlogreader->EndRecPtr < cutoff_lsn)
+ {
+ int block_id;
+ char *errormsg;
+ XLogRecord *record;
+ bool switch_requested;
+
+ /* We shouldn't go backward. */
+ Assert(summary_start_lsn <= xlogreader->EndRecPtr);
+
+ /*
+ * This flag tracks whether the read of a particular record had to
+ * wait for more WAL to arrive, so reset it before reading the next
+ * record.
+ */
+ private_data->waited = false;
+
+ /* Now read the next record. */
+ record = XLogReadRecord(xlogreader, &errormsg);
+ if (record == NULL)
+ {
+ SummarizerReadLocalXLogPrivate *private_data;
+
+ private_data = (SummarizerReadLocalXLogPrivate *)
+ xlogreader->private_data;
+ if (private_data->end_of_wal)
+ {
+ /*
+ * This timeline must be historic and must end before we were
+ * able to read a complete record.
+ */
+ ereport(DEBUG1,
+ errmsg_internal("could not read WAL from timeline %d at %X/%X: end of WAL at %X/%X",
+ tli,
+ LSN_FORMAT_ARGS(xlogreader->EndRecPtr),
+ LSN_FORMAT_ARGS(private_data->read_upto)));
+ /* Summary ends at end of WAL. */
+ summary_end_lsn = private_data->read_upto;
+ break;
+ }
+ if (errormsg)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not read WAL at %X/%X: %s",
+ LSN_FORMAT_ARGS(xlogreader->EndRecPtr), errormsg)));
+ else
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not read WAL at %X/%X",
+ LSN_FORMAT_ARGS(xlogreader->EndRecPtr))));
+ }
+
+ /* We shouldn't go backward. */
+ Assert(summary_start_lsn <= xlogreader->EndRecPtr);
+
+ if (xlogreader->ReadRecPtr >= cutoff_lsn)
+ {
+ /*
+ * Woops! We've read a record that *starts* after the cutoff LSN,
+ * contrary to our goal of reading only until we hit the first
+ * record that ends at or after the cutoff LSN. Pretend we didn't
+ * read it after all by bailing out of this loop right here,
+ * before we do anything with this record.
+ *
+ * This can happen because the last record before the cutoff LSN
+ * might be continued across multiple pages, and then we might
+ * come to a page with XLP_FIRST_IS_OVERWRITE_CONTRECORD set. In
+ * that case, the record that was continued across multiple pages
+ * is incomplete and will be disregarded, and the read will
+ * restart from the beginning of the page that is flagged
+ * XLP_FIRST_IS_OVERWRITE_CONTRECORD.
+ *
+ * If this case occurs, we can fairly say that the current summary
+ * file ends at the cutoff LSN exactly. The first record on the
+ * page marked XLP_FIRST_IS_OVERWRITE_CONTRECORD will be
+ * discovered when generating the next summary file.
+ */
+ summary_end_lsn = cutoff_lsn;
+ break;
+ }
+
+ /*
+ * We attempt, on a best effort basis only, to make WAL summary file
+ * boundaries line up with checkpoint cycles. So, if the last redo
+ * pointer we've seen was in the future, and this record starts at
+ * that redo pointer, stop before processing and let it be included in
+ * the next summary file.
+ *
+ * Note that in the case of a checkpoint triggered by a backup, the
+ * redo pointer is likely to be pointing to the first record on a
+ * page. Before reading the record, xlogreader->EndRecPtr will have
+ * pointed to the start of the page, which precedes the redo LSN. But
+ * after reading the next record, we'll advance over the page header
+ * and realize that the next record starts at the redo LSN exactly,
+ * making this the first point at which we can realize that it's time
+ * to stop.
+ */
+ if (!private_data->redo_pointer_reached &&
+ xlogreader->ReadRecPtr >= private_data->redo_pointer)
+ {
+ summary_end_lsn = xlogreader->ReadRecPtr;
+ break;
+ }
+
+ /* Special handling for particular types of WAL records. */
+ switch (XLogRecGetRmid(xlogreader))
+ {
+ case RM_SMGR_ID:
+ SummarizeSmgrRecord(xlogreader, brtab);
+ break;
+ case RM_XACT_ID:
+ SummarizeXactRecord(xlogreader, brtab);
+ break;
+ default:
+ break;
+ }
+
+ /* Feed block references from xlog record to block reference table. */
+ for (block_id = 0; block_id <= XLogRecMaxBlockId(xlogreader);
+ block_id++)
+ {
+ RelFileLocator rlocator;
+ ForkNumber forknum;
+ BlockNumber blocknum;
+
+ if (!XLogRecGetBlockTagExtended(xlogreader, block_id, &rlocator,
+ &forknum, &blocknum, NULL))
+ continue;
+
+ BlockRefTableMarkBlockModified(brtab, &rlocator, forknum,
+ blocknum);
+ }
+
+ /* Update our notion of where this summary file ends. */
+ summary_end_lsn = xlogreader->EndRecPtr;
+
+ /*
+ * Also update shared memory, and handle any request for a
+ * WAL summary file switch.
+ */
+ LWLockAcquire(WALSummarizerLock, LW_EXCLUSIVE);
+ Assert(summary_end_lsn >= WalSummarizerCtl->pending_lsn);
+ Assert(summary_end_lsn >= WalSummarizerCtl->summarized_lsn);
+ WalSummarizerCtl->pending_lsn = summary_end_lsn;
+ switch_requested = WalSummarizerCtl->switch_requested;
+ LWLockRelease(WALSummarizerLock);
+ if (switch_requested)
+ break;
+
+ /*
+ * Periodically update our notion of the redo pointer, because it
+ * might be changing concurrently. There's no interlocking here: we
+ * might race past the new redo pointer before we learn about it.
+ * That's OK; we only use the redo pointer as a heuristic for where to
+ * stop summarizing.
+ *
+ * It would be nice if we could just fetch the updated redo pointer on
+ * every pass through this loop, but that seems a bit too expensive:
+ * GetRedoRecPtr acquires a heavily-contended spinlock. So, instead,
+ * just fetch the updated value if we've just had to sleep, or if
+ * we've read more than a segment's worth of WAL without sleeping.
+ */
+ if (private_data->waited || xlogreader->EndRecPtr >
+ private_data->redo_pointer_refresh_lsn + wal_segment_size)
+ {
+ private_data->redo_pointer = GetRedoRecPtr();
+ private_data->redo_pointer_refresh_lsn = xlogreader->EndRecPtr;
+ private_data->redo_pointer_reached =
+ (xlogreader->EndRecPtr >= private_data->redo_pointer);
+ }
+
+ /*
+ * Recheck whether we've just caught up with the redo pointer, and
+ * if so, stop. This has the same purpose as the earlier check for
+ * the same condition above, but there we've just read a record and
+ * might decide against including it in the current summary file,
+ * whereas here we've already included it and might decide against
+ * reading the next one. Note that we may have just refreshed our
+ * notion of the redo pointer, so it's smart to check here before we
+ * do any more work.
+ */
+ if (!private_data->redo_pointer_reached &&
+ xlogreader->EndRecPtr >= private_data->redo_pointer)
+ break;
+ }
+
+ /* Destroy xlogreader. */
+ pfree(xlogreader->private_data);
+ XLogReaderFree(xlogreader);
+
+ /*
+ * If a timeline switch occurs, we may fail to make any progress at all
+ * before exiting the loop above. If that happens, we don't write a WAL
+ * summary file at all.
+ */
+ if (summary_end_lsn > summary_start_lsn)
+ {
+ /* Generate temporary and final path name. */
+ snprintf(temp_path, MAXPGPATH,
+ XLOGDIR "/summaries/temp.summary");
+ snprintf(final_path, MAXPGPATH,
+ XLOGDIR "/summaries/%08X%08X%08X%08X%08X.summary",
+ tli,
+ LSN_FORMAT_ARGS(summary_start_lsn),
+ LSN_FORMAT_ARGS(summary_end_lsn));
+
+ /* Open the temporary file for writing. */
+ io.filepos = 0;
+ io.file = PathNameOpenFile(temp_path, O_WRONLY | O_CREAT | O_TRUNC);
+ if (io.file < 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not create file \"%s\": %m", temp_path)));
+
+ /* Write the data. */
+ WriteBlockRefTable(brtab, WriteWalSummary, &io);
+
+ /* Close temporary file and shut down xlogreader. */
+ FileClose(io.file);
+
+ /* Tell the user what we did. */
+ ereport(LOG,
+ errmsg("summarized WAL on TLI %d from %X/%X to %X/%X",
+ tli,
+ LSN_FORMAT_ARGS(summary_start_lsn),
+ LSN_FORMAT_ARGS(summary_end_lsn)));
+
+ /* Durably rename the new summary into place. */
+ durable_rename(temp_path, final_path, ERROR);
+ }
+
+ return summary_end_lsn;
+}
+
+/*
+ * Special handling for WAL records with RM_SMGR_ID.
+ */
+static void
+SummarizeSmgrRecord(XLogReaderState *xlogreader, BlockRefTable *brtab)
+{
+ uint8 info = XLogRecGetInfo(xlogreader) & ~XLR_INFO_MASK;
+
+ if (info == XLOG_SMGR_CREATE)
+ {
+ xl_smgr_create *xlrec;
+
+ /*
+ * If a new relation fork is created on disk, there is no point
+ * tracking anything about which blocks have been modified, because
+ * the whole thing will be new. Hence, set the limit block for this
+ * fork to 0.
+ *
+ * Ignore the FSM fork, which is not fully WAL-logged.
+ */
+ xlrec = (xl_smgr_create *) XLogRecGetData(xlogreader);
+
+ if (xlrec->forkNum != FSM_FORKNUM)
+ BlockRefTableSetLimitBlock(brtab, &xlrec->rlocator,
+ xlrec->forkNum, 0);
+ }
+ else if (info == XLOG_SMGR_TRUNCATE)
+ {
+ xl_smgr_truncate *xlrec;
+
+ xlrec = (xl_smgr_truncate *) XLogRecGetData(xlogreader);
+
+ /*
+ * If a relation fork is truncated on disk, there is in point in
+ * tracking anything about block modifications beyond the truncation
+ * point.
+ *
+ * We ignore SMGR_TRUNCATE_FSM here because the FSM isn't fully
+ * WAL-logged and thus we can't track modified blocks for it anyway.
+ */
+ if ((xlrec->flags & SMGR_TRUNCATE_HEAP) != 0)
+ BlockRefTableSetLimitBlock(brtab, &xlrec->rlocator,
+ MAIN_FORKNUM, xlrec->blkno);
+ if ((xlrec->flags & SMGR_TRUNCATE_VM) != 0)
+ BlockRefTableSetLimitBlock(brtab, &xlrec->rlocator,
+ VISIBILITYMAP_FORKNUM, xlrec->blkno);
+ }
+}
+
+/*
+ * Special handling for WAL recods with RM_XACT_ID.
+ */
+static void
+SummarizeXactRecord(XLogReaderState *xlogreader, BlockRefTable *brtab)
+{
+ uint8 info = XLogRecGetInfo(xlogreader) & ~XLR_INFO_MASK;
+ uint8 xact_info = info & XLOG_XACT_OPMASK;
+
+ if (xact_info == XLOG_XACT_COMMIT ||
+ xact_info == XLOG_XACT_COMMIT_PREPARED)
+ {
+ xl_xact_commit *xlrec = (xl_xact_commit *) XLogRecGetData(xlogreader);
+ xl_xact_parsed_commit parsed;
+ int i;
+
+ ParseCommitRecord(XLogRecGetInfo(xlogreader), xlrec, &parsed);
+ for (i = 0; i < parsed.nrels; ++i)
+ {
+ ForkNumber forknum;
+
+ for (forknum = 0; forknum <= MAX_FORKNUM; ++forknum)
+ if (forknum != FSM_FORKNUM)
+ BlockRefTableSetLimitBlock(brtab, &parsed.xlocators[i],
+ forknum, 0);
+ }
+ }
+ else if (xact_info == XLOG_XACT_ABORT ||
+ xact_info == XLOG_XACT_ABORT_PREPARED)
+ {
+ xl_xact_abort *xlrec = (xl_xact_abort *) XLogRecGetData(xlogreader);
+ xl_xact_parsed_abort parsed;
+ int i;
+
+ ParseAbortRecord(XLogRecGetInfo(xlogreader), xlrec, &parsed);
+ for (i = 0; i < parsed.nrels; ++i)
+ {
+ ForkNumber forknum;
+
+ for (forknum = 0; forknum <= MAX_FORKNUM; ++forknum)
+ if (forknum != FSM_FORKNUM)
+ BlockRefTableSetLimitBlock(brtab, &parsed.xlocators[i],
+ forknum, 0);
+ }
+ }
+}
+
+/*
+ * Similar to read_local_xlog_page, but limited to read from one particular
+ * timeline. If the end of WAL is reached, it will wait for more if reading
+ * from the current timeline, or give up if reading from a historic timeline.
+ * In the latter case, it will also set private_data->end_of_wal = true.
+ *
+ * Caller must set private_data->tli to the TLI of interest,
+ * private_data->read_upto to the lowest LSN that is not known to be safe
+ * to read on that timeline, and private_data->historic to true if and only
+ * if the timeline is not the current timeline. This function will update
+ * private_data->read_upto and private_data->historic if more WAL appears
+ * on the current timeline or if the current timeline becomes historic.
+ */
+static int
+summarizer_read_local_xlog_page(XLogReaderState *state,
+ XLogRecPtr targetPagePtr, int reqLen,
+ XLogRecPtr targetRecPtr, char *cur_page)
+{
+ int count;
+ WALReadError errinfo;
+ SummarizerReadLocalXLogPrivate *private_data;
+
+ private_data = (SummarizerReadLocalXLogPrivate *)
+ state->private_data;
+
+ while (true)
+ {
+ if (targetPagePtr + XLOG_BLCKSZ <= private_data->read_upto)
+ {
+ /*
+ * more than one block available; read only that block, have
+ * caller come back if they need more.
+ */
+ count = XLOG_BLCKSZ;
+ break;
+ }
+ else if (targetPagePtr + reqLen > private_data->read_upto)
+ {
+ /* We don't seem to have enough data. */
+ if (private_data->historic)
+ {
+ /*
+ * This is a historic timeline, so there will never be any
+ * more data than we have currently.
+ */
+ private_data->end_of_wal = true;
+ return -1;
+ }
+ else
+ {
+ XLogRecPtr latest_lsn;
+ TimeLineID latest_tli;
+
+ /*
+ * This is - or at least was up until very recently - the
+ * current timeline, so more data might show up. Delay here
+ * so we don't tight-loop.
+ */
+ HandleWalSummarizerInterrupts();
+ summarizer_wait_for_wal();
+ private_data->waited = true;
+
+ /* Recheck end-of-WAL. */
+ latest_lsn = GetLatestLSN(&latest_tli);
+ if (private_data->tli == latest_tli)
+ {
+ /* Still the current timeline, update max LSN. */
+ Assert(latest_lsn >= private_data->read_upto);
+ private_data->read_upto = latest_lsn;
+ }
+ else
+ {
+ List *tles = readTimeLineHistory(latest_tli);
+ XLogRecPtr switchpoint;
+
+ /*
+ * The timeline we're scanning is no longer the latest
+ * one. Figure out when it ended and allow reads up to
+ * exactly that point.
+ */
+ private_data->historic = true;
+ switchpoint = tliSwitchPoint(private_data->tli, tles,
+ NULL);
+ Assert(switchpoint >= private_data->read_upto);
+ private_data->read_upto = switchpoint;
+ }
+
+ /* Go around and try again. */
+ }
+ }
+ else
+ {
+ /* enough bytes available to satisfy the request */
+ count = private_data->read_upto - targetPagePtr;
+ break;
+ }
+ }
+
+ /*
+ * Even though we just determined how much of the page can be validly read
+ * as 'count', read the whole page anyway. It's guaranteed to be
+ * zero-padded up to the page boundary if it's incomplete.
+ */
+ if (!WALRead(state, cur_page, targetPagePtr, XLOG_BLCKSZ,
+ private_data->tli, &errinfo))
+ WALReadRaiseError(&errinfo);
+
+ /* Track that we read a page, for sleep time calculation. */
+ ++pages_read_since_last_sleep;
+
+ /* number of valid bytes in the buffer */
+ return count;
+}
+
+/*
+ * Sleep for long enough that we believe it's likely that more WAL will
+ * be available afterwards.
+ */
+static void
+summarizer_wait_for_wal(void)
+{
+ if (pages_read_since_last_sleep == 0)
+ {
+ /*
+ * No pages were read since the last sleep, so double the sleep time,
+ * but not beyond the maximum allowable value.
+ */
+ sleep_quanta = Min(sleep_quanta * 2, MAX_SLEEP_QUANTA);
+ }
+ else if (pages_read_since_last_sleep > 1)
+ {
+ /*
+ * Multiple pages were read since the last sleep, so reduce the sleep
+ * time.
+ *
+ * A large burst of activity should be able to quickly reduce the
+ * sleep time to the minimum, but we don't want a handful of extra WAL
+ * records to provoke a strong reaction. We choose to reduce the sleep
+ * time by 1 quantum for each page read beyond the first, which is a
+ * fairly arbitrary way of trying to be reactive without
+ * overrreacting.
+ */
+ if (pages_read_since_last_sleep > sleep_quanta - 1)
+ sleep_quanta = 1;
+ else
+ sleep_quanta -= pages_read_since_last_sleep;
+ }
+
+ /* OK, now sleep. */
+ (void) WaitLatch(MyLatch,
+ WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
+ sleep_quanta * MS_PER_SLEEP_QUANTUM,
+ WAIT_EVENT_WAL_SUMMARIZER_WAL);
+ ResetLatch(MyLatch);
+
+ /* Reset count of pages read. */
+ pages_read_since_last_sleep = 0;
+}
+
+/*
+ * Most recent RedoRecPtr value observed by RemoveOldWalSummaries.
+ */
+static void
+MaybeRemoveOldWalSummaries(void)
+{
+ XLogRecPtr redo_pointer = GetRedoRecPtr();
+ List *wslist;
+ time_t cutoff_time;
+
+ /* If WAL summary removal is disabled, don't do anything. */
+ if (wal_summarize_keep_time == 0)
+ return;
+
+ /*
+ * If the redo pointer has not advanced, don't do anything.
+ *
+ * This has the effect that we only try to remove old WAL summary files
+ * once per checkpoint cycle.
+ */
+ if (redo_pointer == redo_pointer_at_last_summary_removal)
+ return;
+ redo_pointer_at_last_summary_removal = redo_pointer;
+
+ /*
+ * Files should only be removed if the last modification time precedes the
+ * cutoff time we compute here.
+ */
+ cutoff_time = time(NULL) - 60 * wal_summarize_keep_time;
+
+ /* Get all the summaries that currently exist. */
+ wslist = GetWalSummaries(0, InvalidXLogRecPtr, InvalidXLogRecPtr);
+
+ /* Loop until all summaries have been considered for removal. */
+ while (wslist != NIL)
+ {
+ ListCell *lc;
+ XLogSegNo oldest_segno;
+ XLogRecPtr oldest_lsn = InvalidXLogRecPtr;
+ TimeLineID selected_tli;
+
+ CHECK_FOR_INTERRUPTS();
+
+ /*
+ * Pick a timeline for which some summary files still exist on disk,
+ * and find the oldest LSN that still exists on disk for that
+ * timeline.
+ */
+ selected_tli = ((WalSummaryFile *) linitial(wslist))->tli;
+ oldest_segno = XLogGetOldestSegno(selected_tli);
+ if (oldest_segno != 0)
+ XLogSegNoOffsetToRecPtr(oldest_segno, 0, wal_segment_size,
+ oldest_lsn);
+
+
+ /* Consider each WAL file on the selected timeline in turn. */
+ foreach(lc, wslist)
+ {
+ WalSummaryFile *ws = lfirst(lc);
+
+ CHECK_FOR_INTERRUPTS();
+
+ /* If it's not on this timeline, it's not time to consider it. */
+ if (selected_tli != ws->tli)
+ continue;
+
+ /*
+ * If the WAL doesn't exist any more, we can remove it if the file
+ * modification time is old enough.
+ */
+ if (XLogRecPtrIsInvalid(oldest_lsn) || ws->end_lsn <= oldest_lsn)
+ RemoveWalSummaryIfOlderThan(ws, cutoff_time);
+
+ /*
+ * Whether we we removed the file or not, we need not consider it
+ * again.
+ */
+ wslist = foreach_delete_current(wslist, lc);
+ pfree(ws);
+ }
+ }
+}
diff --git a/src/backend/replication/repl_gram.y b/src/backend/replication/repl_gram.y
index 0c874e33cf..a5d118ed68 100644
--- a/src/backend/replication/repl_gram.y
+++ b/src/backend/replication/repl_gram.y
@@ -76,11 +76,12 @@ Node *replication_parse_result;
%token K_EXPORT_SNAPSHOT
%token K_NOEXPORT_SNAPSHOT
%token K_USE_SNAPSHOT
+%token K_UPLOAD_MANIFEST
%type <node> command
%type <node> base_backup start_replication start_logical_replication
create_replication_slot drop_replication_slot identify_system
- read_replication_slot timeline_history show
+ read_replication_slot timeline_history show upload_manifest
%type <list> generic_option_list
%type <defelt> generic_option
%type <uintval> opt_timeline
@@ -114,6 +115,7 @@ command:
| read_replication_slot
| timeline_history
| show
+ | upload_manifest
;
/*
@@ -307,6 +309,15 @@ timeline_history:
}
;
+/* UPLOAD_MANIFEST doesn't currently accept any arguments */
+upload_manifest:
+ K_UPLOAD_MANIFEST
+ {
+ UploadManifestCmd *cmd = makeNode(UploadManifestCmd);
+
+ $$ = (Node *) cmd;
+ }
+
opt_physical:
K_PHYSICAL
| /* EMPTY */
@@ -411,6 +422,7 @@ ident_or_keyword:
| K_EXPORT_SNAPSHOT { $$ = "export_snapshot"; }
| K_NOEXPORT_SNAPSHOT { $$ = "noexport_snapshot"; }
| K_USE_SNAPSHOT { $$ = "use_snapshot"; }
+ | K_UPLOAD_MANIFEST { $$ = "upload_manifest"; }
;
%%
diff --git a/src/backend/replication/repl_scanner.l b/src/backend/replication/repl_scanner.l
index cb467ca46f..fa2bf4ee0a 100644
--- a/src/backend/replication/repl_scanner.l
+++ b/src/backend/replication/repl_scanner.l
@@ -136,6 +136,7 @@ EXPORT_SNAPSHOT { return K_EXPORT_SNAPSHOT; }
NOEXPORT_SNAPSHOT { return K_NOEXPORT_SNAPSHOT; }
USE_SNAPSHOT { return K_USE_SNAPSHOT; }
WAIT { return K_WAIT; }
+UPLOAD_MANIFEST { return K_UPLOAD_MANIFEST; }
{space}+ { /* do nothing */ }
@@ -303,6 +304,7 @@ replication_scanner_is_replication_command(void)
case K_DROP_REPLICATION_SLOT:
case K_READ_REPLICATION_SLOT:
case K_TIMELINE_HISTORY:
+ case K_UPLOAD_MANIFEST:
case K_SHOW:
/* Yes; push back the first token so we can parse later. */
repl_pushed_back_token = first_token;
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index d3a136b6f5..39eb293e5f 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -58,6 +58,7 @@
#include "access/xlogrecovery.h"
#include "access/xlogutils.h"
#include "backup/basebackup.h"
+#include "backup/basebackup_incremental.h"
#include "catalog/pg_authid.h"
#include "catalog/pg_type.h"
#include "commands/dbcommands.h"
@@ -137,6 +138,17 @@ bool wake_wal_senders = false;
*/
static XLogReaderState *xlogreader = NULL;
+/*
+ * If the UPLOAD_MANIFEST command is used to provide a backup manifest in
+ * preparation for an incremental backup, uploaded_manifest will be point
+ * to an object containing information about its contexts, and
+ * uploaded_manifest_mcxt will point to the memory context that contains
+ * that object and all of its subordinate data. Otherwise, both values will
+ * be NULL.
+ */
+static IncrementalBackupInfo *uploaded_manifest = NULL;
+static MemoryContext uploaded_manifest_mcxt = NULL;
+
/*
* These variables keep track of the state of the timeline we're currently
* sending. sendTimeLine identifies the timeline. If sendTimeLineIsHistoric,
@@ -233,6 +245,9 @@ static void XLogSendLogical(void);
static void WalSndDone(WalSndSendDataCallback send_data);
static XLogRecPtr GetStandbyFlushRecPtr(TimeLineID *tli);
static void IdentifySystem(void);
+static void UploadManifest(void);
+static bool HandleUploadManifestPacket(StringInfo buf, off_t *offset,
+ IncrementalBackupInfo *ib);
static void ReadReplicationSlot(ReadReplicationSlotCmd *cmd);
static void CreateReplicationSlot(CreateReplicationSlotCmd *cmd);
static void DropReplicationSlot(DropReplicationSlotCmd *cmd);
@@ -660,6 +675,143 @@ SendTimeLineHistory(TimeLineHistoryCmd *cmd)
pq_endmessage(&buf);
}
+/*
+ * Handle UPLOAD_MANIFEST command.
+ */
+static void
+UploadManifest(void)
+{
+ MemoryContext mcxt;
+ IncrementalBackupInfo *ib;
+ off_t offset = 0;
+ StringInfoData buf;
+
+ /*
+ * parsing the manifest will use the cryptohash stuff, which requires a
+ * resource owner
+ */
+ Assert(CurrentResourceOwner == NULL);
+ CurrentResourceOwner = ResourceOwnerCreate(NULL, "base backup");
+
+ /* Prepare to read manifest data into a temporary context. */
+ mcxt = AllocSetContextCreate(CurrentMemoryContext,
+ "incremental backup information",
+ ALLOCSET_DEFAULT_SIZES);
+ ib = CreateIncrementalBackupInfo(mcxt);
+
+ /* Send a CopyInResponse message */
+ pq_beginmessage(&buf, 'G');
+ pq_sendbyte(&buf, 0);
+ pq_sendint16(&buf, 0);
+ pq_endmessage_reuse(&buf);
+ pq_flush();
+
+ /* Recieve packets from client until done. */
+ while (HandleUploadManifestPacket(&buf, &offset, ib))
+ ;
+
+ /* Finish up manifest processing. */
+ FinalizeIncrementalManifest(ib);
+
+ /*
+ * Discard any old manifest information and arrange to preserve the new
+ * information we just got.
+ *
+ * We assume that MemoryContextDelete and MemoryContextSetParent won't
+ * fail, and thus we shouldn't end up bailing out of here in such a way as
+ * to leave dangling pointrs.
+ */
+ if (uploaded_manifest_mcxt != NULL)
+ MemoryContextDelete(uploaded_manifest_mcxt);
+ MemoryContextSetParent(mcxt, CacheMemoryContext);
+ uploaded_manifest = ib;
+ uploaded_manifest_mcxt = mcxt;
+
+ /* clean up the resource owner we created */
+ WalSndResourceCleanup(true);
+}
+
+/*
+ * Process one packet received during the handling of an UPLOAD_MANIFEST
+ * operation.
+ *
+ * 'buf' is scratch space. This function expects it to be initialized, doesn't
+ * care what the current contents are, and may override them with completely
+ * new contents.
+ *
+ * The return value is true if the caller should continue processing
+ * additional packets and false if the UPLOAD_MANIFEST operation is complete.
+ */
+static bool
+HandleUploadManifestPacket(StringInfo buf, off_t *offset,
+ IncrementalBackupInfo *ib)
+{
+ int mtype;
+ int maxmsglen;
+
+ HOLD_CANCEL_INTERRUPTS();
+
+ pq_startmsgread();
+ mtype = pq_getbyte();
+ if (mtype == EOF)
+ ereport(ERROR,
+ (errcode(ERRCODE_CONNECTION_FAILURE),
+ errmsg("unexpected EOF on client connection with an open transaction")));
+
+ switch (mtype)
+ {
+ case 'd': /* CopyData */
+ maxmsglen = PQ_LARGE_MESSAGE_LIMIT;
+ break;
+ case 'c': /* CopyDone */
+ case 'f': /* CopyFail */
+ case 'H': /* Flush */
+ case 'S': /* Sync */
+ maxmsglen = PQ_SMALL_MESSAGE_LIMIT;
+ break;
+ default:
+ ereport(ERROR,
+ (errcode(ERRCODE_PROTOCOL_VIOLATION),
+ errmsg("unexpected message type 0x%02X during COPY from stdin",
+ mtype)));
+ maxmsglen = 0; /* keep compiler quiet */
+ break;
+ }
+
+ /* Now collect the message body */
+ if (pq_getmessage(buf, maxmsglen))
+ ereport(ERROR,
+ (errcode(ERRCODE_CONNECTION_FAILURE),
+ errmsg("unexpected EOF on client connection with an open transaction")));
+ RESUME_CANCEL_INTERRUPTS();
+
+ /* Process the message */
+ switch (mtype)
+ {
+ case 'd': /* CopyData */
+ AppendIncrementalManifestData(ib, buf->data, buf->len);
+ return true;
+
+ case 'c': /* CopyDone */
+ return false;
+
+ case 'H': /* Sync */
+ case 'S': /* Flush */
+ /* Ignore these while in CopyOut mode as we do elsewhere. */
+ return true;
+
+ case 'f':
+ ereport(ERROR,
+ (errcode(ERRCODE_QUERY_CANCELED),
+ errmsg("COPY from stdin failed: %s",
+ pq_getmsgstring(buf))));
+ }
+
+ /* Not reached. */
+ Assert(false);
+ return false;
+}
+
/*
* Handle START_REPLICATION command.
*
@@ -1802,7 +1954,7 @@ exec_replication_command(const char *cmd_string)
cmdtag = "BASE_BACKUP";
set_ps_display(cmdtag);
PreventInTransactionBlock(true, cmdtag);
- SendBaseBackup((BaseBackupCmd *) cmd_node);
+ SendBaseBackup((BaseBackupCmd *) cmd_node, uploaded_manifest);
EndReplicationCommand(cmdtag);
break;
@@ -1864,6 +2016,14 @@ exec_replication_command(const char *cmd_string)
}
break;
+ case T_UploadManifestCmd:
+ cmdtag = "UPLOAD_MANIFEST";
+ set_ps_display(cmdtag);
+ PreventInTransactionBlock(true, cmdtag);
+ UploadManifest();
+ EndReplicationCommand(cmdtag);
+ break;
+
default:
elog(ERROR, "unrecognized replication command node tag: %u",
cmd_node->type);
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 8f1ded7338..17608b3b8e 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -31,6 +31,7 @@
#include "postmaster/bgworker_internals.h"
#include "postmaster/bgwriter.h"
#include "postmaster/postmaster.h"
+#include "postmaster/walsummarizer.h"
#include "replication/logicallauncher.h"
#include "replication/origin.h"
#include "replication/slot.h"
@@ -135,6 +136,7 @@ CalculateShmemSize(int *num_semaphores)
size = add_size(size, ReplicationOriginShmemSize());
size = add_size(size, WalSndShmemSize());
size = add_size(size, WalRcvShmemSize());
+ size = add_size(size, WalSummarizerShmemSize());
size = add_size(size, PgArchShmemSize());
size = add_size(size, ApplyLauncherShmemSize());
size = add_size(size, SnapMgrShmemSize());
@@ -283,6 +285,7 @@ CreateSharedMemoryAndSemaphores(void)
ReplicationOriginShmemInit();
WalSndShmemInit();
WalRcvShmemInit();
+ WalSummarizerShmemInit();
PgArchShmemInit();
ApplyLauncherShmemInit();
diff --git a/src/backend/storage/lmgr/lwlocknames.txt b/src/backend/storage/lmgr/lwlocknames.txt
index 6c7cf6c295..49f76e82fb 100644
--- a/src/backend/storage/lmgr/lwlocknames.txt
+++ b/src/backend/storage/lmgr/lwlocknames.txt
@@ -53,3 +53,4 @@ XactTruncationLock 44
# 45 was XactTruncationLock until removal of BackendRandomLock
WrapLimitsVacuumLock 46
NotifyQueueTailLock 47
+WALSummarizerLock 48
diff --git a/src/backend/utils/activity/pgstat_io.c b/src/backend/utils/activity/pgstat_io.c
index eb7d35d422..bd0a921a3e 100644
--- a/src/backend/utils/activity/pgstat_io.c
+++ b/src/backend/utils/activity/pgstat_io.c
@@ -292,7 +292,8 @@ pgstat_io_snapshot_cb(void)
* - Syslogger because it is not connected to shared memory
* - Archiver because most relevant archiving IO is delegated to a
* specialized command or module
-* - WAL Receiver and WAL Writer IO is not tracked in pg_stat_io for now
+* - WAL Receiver, WAL Writer, and WAL Summarizer IO are not tracked in
+* pg_stat_io for now
*
* Function returns true if BackendType participates in the cumulative stats
* subsystem for IO and false if it does not.
@@ -314,6 +315,7 @@ pgstat_tracks_io_bktype(BackendType bktype)
case B_LOGGER:
case B_WAL_RECEIVER:
case B_WAL_WRITER:
+ case B_WAL_SUMMARIZER:
return false;
case B_AUTOVAC_LAUNCHER:
diff --git a/src/backend/utils/activity/wait_event.c b/src/backend/utils/activity/wait_event.c
index 7940d64639..36b88f55b1 100644
--- a/src/backend/utils/activity/wait_event.c
+++ b/src/backend/utils/activity/wait_event.c
@@ -245,6 +245,9 @@ pgstat_get_wait_activity(WaitEventActivity w)
case WAIT_EVENT_WAL_SENDER_MAIN:
event_name = "WalSenderMain";
break;
+ case WAIT_EVENT_WAL_SUMMARIZER_WAL:
+ event_name = "WalSummarizerWal";
+ break;
case WAIT_EVENT_WAL_WRITER_MAIN:
event_name = "WalWriterMain";
break;
@@ -466,6 +469,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
case WAIT_EVENT_WAL_RECEIVER_WAIT_START:
event_name = "WalReceiverWaitStart";
break;
+ case WAIT_EVENT_WAL_SUMMARY_READY:
+ event_name = "WalSummaryReady";
+ break;
case WAIT_EVENT_XACT_GROUP_UPDATE:
event_name = "XactGroupUpdate";
break;
@@ -515,6 +521,9 @@ pgstat_get_wait_timeout(WaitEventTimeout w)
case WAIT_EVENT_VACUUM_TRUNCATE:
event_name = "VacuumTruncate";
break;
+ case WAIT_EVENT_WAL_SUMMARIZER_ERROR:
+ event_name = "WalSummarizerError";
+ break;
/* no default case, so that compiler will warn */
}
@@ -747,6 +756,12 @@ pgstat_get_wait_io(WaitEventIO w)
case WAIT_EVENT_WAL_READ:
event_name = "WALRead";
break;
+ case WAIT_EVENT_WAL_SUMMARY_READ:
+ event_name = "WalSummaryRead";
+ break;
+ case WAIT_EVENT_WAL_SUMMARY_WRITE:
+ event_name = "WalSummaryWrite";
+ break;
case WAIT_EVENT_WAL_SYNC:
event_name = "WALSync";
break;
diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index a604432126..eb5736ad85 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -306,6 +306,9 @@ GetBackendTypeDesc(BackendType backendType)
case B_WAL_SENDER:
backendDesc = "walsender";
break;
+ case B_WAL_SUMMARIZER:
+ backendDesc = "walsummarizer";
+ break;
case B_WAL_WRITER:
backendDesc = "walwriter";
break;
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 71e27f8eb0..c4918db4f9 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -61,6 +61,7 @@
#include "postmaster/postmaster.h"
#include "postmaster/startup.h"
#include "postmaster/syslogger.h"
+#include "postmaster/walsummarizer.h"
#include "postmaster/walwriter.h"
#include "replication/logicallauncher.h"
#include "replication/slot.h"
@@ -694,6 +695,8 @@ const char *const config_group_names[] =
gettext_noop("Write-Ahead Log / Archive Recovery"),
/* WAL_RECOVERY_TARGET */
gettext_noop("Write-Ahead Log / Recovery Target"),
+ /* WAL_SUMMARIZATION */
+ gettext_noop("Write-Ahead Log / Summarization"),
/* REPLICATION_SENDING */
gettext_noop("Replication / Sending Servers"),
/* REPLICATION_PRIMARY */
@@ -3167,6 +3170,32 @@ struct config_int ConfigureNamesInt[] =
NULL, NULL, NULL
},
+ {
+ {"wal_summarize_mb", PGC_SIGHUP, WAL_SUMMARIZATION,
+ gettext_noop("Number of bytes of WAL per summary file."),
+ gettext_noop("Smaller values minimize extra work performed by incremental backup, but increase the number of files on disk."),
+ GUC_UNIT_MB,
+ },
+ &wal_summarize_mb,
+ 256,
+ 0,
+ INT_MAX,
+ NULL, NULL, NULL
+ },
+
+ {
+ {"wal_summarize_keep_time", PGC_SIGHUP, WAL_SUMMARIZATION,
+ gettext_noop("Time for which WAL summary files should be kept."),
+ NULL,
+ GUC_UNIT_MIN,
+ },
+ &wal_summarize_keep_time,
+ 7 * 24 * 60, /* 1 week */
+ 0,
+ INT_MAX,
+ NULL, NULL, NULL
+ },
+
{
{"autovacuum_naptime", PGC_SIGHUP, AUTOVACUUM,
gettext_noop("Time to sleep between autovacuum runs."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index e4c0269fa3..d028d02861 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -302,6 +302,11 @@
#recovery_target_action = 'pause' # 'pause', 'promote', 'shutdown'
# (change requires restart)
+# - WAL Summarization -
+
+#wal_summarize_mb = 256 # MB of WAL per summary file, 0 disables
+#wal_summarize_keep_time = '7d' # when to remove old summary files, 0 = never
+
#------------------------------------------------------------------------------
# REPLICATION
diff --git a/src/bin/Makefile b/src/bin/Makefile
index 373077bf52..aa2210925e 100644
--- a/src/bin/Makefile
+++ b/src/bin/Makefile
@@ -19,6 +19,7 @@ SUBDIRS = \
pg_archivecleanup \
pg_basebackup \
pg_checksums \
+ pg_combinebackup \
pg_config \
pg_controldata \
pg_ctl \
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 09a5c98cc0..220f51a32d 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -230,6 +230,7 @@ static char *extra_options = "";
static const char *const subdirs[] = {
"global",
"pg_wal/archive_status",
+ "pg_wal/summaries",
"pg_commit_ts",
"pg_dynshmem",
"pg_notify",
diff --git a/src/bin/meson.build b/src/bin/meson.build
index 67cb50630c..4cb6fd59bb 100644
--- a/src/bin/meson.build
+++ b/src/bin/meson.build
@@ -5,6 +5,7 @@ subdir('pg_amcheck')
subdir('pg_archivecleanup')
subdir('pg_basebackup')
subdir('pg_checksums')
+subdir('pg_combinebackup')
subdir('pg_config')
subdir('pg_controldata')
subdir('pg_ctl')
diff --git a/src/bin/pg_basebackup/bbstreamer_file.c b/src/bin/pg_basebackup/bbstreamer_file.c
index 45f32974ff..6b78ee283d 100644
--- a/src/bin/pg_basebackup/bbstreamer_file.c
+++ b/src/bin/pg_basebackup/bbstreamer_file.c
@@ -296,6 +296,7 @@ should_allow_existing_directory(const char *pathname)
if (strcmp(filename, "pg_wal") == 0 ||
strcmp(filename, "pg_xlog") == 0 ||
strcmp(filename, "archive_status") == 0 ||
+ strcmp(filename, "summaries") == 0 ||
strcmp(filename, "pg_tblspc") == 0)
return true;
diff --git a/src/bin/pg_basebackup/pg_basebackup.c b/src/bin/pg_basebackup/pg_basebackup.c
index 1dc8efe0cb..3ffe15ac74 100644
--- a/src/bin/pg_basebackup/pg_basebackup.c
+++ b/src/bin/pg_basebackup/pg_basebackup.c
@@ -101,6 +101,11 @@ typedef void (*WriteDataCallback) (size_t nbytes, char *buf,
*/
#define MINIMUM_VERSION_FOR_TERMINATED_TARFILE 150000
+/*
+ * pg_wal/summaries exists beginning with v16.
+ */
+#define MINIMUM_VERSION_FOR_WAL_SUMMARIES 160000
+
/*
* Different ways to include WAL
*/
@@ -216,7 +221,8 @@ static void ReceiveBackupManifestInMemoryChunk(size_t r, char *copybuf,
void *callback_data);
static void BaseBackup(char *compression_algorithm, char *compression_detail,
CompressionLocation compressloc,
- pg_compress_specification *client_compress);
+ pg_compress_specification *client_compress,
+ char *incremental_manifest);
static bool reached_end_position(XLogRecPtr segendpos, uint32 timeline,
bool segment_finished);
@@ -684,6 +690,23 @@ StartLogStreamer(char *startpos, uint32 timeline, char *sysidentifier,
if (pg_mkdir_p(statusdir, pg_dir_create_mode) != 0 && errno != EEXIST)
pg_fatal("could not create directory \"%s\": %m", statusdir);
+
+ /*
+ * For newer server versions, likewise create pg_wal/summaries
+ */
+ if (PQserverVersion(conn) < MINIMUM_VERSION_FOR_WAL_SUMMARIES)
+ {
+ char summarydir[MAXPGPATH];
+
+ snprintf(summarydir, sizeof(summarydir), "%s/%s/summaries",
+ basedir,
+ PQserverVersion(conn) < MINIMUM_VERSION_FOR_PG_WAL ?
+ "pg_xlog" : "pg_wal");
+
+ if (pg_mkdir_p(statusdir, pg_dir_create_mode) != 0 &&
+ errno != EEXIST)
+ pg_fatal("could not create directory \"%s\": %m", summarydir);
+ }
}
/*
@@ -1724,7 +1747,9 @@ ReceiveBackupManifestInMemoryChunk(size_t r, char *copybuf,
static void
BaseBackup(char *compression_algorithm, char *compression_detail,
- CompressionLocation compressloc, pg_compress_specification *client_compress)
+ CompressionLocation compressloc,
+ pg_compress_specification *client_compress,
+ char *incremental_manifest)
{
PGresult *res;
char *sysidentifier;
@@ -1790,7 +1815,74 @@ BaseBackup(char *compression_algorithm, char *compression_detail,
exit(1);
/*
- * Start the actual backup
+ * If the user wants an incremental backup, we must upload the manifest
+ * for the previous backup upon which it is to be based.
+ */
+ if (incremental_manifest != NULL)
+ {
+ int fd;
+ char mbuf[65536];
+ int nbytes;
+
+ /* XXX add a server version check here */
+
+ /* Open the file. */
+ fd = open(incremental_manifest, O_RDONLY | PG_BINARY, 0);
+ if (fd < 0)
+ pg_fatal("could not open file \"%s\": %m", incremental_manifest);
+
+ /* Tell the server what we want to do. */
+ if (PQsendQuery(conn, "UPLOAD_MANIFEST") == 0)
+ pg_fatal("could not send replication command \"%s\": %s",
+ "UPLOAD_MANIFEST", PQerrorMessage(conn));
+ res = PQgetResult(conn);
+ if (PQresultStatus(res) != PGRES_COPY_IN)
+ {
+ if (PQresultStatus(res) == PGRES_FATAL_ERROR)
+ pg_fatal("could not upload manifest: %s",
+ PQerrorMessage(conn));
+ else
+ pg_fatal("could not upload manifest: unexpected status %s",
+ PQresStatus(PQresultStatus(res)));
+ }
+
+ /* Loop, reading from the file and sending the data to the server. */
+ while ((nbytes = read(fd, mbuf, sizeof mbuf)) > 0)
+ {
+ if (PQputCopyData(conn, mbuf, nbytes) < 0)
+ pg_fatal("could not send COPY data: %s",
+ PQerrorMessage(conn));
+ }
+
+ /* Bail out if we exited the loop due to an error. */
+ if (nbytes < 0)
+ pg_fatal("could not read file \"%s\": %m", incremental_manifest);
+
+ /* End the COPY operation. */
+ if (PQputCopyEnd(conn, NULL) < 0)
+ pg_fatal("could not send end-of-COPY: %s",
+ PQerrorMessage(conn));
+
+ /* See whether the server is happy with what we sent. */
+ res = PQgetResult(conn);
+ if (PQresultStatus(res) == PGRES_FATAL_ERROR)
+ pg_fatal("could not upload manifest: %s",
+ PQerrorMessage(conn));
+ else if (PQresultStatus(res) != PGRES_COMMAND_OK)
+ pg_fatal("could not upload manifest: unexpected status %s",
+ PQresStatus(PQresultStatus(res)));
+
+ /* Consume ReadyForQuery message from server. */
+ res = PQgetResult(conn);
+ if (res != NULL)
+ pg_fatal("unexpected extra result while sending manifest");
+
+ /* Add INCREMENTAL option to BASE_BACKUP command. */
+ AppendPlainCommandOption(&buf, use_new_option_syntax, "INCREMENTAL");
+ }
+
+ /*
+ * Continue building up the options list for the BASE_BACKUP command.
*/
AppendStringCommandOption(&buf, use_new_option_syntax, "LABEL", label);
if (estimatesize)
@@ -1897,6 +1989,7 @@ BaseBackup(char *compression_algorithm, char *compression_detail,
else
basebkp = psprintf("BASE_BACKUP %s", buf.data);
+ /* OK, try to start the backup. */
if (PQsendQuery(conn, basebkp) == 0)
pg_fatal("could not send replication command \"%s\": %s",
"BASE_BACKUP", PQerrorMessage(conn));
@@ -2252,6 +2345,7 @@ main(int argc, char **argv)
{"version", no_argument, NULL, 'V'},
{"pgdata", required_argument, NULL, 'D'},
{"format", required_argument, NULL, 'F'},
+ {"incremental", required_argument, NULL, 'i'},
{"checkpoint", required_argument, NULL, 'c'},
{"create-slot", no_argument, NULL, 'C'},
{"max-rate", required_argument, NULL, 'r'},
@@ -2288,6 +2382,7 @@ main(int argc, char **argv)
int option_index;
char *compression_algorithm = "none";
char *compression_detail = NULL;
+ char *incremental_manifest = NULL;
CompressionLocation compressloc = COMPRESS_LOCATION_UNSPECIFIED;
pg_compress_specification client_compress;
@@ -2312,7 +2407,7 @@ main(int argc, char **argv)
atexit(cleanup_directories_atexit);
- while ((c = getopt_long(argc, argv, "c:Cd:D:F:h:l:nNp:Pr:Rs:S:t:T:U:vwWX:zZ:",
+ while ((c = getopt_long(argc, argv, "c:Cd:D:F:h:i:l:nNp:Pr:Rs:S:t:T:U:vwWX:zZ:",
long_options, &option_index)) != -1)
{
switch (c)
@@ -2347,6 +2442,9 @@ main(int argc, char **argv)
case 'h':
dbhost = pg_strdup(optarg);
break;
+ case 'i':
+ incremental_manifest = pg_strdup(optarg);
+ break;
case 'l':
label = pg_strdup(optarg);
break;
@@ -2756,7 +2854,7 @@ main(int argc, char **argv)
}
BaseBackup(compression_algorithm, compression_detail, compressloc,
- &client_compress);
+ &client_compress, incremental_manifest);
success = true;
return 0;
diff --git a/src/bin/pg_basebackup/t/010_pg_basebackup.pl b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
index 793d64863c..22a10477ec 100644
--- a/src/bin/pg_basebackup/t/010_pg_basebackup.pl
+++ b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
@@ -223,10 +223,10 @@ SKIP:
"check backup dir permissions");
}
-# Only archive_status directory should be copied in pg_wal/.
+# Only archive_status and summaries directories should be copied in pg_wal/.
is_deeply(
[ sort(slurp_dir("$tempdir/backup/pg_wal/")) ],
- [ sort qw(. .. archive_status) ],
+ [ sort qw(. .. archive_status summaries) ],
'no WAL files copied');
# Contents of these directories should not be copied.
diff --git a/src/bin/pg_combinebackup/.gitignore b/src/bin/pg_combinebackup/.gitignore
new file mode 100644
index 0000000000..d7e617438c
--- /dev/null
+++ b/src/bin/pg_combinebackup/.gitignore
@@ -0,0 +1 @@
+pg_combinebackup
diff --git a/src/bin/pg_combinebackup/Makefile b/src/bin/pg_combinebackup/Makefile
new file mode 100644
index 0000000000..cb20480aae
--- /dev/null
+++ b/src/bin/pg_combinebackup/Makefile
@@ -0,0 +1,46 @@
+#-------------------------------------------------------------------------
+#
+# Makefile for src/bin/pg_combinebackup
+#
+# Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+# Portions Copyright (c) 1994, Regents of the University of California
+#
+# src/bin/pg_combinebackup/Makefile
+#
+#-------------------------------------------------------------------------
+
+PGFILEDESC = "pg_combinebackup - combine incremental backups"
+PGAPPICON=win32
+
+subdir = src/bin/pg_combinebackup
+top_builddir = ../../..
+include $(top_builddir)/src/Makefile.global
+
+override CPPFLAGS := -I$(libpq_srcdir) $(CPPFLAGS)
+LDFLAGS_INTERNAL += -L$(top_builddir)/src/fe_utils -lpgfeutils
+
+OBJS = \
+ $(WIN32RES) \
+ pg_combinebackup.o \
+ backup_label.o \
+ copy_file.o \
+ load_manifest.o \
+ reconstruct.o \
+ write_manifest.o
+
+all: pg_combinebackup
+
+pg_combinebackup: $(OBJS) | submake-libpgport submake-libpgfeutils
+ $(CC) $(CFLAGS) $^ $(LDFLAGS) $(LDFLAGS_EX) $(LIBS) -o $@$(X)
+
+install: all installdirs
+ $(INSTALL_PROGRAM) pg_combinebackup$(X) '$(DESTDIR)$(bindir)/pg_combinebackup$(X)'
+
+installdirs:
+ $(MKDIR_P) '$(DESTDIR)$(bindir)'
+
+uninstall:
+ rm -f '$(DESTDIR)$(bindir)/pg_combinebackup$(X)'
+
+clean distclean maintainer-clean:
+ rm -f pg_combinebackup$(X) $(OBJS)
diff --git a/src/bin/pg_combinebackup/backup_label.c b/src/bin/pg_combinebackup/backup_label.c
new file mode 100644
index 0000000000..2a62aa6fad
--- /dev/null
+++ b/src/bin/pg_combinebackup/backup_label.c
@@ -0,0 +1,281 @@
+/*-------------------------------------------------------------------------
+ *
+ * Read and manipulate backup label files
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/bin/pg_combinebackup/backup_label.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres_fe.h"
+
+#include <unistd.h>
+
+#include "access/xlogdefs.h"
+#include "backup_label.h"
+#include "common/logging.h"
+#include "common/file_perm.h"
+#include "write_manifest.h"
+
+static int get_eol_offset(StringInfo buf);
+static bool line_starts_with(char *s, char *e, char *match, char **sout);
+static bool parse_lsn(char *s, char *e, XLogRecPtr *lsn, char **c);
+static bool parse_tli(char *s, char *e, TimeLineID *tli);
+
+/*
+ * Parse a backup label file, starting at buf->cursor.
+ *
+ * We expect to find a START WAL LOCATION line, followed by a LSN, followed
+ * by a space; the resulting LSN is stored into *start_lsn.
+ *
+ * We expect to find a START TIMELINE line, followed by a TLI, followed by
+ * a newline; the resulting TLI is stored into *start_tli.
+ *
+ * We expect to find either both INCREMENTAL FROM LSN and INCREMENTAL FROM TLI
+ * or neither. If these are found, they should be followed by an LSN or TLI
+ * respectively and then by a newline, and the values will be stored into
+ * *previous_lsn and *previous_tli, respectively.
+ *
+ * Other lines in the provided backup_label data are ignored. filename is used
+ * for error reporting; errors are fatal.
+ */
+void
+parse_backup_label(char *filename, StringInfo buf,
+ TimeLineID *start_tli, XLogRecPtr *start_lsn,
+ TimeLineID *previous_tli, XLogRecPtr *previous_lsn)
+{
+ int found = 0;
+
+ *start_tli = 0;
+ *start_lsn = InvalidXLogRecPtr;
+ *previous_tli = 0;
+ *previous_lsn = InvalidXLogRecPtr;
+
+ while (buf->cursor < buf->len)
+ {
+ char *s = &buf->data[buf->cursor];
+ int eo = get_eol_offset(buf);
+ char *e = &buf->data[eo];
+ char *c;
+
+ if (line_starts_with(s, e, "START WAL LOCATION: ", &s))
+ {
+ if (!parse_lsn(s, e, start_lsn, &c))
+ pg_fatal("%s: could not parse START WAL LOCATION",
+ filename);
+ if (c >= e || *c != ' ')
+ pg_fatal("%s: improper terminator for START WAL LOCATION",
+ filename);
+ found |= 1;
+ }
+ else if (line_starts_with(s, e, "START TIMELINE: ", &s))
+ {
+ if (!parse_tli(s, e, start_tli))
+ pg_fatal("%s: could not parse TLI for START TIMELINE",
+ filename);
+ if (*start_tli == 0)
+ pg_fatal("%s: invalid TLI", filename);
+ found |= 2;
+ }
+ else if (line_starts_with(s, e, "INCREMENTAL FROM LSN: ", &s))
+ {
+ if (!parse_lsn(s, e, previous_lsn, &c))
+ pg_fatal("%s: could not parse INCREMENTAL FROM LSN",
+ filename);
+ if (c >= e || *c != '\n')
+ pg_fatal("%s: improper terminator for INCREMENTAL FROM LSN",
+ filename);
+ found |= 4;
+ }
+ else if (line_starts_with(s, e, "INCREMENTAL FROM TLI: ", &s))
+ {
+ if (!parse_tli(s, e, previous_tli))
+ pg_fatal("%s: could not parse INCREMENTAL FROM TLI",
+ filename);
+ if (*previous_tli == 0)
+ pg_fatal("%s: invalid TLI", filename);
+ found |= 8;
+ }
+
+ buf->cursor = eo;
+ }
+
+ if ((found & 1) == 0)
+ pg_fatal("%s: could not find START WAL LOCATION", filename);
+ if ((found & 2) == 0)
+ pg_fatal("%s: could not find START TIMELINE", filename);
+ if ((found & 4) != 0 && (found & 8) == 0)
+ pg_fatal("%s: INCREMENTAL FROM LSN requires INCREMENTAL FROM TLI", filename);
+ if ((found & 8) != 0 && (found & 4) == 0)
+ pg_fatal("%s: INCREMENTAL FROM TLI requires INCREMENTAL FROM LSN", filename);
+}
+
+/*
+ * Write a backup label file to the output directory.
+ *
+ * This will be identical to the provided backup_label file, except that the
+ * INCREMENTAL FROM LSN and INCREMENTAL FROM TLI lines will be omitted.
+ *
+ * The new file will be checksummed using the specified algorithm. If
+ * mwriter != NULL, it will be added to the manifest.
+ */
+void
+write_backup_label(char *output_directory, StringInfo buf,
+ pg_checksum_type checksum_type, manifest_writer *mwriter)
+{
+ char output_filename[MAXPGPATH];
+ int output_fd;
+ pg_checksum_context checksum_ctx;
+ uint8 checksum_payload[PG_CHECKSUM_MAX_LENGTH];
+ int checksum_length;
+
+ pg_checksum_init(&checksum_ctx, checksum_type);
+
+ snprintf(output_filename, MAXPGPATH, "%s/backup_label", output_directory);
+
+ if ((output_fd = open(output_filename,
+ O_WRONLY | O_CREAT | O_EXCL | PG_BINARY,
+ pg_file_create_mode)) < 0)
+ pg_fatal("could not open file \"%s\": %m", output_filename);
+
+ while (buf->cursor < buf->len)
+ {
+ char *s = &buf->data[buf->cursor];
+ int eo = get_eol_offset(buf);
+ char *e = &buf->data[eo];
+
+ if (!line_starts_with(s, e, "INCREMENTAL FROM LSN: ", NULL) &&
+ !line_starts_with(s, e, "INCREMENTAL FROM TLI: ", NULL))
+ {
+ ssize_t wb;
+
+ wb = write(output_fd, s, e - s);
+ if (wb != e - s)
+ {
+ if (wb < 0)
+ pg_fatal("could not write file \"%s\": %m", output_filename);
+ else
+ pg_fatal("could not write file \"%s\": wrote only %d of %d bytes",
+ output_filename, (int) wb, (int) (e - s));
+ }
+ if (pg_checksum_update(&checksum_ctx, (uint8 *) s, e - s) < 0)
+ pg_fatal("could not update checksum of file \"%s\"",
+ output_filename);
+ }
+
+ buf->cursor = eo;
+ }
+
+ if (close(output_fd) != 0)
+ pg_fatal("could not close \"%s\": %m", output_filename);
+
+ checksum_length = pg_checksum_final(&checksum_ctx, checksum_payload);
+
+ if (mwriter != NULL)
+ {
+ struct stat sb;
+
+ /*
+ * We could track the length ourselves, but must stat() to get the
+ * mtime.
+ */
+ if (stat(output_filename, &sb) < 0)
+ pg_fatal("could not stat file \"%s\": %m", output_filename);
+ add_file_to_manifest(mwriter, "backup_label", sb.st_size,
+ sb.st_mtime, checksum_type,
+ checksum_length, checksum_payload);
+ }
+}
+
+/*
+ * Return the offset at which the next line in the buffer starts, or there
+ * is none, the offset at which the buffer ends.
+ *
+ * The search begins at buf->cursor.
+ */
+static int
+get_eol_offset(StringInfo buf)
+{
+ int eo = buf->cursor;
+
+ while (eo < buf->len)
+ {
+ if (buf->data[eo] == '\n')
+ return eo + 1;
+ ++eo;
+ }
+
+ return eo;
+}
+
+/*
+ * Test whether the line that runs from s to e (inclusive of *s, but not
+ * inclusive of *e) starts with the match string provided, and return true
+ * or false according to whether or not this is the case.
+ *
+ * If the function returns true and if *sout != NULL, stores a pointer to the
+ * byte following the match into *sout.
+ */
+static bool
+line_starts_with(char *s, char *e, char *match, char **sout)
+{
+ while (s < e && *match != '\0' && *s == *match)
+ ++s, ++match;
+
+ if (*match == '\0' && sout != NULL)
+ *sout = s;
+
+ return (*match == '\0');
+}
+
+/*
+ * Parse an LSN starting at s and not stopping at or before e. The return value
+ * is true on success and otherwise false. On success, stores the result into
+ * *lsn and sets *c to the first character that is not part of the LSN.
+ */
+static bool
+parse_lsn(char *s, char *e, XLogRecPtr *lsn, char **c)
+{
+ char save = *e;
+ int nchars;
+ bool success;
+ unsigned hi;
+ unsigned lo;
+
+ *e = '\0';
+ success = (sscanf(s, "%X/%X%n", &hi, &lo, &nchars) == 2);
+ *e = save;
+
+ if (success)
+ {
+ *lsn = ((XLogRecPtr) hi) << 32 | (XLogRecPtr) lo;
+ *c = s + nchars;
+ }
+
+ return success;
+}
+
+/*
+ * Parse a TLI starting at s and stopping at or before e. The return value is
+ * true on success and otherwise false. On success, stores the result into
+ * *tli. If the first character that is not part of the TLI is anything other
+ * than a newline, that is deemed a failure.
+ */
+static bool
+parse_tli(char *s, char *e, TimeLineID *tli)
+{
+ char save = *e;
+ int nchars;
+ bool success;
+
+ *e = '\0';
+ success = (sscanf(s, "%u%n", tli, &nchars) == 1);
+ *e = save;
+
+ if (success && s[nchars] != '\n')
+ success = false;
+
+ return success;
+}
diff --git a/src/bin/pg_combinebackup/backup_label.h b/src/bin/pg_combinebackup/backup_label.h
new file mode 100644
index 0000000000..08d6ed67a9
--- /dev/null
+++ b/src/bin/pg_combinebackup/backup_label.h
@@ -0,0 +1,29 @@
+/*-------------------------------------------------------------------------
+ *
+ * Read and manipulate backup label files
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/bin/pg_combinebackup/backup_label.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef BACKUP_LABEL_H
+#define BACKUP_LABEL_H
+
+#include "common/checksum_helper.h"
+#include "lib/stringinfo.h"
+
+struct manifest_writer;
+
+extern void parse_backup_label(char *filename, StringInfo buf,
+ TimeLineID *start_tli,
+ XLogRecPtr *start_lsn,
+ TimeLineID *previous_tli,
+ XLogRecPtr *previous_lsn);
+extern void write_backup_label(char *output_directory, StringInfo buf,
+ pg_checksum_type checksum_type,
+ struct manifest_writer *mwriter);
+
+#endif /* BACKUP_LABEL_H */
diff --git a/src/bin/pg_combinebackup/copy_file.c b/src/bin/pg_combinebackup/copy_file.c
new file mode 100644
index 0000000000..8ba6cc09e4
--- /dev/null
+++ b/src/bin/pg_combinebackup/copy_file.c
@@ -0,0 +1,169 @@
+/*
+ * Copy entire files.
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/bin/pg_combinebackup/copy_file.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres_fe.h"
+
+#ifdef HAVE_COPYFILE_H
+#include <copyfile.h>
+#endif
+#include <fcntl.h>
+#include <sys/stat.h>
+#include <unistd.h>
+
+#include "common/file_perm.h"
+#include "common/logging.h"
+#include "copy_file.h"
+
+static void copy_file_blocks(const char *src, const char *dst,
+ pg_checksum_context *checksum_ctx);
+
+#ifdef WIN32
+static void copy_file_copyfile(const char *src, const char *dst);
+#endif
+
+/*
+ * Copy a regular file, optionally computing a checksum, and emitting
+ * appropriate debug messages. But if we're in dry-run mode, then just emit
+ * the messages and don't copy anything.
+ */
+void
+copy_file(const char *src, const char *dst,
+ pg_checksum_context *checksum_ctx, bool dry_run)
+{
+ /*
+ * In dry-run mode, we don't actually copy anything, nor do we read any
+ * data from the source file, but we do verify that we can open it.
+ */
+ if (dry_run)
+ {
+ int fd;
+
+ if ((fd = open(src, O_RDONLY | PG_BINARY)) < 0)
+ pg_fatal("could not open \"%s\": %m", src);
+ if (close(fd) < 0)
+ pg_fatal("could not close \"%s\": %m", src);
+ }
+
+ /*
+ * If we don't need to compute a checksum, then we can use any special
+ * operating system primitives that we know about to copy the file; this
+ * may be quicker than a naive block copy.
+ */
+ if (checksum_ctx->type != CHECKSUM_TYPE_NONE)
+ {
+ char *strategy_name = NULL;
+ void (*strategy_implementation) (const char *, const char *) = NULL;
+
+#ifdef WIN32
+ strategy_name = "CopyFile";
+ strategy_implementation = copy_file_copyfile;
+#endif
+
+ if (strategy_name != NULL)
+ {
+ if (dry_run)
+ pg_log_debug("would copy \"%s\" to \"%s\" using strategy %s",
+ src, dst, strategy_name);
+ else
+ {
+ pg_log_debug("copying \"%s\" to \"%s\" using strategy %s",
+ src, dst, strategy_name);
+ (*strategy_implementation) (src, dst);
+ }
+ return;
+ }
+ }
+
+ /*
+ * Fall back to the simple approach of reading and writing all the blocks,
+ * feeding them into the checksum context as we go.
+ */
+ if (dry_run)
+ {
+ if (checksum_ctx->type == CHECKSUM_TYPE_NONE)
+ pg_log_debug("would copy \"%s\" to \"%s\"",
+ src, dst);
+ else
+ pg_log_debug("would copy \"%s\" to \"%s\" and checksum with %s",
+ src, dst, pg_checksum_type_name(checksum_ctx->type));
+ }
+ else
+ {
+ if (checksum_ctx->type == CHECKSUM_TYPE_NONE)
+ pg_log_debug("copying \"%s\" to \"%s\"",
+ src, dst);
+ else
+ pg_log_debug("copying \"%s\" to \"%s\" and checksumming with %s",
+ src, dst, pg_checksum_type_name(checksum_ctx->type));
+ copy_file_blocks(src, dst, checksum_ctx);
+ }
+}
+
+/*
+ * Copy a file block by block, and optionally compute a checksum as we go.
+ */
+static void
+copy_file_blocks(const char *src, const char *dst,
+ pg_checksum_context *checksum_ctx)
+{
+ int src_fd;
+ int dest_fd;
+ uint8 *buffer;
+ const int buffer_size = 50 * BLCKSZ;
+ ssize_t rb;
+ unsigned offset = 0;
+
+ if ((src_fd = open(src, O_RDONLY | PG_BINARY, 0)) < 0)
+ pg_fatal("could not open file \"%s\": %m", src);
+
+ if ((dest_fd = open(dst, O_WRONLY | O_CREAT | O_EXCL | PG_BINARY,
+ pg_file_create_mode)) < 0)
+ pg_fatal("could not open file \"%s\": %m", dst);
+
+ buffer = pg_malloc(buffer_size);
+
+ while ((rb = read(src_fd, buffer, buffer_size)) > 0)
+ {
+ ssize_t wb;
+
+ if ((wb = write(dest_fd, buffer, rb)) != rb)
+ {
+ if (wb < 0)
+ pg_fatal("could not write file \"%s\": %m", dst);
+ else
+ pg_fatal("could not write file \"%s\": wrote only %d of %d bytes at offset %u",
+ dst, (int) wb, (int) rb, offset);
+ }
+
+ if (pg_checksum_update(checksum_ctx, buffer, rb) < 0)
+ pg_fatal("could not update checksum of file \"%s\"", dst);
+
+ offset += rb;
+ }
+
+ if (rb < 0)
+ pg_fatal("could not read file \"%s\": %m", dst);
+
+ pg_free(buffer);
+ close(src_fd);
+ close(dest_fd);
+}
+
+#ifdef WIN32
+static void
+copy_file_copyfile(const char *src, const char *dst)
+{
+ if (CopyFile(src, dst, true) == 0)
+ {
+ _dosmaperr(GetLastError());
+ pg_fatal("could not copy \"%s\" to \"%s\": %m", src, dst);
+ }
+}
+#endif /* WIN32 */
diff --git a/src/bin/pg_combinebackup/copy_file.h b/src/bin/pg_combinebackup/copy_file.h
new file mode 100644
index 0000000000..031030bacb
--- /dev/null
+++ b/src/bin/pg_combinebackup/copy_file.h
@@ -0,0 +1,19 @@
+/*
+ * Copy entire files.
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/bin/pg_combinebackup/copy_file.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef COPY_FILE_H
+#define COPY_FILE_H
+
+#include "common/checksum_helper.h"
+
+extern void copy_file(const char *src, const char *dst,
+ pg_checksum_context *checksum_ctx, bool dry_run);
+
+#endif /* COPY_FILE_H */
diff --git a/src/bin/pg_combinebackup/load_manifest.c b/src/bin/pg_combinebackup/load_manifest.c
new file mode 100644
index 0000000000..d0b8de7912
--- /dev/null
+++ b/src/bin/pg_combinebackup/load_manifest.c
@@ -0,0 +1,245 @@
+/*-------------------------------------------------------------------------
+ *
+ * Load data from a backup manifest into memory.
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/bin/pg_combinebackup/load_manifest.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres_fe.h"
+
+#include <sys/stat.h>
+#include <unistd.h>
+
+#include "common/hashfn.h"
+#include "common/logging.h"
+#include "common/parse_manifest.h"
+#include "load_manifest.h"
+
+/*
+ * For efficiency, we'd like our hash table containing information about the
+ * manifest to start out with approximately the correct number of entries.
+ * There's no way to know the exact number of entries without reading the whole
+ * file, but we can get an estimate by dividing the file size by the estimated
+ * number of bytes per line.
+ *
+ * This could be off by about a factor of two in either direction, because the
+ * checksum algorithm has a big impact on the line lengths; e.g. a SHA512
+ * checksum is 128 hex bytes, whereas a CRC-32C value is only 8, and there
+ * might be no checksum at all.
+ */
+#define ESTIMATED_BYTES_PER_MANIFEST_LINE 100
+
+/*
+ * Define a hash table which we can use to store information about the files
+ * mentioned in the backup manifest.
+ */
+static uint32 hash_string_pointer(char *s);
+#define SH_PREFIX manifest_files
+#define SH_ELEMENT_TYPE manifest_file
+#define SH_KEY_TYPE char *
+#define SH_KEY pathname
+#define SH_HASH_KEY(tb, key) hash_string_pointer(key)
+#define SH_EQUAL(tb, a, b) (strcmp(a, b) == 0)
+#define SH_SCOPE extern
+#define SH_RAW_ALLOCATOR pg_malloc0
+#define SH_DEFINE
+#include "lib/simplehash.h"
+
+static void record_manifest_details_for_file(JsonManifestParseContext *context,
+ char *pathname, size_t size,
+ pg_checksum_type checksum_type,
+ int checksum_length,
+ uint8 *checksum_payload);
+static void record_manifest_details_for_wal_range(JsonManifestParseContext *context,
+ TimeLineID tli,
+ XLogRecPtr start_lsn,
+ XLogRecPtr end_lsn);
+static void report_manifest_error(JsonManifestParseContext *context,
+ const char *fmt,...)
+ pg_attribute_printf(2, 3) pg_attribute_noreturn();
+
+/*
+ * Load backup_manifest files from an array of backups and produces an array
+ * of manifest_data objects.
+ *
+ * NB: Since load_backup_manifest() can return NULL, the resulting array could
+ * contain NULL entries.
+ */
+manifest_data **
+load_backup_manifests(int n_backups, char **backup_directories)
+{
+ manifest_data **result;
+ int i;
+
+ result = pg_malloc(sizeof(manifest_data *) * n_backups);
+ for (i = 0; i < n_backups; ++i)
+ result[i] = load_backup_manifest(backup_directories[i]);
+
+ return result;
+}
+
+/*
+ * Parse the backup_manifest file in the named backup directory. Construct a
+ * hash table with information about all the files it mentions, and a linked
+ * list of all the WAL ranges it mentions.
+ *
+ * If the backup_manifest file simply doesn't exist, logs a warning and returns
+ * NULL. Any other error, or any error parsing the contents of the file, is
+ * fatal.
+ */
+manifest_data *
+load_backup_manifest(char *backup_directory)
+{
+ char pathname[MAXPGPATH];
+ int fd;
+ struct stat statbuf;
+ off_t estimate;
+ uint32 initial_size;
+ manifest_files_hash *ht;
+ char *buffer;
+ int rc;
+ JsonManifestParseContext context;
+ manifest_data *result;
+
+ /* Open the manifest file. */
+ snprintf(pathname, MAXPGPATH, "%s/backup_manifest", backup_directory);
+ if ((fd = open(pathname, O_RDONLY | PG_BINARY, 0)) < 0)
+ {
+ if (errno == EEXIST)
+ {
+ pg_log_warning("\"%s\" does not exist", pathname);
+ return NULL;
+ }
+ pg_fatal("could not open file \"%s\": %m", pathname);
+ }
+
+ /* Figure out how big the manifest is. */
+ if (fstat(fd, &statbuf) != 0)
+ pg_fatal("could not stat file \"%s\": %m", pathname);
+
+ /* Guess how large to make the hash table based on the manifest size. */
+ estimate = statbuf.st_size / ESTIMATED_BYTES_PER_MANIFEST_LINE;
+ initial_size = Min(PG_UINT32_MAX, Max(estimate, 256));
+
+ /* Create the hash table. */
+ ht = manifest_files_create(initial_size, NULL);
+
+ /*
+ * Slurp in the whole file.
+ *
+ * This is not ideal, but there's currently no way to get pg_parse_json()
+ * to perform incremental parsing.
+ */
+ buffer = pg_malloc(statbuf.st_size);
+ rc = read(fd, buffer, statbuf.st_size);
+ if (rc != statbuf.st_size)
+ {
+ if (rc < 0)
+ pg_fatal("could not read file \"%s\": %m", pathname);
+ else
+ pg_fatal("could not read file \"%s\": read %d of %lld",
+ pathname, rc, (long long int) statbuf.st_size);
+ }
+
+ /* Close the manifest file. */
+ close(fd);
+
+ /* Parse the manifest. */
+ result = pg_malloc0(sizeof(manifest_data));
+ result->files = ht;
+ context.private_data = result;
+ context.perfile_cb = record_manifest_details_for_file;
+ context.perwalrange_cb = record_manifest_details_for_wal_range;
+ context.error_cb = report_manifest_error;
+ json_parse_manifest(&context, buffer, statbuf.st_size);
+
+ /* All done. */
+ pfree(buffer);
+ return result;
+}
+
+/*
+ * Report an error while parsing the manifest.
+ *
+ * We consider all such errors to be fatal errors. The manifest parser
+ * expects this function not to return.
+ */
+static void
+report_manifest_error(JsonManifestParseContext *context, const char *fmt,...)
+{
+ va_list ap;
+
+ va_start(ap, fmt);
+ pg_log_generic_v(PG_LOG_ERROR, PG_LOG_PRIMARY, gettext(fmt), ap);
+ va_end(ap);
+
+ exit(1);
+}
+
+/*
+ * Record details extracted from the backup manifest for one file.
+ */
+static void
+record_manifest_details_for_file(JsonManifestParseContext *context,
+ char *pathname, size_t size,
+ pg_checksum_type checksum_type,
+ int checksum_length, uint8 *checksum_payload)
+{
+ manifest_data *manifest = context->private_data;
+ manifest_file *m;
+ bool found;
+
+ /* Make a new entry in the hash table for this file. */
+ m = manifest_files_insert(manifest->files, pathname, &found);
+ if (found)
+ pg_fatal("duplicate path name in backup manifest: \"%s\"", pathname);
+
+ /* Initialize the entry. */
+ m->size = size;
+ m->checksum_type = checksum_type;
+ m->checksum_length = checksum_length;
+ m->checksum_payload = checksum_payload;
+}
+
+/*
+ * Record details extracted from the backup manifest for one WAL range.
+ */
+static void
+record_manifest_details_for_wal_range(JsonManifestParseContext *context,
+ TimeLineID tli,
+ XLogRecPtr start_lsn, XLogRecPtr end_lsn)
+{
+ manifest_data *manifest = context->private_data;
+ manifest_wal_range *range;
+
+ /* Allocate and initialize a struct describing this WAL range. */
+ range = palloc(sizeof(manifest_wal_range));
+ range->tli = tli;
+ range->start_lsn = start_lsn;
+ range->end_lsn = end_lsn;
+ range->prev = manifest->last_wal_range;
+ range->next = NULL;
+
+ /* Add it to the end of the list. */
+ if (manifest->first_wal_range == NULL)
+ manifest->first_wal_range = range;
+ else
+ manifest->last_wal_range->next = range;
+ manifest->last_wal_range = range;
+}
+
+/*
+ * Helper function for manifest_files hash table.
+ */
+static uint32
+hash_string_pointer(char *s)
+{
+ unsigned char *ss = (unsigned char *) s;
+
+ return hash_bytes(ss, strlen(s));
+}
diff --git a/src/bin/pg_combinebackup/load_manifest.h b/src/bin/pg_combinebackup/load_manifest.h
new file mode 100644
index 0000000000..2bfeeff156
--- /dev/null
+++ b/src/bin/pg_combinebackup/load_manifest.h
@@ -0,0 +1,67 @@
+/*-------------------------------------------------------------------------
+ *
+ * Load data from a backup manifest into memory.
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/bin/pg_combinebackup/load_manifest.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef LOAD_MANIFEST_H
+#define LOAD_MANIFEST_H
+
+#include "access/xlogdefs.h"
+#include "common/checksum_helper.h"
+
+/*
+ * Each file described by the manifest file is parsed to produce an object
+ * like this.
+ */
+typedef struct manifest_file
+{
+ uint32 status; /* hash status */
+ char *pathname;
+ size_t size;
+ pg_checksum_type checksum_type;
+ int checksum_length;
+ uint8 *checksum_payload;
+} manifest_file;
+
+#define SH_PREFIX manifest_files
+#define SH_ELEMENT_TYPE manifest_file
+#define SH_KEY_TYPE char *
+#define SH_SCOPE extern
+#define SH_RAW_ALLOCATOR pg_malloc0
+#define SH_DECLARE
+#include "lib/simplehash.h"
+
+/*
+ * Each WAL range described by the manifest file is parsed to produce an
+ * object like this.
+ */
+typedef struct manifest_wal_range
+{
+ TimeLineID tli;
+ XLogRecPtr start_lsn;
+ XLogRecPtr end_lsn;
+ struct manifest_wal_range *next;
+ struct manifest_wal_range *prev;
+} manifest_wal_range;
+
+/*
+ * All the data parsed from a backup_manifest file.
+ */
+typedef struct manifest_data
+{
+ manifest_files_hash *files;
+ manifest_wal_range *first_wal_range;
+ manifest_wal_range *last_wal_range;
+} manifest_data;
+
+extern manifest_data *load_backup_manifest(char *backup_directory);
+extern manifest_data **load_backup_manifests(int n_backups,
+ char **backup_directories);
+
+#endif /* LOAD_MANIFEST_H */
diff --git a/src/bin/pg_combinebackup/meson.build b/src/bin/pg_combinebackup/meson.build
new file mode 100644
index 0000000000..bea0db405e
--- /dev/null
+++ b/src/bin/pg_combinebackup/meson.build
@@ -0,0 +1,29 @@
+# Copyright (c) 2022-2023, PostgreSQL Global Development Group
+
+pg_combinebackup_sources = files(
+ 'pg_combinebackup.c',
+ 'backup_label.c',
+ 'copy_file.c',
+ 'load_manifest.c',
+ 'reconstruct.c',
+ 'write_manifest.c',
+)
+
+if host_system == 'windows'
+ pg_combinebackup_sources += rc_bin_gen.process(win32ver_rc, extra_args: [
+ '--NAME', 'pg_combinebackup',
+ '--FILEDESC', 'pg_combinebackup - combine incremental backups',])
+endif
+
+pg_combinebackup = executable('pg_combinebackup',
+ pg_combinebackup_sources,
+ dependencies: [frontend_code],
+ kwargs: default_bin_args,
+)
+bin_targets += pg_combinebackup
+
+tests += {
+ 'name': 'pg_combinebackup',
+ 'sd': meson.current_source_dir(),
+ 'bd': meson.current_build_dir()
+}
diff --git a/src/bin/pg_combinebackup/pg_combinebackup.c b/src/bin/pg_combinebackup/pg_combinebackup.c
new file mode 100644
index 0000000000..6c7fd3290e
--- /dev/null
+++ b/src/bin/pg_combinebackup/pg_combinebackup.c
@@ -0,0 +1,1268 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_combinebackup.c
+ * Combine incremental backups with prior backups.
+ *
+ * Copyright (c) 2017-2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/bin/pg_combinebackup/pg_combinebackup.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres_fe.h"
+
+#include <dirent.h>
+#include <fcntl.h>
+#include <limits.h>
+
+#include "access/xlogdefs.h"
+#include "backup_label.h"
+#include "common/blkreftable.h"
+#include "common/checksum_helper.h"
+#include "common/controldata_utils.h"
+#include "common/file_perm.h"
+#include "common/file_utils.h"
+#include "common/logging.h"
+#include "copy_file.h"
+#include "fe_utils/option_utils.h"
+#include "lib/stringinfo.h"
+#include "load_manifest.h"
+#include "getopt_long.h"
+#include "reconstruct.h"
+#include "write_manifest.h"
+
+/* Incremental file naming convention. */
+#define INCREMENTAL_PREFIX "INCREMENTAL."
+#define INCREMENTAL_PREFIX_LENGTH 12
+
+/*
+ * Tracking for directories that need to be removed, or have their contents
+ * removed, if the operation fails.
+ */
+typedef struct cb_cleanup_dir
+{
+ char *target_path;
+ bool rmtopdir;
+ struct cb_cleanup_dir *next;
+} cb_cleanup_dir;
+
+/*
+ * Stores a tablespace mapping provided using -T, --tablespace-mapping.
+ */
+typedef struct cb_tablespace_mapping
+{
+ char old_dir[MAXPGPATH];
+ char new_dir[MAXPGPATH];
+ struct cb_tablespace_mapping *next;
+} cb_tablespace_mapping;
+
+/*
+ * Stores data parsed from all command-line options.
+ */
+typedef struct cb_options
+{
+ bool debug;
+ char *output;
+ bool dry_run;
+ bool no_sync;
+ bool progress;
+ cb_tablespace_mapping *tsmappings;
+ pg_checksum_type manifest_checksums;
+ bool no_manifest;
+} cb_options;
+
+/*
+ * Data about a tablespace.
+ *
+ * Every normal tablespace needs a tablespace mapping, but in-place tablespaces
+ * don't, so the list of tablespaces can contain more entries than the list of
+ * tablespace mappings.
+ */
+typedef struct cb_tablespace
+{
+ Oid oid;
+ bool in_place;
+ char old_dir[MAXPGPATH];
+ char new_dir[MAXPGPATH];
+ struct cb_tablespace *next;
+} cb_tablespace;
+
+/* Directories to be removed if we exit uncleanly. */
+cb_cleanup_dir *cleanup_dir_list = NULL;
+
+static void add_tablespace_mapping(cb_options *opt, char *arg);
+static StringInfo check_backup_label_files(int n_backups, char **backup_dirs);
+static void check_control_files(int n_backups, char **backup_dirs);
+static void check_input_dir_permissions(char *dir);
+static void cleanup_directories_atexit(void);
+static void create_output_directory(char *dirname, cb_options *opt);
+static void help(const char *progname);
+static bool parse_oid(char *s, Oid *result);
+static void process_directory_recursively(Oid tsoid,
+ char *input_directory,
+ char *output_directory,
+ char *relative_path,
+ int n_prior_backups,
+ char **prior_backup_dirs,
+ manifest_data **manifests,
+ manifest_writer *mwriter,
+ cb_options *opt);
+static int read_pg_version_file(char *directory);
+static void remember_to_cleanup_directory(char *target_path, bool rmtopdir);
+static void reset_directory_cleanup_list(void);
+static cb_tablespace *scan_for_existing_tablespaces(char *pathname,
+ cb_options *opt);
+static void slurp_file(int fd, char *filename, StringInfo buf, int maxlen);
+
+/*
+ * Main program.
+ */
+int
+main(int argc, char *argv[])
+{
+ static struct option long_options[] = {
+ {"debug", no_argument, NULL, 'd'},
+ {"output", required_argument, NULL, 'o'},
+ {"dry-run", no_argument, NULL, 'n'},
+ {"no-sync", no_argument, NULL, 'N'},
+ {"progress", no_argument, NULL, 'P'},
+ {"tablespace-mapping", no_argument, NULL, 'T'},
+ {"manifest-checksums", required_argument, NULL, 1},
+ {"no-manifest", no_argument, NULL, 2},
+ {NULL, 0, NULL, 0}
+ };
+
+ const char *progname;
+ char *last_input_dir;
+ int optindex;
+ int c;
+ int n_backups;
+ int n_prior_backups;
+ int version;
+ char **prior_backup_dirs;
+ cb_options opt;
+ cb_tablespace *tablespaces;
+ cb_tablespace *ts;
+ StringInfo last_backup_label;
+ manifest_data **manifests;
+ manifest_writer *mwriter;
+
+ pg_logging_init(argv[0]);
+ progname = get_progname(argv[0]);
+ handle_help_version_opts(argc, argv, progname, help);
+
+ memset(&opt, 0, sizeof(opt));
+ opt.manifest_checksums = CHECKSUM_TYPE_CRC32C;
+
+ /* process command-line options */
+ while ((c = getopt_long(argc, argv, "do:nNPT:",
+ long_options, &optindex)) != -1)
+ {
+ switch (c)
+ {
+ case 'd':
+ opt.debug = true;
+ pg_logging_increase_verbosity();
+ break;
+ case 'o':
+ opt.output = optarg;
+ break;
+ case 'n':
+ opt.dry_run = true;
+ break;
+ case 'N':
+ opt.no_sync = true;
+ break;
+ case 'P':
+ opt.progress = true;
+ break;
+ case 'T':
+ add_tablespace_mapping(&opt, optarg);
+ break;
+ case 1:
+ if (!pg_checksum_parse_type(optarg,
+ &opt.manifest_checksums))
+ pg_fatal("unrecognized checksum algorithm: \"%s\"",
+ optarg);
+ break;
+ case 2:
+ opt.no_manifest = true;
+ break;
+ default:
+ /* getopt_long already emitted a complaint */
+ pg_log_error_hint("Try \"%s --help\" for more information.", progname);
+ exit(1);
+ }
+ }
+
+ if (optind >= argc)
+ {
+ pg_log_error("%s: no input directories specified", progname);
+ pg_log_error_hint("Try \"%s --help\" for more information.", progname);
+ exit(1);
+ }
+
+ if (opt.output == NULL)
+ pg_fatal("no output directory specified");
+
+ /* If no manifest is needed, no checksums are needed, either. */
+ if (opt.no_manifest)
+ opt.manifest_checksums = CHECKSUM_TYPE_NONE;
+
+ /* Read the server version from the final backup. */
+ version = read_pg_version_file(argv[argc - 1]);
+
+ /* Sanity-check control files. */
+ n_backups = argc - optind;
+ check_control_files(n_backups, argv + optind);
+
+ /* Sanity-check backup_label files, and get the contents of the last one. */
+ last_backup_label = check_backup_label_files(n_backups, argv + optind);
+
+ /* Load backup manifests. */
+ manifests = load_backup_manifests(n_backups, argv + optind);
+
+ /* Figure out which tablespaces are going to be included in the output. */
+ last_input_dir = argv[argc - 1];
+ check_input_dir_permissions(last_input_dir);
+ tablespaces = scan_for_existing_tablespaces(last_input_dir, &opt);
+
+ /*
+ * Create output directories.
+ *
+ * We create one output directory for the main data directory plus one for
+ * each non-in-place tablespace. create_output_directory() will arrange
+ * for those directories to be cleaned up on failure. In-place tablespaces
+ * aren't handled at this stage because they're located beneath the main
+ * output directory, and thus the cleanup of that directory will get rid
+ * of them. Plus, the pg_tblspc directory that needs to contain them
+ * doesn't exist yet.
+ */
+ atexit(cleanup_directories_atexit);
+ create_output_directory(opt.output, &opt);
+ for (ts = tablespaces; ts != NULL; ts = ts->next)
+ if (!ts->in_place)
+ create_output_directory(ts->new_dir, &opt);
+
+ /* If we need to write a backup_manifest, prepare to do so. */
+ if (!opt.dry_run && !opt.no_manifest)
+ mwriter = create_manifest_writer(opt.output);
+ else
+ mwriter = NULL;
+
+ /* Write backup label into output directory. */
+ if (opt.dry_run)
+ pg_log_debug("would generate \"%s/backup_label\"", opt.output);
+ else
+ {
+ pg_log_debug("generating \"%s/backup_label\"", opt.output);
+ last_backup_label->cursor = 0;
+ write_backup_label(opt.output, last_backup_label,
+ opt.manifest_checksums, mwriter);
+ }
+
+ /*
+ * We'll need the pathnames to the prior backups. By "prior" we mean all
+ * but the last one listed on the command line.
+ */
+ n_prior_backups = argc - optind - 1;
+ prior_backup_dirs = argv + optind;
+
+ /* Process everything that's not part of a user-defined tablespace. */
+ pg_log_debug("processing backup directory \"%s\"", last_input_dir);
+ process_directory_recursively(InvalidOid, last_input_dir, opt.output,
+ NULL, n_prior_backups, prior_backup_dirs,
+ manifests, mwriter, &opt);
+
+ /* Process user-defined tablespaces. */
+ for (ts = tablespaces; ts != NULL; ts = ts->next)
+ {
+ pg_log_debug("processing tablespace directory \"%s\"", ts->old_dir);
+
+ /*
+ * If it's a normal tablespace, we need to set up a symbolic link from
+ * pg_tblspc/${OID} to the target directory; if it's an in-place
+ * tablespace, we need to create a directory at pg_tblspc/${OID}.
+ */
+ if (!ts->in_place)
+ {
+ char linkpath[MAXPGPATH];
+
+ snprintf(linkpath, MAXPGPATH, "%s/pg_tblspc/%u", opt.output,
+ ts->oid);
+
+ if (opt.dry_run)
+ pg_log_debug("would create symbolic link from \"%s\" to \"%s\"",
+ linkpath, ts->new_dir);
+ else
+ {
+ pg_log_debug("creating symbolic link from \"%s\" to \"%s\"",
+ linkpath, ts->new_dir);
+ if (symlink(ts->new_dir, linkpath) != 0)
+ pg_fatal("could not create symbolic link from \"%s\" to \"%s\": %m",
+ linkpath, ts->new_dir);
+ }
+ }
+ else
+ {
+ if (opt.dry_run)
+ pg_log_debug("would create directory \"%s\"", ts->new_dir);
+ else
+ {
+ pg_log_debug("creating directory \"%s\"", ts->new_dir);
+ if (pg_mkdir_p(ts->new_dir, pg_dir_create_mode) == -1)
+ pg_fatal("could not create directory \"%s\": %m",
+ ts->new_dir);
+ }
+ }
+
+ /* OK, now handle the directory contents. */
+ process_directory_recursively(ts->oid, ts->old_dir, ts->new_dir,
+ NULL, n_prior_backups, prior_backup_dirs,
+ manifests, mwriter, &opt);
+ }
+
+ /* Finalize the backup_manifest, if we're generating one. */
+ if (mwriter != NULL)
+ finalize_manifest(mwriter,
+ manifests[n_prior_backups]->first_wal_range);
+
+ /* fsync that output directory unless we've been told not to do so */
+ if (!opt.no_sync)
+ {
+ if (opt.dry_run)
+ pg_log_debug("would recursively fsync \"%s\"", opt.output);
+ else
+ {
+ pg_log_debug("recursively fsyncing \"%s\"", opt.output);
+ fsync_pgdata(opt.output, version * 10000);
+ }
+ }
+
+ /* It's a success, so don't remove the output directories. */
+ reset_directory_cleanup_list();
+ exit(0);
+}
+
+/*
+ * Process the option argument for the -T, --tablespace-mapping switch.
+ */
+static void
+add_tablespace_mapping(cb_options *opt, char *arg)
+{
+ cb_tablespace_mapping *tsmap = pg_malloc0(sizeof(cb_tablespace_mapping));
+ char *dst;
+ char *dst_ptr;
+ char *arg_ptr;
+
+ /*
+ * Basically, we just want to copy everything before the equals sign to
+ * tsmap->old_dir and everything afterwards to tsmap->new_dir, but if
+ * there's more or less than one equals sign, that's an error, and if
+ * there's an equals sign preceded by a backslash, don't treat it as a
+ * field separator but instead copy a literal equals sign.
+ */
+ dst_ptr = dst = tsmap->old_dir;
+ for (arg_ptr = arg; *arg_ptr != '\0'; arg_ptr++)
+ {
+ if (dst_ptr - dst >= MAXPGPATH)
+ pg_fatal("directory name too long");
+
+ if (*arg_ptr == '\\' && *(arg_ptr + 1) == '=')
+ ; /* skip backslash escaping = */
+ else if (*arg_ptr == '=' && (arg_ptr == arg || *(arg_ptr - 1) != '\\'))
+ {
+ if (tsmap->new_dir[0] != '\0')
+ pg_fatal("multiple \"=\" signs in tablespace mapping");
+ else
+ dst = dst_ptr = tsmap->new_dir;
+ }
+ else
+ *dst_ptr++ = *arg_ptr;
+ }
+ if (!tsmap->old_dir[0] || !tsmap->new_dir[0])
+ pg_fatal("invalid tablespace mapping format \"%s\", must be \"OLDDIR=NEWDIR\"", arg);
+
+ /*
+ * All tablespaces are created with absolute directories, so specifying a
+ * non-absolute path here would never match, possibly confusing users.
+ *
+ * In contrast to pg_basebackup, both the old and new directories are on
+ * the local machine, so the local machine's definition of an absolute
+ * path is the only relevant one.
+ */
+ if (!is_absolute_path(tsmap->old_dir))
+ pg_fatal("old directory is not an absolute path in tablespace mapping: %s",
+ tsmap->old_dir);
+
+ if (!is_absolute_path(tsmap->new_dir))
+ pg_fatal("old directory is not an absolute path in tablespace mapping: %s",
+ tsmap->new_dir);
+
+ /* Canonicalize paths to avoid spurious failures when comparing. */
+ canonicalize_path(tsmap->old_dir);
+ canonicalize_path(tsmap->new_dir);
+
+ /* Add it to the list. */
+ tsmap->next = opt->tsmappings;
+ opt->tsmappings = tsmap;
+}
+
+/*
+ * Check that the backup_label files form a coherent backup chain, and return
+ * the contents of the backup_label file from the latest backup.
+ */
+static StringInfo
+check_backup_label_files(int n_backups, char **backup_dirs)
+{
+ StringInfo buf = makeStringInfo();
+ StringInfo lastbuf = buf;
+ int i;
+ TimeLineID check_tli = 0;
+ XLogRecPtr check_lsn = InvalidXLogRecPtr;
+
+ /* Try to read each backup_label file in turn, last to first. */
+ for (i = n_backups - 1; i >= 0; --i)
+ {
+ char pathbuf[MAXPGPATH];
+ int fd;
+ TimeLineID start_tli;
+ TimeLineID previous_tli;
+ XLogRecPtr start_lsn;
+ XLogRecPtr previous_lsn;
+
+ /* Open the backup_label file. */
+ snprintf(pathbuf, MAXPGPATH, "%s/backup_label", backup_dirs[i]);
+ pg_log_debug("reading \"%s\"", pathbuf);
+ if ((fd = open(pathbuf, O_RDONLY, 0)) < 0)
+ pg_fatal("could not open file \"%s\": %m", pathbuf);
+
+ /*
+ * Slurp the whole file into memory.
+ *
+ * The exact size limit that we impose here doesn't really matter --
+ * most of what's supposed to be in the file is fixed size and quite
+ * short. However, the length of the backup_label is limited (at least
+ * by some parts of the code) to MAXGPATH, so include that value in
+ * the maximum length that we tolerate.
+ */
+ slurp_file(fd, pathbuf, buf, 10000 + MAXPGPATH);
+
+ /* Close the file. */
+ if (close(fd) != 0)
+ pg_fatal("could not close \"%s\": %m", pathbuf);
+
+ /* Parse the file contents. */
+ parse_backup_label(pathbuf, buf, &start_tli, &start_lsn,
+ &previous_tli, &previous_lsn);
+
+ /*
+ * Sanity checks.
+ *
+ * XXX. It's actually not required that start_lsn == check_lsn. It
+ * would be OK if start_lsn > check_lsn provided that start_lsn is
+ * less than or equal to the relevant switchpoint. But at the moment
+ * we don't have that information.
+ */
+ if (i > 0 && previous_tli == 0)
+ pg_fatal("backup at \"%s\" is a full backup, but only the first backup should be a full backup",
+ backup_dirs[i]);
+ if (i == 0 && previous_tli != 0)
+ pg_fatal("backup at \"%s\" is an incremental backup, but the first backup should be a full backup",
+ backup_dirs[i]);
+ if (i < n_backups - 1 && start_tli != check_tli)
+ pg_fatal("backup at \"%s\" starts on timeline %u, but expected %u",
+ backup_dirs[i], start_tli, check_tli);
+ if (i < n_backups - 1 && start_lsn != check_lsn)
+ pg_fatal("backup at \"%s\" starts at LSN %X/%X, but expected %X/%X",
+ backup_dirs[i],
+ LSN_FORMAT_ARGS(start_lsn),
+ LSN_FORMAT_ARGS(check_lsn));
+ check_tli = previous_tli;
+ check_lsn = previous_lsn;
+
+ /*
+ * The last backup label in the chain needs to be saved for later use,
+ * while the others are only needed within this loop.
+ */
+ if (lastbuf == buf)
+ buf = makeStringInfo();
+ else
+ resetStringInfo(buf);
+ }
+
+ /* Free memory that we don't need any more. */
+ if (lastbuf != buf)
+ {
+ pfree(buf->data);
+ pfree(buf);
+ }
+
+ /*
+ * Return the data from the first backup_info that we read (which is the
+ * backup_label from the last directory specified on the command line).
+ */
+ return lastbuf;
+}
+
+/*
+ * Sanity check control files.
+ */
+static void
+check_control_files(int n_backups, char **backup_dirs)
+{
+ int i;
+ uint64 system_identifier;
+
+ /* Try to read each control file in turn, last to first. */
+ for (i = n_backups - 1; i >= 0; --i)
+ {
+ ControlFileData *control_file;
+ bool crc_ok;
+
+ pg_log_debug("reading \"%s/global/pg_control\"", backup_dirs[i]);
+ control_file = get_controlfile(backup_dirs[i], &crc_ok);
+
+ /* Control file contents not meaningful if CRC is bad. */
+ if (!crc_ok)
+ pg_fatal("%s/global/pg_control: crc is incorrect", backup_dirs[i]);
+
+ /* Can't interpret control file if not current version. */
+ if (control_file->pg_control_version != PG_CONTROL_VERSION)
+ pg_fatal("%s/global/pg_control: unexpected control file version",
+ backup_dirs[i]);
+
+ /* System identifiers should all match. */
+ if (i == n_backups - 1)
+ system_identifier = control_file->system_identifier;
+ else if (system_identifier != control_file->system_identifier)
+ pg_fatal("%s/global/pg_control: expected system identifier %llu, but found %llu",
+ backup_dirs[i], (unsigned long long) system_identifier,
+ (unsigned long long) control_file->system_identifier);
+
+ /* Release memory. */
+ pfree(control_file);
+ }
+
+ /*
+ * If debug output is enabled, make a note of the system identifier that
+ * we found in all of the relevant control files.
+ */
+ pg_log_debug("system identifier is %llu",
+ (unsigned long long) system_identifier);
+}
+
+/*
+ * Set default permissions for new files and directories based on the
+ * permissions of the given directory. The intent here is that the output
+ * directory should use the same permissions scheme as the final input
+ * directory.
+ */
+static void
+check_input_dir_permissions(char *dir)
+{
+ struct stat st;
+
+ if (stat(dir, &st) != 0)
+ pg_fatal("could not stat \"%s\": %m", dir);
+
+ SetDataDirectoryCreatePerm(st.st_mode);
+}
+
+/*
+ * Clean up output directories before exiting.
+ */
+static void
+cleanup_directories_atexit(void)
+{
+ while (cleanup_dir_list != NULL)
+ {
+ cb_cleanup_dir *dir = cleanup_dir_list;
+
+ if (dir->rmtopdir)
+ {
+ pg_log_info("removing output directory \"%s\"", dir->target_path);
+ if (!rmtree(dir->target_path, dir->rmtopdir))
+ pg_log_error("failed to remove output directory");
+ }
+ else
+ {
+ pg_log_info("removing contents of output directory \"%s\"",
+ dir->target_path);
+ if (!rmtree(dir->target_path, dir->rmtopdir))
+ pg_log_error("failed to remove contents of output directory");
+ }
+
+ cleanup_dir_list = cleanup_dir_list->next;
+ pfree(dir);
+ }
+}
+
+/*
+ * Create the named output directory, unless it already exists or we're in
+ * dry-run mode. If it already exists but is not empty, that's a fatal error.
+ *
+ * Adds the created directory to the list of directories to be cleaned up
+ * at process exit.
+ */
+static void
+create_output_directory(char *dirname, cb_options *opt)
+{
+ switch (pg_check_dir(dirname))
+ {
+ case 0:
+ if (opt->dry_run)
+ {
+ pg_log_debug("would create directory \"%s\"", dirname);
+ return;
+ }
+ pg_log_debug("creating directory \"%s\"", dirname);
+ if (pg_mkdir_p(dirname, pg_dir_create_mode) == -1)
+ pg_fatal("could not create directory \"%s\": %m", dirname);
+ remember_to_cleanup_directory(dirname, true);
+ break;
+
+ case 1:
+ pg_log_debug("using existing directory \"%s\"", dirname);
+ remember_to_cleanup_directory(dirname, false);
+ break;
+
+ case 2:
+ case 3:
+ case 4:
+ pg_fatal("directory \"%s\" exists but is not empty", dirname);
+
+ case -1:
+ pg_fatal("could not access directory \"%s\": %m", dirname);
+ }
+}
+
+/*
+ * help
+ *
+ * Prints help page for the program
+ *
+ * progname: the name of the executed program, such as "pg_combinebackup"
+ */
+static void
+help(const char *progname)
+{
+ printf(_("%s combines incremental backups.\n\n"), progname);
+ printf(_("Usage:\n"));
+ printf(_(" %s [OPTION]... DIRECTORY...\n"), progname);
+ printf(_("\nOptions:\n"));
+ printf(_(" -d, --debug generate lots of debugging output\n"));
+ printf(_(" -o, --output output directory\n"));
+ printf(_(" -n, --dry-run don't actually do anything\n"));
+ printf(_(" -N, --no-sync do not wait for changes to be written safely to disk\n"));
+ printf(_(" -P, --progress show progress information\n"));
+ printf(_(" -T, --tablespace-mapping=OLDDIR=NEWDIR\n"));
+ printf(_(" relocate tablespace in OLDDIR to NEWDIR\n"));
+ printf(_(" --manifest-checksums=SHA{224,256,384,512}|CRC32C|NONE\n"
+ " use algorithm for manifest checksums\n"));
+ printf(_(" --no-manifest suppress generation of backup manifest\n"));
+ printf(_(" -?, --help show this help, then exit\n"));
+
+ printf(_("\nReport bugs to <%s>.\n"), PACKAGE_BUGREPORT);
+ printf(_("%s home page: <%s>\n"), PACKAGE_NAME, PACKAGE_URL);
+}
+
+/*
+ * Try to parse a string as a non-zero OID without leading zeroes.
+ *
+ * If it works, return true and set *result to the answer, else return false.
+ */
+static bool
+parse_oid(char *s, Oid *result)
+{
+ Oid oid;
+ char *ep;
+
+ errno = 0;
+ oid = strtoul(s, &ep, 10);
+ if (errno != 0 || *ep != '\0' || oid < 1 || oid > PG_UINT32_MAX)
+ return false;
+
+ *result = oid;
+ return true;
+}
+
+/*
+ * Copy files from the input directory to the output directory, reconstructing
+ * full files from incremental files as required.
+ *
+ * If processing is a user-defined tablespace, the tsoid should be the OID
+ * of that tablespace and input_directory and output_directory should be the
+ * toplevel input and output directories for that tablespace. Otherwise,
+ * tsoid should be InvalidOid and input_directory and output_directory should
+ * be the main input and output directories.
+ *
+ * relative_path is the path beneath the given input and output directories
+ * that we are currently processing. If NULL, it indicates that we're
+ * processing the input and output directories themselves.
+ *
+ * n_prior_backups is the number of prior backups that we have available.
+ * This doesn't count the very last backup, which is referenced by
+ * output_directory, just the older ones. prior_backup_dirs is an array of
+ * the locations of those previous backups.
+ */
+static void
+process_directory_recursively(Oid tsoid,
+ char *input_directory,
+ char *output_directory,
+ char *relative_path,
+ int n_prior_backups,
+ char **prior_backup_dirs,
+ manifest_data **manifests,
+ manifest_writer *mwriter,
+ cb_options *opt)
+{
+ char ifulldir[MAXPGPATH];
+ char ofulldir[MAXPGPATH];
+ char manifest_prefix[MAXPGPATH];
+ DIR *dir;
+ struct dirent *de;
+ bool is_pg_tblspc;
+ bool is_pg_wal;
+ manifest_data *latest_manifest = manifests[n_prior_backups];
+ pg_checksum_type checksum_type;
+
+ StaticAssertStmt(strlen(INCREMENTAL_PREFIX) == INCREMENTAL_PREFIX_LENGTH,
+ "INCREMENTAL_PREFIX_LENGTH is incorrect");
+
+ /*
+ * pg_tblspc and pg_wal are special cases, so detect those here.
+ *
+ * pg_tblspc is only special at the top level, but subdirectories of
+ * pg_wal are just as special as the top level directory.
+ *
+ * Since incremental backup does not exist in pre-v10 versions, we don't
+ * have to worry about the old pg_xlog naming.
+ */
+ is_pg_tblspc = !OidIsValid(tsoid) && relative_path != NULL &&
+ strcmp(relative_path, "pg_tblspc") == 0;
+ is_pg_wal = !OidIsValid(tsoid) && relative_path != NULL &&
+ (strcmp(relative_path, "pg_wal") == 0 ||
+ strncmp(relative_path, "pg_wal/", 7) == 0);
+
+ /*
+ * If we're under pg_wal, then we don't need checksums, because these
+ * files aren't included in the backup manifest. Otherwise use whatever
+ * type of checksum is configured.
+ */
+ if (!is_pg_wal)
+ checksum_type = opt->manifest_checksums;
+ else
+ checksum_type = CHECKSUM_TYPE_NONE;
+
+ /*
+ * Append the relative path to the input and output directories, and
+ * figure out the appropriate prefix to add to files in this directory
+ * when looking them up in a backup manifest.
+ */
+ if (relative_path == NULL)
+ {
+ strncpy(ifulldir, input_directory, MAXPGPATH);
+ strncpy(ofulldir, output_directory, MAXPGPATH);
+ if (OidIsValid(tsoid))
+ snprintf(manifest_prefix, MAXPGPATH, "pg_tblspc/%u/", tsoid);
+ else
+ manifest_prefix[0] = '\0';
+ }
+ else
+ {
+ snprintf(ifulldir, MAXPGPATH, "%s/%s", input_directory,
+ relative_path);
+ snprintf(ofulldir, MAXPGPATH, "%s/%s", output_directory,
+ relative_path);
+ if (OidIsValid(tsoid))
+ snprintf(manifest_prefix, MAXPGPATH, "pg_tblspc/%u/%s/",
+ tsoid, relative_path);
+ else
+ snprintf(manifest_prefix, MAXPGPATH, "%s/", relative_path);
+ }
+
+ /*
+ * Toplevel output directories have already been created by the time this
+ * function is called, but any subdirectories are our responsibility.
+ */
+ if (relative_path != NULL)
+ {
+ if (opt->dry_run)
+ pg_log_debug("would create directory \"%s\"", ofulldir);
+ else
+ {
+ pg_log_debug("creating directory \"%s\"", ofulldir);
+ if (mkdir(ofulldir, pg_dir_create_mode) == -1)
+ pg_fatal("could not create directory \"%s\": %m", ofulldir);
+ }
+ }
+
+ /* It's time to scan the directory. */
+ if ((dir = opendir(ifulldir)) == NULL)
+ pg_fatal("could not open directory \"%s\": %m", ifulldir);
+ while (errno = 0, (de = readdir(dir)) != NULL)
+ {
+ PGFileType type;
+ char ifullpath[MAXPGPATH];
+ char ofullpath[MAXPGPATH];
+ char manifest_path[MAXPGPATH];
+ Oid oid = InvalidOid;
+ int checksum_length = 0;
+ uint8 *checksum_payload = NULL;
+ pg_checksum_context checksum_ctx;
+
+ /* Ignore "." and ".." entries. */
+ if (strcmp(de->d_name, ".") == 0 ||
+ strcmp(de->d_name, "..") == 0)
+ continue;
+
+ /* Construct input path. */
+ snprintf(ifullpath, MAXPGPATH, "%s/%s", ifulldir, de->d_name);
+
+ /* Figure out what kind of directory entry this is. */
+ type = get_dirent_type(ifullpath, de, false, PG_LOG_ERROR);
+ if (type == PGFILETYPE_ERROR)
+ exit(1);
+
+ /*
+ * If we're processing pg_tblspc, then check whether the filename
+ * looks like it could be a tablespace OID. If so, and if the
+ * directory entry is a symbolic link or a directory, skip it.
+ *
+ * Our goal here is to ignore anything that would have been considered
+ * by scan_for_existing_tablespaces to be a tablespace.
+ */
+ if (is_pg_tblspc && parse_oid(de->d_name, &oid) &&
+ (type == PGFILETYPE_LNK || type == PGFILETYPE_DIR))
+ continue;
+
+ /* If it's a directory, recurse. */
+ if (type == PGFILETYPE_DIR)
+ {
+ char new_relative_path[MAXPGPATH];
+
+ /* Append new pathname component to relative path. */
+ if (relative_path == NULL)
+ strncpy(new_relative_path, de->d_name, MAXPGPATH);
+ else
+ snprintf(new_relative_path, MAXPGPATH, "%s/%s", relative_path,
+ de->d_name);
+
+ /* And recurse. */
+ process_directory_recursively(tsoid,
+ input_directory, output_directory,
+ new_relative_path,
+ n_prior_backups, prior_backup_dirs,
+ manifests, mwriter, opt);
+ continue;
+ }
+
+ /* Skip anything that's not a regular file. */
+ if (type != PGFILETYPE_REG)
+ {
+ if (type == PGFILETYPE_LNK)
+ pg_log_warning("skipping symbolic link \"%s\"", ifullpath);
+ else
+ pg_log_warning("skipping special file \"%s\"", ifullpath);
+ continue;
+ }
+
+ /*
+ * Skip the backup_label and backup_manifest files; they require
+ * special handling and are handled elsewhere.
+ */
+ if (relative_path == NULL &&
+ (strcmp(de->d_name, "backup_label") == 0 ||
+ strcmp(de->d_name, "backup_manifest") == 0))
+ continue;
+
+ /*
+ * If it's an incremental file, hand it off to the reconstruction
+ * code, which will figure out what to do.
+ */
+ if (strncmp(de->d_name, INCREMENTAL_PREFIX,
+ INCREMENTAL_PREFIX_LENGTH) == 0)
+ {
+ /* Output path should not include "INCREMENTAL." prefix. */
+ snprintf(ofullpath, MAXPGPATH, "%s/%s", ofulldir,
+ de->d_name + INCREMENTAL_PREFIX_LENGTH);
+
+
+ /* Manifest path likewise omits incremental prefix. */
+ snprintf(manifest_path, MAXPGPATH, "%s%s", manifest_prefix,
+ de->d_name + INCREMENTAL_PREFIX_LENGTH);
+
+ /* Reconstruction logic will do the rest. */
+ reconstruct_from_incremental_file(ifullpath, ofullpath,
+ relative_path,
+ de->d_name + INCREMENTAL_PREFIX_LENGTH,
+ n_prior_backups,
+ prior_backup_dirs,
+ manifests,
+ manifest_path,
+ checksum_type,
+ &checksum_length,
+ &checksum_payload,
+ opt->dry_run);
+ }
+ else
+ {
+ /* Construct the path that the backup_manifest will use. */
+ snprintf(manifest_path, MAXPGPATH, "%s%s", manifest_prefix,
+ de->d_name);
+
+ /*
+ * It's not an incremental file, so we need to copy the entire
+ * file to the output directory.
+ *
+ * If a checksum of the required type already exists in the
+ * backup_manifest for the final input directory, we can save some
+ * work by reusing that checksum instead of computing a new one.
+ */
+ if (checksum_type != CHECKSUM_TYPE_NONE &&
+ latest_manifest != NULL)
+ {
+ manifest_file *mfile;
+
+ mfile = manifest_files_lookup(latest_manifest->files,
+ manifest_path);
+ if (mfile == NULL)
+ {
+ /*
+ * The directory is out of sync with the backup_manifest,
+ * so emit a warning.
+ */
+ pg_log_warning("\"%s/backup_manifest\" contains no entry for \"%s\"",
+ input_directory, manifest_path);
+ }
+ else if (mfile->checksum_type == checksum_type)
+ {
+ checksum_length = mfile->checksum_length;
+ checksum_payload = mfile->checksum_payload;
+ }
+ }
+
+ /*
+ * If we're reusing a checksum, then we don't need copy_file() to
+ * compute one for us, but otherwise, it needs to compute whatever
+ * type of checksum we need.
+ */
+ if (checksum_length != 0)
+ pg_checksum_init(&checksum_ctx, CHECKSUM_TYPE_NONE);
+ else
+ pg_checksum_init(&checksum_ctx, checksum_type);
+
+ /* Actually copy the file. */
+ snprintf(ofullpath, MAXPGPATH, "%s/%s", ofulldir, de->d_name);
+ copy_file(ifullpath, ofullpath, &checksum_ctx, opt->dry_run);
+
+ /*
+ * If copy_file() performed a checksum calculation for us, then
+ * save the results (except in dry-run mode, when there's no
+ * point).
+ */
+ if (checksum_ctx.type != CHECKSUM_TYPE_NONE && !opt->dry_run)
+ {
+ checksum_payload = pg_malloc(PG_CHECKSUM_MAX_LENGTH);
+ checksum_length = pg_checksum_final(&checksum_ctx,
+ checksum_payload);
+ }
+ }
+
+ /* Generate manifest entry, if needed. */
+ if (mwriter != NULL)
+ {
+ struct stat sb;
+
+ /*
+ * In order to generate a manifest entry, we need the file size
+ * and mtime. We have no way to know the correct mtime except to
+ * stat() the file, so just do that and get the size as well.
+ *
+ * If we didn't need the mtime here, we could try to obtain the
+ * file size from the reconstruction or file copy process above,
+ * although that is actually not convenient in all cases. If we
+ * write the file ourselves then clearly we can keep a count of
+ * bytes, but if we use something like CopyFile() then it's
+ * trickier. Since we have to stat() anyway to get the mtime,
+ * there's no point in worrying about it.
+ */
+ if (stat(ofullpath, &sb) < 0)
+ pg_fatal("could not stat file \"%s\": %m", ofullpath);
+
+ /* OK, now do the work. */
+ add_file_to_manifest(mwriter, manifest_path,
+ sb.st_size, sb.st_mtime,
+ checksum_type, checksum_length,
+ checksum_payload);
+ }
+
+ /* Avoid leaking memory. */
+ if (checksum_payload != NULL)
+ pfree(checksum_payload);
+ }
+
+ closedir(dir);
+}
+
+/*
+ * Read the version number from PG_VERSION and convert it to the usual server
+ * version number format. (e.g. If PG_VERSION contains "14\n" this function
+ * will return 140000)
+ */
+static int
+read_pg_version_file(char *directory)
+{
+ char filename[MAXPGPATH];
+ StringInfoData buf;
+ int fd;
+ int version;
+ char *ep;
+
+ /* Construct pathname. */
+ snprintf(filename, MAXPGPATH, "%s/PG_VERSION", directory);
+
+ /* Open file. */
+ if ((fd = open(filename, O_RDONLY, 0)) < 0)
+ pg_fatal("could not open file \"%s\": %m", filename);
+
+ /* Read into memory. Length limit of 128 should be more than generous. */
+ initStringInfo(&buf);
+ slurp_file(fd, filename, &buf, 128);
+
+ /* Close the file. */
+ if (close(fd) != 0)
+ pg_fatal("could not close \"%s\": %m", filename);
+
+ /* Convert to integer. */
+ errno = 0;
+ version = strtoul(buf.data, &ep, 10);
+ if (errno != 0 || *ep != '\n')
+ {
+ /*
+ * Incremental backup is not relevant to very old server versions that
+ * used multi-part version number (e.g. 9.6, or 8.4). So if we see
+ * what looks like the beginning of such a version number, just bail
+ * out.
+ */
+ if (version < 10 && *ep == '.')
+ pg_fatal("%s: server version too old\n", filename);
+ pg_fatal("%s: could not parse version number\n", filename);
+ }
+
+ /* Debugging output. */
+ pg_log_debug("read server version %d from \"%s\"", version, filename);
+
+ /* Release memory and return result. */
+ pfree(buf.data);
+ return version * 10000;
+}
+
+/*
+ * Add a directory to the list of output directories to clean up.
+ */
+static void
+remember_to_cleanup_directory(char *target_path, bool rmtopdir)
+{
+ cb_cleanup_dir *dir = pg_malloc(sizeof(cb_cleanup_dir));
+
+ dir->target_path = target_path;
+ dir->rmtopdir = rmtopdir;
+ dir->next = cleanup_dir_list;
+ cleanup_dir_list = dir;
+}
+
+/*
+ * Empty out the list of directories scheduled for cleanup a exit.
+ *
+ * We want to remove the output directories only on a failure, so call this
+ * function when we know that the operation has succeeded.
+ *
+ * Since we only expect this to be called when we're about to exit, we could
+ * just set cleanup_dir_list to NULL and be done with it, but we free the
+ * memory to be tidy.
+ */
+static void
+reset_directory_cleanup_list(void)
+{
+ while (cleanup_dir_list != NULL)
+ {
+ cb_cleanup_dir *dir = cleanup_dir_list;
+
+ cleanup_dir_list = cleanup_dir_list->next;
+ pfree(dir);
+ }
+}
+
+/*
+ * Scan the pg_tblspc directory of the final input backup to get a canonical
+ * list of what tablespaces are part of the backup.
+ *
+ * 'pathname' should be the path to the toplevel backup directory for the
+ * final backup in the backup chain.
+ */
+static cb_tablespace *
+scan_for_existing_tablespaces(char *pathname, cb_options *opt)
+{
+ char pg_tblspc[MAXPGPATH];
+ DIR *dir;
+ struct dirent *de;
+ cb_tablespace *tslist = NULL;
+
+ snprintf(pg_tblspc, MAXPGPATH, "%s/pg_tblspc", pathname);
+ pg_log_debug("scanning \"%s\"", pg_tblspc);
+
+ if ((dir = opendir(pg_tblspc)) == NULL)
+ pg_fatal("could not open directory \"%s\": %m", pathname);
+
+ while (errno = 0, (de = readdir(dir)) != NULL)
+ {
+ Oid oid;
+ char tblspcdir[MAXPGPATH];
+ char link_target[MAXPGPATH];
+ int link_length;
+ cb_tablespace *ts;
+ cb_tablespace *otherts;
+ PGFileType type;
+
+ /* Silently ignore "." and ".." entries. */
+ if (strcmp(de->d_name, ".") == 0 || strcmp(de->d_name, "..") == 0)
+ continue;
+
+ /* Construct full pathname. */
+ snprintf(tblspcdir, MAXPGPATH, "%s/%s", pg_tblspc, de->d_name);
+
+ /* Ignore any file name that doesn't look like a proper OID. */
+ if (!parse_oid(de->d_name, &oid))
+ {
+ pg_log_debug("skipping \"%s\" because the filename is not a legal tablespace OID",
+ tblspcdir);
+ continue;
+ }
+
+ /* Only symbolic links and directories are tablespaces. */
+ type = get_dirent_type(tblspcdir, de, false, PG_LOG_ERROR);
+ if (type == PGFILETYPE_ERROR)
+ exit(1);
+ if (type != PGFILETYPE_LNK && type != PGFILETYPE_DIR)
+ {
+ pg_log_debug("skipping \"%s\" because it is neither a symbolic link nor a directory",
+ tblspcdir);
+ continue;
+ }
+
+ /* Create a new tablespace object. */
+ ts = pg_malloc0(sizeof(cb_tablespace));
+ ts->oid = oid;
+
+ /*
+ * If it's a link, it's not an in-place tablespace. Otherwise, it must
+ * be a directory, and thus an in-place tablespace.
+ */
+ if (type == PGFILETYPE_LNK)
+ {
+ cb_tablespace_mapping *tsmap;
+
+ /* Read the link target. */
+ link_length = readlink(tblspcdir, link_target, sizeof(link_target));
+ if (link_length < 0)
+ pg_fatal("could not read symbolic link \"%s\": %m",
+ tblspcdir);
+ if (link_length >= sizeof(link_target))
+ pg_fatal("symbolic link \"%s\" is too long", tblspcdir);
+ link_target[link_length] = '\0';
+ if (!is_absolute_path(link_target))
+ pg_fatal("symbolic link \"%s\" is relative", tblspcdir);
+
+ /* Caonicalize the link target. */
+ canonicalize_path(link_target);
+
+ /*
+ * Find the corresponding tablespace mapping and copy the relevant
+ * details into the new tablespace entry.
+ */
+ for (tsmap = opt->tsmappings; tsmap != NULL; tsmap = tsmap->next)
+ {
+ if (strcmp(tsmap->old_dir, link_target) == 0)
+ {
+ strncpy(ts->old_dir, tsmap->old_dir, MAXPGPATH);
+ strncpy(ts->new_dir, tsmap->new_dir, MAXPGPATH);
+ ts->in_place = false;
+ break;
+ }
+ }
+
+ /* Every non-in-place tablespace must be mapped. */
+ if (tsmap == NULL)
+ pg_fatal("tablespace at \"%s\" has no tablespace mapping",
+ link_target);
+ }
+ else
+ {
+ /*
+ * For an in-place tablespace, there's no separate directory, so
+ * we just record the paths within the data directories.
+ */
+ snprintf(ts->old_dir, MAXPGPATH, "%s/%s", pg_tblspc, de->d_name);
+ snprintf(ts->new_dir, MAXPGPATH, "%s/pg_tblpc/%s", opt->output,
+ de->d_name);
+ ts->in_place = true;
+ }
+
+ /* Tablespaces should not share a directory. */
+ for (otherts = tslist; otherts != NULL; otherts = otherts->next)
+ if (strcmp(ts->new_dir, otherts->new_dir) == 0)
+ pg_fatal("tablespaces with OIDs %u and %u both point at \"%s\"",
+ otherts->oid, oid, ts->new_dir);
+
+ /* Add this tablespace to the list. */
+ ts->next = tslist;
+ tslist = ts;
+ }
+
+ return tslist;
+}
+
+/*
+ * Read a file into a StringInfo.
+ *
+ * fd is used for the actual file I/O, filename for error reporting purposes.
+ * A file longer than maxlen is a fatal error.
+ */
+static void
+slurp_file(int fd, char *filename, StringInfo buf, int maxlen)
+{
+ struct stat st;
+ ssize_t rb;
+
+ /* Check file size, and complain if it's too large. */
+ if (fstat(fd, &st) != 0)
+ pg_fatal("could not stat \"%s\": %m", filename);
+ if (st.st_size > maxlen)
+ pg_fatal("file \"%s\" is too large", filename);
+
+ /* Make sure we have enough space. */
+ enlargeStringInfo(buf, st.st_size);
+
+ /* Read the data. */
+ rb = read(fd, &buf->data[buf->len], st.st_size);
+
+ /*
+ * We don't expect any concurrent changes, so we should read exactly the
+ * expected number of bytes.
+ */
+ if (rb != st.st_size)
+ {
+ if (rb < 0)
+ pg_fatal("could not read file \"%s\": %m", filename);
+ else
+ pg_fatal("could not read file \"%s\": read only %d of %d bytes",
+ filename, (int) rb, (int) st.st_size);
+ }
+
+ /* Adjust buffer length for new data and restore trailing-\0 invariant */
+ buf->len += rb;
+ buf->data[buf->len] = '\0';
+}
diff --git a/src/bin/pg_combinebackup/reconstruct.c b/src/bin/pg_combinebackup/reconstruct.c
new file mode 100644
index 0000000000..c774bf1842
--- /dev/null
+++ b/src/bin/pg_combinebackup/reconstruct.c
@@ -0,0 +1,618 @@
+/*-------------------------------------------------------------------------
+ *
+ * reconstruct.c
+ * Reconstruct full file from incremental file and backup chain.
+ *
+ * Copyright (c) 2017-2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/bin/pg_combinebackup/reconstruct.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres_fe.h"
+
+#include <unistd.h>
+
+#include "backup/basebackup_incremental.h"
+#include "common/logging.h"
+#include "common/file_perm.h"
+#include "copy_file.h"
+#include "reconstruct.h"
+#include "storage/block.h"
+
+/*
+ * An rfile stores the data that we need in order to be able to use some file
+ * on disk for reconstruction. For any given output file, we create one rfile
+ * per backup that we need to consult when we constructing that output file.
+ *
+ * If we find a full version of the file in the backup chain, then only
+ * filename and fd are initialized; the remaining fields are 0 or NULL.
+ * For an incremental file, header_length, num_blocks, relative_block_numbers,
+ * and truncation_block_length are also set.
+ *
+ * num_blocks_read and highest_offset_read always start out as 0.
+ */
+typedef struct rfile
+{
+ char *filename;
+ int fd;
+ size_t header_length;
+ unsigned num_blocks;
+ BlockNumber *relative_block_numbers;
+ unsigned truncation_block_length;
+ unsigned num_blocks_read;
+ off_t highest_offset_read;
+} rfile;
+
+static void debug_reconstruction(int n_source,
+ rfile **sources,
+ bool dry_run);
+static unsigned find_reconstructed_block_length(rfile *s);
+static rfile *make_incremental_rfile(char *filename);
+static rfile *make_rfile(char *filename, bool missing_ok);
+static void write_reconstructed_file(char *input_filename,
+ char *output_filename,
+ unsigned block_length,
+ rfile **sourcemap,
+ off_t *offsetmap,
+ pg_checksum_context *checksum_ctx,
+ bool dry_run);
+static void read_bytes(rfile *rf, void *buffer, unsigned length);
+
+/*
+ * Reconstruct a full file from an incremental file and a chain of prior
+ * backups.
+ *
+ * input_filename should be the path to the incremental file, and
+ * output_filename should be the path where the reconstructed file is to be
+ * written.
+ *
+ * relative_path should be the relative path to the directory containing this
+ * file. bare_file_name should be the name of the file within that directory,
+ * without "INCREMENTAL.".
+ *
+ * n_prior_backups is the number of prior backups, and prior_backup_dirs is
+ * an array of pathnames where those backups can be found.
+ */
+void
+reconstruct_from_incremental_file(char *input_filename,
+ char *output_filename,
+ char *relative_path,
+ char *bare_file_name,
+ int n_prior_backups,
+ char **prior_backup_dirs,
+ manifest_data **manifests,
+ char *manifest_path,
+ pg_checksum_type checksum_type,
+ int *checksum_length,
+ uint8 **checksum_payload,
+ bool dry_run)
+{
+ rfile **source;
+ rfile *latest_source = NULL;
+ rfile **sourcemap;
+ off_t *offsetmap;
+ unsigned block_length;
+ unsigned num_missing_blocks;
+ unsigned i;
+ unsigned sidx = n_prior_backups;
+ bool full_copy_possible = true;
+ int copy_source_index = -1;
+ rfile *copy_source = NULL;
+ pg_checksum_context checksum_ctx;
+
+ /*
+ * Every block must come either from the latest version of the file or
+ * from one of the prior backups.
+ */
+ source = pg_malloc0(sizeof(rfile *) * (1 + n_prior_backups));
+
+ /*
+ * Use the information from the latest incremental file to figure out how
+ * long the reconstructed file should be.
+ */
+ latest_source = make_incremental_rfile(input_filename);
+ source[n_prior_backups] = latest_source;
+ block_length = find_reconstructed_block_length(latest_source);
+
+ /*
+ * For each block in the output file, we need to know from which file we
+ * need to obtain it and at what offset in that file it's stored.
+ * sourcemap gives us the first of these things, and offsetmap the latter.
+ */
+ sourcemap = pg_malloc0(sizeof(rfile *) * block_length);
+ offsetmap = pg_malloc0(sizeof(off_t) * block_length);
+
+ /*
+ * Blocks prior to the truncation_block_length threshold must be obtained
+ * from some prior backup, while those after that threshold are left as
+ * zeroes if not present in the newest incremental file.
+ * num_missing_blocks counts the number of blocks that we must be found
+ * somewhere in the backup chain, and is thus initially equal to
+ * truncation_block_length.
+ */
+ num_missing_blocks = latest_source->truncation_block_length;
+
+ /*
+ * Every block that is present in the newest incremental file should be
+ * sourced from that file. If it precedes the truncation_block_length,
+ * it's a block that we would otherwise have had to find in an older
+ * backup and thus reduces the number of blocks remaining to be found by
+ * one; otherwise, it's an extra block that needs to be included in the
+ * output but would not have needed to be found in an older backup if it
+ * had not been present.
+ */
+ for (i = 0; i < latest_source->num_blocks; ++i)
+ {
+ BlockNumber b = latest_source->relative_block_numbers[i];
+
+ Assert(b < block_length);
+ sourcemap[b] = latest_source;
+ offsetmap[b] = latest_source->header_length + (i * BLCKSZ);
+ if (b < latest_source->truncation_block_length)
+ num_missing_blocks--;
+
+ /*
+ * A full copy of a file from an earlier backup is only possible if no
+ * blocks are needed from any later incremental file.
+ */
+ full_copy_possible = false;
+ }
+
+ while (num_missing_blocks > 0)
+ {
+ char source_filename[MAXPGPATH];
+ rfile *s;
+
+ /*
+ * Move to the next backup in the chain. If there are no more, then
+ * something has gone wrong and reconstruction has failed.
+ */
+ if (sidx == 0)
+ pg_fatal("reconstruction for file \"%s\" failed to find %u required blocks",
+ output_filename, num_missing_blocks);
+ --sidx;
+
+ /*
+ * Look for the full file in the previous backup. If not found, then
+ * look for an incremental file instead.
+ */
+ snprintf(source_filename, MAXPGPATH, "%s/%s/%s",
+ prior_backup_dirs[sidx], relative_path, bare_file_name);
+ if ((s = make_rfile(source_filename, true)) == NULL)
+ {
+ snprintf(source_filename, MAXPGPATH, "%s/%s/INCREMENTAL.%s",
+ prior_backup_dirs[sidx], relative_path, bare_file_name);
+ s = make_incremental_rfile(source_filename);
+ }
+ source[sidx] = s;
+
+ /*
+ * If s->header_length == 0, then this is a full file; otherwise, it's
+ * an incremental file.
+ */
+ if (s->header_length != 0)
+ {
+ /*
+ * Since we found another incremental file, source all blocks from
+ * it that we need but don't yet have.
+ */
+ for (i = 0; i < s->num_blocks; ++i)
+ {
+ BlockNumber b = s->relative_block_numbers[i];
+
+ if (b < latest_source->truncation_block_length &&
+ sourcemap[b] == NULL)
+ {
+ sourcemap[b] = s;
+ offsetmap[b] = s->header_length + (i * BLCKSZ);
+
+ Assert(num_missing_blocks > 0);
+ --num_missing_blocks;
+
+ /*
+ * A full copy of a file from an earlier backup is only
+ * possible if no blocks are needed from any later
+ * incremental file.
+ */
+ full_copy_possible = false;
+ }
+ }
+ }
+ else
+ {
+ BlockNumber b;
+
+ /*
+ * Since we found a full file, source all remaining required
+ * blocks from it.
+ */
+ for (b = 0; b < latest_source->truncation_block_length; ++b)
+ {
+ if (sourcemap[b] == NULL)
+ {
+ sourcemap[b] = s;
+ offsetmap[b] = b * BLCKSZ;
+
+ Assert(num_missing_blocks > 0);
+ --num_missing_blocks;
+ }
+ }
+ Assert(num_missing_blocks == 0);
+
+ /*
+ * If a full copy looks possible, check whether the resulting file
+ * should be exactly as long as the source file is. If so, a full
+ * copy is acceptable, otherwise not.
+ */
+ if (full_copy_possible)
+ {
+ struct stat sb;
+ uint64 expected_length;
+
+ if (fstat(s->fd, &sb) < 0)
+ pg_fatal("could not stat \"%s\": %m", s->filename);
+ expected_length =
+ (uint64) latest_source->truncation_block_length;
+ expected_length *= BLCKSZ;
+ if (expected_length == sb.st_size)
+ {
+ copy_source = s;
+ copy_source_index = sidx;
+ }
+ }
+ }
+ }
+
+ /*
+ * If a checksum of the required type already exists in the
+ * backup_manifest for the relevant input directory, we can save some work
+ * by reusing that checksum instead of computing a new one.
+ */
+ if (copy_source_index >= 0 && manifests[copy_source_index] != NULL &&
+ checksum_type != CHECKSUM_TYPE_NONE)
+ {
+ manifest_file *mfile;
+
+ mfile = manifest_files_lookup(manifests[copy_source_index]->files,
+ manifest_path);
+ if (mfile == NULL)
+ {
+ /*
+ * The directory is out of sync with the backup_manifest, so emit
+ * a warning.
+ */
+ pg_log_warning("\"%s/backup_manifest\" contains no entry for \"%s\"",
+ prior_backup_dirs[copy_source_index],
+ manifest_path);
+ }
+ else if (mfile->checksum_type == checksum_type)
+ {
+ *checksum_length = mfile->checksum_length;
+ *checksum_payload = pg_malloc(*checksum_length);
+ memcpy(*checksum_payload, mfile->checksum_payload,
+ *checksum_length);
+ checksum_type = CHECKSUM_TYPE_NONE;
+ }
+ }
+
+ /* Prepare for checksum calculation, if required. */
+ pg_checksum_init(&checksum_ctx, checksum_type);
+
+ /*
+ * If the full file can be created by copying a file from an older backup
+ * in the chain without needing to overwrite any blocks or truncate the
+ * result, then forget about performing reconstruction and just copy that
+ * file in its entirety.
+ *
+ * Otherwise, reconstruct.
+ */
+ if (copy_source != NULL)
+ copy_file(copy_source->filename, output_filename,
+ &checksum_ctx, dry_run);
+ else
+ {
+ write_reconstructed_file(input_filename, output_filename,
+ block_length, sourcemap, offsetmap,
+ &checksum_ctx, dry_run);
+ debug_reconstruction(n_prior_backups + 1, source, dry_run);
+ }
+
+ /* Save results of checksum calculation. */
+ if (checksum_type != CHECKSUM_TYPE_NONE)
+ {
+ *checksum_payload = pg_malloc(PG_CHECKSUM_MAX_LENGTH);
+ *checksum_length = pg_checksum_final(&checksum_ctx,
+ *checksum_payload);
+ }
+
+ /*
+ * Close files and release memory.
+ */
+ for (i = 0; i <= n_prior_backups; ++i)
+ {
+ rfile *s = source[i];
+
+ if (s == NULL)
+ continue;
+ if (close(s->fd) != 0)
+ pg_fatal("could not close \"%s\": %m", s->filename);
+ if (s->relative_block_numbers != NULL)
+ pfree(s->relative_block_numbers);
+ pg_free(s->filename);
+ }
+ pfree(sourcemap);
+ pfree(offsetmap);
+ pfree(source);
+}
+
+/*
+ * Perform post-reconstruction logging and sanity checks.
+ */
+static void
+debug_reconstruction(int n_source, rfile **sources, bool dry_run)
+{
+ unsigned i;
+
+ for (i = 0; i < n_source; ++i)
+ {
+ rfile *s = sources[i];
+
+ /* Ignore source if not used. */
+ if (s == NULL)
+ continue;
+
+ /* If no data is needed from this file, we can ignore it. */
+ if (s->num_blocks_read == 0)
+ continue;
+
+ /* Debug logging. */
+ if (dry_run)
+ pg_log_debug("would have read %u blocks from \"%s\"",
+ s->num_blocks_read, s->filename);
+ else
+ pg_log_debug("read %u blocks from \"%s\"",
+ s->num_blocks_read, s->filename);
+
+ /*
+ * In dry-run mode, we don't actually try to read data from the file,
+ * but we do try to verify that the file is long enough that we could
+ * have read the data if we'd tried.
+ *
+ * If this fails, then it means that a non-dry-run attempt would fail,
+ * complaining of not being able to read the required bytes from the
+ * file.
+ */
+ if (dry_run)
+ {
+ struct stat sb;
+
+ if (fstat(s->fd, &sb) < 0)
+ pg_fatal("could not stat \"%s\": %m", s->filename);
+ if (sb.st_size < s->highest_offset_read)
+ pg_fatal("file \"%s\" is too short: expected %llu, found %llu",
+ s->filename,
+ (unsigned long long) s->highest_offset_read,
+ (unsigned long long) sb.st_size);
+ }
+ }
+}
+
+/*
+ * When we perform reconstruction using an incremental file, the output file
+ * should be at least as long as the truncation_block_length. Any blocks
+ * present in the incremental file increase the output length as far as is
+ * necessary to include those blocks.
+ */
+static unsigned
+find_reconstructed_block_length(rfile *s)
+{
+ unsigned block_length = s->truncation_block_length;
+ unsigned i;
+
+ for (i = 0; i < s->num_blocks; ++i)
+ if (s->relative_block_numbers[i] >= block_length)
+ block_length = s->relative_block_numbers[i] + 1;
+
+ return block_length;
+}
+
+/*
+ * Initialize an incremental rfile, reading the header so that we know which
+ * blocks it contains.
+ */
+static rfile *
+make_incremental_rfile(char *filename)
+{
+ rfile *rf;
+ unsigned magic;
+
+ rf = make_rfile(filename, false);
+
+ /* Read and validate magic number. */
+ read_bytes(rf, &magic, sizeof(magic));
+ if (magic != INCREMENTAL_MAGIC)
+ pg_fatal("file \"%s\" has bad incremental magic number (0x%x not 0x%x)",
+ filename, magic, INCREMENTAL_MAGIC);
+
+ /* Read block count. */
+ read_bytes(rf, &rf->num_blocks, sizeof(rf->num_blocks));
+ if (rf->num_blocks > RELSEG_SIZE)
+ pg_fatal("file \"%s\" has block count %u in excess of segment size %u",
+ filename, rf->num_blocks, RELSEG_SIZE);
+
+ /* Read truncation block length. */
+ read_bytes(rf, &rf->truncation_block_length,
+ sizeof(rf->truncation_block_length));
+ if (rf->truncation_block_length > RELSEG_SIZE)
+ pg_fatal("file \"%s\" has truncation block length %u in excess of segment size %u",
+ filename, rf->truncation_block_length, RELSEG_SIZE);
+
+ /* Read block numbers if there are any. */
+ if (rf->num_blocks > 0)
+ {
+ rf->relative_block_numbers =
+ pg_malloc0(sizeof(BlockNumber) * rf->num_blocks);
+ read_bytes(rf, rf->relative_block_numbers,
+ sizeof(BlockNumber) * rf->num_blocks);
+ }
+
+ /* Remember length of header. */
+ rf->header_length = sizeof(magic) + sizeof(rf->num_blocks) +
+ sizeof(rf->truncation_block_length) +
+ sizeof(BlockNumber) * rf->num_blocks;
+
+ return rf;
+}
+
+/*
+ * Allocate and perform basic initialization of an rfile.
+ */
+static rfile *
+make_rfile(char *filename, bool missing_ok)
+{
+ rfile *rf;
+
+ rf = pg_malloc0(sizeof(rfile));
+ rf->filename = pstrdup(filename);
+ if ((rf->fd = open(filename, O_RDONLY | PG_BINARY, 0)) < 0)
+ {
+ if (missing_ok && errno == ENOENT)
+ {
+ pg_free(rf);
+ return NULL;
+ }
+ pg_fatal("could not open file \"%s\": %m", filename);
+ }
+
+ return rf;
+}
+
+/*
+ * Read the indicated number of bytes from an rfile into the buffer.
+ */
+static void
+read_bytes(rfile *rf, void *buffer, unsigned length)
+{
+ unsigned rb = read(rf->fd, buffer, length);
+
+ if (rb != length)
+ {
+ if (rb < 0)
+ pg_fatal("could not read file \"%s\": %m", rf->filename);
+ else
+ pg_fatal("could not read file \"%s\": read only %d of %d bytes",
+ rf->filename, (int) rb, length);
+ }
+}
+
+/*
+ * Write out a reconstructed file.
+ */
+static void
+write_reconstructed_file(char *input_filename,
+ char *output_filename,
+ unsigned block_length,
+ rfile **sourcemap,
+ off_t *offsetmap,
+ pg_checksum_context *checksum_ctx,
+ bool dry_run)
+{
+ int wfd = -1;
+ unsigned i;
+ unsigned zero_blocks = 0;
+
+ /* Debugging output. */
+ if (dry_run)
+ pg_log_debug("would reconstruct \"%s\" (%u blocks, checksum %s)",
+ output_filename, block_length,
+ pg_checksum_type_name(checksum_ctx->type));
+ else
+ pg_log_debug("reconstructing \"%s\" (%u blocks, checksum %s)",
+ output_filename, block_length,
+ pg_checksum_type_name(checksum_ctx->type));
+
+ /* Open the output file, except in dry_run mode. */
+ if (!dry_run &&
+ (wfd = open(output_filename,
+ O_RDWR | PG_BINARY | O_CREAT | O_EXCL,
+ pg_file_create_mode)) < 0)
+ pg_fatal("could not open file \"%s\": %m", output_filename);
+
+ /* Read and write the blocks as required. */
+ for (i = 0; i < block_length; ++i)
+ {
+ uint8 buffer[BLCKSZ];
+ rfile *s = sourcemap[i];
+ unsigned wb;
+
+ /* Update accounting information. */
+ if (s == NULL)
+ ++zero_blocks;
+ else
+ {
+ s->num_blocks_read++;
+ s->highest_offset_read = Max(s->highest_offset_read,
+ offsetmap[i] + BLCKSZ);
+ }
+
+ /* Skip the rest of this in dry-run mode. */
+ if (dry_run)
+ continue;
+
+ /* Read or zero-fill the block as appropriate. */
+ if (s == NULL)
+ {
+ /*
+ * New block not mentioned in the WAL summary. Should have been an
+ * uninitialized block, so just zero-fill it.
+ */
+ memset(buffer, 0, BLCKSZ);
+ }
+ else
+ {
+ unsigned rb;
+
+ /* Read the block from the correct source, except if dry-run. */
+ rb = pg_pread(s->fd, buffer, BLCKSZ, offsetmap[i]);
+ if (rb != BLCKSZ)
+ {
+ if (rb < 0)
+ pg_fatal("could not read file \"%s\": %m", s->filename);
+ else
+ pg_fatal("could not read file \"%s\": read only %d of %d bytes at offset %u",
+ s->filename, (int) rb, BLCKSZ,
+ (unsigned) offsetmap[i]);
+ }
+ }
+
+ /* Write out the block. */
+ if ((wb = write(wfd, buffer, BLCKSZ)) != BLCKSZ)
+ {
+ if (wb < 0)
+ pg_fatal("could not write file \"%s\": %m", output_filename);
+ else
+ pg_fatal("could not write file \"%s\": wrote only %d of %d bytes",
+ output_filename, (int) wb, BLCKSZ);
+ }
+
+ /* Update the checksum computation. */
+ if (pg_checksum_update(checksum_ctx, buffer, BLCKSZ) < 0)
+ pg_fatal("could not update checksum of file \"%s\"",
+ output_filename);
+ }
+
+ /* Debugging output. */
+ if (zero_blocks > 0)
+ {
+ if (dry_run)
+ pg_log_debug("would have zero-filled %u blocks", zero_blocks);
+ else
+ pg_log_debug("zero-filled %u blocks", zero_blocks);
+ }
+
+ /* Close the output file. */
+ if (wfd >= 0 && close(wfd) != 0)
+ pg_fatal("could not close \"%s\": %m", output_filename);
+}
diff --git a/src/bin/pg_combinebackup/reconstruct.h b/src/bin/pg_combinebackup/reconstruct.h
new file mode 100644
index 0000000000..c599a70d42
--- /dev/null
+++ b/src/bin/pg_combinebackup/reconstruct.h
@@ -0,0 +1,32 @@
+/*-------------------------------------------------------------------------
+ *
+ * reconstruct.h
+ * Reconstruct full file from incremental file and backup chain.
+ *
+ * Copyright (c) 2017-2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/bin/pg_combinebackup/reconstruct.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef RECONSTRUCT_H
+#define RECONSTRUCT_H
+
+#include "common/checksum_helper.h"
+#include "load_manifest.h"
+
+extern void reconstruct_from_incremental_file(char *input_filename,
+ char *output_filename,
+ char *relative_path,
+ char *bare_file_name,
+ int n_prior_backups,
+ char **prior_backup_dirs,
+ manifest_data **manifests,
+ char *manifest_path,
+ pg_checksum_type checksum_type,
+ int *checksum_length,
+ uint8 **checksum_payload,
+ bool dry_run);
+
+#endif
diff --git a/src/bin/pg_combinebackup/write_manifest.c b/src/bin/pg_combinebackup/write_manifest.c
new file mode 100644
index 0000000000..82160134d8
--- /dev/null
+++ b/src/bin/pg_combinebackup/write_manifest.c
@@ -0,0 +1,293 @@
+/*-------------------------------------------------------------------------
+ *
+ * Write a new backup manifest.
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/bin/pg_combinebackup/write_manifest.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres_fe.h"
+
+#include <fcntl.h>
+#include <time.h>
+#include <unistd.h>
+
+#include "common/checksum_helper.h"
+#include "common/file_perm.h"
+#include "common/logging.h"
+#include "lib/stringinfo.h"
+#include "load_manifest.h"
+#include "mb/pg_wchar.h"
+#include "write_manifest.h"
+
+struct manifest_writer
+{
+ char pathname[MAXPGPATH];
+ int fd;
+ StringInfoData buf;
+ bool first_file;
+ bool still_checksumming;
+ pg_checksum_context manifest_ctx;
+};
+
+static void escape_json(StringInfo buf, const char *str);
+static void flush_manifest(manifest_writer *mwriter);
+static size_t hex_encode(const uint8 *src, size_t len, char *dst);
+
+/*
+ * Create a new backup manifest writer.
+ *
+ * The backup manifest will be written into a file named backup_manifest
+ * in the specified directory.
+ */
+manifest_writer *
+create_manifest_writer(char *directory)
+{
+ manifest_writer *mwriter = pg_malloc(sizeof(manifest_writer));
+
+ snprintf(mwriter->pathname, MAXPGPATH, "%s/backup_manifest", directory);
+ mwriter->fd = -1;
+ initStringInfo(&mwriter->buf);
+ mwriter->first_file = true;
+ mwriter->still_checksumming = true;
+ pg_checksum_init(&mwriter->manifest_ctx, CHECKSUM_TYPE_SHA256);
+
+ appendStringInfo(&mwriter->buf,
+ "{ \"PostgreSQL-Backup-Manifest-Version\": 1,\n"
+ "\"Files\": [");
+
+ return mwriter;
+}
+
+/*
+ * Add an entry for a file to a backup manifest.
+ *
+ * This is very similar to the backend's AddFileToBackupManifest, but
+ * various adjustments are required due to frontend/backend differences
+ * and other details.
+ */
+void
+add_file_to_manifest(manifest_writer *mwriter, const char *manifest_path,
+ size_t size, pg_time_t mtime,
+ pg_checksum_type checksum_type,
+ int checksum_length,
+ uint8 *checksum_payload)
+{
+ int pathlen = strlen(manifest_path);
+
+ if (mwriter->first_file)
+ {
+ appendStringInfoChar(&mwriter->buf, '\n');
+ mwriter->first_file = false;
+ }
+ else
+ appendStringInfoString(&mwriter->buf, ",\n");
+
+ if (pg_encoding_verifymbstr(PG_UTF8, manifest_path, pathlen) == pathlen)
+ {
+ appendStringInfoString(&mwriter->buf, "{ \"Path\": ");
+ escape_json(&mwriter->buf, manifest_path);
+ appendStringInfoString(&mwriter->buf, ", ");
+ }
+ else
+ {
+ appendStringInfoString(&mwriter->buf, "{ \"Encoded-Path\": \"");
+ enlargeStringInfo(&mwriter->buf, 2 * pathlen);
+ mwriter->buf.len += hex_encode((const uint8 *) manifest_path, pathlen,
+ &mwriter->buf.data[mwriter->buf.len]);
+ appendStringInfoString(&mwriter->buf, "\", ");
+ }
+
+ appendStringInfo(&mwriter->buf, "\"Size\": %zu, ", size);
+
+ appendStringInfoString(&mwriter->buf, "\"Last-Modified\": \"");
+ enlargeStringInfo(&mwriter->buf, 128);
+ mwriter->buf.len += strftime(&mwriter->buf.data[mwriter->buf.len], 128,
+ "%Y-%m-%d %H:%M:%S %Z",
+ gmtime(&mtime));
+ appendStringInfoChar(&mwriter->buf, '"');
+
+ if (mwriter->buf.len > 128 * 1024)
+ flush_manifest(mwriter);
+
+ if (checksum_length > 0)
+ {
+ appendStringInfo(&mwriter->buf,
+ ", \"Checksum-Algorithm\": \"%s\", \"Checksum\": \"",
+ pg_checksum_type_name(checksum_type));
+
+ enlargeStringInfo(&mwriter->buf, 2 * checksum_length);
+ mwriter->buf.len += hex_encode(checksum_payload, checksum_length,
+ &mwriter->buf.data[mwriter->buf.len]);
+
+ appendStringInfoChar(&mwriter->buf, '"');
+ }
+
+ appendStringInfoString(&mwriter->buf, " }");
+
+ if (mwriter->buf.len > 128 * 1024)
+ flush_manifest(mwriter);
+}
+
+/*
+ * Finalize the backup_manifest.
+ */
+void
+finalize_manifest(manifest_writer *mwriter,
+ manifest_wal_range *first_wal_range)
+{
+ uint8 checksumbuf[PG_SHA256_DIGEST_LENGTH];
+ int len;
+ manifest_wal_range *wal_range;
+
+ /* Terminate the list of files. */
+ appendStringInfoString(&mwriter->buf, "\n],\n");
+
+ /* Start a list of LSN ranges. */
+ appendStringInfoString(&mwriter->buf, "\"WAL-Ranges\": [\n");
+
+ for (wal_range = first_wal_range; wal_range != NULL;
+ wal_range = wal_range->next)
+ appendStringInfo(&mwriter->buf,
+ "%s{ \"Timeline\": %u, \"Start-LSN\": \"%X/%X\", \"End-LSN\": \"%X/%X\" }",
+ wal_range == first_wal_range ? "" : ",\n",
+ wal_range->tli,
+ LSN_FORMAT_ARGS(wal_range->start_lsn),
+ LSN_FORMAT_ARGS(wal_range->end_lsn));
+
+ /* Terminate the list of WAL ranges. */
+ appendStringInfoString(&mwriter->buf, "\n],\n");
+
+ /* Flush accumulated data and update checksum calculation. */
+ flush_manifest(mwriter);
+
+ /* Checksum only includes data up to this point. */
+ mwriter->still_checksumming = false;
+
+ /* Compute and insert manifest checksum. */
+ appendStringInfoString(&mwriter->buf, "\"Manifest-Checksum\": \"");
+ enlargeStringInfo(&mwriter->buf, 2 * PG_SHA256_DIGEST_STRING_LENGTH);
+ len = pg_checksum_final(&mwriter->manifest_ctx, checksumbuf);
+ Assert(len == PG_SHA256_DIGEST_LENGTH);
+ mwriter->buf.len +=
+ hex_encode(checksumbuf, len, &mwriter->buf.data[mwriter->buf.len]);
+ appendStringInfoString(&mwriter->buf, "\"}\n");
+
+ /* Flush the last manifest checksum itself. */
+ flush_manifest(mwriter);
+
+ /* Close the file. */
+ if (close(mwriter->fd) != 0)
+ pg_fatal("could not close \"%s\": %m", mwriter->pathname);
+ mwriter->fd = -1;
+}
+
+/*
+ * Produce a JSON string literal, properly escaping characters in the text.
+ */
+static void
+escape_json(StringInfo buf, const char *str)
+{
+ const char *p;
+
+ appendStringInfoCharMacro(buf, '"');
+ for (p = str; *p; p++)
+ {
+ switch (*p)
+ {
+ case '\b':
+ appendStringInfoString(buf, "\\b");
+ break;
+ case '\f':
+ appendStringInfoString(buf, "\\f");
+ break;
+ case '\n':
+ appendStringInfoString(buf, "\\n");
+ break;
+ case '\r':
+ appendStringInfoString(buf, "\\r");
+ break;
+ case '\t':
+ appendStringInfoString(buf, "\\t");
+ break;
+ case '"':
+ appendStringInfoString(buf, "\\\"");
+ break;
+ case '\\':
+ appendStringInfoString(buf, "\\\\");
+ break;
+ default:
+ if ((unsigned char) *p < ' ')
+ appendStringInfo(buf, "\\u%04x", (int) *p);
+ else
+ appendStringInfoCharMacro(buf, *p);
+ break;
+ }
+ }
+ appendStringInfoCharMacro(buf, '"');
+}
+
+/*
+ * Flush whatever portion of the backup manifest we have generated and
+ * buffered in memory out to a file on disk.
+ *
+ * The first call to this function will create the file. After that, we
+ * keep it open and just append more data.
+ */
+static void
+flush_manifest(manifest_writer *mwriter)
+{
+ char pathname[MAXPGPATH];
+
+ if (mwriter->fd == -1 &&
+ (mwriter->fd = open(mwriter->pathname,
+ O_WRONLY | O_CREAT | O_EXCL | PG_BINARY,
+ pg_file_create_mode)) < 0)
+ pg_fatal("could not open file \"%s\": %m", mwriter->pathname);
+
+ if (mwriter->buf.len > 0)
+ {
+ ssize_t wb;
+
+ wb = write(mwriter->fd, mwriter->buf.data, mwriter->buf.len);
+ if (wb != mwriter->buf.len)
+ {
+ if (wb < 0)
+ pg_fatal("could not write \"%s\": %m", mwriter->pathname);
+ else
+ pg_fatal("could not write file \"%s\": wrote only %d of %d bytes",
+ pathname, (int) wb, mwriter->buf.len);
+ }
+
+ if (mwriter->still_checksumming)
+ pg_checksum_update(&mwriter->manifest_ctx,
+ (uint8 *) mwriter->buf.data,
+ mwriter->buf.len);
+ resetStringInfo(&mwriter->buf);
+ }
+}
+
+/*
+ * Encode bytes using two hexademical digits for each one.
+ */
+static size_t
+hex_encode(const uint8 *src, size_t len, char *dst)
+{
+ const uint8 *end = src + len;
+
+ while (src < end)
+ {
+ unsigned n1 = (*src >> 4) & 0xF;
+ unsigned n2 = *src & 0xF;
+
+ *dst++ = n1 < 10 ? '0' + n1 : 'a' + n1 - 10;
+ *dst++ = n2 < 10 ? '0' + n2 : 'a' + n2 - 10;
+ ++src;
+ }
+
+ return len * 2;
+}
diff --git a/src/bin/pg_combinebackup/write_manifest.h b/src/bin/pg_combinebackup/write_manifest.h
new file mode 100644
index 0000000000..8fd7fe02c8
--- /dev/null
+++ b/src/bin/pg_combinebackup/write_manifest.h
@@ -0,0 +1,33 @@
+/*-------------------------------------------------------------------------
+ *
+ * Write a new backup manifest.
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/bin/pg_combinebackup/write_manifest.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef WRITE_MANIFEST_H
+#define WRITE_MANIFEST_H
+
+#include "common/checksum_helper.h"
+#include "pgtime.h"
+
+struct manifest_wal_range;
+
+struct manifest_writer;
+typedef struct manifest_writer manifest_writer;
+
+extern manifest_writer *create_manifest_writer(char *directory);
+extern void add_file_to_manifest(manifest_writer *mwriter,
+ const char *manifest_path,
+ size_t size, pg_time_t mtime,
+ pg_checksum_type checksum_type,
+ int checksum_length,
+ uint8 *checksum_payload);
+extern void finalize_manifest(manifest_writer *mwriter,
+ struct manifest_wal_range *first_wal_range);
+
+#endif /* WRITE_MANIFEST_H */
diff --git a/src/bin/pg_resetwal/pg_resetwal.c b/src/bin/pg_resetwal/pg_resetwal.c
index e7ef2b8bd0..f35302e994 100644
--- a/src/bin/pg_resetwal/pg_resetwal.c
+++ b/src/bin/pg_resetwal/pg_resetwal.c
@@ -85,6 +85,7 @@ static void RewriteControlFile(void);
static void FindEndOfXLOG(void);
static void KillExistingXLOG(void);
static void KillExistingArchiveStatus(void);
+static void KillExistingWALSummaries(void);
static void WriteEmptyXLOG(void);
static void usage(void);
@@ -488,6 +489,7 @@ main(int argc, char *argv[])
RewriteControlFile();
KillExistingXLOG();
KillExistingArchiveStatus();
+ KillExistingWALSummaries();
WriteEmptyXLOG();
printf(_("Write-ahead log reset\n"));
@@ -1029,6 +1031,40 @@ KillExistingArchiveStatus(void)
pg_fatal("could not close directory \"%s\": %m", ARCHSTATDIR);
}
+/*
+ * Remove existing WAL summary files
+ */
+static void
+KillExistingWALSummaries(void)
+{
+#define WALSUMMARYDIR XLOGDIR "/summaries"
+#define WALSUMMARY_NHEXCHARS 40
+
+ DIR *xldir;
+ struct dirent *xlde;
+ char path[MAXPGPATH + sizeof(WALSUMMARYDIR)];
+
+ xldir = opendir(WALSUMMARYDIR);
+ if (xldir == NULL)
+ pg_fatal("could not open directory \"%s\": %m", WALSUMMARYDIR);
+
+ while (errno = 0, (xlde = readdir(xldir)) != NULL)
+ {
+ if (strspn(xlde->d_name, "0123456789ABCDEF") == WALSUMMARY_NHEXCHARS &&
+ strcmp(xlde->d_name + WALSUMMARY_NHEXCHARS, ".summary") == 0)
+ {
+ snprintf(path, sizeof(path), "%s/%s", WALSUMMARYDIR, xlde->d_name);
+ if (unlink(path) < 0)
+ pg_fatal("could not delete file \"%s\": %m", path);
+ }
+ }
+
+ if (errno)
+ pg_fatal("could not read directory \"%s\": %m", WALSUMMARYDIR);
+
+ if (closedir(xldir))
+ pg_fatal("could not close directory \"%s\": %m", ARCHSTATDIR);
+}
/*
* Write an empty XLOG file, containing only the checkpoint record
diff --git a/src/common/Makefile b/src/common/Makefile
index e4cd26762b..ef38cc2f03 100644
--- a/src/common/Makefile
+++ b/src/common/Makefile
@@ -48,6 +48,7 @@ LIBS += $(PTHREAD_LIBS)
OBJS_COMMON = \
archive.o \
base64.o \
+ blkreftable.o \
checksum_helper.o \
compression.o \
config_info.o \
diff --git a/src/common/blkreftable.c b/src/common/blkreftable.c
new file mode 100644
index 0000000000..012a443584
--- /dev/null
+++ b/src/common/blkreftable.c
@@ -0,0 +1,1309 @@
+/*-------------------------------------------------------------------------
+ *
+ * blkreftable.c
+ * Block reference tables.
+ *
+ * A block reference table is used to keep track of which blocks have
+ * been modified by WAL records within a certain LSN range.
+ *
+ * For each relation fork, we keep track of all blocks that have appeared
+ * in block reference in the WAL. We also keep track of the "limit block",
+ * which is the smallest relation length in blocks known to have occurred
+ * during that range of WAL records. This should be set to 0 if the relation
+ * fork is created or destroyed, and to the post-truncation length if
+ * truncated.
+ *
+ * Whenever we set the limit block, we also forget about any modified blocks
+ * beyond that point. Those blocks don't exist any more. Such blocks can
+ * later be marked as modified again; if that happens, it means the relation
+ * was re-extended.
+ *
+ * Portions Copyright (c) 2010-2022, PostgreSQL Global Development Group
+ *
+ * src/common/blkreftable.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+
+#ifndef FRONTEND
+#include "postgres.h"
+#else
+#include "postgres_fe.h"
+#endif
+
+#ifdef FRONTEND
+#include "common/logging.h"
+#endif
+
+#include "common/blkreftable.h"
+#include "common/hashfn.h"
+#include "port/pg_crc32c.h"
+
+/*
+ * A block reference table keeps track of the status of each relation
+ * fork individually.
+ */
+typedef struct BlockRefTableKey
+{
+ RelFileLocator rlocator;
+ ForkNumber forknum;
+} BlockRefTableKey;
+
+/*
+ * We could need to store data either for a relation in which only a
+ * tiny fraction of the blocks have been modified or for a relation in
+ * which nearly every block has been modified, and we want a
+ * space-efficient representation in both cases. To accomplish this,
+ * we divide the relation into chunks of 2^16 blocks and choose between
+ * an array representation and a bitmap representation for each chunk.
+ *
+ * When the number of modified blocks in a given chunk is small, we
+ * essentially store an array of block numbers, but we need not store the
+ * entire block number: instead, we store each block number as a 2-byte
+ * offset from the start of the chunk.
+ *
+ * When the number of modified blocks in a given chunk is large, we switch
+ * to a bitmap representation.
+ *
+ * These same basic representational choices are used both when a block
+ * reference table is stored in memory and when it is serialized to disk.
+ *
+ * In the in-memory representation, we initially allocate each chunk with
+ * space for a number of entries given by INITIAL_ENTRIES_PER_CHUNK and
+ * increase that as necessary until we reach MAX_ENTRIES_PER_CHUNK.
+ * Any chunk whose allocated size reaches MAX_ENTRIES_PER_CHUNK is converted
+ * to a bitmap, and thus never needs to grow further.
+ */
+#define BLOCKS_PER_CHUNK (1 << 16)
+#define BLOCKS_PER_ENTRY (BITS_PER_BYTE * sizeof(uint16))
+#define MAX_ENTRIES_PER_CHUNK (BLOCKS_PER_CHUNK / BLOCKS_PER_ENTRY)
+#define INITIAL_ENTRIES_PER_CHUNK 16
+typedef uint16 *BlockRefTableChunk;
+
+/*
+ * State for one relation fork.
+ *
+ * 'rlocator' and 'forknum' identify the relation fork to which this entry
+ * pertains.
+ *
+ * 'limit_block' is the shortest known length of the relation in blocks
+ * within the LSN range covered by a particular block reference table.
+ * It should be set to 0 if the relation fork is created or dropped. If the
+ * relation fork is truncated, it should be set to the number of blocks that
+ * remain after truncation.
+ *
+ * 'nchunks' is the allocated length of each of the three arrays that follow.
+ * We can only represent the status of block numbers less than nchunks *
+ * BLOCKS_PER_CHUNK.
+ *
+ * 'chunk_size' is an array storing the allocated size of each chunk.
+ *
+ * 'chunk_usage' is an array storing the number of elements used in each
+ * chunk. If that value is less than MAX_ENTRIES_PER_CHUNK, the corresonding
+ * chunk is used as an array; else the corresponding chunk is used as a bitmap.
+ * When used as a bitmap, the least significant bit of the first array element
+ * is the status of the lowest-numbered block covered by this chunk.
+ *
+ * 'chunk_data' is the array of chunks.
+ */
+struct BlockRefTableEntry
+{
+ BlockRefTableKey key;
+ BlockNumber limit_block;
+ char status;
+ uint32 nchunks;
+ uint16 *chunk_size;
+ uint16 *chunk_usage;
+ BlockRefTableChunk *chunk_data;
+};
+
+/* Declare and define a hash table over type BlockRefTableEntry. */
+#define SH_PREFIX blockreftable
+#define SH_ELEMENT_TYPE BlockRefTableEntry
+#define SH_KEY_TYPE BlockRefTableKey
+#define SH_KEY key
+#define SH_HASH_KEY(tb, key) \
+ hash_bytes((const unsigned char *) &key, sizeof(BlockRefTableKey))
+#define SH_EQUAL(tb, a, b) memcmp(&a, &b, sizeof(BlockRefTableKey)) == 0
+#define SH_SCOPE static inline
+#ifdef FRONTEND
+#define SH_RAW_ALLOCATOR pg_malloc0
+#endif
+#define SH_DEFINE
+#define SH_DECLARE
+#include "lib/simplehash.h"
+
+/*
+ * A block reference table is basically just the hash table, but we don't
+ * want to expose that to outside callers.
+ *
+ * We keep track of the memory context in use explicitly too, so that it's
+ * easy to place all of our allocations in the same context.
+ */
+struct BlockRefTable
+{
+ blockreftable_hash *hash;
+#ifndef FRONTEND
+ MemoryContext mcxt;
+#endif
+};
+
+/*
+ * On-disk serialization format for block reference table entries.
+ */
+typedef struct BlockRefTableSerializedEntry
+{
+ RelFileLocator rlocator;
+ ForkNumber forknum;
+ BlockNumber limit_block;
+ uint32 nchunks;
+} BlockRefTableSerializedEntry;
+
+/*
+ * Buffer size, so that we avoid doing many small I/Os.
+ */
+#define BUFSIZE 65536
+
+/*
+ * Ad-hoc buffer for file I/O.
+ */
+typedef struct BlockRefTableBuffer
+{
+ io_callback_fn io_callback;
+ void *io_callback_arg;
+ char data[BUFSIZE];
+ int used;
+ int cursor;
+ pg_crc32c crc;
+} BlockRefTableBuffer;
+
+/*
+ * State for keeping track of progress while incrementally reading a block
+ * table reference file from disk.
+ *
+ * total_chunks means the number of chunks for the RelFileLocator/ForkNumber
+ * combination that is curently being read, and consumed_chunks is the number
+ * of those that have been read. (We always read all the information for
+ * a single chunk at one time, so we don't need to be able to represent the
+ * state where a chunk has been partially read.)
+ *
+ * chunk_size is the array of chunk sizes. The length is given by total_chunks.
+ *
+ * chunk_data holds the current chunk.
+ *
+ * chunk_position helps us figure out how much progress we've made in returning
+ * the block numbers for the current chunk to the caller. If the chunk is a
+ * bitmap, it's the number of bits we've scanned; otherwise, it's the number
+ * of chunk entries we've scanned.
+ */
+struct BlockRefTableReader
+{
+ BlockRefTableBuffer buffer;
+ char *error_filename;
+ report_error_fn error_callback;
+ void *error_callback_arg;
+ uint32 total_chunks;
+ uint32 consumed_chunks;
+ uint16 *chunk_size;
+ uint16 chunk_data[MAX_ENTRIES_PER_CHUNK];
+ uint32 chunk_position;
+};
+
+/*
+ * State for keeping track of progress while incrementally writing a block
+ * reference table file to disk.
+ */
+struct BlockRefTableWriter
+{
+ BlockRefTableBuffer buffer;
+};
+
+/* Function prototypes. */
+static int BlockRefTableComparator(const void *a, const void *b);
+static void BlockRefTableFlush(BlockRefTableBuffer *buffer);
+static void BlockRefTableRead(BlockRefTableReader *reader, void *data,
+ int length);
+static void BlockRefTableWrite(BlockRefTableBuffer *buffer, void *data,
+ int length);
+static void BlockRefTableFileTerminate(BlockRefTableBuffer *buffer);
+
+/*
+ * Create an empty block reference table.
+ */
+BlockRefTable *
+CreateEmptyBlockRefTable(void)
+{
+ BlockRefTable *brtab = palloc(sizeof(BlockRefTable));
+
+ /*
+ * Even completely empty database has a few hundred relation forks, so it
+ * seems best to size the hash on the assumption that we're going to have
+ * at least a few thousand entries.
+ */
+#ifdef FRONTEND
+ brtab->hash = blockreftable_create(4096, NULL);
+#else
+ brtab->mcxt = CurrentMemoryContext;
+ brtab->hash = blockreftable_create(brtab->mcxt, 4096, NULL);
+#endif
+
+ return brtab;
+}
+
+/*
+ * Set the "limit block" for a relation fork and forget any modified blocks
+ * with equal or higher block numbers.
+ *
+ * The "limit block" is the shortest known length of the relation within the
+ * range of WAL records covered by this block reference table.
+ */
+void
+BlockRefTableSetLimitBlock(BlockRefTable *brtab,
+ const RelFileLocator *rlocator,
+ ForkNumber forknum,
+ BlockNumber limit_block)
+{
+ BlockRefTableEntry *brtentry;
+ BlockRefTableKey key;
+ bool found;
+
+ memcpy(&key.rlocator, rlocator, sizeof(RelFileLocator));
+ key.forknum = forknum;
+ brtentry = blockreftable_insert(brtab->hash, key, &found);
+
+ if (!found)
+ {
+ /*
+ * We have no existing data about this relation fork, so just record
+ * the limit_block value supplied by the caller, and make sure other
+ * parts of the entry are properly initialized.
+ */
+ brtentry->limit_block = limit_block;
+ brtentry->nchunks = 0;
+ brtentry->chunk_size = NULL;
+ brtentry->chunk_usage = NULL;
+ brtentry->chunk_data = NULL;
+ return;
+ }
+
+ BlockRefTableEntrySetLimitBlock(brtentry, limit_block);
+}
+
+/*
+ * Mark a block in a given relation fork as known to have been modified.
+ */
+void
+BlockRefTableMarkBlockModified(BlockRefTable *brtab,
+ const RelFileLocator *rlocator,
+ ForkNumber forknum,
+ BlockNumber blknum)
+{
+ BlockRefTableEntry *brtentry;
+ BlockRefTableKey key;
+ bool found;
+#ifndef FRONTEND
+ MemoryContext oldcontext = MemoryContextSwitchTo(brtab->mcxt);
+#endif
+
+ memcpy(&key.rlocator, rlocator, sizeof(RelFileLocator));
+ key.forknum = forknum;
+ brtentry = blockreftable_insert(brtab->hash, key, &found);
+
+ if (!found)
+ {
+ /*
+ * We want to set the initial limit block value to something higher
+ * than any legal block number. InvalidBlockNumber fits the bill.
+ */
+ brtentry->limit_block = InvalidBlockNumber;
+ brtentry->nchunks = 0;
+ brtentry->chunk_size = NULL;
+ brtentry->chunk_usage = NULL;
+ brtentry->chunk_data = NULL;
+ }
+
+ BlockRefTableEntryMarkBlockModified(brtentry, forknum, blknum);
+
+#ifndef FRONTEND
+ MemoryContextSwitchTo(oldcontext);
+#endif
+}
+
+/*
+ * Get an entry from a block reference table.
+ *
+ * If the entry does not exist, this function returns NULL. Otherwise, it
+ * returns the entry and sets *limit_block to the value from the entry.
+ */
+BlockRefTableEntry *
+BlockRefTableGetEntry(BlockRefTable *brtab, const RelFileLocator *rlocator,
+ ForkNumber forknum, BlockNumber *limit_block)
+{
+ BlockRefTableKey key;
+ BlockRefTableEntry *entry;
+
+ Assert(limit_block != NULL);
+
+ memcpy(&key.rlocator, rlocator, sizeof(RelFileLocator));
+ key.forknum = forknum;
+ entry = blockreftable_lookup(brtab->hash, key);
+
+ if (entry != NULL)
+ *limit_block = entry->limit_block;
+
+ return entry;
+}
+
+/*
+ * Get block numbers from a table entry.
+ *
+ * 'blocks' must point to enough space to hold at least 'nblocks' block
+ * numbers, and any block numbers we manage to get will be written there.
+ * The return value is the number of block numbers actually written.
+ *
+ * We do not return block numbers unless they are greater than or equal to
+ * start_blkno and strictly less than stop_blkno.
+ */
+int
+BlockRefTableEntryGetBlocks(BlockRefTableEntry *entry,
+ BlockNumber start_blkno,
+ BlockNumber stop_blkno,
+ BlockNumber *blocks,
+ int nblocks)
+{
+ uint32 start_chunkno;
+ uint32 stop_chunkno;
+ uint32 chunkno;
+ int nresults = 0;
+
+ Assert(entry != NULL);
+
+ /*
+ * Figure out which chunks could potentially contain blocks of interest.
+ *
+ * We need to be careful about overflow here, because stop_blkno could be
+ * InvalidBlockNumber or something very close to it.
+ */
+ start_chunkno = start_blkno / BLOCKS_PER_CHUNK;
+ stop_chunkno = stop_blkno / BLOCKS_PER_CHUNK;
+ if ((stop_blkno % BLOCKS_PER_CHUNK) != 0)
+ ++stop_chunkno;
+ if (stop_chunkno > entry->nchunks)
+ stop_chunkno = entry->nchunks;
+
+ /*
+ * Loop over chunks.
+ */
+ for (chunkno = start_chunkno; chunkno < stop_chunkno; ++chunkno)
+ {
+ uint16 chunk_usage = entry->chunk_usage[chunkno];
+ BlockRefTableChunk chunk_data = entry->chunk_data[chunkno];
+ unsigned start_offset = 0;
+ unsigned stop_offset = BLOCKS_PER_CHUNK;
+
+ /*
+ * If the start and/or stop block number falls within this chunk, the
+ * whole chunk may not be of interest. Figure out which portion we
+ * care about, if it's not the whole thing.
+ */
+ if (chunkno == start_chunkno)
+ start_offset = start_blkno % BLOCKS_PER_CHUNK;
+ if (chunkno == stop_chunkno)
+ stop_offset = stop_blkno % BLOCKS_PER_CHUNK;
+
+ /*
+ * Handling differs depending on whether this is an array of offsets
+ * or a bitmap.
+ */
+ if (chunk_usage == MAX_ENTRIES_PER_CHUNK)
+ {
+ unsigned i;
+
+ /* It's a bitmap, so test every relevant bit. */
+ for (i = start_offset; i < BLOCKS_PER_CHUNK; ++i)
+ {
+ uint16 w = chunk_data[i / BLOCKS_PER_ENTRY];
+
+ if ((w & (1 << (i % BLOCKS_PER_ENTRY))) != 0)
+ {
+ BlockNumber blkno = chunkno * BLOCKS_PER_CHUNK + i;
+
+ blocks[nresults++] = blkno;
+
+ /* Early exit if we run out of output space. */
+ if (nresults == nblocks)
+ return nresults;
+ }
+ }
+ }
+ else
+ {
+ unsigned i;
+
+ /* It's an array of offsets, so check each one. */
+ for (i = 0; i < chunk_usage; ++i)
+ {
+ uint16 offset = chunk_data[i];
+
+ if (offset >= start_offset && offset < stop_offset)
+ {
+ BlockNumber blkno = chunkno * BLOCKS_PER_CHUNK + offset;
+
+ blocks[nresults++] = blkno;
+
+ /* Early exit if we run out of output space. */
+ if (nresults == nblocks)
+ return nresults;
+ }
+ }
+ }
+ }
+
+ return nresults;
+}
+
+/*
+ * Serialize a block reference table to a file.
+ */
+void
+WriteBlockRefTable(BlockRefTable *brtab,
+ io_callback_fn write_callback,
+ void *write_callback_arg)
+{
+ BlockRefTableSerializedEntry *sdata = NULL;
+ BlockRefTableBuffer buffer;
+ uint32 magic = BLOCKREFTABLE_MAGIC;
+
+ /* Prepare buffer. */
+ memset(&buffer, 0, sizeof(BlockRefTableBuffer));
+ buffer.io_callback = write_callback;
+ buffer.io_callback_arg = write_callback_arg;
+ INIT_CRC32C(buffer.crc);
+
+ /* Write magic number. */
+ BlockRefTableWrite(&buffer, &magic, sizeof(uint32));
+
+ /* Write the entries, assuming there are some. */
+ if (brtab->hash->members > 0)
+ {
+ unsigned i = 0;
+ blockreftable_iterator it;
+ BlockRefTableEntry *brtentry;
+
+ /* Extract entries into serializable format and sort them. */
+ sdata =
+ palloc(brtab->hash->members * sizeof(BlockRefTableSerializedEntry));
+ blockreftable_start_iterate(brtab->hash, &it);
+ while ((brtentry = blockreftable_iterate(brtab->hash, &it)) != NULL)
+ {
+ BlockRefTableSerializedEntry *sentry = &sdata[i++];
+
+ sentry->rlocator = brtentry->key.rlocator;
+ sentry->forknum = brtentry->key.forknum;
+ sentry->limit_block = brtentry->limit_block;
+ sentry->nchunks = brtentry->nchunks;
+
+ /* trim trailing zero entries */
+ while (sentry->nchunks > 0 &&
+ brtentry->chunk_usage[sentry->nchunks - 1] == 0)
+ sentry->nchunks--;
+ }
+ Assert(i == brtab->hash->members);
+ qsort(sdata, i, sizeof(BlockRefTableSerializedEntry),
+ BlockRefTableComparator);
+
+ /* Loop over entries in sorted order and serialize each one. */
+ for (i = 0; i < brtab->hash->members; ++i)
+ {
+ BlockRefTableSerializedEntry *sentry = &sdata[i];
+ BlockRefTableEntry *brtentry;
+ BlockRefTableKey key;
+ unsigned j;
+
+ /* Write the serialized entry itself. */
+ BlockRefTableWrite(&buffer, sentry,
+ sizeof(BlockRefTableSerializedEntry));
+
+ /* Look up the original entry so we can access the chunks. */
+ memcpy(&key.rlocator, &sentry->rlocator, sizeof(RelFileLocator));
+ key.forknum = sentry->forknum;
+ brtentry = blockreftable_lookup(brtab->hash, key);
+ Assert(brtentry != NULL);
+
+ /* Write the untruncated portion of the chunk length array. */
+ if (sentry->nchunks != 0)
+ BlockRefTableWrite(&buffer, brtentry->chunk_usage,
+ sentry->nchunks * sizeof(uint16));
+
+ /* Write the contents of each chunk. */
+ for (j = 0; j < brtentry->nchunks; ++j)
+ {
+ if (brtentry->chunk_usage[j] == 0)
+ continue;
+ BlockRefTableWrite(&buffer, brtentry->chunk_data[j],
+ brtentry->chunk_usage[j] * sizeof(uint16));
+ }
+ }
+ }
+
+ /* Write out appropriate terminator and CRC and flush buffer. */
+ BlockRefTableFileTerminate(&buffer);
+}
+
+/*
+ * Prepare to incrementally read a block reference table file.
+ *
+ * 'read_callback' is a function that can be called to read data from the
+ * underlying file (or other data source) into our internal buffer.
+ *
+ * 'read_callback_arg' is an opaque argument to be passed to read_callback.
+ *
+ * 'error_filename' is the filename that should be included in error messages
+ * if the file is found to be malformed. The value is not copied, so the
+ * caller should ensure that it remains valid until done with this
+ * BlockRefTableReader.
+ *
+ * 'error_callback' is a function to be called if the file is found to be
+ * malformed. This is not used for I/O errors, which must be handled internally
+ * by read_callback.
+ *
+ * 'error_callback_arg' is an opaque arguent to be passed to error_callback.
+ */
+BlockRefTableReader *
+CreateBlockRefTableReader(io_callback_fn read_callback,
+ void *read_callback_arg,
+ char *error_filename,
+ report_error_fn error_callback,
+ void *error_callback_arg)
+{
+ BlockRefTableReader *reader;
+ uint32 magic;
+
+ /* Initialize data structure. */
+ reader = palloc0(sizeof(BlockRefTableReader));
+ reader->buffer.io_callback = read_callback;
+ reader->buffer.io_callback_arg = read_callback_arg;
+ reader->error_filename = error_filename;
+ reader->error_callback = error_callback;
+ reader->error_callback_arg = error_callback_arg;
+ INIT_CRC32C(reader->buffer.crc);
+
+ /* Verify magic number. */
+ BlockRefTableRead(reader, &magic, sizeof(uint32));
+ if (magic != BLOCKREFTABLE_MAGIC)
+ error_callback(error_callback_arg,
+ "file \"%s\" has wrong magic number: expected %u, found %u",
+ error_filename,
+ BLOCKREFTABLE_MAGIC, magic);
+
+ return reader;
+}
+
+/*
+ * Read next relation fork covered by this block reference table file.
+ *
+ * After calling this function, you must call BlockRefTableReaderGetBlocks
+ * until it returns 0 before calling it again.
+ */
+bool
+BlockRefTableReaderNextRelation(BlockRefTableReader *reader,
+ RelFileLocator *rlocator,
+ ForkNumber *forknum,
+ BlockNumber *limit_block)
+{
+ BlockRefTableSerializedEntry sentry;
+ BlockRefTableSerializedEntry zentry = {0};
+
+ /*
+ * Sanity check: caller must read all blocks from all chunks before moving
+ * on to the next relation.
+ */
+ Assert(reader->total_chunks == reader->consumed_chunks);
+
+ /* Read serialized entry. */
+ BlockRefTableRead(reader, &sentry,
+ sizeof(BlockRefTableSerializedEntry));
+
+ /*
+ * If we just read the sentinel entry indicating that we've reached the
+ * end, read and check the CRC.
+ */
+ if (memcmp(&sentry, &zentry, sizeof(BlockRefTableSerializedEntry)) == 0)
+ {
+ pg_crc32c expected_crc;
+ pg_crc32c actual_crc;
+
+ /*
+ * We want to know the CRC of the file excluding the 4-byte CRC
+ * itself, so copy the current value of the CRC accumulator before
+ * reading those bytes, and use the copy to finalize the calculation.
+ */
+ expected_crc = reader->buffer.crc;
+ FIN_CRC32C(expected_crc);
+
+ /* Now we can read the actual value. */
+ BlockRefTableRead(reader, &actual_crc, sizeof(pg_crc32c));
+
+ /* Throw an error if there is a mismatch. */
+ if (!EQ_CRC32C(expected_crc, actual_crc))
+ reader->error_callback(reader->error_callback_arg,
+ "file \"%s\" has wrong checksum: expected %08X, found %08X",
+ reader->error_filename, expected_crc, actual_crc);
+
+ return false;
+ }
+
+ /* Read chunk size array. */
+ if (reader->chunk_size != NULL)
+ pfree(reader->chunk_size);
+ reader->chunk_size = palloc(sentry.nchunks * sizeof(uint16));
+ BlockRefTableRead(reader, reader->chunk_size,
+ sentry.nchunks * sizeof(uint16));
+
+ /* Set up for chunk scan. */
+ reader->total_chunks = sentry.nchunks;
+ reader->consumed_chunks = 0;
+
+ /* Return data to caller. */
+ memcpy(rlocator, &sentry.rlocator, sizeof(RelFileLocator));
+ *forknum = sentry.forknum;
+ *limit_block = sentry.limit_block;
+ return true;
+}
+
+/*
+ * Get modified blocks associated with the relation fork returned by
+ * the most recent call to BlockRefTableReaderNextRelation.
+ *
+ * On return, block numbers will be written into the 'blocks' array, whose
+ * length should be passed via 'nblocks'. The return value is the number of
+ * entries actually written into the 'blocks' array, which may be less than
+ * 'nblocks' if we run out of modified blocks in the relation fork before
+ * we run out of room in the array.
+ */
+unsigned
+BlockRefTableReaderGetBlocks(BlockRefTableReader *reader,
+ BlockNumber *blocks,
+ int nblocks)
+{
+ unsigned blocks_found = 0;
+
+ /* Must provide space for at least one block number to be returned. */
+ Assert(nblocks > 0);
+
+ /* Loop collecting blocks to return to caller. */
+ for (;;)
+ {
+ uint16 next_chunk_size;
+
+ /*
+ * If we've read at least one chunk, maybe it contains some block
+ * numbers that could satisfy caller's request.
+ */
+ if (reader->consumed_chunks > 0)
+ {
+ uint32 chunkno = reader->consumed_chunks - 1;
+ uint16 chunk_size = reader->chunk_size[chunkno];
+
+ if (chunk_size == MAX_ENTRIES_PER_CHUNK)
+ {
+ /* Bitmap format, so search for bits that are set. */
+ while (reader->chunk_position < BLOCKS_PER_CHUNK &&
+ blocks_found < nblocks)
+ {
+ uint16 chunkoffset = reader->chunk_position;
+ uint16 w;
+
+ w = reader->chunk_data[chunkoffset / BLOCKS_PER_ENTRY];
+ if ((w & (1u << (chunkoffset % BLOCKS_PER_ENTRY))) != 0)
+ blocks[blocks_found++] =
+ chunkno * BLOCKS_PER_CHUNK + chunkoffset;
+ ++reader->chunk_position;
+ }
+ }
+ else
+ {
+ /* Not in bitmap format, so each entry is a 2-byte offset. */
+ while (reader->chunk_position < chunk_size &&
+ blocks_found < nblocks)
+ {
+ blocks[blocks_found++] = chunkno * BLOCKS_PER_CHUNK
+ + reader->chunk_data[reader->chunk_position];
+ ++reader->chunk_position;
+ }
+ }
+ }
+
+ /* We found enough blocks, so we're done. */
+ if (blocks_found >= nblocks)
+ break;
+
+ /*
+ * We didn't find enough blocks, so we must need the next chunk. If
+ * there are none left, though, then we're done anyway.
+ */
+ if (reader->consumed_chunks == reader->total_chunks)
+ break;
+
+ /*
+ * Read data for next chunk and reset scan position to beginning of
+ * chunk. Note that the next chunk might be empty, in which case we
+ * consume the chunk without actually consuming any bytes from the
+ * underlying file.
+ */
+ next_chunk_size = reader->chunk_size[reader->consumed_chunks];
+ if (next_chunk_size > 0)
+ BlockRefTableRead(reader, reader->chunk_data,
+ next_chunk_size * sizeof(uint16));
+ ++reader->consumed_chunks;
+ reader->chunk_position = 0;
+ }
+
+ return blocks_found;
+}
+
+/*
+ * Release memory used while reading a block reference table from a file.
+ */
+void
+DestroyBlockRefTableReader(BlockRefTableReader *reader)
+{
+ if (reader->chunk_size != NULL)
+ {
+ pfree(reader->chunk_size);
+ reader->chunk_size = NULL;
+ }
+ pfree(reader);
+}
+
+/*
+ * Prepare to write a block reference table file incrementally.
+ *
+ * Caller must be able to supply BlockRefTableEntry objects sorted in the
+ * appropriate order.
+ */
+BlockRefTableWriter *
+CreateBlockRefTableWriter(io_callback_fn write_callback,
+ void *write_callback_arg)
+{
+ BlockRefTableWriter *writer;
+ uint32 magic = BLOCKREFTABLE_MAGIC;
+
+ /* Prepare buffer and CRC check and save callbacks. */
+ writer = palloc0(sizeof(BlockRefTableWriter));
+ writer->buffer.io_callback = write_callback;
+ writer->buffer.io_callback_arg = write_callback_arg;
+ INIT_CRC32C(writer->buffer.crc);
+
+ /* Write magic number. */
+ BlockRefTableWrite(&writer->buffer, &magic, sizeof(uint32));
+
+ return writer;
+}
+
+/*
+ * Append one entry to a block reference table file.
+ *
+ * Note that entries must be written in the proper order, that is, sorted by
+ * tablespace, then database, then relfilenumber, then fork number. Caller
+ * is responsible for supplying data in the correct order. If that seems hard,
+ * use an in-memory BlockRefTable instead.
+ */
+void
+BlockRefTableWriteEntry(BlockRefTableWriter *writer, BlockRefTableEntry *entry)
+{
+ BlockRefTableSerializedEntry sentry;
+ unsigned j;
+
+ /* Convert to serialized entry format. */
+ sentry.rlocator = entry->key.rlocator;
+ sentry.forknum = entry->key.forknum;
+ sentry.limit_block = entry->limit_block;
+ sentry.nchunks = entry->nchunks;
+
+ /* Trim trailing zero entries. */
+ while (sentry.nchunks > 0 && entry->chunk_usage[sentry.nchunks - 1] == 0)
+ sentry.nchunks--;
+
+ /* Write the serialized entry itself. */
+ BlockRefTableWrite(&writer->buffer, &sentry,
+ sizeof(BlockRefTableSerializedEntry));
+
+ /* Write the untruncated portion of the chunk length array. */
+ if (sentry.nchunks != 0)
+ BlockRefTableWrite(&writer->buffer, entry->chunk_usage,
+ sentry.nchunks * sizeof(uint16));
+
+ /* Write the contents of each chunk. */
+ for (j = 0; j < entry->nchunks; ++j)
+ {
+ if (entry->chunk_usage[j] == 0)
+ continue;
+ BlockRefTableWrite(&writer->buffer, entry->chunk_data[j],
+ entry->chunk_usage[j] * sizeof(uint16));
+ }
+}
+
+/*
+ * Finalize an incremental write of a block reference table file.
+ */
+void
+DestroyBlockRefTableWriter(BlockRefTableWriter *writer)
+{
+ BlockRefTableFileTerminate(&writer->buffer);
+ pfree(writer);
+}
+
+/*
+ * Allocate a standalone BlockRefTableEntry.
+ *
+ * When we're manipulating a full in-memory BlockRefTable, the entries are
+ * part of the hash table and are allocated by simplehash. This routine is
+ * used by callers that want to write out a BlockRefTable to a file without
+ * needing to store the whole thing in memory at once.
+ *
+ * Entries allocated by this function can be manipulated using the functions
+ * BlockRefTableEntrySetLimitBlock and BlockRefTableEntryMarkBlockModified
+ * and then written using BlockRefTableWriteEntry and freed using
+ * BlockRefTableFreeEntry.
+ */
+BlockRefTableEntry *
+CreateBlockRefTableEntry(RelFileLocator rlocator, ForkNumber forknum)
+{
+ BlockRefTableEntry *entry = palloc0(sizeof(BlockRefTableEntry));
+
+ memcpy(&entry->key.rlocator, &rlocator, sizeof(RelFileLocator));
+ entry->key.forknum = forknum;
+ entry->limit_block = InvalidBlockNumber;
+
+ return entry;
+}
+
+/*
+ * Update a BlockRefTableEntry with a new value for the "limit block" and
+ * forget any equal-or-higher-numbered modified blocks.
+ *
+ * The "limit block" is the shortest known length of the relation within the
+ * range of WAL records covered by this block reference table.
+ */
+void
+BlockRefTableEntrySetLimitBlock(BlockRefTableEntry *entry,
+ BlockNumber limit_block)
+{
+ unsigned chunkno;
+ unsigned limit_chunkno;
+ unsigned limit_chunkoffset;
+ BlockRefTableChunk limit_chunk;
+
+ /* If we already have an equal or lower limit block, do nothing. */
+ if (limit_block >= entry->limit_block)
+ return;
+
+ /* Record the new limit block value. */
+ entry->limit_block = limit_block;
+
+ /*
+ * Figure out which chunk would store the state of the new limit block,
+ * and which offset within that chunk.
+ */
+ limit_chunkno = limit_block / BLOCKS_PER_CHUNK;
+ limit_chunkoffset = limit_block % BLOCKS_PER_CHUNK;
+
+ /*
+ * If the number of chunks is not large enough for any blocks with equal
+ * or higher block numbers to exist, then there is nothing further to do.
+ */
+ if (limit_chunkno >= entry->nchunks)
+ return;
+
+ /* Discard entire contents of any higher-numbered chunks. */
+ for (chunkno = limit_chunkno + 1; chunkno < entry->nchunks; ++chunkno)
+ entry->chunk_usage[chunkno] = 0;
+
+ /*
+ * Next, we need to discard any offsets within the chunk that would
+ * contain the limit_block. We must handle this differenly depending on
+ * whether the chunk that would contain limit_block is a bitmap or an
+ * array of offsets.
+ */
+ limit_chunk = entry->chunk_data[limit_chunkno];
+ if (entry->chunk_usage[limit_chunkno] == MAX_ENTRIES_PER_CHUNK)
+ {
+ unsigned chunkoffset;
+
+ /* It's a bitmap. Unset bits. */
+ for (chunkoffset = limit_chunkoffset; chunkoffset < BLOCKS_PER_CHUNK;
+ ++chunkoffset)
+ limit_chunk[chunkoffset / BLOCKS_PER_ENTRY] &=
+ ~(1 << (chunkoffset % BLOCKS_PER_ENTRY));
+ }
+ else
+ {
+ unsigned i,
+ j = 0;
+
+ /* It's an offset array. Filter out large offsets. */
+ for (i = 0; i < entry->chunk_usage[limit_chunkno]; ++i)
+ {
+ Assert(j <= i);
+ if (limit_chunk[i] < limit_chunkoffset)
+ limit_chunk[j++] = limit_chunk[i];
+ }
+ Assert(j <= entry->chunk_usage[limit_chunkno]);
+ entry->chunk_usage[limit_chunkno] = j;
+ }
+}
+
+/*
+ * Mark a block in a given BlkRefTableEntry as known to have been modified.
+ */
+void
+BlockRefTableEntryMarkBlockModified(BlockRefTableEntry *entry,
+ ForkNumber forknum,
+ BlockNumber blknum)
+{
+ unsigned chunkno;
+ unsigned chunkoffset;
+ unsigned i;
+
+ /*
+ * Which chunk should store the state of this block? And what is the
+ * offset of this block relative to the start of that chunk?
+ */
+ chunkno = blknum / BLOCKS_PER_CHUNK;
+ chunkoffset = blknum % BLOCKS_PER_CHUNK;
+
+ /*
+ * If 'nchunks' isn't big enough for us to be able to represent the state
+ * of this block, we need to enlarge our arrays.
+ */
+ if (chunkno >= entry->nchunks)
+ {
+ unsigned max_chunks;
+ unsigned extra_chunks;
+
+ /*
+ * New array size is a power of 2, at least 16, big enough so that
+ * chunkno will be a valid array index.
+ */
+ max_chunks = Max(16, entry->nchunks);
+ while (max_chunks < chunkno + 1)
+ chunkno *= 2;
+ Assert(max_chunks > chunkno);
+ extra_chunks = max_chunks - entry->nchunks;
+
+ if (entry->nchunks == 0)
+ {
+ entry->chunk_size = palloc0(sizeof(uint16) * max_chunks);
+ entry->chunk_usage = palloc0(sizeof(uint16) * max_chunks);
+ entry->chunk_data =
+ palloc0(sizeof(BlockRefTableChunk) * max_chunks);
+ }
+ else
+ {
+ entry->chunk_size = repalloc(entry->chunk_size,
+ sizeof(uint16) * max_chunks);
+ memset(&entry->chunk_size[entry->nchunks], 0,
+ extra_chunks * sizeof(uint16));
+ entry->chunk_usage = repalloc(entry->chunk_usage,
+ sizeof(uint16) * max_chunks);
+ memset(&entry->chunk_usage[entry->nchunks], 0,
+ extra_chunks * sizeof(uint16));
+ entry->chunk_data = repalloc(entry->chunk_data,
+ sizeof(BlockRefTableChunk) * max_chunks);
+ memset(&entry->chunk_data[entry->nchunks], 0,
+ extra_chunks * sizeof(BlockRefTableChunk));
+ }
+ entry->nchunks = max_chunks;
+ }
+
+ /*
+ * If the chunk that covers this block number doesn't exist yet, create it
+ * as an array and add the appropriate offset to it. We make it pretty
+ * small initially, because there might only be 1 or a few block
+ * references in this chunk and we don't want to use up too much memory.
+ */
+ if (entry->chunk_size[chunkno] == 0)
+ {
+ entry->chunk_data[chunkno] =
+ palloc(sizeof(uint16) * INITIAL_ENTRIES_PER_CHUNK);
+ entry->chunk_size[chunkno] = INITIAL_ENTRIES_PER_CHUNK;
+ entry->chunk_data[chunkno][0] = chunkoffset;
+ entry->chunk_usage[chunkno] = 1;
+ return;
+ }
+
+ /*
+ * If the number of entries in this chunk is already maximum, it must be a
+ * bitmap. Just set the appropriate bit.
+ */
+ if (entry->chunk_usage[chunkno] == MAX_ENTRIES_PER_CHUNK)
+ {
+ BlockRefTableChunk chunk = entry->chunk_data[chunkno];
+
+ chunk[chunkoffset / BLOCKS_PER_ENTRY] |=
+ 1 << (chunkoffset % BLOCKS_PER_ENTRY);
+ return;
+ }
+
+ /*
+ * There is an existing chunk and it's in array format. Let's find out
+ * whether it already has an entry for this block. If so, we do not need
+ * to do anything.
+ */
+ for (i = 0; i < entry->chunk_usage[chunkno]; ++i)
+ {
+ if (entry->chunk_data[chunkno][i] == chunkoffset)
+ return;
+ }
+
+ /*
+ * If the number of entries currently used is one less than the maximum,
+ * it's time to convert to bitmap format.
+ */
+ if (entry->chunk_usage[chunkno] == MAX_ENTRIES_PER_CHUNK - 1)
+ {
+ BlockRefTableChunk newchunk;
+ unsigned j;
+
+ /* Allocate a new chunk. */
+ newchunk = palloc0(MAX_ENTRIES_PER_CHUNK * sizeof(uint16));
+
+ /* Set the bit for each existing entry. */
+ for (j = 0; j < entry->chunk_usage[chunkno]; ++j)
+ {
+ unsigned coff = entry->chunk_data[chunkno][j];
+
+ newchunk[coff / BLOCKS_PER_ENTRY] |=
+ 1 << (coff % BLOCKS_PER_ENTRY);
+ }
+
+ /* Set the bit for the new entry. */
+ newchunk[chunkoffset / BLOCKS_PER_ENTRY] |=
+ 1 << (chunkoffset % BLOCKS_PER_ENTRY);
+
+ /* Swap the new chunk into place and update metadata. */
+ pfree(entry->chunk_data[chunkno]);
+ entry->chunk_data[chunkno] = newchunk;
+ entry->chunk_size[chunkno] = MAX_ENTRIES_PER_CHUNK;
+ entry->chunk_usage[chunkno] = MAX_ENTRIES_PER_CHUNK;
+ return;
+ }
+
+ /*
+ * OK, we currently have an array, and we don't need to convert to a
+ * bitmap, but we do need to add a new element. If there's not enough
+ * room, we'll have to expand the array.
+ */
+ if (entry->chunk_usage[chunkno] == entry->chunk_size[chunkno])
+ {
+ unsigned newsize = entry->chunk_size[chunkno] * 2;
+
+ Assert(newsize <= MAX_ENTRIES_PER_CHUNK);
+ entry->chunk_data[chunkno] = repalloc(entry->chunk_data[chunkno],
+ newsize * sizeof(uint16));
+ entry->chunk_size[chunkno] = newsize;
+ }
+
+ /* Now we can add the new entry. */
+ entry->chunk_data[chunkno][entry->chunk_usage[chunkno]] =
+ chunkoffset;
+ entry->chunk_usage[chunkno]++;
+}
+
+/*
+ * Release memory for a BlockRefTablEntry that was created by
+ * CreateBlockRefTableEntry.
+ */
+void
+BlockRefTableFreeEntry(BlockRefTableEntry *entry)
+{
+ if (entry->chunk_size != NULL)
+ {
+ pfree(entry->chunk_size);
+ entry->chunk_size = NULL;
+ }
+
+ if (entry->chunk_usage != NULL)
+ {
+ pfree(entry->chunk_usage);
+ entry->chunk_usage = NULL;
+ }
+
+ if (entry->chunk_data != NULL)
+ {
+ pfree(entry->chunk_data);
+ entry->chunk_data = NULL;
+ }
+
+ pfree(entry);
+}
+
+/*
+ * Comparator for BlockRefTableSerializedEntry objects.
+ *
+ * We make the tablespace OID the first column of the sort key to match
+ * the on-disk tree structure.
+ */
+static int
+BlockRefTableComparator(const void *a, const void *b)
+{
+ const BlockRefTableSerializedEntry *sa = a;
+ const BlockRefTableSerializedEntry *sb = b;
+
+ if (sa->rlocator.spcOid > sb->rlocator.spcOid)
+ return 1;
+ if (sa->rlocator.spcOid < sb->rlocator.spcOid)
+ return -1;
+
+ if (sa->rlocator.dbOid > sb->rlocator.dbOid)
+ return 1;
+ if (sa->rlocator.dbOid < sb->rlocator.dbOid)
+ return -1;
+
+ if (sa->rlocator.relNumber > sb->rlocator.relNumber)
+ return 1;
+ if (sa->rlocator.relNumber < sb->rlocator.relNumber)
+ return -1;
+
+ if (sa->forknum > sb->forknum)
+ return 1;
+ if (sa->forknum < sb->forknum)
+ return -1;
+
+ return 0;
+}
+
+/*
+ * Flush any buffered data out of a BlockRefTableBuffer.
+ */
+static void
+BlockRefTableFlush(BlockRefTableBuffer *buffer)
+{
+ buffer->io_callback(buffer->io_callback_arg, buffer->data, buffer->used);
+ buffer->used = 0;
+}
+
+/*
+ * Read data from a BlockRefTableBuffer, and update the running CRC
+ * calculation for the returned data (but not any data that we may have
+ * buffered but not yet actually returned).
+ */
+static void
+BlockRefTableRead(BlockRefTableReader *reader, void *data, int length)
+{
+ BlockRefTableBuffer *buffer = &reader->buffer;
+
+ /* Loop until read is fully satisfied. */
+ while (length > 0)
+ {
+ if (buffer->cursor < buffer->used)
+ {
+ /*
+ * If any buffered data is available, use that to satisfy as much
+ * of the request as possible.
+ */
+ int bytes_to_copy = Min(length, buffer->used - buffer->cursor);
+
+ memcpy(data, &buffer->data[buffer->cursor], bytes_to_copy);
+ COMP_CRC32C(buffer->crc, &buffer->data[buffer->cursor],
+ bytes_to_copy);
+ buffer->cursor += bytes_to_copy;
+ data = ((char *) data) + bytes_to_copy;
+ length -= bytes_to_copy;
+ }
+ else if (length >= BUFSIZE)
+ {
+ /*
+ * If the request length is long, read directly into caller's
+ * buffer.
+ */
+ int bytes_read;
+
+ bytes_read = buffer->io_callback(buffer->io_callback_arg,
+ data, length);
+ COMP_CRC32C(buffer->crc, data, bytes_read);
+ data = ((char *) data) + bytes_read;
+ length -= bytes_read;
+
+ /* If we didn't get anything, that's bad. */
+ if (bytes_read == 0)
+ reader->error_callback(reader->error_callback_arg,
+ "file \"%s\" ends unexpectedly",
+ reader->error_filename);
+ }
+ else
+ {
+ /*
+ * Refill our buffer.
+ */
+ buffer->used = buffer->io_callback(buffer->io_callback_arg,
+ buffer->data, BUFSIZE);
+ buffer->cursor = 0;
+
+ /* If we didn't get anything, that's bad. */
+ if (buffer->used == 0)
+ reader->error_callback(reader->error_callback_arg,
+ "file \"%s\" ends unexpectedly",
+ reader->error_filename);
+ }
+ }
+}
+
+/*
+ * Supply data to a BlockRefTableBuffer for write to the underlying File,
+ * and update the running CRC calculation for that data.
+ */
+static void
+BlockRefTableWrite(BlockRefTableBuffer *buffer, void *data, int length)
+{
+ /* Update running CRC calculation. */
+ COMP_CRC32C(buffer->crc, data, length);
+
+ /* If the new data can't fit into the buffer, flush the buffer. */
+ if (buffer->used + length > BUFSIZE)
+ {
+ buffer->io_callback(buffer->io_callback_arg, buffer->data,
+ buffer->used);
+ buffer->used = 0;
+ }
+
+ /* If the new data would fill the buffer, or more, write it directly. */
+ if (length >= BUFSIZE)
+ {
+ buffer->io_callback(buffer->io_callback_arg, data, length);
+ return;
+ }
+
+ /* Otherwise, copy the new data into the buffer. */
+ memcpy(&buffer->data[buffer->used], data, length);
+ buffer->used += length;
+ Assert(buffer->used <= BUFSIZE);
+}
+
+/*
+ * Generate the sentinel and CRC required at the end of a block reference
+ * table file and flush them out of our internal buffer.
+ */
+static void
+BlockRefTableFileTerminate(BlockRefTableBuffer *buffer)
+{
+ BlockRefTableSerializedEntry zentry = {0};
+ pg_crc32c crc;
+
+ /* Write a sentinel indicating that there are no more entries. */
+ BlockRefTableWrite(buffer, &zentry,
+ sizeof(BlockRefTableSerializedEntry));
+
+ /*
+ * Writing the checksum will perturb the ongoing checksum calculation, so
+ * copy the state first and finalize the computation using the copy.
+ */
+ crc = buffer->crc;
+ FIN_CRC32C(crc);
+ BlockRefTableWrite(buffer, &crc, sizeof(pg_crc32c));
+
+ /* Flush any leftover data out of our buffer. */
+ BlockRefTableFlush(buffer);
+}
diff --git a/src/common/meson.build b/src/common/meson.build
index cc6671edca..4ee0ea1f9d 100644
--- a/src/common/meson.build
+++ b/src/common/meson.build
@@ -3,6 +3,7 @@
common_sources = files(
'archive.c',
'base64.c',
+ 'blkreftable.c',
'checksum_helper.c',
'compression.c',
'controldata_utils.c',
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 48ca852381..fed5d790cc 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -206,6 +206,7 @@ extern int XLogFileOpen(XLogSegNo segno, TimeLineID tli);
extern void CheckXLogRemoved(XLogSegNo segno, TimeLineID tli);
extern XLogSegNo XLogGetLastRemovedSegno(void);
+extern XLogSegNo XLogGetOldestSegno(TimeLineID tli);
extern void XLogSetAsyncXactLSN(XLogRecPtr asyncXactLSN);
extern void XLogSetReplicationSlotMinimumLSN(XLogRecPtr lsn);
diff --git a/src/include/access/xlogbackup.h b/src/include/access/xlogbackup.h
index 1611358137..90e04cad56 100644
--- a/src/include/access/xlogbackup.h
+++ b/src/include/access/xlogbackup.h
@@ -28,6 +28,8 @@ typedef struct BackupState
XLogRecPtr checkpointloc; /* last checkpoint location */
pg_time_t starttime; /* backup start time */
bool started_in_recovery; /* backup started in recovery? */
+ XLogRecPtr istartpoint; /* incremental based on backup at this LSN */
+ TimeLineID istarttli; /* incremental based on backup on this TLI */
/* Fields saved at the end of backup */
XLogRecPtr stoppoint; /* backup stop WAL location */
diff --git a/src/include/backup/basebackup.h b/src/include/backup/basebackup.h
index 1432d9c206..345bd22534 100644
--- a/src/include/backup/basebackup.h
+++ b/src/include/backup/basebackup.h
@@ -34,6 +34,9 @@ typedef struct
int64 size; /* total size as sent; -1 if not known */
} tablespaceinfo;
-extern void SendBaseBackup(BaseBackupCmd *cmd);
+struct IncrementalBackupInfo;
+
+extern void SendBaseBackup(BaseBackupCmd *cmd,
+ struct IncrementalBackupInfo *ib);
#endif /* _BASEBACKUP_H */
diff --git a/src/include/backup/basebackup_incremental.h b/src/include/backup/basebackup_incremental.h
new file mode 100644
index 0000000000..c300235a2f
--- /dev/null
+++ b/src/include/backup/basebackup_incremental.h
@@ -0,0 +1,56 @@
+/*-------------------------------------------------------------------------
+ *
+ * basebackup_incremental.h
+ * API for incremental backup support
+ *
+ * Portions Copyright (c) 2010-2022, PostgreSQL Global Development Group
+ *
+ * src/include/backup/basebackup_incremental.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef BASEBACKUP_INCREMENTAL_H
+#define BASEBACKUP_INCREMENTAL_H
+
+#include "access/xlogbackup.h"
+#include "common/relpath.h"
+#include "storage/block.h"
+#include "utils/palloc.h"
+
+#define INCREMENTAL_MAGIC 0xd3ae1f0d
+
+typedef enum
+{
+ BACK_UP_FILE_FULLY,
+ BACK_UP_FILE_INCREMENTALLY,
+ DO_NOT_BACK_UP_FILE
+} FileBackupMethod;
+
+struct IncrementalBackupInfo;
+typedef struct IncrementalBackupInfo IncrementalBackupInfo;
+
+extern IncrementalBackupInfo *CreateIncrementalBackupInfo(MemoryContext);
+
+extern void AppendIncrementalManifestData(IncrementalBackupInfo *ib,
+ const char *data,
+ int len);
+extern void FinalizeIncrementalManifest(IncrementalBackupInfo *ib);
+
+extern void PrepareForIncrementalBackup(IncrementalBackupInfo *ib,
+ BackupState *backup_state);
+
+extern char *GetIncrementalFilePath(Oid dboid, Oid spcoid,
+ RelFileNumber relfilenumber,
+ ForkNumber forknum, unsigned segno);
+extern FileBackupMethod GetFileBackupMethod(IncrementalBackupInfo *ib,
+ char *path,
+ Oid dboid, Oid spcoid,
+ RelFileNumber relfilenumber,
+ ForkNumber forknum,
+ unsigned segno, size_t size,
+ unsigned *num_blocks_required,
+ BlockNumber *relative_block_numbers,
+ unsigned *truncation_block_length);
+extern size_t GetIncrementalFileSize(unsigned num_blocks_required);
+
+#endif
diff --git a/src/include/backup/walsummary.h b/src/include/backup/walsummary.h
new file mode 100644
index 0000000000..d086e64019
--- /dev/null
+++ b/src/include/backup/walsummary.h
@@ -0,0 +1,49 @@
+/*-------------------------------------------------------------------------
+ *
+ * walsummary.h
+ * WAL summary management
+ *
+ * Portions Copyright (c) 2010-2022, PostgreSQL Global Development Group
+ *
+ * src/include/backup/walsummary.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef WALSUMMARY_H
+#define WALSUMMARY_H
+
+#include <time.h>
+
+#include "access/xlogdefs.h"
+#include "nodes/pg_list.h"
+#include "storage/fd.h"
+
+typedef struct WalSummaryIO
+{
+ File file;
+ off_t filepos;
+} WalSummaryIO;
+
+typedef struct WalSummaryFile
+{
+ XLogRecPtr start_lsn;
+ XLogRecPtr end_lsn;
+ TimeLineID tli;
+} WalSummaryFile;
+
+extern List *GetWalSummaries(TimeLineID tli, XLogRecPtr start_lsn,
+ XLogRecPtr end_lsn);
+extern List *FilterWalSummaries(List *wslist, TimeLineID tli,
+ XLogRecPtr start_lsn, XLogRecPtr end_lsn);
+extern bool WalSummariesAreComplete(List *wslist,
+ XLogRecPtr start_lsn, XLogRecPtr end_lsn,
+ XLogRecPtr *missing_lsn);
+extern File OpenWalSummaryFile(WalSummaryFile *ws, bool missing_ok);
+extern void RemoveWalSummaryIfOlderThan(WalSummaryFile *ws,
+ time_t cutoff_time);
+
+extern int ReadWalSummary(void *wal_summary_io, void *data, int length);
+extern int WriteWalSummary(void *wal_summary_io, void *data, int length);
+extern void ReportWalSummaryError(void *callback_arg, char *fmt,...);
+
+#endif /* WALSUMMARY_H */
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 6996073989..c21573efb6 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -12043,4 +12043,23 @@
proname => 'any_value_transfn', prorettype => 'anyelement',
proargtypes => 'anyelement anyelement', prosrc => 'any_value_transfn' },
+{ oid => '8436',
+ descr => 'list of available WAL summary files',
+ proname => 'pg_available_wal_summaries', prorows => '100',
+ proretset => 't', provolatile => 'v', proparallel => 's',
+ prorettype => 'record', proargtypes => '',
+ proallargtypes => '{int8,pg_lsn,pg_lsn}',
+ proargmodes => '{o,o,o}',
+ proargnames => '{tli,start_lsn,end_lsn}',
+ prosrc => 'pg_available_wal_summaries' },
+{ oid => '8437',
+ descr => 'contents of a WAL sumamry file',
+ proname => 'pg_wal_summary_contents', prorows => '100',
+ proretset => 't', provolatile => 'v', proparallel => 's',
+ prorettype => 'record', proargtypes => 'int8 pg_lsn pg_lsn',
+ proallargtypes => '{int8,pg_lsn,pg_lsn,oid,oid,oid,int2,int8,bool}',
+ proargmodes => '{i,i,i,o,o,o,o,o,o}',
+ proargnames => '{tli,start_lsn,end_lsn,relfilenode,reltablespace,reldatabase,relforknumber,relblocknumber,is_limit_block}',
+ prosrc => 'pg_wal_summary_contents' },
+
]
diff --git a/src/include/common/blkreftable.h b/src/include/common/blkreftable.h
new file mode 100644
index 0000000000..22d9883dc5
--- /dev/null
+++ b/src/include/common/blkreftable.h
@@ -0,0 +1,120 @@
+/*-------------------------------------------------------------------------
+ *
+ * blkreftable.h
+ * Block reference tables.
+ *
+ * A block reference table is used to keep track of which blocks have
+ * been modified by WAL records within a certain LSN range.
+ *
+ * For each relation fork, there is a "limit block number". All existing
+ * blocks greater than or equal to the limit block number must be
+ * considered modified; for those less than the limit block number,
+ * we maintain a bitmap. When a relation fork is created or dropped,
+ * the limit block number should be set to 0. When it's truncated,
+ * the limit block number should be set to the length in blocks to
+ * which it was truncated.
+ *
+ * Portions Copyright (c) 2010-2022, PostgreSQL Global Development Group
+ *
+ * src/include/common/blkreftable.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef BLKREFTABLE_H
+#define BLKREFTABLE_H
+
+#include "storage/block.h"
+#include "storage/relfilelocator.h"
+
+/* Magic number for serialization file format. */
+#define BLOCKREFTABLE_MAGIC 0x652b137b
+
+struct BlockRefTable;
+struct BlockRefTableEntry;
+struct BlockRefTableReader;
+struct BlockRefTableWriter;
+typedef struct BlockRefTable BlockRefTable;
+typedef struct BlockRefTableEntry BlockRefTableEntry;
+typedef struct BlockRefTableReader BlockRefTableReader;
+typedef struct BlockRefTableWriter BlockRefTableWriter;
+
+/*
+ * The return value of io_callback_fn should be the number of bytes read
+ * or written. If an error occurs, the functions should report it and
+ * not return. When used as a write callback, short writes should be retried
+ * or treated as errors, so that if the callback returns, the return value
+ * is always the request length.
+ *
+ * report_error_fn should not return.
+ */
+typedef int (*io_callback_fn) (void *callback_arg, void *data, int length);
+typedef void (*report_error_fn) (void *calblack_arg, char *msg,...);
+
+
+/*
+ * Functions for manipulating an entire in-memory block reference table.
+ */
+extern BlockRefTable *CreateEmptyBlockRefTable(void);
+extern void BlockRefTableSetLimitBlock(BlockRefTable *brtab,
+ const RelFileLocator *rlocator,
+ ForkNumber forknum,
+ BlockNumber limit_block);
+extern void BlockRefTableMarkBlockModified(BlockRefTable *brtab,
+ const RelFileLocator *rlocator,
+ ForkNumber forknum,
+ BlockNumber blknum);
+extern void WriteBlockRefTable(BlockRefTable *brtab,
+ io_callback_fn write_callback,
+ void *write_callback_arg);
+
+extern BlockRefTableEntry *BlockRefTableGetEntry(BlockRefTable *brtab,
+ const RelFileLocator *rlocator,
+ ForkNumber forknum,
+ BlockNumber *limit_block);
+extern int BlockRefTableEntryGetBlocks(BlockRefTableEntry *entry,
+ BlockNumber start_blkno,
+ BlockNumber stop_blkno,
+ BlockNumber *blocks,
+ int nblocks);
+
+/*
+ * Functions for reading a block reference table incrementally from disk.
+ */
+extern BlockRefTableReader *CreateBlockRefTableReader(io_callback_fn read_callback,
+ void *read_callback_arg,
+ char *error_filename,
+ report_error_fn error_callback,
+ void *error_callback_arg);
+extern bool BlockRefTableReaderNextRelation(BlockRefTableReader *reader,
+ RelFileLocator *rlocator,
+ ForkNumber *forknum,
+ BlockNumber *limit_block);
+extern unsigned BlockRefTableReaderGetBlocks(BlockRefTableReader *reader,
+ BlockNumber *blocks,
+ int nblocks);
+extern void DestroyBlockRefTableReader(BlockRefTableReader *reader);
+
+/*
+ * Functions for writing a block reference table incrementally to disk.
+ *
+ * Note that entries must be written in the proper order, that is, sorted by
+ * database, then tablespace, then relfilenumber, then fork number. Caller
+ * is responsible for supplying data in the correct order. If that seems hard,
+ * use an in-memory BlockRefTable instead.
+ */
+extern BlockRefTableWriter *CreateBlockRefTableWriter(io_callback_fn write_callback,
+ void *write_callback_arg);
+extern void BlockRefTableWriteEntry(BlockRefTableWriter *writer,
+ BlockRefTableEntry *entry);
+extern void DestroyBlockRefTableWriter(BlockRefTableWriter *writer);
+
+extern BlockRefTableEntry *CreateBlockRefTableEntry(RelFileLocator rlocator,
+ ForkNumber forknum);
+extern void BlockRefTableEntrySetLimitBlock(BlockRefTableEntry *entry,
+ BlockNumber limit_block);
+extern void BlockRefTableEntryMarkBlockModified(BlockRefTableEntry *entry,
+ ForkNumber forknum,
+ BlockNumber blknum);
+extern void BlockRefTableFreeEntry(BlockRefTableEntry *entry);
+
+#endif /* BLKREFTABLE_H */
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 14bd574fc2..898adccb25 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -338,6 +338,7 @@ typedef enum BackendType
B_STARTUP,
B_WAL_RECEIVER,
B_WAL_SENDER,
+ B_WAL_SUMMARIZER,
B_WAL_WRITER,
} BackendType;
@@ -443,6 +444,7 @@ typedef enum
CheckpointerProcess,
WalWriterProcess,
WalReceiverProcess,
+ WalSummarizerProcess,
NUM_AUXPROCTYPES /* Must be last! */
} AuxProcType;
@@ -455,6 +457,7 @@ extern PGDLLIMPORT AuxProcType MyAuxProcType;
#define AmCheckpointerProcess() (MyAuxProcType == CheckpointerProcess)
#define AmWalWriterProcess() (MyAuxProcType == WalWriterProcess)
#define AmWalReceiverProcess() (MyAuxProcType == WalReceiverProcess)
+#define AmWalSummarizerProcess() (MyAuxProcType == WalSummarizerProcess)
/*****************************************************************************
diff --git a/src/include/nodes/replnodes.h b/src/include/nodes/replnodes.h
index 4321ba8f86..856491eecd 100644
--- a/src/include/nodes/replnodes.h
+++ b/src/include/nodes/replnodes.h
@@ -108,4 +108,13 @@ typedef struct TimeLineHistoryCmd
TimeLineID timeline;
} TimeLineHistoryCmd;
+/* ----------------------
+ * UPLOAD_MANIFEST command
+ * ----------------------
+ */
+typedef struct UploadManifestCmd
+{
+ NodeTag type;
+} UploadManifestCmd;
+
#endif /* REPLNODES_H */
diff --git a/src/include/postmaster/walsummarizer.h b/src/include/postmaster/walsummarizer.h
new file mode 100644
index 0000000000..7584cb69a7
--- /dev/null
+++ b/src/include/postmaster/walsummarizer.h
@@ -0,0 +1,31 @@
+/*-------------------------------------------------------------------------
+ *
+ * walsummarizer.h
+ *
+ * Header file for background WAL summarization process.
+ *
+ * Portions Copyright (c) 1996-2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/include/postmaster/walsummarizer.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef WALSUMMARIZER_H
+#define WALSUMMARIZER_H
+
+#include "access/xlogdefs.h"
+
+extern int wal_summarize_mb;
+extern int wal_summarize_keep_time;
+
+extern Size WalSummarizerShmemSize(void);
+extern void WalSummarizerShmemInit(void);
+extern void WalSummarizerMain(void) pg_attribute_noreturn();
+
+extern XLogRecPtr GetOldestUnsummarizedLSN(TimeLineID *tli,
+ bool *lsn_is_exact);
+extern void SetWalSummarizerLatch(void);
+extern XLogRecPtr WaitForWalSummarization(XLogRecPtr lsn, long timeout);
+
+#endif
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index ef74f32693..ee55008082 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -417,11 +417,12 @@ extern PGDLLIMPORT PGPROC *PreparedXactProcs;
* We set aside some extra PGPROC structures for auxiliary processes,
* ie things that aren't full-fledged backends but need shmem access.
*
- * Background writer, checkpointer, WAL writer and archiver run during normal
- * operation. Startup process and WAL receiver also consume 2 slots, but WAL
- * writer is launched only after startup has exited, so we only need 5 slots.
+ * Background writer, checkpointer, WAL writer, WAL summarizer, and archiver
+ * run during normal operation. Startup process and WAL receiver also consume
+ * 2 slots, but WAL writer is launched only after startup has exited, so we
+ * only need 6 slots.
*/
-#define NUM_AUXILIARY_PROCS 5
+#define NUM_AUXILIARY_PROCS 6
/* configurable options */
extern PGDLLIMPORT int DeadlockTimeout;
diff --git a/src/include/utils/guc_tables.h b/src/include/utils/guc_tables.h
index d5a0880678..7d3bc0f671 100644
--- a/src/include/utils/guc_tables.h
+++ b/src/include/utils/guc_tables.h
@@ -72,6 +72,7 @@ enum config_group
WAL_RECOVERY,
WAL_ARCHIVE_RECOVERY,
WAL_RECOVERY_TARGET,
+ WAL_SUMMARIZATION,
REPLICATION_SENDING,
REPLICATION_PRIMARY,
REPLICATION_STANDBY,
diff --git a/src/include/utils/wait_event.h b/src/include/utils/wait_event.h
index 518d3b0a1f..3f99e2eddb 100644
--- a/src/include/utils/wait_event.h
+++ b/src/include/utils/wait_event.h
@@ -47,6 +47,7 @@ typedef enum
WAIT_EVENT_SYSLOGGER_MAIN,
WAIT_EVENT_WAL_RECEIVER_MAIN,
WAIT_EVENT_WAL_SENDER_MAIN,
+ WAIT_EVENT_WAL_SUMMARIZER_WAL,
WAIT_EVENT_WAL_WRITER_MAIN
} WaitEventActivity;
@@ -131,6 +132,7 @@ typedef enum
WAIT_EVENT_SYNC_REP,
WAIT_EVENT_WAL_RECEIVER_EXIT,
WAIT_EVENT_WAL_RECEIVER_WAIT_START,
+ WAIT_EVENT_WAL_SUMMARY_READY,
WAIT_EVENT_XACT_GROUP_UPDATE
} WaitEventIPC;
@@ -150,7 +152,8 @@ typedef enum
WAIT_EVENT_REGISTER_SYNC_REQUEST,
WAIT_EVENT_SPIN_DELAY,
WAIT_EVENT_VACUUM_DELAY,
- WAIT_EVENT_VACUUM_TRUNCATE
+ WAIT_EVENT_VACUUM_TRUNCATE,
+ WAIT_EVENT_WAL_SUMMARIZER_ERROR
} WaitEventTimeout;
/* ----------
@@ -232,6 +235,8 @@ typedef enum
WAIT_EVENT_WAL_INIT_SYNC,
WAIT_EVENT_WAL_INIT_WRITE,
WAIT_EVENT_WAL_READ,
+ WAIT_EVENT_WAL_SUMMARY_READ,
+ WAIT_EVENT_WAL_SUMMARY_WRITE,
WAIT_EVENT_WAL_SYNC,
WAIT_EVENT_WAL_SYNC_METHOD_ASSIGN,
WAIT_EVENT_WAL_WRITE
diff --git a/src/test/recovery/t/001_stream_rep.pl b/src/test/recovery/t/001_stream_rep.pl
index 0c72ba0944..353db33a9f 100644
--- a/src/test/recovery/t/001_stream_rep.pl
+++ b/src/test/recovery/t/001_stream_rep.pl
@@ -15,6 +15,8 @@ my $node_primary = PostgreSQL::Test::Cluster->new('primary');
$node_primary->init(
allows_streaming => 1,
auth_extra => [ '--create-role', 'repl_role' ]);
+# WAL summarization can postpone WAL recycling, leading to test failures
+$node_primary->append_conf('postgresql.conf', "wal_summarize_mb = 0");
$node_primary->start;
my $backup_name = 'my_backup';
diff --git a/src/test/recovery/t/019_replslot_limit.pl b/src/test/recovery/t/019_replslot_limit.pl
index 33e50ad933..6ba5eca700 100644
--- a/src/test/recovery/t/019_replslot_limit.pl
+++ b/src/test/recovery/t/019_replslot_limit.pl
@@ -22,6 +22,7 @@ $node_primary->append_conf(
min_wal_size = 2MB
max_wal_size = 4MB
log_checkpoints = yes
+wal_summarize_mb = 0
));
$node_primary->start;
$node_primary->safe_psql('postgres',
@@ -256,6 +257,7 @@ $node_primary2->append_conf(
min_wal_size = 32MB
max_wal_size = 32MB
log_checkpoints = yes
+wal_summarize_mb = 0
));
$node_primary2->start;
$node_primary2->safe_psql('postgres',
@@ -310,6 +312,7 @@ $node_primary3->append_conf(
max_wal_size = 2MB
log_checkpoints = yes
max_slot_wal_keep_size = 1MB
+ wal_summarize_mb = 0
));
$node_primary3->start;
$node_primary3->safe_psql('postgres',
diff --git a/src/test/recovery/t/035_standby_logical_decoding.pl b/src/test/recovery/t/035_standby_logical_decoding.pl
index 480e6d6caa..a91437dfa7 100644
--- a/src/test/recovery/t/035_standby_logical_decoding.pl
+++ b/src/test/recovery/t/035_standby_logical_decoding.pl
@@ -250,6 +250,7 @@ $node_primary->append_conf(
wal_level = 'logical'
max_replication_slots = 4
max_wal_senders = 4
+wal_summarize_mb = 0
});
$node_primary->dump_info;
$node_primary->start;
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 260854747b..48a10a5d39 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3979,3 +3979,27 @@ yyscan_t
z_stream
z_streamp
zic_t
+BlockRefTable
+BlockRefTableBuffer
+BlockRefTableEntry
+BlockRefTableKey
+BlockRefTableReader
+BlockRefTableSerializedEntry
+BlockRefTableWriter
+FileBackupMethod
+FileChunkContext
+IncrementalBackupInfo
+SummarizerReadLocalXLogPrivate
+UploadManifestCmd
+WalSummarizerData
+WalSummaryFile
+WalSummaryIO
+backup_file_entry
+backup_wal_range
+cb_cleanup_dir
+cb_options
+cb_tablespace
+cb_tablespace_mapping
+manifest_data
+manifest_writer
+rfile
--
2.37.1 (Apple Git-137.1)
v1-0001-In-basebackup.c-refactor-to-create-verify_page_ch.patchapplication/octet-stream; name=v1-0001-In-basebackup.c-refactor-to-create-verify_page_ch.patchDownload
From bf013c066e9f9af9f231e906aaf5a1581a9e4d04 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Mon, 5 Jun 2023 15:40:07 -0400
Subject: [PATCH v1 1/8] In basebackup.c, refactor to create
verify_page_checksum.
If checksum verification fails for a particular page, we reread the
page and try one more time. The code that does this somewhat complex
and difficult to follow. Move some of the logic into a new function
and rearrange the code a bit to try to make it clearer. This way,
we don't need the block_retry Boolean, a couple of other variables
move from sendFile() into the new function, and some code is now less
deeply indented.
---
src/backend/backup/basebackup.c | 188 ++++++++++++++++++--------------
1 file changed, 104 insertions(+), 84 deletions(-)
diff --git a/src/backend/backup/basebackup.c b/src/backend/backup/basebackup.c
index 45be21131c..0daf8257bc 100644
--- a/src/backend/backup/basebackup.c
+++ b/src/backend/backup/basebackup.c
@@ -83,6 +83,9 @@ static int64 sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeo
static bool sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
struct stat *statbuf, bool missing_ok, Oid dboid,
backup_manifest_info *manifest, const char *spcoid);
+static bool verify_page_checksum(Page page, XLogRecPtr start_lsn,
+ BlockNumber blkno,
+ uint16 *expected_checksum);
static void sendFileWithContent(bbsink *sink, const char *filename,
const char *content,
backup_manifest_info *manifest);
@@ -1485,14 +1488,11 @@ sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
{
int fd;
BlockNumber blkno = 0;
- bool block_retry = false;
- uint16 checksum;
int checksum_failures = 0;
off_t cnt;
int i;
pgoff_t len = 0;
char *page;
- PageHeader phdr;
int segmentno = 0;
char *segmentpath;
bool verify_checksum = false;
@@ -1582,94 +1582,78 @@ sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
{
for (i = 0; i < cnt / BLCKSZ; i++)
{
+ int reread_cnt;
+ uint16 expected_checksum;
+
page = sink->bbs_buffer + BLCKSZ * i;
+ /* If the page is OK, go on to the next one. */
+ if (verify_page_checksum(page, sink->bbs_state->startptr,
+ blkno + i + segmentno * RELSEG_SIZE,
+ &expected_checksum))
+ continue;
+
/*
- * Only check pages which have not been modified since the
- * start of the base backup. Otherwise, they might have been
- * written only halfway and the checksum would not be valid.
- * However, replaying WAL would reinstate the correct page in
- * this case. We also skip completely new pages, since they
- * don't have a checksum yet.
+ * Retry the block on the first failure. It's possible that
+ * we read the first 4K page of the block just before postgres
+ * updated the entire block so it ends up looking torn to us.
+ * If, before we retry the read, the concurrent write of the
+ * block finishes, the page LSN will be updated and we'll
+ * realize that we should ignore this block.
+ *
+ * There's no guarantee that this will actually happen,
+ * though: the torn write could take an arbitrarily long time
+ * to complete. Retrying multiple times wouldn't fix this
+ * problem, either, though it would reduce the chances of it
+ * happening in practice. The only real fix here seems to be
+ * to have some kind of interlock that allows us to wait until
+ * we can be certain that no write to the block is in
+ * progress. Since we don't have any such thing right now, we
+ * just do this and hope for the best.
*/
- if (!PageIsNew(page) && PageGetLSN(page) < sink->bbs_state->startptr)
+ reread_cnt =
+ basebackup_read_file(fd,
+ sink->bbs_buffer + BLCKSZ * i,
+ BLCKSZ, len + BLCKSZ * i,
+ readfilename,
+ false);
+ if (reread_cnt == 0)
{
- checksum = pg_checksum_page((char *) page, blkno + segmentno * RELSEG_SIZE);
- phdr = (PageHeader) page;
- if (phdr->pd_checksum != checksum)
- {
- /*
- * Retry the block on the first failure. It's
- * possible that we read the first 4K page of the
- * block just before postgres updated the entire block
- * so it ends up looking torn to us. If, before we
- * retry the read, the concurrent write of the block
- * finishes, the page LSN will be updated and we'll
- * realize that we should ignore this block.
- *
- * There's no guarantee that this will actually
- * happen, though: the torn write could take an
- * arbitrarily long time to complete. Retrying
- * multiple times wouldn't fix this problem, either,
- * though it would reduce the chances of it happening
- * in practice. The only real fix here seems to be to
- * have some kind of interlock that allows us to wait
- * until we can be certain that no write to the block
- * is in progress. Since we don't have any such thing
- * right now, we just do this and hope for the best.
- */
- if (block_retry == false)
- {
- int reread_cnt;
-
- /* Reread the failed block */
- reread_cnt =
- basebackup_read_file(fd,
- sink->bbs_buffer + BLCKSZ * i,
- BLCKSZ, len + BLCKSZ * i,
- readfilename,
- false);
- if (reread_cnt == 0)
- {
- /*
- * If we hit end-of-file, a concurrent
- * truncation must have occurred, so break out
- * of this loop just as if the initial fread()
- * returned 0. We'll drop through to the same
- * code that handles that case. (We must fix
- * up cnt first, though.)
- */
- cnt = BLCKSZ * i;
- break;
- }
-
- /* Set flag so we know a retry was attempted */
- block_retry = true;
-
- /* Reset loop to validate the block again */
- i--;
- continue;
- }
-
- checksum_failures++;
-
- if (checksum_failures <= 5)
- ereport(WARNING,
- (errmsg("checksum verification failed in "
- "file \"%s\", block %u: calculated "
- "%X but expected %X",
- readfilename, blkno, checksum,
- phdr->pd_checksum)));
- if (checksum_failures == 5)
- ereport(WARNING,
- (errmsg("further checksum verification "
- "failures in file \"%s\" will not "
- "be reported", readfilename)));
- }
+ /*
+ * If we hit end-of-file, a concurrent truncation must
+ * have occurred, so break out of this loop just as if the
+ * initial fread() returned 0. We'll drop through to the
+ * same code that handles that case. (We must fix up cnt
+ * first, though.)
+ */
+ cnt = BLCKSZ * i;
+ break;
}
- block_retry = false;
- blkno++;
+
+ /* If the page now looks OK, go on to the next one. */
+ if (verify_page_checksum(page, sink->bbs_state->startptr,
+ blkno + i + segmentno * RELSEG_SIZE,
+ &expected_checksum))
+ continue;
+
+ /* Handle checksum failure. */
+ checksum_failures++;
+ if (checksum_failures <= 5)
+ ereport(WARNING,
+ (errmsg("checksum verification failed in "
+ "file \"%s\", block %u: calculated "
+ "%X but expected %X",
+ readfilename, blkno + i, expected_checksum,
+ ((PageHeader) page)->pd_checksum)));
+ if (checksum_failures == 5)
+ ereport(WARNING,
+ (errmsg("further checksum verification "
+ "failures in file \"%s\" will not "
+ "be reported", readfilename)));
}
+
+ /* Update block number for next pass through the outer loop. */
+ blkno += i;
}
/*
@@ -1734,6 +1718,42 @@ sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
return true;
}
+/*
+ * Try to verify the checksum for the provided page, if it seems appropriate
+ * to do so.
+ *
+ * Returns true if verification succeeds or if we decide not to check it,
+ * and false if verification fails. When return false, it also sets
+ * *expected_checksum to the computed value.
+ */
+static bool
+verify_page_checksum(Page page, XLogRecPtr start_lsn, BlockNumber blkno,
+ uint16 *expected_checksum)
+{
+ PageHeader phdr;
+ uint16 checksum;
+
+ /*
+ * Only check pages which have not been modified since the start of the
+ * base backup. Otherwise, they might have been written only halfway and
+ * the checksum would not be valid. However, replaying WAL would
+ * reinstate the correct page in this case. We also skip completely new
+ * pages, since they don't have a checksum yet.
+ */
+ if (PageIsNew(page) || PageGetLSN(page) >= start_lsn)
+ return true;
+
+ /* Perform the actual checksum calculation. */
+ checksum = pg_checksum_page(page, blkno);
+
+ /* See whether it matches the value from the page. */
+ phdr = (PageHeader) page;
+ if (phdr->pd_checksum == checksum)
+ return true;
+ *expected_checksum = checksum;
+ return false;
+}
+
static int64
_tarWriteHeader(bbsink *sink, const char *filename, const char *linktarget,
struct stat *statbuf, bool sizeonly)
--
2.37.1 (Apple Git-137.1)
v1-0005-Change-how-a-base-backup-decides-which-files-have.patchapplication/octet-stream; name=v1-0005-Change-how-a-base-backup-decides-which-files-have.patchDownload
From f163b7ce56cbf6fa2e716f4d68d101a192297bf3 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Mon, 5 Jun 2023 15:42:19 -0400
Subject: [PATCH v1 5/8] Change how a base backup decides which files have
checksums.
Previously, it thought that any plain file located under global, base,
or a tablespace directory had checksums unless it was in a short list
of excluded files. Now, it thinks that files in those directories have
checksums if parse_filename_for_nontemp_relation says that they are
relation files. (Temporary relation files don't matter because they're
excluded from the backup anyway.)
This changes the behavior if you have stray files not managed by
PostgreSQL in the relevant directories. Previously, you'd get some
kind of checksum-related complaint if such files existed, assuming
that the cluster had checksums enabled and that the base backup
wasn't run with NOVERIFY_CHECKSUMS. Now, you won't get those
complaints any more. That seems like an improvement to me, because
those files were presumably not created by PostgreSQL and so there
is no reason to think that they would be checksummed like a
PostgreSQL relation file. (If we want to complain about such files,
we should complain about them existing at all, not just about their
checksums.)
The point of this change is to make the code more consistent.
sendDir() was already calling parse_filename_for_nontemp_relation()
as part of an effort to determine which files to include in the
backup. So, it already had the information about whether a certain
file was a relation file. sendFile() then used a separate method,
embodied in is_checksummed_file(), to make what is essentially
the same determination. It's better not to make the same decision
using two different methods, especially in closely-related code.
---
src/backend/backup/basebackup.c | 173 +++++++++++---------------------
1 file changed, 56 insertions(+), 117 deletions(-)
diff --git a/src/backend/backup/basebackup.c b/src/backend/backup/basebackup.c
index 24c038dfba..64ab54fe06 100644
--- a/src/backend/backup/basebackup.c
+++ b/src/backend/backup/basebackup.c
@@ -82,7 +82,8 @@ static int64 sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeo
backup_manifest_info *manifest, Oid spcoid);
static bool sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
struct stat *statbuf, bool missing_ok,
- Oid dboid, Oid spcoid,
+ Oid dboid, Oid spcoid, RelFileNumber relfilenumber,
+ unsigned segno,
backup_manifest_info *manifest);
static off_t read_file_data_into_buffer(bbsink *sink,
const char *readfilename, int fd,
@@ -104,7 +105,6 @@ static void convert_link_to_directory(const char *pathbuf, struct stat *statbuf)
static void perform_base_backup(basebackup_options *opt, bbsink *sink);
static void parse_basebackup_options(List *options, basebackup_options *opt);
static int compareWalFileNames(const ListCell *a, const ListCell *b);
-static bool is_checksummed_file(const char *fullpath, const char *filename);
static int basebackup_read_file(int fd, char *buf, size_t nbytes, off_t offset,
const char *filename, bool partial_read_ok);
@@ -213,23 +213,6 @@ static const struct exclude_list_item excludeFiles[] =
{NULL, false}
};
-/*
- * List of files excluded from checksum validation.
- *
- * Note: this list should be kept in sync with what pg_checksums.c
- * includes.
- */
-static const struct exclude_list_item noChecksumFiles[] = {
- {"pg_control", false},
- {"pg_filenode.map", false},
- {"pg_internal.init", true},
- {"PG_VERSION", false},
-#ifdef EXEC_BACKEND
- {"config_exec_params", true},
-#endif
- {NULL, false}
-};
-
/*
* Actually do a base backup for the specified tablespaces.
*
@@ -356,7 +339,8 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
errmsg("could not stat file \"%s\": %m",
XLOG_CONTROL_FILE)));
sendFile(sink, XLOG_CONTROL_FILE, XLOG_CONTROL_FILE, &statbuf,
- false, InvalidOid, InvalidOid, &manifest);
+ false, InvalidOid, InvalidOid,
+ InvalidRelFileNumber, 0, &manifest);
}
else
{
@@ -625,7 +609,8 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
errmsg("could not stat file \"%s\": %m", pathbuf)));
sendFile(sink, pathbuf, pathbuf, &statbuf, false,
- InvalidOid, InvalidOid, &manifest);
+ InvalidOid, InvalidOid, InvalidRelFileNumber, 0,
+ &manifest);
/* unconditionally mark file as archived */
StatusFilePath(pathbuf, fname, ".done");
@@ -1163,7 +1148,8 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
struct stat statbuf;
int64 size = 0;
const char *lastDir; /* Split last dir from parent path. */
- bool isDbDir = false; /* Does this directory contain relations? */
+ bool isRelationDir = false; /* Does directory contain relations? */
+ Oid dboid = InvalidOid;
/*
* Determine if the current path is a database directory that can contain
@@ -1190,17 +1176,23 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
strncmp(lastDir - (sizeof(TABLESPACE_VERSION_DIRECTORY) - 1),
TABLESPACE_VERSION_DIRECTORY,
sizeof(TABLESPACE_VERSION_DIRECTORY) - 1) == 0))
- isDbDir = true;
+ {
+ isRelationDir = true;
+ dboid = atooid(lastDir + 1);
+ }
}
+ else if (strcmp(path, "./global") == 0)
+ isRelationDir = true;
dir = AllocateDir(path);
while ((de = ReadDir(dir, path)) != NULL)
{
int excludeIdx;
bool excludeFound;
- RelFileNumber relNumber;
- ForkNumber relForkNum;
- unsigned segno;
+ RelFileNumber relfilenumber = InvalidRelFileNumber;
+ ForkNumber relForkNum = InvalidForkNumber;
+ unsigned segno = 0;
+ bool isRelationFile = false;
/* Skip special stuff */
if (strcmp(de->d_name, ".") == 0 || strcmp(de->d_name, "..") == 0)
@@ -1248,37 +1240,41 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
if (excludeFound)
continue;
+ /*
+ * If there could be non-temporary relation files in this directory,
+ * try to parse the filename.
+ */
+ if (isRelationDir)
+ isRelationFile =
+ parse_filename_for_nontemp_relation(de->d_name,
+ &relfilenumber,
+ &relForkNum, &segno);
+
/* Exclude all forks for unlogged tables except the init fork */
- if (isDbDir &&
- parse_filename_for_nontemp_relation(de->d_name, &relNumber,
- &relForkNum, &segno))
+ if (isRelationFile && relForkNum != INIT_FORKNUM)
{
- /* Never exclude init forks */
- if (relForkNum != INIT_FORKNUM)
- {
- char initForkFile[MAXPGPATH];
+ char initForkFile[MAXPGPATH];
- /*
- * If any other type of fork, check if there is an init fork
- * with the same RelFileNumber. If so, the file can be
- * excluded.
- */
- snprintf(initForkFile, sizeof(initForkFile), "%s/%u_init",
- path, relNumber);
+ /*
+ * If any other type of fork, check if there is an init fork
+ * with the same RelFileNumber. If so, the file can be
+ * excluded.
+ */
+ snprintf(initForkFile, sizeof(initForkFile), "%s/%u_init",
+ path, relfilenumber);
- if (lstat(initForkFile, &statbuf) == 0)
- {
- elog(DEBUG2,
- "unlogged relation file \"%s\" excluded from backup",
- de->d_name);
+ if (lstat(initForkFile, &statbuf) == 0)
+ {
+ elog(DEBUG2,
+ "unlogged relation file \"%s\" excluded from backup",
+ de->d_name);
- continue;
- }
+ continue;
}
}
/* Exclude temporary relations */
- if (isDbDir && looks_like_temp_rel_name(de->d_name))
+ if (OidIsValid(dboid) && looks_like_temp_rel_name(de->d_name))
{
elog(DEBUG2,
"temporary relation file \"%s\" excluded from backup",
@@ -1417,8 +1413,8 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
if (!sizeonly)
sent = sendFile(sink, pathbuf, pathbuf + basepathlen + 1, &statbuf,
- true, isDbDir ? atooid(lastDir + 1) : InvalidOid, spcoid,
- manifest);
+ true, dboid, spcoid,
+ relfilenumber, segno, manifest);
if (sent || sizeonly)
{
@@ -1440,40 +1436,6 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
return size;
}
-/*
- * Check if a file should have its checksum validated.
- * We validate checksums on files in regular tablespaces
- * (including global and default) only, and in those there
- * are some files that are explicitly excluded.
- */
-static bool
-is_checksummed_file(const char *fullpath, const char *filename)
-{
- /* Check that the file is in a tablespace */
- if (strncmp(fullpath, "./global/", 9) == 0 ||
- strncmp(fullpath, "./base/", 7) == 0 ||
- strncmp(fullpath, "/", 1) == 0)
- {
- int excludeIdx;
-
- /* Compare file against noChecksumFiles skip list */
- for (excludeIdx = 0; noChecksumFiles[excludeIdx].name != NULL; excludeIdx++)
- {
- int cmplen = strlen(noChecksumFiles[excludeIdx].name);
-
- if (!noChecksumFiles[excludeIdx].match_prefix)
- cmplen++;
- if (strncmp(filename, noChecksumFiles[excludeIdx].name,
- cmplen) == 0)
- return false;
- }
-
- return true;
- }
- else
- return false;
-}
-
/*
* Given the member, write the TAR header & send the file.
*
@@ -1488,6 +1450,7 @@ is_checksummed_file(const char *fullpath, const char *filename)
static bool
sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
struct stat *statbuf, bool missing_ok, Oid dboid, Oid spcoid,
+ RelFileNumber relfilenumber, unsigned segno,
backup_manifest_info *manifest)
{
int fd;
@@ -1495,8 +1458,6 @@ sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
int checksum_failures = 0;
off_t cnt;
pgoff_t bytes_done = 0;
- int segmentno = 0;
- char *segmentpath;
bool verify_checksum = false;
pg_checksum_context checksum_ctx;
@@ -1522,36 +1483,14 @@ sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
*/
Assert((sink->bbs_buffer_length % BLCKSZ) == 0);
- if (!noverify_checksums && DataChecksumsEnabled())
- {
- char *filename;
-
- /*
- * Get the filename (excluding path). As last_dir_separator()
- * includes the last directory separator, we chop that off by
- * incrementing the pointer.
- */
- filename = last_dir_separator(readfilename) + 1;
-
- if (is_checksummed_file(readfilename, filename))
- {
- verify_checksum = true;
-
- /*
- * Cut off at the segment boundary (".") to get the segment number
- * in order to mix it into the checksum.
- */
- segmentpath = strstr(filename, ".");
- if (segmentpath != NULL)
- {
- segmentno = atoi(segmentpath + 1);
- if (segmentno == 0)
- ereport(ERROR,
- (errmsg("invalid segment number %d in file \"%s\"",
- segmentno, filename)));
- }
- }
- }
+ /*
+ * If we weren't told not to verify checksums, and if checksums are
+ * enabled for this cluster, and if this is a relation file, then verify
+ * the checksum.
+ */
+ if (!noverify_checksums && DataChecksumsEnabled() &&
+ RelFileNumberIsValid(relfilenumber))
+ verify_checksum = true;
/*
* Loop until we read the amount of data the caller told us to expect. The
@@ -1566,7 +1505,7 @@ sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
/* Try to read some more data. */
cnt = read_file_data_into_buffer(sink, readfilename, fd, bytes_done,
remaining,
- blkno + segmentno * RELSEG_SIZE,
+ blkno + segno * RELSEG_SIZE,
verify_checksum,
&checksum_failures);
--
2.37.1 (Apple Git-137.1)
v1-0003-Change-struct-tablespaceinfo-s-oid-member-from-ch.patchapplication/octet-stream; name=v1-0003-Change-struct-tablespaceinfo-s-oid-member-from-ch.patchDownload
From cfac4b6329e5854010160896ba8e7a1bc34f3b31 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Mon, 5 Jun 2023 15:42:00 -0400
Subject: [PATCH v1 3/8] Change struct tablespaceinfo's oid member from 'char
*' to 'Oid'
This shouldn't change behavior except in the unusual case where
there are file in the tablespace directory that have entirely
numeric names but are nevertheless not possible names for a
tablespace directory, either because their names has leading zeroes
that shouldn't be there, or the value is actually zero, or because
the value is too large to represent as an OID.
In those cases, the directory would previously have made it into
the list of tablespaceinfo objects and no longer will. Thus, base
backups will now ignore such directories, instead of treating them
as legitimate tablespace directories. Similarly, if entries for
such tablespaces occur in a tablespace_map file, they will now
be rejected as erroneous, instead of being honored.
This is infrastructure for future work that wants to be able to
know the tablespace of each relation that is part of a backup
*as an OID*. By strengthening the up-front validation, we don't
have to worry about weird cases later, and can more easily avoid
repeated string->integer conversions.
---
src/backend/access/transam/xlog.c | 19 ++++++++++--
src/backend/access/transam/xlogrecovery.c | 12 ++++++--
src/backend/backup/backup_manifest.c | 6 ++--
src/backend/backup/basebackup.c | 35 ++++++++++++-----------
src/backend/backup/basebackup_copy.c | 2 +-
src/include/backup/backup_manifest.h | 2 +-
src/include/backup/basebackup.h | 2 +-
7 files changed, 49 insertions(+), 29 deletions(-)
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index b2430f617c..664d4ba598 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -8456,9 +8456,22 @@ do_pg_backup_start(const char *backupidstr, bool fast, List **tablespaces,
char *relpath = NULL;
char *s;
PGFileType de_type;
+ char *badp;
+ Oid tsoid;
- /* Skip anything that doesn't look like a tablespace */
- if (strspn(de->d_name, "0123456789") != strlen(de->d_name))
+ /*
+ * Try to parse the directory name as an unsigned integer.
+ *
+ * Tablespace directories should be positive integers that can
+ * be represented in 32 bits, with no leading zeroes or trailing
+ * garbage. If we come across a name that doesn't meet those
+ * criteria, skip it.
+ */
+ if (de->d_name[0] < '1' || de->d_name[1] > '9')
+ continue;
+ errno = 0;
+ tsoid = strtoul(de->d_name, &badp, 10);
+ if (*badp != '\0' || errno == EINVAL || errno == ERANGE)
continue;
snprintf(fullpath, sizeof(fullpath), "pg_tblspc/%s", de->d_name);
@@ -8533,7 +8546,7 @@ do_pg_backup_start(const char *backupidstr, bool fast, List **tablespaces,
}
ti = palloc(sizeof(tablespaceinfo));
- ti->oid = pstrdup(de->d_name);
+ ti->oid = tsoid;
ti->path = pstrdup(linkpath);
ti->rpath = relpath;
ti->size = -1;
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index becc2bda62..4ff4430006 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -678,7 +678,7 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
tablespaceinfo *ti = lfirst(lc);
char *linkloc;
- linkloc = psprintf("pg_tblspc/%s", ti->oid);
+ linkloc = psprintf("pg_tblspc/%u", ti->oid);
/*
* Remove the existing symlink if any and Create the symlink
@@ -692,7 +692,6 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
errmsg("could not create symbolic link \"%s\": %m",
linkloc)));
- pfree(ti->oid);
pfree(ti->path);
pfree(ti);
}
@@ -1341,6 +1340,8 @@ read_tablespace_map(List **tablespaces)
{
if (!was_backslash && (ch == '\n' || ch == '\r'))
{
+ char *endp;
+
if (i == 0)
continue; /* \r immediately followed by \n */
@@ -1360,7 +1361,12 @@ read_tablespace_map(List **tablespaces)
str[n++] = '\0';
ti = palloc0(sizeof(tablespaceinfo));
- ti->oid = pstrdup(str);
+ errno = 0;
+ ti->oid = strtoul(str, &endp, 10);
+ if (*endp != '\0' || errno == EINVAL || errno == ERANGE)
+ ereport(FATAL,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("invalid data in file \"%s\"", TABLESPACE_MAP)));
ti->path = pstrdup(str + n);
*tablespaces = lappend(*tablespaces, ti);
diff --git a/src/backend/backup/backup_manifest.c b/src/backend/backup/backup_manifest.c
index cee6216524..aeed362a9a 100644
--- a/src/backend/backup/backup_manifest.c
+++ b/src/backend/backup/backup_manifest.c
@@ -97,7 +97,7 @@ FreeBackupManifest(backup_manifest_info *manifest)
* Add an entry to the backup manifest for a file.
*/
void
-AddFileToBackupManifest(backup_manifest_info *manifest, const char *spcoid,
+AddFileToBackupManifest(backup_manifest_info *manifest, Oid spcoid,
const char *pathname, size_t size, pg_time_t mtime,
pg_checksum_context *checksum_ctx)
{
@@ -114,9 +114,9 @@ AddFileToBackupManifest(backup_manifest_info *manifest, const char *spcoid,
* pathname relative to the data directory (ignoring the intermediate
* symlink traversal).
*/
- if (spcoid != NULL)
+ if (OidIsValid(spcoid))
{
- snprintf(pathbuf, sizeof(pathbuf), "pg_tblspc/%s/%s", spcoid,
+ snprintf(pathbuf, sizeof(pathbuf), "pg_tblspc/%u/%s", spcoid,
pathname);
pathname = pathbuf;
}
diff --git a/src/backend/backup/basebackup.c b/src/backend/backup/basebackup.c
index f46f930329..cc3d2e0c41 100644
--- a/src/backend/backup/basebackup.c
+++ b/src/backend/backup/basebackup.c
@@ -75,14 +75,15 @@ typedef struct
pg_checksum_type manifest_checksum_type;
} basebackup_options;
-static int64 sendTablespace(bbsink *sink, char *path, char *spcoid, bool sizeonly,
+static int64 sendTablespace(bbsink *sink, char *path, Oid spcoid, bool sizeonly,
struct backup_manifest_info *manifest);
static int64 sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
List *tablespaces, bool sendtblspclinks,
- backup_manifest_info *manifest, const char *spcoid);
+ backup_manifest_info *manifest, Oid spcoid);
static bool sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
- struct stat *statbuf, bool missing_ok, Oid dboid,
- backup_manifest_info *manifest, const char *spcoid);
+ struct stat *statbuf, bool missing_ok,
+ Oid dboid, Oid spcoid,
+ backup_manifest_info *manifest);
static off_t read_file_data_into_buffer(bbsink *sink,
const char *readfilename, int fd,
off_t offset, size_t length,
@@ -305,7 +306,7 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
if (tmp->path == NULL)
tmp->size = sendDir(sink, ".", 1, true, state.tablespaces,
- true, NULL, NULL);
+ true, NULL, InvalidOid);
else
tmp->size = sendTablespace(sink, tmp->path, tmp->oid, true,
NULL);
@@ -346,7 +347,7 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
/* Then the bulk of the files... */
sendDir(sink, ".", 1, false, state.tablespaces,
- sendtblspclinks, &manifest, NULL);
+ sendtblspclinks, &manifest, InvalidOid);
/* ... and pg_control after everything else. */
if (lstat(XLOG_CONTROL_FILE, &statbuf) != 0)
@@ -355,11 +356,11 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
errmsg("could not stat file \"%s\": %m",
XLOG_CONTROL_FILE)));
sendFile(sink, XLOG_CONTROL_FILE, XLOG_CONTROL_FILE, &statbuf,
- false, InvalidOid, &manifest, NULL);
+ false, InvalidOid, InvalidOid, &manifest);
}
else
{
- char *archive_name = psprintf("%s.tar", ti->oid);
+ char *archive_name = psprintf("%u.tar", ti->oid);
bbsink_begin_archive(sink, archive_name);
@@ -623,8 +624,8 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
(errcode_for_file_access(),
errmsg("could not stat file \"%s\": %m", pathbuf)));
- sendFile(sink, pathbuf, pathbuf, &statbuf, false, InvalidOid,
- &manifest, NULL);
+ sendFile(sink, pathbuf, pathbuf, &statbuf, false,
+ InvalidOid, InvalidOid, &manifest);
/* unconditionally mark file as archived */
StatusFilePath(pathbuf, fname, ".done");
@@ -1087,7 +1088,7 @@ sendFileWithContent(bbsink *sink, const char *filename, const char *content,
_tarWritePadding(sink, len);
- AddFileToBackupManifest(manifest, NULL, filename, len,
+ AddFileToBackupManifest(manifest, InvalidOid, filename, len,
(pg_time_t) statbuf.st_mtime, &checksum_ctx);
}
@@ -1099,7 +1100,7 @@ sendFileWithContent(bbsink *sink, const char *filename, const char *content,
* Only used to send auxiliary tablespaces, not PGDATA.
*/
static int64
-sendTablespace(bbsink *sink, char *path, char *spcoid, bool sizeonly,
+sendTablespace(bbsink *sink, char *path, Oid spcoid, bool sizeonly,
backup_manifest_info *manifest)
{
int64 size;
@@ -1154,7 +1155,7 @@ sendTablespace(bbsink *sink, char *path, char *spcoid, bool sizeonly,
static int64
sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
List *tablespaces, bool sendtblspclinks, backup_manifest_info *manifest,
- const char *spcoid)
+ Oid spcoid)
{
DIR *dir;
struct dirent *de;
@@ -1419,8 +1420,8 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
if (!sizeonly)
sent = sendFile(sink, pathbuf, pathbuf + basepathlen + 1, &statbuf,
- true, isDbDir ? atooid(lastDir + 1) : InvalidOid,
- manifest, spcoid);
+ true, isDbDir ? atooid(lastDir + 1) : InvalidOid, spcoid,
+ manifest);
if (sent || sizeonly)
{
@@ -1489,8 +1490,8 @@ is_checksummed_file(const char *fullpath, const char *filename)
*/
static bool
sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
- struct stat *statbuf, bool missing_ok, Oid dboid,
- backup_manifest_info *manifest, const char *spcoid)
+ struct stat *statbuf, bool missing_ok, Oid dboid, Oid spcoid,
+ backup_manifest_info *manifest)
{
int fd;
BlockNumber blkno = 0;
diff --git a/src/backend/backup/basebackup_copy.c b/src/backend/backup/basebackup_copy.c
index 1db80cde1b..2b42fd257e 100644
--- a/src/backend/backup/basebackup_copy.c
+++ b/src/backend/backup/basebackup_copy.c
@@ -407,7 +407,7 @@ SendTablespaceList(List *tablespaces)
}
else
{
- values[0] = ObjectIdGetDatum(strtoul(ti->oid, NULL, 10));
+ values[0] = ObjectIdGetDatum(ti->oid);
values[1] = CStringGetTextDatum(ti->path);
}
if (ti->size >= 0)
diff --git a/src/include/backup/backup_manifest.h b/src/include/backup/backup_manifest.h
index d41b439980..5a481dbcf5 100644
--- a/src/include/backup/backup_manifest.h
+++ b/src/include/backup/backup_manifest.h
@@ -39,7 +39,7 @@ extern void InitializeBackupManifest(backup_manifest_info *manifest,
backup_manifest_option want_manifest,
pg_checksum_type manifest_checksum_type);
extern void AddFileToBackupManifest(backup_manifest_info *manifest,
- const char *spcoid,
+ Oid spcoid,
const char *pathname, size_t size,
pg_time_t mtime,
pg_checksum_context *checksum_ctx);
diff --git a/src/include/backup/basebackup.h b/src/include/backup/basebackup.h
index 3e68abc2bb..1432d9c206 100644
--- a/src/include/backup/basebackup.h
+++ b/src/include/backup/basebackup.h
@@ -27,7 +27,7 @@
*/
typedef struct
{
- char *oid; /* tablespace's OID, as a decimal string */
+ Oid oid; /* tablespace's OID */
char *path; /* full path to tablespace's directory */
char *rpath; /* relative path if it's within PGDATA, else
* NULL */
--
2.37.1 (Apple Git-137.1)
v1-0002-In-basebackup.c-refactor-to-create-read_file_data.patchapplication/octet-stream; name=v1-0002-In-basebackup.c-refactor-to-create-read_file_data.patchDownload
From 6889b92e6d235c1554ed903a8b6edb5fcd646a03 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Mon, 5 Jun 2023 15:40:14 -0400
Subject: [PATCH v1 2/8] In basebackup.c, refactor to create
read_file_data_into_buffer.
This further reduces the length and complexity of sendFile(),
hopefully make it easier to understand and modify. In addition
to moving some logic into a new function, I took this opportunity
to make a few slight adjustments to sendFile() itself, including
renaming the 'len' variable to 'bytes_done', since we use it to represent
the number of bytes we've already handled so far, not the total
length of the file.
---
src/backend/backup/basebackup.c | 231 ++++++++++++++++++--------------
1 file changed, 133 insertions(+), 98 deletions(-)
diff --git a/src/backend/backup/basebackup.c b/src/backend/backup/basebackup.c
index 0daf8257bc..f46f930329 100644
--- a/src/backend/backup/basebackup.c
+++ b/src/backend/backup/basebackup.c
@@ -83,6 +83,12 @@ static int64 sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeo
static bool sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
struct stat *statbuf, bool missing_ok, Oid dboid,
backup_manifest_info *manifest, const char *spcoid);
+static off_t read_file_data_into_buffer(bbsink *sink,
+ const char *readfilename, int fd,
+ off_t offset, size_t length,
+ BlockNumber blkno,
+ bool verify_checksum,
+ int *checksum_failures);
static bool verify_page_checksum(Page page, XLogRecPtr start_lsn,
BlockNumber blkno,
uint16 *expected_checksum);
@@ -1490,9 +1496,7 @@ sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
BlockNumber blkno = 0;
int checksum_failures = 0;
off_t cnt;
- int i;
- pgoff_t len = 0;
- char *page;
+ pgoff_t bytes_done = 0;
int segmentno = 0;
char *segmentpath;
bool verify_checksum = false;
@@ -1514,6 +1518,12 @@ sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
_tarWriteHeader(sink, tarfilename, NULL, statbuf, false);
+ /*
+ * Checksums are verified in multiples of BLCKSZ, so the buffer length
+ * should be a multiple of the block size as well.
+ */
+ Assert((sink->bbs_buffer_length % BLCKSZ) == 0);
+
if (!noverify_checksums && DataChecksumsEnabled())
{
char *filename;
@@ -1551,23 +1561,21 @@ sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
* for a base backup we can ignore such extended data. It will be restored
* from WAL.
*/
- while (len < statbuf->st_size)
+ while (bytes_done < statbuf->st_size)
{
- size_t remaining = statbuf->st_size - len;
+ size_t remaining = statbuf->st_size - bytes_done;
/* Try to read some more data. */
- cnt = basebackup_read_file(fd, sink->bbs_buffer,
- Min(sink->bbs_buffer_length, remaining),
- len, readfilename, true);
+ cnt = read_file_data_into_buffer(sink, readfilename, fd, bytes_done,
+ remaining,
+ blkno + segmentno * RELSEG_SIZE,
+ verify_checksum,
+ &checksum_failures);
/*
- * The checksums are verified at block level, so we iterate over the
- * buffer in chunks of BLCKSZ, after making sure that
- * TAR_SEND_SIZE/buf is divisible by BLCKSZ and we read a multiple of
- * BLCKSZ bytes.
+ * If the amount of data we were able to read was not a multiple of
+ * BLCKSZ, we cannot verify checksums, which are block-level.
*/
- Assert((sink->bbs_buffer_length % BLCKSZ) == 0);
-
if (verify_checksum && (cnt % BLCKSZ != 0))
{
ereport(WARNING,
@@ -1578,84 +1586,6 @@ sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
verify_checksum = false;
}
- if (verify_checksum)
- {
- for (i = 0; i < cnt / BLCKSZ; i++)
- {
- int reread_cnt;
- uint16 expected_checksum;
-
- page = sink->bbs_buffer + BLCKSZ * i;
-
- /* If the page is OK, go on to the next one. */
- if (verify_page_checksum(page, sink->bbs_state->startptr,
- blkno + i + segmentno * RELSEG_SIZE,
- &expected_checksum))
- continue;
-
- /*
- * Retry the block on the first failure. It's possible that
- * we read the first 4K page of the block just before postgres
- * updated the entire block so it ends up looking torn to us.
- * If, before we retry the read, the concurrent write of the
- * block finishes, the page LSN will be updated and we'll
- * realize that we should ignore this block.
- *
- * There's no guarantee that this will actually happen,
- * though: the torn write could take an arbitrarily long time
- * to complete. Retrying multiple times wouldn't fix this
- * problem, either, though it would reduce the chances of it
- * happening in practice. The only real fix here seems to be
- * to have some kind of interlock that allows us to wait until
- * we can be certain that no write to the block is in
- * progress. Since we don't have any such thing right now, we
- * just do this and hope for the best.
- */
- reread_cnt =
- basebackup_read_file(fd,
- sink->bbs_buffer + BLCKSZ * i,
- BLCKSZ, len + BLCKSZ * i,
- readfilename,
- false);
- if (reread_cnt == 0)
- {
- /*
- * If we hit end-of-file, a concurrent truncation must
- * have occurred, so break out of this loop just as if the
- * initial fread() returned 0. We'll drop through to the
- * same code that handles that case. (We must fix up cnt
- * first, though.)
- */
- cnt = BLCKSZ * i;
- break;
- }
-
- /* If the page now looks OK, go on to the next one. */
- if (verify_page_checksum(page, sink->bbs_state->startptr,
- blkno + i + segmentno * RELSEG_SIZE,
- &expected_checksum))
- continue;
-
- /* Handle checksum failure. */
- checksum_failures++;
- if (checksum_failures <= 5)
- ereport(WARNING,
- (errmsg("checksum verification failed in "
- "file \"%s\", block %u: calculated "
- "%X but expected %X",
- readfilename, blkno + i, expected_checksum,
- ((PageHeader) page)->pd_checksum)));
- if (checksum_failures == 5)
- ereport(WARNING,
- (errmsg("further checksum verification "
- "failures in file \"%s\" will not "
- "be reported", readfilename)));
- }
-
- /* Update block number for next pass through the outer loop. */
- blkno += i;
- }
-
/*
* If we hit end-of-file, a concurrent truncation must have occurred.
* That's not an error condition, because WAL replay will fix things
@@ -1664,6 +1594,10 @@ sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
if (cnt == 0)
break;
+ /* Update block number and # of bytes done for next loop iteration. */
+ blkno += cnt / BLCKSZ;
+ bytes_done += cnt;
+
/* Archive the data we just read. */
bbsink_archive_contents(sink, cnt);
@@ -1671,14 +1605,12 @@ sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
if (pg_checksum_update(&checksum_ctx,
(uint8 *) sink->bbs_buffer, cnt) < 0)
elog(ERROR, "could not update checksum of base backup");
-
- len += cnt;
}
/* If the file was truncated while we were sending it, pad it with zeros */
- while (len < statbuf->st_size)
+ while (bytes_done < statbuf->st_size)
{
- size_t remaining = statbuf->st_size - len;
+ size_t remaining = statbuf->st_size - bytes_done;
size_t nbytes = Min(sink->bbs_buffer_length, remaining);
MemSet(sink->bbs_buffer, 0, nbytes);
@@ -1687,7 +1619,7 @@ sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
nbytes) < 0)
elog(ERROR, "could not update checksum of base backup");
bbsink_archive_contents(sink, nbytes);
- len += nbytes;
+ bytes_done += nbytes;
}
/*
@@ -1695,7 +1627,7 @@ sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
* of data is probably not worth throttling, and is not checksummed
* because it's not actually part of the file.)
*/
- _tarWritePadding(sink, len);
+ _tarWritePadding(sink, bytes_done);
CloseTransientFile(fd);
@@ -1718,6 +1650,109 @@ sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
return true;
}
+/*
+ * Read some more data from the file into the bbsink's buffer, verifying
+ * checksums as required.
+ *
+ * 'offset' is the file offset from which we should begin to read, and
+ * 'length' is the amount of data that should be read. The actual amount
+ * of data read will be less than the requested amount if the bbsink's
+ * buffer isn't big enough to hold it all, or if the underlying file has
+ * been truncated. The return value is the number of bytes actually read.
+ *
+ * 'blkno' is the block number of the first page in the bbsink's buffer
+ * relative to the start of the relation.
+ *
+ * 'verify_checksum' indicates whether we should try to verify checksums
+ * for the blocks we read. If we do this, we'll update *checksum_failures
+ * and issue warnings as appropriate.
+ */
+static off_t
+read_file_data_into_buffer(bbsink *sink, const char *readfilename, int fd,
+ off_t offset, size_t length, BlockNumber blkno,
+ bool verify_checksum, int *checksum_failures)
+{
+ off_t cnt;
+ int i;
+ char *page;
+
+ /* Try to read some more data. */
+ cnt = basebackup_read_file(fd, sink->bbs_buffer,
+ Min(sink->bbs_buffer_length, length),
+ offset, readfilename, true);
+
+ /* Can't verify checksums if read length is not a multiple of BLCKSZ. */
+ if (!verify_checksum || (cnt % BLCKSZ) != 0)
+ return cnt;
+
+ /* Verify checksum for each block. */
+ for (i = 0; i < cnt / BLCKSZ; i++)
+ {
+ int reread_cnt;
+ uint16 expected_checksum;
+
+ page = sink->bbs_buffer + BLCKSZ * i;
+
+ /* If the page is OK, go on to the next one. */
+ if (verify_page_checksum(page, sink->bbs_state->startptr, blkno + i,
+ &expected_checksum))
+ continue;
+
+ /*
+ * Retry the block on the first failure. It's possible that we read
+ * the first 4K page of the block just before postgres updated the
+ * entire block so it ends up looking torn to us. If, before we retry
+ * the read, the concurrent write of the block finishes, the page LSN
+ * will be updated and we'll realize that we should ignore this block.
+ *
+ * There's no guarantee that this will actually happen, though: the
+ * torn write could take an arbitrarily long time to complete.
+ * Retrying multiple times wouldn't fix this problem, either, though
+ * it would reduce the chances of it happening in practice. The only
+ * real fix here seems to be to have some kind of interlock that
+ * allows us to wait until we can be certain that no write to the
+ * block is in progress. Since we don't have any such thing right now,
+ * we just do this and hope for the best.
+ */
+ reread_cnt =
+ basebackup_read_file(fd, sink->bbs_buffer + BLCKSZ * i,
+ BLCKSZ, offset + BLCKSZ * i,
+ readfilename, false);
+ if (reread_cnt == 0)
+ {
+ /*
+ * If we hit end-of-file, a concurrent truncation must have
+ * occurred, so reduce cnt to reflect only the blocks already
+ * processed and break out of this loop.
+ */
+ cnt = BLCKSZ * i;
+ break;
+ }
+
+ /* If the page now looks OK, go on to the next one. */
+ if (verify_page_checksum(page, sink->bbs_state->startptr, blkno + i,
+ &expected_checksum))
+ continue;
+
+ /* Handle checksum failure. */
+ (*checksum_failures)++;
+ if (*checksum_failures <= 5)
+ ereport(WARNING,
+ (errmsg("checksum verification failed in "
+ "file \"%s\", block %u: calculated "
+ "%X but expected %X",
+ readfilename, blkno + i, expected_checksum,
+ ((PageHeader) page)->pd_checksum)));
+ if (*checksum_failures == 5)
+ ereport(WARNING,
+ (errmsg("further checksum verification "
+ "failures in file \"%s\" will not "
+ "be reported", readfilename)));
+ }
+
+ return cnt;
+}
+
/*
* Try to verify the checksum for the provided page, if it seems appropriate
* to do so.
--
2.37.1 (Apple Git-137.1)
v1-0004-Refactor-parse_filename_for_nontemp_relation-to-p.patchapplication/octet-stream; name=v1-0004-Refactor-parse_filename_for_nontemp_relation-to-p.patchDownload
From 9b20eaf01f76b1bc325ce0968b294828cde09ea5 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Mon, 5 Jun 2023 15:42:15 -0400
Subject: [PATCH v1 4/8] Refactor parse_filename_for_nontemp_relation to parse
more.
Instead of returning the number of characters in the RelFileNumber,
return the RelFileNumber itself. Continue to return the fork number,
as before, and additionally return the segment number.
parse_filename_for_nontemp_relation now rejects a RelFileNumber or
segment number that begins with a leading zero. Before, we accepted
such cases as relation filenames, but if we continued to do so after
this change, the function might return the same values for two
different files (e.g. 1234.5 and 001234.5 or 1234.005) which could be
annoying for callers. Since we don't actually ever generate filenames
with leading zeroes in the names, any such files that we find must
have been created by something other than PostgreSQL, and it is
therefore reasonable to treat them as non-relation files.
Along the way, change unlogged_relation_entry to store a RelFileNumber
rather than an OID. This update should have been made in
851f4cc75cdd8c831f1baa9a7abf8c8248b65890, but it was overlooked.
It could be done separately from the rest of this commit, but that
would be more involved, whereas this way it's a 1-line change.
---
src/backend/backup/basebackup.c | 15 ++--
src/backend/storage/file/reinit.c | 137 ++++++++++++++++++------------
src/include/storage/reinit.h | 5 +-
3 files changed, 93 insertions(+), 64 deletions(-)
diff --git a/src/backend/backup/basebackup.c b/src/backend/backup/basebackup.c
index cc3d2e0c41..24c038dfba 100644
--- a/src/backend/backup/basebackup.c
+++ b/src/backend/backup/basebackup.c
@@ -1198,9 +1198,9 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
{
int excludeIdx;
bool excludeFound;
- ForkNumber relForkNum; /* Type of fork if file is a relation */
- int relnumchars; /* Chars in filename that are the
- * relnumber */
+ RelFileNumber relNumber;
+ ForkNumber relForkNum;
+ unsigned segno;
/* Skip special stuff */
if (strcmp(de->d_name, ".") == 0 || strcmp(de->d_name, "..") == 0)
@@ -1250,23 +1250,20 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
/* Exclude all forks for unlogged tables except the init fork */
if (isDbDir &&
- parse_filename_for_nontemp_relation(de->d_name, &relnumchars,
- &relForkNum))
+ parse_filename_for_nontemp_relation(de->d_name, &relNumber,
+ &relForkNum, &segno))
{
/* Never exclude init forks */
if (relForkNum != INIT_FORKNUM)
{
char initForkFile[MAXPGPATH];
- char relNumber[OIDCHARS + 1];
/*
* If any other type of fork, check if there is an init fork
* with the same RelFileNumber. If so, the file can be
* excluded.
*/
- memcpy(relNumber, de->d_name, relnumchars);
- relNumber[relnumchars] = '\0';
- snprintf(initForkFile, sizeof(initForkFile), "%s/%s_init",
+ snprintf(initForkFile, sizeof(initForkFile), "%s/%u_init",
path, relNumber);
if (lstat(initForkFile, &statbuf) == 0)
diff --git a/src/backend/storage/file/reinit.c b/src/backend/storage/file/reinit.c
index fb55371b1b..31d6e01106 100644
--- a/src/backend/storage/file/reinit.c
+++ b/src/backend/storage/file/reinit.c
@@ -31,7 +31,7 @@ static void ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname,
typedef struct
{
- Oid reloid; /* hash key */
+ RelFileNumber relnumber; /* hash key */
} unlogged_relation_entry;
/*
@@ -195,12 +195,13 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
{
ForkNumber forkNum;
- int relnumchars;
+ unsigned segno;
unlogged_relation_entry ent;
/* Skip anything that doesn't look like a relation data file. */
- if (!parse_filename_for_nontemp_relation(de->d_name, &relnumchars,
- &forkNum))
+ if (!parse_filename_for_nontemp_relation(de->d_name,
+ &ent.relnumber,
+ &forkNum, &segno))
continue;
/* Also skip it unless this is the init fork. */
@@ -208,10 +209,8 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
continue;
/*
- * Put the OID portion of the name into the hash table, if it
- * isn't already.
+ * Put the RelFileNumber into the hash table, if it isn't already.
*/
- ent.reloid = atooid(de->d_name);
(void) hash_search(hash, &ent, HASH_ENTER, NULL);
}
@@ -235,12 +234,13 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
{
ForkNumber forkNum;
- int relnumchars;
+ unsigned segno;
unlogged_relation_entry ent;
/* Skip anything that doesn't look like a relation data file. */
- if (!parse_filename_for_nontemp_relation(de->d_name, &relnumchars,
- &forkNum))
+ if (!parse_filename_for_nontemp_relation(de->d_name,
+ &ent.relnumber,
+ &forkNum, &segno))
continue;
/* We never remove the init fork. */
@@ -251,7 +251,6 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
* See whether the OID portion of the name shows up in the hash
* table. If so, nuke it!
*/
- ent.reloid = atooid(de->d_name);
if (hash_search(hash, &ent, HASH_FIND, NULL))
{
snprintf(rm_path, sizeof(rm_path), "%s/%s",
@@ -285,14 +284,14 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
{
ForkNumber forkNum;
- int relnumchars;
- char relnumbuf[OIDCHARS + 1];
+ RelFileNumber relNumber;
+ unsigned segno;
char srcpath[MAXPGPATH * 2];
char dstpath[MAXPGPATH];
/* Skip anything that doesn't look like a relation data file. */
- if (!parse_filename_for_nontemp_relation(de->d_name, &relnumchars,
- &forkNum))
+ if (!parse_filename_for_nontemp_relation(de->d_name, &relNumber,
+ &forkNum, &segno))
continue;
/* Also skip it unless this is the init fork. */
@@ -304,11 +303,12 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
dbspacedirname, de->d_name);
/* Construct destination pathname. */
- memcpy(relnumbuf, de->d_name, relnumchars);
- relnumbuf[relnumchars] = '\0';
- snprintf(dstpath, sizeof(dstpath), "%s/%s%s",
- dbspacedirname, relnumbuf, de->d_name + relnumchars + 1 +
- strlen(forkNames[INIT_FORKNUM]));
+ if (segno == 0)
+ snprintf(dstpath, sizeof(dstpath), "%s/%u",
+ dbspacedirname, relNumber);
+ else
+ snprintf(dstpath, sizeof(dstpath), "%s/%u.%u",
+ dbspacedirname, relNumber, segno);
/* OK, we're ready to perform the actual copy. */
elog(DEBUG2, "copying %s to %s", srcpath, dstpath);
@@ -327,14 +327,14 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
dbspace_dir = AllocateDir(dbspacedirname);
while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
{
+ RelFileNumber relNumber;
ForkNumber forkNum;
- int relnumchars;
- char relnumbuf[OIDCHARS + 1];
+ unsigned segno;
char mainpath[MAXPGPATH];
/* Skip anything that doesn't look like a relation data file. */
- if (!parse_filename_for_nontemp_relation(de->d_name, &relnumchars,
- &forkNum))
+ if (!parse_filename_for_nontemp_relation(de->d_name, &relNumber,
+ &forkNum, &segno))
continue;
/* Also skip it unless this is the init fork. */
@@ -342,11 +342,12 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
continue;
/* Construct main fork pathname. */
- memcpy(relnumbuf, de->d_name, relnumchars);
- relnumbuf[relnumchars] = '\0';
- snprintf(mainpath, sizeof(mainpath), "%s/%s%s",
- dbspacedirname, relnumbuf, de->d_name + relnumchars + 1 +
- strlen(forkNames[INIT_FORKNUM]));
+ if (segno == 0)
+ snprintf(mainpath, sizeof(mainpath), "%s/%u",
+ dbspacedirname, relNumber);
+ else
+ snprintf(mainpath, sizeof(mainpath), "%s/%u.%u",
+ dbspacedirname, relNumber, segno);
fsync_fname(mainpath, false);
}
@@ -371,52 +372,82 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
* This function returns true if the file appears to be in the correct format
* for a non-temporary relation and false otherwise.
*
- * NB: If this function returns true, the caller is entitled to assume that
- * *relnumchars has been set to a value no more than OIDCHARS, and thus
- * that a buffer of OIDCHARS+1 characters is sufficient to hold the
- * RelFileNumber portion of the filename. This is critical to protect against
- * a possible buffer overrun.
+ * If it returns true, it sets *relnumber, *fork, and *segno to the values
+ * extracted from the filename. If it returns false, these values are set to
+ * InvalidRelFileNumber, InvalidForkNumber, and 0, respectively.
*/
bool
-parse_filename_for_nontemp_relation(const char *name, int *relnumchars,
- ForkNumber *fork)
+parse_filename_for_nontemp_relation(const char *name, RelFileNumber *relnumber,
+ ForkNumber *fork, unsigned *segno)
{
- int pos;
+ unsigned long n,
+ s;
+ ForkNumber f;
+ char *endp;
- /* Look for a non-empty string of digits (that isn't too long). */
- for (pos = 0; isdigit((unsigned char) name[pos]); ++pos)
- ;
- if (pos == 0 || pos > OIDCHARS)
+ *relnumber = InvalidRelFileNumber;
+ *fork = InvalidForkNumber;
+ *segno = 0;
+
+ /*
+ * Relation filenames should begin with a digit that is not a zero. By
+ * rejecting cases involving leading zeroes, the caller can assume that
+ * there's only one possible string of characters that could have produced
+ * any given value for *relnumber.
+ *
+ * (To be clear, we don't expect files with names like 0017.3 to exist
+ * at all -- but if 0017.3 does exist, it's a non-relation file, not
+ * part of the main fork for relfilenode 17.)
+ */
+ if (name[0] < '1' || name[0] > '9')
+ return false;
+
+ /*
+ * Parse the leading digit string. If the value is out of range, we
+ * conclude that this isn't a relation file at all.
+ */
+ errno = 0;
+ n = strtoul(name, &endp, 10);
+ if (errno || name == endp || n <= 0 || n > PG_UINT32_MAX)
return false;
- *relnumchars = pos;
+ name = endp;
/* Check for a fork name. */
- if (name[pos] != '_')
- *fork = MAIN_FORKNUM;
+ if (*name != '_')
+ f = MAIN_FORKNUM;
else
{
int forkchar;
- forkchar = forkname_chars(&name[pos + 1], fork);
+ forkchar = forkname_chars(name + 1, &f);
if (forkchar <= 0)
return false;
- pos += forkchar + 1;
+ name += forkchar + 1;
}
/* Check for a segment number. */
- if (name[pos] == '.')
+ if (*name != '.')
+ s = 0;
+ else
{
- int segchar;
+ /* Reject leading zeroes, just like we do for RelFileNumber. */
+ if (name[0] < '1' || name[0] > '9')
+ return false;
- for (segchar = 1; isdigit((unsigned char) name[pos + segchar]); ++segchar)
- ;
- if (segchar <= 1)
+ errno = 0;
+ s = strtoul(name + 1, &endp, 10);
+ if (errno || name + 1 == endp || s <= 0 || s > PG_UINT32_MAX)
return false;
- pos += segchar;
+ name = endp;
}
/* Now we should be at the end. */
- if (name[pos] != '\0')
+ if (*name != '\0')
return false;
+
+ /* Set out parameters and return. */
+ *relnumber = (RelFileNumber) n;
+ *fork = f;
+ *segno = (unsigned) s;
return true;
}
diff --git a/src/include/storage/reinit.h b/src/include/storage/reinit.h
index e2bbb5abe9..f8eb7ce234 100644
--- a/src/include/storage/reinit.h
+++ b/src/include/storage/reinit.h
@@ -20,8 +20,9 @@
extern void ResetUnloggedRelations(int op);
extern bool parse_filename_for_nontemp_relation(const char *name,
- int *relnumchars,
- ForkNumber *fork);
+ RelFileNumber *relnumber,
+ ForkNumber *fork,
+ unsigned *segno);
#define UNLOGGED_RELATION_CLEANUP 0x0001
#define UNLOGGED_RELATION_INIT 0x0002
--
2.37.1 (Apple Git-137.1)
Hi,
On 2023-06-14 14:46:48 -0400, Robert Haas wrote:
A few years ago, I sketched out a design for incremental backup, but
no patch for incremental backup ever got committed. Instead, the whole
thing evolved into a project to add backup manifests, which are nice,
but not as nice as incremental backup would be. So I've decided to
have another go at incremental backup itself. Attached are some WIP
patches. Let me summarize the design and some open questions and
problems with it that I've discovered. I welcome problem reports and
test results from others, as well.
Cool!
I originally had the idea of summarizing a certain number of MB of WAL
per WAL summary file, and so I added a GUC wal_summarize_mb for that
purpose. But then I realized that actually, you really want WAL
summary file boundaries to line up with possible redo points, because
when you do an incremental backup, you need a summary that stretches
from the redo point of the checkpoint written at the start of the
prior backup to the redo point of the checkpoint written at the start
of the current backup. The block modifications that happen in that
range of WAL records are the ones that need to be included in the
incremental.
I assume this is "solely" required for keeping the incremental backups as
small as possible, rather than being required for correctness?
Unfortunately, there's no indication in the WAL itself
that you've reached a redo point, but I wrote code that tries to
notice when we've reached the redo point stored in shared memory and
stops the summary there. But I eventually realized that's not good
enough either, because if summarization zooms past the redo point
before noticing the updated redo point in shared memory, then the
backup sat around waiting for the next summary file to be generated so
it had enough summaries to proceed with the backup, while the
summarizer was in no hurry to finish up the current file and just sat
there waiting for more WAL to be generated. Eventually the incremental
backup would just time out. I tried to fix that by making it so that
if somebody's waiting for a summary file to be generated, they can let
the summarizer know about that and it can write a summary file ending
at the LSN up to which it has read and then begin a new file from
there. That seems to fix the hangs, but now I've got three
overlapping, interconnected systems for deciding where to end the
current summary file, and maybe that's OK, but I have a feeling there
might be a better way.
Could we just recompute the WAL summary for the [redo, end of chunk] for the
relevant summary file?
Dilip had an interesting potential solution to this problem, which was
to always emit a special WAL record at the redo pointer. That is, when
we fix the redo pointer for the checkpoint record we're about to
write, also insert a WAL record there. That way, when the summarizer
reaches that sentinel record, it knows it should stop the summary just
before. I'm not sure whether this approach is viable, especially from
a performance and concurrency perspective, and I'm not sure whether
people here would like it, but it does seem like it would make things
a whole lot simpler for this patch set.
FWIW, I like the idea of a special WAL record at that point, independent of
this feature. It wouldn't be a meaningful overhead compared to the cost of a
checkpoint, and it seems like it'd be quite useful for debugging. But I can
see uses going beyond that - we occasionally have been discussing associating
additional data with redo points, and that'd be a lot easier to deal with
during recovery with such a record.
I don't really see a performance and concurrency angle right now - what are
you wondering about?
Another thing that I'm not too sure about is: what happens if we find
a relation file on disk that doesn't appear in the backup_manifest for
the previous backup and isn't mentioned in the WAL summaries either?
Wouldn't that commonly happen for unlogged relations at least?
I suspect there's also other ways to end up with such additional files,
e.g. by crashing during the creation of a new relation.
A few less-serious problems with the patch:
- We don't have an incremental JSON parser, so if you have a
backup_manifest>1GB, pg_basebackup --incremental is going to fail.
That's also true of the existing code in pg_verifybackup, and for the
same reason. I talked to Andrew Dunstan at one point about adapting
our JSON parser to support incremental parsing, and he had a patch for
that, but I think he found some problems with it and I'm not sure what
the current status is.
As a stopgap measure, can't we just use the relevant flag to allow larger
allocations?
- The patch does support differential backup, aka an incremental atop
another incremental. There's no particular limit to how long a chain
of backups can be. However, pg_combinebackup currently requires that
the first backup is a full backup and all the later ones are
incremental backups. So if you have a full backup a and an incremental
backup b and a differential backup c, you can combine a b and c to get
a full backup equivalent to one you would have gotten if you had taken
a full backup at the time you took c. However, you can't combine b and
c with each other without combining them with a, and that might be
desirable in some situations. You might want to collapse a bunch of
older differential backups into a single one that covers the whole
time range of all of them. I think that the file format can support
that, but the tool is currently too dumb.
That seems like a feature for the future...
- We only know how to operate on directories, not tar files. I thought
about that when working on pg_verifybackup as well, but I didn't do
anything about it. It would be nice to go back and make that tool work
on tar-format backups, and this one, too. I don't think there would be
a whole lot of point trying to operate on compressed tar files because
you need random access and that seems hard on a compressed file, but
on uncompressed files it seems at least theoretically doable. I'm not
sure whether anyone would care that much about this, though, even
though it does sound pretty cool.
I don't know the tar format well, but my understanding is that it doesn't have
a "central metadata" portion. I.e. doing something like this would entail
scanning the tar file sequentially, skipping file contents? And wouldn't you
have to create an entirely new tar file for the modified output? That kind of
makes it not so incremental ;)
IOW, I'm not sure it's worth bothering about this ever, and certainly doesn't
seem worth bothering about now. But I might just be missing something.
Greetings,
Andres Freund
On Wed, Jun 14, 2023 at 3:47 PM Andres Freund <andres@anarazel.de> wrote:
I assume this is "solely" required for keeping the incremental backups as
small as possible, rather than being required for correctness?
I believe so. I want to spend some more time thinking about this to
make sure I'm not missing anything.
Could we just recompute the WAL summary for the [redo, end of chunk] for the
relevant summary file?
I'm not understanding how that would help. If we were going to compute
a WAL summary on the fly rather than waiting for one to show up on
disk, what we'd want is [end of last WAL summary that does exist on
disk, redo]. But I'm not sure that's a great approach, because that
LSN gap might be large and then we're duplicating a lot of work that
the summarizer has probably already done most of.
FWIW, I like the idea of a special WAL record at that point, independent of
this feature. It wouldn't be a meaningful overhead compared to the cost of a
checkpoint, and it seems like it'd be quite useful for debugging. But I can
see uses going beyond that - we occasionally have been discussing associating
additional data with redo points, and that'd be a lot easier to deal with
during recovery with such a record.I don't really see a performance and concurrency angle right now - what are
you wondering about?
I'm not really sure. I expect Dilip would be happy to post his patch,
and if you'd be willing to have a look at it and express your concerns
or lack thereof, that would be super valuable.
Another thing that I'm not too sure about is: what happens if we find
a relation file on disk that doesn't appear in the backup_manifest for
the previous backup and isn't mentioned in the WAL summaries either?Wouldn't that commonly happen for unlogged relations at least?
I suspect there's also other ways to end up with such additional files,
e.g. by crashing during the creation of a new relation.
Yeah, this needs some more careful thought.
A few less-serious problems with the patch:
- We don't have an incremental JSON parser, so if you have a
backup_manifest>1GB, pg_basebackup --incremental is going to fail.
That's also true of the existing code in pg_verifybackup, and for the
same reason. I talked to Andrew Dunstan at one point about adapting
our JSON parser to support incremental parsing, and he had a patch for
that, but I think he found some problems with it and I'm not sure what
the current status is.As a stopgap measure, can't we just use the relevant flag to allow larger
allocations?
I'm not sure that's a good idea, but theoretically, yes. We can also
just choose to accept the limitation that your data directory can't be
too darn big if you want to use this feature. But getting incremental
JSON parsing would be better.
Not having the manifest in JSON would be an even better solution, but
regrettably I did not win that argument.
That seems like a feature for the future...
Sure.
I don't know the tar format well, but my understanding is that it doesn't have
a "central metadata" portion. I.e. doing something like this would entail
scanning the tar file sequentially, skipping file contents? And wouldn't you
have to create an entirely new tar file for the modified output? That kind of
makes it not so incremental ;)IOW, I'm not sure it's worth bothering about this ever, and certainly doesn't
seem worth bothering about now. But I might just be missing something.
Oh, yeah, it's just an idle thought. I'll get to it when I get to it,
or else I won't.
--
Robert Haas
EDB: http://www.enterprisedb.com
On Wed, 14 Jun 2023 at 20:47, Robert Haas <robertmhaas@gmail.com> wrote:
A few years ago, I sketched out a design for incremental backup, but
no patch for incremental backup ever got committed. Instead, the whole
thing evolved into a project to add backup manifests, which are nice,
but not as nice as incremental backup would be. So I've decided to
have another go at incremental backup itself. Attached are some WIP
patches.
Nice, I like this idea.
Let me summarize the design and some open questions and
problems with it that I've discovered. I welcome problem reports and
test results from others, as well.
Skimming through the 7th patch, I see claims that FSM is not fully
WAL-logged and thus shouldn't be tracked, and so it indeed doesn't
track those changes.
I disagree with that decision: we now have support for custom resource
managers, which may use the various forks for other purposes than
those used in PostgreSQL right now. It would be a shame if data is
lost because of the backup tool ignoring forks because the PostgreSQL
project itself doesn't have post-recovery consistency guarantees in
that fork. So, unless we document that WAL-logged changes in the FSM
fork are actually not recoverable from backup, regardless of the type
of contents, we should still keep track of the changes in the FSM fork
and include the fork in our backups or only exclude those FSM updates
that we know are safe to ignore.
Kind regards,
Matthias van de Meent
Neon, Inc.
Hi,
On 2023-06-14 16:10:38 -0400, Robert Haas wrote:
On Wed, Jun 14, 2023 at 3:47 PM Andres Freund <andres@anarazel.de> wrote:
Could we just recompute the WAL summary for the [redo, end of chunk] for the
relevant summary file?I'm not understanding how that would help. If we were going to compute
a WAL summary on the fly rather than waiting for one to show up on
disk, what we'd want is [end of last WAL summary that does exist on
disk, redo].
Oh, right.
But I'm not sure that's a great approach, because that LSN gap might be
large and then we're duplicating a lot of work that the summarizer has
probably already done most of.
I guess that really depends on what the summary granularity is. If you create
a separate summary every 32MB or so, recomputing just the required range
shouldn't be too bad.
FWIW, I like the idea of a special WAL record at that point, independent of
this feature. It wouldn't be a meaningful overhead compared to the cost of a
checkpoint, and it seems like it'd be quite useful for debugging. But I can
see uses going beyond that - we occasionally have been discussing associating
additional data with redo points, and that'd be a lot easier to deal with
during recovery with such a record.I don't really see a performance and concurrency angle right now - what are
you wondering about?I'm not really sure. I expect Dilip would be happy to post his patch,
and if you'd be willing to have a look at it and express your concerns
or lack thereof, that would be super valuable.
Will do. Adding me to CC: might help, I have a backlog unfortunately :(.
Greetings,
Andres Freund
On Thu, Jun 15, 2023 at 2:11 AM Andres Freund <andres@anarazel.de> wrote:
I'm not really sure. I expect Dilip would be happy to post his patch,
and if you'd be willing to have a look at it and express your concerns
or lack thereof, that would be super valuable.Will do. Adding me to CC: might help, I have a backlog unfortunately :(.
Thanks, I have posted it here[1]/messages/by-id/CAFiTN-s-K=mVA=HPr_VoU-5bvyLQpNeuzjq1ebPJMEfCJZKFsg@mail.gmail.com
[1]: /messages/by-id/CAFiTN-s-K=mVA=HPr_VoU-5bvyLQpNeuzjq1ebPJMEfCJZKFsg@mail.gmail.com
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
On Wed, Jun 14, 2023 at 4:40 PM Andres Freund <andres@anarazel.de> wrote:
But I'm not sure that's a great approach, because that LSN gap might be
large and then we're duplicating a lot of work that the summarizer has
probably already done most of.I guess that really depends on what the summary granularity is. If you create
a separate summary every 32MB or so, recomputing just the required range
shouldn't be too bad.
Yeah, but I don't think that's the right approach, for two reasons.
First, one of the things I'm rather worried about is what happens when
the WAL distance between the prior backup and the incremental backup
is large. It could be a terabyte. If we have a WAL summary for every
32MB of WAL, that's 32k files we have to read, and I'm concerned
that's too many. Maybe it isn't, but it's something that has really
been weighing on my mind as I've been thinking through the design
questions here. The files are really very small, and having to open a
bazillion tiny little files to get the job done sounds lame. Second, I
don't see what problem it actually solves. Why not just signal the
summarizer to write out the accumulated data to a file instead of
re-doing the work ourselves? Or else adopt the
WAL-record-at-the-redo-pointer approach, and then the whole thing is
moot?
--
Robert Haas
EDB: http://www.enterprisedb.com
Hi,
On 2023-06-19 09:46:12 -0400, Robert Haas wrote:
On Wed, Jun 14, 2023 at 4:40 PM Andres Freund <andres@anarazel.de> wrote:
But I'm not sure that's a great approach, because that LSN gap might be
large and then we're duplicating a lot of work that the summarizer has
probably already done most of.I guess that really depends on what the summary granularity is. If you create
a separate summary every 32MB or so, recomputing just the required range
shouldn't be too bad.Yeah, but I don't think that's the right approach, for two reasons.
First, one of the things I'm rather worried about is what happens when
the WAL distance between the prior backup and the incremental backup
is large. It could be a terabyte. If we have a WAL summary for every
32MB of WAL, that's 32k files we have to read, and I'm concerned
that's too many. Maybe it isn't, but it's something that has really
been weighing on my mind as I've been thinking through the design
questions here.
It doesn't have to be a separate file - you could easily summarize ranges
at a higher granularity, storing multiple ranges into a single file with a
coarser naming pattern.
The files are really very small, and having to open a bazillion tiny little
files to get the job done sounds lame. Second, I don't see what problem it
actually solves. Why not just signal the summarizer to write out the
accumulated data to a file instead of re-doing the work ourselves? Or else
adopt the WAL-record-at-the-redo-pointer approach, and then the whole thing
is moot?
The one point for a relatively grainy summarization scheme that I see is that
it would pave the way for using the WAL summary data for other purposes in the
future. That could be done orthogonal to the other solutions to the redo
pointer issues.
Other potential use cases:
- only restore parts of a base backup that aren't going to be overwritten by
WAL replay
- reconstructing database contents from WAL after data loss
- more efficient pg_rewind
- more efficient prefetching during WAL replay
Greetings,
Andres Freund
Hi,
In the limited time that I've had to work on this project lately, I've
been trying to come up with a test case for this feature -- and since
I've gotten completely stuck, I thought it might be time to post and
see if anyone else has a better idea. I thought a reasonable test case
would be: Do a full backup. Change some stuff. Do an incremental
backup. Restore both backups and perform replay to the same LSN. Then
compare the files on disk. But I cannot make this work. The first
problem I ran into was that replay of the full backup does a
restartpoint, while the replay of the incremental backup does not.
That results in, for example, pg_subtrans having different contents.
I'm not sure whether it can also result in data files having different
contents: are changes that we replayed following the last restartpoint
guaranteed to end up on disk when the server is shut down? It wasn't
clear to me that this is the case. I thought maybe I could get both
servers to perform a restartpoint at the same location by shutting
down the primary and then replaying through the shutdown checkpoint,
but that doesn't work because the primary doesn't finish archiving
before shutting down. After some more fiddling I settled (at least for
research purposes) on having the restored backups PITR and promote,
instead of PITR and pause, so that we're guaranteed a checkpoint. But
that just caused me to run into a far worse problem: replay on the
standby doesn't actually create a state that is byte-for-byte
identical to the one that exists on the primary. I quickly discovered
that in my test case, I was ending up with different contents in the
"hole" of a block wherein a tuple got updated. Replay doesn't think
it's important to make the hole end up with the same contents on all
machines that replay the WAL, so I end up with one server that has
more junk in there than the other one and the tests fail.
Unless someone has a brilliant idea that I lack, this suggests to me
that this whole line of testing is a dead end. I can, of course, write
tests that compare clusters *logically* -- do the correct relations
exist, are they accessible, do they have the right contents? But I
feel like it would be easy to have bugs that escape detection in such
a test but would be detected by a physical comparison of the clusters.
However, such a comparison can only be conducted if either (a) there's
some way to set up the test so that byte-for-byte identical clusters
can be expected or (b) there's some way to perform the comparison that
can distinguish between expected, harmless differences and unexpected,
problematic differences. And at the moment my conclusion is that
neither (a) nor (b) exists. Does anyone think otherwise?
Meanwhile, here's a rebased set of patches. The somewhat-primitive
attempts at writing tests are in 0009, but they don't work, for the
reasons explained above. I think I'd probably like to go ahead and
commit 0001 and 0002 soon if there are no objections, since I think
those are good refactorings independently of the rest of this.
...Robert
Attachments:
0004-Refactor-parse_filename_for_nontemp_relation-to-pars.patchapplication/octet-stream; name=0004-Refactor-parse_filename_for_nontemp_relation-to-pars.patchDownload
From 816266cb17cbcea889a13de1e146fcb5ebfe2066 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Mon, 5 Jun 2023 15:42:15 -0400
Subject: [PATCH 4/9] Refactor parse_filename_for_nontemp_relation to parse
more.
Instead of returning the number of characters in the RelFileNumber,
return the RelFileNumber itself. Continue to return the fork number,
as before, and additionally return the segment number.
parse_filename_for_nontemp_relation now rejects a RelFileNumber or
segment number that begins with a leading zero. Before, we accepted
such cases as relation filenames, but if we continued to do so after
this change, the function might return the same values for two
different files (e.g. 1234.5 and 001234.5 or 1234.005) which could be
annoying for callers. Since we don't actually ever generate filenames
with leading zeroes in the names, any such files that we find must
have been created by something other than PostgreSQL, and it is
therefore reasonable to treat them as non-relation files.
Along the way, change unlogged_relation_entry to store a RelFileNumber
rather than an OID. This update should have been made in
851f4cc75cdd8c831f1baa9a7abf8c8248b65890, but it was overlooked.
It could be done separately from the rest of this commit, but that
would be more involved, whereas this way it's a 1-line change.
---
src/backend/backup/basebackup.c | 15 ++--
src/backend/storage/file/reinit.c | 137 ++++++++++++++++++------------
src/include/storage/reinit.h | 5 +-
3 files changed, 93 insertions(+), 64 deletions(-)
diff --git a/src/backend/backup/basebackup.c b/src/backend/backup/basebackup.c
index cc3d2e0c41..24c038dfba 100644
--- a/src/backend/backup/basebackup.c
+++ b/src/backend/backup/basebackup.c
@@ -1198,9 +1198,9 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
{
int excludeIdx;
bool excludeFound;
- ForkNumber relForkNum; /* Type of fork if file is a relation */
- int relnumchars; /* Chars in filename that are the
- * relnumber */
+ RelFileNumber relNumber;
+ ForkNumber relForkNum;
+ unsigned segno;
/* Skip special stuff */
if (strcmp(de->d_name, ".") == 0 || strcmp(de->d_name, "..") == 0)
@@ -1250,23 +1250,20 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
/* Exclude all forks for unlogged tables except the init fork */
if (isDbDir &&
- parse_filename_for_nontemp_relation(de->d_name, &relnumchars,
- &relForkNum))
+ parse_filename_for_nontemp_relation(de->d_name, &relNumber,
+ &relForkNum, &segno))
{
/* Never exclude init forks */
if (relForkNum != INIT_FORKNUM)
{
char initForkFile[MAXPGPATH];
- char relNumber[OIDCHARS + 1];
/*
* If any other type of fork, check if there is an init fork
* with the same RelFileNumber. If so, the file can be
* excluded.
*/
- memcpy(relNumber, de->d_name, relnumchars);
- relNumber[relnumchars] = '\0';
- snprintf(initForkFile, sizeof(initForkFile), "%s/%s_init",
+ snprintf(initForkFile, sizeof(initForkFile), "%s/%u_init",
path, relNumber);
if (lstat(initForkFile, &statbuf) == 0)
diff --git a/src/backend/storage/file/reinit.c b/src/backend/storage/file/reinit.c
index fb55371b1b..31d6e01106 100644
--- a/src/backend/storage/file/reinit.c
+++ b/src/backend/storage/file/reinit.c
@@ -31,7 +31,7 @@ static void ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname,
typedef struct
{
- Oid reloid; /* hash key */
+ RelFileNumber relnumber; /* hash key */
} unlogged_relation_entry;
/*
@@ -195,12 +195,13 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
{
ForkNumber forkNum;
- int relnumchars;
+ unsigned segno;
unlogged_relation_entry ent;
/* Skip anything that doesn't look like a relation data file. */
- if (!parse_filename_for_nontemp_relation(de->d_name, &relnumchars,
- &forkNum))
+ if (!parse_filename_for_nontemp_relation(de->d_name,
+ &ent.relnumber,
+ &forkNum, &segno))
continue;
/* Also skip it unless this is the init fork. */
@@ -208,10 +209,8 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
continue;
/*
- * Put the OID portion of the name into the hash table, if it
- * isn't already.
+ * Put the RelFileNumber into the hash table, if it isn't already.
*/
- ent.reloid = atooid(de->d_name);
(void) hash_search(hash, &ent, HASH_ENTER, NULL);
}
@@ -235,12 +234,13 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
{
ForkNumber forkNum;
- int relnumchars;
+ unsigned segno;
unlogged_relation_entry ent;
/* Skip anything that doesn't look like a relation data file. */
- if (!parse_filename_for_nontemp_relation(de->d_name, &relnumchars,
- &forkNum))
+ if (!parse_filename_for_nontemp_relation(de->d_name,
+ &ent.relnumber,
+ &forkNum, &segno))
continue;
/* We never remove the init fork. */
@@ -251,7 +251,6 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
* See whether the OID portion of the name shows up in the hash
* table. If so, nuke it!
*/
- ent.reloid = atooid(de->d_name);
if (hash_search(hash, &ent, HASH_FIND, NULL))
{
snprintf(rm_path, sizeof(rm_path), "%s/%s",
@@ -285,14 +284,14 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
{
ForkNumber forkNum;
- int relnumchars;
- char relnumbuf[OIDCHARS + 1];
+ RelFileNumber relNumber;
+ unsigned segno;
char srcpath[MAXPGPATH * 2];
char dstpath[MAXPGPATH];
/* Skip anything that doesn't look like a relation data file. */
- if (!parse_filename_for_nontemp_relation(de->d_name, &relnumchars,
- &forkNum))
+ if (!parse_filename_for_nontemp_relation(de->d_name, &relNumber,
+ &forkNum, &segno))
continue;
/* Also skip it unless this is the init fork. */
@@ -304,11 +303,12 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
dbspacedirname, de->d_name);
/* Construct destination pathname. */
- memcpy(relnumbuf, de->d_name, relnumchars);
- relnumbuf[relnumchars] = '\0';
- snprintf(dstpath, sizeof(dstpath), "%s/%s%s",
- dbspacedirname, relnumbuf, de->d_name + relnumchars + 1 +
- strlen(forkNames[INIT_FORKNUM]));
+ if (segno == 0)
+ snprintf(dstpath, sizeof(dstpath), "%s/%u",
+ dbspacedirname, relNumber);
+ else
+ snprintf(dstpath, sizeof(dstpath), "%s/%u.%u",
+ dbspacedirname, relNumber, segno);
/* OK, we're ready to perform the actual copy. */
elog(DEBUG2, "copying %s to %s", srcpath, dstpath);
@@ -327,14 +327,14 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
dbspace_dir = AllocateDir(dbspacedirname);
while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
{
+ RelFileNumber relNumber;
ForkNumber forkNum;
- int relnumchars;
- char relnumbuf[OIDCHARS + 1];
+ unsigned segno;
char mainpath[MAXPGPATH];
/* Skip anything that doesn't look like a relation data file. */
- if (!parse_filename_for_nontemp_relation(de->d_name, &relnumchars,
- &forkNum))
+ if (!parse_filename_for_nontemp_relation(de->d_name, &relNumber,
+ &forkNum, &segno))
continue;
/* Also skip it unless this is the init fork. */
@@ -342,11 +342,12 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
continue;
/* Construct main fork pathname. */
- memcpy(relnumbuf, de->d_name, relnumchars);
- relnumbuf[relnumchars] = '\0';
- snprintf(mainpath, sizeof(mainpath), "%s/%s%s",
- dbspacedirname, relnumbuf, de->d_name + relnumchars + 1 +
- strlen(forkNames[INIT_FORKNUM]));
+ if (segno == 0)
+ snprintf(mainpath, sizeof(mainpath), "%s/%u",
+ dbspacedirname, relNumber);
+ else
+ snprintf(mainpath, sizeof(mainpath), "%s/%u.%u",
+ dbspacedirname, relNumber, segno);
fsync_fname(mainpath, false);
}
@@ -371,52 +372,82 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
* This function returns true if the file appears to be in the correct format
* for a non-temporary relation and false otherwise.
*
- * NB: If this function returns true, the caller is entitled to assume that
- * *relnumchars has been set to a value no more than OIDCHARS, and thus
- * that a buffer of OIDCHARS+1 characters is sufficient to hold the
- * RelFileNumber portion of the filename. This is critical to protect against
- * a possible buffer overrun.
+ * If it returns true, it sets *relnumber, *fork, and *segno to the values
+ * extracted from the filename. If it returns false, these values are set to
+ * InvalidRelFileNumber, InvalidForkNumber, and 0, respectively.
*/
bool
-parse_filename_for_nontemp_relation(const char *name, int *relnumchars,
- ForkNumber *fork)
+parse_filename_for_nontemp_relation(const char *name, RelFileNumber *relnumber,
+ ForkNumber *fork, unsigned *segno)
{
- int pos;
+ unsigned long n,
+ s;
+ ForkNumber f;
+ char *endp;
- /* Look for a non-empty string of digits (that isn't too long). */
- for (pos = 0; isdigit((unsigned char) name[pos]); ++pos)
- ;
- if (pos == 0 || pos > OIDCHARS)
+ *relnumber = InvalidRelFileNumber;
+ *fork = InvalidForkNumber;
+ *segno = 0;
+
+ /*
+ * Relation filenames should begin with a digit that is not a zero. By
+ * rejecting cases involving leading zeroes, the caller can assume that
+ * there's only one possible string of characters that could have produced
+ * any given value for *relnumber.
+ *
+ * (To be clear, we don't expect files with names like 0017.3 to exist
+ * at all -- but if 0017.3 does exist, it's a non-relation file, not
+ * part of the main fork for relfilenode 17.)
+ */
+ if (name[0] < '1' || name[0] > '9')
+ return false;
+
+ /*
+ * Parse the leading digit string. If the value is out of range, we
+ * conclude that this isn't a relation file at all.
+ */
+ errno = 0;
+ n = strtoul(name, &endp, 10);
+ if (errno || name == endp || n <= 0 || n > PG_UINT32_MAX)
return false;
- *relnumchars = pos;
+ name = endp;
/* Check for a fork name. */
- if (name[pos] != '_')
- *fork = MAIN_FORKNUM;
+ if (*name != '_')
+ f = MAIN_FORKNUM;
else
{
int forkchar;
- forkchar = forkname_chars(&name[pos + 1], fork);
+ forkchar = forkname_chars(name + 1, &f);
if (forkchar <= 0)
return false;
- pos += forkchar + 1;
+ name += forkchar + 1;
}
/* Check for a segment number. */
- if (name[pos] == '.')
+ if (*name != '.')
+ s = 0;
+ else
{
- int segchar;
+ /* Reject leading zeroes, just like we do for RelFileNumber. */
+ if (name[0] < '1' || name[0] > '9')
+ return false;
- for (segchar = 1; isdigit((unsigned char) name[pos + segchar]); ++segchar)
- ;
- if (segchar <= 1)
+ errno = 0;
+ s = strtoul(name + 1, &endp, 10);
+ if (errno || name + 1 == endp || s <= 0 || s > PG_UINT32_MAX)
return false;
- pos += segchar;
+ name = endp;
}
/* Now we should be at the end. */
- if (name[pos] != '\0')
+ if (*name != '\0')
return false;
+
+ /* Set out parameters and return. */
+ *relnumber = (RelFileNumber) n;
+ *fork = f;
+ *segno = (unsigned) s;
return true;
}
diff --git a/src/include/storage/reinit.h b/src/include/storage/reinit.h
index e2bbb5abe9..f8eb7ce234 100644
--- a/src/include/storage/reinit.h
+++ b/src/include/storage/reinit.h
@@ -20,8 +20,9 @@
extern void ResetUnloggedRelations(int op);
extern bool parse_filename_for_nontemp_relation(const char *name,
- int *relnumchars,
- ForkNumber *fork);
+ RelFileNumber *relnumber,
+ ForkNumber *fork,
+ unsigned *segno);
#define UNLOGGED_RELATION_CLEANUP 0x0001
#define UNLOGGED_RELATION_INIT 0x0002
--
2.37.1 (Apple Git-137.1)
0002-In-basebackup.c-refactor-to-create-read_file_data_in.patchapplication/octet-stream; name=0002-In-basebackup.c-refactor-to-create-read_file_data_in.patchDownload
From f3aebf944f13080b108cbc1b0247e22dc0b8a187 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Mon, 5 Jun 2023 15:40:14 -0400
Subject: [PATCH 2/9] In basebackup.c, refactor to create
read_file_data_into_buffer.
This further reduces the length and complexity of sendFile(),
hopefully make it easier to understand and modify. In addition
to moving some logic into a new function, I took this opportunity
to make a few slight adjustments to sendFile() itself, including
renaming the 'len' variable to 'bytes_done', since we use it to represent
the number of bytes we've already handled so far, not the total
length of the file.
---
src/backend/backup/basebackup.c | 231 ++++++++++++++++++--------------
1 file changed, 133 insertions(+), 98 deletions(-)
diff --git a/src/backend/backup/basebackup.c b/src/backend/backup/basebackup.c
index 0daf8257bc..f46f930329 100644
--- a/src/backend/backup/basebackup.c
+++ b/src/backend/backup/basebackup.c
@@ -83,6 +83,12 @@ static int64 sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeo
static bool sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
struct stat *statbuf, bool missing_ok, Oid dboid,
backup_manifest_info *manifest, const char *spcoid);
+static off_t read_file_data_into_buffer(bbsink *sink,
+ const char *readfilename, int fd,
+ off_t offset, size_t length,
+ BlockNumber blkno,
+ bool verify_checksum,
+ int *checksum_failures);
static bool verify_page_checksum(Page page, XLogRecPtr start_lsn,
BlockNumber blkno,
uint16 *expected_checksum);
@@ -1490,9 +1496,7 @@ sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
BlockNumber blkno = 0;
int checksum_failures = 0;
off_t cnt;
- int i;
- pgoff_t len = 0;
- char *page;
+ pgoff_t bytes_done = 0;
int segmentno = 0;
char *segmentpath;
bool verify_checksum = false;
@@ -1514,6 +1518,12 @@ sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
_tarWriteHeader(sink, tarfilename, NULL, statbuf, false);
+ /*
+ * Checksums are verified in multiples of BLCKSZ, so the buffer length
+ * should be a multiple of the block size as well.
+ */
+ Assert((sink->bbs_buffer_length % BLCKSZ) == 0);
+
if (!noverify_checksums && DataChecksumsEnabled())
{
char *filename;
@@ -1551,23 +1561,21 @@ sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
* for a base backup we can ignore such extended data. It will be restored
* from WAL.
*/
- while (len < statbuf->st_size)
+ while (bytes_done < statbuf->st_size)
{
- size_t remaining = statbuf->st_size - len;
+ size_t remaining = statbuf->st_size - bytes_done;
/* Try to read some more data. */
- cnt = basebackup_read_file(fd, sink->bbs_buffer,
- Min(sink->bbs_buffer_length, remaining),
- len, readfilename, true);
+ cnt = read_file_data_into_buffer(sink, readfilename, fd, bytes_done,
+ remaining,
+ blkno + segmentno * RELSEG_SIZE,
+ verify_checksum,
+ &checksum_failures);
/*
- * The checksums are verified at block level, so we iterate over the
- * buffer in chunks of BLCKSZ, after making sure that
- * TAR_SEND_SIZE/buf is divisible by BLCKSZ and we read a multiple of
- * BLCKSZ bytes.
+ * If the amount of data we were able to read was not a multiple of
+ * BLCKSZ, we cannot verify checksums, which are block-level.
*/
- Assert((sink->bbs_buffer_length % BLCKSZ) == 0);
-
if (verify_checksum && (cnt % BLCKSZ != 0))
{
ereport(WARNING,
@@ -1578,84 +1586,6 @@ sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
verify_checksum = false;
}
- if (verify_checksum)
- {
- for (i = 0; i < cnt / BLCKSZ; i++)
- {
- int reread_cnt;
- uint16 expected_checksum;
-
- page = sink->bbs_buffer + BLCKSZ * i;
-
- /* If the page is OK, go on to the next one. */
- if (verify_page_checksum(page, sink->bbs_state->startptr,
- blkno + i + segmentno * RELSEG_SIZE,
- &expected_checksum))
- continue;
-
- /*
- * Retry the block on the first failure. It's possible that
- * we read the first 4K page of the block just before postgres
- * updated the entire block so it ends up looking torn to us.
- * If, before we retry the read, the concurrent write of the
- * block finishes, the page LSN will be updated and we'll
- * realize that we should ignore this block.
- *
- * There's no guarantee that this will actually happen,
- * though: the torn write could take an arbitrarily long time
- * to complete. Retrying multiple times wouldn't fix this
- * problem, either, though it would reduce the chances of it
- * happening in practice. The only real fix here seems to be
- * to have some kind of interlock that allows us to wait until
- * we can be certain that no write to the block is in
- * progress. Since we don't have any such thing right now, we
- * just do this and hope for the best.
- */
- reread_cnt =
- basebackup_read_file(fd,
- sink->bbs_buffer + BLCKSZ * i,
- BLCKSZ, len + BLCKSZ * i,
- readfilename,
- false);
- if (reread_cnt == 0)
- {
- /*
- * If we hit end-of-file, a concurrent truncation must
- * have occurred, so break out of this loop just as if the
- * initial fread() returned 0. We'll drop through to the
- * same code that handles that case. (We must fix up cnt
- * first, though.)
- */
- cnt = BLCKSZ * i;
- break;
- }
-
- /* If the page now looks OK, go on to the next one. */
- if (verify_page_checksum(page, sink->bbs_state->startptr,
- blkno + i + segmentno * RELSEG_SIZE,
- &expected_checksum))
- continue;
-
- /* Handle checksum failure. */
- checksum_failures++;
- if (checksum_failures <= 5)
- ereport(WARNING,
- (errmsg("checksum verification failed in "
- "file \"%s\", block %u: calculated "
- "%X but expected %X",
- readfilename, blkno + i, expected_checksum,
- ((PageHeader) page)->pd_checksum)));
- if (checksum_failures == 5)
- ereport(WARNING,
- (errmsg("further checksum verification "
- "failures in file \"%s\" will not "
- "be reported", readfilename)));
- }
-
- /* Update block number for next pass through the outer loop. */
- blkno += i;
- }
-
/*
* If we hit end-of-file, a concurrent truncation must have occurred.
* That's not an error condition, because WAL replay will fix things
@@ -1664,6 +1594,10 @@ sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
if (cnt == 0)
break;
+ /* Update block number and # of bytes done for next loop iteration. */
+ blkno += cnt / BLCKSZ;
+ bytes_done += cnt;
+
/* Archive the data we just read. */
bbsink_archive_contents(sink, cnt);
@@ -1671,14 +1605,12 @@ sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
if (pg_checksum_update(&checksum_ctx,
(uint8 *) sink->bbs_buffer, cnt) < 0)
elog(ERROR, "could not update checksum of base backup");
-
- len += cnt;
}
/* If the file was truncated while we were sending it, pad it with zeros */
- while (len < statbuf->st_size)
+ while (bytes_done < statbuf->st_size)
{
- size_t remaining = statbuf->st_size - len;
+ size_t remaining = statbuf->st_size - bytes_done;
size_t nbytes = Min(sink->bbs_buffer_length, remaining);
MemSet(sink->bbs_buffer, 0, nbytes);
@@ -1687,7 +1619,7 @@ sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
nbytes) < 0)
elog(ERROR, "could not update checksum of base backup");
bbsink_archive_contents(sink, nbytes);
- len += nbytes;
+ bytes_done += nbytes;
}
/*
@@ -1695,7 +1627,7 @@ sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
* of data is probably not worth throttling, and is not checksummed
* because it's not actually part of the file.)
*/
- _tarWritePadding(sink, len);
+ _tarWritePadding(sink, bytes_done);
CloseTransientFile(fd);
@@ -1718,6 +1650,109 @@ sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
return true;
}
+/*
+ * Read some more data from the file into the bbsink's buffer, verifying
+ * checksums as required.
+ *
+ * 'offset' is the file offset from which we should begin to read, and
+ * 'length' is the amount of data that should be read. The actual amount
+ * of data read will be less than the requested amount if the bbsink's
+ * buffer isn't big enough to hold it all, or if the underlying file has
+ * been truncated. The return value is the number of bytes actually read.
+ *
+ * 'blkno' is the block number of the first page in the bbsink's buffer
+ * relative to the start of the relation.
+ *
+ * 'verify_checksum' indicates whether we should try to verify checksums
+ * for the blocks we read. If we do this, we'll update *checksum_failures
+ * and issue warnings as appropriate.
+ */
+static off_t
+read_file_data_into_buffer(bbsink *sink, const char *readfilename, int fd,
+ off_t offset, size_t length, BlockNumber blkno,
+ bool verify_checksum, int *checksum_failures)
+{
+ off_t cnt;
+ int i;
+ char *page;
+
+ /* Try to read some more data. */
+ cnt = basebackup_read_file(fd, sink->bbs_buffer,
+ Min(sink->bbs_buffer_length, length),
+ offset, readfilename, true);
+
+ /* Can't verify checksums if read length is not a multiple of BLCKSZ. */
+ if (!verify_checksum || (cnt % BLCKSZ) != 0)
+ return cnt;
+
+ /* Verify checksum for each block. */
+ for (i = 0; i < cnt / BLCKSZ; i++)
+ {
+ int reread_cnt;
+ uint16 expected_checksum;
+
+ page = sink->bbs_buffer + BLCKSZ * i;
+
+ /* If the page is OK, go on to the next one. */
+ if (verify_page_checksum(page, sink->bbs_state->startptr, blkno + i,
+ &expected_checksum))
+ continue;
+
+ /*
+ * Retry the block on the first failure. It's possible that we read
+ * the first 4K page of the block just before postgres updated the
+ * entire block so it ends up looking torn to us. If, before we retry
+ * the read, the concurrent write of the block finishes, the page LSN
+ * will be updated and we'll realize that we should ignore this block.
+ *
+ * There's no guarantee that this will actually happen, though: the
+ * torn write could take an arbitrarily long time to complete.
+ * Retrying multiple times wouldn't fix this problem, either, though
+ * it would reduce the chances of it happening in practice. The only
+ * real fix here seems to be to have some kind of interlock that
+ * allows us to wait until we can be certain that no write to the
+ * block is in progress. Since we don't have any such thing right now,
+ * we just do this and hope for the best.
+ */
+ reread_cnt =
+ basebackup_read_file(fd, sink->bbs_buffer + BLCKSZ * i,
+ BLCKSZ, offset + BLCKSZ * i,
+ readfilename, false);
+ if (reread_cnt == 0)
+ {
+ /*
+ * If we hit end-of-file, a concurrent truncation must have
+ * occurred, so reduce cnt to reflect only the blocks already
+ * processed and break out of this loop.
+ */
+ cnt = BLCKSZ * i;
+ break;
+ }
+
+ /* If the page now looks OK, go on to the next one. */
+ if (verify_page_checksum(page, sink->bbs_state->startptr, blkno + i,
+ &expected_checksum))
+ continue;
+
+ /* Handle checksum failure. */
+ (*checksum_failures)++;
+ if (*checksum_failures <= 5)
+ ereport(WARNING,
+ (errmsg("checksum verification failed in "
+ "file \"%s\", block %u: calculated "
+ "%X but expected %X",
+ readfilename, blkno + i, expected_checksum,
+ ((PageHeader) page)->pd_checksum)));
+ if (*checksum_failures == 5)
+ ereport(WARNING,
+ (errmsg("further checksum verification "
+ "failures in file \"%s\" will not "
+ "be reported", readfilename)));
+ }
+
+ return cnt;
+}
+
/*
* Try to verify the checksum for the provided page, if it seems appropriate
* to do so.
--
2.37.1 (Apple Git-137.1)
0001-In-basebackup.c-refactor-to-create-verify_page_check.patchapplication/octet-stream; name=0001-In-basebackup.c-refactor-to-create-verify_page_check.patchDownload
From b65fb32ca1474a1158d5490970b9eb147fbe4f47 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Mon, 5 Jun 2023 15:40:07 -0400
Subject: [PATCH 1/9] In basebackup.c, refactor to create verify_page_checksum.
If checksum verification fails for a particular page, we reread the
page and try one more time. The code that does this somewhat complex
and difficult to follow. Move some of the logic into a new function
and rearrange the code a bit to try to make it clearer. This way,
we don't need the block_retry Boolean, a couple of other variables
move from sendFile() into the new function, and some code is now less
deeply indented.
---
src/backend/backup/basebackup.c | 188 ++++++++++++++++++--------------
1 file changed, 104 insertions(+), 84 deletions(-)
diff --git a/src/backend/backup/basebackup.c b/src/backend/backup/basebackup.c
index 45be21131c..0daf8257bc 100644
--- a/src/backend/backup/basebackup.c
+++ b/src/backend/backup/basebackup.c
@@ -83,6 +83,9 @@ static int64 sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeo
static bool sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
struct stat *statbuf, bool missing_ok, Oid dboid,
backup_manifest_info *manifest, const char *spcoid);
+static bool verify_page_checksum(Page page, XLogRecPtr start_lsn,
+ BlockNumber blkno,
+ uint16 *expected_checksum);
static void sendFileWithContent(bbsink *sink, const char *filename,
const char *content,
backup_manifest_info *manifest);
@@ -1485,14 +1488,11 @@ sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
{
int fd;
BlockNumber blkno = 0;
- bool block_retry = false;
- uint16 checksum;
int checksum_failures = 0;
off_t cnt;
int i;
pgoff_t len = 0;
char *page;
- PageHeader phdr;
int segmentno = 0;
char *segmentpath;
bool verify_checksum = false;
@@ -1582,94 +1582,78 @@ sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
{
for (i = 0; i < cnt / BLCKSZ; i++)
{
+ int reread_cnt;
+ uint16 expected_checksum;
+
page = sink->bbs_buffer + BLCKSZ * i;
+ /* If the page is OK, go on to the next one. */
+ if (verify_page_checksum(page, sink->bbs_state->startptr,
+ blkno + i + segmentno * RELSEG_SIZE,
+ &expected_checksum))
+ continue;
+
/*
- * Only check pages which have not been modified since the
- * start of the base backup. Otherwise, they might have been
- * written only halfway and the checksum would not be valid.
- * However, replaying WAL would reinstate the correct page in
- * this case. We also skip completely new pages, since they
- * don't have a checksum yet.
+ * Retry the block on the first failure. It's possible that
+ * we read the first 4K page of the block just before postgres
+ * updated the entire block so it ends up looking torn to us.
+ * If, before we retry the read, the concurrent write of the
+ * block finishes, the page LSN will be updated and we'll
+ * realize that we should ignore this block.
+ *
+ * There's no guarantee that this will actually happen,
+ * though: the torn write could take an arbitrarily long time
+ * to complete. Retrying multiple times wouldn't fix this
+ * problem, either, though it would reduce the chances of it
+ * happening in practice. The only real fix here seems to be
+ * to have some kind of interlock that allows us to wait until
+ * we can be certain that no write to the block is in
+ * progress. Since we don't have any such thing right now, we
+ * just do this and hope for the best.
*/
- if (!PageIsNew(page) && PageGetLSN(page) < sink->bbs_state->startptr)
+ reread_cnt =
+ basebackup_read_file(fd,
+ sink->bbs_buffer + BLCKSZ * i,
+ BLCKSZ, len + BLCKSZ * i,
+ readfilename,
+ false);
+ if (reread_cnt == 0)
{
- checksum = pg_checksum_page((char *) page, blkno + segmentno * RELSEG_SIZE);
- phdr = (PageHeader) page;
- if (phdr->pd_checksum != checksum)
- {
- /*
- * Retry the block on the first failure. It's
- * possible that we read the first 4K page of the
- * block just before postgres updated the entire block
- * so it ends up looking torn to us. If, before we
- * retry the read, the concurrent write of the block
- * finishes, the page LSN will be updated and we'll
- * realize that we should ignore this block.
- *
- * There's no guarantee that this will actually
- * happen, though: the torn write could take an
- * arbitrarily long time to complete. Retrying
- * multiple times wouldn't fix this problem, either,
- * though it would reduce the chances of it happening
- * in practice. The only real fix here seems to be to
- * have some kind of interlock that allows us to wait
- * until we can be certain that no write to the block
- * is in progress. Since we don't have any such thing
- * right now, we just do this and hope for the best.
- */
- if (block_retry == false)
- {
- int reread_cnt;
-
- /* Reread the failed block */
- reread_cnt =
- basebackup_read_file(fd,
- sink->bbs_buffer + BLCKSZ * i,
- BLCKSZ, len + BLCKSZ * i,
- readfilename,
- false);
- if (reread_cnt == 0)
- {
- /*
- * If we hit end-of-file, a concurrent
- * truncation must have occurred, so break out
- * of this loop just as if the initial fread()
- * returned 0. We'll drop through to the same
- * code that handles that case. (We must fix
- * up cnt first, though.)
- */
- cnt = BLCKSZ * i;
- break;
- }
-
- /* Set flag so we know a retry was attempted */
- block_retry = true;
-
- /* Reset loop to validate the block again */
- i--;
- continue;
- }
-
- checksum_failures++;
-
- if (checksum_failures <= 5)
- ereport(WARNING,
- (errmsg("checksum verification failed in "
- "file \"%s\", block %u: calculated "
- "%X but expected %X",
- readfilename, blkno, checksum,
- phdr->pd_checksum)));
- if (checksum_failures == 5)
- ereport(WARNING,
- (errmsg("further checksum verification "
- "failures in file \"%s\" will not "
- "be reported", readfilename)));
- }
+ /*
+ * If we hit end-of-file, a concurrent truncation must
+ * have occurred, so break out of this loop just as if the
+ * initial fread() returned 0. We'll drop through to the
+ * same code that handles that case. (We must fix up cnt
+ * first, though.)
+ */
+ cnt = BLCKSZ * i;
+ break;
}
- block_retry = false;
- blkno++;
+
+ /* If the page now looks OK, go on to the next one. */
+ if (verify_page_checksum(page, sink->bbs_state->startptr,
+ blkno + i + segmentno * RELSEG_SIZE,
+ &expected_checksum))
+ continue;
+
+ /* Handle checksum failure. */
+ checksum_failures++;
+ if (checksum_failures <= 5)
+ ereport(WARNING,
+ (errmsg("checksum verification failed in "
+ "file \"%s\", block %u: calculated "
+ "%X but expected %X",
+ readfilename, blkno + i, expected_checksum,
+ ((PageHeader) page)->pd_checksum)));
+ if (checksum_failures == 5)
+ ereport(WARNING,
+ (errmsg("further checksum verification "
+ "failures in file \"%s\" will not "
+ "be reported", readfilename)));
}
+
+ /* Update block number for next pass through the outer loop. */
+ blkno += i;
}
/*
@@ -1734,6 +1718,42 @@ sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
return true;
}
+/*
+ * Try to verify the checksum for the provided page, if it seems appropriate
+ * to do so.
+ *
+ * Returns true if verification succeeds or if we decide not to check it,
+ * and false if verification fails. When return false, it also sets
+ * *expected_checksum to the computed value.
+ */
+static bool
+verify_page_checksum(Page page, XLogRecPtr start_lsn, BlockNumber blkno,
+ uint16 *expected_checksum)
+{
+ PageHeader phdr;
+ uint16 checksum;
+
+ /*
+ * Only check pages which have not been modified since the start of the
+ * base backup. Otherwise, they might have been written only halfway and
+ * the checksum would not be valid. However, replaying WAL would
+ * reinstate the correct page in this case. We also skip completely new
+ * pages, since they don't have a checksum yet.
+ */
+ if (PageIsNew(page) || PageGetLSN(page) >= start_lsn)
+ return true;
+
+ /* Perform the actual checksum calculation. */
+ checksum = pg_checksum_page(page, blkno);
+
+ /* See whether it matches the value from the page. */
+ phdr = (PageHeader) page;
+ if (phdr->pd_checksum == checksum)
+ return true;
+ *expected_checksum = checksum;
+ return false;
+}
+
static int64
_tarWriteHeader(bbsink *sink, const char *filename, const char *linktarget,
struct stat *statbuf, bool sizeonly)
--
2.37.1 (Apple Git-137.1)
0005-Change-how-a-base-backup-decides-which-files-have-ch.patchapplication/octet-stream; name=0005-Change-how-a-base-backup-decides-which-files-have-ch.patchDownload
From 23397800976a006b4280617003eec4413898f955 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Mon, 5 Jun 2023 15:42:19 -0400
Subject: [PATCH 5/9] Change how a base backup decides which files have
checksums.
Previously, it thought that any plain file located under global, base,
or a tablespace directory had checksums unless it was in a short list
of excluded files. Now, it thinks that files in those directories have
checksums if parse_filename_for_nontemp_relation says that they are
relation files. (Temporary relation files don't matter because they're
excluded from the backup anyway.)
This changes the behavior if you have stray files not managed by
PostgreSQL in the relevant directories. Previously, you'd get some
kind of checksum-related complaint if such files existed, assuming
that the cluster had checksums enabled and that the base backup
wasn't run with NOVERIFY_CHECKSUMS. Now, you won't get those
complaints any more. That seems like an improvement to me, because
those files were presumably not created by PostgreSQL and so there
is no reason to think that they would be checksummed like a
PostgreSQL relation file. (If we want to complain about such files,
we should complain about them existing at all, not just about their
checksums.)
The point of this change is to make the code more consistent.
sendDir() was already calling parse_filename_for_nontemp_relation()
as part of an effort to determine which files to include in the
backup. So, it already had the information about whether a certain
file was a relation file. sendFile() then used a separate method,
embodied in is_checksummed_file(), to make what is essentially
the same determination. It's better not to make the same decision
using two different methods, especially in closely-related code.
---
src/backend/backup/basebackup.c | 173 +++++++++++---------------------
1 file changed, 56 insertions(+), 117 deletions(-)
diff --git a/src/backend/backup/basebackup.c b/src/backend/backup/basebackup.c
index 24c038dfba..64ab54fe06 100644
--- a/src/backend/backup/basebackup.c
+++ b/src/backend/backup/basebackup.c
@@ -82,7 +82,8 @@ static int64 sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeo
backup_manifest_info *manifest, Oid spcoid);
static bool sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
struct stat *statbuf, bool missing_ok,
- Oid dboid, Oid spcoid,
+ Oid dboid, Oid spcoid, RelFileNumber relfilenumber,
+ unsigned segno,
backup_manifest_info *manifest);
static off_t read_file_data_into_buffer(bbsink *sink,
const char *readfilename, int fd,
@@ -104,7 +105,6 @@ static void convert_link_to_directory(const char *pathbuf, struct stat *statbuf)
static void perform_base_backup(basebackup_options *opt, bbsink *sink);
static void parse_basebackup_options(List *options, basebackup_options *opt);
static int compareWalFileNames(const ListCell *a, const ListCell *b);
-static bool is_checksummed_file(const char *fullpath, const char *filename);
static int basebackup_read_file(int fd, char *buf, size_t nbytes, off_t offset,
const char *filename, bool partial_read_ok);
@@ -213,23 +213,6 @@ static const struct exclude_list_item excludeFiles[] =
{NULL, false}
};
-/*
- * List of files excluded from checksum validation.
- *
- * Note: this list should be kept in sync with what pg_checksums.c
- * includes.
- */
-static const struct exclude_list_item noChecksumFiles[] = {
- {"pg_control", false},
- {"pg_filenode.map", false},
- {"pg_internal.init", true},
- {"PG_VERSION", false},
-#ifdef EXEC_BACKEND
- {"config_exec_params", true},
-#endif
- {NULL, false}
-};
-
/*
* Actually do a base backup for the specified tablespaces.
*
@@ -356,7 +339,8 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
errmsg("could not stat file \"%s\": %m",
XLOG_CONTROL_FILE)));
sendFile(sink, XLOG_CONTROL_FILE, XLOG_CONTROL_FILE, &statbuf,
- false, InvalidOid, InvalidOid, &manifest);
+ false, InvalidOid, InvalidOid,
+ InvalidRelFileNumber, 0, &manifest);
}
else
{
@@ -625,7 +609,8 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
errmsg("could not stat file \"%s\": %m", pathbuf)));
sendFile(sink, pathbuf, pathbuf, &statbuf, false,
- InvalidOid, InvalidOid, &manifest);
+ InvalidOid, InvalidOid, InvalidRelFileNumber, 0,
+ &manifest);
/* unconditionally mark file as archived */
StatusFilePath(pathbuf, fname, ".done");
@@ -1163,7 +1148,8 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
struct stat statbuf;
int64 size = 0;
const char *lastDir; /* Split last dir from parent path. */
- bool isDbDir = false; /* Does this directory contain relations? */
+ bool isRelationDir = false; /* Does directory contain relations? */
+ Oid dboid = InvalidOid;
/*
* Determine if the current path is a database directory that can contain
@@ -1190,17 +1176,23 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
strncmp(lastDir - (sizeof(TABLESPACE_VERSION_DIRECTORY) - 1),
TABLESPACE_VERSION_DIRECTORY,
sizeof(TABLESPACE_VERSION_DIRECTORY) - 1) == 0))
- isDbDir = true;
+ {
+ isRelationDir = true;
+ dboid = atooid(lastDir + 1);
+ }
}
+ else if (strcmp(path, "./global") == 0)
+ isRelationDir = true;
dir = AllocateDir(path);
while ((de = ReadDir(dir, path)) != NULL)
{
int excludeIdx;
bool excludeFound;
- RelFileNumber relNumber;
- ForkNumber relForkNum;
- unsigned segno;
+ RelFileNumber relfilenumber = InvalidRelFileNumber;
+ ForkNumber relForkNum = InvalidForkNumber;
+ unsigned segno = 0;
+ bool isRelationFile = false;
/* Skip special stuff */
if (strcmp(de->d_name, ".") == 0 || strcmp(de->d_name, "..") == 0)
@@ -1248,37 +1240,41 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
if (excludeFound)
continue;
+ /*
+ * If there could be non-temporary relation files in this directory,
+ * try to parse the filename.
+ */
+ if (isRelationDir)
+ isRelationFile =
+ parse_filename_for_nontemp_relation(de->d_name,
+ &relfilenumber,
+ &relForkNum, &segno);
+
/* Exclude all forks for unlogged tables except the init fork */
- if (isDbDir &&
- parse_filename_for_nontemp_relation(de->d_name, &relNumber,
- &relForkNum, &segno))
+ if (isRelationFile && relForkNum != INIT_FORKNUM)
{
- /* Never exclude init forks */
- if (relForkNum != INIT_FORKNUM)
- {
- char initForkFile[MAXPGPATH];
+ char initForkFile[MAXPGPATH];
- /*
- * If any other type of fork, check if there is an init fork
- * with the same RelFileNumber. If so, the file can be
- * excluded.
- */
- snprintf(initForkFile, sizeof(initForkFile), "%s/%u_init",
- path, relNumber);
+ /*
+ * If any other type of fork, check if there is an init fork
+ * with the same RelFileNumber. If so, the file can be
+ * excluded.
+ */
+ snprintf(initForkFile, sizeof(initForkFile), "%s/%u_init",
+ path, relfilenumber);
- if (lstat(initForkFile, &statbuf) == 0)
- {
- elog(DEBUG2,
- "unlogged relation file \"%s\" excluded from backup",
- de->d_name);
+ if (lstat(initForkFile, &statbuf) == 0)
+ {
+ elog(DEBUG2,
+ "unlogged relation file \"%s\" excluded from backup",
+ de->d_name);
- continue;
- }
+ continue;
}
}
/* Exclude temporary relations */
- if (isDbDir && looks_like_temp_rel_name(de->d_name))
+ if (OidIsValid(dboid) && looks_like_temp_rel_name(de->d_name))
{
elog(DEBUG2,
"temporary relation file \"%s\" excluded from backup",
@@ -1417,8 +1413,8 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
if (!sizeonly)
sent = sendFile(sink, pathbuf, pathbuf + basepathlen + 1, &statbuf,
- true, isDbDir ? atooid(lastDir + 1) : InvalidOid, spcoid,
- manifest);
+ true, dboid, spcoid,
+ relfilenumber, segno, manifest);
if (sent || sizeonly)
{
@@ -1440,40 +1436,6 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
return size;
}
-/*
- * Check if a file should have its checksum validated.
- * We validate checksums on files in regular tablespaces
- * (including global and default) only, and in those there
- * are some files that are explicitly excluded.
- */
-static bool
-is_checksummed_file(const char *fullpath, const char *filename)
-{
- /* Check that the file is in a tablespace */
- if (strncmp(fullpath, "./global/", 9) == 0 ||
- strncmp(fullpath, "./base/", 7) == 0 ||
- strncmp(fullpath, "/", 1) == 0)
- {
- int excludeIdx;
-
- /* Compare file against noChecksumFiles skip list */
- for (excludeIdx = 0; noChecksumFiles[excludeIdx].name != NULL; excludeIdx++)
- {
- int cmplen = strlen(noChecksumFiles[excludeIdx].name);
-
- if (!noChecksumFiles[excludeIdx].match_prefix)
- cmplen++;
- if (strncmp(filename, noChecksumFiles[excludeIdx].name,
- cmplen) == 0)
- return false;
- }
-
- return true;
- }
- else
- return false;
-}
-
/*
* Given the member, write the TAR header & send the file.
*
@@ -1488,6 +1450,7 @@ is_checksummed_file(const char *fullpath, const char *filename)
static bool
sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
struct stat *statbuf, bool missing_ok, Oid dboid, Oid spcoid,
+ RelFileNumber relfilenumber, unsigned segno,
backup_manifest_info *manifest)
{
int fd;
@@ -1495,8 +1458,6 @@ sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
int checksum_failures = 0;
off_t cnt;
pgoff_t bytes_done = 0;
- int segmentno = 0;
- char *segmentpath;
bool verify_checksum = false;
pg_checksum_context checksum_ctx;
@@ -1522,36 +1483,14 @@ sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
*/
Assert((sink->bbs_buffer_length % BLCKSZ) == 0);
- if (!noverify_checksums && DataChecksumsEnabled())
- {
- char *filename;
-
- /*
- * Get the filename (excluding path). As last_dir_separator()
- * includes the last directory separator, we chop that off by
- * incrementing the pointer.
- */
- filename = last_dir_separator(readfilename) + 1;
-
- if (is_checksummed_file(readfilename, filename))
- {
- verify_checksum = true;
-
- /*
- * Cut off at the segment boundary (".") to get the segment number
- * in order to mix it into the checksum.
- */
- segmentpath = strstr(filename, ".");
- if (segmentpath != NULL)
- {
- segmentno = atoi(segmentpath + 1);
- if (segmentno == 0)
- ereport(ERROR,
- (errmsg("invalid segment number %d in file \"%s\"",
- segmentno, filename)));
- }
- }
- }
+ /*
+ * If we weren't told not to verify checksums, and if checksums are
+ * enabled for this cluster, and if this is a relation file, then verify
+ * the checksum.
+ */
+ if (!noverify_checksums && DataChecksumsEnabled() &&
+ RelFileNumberIsValid(relfilenumber))
+ verify_checksum = true;
/*
* Loop until we read the amount of data the caller told us to expect. The
@@ -1566,7 +1505,7 @@ sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
/* Try to read some more data. */
cnt = read_file_data_into_buffer(sink, readfilename, fd, bytes_done,
remaining,
- blkno + segmentno * RELSEG_SIZE,
+ blkno + segno * RELSEG_SIZE,
verify_checksum,
&checksum_failures);
--
2.37.1 (Apple Git-137.1)
0003-Change-struct-tablespaceinfo-s-oid-member-from-char-.patchapplication/octet-stream; name=0003-Change-struct-tablespaceinfo-s-oid-member-from-char-.patchDownload
From de555bebd21507637a738c20d2933e05662404c4 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Mon, 5 Jun 2023 15:42:00 -0400
Subject: [PATCH 3/9] Change struct tablespaceinfo's oid member from 'char *'
to 'Oid'
This shouldn't change behavior except in the unusual case where
there are file in the tablespace directory that have entirely
numeric names but are nevertheless not possible names for a
tablespace directory, either because their names has leading zeroes
that shouldn't be there, or the value is actually zero, or because
the value is too large to represent as an OID.
In those cases, the directory would previously have made it into
the list of tablespaceinfo objects and no longer will. Thus, base
backups will now ignore such directories, instead of treating them
as legitimate tablespace directories. Similarly, if entries for
such tablespaces occur in a tablespace_map file, they will now
be rejected as erroneous, instead of being honored.
This is infrastructure for future work that wants to be able to
know the tablespace of each relation that is part of a backup
*as an OID*. By strengthening the up-front validation, we don't
have to worry about weird cases later, and can more easily avoid
repeated string->integer conversions.
---
src/backend/access/transam/xlog.c | 19 ++++++++++--
src/backend/access/transam/xlogrecovery.c | 12 ++++++--
src/backend/backup/backup_manifest.c | 6 ++--
src/backend/backup/basebackup.c | 35 ++++++++++++-----------
src/backend/backup/basebackup_copy.c | 2 +-
src/include/backup/backup_manifest.h | 2 +-
src/include/backup/basebackup.h | 2 +-
7 files changed, 49 insertions(+), 29 deletions(-)
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index f6f8adc72a..6f38d0eb9a 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -8484,9 +8484,22 @@ do_pg_backup_start(const char *backupidstr, bool fast, List **tablespaces,
char *relpath = NULL;
char *s;
PGFileType de_type;
+ char *badp;
+ Oid tsoid;
- /* Skip anything that doesn't look like a tablespace */
- if (strspn(de->d_name, "0123456789") != strlen(de->d_name))
+ /*
+ * Try to parse the directory name as an unsigned integer.
+ *
+ * Tablespace directories should be positive integers that can
+ * be represented in 32 bits, with no leading zeroes or trailing
+ * garbage. If we come across a name that doesn't meet those
+ * criteria, skip it.
+ */
+ if (de->d_name[0] < '1' || de->d_name[1] > '9')
+ continue;
+ errno = 0;
+ tsoid = strtoul(de->d_name, &badp, 10);
+ if (*badp != '\0' || errno == EINVAL || errno == ERANGE)
continue;
snprintf(fullpath, sizeof(fullpath), "pg_tblspc/%s", de->d_name);
@@ -8561,7 +8574,7 @@ do_pg_backup_start(const char *backupidstr, bool fast, List **tablespaces,
}
ti = palloc(sizeof(tablespaceinfo));
- ti->oid = pstrdup(de->d_name);
+ ti->oid = tsoid;
ti->path = pstrdup(linkpath);
ti->rpath = relpath;
ti->size = -1;
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index becc2bda62..4ff4430006 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -678,7 +678,7 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
tablespaceinfo *ti = lfirst(lc);
char *linkloc;
- linkloc = psprintf("pg_tblspc/%s", ti->oid);
+ linkloc = psprintf("pg_tblspc/%u", ti->oid);
/*
* Remove the existing symlink if any and Create the symlink
@@ -692,7 +692,6 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
errmsg("could not create symbolic link \"%s\": %m",
linkloc)));
- pfree(ti->oid);
pfree(ti->path);
pfree(ti);
}
@@ -1341,6 +1340,8 @@ read_tablespace_map(List **tablespaces)
{
if (!was_backslash && (ch == '\n' || ch == '\r'))
{
+ char *endp;
+
if (i == 0)
continue; /* \r immediately followed by \n */
@@ -1360,7 +1361,12 @@ read_tablespace_map(List **tablespaces)
str[n++] = '\0';
ti = palloc0(sizeof(tablespaceinfo));
- ti->oid = pstrdup(str);
+ errno = 0;
+ ti->oid = strtoul(str, &endp, 10);
+ if (*endp != '\0' || errno == EINVAL || errno == ERANGE)
+ ereport(FATAL,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("invalid data in file \"%s\"", TABLESPACE_MAP)));
ti->path = pstrdup(str + n);
*tablespaces = lappend(*tablespaces, ti);
diff --git a/src/backend/backup/backup_manifest.c b/src/backend/backup/backup_manifest.c
index cee6216524..aeed362a9a 100644
--- a/src/backend/backup/backup_manifest.c
+++ b/src/backend/backup/backup_manifest.c
@@ -97,7 +97,7 @@ FreeBackupManifest(backup_manifest_info *manifest)
* Add an entry to the backup manifest for a file.
*/
void
-AddFileToBackupManifest(backup_manifest_info *manifest, const char *spcoid,
+AddFileToBackupManifest(backup_manifest_info *manifest, Oid spcoid,
const char *pathname, size_t size, pg_time_t mtime,
pg_checksum_context *checksum_ctx)
{
@@ -114,9 +114,9 @@ AddFileToBackupManifest(backup_manifest_info *manifest, const char *spcoid,
* pathname relative to the data directory (ignoring the intermediate
* symlink traversal).
*/
- if (spcoid != NULL)
+ if (OidIsValid(spcoid))
{
- snprintf(pathbuf, sizeof(pathbuf), "pg_tblspc/%s/%s", spcoid,
+ snprintf(pathbuf, sizeof(pathbuf), "pg_tblspc/%u/%s", spcoid,
pathname);
pathname = pathbuf;
}
diff --git a/src/backend/backup/basebackup.c b/src/backend/backup/basebackup.c
index f46f930329..cc3d2e0c41 100644
--- a/src/backend/backup/basebackup.c
+++ b/src/backend/backup/basebackup.c
@@ -75,14 +75,15 @@ typedef struct
pg_checksum_type manifest_checksum_type;
} basebackup_options;
-static int64 sendTablespace(bbsink *sink, char *path, char *spcoid, bool sizeonly,
+static int64 sendTablespace(bbsink *sink, char *path, Oid spcoid, bool sizeonly,
struct backup_manifest_info *manifest);
static int64 sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
List *tablespaces, bool sendtblspclinks,
- backup_manifest_info *manifest, const char *spcoid);
+ backup_manifest_info *manifest, Oid spcoid);
static bool sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
- struct stat *statbuf, bool missing_ok, Oid dboid,
- backup_manifest_info *manifest, const char *spcoid);
+ struct stat *statbuf, bool missing_ok,
+ Oid dboid, Oid spcoid,
+ backup_manifest_info *manifest);
static off_t read_file_data_into_buffer(bbsink *sink,
const char *readfilename, int fd,
off_t offset, size_t length,
@@ -305,7 +306,7 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
if (tmp->path == NULL)
tmp->size = sendDir(sink, ".", 1, true, state.tablespaces,
- true, NULL, NULL);
+ true, NULL, InvalidOid);
else
tmp->size = sendTablespace(sink, tmp->path, tmp->oid, true,
NULL);
@@ -346,7 +347,7 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
/* Then the bulk of the files... */
sendDir(sink, ".", 1, false, state.tablespaces,
- sendtblspclinks, &manifest, NULL);
+ sendtblspclinks, &manifest, InvalidOid);
/* ... and pg_control after everything else. */
if (lstat(XLOG_CONTROL_FILE, &statbuf) != 0)
@@ -355,11 +356,11 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
errmsg("could not stat file \"%s\": %m",
XLOG_CONTROL_FILE)));
sendFile(sink, XLOG_CONTROL_FILE, XLOG_CONTROL_FILE, &statbuf,
- false, InvalidOid, &manifest, NULL);
+ false, InvalidOid, InvalidOid, &manifest);
}
else
{
- char *archive_name = psprintf("%s.tar", ti->oid);
+ char *archive_name = psprintf("%u.tar", ti->oid);
bbsink_begin_archive(sink, archive_name);
@@ -623,8 +624,8 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
(errcode_for_file_access(),
errmsg("could not stat file \"%s\": %m", pathbuf)));
- sendFile(sink, pathbuf, pathbuf, &statbuf, false, InvalidOid,
- &manifest, NULL);
+ sendFile(sink, pathbuf, pathbuf, &statbuf, false,
+ InvalidOid, InvalidOid, &manifest);
/* unconditionally mark file as archived */
StatusFilePath(pathbuf, fname, ".done");
@@ -1087,7 +1088,7 @@ sendFileWithContent(bbsink *sink, const char *filename, const char *content,
_tarWritePadding(sink, len);
- AddFileToBackupManifest(manifest, NULL, filename, len,
+ AddFileToBackupManifest(manifest, InvalidOid, filename, len,
(pg_time_t) statbuf.st_mtime, &checksum_ctx);
}
@@ -1099,7 +1100,7 @@ sendFileWithContent(bbsink *sink, const char *filename, const char *content,
* Only used to send auxiliary tablespaces, not PGDATA.
*/
static int64
-sendTablespace(bbsink *sink, char *path, char *spcoid, bool sizeonly,
+sendTablespace(bbsink *sink, char *path, Oid spcoid, bool sizeonly,
backup_manifest_info *manifest)
{
int64 size;
@@ -1154,7 +1155,7 @@ sendTablespace(bbsink *sink, char *path, char *spcoid, bool sizeonly,
static int64
sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
List *tablespaces, bool sendtblspclinks, backup_manifest_info *manifest,
- const char *spcoid)
+ Oid spcoid)
{
DIR *dir;
struct dirent *de;
@@ -1419,8 +1420,8 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
if (!sizeonly)
sent = sendFile(sink, pathbuf, pathbuf + basepathlen + 1, &statbuf,
- true, isDbDir ? atooid(lastDir + 1) : InvalidOid,
- manifest, spcoid);
+ true, isDbDir ? atooid(lastDir + 1) : InvalidOid, spcoid,
+ manifest);
if (sent || sizeonly)
{
@@ -1489,8 +1490,8 @@ is_checksummed_file(const char *fullpath, const char *filename)
*/
static bool
sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
- struct stat *statbuf, bool missing_ok, Oid dboid,
- backup_manifest_info *manifest, const char *spcoid)
+ struct stat *statbuf, bool missing_ok, Oid dboid, Oid spcoid,
+ backup_manifest_info *manifest)
{
int fd;
BlockNumber blkno = 0;
diff --git a/src/backend/backup/basebackup_copy.c b/src/backend/backup/basebackup_copy.c
index fee30c21e1..3bdbe1f989 100644
--- a/src/backend/backup/basebackup_copy.c
+++ b/src/backend/backup/basebackup_copy.c
@@ -407,7 +407,7 @@ SendTablespaceList(List *tablespaces)
}
else
{
- values[0] = ObjectIdGetDatum(strtoul(ti->oid, NULL, 10));
+ values[0] = ObjectIdGetDatum(ti->oid);
values[1] = CStringGetTextDatum(ti->path);
}
if (ti->size >= 0)
diff --git a/src/include/backup/backup_manifest.h b/src/include/backup/backup_manifest.h
index d41b439980..5a481dbcf5 100644
--- a/src/include/backup/backup_manifest.h
+++ b/src/include/backup/backup_manifest.h
@@ -39,7 +39,7 @@ extern void InitializeBackupManifest(backup_manifest_info *manifest,
backup_manifest_option want_manifest,
pg_checksum_type manifest_checksum_type);
extern void AddFileToBackupManifest(backup_manifest_info *manifest,
- const char *spcoid,
+ Oid spcoid,
const char *pathname, size_t size,
pg_time_t mtime,
pg_checksum_context *checksum_ctx);
diff --git a/src/include/backup/basebackup.h b/src/include/backup/basebackup.h
index 3e68abc2bb..1432d9c206 100644
--- a/src/include/backup/basebackup.h
+++ b/src/include/backup/basebackup.h
@@ -27,7 +27,7 @@
*/
typedef struct
{
- char *oid; /* tablespace's OID, as a decimal string */
+ Oid oid; /* tablespace's OID */
char *path; /* full path to tablespace's directory */
char *rpath; /* relative path if it's within PGDATA, else
* NULL */
--
2.37.1 (Apple Git-137.1)
0006-Move-src-bin-pg_verifybackup-parse_manifest.c-into-s.patchapplication/octet-stream; name=0006-Move-src-bin-pg_verifybackup-parse_manifest.c-into-s.patchDownload
From d592bff5496ba8e11c2846532290a7b72e2e3b80 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Mon, 5 Jun 2023 15:42:28 -0400
Subject: [PATCH 6/9] Move src/bin/pg_verifybackup/parse_manifest.c into
src/common.
This makes it possible for the code to be easily reused by other
client-side tools, and/or by the server.
---
src/bin/pg_verifybackup/Makefile | 1 -
src/bin/pg_verifybackup/meson.build | 1 -
src/bin/pg_verifybackup/pg_verifybackup.c | 2 +-
src/common/Makefile | 1 +
src/common/meson.build | 1 +
src/{bin/pg_verifybackup => common}/parse_manifest.c | 4 ++--
src/{bin/pg_verifybackup => include/common}/parse_manifest.h | 2 +-
7 files changed, 6 insertions(+), 6 deletions(-)
rename src/{bin/pg_verifybackup => common}/parse_manifest.c (99%)
rename src/{bin/pg_verifybackup => include/common}/parse_manifest.h (97%)
diff --git a/src/bin/pg_verifybackup/Makefile b/src/bin/pg_verifybackup/Makefile
index 596df15118..8f04fa662c 100644
--- a/src/bin/pg_verifybackup/Makefile
+++ b/src/bin/pg_verifybackup/Makefile
@@ -21,7 +21,6 @@ LDFLAGS_INTERNAL += -L$(top_builddir)/src/fe_utils -lpgfeutils $(libpq_pgport)
OBJS = \
$(WIN32RES) \
- parse_manifest.o \
pg_verifybackup.o
all: pg_verifybackup
diff --git a/src/bin/pg_verifybackup/meson.build b/src/bin/pg_verifybackup/meson.build
index 9369da1bc6..58f780d1a6 100644
--- a/src/bin/pg_verifybackup/meson.build
+++ b/src/bin/pg_verifybackup/meson.build
@@ -1,7 +1,6 @@
# Copyright (c) 2022-2023, PostgreSQL Global Development Group
pg_verifybackup_sources = files(
- 'parse_manifest.c',
'pg_verifybackup.c'
)
diff --git a/src/bin/pg_verifybackup/pg_verifybackup.c b/src/bin/pg_verifybackup/pg_verifybackup.c
index 059836f0e6..ce423a03d4 100644
--- a/src/bin/pg_verifybackup/pg_verifybackup.c
+++ b/src/bin/pg_verifybackup/pg_verifybackup.c
@@ -20,9 +20,9 @@
#include "common/hashfn.h"
#include "common/logging.h"
+#include "common/parse_manifest.h"
#include "fe_utils/simple_list.h"
#include "getopt_long.h"
-#include "parse_manifest.h"
#include "pgtime.h"
/*
diff --git a/src/common/Makefile b/src/common/Makefile
index 113029bf7b..e4cd26762b 100644
--- a/src/common/Makefile
+++ b/src/common/Makefile
@@ -65,6 +65,7 @@ OBJS_COMMON = \
kwlookup.o \
link-canary.o \
md5_common.o \
+ parse_manifest.o \
percentrepl.o \
pg_get_line.o \
pg_lzcompress.o \
diff --git a/src/common/meson.build b/src/common/meson.build
index 53942a9a61..a9ff7f9db8 100644
--- a/src/common/meson.build
+++ b/src/common/meson.build
@@ -17,6 +17,7 @@ common_sources = files(
'kwlookup.c',
'link-canary.c',
'md5_common.c',
+ 'parse_manifest.c',
'percentrepl.c',
'pg_get_line.c',
'pg_lzcompress.c',
diff --git a/src/bin/pg_verifybackup/parse_manifest.c b/src/common/parse_manifest.c
similarity index 99%
rename from src/bin/pg_verifybackup/parse_manifest.c
rename to src/common/parse_manifest.c
index 2379f7be7b..672e8bcf25 100644
--- a/src/bin/pg_verifybackup/parse_manifest.c
+++ b/src/common/parse_manifest.c
@@ -6,15 +6,15 @@
* Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
* Portions Copyright (c) 1994, Regents of the University of California
*
- * src/bin/pg_verifybackup/parse_manifest.c
+ * src/common/parse_manifest.c
*
*-------------------------------------------------------------------------
*/
#include "postgres_fe.h"
-#include "parse_manifest.h"
#include "common/jsonapi.h"
+#include "common/parse_manifest.h"
/*
* Semantic states for JSON manifest parsing.
diff --git a/src/bin/pg_verifybackup/parse_manifest.h b/src/include/common/parse_manifest.h
similarity index 97%
rename from src/bin/pg_verifybackup/parse_manifest.h
rename to src/include/common/parse_manifest.h
index 7387a917a2..7b24c5d785 100644
--- a/src/bin/pg_verifybackup/parse_manifest.h
+++ b/src/include/common/parse_manifest.h
@@ -6,7 +6,7 @@
* Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
* Portions Copyright (c) 1994, Regents of the University of California
*
- * src/bin/pg_verifybackup/parse_manifest.h
+ * src/include/common/parse_manifest.h
*
*-------------------------------------------------------------------------
*/
--
2.37.1 (Apple Git-137.1)
0008-Add-new-pg_walsummary-tool.patchapplication/octet-stream; name=0008-Add-new-pg_walsummary-tool.patchDownload
From d25598e3129566556eda0a161ef0763d115b6f25 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Wed, 14 Jun 2023 11:31:39 -0400
Subject: [PATCH 8/9] Add new pg_walsummary tool.
This can dump the contents of WAL summary files, either those in
pg_wal/summaries, or the INCREMENTAL_BACKUP files that are part of
an incremental backup proper.
XXX. Needs documentation and tests.
---
src/bin/Makefile | 1 +
src/bin/meson.build | 1 +
src/bin/pg_walsummary/.gitignore | 1 +
src/bin/pg_walsummary/Makefile | 42 ++++
src/bin/pg_walsummary/meson.build | 24 +++
src/bin/pg_walsummary/pg_walsummary.c | 278 ++++++++++++++++++++++++++
6 files changed, 347 insertions(+)
create mode 100644 src/bin/pg_walsummary/.gitignore
create mode 100644 src/bin/pg_walsummary/Makefile
create mode 100644 src/bin/pg_walsummary/meson.build
create mode 100644 src/bin/pg_walsummary/pg_walsummary.c
diff --git a/src/bin/Makefile b/src/bin/Makefile
index aa2210925e..f98f58d39e 100644
--- a/src/bin/Makefile
+++ b/src/bin/Makefile
@@ -31,6 +31,7 @@ SUBDIRS = \
pg_upgrade \
pg_verifybackup \
pg_waldump \
+ pg_walsummary \
pgbench \
psql \
scripts
diff --git a/src/bin/meson.build b/src/bin/meson.build
index 4cb6fd59bb..d1e9ef4409 100644
--- a/src/bin/meson.build
+++ b/src/bin/meson.build
@@ -17,6 +17,7 @@ subdir('pg_test_timing')
subdir('pg_upgrade')
subdir('pg_verifybackup')
subdir('pg_waldump')
+subdir('pg_walsummary')
subdir('pgbench')
subdir('pgevent')
subdir('psql')
diff --git a/src/bin/pg_walsummary/.gitignore b/src/bin/pg_walsummary/.gitignore
new file mode 100644
index 0000000000..d71ec192fa
--- /dev/null
+++ b/src/bin/pg_walsummary/.gitignore
@@ -0,0 +1 @@
+pg_walsummary
diff --git a/src/bin/pg_walsummary/Makefile b/src/bin/pg_walsummary/Makefile
new file mode 100644
index 0000000000..852f7208f6
--- /dev/null
+++ b/src/bin/pg_walsummary/Makefile
@@ -0,0 +1,42 @@
+#-------------------------------------------------------------------------
+#
+# Makefile for src/bin/pg_walsummary
+#
+# Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+# Portions Copyright (c) 1994, Regents of the University of California
+#
+# src/bin/pg_walsummary/Makefile
+#
+#-------------------------------------------------------------------------
+
+PGFILEDESC = "pg_walsummary - print contents of WAL summary files"
+PGAPPICON=win32
+
+subdir = src/bin/pg_walsummary
+top_builddir = ../../..
+include $(top_builddir)/src/Makefile.global
+
+override CPPFLAGS := -I$(libpq_srcdir) $(CPPFLAGS)
+LDFLAGS_INTERNAL += -L$(top_builddir)/src/fe_utils -lpgfeutils
+
+OBJS = \
+ $(WIN32RES) \
+ pg_walsummary.o
+
+all: pg_walsummary
+
+pg_walsummary: $(OBJS) | submake-libpgport submake-libpgfeutils
+ $(CC) $(CFLAGS) $^ $(LDFLAGS) $(LDFLAGS_EX) $(LIBS) -o $@$(X)
+
+
+install: all installdirs
+ $(INSTALL_PROGRAM) pg_walsummary$(X) '$(DESTDIR)$(bindir)/pg_walsummary$(X)'
+
+installdirs:
+ $(MKDIR_P) '$(DESTDIR)$(bindir)'
+
+uninstall:
+ rm -f '$(DESTDIR)$(bindir)/pg_walsummary$(X)'
+
+clean distclean maintainer-clean:
+ rm -f pg_walsummary$(X) $(OBJS)
diff --git a/src/bin/pg_walsummary/meson.build b/src/bin/pg_walsummary/meson.build
new file mode 100644
index 0000000000..c2092960c6
--- /dev/null
+++ b/src/bin/pg_walsummary/meson.build
@@ -0,0 +1,24 @@
+# Copyright (c) 2022-2023, PostgreSQL Global Development Group
+
+pg_walsummary_sources = files(
+ 'pg_walsummary.c',
+)
+
+if host_system == 'windows'
+ pg_walsummary_sources += rc_bin_gen.process(win32ver_rc, extra_args: [
+ '--NAME', 'pg_walsummary',
+ '--FILEDESC', 'pg_walsummary - print contents of WAL summary files',])
+endif
+
+pg_walsummary = executable('pg_walsummary',
+ pg_walsummary_sources,
+ dependencies: [frontend_code],
+ kwargs: default_bin_args,
+)
+bin_targets += pg_walsummary
+
+tests += {
+ 'name': 'pg_walsummary',
+ 'sd': meson.current_source_dir(),
+ 'bd': meson.current_build_dir()
+}
diff --git a/src/bin/pg_walsummary/pg_walsummary.c b/src/bin/pg_walsummary/pg_walsummary.c
new file mode 100644
index 0000000000..0304a42026
--- /dev/null
+++ b/src/bin/pg_walsummary/pg_walsummary.c
@@ -0,0 +1,278 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_walsummary.c
+ * Prints the contents of WAL summary files.
+ *
+ * Copyright (c) 2017-2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/bin/pg_walsummary/pg_walsummary.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres_fe.h"
+
+#include <fcntl.h>
+#include <limits.h>
+
+#include "common/blkreftable.h"
+#include "common/logging.h"
+#include "fe_utils/option_utils.h"
+#include "lib/stringinfo.h"
+#include "getopt_long.h"
+
+typedef struct ws_options
+{
+ bool individual;
+ bool quiet;
+} ws_options;
+
+typedef struct ws_file_info
+{
+ int fd;
+ char *filename;
+} ws_file_info;
+
+static BlockNumber *block_buffer = NULL;
+static unsigned block_buffer_size = 512; /* Initial size. */
+
+static void dump_one_relation(ws_options *opt, RelFileLocator *rlocator,
+ ForkNumber forknum, BlockNumber limit_block,
+ BlockRefTableReader *reader);
+static void help(const char *progname);
+static int compare_block_numbers(const void *a, const void *b);
+static int walsummary_read_callback(void *callback_arg, void *data,
+ int length);
+static void walsummary_error_callback(void *callback_arg, char *fmt,...);
+
+/*
+ * Main program.
+ */
+int
+main(int argc, char *argv[])
+{
+ static struct option long_options[] = {
+ {"individual", no_argument, NULL, 'i'},
+ {"quiet", no_argument, NULL, 'q'},
+ {NULL, 0, NULL, 0}
+ };
+
+ const char *progname;
+ int optindex;
+ int c;
+ ws_options opt;
+
+ pg_logging_init(argv[0]);
+ progname = get_progname(argv[0]);
+ handle_help_version_opts(argc, argv, progname, help);
+
+ /* process command-line options */
+ while ((c = getopt_long(argc, argv, "f:iqw:",
+ long_options, &optindex)) != -1)
+ {
+ switch (c)
+ {
+ case 'i':
+ opt.individual = true;
+ break;
+ case 'q':
+ opt.quiet = true;
+ break;
+ default:
+ /* getopt_long already emitted a complaint */
+ pg_log_error_hint("Try \"%s --help\" for more information.", progname);
+ exit(1);
+ }
+ }
+
+ if (optind >= argc)
+ {
+ pg_log_error("%s: no input files specified", progname);
+ pg_log_error_hint("Try \"%s --help\" for more information.", progname);
+ exit(1);
+ }
+
+ while (optind < argc)
+ {
+ ws_file_info ws;
+ BlockRefTableReader *reader;
+ RelFileLocator rlocator;
+ ForkNumber forknum;
+ BlockNumber limit_block;
+
+ ws.filename = argv[optind++];
+ if ((ws.fd = open(ws.filename, O_RDONLY | PG_BINARY, 0)) < 0)
+ pg_fatal("could not open file \"%s\": %m", ws.filename);
+
+ reader = CreateBlockRefTableReader(walsummary_read_callback, &ws,
+ ws.filename,
+ walsummary_error_callback, NULL);
+ while (BlockRefTableReaderNextRelation(reader, &rlocator, &forknum,
+ &limit_block))
+ dump_one_relation(&opt, &rlocator, forknum, limit_block, reader);
+
+ DestroyBlockRefTableReader(reader);
+ close(ws.fd);
+ }
+
+ exit(0);
+}
+
+/*
+ * Dump details for one relation.
+ */
+static void
+dump_one_relation(ws_options *opt, RelFileLocator *rlocator,
+ ForkNumber forknum, BlockNumber limit_block,
+ BlockRefTableReader *reader)
+{
+ unsigned i = 0;
+ unsigned nblocks;
+ BlockNumber startblock = InvalidBlockNumber;
+ BlockNumber endblock = InvalidBlockNumber;
+
+ /* Dump limit block, if any. */
+ if (limit_block != InvalidBlockNumber)
+ printf("TS %u, DB %u, REL %u, FORK %s: limit %u\n",
+ rlocator->spcOid, rlocator->dbOid, rlocator->relNumber,
+ forkNames[forknum], limit_block);
+
+ /* If we haven't allocated a block buffer yet, do that now. */
+ if (block_buffer == NULL)
+ block_buffer = palloc_array(BlockNumber, block_buffer_size);
+
+ /* Try to fill the block buffer. */
+ nblocks = BlockRefTableReaderGetBlocks(reader,
+ block_buffer,
+ block_buffer_size);
+
+ /* If we filled the block buffer completely, we must enlarge it. */
+ while (nblocks >= block_buffer_size)
+ {
+ unsigned new_size;
+
+ /* Double the size, being careful about overflow. */
+ new_size = block_buffer_size * 2;
+ if (new_size < block_buffer_size)
+ new_size = PG_UINT32_MAX;
+ block_buffer = repalloc_array(block_buffer, BlockNumber, new_size);
+
+ /* Try to fill the newly-allocated space. */
+ nblocks +=
+ BlockRefTableReaderGetBlocks(reader,
+ block_buffer + block_buffer_size,
+ new_size - block_buffer_size);
+
+ /* Save the new size for later calls. */
+ block_buffer_size = new_size;
+ }
+
+ /* If we don't need to produce any output, skip the rest of this. */
+ if (opt->quiet)
+ return;
+
+ /*
+ * Sort the returned block numbers. If the block reference table was using
+ * the bitmap representation for a given chunk, the block numbers in that
+ * chunk will already be sorted, but when the array-of-offsets
+ * representation is used, we can receive block numbers here out of order.
+ */
+ qsort(block_buffer, nblocks, sizeof(BlockNumber), compare_block_numbers);
+
+ /* Dump block references. */
+ while (i < nblocks)
+ {
+ /*
+ * Find the next range of blocks to print, but if --individual was
+ * specified, then consider each block a separate range.
+ */
+ startblock = endblock = block_buffer[i++];
+ if (!opt->individual)
+ {
+ while (i < nblocks && block_buffer[i] == endblock + 1)
+ {
+ endblock++;
+ i++;
+ }
+ }
+
+ /* Format this range of block numbers as a string. */
+ if (startblock == endblock)
+ printf("TS %u, DB %u, REL %u, FORK %s: block %u\n",
+ rlocator->spcOid, rlocator->dbOid, rlocator->relNumber,
+ forkNames[forknum], startblock);
+ else
+ printf("TS %u, DB %u, REL %u, FORK %s: blocks %u..%u\n",
+ rlocator->spcOid, rlocator->dbOid, rlocator->relNumber,
+ forkNames[forknum], startblock, endblock);
+ }
+}
+
+/*
+ * Quicksort comparator for block numbers.
+ */
+static int
+compare_block_numbers(const void *a, const void *b)
+{
+ BlockNumber aa = *(BlockNumber *) a;
+ BlockNumber bb = *(BlockNumber *) b;
+
+ if (aa > bb)
+ return 1;
+ else if (aa == bb)
+ return 0;
+ else
+ return -1;
+}
+
+/*
+ * Error callback.
+ */
+void
+walsummary_error_callback(void *callback_arg, char *fmt,...)
+{
+ va_list ap;
+
+ va_start(ap, fmt);
+ pg_log_generic_v(PG_LOG_ERROR, PG_LOG_PRIMARY, fmt, ap);
+ va_end(ap);
+
+ exit(1);
+}
+
+/*
+ * Read callback.
+ */
+int
+walsummary_read_callback(void *callback_arg, void *data, int length)
+{
+ ws_file_info *ws = callback_arg;
+ int rc;
+
+ if ((rc = read(ws->fd, data, length)) < 0)
+ pg_fatal("could not read file \"%s\": %m", ws->filename);
+
+ return rc;
+}
+
+/*
+ * help
+ *
+ * Prints help page for the program
+ *
+ * progname: the name of the executed program, such as "pg_walsummary"
+ */
+static void
+help(const char *progname)
+{
+ printf(_("%s prints the contents of a WAL summary file.\n\n"), progname);
+ printf(_("Usage:\n"));
+ printf(_(" %s [OPTION]... FILE...\n"), progname);
+ printf(_("\nOptions:\n"));
+ printf(_(" -i, --individual list block numbers individually, not as ranges\n"));
+ printf(_(" -q, --quiet don't print anything, just parse the files\n"));
+ printf(_(" -?, --help show this help, then exit\n"));
+
+ printf(_("\nReport bugs to <%s>.\n"), PACKAGE_BUGREPORT);
+ printf(_("%s home page: <%s>\n"), PACKAGE_NAME, PACKAGE_URL);
+}
--
2.37.1 (Apple Git-137.1)
0009-Add-TAP-tests-this-is-broken-doesn-t-work.patchapplication/octet-stream; name=0009-Add-TAP-tests-this-is-broken-doesn-t-work.patchDownload
From c0cce1fc702d273d667a4356e2059837fc152b44 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Thu, 17 Aug 2023 12:56:01 -0400
Subject: [PATCH 9/9] Add TAP tests (this is broken, doesn't work).
---
src/bin/pg_combinebackup/Makefile | 6 +
src/bin/pg_combinebackup/meson.build | 8 +-
src/bin/pg_combinebackup/t/001_basic.pl | 23 ++
.../pg_combinebackup/t/002_compare_backups.pl | 276 ++++++++++++++++++
src/test/perl/PostgreSQL/Test/Cluster.pm | 21 +-
5 files changed, 332 insertions(+), 2 deletions(-)
create mode 100644 src/bin/pg_combinebackup/t/001_basic.pl
create mode 100644 src/bin/pg_combinebackup/t/002_compare_backups.pl
diff --git a/src/bin/pg_combinebackup/Makefile b/src/bin/pg_combinebackup/Makefile
index cb20480aae..78ba05e624 100644
--- a/src/bin/pg_combinebackup/Makefile
+++ b/src/bin/pg_combinebackup/Makefile
@@ -44,3 +44,9 @@ uninstall:
clean distclean maintainer-clean:
rm -f pg_combinebackup$(X) $(OBJS)
+
+check:
+ $(prove_check)
+
+installcheck:
+ $(prove_installcheck)
diff --git a/src/bin/pg_combinebackup/meson.build b/src/bin/pg_combinebackup/meson.build
index bea0db405e..a6036dea74 100644
--- a/src/bin/pg_combinebackup/meson.build
+++ b/src/bin/pg_combinebackup/meson.build
@@ -25,5 +25,11 @@ bin_targets += pg_combinebackup
tests += {
'name': 'pg_combinebackup',
'sd': meson.current_source_dir(),
- 'bd': meson.current_build_dir()
+ 'bd': meson.current_build_dir(),
+ 'tap': {
+ 'tests': [
+ 't/001_basic.pl',
+ 't/002_compare_backups.pl',
+ ],
+ }
}
diff --git a/src/bin/pg_combinebackup/t/001_basic.pl b/src/bin/pg_combinebackup/t/001_basic.pl
new file mode 100644
index 0000000000..fb66075d1a
--- /dev/null
+++ b/src/bin/pg_combinebackup/t/001_basic.pl
@@ -0,0 +1,23 @@
+# Copyright (c) 2021-2023, PostgreSQL Global Development Group
+
+use strict;
+use warnings;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+my $tempdir = PostgreSQL::Test::Utils::tempdir;
+
+program_help_ok('pg_combinebackup');
+program_version_ok('pg_combinebackup');
+program_options_handling_ok('pg_combinebackup');
+
+command_fails_like(
+ ['pg_combinebackup'],
+ qr/no input directories specified/,
+ 'input directories must be specified');
+command_fails_like(
+ [ 'pg_combinebackup', $tempdir ],
+ qr/no output directory specified/,
+ 'output directory must be specified');
+
+done_testing();
diff --git a/src/bin/pg_combinebackup/t/002_compare_backups.pl b/src/bin/pg_combinebackup/t/002_compare_backups.pl
new file mode 100644
index 0000000000..c27f999a32
--- /dev/null
+++ b/src/bin/pg_combinebackup/t/002_compare_backups.pl
@@ -0,0 +1,276 @@
+# Copyright (c) 2021-2023, PostgreSQL Global Development Group
+
+use strict;
+use warnings;
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Set up a new database instance.
+my $primary = PostgreSQL::Test::Cluster->new('primary');
+$primary->init(has_archiving => 1, allows_streaming => 1);
+$primary->append_conf('postgresql.conf', 'autovacuum = off');
+$primary->start;
+
+# Create some test tables, each containing one row of data, plus a whole
+# extra database.
+$primary->safe_psql('postgres', <<EOM);
+CREATE TABLE will_change (a int, b text);
+INSERT INTO will_change VALUES (1, 'initial test row');
+CREATE TABLE will_grow (a int, b text);
+INSERT INTO will_grow VALUES (1, 'initial test row');
+CREATE TABLE will_shrink (a int, b text);
+INSERT INTO will_shrink VALUES (1, 'initial test row');
+CREATE TABLE will_get_vacuumed (a int, b text);
+INSERT INTO will_get_vacuumed VALUES (1, 'initial test row');
+CREATE TABLE will_get_dropped (a int, b text);
+INSERT INTO will_get_dropped VALUES (1, 'initial test row');
+CREATE TABLE will_get_rewritten (a int, b text);
+INSERT INTO will_get_rewritten VALUES (1, 'initial test row');
+CREATE DATABASE db_will_get_dropped;
+EOM
+
+# Take a full backup.
+my $backup1path = $primary->backup_dir . '/backup1';
+$primary->command_ok(
+ [ 'pg_basebackup', '-D', $backup1path, '--no-sync', '-cfast' ],
+ "full backup");
+
+# Now make some database changes.
+$primary->safe_psql('postgres', <<EOM);
+UPDATE will_change SET b = 'modified value' WHERE a = 1;
+INSERT INTO will_grow
+ SELECT g, 'additional row' FROM generate_series(2, 5000) g;
+TRUNCATE will_shrink;
+VACUUM will_get_vacuumed;
+DROP TABLE will_get_dropped;
+CREATE TABLE newly_created (a int, b text);
+INSERT INTO newly_created VALUES (1, 'row for new table');
+VACUUM FULL will_get_rewritten;
+DROP DATABASE db_will_get_dropped;
+CREATE DATABASE db_newly_created;
+EOM
+
+# Take an incremental backup.
+my $backup2path = $primary->backup_dir . '/backup2';
+$primary->command_ok(
+ [ 'pg_basebackup', '-D', $backup2path, '--no-sync', '-cfast',
+ '--incremental', $backup1path . '/backup_manifest' ],
+ "incremental backup");
+
+# Find an LSN to which either backup can be recovered.
+my $lsn = $primary->safe_psql('postgres', "SELECT pg_current_wal_lsn();");
+
+# Make sure that the WAL segment containing that LSN has been archived.
+# PostgreSQL won't issue two consecutive XLOG_SWITCH records, and the backup
+# just issued one, so call txid_current() to generate some WAL activity
+# before calling pg_switch_wal().
+$primary->safe_psql('postgres', 'SELECT txid_current();');
+$primary->safe_psql('postgres', 'SELECT pg_switch_wal()');
+
+# Now wait for the LSN we chose above to be archived.
+my $archive_wait_query =
+ "SELECT pg_walfile_name('$lsn') <= last_archived_wal FROM pg_stat_archiver;";
+$primary->poll_query_until('postgres', $archive_wait_query)
+ or die "Timed out while waiting for WAL segment to be archived";
+
+# Perform PITR from the full backup. Disable archive_mode so that the archive
+# doesn't find out about the new timeline; that way, the later PITR below will
+# choose the same timeline.
+my $pitr1 = PostgreSQL::Test::Cluster->new('pitr1');
+$pitr1->init_from_backup($primary, 'backup1',
+ standby => 1, has_restoring => 1);
+$pitr1->append_conf('postgresql.conf', qq{
+recovery_target_lsn = '$lsn'
+recovery_target_action = 'promote'
+archive_mode = 'off'
+});
+$pitr1->start();
+
+# Wait until we exit recovery, then stop the server.
+$pitr1->poll_query_until('postgres',
+ "SELECT NOT pg_is_in_recovery();")
+ or die "Timed out while waiting apply to reach LSN $lsn";
+$pitr1->stop;
+
+# Perform PITR to the same LSN from the incremental backup. Use the same
+# basic configuration as before.
+my $pitr2 = PostgreSQL::Test::Cluster->new('pitr2');
+$pitr2->init_from_backup($primary, 'backup2',
+ standby => 1, has_restoring => 1,
+ combine_with_prior => [ 'backup1' ]);
+$pitr2->append_conf('postgresql.conf', qq{
+recovery_target_lsn = '$lsn'
+recovery_target_action = 'promote'
+archive_mode = 'off'
+});
+$pitr2->start();
+
+# Wait until we exit recovery, then stop the server.
+$pitr2->poll_query_until('postgres',
+ "SELECT NOT pg_is_in_recovery();")
+ or die "Timed out while waiting apply to reach LSN $lsn";
+$pitr2->stop;
+
+my $cmp = compare_data_directories($pitr1->basedir . '/pgdata',
+ $pitr2->basedir . '/pgdata', '');
+is($cmp, 0, "directories are identical");
+
+done_testing();
+
+sub compare_data_directories
+{
+ my ($basedir1, $basedir2, $relpath) = @_;
+ my $result = 0;
+
+ if ($relpath eq '/pg_wal')
+ {
+ # Since recovery started at different LSNs, pg_wal contents may not
+ # be identical. Ignore that.
+ return 0;
+ }
+
+ my $dir1 = $basedir1 . $relpath;
+ my $dir2 = $basedir2 . $relpath;
+
+ opendir(DIR1, $dir1) || die "$dir1: $!";
+ my @files1 = grep { $_ ne '.' && $_ ne '..' } readdir(DIR1);
+ closedir(DIR1);
+
+ opendir(DIR2, $dir2) || die "$dir2: $!";
+ my %files2 = map { $_ => 'unmatched' }
+ grep { $_ ne '.' && $_ ne '..' } readdir(DIR2);
+ closedir(DIR2);
+
+ for my $fname (@files1)
+ {
+ if (!exists $files2{$fname})
+ {
+ warn "$dir1/$fname exists but $dir2/$fname does not";
+ ++$result;
+ next;
+ }
+
+ $files2{$fname} = 'matched';
+
+ if (-d "$dir1/$fname")
+ {
+ if (! -d "$dir2/$fname")
+ {
+ warn "$dir1/$fname is a directory but $dir2/$fname is not";
+ ++$result;
+ }
+ else
+ {
+ $result +=
+ compare_data_directories($basedir1, $basedir2,
+ "$relpath/$fname");
+ }
+ }
+ elsif (-d "$dir2/$fname")
+ {
+ warn "$dir2/$fname is a directory but $dir1/$fname is not";
+ ++$result;
+ }
+ else
+ {
+ # Both are plain files.
+ $result += compare_files($basedir1, $basedir2, "$relpath/$fname");
+ }
+ }
+
+ for my $fname (keys %files2)
+ {
+ if ($files2{$fname} eq 'unmatched')
+ {
+ warn "$dir2/$fname exists but $dir1/$fname does not";
+ ++$result;
+ }
+ }
+
+ return $result;
+}
+
+sub compare_files
+{
+ my ($basedir1, $basedir2, $relpath) = @_;
+ my $file1 = $basedir1 . $relpath;
+ my $file2 = $basedir2 . $relpath;
+
+ if ($relpath eq '/backup_manifest')
+ {
+ # We don't expect the backup manifest to be identical between two
+ # backups taken at different times, so just disregard it.
+ return 0;
+ }
+
+ if ($relpath eq '/backup_label.old')
+ {
+ # We don't expect the backup label to be identical; the start WAL
+ # location and probably also the start time are expected to be
+ # different.
+ return 0;
+ }
+
+ if ($relpath eq '/postgresql.conf')
+ {
+ # At least the port numbers are expected to be different, so
+ # disregard this file.
+ return 0;
+ }
+
+ if ($relpath eq '/postmaster.opts')
+ {
+ # At least the cluster names are expected to be different, so
+ # disregard this file.
+ return 0;
+ }
+
+ if ($relpath eq '/global/pg_control')
+ {
+ # At least the mock authentication nonce is expected to be different,
+ # so disregard this file.
+ return 0;
+ }
+
+ if ($relpath eq '/pg_stat/pgstat.stat')
+ {
+ # Stats aren't stable enough to be compared here.
+ return 0;
+ }
+
+ if ($relpath =~ m@/pg_internal\.init$@)
+ {
+ # relcache init files are rebuilt at startup, so they don't need to
+ # match. And because we write out the contents of data structures like
+ # RelationData that include pointers, they almost certainly won't.
+ return 0;
+ }
+
+ # Check whether the lengths match.
+ my $length1 = -s $file1;
+ my $length2 = -s $file2;
+ if ($length1 != $length2)
+ {
+ warn "$file1 has length $length1, but $file2 has length $length2";
+ return 1;
+ }
+
+ # Compare contents.
+ my $contents1 = slurp_file($file1);
+ my $contents2 = slurp_file($file2);
+ if ($contents1 ne $contents2)
+ {
+ my $nchars = 1;
+ while (substr($contents1, 0, $nchars) eq substr($contents2, 0, $nchars))
+ {
+ ++$nchars;
+ }
+ warn sprintf("%s and %s are both of length %s, but differ beginning at byte %d",
+ $file1, $file2, $length1, $nchars - 1);
+ return 1;
+ }
+
+ # Files are identical.
+ return 0;
+}
diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index 2a478ba6ed..3b57379e13 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -779,6 +779,10 @@ a tar-format backup, pass the name of the tar program to use in the
keyword parameter tar_program. Note that tablespace tar files aren't
handled here.
+To restore from an incremental backup, pass the parameter combine_with_prior
+as a reference to an array of prior backup names with which this backup
+is to be combined using pg_combinebackup.
+
Streaming replication can be enabled on this node by passing the keyword
parameter has_streaming => 1. This is disabled by default.
@@ -816,7 +820,22 @@ sub init_from_backup
mkdir $self->archive_dir;
my $data_path = $self->data_dir;
- if (defined $params{tar_program})
+ if (defined $params{combine_with_prior})
+ {
+ my @prior_backups = @{$params{combine_with_prior}};
+ my @prior_backup_path;
+
+ for my $prior_backup_name (@prior_backups)
+ {
+ push @prior_backup_path,
+ $root_node->backup_dir . '/' . $prior_backup_name;
+ }
+
+ local %ENV = $self->_get_env();
+ PostgreSQL::Test::Utils::system_or_bail('pg_combinebackup',
+ @prior_backup_path, $backup_path, '-o', $data_path);
+ }
+ elsif (defined $params{tar_program})
{
mkdir($data_path);
PostgreSQL::Test::Utils::system_or_bail($params{tar_program}, 'xf',
--
2.37.1 (Apple Git-137.1)
0007-Prototype-patch-for-incremental-and-differential-bac.patchapplication/octet-stream; name=0007-Prototype-patch-for-incremental-and-differential-bac.patchDownload
From d76b71f72567cfb32340f870cf768e31ea3461c6 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Wed, 14 Jun 2023 11:31:29 -0400
Subject: [PATCH 7/9] Prototype patch for incremental and differential backup.
We don't differentiate between incremental and differential backups;
the term "incremental" as used herein means "either incremental or
differential".
This adds a new background process, the WAL summarizer, whose behavor
is governed by new GUCs wal_summarize_mb and wal_summarize_keep_time.
This writes out WAL summary files to $PGDATA/pg_wal/summaries. Each
summary file contains information for a certain range of LSNs on a
certain TLI. For each relation, it stores a "limit block" which is
0 if a relation is created or destroyed within a certain range of WAL
records, or otherwise the shortest length to which the relation was
truncated during that range of WAL records, or otherwise
InvalidBlockNumber. In addition, it stores any blocks which have
been modified during that range of WAL records, but excluding blocks
which were removed by truncation after they were modified and which
were never modified thereafter. In other words, it tells us which
blocks need to copied in case of an incremental backup covering that
range of WAL records.
To take an incremental backup, you use the new replication command
UPLOAD_MANIFEST to upload the manifest for the prior backup. This
prior backup could either be a full backup or another incremental
backup. You then use BASE_BACKUP with the INCREMENTAL option to take
the backup. pg_basebackup now has an --incremental=PATH_TO_MANIFEST
option to trigger this behavior.
An incremental backup is like a regular full backup except that
some relation files are replaced with files with names like
INCREMENTAL.${ORIGINAL_NAME}, and the backup_label file contains
additional lines identifying it as an incremental backup. The new
pg_combinebackup tool can be used to reconstruct a data directory
from a full backup and a series of incremental backups.
XXX. It would be nice if we could do something about incremental
JSON parsing.
XXX. This needs a lot of work on documentation and tests.
Patch by me. Thanks to Dilip Kumar and Andres Freund for some helpful
design discussions.
---
src/backend/access/transam/xlog.c | 97 +-
src/backend/access/transam/xlogbackup.c | 10 +
src/backend/access/transam/xlogrecovery.c | 10 +-
src/backend/backup/Makefile | 5 +-
src/backend/backup/basebackup.c | 340 +++-
src/backend/backup/basebackup_incremental.c | 867 ++++++++++
src/backend/backup/meson.build | 3 +
src/backend/backup/walsummary.c | 356 +++++
src/backend/backup/walsummaryfuncs.c | 169 ++
src/backend/postmaster/Makefile | 1 +
src/backend/postmaster/auxprocess.c | 8 +
src/backend/postmaster/meson.build | 1 +
src/backend/postmaster/postmaster.c | 53 +
src/backend/postmaster/walsummarizer.c | 1414 +++++++++++++++++
src/backend/replication/repl_gram.y | 14 +-
src/backend/replication/repl_scanner.l | 2 +
src/backend/replication/walsender.c | 162 +-
src/backend/storage/ipc/ipci.c | 3 +
src/backend/storage/lmgr/lwlocknames.txt | 1 +
src/backend/utils/activity/pgstat_io.c | 4 +-
.../utils/activity/wait_event_names.txt | 5 +
src/backend/utils/init/miscinit.c | 3 +
src/backend/utils/misc/guc_tables.c | 29 +
src/backend/utils/misc/postgresql.conf.sample | 5 +
src/bin/Makefile | 1 +
src/bin/initdb/initdb.c | 1 +
src/bin/meson.build | 1 +
src/bin/pg_basebackup/bbstreamer_file.c | 1 +
src/bin/pg_basebackup/pg_basebackup.c | 108 +-
src/bin/pg_basebackup/t/010_pg_basebackup.pl | 4 +-
src/bin/pg_combinebackup/.gitignore | 1 +
src/bin/pg_combinebackup/Makefile | 46 +
src/bin/pg_combinebackup/backup_label.c | 281 ++++
src/bin/pg_combinebackup/backup_label.h | 29 +
src/bin/pg_combinebackup/copy_file.c | 169 ++
src/bin/pg_combinebackup/copy_file.h | 19 +
src/bin/pg_combinebackup/load_manifest.c | 245 +++
src/bin/pg_combinebackup/load_manifest.h | 67 +
src/bin/pg_combinebackup/meson.build | 29 +
src/bin/pg_combinebackup/pg_combinebackup.c | 1268 +++++++++++++++
src/bin/pg_combinebackup/reconstruct.c | 618 +++++++
src/bin/pg_combinebackup/reconstruct.h | 32 +
src/bin/pg_combinebackup/write_manifest.c | 293 ++++
src/bin/pg_combinebackup/write_manifest.h | 33 +
src/bin/pg_resetwal/pg_resetwal.c | 36 +
src/common/Makefile | 1 +
src/common/blkreftable.c | 1309 +++++++++++++++
src/common/meson.build | 1 +
src/include/access/xlog.h | 1 +
src/include/access/xlogbackup.h | 2 +
src/include/backup/basebackup.h | 5 +-
src/include/backup/basebackup_incremental.h | 56 +
src/include/backup/walsummary.h | 49 +
src/include/catalog/pg_proc.dat | 19 +
src/include/common/blkreftable.h | 120 ++
src/include/miscadmin.h | 3 +
src/include/nodes/replnodes.h | 9 +
src/include/postmaster/walsummarizer.h | 31 +
src/include/storage/proc.h | 9 +-
src/include/utils/guc_tables.h | 1 +
src/test/recovery/t/001_stream_rep.pl | 2 +
src/test/recovery/t/019_replslot_limit.pl | 3 +
.../t/035_standby_logical_decoding.pl | 1 +
src/tools/pgindent/typedefs.list | 24 +
64 files changed, 8421 insertions(+), 69 deletions(-)
create mode 100644 src/backend/backup/basebackup_incremental.c
create mode 100644 src/backend/backup/walsummary.c
create mode 100644 src/backend/backup/walsummaryfuncs.c
create mode 100644 src/backend/postmaster/walsummarizer.c
create mode 100644 src/bin/pg_combinebackup/.gitignore
create mode 100644 src/bin/pg_combinebackup/Makefile
create mode 100644 src/bin/pg_combinebackup/backup_label.c
create mode 100644 src/bin/pg_combinebackup/backup_label.h
create mode 100644 src/bin/pg_combinebackup/copy_file.c
create mode 100644 src/bin/pg_combinebackup/copy_file.h
create mode 100644 src/bin/pg_combinebackup/load_manifest.c
create mode 100644 src/bin/pg_combinebackup/load_manifest.h
create mode 100644 src/bin/pg_combinebackup/meson.build
create mode 100644 src/bin/pg_combinebackup/pg_combinebackup.c
create mode 100644 src/bin/pg_combinebackup/reconstruct.c
create mode 100644 src/bin/pg_combinebackup/reconstruct.h
create mode 100644 src/bin/pg_combinebackup/write_manifest.c
create mode 100644 src/bin/pg_combinebackup/write_manifest.h
create mode 100644 src/common/blkreftable.c
create mode 100644 src/include/backup/basebackup_incremental.h
create mode 100644 src/include/backup/walsummary.h
create mode 100644 src/include/common/blkreftable.h
create mode 100644 src/include/postmaster/walsummarizer.h
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 6f38d0eb9a..3e19ec9ad1 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -77,6 +77,7 @@
#include "port/pg_iovec.h"
#include "postmaster/bgwriter.h"
#include "postmaster/startup.h"
+#include "postmaster/walsummarizer.h"
#include "postmaster/walwriter.h"
#include "replication/logical.h"
#include "replication/origin.h"
@@ -3500,6 +3501,43 @@ XLogGetLastRemovedSegno(void)
return lastRemovedSegNo;
}
+/*
+ * Return the oldest WAL segment on the given TLI that still exists in
+ * XLOGDIR, or 0 if none.
+ */
+XLogSegNo
+XLogGetOldestSegno(TimeLineID tli)
+{
+ DIR *xldir;
+ struct dirent *xlde;
+ XLogSegNo oldest_segno = 0;
+
+ xldir = AllocateDir(XLOGDIR);
+ while ((xlde = ReadDir(xldir, XLOGDIR)) != NULL)
+ {
+ TimeLineID file_tli;
+ XLogSegNo file_segno;
+
+ /* Ignore files that are not XLOG segments */
+ if (!IsXLogFileName(xlde->d_name))
+ continue;
+
+ /* Parse filename to get TLI and segno. */
+ XLogFromFileName(xlde->d_name, &file_tli, &file_segno,
+ wal_segment_size);
+
+ /* Ignore anything that's not from the TLI of interest. */
+ if (tli != file_tli)
+ continue;
+
+ /* If it's the oldest so far, update oldest_segno. */
+ if (oldest_segno == 0 || file_segno < oldest_segno)
+ oldest_segno = file_segno;
+ }
+
+ FreeDir(xldir);
+ return oldest_segno;
+}
/*
* Update the last removed segno pointer in shared memory, to reflect that the
@@ -3779,8 +3817,8 @@ RemoveXlogFile(const struct dirent *segment_de,
}
/*
- * Verify whether pg_wal and pg_wal/archive_status exist.
- * If the latter does not exist, recreate it.
+ * Verify whether pg_wal, pg_wal/archive_status, and pg_wal/summaries exist.
+ * If the latter do not exist, recreate them.
*
* It is not the goal of this function to verify the contents of these
* directories, but to help in cases where someone has performed a cluster
@@ -3823,6 +3861,26 @@ ValidateXLOGDirectoryStructure(void)
(errmsg("could not create missing directory \"%s\": %m",
path)));
}
+
+ /* Check for summaries */
+ snprintf(path, MAXPGPATH, XLOGDIR "/summaries");
+ if (stat(path, &stat_buf) == 0)
+ {
+ /* Check for weird cases where it exists but isn't a directory */
+ if (!S_ISDIR(stat_buf.st_mode))
+ ereport(FATAL,
+ (errmsg("required WAL directory \"%s\" does not exist",
+ path)));
+ }
+ else
+ {
+ ereport(LOG,
+ (errmsg("creating missing WAL directory \"%s\"", path)));
+ if (MakePGDirectory(path) < 0)
+ ereport(FATAL,
+ (errmsg("could not create missing directory \"%s\": %m",
+ path)));
+ }
}
/*
@@ -5147,9 +5205,9 @@ StartupXLOG(void)
#endif
/*
- * Verify that pg_wal and pg_wal/archive_status exist. In cases where
- * someone has performed a copy for PITR, these directories may have been
- * excluded and need to be re-created.
+ * Verify that pg_wal, pg_wal/archive_status, and pg_wal/summaries exist.
+ * In cases where someone has performed a copy for PITR, these directories
+ * may have been excluded and need to be re-created.
*/
ValidateXLOGDirectoryStructure();
@@ -6830,6 +6888,17 @@ CreateCheckPoint(int flags)
*/
END_CRIT_SECTION();
+ /*
+ * If there hasn't been much system activity in a while, the WAL
+ * summarizer may be sleeping for relatively long periods, which could
+ * delay an incremental backup that has started concurrently. In the hopes
+ * of avoiding that, poke the WAL summarizer here.
+ *
+ * Possibly this should instead be done at some earlier point in this
+ * function, but it's not clear that it matters much.
+ */
+ SetWalSummarizerLatch();
+
/*
* Let smgr do post-checkpoint cleanup (eg, deleting old files).
*/
@@ -7504,6 +7573,20 @@ KeepLogSeg(XLogRecPtr recptr, XLogSegNo *logSegNo)
}
}
+ /*
+ * If WAL summarization is in use, don't remove WAL that has yet to be
+ * summarized.
+ */
+ keep = GetOldestUnsummarizedLSN(NULL, NULL);
+ if (keep != InvalidXLogRecPtr)
+ {
+ XLogSegNo unsummarized_segno;
+
+ XLByteToSeg(keep, unsummarized_segno, wal_segment_size);
+ if (unsummarized_segno < segno)
+ segno = unsummarized_segno;
+ }
+
/* but, keep at least wal_keep_size if that's set */
if (wal_keep_size_mb > 0)
{
@@ -8490,8 +8573,8 @@ do_pg_backup_start(const char *backupidstr, bool fast, List **tablespaces,
/*
* Try to parse the directory name as an unsigned integer.
*
- * Tablespace directories should be positive integers that can
- * be represented in 32 bits, with no leading zeroes or trailing
+ * Tablespace directories should be positive integers that can be
+ * represented in 32 bits, with no leading zeroes or trailing
* garbage. If we come across a name that doesn't meet those
* criteria, skip it.
*/
diff --git a/src/backend/access/transam/xlogbackup.c b/src/backend/access/transam/xlogbackup.c
index 23461c9d2c..3ad6b679d5 100644
--- a/src/backend/access/transam/xlogbackup.c
+++ b/src/backend/access/transam/xlogbackup.c
@@ -77,6 +77,16 @@ build_backup_content(BackupState *state, bool ishistoryfile)
appendStringInfo(result, "STOP TIMELINE: %u\n", state->stoptli);
}
+ /* either both istartpoint and istarttli should be set, or neither */
+ Assert(XLogRecPtrIsInvalid(state->istartpoint) == (state->istarttli == 0));
+ if (!XLogRecPtrIsInvalid(state->istartpoint))
+ {
+ appendStringInfo(result, "INCREMENTAL FROM LSN: %X/%X\n",
+ LSN_FORMAT_ARGS(state->istartpoint));
+ appendStringInfo(result, "INCREMENTAL FROM TLI: %u\n",
+ state->istarttli);
+ }
+
data = result->data;
pfree(result);
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 4ff4430006..89ddec5bf9 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -1284,6 +1284,12 @@ read_backup_label(XLogRecPtr *checkPointLoc, TimeLineID *backupLabelTLI,
tli_from_file, BACKUP_LABEL_FILE)));
}
+ if (fscanf(lfp, "INCREMENTAL FROM LSN: %X/%X\n", &hi, &lo) > 0)
+ ereport(FATAL,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("this is an incremental backup, not a data directory"),
+ errhint("Use pg_combinebackup to reconstruct a valid data directory.")));
+
if (ferror(lfp) || FreeFile(lfp))
ereport(FATAL,
(errcode_for_file_access(),
@@ -1340,7 +1346,7 @@ read_tablespace_map(List **tablespaces)
{
if (!was_backslash && (ch == '\n' || ch == '\r'))
{
- char *endp;
+ char *endp;
if (i == 0)
continue; /* \r immediately followed by \n */
@@ -1363,7 +1369,7 @@ read_tablespace_map(List **tablespaces)
ti = palloc0(sizeof(tablespaceinfo));
errno = 0;
ti->oid = strtoul(str, &endp, 10);
- if (*endp != '\0' || errno == EINVAL || errno == ERANGE)
+ if (*endp != '\0' || errno == EINVAL || errno == ERANGE)
ereport(FATAL,
(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
errmsg("invalid data in file \"%s\"", TABLESPACE_MAP)));
diff --git a/src/backend/backup/Makefile b/src/backend/backup/Makefile
index b21bd8ff43..751e6d3d5e 100644
--- a/src/backend/backup/Makefile
+++ b/src/backend/backup/Makefile
@@ -19,12 +19,15 @@ OBJS = \
basebackup.o \
basebackup_copy.o \
basebackup_gzip.o \
+ basebackup_incremental.o \
basebackup_lz4.o \
basebackup_zstd.o \
basebackup_progress.o \
basebackup_server.o \
basebackup_sink.o \
basebackup_target.o \
- basebackup_throttle.o
+ basebackup_throttle.o \
+ walsummary.o \
+ walsummaryfuncs.o
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/backup/basebackup.c b/src/backend/backup/basebackup.c
index 64ab54fe06..8aea2a4a76 100644
--- a/src/backend/backup/basebackup.c
+++ b/src/backend/backup/basebackup.c
@@ -20,8 +20,10 @@
#include "access/xlogbackup.h"
#include "backup/backup_manifest.h"
#include "backup/basebackup.h"
+#include "backup/basebackup_incremental.h"
#include "backup/basebackup_sink.h"
#include "backup/basebackup_target.h"
+#include "catalog/pg_tablespace_d.h"
#include "commands/defrem.h"
#include "common/compression.h"
#include "common/file_perm.h"
@@ -64,6 +66,7 @@ typedef struct
bool fastcheckpoint;
bool nowait;
bool includewal;
+ bool incremental;
uint32 maxrate;
bool sendtblspcmapfile;
bool send_to_client;
@@ -75,22 +78,37 @@ typedef struct
pg_checksum_type manifest_checksum_type;
} basebackup_options;
+typedef struct
+{
+ const char *filename;
+ pg_checksum_context *checksum_ctx;
+ bbsink *sink;
+ size_t bytes_sent;
+} FileChunkContext;
+
static int64 sendTablespace(bbsink *sink, char *path, Oid spcoid, bool sizeonly,
- struct backup_manifest_info *manifest);
+ struct backup_manifest_info *manifest,
+ IncrementalBackupInfo *ib);
static int64 sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
List *tablespaces, bool sendtblspclinks,
- backup_manifest_info *manifest, Oid spcoid);
+ backup_manifest_info *manifest, Oid spcoid,
+ IncrementalBackupInfo *ib);
static bool sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
struct stat *statbuf, bool missing_ok,
Oid dboid, Oid spcoid, RelFileNumber relfilenumber,
unsigned segno,
- backup_manifest_info *manifest);
+ backup_manifest_info *manifest,
+ unsigned num_incremental_blocks,
+ BlockNumber *incremental_blocks,
+ unsigned truncation_block_length);
static off_t read_file_data_into_buffer(bbsink *sink,
const char *readfilename, int fd,
off_t offset, size_t length,
BlockNumber blkno,
bool verify_checksum,
int *checksum_failures);
+static void push_to_sink(bbsink *sink, pg_checksum_context *checksum_ctx,
+ size_t *bytes_done, void *data, size_t length);
static bool verify_page_checksum(Page page, XLogRecPtr start_lsn,
BlockNumber blkno,
uint16 *expected_checksum);
@@ -102,7 +120,8 @@ static int64 _tarWriteHeader(bbsink *sink, const char *filename,
bool sizeonly);
static void _tarWritePadding(bbsink *sink, int len);
static void convert_link_to_directory(const char *pathbuf, struct stat *statbuf);
-static void perform_base_backup(basebackup_options *opt, bbsink *sink);
+static void perform_base_backup(basebackup_options *opt, bbsink *sink,
+ IncrementalBackupInfo *ib);
static void parse_basebackup_options(List *options, basebackup_options *opt);
static int compareWalFileNames(const ListCell *a, const ListCell *b);
static int basebackup_read_file(int fd, char *buf, size_t nbytes, off_t offset,
@@ -220,7 +239,8 @@ static const struct exclude_list_item excludeFiles[] =
* clobbered by longjmp" from stupider versions of gcc.
*/
static void
-perform_base_backup(basebackup_options *opt, bbsink *sink)
+perform_base_backup(basebackup_options *opt, bbsink *sink,
+ IncrementalBackupInfo *ib)
{
bbsink_state state;
XLogRecPtr endptr;
@@ -270,6 +290,10 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
ListCell *lc;
tablespaceinfo *newti;
+ /* If this is an incremental backup, execute preparatory steps. */
+ if (ib != NULL)
+ PrepareForIncrementalBackup(ib, backup_state);
+
/* Add a node for the base directory at the end */
newti = palloc0(sizeof(tablespaceinfo));
newti->size = -1;
@@ -289,10 +313,10 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
if (tmp->path == NULL)
tmp->size = sendDir(sink, ".", 1, true, state.tablespaces,
- true, NULL, InvalidOid);
+ true, NULL, InvalidOid, NULL);
else
tmp->size = sendTablespace(sink, tmp->path, tmp->oid, true,
- NULL);
+ NULL, NULL);
state.bytes_total += tmp->size;
}
state.bytes_total_is_valid = true;
@@ -330,7 +354,7 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
/* Then the bulk of the files... */
sendDir(sink, ".", 1, false, state.tablespaces,
- sendtblspclinks, &manifest, InvalidOid);
+ sendtblspclinks, &manifest, InvalidOid, ib);
/* ... and pg_control after everything else. */
if (lstat(XLOG_CONTROL_FILE, &statbuf) != 0)
@@ -340,7 +364,7 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
XLOG_CONTROL_FILE)));
sendFile(sink, XLOG_CONTROL_FILE, XLOG_CONTROL_FILE, &statbuf,
false, InvalidOid, InvalidOid,
- InvalidRelFileNumber, 0, &manifest);
+ InvalidRelFileNumber, 0, &manifest, 0, NULL, 0);
}
else
{
@@ -348,7 +372,7 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
bbsink_begin_archive(sink, archive_name);
- sendTablespace(sink, ti->path, ti->oid, false, &manifest);
+ sendTablespace(sink, ti->path, ti->oid, false, &manifest, ib);
}
/*
@@ -610,7 +634,7 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
sendFile(sink, pathbuf, pathbuf, &statbuf, false,
InvalidOid, InvalidOid, InvalidRelFileNumber, 0,
- &manifest);
+ &manifest, 0, NULL, 0);
/* unconditionally mark file as archived */
StatusFilePath(pathbuf, fname, ".done");
@@ -686,6 +710,7 @@ parse_basebackup_options(List *options, basebackup_options *opt)
bool o_checkpoint = false;
bool o_nowait = false;
bool o_wal = false;
+ bool o_incremental = false;
bool o_maxrate = false;
bool o_tablespace_map = false;
bool o_noverify_checksums = false;
@@ -764,6 +789,15 @@ parse_basebackup_options(List *options, basebackup_options *opt)
opt->includewal = defGetBoolean(defel);
o_wal = true;
}
+ else if (strcmp(defel->defname, "incremental") == 0)
+ {
+ if (o_incremental)
+ ereport(ERROR,
+ (errcode(ERRCODE_SYNTAX_ERROR),
+ errmsg("duplicate option \"%s\"", defel->defname)));
+ opt->incremental = defGetBoolean(defel);
+ o_incremental = true;
+ }
else if (strcmp(defel->defname, "max_rate") == 0)
{
int64 maxrate;
@@ -956,7 +990,7 @@ parse_basebackup_options(List *options, basebackup_options *opt)
* the filesystem, bypassing the buffer cache.
*/
void
-SendBaseBackup(BaseBackupCmd *cmd)
+SendBaseBackup(BaseBackupCmd *cmd, IncrementalBackupInfo *ib)
{
basebackup_options opt;
bbsink *sink;
@@ -980,6 +1014,20 @@ SendBaseBackup(BaseBackupCmd *cmd)
set_ps_display(activitymsg);
}
+ /*
+ * If we're asked to perform an incremental backup and the user has not
+ * supplied a manifest, that's an ERROR.
+ *
+ * If we're asked to perform a full backup and the user did supply a
+ * manifest, just ignore it.
+ */
+ if (!opt.incremental)
+ ib = NULL;
+ else if (ib == NULL)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("must UPLOAD_MANIFEST before performing an incremental BASE_BACKUP")));
+
/*
* If the target is specifically 'client' then set up to stream the backup
* to the client; otherwise, it's being sent someplace else and should not
@@ -1011,7 +1059,7 @@ SendBaseBackup(BaseBackupCmd *cmd)
*/
PG_TRY();
{
- perform_base_backup(&opt, sink);
+ perform_base_backup(&opt, sink, ib);
}
PG_FINALLY();
{
@@ -1086,7 +1134,7 @@ sendFileWithContent(bbsink *sink, const char *filename, const char *content,
*/
static int64
sendTablespace(bbsink *sink, char *path, Oid spcoid, bool sizeonly,
- backup_manifest_info *manifest)
+ backup_manifest_info *manifest, IncrementalBackupInfo *ib)
{
int64 size;
char pathbuf[MAXPGPATH];
@@ -1120,7 +1168,7 @@ sendTablespace(bbsink *sink, char *path, Oid spcoid, bool sizeonly,
/* Send all the files in the tablespace version directory */
size += sendDir(sink, pathbuf, strlen(path), sizeonly, NIL, true, manifest,
- spcoid);
+ spcoid, ib);
return size;
}
@@ -1140,7 +1188,7 @@ sendTablespace(bbsink *sink, char *path, Oid spcoid, bool sizeonly,
static int64
sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
List *tablespaces, bool sendtblspclinks, backup_manifest_info *manifest,
- Oid spcoid)
+ Oid spcoid, IncrementalBackupInfo *ib)
{
DIR *dir;
struct dirent *de;
@@ -1148,7 +1196,8 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
struct stat statbuf;
int64 size = 0;
const char *lastDir; /* Split last dir from parent path. */
- bool isRelationDir = false; /* Does directory contain relations? */
+ bool isRelationDir = false; /* Does directory contain relations? */
+ bool isGlobalDir = false;
Oid dboid = InvalidOid;
/*
@@ -1182,14 +1231,17 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
}
}
else if (strcmp(path, "./global") == 0)
+ {
isRelationDir = true;
+ isGlobalDir = true;
+ }
dir = AllocateDir(path);
while ((de = ReadDir(dir, path)) != NULL)
{
int excludeIdx;
bool excludeFound;
- RelFileNumber relfilenumber = InvalidRelFileNumber;
+ RelFileNumber relfilenumber = InvalidRelFileNumber;
ForkNumber relForkNum = InvalidForkNumber;
unsigned segno = 0;
bool isRelationFile = false;
@@ -1256,9 +1308,8 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
char initForkFile[MAXPGPATH];
/*
- * If any other type of fork, check if there is an init fork
- * with the same RelFileNumber. If so, the file can be
- * excluded.
+ * If any other type of fork, check if there is an init fork with
+ * the same RelFileNumber. If so, the file can be excluded.
*/
snprintf(initForkFile, sizeof(initForkFile), "%s/%u_init",
path, relfilenumber);
@@ -1332,11 +1383,13 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
&statbuf, sizeonly);
/*
- * Also send archive_status directory (by hackishly reusing
- * statbuf from above ...).
+ * Also send archive_status and summaries directories (by
+ * hackishly reusing statbuf from above ...).
*/
size += _tarWriteHeader(sink, "./pg_wal/archive_status", NULL,
&statbuf, sizeonly);
+ size += _tarWriteHeader(sink, "./pg_wal/summaries", NULL,
+ &statbuf, sizeonly);
continue; /* don't recurse into pg_wal */
}
@@ -1405,27 +1458,79 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
if (!skip_this_dir)
size += sendDir(sink, pathbuf, basepathlen, sizeonly, tablespaces,
- sendtblspclinks, manifest, spcoid);
+ sendtblspclinks, manifest, spcoid, ib);
}
else if (S_ISREG(statbuf.st_mode))
{
bool sent = false;
+ unsigned num_blocks_required = 0;
+ unsigned truncation_block_length = 0;
+ BlockNumber relative_block_numbers[RELSEG_SIZE];
+ char tarfilenamebuf[MAXPGPATH * 2];
+ char *tarfilename = pathbuf + basepathlen + 1;
+ FileBackupMethod method = BACK_UP_FILE_FULLY;
+
+ if (ib != NULL && isRelationFile)
+ {
+ Oid relspcoid;
+ char *lookup_path;
+
+ if (OidIsValid(spcoid))
+ {
+ relspcoid = spcoid;
+ lookup_path = psprintf("pg_tblspc/%u/%s", spcoid,
+ pathbuf + basepathlen + 1);
+ }
+ else
+ {
+ if (isGlobalDir)
+ relspcoid = GLOBALTABLESPACE_OID;
+ else
+ relspcoid = DEFAULTTABLESPACE_OID;
+ lookup_path = pstrdup(pathbuf + basepathlen + 1);
+ }
- if (!sizeonly)
- sent = sendFile(sink, pathbuf, pathbuf + basepathlen + 1, &statbuf,
- true, dboid, spcoid,
- relfilenumber, segno, manifest);
+ method = GetFileBackupMethod(ib, lookup_path, dboid, relspcoid,
+ relfilenumber, relForkNum,
+ segno, statbuf.st_size,
+ &num_blocks_required,
+ relative_block_numbers,
+ &truncation_block_length);
+ if (method == BACK_UP_FILE_INCREMENTALLY)
+ {
+ statbuf.st_size =
+ GetIncrementalFileSize(num_blocks_required);
+ snprintf(tarfilenamebuf, sizeof(tarfilenamebuf),
+ "%s/INCREMENTAL.%s",
+ path + basepathlen + 1,
+ de->d_name);
+ tarfilename = tarfilenamebuf;
+ }
+
+ pfree(lookup_path);
+ }
- if (sent || sizeonly)
+ if (method != DO_NOT_BACK_UP_FILE)
{
- /* Add size. */
- size += statbuf.st_size;
+ if (!sizeonly)
+ sent = sendFile(sink, pathbuf, tarfilename, &statbuf,
+ true, dboid, spcoid,
+ relfilenumber, segno, manifest,
+ num_blocks_required,
+ method == BACK_UP_FILE_INCREMENTALLY ? relative_block_numbers : NULL,
+ truncation_block_length);
+
+ if (sent || sizeonly)
+ {
+ /* Add size. */
+ size += statbuf.st_size;
- /* Pad to a multiple of the tar block size. */
- size += tarPaddingBytesRequired(statbuf.st_size);
+ /* Pad to a multiple of the tar block size. */
+ size += tarPaddingBytesRequired(statbuf.st_size);
- /* Size of the header for the file. */
- size += TAR_BLOCK_SIZE;
+ /* Size of the header for the file. */
+ size += TAR_BLOCK_SIZE;
+ }
}
}
else
@@ -1444,6 +1549,12 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
* If dboid is anything other than InvalidOid then any checksum failures
* detected will get reported to the cumulative stats system.
*
+ * If the file is to be set incrementally, then num_incremental_blocks
+ * should be the number of blocks to be sent, and incremental_blocks
+ * an array of block numbers relative to the start of the current segment.
+ * If the whole file is to be sent, then incremental_blocks should be NULL,
+ * and num_incremental_blocks can have any value, as it will be ignored.
+ *
* Returns true if the file was successfully sent, false if 'missing_ok',
* and the file did not exist.
*/
@@ -1451,7 +1562,8 @@ static bool
sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
struct stat *statbuf, bool missing_ok, Oid dboid, Oid spcoid,
RelFileNumber relfilenumber, unsigned segno,
- backup_manifest_info *manifest)
+ backup_manifest_info *manifest, unsigned num_incremental_blocks,
+ BlockNumber *incremental_blocks, unsigned truncation_block_length)
{
int fd;
BlockNumber blkno = 0;
@@ -1460,6 +1572,7 @@ sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
pgoff_t bytes_done = 0;
bool verify_checksum = false;
pg_checksum_context checksum_ctx;
+ int ibindex = 0;
if (pg_checksum_init(&checksum_ctx, manifest->checksum_type) < 0)
elog(ERROR, "could not initialize checksum of file \"%s\"",
@@ -1492,22 +1605,111 @@ sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
RelFileNumberIsValid(relfilenumber))
verify_checksum = true;
+ /*
+ * If we're sending an incremental file, write the file header.
+ */
+ if (incremental_blocks != NULL)
+ {
+ unsigned magic = INCREMENTAL_MAGIC;
+ size_t header_bytes_done = 0;
+
+ /* Emit header data. */
+ push_to_sink(sink, &checksum_ctx, &header_bytes_done,
+ &magic, sizeof(magic));
+ push_to_sink(sink, &checksum_ctx, &header_bytes_done,
+ &num_incremental_blocks, sizeof(num_incremental_blocks));
+ push_to_sink(sink, &checksum_ctx, &header_bytes_done,
+ &truncation_block_length, sizeof(truncation_block_length));
+ push_to_sink(sink, &checksum_ctx, &header_bytes_done,
+ incremental_blocks,
+ sizeof(BlockNumber) * num_incremental_blocks);
+
+ /* Flush out any data still in the buffer so it's again empty. */
+ if (header_bytes_done > 0)
+ {
+ bbsink_archive_contents(sink, header_bytes_done);
+ if (pg_checksum_update(&checksum_ctx,
+ (uint8 *) sink->bbs_buffer,
+ header_bytes_done) < 0)
+ elog(ERROR, "could not update checksum of base backup");
+ }
+
+ /* Update our notion of file position. */
+ bytes_done += sizeof(magic);
+ bytes_done += sizeof(num_incremental_blocks);
+ bytes_done += sizeof(truncation_block_length);
+ bytes_done += sizeof(BlockNumber) * num_incremental_blocks;
+ }
+
/*
* Loop until we read the amount of data the caller told us to expect. The
* file could be longer, if it was extended while we were sending it, but
* for a base backup we can ignore such extended data. It will be restored
* from WAL.
*/
- while (bytes_done < statbuf->st_size)
+ while (1)
{
- size_t remaining = statbuf->st_size - bytes_done;
+ /*
+ * Determine whether we've read all the data that we need, and if not,
+ * read some more.
+ */
+ if (incremental_blocks == NULL)
+ {
+ size_t remaining = statbuf->st_size - bytes_done;
+
+ /*
+ * If we've read the required number of bytes, then it's time to
+ * stop.
+ */
+ if (bytes_done >= statbuf->st_size)
+ break;
+
+ /*
+ * Read as many bytes as will fit in the buffer, or however many
+ * are left to read, whichever is less.
+ */
+ cnt = read_file_data_into_buffer(sink, readfilename, fd,
+ bytes_done, remaining,
+ blkno + segno * RELSEG_SIZE,
+ verify_checksum,
+ &checksum_failures);
+ }
+ else
+ {
+ BlockNumber relative_blkno;
+
+ /*
+ * If we've read all the blocks, then it's time to stop.
+ */
+ if (ibindex >= num_incremental_blocks)
+ break;
+
+ /*
+ * Read just one block, whichever one is the next that we're
+ * supposed to include.
+ */
+ relative_blkno = incremental_blocks[ibindex++];
+ cnt = read_file_data_into_buffer(sink, readfilename, fd,
+ relative_blkno * BLCKSZ,
+ BLCKSZ,
+ relative_blkno + segno * RELSEG_SIZE,
+ verify_checksum,
+ &checksum_failures);
- /* Try to read some more data. */
- cnt = read_file_data_into_buffer(sink, readfilename, fd, bytes_done,
- remaining,
- blkno + segno * RELSEG_SIZE,
- verify_checksum,
- &checksum_failures);
+ /*
+ * If we get a partial read, that must mean that the relation is
+ * being truncated. Ultimately, it should be truncated to a
+ * multiple of BLCKSZ, since this path should only be reached for
+ * relation files, but we might transiently observe an
+ * intermediate value.
+ *
+ * It should be fine to treat this just as if the entire block had
+ * been truncated away - i.e. fill this and all later blocks with
+ * zeroes. WAL replay will fix things up.
+ */
+ if (cnt < BLCKSZ)
+ break;
+ }
/*
* If the amount of data we were able to read was not a multiple of
@@ -1690,6 +1892,56 @@ read_file_data_into_buffer(bbsink *sink, const char *readfilename, int fd,
return cnt;
}
+/*
+ * Push data into a bbsink.
+ *
+ * It's better, when possible, to read data directly into the bbsink's buffer,
+ * rather than using this function to copy it into the buffer; this function is
+ * for cases where that approach is not practical.
+ *
+ * bytes_done should point to a count of the number of bytes that are
+ * currently used in the bbsink's buffer. Upon return, the bytes identified by
+ * data and length will have been copied into the bbsink's buffer, flushing
+ * as required, and *bytes_done will have been updated accordingly. If the
+ * buffer was flushed, the previous contents will also have been fed to
+ * checksum_ctx.
+ *
+ * Note that after one or more calls to this function it is the caller's
+ * responsibility to perform any required final flush.
+ */
+static void
+push_to_sink(bbsink *sink, pg_checksum_context *checksum_ctx,
+ size_t *bytes_done, void *data, size_t length)
+{
+ while (length > 0)
+ {
+ size_t bytes_to_copy;
+
+ /*
+ * We use < here rather than <= so that if the data exactly fills the
+ * remaining buffer space, we trigger a flush now.
+ */
+ if (length < sink->bbs_buffer_length - *bytes_done)
+ {
+ /* Append remaining data to buffer. */
+ memcpy(sink->bbs_buffer + *bytes_done, data, length);
+ *bytes_done += length;
+ return;
+ }
+
+ /* Copy until buffer is full and flush it. */
+ bytes_to_copy = sink->bbs_buffer_length - *bytes_done;
+ memcpy(sink->bbs_buffer + *bytes_done, data, bytes_to_copy);
+ data = ((char *) data) + bytes_to_copy;
+ length -= bytes_to_copy;
+ bbsink_archive_contents(sink, sink->bbs_buffer_length);
+ if (pg_checksum_update(checksum_ctx, (uint8 *) sink->bbs_buffer,
+ sink->bbs_buffer_length) < 0)
+ elog(ERROR, "could not update checksum");
+ *bytes_done = 0;
+ }
+}
+
/*
* Try to verify the checksum for the provided page, if it seems appropriate
* to do so.
diff --git a/src/backend/backup/basebackup_incremental.c b/src/backend/backup/basebackup_incremental.c
new file mode 100644
index 0000000000..b70eeb0282
--- /dev/null
+++ b/src/backend/backup/basebackup_incremental.c
@@ -0,0 +1,867 @@
+/*-------------------------------------------------------------------------
+ *
+ * basebackup_incremental.c
+ * code for incremental backup support
+ *
+ * This code isn't actually in charge of taking an incremental backup;
+ * the actual construction of the incremental backup happens in
+ * basebackup.c. Here, we're concerned with providing the necessary
+ * supports for that operation. In particular, we need to parse the
+ * backup manifest supplied by the user taking the incremental backup
+ * and extract the required information from it.
+ *
+ * Portions Copyright (c) 2010-2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/backend/backup/basebackup_incremental.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/timeline.h"
+#include "access/xlog.h"
+#include "access/xlogrecovery.h"
+#include "backup/basebackup_incremental.h"
+#include "backup/walsummary.h"
+#include "common/blkreftable.h"
+#include "common/parse_manifest.h"
+#include "common/hashfn.h"
+#include "postmaster/walsummarizer.h"
+
+#define BLOCKS_PER_READ 512
+
+typedef struct
+{
+ TimeLineID tli;
+ XLogRecPtr start_lsn;
+ XLogRecPtr end_lsn;
+} backup_wal_range;
+
+typedef struct
+{
+ uint32 status;
+ const char *path;
+ size_t size;
+} backup_file_entry;
+
+static uint32 hash_string_pointer(const char *s);
+#define SH_PREFIX backup_file
+#define SH_ELEMENT_TYPE backup_file_entry
+#define SH_KEY_TYPE const char *
+#define SH_KEY path
+#define SH_HASH_KEY(tb, key) hash_string_pointer(key)
+#define SH_EQUAL(tb, a, b) (strcmp(a, b) == 0)
+#define SH_SCOPE static inline
+#define SH_DECLARE
+#define SH_DEFINE
+#include "lib/simplehash.h"
+
+struct IncrementalBackupInfo
+{
+ /* Memory context for this object and its subsidiary objects. */
+ MemoryContext mcxt;
+
+ /* Temporary buffer for storing the manifest while parsing it. */
+ StringInfoData buf;
+
+ /* WAL ranges extracted from the backup manifest. */
+ List *manifest_wal_ranges;
+
+ /*
+ * Files extracted from the backup manifest.
+ *
+ * We don't really need this information, because we use WAL summaries to
+ * figure what's changed. It would be unsafe to just rely on the list of
+ * files that existed before, because it's possible for a file to be
+ * removed and a new one created with the same name and different
+ * contents. In such cases, the whole file must still be sent. We can tell
+ * from the WAL summaries whether that happened, but not from the file
+ * list.
+ *
+ * Nonetheless, this data is useful for sanity checking. If a file that we
+ * think we shouldn't need to send is not present in the manifest for the
+ * prior backup, something has gone terribly wrong. We retain the file
+ * names and sizes, but not the checksums or last modified times, for
+ * which we have no use.
+ *
+ * One significant downside of storing this data is that it consumes
+ * memory. If that turns out to be a problem, we might have to decide not
+ * to retain this information, or to make it optional.
+ */
+ backup_file_hash *manifest_files;
+
+ /*
+ * Block-reference table for the incremental backup.
+ *
+ * It's possible that storing the entire block-reference table in memory
+ * will be a problem for some users. The in-memory format that we're using
+ * here is pretty efficient, converging to little more than 1 bit per
+ * block for relation forks with large numbers of modified blocks. It's
+ * possible, however, that if you try to perform an incremental backup of
+ * a database with a sufficiently large number of relations on a
+ * sufficiently small machine, you could run out of memory here. If that
+ * turns out to be a problem in practice, we'll need to be more clever.
+ */
+ BlockRefTable *brtab;
+};
+
+static void manifest_process_file(JsonManifestParseContext *,
+ char *pathname,
+ size_t size,
+ pg_checksum_type checksum_type,
+ int checksum_length,
+ uint8 *checksum_payload);
+static void manifest_process_wal_range(JsonManifestParseContext *,
+ TimeLineID tli,
+ XLogRecPtr start_lsn,
+ XLogRecPtr end_lsn);
+static void manifest_report_error(JsonManifestParseContext *ib,
+ const char *fmt,...)
+ pg_attribute_printf(2, 3) pg_attribute_noreturn();
+
+/*
+ * Create a new object for storing information extracted from the manifest
+ * supplied when creating an incremental backup.
+ */
+IncrementalBackupInfo *
+CreateIncrementalBackupInfo(MemoryContext mcxt)
+{
+ IncrementalBackupInfo *ib;
+ MemoryContext oldcontext;
+
+ oldcontext = MemoryContextSwitchTo(mcxt);
+
+ ib = palloc0(sizeof(IncrementalBackupInfo));
+ ib->mcxt = mcxt;
+ initStringInfo(&ib->buf);
+
+ /*
+ * It's hard to guess how many files a "typical" installation will have in
+ * the data directory, but a fresh initdb creates almost 1000 files as of
+ * this writing, so it seems to make sense for our estimate to
+ * substantially higher.
+ */
+ ib->manifest_files = backup_file_create(mcxt, 10000, NULL);
+
+ MemoryContextSwitchTo(oldcontext);
+
+ return ib;
+}
+
+/*
+ * Before taking an incremental backup, the caller must supply the backup
+ * manifest from a prior backup. Each chunk of manifest data recieved
+ * from the client should be passed to this function.
+ */
+void
+AppendIncrementalManifestData(IncrementalBackupInfo *ib, const char *data,
+ int len)
+{
+ MemoryContext oldcontext;
+
+ /* Switch to our memory context. */
+ oldcontext = MemoryContextSwitchTo(ib->mcxt);
+
+ /*
+ * XXX. Our json parser is at present incapable of parsing json blobs
+ * incrementally, so we have to accumulate the entire backup manifest
+ * before we can do anything with it. This should really be fixed, since
+ * some users might have very large numbers of files in the data
+ * directory.
+ */
+ appendBinaryStringInfo(&ib->buf, data, len);
+
+ /* Switch back to previous memory context. */
+ MemoryContextSwitchTo(oldcontext);
+}
+
+/*
+ * Finalize an IncrementalBackupInfo object after all manifest data has
+ * been supplied via calls to AppendIncrementalManifestData.
+ */
+void
+FinalizeIncrementalManifest(IncrementalBackupInfo *ib)
+{
+ JsonManifestParseContext context;
+ MemoryContext oldcontext;
+
+ /* Switch to our memory context. */
+ oldcontext = MemoryContextSwitchTo(ib->mcxt);
+
+ /* Parse the manifest. */
+ context.private_data = ib;
+ context.perfile_cb = manifest_process_file;
+ context.perwalrange_cb = manifest_process_wal_range;
+ context.error_cb = manifest_report_error;
+ json_parse_manifest(&context, ib->buf.data, ib->buf.len);
+
+ /* Done with the buffer, so release memory. */
+ pfree(ib->buf.data);
+ ib->buf.data = NULL;
+
+ /* Switch back to previous memory context. */
+ MemoryContextSwitchTo(oldcontext);
+}
+
+/*
+ * Prepare to take an incremental backup.
+ *
+ * Before this function is called, AppendIncrementalManifestData and
+ * FinalizeIncrementalManifest should have already been called to pass all
+ * the manifest data to this object.
+ *
+ * This function performs sanity checks on the data extracted from the
+ * manifest and figures out for which WAL ranges we need summaries, and
+ * whether those summaries are available. Then, it reads and combines the
+ * data from those summary files. It also updates the backup_state with the
+ * reference TLI and LSN for the prior backup.
+ */
+void
+PrepareForIncrementalBackup(IncrementalBackupInfo *ib,
+ BackupState *backup_state)
+{
+ MemoryContext oldcontext;
+ List *expectedTLEs;
+ List *all_wslist,
+ *required_wslist = NIL;
+ ListCell *lc;
+ TimeLineHistoryEntry **tlep;
+ int num_wal_ranges;
+ int i;
+ bool found_backup_start_tli = false;
+ TimeLineID earliest_wal_range_tli = 0;
+ XLogRecPtr earliest_wal_range_start_lsn;
+ TimeLineID latest_wal_range_tli = 0;
+ XLogRecPtr summarized_lsn;
+
+ Assert(ib->buf.data == NULL);
+
+ /* Switch to our memory context. */
+ oldcontext = MemoryContextSwitchTo(ib->mcxt);
+
+ /*
+ * Match up the TLIs that appear in the WAL ranges of the backup manifest
+ * with those that appear in this server's timeline history. We expect
+ * every backup_wal_range to match to a TimeLineHistoryEntry; if it does
+ * not, that's an error.
+ *
+ * This loop also decides which of the WAL ranges is the manifest is most
+ * ancient and which one is the newest, according to the timeline history
+ * of this server, and stores TLIs of those WAL ranges into
+ * earliest_wal_range_tli and latest_wal_range_tli. It also updates
+ * earliest_wal_range_start_lsn to the start LSN of the WAL range for
+ * earliest_wal_range_tli.
+ *
+ * Note that the return value of readTimeLineHistory puts the latest
+ * timeline at the beginning of the list, not the end. Hence, the earliest
+ * TLI is the one that occurs nearest the end of the list returned by
+ * readTimeLineHistory, and the latest TLI is the one that occurs closest
+ * to the beginning.
+ */
+ expectedTLEs = readTimeLineHistory(backup_state->starttli);
+ num_wal_ranges = list_length(ib->manifest_wal_ranges);
+ tlep = palloc0(num_wal_ranges * sizeof(TimeLineHistoryEntry *));
+ for (i = 0; i < num_wal_ranges; ++i)
+ {
+ backup_wal_range *range = list_nth(ib->manifest_wal_ranges, i);
+ bool saw_earliest_wal_range_tli = false;
+ bool saw_latest_wal_range_tli = false;
+
+ /* Search this server's history for this WAL range's TLI. */
+ foreach(lc, expectedTLEs)
+ {
+ TimeLineHistoryEntry *tle = lfirst(lc);
+
+ if (tle->tli == range->tli)
+ {
+ tlep[i] = tle;
+ break;
+ }
+
+ if (tle->tli == earliest_wal_range_tli)
+ saw_earliest_wal_range_tli = true;
+ if (tle->tli == latest_wal_range_tli)
+ saw_latest_wal_range_tli = true;
+ }
+
+ /*
+ * An incremental backup can only be taken relative to a backup that
+ * represents a previous state of this server. If the backup requires
+ * WAL from a timeline that's not in our history, that definitely
+ * isn't the case.
+ */
+ if (tlep[i] == NULL)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("timeline %u found in manifest, but not in this server's history",
+ range->tli)));
+
+ /*
+ * If we found this TLI in the server's history before encountering
+ * the latest TLI seen so far in the server's history, then this TLI
+ * is the latest one seen so far.
+ *
+ * If on the other hand we saw the earliest TLI seen so far before
+ * finding this TLI, this TLI is earlier than the earliest one seen so
+ * far. And if this is the first TLI for which we've searched, it's
+ * also the earliest one seen so far.
+ *
+ * On the first loop iteration, both things should necessarily be
+ * true.
+ */
+ if (!saw_latest_wal_range_tli)
+ latest_wal_range_tli = range->tli;
+ if (earliest_wal_range_tli == 0 || saw_earliest_wal_range_tli)
+ {
+ earliest_wal_range_tli = range->tli;
+ earliest_wal_range_start_lsn = range->start_lsn;
+ }
+ }
+
+ /*
+ * Propagate information about the prior backup into the backup_label that
+ * will be generated for this backup.
+ */
+ backup_state->istartpoint = earliest_wal_range_start_lsn;
+ backup_state->istarttli = earliest_wal_range_tli;
+
+ /*
+ * Sanity check start and end LSNs for the WAL ranges in the manifest.
+ *
+ * Commonly, there won't be any timeline switches during the prior backup
+ * at all, but if there are, they should happen at the same LSNs that this
+ * server switched timelines.
+ *
+ * Whether there are any timeline switches during the prior backup or not,
+ * the prior backup shouldn't require any WAL from a timeline prior to the
+ * start of that timeline. It also shouldn't require any WAL from later
+ * than the start of this backup.
+ *
+ * If any of these sanity checks fail, one possible explanation is that
+ * the user has generated WAL on the same timeline with the same LSNs more
+ * than once. For instance, if two standbys running on timeline 1 were
+ * both promoted and (due to a broken archiving setup) both selected new
+ * timeline ID 2, then it's possible that one of these checks might trip.
+ *
+ * Note that there are lots of ways for the user to do something very bad
+ * without tripping any of these checks, and they are not intended to be
+ * comprehensive. It's pretty hard to see how we could be certain of
+ * anything here. However, if there's a problem staring us right in the
+ * face, it's best to report it, so we do.
+ */
+ for (i = 0; i < num_wal_ranges; ++i)
+ {
+ backup_wal_range *range = list_nth(ib->manifest_wal_ranges, i);
+
+ if (range->tli == earliest_wal_range_tli)
+ {
+ if (range->start_lsn < tlep[i]->begin)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("manifest requires WAL from initial timeline %u starting at %X/%X, but that timeline begins at %X/%X",
+ range->tli,
+ LSN_FORMAT_ARGS(range->start_lsn),
+ LSN_FORMAT_ARGS(tlep[i]->begin))));
+ }
+ else
+ {
+ if (range->start_lsn != tlep[i]->begin)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("manifest requires WAL from continuation timeline %u starting at %X/%X, but that timeline begins at %X/%X",
+ range->tli,
+ LSN_FORMAT_ARGS(range->start_lsn),
+ LSN_FORMAT_ARGS(tlep[i]->begin))));
+ }
+
+ if (range->tli == latest_wal_range_tli)
+ {
+ if (range->end_lsn > backup_state->startpoint)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("manifest requires WAL from final timeline %u ending at %X/%X, but this backup starts at %X/%X",
+ range->tli,
+ LSN_FORMAT_ARGS(range->end_lsn),
+ LSN_FORMAT_ARGS(backup_state->startpoint))));
+ }
+ else
+ {
+ if (range->end_lsn != tlep[i]->end)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("manifest requires WAL from non-final timeline %u ending at %X/%X, but this server switched timelines at %X/%X",
+ range->tli,
+ LSN_FORMAT_ARGS(range->end_lsn),
+ LSN_FORMAT_ARGS(tlep[i]->end))));
+ }
+
+ }
+
+ /*
+ * Wait for WAL summarization to catch up to the backup start LSN (but
+ * time out if it doesn't do so quickly enough).
+ */
+ /* XXX make timeout configurable */
+ summarized_lsn = WaitForWalSummarization(backup_state->startpoint, 60000);
+ if (summarized_lsn < backup_state->startpoint)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("timeout waiting for WAL summarization"),
+ errdetail("This backup requires WAL to be summarized up to %X/%X, but summarizer has only reached %X/%X.",
+ LSN_FORMAT_ARGS(backup_state->startpoint),
+ LSN_FORMAT_ARGS(summarized_lsn))));
+
+ /*
+ * Retrieve a list of all WAL summaries on any timeline that overlap with
+ * the LSN range of interest. We could instead call GetWalSummaries() once
+ * per timeline in the loop that follows, but that would involve reading
+ * the directory multiple times. It should be mildly faster - and perhaps
+ * a bit safer - to do it just once.
+ */
+ all_wslist = GetWalSummaries(0, earliest_wal_range_start_lsn,
+ backup_state->startpoint);
+
+ /*
+ * We need WAL summaries for everything that happened during the prior
+ * backup and everything that happened afterward up until the point where
+ * the current backup started.
+ */
+ foreach(lc, expectedTLEs)
+ {
+ TimeLineHistoryEntry *tle = lfirst(lc);
+ XLogRecPtr tli_start_lsn = tle->begin;
+ XLogRecPtr tli_end_lsn = tle->end;
+ XLogRecPtr tli_missing_lsn = InvalidXLogRecPtr;
+ List *tli_wslist;
+
+ /*
+ * Working through the history of this server from the current
+ * timeline backwards, we skip everything until we find the timeline
+ * where this backup started. Most of the time, this means we won't
+ * skip anything at all, as it's unlikely that the timeline has
+ * changed since the beginning of the backup moments ago.
+ */
+ if (tle->tli == backup_state->starttli)
+ {
+ found_backup_start_tli = true;
+ tli_end_lsn = backup_state->startpoint;
+ }
+ else if (!found_backup_start_tli)
+ continue;
+
+ /*
+ * Find the summaries that overlap the LSN range of interest for this
+ * timeline. If this is the earliest timeline involved, the range of
+ * interest begins with the start LSN of the prior backup; otherwise,
+ * it begins at the LSN at which this timeline came into existence. If
+ * this is the latest TLI involved, the range of interest ends at the
+ * start LSN of the current backup; otherwise, it ends at the point
+ * where we switched from this timeline to the next one.
+ */
+ if (tle->tli == earliest_wal_range_tli)
+ tli_start_lsn = earliest_wal_range_start_lsn;
+ tli_wslist = FilterWalSummaries(all_wslist, tle->tli,
+ tli_start_lsn, tli_end_lsn);
+
+ /*
+ * There is no guarantee that the WAL summaries we found cover the
+ * entire range of LSNs for which summaries are required, or indeed
+ * that we found any WAL summaries at all. Check whether we have a
+ * problem of that sort.
+ */
+ if (!WalSummariesAreComplete(tli_wslist, tli_start_lsn, tli_end_lsn,
+ &tli_missing_lsn))
+ {
+ if (XLogRecPtrIsInvalid(tli_missing_lsn))
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("WAL summaries are required on timeline %u from %X/%X to %X/%X, but no summaries for that timeline and LSN range exist",
+ tle->tli,
+ LSN_FORMAT_ARGS(tli_start_lsn),
+ LSN_FORMAT_ARGS(tli_end_lsn))));
+ else
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("WAL summaries are required on timeline %u from %X/%X to %X/%X, but the summaries for that timeline and LSN range are incomplete",
+ tle->tli,
+ LSN_FORMAT_ARGS(tli_start_lsn),
+ LSN_FORMAT_ARGS(tli_end_lsn)),
+ errdetail("The first unsummarized LSN is this range is %X/%X.",
+ LSN_FORMAT_ARGS(tli_missing_lsn))));
+ }
+
+ /*
+ * Remember that we need to read these summaries.
+ *
+ * Technically, it's possible that this could read more files than
+ * required, since tli_wslist in theory could contain redundant
+ * summaries. For instance, if we have a summary from 0/10000000 to
+ * 0/20000000 and also one from 0/00000000 to 0/30000000, then the
+ * latter subsumes the former and the former could be ignored.
+ *
+ * We ignore this possibility because the WAL summarizer only tries to
+ * generate summaries that do not overlap. If somehow they exist,
+ * we'll do a bit of extra work but the results should still be
+ * correct.
+ */
+ required_wslist = list_concat(required_wslist, tli_wslist);
+
+ /*
+ * Timelines earlier than the one in which the prior backup began are
+ * not relevant.
+ */
+ if (tle->tli == earliest_wal_range_tli)
+ break;
+ }
+
+ /*
+ * Read all of the required block reference table files and merge all of
+ * the data into a single in-memory block reference table.
+ *
+ * See the comments for struct IncrementalBackupInfo for some thoughts on
+ * memory usage.
+ */
+ ib->brtab = CreateEmptyBlockRefTable();
+ foreach(lc, required_wslist)
+ {
+ WalSummaryFile *ws = lfirst(lc);
+ WalSummaryIO wsio;
+ BlockRefTableReader *reader;
+ RelFileLocator rlocator;
+ ForkNumber forknum;
+ BlockNumber limit_block;
+ BlockNumber blocks[BLOCKS_PER_READ];
+
+ wsio.file = OpenWalSummaryFile(ws, false);
+ wsio.filepos = 0;
+ ereport(DEBUG1,
+ (errmsg_internal("reading WAL summary file \"%s\"",
+ FilePathName(wsio.file))));
+ reader = CreateBlockRefTableReader(ReadWalSummary, &wsio,
+ FilePathName(wsio.file),
+ ReportWalSummaryError, NULL);
+ while (BlockRefTableReaderNextRelation(reader, &rlocator, &forknum,
+ &limit_block))
+ {
+ BlockRefTableSetLimitBlock(ib->brtab, &rlocator,
+ forknum, limit_block);
+
+ while (1)
+ {
+ unsigned nblocks;
+ unsigned i;
+
+ nblocks = BlockRefTableReaderGetBlocks(reader, blocks,
+ BLOCKS_PER_READ);
+ if (nblocks == 0)
+ break;
+
+ for (i = 0; i < nblocks; ++i)
+ BlockRefTableMarkBlockModified(ib->brtab, &rlocator,
+ forknum, blocks[i]);
+ }
+ }
+ DestroyBlockRefTableReader(reader);
+ FileClose(wsio.file);
+ }
+
+ /* Switch back to previous memory context. */
+ MemoryContextSwitchTo(oldcontext);
+}
+
+/*
+ * Get the pathname that should be used when a file is sent incrementally.
+ *
+ * The result is a palloc'd string.
+ */
+char *
+GetIncrementalFilePath(Oid dboid, Oid spcoid, RelFileNumber relfilenumber,
+ ForkNumber forknum, unsigned segno)
+{
+ char *path;
+ char *lastslash;
+ char *ipath;
+
+ path = GetRelationPath(dboid, spcoid, relfilenumber, InvalidBackendId,
+ forknum);
+
+ lastslash = strrchr(path, '/');
+ Assert(lastslash != NULL);
+ *lastslash = '\0';
+
+ if (segno > 0)
+ ipath = psprintf("%s/INCREMENTAL.%s.%u", path, lastslash + 1, segno);
+ else
+ ipath = psprintf("%s/INCREMENTAL.%s", path, lastslash + 1);
+
+ pfree(path);
+
+ return ipath;
+}
+
+/*
+ * How should we back up a particular file as part of an incremental backup?
+ *
+ * If the return value is BACK_UP_FILE_FULLY, caller should back up the whole
+ * file just as if this were not an incremental backup.
+ *
+ * If the return value is BACK_UP_FILE_INCREMENTALLY, caller should include
+ * an incremental file in the backup instead of the entire file. On return,
+ * *num_blocks_required will be set to the number of blocks that need to be
+ * sent, and the actual block numbers will have been stored in
+ * relative_block_numbers, which should be an array of at least RELSEG_SIZE.
+ * In addition, *truncation_block_length will be set to the value that should
+ * be included in the incremental file.
+ *
+ * If the return value is DO_NOT_BACK_UP_FILE, the caller should not include
+ * the file in the backup at all.
+ */
+FileBackupMethod
+GetFileBackupMethod(IncrementalBackupInfo *ib, char *path,
+ Oid dboid, Oid spcoid,
+ RelFileNumber relfilenumber, ForkNumber forknum,
+ unsigned segno, size_t size,
+ unsigned *num_blocks_required,
+ BlockNumber *relative_block_numbers,
+ unsigned *truncation_block_length)
+{
+ BlockNumber absolute_block_numbers[RELSEG_SIZE];
+ BlockNumber limit_block;
+ BlockNumber start_blkno;
+ BlockNumber stop_blkno;
+ RelFileLocator rlocator;
+ BlockRefTableEntry *brtentry;
+ unsigned i;
+ unsigned nblocks;
+
+ /* Should only be called after PrepareForIncrementalBackup. */
+ Assert(ib->buf.data == NULL);
+
+ /*
+ * dboid could be InvalidOid if shared rel, but spcoid and relfilenumber
+ * should have legal values.
+ */
+ Assert(OidIsValid(spcoid));
+ Assert(RelFileNumberIsValid(relfilenumber));
+
+ /*
+ * If the file size is too large or not a multiple of BLCKSZ, then
+ * something weird is happening, so give up and send the whole file.
+ */
+ if ((size % BLCKSZ) != 0 || size / BLCKSZ > RELSEG_SIZE)
+ return BACK_UP_FILE_FULLY;
+
+ /*
+ * The free-space map fork is not properly WAL-logged, so we need to
+ * backup the entire file every time.
+ */
+ if (forknum == FSM_FORKNUM)
+ return BACK_UP_FILE_FULLY;
+
+ /*
+ * Check whether this file is part of the prior backup. If it isn't, back
+ * up the whole file.
+ */
+ if (backup_file_lookup(ib->manifest_files, path) == NULL)
+ {
+ char *ipath;
+
+ ipath = GetIncrementalFilePath(dboid, spcoid, relfilenumber,
+ forknum, segno);
+ if (backup_file_lookup(ib->manifest_files, ipath) == NULL)
+ return BACK_UP_FILE_FULLY;
+ }
+
+ /* Look up the block reference table entry. */
+ rlocator.spcOid = spcoid;
+ rlocator.dbOid = dboid;
+ rlocator.relNumber = relfilenumber;
+ brtentry = BlockRefTableGetEntry(ib->brtab, &rlocator, forknum,
+ &limit_block);
+
+ /*
+ * If there is no entry, then there have been no WAL-logged changes to the
+ * relation since the predecessor backup was taken, so we can back it up
+ * incrementally and need not include any modified blocks.
+ *
+ * However, if the file is zero-length, we should do a full backup,
+ * because an incremental file is always more than zero length, and it's
+ * silly to take an incremental backup when a full backup would be
+ * smaller.
+ */
+ if (brtentry == NULL)
+ {
+ *num_blocks_required = 0;
+ *truncation_block_length = size / BLCKSZ;
+ if (size == 0)
+ return BACK_UP_FILE_FULLY;
+ return BACK_UP_FILE_INCREMENTALLY;
+ }
+
+ /*
+ * If the limit_block is less than or equal to the point where this
+ * segment starts, send the whole file.
+ */
+ if (limit_block <= segno * RELSEG_SIZE)
+ return BACK_UP_FILE_FULLY;
+
+ /*
+ * Get relevant entries from the block reference table entry.
+ *
+ * We shouldn't overflow computing the start or stop block numbers, but if
+ * it manages to happen somehow, detect it and throw an error.
+ */
+ start_blkno = segno * RELSEG_SIZE;
+ stop_blkno = start_blkno + (size / BLCKSZ);
+ if (start_blkno / RELSEG_SIZE != segno || stop_blkno < start_blkno)
+ ereport(ERROR,
+ errcode(ERRCODE_INTERNAL_ERROR),
+ errmsg_internal("overflow computing block number bounds for segment %u with size %lu",
+ segno, size));
+ nblocks = BlockRefTableEntryGetBlocks(brtentry, start_blkno, stop_blkno,
+ absolute_block_numbers, RELSEG_SIZE);
+ Assert(nblocks <= RELSEG_SIZE);
+
+ /*
+ * If we're going to have to send nearly all of the blocks, then just send
+ * the whole file, because that won't require much extra storage or
+ * transfer and will speed up and simplify backup restoration. It's not
+ * clear what threshold is most appropriate here and perhaps it ought to
+ * be configurable, but for now we're just going to say that if we'd need
+ * to send 90% of the blocks anyway, give up and send the whole file.
+ *
+ * NB: If you change the threshold here, at least make sure to back up the
+ * file fully when every single block must be sent, because there's
+ * nothing good about sending an incremental file in that case.
+ */
+ if (nblocks * BLCKSZ > size * 0.9)
+ return BACK_UP_FILE_FULLY;
+
+ /*
+ * Looks like we can send an incremental file.
+ *
+ * Return the relevant details to the caller, transposing absolute block
+ * numbers to relative block numbers.
+ *
+ * The truncation block length is the minimum length of the reconstructed
+ * file. Any block numbers below this threshold that are not present in
+ * the backup need to be fetched from the prior backup. At or above this
+ * threshold, blocks should only be included in the result if they are
+ * present in the backup. (This may require inserting zero blocks if the
+ * blocks included in the backup are non-consecutive.)
+ */
+ for (i = 0; i < nblocks; ++i)
+ relative_block_numbers[i] = absolute_block_numbers[i] - start_blkno;
+ *num_blocks_required = nblocks;
+ *truncation_block_length =
+ Min(size / BLCKSZ, limit_block - segno * RELSEG_SIZE);
+ return BACK_UP_FILE_INCREMENTALLY;
+}
+
+/*
+ * Compute the size for an incremental file containing a given number of blocks.
+ */
+extern size_t
+GetIncrementalFileSize(unsigned num_blocks_required)
+{
+ size_t result;
+
+ /* Make sure we're not going to overflow. */
+ Assert(num_blocks_required <= RELSEG_SIZE);
+
+ /*
+ * Three four byte quantities (magic number, truncation block length,
+ * block count) followed by block numbers followed by block contents.
+ */
+ result = 3 * sizeof(uint32);
+ result += (BLCKSZ + sizeof(BlockNumber)) * num_blocks_required;
+
+ return result;
+}
+
+/*
+ * Helper function for filemap hash table.
+ */
+static uint32
+hash_string_pointer(const char *s)
+{
+ unsigned char *ss = (unsigned char *) s;
+
+ return hash_bytes(ss, strlen(s));
+}
+
+/*
+ * This callback is invoked for each file mentioned in the backup manifest.
+ *
+ * We store the path to each file and the size of each file for sanity-checking
+ * purposes. For further details, see comments for IncrementalBackupInfo.
+ */
+static void
+manifest_process_file(JsonManifestParseContext *context,
+ char *pathname, size_t size,
+ pg_checksum_type checksum_type,
+ int checksum_length,
+ uint8 *checksum_payload)
+{
+ IncrementalBackupInfo *ib = context->private_data;
+ backup_file_entry *entry;
+ bool found;
+
+ entry = backup_file_insert(ib->manifest_files, pathname, &found);
+ if (!found)
+ {
+ entry->path = MemoryContextStrdup(ib->manifest_files->ctx,
+ pathname);
+ entry->size = size;
+ }
+}
+
+/*
+ * This callback is invoked for each WAL range mentioned in the backup
+ * manifest.
+ *
+ * We're just interested in learning the oldest LSN and the corresponding TLI
+ * that appear in any WAL range.
+ */
+static void
+manifest_process_wal_range(JsonManifestParseContext *context,
+ TimeLineID tli, XLogRecPtr start_lsn,
+ XLogRecPtr end_lsn)
+{
+ IncrementalBackupInfo *ib = context->private_data;
+ backup_wal_range *range = palloc(sizeof(backup_wal_range));
+
+ range->tli = tli;
+ range->start_lsn = start_lsn;
+ range->end_lsn = end_lsn;
+ ib->manifest_wal_ranges = lappend(ib->manifest_wal_ranges, range);
+}
+
+/*
+ * This callback is invoked if an error occurs while parsing the backup
+ * manifest.
+ */
+static void
+manifest_report_error(JsonManifestParseContext *context, const char *fmt,...)
+{
+ StringInfoData errbuf;
+
+ initStringInfo(&errbuf);
+
+ for (;;)
+ {
+ va_list ap;
+ int needed;
+
+ va_start(ap, fmt);
+ needed = appendStringInfoVA(&errbuf, fmt, ap);
+ va_end(ap);
+ if (needed == 0)
+ break;
+ enlargeStringInfo(&errbuf, needed);
+ }
+
+ ereport(ERROR,
+ errmsg_internal("%s", errbuf.data));
+}
diff --git a/src/backend/backup/meson.build b/src/backend/backup/meson.build
index 11a79bbf80..19c355ceca 100644
--- a/src/backend/backup/meson.build
+++ b/src/backend/backup/meson.build
@@ -5,6 +5,7 @@ backend_sources += files(
'basebackup.c',
'basebackup_copy.c',
'basebackup_gzip.c',
+ 'basebackup_incremental.c',
'basebackup_lz4.c',
'basebackup_progress.c',
'basebackup_server.c',
@@ -12,4 +13,6 @@ backend_sources += files(
'basebackup_target.c',
'basebackup_throttle.c',
'basebackup_zstd.c',
+ 'walsummary.c',
+ 'walsummaryfuncs.c'
)
diff --git a/src/backend/backup/walsummary.c b/src/backend/backup/walsummary.c
new file mode 100644
index 0000000000..ebf4ea038d
--- /dev/null
+++ b/src/backend/backup/walsummary.c
@@ -0,0 +1,356 @@
+/*-------------------------------------------------------------------------
+ *
+ * walsummary.c
+ * Functions for accessing and managing WAL summary data.
+ *
+ * Portions Copyright (c) 2010-2022, PostgreSQL Global Development Group
+ *
+ * src/backend/backup/walsummary.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include <sys/stat.h>
+#include <unistd.h>
+
+#include "access/xlog_internal.h"
+#include "backup/walsummary.h"
+#include "utils/wait_event.h"
+
+static bool IsWalSummaryFilename(char *filename);
+static int ListComparatorForWalSummaryFiles(const ListCell *a,
+ const ListCell *b);
+
+/*
+ * Get a list of WAL summaries.
+ *
+ * If tli != 0, only WAL summaries with the indicated TLI will be included.
+ *
+ * If start_lsn != InvalidXLogRecPtr, only summaries that end before the
+ * indicated LSN will be included.
+ *
+ * If end_lsn != InvalidXLogRecPtr, only summaries that start before the
+ * indicated LSN will be included.
+ *
+ * The intent is that you can call GetWalSummaries(tli, start_lsn, end_lsn)
+ * to get all WAL summaries on the indicated timeline that overlap the
+ * specified LSN range.
+ */
+List *
+GetWalSummaries(TimeLineID tli, XLogRecPtr start_lsn, XLogRecPtr end_lsn)
+{
+ DIR *sdir;
+ struct dirent *dent;
+ List *result = NIL;
+
+ sdir = AllocateDir(XLOGDIR "/summaries");
+ while ((dent = ReadDir(sdir, XLOGDIR "/summaries")) != NULL)
+ {
+ WalSummaryFile *ws;
+ uint32 tmp[5];
+ TimeLineID file_tli;
+ XLogRecPtr file_start_lsn;
+ XLogRecPtr file_end_lsn;
+
+ /* Decode filename, or skip if it's not in the expected format. */
+ if (!IsWalSummaryFilename(dent->d_name))
+ continue;
+ sscanf(dent->d_name, "%08X%08X%08X%08X%08X",
+ &tmp[0], &tmp[1], &tmp[2], &tmp[3], &tmp[4]);
+ file_tli = tmp[0];
+ file_start_lsn = ((uint64) tmp[1]) << 32 | tmp[2];
+ file_end_lsn = ((uint64) tmp[3]) << 32 | tmp[4];
+
+ /* Skip if it doesn't match the filter criteria. */
+ if (tli != 0 && tli != file_tli)
+ continue;
+ if (!XLogRecPtrIsInvalid(start_lsn) && start_lsn > file_end_lsn)
+ continue;
+ if (!XLogRecPtrIsInvalid(end_lsn) && end_lsn < file_start_lsn)
+ continue;
+
+ /* Add it to the list. */
+ ws = palloc(sizeof(WalSummaryFile));
+ ws->tli = file_tli;
+ ws->start_lsn = file_start_lsn;
+ ws->end_lsn = file_end_lsn;
+ result = lappend(result, ws);
+ }
+ FreeDir(sdir);
+
+ return result;
+}
+
+/*
+ * Build a new list of WAL summaries based on an existing list, but filtering
+ * out summaries that don't match the search parameters.
+ *
+ * If tli != 0, only WAL summaries with the indicated TLI will be included.
+ *
+ * If start_lsn != InvalidXLogRecPtr, only summaries that end before the
+ * indicated LSN will be included.
+ *
+ * If end_lsn != InvalidXLogRecPtr, only summaries that start before the
+ * indicated LSN will be included.
+ */
+List *
+FilterWalSummaries(List *wslist, TimeLineID tli,
+ XLogRecPtr start_lsn, XLogRecPtr end_lsn)
+{
+ List *result = NIL;
+ ListCell *lc;
+
+ /* Loop over input. */
+ foreach(lc, wslist)
+ {
+ WalSummaryFile *ws = lfirst(lc);
+
+ /* Skip if it doesn't match the filter criteria. */
+ if (tli != 0 && tli != ws->tli)
+ continue;
+ if (!XLogRecPtrIsInvalid(start_lsn) && start_lsn > ws->end_lsn)
+ continue;
+ if (!XLogRecPtrIsInvalid(end_lsn) && end_lsn < ws->start_lsn)
+ continue;
+
+ /* Add it to the result list. */
+ result = lappend(result, ws);
+ }
+
+ return result;
+}
+
+/*
+ * Check whether the supplied list of WalSummaryFile objects covers the
+ * whole range of LSNs from start_lsn to end_lsn. This function ignores
+ * timelines, so the caller should probably filter using the appropriate
+ * timeline before calling this.
+ *
+ * If the whole range of LSNs is covered, returns true, otherwise false.
+ * If false is returned, *missing_lsn is set either to InvalidXLogRecPtr
+ * if there are no WAL summary files in the input list, or to the first LSN
+ * in the range that is not covered by a WAL summary file in the input list.
+ */
+bool
+WalSummariesAreComplete(List *wslist, XLogRecPtr start_lsn,
+ XLogRecPtr end_lsn, XLogRecPtr *missing_lsn)
+{
+ XLogRecPtr current_lsn = start_lsn;
+ ListCell *lc;
+
+ /* Special case for empty list. */
+ if (wslist == NIL)
+ {
+ *missing_lsn = InvalidXLogRecPtr;
+ return false;
+ }
+
+ /* Make a private copy of the list and sort it by start LSN. */
+ wslist = list_copy(wslist);
+ list_sort(wslist, ListComparatorForWalSummaryFiles);
+
+ /*
+ * Consider summary files in order of increasing start_lsn, advancing the
+ * known-summarized range from start_lsn toward end_lsn.
+ *
+ * Normally, the summary files should cover non-overlapping WAL ranges,
+ * but this algorithm is intended to be correct even in case of overlap.
+ */
+ foreach(lc, wslist)
+ {
+ WalSummaryFile *ws = lfirst(lc);
+
+ if (ws->start_lsn > current_lsn)
+ {
+ /* We found a gap. */
+ break;
+ }
+ if (ws->end_lsn > current_lsn)
+ {
+ /*
+ * Next summary extends beyond end of previous summary, so extend
+ * the end of the range known to be summarized.
+ */
+ current_lsn = ws->end_lsn;
+
+ /*
+ * If the range we know to be summarized has reached the required
+ * end LSN, we have proved completeness.
+ */
+ if (current_lsn >= end_lsn)
+ return true;
+ }
+ }
+
+ /*
+ * We either ran out of summary files without reaching the end LSN, or we
+ * hit a gap in the sequence that resulted in us bailing out of the loop
+ * above.
+ */
+ *missing_lsn = current_lsn;
+ return false;
+}
+
+/*
+ * Open a WAL summary file.
+ *
+ * This will throw an error in case of trouble. As an exception, if
+ * missing_ok = true and the trouble is specifically that the file does
+ * not exist, it will not throw an error and will return a value less than 0.
+ */
+File
+OpenWalSummaryFile(WalSummaryFile *ws, bool missing_ok)
+{
+ char path[MAXPGPATH];
+ File file;
+
+ snprintf(path, MAXPGPATH,
+ XLOGDIR "/summaries/%08X%08X%08X%08X%08X.summary",
+ ws->tli,
+ LSN_FORMAT_ARGS(ws->start_lsn),
+ LSN_FORMAT_ARGS(ws->end_lsn));
+
+ file = PathNameOpenFile(path, O_RDONLY);
+ if (file < 0 && (errno != EEXIST || !missing_ok))
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not open file \"%s\": %m", path)));
+
+ return file;
+}
+
+/*
+ * Remove a WAL summary file if the last modification time precedes the
+ * cutoff time.
+ */
+void
+RemoveWalSummaryIfOlderThan(WalSummaryFile *ws, time_t cutoff_time)
+{
+ char path[MAXPGPATH];
+ struct stat statbuf;
+
+ snprintf(path, MAXPGPATH,
+ XLOGDIR "/summaries/%08X%08X%08X%08X%08X.summary",
+ ws->tli,
+ LSN_FORMAT_ARGS(ws->start_lsn),
+ LSN_FORMAT_ARGS(ws->end_lsn));
+
+ if (lstat(path, &statbuf) != 0)
+ {
+ if (errno == ENOENT)
+ return;
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not stat file \"%s\": %m", path)));
+ }
+ if (statbuf.st_mtime >= cutoff_time)
+ return;
+ if (unlink(path) != 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not stat file \"%s\": %m", path)));
+ ereport(DEBUG2,
+ (errmsg_internal("removing file \"%s\"", path)));
+}
+
+/*
+ * Test whether a filename looks like a WAL summary file.
+ */
+static bool
+IsWalSummaryFilename(char *filename)
+{
+ return strspn(filename, "0123456789ABCDEF") == 40 &&
+ strcmp(filename + 40, ".summary") == 0;
+}
+
+/*
+ * Data read callback for use with CreateBlockRefTableReader.
+ */
+int
+ReadWalSummary(void *wal_summary_io, void *data, int length)
+{
+ WalSummaryIO *io = wal_summary_io;
+ int nbytes;
+
+ nbytes = FileRead(io->file, data, length, io->filepos,
+ WAIT_EVENT_WAL_SUMMARY_READ);
+ if (nbytes < 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not write file \"%s\": %m",
+ FilePathName(io->file))));
+
+ io->filepos += nbytes;
+ return nbytes;
+}
+
+/*
+ * Data write callback for use with WriteBlockRefTable.
+ */
+int
+WriteWalSummary(void *wal_summary_io, void *data, int length)
+{
+ WalSummaryIO *io = wal_summary_io;
+ int nbytes;
+
+ nbytes = FileWrite(io->file, data, length, io->filepos,
+ WAIT_EVENT_WAL_SUMMARY_WRITE);
+ if (nbytes < 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not write file \"%s\": %m",
+ FilePathName(io->file))));
+ if (nbytes != length)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not write file \"%s\": wrote only %d of %d bytes at offset %u",
+ FilePathName(io->file), nbytes,
+ length, (unsigned) io->filepos),
+ errhint("Check free disk space.")));
+
+ io->filepos += nbytes;
+ return nbytes;
+}
+
+/*
+ * Error-reporting callback for use with CreateBlockRefTableReader.
+ */
+void
+ReportWalSummaryError(void *callback_arg, char *fmt,...)
+{
+ StringInfoData buf;
+ va_list ap;
+ int needed;
+
+ initStringInfo(&buf);
+ for (;;)
+ {
+ va_start(ap, fmt);
+ needed = appendStringInfoVA(&buf, fmt, ap);
+ va_end(ap);
+ if (needed == 0)
+ break;
+ enlargeStringInfo(&buf, needed);
+ }
+ ereport(ERROR,
+ errcode(ERRCODE_DATA_CORRUPTED),
+ errmsg_internal("%s", buf.data));
+}
+
+/*
+ * Comparator to sort a List of WalSummaryFile objects by start_lsn.
+ */
+static int
+ListComparatorForWalSummaryFiles(const ListCell *a, const ListCell *b)
+{
+ WalSummaryFile *ws1 = lfirst(a);
+ WalSummaryFile *ws2 = lfirst(b);
+
+ if (ws1->start_lsn < ws2->start_lsn)
+ return -1;
+ if (ws1->start_lsn > ws2->start_lsn)
+ return 1;
+ return 0;
+}
diff --git a/src/backend/backup/walsummaryfuncs.c b/src/backend/backup/walsummaryfuncs.c
new file mode 100644
index 0000000000..2e77d38b4a
--- /dev/null
+++ b/src/backend/backup/walsummaryfuncs.c
@@ -0,0 +1,169 @@
+/*-------------------------------------------------------------------------
+ *
+ * walsummaryfuncs.c
+ * SQL-callable functions for accessing WAL summary data.
+ *
+ * Portions Copyright (c) 2010-2022, PostgreSQL Global Development Group
+ *
+ * src/backend/backup/walsummaryfuncs.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "backup/walsummary.h"
+#include "common/blkreftable.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "utils/fmgrprotos.h"
+#include "utils/pg_lsn.h"
+
+#define NUM_WS_ATTS 3
+#define NUM_SUMMARY_ATTS 6
+#define MAX_BLOCKS_PER_CALL 256
+
+/*
+ * List the WAL summary files available in pg_wal/summaries.
+ */
+Datum
+pg_available_wal_summaries(PG_FUNCTION_ARGS)
+{
+ ReturnSetInfo *rsi;
+ List *wslist;
+ ListCell *lc;
+ Datum values[NUM_WS_ATTS];
+ bool nulls[NUM_WS_ATTS];
+
+ InitMaterializedSRF(fcinfo, 0);
+ rsi = (ReturnSetInfo *) fcinfo->resultinfo;
+
+ memset(nulls, 0, sizeof(nulls));
+
+ wslist = GetWalSummaries(0, InvalidXLogRecPtr, InvalidXLogRecPtr);
+ foreach(lc, wslist)
+ {
+ WalSummaryFile *ws = (WalSummaryFile *) lfirst(lc);
+ HeapTuple tuple;
+
+ CHECK_FOR_INTERRUPTS();
+
+ values[0] = Int64GetDatum((int64) ws->tli);
+ values[1] = LSNGetDatum(ws->start_lsn);
+ values[2] = LSNGetDatum(ws->end_lsn);
+
+ tuple = heap_form_tuple(rsi->setDesc, values, nulls);
+ tuplestore_puttuple(rsi->setResult, tuple);
+ }
+
+ return (Datum) 0;
+}
+
+/*
+ * List the contents of a WAL summary file identified by TLI, start LSN,
+ * and end LSN.
+ */
+Datum
+pg_wal_summary_contents(PG_FUNCTION_ARGS)
+{
+ ReturnSetInfo *rsi;
+ Datum values[NUM_SUMMARY_ATTS];
+ bool nulls[NUM_SUMMARY_ATTS];
+ WalSummaryFile ws;
+ WalSummaryIO io;
+ BlockRefTableReader *reader;
+ int64 raw_tli;
+ RelFileLocator rlocator;
+ ForkNumber forknum;
+ BlockNumber limit_block;
+
+ InitMaterializedSRF(fcinfo, 0);
+ rsi = (ReturnSetInfo *) fcinfo->resultinfo;
+ memset(nulls, 0, sizeof(nulls));
+
+ /*
+ * Since the timeline could at least in theory be more than 2^31, and
+ * since we don't have unsigned types at the SQL level, it is passed as a
+ * 64-bit integer. Test whether it's out of range.
+ */
+ raw_tli = PG_GETARG_INT64(0);
+ if (raw_tli < 1 || raw_tli > PG_INT32_MAX)
+ ereport(ERROR,
+ errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("invalid timeline %lld", (long long) raw_tli));
+
+ /* Prepare to read the specified WAL summry file. */
+ ws.tli = (TimeLineID) raw_tli;
+ ws.start_lsn = PG_GETARG_LSN(1);
+ ws.end_lsn = PG_GETARG_LSN(2);
+ io.filepos = 0;
+ io.file = OpenWalSummaryFile(&ws, false);
+ reader = CreateBlockRefTableReader(ReadWalSummary, &io,
+ FilePathName(io.file),
+ ReportWalSummaryError, NULL);
+
+ /* Loop over relation forks. */
+ while (BlockRefTableReaderNextRelation(reader, &rlocator, &forknum,
+ &limit_block))
+ {
+ BlockNumber blocks[MAX_BLOCKS_PER_CALL];
+ HeapTuple tuple;
+
+ CHECK_FOR_INTERRUPTS();
+
+ values[0] = ObjectIdGetDatum(rlocator.relNumber);
+ values[1] = ObjectIdGetDatum(rlocator.spcOid);
+ values[2] = ObjectIdGetDatum(rlocator.dbOid);
+ values[3] = Int16GetDatum((int16) forknum);
+
+ /* Loop over blocks within the current relation fork. */
+ while (true)
+ {
+ unsigned nblocks;
+ unsigned i;
+
+ CHECK_FOR_INTERRUPTS();
+
+ nblocks = BlockRefTableReaderGetBlocks(reader, blocks,
+ MAX_BLOCKS_PER_CALL);
+ if (nblocks == 0)
+ break;
+
+ /*
+ * For each block that we specifically know to have been modified,
+ * emit a row with that block number and limit_block = false.
+ */
+ values[5] = BoolGetDatum(false);
+ for (i = 0; i < nblocks; ++i)
+ {
+ values[4] = Int64GetDatum((int64) blocks[i]);
+
+ tuple = heap_form_tuple(rsi->setDesc, values, nulls);
+ tuplestore_puttuple(rsi->setResult, tuple);
+ }
+
+ /*
+ * If the limit block is not InvalidBlockNumber, emit an exta row
+ * with that block number and limit_block = true.
+ *
+ * There is no point in doing this when the limit_block is
+ * InvalidBlockNumber, because no block with that number or any
+ * higher number can ever exist.
+ */
+ if (BlockNumberIsValid(limit_block))
+ {
+ values[4] = Int64GetDatum((int64) limit_block);
+ values[5] = BoolGetDatum(true);
+
+ tuple = heap_form_tuple(rsi->setDesc, values, nulls);
+ tuplestore_puttuple(rsi->setResult, tuple);
+ }
+ }
+ }
+
+ /* Cleanup */
+ DestroyBlockRefTableReader(reader);
+ FileClose(io.file);
+
+ return (Datum) 0;
+}
diff --git a/src/backend/postmaster/Makefile b/src/backend/postmaster/Makefile
index 047448b34e..367a46c617 100644
--- a/src/backend/postmaster/Makefile
+++ b/src/backend/postmaster/Makefile
@@ -24,6 +24,7 @@ OBJS = \
postmaster.o \
startup.o \
syslogger.o \
+ walsummarizer.o \
walwriter.o
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/postmaster/auxprocess.c b/src/backend/postmaster/auxprocess.c
index cae6feb356..0c15c1777d 100644
--- a/src/backend/postmaster/auxprocess.c
+++ b/src/backend/postmaster/auxprocess.c
@@ -21,6 +21,7 @@
#include "postmaster/auxprocess.h"
#include "postmaster/bgwriter.h"
#include "postmaster/startup.h"
+#include "postmaster/walsummarizer.h"
#include "postmaster/walwriter.h"
#include "replication/walreceiver.h"
#include "storage/bufmgr.h"
@@ -80,6 +81,9 @@ AuxiliaryProcessMain(AuxProcType auxtype)
case WalReceiverProcess:
MyBackendType = B_WAL_RECEIVER;
break;
+ case WalSummarizerProcess:
+ MyBackendType = B_WAL_SUMMARIZER;
+ break;
default:
elog(PANIC, "unrecognized process type: %d", (int) MyAuxProcType);
MyBackendType = B_INVALID;
@@ -161,6 +165,10 @@ AuxiliaryProcessMain(AuxProcType auxtype)
WalReceiverMain();
proc_exit(1);
+ case WalSummarizerProcess:
+ WalSummarizerMain();
+ proc_exit(1);
+
default:
elog(PANIC, "unrecognized process type: %d", (int) MyAuxProcType);
proc_exit(1);
diff --git a/src/backend/postmaster/meson.build b/src/backend/postmaster/meson.build
index cda921fd10..a30eb6692f 100644
--- a/src/backend/postmaster/meson.build
+++ b/src/backend/postmaster/meson.build
@@ -12,5 +12,6 @@ backend_sources += files(
'postmaster.c',
'startup.c',
'syslogger.c',
+ 'walsummarizer.c',
'walwriter.c',
)
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index d7bfb28ff3..8ae8291a3b 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -114,6 +114,7 @@
#include "postmaster/pgarch.h"
#include "postmaster/postmaster.h"
#include "postmaster/syslogger.h"
+#include "postmaster/walsummarizer.h"
#include "replication/logicallauncher.h"
#include "replication/walsender.h"
#include "storage/fd.h"
@@ -250,6 +251,7 @@ static pid_t StartupPID = 0,
CheckpointerPID = 0,
WalWriterPID = 0,
WalReceiverPID = 0,
+ WalSummarizerPID = 0,
AutoVacPID = 0,
PgArchPID = 0,
SysLoggerPID = 0;
@@ -441,6 +443,7 @@ static bool CreateOptsFile(int argc, char *argv[], char *fullprogname);
static pid_t StartChildProcess(AuxProcType type);
static void StartAutovacuumWorker(void);
static void MaybeStartWalReceiver(void);
+static void MaybeStartWalSummarizer(void);
static void InitPostmasterDeathWatchHandle(void);
/*
@@ -560,6 +563,7 @@ static void ShmemBackendArrayRemove(Backend *bn);
#define StartCheckpointer() StartChildProcess(CheckpointerProcess)
#define StartWalWriter() StartChildProcess(WalWriterProcess)
#define StartWalReceiver() StartChildProcess(WalReceiverProcess)
+#define StartWalSummarizer() StartChildProcess(WalSummarizerProcess)
/* Macros to check exit status of a child process */
#define EXIT_STATUS_0(st) ((st) == 0)
@@ -1846,6 +1850,9 @@ ServerLoop(void)
if (WalReceiverRequested)
MaybeStartWalReceiver();
+ /* If we need to start a WAL summarizer, try to do that now */
+ MaybeStartWalSummarizer();
+
/* Get other worker processes running, if needed */
if (StartWorkerNeeded || HaveCrashedWorker)
maybe_start_bgworkers();
@@ -2713,6 +2720,8 @@ process_pm_reload_request(void)
signal_child(WalWriterPID, SIGHUP);
if (WalReceiverPID != 0)
signal_child(WalReceiverPID, SIGHUP);
+ if (WalSummarizerPID != 0)
+ signal_child(WalSummarizerPID, SIGHUP);
if (AutoVacPID != 0)
signal_child(AutoVacPID, SIGHUP);
if (PgArchPID != 0)
@@ -3066,6 +3075,7 @@ process_pm_child_exit(void)
BgWriterPID = StartBackgroundWriter();
if (WalWriterPID == 0)
WalWriterPID = StartWalWriter();
+ MaybeStartWalSummarizer();
/*
* Likewise, start other special children as needed. In a restart
@@ -3184,6 +3194,20 @@ process_pm_child_exit(void)
continue;
}
+ /*
+ * Was it the wal summarizer? Normal exit can be ignored; we'll start
+ * a new one at the next iteration of the postmaster's main loop, if
+ * necessary. Any other exit condition is treated as a crash.
+ */
+ if (pid == WalSummarizerPID)
+ {
+ WalSummarizerPID = 0;
+ if (!EXIT_STATUS_0(exitstatus))
+ HandleChildCrash(pid, exitstatus,
+ _("WAL summarizer process"));
+ continue;
+ }
+
/*
* Was it the autovacuum launcher? Normal exit can be ignored; we'll
* start a new one at the next iteration of the postmaster's main
@@ -3579,6 +3603,12 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
else if (WalReceiverPID != 0 && take_action)
sigquit_child(WalReceiverPID);
+ /* Take care of the walsummarizer too */
+ if (pid == WalSummarizerPID)
+ WalSummarizerPID = 0;
+ else if (WalSummarizerPID != 0 && take_action)
+ sigquit_child(WalSummarizerPID);
+
/* Take care of the autovacuum launcher too */
if (pid == AutoVacPID)
AutoVacPID = 0;
@@ -3729,6 +3759,8 @@ PostmasterStateMachine(void)
signal_child(StartupPID, SIGTERM);
if (WalReceiverPID != 0)
signal_child(WalReceiverPID, SIGTERM);
+ if (WalSummarizerPID != 0)
+ signal_child(WalSummarizerPID, SIGTERM);
/* checkpointer, archiver, stats, and syslogger may continue for now */
/* Now transition to PM_WAIT_BACKENDS state to wait for them to die */
@@ -3755,6 +3787,7 @@ PostmasterStateMachine(void)
if (CountChildren(BACKEND_TYPE_ALL - BACKEND_TYPE_WALSND) == 0 &&
StartupPID == 0 &&
WalReceiverPID == 0 &&
+ WalSummarizerPID == 0 &&
BgWriterPID == 0 &&
(CheckpointerPID == 0 ||
(!FatalError && Shutdown < ImmediateShutdown)) &&
@@ -3852,6 +3885,7 @@ PostmasterStateMachine(void)
/* These other guys should be dead already */
Assert(StartupPID == 0);
Assert(WalReceiverPID == 0);
+ Assert(WalSummarizerPID == 0);
Assert(BgWriterPID == 0);
Assert(CheckpointerPID == 0);
Assert(WalWriterPID == 0);
@@ -4073,6 +4107,8 @@ TerminateChildren(int signal)
signal_child(WalWriterPID, signal);
if (WalReceiverPID != 0)
signal_child(WalReceiverPID, signal);
+ if (WalSummarizerPID != 0)
+ signal_child(WalSummarizerPID, signal);
if (AutoVacPID != 0)
signal_child(AutoVacPID, signal);
if (PgArchPID != 0)
@@ -5379,6 +5415,10 @@ StartChildProcess(AuxProcType type)
ereport(LOG,
(errmsg("could not fork WAL receiver process: %m")));
break;
+ case WalSummarizerProcess:
+ ereport(LOG,
+ (errmsg("could not fork WAL summarizer process: %m")));
+ break;
default:
ereport(LOG,
(errmsg("could not fork process: %m")));
@@ -5515,6 +5555,19 @@ MaybeStartWalReceiver(void)
}
}
+/*
+ * MaybeStartWalSummarizer
+ * Start the WAL summarizer process, if not running and our state allows.
+ */
+static void
+MaybeStartWalSummarizer(void)
+{
+ if (wal_summarize_mb != 0 && WalSummarizerPID == 0 &&
+ (pmState == PM_RUN || pmState == PM_HOT_STANDBY) &&
+ Shutdown <= SmartShutdown)
+ WalSummarizerPID = StartWalSummarizer();
+}
+
/*
* Create the opts file
diff --git a/src/backend/postmaster/walsummarizer.c b/src/backend/postmaster/walsummarizer.c
new file mode 100644
index 0000000000..926b6c6ae4
--- /dev/null
+++ b/src/backend/postmaster/walsummarizer.c
@@ -0,0 +1,1414 @@
+/*-------------------------------------------------------------------------
+ *
+ * walsummarizer.c
+ *
+ * Background process to perform WAL summarization, if it is enabled.
+ * It continuously scans the write-ahead log and periodically emits a
+ * summary file which indicates which blocks in which relation forks
+ * were modified by WAL records in the LSN range covered by the summary
+ * file. See walsummary.c and blkreftable.c for more details on the
+ * naming and contents of WAL summary files.
+ *
+ * If configured to do, this background process will also remove WAL
+ * summary files when the file timestamp is older than a configurable
+ * threshold (but only if the WAL has been removed first).
+ *
+ * Portions Copyright (c) 1996-2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/backend/postmaster/walsummarizer.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/timeline.h"
+#include "access/xlog.h"
+#include "access/xlog_internal.h"
+#include "access/xlogrecovery.h"
+#include "access/xlogutils.h"
+#include "backup/walsummary.h"
+#include "catalog/storage_xlog.h"
+#include "common/blkreftable.h"
+#include "libpq/pqsignal.h"
+#include "miscadmin.h"
+#include "postmaster/bgwriter.h"
+#include "postmaster/interrupt.h"
+#include "postmaster/walsummarizer.h"
+#include "replication/walreceiver.h"
+#include "storage/fd.h"
+#include "storage/ipc.h"
+#include "storage/lwlock.h"
+#include "storage/latch.h"
+#include "storage/proc.h"
+#include "storage/procsignal.h"
+#include "storage/shmem.h"
+#include "storage/spin.h"
+#include "utils/guc.h"
+#include "utils/memutils.h"
+#include "utils/wait_event.h"
+
+/*
+ * Data in shared memory related to WAL summarization.
+ */
+typedef struct
+{
+ /*
+ * These fields are protected by WALSummarizerLock.
+ *
+ * Until we've discovered what summary files already exist on disk and
+ * stored that information in shared memory, initialized is false and the
+ * other fields here contain no meaningful information. After that has
+ * been done, initialized is true.
+ *
+ * summarized_tli and summarized_lsn indicate the last LSN and TLI at
+ * which the next summary file will start. Normally, these are the LSN
+ * and TLI at which the last file ended; in such case, lsn_is_exact is
+ * true. If, however, the LSN is just an approximation, then lsn_is_exact
+ * is false. This can happen if, for example, there are no existing WAL
+ * summary files at startup. In that case, we have to derive the position
+ * at which to start summarizing from the WAL files that exist on disk,
+ * and so the LSN might point to the start of the next file even though
+ * that might happen to be in the middle of a WAL record.
+ *
+ * summarizer_pgprocno is the pgprocno value for the summarizer process,
+ * if one is running, or else INVALID_PGPROCNO.
+ *
+ * pending_lsn is used by the summarizer to advertise the ending LSN of
+ * a record it has recently read. It shouldn't ever be less than
+ * summarized_lsn, but might be greater, because the summarizer buffers
+ * data for a range of LSNs in memory before writing out a new file.
+ *
+ * switch_requested can be set to true to notify the summarizer that a
+ * new WAL summary file should be written as soon as possible, without
+ * trying to read more WAL first.
+ */
+ bool initialized;
+ TimeLineID summarized_tli;
+ XLogRecPtr summarized_lsn;
+ bool lsn_is_exact;
+ int summarizer_pgprocno;
+ XLogRecPtr pending_lsn;
+ bool switch_requested;
+
+ /*
+ * This field handles its own synchronizaton.
+ */
+ ConditionVariable summary_file_cv;
+} WalSummarizerData;
+
+/*
+ * Private data for our xlogreader's page read callback.
+ */
+typedef struct
+{
+ TimeLineID tli;
+ bool historic;
+ XLogRecPtr read_upto;
+ bool end_of_wal;
+ bool waited;
+ XLogRecPtr redo_pointer;
+ bool redo_pointer_reached;
+ XLogRecPtr redo_pointer_refresh_lsn;
+} SummarizerReadLocalXLogPrivate;
+
+/* Pointer to shared memory state. */
+static WalSummarizerData *WalSummarizerCtl;
+
+/*
+ * When we reach end of WAL and need to read more, we sleep for a number of
+ * milliseconds that is a integer multiple of MS_PER_SLEEP_QUANTUM. This is
+ * the multiplier. It should vary between 1 and MAX_SLEEP_QUANTA, depending
+ * on system activity. See summarizer_wait_for_wal() for how we adjust this.
+ */
+static long sleep_quanta = 1;
+
+/*
+ * The sleep time will always be a multiple of 200ms and will not exceed
+ * one minute (300 * 200 = 60 * 1000).
+ */
+#define MAX_SLEEP_QUANTA 300
+#define MS_PER_SLEEP_QUANTUM 200
+
+/*
+ * This is a count of the number of pages of WAL that we've read since the
+ * last time we waited for more WAL to appear.
+ */
+static long pages_read_since_last_sleep = 0;
+
+/*
+ * Most recent RedoRecPtr value observed by MaybeRemoveOldWalSummaries.
+ */
+static XLogRecPtr redo_pointer_at_last_summary_removal = InvalidXLogRecPtr;
+
+/*
+ * GUC parameters
+ */
+int wal_summarize_mb = 256;
+int wal_summarize_keep_time = 7 * 24 * 60;
+
+static XLogRecPtr GetLatestLSN(TimeLineID *tli);
+static void HandleWalSummarizerInterrupts(void);
+static XLogRecPtr SummarizeWAL(TimeLineID tli, bool historic,
+ XLogRecPtr start_lsn, bool exact,
+ XLogRecPtr cutoff_lsn, XLogRecPtr maximum_lsn);
+static void SummarizeSmgrRecord(XLogReaderState *xlogreader,
+ BlockRefTable *brtab);
+static void SummarizeXactRecord(XLogReaderState *xlogreader,
+ BlockRefTable *brtab);
+static int summarizer_read_local_xlog_page(XLogReaderState *state,
+ XLogRecPtr targetPagePtr,
+ int reqLen,
+ XLogRecPtr targetRecPtr,
+ char *cur_page);
+static void summarizer_wait_for_wal(void);
+static void MaybeRemoveOldWalSummaries(void);
+
+/*
+ * Amount of shared memory required for this module.
+ */
+Size
+WalSummarizerShmemSize(void)
+{
+ return sizeof(WalSummarizerData);
+}
+
+/*
+ * Create or attach to shared memory segment for this module.
+ */
+void
+WalSummarizerShmemInit(void)
+{
+ bool found;
+
+ WalSummarizerCtl = (WalSummarizerData *)
+ ShmemInitStruct("Wal Summarizer Ctl", WalSummarizerShmemSize(),
+ &found);
+
+ if (!found)
+ {
+ /*
+ * First time through, so initialize.
+ *
+ * We're just filling in dummy values here -- the real initialization
+ * will happen when GetOldestUnsummarizedLSN() is called for the first
+ * time.
+ */
+ WalSummarizerCtl->initialized = false;
+ WalSummarizerCtl->summarized_tli = 0;
+ WalSummarizerCtl->summarized_lsn = InvalidXLogRecPtr;
+ WalSummarizerCtl->lsn_is_exact = false;
+ WalSummarizerCtl->summarizer_pgprocno = INVALID_PGPROCNO;
+ WalSummarizerCtl->pending_lsn = InvalidXLogRecPtr;
+ WalSummarizerCtl->switch_requested = false;
+ ConditionVariableInit(&WalSummarizerCtl->summary_file_cv);
+ }
+}
+
+/*
+ * Entry point for walsummarizer process.
+ */
+void
+WalSummarizerMain(void)
+{
+ sigjmp_buf local_sigjmp_buf;
+ MemoryContext context;
+
+ /*
+ * Within this function, 'current_lsn' and 'current_tli' refer to the
+ * point from which the next WAL summary file should start. 'exact' is
+ * true if 'current_lsn' is known to be the start of a WAL recod or WAL
+ * segment, and false if it might be in the middle of a record someplace.
+ *
+ * 'switch_lsn' and 'switch_tli', if set, are the LSN at which we need to
+ * switch to a new timeline and the timeline to which we need to switch.
+ * If not set, we either haven't figured out the answers yet or we're
+ * already on the latest timeline.
+ */
+ XLogRecPtr current_lsn;
+ TimeLineID current_tli;
+ bool exact;
+ XLogRecPtr switch_lsn = InvalidXLogRecPtr;
+ TimeLineID switch_tli = 0;
+
+ ereport(DEBUG1,
+ (errmsg_internal("WAL summarizer started")));
+
+ /*
+ * Properly accept or ignore signals the postmaster might send us
+ *
+ * We have no particular use for SIGINT at the moment, but seems
+ * reasonable to treat like SIGTERM.
+ */
+ pqsignal(SIGHUP, SignalHandlerForConfigReload);
+ pqsignal(SIGINT, SignalHandlerForShutdownRequest);
+ pqsignal(SIGTERM, SignalHandlerForShutdownRequest);
+ /* SIGQUIT handler was already set up by InitPostmasterChild */
+ pqsignal(SIGALRM, SIG_IGN);
+ pqsignal(SIGPIPE, SIG_IGN);
+ pqsignal(SIGUSR1, procsignal_sigusr1_handler);
+ pqsignal(SIGUSR2, SIG_IGN); /* not used */
+
+ /* Advertise ourselves. */
+ LWLockAcquire(WALSummarizerLock, LW_EXCLUSIVE);
+ WalSummarizerCtl->summarizer_pgprocno = MyProc->pgprocno;
+ LWLockRelease(WALSummarizerLock);
+
+ /* Create and switch to a memory context that we can reset on error. */
+ context = AllocSetContextCreate(TopMemoryContext,
+ "Wal Summarizer",
+ ALLOCSET_DEFAULT_SIZES);
+ MemoryContextSwitchTo(context);
+
+ /*
+ * Reset some signals that are accepted by postmaster but not here
+ */
+ pqsignal(SIGCHLD, SIG_DFL);
+
+ /*
+ * If an exception is encountered, processing resumes here.
+ */
+ if (sigsetjmp(local_sigjmp_buf, 1) != 0)
+ {
+ /* Since not using PG_TRY, must reset error stack by hand */
+ error_context_stack = NULL;
+
+ /* Prevent interrupts while cleaning up */
+ HOLD_INTERRUPTS();
+
+ /* Report the error to the server log */
+ EmitErrorReport();
+
+ /* Release resources we might have acquired. */
+ LWLockReleaseAll();
+ ConditionVariableCancelSleep();
+ pgstat_report_wait_end();
+ ReleaseAuxProcessResources(false);
+ AtEOXact_Files(false);
+ AtEOXact_HashTables(false);
+
+ /*
+ * Now return to normal top-level context and clear ErrorContext for
+ * next time.
+ */
+ MemoryContextSwitchTo(context);
+ FlushErrorState();
+
+ /* Flush any leaked data in the top-level context */
+ MemoryContextResetAndDeleteChildren(context);
+
+ /* Now we can allow interrupts again */
+ RESUME_INTERRUPTS();
+
+ /*
+ * Sleep for 10 seconds before attempting to resume operations in
+ * order to avoid excessing logging.
+ *
+ * Many of the likely error conditions are things that will repeat
+ * every time. For example, if the WAL can't be read or the summary
+ * can't be written, only administrator action will cure the problem.
+ * So a really fast retry time doesn't seem to be especially
+ * beneficial, and it will clutter the logs.
+ */
+ (void) WaitLatch(MyLatch,
+ WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
+ 10000,
+ WAIT_EVENT_WAL_SUMMARIZER_ERROR);
+ }
+
+ /* We can now handle ereport(ERROR) */
+ PG_exception_stack = &local_sigjmp_buf;
+
+ /*
+ * Unblock signals (they were blocked when the postmaster forked us)
+ */
+ sigprocmask(SIG_SETMASK, &UnBlockSig, NULL);
+
+ /*
+ * Fetch information about previous progress from shared memory.
+ *
+ * If we discover that WAL summarization is not enabled, just exit.
+ */
+ current_lsn = GetOldestUnsummarizedLSN(¤t_tli, &exact);
+ if (XLogRecPtrIsInvalid(current_lsn))
+ proc_exit(0);
+
+ /*
+ * Loop forever
+ */
+ for (;;)
+ {
+ XLogRecPtr latest_lsn;
+ TimeLineID latest_tli;
+ XLogRecPtr cutoff_lsn;
+ XLogRecPtr end_of_summary_lsn;
+
+ /* Flush any leaked data in the top-level context */
+ MemoryContextResetAndDeleteChildren(context);
+
+ /* Process any signals received recently. */
+ HandleWalSummarizerInterrupts();
+
+ /* If it's time to remove any old WAL summaries, do that now. */
+ MaybeRemoveOldWalSummaries();
+
+ /* Find the LSN and TLI up to which we can safely summarize. */
+ latest_lsn = GetLatestLSN(&latest_tli);
+
+ /*
+ * If we're summarizing a historic timeline and we haven't yet
+ * computed the point at which to switch to the next timeline, do that
+ * now.
+ *
+ * Note that if this is a standby, what was previously the current
+ * timeline could become historic at any time.
+ *
+ * We could try to make this more efficient by caching the results of
+ * readTimeLineHistory when latest_tli has not changed, but since we
+ * only have to do this once per timeline switch, we probably wouldn't
+ * save any significant amount of work in practice.
+ */
+ if (current_tli != latest_tli && XLogRecPtrIsInvalid(switch_lsn))
+ {
+ List *tles = readTimeLineHistory(latest_tli);
+
+ switch_lsn = tliSwitchPoint(current_tli, tles, &switch_tli);
+ elog(DEBUG2,
+ "switch point from TLI %u to TLI %u is at %X/%X",
+ current_tli, switch_tli, LSN_FORMAT_ARGS(switch_lsn));
+ }
+
+ /*
+ * wal_summarize_mb sets a soft limit on the amont of WAL covered
+ * by a single summary file. If we read a WAL record that ends after
+ * the cutoff LSN computed here, we'll stop the summary. In most cases,
+ * it will actually stop earlier than that, but this is here as a
+ * backstop.
+ */
+ cutoff_lsn = current_lsn + wal_summarize_mb * 1024 * 1024;
+ if (!XLogRecPtrIsInvalid(switch_lsn) && cutoff_lsn > switch_lsn)
+ cutoff_lsn = switch_lsn;
+ elog(DEBUG2,
+ "WAL summarization cutoff is TLI %d @ %X/%X, flush position is %X/%X",
+ current_tli, LSN_FORMAT_ARGS(cutoff_lsn), LSN_FORMAT_ARGS(latest_lsn));
+
+ /* Summarize WAL. */
+ end_of_summary_lsn = SummarizeWAL(current_tli,
+ current_tli != latest_tli,
+ current_lsn, exact,
+ cutoff_lsn, latest_lsn);
+ Assert(!XLogRecPtrIsInvalid(end_of_summary_lsn));
+ Assert(end_of_summary_lsn >= current_lsn);
+
+ /*
+ * Update state for next loop iteration.
+ *
+ * Next summary file should start from exactly where this one ended.
+ * Timeline remains unchanged unless a switch LSN was computed and we
+ * have reached it.
+ */
+ current_lsn = end_of_summary_lsn;
+ exact = true;
+ if (!XLogRecPtrIsInvalid(switch_lsn) && cutoff_lsn >= switch_lsn)
+ {
+ current_tli = switch_tli;
+ switch_lsn = InvalidXLogRecPtr;
+ switch_tli = 0;
+ }
+
+ /* Update state in shared memory. */
+ LWLockAcquire(WALSummarizerLock, LW_EXCLUSIVE);
+ Assert(WalSummarizerCtl->pending_lsn <= end_of_summary_lsn);
+ WalSummarizerCtl->summarized_lsn = end_of_summary_lsn;
+ WalSummarizerCtl->summarized_tli = current_tli;
+ WalSummarizerCtl->lsn_is_exact = true;
+ WalSummarizerCtl->pending_lsn = end_of_summary_lsn;
+ WalSummarizerCtl->switch_requested = false;
+ LWLockRelease(WALSummarizerLock);
+
+ /* Wake up anyone waiting for more summary files to be written. */
+ ConditionVariableBroadcast(&WalSummarizerCtl->summary_file_cv);
+ }
+}
+
+/*
+ * Get the oldest LSN in this server's timeline history that has not yet been
+ * summarized.
+ *
+ * If *tli != NULL, it will be set to the TLI for the LSN that is returned.
+ *
+ * If *lsn_is_exact != NULL, it will be set to true if the returned LSN is
+ * necessarily the start of a WAL record and false if it's just the beginning
+ * of a WAL segment.
+ */
+XLogRecPtr
+GetOldestUnsummarizedLSN(TimeLineID *tli, bool *lsn_is_exact)
+{
+ TimeLineID latest_tli;
+ LWLockMode mode = LW_SHARED;
+ int n;
+ List *tles;
+ XLogRecPtr unsummarized_lsn;
+ TimeLineID unsummarized_tli = 0;
+ bool should_make_exact = false;
+ List *existing_summaries;
+ ListCell *lc;
+
+ /* If not summarizing WAL, do nothing. */
+ if (wal_summarize_mb == 0)
+ return InvalidXLogRecPtr;
+
+ /*
+ * Initially, we acquire the lock in shared mode and try to fetch the
+ * required information. If the data structure hasn't been initialized, we
+ * reacquire the lock in shared mode so that we can initialize it.
+ * However, if someone else does that first before we get the lock, then
+ * we can just return the requested information after all.
+ */
+ while (true)
+ {
+ LWLockAcquire(WALSummarizerLock, mode);
+
+ if (WalSummarizerCtl->initialized)
+ {
+ unsummarized_lsn = WalSummarizerCtl->summarized_lsn;
+ if (tli != NULL)
+ *tli = WalSummarizerCtl->summarized_tli;
+ if (lsn_is_exact != NULL)
+ *lsn_is_exact = WalSummarizerCtl->lsn_is_exact;
+ LWLockRelease(WALSummarizerLock);
+ return unsummarized_lsn;
+ }
+
+ if (mode == LW_EXCLUSIVE)
+ break;
+
+ LWLockRelease(WALSummarizerLock);
+ mode = LW_EXCLUSIVE;
+ }
+
+ /*
+ * The data structure needs to be initialized, and we are the first to
+ * obtain the lock in exclusive mode, so it's our job to do that
+ * initialization.
+ *
+ * So, find the oldest timeline on which WAL still exists, and the
+ * earliest segment for which it exists.
+ */
+ (void) GetLatestLSN(&latest_tli);
+ tles = readTimeLineHistory(latest_tli);
+ for (n = list_length(tles) - 1; n >= 0; --n)
+ {
+ TimeLineHistoryEntry *tle = list_nth(tles, n);
+ XLogSegNo oldest_segno;
+
+ oldest_segno = XLogGetOldestSegno(tle->tli);
+ if (oldest_segno != 0)
+ {
+ /* Compute oldest LSN that still exists on disk. */
+ XLogSegNoOffsetToRecPtr(oldest_segno, 0, wal_segment_size,
+ unsummarized_lsn);
+
+ unsummarized_tli = tle->tli;
+ break;
+ }
+ }
+
+ /* It really should not be possible for us to find no WAL. */
+ if (unsummarized_tli == 0)
+ ereport(ERROR,
+ errcode(ERRCODE_INTERNAL_ERROR),
+ errmsg_internal("no WAL found on timeline %d", latest_tli));
+
+ /*
+ * Don't try to summarize anything older than the end LSN of the newest
+ * summary file that exists for this timeline.
+ */
+ existing_summaries =
+ GetWalSummaries(unsummarized_tli,
+ InvalidXLogRecPtr, InvalidXLogRecPtr);
+ foreach(lc, existing_summaries)
+ {
+ WalSummaryFile *ws = lfirst(lc);
+
+ if (ws->end_lsn > unsummarized_lsn)
+ {
+ unsummarized_lsn = ws->end_lsn;
+ should_make_exact = true;
+ }
+ }
+
+ /* Update shared memory with the discovered values. */
+ WalSummarizerCtl->initialized = true;
+ WalSummarizerCtl->summarized_lsn = unsummarized_lsn;
+ WalSummarizerCtl->summarized_tli = unsummarized_tli;
+ WalSummarizerCtl->lsn_is_exact = should_make_exact;
+ WalSummarizerCtl->pending_lsn = unsummarized_lsn;
+
+ /* Also return the to the caller as required. */
+ if (tli != NULL)
+ *tli = WalSummarizerCtl->summarized_tli;
+ if (lsn_is_exact != NULL)
+ *lsn_is_exact = WalSummarizerCtl->lsn_is_exact;
+ LWLockRelease(WALSummarizerLock);
+
+ return unsummarized_lsn;
+}
+
+/*
+ * Attempt to set the WAL summarizer's latch.
+ *
+ * This might not work, because there's no guarantee that the WAL summarizer
+ * process was successfully started, and it also might have started but
+ * subsequently terminated. So, under normal circumstances, this will get the
+ * latch set, but there's no guarantee.
+ */
+void
+SetWalSummarizerLatch(void)
+{
+ int pgprocno;
+
+ LWLockAcquire(WALSummarizerLock, LW_EXCLUSIVE);
+ pgprocno = WalSummarizerCtl->summarizer_pgprocno;
+ LWLockRelease(WALSummarizerLock);
+
+ if (pgprocno != INVALID_PGPROCNO)
+ SetLatch(&ProcGlobal->allProcs[pgprocno].procLatch);
+}
+
+/*
+ * Wait until WAL summarization reaches the given LSN, but not longer than
+ * the given timeout.
+ *
+ * The return value is the first still-unsummarized LSN. If it's greater than
+ * or equal to the passed LSN, then that LSN was reached. If not, we timed out.
+ */
+XLogRecPtr
+WaitForWalSummarization(XLogRecPtr lsn, long timeout)
+{
+ TimestampTz start_time = GetCurrentTimestamp();
+ TimestampTz deadline = TimestampTzPlusMilliseconds(start_time, timeout);
+ XLogRecPtr summarized_lsn;
+
+ Assert(!XLogRecPtrIsInvalid(lsn));
+ Assert(timeout > 0);
+
+ while (1)
+ {
+ TimestampTz now;
+ long remaining_timeout;
+
+ /*
+ * If the LSN summarized on disk has reached the target value, stop.
+ * If it hasn't, but the in-memory value has reached the target value,
+ * request that a file be written as soon as possible.
+ */
+ LWLockAcquire(WALSummarizerLock, LW_EXCLUSIVE);
+ summarized_lsn = WalSummarizerCtl->summarized_lsn;
+ if (summarized_lsn < lsn &&
+ WalSummarizerCtl->pending_lsn >= lsn)
+ WalSummarizerCtl->switch_requested = true;
+ LWLockRelease(WALSummarizerLock);
+ if (summarized_lsn >= lsn)
+ break;
+
+ /* Timeout reached? If yes, stop. */
+ now = GetCurrentTimestamp();
+ remaining_timeout = TimestampDifferenceMilliseconds(now, deadline);
+ if (remaining_timeout <= 0)
+ break;
+
+ /*
+ * Limit the sleep to 1 second, because we may need to request a
+ * switch.
+ */
+ if (remaining_timeout > 1000)
+ remaining_timeout = 1000;
+
+ /* Wait and see. */
+ ConditionVariableTimedSleep(&WalSummarizerCtl->summary_file_cv,
+ remaining_timeout,
+ WAIT_EVENT_WAL_SUMMARY_READY);
+ }
+
+ return summarized_lsn;
+}
+
+/*
+ * Get the latest LSN that is eligible to be summarized, and set *tli to the
+ * corresponding timeline.
+ */
+static XLogRecPtr
+GetLatestLSN(TimeLineID *tli)
+{
+ if (!RecoveryInProgress())
+ {
+ /* Don't summarize WAL before it's flushed. */
+ return GetFlushRecPtr(tli);
+ }
+ else
+ {
+ XLogRecPtr flush_lsn;
+ TimeLineID flush_tli;
+ XLogRecPtr replay_lsn;
+ TimeLineID replay_tli;
+
+ /*
+ * What we really want to know is how much WAL has been flushed to
+ * disk, but the only flush position available is the one provided by
+ * the walreceiver, which may not be running, because this could be
+ * crash recovery or recovery via restore_command. So use either the
+ * WAL receiver's flush position or the replay position, whichever is
+ * further ahead, on the theory that if the WAL has been replayed then
+ * it must also have been flushed to disk.
+ */
+ flush_lsn = GetWalRcvFlushRecPtr(NULL, &flush_tli);
+ replay_lsn = GetXLogReplayRecPtr(&replay_tli);
+ if (flush_lsn > replay_lsn)
+ {
+ *tli = flush_tli;
+ return flush_lsn;
+ }
+ else
+ {
+ *tli = replay_tli;
+ return replay_lsn;
+ }
+ }
+}
+
+/*
+ * Interrupt handler for main loop of WAL summarizer process.
+ */
+static void
+HandleWalSummarizerInterrupts(void)
+{
+ if (ProcSignalBarrierPending)
+ ProcessProcSignalBarrier();
+
+ if (ConfigReloadPending)
+ {
+ ConfigReloadPending = false;
+ ProcessConfigFile(PGC_SIGHUP);
+ }
+
+ if (ShutdownRequestPending || wal_summarize_mb == 0)
+ {
+ ereport(DEBUG1,
+ errmsg_internal("WAL summarizer shutting down"));
+ proc_exit(0);
+ }
+
+ /* Perform logging of memory contexts of this process */
+ if (LogMemoryContextPending)
+ ProcessLogMemoryContextInterrupt();
+}
+
+/*
+ * Summarize a range of WAL records on a single timeline.
+ *
+ * 'tli' is the timeline to be summarized. 'historic' should be false if the
+ * timeline in question is the latest one and true otherwise.
+ *
+ * 'start_lsn' is the point at which we should start summarizing. If this
+ * value comes from the end LSN of the previous record as returned by the
+ * xlograder machinery, 'exact' should be true; otherwise, 'exact' should
+ * be false, and this function will search forward for the start of a valid
+ * WAL record.
+ *
+ * 'cutoff_lsn' is the point at which we should stop summarizing. The first
+ * record that ends at or after cutoff_lsn will be the last one included
+ * in the summary.
+ *
+ * 'maximum_lsn' identifies the point beyond which we can't count on being
+ * able to read any more WAL. It should be the switch point when reading a
+ * historic timeline, or the most-recently-measured end of WAL when reading
+ * the current timeline.
+ *
+ * The return value is the LSN at which the WAL summary actually ends. Most
+ * often, a summary file ends because we notice that a checkpoint has
+ * occurred and reach the redo pointer of that checkpoint, but sometimes
+ * we stop for other reasons, such as a timeline switch, or reading a record
+ * that ends after the cutoff_lsn.
+ */
+static XLogRecPtr
+SummarizeWAL(TimeLineID tli, bool historic,
+ XLogRecPtr start_lsn, bool exact,
+ XLogRecPtr cutoff_lsn, XLogRecPtr maximum_lsn)
+{
+ SummarizerReadLocalXLogPrivate *private_data;
+ XLogReaderState *xlogreader;
+ XLogRecPtr summary_start_lsn;
+ XLogRecPtr summary_end_lsn = cutoff_lsn;
+ char temp_path[MAXPGPATH];
+ char final_path[MAXPGPATH];
+ WalSummaryIO io;
+ BlockRefTable *brtab = CreateEmptyBlockRefTable();
+
+ /* Initialize private data for xlogreader. */
+ private_data = (SummarizerReadLocalXLogPrivate *)
+ palloc0(sizeof(SummarizerReadLocalXLogPrivate));
+ private_data->tli = tli;
+ private_data->historic = historic;
+ private_data->read_upto = maximum_lsn;
+ private_data->redo_pointer = GetRedoRecPtr();
+ private_data->redo_pointer_refresh_lsn = start_lsn;
+ private_data->redo_pointer_reached =
+ (start_lsn >= private_data->redo_pointer);
+
+ /* Create xlogreader. */
+ xlogreader = XLogReaderAllocate(wal_segment_size, NULL,
+ XL_ROUTINE(.page_read = &summarizer_read_local_xlog_page,
+ .segment_open = &wal_segment_open,
+ .segment_close = &wal_segment_close),
+ private_data);
+ if (xlogreader == NULL)
+ ereport(ERROR,
+ (errcode(ERRCODE_OUT_OF_MEMORY),
+ errmsg("out of memory"),
+ errdetail("Failed while allocating a WAL reading processor.")));
+
+ /*
+ * When exact = false, we're starting from an arbitrary point in the WAL
+ * and must search forward for the start of the next record.
+ *
+ * When exact = true, start_lsn should be either the LSN where a record
+ * begins, or the LSN of a page where the page header is immediately
+ * followed by the start of a new record. XLogBeginRead should tolerate
+ * either case.
+ *
+ * We need to allow for both cases because the behavior of xlogreader
+ * varies. When a record spans two or more xlog pages, the ending LSN
+ * reported by xlogreader will be the starting LSN of the following
+ * record, but when an xlog page boundary falls between two records, the
+ * end LSN for the first will be reported as the first byte of the
+ * following page. We can't know until we read that page how large the
+ * header will be, but we'll have to skip over it to find the next record.
+ */
+ if (exact)
+ {
+ /*
+ * Even if start_lsn is the beginning of a page rather than the
+ * beginning of the first record on that page, we should still use it
+ * as the start LSN for the summary file. That's because we detect
+ * missing summary files by looking for cases where the end LSN of one
+ * file is less than the start LSN of the next file. When only a page
+ * header is skipped, nothing has been missed.
+ */
+ XLogBeginRead(xlogreader, start_lsn);
+ summary_start_lsn = start_lsn;
+ }
+ else
+ {
+ summary_start_lsn = XLogFindNextRecord(xlogreader, start_lsn);
+ if (XLogRecPtrIsInvalid(summary_start_lsn))
+ {
+ /*
+ * If we hit end-of-WAL while trying to find the next valid
+ * record, we must be on a historic timeline that has no valid
+ * records that begin after start_lsn and before end of WAL.
+ */
+ if (private_data->end_of_wal)
+ {
+ ereport(DEBUG1,
+ errmsg_internal("could not read WAL from timeline %d at %X/%X: end of WAL at %X/%X",
+ tli,
+ LSN_FORMAT_ARGS(start_lsn),
+ LSN_FORMAT_ARGS(private_data->read_upto)));
+
+ /*
+ * The timeline ends at or after start_lsn, without containing
+ * any records. Thus, we must make sure the main loop does not
+ * iterate. If start_lsn is the end of the timeline, then we
+ * won't actually emit an empty summary file, but otherwise,
+ * we must, to capture the fact that the LSN range in question
+ * contains no interesting WAL records.
+ */
+ summary_start_lsn = start_lsn;
+ summary_end_lsn = private_data->read_upto;
+ cutoff_lsn = xlogreader->EndRecPtr;
+ }
+ else
+ ereport(ERROR,
+ (errmsg("could not find a valid record after %X/%X",
+ LSN_FORMAT_ARGS(start_lsn))));
+ }
+
+ /* We shouldn't go backward. */
+ Assert(summary_start_lsn >= start_lsn);
+ }
+
+ /*
+ * Main loop: read xlog records one by one.
+ */
+ while (xlogreader->EndRecPtr < cutoff_lsn)
+ {
+ int block_id;
+ char *errormsg;
+ XLogRecord *record;
+ bool switch_requested;
+
+ /* We shouldn't go backward. */
+ Assert(summary_start_lsn <= xlogreader->EndRecPtr);
+
+ /*
+ * This flag tracks whether the read of a particular record had to
+ * wait for more WAL to arrive, so reset it before reading the next
+ * record.
+ */
+ private_data->waited = false;
+
+ /* Now read the next record. */
+ record = XLogReadRecord(xlogreader, &errormsg);
+ if (record == NULL)
+ {
+ SummarizerReadLocalXLogPrivate *private_data;
+
+ private_data = (SummarizerReadLocalXLogPrivate *)
+ xlogreader->private_data;
+ if (private_data->end_of_wal)
+ {
+ /*
+ * This timeline must be historic and must end before we were
+ * able to read a complete record.
+ */
+ ereport(DEBUG1,
+ errmsg_internal("could not read WAL from timeline %d at %X/%X: end of WAL at %X/%X",
+ tli,
+ LSN_FORMAT_ARGS(xlogreader->EndRecPtr),
+ LSN_FORMAT_ARGS(private_data->read_upto)));
+ /* Summary ends at end of WAL. */
+ summary_end_lsn = private_data->read_upto;
+ break;
+ }
+ if (errormsg)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not read WAL at %X/%X: %s",
+ LSN_FORMAT_ARGS(xlogreader->EndRecPtr), errormsg)));
+ else
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not read WAL at %X/%X",
+ LSN_FORMAT_ARGS(xlogreader->EndRecPtr))));
+ }
+
+ /* We shouldn't go backward. */
+ Assert(summary_start_lsn <= xlogreader->EndRecPtr);
+
+ if (xlogreader->ReadRecPtr >= cutoff_lsn)
+ {
+ /*
+ * Woops! We've read a record that *starts* after the cutoff LSN,
+ * contrary to our goal of reading only until we hit the first
+ * record that ends at or after the cutoff LSN. Pretend we didn't
+ * read it after all by bailing out of this loop right here,
+ * before we do anything with this record.
+ *
+ * This can happen because the last record before the cutoff LSN
+ * might be continued across multiple pages, and then we might
+ * come to a page with XLP_FIRST_IS_OVERWRITE_CONTRECORD set. In
+ * that case, the record that was continued across multiple pages
+ * is incomplete and will be disregarded, and the read will
+ * restart from the beginning of the page that is flagged
+ * XLP_FIRST_IS_OVERWRITE_CONTRECORD.
+ *
+ * If this case occurs, we can fairly say that the current summary
+ * file ends at the cutoff LSN exactly. The first record on the
+ * page marked XLP_FIRST_IS_OVERWRITE_CONTRECORD will be
+ * discovered when generating the next summary file.
+ */
+ summary_end_lsn = cutoff_lsn;
+ break;
+ }
+
+ /*
+ * We attempt, on a best effort basis only, to make WAL summary file
+ * boundaries line up with checkpoint cycles. So, if the last redo
+ * pointer we've seen was in the future, and this record starts at
+ * that redo pointer, stop before processing and let it be included in
+ * the next summary file.
+ *
+ * Note that in the case of a checkpoint triggered by a backup, the
+ * redo pointer is likely to be pointing to the first record on a
+ * page. Before reading the record, xlogreader->EndRecPtr will have
+ * pointed to the start of the page, which precedes the redo LSN. But
+ * after reading the next record, we'll advance over the page header
+ * and realize that the next record starts at the redo LSN exactly,
+ * making this the first point at which we can realize that it's time
+ * to stop.
+ */
+ if (!private_data->redo_pointer_reached &&
+ xlogreader->ReadRecPtr >= private_data->redo_pointer)
+ {
+ summary_end_lsn = xlogreader->ReadRecPtr;
+ break;
+ }
+
+ /* Special handling for particular types of WAL records. */
+ switch (XLogRecGetRmid(xlogreader))
+ {
+ case RM_SMGR_ID:
+ SummarizeSmgrRecord(xlogreader, brtab);
+ break;
+ case RM_XACT_ID:
+ SummarizeXactRecord(xlogreader, brtab);
+ break;
+ default:
+ break;
+ }
+
+ /* Feed block references from xlog record to block reference table. */
+ for (block_id = 0; block_id <= XLogRecMaxBlockId(xlogreader);
+ block_id++)
+ {
+ RelFileLocator rlocator;
+ ForkNumber forknum;
+ BlockNumber blocknum;
+
+ if (!XLogRecGetBlockTagExtended(xlogreader, block_id, &rlocator,
+ &forknum, &blocknum, NULL))
+ continue;
+
+ BlockRefTableMarkBlockModified(brtab, &rlocator, forknum,
+ blocknum);
+ }
+
+ /* Update our notion of where this summary file ends. */
+ summary_end_lsn = xlogreader->EndRecPtr;
+
+ /*
+ * Also update shared memory, and handle any request for a
+ * WAL summary file switch.
+ */
+ LWLockAcquire(WALSummarizerLock, LW_EXCLUSIVE);
+ Assert(summary_end_lsn >= WalSummarizerCtl->pending_lsn);
+ Assert(summary_end_lsn >= WalSummarizerCtl->summarized_lsn);
+ WalSummarizerCtl->pending_lsn = summary_end_lsn;
+ switch_requested = WalSummarizerCtl->switch_requested;
+ LWLockRelease(WALSummarizerLock);
+ if (switch_requested)
+ break;
+
+ /*
+ * Periodically update our notion of the redo pointer, because it
+ * might be changing concurrently. There's no interlocking here: we
+ * might race past the new redo pointer before we learn about it.
+ * That's OK; we only use the redo pointer as a heuristic for where to
+ * stop summarizing.
+ *
+ * It would be nice if we could just fetch the updated redo pointer on
+ * every pass through this loop, but that seems a bit too expensive:
+ * GetRedoRecPtr acquires a heavily-contended spinlock. So, instead,
+ * just fetch the updated value if we've just had to sleep, or if
+ * we've read more than a segment's worth of WAL without sleeping.
+ */
+ if (private_data->waited || xlogreader->EndRecPtr >
+ private_data->redo_pointer_refresh_lsn + wal_segment_size)
+ {
+ private_data->redo_pointer = GetRedoRecPtr();
+ private_data->redo_pointer_refresh_lsn = xlogreader->EndRecPtr;
+ private_data->redo_pointer_reached =
+ (xlogreader->EndRecPtr >= private_data->redo_pointer);
+ }
+
+ /*
+ * Recheck whether we've just caught up with the redo pointer, and
+ * if so, stop. This has the same purpose as the earlier check for
+ * the same condition above, but there we've just read a record and
+ * might decide against including it in the current summary file,
+ * whereas here we've already included it and might decide against
+ * reading the next one. Note that we may have just refreshed our
+ * notion of the redo pointer, so it's smart to check here before we
+ * do any more work.
+ */
+ if (!private_data->redo_pointer_reached &&
+ xlogreader->EndRecPtr >= private_data->redo_pointer)
+ break;
+ }
+
+ /* Destroy xlogreader. */
+ pfree(xlogreader->private_data);
+ XLogReaderFree(xlogreader);
+
+ /*
+ * If a timeline switch occurs, we may fail to make any progress at all
+ * before exiting the loop above. If that happens, we don't write a WAL
+ * summary file at all.
+ */
+ if (summary_end_lsn > summary_start_lsn)
+ {
+ /* Generate temporary and final path name. */
+ snprintf(temp_path, MAXPGPATH,
+ XLOGDIR "/summaries/temp.summary");
+ snprintf(final_path, MAXPGPATH,
+ XLOGDIR "/summaries/%08X%08X%08X%08X%08X.summary",
+ tli,
+ LSN_FORMAT_ARGS(summary_start_lsn),
+ LSN_FORMAT_ARGS(summary_end_lsn));
+
+ /* Open the temporary file for writing. */
+ io.filepos = 0;
+ io.file = PathNameOpenFile(temp_path, O_WRONLY | O_CREAT | O_TRUNC);
+ if (io.file < 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not create file \"%s\": %m", temp_path)));
+
+ /* Write the data. */
+ WriteBlockRefTable(brtab, WriteWalSummary, &io);
+
+ /* Close temporary file and shut down xlogreader. */
+ FileClose(io.file);
+
+ /* Tell the user what we did. */
+ ereport(LOG,
+ errmsg("summarized WAL on TLI %d from %X/%X to %X/%X",
+ tli,
+ LSN_FORMAT_ARGS(summary_start_lsn),
+ LSN_FORMAT_ARGS(summary_end_lsn)));
+
+ /* Durably rename the new summary into place. */
+ durable_rename(temp_path, final_path, ERROR);
+ }
+
+ return summary_end_lsn;
+}
+
+/*
+ * Special handling for WAL records with RM_SMGR_ID.
+ */
+static void
+SummarizeSmgrRecord(XLogReaderState *xlogreader, BlockRefTable *brtab)
+{
+ uint8 info = XLogRecGetInfo(xlogreader) & ~XLR_INFO_MASK;
+
+ if (info == XLOG_SMGR_CREATE)
+ {
+ xl_smgr_create *xlrec;
+
+ /*
+ * If a new relation fork is created on disk, there is no point
+ * tracking anything about which blocks have been modified, because
+ * the whole thing will be new. Hence, set the limit block for this
+ * fork to 0.
+ *
+ * Ignore the FSM fork, which is not fully WAL-logged.
+ */
+ xlrec = (xl_smgr_create *) XLogRecGetData(xlogreader);
+
+ if (xlrec->forkNum != FSM_FORKNUM)
+ BlockRefTableSetLimitBlock(brtab, &xlrec->rlocator,
+ xlrec->forkNum, 0);
+ }
+ else if (info == XLOG_SMGR_TRUNCATE)
+ {
+ xl_smgr_truncate *xlrec;
+
+ xlrec = (xl_smgr_truncate *) XLogRecGetData(xlogreader);
+
+ /*
+ * If a relation fork is truncated on disk, there is in point in
+ * tracking anything about block modifications beyond the truncation
+ * point.
+ *
+ * We ignore SMGR_TRUNCATE_FSM here because the FSM isn't fully
+ * WAL-logged and thus we can't track modified blocks for it anyway.
+ */
+ if ((xlrec->flags & SMGR_TRUNCATE_HEAP) != 0)
+ BlockRefTableSetLimitBlock(brtab, &xlrec->rlocator,
+ MAIN_FORKNUM, xlrec->blkno);
+ if ((xlrec->flags & SMGR_TRUNCATE_VM) != 0)
+ BlockRefTableSetLimitBlock(brtab, &xlrec->rlocator,
+ VISIBILITYMAP_FORKNUM, xlrec->blkno);
+ }
+}
+
+/*
+ * Special handling for WAL recods with RM_XACT_ID.
+ */
+static void
+SummarizeXactRecord(XLogReaderState *xlogreader, BlockRefTable *brtab)
+{
+ uint8 info = XLogRecGetInfo(xlogreader) & ~XLR_INFO_MASK;
+ uint8 xact_info = info & XLOG_XACT_OPMASK;
+
+ if (xact_info == XLOG_XACT_COMMIT ||
+ xact_info == XLOG_XACT_COMMIT_PREPARED)
+ {
+ xl_xact_commit *xlrec = (xl_xact_commit *) XLogRecGetData(xlogreader);
+ xl_xact_parsed_commit parsed;
+ int i;
+
+ ParseCommitRecord(XLogRecGetInfo(xlogreader), xlrec, &parsed);
+ for (i = 0; i < parsed.nrels; ++i)
+ {
+ ForkNumber forknum;
+
+ for (forknum = 0; forknum <= MAX_FORKNUM; ++forknum)
+ if (forknum != FSM_FORKNUM)
+ BlockRefTableSetLimitBlock(brtab, &parsed.xlocators[i],
+ forknum, 0);
+ }
+ }
+ else if (xact_info == XLOG_XACT_ABORT ||
+ xact_info == XLOG_XACT_ABORT_PREPARED)
+ {
+ xl_xact_abort *xlrec = (xl_xact_abort *) XLogRecGetData(xlogreader);
+ xl_xact_parsed_abort parsed;
+ int i;
+
+ ParseAbortRecord(XLogRecGetInfo(xlogreader), xlrec, &parsed);
+ for (i = 0; i < parsed.nrels; ++i)
+ {
+ ForkNumber forknum;
+
+ for (forknum = 0; forknum <= MAX_FORKNUM; ++forknum)
+ if (forknum != FSM_FORKNUM)
+ BlockRefTableSetLimitBlock(brtab, &parsed.xlocators[i],
+ forknum, 0);
+ }
+ }
+}
+
+/*
+ * Similar to read_local_xlog_page, but limited to read from one particular
+ * timeline. If the end of WAL is reached, it will wait for more if reading
+ * from the current timeline, or give up if reading from a historic timeline.
+ * In the latter case, it will also set private_data->end_of_wal = true.
+ *
+ * Caller must set private_data->tli to the TLI of interest,
+ * private_data->read_upto to the lowest LSN that is not known to be safe
+ * to read on that timeline, and private_data->historic to true if and only
+ * if the timeline is not the current timeline. This function will update
+ * private_data->read_upto and private_data->historic if more WAL appears
+ * on the current timeline or if the current timeline becomes historic.
+ */
+static int
+summarizer_read_local_xlog_page(XLogReaderState *state,
+ XLogRecPtr targetPagePtr, int reqLen,
+ XLogRecPtr targetRecPtr, char *cur_page)
+{
+ int count;
+ WALReadError errinfo;
+ SummarizerReadLocalXLogPrivate *private_data;
+
+ private_data = (SummarizerReadLocalXLogPrivate *)
+ state->private_data;
+
+ while (true)
+ {
+ if (targetPagePtr + XLOG_BLCKSZ <= private_data->read_upto)
+ {
+ /*
+ * more than one block available; read only that block, have
+ * caller come back if they need more.
+ */
+ count = XLOG_BLCKSZ;
+ break;
+ }
+ else if (targetPagePtr + reqLen > private_data->read_upto)
+ {
+ /* We don't seem to have enough data. */
+ if (private_data->historic)
+ {
+ /*
+ * This is a historic timeline, so there will never be any
+ * more data than we have currently.
+ */
+ private_data->end_of_wal = true;
+ return -1;
+ }
+ else
+ {
+ XLogRecPtr latest_lsn;
+ TimeLineID latest_tli;
+
+ /*
+ * This is - or at least was up until very recently - the
+ * current timeline, so more data might show up. Delay here
+ * so we don't tight-loop.
+ */
+ HandleWalSummarizerInterrupts();
+ summarizer_wait_for_wal();
+ private_data->waited = true;
+
+ /* Recheck end-of-WAL. */
+ latest_lsn = GetLatestLSN(&latest_tli);
+ if (private_data->tli == latest_tli)
+ {
+ /* Still the current timeline, update max LSN. */
+ Assert(latest_lsn >= private_data->read_upto);
+ private_data->read_upto = latest_lsn;
+ }
+ else
+ {
+ List *tles = readTimeLineHistory(latest_tli);
+ XLogRecPtr switchpoint;
+
+ /*
+ * The timeline we're scanning is no longer the latest
+ * one. Figure out when it ended and allow reads up to
+ * exactly that point.
+ */
+ private_data->historic = true;
+ switchpoint = tliSwitchPoint(private_data->tli, tles,
+ NULL);
+ Assert(switchpoint >= private_data->read_upto);
+ private_data->read_upto = switchpoint;
+ }
+
+ /* Go around and try again. */
+ }
+ }
+ else
+ {
+ /* enough bytes available to satisfy the request */
+ count = private_data->read_upto - targetPagePtr;
+ break;
+ }
+ }
+
+ /*
+ * Even though we just determined how much of the page can be validly read
+ * as 'count', read the whole page anyway. It's guaranteed to be
+ * zero-padded up to the page boundary if it's incomplete.
+ */
+ if (!WALRead(state, cur_page, targetPagePtr, XLOG_BLCKSZ,
+ private_data->tli, &errinfo))
+ WALReadRaiseError(&errinfo);
+
+ /* Track that we read a page, for sleep time calculation. */
+ ++pages_read_since_last_sleep;
+
+ /* number of valid bytes in the buffer */
+ return count;
+}
+
+/*
+ * Sleep for long enough that we believe it's likely that more WAL will
+ * be available afterwards.
+ */
+static void
+summarizer_wait_for_wal(void)
+{
+ if (pages_read_since_last_sleep == 0)
+ {
+ /*
+ * No pages were read since the last sleep, so double the sleep time,
+ * but not beyond the maximum allowable value.
+ */
+ sleep_quanta = Min(sleep_quanta * 2, MAX_SLEEP_QUANTA);
+ }
+ else if (pages_read_since_last_sleep > 1)
+ {
+ /*
+ * Multiple pages were read since the last sleep, so reduce the sleep
+ * time.
+ *
+ * A large burst of activity should be able to quickly reduce the
+ * sleep time to the minimum, but we don't want a handful of extra WAL
+ * records to provoke a strong reaction. We choose to reduce the sleep
+ * time by 1 quantum for each page read beyond the first, which is a
+ * fairly arbitrary way of trying to be reactive without
+ * overrreacting.
+ */
+ if (pages_read_since_last_sleep > sleep_quanta - 1)
+ sleep_quanta = 1;
+ else
+ sleep_quanta -= pages_read_since_last_sleep;
+ }
+
+ /* OK, now sleep. */
+ (void) WaitLatch(MyLatch,
+ WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
+ sleep_quanta * MS_PER_SLEEP_QUANTUM,
+ WAIT_EVENT_WAL_SUMMARIZER_WAL);
+ ResetLatch(MyLatch);
+
+ /* Reset count of pages read. */
+ pages_read_since_last_sleep = 0;
+}
+
+/*
+ * Most recent RedoRecPtr value observed by RemoveOldWalSummaries.
+ */
+static void
+MaybeRemoveOldWalSummaries(void)
+{
+ XLogRecPtr redo_pointer = GetRedoRecPtr();
+ List *wslist;
+ time_t cutoff_time;
+
+ /* If WAL summary removal is disabled, don't do anything. */
+ if (wal_summarize_keep_time == 0)
+ return;
+
+ /*
+ * If the redo pointer has not advanced, don't do anything.
+ *
+ * This has the effect that we only try to remove old WAL summary files
+ * once per checkpoint cycle.
+ */
+ if (redo_pointer == redo_pointer_at_last_summary_removal)
+ return;
+ redo_pointer_at_last_summary_removal = redo_pointer;
+
+ /*
+ * Files should only be removed if the last modification time precedes the
+ * cutoff time we compute here.
+ */
+ cutoff_time = time(NULL) - 60 * wal_summarize_keep_time;
+
+ /* Get all the summaries that currently exist. */
+ wslist = GetWalSummaries(0, InvalidXLogRecPtr, InvalidXLogRecPtr);
+
+ /* Loop until all summaries have been considered for removal. */
+ while (wslist != NIL)
+ {
+ ListCell *lc;
+ XLogSegNo oldest_segno;
+ XLogRecPtr oldest_lsn = InvalidXLogRecPtr;
+ TimeLineID selected_tli;
+
+ CHECK_FOR_INTERRUPTS();
+
+ /*
+ * Pick a timeline for which some summary files still exist on disk,
+ * and find the oldest LSN that still exists on disk for that
+ * timeline.
+ */
+ selected_tli = ((WalSummaryFile *) linitial(wslist))->tli;
+ oldest_segno = XLogGetOldestSegno(selected_tli);
+ if (oldest_segno != 0)
+ XLogSegNoOffsetToRecPtr(oldest_segno, 0, wal_segment_size,
+ oldest_lsn);
+
+
+ /* Consider each WAL file on the selected timeline in turn. */
+ foreach(lc, wslist)
+ {
+ WalSummaryFile *ws = lfirst(lc);
+
+ CHECK_FOR_INTERRUPTS();
+
+ /* If it's not on this timeline, it's not time to consider it. */
+ if (selected_tli != ws->tli)
+ continue;
+
+ /*
+ * If the WAL doesn't exist any more, we can remove it if the file
+ * modification time is old enough.
+ */
+ if (XLogRecPtrIsInvalid(oldest_lsn) || ws->end_lsn <= oldest_lsn)
+ RemoveWalSummaryIfOlderThan(ws, cutoff_time);
+
+ /*
+ * Whether we we removed the file or not, we need not consider it
+ * again.
+ */
+ wslist = foreach_delete_current(wslist, lc);
+ pfree(ws);
+ }
+ }
+}
diff --git a/src/backend/replication/repl_gram.y b/src/backend/replication/repl_gram.y
index 0c874e33cf..a5d118ed68 100644
--- a/src/backend/replication/repl_gram.y
+++ b/src/backend/replication/repl_gram.y
@@ -76,11 +76,12 @@ Node *replication_parse_result;
%token K_EXPORT_SNAPSHOT
%token K_NOEXPORT_SNAPSHOT
%token K_USE_SNAPSHOT
+%token K_UPLOAD_MANIFEST
%type <node> command
%type <node> base_backup start_replication start_logical_replication
create_replication_slot drop_replication_slot identify_system
- read_replication_slot timeline_history show
+ read_replication_slot timeline_history show upload_manifest
%type <list> generic_option_list
%type <defelt> generic_option
%type <uintval> opt_timeline
@@ -114,6 +115,7 @@ command:
| read_replication_slot
| timeline_history
| show
+ | upload_manifest
;
/*
@@ -307,6 +309,15 @@ timeline_history:
}
;
+/* UPLOAD_MANIFEST doesn't currently accept any arguments */
+upload_manifest:
+ K_UPLOAD_MANIFEST
+ {
+ UploadManifestCmd *cmd = makeNode(UploadManifestCmd);
+
+ $$ = (Node *) cmd;
+ }
+
opt_physical:
K_PHYSICAL
| /* EMPTY */
@@ -411,6 +422,7 @@ ident_or_keyword:
| K_EXPORT_SNAPSHOT { $$ = "export_snapshot"; }
| K_NOEXPORT_SNAPSHOT { $$ = "noexport_snapshot"; }
| K_USE_SNAPSHOT { $$ = "use_snapshot"; }
+ | K_UPLOAD_MANIFEST { $$ = "upload_manifest"; }
;
%%
diff --git a/src/backend/replication/repl_scanner.l b/src/backend/replication/repl_scanner.l
index 1cc7fb858c..4805da08ee 100644
--- a/src/backend/replication/repl_scanner.l
+++ b/src/backend/replication/repl_scanner.l
@@ -136,6 +136,7 @@ EXPORT_SNAPSHOT { return K_EXPORT_SNAPSHOT; }
NOEXPORT_SNAPSHOT { return K_NOEXPORT_SNAPSHOT; }
USE_SNAPSHOT { return K_USE_SNAPSHOT; }
WAIT { return K_WAIT; }
+UPLOAD_MANIFEST { return K_UPLOAD_MANIFEST; }
{space}+ { /* do nothing */ }
@@ -303,6 +304,7 @@ replication_scanner_is_replication_command(void)
case K_DROP_REPLICATION_SLOT:
case K_READ_REPLICATION_SLOT:
case K_TIMELINE_HISTORY:
+ case K_UPLOAD_MANIFEST:
case K_SHOW:
/* Yes; push back the first token so we can parse later. */
repl_pushed_back_token = first_token;
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 80374c55be..a73d15fdd5 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -58,6 +58,7 @@
#include "access/xlogrecovery.h"
#include "access/xlogutils.h"
#include "backup/basebackup.h"
+#include "backup/basebackup_incremental.h"
#include "catalog/pg_authid.h"
#include "catalog/pg_type.h"
#include "commands/dbcommands.h"
@@ -137,6 +138,17 @@ bool wake_wal_senders = false;
*/
static XLogReaderState *xlogreader = NULL;
+/*
+ * If the UPLOAD_MANIFEST command is used to provide a backup manifest in
+ * preparation for an incremental backup, uploaded_manifest will be point
+ * to an object containing information about its contexts, and
+ * uploaded_manifest_mcxt will point to the memory context that contains
+ * that object and all of its subordinate data. Otherwise, both values will
+ * be NULL.
+ */
+static IncrementalBackupInfo *uploaded_manifest = NULL;
+static MemoryContext uploaded_manifest_mcxt = NULL;
+
/*
* These variables keep track of the state of the timeline we're currently
* sending. sendTimeLine identifies the timeline. If sendTimeLineIsHistoric,
@@ -233,6 +245,9 @@ static void XLogSendLogical(void);
static void WalSndDone(WalSndSendDataCallback send_data);
static XLogRecPtr GetStandbyFlushRecPtr(TimeLineID *tli);
static void IdentifySystem(void);
+static void UploadManifest(void);
+static bool HandleUploadManifestPacket(StringInfo buf, off_t *offset,
+ IncrementalBackupInfo *ib);
static void ReadReplicationSlot(ReadReplicationSlotCmd *cmd);
static void CreateReplicationSlot(CreateReplicationSlotCmd *cmd);
static void DropReplicationSlot(DropReplicationSlotCmd *cmd);
@@ -660,6 +675,143 @@ SendTimeLineHistory(TimeLineHistoryCmd *cmd)
pq_endmessage(&buf);
}
+/*
+ * Handle UPLOAD_MANIFEST command.
+ */
+static void
+UploadManifest(void)
+{
+ MemoryContext mcxt;
+ IncrementalBackupInfo *ib;
+ off_t offset = 0;
+ StringInfoData buf;
+
+ /*
+ * parsing the manifest will use the cryptohash stuff, which requires a
+ * resource owner
+ */
+ Assert(CurrentResourceOwner == NULL);
+ CurrentResourceOwner = ResourceOwnerCreate(NULL, "base backup");
+
+ /* Prepare to read manifest data into a temporary context. */
+ mcxt = AllocSetContextCreate(CurrentMemoryContext,
+ "incremental backup information",
+ ALLOCSET_DEFAULT_SIZES);
+ ib = CreateIncrementalBackupInfo(mcxt);
+
+ /* Send a CopyInResponse message */
+ pq_beginmessage(&buf, 'G');
+ pq_sendbyte(&buf, 0);
+ pq_sendint16(&buf, 0);
+ pq_endmessage_reuse(&buf);
+ pq_flush();
+
+ /* Recieve packets from client until done. */
+ while (HandleUploadManifestPacket(&buf, &offset, ib))
+ ;
+
+ /* Finish up manifest processing. */
+ FinalizeIncrementalManifest(ib);
+
+ /*
+ * Discard any old manifest information and arrange to preserve the new
+ * information we just got.
+ *
+ * We assume that MemoryContextDelete and MemoryContextSetParent won't
+ * fail, and thus we shouldn't end up bailing out of here in such a way as
+ * to leave dangling pointrs.
+ */
+ if (uploaded_manifest_mcxt != NULL)
+ MemoryContextDelete(uploaded_manifest_mcxt);
+ MemoryContextSetParent(mcxt, CacheMemoryContext);
+ uploaded_manifest = ib;
+ uploaded_manifest_mcxt = mcxt;
+
+ /* clean up the resource owner we created */
+ WalSndResourceCleanup(true);
+}
+
+/*
+ * Process one packet received during the handling of an UPLOAD_MANIFEST
+ * operation.
+ *
+ * 'buf' is scratch space. This function expects it to be initialized, doesn't
+ * care what the current contents are, and may override them with completely
+ * new contents.
+ *
+ * The return value is true if the caller should continue processing
+ * additional packets and false if the UPLOAD_MANIFEST operation is complete.
+ */
+static bool
+HandleUploadManifestPacket(StringInfo buf, off_t *offset,
+ IncrementalBackupInfo *ib)
+{
+ int mtype;
+ int maxmsglen;
+
+ HOLD_CANCEL_INTERRUPTS();
+
+ pq_startmsgread();
+ mtype = pq_getbyte();
+ if (mtype == EOF)
+ ereport(ERROR,
+ (errcode(ERRCODE_CONNECTION_FAILURE),
+ errmsg("unexpected EOF on client connection with an open transaction")));
+
+ switch (mtype)
+ {
+ case 'd': /* CopyData */
+ maxmsglen = PQ_LARGE_MESSAGE_LIMIT;
+ break;
+ case 'c': /* CopyDone */
+ case 'f': /* CopyFail */
+ case 'H': /* Flush */
+ case 'S': /* Sync */
+ maxmsglen = PQ_SMALL_MESSAGE_LIMIT;
+ break;
+ default:
+ ereport(ERROR,
+ (errcode(ERRCODE_PROTOCOL_VIOLATION),
+ errmsg("unexpected message type 0x%02X during COPY from stdin",
+ mtype)));
+ maxmsglen = 0; /* keep compiler quiet */
+ break;
+ }
+
+ /* Now collect the message body */
+ if (pq_getmessage(buf, maxmsglen))
+ ereport(ERROR,
+ (errcode(ERRCODE_CONNECTION_FAILURE),
+ errmsg("unexpected EOF on client connection with an open transaction")));
+ RESUME_CANCEL_INTERRUPTS();
+
+ /* Process the message */
+ switch (mtype)
+ {
+ case 'd': /* CopyData */
+ AppendIncrementalManifestData(ib, buf->data, buf->len);
+ return true;
+
+ case 'c': /* CopyDone */
+ return false;
+
+ case 'H': /* Sync */
+ case 'S': /* Flush */
+ /* Ignore these while in CopyOut mode as we do elsewhere. */
+ return true;
+
+ case 'f':
+ ereport(ERROR,
+ (errcode(ERRCODE_QUERY_CANCELED),
+ errmsg("COPY from stdin failed: %s",
+ pq_getmsgstring(buf))));
+ }
+
+ /* Not reached. */
+ Assert(false);
+ return false;
+}
+
/*
* Handle START_REPLICATION command.
*
@@ -1801,7 +1953,7 @@ exec_replication_command(const char *cmd_string)
cmdtag = "BASE_BACKUP";
set_ps_display(cmdtag);
PreventInTransactionBlock(true, cmdtag);
- SendBaseBackup((BaseBackupCmd *) cmd_node);
+ SendBaseBackup((BaseBackupCmd *) cmd_node, uploaded_manifest);
EndReplicationCommand(cmdtag);
break;
@@ -1863,6 +2015,14 @@ exec_replication_command(const char *cmd_string)
}
break;
+ case T_UploadManifestCmd:
+ cmdtag = "UPLOAD_MANIFEST";
+ set_ps_display(cmdtag);
+ PreventInTransactionBlock(true, cmdtag);
+ UploadManifest();
+ EndReplicationCommand(cmdtag);
+ break;
+
default:
elog(ERROR, "unrecognized replication command node tag: %u",
cmd_node->type);
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 5551afffc0..ff0660656c 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -31,6 +31,7 @@
#include "postmaster/bgworker_internals.h"
#include "postmaster/bgwriter.h"
#include "postmaster/postmaster.h"
+#include "postmaster/walsummarizer.h"
#include "replication/logicallauncher.h"
#include "replication/origin.h"
#include "replication/slot.h"
@@ -136,6 +137,7 @@ CalculateShmemSize(int *num_semaphores)
size = add_size(size, ReplicationOriginShmemSize());
size = add_size(size, WalSndShmemSize());
size = add_size(size, WalRcvShmemSize());
+ size = add_size(size, WalSummarizerShmemSize());
size = add_size(size, PgArchShmemSize());
size = add_size(size, ApplyLauncherShmemSize());
size = add_size(size, SnapMgrShmemSize());
@@ -292,6 +294,7 @@ CreateSharedMemoryAndSemaphores(void)
ReplicationOriginShmemInit();
WalSndShmemInit();
WalRcvShmemInit();
+ WalSummarizerShmemInit();
PgArchShmemInit();
ApplyLauncherShmemInit();
diff --git a/src/backend/storage/lmgr/lwlocknames.txt b/src/backend/storage/lmgr/lwlocknames.txt
index 811ad94742..4a315bfe93 100644
--- a/src/backend/storage/lmgr/lwlocknames.txt
+++ b/src/backend/storage/lmgr/lwlocknames.txt
@@ -54,3 +54,4 @@ XactTruncationLock 44
WrapLimitsVacuumLock 46
NotifyQueueTailLock 47
WaitEventExtensionLock 48
+WALSummarizerLock 49
diff --git a/src/backend/utils/activity/pgstat_io.c b/src/backend/utils/activity/pgstat_io.c
index eb7d35d422..bd0a921a3e 100644
--- a/src/backend/utils/activity/pgstat_io.c
+++ b/src/backend/utils/activity/pgstat_io.c
@@ -292,7 +292,8 @@ pgstat_io_snapshot_cb(void)
* - Syslogger because it is not connected to shared memory
* - Archiver because most relevant archiving IO is delegated to a
* specialized command or module
-* - WAL Receiver and WAL Writer IO is not tracked in pg_stat_io for now
+* - WAL Receiver, WAL Writer, and WAL Summarizer IO are not tracked in
+* pg_stat_io for now
*
* Function returns true if BackendType participates in the cumulative stats
* subsystem for IO and false if it does not.
@@ -314,6 +315,7 @@ pgstat_tracks_io_bktype(BackendType bktype)
case B_LOGGER:
case B_WAL_RECEIVER:
case B_WAL_WRITER:
+ case B_WAL_SUMMARIZER:
return false;
case B_AUTOVAC_LAUNCHER:
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 13774254d2..2ea39fd824 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -56,6 +56,7 @@ WAIT_EVENT_RECOVERY_WAL_STREAM RecoveryWalStream "Waiting in main loop of startu
WAIT_EVENT_SYSLOGGER_MAIN SysLoggerMain "Waiting in main loop of syslogger process."
WAIT_EVENT_WAL_RECEIVER_MAIN WalReceiverMain "Waiting in main loop of WAL receiver process."
WAIT_EVENT_WAL_SENDER_MAIN WalSenderMain "Waiting in main loop of WAL sender process."
+WAIT_EVENT_WAL_SUMMARIZER_WAL WalSummarizerError "Waiting in WAL summarizer for more WAL to be generated."
WAIT_EVENT_WAL_WRITER_MAIN WalWriterMain "Waiting in main loop of WAL writer process."
@@ -140,6 +141,7 @@ WAIT_EVENT_SAFE_SNAPSHOT SafeSnapshot "Waiting to obtain a valid snapshot for a
WAIT_EVENT_SYNC_REP SyncRep "Waiting for confirmation from a remote server during synchronous replication."
WAIT_EVENT_WAL_RECEIVER_EXIT WalReceiverExit "Waiting for the WAL receiver to exit."
WAIT_EVENT_WAL_RECEIVER_WAIT_START WalReceiverWaitStart "Waiting for startup process to send initial data for streaming replication."
+WAIT_EVENT_WAL_SUMMARY_READY WalSummaryReady "Waiting for a new WAL summary to be generated."
WAIT_EVENT_XACT_GROUP_UPDATE XactGroupUpdate "Waiting for the group leader to update transaction status at end of a parallel operation."
@@ -160,6 +162,7 @@ WAIT_EVENT_REGISTER_SYNC_REQUEST RegisterSyncRequest "Waiting while sending sync
WAIT_EVENT_SPIN_DELAY SpinDelay "Waiting while acquiring a contended spinlock."
WAIT_EVENT_VACUUM_DELAY VacuumDelay "Waiting in a cost-based vacuum delay point."
WAIT_EVENT_VACUUM_TRUNCATE VacuumTruncate "Waiting to acquire an exclusive lock to truncate off any empty pages at the end of a table vacuumed."
+WAIT_EVENT_WAL_SUMMARIZER_ERROR WalSummarizerError "Waiting after a WAL summarizer error."
#
@@ -241,6 +244,8 @@ WAIT_EVENT_WAL_COPY_WRITE WALCopyWrite "Waiting for a write when creating a new
WAIT_EVENT_WAL_INIT_SYNC WALInitSync "Waiting for a newly initialized WAL file to reach durable storage."
WAIT_EVENT_WAL_INIT_WRITE WALInitWrite "Waiting for a write while initializing a new WAL file."
WAIT_EVENT_WAL_READ WALRead "Waiting for a read from a WAL file."
+WAIT_EVENT_WAL_SUMMARY_READ WALSummaryRead "Waiting for a read from a WAL summary file."
+WAIT_EVENT_WAL_SUMMARY_WRITE WALSummaryWrite "Waiting for a write to a WAL summary file."
WAIT_EVENT_WAL_SYNC WALSync "Waiting for a WAL file to reach durable storage."
WAIT_EVENT_WAL_SYNC_METHOD_ASSIGN WALSyncMethodAssign "Waiting for data to reach durable storage while assigning a new WAL sync method."
WAIT_EVENT_WAL_WRITE WALWrite "Waiting for a write to a WAL file."
diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index 1e671c560c..037111b89f 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -306,6 +306,9 @@ GetBackendTypeDesc(BackendType backendType)
case B_WAL_SENDER:
backendDesc = "walsender";
break;
+ case B_WAL_SUMMARIZER:
+ backendDesc = "walsummarizer";
+ break;
case B_WAL_WRITER:
backendDesc = "walwriter";
break;
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index e565a3092f..91dc345d8b 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -61,6 +61,7 @@
#include "postmaster/postmaster.h"
#include "postmaster/startup.h"
#include "postmaster/syslogger.h"
+#include "postmaster/walsummarizer.h"
#include "postmaster/walwriter.h"
#include "replication/logicallauncher.h"
#include "replication/slot.h"
@@ -702,6 +703,8 @@ const char *const config_group_names[] =
gettext_noop("Write-Ahead Log / Archive Recovery"),
/* WAL_RECOVERY_TARGET */
gettext_noop("Write-Ahead Log / Recovery Target"),
+ /* WAL_SUMMARIZATION */
+ gettext_noop("Write-Ahead Log / Summarization"),
/* REPLICATION_SENDING */
gettext_noop("Replication / Sending Servers"),
/* REPLICATION_PRIMARY */
@@ -3169,6 +3172,32 @@ struct config_int ConfigureNamesInt[] =
check_wal_segment_size, NULL, NULL
},
+ {
+ {"wal_summarize_mb", PGC_SIGHUP, WAL_SUMMARIZATION,
+ gettext_noop("Number of bytes of WAL per summary file."),
+ gettext_noop("Smaller values minimize extra work performed by incremental backup, but increase the number of files on disk."),
+ GUC_UNIT_MB,
+ },
+ &wal_summarize_mb,
+ 256,
+ 0,
+ INT_MAX,
+ NULL, NULL, NULL
+ },
+
+ {
+ {"wal_summarize_keep_time", PGC_SIGHUP, WAL_SUMMARIZATION,
+ gettext_noop("Time for which WAL summary files should be kept."),
+ NULL,
+ GUC_UNIT_MIN,
+ },
+ &wal_summarize_keep_time,
+ 7 * 24 * 60, /* 1 week */
+ 0,
+ INT_MAX,
+ NULL, NULL, NULL
+ },
+
{
{"autovacuum_naptime", PGC_SIGHUP, AUTOVACUUM,
gettext_noop("Time to sleep between autovacuum runs."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index c768af9a73..1211de5ea3 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -301,6 +301,11 @@
#recovery_target_action = 'pause' # 'pause', 'promote', 'shutdown'
# (change requires restart)
+# - WAL Summarization -
+
+#wal_summarize_mb = 256 # MB of WAL per summary file, 0 disables
+#wal_summarize_keep_time = '7d' # when to remove old summary files, 0 = never
+
#------------------------------------------------------------------------------
# REPLICATION
diff --git a/src/bin/Makefile b/src/bin/Makefile
index 373077bf52..aa2210925e 100644
--- a/src/bin/Makefile
+++ b/src/bin/Makefile
@@ -19,6 +19,7 @@ SUBDIRS = \
pg_archivecleanup \
pg_basebackup \
pg_checksums \
+ pg_combinebackup \
pg_config \
pg_controldata \
pg_ctl \
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 905b979947..09d153ed88 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -226,6 +226,7 @@ static char *extra_options = "";
static const char *const subdirs[] = {
"global",
"pg_wal/archive_status",
+ "pg_wal/summaries",
"pg_commit_ts",
"pg_dynshmem",
"pg_notify",
diff --git a/src/bin/meson.build b/src/bin/meson.build
index 67cb50630c..4cb6fd59bb 100644
--- a/src/bin/meson.build
+++ b/src/bin/meson.build
@@ -5,6 +5,7 @@ subdir('pg_amcheck')
subdir('pg_archivecleanup')
subdir('pg_basebackup')
subdir('pg_checksums')
+subdir('pg_combinebackup')
subdir('pg_config')
subdir('pg_controldata')
subdir('pg_ctl')
diff --git a/src/bin/pg_basebackup/bbstreamer_file.c b/src/bin/pg_basebackup/bbstreamer_file.c
index 45f32974ff..6b78ee283d 100644
--- a/src/bin/pg_basebackup/bbstreamer_file.c
+++ b/src/bin/pg_basebackup/bbstreamer_file.c
@@ -296,6 +296,7 @@ should_allow_existing_directory(const char *pathname)
if (strcmp(filename, "pg_wal") == 0 ||
strcmp(filename, "pg_xlog") == 0 ||
strcmp(filename, "archive_status") == 0 ||
+ strcmp(filename, "summaries") == 0 ||
strcmp(filename, "pg_tblspc") == 0)
return true;
diff --git a/src/bin/pg_basebackup/pg_basebackup.c b/src/bin/pg_basebackup/pg_basebackup.c
index 1dc8efe0cb..3ffe15ac74 100644
--- a/src/bin/pg_basebackup/pg_basebackup.c
+++ b/src/bin/pg_basebackup/pg_basebackup.c
@@ -101,6 +101,11 @@ typedef void (*WriteDataCallback) (size_t nbytes, char *buf,
*/
#define MINIMUM_VERSION_FOR_TERMINATED_TARFILE 150000
+/*
+ * pg_wal/summaries exists beginning with v16.
+ */
+#define MINIMUM_VERSION_FOR_WAL_SUMMARIES 160000
+
/*
* Different ways to include WAL
*/
@@ -216,7 +221,8 @@ static void ReceiveBackupManifestInMemoryChunk(size_t r, char *copybuf,
void *callback_data);
static void BaseBackup(char *compression_algorithm, char *compression_detail,
CompressionLocation compressloc,
- pg_compress_specification *client_compress);
+ pg_compress_specification *client_compress,
+ char *incremental_manifest);
static bool reached_end_position(XLogRecPtr segendpos, uint32 timeline,
bool segment_finished);
@@ -684,6 +690,23 @@ StartLogStreamer(char *startpos, uint32 timeline, char *sysidentifier,
if (pg_mkdir_p(statusdir, pg_dir_create_mode) != 0 && errno != EEXIST)
pg_fatal("could not create directory \"%s\": %m", statusdir);
+
+ /*
+ * For newer server versions, likewise create pg_wal/summaries
+ */
+ if (PQserverVersion(conn) < MINIMUM_VERSION_FOR_WAL_SUMMARIES)
+ {
+ char summarydir[MAXPGPATH];
+
+ snprintf(summarydir, sizeof(summarydir), "%s/%s/summaries",
+ basedir,
+ PQserverVersion(conn) < MINIMUM_VERSION_FOR_PG_WAL ?
+ "pg_xlog" : "pg_wal");
+
+ if (pg_mkdir_p(statusdir, pg_dir_create_mode) != 0 &&
+ errno != EEXIST)
+ pg_fatal("could not create directory \"%s\": %m", summarydir);
+ }
}
/*
@@ -1724,7 +1747,9 @@ ReceiveBackupManifestInMemoryChunk(size_t r, char *copybuf,
static void
BaseBackup(char *compression_algorithm, char *compression_detail,
- CompressionLocation compressloc, pg_compress_specification *client_compress)
+ CompressionLocation compressloc,
+ pg_compress_specification *client_compress,
+ char *incremental_manifest)
{
PGresult *res;
char *sysidentifier;
@@ -1790,7 +1815,74 @@ BaseBackup(char *compression_algorithm, char *compression_detail,
exit(1);
/*
- * Start the actual backup
+ * If the user wants an incremental backup, we must upload the manifest
+ * for the previous backup upon which it is to be based.
+ */
+ if (incremental_manifest != NULL)
+ {
+ int fd;
+ char mbuf[65536];
+ int nbytes;
+
+ /* XXX add a server version check here */
+
+ /* Open the file. */
+ fd = open(incremental_manifest, O_RDONLY | PG_BINARY, 0);
+ if (fd < 0)
+ pg_fatal("could not open file \"%s\": %m", incremental_manifest);
+
+ /* Tell the server what we want to do. */
+ if (PQsendQuery(conn, "UPLOAD_MANIFEST") == 0)
+ pg_fatal("could not send replication command \"%s\": %s",
+ "UPLOAD_MANIFEST", PQerrorMessage(conn));
+ res = PQgetResult(conn);
+ if (PQresultStatus(res) != PGRES_COPY_IN)
+ {
+ if (PQresultStatus(res) == PGRES_FATAL_ERROR)
+ pg_fatal("could not upload manifest: %s",
+ PQerrorMessage(conn));
+ else
+ pg_fatal("could not upload manifest: unexpected status %s",
+ PQresStatus(PQresultStatus(res)));
+ }
+
+ /* Loop, reading from the file and sending the data to the server. */
+ while ((nbytes = read(fd, mbuf, sizeof mbuf)) > 0)
+ {
+ if (PQputCopyData(conn, mbuf, nbytes) < 0)
+ pg_fatal("could not send COPY data: %s",
+ PQerrorMessage(conn));
+ }
+
+ /* Bail out if we exited the loop due to an error. */
+ if (nbytes < 0)
+ pg_fatal("could not read file \"%s\": %m", incremental_manifest);
+
+ /* End the COPY operation. */
+ if (PQputCopyEnd(conn, NULL) < 0)
+ pg_fatal("could not send end-of-COPY: %s",
+ PQerrorMessage(conn));
+
+ /* See whether the server is happy with what we sent. */
+ res = PQgetResult(conn);
+ if (PQresultStatus(res) == PGRES_FATAL_ERROR)
+ pg_fatal("could not upload manifest: %s",
+ PQerrorMessage(conn));
+ else if (PQresultStatus(res) != PGRES_COMMAND_OK)
+ pg_fatal("could not upload manifest: unexpected status %s",
+ PQresStatus(PQresultStatus(res)));
+
+ /* Consume ReadyForQuery message from server. */
+ res = PQgetResult(conn);
+ if (res != NULL)
+ pg_fatal("unexpected extra result while sending manifest");
+
+ /* Add INCREMENTAL option to BASE_BACKUP command. */
+ AppendPlainCommandOption(&buf, use_new_option_syntax, "INCREMENTAL");
+ }
+
+ /*
+ * Continue building up the options list for the BASE_BACKUP command.
*/
AppendStringCommandOption(&buf, use_new_option_syntax, "LABEL", label);
if (estimatesize)
@@ -1897,6 +1989,7 @@ BaseBackup(char *compression_algorithm, char *compression_detail,
else
basebkp = psprintf("BASE_BACKUP %s", buf.data);
+ /* OK, try to start the backup. */
if (PQsendQuery(conn, basebkp) == 0)
pg_fatal("could not send replication command \"%s\": %s",
"BASE_BACKUP", PQerrorMessage(conn));
@@ -2252,6 +2345,7 @@ main(int argc, char **argv)
{"version", no_argument, NULL, 'V'},
{"pgdata", required_argument, NULL, 'D'},
{"format", required_argument, NULL, 'F'},
+ {"incremental", required_argument, NULL, 'i'},
{"checkpoint", required_argument, NULL, 'c'},
{"create-slot", no_argument, NULL, 'C'},
{"max-rate", required_argument, NULL, 'r'},
@@ -2288,6 +2382,7 @@ main(int argc, char **argv)
int option_index;
char *compression_algorithm = "none";
char *compression_detail = NULL;
+ char *incremental_manifest = NULL;
CompressionLocation compressloc = COMPRESS_LOCATION_UNSPECIFIED;
pg_compress_specification client_compress;
@@ -2312,7 +2407,7 @@ main(int argc, char **argv)
atexit(cleanup_directories_atexit);
- while ((c = getopt_long(argc, argv, "c:Cd:D:F:h:l:nNp:Pr:Rs:S:t:T:U:vwWX:zZ:",
+ while ((c = getopt_long(argc, argv, "c:Cd:D:F:h:i:l:nNp:Pr:Rs:S:t:T:U:vwWX:zZ:",
long_options, &option_index)) != -1)
{
switch (c)
@@ -2347,6 +2442,9 @@ main(int argc, char **argv)
case 'h':
dbhost = pg_strdup(optarg);
break;
+ case 'i':
+ incremental_manifest = pg_strdup(optarg);
+ break;
case 'l':
label = pg_strdup(optarg);
break;
@@ -2756,7 +2854,7 @@ main(int argc, char **argv)
}
BaseBackup(compression_algorithm, compression_detail, compressloc,
- &client_compress);
+ &client_compress, incremental_manifest);
success = true;
return 0;
diff --git a/src/bin/pg_basebackup/t/010_pg_basebackup.pl b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
index b9f5e1266b..bf765291e7 100644
--- a/src/bin/pg_basebackup/t/010_pg_basebackup.pl
+++ b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
@@ -223,10 +223,10 @@ SKIP:
"check backup dir permissions");
}
-# Only archive_status directory should be copied in pg_wal/.
+# Only archive_status and summaries directories should be copied in pg_wal/.
is_deeply(
[ sort(slurp_dir("$tempdir/backup/pg_wal/")) ],
- [ sort qw(. .. archive_status) ],
+ [ sort qw(. .. archive_status summaries) ],
'no WAL files copied');
# Contents of these directories should not be copied.
diff --git a/src/bin/pg_combinebackup/.gitignore b/src/bin/pg_combinebackup/.gitignore
new file mode 100644
index 0000000000..d7e617438c
--- /dev/null
+++ b/src/bin/pg_combinebackup/.gitignore
@@ -0,0 +1 @@
+pg_combinebackup
diff --git a/src/bin/pg_combinebackup/Makefile b/src/bin/pg_combinebackup/Makefile
new file mode 100644
index 0000000000..cb20480aae
--- /dev/null
+++ b/src/bin/pg_combinebackup/Makefile
@@ -0,0 +1,46 @@
+#-------------------------------------------------------------------------
+#
+# Makefile for src/bin/pg_combinebackup
+#
+# Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+# Portions Copyright (c) 1994, Regents of the University of California
+#
+# src/bin/pg_combinebackup/Makefile
+#
+#-------------------------------------------------------------------------
+
+PGFILEDESC = "pg_combinebackup - combine incremental backups"
+PGAPPICON=win32
+
+subdir = src/bin/pg_combinebackup
+top_builddir = ../../..
+include $(top_builddir)/src/Makefile.global
+
+override CPPFLAGS := -I$(libpq_srcdir) $(CPPFLAGS)
+LDFLAGS_INTERNAL += -L$(top_builddir)/src/fe_utils -lpgfeutils
+
+OBJS = \
+ $(WIN32RES) \
+ pg_combinebackup.o \
+ backup_label.o \
+ copy_file.o \
+ load_manifest.o \
+ reconstruct.o \
+ write_manifest.o
+
+all: pg_combinebackup
+
+pg_combinebackup: $(OBJS) | submake-libpgport submake-libpgfeutils
+ $(CC) $(CFLAGS) $^ $(LDFLAGS) $(LDFLAGS_EX) $(LIBS) -o $@$(X)
+
+install: all installdirs
+ $(INSTALL_PROGRAM) pg_combinebackup$(X) '$(DESTDIR)$(bindir)/pg_combinebackup$(X)'
+
+installdirs:
+ $(MKDIR_P) '$(DESTDIR)$(bindir)'
+
+uninstall:
+ rm -f '$(DESTDIR)$(bindir)/pg_combinebackup$(X)'
+
+clean distclean maintainer-clean:
+ rm -f pg_combinebackup$(X) $(OBJS)
diff --git a/src/bin/pg_combinebackup/backup_label.c b/src/bin/pg_combinebackup/backup_label.c
new file mode 100644
index 0000000000..2a62aa6fad
--- /dev/null
+++ b/src/bin/pg_combinebackup/backup_label.c
@@ -0,0 +1,281 @@
+/*-------------------------------------------------------------------------
+ *
+ * Read and manipulate backup label files
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/bin/pg_combinebackup/backup_label.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres_fe.h"
+
+#include <unistd.h>
+
+#include "access/xlogdefs.h"
+#include "backup_label.h"
+#include "common/logging.h"
+#include "common/file_perm.h"
+#include "write_manifest.h"
+
+static int get_eol_offset(StringInfo buf);
+static bool line_starts_with(char *s, char *e, char *match, char **sout);
+static bool parse_lsn(char *s, char *e, XLogRecPtr *lsn, char **c);
+static bool parse_tli(char *s, char *e, TimeLineID *tli);
+
+/*
+ * Parse a backup label file, starting at buf->cursor.
+ *
+ * We expect to find a START WAL LOCATION line, followed by a LSN, followed
+ * by a space; the resulting LSN is stored into *start_lsn.
+ *
+ * We expect to find a START TIMELINE line, followed by a TLI, followed by
+ * a newline; the resulting TLI is stored into *start_tli.
+ *
+ * We expect to find either both INCREMENTAL FROM LSN and INCREMENTAL FROM TLI
+ * or neither. If these are found, they should be followed by an LSN or TLI
+ * respectively and then by a newline, and the values will be stored into
+ * *previous_lsn and *previous_tli, respectively.
+ *
+ * Other lines in the provided backup_label data are ignored. filename is used
+ * for error reporting; errors are fatal.
+ */
+void
+parse_backup_label(char *filename, StringInfo buf,
+ TimeLineID *start_tli, XLogRecPtr *start_lsn,
+ TimeLineID *previous_tli, XLogRecPtr *previous_lsn)
+{
+ int found = 0;
+
+ *start_tli = 0;
+ *start_lsn = InvalidXLogRecPtr;
+ *previous_tli = 0;
+ *previous_lsn = InvalidXLogRecPtr;
+
+ while (buf->cursor < buf->len)
+ {
+ char *s = &buf->data[buf->cursor];
+ int eo = get_eol_offset(buf);
+ char *e = &buf->data[eo];
+ char *c;
+
+ if (line_starts_with(s, e, "START WAL LOCATION: ", &s))
+ {
+ if (!parse_lsn(s, e, start_lsn, &c))
+ pg_fatal("%s: could not parse START WAL LOCATION",
+ filename);
+ if (c >= e || *c != ' ')
+ pg_fatal("%s: improper terminator for START WAL LOCATION",
+ filename);
+ found |= 1;
+ }
+ else if (line_starts_with(s, e, "START TIMELINE: ", &s))
+ {
+ if (!parse_tli(s, e, start_tli))
+ pg_fatal("%s: could not parse TLI for START TIMELINE",
+ filename);
+ if (*start_tli == 0)
+ pg_fatal("%s: invalid TLI", filename);
+ found |= 2;
+ }
+ else if (line_starts_with(s, e, "INCREMENTAL FROM LSN: ", &s))
+ {
+ if (!parse_lsn(s, e, previous_lsn, &c))
+ pg_fatal("%s: could not parse INCREMENTAL FROM LSN",
+ filename);
+ if (c >= e || *c != '\n')
+ pg_fatal("%s: improper terminator for INCREMENTAL FROM LSN",
+ filename);
+ found |= 4;
+ }
+ else if (line_starts_with(s, e, "INCREMENTAL FROM TLI: ", &s))
+ {
+ if (!parse_tli(s, e, previous_tli))
+ pg_fatal("%s: could not parse INCREMENTAL FROM TLI",
+ filename);
+ if (*previous_tli == 0)
+ pg_fatal("%s: invalid TLI", filename);
+ found |= 8;
+ }
+
+ buf->cursor = eo;
+ }
+
+ if ((found & 1) == 0)
+ pg_fatal("%s: could not find START WAL LOCATION", filename);
+ if ((found & 2) == 0)
+ pg_fatal("%s: could not find START TIMELINE", filename);
+ if ((found & 4) != 0 && (found & 8) == 0)
+ pg_fatal("%s: INCREMENTAL FROM LSN requires INCREMENTAL FROM TLI", filename);
+ if ((found & 8) != 0 && (found & 4) == 0)
+ pg_fatal("%s: INCREMENTAL FROM TLI requires INCREMENTAL FROM LSN", filename);
+}
+
+/*
+ * Write a backup label file to the output directory.
+ *
+ * This will be identical to the provided backup_label file, except that the
+ * INCREMENTAL FROM LSN and INCREMENTAL FROM TLI lines will be omitted.
+ *
+ * The new file will be checksummed using the specified algorithm. If
+ * mwriter != NULL, it will be added to the manifest.
+ */
+void
+write_backup_label(char *output_directory, StringInfo buf,
+ pg_checksum_type checksum_type, manifest_writer *mwriter)
+{
+ char output_filename[MAXPGPATH];
+ int output_fd;
+ pg_checksum_context checksum_ctx;
+ uint8 checksum_payload[PG_CHECKSUM_MAX_LENGTH];
+ int checksum_length;
+
+ pg_checksum_init(&checksum_ctx, checksum_type);
+
+ snprintf(output_filename, MAXPGPATH, "%s/backup_label", output_directory);
+
+ if ((output_fd = open(output_filename,
+ O_WRONLY | O_CREAT | O_EXCL | PG_BINARY,
+ pg_file_create_mode)) < 0)
+ pg_fatal("could not open file \"%s\": %m", output_filename);
+
+ while (buf->cursor < buf->len)
+ {
+ char *s = &buf->data[buf->cursor];
+ int eo = get_eol_offset(buf);
+ char *e = &buf->data[eo];
+
+ if (!line_starts_with(s, e, "INCREMENTAL FROM LSN: ", NULL) &&
+ !line_starts_with(s, e, "INCREMENTAL FROM TLI: ", NULL))
+ {
+ ssize_t wb;
+
+ wb = write(output_fd, s, e - s);
+ if (wb != e - s)
+ {
+ if (wb < 0)
+ pg_fatal("could not write file \"%s\": %m", output_filename);
+ else
+ pg_fatal("could not write file \"%s\": wrote only %d of %d bytes",
+ output_filename, (int) wb, (int) (e - s));
+ }
+ if (pg_checksum_update(&checksum_ctx, (uint8 *) s, e - s) < 0)
+ pg_fatal("could not update checksum of file \"%s\"",
+ output_filename);
+ }
+
+ buf->cursor = eo;
+ }
+
+ if (close(output_fd) != 0)
+ pg_fatal("could not close \"%s\": %m", output_filename);
+
+ checksum_length = pg_checksum_final(&checksum_ctx, checksum_payload);
+
+ if (mwriter != NULL)
+ {
+ struct stat sb;
+
+ /*
+ * We could track the length ourselves, but must stat() to get the
+ * mtime.
+ */
+ if (stat(output_filename, &sb) < 0)
+ pg_fatal("could not stat file \"%s\": %m", output_filename);
+ add_file_to_manifest(mwriter, "backup_label", sb.st_size,
+ sb.st_mtime, checksum_type,
+ checksum_length, checksum_payload);
+ }
+}
+
+/*
+ * Return the offset at which the next line in the buffer starts, or there
+ * is none, the offset at which the buffer ends.
+ *
+ * The search begins at buf->cursor.
+ */
+static int
+get_eol_offset(StringInfo buf)
+{
+ int eo = buf->cursor;
+
+ while (eo < buf->len)
+ {
+ if (buf->data[eo] == '\n')
+ return eo + 1;
+ ++eo;
+ }
+
+ return eo;
+}
+
+/*
+ * Test whether the line that runs from s to e (inclusive of *s, but not
+ * inclusive of *e) starts with the match string provided, and return true
+ * or false according to whether or not this is the case.
+ *
+ * If the function returns true and if *sout != NULL, stores a pointer to the
+ * byte following the match into *sout.
+ */
+static bool
+line_starts_with(char *s, char *e, char *match, char **sout)
+{
+ while (s < e && *match != '\0' && *s == *match)
+ ++s, ++match;
+
+ if (*match == '\0' && sout != NULL)
+ *sout = s;
+
+ return (*match == '\0');
+}
+
+/*
+ * Parse an LSN starting at s and not stopping at or before e. The return value
+ * is true on success and otherwise false. On success, stores the result into
+ * *lsn and sets *c to the first character that is not part of the LSN.
+ */
+static bool
+parse_lsn(char *s, char *e, XLogRecPtr *lsn, char **c)
+{
+ char save = *e;
+ int nchars;
+ bool success;
+ unsigned hi;
+ unsigned lo;
+
+ *e = '\0';
+ success = (sscanf(s, "%X/%X%n", &hi, &lo, &nchars) == 2);
+ *e = save;
+
+ if (success)
+ {
+ *lsn = ((XLogRecPtr) hi) << 32 | (XLogRecPtr) lo;
+ *c = s + nchars;
+ }
+
+ return success;
+}
+
+/*
+ * Parse a TLI starting at s and stopping at or before e. The return value is
+ * true on success and otherwise false. On success, stores the result into
+ * *tli. If the first character that is not part of the TLI is anything other
+ * than a newline, that is deemed a failure.
+ */
+static bool
+parse_tli(char *s, char *e, TimeLineID *tli)
+{
+ char save = *e;
+ int nchars;
+ bool success;
+
+ *e = '\0';
+ success = (sscanf(s, "%u%n", tli, &nchars) == 1);
+ *e = save;
+
+ if (success && s[nchars] != '\n')
+ success = false;
+
+ return success;
+}
diff --git a/src/bin/pg_combinebackup/backup_label.h b/src/bin/pg_combinebackup/backup_label.h
new file mode 100644
index 0000000000..08d6ed67a9
--- /dev/null
+++ b/src/bin/pg_combinebackup/backup_label.h
@@ -0,0 +1,29 @@
+/*-------------------------------------------------------------------------
+ *
+ * Read and manipulate backup label files
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/bin/pg_combinebackup/backup_label.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef BACKUP_LABEL_H
+#define BACKUP_LABEL_H
+
+#include "common/checksum_helper.h"
+#include "lib/stringinfo.h"
+
+struct manifest_writer;
+
+extern void parse_backup_label(char *filename, StringInfo buf,
+ TimeLineID *start_tli,
+ XLogRecPtr *start_lsn,
+ TimeLineID *previous_tli,
+ XLogRecPtr *previous_lsn);
+extern void write_backup_label(char *output_directory, StringInfo buf,
+ pg_checksum_type checksum_type,
+ struct manifest_writer *mwriter);
+
+#endif /* BACKUP_LABEL_H */
diff --git a/src/bin/pg_combinebackup/copy_file.c b/src/bin/pg_combinebackup/copy_file.c
new file mode 100644
index 0000000000..8ba6cc09e4
--- /dev/null
+++ b/src/bin/pg_combinebackup/copy_file.c
@@ -0,0 +1,169 @@
+/*
+ * Copy entire files.
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/bin/pg_combinebackup/copy_file.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres_fe.h"
+
+#ifdef HAVE_COPYFILE_H
+#include <copyfile.h>
+#endif
+#include <fcntl.h>
+#include <sys/stat.h>
+#include <unistd.h>
+
+#include "common/file_perm.h"
+#include "common/logging.h"
+#include "copy_file.h"
+
+static void copy_file_blocks(const char *src, const char *dst,
+ pg_checksum_context *checksum_ctx);
+
+#ifdef WIN32
+static void copy_file_copyfile(const char *src, const char *dst);
+#endif
+
+/*
+ * Copy a regular file, optionally computing a checksum, and emitting
+ * appropriate debug messages. But if we're in dry-run mode, then just emit
+ * the messages and don't copy anything.
+ */
+void
+copy_file(const char *src, const char *dst,
+ pg_checksum_context *checksum_ctx, bool dry_run)
+{
+ /*
+ * In dry-run mode, we don't actually copy anything, nor do we read any
+ * data from the source file, but we do verify that we can open it.
+ */
+ if (dry_run)
+ {
+ int fd;
+
+ if ((fd = open(src, O_RDONLY | PG_BINARY)) < 0)
+ pg_fatal("could not open \"%s\": %m", src);
+ if (close(fd) < 0)
+ pg_fatal("could not close \"%s\": %m", src);
+ }
+
+ /*
+ * If we don't need to compute a checksum, then we can use any special
+ * operating system primitives that we know about to copy the file; this
+ * may be quicker than a naive block copy.
+ */
+ if (checksum_ctx->type != CHECKSUM_TYPE_NONE)
+ {
+ char *strategy_name = NULL;
+ void (*strategy_implementation) (const char *, const char *) = NULL;
+
+#ifdef WIN32
+ strategy_name = "CopyFile";
+ strategy_implementation = copy_file_copyfile;
+#endif
+
+ if (strategy_name != NULL)
+ {
+ if (dry_run)
+ pg_log_debug("would copy \"%s\" to \"%s\" using strategy %s",
+ src, dst, strategy_name);
+ else
+ {
+ pg_log_debug("copying \"%s\" to \"%s\" using strategy %s",
+ src, dst, strategy_name);
+ (*strategy_implementation) (src, dst);
+ }
+ return;
+ }
+ }
+
+ /*
+ * Fall back to the simple approach of reading and writing all the blocks,
+ * feeding them into the checksum context as we go.
+ */
+ if (dry_run)
+ {
+ if (checksum_ctx->type == CHECKSUM_TYPE_NONE)
+ pg_log_debug("would copy \"%s\" to \"%s\"",
+ src, dst);
+ else
+ pg_log_debug("would copy \"%s\" to \"%s\" and checksum with %s",
+ src, dst, pg_checksum_type_name(checksum_ctx->type));
+ }
+ else
+ {
+ if (checksum_ctx->type == CHECKSUM_TYPE_NONE)
+ pg_log_debug("copying \"%s\" to \"%s\"",
+ src, dst);
+ else
+ pg_log_debug("copying \"%s\" to \"%s\" and checksumming with %s",
+ src, dst, pg_checksum_type_name(checksum_ctx->type));
+ copy_file_blocks(src, dst, checksum_ctx);
+ }
+}
+
+/*
+ * Copy a file block by block, and optionally compute a checksum as we go.
+ */
+static void
+copy_file_blocks(const char *src, const char *dst,
+ pg_checksum_context *checksum_ctx)
+{
+ int src_fd;
+ int dest_fd;
+ uint8 *buffer;
+ const int buffer_size = 50 * BLCKSZ;
+ ssize_t rb;
+ unsigned offset = 0;
+
+ if ((src_fd = open(src, O_RDONLY | PG_BINARY, 0)) < 0)
+ pg_fatal("could not open file \"%s\": %m", src);
+
+ if ((dest_fd = open(dst, O_WRONLY | O_CREAT | O_EXCL | PG_BINARY,
+ pg_file_create_mode)) < 0)
+ pg_fatal("could not open file \"%s\": %m", dst);
+
+ buffer = pg_malloc(buffer_size);
+
+ while ((rb = read(src_fd, buffer, buffer_size)) > 0)
+ {
+ ssize_t wb;
+
+ if ((wb = write(dest_fd, buffer, rb)) != rb)
+ {
+ if (wb < 0)
+ pg_fatal("could not write file \"%s\": %m", dst);
+ else
+ pg_fatal("could not write file \"%s\": wrote only %d of %d bytes at offset %u",
+ dst, (int) wb, (int) rb, offset);
+ }
+
+ if (pg_checksum_update(checksum_ctx, buffer, rb) < 0)
+ pg_fatal("could not update checksum of file \"%s\"", dst);
+
+ offset += rb;
+ }
+
+ if (rb < 0)
+ pg_fatal("could not read file \"%s\": %m", dst);
+
+ pg_free(buffer);
+ close(src_fd);
+ close(dest_fd);
+}
+
+#ifdef WIN32
+static void
+copy_file_copyfile(const char *src, const char *dst)
+{
+ if (CopyFile(src, dst, true) == 0)
+ {
+ _dosmaperr(GetLastError());
+ pg_fatal("could not copy \"%s\" to \"%s\": %m", src, dst);
+ }
+}
+#endif /* WIN32 */
diff --git a/src/bin/pg_combinebackup/copy_file.h b/src/bin/pg_combinebackup/copy_file.h
new file mode 100644
index 0000000000..031030bacb
--- /dev/null
+++ b/src/bin/pg_combinebackup/copy_file.h
@@ -0,0 +1,19 @@
+/*
+ * Copy entire files.
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/bin/pg_combinebackup/copy_file.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef COPY_FILE_H
+#define COPY_FILE_H
+
+#include "common/checksum_helper.h"
+
+extern void copy_file(const char *src, const char *dst,
+ pg_checksum_context *checksum_ctx, bool dry_run);
+
+#endif /* COPY_FILE_H */
diff --git a/src/bin/pg_combinebackup/load_manifest.c b/src/bin/pg_combinebackup/load_manifest.c
new file mode 100644
index 0000000000..d0b8de7912
--- /dev/null
+++ b/src/bin/pg_combinebackup/load_manifest.c
@@ -0,0 +1,245 @@
+/*-------------------------------------------------------------------------
+ *
+ * Load data from a backup manifest into memory.
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/bin/pg_combinebackup/load_manifest.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres_fe.h"
+
+#include <sys/stat.h>
+#include <unistd.h>
+
+#include "common/hashfn.h"
+#include "common/logging.h"
+#include "common/parse_manifest.h"
+#include "load_manifest.h"
+
+/*
+ * For efficiency, we'd like our hash table containing information about the
+ * manifest to start out with approximately the correct number of entries.
+ * There's no way to know the exact number of entries without reading the whole
+ * file, but we can get an estimate by dividing the file size by the estimated
+ * number of bytes per line.
+ *
+ * This could be off by about a factor of two in either direction, because the
+ * checksum algorithm has a big impact on the line lengths; e.g. a SHA512
+ * checksum is 128 hex bytes, whereas a CRC-32C value is only 8, and there
+ * might be no checksum at all.
+ */
+#define ESTIMATED_BYTES_PER_MANIFEST_LINE 100
+
+/*
+ * Define a hash table which we can use to store information about the files
+ * mentioned in the backup manifest.
+ */
+static uint32 hash_string_pointer(char *s);
+#define SH_PREFIX manifest_files
+#define SH_ELEMENT_TYPE manifest_file
+#define SH_KEY_TYPE char *
+#define SH_KEY pathname
+#define SH_HASH_KEY(tb, key) hash_string_pointer(key)
+#define SH_EQUAL(tb, a, b) (strcmp(a, b) == 0)
+#define SH_SCOPE extern
+#define SH_RAW_ALLOCATOR pg_malloc0
+#define SH_DEFINE
+#include "lib/simplehash.h"
+
+static void record_manifest_details_for_file(JsonManifestParseContext *context,
+ char *pathname, size_t size,
+ pg_checksum_type checksum_type,
+ int checksum_length,
+ uint8 *checksum_payload);
+static void record_manifest_details_for_wal_range(JsonManifestParseContext *context,
+ TimeLineID tli,
+ XLogRecPtr start_lsn,
+ XLogRecPtr end_lsn);
+static void report_manifest_error(JsonManifestParseContext *context,
+ const char *fmt,...)
+ pg_attribute_printf(2, 3) pg_attribute_noreturn();
+
+/*
+ * Load backup_manifest files from an array of backups and produces an array
+ * of manifest_data objects.
+ *
+ * NB: Since load_backup_manifest() can return NULL, the resulting array could
+ * contain NULL entries.
+ */
+manifest_data **
+load_backup_manifests(int n_backups, char **backup_directories)
+{
+ manifest_data **result;
+ int i;
+
+ result = pg_malloc(sizeof(manifest_data *) * n_backups);
+ for (i = 0; i < n_backups; ++i)
+ result[i] = load_backup_manifest(backup_directories[i]);
+
+ return result;
+}
+
+/*
+ * Parse the backup_manifest file in the named backup directory. Construct a
+ * hash table with information about all the files it mentions, and a linked
+ * list of all the WAL ranges it mentions.
+ *
+ * If the backup_manifest file simply doesn't exist, logs a warning and returns
+ * NULL. Any other error, or any error parsing the contents of the file, is
+ * fatal.
+ */
+manifest_data *
+load_backup_manifest(char *backup_directory)
+{
+ char pathname[MAXPGPATH];
+ int fd;
+ struct stat statbuf;
+ off_t estimate;
+ uint32 initial_size;
+ manifest_files_hash *ht;
+ char *buffer;
+ int rc;
+ JsonManifestParseContext context;
+ manifest_data *result;
+
+ /* Open the manifest file. */
+ snprintf(pathname, MAXPGPATH, "%s/backup_manifest", backup_directory);
+ if ((fd = open(pathname, O_RDONLY | PG_BINARY, 0)) < 0)
+ {
+ if (errno == EEXIST)
+ {
+ pg_log_warning("\"%s\" does not exist", pathname);
+ return NULL;
+ }
+ pg_fatal("could not open file \"%s\": %m", pathname);
+ }
+
+ /* Figure out how big the manifest is. */
+ if (fstat(fd, &statbuf) != 0)
+ pg_fatal("could not stat file \"%s\": %m", pathname);
+
+ /* Guess how large to make the hash table based on the manifest size. */
+ estimate = statbuf.st_size / ESTIMATED_BYTES_PER_MANIFEST_LINE;
+ initial_size = Min(PG_UINT32_MAX, Max(estimate, 256));
+
+ /* Create the hash table. */
+ ht = manifest_files_create(initial_size, NULL);
+
+ /*
+ * Slurp in the whole file.
+ *
+ * This is not ideal, but there's currently no way to get pg_parse_json()
+ * to perform incremental parsing.
+ */
+ buffer = pg_malloc(statbuf.st_size);
+ rc = read(fd, buffer, statbuf.st_size);
+ if (rc != statbuf.st_size)
+ {
+ if (rc < 0)
+ pg_fatal("could not read file \"%s\": %m", pathname);
+ else
+ pg_fatal("could not read file \"%s\": read %d of %lld",
+ pathname, rc, (long long int) statbuf.st_size);
+ }
+
+ /* Close the manifest file. */
+ close(fd);
+
+ /* Parse the manifest. */
+ result = pg_malloc0(sizeof(manifest_data));
+ result->files = ht;
+ context.private_data = result;
+ context.perfile_cb = record_manifest_details_for_file;
+ context.perwalrange_cb = record_manifest_details_for_wal_range;
+ context.error_cb = report_manifest_error;
+ json_parse_manifest(&context, buffer, statbuf.st_size);
+
+ /* All done. */
+ pfree(buffer);
+ return result;
+}
+
+/*
+ * Report an error while parsing the manifest.
+ *
+ * We consider all such errors to be fatal errors. The manifest parser
+ * expects this function not to return.
+ */
+static void
+report_manifest_error(JsonManifestParseContext *context, const char *fmt,...)
+{
+ va_list ap;
+
+ va_start(ap, fmt);
+ pg_log_generic_v(PG_LOG_ERROR, PG_LOG_PRIMARY, gettext(fmt), ap);
+ va_end(ap);
+
+ exit(1);
+}
+
+/*
+ * Record details extracted from the backup manifest for one file.
+ */
+static void
+record_manifest_details_for_file(JsonManifestParseContext *context,
+ char *pathname, size_t size,
+ pg_checksum_type checksum_type,
+ int checksum_length, uint8 *checksum_payload)
+{
+ manifest_data *manifest = context->private_data;
+ manifest_file *m;
+ bool found;
+
+ /* Make a new entry in the hash table for this file. */
+ m = manifest_files_insert(manifest->files, pathname, &found);
+ if (found)
+ pg_fatal("duplicate path name in backup manifest: \"%s\"", pathname);
+
+ /* Initialize the entry. */
+ m->size = size;
+ m->checksum_type = checksum_type;
+ m->checksum_length = checksum_length;
+ m->checksum_payload = checksum_payload;
+}
+
+/*
+ * Record details extracted from the backup manifest for one WAL range.
+ */
+static void
+record_manifest_details_for_wal_range(JsonManifestParseContext *context,
+ TimeLineID tli,
+ XLogRecPtr start_lsn, XLogRecPtr end_lsn)
+{
+ manifest_data *manifest = context->private_data;
+ manifest_wal_range *range;
+
+ /* Allocate and initialize a struct describing this WAL range. */
+ range = palloc(sizeof(manifest_wal_range));
+ range->tli = tli;
+ range->start_lsn = start_lsn;
+ range->end_lsn = end_lsn;
+ range->prev = manifest->last_wal_range;
+ range->next = NULL;
+
+ /* Add it to the end of the list. */
+ if (manifest->first_wal_range == NULL)
+ manifest->first_wal_range = range;
+ else
+ manifest->last_wal_range->next = range;
+ manifest->last_wal_range = range;
+}
+
+/*
+ * Helper function for manifest_files hash table.
+ */
+static uint32
+hash_string_pointer(char *s)
+{
+ unsigned char *ss = (unsigned char *) s;
+
+ return hash_bytes(ss, strlen(s));
+}
diff --git a/src/bin/pg_combinebackup/load_manifest.h b/src/bin/pg_combinebackup/load_manifest.h
new file mode 100644
index 0000000000..2bfeeff156
--- /dev/null
+++ b/src/bin/pg_combinebackup/load_manifest.h
@@ -0,0 +1,67 @@
+/*-------------------------------------------------------------------------
+ *
+ * Load data from a backup manifest into memory.
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/bin/pg_combinebackup/load_manifest.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef LOAD_MANIFEST_H
+#define LOAD_MANIFEST_H
+
+#include "access/xlogdefs.h"
+#include "common/checksum_helper.h"
+
+/*
+ * Each file described by the manifest file is parsed to produce an object
+ * like this.
+ */
+typedef struct manifest_file
+{
+ uint32 status; /* hash status */
+ char *pathname;
+ size_t size;
+ pg_checksum_type checksum_type;
+ int checksum_length;
+ uint8 *checksum_payload;
+} manifest_file;
+
+#define SH_PREFIX manifest_files
+#define SH_ELEMENT_TYPE manifest_file
+#define SH_KEY_TYPE char *
+#define SH_SCOPE extern
+#define SH_RAW_ALLOCATOR pg_malloc0
+#define SH_DECLARE
+#include "lib/simplehash.h"
+
+/*
+ * Each WAL range described by the manifest file is parsed to produce an
+ * object like this.
+ */
+typedef struct manifest_wal_range
+{
+ TimeLineID tli;
+ XLogRecPtr start_lsn;
+ XLogRecPtr end_lsn;
+ struct manifest_wal_range *next;
+ struct manifest_wal_range *prev;
+} manifest_wal_range;
+
+/*
+ * All the data parsed from a backup_manifest file.
+ */
+typedef struct manifest_data
+{
+ manifest_files_hash *files;
+ manifest_wal_range *first_wal_range;
+ manifest_wal_range *last_wal_range;
+} manifest_data;
+
+extern manifest_data *load_backup_manifest(char *backup_directory);
+extern manifest_data **load_backup_manifests(int n_backups,
+ char **backup_directories);
+
+#endif /* LOAD_MANIFEST_H */
diff --git a/src/bin/pg_combinebackup/meson.build b/src/bin/pg_combinebackup/meson.build
new file mode 100644
index 0000000000..bea0db405e
--- /dev/null
+++ b/src/bin/pg_combinebackup/meson.build
@@ -0,0 +1,29 @@
+# Copyright (c) 2022-2023, PostgreSQL Global Development Group
+
+pg_combinebackup_sources = files(
+ 'pg_combinebackup.c',
+ 'backup_label.c',
+ 'copy_file.c',
+ 'load_manifest.c',
+ 'reconstruct.c',
+ 'write_manifest.c',
+)
+
+if host_system == 'windows'
+ pg_combinebackup_sources += rc_bin_gen.process(win32ver_rc, extra_args: [
+ '--NAME', 'pg_combinebackup',
+ '--FILEDESC', 'pg_combinebackup - combine incremental backups',])
+endif
+
+pg_combinebackup = executable('pg_combinebackup',
+ pg_combinebackup_sources,
+ dependencies: [frontend_code],
+ kwargs: default_bin_args,
+)
+bin_targets += pg_combinebackup
+
+tests += {
+ 'name': 'pg_combinebackup',
+ 'sd': meson.current_source_dir(),
+ 'bd': meson.current_build_dir()
+}
diff --git a/src/bin/pg_combinebackup/pg_combinebackup.c b/src/bin/pg_combinebackup/pg_combinebackup.c
new file mode 100644
index 0000000000..6c7fd3290e
--- /dev/null
+++ b/src/bin/pg_combinebackup/pg_combinebackup.c
@@ -0,0 +1,1268 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_combinebackup.c
+ * Combine incremental backups with prior backups.
+ *
+ * Copyright (c) 2017-2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/bin/pg_combinebackup/pg_combinebackup.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres_fe.h"
+
+#include <dirent.h>
+#include <fcntl.h>
+#include <limits.h>
+
+#include "access/xlogdefs.h"
+#include "backup_label.h"
+#include "common/blkreftable.h"
+#include "common/checksum_helper.h"
+#include "common/controldata_utils.h"
+#include "common/file_perm.h"
+#include "common/file_utils.h"
+#include "common/logging.h"
+#include "copy_file.h"
+#include "fe_utils/option_utils.h"
+#include "lib/stringinfo.h"
+#include "load_manifest.h"
+#include "getopt_long.h"
+#include "reconstruct.h"
+#include "write_manifest.h"
+
+/* Incremental file naming convention. */
+#define INCREMENTAL_PREFIX "INCREMENTAL."
+#define INCREMENTAL_PREFIX_LENGTH 12
+
+/*
+ * Tracking for directories that need to be removed, or have their contents
+ * removed, if the operation fails.
+ */
+typedef struct cb_cleanup_dir
+{
+ char *target_path;
+ bool rmtopdir;
+ struct cb_cleanup_dir *next;
+} cb_cleanup_dir;
+
+/*
+ * Stores a tablespace mapping provided using -T, --tablespace-mapping.
+ */
+typedef struct cb_tablespace_mapping
+{
+ char old_dir[MAXPGPATH];
+ char new_dir[MAXPGPATH];
+ struct cb_tablespace_mapping *next;
+} cb_tablespace_mapping;
+
+/*
+ * Stores data parsed from all command-line options.
+ */
+typedef struct cb_options
+{
+ bool debug;
+ char *output;
+ bool dry_run;
+ bool no_sync;
+ bool progress;
+ cb_tablespace_mapping *tsmappings;
+ pg_checksum_type manifest_checksums;
+ bool no_manifest;
+} cb_options;
+
+/*
+ * Data about a tablespace.
+ *
+ * Every normal tablespace needs a tablespace mapping, but in-place tablespaces
+ * don't, so the list of tablespaces can contain more entries than the list of
+ * tablespace mappings.
+ */
+typedef struct cb_tablespace
+{
+ Oid oid;
+ bool in_place;
+ char old_dir[MAXPGPATH];
+ char new_dir[MAXPGPATH];
+ struct cb_tablespace *next;
+} cb_tablespace;
+
+/* Directories to be removed if we exit uncleanly. */
+cb_cleanup_dir *cleanup_dir_list = NULL;
+
+static void add_tablespace_mapping(cb_options *opt, char *arg);
+static StringInfo check_backup_label_files(int n_backups, char **backup_dirs);
+static void check_control_files(int n_backups, char **backup_dirs);
+static void check_input_dir_permissions(char *dir);
+static void cleanup_directories_atexit(void);
+static void create_output_directory(char *dirname, cb_options *opt);
+static void help(const char *progname);
+static bool parse_oid(char *s, Oid *result);
+static void process_directory_recursively(Oid tsoid,
+ char *input_directory,
+ char *output_directory,
+ char *relative_path,
+ int n_prior_backups,
+ char **prior_backup_dirs,
+ manifest_data **manifests,
+ manifest_writer *mwriter,
+ cb_options *opt);
+static int read_pg_version_file(char *directory);
+static void remember_to_cleanup_directory(char *target_path, bool rmtopdir);
+static void reset_directory_cleanup_list(void);
+static cb_tablespace *scan_for_existing_tablespaces(char *pathname,
+ cb_options *opt);
+static void slurp_file(int fd, char *filename, StringInfo buf, int maxlen);
+
+/*
+ * Main program.
+ */
+int
+main(int argc, char *argv[])
+{
+ static struct option long_options[] = {
+ {"debug", no_argument, NULL, 'd'},
+ {"output", required_argument, NULL, 'o'},
+ {"dry-run", no_argument, NULL, 'n'},
+ {"no-sync", no_argument, NULL, 'N'},
+ {"progress", no_argument, NULL, 'P'},
+ {"tablespace-mapping", no_argument, NULL, 'T'},
+ {"manifest-checksums", required_argument, NULL, 1},
+ {"no-manifest", no_argument, NULL, 2},
+ {NULL, 0, NULL, 0}
+ };
+
+ const char *progname;
+ char *last_input_dir;
+ int optindex;
+ int c;
+ int n_backups;
+ int n_prior_backups;
+ int version;
+ char **prior_backup_dirs;
+ cb_options opt;
+ cb_tablespace *tablespaces;
+ cb_tablespace *ts;
+ StringInfo last_backup_label;
+ manifest_data **manifests;
+ manifest_writer *mwriter;
+
+ pg_logging_init(argv[0]);
+ progname = get_progname(argv[0]);
+ handle_help_version_opts(argc, argv, progname, help);
+
+ memset(&opt, 0, sizeof(opt));
+ opt.manifest_checksums = CHECKSUM_TYPE_CRC32C;
+
+ /* process command-line options */
+ while ((c = getopt_long(argc, argv, "do:nNPT:",
+ long_options, &optindex)) != -1)
+ {
+ switch (c)
+ {
+ case 'd':
+ opt.debug = true;
+ pg_logging_increase_verbosity();
+ break;
+ case 'o':
+ opt.output = optarg;
+ break;
+ case 'n':
+ opt.dry_run = true;
+ break;
+ case 'N':
+ opt.no_sync = true;
+ break;
+ case 'P':
+ opt.progress = true;
+ break;
+ case 'T':
+ add_tablespace_mapping(&opt, optarg);
+ break;
+ case 1:
+ if (!pg_checksum_parse_type(optarg,
+ &opt.manifest_checksums))
+ pg_fatal("unrecognized checksum algorithm: \"%s\"",
+ optarg);
+ break;
+ case 2:
+ opt.no_manifest = true;
+ break;
+ default:
+ /* getopt_long already emitted a complaint */
+ pg_log_error_hint("Try \"%s --help\" for more information.", progname);
+ exit(1);
+ }
+ }
+
+ if (optind >= argc)
+ {
+ pg_log_error("%s: no input directories specified", progname);
+ pg_log_error_hint("Try \"%s --help\" for more information.", progname);
+ exit(1);
+ }
+
+ if (opt.output == NULL)
+ pg_fatal("no output directory specified");
+
+ /* If no manifest is needed, no checksums are needed, either. */
+ if (opt.no_manifest)
+ opt.manifest_checksums = CHECKSUM_TYPE_NONE;
+
+ /* Read the server version from the final backup. */
+ version = read_pg_version_file(argv[argc - 1]);
+
+ /* Sanity-check control files. */
+ n_backups = argc - optind;
+ check_control_files(n_backups, argv + optind);
+
+ /* Sanity-check backup_label files, and get the contents of the last one. */
+ last_backup_label = check_backup_label_files(n_backups, argv + optind);
+
+ /* Load backup manifests. */
+ manifests = load_backup_manifests(n_backups, argv + optind);
+
+ /* Figure out which tablespaces are going to be included in the output. */
+ last_input_dir = argv[argc - 1];
+ check_input_dir_permissions(last_input_dir);
+ tablespaces = scan_for_existing_tablespaces(last_input_dir, &opt);
+
+ /*
+ * Create output directories.
+ *
+ * We create one output directory for the main data directory plus one for
+ * each non-in-place tablespace. create_output_directory() will arrange
+ * for those directories to be cleaned up on failure. In-place tablespaces
+ * aren't handled at this stage because they're located beneath the main
+ * output directory, and thus the cleanup of that directory will get rid
+ * of them. Plus, the pg_tblspc directory that needs to contain them
+ * doesn't exist yet.
+ */
+ atexit(cleanup_directories_atexit);
+ create_output_directory(opt.output, &opt);
+ for (ts = tablespaces; ts != NULL; ts = ts->next)
+ if (!ts->in_place)
+ create_output_directory(ts->new_dir, &opt);
+
+ /* If we need to write a backup_manifest, prepare to do so. */
+ if (!opt.dry_run && !opt.no_manifest)
+ mwriter = create_manifest_writer(opt.output);
+ else
+ mwriter = NULL;
+
+ /* Write backup label into output directory. */
+ if (opt.dry_run)
+ pg_log_debug("would generate \"%s/backup_label\"", opt.output);
+ else
+ {
+ pg_log_debug("generating \"%s/backup_label\"", opt.output);
+ last_backup_label->cursor = 0;
+ write_backup_label(opt.output, last_backup_label,
+ opt.manifest_checksums, mwriter);
+ }
+
+ /*
+ * We'll need the pathnames to the prior backups. By "prior" we mean all
+ * but the last one listed on the command line.
+ */
+ n_prior_backups = argc - optind - 1;
+ prior_backup_dirs = argv + optind;
+
+ /* Process everything that's not part of a user-defined tablespace. */
+ pg_log_debug("processing backup directory \"%s\"", last_input_dir);
+ process_directory_recursively(InvalidOid, last_input_dir, opt.output,
+ NULL, n_prior_backups, prior_backup_dirs,
+ manifests, mwriter, &opt);
+
+ /* Process user-defined tablespaces. */
+ for (ts = tablespaces; ts != NULL; ts = ts->next)
+ {
+ pg_log_debug("processing tablespace directory \"%s\"", ts->old_dir);
+
+ /*
+ * If it's a normal tablespace, we need to set up a symbolic link from
+ * pg_tblspc/${OID} to the target directory; if it's an in-place
+ * tablespace, we need to create a directory at pg_tblspc/${OID}.
+ */
+ if (!ts->in_place)
+ {
+ char linkpath[MAXPGPATH];
+
+ snprintf(linkpath, MAXPGPATH, "%s/pg_tblspc/%u", opt.output,
+ ts->oid);
+
+ if (opt.dry_run)
+ pg_log_debug("would create symbolic link from \"%s\" to \"%s\"",
+ linkpath, ts->new_dir);
+ else
+ {
+ pg_log_debug("creating symbolic link from \"%s\" to \"%s\"",
+ linkpath, ts->new_dir);
+ if (symlink(ts->new_dir, linkpath) != 0)
+ pg_fatal("could not create symbolic link from \"%s\" to \"%s\": %m",
+ linkpath, ts->new_dir);
+ }
+ }
+ else
+ {
+ if (opt.dry_run)
+ pg_log_debug("would create directory \"%s\"", ts->new_dir);
+ else
+ {
+ pg_log_debug("creating directory \"%s\"", ts->new_dir);
+ if (pg_mkdir_p(ts->new_dir, pg_dir_create_mode) == -1)
+ pg_fatal("could not create directory \"%s\": %m",
+ ts->new_dir);
+ }
+ }
+
+ /* OK, now handle the directory contents. */
+ process_directory_recursively(ts->oid, ts->old_dir, ts->new_dir,
+ NULL, n_prior_backups, prior_backup_dirs,
+ manifests, mwriter, &opt);
+ }
+
+ /* Finalize the backup_manifest, if we're generating one. */
+ if (mwriter != NULL)
+ finalize_manifest(mwriter,
+ manifests[n_prior_backups]->first_wal_range);
+
+ /* fsync that output directory unless we've been told not to do so */
+ if (!opt.no_sync)
+ {
+ if (opt.dry_run)
+ pg_log_debug("would recursively fsync \"%s\"", opt.output);
+ else
+ {
+ pg_log_debug("recursively fsyncing \"%s\"", opt.output);
+ fsync_pgdata(opt.output, version * 10000);
+ }
+ }
+
+ /* It's a success, so don't remove the output directories. */
+ reset_directory_cleanup_list();
+ exit(0);
+}
+
+/*
+ * Process the option argument for the -T, --tablespace-mapping switch.
+ */
+static void
+add_tablespace_mapping(cb_options *opt, char *arg)
+{
+ cb_tablespace_mapping *tsmap = pg_malloc0(sizeof(cb_tablespace_mapping));
+ char *dst;
+ char *dst_ptr;
+ char *arg_ptr;
+
+ /*
+ * Basically, we just want to copy everything before the equals sign to
+ * tsmap->old_dir and everything afterwards to tsmap->new_dir, but if
+ * there's more or less than one equals sign, that's an error, and if
+ * there's an equals sign preceded by a backslash, don't treat it as a
+ * field separator but instead copy a literal equals sign.
+ */
+ dst_ptr = dst = tsmap->old_dir;
+ for (arg_ptr = arg; *arg_ptr != '\0'; arg_ptr++)
+ {
+ if (dst_ptr - dst >= MAXPGPATH)
+ pg_fatal("directory name too long");
+
+ if (*arg_ptr == '\\' && *(arg_ptr + 1) == '=')
+ ; /* skip backslash escaping = */
+ else if (*arg_ptr == '=' && (arg_ptr == arg || *(arg_ptr - 1) != '\\'))
+ {
+ if (tsmap->new_dir[0] != '\0')
+ pg_fatal("multiple \"=\" signs in tablespace mapping");
+ else
+ dst = dst_ptr = tsmap->new_dir;
+ }
+ else
+ *dst_ptr++ = *arg_ptr;
+ }
+ if (!tsmap->old_dir[0] || !tsmap->new_dir[0])
+ pg_fatal("invalid tablespace mapping format \"%s\", must be \"OLDDIR=NEWDIR\"", arg);
+
+ /*
+ * All tablespaces are created with absolute directories, so specifying a
+ * non-absolute path here would never match, possibly confusing users.
+ *
+ * In contrast to pg_basebackup, both the old and new directories are on
+ * the local machine, so the local machine's definition of an absolute
+ * path is the only relevant one.
+ */
+ if (!is_absolute_path(tsmap->old_dir))
+ pg_fatal("old directory is not an absolute path in tablespace mapping: %s",
+ tsmap->old_dir);
+
+ if (!is_absolute_path(tsmap->new_dir))
+ pg_fatal("old directory is not an absolute path in tablespace mapping: %s",
+ tsmap->new_dir);
+
+ /* Canonicalize paths to avoid spurious failures when comparing. */
+ canonicalize_path(tsmap->old_dir);
+ canonicalize_path(tsmap->new_dir);
+
+ /* Add it to the list. */
+ tsmap->next = opt->tsmappings;
+ opt->tsmappings = tsmap;
+}
+
+/*
+ * Check that the backup_label files form a coherent backup chain, and return
+ * the contents of the backup_label file from the latest backup.
+ */
+static StringInfo
+check_backup_label_files(int n_backups, char **backup_dirs)
+{
+ StringInfo buf = makeStringInfo();
+ StringInfo lastbuf = buf;
+ int i;
+ TimeLineID check_tli = 0;
+ XLogRecPtr check_lsn = InvalidXLogRecPtr;
+
+ /* Try to read each backup_label file in turn, last to first. */
+ for (i = n_backups - 1; i >= 0; --i)
+ {
+ char pathbuf[MAXPGPATH];
+ int fd;
+ TimeLineID start_tli;
+ TimeLineID previous_tli;
+ XLogRecPtr start_lsn;
+ XLogRecPtr previous_lsn;
+
+ /* Open the backup_label file. */
+ snprintf(pathbuf, MAXPGPATH, "%s/backup_label", backup_dirs[i]);
+ pg_log_debug("reading \"%s\"", pathbuf);
+ if ((fd = open(pathbuf, O_RDONLY, 0)) < 0)
+ pg_fatal("could not open file \"%s\": %m", pathbuf);
+
+ /*
+ * Slurp the whole file into memory.
+ *
+ * The exact size limit that we impose here doesn't really matter --
+ * most of what's supposed to be in the file is fixed size and quite
+ * short. However, the length of the backup_label is limited (at least
+ * by some parts of the code) to MAXGPATH, so include that value in
+ * the maximum length that we tolerate.
+ */
+ slurp_file(fd, pathbuf, buf, 10000 + MAXPGPATH);
+
+ /* Close the file. */
+ if (close(fd) != 0)
+ pg_fatal("could not close \"%s\": %m", pathbuf);
+
+ /* Parse the file contents. */
+ parse_backup_label(pathbuf, buf, &start_tli, &start_lsn,
+ &previous_tli, &previous_lsn);
+
+ /*
+ * Sanity checks.
+ *
+ * XXX. It's actually not required that start_lsn == check_lsn. It
+ * would be OK if start_lsn > check_lsn provided that start_lsn is
+ * less than or equal to the relevant switchpoint. But at the moment
+ * we don't have that information.
+ */
+ if (i > 0 && previous_tli == 0)
+ pg_fatal("backup at \"%s\" is a full backup, but only the first backup should be a full backup",
+ backup_dirs[i]);
+ if (i == 0 && previous_tli != 0)
+ pg_fatal("backup at \"%s\" is an incremental backup, but the first backup should be a full backup",
+ backup_dirs[i]);
+ if (i < n_backups - 1 && start_tli != check_tli)
+ pg_fatal("backup at \"%s\" starts on timeline %u, but expected %u",
+ backup_dirs[i], start_tli, check_tli);
+ if (i < n_backups - 1 && start_lsn != check_lsn)
+ pg_fatal("backup at \"%s\" starts at LSN %X/%X, but expected %X/%X",
+ backup_dirs[i],
+ LSN_FORMAT_ARGS(start_lsn),
+ LSN_FORMAT_ARGS(check_lsn));
+ check_tli = previous_tli;
+ check_lsn = previous_lsn;
+
+ /*
+ * The last backup label in the chain needs to be saved for later use,
+ * while the others are only needed within this loop.
+ */
+ if (lastbuf == buf)
+ buf = makeStringInfo();
+ else
+ resetStringInfo(buf);
+ }
+
+ /* Free memory that we don't need any more. */
+ if (lastbuf != buf)
+ {
+ pfree(buf->data);
+ pfree(buf);
+ }
+
+ /*
+ * Return the data from the first backup_info that we read (which is the
+ * backup_label from the last directory specified on the command line).
+ */
+ return lastbuf;
+}
+
+/*
+ * Sanity check control files.
+ */
+static void
+check_control_files(int n_backups, char **backup_dirs)
+{
+ int i;
+ uint64 system_identifier;
+
+ /* Try to read each control file in turn, last to first. */
+ for (i = n_backups - 1; i >= 0; --i)
+ {
+ ControlFileData *control_file;
+ bool crc_ok;
+
+ pg_log_debug("reading \"%s/global/pg_control\"", backup_dirs[i]);
+ control_file = get_controlfile(backup_dirs[i], &crc_ok);
+
+ /* Control file contents not meaningful if CRC is bad. */
+ if (!crc_ok)
+ pg_fatal("%s/global/pg_control: crc is incorrect", backup_dirs[i]);
+
+ /* Can't interpret control file if not current version. */
+ if (control_file->pg_control_version != PG_CONTROL_VERSION)
+ pg_fatal("%s/global/pg_control: unexpected control file version",
+ backup_dirs[i]);
+
+ /* System identifiers should all match. */
+ if (i == n_backups - 1)
+ system_identifier = control_file->system_identifier;
+ else if (system_identifier != control_file->system_identifier)
+ pg_fatal("%s/global/pg_control: expected system identifier %llu, but found %llu",
+ backup_dirs[i], (unsigned long long) system_identifier,
+ (unsigned long long) control_file->system_identifier);
+
+ /* Release memory. */
+ pfree(control_file);
+ }
+
+ /*
+ * If debug output is enabled, make a note of the system identifier that
+ * we found in all of the relevant control files.
+ */
+ pg_log_debug("system identifier is %llu",
+ (unsigned long long) system_identifier);
+}
+
+/*
+ * Set default permissions for new files and directories based on the
+ * permissions of the given directory. The intent here is that the output
+ * directory should use the same permissions scheme as the final input
+ * directory.
+ */
+static void
+check_input_dir_permissions(char *dir)
+{
+ struct stat st;
+
+ if (stat(dir, &st) != 0)
+ pg_fatal("could not stat \"%s\": %m", dir);
+
+ SetDataDirectoryCreatePerm(st.st_mode);
+}
+
+/*
+ * Clean up output directories before exiting.
+ */
+static void
+cleanup_directories_atexit(void)
+{
+ while (cleanup_dir_list != NULL)
+ {
+ cb_cleanup_dir *dir = cleanup_dir_list;
+
+ if (dir->rmtopdir)
+ {
+ pg_log_info("removing output directory \"%s\"", dir->target_path);
+ if (!rmtree(dir->target_path, dir->rmtopdir))
+ pg_log_error("failed to remove output directory");
+ }
+ else
+ {
+ pg_log_info("removing contents of output directory \"%s\"",
+ dir->target_path);
+ if (!rmtree(dir->target_path, dir->rmtopdir))
+ pg_log_error("failed to remove contents of output directory");
+ }
+
+ cleanup_dir_list = cleanup_dir_list->next;
+ pfree(dir);
+ }
+}
+
+/*
+ * Create the named output directory, unless it already exists or we're in
+ * dry-run mode. If it already exists but is not empty, that's a fatal error.
+ *
+ * Adds the created directory to the list of directories to be cleaned up
+ * at process exit.
+ */
+static void
+create_output_directory(char *dirname, cb_options *opt)
+{
+ switch (pg_check_dir(dirname))
+ {
+ case 0:
+ if (opt->dry_run)
+ {
+ pg_log_debug("would create directory \"%s\"", dirname);
+ return;
+ }
+ pg_log_debug("creating directory \"%s\"", dirname);
+ if (pg_mkdir_p(dirname, pg_dir_create_mode) == -1)
+ pg_fatal("could not create directory \"%s\": %m", dirname);
+ remember_to_cleanup_directory(dirname, true);
+ break;
+
+ case 1:
+ pg_log_debug("using existing directory \"%s\"", dirname);
+ remember_to_cleanup_directory(dirname, false);
+ break;
+
+ case 2:
+ case 3:
+ case 4:
+ pg_fatal("directory \"%s\" exists but is not empty", dirname);
+
+ case -1:
+ pg_fatal("could not access directory \"%s\": %m", dirname);
+ }
+}
+
+/*
+ * help
+ *
+ * Prints help page for the program
+ *
+ * progname: the name of the executed program, such as "pg_combinebackup"
+ */
+static void
+help(const char *progname)
+{
+ printf(_("%s combines incremental backups.\n\n"), progname);
+ printf(_("Usage:\n"));
+ printf(_(" %s [OPTION]... DIRECTORY...\n"), progname);
+ printf(_("\nOptions:\n"));
+ printf(_(" -d, --debug generate lots of debugging output\n"));
+ printf(_(" -o, --output output directory\n"));
+ printf(_(" -n, --dry-run don't actually do anything\n"));
+ printf(_(" -N, --no-sync do not wait for changes to be written safely to disk\n"));
+ printf(_(" -P, --progress show progress information\n"));
+ printf(_(" -T, --tablespace-mapping=OLDDIR=NEWDIR\n"));
+ printf(_(" relocate tablespace in OLDDIR to NEWDIR\n"));
+ printf(_(" --manifest-checksums=SHA{224,256,384,512}|CRC32C|NONE\n"
+ " use algorithm for manifest checksums\n"));
+ printf(_(" --no-manifest suppress generation of backup manifest\n"));
+ printf(_(" -?, --help show this help, then exit\n"));
+
+ printf(_("\nReport bugs to <%s>.\n"), PACKAGE_BUGREPORT);
+ printf(_("%s home page: <%s>\n"), PACKAGE_NAME, PACKAGE_URL);
+}
+
+/*
+ * Try to parse a string as a non-zero OID without leading zeroes.
+ *
+ * If it works, return true and set *result to the answer, else return false.
+ */
+static bool
+parse_oid(char *s, Oid *result)
+{
+ Oid oid;
+ char *ep;
+
+ errno = 0;
+ oid = strtoul(s, &ep, 10);
+ if (errno != 0 || *ep != '\0' || oid < 1 || oid > PG_UINT32_MAX)
+ return false;
+
+ *result = oid;
+ return true;
+}
+
+/*
+ * Copy files from the input directory to the output directory, reconstructing
+ * full files from incremental files as required.
+ *
+ * If processing is a user-defined tablespace, the tsoid should be the OID
+ * of that tablespace and input_directory and output_directory should be the
+ * toplevel input and output directories for that tablespace. Otherwise,
+ * tsoid should be InvalidOid and input_directory and output_directory should
+ * be the main input and output directories.
+ *
+ * relative_path is the path beneath the given input and output directories
+ * that we are currently processing. If NULL, it indicates that we're
+ * processing the input and output directories themselves.
+ *
+ * n_prior_backups is the number of prior backups that we have available.
+ * This doesn't count the very last backup, which is referenced by
+ * output_directory, just the older ones. prior_backup_dirs is an array of
+ * the locations of those previous backups.
+ */
+static void
+process_directory_recursively(Oid tsoid,
+ char *input_directory,
+ char *output_directory,
+ char *relative_path,
+ int n_prior_backups,
+ char **prior_backup_dirs,
+ manifest_data **manifests,
+ manifest_writer *mwriter,
+ cb_options *opt)
+{
+ char ifulldir[MAXPGPATH];
+ char ofulldir[MAXPGPATH];
+ char manifest_prefix[MAXPGPATH];
+ DIR *dir;
+ struct dirent *de;
+ bool is_pg_tblspc;
+ bool is_pg_wal;
+ manifest_data *latest_manifest = manifests[n_prior_backups];
+ pg_checksum_type checksum_type;
+
+ StaticAssertStmt(strlen(INCREMENTAL_PREFIX) == INCREMENTAL_PREFIX_LENGTH,
+ "INCREMENTAL_PREFIX_LENGTH is incorrect");
+
+ /*
+ * pg_tblspc and pg_wal are special cases, so detect those here.
+ *
+ * pg_tblspc is only special at the top level, but subdirectories of
+ * pg_wal are just as special as the top level directory.
+ *
+ * Since incremental backup does not exist in pre-v10 versions, we don't
+ * have to worry about the old pg_xlog naming.
+ */
+ is_pg_tblspc = !OidIsValid(tsoid) && relative_path != NULL &&
+ strcmp(relative_path, "pg_tblspc") == 0;
+ is_pg_wal = !OidIsValid(tsoid) && relative_path != NULL &&
+ (strcmp(relative_path, "pg_wal") == 0 ||
+ strncmp(relative_path, "pg_wal/", 7) == 0);
+
+ /*
+ * If we're under pg_wal, then we don't need checksums, because these
+ * files aren't included in the backup manifest. Otherwise use whatever
+ * type of checksum is configured.
+ */
+ if (!is_pg_wal)
+ checksum_type = opt->manifest_checksums;
+ else
+ checksum_type = CHECKSUM_TYPE_NONE;
+
+ /*
+ * Append the relative path to the input and output directories, and
+ * figure out the appropriate prefix to add to files in this directory
+ * when looking them up in a backup manifest.
+ */
+ if (relative_path == NULL)
+ {
+ strncpy(ifulldir, input_directory, MAXPGPATH);
+ strncpy(ofulldir, output_directory, MAXPGPATH);
+ if (OidIsValid(tsoid))
+ snprintf(manifest_prefix, MAXPGPATH, "pg_tblspc/%u/", tsoid);
+ else
+ manifest_prefix[0] = '\0';
+ }
+ else
+ {
+ snprintf(ifulldir, MAXPGPATH, "%s/%s", input_directory,
+ relative_path);
+ snprintf(ofulldir, MAXPGPATH, "%s/%s", output_directory,
+ relative_path);
+ if (OidIsValid(tsoid))
+ snprintf(manifest_prefix, MAXPGPATH, "pg_tblspc/%u/%s/",
+ tsoid, relative_path);
+ else
+ snprintf(manifest_prefix, MAXPGPATH, "%s/", relative_path);
+ }
+
+ /*
+ * Toplevel output directories have already been created by the time this
+ * function is called, but any subdirectories are our responsibility.
+ */
+ if (relative_path != NULL)
+ {
+ if (opt->dry_run)
+ pg_log_debug("would create directory \"%s\"", ofulldir);
+ else
+ {
+ pg_log_debug("creating directory \"%s\"", ofulldir);
+ if (mkdir(ofulldir, pg_dir_create_mode) == -1)
+ pg_fatal("could not create directory \"%s\": %m", ofulldir);
+ }
+ }
+
+ /* It's time to scan the directory. */
+ if ((dir = opendir(ifulldir)) == NULL)
+ pg_fatal("could not open directory \"%s\": %m", ifulldir);
+ while (errno = 0, (de = readdir(dir)) != NULL)
+ {
+ PGFileType type;
+ char ifullpath[MAXPGPATH];
+ char ofullpath[MAXPGPATH];
+ char manifest_path[MAXPGPATH];
+ Oid oid = InvalidOid;
+ int checksum_length = 0;
+ uint8 *checksum_payload = NULL;
+ pg_checksum_context checksum_ctx;
+
+ /* Ignore "." and ".." entries. */
+ if (strcmp(de->d_name, ".") == 0 ||
+ strcmp(de->d_name, "..") == 0)
+ continue;
+
+ /* Construct input path. */
+ snprintf(ifullpath, MAXPGPATH, "%s/%s", ifulldir, de->d_name);
+
+ /* Figure out what kind of directory entry this is. */
+ type = get_dirent_type(ifullpath, de, false, PG_LOG_ERROR);
+ if (type == PGFILETYPE_ERROR)
+ exit(1);
+
+ /*
+ * If we're processing pg_tblspc, then check whether the filename
+ * looks like it could be a tablespace OID. If so, and if the
+ * directory entry is a symbolic link or a directory, skip it.
+ *
+ * Our goal here is to ignore anything that would have been considered
+ * by scan_for_existing_tablespaces to be a tablespace.
+ */
+ if (is_pg_tblspc && parse_oid(de->d_name, &oid) &&
+ (type == PGFILETYPE_LNK || type == PGFILETYPE_DIR))
+ continue;
+
+ /* If it's a directory, recurse. */
+ if (type == PGFILETYPE_DIR)
+ {
+ char new_relative_path[MAXPGPATH];
+
+ /* Append new pathname component to relative path. */
+ if (relative_path == NULL)
+ strncpy(new_relative_path, de->d_name, MAXPGPATH);
+ else
+ snprintf(new_relative_path, MAXPGPATH, "%s/%s", relative_path,
+ de->d_name);
+
+ /* And recurse. */
+ process_directory_recursively(tsoid,
+ input_directory, output_directory,
+ new_relative_path,
+ n_prior_backups, prior_backup_dirs,
+ manifests, mwriter, opt);
+ continue;
+ }
+
+ /* Skip anything that's not a regular file. */
+ if (type != PGFILETYPE_REG)
+ {
+ if (type == PGFILETYPE_LNK)
+ pg_log_warning("skipping symbolic link \"%s\"", ifullpath);
+ else
+ pg_log_warning("skipping special file \"%s\"", ifullpath);
+ continue;
+ }
+
+ /*
+ * Skip the backup_label and backup_manifest files; they require
+ * special handling and are handled elsewhere.
+ */
+ if (relative_path == NULL &&
+ (strcmp(de->d_name, "backup_label") == 0 ||
+ strcmp(de->d_name, "backup_manifest") == 0))
+ continue;
+
+ /*
+ * If it's an incremental file, hand it off to the reconstruction
+ * code, which will figure out what to do.
+ */
+ if (strncmp(de->d_name, INCREMENTAL_PREFIX,
+ INCREMENTAL_PREFIX_LENGTH) == 0)
+ {
+ /* Output path should not include "INCREMENTAL." prefix. */
+ snprintf(ofullpath, MAXPGPATH, "%s/%s", ofulldir,
+ de->d_name + INCREMENTAL_PREFIX_LENGTH);
+
+
+ /* Manifest path likewise omits incremental prefix. */
+ snprintf(manifest_path, MAXPGPATH, "%s%s", manifest_prefix,
+ de->d_name + INCREMENTAL_PREFIX_LENGTH);
+
+ /* Reconstruction logic will do the rest. */
+ reconstruct_from_incremental_file(ifullpath, ofullpath,
+ relative_path,
+ de->d_name + INCREMENTAL_PREFIX_LENGTH,
+ n_prior_backups,
+ prior_backup_dirs,
+ manifests,
+ manifest_path,
+ checksum_type,
+ &checksum_length,
+ &checksum_payload,
+ opt->dry_run);
+ }
+ else
+ {
+ /* Construct the path that the backup_manifest will use. */
+ snprintf(manifest_path, MAXPGPATH, "%s%s", manifest_prefix,
+ de->d_name);
+
+ /*
+ * It's not an incremental file, so we need to copy the entire
+ * file to the output directory.
+ *
+ * If a checksum of the required type already exists in the
+ * backup_manifest for the final input directory, we can save some
+ * work by reusing that checksum instead of computing a new one.
+ */
+ if (checksum_type != CHECKSUM_TYPE_NONE &&
+ latest_manifest != NULL)
+ {
+ manifest_file *mfile;
+
+ mfile = manifest_files_lookup(latest_manifest->files,
+ manifest_path);
+ if (mfile == NULL)
+ {
+ /*
+ * The directory is out of sync with the backup_manifest,
+ * so emit a warning.
+ */
+ pg_log_warning("\"%s/backup_manifest\" contains no entry for \"%s\"",
+ input_directory, manifest_path);
+ }
+ else if (mfile->checksum_type == checksum_type)
+ {
+ checksum_length = mfile->checksum_length;
+ checksum_payload = mfile->checksum_payload;
+ }
+ }
+
+ /*
+ * If we're reusing a checksum, then we don't need copy_file() to
+ * compute one for us, but otherwise, it needs to compute whatever
+ * type of checksum we need.
+ */
+ if (checksum_length != 0)
+ pg_checksum_init(&checksum_ctx, CHECKSUM_TYPE_NONE);
+ else
+ pg_checksum_init(&checksum_ctx, checksum_type);
+
+ /* Actually copy the file. */
+ snprintf(ofullpath, MAXPGPATH, "%s/%s", ofulldir, de->d_name);
+ copy_file(ifullpath, ofullpath, &checksum_ctx, opt->dry_run);
+
+ /*
+ * If copy_file() performed a checksum calculation for us, then
+ * save the results (except in dry-run mode, when there's no
+ * point).
+ */
+ if (checksum_ctx.type != CHECKSUM_TYPE_NONE && !opt->dry_run)
+ {
+ checksum_payload = pg_malloc(PG_CHECKSUM_MAX_LENGTH);
+ checksum_length = pg_checksum_final(&checksum_ctx,
+ checksum_payload);
+ }
+ }
+
+ /* Generate manifest entry, if needed. */
+ if (mwriter != NULL)
+ {
+ struct stat sb;
+
+ /*
+ * In order to generate a manifest entry, we need the file size
+ * and mtime. We have no way to know the correct mtime except to
+ * stat() the file, so just do that and get the size as well.
+ *
+ * If we didn't need the mtime here, we could try to obtain the
+ * file size from the reconstruction or file copy process above,
+ * although that is actually not convenient in all cases. If we
+ * write the file ourselves then clearly we can keep a count of
+ * bytes, but if we use something like CopyFile() then it's
+ * trickier. Since we have to stat() anyway to get the mtime,
+ * there's no point in worrying about it.
+ */
+ if (stat(ofullpath, &sb) < 0)
+ pg_fatal("could not stat file \"%s\": %m", ofullpath);
+
+ /* OK, now do the work. */
+ add_file_to_manifest(mwriter, manifest_path,
+ sb.st_size, sb.st_mtime,
+ checksum_type, checksum_length,
+ checksum_payload);
+ }
+
+ /* Avoid leaking memory. */
+ if (checksum_payload != NULL)
+ pfree(checksum_payload);
+ }
+
+ closedir(dir);
+}
+
+/*
+ * Read the version number from PG_VERSION and convert it to the usual server
+ * version number format. (e.g. If PG_VERSION contains "14\n" this function
+ * will return 140000)
+ */
+static int
+read_pg_version_file(char *directory)
+{
+ char filename[MAXPGPATH];
+ StringInfoData buf;
+ int fd;
+ int version;
+ char *ep;
+
+ /* Construct pathname. */
+ snprintf(filename, MAXPGPATH, "%s/PG_VERSION", directory);
+
+ /* Open file. */
+ if ((fd = open(filename, O_RDONLY, 0)) < 0)
+ pg_fatal("could not open file \"%s\": %m", filename);
+
+ /* Read into memory. Length limit of 128 should be more than generous. */
+ initStringInfo(&buf);
+ slurp_file(fd, filename, &buf, 128);
+
+ /* Close the file. */
+ if (close(fd) != 0)
+ pg_fatal("could not close \"%s\": %m", filename);
+
+ /* Convert to integer. */
+ errno = 0;
+ version = strtoul(buf.data, &ep, 10);
+ if (errno != 0 || *ep != '\n')
+ {
+ /*
+ * Incremental backup is not relevant to very old server versions that
+ * used multi-part version number (e.g. 9.6, or 8.4). So if we see
+ * what looks like the beginning of such a version number, just bail
+ * out.
+ */
+ if (version < 10 && *ep == '.')
+ pg_fatal("%s: server version too old\n", filename);
+ pg_fatal("%s: could not parse version number\n", filename);
+ }
+
+ /* Debugging output. */
+ pg_log_debug("read server version %d from \"%s\"", version, filename);
+
+ /* Release memory and return result. */
+ pfree(buf.data);
+ return version * 10000;
+}
+
+/*
+ * Add a directory to the list of output directories to clean up.
+ */
+static void
+remember_to_cleanup_directory(char *target_path, bool rmtopdir)
+{
+ cb_cleanup_dir *dir = pg_malloc(sizeof(cb_cleanup_dir));
+
+ dir->target_path = target_path;
+ dir->rmtopdir = rmtopdir;
+ dir->next = cleanup_dir_list;
+ cleanup_dir_list = dir;
+}
+
+/*
+ * Empty out the list of directories scheduled for cleanup a exit.
+ *
+ * We want to remove the output directories only on a failure, so call this
+ * function when we know that the operation has succeeded.
+ *
+ * Since we only expect this to be called when we're about to exit, we could
+ * just set cleanup_dir_list to NULL and be done with it, but we free the
+ * memory to be tidy.
+ */
+static void
+reset_directory_cleanup_list(void)
+{
+ while (cleanup_dir_list != NULL)
+ {
+ cb_cleanup_dir *dir = cleanup_dir_list;
+
+ cleanup_dir_list = cleanup_dir_list->next;
+ pfree(dir);
+ }
+}
+
+/*
+ * Scan the pg_tblspc directory of the final input backup to get a canonical
+ * list of what tablespaces are part of the backup.
+ *
+ * 'pathname' should be the path to the toplevel backup directory for the
+ * final backup in the backup chain.
+ */
+static cb_tablespace *
+scan_for_existing_tablespaces(char *pathname, cb_options *opt)
+{
+ char pg_tblspc[MAXPGPATH];
+ DIR *dir;
+ struct dirent *de;
+ cb_tablespace *tslist = NULL;
+
+ snprintf(pg_tblspc, MAXPGPATH, "%s/pg_tblspc", pathname);
+ pg_log_debug("scanning \"%s\"", pg_tblspc);
+
+ if ((dir = opendir(pg_tblspc)) == NULL)
+ pg_fatal("could not open directory \"%s\": %m", pathname);
+
+ while (errno = 0, (de = readdir(dir)) != NULL)
+ {
+ Oid oid;
+ char tblspcdir[MAXPGPATH];
+ char link_target[MAXPGPATH];
+ int link_length;
+ cb_tablespace *ts;
+ cb_tablespace *otherts;
+ PGFileType type;
+
+ /* Silently ignore "." and ".." entries. */
+ if (strcmp(de->d_name, ".") == 0 || strcmp(de->d_name, "..") == 0)
+ continue;
+
+ /* Construct full pathname. */
+ snprintf(tblspcdir, MAXPGPATH, "%s/%s", pg_tblspc, de->d_name);
+
+ /* Ignore any file name that doesn't look like a proper OID. */
+ if (!parse_oid(de->d_name, &oid))
+ {
+ pg_log_debug("skipping \"%s\" because the filename is not a legal tablespace OID",
+ tblspcdir);
+ continue;
+ }
+
+ /* Only symbolic links and directories are tablespaces. */
+ type = get_dirent_type(tblspcdir, de, false, PG_LOG_ERROR);
+ if (type == PGFILETYPE_ERROR)
+ exit(1);
+ if (type != PGFILETYPE_LNK && type != PGFILETYPE_DIR)
+ {
+ pg_log_debug("skipping \"%s\" because it is neither a symbolic link nor a directory",
+ tblspcdir);
+ continue;
+ }
+
+ /* Create a new tablespace object. */
+ ts = pg_malloc0(sizeof(cb_tablespace));
+ ts->oid = oid;
+
+ /*
+ * If it's a link, it's not an in-place tablespace. Otherwise, it must
+ * be a directory, and thus an in-place tablespace.
+ */
+ if (type == PGFILETYPE_LNK)
+ {
+ cb_tablespace_mapping *tsmap;
+
+ /* Read the link target. */
+ link_length = readlink(tblspcdir, link_target, sizeof(link_target));
+ if (link_length < 0)
+ pg_fatal("could not read symbolic link \"%s\": %m",
+ tblspcdir);
+ if (link_length >= sizeof(link_target))
+ pg_fatal("symbolic link \"%s\" is too long", tblspcdir);
+ link_target[link_length] = '\0';
+ if (!is_absolute_path(link_target))
+ pg_fatal("symbolic link \"%s\" is relative", tblspcdir);
+
+ /* Caonicalize the link target. */
+ canonicalize_path(link_target);
+
+ /*
+ * Find the corresponding tablespace mapping and copy the relevant
+ * details into the new tablespace entry.
+ */
+ for (tsmap = opt->tsmappings; tsmap != NULL; tsmap = tsmap->next)
+ {
+ if (strcmp(tsmap->old_dir, link_target) == 0)
+ {
+ strncpy(ts->old_dir, tsmap->old_dir, MAXPGPATH);
+ strncpy(ts->new_dir, tsmap->new_dir, MAXPGPATH);
+ ts->in_place = false;
+ break;
+ }
+ }
+
+ /* Every non-in-place tablespace must be mapped. */
+ if (tsmap == NULL)
+ pg_fatal("tablespace at \"%s\" has no tablespace mapping",
+ link_target);
+ }
+ else
+ {
+ /*
+ * For an in-place tablespace, there's no separate directory, so
+ * we just record the paths within the data directories.
+ */
+ snprintf(ts->old_dir, MAXPGPATH, "%s/%s", pg_tblspc, de->d_name);
+ snprintf(ts->new_dir, MAXPGPATH, "%s/pg_tblpc/%s", opt->output,
+ de->d_name);
+ ts->in_place = true;
+ }
+
+ /* Tablespaces should not share a directory. */
+ for (otherts = tslist; otherts != NULL; otherts = otherts->next)
+ if (strcmp(ts->new_dir, otherts->new_dir) == 0)
+ pg_fatal("tablespaces with OIDs %u and %u both point at \"%s\"",
+ otherts->oid, oid, ts->new_dir);
+
+ /* Add this tablespace to the list. */
+ ts->next = tslist;
+ tslist = ts;
+ }
+
+ return tslist;
+}
+
+/*
+ * Read a file into a StringInfo.
+ *
+ * fd is used for the actual file I/O, filename for error reporting purposes.
+ * A file longer than maxlen is a fatal error.
+ */
+static void
+slurp_file(int fd, char *filename, StringInfo buf, int maxlen)
+{
+ struct stat st;
+ ssize_t rb;
+
+ /* Check file size, and complain if it's too large. */
+ if (fstat(fd, &st) != 0)
+ pg_fatal("could not stat \"%s\": %m", filename);
+ if (st.st_size > maxlen)
+ pg_fatal("file \"%s\" is too large", filename);
+
+ /* Make sure we have enough space. */
+ enlargeStringInfo(buf, st.st_size);
+
+ /* Read the data. */
+ rb = read(fd, &buf->data[buf->len], st.st_size);
+
+ /*
+ * We don't expect any concurrent changes, so we should read exactly the
+ * expected number of bytes.
+ */
+ if (rb != st.st_size)
+ {
+ if (rb < 0)
+ pg_fatal("could not read file \"%s\": %m", filename);
+ else
+ pg_fatal("could not read file \"%s\": read only %d of %d bytes",
+ filename, (int) rb, (int) st.st_size);
+ }
+
+ /* Adjust buffer length for new data and restore trailing-\0 invariant */
+ buf->len += rb;
+ buf->data[buf->len] = '\0';
+}
diff --git a/src/bin/pg_combinebackup/reconstruct.c b/src/bin/pg_combinebackup/reconstruct.c
new file mode 100644
index 0000000000..c774bf1842
--- /dev/null
+++ b/src/bin/pg_combinebackup/reconstruct.c
@@ -0,0 +1,618 @@
+/*-------------------------------------------------------------------------
+ *
+ * reconstruct.c
+ * Reconstruct full file from incremental file and backup chain.
+ *
+ * Copyright (c) 2017-2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/bin/pg_combinebackup/reconstruct.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres_fe.h"
+
+#include <unistd.h>
+
+#include "backup/basebackup_incremental.h"
+#include "common/logging.h"
+#include "common/file_perm.h"
+#include "copy_file.h"
+#include "reconstruct.h"
+#include "storage/block.h"
+
+/*
+ * An rfile stores the data that we need in order to be able to use some file
+ * on disk for reconstruction. For any given output file, we create one rfile
+ * per backup that we need to consult when we constructing that output file.
+ *
+ * If we find a full version of the file in the backup chain, then only
+ * filename and fd are initialized; the remaining fields are 0 or NULL.
+ * For an incremental file, header_length, num_blocks, relative_block_numbers,
+ * and truncation_block_length are also set.
+ *
+ * num_blocks_read and highest_offset_read always start out as 0.
+ */
+typedef struct rfile
+{
+ char *filename;
+ int fd;
+ size_t header_length;
+ unsigned num_blocks;
+ BlockNumber *relative_block_numbers;
+ unsigned truncation_block_length;
+ unsigned num_blocks_read;
+ off_t highest_offset_read;
+} rfile;
+
+static void debug_reconstruction(int n_source,
+ rfile **sources,
+ bool dry_run);
+static unsigned find_reconstructed_block_length(rfile *s);
+static rfile *make_incremental_rfile(char *filename);
+static rfile *make_rfile(char *filename, bool missing_ok);
+static void write_reconstructed_file(char *input_filename,
+ char *output_filename,
+ unsigned block_length,
+ rfile **sourcemap,
+ off_t *offsetmap,
+ pg_checksum_context *checksum_ctx,
+ bool dry_run);
+static void read_bytes(rfile *rf, void *buffer, unsigned length);
+
+/*
+ * Reconstruct a full file from an incremental file and a chain of prior
+ * backups.
+ *
+ * input_filename should be the path to the incremental file, and
+ * output_filename should be the path where the reconstructed file is to be
+ * written.
+ *
+ * relative_path should be the relative path to the directory containing this
+ * file. bare_file_name should be the name of the file within that directory,
+ * without "INCREMENTAL.".
+ *
+ * n_prior_backups is the number of prior backups, and prior_backup_dirs is
+ * an array of pathnames where those backups can be found.
+ */
+void
+reconstruct_from_incremental_file(char *input_filename,
+ char *output_filename,
+ char *relative_path,
+ char *bare_file_name,
+ int n_prior_backups,
+ char **prior_backup_dirs,
+ manifest_data **manifests,
+ char *manifest_path,
+ pg_checksum_type checksum_type,
+ int *checksum_length,
+ uint8 **checksum_payload,
+ bool dry_run)
+{
+ rfile **source;
+ rfile *latest_source = NULL;
+ rfile **sourcemap;
+ off_t *offsetmap;
+ unsigned block_length;
+ unsigned num_missing_blocks;
+ unsigned i;
+ unsigned sidx = n_prior_backups;
+ bool full_copy_possible = true;
+ int copy_source_index = -1;
+ rfile *copy_source = NULL;
+ pg_checksum_context checksum_ctx;
+
+ /*
+ * Every block must come either from the latest version of the file or
+ * from one of the prior backups.
+ */
+ source = pg_malloc0(sizeof(rfile *) * (1 + n_prior_backups));
+
+ /*
+ * Use the information from the latest incremental file to figure out how
+ * long the reconstructed file should be.
+ */
+ latest_source = make_incremental_rfile(input_filename);
+ source[n_prior_backups] = latest_source;
+ block_length = find_reconstructed_block_length(latest_source);
+
+ /*
+ * For each block in the output file, we need to know from which file we
+ * need to obtain it and at what offset in that file it's stored.
+ * sourcemap gives us the first of these things, and offsetmap the latter.
+ */
+ sourcemap = pg_malloc0(sizeof(rfile *) * block_length);
+ offsetmap = pg_malloc0(sizeof(off_t) * block_length);
+
+ /*
+ * Blocks prior to the truncation_block_length threshold must be obtained
+ * from some prior backup, while those after that threshold are left as
+ * zeroes if not present in the newest incremental file.
+ * num_missing_blocks counts the number of blocks that we must be found
+ * somewhere in the backup chain, and is thus initially equal to
+ * truncation_block_length.
+ */
+ num_missing_blocks = latest_source->truncation_block_length;
+
+ /*
+ * Every block that is present in the newest incremental file should be
+ * sourced from that file. If it precedes the truncation_block_length,
+ * it's a block that we would otherwise have had to find in an older
+ * backup and thus reduces the number of blocks remaining to be found by
+ * one; otherwise, it's an extra block that needs to be included in the
+ * output but would not have needed to be found in an older backup if it
+ * had not been present.
+ */
+ for (i = 0; i < latest_source->num_blocks; ++i)
+ {
+ BlockNumber b = latest_source->relative_block_numbers[i];
+
+ Assert(b < block_length);
+ sourcemap[b] = latest_source;
+ offsetmap[b] = latest_source->header_length + (i * BLCKSZ);
+ if (b < latest_source->truncation_block_length)
+ num_missing_blocks--;
+
+ /*
+ * A full copy of a file from an earlier backup is only possible if no
+ * blocks are needed from any later incremental file.
+ */
+ full_copy_possible = false;
+ }
+
+ while (num_missing_blocks > 0)
+ {
+ char source_filename[MAXPGPATH];
+ rfile *s;
+
+ /*
+ * Move to the next backup in the chain. If there are no more, then
+ * something has gone wrong and reconstruction has failed.
+ */
+ if (sidx == 0)
+ pg_fatal("reconstruction for file \"%s\" failed to find %u required blocks",
+ output_filename, num_missing_blocks);
+ --sidx;
+
+ /*
+ * Look for the full file in the previous backup. If not found, then
+ * look for an incremental file instead.
+ */
+ snprintf(source_filename, MAXPGPATH, "%s/%s/%s",
+ prior_backup_dirs[sidx], relative_path, bare_file_name);
+ if ((s = make_rfile(source_filename, true)) == NULL)
+ {
+ snprintf(source_filename, MAXPGPATH, "%s/%s/INCREMENTAL.%s",
+ prior_backup_dirs[sidx], relative_path, bare_file_name);
+ s = make_incremental_rfile(source_filename);
+ }
+ source[sidx] = s;
+
+ /*
+ * If s->header_length == 0, then this is a full file; otherwise, it's
+ * an incremental file.
+ */
+ if (s->header_length != 0)
+ {
+ /*
+ * Since we found another incremental file, source all blocks from
+ * it that we need but don't yet have.
+ */
+ for (i = 0; i < s->num_blocks; ++i)
+ {
+ BlockNumber b = s->relative_block_numbers[i];
+
+ if (b < latest_source->truncation_block_length &&
+ sourcemap[b] == NULL)
+ {
+ sourcemap[b] = s;
+ offsetmap[b] = s->header_length + (i * BLCKSZ);
+
+ Assert(num_missing_blocks > 0);
+ --num_missing_blocks;
+
+ /*
+ * A full copy of a file from an earlier backup is only
+ * possible if no blocks are needed from any later
+ * incremental file.
+ */
+ full_copy_possible = false;
+ }
+ }
+ }
+ else
+ {
+ BlockNumber b;
+
+ /*
+ * Since we found a full file, source all remaining required
+ * blocks from it.
+ */
+ for (b = 0; b < latest_source->truncation_block_length; ++b)
+ {
+ if (sourcemap[b] == NULL)
+ {
+ sourcemap[b] = s;
+ offsetmap[b] = b * BLCKSZ;
+
+ Assert(num_missing_blocks > 0);
+ --num_missing_blocks;
+ }
+ }
+ Assert(num_missing_blocks == 0);
+
+ /*
+ * If a full copy looks possible, check whether the resulting file
+ * should be exactly as long as the source file is. If so, a full
+ * copy is acceptable, otherwise not.
+ */
+ if (full_copy_possible)
+ {
+ struct stat sb;
+ uint64 expected_length;
+
+ if (fstat(s->fd, &sb) < 0)
+ pg_fatal("could not stat \"%s\": %m", s->filename);
+ expected_length =
+ (uint64) latest_source->truncation_block_length;
+ expected_length *= BLCKSZ;
+ if (expected_length == sb.st_size)
+ {
+ copy_source = s;
+ copy_source_index = sidx;
+ }
+ }
+ }
+ }
+
+ /*
+ * If a checksum of the required type already exists in the
+ * backup_manifest for the relevant input directory, we can save some work
+ * by reusing that checksum instead of computing a new one.
+ */
+ if (copy_source_index >= 0 && manifests[copy_source_index] != NULL &&
+ checksum_type != CHECKSUM_TYPE_NONE)
+ {
+ manifest_file *mfile;
+
+ mfile = manifest_files_lookup(manifests[copy_source_index]->files,
+ manifest_path);
+ if (mfile == NULL)
+ {
+ /*
+ * The directory is out of sync with the backup_manifest, so emit
+ * a warning.
+ */
+ pg_log_warning("\"%s/backup_manifest\" contains no entry for \"%s\"",
+ prior_backup_dirs[copy_source_index],
+ manifest_path);
+ }
+ else if (mfile->checksum_type == checksum_type)
+ {
+ *checksum_length = mfile->checksum_length;
+ *checksum_payload = pg_malloc(*checksum_length);
+ memcpy(*checksum_payload, mfile->checksum_payload,
+ *checksum_length);
+ checksum_type = CHECKSUM_TYPE_NONE;
+ }
+ }
+
+ /* Prepare for checksum calculation, if required. */
+ pg_checksum_init(&checksum_ctx, checksum_type);
+
+ /*
+ * If the full file can be created by copying a file from an older backup
+ * in the chain without needing to overwrite any blocks or truncate the
+ * result, then forget about performing reconstruction and just copy that
+ * file in its entirety.
+ *
+ * Otherwise, reconstruct.
+ */
+ if (copy_source != NULL)
+ copy_file(copy_source->filename, output_filename,
+ &checksum_ctx, dry_run);
+ else
+ {
+ write_reconstructed_file(input_filename, output_filename,
+ block_length, sourcemap, offsetmap,
+ &checksum_ctx, dry_run);
+ debug_reconstruction(n_prior_backups + 1, source, dry_run);
+ }
+
+ /* Save results of checksum calculation. */
+ if (checksum_type != CHECKSUM_TYPE_NONE)
+ {
+ *checksum_payload = pg_malloc(PG_CHECKSUM_MAX_LENGTH);
+ *checksum_length = pg_checksum_final(&checksum_ctx,
+ *checksum_payload);
+ }
+
+ /*
+ * Close files and release memory.
+ */
+ for (i = 0; i <= n_prior_backups; ++i)
+ {
+ rfile *s = source[i];
+
+ if (s == NULL)
+ continue;
+ if (close(s->fd) != 0)
+ pg_fatal("could not close \"%s\": %m", s->filename);
+ if (s->relative_block_numbers != NULL)
+ pfree(s->relative_block_numbers);
+ pg_free(s->filename);
+ }
+ pfree(sourcemap);
+ pfree(offsetmap);
+ pfree(source);
+}
+
+/*
+ * Perform post-reconstruction logging and sanity checks.
+ */
+static void
+debug_reconstruction(int n_source, rfile **sources, bool dry_run)
+{
+ unsigned i;
+
+ for (i = 0; i < n_source; ++i)
+ {
+ rfile *s = sources[i];
+
+ /* Ignore source if not used. */
+ if (s == NULL)
+ continue;
+
+ /* If no data is needed from this file, we can ignore it. */
+ if (s->num_blocks_read == 0)
+ continue;
+
+ /* Debug logging. */
+ if (dry_run)
+ pg_log_debug("would have read %u blocks from \"%s\"",
+ s->num_blocks_read, s->filename);
+ else
+ pg_log_debug("read %u blocks from \"%s\"",
+ s->num_blocks_read, s->filename);
+
+ /*
+ * In dry-run mode, we don't actually try to read data from the file,
+ * but we do try to verify that the file is long enough that we could
+ * have read the data if we'd tried.
+ *
+ * If this fails, then it means that a non-dry-run attempt would fail,
+ * complaining of not being able to read the required bytes from the
+ * file.
+ */
+ if (dry_run)
+ {
+ struct stat sb;
+
+ if (fstat(s->fd, &sb) < 0)
+ pg_fatal("could not stat \"%s\": %m", s->filename);
+ if (sb.st_size < s->highest_offset_read)
+ pg_fatal("file \"%s\" is too short: expected %llu, found %llu",
+ s->filename,
+ (unsigned long long) s->highest_offset_read,
+ (unsigned long long) sb.st_size);
+ }
+ }
+}
+
+/*
+ * When we perform reconstruction using an incremental file, the output file
+ * should be at least as long as the truncation_block_length. Any blocks
+ * present in the incremental file increase the output length as far as is
+ * necessary to include those blocks.
+ */
+static unsigned
+find_reconstructed_block_length(rfile *s)
+{
+ unsigned block_length = s->truncation_block_length;
+ unsigned i;
+
+ for (i = 0; i < s->num_blocks; ++i)
+ if (s->relative_block_numbers[i] >= block_length)
+ block_length = s->relative_block_numbers[i] + 1;
+
+ return block_length;
+}
+
+/*
+ * Initialize an incremental rfile, reading the header so that we know which
+ * blocks it contains.
+ */
+static rfile *
+make_incremental_rfile(char *filename)
+{
+ rfile *rf;
+ unsigned magic;
+
+ rf = make_rfile(filename, false);
+
+ /* Read and validate magic number. */
+ read_bytes(rf, &magic, sizeof(magic));
+ if (magic != INCREMENTAL_MAGIC)
+ pg_fatal("file \"%s\" has bad incremental magic number (0x%x not 0x%x)",
+ filename, magic, INCREMENTAL_MAGIC);
+
+ /* Read block count. */
+ read_bytes(rf, &rf->num_blocks, sizeof(rf->num_blocks));
+ if (rf->num_blocks > RELSEG_SIZE)
+ pg_fatal("file \"%s\" has block count %u in excess of segment size %u",
+ filename, rf->num_blocks, RELSEG_SIZE);
+
+ /* Read truncation block length. */
+ read_bytes(rf, &rf->truncation_block_length,
+ sizeof(rf->truncation_block_length));
+ if (rf->truncation_block_length > RELSEG_SIZE)
+ pg_fatal("file \"%s\" has truncation block length %u in excess of segment size %u",
+ filename, rf->truncation_block_length, RELSEG_SIZE);
+
+ /* Read block numbers if there are any. */
+ if (rf->num_blocks > 0)
+ {
+ rf->relative_block_numbers =
+ pg_malloc0(sizeof(BlockNumber) * rf->num_blocks);
+ read_bytes(rf, rf->relative_block_numbers,
+ sizeof(BlockNumber) * rf->num_blocks);
+ }
+
+ /* Remember length of header. */
+ rf->header_length = sizeof(magic) + sizeof(rf->num_blocks) +
+ sizeof(rf->truncation_block_length) +
+ sizeof(BlockNumber) * rf->num_blocks;
+
+ return rf;
+}
+
+/*
+ * Allocate and perform basic initialization of an rfile.
+ */
+static rfile *
+make_rfile(char *filename, bool missing_ok)
+{
+ rfile *rf;
+
+ rf = pg_malloc0(sizeof(rfile));
+ rf->filename = pstrdup(filename);
+ if ((rf->fd = open(filename, O_RDONLY | PG_BINARY, 0)) < 0)
+ {
+ if (missing_ok && errno == ENOENT)
+ {
+ pg_free(rf);
+ return NULL;
+ }
+ pg_fatal("could not open file \"%s\": %m", filename);
+ }
+
+ return rf;
+}
+
+/*
+ * Read the indicated number of bytes from an rfile into the buffer.
+ */
+static void
+read_bytes(rfile *rf, void *buffer, unsigned length)
+{
+ unsigned rb = read(rf->fd, buffer, length);
+
+ if (rb != length)
+ {
+ if (rb < 0)
+ pg_fatal("could not read file \"%s\": %m", rf->filename);
+ else
+ pg_fatal("could not read file \"%s\": read only %d of %d bytes",
+ rf->filename, (int) rb, length);
+ }
+}
+
+/*
+ * Write out a reconstructed file.
+ */
+static void
+write_reconstructed_file(char *input_filename,
+ char *output_filename,
+ unsigned block_length,
+ rfile **sourcemap,
+ off_t *offsetmap,
+ pg_checksum_context *checksum_ctx,
+ bool dry_run)
+{
+ int wfd = -1;
+ unsigned i;
+ unsigned zero_blocks = 0;
+
+ /* Debugging output. */
+ if (dry_run)
+ pg_log_debug("would reconstruct \"%s\" (%u blocks, checksum %s)",
+ output_filename, block_length,
+ pg_checksum_type_name(checksum_ctx->type));
+ else
+ pg_log_debug("reconstructing \"%s\" (%u blocks, checksum %s)",
+ output_filename, block_length,
+ pg_checksum_type_name(checksum_ctx->type));
+
+ /* Open the output file, except in dry_run mode. */
+ if (!dry_run &&
+ (wfd = open(output_filename,
+ O_RDWR | PG_BINARY | O_CREAT | O_EXCL,
+ pg_file_create_mode)) < 0)
+ pg_fatal("could not open file \"%s\": %m", output_filename);
+
+ /* Read and write the blocks as required. */
+ for (i = 0; i < block_length; ++i)
+ {
+ uint8 buffer[BLCKSZ];
+ rfile *s = sourcemap[i];
+ unsigned wb;
+
+ /* Update accounting information. */
+ if (s == NULL)
+ ++zero_blocks;
+ else
+ {
+ s->num_blocks_read++;
+ s->highest_offset_read = Max(s->highest_offset_read,
+ offsetmap[i] + BLCKSZ);
+ }
+
+ /* Skip the rest of this in dry-run mode. */
+ if (dry_run)
+ continue;
+
+ /* Read or zero-fill the block as appropriate. */
+ if (s == NULL)
+ {
+ /*
+ * New block not mentioned in the WAL summary. Should have been an
+ * uninitialized block, so just zero-fill it.
+ */
+ memset(buffer, 0, BLCKSZ);
+ }
+ else
+ {
+ unsigned rb;
+
+ /* Read the block from the correct source, except if dry-run. */
+ rb = pg_pread(s->fd, buffer, BLCKSZ, offsetmap[i]);
+ if (rb != BLCKSZ)
+ {
+ if (rb < 0)
+ pg_fatal("could not read file \"%s\": %m", s->filename);
+ else
+ pg_fatal("could not read file \"%s\": read only %d of %d bytes at offset %u",
+ s->filename, (int) rb, BLCKSZ,
+ (unsigned) offsetmap[i]);
+ }
+ }
+
+ /* Write out the block. */
+ if ((wb = write(wfd, buffer, BLCKSZ)) != BLCKSZ)
+ {
+ if (wb < 0)
+ pg_fatal("could not write file \"%s\": %m", output_filename);
+ else
+ pg_fatal("could not write file \"%s\": wrote only %d of %d bytes",
+ output_filename, (int) wb, BLCKSZ);
+ }
+
+ /* Update the checksum computation. */
+ if (pg_checksum_update(checksum_ctx, buffer, BLCKSZ) < 0)
+ pg_fatal("could not update checksum of file \"%s\"",
+ output_filename);
+ }
+
+ /* Debugging output. */
+ if (zero_blocks > 0)
+ {
+ if (dry_run)
+ pg_log_debug("would have zero-filled %u blocks", zero_blocks);
+ else
+ pg_log_debug("zero-filled %u blocks", zero_blocks);
+ }
+
+ /* Close the output file. */
+ if (wfd >= 0 && close(wfd) != 0)
+ pg_fatal("could not close \"%s\": %m", output_filename);
+}
diff --git a/src/bin/pg_combinebackup/reconstruct.h b/src/bin/pg_combinebackup/reconstruct.h
new file mode 100644
index 0000000000..c599a70d42
--- /dev/null
+++ b/src/bin/pg_combinebackup/reconstruct.h
@@ -0,0 +1,32 @@
+/*-------------------------------------------------------------------------
+ *
+ * reconstruct.h
+ * Reconstruct full file from incremental file and backup chain.
+ *
+ * Copyright (c) 2017-2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/bin/pg_combinebackup/reconstruct.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef RECONSTRUCT_H
+#define RECONSTRUCT_H
+
+#include "common/checksum_helper.h"
+#include "load_manifest.h"
+
+extern void reconstruct_from_incremental_file(char *input_filename,
+ char *output_filename,
+ char *relative_path,
+ char *bare_file_name,
+ int n_prior_backups,
+ char **prior_backup_dirs,
+ manifest_data **manifests,
+ char *manifest_path,
+ pg_checksum_type checksum_type,
+ int *checksum_length,
+ uint8 **checksum_payload,
+ bool dry_run);
+
+#endif
diff --git a/src/bin/pg_combinebackup/write_manifest.c b/src/bin/pg_combinebackup/write_manifest.c
new file mode 100644
index 0000000000..82160134d8
--- /dev/null
+++ b/src/bin/pg_combinebackup/write_manifest.c
@@ -0,0 +1,293 @@
+/*-------------------------------------------------------------------------
+ *
+ * Write a new backup manifest.
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/bin/pg_combinebackup/write_manifest.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres_fe.h"
+
+#include <fcntl.h>
+#include <time.h>
+#include <unistd.h>
+
+#include "common/checksum_helper.h"
+#include "common/file_perm.h"
+#include "common/logging.h"
+#include "lib/stringinfo.h"
+#include "load_manifest.h"
+#include "mb/pg_wchar.h"
+#include "write_manifest.h"
+
+struct manifest_writer
+{
+ char pathname[MAXPGPATH];
+ int fd;
+ StringInfoData buf;
+ bool first_file;
+ bool still_checksumming;
+ pg_checksum_context manifest_ctx;
+};
+
+static void escape_json(StringInfo buf, const char *str);
+static void flush_manifest(manifest_writer *mwriter);
+static size_t hex_encode(const uint8 *src, size_t len, char *dst);
+
+/*
+ * Create a new backup manifest writer.
+ *
+ * The backup manifest will be written into a file named backup_manifest
+ * in the specified directory.
+ */
+manifest_writer *
+create_manifest_writer(char *directory)
+{
+ manifest_writer *mwriter = pg_malloc(sizeof(manifest_writer));
+
+ snprintf(mwriter->pathname, MAXPGPATH, "%s/backup_manifest", directory);
+ mwriter->fd = -1;
+ initStringInfo(&mwriter->buf);
+ mwriter->first_file = true;
+ mwriter->still_checksumming = true;
+ pg_checksum_init(&mwriter->manifest_ctx, CHECKSUM_TYPE_SHA256);
+
+ appendStringInfo(&mwriter->buf,
+ "{ \"PostgreSQL-Backup-Manifest-Version\": 1,\n"
+ "\"Files\": [");
+
+ return mwriter;
+}
+
+/*
+ * Add an entry for a file to a backup manifest.
+ *
+ * This is very similar to the backend's AddFileToBackupManifest, but
+ * various adjustments are required due to frontend/backend differences
+ * and other details.
+ */
+void
+add_file_to_manifest(manifest_writer *mwriter, const char *manifest_path,
+ size_t size, pg_time_t mtime,
+ pg_checksum_type checksum_type,
+ int checksum_length,
+ uint8 *checksum_payload)
+{
+ int pathlen = strlen(manifest_path);
+
+ if (mwriter->first_file)
+ {
+ appendStringInfoChar(&mwriter->buf, '\n');
+ mwriter->first_file = false;
+ }
+ else
+ appendStringInfoString(&mwriter->buf, ",\n");
+
+ if (pg_encoding_verifymbstr(PG_UTF8, manifest_path, pathlen) == pathlen)
+ {
+ appendStringInfoString(&mwriter->buf, "{ \"Path\": ");
+ escape_json(&mwriter->buf, manifest_path);
+ appendStringInfoString(&mwriter->buf, ", ");
+ }
+ else
+ {
+ appendStringInfoString(&mwriter->buf, "{ \"Encoded-Path\": \"");
+ enlargeStringInfo(&mwriter->buf, 2 * pathlen);
+ mwriter->buf.len += hex_encode((const uint8 *) manifest_path, pathlen,
+ &mwriter->buf.data[mwriter->buf.len]);
+ appendStringInfoString(&mwriter->buf, "\", ");
+ }
+
+ appendStringInfo(&mwriter->buf, "\"Size\": %zu, ", size);
+
+ appendStringInfoString(&mwriter->buf, "\"Last-Modified\": \"");
+ enlargeStringInfo(&mwriter->buf, 128);
+ mwriter->buf.len += strftime(&mwriter->buf.data[mwriter->buf.len], 128,
+ "%Y-%m-%d %H:%M:%S %Z",
+ gmtime(&mtime));
+ appendStringInfoChar(&mwriter->buf, '"');
+
+ if (mwriter->buf.len > 128 * 1024)
+ flush_manifest(mwriter);
+
+ if (checksum_length > 0)
+ {
+ appendStringInfo(&mwriter->buf,
+ ", \"Checksum-Algorithm\": \"%s\", \"Checksum\": \"",
+ pg_checksum_type_name(checksum_type));
+
+ enlargeStringInfo(&mwriter->buf, 2 * checksum_length);
+ mwriter->buf.len += hex_encode(checksum_payload, checksum_length,
+ &mwriter->buf.data[mwriter->buf.len]);
+
+ appendStringInfoChar(&mwriter->buf, '"');
+ }
+
+ appendStringInfoString(&mwriter->buf, " }");
+
+ if (mwriter->buf.len > 128 * 1024)
+ flush_manifest(mwriter);
+}
+
+/*
+ * Finalize the backup_manifest.
+ */
+void
+finalize_manifest(manifest_writer *mwriter,
+ manifest_wal_range *first_wal_range)
+{
+ uint8 checksumbuf[PG_SHA256_DIGEST_LENGTH];
+ int len;
+ manifest_wal_range *wal_range;
+
+ /* Terminate the list of files. */
+ appendStringInfoString(&mwriter->buf, "\n],\n");
+
+ /* Start a list of LSN ranges. */
+ appendStringInfoString(&mwriter->buf, "\"WAL-Ranges\": [\n");
+
+ for (wal_range = first_wal_range; wal_range != NULL;
+ wal_range = wal_range->next)
+ appendStringInfo(&mwriter->buf,
+ "%s{ \"Timeline\": %u, \"Start-LSN\": \"%X/%X\", \"End-LSN\": \"%X/%X\" }",
+ wal_range == first_wal_range ? "" : ",\n",
+ wal_range->tli,
+ LSN_FORMAT_ARGS(wal_range->start_lsn),
+ LSN_FORMAT_ARGS(wal_range->end_lsn));
+
+ /* Terminate the list of WAL ranges. */
+ appendStringInfoString(&mwriter->buf, "\n],\n");
+
+ /* Flush accumulated data and update checksum calculation. */
+ flush_manifest(mwriter);
+
+ /* Checksum only includes data up to this point. */
+ mwriter->still_checksumming = false;
+
+ /* Compute and insert manifest checksum. */
+ appendStringInfoString(&mwriter->buf, "\"Manifest-Checksum\": \"");
+ enlargeStringInfo(&mwriter->buf, 2 * PG_SHA256_DIGEST_STRING_LENGTH);
+ len = pg_checksum_final(&mwriter->manifest_ctx, checksumbuf);
+ Assert(len == PG_SHA256_DIGEST_LENGTH);
+ mwriter->buf.len +=
+ hex_encode(checksumbuf, len, &mwriter->buf.data[mwriter->buf.len]);
+ appendStringInfoString(&mwriter->buf, "\"}\n");
+
+ /* Flush the last manifest checksum itself. */
+ flush_manifest(mwriter);
+
+ /* Close the file. */
+ if (close(mwriter->fd) != 0)
+ pg_fatal("could not close \"%s\": %m", mwriter->pathname);
+ mwriter->fd = -1;
+}
+
+/*
+ * Produce a JSON string literal, properly escaping characters in the text.
+ */
+static void
+escape_json(StringInfo buf, const char *str)
+{
+ const char *p;
+
+ appendStringInfoCharMacro(buf, '"');
+ for (p = str; *p; p++)
+ {
+ switch (*p)
+ {
+ case '\b':
+ appendStringInfoString(buf, "\\b");
+ break;
+ case '\f':
+ appendStringInfoString(buf, "\\f");
+ break;
+ case '\n':
+ appendStringInfoString(buf, "\\n");
+ break;
+ case '\r':
+ appendStringInfoString(buf, "\\r");
+ break;
+ case '\t':
+ appendStringInfoString(buf, "\\t");
+ break;
+ case '"':
+ appendStringInfoString(buf, "\\\"");
+ break;
+ case '\\':
+ appendStringInfoString(buf, "\\\\");
+ break;
+ default:
+ if ((unsigned char) *p < ' ')
+ appendStringInfo(buf, "\\u%04x", (int) *p);
+ else
+ appendStringInfoCharMacro(buf, *p);
+ break;
+ }
+ }
+ appendStringInfoCharMacro(buf, '"');
+}
+
+/*
+ * Flush whatever portion of the backup manifest we have generated and
+ * buffered in memory out to a file on disk.
+ *
+ * The first call to this function will create the file. After that, we
+ * keep it open and just append more data.
+ */
+static void
+flush_manifest(manifest_writer *mwriter)
+{
+ char pathname[MAXPGPATH];
+
+ if (mwriter->fd == -1 &&
+ (mwriter->fd = open(mwriter->pathname,
+ O_WRONLY | O_CREAT | O_EXCL | PG_BINARY,
+ pg_file_create_mode)) < 0)
+ pg_fatal("could not open file \"%s\": %m", mwriter->pathname);
+
+ if (mwriter->buf.len > 0)
+ {
+ ssize_t wb;
+
+ wb = write(mwriter->fd, mwriter->buf.data, mwriter->buf.len);
+ if (wb != mwriter->buf.len)
+ {
+ if (wb < 0)
+ pg_fatal("could not write \"%s\": %m", mwriter->pathname);
+ else
+ pg_fatal("could not write file \"%s\": wrote only %d of %d bytes",
+ pathname, (int) wb, mwriter->buf.len);
+ }
+
+ if (mwriter->still_checksumming)
+ pg_checksum_update(&mwriter->manifest_ctx,
+ (uint8 *) mwriter->buf.data,
+ mwriter->buf.len);
+ resetStringInfo(&mwriter->buf);
+ }
+}
+
+/*
+ * Encode bytes using two hexademical digits for each one.
+ */
+static size_t
+hex_encode(const uint8 *src, size_t len, char *dst)
+{
+ const uint8 *end = src + len;
+
+ while (src < end)
+ {
+ unsigned n1 = (*src >> 4) & 0xF;
+ unsigned n2 = *src & 0xF;
+
+ *dst++ = n1 < 10 ? '0' + n1 : 'a' + n1 - 10;
+ *dst++ = n2 < 10 ? '0' + n2 : 'a' + n2 - 10;
+ ++src;
+ }
+
+ return len * 2;
+}
diff --git a/src/bin/pg_combinebackup/write_manifest.h b/src/bin/pg_combinebackup/write_manifest.h
new file mode 100644
index 0000000000..8fd7fe02c8
--- /dev/null
+++ b/src/bin/pg_combinebackup/write_manifest.h
@@ -0,0 +1,33 @@
+/*-------------------------------------------------------------------------
+ *
+ * Write a new backup manifest.
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/bin/pg_combinebackup/write_manifest.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef WRITE_MANIFEST_H
+#define WRITE_MANIFEST_H
+
+#include "common/checksum_helper.h"
+#include "pgtime.h"
+
+struct manifest_wal_range;
+
+struct manifest_writer;
+typedef struct manifest_writer manifest_writer;
+
+extern manifest_writer *create_manifest_writer(char *directory);
+extern void add_file_to_manifest(manifest_writer *mwriter,
+ const char *manifest_path,
+ size_t size, pg_time_t mtime,
+ pg_checksum_type checksum_type,
+ int checksum_length,
+ uint8 *checksum_payload);
+extern void finalize_manifest(manifest_writer *mwriter,
+ struct manifest_wal_range *first_wal_range);
+
+#endif /* WRITE_MANIFEST_H */
diff --git a/src/bin/pg_resetwal/pg_resetwal.c b/src/bin/pg_resetwal/pg_resetwal.c
index 25ecdaaa15..15a40cd17e 100644
--- a/src/bin/pg_resetwal/pg_resetwal.c
+++ b/src/bin/pg_resetwal/pg_resetwal.c
@@ -86,6 +86,7 @@ static void RewriteControlFile(void);
static void FindEndOfXLOG(void);
static void KillExistingXLOG(void);
static void KillExistingArchiveStatus(void);
+static void KillExistingWALSummaries(void);
static void WriteEmptyXLOG(void);
static void usage(void);
@@ -492,6 +493,7 @@ main(int argc, char *argv[])
RewriteControlFile();
KillExistingXLOG();
KillExistingArchiveStatus();
+ KillExistingWALSummaries();
WriteEmptyXLOG();
printf(_("Write-ahead log reset\n"));
@@ -1033,6 +1035,40 @@ KillExistingArchiveStatus(void)
pg_fatal("could not close directory \"%s\": %m", ARCHSTATDIR);
}
+/*
+ * Remove existing WAL summary files
+ */
+static void
+KillExistingWALSummaries(void)
+{
+#define WALSUMMARYDIR XLOGDIR "/summaries"
+#define WALSUMMARY_NHEXCHARS 40
+
+ DIR *xldir;
+ struct dirent *xlde;
+ char path[MAXPGPATH + sizeof(WALSUMMARYDIR)];
+
+ xldir = opendir(WALSUMMARYDIR);
+ if (xldir == NULL)
+ pg_fatal("could not open directory \"%s\": %m", WALSUMMARYDIR);
+
+ while (errno = 0, (xlde = readdir(xldir)) != NULL)
+ {
+ if (strspn(xlde->d_name, "0123456789ABCDEF") == WALSUMMARY_NHEXCHARS &&
+ strcmp(xlde->d_name + WALSUMMARY_NHEXCHARS, ".summary") == 0)
+ {
+ snprintf(path, sizeof(path), "%s/%s", WALSUMMARYDIR, xlde->d_name);
+ if (unlink(path) < 0)
+ pg_fatal("could not delete file \"%s\": %m", path);
+ }
+ }
+
+ if (errno)
+ pg_fatal("could not read directory \"%s\": %m", WALSUMMARYDIR);
+
+ if (closedir(xldir))
+ pg_fatal("could not close directory \"%s\": %m", ARCHSTATDIR);
+}
/*
* Write an empty XLOG file, containing only the checkpoint record
diff --git a/src/common/Makefile b/src/common/Makefile
index e4cd26762b..ef38cc2f03 100644
--- a/src/common/Makefile
+++ b/src/common/Makefile
@@ -48,6 +48,7 @@ LIBS += $(PTHREAD_LIBS)
OBJS_COMMON = \
archive.o \
base64.o \
+ blkreftable.o \
checksum_helper.o \
compression.o \
config_info.o \
diff --git a/src/common/blkreftable.c b/src/common/blkreftable.c
new file mode 100644
index 0000000000..012a443584
--- /dev/null
+++ b/src/common/blkreftable.c
@@ -0,0 +1,1309 @@
+/*-------------------------------------------------------------------------
+ *
+ * blkreftable.c
+ * Block reference tables.
+ *
+ * A block reference table is used to keep track of which blocks have
+ * been modified by WAL records within a certain LSN range.
+ *
+ * For each relation fork, we keep track of all blocks that have appeared
+ * in block reference in the WAL. We also keep track of the "limit block",
+ * which is the smallest relation length in blocks known to have occurred
+ * during that range of WAL records. This should be set to 0 if the relation
+ * fork is created or destroyed, and to the post-truncation length if
+ * truncated.
+ *
+ * Whenever we set the limit block, we also forget about any modified blocks
+ * beyond that point. Those blocks don't exist any more. Such blocks can
+ * later be marked as modified again; if that happens, it means the relation
+ * was re-extended.
+ *
+ * Portions Copyright (c) 2010-2022, PostgreSQL Global Development Group
+ *
+ * src/common/blkreftable.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+
+#ifndef FRONTEND
+#include "postgres.h"
+#else
+#include "postgres_fe.h"
+#endif
+
+#ifdef FRONTEND
+#include "common/logging.h"
+#endif
+
+#include "common/blkreftable.h"
+#include "common/hashfn.h"
+#include "port/pg_crc32c.h"
+
+/*
+ * A block reference table keeps track of the status of each relation
+ * fork individually.
+ */
+typedef struct BlockRefTableKey
+{
+ RelFileLocator rlocator;
+ ForkNumber forknum;
+} BlockRefTableKey;
+
+/*
+ * We could need to store data either for a relation in which only a
+ * tiny fraction of the blocks have been modified or for a relation in
+ * which nearly every block has been modified, and we want a
+ * space-efficient representation in both cases. To accomplish this,
+ * we divide the relation into chunks of 2^16 blocks and choose between
+ * an array representation and a bitmap representation for each chunk.
+ *
+ * When the number of modified blocks in a given chunk is small, we
+ * essentially store an array of block numbers, but we need not store the
+ * entire block number: instead, we store each block number as a 2-byte
+ * offset from the start of the chunk.
+ *
+ * When the number of modified blocks in a given chunk is large, we switch
+ * to a bitmap representation.
+ *
+ * These same basic representational choices are used both when a block
+ * reference table is stored in memory and when it is serialized to disk.
+ *
+ * In the in-memory representation, we initially allocate each chunk with
+ * space for a number of entries given by INITIAL_ENTRIES_PER_CHUNK and
+ * increase that as necessary until we reach MAX_ENTRIES_PER_CHUNK.
+ * Any chunk whose allocated size reaches MAX_ENTRIES_PER_CHUNK is converted
+ * to a bitmap, and thus never needs to grow further.
+ */
+#define BLOCKS_PER_CHUNK (1 << 16)
+#define BLOCKS_PER_ENTRY (BITS_PER_BYTE * sizeof(uint16))
+#define MAX_ENTRIES_PER_CHUNK (BLOCKS_PER_CHUNK / BLOCKS_PER_ENTRY)
+#define INITIAL_ENTRIES_PER_CHUNK 16
+typedef uint16 *BlockRefTableChunk;
+
+/*
+ * State for one relation fork.
+ *
+ * 'rlocator' and 'forknum' identify the relation fork to which this entry
+ * pertains.
+ *
+ * 'limit_block' is the shortest known length of the relation in blocks
+ * within the LSN range covered by a particular block reference table.
+ * It should be set to 0 if the relation fork is created or dropped. If the
+ * relation fork is truncated, it should be set to the number of blocks that
+ * remain after truncation.
+ *
+ * 'nchunks' is the allocated length of each of the three arrays that follow.
+ * We can only represent the status of block numbers less than nchunks *
+ * BLOCKS_PER_CHUNK.
+ *
+ * 'chunk_size' is an array storing the allocated size of each chunk.
+ *
+ * 'chunk_usage' is an array storing the number of elements used in each
+ * chunk. If that value is less than MAX_ENTRIES_PER_CHUNK, the corresonding
+ * chunk is used as an array; else the corresponding chunk is used as a bitmap.
+ * When used as a bitmap, the least significant bit of the first array element
+ * is the status of the lowest-numbered block covered by this chunk.
+ *
+ * 'chunk_data' is the array of chunks.
+ */
+struct BlockRefTableEntry
+{
+ BlockRefTableKey key;
+ BlockNumber limit_block;
+ char status;
+ uint32 nchunks;
+ uint16 *chunk_size;
+ uint16 *chunk_usage;
+ BlockRefTableChunk *chunk_data;
+};
+
+/* Declare and define a hash table over type BlockRefTableEntry. */
+#define SH_PREFIX blockreftable
+#define SH_ELEMENT_TYPE BlockRefTableEntry
+#define SH_KEY_TYPE BlockRefTableKey
+#define SH_KEY key
+#define SH_HASH_KEY(tb, key) \
+ hash_bytes((const unsigned char *) &key, sizeof(BlockRefTableKey))
+#define SH_EQUAL(tb, a, b) memcmp(&a, &b, sizeof(BlockRefTableKey)) == 0
+#define SH_SCOPE static inline
+#ifdef FRONTEND
+#define SH_RAW_ALLOCATOR pg_malloc0
+#endif
+#define SH_DEFINE
+#define SH_DECLARE
+#include "lib/simplehash.h"
+
+/*
+ * A block reference table is basically just the hash table, but we don't
+ * want to expose that to outside callers.
+ *
+ * We keep track of the memory context in use explicitly too, so that it's
+ * easy to place all of our allocations in the same context.
+ */
+struct BlockRefTable
+{
+ blockreftable_hash *hash;
+#ifndef FRONTEND
+ MemoryContext mcxt;
+#endif
+};
+
+/*
+ * On-disk serialization format for block reference table entries.
+ */
+typedef struct BlockRefTableSerializedEntry
+{
+ RelFileLocator rlocator;
+ ForkNumber forknum;
+ BlockNumber limit_block;
+ uint32 nchunks;
+} BlockRefTableSerializedEntry;
+
+/*
+ * Buffer size, so that we avoid doing many small I/Os.
+ */
+#define BUFSIZE 65536
+
+/*
+ * Ad-hoc buffer for file I/O.
+ */
+typedef struct BlockRefTableBuffer
+{
+ io_callback_fn io_callback;
+ void *io_callback_arg;
+ char data[BUFSIZE];
+ int used;
+ int cursor;
+ pg_crc32c crc;
+} BlockRefTableBuffer;
+
+/*
+ * State for keeping track of progress while incrementally reading a block
+ * table reference file from disk.
+ *
+ * total_chunks means the number of chunks for the RelFileLocator/ForkNumber
+ * combination that is curently being read, and consumed_chunks is the number
+ * of those that have been read. (We always read all the information for
+ * a single chunk at one time, so we don't need to be able to represent the
+ * state where a chunk has been partially read.)
+ *
+ * chunk_size is the array of chunk sizes. The length is given by total_chunks.
+ *
+ * chunk_data holds the current chunk.
+ *
+ * chunk_position helps us figure out how much progress we've made in returning
+ * the block numbers for the current chunk to the caller. If the chunk is a
+ * bitmap, it's the number of bits we've scanned; otherwise, it's the number
+ * of chunk entries we've scanned.
+ */
+struct BlockRefTableReader
+{
+ BlockRefTableBuffer buffer;
+ char *error_filename;
+ report_error_fn error_callback;
+ void *error_callback_arg;
+ uint32 total_chunks;
+ uint32 consumed_chunks;
+ uint16 *chunk_size;
+ uint16 chunk_data[MAX_ENTRIES_PER_CHUNK];
+ uint32 chunk_position;
+};
+
+/*
+ * State for keeping track of progress while incrementally writing a block
+ * reference table file to disk.
+ */
+struct BlockRefTableWriter
+{
+ BlockRefTableBuffer buffer;
+};
+
+/* Function prototypes. */
+static int BlockRefTableComparator(const void *a, const void *b);
+static void BlockRefTableFlush(BlockRefTableBuffer *buffer);
+static void BlockRefTableRead(BlockRefTableReader *reader, void *data,
+ int length);
+static void BlockRefTableWrite(BlockRefTableBuffer *buffer, void *data,
+ int length);
+static void BlockRefTableFileTerminate(BlockRefTableBuffer *buffer);
+
+/*
+ * Create an empty block reference table.
+ */
+BlockRefTable *
+CreateEmptyBlockRefTable(void)
+{
+ BlockRefTable *brtab = palloc(sizeof(BlockRefTable));
+
+ /*
+ * Even completely empty database has a few hundred relation forks, so it
+ * seems best to size the hash on the assumption that we're going to have
+ * at least a few thousand entries.
+ */
+#ifdef FRONTEND
+ brtab->hash = blockreftable_create(4096, NULL);
+#else
+ brtab->mcxt = CurrentMemoryContext;
+ brtab->hash = blockreftable_create(brtab->mcxt, 4096, NULL);
+#endif
+
+ return brtab;
+}
+
+/*
+ * Set the "limit block" for a relation fork and forget any modified blocks
+ * with equal or higher block numbers.
+ *
+ * The "limit block" is the shortest known length of the relation within the
+ * range of WAL records covered by this block reference table.
+ */
+void
+BlockRefTableSetLimitBlock(BlockRefTable *brtab,
+ const RelFileLocator *rlocator,
+ ForkNumber forknum,
+ BlockNumber limit_block)
+{
+ BlockRefTableEntry *brtentry;
+ BlockRefTableKey key;
+ bool found;
+
+ memcpy(&key.rlocator, rlocator, sizeof(RelFileLocator));
+ key.forknum = forknum;
+ brtentry = blockreftable_insert(brtab->hash, key, &found);
+
+ if (!found)
+ {
+ /*
+ * We have no existing data about this relation fork, so just record
+ * the limit_block value supplied by the caller, and make sure other
+ * parts of the entry are properly initialized.
+ */
+ brtentry->limit_block = limit_block;
+ brtentry->nchunks = 0;
+ brtentry->chunk_size = NULL;
+ brtentry->chunk_usage = NULL;
+ brtentry->chunk_data = NULL;
+ return;
+ }
+
+ BlockRefTableEntrySetLimitBlock(brtentry, limit_block);
+}
+
+/*
+ * Mark a block in a given relation fork as known to have been modified.
+ */
+void
+BlockRefTableMarkBlockModified(BlockRefTable *brtab,
+ const RelFileLocator *rlocator,
+ ForkNumber forknum,
+ BlockNumber blknum)
+{
+ BlockRefTableEntry *brtentry;
+ BlockRefTableKey key;
+ bool found;
+#ifndef FRONTEND
+ MemoryContext oldcontext = MemoryContextSwitchTo(brtab->mcxt);
+#endif
+
+ memcpy(&key.rlocator, rlocator, sizeof(RelFileLocator));
+ key.forknum = forknum;
+ brtentry = blockreftable_insert(brtab->hash, key, &found);
+
+ if (!found)
+ {
+ /*
+ * We want to set the initial limit block value to something higher
+ * than any legal block number. InvalidBlockNumber fits the bill.
+ */
+ brtentry->limit_block = InvalidBlockNumber;
+ brtentry->nchunks = 0;
+ brtentry->chunk_size = NULL;
+ brtentry->chunk_usage = NULL;
+ brtentry->chunk_data = NULL;
+ }
+
+ BlockRefTableEntryMarkBlockModified(brtentry, forknum, blknum);
+
+#ifndef FRONTEND
+ MemoryContextSwitchTo(oldcontext);
+#endif
+}
+
+/*
+ * Get an entry from a block reference table.
+ *
+ * If the entry does not exist, this function returns NULL. Otherwise, it
+ * returns the entry and sets *limit_block to the value from the entry.
+ */
+BlockRefTableEntry *
+BlockRefTableGetEntry(BlockRefTable *brtab, const RelFileLocator *rlocator,
+ ForkNumber forknum, BlockNumber *limit_block)
+{
+ BlockRefTableKey key;
+ BlockRefTableEntry *entry;
+
+ Assert(limit_block != NULL);
+
+ memcpy(&key.rlocator, rlocator, sizeof(RelFileLocator));
+ key.forknum = forknum;
+ entry = blockreftable_lookup(brtab->hash, key);
+
+ if (entry != NULL)
+ *limit_block = entry->limit_block;
+
+ return entry;
+}
+
+/*
+ * Get block numbers from a table entry.
+ *
+ * 'blocks' must point to enough space to hold at least 'nblocks' block
+ * numbers, and any block numbers we manage to get will be written there.
+ * The return value is the number of block numbers actually written.
+ *
+ * We do not return block numbers unless they are greater than or equal to
+ * start_blkno and strictly less than stop_blkno.
+ */
+int
+BlockRefTableEntryGetBlocks(BlockRefTableEntry *entry,
+ BlockNumber start_blkno,
+ BlockNumber stop_blkno,
+ BlockNumber *blocks,
+ int nblocks)
+{
+ uint32 start_chunkno;
+ uint32 stop_chunkno;
+ uint32 chunkno;
+ int nresults = 0;
+
+ Assert(entry != NULL);
+
+ /*
+ * Figure out which chunks could potentially contain blocks of interest.
+ *
+ * We need to be careful about overflow here, because stop_blkno could be
+ * InvalidBlockNumber or something very close to it.
+ */
+ start_chunkno = start_blkno / BLOCKS_PER_CHUNK;
+ stop_chunkno = stop_blkno / BLOCKS_PER_CHUNK;
+ if ((stop_blkno % BLOCKS_PER_CHUNK) != 0)
+ ++stop_chunkno;
+ if (stop_chunkno > entry->nchunks)
+ stop_chunkno = entry->nchunks;
+
+ /*
+ * Loop over chunks.
+ */
+ for (chunkno = start_chunkno; chunkno < stop_chunkno; ++chunkno)
+ {
+ uint16 chunk_usage = entry->chunk_usage[chunkno];
+ BlockRefTableChunk chunk_data = entry->chunk_data[chunkno];
+ unsigned start_offset = 0;
+ unsigned stop_offset = BLOCKS_PER_CHUNK;
+
+ /*
+ * If the start and/or stop block number falls within this chunk, the
+ * whole chunk may not be of interest. Figure out which portion we
+ * care about, if it's not the whole thing.
+ */
+ if (chunkno == start_chunkno)
+ start_offset = start_blkno % BLOCKS_PER_CHUNK;
+ if (chunkno == stop_chunkno)
+ stop_offset = stop_blkno % BLOCKS_PER_CHUNK;
+
+ /*
+ * Handling differs depending on whether this is an array of offsets
+ * or a bitmap.
+ */
+ if (chunk_usage == MAX_ENTRIES_PER_CHUNK)
+ {
+ unsigned i;
+
+ /* It's a bitmap, so test every relevant bit. */
+ for (i = start_offset; i < BLOCKS_PER_CHUNK; ++i)
+ {
+ uint16 w = chunk_data[i / BLOCKS_PER_ENTRY];
+
+ if ((w & (1 << (i % BLOCKS_PER_ENTRY))) != 0)
+ {
+ BlockNumber blkno = chunkno * BLOCKS_PER_CHUNK + i;
+
+ blocks[nresults++] = blkno;
+
+ /* Early exit if we run out of output space. */
+ if (nresults == nblocks)
+ return nresults;
+ }
+ }
+ }
+ else
+ {
+ unsigned i;
+
+ /* It's an array of offsets, so check each one. */
+ for (i = 0; i < chunk_usage; ++i)
+ {
+ uint16 offset = chunk_data[i];
+
+ if (offset >= start_offset && offset < stop_offset)
+ {
+ BlockNumber blkno = chunkno * BLOCKS_PER_CHUNK + offset;
+
+ blocks[nresults++] = blkno;
+
+ /* Early exit if we run out of output space. */
+ if (nresults == nblocks)
+ return nresults;
+ }
+ }
+ }
+ }
+
+ return nresults;
+}
+
+/*
+ * Serialize a block reference table to a file.
+ */
+void
+WriteBlockRefTable(BlockRefTable *brtab,
+ io_callback_fn write_callback,
+ void *write_callback_arg)
+{
+ BlockRefTableSerializedEntry *sdata = NULL;
+ BlockRefTableBuffer buffer;
+ uint32 magic = BLOCKREFTABLE_MAGIC;
+
+ /* Prepare buffer. */
+ memset(&buffer, 0, sizeof(BlockRefTableBuffer));
+ buffer.io_callback = write_callback;
+ buffer.io_callback_arg = write_callback_arg;
+ INIT_CRC32C(buffer.crc);
+
+ /* Write magic number. */
+ BlockRefTableWrite(&buffer, &magic, sizeof(uint32));
+
+ /* Write the entries, assuming there are some. */
+ if (brtab->hash->members > 0)
+ {
+ unsigned i = 0;
+ blockreftable_iterator it;
+ BlockRefTableEntry *brtentry;
+
+ /* Extract entries into serializable format and sort them. */
+ sdata =
+ palloc(brtab->hash->members * sizeof(BlockRefTableSerializedEntry));
+ blockreftable_start_iterate(brtab->hash, &it);
+ while ((brtentry = blockreftable_iterate(brtab->hash, &it)) != NULL)
+ {
+ BlockRefTableSerializedEntry *sentry = &sdata[i++];
+
+ sentry->rlocator = brtentry->key.rlocator;
+ sentry->forknum = brtentry->key.forknum;
+ sentry->limit_block = brtentry->limit_block;
+ sentry->nchunks = brtentry->nchunks;
+
+ /* trim trailing zero entries */
+ while (sentry->nchunks > 0 &&
+ brtentry->chunk_usage[sentry->nchunks - 1] == 0)
+ sentry->nchunks--;
+ }
+ Assert(i == brtab->hash->members);
+ qsort(sdata, i, sizeof(BlockRefTableSerializedEntry),
+ BlockRefTableComparator);
+
+ /* Loop over entries in sorted order and serialize each one. */
+ for (i = 0; i < brtab->hash->members; ++i)
+ {
+ BlockRefTableSerializedEntry *sentry = &sdata[i];
+ BlockRefTableEntry *brtentry;
+ BlockRefTableKey key;
+ unsigned j;
+
+ /* Write the serialized entry itself. */
+ BlockRefTableWrite(&buffer, sentry,
+ sizeof(BlockRefTableSerializedEntry));
+
+ /* Look up the original entry so we can access the chunks. */
+ memcpy(&key.rlocator, &sentry->rlocator, sizeof(RelFileLocator));
+ key.forknum = sentry->forknum;
+ brtentry = blockreftable_lookup(brtab->hash, key);
+ Assert(brtentry != NULL);
+
+ /* Write the untruncated portion of the chunk length array. */
+ if (sentry->nchunks != 0)
+ BlockRefTableWrite(&buffer, brtentry->chunk_usage,
+ sentry->nchunks * sizeof(uint16));
+
+ /* Write the contents of each chunk. */
+ for (j = 0; j < brtentry->nchunks; ++j)
+ {
+ if (brtentry->chunk_usage[j] == 0)
+ continue;
+ BlockRefTableWrite(&buffer, brtentry->chunk_data[j],
+ brtentry->chunk_usage[j] * sizeof(uint16));
+ }
+ }
+ }
+
+ /* Write out appropriate terminator and CRC and flush buffer. */
+ BlockRefTableFileTerminate(&buffer);
+}
+
+/*
+ * Prepare to incrementally read a block reference table file.
+ *
+ * 'read_callback' is a function that can be called to read data from the
+ * underlying file (or other data source) into our internal buffer.
+ *
+ * 'read_callback_arg' is an opaque argument to be passed to read_callback.
+ *
+ * 'error_filename' is the filename that should be included in error messages
+ * if the file is found to be malformed. The value is not copied, so the
+ * caller should ensure that it remains valid until done with this
+ * BlockRefTableReader.
+ *
+ * 'error_callback' is a function to be called if the file is found to be
+ * malformed. This is not used for I/O errors, which must be handled internally
+ * by read_callback.
+ *
+ * 'error_callback_arg' is an opaque arguent to be passed to error_callback.
+ */
+BlockRefTableReader *
+CreateBlockRefTableReader(io_callback_fn read_callback,
+ void *read_callback_arg,
+ char *error_filename,
+ report_error_fn error_callback,
+ void *error_callback_arg)
+{
+ BlockRefTableReader *reader;
+ uint32 magic;
+
+ /* Initialize data structure. */
+ reader = palloc0(sizeof(BlockRefTableReader));
+ reader->buffer.io_callback = read_callback;
+ reader->buffer.io_callback_arg = read_callback_arg;
+ reader->error_filename = error_filename;
+ reader->error_callback = error_callback;
+ reader->error_callback_arg = error_callback_arg;
+ INIT_CRC32C(reader->buffer.crc);
+
+ /* Verify magic number. */
+ BlockRefTableRead(reader, &magic, sizeof(uint32));
+ if (magic != BLOCKREFTABLE_MAGIC)
+ error_callback(error_callback_arg,
+ "file \"%s\" has wrong magic number: expected %u, found %u",
+ error_filename,
+ BLOCKREFTABLE_MAGIC, magic);
+
+ return reader;
+}
+
+/*
+ * Read next relation fork covered by this block reference table file.
+ *
+ * After calling this function, you must call BlockRefTableReaderGetBlocks
+ * until it returns 0 before calling it again.
+ */
+bool
+BlockRefTableReaderNextRelation(BlockRefTableReader *reader,
+ RelFileLocator *rlocator,
+ ForkNumber *forknum,
+ BlockNumber *limit_block)
+{
+ BlockRefTableSerializedEntry sentry;
+ BlockRefTableSerializedEntry zentry = {0};
+
+ /*
+ * Sanity check: caller must read all blocks from all chunks before moving
+ * on to the next relation.
+ */
+ Assert(reader->total_chunks == reader->consumed_chunks);
+
+ /* Read serialized entry. */
+ BlockRefTableRead(reader, &sentry,
+ sizeof(BlockRefTableSerializedEntry));
+
+ /*
+ * If we just read the sentinel entry indicating that we've reached the
+ * end, read and check the CRC.
+ */
+ if (memcmp(&sentry, &zentry, sizeof(BlockRefTableSerializedEntry)) == 0)
+ {
+ pg_crc32c expected_crc;
+ pg_crc32c actual_crc;
+
+ /*
+ * We want to know the CRC of the file excluding the 4-byte CRC
+ * itself, so copy the current value of the CRC accumulator before
+ * reading those bytes, and use the copy to finalize the calculation.
+ */
+ expected_crc = reader->buffer.crc;
+ FIN_CRC32C(expected_crc);
+
+ /* Now we can read the actual value. */
+ BlockRefTableRead(reader, &actual_crc, sizeof(pg_crc32c));
+
+ /* Throw an error if there is a mismatch. */
+ if (!EQ_CRC32C(expected_crc, actual_crc))
+ reader->error_callback(reader->error_callback_arg,
+ "file \"%s\" has wrong checksum: expected %08X, found %08X",
+ reader->error_filename, expected_crc, actual_crc);
+
+ return false;
+ }
+
+ /* Read chunk size array. */
+ if (reader->chunk_size != NULL)
+ pfree(reader->chunk_size);
+ reader->chunk_size = palloc(sentry.nchunks * sizeof(uint16));
+ BlockRefTableRead(reader, reader->chunk_size,
+ sentry.nchunks * sizeof(uint16));
+
+ /* Set up for chunk scan. */
+ reader->total_chunks = sentry.nchunks;
+ reader->consumed_chunks = 0;
+
+ /* Return data to caller. */
+ memcpy(rlocator, &sentry.rlocator, sizeof(RelFileLocator));
+ *forknum = sentry.forknum;
+ *limit_block = sentry.limit_block;
+ return true;
+}
+
+/*
+ * Get modified blocks associated with the relation fork returned by
+ * the most recent call to BlockRefTableReaderNextRelation.
+ *
+ * On return, block numbers will be written into the 'blocks' array, whose
+ * length should be passed via 'nblocks'. The return value is the number of
+ * entries actually written into the 'blocks' array, which may be less than
+ * 'nblocks' if we run out of modified blocks in the relation fork before
+ * we run out of room in the array.
+ */
+unsigned
+BlockRefTableReaderGetBlocks(BlockRefTableReader *reader,
+ BlockNumber *blocks,
+ int nblocks)
+{
+ unsigned blocks_found = 0;
+
+ /* Must provide space for at least one block number to be returned. */
+ Assert(nblocks > 0);
+
+ /* Loop collecting blocks to return to caller. */
+ for (;;)
+ {
+ uint16 next_chunk_size;
+
+ /*
+ * If we've read at least one chunk, maybe it contains some block
+ * numbers that could satisfy caller's request.
+ */
+ if (reader->consumed_chunks > 0)
+ {
+ uint32 chunkno = reader->consumed_chunks - 1;
+ uint16 chunk_size = reader->chunk_size[chunkno];
+
+ if (chunk_size == MAX_ENTRIES_PER_CHUNK)
+ {
+ /* Bitmap format, so search for bits that are set. */
+ while (reader->chunk_position < BLOCKS_PER_CHUNK &&
+ blocks_found < nblocks)
+ {
+ uint16 chunkoffset = reader->chunk_position;
+ uint16 w;
+
+ w = reader->chunk_data[chunkoffset / BLOCKS_PER_ENTRY];
+ if ((w & (1u << (chunkoffset % BLOCKS_PER_ENTRY))) != 0)
+ blocks[blocks_found++] =
+ chunkno * BLOCKS_PER_CHUNK + chunkoffset;
+ ++reader->chunk_position;
+ }
+ }
+ else
+ {
+ /* Not in bitmap format, so each entry is a 2-byte offset. */
+ while (reader->chunk_position < chunk_size &&
+ blocks_found < nblocks)
+ {
+ blocks[blocks_found++] = chunkno * BLOCKS_PER_CHUNK
+ + reader->chunk_data[reader->chunk_position];
+ ++reader->chunk_position;
+ }
+ }
+ }
+
+ /* We found enough blocks, so we're done. */
+ if (blocks_found >= nblocks)
+ break;
+
+ /*
+ * We didn't find enough blocks, so we must need the next chunk. If
+ * there are none left, though, then we're done anyway.
+ */
+ if (reader->consumed_chunks == reader->total_chunks)
+ break;
+
+ /*
+ * Read data for next chunk and reset scan position to beginning of
+ * chunk. Note that the next chunk might be empty, in which case we
+ * consume the chunk without actually consuming any bytes from the
+ * underlying file.
+ */
+ next_chunk_size = reader->chunk_size[reader->consumed_chunks];
+ if (next_chunk_size > 0)
+ BlockRefTableRead(reader, reader->chunk_data,
+ next_chunk_size * sizeof(uint16));
+ ++reader->consumed_chunks;
+ reader->chunk_position = 0;
+ }
+
+ return blocks_found;
+}
+
+/*
+ * Release memory used while reading a block reference table from a file.
+ */
+void
+DestroyBlockRefTableReader(BlockRefTableReader *reader)
+{
+ if (reader->chunk_size != NULL)
+ {
+ pfree(reader->chunk_size);
+ reader->chunk_size = NULL;
+ }
+ pfree(reader);
+}
+
+/*
+ * Prepare to write a block reference table file incrementally.
+ *
+ * Caller must be able to supply BlockRefTableEntry objects sorted in the
+ * appropriate order.
+ */
+BlockRefTableWriter *
+CreateBlockRefTableWriter(io_callback_fn write_callback,
+ void *write_callback_arg)
+{
+ BlockRefTableWriter *writer;
+ uint32 magic = BLOCKREFTABLE_MAGIC;
+
+ /* Prepare buffer and CRC check and save callbacks. */
+ writer = palloc0(sizeof(BlockRefTableWriter));
+ writer->buffer.io_callback = write_callback;
+ writer->buffer.io_callback_arg = write_callback_arg;
+ INIT_CRC32C(writer->buffer.crc);
+
+ /* Write magic number. */
+ BlockRefTableWrite(&writer->buffer, &magic, sizeof(uint32));
+
+ return writer;
+}
+
+/*
+ * Append one entry to a block reference table file.
+ *
+ * Note that entries must be written in the proper order, that is, sorted by
+ * tablespace, then database, then relfilenumber, then fork number. Caller
+ * is responsible for supplying data in the correct order. If that seems hard,
+ * use an in-memory BlockRefTable instead.
+ */
+void
+BlockRefTableWriteEntry(BlockRefTableWriter *writer, BlockRefTableEntry *entry)
+{
+ BlockRefTableSerializedEntry sentry;
+ unsigned j;
+
+ /* Convert to serialized entry format. */
+ sentry.rlocator = entry->key.rlocator;
+ sentry.forknum = entry->key.forknum;
+ sentry.limit_block = entry->limit_block;
+ sentry.nchunks = entry->nchunks;
+
+ /* Trim trailing zero entries. */
+ while (sentry.nchunks > 0 && entry->chunk_usage[sentry.nchunks - 1] == 0)
+ sentry.nchunks--;
+
+ /* Write the serialized entry itself. */
+ BlockRefTableWrite(&writer->buffer, &sentry,
+ sizeof(BlockRefTableSerializedEntry));
+
+ /* Write the untruncated portion of the chunk length array. */
+ if (sentry.nchunks != 0)
+ BlockRefTableWrite(&writer->buffer, entry->chunk_usage,
+ sentry.nchunks * sizeof(uint16));
+
+ /* Write the contents of each chunk. */
+ for (j = 0; j < entry->nchunks; ++j)
+ {
+ if (entry->chunk_usage[j] == 0)
+ continue;
+ BlockRefTableWrite(&writer->buffer, entry->chunk_data[j],
+ entry->chunk_usage[j] * sizeof(uint16));
+ }
+}
+
+/*
+ * Finalize an incremental write of a block reference table file.
+ */
+void
+DestroyBlockRefTableWriter(BlockRefTableWriter *writer)
+{
+ BlockRefTableFileTerminate(&writer->buffer);
+ pfree(writer);
+}
+
+/*
+ * Allocate a standalone BlockRefTableEntry.
+ *
+ * When we're manipulating a full in-memory BlockRefTable, the entries are
+ * part of the hash table and are allocated by simplehash. This routine is
+ * used by callers that want to write out a BlockRefTable to a file without
+ * needing to store the whole thing in memory at once.
+ *
+ * Entries allocated by this function can be manipulated using the functions
+ * BlockRefTableEntrySetLimitBlock and BlockRefTableEntryMarkBlockModified
+ * and then written using BlockRefTableWriteEntry and freed using
+ * BlockRefTableFreeEntry.
+ */
+BlockRefTableEntry *
+CreateBlockRefTableEntry(RelFileLocator rlocator, ForkNumber forknum)
+{
+ BlockRefTableEntry *entry = palloc0(sizeof(BlockRefTableEntry));
+
+ memcpy(&entry->key.rlocator, &rlocator, sizeof(RelFileLocator));
+ entry->key.forknum = forknum;
+ entry->limit_block = InvalidBlockNumber;
+
+ return entry;
+}
+
+/*
+ * Update a BlockRefTableEntry with a new value for the "limit block" and
+ * forget any equal-or-higher-numbered modified blocks.
+ *
+ * The "limit block" is the shortest known length of the relation within the
+ * range of WAL records covered by this block reference table.
+ */
+void
+BlockRefTableEntrySetLimitBlock(BlockRefTableEntry *entry,
+ BlockNumber limit_block)
+{
+ unsigned chunkno;
+ unsigned limit_chunkno;
+ unsigned limit_chunkoffset;
+ BlockRefTableChunk limit_chunk;
+
+ /* If we already have an equal or lower limit block, do nothing. */
+ if (limit_block >= entry->limit_block)
+ return;
+
+ /* Record the new limit block value. */
+ entry->limit_block = limit_block;
+
+ /*
+ * Figure out which chunk would store the state of the new limit block,
+ * and which offset within that chunk.
+ */
+ limit_chunkno = limit_block / BLOCKS_PER_CHUNK;
+ limit_chunkoffset = limit_block % BLOCKS_PER_CHUNK;
+
+ /*
+ * If the number of chunks is not large enough for any blocks with equal
+ * or higher block numbers to exist, then there is nothing further to do.
+ */
+ if (limit_chunkno >= entry->nchunks)
+ return;
+
+ /* Discard entire contents of any higher-numbered chunks. */
+ for (chunkno = limit_chunkno + 1; chunkno < entry->nchunks; ++chunkno)
+ entry->chunk_usage[chunkno] = 0;
+
+ /*
+ * Next, we need to discard any offsets within the chunk that would
+ * contain the limit_block. We must handle this differenly depending on
+ * whether the chunk that would contain limit_block is a bitmap or an
+ * array of offsets.
+ */
+ limit_chunk = entry->chunk_data[limit_chunkno];
+ if (entry->chunk_usage[limit_chunkno] == MAX_ENTRIES_PER_CHUNK)
+ {
+ unsigned chunkoffset;
+
+ /* It's a bitmap. Unset bits. */
+ for (chunkoffset = limit_chunkoffset; chunkoffset < BLOCKS_PER_CHUNK;
+ ++chunkoffset)
+ limit_chunk[chunkoffset / BLOCKS_PER_ENTRY] &=
+ ~(1 << (chunkoffset % BLOCKS_PER_ENTRY));
+ }
+ else
+ {
+ unsigned i,
+ j = 0;
+
+ /* It's an offset array. Filter out large offsets. */
+ for (i = 0; i < entry->chunk_usage[limit_chunkno]; ++i)
+ {
+ Assert(j <= i);
+ if (limit_chunk[i] < limit_chunkoffset)
+ limit_chunk[j++] = limit_chunk[i];
+ }
+ Assert(j <= entry->chunk_usage[limit_chunkno]);
+ entry->chunk_usage[limit_chunkno] = j;
+ }
+}
+
+/*
+ * Mark a block in a given BlkRefTableEntry as known to have been modified.
+ */
+void
+BlockRefTableEntryMarkBlockModified(BlockRefTableEntry *entry,
+ ForkNumber forknum,
+ BlockNumber blknum)
+{
+ unsigned chunkno;
+ unsigned chunkoffset;
+ unsigned i;
+
+ /*
+ * Which chunk should store the state of this block? And what is the
+ * offset of this block relative to the start of that chunk?
+ */
+ chunkno = blknum / BLOCKS_PER_CHUNK;
+ chunkoffset = blknum % BLOCKS_PER_CHUNK;
+
+ /*
+ * If 'nchunks' isn't big enough for us to be able to represent the state
+ * of this block, we need to enlarge our arrays.
+ */
+ if (chunkno >= entry->nchunks)
+ {
+ unsigned max_chunks;
+ unsigned extra_chunks;
+
+ /*
+ * New array size is a power of 2, at least 16, big enough so that
+ * chunkno will be a valid array index.
+ */
+ max_chunks = Max(16, entry->nchunks);
+ while (max_chunks < chunkno + 1)
+ chunkno *= 2;
+ Assert(max_chunks > chunkno);
+ extra_chunks = max_chunks - entry->nchunks;
+
+ if (entry->nchunks == 0)
+ {
+ entry->chunk_size = palloc0(sizeof(uint16) * max_chunks);
+ entry->chunk_usage = palloc0(sizeof(uint16) * max_chunks);
+ entry->chunk_data =
+ palloc0(sizeof(BlockRefTableChunk) * max_chunks);
+ }
+ else
+ {
+ entry->chunk_size = repalloc(entry->chunk_size,
+ sizeof(uint16) * max_chunks);
+ memset(&entry->chunk_size[entry->nchunks], 0,
+ extra_chunks * sizeof(uint16));
+ entry->chunk_usage = repalloc(entry->chunk_usage,
+ sizeof(uint16) * max_chunks);
+ memset(&entry->chunk_usage[entry->nchunks], 0,
+ extra_chunks * sizeof(uint16));
+ entry->chunk_data = repalloc(entry->chunk_data,
+ sizeof(BlockRefTableChunk) * max_chunks);
+ memset(&entry->chunk_data[entry->nchunks], 0,
+ extra_chunks * sizeof(BlockRefTableChunk));
+ }
+ entry->nchunks = max_chunks;
+ }
+
+ /*
+ * If the chunk that covers this block number doesn't exist yet, create it
+ * as an array and add the appropriate offset to it. We make it pretty
+ * small initially, because there might only be 1 or a few block
+ * references in this chunk and we don't want to use up too much memory.
+ */
+ if (entry->chunk_size[chunkno] == 0)
+ {
+ entry->chunk_data[chunkno] =
+ palloc(sizeof(uint16) * INITIAL_ENTRIES_PER_CHUNK);
+ entry->chunk_size[chunkno] = INITIAL_ENTRIES_PER_CHUNK;
+ entry->chunk_data[chunkno][0] = chunkoffset;
+ entry->chunk_usage[chunkno] = 1;
+ return;
+ }
+
+ /*
+ * If the number of entries in this chunk is already maximum, it must be a
+ * bitmap. Just set the appropriate bit.
+ */
+ if (entry->chunk_usage[chunkno] == MAX_ENTRIES_PER_CHUNK)
+ {
+ BlockRefTableChunk chunk = entry->chunk_data[chunkno];
+
+ chunk[chunkoffset / BLOCKS_PER_ENTRY] |=
+ 1 << (chunkoffset % BLOCKS_PER_ENTRY);
+ return;
+ }
+
+ /*
+ * There is an existing chunk and it's in array format. Let's find out
+ * whether it already has an entry for this block. If so, we do not need
+ * to do anything.
+ */
+ for (i = 0; i < entry->chunk_usage[chunkno]; ++i)
+ {
+ if (entry->chunk_data[chunkno][i] == chunkoffset)
+ return;
+ }
+
+ /*
+ * If the number of entries currently used is one less than the maximum,
+ * it's time to convert to bitmap format.
+ */
+ if (entry->chunk_usage[chunkno] == MAX_ENTRIES_PER_CHUNK - 1)
+ {
+ BlockRefTableChunk newchunk;
+ unsigned j;
+
+ /* Allocate a new chunk. */
+ newchunk = palloc0(MAX_ENTRIES_PER_CHUNK * sizeof(uint16));
+
+ /* Set the bit for each existing entry. */
+ for (j = 0; j < entry->chunk_usage[chunkno]; ++j)
+ {
+ unsigned coff = entry->chunk_data[chunkno][j];
+
+ newchunk[coff / BLOCKS_PER_ENTRY] |=
+ 1 << (coff % BLOCKS_PER_ENTRY);
+ }
+
+ /* Set the bit for the new entry. */
+ newchunk[chunkoffset / BLOCKS_PER_ENTRY] |=
+ 1 << (chunkoffset % BLOCKS_PER_ENTRY);
+
+ /* Swap the new chunk into place and update metadata. */
+ pfree(entry->chunk_data[chunkno]);
+ entry->chunk_data[chunkno] = newchunk;
+ entry->chunk_size[chunkno] = MAX_ENTRIES_PER_CHUNK;
+ entry->chunk_usage[chunkno] = MAX_ENTRIES_PER_CHUNK;
+ return;
+ }
+
+ /*
+ * OK, we currently have an array, and we don't need to convert to a
+ * bitmap, but we do need to add a new element. If there's not enough
+ * room, we'll have to expand the array.
+ */
+ if (entry->chunk_usage[chunkno] == entry->chunk_size[chunkno])
+ {
+ unsigned newsize = entry->chunk_size[chunkno] * 2;
+
+ Assert(newsize <= MAX_ENTRIES_PER_CHUNK);
+ entry->chunk_data[chunkno] = repalloc(entry->chunk_data[chunkno],
+ newsize * sizeof(uint16));
+ entry->chunk_size[chunkno] = newsize;
+ }
+
+ /* Now we can add the new entry. */
+ entry->chunk_data[chunkno][entry->chunk_usage[chunkno]] =
+ chunkoffset;
+ entry->chunk_usage[chunkno]++;
+}
+
+/*
+ * Release memory for a BlockRefTablEntry that was created by
+ * CreateBlockRefTableEntry.
+ */
+void
+BlockRefTableFreeEntry(BlockRefTableEntry *entry)
+{
+ if (entry->chunk_size != NULL)
+ {
+ pfree(entry->chunk_size);
+ entry->chunk_size = NULL;
+ }
+
+ if (entry->chunk_usage != NULL)
+ {
+ pfree(entry->chunk_usage);
+ entry->chunk_usage = NULL;
+ }
+
+ if (entry->chunk_data != NULL)
+ {
+ pfree(entry->chunk_data);
+ entry->chunk_data = NULL;
+ }
+
+ pfree(entry);
+}
+
+/*
+ * Comparator for BlockRefTableSerializedEntry objects.
+ *
+ * We make the tablespace OID the first column of the sort key to match
+ * the on-disk tree structure.
+ */
+static int
+BlockRefTableComparator(const void *a, const void *b)
+{
+ const BlockRefTableSerializedEntry *sa = a;
+ const BlockRefTableSerializedEntry *sb = b;
+
+ if (sa->rlocator.spcOid > sb->rlocator.spcOid)
+ return 1;
+ if (sa->rlocator.spcOid < sb->rlocator.spcOid)
+ return -1;
+
+ if (sa->rlocator.dbOid > sb->rlocator.dbOid)
+ return 1;
+ if (sa->rlocator.dbOid < sb->rlocator.dbOid)
+ return -1;
+
+ if (sa->rlocator.relNumber > sb->rlocator.relNumber)
+ return 1;
+ if (sa->rlocator.relNumber < sb->rlocator.relNumber)
+ return -1;
+
+ if (sa->forknum > sb->forknum)
+ return 1;
+ if (sa->forknum < sb->forknum)
+ return -1;
+
+ return 0;
+}
+
+/*
+ * Flush any buffered data out of a BlockRefTableBuffer.
+ */
+static void
+BlockRefTableFlush(BlockRefTableBuffer *buffer)
+{
+ buffer->io_callback(buffer->io_callback_arg, buffer->data, buffer->used);
+ buffer->used = 0;
+}
+
+/*
+ * Read data from a BlockRefTableBuffer, and update the running CRC
+ * calculation for the returned data (but not any data that we may have
+ * buffered but not yet actually returned).
+ */
+static void
+BlockRefTableRead(BlockRefTableReader *reader, void *data, int length)
+{
+ BlockRefTableBuffer *buffer = &reader->buffer;
+
+ /* Loop until read is fully satisfied. */
+ while (length > 0)
+ {
+ if (buffer->cursor < buffer->used)
+ {
+ /*
+ * If any buffered data is available, use that to satisfy as much
+ * of the request as possible.
+ */
+ int bytes_to_copy = Min(length, buffer->used - buffer->cursor);
+
+ memcpy(data, &buffer->data[buffer->cursor], bytes_to_copy);
+ COMP_CRC32C(buffer->crc, &buffer->data[buffer->cursor],
+ bytes_to_copy);
+ buffer->cursor += bytes_to_copy;
+ data = ((char *) data) + bytes_to_copy;
+ length -= bytes_to_copy;
+ }
+ else if (length >= BUFSIZE)
+ {
+ /*
+ * If the request length is long, read directly into caller's
+ * buffer.
+ */
+ int bytes_read;
+
+ bytes_read = buffer->io_callback(buffer->io_callback_arg,
+ data, length);
+ COMP_CRC32C(buffer->crc, data, bytes_read);
+ data = ((char *) data) + bytes_read;
+ length -= bytes_read;
+
+ /* If we didn't get anything, that's bad. */
+ if (bytes_read == 0)
+ reader->error_callback(reader->error_callback_arg,
+ "file \"%s\" ends unexpectedly",
+ reader->error_filename);
+ }
+ else
+ {
+ /*
+ * Refill our buffer.
+ */
+ buffer->used = buffer->io_callback(buffer->io_callback_arg,
+ buffer->data, BUFSIZE);
+ buffer->cursor = 0;
+
+ /* If we didn't get anything, that's bad. */
+ if (buffer->used == 0)
+ reader->error_callback(reader->error_callback_arg,
+ "file \"%s\" ends unexpectedly",
+ reader->error_filename);
+ }
+ }
+}
+
+/*
+ * Supply data to a BlockRefTableBuffer for write to the underlying File,
+ * and update the running CRC calculation for that data.
+ */
+static void
+BlockRefTableWrite(BlockRefTableBuffer *buffer, void *data, int length)
+{
+ /* Update running CRC calculation. */
+ COMP_CRC32C(buffer->crc, data, length);
+
+ /* If the new data can't fit into the buffer, flush the buffer. */
+ if (buffer->used + length > BUFSIZE)
+ {
+ buffer->io_callback(buffer->io_callback_arg, buffer->data,
+ buffer->used);
+ buffer->used = 0;
+ }
+
+ /* If the new data would fill the buffer, or more, write it directly. */
+ if (length >= BUFSIZE)
+ {
+ buffer->io_callback(buffer->io_callback_arg, data, length);
+ return;
+ }
+
+ /* Otherwise, copy the new data into the buffer. */
+ memcpy(&buffer->data[buffer->used], data, length);
+ buffer->used += length;
+ Assert(buffer->used <= BUFSIZE);
+}
+
+/*
+ * Generate the sentinel and CRC required at the end of a block reference
+ * table file and flush them out of our internal buffer.
+ */
+static void
+BlockRefTableFileTerminate(BlockRefTableBuffer *buffer)
+{
+ BlockRefTableSerializedEntry zentry = {0};
+ pg_crc32c crc;
+
+ /* Write a sentinel indicating that there are no more entries. */
+ BlockRefTableWrite(buffer, &zentry,
+ sizeof(BlockRefTableSerializedEntry));
+
+ /*
+ * Writing the checksum will perturb the ongoing checksum calculation, so
+ * copy the state first and finalize the computation using the copy.
+ */
+ crc = buffer->crc;
+ FIN_CRC32C(crc);
+ BlockRefTableWrite(buffer, &crc, sizeof(pg_crc32c));
+
+ /* Flush any leftover data out of our buffer. */
+ BlockRefTableFlush(buffer);
+}
diff --git a/src/common/meson.build b/src/common/meson.build
index a9ff7f9db8..dda018f6d1 100644
--- a/src/common/meson.build
+++ b/src/common/meson.build
@@ -3,6 +3,7 @@
common_sources = files(
'archive.c',
'base64.c',
+ 'blkreftable.c',
'checksum_helper.c',
'compression.c',
'controldata_utils.c',
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 48ca852381..fed5d790cc 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -206,6 +206,7 @@ extern int XLogFileOpen(XLogSegNo segno, TimeLineID tli);
extern void CheckXLogRemoved(XLogSegNo segno, TimeLineID tli);
extern XLogSegNo XLogGetLastRemovedSegno(void);
+extern XLogSegNo XLogGetOldestSegno(TimeLineID tli);
extern void XLogSetAsyncXactLSN(XLogRecPtr asyncXactLSN);
extern void XLogSetReplicationSlotMinimumLSN(XLogRecPtr lsn);
diff --git a/src/include/access/xlogbackup.h b/src/include/access/xlogbackup.h
index 1611358137..90e04cad56 100644
--- a/src/include/access/xlogbackup.h
+++ b/src/include/access/xlogbackup.h
@@ -28,6 +28,8 @@ typedef struct BackupState
XLogRecPtr checkpointloc; /* last checkpoint location */
pg_time_t starttime; /* backup start time */
bool started_in_recovery; /* backup started in recovery? */
+ XLogRecPtr istartpoint; /* incremental based on backup at this LSN */
+ TimeLineID istarttli; /* incremental based on backup on this TLI */
/* Fields saved at the end of backup */
XLogRecPtr stoppoint; /* backup stop WAL location */
diff --git a/src/include/backup/basebackup.h b/src/include/backup/basebackup.h
index 1432d9c206..345bd22534 100644
--- a/src/include/backup/basebackup.h
+++ b/src/include/backup/basebackup.h
@@ -34,6 +34,9 @@ typedef struct
int64 size; /* total size as sent; -1 if not known */
} tablespaceinfo;
-extern void SendBaseBackup(BaseBackupCmd *cmd);
+struct IncrementalBackupInfo;
+
+extern void SendBaseBackup(BaseBackupCmd *cmd,
+ struct IncrementalBackupInfo *ib);
#endif /* _BASEBACKUP_H */
diff --git a/src/include/backup/basebackup_incremental.h b/src/include/backup/basebackup_incremental.h
new file mode 100644
index 0000000000..c300235a2f
--- /dev/null
+++ b/src/include/backup/basebackup_incremental.h
@@ -0,0 +1,56 @@
+/*-------------------------------------------------------------------------
+ *
+ * basebackup_incremental.h
+ * API for incremental backup support
+ *
+ * Portions Copyright (c) 2010-2022, PostgreSQL Global Development Group
+ *
+ * src/include/backup/basebackup_incremental.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef BASEBACKUP_INCREMENTAL_H
+#define BASEBACKUP_INCREMENTAL_H
+
+#include "access/xlogbackup.h"
+#include "common/relpath.h"
+#include "storage/block.h"
+#include "utils/palloc.h"
+
+#define INCREMENTAL_MAGIC 0xd3ae1f0d
+
+typedef enum
+{
+ BACK_UP_FILE_FULLY,
+ BACK_UP_FILE_INCREMENTALLY,
+ DO_NOT_BACK_UP_FILE
+} FileBackupMethod;
+
+struct IncrementalBackupInfo;
+typedef struct IncrementalBackupInfo IncrementalBackupInfo;
+
+extern IncrementalBackupInfo *CreateIncrementalBackupInfo(MemoryContext);
+
+extern void AppendIncrementalManifestData(IncrementalBackupInfo *ib,
+ const char *data,
+ int len);
+extern void FinalizeIncrementalManifest(IncrementalBackupInfo *ib);
+
+extern void PrepareForIncrementalBackup(IncrementalBackupInfo *ib,
+ BackupState *backup_state);
+
+extern char *GetIncrementalFilePath(Oid dboid, Oid spcoid,
+ RelFileNumber relfilenumber,
+ ForkNumber forknum, unsigned segno);
+extern FileBackupMethod GetFileBackupMethod(IncrementalBackupInfo *ib,
+ char *path,
+ Oid dboid, Oid spcoid,
+ RelFileNumber relfilenumber,
+ ForkNumber forknum,
+ unsigned segno, size_t size,
+ unsigned *num_blocks_required,
+ BlockNumber *relative_block_numbers,
+ unsigned *truncation_block_length);
+extern size_t GetIncrementalFileSize(unsigned num_blocks_required);
+
+#endif
diff --git a/src/include/backup/walsummary.h b/src/include/backup/walsummary.h
new file mode 100644
index 0000000000..d086e64019
--- /dev/null
+++ b/src/include/backup/walsummary.h
@@ -0,0 +1,49 @@
+/*-------------------------------------------------------------------------
+ *
+ * walsummary.h
+ * WAL summary management
+ *
+ * Portions Copyright (c) 2010-2022, PostgreSQL Global Development Group
+ *
+ * src/include/backup/walsummary.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef WALSUMMARY_H
+#define WALSUMMARY_H
+
+#include <time.h>
+
+#include "access/xlogdefs.h"
+#include "nodes/pg_list.h"
+#include "storage/fd.h"
+
+typedef struct WalSummaryIO
+{
+ File file;
+ off_t filepos;
+} WalSummaryIO;
+
+typedef struct WalSummaryFile
+{
+ XLogRecPtr start_lsn;
+ XLogRecPtr end_lsn;
+ TimeLineID tli;
+} WalSummaryFile;
+
+extern List *GetWalSummaries(TimeLineID tli, XLogRecPtr start_lsn,
+ XLogRecPtr end_lsn);
+extern List *FilterWalSummaries(List *wslist, TimeLineID tli,
+ XLogRecPtr start_lsn, XLogRecPtr end_lsn);
+extern bool WalSummariesAreComplete(List *wslist,
+ XLogRecPtr start_lsn, XLogRecPtr end_lsn,
+ XLogRecPtr *missing_lsn);
+extern File OpenWalSummaryFile(WalSummaryFile *ws, bool missing_ok);
+extern void RemoveWalSummaryIfOlderThan(WalSummaryFile *ws,
+ time_t cutoff_time);
+
+extern int ReadWalSummary(void *wal_summary_io, void *data, int length);
+extern int WriteWalSummary(void *wal_summary_io, void *data, int length);
+extern void ReportWalSummaryError(void *callback_arg, char *fmt,...);
+
+#endif /* WALSUMMARY_H */
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 9805bc6118..148f35d9b1 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -12062,4 +12062,23 @@
proname => 'any_value_transfn', prorettype => 'anyelement',
proargtypes => 'anyelement anyelement', prosrc => 'any_value_transfn' },
+{ oid => '8436',
+ descr => 'list of available WAL summary files',
+ proname => 'pg_available_wal_summaries', prorows => '100',
+ proretset => 't', provolatile => 'v', proparallel => 's',
+ prorettype => 'record', proargtypes => '',
+ proallargtypes => '{int8,pg_lsn,pg_lsn}',
+ proargmodes => '{o,o,o}',
+ proargnames => '{tli,start_lsn,end_lsn}',
+ prosrc => 'pg_available_wal_summaries' },
+{ oid => '8437',
+ descr => 'contents of a WAL sumamry file',
+ proname => 'pg_wal_summary_contents', prorows => '100',
+ proretset => 't', provolatile => 'v', proparallel => 's',
+ prorettype => 'record', proargtypes => 'int8 pg_lsn pg_lsn',
+ proallargtypes => '{int8,pg_lsn,pg_lsn,oid,oid,oid,int2,int8,bool}',
+ proargmodes => '{i,i,i,o,o,o,o,o,o}',
+ proargnames => '{tli,start_lsn,end_lsn,relfilenode,reltablespace,reldatabase,relforknumber,relblocknumber,is_limit_block}',
+ prosrc => 'pg_wal_summary_contents' },
+
]
diff --git a/src/include/common/blkreftable.h b/src/include/common/blkreftable.h
new file mode 100644
index 0000000000..22d9883dc5
--- /dev/null
+++ b/src/include/common/blkreftable.h
@@ -0,0 +1,120 @@
+/*-------------------------------------------------------------------------
+ *
+ * blkreftable.h
+ * Block reference tables.
+ *
+ * A block reference table is used to keep track of which blocks have
+ * been modified by WAL records within a certain LSN range.
+ *
+ * For each relation fork, there is a "limit block number". All existing
+ * blocks greater than or equal to the limit block number must be
+ * considered modified; for those less than the limit block number,
+ * we maintain a bitmap. When a relation fork is created or dropped,
+ * the limit block number should be set to 0. When it's truncated,
+ * the limit block number should be set to the length in blocks to
+ * which it was truncated.
+ *
+ * Portions Copyright (c) 2010-2022, PostgreSQL Global Development Group
+ *
+ * src/include/common/blkreftable.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef BLKREFTABLE_H
+#define BLKREFTABLE_H
+
+#include "storage/block.h"
+#include "storage/relfilelocator.h"
+
+/* Magic number for serialization file format. */
+#define BLOCKREFTABLE_MAGIC 0x652b137b
+
+struct BlockRefTable;
+struct BlockRefTableEntry;
+struct BlockRefTableReader;
+struct BlockRefTableWriter;
+typedef struct BlockRefTable BlockRefTable;
+typedef struct BlockRefTableEntry BlockRefTableEntry;
+typedef struct BlockRefTableReader BlockRefTableReader;
+typedef struct BlockRefTableWriter BlockRefTableWriter;
+
+/*
+ * The return value of io_callback_fn should be the number of bytes read
+ * or written. If an error occurs, the functions should report it and
+ * not return. When used as a write callback, short writes should be retried
+ * or treated as errors, so that if the callback returns, the return value
+ * is always the request length.
+ *
+ * report_error_fn should not return.
+ */
+typedef int (*io_callback_fn) (void *callback_arg, void *data, int length);
+typedef void (*report_error_fn) (void *calblack_arg, char *msg,...);
+
+
+/*
+ * Functions for manipulating an entire in-memory block reference table.
+ */
+extern BlockRefTable *CreateEmptyBlockRefTable(void);
+extern void BlockRefTableSetLimitBlock(BlockRefTable *brtab,
+ const RelFileLocator *rlocator,
+ ForkNumber forknum,
+ BlockNumber limit_block);
+extern void BlockRefTableMarkBlockModified(BlockRefTable *brtab,
+ const RelFileLocator *rlocator,
+ ForkNumber forknum,
+ BlockNumber blknum);
+extern void WriteBlockRefTable(BlockRefTable *brtab,
+ io_callback_fn write_callback,
+ void *write_callback_arg);
+
+extern BlockRefTableEntry *BlockRefTableGetEntry(BlockRefTable *brtab,
+ const RelFileLocator *rlocator,
+ ForkNumber forknum,
+ BlockNumber *limit_block);
+extern int BlockRefTableEntryGetBlocks(BlockRefTableEntry *entry,
+ BlockNumber start_blkno,
+ BlockNumber stop_blkno,
+ BlockNumber *blocks,
+ int nblocks);
+
+/*
+ * Functions for reading a block reference table incrementally from disk.
+ */
+extern BlockRefTableReader *CreateBlockRefTableReader(io_callback_fn read_callback,
+ void *read_callback_arg,
+ char *error_filename,
+ report_error_fn error_callback,
+ void *error_callback_arg);
+extern bool BlockRefTableReaderNextRelation(BlockRefTableReader *reader,
+ RelFileLocator *rlocator,
+ ForkNumber *forknum,
+ BlockNumber *limit_block);
+extern unsigned BlockRefTableReaderGetBlocks(BlockRefTableReader *reader,
+ BlockNumber *blocks,
+ int nblocks);
+extern void DestroyBlockRefTableReader(BlockRefTableReader *reader);
+
+/*
+ * Functions for writing a block reference table incrementally to disk.
+ *
+ * Note that entries must be written in the proper order, that is, sorted by
+ * database, then tablespace, then relfilenumber, then fork number. Caller
+ * is responsible for supplying data in the correct order. If that seems hard,
+ * use an in-memory BlockRefTable instead.
+ */
+extern BlockRefTableWriter *CreateBlockRefTableWriter(io_callback_fn write_callback,
+ void *write_callback_arg);
+extern void BlockRefTableWriteEntry(BlockRefTableWriter *writer,
+ BlockRefTableEntry *entry);
+extern void DestroyBlockRefTableWriter(BlockRefTableWriter *writer);
+
+extern BlockRefTableEntry *CreateBlockRefTableEntry(RelFileLocator rlocator,
+ ForkNumber forknum);
+extern void BlockRefTableEntrySetLimitBlock(BlockRefTableEntry *entry,
+ BlockNumber limit_block);
+extern void BlockRefTableEntryMarkBlockModified(BlockRefTableEntry *entry,
+ ForkNumber forknum,
+ BlockNumber blknum);
+extern void BlockRefTableFreeEntry(BlockRefTableEntry *entry);
+
+#endif /* BLKREFTABLE_H */
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 14bd574fc2..898adccb25 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -338,6 +338,7 @@ typedef enum BackendType
B_STARTUP,
B_WAL_RECEIVER,
B_WAL_SENDER,
+ B_WAL_SUMMARIZER,
B_WAL_WRITER,
} BackendType;
@@ -443,6 +444,7 @@ typedef enum
CheckpointerProcess,
WalWriterProcess,
WalReceiverProcess,
+ WalSummarizerProcess,
NUM_AUXPROCTYPES /* Must be last! */
} AuxProcType;
@@ -455,6 +457,7 @@ extern PGDLLIMPORT AuxProcType MyAuxProcType;
#define AmCheckpointerProcess() (MyAuxProcType == CheckpointerProcess)
#define AmWalWriterProcess() (MyAuxProcType == WalWriterProcess)
#define AmWalReceiverProcess() (MyAuxProcType == WalReceiverProcess)
+#define AmWalSummarizerProcess() (MyAuxProcType == WalSummarizerProcess)
/*****************************************************************************
diff --git a/src/include/nodes/replnodes.h b/src/include/nodes/replnodes.h
index 4321ba8f86..856491eecd 100644
--- a/src/include/nodes/replnodes.h
+++ b/src/include/nodes/replnodes.h
@@ -108,4 +108,13 @@ typedef struct TimeLineHistoryCmd
TimeLineID timeline;
} TimeLineHistoryCmd;
+/* ----------------------
+ * UPLOAD_MANIFEST command
+ * ----------------------
+ */
+typedef struct UploadManifestCmd
+{
+ NodeTag type;
+} UploadManifestCmd;
+
#endif /* REPLNODES_H */
diff --git a/src/include/postmaster/walsummarizer.h b/src/include/postmaster/walsummarizer.h
new file mode 100644
index 0000000000..7584cb69a7
--- /dev/null
+++ b/src/include/postmaster/walsummarizer.h
@@ -0,0 +1,31 @@
+/*-------------------------------------------------------------------------
+ *
+ * walsummarizer.h
+ *
+ * Header file for background WAL summarization process.
+ *
+ * Portions Copyright (c) 1996-2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/include/postmaster/walsummarizer.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef WALSUMMARIZER_H
+#define WALSUMMARIZER_H
+
+#include "access/xlogdefs.h"
+
+extern int wal_summarize_mb;
+extern int wal_summarize_keep_time;
+
+extern Size WalSummarizerShmemSize(void);
+extern void WalSummarizerShmemInit(void);
+extern void WalSummarizerMain(void) pg_attribute_noreturn();
+
+extern XLogRecPtr GetOldestUnsummarizedLSN(TimeLineID *tli,
+ bool *lsn_is_exact);
+extern void SetWalSummarizerLatch(void);
+extern XLogRecPtr WaitForWalSummarization(XLogRecPtr lsn, long timeout);
+
+#endif
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index ef74f32693..ee55008082 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -417,11 +417,12 @@ extern PGDLLIMPORT PGPROC *PreparedXactProcs;
* We set aside some extra PGPROC structures for auxiliary processes,
* ie things that aren't full-fledged backends but need shmem access.
*
- * Background writer, checkpointer, WAL writer and archiver run during normal
- * operation. Startup process and WAL receiver also consume 2 slots, but WAL
- * writer is launched only after startup has exited, so we only need 5 slots.
+ * Background writer, checkpointer, WAL writer, WAL summarizer, and archiver
+ * run during normal operation. Startup process and WAL receiver also consume
+ * 2 slots, but WAL writer is launched only after startup has exited, so we
+ * only need 6 slots.
*/
-#define NUM_AUXILIARY_PROCS 5
+#define NUM_AUXILIARY_PROCS 6
/* configurable options */
extern PGDLLIMPORT int DeadlockTimeout;
diff --git a/src/include/utils/guc_tables.h b/src/include/utils/guc_tables.h
index d5a0880678..7d3bc0f671 100644
--- a/src/include/utils/guc_tables.h
+++ b/src/include/utils/guc_tables.h
@@ -72,6 +72,7 @@ enum config_group
WAL_RECOVERY,
WAL_ARCHIVE_RECOVERY,
WAL_RECOVERY_TARGET,
+ WAL_SUMMARIZATION,
REPLICATION_SENDING,
REPLICATION_PRIMARY,
REPLICATION_STANDBY,
diff --git a/src/test/recovery/t/001_stream_rep.pl b/src/test/recovery/t/001_stream_rep.pl
index 0c72ba0944..353db33a9f 100644
--- a/src/test/recovery/t/001_stream_rep.pl
+++ b/src/test/recovery/t/001_stream_rep.pl
@@ -15,6 +15,8 @@ my $node_primary = PostgreSQL::Test::Cluster->new('primary');
$node_primary->init(
allows_streaming => 1,
auth_extra => [ '--create-role', 'repl_role' ]);
+# WAL summarization can postpone WAL recycling, leading to test failures
+$node_primary->append_conf('postgresql.conf', "wal_summarize_mb = 0");
$node_primary->start;
my $backup_name = 'my_backup';
diff --git a/src/test/recovery/t/019_replslot_limit.pl b/src/test/recovery/t/019_replslot_limit.pl
index 33e50ad933..6ba5eca700 100644
--- a/src/test/recovery/t/019_replslot_limit.pl
+++ b/src/test/recovery/t/019_replslot_limit.pl
@@ -22,6 +22,7 @@ $node_primary->append_conf(
min_wal_size = 2MB
max_wal_size = 4MB
log_checkpoints = yes
+wal_summarize_mb = 0
));
$node_primary->start;
$node_primary->safe_psql('postgres',
@@ -256,6 +257,7 @@ $node_primary2->append_conf(
min_wal_size = 32MB
max_wal_size = 32MB
log_checkpoints = yes
+wal_summarize_mb = 0
));
$node_primary2->start;
$node_primary2->safe_psql('postgres',
@@ -310,6 +312,7 @@ $node_primary3->append_conf(
max_wal_size = 2MB
log_checkpoints = yes
max_slot_wal_keep_size = 1MB
+ wal_summarize_mb = 0
));
$node_primary3->start;
$node_primary3->safe_psql('postgres',
diff --git a/src/test/recovery/t/035_standby_logical_decoding.pl b/src/test/recovery/t/035_standby_logical_decoding.pl
index 480e6d6caa..a91437dfa7 100644
--- a/src/test/recovery/t/035_standby_logical_decoding.pl
+++ b/src/test/recovery/t/035_standby_logical_decoding.pl
@@ -250,6 +250,7 @@ $node_primary->append_conf(
wal_level = 'logical'
max_replication_slots = 4
max_wal_senders = 4
+wal_summarize_mb = 0
});
$node_primary->dump_info;
$node_primary->start;
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 49a33c0387..56b8270dda 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3989,3 +3989,27 @@ yyscan_t
z_stream
z_streamp
zic_t
+BlockRefTable
+BlockRefTableBuffer
+BlockRefTableEntry
+BlockRefTableKey
+BlockRefTableReader
+BlockRefTableSerializedEntry
+BlockRefTableWriter
+FileBackupMethod
+FileChunkContext
+IncrementalBackupInfo
+SummarizerReadLocalXLogPrivate
+UploadManifestCmd
+WalSummarizerData
+WalSummaryFile
+WalSummaryIO
+backup_file_entry
+backup_wal_range
+cb_cleanup_dir
+cb_options
+cb_tablespace
+cb_tablespace_mapping
+manifest_data
+manifest_writer
+rfile
--
2.37.1 (Apple Git-137.1)
Hi Robert,
On 8/30/23 10:49, Robert Haas wrote:
In the limited time that I've had to work on this project lately, I've
been trying to come up with a test case for this feature -- and since
I've gotten completely stuck, I thought it might be time to post and
see if anyone else has a better idea. I thought a reasonable test case
would be: Do a full backup. Change some stuff. Do an incremental
backup. Restore both backups and perform replay to the same LSN. Then
compare the files on disk. But I cannot make this work. The first
problem I ran into was that replay of the full backup does a
restartpoint, while the replay of the incremental backup does not.
That results in, for example, pg_subtrans having different contents.
pg_subtrans, at least, can be ignored since it is excluded from the
backup and not required for recovery.
I'm not sure whether it can also result in data files having different
contents: are changes that we replayed following the last restartpoint
guaranteed to end up on disk when the server is shut down? It wasn't
clear to me that this is the case. I thought maybe I could get both
servers to perform a restartpoint at the same location by shutting
down the primary and then replaying through the shutdown checkpoint,
but that doesn't work because the primary doesn't finish archiving
before shutting down. After some more fiddling I settled (at least for
research purposes) on having the restored backups PITR and promote,
instead of PITR and pause, so that we're guaranteed a checkpoint. But
that just caused me to run into a far worse problem: replay on the
standby doesn't actually create a state that is byte-for-byte
identical to the one that exists on the primary. I quickly discovered
that in my test case, I was ending up with different contents in the
"hole" of a block wherein a tuple got updated. Replay doesn't think
it's important to make the hole end up with the same contents on all
machines that replay the WAL, so I end up with one server that has
more junk in there than the other one and the tests fail.
This is pretty much what I discovered when investigating backup from
standby back in 2016. My (ultimately unsuccessful) efforts to find a
clean delta resulted in [1]http://git.postgresql.org/pg/commitdiff/6ad8ac6026287e3ccbc4d606b6ab6116ccc0eec8 as I systematically excluded directories
that are not required for recovery and will not be synced between a
primary and standby.
FWIW Heikki also made similar attempts at this before me (back then I
found the thread but I doubt I could find it again) and arrived at
similar results. We discussed this in person and figured out that we had
come to more or less the same conclusion. Welcome to the club!
Unless someone has a brilliant idea that I lack, this suggests to me
that this whole line of testing is a dead end. I can, of course, write
tests that compare clusters *logically* -- do the correct relations
exist, are they accessible, do they have the right contents? But I
feel like it would be easy to have bugs that escape detection in such
a test but would be detected by a physical comparison of the clusters.
Agreed, though a matching logical result is still very compelling.
However, such a comparison can only be conducted if either (a) there's
some way to set up the test so that byte-for-byte identical clusters
can be expected or (b) there's some way to perform the comparison that
can distinguish between expected, harmless differences and unexpected,
problematic differences. And at the moment my conclusion is that
neither (a) nor (b) exists. Does anyone think otherwise?
I do not. My conclusion back then was that validating a physical
comparison would be nearly impossible without changes to Postgres to
make the primary and standby match via replication. Which, by the way, I
still think would be a great idea. In principle, at least. Replay is
already a major bottleneck and anything that makes it slower will likely
not be very popular.
This would also be great for WAL, since last time I tested the same WAL
segment can be different between the primary and standby because the
unused (and recycled) portion at the end is not zeroed as it is on the
primary (but logically they do match). I would be very happy if somebody
told me that my info is out of date here and this has been fixed. But
when I looked at the code it was incredibly tricky to do this because of
how WAL is replicated.
Meanwhile, here's a rebased set of patches. The somewhat-primitive
attempts at writing tests are in 0009, but they don't work, for the
reasons explained above. I think I'd probably like to go ahead and
commit 0001 and 0002 soon if there are no objections, since I think
those are good refactorings independently of the rest of this.
No objections to 0001/0002.
Regards,
-David
[1]: http://git.postgresql.org/pg/commitdiff/6ad8ac6026287e3ccbc4d606b6ab6116ccc0eec8
http://git.postgresql.org/pg/commitdiff/6ad8ac6026287e3ccbc4d606b6ab6116ccc0eec8
Hey, thanks for the reply.
On Thu, Aug 31, 2023 at 6:50 PM David Steele <david@pgmasters.net> wrote:
pg_subtrans, at least, can be ignored since it is excluded from the
backup and not required for recovery.
I agree...
Welcome to the club!
Thanks for the welcome, but being a member feels *terrible*. :-)
I do not. My conclusion back then was that validating a physical
comparison would be nearly impossible without changes to Postgres to
make the primary and standby match via replication. Which, by the way, I
still think would be a great idea. In principle, at least. Replay is
already a major bottleneck and anything that makes it slower will likely
not be very popular.
Fair point. But maybe the bigger issue is the work involved. I don't
think zeroing the hole in all cases would likely be that expensive,
but finding everything that can cause the standby to diverge from the
primary and fixing all of it sounds like an unpleasant amount of
effort. Still, it's good to know that I'm not missing something
obvious.
No objections to 0001/0002.
Cool.
--
Robert Haas
EDB: http://www.enterprisedb.com
On Wed, Aug 30, 2023 at 9:20 PM Robert Haas <robertmhaas@gmail.com> wrote:
Unless someone has a brilliant idea that I lack, this suggests to me
that this whole line of testing is a dead end. I can, of course, write
tests that compare clusters *logically* -- do the correct relations
exist, are they accessible, do they have the right contents?
Can't we think of comparing at the block level, like we can compare
each block but ignore the content of the hole?
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
On Mon, Sep 4, 2023 at 8:42 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
Can't we think of comparing at the block level, like we can compare
each block but ignore the content of the hole?
We could do that, but I don't think that's a full solution. I think
I'd end up having to reimplement the equivalent of heap_mask,
btree_mask, et. al. in Perl, which doesn't seem very reasonable. It's
fairly complicated logic even written in C, and doing the right thing
in Perl would be more complex, I think, because it wouldn't have
access to all the same #defines which depend on things like word size
and Endianness and stuff. If we want to allow this sort of comparison,
I feel we should think of changing the C code in some way to make it
work reliably rather than try to paper over the problems in Perl.
--
Robert Haas
EDB: http://www.enterprisedb.com
On Wed, Aug 30, 2023 at 9:20 PM Robert Haas <robertmhaas@gmail.com> wrote:
Meanwhile, here's a rebased set of patches. The somewhat-primitive
attempts at writing tests are in 0009, but they don't work, for the
reasons explained above. I think I'd probably like to go ahead and
commit 0001 and 0002 soon if there are no objections, since I think
those are good refactorings independently of the rest of this.
I have started reading the patch today, I haven't yet completed one
pass but here are my comments in 0007
1.
+ BlockNumber relative_block_numbers[RELSEG_SIZE];
This is close to 400kB of memory, so I think it is better we palloc it
instead of keeping it in the stack.
2.
/*
* Try to parse the directory name as an unsigned integer.
*
- * Tablespace directories should be positive integers that can
- * be represented in 32 bits, with no leading zeroes or trailing
+ * Tablespace directories should be positive integers that can be
+ * represented in 32 bits, with no leading zeroes or trailing
* garbage. If we come across a name that doesn't meet those
* criteria, skip it.
Unrelated code refactoring hunk
3.
+typedef struct
+{
+ const char *filename;
+ pg_checksum_context *checksum_ctx;
+ bbsink *sink;
+ size_t bytes_sent;
+} FileChunkContext;
This structure is not used anywhere.
4.
+ * If the file is to be set incrementally, then num_incremental_blocks
+ * should be the number of blocks to be sent, and incremental_blocks
/If the file is to be set incrementally/If the file is to be sent incrementally
5.
- while (bytes_done < statbuf->st_size)
+ while (1)
{
- size_t remaining = statbuf->st_size - bytes_done;
+ /*
I do not really like this change, because after removing this you have
put 2 independent checks for sending the full file[1]+ /* + * If we've read the required number of bytes, then it's time to + * stop. + */ + if (bytes_done >= statbuf->st_size) + break; and sending it
incrementally[1]+ /* + * If we've read the required number of bytes, then it's time to + * stop. + */ + if (bytes_done >= statbuf->st_size) + break;. Actually for sending incrementally
'statbuf->st_size' is computed from the 'num_incremental_blocks'
itself so why don't we keep this breaking condition in the while loop
itself? So that we can avoid these two separate conditions.
[1]
+ /*
+ * If we've read the required number of bytes, then it's time to
+ * stop.
+ */
+ if (bytes_done >= statbuf->st_size)
+ break;
[2]
+ /*
+ * If we've read all the blocks, then it's time to stop.
+ */
+ if (ibindex >= num_incremental_blocks)
+ break;
6.
+typedef struct
+{
+ TimeLineID tli;
+ XLogRecPtr start_lsn;
+ XLogRecPtr end_lsn;
+} backup_wal_range;
+
+typedef struct
+{
+ uint32 status;
+ const char *path;
+ size_t size;
+} backup_file_entry;
Better we add some comments for these structures.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
On Wed, Aug 30, 2023 at 4:50 PM Robert Haas <robertmhaas@gmail.com> wrote:
[..]
I've played a little bit more this second batch of patches on
e8d74ad625f7344f6b715254d3869663c1569a51 @ 31Aug (days before wait
events refactor):
test_across_wallevelminimal.sh
test_many_incrementals_dbcreate.sh
test_many_incrementals.sh
test_multixact.sh
test_pending_2pc.sh
test_reindex_and_vacuum_full.sh
test_truncaterollback.sh
test_unlogged_table.sh
all those basic tests had GOOD results. Please find attached. I'll try
to schedule some more realistic (in terms of workload and sizes) test
in a couple of days + maybe have some fun with cross-backup-and
restores across standbys. As per earlier doubt: raw wal_level =
minimal situation, shouldn't be a concern, sadly because it requires
max_wal_senders==0, while pg_basebackup requires it above 0 (due to
"FATAL: number of requested standby connections exceeds
max_wal_senders (currently 0)").
I wanted to also introduce corruption onto pg_walsummaries files, but
later saw in code that is already covered with CRC32, cool.
In v07:
+#define MINIMUM_VERSION_FOR_WAL_SUMMARIES 160000
170000 ?
A related design question is whether we should really be sending the
whole backup manifest to the server at all. If it turns out that we
don't really need anything except for the LSN of the previous backup,
we could send that one piece of information instead of everything. On
the other hand, if we need the list of files from the previous backup,
then sending the whole manifest makes sense.
If that is still an area open for discussion: wouldn't it be better to
just specify LSN as it would allow resyncing standby across major lag
where the WAL to replay would be enormous? Given that we had
primary->standby where standby would be stuck on some LSN, right now
it would be:
1) calculate backup manifest of desynced 10TB standby (how? using
which tool?) - even if possible, that means reading 10TB of data
instead of just putting a number, isn't it?
2) backup primary with such incremental backup >= LSN
3) copy the incremental backup to standby
4) apply it to the impaired standby
5) restart the WAL replay
- We only know how to operate on directories, not tar files. I thought
about that when working on pg_verifybackup as well, but I didn't do
anything about it. It would be nice to go back and make that tool work
on tar-format backups, and this one, too. I don't think there would be
a whole lot of point trying to operate on compressed tar files because
you need random access and that seems hard on a compressed file, but
on uncompressed files it seems at least theoretically doable. I'm not
sure whether anyone would care that much about this, though, even
though it does sound pretty cool.
Also maybe it's too early to ask, but wouldn't it be nice if we could
have an future option in pg_combinebackup to avoid double writes when
used from restore hosts (right now we need to first to reconstruct the
original datadir from full and incremental backups on host hosting
backups and then TRANSFER it again and on target host?). So something
like that could work well from restorehost: pg_combinebackup
/tmp/backup1 /tmp/incbackup2 /tmp/incbackup3 -O tar -o - | ssh
dbserver 'tar xvf -C /path/to/restored/cluster - ' . The bad thing is
that such a pipe prevents parallelism from day 1 and I'm afraid I do
not have a better easy idea on how to have both at the same time in
the long term.
-J.
Attachments:
On Fri, Sep 1, 2023 at 10:30 AM Robert Haas <robertmhaas@gmail.com> wrote:
No objections to 0001/0002.
Cool.
Nobody else objected either, so I went ahead and committed those. I'll
rebase the rest of the patches on top of the latest master and repost,
hopefully after addressing some of the other review comments from
Dilip and Jakub.
--
Robert Haas
EDB: http://www.enterprisedb.com
On Tue, Sep 12, 2023 at 5:56 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
+ BlockNumber relative_block_numbers[RELSEG_SIZE];
This is close to 400kB of memory, so I think it is better we palloc it
instead of keeping it in the stack.
Fixed.
Unrelated code refactoring hunk
Fixed.
This structure is not used anywhere.
Removed.
/If the file is to be set incrementally/If the file is to be sent incrementally
Fixed.
I do not really like this change, because after removing this you have
put 2 independent checks for sending the full file[1] and sending it
incrementally[1]. Actually for sending incrementally
'statbuf->st_size' is computed from the 'num_incremental_blocks'
itself so why don't we keep this breaking condition in the while loop
itself? So that we can avoid these two separate conditions.
I don't think that would be correct. The number of bytes that need to
be read from the original file is not equal to the number of bytes
that will be written to the incremental file. Admittedly, they're
currently different by less than a block, but that could change if we
change the format of the incremental file (e.g. suppose we compressed
the blocks in the incremental file with gzip, or smushed out the holes
in the pages). I wrote the loop as I did precisely so that the two
cases could have different loop exit conditions.
Better we add some comments for these structures.
Done.
Here's a new patch set, also addressing Jakub's observation that
MINIMUM_VERSION_FOR_WAL_SUMMARIES needed updating.
--
Robert Haas
EDB: http://www.enterprisedb.com
Attachments:
v3-0002-Refactor-parse_filename_for_nontemp_relation-to-p.patchapplication/octet-stream; name=v3-0002-Refactor-parse_filename_for_nontemp_relation-to-p.patchDownload
From b7f1acaead7fdf87a01ea88a7a381580c1705b38 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Tue, 3 Oct 2023 13:30:44 -0400
Subject: [PATCH v3 2/7] Refactor parse_filename_for_nontemp_relation to parse
more.
Instead of returning the number of characters in the RelFileNumber,
return the RelFileNumber itself. Continue to return the fork number,
as before, and additionally return the segment number.
parse_filename_for_nontemp_relation now rejects a RelFileNumber or
segment number that begins with a leading zero. Before, we accepted
such cases as relation filenames, but if we continued to do so after
this change, the function might return the same values for two
different files (e.g. 1234.5 and 001234.5 or 1234.005) which could be
annoying for callers. Since we don't actually ever generate filenames
with leading zeroes in the names, any such files that we find must
have been created by something other than PostgreSQL, and it is
therefore reasonable to treat them as non-relation files.
Along the way, change unlogged_relation_entry to store a RelFileNumber
rather than an OID. This update should have been made in
851f4cc75cdd8c831f1baa9a7abf8c8248b65890, but it was overlooked.
It could be done separately from the rest of this commit, but that
would be more involved, whereas this way it's a 1-line change.
---
src/backend/backup/basebackup.c | 15 ++--
src/backend/storage/file/reinit.c | 137 ++++++++++++++++++------------
src/include/storage/reinit.h | 5 +-
3 files changed, 93 insertions(+), 64 deletions(-)
diff --git a/src/backend/backup/basebackup.c b/src/backend/backup/basebackup.c
index dc6892011d..b537f46219 100644
--- a/src/backend/backup/basebackup.c
+++ b/src/backend/backup/basebackup.c
@@ -1198,9 +1198,9 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
{
int excludeIdx;
bool excludeFound;
- ForkNumber relForkNum; /* Type of fork if file is a relation */
- int relnumchars; /* Chars in filename that are the
- * relnumber */
+ RelFileNumber relNumber;
+ ForkNumber relForkNum;
+ unsigned segno;
/* Skip special stuff */
if (strcmp(de->d_name, ".") == 0 || strcmp(de->d_name, "..") == 0)
@@ -1250,23 +1250,20 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
/* Exclude all forks for unlogged tables except the init fork */
if (isDbDir &&
- parse_filename_for_nontemp_relation(de->d_name, &relnumchars,
- &relForkNum))
+ parse_filename_for_nontemp_relation(de->d_name, &relNumber,
+ &relForkNum, &segno))
{
/* Never exclude init forks */
if (relForkNum != INIT_FORKNUM)
{
char initForkFile[MAXPGPATH];
- char relNumber[OIDCHARS + 1];
/*
* If any other type of fork, check if there is an init fork
* with the same RelFileNumber. If so, the file can be
* excluded.
*/
- memcpy(relNumber, de->d_name, relnumchars);
- relNumber[relnumchars] = '\0';
- snprintf(initForkFile, sizeof(initForkFile), "%s/%s_init",
+ snprintf(initForkFile, sizeof(initForkFile), "%s/%u_init",
path, relNumber);
if (lstat(initForkFile, &statbuf) == 0)
diff --git a/src/backend/storage/file/reinit.c b/src/backend/storage/file/reinit.c
index fb55371b1b..5df2517b46 100644
--- a/src/backend/storage/file/reinit.c
+++ b/src/backend/storage/file/reinit.c
@@ -31,7 +31,7 @@ static void ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname,
typedef struct
{
- Oid reloid; /* hash key */
+ RelFileNumber relnumber; /* hash key */
} unlogged_relation_entry;
/*
@@ -195,12 +195,13 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
{
ForkNumber forkNum;
- int relnumchars;
+ unsigned segno;
unlogged_relation_entry ent;
/* Skip anything that doesn't look like a relation data file. */
- if (!parse_filename_for_nontemp_relation(de->d_name, &relnumchars,
- &forkNum))
+ if (!parse_filename_for_nontemp_relation(de->d_name,
+ &ent.relnumber,
+ &forkNum, &segno))
continue;
/* Also skip it unless this is the init fork. */
@@ -208,10 +209,8 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
continue;
/*
- * Put the OID portion of the name into the hash table, if it
- * isn't already.
+ * Put the RelFileNumber into the hash table, if it isn't already.
*/
- ent.reloid = atooid(de->d_name);
(void) hash_search(hash, &ent, HASH_ENTER, NULL);
}
@@ -235,12 +234,13 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
{
ForkNumber forkNum;
- int relnumchars;
+ unsigned segno;
unlogged_relation_entry ent;
/* Skip anything that doesn't look like a relation data file. */
- if (!parse_filename_for_nontemp_relation(de->d_name, &relnumchars,
- &forkNum))
+ if (!parse_filename_for_nontemp_relation(de->d_name,
+ &ent.relnumber,
+ &forkNum, &segno))
continue;
/* We never remove the init fork. */
@@ -251,7 +251,6 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
* See whether the OID portion of the name shows up in the hash
* table. If so, nuke it!
*/
- ent.reloid = atooid(de->d_name);
if (hash_search(hash, &ent, HASH_FIND, NULL))
{
snprintf(rm_path, sizeof(rm_path), "%s/%s",
@@ -285,14 +284,14 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
{
ForkNumber forkNum;
- int relnumchars;
- char relnumbuf[OIDCHARS + 1];
+ RelFileNumber relNumber;
+ unsigned segno;
char srcpath[MAXPGPATH * 2];
char dstpath[MAXPGPATH];
/* Skip anything that doesn't look like a relation data file. */
- if (!parse_filename_for_nontemp_relation(de->d_name, &relnumchars,
- &forkNum))
+ if (!parse_filename_for_nontemp_relation(de->d_name, &relNumber,
+ &forkNum, &segno))
continue;
/* Also skip it unless this is the init fork. */
@@ -304,11 +303,12 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
dbspacedirname, de->d_name);
/* Construct destination pathname. */
- memcpy(relnumbuf, de->d_name, relnumchars);
- relnumbuf[relnumchars] = '\0';
- snprintf(dstpath, sizeof(dstpath), "%s/%s%s",
- dbspacedirname, relnumbuf, de->d_name + relnumchars + 1 +
- strlen(forkNames[INIT_FORKNUM]));
+ if (segno == 0)
+ snprintf(dstpath, sizeof(dstpath), "%s/%u",
+ dbspacedirname, relNumber);
+ else
+ snprintf(dstpath, sizeof(dstpath), "%s/%u.%u",
+ dbspacedirname, relNumber, segno);
/* OK, we're ready to perform the actual copy. */
elog(DEBUG2, "copying %s to %s", srcpath, dstpath);
@@ -327,14 +327,14 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
dbspace_dir = AllocateDir(dbspacedirname);
while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
{
+ RelFileNumber relNumber;
ForkNumber forkNum;
- int relnumchars;
- char relnumbuf[OIDCHARS + 1];
+ unsigned segno;
char mainpath[MAXPGPATH];
/* Skip anything that doesn't look like a relation data file. */
- if (!parse_filename_for_nontemp_relation(de->d_name, &relnumchars,
- &forkNum))
+ if (!parse_filename_for_nontemp_relation(de->d_name, &relNumber,
+ &forkNum, &segno))
continue;
/* Also skip it unless this is the init fork. */
@@ -342,11 +342,12 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
continue;
/* Construct main fork pathname. */
- memcpy(relnumbuf, de->d_name, relnumchars);
- relnumbuf[relnumchars] = '\0';
- snprintf(mainpath, sizeof(mainpath), "%s/%s%s",
- dbspacedirname, relnumbuf, de->d_name + relnumchars + 1 +
- strlen(forkNames[INIT_FORKNUM]));
+ if (segno == 0)
+ snprintf(mainpath, sizeof(mainpath), "%s/%u",
+ dbspacedirname, relNumber);
+ else
+ snprintf(mainpath, sizeof(mainpath), "%s/%u.%u",
+ dbspacedirname, relNumber, segno);
fsync_fname(mainpath, false);
}
@@ -371,52 +372,82 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
* This function returns true if the file appears to be in the correct format
* for a non-temporary relation and false otherwise.
*
- * NB: If this function returns true, the caller is entitled to assume that
- * *relnumchars has been set to a value no more than OIDCHARS, and thus
- * that a buffer of OIDCHARS+1 characters is sufficient to hold the
- * RelFileNumber portion of the filename. This is critical to protect against
- * a possible buffer overrun.
+ * If it returns true, it sets *relnumber, *fork, and *segno to the values
+ * extracted from the filename. If it returns false, these values are set to
+ * InvalidRelFileNumber, InvalidForkNumber, and 0, respectively.
*/
bool
-parse_filename_for_nontemp_relation(const char *name, int *relnumchars,
- ForkNumber *fork)
+parse_filename_for_nontemp_relation(const char *name, RelFileNumber *relnumber,
+ ForkNumber *fork, unsigned *segno)
{
- int pos;
+ unsigned long n,
+ s;
+ ForkNumber f;
+ char *endp;
- /* Look for a non-empty string of digits (that isn't too long). */
- for (pos = 0; isdigit((unsigned char) name[pos]); ++pos)
- ;
- if (pos == 0 || pos > OIDCHARS)
+ *relnumber = InvalidRelFileNumber;
+ *fork = InvalidForkNumber;
+ *segno = 0;
+
+ /*
+ * Relation filenames should begin with a digit that is not a zero. By
+ * rejecting cases involving leading zeroes, the caller can assume that
+ * there's only one possible string of characters that could have produced
+ * any given value for *relnumber.
+ *
+ * (To be clear, we don't expect files with names like 0017.3 to exist at
+ * all -- but if 0017.3 does exist, it's a non-relation file, not part of
+ * the main fork for relfilenode 17.)
+ */
+ if (name[0] < '1' || name[0] > '9')
+ return false;
+
+ /*
+ * Parse the leading digit string. If the value is out of range, we
+ * conclude that this isn't a relation file at all.
+ */
+ errno = 0;
+ n = strtoul(name, &endp, 10);
+ if (errno || name == endp || n <= 0 || n > PG_UINT32_MAX)
return false;
- *relnumchars = pos;
+ name = endp;
/* Check for a fork name. */
- if (name[pos] != '_')
- *fork = MAIN_FORKNUM;
+ if (*name != '_')
+ f = MAIN_FORKNUM;
else
{
int forkchar;
- forkchar = forkname_chars(&name[pos + 1], fork);
+ forkchar = forkname_chars(name + 1, &f);
if (forkchar <= 0)
return false;
- pos += forkchar + 1;
+ name += forkchar + 1;
}
/* Check for a segment number. */
- if (name[pos] == '.')
+ if (*name != '.')
+ s = 0;
+ else
{
- int segchar;
+ /* Reject leading zeroes, just like we do for RelFileNumber. */
+ if (name[0] < '1' || name[0] > '9')
+ return false;
- for (segchar = 1; isdigit((unsigned char) name[pos + segchar]); ++segchar)
- ;
- if (segchar <= 1)
+ errno = 0;
+ s = strtoul(name + 1, &endp, 10);
+ if (errno || name + 1 == endp || s <= 0 || s > PG_UINT32_MAX)
return false;
- pos += segchar;
+ name = endp;
}
/* Now we should be at the end. */
- if (name[pos] != '\0')
+ if (*name != '\0')
return false;
+
+ /* Set out parameters and return. */
+ *relnumber = (RelFileNumber) n;
+ *fork = f;
+ *segno = (unsigned) s;
return true;
}
diff --git a/src/include/storage/reinit.h b/src/include/storage/reinit.h
index e2bbb5abe9..f8eb7ce234 100644
--- a/src/include/storage/reinit.h
+++ b/src/include/storage/reinit.h
@@ -20,8 +20,9 @@
extern void ResetUnloggedRelations(int op);
extern bool parse_filename_for_nontemp_relation(const char *name,
- int *relnumchars,
- ForkNumber *fork);
+ RelFileNumber *relnumber,
+ ForkNumber *fork,
+ unsigned *segno);
#define UNLOGGED_RELATION_CLEANUP 0x0001
#define UNLOGGED_RELATION_INIT 0x0002
--
2.37.1 (Apple Git-137.1)
v3-0004-Move-src-bin-pg_verifybackup-parse_manifest.c-int.patchapplication/octet-stream; name=v3-0004-Move-src-bin-pg_verifybackup-parse_manifest.c-int.patchDownload
From 333933d0d14588a43826e0089d97579c65cef94e Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Tue, 3 Oct 2023 13:32:45 -0400
Subject: [PATCH v3 4/7] Move src/bin/pg_verifybackup/parse_manifest.c into
src/common.
This makes it possible for the code to be easily reused by other
client-side tools, and/or by the server.
---
src/bin/pg_verifybackup/Makefile | 1 -
src/bin/pg_verifybackup/meson.build | 1 -
src/bin/pg_verifybackup/pg_verifybackup.c | 2 +-
src/common/Makefile | 1 +
src/common/meson.build | 1 +
src/{bin/pg_verifybackup => common}/parse_manifest.c | 4 ++--
src/{bin/pg_verifybackup => include/common}/parse_manifest.h | 2 +-
7 files changed, 6 insertions(+), 6 deletions(-)
rename src/{bin/pg_verifybackup => common}/parse_manifest.c (99%)
rename src/{bin/pg_verifybackup => include/common}/parse_manifest.h (97%)
diff --git a/src/bin/pg_verifybackup/Makefile b/src/bin/pg_verifybackup/Makefile
index 596df15118..8f04fa662c 100644
--- a/src/bin/pg_verifybackup/Makefile
+++ b/src/bin/pg_verifybackup/Makefile
@@ -21,7 +21,6 @@ LDFLAGS_INTERNAL += -L$(top_builddir)/src/fe_utils -lpgfeutils $(libpq_pgport)
OBJS = \
$(WIN32RES) \
- parse_manifest.o \
pg_verifybackup.o
all: pg_verifybackup
diff --git a/src/bin/pg_verifybackup/meson.build b/src/bin/pg_verifybackup/meson.build
index 9369da1bc6..58f780d1a6 100644
--- a/src/bin/pg_verifybackup/meson.build
+++ b/src/bin/pg_verifybackup/meson.build
@@ -1,7 +1,6 @@
# Copyright (c) 2022-2023, PostgreSQL Global Development Group
pg_verifybackup_sources = files(
- 'parse_manifest.c',
'pg_verifybackup.c'
)
diff --git a/src/bin/pg_verifybackup/pg_verifybackup.c b/src/bin/pg_verifybackup/pg_verifybackup.c
index 059836f0e6..ce423a03d4 100644
--- a/src/bin/pg_verifybackup/pg_verifybackup.c
+++ b/src/bin/pg_verifybackup/pg_verifybackup.c
@@ -20,9 +20,9 @@
#include "common/hashfn.h"
#include "common/logging.h"
+#include "common/parse_manifest.h"
#include "fe_utils/simple_list.h"
#include "getopt_long.h"
-#include "parse_manifest.h"
#include "pgtime.h"
/*
diff --git a/src/common/Makefile b/src/common/Makefile
index cc5c54dcee..ff60666f5c 100644
--- a/src/common/Makefile
+++ b/src/common/Makefile
@@ -66,6 +66,7 @@ OBJS_COMMON = \
kwlookup.o \
link-canary.o \
md5_common.o \
+ parse_manifest.o \
percentrepl.o \
pg_get_line.o \
pg_lzcompress.o \
diff --git a/src/common/meson.build b/src/common/meson.build
index 3b97497d1a..fcc0c4fe8d 100644
--- a/src/common/meson.build
+++ b/src/common/meson.build
@@ -18,6 +18,7 @@ common_sources = files(
'kwlookup.c',
'link-canary.c',
'md5_common.c',
+ 'parse_manifest.c',
'percentrepl.c',
'pg_get_line.c',
'pg_lzcompress.c',
diff --git a/src/bin/pg_verifybackup/parse_manifest.c b/src/common/parse_manifest.c
similarity index 99%
rename from src/bin/pg_verifybackup/parse_manifest.c
rename to src/common/parse_manifest.c
index 2379f7be7b..672e8bcf25 100644
--- a/src/bin/pg_verifybackup/parse_manifest.c
+++ b/src/common/parse_manifest.c
@@ -6,15 +6,15 @@
* Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
* Portions Copyright (c) 1994, Regents of the University of California
*
- * src/bin/pg_verifybackup/parse_manifest.c
+ * src/common/parse_manifest.c
*
*-------------------------------------------------------------------------
*/
#include "postgres_fe.h"
-#include "parse_manifest.h"
#include "common/jsonapi.h"
+#include "common/parse_manifest.h"
/*
* Semantic states for JSON manifest parsing.
diff --git a/src/bin/pg_verifybackup/parse_manifest.h b/src/include/common/parse_manifest.h
similarity index 97%
rename from src/bin/pg_verifybackup/parse_manifest.h
rename to src/include/common/parse_manifest.h
index 7387a917a2..7b24c5d785 100644
--- a/src/bin/pg_verifybackup/parse_manifest.h
+++ b/src/include/common/parse_manifest.h
@@ -6,7 +6,7 @@
* Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
* Portions Copyright (c) 1994, Regents of the University of California
*
- * src/bin/pg_verifybackup/parse_manifest.h
+ * src/include/common/parse_manifest.h
*
*-------------------------------------------------------------------------
*/
--
2.37.1 (Apple Git-137.1)
v3-0001-Change-struct-tablespaceinfo-s-oid-member-from-ch.patchapplication/octet-stream; name=v3-0001-Change-struct-tablespaceinfo-s-oid-member-from-ch.patchDownload
From fcc50ccfc814763e7b12dfa3b265ee737802b74f Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Tue, 3 Oct 2023 13:30:16 -0400
Subject: [PATCH v3 1/7] Change struct tablespaceinfo's oid member from 'char
*' to 'Oid'
This shouldn't change behavior except in the unusual case where
there are file in the tablespace directory that have entirely
numeric names but are nevertheless not possible names for a
tablespace directory, either because their names has leading zeroes
that shouldn't be there, or the value is actually zero, or because
the value is too large to represent as an OID.
In those cases, the directory would previously have made it into
the list of tablespaceinfo objects and no longer will. Thus, base
backups will now ignore such directories, instead of treating them
as legitimate tablespace directories. Similarly, if entries for
such tablespaces occur in a tablespace_map file, they will now
be rejected as erroneous, instead of being honored.
This is infrastructure for future work that wants to be able to
know the tablespace of each relation that is part of a backup
*as an OID*. By strengthening the up-front validation, we don't
have to worry about weird cases later, and can more easily avoid
repeated string->integer conversions.
---
src/backend/access/transam/xlog.c | 19 ++++++++++--
src/backend/access/transam/xlogrecovery.c | 12 ++++++--
src/backend/backup/backup_manifest.c | 6 ++--
src/backend/backup/basebackup.c | 35 ++++++++++++-----------
src/backend/backup/basebackup_copy.c | 2 +-
src/include/backup/backup_manifest.h | 2 +-
src/include/backup/basebackup.h | 2 +-
7 files changed, 49 insertions(+), 29 deletions(-)
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index fcbde10529..677a5bf51b 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -8483,9 +8483,22 @@ do_pg_backup_start(const char *backupidstr, bool fast, List **tablespaces,
char *relpath = NULL;
char *s;
PGFileType de_type;
+ char *badp;
+ Oid tsoid;
- /* Skip anything that doesn't look like a tablespace */
- if (strspn(de->d_name, "0123456789") != strlen(de->d_name))
+ /*
+ * Try to parse the directory name as an unsigned integer.
+ *
+ * Tablespace directories should be positive integers that can be
+ * represented in 32 bits, with no leading zeroes or trailing
+ * garbage. If we come across a name that doesn't meet those
+ * criteria, skip it.
+ */
+ if (de->d_name[0] < '1' || de->d_name[1] > '9')
+ continue;
+ errno = 0;
+ tsoid = strtoul(de->d_name, &badp, 10);
+ if (*badp != '\0' || errno == EINVAL || errno == ERANGE)
continue;
snprintf(fullpath, sizeof(fullpath), "pg_tblspc/%s", de->d_name);
@@ -8560,7 +8573,7 @@ do_pg_backup_start(const char *backupidstr, bool fast, List **tablespaces,
}
ti = palloc(sizeof(tablespaceinfo));
- ti->oid = pstrdup(de->d_name);
+ ti->oid = tsoid;
ti->path = pstrdup(linkpath);
ti->rpath = relpath;
ti->size = -1;
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index becc2bda62..5549e1afc5 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -678,7 +678,7 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
tablespaceinfo *ti = lfirst(lc);
char *linkloc;
- linkloc = psprintf("pg_tblspc/%s", ti->oid);
+ linkloc = psprintf("pg_tblspc/%u", ti->oid);
/*
* Remove the existing symlink if any and Create the symlink
@@ -692,7 +692,6 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
errmsg("could not create symbolic link \"%s\": %m",
linkloc)));
- pfree(ti->oid);
pfree(ti->path);
pfree(ti);
}
@@ -1341,6 +1340,8 @@ read_tablespace_map(List **tablespaces)
{
if (!was_backslash && (ch == '\n' || ch == '\r'))
{
+ char *endp;
+
if (i == 0)
continue; /* \r immediately followed by \n */
@@ -1360,7 +1361,12 @@ read_tablespace_map(List **tablespaces)
str[n++] = '\0';
ti = palloc0(sizeof(tablespaceinfo));
- ti->oid = pstrdup(str);
+ errno = 0;
+ ti->oid = strtoul(str, &endp, 10);
+ if (*endp != '\0' || errno == EINVAL || errno == ERANGE)
+ ereport(FATAL,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("invalid data in file \"%s\"", TABLESPACE_MAP)));
ti->path = pstrdup(str + n);
*tablespaces = lappend(*tablespaces, ti);
diff --git a/src/backend/backup/backup_manifest.c b/src/backend/backup/backup_manifest.c
index cee6216524..aeed362a9a 100644
--- a/src/backend/backup/backup_manifest.c
+++ b/src/backend/backup/backup_manifest.c
@@ -97,7 +97,7 @@ FreeBackupManifest(backup_manifest_info *manifest)
* Add an entry to the backup manifest for a file.
*/
void
-AddFileToBackupManifest(backup_manifest_info *manifest, const char *spcoid,
+AddFileToBackupManifest(backup_manifest_info *manifest, Oid spcoid,
const char *pathname, size_t size, pg_time_t mtime,
pg_checksum_context *checksum_ctx)
{
@@ -114,9 +114,9 @@ AddFileToBackupManifest(backup_manifest_info *manifest, const char *spcoid,
* pathname relative to the data directory (ignoring the intermediate
* symlink traversal).
*/
- if (spcoid != NULL)
+ if (OidIsValid(spcoid))
{
- snprintf(pathbuf, sizeof(pathbuf), "pg_tblspc/%s/%s", spcoid,
+ snprintf(pathbuf, sizeof(pathbuf), "pg_tblspc/%u/%s", spcoid,
pathname);
pathname = pathbuf;
}
diff --git a/src/backend/backup/basebackup.c b/src/backend/backup/basebackup.c
index 7d025bcf38..dc6892011d 100644
--- a/src/backend/backup/basebackup.c
+++ b/src/backend/backup/basebackup.c
@@ -75,14 +75,15 @@ typedef struct
pg_checksum_type manifest_checksum_type;
} basebackup_options;
-static int64 sendTablespace(bbsink *sink, char *path, char *spcoid, bool sizeonly,
+static int64 sendTablespace(bbsink *sink, char *path, Oid spcoid, bool sizeonly,
struct backup_manifest_info *manifest);
static int64 sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
List *tablespaces, bool sendtblspclinks,
- backup_manifest_info *manifest, const char *spcoid);
+ backup_manifest_info *manifest, Oid spcoid);
static bool sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
- struct stat *statbuf, bool missing_ok, Oid dboid,
- backup_manifest_info *manifest, const char *spcoid);
+ struct stat *statbuf, bool missing_ok,
+ Oid dboid, Oid spcoid,
+ backup_manifest_info *manifest);
static off_t read_file_data_into_buffer(bbsink *sink,
const char *readfilename, int fd,
off_t offset, size_t length,
@@ -305,7 +306,7 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
if (tmp->path == NULL)
tmp->size = sendDir(sink, ".", 1, true, state.tablespaces,
- true, NULL, NULL);
+ true, NULL, InvalidOid);
else
tmp->size = sendTablespace(sink, tmp->path, tmp->oid, true,
NULL);
@@ -346,7 +347,7 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
/* Then the bulk of the files... */
sendDir(sink, ".", 1, false, state.tablespaces,
- sendtblspclinks, &manifest, NULL);
+ sendtblspclinks, &manifest, InvalidOid);
/* ... and pg_control after everything else. */
if (lstat(XLOG_CONTROL_FILE, &statbuf) != 0)
@@ -355,11 +356,11 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
errmsg("could not stat file \"%s\": %m",
XLOG_CONTROL_FILE)));
sendFile(sink, XLOG_CONTROL_FILE, XLOG_CONTROL_FILE, &statbuf,
- false, InvalidOid, &manifest, NULL);
+ false, InvalidOid, InvalidOid, &manifest);
}
else
{
- char *archive_name = psprintf("%s.tar", ti->oid);
+ char *archive_name = psprintf("%u.tar", ti->oid);
bbsink_begin_archive(sink, archive_name);
@@ -623,8 +624,8 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
(errcode_for_file_access(),
errmsg("could not stat file \"%s\": %m", pathbuf)));
- sendFile(sink, pathbuf, pathbuf, &statbuf, false, InvalidOid,
- &manifest, NULL);
+ sendFile(sink, pathbuf, pathbuf, &statbuf, false,
+ InvalidOid, InvalidOid, &manifest);
/* unconditionally mark file as archived */
StatusFilePath(pathbuf, fname, ".done");
@@ -1087,7 +1088,7 @@ sendFileWithContent(bbsink *sink, const char *filename, const char *content,
_tarWritePadding(sink, len);
- AddFileToBackupManifest(manifest, NULL, filename, len,
+ AddFileToBackupManifest(manifest, InvalidOid, filename, len,
(pg_time_t) statbuf.st_mtime, &checksum_ctx);
}
@@ -1099,7 +1100,7 @@ sendFileWithContent(bbsink *sink, const char *filename, const char *content,
* Only used to send auxiliary tablespaces, not PGDATA.
*/
static int64
-sendTablespace(bbsink *sink, char *path, char *spcoid, bool sizeonly,
+sendTablespace(bbsink *sink, char *path, Oid spcoid, bool sizeonly,
backup_manifest_info *manifest)
{
int64 size;
@@ -1154,7 +1155,7 @@ sendTablespace(bbsink *sink, char *path, char *spcoid, bool sizeonly,
static int64
sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
List *tablespaces, bool sendtblspclinks, backup_manifest_info *manifest,
- const char *spcoid)
+ Oid spcoid)
{
DIR *dir;
struct dirent *de;
@@ -1419,8 +1420,8 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
if (!sizeonly)
sent = sendFile(sink, pathbuf, pathbuf + basepathlen + 1, &statbuf,
- true, isDbDir ? atooid(lastDir + 1) : InvalidOid,
- manifest, spcoid);
+ true, isDbDir ? atooid(lastDir + 1) : InvalidOid, spcoid,
+ manifest);
if (sent || sizeonly)
{
@@ -1489,8 +1490,8 @@ is_checksummed_file(const char *fullpath, const char *filename)
*/
static bool
sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
- struct stat *statbuf, bool missing_ok, Oid dboid,
- backup_manifest_info *manifest, const char *spcoid)
+ struct stat *statbuf, bool missing_ok, Oid dboid, Oid spcoid,
+ backup_manifest_info *manifest)
{
int fd;
BlockNumber blkno = 0;
diff --git a/src/backend/backup/basebackup_copy.c b/src/backend/backup/basebackup_copy.c
index fee30c21e1..3bdbe1f989 100644
--- a/src/backend/backup/basebackup_copy.c
+++ b/src/backend/backup/basebackup_copy.c
@@ -407,7 +407,7 @@ SendTablespaceList(List *tablespaces)
}
else
{
- values[0] = ObjectIdGetDatum(strtoul(ti->oid, NULL, 10));
+ values[0] = ObjectIdGetDatum(ti->oid);
values[1] = CStringGetTextDatum(ti->path);
}
if (ti->size >= 0)
diff --git a/src/include/backup/backup_manifest.h b/src/include/backup/backup_manifest.h
index d41b439980..5a481dbcf5 100644
--- a/src/include/backup/backup_manifest.h
+++ b/src/include/backup/backup_manifest.h
@@ -39,7 +39,7 @@ extern void InitializeBackupManifest(backup_manifest_info *manifest,
backup_manifest_option want_manifest,
pg_checksum_type manifest_checksum_type);
extern void AddFileToBackupManifest(backup_manifest_info *manifest,
- const char *spcoid,
+ Oid spcoid,
const char *pathname, size_t size,
pg_time_t mtime,
pg_checksum_context *checksum_ctx);
diff --git a/src/include/backup/basebackup.h b/src/include/backup/basebackup.h
index 3e68abc2bb..1432d9c206 100644
--- a/src/include/backup/basebackup.h
+++ b/src/include/backup/basebackup.h
@@ -27,7 +27,7 @@
*/
typedef struct
{
- char *oid; /* tablespace's OID, as a decimal string */
+ Oid oid; /* tablespace's OID */
char *path; /* full path to tablespace's directory */
char *rpath; /* relative path if it's within PGDATA, else
* NULL */
--
2.37.1 (Apple Git-137.1)
v3-0003-Change-how-a-base-backup-decides-which-files-have.patchapplication/octet-stream; name=v3-0003-Change-how-a-base-backup-decides-which-files-have.patchDownload
From 65f7f683ecbcebc6575f73716ce86e2a2f70368b Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Tue, 3 Oct 2023 13:32:28 -0400
Subject: [PATCH v3 3/7] Change how a base backup decides which files have
checksums.
Previously, it thought that any plain file located under global, base,
or a tablespace directory had checksums unless it was in a short list
of excluded files. Now, it thinks that files in those directories have
checksums if parse_filename_for_nontemp_relation says that they are
relation files. (Temporary relation files don't matter because they're
excluded from the backup anyway.)
This changes the behavior if you have stray files not managed by
PostgreSQL in the relevant directories. Previously, you'd get some
kind of checksum-related complaint if such files existed, assuming
that the cluster had checksums enabled and that the base backup
wasn't run with NOVERIFY_CHECKSUMS. Now, you won't get those
complaints any more. That seems like an improvement to me, because
those files were presumably not created by PostgreSQL and so there
is no reason to think that they would be checksummed like a
PostgreSQL relation file. (If we want to complain about such files,
we should complain about them existing at all, not just about their
checksums.)
The point of this change is to make the code more consistent.
sendDir() was already calling parse_filename_for_nontemp_relation()
as part of an effort to determine which files to include in the
backup. So, it already had the information about whether a certain
file was a relation file. sendFile() then used a separate method,
embodied in is_checksummed_file(), to make what is essentially
the same determination. It's better not to make the same decision
using two different methods, especially in closely-related code.
---
src/backend/backup/basebackup.c | 172 ++++++++++----------------------
1 file changed, 55 insertions(+), 117 deletions(-)
diff --git a/src/backend/backup/basebackup.c b/src/backend/backup/basebackup.c
index b537f46219..4ba63ad8a6 100644
--- a/src/backend/backup/basebackup.c
+++ b/src/backend/backup/basebackup.c
@@ -82,7 +82,8 @@ static int64 sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeo
backup_manifest_info *manifest, Oid spcoid);
static bool sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
struct stat *statbuf, bool missing_ok,
- Oid dboid, Oid spcoid,
+ Oid dboid, Oid spcoid, RelFileNumber relfilenumber,
+ unsigned segno,
backup_manifest_info *manifest);
static off_t read_file_data_into_buffer(bbsink *sink,
const char *readfilename, int fd,
@@ -104,7 +105,6 @@ static void convert_link_to_directory(const char *pathbuf, struct stat *statbuf)
static void perform_base_backup(basebackup_options *opt, bbsink *sink);
static void parse_basebackup_options(List *options, basebackup_options *opt);
static int compareWalFileNames(const ListCell *a, const ListCell *b);
-static bool is_checksummed_file(const char *fullpath, const char *filename);
static int basebackup_read_file(int fd, char *buf, size_t nbytes, off_t offset,
const char *filename, bool partial_read_ok);
@@ -213,23 +213,6 @@ static const struct exclude_list_item excludeFiles[] =
{NULL, false}
};
-/*
- * List of files excluded from checksum validation.
- *
- * Note: this list should be kept in sync with what pg_checksums.c
- * includes.
- */
-static const struct exclude_list_item noChecksumFiles[] = {
- {"pg_control", false},
- {"pg_filenode.map", false},
- {"pg_internal.init", true},
- {"PG_VERSION", false},
-#ifdef EXEC_BACKEND
- {"config_exec_params", true},
-#endif
- {NULL, false}
-};
-
/*
* Actually do a base backup for the specified tablespaces.
*
@@ -356,7 +339,8 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
errmsg("could not stat file \"%s\": %m",
XLOG_CONTROL_FILE)));
sendFile(sink, XLOG_CONTROL_FILE, XLOG_CONTROL_FILE, &statbuf,
- false, InvalidOid, InvalidOid, &manifest);
+ false, InvalidOid, InvalidOid,
+ InvalidRelFileNumber, 0, &manifest);
}
else
{
@@ -625,7 +609,8 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
errmsg("could not stat file \"%s\": %m", pathbuf)));
sendFile(sink, pathbuf, pathbuf, &statbuf, false,
- InvalidOid, InvalidOid, &manifest);
+ InvalidOid, InvalidOid, InvalidRelFileNumber, 0,
+ &manifest);
/* unconditionally mark file as archived */
StatusFilePath(pathbuf, fname, ".done");
@@ -1163,7 +1148,8 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
struct stat statbuf;
int64 size = 0;
const char *lastDir; /* Split last dir from parent path. */
- bool isDbDir = false; /* Does this directory contain relations? */
+ bool isRelationDir = false; /* Does directory contain relations? */
+ Oid dboid = InvalidOid;
/*
* Determine if the current path is a database directory that can contain
@@ -1190,17 +1176,23 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
strncmp(lastDir - (sizeof(TABLESPACE_VERSION_DIRECTORY) - 1),
TABLESPACE_VERSION_DIRECTORY,
sizeof(TABLESPACE_VERSION_DIRECTORY) - 1) == 0))
- isDbDir = true;
+ {
+ isRelationDir = true;
+ dboid = atooid(lastDir + 1);
+ }
}
+ else if (strcmp(path, "./global") == 0)
+ isRelationDir = true;
dir = AllocateDir(path);
while ((de = ReadDir(dir, path)) != NULL)
{
int excludeIdx;
bool excludeFound;
- RelFileNumber relNumber;
- ForkNumber relForkNum;
- unsigned segno;
+ RelFileNumber relfilenumber = InvalidRelFileNumber;
+ ForkNumber relForkNum = InvalidForkNumber;
+ unsigned segno = 0;
+ bool isRelationFile = false;
/* Skip special stuff */
if (strcmp(de->d_name, ".") == 0 || strcmp(de->d_name, "..") == 0)
@@ -1248,37 +1240,40 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
if (excludeFound)
continue;
+ /*
+ * If there could be non-temporary relation files in this directory,
+ * try to parse the filename.
+ */
+ if (isRelationDir)
+ isRelationFile =
+ parse_filename_for_nontemp_relation(de->d_name,
+ &relfilenumber,
+ &relForkNum, &segno);
+
/* Exclude all forks for unlogged tables except the init fork */
- if (isDbDir &&
- parse_filename_for_nontemp_relation(de->d_name, &relNumber,
- &relForkNum, &segno))
+ if (isRelationFile && relForkNum != INIT_FORKNUM)
{
- /* Never exclude init forks */
- if (relForkNum != INIT_FORKNUM)
- {
- char initForkFile[MAXPGPATH];
+ char initForkFile[MAXPGPATH];
- /*
- * If any other type of fork, check if there is an init fork
- * with the same RelFileNumber. If so, the file can be
- * excluded.
- */
- snprintf(initForkFile, sizeof(initForkFile), "%s/%u_init",
- path, relNumber);
+ /*
+ * If any other type of fork, check if there is an init fork with
+ * the same RelFileNumber. If so, the file can be excluded.
+ */
+ snprintf(initForkFile, sizeof(initForkFile), "%s/%u_init",
+ path, relfilenumber);
- if (lstat(initForkFile, &statbuf) == 0)
- {
- elog(DEBUG2,
- "unlogged relation file \"%s\" excluded from backup",
- de->d_name);
+ if (lstat(initForkFile, &statbuf) == 0)
+ {
+ elog(DEBUG2,
+ "unlogged relation file \"%s\" excluded from backup",
+ de->d_name);
- continue;
- }
+ continue;
}
}
/* Exclude temporary relations */
- if (isDbDir && looks_like_temp_rel_name(de->d_name))
+ if (OidIsValid(dboid) && looks_like_temp_rel_name(de->d_name))
{
elog(DEBUG2,
"temporary relation file \"%s\" excluded from backup",
@@ -1417,8 +1412,8 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
if (!sizeonly)
sent = sendFile(sink, pathbuf, pathbuf + basepathlen + 1, &statbuf,
- true, isDbDir ? atooid(lastDir + 1) : InvalidOid, spcoid,
- manifest);
+ true, dboid, spcoid,
+ relfilenumber, segno, manifest);
if (sent || sizeonly)
{
@@ -1440,40 +1435,6 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
return size;
}
-/*
- * Check if a file should have its checksum validated.
- * We validate checksums on files in regular tablespaces
- * (including global and default) only, and in those there
- * are some files that are explicitly excluded.
- */
-static bool
-is_checksummed_file(const char *fullpath, const char *filename)
-{
- /* Check that the file is in a tablespace */
- if (strncmp(fullpath, "./global/", 9) == 0 ||
- strncmp(fullpath, "./base/", 7) == 0 ||
- strncmp(fullpath, "/", 1) == 0)
- {
- int excludeIdx;
-
- /* Compare file against noChecksumFiles skip list */
- for (excludeIdx = 0; noChecksumFiles[excludeIdx].name != NULL; excludeIdx++)
- {
- int cmplen = strlen(noChecksumFiles[excludeIdx].name);
-
- if (!noChecksumFiles[excludeIdx].match_prefix)
- cmplen++;
- if (strncmp(filename, noChecksumFiles[excludeIdx].name,
- cmplen) == 0)
- return false;
- }
-
- return true;
- }
- else
- return false;
-}
-
/*
* Given the member, write the TAR header & send the file.
*
@@ -1488,6 +1449,7 @@ is_checksummed_file(const char *fullpath, const char *filename)
static bool
sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
struct stat *statbuf, bool missing_ok, Oid dboid, Oid spcoid,
+ RelFileNumber relfilenumber, unsigned segno,
backup_manifest_info *manifest)
{
int fd;
@@ -1495,8 +1457,6 @@ sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
int checksum_failures = 0;
off_t cnt;
pgoff_t bytes_done = 0;
- int segmentno = 0;
- char *segmentpath;
bool verify_checksum = false;
pg_checksum_context checksum_ctx;
@@ -1522,36 +1482,14 @@ sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
*/
Assert((sink->bbs_buffer_length % BLCKSZ) == 0);
- if (!noverify_checksums && DataChecksumsEnabled())
- {
- char *filename;
-
- /*
- * Get the filename (excluding path). As last_dir_separator()
- * includes the last directory separator, we chop that off by
- * incrementing the pointer.
- */
- filename = last_dir_separator(readfilename) + 1;
-
- if (is_checksummed_file(readfilename, filename))
- {
- verify_checksum = true;
-
- /*
- * Cut off at the segment boundary (".") to get the segment number
- * in order to mix it into the checksum.
- */
- segmentpath = strstr(filename, ".");
- if (segmentpath != NULL)
- {
- segmentno = atoi(segmentpath + 1);
- if (segmentno == 0)
- ereport(ERROR,
- (errmsg("invalid segment number %d in file \"%s\"",
- segmentno, filename)));
- }
- }
- }
+ /*
+ * If we weren't told not to verify checksums, and if checksums are
+ * enabled for this cluster, and if this is a relation file, then verify
+ * the checksum.
+ */
+ if (!noverify_checksums && DataChecksumsEnabled() &&
+ RelFileNumberIsValid(relfilenumber))
+ verify_checksum = true;
/*
* Loop until we read the amount of data the caller told us to expect. The
@@ -1566,7 +1504,7 @@ sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
/* Try to read some more data. */
cnt = read_file_data_into_buffer(sink, readfilename, fd, bytes_done,
remaining,
- blkno + segmentno * RELSEG_SIZE,
+ blkno + segno * RELSEG_SIZE,
verify_checksum,
&checksum_failures);
--
2.37.1 (Apple Git-137.1)
v3-0005-Prototype-patch-for-incremental-and-differential-.patchapplication/octet-stream; name=v3-0005-Prototype-patch-for-incremental-and-differential-.patchDownload
From 1a93d452eee57de92d3fc281b1a0bd65f7bb9acb Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Wed, 14 Jun 2023 11:31:29 -0400
Subject: [PATCH v3 5/7] Prototype patch for incremental and differential
backup.
We don't differentiate between incremental and differential backups;
the term "incremental" as used herein means "either incremental or
differential".
This adds a new background process, the WAL summarizer, whose behavor
is governed by new GUCs wal_summarize_mb and wal_summarize_keep_time.
This writes out WAL summary files to $PGDATA/pg_wal/summaries. Each
summary file contains information for a certain range of LSNs on a
certain TLI. For each relation, it stores a "limit block" which is
0 if a relation is created or destroyed within a certain range of WAL
records, or otherwise the shortest length to which the relation was
truncated during that range of WAL records, or otherwise
InvalidBlockNumber. In addition, it stores any blocks which have
been modified during that range of WAL records, but excluding blocks
which were removed by truncation after they were modified and which
were never modified thereafter. In other words, it tells us which
blocks need to copied in case of an incremental backup covering that
range of WAL records.
To take an incremental backup, you use the new replication command
UPLOAD_MANIFEST to upload the manifest for the prior backup. This
prior backup could either be a full backup or another incremental
backup. You then use BASE_BACKUP with the INCREMENTAL option to take
the backup. pg_basebackup now has an --incremental=PATH_TO_MANIFEST
option to trigger this behavior.
An incremental backup is like a regular full backup except that
some relation files are replaced with files with names like
INCREMENTAL.${ORIGINAL_NAME}, and the backup_label file contains
additional lines identifying it as an incremental backup. The new
pg_combinebackup tool can be used to reconstruct a data directory
from a full backup and a series of incremental backups.
XXX. It would be nice if we could do something about incremental
JSON parsing.
XXX. This needs a lot of work on documentation and tests.
Patch by me. Thanks to Dilip Kumar and Andres Freund for some helpful
design discussions. Reviewed by Dilip Kumar and Jakub Wartak.
---
src/backend/access/transam/xlog.c | 93 +-
src/backend/access/transam/xlogbackup.c | 10 +
src/backend/access/transam/xlogrecovery.c | 6 +
src/backend/backup/Makefile | 5 +-
src/backend/backup/basebackup.c | 334 +++-
src/backend/backup/basebackup_incremental.c | 873 ++++++++++
src/backend/backup/meson.build | 3 +
src/backend/backup/walsummary.c | 356 +++++
src/backend/backup/walsummaryfuncs.c | 169 ++
src/backend/postmaster/Makefile | 1 +
src/backend/postmaster/auxprocess.c | 8 +
src/backend/postmaster/meson.build | 1 +
src/backend/postmaster/postmaster.c | 53 +
src/backend/postmaster/walsummarizer.c | 1414 +++++++++++++++++
src/backend/replication/repl_gram.y | 14 +-
src/backend/replication/repl_scanner.l | 2 +
src/backend/replication/walsender.c | 162 +-
src/backend/storage/ipc/ipci.c | 3 +
src/backend/storage/lmgr/lwlocknames.txt | 1 +
src/backend/utils/activity/pgstat_io.c | 4 +-
.../utils/activity/wait_event_names.txt | 5 +
src/backend/utils/init/miscinit.c | 3 +
src/backend/utils/misc/guc_tables.c | 29 +
src/backend/utils/misc/postgresql.conf.sample | 5 +
src/bin/Makefile | 1 +
src/bin/initdb/initdb.c | 1 +
src/bin/meson.build | 1 +
src/bin/pg_basebackup/bbstreamer_file.c | 1 +
src/bin/pg_basebackup/pg_basebackup.c | 108 +-
src/bin/pg_basebackup/t/010_pg_basebackup.pl | 4 +-
src/bin/pg_combinebackup/.gitignore | 1 +
src/bin/pg_combinebackup/Makefile | 46 +
src/bin/pg_combinebackup/backup_label.c | 281 ++++
src/bin/pg_combinebackup/backup_label.h | 29 +
src/bin/pg_combinebackup/copy_file.c | 169 ++
src/bin/pg_combinebackup/copy_file.h | 19 +
src/bin/pg_combinebackup/load_manifest.c | 245 +++
src/bin/pg_combinebackup/load_manifest.h | 67 +
src/bin/pg_combinebackup/meson.build | 29 +
src/bin/pg_combinebackup/pg_combinebackup.c | 1276 +++++++++++++++
src/bin/pg_combinebackup/reconstruct.c | 618 +++++++
src/bin/pg_combinebackup/reconstruct.h | 32 +
src/bin/pg_combinebackup/write_manifest.c | 293 ++++
src/bin/pg_combinebackup/write_manifest.h | 33 +
src/bin/pg_resetwal/pg_resetwal.c | 36 +
src/common/Makefile | 1 +
src/common/blkreftable.c | 1309 +++++++++++++++
src/common/meson.build | 1 +
src/include/access/xlog.h | 1 +
src/include/access/xlogbackup.h | 2 +
src/include/backup/basebackup.h | 5 +-
src/include/backup/basebackup_incremental.h | 56 +
src/include/backup/walsummary.h | 49 +
src/include/catalog/pg_proc.dat | 19 +
src/include/common/blkreftable.h | 120 ++
src/include/miscadmin.h | 3 +
src/include/nodes/replnodes.h | 9 +
src/include/postmaster/walsummarizer.h | 31 +
src/include/storage/proc.h | 9 +-
src/include/utils/guc_tables.h | 1 +
src/test/recovery/t/001_stream_rep.pl | 2 +
src/test/recovery/t/019_replslot_limit.pl | 3 +
.../t/035_standby_logical_decoding.pl | 1 +
src/tools/pgindent/typedefs.list | 23 +
64 files changed, 8429 insertions(+), 60 deletions(-)
create mode 100644 src/backend/backup/basebackup_incremental.c
create mode 100644 src/backend/backup/walsummary.c
create mode 100644 src/backend/backup/walsummaryfuncs.c
create mode 100644 src/backend/postmaster/walsummarizer.c
create mode 100644 src/bin/pg_combinebackup/.gitignore
create mode 100644 src/bin/pg_combinebackup/Makefile
create mode 100644 src/bin/pg_combinebackup/backup_label.c
create mode 100644 src/bin/pg_combinebackup/backup_label.h
create mode 100644 src/bin/pg_combinebackup/copy_file.c
create mode 100644 src/bin/pg_combinebackup/copy_file.h
create mode 100644 src/bin/pg_combinebackup/load_manifest.c
create mode 100644 src/bin/pg_combinebackup/load_manifest.h
create mode 100644 src/bin/pg_combinebackup/meson.build
create mode 100644 src/bin/pg_combinebackup/pg_combinebackup.c
create mode 100644 src/bin/pg_combinebackup/reconstruct.c
create mode 100644 src/bin/pg_combinebackup/reconstruct.h
create mode 100644 src/bin/pg_combinebackup/write_manifest.c
create mode 100644 src/bin/pg_combinebackup/write_manifest.h
create mode 100644 src/common/blkreftable.c
create mode 100644 src/include/backup/basebackup_incremental.h
create mode 100644 src/include/backup/walsummary.h
create mode 100644 src/include/common/blkreftable.h
create mode 100644 src/include/postmaster/walsummarizer.h
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 677a5bf51b..6cfeee63e8 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -77,6 +77,7 @@
#include "port/pg_iovec.h"
#include "postmaster/bgwriter.h"
#include "postmaster/startup.h"
+#include "postmaster/walsummarizer.h"
#include "postmaster/walwriter.h"
#include "replication/logical.h"
#include "replication/origin.h"
@@ -3499,6 +3500,43 @@ XLogGetLastRemovedSegno(void)
return lastRemovedSegNo;
}
+/*
+ * Return the oldest WAL segment on the given TLI that still exists in
+ * XLOGDIR, or 0 if none.
+ */
+XLogSegNo
+XLogGetOldestSegno(TimeLineID tli)
+{
+ DIR *xldir;
+ struct dirent *xlde;
+ XLogSegNo oldest_segno = 0;
+
+ xldir = AllocateDir(XLOGDIR);
+ while ((xlde = ReadDir(xldir, XLOGDIR)) != NULL)
+ {
+ TimeLineID file_tli;
+ XLogSegNo file_segno;
+
+ /* Ignore files that are not XLOG segments */
+ if (!IsXLogFileName(xlde->d_name))
+ continue;
+
+ /* Parse filename to get TLI and segno. */
+ XLogFromFileName(xlde->d_name, &file_tli, &file_segno,
+ wal_segment_size);
+
+ /* Ignore anything that's not from the TLI of interest. */
+ if (tli != file_tli)
+ continue;
+
+ /* If it's the oldest so far, update oldest_segno. */
+ if (oldest_segno == 0 || file_segno < oldest_segno)
+ oldest_segno = file_segno;
+ }
+
+ FreeDir(xldir);
+ return oldest_segno;
+}
/*
* Update the last removed segno pointer in shared memory, to reflect that the
@@ -3778,8 +3816,8 @@ RemoveXlogFile(const struct dirent *segment_de,
}
/*
- * Verify whether pg_wal and pg_wal/archive_status exist.
- * If the latter does not exist, recreate it.
+ * Verify whether pg_wal, pg_wal/archive_status, and pg_wal/summaries exist.
+ * If the latter do not exist, recreate them.
*
* It is not the goal of this function to verify the contents of these
* directories, but to help in cases where someone has performed a cluster
@@ -3822,6 +3860,26 @@ ValidateXLOGDirectoryStructure(void)
(errmsg("could not create missing directory \"%s\": %m",
path)));
}
+
+ /* Check for summaries */
+ snprintf(path, MAXPGPATH, XLOGDIR "/summaries");
+ if (stat(path, &stat_buf) == 0)
+ {
+ /* Check for weird cases where it exists but isn't a directory */
+ if (!S_ISDIR(stat_buf.st_mode))
+ ereport(FATAL,
+ (errmsg("required WAL directory \"%s\" does not exist",
+ path)));
+ }
+ else
+ {
+ ereport(LOG,
+ (errmsg("creating missing WAL directory \"%s\"", path)));
+ if (MakePGDirectory(path) < 0)
+ ereport(FATAL,
+ (errmsg("could not create missing directory \"%s\": %m",
+ path)));
+ }
}
/*
@@ -5146,9 +5204,9 @@ StartupXLOG(void)
#endif
/*
- * Verify that pg_wal and pg_wal/archive_status exist. In cases where
- * someone has performed a copy for PITR, these directories may have been
- * excluded and need to be re-created.
+ * Verify that pg_wal, pg_wal/archive_status, and pg_wal/summaries exist.
+ * In cases where someone has performed a copy for PITR, these directories
+ * may have been excluded and need to be re-created.
*/
ValidateXLOGDirectoryStructure();
@@ -6829,6 +6887,17 @@ CreateCheckPoint(int flags)
*/
END_CRIT_SECTION();
+ /*
+ * If there hasn't been much system activity in a while, the WAL
+ * summarizer may be sleeping for relatively long periods, which could
+ * delay an incremental backup that has started concurrently. In the hopes
+ * of avoiding that, poke the WAL summarizer here.
+ *
+ * Possibly this should instead be done at some earlier point in this
+ * function, but it's not clear that it matters much.
+ */
+ SetWalSummarizerLatch();
+
/*
* Let smgr do post-checkpoint cleanup (eg, deleting old files).
*/
@@ -7503,6 +7572,20 @@ KeepLogSeg(XLogRecPtr recptr, XLogSegNo *logSegNo)
}
}
+ /*
+ * If WAL summarization is in use, don't remove WAL that has yet to be
+ * summarized.
+ */
+ keep = GetOldestUnsummarizedLSN(NULL, NULL);
+ if (keep != InvalidXLogRecPtr)
+ {
+ XLogSegNo unsummarized_segno;
+
+ XLByteToSeg(keep, unsummarized_segno, wal_segment_size);
+ if (unsummarized_segno < segno)
+ segno = unsummarized_segno;
+ }
+
/* but, keep at least wal_keep_size if that's set */
if (wal_keep_size_mb > 0)
{
diff --git a/src/backend/access/transam/xlogbackup.c b/src/backend/access/transam/xlogbackup.c
index 21d68133ae..f51d4282bb 100644
--- a/src/backend/access/transam/xlogbackup.c
+++ b/src/backend/access/transam/xlogbackup.c
@@ -77,6 +77,16 @@ build_backup_content(BackupState *state, bool ishistoryfile)
appendStringInfo(result, "STOP TIMELINE: %u\n", state->stoptli);
}
+ /* either both istartpoint and istarttli should be set, or neither */
+ Assert(XLogRecPtrIsInvalid(state->istartpoint) == (state->istarttli == 0));
+ if (!XLogRecPtrIsInvalid(state->istartpoint))
+ {
+ appendStringInfo(result, "INCREMENTAL FROM LSN: %X/%X\n",
+ LSN_FORMAT_ARGS(state->istartpoint));
+ appendStringInfo(result, "INCREMENTAL FROM TLI: %u\n",
+ state->istarttli);
+ }
+
data = result->data;
pfree(result);
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 5549e1afc5..89ddec5bf9 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -1284,6 +1284,12 @@ read_backup_label(XLogRecPtr *checkPointLoc, TimeLineID *backupLabelTLI,
tli_from_file, BACKUP_LABEL_FILE)));
}
+ if (fscanf(lfp, "INCREMENTAL FROM LSN: %X/%X\n", &hi, &lo) > 0)
+ ereport(FATAL,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("this is an incremental backup, not a data directory"),
+ errhint("Use pg_combinebackup to reconstruct a valid data directory.")));
+
if (ferror(lfp) || FreeFile(lfp))
ereport(FATAL,
(errcode_for_file_access(),
diff --git a/src/backend/backup/Makefile b/src/backend/backup/Makefile
index b21bd8ff43..751e6d3d5e 100644
--- a/src/backend/backup/Makefile
+++ b/src/backend/backup/Makefile
@@ -19,12 +19,15 @@ OBJS = \
basebackup.o \
basebackup_copy.o \
basebackup_gzip.o \
+ basebackup_incremental.o \
basebackup_lz4.o \
basebackup_zstd.o \
basebackup_progress.o \
basebackup_server.o \
basebackup_sink.o \
basebackup_target.o \
- basebackup_throttle.o
+ basebackup_throttle.o \
+ walsummary.o \
+ walsummaryfuncs.o
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/backup/basebackup.c b/src/backend/backup/basebackup.c
index 4ba63ad8a6..8a70a9ae41 100644
--- a/src/backend/backup/basebackup.c
+++ b/src/backend/backup/basebackup.c
@@ -20,8 +20,10 @@
#include "access/xlogbackup.h"
#include "backup/backup_manifest.h"
#include "backup/basebackup.h"
+#include "backup/basebackup_incremental.h"
#include "backup/basebackup_sink.h"
#include "backup/basebackup_target.h"
+#include "catalog/pg_tablespace_d.h"
#include "commands/defrem.h"
#include "common/compression.h"
#include "common/file_perm.h"
@@ -64,6 +66,7 @@ typedef struct
bool fastcheckpoint;
bool nowait;
bool includewal;
+ bool incremental;
uint32 maxrate;
bool sendtblspcmapfile;
bool send_to_client;
@@ -76,21 +79,28 @@ typedef struct
} basebackup_options;
static int64 sendTablespace(bbsink *sink, char *path, Oid spcoid, bool sizeonly,
- struct backup_manifest_info *manifest);
+ struct backup_manifest_info *manifest,
+ IncrementalBackupInfo *ib);
static int64 sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
List *tablespaces, bool sendtblspclinks,
- backup_manifest_info *manifest, Oid spcoid);
+ backup_manifest_info *manifest, Oid spcoid,
+ IncrementalBackupInfo *ib);
static bool sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
struct stat *statbuf, bool missing_ok,
Oid dboid, Oid spcoid, RelFileNumber relfilenumber,
unsigned segno,
- backup_manifest_info *manifest);
+ backup_manifest_info *manifest,
+ unsigned num_incremental_blocks,
+ BlockNumber *incremental_blocks,
+ unsigned truncation_block_length);
static off_t read_file_data_into_buffer(bbsink *sink,
const char *readfilename, int fd,
off_t offset, size_t length,
BlockNumber blkno,
bool verify_checksum,
int *checksum_failures);
+static void push_to_sink(bbsink *sink, pg_checksum_context *checksum_ctx,
+ size_t *bytes_done, void *data, size_t length);
static bool verify_page_checksum(Page page, XLogRecPtr start_lsn,
BlockNumber blkno,
uint16 *expected_checksum);
@@ -102,7 +112,8 @@ static int64 _tarWriteHeader(bbsink *sink, const char *filename,
bool sizeonly);
static void _tarWritePadding(bbsink *sink, int len);
static void convert_link_to_directory(const char *pathbuf, struct stat *statbuf);
-static void perform_base_backup(basebackup_options *opt, bbsink *sink);
+static void perform_base_backup(basebackup_options *opt, bbsink *sink,
+ IncrementalBackupInfo *ib);
static void parse_basebackup_options(List *options, basebackup_options *opt);
static int compareWalFileNames(const ListCell *a, const ListCell *b);
static int basebackup_read_file(int fd, char *buf, size_t nbytes, off_t offset,
@@ -220,7 +231,8 @@ static const struct exclude_list_item excludeFiles[] =
* clobbered by longjmp" from stupider versions of gcc.
*/
static void
-perform_base_backup(basebackup_options *opt, bbsink *sink)
+perform_base_backup(basebackup_options *opt, bbsink *sink,
+ IncrementalBackupInfo *ib)
{
bbsink_state state;
XLogRecPtr endptr;
@@ -270,6 +282,10 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
ListCell *lc;
tablespaceinfo *newti;
+ /* If this is an incremental backup, execute preparatory steps. */
+ if (ib != NULL)
+ PrepareForIncrementalBackup(ib, backup_state);
+
/* Add a node for the base directory at the end */
newti = palloc0(sizeof(tablespaceinfo));
newti->size = -1;
@@ -289,10 +305,10 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
if (tmp->path == NULL)
tmp->size = sendDir(sink, ".", 1, true, state.tablespaces,
- true, NULL, InvalidOid);
+ true, NULL, InvalidOid, NULL);
else
tmp->size = sendTablespace(sink, tmp->path, tmp->oid, true,
- NULL);
+ NULL, NULL);
state.bytes_total += tmp->size;
}
state.bytes_total_is_valid = true;
@@ -330,7 +346,7 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
/* Then the bulk of the files... */
sendDir(sink, ".", 1, false, state.tablespaces,
- sendtblspclinks, &manifest, InvalidOid);
+ sendtblspclinks, &manifest, InvalidOid, ib);
/* ... and pg_control after everything else. */
if (lstat(XLOG_CONTROL_FILE, &statbuf) != 0)
@@ -340,7 +356,7 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
XLOG_CONTROL_FILE)));
sendFile(sink, XLOG_CONTROL_FILE, XLOG_CONTROL_FILE, &statbuf,
false, InvalidOid, InvalidOid,
- InvalidRelFileNumber, 0, &manifest);
+ InvalidRelFileNumber, 0, &manifest, 0, NULL, 0);
}
else
{
@@ -348,7 +364,7 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
bbsink_begin_archive(sink, archive_name);
- sendTablespace(sink, ti->path, ti->oid, false, &manifest);
+ sendTablespace(sink, ti->path, ti->oid, false, &manifest, ib);
}
/*
@@ -610,7 +626,7 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
sendFile(sink, pathbuf, pathbuf, &statbuf, false,
InvalidOid, InvalidOid, InvalidRelFileNumber, 0,
- &manifest);
+ &manifest, 0, NULL, 0);
/* unconditionally mark file as archived */
StatusFilePath(pathbuf, fname, ".done");
@@ -686,6 +702,7 @@ parse_basebackup_options(List *options, basebackup_options *opt)
bool o_checkpoint = false;
bool o_nowait = false;
bool o_wal = false;
+ bool o_incremental = false;
bool o_maxrate = false;
bool o_tablespace_map = false;
bool o_noverify_checksums = false;
@@ -764,6 +781,15 @@ parse_basebackup_options(List *options, basebackup_options *opt)
opt->includewal = defGetBoolean(defel);
o_wal = true;
}
+ else if (strcmp(defel->defname, "incremental") == 0)
+ {
+ if (o_incremental)
+ ereport(ERROR,
+ (errcode(ERRCODE_SYNTAX_ERROR),
+ errmsg("duplicate option \"%s\"", defel->defname)));
+ opt->incremental = defGetBoolean(defel);
+ o_incremental = true;
+ }
else if (strcmp(defel->defname, "max_rate") == 0)
{
int64 maxrate;
@@ -956,7 +982,7 @@ parse_basebackup_options(List *options, basebackup_options *opt)
* the filesystem, bypassing the buffer cache.
*/
void
-SendBaseBackup(BaseBackupCmd *cmd)
+SendBaseBackup(BaseBackupCmd *cmd, IncrementalBackupInfo *ib)
{
basebackup_options opt;
bbsink *sink;
@@ -980,6 +1006,20 @@ SendBaseBackup(BaseBackupCmd *cmd)
set_ps_display(activitymsg);
}
+ /*
+ * If we're asked to perform an incremental backup and the user has not
+ * supplied a manifest, that's an ERROR.
+ *
+ * If we're asked to perform a full backup and the user did supply a
+ * manifest, just ignore it.
+ */
+ if (!opt.incremental)
+ ib = NULL;
+ else if (ib == NULL)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("must UPLOAD_MANIFEST before performing an incremental BASE_BACKUP")));
+
/*
* If the target is specifically 'client' then set up to stream the backup
* to the client; otherwise, it's being sent someplace else and should not
@@ -1011,7 +1051,7 @@ SendBaseBackup(BaseBackupCmd *cmd)
*/
PG_TRY();
{
- perform_base_backup(&opt, sink);
+ perform_base_backup(&opt, sink, ib);
}
PG_FINALLY();
{
@@ -1086,7 +1126,7 @@ sendFileWithContent(bbsink *sink, const char *filename, const char *content,
*/
static int64
sendTablespace(bbsink *sink, char *path, Oid spcoid, bool sizeonly,
- backup_manifest_info *manifest)
+ backup_manifest_info *manifest, IncrementalBackupInfo *ib)
{
int64 size;
char pathbuf[MAXPGPATH];
@@ -1120,7 +1160,7 @@ sendTablespace(bbsink *sink, char *path, Oid spcoid, bool sizeonly,
/* Send all the files in the tablespace version directory */
size += sendDir(sink, pathbuf, strlen(path), sizeonly, NIL, true, manifest,
- spcoid);
+ spcoid, ib);
return size;
}
@@ -1140,7 +1180,7 @@ sendTablespace(bbsink *sink, char *path, Oid spcoid, bool sizeonly,
static int64
sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
List *tablespaces, bool sendtblspclinks, backup_manifest_info *manifest,
- Oid spcoid)
+ Oid spcoid, IncrementalBackupInfo *ib)
{
DIR *dir;
struct dirent *de;
@@ -1149,7 +1189,16 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
int64 size = 0;
const char *lastDir; /* Split last dir from parent path. */
bool isRelationDir = false; /* Does directory contain relations? */
+ bool isGlobalDir = false;
Oid dboid = InvalidOid;
+ BlockNumber *relative_block_numbers = NULL;
+
+ /*
+ * Since this array is relatively large, avoid putting it on the stack.
+ * But we don't need it at all if this is not an incremental backup.
+ */
+ if (ib != NULL)
+ relative_block_numbers = palloc(sizeof(BlockNumber) * RELSEG_SIZE);
/*
* Determine if the current path is a database directory that can contain
@@ -1182,7 +1231,10 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
}
}
else if (strcmp(path, "./global") == 0)
+ {
isRelationDir = true;
+ isGlobalDir = true;
+ }
dir = AllocateDir(path);
while ((de = ReadDir(dir, path)) != NULL)
@@ -1331,11 +1383,13 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
&statbuf, sizeonly);
/*
- * Also send archive_status directory (by hackishly reusing
- * statbuf from above ...).
+ * Also send archive_status and summaries directories (by
+ * hackishly reusing statbuf from above ...).
*/
size += _tarWriteHeader(sink, "./pg_wal/archive_status", NULL,
&statbuf, sizeonly);
+ size += _tarWriteHeader(sink, "./pg_wal/summaries", NULL,
+ &statbuf, sizeonly);
continue; /* don't recurse into pg_wal */
}
@@ -1404,33 +1458,88 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
if (!skip_this_dir)
size += sendDir(sink, pathbuf, basepathlen, sizeonly, tablespaces,
- sendtblspclinks, manifest, spcoid);
+ sendtblspclinks, manifest, spcoid, ib);
}
else if (S_ISREG(statbuf.st_mode))
{
bool sent = false;
+ unsigned num_blocks_required = 0;
+ unsigned truncation_block_length = 0;
+ char tarfilenamebuf[MAXPGPATH * 2];
+ char *tarfilename = pathbuf + basepathlen + 1;
+ FileBackupMethod method = BACK_UP_FILE_FULLY;
- if (!sizeonly)
- sent = sendFile(sink, pathbuf, pathbuf + basepathlen + 1, &statbuf,
- true, dboid, spcoid,
- relfilenumber, segno, manifest);
+ if (ib != NULL && isRelationFile)
+ {
+ Oid relspcoid;
+ char *lookup_path;
- if (sent || sizeonly)
+ if (OidIsValid(spcoid))
+ {
+ relspcoid = spcoid;
+ lookup_path = psprintf("pg_tblspc/%u/%s", spcoid,
+ pathbuf + basepathlen + 1);
+ }
+ else
+ {
+ if (isGlobalDir)
+ relspcoid = GLOBALTABLESPACE_OID;
+ else
+ relspcoid = DEFAULTTABLESPACE_OID;
+ lookup_path = pstrdup(pathbuf + basepathlen + 1);
+ }
+
+ method = GetFileBackupMethod(ib, lookup_path, dboid, relspcoid,
+ relfilenumber, relForkNum,
+ segno, statbuf.st_size,
+ &num_blocks_required,
+ relative_block_numbers,
+ &truncation_block_length);
+ if (method == BACK_UP_FILE_INCREMENTALLY)
+ {
+ statbuf.st_size =
+ GetIncrementalFileSize(num_blocks_required);
+ snprintf(tarfilenamebuf, sizeof(tarfilenamebuf),
+ "%s/INCREMENTAL.%s",
+ path + basepathlen + 1,
+ de->d_name);
+ tarfilename = tarfilenamebuf;
+ }
+
+ pfree(lookup_path);
+ }
+
+ if (method != DO_NOT_BACK_UP_FILE)
{
- /* Add size. */
- size += statbuf.st_size;
+ if (!sizeonly)
+ sent = sendFile(sink, pathbuf, tarfilename, &statbuf,
+ true, dboid, spcoid,
+ relfilenumber, segno, manifest,
+ num_blocks_required,
+ method == BACK_UP_FILE_INCREMENTALLY ? relative_block_numbers : NULL,
+ truncation_block_length);
+
+ if (sent || sizeonly)
+ {
+ /* Add size. */
+ size += statbuf.st_size;
- /* Pad to a multiple of the tar block size. */
- size += tarPaddingBytesRequired(statbuf.st_size);
+ /* Pad to a multiple of the tar block size. */
+ size += tarPaddingBytesRequired(statbuf.st_size);
- /* Size of the header for the file. */
- size += TAR_BLOCK_SIZE;
+ /* Size of the header for the file. */
+ size += TAR_BLOCK_SIZE;
+ }
}
}
else
ereport(WARNING,
(errmsg("skipping special file \"%s\"", pathbuf)));
}
+
+ if (relative_block_numbers != NULL)
+ pfree(relative_block_numbers);
+
FreeDir(dir);
return size;
}
@@ -1443,6 +1552,12 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
* If dboid is anything other than InvalidOid then any checksum failures
* detected will get reported to the cumulative stats system.
*
+ * If the file is to be sent incrementally, then num_incremental_blocks
+ * should be the number of blocks to be sent, and incremental_blocks
+ * an array of block numbers relative to the start of the current segment.
+ * If the whole file is to be sent, then incremental_blocks should be NULL,
+ * and num_incremental_blocks can have any value, as it will be ignored.
+ *
* Returns true if the file was successfully sent, false if 'missing_ok',
* and the file did not exist.
*/
@@ -1450,7 +1565,8 @@ static bool
sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
struct stat *statbuf, bool missing_ok, Oid dboid, Oid spcoid,
RelFileNumber relfilenumber, unsigned segno,
- backup_manifest_info *manifest)
+ backup_manifest_info *manifest, unsigned num_incremental_blocks,
+ BlockNumber *incremental_blocks, unsigned truncation_block_length)
{
int fd;
BlockNumber blkno = 0;
@@ -1459,6 +1575,7 @@ sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
pgoff_t bytes_done = 0;
bool verify_checksum = false;
pg_checksum_context checksum_ctx;
+ int ibindex = 0;
if (pg_checksum_init(&checksum_ctx, manifest->checksum_type) < 0)
elog(ERROR, "could not initialize checksum of file \"%s\"",
@@ -1491,22 +1608,111 @@ sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
RelFileNumberIsValid(relfilenumber))
verify_checksum = true;
+ /*
+ * If we're sending an incremental file, write the file header.
+ */
+ if (incremental_blocks != NULL)
+ {
+ unsigned magic = INCREMENTAL_MAGIC;
+ size_t header_bytes_done = 0;
+
+ /* Emit header data. */
+ push_to_sink(sink, &checksum_ctx, &header_bytes_done,
+ &magic, sizeof(magic));
+ push_to_sink(sink, &checksum_ctx, &header_bytes_done,
+ &num_incremental_blocks, sizeof(num_incremental_blocks));
+ push_to_sink(sink, &checksum_ctx, &header_bytes_done,
+ &truncation_block_length, sizeof(truncation_block_length));
+ push_to_sink(sink, &checksum_ctx, &header_bytes_done,
+ incremental_blocks,
+ sizeof(BlockNumber) * num_incremental_blocks);
+
+ /* Flush out any data still in the buffer so it's again empty. */
+ if (header_bytes_done > 0)
+ {
+ bbsink_archive_contents(sink, header_bytes_done);
+ if (pg_checksum_update(&checksum_ctx,
+ (uint8 *) sink->bbs_buffer,
+ header_bytes_done) < 0)
+ elog(ERROR, "could not update checksum of base backup");
+ }
+
+ /* Update our notion of file position. */
+ bytes_done += sizeof(magic);
+ bytes_done += sizeof(num_incremental_blocks);
+ bytes_done += sizeof(truncation_block_length);
+ bytes_done += sizeof(BlockNumber) * num_incremental_blocks;
+ }
+
/*
* Loop until we read the amount of data the caller told us to expect. The
* file could be longer, if it was extended while we were sending it, but
* for a base backup we can ignore such extended data. It will be restored
* from WAL.
*/
- while (bytes_done < statbuf->st_size)
+ while (1)
{
- size_t remaining = statbuf->st_size - bytes_done;
+ /*
+ * Determine whether we've read all the data that we need, and if not,
+ * read some more.
+ */
+ if (incremental_blocks == NULL)
+ {
+ size_t remaining = statbuf->st_size - bytes_done;
+
+ /*
+ * If we've read the required number of bytes, then it's time to
+ * stop.
+ */
+ if (bytes_done >= statbuf->st_size)
+ break;
+
+ /*
+ * Read as many bytes as will fit in the buffer, or however many
+ * are left to read, whichever is less.
+ */
+ cnt = read_file_data_into_buffer(sink, readfilename, fd,
+ bytes_done, remaining,
+ blkno + segno * RELSEG_SIZE,
+ verify_checksum,
+ &checksum_failures);
+ }
+ else
+ {
+ BlockNumber relative_blkno;
- /* Try to read some more data. */
- cnt = read_file_data_into_buffer(sink, readfilename, fd, bytes_done,
- remaining,
- blkno + segno * RELSEG_SIZE,
- verify_checksum,
- &checksum_failures);
+ /*
+ * If we've read all the blocks, then it's time to stop.
+ */
+ if (ibindex >= num_incremental_blocks)
+ break;
+
+ /*
+ * Read just one block, whichever one is the next that we're
+ * supposed to include.
+ */
+ relative_blkno = incremental_blocks[ibindex++];
+ cnt = read_file_data_into_buffer(sink, readfilename, fd,
+ relative_blkno * BLCKSZ,
+ BLCKSZ,
+ relative_blkno + segno * RELSEG_SIZE,
+ verify_checksum,
+ &checksum_failures);
+
+ /*
+ * If we get a partial read, that must mean that the relation is
+ * being truncated. Ultimately, it should be truncated to a
+ * multiple of BLCKSZ, since this path should only be reached for
+ * relation files, but we might transiently observe an
+ * intermediate value.
+ *
+ * It should be fine to treat this just as if the entire block had
+ * been truncated away - i.e. fill this and all later blocks with
+ * zeroes. WAL replay will fix things up.
+ */
+ if (cnt < BLCKSZ)
+ break;
+ }
/*
* If the amount of data we were able to read was not a multiple of
@@ -1689,6 +1895,56 @@ read_file_data_into_buffer(bbsink *sink, const char *readfilename, int fd,
return cnt;
}
+/*
+ * Push data into a bbsink.
+ *
+ * It's better, when possible, to read data directly into the bbsink's buffer,
+ * rather than using this function to copy it into the buffer; this function is
+ * for cases where that approach is not practical.
+ *
+ * bytes_done should point to a count of the number of bytes that are
+ * currently used in the bbsink's buffer. Upon return, the bytes identified by
+ * data and length will have been copied into the bbsink's buffer, flushing
+ * as required, and *bytes_done will have been updated accordingly. If the
+ * buffer was flushed, the previous contents will also have been fed to
+ * checksum_ctx.
+ *
+ * Note that after one or more calls to this function it is the caller's
+ * responsibility to perform any required final flush.
+ */
+static void
+push_to_sink(bbsink *sink, pg_checksum_context *checksum_ctx,
+ size_t *bytes_done, void *data, size_t length)
+{
+ while (length > 0)
+ {
+ size_t bytes_to_copy;
+
+ /*
+ * We use < here rather than <= so that if the data exactly fills the
+ * remaining buffer space, we trigger a flush now.
+ */
+ if (length < sink->bbs_buffer_length - *bytes_done)
+ {
+ /* Append remaining data to buffer. */
+ memcpy(sink->bbs_buffer + *bytes_done, data, length);
+ *bytes_done += length;
+ return;
+ }
+
+ /* Copy until buffer is full and flush it. */
+ bytes_to_copy = sink->bbs_buffer_length - *bytes_done;
+ memcpy(sink->bbs_buffer + *bytes_done, data, bytes_to_copy);
+ data = ((char *) data) + bytes_to_copy;
+ length -= bytes_to_copy;
+ bbsink_archive_contents(sink, sink->bbs_buffer_length);
+ if (pg_checksum_update(checksum_ctx, (uint8 *) sink->bbs_buffer,
+ sink->bbs_buffer_length) < 0)
+ elog(ERROR, "could not update checksum");
+ *bytes_done = 0;
+ }
+}
+
/*
* Try to verify the checksum for the provided page, if it seems appropriate
* to do so.
diff --git a/src/backend/backup/basebackup_incremental.c b/src/backend/backup/basebackup_incremental.c
new file mode 100644
index 0000000000..20cc00bded
--- /dev/null
+++ b/src/backend/backup/basebackup_incremental.c
@@ -0,0 +1,873 @@
+/*-------------------------------------------------------------------------
+ *
+ * basebackup_incremental.c
+ * code for incremental backup support
+ *
+ * This code isn't actually in charge of taking an incremental backup;
+ * the actual construction of the incremental backup happens in
+ * basebackup.c. Here, we're concerned with providing the necessary
+ * supports for that operation. In particular, we need to parse the
+ * backup manifest supplied by the user taking the incremental backup
+ * and extract the required information from it.
+ *
+ * Portions Copyright (c) 2010-2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/backend/backup/basebackup_incremental.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/timeline.h"
+#include "access/xlog.h"
+#include "access/xlogrecovery.h"
+#include "backup/basebackup_incremental.h"
+#include "backup/walsummary.h"
+#include "common/blkreftable.h"
+#include "common/parse_manifest.h"
+#include "common/hashfn.h"
+#include "postmaster/walsummarizer.h"
+
+#define BLOCKS_PER_READ 512
+
+/*
+ * Details extracted from the WAL ranges present in the supplied backup manifest.
+ */
+typedef struct
+{
+ TimeLineID tli;
+ XLogRecPtr start_lsn;
+ XLogRecPtr end_lsn;
+} backup_wal_range;
+
+/*
+ * Details extracted from the file list present in the supplied backup manifest.
+ */
+typedef struct
+{
+ uint32 status;
+ const char *path;
+ size_t size;
+} backup_file_entry;
+
+static uint32 hash_string_pointer(const char *s);
+#define SH_PREFIX backup_file
+#define SH_ELEMENT_TYPE backup_file_entry
+#define SH_KEY_TYPE const char *
+#define SH_KEY path
+#define SH_HASH_KEY(tb, key) hash_string_pointer(key)
+#define SH_EQUAL(tb, a, b) (strcmp(a, b) == 0)
+#define SH_SCOPE static inline
+#define SH_DECLARE
+#define SH_DEFINE
+#include "lib/simplehash.h"
+
+struct IncrementalBackupInfo
+{
+ /* Memory context for this object and its subsidiary objects. */
+ MemoryContext mcxt;
+
+ /* Temporary buffer for storing the manifest while parsing it. */
+ StringInfoData buf;
+
+ /* WAL ranges extracted from the backup manifest. */
+ List *manifest_wal_ranges;
+
+ /*
+ * Files extracted from the backup manifest.
+ *
+ * We don't really need this information, because we use WAL summaries to
+ * figure what's changed. It would be unsafe to just rely on the list of
+ * files that existed before, because it's possible for a file to be
+ * removed and a new one created with the same name and different
+ * contents. In such cases, the whole file must still be sent. We can tell
+ * from the WAL summaries whether that happened, but not from the file
+ * list.
+ *
+ * Nonetheless, this data is useful for sanity checking. If a file that we
+ * think we shouldn't need to send is not present in the manifest for the
+ * prior backup, something has gone terribly wrong. We retain the file
+ * names and sizes, but not the checksums or last modified times, for
+ * which we have no use.
+ *
+ * One significant downside of storing this data is that it consumes
+ * memory. If that turns out to be a problem, we might have to decide not
+ * to retain this information, or to make it optional.
+ */
+ backup_file_hash *manifest_files;
+
+ /*
+ * Block-reference table for the incremental backup.
+ *
+ * It's possible that storing the entire block-reference table in memory
+ * will be a problem for some users. The in-memory format that we're using
+ * here is pretty efficient, converging to little more than 1 bit per
+ * block for relation forks with large numbers of modified blocks. It's
+ * possible, however, that if you try to perform an incremental backup of
+ * a database with a sufficiently large number of relations on a
+ * sufficiently small machine, you could run out of memory here. If that
+ * turns out to be a problem in practice, we'll need to be more clever.
+ */
+ BlockRefTable *brtab;
+};
+
+static void manifest_process_file(JsonManifestParseContext *,
+ char *pathname,
+ size_t size,
+ pg_checksum_type checksum_type,
+ int checksum_length,
+ uint8 *checksum_payload);
+static void manifest_process_wal_range(JsonManifestParseContext *,
+ TimeLineID tli,
+ XLogRecPtr start_lsn,
+ XLogRecPtr end_lsn);
+static void manifest_report_error(JsonManifestParseContext *ib,
+ const char *fmt,...)
+ pg_attribute_printf(2, 3) pg_attribute_noreturn();
+
+/*
+ * Create a new object for storing information extracted from the manifest
+ * supplied when creating an incremental backup.
+ */
+IncrementalBackupInfo *
+CreateIncrementalBackupInfo(MemoryContext mcxt)
+{
+ IncrementalBackupInfo *ib;
+ MemoryContext oldcontext;
+
+ oldcontext = MemoryContextSwitchTo(mcxt);
+
+ ib = palloc0(sizeof(IncrementalBackupInfo));
+ ib->mcxt = mcxt;
+ initStringInfo(&ib->buf);
+
+ /*
+ * It's hard to guess how many files a "typical" installation will have in
+ * the data directory, but a fresh initdb creates almost 1000 files as of
+ * this writing, so it seems to make sense for our estimate to
+ * substantially higher.
+ */
+ ib->manifest_files = backup_file_create(mcxt, 10000, NULL);
+
+ MemoryContextSwitchTo(oldcontext);
+
+ return ib;
+}
+
+/*
+ * Before taking an incremental backup, the caller must supply the backup
+ * manifest from a prior backup. Each chunk of manifest data recieved
+ * from the client should be passed to this function.
+ */
+void
+AppendIncrementalManifestData(IncrementalBackupInfo *ib, const char *data,
+ int len)
+{
+ MemoryContext oldcontext;
+
+ /* Switch to our memory context. */
+ oldcontext = MemoryContextSwitchTo(ib->mcxt);
+
+ /*
+ * XXX. Our json parser is at present incapable of parsing json blobs
+ * incrementally, so we have to accumulate the entire backup manifest
+ * before we can do anything with it. This should really be fixed, since
+ * some users might have very large numbers of files in the data
+ * directory.
+ */
+ appendBinaryStringInfo(&ib->buf, data, len);
+
+ /* Switch back to previous memory context. */
+ MemoryContextSwitchTo(oldcontext);
+}
+
+/*
+ * Finalize an IncrementalBackupInfo object after all manifest data has
+ * been supplied via calls to AppendIncrementalManifestData.
+ */
+void
+FinalizeIncrementalManifest(IncrementalBackupInfo *ib)
+{
+ JsonManifestParseContext context;
+ MemoryContext oldcontext;
+
+ /* Switch to our memory context. */
+ oldcontext = MemoryContextSwitchTo(ib->mcxt);
+
+ /* Parse the manifest. */
+ context.private_data = ib;
+ context.perfile_cb = manifest_process_file;
+ context.perwalrange_cb = manifest_process_wal_range;
+ context.error_cb = manifest_report_error;
+ json_parse_manifest(&context, ib->buf.data, ib->buf.len);
+
+ /* Done with the buffer, so release memory. */
+ pfree(ib->buf.data);
+ ib->buf.data = NULL;
+
+ /* Switch back to previous memory context. */
+ MemoryContextSwitchTo(oldcontext);
+}
+
+/*
+ * Prepare to take an incremental backup.
+ *
+ * Before this function is called, AppendIncrementalManifestData and
+ * FinalizeIncrementalManifest should have already been called to pass all
+ * the manifest data to this object.
+ *
+ * This function performs sanity checks on the data extracted from the
+ * manifest and figures out for which WAL ranges we need summaries, and
+ * whether those summaries are available. Then, it reads and combines the
+ * data from those summary files. It also updates the backup_state with the
+ * reference TLI and LSN for the prior backup.
+ */
+void
+PrepareForIncrementalBackup(IncrementalBackupInfo *ib,
+ BackupState *backup_state)
+{
+ MemoryContext oldcontext;
+ List *expectedTLEs;
+ List *all_wslist,
+ *required_wslist = NIL;
+ ListCell *lc;
+ TimeLineHistoryEntry **tlep;
+ int num_wal_ranges;
+ int i;
+ bool found_backup_start_tli = false;
+ TimeLineID earliest_wal_range_tli = 0;
+ XLogRecPtr earliest_wal_range_start_lsn;
+ TimeLineID latest_wal_range_tli = 0;
+ XLogRecPtr summarized_lsn;
+
+ Assert(ib->buf.data == NULL);
+
+ /* Switch to our memory context. */
+ oldcontext = MemoryContextSwitchTo(ib->mcxt);
+
+ /*
+ * Match up the TLIs that appear in the WAL ranges of the backup manifest
+ * with those that appear in this server's timeline history. We expect
+ * every backup_wal_range to match to a TimeLineHistoryEntry; if it does
+ * not, that's an error.
+ *
+ * This loop also decides which of the WAL ranges is the manifest is most
+ * ancient and which one is the newest, according to the timeline history
+ * of this server, and stores TLIs of those WAL ranges into
+ * earliest_wal_range_tli and latest_wal_range_tli. It also updates
+ * earliest_wal_range_start_lsn to the start LSN of the WAL range for
+ * earliest_wal_range_tli.
+ *
+ * Note that the return value of readTimeLineHistory puts the latest
+ * timeline at the beginning of the list, not the end. Hence, the earliest
+ * TLI is the one that occurs nearest the end of the list returned by
+ * readTimeLineHistory, and the latest TLI is the one that occurs closest
+ * to the beginning.
+ */
+ expectedTLEs = readTimeLineHistory(backup_state->starttli);
+ num_wal_ranges = list_length(ib->manifest_wal_ranges);
+ tlep = palloc0(num_wal_ranges * sizeof(TimeLineHistoryEntry *));
+ for (i = 0; i < num_wal_ranges; ++i)
+ {
+ backup_wal_range *range = list_nth(ib->manifest_wal_ranges, i);
+ bool saw_earliest_wal_range_tli = false;
+ bool saw_latest_wal_range_tli = false;
+
+ /* Search this server's history for this WAL range's TLI. */
+ foreach(lc, expectedTLEs)
+ {
+ TimeLineHistoryEntry *tle = lfirst(lc);
+
+ if (tle->tli == range->tli)
+ {
+ tlep[i] = tle;
+ break;
+ }
+
+ if (tle->tli == earliest_wal_range_tli)
+ saw_earliest_wal_range_tli = true;
+ if (tle->tli == latest_wal_range_tli)
+ saw_latest_wal_range_tli = true;
+ }
+
+ /*
+ * An incremental backup can only be taken relative to a backup that
+ * represents a previous state of this server. If the backup requires
+ * WAL from a timeline that's not in our history, that definitely
+ * isn't the case.
+ */
+ if (tlep[i] == NULL)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("timeline %u found in manifest, but not in this server's history",
+ range->tli)));
+
+ /*
+ * If we found this TLI in the server's history before encountering
+ * the latest TLI seen so far in the server's history, then this TLI
+ * is the latest one seen so far.
+ *
+ * If on the other hand we saw the earliest TLI seen so far before
+ * finding this TLI, this TLI is earlier than the earliest one seen so
+ * far. And if this is the first TLI for which we've searched, it's
+ * also the earliest one seen so far.
+ *
+ * On the first loop iteration, both things should necessarily be
+ * true.
+ */
+ if (!saw_latest_wal_range_tli)
+ latest_wal_range_tli = range->tli;
+ if (earliest_wal_range_tli == 0 || saw_earliest_wal_range_tli)
+ {
+ earliest_wal_range_tli = range->tli;
+ earliest_wal_range_start_lsn = range->start_lsn;
+ }
+ }
+
+ /*
+ * Propagate information about the prior backup into the backup_label that
+ * will be generated for this backup.
+ */
+ backup_state->istartpoint = earliest_wal_range_start_lsn;
+ backup_state->istarttli = earliest_wal_range_tli;
+
+ /*
+ * Sanity check start and end LSNs for the WAL ranges in the manifest.
+ *
+ * Commonly, there won't be any timeline switches during the prior backup
+ * at all, but if there are, they should happen at the same LSNs that this
+ * server switched timelines.
+ *
+ * Whether there are any timeline switches during the prior backup or not,
+ * the prior backup shouldn't require any WAL from a timeline prior to the
+ * start of that timeline. It also shouldn't require any WAL from later
+ * than the start of this backup.
+ *
+ * If any of these sanity checks fail, one possible explanation is that
+ * the user has generated WAL on the same timeline with the same LSNs more
+ * than once. For instance, if two standbys running on timeline 1 were
+ * both promoted and (due to a broken archiving setup) both selected new
+ * timeline ID 2, then it's possible that one of these checks might trip.
+ *
+ * Note that there are lots of ways for the user to do something very bad
+ * without tripping any of these checks, and they are not intended to be
+ * comprehensive. It's pretty hard to see how we could be certain of
+ * anything here. However, if there's a problem staring us right in the
+ * face, it's best to report it, so we do.
+ */
+ for (i = 0; i < num_wal_ranges; ++i)
+ {
+ backup_wal_range *range = list_nth(ib->manifest_wal_ranges, i);
+
+ if (range->tli == earliest_wal_range_tli)
+ {
+ if (range->start_lsn < tlep[i]->begin)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("manifest requires WAL from initial timeline %u starting at %X/%X, but that timeline begins at %X/%X",
+ range->tli,
+ LSN_FORMAT_ARGS(range->start_lsn),
+ LSN_FORMAT_ARGS(tlep[i]->begin))));
+ }
+ else
+ {
+ if (range->start_lsn != tlep[i]->begin)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("manifest requires WAL from continuation timeline %u starting at %X/%X, but that timeline begins at %X/%X",
+ range->tli,
+ LSN_FORMAT_ARGS(range->start_lsn),
+ LSN_FORMAT_ARGS(tlep[i]->begin))));
+ }
+
+ if (range->tli == latest_wal_range_tli)
+ {
+ if (range->end_lsn > backup_state->startpoint)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("manifest requires WAL from final timeline %u ending at %X/%X, but this backup starts at %X/%X",
+ range->tli,
+ LSN_FORMAT_ARGS(range->end_lsn),
+ LSN_FORMAT_ARGS(backup_state->startpoint))));
+ }
+ else
+ {
+ if (range->end_lsn != tlep[i]->end)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("manifest requires WAL from non-final timeline %u ending at %X/%X, but this server switched timelines at %X/%X",
+ range->tli,
+ LSN_FORMAT_ARGS(range->end_lsn),
+ LSN_FORMAT_ARGS(tlep[i]->end))));
+ }
+
+ }
+
+ /*
+ * Wait for WAL summarization to catch up to the backup start LSN (but
+ * time out if it doesn't do so quickly enough).
+ */
+ /* XXX make timeout configurable */
+ summarized_lsn = WaitForWalSummarization(backup_state->startpoint, 60000);
+ if (summarized_lsn < backup_state->startpoint)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("timeout waiting for WAL summarization"),
+ errdetail("This backup requires WAL to be summarized up to %X/%X, but summarizer has only reached %X/%X.",
+ LSN_FORMAT_ARGS(backup_state->startpoint),
+ LSN_FORMAT_ARGS(summarized_lsn))));
+
+ /*
+ * Retrieve a list of all WAL summaries on any timeline that overlap with
+ * the LSN range of interest. We could instead call GetWalSummaries() once
+ * per timeline in the loop that follows, but that would involve reading
+ * the directory multiple times. It should be mildly faster - and perhaps
+ * a bit safer - to do it just once.
+ */
+ all_wslist = GetWalSummaries(0, earliest_wal_range_start_lsn,
+ backup_state->startpoint);
+
+ /*
+ * We need WAL summaries for everything that happened during the prior
+ * backup and everything that happened afterward up until the point where
+ * the current backup started.
+ */
+ foreach(lc, expectedTLEs)
+ {
+ TimeLineHistoryEntry *tle = lfirst(lc);
+ XLogRecPtr tli_start_lsn = tle->begin;
+ XLogRecPtr tli_end_lsn = tle->end;
+ XLogRecPtr tli_missing_lsn = InvalidXLogRecPtr;
+ List *tli_wslist;
+
+ /*
+ * Working through the history of this server from the current
+ * timeline backwards, we skip everything until we find the timeline
+ * where this backup started. Most of the time, this means we won't
+ * skip anything at all, as it's unlikely that the timeline has
+ * changed since the beginning of the backup moments ago.
+ */
+ if (tle->tli == backup_state->starttli)
+ {
+ found_backup_start_tli = true;
+ tli_end_lsn = backup_state->startpoint;
+ }
+ else if (!found_backup_start_tli)
+ continue;
+
+ /*
+ * Find the summaries that overlap the LSN range of interest for this
+ * timeline. If this is the earliest timeline involved, the range of
+ * interest begins with the start LSN of the prior backup; otherwise,
+ * it begins at the LSN at which this timeline came into existence. If
+ * this is the latest TLI involved, the range of interest ends at the
+ * start LSN of the current backup; otherwise, it ends at the point
+ * where we switched from this timeline to the next one.
+ */
+ if (tle->tli == earliest_wal_range_tli)
+ tli_start_lsn = earliest_wal_range_start_lsn;
+ tli_wslist = FilterWalSummaries(all_wslist, tle->tli,
+ tli_start_lsn, tli_end_lsn);
+
+ /*
+ * There is no guarantee that the WAL summaries we found cover the
+ * entire range of LSNs for which summaries are required, or indeed
+ * that we found any WAL summaries at all. Check whether we have a
+ * problem of that sort.
+ */
+ if (!WalSummariesAreComplete(tli_wslist, tli_start_lsn, tli_end_lsn,
+ &tli_missing_lsn))
+ {
+ if (XLogRecPtrIsInvalid(tli_missing_lsn))
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("WAL summaries are required on timeline %u from %X/%X to %X/%X, but no summaries for that timeline and LSN range exist",
+ tle->tli,
+ LSN_FORMAT_ARGS(tli_start_lsn),
+ LSN_FORMAT_ARGS(tli_end_lsn))));
+ else
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("WAL summaries are required on timeline %u from %X/%X to %X/%X, but the summaries for that timeline and LSN range are incomplete",
+ tle->tli,
+ LSN_FORMAT_ARGS(tli_start_lsn),
+ LSN_FORMAT_ARGS(tli_end_lsn)),
+ errdetail("The first unsummarized LSN is this range is %X/%X.",
+ LSN_FORMAT_ARGS(tli_missing_lsn))));
+ }
+
+ /*
+ * Remember that we need to read these summaries.
+ *
+ * Technically, it's possible that this could read more files than
+ * required, since tli_wslist in theory could contain redundant
+ * summaries. For instance, if we have a summary from 0/10000000 to
+ * 0/20000000 and also one from 0/00000000 to 0/30000000, then the
+ * latter subsumes the former and the former could be ignored.
+ *
+ * We ignore this possibility because the WAL summarizer only tries to
+ * generate summaries that do not overlap. If somehow they exist,
+ * we'll do a bit of extra work but the results should still be
+ * correct.
+ */
+ required_wslist = list_concat(required_wslist, tli_wslist);
+
+ /*
+ * Timelines earlier than the one in which the prior backup began are
+ * not relevant.
+ */
+ if (tle->tli == earliest_wal_range_tli)
+ break;
+ }
+
+ /*
+ * Read all of the required block reference table files and merge all of
+ * the data into a single in-memory block reference table.
+ *
+ * See the comments for struct IncrementalBackupInfo for some thoughts on
+ * memory usage.
+ */
+ ib->brtab = CreateEmptyBlockRefTable();
+ foreach(lc, required_wslist)
+ {
+ WalSummaryFile *ws = lfirst(lc);
+ WalSummaryIO wsio;
+ BlockRefTableReader *reader;
+ RelFileLocator rlocator;
+ ForkNumber forknum;
+ BlockNumber limit_block;
+ BlockNumber blocks[BLOCKS_PER_READ];
+
+ wsio.file = OpenWalSummaryFile(ws, false);
+ wsio.filepos = 0;
+ ereport(DEBUG1,
+ (errmsg_internal("reading WAL summary file \"%s\"",
+ FilePathName(wsio.file))));
+ reader = CreateBlockRefTableReader(ReadWalSummary, &wsio,
+ FilePathName(wsio.file),
+ ReportWalSummaryError, NULL);
+ while (BlockRefTableReaderNextRelation(reader, &rlocator, &forknum,
+ &limit_block))
+ {
+ BlockRefTableSetLimitBlock(ib->brtab, &rlocator,
+ forknum, limit_block);
+
+ while (1)
+ {
+ unsigned nblocks;
+ unsigned i;
+
+ nblocks = BlockRefTableReaderGetBlocks(reader, blocks,
+ BLOCKS_PER_READ);
+ if (nblocks == 0)
+ break;
+
+ for (i = 0; i < nblocks; ++i)
+ BlockRefTableMarkBlockModified(ib->brtab, &rlocator,
+ forknum, blocks[i]);
+ }
+ }
+ DestroyBlockRefTableReader(reader);
+ FileClose(wsio.file);
+ }
+
+ /* Switch back to previous memory context. */
+ MemoryContextSwitchTo(oldcontext);
+}
+
+/*
+ * Get the pathname that should be used when a file is sent incrementally.
+ *
+ * The result is a palloc'd string.
+ */
+char *
+GetIncrementalFilePath(Oid dboid, Oid spcoid, RelFileNumber relfilenumber,
+ ForkNumber forknum, unsigned segno)
+{
+ char *path;
+ char *lastslash;
+ char *ipath;
+
+ path = GetRelationPath(dboid, spcoid, relfilenumber, InvalidBackendId,
+ forknum);
+
+ lastslash = strrchr(path, '/');
+ Assert(lastslash != NULL);
+ *lastslash = '\0';
+
+ if (segno > 0)
+ ipath = psprintf("%s/INCREMENTAL.%s.%u", path, lastslash + 1, segno);
+ else
+ ipath = psprintf("%s/INCREMENTAL.%s", path, lastslash + 1);
+
+ pfree(path);
+
+ return ipath;
+}
+
+/*
+ * How should we back up a particular file as part of an incremental backup?
+ *
+ * If the return value is BACK_UP_FILE_FULLY, caller should back up the whole
+ * file just as if this were not an incremental backup.
+ *
+ * If the return value is BACK_UP_FILE_INCREMENTALLY, caller should include
+ * an incremental file in the backup instead of the entire file. On return,
+ * *num_blocks_required will be set to the number of blocks that need to be
+ * sent, and the actual block numbers will have been stored in
+ * relative_block_numbers, which should be an array of at least RELSEG_SIZE.
+ * In addition, *truncation_block_length will be set to the value that should
+ * be included in the incremental file.
+ *
+ * If the return value is DO_NOT_BACK_UP_FILE, the caller should not include
+ * the file in the backup at all.
+ */
+FileBackupMethod
+GetFileBackupMethod(IncrementalBackupInfo *ib, char *path,
+ Oid dboid, Oid spcoid,
+ RelFileNumber relfilenumber, ForkNumber forknum,
+ unsigned segno, size_t size,
+ unsigned *num_blocks_required,
+ BlockNumber *relative_block_numbers,
+ unsigned *truncation_block_length)
+{
+ BlockNumber absolute_block_numbers[RELSEG_SIZE];
+ BlockNumber limit_block;
+ BlockNumber start_blkno;
+ BlockNumber stop_blkno;
+ RelFileLocator rlocator;
+ BlockRefTableEntry *brtentry;
+ unsigned i;
+ unsigned nblocks;
+
+ /* Should only be called after PrepareForIncrementalBackup. */
+ Assert(ib->buf.data == NULL);
+
+ /*
+ * dboid could be InvalidOid if shared rel, but spcoid and relfilenumber
+ * should have legal values.
+ */
+ Assert(OidIsValid(spcoid));
+ Assert(RelFileNumberIsValid(relfilenumber));
+
+ /*
+ * If the file size is too large or not a multiple of BLCKSZ, then
+ * something weird is happening, so give up and send the whole file.
+ */
+ if ((size % BLCKSZ) != 0 || size / BLCKSZ > RELSEG_SIZE)
+ return BACK_UP_FILE_FULLY;
+
+ /*
+ * The free-space map fork is not properly WAL-logged, so we need to
+ * backup the entire file every time.
+ */
+ if (forknum == FSM_FORKNUM)
+ return BACK_UP_FILE_FULLY;
+
+ /*
+ * Check whether this file is part of the prior backup. If it isn't, back
+ * up the whole file.
+ */
+ if (backup_file_lookup(ib->manifest_files, path) == NULL)
+ {
+ char *ipath;
+
+ ipath = GetIncrementalFilePath(dboid, spcoid, relfilenumber,
+ forknum, segno);
+ if (backup_file_lookup(ib->manifest_files, ipath) == NULL)
+ return BACK_UP_FILE_FULLY;
+ }
+
+ /* Look up the block reference table entry. */
+ rlocator.spcOid = spcoid;
+ rlocator.dbOid = dboid;
+ rlocator.relNumber = relfilenumber;
+ brtentry = BlockRefTableGetEntry(ib->brtab, &rlocator, forknum,
+ &limit_block);
+
+ /*
+ * If there is no entry, then there have been no WAL-logged changes to the
+ * relation since the predecessor backup was taken, so we can back it up
+ * incrementally and need not include any modified blocks.
+ *
+ * However, if the file is zero-length, we should do a full backup,
+ * because an incremental file is always more than zero length, and it's
+ * silly to take an incremental backup when a full backup would be
+ * smaller.
+ */
+ if (brtentry == NULL)
+ {
+ *num_blocks_required = 0;
+ *truncation_block_length = size / BLCKSZ;
+ if (size == 0)
+ return BACK_UP_FILE_FULLY;
+ return BACK_UP_FILE_INCREMENTALLY;
+ }
+
+ /*
+ * If the limit_block is less than or equal to the point where this
+ * segment starts, send the whole file.
+ */
+ if (limit_block <= segno * RELSEG_SIZE)
+ return BACK_UP_FILE_FULLY;
+
+ /*
+ * Get relevant entries from the block reference table entry.
+ *
+ * We shouldn't overflow computing the start or stop block numbers, but if
+ * it manages to happen somehow, detect it and throw an error.
+ */
+ start_blkno = segno * RELSEG_SIZE;
+ stop_blkno = start_blkno + (size / BLCKSZ);
+ if (start_blkno / RELSEG_SIZE != segno || stop_blkno < start_blkno)
+ ereport(ERROR,
+ errcode(ERRCODE_INTERNAL_ERROR),
+ errmsg_internal("overflow computing block number bounds for segment %u with size %lu",
+ segno, size));
+ nblocks = BlockRefTableEntryGetBlocks(brtentry, start_blkno, stop_blkno,
+ absolute_block_numbers, RELSEG_SIZE);
+ Assert(nblocks <= RELSEG_SIZE);
+
+ /*
+ * If we're going to have to send nearly all of the blocks, then just send
+ * the whole file, because that won't require much extra storage or
+ * transfer and will speed up and simplify backup restoration. It's not
+ * clear what threshold is most appropriate here and perhaps it ought to
+ * be configurable, but for now we're just going to say that if we'd need
+ * to send 90% of the blocks anyway, give up and send the whole file.
+ *
+ * NB: If you change the threshold here, at least make sure to back up the
+ * file fully when every single block must be sent, because there's
+ * nothing good about sending an incremental file in that case.
+ */
+ if (nblocks * BLCKSZ > size * 0.9)
+ return BACK_UP_FILE_FULLY;
+
+ /*
+ * Looks like we can send an incremental file.
+ *
+ * Return the relevant details to the caller, transposing absolute block
+ * numbers to relative block numbers.
+ *
+ * The truncation block length is the minimum length of the reconstructed
+ * file. Any block numbers below this threshold that are not present in
+ * the backup need to be fetched from the prior backup. At or above this
+ * threshold, blocks should only be included in the result if they are
+ * present in the backup. (This may require inserting zero blocks if the
+ * blocks included in the backup are non-consecutive.)
+ */
+ for (i = 0; i < nblocks; ++i)
+ relative_block_numbers[i] = absolute_block_numbers[i] - start_blkno;
+ *num_blocks_required = nblocks;
+ *truncation_block_length =
+ Min(size / BLCKSZ, limit_block - segno * RELSEG_SIZE);
+ return BACK_UP_FILE_INCREMENTALLY;
+}
+
+/*
+ * Compute the size for an incremental file containing a given number of blocks.
+ */
+extern size_t
+GetIncrementalFileSize(unsigned num_blocks_required)
+{
+ size_t result;
+
+ /* Make sure we're not going to overflow. */
+ Assert(num_blocks_required <= RELSEG_SIZE);
+
+ /*
+ * Three four byte quantities (magic number, truncation block length,
+ * block count) followed by block numbers followed by block contents.
+ */
+ result = 3 * sizeof(uint32);
+ result += (BLCKSZ + sizeof(BlockNumber)) * num_blocks_required;
+
+ return result;
+}
+
+/*
+ * Helper function for filemap hash table.
+ */
+static uint32
+hash_string_pointer(const char *s)
+{
+ unsigned char *ss = (unsigned char *) s;
+
+ return hash_bytes(ss, strlen(s));
+}
+
+/*
+ * This callback is invoked for each file mentioned in the backup manifest.
+ *
+ * We store the path to each file and the size of each file for sanity-checking
+ * purposes. For further details, see comments for IncrementalBackupInfo.
+ */
+static void
+manifest_process_file(JsonManifestParseContext *context,
+ char *pathname, size_t size,
+ pg_checksum_type checksum_type,
+ int checksum_length,
+ uint8 *checksum_payload)
+{
+ IncrementalBackupInfo *ib = context->private_data;
+ backup_file_entry *entry;
+ bool found;
+
+ entry = backup_file_insert(ib->manifest_files, pathname, &found);
+ if (!found)
+ {
+ entry->path = MemoryContextStrdup(ib->manifest_files->ctx,
+ pathname);
+ entry->size = size;
+ }
+}
+
+/*
+ * This callback is invoked for each WAL range mentioned in the backup
+ * manifest.
+ *
+ * We're just interested in learning the oldest LSN and the corresponding TLI
+ * that appear in any WAL range.
+ */
+static void
+manifest_process_wal_range(JsonManifestParseContext *context,
+ TimeLineID tli, XLogRecPtr start_lsn,
+ XLogRecPtr end_lsn)
+{
+ IncrementalBackupInfo *ib = context->private_data;
+ backup_wal_range *range = palloc(sizeof(backup_wal_range));
+
+ range->tli = tli;
+ range->start_lsn = start_lsn;
+ range->end_lsn = end_lsn;
+ ib->manifest_wal_ranges = lappend(ib->manifest_wal_ranges, range);
+}
+
+/*
+ * This callback is invoked if an error occurs while parsing the backup
+ * manifest.
+ */
+static void
+manifest_report_error(JsonManifestParseContext *context, const char *fmt,...)
+{
+ StringInfoData errbuf;
+
+ initStringInfo(&errbuf);
+
+ for (;;)
+ {
+ va_list ap;
+ int needed;
+
+ va_start(ap, fmt);
+ needed = appendStringInfoVA(&errbuf, fmt, ap);
+ va_end(ap);
+ if (needed == 0)
+ break;
+ enlargeStringInfo(&errbuf, needed);
+ }
+
+ ereport(ERROR,
+ errmsg_internal("%s", errbuf.data));
+}
diff --git a/src/backend/backup/meson.build b/src/backend/backup/meson.build
index 11a79bbf80..19c355ceca 100644
--- a/src/backend/backup/meson.build
+++ b/src/backend/backup/meson.build
@@ -5,6 +5,7 @@ backend_sources += files(
'basebackup.c',
'basebackup_copy.c',
'basebackup_gzip.c',
+ 'basebackup_incremental.c',
'basebackup_lz4.c',
'basebackup_progress.c',
'basebackup_server.c',
@@ -12,4 +13,6 @@ backend_sources += files(
'basebackup_target.c',
'basebackup_throttle.c',
'basebackup_zstd.c',
+ 'walsummary.c',
+ 'walsummaryfuncs.c'
)
diff --git a/src/backend/backup/walsummary.c b/src/backend/backup/walsummary.c
new file mode 100644
index 0000000000..ebf4ea038d
--- /dev/null
+++ b/src/backend/backup/walsummary.c
@@ -0,0 +1,356 @@
+/*-------------------------------------------------------------------------
+ *
+ * walsummary.c
+ * Functions for accessing and managing WAL summary data.
+ *
+ * Portions Copyright (c) 2010-2022, PostgreSQL Global Development Group
+ *
+ * src/backend/backup/walsummary.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include <sys/stat.h>
+#include <unistd.h>
+
+#include "access/xlog_internal.h"
+#include "backup/walsummary.h"
+#include "utils/wait_event.h"
+
+static bool IsWalSummaryFilename(char *filename);
+static int ListComparatorForWalSummaryFiles(const ListCell *a,
+ const ListCell *b);
+
+/*
+ * Get a list of WAL summaries.
+ *
+ * If tli != 0, only WAL summaries with the indicated TLI will be included.
+ *
+ * If start_lsn != InvalidXLogRecPtr, only summaries that end before the
+ * indicated LSN will be included.
+ *
+ * If end_lsn != InvalidXLogRecPtr, only summaries that start before the
+ * indicated LSN will be included.
+ *
+ * The intent is that you can call GetWalSummaries(tli, start_lsn, end_lsn)
+ * to get all WAL summaries on the indicated timeline that overlap the
+ * specified LSN range.
+ */
+List *
+GetWalSummaries(TimeLineID tli, XLogRecPtr start_lsn, XLogRecPtr end_lsn)
+{
+ DIR *sdir;
+ struct dirent *dent;
+ List *result = NIL;
+
+ sdir = AllocateDir(XLOGDIR "/summaries");
+ while ((dent = ReadDir(sdir, XLOGDIR "/summaries")) != NULL)
+ {
+ WalSummaryFile *ws;
+ uint32 tmp[5];
+ TimeLineID file_tli;
+ XLogRecPtr file_start_lsn;
+ XLogRecPtr file_end_lsn;
+
+ /* Decode filename, or skip if it's not in the expected format. */
+ if (!IsWalSummaryFilename(dent->d_name))
+ continue;
+ sscanf(dent->d_name, "%08X%08X%08X%08X%08X",
+ &tmp[0], &tmp[1], &tmp[2], &tmp[3], &tmp[4]);
+ file_tli = tmp[0];
+ file_start_lsn = ((uint64) tmp[1]) << 32 | tmp[2];
+ file_end_lsn = ((uint64) tmp[3]) << 32 | tmp[4];
+
+ /* Skip if it doesn't match the filter criteria. */
+ if (tli != 0 && tli != file_tli)
+ continue;
+ if (!XLogRecPtrIsInvalid(start_lsn) && start_lsn > file_end_lsn)
+ continue;
+ if (!XLogRecPtrIsInvalid(end_lsn) && end_lsn < file_start_lsn)
+ continue;
+
+ /* Add it to the list. */
+ ws = palloc(sizeof(WalSummaryFile));
+ ws->tli = file_tli;
+ ws->start_lsn = file_start_lsn;
+ ws->end_lsn = file_end_lsn;
+ result = lappend(result, ws);
+ }
+ FreeDir(sdir);
+
+ return result;
+}
+
+/*
+ * Build a new list of WAL summaries based on an existing list, but filtering
+ * out summaries that don't match the search parameters.
+ *
+ * If tli != 0, only WAL summaries with the indicated TLI will be included.
+ *
+ * If start_lsn != InvalidXLogRecPtr, only summaries that end before the
+ * indicated LSN will be included.
+ *
+ * If end_lsn != InvalidXLogRecPtr, only summaries that start before the
+ * indicated LSN will be included.
+ */
+List *
+FilterWalSummaries(List *wslist, TimeLineID tli,
+ XLogRecPtr start_lsn, XLogRecPtr end_lsn)
+{
+ List *result = NIL;
+ ListCell *lc;
+
+ /* Loop over input. */
+ foreach(lc, wslist)
+ {
+ WalSummaryFile *ws = lfirst(lc);
+
+ /* Skip if it doesn't match the filter criteria. */
+ if (tli != 0 && tli != ws->tli)
+ continue;
+ if (!XLogRecPtrIsInvalid(start_lsn) && start_lsn > ws->end_lsn)
+ continue;
+ if (!XLogRecPtrIsInvalid(end_lsn) && end_lsn < ws->start_lsn)
+ continue;
+
+ /* Add it to the result list. */
+ result = lappend(result, ws);
+ }
+
+ return result;
+}
+
+/*
+ * Check whether the supplied list of WalSummaryFile objects covers the
+ * whole range of LSNs from start_lsn to end_lsn. This function ignores
+ * timelines, so the caller should probably filter using the appropriate
+ * timeline before calling this.
+ *
+ * If the whole range of LSNs is covered, returns true, otherwise false.
+ * If false is returned, *missing_lsn is set either to InvalidXLogRecPtr
+ * if there are no WAL summary files in the input list, or to the first LSN
+ * in the range that is not covered by a WAL summary file in the input list.
+ */
+bool
+WalSummariesAreComplete(List *wslist, XLogRecPtr start_lsn,
+ XLogRecPtr end_lsn, XLogRecPtr *missing_lsn)
+{
+ XLogRecPtr current_lsn = start_lsn;
+ ListCell *lc;
+
+ /* Special case for empty list. */
+ if (wslist == NIL)
+ {
+ *missing_lsn = InvalidXLogRecPtr;
+ return false;
+ }
+
+ /* Make a private copy of the list and sort it by start LSN. */
+ wslist = list_copy(wslist);
+ list_sort(wslist, ListComparatorForWalSummaryFiles);
+
+ /*
+ * Consider summary files in order of increasing start_lsn, advancing the
+ * known-summarized range from start_lsn toward end_lsn.
+ *
+ * Normally, the summary files should cover non-overlapping WAL ranges,
+ * but this algorithm is intended to be correct even in case of overlap.
+ */
+ foreach(lc, wslist)
+ {
+ WalSummaryFile *ws = lfirst(lc);
+
+ if (ws->start_lsn > current_lsn)
+ {
+ /* We found a gap. */
+ break;
+ }
+ if (ws->end_lsn > current_lsn)
+ {
+ /*
+ * Next summary extends beyond end of previous summary, so extend
+ * the end of the range known to be summarized.
+ */
+ current_lsn = ws->end_lsn;
+
+ /*
+ * If the range we know to be summarized has reached the required
+ * end LSN, we have proved completeness.
+ */
+ if (current_lsn >= end_lsn)
+ return true;
+ }
+ }
+
+ /*
+ * We either ran out of summary files without reaching the end LSN, or we
+ * hit a gap in the sequence that resulted in us bailing out of the loop
+ * above.
+ */
+ *missing_lsn = current_lsn;
+ return false;
+}
+
+/*
+ * Open a WAL summary file.
+ *
+ * This will throw an error in case of trouble. As an exception, if
+ * missing_ok = true and the trouble is specifically that the file does
+ * not exist, it will not throw an error and will return a value less than 0.
+ */
+File
+OpenWalSummaryFile(WalSummaryFile *ws, bool missing_ok)
+{
+ char path[MAXPGPATH];
+ File file;
+
+ snprintf(path, MAXPGPATH,
+ XLOGDIR "/summaries/%08X%08X%08X%08X%08X.summary",
+ ws->tli,
+ LSN_FORMAT_ARGS(ws->start_lsn),
+ LSN_FORMAT_ARGS(ws->end_lsn));
+
+ file = PathNameOpenFile(path, O_RDONLY);
+ if (file < 0 && (errno != EEXIST || !missing_ok))
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not open file \"%s\": %m", path)));
+
+ return file;
+}
+
+/*
+ * Remove a WAL summary file if the last modification time precedes the
+ * cutoff time.
+ */
+void
+RemoveWalSummaryIfOlderThan(WalSummaryFile *ws, time_t cutoff_time)
+{
+ char path[MAXPGPATH];
+ struct stat statbuf;
+
+ snprintf(path, MAXPGPATH,
+ XLOGDIR "/summaries/%08X%08X%08X%08X%08X.summary",
+ ws->tli,
+ LSN_FORMAT_ARGS(ws->start_lsn),
+ LSN_FORMAT_ARGS(ws->end_lsn));
+
+ if (lstat(path, &statbuf) != 0)
+ {
+ if (errno == ENOENT)
+ return;
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not stat file \"%s\": %m", path)));
+ }
+ if (statbuf.st_mtime >= cutoff_time)
+ return;
+ if (unlink(path) != 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not stat file \"%s\": %m", path)));
+ ereport(DEBUG2,
+ (errmsg_internal("removing file \"%s\"", path)));
+}
+
+/*
+ * Test whether a filename looks like a WAL summary file.
+ */
+static bool
+IsWalSummaryFilename(char *filename)
+{
+ return strspn(filename, "0123456789ABCDEF") == 40 &&
+ strcmp(filename + 40, ".summary") == 0;
+}
+
+/*
+ * Data read callback for use with CreateBlockRefTableReader.
+ */
+int
+ReadWalSummary(void *wal_summary_io, void *data, int length)
+{
+ WalSummaryIO *io = wal_summary_io;
+ int nbytes;
+
+ nbytes = FileRead(io->file, data, length, io->filepos,
+ WAIT_EVENT_WAL_SUMMARY_READ);
+ if (nbytes < 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not write file \"%s\": %m",
+ FilePathName(io->file))));
+
+ io->filepos += nbytes;
+ return nbytes;
+}
+
+/*
+ * Data write callback for use with WriteBlockRefTable.
+ */
+int
+WriteWalSummary(void *wal_summary_io, void *data, int length)
+{
+ WalSummaryIO *io = wal_summary_io;
+ int nbytes;
+
+ nbytes = FileWrite(io->file, data, length, io->filepos,
+ WAIT_EVENT_WAL_SUMMARY_WRITE);
+ if (nbytes < 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not write file \"%s\": %m",
+ FilePathName(io->file))));
+ if (nbytes != length)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not write file \"%s\": wrote only %d of %d bytes at offset %u",
+ FilePathName(io->file), nbytes,
+ length, (unsigned) io->filepos),
+ errhint("Check free disk space.")));
+
+ io->filepos += nbytes;
+ return nbytes;
+}
+
+/*
+ * Error-reporting callback for use with CreateBlockRefTableReader.
+ */
+void
+ReportWalSummaryError(void *callback_arg, char *fmt,...)
+{
+ StringInfoData buf;
+ va_list ap;
+ int needed;
+
+ initStringInfo(&buf);
+ for (;;)
+ {
+ va_start(ap, fmt);
+ needed = appendStringInfoVA(&buf, fmt, ap);
+ va_end(ap);
+ if (needed == 0)
+ break;
+ enlargeStringInfo(&buf, needed);
+ }
+ ereport(ERROR,
+ errcode(ERRCODE_DATA_CORRUPTED),
+ errmsg_internal("%s", buf.data));
+}
+
+/*
+ * Comparator to sort a List of WalSummaryFile objects by start_lsn.
+ */
+static int
+ListComparatorForWalSummaryFiles(const ListCell *a, const ListCell *b)
+{
+ WalSummaryFile *ws1 = lfirst(a);
+ WalSummaryFile *ws2 = lfirst(b);
+
+ if (ws1->start_lsn < ws2->start_lsn)
+ return -1;
+ if (ws1->start_lsn > ws2->start_lsn)
+ return 1;
+ return 0;
+}
diff --git a/src/backend/backup/walsummaryfuncs.c b/src/backend/backup/walsummaryfuncs.c
new file mode 100644
index 0000000000..2e77d38b4a
--- /dev/null
+++ b/src/backend/backup/walsummaryfuncs.c
@@ -0,0 +1,169 @@
+/*-------------------------------------------------------------------------
+ *
+ * walsummaryfuncs.c
+ * SQL-callable functions for accessing WAL summary data.
+ *
+ * Portions Copyright (c) 2010-2022, PostgreSQL Global Development Group
+ *
+ * src/backend/backup/walsummaryfuncs.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "backup/walsummary.h"
+#include "common/blkreftable.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "utils/fmgrprotos.h"
+#include "utils/pg_lsn.h"
+
+#define NUM_WS_ATTS 3
+#define NUM_SUMMARY_ATTS 6
+#define MAX_BLOCKS_PER_CALL 256
+
+/*
+ * List the WAL summary files available in pg_wal/summaries.
+ */
+Datum
+pg_available_wal_summaries(PG_FUNCTION_ARGS)
+{
+ ReturnSetInfo *rsi;
+ List *wslist;
+ ListCell *lc;
+ Datum values[NUM_WS_ATTS];
+ bool nulls[NUM_WS_ATTS];
+
+ InitMaterializedSRF(fcinfo, 0);
+ rsi = (ReturnSetInfo *) fcinfo->resultinfo;
+
+ memset(nulls, 0, sizeof(nulls));
+
+ wslist = GetWalSummaries(0, InvalidXLogRecPtr, InvalidXLogRecPtr);
+ foreach(lc, wslist)
+ {
+ WalSummaryFile *ws = (WalSummaryFile *) lfirst(lc);
+ HeapTuple tuple;
+
+ CHECK_FOR_INTERRUPTS();
+
+ values[0] = Int64GetDatum((int64) ws->tli);
+ values[1] = LSNGetDatum(ws->start_lsn);
+ values[2] = LSNGetDatum(ws->end_lsn);
+
+ tuple = heap_form_tuple(rsi->setDesc, values, nulls);
+ tuplestore_puttuple(rsi->setResult, tuple);
+ }
+
+ return (Datum) 0;
+}
+
+/*
+ * List the contents of a WAL summary file identified by TLI, start LSN,
+ * and end LSN.
+ */
+Datum
+pg_wal_summary_contents(PG_FUNCTION_ARGS)
+{
+ ReturnSetInfo *rsi;
+ Datum values[NUM_SUMMARY_ATTS];
+ bool nulls[NUM_SUMMARY_ATTS];
+ WalSummaryFile ws;
+ WalSummaryIO io;
+ BlockRefTableReader *reader;
+ int64 raw_tli;
+ RelFileLocator rlocator;
+ ForkNumber forknum;
+ BlockNumber limit_block;
+
+ InitMaterializedSRF(fcinfo, 0);
+ rsi = (ReturnSetInfo *) fcinfo->resultinfo;
+ memset(nulls, 0, sizeof(nulls));
+
+ /*
+ * Since the timeline could at least in theory be more than 2^31, and
+ * since we don't have unsigned types at the SQL level, it is passed as a
+ * 64-bit integer. Test whether it's out of range.
+ */
+ raw_tli = PG_GETARG_INT64(0);
+ if (raw_tli < 1 || raw_tli > PG_INT32_MAX)
+ ereport(ERROR,
+ errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("invalid timeline %lld", (long long) raw_tli));
+
+ /* Prepare to read the specified WAL summry file. */
+ ws.tli = (TimeLineID) raw_tli;
+ ws.start_lsn = PG_GETARG_LSN(1);
+ ws.end_lsn = PG_GETARG_LSN(2);
+ io.filepos = 0;
+ io.file = OpenWalSummaryFile(&ws, false);
+ reader = CreateBlockRefTableReader(ReadWalSummary, &io,
+ FilePathName(io.file),
+ ReportWalSummaryError, NULL);
+
+ /* Loop over relation forks. */
+ while (BlockRefTableReaderNextRelation(reader, &rlocator, &forknum,
+ &limit_block))
+ {
+ BlockNumber blocks[MAX_BLOCKS_PER_CALL];
+ HeapTuple tuple;
+
+ CHECK_FOR_INTERRUPTS();
+
+ values[0] = ObjectIdGetDatum(rlocator.relNumber);
+ values[1] = ObjectIdGetDatum(rlocator.spcOid);
+ values[2] = ObjectIdGetDatum(rlocator.dbOid);
+ values[3] = Int16GetDatum((int16) forknum);
+
+ /* Loop over blocks within the current relation fork. */
+ while (true)
+ {
+ unsigned nblocks;
+ unsigned i;
+
+ CHECK_FOR_INTERRUPTS();
+
+ nblocks = BlockRefTableReaderGetBlocks(reader, blocks,
+ MAX_BLOCKS_PER_CALL);
+ if (nblocks == 0)
+ break;
+
+ /*
+ * For each block that we specifically know to have been modified,
+ * emit a row with that block number and limit_block = false.
+ */
+ values[5] = BoolGetDatum(false);
+ for (i = 0; i < nblocks; ++i)
+ {
+ values[4] = Int64GetDatum((int64) blocks[i]);
+
+ tuple = heap_form_tuple(rsi->setDesc, values, nulls);
+ tuplestore_puttuple(rsi->setResult, tuple);
+ }
+
+ /*
+ * If the limit block is not InvalidBlockNumber, emit an exta row
+ * with that block number and limit_block = true.
+ *
+ * There is no point in doing this when the limit_block is
+ * InvalidBlockNumber, because no block with that number or any
+ * higher number can ever exist.
+ */
+ if (BlockNumberIsValid(limit_block))
+ {
+ values[4] = Int64GetDatum((int64) limit_block);
+ values[5] = BoolGetDatum(true);
+
+ tuple = heap_form_tuple(rsi->setDesc, values, nulls);
+ tuplestore_puttuple(rsi->setResult, tuple);
+ }
+ }
+ }
+
+ /* Cleanup */
+ DestroyBlockRefTableReader(reader);
+ FileClose(io.file);
+
+ return (Datum) 0;
+}
diff --git a/src/backend/postmaster/Makefile b/src/backend/postmaster/Makefile
index 047448b34e..367a46c617 100644
--- a/src/backend/postmaster/Makefile
+++ b/src/backend/postmaster/Makefile
@@ -24,6 +24,7 @@ OBJS = \
postmaster.o \
startup.o \
syslogger.o \
+ walsummarizer.o \
walwriter.o
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/postmaster/auxprocess.c b/src/backend/postmaster/auxprocess.c
index cae6feb356..0c15c1777d 100644
--- a/src/backend/postmaster/auxprocess.c
+++ b/src/backend/postmaster/auxprocess.c
@@ -21,6 +21,7 @@
#include "postmaster/auxprocess.h"
#include "postmaster/bgwriter.h"
#include "postmaster/startup.h"
+#include "postmaster/walsummarizer.h"
#include "postmaster/walwriter.h"
#include "replication/walreceiver.h"
#include "storage/bufmgr.h"
@@ -80,6 +81,9 @@ AuxiliaryProcessMain(AuxProcType auxtype)
case WalReceiverProcess:
MyBackendType = B_WAL_RECEIVER;
break;
+ case WalSummarizerProcess:
+ MyBackendType = B_WAL_SUMMARIZER;
+ break;
default:
elog(PANIC, "unrecognized process type: %d", (int) MyAuxProcType);
MyBackendType = B_INVALID;
@@ -161,6 +165,10 @@ AuxiliaryProcessMain(AuxProcType auxtype)
WalReceiverMain();
proc_exit(1);
+ case WalSummarizerProcess:
+ WalSummarizerMain();
+ proc_exit(1);
+
default:
elog(PANIC, "unrecognized process type: %d", (int) MyAuxProcType);
proc_exit(1);
diff --git a/src/backend/postmaster/meson.build b/src/backend/postmaster/meson.build
index cda921fd10..a30eb6692f 100644
--- a/src/backend/postmaster/meson.build
+++ b/src/backend/postmaster/meson.build
@@ -12,5 +12,6 @@ backend_sources += files(
'postmaster.c',
'startup.c',
'syslogger.c',
+ 'walsummarizer.c',
'walwriter.c',
)
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 54e9bfb8c4..0538b84ef8 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -115,6 +115,7 @@
#include "postmaster/pgarch.h"
#include "postmaster/postmaster.h"
#include "postmaster/syslogger.h"
+#include "postmaster/walsummarizer.h"
#include "replication/logicallauncher.h"
#include "replication/walsender.h"
#include "storage/fd.h"
@@ -251,6 +252,7 @@ static pid_t StartupPID = 0,
CheckpointerPID = 0,
WalWriterPID = 0,
WalReceiverPID = 0,
+ WalSummarizerPID = 0,
AutoVacPID = 0,
PgArchPID = 0,
SysLoggerPID = 0;
@@ -442,6 +444,7 @@ static bool CreateOptsFile(int argc, char *argv[], char *fullprogname);
static pid_t StartChildProcess(AuxProcType type);
static void StartAutovacuumWorker(void);
static void MaybeStartWalReceiver(void);
+static void MaybeStartWalSummarizer(void);
static void InitPostmasterDeathWatchHandle(void);
/*
@@ -561,6 +564,7 @@ static void ShmemBackendArrayRemove(Backend *bn);
#define StartCheckpointer() StartChildProcess(CheckpointerProcess)
#define StartWalWriter() StartChildProcess(WalWriterProcess)
#define StartWalReceiver() StartChildProcess(WalReceiverProcess)
+#define StartWalSummarizer() StartChildProcess(WalSummarizerProcess)
/* Macros to check exit status of a child process */
#define EXIT_STATUS_0(st) ((st) == 0)
@@ -1847,6 +1851,9 @@ ServerLoop(void)
if (WalReceiverRequested)
MaybeStartWalReceiver();
+ /* If we need to start a WAL summarizer, try to do that now */
+ MaybeStartWalSummarizer();
+
/* Get other worker processes running, if needed */
if (StartWorkerNeeded || HaveCrashedWorker)
maybe_start_bgworkers();
@@ -2714,6 +2721,8 @@ process_pm_reload_request(void)
signal_child(WalWriterPID, SIGHUP);
if (WalReceiverPID != 0)
signal_child(WalReceiverPID, SIGHUP);
+ if (WalSummarizerPID != 0)
+ signal_child(WalSummarizerPID, SIGHUP);
if (AutoVacPID != 0)
signal_child(AutoVacPID, SIGHUP);
if (PgArchPID != 0)
@@ -3067,6 +3076,7 @@ process_pm_child_exit(void)
BgWriterPID = StartBackgroundWriter();
if (WalWriterPID == 0)
WalWriterPID = StartWalWriter();
+ MaybeStartWalSummarizer();
/*
* Likewise, start other special children as needed. In a restart
@@ -3185,6 +3195,20 @@ process_pm_child_exit(void)
continue;
}
+ /*
+ * Was it the wal summarizer? Normal exit can be ignored; we'll start
+ * a new one at the next iteration of the postmaster's main loop, if
+ * necessary. Any other exit condition is treated as a crash.
+ */
+ if (pid == WalSummarizerPID)
+ {
+ WalSummarizerPID = 0;
+ if (!EXIT_STATUS_0(exitstatus))
+ HandleChildCrash(pid, exitstatus,
+ _("WAL summarizer process"));
+ continue;
+ }
+
/*
* Was it the autovacuum launcher? Normal exit can be ignored; we'll
* start a new one at the next iteration of the postmaster's main
@@ -3580,6 +3604,12 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
else if (WalReceiverPID != 0 && take_action)
sigquit_child(WalReceiverPID);
+ /* Take care of the walsummarizer too */
+ if (pid == WalSummarizerPID)
+ WalSummarizerPID = 0;
+ else if (WalSummarizerPID != 0 && take_action)
+ sigquit_child(WalSummarizerPID);
+
/* Take care of the autovacuum launcher too */
if (pid == AutoVacPID)
AutoVacPID = 0;
@@ -3730,6 +3760,8 @@ PostmasterStateMachine(void)
signal_child(StartupPID, SIGTERM);
if (WalReceiverPID != 0)
signal_child(WalReceiverPID, SIGTERM);
+ if (WalSummarizerPID != 0)
+ signal_child(WalSummarizerPID, SIGTERM);
/* checkpointer, archiver, stats, and syslogger may continue for now */
/* Now transition to PM_WAIT_BACKENDS state to wait for them to die */
@@ -3756,6 +3788,7 @@ PostmasterStateMachine(void)
if (CountChildren(BACKEND_TYPE_ALL - BACKEND_TYPE_WALSND) == 0 &&
StartupPID == 0 &&
WalReceiverPID == 0 &&
+ WalSummarizerPID == 0 &&
BgWriterPID == 0 &&
(CheckpointerPID == 0 ||
(!FatalError && Shutdown < ImmediateShutdown)) &&
@@ -3853,6 +3886,7 @@ PostmasterStateMachine(void)
/* These other guys should be dead already */
Assert(StartupPID == 0);
Assert(WalReceiverPID == 0);
+ Assert(WalSummarizerPID == 0);
Assert(BgWriterPID == 0);
Assert(CheckpointerPID == 0);
Assert(WalWriterPID == 0);
@@ -4074,6 +4108,8 @@ TerminateChildren(int signal)
signal_child(WalWriterPID, signal);
if (WalReceiverPID != 0)
signal_child(WalReceiverPID, signal);
+ if (WalSummarizerPID != 0)
+ signal_child(WalSummarizerPID, signal);
if (AutoVacPID != 0)
signal_child(AutoVacPID, signal);
if (PgArchPID != 0)
@@ -5380,6 +5416,10 @@ StartChildProcess(AuxProcType type)
ereport(LOG,
(errmsg("could not fork WAL receiver process: %m")));
break;
+ case WalSummarizerProcess:
+ ereport(LOG,
+ (errmsg("could not fork WAL summarizer process: %m")));
+ break;
default:
ereport(LOG,
(errmsg("could not fork process: %m")));
@@ -5516,6 +5556,19 @@ MaybeStartWalReceiver(void)
}
}
+/*
+ * MaybeStartWalSummarizer
+ * Start the WAL summarizer process, if not running and our state allows.
+ */
+static void
+MaybeStartWalSummarizer(void)
+{
+ if (wal_summarize_mb != 0 && WalSummarizerPID == 0 &&
+ (pmState == PM_RUN || pmState == PM_HOT_STANDBY) &&
+ Shutdown <= SmartShutdown)
+ WalSummarizerPID = StartWalSummarizer();
+}
+
/*
* Create the opts file
diff --git a/src/backend/postmaster/walsummarizer.c b/src/backend/postmaster/walsummarizer.c
new file mode 100644
index 0000000000..34bd254183
--- /dev/null
+++ b/src/backend/postmaster/walsummarizer.c
@@ -0,0 +1,1414 @@
+/*-------------------------------------------------------------------------
+ *
+ * walsummarizer.c
+ *
+ * Background process to perform WAL summarization, if it is enabled.
+ * It continuously scans the write-ahead log and periodically emits a
+ * summary file which indicates which blocks in which relation forks
+ * were modified by WAL records in the LSN range covered by the summary
+ * file. See walsummary.c and blkreftable.c for more details on the
+ * naming and contents of WAL summary files.
+ *
+ * If configured to do, this background process will also remove WAL
+ * summary files when the file timestamp is older than a configurable
+ * threshold (but only if the WAL has been removed first).
+ *
+ * Portions Copyright (c) 1996-2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/backend/postmaster/walsummarizer.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/timeline.h"
+#include "access/xlog.h"
+#include "access/xlog_internal.h"
+#include "access/xlogrecovery.h"
+#include "access/xlogutils.h"
+#include "backup/walsummary.h"
+#include "catalog/storage_xlog.h"
+#include "common/blkreftable.h"
+#include "libpq/pqsignal.h"
+#include "miscadmin.h"
+#include "postmaster/bgwriter.h"
+#include "postmaster/interrupt.h"
+#include "postmaster/walsummarizer.h"
+#include "replication/walreceiver.h"
+#include "storage/fd.h"
+#include "storage/ipc.h"
+#include "storage/lwlock.h"
+#include "storage/latch.h"
+#include "storage/proc.h"
+#include "storage/procsignal.h"
+#include "storage/shmem.h"
+#include "storage/spin.h"
+#include "utils/guc.h"
+#include "utils/memutils.h"
+#include "utils/wait_event.h"
+
+/*
+ * Data in shared memory related to WAL summarization.
+ */
+typedef struct
+{
+ /*
+ * These fields are protected by WALSummarizerLock.
+ *
+ * Until we've discovered what summary files already exist on disk and
+ * stored that information in shared memory, initialized is false and the
+ * other fields here contain no meaningful information. After that has
+ * been done, initialized is true.
+ *
+ * summarized_tli and summarized_lsn indicate the last LSN and TLI at
+ * which the next summary file will start. Normally, these are the LSN and
+ * TLI at which the last file ended; in such case, lsn_is_exact is true.
+ * If, however, the LSN is just an approximation, then lsn_is_exact is
+ * false. This can happen if, for example, there are no existing WAL
+ * summary files at startup. In that case, we have to derive the position
+ * at which to start summarizing from the WAL files that exist on disk,
+ * and so the LSN might point to the start of the next file even though
+ * that might happen to be in the middle of a WAL record.
+ *
+ * summarizer_pgprocno is the pgprocno value for the summarizer process,
+ * if one is running, or else INVALID_PGPROCNO.
+ *
+ * pending_lsn is used by the summarizer to advertise the ending LSN of a
+ * record it has recently read. It shouldn't ever be less than
+ * summarized_lsn, but might be greater, because the summarizer buffers
+ * data for a range of LSNs in memory before writing out a new file.
+ *
+ * switch_requested can be set to true to notify the summarizer that a new
+ * WAL summary file should be written as soon as possible, without trying
+ * to read more WAL first.
+ */
+ bool initialized;
+ TimeLineID summarized_tli;
+ XLogRecPtr summarized_lsn;
+ bool lsn_is_exact;
+ int summarizer_pgprocno;
+ XLogRecPtr pending_lsn;
+ bool switch_requested;
+
+ /*
+ * This field handles its own synchronizaton.
+ */
+ ConditionVariable summary_file_cv;
+} WalSummarizerData;
+
+/*
+ * Private data for our xlogreader's page read callback.
+ */
+typedef struct
+{
+ TimeLineID tli;
+ bool historic;
+ XLogRecPtr read_upto;
+ bool end_of_wal;
+ bool waited;
+ XLogRecPtr redo_pointer;
+ bool redo_pointer_reached;
+ XLogRecPtr redo_pointer_refresh_lsn;
+} SummarizerReadLocalXLogPrivate;
+
+/* Pointer to shared memory state. */
+static WalSummarizerData *WalSummarizerCtl;
+
+/*
+ * When we reach end of WAL and need to read more, we sleep for a number of
+ * milliseconds that is a integer multiple of MS_PER_SLEEP_QUANTUM. This is
+ * the multiplier. It should vary between 1 and MAX_SLEEP_QUANTA, depending
+ * on system activity. See summarizer_wait_for_wal() for how we adjust this.
+ */
+static long sleep_quanta = 1;
+
+/*
+ * The sleep time will always be a multiple of 200ms and will not exceed
+ * one minute (300 * 200 = 60 * 1000).
+ */
+#define MAX_SLEEP_QUANTA 300
+#define MS_PER_SLEEP_QUANTUM 200
+
+/*
+ * This is a count of the number of pages of WAL that we've read since the
+ * last time we waited for more WAL to appear.
+ */
+static long pages_read_since_last_sleep = 0;
+
+/*
+ * Most recent RedoRecPtr value observed by MaybeRemoveOldWalSummaries.
+ */
+static XLogRecPtr redo_pointer_at_last_summary_removal = InvalidXLogRecPtr;
+
+/*
+ * GUC parameters
+ */
+int wal_summarize_mb = 256;
+int wal_summarize_keep_time = 7 * 24 * 60;
+
+static XLogRecPtr GetLatestLSN(TimeLineID *tli);
+static void HandleWalSummarizerInterrupts(void);
+static XLogRecPtr SummarizeWAL(TimeLineID tli, bool historic,
+ XLogRecPtr start_lsn, bool exact,
+ XLogRecPtr cutoff_lsn, XLogRecPtr maximum_lsn);
+static void SummarizeSmgrRecord(XLogReaderState *xlogreader,
+ BlockRefTable *brtab);
+static void SummarizeXactRecord(XLogReaderState *xlogreader,
+ BlockRefTable *brtab);
+static int summarizer_read_local_xlog_page(XLogReaderState *state,
+ XLogRecPtr targetPagePtr,
+ int reqLen,
+ XLogRecPtr targetRecPtr,
+ char *cur_page);
+static void summarizer_wait_for_wal(void);
+static void MaybeRemoveOldWalSummaries(void);
+
+/*
+ * Amount of shared memory required for this module.
+ */
+Size
+WalSummarizerShmemSize(void)
+{
+ return sizeof(WalSummarizerData);
+}
+
+/*
+ * Create or attach to shared memory segment for this module.
+ */
+void
+WalSummarizerShmemInit(void)
+{
+ bool found;
+
+ WalSummarizerCtl = (WalSummarizerData *)
+ ShmemInitStruct("Wal Summarizer Ctl", WalSummarizerShmemSize(),
+ &found);
+
+ if (!found)
+ {
+ /*
+ * First time through, so initialize.
+ *
+ * We're just filling in dummy values here -- the real initialization
+ * will happen when GetOldestUnsummarizedLSN() is called for the first
+ * time.
+ */
+ WalSummarizerCtl->initialized = false;
+ WalSummarizerCtl->summarized_tli = 0;
+ WalSummarizerCtl->summarized_lsn = InvalidXLogRecPtr;
+ WalSummarizerCtl->lsn_is_exact = false;
+ WalSummarizerCtl->summarizer_pgprocno = INVALID_PGPROCNO;
+ WalSummarizerCtl->pending_lsn = InvalidXLogRecPtr;
+ WalSummarizerCtl->switch_requested = false;
+ ConditionVariableInit(&WalSummarizerCtl->summary_file_cv);
+ }
+}
+
+/*
+ * Entry point for walsummarizer process.
+ */
+void
+WalSummarizerMain(void)
+{
+ sigjmp_buf local_sigjmp_buf;
+ MemoryContext context;
+
+ /*
+ * Within this function, 'current_lsn' and 'current_tli' refer to the
+ * point from which the next WAL summary file should start. 'exact' is
+ * true if 'current_lsn' is known to be the start of a WAL recod or WAL
+ * segment, and false if it might be in the middle of a record someplace.
+ *
+ * 'switch_lsn' and 'switch_tli', if set, are the LSN at which we need to
+ * switch to a new timeline and the timeline to which we need to switch.
+ * If not set, we either haven't figured out the answers yet or we're
+ * already on the latest timeline.
+ */
+ XLogRecPtr current_lsn;
+ TimeLineID current_tli;
+ bool exact;
+ XLogRecPtr switch_lsn = InvalidXLogRecPtr;
+ TimeLineID switch_tli = 0;
+
+ ereport(DEBUG1,
+ (errmsg_internal("WAL summarizer started")));
+
+ /*
+ * Properly accept or ignore signals the postmaster might send us
+ *
+ * We have no particular use for SIGINT at the moment, but seems
+ * reasonable to treat like SIGTERM.
+ */
+ pqsignal(SIGHUP, SignalHandlerForConfigReload);
+ pqsignal(SIGINT, SignalHandlerForShutdownRequest);
+ pqsignal(SIGTERM, SignalHandlerForShutdownRequest);
+ /* SIGQUIT handler was already set up by InitPostmasterChild */
+ pqsignal(SIGALRM, SIG_IGN);
+ pqsignal(SIGPIPE, SIG_IGN);
+ pqsignal(SIGUSR1, procsignal_sigusr1_handler);
+ pqsignal(SIGUSR2, SIG_IGN); /* not used */
+
+ /* Advertise ourselves. */
+ LWLockAcquire(WALSummarizerLock, LW_EXCLUSIVE);
+ WalSummarizerCtl->summarizer_pgprocno = MyProc->pgprocno;
+ LWLockRelease(WALSummarizerLock);
+
+ /* Create and switch to a memory context that we can reset on error. */
+ context = AllocSetContextCreate(TopMemoryContext,
+ "Wal Summarizer",
+ ALLOCSET_DEFAULT_SIZES);
+ MemoryContextSwitchTo(context);
+
+ /*
+ * Reset some signals that are accepted by postmaster but not here
+ */
+ pqsignal(SIGCHLD, SIG_DFL);
+
+ /*
+ * If an exception is encountered, processing resumes here.
+ */
+ if (sigsetjmp(local_sigjmp_buf, 1) != 0)
+ {
+ /* Since not using PG_TRY, must reset error stack by hand */
+ error_context_stack = NULL;
+
+ /* Prevent interrupts while cleaning up */
+ HOLD_INTERRUPTS();
+
+ /* Report the error to the server log */
+ EmitErrorReport();
+
+ /* Release resources we might have acquired. */
+ LWLockReleaseAll();
+ ConditionVariableCancelSleep();
+ pgstat_report_wait_end();
+ ReleaseAuxProcessResources(false);
+ AtEOXact_Files(false);
+ AtEOXact_HashTables(false);
+
+ /*
+ * Now return to normal top-level context and clear ErrorContext for
+ * next time.
+ */
+ MemoryContextSwitchTo(context);
+ FlushErrorState();
+
+ /* Flush any leaked data in the top-level context */
+ MemoryContextResetAndDeleteChildren(context);
+
+ /* Now we can allow interrupts again */
+ RESUME_INTERRUPTS();
+
+ /*
+ * Sleep for 10 seconds before attempting to resume operations in
+ * order to avoid excessing logging.
+ *
+ * Many of the likely error conditions are things that will repeat
+ * every time. For example, if the WAL can't be read or the summary
+ * can't be written, only administrator action will cure the problem.
+ * So a really fast retry time doesn't seem to be especially
+ * beneficial, and it will clutter the logs.
+ */
+ (void) WaitLatch(MyLatch,
+ WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
+ 10000,
+ WAIT_EVENT_WAL_SUMMARIZER_ERROR);
+ }
+
+ /* We can now handle ereport(ERROR) */
+ PG_exception_stack = &local_sigjmp_buf;
+
+ /*
+ * Unblock signals (they were blocked when the postmaster forked us)
+ */
+ sigprocmask(SIG_SETMASK, &UnBlockSig, NULL);
+
+ /*
+ * Fetch information about previous progress from shared memory.
+ *
+ * If we discover that WAL summarization is not enabled, just exit.
+ */
+ current_lsn = GetOldestUnsummarizedLSN(¤t_tli, &exact);
+ if (XLogRecPtrIsInvalid(current_lsn))
+ proc_exit(0);
+
+ /*
+ * Loop forever
+ */
+ for (;;)
+ {
+ XLogRecPtr latest_lsn;
+ TimeLineID latest_tli;
+ XLogRecPtr cutoff_lsn;
+ XLogRecPtr end_of_summary_lsn;
+
+ /* Flush any leaked data in the top-level context */
+ MemoryContextResetAndDeleteChildren(context);
+
+ /* Process any signals received recently. */
+ HandleWalSummarizerInterrupts();
+
+ /* If it's time to remove any old WAL summaries, do that now. */
+ MaybeRemoveOldWalSummaries();
+
+ /* Find the LSN and TLI up to which we can safely summarize. */
+ latest_lsn = GetLatestLSN(&latest_tli);
+
+ /*
+ * If we're summarizing a historic timeline and we haven't yet
+ * computed the point at which to switch to the next timeline, do that
+ * now.
+ *
+ * Note that if this is a standby, what was previously the current
+ * timeline could become historic at any time.
+ *
+ * We could try to make this more efficient by caching the results of
+ * readTimeLineHistory when latest_tli has not changed, but since we
+ * only have to do this once per timeline switch, we probably wouldn't
+ * save any significant amount of work in practice.
+ */
+ if (current_tli != latest_tli && XLogRecPtrIsInvalid(switch_lsn))
+ {
+ List *tles = readTimeLineHistory(latest_tli);
+
+ switch_lsn = tliSwitchPoint(current_tli, tles, &switch_tli);
+ elog(DEBUG2,
+ "switch point from TLI %u to TLI %u is at %X/%X",
+ current_tli, switch_tli, LSN_FORMAT_ARGS(switch_lsn));
+ }
+
+ /*
+ * wal_summarize_mb sets a soft limit on the amont of WAL covered by a
+ * single summary file. If we read a WAL record that ends after the
+ * cutoff LSN computed here, we'll stop the summary. In most cases, it
+ * will actually stop earlier than that, but this is here as a
+ * backstop.
+ */
+ cutoff_lsn = current_lsn + wal_summarize_mb * 1024 * 1024;
+ if (!XLogRecPtrIsInvalid(switch_lsn) && cutoff_lsn > switch_lsn)
+ cutoff_lsn = switch_lsn;
+ elog(DEBUG2,
+ "WAL summarization cutoff is TLI %d @ %X/%X, flush position is %X/%X",
+ current_tli, LSN_FORMAT_ARGS(cutoff_lsn), LSN_FORMAT_ARGS(latest_lsn));
+
+ /* Summarize WAL. */
+ end_of_summary_lsn = SummarizeWAL(current_tli,
+ current_tli != latest_tli,
+ current_lsn, exact,
+ cutoff_lsn, latest_lsn);
+ Assert(!XLogRecPtrIsInvalid(end_of_summary_lsn));
+ Assert(end_of_summary_lsn >= current_lsn);
+
+ /*
+ * Update state for next loop iteration.
+ *
+ * Next summary file should start from exactly where this one ended.
+ * Timeline remains unchanged unless a switch LSN was computed and we
+ * have reached it.
+ */
+ current_lsn = end_of_summary_lsn;
+ exact = true;
+ if (!XLogRecPtrIsInvalid(switch_lsn) && cutoff_lsn >= switch_lsn)
+ {
+ current_tli = switch_tli;
+ switch_lsn = InvalidXLogRecPtr;
+ switch_tli = 0;
+ }
+
+ /* Update state in shared memory. */
+ LWLockAcquire(WALSummarizerLock, LW_EXCLUSIVE);
+ Assert(WalSummarizerCtl->pending_lsn <= end_of_summary_lsn);
+ WalSummarizerCtl->summarized_lsn = end_of_summary_lsn;
+ WalSummarizerCtl->summarized_tli = current_tli;
+ WalSummarizerCtl->lsn_is_exact = true;
+ WalSummarizerCtl->pending_lsn = end_of_summary_lsn;
+ WalSummarizerCtl->switch_requested = false;
+ LWLockRelease(WALSummarizerLock);
+
+ /* Wake up anyone waiting for more summary files to be written. */
+ ConditionVariableBroadcast(&WalSummarizerCtl->summary_file_cv);
+ }
+}
+
+/*
+ * Get the oldest LSN in this server's timeline history that has not yet been
+ * summarized.
+ *
+ * If *tli != NULL, it will be set to the TLI for the LSN that is returned.
+ *
+ * If *lsn_is_exact != NULL, it will be set to true if the returned LSN is
+ * necessarily the start of a WAL record and false if it's just the beginning
+ * of a WAL segment.
+ */
+XLogRecPtr
+GetOldestUnsummarizedLSN(TimeLineID *tli, bool *lsn_is_exact)
+{
+ TimeLineID latest_tli;
+ LWLockMode mode = LW_SHARED;
+ int n;
+ List *tles;
+ XLogRecPtr unsummarized_lsn;
+ TimeLineID unsummarized_tli = 0;
+ bool should_make_exact = false;
+ List *existing_summaries;
+ ListCell *lc;
+
+ /* If not summarizing WAL, do nothing. */
+ if (wal_summarize_mb == 0)
+ return InvalidXLogRecPtr;
+
+ /*
+ * Initially, we acquire the lock in shared mode and try to fetch the
+ * required information. If the data structure hasn't been initialized, we
+ * reacquire the lock in shared mode so that we can initialize it.
+ * However, if someone else does that first before we get the lock, then
+ * we can just return the requested information after all.
+ */
+ while (true)
+ {
+ LWLockAcquire(WALSummarizerLock, mode);
+
+ if (WalSummarizerCtl->initialized)
+ {
+ unsummarized_lsn = WalSummarizerCtl->summarized_lsn;
+ if (tli != NULL)
+ *tli = WalSummarizerCtl->summarized_tli;
+ if (lsn_is_exact != NULL)
+ *lsn_is_exact = WalSummarizerCtl->lsn_is_exact;
+ LWLockRelease(WALSummarizerLock);
+ return unsummarized_lsn;
+ }
+
+ if (mode == LW_EXCLUSIVE)
+ break;
+
+ LWLockRelease(WALSummarizerLock);
+ mode = LW_EXCLUSIVE;
+ }
+
+ /*
+ * The data structure needs to be initialized, and we are the first to
+ * obtain the lock in exclusive mode, so it's our job to do that
+ * initialization.
+ *
+ * So, find the oldest timeline on which WAL still exists, and the
+ * earliest segment for which it exists.
+ */
+ (void) GetLatestLSN(&latest_tli);
+ tles = readTimeLineHistory(latest_tli);
+ for (n = list_length(tles) - 1; n >= 0; --n)
+ {
+ TimeLineHistoryEntry *tle = list_nth(tles, n);
+ XLogSegNo oldest_segno;
+
+ oldest_segno = XLogGetOldestSegno(tle->tli);
+ if (oldest_segno != 0)
+ {
+ /* Compute oldest LSN that still exists on disk. */
+ XLogSegNoOffsetToRecPtr(oldest_segno, 0, wal_segment_size,
+ unsummarized_lsn);
+
+ unsummarized_tli = tle->tli;
+ break;
+ }
+ }
+
+ /* It really should not be possible for us to find no WAL. */
+ if (unsummarized_tli == 0)
+ ereport(ERROR,
+ errcode(ERRCODE_INTERNAL_ERROR),
+ errmsg_internal("no WAL found on timeline %d", latest_tli));
+
+ /*
+ * Don't try to summarize anything older than the end LSN of the newest
+ * summary file that exists for this timeline.
+ */
+ existing_summaries =
+ GetWalSummaries(unsummarized_tli,
+ InvalidXLogRecPtr, InvalidXLogRecPtr);
+ foreach(lc, existing_summaries)
+ {
+ WalSummaryFile *ws = lfirst(lc);
+
+ if (ws->end_lsn > unsummarized_lsn)
+ {
+ unsummarized_lsn = ws->end_lsn;
+ should_make_exact = true;
+ }
+ }
+
+ /* Update shared memory with the discovered values. */
+ WalSummarizerCtl->initialized = true;
+ WalSummarizerCtl->summarized_lsn = unsummarized_lsn;
+ WalSummarizerCtl->summarized_tli = unsummarized_tli;
+ WalSummarizerCtl->lsn_is_exact = should_make_exact;
+ WalSummarizerCtl->pending_lsn = unsummarized_lsn;
+
+ /* Also return the to the caller as required. */
+ if (tli != NULL)
+ *tli = WalSummarizerCtl->summarized_tli;
+ if (lsn_is_exact != NULL)
+ *lsn_is_exact = WalSummarizerCtl->lsn_is_exact;
+ LWLockRelease(WALSummarizerLock);
+
+ return unsummarized_lsn;
+}
+
+/*
+ * Attempt to set the WAL summarizer's latch.
+ *
+ * This might not work, because there's no guarantee that the WAL summarizer
+ * process was successfully started, and it also might have started but
+ * subsequently terminated. So, under normal circumstances, this will get the
+ * latch set, but there's no guarantee.
+ */
+void
+SetWalSummarizerLatch(void)
+{
+ int pgprocno;
+
+ LWLockAcquire(WALSummarizerLock, LW_EXCLUSIVE);
+ pgprocno = WalSummarizerCtl->summarizer_pgprocno;
+ LWLockRelease(WALSummarizerLock);
+
+ if (pgprocno != INVALID_PGPROCNO)
+ SetLatch(&ProcGlobal->allProcs[pgprocno].procLatch);
+}
+
+/*
+ * Wait until WAL summarization reaches the given LSN, but not longer than
+ * the given timeout.
+ *
+ * The return value is the first still-unsummarized LSN. If it's greater than
+ * or equal to the passed LSN, then that LSN was reached. If not, we timed out.
+ */
+XLogRecPtr
+WaitForWalSummarization(XLogRecPtr lsn, long timeout)
+{
+ TimestampTz start_time = GetCurrentTimestamp();
+ TimestampTz deadline = TimestampTzPlusMilliseconds(start_time, timeout);
+ XLogRecPtr summarized_lsn;
+
+ Assert(!XLogRecPtrIsInvalid(lsn));
+ Assert(timeout > 0);
+
+ while (1)
+ {
+ TimestampTz now;
+ long remaining_timeout;
+
+ /*
+ * If the LSN summarized on disk has reached the target value, stop.
+ * If it hasn't, but the in-memory value has reached the target value,
+ * request that a file be written as soon as possible.
+ */
+ LWLockAcquire(WALSummarizerLock, LW_EXCLUSIVE);
+ summarized_lsn = WalSummarizerCtl->summarized_lsn;
+ if (summarized_lsn < lsn &&
+ WalSummarizerCtl->pending_lsn >= lsn)
+ WalSummarizerCtl->switch_requested = true;
+ LWLockRelease(WALSummarizerLock);
+ if (summarized_lsn >= lsn)
+ break;
+
+ /* Timeout reached? If yes, stop. */
+ now = GetCurrentTimestamp();
+ remaining_timeout = TimestampDifferenceMilliseconds(now, deadline);
+ if (remaining_timeout <= 0)
+ break;
+
+ /*
+ * Limit the sleep to 1 second, because we may need to request a
+ * switch.
+ */
+ if (remaining_timeout > 1000)
+ remaining_timeout = 1000;
+
+ /* Wait and see. */
+ ConditionVariableTimedSleep(&WalSummarizerCtl->summary_file_cv,
+ remaining_timeout,
+ WAIT_EVENT_WAL_SUMMARY_READY);
+ }
+
+ return summarized_lsn;
+}
+
+/*
+ * Get the latest LSN that is eligible to be summarized, and set *tli to the
+ * corresponding timeline.
+ */
+static XLogRecPtr
+GetLatestLSN(TimeLineID *tli)
+{
+ if (!RecoveryInProgress())
+ {
+ /* Don't summarize WAL before it's flushed. */
+ return GetFlushRecPtr(tli);
+ }
+ else
+ {
+ XLogRecPtr flush_lsn;
+ TimeLineID flush_tli;
+ XLogRecPtr replay_lsn;
+ TimeLineID replay_tli;
+
+ /*
+ * What we really want to know is how much WAL has been flushed to
+ * disk, but the only flush position available is the one provided by
+ * the walreceiver, which may not be running, because this could be
+ * crash recovery or recovery via restore_command. So use either the
+ * WAL receiver's flush position or the replay position, whichever is
+ * further ahead, on the theory that if the WAL has been replayed then
+ * it must also have been flushed to disk.
+ */
+ flush_lsn = GetWalRcvFlushRecPtr(NULL, &flush_tli);
+ replay_lsn = GetXLogReplayRecPtr(&replay_tli);
+ if (flush_lsn > replay_lsn)
+ {
+ *tli = flush_tli;
+ return flush_lsn;
+ }
+ else
+ {
+ *tli = replay_tli;
+ return replay_lsn;
+ }
+ }
+}
+
+/*
+ * Interrupt handler for main loop of WAL summarizer process.
+ */
+static void
+HandleWalSummarizerInterrupts(void)
+{
+ if (ProcSignalBarrierPending)
+ ProcessProcSignalBarrier();
+
+ if (ConfigReloadPending)
+ {
+ ConfigReloadPending = false;
+ ProcessConfigFile(PGC_SIGHUP);
+ }
+
+ if (ShutdownRequestPending || wal_summarize_mb == 0)
+ {
+ ereport(DEBUG1,
+ errmsg_internal("WAL summarizer shutting down"));
+ proc_exit(0);
+ }
+
+ /* Perform logging of memory contexts of this process */
+ if (LogMemoryContextPending)
+ ProcessLogMemoryContextInterrupt();
+}
+
+/*
+ * Summarize a range of WAL records on a single timeline.
+ *
+ * 'tli' is the timeline to be summarized. 'historic' should be false if the
+ * timeline in question is the latest one and true otherwise.
+ *
+ * 'start_lsn' is the point at which we should start summarizing. If this
+ * value comes from the end LSN of the previous record as returned by the
+ * xlograder machinery, 'exact' should be true; otherwise, 'exact' should
+ * be false, and this function will search forward for the start of a valid
+ * WAL record.
+ *
+ * 'cutoff_lsn' is the point at which we should stop summarizing. The first
+ * record that ends at or after cutoff_lsn will be the last one included
+ * in the summary.
+ *
+ * 'maximum_lsn' identifies the point beyond which we can't count on being
+ * able to read any more WAL. It should be the switch point when reading a
+ * historic timeline, or the most-recently-measured end of WAL when reading
+ * the current timeline.
+ *
+ * The return value is the LSN at which the WAL summary actually ends. Most
+ * often, a summary file ends because we notice that a checkpoint has
+ * occurred and reach the redo pointer of that checkpoint, but sometimes
+ * we stop for other reasons, such as a timeline switch, or reading a record
+ * that ends after the cutoff_lsn.
+ */
+static XLogRecPtr
+SummarizeWAL(TimeLineID tli, bool historic,
+ XLogRecPtr start_lsn, bool exact,
+ XLogRecPtr cutoff_lsn, XLogRecPtr maximum_lsn)
+{
+ SummarizerReadLocalXLogPrivate *private_data;
+ XLogReaderState *xlogreader;
+ XLogRecPtr summary_start_lsn;
+ XLogRecPtr summary_end_lsn = cutoff_lsn;
+ char temp_path[MAXPGPATH];
+ char final_path[MAXPGPATH];
+ WalSummaryIO io;
+ BlockRefTable *brtab = CreateEmptyBlockRefTable();
+
+ /* Initialize private data for xlogreader. */
+ private_data = (SummarizerReadLocalXLogPrivate *)
+ palloc0(sizeof(SummarizerReadLocalXLogPrivate));
+ private_data->tli = tli;
+ private_data->historic = historic;
+ private_data->read_upto = maximum_lsn;
+ private_data->redo_pointer = GetRedoRecPtr();
+ private_data->redo_pointer_refresh_lsn = start_lsn;
+ private_data->redo_pointer_reached =
+ (start_lsn >= private_data->redo_pointer);
+
+ /* Create xlogreader. */
+ xlogreader = XLogReaderAllocate(wal_segment_size, NULL,
+ XL_ROUTINE(.page_read = &summarizer_read_local_xlog_page,
+ .segment_open = &wal_segment_open,
+ .segment_close = &wal_segment_close),
+ private_data);
+ if (xlogreader == NULL)
+ ereport(ERROR,
+ (errcode(ERRCODE_OUT_OF_MEMORY),
+ errmsg("out of memory"),
+ errdetail("Failed while allocating a WAL reading processor.")));
+
+ /*
+ * When exact = false, we're starting from an arbitrary point in the WAL
+ * and must search forward for the start of the next record.
+ *
+ * When exact = true, start_lsn should be either the LSN where a record
+ * begins, or the LSN of a page where the page header is immediately
+ * followed by the start of a new record. XLogBeginRead should tolerate
+ * either case.
+ *
+ * We need to allow for both cases because the behavior of xlogreader
+ * varies. When a record spans two or more xlog pages, the ending LSN
+ * reported by xlogreader will be the starting LSN of the following
+ * record, but when an xlog page boundary falls between two records, the
+ * end LSN for the first will be reported as the first byte of the
+ * following page. We can't know until we read that page how large the
+ * header will be, but we'll have to skip over it to find the next record.
+ */
+ if (exact)
+ {
+ /*
+ * Even if start_lsn is the beginning of a page rather than the
+ * beginning of the first record on that page, we should still use it
+ * as the start LSN for the summary file. That's because we detect
+ * missing summary files by looking for cases where the end LSN of one
+ * file is less than the start LSN of the next file. When only a page
+ * header is skipped, nothing has been missed.
+ */
+ XLogBeginRead(xlogreader, start_lsn);
+ summary_start_lsn = start_lsn;
+ }
+ else
+ {
+ summary_start_lsn = XLogFindNextRecord(xlogreader, start_lsn);
+ if (XLogRecPtrIsInvalid(summary_start_lsn))
+ {
+ /*
+ * If we hit end-of-WAL while trying to find the next valid
+ * record, we must be on a historic timeline that has no valid
+ * records that begin after start_lsn and before end of WAL.
+ */
+ if (private_data->end_of_wal)
+ {
+ ereport(DEBUG1,
+ errmsg_internal("could not read WAL from timeline %d at %X/%X: end of WAL at %X/%X",
+ tli,
+ LSN_FORMAT_ARGS(start_lsn),
+ LSN_FORMAT_ARGS(private_data->read_upto)));
+
+ /*
+ * The timeline ends at or after start_lsn, without containing
+ * any records. Thus, we must make sure the main loop does not
+ * iterate. If start_lsn is the end of the timeline, then we
+ * won't actually emit an empty summary file, but otherwise,
+ * we must, to capture the fact that the LSN range in question
+ * contains no interesting WAL records.
+ */
+ summary_start_lsn = start_lsn;
+ summary_end_lsn = private_data->read_upto;
+ cutoff_lsn = xlogreader->EndRecPtr;
+ }
+ else
+ ereport(ERROR,
+ (errmsg("could not find a valid record after %X/%X",
+ LSN_FORMAT_ARGS(start_lsn))));
+ }
+
+ /* We shouldn't go backward. */
+ Assert(summary_start_lsn >= start_lsn);
+ }
+
+ /*
+ * Main loop: read xlog records one by one.
+ */
+ while (xlogreader->EndRecPtr < cutoff_lsn)
+ {
+ int block_id;
+ char *errormsg;
+ XLogRecord *record;
+ bool switch_requested;
+
+ /* We shouldn't go backward. */
+ Assert(summary_start_lsn <= xlogreader->EndRecPtr);
+
+ /*
+ * This flag tracks whether the read of a particular record had to
+ * wait for more WAL to arrive, so reset it before reading the next
+ * record.
+ */
+ private_data->waited = false;
+
+ /* Now read the next record. */
+ record = XLogReadRecord(xlogreader, &errormsg);
+ if (record == NULL)
+ {
+ SummarizerReadLocalXLogPrivate *private_data;
+
+ private_data = (SummarizerReadLocalXLogPrivate *)
+ xlogreader->private_data;
+ if (private_data->end_of_wal)
+ {
+ /*
+ * This timeline must be historic and must end before we were
+ * able to read a complete record.
+ */
+ ereport(DEBUG1,
+ errmsg_internal("could not read WAL from timeline %d at %X/%X: end of WAL at %X/%X",
+ tli,
+ LSN_FORMAT_ARGS(xlogreader->EndRecPtr),
+ LSN_FORMAT_ARGS(private_data->read_upto)));
+ /* Summary ends at end of WAL. */
+ summary_end_lsn = private_data->read_upto;
+ break;
+ }
+ if (errormsg)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not read WAL at %X/%X: %s",
+ LSN_FORMAT_ARGS(xlogreader->EndRecPtr), errormsg)));
+ else
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not read WAL at %X/%X",
+ LSN_FORMAT_ARGS(xlogreader->EndRecPtr))));
+ }
+
+ /* We shouldn't go backward. */
+ Assert(summary_start_lsn <= xlogreader->EndRecPtr);
+
+ if (xlogreader->ReadRecPtr >= cutoff_lsn)
+ {
+ /*
+ * Woops! We've read a record that *starts* after the cutoff LSN,
+ * contrary to our goal of reading only until we hit the first
+ * record that ends at or after the cutoff LSN. Pretend we didn't
+ * read it after all by bailing out of this loop right here,
+ * before we do anything with this record.
+ *
+ * This can happen because the last record before the cutoff LSN
+ * might be continued across multiple pages, and then we might
+ * come to a page with XLP_FIRST_IS_OVERWRITE_CONTRECORD set. In
+ * that case, the record that was continued across multiple pages
+ * is incomplete and will be disregarded, and the read will
+ * restart from the beginning of the page that is flagged
+ * XLP_FIRST_IS_OVERWRITE_CONTRECORD.
+ *
+ * If this case occurs, we can fairly say that the current summary
+ * file ends at the cutoff LSN exactly. The first record on the
+ * page marked XLP_FIRST_IS_OVERWRITE_CONTRECORD will be
+ * discovered when generating the next summary file.
+ */
+ summary_end_lsn = cutoff_lsn;
+ break;
+ }
+
+ /*
+ * We attempt, on a best effort basis only, to make WAL summary file
+ * boundaries line up with checkpoint cycles. So, if the last redo
+ * pointer we've seen was in the future, and this record starts at
+ * that redo pointer, stop before processing and let it be included in
+ * the next summary file.
+ *
+ * Note that in the case of a checkpoint triggered by a backup, the
+ * redo pointer is likely to be pointing to the first record on a
+ * page. Before reading the record, xlogreader->EndRecPtr will have
+ * pointed to the start of the page, which precedes the redo LSN. But
+ * after reading the next record, we'll advance over the page header
+ * and realize that the next record starts at the redo LSN exactly,
+ * making this the first point at which we can realize that it's time
+ * to stop.
+ */
+ if (!private_data->redo_pointer_reached &&
+ xlogreader->ReadRecPtr >= private_data->redo_pointer)
+ {
+ summary_end_lsn = xlogreader->ReadRecPtr;
+ break;
+ }
+
+ /* Special handling for particular types of WAL records. */
+ switch (XLogRecGetRmid(xlogreader))
+ {
+ case RM_SMGR_ID:
+ SummarizeSmgrRecord(xlogreader, brtab);
+ break;
+ case RM_XACT_ID:
+ SummarizeXactRecord(xlogreader, brtab);
+ break;
+ default:
+ break;
+ }
+
+ /* Feed block references from xlog record to block reference table. */
+ for (block_id = 0; block_id <= XLogRecMaxBlockId(xlogreader);
+ block_id++)
+ {
+ RelFileLocator rlocator;
+ ForkNumber forknum;
+ BlockNumber blocknum;
+
+ if (!XLogRecGetBlockTagExtended(xlogreader, block_id, &rlocator,
+ &forknum, &blocknum, NULL))
+ continue;
+
+ BlockRefTableMarkBlockModified(brtab, &rlocator, forknum,
+ blocknum);
+ }
+
+ /* Update our notion of where this summary file ends. */
+ summary_end_lsn = xlogreader->EndRecPtr;
+
+ /*
+ * Also update shared memory, and handle any request for a WAL summary
+ * file switch.
+ */
+ LWLockAcquire(WALSummarizerLock, LW_EXCLUSIVE);
+ Assert(summary_end_lsn >= WalSummarizerCtl->pending_lsn);
+ Assert(summary_end_lsn >= WalSummarizerCtl->summarized_lsn);
+ WalSummarizerCtl->pending_lsn = summary_end_lsn;
+ switch_requested = WalSummarizerCtl->switch_requested;
+ LWLockRelease(WALSummarizerLock);
+ if (switch_requested)
+ break;
+
+ /*
+ * Periodically update our notion of the redo pointer, because it
+ * might be changing concurrently. There's no interlocking here: we
+ * might race past the new redo pointer before we learn about it.
+ * That's OK; we only use the redo pointer as a heuristic for where to
+ * stop summarizing.
+ *
+ * It would be nice if we could just fetch the updated redo pointer on
+ * every pass through this loop, but that seems a bit too expensive:
+ * GetRedoRecPtr acquires a heavily-contended spinlock. So, instead,
+ * just fetch the updated value if we've just had to sleep, or if
+ * we've read more than a segment's worth of WAL without sleeping.
+ */
+ if (private_data->waited || xlogreader->EndRecPtr >
+ private_data->redo_pointer_refresh_lsn + wal_segment_size)
+ {
+ private_data->redo_pointer = GetRedoRecPtr();
+ private_data->redo_pointer_refresh_lsn = xlogreader->EndRecPtr;
+ private_data->redo_pointer_reached =
+ (xlogreader->EndRecPtr >= private_data->redo_pointer);
+ }
+
+ /*
+ * Recheck whether we've just caught up with the redo pointer, and if
+ * so, stop. This has the same purpose as the earlier check for the
+ * same condition above, but there we've just read a record and might
+ * decide against including it in the current summary file, whereas
+ * here we've already included it and might decide against reading the
+ * next one. Note that we may have just refreshed our notion of the
+ * redo pointer, so it's smart to check here before we do any more
+ * work.
+ */
+ if (!private_data->redo_pointer_reached &&
+ xlogreader->EndRecPtr >= private_data->redo_pointer)
+ break;
+ }
+
+ /* Destroy xlogreader. */
+ pfree(xlogreader->private_data);
+ XLogReaderFree(xlogreader);
+
+ /*
+ * If a timeline switch occurs, we may fail to make any progress at all
+ * before exiting the loop above. If that happens, we don't write a WAL
+ * summary file at all.
+ */
+ if (summary_end_lsn > summary_start_lsn)
+ {
+ /* Generate temporary and final path name. */
+ snprintf(temp_path, MAXPGPATH,
+ XLOGDIR "/summaries/temp.summary");
+ snprintf(final_path, MAXPGPATH,
+ XLOGDIR "/summaries/%08X%08X%08X%08X%08X.summary",
+ tli,
+ LSN_FORMAT_ARGS(summary_start_lsn),
+ LSN_FORMAT_ARGS(summary_end_lsn));
+
+ /* Open the temporary file for writing. */
+ io.filepos = 0;
+ io.file = PathNameOpenFile(temp_path, O_WRONLY | O_CREAT | O_TRUNC);
+ if (io.file < 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not create file \"%s\": %m", temp_path)));
+
+ /* Write the data. */
+ WriteBlockRefTable(brtab, WriteWalSummary, &io);
+
+ /* Close temporary file and shut down xlogreader. */
+ FileClose(io.file);
+
+ /* Tell the user what we did. */
+ ereport(LOG,
+ errmsg("summarized WAL on TLI %d from %X/%X to %X/%X",
+ tli,
+ LSN_FORMAT_ARGS(summary_start_lsn),
+ LSN_FORMAT_ARGS(summary_end_lsn)));
+
+ /* Durably rename the new summary into place. */
+ durable_rename(temp_path, final_path, ERROR);
+ }
+
+ return summary_end_lsn;
+}
+
+/*
+ * Special handling for WAL records with RM_SMGR_ID.
+ */
+static void
+SummarizeSmgrRecord(XLogReaderState *xlogreader, BlockRefTable *brtab)
+{
+ uint8 info = XLogRecGetInfo(xlogreader) & ~XLR_INFO_MASK;
+
+ if (info == XLOG_SMGR_CREATE)
+ {
+ xl_smgr_create *xlrec;
+
+ /*
+ * If a new relation fork is created on disk, there is no point
+ * tracking anything about which blocks have been modified, because
+ * the whole thing will be new. Hence, set the limit block for this
+ * fork to 0.
+ *
+ * Ignore the FSM fork, which is not fully WAL-logged.
+ */
+ xlrec = (xl_smgr_create *) XLogRecGetData(xlogreader);
+
+ if (xlrec->forkNum != FSM_FORKNUM)
+ BlockRefTableSetLimitBlock(brtab, &xlrec->rlocator,
+ xlrec->forkNum, 0);
+ }
+ else if (info == XLOG_SMGR_TRUNCATE)
+ {
+ xl_smgr_truncate *xlrec;
+
+ xlrec = (xl_smgr_truncate *) XLogRecGetData(xlogreader);
+
+ /*
+ * If a relation fork is truncated on disk, there is in point in
+ * tracking anything about block modifications beyond the truncation
+ * point.
+ *
+ * We ignore SMGR_TRUNCATE_FSM here because the FSM isn't fully
+ * WAL-logged and thus we can't track modified blocks for it anyway.
+ */
+ if ((xlrec->flags & SMGR_TRUNCATE_HEAP) != 0)
+ BlockRefTableSetLimitBlock(brtab, &xlrec->rlocator,
+ MAIN_FORKNUM, xlrec->blkno);
+ if ((xlrec->flags & SMGR_TRUNCATE_VM) != 0)
+ BlockRefTableSetLimitBlock(brtab, &xlrec->rlocator,
+ VISIBILITYMAP_FORKNUM, xlrec->blkno);
+ }
+}
+
+/*
+ * Special handling for WAL recods with RM_XACT_ID.
+ */
+static void
+SummarizeXactRecord(XLogReaderState *xlogreader, BlockRefTable *brtab)
+{
+ uint8 info = XLogRecGetInfo(xlogreader) & ~XLR_INFO_MASK;
+ uint8 xact_info = info & XLOG_XACT_OPMASK;
+
+ if (xact_info == XLOG_XACT_COMMIT ||
+ xact_info == XLOG_XACT_COMMIT_PREPARED)
+ {
+ xl_xact_commit *xlrec = (xl_xact_commit *) XLogRecGetData(xlogreader);
+ xl_xact_parsed_commit parsed;
+ int i;
+
+ ParseCommitRecord(XLogRecGetInfo(xlogreader), xlrec, &parsed);
+ for (i = 0; i < parsed.nrels; ++i)
+ {
+ ForkNumber forknum;
+
+ for (forknum = 0; forknum <= MAX_FORKNUM; ++forknum)
+ if (forknum != FSM_FORKNUM)
+ BlockRefTableSetLimitBlock(brtab, &parsed.xlocators[i],
+ forknum, 0);
+ }
+ }
+ else if (xact_info == XLOG_XACT_ABORT ||
+ xact_info == XLOG_XACT_ABORT_PREPARED)
+ {
+ xl_xact_abort *xlrec = (xl_xact_abort *) XLogRecGetData(xlogreader);
+ xl_xact_parsed_abort parsed;
+ int i;
+
+ ParseAbortRecord(XLogRecGetInfo(xlogreader), xlrec, &parsed);
+ for (i = 0; i < parsed.nrels; ++i)
+ {
+ ForkNumber forknum;
+
+ for (forknum = 0; forknum <= MAX_FORKNUM; ++forknum)
+ if (forknum != FSM_FORKNUM)
+ BlockRefTableSetLimitBlock(brtab, &parsed.xlocators[i],
+ forknum, 0);
+ }
+ }
+}
+
+/*
+ * Similar to read_local_xlog_page, but limited to read from one particular
+ * timeline. If the end of WAL is reached, it will wait for more if reading
+ * from the current timeline, or give up if reading from a historic timeline.
+ * In the latter case, it will also set private_data->end_of_wal = true.
+ *
+ * Caller must set private_data->tli to the TLI of interest,
+ * private_data->read_upto to the lowest LSN that is not known to be safe
+ * to read on that timeline, and private_data->historic to true if and only
+ * if the timeline is not the current timeline. This function will update
+ * private_data->read_upto and private_data->historic if more WAL appears
+ * on the current timeline or if the current timeline becomes historic.
+ */
+static int
+summarizer_read_local_xlog_page(XLogReaderState *state,
+ XLogRecPtr targetPagePtr, int reqLen,
+ XLogRecPtr targetRecPtr, char *cur_page)
+{
+ int count;
+ WALReadError errinfo;
+ SummarizerReadLocalXLogPrivate *private_data;
+
+ private_data = (SummarizerReadLocalXLogPrivate *)
+ state->private_data;
+
+ while (true)
+ {
+ if (targetPagePtr + XLOG_BLCKSZ <= private_data->read_upto)
+ {
+ /*
+ * more than one block available; read only that block, have
+ * caller come back if they need more.
+ */
+ count = XLOG_BLCKSZ;
+ break;
+ }
+ else if (targetPagePtr + reqLen > private_data->read_upto)
+ {
+ /* We don't seem to have enough data. */
+ if (private_data->historic)
+ {
+ /*
+ * This is a historic timeline, so there will never be any
+ * more data than we have currently.
+ */
+ private_data->end_of_wal = true;
+ return -1;
+ }
+ else
+ {
+ XLogRecPtr latest_lsn;
+ TimeLineID latest_tli;
+
+ /*
+ * This is - or at least was up until very recently - the
+ * current timeline, so more data might show up. Delay here
+ * so we don't tight-loop.
+ */
+ HandleWalSummarizerInterrupts();
+ summarizer_wait_for_wal();
+ private_data->waited = true;
+
+ /* Recheck end-of-WAL. */
+ latest_lsn = GetLatestLSN(&latest_tli);
+ if (private_data->tli == latest_tli)
+ {
+ /* Still the current timeline, update max LSN. */
+ Assert(latest_lsn >= private_data->read_upto);
+ private_data->read_upto = latest_lsn;
+ }
+ else
+ {
+ List *tles = readTimeLineHistory(latest_tli);
+ XLogRecPtr switchpoint;
+
+ /*
+ * The timeline we're scanning is no longer the latest
+ * one. Figure out when it ended and allow reads up to
+ * exactly that point.
+ */
+ private_data->historic = true;
+ switchpoint = tliSwitchPoint(private_data->tli, tles,
+ NULL);
+ Assert(switchpoint >= private_data->read_upto);
+ private_data->read_upto = switchpoint;
+ }
+
+ /* Go around and try again. */
+ }
+ }
+ else
+ {
+ /* enough bytes available to satisfy the request */
+ count = private_data->read_upto - targetPagePtr;
+ break;
+ }
+ }
+
+ /*
+ * Even though we just determined how much of the page can be validly read
+ * as 'count', read the whole page anyway. It's guaranteed to be
+ * zero-padded up to the page boundary if it's incomplete.
+ */
+ if (!WALRead(state, cur_page, targetPagePtr, XLOG_BLCKSZ,
+ private_data->tli, &errinfo))
+ WALReadRaiseError(&errinfo);
+
+ /* Track that we read a page, for sleep time calculation. */
+ ++pages_read_since_last_sleep;
+
+ /* number of valid bytes in the buffer */
+ return count;
+}
+
+/*
+ * Sleep for long enough that we believe it's likely that more WAL will
+ * be available afterwards.
+ */
+static void
+summarizer_wait_for_wal(void)
+{
+ if (pages_read_since_last_sleep == 0)
+ {
+ /*
+ * No pages were read since the last sleep, so double the sleep time,
+ * but not beyond the maximum allowable value.
+ */
+ sleep_quanta = Min(sleep_quanta * 2, MAX_SLEEP_QUANTA);
+ }
+ else if (pages_read_since_last_sleep > 1)
+ {
+ /*
+ * Multiple pages were read since the last sleep, so reduce the sleep
+ * time.
+ *
+ * A large burst of activity should be able to quickly reduce the
+ * sleep time to the minimum, but we don't want a handful of extra WAL
+ * records to provoke a strong reaction. We choose to reduce the sleep
+ * time by 1 quantum for each page read beyond the first, which is a
+ * fairly arbitrary way of trying to be reactive without
+ * overrreacting.
+ */
+ if (pages_read_since_last_sleep > sleep_quanta - 1)
+ sleep_quanta = 1;
+ else
+ sleep_quanta -= pages_read_since_last_sleep;
+ }
+
+ /* OK, now sleep. */
+ (void) WaitLatch(MyLatch,
+ WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
+ sleep_quanta * MS_PER_SLEEP_QUANTUM,
+ WAIT_EVENT_WAL_SUMMARIZER_WAL);
+ ResetLatch(MyLatch);
+
+ /* Reset count of pages read. */
+ pages_read_since_last_sleep = 0;
+}
+
+/*
+ * Most recent RedoRecPtr value observed by RemoveOldWalSummaries.
+ */
+static void
+MaybeRemoveOldWalSummaries(void)
+{
+ XLogRecPtr redo_pointer = GetRedoRecPtr();
+ List *wslist;
+ time_t cutoff_time;
+
+ /* If WAL summary removal is disabled, don't do anything. */
+ if (wal_summarize_keep_time == 0)
+ return;
+
+ /*
+ * If the redo pointer has not advanced, don't do anything.
+ *
+ * This has the effect that we only try to remove old WAL summary files
+ * once per checkpoint cycle.
+ */
+ if (redo_pointer == redo_pointer_at_last_summary_removal)
+ return;
+ redo_pointer_at_last_summary_removal = redo_pointer;
+
+ /*
+ * Files should only be removed if the last modification time precedes the
+ * cutoff time we compute here.
+ */
+ cutoff_time = time(NULL) - 60 * wal_summarize_keep_time;
+
+ /* Get all the summaries that currently exist. */
+ wslist = GetWalSummaries(0, InvalidXLogRecPtr, InvalidXLogRecPtr);
+
+ /* Loop until all summaries have been considered for removal. */
+ while (wslist != NIL)
+ {
+ ListCell *lc;
+ XLogSegNo oldest_segno;
+ XLogRecPtr oldest_lsn = InvalidXLogRecPtr;
+ TimeLineID selected_tli;
+
+ CHECK_FOR_INTERRUPTS();
+
+ /*
+ * Pick a timeline for which some summary files still exist on disk,
+ * and find the oldest LSN that still exists on disk for that
+ * timeline.
+ */
+ selected_tli = ((WalSummaryFile *) linitial(wslist))->tli;
+ oldest_segno = XLogGetOldestSegno(selected_tli);
+ if (oldest_segno != 0)
+ XLogSegNoOffsetToRecPtr(oldest_segno, 0, wal_segment_size,
+ oldest_lsn);
+
+
+ /* Consider each WAL file on the selected timeline in turn. */
+ foreach(lc, wslist)
+ {
+ WalSummaryFile *ws = lfirst(lc);
+
+ CHECK_FOR_INTERRUPTS();
+
+ /* If it's not on this timeline, it's not time to consider it. */
+ if (selected_tli != ws->tli)
+ continue;
+
+ /*
+ * If the WAL doesn't exist any more, we can remove it if the file
+ * modification time is old enough.
+ */
+ if (XLogRecPtrIsInvalid(oldest_lsn) || ws->end_lsn <= oldest_lsn)
+ RemoveWalSummaryIfOlderThan(ws, cutoff_time);
+
+ /*
+ * Whether we we removed the file or not, we need not consider it
+ * again.
+ */
+ wslist = foreach_delete_current(wslist, lc);
+ pfree(ws);
+ }
+ }
+}
diff --git a/src/backend/replication/repl_gram.y b/src/backend/replication/repl_gram.y
index 0c874e33cf..a5d118ed68 100644
--- a/src/backend/replication/repl_gram.y
+++ b/src/backend/replication/repl_gram.y
@@ -76,11 +76,12 @@ Node *replication_parse_result;
%token K_EXPORT_SNAPSHOT
%token K_NOEXPORT_SNAPSHOT
%token K_USE_SNAPSHOT
+%token K_UPLOAD_MANIFEST
%type <node> command
%type <node> base_backup start_replication start_logical_replication
create_replication_slot drop_replication_slot identify_system
- read_replication_slot timeline_history show
+ read_replication_slot timeline_history show upload_manifest
%type <list> generic_option_list
%type <defelt> generic_option
%type <uintval> opt_timeline
@@ -114,6 +115,7 @@ command:
| read_replication_slot
| timeline_history
| show
+ | upload_manifest
;
/*
@@ -307,6 +309,15 @@ timeline_history:
}
;
+/* UPLOAD_MANIFEST doesn't currently accept any arguments */
+upload_manifest:
+ K_UPLOAD_MANIFEST
+ {
+ UploadManifestCmd *cmd = makeNode(UploadManifestCmd);
+
+ $$ = (Node *) cmd;
+ }
+
opt_physical:
K_PHYSICAL
| /* EMPTY */
@@ -411,6 +422,7 @@ ident_or_keyword:
| K_EXPORT_SNAPSHOT { $$ = "export_snapshot"; }
| K_NOEXPORT_SNAPSHOT { $$ = "noexport_snapshot"; }
| K_USE_SNAPSHOT { $$ = "use_snapshot"; }
+ | K_UPLOAD_MANIFEST { $$ = "upload_manifest"; }
;
%%
diff --git a/src/backend/replication/repl_scanner.l b/src/backend/replication/repl_scanner.l
index 1cc7fb858c..4805da08ee 100644
--- a/src/backend/replication/repl_scanner.l
+++ b/src/backend/replication/repl_scanner.l
@@ -136,6 +136,7 @@ EXPORT_SNAPSHOT { return K_EXPORT_SNAPSHOT; }
NOEXPORT_SNAPSHOT { return K_NOEXPORT_SNAPSHOT; }
USE_SNAPSHOT { return K_USE_SNAPSHOT; }
WAIT { return K_WAIT; }
+UPLOAD_MANIFEST { return K_UPLOAD_MANIFEST; }
{space}+ { /* do nothing */ }
@@ -303,6 +304,7 @@ replication_scanner_is_replication_command(void)
case K_DROP_REPLICATION_SLOT:
case K_READ_REPLICATION_SLOT:
case K_TIMELINE_HISTORY:
+ case K_UPLOAD_MANIFEST:
case K_SHOW:
/* Yes; push back the first token so we can parse later. */
repl_pushed_back_token = first_token;
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index e250b0567e..b33b86671b 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -58,6 +58,7 @@
#include "access/xlogrecovery.h"
#include "access/xlogutils.h"
#include "backup/basebackup.h"
+#include "backup/basebackup_incremental.h"
#include "catalog/pg_authid.h"
#include "catalog/pg_type.h"
#include "commands/dbcommands.h"
@@ -137,6 +138,17 @@ bool wake_wal_senders = false;
*/
static XLogReaderState *xlogreader = NULL;
+/*
+ * If the UPLOAD_MANIFEST command is used to provide a backup manifest in
+ * preparation for an incremental backup, uploaded_manifest will be point
+ * to an object containing information about its contexts, and
+ * uploaded_manifest_mcxt will point to the memory context that contains
+ * that object and all of its subordinate data. Otherwise, both values will
+ * be NULL.
+ */
+static IncrementalBackupInfo *uploaded_manifest = NULL;
+static MemoryContext uploaded_manifest_mcxt = NULL;
+
/*
* These variables keep track of the state of the timeline we're currently
* sending. sendTimeLine identifies the timeline. If sendTimeLineIsHistoric,
@@ -233,6 +245,9 @@ static void XLogSendLogical(void);
static void WalSndDone(WalSndSendDataCallback send_data);
static XLogRecPtr GetStandbyFlushRecPtr(TimeLineID *tli);
static void IdentifySystem(void);
+static void UploadManifest(void);
+static bool HandleUploadManifestPacket(StringInfo buf, off_t *offset,
+ IncrementalBackupInfo *ib);
static void ReadReplicationSlot(ReadReplicationSlotCmd *cmd);
static void CreateReplicationSlot(CreateReplicationSlotCmd *cmd);
static void DropReplicationSlot(DropReplicationSlotCmd *cmd);
@@ -660,6 +675,143 @@ SendTimeLineHistory(TimeLineHistoryCmd *cmd)
pq_endmessage(&buf);
}
+/*
+ * Handle UPLOAD_MANIFEST command.
+ */
+static void
+UploadManifest(void)
+{
+ MemoryContext mcxt;
+ IncrementalBackupInfo *ib;
+ off_t offset = 0;
+ StringInfoData buf;
+
+ /*
+ * parsing the manifest will use the cryptohash stuff, which requires a
+ * resource owner
+ */
+ Assert(CurrentResourceOwner == NULL);
+ CurrentResourceOwner = ResourceOwnerCreate(NULL, "base backup");
+
+ /* Prepare to read manifest data into a temporary context. */
+ mcxt = AllocSetContextCreate(CurrentMemoryContext,
+ "incremental backup information",
+ ALLOCSET_DEFAULT_SIZES);
+ ib = CreateIncrementalBackupInfo(mcxt);
+
+ /* Send a CopyInResponse message */
+ pq_beginmessage(&buf, 'G');
+ pq_sendbyte(&buf, 0);
+ pq_sendint16(&buf, 0);
+ pq_endmessage_reuse(&buf);
+ pq_flush();
+
+ /* Recieve packets from client until done. */
+ while (HandleUploadManifestPacket(&buf, &offset, ib))
+ ;
+
+ /* Finish up manifest processing. */
+ FinalizeIncrementalManifest(ib);
+
+ /*
+ * Discard any old manifest information and arrange to preserve the new
+ * information we just got.
+ *
+ * We assume that MemoryContextDelete and MemoryContextSetParent won't
+ * fail, and thus we shouldn't end up bailing out of here in such a way as
+ * to leave dangling pointrs.
+ */
+ if (uploaded_manifest_mcxt != NULL)
+ MemoryContextDelete(uploaded_manifest_mcxt);
+ MemoryContextSetParent(mcxt, CacheMemoryContext);
+ uploaded_manifest = ib;
+ uploaded_manifest_mcxt = mcxt;
+
+ /* clean up the resource owner we created */
+ WalSndResourceCleanup(true);
+}
+
+/*
+ * Process one packet received during the handling of an UPLOAD_MANIFEST
+ * operation.
+ *
+ * 'buf' is scratch space. This function expects it to be initialized, doesn't
+ * care what the current contents are, and may override them with completely
+ * new contents.
+ *
+ * The return value is true if the caller should continue processing
+ * additional packets and false if the UPLOAD_MANIFEST operation is complete.
+ */
+static bool
+HandleUploadManifestPacket(StringInfo buf, off_t *offset,
+ IncrementalBackupInfo *ib)
+{
+ int mtype;
+ int maxmsglen;
+
+ HOLD_CANCEL_INTERRUPTS();
+
+ pq_startmsgread();
+ mtype = pq_getbyte();
+ if (mtype == EOF)
+ ereport(ERROR,
+ (errcode(ERRCODE_CONNECTION_FAILURE),
+ errmsg("unexpected EOF on client connection with an open transaction")));
+
+ switch (mtype)
+ {
+ case 'd': /* CopyData */
+ maxmsglen = PQ_LARGE_MESSAGE_LIMIT;
+ break;
+ case 'c': /* CopyDone */
+ case 'f': /* CopyFail */
+ case 'H': /* Flush */
+ case 'S': /* Sync */
+ maxmsglen = PQ_SMALL_MESSAGE_LIMIT;
+ break;
+ default:
+ ereport(ERROR,
+ (errcode(ERRCODE_PROTOCOL_VIOLATION),
+ errmsg("unexpected message type 0x%02X during COPY from stdin",
+ mtype)));
+ maxmsglen = 0; /* keep compiler quiet */
+ break;
+ }
+
+ /* Now collect the message body */
+ if (pq_getmessage(buf, maxmsglen))
+ ereport(ERROR,
+ (errcode(ERRCODE_CONNECTION_FAILURE),
+ errmsg("unexpected EOF on client connection with an open transaction")));
+ RESUME_CANCEL_INTERRUPTS();
+
+ /* Process the message */
+ switch (mtype)
+ {
+ case 'd': /* CopyData */
+ AppendIncrementalManifestData(ib, buf->data, buf->len);
+ return true;
+
+ case 'c': /* CopyDone */
+ return false;
+
+ case 'H': /* Sync */
+ case 'S': /* Flush */
+ /* Ignore these while in CopyOut mode as we do elsewhere. */
+ return true;
+
+ case 'f':
+ ereport(ERROR,
+ (errcode(ERRCODE_QUERY_CANCELED),
+ errmsg("COPY from stdin failed: %s",
+ pq_getmsgstring(buf))));
+ }
+
+ /* Not reached. */
+ Assert(false);
+ return false;
+}
+
/*
* Handle START_REPLICATION command.
*
@@ -1801,7 +1953,7 @@ exec_replication_command(const char *cmd_string)
cmdtag = "BASE_BACKUP";
set_ps_display(cmdtag);
PreventInTransactionBlock(true, cmdtag);
- SendBaseBackup((BaseBackupCmd *) cmd_node);
+ SendBaseBackup((BaseBackupCmd *) cmd_node, uploaded_manifest);
EndReplicationCommand(cmdtag);
break;
@@ -1863,6 +2015,14 @@ exec_replication_command(const char *cmd_string)
}
break;
+ case T_UploadManifestCmd:
+ cmdtag = "UPLOAD_MANIFEST";
+ set_ps_display(cmdtag);
+ PreventInTransactionBlock(true, cmdtag);
+ UploadManifest();
+ EndReplicationCommand(cmdtag);
+ break;
+
default:
elog(ERROR, "unrecognized replication command node tag: %u",
cmd_node->type);
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index a3d8eacb8d..3a6729003a 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -31,6 +31,7 @@
#include "postmaster/bgworker_internals.h"
#include "postmaster/bgwriter.h"
#include "postmaster/postmaster.h"
+#include "postmaster/walsummarizer.h"
#include "replication/logicallauncher.h"
#include "replication/origin.h"
#include "replication/slot.h"
@@ -136,6 +137,7 @@ CalculateShmemSize(int *num_semaphores)
size = add_size(size, ReplicationOriginShmemSize());
size = add_size(size, WalSndShmemSize());
size = add_size(size, WalRcvShmemSize());
+ size = add_size(size, WalSummarizerShmemSize());
size = add_size(size, PgArchShmemSize());
size = add_size(size, ApplyLauncherShmemSize());
size = add_size(size, BTreeShmemSize());
@@ -291,6 +293,7 @@ CreateSharedMemoryAndSemaphores(void)
ReplicationOriginShmemInit();
WalSndShmemInit();
WalRcvShmemInit();
+ WalSummarizerShmemInit();
PgArchShmemInit();
ApplyLauncherShmemInit();
diff --git a/src/backend/storage/lmgr/lwlocknames.txt b/src/backend/storage/lmgr/lwlocknames.txt
index f72f2906ce..d621f5507f 100644
--- a/src/backend/storage/lmgr/lwlocknames.txt
+++ b/src/backend/storage/lmgr/lwlocknames.txt
@@ -54,3 +54,4 @@ XactTruncationLock 44
WrapLimitsVacuumLock 46
NotifyQueueTailLock 47
WaitEventExtensionLock 48
+WALSummarizerLock 49
diff --git a/src/backend/utils/activity/pgstat_io.c b/src/backend/utils/activity/pgstat_io.c
index eb7d35d422..bd0a921a3e 100644
--- a/src/backend/utils/activity/pgstat_io.c
+++ b/src/backend/utils/activity/pgstat_io.c
@@ -292,7 +292,8 @@ pgstat_io_snapshot_cb(void)
* - Syslogger because it is not connected to shared memory
* - Archiver because most relevant archiving IO is delegated to a
* specialized command or module
-* - WAL Receiver and WAL Writer IO is not tracked in pg_stat_io for now
+* - WAL Receiver, WAL Writer, and WAL Summarizer IO are not tracked in
+* pg_stat_io for now
*
* Function returns true if BackendType participates in the cumulative stats
* subsystem for IO and false if it does not.
@@ -314,6 +315,7 @@ pgstat_tracks_io_bktype(BackendType bktype)
case B_LOGGER:
case B_WAL_RECEIVER:
case B_WAL_WRITER:
+ case B_WAL_SUMMARIZER:
return false;
case B_AUTOVAC_LAUNCHER:
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 9c5fdeb3ca..17ad986c98 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -56,6 +56,7 @@ RECOVERY_WAL_STREAM "Waiting in main loop of startup process for WAL to arrive,
SYSLOGGER_MAIN "Waiting in main loop of syslogger process."
WAL_RECEIVER_MAIN "Waiting in main loop of WAL receiver process."
WAL_SENDER_MAIN "Waiting in main loop of WAL sender process."
+WAL_SUMMARIZER_WAL "Waiting in WAL summarizer for more WAL to be generated."
WAL_WRITER_MAIN "Waiting in main loop of WAL writer process."
@@ -140,6 +141,7 @@ SAFE_SNAPSHOT "Waiting to obtain a valid snapshot for a <literal>READ ONLY DEFER
SYNC_REP "Waiting for confirmation from a remote server during synchronous replication."
WAL_RECEIVER_EXIT "Waiting for the WAL receiver to exit."
WAL_RECEIVER_WAIT_START "Waiting for startup process to send initial data for streaming replication."
+WAL_SUMMARY_READY "Waiting for a new WAL summary to be generated."
XACT_GROUP_UPDATE "Waiting for the group leader to update transaction status at end of a parallel operation."
@@ -160,6 +162,7 @@ REGISTER_SYNC_REQUEST "Waiting while sending synchronization requests to the che
SPIN_DELAY "Waiting while acquiring a contended spinlock."
VACUUM_DELAY "Waiting in a cost-based vacuum delay point."
VACUUM_TRUNCATE "Waiting to acquire an exclusive lock to truncate off any empty pages at the end of a table vacuumed."
+WAL_SUMMARIZER_ERROR "Waiting after a WAL summarizer error."
#
@@ -241,6 +244,8 @@ WAL_COPY_WRITE "Waiting for a write when creating a new WAL segment by copying a
WAL_INIT_SYNC "Waiting for a newly initialized WAL file to reach durable storage."
WAL_INIT_WRITE "Waiting for a write while initializing a new WAL file."
WAL_READ "Waiting for a read from a WAL file."
+WAL_SUMMARY_READ "Waiting for a read from a WAL summary file."
+WAL_SUMMARY_WRITE "Waiting for a write to a WAL summary file."
WAL_SYNC "Waiting for a WAL file to reach durable storage."
WAL_SYNC_METHOD_ASSIGN "Waiting for data to reach durable storage while assigning a new WAL sync method."
WAL_WRITE "Waiting for a write to a WAL file."
diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index 1e671c560c..037111b89f 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -306,6 +306,9 @@ GetBackendTypeDesc(BackendType backendType)
case B_WAL_SENDER:
backendDesc = "walsender";
break;
+ case B_WAL_SUMMARIZER:
+ backendDesc = "walsummarizer";
+ break;
case B_WAL_WRITER:
backendDesc = "walwriter";
break;
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 16ec6c5ef0..a532f57af1 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -63,6 +63,7 @@
#include "postmaster/postmaster.h"
#include "postmaster/startup.h"
#include "postmaster/syslogger.h"
+#include "postmaster/walsummarizer.h"
#include "postmaster/walwriter.h"
#include "replication/logicallauncher.h"
#include "replication/slot.h"
@@ -704,6 +705,8 @@ const char *const config_group_names[] =
gettext_noop("Write-Ahead Log / Archive Recovery"),
/* WAL_RECOVERY_TARGET */
gettext_noop("Write-Ahead Log / Recovery Target"),
+ /* WAL_SUMMARIZATION */
+ gettext_noop("Write-Ahead Log / Summarization"),
/* REPLICATION_SENDING */
gettext_noop("Replication / Sending Servers"),
/* REPLICATION_PRIMARY */
@@ -3181,6 +3184,32 @@ struct config_int ConfigureNamesInt[] =
check_wal_segment_size, NULL, NULL
},
+ {
+ {"wal_summarize_mb", PGC_SIGHUP, WAL_SUMMARIZATION,
+ gettext_noop("Number of bytes of WAL per summary file."),
+ gettext_noop("Smaller values minimize extra work performed by incremental backup, but increase the number of files on disk."),
+ GUC_UNIT_MB,
+ },
+ &wal_summarize_mb,
+ 256,
+ 0,
+ INT_MAX,
+ NULL, NULL, NULL
+ },
+
+ {
+ {"wal_summarize_keep_time", PGC_SIGHUP, WAL_SUMMARIZATION,
+ gettext_noop("Time for which WAL summary files should be kept."),
+ NULL,
+ GUC_UNIT_MIN,
+ },
+ &wal_summarize_keep_time,
+ 7 * 24 * 60, /* 1 week */
+ 0,
+ INT_MAX,
+ NULL, NULL, NULL
+ },
+
{
{"autovacuum_naptime", PGC_SIGHUP, AUTOVACUUM,
gettext_noop("Time to sleep between autovacuum runs."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index d08d55c3fe..4736606ac1 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -299,6 +299,11 @@
#recovery_target_action = 'pause' # 'pause', 'promote', 'shutdown'
# (change requires restart)
+# - WAL Summarization -
+
+#wal_summarize_mb = 256 # MB of WAL per summary file, 0 disables
+#wal_summarize_keep_time = '7d' # when to remove old summary files, 0 = never
+
#------------------------------------------------------------------------------
# REPLICATION
diff --git a/src/bin/Makefile b/src/bin/Makefile
index 373077bf52..aa2210925e 100644
--- a/src/bin/Makefile
+++ b/src/bin/Makefile
@@ -19,6 +19,7 @@ SUBDIRS = \
pg_archivecleanup \
pg_basebackup \
pg_checksums \
+ pg_combinebackup \
pg_config \
pg_controldata \
pg_ctl \
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 0c6f5ceb0a..e68b40d2b5 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -227,6 +227,7 @@ static char *extra_options = "";
static const char *const subdirs[] = {
"global",
"pg_wal/archive_status",
+ "pg_wal/summaries",
"pg_commit_ts",
"pg_dynshmem",
"pg_notify",
diff --git a/src/bin/meson.build b/src/bin/meson.build
index 67cb50630c..4cb6fd59bb 100644
--- a/src/bin/meson.build
+++ b/src/bin/meson.build
@@ -5,6 +5,7 @@ subdir('pg_amcheck')
subdir('pg_archivecleanup')
subdir('pg_basebackup')
subdir('pg_checksums')
+subdir('pg_combinebackup')
subdir('pg_config')
subdir('pg_controldata')
subdir('pg_ctl')
diff --git a/src/bin/pg_basebackup/bbstreamer_file.c b/src/bin/pg_basebackup/bbstreamer_file.c
index 45f32974ff..6b78ee283d 100644
--- a/src/bin/pg_basebackup/bbstreamer_file.c
+++ b/src/bin/pg_basebackup/bbstreamer_file.c
@@ -296,6 +296,7 @@ should_allow_existing_directory(const char *pathname)
if (strcmp(filename, "pg_wal") == 0 ||
strcmp(filename, "pg_xlog") == 0 ||
strcmp(filename, "archive_status") == 0 ||
+ strcmp(filename, "summaries") == 0 ||
strcmp(filename, "pg_tblspc") == 0)
return true;
diff --git a/src/bin/pg_basebackup/pg_basebackup.c b/src/bin/pg_basebackup/pg_basebackup.c
index 1a8cef345d..2e27fb58c6 100644
--- a/src/bin/pg_basebackup/pg_basebackup.c
+++ b/src/bin/pg_basebackup/pg_basebackup.c
@@ -101,6 +101,11 @@ typedef void (*WriteDataCallback) (size_t nbytes, char *buf,
*/
#define MINIMUM_VERSION_FOR_TERMINATED_TARFILE 150000
+/*
+ * pg_wal/summaries exists beginning with version 17.
+ */
+#define MINIMUM_VERSION_FOR_WAL_SUMMARIES 170000
+
/*
* Different ways to include WAL
*/
@@ -217,7 +222,8 @@ static void ReceiveBackupManifestInMemoryChunk(size_t r, char *copybuf,
void *callback_data);
static void BaseBackup(char *compression_algorithm, char *compression_detail,
CompressionLocation compressloc,
- pg_compress_specification *client_compress);
+ pg_compress_specification *client_compress,
+ char *incremental_manifest);
static bool reached_end_position(XLogRecPtr segendpos, uint32 timeline,
bool segment_finished);
@@ -688,6 +694,23 @@ StartLogStreamer(char *startpos, uint32 timeline, char *sysidentifier,
if (pg_mkdir_p(statusdir, pg_dir_create_mode) != 0 && errno != EEXIST)
pg_fatal("could not create directory \"%s\": %m", statusdir);
+
+ /*
+ * For newer server versions, likewise create pg_wal/summaries
+ */
+ if (PQserverVersion(conn) < MINIMUM_VERSION_FOR_WAL_SUMMARIES)
+ {
+ char summarydir[MAXPGPATH];
+
+ snprintf(summarydir, sizeof(summarydir), "%s/%s/summaries",
+ basedir,
+ PQserverVersion(conn) < MINIMUM_VERSION_FOR_PG_WAL ?
+ "pg_xlog" : "pg_wal");
+
+ if (pg_mkdir_p(statusdir, pg_dir_create_mode) != 0 &&
+ errno != EEXIST)
+ pg_fatal("could not create directory \"%s\": %m", summarydir);
+ }
}
/*
@@ -1728,7 +1751,9 @@ ReceiveBackupManifestInMemoryChunk(size_t r, char *copybuf,
static void
BaseBackup(char *compression_algorithm, char *compression_detail,
- CompressionLocation compressloc, pg_compress_specification *client_compress)
+ CompressionLocation compressloc,
+ pg_compress_specification *client_compress,
+ char *incremental_manifest)
{
PGresult *res;
char *sysidentifier;
@@ -1794,7 +1819,74 @@ BaseBackup(char *compression_algorithm, char *compression_detail,
exit(1);
/*
- * Start the actual backup
+ * If the user wants an incremental backup, we must upload the manifest
+ * for the previous backup upon which it is to be based.
+ */
+ if (incremental_manifest != NULL)
+ {
+ int fd;
+ char mbuf[65536];
+ int nbytes;
+
+ /* XXX add a server version check here */
+
+ /* Open the file. */
+ fd = open(incremental_manifest, O_RDONLY | PG_BINARY, 0);
+ if (fd < 0)
+ pg_fatal("could not open file \"%s\": %m", incremental_manifest);
+
+ /* Tell the server what we want to do. */
+ if (PQsendQuery(conn, "UPLOAD_MANIFEST") == 0)
+ pg_fatal("could not send replication command \"%s\": %s",
+ "UPLOAD_MANIFEST", PQerrorMessage(conn));
+ res = PQgetResult(conn);
+ if (PQresultStatus(res) != PGRES_COPY_IN)
+ {
+ if (PQresultStatus(res) == PGRES_FATAL_ERROR)
+ pg_fatal("could not upload manifest: %s",
+ PQerrorMessage(conn));
+ else
+ pg_fatal("could not upload manifest: unexpected status %s",
+ PQresStatus(PQresultStatus(res)));
+ }
+
+ /* Loop, reading from the file and sending the data to the server. */
+ while ((nbytes = read(fd, mbuf, sizeof mbuf)) > 0)
+ {
+ if (PQputCopyData(conn, mbuf, nbytes) < 0)
+ pg_fatal("could not send COPY data: %s",
+ PQerrorMessage(conn));
+ }
+
+ /* Bail out if we exited the loop due to an error. */
+ if (nbytes < 0)
+ pg_fatal("could not read file \"%s\": %m", incremental_manifest);
+
+ /* End the COPY operation. */
+ if (PQputCopyEnd(conn, NULL) < 0)
+ pg_fatal("could not send end-of-COPY: %s",
+ PQerrorMessage(conn));
+
+ /* See whether the server is happy with what we sent. */
+ res = PQgetResult(conn);
+ if (PQresultStatus(res) == PGRES_FATAL_ERROR)
+ pg_fatal("could not upload manifest: %s",
+ PQerrorMessage(conn));
+ else if (PQresultStatus(res) != PGRES_COMMAND_OK)
+ pg_fatal("could not upload manifest: unexpected status %s",
+ PQresStatus(PQresultStatus(res)));
+
+ /* Consume ReadyForQuery message from server. */
+ res = PQgetResult(conn);
+ if (res != NULL)
+ pg_fatal("unexpected extra result while sending manifest");
+
+ /* Add INCREMENTAL option to BASE_BACKUP command. */
+ AppendPlainCommandOption(&buf, use_new_option_syntax, "INCREMENTAL");
+ }
+
+ /*
+ * Continue building up the options list for the BASE_BACKUP command.
*/
AppendStringCommandOption(&buf, use_new_option_syntax, "LABEL", label);
if (estimatesize)
@@ -1901,6 +1993,7 @@ BaseBackup(char *compression_algorithm, char *compression_detail,
else
basebkp = psprintf("BASE_BACKUP %s", buf.data);
+ /* OK, try to start the backup. */
if (PQsendQuery(conn, basebkp) == 0)
pg_fatal("could not send replication command \"%s\": %s",
"BASE_BACKUP", PQerrorMessage(conn));
@@ -2256,6 +2349,7 @@ main(int argc, char **argv)
{"version", no_argument, NULL, 'V'},
{"pgdata", required_argument, NULL, 'D'},
{"format", required_argument, NULL, 'F'},
+ {"incremental", required_argument, NULL, 'i'},
{"checkpoint", required_argument, NULL, 'c'},
{"create-slot", no_argument, NULL, 'C'},
{"max-rate", required_argument, NULL, 'r'},
@@ -2293,6 +2387,7 @@ main(int argc, char **argv)
int option_index;
char *compression_algorithm = "none";
char *compression_detail = NULL;
+ char *incremental_manifest = NULL;
CompressionLocation compressloc = COMPRESS_LOCATION_UNSPECIFIED;
pg_compress_specification client_compress;
@@ -2317,7 +2412,7 @@ main(int argc, char **argv)
atexit(cleanup_directories_atexit);
- while ((c = getopt_long(argc, argv, "c:Cd:D:F:h:l:nNp:Pr:Rs:S:t:T:U:vwWX:zZ:",
+ while ((c = getopt_long(argc, argv, "c:Cd:D:F:h:i:l:nNp:Pr:Rs:S:t:T:U:vwWX:zZ:",
long_options, &option_index)) != -1)
{
switch (c)
@@ -2352,6 +2447,9 @@ main(int argc, char **argv)
case 'h':
dbhost = pg_strdup(optarg);
break;
+ case 'i':
+ incremental_manifest = pg_strdup(optarg);
+ break;
case 'l':
label = pg_strdup(optarg);
break;
@@ -2765,7 +2863,7 @@ main(int argc, char **argv)
}
BaseBackup(compression_algorithm, compression_detail, compressloc,
- &client_compress);
+ &client_compress, incremental_manifest);
success = true;
return 0;
diff --git a/src/bin/pg_basebackup/t/010_pg_basebackup.pl b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
index b9f5e1266b..bf765291e7 100644
--- a/src/bin/pg_basebackup/t/010_pg_basebackup.pl
+++ b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
@@ -223,10 +223,10 @@ SKIP:
"check backup dir permissions");
}
-# Only archive_status directory should be copied in pg_wal/.
+# Only archive_status and summaries directories should be copied in pg_wal/.
is_deeply(
[ sort(slurp_dir("$tempdir/backup/pg_wal/")) ],
- [ sort qw(. .. archive_status) ],
+ [ sort qw(. .. archive_status summaries) ],
'no WAL files copied');
# Contents of these directories should not be copied.
diff --git a/src/bin/pg_combinebackup/.gitignore b/src/bin/pg_combinebackup/.gitignore
new file mode 100644
index 0000000000..d7e617438c
--- /dev/null
+++ b/src/bin/pg_combinebackup/.gitignore
@@ -0,0 +1 @@
+pg_combinebackup
diff --git a/src/bin/pg_combinebackup/Makefile b/src/bin/pg_combinebackup/Makefile
new file mode 100644
index 0000000000..cb20480aae
--- /dev/null
+++ b/src/bin/pg_combinebackup/Makefile
@@ -0,0 +1,46 @@
+#-------------------------------------------------------------------------
+#
+# Makefile for src/bin/pg_combinebackup
+#
+# Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+# Portions Copyright (c) 1994, Regents of the University of California
+#
+# src/bin/pg_combinebackup/Makefile
+#
+#-------------------------------------------------------------------------
+
+PGFILEDESC = "pg_combinebackup - combine incremental backups"
+PGAPPICON=win32
+
+subdir = src/bin/pg_combinebackup
+top_builddir = ../../..
+include $(top_builddir)/src/Makefile.global
+
+override CPPFLAGS := -I$(libpq_srcdir) $(CPPFLAGS)
+LDFLAGS_INTERNAL += -L$(top_builddir)/src/fe_utils -lpgfeutils
+
+OBJS = \
+ $(WIN32RES) \
+ pg_combinebackup.o \
+ backup_label.o \
+ copy_file.o \
+ load_manifest.o \
+ reconstruct.o \
+ write_manifest.o
+
+all: pg_combinebackup
+
+pg_combinebackup: $(OBJS) | submake-libpgport submake-libpgfeutils
+ $(CC) $(CFLAGS) $^ $(LDFLAGS) $(LDFLAGS_EX) $(LIBS) -o $@$(X)
+
+install: all installdirs
+ $(INSTALL_PROGRAM) pg_combinebackup$(X) '$(DESTDIR)$(bindir)/pg_combinebackup$(X)'
+
+installdirs:
+ $(MKDIR_P) '$(DESTDIR)$(bindir)'
+
+uninstall:
+ rm -f '$(DESTDIR)$(bindir)/pg_combinebackup$(X)'
+
+clean distclean maintainer-clean:
+ rm -f pg_combinebackup$(X) $(OBJS)
diff --git a/src/bin/pg_combinebackup/backup_label.c b/src/bin/pg_combinebackup/backup_label.c
new file mode 100644
index 0000000000..2a62aa6fad
--- /dev/null
+++ b/src/bin/pg_combinebackup/backup_label.c
@@ -0,0 +1,281 @@
+/*-------------------------------------------------------------------------
+ *
+ * Read and manipulate backup label files
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/bin/pg_combinebackup/backup_label.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres_fe.h"
+
+#include <unistd.h>
+
+#include "access/xlogdefs.h"
+#include "backup_label.h"
+#include "common/logging.h"
+#include "common/file_perm.h"
+#include "write_manifest.h"
+
+static int get_eol_offset(StringInfo buf);
+static bool line_starts_with(char *s, char *e, char *match, char **sout);
+static bool parse_lsn(char *s, char *e, XLogRecPtr *lsn, char **c);
+static bool parse_tli(char *s, char *e, TimeLineID *tli);
+
+/*
+ * Parse a backup label file, starting at buf->cursor.
+ *
+ * We expect to find a START WAL LOCATION line, followed by a LSN, followed
+ * by a space; the resulting LSN is stored into *start_lsn.
+ *
+ * We expect to find a START TIMELINE line, followed by a TLI, followed by
+ * a newline; the resulting TLI is stored into *start_tli.
+ *
+ * We expect to find either both INCREMENTAL FROM LSN and INCREMENTAL FROM TLI
+ * or neither. If these are found, they should be followed by an LSN or TLI
+ * respectively and then by a newline, and the values will be stored into
+ * *previous_lsn and *previous_tli, respectively.
+ *
+ * Other lines in the provided backup_label data are ignored. filename is used
+ * for error reporting; errors are fatal.
+ */
+void
+parse_backup_label(char *filename, StringInfo buf,
+ TimeLineID *start_tli, XLogRecPtr *start_lsn,
+ TimeLineID *previous_tli, XLogRecPtr *previous_lsn)
+{
+ int found = 0;
+
+ *start_tli = 0;
+ *start_lsn = InvalidXLogRecPtr;
+ *previous_tli = 0;
+ *previous_lsn = InvalidXLogRecPtr;
+
+ while (buf->cursor < buf->len)
+ {
+ char *s = &buf->data[buf->cursor];
+ int eo = get_eol_offset(buf);
+ char *e = &buf->data[eo];
+ char *c;
+
+ if (line_starts_with(s, e, "START WAL LOCATION: ", &s))
+ {
+ if (!parse_lsn(s, e, start_lsn, &c))
+ pg_fatal("%s: could not parse START WAL LOCATION",
+ filename);
+ if (c >= e || *c != ' ')
+ pg_fatal("%s: improper terminator for START WAL LOCATION",
+ filename);
+ found |= 1;
+ }
+ else if (line_starts_with(s, e, "START TIMELINE: ", &s))
+ {
+ if (!parse_tli(s, e, start_tli))
+ pg_fatal("%s: could not parse TLI for START TIMELINE",
+ filename);
+ if (*start_tli == 0)
+ pg_fatal("%s: invalid TLI", filename);
+ found |= 2;
+ }
+ else if (line_starts_with(s, e, "INCREMENTAL FROM LSN: ", &s))
+ {
+ if (!parse_lsn(s, e, previous_lsn, &c))
+ pg_fatal("%s: could not parse INCREMENTAL FROM LSN",
+ filename);
+ if (c >= e || *c != '\n')
+ pg_fatal("%s: improper terminator for INCREMENTAL FROM LSN",
+ filename);
+ found |= 4;
+ }
+ else if (line_starts_with(s, e, "INCREMENTAL FROM TLI: ", &s))
+ {
+ if (!parse_tli(s, e, previous_tli))
+ pg_fatal("%s: could not parse INCREMENTAL FROM TLI",
+ filename);
+ if (*previous_tli == 0)
+ pg_fatal("%s: invalid TLI", filename);
+ found |= 8;
+ }
+
+ buf->cursor = eo;
+ }
+
+ if ((found & 1) == 0)
+ pg_fatal("%s: could not find START WAL LOCATION", filename);
+ if ((found & 2) == 0)
+ pg_fatal("%s: could not find START TIMELINE", filename);
+ if ((found & 4) != 0 && (found & 8) == 0)
+ pg_fatal("%s: INCREMENTAL FROM LSN requires INCREMENTAL FROM TLI", filename);
+ if ((found & 8) != 0 && (found & 4) == 0)
+ pg_fatal("%s: INCREMENTAL FROM TLI requires INCREMENTAL FROM LSN", filename);
+}
+
+/*
+ * Write a backup label file to the output directory.
+ *
+ * This will be identical to the provided backup_label file, except that the
+ * INCREMENTAL FROM LSN and INCREMENTAL FROM TLI lines will be omitted.
+ *
+ * The new file will be checksummed using the specified algorithm. If
+ * mwriter != NULL, it will be added to the manifest.
+ */
+void
+write_backup_label(char *output_directory, StringInfo buf,
+ pg_checksum_type checksum_type, manifest_writer *mwriter)
+{
+ char output_filename[MAXPGPATH];
+ int output_fd;
+ pg_checksum_context checksum_ctx;
+ uint8 checksum_payload[PG_CHECKSUM_MAX_LENGTH];
+ int checksum_length;
+
+ pg_checksum_init(&checksum_ctx, checksum_type);
+
+ snprintf(output_filename, MAXPGPATH, "%s/backup_label", output_directory);
+
+ if ((output_fd = open(output_filename,
+ O_WRONLY | O_CREAT | O_EXCL | PG_BINARY,
+ pg_file_create_mode)) < 0)
+ pg_fatal("could not open file \"%s\": %m", output_filename);
+
+ while (buf->cursor < buf->len)
+ {
+ char *s = &buf->data[buf->cursor];
+ int eo = get_eol_offset(buf);
+ char *e = &buf->data[eo];
+
+ if (!line_starts_with(s, e, "INCREMENTAL FROM LSN: ", NULL) &&
+ !line_starts_with(s, e, "INCREMENTAL FROM TLI: ", NULL))
+ {
+ ssize_t wb;
+
+ wb = write(output_fd, s, e - s);
+ if (wb != e - s)
+ {
+ if (wb < 0)
+ pg_fatal("could not write file \"%s\": %m", output_filename);
+ else
+ pg_fatal("could not write file \"%s\": wrote only %d of %d bytes",
+ output_filename, (int) wb, (int) (e - s));
+ }
+ if (pg_checksum_update(&checksum_ctx, (uint8 *) s, e - s) < 0)
+ pg_fatal("could not update checksum of file \"%s\"",
+ output_filename);
+ }
+
+ buf->cursor = eo;
+ }
+
+ if (close(output_fd) != 0)
+ pg_fatal("could not close \"%s\": %m", output_filename);
+
+ checksum_length = pg_checksum_final(&checksum_ctx, checksum_payload);
+
+ if (mwriter != NULL)
+ {
+ struct stat sb;
+
+ /*
+ * We could track the length ourselves, but must stat() to get the
+ * mtime.
+ */
+ if (stat(output_filename, &sb) < 0)
+ pg_fatal("could not stat file \"%s\": %m", output_filename);
+ add_file_to_manifest(mwriter, "backup_label", sb.st_size,
+ sb.st_mtime, checksum_type,
+ checksum_length, checksum_payload);
+ }
+}
+
+/*
+ * Return the offset at which the next line in the buffer starts, or there
+ * is none, the offset at which the buffer ends.
+ *
+ * The search begins at buf->cursor.
+ */
+static int
+get_eol_offset(StringInfo buf)
+{
+ int eo = buf->cursor;
+
+ while (eo < buf->len)
+ {
+ if (buf->data[eo] == '\n')
+ return eo + 1;
+ ++eo;
+ }
+
+ return eo;
+}
+
+/*
+ * Test whether the line that runs from s to e (inclusive of *s, but not
+ * inclusive of *e) starts with the match string provided, and return true
+ * or false according to whether or not this is the case.
+ *
+ * If the function returns true and if *sout != NULL, stores a pointer to the
+ * byte following the match into *sout.
+ */
+static bool
+line_starts_with(char *s, char *e, char *match, char **sout)
+{
+ while (s < e && *match != '\0' && *s == *match)
+ ++s, ++match;
+
+ if (*match == '\0' && sout != NULL)
+ *sout = s;
+
+ return (*match == '\0');
+}
+
+/*
+ * Parse an LSN starting at s and not stopping at or before e. The return value
+ * is true on success and otherwise false. On success, stores the result into
+ * *lsn and sets *c to the first character that is not part of the LSN.
+ */
+static bool
+parse_lsn(char *s, char *e, XLogRecPtr *lsn, char **c)
+{
+ char save = *e;
+ int nchars;
+ bool success;
+ unsigned hi;
+ unsigned lo;
+
+ *e = '\0';
+ success = (sscanf(s, "%X/%X%n", &hi, &lo, &nchars) == 2);
+ *e = save;
+
+ if (success)
+ {
+ *lsn = ((XLogRecPtr) hi) << 32 | (XLogRecPtr) lo;
+ *c = s + nchars;
+ }
+
+ return success;
+}
+
+/*
+ * Parse a TLI starting at s and stopping at or before e. The return value is
+ * true on success and otherwise false. On success, stores the result into
+ * *tli. If the first character that is not part of the TLI is anything other
+ * than a newline, that is deemed a failure.
+ */
+static bool
+parse_tli(char *s, char *e, TimeLineID *tli)
+{
+ char save = *e;
+ int nchars;
+ bool success;
+
+ *e = '\0';
+ success = (sscanf(s, "%u%n", tli, &nchars) == 1);
+ *e = save;
+
+ if (success && s[nchars] != '\n')
+ success = false;
+
+ return success;
+}
diff --git a/src/bin/pg_combinebackup/backup_label.h b/src/bin/pg_combinebackup/backup_label.h
new file mode 100644
index 0000000000..08d6ed67a9
--- /dev/null
+++ b/src/bin/pg_combinebackup/backup_label.h
@@ -0,0 +1,29 @@
+/*-------------------------------------------------------------------------
+ *
+ * Read and manipulate backup label files
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/bin/pg_combinebackup/backup_label.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef BACKUP_LABEL_H
+#define BACKUP_LABEL_H
+
+#include "common/checksum_helper.h"
+#include "lib/stringinfo.h"
+
+struct manifest_writer;
+
+extern void parse_backup_label(char *filename, StringInfo buf,
+ TimeLineID *start_tli,
+ XLogRecPtr *start_lsn,
+ TimeLineID *previous_tli,
+ XLogRecPtr *previous_lsn);
+extern void write_backup_label(char *output_directory, StringInfo buf,
+ pg_checksum_type checksum_type,
+ struct manifest_writer *mwriter);
+
+#endif /* BACKUP_LABEL_H */
diff --git a/src/bin/pg_combinebackup/copy_file.c b/src/bin/pg_combinebackup/copy_file.c
new file mode 100644
index 0000000000..8ba6cc09e4
--- /dev/null
+++ b/src/bin/pg_combinebackup/copy_file.c
@@ -0,0 +1,169 @@
+/*
+ * Copy entire files.
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/bin/pg_combinebackup/copy_file.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres_fe.h"
+
+#ifdef HAVE_COPYFILE_H
+#include <copyfile.h>
+#endif
+#include <fcntl.h>
+#include <sys/stat.h>
+#include <unistd.h>
+
+#include "common/file_perm.h"
+#include "common/logging.h"
+#include "copy_file.h"
+
+static void copy_file_blocks(const char *src, const char *dst,
+ pg_checksum_context *checksum_ctx);
+
+#ifdef WIN32
+static void copy_file_copyfile(const char *src, const char *dst);
+#endif
+
+/*
+ * Copy a regular file, optionally computing a checksum, and emitting
+ * appropriate debug messages. But if we're in dry-run mode, then just emit
+ * the messages and don't copy anything.
+ */
+void
+copy_file(const char *src, const char *dst,
+ pg_checksum_context *checksum_ctx, bool dry_run)
+{
+ /*
+ * In dry-run mode, we don't actually copy anything, nor do we read any
+ * data from the source file, but we do verify that we can open it.
+ */
+ if (dry_run)
+ {
+ int fd;
+
+ if ((fd = open(src, O_RDONLY | PG_BINARY)) < 0)
+ pg_fatal("could not open \"%s\": %m", src);
+ if (close(fd) < 0)
+ pg_fatal("could not close \"%s\": %m", src);
+ }
+
+ /*
+ * If we don't need to compute a checksum, then we can use any special
+ * operating system primitives that we know about to copy the file; this
+ * may be quicker than a naive block copy.
+ */
+ if (checksum_ctx->type != CHECKSUM_TYPE_NONE)
+ {
+ char *strategy_name = NULL;
+ void (*strategy_implementation) (const char *, const char *) = NULL;
+
+#ifdef WIN32
+ strategy_name = "CopyFile";
+ strategy_implementation = copy_file_copyfile;
+#endif
+
+ if (strategy_name != NULL)
+ {
+ if (dry_run)
+ pg_log_debug("would copy \"%s\" to \"%s\" using strategy %s",
+ src, dst, strategy_name);
+ else
+ {
+ pg_log_debug("copying \"%s\" to \"%s\" using strategy %s",
+ src, dst, strategy_name);
+ (*strategy_implementation) (src, dst);
+ }
+ return;
+ }
+ }
+
+ /*
+ * Fall back to the simple approach of reading and writing all the blocks,
+ * feeding them into the checksum context as we go.
+ */
+ if (dry_run)
+ {
+ if (checksum_ctx->type == CHECKSUM_TYPE_NONE)
+ pg_log_debug("would copy \"%s\" to \"%s\"",
+ src, dst);
+ else
+ pg_log_debug("would copy \"%s\" to \"%s\" and checksum with %s",
+ src, dst, pg_checksum_type_name(checksum_ctx->type));
+ }
+ else
+ {
+ if (checksum_ctx->type == CHECKSUM_TYPE_NONE)
+ pg_log_debug("copying \"%s\" to \"%s\"",
+ src, dst);
+ else
+ pg_log_debug("copying \"%s\" to \"%s\" and checksumming with %s",
+ src, dst, pg_checksum_type_name(checksum_ctx->type));
+ copy_file_blocks(src, dst, checksum_ctx);
+ }
+}
+
+/*
+ * Copy a file block by block, and optionally compute a checksum as we go.
+ */
+static void
+copy_file_blocks(const char *src, const char *dst,
+ pg_checksum_context *checksum_ctx)
+{
+ int src_fd;
+ int dest_fd;
+ uint8 *buffer;
+ const int buffer_size = 50 * BLCKSZ;
+ ssize_t rb;
+ unsigned offset = 0;
+
+ if ((src_fd = open(src, O_RDONLY | PG_BINARY, 0)) < 0)
+ pg_fatal("could not open file \"%s\": %m", src);
+
+ if ((dest_fd = open(dst, O_WRONLY | O_CREAT | O_EXCL | PG_BINARY,
+ pg_file_create_mode)) < 0)
+ pg_fatal("could not open file \"%s\": %m", dst);
+
+ buffer = pg_malloc(buffer_size);
+
+ while ((rb = read(src_fd, buffer, buffer_size)) > 0)
+ {
+ ssize_t wb;
+
+ if ((wb = write(dest_fd, buffer, rb)) != rb)
+ {
+ if (wb < 0)
+ pg_fatal("could not write file \"%s\": %m", dst);
+ else
+ pg_fatal("could not write file \"%s\": wrote only %d of %d bytes at offset %u",
+ dst, (int) wb, (int) rb, offset);
+ }
+
+ if (pg_checksum_update(checksum_ctx, buffer, rb) < 0)
+ pg_fatal("could not update checksum of file \"%s\"", dst);
+
+ offset += rb;
+ }
+
+ if (rb < 0)
+ pg_fatal("could not read file \"%s\": %m", dst);
+
+ pg_free(buffer);
+ close(src_fd);
+ close(dest_fd);
+}
+
+#ifdef WIN32
+static void
+copy_file_copyfile(const char *src, const char *dst)
+{
+ if (CopyFile(src, dst, true) == 0)
+ {
+ _dosmaperr(GetLastError());
+ pg_fatal("could not copy \"%s\" to \"%s\": %m", src, dst);
+ }
+}
+#endif /* WIN32 */
diff --git a/src/bin/pg_combinebackup/copy_file.h b/src/bin/pg_combinebackup/copy_file.h
new file mode 100644
index 0000000000..031030bacb
--- /dev/null
+++ b/src/bin/pg_combinebackup/copy_file.h
@@ -0,0 +1,19 @@
+/*
+ * Copy entire files.
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/bin/pg_combinebackup/copy_file.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef COPY_FILE_H
+#define COPY_FILE_H
+
+#include "common/checksum_helper.h"
+
+extern void copy_file(const char *src, const char *dst,
+ pg_checksum_context *checksum_ctx, bool dry_run);
+
+#endif /* COPY_FILE_H */
diff --git a/src/bin/pg_combinebackup/load_manifest.c b/src/bin/pg_combinebackup/load_manifest.c
new file mode 100644
index 0000000000..d0b8de7912
--- /dev/null
+++ b/src/bin/pg_combinebackup/load_manifest.c
@@ -0,0 +1,245 @@
+/*-------------------------------------------------------------------------
+ *
+ * Load data from a backup manifest into memory.
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/bin/pg_combinebackup/load_manifest.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres_fe.h"
+
+#include <sys/stat.h>
+#include <unistd.h>
+
+#include "common/hashfn.h"
+#include "common/logging.h"
+#include "common/parse_manifest.h"
+#include "load_manifest.h"
+
+/*
+ * For efficiency, we'd like our hash table containing information about the
+ * manifest to start out with approximately the correct number of entries.
+ * There's no way to know the exact number of entries without reading the whole
+ * file, but we can get an estimate by dividing the file size by the estimated
+ * number of bytes per line.
+ *
+ * This could be off by about a factor of two in either direction, because the
+ * checksum algorithm has a big impact on the line lengths; e.g. a SHA512
+ * checksum is 128 hex bytes, whereas a CRC-32C value is only 8, and there
+ * might be no checksum at all.
+ */
+#define ESTIMATED_BYTES_PER_MANIFEST_LINE 100
+
+/*
+ * Define a hash table which we can use to store information about the files
+ * mentioned in the backup manifest.
+ */
+static uint32 hash_string_pointer(char *s);
+#define SH_PREFIX manifest_files
+#define SH_ELEMENT_TYPE manifest_file
+#define SH_KEY_TYPE char *
+#define SH_KEY pathname
+#define SH_HASH_KEY(tb, key) hash_string_pointer(key)
+#define SH_EQUAL(tb, a, b) (strcmp(a, b) == 0)
+#define SH_SCOPE extern
+#define SH_RAW_ALLOCATOR pg_malloc0
+#define SH_DEFINE
+#include "lib/simplehash.h"
+
+static void record_manifest_details_for_file(JsonManifestParseContext *context,
+ char *pathname, size_t size,
+ pg_checksum_type checksum_type,
+ int checksum_length,
+ uint8 *checksum_payload);
+static void record_manifest_details_for_wal_range(JsonManifestParseContext *context,
+ TimeLineID tli,
+ XLogRecPtr start_lsn,
+ XLogRecPtr end_lsn);
+static void report_manifest_error(JsonManifestParseContext *context,
+ const char *fmt,...)
+ pg_attribute_printf(2, 3) pg_attribute_noreturn();
+
+/*
+ * Load backup_manifest files from an array of backups and produces an array
+ * of manifest_data objects.
+ *
+ * NB: Since load_backup_manifest() can return NULL, the resulting array could
+ * contain NULL entries.
+ */
+manifest_data **
+load_backup_manifests(int n_backups, char **backup_directories)
+{
+ manifest_data **result;
+ int i;
+
+ result = pg_malloc(sizeof(manifest_data *) * n_backups);
+ for (i = 0; i < n_backups; ++i)
+ result[i] = load_backup_manifest(backup_directories[i]);
+
+ return result;
+}
+
+/*
+ * Parse the backup_manifest file in the named backup directory. Construct a
+ * hash table with information about all the files it mentions, and a linked
+ * list of all the WAL ranges it mentions.
+ *
+ * If the backup_manifest file simply doesn't exist, logs a warning and returns
+ * NULL. Any other error, or any error parsing the contents of the file, is
+ * fatal.
+ */
+manifest_data *
+load_backup_manifest(char *backup_directory)
+{
+ char pathname[MAXPGPATH];
+ int fd;
+ struct stat statbuf;
+ off_t estimate;
+ uint32 initial_size;
+ manifest_files_hash *ht;
+ char *buffer;
+ int rc;
+ JsonManifestParseContext context;
+ manifest_data *result;
+
+ /* Open the manifest file. */
+ snprintf(pathname, MAXPGPATH, "%s/backup_manifest", backup_directory);
+ if ((fd = open(pathname, O_RDONLY | PG_BINARY, 0)) < 0)
+ {
+ if (errno == EEXIST)
+ {
+ pg_log_warning("\"%s\" does not exist", pathname);
+ return NULL;
+ }
+ pg_fatal("could not open file \"%s\": %m", pathname);
+ }
+
+ /* Figure out how big the manifest is. */
+ if (fstat(fd, &statbuf) != 0)
+ pg_fatal("could not stat file \"%s\": %m", pathname);
+
+ /* Guess how large to make the hash table based on the manifest size. */
+ estimate = statbuf.st_size / ESTIMATED_BYTES_PER_MANIFEST_LINE;
+ initial_size = Min(PG_UINT32_MAX, Max(estimate, 256));
+
+ /* Create the hash table. */
+ ht = manifest_files_create(initial_size, NULL);
+
+ /*
+ * Slurp in the whole file.
+ *
+ * This is not ideal, but there's currently no way to get pg_parse_json()
+ * to perform incremental parsing.
+ */
+ buffer = pg_malloc(statbuf.st_size);
+ rc = read(fd, buffer, statbuf.st_size);
+ if (rc != statbuf.st_size)
+ {
+ if (rc < 0)
+ pg_fatal("could not read file \"%s\": %m", pathname);
+ else
+ pg_fatal("could not read file \"%s\": read %d of %lld",
+ pathname, rc, (long long int) statbuf.st_size);
+ }
+
+ /* Close the manifest file. */
+ close(fd);
+
+ /* Parse the manifest. */
+ result = pg_malloc0(sizeof(manifest_data));
+ result->files = ht;
+ context.private_data = result;
+ context.perfile_cb = record_manifest_details_for_file;
+ context.perwalrange_cb = record_manifest_details_for_wal_range;
+ context.error_cb = report_manifest_error;
+ json_parse_manifest(&context, buffer, statbuf.st_size);
+
+ /* All done. */
+ pfree(buffer);
+ return result;
+}
+
+/*
+ * Report an error while parsing the manifest.
+ *
+ * We consider all such errors to be fatal errors. The manifest parser
+ * expects this function not to return.
+ */
+static void
+report_manifest_error(JsonManifestParseContext *context, const char *fmt,...)
+{
+ va_list ap;
+
+ va_start(ap, fmt);
+ pg_log_generic_v(PG_LOG_ERROR, PG_LOG_PRIMARY, gettext(fmt), ap);
+ va_end(ap);
+
+ exit(1);
+}
+
+/*
+ * Record details extracted from the backup manifest for one file.
+ */
+static void
+record_manifest_details_for_file(JsonManifestParseContext *context,
+ char *pathname, size_t size,
+ pg_checksum_type checksum_type,
+ int checksum_length, uint8 *checksum_payload)
+{
+ manifest_data *manifest = context->private_data;
+ manifest_file *m;
+ bool found;
+
+ /* Make a new entry in the hash table for this file. */
+ m = manifest_files_insert(manifest->files, pathname, &found);
+ if (found)
+ pg_fatal("duplicate path name in backup manifest: \"%s\"", pathname);
+
+ /* Initialize the entry. */
+ m->size = size;
+ m->checksum_type = checksum_type;
+ m->checksum_length = checksum_length;
+ m->checksum_payload = checksum_payload;
+}
+
+/*
+ * Record details extracted from the backup manifest for one WAL range.
+ */
+static void
+record_manifest_details_for_wal_range(JsonManifestParseContext *context,
+ TimeLineID tli,
+ XLogRecPtr start_lsn, XLogRecPtr end_lsn)
+{
+ manifest_data *manifest = context->private_data;
+ manifest_wal_range *range;
+
+ /* Allocate and initialize a struct describing this WAL range. */
+ range = palloc(sizeof(manifest_wal_range));
+ range->tli = tli;
+ range->start_lsn = start_lsn;
+ range->end_lsn = end_lsn;
+ range->prev = manifest->last_wal_range;
+ range->next = NULL;
+
+ /* Add it to the end of the list. */
+ if (manifest->first_wal_range == NULL)
+ manifest->first_wal_range = range;
+ else
+ manifest->last_wal_range->next = range;
+ manifest->last_wal_range = range;
+}
+
+/*
+ * Helper function for manifest_files hash table.
+ */
+static uint32
+hash_string_pointer(char *s)
+{
+ unsigned char *ss = (unsigned char *) s;
+
+ return hash_bytes(ss, strlen(s));
+}
diff --git a/src/bin/pg_combinebackup/load_manifest.h b/src/bin/pg_combinebackup/load_manifest.h
new file mode 100644
index 0000000000..2bfeeff156
--- /dev/null
+++ b/src/bin/pg_combinebackup/load_manifest.h
@@ -0,0 +1,67 @@
+/*-------------------------------------------------------------------------
+ *
+ * Load data from a backup manifest into memory.
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/bin/pg_combinebackup/load_manifest.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef LOAD_MANIFEST_H
+#define LOAD_MANIFEST_H
+
+#include "access/xlogdefs.h"
+#include "common/checksum_helper.h"
+
+/*
+ * Each file described by the manifest file is parsed to produce an object
+ * like this.
+ */
+typedef struct manifest_file
+{
+ uint32 status; /* hash status */
+ char *pathname;
+ size_t size;
+ pg_checksum_type checksum_type;
+ int checksum_length;
+ uint8 *checksum_payload;
+} manifest_file;
+
+#define SH_PREFIX manifest_files
+#define SH_ELEMENT_TYPE manifest_file
+#define SH_KEY_TYPE char *
+#define SH_SCOPE extern
+#define SH_RAW_ALLOCATOR pg_malloc0
+#define SH_DECLARE
+#include "lib/simplehash.h"
+
+/*
+ * Each WAL range described by the manifest file is parsed to produce an
+ * object like this.
+ */
+typedef struct manifest_wal_range
+{
+ TimeLineID tli;
+ XLogRecPtr start_lsn;
+ XLogRecPtr end_lsn;
+ struct manifest_wal_range *next;
+ struct manifest_wal_range *prev;
+} manifest_wal_range;
+
+/*
+ * All the data parsed from a backup_manifest file.
+ */
+typedef struct manifest_data
+{
+ manifest_files_hash *files;
+ manifest_wal_range *first_wal_range;
+ manifest_wal_range *last_wal_range;
+} manifest_data;
+
+extern manifest_data *load_backup_manifest(char *backup_directory);
+extern manifest_data **load_backup_manifests(int n_backups,
+ char **backup_directories);
+
+#endif /* LOAD_MANIFEST_H */
diff --git a/src/bin/pg_combinebackup/meson.build b/src/bin/pg_combinebackup/meson.build
new file mode 100644
index 0000000000..bea0db405e
--- /dev/null
+++ b/src/bin/pg_combinebackup/meson.build
@@ -0,0 +1,29 @@
+# Copyright (c) 2022-2023, PostgreSQL Global Development Group
+
+pg_combinebackup_sources = files(
+ 'pg_combinebackup.c',
+ 'backup_label.c',
+ 'copy_file.c',
+ 'load_manifest.c',
+ 'reconstruct.c',
+ 'write_manifest.c',
+)
+
+if host_system == 'windows'
+ pg_combinebackup_sources += rc_bin_gen.process(win32ver_rc, extra_args: [
+ '--NAME', 'pg_combinebackup',
+ '--FILEDESC', 'pg_combinebackup - combine incremental backups',])
+endif
+
+pg_combinebackup = executable('pg_combinebackup',
+ pg_combinebackup_sources,
+ dependencies: [frontend_code],
+ kwargs: default_bin_args,
+)
+bin_targets += pg_combinebackup
+
+tests += {
+ 'name': 'pg_combinebackup',
+ 'sd': meson.current_source_dir(),
+ 'bd': meson.current_build_dir()
+}
diff --git a/src/bin/pg_combinebackup/pg_combinebackup.c b/src/bin/pg_combinebackup/pg_combinebackup.c
new file mode 100644
index 0000000000..369447204a
--- /dev/null
+++ b/src/bin/pg_combinebackup/pg_combinebackup.c
@@ -0,0 +1,1276 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_combinebackup.c
+ * Combine incremental backups with prior backups.
+ *
+ * Copyright (c) 2017-2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/bin/pg_combinebackup/pg_combinebackup.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres_fe.h"
+
+#include <dirent.h>
+#include <fcntl.h>
+#include <limits.h>
+
+#include "access/xlogdefs.h"
+#include "backup_label.h"
+#include "common/blkreftable.h"
+#include "common/checksum_helper.h"
+#include "common/controldata_utils.h"
+#include "common/file_perm.h"
+#include "common/file_utils.h"
+#include "common/logging.h"
+#include "copy_file.h"
+#include "fe_utils/option_utils.h"
+#include "lib/stringinfo.h"
+#include "load_manifest.h"
+#include "getopt_long.h"
+#include "reconstruct.h"
+#include "write_manifest.h"
+
+/* Incremental file naming convention. */
+#define INCREMENTAL_PREFIX "INCREMENTAL."
+#define INCREMENTAL_PREFIX_LENGTH 12
+
+/*
+ * Tracking for directories that need to be removed, or have their contents
+ * removed, if the operation fails.
+ */
+typedef struct cb_cleanup_dir
+{
+ char *target_path;
+ bool rmtopdir;
+ struct cb_cleanup_dir *next;
+} cb_cleanup_dir;
+
+/*
+ * Stores a tablespace mapping provided using -T, --tablespace-mapping.
+ */
+typedef struct cb_tablespace_mapping
+{
+ char old_dir[MAXPGPATH];
+ char new_dir[MAXPGPATH];
+ struct cb_tablespace_mapping *next;
+} cb_tablespace_mapping;
+
+/*
+ * Stores data parsed from all command-line options.
+ */
+typedef struct cb_options
+{
+ bool debug;
+ char *output;
+ bool dry_run;
+ bool no_sync;
+ bool progress;
+ cb_tablespace_mapping *tsmappings;
+ pg_checksum_type manifest_checksums;
+ bool no_manifest;
+ DataDirSyncMethod sync_method;
+} cb_options;
+
+/*
+ * Data about a tablespace.
+ *
+ * Every normal tablespace needs a tablespace mapping, but in-place tablespaces
+ * don't, so the list of tablespaces can contain more entries than the list of
+ * tablespace mappings.
+ */
+typedef struct cb_tablespace
+{
+ Oid oid;
+ bool in_place;
+ char old_dir[MAXPGPATH];
+ char new_dir[MAXPGPATH];
+ struct cb_tablespace *next;
+} cb_tablespace;
+
+/* Directories to be removed if we exit uncleanly. */
+cb_cleanup_dir *cleanup_dir_list = NULL;
+
+static void add_tablespace_mapping(cb_options *opt, char *arg);
+static StringInfo check_backup_label_files(int n_backups, char **backup_dirs);
+static void check_control_files(int n_backups, char **backup_dirs);
+static void check_input_dir_permissions(char *dir);
+static void cleanup_directories_atexit(void);
+static void create_output_directory(char *dirname, cb_options *opt);
+static void help(const char *progname);
+static bool parse_oid(char *s, Oid *result);
+static void process_directory_recursively(Oid tsoid,
+ char *input_directory,
+ char *output_directory,
+ char *relative_path,
+ int n_prior_backups,
+ char **prior_backup_dirs,
+ manifest_data **manifests,
+ manifest_writer *mwriter,
+ cb_options *opt);
+static int read_pg_version_file(char *directory);
+static void remember_to_cleanup_directory(char *target_path, bool rmtopdir);
+static void reset_directory_cleanup_list(void);
+static cb_tablespace *scan_for_existing_tablespaces(char *pathname,
+ cb_options *opt);
+static void slurp_file(int fd, char *filename, StringInfo buf, int maxlen);
+
+/*
+ * Main program.
+ */
+int
+main(int argc, char *argv[])
+{
+ static struct option long_options[] = {
+ {"debug", no_argument, NULL, 'd'},
+ {"output", required_argument, NULL, 'o'},
+ {"dry-run", no_argument, NULL, 'n'},
+ {"no-sync", no_argument, NULL, 'N'},
+ {"progress", no_argument, NULL, 'P'},
+ {"tablespace-mapping", no_argument, NULL, 'T'},
+ {"manifest-checksums", required_argument, NULL, 1},
+ {"no-manifest", no_argument, NULL, 2},
+ {"sync-method", required_argument, NULL, 3},
+ {NULL, 0, NULL, 0}
+ };
+
+ const char *progname;
+ char *last_input_dir;
+ int optindex;
+ int c;
+ int n_backups;
+ int n_prior_backups;
+ int version;
+ char **prior_backup_dirs;
+ cb_options opt;
+ cb_tablespace *tablespaces;
+ cb_tablespace *ts;
+ StringInfo last_backup_label;
+ manifest_data **manifests;
+ manifest_writer *mwriter;
+
+ pg_logging_init(argv[0]);
+ progname = get_progname(argv[0]);
+ handle_help_version_opts(argc, argv, progname, help);
+
+ memset(&opt, 0, sizeof(opt));
+ opt.manifest_checksums = CHECKSUM_TYPE_CRC32C;
+ opt.sync_method = DATA_DIR_SYNC_METHOD_FSYNC;
+
+ /* process command-line options */
+ while ((c = getopt_long(argc, argv, "do:nNPT:",
+ long_options, &optindex)) != -1)
+ {
+ switch (c)
+ {
+ case 'd':
+ opt.debug = true;
+ pg_logging_increase_verbosity();
+ break;
+ case 'o':
+ opt.output = optarg;
+ break;
+ case 'n':
+ opt.dry_run = true;
+ break;
+ case 'N':
+ opt.no_sync = true;
+ break;
+ case 'P':
+ opt.progress = true;
+ break;
+ case 'T':
+ add_tablespace_mapping(&opt, optarg);
+ break;
+ case 1:
+ if (!pg_checksum_parse_type(optarg,
+ &opt.manifest_checksums))
+ pg_fatal("unrecognized checksum algorithm: \"%s\"",
+ optarg);
+ break;
+ case 2:
+ opt.no_manifest = true;
+ break;
+ case 3:
+ if (!parse_sync_method(optarg, &opt.sync_method))
+ exit(1);
+ break;
+ default:
+ /* getopt_long already emitted a complaint */
+ pg_log_error_hint("Try \"%s --help\" for more information.", progname);
+ exit(1);
+ }
+ }
+
+ if (optind >= argc)
+ {
+ pg_log_error("%s: no input directories specified", progname);
+ pg_log_error_hint("Try \"%s --help\" for more information.", progname);
+ exit(1);
+ }
+
+ if (opt.output == NULL)
+ pg_fatal("no output directory specified");
+
+ /* If no manifest is needed, no checksums are needed, either. */
+ if (opt.no_manifest)
+ opt.manifest_checksums = CHECKSUM_TYPE_NONE;
+
+ /* Read the server version from the final backup. */
+ version = read_pg_version_file(argv[argc - 1]);
+
+ /* Sanity-check control files. */
+ n_backups = argc - optind;
+ check_control_files(n_backups, argv + optind);
+
+ /* Sanity-check backup_label files, and get the contents of the last one. */
+ last_backup_label = check_backup_label_files(n_backups, argv + optind);
+
+ /* Load backup manifests. */
+ manifests = load_backup_manifests(n_backups, argv + optind);
+
+ /* Figure out which tablespaces are going to be included in the output. */
+ last_input_dir = argv[argc - 1];
+ check_input_dir_permissions(last_input_dir);
+ tablespaces = scan_for_existing_tablespaces(last_input_dir, &opt);
+
+ /*
+ * Create output directories.
+ *
+ * We create one output directory for the main data directory plus one for
+ * each non-in-place tablespace. create_output_directory() will arrange
+ * for those directories to be cleaned up on failure. In-place tablespaces
+ * aren't handled at this stage because they're located beneath the main
+ * output directory, and thus the cleanup of that directory will get rid
+ * of them. Plus, the pg_tblspc directory that needs to contain them
+ * doesn't exist yet.
+ */
+ atexit(cleanup_directories_atexit);
+ create_output_directory(opt.output, &opt);
+ for (ts = tablespaces; ts != NULL; ts = ts->next)
+ if (!ts->in_place)
+ create_output_directory(ts->new_dir, &opt);
+
+ /* If we need to write a backup_manifest, prepare to do so. */
+ if (!opt.dry_run && !opt.no_manifest)
+ mwriter = create_manifest_writer(opt.output);
+ else
+ mwriter = NULL;
+
+ /* Write backup label into output directory. */
+ if (opt.dry_run)
+ pg_log_debug("would generate \"%s/backup_label\"", opt.output);
+ else
+ {
+ pg_log_debug("generating \"%s/backup_label\"", opt.output);
+ last_backup_label->cursor = 0;
+ write_backup_label(opt.output, last_backup_label,
+ opt.manifest_checksums, mwriter);
+ }
+
+ /*
+ * We'll need the pathnames to the prior backups. By "prior" we mean all
+ * but the last one listed on the command line.
+ */
+ n_prior_backups = argc - optind - 1;
+ prior_backup_dirs = argv + optind;
+
+ /* Process everything that's not part of a user-defined tablespace. */
+ pg_log_debug("processing backup directory \"%s\"", last_input_dir);
+ process_directory_recursively(InvalidOid, last_input_dir, opt.output,
+ NULL, n_prior_backups, prior_backup_dirs,
+ manifests, mwriter, &opt);
+
+ /* Process user-defined tablespaces. */
+ for (ts = tablespaces; ts != NULL; ts = ts->next)
+ {
+ pg_log_debug("processing tablespace directory \"%s\"", ts->old_dir);
+
+ /*
+ * If it's a normal tablespace, we need to set up a symbolic link from
+ * pg_tblspc/${OID} to the target directory; if it's an in-place
+ * tablespace, we need to create a directory at pg_tblspc/${OID}.
+ */
+ if (!ts->in_place)
+ {
+ char linkpath[MAXPGPATH];
+
+ snprintf(linkpath, MAXPGPATH, "%s/pg_tblspc/%u", opt.output,
+ ts->oid);
+
+ if (opt.dry_run)
+ pg_log_debug("would create symbolic link from \"%s\" to \"%s\"",
+ linkpath, ts->new_dir);
+ else
+ {
+ pg_log_debug("creating symbolic link from \"%s\" to \"%s\"",
+ linkpath, ts->new_dir);
+ if (symlink(ts->new_dir, linkpath) != 0)
+ pg_fatal("could not create symbolic link from \"%s\" to \"%s\": %m",
+ linkpath, ts->new_dir);
+ }
+ }
+ else
+ {
+ if (opt.dry_run)
+ pg_log_debug("would create directory \"%s\"", ts->new_dir);
+ else
+ {
+ pg_log_debug("creating directory \"%s\"", ts->new_dir);
+ if (pg_mkdir_p(ts->new_dir, pg_dir_create_mode) == -1)
+ pg_fatal("could not create directory \"%s\": %m",
+ ts->new_dir);
+ }
+ }
+
+ /* OK, now handle the directory contents. */
+ process_directory_recursively(ts->oid, ts->old_dir, ts->new_dir,
+ NULL, n_prior_backups, prior_backup_dirs,
+ manifests, mwriter, &opt);
+ }
+
+ /* Finalize the backup_manifest, if we're generating one. */
+ if (mwriter != NULL)
+ finalize_manifest(mwriter,
+ manifests[n_prior_backups]->first_wal_range);
+
+ /* fsync that output directory unless we've been told not to do so */
+ if (!opt.no_sync)
+ {
+ if (opt.dry_run)
+ pg_log_debug("would recursively fsync \"%s\"", opt.output);
+ else
+ {
+ pg_log_debug("recursively fsyncing \"%s\"", opt.output);
+ sync_pgdata(opt.output, version * 10000, opt.sync_method);
+ }
+ }
+
+ /* It's a success, so don't remove the output directories. */
+ reset_directory_cleanup_list();
+ exit(0);
+}
+
+/*
+ * Process the option argument for the -T, --tablespace-mapping switch.
+ */
+static void
+add_tablespace_mapping(cb_options *opt, char *arg)
+{
+ cb_tablespace_mapping *tsmap = pg_malloc0(sizeof(cb_tablespace_mapping));
+ char *dst;
+ char *dst_ptr;
+ char *arg_ptr;
+
+ /*
+ * Basically, we just want to copy everything before the equals sign to
+ * tsmap->old_dir and everything afterwards to tsmap->new_dir, but if
+ * there's more or less than one equals sign, that's an error, and if
+ * there's an equals sign preceded by a backslash, don't treat it as a
+ * field separator but instead copy a literal equals sign.
+ */
+ dst_ptr = dst = tsmap->old_dir;
+ for (arg_ptr = arg; *arg_ptr != '\0'; arg_ptr++)
+ {
+ if (dst_ptr - dst >= MAXPGPATH)
+ pg_fatal("directory name too long");
+
+ if (*arg_ptr == '\\' && *(arg_ptr + 1) == '=')
+ ; /* skip backslash escaping = */
+ else if (*arg_ptr == '=' && (arg_ptr == arg || *(arg_ptr - 1) != '\\'))
+ {
+ if (tsmap->new_dir[0] != '\0')
+ pg_fatal("multiple \"=\" signs in tablespace mapping");
+ else
+ dst = dst_ptr = tsmap->new_dir;
+ }
+ else
+ *dst_ptr++ = *arg_ptr;
+ }
+ if (!tsmap->old_dir[0] || !tsmap->new_dir[0])
+ pg_fatal("invalid tablespace mapping format \"%s\", must be \"OLDDIR=NEWDIR\"", arg);
+
+ /*
+ * All tablespaces are created with absolute directories, so specifying a
+ * non-absolute path here would never match, possibly confusing users.
+ *
+ * In contrast to pg_basebackup, both the old and new directories are on
+ * the local machine, so the local machine's definition of an absolute
+ * path is the only relevant one.
+ */
+ if (!is_absolute_path(tsmap->old_dir))
+ pg_fatal("old directory is not an absolute path in tablespace mapping: %s",
+ tsmap->old_dir);
+
+ if (!is_absolute_path(tsmap->new_dir))
+ pg_fatal("old directory is not an absolute path in tablespace mapping: %s",
+ tsmap->new_dir);
+
+ /* Canonicalize paths to avoid spurious failures when comparing. */
+ canonicalize_path(tsmap->old_dir);
+ canonicalize_path(tsmap->new_dir);
+
+ /* Add it to the list. */
+ tsmap->next = opt->tsmappings;
+ opt->tsmappings = tsmap;
+}
+
+/*
+ * Check that the backup_label files form a coherent backup chain, and return
+ * the contents of the backup_label file from the latest backup.
+ */
+static StringInfo
+check_backup_label_files(int n_backups, char **backup_dirs)
+{
+ StringInfo buf = makeStringInfo();
+ StringInfo lastbuf = buf;
+ int i;
+ TimeLineID check_tli = 0;
+ XLogRecPtr check_lsn = InvalidXLogRecPtr;
+
+ /* Try to read each backup_label file in turn, last to first. */
+ for (i = n_backups - 1; i >= 0; --i)
+ {
+ char pathbuf[MAXPGPATH];
+ int fd;
+ TimeLineID start_tli;
+ TimeLineID previous_tli;
+ XLogRecPtr start_lsn;
+ XLogRecPtr previous_lsn;
+
+ /* Open the backup_label file. */
+ snprintf(pathbuf, MAXPGPATH, "%s/backup_label", backup_dirs[i]);
+ pg_log_debug("reading \"%s\"", pathbuf);
+ if ((fd = open(pathbuf, O_RDONLY, 0)) < 0)
+ pg_fatal("could not open file \"%s\": %m", pathbuf);
+
+ /*
+ * Slurp the whole file into memory.
+ *
+ * The exact size limit that we impose here doesn't really matter --
+ * most of what's supposed to be in the file is fixed size and quite
+ * short. However, the length of the backup_label is limited (at least
+ * by some parts of the code) to MAXGPATH, so include that value in
+ * the maximum length that we tolerate.
+ */
+ slurp_file(fd, pathbuf, buf, 10000 + MAXPGPATH);
+
+ /* Close the file. */
+ if (close(fd) != 0)
+ pg_fatal("could not close \"%s\": %m", pathbuf);
+
+ /* Parse the file contents. */
+ parse_backup_label(pathbuf, buf, &start_tli, &start_lsn,
+ &previous_tli, &previous_lsn);
+
+ /*
+ * Sanity checks.
+ *
+ * XXX. It's actually not required that start_lsn == check_lsn. It
+ * would be OK if start_lsn > check_lsn provided that start_lsn is
+ * less than or equal to the relevant switchpoint. But at the moment
+ * we don't have that information.
+ */
+ if (i > 0 && previous_tli == 0)
+ pg_fatal("backup at \"%s\" is a full backup, but only the first backup should be a full backup",
+ backup_dirs[i]);
+ if (i == 0 && previous_tli != 0)
+ pg_fatal("backup at \"%s\" is an incremental backup, but the first backup should be a full backup",
+ backup_dirs[i]);
+ if (i < n_backups - 1 && start_tli != check_tli)
+ pg_fatal("backup at \"%s\" starts on timeline %u, but expected %u",
+ backup_dirs[i], start_tli, check_tli);
+ if (i < n_backups - 1 && start_lsn != check_lsn)
+ pg_fatal("backup at \"%s\" starts at LSN %X/%X, but expected %X/%X",
+ backup_dirs[i],
+ LSN_FORMAT_ARGS(start_lsn),
+ LSN_FORMAT_ARGS(check_lsn));
+ check_tli = previous_tli;
+ check_lsn = previous_lsn;
+
+ /*
+ * The last backup label in the chain needs to be saved for later use,
+ * while the others are only needed within this loop.
+ */
+ if (lastbuf == buf)
+ buf = makeStringInfo();
+ else
+ resetStringInfo(buf);
+ }
+
+ /* Free memory that we don't need any more. */
+ if (lastbuf != buf)
+ {
+ pfree(buf->data);
+ pfree(buf);
+ }
+
+ /*
+ * Return the data from the first backup_info that we read (which is the
+ * backup_label from the last directory specified on the command line).
+ */
+ return lastbuf;
+}
+
+/*
+ * Sanity check control files.
+ */
+static void
+check_control_files(int n_backups, char **backup_dirs)
+{
+ int i;
+ uint64 system_identifier;
+
+ /* Try to read each control file in turn, last to first. */
+ for (i = n_backups - 1; i >= 0; --i)
+ {
+ ControlFileData *control_file;
+ bool crc_ok;
+
+ pg_log_debug("reading \"%s/global/pg_control\"", backup_dirs[i]);
+ control_file = get_controlfile(backup_dirs[i], &crc_ok);
+
+ /* Control file contents not meaningful if CRC is bad. */
+ if (!crc_ok)
+ pg_fatal("%s/global/pg_control: crc is incorrect", backup_dirs[i]);
+
+ /* Can't interpret control file if not current version. */
+ if (control_file->pg_control_version != PG_CONTROL_VERSION)
+ pg_fatal("%s/global/pg_control: unexpected control file version",
+ backup_dirs[i]);
+
+ /* System identifiers should all match. */
+ if (i == n_backups - 1)
+ system_identifier = control_file->system_identifier;
+ else if (system_identifier != control_file->system_identifier)
+ pg_fatal("%s/global/pg_control: expected system identifier %llu, but found %llu",
+ backup_dirs[i], (unsigned long long) system_identifier,
+ (unsigned long long) control_file->system_identifier);
+
+ /* Release memory. */
+ pfree(control_file);
+ }
+
+ /*
+ * If debug output is enabled, make a note of the system identifier that
+ * we found in all of the relevant control files.
+ */
+ pg_log_debug("system identifier is %llu",
+ (unsigned long long) system_identifier);
+}
+
+/*
+ * Set default permissions for new files and directories based on the
+ * permissions of the given directory. The intent here is that the output
+ * directory should use the same permissions scheme as the final input
+ * directory.
+ */
+static void
+check_input_dir_permissions(char *dir)
+{
+ struct stat st;
+
+ if (stat(dir, &st) != 0)
+ pg_fatal("could not stat \"%s\": %m", dir);
+
+ SetDataDirectoryCreatePerm(st.st_mode);
+}
+
+/*
+ * Clean up output directories before exiting.
+ */
+static void
+cleanup_directories_atexit(void)
+{
+ while (cleanup_dir_list != NULL)
+ {
+ cb_cleanup_dir *dir = cleanup_dir_list;
+
+ if (dir->rmtopdir)
+ {
+ pg_log_info("removing output directory \"%s\"", dir->target_path);
+ if (!rmtree(dir->target_path, dir->rmtopdir))
+ pg_log_error("failed to remove output directory");
+ }
+ else
+ {
+ pg_log_info("removing contents of output directory \"%s\"",
+ dir->target_path);
+ if (!rmtree(dir->target_path, dir->rmtopdir))
+ pg_log_error("failed to remove contents of output directory");
+ }
+
+ cleanup_dir_list = cleanup_dir_list->next;
+ pfree(dir);
+ }
+}
+
+/*
+ * Create the named output directory, unless it already exists or we're in
+ * dry-run mode. If it already exists but is not empty, that's a fatal error.
+ *
+ * Adds the created directory to the list of directories to be cleaned up
+ * at process exit.
+ */
+static void
+create_output_directory(char *dirname, cb_options *opt)
+{
+ switch (pg_check_dir(dirname))
+ {
+ case 0:
+ if (opt->dry_run)
+ {
+ pg_log_debug("would create directory \"%s\"", dirname);
+ return;
+ }
+ pg_log_debug("creating directory \"%s\"", dirname);
+ if (pg_mkdir_p(dirname, pg_dir_create_mode) == -1)
+ pg_fatal("could not create directory \"%s\": %m", dirname);
+ remember_to_cleanup_directory(dirname, true);
+ break;
+
+ case 1:
+ pg_log_debug("using existing directory \"%s\"", dirname);
+ remember_to_cleanup_directory(dirname, false);
+ break;
+
+ case 2:
+ case 3:
+ case 4:
+ pg_fatal("directory \"%s\" exists but is not empty", dirname);
+
+ case -1:
+ pg_fatal("could not access directory \"%s\": %m", dirname);
+ }
+}
+
+/*
+ * help
+ *
+ * Prints help page for the program
+ *
+ * progname: the name of the executed program, such as "pg_combinebackup"
+ */
+static void
+help(const char *progname)
+{
+ printf(_("%s combines incremental backups.\n\n"), progname);
+ printf(_("Usage:\n"));
+ printf(_(" %s [OPTION]... DIRECTORY...\n"), progname);
+ printf(_("\nOptions:\n"));
+ printf(_(" -d, --debug generate lots of debugging output\n"));
+ printf(_(" -o, --output output directory\n"));
+ printf(_(" -n, --dry-run don't actually do anything\n"));
+ printf(_(" -N, --no-sync do not wait for changes to be written safely to disk\n"));
+ printf(_(" -P, --progress show progress information\n"));
+ printf(_(" -T, --tablespace-mapping=OLDDIR=NEWDIR\n"));
+ printf(_(" relocate tablespace in OLDDIR to NEWDIR\n"));
+ printf(_(" --manifest-checksums=SHA{224,256,384,512}|CRC32C|NONE\n"
+ " use algorithm for manifest checksums\n"));
+ printf(_(" --no-manifest suppress generation of backup manifest\n"));
+ printf(_(" --sync-method=METHOD set method for syncing files to disk\n"));
+ printf(_(" -?, --help show this help, then exit\n"));
+
+ printf(_("\nReport bugs to <%s>.\n"), PACKAGE_BUGREPORT);
+ printf(_("%s home page: <%s>\n"), PACKAGE_NAME, PACKAGE_URL);
+}
+
+/*
+ * Try to parse a string as a non-zero OID without leading zeroes.
+ *
+ * If it works, return true and set *result to the answer, else return false.
+ */
+static bool
+parse_oid(char *s, Oid *result)
+{
+ Oid oid;
+ char *ep;
+
+ errno = 0;
+ oid = strtoul(s, &ep, 10);
+ if (errno != 0 || *ep != '\0' || oid < 1 || oid > PG_UINT32_MAX)
+ return false;
+
+ *result = oid;
+ return true;
+}
+
+/*
+ * Copy files from the input directory to the output directory, reconstructing
+ * full files from incremental files as required.
+ *
+ * If processing is a user-defined tablespace, the tsoid should be the OID
+ * of that tablespace and input_directory and output_directory should be the
+ * toplevel input and output directories for that tablespace. Otherwise,
+ * tsoid should be InvalidOid and input_directory and output_directory should
+ * be the main input and output directories.
+ *
+ * relative_path is the path beneath the given input and output directories
+ * that we are currently processing. If NULL, it indicates that we're
+ * processing the input and output directories themselves.
+ *
+ * n_prior_backups is the number of prior backups that we have available.
+ * This doesn't count the very last backup, which is referenced by
+ * output_directory, just the older ones. prior_backup_dirs is an array of
+ * the locations of those previous backups.
+ */
+static void
+process_directory_recursively(Oid tsoid,
+ char *input_directory,
+ char *output_directory,
+ char *relative_path,
+ int n_prior_backups,
+ char **prior_backup_dirs,
+ manifest_data **manifests,
+ manifest_writer *mwriter,
+ cb_options *opt)
+{
+ char ifulldir[MAXPGPATH];
+ char ofulldir[MAXPGPATH];
+ char manifest_prefix[MAXPGPATH];
+ DIR *dir;
+ struct dirent *de;
+ bool is_pg_tblspc;
+ bool is_pg_wal;
+ manifest_data *latest_manifest = manifests[n_prior_backups];
+ pg_checksum_type checksum_type;
+
+ StaticAssertStmt(strlen(INCREMENTAL_PREFIX) == INCREMENTAL_PREFIX_LENGTH,
+ "INCREMENTAL_PREFIX_LENGTH is incorrect");
+
+ /*
+ * pg_tblspc and pg_wal are special cases, so detect those here.
+ *
+ * pg_tblspc is only special at the top level, but subdirectories of
+ * pg_wal are just as special as the top level directory.
+ *
+ * Since incremental backup does not exist in pre-v10 versions, we don't
+ * have to worry about the old pg_xlog naming.
+ */
+ is_pg_tblspc = !OidIsValid(tsoid) && relative_path != NULL &&
+ strcmp(relative_path, "pg_tblspc") == 0;
+ is_pg_wal = !OidIsValid(tsoid) && relative_path != NULL &&
+ (strcmp(relative_path, "pg_wal") == 0 ||
+ strncmp(relative_path, "pg_wal/", 7) == 0);
+
+ /*
+ * If we're under pg_wal, then we don't need checksums, because these
+ * files aren't included in the backup manifest. Otherwise use whatever
+ * type of checksum is configured.
+ */
+ if (!is_pg_wal)
+ checksum_type = opt->manifest_checksums;
+ else
+ checksum_type = CHECKSUM_TYPE_NONE;
+
+ /*
+ * Append the relative path to the input and output directories, and
+ * figure out the appropriate prefix to add to files in this directory
+ * when looking them up in a backup manifest.
+ */
+ if (relative_path == NULL)
+ {
+ strncpy(ifulldir, input_directory, MAXPGPATH);
+ strncpy(ofulldir, output_directory, MAXPGPATH);
+ if (OidIsValid(tsoid))
+ snprintf(manifest_prefix, MAXPGPATH, "pg_tblspc/%u/", tsoid);
+ else
+ manifest_prefix[0] = '\0';
+ }
+ else
+ {
+ snprintf(ifulldir, MAXPGPATH, "%s/%s", input_directory,
+ relative_path);
+ snprintf(ofulldir, MAXPGPATH, "%s/%s", output_directory,
+ relative_path);
+ if (OidIsValid(tsoid))
+ snprintf(manifest_prefix, MAXPGPATH, "pg_tblspc/%u/%s/",
+ tsoid, relative_path);
+ else
+ snprintf(manifest_prefix, MAXPGPATH, "%s/", relative_path);
+ }
+
+ /*
+ * Toplevel output directories have already been created by the time this
+ * function is called, but any subdirectories are our responsibility.
+ */
+ if (relative_path != NULL)
+ {
+ if (opt->dry_run)
+ pg_log_debug("would create directory \"%s\"", ofulldir);
+ else
+ {
+ pg_log_debug("creating directory \"%s\"", ofulldir);
+ if (mkdir(ofulldir, pg_dir_create_mode) == -1)
+ pg_fatal("could not create directory \"%s\": %m", ofulldir);
+ }
+ }
+
+ /* It's time to scan the directory. */
+ if ((dir = opendir(ifulldir)) == NULL)
+ pg_fatal("could not open directory \"%s\": %m", ifulldir);
+ while (errno = 0, (de = readdir(dir)) != NULL)
+ {
+ PGFileType type;
+ char ifullpath[MAXPGPATH];
+ char ofullpath[MAXPGPATH];
+ char manifest_path[MAXPGPATH];
+ Oid oid = InvalidOid;
+ int checksum_length = 0;
+ uint8 *checksum_payload = NULL;
+ pg_checksum_context checksum_ctx;
+
+ /* Ignore "." and ".." entries. */
+ if (strcmp(de->d_name, ".") == 0 ||
+ strcmp(de->d_name, "..") == 0)
+ continue;
+
+ /* Construct input path. */
+ snprintf(ifullpath, MAXPGPATH, "%s/%s", ifulldir, de->d_name);
+
+ /* Figure out what kind of directory entry this is. */
+ type = get_dirent_type(ifullpath, de, false, PG_LOG_ERROR);
+ if (type == PGFILETYPE_ERROR)
+ exit(1);
+
+ /*
+ * If we're processing pg_tblspc, then check whether the filename
+ * looks like it could be a tablespace OID. If so, and if the
+ * directory entry is a symbolic link or a directory, skip it.
+ *
+ * Our goal here is to ignore anything that would have been considered
+ * by scan_for_existing_tablespaces to be a tablespace.
+ */
+ if (is_pg_tblspc && parse_oid(de->d_name, &oid) &&
+ (type == PGFILETYPE_LNK || type == PGFILETYPE_DIR))
+ continue;
+
+ /* If it's a directory, recurse. */
+ if (type == PGFILETYPE_DIR)
+ {
+ char new_relative_path[MAXPGPATH];
+
+ /* Append new pathname component to relative path. */
+ if (relative_path == NULL)
+ strncpy(new_relative_path, de->d_name, MAXPGPATH);
+ else
+ snprintf(new_relative_path, MAXPGPATH, "%s/%s", relative_path,
+ de->d_name);
+
+ /* And recurse. */
+ process_directory_recursively(tsoid,
+ input_directory, output_directory,
+ new_relative_path,
+ n_prior_backups, prior_backup_dirs,
+ manifests, mwriter, opt);
+ continue;
+ }
+
+ /* Skip anything that's not a regular file. */
+ if (type != PGFILETYPE_REG)
+ {
+ if (type == PGFILETYPE_LNK)
+ pg_log_warning("skipping symbolic link \"%s\"", ifullpath);
+ else
+ pg_log_warning("skipping special file \"%s\"", ifullpath);
+ continue;
+ }
+
+ /*
+ * Skip the backup_label and backup_manifest files; they require
+ * special handling and are handled elsewhere.
+ */
+ if (relative_path == NULL &&
+ (strcmp(de->d_name, "backup_label") == 0 ||
+ strcmp(de->d_name, "backup_manifest") == 0))
+ continue;
+
+ /*
+ * If it's an incremental file, hand it off to the reconstruction
+ * code, which will figure out what to do.
+ */
+ if (strncmp(de->d_name, INCREMENTAL_PREFIX,
+ INCREMENTAL_PREFIX_LENGTH) == 0)
+ {
+ /* Output path should not include "INCREMENTAL." prefix. */
+ snprintf(ofullpath, MAXPGPATH, "%s/%s", ofulldir,
+ de->d_name + INCREMENTAL_PREFIX_LENGTH);
+
+
+ /* Manifest path likewise omits incremental prefix. */
+ snprintf(manifest_path, MAXPGPATH, "%s%s", manifest_prefix,
+ de->d_name + INCREMENTAL_PREFIX_LENGTH);
+
+ /* Reconstruction logic will do the rest. */
+ reconstruct_from_incremental_file(ifullpath, ofullpath,
+ relative_path,
+ de->d_name + INCREMENTAL_PREFIX_LENGTH,
+ n_prior_backups,
+ prior_backup_dirs,
+ manifests,
+ manifest_path,
+ checksum_type,
+ &checksum_length,
+ &checksum_payload,
+ opt->dry_run);
+ }
+ else
+ {
+ /* Construct the path that the backup_manifest will use. */
+ snprintf(manifest_path, MAXPGPATH, "%s%s", manifest_prefix,
+ de->d_name);
+
+ /*
+ * It's not an incremental file, so we need to copy the entire
+ * file to the output directory.
+ *
+ * If a checksum of the required type already exists in the
+ * backup_manifest for the final input directory, we can save some
+ * work by reusing that checksum instead of computing a new one.
+ */
+ if (checksum_type != CHECKSUM_TYPE_NONE &&
+ latest_manifest != NULL)
+ {
+ manifest_file *mfile;
+
+ mfile = manifest_files_lookup(latest_manifest->files,
+ manifest_path);
+ if (mfile == NULL)
+ {
+ /*
+ * The directory is out of sync with the backup_manifest,
+ * so emit a warning.
+ */
+ pg_log_warning("\"%s/backup_manifest\" contains no entry for \"%s\"",
+ input_directory, manifest_path);
+ }
+ else if (mfile->checksum_type == checksum_type)
+ {
+ checksum_length = mfile->checksum_length;
+ checksum_payload = mfile->checksum_payload;
+ }
+ }
+
+ /*
+ * If we're reusing a checksum, then we don't need copy_file() to
+ * compute one for us, but otherwise, it needs to compute whatever
+ * type of checksum we need.
+ */
+ if (checksum_length != 0)
+ pg_checksum_init(&checksum_ctx, CHECKSUM_TYPE_NONE);
+ else
+ pg_checksum_init(&checksum_ctx, checksum_type);
+
+ /* Actually copy the file. */
+ snprintf(ofullpath, MAXPGPATH, "%s/%s", ofulldir, de->d_name);
+ copy_file(ifullpath, ofullpath, &checksum_ctx, opt->dry_run);
+
+ /*
+ * If copy_file() performed a checksum calculation for us, then
+ * save the results (except in dry-run mode, when there's no
+ * point).
+ */
+ if (checksum_ctx.type != CHECKSUM_TYPE_NONE && !opt->dry_run)
+ {
+ checksum_payload = pg_malloc(PG_CHECKSUM_MAX_LENGTH);
+ checksum_length = pg_checksum_final(&checksum_ctx,
+ checksum_payload);
+ }
+ }
+
+ /* Generate manifest entry, if needed. */
+ if (mwriter != NULL)
+ {
+ struct stat sb;
+
+ /*
+ * In order to generate a manifest entry, we need the file size
+ * and mtime. We have no way to know the correct mtime except to
+ * stat() the file, so just do that and get the size as well.
+ *
+ * If we didn't need the mtime here, we could try to obtain the
+ * file size from the reconstruction or file copy process above,
+ * although that is actually not convenient in all cases. If we
+ * write the file ourselves then clearly we can keep a count of
+ * bytes, but if we use something like CopyFile() then it's
+ * trickier. Since we have to stat() anyway to get the mtime,
+ * there's no point in worrying about it.
+ */
+ if (stat(ofullpath, &sb) < 0)
+ pg_fatal("could not stat file \"%s\": %m", ofullpath);
+
+ /* OK, now do the work. */
+ add_file_to_manifest(mwriter, manifest_path,
+ sb.st_size, sb.st_mtime,
+ checksum_type, checksum_length,
+ checksum_payload);
+ }
+
+ /* Avoid leaking memory. */
+ if (checksum_payload != NULL)
+ pfree(checksum_payload);
+ }
+
+ closedir(dir);
+}
+
+/*
+ * Read the version number from PG_VERSION and convert it to the usual server
+ * version number format. (e.g. If PG_VERSION contains "14\n" this function
+ * will return 140000)
+ */
+static int
+read_pg_version_file(char *directory)
+{
+ char filename[MAXPGPATH];
+ StringInfoData buf;
+ int fd;
+ int version;
+ char *ep;
+
+ /* Construct pathname. */
+ snprintf(filename, MAXPGPATH, "%s/PG_VERSION", directory);
+
+ /* Open file. */
+ if ((fd = open(filename, O_RDONLY, 0)) < 0)
+ pg_fatal("could not open file \"%s\": %m", filename);
+
+ /* Read into memory. Length limit of 128 should be more than generous. */
+ initStringInfo(&buf);
+ slurp_file(fd, filename, &buf, 128);
+
+ /* Close the file. */
+ if (close(fd) != 0)
+ pg_fatal("could not close \"%s\": %m", filename);
+
+ /* Convert to integer. */
+ errno = 0;
+ version = strtoul(buf.data, &ep, 10);
+ if (errno != 0 || *ep != '\n')
+ {
+ /*
+ * Incremental backup is not relevant to very old server versions that
+ * used multi-part version number (e.g. 9.6, or 8.4). So if we see
+ * what looks like the beginning of such a version number, just bail
+ * out.
+ */
+ if (version < 10 && *ep == '.')
+ pg_fatal("%s: server version too old\n", filename);
+ pg_fatal("%s: could not parse version number\n", filename);
+ }
+
+ /* Debugging output. */
+ pg_log_debug("read server version %d from \"%s\"", version, filename);
+
+ /* Release memory and return result. */
+ pfree(buf.data);
+ return version * 10000;
+}
+
+/*
+ * Add a directory to the list of output directories to clean up.
+ */
+static void
+remember_to_cleanup_directory(char *target_path, bool rmtopdir)
+{
+ cb_cleanup_dir *dir = pg_malloc(sizeof(cb_cleanup_dir));
+
+ dir->target_path = target_path;
+ dir->rmtopdir = rmtopdir;
+ dir->next = cleanup_dir_list;
+ cleanup_dir_list = dir;
+}
+
+/*
+ * Empty out the list of directories scheduled for cleanup a exit.
+ *
+ * We want to remove the output directories only on a failure, so call this
+ * function when we know that the operation has succeeded.
+ *
+ * Since we only expect this to be called when we're about to exit, we could
+ * just set cleanup_dir_list to NULL and be done with it, but we free the
+ * memory to be tidy.
+ */
+static void
+reset_directory_cleanup_list(void)
+{
+ while (cleanup_dir_list != NULL)
+ {
+ cb_cleanup_dir *dir = cleanup_dir_list;
+
+ cleanup_dir_list = cleanup_dir_list->next;
+ pfree(dir);
+ }
+}
+
+/*
+ * Scan the pg_tblspc directory of the final input backup to get a canonical
+ * list of what tablespaces are part of the backup.
+ *
+ * 'pathname' should be the path to the toplevel backup directory for the
+ * final backup in the backup chain.
+ */
+static cb_tablespace *
+scan_for_existing_tablespaces(char *pathname, cb_options *opt)
+{
+ char pg_tblspc[MAXPGPATH];
+ DIR *dir;
+ struct dirent *de;
+ cb_tablespace *tslist = NULL;
+
+ snprintf(pg_tblspc, MAXPGPATH, "%s/pg_tblspc", pathname);
+ pg_log_debug("scanning \"%s\"", pg_tblspc);
+
+ if ((dir = opendir(pg_tblspc)) == NULL)
+ pg_fatal("could not open directory \"%s\": %m", pathname);
+
+ while (errno = 0, (de = readdir(dir)) != NULL)
+ {
+ Oid oid;
+ char tblspcdir[MAXPGPATH];
+ char link_target[MAXPGPATH];
+ int link_length;
+ cb_tablespace *ts;
+ cb_tablespace *otherts;
+ PGFileType type;
+
+ /* Silently ignore "." and ".." entries. */
+ if (strcmp(de->d_name, ".") == 0 || strcmp(de->d_name, "..") == 0)
+ continue;
+
+ /* Construct full pathname. */
+ snprintf(tblspcdir, MAXPGPATH, "%s/%s", pg_tblspc, de->d_name);
+
+ /* Ignore any file name that doesn't look like a proper OID. */
+ if (!parse_oid(de->d_name, &oid))
+ {
+ pg_log_debug("skipping \"%s\" because the filename is not a legal tablespace OID",
+ tblspcdir);
+ continue;
+ }
+
+ /* Only symbolic links and directories are tablespaces. */
+ type = get_dirent_type(tblspcdir, de, false, PG_LOG_ERROR);
+ if (type == PGFILETYPE_ERROR)
+ exit(1);
+ if (type != PGFILETYPE_LNK && type != PGFILETYPE_DIR)
+ {
+ pg_log_debug("skipping \"%s\" because it is neither a symbolic link nor a directory",
+ tblspcdir);
+ continue;
+ }
+
+ /* Create a new tablespace object. */
+ ts = pg_malloc0(sizeof(cb_tablespace));
+ ts->oid = oid;
+
+ /*
+ * If it's a link, it's not an in-place tablespace. Otherwise, it must
+ * be a directory, and thus an in-place tablespace.
+ */
+ if (type == PGFILETYPE_LNK)
+ {
+ cb_tablespace_mapping *tsmap;
+
+ /* Read the link target. */
+ link_length = readlink(tblspcdir, link_target, sizeof(link_target));
+ if (link_length < 0)
+ pg_fatal("could not read symbolic link \"%s\": %m",
+ tblspcdir);
+ if (link_length >= sizeof(link_target))
+ pg_fatal("symbolic link \"%s\" is too long", tblspcdir);
+ link_target[link_length] = '\0';
+ if (!is_absolute_path(link_target))
+ pg_fatal("symbolic link \"%s\" is relative", tblspcdir);
+
+ /* Caonicalize the link target. */
+ canonicalize_path(link_target);
+
+ /*
+ * Find the corresponding tablespace mapping and copy the relevant
+ * details into the new tablespace entry.
+ */
+ for (tsmap = opt->tsmappings; tsmap != NULL; tsmap = tsmap->next)
+ {
+ if (strcmp(tsmap->old_dir, link_target) == 0)
+ {
+ strncpy(ts->old_dir, tsmap->old_dir, MAXPGPATH);
+ strncpy(ts->new_dir, tsmap->new_dir, MAXPGPATH);
+ ts->in_place = false;
+ break;
+ }
+ }
+
+ /* Every non-in-place tablespace must be mapped. */
+ if (tsmap == NULL)
+ pg_fatal("tablespace at \"%s\" has no tablespace mapping",
+ link_target);
+ }
+ else
+ {
+ /*
+ * For an in-place tablespace, there's no separate directory, so
+ * we just record the paths within the data directories.
+ */
+ snprintf(ts->old_dir, MAXPGPATH, "%s/%s", pg_tblspc, de->d_name);
+ snprintf(ts->new_dir, MAXPGPATH, "%s/pg_tblpc/%s", opt->output,
+ de->d_name);
+ ts->in_place = true;
+ }
+
+ /* Tablespaces should not share a directory. */
+ for (otherts = tslist; otherts != NULL; otherts = otherts->next)
+ if (strcmp(ts->new_dir, otherts->new_dir) == 0)
+ pg_fatal("tablespaces with OIDs %u and %u both point at \"%s\"",
+ otherts->oid, oid, ts->new_dir);
+
+ /* Add this tablespace to the list. */
+ ts->next = tslist;
+ tslist = ts;
+ }
+
+ return tslist;
+}
+
+/*
+ * Read a file into a StringInfo.
+ *
+ * fd is used for the actual file I/O, filename for error reporting purposes.
+ * A file longer than maxlen is a fatal error.
+ */
+static void
+slurp_file(int fd, char *filename, StringInfo buf, int maxlen)
+{
+ struct stat st;
+ ssize_t rb;
+
+ /* Check file size, and complain if it's too large. */
+ if (fstat(fd, &st) != 0)
+ pg_fatal("could not stat \"%s\": %m", filename);
+ if (st.st_size > maxlen)
+ pg_fatal("file \"%s\" is too large", filename);
+
+ /* Make sure we have enough space. */
+ enlargeStringInfo(buf, st.st_size);
+
+ /* Read the data. */
+ rb = read(fd, &buf->data[buf->len], st.st_size);
+
+ /*
+ * We don't expect any concurrent changes, so we should read exactly the
+ * expected number of bytes.
+ */
+ if (rb != st.st_size)
+ {
+ if (rb < 0)
+ pg_fatal("could not read file \"%s\": %m", filename);
+ else
+ pg_fatal("could not read file \"%s\": read only %d of %d bytes",
+ filename, (int) rb, (int) st.st_size);
+ }
+
+ /* Adjust buffer length for new data and restore trailing-\0 invariant */
+ buf->len += rb;
+ buf->data[buf->len] = '\0';
+}
diff --git a/src/bin/pg_combinebackup/reconstruct.c b/src/bin/pg_combinebackup/reconstruct.c
new file mode 100644
index 0000000000..c774bf1842
--- /dev/null
+++ b/src/bin/pg_combinebackup/reconstruct.c
@@ -0,0 +1,618 @@
+/*-------------------------------------------------------------------------
+ *
+ * reconstruct.c
+ * Reconstruct full file from incremental file and backup chain.
+ *
+ * Copyright (c) 2017-2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/bin/pg_combinebackup/reconstruct.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres_fe.h"
+
+#include <unistd.h>
+
+#include "backup/basebackup_incremental.h"
+#include "common/logging.h"
+#include "common/file_perm.h"
+#include "copy_file.h"
+#include "reconstruct.h"
+#include "storage/block.h"
+
+/*
+ * An rfile stores the data that we need in order to be able to use some file
+ * on disk for reconstruction. For any given output file, we create one rfile
+ * per backup that we need to consult when we constructing that output file.
+ *
+ * If we find a full version of the file in the backup chain, then only
+ * filename and fd are initialized; the remaining fields are 0 or NULL.
+ * For an incremental file, header_length, num_blocks, relative_block_numbers,
+ * and truncation_block_length are also set.
+ *
+ * num_blocks_read and highest_offset_read always start out as 0.
+ */
+typedef struct rfile
+{
+ char *filename;
+ int fd;
+ size_t header_length;
+ unsigned num_blocks;
+ BlockNumber *relative_block_numbers;
+ unsigned truncation_block_length;
+ unsigned num_blocks_read;
+ off_t highest_offset_read;
+} rfile;
+
+static void debug_reconstruction(int n_source,
+ rfile **sources,
+ bool dry_run);
+static unsigned find_reconstructed_block_length(rfile *s);
+static rfile *make_incremental_rfile(char *filename);
+static rfile *make_rfile(char *filename, bool missing_ok);
+static void write_reconstructed_file(char *input_filename,
+ char *output_filename,
+ unsigned block_length,
+ rfile **sourcemap,
+ off_t *offsetmap,
+ pg_checksum_context *checksum_ctx,
+ bool dry_run);
+static void read_bytes(rfile *rf, void *buffer, unsigned length);
+
+/*
+ * Reconstruct a full file from an incremental file and a chain of prior
+ * backups.
+ *
+ * input_filename should be the path to the incremental file, and
+ * output_filename should be the path where the reconstructed file is to be
+ * written.
+ *
+ * relative_path should be the relative path to the directory containing this
+ * file. bare_file_name should be the name of the file within that directory,
+ * without "INCREMENTAL.".
+ *
+ * n_prior_backups is the number of prior backups, and prior_backup_dirs is
+ * an array of pathnames where those backups can be found.
+ */
+void
+reconstruct_from_incremental_file(char *input_filename,
+ char *output_filename,
+ char *relative_path,
+ char *bare_file_name,
+ int n_prior_backups,
+ char **prior_backup_dirs,
+ manifest_data **manifests,
+ char *manifest_path,
+ pg_checksum_type checksum_type,
+ int *checksum_length,
+ uint8 **checksum_payload,
+ bool dry_run)
+{
+ rfile **source;
+ rfile *latest_source = NULL;
+ rfile **sourcemap;
+ off_t *offsetmap;
+ unsigned block_length;
+ unsigned num_missing_blocks;
+ unsigned i;
+ unsigned sidx = n_prior_backups;
+ bool full_copy_possible = true;
+ int copy_source_index = -1;
+ rfile *copy_source = NULL;
+ pg_checksum_context checksum_ctx;
+
+ /*
+ * Every block must come either from the latest version of the file or
+ * from one of the prior backups.
+ */
+ source = pg_malloc0(sizeof(rfile *) * (1 + n_prior_backups));
+
+ /*
+ * Use the information from the latest incremental file to figure out how
+ * long the reconstructed file should be.
+ */
+ latest_source = make_incremental_rfile(input_filename);
+ source[n_prior_backups] = latest_source;
+ block_length = find_reconstructed_block_length(latest_source);
+
+ /*
+ * For each block in the output file, we need to know from which file we
+ * need to obtain it and at what offset in that file it's stored.
+ * sourcemap gives us the first of these things, and offsetmap the latter.
+ */
+ sourcemap = pg_malloc0(sizeof(rfile *) * block_length);
+ offsetmap = pg_malloc0(sizeof(off_t) * block_length);
+
+ /*
+ * Blocks prior to the truncation_block_length threshold must be obtained
+ * from some prior backup, while those after that threshold are left as
+ * zeroes if not present in the newest incremental file.
+ * num_missing_blocks counts the number of blocks that we must be found
+ * somewhere in the backup chain, and is thus initially equal to
+ * truncation_block_length.
+ */
+ num_missing_blocks = latest_source->truncation_block_length;
+
+ /*
+ * Every block that is present in the newest incremental file should be
+ * sourced from that file. If it precedes the truncation_block_length,
+ * it's a block that we would otherwise have had to find in an older
+ * backup and thus reduces the number of blocks remaining to be found by
+ * one; otherwise, it's an extra block that needs to be included in the
+ * output but would not have needed to be found in an older backup if it
+ * had not been present.
+ */
+ for (i = 0; i < latest_source->num_blocks; ++i)
+ {
+ BlockNumber b = latest_source->relative_block_numbers[i];
+
+ Assert(b < block_length);
+ sourcemap[b] = latest_source;
+ offsetmap[b] = latest_source->header_length + (i * BLCKSZ);
+ if (b < latest_source->truncation_block_length)
+ num_missing_blocks--;
+
+ /*
+ * A full copy of a file from an earlier backup is only possible if no
+ * blocks are needed from any later incremental file.
+ */
+ full_copy_possible = false;
+ }
+
+ while (num_missing_blocks > 0)
+ {
+ char source_filename[MAXPGPATH];
+ rfile *s;
+
+ /*
+ * Move to the next backup in the chain. If there are no more, then
+ * something has gone wrong and reconstruction has failed.
+ */
+ if (sidx == 0)
+ pg_fatal("reconstruction for file \"%s\" failed to find %u required blocks",
+ output_filename, num_missing_blocks);
+ --sidx;
+
+ /*
+ * Look for the full file in the previous backup. If not found, then
+ * look for an incremental file instead.
+ */
+ snprintf(source_filename, MAXPGPATH, "%s/%s/%s",
+ prior_backup_dirs[sidx], relative_path, bare_file_name);
+ if ((s = make_rfile(source_filename, true)) == NULL)
+ {
+ snprintf(source_filename, MAXPGPATH, "%s/%s/INCREMENTAL.%s",
+ prior_backup_dirs[sidx], relative_path, bare_file_name);
+ s = make_incremental_rfile(source_filename);
+ }
+ source[sidx] = s;
+
+ /*
+ * If s->header_length == 0, then this is a full file; otherwise, it's
+ * an incremental file.
+ */
+ if (s->header_length != 0)
+ {
+ /*
+ * Since we found another incremental file, source all blocks from
+ * it that we need but don't yet have.
+ */
+ for (i = 0; i < s->num_blocks; ++i)
+ {
+ BlockNumber b = s->relative_block_numbers[i];
+
+ if (b < latest_source->truncation_block_length &&
+ sourcemap[b] == NULL)
+ {
+ sourcemap[b] = s;
+ offsetmap[b] = s->header_length + (i * BLCKSZ);
+
+ Assert(num_missing_blocks > 0);
+ --num_missing_blocks;
+
+ /*
+ * A full copy of a file from an earlier backup is only
+ * possible if no blocks are needed from any later
+ * incremental file.
+ */
+ full_copy_possible = false;
+ }
+ }
+ }
+ else
+ {
+ BlockNumber b;
+
+ /*
+ * Since we found a full file, source all remaining required
+ * blocks from it.
+ */
+ for (b = 0; b < latest_source->truncation_block_length; ++b)
+ {
+ if (sourcemap[b] == NULL)
+ {
+ sourcemap[b] = s;
+ offsetmap[b] = b * BLCKSZ;
+
+ Assert(num_missing_blocks > 0);
+ --num_missing_blocks;
+ }
+ }
+ Assert(num_missing_blocks == 0);
+
+ /*
+ * If a full copy looks possible, check whether the resulting file
+ * should be exactly as long as the source file is. If so, a full
+ * copy is acceptable, otherwise not.
+ */
+ if (full_copy_possible)
+ {
+ struct stat sb;
+ uint64 expected_length;
+
+ if (fstat(s->fd, &sb) < 0)
+ pg_fatal("could not stat \"%s\": %m", s->filename);
+ expected_length =
+ (uint64) latest_source->truncation_block_length;
+ expected_length *= BLCKSZ;
+ if (expected_length == sb.st_size)
+ {
+ copy_source = s;
+ copy_source_index = sidx;
+ }
+ }
+ }
+ }
+
+ /*
+ * If a checksum of the required type already exists in the
+ * backup_manifest for the relevant input directory, we can save some work
+ * by reusing that checksum instead of computing a new one.
+ */
+ if (copy_source_index >= 0 && manifests[copy_source_index] != NULL &&
+ checksum_type != CHECKSUM_TYPE_NONE)
+ {
+ manifest_file *mfile;
+
+ mfile = manifest_files_lookup(manifests[copy_source_index]->files,
+ manifest_path);
+ if (mfile == NULL)
+ {
+ /*
+ * The directory is out of sync with the backup_manifest, so emit
+ * a warning.
+ */
+ pg_log_warning("\"%s/backup_manifest\" contains no entry for \"%s\"",
+ prior_backup_dirs[copy_source_index],
+ manifest_path);
+ }
+ else if (mfile->checksum_type == checksum_type)
+ {
+ *checksum_length = mfile->checksum_length;
+ *checksum_payload = pg_malloc(*checksum_length);
+ memcpy(*checksum_payload, mfile->checksum_payload,
+ *checksum_length);
+ checksum_type = CHECKSUM_TYPE_NONE;
+ }
+ }
+
+ /* Prepare for checksum calculation, if required. */
+ pg_checksum_init(&checksum_ctx, checksum_type);
+
+ /*
+ * If the full file can be created by copying a file from an older backup
+ * in the chain without needing to overwrite any blocks or truncate the
+ * result, then forget about performing reconstruction and just copy that
+ * file in its entirety.
+ *
+ * Otherwise, reconstruct.
+ */
+ if (copy_source != NULL)
+ copy_file(copy_source->filename, output_filename,
+ &checksum_ctx, dry_run);
+ else
+ {
+ write_reconstructed_file(input_filename, output_filename,
+ block_length, sourcemap, offsetmap,
+ &checksum_ctx, dry_run);
+ debug_reconstruction(n_prior_backups + 1, source, dry_run);
+ }
+
+ /* Save results of checksum calculation. */
+ if (checksum_type != CHECKSUM_TYPE_NONE)
+ {
+ *checksum_payload = pg_malloc(PG_CHECKSUM_MAX_LENGTH);
+ *checksum_length = pg_checksum_final(&checksum_ctx,
+ *checksum_payload);
+ }
+
+ /*
+ * Close files and release memory.
+ */
+ for (i = 0; i <= n_prior_backups; ++i)
+ {
+ rfile *s = source[i];
+
+ if (s == NULL)
+ continue;
+ if (close(s->fd) != 0)
+ pg_fatal("could not close \"%s\": %m", s->filename);
+ if (s->relative_block_numbers != NULL)
+ pfree(s->relative_block_numbers);
+ pg_free(s->filename);
+ }
+ pfree(sourcemap);
+ pfree(offsetmap);
+ pfree(source);
+}
+
+/*
+ * Perform post-reconstruction logging and sanity checks.
+ */
+static void
+debug_reconstruction(int n_source, rfile **sources, bool dry_run)
+{
+ unsigned i;
+
+ for (i = 0; i < n_source; ++i)
+ {
+ rfile *s = sources[i];
+
+ /* Ignore source if not used. */
+ if (s == NULL)
+ continue;
+
+ /* If no data is needed from this file, we can ignore it. */
+ if (s->num_blocks_read == 0)
+ continue;
+
+ /* Debug logging. */
+ if (dry_run)
+ pg_log_debug("would have read %u blocks from \"%s\"",
+ s->num_blocks_read, s->filename);
+ else
+ pg_log_debug("read %u blocks from \"%s\"",
+ s->num_blocks_read, s->filename);
+
+ /*
+ * In dry-run mode, we don't actually try to read data from the file,
+ * but we do try to verify that the file is long enough that we could
+ * have read the data if we'd tried.
+ *
+ * If this fails, then it means that a non-dry-run attempt would fail,
+ * complaining of not being able to read the required bytes from the
+ * file.
+ */
+ if (dry_run)
+ {
+ struct stat sb;
+
+ if (fstat(s->fd, &sb) < 0)
+ pg_fatal("could not stat \"%s\": %m", s->filename);
+ if (sb.st_size < s->highest_offset_read)
+ pg_fatal("file \"%s\" is too short: expected %llu, found %llu",
+ s->filename,
+ (unsigned long long) s->highest_offset_read,
+ (unsigned long long) sb.st_size);
+ }
+ }
+}
+
+/*
+ * When we perform reconstruction using an incremental file, the output file
+ * should be at least as long as the truncation_block_length. Any blocks
+ * present in the incremental file increase the output length as far as is
+ * necessary to include those blocks.
+ */
+static unsigned
+find_reconstructed_block_length(rfile *s)
+{
+ unsigned block_length = s->truncation_block_length;
+ unsigned i;
+
+ for (i = 0; i < s->num_blocks; ++i)
+ if (s->relative_block_numbers[i] >= block_length)
+ block_length = s->relative_block_numbers[i] + 1;
+
+ return block_length;
+}
+
+/*
+ * Initialize an incremental rfile, reading the header so that we know which
+ * blocks it contains.
+ */
+static rfile *
+make_incremental_rfile(char *filename)
+{
+ rfile *rf;
+ unsigned magic;
+
+ rf = make_rfile(filename, false);
+
+ /* Read and validate magic number. */
+ read_bytes(rf, &magic, sizeof(magic));
+ if (magic != INCREMENTAL_MAGIC)
+ pg_fatal("file \"%s\" has bad incremental magic number (0x%x not 0x%x)",
+ filename, magic, INCREMENTAL_MAGIC);
+
+ /* Read block count. */
+ read_bytes(rf, &rf->num_blocks, sizeof(rf->num_blocks));
+ if (rf->num_blocks > RELSEG_SIZE)
+ pg_fatal("file \"%s\" has block count %u in excess of segment size %u",
+ filename, rf->num_blocks, RELSEG_SIZE);
+
+ /* Read truncation block length. */
+ read_bytes(rf, &rf->truncation_block_length,
+ sizeof(rf->truncation_block_length));
+ if (rf->truncation_block_length > RELSEG_SIZE)
+ pg_fatal("file \"%s\" has truncation block length %u in excess of segment size %u",
+ filename, rf->truncation_block_length, RELSEG_SIZE);
+
+ /* Read block numbers if there are any. */
+ if (rf->num_blocks > 0)
+ {
+ rf->relative_block_numbers =
+ pg_malloc0(sizeof(BlockNumber) * rf->num_blocks);
+ read_bytes(rf, rf->relative_block_numbers,
+ sizeof(BlockNumber) * rf->num_blocks);
+ }
+
+ /* Remember length of header. */
+ rf->header_length = sizeof(magic) + sizeof(rf->num_blocks) +
+ sizeof(rf->truncation_block_length) +
+ sizeof(BlockNumber) * rf->num_blocks;
+
+ return rf;
+}
+
+/*
+ * Allocate and perform basic initialization of an rfile.
+ */
+static rfile *
+make_rfile(char *filename, bool missing_ok)
+{
+ rfile *rf;
+
+ rf = pg_malloc0(sizeof(rfile));
+ rf->filename = pstrdup(filename);
+ if ((rf->fd = open(filename, O_RDONLY | PG_BINARY, 0)) < 0)
+ {
+ if (missing_ok && errno == ENOENT)
+ {
+ pg_free(rf);
+ return NULL;
+ }
+ pg_fatal("could not open file \"%s\": %m", filename);
+ }
+
+ return rf;
+}
+
+/*
+ * Read the indicated number of bytes from an rfile into the buffer.
+ */
+static void
+read_bytes(rfile *rf, void *buffer, unsigned length)
+{
+ unsigned rb = read(rf->fd, buffer, length);
+
+ if (rb != length)
+ {
+ if (rb < 0)
+ pg_fatal("could not read file \"%s\": %m", rf->filename);
+ else
+ pg_fatal("could not read file \"%s\": read only %d of %d bytes",
+ rf->filename, (int) rb, length);
+ }
+}
+
+/*
+ * Write out a reconstructed file.
+ */
+static void
+write_reconstructed_file(char *input_filename,
+ char *output_filename,
+ unsigned block_length,
+ rfile **sourcemap,
+ off_t *offsetmap,
+ pg_checksum_context *checksum_ctx,
+ bool dry_run)
+{
+ int wfd = -1;
+ unsigned i;
+ unsigned zero_blocks = 0;
+
+ /* Debugging output. */
+ if (dry_run)
+ pg_log_debug("would reconstruct \"%s\" (%u blocks, checksum %s)",
+ output_filename, block_length,
+ pg_checksum_type_name(checksum_ctx->type));
+ else
+ pg_log_debug("reconstructing \"%s\" (%u blocks, checksum %s)",
+ output_filename, block_length,
+ pg_checksum_type_name(checksum_ctx->type));
+
+ /* Open the output file, except in dry_run mode. */
+ if (!dry_run &&
+ (wfd = open(output_filename,
+ O_RDWR | PG_BINARY | O_CREAT | O_EXCL,
+ pg_file_create_mode)) < 0)
+ pg_fatal("could not open file \"%s\": %m", output_filename);
+
+ /* Read and write the blocks as required. */
+ for (i = 0; i < block_length; ++i)
+ {
+ uint8 buffer[BLCKSZ];
+ rfile *s = sourcemap[i];
+ unsigned wb;
+
+ /* Update accounting information. */
+ if (s == NULL)
+ ++zero_blocks;
+ else
+ {
+ s->num_blocks_read++;
+ s->highest_offset_read = Max(s->highest_offset_read,
+ offsetmap[i] + BLCKSZ);
+ }
+
+ /* Skip the rest of this in dry-run mode. */
+ if (dry_run)
+ continue;
+
+ /* Read or zero-fill the block as appropriate. */
+ if (s == NULL)
+ {
+ /*
+ * New block not mentioned in the WAL summary. Should have been an
+ * uninitialized block, so just zero-fill it.
+ */
+ memset(buffer, 0, BLCKSZ);
+ }
+ else
+ {
+ unsigned rb;
+
+ /* Read the block from the correct source, except if dry-run. */
+ rb = pg_pread(s->fd, buffer, BLCKSZ, offsetmap[i]);
+ if (rb != BLCKSZ)
+ {
+ if (rb < 0)
+ pg_fatal("could not read file \"%s\": %m", s->filename);
+ else
+ pg_fatal("could not read file \"%s\": read only %d of %d bytes at offset %u",
+ s->filename, (int) rb, BLCKSZ,
+ (unsigned) offsetmap[i]);
+ }
+ }
+
+ /* Write out the block. */
+ if ((wb = write(wfd, buffer, BLCKSZ)) != BLCKSZ)
+ {
+ if (wb < 0)
+ pg_fatal("could not write file \"%s\": %m", output_filename);
+ else
+ pg_fatal("could not write file \"%s\": wrote only %d of %d bytes",
+ output_filename, (int) wb, BLCKSZ);
+ }
+
+ /* Update the checksum computation. */
+ if (pg_checksum_update(checksum_ctx, buffer, BLCKSZ) < 0)
+ pg_fatal("could not update checksum of file \"%s\"",
+ output_filename);
+ }
+
+ /* Debugging output. */
+ if (zero_blocks > 0)
+ {
+ if (dry_run)
+ pg_log_debug("would have zero-filled %u blocks", zero_blocks);
+ else
+ pg_log_debug("zero-filled %u blocks", zero_blocks);
+ }
+
+ /* Close the output file. */
+ if (wfd >= 0 && close(wfd) != 0)
+ pg_fatal("could not close \"%s\": %m", output_filename);
+}
diff --git a/src/bin/pg_combinebackup/reconstruct.h b/src/bin/pg_combinebackup/reconstruct.h
new file mode 100644
index 0000000000..c599a70d42
--- /dev/null
+++ b/src/bin/pg_combinebackup/reconstruct.h
@@ -0,0 +1,32 @@
+/*-------------------------------------------------------------------------
+ *
+ * reconstruct.h
+ * Reconstruct full file from incremental file and backup chain.
+ *
+ * Copyright (c) 2017-2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/bin/pg_combinebackup/reconstruct.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef RECONSTRUCT_H
+#define RECONSTRUCT_H
+
+#include "common/checksum_helper.h"
+#include "load_manifest.h"
+
+extern void reconstruct_from_incremental_file(char *input_filename,
+ char *output_filename,
+ char *relative_path,
+ char *bare_file_name,
+ int n_prior_backups,
+ char **prior_backup_dirs,
+ manifest_data **manifests,
+ char *manifest_path,
+ pg_checksum_type checksum_type,
+ int *checksum_length,
+ uint8 **checksum_payload,
+ bool dry_run);
+
+#endif
diff --git a/src/bin/pg_combinebackup/write_manifest.c b/src/bin/pg_combinebackup/write_manifest.c
new file mode 100644
index 0000000000..82160134d8
--- /dev/null
+++ b/src/bin/pg_combinebackup/write_manifest.c
@@ -0,0 +1,293 @@
+/*-------------------------------------------------------------------------
+ *
+ * Write a new backup manifest.
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/bin/pg_combinebackup/write_manifest.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres_fe.h"
+
+#include <fcntl.h>
+#include <time.h>
+#include <unistd.h>
+
+#include "common/checksum_helper.h"
+#include "common/file_perm.h"
+#include "common/logging.h"
+#include "lib/stringinfo.h"
+#include "load_manifest.h"
+#include "mb/pg_wchar.h"
+#include "write_manifest.h"
+
+struct manifest_writer
+{
+ char pathname[MAXPGPATH];
+ int fd;
+ StringInfoData buf;
+ bool first_file;
+ bool still_checksumming;
+ pg_checksum_context manifest_ctx;
+};
+
+static void escape_json(StringInfo buf, const char *str);
+static void flush_manifest(manifest_writer *mwriter);
+static size_t hex_encode(const uint8 *src, size_t len, char *dst);
+
+/*
+ * Create a new backup manifest writer.
+ *
+ * The backup manifest will be written into a file named backup_manifest
+ * in the specified directory.
+ */
+manifest_writer *
+create_manifest_writer(char *directory)
+{
+ manifest_writer *mwriter = pg_malloc(sizeof(manifest_writer));
+
+ snprintf(mwriter->pathname, MAXPGPATH, "%s/backup_manifest", directory);
+ mwriter->fd = -1;
+ initStringInfo(&mwriter->buf);
+ mwriter->first_file = true;
+ mwriter->still_checksumming = true;
+ pg_checksum_init(&mwriter->manifest_ctx, CHECKSUM_TYPE_SHA256);
+
+ appendStringInfo(&mwriter->buf,
+ "{ \"PostgreSQL-Backup-Manifest-Version\": 1,\n"
+ "\"Files\": [");
+
+ return mwriter;
+}
+
+/*
+ * Add an entry for a file to a backup manifest.
+ *
+ * This is very similar to the backend's AddFileToBackupManifest, but
+ * various adjustments are required due to frontend/backend differences
+ * and other details.
+ */
+void
+add_file_to_manifest(manifest_writer *mwriter, const char *manifest_path,
+ size_t size, pg_time_t mtime,
+ pg_checksum_type checksum_type,
+ int checksum_length,
+ uint8 *checksum_payload)
+{
+ int pathlen = strlen(manifest_path);
+
+ if (mwriter->first_file)
+ {
+ appendStringInfoChar(&mwriter->buf, '\n');
+ mwriter->first_file = false;
+ }
+ else
+ appendStringInfoString(&mwriter->buf, ",\n");
+
+ if (pg_encoding_verifymbstr(PG_UTF8, manifest_path, pathlen) == pathlen)
+ {
+ appendStringInfoString(&mwriter->buf, "{ \"Path\": ");
+ escape_json(&mwriter->buf, manifest_path);
+ appendStringInfoString(&mwriter->buf, ", ");
+ }
+ else
+ {
+ appendStringInfoString(&mwriter->buf, "{ \"Encoded-Path\": \"");
+ enlargeStringInfo(&mwriter->buf, 2 * pathlen);
+ mwriter->buf.len += hex_encode((const uint8 *) manifest_path, pathlen,
+ &mwriter->buf.data[mwriter->buf.len]);
+ appendStringInfoString(&mwriter->buf, "\", ");
+ }
+
+ appendStringInfo(&mwriter->buf, "\"Size\": %zu, ", size);
+
+ appendStringInfoString(&mwriter->buf, "\"Last-Modified\": \"");
+ enlargeStringInfo(&mwriter->buf, 128);
+ mwriter->buf.len += strftime(&mwriter->buf.data[mwriter->buf.len], 128,
+ "%Y-%m-%d %H:%M:%S %Z",
+ gmtime(&mtime));
+ appendStringInfoChar(&mwriter->buf, '"');
+
+ if (mwriter->buf.len > 128 * 1024)
+ flush_manifest(mwriter);
+
+ if (checksum_length > 0)
+ {
+ appendStringInfo(&mwriter->buf,
+ ", \"Checksum-Algorithm\": \"%s\", \"Checksum\": \"",
+ pg_checksum_type_name(checksum_type));
+
+ enlargeStringInfo(&mwriter->buf, 2 * checksum_length);
+ mwriter->buf.len += hex_encode(checksum_payload, checksum_length,
+ &mwriter->buf.data[mwriter->buf.len]);
+
+ appendStringInfoChar(&mwriter->buf, '"');
+ }
+
+ appendStringInfoString(&mwriter->buf, " }");
+
+ if (mwriter->buf.len > 128 * 1024)
+ flush_manifest(mwriter);
+}
+
+/*
+ * Finalize the backup_manifest.
+ */
+void
+finalize_manifest(manifest_writer *mwriter,
+ manifest_wal_range *first_wal_range)
+{
+ uint8 checksumbuf[PG_SHA256_DIGEST_LENGTH];
+ int len;
+ manifest_wal_range *wal_range;
+
+ /* Terminate the list of files. */
+ appendStringInfoString(&mwriter->buf, "\n],\n");
+
+ /* Start a list of LSN ranges. */
+ appendStringInfoString(&mwriter->buf, "\"WAL-Ranges\": [\n");
+
+ for (wal_range = first_wal_range; wal_range != NULL;
+ wal_range = wal_range->next)
+ appendStringInfo(&mwriter->buf,
+ "%s{ \"Timeline\": %u, \"Start-LSN\": \"%X/%X\", \"End-LSN\": \"%X/%X\" }",
+ wal_range == first_wal_range ? "" : ",\n",
+ wal_range->tli,
+ LSN_FORMAT_ARGS(wal_range->start_lsn),
+ LSN_FORMAT_ARGS(wal_range->end_lsn));
+
+ /* Terminate the list of WAL ranges. */
+ appendStringInfoString(&mwriter->buf, "\n],\n");
+
+ /* Flush accumulated data and update checksum calculation. */
+ flush_manifest(mwriter);
+
+ /* Checksum only includes data up to this point. */
+ mwriter->still_checksumming = false;
+
+ /* Compute and insert manifest checksum. */
+ appendStringInfoString(&mwriter->buf, "\"Manifest-Checksum\": \"");
+ enlargeStringInfo(&mwriter->buf, 2 * PG_SHA256_DIGEST_STRING_LENGTH);
+ len = pg_checksum_final(&mwriter->manifest_ctx, checksumbuf);
+ Assert(len == PG_SHA256_DIGEST_LENGTH);
+ mwriter->buf.len +=
+ hex_encode(checksumbuf, len, &mwriter->buf.data[mwriter->buf.len]);
+ appendStringInfoString(&mwriter->buf, "\"}\n");
+
+ /* Flush the last manifest checksum itself. */
+ flush_manifest(mwriter);
+
+ /* Close the file. */
+ if (close(mwriter->fd) != 0)
+ pg_fatal("could not close \"%s\": %m", mwriter->pathname);
+ mwriter->fd = -1;
+}
+
+/*
+ * Produce a JSON string literal, properly escaping characters in the text.
+ */
+static void
+escape_json(StringInfo buf, const char *str)
+{
+ const char *p;
+
+ appendStringInfoCharMacro(buf, '"');
+ for (p = str; *p; p++)
+ {
+ switch (*p)
+ {
+ case '\b':
+ appendStringInfoString(buf, "\\b");
+ break;
+ case '\f':
+ appendStringInfoString(buf, "\\f");
+ break;
+ case '\n':
+ appendStringInfoString(buf, "\\n");
+ break;
+ case '\r':
+ appendStringInfoString(buf, "\\r");
+ break;
+ case '\t':
+ appendStringInfoString(buf, "\\t");
+ break;
+ case '"':
+ appendStringInfoString(buf, "\\\"");
+ break;
+ case '\\':
+ appendStringInfoString(buf, "\\\\");
+ break;
+ default:
+ if ((unsigned char) *p < ' ')
+ appendStringInfo(buf, "\\u%04x", (int) *p);
+ else
+ appendStringInfoCharMacro(buf, *p);
+ break;
+ }
+ }
+ appendStringInfoCharMacro(buf, '"');
+}
+
+/*
+ * Flush whatever portion of the backup manifest we have generated and
+ * buffered in memory out to a file on disk.
+ *
+ * The first call to this function will create the file. After that, we
+ * keep it open and just append more data.
+ */
+static void
+flush_manifest(manifest_writer *mwriter)
+{
+ char pathname[MAXPGPATH];
+
+ if (mwriter->fd == -1 &&
+ (mwriter->fd = open(mwriter->pathname,
+ O_WRONLY | O_CREAT | O_EXCL | PG_BINARY,
+ pg_file_create_mode)) < 0)
+ pg_fatal("could not open file \"%s\": %m", mwriter->pathname);
+
+ if (mwriter->buf.len > 0)
+ {
+ ssize_t wb;
+
+ wb = write(mwriter->fd, mwriter->buf.data, mwriter->buf.len);
+ if (wb != mwriter->buf.len)
+ {
+ if (wb < 0)
+ pg_fatal("could not write \"%s\": %m", mwriter->pathname);
+ else
+ pg_fatal("could not write file \"%s\": wrote only %d of %d bytes",
+ pathname, (int) wb, mwriter->buf.len);
+ }
+
+ if (mwriter->still_checksumming)
+ pg_checksum_update(&mwriter->manifest_ctx,
+ (uint8 *) mwriter->buf.data,
+ mwriter->buf.len);
+ resetStringInfo(&mwriter->buf);
+ }
+}
+
+/*
+ * Encode bytes using two hexademical digits for each one.
+ */
+static size_t
+hex_encode(const uint8 *src, size_t len, char *dst)
+{
+ const uint8 *end = src + len;
+
+ while (src < end)
+ {
+ unsigned n1 = (*src >> 4) & 0xF;
+ unsigned n2 = *src & 0xF;
+
+ *dst++ = n1 < 10 ? '0' + n1 : 'a' + n1 - 10;
+ *dst++ = n2 < 10 ? '0' + n2 : 'a' + n2 - 10;
+ ++src;
+ }
+
+ return len * 2;
+}
diff --git a/src/bin/pg_combinebackup/write_manifest.h b/src/bin/pg_combinebackup/write_manifest.h
new file mode 100644
index 0000000000..8fd7fe02c8
--- /dev/null
+++ b/src/bin/pg_combinebackup/write_manifest.h
@@ -0,0 +1,33 @@
+/*-------------------------------------------------------------------------
+ *
+ * Write a new backup manifest.
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/bin/pg_combinebackup/write_manifest.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef WRITE_MANIFEST_H
+#define WRITE_MANIFEST_H
+
+#include "common/checksum_helper.h"
+#include "pgtime.h"
+
+struct manifest_wal_range;
+
+struct manifest_writer;
+typedef struct manifest_writer manifest_writer;
+
+extern manifest_writer *create_manifest_writer(char *directory);
+extern void add_file_to_manifest(manifest_writer *mwriter,
+ const char *manifest_path,
+ size_t size, pg_time_t mtime,
+ pg_checksum_type checksum_type,
+ int checksum_length,
+ uint8 *checksum_payload);
+extern void finalize_manifest(manifest_writer *mwriter,
+ struct manifest_wal_range *first_wal_range);
+
+#endif /* WRITE_MANIFEST_H */
diff --git a/src/bin/pg_resetwal/pg_resetwal.c b/src/bin/pg_resetwal/pg_resetwal.c
index 04567f349d..c3b9e07841 100644
--- a/src/bin/pg_resetwal/pg_resetwal.c
+++ b/src/bin/pg_resetwal/pg_resetwal.c
@@ -85,6 +85,7 @@ static void RewriteControlFile(void);
static void FindEndOfXLOG(void);
static void KillExistingXLOG(void);
static void KillExistingArchiveStatus(void);
+static void KillExistingWALSummaries(void);
static void WriteEmptyXLOG(void);
static void usage(void);
@@ -493,6 +494,7 @@ main(int argc, char *argv[])
RewriteControlFile();
KillExistingXLOG();
KillExistingArchiveStatus();
+ KillExistingWALSummaries();
WriteEmptyXLOG();
printf(_("Write-ahead log reset\n"));
@@ -1034,6 +1036,40 @@ KillExistingArchiveStatus(void)
pg_fatal("could not close directory \"%s\": %m", ARCHSTATDIR);
}
+/*
+ * Remove existing WAL summary files
+ */
+static void
+KillExistingWALSummaries(void)
+{
+#define WALSUMMARYDIR XLOGDIR "/summaries"
+#define WALSUMMARY_NHEXCHARS 40
+
+ DIR *xldir;
+ struct dirent *xlde;
+ char path[MAXPGPATH + sizeof(WALSUMMARYDIR)];
+
+ xldir = opendir(WALSUMMARYDIR);
+ if (xldir == NULL)
+ pg_fatal("could not open directory \"%s\": %m", WALSUMMARYDIR);
+
+ while (errno = 0, (xlde = readdir(xldir)) != NULL)
+ {
+ if (strspn(xlde->d_name, "0123456789ABCDEF") == WALSUMMARY_NHEXCHARS &&
+ strcmp(xlde->d_name + WALSUMMARY_NHEXCHARS, ".summary") == 0)
+ {
+ snprintf(path, sizeof(path), "%s/%s", WALSUMMARYDIR, xlde->d_name);
+ if (unlink(path) < 0)
+ pg_fatal("could not delete file \"%s\": %m", path);
+ }
+ }
+
+ if (errno)
+ pg_fatal("could not read directory \"%s\": %m", WALSUMMARYDIR);
+
+ if (closedir(xldir))
+ pg_fatal("could not close directory \"%s\": %m", ARCHSTATDIR);
+}
/*
* Write an empty XLOG file, containing only the checkpoint record
diff --git a/src/common/Makefile b/src/common/Makefile
index ff60666f5c..ebff20b1d3 100644
--- a/src/common/Makefile
+++ b/src/common/Makefile
@@ -49,6 +49,7 @@ OBJS_COMMON = \
archive.o \
base64.o \
binaryheap.o \
+ blkreftable.o \
checksum_helper.o \
compression.o \
config_info.o \
diff --git a/src/common/blkreftable.c b/src/common/blkreftable.c
new file mode 100644
index 0000000000..012a443584
--- /dev/null
+++ b/src/common/blkreftable.c
@@ -0,0 +1,1309 @@
+/*-------------------------------------------------------------------------
+ *
+ * blkreftable.c
+ * Block reference tables.
+ *
+ * A block reference table is used to keep track of which blocks have
+ * been modified by WAL records within a certain LSN range.
+ *
+ * For each relation fork, we keep track of all blocks that have appeared
+ * in block reference in the WAL. We also keep track of the "limit block",
+ * which is the smallest relation length in blocks known to have occurred
+ * during that range of WAL records. This should be set to 0 if the relation
+ * fork is created or destroyed, and to the post-truncation length if
+ * truncated.
+ *
+ * Whenever we set the limit block, we also forget about any modified blocks
+ * beyond that point. Those blocks don't exist any more. Such blocks can
+ * later be marked as modified again; if that happens, it means the relation
+ * was re-extended.
+ *
+ * Portions Copyright (c) 2010-2022, PostgreSQL Global Development Group
+ *
+ * src/common/blkreftable.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+
+#ifndef FRONTEND
+#include "postgres.h"
+#else
+#include "postgres_fe.h"
+#endif
+
+#ifdef FRONTEND
+#include "common/logging.h"
+#endif
+
+#include "common/blkreftable.h"
+#include "common/hashfn.h"
+#include "port/pg_crc32c.h"
+
+/*
+ * A block reference table keeps track of the status of each relation
+ * fork individually.
+ */
+typedef struct BlockRefTableKey
+{
+ RelFileLocator rlocator;
+ ForkNumber forknum;
+} BlockRefTableKey;
+
+/*
+ * We could need to store data either for a relation in which only a
+ * tiny fraction of the blocks have been modified or for a relation in
+ * which nearly every block has been modified, and we want a
+ * space-efficient representation in both cases. To accomplish this,
+ * we divide the relation into chunks of 2^16 blocks and choose between
+ * an array representation and a bitmap representation for each chunk.
+ *
+ * When the number of modified blocks in a given chunk is small, we
+ * essentially store an array of block numbers, but we need not store the
+ * entire block number: instead, we store each block number as a 2-byte
+ * offset from the start of the chunk.
+ *
+ * When the number of modified blocks in a given chunk is large, we switch
+ * to a bitmap representation.
+ *
+ * These same basic representational choices are used both when a block
+ * reference table is stored in memory and when it is serialized to disk.
+ *
+ * In the in-memory representation, we initially allocate each chunk with
+ * space for a number of entries given by INITIAL_ENTRIES_PER_CHUNK and
+ * increase that as necessary until we reach MAX_ENTRIES_PER_CHUNK.
+ * Any chunk whose allocated size reaches MAX_ENTRIES_PER_CHUNK is converted
+ * to a bitmap, and thus never needs to grow further.
+ */
+#define BLOCKS_PER_CHUNK (1 << 16)
+#define BLOCKS_PER_ENTRY (BITS_PER_BYTE * sizeof(uint16))
+#define MAX_ENTRIES_PER_CHUNK (BLOCKS_PER_CHUNK / BLOCKS_PER_ENTRY)
+#define INITIAL_ENTRIES_PER_CHUNK 16
+typedef uint16 *BlockRefTableChunk;
+
+/*
+ * State for one relation fork.
+ *
+ * 'rlocator' and 'forknum' identify the relation fork to which this entry
+ * pertains.
+ *
+ * 'limit_block' is the shortest known length of the relation in blocks
+ * within the LSN range covered by a particular block reference table.
+ * It should be set to 0 if the relation fork is created or dropped. If the
+ * relation fork is truncated, it should be set to the number of blocks that
+ * remain after truncation.
+ *
+ * 'nchunks' is the allocated length of each of the three arrays that follow.
+ * We can only represent the status of block numbers less than nchunks *
+ * BLOCKS_PER_CHUNK.
+ *
+ * 'chunk_size' is an array storing the allocated size of each chunk.
+ *
+ * 'chunk_usage' is an array storing the number of elements used in each
+ * chunk. If that value is less than MAX_ENTRIES_PER_CHUNK, the corresonding
+ * chunk is used as an array; else the corresponding chunk is used as a bitmap.
+ * When used as a bitmap, the least significant bit of the first array element
+ * is the status of the lowest-numbered block covered by this chunk.
+ *
+ * 'chunk_data' is the array of chunks.
+ */
+struct BlockRefTableEntry
+{
+ BlockRefTableKey key;
+ BlockNumber limit_block;
+ char status;
+ uint32 nchunks;
+ uint16 *chunk_size;
+ uint16 *chunk_usage;
+ BlockRefTableChunk *chunk_data;
+};
+
+/* Declare and define a hash table over type BlockRefTableEntry. */
+#define SH_PREFIX blockreftable
+#define SH_ELEMENT_TYPE BlockRefTableEntry
+#define SH_KEY_TYPE BlockRefTableKey
+#define SH_KEY key
+#define SH_HASH_KEY(tb, key) \
+ hash_bytes((const unsigned char *) &key, sizeof(BlockRefTableKey))
+#define SH_EQUAL(tb, a, b) memcmp(&a, &b, sizeof(BlockRefTableKey)) == 0
+#define SH_SCOPE static inline
+#ifdef FRONTEND
+#define SH_RAW_ALLOCATOR pg_malloc0
+#endif
+#define SH_DEFINE
+#define SH_DECLARE
+#include "lib/simplehash.h"
+
+/*
+ * A block reference table is basically just the hash table, but we don't
+ * want to expose that to outside callers.
+ *
+ * We keep track of the memory context in use explicitly too, so that it's
+ * easy to place all of our allocations in the same context.
+ */
+struct BlockRefTable
+{
+ blockreftable_hash *hash;
+#ifndef FRONTEND
+ MemoryContext mcxt;
+#endif
+};
+
+/*
+ * On-disk serialization format for block reference table entries.
+ */
+typedef struct BlockRefTableSerializedEntry
+{
+ RelFileLocator rlocator;
+ ForkNumber forknum;
+ BlockNumber limit_block;
+ uint32 nchunks;
+} BlockRefTableSerializedEntry;
+
+/*
+ * Buffer size, so that we avoid doing many small I/Os.
+ */
+#define BUFSIZE 65536
+
+/*
+ * Ad-hoc buffer for file I/O.
+ */
+typedef struct BlockRefTableBuffer
+{
+ io_callback_fn io_callback;
+ void *io_callback_arg;
+ char data[BUFSIZE];
+ int used;
+ int cursor;
+ pg_crc32c crc;
+} BlockRefTableBuffer;
+
+/*
+ * State for keeping track of progress while incrementally reading a block
+ * table reference file from disk.
+ *
+ * total_chunks means the number of chunks for the RelFileLocator/ForkNumber
+ * combination that is curently being read, and consumed_chunks is the number
+ * of those that have been read. (We always read all the information for
+ * a single chunk at one time, so we don't need to be able to represent the
+ * state where a chunk has been partially read.)
+ *
+ * chunk_size is the array of chunk sizes. The length is given by total_chunks.
+ *
+ * chunk_data holds the current chunk.
+ *
+ * chunk_position helps us figure out how much progress we've made in returning
+ * the block numbers for the current chunk to the caller. If the chunk is a
+ * bitmap, it's the number of bits we've scanned; otherwise, it's the number
+ * of chunk entries we've scanned.
+ */
+struct BlockRefTableReader
+{
+ BlockRefTableBuffer buffer;
+ char *error_filename;
+ report_error_fn error_callback;
+ void *error_callback_arg;
+ uint32 total_chunks;
+ uint32 consumed_chunks;
+ uint16 *chunk_size;
+ uint16 chunk_data[MAX_ENTRIES_PER_CHUNK];
+ uint32 chunk_position;
+};
+
+/*
+ * State for keeping track of progress while incrementally writing a block
+ * reference table file to disk.
+ */
+struct BlockRefTableWriter
+{
+ BlockRefTableBuffer buffer;
+};
+
+/* Function prototypes. */
+static int BlockRefTableComparator(const void *a, const void *b);
+static void BlockRefTableFlush(BlockRefTableBuffer *buffer);
+static void BlockRefTableRead(BlockRefTableReader *reader, void *data,
+ int length);
+static void BlockRefTableWrite(BlockRefTableBuffer *buffer, void *data,
+ int length);
+static void BlockRefTableFileTerminate(BlockRefTableBuffer *buffer);
+
+/*
+ * Create an empty block reference table.
+ */
+BlockRefTable *
+CreateEmptyBlockRefTable(void)
+{
+ BlockRefTable *brtab = palloc(sizeof(BlockRefTable));
+
+ /*
+ * Even completely empty database has a few hundred relation forks, so it
+ * seems best to size the hash on the assumption that we're going to have
+ * at least a few thousand entries.
+ */
+#ifdef FRONTEND
+ brtab->hash = blockreftable_create(4096, NULL);
+#else
+ brtab->mcxt = CurrentMemoryContext;
+ brtab->hash = blockreftable_create(brtab->mcxt, 4096, NULL);
+#endif
+
+ return brtab;
+}
+
+/*
+ * Set the "limit block" for a relation fork and forget any modified blocks
+ * with equal or higher block numbers.
+ *
+ * The "limit block" is the shortest known length of the relation within the
+ * range of WAL records covered by this block reference table.
+ */
+void
+BlockRefTableSetLimitBlock(BlockRefTable *brtab,
+ const RelFileLocator *rlocator,
+ ForkNumber forknum,
+ BlockNumber limit_block)
+{
+ BlockRefTableEntry *brtentry;
+ BlockRefTableKey key;
+ bool found;
+
+ memcpy(&key.rlocator, rlocator, sizeof(RelFileLocator));
+ key.forknum = forknum;
+ brtentry = blockreftable_insert(brtab->hash, key, &found);
+
+ if (!found)
+ {
+ /*
+ * We have no existing data about this relation fork, so just record
+ * the limit_block value supplied by the caller, and make sure other
+ * parts of the entry are properly initialized.
+ */
+ brtentry->limit_block = limit_block;
+ brtentry->nchunks = 0;
+ brtentry->chunk_size = NULL;
+ brtentry->chunk_usage = NULL;
+ brtentry->chunk_data = NULL;
+ return;
+ }
+
+ BlockRefTableEntrySetLimitBlock(brtentry, limit_block);
+}
+
+/*
+ * Mark a block in a given relation fork as known to have been modified.
+ */
+void
+BlockRefTableMarkBlockModified(BlockRefTable *brtab,
+ const RelFileLocator *rlocator,
+ ForkNumber forknum,
+ BlockNumber blknum)
+{
+ BlockRefTableEntry *brtentry;
+ BlockRefTableKey key;
+ bool found;
+#ifndef FRONTEND
+ MemoryContext oldcontext = MemoryContextSwitchTo(brtab->mcxt);
+#endif
+
+ memcpy(&key.rlocator, rlocator, sizeof(RelFileLocator));
+ key.forknum = forknum;
+ brtentry = blockreftable_insert(brtab->hash, key, &found);
+
+ if (!found)
+ {
+ /*
+ * We want to set the initial limit block value to something higher
+ * than any legal block number. InvalidBlockNumber fits the bill.
+ */
+ brtentry->limit_block = InvalidBlockNumber;
+ brtentry->nchunks = 0;
+ brtentry->chunk_size = NULL;
+ brtentry->chunk_usage = NULL;
+ brtentry->chunk_data = NULL;
+ }
+
+ BlockRefTableEntryMarkBlockModified(brtentry, forknum, blknum);
+
+#ifndef FRONTEND
+ MemoryContextSwitchTo(oldcontext);
+#endif
+}
+
+/*
+ * Get an entry from a block reference table.
+ *
+ * If the entry does not exist, this function returns NULL. Otherwise, it
+ * returns the entry and sets *limit_block to the value from the entry.
+ */
+BlockRefTableEntry *
+BlockRefTableGetEntry(BlockRefTable *brtab, const RelFileLocator *rlocator,
+ ForkNumber forknum, BlockNumber *limit_block)
+{
+ BlockRefTableKey key;
+ BlockRefTableEntry *entry;
+
+ Assert(limit_block != NULL);
+
+ memcpy(&key.rlocator, rlocator, sizeof(RelFileLocator));
+ key.forknum = forknum;
+ entry = blockreftable_lookup(brtab->hash, key);
+
+ if (entry != NULL)
+ *limit_block = entry->limit_block;
+
+ return entry;
+}
+
+/*
+ * Get block numbers from a table entry.
+ *
+ * 'blocks' must point to enough space to hold at least 'nblocks' block
+ * numbers, and any block numbers we manage to get will be written there.
+ * The return value is the number of block numbers actually written.
+ *
+ * We do not return block numbers unless they are greater than or equal to
+ * start_blkno and strictly less than stop_blkno.
+ */
+int
+BlockRefTableEntryGetBlocks(BlockRefTableEntry *entry,
+ BlockNumber start_blkno,
+ BlockNumber stop_blkno,
+ BlockNumber *blocks,
+ int nblocks)
+{
+ uint32 start_chunkno;
+ uint32 stop_chunkno;
+ uint32 chunkno;
+ int nresults = 0;
+
+ Assert(entry != NULL);
+
+ /*
+ * Figure out which chunks could potentially contain blocks of interest.
+ *
+ * We need to be careful about overflow here, because stop_blkno could be
+ * InvalidBlockNumber or something very close to it.
+ */
+ start_chunkno = start_blkno / BLOCKS_PER_CHUNK;
+ stop_chunkno = stop_blkno / BLOCKS_PER_CHUNK;
+ if ((stop_blkno % BLOCKS_PER_CHUNK) != 0)
+ ++stop_chunkno;
+ if (stop_chunkno > entry->nchunks)
+ stop_chunkno = entry->nchunks;
+
+ /*
+ * Loop over chunks.
+ */
+ for (chunkno = start_chunkno; chunkno < stop_chunkno; ++chunkno)
+ {
+ uint16 chunk_usage = entry->chunk_usage[chunkno];
+ BlockRefTableChunk chunk_data = entry->chunk_data[chunkno];
+ unsigned start_offset = 0;
+ unsigned stop_offset = BLOCKS_PER_CHUNK;
+
+ /*
+ * If the start and/or stop block number falls within this chunk, the
+ * whole chunk may not be of interest. Figure out which portion we
+ * care about, if it's not the whole thing.
+ */
+ if (chunkno == start_chunkno)
+ start_offset = start_blkno % BLOCKS_PER_CHUNK;
+ if (chunkno == stop_chunkno)
+ stop_offset = stop_blkno % BLOCKS_PER_CHUNK;
+
+ /*
+ * Handling differs depending on whether this is an array of offsets
+ * or a bitmap.
+ */
+ if (chunk_usage == MAX_ENTRIES_PER_CHUNK)
+ {
+ unsigned i;
+
+ /* It's a bitmap, so test every relevant bit. */
+ for (i = start_offset; i < BLOCKS_PER_CHUNK; ++i)
+ {
+ uint16 w = chunk_data[i / BLOCKS_PER_ENTRY];
+
+ if ((w & (1 << (i % BLOCKS_PER_ENTRY))) != 0)
+ {
+ BlockNumber blkno = chunkno * BLOCKS_PER_CHUNK + i;
+
+ blocks[nresults++] = blkno;
+
+ /* Early exit if we run out of output space. */
+ if (nresults == nblocks)
+ return nresults;
+ }
+ }
+ }
+ else
+ {
+ unsigned i;
+
+ /* It's an array of offsets, so check each one. */
+ for (i = 0; i < chunk_usage; ++i)
+ {
+ uint16 offset = chunk_data[i];
+
+ if (offset >= start_offset && offset < stop_offset)
+ {
+ BlockNumber blkno = chunkno * BLOCKS_PER_CHUNK + offset;
+
+ blocks[nresults++] = blkno;
+
+ /* Early exit if we run out of output space. */
+ if (nresults == nblocks)
+ return nresults;
+ }
+ }
+ }
+ }
+
+ return nresults;
+}
+
+/*
+ * Serialize a block reference table to a file.
+ */
+void
+WriteBlockRefTable(BlockRefTable *brtab,
+ io_callback_fn write_callback,
+ void *write_callback_arg)
+{
+ BlockRefTableSerializedEntry *sdata = NULL;
+ BlockRefTableBuffer buffer;
+ uint32 magic = BLOCKREFTABLE_MAGIC;
+
+ /* Prepare buffer. */
+ memset(&buffer, 0, sizeof(BlockRefTableBuffer));
+ buffer.io_callback = write_callback;
+ buffer.io_callback_arg = write_callback_arg;
+ INIT_CRC32C(buffer.crc);
+
+ /* Write magic number. */
+ BlockRefTableWrite(&buffer, &magic, sizeof(uint32));
+
+ /* Write the entries, assuming there are some. */
+ if (brtab->hash->members > 0)
+ {
+ unsigned i = 0;
+ blockreftable_iterator it;
+ BlockRefTableEntry *brtentry;
+
+ /* Extract entries into serializable format and sort them. */
+ sdata =
+ palloc(brtab->hash->members * sizeof(BlockRefTableSerializedEntry));
+ blockreftable_start_iterate(brtab->hash, &it);
+ while ((brtentry = blockreftable_iterate(brtab->hash, &it)) != NULL)
+ {
+ BlockRefTableSerializedEntry *sentry = &sdata[i++];
+
+ sentry->rlocator = brtentry->key.rlocator;
+ sentry->forknum = brtentry->key.forknum;
+ sentry->limit_block = brtentry->limit_block;
+ sentry->nchunks = brtentry->nchunks;
+
+ /* trim trailing zero entries */
+ while (sentry->nchunks > 0 &&
+ brtentry->chunk_usage[sentry->nchunks - 1] == 0)
+ sentry->nchunks--;
+ }
+ Assert(i == brtab->hash->members);
+ qsort(sdata, i, sizeof(BlockRefTableSerializedEntry),
+ BlockRefTableComparator);
+
+ /* Loop over entries in sorted order and serialize each one. */
+ for (i = 0; i < brtab->hash->members; ++i)
+ {
+ BlockRefTableSerializedEntry *sentry = &sdata[i];
+ BlockRefTableEntry *brtentry;
+ BlockRefTableKey key;
+ unsigned j;
+
+ /* Write the serialized entry itself. */
+ BlockRefTableWrite(&buffer, sentry,
+ sizeof(BlockRefTableSerializedEntry));
+
+ /* Look up the original entry so we can access the chunks. */
+ memcpy(&key.rlocator, &sentry->rlocator, sizeof(RelFileLocator));
+ key.forknum = sentry->forknum;
+ brtentry = blockreftable_lookup(brtab->hash, key);
+ Assert(brtentry != NULL);
+
+ /* Write the untruncated portion of the chunk length array. */
+ if (sentry->nchunks != 0)
+ BlockRefTableWrite(&buffer, brtentry->chunk_usage,
+ sentry->nchunks * sizeof(uint16));
+
+ /* Write the contents of each chunk. */
+ for (j = 0; j < brtentry->nchunks; ++j)
+ {
+ if (brtentry->chunk_usage[j] == 0)
+ continue;
+ BlockRefTableWrite(&buffer, brtentry->chunk_data[j],
+ brtentry->chunk_usage[j] * sizeof(uint16));
+ }
+ }
+ }
+
+ /* Write out appropriate terminator and CRC and flush buffer. */
+ BlockRefTableFileTerminate(&buffer);
+}
+
+/*
+ * Prepare to incrementally read a block reference table file.
+ *
+ * 'read_callback' is a function that can be called to read data from the
+ * underlying file (or other data source) into our internal buffer.
+ *
+ * 'read_callback_arg' is an opaque argument to be passed to read_callback.
+ *
+ * 'error_filename' is the filename that should be included in error messages
+ * if the file is found to be malformed. The value is not copied, so the
+ * caller should ensure that it remains valid until done with this
+ * BlockRefTableReader.
+ *
+ * 'error_callback' is a function to be called if the file is found to be
+ * malformed. This is not used for I/O errors, which must be handled internally
+ * by read_callback.
+ *
+ * 'error_callback_arg' is an opaque arguent to be passed to error_callback.
+ */
+BlockRefTableReader *
+CreateBlockRefTableReader(io_callback_fn read_callback,
+ void *read_callback_arg,
+ char *error_filename,
+ report_error_fn error_callback,
+ void *error_callback_arg)
+{
+ BlockRefTableReader *reader;
+ uint32 magic;
+
+ /* Initialize data structure. */
+ reader = palloc0(sizeof(BlockRefTableReader));
+ reader->buffer.io_callback = read_callback;
+ reader->buffer.io_callback_arg = read_callback_arg;
+ reader->error_filename = error_filename;
+ reader->error_callback = error_callback;
+ reader->error_callback_arg = error_callback_arg;
+ INIT_CRC32C(reader->buffer.crc);
+
+ /* Verify magic number. */
+ BlockRefTableRead(reader, &magic, sizeof(uint32));
+ if (magic != BLOCKREFTABLE_MAGIC)
+ error_callback(error_callback_arg,
+ "file \"%s\" has wrong magic number: expected %u, found %u",
+ error_filename,
+ BLOCKREFTABLE_MAGIC, magic);
+
+ return reader;
+}
+
+/*
+ * Read next relation fork covered by this block reference table file.
+ *
+ * After calling this function, you must call BlockRefTableReaderGetBlocks
+ * until it returns 0 before calling it again.
+ */
+bool
+BlockRefTableReaderNextRelation(BlockRefTableReader *reader,
+ RelFileLocator *rlocator,
+ ForkNumber *forknum,
+ BlockNumber *limit_block)
+{
+ BlockRefTableSerializedEntry sentry;
+ BlockRefTableSerializedEntry zentry = {0};
+
+ /*
+ * Sanity check: caller must read all blocks from all chunks before moving
+ * on to the next relation.
+ */
+ Assert(reader->total_chunks == reader->consumed_chunks);
+
+ /* Read serialized entry. */
+ BlockRefTableRead(reader, &sentry,
+ sizeof(BlockRefTableSerializedEntry));
+
+ /*
+ * If we just read the sentinel entry indicating that we've reached the
+ * end, read and check the CRC.
+ */
+ if (memcmp(&sentry, &zentry, sizeof(BlockRefTableSerializedEntry)) == 0)
+ {
+ pg_crc32c expected_crc;
+ pg_crc32c actual_crc;
+
+ /*
+ * We want to know the CRC of the file excluding the 4-byte CRC
+ * itself, so copy the current value of the CRC accumulator before
+ * reading those bytes, and use the copy to finalize the calculation.
+ */
+ expected_crc = reader->buffer.crc;
+ FIN_CRC32C(expected_crc);
+
+ /* Now we can read the actual value. */
+ BlockRefTableRead(reader, &actual_crc, sizeof(pg_crc32c));
+
+ /* Throw an error if there is a mismatch. */
+ if (!EQ_CRC32C(expected_crc, actual_crc))
+ reader->error_callback(reader->error_callback_arg,
+ "file \"%s\" has wrong checksum: expected %08X, found %08X",
+ reader->error_filename, expected_crc, actual_crc);
+
+ return false;
+ }
+
+ /* Read chunk size array. */
+ if (reader->chunk_size != NULL)
+ pfree(reader->chunk_size);
+ reader->chunk_size = palloc(sentry.nchunks * sizeof(uint16));
+ BlockRefTableRead(reader, reader->chunk_size,
+ sentry.nchunks * sizeof(uint16));
+
+ /* Set up for chunk scan. */
+ reader->total_chunks = sentry.nchunks;
+ reader->consumed_chunks = 0;
+
+ /* Return data to caller. */
+ memcpy(rlocator, &sentry.rlocator, sizeof(RelFileLocator));
+ *forknum = sentry.forknum;
+ *limit_block = sentry.limit_block;
+ return true;
+}
+
+/*
+ * Get modified blocks associated with the relation fork returned by
+ * the most recent call to BlockRefTableReaderNextRelation.
+ *
+ * On return, block numbers will be written into the 'blocks' array, whose
+ * length should be passed via 'nblocks'. The return value is the number of
+ * entries actually written into the 'blocks' array, which may be less than
+ * 'nblocks' if we run out of modified blocks in the relation fork before
+ * we run out of room in the array.
+ */
+unsigned
+BlockRefTableReaderGetBlocks(BlockRefTableReader *reader,
+ BlockNumber *blocks,
+ int nblocks)
+{
+ unsigned blocks_found = 0;
+
+ /* Must provide space for at least one block number to be returned. */
+ Assert(nblocks > 0);
+
+ /* Loop collecting blocks to return to caller. */
+ for (;;)
+ {
+ uint16 next_chunk_size;
+
+ /*
+ * If we've read at least one chunk, maybe it contains some block
+ * numbers that could satisfy caller's request.
+ */
+ if (reader->consumed_chunks > 0)
+ {
+ uint32 chunkno = reader->consumed_chunks - 1;
+ uint16 chunk_size = reader->chunk_size[chunkno];
+
+ if (chunk_size == MAX_ENTRIES_PER_CHUNK)
+ {
+ /* Bitmap format, so search for bits that are set. */
+ while (reader->chunk_position < BLOCKS_PER_CHUNK &&
+ blocks_found < nblocks)
+ {
+ uint16 chunkoffset = reader->chunk_position;
+ uint16 w;
+
+ w = reader->chunk_data[chunkoffset / BLOCKS_PER_ENTRY];
+ if ((w & (1u << (chunkoffset % BLOCKS_PER_ENTRY))) != 0)
+ blocks[blocks_found++] =
+ chunkno * BLOCKS_PER_CHUNK + chunkoffset;
+ ++reader->chunk_position;
+ }
+ }
+ else
+ {
+ /* Not in bitmap format, so each entry is a 2-byte offset. */
+ while (reader->chunk_position < chunk_size &&
+ blocks_found < nblocks)
+ {
+ blocks[blocks_found++] = chunkno * BLOCKS_PER_CHUNK
+ + reader->chunk_data[reader->chunk_position];
+ ++reader->chunk_position;
+ }
+ }
+ }
+
+ /* We found enough blocks, so we're done. */
+ if (blocks_found >= nblocks)
+ break;
+
+ /*
+ * We didn't find enough blocks, so we must need the next chunk. If
+ * there are none left, though, then we're done anyway.
+ */
+ if (reader->consumed_chunks == reader->total_chunks)
+ break;
+
+ /*
+ * Read data for next chunk and reset scan position to beginning of
+ * chunk. Note that the next chunk might be empty, in which case we
+ * consume the chunk without actually consuming any bytes from the
+ * underlying file.
+ */
+ next_chunk_size = reader->chunk_size[reader->consumed_chunks];
+ if (next_chunk_size > 0)
+ BlockRefTableRead(reader, reader->chunk_data,
+ next_chunk_size * sizeof(uint16));
+ ++reader->consumed_chunks;
+ reader->chunk_position = 0;
+ }
+
+ return blocks_found;
+}
+
+/*
+ * Release memory used while reading a block reference table from a file.
+ */
+void
+DestroyBlockRefTableReader(BlockRefTableReader *reader)
+{
+ if (reader->chunk_size != NULL)
+ {
+ pfree(reader->chunk_size);
+ reader->chunk_size = NULL;
+ }
+ pfree(reader);
+}
+
+/*
+ * Prepare to write a block reference table file incrementally.
+ *
+ * Caller must be able to supply BlockRefTableEntry objects sorted in the
+ * appropriate order.
+ */
+BlockRefTableWriter *
+CreateBlockRefTableWriter(io_callback_fn write_callback,
+ void *write_callback_arg)
+{
+ BlockRefTableWriter *writer;
+ uint32 magic = BLOCKREFTABLE_MAGIC;
+
+ /* Prepare buffer and CRC check and save callbacks. */
+ writer = palloc0(sizeof(BlockRefTableWriter));
+ writer->buffer.io_callback = write_callback;
+ writer->buffer.io_callback_arg = write_callback_arg;
+ INIT_CRC32C(writer->buffer.crc);
+
+ /* Write magic number. */
+ BlockRefTableWrite(&writer->buffer, &magic, sizeof(uint32));
+
+ return writer;
+}
+
+/*
+ * Append one entry to a block reference table file.
+ *
+ * Note that entries must be written in the proper order, that is, sorted by
+ * tablespace, then database, then relfilenumber, then fork number. Caller
+ * is responsible for supplying data in the correct order. If that seems hard,
+ * use an in-memory BlockRefTable instead.
+ */
+void
+BlockRefTableWriteEntry(BlockRefTableWriter *writer, BlockRefTableEntry *entry)
+{
+ BlockRefTableSerializedEntry sentry;
+ unsigned j;
+
+ /* Convert to serialized entry format. */
+ sentry.rlocator = entry->key.rlocator;
+ sentry.forknum = entry->key.forknum;
+ sentry.limit_block = entry->limit_block;
+ sentry.nchunks = entry->nchunks;
+
+ /* Trim trailing zero entries. */
+ while (sentry.nchunks > 0 && entry->chunk_usage[sentry.nchunks - 1] == 0)
+ sentry.nchunks--;
+
+ /* Write the serialized entry itself. */
+ BlockRefTableWrite(&writer->buffer, &sentry,
+ sizeof(BlockRefTableSerializedEntry));
+
+ /* Write the untruncated portion of the chunk length array. */
+ if (sentry.nchunks != 0)
+ BlockRefTableWrite(&writer->buffer, entry->chunk_usage,
+ sentry.nchunks * sizeof(uint16));
+
+ /* Write the contents of each chunk. */
+ for (j = 0; j < entry->nchunks; ++j)
+ {
+ if (entry->chunk_usage[j] == 0)
+ continue;
+ BlockRefTableWrite(&writer->buffer, entry->chunk_data[j],
+ entry->chunk_usage[j] * sizeof(uint16));
+ }
+}
+
+/*
+ * Finalize an incremental write of a block reference table file.
+ */
+void
+DestroyBlockRefTableWriter(BlockRefTableWriter *writer)
+{
+ BlockRefTableFileTerminate(&writer->buffer);
+ pfree(writer);
+}
+
+/*
+ * Allocate a standalone BlockRefTableEntry.
+ *
+ * When we're manipulating a full in-memory BlockRefTable, the entries are
+ * part of the hash table and are allocated by simplehash. This routine is
+ * used by callers that want to write out a BlockRefTable to a file without
+ * needing to store the whole thing in memory at once.
+ *
+ * Entries allocated by this function can be manipulated using the functions
+ * BlockRefTableEntrySetLimitBlock and BlockRefTableEntryMarkBlockModified
+ * and then written using BlockRefTableWriteEntry and freed using
+ * BlockRefTableFreeEntry.
+ */
+BlockRefTableEntry *
+CreateBlockRefTableEntry(RelFileLocator rlocator, ForkNumber forknum)
+{
+ BlockRefTableEntry *entry = palloc0(sizeof(BlockRefTableEntry));
+
+ memcpy(&entry->key.rlocator, &rlocator, sizeof(RelFileLocator));
+ entry->key.forknum = forknum;
+ entry->limit_block = InvalidBlockNumber;
+
+ return entry;
+}
+
+/*
+ * Update a BlockRefTableEntry with a new value for the "limit block" and
+ * forget any equal-or-higher-numbered modified blocks.
+ *
+ * The "limit block" is the shortest known length of the relation within the
+ * range of WAL records covered by this block reference table.
+ */
+void
+BlockRefTableEntrySetLimitBlock(BlockRefTableEntry *entry,
+ BlockNumber limit_block)
+{
+ unsigned chunkno;
+ unsigned limit_chunkno;
+ unsigned limit_chunkoffset;
+ BlockRefTableChunk limit_chunk;
+
+ /* If we already have an equal or lower limit block, do nothing. */
+ if (limit_block >= entry->limit_block)
+ return;
+
+ /* Record the new limit block value. */
+ entry->limit_block = limit_block;
+
+ /*
+ * Figure out which chunk would store the state of the new limit block,
+ * and which offset within that chunk.
+ */
+ limit_chunkno = limit_block / BLOCKS_PER_CHUNK;
+ limit_chunkoffset = limit_block % BLOCKS_PER_CHUNK;
+
+ /*
+ * If the number of chunks is not large enough for any blocks with equal
+ * or higher block numbers to exist, then there is nothing further to do.
+ */
+ if (limit_chunkno >= entry->nchunks)
+ return;
+
+ /* Discard entire contents of any higher-numbered chunks. */
+ for (chunkno = limit_chunkno + 1; chunkno < entry->nchunks; ++chunkno)
+ entry->chunk_usage[chunkno] = 0;
+
+ /*
+ * Next, we need to discard any offsets within the chunk that would
+ * contain the limit_block. We must handle this differenly depending on
+ * whether the chunk that would contain limit_block is a bitmap or an
+ * array of offsets.
+ */
+ limit_chunk = entry->chunk_data[limit_chunkno];
+ if (entry->chunk_usage[limit_chunkno] == MAX_ENTRIES_PER_CHUNK)
+ {
+ unsigned chunkoffset;
+
+ /* It's a bitmap. Unset bits. */
+ for (chunkoffset = limit_chunkoffset; chunkoffset < BLOCKS_PER_CHUNK;
+ ++chunkoffset)
+ limit_chunk[chunkoffset / BLOCKS_PER_ENTRY] &=
+ ~(1 << (chunkoffset % BLOCKS_PER_ENTRY));
+ }
+ else
+ {
+ unsigned i,
+ j = 0;
+
+ /* It's an offset array. Filter out large offsets. */
+ for (i = 0; i < entry->chunk_usage[limit_chunkno]; ++i)
+ {
+ Assert(j <= i);
+ if (limit_chunk[i] < limit_chunkoffset)
+ limit_chunk[j++] = limit_chunk[i];
+ }
+ Assert(j <= entry->chunk_usage[limit_chunkno]);
+ entry->chunk_usage[limit_chunkno] = j;
+ }
+}
+
+/*
+ * Mark a block in a given BlkRefTableEntry as known to have been modified.
+ */
+void
+BlockRefTableEntryMarkBlockModified(BlockRefTableEntry *entry,
+ ForkNumber forknum,
+ BlockNumber blknum)
+{
+ unsigned chunkno;
+ unsigned chunkoffset;
+ unsigned i;
+
+ /*
+ * Which chunk should store the state of this block? And what is the
+ * offset of this block relative to the start of that chunk?
+ */
+ chunkno = blknum / BLOCKS_PER_CHUNK;
+ chunkoffset = blknum % BLOCKS_PER_CHUNK;
+
+ /*
+ * If 'nchunks' isn't big enough for us to be able to represent the state
+ * of this block, we need to enlarge our arrays.
+ */
+ if (chunkno >= entry->nchunks)
+ {
+ unsigned max_chunks;
+ unsigned extra_chunks;
+
+ /*
+ * New array size is a power of 2, at least 16, big enough so that
+ * chunkno will be a valid array index.
+ */
+ max_chunks = Max(16, entry->nchunks);
+ while (max_chunks < chunkno + 1)
+ chunkno *= 2;
+ Assert(max_chunks > chunkno);
+ extra_chunks = max_chunks - entry->nchunks;
+
+ if (entry->nchunks == 0)
+ {
+ entry->chunk_size = palloc0(sizeof(uint16) * max_chunks);
+ entry->chunk_usage = palloc0(sizeof(uint16) * max_chunks);
+ entry->chunk_data =
+ palloc0(sizeof(BlockRefTableChunk) * max_chunks);
+ }
+ else
+ {
+ entry->chunk_size = repalloc(entry->chunk_size,
+ sizeof(uint16) * max_chunks);
+ memset(&entry->chunk_size[entry->nchunks], 0,
+ extra_chunks * sizeof(uint16));
+ entry->chunk_usage = repalloc(entry->chunk_usage,
+ sizeof(uint16) * max_chunks);
+ memset(&entry->chunk_usage[entry->nchunks], 0,
+ extra_chunks * sizeof(uint16));
+ entry->chunk_data = repalloc(entry->chunk_data,
+ sizeof(BlockRefTableChunk) * max_chunks);
+ memset(&entry->chunk_data[entry->nchunks], 0,
+ extra_chunks * sizeof(BlockRefTableChunk));
+ }
+ entry->nchunks = max_chunks;
+ }
+
+ /*
+ * If the chunk that covers this block number doesn't exist yet, create it
+ * as an array and add the appropriate offset to it. We make it pretty
+ * small initially, because there might only be 1 or a few block
+ * references in this chunk and we don't want to use up too much memory.
+ */
+ if (entry->chunk_size[chunkno] == 0)
+ {
+ entry->chunk_data[chunkno] =
+ palloc(sizeof(uint16) * INITIAL_ENTRIES_PER_CHUNK);
+ entry->chunk_size[chunkno] = INITIAL_ENTRIES_PER_CHUNK;
+ entry->chunk_data[chunkno][0] = chunkoffset;
+ entry->chunk_usage[chunkno] = 1;
+ return;
+ }
+
+ /*
+ * If the number of entries in this chunk is already maximum, it must be a
+ * bitmap. Just set the appropriate bit.
+ */
+ if (entry->chunk_usage[chunkno] == MAX_ENTRIES_PER_CHUNK)
+ {
+ BlockRefTableChunk chunk = entry->chunk_data[chunkno];
+
+ chunk[chunkoffset / BLOCKS_PER_ENTRY] |=
+ 1 << (chunkoffset % BLOCKS_PER_ENTRY);
+ return;
+ }
+
+ /*
+ * There is an existing chunk and it's in array format. Let's find out
+ * whether it already has an entry for this block. If so, we do not need
+ * to do anything.
+ */
+ for (i = 0; i < entry->chunk_usage[chunkno]; ++i)
+ {
+ if (entry->chunk_data[chunkno][i] == chunkoffset)
+ return;
+ }
+
+ /*
+ * If the number of entries currently used is one less than the maximum,
+ * it's time to convert to bitmap format.
+ */
+ if (entry->chunk_usage[chunkno] == MAX_ENTRIES_PER_CHUNK - 1)
+ {
+ BlockRefTableChunk newchunk;
+ unsigned j;
+
+ /* Allocate a new chunk. */
+ newchunk = palloc0(MAX_ENTRIES_PER_CHUNK * sizeof(uint16));
+
+ /* Set the bit for each existing entry. */
+ for (j = 0; j < entry->chunk_usage[chunkno]; ++j)
+ {
+ unsigned coff = entry->chunk_data[chunkno][j];
+
+ newchunk[coff / BLOCKS_PER_ENTRY] |=
+ 1 << (coff % BLOCKS_PER_ENTRY);
+ }
+
+ /* Set the bit for the new entry. */
+ newchunk[chunkoffset / BLOCKS_PER_ENTRY] |=
+ 1 << (chunkoffset % BLOCKS_PER_ENTRY);
+
+ /* Swap the new chunk into place and update metadata. */
+ pfree(entry->chunk_data[chunkno]);
+ entry->chunk_data[chunkno] = newchunk;
+ entry->chunk_size[chunkno] = MAX_ENTRIES_PER_CHUNK;
+ entry->chunk_usage[chunkno] = MAX_ENTRIES_PER_CHUNK;
+ return;
+ }
+
+ /*
+ * OK, we currently have an array, and we don't need to convert to a
+ * bitmap, but we do need to add a new element. If there's not enough
+ * room, we'll have to expand the array.
+ */
+ if (entry->chunk_usage[chunkno] == entry->chunk_size[chunkno])
+ {
+ unsigned newsize = entry->chunk_size[chunkno] * 2;
+
+ Assert(newsize <= MAX_ENTRIES_PER_CHUNK);
+ entry->chunk_data[chunkno] = repalloc(entry->chunk_data[chunkno],
+ newsize * sizeof(uint16));
+ entry->chunk_size[chunkno] = newsize;
+ }
+
+ /* Now we can add the new entry. */
+ entry->chunk_data[chunkno][entry->chunk_usage[chunkno]] =
+ chunkoffset;
+ entry->chunk_usage[chunkno]++;
+}
+
+/*
+ * Release memory for a BlockRefTablEntry that was created by
+ * CreateBlockRefTableEntry.
+ */
+void
+BlockRefTableFreeEntry(BlockRefTableEntry *entry)
+{
+ if (entry->chunk_size != NULL)
+ {
+ pfree(entry->chunk_size);
+ entry->chunk_size = NULL;
+ }
+
+ if (entry->chunk_usage != NULL)
+ {
+ pfree(entry->chunk_usage);
+ entry->chunk_usage = NULL;
+ }
+
+ if (entry->chunk_data != NULL)
+ {
+ pfree(entry->chunk_data);
+ entry->chunk_data = NULL;
+ }
+
+ pfree(entry);
+}
+
+/*
+ * Comparator for BlockRefTableSerializedEntry objects.
+ *
+ * We make the tablespace OID the first column of the sort key to match
+ * the on-disk tree structure.
+ */
+static int
+BlockRefTableComparator(const void *a, const void *b)
+{
+ const BlockRefTableSerializedEntry *sa = a;
+ const BlockRefTableSerializedEntry *sb = b;
+
+ if (sa->rlocator.spcOid > sb->rlocator.spcOid)
+ return 1;
+ if (sa->rlocator.spcOid < sb->rlocator.spcOid)
+ return -1;
+
+ if (sa->rlocator.dbOid > sb->rlocator.dbOid)
+ return 1;
+ if (sa->rlocator.dbOid < sb->rlocator.dbOid)
+ return -1;
+
+ if (sa->rlocator.relNumber > sb->rlocator.relNumber)
+ return 1;
+ if (sa->rlocator.relNumber < sb->rlocator.relNumber)
+ return -1;
+
+ if (sa->forknum > sb->forknum)
+ return 1;
+ if (sa->forknum < sb->forknum)
+ return -1;
+
+ return 0;
+}
+
+/*
+ * Flush any buffered data out of a BlockRefTableBuffer.
+ */
+static void
+BlockRefTableFlush(BlockRefTableBuffer *buffer)
+{
+ buffer->io_callback(buffer->io_callback_arg, buffer->data, buffer->used);
+ buffer->used = 0;
+}
+
+/*
+ * Read data from a BlockRefTableBuffer, and update the running CRC
+ * calculation for the returned data (but not any data that we may have
+ * buffered but not yet actually returned).
+ */
+static void
+BlockRefTableRead(BlockRefTableReader *reader, void *data, int length)
+{
+ BlockRefTableBuffer *buffer = &reader->buffer;
+
+ /* Loop until read is fully satisfied. */
+ while (length > 0)
+ {
+ if (buffer->cursor < buffer->used)
+ {
+ /*
+ * If any buffered data is available, use that to satisfy as much
+ * of the request as possible.
+ */
+ int bytes_to_copy = Min(length, buffer->used - buffer->cursor);
+
+ memcpy(data, &buffer->data[buffer->cursor], bytes_to_copy);
+ COMP_CRC32C(buffer->crc, &buffer->data[buffer->cursor],
+ bytes_to_copy);
+ buffer->cursor += bytes_to_copy;
+ data = ((char *) data) + bytes_to_copy;
+ length -= bytes_to_copy;
+ }
+ else if (length >= BUFSIZE)
+ {
+ /*
+ * If the request length is long, read directly into caller's
+ * buffer.
+ */
+ int bytes_read;
+
+ bytes_read = buffer->io_callback(buffer->io_callback_arg,
+ data, length);
+ COMP_CRC32C(buffer->crc, data, bytes_read);
+ data = ((char *) data) + bytes_read;
+ length -= bytes_read;
+
+ /* If we didn't get anything, that's bad. */
+ if (bytes_read == 0)
+ reader->error_callback(reader->error_callback_arg,
+ "file \"%s\" ends unexpectedly",
+ reader->error_filename);
+ }
+ else
+ {
+ /*
+ * Refill our buffer.
+ */
+ buffer->used = buffer->io_callback(buffer->io_callback_arg,
+ buffer->data, BUFSIZE);
+ buffer->cursor = 0;
+
+ /* If we didn't get anything, that's bad. */
+ if (buffer->used == 0)
+ reader->error_callback(reader->error_callback_arg,
+ "file \"%s\" ends unexpectedly",
+ reader->error_filename);
+ }
+ }
+}
+
+/*
+ * Supply data to a BlockRefTableBuffer for write to the underlying File,
+ * and update the running CRC calculation for that data.
+ */
+static void
+BlockRefTableWrite(BlockRefTableBuffer *buffer, void *data, int length)
+{
+ /* Update running CRC calculation. */
+ COMP_CRC32C(buffer->crc, data, length);
+
+ /* If the new data can't fit into the buffer, flush the buffer. */
+ if (buffer->used + length > BUFSIZE)
+ {
+ buffer->io_callback(buffer->io_callback_arg, buffer->data,
+ buffer->used);
+ buffer->used = 0;
+ }
+
+ /* If the new data would fill the buffer, or more, write it directly. */
+ if (length >= BUFSIZE)
+ {
+ buffer->io_callback(buffer->io_callback_arg, data, length);
+ return;
+ }
+
+ /* Otherwise, copy the new data into the buffer. */
+ memcpy(&buffer->data[buffer->used], data, length);
+ buffer->used += length;
+ Assert(buffer->used <= BUFSIZE);
+}
+
+/*
+ * Generate the sentinel and CRC required at the end of a block reference
+ * table file and flush them out of our internal buffer.
+ */
+static void
+BlockRefTableFileTerminate(BlockRefTableBuffer *buffer)
+{
+ BlockRefTableSerializedEntry zentry = {0};
+ pg_crc32c crc;
+
+ /* Write a sentinel indicating that there are no more entries. */
+ BlockRefTableWrite(buffer, &zentry,
+ sizeof(BlockRefTableSerializedEntry));
+
+ /*
+ * Writing the checksum will perturb the ongoing checksum calculation, so
+ * copy the state first and finalize the computation using the copy.
+ */
+ crc = buffer->crc;
+ FIN_CRC32C(crc);
+ BlockRefTableWrite(buffer, &crc, sizeof(pg_crc32c));
+
+ /* Flush any leftover data out of our buffer. */
+ BlockRefTableFlush(buffer);
+}
diff --git a/src/common/meson.build b/src/common/meson.build
index fcc0c4fe8d..6e51257b1c 100644
--- a/src/common/meson.build
+++ b/src/common/meson.build
@@ -4,6 +4,7 @@ common_sources = files(
'archive.c',
'base64.c',
'binaryheap.c',
+ 'blkreftable.c',
'checksum_helper.c',
'compression.c',
'controldata_utils.c',
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 48ca852381..fed5d790cc 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -206,6 +206,7 @@ extern int XLogFileOpen(XLogSegNo segno, TimeLineID tli);
extern void CheckXLogRemoved(XLogSegNo segno, TimeLineID tli);
extern XLogSegNo XLogGetLastRemovedSegno(void);
+extern XLogSegNo XLogGetOldestSegno(TimeLineID tli);
extern void XLogSetAsyncXactLSN(XLogRecPtr asyncXactLSN);
extern void XLogSetReplicationSlotMinimumLSN(XLogRecPtr lsn);
diff --git a/src/include/access/xlogbackup.h b/src/include/access/xlogbackup.h
index 1611358137..90e04cad56 100644
--- a/src/include/access/xlogbackup.h
+++ b/src/include/access/xlogbackup.h
@@ -28,6 +28,8 @@ typedef struct BackupState
XLogRecPtr checkpointloc; /* last checkpoint location */
pg_time_t starttime; /* backup start time */
bool started_in_recovery; /* backup started in recovery? */
+ XLogRecPtr istartpoint; /* incremental based on backup at this LSN */
+ TimeLineID istarttli; /* incremental based on backup on this TLI */
/* Fields saved at the end of backup */
XLogRecPtr stoppoint; /* backup stop WAL location */
diff --git a/src/include/backup/basebackup.h b/src/include/backup/basebackup.h
index 1432d9c206..345bd22534 100644
--- a/src/include/backup/basebackup.h
+++ b/src/include/backup/basebackup.h
@@ -34,6 +34,9 @@ typedef struct
int64 size; /* total size as sent; -1 if not known */
} tablespaceinfo;
-extern void SendBaseBackup(BaseBackupCmd *cmd);
+struct IncrementalBackupInfo;
+
+extern void SendBaseBackup(BaseBackupCmd *cmd,
+ struct IncrementalBackupInfo *ib);
#endif /* _BASEBACKUP_H */
diff --git a/src/include/backup/basebackup_incremental.h b/src/include/backup/basebackup_incremental.h
new file mode 100644
index 0000000000..c300235a2f
--- /dev/null
+++ b/src/include/backup/basebackup_incremental.h
@@ -0,0 +1,56 @@
+/*-------------------------------------------------------------------------
+ *
+ * basebackup_incremental.h
+ * API for incremental backup support
+ *
+ * Portions Copyright (c) 2010-2022, PostgreSQL Global Development Group
+ *
+ * src/include/backup/basebackup_incremental.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef BASEBACKUP_INCREMENTAL_H
+#define BASEBACKUP_INCREMENTAL_H
+
+#include "access/xlogbackup.h"
+#include "common/relpath.h"
+#include "storage/block.h"
+#include "utils/palloc.h"
+
+#define INCREMENTAL_MAGIC 0xd3ae1f0d
+
+typedef enum
+{
+ BACK_UP_FILE_FULLY,
+ BACK_UP_FILE_INCREMENTALLY,
+ DO_NOT_BACK_UP_FILE
+} FileBackupMethod;
+
+struct IncrementalBackupInfo;
+typedef struct IncrementalBackupInfo IncrementalBackupInfo;
+
+extern IncrementalBackupInfo *CreateIncrementalBackupInfo(MemoryContext);
+
+extern void AppendIncrementalManifestData(IncrementalBackupInfo *ib,
+ const char *data,
+ int len);
+extern void FinalizeIncrementalManifest(IncrementalBackupInfo *ib);
+
+extern void PrepareForIncrementalBackup(IncrementalBackupInfo *ib,
+ BackupState *backup_state);
+
+extern char *GetIncrementalFilePath(Oid dboid, Oid spcoid,
+ RelFileNumber relfilenumber,
+ ForkNumber forknum, unsigned segno);
+extern FileBackupMethod GetFileBackupMethod(IncrementalBackupInfo *ib,
+ char *path,
+ Oid dboid, Oid spcoid,
+ RelFileNumber relfilenumber,
+ ForkNumber forknum,
+ unsigned segno, size_t size,
+ unsigned *num_blocks_required,
+ BlockNumber *relative_block_numbers,
+ unsigned *truncation_block_length);
+extern size_t GetIncrementalFileSize(unsigned num_blocks_required);
+
+#endif
diff --git a/src/include/backup/walsummary.h b/src/include/backup/walsummary.h
new file mode 100644
index 0000000000..d086e64019
--- /dev/null
+++ b/src/include/backup/walsummary.h
@@ -0,0 +1,49 @@
+/*-------------------------------------------------------------------------
+ *
+ * walsummary.h
+ * WAL summary management
+ *
+ * Portions Copyright (c) 2010-2022, PostgreSQL Global Development Group
+ *
+ * src/include/backup/walsummary.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef WALSUMMARY_H
+#define WALSUMMARY_H
+
+#include <time.h>
+
+#include "access/xlogdefs.h"
+#include "nodes/pg_list.h"
+#include "storage/fd.h"
+
+typedef struct WalSummaryIO
+{
+ File file;
+ off_t filepos;
+} WalSummaryIO;
+
+typedef struct WalSummaryFile
+{
+ XLogRecPtr start_lsn;
+ XLogRecPtr end_lsn;
+ TimeLineID tli;
+} WalSummaryFile;
+
+extern List *GetWalSummaries(TimeLineID tli, XLogRecPtr start_lsn,
+ XLogRecPtr end_lsn);
+extern List *FilterWalSummaries(List *wslist, TimeLineID tli,
+ XLogRecPtr start_lsn, XLogRecPtr end_lsn);
+extern bool WalSummariesAreComplete(List *wslist,
+ XLogRecPtr start_lsn, XLogRecPtr end_lsn,
+ XLogRecPtr *missing_lsn);
+extern File OpenWalSummaryFile(WalSummaryFile *ws, bool missing_ok);
+extern void RemoveWalSummaryIfOlderThan(WalSummaryFile *ws,
+ time_t cutoff_time);
+
+extern int ReadWalSummary(void *wal_summary_io, void *data, int length);
+extern int WriteWalSummary(void *wal_summary_io, void *data, int length);
+extern void ReportWalSummaryError(void *callback_arg, char *fmt,...);
+
+#endif /* WALSUMMARY_H */
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index f0b7b9cbd8..f68e6d4987 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -12062,4 +12062,23 @@
proname => 'any_value_transfn', prorettype => 'anyelement',
proargtypes => 'anyelement anyelement', prosrc => 'any_value_transfn' },
+{ oid => '8436',
+ descr => 'list of available WAL summary files',
+ proname => 'pg_available_wal_summaries', prorows => '100',
+ proretset => 't', provolatile => 'v', proparallel => 's',
+ prorettype => 'record', proargtypes => '',
+ proallargtypes => '{int8,pg_lsn,pg_lsn}',
+ proargmodes => '{o,o,o}',
+ proargnames => '{tli,start_lsn,end_lsn}',
+ prosrc => 'pg_available_wal_summaries' },
+{ oid => '8437',
+ descr => 'contents of a WAL sumamry file',
+ proname => 'pg_wal_summary_contents', prorows => '100',
+ proretset => 't', provolatile => 'v', proparallel => 's',
+ prorettype => 'record', proargtypes => 'int8 pg_lsn pg_lsn',
+ proallargtypes => '{int8,pg_lsn,pg_lsn,oid,oid,oid,int2,int8,bool}',
+ proargmodes => '{i,i,i,o,o,o,o,o,o}',
+ proargnames => '{tli,start_lsn,end_lsn,relfilenode,reltablespace,reldatabase,relforknumber,relblocknumber,is_limit_block}',
+ prosrc => 'pg_wal_summary_contents' },
+
]
diff --git a/src/include/common/blkreftable.h b/src/include/common/blkreftable.h
new file mode 100644
index 0000000000..22d9883dc5
--- /dev/null
+++ b/src/include/common/blkreftable.h
@@ -0,0 +1,120 @@
+/*-------------------------------------------------------------------------
+ *
+ * blkreftable.h
+ * Block reference tables.
+ *
+ * A block reference table is used to keep track of which blocks have
+ * been modified by WAL records within a certain LSN range.
+ *
+ * For each relation fork, there is a "limit block number". All existing
+ * blocks greater than or equal to the limit block number must be
+ * considered modified; for those less than the limit block number,
+ * we maintain a bitmap. When a relation fork is created or dropped,
+ * the limit block number should be set to 0. When it's truncated,
+ * the limit block number should be set to the length in blocks to
+ * which it was truncated.
+ *
+ * Portions Copyright (c) 2010-2022, PostgreSQL Global Development Group
+ *
+ * src/include/common/blkreftable.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef BLKREFTABLE_H
+#define BLKREFTABLE_H
+
+#include "storage/block.h"
+#include "storage/relfilelocator.h"
+
+/* Magic number for serialization file format. */
+#define BLOCKREFTABLE_MAGIC 0x652b137b
+
+struct BlockRefTable;
+struct BlockRefTableEntry;
+struct BlockRefTableReader;
+struct BlockRefTableWriter;
+typedef struct BlockRefTable BlockRefTable;
+typedef struct BlockRefTableEntry BlockRefTableEntry;
+typedef struct BlockRefTableReader BlockRefTableReader;
+typedef struct BlockRefTableWriter BlockRefTableWriter;
+
+/*
+ * The return value of io_callback_fn should be the number of bytes read
+ * or written. If an error occurs, the functions should report it and
+ * not return. When used as a write callback, short writes should be retried
+ * or treated as errors, so that if the callback returns, the return value
+ * is always the request length.
+ *
+ * report_error_fn should not return.
+ */
+typedef int (*io_callback_fn) (void *callback_arg, void *data, int length);
+typedef void (*report_error_fn) (void *calblack_arg, char *msg,...);
+
+
+/*
+ * Functions for manipulating an entire in-memory block reference table.
+ */
+extern BlockRefTable *CreateEmptyBlockRefTable(void);
+extern void BlockRefTableSetLimitBlock(BlockRefTable *brtab,
+ const RelFileLocator *rlocator,
+ ForkNumber forknum,
+ BlockNumber limit_block);
+extern void BlockRefTableMarkBlockModified(BlockRefTable *brtab,
+ const RelFileLocator *rlocator,
+ ForkNumber forknum,
+ BlockNumber blknum);
+extern void WriteBlockRefTable(BlockRefTable *brtab,
+ io_callback_fn write_callback,
+ void *write_callback_arg);
+
+extern BlockRefTableEntry *BlockRefTableGetEntry(BlockRefTable *brtab,
+ const RelFileLocator *rlocator,
+ ForkNumber forknum,
+ BlockNumber *limit_block);
+extern int BlockRefTableEntryGetBlocks(BlockRefTableEntry *entry,
+ BlockNumber start_blkno,
+ BlockNumber stop_blkno,
+ BlockNumber *blocks,
+ int nblocks);
+
+/*
+ * Functions for reading a block reference table incrementally from disk.
+ */
+extern BlockRefTableReader *CreateBlockRefTableReader(io_callback_fn read_callback,
+ void *read_callback_arg,
+ char *error_filename,
+ report_error_fn error_callback,
+ void *error_callback_arg);
+extern bool BlockRefTableReaderNextRelation(BlockRefTableReader *reader,
+ RelFileLocator *rlocator,
+ ForkNumber *forknum,
+ BlockNumber *limit_block);
+extern unsigned BlockRefTableReaderGetBlocks(BlockRefTableReader *reader,
+ BlockNumber *blocks,
+ int nblocks);
+extern void DestroyBlockRefTableReader(BlockRefTableReader *reader);
+
+/*
+ * Functions for writing a block reference table incrementally to disk.
+ *
+ * Note that entries must be written in the proper order, that is, sorted by
+ * database, then tablespace, then relfilenumber, then fork number. Caller
+ * is responsible for supplying data in the correct order. If that seems hard,
+ * use an in-memory BlockRefTable instead.
+ */
+extern BlockRefTableWriter *CreateBlockRefTableWriter(io_callback_fn write_callback,
+ void *write_callback_arg);
+extern void BlockRefTableWriteEntry(BlockRefTableWriter *writer,
+ BlockRefTableEntry *entry);
+extern void DestroyBlockRefTableWriter(BlockRefTableWriter *writer);
+
+extern BlockRefTableEntry *CreateBlockRefTableEntry(RelFileLocator rlocator,
+ ForkNumber forknum);
+extern void BlockRefTableEntrySetLimitBlock(BlockRefTableEntry *entry,
+ BlockNumber limit_block);
+extern void BlockRefTableEntryMarkBlockModified(BlockRefTableEntry *entry,
+ ForkNumber forknum,
+ BlockNumber blknum);
+extern void BlockRefTableFreeEntry(BlockRefTableEntry *entry);
+
+#endif /* BLKREFTABLE_H */
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 14bd574fc2..898adccb25 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -338,6 +338,7 @@ typedef enum BackendType
B_STARTUP,
B_WAL_RECEIVER,
B_WAL_SENDER,
+ B_WAL_SUMMARIZER,
B_WAL_WRITER,
} BackendType;
@@ -443,6 +444,7 @@ typedef enum
CheckpointerProcess,
WalWriterProcess,
WalReceiverProcess,
+ WalSummarizerProcess,
NUM_AUXPROCTYPES /* Must be last! */
} AuxProcType;
@@ -455,6 +457,7 @@ extern PGDLLIMPORT AuxProcType MyAuxProcType;
#define AmCheckpointerProcess() (MyAuxProcType == CheckpointerProcess)
#define AmWalWriterProcess() (MyAuxProcType == WalWriterProcess)
#define AmWalReceiverProcess() (MyAuxProcType == WalReceiverProcess)
+#define AmWalSummarizerProcess() (MyAuxProcType == WalSummarizerProcess)
/*****************************************************************************
diff --git a/src/include/nodes/replnodes.h b/src/include/nodes/replnodes.h
index 4321ba8f86..856491eecd 100644
--- a/src/include/nodes/replnodes.h
+++ b/src/include/nodes/replnodes.h
@@ -108,4 +108,13 @@ typedef struct TimeLineHistoryCmd
TimeLineID timeline;
} TimeLineHistoryCmd;
+/* ----------------------
+ * UPLOAD_MANIFEST command
+ * ----------------------
+ */
+typedef struct UploadManifestCmd
+{
+ NodeTag type;
+} UploadManifestCmd;
+
#endif /* REPLNODES_H */
diff --git a/src/include/postmaster/walsummarizer.h b/src/include/postmaster/walsummarizer.h
new file mode 100644
index 0000000000..7584cb69a7
--- /dev/null
+++ b/src/include/postmaster/walsummarizer.h
@@ -0,0 +1,31 @@
+/*-------------------------------------------------------------------------
+ *
+ * walsummarizer.h
+ *
+ * Header file for background WAL summarization process.
+ *
+ * Portions Copyright (c) 1996-2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/include/postmaster/walsummarizer.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef WALSUMMARIZER_H
+#define WALSUMMARIZER_H
+
+#include "access/xlogdefs.h"
+
+extern int wal_summarize_mb;
+extern int wal_summarize_keep_time;
+
+extern Size WalSummarizerShmemSize(void);
+extern void WalSummarizerShmemInit(void);
+extern void WalSummarizerMain(void) pg_attribute_noreturn();
+
+extern XLogRecPtr GetOldestUnsummarizedLSN(TimeLineID *tli,
+ bool *lsn_is_exact);
+extern void SetWalSummarizerLatch(void);
+extern XLogRecPtr WaitForWalSummarization(XLogRecPtr lsn, long timeout);
+
+#endif
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index ef74f32693..ee55008082 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -417,11 +417,12 @@ extern PGDLLIMPORT PGPROC *PreparedXactProcs;
* We set aside some extra PGPROC structures for auxiliary processes,
* ie things that aren't full-fledged backends but need shmem access.
*
- * Background writer, checkpointer, WAL writer and archiver run during normal
- * operation. Startup process and WAL receiver also consume 2 slots, but WAL
- * writer is launched only after startup has exited, so we only need 5 slots.
+ * Background writer, checkpointer, WAL writer, WAL summarizer, and archiver
+ * run during normal operation. Startup process and WAL receiver also consume
+ * 2 slots, but WAL writer is launched only after startup has exited, so we
+ * only need 6 slots.
*/
-#define NUM_AUXILIARY_PROCS 5
+#define NUM_AUXILIARY_PROCS 6
/* configurable options */
extern PGDLLIMPORT int DeadlockTimeout;
diff --git a/src/include/utils/guc_tables.h b/src/include/utils/guc_tables.h
index d5a0880678..7d3bc0f671 100644
--- a/src/include/utils/guc_tables.h
+++ b/src/include/utils/guc_tables.h
@@ -72,6 +72,7 @@ enum config_group
WAL_RECOVERY,
WAL_ARCHIVE_RECOVERY,
WAL_RECOVERY_TARGET,
+ WAL_SUMMARIZATION,
REPLICATION_SENDING,
REPLICATION_PRIMARY,
REPLICATION_STANDBY,
diff --git a/src/test/recovery/t/001_stream_rep.pl b/src/test/recovery/t/001_stream_rep.pl
index 0c72ba0944..353db33a9f 100644
--- a/src/test/recovery/t/001_stream_rep.pl
+++ b/src/test/recovery/t/001_stream_rep.pl
@@ -15,6 +15,8 @@ my $node_primary = PostgreSQL::Test::Cluster->new('primary');
$node_primary->init(
allows_streaming => 1,
auth_extra => [ '--create-role', 'repl_role' ]);
+# WAL summarization can postpone WAL recycling, leading to test failures
+$node_primary->append_conf('postgresql.conf', "wal_summarize_mb = 0");
$node_primary->start;
my $backup_name = 'my_backup';
diff --git a/src/test/recovery/t/019_replslot_limit.pl b/src/test/recovery/t/019_replslot_limit.pl
index 7d94f15778..4f52ddbe79 100644
--- a/src/test/recovery/t/019_replslot_limit.pl
+++ b/src/test/recovery/t/019_replslot_limit.pl
@@ -22,6 +22,7 @@ $node_primary->append_conf(
min_wal_size = 2MB
max_wal_size = 4MB
log_checkpoints = yes
+wal_summarize_mb = 0
));
$node_primary->start;
$node_primary->safe_psql('postgres',
@@ -256,6 +257,7 @@ $node_primary2->append_conf(
min_wal_size = 32MB
max_wal_size = 32MB
log_checkpoints = yes
+wal_summarize_mb = 0
));
$node_primary2->start;
$node_primary2->safe_psql('postgres',
@@ -310,6 +312,7 @@ $node_primary3->append_conf(
max_wal_size = 2MB
log_checkpoints = yes
max_slot_wal_keep_size = 1MB
+ wal_summarize_mb = 0
));
$node_primary3->start;
$node_primary3->safe_psql('postgres',
diff --git a/src/test/recovery/t/035_standby_logical_decoding.pl b/src/test/recovery/t/035_standby_logical_decoding.pl
index 480e6d6caa..a91437dfa7 100644
--- a/src/test/recovery/t/035_standby_logical_decoding.pl
+++ b/src/test/recovery/t/035_standby_logical_decoding.pl
@@ -250,6 +250,7 @@ $node_primary->append_conf(
wal_level = 'logical'
max_replication_slots = 4
max_wal_senders = 4
+wal_summarize_mb = 0
});
$node_primary->dump_info;
$node_primary->start;
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 8de90c4958..ff3cff8c28 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3991,3 +3991,26 @@ yyscan_t
z_stream
z_streamp
zic_t
+BlockRefTable
+BlockRefTableBuffer
+BlockRefTableEntry
+BlockRefTableKey
+BlockRefTableReader
+BlockRefTableSerializedEntry
+BlockRefTableWriter
+FileBackupMethod
+IncrementalBackupInfo
+SummarizerReadLocalXLogPrivate
+UploadManifestCmd
+WalSummarizerData
+WalSummaryFile
+WalSummaryIO
+backup_file_entry
+backup_wal_range
+cb_cleanup_dir
+cb_options
+cb_tablespace
+cb_tablespace_mapping
+manifest_data
+manifest_writer
+rfile
--
2.37.1 (Apple Git-137.1)
v3-0006-Add-new-pg_walsummary-tool.patchapplication/octet-stream; name=v3-0006-Add-new-pg_walsummary-tool.patchDownload
From d34112316b67d535c3dd6f3ee4f2205a1c1988d6 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Wed, 14 Jun 2023 11:31:39 -0400
Subject: [PATCH v3 6/7] Add new pg_walsummary tool.
This can dump the contents of WAL summary files, either those in
pg_wal/summaries, or the INCREMENTAL_BACKUP files that are part of
an incremental backup proper.
XXX. Needs documentation and tests.
---
src/bin/Makefile | 1 +
src/bin/meson.build | 1 +
src/bin/pg_walsummary/.gitignore | 1 +
src/bin/pg_walsummary/Makefile | 42 ++++
src/bin/pg_walsummary/meson.build | 24 +++
src/bin/pg_walsummary/pg_walsummary.c | 278 ++++++++++++++++++++++++++
6 files changed, 347 insertions(+)
create mode 100644 src/bin/pg_walsummary/.gitignore
create mode 100644 src/bin/pg_walsummary/Makefile
create mode 100644 src/bin/pg_walsummary/meson.build
create mode 100644 src/bin/pg_walsummary/pg_walsummary.c
diff --git a/src/bin/Makefile b/src/bin/Makefile
index aa2210925e..f98f58d39e 100644
--- a/src/bin/Makefile
+++ b/src/bin/Makefile
@@ -31,6 +31,7 @@ SUBDIRS = \
pg_upgrade \
pg_verifybackup \
pg_waldump \
+ pg_walsummary \
pgbench \
psql \
scripts
diff --git a/src/bin/meson.build b/src/bin/meson.build
index 4cb6fd59bb..d1e9ef4409 100644
--- a/src/bin/meson.build
+++ b/src/bin/meson.build
@@ -17,6 +17,7 @@ subdir('pg_test_timing')
subdir('pg_upgrade')
subdir('pg_verifybackup')
subdir('pg_waldump')
+subdir('pg_walsummary')
subdir('pgbench')
subdir('pgevent')
subdir('psql')
diff --git a/src/bin/pg_walsummary/.gitignore b/src/bin/pg_walsummary/.gitignore
new file mode 100644
index 0000000000..d71ec192fa
--- /dev/null
+++ b/src/bin/pg_walsummary/.gitignore
@@ -0,0 +1 @@
+pg_walsummary
diff --git a/src/bin/pg_walsummary/Makefile b/src/bin/pg_walsummary/Makefile
new file mode 100644
index 0000000000..852f7208f6
--- /dev/null
+++ b/src/bin/pg_walsummary/Makefile
@@ -0,0 +1,42 @@
+#-------------------------------------------------------------------------
+#
+# Makefile for src/bin/pg_walsummary
+#
+# Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+# Portions Copyright (c) 1994, Regents of the University of California
+#
+# src/bin/pg_walsummary/Makefile
+#
+#-------------------------------------------------------------------------
+
+PGFILEDESC = "pg_walsummary - print contents of WAL summary files"
+PGAPPICON=win32
+
+subdir = src/bin/pg_walsummary
+top_builddir = ../../..
+include $(top_builddir)/src/Makefile.global
+
+override CPPFLAGS := -I$(libpq_srcdir) $(CPPFLAGS)
+LDFLAGS_INTERNAL += -L$(top_builddir)/src/fe_utils -lpgfeutils
+
+OBJS = \
+ $(WIN32RES) \
+ pg_walsummary.o
+
+all: pg_walsummary
+
+pg_walsummary: $(OBJS) | submake-libpgport submake-libpgfeutils
+ $(CC) $(CFLAGS) $^ $(LDFLAGS) $(LDFLAGS_EX) $(LIBS) -o $@$(X)
+
+
+install: all installdirs
+ $(INSTALL_PROGRAM) pg_walsummary$(X) '$(DESTDIR)$(bindir)/pg_walsummary$(X)'
+
+installdirs:
+ $(MKDIR_P) '$(DESTDIR)$(bindir)'
+
+uninstall:
+ rm -f '$(DESTDIR)$(bindir)/pg_walsummary$(X)'
+
+clean distclean maintainer-clean:
+ rm -f pg_walsummary$(X) $(OBJS)
diff --git a/src/bin/pg_walsummary/meson.build b/src/bin/pg_walsummary/meson.build
new file mode 100644
index 0000000000..c2092960c6
--- /dev/null
+++ b/src/bin/pg_walsummary/meson.build
@@ -0,0 +1,24 @@
+# Copyright (c) 2022-2023, PostgreSQL Global Development Group
+
+pg_walsummary_sources = files(
+ 'pg_walsummary.c',
+)
+
+if host_system == 'windows'
+ pg_walsummary_sources += rc_bin_gen.process(win32ver_rc, extra_args: [
+ '--NAME', 'pg_walsummary',
+ '--FILEDESC', 'pg_walsummary - print contents of WAL summary files',])
+endif
+
+pg_walsummary = executable('pg_walsummary',
+ pg_walsummary_sources,
+ dependencies: [frontend_code],
+ kwargs: default_bin_args,
+)
+bin_targets += pg_walsummary
+
+tests += {
+ 'name': 'pg_walsummary',
+ 'sd': meson.current_source_dir(),
+ 'bd': meson.current_build_dir()
+}
diff --git a/src/bin/pg_walsummary/pg_walsummary.c b/src/bin/pg_walsummary/pg_walsummary.c
new file mode 100644
index 0000000000..0304a42026
--- /dev/null
+++ b/src/bin/pg_walsummary/pg_walsummary.c
@@ -0,0 +1,278 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_walsummary.c
+ * Prints the contents of WAL summary files.
+ *
+ * Copyright (c) 2017-2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/bin/pg_walsummary/pg_walsummary.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres_fe.h"
+
+#include <fcntl.h>
+#include <limits.h>
+
+#include "common/blkreftable.h"
+#include "common/logging.h"
+#include "fe_utils/option_utils.h"
+#include "lib/stringinfo.h"
+#include "getopt_long.h"
+
+typedef struct ws_options
+{
+ bool individual;
+ bool quiet;
+} ws_options;
+
+typedef struct ws_file_info
+{
+ int fd;
+ char *filename;
+} ws_file_info;
+
+static BlockNumber *block_buffer = NULL;
+static unsigned block_buffer_size = 512; /* Initial size. */
+
+static void dump_one_relation(ws_options *opt, RelFileLocator *rlocator,
+ ForkNumber forknum, BlockNumber limit_block,
+ BlockRefTableReader *reader);
+static void help(const char *progname);
+static int compare_block_numbers(const void *a, const void *b);
+static int walsummary_read_callback(void *callback_arg, void *data,
+ int length);
+static void walsummary_error_callback(void *callback_arg, char *fmt,...);
+
+/*
+ * Main program.
+ */
+int
+main(int argc, char *argv[])
+{
+ static struct option long_options[] = {
+ {"individual", no_argument, NULL, 'i'},
+ {"quiet", no_argument, NULL, 'q'},
+ {NULL, 0, NULL, 0}
+ };
+
+ const char *progname;
+ int optindex;
+ int c;
+ ws_options opt;
+
+ pg_logging_init(argv[0]);
+ progname = get_progname(argv[0]);
+ handle_help_version_opts(argc, argv, progname, help);
+
+ /* process command-line options */
+ while ((c = getopt_long(argc, argv, "f:iqw:",
+ long_options, &optindex)) != -1)
+ {
+ switch (c)
+ {
+ case 'i':
+ opt.individual = true;
+ break;
+ case 'q':
+ opt.quiet = true;
+ break;
+ default:
+ /* getopt_long already emitted a complaint */
+ pg_log_error_hint("Try \"%s --help\" for more information.", progname);
+ exit(1);
+ }
+ }
+
+ if (optind >= argc)
+ {
+ pg_log_error("%s: no input files specified", progname);
+ pg_log_error_hint("Try \"%s --help\" for more information.", progname);
+ exit(1);
+ }
+
+ while (optind < argc)
+ {
+ ws_file_info ws;
+ BlockRefTableReader *reader;
+ RelFileLocator rlocator;
+ ForkNumber forknum;
+ BlockNumber limit_block;
+
+ ws.filename = argv[optind++];
+ if ((ws.fd = open(ws.filename, O_RDONLY | PG_BINARY, 0)) < 0)
+ pg_fatal("could not open file \"%s\": %m", ws.filename);
+
+ reader = CreateBlockRefTableReader(walsummary_read_callback, &ws,
+ ws.filename,
+ walsummary_error_callback, NULL);
+ while (BlockRefTableReaderNextRelation(reader, &rlocator, &forknum,
+ &limit_block))
+ dump_one_relation(&opt, &rlocator, forknum, limit_block, reader);
+
+ DestroyBlockRefTableReader(reader);
+ close(ws.fd);
+ }
+
+ exit(0);
+}
+
+/*
+ * Dump details for one relation.
+ */
+static void
+dump_one_relation(ws_options *opt, RelFileLocator *rlocator,
+ ForkNumber forknum, BlockNumber limit_block,
+ BlockRefTableReader *reader)
+{
+ unsigned i = 0;
+ unsigned nblocks;
+ BlockNumber startblock = InvalidBlockNumber;
+ BlockNumber endblock = InvalidBlockNumber;
+
+ /* Dump limit block, if any. */
+ if (limit_block != InvalidBlockNumber)
+ printf("TS %u, DB %u, REL %u, FORK %s: limit %u\n",
+ rlocator->spcOid, rlocator->dbOid, rlocator->relNumber,
+ forkNames[forknum], limit_block);
+
+ /* If we haven't allocated a block buffer yet, do that now. */
+ if (block_buffer == NULL)
+ block_buffer = palloc_array(BlockNumber, block_buffer_size);
+
+ /* Try to fill the block buffer. */
+ nblocks = BlockRefTableReaderGetBlocks(reader,
+ block_buffer,
+ block_buffer_size);
+
+ /* If we filled the block buffer completely, we must enlarge it. */
+ while (nblocks >= block_buffer_size)
+ {
+ unsigned new_size;
+
+ /* Double the size, being careful about overflow. */
+ new_size = block_buffer_size * 2;
+ if (new_size < block_buffer_size)
+ new_size = PG_UINT32_MAX;
+ block_buffer = repalloc_array(block_buffer, BlockNumber, new_size);
+
+ /* Try to fill the newly-allocated space. */
+ nblocks +=
+ BlockRefTableReaderGetBlocks(reader,
+ block_buffer + block_buffer_size,
+ new_size - block_buffer_size);
+
+ /* Save the new size for later calls. */
+ block_buffer_size = new_size;
+ }
+
+ /* If we don't need to produce any output, skip the rest of this. */
+ if (opt->quiet)
+ return;
+
+ /*
+ * Sort the returned block numbers. If the block reference table was using
+ * the bitmap representation for a given chunk, the block numbers in that
+ * chunk will already be sorted, but when the array-of-offsets
+ * representation is used, we can receive block numbers here out of order.
+ */
+ qsort(block_buffer, nblocks, sizeof(BlockNumber), compare_block_numbers);
+
+ /* Dump block references. */
+ while (i < nblocks)
+ {
+ /*
+ * Find the next range of blocks to print, but if --individual was
+ * specified, then consider each block a separate range.
+ */
+ startblock = endblock = block_buffer[i++];
+ if (!opt->individual)
+ {
+ while (i < nblocks && block_buffer[i] == endblock + 1)
+ {
+ endblock++;
+ i++;
+ }
+ }
+
+ /* Format this range of block numbers as a string. */
+ if (startblock == endblock)
+ printf("TS %u, DB %u, REL %u, FORK %s: block %u\n",
+ rlocator->spcOid, rlocator->dbOid, rlocator->relNumber,
+ forkNames[forknum], startblock);
+ else
+ printf("TS %u, DB %u, REL %u, FORK %s: blocks %u..%u\n",
+ rlocator->spcOid, rlocator->dbOid, rlocator->relNumber,
+ forkNames[forknum], startblock, endblock);
+ }
+}
+
+/*
+ * Quicksort comparator for block numbers.
+ */
+static int
+compare_block_numbers(const void *a, const void *b)
+{
+ BlockNumber aa = *(BlockNumber *) a;
+ BlockNumber bb = *(BlockNumber *) b;
+
+ if (aa > bb)
+ return 1;
+ else if (aa == bb)
+ return 0;
+ else
+ return -1;
+}
+
+/*
+ * Error callback.
+ */
+void
+walsummary_error_callback(void *callback_arg, char *fmt,...)
+{
+ va_list ap;
+
+ va_start(ap, fmt);
+ pg_log_generic_v(PG_LOG_ERROR, PG_LOG_PRIMARY, fmt, ap);
+ va_end(ap);
+
+ exit(1);
+}
+
+/*
+ * Read callback.
+ */
+int
+walsummary_read_callback(void *callback_arg, void *data, int length)
+{
+ ws_file_info *ws = callback_arg;
+ int rc;
+
+ if ((rc = read(ws->fd, data, length)) < 0)
+ pg_fatal("could not read file \"%s\": %m", ws->filename);
+
+ return rc;
+}
+
+/*
+ * help
+ *
+ * Prints help page for the program
+ *
+ * progname: the name of the executed program, such as "pg_walsummary"
+ */
+static void
+help(const char *progname)
+{
+ printf(_("%s prints the contents of a WAL summary file.\n\n"), progname);
+ printf(_("Usage:\n"));
+ printf(_(" %s [OPTION]... FILE...\n"), progname);
+ printf(_("\nOptions:\n"));
+ printf(_(" -i, --individual list block numbers individually, not as ranges\n"));
+ printf(_(" -q, --quiet don't print anything, just parse the files\n"));
+ printf(_(" -?, --help show this help, then exit\n"));
+
+ printf(_("\nReport bugs to <%s>.\n"), PACKAGE_BUGREPORT);
+ printf(_("%s home page: <%s>\n"), PACKAGE_NAME, PACKAGE_URL);
+}
--
2.37.1 (Apple Git-137.1)
v3-0007-Add-TAP-tests-this-is-broken-doesn-t-work.patchapplication/octet-stream; name=v3-0007-Add-TAP-tests-this-is-broken-doesn-t-work.patchDownload
From 5dd4ce5b1d214fed4d81d824d6edbbeb26c42efe Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Thu, 17 Aug 2023 12:56:01 -0400
Subject: [PATCH v3 7/7] Add TAP tests (this is broken, doesn't work).
---
src/bin/pg_combinebackup/Makefile | 6 +
src/bin/pg_combinebackup/meson.build | 8 +-
src/bin/pg_combinebackup/t/001_basic.pl | 23 ++
.../pg_combinebackup/t/002_compare_backups.pl | 277 ++++++++++++++++++
src/test/perl/PostgreSQL/Test/Cluster.pm | 21 +-
5 files changed, 333 insertions(+), 2 deletions(-)
create mode 100644 src/bin/pg_combinebackup/t/001_basic.pl
create mode 100644 src/bin/pg_combinebackup/t/002_compare_backups.pl
diff --git a/src/bin/pg_combinebackup/Makefile b/src/bin/pg_combinebackup/Makefile
index cb20480aae..78ba05e624 100644
--- a/src/bin/pg_combinebackup/Makefile
+++ b/src/bin/pg_combinebackup/Makefile
@@ -44,3 +44,9 @@ uninstall:
clean distclean maintainer-clean:
rm -f pg_combinebackup$(X) $(OBJS)
+
+check:
+ $(prove_check)
+
+installcheck:
+ $(prove_installcheck)
diff --git a/src/bin/pg_combinebackup/meson.build b/src/bin/pg_combinebackup/meson.build
index bea0db405e..a6036dea74 100644
--- a/src/bin/pg_combinebackup/meson.build
+++ b/src/bin/pg_combinebackup/meson.build
@@ -25,5 +25,11 @@ bin_targets += pg_combinebackup
tests += {
'name': 'pg_combinebackup',
'sd': meson.current_source_dir(),
- 'bd': meson.current_build_dir()
+ 'bd': meson.current_build_dir(),
+ 'tap': {
+ 'tests': [
+ 't/001_basic.pl',
+ 't/002_compare_backups.pl',
+ ],
+ }
}
diff --git a/src/bin/pg_combinebackup/t/001_basic.pl b/src/bin/pg_combinebackup/t/001_basic.pl
new file mode 100644
index 0000000000..fb66075d1a
--- /dev/null
+++ b/src/bin/pg_combinebackup/t/001_basic.pl
@@ -0,0 +1,23 @@
+# Copyright (c) 2021-2023, PostgreSQL Global Development Group
+
+use strict;
+use warnings;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+my $tempdir = PostgreSQL::Test::Utils::tempdir;
+
+program_help_ok('pg_combinebackup');
+program_version_ok('pg_combinebackup');
+program_options_handling_ok('pg_combinebackup');
+
+command_fails_like(
+ ['pg_combinebackup'],
+ qr/no input directories specified/,
+ 'input directories must be specified');
+command_fails_like(
+ [ 'pg_combinebackup', $tempdir ],
+ qr/no output directory specified/,
+ 'output directory must be specified');
+
+done_testing();
diff --git a/src/bin/pg_combinebackup/t/002_compare_backups.pl b/src/bin/pg_combinebackup/t/002_compare_backups.pl
new file mode 100644
index 0000000000..25bc5ef958
--- /dev/null
+++ b/src/bin/pg_combinebackup/t/002_compare_backups.pl
@@ -0,0 +1,277 @@
+# Copyright (c) 2021-2023, PostgreSQL Global Development Group
+
+use strict;
+use warnings;
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Set up a new database instance.
+my $primary = PostgreSQL::Test::Cluster->new('primary');
+$primary->init(has_archiving => 1, allows_streaming => 1);
+$primary->append_conf('postgresql.conf', 'autovacuum = off');
+$primary->start;
+
+# Create some test tables, each containing one row of data, plus a whole
+# extra database.
+$primary->safe_psql('postgres', <<EOM);
+CREATE TABLE will_change (a int, b text);
+INSERT INTO will_change VALUES (1, 'initial test row');
+CREATE TABLE will_grow (a int, b text);
+INSERT INTO will_grow VALUES (1, 'initial test row');
+CREATE TABLE will_shrink (a int, b text);
+INSERT INTO will_shrink VALUES (1, 'initial test row');
+CREATE TABLE will_get_vacuumed (a int, b text);
+INSERT INTO will_get_vacuumed VALUES (1, 'initial test row');
+CREATE TABLE will_get_dropped (a int, b text);
+INSERT INTO will_get_dropped VALUES (1, 'initial test row');
+CREATE TABLE will_get_rewritten (a int, b text);
+INSERT INTO will_get_rewritten VALUES (1, 'initial test row');
+CREATE DATABASE db_will_get_dropped;
+EOM
+
+# Take a full backup.
+my $backup1path = $primary->backup_dir . '/backup1';
+$primary->command_ok(
+ [ 'pg_basebackup', '-D', $backup1path, '--no-sync', '-cfast' ],
+ "full backup");
+
+# Now make some database changes.
+$primary->safe_psql('postgres', <<EOM);
+UPDATE will_change SET b = 'modified value' WHERE a = 1;
+INSERT INTO will_grow
+ SELECT g, 'additional row' FROM generate_series(2, 5000) g;
+TRUNCATE will_shrink;
+VACUUM will_get_vacuumed;
+DROP TABLE will_get_dropped;
+CREATE TABLE newly_created (a int, b text);
+INSERT INTO newly_created VALUES (1, 'row for new table');
+VACUUM FULL will_get_rewritten;
+DROP DATABASE db_will_get_dropped;
+CREATE DATABASE db_newly_created;
+EOM
+
+# Take an incremental backup.
+my $backup2path = $primary->backup_dir . '/backup2';
+$primary->command_ok(
+ [ 'pg_basebackup', '-D', $backup2path, '--no-sync', '-cfast',
+ '--incremental', $backup1path . '/backup_manifest' ],
+ "incremental backup");
+
+# Find an LSN to which either backup can be recovered.
+my $lsn = $primary->safe_psql('postgres', "SELECT pg_current_wal_lsn();");
+
+# Make sure that the WAL segment containing that LSN has been archived.
+# PostgreSQL won't issue two consecutive XLOG_SWITCH records, and the backup
+# just issued one, so call txid_current() to generate some WAL activity
+# before calling pg_switch_wal().
+$primary->safe_psql('postgres', 'SELECT txid_current();');
+$primary->safe_psql('postgres', 'SELECT pg_switch_wal()');
+
+# Now wait for the LSN we chose above to be archived.
+my $archive_wait_query =
+ "SELECT pg_walfile_name('$lsn') <= last_archived_wal FROM pg_stat_archiver;";
+$primary->poll_query_until('postgres', $archive_wait_query)
+ or die "Timed out while waiting for WAL segment to be archived";
+
+# Perform PITR from the full backup. Disable archive_mode so that the archive
+# doesn't find out about the new timeline; that way, the later PITR below will
+# choose the same timeline.
+my $pitr1 = PostgreSQL::Test::Cluster->new('pitr1');
+$pitr1->init_from_backup($primary, 'backup1',
+ standby => 1, has_restoring => 1);
+$pitr1->append_conf('postgresql.conf', qq{
+recovery_target_lsn = '$lsn'
+recovery_target_action = 'promote'
+archive_mode = 'off'
+});
+$pitr1->start();
+
+# Wait until we exit recovery, then stop the server.
+$pitr1->poll_query_until('postgres',
+ "SELECT NOT pg_is_in_recovery();")
+ or die "Timed out while waiting apply to reach LSN $lsn";
+$pitr1->stop;
+
+# Perform PITR to the same LSN from the incremental backup. Use the same
+# basic configuration as before.
+my $pitr2 = PostgreSQL::Test::Cluster->new('pitr2');
+$pitr2->init_from_backup($primary, 'backup2',
+ standby => 1, has_restoring => 1,
+ combine_with_prior => [ 'backup1' ]);
+$pitr2->append_conf('postgresql.conf', qq{
+recovery_target_lsn = '$lsn'
+recovery_target_action = 'promote'
+archive_mode = 'off'
+});
+$pitr2->start();
+
+# Wait until we exit recovery, then stop the server.
+$pitr2->poll_query_until('postgres',
+ "SELECT NOT pg_is_in_recovery();")
+ or die "Timed out while waiting apply to reach LSN $lsn";
+$pitr2->stop;
+
+#my $cmp = compare_data_directories($pitr1->basedir . '/pgdata',
+# $pitr2->basedir . '/pgdata', '');
+my $cmp = 0; # XXX: DISABLE BROKEN TEST
+is($cmp, 0, "directories are identical");
+
+done_testing();
+
+sub compare_data_directories
+{
+ my ($basedir1, $basedir2, $relpath) = @_;
+ my $result = 0;
+
+ if ($relpath eq '/pg_wal')
+ {
+ # Since recovery started at different LSNs, pg_wal contents may not
+ # be identical. Ignore that.
+ return 0;
+ }
+
+ my $dir1 = $basedir1 . $relpath;
+ my $dir2 = $basedir2 . $relpath;
+
+ opendir(DIR1, $dir1) || die "$dir1: $!";
+ my @files1 = grep { $_ ne '.' && $_ ne '..' } readdir(DIR1);
+ closedir(DIR1);
+
+ opendir(DIR2, $dir2) || die "$dir2: $!";
+ my %files2 = map { $_ => 'unmatched' }
+ grep { $_ ne '.' && $_ ne '..' } readdir(DIR2);
+ closedir(DIR2);
+
+ for my $fname (@files1)
+ {
+ if (!exists $files2{$fname})
+ {
+ warn "$dir1/$fname exists but $dir2/$fname does not";
+ ++$result;
+ next;
+ }
+
+ $files2{$fname} = 'matched';
+
+ if (-d "$dir1/$fname")
+ {
+ if (! -d "$dir2/$fname")
+ {
+ warn "$dir1/$fname is a directory but $dir2/$fname is not";
+ ++$result;
+ }
+ else
+ {
+ $result +=
+ compare_data_directories($basedir1, $basedir2,
+ "$relpath/$fname");
+ }
+ }
+ elsif (-d "$dir2/$fname")
+ {
+ warn "$dir2/$fname is a directory but $dir1/$fname is not";
+ ++$result;
+ }
+ else
+ {
+ # Both are plain files.
+ $result += compare_files($basedir1, $basedir2, "$relpath/$fname");
+ }
+ }
+
+ for my $fname (keys %files2)
+ {
+ if ($files2{$fname} eq 'unmatched')
+ {
+ warn "$dir2/$fname exists but $dir1/$fname does not";
+ ++$result;
+ }
+ }
+
+ return $result;
+}
+
+sub compare_files
+{
+ my ($basedir1, $basedir2, $relpath) = @_;
+ my $file1 = $basedir1 . $relpath;
+ my $file2 = $basedir2 . $relpath;
+
+ if ($relpath eq '/backup_manifest')
+ {
+ # We don't expect the backup manifest to be identical between two
+ # backups taken at different times, so just disregard it.
+ return 0;
+ }
+
+ if ($relpath eq '/backup_label.old')
+ {
+ # We don't expect the backup label to be identical; the start WAL
+ # location and probably also the start time are expected to be
+ # different.
+ return 0;
+ }
+
+ if ($relpath eq '/postgresql.conf')
+ {
+ # At least the port numbers are expected to be different, so
+ # disregard this file.
+ return 0;
+ }
+
+ if ($relpath eq '/postmaster.opts')
+ {
+ # At least the cluster names are expected to be different, so
+ # disregard this file.
+ return 0;
+ }
+
+ if ($relpath eq '/global/pg_control')
+ {
+ # At least the mock authentication nonce is expected to be different,
+ # so disregard this file.
+ return 0;
+ }
+
+ if ($relpath eq '/pg_stat/pgstat.stat')
+ {
+ # Stats aren't stable enough to be compared here.
+ return 0;
+ }
+
+ if ($relpath =~ m@/pg_internal\.init$@)
+ {
+ # relcache init files are rebuilt at startup, so they don't need to
+ # match. And because we write out the contents of data structures like
+ # RelationData that include pointers, they almost certainly won't.
+ return 0;
+ }
+
+ # Check whether the lengths match.
+ my $length1 = -s $file1;
+ my $length2 = -s $file2;
+ if ($length1 != $length2)
+ {
+ warn "$file1 has length $length1, but $file2 has length $length2";
+ return 1;
+ }
+
+ # Compare contents.
+ my $contents1 = slurp_file($file1);
+ my $contents2 = slurp_file($file2);
+ if ($contents1 ne $contents2)
+ {
+ my $nchars = 1;
+ while (substr($contents1, 0, $nchars) eq substr($contents2, 0, $nchars))
+ {
+ ++$nchars;
+ }
+ warn sprintf("%s and %s are both of length %s, but differ beginning at byte %d",
+ $file1, $file2, $length1, $nchars - 1);
+ return 1;
+ }
+
+ # Files are identical.
+ return 0;
+}
diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index c3d46c7c70..b711d60fc4 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -779,6 +779,10 @@ a tar-format backup, pass the name of the tar program to use in the
keyword parameter tar_program. Note that tablespace tar files aren't
handled here.
+To restore from an incremental backup, pass the parameter combine_with_prior
+as a reference to an array of prior backup names with which this backup
+is to be combined using pg_combinebackup.
+
Streaming replication can be enabled on this node by passing the keyword
parameter has_streaming => 1. This is disabled by default.
@@ -816,7 +820,22 @@ sub init_from_backup
mkdir $self->archive_dir;
my $data_path = $self->data_dir;
- if (defined $params{tar_program})
+ if (defined $params{combine_with_prior})
+ {
+ my @prior_backups = @{$params{combine_with_prior}};
+ my @prior_backup_path;
+
+ for my $prior_backup_name (@prior_backups)
+ {
+ push @prior_backup_path,
+ $root_node->backup_dir . '/' . $prior_backup_name;
+ }
+
+ local %ENV = $self->_get_env();
+ PostgreSQL::Test::Utils::system_or_bail('pg_combinebackup',
+ @prior_backup_path, $backup_path, '-o', $data_path);
+ }
+ elsif (defined $params{tar_program})
{
mkdir($data_path);
PostgreSQL::Test::Utils::system_or_bail($params{tar_program}, 'xf',
--
2.37.1 (Apple Git-137.1)
On Thu, Sep 28, 2023 at 6:22 AM Jakub Wartak
<jakub.wartak@enterprisedb.com> wrote:
all those basic tests had GOOD results. Please find attached. I'll try
to schedule some more realistic (in terms of workload and sizes) test
in a couple of days + maybe have some fun with cross-backup-and
restores across standbys.
That's awesome! Thanks for testing! This can definitely benefit from
any amount of beating on it that people wish to do. It's a complex,
delicate area that risks data loss.
If that is still an area open for discussion: wouldn't it be better to
just specify LSN as it would allow resyncing standby across major lag
where the WAL to replay would be enormous? Given that we had
primary->standby where standby would be stuck on some LSN, right now
it would be:
1) calculate backup manifest of desynced 10TB standby (how? using
which tool?) - even if possible, that means reading 10TB of data
instead of just putting a number, isn't it?
2) backup primary with such incremental backup >= LSN
3) copy the incremental backup to standby
4) apply it to the impaired standby
5) restart the WAL replay
Hmm. I wonder if this would even be a safe procedure. I admit that I
can't quite see a problem with it, but sometimes I'm kind of dumb.
Also maybe it's too early to ask, but wouldn't it be nice if we could
have an future option in pg_combinebackup to avoid double writes when
used from restore hosts (right now we need to first to reconstruct the
original datadir from full and incremental backups on host hosting
backups and then TRANSFER it again and on target host?). So something
like that could work well from restorehost: pg_combinebackup
/tmp/backup1 /tmp/incbackup2 /tmp/incbackup3 -O tar -o - | ssh
dbserver 'tar xvf -C /path/to/restored/cluster - ' . The bad thing is
that such a pipe prevents parallelism from day 1 and I'm afraid I do
not have a better easy idea on how to have both at the same time in
the long term.
I don't think it's too early to ask for this, but I do think it's too
early for you to get it. ;-)
--
Robert Haas
EDB: http://www.enterprisedb.com
On Tue, Oct 3, 2023 at 2:21 PM Robert Haas <robertmhaas@gmail.com> wrote:
Here's a new patch set, also addressing Jakub's observation that
MINIMUM_VERSION_FOR_WAL_SUMMARIES needed updating.
Here's yet another new version. In this version, I reversed the order
of the first two patches, with the idea that what's now 0001 seems
fairly reasonable as an independent commit, and could thus perhaps be
committed sometime soon-ish. In the main patch, I added SGML
documentation for pg_combinebackup. I also fixed the broken TAP tests
so that they work, by basing them on pg_dump equivalence rather than
file-level equivalence. I'm sad to give up on testing the latter, but
it seems to be unrealistic. I cleaned up a few other odds and ends,
too. But, what exactly is the bigger picture for this patch in terms
of moving forward? Here's a list of things that are on my mind:
- I'd like to get the patch to mark the redo point in the WAL
committed[1]/messages/by-id/CA+TgmoZAM24Ub=uxP0aWuWstNYTUJQ64j976FYJeVaMJ+qD0uw@mail.gmail.com and then reword this patch set to make use of that
infrastructure. Right now, we make a best effort to end WAL summaries
at redo point boundaries, but it's racey, and sometimes we fail to do
so. In theory that just has the effect of potentially making an
incremental backup contain some extra blocks that it shouldn't really
need to contain, but I think it can actually lead to weird stalls,
because when an incremental backup is taken, we have to wait until a
WAL summary shows up that extends at least up to the start LSN of the
backup we're about to take. I believe all the logic in this area can
be made a good deal simpler and more reliable if that patch gets
committed and this one reworked accordingly.
- I would like some feedback on the generation of WAL summary files.
Right now, I have it enabled by default, and summaries are kept for a
week. That means that, with no additional setup, you can take an
incremental backup as long as the reference backup was taken in the
last week. File removal is governed by mtimes, so if you change the
mtimes of your summary files or whack your system clock around, weird
things might happen. But obviously this might be inconvenient. Some
people might not want WAL summary files to be generated at all because
they don't care about incremental backup, and other people might want
them retained for longer, and still other people might want them to be
not removed automatically or removed automatically based on some
criteria other than mtime. I don't really know what's best here. I
don't think the default policy that the patches implement is
especially terrible, but it's just something that I made up and I
don't have any real confidence that it's wonderful. One point to be
consider here is that, if WAL summarization is enabled, checkpoints
can't remove WAL that isn't summarized yet. Mostly that's not a
problem, I think, because the WAL summarizer is pretty fast. But it
could increase disk consumption for some people. I don't think that we
need to worry about the summaries themselves being a problem in terms
of space consumption; at least in all the cases I've tested, they're
just not very big.
- On a related note, I haven't yet tested this on a standby, which is
a thing that I definitely need to do. I don't know of a reason why it
shouldn't be possible for all of this machinery to work on a standby
just as it does on a primary, but then we need the WAL summarizer to
run there too, which could end up being a waste if nobody ever tries
to take an incremental backup. I wonder how that should be reflected
in the configuration. We could do something like what we've done for
archive_mode, where on means "only on if this is a primary" and you
have to say always if you want it to run on standbys as well ... but
I'm not sure if that's a design pattern that we really want to
replicate into more places. I'd be somewhat inclined to just make
whatever configuration parameters we need to configure this thing on
the primary also work on standbys, and you can set each server up as
you please. But I'm open to other suggestions.
- We need to settle the question of whether to send the whole backup
manifest to the server or just the LSN. In a previous attempt at
incremental backup, we decided the whole manifest was necessary,
because flat-copying files could make new data show up with old LSNs.
But that version of the patch set was trying to find modified blocks
by checking their LSNs individually, not by summarizing WAL. And since
the operations that flat-copy files are WAL-logged, the WAL summary
approach seems to eliminate that problem - maybe an LSN (and the
associated TLI) is good enough now. This also relates to Jakub's
question about whether this machinery could be used to fast-forward a
standby, which is not exactly a base backup but ... perhaps close
enough? I'm somewhat inclined to believe that we can simplify to an
LSN and TLI; however, if we do that, then we'll have big problems if
later we realize that we want the manifest for something after all. So
if anybody thinks that there's a reason to keep doing what the patch
does today -- namely, upload the whole manifest to the server --
please speak up.
- It's regrettable that we don't have incremental JSON parsing; I
think that means anyone who has a backup manifest that is bigger than
1GB can't use this feature. However, that's also a problem for the
existing backup manifest feature, and as far as I can see, we have no
complaints about it. So maybe people just don't have databases with
enough relations for that to be much of a live issue yet. I'm inclined
to treat this as a non-blocker, although Andrew Dunstan tells me he
does have a prototype for incremental JSON parsing so maybe that will
land and we can use it here.
- Right now, I have a hard-coded 60 second timeout for WAL
summarization. If you try to take an incremental backup and the WAL
summaries you need don't show up within 60 seconds, the backup times
out. I think that's a reasonable default, but should it be
configurable? If yes, should that be a GUC or, perhaps better, a
pg_basebackup option?
- I'm curious what people think about the pg_walsummary tool that is
included in 0006. I think it's going to be fairly important for
debugging, but it does feel a little bit bad to add a new binary for
something pretty niche. Nevertheless, merging it into any other
utility seems relatively awkward, so I'm inclined to think both that
this should be included in whatever finally gets committed and that it
should be a separate binary. I considered whether it should go in
contrib, but we seem to have moved to a policy that heavily favors
limiting contrib to extensions and loadable modules, rather than
binaries.
Clearly there's a good amount of stuff to sort out here, but we've
still got quite a bit of time left before feature freeze so I'd like
to have a go at it. Please let me know your thoughts, if you have any.
[1]: /messages/by-id/CA+TgmoZAM24Ub=uxP0aWuWstNYTUJQ64j976FYJeVaMJ+qD0uw@mail.gmail.com
--
Robert Haas
EDB: http://www.enterprisedb.com
Attachments:
v4-0006-Add-new-pg_walsummary-tool.patchapplication/octet-stream; name=v4-0006-Add-new-pg_walsummary-tool.patchDownload
From db75f3dd79f5f2253c4e3f4cf0689e77a2077d27 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Wed, 14 Jun 2023 11:31:39 -0400
Subject: [PATCH v4 6/6] Add new pg_walsummary tool.
This can dump the contents of WAL summary files, either those in
pg_wal/summaries, or the INCREMENTAL_BACKUP files that are part of
an incremental backup proper.
XXX. Needs documentation and tests.
---
src/bin/Makefile | 1 +
src/bin/meson.build | 1 +
src/bin/pg_walsummary/.gitignore | 1 +
src/bin/pg_walsummary/Makefile | 42 ++++
src/bin/pg_walsummary/meson.build | 24 +++
src/bin/pg_walsummary/pg_walsummary.c | 278 ++++++++++++++++++++++++++
6 files changed, 347 insertions(+)
create mode 100644 src/bin/pg_walsummary/.gitignore
create mode 100644 src/bin/pg_walsummary/Makefile
create mode 100644 src/bin/pg_walsummary/meson.build
create mode 100644 src/bin/pg_walsummary/pg_walsummary.c
diff --git a/src/bin/Makefile b/src/bin/Makefile
index aa2210925e..f98f58d39e 100644
--- a/src/bin/Makefile
+++ b/src/bin/Makefile
@@ -31,6 +31,7 @@ SUBDIRS = \
pg_upgrade \
pg_verifybackup \
pg_waldump \
+ pg_walsummary \
pgbench \
psql \
scripts
diff --git a/src/bin/meson.build b/src/bin/meson.build
index 4cb6fd59bb..d1e9ef4409 100644
--- a/src/bin/meson.build
+++ b/src/bin/meson.build
@@ -17,6 +17,7 @@ subdir('pg_test_timing')
subdir('pg_upgrade')
subdir('pg_verifybackup')
subdir('pg_waldump')
+subdir('pg_walsummary')
subdir('pgbench')
subdir('pgevent')
subdir('psql')
diff --git a/src/bin/pg_walsummary/.gitignore b/src/bin/pg_walsummary/.gitignore
new file mode 100644
index 0000000000..d71ec192fa
--- /dev/null
+++ b/src/bin/pg_walsummary/.gitignore
@@ -0,0 +1 @@
+pg_walsummary
diff --git a/src/bin/pg_walsummary/Makefile b/src/bin/pg_walsummary/Makefile
new file mode 100644
index 0000000000..852f7208f6
--- /dev/null
+++ b/src/bin/pg_walsummary/Makefile
@@ -0,0 +1,42 @@
+#-------------------------------------------------------------------------
+#
+# Makefile for src/bin/pg_walsummary
+#
+# Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+# Portions Copyright (c) 1994, Regents of the University of California
+#
+# src/bin/pg_walsummary/Makefile
+#
+#-------------------------------------------------------------------------
+
+PGFILEDESC = "pg_walsummary - print contents of WAL summary files"
+PGAPPICON=win32
+
+subdir = src/bin/pg_walsummary
+top_builddir = ../../..
+include $(top_builddir)/src/Makefile.global
+
+override CPPFLAGS := -I$(libpq_srcdir) $(CPPFLAGS)
+LDFLAGS_INTERNAL += -L$(top_builddir)/src/fe_utils -lpgfeutils
+
+OBJS = \
+ $(WIN32RES) \
+ pg_walsummary.o
+
+all: pg_walsummary
+
+pg_walsummary: $(OBJS) | submake-libpgport submake-libpgfeutils
+ $(CC) $(CFLAGS) $^ $(LDFLAGS) $(LDFLAGS_EX) $(LIBS) -o $@$(X)
+
+
+install: all installdirs
+ $(INSTALL_PROGRAM) pg_walsummary$(X) '$(DESTDIR)$(bindir)/pg_walsummary$(X)'
+
+installdirs:
+ $(MKDIR_P) '$(DESTDIR)$(bindir)'
+
+uninstall:
+ rm -f '$(DESTDIR)$(bindir)/pg_walsummary$(X)'
+
+clean distclean maintainer-clean:
+ rm -f pg_walsummary$(X) $(OBJS)
diff --git a/src/bin/pg_walsummary/meson.build b/src/bin/pg_walsummary/meson.build
new file mode 100644
index 0000000000..c2092960c6
--- /dev/null
+++ b/src/bin/pg_walsummary/meson.build
@@ -0,0 +1,24 @@
+# Copyright (c) 2022-2023, PostgreSQL Global Development Group
+
+pg_walsummary_sources = files(
+ 'pg_walsummary.c',
+)
+
+if host_system == 'windows'
+ pg_walsummary_sources += rc_bin_gen.process(win32ver_rc, extra_args: [
+ '--NAME', 'pg_walsummary',
+ '--FILEDESC', 'pg_walsummary - print contents of WAL summary files',])
+endif
+
+pg_walsummary = executable('pg_walsummary',
+ pg_walsummary_sources,
+ dependencies: [frontend_code],
+ kwargs: default_bin_args,
+)
+bin_targets += pg_walsummary
+
+tests += {
+ 'name': 'pg_walsummary',
+ 'sd': meson.current_source_dir(),
+ 'bd': meson.current_build_dir()
+}
diff --git a/src/bin/pg_walsummary/pg_walsummary.c b/src/bin/pg_walsummary/pg_walsummary.c
new file mode 100644
index 0000000000..0304a42026
--- /dev/null
+++ b/src/bin/pg_walsummary/pg_walsummary.c
@@ -0,0 +1,278 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_walsummary.c
+ * Prints the contents of WAL summary files.
+ *
+ * Copyright (c) 2017-2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/bin/pg_walsummary/pg_walsummary.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres_fe.h"
+
+#include <fcntl.h>
+#include <limits.h>
+
+#include "common/blkreftable.h"
+#include "common/logging.h"
+#include "fe_utils/option_utils.h"
+#include "lib/stringinfo.h"
+#include "getopt_long.h"
+
+typedef struct ws_options
+{
+ bool individual;
+ bool quiet;
+} ws_options;
+
+typedef struct ws_file_info
+{
+ int fd;
+ char *filename;
+} ws_file_info;
+
+static BlockNumber *block_buffer = NULL;
+static unsigned block_buffer_size = 512; /* Initial size. */
+
+static void dump_one_relation(ws_options *opt, RelFileLocator *rlocator,
+ ForkNumber forknum, BlockNumber limit_block,
+ BlockRefTableReader *reader);
+static void help(const char *progname);
+static int compare_block_numbers(const void *a, const void *b);
+static int walsummary_read_callback(void *callback_arg, void *data,
+ int length);
+static void walsummary_error_callback(void *callback_arg, char *fmt,...);
+
+/*
+ * Main program.
+ */
+int
+main(int argc, char *argv[])
+{
+ static struct option long_options[] = {
+ {"individual", no_argument, NULL, 'i'},
+ {"quiet", no_argument, NULL, 'q'},
+ {NULL, 0, NULL, 0}
+ };
+
+ const char *progname;
+ int optindex;
+ int c;
+ ws_options opt;
+
+ pg_logging_init(argv[0]);
+ progname = get_progname(argv[0]);
+ handle_help_version_opts(argc, argv, progname, help);
+
+ /* process command-line options */
+ while ((c = getopt_long(argc, argv, "f:iqw:",
+ long_options, &optindex)) != -1)
+ {
+ switch (c)
+ {
+ case 'i':
+ opt.individual = true;
+ break;
+ case 'q':
+ opt.quiet = true;
+ break;
+ default:
+ /* getopt_long already emitted a complaint */
+ pg_log_error_hint("Try \"%s --help\" for more information.", progname);
+ exit(1);
+ }
+ }
+
+ if (optind >= argc)
+ {
+ pg_log_error("%s: no input files specified", progname);
+ pg_log_error_hint("Try \"%s --help\" for more information.", progname);
+ exit(1);
+ }
+
+ while (optind < argc)
+ {
+ ws_file_info ws;
+ BlockRefTableReader *reader;
+ RelFileLocator rlocator;
+ ForkNumber forknum;
+ BlockNumber limit_block;
+
+ ws.filename = argv[optind++];
+ if ((ws.fd = open(ws.filename, O_RDONLY | PG_BINARY, 0)) < 0)
+ pg_fatal("could not open file \"%s\": %m", ws.filename);
+
+ reader = CreateBlockRefTableReader(walsummary_read_callback, &ws,
+ ws.filename,
+ walsummary_error_callback, NULL);
+ while (BlockRefTableReaderNextRelation(reader, &rlocator, &forknum,
+ &limit_block))
+ dump_one_relation(&opt, &rlocator, forknum, limit_block, reader);
+
+ DestroyBlockRefTableReader(reader);
+ close(ws.fd);
+ }
+
+ exit(0);
+}
+
+/*
+ * Dump details for one relation.
+ */
+static void
+dump_one_relation(ws_options *opt, RelFileLocator *rlocator,
+ ForkNumber forknum, BlockNumber limit_block,
+ BlockRefTableReader *reader)
+{
+ unsigned i = 0;
+ unsigned nblocks;
+ BlockNumber startblock = InvalidBlockNumber;
+ BlockNumber endblock = InvalidBlockNumber;
+
+ /* Dump limit block, if any. */
+ if (limit_block != InvalidBlockNumber)
+ printf("TS %u, DB %u, REL %u, FORK %s: limit %u\n",
+ rlocator->spcOid, rlocator->dbOid, rlocator->relNumber,
+ forkNames[forknum], limit_block);
+
+ /* If we haven't allocated a block buffer yet, do that now. */
+ if (block_buffer == NULL)
+ block_buffer = palloc_array(BlockNumber, block_buffer_size);
+
+ /* Try to fill the block buffer. */
+ nblocks = BlockRefTableReaderGetBlocks(reader,
+ block_buffer,
+ block_buffer_size);
+
+ /* If we filled the block buffer completely, we must enlarge it. */
+ while (nblocks >= block_buffer_size)
+ {
+ unsigned new_size;
+
+ /* Double the size, being careful about overflow. */
+ new_size = block_buffer_size * 2;
+ if (new_size < block_buffer_size)
+ new_size = PG_UINT32_MAX;
+ block_buffer = repalloc_array(block_buffer, BlockNumber, new_size);
+
+ /* Try to fill the newly-allocated space. */
+ nblocks +=
+ BlockRefTableReaderGetBlocks(reader,
+ block_buffer + block_buffer_size,
+ new_size - block_buffer_size);
+
+ /* Save the new size for later calls. */
+ block_buffer_size = new_size;
+ }
+
+ /* If we don't need to produce any output, skip the rest of this. */
+ if (opt->quiet)
+ return;
+
+ /*
+ * Sort the returned block numbers. If the block reference table was using
+ * the bitmap representation for a given chunk, the block numbers in that
+ * chunk will already be sorted, but when the array-of-offsets
+ * representation is used, we can receive block numbers here out of order.
+ */
+ qsort(block_buffer, nblocks, sizeof(BlockNumber), compare_block_numbers);
+
+ /* Dump block references. */
+ while (i < nblocks)
+ {
+ /*
+ * Find the next range of blocks to print, but if --individual was
+ * specified, then consider each block a separate range.
+ */
+ startblock = endblock = block_buffer[i++];
+ if (!opt->individual)
+ {
+ while (i < nblocks && block_buffer[i] == endblock + 1)
+ {
+ endblock++;
+ i++;
+ }
+ }
+
+ /* Format this range of block numbers as a string. */
+ if (startblock == endblock)
+ printf("TS %u, DB %u, REL %u, FORK %s: block %u\n",
+ rlocator->spcOid, rlocator->dbOid, rlocator->relNumber,
+ forkNames[forknum], startblock);
+ else
+ printf("TS %u, DB %u, REL %u, FORK %s: blocks %u..%u\n",
+ rlocator->spcOid, rlocator->dbOid, rlocator->relNumber,
+ forkNames[forknum], startblock, endblock);
+ }
+}
+
+/*
+ * Quicksort comparator for block numbers.
+ */
+static int
+compare_block_numbers(const void *a, const void *b)
+{
+ BlockNumber aa = *(BlockNumber *) a;
+ BlockNumber bb = *(BlockNumber *) b;
+
+ if (aa > bb)
+ return 1;
+ else if (aa == bb)
+ return 0;
+ else
+ return -1;
+}
+
+/*
+ * Error callback.
+ */
+void
+walsummary_error_callback(void *callback_arg, char *fmt,...)
+{
+ va_list ap;
+
+ va_start(ap, fmt);
+ pg_log_generic_v(PG_LOG_ERROR, PG_LOG_PRIMARY, fmt, ap);
+ va_end(ap);
+
+ exit(1);
+}
+
+/*
+ * Read callback.
+ */
+int
+walsummary_read_callback(void *callback_arg, void *data, int length)
+{
+ ws_file_info *ws = callback_arg;
+ int rc;
+
+ if ((rc = read(ws->fd, data, length)) < 0)
+ pg_fatal("could not read file \"%s\": %m", ws->filename);
+
+ return rc;
+}
+
+/*
+ * help
+ *
+ * Prints help page for the program
+ *
+ * progname: the name of the executed program, such as "pg_walsummary"
+ */
+static void
+help(const char *progname)
+{
+ printf(_("%s prints the contents of a WAL summary file.\n\n"), progname);
+ printf(_("Usage:\n"));
+ printf(_(" %s [OPTION]... FILE...\n"), progname);
+ printf(_("\nOptions:\n"));
+ printf(_(" -i, --individual list block numbers individually, not as ranges\n"));
+ printf(_(" -q, --quiet don't print anything, just parse the files\n"));
+ printf(_(" -?, --help show this help, then exit\n"));
+
+ printf(_("\nReport bugs to <%s>.\n"), PACKAGE_BUGREPORT);
+ printf(_("%s home page: <%s>\n"), PACKAGE_NAME, PACKAGE_URL);
+}
--
2.37.1 (Apple Git-137.1)
v4-0001-Refactor-parse_filename_for_nontemp_relation-to-p.patchapplication/octet-stream; name=v4-0001-Refactor-parse_filename_for_nontemp_relation-to-p.patchDownload
From 7432b6e1db39557bb6c7cbe3641fdfd23fb53e75 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Tue, 3 Oct 2023 13:30:44 -0400
Subject: [PATCH v4 1/6] Refactor parse_filename_for_nontemp_relation to parse
more.
Instead of returning the number of characters in the RelFileNumber,
return the RelFileNumber itself. Continue to return the fork number,
as before, and additionally return the segment number.
parse_filename_for_nontemp_relation now rejects a RelFileNumber or
segment number that begins with a leading zero. Before, we accepted
such cases as relation filenames, but if we continued to do so after
this change, the function might return the same values for two
different files (e.g. 1234.5 and 001234.5 or 1234.005) which could be
annoying for callers. Since we don't actually ever generate filenames
with leading zeroes in the names, any such files that we find must
have been created by something other than PostgreSQL, and it is
therefore reasonable to treat them as non-relation files.
Along the way, change unlogged_relation_entry to store a RelFileNumber
rather than an OID. This update should have been made in
851f4cc75cdd8c831f1baa9a7abf8c8248b65890, but it was overlooked.
It's trivial to make the update as part of this commit, perhaps more
trivial than it would have been without it, so do that.
---
src/backend/backup/basebackup.c | 15 ++--
src/backend/storage/file/reinit.c | 137 ++++++++++++++++++------------
src/include/storage/reinit.h | 5 +-
3 files changed, 93 insertions(+), 64 deletions(-)
diff --git a/src/backend/backup/basebackup.c b/src/backend/backup/basebackup.c
index 7d025bcf38..b126d9c890 100644
--- a/src/backend/backup/basebackup.c
+++ b/src/backend/backup/basebackup.c
@@ -1197,9 +1197,9 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
{
int excludeIdx;
bool excludeFound;
- ForkNumber relForkNum; /* Type of fork if file is a relation */
- int relnumchars; /* Chars in filename that are the
- * relnumber */
+ RelFileNumber relNumber;
+ ForkNumber relForkNum;
+ unsigned segno;
/* Skip special stuff */
if (strcmp(de->d_name, ".") == 0 || strcmp(de->d_name, "..") == 0)
@@ -1249,23 +1249,20 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
/* Exclude all forks for unlogged tables except the init fork */
if (isDbDir &&
- parse_filename_for_nontemp_relation(de->d_name, &relnumchars,
- &relForkNum))
+ parse_filename_for_nontemp_relation(de->d_name, &relNumber,
+ &relForkNum, &segno))
{
/* Never exclude init forks */
if (relForkNum != INIT_FORKNUM)
{
char initForkFile[MAXPGPATH];
- char relNumber[OIDCHARS + 1];
/*
* If any other type of fork, check if there is an init fork
* with the same RelFileNumber. If so, the file can be
* excluded.
*/
- memcpy(relNumber, de->d_name, relnumchars);
- relNumber[relnumchars] = '\0';
- snprintf(initForkFile, sizeof(initForkFile), "%s/%s_init",
+ snprintf(initForkFile, sizeof(initForkFile), "%s/%u_init",
path, relNumber);
if (lstat(initForkFile, &statbuf) == 0)
diff --git a/src/backend/storage/file/reinit.c b/src/backend/storage/file/reinit.c
index fb55371b1b..5df2517b46 100644
--- a/src/backend/storage/file/reinit.c
+++ b/src/backend/storage/file/reinit.c
@@ -31,7 +31,7 @@ static void ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname,
typedef struct
{
- Oid reloid; /* hash key */
+ RelFileNumber relnumber; /* hash key */
} unlogged_relation_entry;
/*
@@ -195,12 +195,13 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
{
ForkNumber forkNum;
- int relnumchars;
+ unsigned segno;
unlogged_relation_entry ent;
/* Skip anything that doesn't look like a relation data file. */
- if (!parse_filename_for_nontemp_relation(de->d_name, &relnumchars,
- &forkNum))
+ if (!parse_filename_for_nontemp_relation(de->d_name,
+ &ent.relnumber,
+ &forkNum, &segno))
continue;
/* Also skip it unless this is the init fork. */
@@ -208,10 +209,8 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
continue;
/*
- * Put the OID portion of the name into the hash table, if it
- * isn't already.
+ * Put the RelFileNumber into the hash table, if it isn't already.
*/
- ent.reloid = atooid(de->d_name);
(void) hash_search(hash, &ent, HASH_ENTER, NULL);
}
@@ -235,12 +234,13 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
{
ForkNumber forkNum;
- int relnumchars;
+ unsigned segno;
unlogged_relation_entry ent;
/* Skip anything that doesn't look like a relation data file. */
- if (!parse_filename_for_nontemp_relation(de->d_name, &relnumchars,
- &forkNum))
+ if (!parse_filename_for_nontemp_relation(de->d_name,
+ &ent.relnumber,
+ &forkNum, &segno))
continue;
/* We never remove the init fork. */
@@ -251,7 +251,6 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
* See whether the OID portion of the name shows up in the hash
* table. If so, nuke it!
*/
- ent.reloid = atooid(de->d_name);
if (hash_search(hash, &ent, HASH_FIND, NULL))
{
snprintf(rm_path, sizeof(rm_path), "%s/%s",
@@ -285,14 +284,14 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
{
ForkNumber forkNum;
- int relnumchars;
- char relnumbuf[OIDCHARS + 1];
+ RelFileNumber relNumber;
+ unsigned segno;
char srcpath[MAXPGPATH * 2];
char dstpath[MAXPGPATH];
/* Skip anything that doesn't look like a relation data file. */
- if (!parse_filename_for_nontemp_relation(de->d_name, &relnumchars,
- &forkNum))
+ if (!parse_filename_for_nontemp_relation(de->d_name, &relNumber,
+ &forkNum, &segno))
continue;
/* Also skip it unless this is the init fork. */
@@ -304,11 +303,12 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
dbspacedirname, de->d_name);
/* Construct destination pathname. */
- memcpy(relnumbuf, de->d_name, relnumchars);
- relnumbuf[relnumchars] = '\0';
- snprintf(dstpath, sizeof(dstpath), "%s/%s%s",
- dbspacedirname, relnumbuf, de->d_name + relnumchars + 1 +
- strlen(forkNames[INIT_FORKNUM]));
+ if (segno == 0)
+ snprintf(dstpath, sizeof(dstpath), "%s/%u",
+ dbspacedirname, relNumber);
+ else
+ snprintf(dstpath, sizeof(dstpath), "%s/%u.%u",
+ dbspacedirname, relNumber, segno);
/* OK, we're ready to perform the actual copy. */
elog(DEBUG2, "copying %s to %s", srcpath, dstpath);
@@ -327,14 +327,14 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
dbspace_dir = AllocateDir(dbspacedirname);
while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
{
+ RelFileNumber relNumber;
ForkNumber forkNum;
- int relnumchars;
- char relnumbuf[OIDCHARS + 1];
+ unsigned segno;
char mainpath[MAXPGPATH];
/* Skip anything that doesn't look like a relation data file. */
- if (!parse_filename_for_nontemp_relation(de->d_name, &relnumchars,
- &forkNum))
+ if (!parse_filename_for_nontemp_relation(de->d_name, &relNumber,
+ &forkNum, &segno))
continue;
/* Also skip it unless this is the init fork. */
@@ -342,11 +342,12 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
continue;
/* Construct main fork pathname. */
- memcpy(relnumbuf, de->d_name, relnumchars);
- relnumbuf[relnumchars] = '\0';
- snprintf(mainpath, sizeof(mainpath), "%s/%s%s",
- dbspacedirname, relnumbuf, de->d_name + relnumchars + 1 +
- strlen(forkNames[INIT_FORKNUM]));
+ if (segno == 0)
+ snprintf(mainpath, sizeof(mainpath), "%s/%u",
+ dbspacedirname, relNumber);
+ else
+ snprintf(mainpath, sizeof(mainpath), "%s/%u.%u",
+ dbspacedirname, relNumber, segno);
fsync_fname(mainpath, false);
}
@@ -371,52 +372,82 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
* This function returns true if the file appears to be in the correct format
* for a non-temporary relation and false otherwise.
*
- * NB: If this function returns true, the caller is entitled to assume that
- * *relnumchars has been set to a value no more than OIDCHARS, and thus
- * that a buffer of OIDCHARS+1 characters is sufficient to hold the
- * RelFileNumber portion of the filename. This is critical to protect against
- * a possible buffer overrun.
+ * If it returns true, it sets *relnumber, *fork, and *segno to the values
+ * extracted from the filename. If it returns false, these values are set to
+ * InvalidRelFileNumber, InvalidForkNumber, and 0, respectively.
*/
bool
-parse_filename_for_nontemp_relation(const char *name, int *relnumchars,
- ForkNumber *fork)
+parse_filename_for_nontemp_relation(const char *name, RelFileNumber *relnumber,
+ ForkNumber *fork, unsigned *segno)
{
- int pos;
+ unsigned long n,
+ s;
+ ForkNumber f;
+ char *endp;
- /* Look for a non-empty string of digits (that isn't too long). */
- for (pos = 0; isdigit((unsigned char) name[pos]); ++pos)
- ;
- if (pos == 0 || pos > OIDCHARS)
+ *relnumber = InvalidRelFileNumber;
+ *fork = InvalidForkNumber;
+ *segno = 0;
+
+ /*
+ * Relation filenames should begin with a digit that is not a zero. By
+ * rejecting cases involving leading zeroes, the caller can assume that
+ * there's only one possible string of characters that could have produced
+ * any given value for *relnumber.
+ *
+ * (To be clear, we don't expect files with names like 0017.3 to exist at
+ * all -- but if 0017.3 does exist, it's a non-relation file, not part of
+ * the main fork for relfilenode 17.)
+ */
+ if (name[0] < '1' || name[0] > '9')
+ return false;
+
+ /*
+ * Parse the leading digit string. If the value is out of range, we
+ * conclude that this isn't a relation file at all.
+ */
+ errno = 0;
+ n = strtoul(name, &endp, 10);
+ if (errno || name == endp || n <= 0 || n > PG_UINT32_MAX)
return false;
- *relnumchars = pos;
+ name = endp;
/* Check for a fork name. */
- if (name[pos] != '_')
- *fork = MAIN_FORKNUM;
+ if (*name != '_')
+ f = MAIN_FORKNUM;
else
{
int forkchar;
- forkchar = forkname_chars(&name[pos + 1], fork);
+ forkchar = forkname_chars(name + 1, &f);
if (forkchar <= 0)
return false;
- pos += forkchar + 1;
+ name += forkchar + 1;
}
/* Check for a segment number. */
- if (name[pos] == '.')
+ if (*name != '.')
+ s = 0;
+ else
{
- int segchar;
+ /* Reject leading zeroes, just like we do for RelFileNumber. */
+ if (name[0] < '1' || name[0] > '9')
+ return false;
- for (segchar = 1; isdigit((unsigned char) name[pos + segchar]); ++segchar)
- ;
- if (segchar <= 1)
+ errno = 0;
+ s = strtoul(name + 1, &endp, 10);
+ if (errno || name + 1 == endp || s <= 0 || s > PG_UINT32_MAX)
return false;
- pos += segchar;
+ name = endp;
}
/* Now we should be at the end. */
- if (name[pos] != '\0')
+ if (*name != '\0')
return false;
+
+ /* Set out parameters and return. */
+ *relnumber = (RelFileNumber) n;
+ *fork = f;
+ *segno = (unsigned) s;
return true;
}
diff --git a/src/include/storage/reinit.h b/src/include/storage/reinit.h
index e2bbb5abe9..f8eb7ce234 100644
--- a/src/include/storage/reinit.h
+++ b/src/include/storage/reinit.h
@@ -20,8 +20,9 @@
extern void ResetUnloggedRelations(int op);
extern bool parse_filename_for_nontemp_relation(const char *name,
- int *relnumchars,
- ForkNumber *fork);
+ RelFileNumber *relnumber,
+ ForkNumber *fork,
+ unsigned *segno);
#define UNLOGGED_RELATION_CLEANUP 0x0001
#define UNLOGGED_RELATION_INIT 0x0002
--
2.37.1 (Apple Git-137.1)
v4-0003-Change-how-a-base-backup-decides-which-files-have.patchapplication/octet-stream; name=v4-0003-Change-how-a-base-backup-decides-which-files-have.patchDownload
From 3a77a61d05febc358b6b4a3b545bc9b56a55d0c2 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Tue, 3 Oct 2023 13:32:28 -0400
Subject: [PATCH v4 3/6] Change how a base backup decides which files have
checksums.
Previously, it thought that any plain file located under global, base,
or a tablespace directory had checksums unless it was in a short list
of excluded files. Now, it thinks that files in those directories have
checksums if parse_filename_for_nontemp_relation says that they are
relation files. (Temporary relation files don't matter because they're
excluded from the backup anyway.)
This changes the behavior if you have stray files not managed by
PostgreSQL in the relevant directories. Previously, you'd get some
kind of checksum-related complaint if such files existed, assuming
that the cluster had checksums enabled and that the base backup
wasn't run with NOVERIFY_CHECKSUMS. Now, you won't get those
complaints any more. That seems like an improvement to me, because
those files were presumably not created by PostgreSQL and so there
is no reason to think that they would be checksummed like a
PostgreSQL relation file. (If we want to complain about such files,
we should complain about them existing at all, not just about their
checksums.)
The point of this change is to make the code more consistent.
sendDir() was already calling parse_filename_for_nontemp_relation()
as part of an effort to determine which files to include in the
backup. So, it already had the information about whether a certain
file was a relation file. sendFile() then used a separate method,
embodied in is_checksummed_file(), to make what is essentially
the same determination. It's better not to make the same decision
using two different methods, especially in closely-related code.
---
src/backend/backup/basebackup.c | 172 ++++++++++----------------------
1 file changed, 55 insertions(+), 117 deletions(-)
diff --git a/src/backend/backup/basebackup.c b/src/backend/backup/basebackup.c
index b537f46219..4ba63ad8a6 100644
--- a/src/backend/backup/basebackup.c
+++ b/src/backend/backup/basebackup.c
@@ -82,7 +82,8 @@ static int64 sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeo
backup_manifest_info *manifest, Oid spcoid);
static bool sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
struct stat *statbuf, bool missing_ok,
- Oid dboid, Oid spcoid,
+ Oid dboid, Oid spcoid, RelFileNumber relfilenumber,
+ unsigned segno,
backup_manifest_info *manifest);
static off_t read_file_data_into_buffer(bbsink *sink,
const char *readfilename, int fd,
@@ -104,7 +105,6 @@ static void convert_link_to_directory(const char *pathbuf, struct stat *statbuf)
static void perform_base_backup(basebackup_options *opt, bbsink *sink);
static void parse_basebackup_options(List *options, basebackup_options *opt);
static int compareWalFileNames(const ListCell *a, const ListCell *b);
-static bool is_checksummed_file(const char *fullpath, const char *filename);
static int basebackup_read_file(int fd, char *buf, size_t nbytes, off_t offset,
const char *filename, bool partial_read_ok);
@@ -213,23 +213,6 @@ static const struct exclude_list_item excludeFiles[] =
{NULL, false}
};
-/*
- * List of files excluded from checksum validation.
- *
- * Note: this list should be kept in sync with what pg_checksums.c
- * includes.
- */
-static const struct exclude_list_item noChecksumFiles[] = {
- {"pg_control", false},
- {"pg_filenode.map", false},
- {"pg_internal.init", true},
- {"PG_VERSION", false},
-#ifdef EXEC_BACKEND
- {"config_exec_params", true},
-#endif
- {NULL, false}
-};
-
/*
* Actually do a base backup for the specified tablespaces.
*
@@ -356,7 +339,8 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
errmsg("could not stat file \"%s\": %m",
XLOG_CONTROL_FILE)));
sendFile(sink, XLOG_CONTROL_FILE, XLOG_CONTROL_FILE, &statbuf,
- false, InvalidOid, InvalidOid, &manifest);
+ false, InvalidOid, InvalidOid,
+ InvalidRelFileNumber, 0, &manifest);
}
else
{
@@ -625,7 +609,8 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
errmsg("could not stat file \"%s\": %m", pathbuf)));
sendFile(sink, pathbuf, pathbuf, &statbuf, false,
- InvalidOid, InvalidOid, &manifest);
+ InvalidOid, InvalidOid, InvalidRelFileNumber, 0,
+ &manifest);
/* unconditionally mark file as archived */
StatusFilePath(pathbuf, fname, ".done");
@@ -1163,7 +1148,8 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
struct stat statbuf;
int64 size = 0;
const char *lastDir; /* Split last dir from parent path. */
- bool isDbDir = false; /* Does this directory contain relations? */
+ bool isRelationDir = false; /* Does directory contain relations? */
+ Oid dboid = InvalidOid;
/*
* Determine if the current path is a database directory that can contain
@@ -1190,17 +1176,23 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
strncmp(lastDir - (sizeof(TABLESPACE_VERSION_DIRECTORY) - 1),
TABLESPACE_VERSION_DIRECTORY,
sizeof(TABLESPACE_VERSION_DIRECTORY) - 1) == 0))
- isDbDir = true;
+ {
+ isRelationDir = true;
+ dboid = atooid(lastDir + 1);
+ }
}
+ else if (strcmp(path, "./global") == 0)
+ isRelationDir = true;
dir = AllocateDir(path);
while ((de = ReadDir(dir, path)) != NULL)
{
int excludeIdx;
bool excludeFound;
- RelFileNumber relNumber;
- ForkNumber relForkNum;
- unsigned segno;
+ RelFileNumber relfilenumber = InvalidRelFileNumber;
+ ForkNumber relForkNum = InvalidForkNumber;
+ unsigned segno = 0;
+ bool isRelationFile = false;
/* Skip special stuff */
if (strcmp(de->d_name, ".") == 0 || strcmp(de->d_name, "..") == 0)
@@ -1248,37 +1240,40 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
if (excludeFound)
continue;
+ /*
+ * If there could be non-temporary relation files in this directory,
+ * try to parse the filename.
+ */
+ if (isRelationDir)
+ isRelationFile =
+ parse_filename_for_nontemp_relation(de->d_name,
+ &relfilenumber,
+ &relForkNum, &segno);
+
/* Exclude all forks for unlogged tables except the init fork */
- if (isDbDir &&
- parse_filename_for_nontemp_relation(de->d_name, &relNumber,
- &relForkNum, &segno))
+ if (isRelationFile && relForkNum != INIT_FORKNUM)
{
- /* Never exclude init forks */
- if (relForkNum != INIT_FORKNUM)
- {
- char initForkFile[MAXPGPATH];
+ char initForkFile[MAXPGPATH];
- /*
- * If any other type of fork, check if there is an init fork
- * with the same RelFileNumber. If so, the file can be
- * excluded.
- */
- snprintf(initForkFile, sizeof(initForkFile), "%s/%u_init",
- path, relNumber);
+ /*
+ * If any other type of fork, check if there is an init fork with
+ * the same RelFileNumber. If so, the file can be excluded.
+ */
+ snprintf(initForkFile, sizeof(initForkFile), "%s/%u_init",
+ path, relfilenumber);
- if (lstat(initForkFile, &statbuf) == 0)
- {
- elog(DEBUG2,
- "unlogged relation file \"%s\" excluded from backup",
- de->d_name);
+ if (lstat(initForkFile, &statbuf) == 0)
+ {
+ elog(DEBUG2,
+ "unlogged relation file \"%s\" excluded from backup",
+ de->d_name);
- continue;
- }
+ continue;
}
}
/* Exclude temporary relations */
- if (isDbDir && looks_like_temp_rel_name(de->d_name))
+ if (OidIsValid(dboid) && looks_like_temp_rel_name(de->d_name))
{
elog(DEBUG2,
"temporary relation file \"%s\" excluded from backup",
@@ -1417,8 +1412,8 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
if (!sizeonly)
sent = sendFile(sink, pathbuf, pathbuf + basepathlen + 1, &statbuf,
- true, isDbDir ? atooid(lastDir + 1) : InvalidOid, spcoid,
- manifest);
+ true, dboid, spcoid,
+ relfilenumber, segno, manifest);
if (sent || sizeonly)
{
@@ -1440,40 +1435,6 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
return size;
}
-/*
- * Check if a file should have its checksum validated.
- * We validate checksums on files in regular tablespaces
- * (including global and default) only, and in those there
- * are some files that are explicitly excluded.
- */
-static bool
-is_checksummed_file(const char *fullpath, const char *filename)
-{
- /* Check that the file is in a tablespace */
- if (strncmp(fullpath, "./global/", 9) == 0 ||
- strncmp(fullpath, "./base/", 7) == 0 ||
- strncmp(fullpath, "/", 1) == 0)
- {
- int excludeIdx;
-
- /* Compare file against noChecksumFiles skip list */
- for (excludeIdx = 0; noChecksumFiles[excludeIdx].name != NULL; excludeIdx++)
- {
- int cmplen = strlen(noChecksumFiles[excludeIdx].name);
-
- if (!noChecksumFiles[excludeIdx].match_prefix)
- cmplen++;
- if (strncmp(filename, noChecksumFiles[excludeIdx].name,
- cmplen) == 0)
- return false;
- }
-
- return true;
- }
- else
- return false;
-}
-
/*
* Given the member, write the TAR header & send the file.
*
@@ -1488,6 +1449,7 @@ is_checksummed_file(const char *fullpath, const char *filename)
static bool
sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
struct stat *statbuf, bool missing_ok, Oid dboid, Oid spcoid,
+ RelFileNumber relfilenumber, unsigned segno,
backup_manifest_info *manifest)
{
int fd;
@@ -1495,8 +1457,6 @@ sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
int checksum_failures = 0;
off_t cnt;
pgoff_t bytes_done = 0;
- int segmentno = 0;
- char *segmentpath;
bool verify_checksum = false;
pg_checksum_context checksum_ctx;
@@ -1522,36 +1482,14 @@ sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
*/
Assert((sink->bbs_buffer_length % BLCKSZ) == 0);
- if (!noverify_checksums && DataChecksumsEnabled())
- {
- char *filename;
-
- /*
- * Get the filename (excluding path). As last_dir_separator()
- * includes the last directory separator, we chop that off by
- * incrementing the pointer.
- */
- filename = last_dir_separator(readfilename) + 1;
-
- if (is_checksummed_file(readfilename, filename))
- {
- verify_checksum = true;
-
- /*
- * Cut off at the segment boundary (".") to get the segment number
- * in order to mix it into the checksum.
- */
- segmentpath = strstr(filename, ".");
- if (segmentpath != NULL)
- {
- segmentno = atoi(segmentpath + 1);
- if (segmentno == 0)
- ereport(ERROR,
- (errmsg("invalid segment number %d in file \"%s\"",
- segmentno, filename)));
- }
- }
- }
+ /*
+ * If we weren't told not to verify checksums, and if checksums are
+ * enabled for this cluster, and if this is a relation file, then verify
+ * the checksum.
+ */
+ if (!noverify_checksums && DataChecksumsEnabled() &&
+ RelFileNumberIsValid(relfilenumber))
+ verify_checksum = true;
/*
* Loop until we read the amount of data the caller told us to expect. The
@@ -1566,7 +1504,7 @@ sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
/* Try to read some more data. */
cnt = read_file_data_into_buffer(sink, readfilename, fd, bytes_done,
remaining,
- blkno + segmentno * RELSEG_SIZE,
+ blkno + segno * RELSEG_SIZE,
verify_checksum,
&checksum_failures);
--
2.37.1 (Apple Git-137.1)
v4-0004-Move-src-bin-pg_verifybackup-parse_manifest.c-int.patchapplication/octet-stream; name=v4-0004-Move-src-bin-pg_verifybackup-parse_manifest.c-int.patchDownload
From 89ec93d59cb6a2a9b9239803dc1ad818fd9cdf93 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Tue, 3 Oct 2023 13:32:45 -0400
Subject: [PATCH v4 4/6] Move src/bin/pg_verifybackup/parse_manifest.c into
src/common.
This makes it possible for the code to be easily reused by other
client-side tools, and/or by the server.
---
src/bin/pg_verifybackup/Makefile | 1 -
src/bin/pg_verifybackup/meson.build | 1 -
src/bin/pg_verifybackup/pg_verifybackup.c | 2 +-
src/common/Makefile | 1 +
src/common/meson.build | 1 +
src/{bin/pg_verifybackup => common}/parse_manifest.c | 4 ++--
src/{bin/pg_verifybackup => include/common}/parse_manifest.h | 2 +-
7 files changed, 6 insertions(+), 6 deletions(-)
rename src/{bin/pg_verifybackup => common}/parse_manifest.c (99%)
rename src/{bin/pg_verifybackup => include/common}/parse_manifest.h (97%)
diff --git a/src/bin/pg_verifybackup/Makefile b/src/bin/pg_verifybackup/Makefile
index 596df15118..8f04fa662c 100644
--- a/src/bin/pg_verifybackup/Makefile
+++ b/src/bin/pg_verifybackup/Makefile
@@ -21,7 +21,6 @@ LDFLAGS_INTERNAL += -L$(top_builddir)/src/fe_utils -lpgfeutils $(libpq_pgport)
OBJS = \
$(WIN32RES) \
- parse_manifest.o \
pg_verifybackup.o
all: pg_verifybackup
diff --git a/src/bin/pg_verifybackup/meson.build b/src/bin/pg_verifybackup/meson.build
index 9369da1bc6..58f780d1a6 100644
--- a/src/bin/pg_verifybackup/meson.build
+++ b/src/bin/pg_verifybackup/meson.build
@@ -1,7 +1,6 @@
# Copyright (c) 2022-2023, PostgreSQL Global Development Group
pg_verifybackup_sources = files(
- 'parse_manifest.c',
'pg_verifybackup.c'
)
diff --git a/src/bin/pg_verifybackup/pg_verifybackup.c b/src/bin/pg_verifybackup/pg_verifybackup.c
index 059836f0e6..ce423a03d4 100644
--- a/src/bin/pg_verifybackup/pg_verifybackup.c
+++ b/src/bin/pg_verifybackup/pg_verifybackup.c
@@ -20,9 +20,9 @@
#include "common/hashfn.h"
#include "common/logging.h"
+#include "common/parse_manifest.h"
#include "fe_utils/simple_list.h"
#include "getopt_long.h"
-#include "parse_manifest.h"
#include "pgtime.h"
/*
diff --git a/src/common/Makefile b/src/common/Makefile
index cc5c54dcee..ff60666f5c 100644
--- a/src/common/Makefile
+++ b/src/common/Makefile
@@ -66,6 +66,7 @@ OBJS_COMMON = \
kwlookup.o \
link-canary.o \
md5_common.o \
+ parse_manifest.o \
percentrepl.o \
pg_get_line.o \
pg_lzcompress.o \
diff --git a/src/common/meson.build b/src/common/meson.build
index 3b97497d1a..fcc0c4fe8d 100644
--- a/src/common/meson.build
+++ b/src/common/meson.build
@@ -18,6 +18,7 @@ common_sources = files(
'kwlookup.c',
'link-canary.c',
'md5_common.c',
+ 'parse_manifest.c',
'percentrepl.c',
'pg_get_line.c',
'pg_lzcompress.c',
diff --git a/src/bin/pg_verifybackup/parse_manifest.c b/src/common/parse_manifest.c
similarity index 99%
rename from src/bin/pg_verifybackup/parse_manifest.c
rename to src/common/parse_manifest.c
index 2379f7be7b..672e8bcf25 100644
--- a/src/bin/pg_verifybackup/parse_manifest.c
+++ b/src/common/parse_manifest.c
@@ -6,15 +6,15 @@
* Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
* Portions Copyright (c) 1994, Regents of the University of California
*
- * src/bin/pg_verifybackup/parse_manifest.c
+ * src/common/parse_manifest.c
*
*-------------------------------------------------------------------------
*/
#include "postgres_fe.h"
-#include "parse_manifest.h"
#include "common/jsonapi.h"
+#include "common/parse_manifest.h"
/*
* Semantic states for JSON manifest parsing.
diff --git a/src/bin/pg_verifybackup/parse_manifest.h b/src/include/common/parse_manifest.h
similarity index 97%
rename from src/bin/pg_verifybackup/parse_manifest.h
rename to src/include/common/parse_manifest.h
index 7387a917a2..7b24c5d785 100644
--- a/src/bin/pg_verifybackup/parse_manifest.h
+++ b/src/include/common/parse_manifest.h
@@ -6,7 +6,7 @@
* Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
* Portions Copyright (c) 1994, Regents of the University of California
*
- * src/bin/pg_verifybackup/parse_manifest.h
+ * src/include/common/parse_manifest.h
*
*-------------------------------------------------------------------------
*/
--
2.37.1 (Apple Git-137.1)
v4-0005-Prototype-patch-for-incremental-and-differential-.patchapplication/octet-stream; name=v4-0005-Prototype-patch-for-incremental-and-differential-.patchDownload
From 9dbf7217076ddfefdf8cc80a2c42ac6c8b22044f Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Wed, 14 Jun 2023 11:31:29 -0400
Subject: [PATCH v4 5/6] Prototype patch for incremental and differential
backup.
We don't differentiate between incremental and differential backups;
the term "incremental" as used herein means "either incremental or
differential".
This adds a new background process, the WAL summarizer, whose behavor
is governed by new GUCs wal_summarize_mb and wal_summarize_keep_time.
This writes out WAL summary files to $PGDATA/pg_wal/summaries. Each
summary file contains information for a certain range of LSNs on a
certain TLI. For each relation, it stores a "limit block" which is
0 if a relation is created or destroyed within a certain range of WAL
records, or otherwise the shortest length to which the relation was
truncated during that range of WAL records, or otherwise
InvalidBlockNumber. In addition, it stores any blocks which have
been modified during that range of WAL records, but excluding blocks
which were removed by truncation after they were modified and which
were never modified thereafter. In other words, it tells us which
blocks need to copied in case of an incremental backup covering that
range of WAL records.
To take an incremental backup, you use the new replication command
UPLOAD_MANIFEST to upload the manifest for the prior backup. This
prior backup could either be a full backup or another incremental
backup. You then use BASE_BACKUP with the INCREMENTAL option to take
the backup. pg_basebackup now has an --incremental=PATH_TO_MANIFEST
option to trigger this behavior.
An incremental backup is like a regular full backup except that
some relation files are replaced with files with names like
INCREMENTAL.${ORIGINAL_NAME}, and the backup_label file contains
additional lines identifying it as an incremental backup. The new
pg_combinebackup tool can be used to reconstruct a data directory
from a full backup and a series of incremental backups.
XXX. It would be nice if we could do something about incremental
JSON parsing.
XXX. This might need more work on documentation and tests.
Patch by me. Thanks to Dilip Kumar and Andres Freund for some helpful
design discussions. Reviewed by Dilip Kumar and Jakub Wartak.
---
doc/src/sgml/ref/allfiles.sgml | 1 +
doc/src/sgml/ref/pg_basebackup.sgml | 42 +-
doc/src/sgml/ref/pg_combinebackup.sgml | 227 +++
doc/src/sgml/reference.sgml | 1 +
src/backend/access/transam/xlog.c | 93 +-
src/backend/access/transam/xlogbackup.c | 10 +
src/backend/access/transam/xlogrecovery.c | 6 +
src/backend/backup/Makefile | 5 +-
src/backend/backup/basebackup.c | 334 +++-
src/backend/backup/basebackup_incremental.c | 873 ++++++++++
src/backend/backup/meson.build | 3 +
src/backend/backup/walsummary.c | 356 +++++
src/backend/backup/walsummaryfuncs.c | 169 ++
src/backend/postmaster/Makefile | 1 +
src/backend/postmaster/auxprocess.c | 8 +
src/backend/postmaster/meson.build | 1 +
src/backend/postmaster/postmaster.c | 53 +
src/backend/postmaster/walsummarizer.c | 1414 +++++++++++++++++
src/backend/replication/repl_gram.y | 14 +-
src/backend/replication/repl_scanner.l | 2 +
src/backend/replication/walsender.c | 162 +-
src/backend/storage/ipc/ipci.c | 3 +
src/backend/storage/lmgr/lwlocknames.txt | 1 +
src/backend/utils/activity/pgstat_io.c | 4 +-
.../utils/activity/wait_event_names.txt | 5 +
src/backend/utils/init/miscinit.c | 3 +
src/backend/utils/misc/guc_tables.c | 29 +
src/backend/utils/misc/postgresql.conf.sample | 5 +
src/bin/Makefile | 1 +
src/bin/initdb/initdb.c | 1 +
src/bin/meson.build | 1 +
src/bin/pg_basebackup/bbstreamer_file.c | 1 +
src/bin/pg_basebackup/pg_basebackup.c | 110 +-
src/bin/pg_basebackup/t/010_pg_basebackup.pl | 4 +-
src/bin/pg_combinebackup/.gitignore | 1 +
src/bin/pg_combinebackup/Makefile | 52 +
src/bin/pg_combinebackup/backup_label.c | 281 ++++
src/bin/pg_combinebackup/backup_label.h | 29 +
src/bin/pg_combinebackup/copy_file.c | 169 ++
src/bin/pg_combinebackup/copy_file.h | 19 +
src/bin/pg_combinebackup/load_manifest.c | 245 +++
src/bin/pg_combinebackup/load_manifest.h | 67 +
src/bin/pg_combinebackup/meson.build | 35 +
src/bin/pg_combinebackup/pg_combinebackup.c | 1270 +++++++++++++++
src/bin/pg_combinebackup/reconstruct.c | 618 +++++++
src/bin/pg_combinebackup/reconstruct.h | 32 +
src/bin/pg_combinebackup/t/001_basic.pl | 23 +
.../pg_combinebackup/t/002_compare_backups.pl | 154 ++
src/bin/pg_combinebackup/write_manifest.c | 293 ++++
src/bin/pg_combinebackup/write_manifest.h | 33 +
src/bin/pg_resetwal/pg_resetwal.c | 36 +
src/common/Makefile | 1 +
src/common/blkreftable.c | 1309 +++++++++++++++
src/common/meson.build | 1 +
src/include/access/xlog.h | 1 +
src/include/access/xlogbackup.h | 2 +
src/include/backup/basebackup.h | 5 +-
src/include/backup/basebackup_incremental.h | 56 +
src/include/backup/walsummary.h | 49 +
src/include/catalog/pg_proc.dat | 19 +
src/include/common/blkreftable.h | 120 ++
src/include/miscadmin.h | 3 +
src/include/nodes/replnodes.h | 9 +
src/include/postmaster/walsummarizer.h | 31 +
src/include/storage/proc.h | 9 +-
src/include/utils/guc_tables.h | 1 +
src/test/perl/PostgreSQL/Test/Cluster.pm | 21 +-
src/test/recovery/t/001_stream_rep.pl | 2 +
src/test/recovery/t/019_replslot_limit.pl | 3 +
.../t/035_standby_logical_decoding.pl | 1 +
src/tools/pgindent/typedefs.list | 23 +
71 files changed, 8899 insertions(+), 67 deletions(-)
create mode 100644 doc/src/sgml/ref/pg_combinebackup.sgml
create mode 100644 src/backend/backup/basebackup_incremental.c
create mode 100644 src/backend/backup/walsummary.c
create mode 100644 src/backend/backup/walsummaryfuncs.c
create mode 100644 src/backend/postmaster/walsummarizer.c
create mode 100644 src/bin/pg_combinebackup/.gitignore
create mode 100644 src/bin/pg_combinebackup/Makefile
create mode 100644 src/bin/pg_combinebackup/backup_label.c
create mode 100644 src/bin/pg_combinebackup/backup_label.h
create mode 100644 src/bin/pg_combinebackup/copy_file.c
create mode 100644 src/bin/pg_combinebackup/copy_file.h
create mode 100644 src/bin/pg_combinebackup/load_manifest.c
create mode 100644 src/bin/pg_combinebackup/load_manifest.h
create mode 100644 src/bin/pg_combinebackup/meson.build
create mode 100644 src/bin/pg_combinebackup/pg_combinebackup.c
create mode 100644 src/bin/pg_combinebackup/reconstruct.c
create mode 100644 src/bin/pg_combinebackup/reconstruct.h
create mode 100644 src/bin/pg_combinebackup/t/001_basic.pl
create mode 100644 src/bin/pg_combinebackup/t/002_compare_backups.pl
create mode 100644 src/bin/pg_combinebackup/write_manifest.c
create mode 100644 src/bin/pg_combinebackup/write_manifest.h
create mode 100644 src/common/blkreftable.c
create mode 100644 src/include/backup/basebackup_incremental.h
create mode 100644 src/include/backup/walsummary.h
create mode 100644 src/include/common/blkreftable.h
create mode 100644 src/include/postmaster/walsummarizer.h
diff --git a/doc/src/sgml/ref/allfiles.sgml b/doc/src/sgml/ref/allfiles.sgml
index 54b5f22d6e..fda4690eab 100644
--- a/doc/src/sgml/ref/allfiles.sgml
+++ b/doc/src/sgml/ref/allfiles.sgml
@@ -202,6 +202,7 @@ Complete list of usable sgml source files in this directory.
<!ENTITY pgBasebackup SYSTEM "pg_basebackup.sgml">
<!ENTITY pgbench SYSTEM "pgbench.sgml">
<!ENTITY pgChecksums SYSTEM "pg_checksums.sgml">
+<!ENTITY pgCombinebackup SYSTEM "pg_combinebackup.sgml">
<!ENTITY pgConfig SYSTEM "pg_config-ref.sgml">
<!ENTITY pgControldata SYSTEM "pg_controldata.sgml">
<!ENTITY pgCtl SYSTEM "pg_ctl-ref.sgml">
diff --git a/doc/src/sgml/ref/pg_basebackup.sgml b/doc/src/sgml/ref/pg_basebackup.sgml
index 344de921e4..3a569069ec 100644
--- a/doc/src/sgml/ref/pg_basebackup.sgml
+++ b/doc/src/sgml/ref/pg_basebackup.sgml
@@ -38,11 +38,22 @@ PostgreSQL documentation
</para>
<para>
- <application>pg_basebackup</application> makes an exact copy of the database
- cluster's files, while making sure the server is put into and
- out of backup mode automatically. Backups are always taken of the entire
- database cluster; it is not possible to back up individual databases or
- database objects. For selective backups, another tool such as
+ <application>pg_basebackup</application> can take a full, incremental,
+ or differential backup of the database. When used to take a full backup, it
+ makes an exact copy of the database cluster's files. When used to take an
+ incremental or differential backup, some files that would have been part of
+ a full backup may be replaced with incremental versions of the same files,
+ containing only those blocks that have been modified since the reference
+ backup. An incremental or differential backup cannot be used directly;
+ instead, <xref linkend="app-pgcombinebackup"/> must first be used to combine
+ it with the previous backups upon which it depends.
+ </para>
+
+ <para>
+ In any mode, <application>pg_basebackup</application> makes sure the server
+ is put into and out of backup mode automatically. Backups are always taken of
+ the entire database cluster; it is not possible to back up individual
+ databases or database objects. For selective backups, another tool such as
<xref linkend="app-pgdump"/> must be used.
</para>
@@ -197,6 +208,25 @@ PostgreSQL documentation
</listitem>
</varlistentry>
+ <varlistentry>
+ <term><option>-i <replaceable class="parameter">old_manifest_file</replaceable></option></term>
+ <term><option>--incremental=<replaceable class="parameter">old_meanifest_file</replaceable></option></term>
+ <listitem>
+ <para>
+ Performs an incremental or differential backup. The backup manifest
+ for the reference backup must be provided, and will be uploaded to the
+ server, which will respond by sending the requested incremental or
+ differential backup. There is no real difference between the two:
+ an incremental backup is simply a backup where the reference backup is
+ a full backup, and a differential backup is one where the reference
+ backup is an incremental or differential backup. Either way,
+ the backup cannot be used directly; instead,
+ <xref linkend="app-pgcombinebackup"/> must be used to reconstruct a
+ full backup.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry>
<term><option>-R</option></term>
<term><option>--write-recovery-conf</option></term>
@@ -595,7 +625,7 @@ PostgreSQL documentation
</varlistentry>
<varlistentry>
- <term><option>--sync-method</option></term>
+ <term><option>--sync-method=<replaceable class="parameter">method</replaceable></option></term>
<listitem>
<para>
When set to <literal>fsync</literal>, which is the default,
diff --git a/doc/src/sgml/ref/pg_combinebackup.sgml b/doc/src/sgml/ref/pg_combinebackup.sgml
new file mode 100644
index 0000000000..626b1b13dd
--- /dev/null
+++ b/doc/src/sgml/ref/pg_combinebackup.sgml
@@ -0,0 +1,227 @@
+<!--
+doc/src/sgml/ref/pg_combinebackup.sgml
+PostgreSQL documentation
+-->
+
+<refentry id="app-pgcombinebackup">
+ <indexterm zone="app-pgcombinebackup">
+ <primary>pg_combinebackup</primary>
+ </indexterm>
+
+ <refmeta>
+ <refentrytitle><application>pg_combinebackup</application></refentrytitle>
+ <manvolnum>1</manvolnum>
+ <refmiscinfo>Application</refmiscinfo>
+ </refmeta>
+
+ <refnamediv>
+ <refname>pg_combinebackup</refname>
+ <refpurpose>reconstruct a full backup from an incremental or differential backup and dependent backups</refpurpose>
+ </refnamediv>
+
+ <refsynopsisdiv>
+ <cmdsynopsis>
+ <command>pg_combinebackup</command>
+ <arg rep="repeat"><replaceable>option</replaceable></arg>
+ <arg rep="repeat"><replaceable>backup_directory</replaceable></arg>
+ </cmdsynopsis>
+ </refsynopsisdiv>
+
+ <refsect1>
+ <title>Description</title>
+ <para>
+ <application>pg_combinebackup</application> is used to reconstruct a
+ synthetic full backup from an incremental or differential backup and the
+ earlier backups upon which it depends.
+ </para>
+
+ <para>
+ Specify all of the required backups on the command line from oldest to newest.
+ That is, the first backup directory should be the path to the full backup, and
+ the last should be the path to the final incremental or differential backup
+ that you wish to restore. The reconstructed backup will be written to the
+ output directory specified by the <option>-o</option> option.
+ </para>
+
+ <para>
+ Although <application>pg_combinebackup</application> will attempt to verify
+ that the backups you specify form a legal backup chain from which a correct
+ full backup can be reconstructed, it is not designed to help you keep track
+ of which backups depend on which other backups. If you remove the one or
+ more of the previous backups upon which your incremental or differential
+ backup relies, you will not be able to restore it.
+ </para>
+
+ <para>
+ Since the output of <application>pg_combinebackup</application> is a
+ synthetic full backup, it can be used as an input to a future invocation of
+ <application>pg_combinebackup</application>. The synthetic full backup would
+ be specified on the command line in lieu of the chain of backups from which
+ it was reconstructed.
+ </para>
+ </refsect1>
+
+ <refsect1>
+ <title>Options</title>
+
+ <para>
+ <variablelist>
+ <varlistentry>
+ <term><option>-d</option></term>
+ <term><option>--debug</option></term>
+ <listitem>
+ <para>
+ Print lots of debug logging output on <filename>stderr</filename>.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><option>-T <replaceable class="parameter">olddir</replaceable>=<replaceable class="parameter">newdir</replaceable></option></term>
+ <term><option>--tablespace-mapping=<replaceable class="parameter">olddir</replaceable>=<replaceable class="parameter">newdir</replaceable></option></term>
+ <listitem>
+ <para>
+ Relocates the tablespace in directory <replaceable>olddir</replaceable>
+ to <replaceable>newdir</replaceable> during the backup.
+ <replaceable>olddir</replaceable> is the absolute path of the tablespace
+ as it exists in the first backup specified on the command line,
+ and <replaceable>newdir</replaceable> is the absolute path to use for the
+ tablespace in the reconstructed backup. If either path needs to contain
+ an equal sign (<literal>=</literal>), precede that with a backslash.
+ This option can be specified multiple times for multiple tablespaces.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><option>-N</option></term>
+ <term><option>--no-sync</option></term>
+ <listitem>
+ <para>
+ By default, <command>pg_combinebackup</command> will wait for all files
+ to be written safely to disk. This option causes
+ <command>pg_combinebackup</command> to return without waiting, which is
+ faster, but means that a subsequent operating system crash can leave
+ the output backup corrupt. Generally, this option is useful for testing
+ but should not be used when creating a production installation.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><option>-o <replaceable class="parameter">outputdir</replaceable></option></term>
+ <term><option>--output=<replaceable class="parameter">outputdir</replaceable></option></term>
+ <listitem>
+ <para>
+ Specifies the output directory to which the synthetic full backup
+ should be written. Currently, this argument is required.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><option>--sync-method</option></term>
+ <listitem>
+ <para>
+ When set to <literal>fsync</literal>, which is the default,
+ <command>pg_combinebackup</command> will recursively open and synchronize
+ all files in the backup directory. When the plain format is used, the
+ search for files will follow symbolic links for the WAL directory and
+ each configured tablespace.
+ </para>
+ <para>
+ On Linux, <literal>syncfs</literal> may be used instead to ask the
+ operating system to synchronize the whole file system that contains the
+ backup directory. When the plain format is used,
+ <command>pg_combinebackup</command> will also synchronize the file systems
+ that contain the WAL files and each tablespace. See
+ <xref linkend="syncfs"/> for more information about using
+ <function>syncfs()</function>.
+ </para>
+ <para>
+ This option has no effect when <option>--no-sync</option> is used.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><option>--manifest-checksums=<replaceable class="parameter">algorithm</replaceable></option></term>
+ <listitem>
+ <para>
+ Like <xref linkend="app-pgbasebackup"/>,
+ <application>pg_combinebackup</application> writes a backup manifest
+ in the output directory. This option specifies the checksum algorithm
+ that should be applied to each file included in the backup manifest.
+ Currently, the available algorithms are <literal>NONE</literal>,
+ <literal>CRC32C</literal>, <literal>SHA224</literal>,
+ <literal>SHA256</literal>, <literal>SHA384</literal>,
+ and <literal>SHA512</literal>. The default is <literal>CRC32C</literal>.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><option>--no-manifest</option></term>
+ <listitem>
+ <para>
+ Disables generation of a backup manifest. If this option is not
+ specified, a backup manifest for the reconstructed backup will be
+ written to the output directory.
+ </para>
+ </listitem>
+ </varlistentry>
+ </variablelist>
+
+ <variablelist>
+ <varlistentry>
+ <term><option>-V</option></term>
+ <term><option>--version</option></term>
+ <listitem>
+ <para>
+ Prints the <application>pg_combinebackup</application> version and exits.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><option>-?</option></term>
+ <term><option>--help</option></term>
+ <listitem>
+ <para>
+ Shows help about <application>pg_combinebackup</application> command line
+ arguments, and exits.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ </variablelist>
+ </para>
+
+ </refsect1>
+
+ <refsect1>
+ <title>Environment</title>
+
+ <para>
+ This utility, like most other <productname>PostgreSQL</productname> utilities,
+ uses the environment variables supported by <application>libpq</application>
+ (see <xref linkend="libpq-envars"/>).
+ </para>
+
+ <para>
+ The environment variable <envar>PG_COLOR</envar> specifies whether to use
+ color in diagnostic messages. Possible values are
+ <literal>always</literal>, <literal>auto</literal> and
+ <literal>never</literal>.
+ </para>
+ </refsect1>
+
+ <refsect1>
+ <title>See Also</title>
+
+ <simplelist type="inline">
+ <member><xref linkend="app-pgbasebackup"/></member>
+ </simplelist>
+ </refsect1>
+
+</refentry>
diff --git a/doc/src/sgml/reference.sgml b/doc/src/sgml/reference.sgml
index e11b4b6130..a07d2b5e01 100644
--- a/doc/src/sgml/reference.sgml
+++ b/doc/src/sgml/reference.sgml
@@ -250,6 +250,7 @@
&pgamcheck;
&pgBasebackup;
&pgbench;
+ &pgCombinebackup;
&pgConfig;
&pgDump;
&pgDumpall;
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 677a5bf51b..6cfeee63e8 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -77,6 +77,7 @@
#include "port/pg_iovec.h"
#include "postmaster/bgwriter.h"
#include "postmaster/startup.h"
+#include "postmaster/walsummarizer.h"
#include "postmaster/walwriter.h"
#include "replication/logical.h"
#include "replication/origin.h"
@@ -3499,6 +3500,43 @@ XLogGetLastRemovedSegno(void)
return lastRemovedSegNo;
}
+/*
+ * Return the oldest WAL segment on the given TLI that still exists in
+ * XLOGDIR, or 0 if none.
+ */
+XLogSegNo
+XLogGetOldestSegno(TimeLineID tli)
+{
+ DIR *xldir;
+ struct dirent *xlde;
+ XLogSegNo oldest_segno = 0;
+
+ xldir = AllocateDir(XLOGDIR);
+ while ((xlde = ReadDir(xldir, XLOGDIR)) != NULL)
+ {
+ TimeLineID file_tli;
+ XLogSegNo file_segno;
+
+ /* Ignore files that are not XLOG segments */
+ if (!IsXLogFileName(xlde->d_name))
+ continue;
+
+ /* Parse filename to get TLI and segno. */
+ XLogFromFileName(xlde->d_name, &file_tli, &file_segno,
+ wal_segment_size);
+
+ /* Ignore anything that's not from the TLI of interest. */
+ if (tli != file_tli)
+ continue;
+
+ /* If it's the oldest so far, update oldest_segno. */
+ if (oldest_segno == 0 || file_segno < oldest_segno)
+ oldest_segno = file_segno;
+ }
+
+ FreeDir(xldir);
+ return oldest_segno;
+}
/*
* Update the last removed segno pointer in shared memory, to reflect that the
@@ -3778,8 +3816,8 @@ RemoveXlogFile(const struct dirent *segment_de,
}
/*
- * Verify whether pg_wal and pg_wal/archive_status exist.
- * If the latter does not exist, recreate it.
+ * Verify whether pg_wal, pg_wal/archive_status, and pg_wal/summaries exist.
+ * If the latter do not exist, recreate them.
*
* It is not the goal of this function to verify the contents of these
* directories, but to help in cases where someone has performed a cluster
@@ -3822,6 +3860,26 @@ ValidateXLOGDirectoryStructure(void)
(errmsg("could not create missing directory \"%s\": %m",
path)));
}
+
+ /* Check for summaries */
+ snprintf(path, MAXPGPATH, XLOGDIR "/summaries");
+ if (stat(path, &stat_buf) == 0)
+ {
+ /* Check for weird cases where it exists but isn't a directory */
+ if (!S_ISDIR(stat_buf.st_mode))
+ ereport(FATAL,
+ (errmsg("required WAL directory \"%s\" does not exist",
+ path)));
+ }
+ else
+ {
+ ereport(LOG,
+ (errmsg("creating missing WAL directory \"%s\"", path)));
+ if (MakePGDirectory(path) < 0)
+ ereport(FATAL,
+ (errmsg("could not create missing directory \"%s\": %m",
+ path)));
+ }
}
/*
@@ -5146,9 +5204,9 @@ StartupXLOG(void)
#endif
/*
- * Verify that pg_wal and pg_wal/archive_status exist. In cases where
- * someone has performed a copy for PITR, these directories may have been
- * excluded and need to be re-created.
+ * Verify that pg_wal, pg_wal/archive_status, and pg_wal/summaries exist.
+ * In cases where someone has performed a copy for PITR, these directories
+ * may have been excluded and need to be re-created.
*/
ValidateXLOGDirectoryStructure();
@@ -6829,6 +6887,17 @@ CreateCheckPoint(int flags)
*/
END_CRIT_SECTION();
+ /*
+ * If there hasn't been much system activity in a while, the WAL
+ * summarizer may be sleeping for relatively long periods, which could
+ * delay an incremental backup that has started concurrently. In the hopes
+ * of avoiding that, poke the WAL summarizer here.
+ *
+ * Possibly this should instead be done at some earlier point in this
+ * function, but it's not clear that it matters much.
+ */
+ SetWalSummarizerLatch();
+
/*
* Let smgr do post-checkpoint cleanup (eg, deleting old files).
*/
@@ -7503,6 +7572,20 @@ KeepLogSeg(XLogRecPtr recptr, XLogSegNo *logSegNo)
}
}
+ /*
+ * If WAL summarization is in use, don't remove WAL that has yet to be
+ * summarized.
+ */
+ keep = GetOldestUnsummarizedLSN(NULL, NULL);
+ if (keep != InvalidXLogRecPtr)
+ {
+ XLogSegNo unsummarized_segno;
+
+ XLByteToSeg(keep, unsummarized_segno, wal_segment_size);
+ if (unsummarized_segno < segno)
+ segno = unsummarized_segno;
+ }
+
/* but, keep at least wal_keep_size if that's set */
if (wal_keep_size_mb > 0)
{
diff --git a/src/backend/access/transam/xlogbackup.c b/src/backend/access/transam/xlogbackup.c
index 21d68133ae..f51d4282bb 100644
--- a/src/backend/access/transam/xlogbackup.c
+++ b/src/backend/access/transam/xlogbackup.c
@@ -77,6 +77,16 @@ build_backup_content(BackupState *state, bool ishistoryfile)
appendStringInfo(result, "STOP TIMELINE: %u\n", state->stoptli);
}
+ /* either both istartpoint and istarttli should be set, or neither */
+ Assert(XLogRecPtrIsInvalid(state->istartpoint) == (state->istarttli == 0));
+ if (!XLogRecPtrIsInvalid(state->istartpoint))
+ {
+ appendStringInfo(result, "INCREMENTAL FROM LSN: %X/%X\n",
+ LSN_FORMAT_ARGS(state->istartpoint));
+ appendStringInfo(result, "INCREMENTAL FROM TLI: %u\n",
+ state->istarttli);
+ }
+
data = result->data;
pfree(result);
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 5549e1afc5..89ddec5bf9 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -1284,6 +1284,12 @@ read_backup_label(XLogRecPtr *checkPointLoc, TimeLineID *backupLabelTLI,
tli_from_file, BACKUP_LABEL_FILE)));
}
+ if (fscanf(lfp, "INCREMENTAL FROM LSN: %X/%X\n", &hi, &lo) > 0)
+ ereport(FATAL,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("this is an incremental backup, not a data directory"),
+ errhint("Use pg_combinebackup to reconstruct a valid data directory.")));
+
if (ferror(lfp) || FreeFile(lfp))
ereport(FATAL,
(errcode_for_file_access(),
diff --git a/src/backend/backup/Makefile b/src/backend/backup/Makefile
index b21bd8ff43..751e6d3d5e 100644
--- a/src/backend/backup/Makefile
+++ b/src/backend/backup/Makefile
@@ -19,12 +19,15 @@ OBJS = \
basebackup.o \
basebackup_copy.o \
basebackup_gzip.o \
+ basebackup_incremental.o \
basebackup_lz4.o \
basebackup_zstd.o \
basebackup_progress.o \
basebackup_server.o \
basebackup_sink.o \
basebackup_target.o \
- basebackup_throttle.o
+ basebackup_throttle.o \
+ walsummary.o \
+ walsummaryfuncs.o
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/backup/basebackup.c b/src/backend/backup/basebackup.c
index 4ba63ad8a6..8a70a9ae41 100644
--- a/src/backend/backup/basebackup.c
+++ b/src/backend/backup/basebackup.c
@@ -20,8 +20,10 @@
#include "access/xlogbackup.h"
#include "backup/backup_manifest.h"
#include "backup/basebackup.h"
+#include "backup/basebackup_incremental.h"
#include "backup/basebackup_sink.h"
#include "backup/basebackup_target.h"
+#include "catalog/pg_tablespace_d.h"
#include "commands/defrem.h"
#include "common/compression.h"
#include "common/file_perm.h"
@@ -64,6 +66,7 @@ typedef struct
bool fastcheckpoint;
bool nowait;
bool includewal;
+ bool incremental;
uint32 maxrate;
bool sendtblspcmapfile;
bool send_to_client;
@@ -76,21 +79,28 @@ typedef struct
} basebackup_options;
static int64 sendTablespace(bbsink *sink, char *path, Oid spcoid, bool sizeonly,
- struct backup_manifest_info *manifest);
+ struct backup_manifest_info *manifest,
+ IncrementalBackupInfo *ib);
static int64 sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
List *tablespaces, bool sendtblspclinks,
- backup_manifest_info *manifest, Oid spcoid);
+ backup_manifest_info *manifest, Oid spcoid,
+ IncrementalBackupInfo *ib);
static bool sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
struct stat *statbuf, bool missing_ok,
Oid dboid, Oid spcoid, RelFileNumber relfilenumber,
unsigned segno,
- backup_manifest_info *manifest);
+ backup_manifest_info *manifest,
+ unsigned num_incremental_blocks,
+ BlockNumber *incremental_blocks,
+ unsigned truncation_block_length);
static off_t read_file_data_into_buffer(bbsink *sink,
const char *readfilename, int fd,
off_t offset, size_t length,
BlockNumber blkno,
bool verify_checksum,
int *checksum_failures);
+static void push_to_sink(bbsink *sink, pg_checksum_context *checksum_ctx,
+ size_t *bytes_done, void *data, size_t length);
static bool verify_page_checksum(Page page, XLogRecPtr start_lsn,
BlockNumber blkno,
uint16 *expected_checksum);
@@ -102,7 +112,8 @@ static int64 _tarWriteHeader(bbsink *sink, const char *filename,
bool sizeonly);
static void _tarWritePadding(bbsink *sink, int len);
static void convert_link_to_directory(const char *pathbuf, struct stat *statbuf);
-static void perform_base_backup(basebackup_options *opt, bbsink *sink);
+static void perform_base_backup(basebackup_options *opt, bbsink *sink,
+ IncrementalBackupInfo *ib);
static void parse_basebackup_options(List *options, basebackup_options *opt);
static int compareWalFileNames(const ListCell *a, const ListCell *b);
static int basebackup_read_file(int fd, char *buf, size_t nbytes, off_t offset,
@@ -220,7 +231,8 @@ static const struct exclude_list_item excludeFiles[] =
* clobbered by longjmp" from stupider versions of gcc.
*/
static void
-perform_base_backup(basebackup_options *opt, bbsink *sink)
+perform_base_backup(basebackup_options *opt, bbsink *sink,
+ IncrementalBackupInfo *ib)
{
bbsink_state state;
XLogRecPtr endptr;
@@ -270,6 +282,10 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
ListCell *lc;
tablespaceinfo *newti;
+ /* If this is an incremental backup, execute preparatory steps. */
+ if (ib != NULL)
+ PrepareForIncrementalBackup(ib, backup_state);
+
/* Add a node for the base directory at the end */
newti = palloc0(sizeof(tablespaceinfo));
newti->size = -1;
@@ -289,10 +305,10 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
if (tmp->path == NULL)
tmp->size = sendDir(sink, ".", 1, true, state.tablespaces,
- true, NULL, InvalidOid);
+ true, NULL, InvalidOid, NULL);
else
tmp->size = sendTablespace(sink, tmp->path, tmp->oid, true,
- NULL);
+ NULL, NULL);
state.bytes_total += tmp->size;
}
state.bytes_total_is_valid = true;
@@ -330,7 +346,7 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
/* Then the bulk of the files... */
sendDir(sink, ".", 1, false, state.tablespaces,
- sendtblspclinks, &manifest, InvalidOid);
+ sendtblspclinks, &manifest, InvalidOid, ib);
/* ... and pg_control after everything else. */
if (lstat(XLOG_CONTROL_FILE, &statbuf) != 0)
@@ -340,7 +356,7 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
XLOG_CONTROL_FILE)));
sendFile(sink, XLOG_CONTROL_FILE, XLOG_CONTROL_FILE, &statbuf,
false, InvalidOid, InvalidOid,
- InvalidRelFileNumber, 0, &manifest);
+ InvalidRelFileNumber, 0, &manifest, 0, NULL, 0);
}
else
{
@@ -348,7 +364,7 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
bbsink_begin_archive(sink, archive_name);
- sendTablespace(sink, ti->path, ti->oid, false, &manifest);
+ sendTablespace(sink, ti->path, ti->oid, false, &manifest, ib);
}
/*
@@ -610,7 +626,7 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
sendFile(sink, pathbuf, pathbuf, &statbuf, false,
InvalidOid, InvalidOid, InvalidRelFileNumber, 0,
- &manifest);
+ &manifest, 0, NULL, 0);
/* unconditionally mark file as archived */
StatusFilePath(pathbuf, fname, ".done");
@@ -686,6 +702,7 @@ parse_basebackup_options(List *options, basebackup_options *opt)
bool o_checkpoint = false;
bool o_nowait = false;
bool o_wal = false;
+ bool o_incremental = false;
bool o_maxrate = false;
bool o_tablespace_map = false;
bool o_noverify_checksums = false;
@@ -764,6 +781,15 @@ parse_basebackup_options(List *options, basebackup_options *opt)
opt->includewal = defGetBoolean(defel);
o_wal = true;
}
+ else if (strcmp(defel->defname, "incremental") == 0)
+ {
+ if (o_incremental)
+ ereport(ERROR,
+ (errcode(ERRCODE_SYNTAX_ERROR),
+ errmsg("duplicate option \"%s\"", defel->defname)));
+ opt->incremental = defGetBoolean(defel);
+ o_incremental = true;
+ }
else if (strcmp(defel->defname, "max_rate") == 0)
{
int64 maxrate;
@@ -956,7 +982,7 @@ parse_basebackup_options(List *options, basebackup_options *opt)
* the filesystem, bypassing the buffer cache.
*/
void
-SendBaseBackup(BaseBackupCmd *cmd)
+SendBaseBackup(BaseBackupCmd *cmd, IncrementalBackupInfo *ib)
{
basebackup_options opt;
bbsink *sink;
@@ -980,6 +1006,20 @@ SendBaseBackup(BaseBackupCmd *cmd)
set_ps_display(activitymsg);
}
+ /*
+ * If we're asked to perform an incremental backup and the user has not
+ * supplied a manifest, that's an ERROR.
+ *
+ * If we're asked to perform a full backup and the user did supply a
+ * manifest, just ignore it.
+ */
+ if (!opt.incremental)
+ ib = NULL;
+ else if (ib == NULL)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("must UPLOAD_MANIFEST before performing an incremental BASE_BACKUP")));
+
/*
* If the target is specifically 'client' then set up to stream the backup
* to the client; otherwise, it's being sent someplace else and should not
@@ -1011,7 +1051,7 @@ SendBaseBackup(BaseBackupCmd *cmd)
*/
PG_TRY();
{
- perform_base_backup(&opt, sink);
+ perform_base_backup(&opt, sink, ib);
}
PG_FINALLY();
{
@@ -1086,7 +1126,7 @@ sendFileWithContent(bbsink *sink, const char *filename, const char *content,
*/
static int64
sendTablespace(bbsink *sink, char *path, Oid spcoid, bool sizeonly,
- backup_manifest_info *manifest)
+ backup_manifest_info *manifest, IncrementalBackupInfo *ib)
{
int64 size;
char pathbuf[MAXPGPATH];
@@ -1120,7 +1160,7 @@ sendTablespace(bbsink *sink, char *path, Oid spcoid, bool sizeonly,
/* Send all the files in the tablespace version directory */
size += sendDir(sink, pathbuf, strlen(path), sizeonly, NIL, true, manifest,
- spcoid);
+ spcoid, ib);
return size;
}
@@ -1140,7 +1180,7 @@ sendTablespace(bbsink *sink, char *path, Oid spcoid, bool sizeonly,
static int64
sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
List *tablespaces, bool sendtblspclinks, backup_manifest_info *manifest,
- Oid spcoid)
+ Oid spcoid, IncrementalBackupInfo *ib)
{
DIR *dir;
struct dirent *de;
@@ -1149,7 +1189,16 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
int64 size = 0;
const char *lastDir; /* Split last dir from parent path. */
bool isRelationDir = false; /* Does directory contain relations? */
+ bool isGlobalDir = false;
Oid dboid = InvalidOid;
+ BlockNumber *relative_block_numbers = NULL;
+
+ /*
+ * Since this array is relatively large, avoid putting it on the stack.
+ * But we don't need it at all if this is not an incremental backup.
+ */
+ if (ib != NULL)
+ relative_block_numbers = palloc(sizeof(BlockNumber) * RELSEG_SIZE);
/*
* Determine if the current path is a database directory that can contain
@@ -1182,7 +1231,10 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
}
}
else if (strcmp(path, "./global") == 0)
+ {
isRelationDir = true;
+ isGlobalDir = true;
+ }
dir = AllocateDir(path);
while ((de = ReadDir(dir, path)) != NULL)
@@ -1331,11 +1383,13 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
&statbuf, sizeonly);
/*
- * Also send archive_status directory (by hackishly reusing
- * statbuf from above ...).
+ * Also send archive_status and summaries directories (by
+ * hackishly reusing statbuf from above ...).
*/
size += _tarWriteHeader(sink, "./pg_wal/archive_status", NULL,
&statbuf, sizeonly);
+ size += _tarWriteHeader(sink, "./pg_wal/summaries", NULL,
+ &statbuf, sizeonly);
continue; /* don't recurse into pg_wal */
}
@@ -1404,33 +1458,88 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
if (!skip_this_dir)
size += sendDir(sink, pathbuf, basepathlen, sizeonly, tablespaces,
- sendtblspclinks, manifest, spcoid);
+ sendtblspclinks, manifest, spcoid, ib);
}
else if (S_ISREG(statbuf.st_mode))
{
bool sent = false;
+ unsigned num_blocks_required = 0;
+ unsigned truncation_block_length = 0;
+ char tarfilenamebuf[MAXPGPATH * 2];
+ char *tarfilename = pathbuf + basepathlen + 1;
+ FileBackupMethod method = BACK_UP_FILE_FULLY;
- if (!sizeonly)
- sent = sendFile(sink, pathbuf, pathbuf + basepathlen + 1, &statbuf,
- true, dboid, spcoid,
- relfilenumber, segno, manifest);
+ if (ib != NULL && isRelationFile)
+ {
+ Oid relspcoid;
+ char *lookup_path;
- if (sent || sizeonly)
+ if (OidIsValid(spcoid))
+ {
+ relspcoid = spcoid;
+ lookup_path = psprintf("pg_tblspc/%u/%s", spcoid,
+ pathbuf + basepathlen + 1);
+ }
+ else
+ {
+ if (isGlobalDir)
+ relspcoid = GLOBALTABLESPACE_OID;
+ else
+ relspcoid = DEFAULTTABLESPACE_OID;
+ lookup_path = pstrdup(pathbuf + basepathlen + 1);
+ }
+
+ method = GetFileBackupMethod(ib, lookup_path, dboid, relspcoid,
+ relfilenumber, relForkNum,
+ segno, statbuf.st_size,
+ &num_blocks_required,
+ relative_block_numbers,
+ &truncation_block_length);
+ if (method == BACK_UP_FILE_INCREMENTALLY)
+ {
+ statbuf.st_size =
+ GetIncrementalFileSize(num_blocks_required);
+ snprintf(tarfilenamebuf, sizeof(tarfilenamebuf),
+ "%s/INCREMENTAL.%s",
+ path + basepathlen + 1,
+ de->d_name);
+ tarfilename = tarfilenamebuf;
+ }
+
+ pfree(lookup_path);
+ }
+
+ if (method != DO_NOT_BACK_UP_FILE)
{
- /* Add size. */
- size += statbuf.st_size;
+ if (!sizeonly)
+ sent = sendFile(sink, pathbuf, tarfilename, &statbuf,
+ true, dboid, spcoid,
+ relfilenumber, segno, manifest,
+ num_blocks_required,
+ method == BACK_UP_FILE_INCREMENTALLY ? relative_block_numbers : NULL,
+ truncation_block_length);
+
+ if (sent || sizeonly)
+ {
+ /* Add size. */
+ size += statbuf.st_size;
- /* Pad to a multiple of the tar block size. */
- size += tarPaddingBytesRequired(statbuf.st_size);
+ /* Pad to a multiple of the tar block size. */
+ size += tarPaddingBytesRequired(statbuf.st_size);
- /* Size of the header for the file. */
- size += TAR_BLOCK_SIZE;
+ /* Size of the header for the file. */
+ size += TAR_BLOCK_SIZE;
+ }
}
}
else
ereport(WARNING,
(errmsg("skipping special file \"%s\"", pathbuf)));
}
+
+ if (relative_block_numbers != NULL)
+ pfree(relative_block_numbers);
+
FreeDir(dir);
return size;
}
@@ -1443,6 +1552,12 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
* If dboid is anything other than InvalidOid then any checksum failures
* detected will get reported to the cumulative stats system.
*
+ * If the file is to be sent incrementally, then num_incremental_blocks
+ * should be the number of blocks to be sent, and incremental_blocks
+ * an array of block numbers relative to the start of the current segment.
+ * If the whole file is to be sent, then incremental_blocks should be NULL,
+ * and num_incremental_blocks can have any value, as it will be ignored.
+ *
* Returns true if the file was successfully sent, false if 'missing_ok',
* and the file did not exist.
*/
@@ -1450,7 +1565,8 @@ static bool
sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
struct stat *statbuf, bool missing_ok, Oid dboid, Oid spcoid,
RelFileNumber relfilenumber, unsigned segno,
- backup_manifest_info *manifest)
+ backup_manifest_info *manifest, unsigned num_incremental_blocks,
+ BlockNumber *incremental_blocks, unsigned truncation_block_length)
{
int fd;
BlockNumber blkno = 0;
@@ -1459,6 +1575,7 @@ sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
pgoff_t bytes_done = 0;
bool verify_checksum = false;
pg_checksum_context checksum_ctx;
+ int ibindex = 0;
if (pg_checksum_init(&checksum_ctx, manifest->checksum_type) < 0)
elog(ERROR, "could not initialize checksum of file \"%s\"",
@@ -1491,22 +1608,111 @@ sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
RelFileNumberIsValid(relfilenumber))
verify_checksum = true;
+ /*
+ * If we're sending an incremental file, write the file header.
+ */
+ if (incremental_blocks != NULL)
+ {
+ unsigned magic = INCREMENTAL_MAGIC;
+ size_t header_bytes_done = 0;
+
+ /* Emit header data. */
+ push_to_sink(sink, &checksum_ctx, &header_bytes_done,
+ &magic, sizeof(magic));
+ push_to_sink(sink, &checksum_ctx, &header_bytes_done,
+ &num_incremental_blocks, sizeof(num_incremental_blocks));
+ push_to_sink(sink, &checksum_ctx, &header_bytes_done,
+ &truncation_block_length, sizeof(truncation_block_length));
+ push_to_sink(sink, &checksum_ctx, &header_bytes_done,
+ incremental_blocks,
+ sizeof(BlockNumber) * num_incremental_blocks);
+
+ /* Flush out any data still in the buffer so it's again empty. */
+ if (header_bytes_done > 0)
+ {
+ bbsink_archive_contents(sink, header_bytes_done);
+ if (pg_checksum_update(&checksum_ctx,
+ (uint8 *) sink->bbs_buffer,
+ header_bytes_done) < 0)
+ elog(ERROR, "could not update checksum of base backup");
+ }
+
+ /* Update our notion of file position. */
+ bytes_done += sizeof(magic);
+ bytes_done += sizeof(num_incremental_blocks);
+ bytes_done += sizeof(truncation_block_length);
+ bytes_done += sizeof(BlockNumber) * num_incremental_blocks;
+ }
+
/*
* Loop until we read the amount of data the caller told us to expect. The
* file could be longer, if it was extended while we were sending it, but
* for a base backup we can ignore such extended data. It will be restored
* from WAL.
*/
- while (bytes_done < statbuf->st_size)
+ while (1)
{
- size_t remaining = statbuf->st_size - bytes_done;
+ /*
+ * Determine whether we've read all the data that we need, and if not,
+ * read some more.
+ */
+ if (incremental_blocks == NULL)
+ {
+ size_t remaining = statbuf->st_size - bytes_done;
+
+ /*
+ * If we've read the required number of bytes, then it's time to
+ * stop.
+ */
+ if (bytes_done >= statbuf->st_size)
+ break;
+
+ /*
+ * Read as many bytes as will fit in the buffer, or however many
+ * are left to read, whichever is less.
+ */
+ cnt = read_file_data_into_buffer(sink, readfilename, fd,
+ bytes_done, remaining,
+ blkno + segno * RELSEG_SIZE,
+ verify_checksum,
+ &checksum_failures);
+ }
+ else
+ {
+ BlockNumber relative_blkno;
- /* Try to read some more data. */
- cnt = read_file_data_into_buffer(sink, readfilename, fd, bytes_done,
- remaining,
- blkno + segno * RELSEG_SIZE,
- verify_checksum,
- &checksum_failures);
+ /*
+ * If we've read all the blocks, then it's time to stop.
+ */
+ if (ibindex >= num_incremental_blocks)
+ break;
+
+ /*
+ * Read just one block, whichever one is the next that we're
+ * supposed to include.
+ */
+ relative_blkno = incremental_blocks[ibindex++];
+ cnt = read_file_data_into_buffer(sink, readfilename, fd,
+ relative_blkno * BLCKSZ,
+ BLCKSZ,
+ relative_blkno + segno * RELSEG_SIZE,
+ verify_checksum,
+ &checksum_failures);
+
+ /*
+ * If we get a partial read, that must mean that the relation is
+ * being truncated. Ultimately, it should be truncated to a
+ * multiple of BLCKSZ, since this path should only be reached for
+ * relation files, but we might transiently observe an
+ * intermediate value.
+ *
+ * It should be fine to treat this just as if the entire block had
+ * been truncated away - i.e. fill this and all later blocks with
+ * zeroes. WAL replay will fix things up.
+ */
+ if (cnt < BLCKSZ)
+ break;
+ }
/*
* If the amount of data we were able to read was not a multiple of
@@ -1689,6 +1895,56 @@ read_file_data_into_buffer(bbsink *sink, const char *readfilename, int fd,
return cnt;
}
+/*
+ * Push data into a bbsink.
+ *
+ * It's better, when possible, to read data directly into the bbsink's buffer,
+ * rather than using this function to copy it into the buffer; this function is
+ * for cases where that approach is not practical.
+ *
+ * bytes_done should point to a count of the number of bytes that are
+ * currently used in the bbsink's buffer. Upon return, the bytes identified by
+ * data and length will have been copied into the bbsink's buffer, flushing
+ * as required, and *bytes_done will have been updated accordingly. If the
+ * buffer was flushed, the previous contents will also have been fed to
+ * checksum_ctx.
+ *
+ * Note that after one or more calls to this function it is the caller's
+ * responsibility to perform any required final flush.
+ */
+static void
+push_to_sink(bbsink *sink, pg_checksum_context *checksum_ctx,
+ size_t *bytes_done, void *data, size_t length)
+{
+ while (length > 0)
+ {
+ size_t bytes_to_copy;
+
+ /*
+ * We use < here rather than <= so that if the data exactly fills the
+ * remaining buffer space, we trigger a flush now.
+ */
+ if (length < sink->bbs_buffer_length - *bytes_done)
+ {
+ /* Append remaining data to buffer. */
+ memcpy(sink->bbs_buffer + *bytes_done, data, length);
+ *bytes_done += length;
+ return;
+ }
+
+ /* Copy until buffer is full and flush it. */
+ bytes_to_copy = sink->bbs_buffer_length - *bytes_done;
+ memcpy(sink->bbs_buffer + *bytes_done, data, bytes_to_copy);
+ data = ((char *) data) + bytes_to_copy;
+ length -= bytes_to_copy;
+ bbsink_archive_contents(sink, sink->bbs_buffer_length);
+ if (pg_checksum_update(checksum_ctx, (uint8 *) sink->bbs_buffer,
+ sink->bbs_buffer_length) < 0)
+ elog(ERROR, "could not update checksum");
+ *bytes_done = 0;
+ }
+}
+
/*
* Try to verify the checksum for the provided page, if it seems appropriate
* to do so.
diff --git a/src/backend/backup/basebackup_incremental.c b/src/backend/backup/basebackup_incremental.c
new file mode 100644
index 0000000000..20cc00bded
--- /dev/null
+++ b/src/backend/backup/basebackup_incremental.c
@@ -0,0 +1,873 @@
+/*-------------------------------------------------------------------------
+ *
+ * basebackup_incremental.c
+ * code for incremental backup support
+ *
+ * This code isn't actually in charge of taking an incremental backup;
+ * the actual construction of the incremental backup happens in
+ * basebackup.c. Here, we're concerned with providing the necessary
+ * supports for that operation. In particular, we need to parse the
+ * backup manifest supplied by the user taking the incremental backup
+ * and extract the required information from it.
+ *
+ * Portions Copyright (c) 2010-2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/backend/backup/basebackup_incremental.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/timeline.h"
+#include "access/xlog.h"
+#include "access/xlogrecovery.h"
+#include "backup/basebackup_incremental.h"
+#include "backup/walsummary.h"
+#include "common/blkreftable.h"
+#include "common/parse_manifest.h"
+#include "common/hashfn.h"
+#include "postmaster/walsummarizer.h"
+
+#define BLOCKS_PER_READ 512
+
+/*
+ * Details extracted from the WAL ranges present in the supplied backup manifest.
+ */
+typedef struct
+{
+ TimeLineID tli;
+ XLogRecPtr start_lsn;
+ XLogRecPtr end_lsn;
+} backup_wal_range;
+
+/*
+ * Details extracted from the file list present in the supplied backup manifest.
+ */
+typedef struct
+{
+ uint32 status;
+ const char *path;
+ size_t size;
+} backup_file_entry;
+
+static uint32 hash_string_pointer(const char *s);
+#define SH_PREFIX backup_file
+#define SH_ELEMENT_TYPE backup_file_entry
+#define SH_KEY_TYPE const char *
+#define SH_KEY path
+#define SH_HASH_KEY(tb, key) hash_string_pointer(key)
+#define SH_EQUAL(tb, a, b) (strcmp(a, b) == 0)
+#define SH_SCOPE static inline
+#define SH_DECLARE
+#define SH_DEFINE
+#include "lib/simplehash.h"
+
+struct IncrementalBackupInfo
+{
+ /* Memory context for this object and its subsidiary objects. */
+ MemoryContext mcxt;
+
+ /* Temporary buffer for storing the manifest while parsing it. */
+ StringInfoData buf;
+
+ /* WAL ranges extracted from the backup manifest. */
+ List *manifest_wal_ranges;
+
+ /*
+ * Files extracted from the backup manifest.
+ *
+ * We don't really need this information, because we use WAL summaries to
+ * figure what's changed. It would be unsafe to just rely on the list of
+ * files that existed before, because it's possible for a file to be
+ * removed and a new one created with the same name and different
+ * contents. In such cases, the whole file must still be sent. We can tell
+ * from the WAL summaries whether that happened, but not from the file
+ * list.
+ *
+ * Nonetheless, this data is useful for sanity checking. If a file that we
+ * think we shouldn't need to send is not present in the manifest for the
+ * prior backup, something has gone terribly wrong. We retain the file
+ * names and sizes, but not the checksums or last modified times, for
+ * which we have no use.
+ *
+ * One significant downside of storing this data is that it consumes
+ * memory. If that turns out to be a problem, we might have to decide not
+ * to retain this information, or to make it optional.
+ */
+ backup_file_hash *manifest_files;
+
+ /*
+ * Block-reference table for the incremental backup.
+ *
+ * It's possible that storing the entire block-reference table in memory
+ * will be a problem for some users. The in-memory format that we're using
+ * here is pretty efficient, converging to little more than 1 bit per
+ * block for relation forks with large numbers of modified blocks. It's
+ * possible, however, that if you try to perform an incremental backup of
+ * a database with a sufficiently large number of relations on a
+ * sufficiently small machine, you could run out of memory here. If that
+ * turns out to be a problem in practice, we'll need to be more clever.
+ */
+ BlockRefTable *brtab;
+};
+
+static void manifest_process_file(JsonManifestParseContext *,
+ char *pathname,
+ size_t size,
+ pg_checksum_type checksum_type,
+ int checksum_length,
+ uint8 *checksum_payload);
+static void manifest_process_wal_range(JsonManifestParseContext *,
+ TimeLineID tli,
+ XLogRecPtr start_lsn,
+ XLogRecPtr end_lsn);
+static void manifest_report_error(JsonManifestParseContext *ib,
+ const char *fmt,...)
+ pg_attribute_printf(2, 3) pg_attribute_noreturn();
+
+/*
+ * Create a new object for storing information extracted from the manifest
+ * supplied when creating an incremental backup.
+ */
+IncrementalBackupInfo *
+CreateIncrementalBackupInfo(MemoryContext mcxt)
+{
+ IncrementalBackupInfo *ib;
+ MemoryContext oldcontext;
+
+ oldcontext = MemoryContextSwitchTo(mcxt);
+
+ ib = palloc0(sizeof(IncrementalBackupInfo));
+ ib->mcxt = mcxt;
+ initStringInfo(&ib->buf);
+
+ /*
+ * It's hard to guess how many files a "typical" installation will have in
+ * the data directory, but a fresh initdb creates almost 1000 files as of
+ * this writing, so it seems to make sense for our estimate to
+ * substantially higher.
+ */
+ ib->manifest_files = backup_file_create(mcxt, 10000, NULL);
+
+ MemoryContextSwitchTo(oldcontext);
+
+ return ib;
+}
+
+/*
+ * Before taking an incremental backup, the caller must supply the backup
+ * manifest from a prior backup. Each chunk of manifest data recieved
+ * from the client should be passed to this function.
+ */
+void
+AppendIncrementalManifestData(IncrementalBackupInfo *ib, const char *data,
+ int len)
+{
+ MemoryContext oldcontext;
+
+ /* Switch to our memory context. */
+ oldcontext = MemoryContextSwitchTo(ib->mcxt);
+
+ /*
+ * XXX. Our json parser is at present incapable of parsing json blobs
+ * incrementally, so we have to accumulate the entire backup manifest
+ * before we can do anything with it. This should really be fixed, since
+ * some users might have very large numbers of files in the data
+ * directory.
+ */
+ appendBinaryStringInfo(&ib->buf, data, len);
+
+ /* Switch back to previous memory context. */
+ MemoryContextSwitchTo(oldcontext);
+}
+
+/*
+ * Finalize an IncrementalBackupInfo object after all manifest data has
+ * been supplied via calls to AppendIncrementalManifestData.
+ */
+void
+FinalizeIncrementalManifest(IncrementalBackupInfo *ib)
+{
+ JsonManifestParseContext context;
+ MemoryContext oldcontext;
+
+ /* Switch to our memory context. */
+ oldcontext = MemoryContextSwitchTo(ib->mcxt);
+
+ /* Parse the manifest. */
+ context.private_data = ib;
+ context.perfile_cb = manifest_process_file;
+ context.perwalrange_cb = manifest_process_wal_range;
+ context.error_cb = manifest_report_error;
+ json_parse_manifest(&context, ib->buf.data, ib->buf.len);
+
+ /* Done with the buffer, so release memory. */
+ pfree(ib->buf.data);
+ ib->buf.data = NULL;
+
+ /* Switch back to previous memory context. */
+ MemoryContextSwitchTo(oldcontext);
+}
+
+/*
+ * Prepare to take an incremental backup.
+ *
+ * Before this function is called, AppendIncrementalManifestData and
+ * FinalizeIncrementalManifest should have already been called to pass all
+ * the manifest data to this object.
+ *
+ * This function performs sanity checks on the data extracted from the
+ * manifest and figures out for which WAL ranges we need summaries, and
+ * whether those summaries are available. Then, it reads and combines the
+ * data from those summary files. It also updates the backup_state with the
+ * reference TLI and LSN for the prior backup.
+ */
+void
+PrepareForIncrementalBackup(IncrementalBackupInfo *ib,
+ BackupState *backup_state)
+{
+ MemoryContext oldcontext;
+ List *expectedTLEs;
+ List *all_wslist,
+ *required_wslist = NIL;
+ ListCell *lc;
+ TimeLineHistoryEntry **tlep;
+ int num_wal_ranges;
+ int i;
+ bool found_backup_start_tli = false;
+ TimeLineID earliest_wal_range_tli = 0;
+ XLogRecPtr earliest_wal_range_start_lsn;
+ TimeLineID latest_wal_range_tli = 0;
+ XLogRecPtr summarized_lsn;
+
+ Assert(ib->buf.data == NULL);
+
+ /* Switch to our memory context. */
+ oldcontext = MemoryContextSwitchTo(ib->mcxt);
+
+ /*
+ * Match up the TLIs that appear in the WAL ranges of the backup manifest
+ * with those that appear in this server's timeline history. We expect
+ * every backup_wal_range to match to a TimeLineHistoryEntry; if it does
+ * not, that's an error.
+ *
+ * This loop also decides which of the WAL ranges is the manifest is most
+ * ancient and which one is the newest, according to the timeline history
+ * of this server, and stores TLIs of those WAL ranges into
+ * earliest_wal_range_tli and latest_wal_range_tli. It also updates
+ * earliest_wal_range_start_lsn to the start LSN of the WAL range for
+ * earliest_wal_range_tli.
+ *
+ * Note that the return value of readTimeLineHistory puts the latest
+ * timeline at the beginning of the list, not the end. Hence, the earliest
+ * TLI is the one that occurs nearest the end of the list returned by
+ * readTimeLineHistory, and the latest TLI is the one that occurs closest
+ * to the beginning.
+ */
+ expectedTLEs = readTimeLineHistory(backup_state->starttli);
+ num_wal_ranges = list_length(ib->manifest_wal_ranges);
+ tlep = palloc0(num_wal_ranges * sizeof(TimeLineHistoryEntry *));
+ for (i = 0; i < num_wal_ranges; ++i)
+ {
+ backup_wal_range *range = list_nth(ib->manifest_wal_ranges, i);
+ bool saw_earliest_wal_range_tli = false;
+ bool saw_latest_wal_range_tli = false;
+
+ /* Search this server's history for this WAL range's TLI. */
+ foreach(lc, expectedTLEs)
+ {
+ TimeLineHistoryEntry *tle = lfirst(lc);
+
+ if (tle->tli == range->tli)
+ {
+ tlep[i] = tle;
+ break;
+ }
+
+ if (tle->tli == earliest_wal_range_tli)
+ saw_earliest_wal_range_tli = true;
+ if (tle->tli == latest_wal_range_tli)
+ saw_latest_wal_range_tli = true;
+ }
+
+ /*
+ * An incremental backup can only be taken relative to a backup that
+ * represents a previous state of this server. If the backup requires
+ * WAL from a timeline that's not in our history, that definitely
+ * isn't the case.
+ */
+ if (tlep[i] == NULL)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("timeline %u found in manifest, but not in this server's history",
+ range->tli)));
+
+ /*
+ * If we found this TLI in the server's history before encountering
+ * the latest TLI seen so far in the server's history, then this TLI
+ * is the latest one seen so far.
+ *
+ * If on the other hand we saw the earliest TLI seen so far before
+ * finding this TLI, this TLI is earlier than the earliest one seen so
+ * far. And if this is the first TLI for which we've searched, it's
+ * also the earliest one seen so far.
+ *
+ * On the first loop iteration, both things should necessarily be
+ * true.
+ */
+ if (!saw_latest_wal_range_tli)
+ latest_wal_range_tli = range->tli;
+ if (earliest_wal_range_tli == 0 || saw_earliest_wal_range_tli)
+ {
+ earliest_wal_range_tli = range->tli;
+ earliest_wal_range_start_lsn = range->start_lsn;
+ }
+ }
+
+ /*
+ * Propagate information about the prior backup into the backup_label that
+ * will be generated for this backup.
+ */
+ backup_state->istartpoint = earliest_wal_range_start_lsn;
+ backup_state->istarttli = earliest_wal_range_tli;
+
+ /*
+ * Sanity check start and end LSNs for the WAL ranges in the manifest.
+ *
+ * Commonly, there won't be any timeline switches during the prior backup
+ * at all, but if there are, they should happen at the same LSNs that this
+ * server switched timelines.
+ *
+ * Whether there are any timeline switches during the prior backup or not,
+ * the prior backup shouldn't require any WAL from a timeline prior to the
+ * start of that timeline. It also shouldn't require any WAL from later
+ * than the start of this backup.
+ *
+ * If any of these sanity checks fail, one possible explanation is that
+ * the user has generated WAL on the same timeline with the same LSNs more
+ * than once. For instance, if two standbys running on timeline 1 were
+ * both promoted and (due to a broken archiving setup) both selected new
+ * timeline ID 2, then it's possible that one of these checks might trip.
+ *
+ * Note that there are lots of ways for the user to do something very bad
+ * without tripping any of these checks, and they are not intended to be
+ * comprehensive. It's pretty hard to see how we could be certain of
+ * anything here. However, if there's a problem staring us right in the
+ * face, it's best to report it, so we do.
+ */
+ for (i = 0; i < num_wal_ranges; ++i)
+ {
+ backup_wal_range *range = list_nth(ib->manifest_wal_ranges, i);
+
+ if (range->tli == earliest_wal_range_tli)
+ {
+ if (range->start_lsn < tlep[i]->begin)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("manifest requires WAL from initial timeline %u starting at %X/%X, but that timeline begins at %X/%X",
+ range->tli,
+ LSN_FORMAT_ARGS(range->start_lsn),
+ LSN_FORMAT_ARGS(tlep[i]->begin))));
+ }
+ else
+ {
+ if (range->start_lsn != tlep[i]->begin)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("manifest requires WAL from continuation timeline %u starting at %X/%X, but that timeline begins at %X/%X",
+ range->tli,
+ LSN_FORMAT_ARGS(range->start_lsn),
+ LSN_FORMAT_ARGS(tlep[i]->begin))));
+ }
+
+ if (range->tli == latest_wal_range_tli)
+ {
+ if (range->end_lsn > backup_state->startpoint)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("manifest requires WAL from final timeline %u ending at %X/%X, but this backup starts at %X/%X",
+ range->tli,
+ LSN_FORMAT_ARGS(range->end_lsn),
+ LSN_FORMAT_ARGS(backup_state->startpoint))));
+ }
+ else
+ {
+ if (range->end_lsn != tlep[i]->end)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("manifest requires WAL from non-final timeline %u ending at %X/%X, but this server switched timelines at %X/%X",
+ range->tli,
+ LSN_FORMAT_ARGS(range->end_lsn),
+ LSN_FORMAT_ARGS(tlep[i]->end))));
+ }
+
+ }
+
+ /*
+ * Wait for WAL summarization to catch up to the backup start LSN (but
+ * time out if it doesn't do so quickly enough).
+ */
+ /* XXX make timeout configurable */
+ summarized_lsn = WaitForWalSummarization(backup_state->startpoint, 60000);
+ if (summarized_lsn < backup_state->startpoint)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("timeout waiting for WAL summarization"),
+ errdetail("This backup requires WAL to be summarized up to %X/%X, but summarizer has only reached %X/%X.",
+ LSN_FORMAT_ARGS(backup_state->startpoint),
+ LSN_FORMAT_ARGS(summarized_lsn))));
+
+ /*
+ * Retrieve a list of all WAL summaries on any timeline that overlap with
+ * the LSN range of interest. We could instead call GetWalSummaries() once
+ * per timeline in the loop that follows, but that would involve reading
+ * the directory multiple times. It should be mildly faster - and perhaps
+ * a bit safer - to do it just once.
+ */
+ all_wslist = GetWalSummaries(0, earliest_wal_range_start_lsn,
+ backup_state->startpoint);
+
+ /*
+ * We need WAL summaries for everything that happened during the prior
+ * backup and everything that happened afterward up until the point where
+ * the current backup started.
+ */
+ foreach(lc, expectedTLEs)
+ {
+ TimeLineHistoryEntry *tle = lfirst(lc);
+ XLogRecPtr tli_start_lsn = tle->begin;
+ XLogRecPtr tli_end_lsn = tle->end;
+ XLogRecPtr tli_missing_lsn = InvalidXLogRecPtr;
+ List *tli_wslist;
+
+ /*
+ * Working through the history of this server from the current
+ * timeline backwards, we skip everything until we find the timeline
+ * where this backup started. Most of the time, this means we won't
+ * skip anything at all, as it's unlikely that the timeline has
+ * changed since the beginning of the backup moments ago.
+ */
+ if (tle->tli == backup_state->starttli)
+ {
+ found_backup_start_tli = true;
+ tli_end_lsn = backup_state->startpoint;
+ }
+ else if (!found_backup_start_tli)
+ continue;
+
+ /*
+ * Find the summaries that overlap the LSN range of interest for this
+ * timeline. If this is the earliest timeline involved, the range of
+ * interest begins with the start LSN of the prior backup; otherwise,
+ * it begins at the LSN at which this timeline came into existence. If
+ * this is the latest TLI involved, the range of interest ends at the
+ * start LSN of the current backup; otherwise, it ends at the point
+ * where we switched from this timeline to the next one.
+ */
+ if (tle->tli == earliest_wal_range_tli)
+ tli_start_lsn = earliest_wal_range_start_lsn;
+ tli_wslist = FilterWalSummaries(all_wslist, tle->tli,
+ tli_start_lsn, tli_end_lsn);
+
+ /*
+ * There is no guarantee that the WAL summaries we found cover the
+ * entire range of LSNs for which summaries are required, or indeed
+ * that we found any WAL summaries at all. Check whether we have a
+ * problem of that sort.
+ */
+ if (!WalSummariesAreComplete(tli_wslist, tli_start_lsn, tli_end_lsn,
+ &tli_missing_lsn))
+ {
+ if (XLogRecPtrIsInvalid(tli_missing_lsn))
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("WAL summaries are required on timeline %u from %X/%X to %X/%X, but no summaries for that timeline and LSN range exist",
+ tle->tli,
+ LSN_FORMAT_ARGS(tli_start_lsn),
+ LSN_FORMAT_ARGS(tli_end_lsn))));
+ else
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("WAL summaries are required on timeline %u from %X/%X to %X/%X, but the summaries for that timeline and LSN range are incomplete",
+ tle->tli,
+ LSN_FORMAT_ARGS(tli_start_lsn),
+ LSN_FORMAT_ARGS(tli_end_lsn)),
+ errdetail("The first unsummarized LSN is this range is %X/%X.",
+ LSN_FORMAT_ARGS(tli_missing_lsn))));
+ }
+
+ /*
+ * Remember that we need to read these summaries.
+ *
+ * Technically, it's possible that this could read more files than
+ * required, since tli_wslist in theory could contain redundant
+ * summaries. For instance, if we have a summary from 0/10000000 to
+ * 0/20000000 and also one from 0/00000000 to 0/30000000, then the
+ * latter subsumes the former and the former could be ignored.
+ *
+ * We ignore this possibility because the WAL summarizer only tries to
+ * generate summaries that do not overlap. If somehow they exist,
+ * we'll do a bit of extra work but the results should still be
+ * correct.
+ */
+ required_wslist = list_concat(required_wslist, tli_wslist);
+
+ /*
+ * Timelines earlier than the one in which the prior backup began are
+ * not relevant.
+ */
+ if (tle->tli == earliest_wal_range_tli)
+ break;
+ }
+
+ /*
+ * Read all of the required block reference table files and merge all of
+ * the data into a single in-memory block reference table.
+ *
+ * See the comments for struct IncrementalBackupInfo for some thoughts on
+ * memory usage.
+ */
+ ib->brtab = CreateEmptyBlockRefTable();
+ foreach(lc, required_wslist)
+ {
+ WalSummaryFile *ws = lfirst(lc);
+ WalSummaryIO wsio;
+ BlockRefTableReader *reader;
+ RelFileLocator rlocator;
+ ForkNumber forknum;
+ BlockNumber limit_block;
+ BlockNumber blocks[BLOCKS_PER_READ];
+
+ wsio.file = OpenWalSummaryFile(ws, false);
+ wsio.filepos = 0;
+ ereport(DEBUG1,
+ (errmsg_internal("reading WAL summary file \"%s\"",
+ FilePathName(wsio.file))));
+ reader = CreateBlockRefTableReader(ReadWalSummary, &wsio,
+ FilePathName(wsio.file),
+ ReportWalSummaryError, NULL);
+ while (BlockRefTableReaderNextRelation(reader, &rlocator, &forknum,
+ &limit_block))
+ {
+ BlockRefTableSetLimitBlock(ib->brtab, &rlocator,
+ forknum, limit_block);
+
+ while (1)
+ {
+ unsigned nblocks;
+ unsigned i;
+
+ nblocks = BlockRefTableReaderGetBlocks(reader, blocks,
+ BLOCKS_PER_READ);
+ if (nblocks == 0)
+ break;
+
+ for (i = 0; i < nblocks; ++i)
+ BlockRefTableMarkBlockModified(ib->brtab, &rlocator,
+ forknum, blocks[i]);
+ }
+ }
+ DestroyBlockRefTableReader(reader);
+ FileClose(wsio.file);
+ }
+
+ /* Switch back to previous memory context. */
+ MemoryContextSwitchTo(oldcontext);
+}
+
+/*
+ * Get the pathname that should be used when a file is sent incrementally.
+ *
+ * The result is a palloc'd string.
+ */
+char *
+GetIncrementalFilePath(Oid dboid, Oid spcoid, RelFileNumber relfilenumber,
+ ForkNumber forknum, unsigned segno)
+{
+ char *path;
+ char *lastslash;
+ char *ipath;
+
+ path = GetRelationPath(dboid, spcoid, relfilenumber, InvalidBackendId,
+ forknum);
+
+ lastslash = strrchr(path, '/');
+ Assert(lastslash != NULL);
+ *lastslash = '\0';
+
+ if (segno > 0)
+ ipath = psprintf("%s/INCREMENTAL.%s.%u", path, lastslash + 1, segno);
+ else
+ ipath = psprintf("%s/INCREMENTAL.%s", path, lastslash + 1);
+
+ pfree(path);
+
+ return ipath;
+}
+
+/*
+ * How should we back up a particular file as part of an incremental backup?
+ *
+ * If the return value is BACK_UP_FILE_FULLY, caller should back up the whole
+ * file just as if this were not an incremental backup.
+ *
+ * If the return value is BACK_UP_FILE_INCREMENTALLY, caller should include
+ * an incremental file in the backup instead of the entire file. On return,
+ * *num_blocks_required will be set to the number of blocks that need to be
+ * sent, and the actual block numbers will have been stored in
+ * relative_block_numbers, which should be an array of at least RELSEG_SIZE.
+ * In addition, *truncation_block_length will be set to the value that should
+ * be included in the incremental file.
+ *
+ * If the return value is DO_NOT_BACK_UP_FILE, the caller should not include
+ * the file in the backup at all.
+ */
+FileBackupMethod
+GetFileBackupMethod(IncrementalBackupInfo *ib, char *path,
+ Oid dboid, Oid spcoid,
+ RelFileNumber relfilenumber, ForkNumber forknum,
+ unsigned segno, size_t size,
+ unsigned *num_blocks_required,
+ BlockNumber *relative_block_numbers,
+ unsigned *truncation_block_length)
+{
+ BlockNumber absolute_block_numbers[RELSEG_SIZE];
+ BlockNumber limit_block;
+ BlockNumber start_blkno;
+ BlockNumber stop_blkno;
+ RelFileLocator rlocator;
+ BlockRefTableEntry *brtentry;
+ unsigned i;
+ unsigned nblocks;
+
+ /* Should only be called after PrepareForIncrementalBackup. */
+ Assert(ib->buf.data == NULL);
+
+ /*
+ * dboid could be InvalidOid if shared rel, but spcoid and relfilenumber
+ * should have legal values.
+ */
+ Assert(OidIsValid(spcoid));
+ Assert(RelFileNumberIsValid(relfilenumber));
+
+ /*
+ * If the file size is too large or not a multiple of BLCKSZ, then
+ * something weird is happening, so give up and send the whole file.
+ */
+ if ((size % BLCKSZ) != 0 || size / BLCKSZ > RELSEG_SIZE)
+ return BACK_UP_FILE_FULLY;
+
+ /*
+ * The free-space map fork is not properly WAL-logged, so we need to
+ * backup the entire file every time.
+ */
+ if (forknum == FSM_FORKNUM)
+ return BACK_UP_FILE_FULLY;
+
+ /*
+ * Check whether this file is part of the prior backup. If it isn't, back
+ * up the whole file.
+ */
+ if (backup_file_lookup(ib->manifest_files, path) == NULL)
+ {
+ char *ipath;
+
+ ipath = GetIncrementalFilePath(dboid, spcoid, relfilenumber,
+ forknum, segno);
+ if (backup_file_lookup(ib->manifest_files, ipath) == NULL)
+ return BACK_UP_FILE_FULLY;
+ }
+
+ /* Look up the block reference table entry. */
+ rlocator.spcOid = spcoid;
+ rlocator.dbOid = dboid;
+ rlocator.relNumber = relfilenumber;
+ brtentry = BlockRefTableGetEntry(ib->brtab, &rlocator, forknum,
+ &limit_block);
+
+ /*
+ * If there is no entry, then there have been no WAL-logged changes to the
+ * relation since the predecessor backup was taken, so we can back it up
+ * incrementally and need not include any modified blocks.
+ *
+ * However, if the file is zero-length, we should do a full backup,
+ * because an incremental file is always more than zero length, and it's
+ * silly to take an incremental backup when a full backup would be
+ * smaller.
+ */
+ if (brtentry == NULL)
+ {
+ *num_blocks_required = 0;
+ *truncation_block_length = size / BLCKSZ;
+ if (size == 0)
+ return BACK_UP_FILE_FULLY;
+ return BACK_UP_FILE_INCREMENTALLY;
+ }
+
+ /*
+ * If the limit_block is less than or equal to the point where this
+ * segment starts, send the whole file.
+ */
+ if (limit_block <= segno * RELSEG_SIZE)
+ return BACK_UP_FILE_FULLY;
+
+ /*
+ * Get relevant entries from the block reference table entry.
+ *
+ * We shouldn't overflow computing the start or stop block numbers, but if
+ * it manages to happen somehow, detect it and throw an error.
+ */
+ start_blkno = segno * RELSEG_SIZE;
+ stop_blkno = start_blkno + (size / BLCKSZ);
+ if (start_blkno / RELSEG_SIZE != segno || stop_blkno < start_blkno)
+ ereport(ERROR,
+ errcode(ERRCODE_INTERNAL_ERROR),
+ errmsg_internal("overflow computing block number bounds for segment %u with size %lu",
+ segno, size));
+ nblocks = BlockRefTableEntryGetBlocks(brtentry, start_blkno, stop_blkno,
+ absolute_block_numbers, RELSEG_SIZE);
+ Assert(nblocks <= RELSEG_SIZE);
+
+ /*
+ * If we're going to have to send nearly all of the blocks, then just send
+ * the whole file, because that won't require much extra storage or
+ * transfer and will speed up and simplify backup restoration. It's not
+ * clear what threshold is most appropriate here and perhaps it ought to
+ * be configurable, but for now we're just going to say that if we'd need
+ * to send 90% of the blocks anyway, give up and send the whole file.
+ *
+ * NB: If you change the threshold here, at least make sure to back up the
+ * file fully when every single block must be sent, because there's
+ * nothing good about sending an incremental file in that case.
+ */
+ if (nblocks * BLCKSZ > size * 0.9)
+ return BACK_UP_FILE_FULLY;
+
+ /*
+ * Looks like we can send an incremental file.
+ *
+ * Return the relevant details to the caller, transposing absolute block
+ * numbers to relative block numbers.
+ *
+ * The truncation block length is the minimum length of the reconstructed
+ * file. Any block numbers below this threshold that are not present in
+ * the backup need to be fetched from the prior backup. At or above this
+ * threshold, blocks should only be included in the result if they are
+ * present in the backup. (This may require inserting zero blocks if the
+ * blocks included in the backup are non-consecutive.)
+ */
+ for (i = 0; i < nblocks; ++i)
+ relative_block_numbers[i] = absolute_block_numbers[i] - start_blkno;
+ *num_blocks_required = nblocks;
+ *truncation_block_length =
+ Min(size / BLCKSZ, limit_block - segno * RELSEG_SIZE);
+ return BACK_UP_FILE_INCREMENTALLY;
+}
+
+/*
+ * Compute the size for an incremental file containing a given number of blocks.
+ */
+extern size_t
+GetIncrementalFileSize(unsigned num_blocks_required)
+{
+ size_t result;
+
+ /* Make sure we're not going to overflow. */
+ Assert(num_blocks_required <= RELSEG_SIZE);
+
+ /*
+ * Three four byte quantities (magic number, truncation block length,
+ * block count) followed by block numbers followed by block contents.
+ */
+ result = 3 * sizeof(uint32);
+ result += (BLCKSZ + sizeof(BlockNumber)) * num_blocks_required;
+
+ return result;
+}
+
+/*
+ * Helper function for filemap hash table.
+ */
+static uint32
+hash_string_pointer(const char *s)
+{
+ unsigned char *ss = (unsigned char *) s;
+
+ return hash_bytes(ss, strlen(s));
+}
+
+/*
+ * This callback is invoked for each file mentioned in the backup manifest.
+ *
+ * We store the path to each file and the size of each file for sanity-checking
+ * purposes. For further details, see comments for IncrementalBackupInfo.
+ */
+static void
+manifest_process_file(JsonManifestParseContext *context,
+ char *pathname, size_t size,
+ pg_checksum_type checksum_type,
+ int checksum_length,
+ uint8 *checksum_payload)
+{
+ IncrementalBackupInfo *ib = context->private_data;
+ backup_file_entry *entry;
+ bool found;
+
+ entry = backup_file_insert(ib->manifest_files, pathname, &found);
+ if (!found)
+ {
+ entry->path = MemoryContextStrdup(ib->manifest_files->ctx,
+ pathname);
+ entry->size = size;
+ }
+}
+
+/*
+ * This callback is invoked for each WAL range mentioned in the backup
+ * manifest.
+ *
+ * We're just interested in learning the oldest LSN and the corresponding TLI
+ * that appear in any WAL range.
+ */
+static void
+manifest_process_wal_range(JsonManifestParseContext *context,
+ TimeLineID tli, XLogRecPtr start_lsn,
+ XLogRecPtr end_lsn)
+{
+ IncrementalBackupInfo *ib = context->private_data;
+ backup_wal_range *range = palloc(sizeof(backup_wal_range));
+
+ range->tli = tli;
+ range->start_lsn = start_lsn;
+ range->end_lsn = end_lsn;
+ ib->manifest_wal_ranges = lappend(ib->manifest_wal_ranges, range);
+}
+
+/*
+ * This callback is invoked if an error occurs while parsing the backup
+ * manifest.
+ */
+static void
+manifest_report_error(JsonManifestParseContext *context, const char *fmt,...)
+{
+ StringInfoData errbuf;
+
+ initStringInfo(&errbuf);
+
+ for (;;)
+ {
+ va_list ap;
+ int needed;
+
+ va_start(ap, fmt);
+ needed = appendStringInfoVA(&errbuf, fmt, ap);
+ va_end(ap);
+ if (needed == 0)
+ break;
+ enlargeStringInfo(&errbuf, needed);
+ }
+
+ ereport(ERROR,
+ errmsg_internal("%s", errbuf.data));
+}
diff --git a/src/backend/backup/meson.build b/src/backend/backup/meson.build
index 11a79bbf80..19c355ceca 100644
--- a/src/backend/backup/meson.build
+++ b/src/backend/backup/meson.build
@@ -5,6 +5,7 @@ backend_sources += files(
'basebackup.c',
'basebackup_copy.c',
'basebackup_gzip.c',
+ 'basebackup_incremental.c',
'basebackup_lz4.c',
'basebackup_progress.c',
'basebackup_server.c',
@@ -12,4 +13,6 @@ backend_sources += files(
'basebackup_target.c',
'basebackup_throttle.c',
'basebackup_zstd.c',
+ 'walsummary.c',
+ 'walsummaryfuncs.c'
)
diff --git a/src/backend/backup/walsummary.c b/src/backend/backup/walsummary.c
new file mode 100644
index 0000000000..ebf4ea038d
--- /dev/null
+++ b/src/backend/backup/walsummary.c
@@ -0,0 +1,356 @@
+/*-------------------------------------------------------------------------
+ *
+ * walsummary.c
+ * Functions for accessing and managing WAL summary data.
+ *
+ * Portions Copyright (c) 2010-2022, PostgreSQL Global Development Group
+ *
+ * src/backend/backup/walsummary.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include <sys/stat.h>
+#include <unistd.h>
+
+#include "access/xlog_internal.h"
+#include "backup/walsummary.h"
+#include "utils/wait_event.h"
+
+static bool IsWalSummaryFilename(char *filename);
+static int ListComparatorForWalSummaryFiles(const ListCell *a,
+ const ListCell *b);
+
+/*
+ * Get a list of WAL summaries.
+ *
+ * If tli != 0, only WAL summaries with the indicated TLI will be included.
+ *
+ * If start_lsn != InvalidXLogRecPtr, only summaries that end before the
+ * indicated LSN will be included.
+ *
+ * If end_lsn != InvalidXLogRecPtr, only summaries that start before the
+ * indicated LSN will be included.
+ *
+ * The intent is that you can call GetWalSummaries(tli, start_lsn, end_lsn)
+ * to get all WAL summaries on the indicated timeline that overlap the
+ * specified LSN range.
+ */
+List *
+GetWalSummaries(TimeLineID tli, XLogRecPtr start_lsn, XLogRecPtr end_lsn)
+{
+ DIR *sdir;
+ struct dirent *dent;
+ List *result = NIL;
+
+ sdir = AllocateDir(XLOGDIR "/summaries");
+ while ((dent = ReadDir(sdir, XLOGDIR "/summaries")) != NULL)
+ {
+ WalSummaryFile *ws;
+ uint32 tmp[5];
+ TimeLineID file_tli;
+ XLogRecPtr file_start_lsn;
+ XLogRecPtr file_end_lsn;
+
+ /* Decode filename, or skip if it's not in the expected format. */
+ if (!IsWalSummaryFilename(dent->d_name))
+ continue;
+ sscanf(dent->d_name, "%08X%08X%08X%08X%08X",
+ &tmp[0], &tmp[1], &tmp[2], &tmp[3], &tmp[4]);
+ file_tli = tmp[0];
+ file_start_lsn = ((uint64) tmp[1]) << 32 | tmp[2];
+ file_end_lsn = ((uint64) tmp[3]) << 32 | tmp[4];
+
+ /* Skip if it doesn't match the filter criteria. */
+ if (tli != 0 && tli != file_tli)
+ continue;
+ if (!XLogRecPtrIsInvalid(start_lsn) && start_lsn > file_end_lsn)
+ continue;
+ if (!XLogRecPtrIsInvalid(end_lsn) && end_lsn < file_start_lsn)
+ continue;
+
+ /* Add it to the list. */
+ ws = palloc(sizeof(WalSummaryFile));
+ ws->tli = file_tli;
+ ws->start_lsn = file_start_lsn;
+ ws->end_lsn = file_end_lsn;
+ result = lappend(result, ws);
+ }
+ FreeDir(sdir);
+
+ return result;
+}
+
+/*
+ * Build a new list of WAL summaries based on an existing list, but filtering
+ * out summaries that don't match the search parameters.
+ *
+ * If tli != 0, only WAL summaries with the indicated TLI will be included.
+ *
+ * If start_lsn != InvalidXLogRecPtr, only summaries that end before the
+ * indicated LSN will be included.
+ *
+ * If end_lsn != InvalidXLogRecPtr, only summaries that start before the
+ * indicated LSN will be included.
+ */
+List *
+FilterWalSummaries(List *wslist, TimeLineID tli,
+ XLogRecPtr start_lsn, XLogRecPtr end_lsn)
+{
+ List *result = NIL;
+ ListCell *lc;
+
+ /* Loop over input. */
+ foreach(lc, wslist)
+ {
+ WalSummaryFile *ws = lfirst(lc);
+
+ /* Skip if it doesn't match the filter criteria. */
+ if (tli != 0 && tli != ws->tli)
+ continue;
+ if (!XLogRecPtrIsInvalid(start_lsn) && start_lsn > ws->end_lsn)
+ continue;
+ if (!XLogRecPtrIsInvalid(end_lsn) && end_lsn < ws->start_lsn)
+ continue;
+
+ /* Add it to the result list. */
+ result = lappend(result, ws);
+ }
+
+ return result;
+}
+
+/*
+ * Check whether the supplied list of WalSummaryFile objects covers the
+ * whole range of LSNs from start_lsn to end_lsn. This function ignores
+ * timelines, so the caller should probably filter using the appropriate
+ * timeline before calling this.
+ *
+ * If the whole range of LSNs is covered, returns true, otherwise false.
+ * If false is returned, *missing_lsn is set either to InvalidXLogRecPtr
+ * if there are no WAL summary files in the input list, or to the first LSN
+ * in the range that is not covered by a WAL summary file in the input list.
+ */
+bool
+WalSummariesAreComplete(List *wslist, XLogRecPtr start_lsn,
+ XLogRecPtr end_lsn, XLogRecPtr *missing_lsn)
+{
+ XLogRecPtr current_lsn = start_lsn;
+ ListCell *lc;
+
+ /* Special case for empty list. */
+ if (wslist == NIL)
+ {
+ *missing_lsn = InvalidXLogRecPtr;
+ return false;
+ }
+
+ /* Make a private copy of the list and sort it by start LSN. */
+ wslist = list_copy(wslist);
+ list_sort(wslist, ListComparatorForWalSummaryFiles);
+
+ /*
+ * Consider summary files in order of increasing start_lsn, advancing the
+ * known-summarized range from start_lsn toward end_lsn.
+ *
+ * Normally, the summary files should cover non-overlapping WAL ranges,
+ * but this algorithm is intended to be correct even in case of overlap.
+ */
+ foreach(lc, wslist)
+ {
+ WalSummaryFile *ws = lfirst(lc);
+
+ if (ws->start_lsn > current_lsn)
+ {
+ /* We found a gap. */
+ break;
+ }
+ if (ws->end_lsn > current_lsn)
+ {
+ /*
+ * Next summary extends beyond end of previous summary, so extend
+ * the end of the range known to be summarized.
+ */
+ current_lsn = ws->end_lsn;
+
+ /*
+ * If the range we know to be summarized has reached the required
+ * end LSN, we have proved completeness.
+ */
+ if (current_lsn >= end_lsn)
+ return true;
+ }
+ }
+
+ /*
+ * We either ran out of summary files without reaching the end LSN, or we
+ * hit a gap in the sequence that resulted in us bailing out of the loop
+ * above.
+ */
+ *missing_lsn = current_lsn;
+ return false;
+}
+
+/*
+ * Open a WAL summary file.
+ *
+ * This will throw an error in case of trouble. As an exception, if
+ * missing_ok = true and the trouble is specifically that the file does
+ * not exist, it will not throw an error and will return a value less than 0.
+ */
+File
+OpenWalSummaryFile(WalSummaryFile *ws, bool missing_ok)
+{
+ char path[MAXPGPATH];
+ File file;
+
+ snprintf(path, MAXPGPATH,
+ XLOGDIR "/summaries/%08X%08X%08X%08X%08X.summary",
+ ws->tli,
+ LSN_FORMAT_ARGS(ws->start_lsn),
+ LSN_FORMAT_ARGS(ws->end_lsn));
+
+ file = PathNameOpenFile(path, O_RDONLY);
+ if (file < 0 && (errno != EEXIST || !missing_ok))
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not open file \"%s\": %m", path)));
+
+ return file;
+}
+
+/*
+ * Remove a WAL summary file if the last modification time precedes the
+ * cutoff time.
+ */
+void
+RemoveWalSummaryIfOlderThan(WalSummaryFile *ws, time_t cutoff_time)
+{
+ char path[MAXPGPATH];
+ struct stat statbuf;
+
+ snprintf(path, MAXPGPATH,
+ XLOGDIR "/summaries/%08X%08X%08X%08X%08X.summary",
+ ws->tli,
+ LSN_FORMAT_ARGS(ws->start_lsn),
+ LSN_FORMAT_ARGS(ws->end_lsn));
+
+ if (lstat(path, &statbuf) != 0)
+ {
+ if (errno == ENOENT)
+ return;
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not stat file \"%s\": %m", path)));
+ }
+ if (statbuf.st_mtime >= cutoff_time)
+ return;
+ if (unlink(path) != 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not stat file \"%s\": %m", path)));
+ ereport(DEBUG2,
+ (errmsg_internal("removing file \"%s\"", path)));
+}
+
+/*
+ * Test whether a filename looks like a WAL summary file.
+ */
+static bool
+IsWalSummaryFilename(char *filename)
+{
+ return strspn(filename, "0123456789ABCDEF") == 40 &&
+ strcmp(filename + 40, ".summary") == 0;
+}
+
+/*
+ * Data read callback for use with CreateBlockRefTableReader.
+ */
+int
+ReadWalSummary(void *wal_summary_io, void *data, int length)
+{
+ WalSummaryIO *io = wal_summary_io;
+ int nbytes;
+
+ nbytes = FileRead(io->file, data, length, io->filepos,
+ WAIT_EVENT_WAL_SUMMARY_READ);
+ if (nbytes < 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not write file \"%s\": %m",
+ FilePathName(io->file))));
+
+ io->filepos += nbytes;
+ return nbytes;
+}
+
+/*
+ * Data write callback for use with WriteBlockRefTable.
+ */
+int
+WriteWalSummary(void *wal_summary_io, void *data, int length)
+{
+ WalSummaryIO *io = wal_summary_io;
+ int nbytes;
+
+ nbytes = FileWrite(io->file, data, length, io->filepos,
+ WAIT_EVENT_WAL_SUMMARY_WRITE);
+ if (nbytes < 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not write file \"%s\": %m",
+ FilePathName(io->file))));
+ if (nbytes != length)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not write file \"%s\": wrote only %d of %d bytes at offset %u",
+ FilePathName(io->file), nbytes,
+ length, (unsigned) io->filepos),
+ errhint("Check free disk space.")));
+
+ io->filepos += nbytes;
+ return nbytes;
+}
+
+/*
+ * Error-reporting callback for use with CreateBlockRefTableReader.
+ */
+void
+ReportWalSummaryError(void *callback_arg, char *fmt,...)
+{
+ StringInfoData buf;
+ va_list ap;
+ int needed;
+
+ initStringInfo(&buf);
+ for (;;)
+ {
+ va_start(ap, fmt);
+ needed = appendStringInfoVA(&buf, fmt, ap);
+ va_end(ap);
+ if (needed == 0)
+ break;
+ enlargeStringInfo(&buf, needed);
+ }
+ ereport(ERROR,
+ errcode(ERRCODE_DATA_CORRUPTED),
+ errmsg_internal("%s", buf.data));
+}
+
+/*
+ * Comparator to sort a List of WalSummaryFile objects by start_lsn.
+ */
+static int
+ListComparatorForWalSummaryFiles(const ListCell *a, const ListCell *b)
+{
+ WalSummaryFile *ws1 = lfirst(a);
+ WalSummaryFile *ws2 = lfirst(b);
+
+ if (ws1->start_lsn < ws2->start_lsn)
+ return -1;
+ if (ws1->start_lsn > ws2->start_lsn)
+ return 1;
+ return 0;
+}
diff --git a/src/backend/backup/walsummaryfuncs.c b/src/backend/backup/walsummaryfuncs.c
new file mode 100644
index 0000000000..2e77d38b4a
--- /dev/null
+++ b/src/backend/backup/walsummaryfuncs.c
@@ -0,0 +1,169 @@
+/*-------------------------------------------------------------------------
+ *
+ * walsummaryfuncs.c
+ * SQL-callable functions for accessing WAL summary data.
+ *
+ * Portions Copyright (c) 2010-2022, PostgreSQL Global Development Group
+ *
+ * src/backend/backup/walsummaryfuncs.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "backup/walsummary.h"
+#include "common/blkreftable.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "utils/fmgrprotos.h"
+#include "utils/pg_lsn.h"
+
+#define NUM_WS_ATTS 3
+#define NUM_SUMMARY_ATTS 6
+#define MAX_BLOCKS_PER_CALL 256
+
+/*
+ * List the WAL summary files available in pg_wal/summaries.
+ */
+Datum
+pg_available_wal_summaries(PG_FUNCTION_ARGS)
+{
+ ReturnSetInfo *rsi;
+ List *wslist;
+ ListCell *lc;
+ Datum values[NUM_WS_ATTS];
+ bool nulls[NUM_WS_ATTS];
+
+ InitMaterializedSRF(fcinfo, 0);
+ rsi = (ReturnSetInfo *) fcinfo->resultinfo;
+
+ memset(nulls, 0, sizeof(nulls));
+
+ wslist = GetWalSummaries(0, InvalidXLogRecPtr, InvalidXLogRecPtr);
+ foreach(lc, wslist)
+ {
+ WalSummaryFile *ws = (WalSummaryFile *) lfirst(lc);
+ HeapTuple tuple;
+
+ CHECK_FOR_INTERRUPTS();
+
+ values[0] = Int64GetDatum((int64) ws->tli);
+ values[1] = LSNGetDatum(ws->start_lsn);
+ values[2] = LSNGetDatum(ws->end_lsn);
+
+ tuple = heap_form_tuple(rsi->setDesc, values, nulls);
+ tuplestore_puttuple(rsi->setResult, tuple);
+ }
+
+ return (Datum) 0;
+}
+
+/*
+ * List the contents of a WAL summary file identified by TLI, start LSN,
+ * and end LSN.
+ */
+Datum
+pg_wal_summary_contents(PG_FUNCTION_ARGS)
+{
+ ReturnSetInfo *rsi;
+ Datum values[NUM_SUMMARY_ATTS];
+ bool nulls[NUM_SUMMARY_ATTS];
+ WalSummaryFile ws;
+ WalSummaryIO io;
+ BlockRefTableReader *reader;
+ int64 raw_tli;
+ RelFileLocator rlocator;
+ ForkNumber forknum;
+ BlockNumber limit_block;
+
+ InitMaterializedSRF(fcinfo, 0);
+ rsi = (ReturnSetInfo *) fcinfo->resultinfo;
+ memset(nulls, 0, sizeof(nulls));
+
+ /*
+ * Since the timeline could at least in theory be more than 2^31, and
+ * since we don't have unsigned types at the SQL level, it is passed as a
+ * 64-bit integer. Test whether it's out of range.
+ */
+ raw_tli = PG_GETARG_INT64(0);
+ if (raw_tli < 1 || raw_tli > PG_INT32_MAX)
+ ereport(ERROR,
+ errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("invalid timeline %lld", (long long) raw_tli));
+
+ /* Prepare to read the specified WAL summry file. */
+ ws.tli = (TimeLineID) raw_tli;
+ ws.start_lsn = PG_GETARG_LSN(1);
+ ws.end_lsn = PG_GETARG_LSN(2);
+ io.filepos = 0;
+ io.file = OpenWalSummaryFile(&ws, false);
+ reader = CreateBlockRefTableReader(ReadWalSummary, &io,
+ FilePathName(io.file),
+ ReportWalSummaryError, NULL);
+
+ /* Loop over relation forks. */
+ while (BlockRefTableReaderNextRelation(reader, &rlocator, &forknum,
+ &limit_block))
+ {
+ BlockNumber blocks[MAX_BLOCKS_PER_CALL];
+ HeapTuple tuple;
+
+ CHECK_FOR_INTERRUPTS();
+
+ values[0] = ObjectIdGetDatum(rlocator.relNumber);
+ values[1] = ObjectIdGetDatum(rlocator.spcOid);
+ values[2] = ObjectIdGetDatum(rlocator.dbOid);
+ values[3] = Int16GetDatum((int16) forknum);
+
+ /* Loop over blocks within the current relation fork. */
+ while (true)
+ {
+ unsigned nblocks;
+ unsigned i;
+
+ CHECK_FOR_INTERRUPTS();
+
+ nblocks = BlockRefTableReaderGetBlocks(reader, blocks,
+ MAX_BLOCKS_PER_CALL);
+ if (nblocks == 0)
+ break;
+
+ /*
+ * For each block that we specifically know to have been modified,
+ * emit a row with that block number and limit_block = false.
+ */
+ values[5] = BoolGetDatum(false);
+ for (i = 0; i < nblocks; ++i)
+ {
+ values[4] = Int64GetDatum((int64) blocks[i]);
+
+ tuple = heap_form_tuple(rsi->setDesc, values, nulls);
+ tuplestore_puttuple(rsi->setResult, tuple);
+ }
+
+ /*
+ * If the limit block is not InvalidBlockNumber, emit an exta row
+ * with that block number and limit_block = true.
+ *
+ * There is no point in doing this when the limit_block is
+ * InvalidBlockNumber, because no block with that number or any
+ * higher number can ever exist.
+ */
+ if (BlockNumberIsValid(limit_block))
+ {
+ values[4] = Int64GetDatum((int64) limit_block);
+ values[5] = BoolGetDatum(true);
+
+ tuple = heap_form_tuple(rsi->setDesc, values, nulls);
+ tuplestore_puttuple(rsi->setResult, tuple);
+ }
+ }
+ }
+
+ /* Cleanup */
+ DestroyBlockRefTableReader(reader);
+ FileClose(io.file);
+
+ return (Datum) 0;
+}
diff --git a/src/backend/postmaster/Makefile b/src/backend/postmaster/Makefile
index 047448b34e..367a46c617 100644
--- a/src/backend/postmaster/Makefile
+++ b/src/backend/postmaster/Makefile
@@ -24,6 +24,7 @@ OBJS = \
postmaster.o \
startup.o \
syslogger.o \
+ walsummarizer.o \
walwriter.o
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/postmaster/auxprocess.c b/src/backend/postmaster/auxprocess.c
index cae6feb356..0c15c1777d 100644
--- a/src/backend/postmaster/auxprocess.c
+++ b/src/backend/postmaster/auxprocess.c
@@ -21,6 +21,7 @@
#include "postmaster/auxprocess.h"
#include "postmaster/bgwriter.h"
#include "postmaster/startup.h"
+#include "postmaster/walsummarizer.h"
#include "postmaster/walwriter.h"
#include "replication/walreceiver.h"
#include "storage/bufmgr.h"
@@ -80,6 +81,9 @@ AuxiliaryProcessMain(AuxProcType auxtype)
case WalReceiverProcess:
MyBackendType = B_WAL_RECEIVER;
break;
+ case WalSummarizerProcess:
+ MyBackendType = B_WAL_SUMMARIZER;
+ break;
default:
elog(PANIC, "unrecognized process type: %d", (int) MyAuxProcType);
MyBackendType = B_INVALID;
@@ -161,6 +165,10 @@ AuxiliaryProcessMain(AuxProcType auxtype)
WalReceiverMain();
proc_exit(1);
+ case WalSummarizerProcess:
+ WalSummarizerMain();
+ proc_exit(1);
+
default:
elog(PANIC, "unrecognized process type: %d", (int) MyAuxProcType);
proc_exit(1);
diff --git a/src/backend/postmaster/meson.build b/src/backend/postmaster/meson.build
index cda921fd10..a30eb6692f 100644
--- a/src/backend/postmaster/meson.build
+++ b/src/backend/postmaster/meson.build
@@ -12,5 +12,6 @@ backend_sources += files(
'postmaster.c',
'startup.c',
'syslogger.c',
+ 'walsummarizer.c',
'walwriter.c',
)
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 54e9bfb8c4..0538b84ef8 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -115,6 +115,7 @@
#include "postmaster/pgarch.h"
#include "postmaster/postmaster.h"
#include "postmaster/syslogger.h"
+#include "postmaster/walsummarizer.h"
#include "replication/logicallauncher.h"
#include "replication/walsender.h"
#include "storage/fd.h"
@@ -251,6 +252,7 @@ static pid_t StartupPID = 0,
CheckpointerPID = 0,
WalWriterPID = 0,
WalReceiverPID = 0,
+ WalSummarizerPID = 0,
AutoVacPID = 0,
PgArchPID = 0,
SysLoggerPID = 0;
@@ -442,6 +444,7 @@ static bool CreateOptsFile(int argc, char *argv[], char *fullprogname);
static pid_t StartChildProcess(AuxProcType type);
static void StartAutovacuumWorker(void);
static void MaybeStartWalReceiver(void);
+static void MaybeStartWalSummarizer(void);
static void InitPostmasterDeathWatchHandle(void);
/*
@@ -561,6 +564,7 @@ static void ShmemBackendArrayRemove(Backend *bn);
#define StartCheckpointer() StartChildProcess(CheckpointerProcess)
#define StartWalWriter() StartChildProcess(WalWriterProcess)
#define StartWalReceiver() StartChildProcess(WalReceiverProcess)
+#define StartWalSummarizer() StartChildProcess(WalSummarizerProcess)
/* Macros to check exit status of a child process */
#define EXIT_STATUS_0(st) ((st) == 0)
@@ -1847,6 +1851,9 @@ ServerLoop(void)
if (WalReceiverRequested)
MaybeStartWalReceiver();
+ /* If we need to start a WAL summarizer, try to do that now */
+ MaybeStartWalSummarizer();
+
/* Get other worker processes running, if needed */
if (StartWorkerNeeded || HaveCrashedWorker)
maybe_start_bgworkers();
@@ -2714,6 +2721,8 @@ process_pm_reload_request(void)
signal_child(WalWriterPID, SIGHUP);
if (WalReceiverPID != 0)
signal_child(WalReceiverPID, SIGHUP);
+ if (WalSummarizerPID != 0)
+ signal_child(WalSummarizerPID, SIGHUP);
if (AutoVacPID != 0)
signal_child(AutoVacPID, SIGHUP);
if (PgArchPID != 0)
@@ -3067,6 +3076,7 @@ process_pm_child_exit(void)
BgWriterPID = StartBackgroundWriter();
if (WalWriterPID == 0)
WalWriterPID = StartWalWriter();
+ MaybeStartWalSummarizer();
/*
* Likewise, start other special children as needed. In a restart
@@ -3185,6 +3195,20 @@ process_pm_child_exit(void)
continue;
}
+ /*
+ * Was it the wal summarizer? Normal exit can be ignored; we'll start
+ * a new one at the next iteration of the postmaster's main loop, if
+ * necessary. Any other exit condition is treated as a crash.
+ */
+ if (pid == WalSummarizerPID)
+ {
+ WalSummarizerPID = 0;
+ if (!EXIT_STATUS_0(exitstatus))
+ HandleChildCrash(pid, exitstatus,
+ _("WAL summarizer process"));
+ continue;
+ }
+
/*
* Was it the autovacuum launcher? Normal exit can be ignored; we'll
* start a new one at the next iteration of the postmaster's main
@@ -3580,6 +3604,12 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
else if (WalReceiverPID != 0 && take_action)
sigquit_child(WalReceiverPID);
+ /* Take care of the walsummarizer too */
+ if (pid == WalSummarizerPID)
+ WalSummarizerPID = 0;
+ else if (WalSummarizerPID != 0 && take_action)
+ sigquit_child(WalSummarizerPID);
+
/* Take care of the autovacuum launcher too */
if (pid == AutoVacPID)
AutoVacPID = 0;
@@ -3730,6 +3760,8 @@ PostmasterStateMachine(void)
signal_child(StartupPID, SIGTERM);
if (WalReceiverPID != 0)
signal_child(WalReceiverPID, SIGTERM);
+ if (WalSummarizerPID != 0)
+ signal_child(WalSummarizerPID, SIGTERM);
/* checkpointer, archiver, stats, and syslogger may continue for now */
/* Now transition to PM_WAIT_BACKENDS state to wait for them to die */
@@ -3756,6 +3788,7 @@ PostmasterStateMachine(void)
if (CountChildren(BACKEND_TYPE_ALL - BACKEND_TYPE_WALSND) == 0 &&
StartupPID == 0 &&
WalReceiverPID == 0 &&
+ WalSummarizerPID == 0 &&
BgWriterPID == 0 &&
(CheckpointerPID == 0 ||
(!FatalError && Shutdown < ImmediateShutdown)) &&
@@ -3853,6 +3886,7 @@ PostmasterStateMachine(void)
/* These other guys should be dead already */
Assert(StartupPID == 0);
Assert(WalReceiverPID == 0);
+ Assert(WalSummarizerPID == 0);
Assert(BgWriterPID == 0);
Assert(CheckpointerPID == 0);
Assert(WalWriterPID == 0);
@@ -4074,6 +4108,8 @@ TerminateChildren(int signal)
signal_child(WalWriterPID, signal);
if (WalReceiverPID != 0)
signal_child(WalReceiverPID, signal);
+ if (WalSummarizerPID != 0)
+ signal_child(WalSummarizerPID, signal);
if (AutoVacPID != 0)
signal_child(AutoVacPID, signal);
if (PgArchPID != 0)
@@ -5380,6 +5416,10 @@ StartChildProcess(AuxProcType type)
ereport(LOG,
(errmsg("could not fork WAL receiver process: %m")));
break;
+ case WalSummarizerProcess:
+ ereport(LOG,
+ (errmsg("could not fork WAL summarizer process: %m")));
+ break;
default:
ereport(LOG,
(errmsg("could not fork process: %m")));
@@ -5516,6 +5556,19 @@ MaybeStartWalReceiver(void)
}
}
+/*
+ * MaybeStartWalSummarizer
+ * Start the WAL summarizer process, if not running and our state allows.
+ */
+static void
+MaybeStartWalSummarizer(void)
+{
+ if (wal_summarize_mb != 0 && WalSummarizerPID == 0 &&
+ (pmState == PM_RUN || pmState == PM_HOT_STANDBY) &&
+ Shutdown <= SmartShutdown)
+ WalSummarizerPID = StartWalSummarizer();
+}
+
/*
* Create the opts file
diff --git a/src/backend/postmaster/walsummarizer.c b/src/backend/postmaster/walsummarizer.c
new file mode 100644
index 0000000000..34bd254183
--- /dev/null
+++ b/src/backend/postmaster/walsummarizer.c
@@ -0,0 +1,1414 @@
+/*-------------------------------------------------------------------------
+ *
+ * walsummarizer.c
+ *
+ * Background process to perform WAL summarization, if it is enabled.
+ * It continuously scans the write-ahead log and periodically emits a
+ * summary file which indicates which blocks in which relation forks
+ * were modified by WAL records in the LSN range covered by the summary
+ * file. See walsummary.c and blkreftable.c for more details on the
+ * naming and contents of WAL summary files.
+ *
+ * If configured to do, this background process will also remove WAL
+ * summary files when the file timestamp is older than a configurable
+ * threshold (but only if the WAL has been removed first).
+ *
+ * Portions Copyright (c) 1996-2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/backend/postmaster/walsummarizer.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/timeline.h"
+#include "access/xlog.h"
+#include "access/xlog_internal.h"
+#include "access/xlogrecovery.h"
+#include "access/xlogutils.h"
+#include "backup/walsummary.h"
+#include "catalog/storage_xlog.h"
+#include "common/blkreftable.h"
+#include "libpq/pqsignal.h"
+#include "miscadmin.h"
+#include "postmaster/bgwriter.h"
+#include "postmaster/interrupt.h"
+#include "postmaster/walsummarizer.h"
+#include "replication/walreceiver.h"
+#include "storage/fd.h"
+#include "storage/ipc.h"
+#include "storage/lwlock.h"
+#include "storage/latch.h"
+#include "storage/proc.h"
+#include "storage/procsignal.h"
+#include "storage/shmem.h"
+#include "storage/spin.h"
+#include "utils/guc.h"
+#include "utils/memutils.h"
+#include "utils/wait_event.h"
+
+/*
+ * Data in shared memory related to WAL summarization.
+ */
+typedef struct
+{
+ /*
+ * These fields are protected by WALSummarizerLock.
+ *
+ * Until we've discovered what summary files already exist on disk and
+ * stored that information in shared memory, initialized is false and the
+ * other fields here contain no meaningful information. After that has
+ * been done, initialized is true.
+ *
+ * summarized_tli and summarized_lsn indicate the last LSN and TLI at
+ * which the next summary file will start. Normally, these are the LSN and
+ * TLI at which the last file ended; in such case, lsn_is_exact is true.
+ * If, however, the LSN is just an approximation, then lsn_is_exact is
+ * false. This can happen if, for example, there are no existing WAL
+ * summary files at startup. In that case, we have to derive the position
+ * at which to start summarizing from the WAL files that exist on disk,
+ * and so the LSN might point to the start of the next file even though
+ * that might happen to be in the middle of a WAL record.
+ *
+ * summarizer_pgprocno is the pgprocno value for the summarizer process,
+ * if one is running, or else INVALID_PGPROCNO.
+ *
+ * pending_lsn is used by the summarizer to advertise the ending LSN of a
+ * record it has recently read. It shouldn't ever be less than
+ * summarized_lsn, but might be greater, because the summarizer buffers
+ * data for a range of LSNs in memory before writing out a new file.
+ *
+ * switch_requested can be set to true to notify the summarizer that a new
+ * WAL summary file should be written as soon as possible, without trying
+ * to read more WAL first.
+ */
+ bool initialized;
+ TimeLineID summarized_tli;
+ XLogRecPtr summarized_lsn;
+ bool lsn_is_exact;
+ int summarizer_pgprocno;
+ XLogRecPtr pending_lsn;
+ bool switch_requested;
+
+ /*
+ * This field handles its own synchronizaton.
+ */
+ ConditionVariable summary_file_cv;
+} WalSummarizerData;
+
+/*
+ * Private data for our xlogreader's page read callback.
+ */
+typedef struct
+{
+ TimeLineID tli;
+ bool historic;
+ XLogRecPtr read_upto;
+ bool end_of_wal;
+ bool waited;
+ XLogRecPtr redo_pointer;
+ bool redo_pointer_reached;
+ XLogRecPtr redo_pointer_refresh_lsn;
+} SummarizerReadLocalXLogPrivate;
+
+/* Pointer to shared memory state. */
+static WalSummarizerData *WalSummarizerCtl;
+
+/*
+ * When we reach end of WAL and need to read more, we sleep for a number of
+ * milliseconds that is a integer multiple of MS_PER_SLEEP_QUANTUM. This is
+ * the multiplier. It should vary between 1 and MAX_SLEEP_QUANTA, depending
+ * on system activity. See summarizer_wait_for_wal() for how we adjust this.
+ */
+static long sleep_quanta = 1;
+
+/*
+ * The sleep time will always be a multiple of 200ms and will not exceed
+ * one minute (300 * 200 = 60 * 1000).
+ */
+#define MAX_SLEEP_QUANTA 300
+#define MS_PER_SLEEP_QUANTUM 200
+
+/*
+ * This is a count of the number of pages of WAL that we've read since the
+ * last time we waited for more WAL to appear.
+ */
+static long pages_read_since_last_sleep = 0;
+
+/*
+ * Most recent RedoRecPtr value observed by MaybeRemoveOldWalSummaries.
+ */
+static XLogRecPtr redo_pointer_at_last_summary_removal = InvalidXLogRecPtr;
+
+/*
+ * GUC parameters
+ */
+int wal_summarize_mb = 256;
+int wal_summarize_keep_time = 7 * 24 * 60;
+
+static XLogRecPtr GetLatestLSN(TimeLineID *tli);
+static void HandleWalSummarizerInterrupts(void);
+static XLogRecPtr SummarizeWAL(TimeLineID tli, bool historic,
+ XLogRecPtr start_lsn, bool exact,
+ XLogRecPtr cutoff_lsn, XLogRecPtr maximum_lsn);
+static void SummarizeSmgrRecord(XLogReaderState *xlogreader,
+ BlockRefTable *brtab);
+static void SummarizeXactRecord(XLogReaderState *xlogreader,
+ BlockRefTable *brtab);
+static int summarizer_read_local_xlog_page(XLogReaderState *state,
+ XLogRecPtr targetPagePtr,
+ int reqLen,
+ XLogRecPtr targetRecPtr,
+ char *cur_page);
+static void summarizer_wait_for_wal(void);
+static void MaybeRemoveOldWalSummaries(void);
+
+/*
+ * Amount of shared memory required for this module.
+ */
+Size
+WalSummarizerShmemSize(void)
+{
+ return sizeof(WalSummarizerData);
+}
+
+/*
+ * Create or attach to shared memory segment for this module.
+ */
+void
+WalSummarizerShmemInit(void)
+{
+ bool found;
+
+ WalSummarizerCtl = (WalSummarizerData *)
+ ShmemInitStruct("Wal Summarizer Ctl", WalSummarizerShmemSize(),
+ &found);
+
+ if (!found)
+ {
+ /*
+ * First time through, so initialize.
+ *
+ * We're just filling in dummy values here -- the real initialization
+ * will happen when GetOldestUnsummarizedLSN() is called for the first
+ * time.
+ */
+ WalSummarizerCtl->initialized = false;
+ WalSummarizerCtl->summarized_tli = 0;
+ WalSummarizerCtl->summarized_lsn = InvalidXLogRecPtr;
+ WalSummarizerCtl->lsn_is_exact = false;
+ WalSummarizerCtl->summarizer_pgprocno = INVALID_PGPROCNO;
+ WalSummarizerCtl->pending_lsn = InvalidXLogRecPtr;
+ WalSummarizerCtl->switch_requested = false;
+ ConditionVariableInit(&WalSummarizerCtl->summary_file_cv);
+ }
+}
+
+/*
+ * Entry point for walsummarizer process.
+ */
+void
+WalSummarizerMain(void)
+{
+ sigjmp_buf local_sigjmp_buf;
+ MemoryContext context;
+
+ /*
+ * Within this function, 'current_lsn' and 'current_tli' refer to the
+ * point from which the next WAL summary file should start. 'exact' is
+ * true if 'current_lsn' is known to be the start of a WAL recod or WAL
+ * segment, and false if it might be in the middle of a record someplace.
+ *
+ * 'switch_lsn' and 'switch_tli', if set, are the LSN at which we need to
+ * switch to a new timeline and the timeline to which we need to switch.
+ * If not set, we either haven't figured out the answers yet or we're
+ * already on the latest timeline.
+ */
+ XLogRecPtr current_lsn;
+ TimeLineID current_tli;
+ bool exact;
+ XLogRecPtr switch_lsn = InvalidXLogRecPtr;
+ TimeLineID switch_tli = 0;
+
+ ereport(DEBUG1,
+ (errmsg_internal("WAL summarizer started")));
+
+ /*
+ * Properly accept or ignore signals the postmaster might send us
+ *
+ * We have no particular use for SIGINT at the moment, but seems
+ * reasonable to treat like SIGTERM.
+ */
+ pqsignal(SIGHUP, SignalHandlerForConfigReload);
+ pqsignal(SIGINT, SignalHandlerForShutdownRequest);
+ pqsignal(SIGTERM, SignalHandlerForShutdownRequest);
+ /* SIGQUIT handler was already set up by InitPostmasterChild */
+ pqsignal(SIGALRM, SIG_IGN);
+ pqsignal(SIGPIPE, SIG_IGN);
+ pqsignal(SIGUSR1, procsignal_sigusr1_handler);
+ pqsignal(SIGUSR2, SIG_IGN); /* not used */
+
+ /* Advertise ourselves. */
+ LWLockAcquire(WALSummarizerLock, LW_EXCLUSIVE);
+ WalSummarizerCtl->summarizer_pgprocno = MyProc->pgprocno;
+ LWLockRelease(WALSummarizerLock);
+
+ /* Create and switch to a memory context that we can reset on error. */
+ context = AllocSetContextCreate(TopMemoryContext,
+ "Wal Summarizer",
+ ALLOCSET_DEFAULT_SIZES);
+ MemoryContextSwitchTo(context);
+
+ /*
+ * Reset some signals that are accepted by postmaster but not here
+ */
+ pqsignal(SIGCHLD, SIG_DFL);
+
+ /*
+ * If an exception is encountered, processing resumes here.
+ */
+ if (sigsetjmp(local_sigjmp_buf, 1) != 0)
+ {
+ /* Since not using PG_TRY, must reset error stack by hand */
+ error_context_stack = NULL;
+
+ /* Prevent interrupts while cleaning up */
+ HOLD_INTERRUPTS();
+
+ /* Report the error to the server log */
+ EmitErrorReport();
+
+ /* Release resources we might have acquired. */
+ LWLockReleaseAll();
+ ConditionVariableCancelSleep();
+ pgstat_report_wait_end();
+ ReleaseAuxProcessResources(false);
+ AtEOXact_Files(false);
+ AtEOXact_HashTables(false);
+
+ /*
+ * Now return to normal top-level context and clear ErrorContext for
+ * next time.
+ */
+ MemoryContextSwitchTo(context);
+ FlushErrorState();
+
+ /* Flush any leaked data in the top-level context */
+ MemoryContextResetAndDeleteChildren(context);
+
+ /* Now we can allow interrupts again */
+ RESUME_INTERRUPTS();
+
+ /*
+ * Sleep for 10 seconds before attempting to resume operations in
+ * order to avoid excessing logging.
+ *
+ * Many of the likely error conditions are things that will repeat
+ * every time. For example, if the WAL can't be read or the summary
+ * can't be written, only administrator action will cure the problem.
+ * So a really fast retry time doesn't seem to be especially
+ * beneficial, and it will clutter the logs.
+ */
+ (void) WaitLatch(MyLatch,
+ WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
+ 10000,
+ WAIT_EVENT_WAL_SUMMARIZER_ERROR);
+ }
+
+ /* We can now handle ereport(ERROR) */
+ PG_exception_stack = &local_sigjmp_buf;
+
+ /*
+ * Unblock signals (they were blocked when the postmaster forked us)
+ */
+ sigprocmask(SIG_SETMASK, &UnBlockSig, NULL);
+
+ /*
+ * Fetch information about previous progress from shared memory.
+ *
+ * If we discover that WAL summarization is not enabled, just exit.
+ */
+ current_lsn = GetOldestUnsummarizedLSN(¤t_tli, &exact);
+ if (XLogRecPtrIsInvalid(current_lsn))
+ proc_exit(0);
+
+ /*
+ * Loop forever
+ */
+ for (;;)
+ {
+ XLogRecPtr latest_lsn;
+ TimeLineID latest_tli;
+ XLogRecPtr cutoff_lsn;
+ XLogRecPtr end_of_summary_lsn;
+
+ /* Flush any leaked data in the top-level context */
+ MemoryContextResetAndDeleteChildren(context);
+
+ /* Process any signals received recently. */
+ HandleWalSummarizerInterrupts();
+
+ /* If it's time to remove any old WAL summaries, do that now. */
+ MaybeRemoveOldWalSummaries();
+
+ /* Find the LSN and TLI up to which we can safely summarize. */
+ latest_lsn = GetLatestLSN(&latest_tli);
+
+ /*
+ * If we're summarizing a historic timeline and we haven't yet
+ * computed the point at which to switch to the next timeline, do that
+ * now.
+ *
+ * Note that if this is a standby, what was previously the current
+ * timeline could become historic at any time.
+ *
+ * We could try to make this more efficient by caching the results of
+ * readTimeLineHistory when latest_tli has not changed, but since we
+ * only have to do this once per timeline switch, we probably wouldn't
+ * save any significant amount of work in practice.
+ */
+ if (current_tli != latest_tli && XLogRecPtrIsInvalid(switch_lsn))
+ {
+ List *tles = readTimeLineHistory(latest_tli);
+
+ switch_lsn = tliSwitchPoint(current_tli, tles, &switch_tli);
+ elog(DEBUG2,
+ "switch point from TLI %u to TLI %u is at %X/%X",
+ current_tli, switch_tli, LSN_FORMAT_ARGS(switch_lsn));
+ }
+
+ /*
+ * wal_summarize_mb sets a soft limit on the amont of WAL covered by a
+ * single summary file. If we read a WAL record that ends after the
+ * cutoff LSN computed here, we'll stop the summary. In most cases, it
+ * will actually stop earlier than that, but this is here as a
+ * backstop.
+ */
+ cutoff_lsn = current_lsn + wal_summarize_mb * 1024 * 1024;
+ if (!XLogRecPtrIsInvalid(switch_lsn) && cutoff_lsn > switch_lsn)
+ cutoff_lsn = switch_lsn;
+ elog(DEBUG2,
+ "WAL summarization cutoff is TLI %d @ %X/%X, flush position is %X/%X",
+ current_tli, LSN_FORMAT_ARGS(cutoff_lsn), LSN_FORMAT_ARGS(latest_lsn));
+
+ /* Summarize WAL. */
+ end_of_summary_lsn = SummarizeWAL(current_tli,
+ current_tli != latest_tli,
+ current_lsn, exact,
+ cutoff_lsn, latest_lsn);
+ Assert(!XLogRecPtrIsInvalid(end_of_summary_lsn));
+ Assert(end_of_summary_lsn >= current_lsn);
+
+ /*
+ * Update state for next loop iteration.
+ *
+ * Next summary file should start from exactly where this one ended.
+ * Timeline remains unchanged unless a switch LSN was computed and we
+ * have reached it.
+ */
+ current_lsn = end_of_summary_lsn;
+ exact = true;
+ if (!XLogRecPtrIsInvalid(switch_lsn) && cutoff_lsn >= switch_lsn)
+ {
+ current_tli = switch_tli;
+ switch_lsn = InvalidXLogRecPtr;
+ switch_tli = 0;
+ }
+
+ /* Update state in shared memory. */
+ LWLockAcquire(WALSummarizerLock, LW_EXCLUSIVE);
+ Assert(WalSummarizerCtl->pending_lsn <= end_of_summary_lsn);
+ WalSummarizerCtl->summarized_lsn = end_of_summary_lsn;
+ WalSummarizerCtl->summarized_tli = current_tli;
+ WalSummarizerCtl->lsn_is_exact = true;
+ WalSummarizerCtl->pending_lsn = end_of_summary_lsn;
+ WalSummarizerCtl->switch_requested = false;
+ LWLockRelease(WALSummarizerLock);
+
+ /* Wake up anyone waiting for more summary files to be written. */
+ ConditionVariableBroadcast(&WalSummarizerCtl->summary_file_cv);
+ }
+}
+
+/*
+ * Get the oldest LSN in this server's timeline history that has not yet been
+ * summarized.
+ *
+ * If *tli != NULL, it will be set to the TLI for the LSN that is returned.
+ *
+ * If *lsn_is_exact != NULL, it will be set to true if the returned LSN is
+ * necessarily the start of a WAL record and false if it's just the beginning
+ * of a WAL segment.
+ */
+XLogRecPtr
+GetOldestUnsummarizedLSN(TimeLineID *tli, bool *lsn_is_exact)
+{
+ TimeLineID latest_tli;
+ LWLockMode mode = LW_SHARED;
+ int n;
+ List *tles;
+ XLogRecPtr unsummarized_lsn;
+ TimeLineID unsummarized_tli = 0;
+ bool should_make_exact = false;
+ List *existing_summaries;
+ ListCell *lc;
+
+ /* If not summarizing WAL, do nothing. */
+ if (wal_summarize_mb == 0)
+ return InvalidXLogRecPtr;
+
+ /*
+ * Initially, we acquire the lock in shared mode and try to fetch the
+ * required information. If the data structure hasn't been initialized, we
+ * reacquire the lock in shared mode so that we can initialize it.
+ * However, if someone else does that first before we get the lock, then
+ * we can just return the requested information after all.
+ */
+ while (true)
+ {
+ LWLockAcquire(WALSummarizerLock, mode);
+
+ if (WalSummarizerCtl->initialized)
+ {
+ unsummarized_lsn = WalSummarizerCtl->summarized_lsn;
+ if (tli != NULL)
+ *tli = WalSummarizerCtl->summarized_tli;
+ if (lsn_is_exact != NULL)
+ *lsn_is_exact = WalSummarizerCtl->lsn_is_exact;
+ LWLockRelease(WALSummarizerLock);
+ return unsummarized_lsn;
+ }
+
+ if (mode == LW_EXCLUSIVE)
+ break;
+
+ LWLockRelease(WALSummarizerLock);
+ mode = LW_EXCLUSIVE;
+ }
+
+ /*
+ * The data structure needs to be initialized, and we are the first to
+ * obtain the lock in exclusive mode, so it's our job to do that
+ * initialization.
+ *
+ * So, find the oldest timeline on which WAL still exists, and the
+ * earliest segment for which it exists.
+ */
+ (void) GetLatestLSN(&latest_tli);
+ tles = readTimeLineHistory(latest_tli);
+ for (n = list_length(tles) - 1; n >= 0; --n)
+ {
+ TimeLineHistoryEntry *tle = list_nth(tles, n);
+ XLogSegNo oldest_segno;
+
+ oldest_segno = XLogGetOldestSegno(tle->tli);
+ if (oldest_segno != 0)
+ {
+ /* Compute oldest LSN that still exists on disk. */
+ XLogSegNoOffsetToRecPtr(oldest_segno, 0, wal_segment_size,
+ unsummarized_lsn);
+
+ unsummarized_tli = tle->tli;
+ break;
+ }
+ }
+
+ /* It really should not be possible for us to find no WAL. */
+ if (unsummarized_tli == 0)
+ ereport(ERROR,
+ errcode(ERRCODE_INTERNAL_ERROR),
+ errmsg_internal("no WAL found on timeline %d", latest_tli));
+
+ /*
+ * Don't try to summarize anything older than the end LSN of the newest
+ * summary file that exists for this timeline.
+ */
+ existing_summaries =
+ GetWalSummaries(unsummarized_tli,
+ InvalidXLogRecPtr, InvalidXLogRecPtr);
+ foreach(lc, existing_summaries)
+ {
+ WalSummaryFile *ws = lfirst(lc);
+
+ if (ws->end_lsn > unsummarized_lsn)
+ {
+ unsummarized_lsn = ws->end_lsn;
+ should_make_exact = true;
+ }
+ }
+
+ /* Update shared memory with the discovered values. */
+ WalSummarizerCtl->initialized = true;
+ WalSummarizerCtl->summarized_lsn = unsummarized_lsn;
+ WalSummarizerCtl->summarized_tli = unsummarized_tli;
+ WalSummarizerCtl->lsn_is_exact = should_make_exact;
+ WalSummarizerCtl->pending_lsn = unsummarized_lsn;
+
+ /* Also return the to the caller as required. */
+ if (tli != NULL)
+ *tli = WalSummarizerCtl->summarized_tli;
+ if (lsn_is_exact != NULL)
+ *lsn_is_exact = WalSummarizerCtl->lsn_is_exact;
+ LWLockRelease(WALSummarizerLock);
+
+ return unsummarized_lsn;
+}
+
+/*
+ * Attempt to set the WAL summarizer's latch.
+ *
+ * This might not work, because there's no guarantee that the WAL summarizer
+ * process was successfully started, and it also might have started but
+ * subsequently terminated. So, under normal circumstances, this will get the
+ * latch set, but there's no guarantee.
+ */
+void
+SetWalSummarizerLatch(void)
+{
+ int pgprocno;
+
+ LWLockAcquire(WALSummarizerLock, LW_EXCLUSIVE);
+ pgprocno = WalSummarizerCtl->summarizer_pgprocno;
+ LWLockRelease(WALSummarizerLock);
+
+ if (pgprocno != INVALID_PGPROCNO)
+ SetLatch(&ProcGlobal->allProcs[pgprocno].procLatch);
+}
+
+/*
+ * Wait until WAL summarization reaches the given LSN, but not longer than
+ * the given timeout.
+ *
+ * The return value is the first still-unsummarized LSN. If it's greater than
+ * or equal to the passed LSN, then that LSN was reached. If not, we timed out.
+ */
+XLogRecPtr
+WaitForWalSummarization(XLogRecPtr lsn, long timeout)
+{
+ TimestampTz start_time = GetCurrentTimestamp();
+ TimestampTz deadline = TimestampTzPlusMilliseconds(start_time, timeout);
+ XLogRecPtr summarized_lsn;
+
+ Assert(!XLogRecPtrIsInvalid(lsn));
+ Assert(timeout > 0);
+
+ while (1)
+ {
+ TimestampTz now;
+ long remaining_timeout;
+
+ /*
+ * If the LSN summarized on disk has reached the target value, stop.
+ * If it hasn't, but the in-memory value has reached the target value,
+ * request that a file be written as soon as possible.
+ */
+ LWLockAcquire(WALSummarizerLock, LW_EXCLUSIVE);
+ summarized_lsn = WalSummarizerCtl->summarized_lsn;
+ if (summarized_lsn < lsn &&
+ WalSummarizerCtl->pending_lsn >= lsn)
+ WalSummarizerCtl->switch_requested = true;
+ LWLockRelease(WALSummarizerLock);
+ if (summarized_lsn >= lsn)
+ break;
+
+ /* Timeout reached? If yes, stop. */
+ now = GetCurrentTimestamp();
+ remaining_timeout = TimestampDifferenceMilliseconds(now, deadline);
+ if (remaining_timeout <= 0)
+ break;
+
+ /*
+ * Limit the sleep to 1 second, because we may need to request a
+ * switch.
+ */
+ if (remaining_timeout > 1000)
+ remaining_timeout = 1000;
+
+ /* Wait and see. */
+ ConditionVariableTimedSleep(&WalSummarizerCtl->summary_file_cv,
+ remaining_timeout,
+ WAIT_EVENT_WAL_SUMMARY_READY);
+ }
+
+ return summarized_lsn;
+}
+
+/*
+ * Get the latest LSN that is eligible to be summarized, and set *tli to the
+ * corresponding timeline.
+ */
+static XLogRecPtr
+GetLatestLSN(TimeLineID *tli)
+{
+ if (!RecoveryInProgress())
+ {
+ /* Don't summarize WAL before it's flushed. */
+ return GetFlushRecPtr(tli);
+ }
+ else
+ {
+ XLogRecPtr flush_lsn;
+ TimeLineID flush_tli;
+ XLogRecPtr replay_lsn;
+ TimeLineID replay_tli;
+
+ /*
+ * What we really want to know is how much WAL has been flushed to
+ * disk, but the only flush position available is the one provided by
+ * the walreceiver, which may not be running, because this could be
+ * crash recovery or recovery via restore_command. So use either the
+ * WAL receiver's flush position or the replay position, whichever is
+ * further ahead, on the theory that if the WAL has been replayed then
+ * it must also have been flushed to disk.
+ */
+ flush_lsn = GetWalRcvFlushRecPtr(NULL, &flush_tli);
+ replay_lsn = GetXLogReplayRecPtr(&replay_tli);
+ if (flush_lsn > replay_lsn)
+ {
+ *tli = flush_tli;
+ return flush_lsn;
+ }
+ else
+ {
+ *tli = replay_tli;
+ return replay_lsn;
+ }
+ }
+}
+
+/*
+ * Interrupt handler for main loop of WAL summarizer process.
+ */
+static void
+HandleWalSummarizerInterrupts(void)
+{
+ if (ProcSignalBarrierPending)
+ ProcessProcSignalBarrier();
+
+ if (ConfigReloadPending)
+ {
+ ConfigReloadPending = false;
+ ProcessConfigFile(PGC_SIGHUP);
+ }
+
+ if (ShutdownRequestPending || wal_summarize_mb == 0)
+ {
+ ereport(DEBUG1,
+ errmsg_internal("WAL summarizer shutting down"));
+ proc_exit(0);
+ }
+
+ /* Perform logging of memory contexts of this process */
+ if (LogMemoryContextPending)
+ ProcessLogMemoryContextInterrupt();
+}
+
+/*
+ * Summarize a range of WAL records on a single timeline.
+ *
+ * 'tli' is the timeline to be summarized. 'historic' should be false if the
+ * timeline in question is the latest one and true otherwise.
+ *
+ * 'start_lsn' is the point at which we should start summarizing. If this
+ * value comes from the end LSN of the previous record as returned by the
+ * xlograder machinery, 'exact' should be true; otherwise, 'exact' should
+ * be false, and this function will search forward for the start of a valid
+ * WAL record.
+ *
+ * 'cutoff_lsn' is the point at which we should stop summarizing. The first
+ * record that ends at or after cutoff_lsn will be the last one included
+ * in the summary.
+ *
+ * 'maximum_lsn' identifies the point beyond which we can't count on being
+ * able to read any more WAL. It should be the switch point when reading a
+ * historic timeline, or the most-recently-measured end of WAL when reading
+ * the current timeline.
+ *
+ * The return value is the LSN at which the WAL summary actually ends. Most
+ * often, a summary file ends because we notice that a checkpoint has
+ * occurred and reach the redo pointer of that checkpoint, but sometimes
+ * we stop for other reasons, such as a timeline switch, or reading a record
+ * that ends after the cutoff_lsn.
+ */
+static XLogRecPtr
+SummarizeWAL(TimeLineID tli, bool historic,
+ XLogRecPtr start_lsn, bool exact,
+ XLogRecPtr cutoff_lsn, XLogRecPtr maximum_lsn)
+{
+ SummarizerReadLocalXLogPrivate *private_data;
+ XLogReaderState *xlogreader;
+ XLogRecPtr summary_start_lsn;
+ XLogRecPtr summary_end_lsn = cutoff_lsn;
+ char temp_path[MAXPGPATH];
+ char final_path[MAXPGPATH];
+ WalSummaryIO io;
+ BlockRefTable *brtab = CreateEmptyBlockRefTable();
+
+ /* Initialize private data for xlogreader. */
+ private_data = (SummarizerReadLocalXLogPrivate *)
+ palloc0(sizeof(SummarizerReadLocalXLogPrivate));
+ private_data->tli = tli;
+ private_data->historic = historic;
+ private_data->read_upto = maximum_lsn;
+ private_data->redo_pointer = GetRedoRecPtr();
+ private_data->redo_pointer_refresh_lsn = start_lsn;
+ private_data->redo_pointer_reached =
+ (start_lsn >= private_data->redo_pointer);
+
+ /* Create xlogreader. */
+ xlogreader = XLogReaderAllocate(wal_segment_size, NULL,
+ XL_ROUTINE(.page_read = &summarizer_read_local_xlog_page,
+ .segment_open = &wal_segment_open,
+ .segment_close = &wal_segment_close),
+ private_data);
+ if (xlogreader == NULL)
+ ereport(ERROR,
+ (errcode(ERRCODE_OUT_OF_MEMORY),
+ errmsg("out of memory"),
+ errdetail("Failed while allocating a WAL reading processor.")));
+
+ /*
+ * When exact = false, we're starting from an arbitrary point in the WAL
+ * and must search forward for the start of the next record.
+ *
+ * When exact = true, start_lsn should be either the LSN where a record
+ * begins, or the LSN of a page where the page header is immediately
+ * followed by the start of a new record. XLogBeginRead should tolerate
+ * either case.
+ *
+ * We need to allow for both cases because the behavior of xlogreader
+ * varies. When a record spans two or more xlog pages, the ending LSN
+ * reported by xlogreader will be the starting LSN of the following
+ * record, but when an xlog page boundary falls between two records, the
+ * end LSN for the first will be reported as the first byte of the
+ * following page. We can't know until we read that page how large the
+ * header will be, but we'll have to skip over it to find the next record.
+ */
+ if (exact)
+ {
+ /*
+ * Even if start_lsn is the beginning of a page rather than the
+ * beginning of the first record on that page, we should still use it
+ * as the start LSN for the summary file. That's because we detect
+ * missing summary files by looking for cases where the end LSN of one
+ * file is less than the start LSN of the next file. When only a page
+ * header is skipped, nothing has been missed.
+ */
+ XLogBeginRead(xlogreader, start_lsn);
+ summary_start_lsn = start_lsn;
+ }
+ else
+ {
+ summary_start_lsn = XLogFindNextRecord(xlogreader, start_lsn);
+ if (XLogRecPtrIsInvalid(summary_start_lsn))
+ {
+ /*
+ * If we hit end-of-WAL while trying to find the next valid
+ * record, we must be on a historic timeline that has no valid
+ * records that begin after start_lsn and before end of WAL.
+ */
+ if (private_data->end_of_wal)
+ {
+ ereport(DEBUG1,
+ errmsg_internal("could not read WAL from timeline %d at %X/%X: end of WAL at %X/%X",
+ tli,
+ LSN_FORMAT_ARGS(start_lsn),
+ LSN_FORMAT_ARGS(private_data->read_upto)));
+
+ /*
+ * The timeline ends at or after start_lsn, without containing
+ * any records. Thus, we must make sure the main loop does not
+ * iterate. If start_lsn is the end of the timeline, then we
+ * won't actually emit an empty summary file, but otherwise,
+ * we must, to capture the fact that the LSN range in question
+ * contains no interesting WAL records.
+ */
+ summary_start_lsn = start_lsn;
+ summary_end_lsn = private_data->read_upto;
+ cutoff_lsn = xlogreader->EndRecPtr;
+ }
+ else
+ ereport(ERROR,
+ (errmsg("could not find a valid record after %X/%X",
+ LSN_FORMAT_ARGS(start_lsn))));
+ }
+
+ /* We shouldn't go backward. */
+ Assert(summary_start_lsn >= start_lsn);
+ }
+
+ /*
+ * Main loop: read xlog records one by one.
+ */
+ while (xlogreader->EndRecPtr < cutoff_lsn)
+ {
+ int block_id;
+ char *errormsg;
+ XLogRecord *record;
+ bool switch_requested;
+
+ /* We shouldn't go backward. */
+ Assert(summary_start_lsn <= xlogreader->EndRecPtr);
+
+ /*
+ * This flag tracks whether the read of a particular record had to
+ * wait for more WAL to arrive, so reset it before reading the next
+ * record.
+ */
+ private_data->waited = false;
+
+ /* Now read the next record. */
+ record = XLogReadRecord(xlogreader, &errormsg);
+ if (record == NULL)
+ {
+ SummarizerReadLocalXLogPrivate *private_data;
+
+ private_data = (SummarizerReadLocalXLogPrivate *)
+ xlogreader->private_data;
+ if (private_data->end_of_wal)
+ {
+ /*
+ * This timeline must be historic and must end before we were
+ * able to read a complete record.
+ */
+ ereport(DEBUG1,
+ errmsg_internal("could not read WAL from timeline %d at %X/%X: end of WAL at %X/%X",
+ tli,
+ LSN_FORMAT_ARGS(xlogreader->EndRecPtr),
+ LSN_FORMAT_ARGS(private_data->read_upto)));
+ /* Summary ends at end of WAL. */
+ summary_end_lsn = private_data->read_upto;
+ break;
+ }
+ if (errormsg)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not read WAL at %X/%X: %s",
+ LSN_FORMAT_ARGS(xlogreader->EndRecPtr), errormsg)));
+ else
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not read WAL at %X/%X",
+ LSN_FORMAT_ARGS(xlogreader->EndRecPtr))));
+ }
+
+ /* We shouldn't go backward. */
+ Assert(summary_start_lsn <= xlogreader->EndRecPtr);
+
+ if (xlogreader->ReadRecPtr >= cutoff_lsn)
+ {
+ /*
+ * Woops! We've read a record that *starts* after the cutoff LSN,
+ * contrary to our goal of reading only until we hit the first
+ * record that ends at or after the cutoff LSN. Pretend we didn't
+ * read it after all by bailing out of this loop right here,
+ * before we do anything with this record.
+ *
+ * This can happen because the last record before the cutoff LSN
+ * might be continued across multiple pages, and then we might
+ * come to a page with XLP_FIRST_IS_OVERWRITE_CONTRECORD set. In
+ * that case, the record that was continued across multiple pages
+ * is incomplete and will be disregarded, and the read will
+ * restart from the beginning of the page that is flagged
+ * XLP_FIRST_IS_OVERWRITE_CONTRECORD.
+ *
+ * If this case occurs, we can fairly say that the current summary
+ * file ends at the cutoff LSN exactly. The first record on the
+ * page marked XLP_FIRST_IS_OVERWRITE_CONTRECORD will be
+ * discovered when generating the next summary file.
+ */
+ summary_end_lsn = cutoff_lsn;
+ break;
+ }
+
+ /*
+ * We attempt, on a best effort basis only, to make WAL summary file
+ * boundaries line up with checkpoint cycles. So, if the last redo
+ * pointer we've seen was in the future, and this record starts at
+ * that redo pointer, stop before processing and let it be included in
+ * the next summary file.
+ *
+ * Note that in the case of a checkpoint triggered by a backup, the
+ * redo pointer is likely to be pointing to the first record on a
+ * page. Before reading the record, xlogreader->EndRecPtr will have
+ * pointed to the start of the page, which precedes the redo LSN. But
+ * after reading the next record, we'll advance over the page header
+ * and realize that the next record starts at the redo LSN exactly,
+ * making this the first point at which we can realize that it's time
+ * to stop.
+ */
+ if (!private_data->redo_pointer_reached &&
+ xlogreader->ReadRecPtr >= private_data->redo_pointer)
+ {
+ summary_end_lsn = xlogreader->ReadRecPtr;
+ break;
+ }
+
+ /* Special handling for particular types of WAL records. */
+ switch (XLogRecGetRmid(xlogreader))
+ {
+ case RM_SMGR_ID:
+ SummarizeSmgrRecord(xlogreader, brtab);
+ break;
+ case RM_XACT_ID:
+ SummarizeXactRecord(xlogreader, brtab);
+ break;
+ default:
+ break;
+ }
+
+ /* Feed block references from xlog record to block reference table. */
+ for (block_id = 0; block_id <= XLogRecMaxBlockId(xlogreader);
+ block_id++)
+ {
+ RelFileLocator rlocator;
+ ForkNumber forknum;
+ BlockNumber blocknum;
+
+ if (!XLogRecGetBlockTagExtended(xlogreader, block_id, &rlocator,
+ &forknum, &blocknum, NULL))
+ continue;
+
+ BlockRefTableMarkBlockModified(brtab, &rlocator, forknum,
+ blocknum);
+ }
+
+ /* Update our notion of where this summary file ends. */
+ summary_end_lsn = xlogreader->EndRecPtr;
+
+ /*
+ * Also update shared memory, and handle any request for a WAL summary
+ * file switch.
+ */
+ LWLockAcquire(WALSummarizerLock, LW_EXCLUSIVE);
+ Assert(summary_end_lsn >= WalSummarizerCtl->pending_lsn);
+ Assert(summary_end_lsn >= WalSummarizerCtl->summarized_lsn);
+ WalSummarizerCtl->pending_lsn = summary_end_lsn;
+ switch_requested = WalSummarizerCtl->switch_requested;
+ LWLockRelease(WALSummarizerLock);
+ if (switch_requested)
+ break;
+
+ /*
+ * Periodically update our notion of the redo pointer, because it
+ * might be changing concurrently. There's no interlocking here: we
+ * might race past the new redo pointer before we learn about it.
+ * That's OK; we only use the redo pointer as a heuristic for where to
+ * stop summarizing.
+ *
+ * It would be nice if we could just fetch the updated redo pointer on
+ * every pass through this loop, but that seems a bit too expensive:
+ * GetRedoRecPtr acquires a heavily-contended spinlock. So, instead,
+ * just fetch the updated value if we've just had to sleep, or if
+ * we've read more than a segment's worth of WAL without sleeping.
+ */
+ if (private_data->waited || xlogreader->EndRecPtr >
+ private_data->redo_pointer_refresh_lsn + wal_segment_size)
+ {
+ private_data->redo_pointer = GetRedoRecPtr();
+ private_data->redo_pointer_refresh_lsn = xlogreader->EndRecPtr;
+ private_data->redo_pointer_reached =
+ (xlogreader->EndRecPtr >= private_data->redo_pointer);
+ }
+
+ /*
+ * Recheck whether we've just caught up with the redo pointer, and if
+ * so, stop. This has the same purpose as the earlier check for the
+ * same condition above, but there we've just read a record and might
+ * decide against including it in the current summary file, whereas
+ * here we've already included it and might decide against reading the
+ * next one. Note that we may have just refreshed our notion of the
+ * redo pointer, so it's smart to check here before we do any more
+ * work.
+ */
+ if (!private_data->redo_pointer_reached &&
+ xlogreader->EndRecPtr >= private_data->redo_pointer)
+ break;
+ }
+
+ /* Destroy xlogreader. */
+ pfree(xlogreader->private_data);
+ XLogReaderFree(xlogreader);
+
+ /*
+ * If a timeline switch occurs, we may fail to make any progress at all
+ * before exiting the loop above. If that happens, we don't write a WAL
+ * summary file at all.
+ */
+ if (summary_end_lsn > summary_start_lsn)
+ {
+ /* Generate temporary and final path name. */
+ snprintf(temp_path, MAXPGPATH,
+ XLOGDIR "/summaries/temp.summary");
+ snprintf(final_path, MAXPGPATH,
+ XLOGDIR "/summaries/%08X%08X%08X%08X%08X.summary",
+ tli,
+ LSN_FORMAT_ARGS(summary_start_lsn),
+ LSN_FORMAT_ARGS(summary_end_lsn));
+
+ /* Open the temporary file for writing. */
+ io.filepos = 0;
+ io.file = PathNameOpenFile(temp_path, O_WRONLY | O_CREAT | O_TRUNC);
+ if (io.file < 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not create file \"%s\": %m", temp_path)));
+
+ /* Write the data. */
+ WriteBlockRefTable(brtab, WriteWalSummary, &io);
+
+ /* Close temporary file and shut down xlogreader. */
+ FileClose(io.file);
+
+ /* Tell the user what we did. */
+ ereport(LOG,
+ errmsg("summarized WAL on TLI %d from %X/%X to %X/%X",
+ tli,
+ LSN_FORMAT_ARGS(summary_start_lsn),
+ LSN_FORMAT_ARGS(summary_end_lsn)));
+
+ /* Durably rename the new summary into place. */
+ durable_rename(temp_path, final_path, ERROR);
+ }
+
+ return summary_end_lsn;
+}
+
+/*
+ * Special handling for WAL records with RM_SMGR_ID.
+ */
+static void
+SummarizeSmgrRecord(XLogReaderState *xlogreader, BlockRefTable *brtab)
+{
+ uint8 info = XLogRecGetInfo(xlogreader) & ~XLR_INFO_MASK;
+
+ if (info == XLOG_SMGR_CREATE)
+ {
+ xl_smgr_create *xlrec;
+
+ /*
+ * If a new relation fork is created on disk, there is no point
+ * tracking anything about which blocks have been modified, because
+ * the whole thing will be new. Hence, set the limit block for this
+ * fork to 0.
+ *
+ * Ignore the FSM fork, which is not fully WAL-logged.
+ */
+ xlrec = (xl_smgr_create *) XLogRecGetData(xlogreader);
+
+ if (xlrec->forkNum != FSM_FORKNUM)
+ BlockRefTableSetLimitBlock(brtab, &xlrec->rlocator,
+ xlrec->forkNum, 0);
+ }
+ else if (info == XLOG_SMGR_TRUNCATE)
+ {
+ xl_smgr_truncate *xlrec;
+
+ xlrec = (xl_smgr_truncate *) XLogRecGetData(xlogreader);
+
+ /*
+ * If a relation fork is truncated on disk, there is in point in
+ * tracking anything about block modifications beyond the truncation
+ * point.
+ *
+ * We ignore SMGR_TRUNCATE_FSM here because the FSM isn't fully
+ * WAL-logged and thus we can't track modified blocks for it anyway.
+ */
+ if ((xlrec->flags & SMGR_TRUNCATE_HEAP) != 0)
+ BlockRefTableSetLimitBlock(brtab, &xlrec->rlocator,
+ MAIN_FORKNUM, xlrec->blkno);
+ if ((xlrec->flags & SMGR_TRUNCATE_VM) != 0)
+ BlockRefTableSetLimitBlock(brtab, &xlrec->rlocator,
+ VISIBILITYMAP_FORKNUM, xlrec->blkno);
+ }
+}
+
+/*
+ * Special handling for WAL recods with RM_XACT_ID.
+ */
+static void
+SummarizeXactRecord(XLogReaderState *xlogreader, BlockRefTable *brtab)
+{
+ uint8 info = XLogRecGetInfo(xlogreader) & ~XLR_INFO_MASK;
+ uint8 xact_info = info & XLOG_XACT_OPMASK;
+
+ if (xact_info == XLOG_XACT_COMMIT ||
+ xact_info == XLOG_XACT_COMMIT_PREPARED)
+ {
+ xl_xact_commit *xlrec = (xl_xact_commit *) XLogRecGetData(xlogreader);
+ xl_xact_parsed_commit parsed;
+ int i;
+
+ ParseCommitRecord(XLogRecGetInfo(xlogreader), xlrec, &parsed);
+ for (i = 0; i < parsed.nrels; ++i)
+ {
+ ForkNumber forknum;
+
+ for (forknum = 0; forknum <= MAX_FORKNUM; ++forknum)
+ if (forknum != FSM_FORKNUM)
+ BlockRefTableSetLimitBlock(brtab, &parsed.xlocators[i],
+ forknum, 0);
+ }
+ }
+ else if (xact_info == XLOG_XACT_ABORT ||
+ xact_info == XLOG_XACT_ABORT_PREPARED)
+ {
+ xl_xact_abort *xlrec = (xl_xact_abort *) XLogRecGetData(xlogreader);
+ xl_xact_parsed_abort parsed;
+ int i;
+
+ ParseAbortRecord(XLogRecGetInfo(xlogreader), xlrec, &parsed);
+ for (i = 0; i < parsed.nrels; ++i)
+ {
+ ForkNumber forknum;
+
+ for (forknum = 0; forknum <= MAX_FORKNUM; ++forknum)
+ if (forknum != FSM_FORKNUM)
+ BlockRefTableSetLimitBlock(brtab, &parsed.xlocators[i],
+ forknum, 0);
+ }
+ }
+}
+
+/*
+ * Similar to read_local_xlog_page, but limited to read from one particular
+ * timeline. If the end of WAL is reached, it will wait for more if reading
+ * from the current timeline, or give up if reading from a historic timeline.
+ * In the latter case, it will also set private_data->end_of_wal = true.
+ *
+ * Caller must set private_data->tli to the TLI of interest,
+ * private_data->read_upto to the lowest LSN that is not known to be safe
+ * to read on that timeline, and private_data->historic to true if and only
+ * if the timeline is not the current timeline. This function will update
+ * private_data->read_upto and private_data->historic if more WAL appears
+ * on the current timeline or if the current timeline becomes historic.
+ */
+static int
+summarizer_read_local_xlog_page(XLogReaderState *state,
+ XLogRecPtr targetPagePtr, int reqLen,
+ XLogRecPtr targetRecPtr, char *cur_page)
+{
+ int count;
+ WALReadError errinfo;
+ SummarizerReadLocalXLogPrivate *private_data;
+
+ private_data = (SummarizerReadLocalXLogPrivate *)
+ state->private_data;
+
+ while (true)
+ {
+ if (targetPagePtr + XLOG_BLCKSZ <= private_data->read_upto)
+ {
+ /*
+ * more than one block available; read only that block, have
+ * caller come back if they need more.
+ */
+ count = XLOG_BLCKSZ;
+ break;
+ }
+ else if (targetPagePtr + reqLen > private_data->read_upto)
+ {
+ /* We don't seem to have enough data. */
+ if (private_data->historic)
+ {
+ /*
+ * This is a historic timeline, so there will never be any
+ * more data than we have currently.
+ */
+ private_data->end_of_wal = true;
+ return -1;
+ }
+ else
+ {
+ XLogRecPtr latest_lsn;
+ TimeLineID latest_tli;
+
+ /*
+ * This is - or at least was up until very recently - the
+ * current timeline, so more data might show up. Delay here
+ * so we don't tight-loop.
+ */
+ HandleWalSummarizerInterrupts();
+ summarizer_wait_for_wal();
+ private_data->waited = true;
+
+ /* Recheck end-of-WAL. */
+ latest_lsn = GetLatestLSN(&latest_tli);
+ if (private_data->tli == latest_tli)
+ {
+ /* Still the current timeline, update max LSN. */
+ Assert(latest_lsn >= private_data->read_upto);
+ private_data->read_upto = latest_lsn;
+ }
+ else
+ {
+ List *tles = readTimeLineHistory(latest_tli);
+ XLogRecPtr switchpoint;
+
+ /*
+ * The timeline we're scanning is no longer the latest
+ * one. Figure out when it ended and allow reads up to
+ * exactly that point.
+ */
+ private_data->historic = true;
+ switchpoint = tliSwitchPoint(private_data->tli, tles,
+ NULL);
+ Assert(switchpoint >= private_data->read_upto);
+ private_data->read_upto = switchpoint;
+ }
+
+ /* Go around and try again. */
+ }
+ }
+ else
+ {
+ /* enough bytes available to satisfy the request */
+ count = private_data->read_upto - targetPagePtr;
+ break;
+ }
+ }
+
+ /*
+ * Even though we just determined how much of the page can be validly read
+ * as 'count', read the whole page anyway. It's guaranteed to be
+ * zero-padded up to the page boundary if it's incomplete.
+ */
+ if (!WALRead(state, cur_page, targetPagePtr, XLOG_BLCKSZ,
+ private_data->tli, &errinfo))
+ WALReadRaiseError(&errinfo);
+
+ /* Track that we read a page, for sleep time calculation. */
+ ++pages_read_since_last_sleep;
+
+ /* number of valid bytes in the buffer */
+ return count;
+}
+
+/*
+ * Sleep for long enough that we believe it's likely that more WAL will
+ * be available afterwards.
+ */
+static void
+summarizer_wait_for_wal(void)
+{
+ if (pages_read_since_last_sleep == 0)
+ {
+ /*
+ * No pages were read since the last sleep, so double the sleep time,
+ * but not beyond the maximum allowable value.
+ */
+ sleep_quanta = Min(sleep_quanta * 2, MAX_SLEEP_QUANTA);
+ }
+ else if (pages_read_since_last_sleep > 1)
+ {
+ /*
+ * Multiple pages were read since the last sleep, so reduce the sleep
+ * time.
+ *
+ * A large burst of activity should be able to quickly reduce the
+ * sleep time to the minimum, but we don't want a handful of extra WAL
+ * records to provoke a strong reaction. We choose to reduce the sleep
+ * time by 1 quantum for each page read beyond the first, which is a
+ * fairly arbitrary way of trying to be reactive without
+ * overrreacting.
+ */
+ if (pages_read_since_last_sleep > sleep_quanta - 1)
+ sleep_quanta = 1;
+ else
+ sleep_quanta -= pages_read_since_last_sleep;
+ }
+
+ /* OK, now sleep. */
+ (void) WaitLatch(MyLatch,
+ WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
+ sleep_quanta * MS_PER_SLEEP_QUANTUM,
+ WAIT_EVENT_WAL_SUMMARIZER_WAL);
+ ResetLatch(MyLatch);
+
+ /* Reset count of pages read. */
+ pages_read_since_last_sleep = 0;
+}
+
+/*
+ * Most recent RedoRecPtr value observed by RemoveOldWalSummaries.
+ */
+static void
+MaybeRemoveOldWalSummaries(void)
+{
+ XLogRecPtr redo_pointer = GetRedoRecPtr();
+ List *wslist;
+ time_t cutoff_time;
+
+ /* If WAL summary removal is disabled, don't do anything. */
+ if (wal_summarize_keep_time == 0)
+ return;
+
+ /*
+ * If the redo pointer has not advanced, don't do anything.
+ *
+ * This has the effect that we only try to remove old WAL summary files
+ * once per checkpoint cycle.
+ */
+ if (redo_pointer == redo_pointer_at_last_summary_removal)
+ return;
+ redo_pointer_at_last_summary_removal = redo_pointer;
+
+ /*
+ * Files should only be removed if the last modification time precedes the
+ * cutoff time we compute here.
+ */
+ cutoff_time = time(NULL) - 60 * wal_summarize_keep_time;
+
+ /* Get all the summaries that currently exist. */
+ wslist = GetWalSummaries(0, InvalidXLogRecPtr, InvalidXLogRecPtr);
+
+ /* Loop until all summaries have been considered for removal. */
+ while (wslist != NIL)
+ {
+ ListCell *lc;
+ XLogSegNo oldest_segno;
+ XLogRecPtr oldest_lsn = InvalidXLogRecPtr;
+ TimeLineID selected_tli;
+
+ CHECK_FOR_INTERRUPTS();
+
+ /*
+ * Pick a timeline for which some summary files still exist on disk,
+ * and find the oldest LSN that still exists on disk for that
+ * timeline.
+ */
+ selected_tli = ((WalSummaryFile *) linitial(wslist))->tli;
+ oldest_segno = XLogGetOldestSegno(selected_tli);
+ if (oldest_segno != 0)
+ XLogSegNoOffsetToRecPtr(oldest_segno, 0, wal_segment_size,
+ oldest_lsn);
+
+
+ /* Consider each WAL file on the selected timeline in turn. */
+ foreach(lc, wslist)
+ {
+ WalSummaryFile *ws = lfirst(lc);
+
+ CHECK_FOR_INTERRUPTS();
+
+ /* If it's not on this timeline, it's not time to consider it. */
+ if (selected_tli != ws->tli)
+ continue;
+
+ /*
+ * If the WAL doesn't exist any more, we can remove it if the file
+ * modification time is old enough.
+ */
+ if (XLogRecPtrIsInvalid(oldest_lsn) || ws->end_lsn <= oldest_lsn)
+ RemoveWalSummaryIfOlderThan(ws, cutoff_time);
+
+ /*
+ * Whether we we removed the file or not, we need not consider it
+ * again.
+ */
+ wslist = foreach_delete_current(wslist, lc);
+ pfree(ws);
+ }
+ }
+}
diff --git a/src/backend/replication/repl_gram.y b/src/backend/replication/repl_gram.y
index 0c874e33cf..a5d118ed68 100644
--- a/src/backend/replication/repl_gram.y
+++ b/src/backend/replication/repl_gram.y
@@ -76,11 +76,12 @@ Node *replication_parse_result;
%token K_EXPORT_SNAPSHOT
%token K_NOEXPORT_SNAPSHOT
%token K_USE_SNAPSHOT
+%token K_UPLOAD_MANIFEST
%type <node> command
%type <node> base_backup start_replication start_logical_replication
create_replication_slot drop_replication_slot identify_system
- read_replication_slot timeline_history show
+ read_replication_slot timeline_history show upload_manifest
%type <list> generic_option_list
%type <defelt> generic_option
%type <uintval> opt_timeline
@@ -114,6 +115,7 @@ command:
| read_replication_slot
| timeline_history
| show
+ | upload_manifest
;
/*
@@ -307,6 +309,15 @@ timeline_history:
}
;
+/* UPLOAD_MANIFEST doesn't currently accept any arguments */
+upload_manifest:
+ K_UPLOAD_MANIFEST
+ {
+ UploadManifestCmd *cmd = makeNode(UploadManifestCmd);
+
+ $$ = (Node *) cmd;
+ }
+
opt_physical:
K_PHYSICAL
| /* EMPTY */
@@ -411,6 +422,7 @@ ident_or_keyword:
| K_EXPORT_SNAPSHOT { $$ = "export_snapshot"; }
| K_NOEXPORT_SNAPSHOT { $$ = "noexport_snapshot"; }
| K_USE_SNAPSHOT { $$ = "use_snapshot"; }
+ | K_UPLOAD_MANIFEST { $$ = "upload_manifest"; }
;
%%
diff --git a/src/backend/replication/repl_scanner.l b/src/backend/replication/repl_scanner.l
index 1cc7fb858c..4805da08ee 100644
--- a/src/backend/replication/repl_scanner.l
+++ b/src/backend/replication/repl_scanner.l
@@ -136,6 +136,7 @@ EXPORT_SNAPSHOT { return K_EXPORT_SNAPSHOT; }
NOEXPORT_SNAPSHOT { return K_NOEXPORT_SNAPSHOT; }
USE_SNAPSHOT { return K_USE_SNAPSHOT; }
WAIT { return K_WAIT; }
+UPLOAD_MANIFEST { return K_UPLOAD_MANIFEST; }
{space}+ { /* do nothing */ }
@@ -303,6 +304,7 @@ replication_scanner_is_replication_command(void)
case K_DROP_REPLICATION_SLOT:
case K_READ_REPLICATION_SLOT:
case K_TIMELINE_HISTORY:
+ case K_UPLOAD_MANIFEST:
case K_SHOW:
/* Yes; push back the first token so we can parse later. */
repl_pushed_back_token = first_token;
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index e250b0567e..b33b86671b 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -58,6 +58,7 @@
#include "access/xlogrecovery.h"
#include "access/xlogutils.h"
#include "backup/basebackup.h"
+#include "backup/basebackup_incremental.h"
#include "catalog/pg_authid.h"
#include "catalog/pg_type.h"
#include "commands/dbcommands.h"
@@ -137,6 +138,17 @@ bool wake_wal_senders = false;
*/
static XLogReaderState *xlogreader = NULL;
+/*
+ * If the UPLOAD_MANIFEST command is used to provide a backup manifest in
+ * preparation for an incremental backup, uploaded_manifest will be point
+ * to an object containing information about its contexts, and
+ * uploaded_manifest_mcxt will point to the memory context that contains
+ * that object and all of its subordinate data. Otherwise, both values will
+ * be NULL.
+ */
+static IncrementalBackupInfo *uploaded_manifest = NULL;
+static MemoryContext uploaded_manifest_mcxt = NULL;
+
/*
* These variables keep track of the state of the timeline we're currently
* sending. sendTimeLine identifies the timeline. If sendTimeLineIsHistoric,
@@ -233,6 +245,9 @@ static void XLogSendLogical(void);
static void WalSndDone(WalSndSendDataCallback send_data);
static XLogRecPtr GetStandbyFlushRecPtr(TimeLineID *tli);
static void IdentifySystem(void);
+static void UploadManifest(void);
+static bool HandleUploadManifestPacket(StringInfo buf, off_t *offset,
+ IncrementalBackupInfo *ib);
static void ReadReplicationSlot(ReadReplicationSlotCmd *cmd);
static void CreateReplicationSlot(CreateReplicationSlotCmd *cmd);
static void DropReplicationSlot(DropReplicationSlotCmd *cmd);
@@ -660,6 +675,143 @@ SendTimeLineHistory(TimeLineHistoryCmd *cmd)
pq_endmessage(&buf);
}
+/*
+ * Handle UPLOAD_MANIFEST command.
+ */
+static void
+UploadManifest(void)
+{
+ MemoryContext mcxt;
+ IncrementalBackupInfo *ib;
+ off_t offset = 0;
+ StringInfoData buf;
+
+ /*
+ * parsing the manifest will use the cryptohash stuff, which requires a
+ * resource owner
+ */
+ Assert(CurrentResourceOwner == NULL);
+ CurrentResourceOwner = ResourceOwnerCreate(NULL, "base backup");
+
+ /* Prepare to read manifest data into a temporary context. */
+ mcxt = AllocSetContextCreate(CurrentMemoryContext,
+ "incremental backup information",
+ ALLOCSET_DEFAULT_SIZES);
+ ib = CreateIncrementalBackupInfo(mcxt);
+
+ /* Send a CopyInResponse message */
+ pq_beginmessage(&buf, 'G');
+ pq_sendbyte(&buf, 0);
+ pq_sendint16(&buf, 0);
+ pq_endmessage_reuse(&buf);
+ pq_flush();
+
+ /* Recieve packets from client until done. */
+ while (HandleUploadManifestPacket(&buf, &offset, ib))
+ ;
+
+ /* Finish up manifest processing. */
+ FinalizeIncrementalManifest(ib);
+
+ /*
+ * Discard any old manifest information and arrange to preserve the new
+ * information we just got.
+ *
+ * We assume that MemoryContextDelete and MemoryContextSetParent won't
+ * fail, and thus we shouldn't end up bailing out of here in such a way as
+ * to leave dangling pointrs.
+ */
+ if (uploaded_manifest_mcxt != NULL)
+ MemoryContextDelete(uploaded_manifest_mcxt);
+ MemoryContextSetParent(mcxt, CacheMemoryContext);
+ uploaded_manifest = ib;
+ uploaded_manifest_mcxt = mcxt;
+
+ /* clean up the resource owner we created */
+ WalSndResourceCleanup(true);
+}
+
+/*
+ * Process one packet received during the handling of an UPLOAD_MANIFEST
+ * operation.
+ *
+ * 'buf' is scratch space. This function expects it to be initialized, doesn't
+ * care what the current contents are, and may override them with completely
+ * new contents.
+ *
+ * The return value is true if the caller should continue processing
+ * additional packets and false if the UPLOAD_MANIFEST operation is complete.
+ */
+static bool
+HandleUploadManifestPacket(StringInfo buf, off_t *offset,
+ IncrementalBackupInfo *ib)
+{
+ int mtype;
+ int maxmsglen;
+
+ HOLD_CANCEL_INTERRUPTS();
+
+ pq_startmsgread();
+ mtype = pq_getbyte();
+ if (mtype == EOF)
+ ereport(ERROR,
+ (errcode(ERRCODE_CONNECTION_FAILURE),
+ errmsg("unexpected EOF on client connection with an open transaction")));
+
+ switch (mtype)
+ {
+ case 'd': /* CopyData */
+ maxmsglen = PQ_LARGE_MESSAGE_LIMIT;
+ break;
+ case 'c': /* CopyDone */
+ case 'f': /* CopyFail */
+ case 'H': /* Flush */
+ case 'S': /* Sync */
+ maxmsglen = PQ_SMALL_MESSAGE_LIMIT;
+ break;
+ default:
+ ereport(ERROR,
+ (errcode(ERRCODE_PROTOCOL_VIOLATION),
+ errmsg("unexpected message type 0x%02X during COPY from stdin",
+ mtype)));
+ maxmsglen = 0; /* keep compiler quiet */
+ break;
+ }
+
+ /* Now collect the message body */
+ if (pq_getmessage(buf, maxmsglen))
+ ereport(ERROR,
+ (errcode(ERRCODE_CONNECTION_FAILURE),
+ errmsg("unexpected EOF on client connection with an open transaction")));
+ RESUME_CANCEL_INTERRUPTS();
+
+ /* Process the message */
+ switch (mtype)
+ {
+ case 'd': /* CopyData */
+ AppendIncrementalManifestData(ib, buf->data, buf->len);
+ return true;
+
+ case 'c': /* CopyDone */
+ return false;
+
+ case 'H': /* Sync */
+ case 'S': /* Flush */
+ /* Ignore these while in CopyOut mode as we do elsewhere. */
+ return true;
+
+ case 'f':
+ ereport(ERROR,
+ (errcode(ERRCODE_QUERY_CANCELED),
+ errmsg("COPY from stdin failed: %s",
+ pq_getmsgstring(buf))));
+ }
+
+ /* Not reached. */
+ Assert(false);
+ return false;
+}
+
/*
* Handle START_REPLICATION command.
*
@@ -1801,7 +1953,7 @@ exec_replication_command(const char *cmd_string)
cmdtag = "BASE_BACKUP";
set_ps_display(cmdtag);
PreventInTransactionBlock(true, cmdtag);
- SendBaseBackup((BaseBackupCmd *) cmd_node);
+ SendBaseBackup((BaseBackupCmd *) cmd_node, uploaded_manifest);
EndReplicationCommand(cmdtag);
break;
@@ -1863,6 +2015,14 @@ exec_replication_command(const char *cmd_string)
}
break;
+ case T_UploadManifestCmd:
+ cmdtag = "UPLOAD_MANIFEST";
+ set_ps_display(cmdtag);
+ PreventInTransactionBlock(true, cmdtag);
+ UploadManifest();
+ EndReplicationCommand(cmdtag);
+ break;
+
default:
elog(ERROR, "unrecognized replication command node tag: %u",
cmd_node->type);
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index a3d8eacb8d..3a6729003a 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -31,6 +31,7 @@
#include "postmaster/bgworker_internals.h"
#include "postmaster/bgwriter.h"
#include "postmaster/postmaster.h"
+#include "postmaster/walsummarizer.h"
#include "replication/logicallauncher.h"
#include "replication/origin.h"
#include "replication/slot.h"
@@ -136,6 +137,7 @@ CalculateShmemSize(int *num_semaphores)
size = add_size(size, ReplicationOriginShmemSize());
size = add_size(size, WalSndShmemSize());
size = add_size(size, WalRcvShmemSize());
+ size = add_size(size, WalSummarizerShmemSize());
size = add_size(size, PgArchShmemSize());
size = add_size(size, ApplyLauncherShmemSize());
size = add_size(size, BTreeShmemSize());
@@ -291,6 +293,7 @@ CreateSharedMemoryAndSemaphores(void)
ReplicationOriginShmemInit();
WalSndShmemInit();
WalRcvShmemInit();
+ WalSummarizerShmemInit();
PgArchShmemInit();
ApplyLauncherShmemInit();
diff --git a/src/backend/storage/lmgr/lwlocknames.txt b/src/backend/storage/lmgr/lwlocknames.txt
index f72f2906ce..d621f5507f 100644
--- a/src/backend/storage/lmgr/lwlocknames.txt
+++ b/src/backend/storage/lmgr/lwlocknames.txt
@@ -54,3 +54,4 @@ XactTruncationLock 44
WrapLimitsVacuumLock 46
NotifyQueueTailLock 47
WaitEventExtensionLock 48
+WALSummarizerLock 49
diff --git a/src/backend/utils/activity/pgstat_io.c b/src/backend/utils/activity/pgstat_io.c
index eb7d35d422..bd0a921a3e 100644
--- a/src/backend/utils/activity/pgstat_io.c
+++ b/src/backend/utils/activity/pgstat_io.c
@@ -292,7 +292,8 @@ pgstat_io_snapshot_cb(void)
* - Syslogger because it is not connected to shared memory
* - Archiver because most relevant archiving IO is delegated to a
* specialized command or module
-* - WAL Receiver and WAL Writer IO is not tracked in pg_stat_io for now
+* - WAL Receiver, WAL Writer, and WAL Summarizer IO are not tracked in
+* pg_stat_io for now
*
* Function returns true if BackendType participates in the cumulative stats
* subsystem for IO and false if it does not.
@@ -314,6 +315,7 @@ pgstat_tracks_io_bktype(BackendType bktype)
case B_LOGGER:
case B_WAL_RECEIVER:
case B_WAL_WRITER:
+ case B_WAL_SUMMARIZER:
return false;
case B_AUTOVAC_LAUNCHER:
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 9c5fdeb3ca..17ad986c98 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -56,6 +56,7 @@ RECOVERY_WAL_STREAM "Waiting in main loop of startup process for WAL to arrive,
SYSLOGGER_MAIN "Waiting in main loop of syslogger process."
WAL_RECEIVER_MAIN "Waiting in main loop of WAL receiver process."
WAL_SENDER_MAIN "Waiting in main loop of WAL sender process."
+WAL_SUMMARIZER_WAL "Waiting in WAL summarizer for more WAL to be generated."
WAL_WRITER_MAIN "Waiting in main loop of WAL writer process."
@@ -140,6 +141,7 @@ SAFE_SNAPSHOT "Waiting to obtain a valid snapshot for a <literal>READ ONLY DEFER
SYNC_REP "Waiting for confirmation from a remote server during synchronous replication."
WAL_RECEIVER_EXIT "Waiting for the WAL receiver to exit."
WAL_RECEIVER_WAIT_START "Waiting for startup process to send initial data for streaming replication."
+WAL_SUMMARY_READY "Waiting for a new WAL summary to be generated."
XACT_GROUP_UPDATE "Waiting for the group leader to update transaction status at end of a parallel operation."
@@ -160,6 +162,7 @@ REGISTER_SYNC_REQUEST "Waiting while sending synchronization requests to the che
SPIN_DELAY "Waiting while acquiring a contended spinlock."
VACUUM_DELAY "Waiting in a cost-based vacuum delay point."
VACUUM_TRUNCATE "Waiting to acquire an exclusive lock to truncate off any empty pages at the end of a table vacuumed."
+WAL_SUMMARIZER_ERROR "Waiting after a WAL summarizer error."
#
@@ -241,6 +244,8 @@ WAL_COPY_WRITE "Waiting for a write when creating a new WAL segment by copying a
WAL_INIT_SYNC "Waiting for a newly initialized WAL file to reach durable storage."
WAL_INIT_WRITE "Waiting for a write while initializing a new WAL file."
WAL_READ "Waiting for a read from a WAL file."
+WAL_SUMMARY_READ "Waiting for a read from a WAL summary file."
+WAL_SUMMARY_WRITE "Waiting for a write to a WAL summary file."
WAL_SYNC "Waiting for a WAL file to reach durable storage."
WAL_SYNC_METHOD_ASSIGN "Waiting for data to reach durable storage while assigning a new WAL sync method."
WAL_WRITE "Waiting for a write to a WAL file."
diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index 1e671c560c..037111b89f 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -306,6 +306,9 @@ GetBackendTypeDesc(BackendType backendType)
case B_WAL_SENDER:
backendDesc = "walsender";
break;
+ case B_WAL_SUMMARIZER:
+ backendDesc = "walsummarizer";
+ break;
case B_WAL_WRITER:
backendDesc = "walwriter";
break;
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 16ec6c5ef0..a532f57af1 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -63,6 +63,7 @@
#include "postmaster/postmaster.h"
#include "postmaster/startup.h"
#include "postmaster/syslogger.h"
+#include "postmaster/walsummarizer.h"
#include "postmaster/walwriter.h"
#include "replication/logicallauncher.h"
#include "replication/slot.h"
@@ -704,6 +705,8 @@ const char *const config_group_names[] =
gettext_noop("Write-Ahead Log / Archive Recovery"),
/* WAL_RECOVERY_TARGET */
gettext_noop("Write-Ahead Log / Recovery Target"),
+ /* WAL_SUMMARIZATION */
+ gettext_noop("Write-Ahead Log / Summarization"),
/* REPLICATION_SENDING */
gettext_noop("Replication / Sending Servers"),
/* REPLICATION_PRIMARY */
@@ -3181,6 +3184,32 @@ struct config_int ConfigureNamesInt[] =
check_wal_segment_size, NULL, NULL
},
+ {
+ {"wal_summarize_mb", PGC_SIGHUP, WAL_SUMMARIZATION,
+ gettext_noop("Number of bytes of WAL per summary file."),
+ gettext_noop("Smaller values minimize extra work performed by incremental backup, but increase the number of files on disk."),
+ GUC_UNIT_MB,
+ },
+ &wal_summarize_mb,
+ 256,
+ 0,
+ INT_MAX,
+ NULL, NULL, NULL
+ },
+
+ {
+ {"wal_summarize_keep_time", PGC_SIGHUP, WAL_SUMMARIZATION,
+ gettext_noop("Time for which WAL summary files should be kept."),
+ NULL,
+ GUC_UNIT_MIN,
+ },
+ &wal_summarize_keep_time,
+ 7 * 24 * 60, /* 1 week */
+ 0,
+ INT_MAX,
+ NULL, NULL, NULL
+ },
+
{
{"autovacuum_naptime", PGC_SIGHUP, AUTOVACUUM,
gettext_noop("Time to sleep between autovacuum runs."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index d08d55c3fe..4736606ac1 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -299,6 +299,11 @@
#recovery_target_action = 'pause' # 'pause', 'promote', 'shutdown'
# (change requires restart)
+# - WAL Summarization -
+
+#wal_summarize_mb = 256 # MB of WAL per summary file, 0 disables
+#wal_summarize_keep_time = '7d' # when to remove old summary files, 0 = never
+
#------------------------------------------------------------------------------
# REPLICATION
diff --git a/src/bin/Makefile b/src/bin/Makefile
index 373077bf52..aa2210925e 100644
--- a/src/bin/Makefile
+++ b/src/bin/Makefile
@@ -19,6 +19,7 @@ SUBDIRS = \
pg_archivecleanup \
pg_basebackup \
pg_checksums \
+ pg_combinebackup \
pg_config \
pg_controldata \
pg_ctl \
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 0c6f5ceb0a..e68b40d2b5 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -227,6 +227,7 @@ static char *extra_options = "";
static const char *const subdirs[] = {
"global",
"pg_wal/archive_status",
+ "pg_wal/summaries",
"pg_commit_ts",
"pg_dynshmem",
"pg_notify",
diff --git a/src/bin/meson.build b/src/bin/meson.build
index 67cb50630c..4cb6fd59bb 100644
--- a/src/bin/meson.build
+++ b/src/bin/meson.build
@@ -5,6 +5,7 @@ subdir('pg_amcheck')
subdir('pg_archivecleanup')
subdir('pg_basebackup')
subdir('pg_checksums')
+subdir('pg_combinebackup')
subdir('pg_config')
subdir('pg_controldata')
subdir('pg_ctl')
diff --git a/src/bin/pg_basebackup/bbstreamer_file.c b/src/bin/pg_basebackup/bbstreamer_file.c
index 45f32974ff..6b78ee283d 100644
--- a/src/bin/pg_basebackup/bbstreamer_file.c
+++ b/src/bin/pg_basebackup/bbstreamer_file.c
@@ -296,6 +296,7 @@ should_allow_existing_directory(const char *pathname)
if (strcmp(filename, "pg_wal") == 0 ||
strcmp(filename, "pg_xlog") == 0 ||
strcmp(filename, "archive_status") == 0 ||
+ strcmp(filename, "summaries") == 0 ||
strcmp(filename, "pg_tblspc") == 0)
return true;
diff --git a/src/bin/pg_basebackup/pg_basebackup.c b/src/bin/pg_basebackup/pg_basebackup.c
index 1a8cef345d..33416b11cf 100644
--- a/src/bin/pg_basebackup/pg_basebackup.c
+++ b/src/bin/pg_basebackup/pg_basebackup.c
@@ -101,6 +101,11 @@ typedef void (*WriteDataCallback) (size_t nbytes, char *buf,
*/
#define MINIMUM_VERSION_FOR_TERMINATED_TARFILE 150000
+/*
+ * pg_wal/summaries exists beginning with version 17.
+ */
+#define MINIMUM_VERSION_FOR_WAL_SUMMARIES 170000
+
/*
* Different ways to include WAL
*/
@@ -217,7 +222,8 @@ static void ReceiveBackupManifestInMemoryChunk(size_t r, char *copybuf,
void *callback_data);
static void BaseBackup(char *compression_algorithm, char *compression_detail,
CompressionLocation compressloc,
- pg_compress_specification *client_compress);
+ pg_compress_specification *client_compress,
+ char *incremental_manifest);
static bool reached_end_position(XLogRecPtr segendpos, uint32 timeline,
bool segment_finished);
@@ -390,6 +396,8 @@ usage(void)
printf(_("\nOptions controlling the output:\n"));
printf(_(" -D, --pgdata=DIRECTORY receive base backup into directory\n"));
printf(_(" -F, --format=p|t output format (plain (default), tar)\n"));
+ printf(_(" -i, --incremental=OLDMANIFEST\n"));
+ printf(_(" take incremental or differential backup\n"));
printf(_(" -r, --max-rate=RATE maximum transfer rate to transfer data directory\n"
" (in kB/s, or use suffix \"k\" or \"M\")\n"));
printf(_(" -R, --write-recovery-conf\n"
@@ -688,6 +696,23 @@ StartLogStreamer(char *startpos, uint32 timeline, char *sysidentifier,
if (pg_mkdir_p(statusdir, pg_dir_create_mode) != 0 && errno != EEXIST)
pg_fatal("could not create directory \"%s\": %m", statusdir);
+
+ /*
+ * For newer server versions, likewise create pg_wal/summaries
+ */
+ if (PQserverVersion(conn) < MINIMUM_VERSION_FOR_WAL_SUMMARIES)
+ {
+ char summarydir[MAXPGPATH];
+
+ snprintf(summarydir, sizeof(summarydir), "%s/%s/summaries",
+ basedir,
+ PQserverVersion(conn) < MINIMUM_VERSION_FOR_PG_WAL ?
+ "pg_xlog" : "pg_wal");
+
+ if (pg_mkdir_p(statusdir, pg_dir_create_mode) != 0 &&
+ errno != EEXIST)
+ pg_fatal("could not create directory \"%s\": %m", summarydir);
+ }
}
/*
@@ -1728,7 +1753,9 @@ ReceiveBackupManifestInMemoryChunk(size_t r, char *copybuf,
static void
BaseBackup(char *compression_algorithm, char *compression_detail,
- CompressionLocation compressloc, pg_compress_specification *client_compress)
+ CompressionLocation compressloc,
+ pg_compress_specification *client_compress,
+ char *incremental_manifest)
{
PGresult *res;
char *sysidentifier;
@@ -1794,7 +1821,74 @@ BaseBackup(char *compression_algorithm, char *compression_detail,
exit(1);
/*
- * Start the actual backup
+ * If the user wants an incremental backup, we must upload the manifest
+ * for the previous backup upon which it is to be based.
+ */
+ if (incremental_manifest != NULL)
+ {
+ int fd;
+ char mbuf[65536];
+ int nbytes;
+
+ /* XXX add a server version check here */
+
+ /* Open the file. */
+ fd = open(incremental_manifest, O_RDONLY | PG_BINARY, 0);
+ if (fd < 0)
+ pg_fatal("could not open file \"%s\": %m", incremental_manifest);
+
+ /* Tell the server what we want to do. */
+ if (PQsendQuery(conn, "UPLOAD_MANIFEST") == 0)
+ pg_fatal("could not send replication command \"%s\": %s",
+ "UPLOAD_MANIFEST", PQerrorMessage(conn));
+ res = PQgetResult(conn);
+ if (PQresultStatus(res) != PGRES_COPY_IN)
+ {
+ if (PQresultStatus(res) == PGRES_FATAL_ERROR)
+ pg_fatal("could not upload manifest: %s",
+ PQerrorMessage(conn));
+ else
+ pg_fatal("could not upload manifest: unexpected status %s",
+ PQresStatus(PQresultStatus(res)));
+ }
+
+ /* Loop, reading from the file and sending the data to the server. */
+ while ((nbytes = read(fd, mbuf, sizeof mbuf)) > 0)
+ {
+ if (PQputCopyData(conn, mbuf, nbytes) < 0)
+ pg_fatal("could not send COPY data: %s",
+ PQerrorMessage(conn));
+ }
+
+ /* Bail out if we exited the loop due to an error. */
+ if (nbytes < 0)
+ pg_fatal("could not read file \"%s\": %m", incremental_manifest);
+
+ /* End the COPY operation. */
+ if (PQputCopyEnd(conn, NULL) < 0)
+ pg_fatal("could not send end-of-COPY: %s",
+ PQerrorMessage(conn));
+
+ /* See whether the server is happy with what we sent. */
+ res = PQgetResult(conn);
+ if (PQresultStatus(res) == PGRES_FATAL_ERROR)
+ pg_fatal("could not upload manifest: %s",
+ PQerrorMessage(conn));
+ else if (PQresultStatus(res) != PGRES_COMMAND_OK)
+ pg_fatal("could not upload manifest: unexpected status %s",
+ PQresStatus(PQresultStatus(res)));
+
+ /* Consume ReadyForQuery message from server. */
+ res = PQgetResult(conn);
+ if (res != NULL)
+ pg_fatal("unexpected extra result while sending manifest");
+
+ /* Add INCREMENTAL option to BASE_BACKUP command. */
+ AppendPlainCommandOption(&buf, use_new_option_syntax, "INCREMENTAL");
+ }
+
+ /*
+ * Continue building up the options list for the BASE_BACKUP command.
*/
AppendStringCommandOption(&buf, use_new_option_syntax, "LABEL", label);
if (estimatesize)
@@ -1901,6 +1995,7 @@ BaseBackup(char *compression_algorithm, char *compression_detail,
else
basebkp = psprintf("BASE_BACKUP %s", buf.data);
+ /* OK, try to start the backup. */
if (PQsendQuery(conn, basebkp) == 0)
pg_fatal("could not send replication command \"%s\": %s",
"BASE_BACKUP", PQerrorMessage(conn));
@@ -2256,6 +2351,7 @@ main(int argc, char **argv)
{"version", no_argument, NULL, 'V'},
{"pgdata", required_argument, NULL, 'D'},
{"format", required_argument, NULL, 'F'},
+ {"incremental", required_argument, NULL, 'i'},
{"checkpoint", required_argument, NULL, 'c'},
{"create-slot", no_argument, NULL, 'C'},
{"max-rate", required_argument, NULL, 'r'},
@@ -2293,6 +2389,7 @@ main(int argc, char **argv)
int option_index;
char *compression_algorithm = "none";
char *compression_detail = NULL;
+ char *incremental_manifest = NULL;
CompressionLocation compressloc = COMPRESS_LOCATION_UNSPECIFIED;
pg_compress_specification client_compress;
@@ -2317,7 +2414,7 @@ main(int argc, char **argv)
atexit(cleanup_directories_atexit);
- while ((c = getopt_long(argc, argv, "c:Cd:D:F:h:l:nNp:Pr:Rs:S:t:T:U:vwWX:zZ:",
+ while ((c = getopt_long(argc, argv, "c:Cd:D:F:h:i:l:nNp:Pr:Rs:S:t:T:U:vwWX:zZ:",
long_options, &option_index)) != -1)
{
switch (c)
@@ -2352,6 +2449,9 @@ main(int argc, char **argv)
case 'h':
dbhost = pg_strdup(optarg);
break;
+ case 'i':
+ incremental_manifest = pg_strdup(optarg);
+ break;
case 'l':
label = pg_strdup(optarg);
break;
@@ -2765,7 +2865,7 @@ main(int argc, char **argv)
}
BaseBackup(compression_algorithm, compression_detail, compressloc,
- &client_compress);
+ &client_compress, incremental_manifest);
success = true;
return 0;
diff --git a/src/bin/pg_basebackup/t/010_pg_basebackup.pl b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
index b9f5e1266b..bf765291e7 100644
--- a/src/bin/pg_basebackup/t/010_pg_basebackup.pl
+++ b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
@@ -223,10 +223,10 @@ SKIP:
"check backup dir permissions");
}
-# Only archive_status directory should be copied in pg_wal/.
+# Only archive_status and summaries directories should be copied in pg_wal/.
is_deeply(
[ sort(slurp_dir("$tempdir/backup/pg_wal/")) ],
- [ sort qw(. .. archive_status) ],
+ [ sort qw(. .. archive_status summaries) ],
'no WAL files copied');
# Contents of these directories should not be copied.
diff --git a/src/bin/pg_combinebackup/.gitignore b/src/bin/pg_combinebackup/.gitignore
new file mode 100644
index 0000000000..d7e617438c
--- /dev/null
+++ b/src/bin/pg_combinebackup/.gitignore
@@ -0,0 +1 @@
+pg_combinebackup
diff --git a/src/bin/pg_combinebackup/Makefile b/src/bin/pg_combinebackup/Makefile
new file mode 100644
index 0000000000..78ba05e624
--- /dev/null
+++ b/src/bin/pg_combinebackup/Makefile
@@ -0,0 +1,52 @@
+#-------------------------------------------------------------------------
+#
+# Makefile for src/bin/pg_combinebackup
+#
+# Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+# Portions Copyright (c) 1994, Regents of the University of California
+#
+# src/bin/pg_combinebackup/Makefile
+#
+#-------------------------------------------------------------------------
+
+PGFILEDESC = "pg_combinebackup - combine incremental backups"
+PGAPPICON=win32
+
+subdir = src/bin/pg_combinebackup
+top_builddir = ../../..
+include $(top_builddir)/src/Makefile.global
+
+override CPPFLAGS := -I$(libpq_srcdir) $(CPPFLAGS)
+LDFLAGS_INTERNAL += -L$(top_builddir)/src/fe_utils -lpgfeutils
+
+OBJS = \
+ $(WIN32RES) \
+ pg_combinebackup.o \
+ backup_label.o \
+ copy_file.o \
+ load_manifest.o \
+ reconstruct.o \
+ write_manifest.o
+
+all: pg_combinebackup
+
+pg_combinebackup: $(OBJS) | submake-libpgport submake-libpgfeutils
+ $(CC) $(CFLAGS) $^ $(LDFLAGS) $(LDFLAGS_EX) $(LIBS) -o $@$(X)
+
+install: all installdirs
+ $(INSTALL_PROGRAM) pg_combinebackup$(X) '$(DESTDIR)$(bindir)/pg_combinebackup$(X)'
+
+installdirs:
+ $(MKDIR_P) '$(DESTDIR)$(bindir)'
+
+uninstall:
+ rm -f '$(DESTDIR)$(bindir)/pg_combinebackup$(X)'
+
+clean distclean maintainer-clean:
+ rm -f pg_combinebackup$(X) $(OBJS)
+
+check:
+ $(prove_check)
+
+installcheck:
+ $(prove_installcheck)
diff --git a/src/bin/pg_combinebackup/backup_label.c b/src/bin/pg_combinebackup/backup_label.c
new file mode 100644
index 0000000000..2a62aa6fad
--- /dev/null
+++ b/src/bin/pg_combinebackup/backup_label.c
@@ -0,0 +1,281 @@
+/*-------------------------------------------------------------------------
+ *
+ * Read and manipulate backup label files
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/bin/pg_combinebackup/backup_label.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres_fe.h"
+
+#include <unistd.h>
+
+#include "access/xlogdefs.h"
+#include "backup_label.h"
+#include "common/logging.h"
+#include "common/file_perm.h"
+#include "write_manifest.h"
+
+static int get_eol_offset(StringInfo buf);
+static bool line_starts_with(char *s, char *e, char *match, char **sout);
+static bool parse_lsn(char *s, char *e, XLogRecPtr *lsn, char **c);
+static bool parse_tli(char *s, char *e, TimeLineID *tli);
+
+/*
+ * Parse a backup label file, starting at buf->cursor.
+ *
+ * We expect to find a START WAL LOCATION line, followed by a LSN, followed
+ * by a space; the resulting LSN is stored into *start_lsn.
+ *
+ * We expect to find a START TIMELINE line, followed by a TLI, followed by
+ * a newline; the resulting TLI is stored into *start_tli.
+ *
+ * We expect to find either both INCREMENTAL FROM LSN and INCREMENTAL FROM TLI
+ * or neither. If these are found, they should be followed by an LSN or TLI
+ * respectively and then by a newline, and the values will be stored into
+ * *previous_lsn and *previous_tli, respectively.
+ *
+ * Other lines in the provided backup_label data are ignored. filename is used
+ * for error reporting; errors are fatal.
+ */
+void
+parse_backup_label(char *filename, StringInfo buf,
+ TimeLineID *start_tli, XLogRecPtr *start_lsn,
+ TimeLineID *previous_tli, XLogRecPtr *previous_lsn)
+{
+ int found = 0;
+
+ *start_tli = 0;
+ *start_lsn = InvalidXLogRecPtr;
+ *previous_tli = 0;
+ *previous_lsn = InvalidXLogRecPtr;
+
+ while (buf->cursor < buf->len)
+ {
+ char *s = &buf->data[buf->cursor];
+ int eo = get_eol_offset(buf);
+ char *e = &buf->data[eo];
+ char *c;
+
+ if (line_starts_with(s, e, "START WAL LOCATION: ", &s))
+ {
+ if (!parse_lsn(s, e, start_lsn, &c))
+ pg_fatal("%s: could not parse START WAL LOCATION",
+ filename);
+ if (c >= e || *c != ' ')
+ pg_fatal("%s: improper terminator for START WAL LOCATION",
+ filename);
+ found |= 1;
+ }
+ else if (line_starts_with(s, e, "START TIMELINE: ", &s))
+ {
+ if (!parse_tli(s, e, start_tli))
+ pg_fatal("%s: could not parse TLI for START TIMELINE",
+ filename);
+ if (*start_tli == 0)
+ pg_fatal("%s: invalid TLI", filename);
+ found |= 2;
+ }
+ else if (line_starts_with(s, e, "INCREMENTAL FROM LSN: ", &s))
+ {
+ if (!parse_lsn(s, e, previous_lsn, &c))
+ pg_fatal("%s: could not parse INCREMENTAL FROM LSN",
+ filename);
+ if (c >= e || *c != '\n')
+ pg_fatal("%s: improper terminator for INCREMENTAL FROM LSN",
+ filename);
+ found |= 4;
+ }
+ else if (line_starts_with(s, e, "INCREMENTAL FROM TLI: ", &s))
+ {
+ if (!parse_tli(s, e, previous_tli))
+ pg_fatal("%s: could not parse INCREMENTAL FROM TLI",
+ filename);
+ if (*previous_tli == 0)
+ pg_fatal("%s: invalid TLI", filename);
+ found |= 8;
+ }
+
+ buf->cursor = eo;
+ }
+
+ if ((found & 1) == 0)
+ pg_fatal("%s: could not find START WAL LOCATION", filename);
+ if ((found & 2) == 0)
+ pg_fatal("%s: could not find START TIMELINE", filename);
+ if ((found & 4) != 0 && (found & 8) == 0)
+ pg_fatal("%s: INCREMENTAL FROM LSN requires INCREMENTAL FROM TLI", filename);
+ if ((found & 8) != 0 && (found & 4) == 0)
+ pg_fatal("%s: INCREMENTAL FROM TLI requires INCREMENTAL FROM LSN", filename);
+}
+
+/*
+ * Write a backup label file to the output directory.
+ *
+ * This will be identical to the provided backup_label file, except that the
+ * INCREMENTAL FROM LSN and INCREMENTAL FROM TLI lines will be omitted.
+ *
+ * The new file will be checksummed using the specified algorithm. If
+ * mwriter != NULL, it will be added to the manifest.
+ */
+void
+write_backup_label(char *output_directory, StringInfo buf,
+ pg_checksum_type checksum_type, manifest_writer *mwriter)
+{
+ char output_filename[MAXPGPATH];
+ int output_fd;
+ pg_checksum_context checksum_ctx;
+ uint8 checksum_payload[PG_CHECKSUM_MAX_LENGTH];
+ int checksum_length;
+
+ pg_checksum_init(&checksum_ctx, checksum_type);
+
+ snprintf(output_filename, MAXPGPATH, "%s/backup_label", output_directory);
+
+ if ((output_fd = open(output_filename,
+ O_WRONLY | O_CREAT | O_EXCL | PG_BINARY,
+ pg_file_create_mode)) < 0)
+ pg_fatal("could not open file \"%s\": %m", output_filename);
+
+ while (buf->cursor < buf->len)
+ {
+ char *s = &buf->data[buf->cursor];
+ int eo = get_eol_offset(buf);
+ char *e = &buf->data[eo];
+
+ if (!line_starts_with(s, e, "INCREMENTAL FROM LSN: ", NULL) &&
+ !line_starts_with(s, e, "INCREMENTAL FROM TLI: ", NULL))
+ {
+ ssize_t wb;
+
+ wb = write(output_fd, s, e - s);
+ if (wb != e - s)
+ {
+ if (wb < 0)
+ pg_fatal("could not write file \"%s\": %m", output_filename);
+ else
+ pg_fatal("could not write file \"%s\": wrote only %d of %d bytes",
+ output_filename, (int) wb, (int) (e - s));
+ }
+ if (pg_checksum_update(&checksum_ctx, (uint8 *) s, e - s) < 0)
+ pg_fatal("could not update checksum of file \"%s\"",
+ output_filename);
+ }
+
+ buf->cursor = eo;
+ }
+
+ if (close(output_fd) != 0)
+ pg_fatal("could not close \"%s\": %m", output_filename);
+
+ checksum_length = pg_checksum_final(&checksum_ctx, checksum_payload);
+
+ if (mwriter != NULL)
+ {
+ struct stat sb;
+
+ /*
+ * We could track the length ourselves, but must stat() to get the
+ * mtime.
+ */
+ if (stat(output_filename, &sb) < 0)
+ pg_fatal("could not stat file \"%s\": %m", output_filename);
+ add_file_to_manifest(mwriter, "backup_label", sb.st_size,
+ sb.st_mtime, checksum_type,
+ checksum_length, checksum_payload);
+ }
+}
+
+/*
+ * Return the offset at which the next line in the buffer starts, or there
+ * is none, the offset at which the buffer ends.
+ *
+ * The search begins at buf->cursor.
+ */
+static int
+get_eol_offset(StringInfo buf)
+{
+ int eo = buf->cursor;
+
+ while (eo < buf->len)
+ {
+ if (buf->data[eo] == '\n')
+ return eo + 1;
+ ++eo;
+ }
+
+ return eo;
+}
+
+/*
+ * Test whether the line that runs from s to e (inclusive of *s, but not
+ * inclusive of *e) starts with the match string provided, and return true
+ * or false according to whether or not this is the case.
+ *
+ * If the function returns true and if *sout != NULL, stores a pointer to the
+ * byte following the match into *sout.
+ */
+static bool
+line_starts_with(char *s, char *e, char *match, char **sout)
+{
+ while (s < e && *match != '\0' && *s == *match)
+ ++s, ++match;
+
+ if (*match == '\0' && sout != NULL)
+ *sout = s;
+
+ return (*match == '\0');
+}
+
+/*
+ * Parse an LSN starting at s and not stopping at or before e. The return value
+ * is true on success and otherwise false. On success, stores the result into
+ * *lsn and sets *c to the first character that is not part of the LSN.
+ */
+static bool
+parse_lsn(char *s, char *e, XLogRecPtr *lsn, char **c)
+{
+ char save = *e;
+ int nchars;
+ bool success;
+ unsigned hi;
+ unsigned lo;
+
+ *e = '\0';
+ success = (sscanf(s, "%X/%X%n", &hi, &lo, &nchars) == 2);
+ *e = save;
+
+ if (success)
+ {
+ *lsn = ((XLogRecPtr) hi) << 32 | (XLogRecPtr) lo;
+ *c = s + nchars;
+ }
+
+ return success;
+}
+
+/*
+ * Parse a TLI starting at s and stopping at or before e. The return value is
+ * true on success and otherwise false. On success, stores the result into
+ * *tli. If the first character that is not part of the TLI is anything other
+ * than a newline, that is deemed a failure.
+ */
+static bool
+parse_tli(char *s, char *e, TimeLineID *tli)
+{
+ char save = *e;
+ int nchars;
+ bool success;
+
+ *e = '\0';
+ success = (sscanf(s, "%u%n", tli, &nchars) == 1);
+ *e = save;
+
+ if (success && s[nchars] != '\n')
+ success = false;
+
+ return success;
+}
diff --git a/src/bin/pg_combinebackup/backup_label.h b/src/bin/pg_combinebackup/backup_label.h
new file mode 100644
index 0000000000..08d6ed67a9
--- /dev/null
+++ b/src/bin/pg_combinebackup/backup_label.h
@@ -0,0 +1,29 @@
+/*-------------------------------------------------------------------------
+ *
+ * Read and manipulate backup label files
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/bin/pg_combinebackup/backup_label.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef BACKUP_LABEL_H
+#define BACKUP_LABEL_H
+
+#include "common/checksum_helper.h"
+#include "lib/stringinfo.h"
+
+struct manifest_writer;
+
+extern void parse_backup_label(char *filename, StringInfo buf,
+ TimeLineID *start_tli,
+ XLogRecPtr *start_lsn,
+ TimeLineID *previous_tli,
+ XLogRecPtr *previous_lsn);
+extern void write_backup_label(char *output_directory, StringInfo buf,
+ pg_checksum_type checksum_type,
+ struct manifest_writer *mwriter);
+
+#endif /* BACKUP_LABEL_H */
diff --git a/src/bin/pg_combinebackup/copy_file.c b/src/bin/pg_combinebackup/copy_file.c
new file mode 100644
index 0000000000..8ba6cc09e4
--- /dev/null
+++ b/src/bin/pg_combinebackup/copy_file.c
@@ -0,0 +1,169 @@
+/*
+ * Copy entire files.
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/bin/pg_combinebackup/copy_file.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres_fe.h"
+
+#ifdef HAVE_COPYFILE_H
+#include <copyfile.h>
+#endif
+#include <fcntl.h>
+#include <sys/stat.h>
+#include <unistd.h>
+
+#include "common/file_perm.h"
+#include "common/logging.h"
+#include "copy_file.h"
+
+static void copy_file_blocks(const char *src, const char *dst,
+ pg_checksum_context *checksum_ctx);
+
+#ifdef WIN32
+static void copy_file_copyfile(const char *src, const char *dst);
+#endif
+
+/*
+ * Copy a regular file, optionally computing a checksum, and emitting
+ * appropriate debug messages. But if we're in dry-run mode, then just emit
+ * the messages and don't copy anything.
+ */
+void
+copy_file(const char *src, const char *dst,
+ pg_checksum_context *checksum_ctx, bool dry_run)
+{
+ /*
+ * In dry-run mode, we don't actually copy anything, nor do we read any
+ * data from the source file, but we do verify that we can open it.
+ */
+ if (dry_run)
+ {
+ int fd;
+
+ if ((fd = open(src, O_RDONLY | PG_BINARY)) < 0)
+ pg_fatal("could not open \"%s\": %m", src);
+ if (close(fd) < 0)
+ pg_fatal("could not close \"%s\": %m", src);
+ }
+
+ /*
+ * If we don't need to compute a checksum, then we can use any special
+ * operating system primitives that we know about to copy the file; this
+ * may be quicker than a naive block copy.
+ */
+ if (checksum_ctx->type != CHECKSUM_TYPE_NONE)
+ {
+ char *strategy_name = NULL;
+ void (*strategy_implementation) (const char *, const char *) = NULL;
+
+#ifdef WIN32
+ strategy_name = "CopyFile";
+ strategy_implementation = copy_file_copyfile;
+#endif
+
+ if (strategy_name != NULL)
+ {
+ if (dry_run)
+ pg_log_debug("would copy \"%s\" to \"%s\" using strategy %s",
+ src, dst, strategy_name);
+ else
+ {
+ pg_log_debug("copying \"%s\" to \"%s\" using strategy %s",
+ src, dst, strategy_name);
+ (*strategy_implementation) (src, dst);
+ }
+ return;
+ }
+ }
+
+ /*
+ * Fall back to the simple approach of reading and writing all the blocks,
+ * feeding them into the checksum context as we go.
+ */
+ if (dry_run)
+ {
+ if (checksum_ctx->type == CHECKSUM_TYPE_NONE)
+ pg_log_debug("would copy \"%s\" to \"%s\"",
+ src, dst);
+ else
+ pg_log_debug("would copy \"%s\" to \"%s\" and checksum with %s",
+ src, dst, pg_checksum_type_name(checksum_ctx->type));
+ }
+ else
+ {
+ if (checksum_ctx->type == CHECKSUM_TYPE_NONE)
+ pg_log_debug("copying \"%s\" to \"%s\"",
+ src, dst);
+ else
+ pg_log_debug("copying \"%s\" to \"%s\" and checksumming with %s",
+ src, dst, pg_checksum_type_name(checksum_ctx->type));
+ copy_file_blocks(src, dst, checksum_ctx);
+ }
+}
+
+/*
+ * Copy a file block by block, and optionally compute a checksum as we go.
+ */
+static void
+copy_file_blocks(const char *src, const char *dst,
+ pg_checksum_context *checksum_ctx)
+{
+ int src_fd;
+ int dest_fd;
+ uint8 *buffer;
+ const int buffer_size = 50 * BLCKSZ;
+ ssize_t rb;
+ unsigned offset = 0;
+
+ if ((src_fd = open(src, O_RDONLY | PG_BINARY, 0)) < 0)
+ pg_fatal("could not open file \"%s\": %m", src);
+
+ if ((dest_fd = open(dst, O_WRONLY | O_CREAT | O_EXCL | PG_BINARY,
+ pg_file_create_mode)) < 0)
+ pg_fatal("could not open file \"%s\": %m", dst);
+
+ buffer = pg_malloc(buffer_size);
+
+ while ((rb = read(src_fd, buffer, buffer_size)) > 0)
+ {
+ ssize_t wb;
+
+ if ((wb = write(dest_fd, buffer, rb)) != rb)
+ {
+ if (wb < 0)
+ pg_fatal("could not write file \"%s\": %m", dst);
+ else
+ pg_fatal("could not write file \"%s\": wrote only %d of %d bytes at offset %u",
+ dst, (int) wb, (int) rb, offset);
+ }
+
+ if (pg_checksum_update(checksum_ctx, buffer, rb) < 0)
+ pg_fatal("could not update checksum of file \"%s\"", dst);
+
+ offset += rb;
+ }
+
+ if (rb < 0)
+ pg_fatal("could not read file \"%s\": %m", dst);
+
+ pg_free(buffer);
+ close(src_fd);
+ close(dest_fd);
+}
+
+#ifdef WIN32
+static void
+copy_file_copyfile(const char *src, const char *dst)
+{
+ if (CopyFile(src, dst, true) == 0)
+ {
+ _dosmaperr(GetLastError());
+ pg_fatal("could not copy \"%s\" to \"%s\": %m", src, dst);
+ }
+}
+#endif /* WIN32 */
diff --git a/src/bin/pg_combinebackup/copy_file.h b/src/bin/pg_combinebackup/copy_file.h
new file mode 100644
index 0000000000..031030bacb
--- /dev/null
+++ b/src/bin/pg_combinebackup/copy_file.h
@@ -0,0 +1,19 @@
+/*
+ * Copy entire files.
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/bin/pg_combinebackup/copy_file.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef COPY_FILE_H
+#define COPY_FILE_H
+
+#include "common/checksum_helper.h"
+
+extern void copy_file(const char *src, const char *dst,
+ pg_checksum_context *checksum_ctx, bool dry_run);
+
+#endif /* COPY_FILE_H */
diff --git a/src/bin/pg_combinebackup/load_manifest.c b/src/bin/pg_combinebackup/load_manifest.c
new file mode 100644
index 0000000000..d0b8de7912
--- /dev/null
+++ b/src/bin/pg_combinebackup/load_manifest.c
@@ -0,0 +1,245 @@
+/*-------------------------------------------------------------------------
+ *
+ * Load data from a backup manifest into memory.
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/bin/pg_combinebackup/load_manifest.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres_fe.h"
+
+#include <sys/stat.h>
+#include <unistd.h>
+
+#include "common/hashfn.h"
+#include "common/logging.h"
+#include "common/parse_manifest.h"
+#include "load_manifest.h"
+
+/*
+ * For efficiency, we'd like our hash table containing information about the
+ * manifest to start out with approximately the correct number of entries.
+ * There's no way to know the exact number of entries without reading the whole
+ * file, but we can get an estimate by dividing the file size by the estimated
+ * number of bytes per line.
+ *
+ * This could be off by about a factor of two in either direction, because the
+ * checksum algorithm has a big impact on the line lengths; e.g. a SHA512
+ * checksum is 128 hex bytes, whereas a CRC-32C value is only 8, and there
+ * might be no checksum at all.
+ */
+#define ESTIMATED_BYTES_PER_MANIFEST_LINE 100
+
+/*
+ * Define a hash table which we can use to store information about the files
+ * mentioned in the backup manifest.
+ */
+static uint32 hash_string_pointer(char *s);
+#define SH_PREFIX manifest_files
+#define SH_ELEMENT_TYPE manifest_file
+#define SH_KEY_TYPE char *
+#define SH_KEY pathname
+#define SH_HASH_KEY(tb, key) hash_string_pointer(key)
+#define SH_EQUAL(tb, a, b) (strcmp(a, b) == 0)
+#define SH_SCOPE extern
+#define SH_RAW_ALLOCATOR pg_malloc0
+#define SH_DEFINE
+#include "lib/simplehash.h"
+
+static void record_manifest_details_for_file(JsonManifestParseContext *context,
+ char *pathname, size_t size,
+ pg_checksum_type checksum_type,
+ int checksum_length,
+ uint8 *checksum_payload);
+static void record_manifest_details_for_wal_range(JsonManifestParseContext *context,
+ TimeLineID tli,
+ XLogRecPtr start_lsn,
+ XLogRecPtr end_lsn);
+static void report_manifest_error(JsonManifestParseContext *context,
+ const char *fmt,...)
+ pg_attribute_printf(2, 3) pg_attribute_noreturn();
+
+/*
+ * Load backup_manifest files from an array of backups and produces an array
+ * of manifest_data objects.
+ *
+ * NB: Since load_backup_manifest() can return NULL, the resulting array could
+ * contain NULL entries.
+ */
+manifest_data **
+load_backup_manifests(int n_backups, char **backup_directories)
+{
+ manifest_data **result;
+ int i;
+
+ result = pg_malloc(sizeof(manifest_data *) * n_backups);
+ for (i = 0; i < n_backups; ++i)
+ result[i] = load_backup_manifest(backup_directories[i]);
+
+ return result;
+}
+
+/*
+ * Parse the backup_manifest file in the named backup directory. Construct a
+ * hash table with information about all the files it mentions, and a linked
+ * list of all the WAL ranges it mentions.
+ *
+ * If the backup_manifest file simply doesn't exist, logs a warning and returns
+ * NULL. Any other error, or any error parsing the contents of the file, is
+ * fatal.
+ */
+manifest_data *
+load_backup_manifest(char *backup_directory)
+{
+ char pathname[MAXPGPATH];
+ int fd;
+ struct stat statbuf;
+ off_t estimate;
+ uint32 initial_size;
+ manifest_files_hash *ht;
+ char *buffer;
+ int rc;
+ JsonManifestParseContext context;
+ manifest_data *result;
+
+ /* Open the manifest file. */
+ snprintf(pathname, MAXPGPATH, "%s/backup_manifest", backup_directory);
+ if ((fd = open(pathname, O_RDONLY | PG_BINARY, 0)) < 0)
+ {
+ if (errno == EEXIST)
+ {
+ pg_log_warning("\"%s\" does not exist", pathname);
+ return NULL;
+ }
+ pg_fatal("could not open file \"%s\": %m", pathname);
+ }
+
+ /* Figure out how big the manifest is. */
+ if (fstat(fd, &statbuf) != 0)
+ pg_fatal("could not stat file \"%s\": %m", pathname);
+
+ /* Guess how large to make the hash table based on the manifest size. */
+ estimate = statbuf.st_size / ESTIMATED_BYTES_PER_MANIFEST_LINE;
+ initial_size = Min(PG_UINT32_MAX, Max(estimate, 256));
+
+ /* Create the hash table. */
+ ht = manifest_files_create(initial_size, NULL);
+
+ /*
+ * Slurp in the whole file.
+ *
+ * This is not ideal, but there's currently no way to get pg_parse_json()
+ * to perform incremental parsing.
+ */
+ buffer = pg_malloc(statbuf.st_size);
+ rc = read(fd, buffer, statbuf.st_size);
+ if (rc != statbuf.st_size)
+ {
+ if (rc < 0)
+ pg_fatal("could not read file \"%s\": %m", pathname);
+ else
+ pg_fatal("could not read file \"%s\": read %d of %lld",
+ pathname, rc, (long long int) statbuf.st_size);
+ }
+
+ /* Close the manifest file. */
+ close(fd);
+
+ /* Parse the manifest. */
+ result = pg_malloc0(sizeof(manifest_data));
+ result->files = ht;
+ context.private_data = result;
+ context.perfile_cb = record_manifest_details_for_file;
+ context.perwalrange_cb = record_manifest_details_for_wal_range;
+ context.error_cb = report_manifest_error;
+ json_parse_manifest(&context, buffer, statbuf.st_size);
+
+ /* All done. */
+ pfree(buffer);
+ return result;
+}
+
+/*
+ * Report an error while parsing the manifest.
+ *
+ * We consider all such errors to be fatal errors. The manifest parser
+ * expects this function not to return.
+ */
+static void
+report_manifest_error(JsonManifestParseContext *context, const char *fmt,...)
+{
+ va_list ap;
+
+ va_start(ap, fmt);
+ pg_log_generic_v(PG_LOG_ERROR, PG_LOG_PRIMARY, gettext(fmt), ap);
+ va_end(ap);
+
+ exit(1);
+}
+
+/*
+ * Record details extracted from the backup manifest for one file.
+ */
+static void
+record_manifest_details_for_file(JsonManifestParseContext *context,
+ char *pathname, size_t size,
+ pg_checksum_type checksum_type,
+ int checksum_length, uint8 *checksum_payload)
+{
+ manifest_data *manifest = context->private_data;
+ manifest_file *m;
+ bool found;
+
+ /* Make a new entry in the hash table for this file. */
+ m = manifest_files_insert(manifest->files, pathname, &found);
+ if (found)
+ pg_fatal("duplicate path name in backup manifest: \"%s\"", pathname);
+
+ /* Initialize the entry. */
+ m->size = size;
+ m->checksum_type = checksum_type;
+ m->checksum_length = checksum_length;
+ m->checksum_payload = checksum_payload;
+}
+
+/*
+ * Record details extracted from the backup manifest for one WAL range.
+ */
+static void
+record_manifest_details_for_wal_range(JsonManifestParseContext *context,
+ TimeLineID tli,
+ XLogRecPtr start_lsn, XLogRecPtr end_lsn)
+{
+ manifest_data *manifest = context->private_data;
+ manifest_wal_range *range;
+
+ /* Allocate and initialize a struct describing this WAL range. */
+ range = palloc(sizeof(manifest_wal_range));
+ range->tli = tli;
+ range->start_lsn = start_lsn;
+ range->end_lsn = end_lsn;
+ range->prev = manifest->last_wal_range;
+ range->next = NULL;
+
+ /* Add it to the end of the list. */
+ if (manifest->first_wal_range == NULL)
+ manifest->first_wal_range = range;
+ else
+ manifest->last_wal_range->next = range;
+ manifest->last_wal_range = range;
+}
+
+/*
+ * Helper function for manifest_files hash table.
+ */
+static uint32
+hash_string_pointer(char *s)
+{
+ unsigned char *ss = (unsigned char *) s;
+
+ return hash_bytes(ss, strlen(s));
+}
diff --git a/src/bin/pg_combinebackup/load_manifest.h b/src/bin/pg_combinebackup/load_manifest.h
new file mode 100644
index 0000000000..2bfeeff156
--- /dev/null
+++ b/src/bin/pg_combinebackup/load_manifest.h
@@ -0,0 +1,67 @@
+/*-------------------------------------------------------------------------
+ *
+ * Load data from a backup manifest into memory.
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/bin/pg_combinebackup/load_manifest.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef LOAD_MANIFEST_H
+#define LOAD_MANIFEST_H
+
+#include "access/xlogdefs.h"
+#include "common/checksum_helper.h"
+
+/*
+ * Each file described by the manifest file is parsed to produce an object
+ * like this.
+ */
+typedef struct manifest_file
+{
+ uint32 status; /* hash status */
+ char *pathname;
+ size_t size;
+ pg_checksum_type checksum_type;
+ int checksum_length;
+ uint8 *checksum_payload;
+} manifest_file;
+
+#define SH_PREFIX manifest_files
+#define SH_ELEMENT_TYPE manifest_file
+#define SH_KEY_TYPE char *
+#define SH_SCOPE extern
+#define SH_RAW_ALLOCATOR pg_malloc0
+#define SH_DECLARE
+#include "lib/simplehash.h"
+
+/*
+ * Each WAL range described by the manifest file is parsed to produce an
+ * object like this.
+ */
+typedef struct manifest_wal_range
+{
+ TimeLineID tli;
+ XLogRecPtr start_lsn;
+ XLogRecPtr end_lsn;
+ struct manifest_wal_range *next;
+ struct manifest_wal_range *prev;
+} manifest_wal_range;
+
+/*
+ * All the data parsed from a backup_manifest file.
+ */
+typedef struct manifest_data
+{
+ manifest_files_hash *files;
+ manifest_wal_range *first_wal_range;
+ manifest_wal_range *last_wal_range;
+} manifest_data;
+
+extern manifest_data *load_backup_manifest(char *backup_directory);
+extern manifest_data **load_backup_manifests(int n_backups,
+ char **backup_directories);
+
+#endif /* LOAD_MANIFEST_H */
diff --git a/src/bin/pg_combinebackup/meson.build b/src/bin/pg_combinebackup/meson.build
new file mode 100644
index 0000000000..a6036dea74
--- /dev/null
+++ b/src/bin/pg_combinebackup/meson.build
@@ -0,0 +1,35 @@
+# Copyright (c) 2022-2023, PostgreSQL Global Development Group
+
+pg_combinebackup_sources = files(
+ 'pg_combinebackup.c',
+ 'backup_label.c',
+ 'copy_file.c',
+ 'load_manifest.c',
+ 'reconstruct.c',
+ 'write_manifest.c',
+)
+
+if host_system == 'windows'
+ pg_combinebackup_sources += rc_bin_gen.process(win32ver_rc, extra_args: [
+ '--NAME', 'pg_combinebackup',
+ '--FILEDESC', 'pg_combinebackup - combine incremental backups',])
+endif
+
+pg_combinebackup = executable('pg_combinebackup',
+ pg_combinebackup_sources,
+ dependencies: [frontend_code],
+ kwargs: default_bin_args,
+)
+bin_targets += pg_combinebackup
+
+tests += {
+ 'name': 'pg_combinebackup',
+ 'sd': meson.current_source_dir(),
+ 'bd': meson.current_build_dir(),
+ 'tap': {
+ 'tests': [
+ 't/001_basic.pl',
+ 't/002_compare_backups.pl',
+ ],
+ }
+}
diff --git a/src/bin/pg_combinebackup/pg_combinebackup.c b/src/bin/pg_combinebackup/pg_combinebackup.c
new file mode 100644
index 0000000000..32d2846433
--- /dev/null
+++ b/src/bin/pg_combinebackup/pg_combinebackup.c
@@ -0,0 +1,1270 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_combinebackup.c
+ * Combine incremental backups with prior backups.
+ *
+ * Copyright (c) 2017-2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/bin/pg_combinebackup/pg_combinebackup.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres_fe.h"
+
+#include <dirent.h>
+#include <fcntl.h>
+#include <limits.h>
+
+#include "access/xlogdefs.h"
+#include "backup_label.h"
+#include "common/blkreftable.h"
+#include "common/checksum_helper.h"
+#include "common/controldata_utils.h"
+#include "common/file_perm.h"
+#include "common/file_utils.h"
+#include "common/logging.h"
+#include "copy_file.h"
+#include "fe_utils/option_utils.h"
+#include "lib/stringinfo.h"
+#include "load_manifest.h"
+#include "getopt_long.h"
+#include "reconstruct.h"
+#include "write_manifest.h"
+
+/* Incremental file naming convention. */
+#define INCREMENTAL_PREFIX "INCREMENTAL."
+#define INCREMENTAL_PREFIX_LENGTH 12
+
+/*
+ * Tracking for directories that need to be removed, or have their contents
+ * removed, if the operation fails.
+ */
+typedef struct cb_cleanup_dir
+{
+ char *target_path;
+ bool rmtopdir;
+ struct cb_cleanup_dir *next;
+} cb_cleanup_dir;
+
+/*
+ * Stores a tablespace mapping provided using -T, --tablespace-mapping.
+ */
+typedef struct cb_tablespace_mapping
+{
+ char old_dir[MAXPGPATH];
+ char new_dir[MAXPGPATH];
+ struct cb_tablespace_mapping *next;
+} cb_tablespace_mapping;
+
+/*
+ * Stores data parsed from all command-line options.
+ */
+typedef struct cb_options
+{
+ bool debug;
+ char *output;
+ bool dry_run;
+ bool no_sync;
+ cb_tablespace_mapping *tsmappings;
+ pg_checksum_type manifest_checksums;
+ bool no_manifest;
+ DataDirSyncMethod sync_method;
+} cb_options;
+
+/*
+ * Data about a tablespace.
+ *
+ * Every normal tablespace needs a tablespace mapping, but in-place tablespaces
+ * don't, so the list of tablespaces can contain more entries than the list of
+ * tablespace mappings.
+ */
+typedef struct cb_tablespace
+{
+ Oid oid;
+ bool in_place;
+ char old_dir[MAXPGPATH];
+ char new_dir[MAXPGPATH];
+ struct cb_tablespace *next;
+} cb_tablespace;
+
+/* Directories to be removed if we exit uncleanly. */
+cb_cleanup_dir *cleanup_dir_list = NULL;
+
+static void add_tablespace_mapping(cb_options *opt, char *arg);
+static StringInfo check_backup_label_files(int n_backups, char **backup_dirs);
+static void check_control_files(int n_backups, char **backup_dirs);
+static void check_input_dir_permissions(char *dir);
+static void cleanup_directories_atexit(void);
+static void create_output_directory(char *dirname, cb_options *opt);
+static void help(const char *progname);
+static bool parse_oid(char *s, Oid *result);
+static void process_directory_recursively(Oid tsoid,
+ char *input_directory,
+ char *output_directory,
+ char *relative_path,
+ int n_prior_backups,
+ char **prior_backup_dirs,
+ manifest_data **manifests,
+ manifest_writer *mwriter,
+ cb_options *opt);
+static int read_pg_version_file(char *directory);
+static void remember_to_cleanup_directory(char *target_path, bool rmtopdir);
+static void reset_directory_cleanup_list(void);
+static cb_tablespace *scan_for_existing_tablespaces(char *pathname,
+ cb_options *opt);
+static void slurp_file(int fd, char *filename, StringInfo buf, int maxlen);
+
+/*
+ * Main program.
+ */
+int
+main(int argc, char *argv[])
+{
+ static struct option long_options[] = {
+ {"debug", no_argument, NULL, 'd'},
+ {"dry-run", no_argument, NULL, 'n'},
+ {"no-sync", no_argument, NULL, 'N'},
+ {"output", required_argument, NULL, 'o'},
+ {"tablespace-mapping", no_argument, NULL, 'T'},
+ {"manifest-checksums", required_argument, NULL, 1},
+ {"no-manifest", no_argument, NULL, 2},
+ {"sync-method", required_argument, NULL, 3},
+ {NULL, 0, NULL, 0}
+ };
+
+ const char *progname;
+ char *last_input_dir;
+ int optindex;
+ int c;
+ int n_backups;
+ int n_prior_backups;
+ int version;
+ char **prior_backup_dirs;
+ cb_options opt;
+ cb_tablespace *tablespaces;
+ cb_tablespace *ts;
+ StringInfo last_backup_label;
+ manifest_data **manifests;
+ manifest_writer *mwriter;
+
+ pg_logging_init(argv[0]);
+ progname = get_progname(argv[0]);
+ handle_help_version_opts(argc, argv, progname, help);
+
+ memset(&opt, 0, sizeof(opt));
+ opt.manifest_checksums = CHECKSUM_TYPE_CRC32C;
+ opt.sync_method = DATA_DIR_SYNC_METHOD_FSYNC;
+
+ /* process command-line options */
+ while ((c = getopt_long(argc, argv, "do:nNPT:",
+ long_options, &optindex)) != -1)
+ {
+ switch (c)
+ {
+ case 'd':
+ opt.debug = true;
+ pg_logging_increase_verbosity();
+ break;
+ case 'o':
+ opt.output = optarg;
+ break;
+ case 'n':
+ opt.dry_run = true;
+ break;
+ case 'N':
+ opt.no_sync = true;
+ break;
+ case 'T':
+ add_tablespace_mapping(&opt, optarg);
+ break;
+ case 1:
+ if (!pg_checksum_parse_type(optarg,
+ &opt.manifest_checksums))
+ pg_fatal("unrecognized checksum algorithm: \"%s\"",
+ optarg);
+ break;
+ case 2:
+ opt.no_manifest = true;
+ break;
+ case 3:
+ if (!parse_sync_method(optarg, &opt.sync_method))
+ exit(1);
+ break;
+ default:
+ /* getopt_long already emitted a complaint */
+ pg_log_error_hint("Try \"%s --help\" for more information.", progname);
+ exit(1);
+ }
+ }
+
+ if (optind >= argc)
+ {
+ pg_log_error("%s: no input directories specified", progname);
+ pg_log_error_hint("Try \"%s --help\" for more information.", progname);
+ exit(1);
+ }
+
+ if (opt.output == NULL)
+ pg_fatal("no output directory specified");
+
+ /* If no manifest is needed, no checksums are needed, either. */
+ if (opt.no_manifest)
+ opt.manifest_checksums = CHECKSUM_TYPE_NONE;
+
+ /* Read the server version from the final backup. */
+ version = read_pg_version_file(argv[argc - 1]);
+
+ /* Sanity-check control files. */
+ n_backups = argc - optind;
+ check_control_files(n_backups, argv + optind);
+
+ /* Sanity-check backup_label files, and get the contents of the last one. */
+ last_backup_label = check_backup_label_files(n_backups, argv + optind);
+
+ /* Load backup manifests. */
+ manifests = load_backup_manifests(n_backups, argv + optind);
+
+ /* Figure out which tablespaces are going to be included in the output. */
+ last_input_dir = argv[argc - 1];
+ check_input_dir_permissions(last_input_dir);
+ tablespaces = scan_for_existing_tablespaces(last_input_dir, &opt);
+
+ /*
+ * Create output directories.
+ *
+ * We create one output directory for the main data directory plus one for
+ * each non-in-place tablespace. create_output_directory() will arrange
+ * for those directories to be cleaned up on failure. In-place tablespaces
+ * aren't handled at this stage because they're located beneath the main
+ * output directory, and thus the cleanup of that directory will get rid
+ * of them. Plus, the pg_tblspc directory that needs to contain them
+ * doesn't exist yet.
+ */
+ atexit(cleanup_directories_atexit);
+ create_output_directory(opt.output, &opt);
+ for (ts = tablespaces; ts != NULL; ts = ts->next)
+ if (!ts->in_place)
+ create_output_directory(ts->new_dir, &opt);
+
+ /* If we need to write a backup_manifest, prepare to do so. */
+ if (!opt.dry_run && !opt.no_manifest)
+ mwriter = create_manifest_writer(opt.output);
+ else
+ mwriter = NULL;
+
+ /* Write backup label into output directory. */
+ if (opt.dry_run)
+ pg_log_debug("would generate \"%s/backup_label\"", opt.output);
+ else
+ {
+ pg_log_debug("generating \"%s/backup_label\"", opt.output);
+ last_backup_label->cursor = 0;
+ write_backup_label(opt.output, last_backup_label,
+ opt.manifest_checksums, mwriter);
+ }
+
+ /*
+ * We'll need the pathnames to the prior backups. By "prior" we mean all
+ * but the last one listed on the command line.
+ */
+ n_prior_backups = argc - optind - 1;
+ prior_backup_dirs = argv + optind;
+
+ /* Process everything that's not part of a user-defined tablespace. */
+ pg_log_debug("processing backup directory \"%s\"", last_input_dir);
+ process_directory_recursively(InvalidOid, last_input_dir, opt.output,
+ NULL, n_prior_backups, prior_backup_dirs,
+ manifests, mwriter, &opt);
+
+ /* Process user-defined tablespaces. */
+ for (ts = tablespaces; ts != NULL; ts = ts->next)
+ {
+ pg_log_debug("processing tablespace directory \"%s\"", ts->old_dir);
+
+ /*
+ * If it's a normal tablespace, we need to set up a symbolic link from
+ * pg_tblspc/${OID} to the target directory; if it's an in-place
+ * tablespace, we need to create a directory at pg_tblspc/${OID}.
+ */
+ if (!ts->in_place)
+ {
+ char linkpath[MAXPGPATH];
+
+ snprintf(linkpath, MAXPGPATH, "%s/pg_tblspc/%u", opt.output,
+ ts->oid);
+
+ if (opt.dry_run)
+ pg_log_debug("would create symbolic link from \"%s\" to \"%s\"",
+ linkpath, ts->new_dir);
+ else
+ {
+ pg_log_debug("creating symbolic link from \"%s\" to \"%s\"",
+ linkpath, ts->new_dir);
+ if (symlink(ts->new_dir, linkpath) != 0)
+ pg_fatal("could not create symbolic link from \"%s\" to \"%s\": %m",
+ linkpath, ts->new_dir);
+ }
+ }
+ else
+ {
+ if (opt.dry_run)
+ pg_log_debug("would create directory \"%s\"", ts->new_dir);
+ else
+ {
+ pg_log_debug("creating directory \"%s\"", ts->new_dir);
+ if (pg_mkdir_p(ts->new_dir, pg_dir_create_mode) == -1)
+ pg_fatal("could not create directory \"%s\": %m",
+ ts->new_dir);
+ }
+ }
+
+ /* OK, now handle the directory contents. */
+ process_directory_recursively(ts->oid, ts->old_dir, ts->new_dir,
+ NULL, n_prior_backups, prior_backup_dirs,
+ manifests, mwriter, &opt);
+ }
+
+ /* Finalize the backup_manifest, if we're generating one. */
+ if (mwriter != NULL)
+ finalize_manifest(mwriter,
+ manifests[n_prior_backups]->first_wal_range);
+
+ /* fsync that output directory unless we've been told not to do so */
+ if (!opt.no_sync)
+ {
+ if (opt.dry_run)
+ pg_log_debug("would recursively fsync \"%s\"", opt.output);
+ else
+ {
+ pg_log_debug("recursively fsyncing \"%s\"", opt.output);
+ sync_pgdata(opt.output, version * 10000, opt.sync_method);
+ }
+ }
+
+ /* It's a success, so don't remove the output directories. */
+ reset_directory_cleanup_list();
+ exit(0);
+}
+
+/*
+ * Process the option argument for the -T, --tablespace-mapping switch.
+ */
+static void
+add_tablespace_mapping(cb_options *opt, char *arg)
+{
+ cb_tablespace_mapping *tsmap = pg_malloc0(sizeof(cb_tablespace_mapping));
+ char *dst;
+ char *dst_ptr;
+ char *arg_ptr;
+
+ /*
+ * Basically, we just want to copy everything before the equals sign to
+ * tsmap->old_dir and everything afterwards to tsmap->new_dir, but if
+ * there's more or less than one equals sign, that's an error, and if
+ * there's an equals sign preceded by a backslash, don't treat it as a
+ * field separator but instead copy a literal equals sign.
+ */
+ dst_ptr = dst = tsmap->old_dir;
+ for (arg_ptr = arg; *arg_ptr != '\0'; arg_ptr++)
+ {
+ if (dst_ptr - dst >= MAXPGPATH)
+ pg_fatal("directory name too long");
+
+ if (*arg_ptr == '\\' && *(arg_ptr + 1) == '=')
+ ; /* skip backslash escaping = */
+ else if (*arg_ptr == '=' && (arg_ptr == arg || *(arg_ptr - 1) != '\\'))
+ {
+ if (tsmap->new_dir[0] != '\0')
+ pg_fatal("multiple \"=\" signs in tablespace mapping");
+ else
+ dst = dst_ptr = tsmap->new_dir;
+ }
+ else
+ *dst_ptr++ = *arg_ptr;
+ }
+ if (!tsmap->old_dir[0] || !tsmap->new_dir[0])
+ pg_fatal("invalid tablespace mapping format \"%s\", must be \"OLDDIR=NEWDIR\"", arg);
+
+ /*
+ * All tablespaces are created with absolute directories, so specifying a
+ * non-absolute path here would never match, possibly confusing users.
+ *
+ * In contrast to pg_basebackup, both the old and new directories are on
+ * the local machine, so the local machine's definition of an absolute
+ * path is the only relevant one.
+ */
+ if (!is_absolute_path(tsmap->old_dir))
+ pg_fatal("old directory is not an absolute path in tablespace mapping: %s",
+ tsmap->old_dir);
+
+ if (!is_absolute_path(tsmap->new_dir))
+ pg_fatal("old directory is not an absolute path in tablespace mapping: %s",
+ tsmap->new_dir);
+
+ /* Canonicalize paths to avoid spurious failures when comparing. */
+ canonicalize_path(tsmap->old_dir);
+ canonicalize_path(tsmap->new_dir);
+
+ /* Add it to the list. */
+ tsmap->next = opt->tsmappings;
+ opt->tsmappings = tsmap;
+}
+
+/*
+ * Check that the backup_label files form a coherent backup chain, and return
+ * the contents of the backup_label file from the latest backup.
+ */
+static StringInfo
+check_backup_label_files(int n_backups, char **backup_dirs)
+{
+ StringInfo buf = makeStringInfo();
+ StringInfo lastbuf = buf;
+ int i;
+ TimeLineID check_tli = 0;
+ XLogRecPtr check_lsn = InvalidXLogRecPtr;
+
+ /* Try to read each backup_label file in turn, last to first. */
+ for (i = n_backups - 1; i >= 0; --i)
+ {
+ char pathbuf[MAXPGPATH];
+ int fd;
+ TimeLineID start_tli;
+ TimeLineID previous_tli;
+ XLogRecPtr start_lsn;
+ XLogRecPtr previous_lsn;
+
+ /* Open the backup_label file. */
+ snprintf(pathbuf, MAXPGPATH, "%s/backup_label", backup_dirs[i]);
+ pg_log_debug("reading \"%s\"", pathbuf);
+ if ((fd = open(pathbuf, O_RDONLY, 0)) < 0)
+ pg_fatal("could not open file \"%s\": %m", pathbuf);
+
+ /*
+ * Slurp the whole file into memory.
+ *
+ * The exact size limit that we impose here doesn't really matter --
+ * most of what's supposed to be in the file is fixed size and quite
+ * short. However, the length of the backup_label is limited (at least
+ * by some parts of the code) to MAXGPATH, so include that value in
+ * the maximum length that we tolerate.
+ */
+ slurp_file(fd, pathbuf, buf, 10000 + MAXPGPATH);
+
+ /* Close the file. */
+ if (close(fd) != 0)
+ pg_fatal("could not close \"%s\": %m", pathbuf);
+
+ /* Parse the file contents. */
+ parse_backup_label(pathbuf, buf, &start_tli, &start_lsn,
+ &previous_tli, &previous_lsn);
+
+ /*
+ * Sanity checks.
+ *
+ * XXX. It's actually not required that start_lsn == check_lsn. It
+ * would be OK if start_lsn > check_lsn provided that start_lsn is
+ * less than or equal to the relevant switchpoint. But at the moment
+ * we don't have that information.
+ */
+ if (i > 0 && previous_tli == 0)
+ pg_fatal("backup at \"%s\" is a full backup, but only the first backup should be a full backup",
+ backup_dirs[i]);
+ if (i == 0 && previous_tli != 0)
+ pg_fatal("backup at \"%s\" is an incremental backup, but the first backup should be a full backup",
+ backup_dirs[i]);
+ if (i < n_backups - 1 && start_tli != check_tli)
+ pg_fatal("backup at \"%s\" starts on timeline %u, but expected %u",
+ backup_dirs[i], start_tli, check_tli);
+ if (i < n_backups - 1 && start_lsn != check_lsn)
+ pg_fatal("backup at \"%s\" starts at LSN %X/%X, but expected %X/%X",
+ backup_dirs[i],
+ LSN_FORMAT_ARGS(start_lsn),
+ LSN_FORMAT_ARGS(check_lsn));
+ check_tli = previous_tli;
+ check_lsn = previous_lsn;
+
+ /*
+ * The last backup label in the chain needs to be saved for later use,
+ * while the others are only needed within this loop.
+ */
+ if (lastbuf == buf)
+ buf = makeStringInfo();
+ else
+ resetStringInfo(buf);
+ }
+
+ /* Free memory that we don't need any more. */
+ if (lastbuf != buf)
+ {
+ pfree(buf->data);
+ pfree(buf);
+ }
+
+ /*
+ * Return the data from the first backup_info that we read (which is the
+ * backup_label from the last directory specified on the command line).
+ */
+ return lastbuf;
+}
+
+/*
+ * Sanity check control files.
+ */
+static void
+check_control_files(int n_backups, char **backup_dirs)
+{
+ int i;
+ uint64 system_identifier;
+
+ /* Try to read each control file in turn, last to first. */
+ for (i = n_backups - 1; i >= 0; --i)
+ {
+ ControlFileData *control_file;
+ bool crc_ok;
+
+ pg_log_debug("reading \"%s/global/pg_control\"", backup_dirs[i]);
+ control_file = get_controlfile(backup_dirs[i], &crc_ok);
+
+ /* Control file contents not meaningful if CRC is bad. */
+ if (!crc_ok)
+ pg_fatal("%s/global/pg_control: crc is incorrect", backup_dirs[i]);
+
+ /* Can't interpret control file if not current version. */
+ if (control_file->pg_control_version != PG_CONTROL_VERSION)
+ pg_fatal("%s/global/pg_control: unexpected control file version",
+ backup_dirs[i]);
+
+ /* System identifiers should all match. */
+ if (i == n_backups - 1)
+ system_identifier = control_file->system_identifier;
+ else if (system_identifier != control_file->system_identifier)
+ pg_fatal("%s/global/pg_control: expected system identifier %llu, but found %llu",
+ backup_dirs[i], (unsigned long long) system_identifier,
+ (unsigned long long) control_file->system_identifier);
+
+ /* Release memory. */
+ pfree(control_file);
+ }
+
+ /*
+ * If debug output is enabled, make a note of the system identifier that
+ * we found in all of the relevant control files.
+ */
+ pg_log_debug("system identifier is %llu",
+ (unsigned long long) system_identifier);
+}
+
+/*
+ * Set default permissions for new files and directories based on the
+ * permissions of the given directory. The intent here is that the output
+ * directory should use the same permissions scheme as the final input
+ * directory.
+ */
+static void
+check_input_dir_permissions(char *dir)
+{
+ struct stat st;
+
+ if (stat(dir, &st) != 0)
+ pg_fatal("could not stat \"%s\": %m", dir);
+
+ SetDataDirectoryCreatePerm(st.st_mode);
+}
+
+/*
+ * Clean up output directories before exiting.
+ */
+static void
+cleanup_directories_atexit(void)
+{
+ while (cleanup_dir_list != NULL)
+ {
+ cb_cleanup_dir *dir = cleanup_dir_list;
+
+ if (dir->rmtopdir)
+ {
+ pg_log_info("removing output directory \"%s\"", dir->target_path);
+ if (!rmtree(dir->target_path, dir->rmtopdir))
+ pg_log_error("failed to remove output directory");
+ }
+ else
+ {
+ pg_log_info("removing contents of output directory \"%s\"",
+ dir->target_path);
+ if (!rmtree(dir->target_path, dir->rmtopdir))
+ pg_log_error("failed to remove contents of output directory");
+ }
+
+ cleanup_dir_list = cleanup_dir_list->next;
+ pfree(dir);
+ }
+}
+
+/*
+ * Create the named output directory, unless it already exists or we're in
+ * dry-run mode. If it already exists but is not empty, that's a fatal error.
+ *
+ * Adds the created directory to the list of directories to be cleaned up
+ * at process exit.
+ */
+static void
+create_output_directory(char *dirname, cb_options *opt)
+{
+ switch (pg_check_dir(dirname))
+ {
+ case 0:
+ if (opt->dry_run)
+ {
+ pg_log_debug("would create directory \"%s\"", dirname);
+ return;
+ }
+ pg_log_debug("creating directory \"%s\"", dirname);
+ if (pg_mkdir_p(dirname, pg_dir_create_mode) == -1)
+ pg_fatal("could not create directory \"%s\": %m", dirname);
+ remember_to_cleanup_directory(dirname, true);
+ break;
+
+ case 1:
+ pg_log_debug("using existing directory \"%s\"", dirname);
+ remember_to_cleanup_directory(dirname, false);
+ break;
+
+ case 2:
+ case 3:
+ case 4:
+ pg_fatal("directory \"%s\" exists but is not empty", dirname);
+
+ case -1:
+ pg_fatal("could not access directory \"%s\": %m", dirname);
+ }
+}
+
+/*
+ * help
+ *
+ * Prints help page for the program
+ *
+ * progname: the name of the executed program, such as "pg_combinebackup"
+ */
+static void
+help(const char *progname)
+{
+ printf(_("%s combines incremental backups.\n\n"), progname);
+ printf(_("Usage:\n"));
+ printf(_(" %s [OPTION]... DIRECTORY...\n"), progname);
+ printf(_("\nOptions:\n"));
+ printf(_(" -d, --debug generate lots of debugging output\n"));
+ printf(_(" -n, --dry-run don't actually do anything\n"));
+ printf(_(" -N, --no-sync do not wait for changes to be written safely to disk\n"));
+ printf(_(" -o, --output output directory\n"));
+ printf(_(" -T, --tablespace-mapping=OLDDIR=NEWDIR\n"));
+ printf(_(" relocate tablespace in OLDDIR to NEWDIR\n"));
+ printf(_(" --manifest-checksums=SHA{224,256,384,512}|CRC32C|NONE\n"
+ " use algorithm for manifest checksums\n"));
+ printf(_(" --no-manifest suppress generation of backup manifest\n"));
+ printf(_(" --sync-method=METHOD set method for syncing files to disk\n"));
+ printf(_(" -?, --help show this help, then exit\n"));
+
+ printf(_("\nReport bugs to <%s>.\n"), PACKAGE_BUGREPORT);
+ printf(_("%s home page: <%s>\n"), PACKAGE_NAME, PACKAGE_URL);
+}
+
+/*
+ * Try to parse a string as a non-zero OID without leading zeroes.
+ *
+ * If it works, return true and set *result to the answer, else return false.
+ */
+static bool
+parse_oid(char *s, Oid *result)
+{
+ Oid oid;
+ char *ep;
+
+ errno = 0;
+ oid = strtoul(s, &ep, 10);
+ if (errno != 0 || *ep != '\0' || oid < 1 || oid > PG_UINT32_MAX)
+ return false;
+
+ *result = oid;
+ return true;
+}
+
+/*
+ * Copy files from the input directory to the output directory, reconstructing
+ * full files from incremental files as required.
+ *
+ * If processing is a user-defined tablespace, the tsoid should be the OID
+ * of that tablespace and input_directory and output_directory should be the
+ * toplevel input and output directories for that tablespace. Otherwise,
+ * tsoid should be InvalidOid and input_directory and output_directory should
+ * be the main input and output directories.
+ *
+ * relative_path is the path beneath the given input and output directories
+ * that we are currently processing. If NULL, it indicates that we're
+ * processing the input and output directories themselves.
+ *
+ * n_prior_backups is the number of prior backups that we have available.
+ * This doesn't count the very last backup, which is referenced by
+ * output_directory, just the older ones. prior_backup_dirs is an array of
+ * the locations of those previous backups.
+ */
+static void
+process_directory_recursively(Oid tsoid,
+ char *input_directory,
+ char *output_directory,
+ char *relative_path,
+ int n_prior_backups,
+ char **prior_backup_dirs,
+ manifest_data **manifests,
+ manifest_writer *mwriter,
+ cb_options *opt)
+{
+ char ifulldir[MAXPGPATH];
+ char ofulldir[MAXPGPATH];
+ char manifest_prefix[MAXPGPATH];
+ DIR *dir;
+ struct dirent *de;
+ bool is_pg_tblspc;
+ bool is_pg_wal;
+ manifest_data *latest_manifest = manifests[n_prior_backups];
+ pg_checksum_type checksum_type;
+
+ StaticAssertStmt(strlen(INCREMENTAL_PREFIX) == INCREMENTAL_PREFIX_LENGTH,
+ "INCREMENTAL_PREFIX_LENGTH is incorrect");
+
+ /*
+ * pg_tblspc and pg_wal are special cases, so detect those here.
+ *
+ * pg_tblspc is only special at the top level, but subdirectories of
+ * pg_wal are just as special as the top level directory.
+ *
+ * Since incremental backup does not exist in pre-v10 versions, we don't
+ * have to worry about the old pg_xlog naming.
+ */
+ is_pg_tblspc = !OidIsValid(tsoid) && relative_path != NULL &&
+ strcmp(relative_path, "pg_tblspc") == 0;
+ is_pg_wal = !OidIsValid(tsoid) && relative_path != NULL &&
+ (strcmp(relative_path, "pg_wal") == 0 ||
+ strncmp(relative_path, "pg_wal/", 7) == 0);
+
+ /*
+ * If we're under pg_wal, then we don't need checksums, because these
+ * files aren't included in the backup manifest. Otherwise use whatever
+ * type of checksum is configured.
+ */
+ if (!is_pg_wal)
+ checksum_type = opt->manifest_checksums;
+ else
+ checksum_type = CHECKSUM_TYPE_NONE;
+
+ /*
+ * Append the relative path to the input and output directories, and
+ * figure out the appropriate prefix to add to files in this directory
+ * when looking them up in a backup manifest.
+ */
+ if (relative_path == NULL)
+ {
+ strncpy(ifulldir, input_directory, MAXPGPATH);
+ strncpy(ofulldir, output_directory, MAXPGPATH);
+ if (OidIsValid(tsoid))
+ snprintf(manifest_prefix, MAXPGPATH, "pg_tblspc/%u/", tsoid);
+ else
+ manifest_prefix[0] = '\0';
+ }
+ else
+ {
+ snprintf(ifulldir, MAXPGPATH, "%s/%s", input_directory,
+ relative_path);
+ snprintf(ofulldir, MAXPGPATH, "%s/%s", output_directory,
+ relative_path);
+ if (OidIsValid(tsoid))
+ snprintf(manifest_prefix, MAXPGPATH, "pg_tblspc/%u/%s/",
+ tsoid, relative_path);
+ else
+ snprintf(manifest_prefix, MAXPGPATH, "%s/", relative_path);
+ }
+
+ /*
+ * Toplevel output directories have already been created by the time this
+ * function is called, but any subdirectories are our responsibility.
+ */
+ if (relative_path != NULL)
+ {
+ if (opt->dry_run)
+ pg_log_debug("would create directory \"%s\"", ofulldir);
+ else
+ {
+ pg_log_debug("creating directory \"%s\"", ofulldir);
+ if (mkdir(ofulldir, pg_dir_create_mode) == -1)
+ pg_fatal("could not create directory \"%s\": %m", ofulldir);
+ }
+ }
+
+ /* It's time to scan the directory. */
+ if ((dir = opendir(ifulldir)) == NULL)
+ pg_fatal("could not open directory \"%s\": %m", ifulldir);
+ while (errno = 0, (de = readdir(dir)) != NULL)
+ {
+ PGFileType type;
+ char ifullpath[MAXPGPATH];
+ char ofullpath[MAXPGPATH];
+ char manifest_path[MAXPGPATH];
+ Oid oid = InvalidOid;
+ int checksum_length = 0;
+ uint8 *checksum_payload = NULL;
+ pg_checksum_context checksum_ctx;
+
+ /* Ignore "." and ".." entries. */
+ if (strcmp(de->d_name, ".") == 0 ||
+ strcmp(de->d_name, "..") == 0)
+ continue;
+
+ /* Construct input path. */
+ snprintf(ifullpath, MAXPGPATH, "%s/%s", ifulldir, de->d_name);
+
+ /* Figure out what kind of directory entry this is. */
+ type = get_dirent_type(ifullpath, de, false, PG_LOG_ERROR);
+ if (type == PGFILETYPE_ERROR)
+ exit(1);
+
+ /*
+ * If we're processing pg_tblspc, then check whether the filename
+ * looks like it could be a tablespace OID. If so, and if the
+ * directory entry is a symbolic link or a directory, skip it.
+ *
+ * Our goal here is to ignore anything that would have been considered
+ * by scan_for_existing_tablespaces to be a tablespace.
+ */
+ if (is_pg_tblspc && parse_oid(de->d_name, &oid) &&
+ (type == PGFILETYPE_LNK || type == PGFILETYPE_DIR))
+ continue;
+
+ /* If it's a directory, recurse. */
+ if (type == PGFILETYPE_DIR)
+ {
+ char new_relative_path[MAXPGPATH];
+
+ /* Append new pathname component to relative path. */
+ if (relative_path == NULL)
+ strncpy(new_relative_path, de->d_name, MAXPGPATH);
+ else
+ snprintf(new_relative_path, MAXPGPATH, "%s/%s", relative_path,
+ de->d_name);
+
+ /* And recurse. */
+ process_directory_recursively(tsoid,
+ input_directory, output_directory,
+ new_relative_path,
+ n_prior_backups, prior_backup_dirs,
+ manifests, mwriter, opt);
+ continue;
+ }
+
+ /* Skip anything that's not a regular file. */
+ if (type != PGFILETYPE_REG)
+ {
+ if (type == PGFILETYPE_LNK)
+ pg_log_warning("skipping symbolic link \"%s\"", ifullpath);
+ else
+ pg_log_warning("skipping special file \"%s\"", ifullpath);
+ continue;
+ }
+
+ /*
+ * Skip the backup_label and backup_manifest files; they require
+ * special handling and are handled elsewhere.
+ */
+ if (relative_path == NULL &&
+ (strcmp(de->d_name, "backup_label") == 0 ||
+ strcmp(de->d_name, "backup_manifest") == 0))
+ continue;
+
+ /*
+ * If it's an incremental file, hand it off to the reconstruction
+ * code, which will figure out what to do.
+ */
+ if (strncmp(de->d_name, INCREMENTAL_PREFIX,
+ INCREMENTAL_PREFIX_LENGTH) == 0)
+ {
+ /* Output path should not include "INCREMENTAL." prefix. */
+ snprintf(ofullpath, MAXPGPATH, "%s/%s", ofulldir,
+ de->d_name + INCREMENTAL_PREFIX_LENGTH);
+
+
+ /* Manifest path likewise omits incremental prefix. */
+ snprintf(manifest_path, MAXPGPATH, "%s%s", manifest_prefix,
+ de->d_name + INCREMENTAL_PREFIX_LENGTH);
+
+ /* Reconstruction logic will do the rest. */
+ reconstruct_from_incremental_file(ifullpath, ofullpath,
+ relative_path,
+ de->d_name + INCREMENTAL_PREFIX_LENGTH,
+ n_prior_backups,
+ prior_backup_dirs,
+ manifests,
+ manifest_path,
+ checksum_type,
+ &checksum_length,
+ &checksum_payload,
+ opt->dry_run);
+ }
+ else
+ {
+ /* Construct the path that the backup_manifest will use. */
+ snprintf(manifest_path, MAXPGPATH, "%s%s", manifest_prefix,
+ de->d_name);
+
+ /*
+ * It's not an incremental file, so we need to copy the entire
+ * file to the output directory.
+ *
+ * If a checksum of the required type already exists in the
+ * backup_manifest for the final input directory, we can save some
+ * work by reusing that checksum instead of computing a new one.
+ */
+ if (checksum_type != CHECKSUM_TYPE_NONE &&
+ latest_manifest != NULL)
+ {
+ manifest_file *mfile;
+
+ mfile = manifest_files_lookup(latest_manifest->files,
+ manifest_path);
+ if (mfile == NULL)
+ {
+ /*
+ * The directory is out of sync with the backup_manifest,
+ * so emit a warning.
+ */
+ pg_log_warning("\"%s/backup_manifest\" contains no entry for \"%s\"",
+ input_directory, manifest_path);
+ }
+ else if (mfile->checksum_type == checksum_type)
+ {
+ checksum_length = mfile->checksum_length;
+ checksum_payload = mfile->checksum_payload;
+ }
+ }
+
+ /*
+ * If we're reusing a checksum, then we don't need copy_file() to
+ * compute one for us, but otherwise, it needs to compute whatever
+ * type of checksum we need.
+ */
+ if (checksum_length != 0)
+ pg_checksum_init(&checksum_ctx, CHECKSUM_TYPE_NONE);
+ else
+ pg_checksum_init(&checksum_ctx, checksum_type);
+
+ /* Actually copy the file. */
+ snprintf(ofullpath, MAXPGPATH, "%s/%s", ofulldir, de->d_name);
+ copy_file(ifullpath, ofullpath, &checksum_ctx, opt->dry_run);
+
+ /*
+ * If copy_file() performed a checksum calculation for us, then
+ * save the results (except in dry-run mode, when there's no
+ * point).
+ */
+ if (checksum_ctx.type != CHECKSUM_TYPE_NONE && !opt->dry_run)
+ {
+ checksum_payload = pg_malloc(PG_CHECKSUM_MAX_LENGTH);
+ checksum_length = pg_checksum_final(&checksum_ctx,
+ checksum_payload);
+ }
+ }
+
+ /* Generate manifest entry, if needed. */
+ if (mwriter != NULL)
+ {
+ struct stat sb;
+
+ /*
+ * In order to generate a manifest entry, we need the file size
+ * and mtime. We have no way to know the correct mtime except to
+ * stat() the file, so just do that and get the size as well.
+ *
+ * If we didn't need the mtime here, we could try to obtain the
+ * file size from the reconstruction or file copy process above,
+ * although that is actually not convenient in all cases. If we
+ * write the file ourselves then clearly we can keep a count of
+ * bytes, but if we use something like CopyFile() then it's
+ * trickier. Since we have to stat() anyway to get the mtime,
+ * there's no point in worrying about it.
+ */
+ if (stat(ofullpath, &sb) < 0)
+ pg_fatal("could not stat file \"%s\": %m", ofullpath);
+
+ /* OK, now do the work. */
+ add_file_to_manifest(mwriter, manifest_path,
+ sb.st_size, sb.st_mtime,
+ checksum_type, checksum_length,
+ checksum_payload);
+ }
+
+ /* Avoid leaking memory. */
+ if (checksum_payload != NULL)
+ pfree(checksum_payload);
+ }
+
+ closedir(dir);
+}
+
+/*
+ * Read the version number from PG_VERSION and convert it to the usual server
+ * version number format. (e.g. If PG_VERSION contains "14\n" this function
+ * will return 140000)
+ */
+static int
+read_pg_version_file(char *directory)
+{
+ char filename[MAXPGPATH];
+ StringInfoData buf;
+ int fd;
+ int version;
+ char *ep;
+
+ /* Construct pathname. */
+ snprintf(filename, MAXPGPATH, "%s/PG_VERSION", directory);
+
+ /* Open file. */
+ if ((fd = open(filename, O_RDONLY, 0)) < 0)
+ pg_fatal("could not open file \"%s\": %m", filename);
+
+ /* Read into memory. Length limit of 128 should be more than generous. */
+ initStringInfo(&buf);
+ slurp_file(fd, filename, &buf, 128);
+
+ /* Close the file. */
+ if (close(fd) != 0)
+ pg_fatal("could not close \"%s\": %m", filename);
+
+ /* Convert to integer. */
+ errno = 0;
+ version = strtoul(buf.data, &ep, 10);
+ if (errno != 0 || *ep != '\n')
+ {
+ /*
+ * Incremental backup is not relevant to very old server versions that
+ * used multi-part version number (e.g. 9.6, or 8.4). So if we see
+ * what looks like the beginning of such a version number, just bail
+ * out.
+ */
+ if (version < 10 && *ep == '.')
+ pg_fatal("%s: server version too old\n", filename);
+ pg_fatal("%s: could not parse version number\n", filename);
+ }
+
+ /* Debugging output. */
+ pg_log_debug("read server version %d from \"%s\"", version, filename);
+
+ /* Release memory and return result. */
+ pfree(buf.data);
+ return version * 10000;
+}
+
+/*
+ * Add a directory to the list of output directories to clean up.
+ */
+static void
+remember_to_cleanup_directory(char *target_path, bool rmtopdir)
+{
+ cb_cleanup_dir *dir = pg_malloc(sizeof(cb_cleanup_dir));
+
+ dir->target_path = target_path;
+ dir->rmtopdir = rmtopdir;
+ dir->next = cleanup_dir_list;
+ cleanup_dir_list = dir;
+}
+
+/*
+ * Empty out the list of directories scheduled for cleanup a exit.
+ *
+ * We want to remove the output directories only on a failure, so call this
+ * function when we know that the operation has succeeded.
+ *
+ * Since we only expect this to be called when we're about to exit, we could
+ * just set cleanup_dir_list to NULL and be done with it, but we free the
+ * memory to be tidy.
+ */
+static void
+reset_directory_cleanup_list(void)
+{
+ while (cleanup_dir_list != NULL)
+ {
+ cb_cleanup_dir *dir = cleanup_dir_list;
+
+ cleanup_dir_list = cleanup_dir_list->next;
+ pfree(dir);
+ }
+}
+
+/*
+ * Scan the pg_tblspc directory of the final input backup to get a canonical
+ * list of what tablespaces are part of the backup.
+ *
+ * 'pathname' should be the path to the toplevel backup directory for the
+ * final backup in the backup chain.
+ */
+static cb_tablespace *
+scan_for_existing_tablespaces(char *pathname, cb_options *opt)
+{
+ char pg_tblspc[MAXPGPATH];
+ DIR *dir;
+ struct dirent *de;
+ cb_tablespace *tslist = NULL;
+
+ snprintf(pg_tblspc, MAXPGPATH, "%s/pg_tblspc", pathname);
+ pg_log_debug("scanning \"%s\"", pg_tblspc);
+
+ if ((dir = opendir(pg_tblspc)) == NULL)
+ pg_fatal("could not open directory \"%s\": %m", pathname);
+
+ while (errno = 0, (de = readdir(dir)) != NULL)
+ {
+ Oid oid;
+ char tblspcdir[MAXPGPATH];
+ char link_target[MAXPGPATH];
+ int link_length;
+ cb_tablespace *ts;
+ cb_tablespace *otherts;
+ PGFileType type;
+
+ /* Silently ignore "." and ".." entries. */
+ if (strcmp(de->d_name, ".") == 0 || strcmp(de->d_name, "..") == 0)
+ continue;
+
+ /* Construct full pathname. */
+ snprintf(tblspcdir, MAXPGPATH, "%s/%s", pg_tblspc, de->d_name);
+
+ /* Ignore any file name that doesn't look like a proper OID. */
+ if (!parse_oid(de->d_name, &oid))
+ {
+ pg_log_debug("skipping \"%s\" because the filename is not a legal tablespace OID",
+ tblspcdir);
+ continue;
+ }
+
+ /* Only symbolic links and directories are tablespaces. */
+ type = get_dirent_type(tblspcdir, de, false, PG_LOG_ERROR);
+ if (type == PGFILETYPE_ERROR)
+ exit(1);
+ if (type != PGFILETYPE_LNK && type != PGFILETYPE_DIR)
+ {
+ pg_log_debug("skipping \"%s\" because it is neither a symbolic link nor a directory",
+ tblspcdir);
+ continue;
+ }
+
+ /* Create a new tablespace object. */
+ ts = pg_malloc0(sizeof(cb_tablespace));
+ ts->oid = oid;
+
+ /*
+ * If it's a link, it's not an in-place tablespace. Otherwise, it must
+ * be a directory, and thus an in-place tablespace.
+ */
+ if (type == PGFILETYPE_LNK)
+ {
+ cb_tablespace_mapping *tsmap;
+
+ /* Read the link target. */
+ link_length = readlink(tblspcdir, link_target, sizeof(link_target));
+ if (link_length < 0)
+ pg_fatal("could not read symbolic link \"%s\": %m",
+ tblspcdir);
+ if (link_length >= sizeof(link_target))
+ pg_fatal("symbolic link \"%s\" is too long", tblspcdir);
+ link_target[link_length] = '\0';
+ if (!is_absolute_path(link_target))
+ pg_fatal("symbolic link \"%s\" is relative", tblspcdir);
+
+ /* Caonicalize the link target. */
+ canonicalize_path(link_target);
+
+ /*
+ * Find the corresponding tablespace mapping and copy the relevant
+ * details into the new tablespace entry.
+ */
+ for (tsmap = opt->tsmappings; tsmap != NULL; tsmap = tsmap->next)
+ {
+ if (strcmp(tsmap->old_dir, link_target) == 0)
+ {
+ strncpy(ts->old_dir, tsmap->old_dir, MAXPGPATH);
+ strncpy(ts->new_dir, tsmap->new_dir, MAXPGPATH);
+ ts->in_place = false;
+ break;
+ }
+ }
+
+ /* Every non-in-place tablespace must be mapped. */
+ if (tsmap == NULL)
+ pg_fatal("tablespace at \"%s\" has no tablespace mapping",
+ link_target);
+ }
+ else
+ {
+ /*
+ * For an in-place tablespace, there's no separate directory, so
+ * we just record the paths within the data directories.
+ */
+ snprintf(ts->old_dir, MAXPGPATH, "%s/%s", pg_tblspc, de->d_name);
+ snprintf(ts->new_dir, MAXPGPATH, "%s/pg_tblpc/%s", opt->output,
+ de->d_name);
+ ts->in_place = true;
+ }
+
+ /* Tablespaces should not share a directory. */
+ for (otherts = tslist; otherts != NULL; otherts = otherts->next)
+ if (strcmp(ts->new_dir, otherts->new_dir) == 0)
+ pg_fatal("tablespaces with OIDs %u and %u both point at \"%s\"",
+ otherts->oid, oid, ts->new_dir);
+
+ /* Add this tablespace to the list. */
+ ts->next = tslist;
+ tslist = ts;
+ }
+
+ return tslist;
+}
+
+/*
+ * Read a file into a StringInfo.
+ *
+ * fd is used for the actual file I/O, filename for error reporting purposes.
+ * A file longer than maxlen is a fatal error.
+ */
+static void
+slurp_file(int fd, char *filename, StringInfo buf, int maxlen)
+{
+ struct stat st;
+ ssize_t rb;
+
+ /* Check file size, and complain if it's too large. */
+ if (fstat(fd, &st) != 0)
+ pg_fatal("could not stat \"%s\": %m", filename);
+ if (st.st_size > maxlen)
+ pg_fatal("file \"%s\" is too large", filename);
+
+ /* Make sure we have enough space. */
+ enlargeStringInfo(buf, st.st_size);
+
+ /* Read the data. */
+ rb = read(fd, &buf->data[buf->len], st.st_size);
+
+ /*
+ * We don't expect any concurrent changes, so we should read exactly the
+ * expected number of bytes.
+ */
+ if (rb != st.st_size)
+ {
+ if (rb < 0)
+ pg_fatal("could not read file \"%s\": %m", filename);
+ else
+ pg_fatal("could not read file \"%s\": read only %d of %d bytes",
+ filename, (int) rb, (int) st.st_size);
+ }
+
+ /* Adjust buffer length for new data and restore trailing-\0 invariant */
+ buf->len += rb;
+ buf->data[buf->len] = '\0';
+}
diff --git a/src/bin/pg_combinebackup/reconstruct.c b/src/bin/pg_combinebackup/reconstruct.c
new file mode 100644
index 0000000000..c774bf1842
--- /dev/null
+++ b/src/bin/pg_combinebackup/reconstruct.c
@@ -0,0 +1,618 @@
+/*-------------------------------------------------------------------------
+ *
+ * reconstruct.c
+ * Reconstruct full file from incremental file and backup chain.
+ *
+ * Copyright (c) 2017-2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/bin/pg_combinebackup/reconstruct.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres_fe.h"
+
+#include <unistd.h>
+
+#include "backup/basebackup_incremental.h"
+#include "common/logging.h"
+#include "common/file_perm.h"
+#include "copy_file.h"
+#include "reconstruct.h"
+#include "storage/block.h"
+
+/*
+ * An rfile stores the data that we need in order to be able to use some file
+ * on disk for reconstruction. For any given output file, we create one rfile
+ * per backup that we need to consult when we constructing that output file.
+ *
+ * If we find a full version of the file in the backup chain, then only
+ * filename and fd are initialized; the remaining fields are 0 or NULL.
+ * For an incremental file, header_length, num_blocks, relative_block_numbers,
+ * and truncation_block_length are also set.
+ *
+ * num_blocks_read and highest_offset_read always start out as 0.
+ */
+typedef struct rfile
+{
+ char *filename;
+ int fd;
+ size_t header_length;
+ unsigned num_blocks;
+ BlockNumber *relative_block_numbers;
+ unsigned truncation_block_length;
+ unsigned num_blocks_read;
+ off_t highest_offset_read;
+} rfile;
+
+static void debug_reconstruction(int n_source,
+ rfile **sources,
+ bool dry_run);
+static unsigned find_reconstructed_block_length(rfile *s);
+static rfile *make_incremental_rfile(char *filename);
+static rfile *make_rfile(char *filename, bool missing_ok);
+static void write_reconstructed_file(char *input_filename,
+ char *output_filename,
+ unsigned block_length,
+ rfile **sourcemap,
+ off_t *offsetmap,
+ pg_checksum_context *checksum_ctx,
+ bool dry_run);
+static void read_bytes(rfile *rf, void *buffer, unsigned length);
+
+/*
+ * Reconstruct a full file from an incremental file and a chain of prior
+ * backups.
+ *
+ * input_filename should be the path to the incremental file, and
+ * output_filename should be the path where the reconstructed file is to be
+ * written.
+ *
+ * relative_path should be the relative path to the directory containing this
+ * file. bare_file_name should be the name of the file within that directory,
+ * without "INCREMENTAL.".
+ *
+ * n_prior_backups is the number of prior backups, and prior_backup_dirs is
+ * an array of pathnames where those backups can be found.
+ */
+void
+reconstruct_from_incremental_file(char *input_filename,
+ char *output_filename,
+ char *relative_path,
+ char *bare_file_name,
+ int n_prior_backups,
+ char **prior_backup_dirs,
+ manifest_data **manifests,
+ char *manifest_path,
+ pg_checksum_type checksum_type,
+ int *checksum_length,
+ uint8 **checksum_payload,
+ bool dry_run)
+{
+ rfile **source;
+ rfile *latest_source = NULL;
+ rfile **sourcemap;
+ off_t *offsetmap;
+ unsigned block_length;
+ unsigned num_missing_blocks;
+ unsigned i;
+ unsigned sidx = n_prior_backups;
+ bool full_copy_possible = true;
+ int copy_source_index = -1;
+ rfile *copy_source = NULL;
+ pg_checksum_context checksum_ctx;
+
+ /*
+ * Every block must come either from the latest version of the file or
+ * from one of the prior backups.
+ */
+ source = pg_malloc0(sizeof(rfile *) * (1 + n_prior_backups));
+
+ /*
+ * Use the information from the latest incremental file to figure out how
+ * long the reconstructed file should be.
+ */
+ latest_source = make_incremental_rfile(input_filename);
+ source[n_prior_backups] = latest_source;
+ block_length = find_reconstructed_block_length(latest_source);
+
+ /*
+ * For each block in the output file, we need to know from which file we
+ * need to obtain it and at what offset in that file it's stored.
+ * sourcemap gives us the first of these things, and offsetmap the latter.
+ */
+ sourcemap = pg_malloc0(sizeof(rfile *) * block_length);
+ offsetmap = pg_malloc0(sizeof(off_t) * block_length);
+
+ /*
+ * Blocks prior to the truncation_block_length threshold must be obtained
+ * from some prior backup, while those after that threshold are left as
+ * zeroes if not present in the newest incremental file.
+ * num_missing_blocks counts the number of blocks that we must be found
+ * somewhere in the backup chain, and is thus initially equal to
+ * truncation_block_length.
+ */
+ num_missing_blocks = latest_source->truncation_block_length;
+
+ /*
+ * Every block that is present in the newest incremental file should be
+ * sourced from that file. If it precedes the truncation_block_length,
+ * it's a block that we would otherwise have had to find in an older
+ * backup and thus reduces the number of blocks remaining to be found by
+ * one; otherwise, it's an extra block that needs to be included in the
+ * output but would not have needed to be found in an older backup if it
+ * had not been present.
+ */
+ for (i = 0; i < latest_source->num_blocks; ++i)
+ {
+ BlockNumber b = latest_source->relative_block_numbers[i];
+
+ Assert(b < block_length);
+ sourcemap[b] = latest_source;
+ offsetmap[b] = latest_source->header_length + (i * BLCKSZ);
+ if (b < latest_source->truncation_block_length)
+ num_missing_blocks--;
+
+ /*
+ * A full copy of a file from an earlier backup is only possible if no
+ * blocks are needed from any later incremental file.
+ */
+ full_copy_possible = false;
+ }
+
+ while (num_missing_blocks > 0)
+ {
+ char source_filename[MAXPGPATH];
+ rfile *s;
+
+ /*
+ * Move to the next backup in the chain. If there are no more, then
+ * something has gone wrong and reconstruction has failed.
+ */
+ if (sidx == 0)
+ pg_fatal("reconstruction for file \"%s\" failed to find %u required blocks",
+ output_filename, num_missing_blocks);
+ --sidx;
+
+ /*
+ * Look for the full file in the previous backup. If not found, then
+ * look for an incremental file instead.
+ */
+ snprintf(source_filename, MAXPGPATH, "%s/%s/%s",
+ prior_backup_dirs[sidx], relative_path, bare_file_name);
+ if ((s = make_rfile(source_filename, true)) == NULL)
+ {
+ snprintf(source_filename, MAXPGPATH, "%s/%s/INCREMENTAL.%s",
+ prior_backup_dirs[sidx], relative_path, bare_file_name);
+ s = make_incremental_rfile(source_filename);
+ }
+ source[sidx] = s;
+
+ /*
+ * If s->header_length == 0, then this is a full file; otherwise, it's
+ * an incremental file.
+ */
+ if (s->header_length != 0)
+ {
+ /*
+ * Since we found another incremental file, source all blocks from
+ * it that we need but don't yet have.
+ */
+ for (i = 0; i < s->num_blocks; ++i)
+ {
+ BlockNumber b = s->relative_block_numbers[i];
+
+ if (b < latest_source->truncation_block_length &&
+ sourcemap[b] == NULL)
+ {
+ sourcemap[b] = s;
+ offsetmap[b] = s->header_length + (i * BLCKSZ);
+
+ Assert(num_missing_blocks > 0);
+ --num_missing_blocks;
+
+ /*
+ * A full copy of a file from an earlier backup is only
+ * possible if no blocks are needed from any later
+ * incremental file.
+ */
+ full_copy_possible = false;
+ }
+ }
+ }
+ else
+ {
+ BlockNumber b;
+
+ /*
+ * Since we found a full file, source all remaining required
+ * blocks from it.
+ */
+ for (b = 0; b < latest_source->truncation_block_length; ++b)
+ {
+ if (sourcemap[b] == NULL)
+ {
+ sourcemap[b] = s;
+ offsetmap[b] = b * BLCKSZ;
+
+ Assert(num_missing_blocks > 0);
+ --num_missing_blocks;
+ }
+ }
+ Assert(num_missing_blocks == 0);
+
+ /*
+ * If a full copy looks possible, check whether the resulting file
+ * should be exactly as long as the source file is. If so, a full
+ * copy is acceptable, otherwise not.
+ */
+ if (full_copy_possible)
+ {
+ struct stat sb;
+ uint64 expected_length;
+
+ if (fstat(s->fd, &sb) < 0)
+ pg_fatal("could not stat \"%s\": %m", s->filename);
+ expected_length =
+ (uint64) latest_source->truncation_block_length;
+ expected_length *= BLCKSZ;
+ if (expected_length == sb.st_size)
+ {
+ copy_source = s;
+ copy_source_index = sidx;
+ }
+ }
+ }
+ }
+
+ /*
+ * If a checksum of the required type already exists in the
+ * backup_manifest for the relevant input directory, we can save some work
+ * by reusing that checksum instead of computing a new one.
+ */
+ if (copy_source_index >= 0 && manifests[copy_source_index] != NULL &&
+ checksum_type != CHECKSUM_TYPE_NONE)
+ {
+ manifest_file *mfile;
+
+ mfile = manifest_files_lookup(manifests[copy_source_index]->files,
+ manifest_path);
+ if (mfile == NULL)
+ {
+ /*
+ * The directory is out of sync with the backup_manifest, so emit
+ * a warning.
+ */
+ pg_log_warning("\"%s/backup_manifest\" contains no entry for \"%s\"",
+ prior_backup_dirs[copy_source_index],
+ manifest_path);
+ }
+ else if (mfile->checksum_type == checksum_type)
+ {
+ *checksum_length = mfile->checksum_length;
+ *checksum_payload = pg_malloc(*checksum_length);
+ memcpy(*checksum_payload, mfile->checksum_payload,
+ *checksum_length);
+ checksum_type = CHECKSUM_TYPE_NONE;
+ }
+ }
+
+ /* Prepare for checksum calculation, if required. */
+ pg_checksum_init(&checksum_ctx, checksum_type);
+
+ /*
+ * If the full file can be created by copying a file from an older backup
+ * in the chain without needing to overwrite any blocks or truncate the
+ * result, then forget about performing reconstruction and just copy that
+ * file in its entirety.
+ *
+ * Otherwise, reconstruct.
+ */
+ if (copy_source != NULL)
+ copy_file(copy_source->filename, output_filename,
+ &checksum_ctx, dry_run);
+ else
+ {
+ write_reconstructed_file(input_filename, output_filename,
+ block_length, sourcemap, offsetmap,
+ &checksum_ctx, dry_run);
+ debug_reconstruction(n_prior_backups + 1, source, dry_run);
+ }
+
+ /* Save results of checksum calculation. */
+ if (checksum_type != CHECKSUM_TYPE_NONE)
+ {
+ *checksum_payload = pg_malloc(PG_CHECKSUM_MAX_LENGTH);
+ *checksum_length = pg_checksum_final(&checksum_ctx,
+ *checksum_payload);
+ }
+
+ /*
+ * Close files and release memory.
+ */
+ for (i = 0; i <= n_prior_backups; ++i)
+ {
+ rfile *s = source[i];
+
+ if (s == NULL)
+ continue;
+ if (close(s->fd) != 0)
+ pg_fatal("could not close \"%s\": %m", s->filename);
+ if (s->relative_block_numbers != NULL)
+ pfree(s->relative_block_numbers);
+ pg_free(s->filename);
+ }
+ pfree(sourcemap);
+ pfree(offsetmap);
+ pfree(source);
+}
+
+/*
+ * Perform post-reconstruction logging and sanity checks.
+ */
+static void
+debug_reconstruction(int n_source, rfile **sources, bool dry_run)
+{
+ unsigned i;
+
+ for (i = 0; i < n_source; ++i)
+ {
+ rfile *s = sources[i];
+
+ /* Ignore source if not used. */
+ if (s == NULL)
+ continue;
+
+ /* If no data is needed from this file, we can ignore it. */
+ if (s->num_blocks_read == 0)
+ continue;
+
+ /* Debug logging. */
+ if (dry_run)
+ pg_log_debug("would have read %u blocks from \"%s\"",
+ s->num_blocks_read, s->filename);
+ else
+ pg_log_debug("read %u blocks from \"%s\"",
+ s->num_blocks_read, s->filename);
+
+ /*
+ * In dry-run mode, we don't actually try to read data from the file,
+ * but we do try to verify that the file is long enough that we could
+ * have read the data if we'd tried.
+ *
+ * If this fails, then it means that a non-dry-run attempt would fail,
+ * complaining of not being able to read the required bytes from the
+ * file.
+ */
+ if (dry_run)
+ {
+ struct stat sb;
+
+ if (fstat(s->fd, &sb) < 0)
+ pg_fatal("could not stat \"%s\": %m", s->filename);
+ if (sb.st_size < s->highest_offset_read)
+ pg_fatal("file \"%s\" is too short: expected %llu, found %llu",
+ s->filename,
+ (unsigned long long) s->highest_offset_read,
+ (unsigned long long) sb.st_size);
+ }
+ }
+}
+
+/*
+ * When we perform reconstruction using an incremental file, the output file
+ * should be at least as long as the truncation_block_length. Any blocks
+ * present in the incremental file increase the output length as far as is
+ * necessary to include those blocks.
+ */
+static unsigned
+find_reconstructed_block_length(rfile *s)
+{
+ unsigned block_length = s->truncation_block_length;
+ unsigned i;
+
+ for (i = 0; i < s->num_blocks; ++i)
+ if (s->relative_block_numbers[i] >= block_length)
+ block_length = s->relative_block_numbers[i] + 1;
+
+ return block_length;
+}
+
+/*
+ * Initialize an incremental rfile, reading the header so that we know which
+ * blocks it contains.
+ */
+static rfile *
+make_incremental_rfile(char *filename)
+{
+ rfile *rf;
+ unsigned magic;
+
+ rf = make_rfile(filename, false);
+
+ /* Read and validate magic number. */
+ read_bytes(rf, &magic, sizeof(magic));
+ if (magic != INCREMENTAL_MAGIC)
+ pg_fatal("file \"%s\" has bad incremental magic number (0x%x not 0x%x)",
+ filename, magic, INCREMENTAL_MAGIC);
+
+ /* Read block count. */
+ read_bytes(rf, &rf->num_blocks, sizeof(rf->num_blocks));
+ if (rf->num_blocks > RELSEG_SIZE)
+ pg_fatal("file \"%s\" has block count %u in excess of segment size %u",
+ filename, rf->num_blocks, RELSEG_SIZE);
+
+ /* Read truncation block length. */
+ read_bytes(rf, &rf->truncation_block_length,
+ sizeof(rf->truncation_block_length));
+ if (rf->truncation_block_length > RELSEG_SIZE)
+ pg_fatal("file \"%s\" has truncation block length %u in excess of segment size %u",
+ filename, rf->truncation_block_length, RELSEG_SIZE);
+
+ /* Read block numbers if there are any. */
+ if (rf->num_blocks > 0)
+ {
+ rf->relative_block_numbers =
+ pg_malloc0(sizeof(BlockNumber) * rf->num_blocks);
+ read_bytes(rf, rf->relative_block_numbers,
+ sizeof(BlockNumber) * rf->num_blocks);
+ }
+
+ /* Remember length of header. */
+ rf->header_length = sizeof(magic) + sizeof(rf->num_blocks) +
+ sizeof(rf->truncation_block_length) +
+ sizeof(BlockNumber) * rf->num_blocks;
+
+ return rf;
+}
+
+/*
+ * Allocate and perform basic initialization of an rfile.
+ */
+static rfile *
+make_rfile(char *filename, bool missing_ok)
+{
+ rfile *rf;
+
+ rf = pg_malloc0(sizeof(rfile));
+ rf->filename = pstrdup(filename);
+ if ((rf->fd = open(filename, O_RDONLY | PG_BINARY, 0)) < 0)
+ {
+ if (missing_ok && errno == ENOENT)
+ {
+ pg_free(rf);
+ return NULL;
+ }
+ pg_fatal("could not open file \"%s\": %m", filename);
+ }
+
+ return rf;
+}
+
+/*
+ * Read the indicated number of bytes from an rfile into the buffer.
+ */
+static void
+read_bytes(rfile *rf, void *buffer, unsigned length)
+{
+ unsigned rb = read(rf->fd, buffer, length);
+
+ if (rb != length)
+ {
+ if (rb < 0)
+ pg_fatal("could not read file \"%s\": %m", rf->filename);
+ else
+ pg_fatal("could not read file \"%s\": read only %d of %d bytes",
+ rf->filename, (int) rb, length);
+ }
+}
+
+/*
+ * Write out a reconstructed file.
+ */
+static void
+write_reconstructed_file(char *input_filename,
+ char *output_filename,
+ unsigned block_length,
+ rfile **sourcemap,
+ off_t *offsetmap,
+ pg_checksum_context *checksum_ctx,
+ bool dry_run)
+{
+ int wfd = -1;
+ unsigned i;
+ unsigned zero_blocks = 0;
+
+ /* Debugging output. */
+ if (dry_run)
+ pg_log_debug("would reconstruct \"%s\" (%u blocks, checksum %s)",
+ output_filename, block_length,
+ pg_checksum_type_name(checksum_ctx->type));
+ else
+ pg_log_debug("reconstructing \"%s\" (%u blocks, checksum %s)",
+ output_filename, block_length,
+ pg_checksum_type_name(checksum_ctx->type));
+
+ /* Open the output file, except in dry_run mode. */
+ if (!dry_run &&
+ (wfd = open(output_filename,
+ O_RDWR | PG_BINARY | O_CREAT | O_EXCL,
+ pg_file_create_mode)) < 0)
+ pg_fatal("could not open file \"%s\": %m", output_filename);
+
+ /* Read and write the blocks as required. */
+ for (i = 0; i < block_length; ++i)
+ {
+ uint8 buffer[BLCKSZ];
+ rfile *s = sourcemap[i];
+ unsigned wb;
+
+ /* Update accounting information. */
+ if (s == NULL)
+ ++zero_blocks;
+ else
+ {
+ s->num_blocks_read++;
+ s->highest_offset_read = Max(s->highest_offset_read,
+ offsetmap[i] + BLCKSZ);
+ }
+
+ /* Skip the rest of this in dry-run mode. */
+ if (dry_run)
+ continue;
+
+ /* Read or zero-fill the block as appropriate. */
+ if (s == NULL)
+ {
+ /*
+ * New block not mentioned in the WAL summary. Should have been an
+ * uninitialized block, so just zero-fill it.
+ */
+ memset(buffer, 0, BLCKSZ);
+ }
+ else
+ {
+ unsigned rb;
+
+ /* Read the block from the correct source, except if dry-run. */
+ rb = pg_pread(s->fd, buffer, BLCKSZ, offsetmap[i]);
+ if (rb != BLCKSZ)
+ {
+ if (rb < 0)
+ pg_fatal("could not read file \"%s\": %m", s->filename);
+ else
+ pg_fatal("could not read file \"%s\": read only %d of %d bytes at offset %u",
+ s->filename, (int) rb, BLCKSZ,
+ (unsigned) offsetmap[i]);
+ }
+ }
+
+ /* Write out the block. */
+ if ((wb = write(wfd, buffer, BLCKSZ)) != BLCKSZ)
+ {
+ if (wb < 0)
+ pg_fatal("could not write file \"%s\": %m", output_filename);
+ else
+ pg_fatal("could not write file \"%s\": wrote only %d of %d bytes",
+ output_filename, (int) wb, BLCKSZ);
+ }
+
+ /* Update the checksum computation. */
+ if (pg_checksum_update(checksum_ctx, buffer, BLCKSZ) < 0)
+ pg_fatal("could not update checksum of file \"%s\"",
+ output_filename);
+ }
+
+ /* Debugging output. */
+ if (zero_blocks > 0)
+ {
+ if (dry_run)
+ pg_log_debug("would have zero-filled %u blocks", zero_blocks);
+ else
+ pg_log_debug("zero-filled %u blocks", zero_blocks);
+ }
+
+ /* Close the output file. */
+ if (wfd >= 0 && close(wfd) != 0)
+ pg_fatal("could not close \"%s\": %m", output_filename);
+}
diff --git a/src/bin/pg_combinebackup/reconstruct.h b/src/bin/pg_combinebackup/reconstruct.h
new file mode 100644
index 0000000000..c599a70d42
--- /dev/null
+++ b/src/bin/pg_combinebackup/reconstruct.h
@@ -0,0 +1,32 @@
+/*-------------------------------------------------------------------------
+ *
+ * reconstruct.h
+ * Reconstruct full file from incremental file and backup chain.
+ *
+ * Copyright (c) 2017-2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/bin/pg_combinebackup/reconstruct.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef RECONSTRUCT_H
+#define RECONSTRUCT_H
+
+#include "common/checksum_helper.h"
+#include "load_manifest.h"
+
+extern void reconstruct_from_incremental_file(char *input_filename,
+ char *output_filename,
+ char *relative_path,
+ char *bare_file_name,
+ int n_prior_backups,
+ char **prior_backup_dirs,
+ manifest_data **manifests,
+ char *manifest_path,
+ pg_checksum_type checksum_type,
+ int *checksum_length,
+ uint8 **checksum_payload,
+ bool dry_run);
+
+#endif
diff --git a/src/bin/pg_combinebackup/t/001_basic.pl b/src/bin/pg_combinebackup/t/001_basic.pl
new file mode 100644
index 0000000000..fb66075d1a
--- /dev/null
+++ b/src/bin/pg_combinebackup/t/001_basic.pl
@@ -0,0 +1,23 @@
+# Copyright (c) 2021-2023, PostgreSQL Global Development Group
+
+use strict;
+use warnings;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+my $tempdir = PostgreSQL::Test::Utils::tempdir;
+
+program_help_ok('pg_combinebackup');
+program_version_ok('pg_combinebackup');
+program_options_handling_ok('pg_combinebackup');
+
+command_fails_like(
+ ['pg_combinebackup'],
+ qr/no input directories specified/,
+ 'input directories must be specified');
+command_fails_like(
+ [ 'pg_combinebackup', $tempdir ],
+ qr/no output directory specified/,
+ 'output directory must be specified');
+
+done_testing();
diff --git a/src/bin/pg_combinebackup/t/002_compare_backups.pl b/src/bin/pg_combinebackup/t/002_compare_backups.pl
new file mode 100644
index 0000000000..3d9238f366
--- /dev/null
+++ b/src/bin/pg_combinebackup/t/002_compare_backups.pl
@@ -0,0 +1,154 @@
+# Copyright (c) 2021-2023, PostgreSQL Global Development Group
+
+use strict;
+use warnings;
+use File::Compare;
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Set up a new database instance.
+my $primary = PostgreSQL::Test::Cluster->new('primary');
+$primary->init(has_archiving => 1, allows_streaming => 1);
+$primary->append_conf('postgresql.conf', 'autovacuum = off');
+$primary->start;
+
+# Create some test tables, each containing one row of data, plus a whole
+# extra database.
+$primary->safe_psql('postgres', <<EOM);
+CREATE TABLE will_change (a int, b text);
+INSERT INTO will_change VALUES (1, 'initial test row');
+CREATE TABLE will_grow (a int, b text);
+INSERT INTO will_grow VALUES (1, 'initial test row');
+CREATE TABLE will_shrink (a int, b text);
+INSERT INTO will_shrink VALUES (1, 'initial test row');
+CREATE TABLE will_get_vacuumed (a int, b text);
+INSERT INTO will_get_vacuumed VALUES (1, 'initial test row');
+CREATE TABLE will_get_dropped (a int, b text);
+INSERT INTO will_get_dropped VALUES (1, 'initial test row');
+CREATE TABLE will_get_rewritten (a int, b text);
+INSERT INTO will_get_rewritten VALUES (1, 'initial test row');
+CREATE DATABASE db_will_get_dropped;
+EOM
+
+# Take a full backup.
+my $backup1path = $primary->backup_dir . '/backup1';
+$primary->command_ok(
+ [ 'pg_basebackup', '-D', $backup1path, '--no-sync', '-cfast' ],
+ "full backup");
+
+# Now make some database changes.
+$primary->safe_psql('postgres', <<EOM);
+UPDATE will_change SET b = 'modified value' WHERE a = 1;
+INSERT INTO will_grow
+ SELECT g, 'additional row' FROM generate_series(2, 5000) g;
+TRUNCATE will_shrink;
+VACUUM will_get_vacuumed;
+DROP TABLE will_get_dropped;
+CREATE TABLE newly_created (a int, b text);
+INSERT INTO newly_created VALUES (1, 'row for new table');
+VACUUM FULL will_get_rewritten;
+DROP DATABASE db_will_get_dropped;
+CREATE DATABASE db_newly_created;
+EOM
+
+# Take an incremental backup.
+my $backup2path = $primary->backup_dir . '/backup2';
+$primary->command_ok(
+ [ 'pg_basebackup', '-D', $backup2path, '--no-sync', '-cfast',
+ '--incremental', $backup1path . '/backup_manifest' ],
+ "incremental backup");
+
+# Find an LSN to which either backup can be recovered.
+my $lsn = $primary->safe_psql('postgres', "SELECT pg_current_wal_lsn();");
+
+# Make sure that the WAL segment containing that LSN has been archived.
+# PostgreSQL won't issue two consecutive XLOG_SWITCH records, and the backup
+# just issued one, so call txid_current() to generate some WAL activity
+# before calling pg_switch_wal().
+$primary->safe_psql('postgres', 'SELECT txid_current();');
+$primary->safe_psql('postgres', 'SELECT pg_switch_wal()');
+
+# Now wait for the LSN we chose above to be archived.
+my $archive_wait_query =
+ "SELECT pg_walfile_name('$lsn') <= last_archived_wal FROM pg_stat_archiver;";
+$primary->poll_query_until('postgres', $archive_wait_query)
+ or die "Timed out while waiting for WAL segment to be archived";
+
+# Perform PITR from the full backup. Disable archive_mode so that the archive
+# doesn't find out about the new timeline; that way, the later PITR below will
+# choose the same timeline.
+my $pitr1 = PostgreSQL::Test::Cluster->new('pitr1');
+$pitr1->init_from_backup($primary, 'backup1',
+ standby => 1, has_restoring => 1);
+$pitr1->append_conf('postgresql.conf', qq{
+recovery_target_lsn = '$lsn'
+recovery_target_action = 'promote'
+archive_mode = 'off'
+});
+$pitr1->start();
+
+# Perform PITR to the same LSN from the incremental backup. Use the same
+# basic configuration as before.
+my $pitr2 = PostgreSQL::Test::Cluster->new('pitr2');
+$pitr2->init_from_backup($primary, 'backup2',
+ standby => 1, has_restoring => 1,
+ combine_with_prior => [ 'backup1' ]);
+$pitr2->append_conf('postgresql.conf', qq{
+recovery_target_lsn = '$lsn'
+recovery_target_action = 'promote'
+archive_mode = 'off'
+});
+$pitr2->start();
+
+# Wait until both servers exit recovery.
+$pitr1->poll_query_until('postgres',
+ "SELECT NOT pg_is_in_recovery();")
+ or die "Timed out while waiting apply to reach LSN $lsn";
+$pitr2->poll_query_until('postgres',
+ "SELECT NOT pg_is_in_recovery();")
+ or die "Timed out while waiting apply to reach LSN $lsn";
+
+# Perform a logical dump of each server, and check that they match.
+# It would be much nicer if we could physically compare the data files, but
+# that doesn't really work. The contents of the page hole aren't guaranteed to
+# be identical, and there can be other discrepancies as well. To make this work
+# we'd need the equivalent of each AM's rm_mask functon written or at least
+# callable from Perl, and that doesn't seem practical.
+#
+# NB: We're just using the primary's backup directory for scratch space here.
+# This could equally well be any other directory we wanted to pick.
+my $backupdir = $primary->backup_dir;
+my $dump1 = $backupdir . '/pitr1.dump';
+my $dump2 = $backupdir . '/pitr2.dump';
+$pitr1->command_ok([
+ 'pg_dumpall', '-f', $dump1, '--no-sync', '--no-unlogged-table-data',
+ '-d', $pitr1->connstr('postgres'),
+ ],
+ 'dump from PITR 1');
+$pitr1->command_ok([
+ 'pg_dumpall', '-f', $dump2, '--no-sync', '--no-unlogged-table-data',
+ '-d', $pitr1->connstr('postgres'),
+ ],
+ 'dump from PITR 2');
+
+# Compare the two dumps, there should be no differences.
+my $compare_res = compare($dump1, $dump2);
+note($dump1);
+note($dump2);
+is($compare_res, 0, "dumps are identical");
+
+# Provide more context if the dumps do not match.
+if ($compare_res != 0)
+{
+ my ($stdout, $stderr) =
+ run_command([ 'diff', '-u', $dump1, $dump2 ]);
+ print "=== diff of $dump1 and $dump2\n";
+ print "=== stdout ===\n";
+ print $stdout;
+ print "=== stderr ===\n";
+ print $stderr;
+ print "=== EOF ===\n";
+}
+
+done_testing();
diff --git a/src/bin/pg_combinebackup/write_manifest.c b/src/bin/pg_combinebackup/write_manifest.c
new file mode 100644
index 0000000000..82160134d8
--- /dev/null
+++ b/src/bin/pg_combinebackup/write_manifest.c
@@ -0,0 +1,293 @@
+/*-------------------------------------------------------------------------
+ *
+ * Write a new backup manifest.
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/bin/pg_combinebackup/write_manifest.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres_fe.h"
+
+#include <fcntl.h>
+#include <time.h>
+#include <unistd.h>
+
+#include "common/checksum_helper.h"
+#include "common/file_perm.h"
+#include "common/logging.h"
+#include "lib/stringinfo.h"
+#include "load_manifest.h"
+#include "mb/pg_wchar.h"
+#include "write_manifest.h"
+
+struct manifest_writer
+{
+ char pathname[MAXPGPATH];
+ int fd;
+ StringInfoData buf;
+ bool first_file;
+ bool still_checksumming;
+ pg_checksum_context manifest_ctx;
+};
+
+static void escape_json(StringInfo buf, const char *str);
+static void flush_manifest(manifest_writer *mwriter);
+static size_t hex_encode(const uint8 *src, size_t len, char *dst);
+
+/*
+ * Create a new backup manifest writer.
+ *
+ * The backup manifest will be written into a file named backup_manifest
+ * in the specified directory.
+ */
+manifest_writer *
+create_manifest_writer(char *directory)
+{
+ manifest_writer *mwriter = pg_malloc(sizeof(manifest_writer));
+
+ snprintf(mwriter->pathname, MAXPGPATH, "%s/backup_manifest", directory);
+ mwriter->fd = -1;
+ initStringInfo(&mwriter->buf);
+ mwriter->first_file = true;
+ mwriter->still_checksumming = true;
+ pg_checksum_init(&mwriter->manifest_ctx, CHECKSUM_TYPE_SHA256);
+
+ appendStringInfo(&mwriter->buf,
+ "{ \"PostgreSQL-Backup-Manifest-Version\": 1,\n"
+ "\"Files\": [");
+
+ return mwriter;
+}
+
+/*
+ * Add an entry for a file to a backup manifest.
+ *
+ * This is very similar to the backend's AddFileToBackupManifest, but
+ * various adjustments are required due to frontend/backend differences
+ * and other details.
+ */
+void
+add_file_to_manifest(manifest_writer *mwriter, const char *manifest_path,
+ size_t size, pg_time_t mtime,
+ pg_checksum_type checksum_type,
+ int checksum_length,
+ uint8 *checksum_payload)
+{
+ int pathlen = strlen(manifest_path);
+
+ if (mwriter->first_file)
+ {
+ appendStringInfoChar(&mwriter->buf, '\n');
+ mwriter->first_file = false;
+ }
+ else
+ appendStringInfoString(&mwriter->buf, ",\n");
+
+ if (pg_encoding_verifymbstr(PG_UTF8, manifest_path, pathlen) == pathlen)
+ {
+ appendStringInfoString(&mwriter->buf, "{ \"Path\": ");
+ escape_json(&mwriter->buf, manifest_path);
+ appendStringInfoString(&mwriter->buf, ", ");
+ }
+ else
+ {
+ appendStringInfoString(&mwriter->buf, "{ \"Encoded-Path\": \"");
+ enlargeStringInfo(&mwriter->buf, 2 * pathlen);
+ mwriter->buf.len += hex_encode((const uint8 *) manifest_path, pathlen,
+ &mwriter->buf.data[mwriter->buf.len]);
+ appendStringInfoString(&mwriter->buf, "\", ");
+ }
+
+ appendStringInfo(&mwriter->buf, "\"Size\": %zu, ", size);
+
+ appendStringInfoString(&mwriter->buf, "\"Last-Modified\": \"");
+ enlargeStringInfo(&mwriter->buf, 128);
+ mwriter->buf.len += strftime(&mwriter->buf.data[mwriter->buf.len], 128,
+ "%Y-%m-%d %H:%M:%S %Z",
+ gmtime(&mtime));
+ appendStringInfoChar(&mwriter->buf, '"');
+
+ if (mwriter->buf.len > 128 * 1024)
+ flush_manifest(mwriter);
+
+ if (checksum_length > 0)
+ {
+ appendStringInfo(&mwriter->buf,
+ ", \"Checksum-Algorithm\": \"%s\", \"Checksum\": \"",
+ pg_checksum_type_name(checksum_type));
+
+ enlargeStringInfo(&mwriter->buf, 2 * checksum_length);
+ mwriter->buf.len += hex_encode(checksum_payload, checksum_length,
+ &mwriter->buf.data[mwriter->buf.len]);
+
+ appendStringInfoChar(&mwriter->buf, '"');
+ }
+
+ appendStringInfoString(&mwriter->buf, " }");
+
+ if (mwriter->buf.len > 128 * 1024)
+ flush_manifest(mwriter);
+}
+
+/*
+ * Finalize the backup_manifest.
+ */
+void
+finalize_manifest(manifest_writer *mwriter,
+ manifest_wal_range *first_wal_range)
+{
+ uint8 checksumbuf[PG_SHA256_DIGEST_LENGTH];
+ int len;
+ manifest_wal_range *wal_range;
+
+ /* Terminate the list of files. */
+ appendStringInfoString(&mwriter->buf, "\n],\n");
+
+ /* Start a list of LSN ranges. */
+ appendStringInfoString(&mwriter->buf, "\"WAL-Ranges\": [\n");
+
+ for (wal_range = first_wal_range; wal_range != NULL;
+ wal_range = wal_range->next)
+ appendStringInfo(&mwriter->buf,
+ "%s{ \"Timeline\": %u, \"Start-LSN\": \"%X/%X\", \"End-LSN\": \"%X/%X\" }",
+ wal_range == first_wal_range ? "" : ",\n",
+ wal_range->tli,
+ LSN_FORMAT_ARGS(wal_range->start_lsn),
+ LSN_FORMAT_ARGS(wal_range->end_lsn));
+
+ /* Terminate the list of WAL ranges. */
+ appendStringInfoString(&mwriter->buf, "\n],\n");
+
+ /* Flush accumulated data and update checksum calculation. */
+ flush_manifest(mwriter);
+
+ /* Checksum only includes data up to this point. */
+ mwriter->still_checksumming = false;
+
+ /* Compute and insert manifest checksum. */
+ appendStringInfoString(&mwriter->buf, "\"Manifest-Checksum\": \"");
+ enlargeStringInfo(&mwriter->buf, 2 * PG_SHA256_DIGEST_STRING_LENGTH);
+ len = pg_checksum_final(&mwriter->manifest_ctx, checksumbuf);
+ Assert(len == PG_SHA256_DIGEST_LENGTH);
+ mwriter->buf.len +=
+ hex_encode(checksumbuf, len, &mwriter->buf.data[mwriter->buf.len]);
+ appendStringInfoString(&mwriter->buf, "\"}\n");
+
+ /* Flush the last manifest checksum itself. */
+ flush_manifest(mwriter);
+
+ /* Close the file. */
+ if (close(mwriter->fd) != 0)
+ pg_fatal("could not close \"%s\": %m", mwriter->pathname);
+ mwriter->fd = -1;
+}
+
+/*
+ * Produce a JSON string literal, properly escaping characters in the text.
+ */
+static void
+escape_json(StringInfo buf, const char *str)
+{
+ const char *p;
+
+ appendStringInfoCharMacro(buf, '"');
+ for (p = str; *p; p++)
+ {
+ switch (*p)
+ {
+ case '\b':
+ appendStringInfoString(buf, "\\b");
+ break;
+ case '\f':
+ appendStringInfoString(buf, "\\f");
+ break;
+ case '\n':
+ appendStringInfoString(buf, "\\n");
+ break;
+ case '\r':
+ appendStringInfoString(buf, "\\r");
+ break;
+ case '\t':
+ appendStringInfoString(buf, "\\t");
+ break;
+ case '"':
+ appendStringInfoString(buf, "\\\"");
+ break;
+ case '\\':
+ appendStringInfoString(buf, "\\\\");
+ break;
+ default:
+ if ((unsigned char) *p < ' ')
+ appendStringInfo(buf, "\\u%04x", (int) *p);
+ else
+ appendStringInfoCharMacro(buf, *p);
+ break;
+ }
+ }
+ appendStringInfoCharMacro(buf, '"');
+}
+
+/*
+ * Flush whatever portion of the backup manifest we have generated and
+ * buffered in memory out to a file on disk.
+ *
+ * The first call to this function will create the file. After that, we
+ * keep it open and just append more data.
+ */
+static void
+flush_manifest(manifest_writer *mwriter)
+{
+ char pathname[MAXPGPATH];
+
+ if (mwriter->fd == -1 &&
+ (mwriter->fd = open(mwriter->pathname,
+ O_WRONLY | O_CREAT | O_EXCL | PG_BINARY,
+ pg_file_create_mode)) < 0)
+ pg_fatal("could not open file \"%s\": %m", mwriter->pathname);
+
+ if (mwriter->buf.len > 0)
+ {
+ ssize_t wb;
+
+ wb = write(mwriter->fd, mwriter->buf.data, mwriter->buf.len);
+ if (wb != mwriter->buf.len)
+ {
+ if (wb < 0)
+ pg_fatal("could not write \"%s\": %m", mwriter->pathname);
+ else
+ pg_fatal("could not write file \"%s\": wrote only %d of %d bytes",
+ pathname, (int) wb, mwriter->buf.len);
+ }
+
+ if (mwriter->still_checksumming)
+ pg_checksum_update(&mwriter->manifest_ctx,
+ (uint8 *) mwriter->buf.data,
+ mwriter->buf.len);
+ resetStringInfo(&mwriter->buf);
+ }
+}
+
+/*
+ * Encode bytes using two hexademical digits for each one.
+ */
+static size_t
+hex_encode(const uint8 *src, size_t len, char *dst)
+{
+ const uint8 *end = src + len;
+
+ while (src < end)
+ {
+ unsigned n1 = (*src >> 4) & 0xF;
+ unsigned n2 = *src & 0xF;
+
+ *dst++ = n1 < 10 ? '0' + n1 : 'a' + n1 - 10;
+ *dst++ = n2 < 10 ? '0' + n2 : 'a' + n2 - 10;
+ ++src;
+ }
+
+ return len * 2;
+}
diff --git a/src/bin/pg_combinebackup/write_manifest.h b/src/bin/pg_combinebackup/write_manifest.h
new file mode 100644
index 0000000000..8fd7fe02c8
--- /dev/null
+++ b/src/bin/pg_combinebackup/write_manifest.h
@@ -0,0 +1,33 @@
+/*-------------------------------------------------------------------------
+ *
+ * Write a new backup manifest.
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/bin/pg_combinebackup/write_manifest.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef WRITE_MANIFEST_H
+#define WRITE_MANIFEST_H
+
+#include "common/checksum_helper.h"
+#include "pgtime.h"
+
+struct manifest_wal_range;
+
+struct manifest_writer;
+typedef struct manifest_writer manifest_writer;
+
+extern manifest_writer *create_manifest_writer(char *directory);
+extern void add_file_to_manifest(manifest_writer *mwriter,
+ const char *manifest_path,
+ size_t size, pg_time_t mtime,
+ pg_checksum_type checksum_type,
+ int checksum_length,
+ uint8 *checksum_payload);
+extern void finalize_manifest(manifest_writer *mwriter,
+ struct manifest_wal_range *first_wal_range);
+
+#endif /* WRITE_MANIFEST_H */
diff --git a/src/bin/pg_resetwal/pg_resetwal.c b/src/bin/pg_resetwal/pg_resetwal.c
index 04567f349d..c3b9e07841 100644
--- a/src/bin/pg_resetwal/pg_resetwal.c
+++ b/src/bin/pg_resetwal/pg_resetwal.c
@@ -85,6 +85,7 @@ static void RewriteControlFile(void);
static void FindEndOfXLOG(void);
static void KillExistingXLOG(void);
static void KillExistingArchiveStatus(void);
+static void KillExistingWALSummaries(void);
static void WriteEmptyXLOG(void);
static void usage(void);
@@ -493,6 +494,7 @@ main(int argc, char *argv[])
RewriteControlFile();
KillExistingXLOG();
KillExistingArchiveStatus();
+ KillExistingWALSummaries();
WriteEmptyXLOG();
printf(_("Write-ahead log reset\n"));
@@ -1034,6 +1036,40 @@ KillExistingArchiveStatus(void)
pg_fatal("could not close directory \"%s\": %m", ARCHSTATDIR);
}
+/*
+ * Remove existing WAL summary files
+ */
+static void
+KillExistingWALSummaries(void)
+{
+#define WALSUMMARYDIR XLOGDIR "/summaries"
+#define WALSUMMARY_NHEXCHARS 40
+
+ DIR *xldir;
+ struct dirent *xlde;
+ char path[MAXPGPATH + sizeof(WALSUMMARYDIR)];
+
+ xldir = opendir(WALSUMMARYDIR);
+ if (xldir == NULL)
+ pg_fatal("could not open directory \"%s\": %m", WALSUMMARYDIR);
+
+ while (errno = 0, (xlde = readdir(xldir)) != NULL)
+ {
+ if (strspn(xlde->d_name, "0123456789ABCDEF") == WALSUMMARY_NHEXCHARS &&
+ strcmp(xlde->d_name + WALSUMMARY_NHEXCHARS, ".summary") == 0)
+ {
+ snprintf(path, sizeof(path), "%s/%s", WALSUMMARYDIR, xlde->d_name);
+ if (unlink(path) < 0)
+ pg_fatal("could not delete file \"%s\": %m", path);
+ }
+ }
+
+ if (errno)
+ pg_fatal("could not read directory \"%s\": %m", WALSUMMARYDIR);
+
+ if (closedir(xldir))
+ pg_fatal("could not close directory \"%s\": %m", ARCHSTATDIR);
+}
/*
* Write an empty XLOG file, containing only the checkpoint record
diff --git a/src/common/Makefile b/src/common/Makefile
index ff60666f5c..ebff20b1d3 100644
--- a/src/common/Makefile
+++ b/src/common/Makefile
@@ -49,6 +49,7 @@ OBJS_COMMON = \
archive.o \
base64.o \
binaryheap.o \
+ blkreftable.o \
checksum_helper.o \
compression.o \
config_info.o \
diff --git a/src/common/blkreftable.c b/src/common/blkreftable.c
new file mode 100644
index 0000000000..012a443584
--- /dev/null
+++ b/src/common/blkreftable.c
@@ -0,0 +1,1309 @@
+/*-------------------------------------------------------------------------
+ *
+ * blkreftable.c
+ * Block reference tables.
+ *
+ * A block reference table is used to keep track of which blocks have
+ * been modified by WAL records within a certain LSN range.
+ *
+ * For each relation fork, we keep track of all blocks that have appeared
+ * in block reference in the WAL. We also keep track of the "limit block",
+ * which is the smallest relation length in blocks known to have occurred
+ * during that range of WAL records. This should be set to 0 if the relation
+ * fork is created or destroyed, and to the post-truncation length if
+ * truncated.
+ *
+ * Whenever we set the limit block, we also forget about any modified blocks
+ * beyond that point. Those blocks don't exist any more. Such blocks can
+ * later be marked as modified again; if that happens, it means the relation
+ * was re-extended.
+ *
+ * Portions Copyright (c) 2010-2022, PostgreSQL Global Development Group
+ *
+ * src/common/blkreftable.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+
+#ifndef FRONTEND
+#include "postgres.h"
+#else
+#include "postgres_fe.h"
+#endif
+
+#ifdef FRONTEND
+#include "common/logging.h"
+#endif
+
+#include "common/blkreftable.h"
+#include "common/hashfn.h"
+#include "port/pg_crc32c.h"
+
+/*
+ * A block reference table keeps track of the status of each relation
+ * fork individually.
+ */
+typedef struct BlockRefTableKey
+{
+ RelFileLocator rlocator;
+ ForkNumber forknum;
+} BlockRefTableKey;
+
+/*
+ * We could need to store data either for a relation in which only a
+ * tiny fraction of the blocks have been modified or for a relation in
+ * which nearly every block has been modified, and we want a
+ * space-efficient representation in both cases. To accomplish this,
+ * we divide the relation into chunks of 2^16 blocks and choose between
+ * an array representation and a bitmap representation for each chunk.
+ *
+ * When the number of modified blocks in a given chunk is small, we
+ * essentially store an array of block numbers, but we need not store the
+ * entire block number: instead, we store each block number as a 2-byte
+ * offset from the start of the chunk.
+ *
+ * When the number of modified blocks in a given chunk is large, we switch
+ * to a bitmap representation.
+ *
+ * These same basic representational choices are used both when a block
+ * reference table is stored in memory and when it is serialized to disk.
+ *
+ * In the in-memory representation, we initially allocate each chunk with
+ * space for a number of entries given by INITIAL_ENTRIES_PER_CHUNK and
+ * increase that as necessary until we reach MAX_ENTRIES_PER_CHUNK.
+ * Any chunk whose allocated size reaches MAX_ENTRIES_PER_CHUNK is converted
+ * to a bitmap, and thus never needs to grow further.
+ */
+#define BLOCKS_PER_CHUNK (1 << 16)
+#define BLOCKS_PER_ENTRY (BITS_PER_BYTE * sizeof(uint16))
+#define MAX_ENTRIES_PER_CHUNK (BLOCKS_PER_CHUNK / BLOCKS_PER_ENTRY)
+#define INITIAL_ENTRIES_PER_CHUNK 16
+typedef uint16 *BlockRefTableChunk;
+
+/*
+ * State for one relation fork.
+ *
+ * 'rlocator' and 'forknum' identify the relation fork to which this entry
+ * pertains.
+ *
+ * 'limit_block' is the shortest known length of the relation in blocks
+ * within the LSN range covered by a particular block reference table.
+ * It should be set to 0 if the relation fork is created or dropped. If the
+ * relation fork is truncated, it should be set to the number of blocks that
+ * remain after truncation.
+ *
+ * 'nchunks' is the allocated length of each of the three arrays that follow.
+ * We can only represent the status of block numbers less than nchunks *
+ * BLOCKS_PER_CHUNK.
+ *
+ * 'chunk_size' is an array storing the allocated size of each chunk.
+ *
+ * 'chunk_usage' is an array storing the number of elements used in each
+ * chunk. If that value is less than MAX_ENTRIES_PER_CHUNK, the corresonding
+ * chunk is used as an array; else the corresponding chunk is used as a bitmap.
+ * When used as a bitmap, the least significant bit of the first array element
+ * is the status of the lowest-numbered block covered by this chunk.
+ *
+ * 'chunk_data' is the array of chunks.
+ */
+struct BlockRefTableEntry
+{
+ BlockRefTableKey key;
+ BlockNumber limit_block;
+ char status;
+ uint32 nchunks;
+ uint16 *chunk_size;
+ uint16 *chunk_usage;
+ BlockRefTableChunk *chunk_data;
+};
+
+/* Declare and define a hash table over type BlockRefTableEntry. */
+#define SH_PREFIX blockreftable
+#define SH_ELEMENT_TYPE BlockRefTableEntry
+#define SH_KEY_TYPE BlockRefTableKey
+#define SH_KEY key
+#define SH_HASH_KEY(tb, key) \
+ hash_bytes((const unsigned char *) &key, sizeof(BlockRefTableKey))
+#define SH_EQUAL(tb, a, b) memcmp(&a, &b, sizeof(BlockRefTableKey)) == 0
+#define SH_SCOPE static inline
+#ifdef FRONTEND
+#define SH_RAW_ALLOCATOR pg_malloc0
+#endif
+#define SH_DEFINE
+#define SH_DECLARE
+#include "lib/simplehash.h"
+
+/*
+ * A block reference table is basically just the hash table, but we don't
+ * want to expose that to outside callers.
+ *
+ * We keep track of the memory context in use explicitly too, so that it's
+ * easy to place all of our allocations in the same context.
+ */
+struct BlockRefTable
+{
+ blockreftable_hash *hash;
+#ifndef FRONTEND
+ MemoryContext mcxt;
+#endif
+};
+
+/*
+ * On-disk serialization format for block reference table entries.
+ */
+typedef struct BlockRefTableSerializedEntry
+{
+ RelFileLocator rlocator;
+ ForkNumber forknum;
+ BlockNumber limit_block;
+ uint32 nchunks;
+} BlockRefTableSerializedEntry;
+
+/*
+ * Buffer size, so that we avoid doing many small I/Os.
+ */
+#define BUFSIZE 65536
+
+/*
+ * Ad-hoc buffer for file I/O.
+ */
+typedef struct BlockRefTableBuffer
+{
+ io_callback_fn io_callback;
+ void *io_callback_arg;
+ char data[BUFSIZE];
+ int used;
+ int cursor;
+ pg_crc32c crc;
+} BlockRefTableBuffer;
+
+/*
+ * State for keeping track of progress while incrementally reading a block
+ * table reference file from disk.
+ *
+ * total_chunks means the number of chunks for the RelFileLocator/ForkNumber
+ * combination that is curently being read, and consumed_chunks is the number
+ * of those that have been read. (We always read all the information for
+ * a single chunk at one time, so we don't need to be able to represent the
+ * state where a chunk has been partially read.)
+ *
+ * chunk_size is the array of chunk sizes. The length is given by total_chunks.
+ *
+ * chunk_data holds the current chunk.
+ *
+ * chunk_position helps us figure out how much progress we've made in returning
+ * the block numbers for the current chunk to the caller. If the chunk is a
+ * bitmap, it's the number of bits we've scanned; otherwise, it's the number
+ * of chunk entries we've scanned.
+ */
+struct BlockRefTableReader
+{
+ BlockRefTableBuffer buffer;
+ char *error_filename;
+ report_error_fn error_callback;
+ void *error_callback_arg;
+ uint32 total_chunks;
+ uint32 consumed_chunks;
+ uint16 *chunk_size;
+ uint16 chunk_data[MAX_ENTRIES_PER_CHUNK];
+ uint32 chunk_position;
+};
+
+/*
+ * State for keeping track of progress while incrementally writing a block
+ * reference table file to disk.
+ */
+struct BlockRefTableWriter
+{
+ BlockRefTableBuffer buffer;
+};
+
+/* Function prototypes. */
+static int BlockRefTableComparator(const void *a, const void *b);
+static void BlockRefTableFlush(BlockRefTableBuffer *buffer);
+static void BlockRefTableRead(BlockRefTableReader *reader, void *data,
+ int length);
+static void BlockRefTableWrite(BlockRefTableBuffer *buffer, void *data,
+ int length);
+static void BlockRefTableFileTerminate(BlockRefTableBuffer *buffer);
+
+/*
+ * Create an empty block reference table.
+ */
+BlockRefTable *
+CreateEmptyBlockRefTable(void)
+{
+ BlockRefTable *brtab = palloc(sizeof(BlockRefTable));
+
+ /*
+ * Even completely empty database has a few hundred relation forks, so it
+ * seems best to size the hash on the assumption that we're going to have
+ * at least a few thousand entries.
+ */
+#ifdef FRONTEND
+ brtab->hash = blockreftable_create(4096, NULL);
+#else
+ brtab->mcxt = CurrentMemoryContext;
+ brtab->hash = blockreftable_create(brtab->mcxt, 4096, NULL);
+#endif
+
+ return brtab;
+}
+
+/*
+ * Set the "limit block" for a relation fork and forget any modified blocks
+ * with equal or higher block numbers.
+ *
+ * The "limit block" is the shortest known length of the relation within the
+ * range of WAL records covered by this block reference table.
+ */
+void
+BlockRefTableSetLimitBlock(BlockRefTable *brtab,
+ const RelFileLocator *rlocator,
+ ForkNumber forknum,
+ BlockNumber limit_block)
+{
+ BlockRefTableEntry *brtentry;
+ BlockRefTableKey key;
+ bool found;
+
+ memcpy(&key.rlocator, rlocator, sizeof(RelFileLocator));
+ key.forknum = forknum;
+ brtentry = blockreftable_insert(brtab->hash, key, &found);
+
+ if (!found)
+ {
+ /*
+ * We have no existing data about this relation fork, so just record
+ * the limit_block value supplied by the caller, and make sure other
+ * parts of the entry are properly initialized.
+ */
+ brtentry->limit_block = limit_block;
+ brtentry->nchunks = 0;
+ brtentry->chunk_size = NULL;
+ brtentry->chunk_usage = NULL;
+ brtentry->chunk_data = NULL;
+ return;
+ }
+
+ BlockRefTableEntrySetLimitBlock(brtentry, limit_block);
+}
+
+/*
+ * Mark a block in a given relation fork as known to have been modified.
+ */
+void
+BlockRefTableMarkBlockModified(BlockRefTable *brtab,
+ const RelFileLocator *rlocator,
+ ForkNumber forknum,
+ BlockNumber blknum)
+{
+ BlockRefTableEntry *brtentry;
+ BlockRefTableKey key;
+ bool found;
+#ifndef FRONTEND
+ MemoryContext oldcontext = MemoryContextSwitchTo(brtab->mcxt);
+#endif
+
+ memcpy(&key.rlocator, rlocator, sizeof(RelFileLocator));
+ key.forknum = forknum;
+ brtentry = blockreftable_insert(brtab->hash, key, &found);
+
+ if (!found)
+ {
+ /*
+ * We want to set the initial limit block value to something higher
+ * than any legal block number. InvalidBlockNumber fits the bill.
+ */
+ brtentry->limit_block = InvalidBlockNumber;
+ brtentry->nchunks = 0;
+ brtentry->chunk_size = NULL;
+ brtentry->chunk_usage = NULL;
+ brtentry->chunk_data = NULL;
+ }
+
+ BlockRefTableEntryMarkBlockModified(brtentry, forknum, blknum);
+
+#ifndef FRONTEND
+ MemoryContextSwitchTo(oldcontext);
+#endif
+}
+
+/*
+ * Get an entry from a block reference table.
+ *
+ * If the entry does not exist, this function returns NULL. Otherwise, it
+ * returns the entry and sets *limit_block to the value from the entry.
+ */
+BlockRefTableEntry *
+BlockRefTableGetEntry(BlockRefTable *brtab, const RelFileLocator *rlocator,
+ ForkNumber forknum, BlockNumber *limit_block)
+{
+ BlockRefTableKey key;
+ BlockRefTableEntry *entry;
+
+ Assert(limit_block != NULL);
+
+ memcpy(&key.rlocator, rlocator, sizeof(RelFileLocator));
+ key.forknum = forknum;
+ entry = blockreftable_lookup(brtab->hash, key);
+
+ if (entry != NULL)
+ *limit_block = entry->limit_block;
+
+ return entry;
+}
+
+/*
+ * Get block numbers from a table entry.
+ *
+ * 'blocks' must point to enough space to hold at least 'nblocks' block
+ * numbers, and any block numbers we manage to get will be written there.
+ * The return value is the number of block numbers actually written.
+ *
+ * We do not return block numbers unless they are greater than or equal to
+ * start_blkno and strictly less than stop_blkno.
+ */
+int
+BlockRefTableEntryGetBlocks(BlockRefTableEntry *entry,
+ BlockNumber start_blkno,
+ BlockNumber stop_blkno,
+ BlockNumber *blocks,
+ int nblocks)
+{
+ uint32 start_chunkno;
+ uint32 stop_chunkno;
+ uint32 chunkno;
+ int nresults = 0;
+
+ Assert(entry != NULL);
+
+ /*
+ * Figure out which chunks could potentially contain blocks of interest.
+ *
+ * We need to be careful about overflow here, because stop_blkno could be
+ * InvalidBlockNumber or something very close to it.
+ */
+ start_chunkno = start_blkno / BLOCKS_PER_CHUNK;
+ stop_chunkno = stop_blkno / BLOCKS_PER_CHUNK;
+ if ((stop_blkno % BLOCKS_PER_CHUNK) != 0)
+ ++stop_chunkno;
+ if (stop_chunkno > entry->nchunks)
+ stop_chunkno = entry->nchunks;
+
+ /*
+ * Loop over chunks.
+ */
+ for (chunkno = start_chunkno; chunkno < stop_chunkno; ++chunkno)
+ {
+ uint16 chunk_usage = entry->chunk_usage[chunkno];
+ BlockRefTableChunk chunk_data = entry->chunk_data[chunkno];
+ unsigned start_offset = 0;
+ unsigned stop_offset = BLOCKS_PER_CHUNK;
+
+ /*
+ * If the start and/or stop block number falls within this chunk, the
+ * whole chunk may not be of interest. Figure out which portion we
+ * care about, if it's not the whole thing.
+ */
+ if (chunkno == start_chunkno)
+ start_offset = start_blkno % BLOCKS_PER_CHUNK;
+ if (chunkno == stop_chunkno)
+ stop_offset = stop_blkno % BLOCKS_PER_CHUNK;
+
+ /*
+ * Handling differs depending on whether this is an array of offsets
+ * or a bitmap.
+ */
+ if (chunk_usage == MAX_ENTRIES_PER_CHUNK)
+ {
+ unsigned i;
+
+ /* It's a bitmap, so test every relevant bit. */
+ for (i = start_offset; i < BLOCKS_PER_CHUNK; ++i)
+ {
+ uint16 w = chunk_data[i / BLOCKS_PER_ENTRY];
+
+ if ((w & (1 << (i % BLOCKS_PER_ENTRY))) != 0)
+ {
+ BlockNumber blkno = chunkno * BLOCKS_PER_CHUNK + i;
+
+ blocks[nresults++] = blkno;
+
+ /* Early exit if we run out of output space. */
+ if (nresults == nblocks)
+ return nresults;
+ }
+ }
+ }
+ else
+ {
+ unsigned i;
+
+ /* It's an array of offsets, so check each one. */
+ for (i = 0; i < chunk_usage; ++i)
+ {
+ uint16 offset = chunk_data[i];
+
+ if (offset >= start_offset && offset < stop_offset)
+ {
+ BlockNumber blkno = chunkno * BLOCKS_PER_CHUNK + offset;
+
+ blocks[nresults++] = blkno;
+
+ /* Early exit if we run out of output space. */
+ if (nresults == nblocks)
+ return nresults;
+ }
+ }
+ }
+ }
+
+ return nresults;
+}
+
+/*
+ * Serialize a block reference table to a file.
+ */
+void
+WriteBlockRefTable(BlockRefTable *brtab,
+ io_callback_fn write_callback,
+ void *write_callback_arg)
+{
+ BlockRefTableSerializedEntry *sdata = NULL;
+ BlockRefTableBuffer buffer;
+ uint32 magic = BLOCKREFTABLE_MAGIC;
+
+ /* Prepare buffer. */
+ memset(&buffer, 0, sizeof(BlockRefTableBuffer));
+ buffer.io_callback = write_callback;
+ buffer.io_callback_arg = write_callback_arg;
+ INIT_CRC32C(buffer.crc);
+
+ /* Write magic number. */
+ BlockRefTableWrite(&buffer, &magic, sizeof(uint32));
+
+ /* Write the entries, assuming there are some. */
+ if (brtab->hash->members > 0)
+ {
+ unsigned i = 0;
+ blockreftable_iterator it;
+ BlockRefTableEntry *brtentry;
+
+ /* Extract entries into serializable format and sort them. */
+ sdata =
+ palloc(brtab->hash->members * sizeof(BlockRefTableSerializedEntry));
+ blockreftable_start_iterate(brtab->hash, &it);
+ while ((brtentry = blockreftable_iterate(brtab->hash, &it)) != NULL)
+ {
+ BlockRefTableSerializedEntry *sentry = &sdata[i++];
+
+ sentry->rlocator = brtentry->key.rlocator;
+ sentry->forknum = brtentry->key.forknum;
+ sentry->limit_block = brtentry->limit_block;
+ sentry->nchunks = brtentry->nchunks;
+
+ /* trim trailing zero entries */
+ while (sentry->nchunks > 0 &&
+ brtentry->chunk_usage[sentry->nchunks - 1] == 0)
+ sentry->nchunks--;
+ }
+ Assert(i == brtab->hash->members);
+ qsort(sdata, i, sizeof(BlockRefTableSerializedEntry),
+ BlockRefTableComparator);
+
+ /* Loop over entries in sorted order and serialize each one. */
+ for (i = 0; i < brtab->hash->members; ++i)
+ {
+ BlockRefTableSerializedEntry *sentry = &sdata[i];
+ BlockRefTableEntry *brtentry;
+ BlockRefTableKey key;
+ unsigned j;
+
+ /* Write the serialized entry itself. */
+ BlockRefTableWrite(&buffer, sentry,
+ sizeof(BlockRefTableSerializedEntry));
+
+ /* Look up the original entry so we can access the chunks. */
+ memcpy(&key.rlocator, &sentry->rlocator, sizeof(RelFileLocator));
+ key.forknum = sentry->forknum;
+ brtentry = blockreftable_lookup(brtab->hash, key);
+ Assert(brtentry != NULL);
+
+ /* Write the untruncated portion of the chunk length array. */
+ if (sentry->nchunks != 0)
+ BlockRefTableWrite(&buffer, brtentry->chunk_usage,
+ sentry->nchunks * sizeof(uint16));
+
+ /* Write the contents of each chunk. */
+ for (j = 0; j < brtentry->nchunks; ++j)
+ {
+ if (brtentry->chunk_usage[j] == 0)
+ continue;
+ BlockRefTableWrite(&buffer, brtentry->chunk_data[j],
+ brtentry->chunk_usage[j] * sizeof(uint16));
+ }
+ }
+ }
+
+ /* Write out appropriate terminator and CRC and flush buffer. */
+ BlockRefTableFileTerminate(&buffer);
+}
+
+/*
+ * Prepare to incrementally read a block reference table file.
+ *
+ * 'read_callback' is a function that can be called to read data from the
+ * underlying file (or other data source) into our internal buffer.
+ *
+ * 'read_callback_arg' is an opaque argument to be passed to read_callback.
+ *
+ * 'error_filename' is the filename that should be included in error messages
+ * if the file is found to be malformed. The value is not copied, so the
+ * caller should ensure that it remains valid until done with this
+ * BlockRefTableReader.
+ *
+ * 'error_callback' is a function to be called if the file is found to be
+ * malformed. This is not used for I/O errors, which must be handled internally
+ * by read_callback.
+ *
+ * 'error_callback_arg' is an opaque arguent to be passed to error_callback.
+ */
+BlockRefTableReader *
+CreateBlockRefTableReader(io_callback_fn read_callback,
+ void *read_callback_arg,
+ char *error_filename,
+ report_error_fn error_callback,
+ void *error_callback_arg)
+{
+ BlockRefTableReader *reader;
+ uint32 magic;
+
+ /* Initialize data structure. */
+ reader = palloc0(sizeof(BlockRefTableReader));
+ reader->buffer.io_callback = read_callback;
+ reader->buffer.io_callback_arg = read_callback_arg;
+ reader->error_filename = error_filename;
+ reader->error_callback = error_callback;
+ reader->error_callback_arg = error_callback_arg;
+ INIT_CRC32C(reader->buffer.crc);
+
+ /* Verify magic number. */
+ BlockRefTableRead(reader, &magic, sizeof(uint32));
+ if (magic != BLOCKREFTABLE_MAGIC)
+ error_callback(error_callback_arg,
+ "file \"%s\" has wrong magic number: expected %u, found %u",
+ error_filename,
+ BLOCKREFTABLE_MAGIC, magic);
+
+ return reader;
+}
+
+/*
+ * Read next relation fork covered by this block reference table file.
+ *
+ * After calling this function, you must call BlockRefTableReaderGetBlocks
+ * until it returns 0 before calling it again.
+ */
+bool
+BlockRefTableReaderNextRelation(BlockRefTableReader *reader,
+ RelFileLocator *rlocator,
+ ForkNumber *forknum,
+ BlockNumber *limit_block)
+{
+ BlockRefTableSerializedEntry sentry;
+ BlockRefTableSerializedEntry zentry = {0};
+
+ /*
+ * Sanity check: caller must read all blocks from all chunks before moving
+ * on to the next relation.
+ */
+ Assert(reader->total_chunks == reader->consumed_chunks);
+
+ /* Read serialized entry. */
+ BlockRefTableRead(reader, &sentry,
+ sizeof(BlockRefTableSerializedEntry));
+
+ /*
+ * If we just read the sentinel entry indicating that we've reached the
+ * end, read and check the CRC.
+ */
+ if (memcmp(&sentry, &zentry, sizeof(BlockRefTableSerializedEntry)) == 0)
+ {
+ pg_crc32c expected_crc;
+ pg_crc32c actual_crc;
+
+ /*
+ * We want to know the CRC of the file excluding the 4-byte CRC
+ * itself, so copy the current value of the CRC accumulator before
+ * reading those bytes, and use the copy to finalize the calculation.
+ */
+ expected_crc = reader->buffer.crc;
+ FIN_CRC32C(expected_crc);
+
+ /* Now we can read the actual value. */
+ BlockRefTableRead(reader, &actual_crc, sizeof(pg_crc32c));
+
+ /* Throw an error if there is a mismatch. */
+ if (!EQ_CRC32C(expected_crc, actual_crc))
+ reader->error_callback(reader->error_callback_arg,
+ "file \"%s\" has wrong checksum: expected %08X, found %08X",
+ reader->error_filename, expected_crc, actual_crc);
+
+ return false;
+ }
+
+ /* Read chunk size array. */
+ if (reader->chunk_size != NULL)
+ pfree(reader->chunk_size);
+ reader->chunk_size = palloc(sentry.nchunks * sizeof(uint16));
+ BlockRefTableRead(reader, reader->chunk_size,
+ sentry.nchunks * sizeof(uint16));
+
+ /* Set up for chunk scan. */
+ reader->total_chunks = sentry.nchunks;
+ reader->consumed_chunks = 0;
+
+ /* Return data to caller. */
+ memcpy(rlocator, &sentry.rlocator, sizeof(RelFileLocator));
+ *forknum = sentry.forknum;
+ *limit_block = sentry.limit_block;
+ return true;
+}
+
+/*
+ * Get modified blocks associated with the relation fork returned by
+ * the most recent call to BlockRefTableReaderNextRelation.
+ *
+ * On return, block numbers will be written into the 'blocks' array, whose
+ * length should be passed via 'nblocks'. The return value is the number of
+ * entries actually written into the 'blocks' array, which may be less than
+ * 'nblocks' if we run out of modified blocks in the relation fork before
+ * we run out of room in the array.
+ */
+unsigned
+BlockRefTableReaderGetBlocks(BlockRefTableReader *reader,
+ BlockNumber *blocks,
+ int nblocks)
+{
+ unsigned blocks_found = 0;
+
+ /* Must provide space for at least one block number to be returned. */
+ Assert(nblocks > 0);
+
+ /* Loop collecting blocks to return to caller. */
+ for (;;)
+ {
+ uint16 next_chunk_size;
+
+ /*
+ * If we've read at least one chunk, maybe it contains some block
+ * numbers that could satisfy caller's request.
+ */
+ if (reader->consumed_chunks > 0)
+ {
+ uint32 chunkno = reader->consumed_chunks - 1;
+ uint16 chunk_size = reader->chunk_size[chunkno];
+
+ if (chunk_size == MAX_ENTRIES_PER_CHUNK)
+ {
+ /* Bitmap format, so search for bits that are set. */
+ while (reader->chunk_position < BLOCKS_PER_CHUNK &&
+ blocks_found < nblocks)
+ {
+ uint16 chunkoffset = reader->chunk_position;
+ uint16 w;
+
+ w = reader->chunk_data[chunkoffset / BLOCKS_PER_ENTRY];
+ if ((w & (1u << (chunkoffset % BLOCKS_PER_ENTRY))) != 0)
+ blocks[blocks_found++] =
+ chunkno * BLOCKS_PER_CHUNK + chunkoffset;
+ ++reader->chunk_position;
+ }
+ }
+ else
+ {
+ /* Not in bitmap format, so each entry is a 2-byte offset. */
+ while (reader->chunk_position < chunk_size &&
+ blocks_found < nblocks)
+ {
+ blocks[blocks_found++] = chunkno * BLOCKS_PER_CHUNK
+ + reader->chunk_data[reader->chunk_position];
+ ++reader->chunk_position;
+ }
+ }
+ }
+
+ /* We found enough blocks, so we're done. */
+ if (blocks_found >= nblocks)
+ break;
+
+ /*
+ * We didn't find enough blocks, so we must need the next chunk. If
+ * there are none left, though, then we're done anyway.
+ */
+ if (reader->consumed_chunks == reader->total_chunks)
+ break;
+
+ /*
+ * Read data for next chunk and reset scan position to beginning of
+ * chunk. Note that the next chunk might be empty, in which case we
+ * consume the chunk without actually consuming any bytes from the
+ * underlying file.
+ */
+ next_chunk_size = reader->chunk_size[reader->consumed_chunks];
+ if (next_chunk_size > 0)
+ BlockRefTableRead(reader, reader->chunk_data,
+ next_chunk_size * sizeof(uint16));
+ ++reader->consumed_chunks;
+ reader->chunk_position = 0;
+ }
+
+ return blocks_found;
+}
+
+/*
+ * Release memory used while reading a block reference table from a file.
+ */
+void
+DestroyBlockRefTableReader(BlockRefTableReader *reader)
+{
+ if (reader->chunk_size != NULL)
+ {
+ pfree(reader->chunk_size);
+ reader->chunk_size = NULL;
+ }
+ pfree(reader);
+}
+
+/*
+ * Prepare to write a block reference table file incrementally.
+ *
+ * Caller must be able to supply BlockRefTableEntry objects sorted in the
+ * appropriate order.
+ */
+BlockRefTableWriter *
+CreateBlockRefTableWriter(io_callback_fn write_callback,
+ void *write_callback_arg)
+{
+ BlockRefTableWriter *writer;
+ uint32 magic = BLOCKREFTABLE_MAGIC;
+
+ /* Prepare buffer and CRC check and save callbacks. */
+ writer = palloc0(sizeof(BlockRefTableWriter));
+ writer->buffer.io_callback = write_callback;
+ writer->buffer.io_callback_arg = write_callback_arg;
+ INIT_CRC32C(writer->buffer.crc);
+
+ /* Write magic number. */
+ BlockRefTableWrite(&writer->buffer, &magic, sizeof(uint32));
+
+ return writer;
+}
+
+/*
+ * Append one entry to a block reference table file.
+ *
+ * Note that entries must be written in the proper order, that is, sorted by
+ * tablespace, then database, then relfilenumber, then fork number. Caller
+ * is responsible for supplying data in the correct order. If that seems hard,
+ * use an in-memory BlockRefTable instead.
+ */
+void
+BlockRefTableWriteEntry(BlockRefTableWriter *writer, BlockRefTableEntry *entry)
+{
+ BlockRefTableSerializedEntry sentry;
+ unsigned j;
+
+ /* Convert to serialized entry format. */
+ sentry.rlocator = entry->key.rlocator;
+ sentry.forknum = entry->key.forknum;
+ sentry.limit_block = entry->limit_block;
+ sentry.nchunks = entry->nchunks;
+
+ /* Trim trailing zero entries. */
+ while (sentry.nchunks > 0 && entry->chunk_usage[sentry.nchunks - 1] == 0)
+ sentry.nchunks--;
+
+ /* Write the serialized entry itself. */
+ BlockRefTableWrite(&writer->buffer, &sentry,
+ sizeof(BlockRefTableSerializedEntry));
+
+ /* Write the untruncated portion of the chunk length array. */
+ if (sentry.nchunks != 0)
+ BlockRefTableWrite(&writer->buffer, entry->chunk_usage,
+ sentry.nchunks * sizeof(uint16));
+
+ /* Write the contents of each chunk. */
+ for (j = 0; j < entry->nchunks; ++j)
+ {
+ if (entry->chunk_usage[j] == 0)
+ continue;
+ BlockRefTableWrite(&writer->buffer, entry->chunk_data[j],
+ entry->chunk_usage[j] * sizeof(uint16));
+ }
+}
+
+/*
+ * Finalize an incremental write of a block reference table file.
+ */
+void
+DestroyBlockRefTableWriter(BlockRefTableWriter *writer)
+{
+ BlockRefTableFileTerminate(&writer->buffer);
+ pfree(writer);
+}
+
+/*
+ * Allocate a standalone BlockRefTableEntry.
+ *
+ * When we're manipulating a full in-memory BlockRefTable, the entries are
+ * part of the hash table and are allocated by simplehash. This routine is
+ * used by callers that want to write out a BlockRefTable to a file without
+ * needing to store the whole thing in memory at once.
+ *
+ * Entries allocated by this function can be manipulated using the functions
+ * BlockRefTableEntrySetLimitBlock and BlockRefTableEntryMarkBlockModified
+ * and then written using BlockRefTableWriteEntry and freed using
+ * BlockRefTableFreeEntry.
+ */
+BlockRefTableEntry *
+CreateBlockRefTableEntry(RelFileLocator rlocator, ForkNumber forknum)
+{
+ BlockRefTableEntry *entry = palloc0(sizeof(BlockRefTableEntry));
+
+ memcpy(&entry->key.rlocator, &rlocator, sizeof(RelFileLocator));
+ entry->key.forknum = forknum;
+ entry->limit_block = InvalidBlockNumber;
+
+ return entry;
+}
+
+/*
+ * Update a BlockRefTableEntry with a new value for the "limit block" and
+ * forget any equal-or-higher-numbered modified blocks.
+ *
+ * The "limit block" is the shortest known length of the relation within the
+ * range of WAL records covered by this block reference table.
+ */
+void
+BlockRefTableEntrySetLimitBlock(BlockRefTableEntry *entry,
+ BlockNumber limit_block)
+{
+ unsigned chunkno;
+ unsigned limit_chunkno;
+ unsigned limit_chunkoffset;
+ BlockRefTableChunk limit_chunk;
+
+ /* If we already have an equal or lower limit block, do nothing. */
+ if (limit_block >= entry->limit_block)
+ return;
+
+ /* Record the new limit block value. */
+ entry->limit_block = limit_block;
+
+ /*
+ * Figure out which chunk would store the state of the new limit block,
+ * and which offset within that chunk.
+ */
+ limit_chunkno = limit_block / BLOCKS_PER_CHUNK;
+ limit_chunkoffset = limit_block % BLOCKS_PER_CHUNK;
+
+ /*
+ * If the number of chunks is not large enough for any blocks with equal
+ * or higher block numbers to exist, then there is nothing further to do.
+ */
+ if (limit_chunkno >= entry->nchunks)
+ return;
+
+ /* Discard entire contents of any higher-numbered chunks. */
+ for (chunkno = limit_chunkno + 1; chunkno < entry->nchunks; ++chunkno)
+ entry->chunk_usage[chunkno] = 0;
+
+ /*
+ * Next, we need to discard any offsets within the chunk that would
+ * contain the limit_block. We must handle this differenly depending on
+ * whether the chunk that would contain limit_block is a bitmap or an
+ * array of offsets.
+ */
+ limit_chunk = entry->chunk_data[limit_chunkno];
+ if (entry->chunk_usage[limit_chunkno] == MAX_ENTRIES_PER_CHUNK)
+ {
+ unsigned chunkoffset;
+
+ /* It's a bitmap. Unset bits. */
+ for (chunkoffset = limit_chunkoffset; chunkoffset < BLOCKS_PER_CHUNK;
+ ++chunkoffset)
+ limit_chunk[chunkoffset / BLOCKS_PER_ENTRY] &=
+ ~(1 << (chunkoffset % BLOCKS_PER_ENTRY));
+ }
+ else
+ {
+ unsigned i,
+ j = 0;
+
+ /* It's an offset array. Filter out large offsets. */
+ for (i = 0; i < entry->chunk_usage[limit_chunkno]; ++i)
+ {
+ Assert(j <= i);
+ if (limit_chunk[i] < limit_chunkoffset)
+ limit_chunk[j++] = limit_chunk[i];
+ }
+ Assert(j <= entry->chunk_usage[limit_chunkno]);
+ entry->chunk_usage[limit_chunkno] = j;
+ }
+}
+
+/*
+ * Mark a block in a given BlkRefTableEntry as known to have been modified.
+ */
+void
+BlockRefTableEntryMarkBlockModified(BlockRefTableEntry *entry,
+ ForkNumber forknum,
+ BlockNumber blknum)
+{
+ unsigned chunkno;
+ unsigned chunkoffset;
+ unsigned i;
+
+ /*
+ * Which chunk should store the state of this block? And what is the
+ * offset of this block relative to the start of that chunk?
+ */
+ chunkno = blknum / BLOCKS_PER_CHUNK;
+ chunkoffset = blknum % BLOCKS_PER_CHUNK;
+
+ /*
+ * If 'nchunks' isn't big enough for us to be able to represent the state
+ * of this block, we need to enlarge our arrays.
+ */
+ if (chunkno >= entry->nchunks)
+ {
+ unsigned max_chunks;
+ unsigned extra_chunks;
+
+ /*
+ * New array size is a power of 2, at least 16, big enough so that
+ * chunkno will be a valid array index.
+ */
+ max_chunks = Max(16, entry->nchunks);
+ while (max_chunks < chunkno + 1)
+ chunkno *= 2;
+ Assert(max_chunks > chunkno);
+ extra_chunks = max_chunks - entry->nchunks;
+
+ if (entry->nchunks == 0)
+ {
+ entry->chunk_size = palloc0(sizeof(uint16) * max_chunks);
+ entry->chunk_usage = palloc0(sizeof(uint16) * max_chunks);
+ entry->chunk_data =
+ palloc0(sizeof(BlockRefTableChunk) * max_chunks);
+ }
+ else
+ {
+ entry->chunk_size = repalloc(entry->chunk_size,
+ sizeof(uint16) * max_chunks);
+ memset(&entry->chunk_size[entry->nchunks], 0,
+ extra_chunks * sizeof(uint16));
+ entry->chunk_usage = repalloc(entry->chunk_usage,
+ sizeof(uint16) * max_chunks);
+ memset(&entry->chunk_usage[entry->nchunks], 0,
+ extra_chunks * sizeof(uint16));
+ entry->chunk_data = repalloc(entry->chunk_data,
+ sizeof(BlockRefTableChunk) * max_chunks);
+ memset(&entry->chunk_data[entry->nchunks], 0,
+ extra_chunks * sizeof(BlockRefTableChunk));
+ }
+ entry->nchunks = max_chunks;
+ }
+
+ /*
+ * If the chunk that covers this block number doesn't exist yet, create it
+ * as an array and add the appropriate offset to it. We make it pretty
+ * small initially, because there might only be 1 or a few block
+ * references in this chunk and we don't want to use up too much memory.
+ */
+ if (entry->chunk_size[chunkno] == 0)
+ {
+ entry->chunk_data[chunkno] =
+ palloc(sizeof(uint16) * INITIAL_ENTRIES_PER_CHUNK);
+ entry->chunk_size[chunkno] = INITIAL_ENTRIES_PER_CHUNK;
+ entry->chunk_data[chunkno][0] = chunkoffset;
+ entry->chunk_usage[chunkno] = 1;
+ return;
+ }
+
+ /*
+ * If the number of entries in this chunk is already maximum, it must be a
+ * bitmap. Just set the appropriate bit.
+ */
+ if (entry->chunk_usage[chunkno] == MAX_ENTRIES_PER_CHUNK)
+ {
+ BlockRefTableChunk chunk = entry->chunk_data[chunkno];
+
+ chunk[chunkoffset / BLOCKS_PER_ENTRY] |=
+ 1 << (chunkoffset % BLOCKS_PER_ENTRY);
+ return;
+ }
+
+ /*
+ * There is an existing chunk and it's in array format. Let's find out
+ * whether it already has an entry for this block. If so, we do not need
+ * to do anything.
+ */
+ for (i = 0; i < entry->chunk_usage[chunkno]; ++i)
+ {
+ if (entry->chunk_data[chunkno][i] == chunkoffset)
+ return;
+ }
+
+ /*
+ * If the number of entries currently used is one less than the maximum,
+ * it's time to convert to bitmap format.
+ */
+ if (entry->chunk_usage[chunkno] == MAX_ENTRIES_PER_CHUNK - 1)
+ {
+ BlockRefTableChunk newchunk;
+ unsigned j;
+
+ /* Allocate a new chunk. */
+ newchunk = palloc0(MAX_ENTRIES_PER_CHUNK * sizeof(uint16));
+
+ /* Set the bit for each existing entry. */
+ for (j = 0; j < entry->chunk_usage[chunkno]; ++j)
+ {
+ unsigned coff = entry->chunk_data[chunkno][j];
+
+ newchunk[coff / BLOCKS_PER_ENTRY] |=
+ 1 << (coff % BLOCKS_PER_ENTRY);
+ }
+
+ /* Set the bit for the new entry. */
+ newchunk[chunkoffset / BLOCKS_PER_ENTRY] |=
+ 1 << (chunkoffset % BLOCKS_PER_ENTRY);
+
+ /* Swap the new chunk into place and update metadata. */
+ pfree(entry->chunk_data[chunkno]);
+ entry->chunk_data[chunkno] = newchunk;
+ entry->chunk_size[chunkno] = MAX_ENTRIES_PER_CHUNK;
+ entry->chunk_usage[chunkno] = MAX_ENTRIES_PER_CHUNK;
+ return;
+ }
+
+ /*
+ * OK, we currently have an array, and we don't need to convert to a
+ * bitmap, but we do need to add a new element. If there's not enough
+ * room, we'll have to expand the array.
+ */
+ if (entry->chunk_usage[chunkno] == entry->chunk_size[chunkno])
+ {
+ unsigned newsize = entry->chunk_size[chunkno] * 2;
+
+ Assert(newsize <= MAX_ENTRIES_PER_CHUNK);
+ entry->chunk_data[chunkno] = repalloc(entry->chunk_data[chunkno],
+ newsize * sizeof(uint16));
+ entry->chunk_size[chunkno] = newsize;
+ }
+
+ /* Now we can add the new entry. */
+ entry->chunk_data[chunkno][entry->chunk_usage[chunkno]] =
+ chunkoffset;
+ entry->chunk_usage[chunkno]++;
+}
+
+/*
+ * Release memory for a BlockRefTablEntry that was created by
+ * CreateBlockRefTableEntry.
+ */
+void
+BlockRefTableFreeEntry(BlockRefTableEntry *entry)
+{
+ if (entry->chunk_size != NULL)
+ {
+ pfree(entry->chunk_size);
+ entry->chunk_size = NULL;
+ }
+
+ if (entry->chunk_usage != NULL)
+ {
+ pfree(entry->chunk_usage);
+ entry->chunk_usage = NULL;
+ }
+
+ if (entry->chunk_data != NULL)
+ {
+ pfree(entry->chunk_data);
+ entry->chunk_data = NULL;
+ }
+
+ pfree(entry);
+}
+
+/*
+ * Comparator for BlockRefTableSerializedEntry objects.
+ *
+ * We make the tablespace OID the first column of the sort key to match
+ * the on-disk tree structure.
+ */
+static int
+BlockRefTableComparator(const void *a, const void *b)
+{
+ const BlockRefTableSerializedEntry *sa = a;
+ const BlockRefTableSerializedEntry *sb = b;
+
+ if (sa->rlocator.spcOid > sb->rlocator.spcOid)
+ return 1;
+ if (sa->rlocator.spcOid < sb->rlocator.spcOid)
+ return -1;
+
+ if (sa->rlocator.dbOid > sb->rlocator.dbOid)
+ return 1;
+ if (sa->rlocator.dbOid < sb->rlocator.dbOid)
+ return -1;
+
+ if (sa->rlocator.relNumber > sb->rlocator.relNumber)
+ return 1;
+ if (sa->rlocator.relNumber < sb->rlocator.relNumber)
+ return -1;
+
+ if (sa->forknum > sb->forknum)
+ return 1;
+ if (sa->forknum < sb->forknum)
+ return -1;
+
+ return 0;
+}
+
+/*
+ * Flush any buffered data out of a BlockRefTableBuffer.
+ */
+static void
+BlockRefTableFlush(BlockRefTableBuffer *buffer)
+{
+ buffer->io_callback(buffer->io_callback_arg, buffer->data, buffer->used);
+ buffer->used = 0;
+}
+
+/*
+ * Read data from a BlockRefTableBuffer, and update the running CRC
+ * calculation for the returned data (but not any data that we may have
+ * buffered but not yet actually returned).
+ */
+static void
+BlockRefTableRead(BlockRefTableReader *reader, void *data, int length)
+{
+ BlockRefTableBuffer *buffer = &reader->buffer;
+
+ /* Loop until read is fully satisfied. */
+ while (length > 0)
+ {
+ if (buffer->cursor < buffer->used)
+ {
+ /*
+ * If any buffered data is available, use that to satisfy as much
+ * of the request as possible.
+ */
+ int bytes_to_copy = Min(length, buffer->used - buffer->cursor);
+
+ memcpy(data, &buffer->data[buffer->cursor], bytes_to_copy);
+ COMP_CRC32C(buffer->crc, &buffer->data[buffer->cursor],
+ bytes_to_copy);
+ buffer->cursor += bytes_to_copy;
+ data = ((char *) data) + bytes_to_copy;
+ length -= bytes_to_copy;
+ }
+ else if (length >= BUFSIZE)
+ {
+ /*
+ * If the request length is long, read directly into caller's
+ * buffer.
+ */
+ int bytes_read;
+
+ bytes_read = buffer->io_callback(buffer->io_callback_arg,
+ data, length);
+ COMP_CRC32C(buffer->crc, data, bytes_read);
+ data = ((char *) data) + bytes_read;
+ length -= bytes_read;
+
+ /* If we didn't get anything, that's bad. */
+ if (bytes_read == 0)
+ reader->error_callback(reader->error_callback_arg,
+ "file \"%s\" ends unexpectedly",
+ reader->error_filename);
+ }
+ else
+ {
+ /*
+ * Refill our buffer.
+ */
+ buffer->used = buffer->io_callback(buffer->io_callback_arg,
+ buffer->data, BUFSIZE);
+ buffer->cursor = 0;
+
+ /* If we didn't get anything, that's bad. */
+ if (buffer->used == 0)
+ reader->error_callback(reader->error_callback_arg,
+ "file \"%s\" ends unexpectedly",
+ reader->error_filename);
+ }
+ }
+}
+
+/*
+ * Supply data to a BlockRefTableBuffer for write to the underlying File,
+ * and update the running CRC calculation for that data.
+ */
+static void
+BlockRefTableWrite(BlockRefTableBuffer *buffer, void *data, int length)
+{
+ /* Update running CRC calculation. */
+ COMP_CRC32C(buffer->crc, data, length);
+
+ /* If the new data can't fit into the buffer, flush the buffer. */
+ if (buffer->used + length > BUFSIZE)
+ {
+ buffer->io_callback(buffer->io_callback_arg, buffer->data,
+ buffer->used);
+ buffer->used = 0;
+ }
+
+ /* If the new data would fill the buffer, or more, write it directly. */
+ if (length >= BUFSIZE)
+ {
+ buffer->io_callback(buffer->io_callback_arg, data, length);
+ return;
+ }
+
+ /* Otherwise, copy the new data into the buffer. */
+ memcpy(&buffer->data[buffer->used], data, length);
+ buffer->used += length;
+ Assert(buffer->used <= BUFSIZE);
+}
+
+/*
+ * Generate the sentinel and CRC required at the end of a block reference
+ * table file and flush them out of our internal buffer.
+ */
+static void
+BlockRefTableFileTerminate(BlockRefTableBuffer *buffer)
+{
+ BlockRefTableSerializedEntry zentry = {0};
+ pg_crc32c crc;
+
+ /* Write a sentinel indicating that there are no more entries. */
+ BlockRefTableWrite(buffer, &zentry,
+ sizeof(BlockRefTableSerializedEntry));
+
+ /*
+ * Writing the checksum will perturb the ongoing checksum calculation, so
+ * copy the state first and finalize the computation using the copy.
+ */
+ crc = buffer->crc;
+ FIN_CRC32C(crc);
+ BlockRefTableWrite(buffer, &crc, sizeof(pg_crc32c));
+
+ /* Flush any leftover data out of our buffer. */
+ BlockRefTableFlush(buffer);
+}
diff --git a/src/common/meson.build b/src/common/meson.build
index fcc0c4fe8d..6e51257b1c 100644
--- a/src/common/meson.build
+++ b/src/common/meson.build
@@ -4,6 +4,7 @@ common_sources = files(
'archive.c',
'base64.c',
'binaryheap.c',
+ 'blkreftable.c',
'checksum_helper.c',
'compression.c',
'controldata_utils.c',
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 48ca852381..fed5d790cc 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -206,6 +206,7 @@ extern int XLogFileOpen(XLogSegNo segno, TimeLineID tli);
extern void CheckXLogRemoved(XLogSegNo segno, TimeLineID tli);
extern XLogSegNo XLogGetLastRemovedSegno(void);
+extern XLogSegNo XLogGetOldestSegno(TimeLineID tli);
extern void XLogSetAsyncXactLSN(XLogRecPtr asyncXactLSN);
extern void XLogSetReplicationSlotMinimumLSN(XLogRecPtr lsn);
diff --git a/src/include/access/xlogbackup.h b/src/include/access/xlogbackup.h
index 1611358137..90e04cad56 100644
--- a/src/include/access/xlogbackup.h
+++ b/src/include/access/xlogbackup.h
@@ -28,6 +28,8 @@ typedef struct BackupState
XLogRecPtr checkpointloc; /* last checkpoint location */
pg_time_t starttime; /* backup start time */
bool started_in_recovery; /* backup started in recovery? */
+ XLogRecPtr istartpoint; /* incremental based on backup at this LSN */
+ TimeLineID istarttli; /* incremental based on backup on this TLI */
/* Fields saved at the end of backup */
XLogRecPtr stoppoint; /* backup stop WAL location */
diff --git a/src/include/backup/basebackup.h b/src/include/backup/basebackup.h
index 1432d9c206..345bd22534 100644
--- a/src/include/backup/basebackup.h
+++ b/src/include/backup/basebackup.h
@@ -34,6 +34,9 @@ typedef struct
int64 size; /* total size as sent; -1 if not known */
} tablespaceinfo;
-extern void SendBaseBackup(BaseBackupCmd *cmd);
+struct IncrementalBackupInfo;
+
+extern void SendBaseBackup(BaseBackupCmd *cmd,
+ struct IncrementalBackupInfo *ib);
#endif /* _BASEBACKUP_H */
diff --git a/src/include/backup/basebackup_incremental.h b/src/include/backup/basebackup_incremental.h
new file mode 100644
index 0000000000..c300235a2f
--- /dev/null
+++ b/src/include/backup/basebackup_incremental.h
@@ -0,0 +1,56 @@
+/*-------------------------------------------------------------------------
+ *
+ * basebackup_incremental.h
+ * API for incremental backup support
+ *
+ * Portions Copyright (c) 2010-2022, PostgreSQL Global Development Group
+ *
+ * src/include/backup/basebackup_incremental.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef BASEBACKUP_INCREMENTAL_H
+#define BASEBACKUP_INCREMENTAL_H
+
+#include "access/xlogbackup.h"
+#include "common/relpath.h"
+#include "storage/block.h"
+#include "utils/palloc.h"
+
+#define INCREMENTAL_MAGIC 0xd3ae1f0d
+
+typedef enum
+{
+ BACK_UP_FILE_FULLY,
+ BACK_UP_FILE_INCREMENTALLY,
+ DO_NOT_BACK_UP_FILE
+} FileBackupMethod;
+
+struct IncrementalBackupInfo;
+typedef struct IncrementalBackupInfo IncrementalBackupInfo;
+
+extern IncrementalBackupInfo *CreateIncrementalBackupInfo(MemoryContext);
+
+extern void AppendIncrementalManifestData(IncrementalBackupInfo *ib,
+ const char *data,
+ int len);
+extern void FinalizeIncrementalManifest(IncrementalBackupInfo *ib);
+
+extern void PrepareForIncrementalBackup(IncrementalBackupInfo *ib,
+ BackupState *backup_state);
+
+extern char *GetIncrementalFilePath(Oid dboid, Oid spcoid,
+ RelFileNumber relfilenumber,
+ ForkNumber forknum, unsigned segno);
+extern FileBackupMethod GetFileBackupMethod(IncrementalBackupInfo *ib,
+ char *path,
+ Oid dboid, Oid spcoid,
+ RelFileNumber relfilenumber,
+ ForkNumber forknum,
+ unsigned segno, size_t size,
+ unsigned *num_blocks_required,
+ BlockNumber *relative_block_numbers,
+ unsigned *truncation_block_length);
+extern size_t GetIncrementalFileSize(unsigned num_blocks_required);
+
+#endif
diff --git a/src/include/backup/walsummary.h b/src/include/backup/walsummary.h
new file mode 100644
index 0000000000..d086e64019
--- /dev/null
+++ b/src/include/backup/walsummary.h
@@ -0,0 +1,49 @@
+/*-------------------------------------------------------------------------
+ *
+ * walsummary.h
+ * WAL summary management
+ *
+ * Portions Copyright (c) 2010-2022, PostgreSQL Global Development Group
+ *
+ * src/include/backup/walsummary.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef WALSUMMARY_H
+#define WALSUMMARY_H
+
+#include <time.h>
+
+#include "access/xlogdefs.h"
+#include "nodes/pg_list.h"
+#include "storage/fd.h"
+
+typedef struct WalSummaryIO
+{
+ File file;
+ off_t filepos;
+} WalSummaryIO;
+
+typedef struct WalSummaryFile
+{
+ XLogRecPtr start_lsn;
+ XLogRecPtr end_lsn;
+ TimeLineID tli;
+} WalSummaryFile;
+
+extern List *GetWalSummaries(TimeLineID tli, XLogRecPtr start_lsn,
+ XLogRecPtr end_lsn);
+extern List *FilterWalSummaries(List *wslist, TimeLineID tli,
+ XLogRecPtr start_lsn, XLogRecPtr end_lsn);
+extern bool WalSummariesAreComplete(List *wslist,
+ XLogRecPtr start_lsn, XLogRecPtr end_lsn,
+ XLogRecPtr *missing_lsn);
+extern File OpenWalSummaryFile(WalSummaryFile *ws, bool missing_ok);
+extern void RemoveWalSummaryIfOlderThan(WalSummaryFile *ws,
+ time_t cutoff_time);
+
+extern int ReadWalSummary(void *wal_summary_io, void *data, int length);
+extern int WriteWalSummary(void *wal_summary_io, void *data, int length);
+extern void ReportWalSummaryError(void *callback_arg, char *fmt,...);
+
+#endif /* WALSUMMARY_H */
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index f0b7b9cbd8..f68e6d4987 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -12062,4 +12062,23 @@
proname => 'any_value_transfn', prorettype => 'anyelement',
proargtypes => 'anyelement anyelement', prosrc => 'any_value_transfn' },
+{ oid => '8436',
+ descr => 'list of available WAL summary files',
+ proname => 'pg_available_wal_summaries', prorows => '100',
+ proretset => 't', provolatile => 'v', proparallel => 's',
+ prorettype => 'record', proargtypes => '',
+ proallargtypes => '{int8,pg_lsn,pg_lsn}',
+ proargmodes => '{o,o,o}',
+ proargnames => '{tli,start_lsn,end_lsn}',
+ prosrc => 'pg_available_wal_summaries' },
+{ oid => '8437',
+ descr => 'contents of a WAL sumamry file',
+ proname => 'pg_wal_summary_contents', prorows => '100',
+ proretset => 't', provolatile => 'v', proparallel => 's',
+ prorettype => 'record', proargtypes => 'int8 pg_lsn pg_lsn',
+ proallargtypes => '{int8,pg_lsn,pg_lsn,oid,oid,oid,int2,int8,bool}',
+ proargmodes => '{i,i,i,o,o,o,o,o,o}',
+ proargnames => '{tli,start_lsn,end_lsn,relfilenode,reltablespace,reldatabase,relforknumber,relblocknumber,is_limit_block}',
+ prosrc => 'pg_wal_summary_contents' },
+
]
diff --git a/src/include/common/blkreftable.h b/src/include/common/blkreftable.h
new file mode 100644
index 0000000000..22d9883dc5
--- /dev/null
+++ b/src/include/common/blkreftable.h
@@ -0,0 +1,120 @@
+/*-------------------------------------------------------------------------
+ *
+ * blkreftable.h
+ * Block reference tables.
+ *
+ * A block reference table is used to keep track of which blocks have
+ * been modified by WAL records within a certain LSN range.
+ *
+ * For each relation fork, there is a "limit block number". All existing
+ * blocks greater than or equal to the limit block number must be
+ * considered modified; for those less than the limit block number,
+ * we maintain a bitmap. When a relation fork is created or dropped,
+ * the limit block number should be set to 0. When it's truncated,
+ * the limit block number should be set to the length in blocks to
+ * which it was truncated.
+ *
+ * Portions Copyright (c) 2010-2022, PostgreSQL Global Development Group
+ *
+ * src/include/common/blkreftable.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef BLKREFTABLE_H
+#define BLKREFTABLE_H
+
+#include "storage/block.h"
+#include "storage/relfilelocator.h"
+
+/* Magic number for serialization file format. */
+#define BLOCKREFTABLE_MAGIC 0x652b137b
+
+struct BlockRefTable;
+struct BlockRefTableEntry;
+struct BlockRefTableReader;
+struct BlockRefTableWriter;
+typedef struct BlockRefTable BlockRefTable;
+typedef struct BlockRefTableEntry BlockRefTableEntry;
+typedef struct BlockRefTableReader BlockRefTableReader;
+typedef struct BlockRefTableWriter BlockRefTableWriter;
+
+/*
+ * The return value of io_callback_fn should be the number of bytes read
+ * or written. If an error occurs, the functions should report it and
+ * not return. When used as a write callback, short writes should be retried
+ * or treated as errors, so that if the callback returns, the return value
+ * is always the request length.
+ *
+ * report_error_fn should not return.
+ */
+typedef int (*io_callback_fn) (void *callback_arg, void *data, int length);
+typedef void (*report_error_fn) (void *calblack_arg, char *msg,...);
+
+
+/*
+ * Functions for manipulating an entire in-memory block reference table.
+ */
+extern BlockRefTable *CreateEmptyBlockRefTable(void);
+extern void BlockRefTableSetLimitBlock(BlockRefTable *brtab,
+ const RelFileLocator *rlocator,
+ ForkNumber forknum,
+ BlockNumber limit_block);
+extern void BlockRefTableMarkBlockModified(BlockRefTable *brtab,
+ const RelFileLocator *rlocator,
+ ForkNumber forknum,
+ BlockNumber blknum);
+extern void WriteBlockRefTable(BlockRefTable *brtab,
+ io_callback_fn write_callback,
+ void *write_callback_arg);
+
+extern BlockRefTableEntry *BlockRefTableGetEntry(BlockRefTable *brtab,
+ const RelFileLocator *rlocator,
+ ForkNumber forknum,
+ BlockNumber *limit_block);
+extern int BlockRefTableEntryGetBlocks(BlockRefTableEntry *entry,
+ BlockNumber start_blkno,
+ BlockNumber stop_blkno,
+ BlockNumber *blocks,
+ int nblocks);
+
+/*
+ * Functions for reading a block reference table incrementally from disk.
+ */
+extern BlockRefTableReader *CreateBlockRefTableReader(io_callback_fn read_callback,
+ void *read_callback_arg,
+ char *error_filename,
+ report_error_fn error_callback,
+ void *error_callback_arg);
+extern bool BlockRefTableReaderNextRelation(BlockRefTableReader *reader,
+ RelFileLocator *rlocator,
+ ForkNumber *forknum,
+ BlockNumber *limit_block);
+extern unsigned BlockRefTableReaderGetBlocks(BlockRefTableReader *reader,
+ BlockNumber *blocks,
+ int nblocks);
+extern void DestroyBlockRefTableReader(BlockRefTableReader *reader);
+
+/*
+ * Functions for writing a block reference table incrementally to disk.
+ *
+ * Note that entries must be written in the proper order, that is, sorted by
+ * database, then tablespace, then relfilenumber, then fork number. Caller
+ * is responsible for supplying data in the correct order. If that seems hard,
+ * use an in-memory BlockRefTable instead.
+ */
+extern BlockRefTableWriter *CreateBlockRefTableWriter(io_callback_fn write_callback,
+ void *write_callback_arg);
+extern void BlockRefTableWriteEntry(BlockRefTableWriter *writer,
+ BlockRefTableEntry *entry);
+extern void DestroyBlockRefTableWriter(BlockRefTableWriter *writer);
+
+extern BlockRefTableEntry *CreateBlockRefTableEntry(RelFileLocator rlocator,
+ ForkNumber forknum);
+extern void BlockRefTableEntrySetLimitBlock(BlockRefTableEntry *entry,
+ BlockNumber limit_block);
+extern void BlockRefTableEntryMarkBlockModified(BlockRefTableEntry *entry,
+ ForkNumber forknum,
+ BlockNumber blknum);
+extern void BlockRefTableFreeEntry(BlockRefTableEntry *entry);
+
+#endif /* BLKREFTABLE_H */
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 14bd574fc2..898adccb25 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -338,6 +338,7 @@ typedef enum BackendType
B_STARTUP,
B_WAL_RECEIVER,
B_WAL_SENDER,
+ B_WAL_SUMMARIZER,
B_WAL_WRITER,
} BackendType;
@@ -443,6 +444,7 @@ typedef enum
CheckpointerProcess,
WalWriterProcess,
WalReceiverProcess,
+ WalSummarizerProcess,
NUM_AUXPROCTYPES /* Must be last! */
} AuxProcType;
@@ -455,6 +457,7 @@ extern PGDLLIMPORT AuxProcType MyAuxProcType;
#define AmCheckpointerProcess() (MyAuxProcType == CheckpointerProcess)
#define AmWalWriterProcess() (MyAuxProcType == WalWriterProcess)
#define AmWalReceiverProcess() (MyAuxProcType == WalReceiverProcess)
+#define AmWalSummarizerProcess() (MyAuxProcType == WalSummarizerProcess)
/*****************************************************************************
diff --git a/src/include/nodes/replnodes.h b/src/include/nodes/replnodes.h
index 4321ba8f86..856491eecd 100644
--- a/src/include/nodes/replnodes.h
+++ b/src/include/nodes/replnodes.h
@@ -108,4 +108,13 @@ typedef struct TimeLineHistoryCmd
TimeLineID timeline;
} TimeLineHistoryCmd;
+/* ----------------------
+ * UPLOAD_MANIFEST command
+ * ----------------------
+ */
+typedef struct UploadManifestCmd
+{
+ NodeTag type;
+} UploadManifestCmd;
+
#endif /* REPLNODES_H */
diff --git a/src/include/postmaster/walsummarizer.h b/src/include/postmaster/walsummarizer.h
new file mode 100644
index 0000000000..7584cb69a7
--- /dev/null
+++ b/src/include/postmaster/walsummarizer.h
@@ -0,0 +1,31 @@
+/*-------------------------------------------------------------------------
+ *
+ * walsummarizer.h
+ *
+ * Header file for background WAL summarization process.
+ *
+ * Portions Copyright (c) 1996-2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/include/postmaster/walsummarizer.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef WALSUMMARIZER_H
+#define WALSUMMARIZER_H
+
+#include "access/xlogdefs.h"
+
+extern int wal_summarize_mb;
+extern int wal_summarize_keep_time;
+
+extern Size WalSummarizerShmemSize(void);
+extern void WalSummarizerShmemInit(void);
+extern void WalSummarizerMain(void) pg_attribute_noreturn();
+
+extern XLogRecPtr GetOldestUnsummarizedLSN(TimeLineID *tli,
+ bool *lsn_is_exact);
+extern void SetWalSummarizerLatch(void);
+extern XLogRecPtr WaitForWalSummarization(XLogRecPtr lsn, long timeout);
+
+#endif
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index ef74f32693..ee55008082 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -417,11 +417,12 @@ extern PGDLLIMPORT PGPROC *PreparedXactProcs;
* We set aside some extra PGPROC structures for auxiliary processes,
* ie things that aren't full-fledged backends but need shmem access.
*
- * Background writer, checkpointer, WAL writer and archiver run during normal
- * operation. Startup process and WAL receiver also consume 2 slots, but WAL
- * writer is launched only after startup has exited, so we only need 5 slots.
+ * Background writer, checkpointer, WAL writer, WAL summarizer, and archiver
+ * run during normal operation. Startup process and WAL receiver also consume
+ * 2 slots, but WAL writer is launched only after startup has exited, so we
+ * only need 6 slots.
*/
-#define NUM_AUXILIARY_PROCS 5
+#define NUM_AUXILIARY_PROCS 6
/* configurable options */
extern PGDLLIMPORT int DeadlockTimeout;
diff --git a/src/include/utils/guc_tables.h b/src/include/utils/guc_tables.h
index d5a0880678..7d3bc0f671 100644
--- a/src/include/utils/guc_tables.h
+++ b/src/include/utils/guc_tables.h
@@ -72,6 +72,7 @@ enum config_group
WAL_RECOVERY,
WAL_ARCHIVE_RECOVERY,
WAL_RECOVERY_TARGET,
+ WAL_SUMMARIZATION,
REPLICATION_SENDING,
REPLICATION_PRIMARY,
REPLICATION_STANDBY,
diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index c3d46c7c70..b711d60fc4 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -779,6 +779,10 @@ a tar-format backup, pass the name of the tar program to use in the
keyword parameter tar_program. Note that tablespace tar files aren't
handled here.
+To restore from an incremental backup, pass the parameter combine_with_prior
+as a reference to an array of prior backup names with which this backup
+is to be combined using pg_combinebackup.
+
Streaming replication can be enabled on this node by passing the keyword
parameter has_streaming => 1. This is disabled by default.
@@ -816,7 +820,22 @@ sub init_from_backup
mkdir $self->archive_dir;
my $data_path = $self->data_dir;
- if (defined $params{tar_program})
+ if (defined $params{combine_with_prior})
+ {
+ my @prior_backups = @{$params{combine_with_prior}};
+ my @prior_backup_path;
+
+ for my $prior_backup_name (@prior_backups)
+ {
+ push @prior_backup_path,
+ $root_node->backup_dir . '/' . $prior_backup_name;
+ }
+
+ local %ENV = $self->_get_env();
+ PostgreSQL::Test::Utils::system_or_bail('pg_combinebackup',
+ @prior_backup_path, $backup_path, '-o', $data_path);
+ }
+ elsif (defined $params{tar_program})
{
mkdir($data_path);
PostgreSQL::Test::Utils::system_or_bail($params{tar_program}, 'xf',
diff --git a/src/test/recovery/t/001_stream_rep.pl b/src/test/recovery/t/001_stream_rep.pl
index 0c72ba0944..353db33a9f 100644
--- a/src/test/recovery/t/001_stream_rep.pl
+++ b/src/test/recovery/t/001_stream_rep.pl
@@ -15,6 +15,8 @@ my $node_primary = PostgreSQL::Test::Cluster->new('primary');
$node_primary->init(
allows_streaming => 1,
auth_extra => [ '--create-role', 'repl_role' ]);
+# WAL summarization can postpone WAL recycling, leading to test failures
+$node_primary->append_conf('postgresql.conf', "wal_summarize_mb = 0");
$node_primary->start;
my $backup_name = 'my_backup';
diff --git a/src/test/recovery/t/019_replslot_limit.pl b/src/test/recovery/t/019_replslot_limit.pl
index 7d94f15778..4f52ddbe79 100644
--- a/src/test/recovery/t/019_replslot_limit.pl
+++ b/src/test/recovery/t/019_replslot_limit.pl
@@ -22,6 +22,7 @@ $node_primary->append_conf(
min_wal_size = 2MB
max_wal_size = 4MB
log_checkpoints = yes
+wal_summarize_mb = 0
));
$node_primary->start;
$node_primary->safe_psql('postgres',
@@ -256,6 +257,7 @@ $node_primary2->append_conf(
min_wal_size = 32MB
max_wal_size = 32MB
log_checkpoints = yes
+wal_summarize_mb = 0
));
$node_primary2->start;
$node_primary2->safe_psql('postgres',
@@ -310,6 +312,7 @@ $node_primary3->append_conf(
max_wal_size = 2MB
log_checkpoints = yes
max_slot_wal_keep_size = 1MB
+ wal_summarize_mb = 0
));
$node_primary3->start;
$node_primary3->safe_psql('postgres',
diff --git a/src/test/recovery/t/035_standby_logical_decoding.pl b/src/test/recovery/t/035_standby_logical_decoding.pl
index 480e6d6caa..a91437dfa7 100644
--- a/src/test/recovery/t/035_standby_logical_decoding.pl
+++ b/src/test/recovery/t/035_standby_logical_decoding.pl
@@ -250,6 +250,7 @@ $node_primary->append_conf(
wal_level = 'logical'
max_replication_slots = 4
max_wal_senders = 4
+wal_summarize_mb = 0
});
$node_primary->dump_info;
$node_primary->start;
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 8de90c4958..ff3cff8c28 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3991,3 +3991,26 @@ yyscan_t
z_stream
z_streamp
zic_t
+BlockRefTable
+BlockRefTableBuffer
+BlockRefTableEntry
+BlockRefTableKey
+BlockRefTableReader
+BlockRefTableSerializedEntry
+BlockRefTableWriter
+FileBackupMethod
+IncrementalBackupInfo
+SummarizerReadLocalXLogPrivate
+UploadManifestCmd
+WalSummarizerData
+WalSummaryFile
+WalSummaryIO
+backup_file_entry
+backup_wal_range
+cb_cleanup_dir
+cb_options
+cb_tablespace
+cb_tablespace_mapping
+manifest_data
+manifest_writer
+rfile
--
2.37.1 (Apple Git-137.1)
v4-0002-Change-struct-tablespaceinfo-s-oid-member-from-ch.patchapplication/octet-stream; name=v4-0002-Change-struct-tablespaceinfo-s-oid-member-from-ch.patchDownload
From 94280254510de2f50cee7a597158df7bf1fca16c Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Tue, 3 Oct 2023 13:30:16 -0400
Subject: [PATCH v4 2/6] Change struct tablespaceinfo's oid member from 'char
*' to 'Oid'
This shouldn't change behavior except in the unusual case where
there are file in the tablespace directory that have entirely
numeric names but are nevertheless not possible names for a
tablespace directory, either because their names have leading zeroes
that shouldn't be there, or the value is actually zero, or because
the value is too large to represent as an OID.
In those cases, the directory would previously have made it into
the list of tablespaceinfo objects and no longer will. Thus, base
backups will now ignore such directories, instead of treating them
as legitimate tablespace directories. Similarly, if entries for
such tablespaces occur in a tablespace_map file, they will now
be rejected as erroneous, instead of being honored.
This is infrastructure for future work that wants to be able to
know the tablespace of each relation that is part of a backup
*as an OID*. By strengthening the up-front validation, we don't
have to worry about weird cases later, and can more easily avoid
repeated string->integer conversions.
---
src/backend/access/transam/xlog.c | 19 ++++++++++--
src/backend/access/transam/xlogrecovery.c | 12 ++++++--
src/backend/backup/backup_manifest.c | 6 ++--
src/backend/backup/basebackup.c | 35 ++++++++++++-----------
src/backend/backup/basebackup_copy.c | 2 +-
src/include/backup/backup_manifest.h | 2 +-
src/include/backup/basebackup.h | 2 +-
7 files changed, 49 insertions(+), 29 deletions(-)
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index fcbde10529..677a5bf51b 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -8483,9 +8483,22 @@ do_pg_backup_start(const char *backupidstr, bool fast, List **tablespaces,
char *relpath = NULL;
char *s;
PGFileType de_type;
+ char *badp;
+ Oid tsoid;
- /* Skip anything that doesn't look like a tablespace */
- if (strspn(de->d_name, "0123456789") != strlen(de->d_name))
+ /*
+ * Try to parse the directory name as an unsigned integer.
+ *
+ * Tablespace directories should be positive integers that can be
+ * represented in 32 bits, with no leading zeroes or trailing
+ * garbage. If we come across a name that doesn't meet those
+ * criteria, skip it.
+ */
+ if (de->d_name[0] < '1' || de->d_name[1] > '9')
+ continue;
+ errno = 0;
+ tsoid = strtoul(de->d_name, &badp, 10);
+ if (*badp != '\0' || errno == EINVAL || errno == ERANGE)
continue;
snprintf(fullpath, sizeof(fullpath), "pg_tblspc/%s", de->d_name);
@@ -8560,7 +8573,7 @@ do_pg_backup_start(const char *backupidstr, bool fast, List **tablespaces,
}
ti = palloc(sizeof(tablespaceinfo));
- ti->oid = pstrdup(de->d_name);
+ ti->oid = tsoid;
ti->path = pstrdup(linkpath);
ti->rpath = relpath;
ti->size = -1;
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index becc2bda62..5549e1afc5 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -678,7 +678,7 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
tablespaceinfo *ti = lfirst(lc);
char *linkloc;
- linkloc = psprintf("pg_tblspc/%s", ti->oid);
+ linkloc = psprintf("pg_tblspc/%u", ti->oid);
/*
* Remove the existing symlink if any and Create the symlink
@@ -692,7 +692,6 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
errmsg("could not create symbolic link \"%s\": %m",
linkloc)));
- pfree(ti->oid);
pfree(ti->path);
pfree(ti);
}
@@ -1341,6 +1340,8 @@ read_tablespace_map(List **tablespaces)
{
if (!was_backslash && (ch == '\n' || ch == '\r'))
{
+ char *endp;
+
if (i == 0)
continue; /* \r immediately followed by \n */
@@ -1360,7 +1361,12 @@ read_tablespace_map(List **tablespaces)
str[n++] = '\0';
ti = palloc0(sizeof(tablespaceinfo));
- ti->oid = pstrdup(str);
+ errno = 0;
+ ti->oid = strtoul(str, &endp, 10);
+ if (*endp != '\0' || errno == EINVAL || errno == ERANGE)
+ ereport(FATAL,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("invalid data in file \"%s\"", TABLESPACE_MAP)));
ti->path = pstrdup(str + n);
*tablespaces = lappend(*tablespaces, ti);
diff --git a/src/backend/backup/backup_manifest.c b/src/backend/backup/backup_manifest.c
index cee6216524..aeed362a9a 100644
--- a/src/backend/backup/backup_manifest.c
+++ b/src/backend/backup/backup_manifest.c
@@ -97,7 +97,7 @@ FreeBackupManifest(backup_manifest_info *manifest)
* Add an entry to the backup manifest for a file.
*/
void
-AddFileToBackupManifest(backup_manifest_info *manifest, const char *spcoid,
+AddFileToBackupManifest(backup_manifest_info *manifest, Oid spcoid,
const char *pathname, size_t size, pg_time_t mtime,
pg_checksum_context *checksum_ctx)
{
@@ -114,9 +114,9 @@ AddFileToBackupManifest(backup_manifest_info *manifest, const char *spcoid,
* pathname relative to the data directory (ignoring the intermediate
* symlink traversal).
*/
- if (spcoid != NULL)
+ if (OidIsValid(spcoid))
{
- snprintf(pathbuf, sizeof(pathbuf), "pg_tblspc/%s/%s", spcoid,
+ snprintf(pathbuf, sizeof(pathbuf), "pg_tblspc/%u/%s", spcoid,
pathname);
pathname = pathbuf;
}
diff --git a/src/backend/backup/basebackup.c b/src/backend/backup/basebackup.c
index b126d9c890..b537f46219 100644
--- a/src/backend/backup/basebackup.c
+++ b/src/backend/backup/basebackup.c
@@ -75,14 +75,15 @@ typedef struct
pg_checksum_type manifest_checksum_type;
} basebackup_options;
-static int64 sendTablespace(bbsink *sink, char *path, char *spcoid, bool sizeonly,
+static int64 sendTablespace(bbsink *sink, char *path, Oid spcoid, bool sizeonly,
struct backup_manifest_info *manifest);
static int64 sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
List *tablespaces, bool sendtblspclinks,
- backup_manifest_info *manifest, const char *spcoid);
+ backup_manifest_info *manifest, Oid spcoid);
static bool sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
- struct stat *statbuf, bool missing_ok, Oid dboid,
- backup_manifest_info *manifest, const char *spcoid);
+ struct stat *statbuf, bool missing_ok,
+ Oid dboid, Oid spcoid,
+ backup_manifest_info *manifest);
static off_t read_file_data_into_buffer(bbsink *sink,
const char *readfilename, int fd,
off_t offset, size_t length,
@@ -305,7 +306,7 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
if (tmp->path == NULL)
tmp->size = sendDir(sink, ".", 1, true, state.tablespaces,
- true, NULL, NULL);
+ true, NULL, InvalidOid);
else
tmp->size = sendTablespace(sink, tmp->path, tmp->oid, true,
NULL);
@@ -346,7 +347,7 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
/* Then the bulk of the files... */
sendDir(sink, ".", 1, false, state.tablespaces,
- sendtblspclinks, &manifest, NULL);
+ sendtblspclinks, &manifest, InvalidOid);
/* ... and pg_control after everything else. */
if (lstat(XLOG_CONTROL_FILE, &statbuf) != 0)
@@ -355,11 +356,11 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
errmsg("could not stat file \"%s\": %m",
XLOG_CONTROL_FILE)));
sendFile(sink, XLOG_CONTROL_FILE, XLOG_CONTROL_FILE, &statbuf,
- false, InvalidOid, &manifest, NULL);
+ false, InvalidOid, InvalidOid, &manifest);
}
else
{
- char *archive_name = psprintf("%s.tar", ti->oid);
+ char *archive_name = psprintf("%u.tar", ti->oid);
bbsink_begin_archive(sink, archive_name);
@@ -623,8 +624,8 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
(errcode_for_file_access(),
errmsg("could not stat file \"%s\": %m", pathbuf)));
- sendFile(sink, pathbuf, pathbuf, &statbuf, false, InvalidOid,
- &manifest, NULL);
+ sendFile(sink, pathbuf, pathbuf, &statbuf, false,
+ InvalidOid, InvalidOid, &manifest);
/* unconditionally mark file as archived */
StatusFilePath(pathbuf, fname, ".done");
@@ -1087,7 +1088,7 @@ sendFileWithContent(bbsink *sink, const char *filename, const char *content,
_tarWritePadding(sink, len);
- AddFileToBackupManifest(manifest, NULL, filename, len,
+ AddFileToBackupManifest(manifest, InvalidOid, filename, len,
(pg_time_t) statbuf.st_mtime, &checksum_ctx);
}
@@ -1099,7 +1100,7 @@ sendFileWithContent(bbsink *sink, const char *filename, const char *content,
* Only used to send auxiliary tablespaces, not PGDATA.
*/
static int64
-sendTablespace(bbsink *sink, char *path, char *spcoid, bool sizeonly,
+sendTablespace(bbsink *sink, char *path, Oid spcoid, bool sizeonly,
backup_manifest_info *manifest)
{
int64 size;
@@ -1154,7 +1155,7 @@ sendTablespace(bbsink *sink, char *path, char *spcoid, bool sizeonly,
static int64
sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
List *tablespaces, bool sendtblspclinks, backup_manifest_info *manifest,
- const char *spcoid)
+ Oid spcoid)
{
DIR *dir;
struct dirent *de;
@@ -1416,8 +1417,8 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
if (!sizeonly)
sent = sendFile(sink, pathbuf, pathbuf + basepathlen + 1, &statbuf,
- true, isDbDir ? atooid(lastDir + 1) : InvalidOid,
- manifest, spcoid);
+ true, isDbDir ? atooid(lastDir + 1) : InvalidOid, spcoid,
+ manifest);
if (sent || sizeonly)
{
@@ -1486,8 +1487,8 @@ is_checksummed_file(const char *fullpath, const char *filename)
*/
static bool
sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
- struct stat *statbuf, bool missing_ok, Oid dboid,
- backup_manifest_info *manifest, const char *spcoid)
+ struct stat *statbuf, bool missing_ok, Oid dboid, Oid spcoid,
+ backup_manifest_info *manifest)
{
int fd;
BlockNumber blkno = 0;
diff --git a/src/backend/backup/basebackup_copy.c b/src/backend/backup/basebackup_copy.c
index fee30c21e1..3bdbe1f989 100644
--- a/src/backend/backup/basebackup_copy.c
+++ b/src/backend/backup/basebackup_copy.c
@@ -407,7 +407,7 @@ SendTablespaceList(List *tablespaces)
}
else
{
- values[0] = ObjectIdGetDatum(strtoul(ti->oid, NULL, 10));
+ values[0] = ObjectIdGetDatum(ti->oid);
values[1] = CStringGetTextDatum(ti->path);
}
if (ti->size >= 0)
diff --git a/src/include/backup/backup_manifest.h b/src/include/backup/backup_manifest.h
index d41b439980..5a481dbcf5 100644
--- a/src/include/backup/backup_manifest.h
+++ b/src/include/backup/backup_manifest.h
@@ -39,7 +39,7 @@ extern void InitializeBackupManifest(backup_manifest_info *manifest,
backup_manifest_option want_manifest,
pg_checksum_type manifest_checksum_type);
extern void AddFileToBackupManifest(backup_manifest_info *manifest,
- const char *spcoid,
+ Oid spcoid,
const char *pathname, size_t size,
pg_time_t mtime,
pg_checksum_context *checksum_ctx);
diff --git a/src/include/backup/basebackup.h b/src/include/backup/basebackup.h
index 3e68abc2bb..1432d9c206 100644
--- a/src/include/backup/basebackup.h
+++ b/src/include/backup/basebackup.h
@@ -27,7 +27,7 @@
*/
typedef struct
{
- char *oid; /* tablespace's OID, as a decimal string */
+ Oid oid; /* tablespace's OID */
char *path; /* full path to tablespace's directory */
char *rpath; /* relative path if it's within PGDATA, else
* NULL */
--
2.37.1 (Apple Git-137.1)
On Wed, Oct 4, 2023 at 4:08 PM Robert Haas <robertmhaas@gmail.com> wrote:
Clearly there's a good amount of stuff to sort out here, but we've
still got quite a bit of time left before feature freeze so I'd like
to have a go at it. Please let me know your thoughts, if you have any.
Apparently, nobody has any thoughts, but here's an updated patch set
anyway. The main change, other than rebasing, is that I did a bunch
more documentation work on the main patch (0005). I'm much happier
with it now, although I expect it may need more adjustments here and
there as outstanding design questions get settled.
After some thought, I think that it should be fine to commit 0001 and
0002 as independent refactoring patches, and I plan to go ahead and do
that pretty soon unless somebody objects.
Thanks,
--
Robert Haas
EDB: http://www.enterprisedb.com
Attachments:
v5-0004-Move-src-bin-pg_verifybackup-parse_manifest.c-int.patchapplication/octet-stream; name=v5-0004-Move-src-bin-pg_verifybackup-parse_manifest.c-int.patchDownload
From b9355ab0f9c9f10736a13e7730294b82fc374963 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Tue, 3 Oct 2023 13:32:45 -0400
Subject: [PATCH v5 4/6] Move src/bin/pg_verifybackup/parse_manifest.c into
src/common.
This makes it possible for the code to be easily reused by other
client-side tools, and/or by the server.
---
src/bin/pg_verifybackup/Makefile | 1 -
src/bin/pg_verifybackup/meson.build | 1 -
src/bin/pg_verifybackup/pg_verifybackup.c | 2 +-
src/common/Makefile | 1 +
src/common/meson.build | 1 +
src/{bin/pg_verifybackup => common}/parse_manifest.c | 4 ++--
src/{bin/pg_verifybackup => include/common}/parse_manifest.h | 2 +-
7 files changed, 6 insertions(+), 6 deletions(-)
rename src/{bin/pg_verifybackup => common}/parse_manifest.c (99%)
rename src/{bin/pg_verifybackup => include/common}/parse_manifest.h (97%)
diff --git a/src/bin/pg_verifybackup/Makefile b/src/bin/pg_verifybackup/Makefile
index 596df15118..8f04fa662c 100644
--- a/src/bin/pg_verifybackup/Makefile
+++ b/src/bin/pg_verifybackup/Makefile
@@ -21,7 +21,6 @@ LDFLAGS_INTERNAL += -L$(top_builddir)/src/fe_utils -lpgfeutils $(libpq_pgport)
OBJS = \
$(WIN32RES) \
- parse_manifest.o \
pg_verifybackup.o
all: pg_verifybackup
diff --git a/src/bin/pg_verifybackup/meson.build b/src/bin/pg_verifybackup/meson.build
index 9369da1bc6..58f780d1a6 100644
--- a/src/bin/pg_verifybackup/meson.build
+++ b/src/bin/pg_verifybackup/meson.build
@@ -1,7 +1,6 @@
# Copyright (c) 2022-2023, PostgreSQL Global Development Group
pg_verifybackup_sources = files(
- 'parse_manifest.c',
'pg_verifybackup.c'
)
diff --git a/src/bin/pg_verifybackup/pg_verifybackup.c b/src/bin/pg_verifybackup/pg_verifybackup.c
index 059836f0e6..ce423a03d4 100644
--- a/src/bin/pg_verifybackup/pg_verifybackup.c
+++ b/src/bin/pg_verifybackup/pg_verifybackup.c
@@ -20,9 +20,9 @@
#include "common/hashfn.h"
#include "common/logging.h"
+#include "common/parse_manifest.h"
#include "fe_utils/simple_list.h"
#include "getopt_long.h"
-#include "parse_manifest.h"
#include "pgtime.h"
/*
diff --git a/src/common/Makefile b/src/common/Makefile
index 70884be00c..3c8effc533 100644
--- a/src/common/Makefile
+++ b/src/common/Makefile
@@ -66,6 +66,7 @@ OBJS_COMMON = \
kwlookup.o \
link-canary.o \
md5_common.o \
+ parse_manifest.o \
percentrepl.o \
pg_get_line.o \
pg_lzcompress.o \
diff --git a/src/common/meson.build b/src/common/meson.build
index ae05ac63cf..aa646f96a3 100644
--- a/src/common/meson.build
+++ b/src/common/meson.build
@@ -18,6 +18,7 @@ common_sources = files(
'kwlookup.c',
'link-canary.c',
'md5_common.c',
+ 'parse_manifest.c',
'percentrepl.c',
'pg_get_line.c',
'pg_lzcompress.c',
diff --git a/src/bin/pg_verifybackup/parse_manifest.c b/src/common/parse_manifest.c
similarity index 99%
rename from src/bin/pg_verifybackup/parse_manifest.c
rename to src/common/parse_manifest.c
index f0acd9f1e7..9895f2f17d 100644
--- a/src/bin/pg_verifybackup/parse_manifest.c
+++ b/src/common/parse_manifest.c
@@ -6,15 +6,15 @@
* Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
* Portions Copyright (c) 1994, Regents of the University of California
*
- * src/bin/pg_verifybackup/parse_manifest.c
+ * src/common/parse_manifest.c
*
*-------------------------------------------------------------------------
*/
#include "postgres_fe.h"
-#include "parse_manifest.h"
#include "common/jsonapi.h"
+#include "common/parse_manifest.h"
/*
* Semantic states for JSON manifest parsing.
diff --git a/src/bin/pg_verifybackup/parse_manifest.h b/src/include/common/parse_manifest.h
similarity index 97%
rename from src/bin/pg_verifybackup/parse_manifest.h
rename to src/include/common/parse_manifest.h
index 7387a917a2..7b24c5d785 100644
--- a/src/bin/pg_verifybackup/parse_manifest.h
+++ b/src/include/common/parse_manifest.h
@@ -6,7 +6,7 @@
* Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
* Portions Copyright (c) 1994, Regents of the University of California
*
- * src/bin/pg_verifybackup/parse_manifest.h
+ * src/include/common/parse_manifest.h
*
*-------------------------------------------------------------------------
*/
--
2.37.1 (Apple Git-137.1)
v5-0006-Add-new-pg_walsummary-tool.patchapplication/octet-stream; name=v5-0006-Add-new-pg_walsummary-tool.patchDownload
From 5a3e4b4d41faa184f03cddf45f546de764eac6de Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Wed, 14 Jun 2023 11:31:39 -0400
Subject: [PATCH v5 6/6] Add new pg_walsummary tool.
This can dump the contents of WAL summary files, either those in
pg_wal/summaries, or the INCREMENTAL_BACKUP files that are part of
an incremental backup proper.
XXX. Needs documentation and tests.
---
src/bin/Makefile | 1 +
src/bin/meson.build | 1 +
src/bin/pg_walsummary/.gitignore | 1 +
src/bin/pg_walsummary/Makefile | 42 ++++
src/bin/pg_walsummary/meson.build | 24 +++
src/bin/pg_walsummary/pg_walsummary.c | 278 ++++++++++++++++++++++++++
6 files changed, 347 insertions(+)
create mode 100644 src/bin/pg_walsummary/.gitignore
create mode 100644 src/bin/pg_walsummary/Makefile
create mode 100644 src/bin/pg_walsummary/meson.build
create mode 100644 src/bin/pg_walsummary/pg_walsummary.c
diff --git a/src/bin/Makefile b/src/bin/Makefile
index aa2210925e..f98f58d39e 100644
--- a/src/bin/Makefile
+++ b/src/bin/Makefile
@@ -31,6 +31,7 @@ SUBDIRS = \
pg_upgrade \
pg_verifybackup \
pg_waldump \
+ pg_walsummary \
pgbench \
psql \
scripts
diff --git a/src/bin/meson.build b/src/bin/meson.build
index 4cb6fd59bb..d1e9ef4409 100644
--- a/src/bin/meson.build
+++ b/src/bin/meson.build
@@ -17,6 +17,7 @@ subdir('pg_test_timing')
subdir('pg_upgrade')
subdir('pg_verifybackup')
subdir('pg_waldump')
+subdir('pg_walsummary')
subdir('pgbench')
subdir('pgevent')
subdir('psql')
diff --git a/src/bin/pg_walsummary/.gitignore b/src/bin/pg_walsummary/.gitignore
new file mode 100644
index 0000000000..d71ec192fa
--- /dev/null
+++ b/src/bin/pg_walsummary/.gitignore
@@ -0,0 +1 @@
+pg_walsummary
diff --git a/src/bin/pg_walsummary/Makefile b/src/bin/pg_walsummary/Makefile
new file mode 100644
index 0000000000..852f7208f6
--- /dev/null
+++ b/src/bin/pg_walsummary/Makefile
@@ -0,0 +1,42 @@
+#-------------------------------------------------------------------------
+#
+# Makefile for src/bin/pg_walsummary
+#
+# Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+# Portions Copyright (c) 1994, Regents of the University of California
+#
+# src/bin/pg_walsummary/Makefile
+#
+#-------------------------------------------------------------------------
+
+PGFILEDESC = "pg_walsummary - print contents of WAL summary files"
+PGAPPICON=win32
+
+subdir = src/bin/pg_walsummary
+top_builddir = ../../..
+include $(top_builddir)/src/Makefile.global
+
+override CPPFLAGS := -I$(libpq_srcdir) $(CPPFLAGS)
+LDFLAGS_INTERNAL += -L$(top_builddir)/src/fe_utils -lpgfeutils
+
+OBJS = \
+ $(WIN32RES) \
+ pg_walsummary.o
+
+all: pg_walsummary
+
+pg_walsummary: $(OBJS) | submake-libpgport submake-libpgfeutils
+ $(CC) $(CFLAGS) $^ $(LDFLAGS) $(LDFLAGS_EX) $(LIBS) -o $@$(X)
+
+
+install: all installdirs
+ $(INSTALL_PROGRAM) pg_walsummary$(X) '$(DESTDIR)$(bindir)/pg_walsummary$(X)'
+
+installdirs:
+ $(MKDIR_P) '$(DESTDIR)$(bindir)'
+
+uninstall:
+ rm -f '$(DESTDIR)$(bindir)/pg_walsummary$(X)'
+
+clean distclean maintainer-clean:
+ rm -f pg_walsummary$(X) $(OBJS)
diff --git a/src/bin/pg_walsummary/meson.build b/src/bin/pg_walsummary/meson.build
new file mode 100644
index 0000000000..c2092960c6
--- /dev/null
+++ b/src/bin/pg_walsummary/meson.build
@@ -0,0 +1,24 @@
+# Copyright (c) 2022-2023, PostgreSQL Global Development Group
+
+pg_walsummary_sources = files(
+ 'pg_walsummary.c',
+)
+
+if host_system == 'windows'
+ pg_walsummary_sources += rc_bin_gen.process(win32ver_rc, extra_args: [
+ '--NAME', 'pg_walsummary',
+ '--FILEDESC', 'pg_walsummary - print contents of WAL summary files',])
+endif
+
+pg_walsummary = executable('pg_walsummary',
+ pg_walsummary_sources,
+ dependencies: [frontend_code],
+ kwargs: default_bin_args,
+)
+bin_targets += pg_walsummary
+
+tests += {
+ 'name': 'pg_walsummary',
+ 'sd': meson.current_source_dir(),
+ 'bd': meson.current_build_dir()
+}
diff --git a/src/bin/pg_walsummary/pg_walsummary.c b/src/bin/pg_walsummary/pg_walsummary.c
new file mode 100644
index 0000000000..0304a42026
--- /dev/null
+++ b/src/bin/pg_walsummary/pg_walsummary.c
@@ -0,0 +1,278 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_walsummary.c
+ * Prints the contents of WAL summary files.
+ *
+ * Copyright (c) 2017-2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/bin/pg_walsummary/pg_walsummary.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres_fe.h"
+
+#include <fcntl.h>
+#include <limits.h>
+
+#include "common/blkreftable.h"
+#include "common/logging.h"
+#include "fe_utils/option_utils.h"
+#include "lib/stringinfo.h"
+#include "getopt_long.h"
+
+typedef struct ws_options
+{
+ bool individual;
+ bool quiet;
+} ws_options;
+
+typedef struct ws_file_info
+{
+ int fd;
+ char *filename;
+} ws_file_info;
+
+static BlockNumber *block_buffer = NULL;
+static unsigned block_buffer_size = 512; /* Initial size. */
+
+static void dump_one_relation(ws_options *opt, RelFileLocator *rlocator,
+ ForkNumber forknum, BlockNumber limit_block,
+ BlockRefTableReader *reader);
+static void help(const char *progname);
+static int compare_block_numbers(const void *a, const void *b);
+static int walsummary_read_callback(void *callback_arg, void *data,
+ int length);
+static void walsummary_error_callback(void *callback_arg, char *fmt,...);
+
+/*
+ * Main program.
+ */
+int
+main(int argc, char *argv[])
+{
+ static struct option long_options[] = {
+ {"individual", no_argument, NULL, 'i'},
+ {"quiet", no_argument, NULL, 'q'},
+ {NULL, 0, NULL, 0}
+ };
+
+ const char *progname;
+ int optindex;
+ int c;
+ ws_options opt;
+
+ pg_logging_init(argv[0]);
+ progname = get_progname(argv[0]);
+ handle_help_version_opts(argc, argv, progname, help);
+
+ /* process command-line options */
+ while ((c = getopt_long(argc, argv, "f:iqw:",
+ long_options, &optindex)) != -1)
+ {
+ switch (c)
+ {
+ case 'i':
+ opt.individual = true;
+ break;
+ case 'q':
+ opt.quiet = true;
+ break;
+ default:
+ /* getopt_long already emitted a complaint */
+ pg_log_error_hint("Try \"%s --help\" for more information.", progname);
+ exit(1);
+ }
+ }
+
+ if (optind >= argc)
+ {
+ pg_log_error("%s: no input files specified", progname);
+ pg_log_error_hint("Try \"%s --help\" for more information.", progname);
+ exit(1);
+ }
+
+ while (optind < argc)
+ {
+ ws_file_info ws;
+ BlockRefTableReader *reader;
+ RelFileLocator rlocator;
+ ForkNumber forknum;
+ BlockNumber limit_block;
+
+ ws.filename = argv[optind++];
+ if ((ws.fd = open(ws.filename, O_RDONLY | PG_BINARY, 0)) < 0)
+ pg_fatal("could not open file \"%s\": %m", ws.filename);
+
+ reader = CreateBlockRefTableReader(walsummary_read_callback, &ws,
+ ws.filename,
+ walsummary_error_callback, NULL);
+ while (BlockRefTableReaderNextRelation(reader, &rlocator, &forknum,
+ &limit_block))
+ dump_one_relation(&opt, &rlocator, forknum, limit_block, reader);
+
+ DestroyBlockRefTableReader(reader);
+ close(ws.fd);
+ }
+
+ exit(0);
+}
+
+/*
+ * Dump details for one relation.
+ */
+static void
+dump_one_relation(ws_options *opt, RelFileLocator *rlocator,
+ ForkNumber forknum, BlockNumber limit_block,
+ BlockRefTableReader *reader)
+{
+ unsigned i = 0;
+ unsigned nblocks;
+ BlockNumber startblock = InvalidBlockNumber;
+ BlockNumber endblock = InvalidBlockNumber;
+
+ /* Dump limit block, if any. */
+ if (limit_block != InvalidBlockNumber)
+ printf("TS %u, DB %u, REL %u, FORK %s: limit %u\n",
+ rlocator->spcOid, rlocator->dbOid, rlocator->relNumber,
+ forkNames[forknum], limit_block);
+
+ /* If we haven't allocated a block buffer yet, do that now. */
+ if (block_buffer == NULL)
+ block_buffer = palloc_array(BlockNumber, block_buffer_size);
+
+ /* Try to fill the block buffer. */
+ nblocks = BlockRefTableReaderGetBlocks(reader,
+ block_buffer,
+ block_buffer_size);
+
+ /* If we filled the block buffer completely, we must enlarge it. */
+ while (nblocks >= block_buffer_size)
+ {
+ unsigned new_size;
+
+ /* Double the size, being careful about overflow. */
+ new_size = block_buffer_size * 2;
+ if (new_size < block_buffer_size)
+ new_size = PG_UINT32_MAX;
+ block_buffer = repalloc_array(block_buffer, BlockNumber, new_size);
+
+ /* Try to fill the newly-allocated space. */
+ nblocks +=
+ BlockRefTableReaderGetBlocks(reader,
+ block_buffer + block_buffer_size,
+ new_size - block_buffer_size);
+
+ /* Save the new size for later calls. */
+ block_buffer_size = new_size;
+ }
+
+ /* If we don't need to produce any output, skip the rest of this. */
+ if (opt->quiet)
+ return;
+
+ /*
+ * Sort the returned block numbers. If the block reference table was using
+ * the bitmap representation for a given chunk, the block numbers in that
+ * chunk will already be sorted, but when the array-of-offsets
+ * representation is used, we can receive block numbers here out of order.
+ */
+ qsort(block_buffer, nblocks, sizeof(BlockNumber), compare_block_numbers);
+
+ /* Dump block references. */
+ while (i < nblocks)
+ {
+ /*
+ * Find the next range of blocks to print, but if --individual was
+ * specified, then consider each block a separate range.
+ */
+ startblock = endblock = block_buffer[i++];
+ if (!opt->individual)
+ {
+ while (i < nblocks && block_buffer[i] == endblock + 1)
+ {
+ endblock++;
+ i++;
+ }
+ }
+
+ /* Format this range of block numbers as a string. */
+ if (startblock == endblock)
+ printf("TS %u, DB %u, REL %u, FORK %s: block %u\n",
+ rlocator->spcOid, rlocator->dbOid, rlocator->relNumber,
+ forkNames[forknum], startblock);
+ else
+ printf("TS %u, DB %u, REL %u, FORK %s: blocks %u..%u\n",
+ rlocator->spcOid, rlocator->dbOid, rlocator->relNumber,
+ forkNames[forknum], startblock, endblock);
+ }
+}
+
+/*
+ * Quicksort comparator for block numbers.
+ */
+static int
+compare_block_numbers(const void *a, const void *b)
+{
+ BlockNumber aa = *(BlockNumber *) a;
+ BlockNumber bb = *(BlockNumber *) b;
+
+ if (aa > bb)
+ return 1;
+ else if (aa == bb)
+ return 0;
+ else
+ return -1;
+}
+
+/*
+ * Error callback.
+ */
+void
+walsummary_error_callback(void *callback_arg, char *fmt,...)
+{
+ va_list ap;
+
+ va_start(ap, fmt);
+ pg_log_generic_v(PG_LOG_ERROR, PG_LOG_PRIMARY, fmt, ap);
+ va_end(ap);
+
+ exit(1);
+}
+
+/*
+ * Read callback.
+ */
+int
+walsummary_read_callback(void *callback_arg, void *data, int length)
+{
+ ws_file_info *ws = callback_arg;
+ int rc;
+
+ if ((rc = read(ws->fd, data, length)) < 0)
+ pg_fatal("could not read file \"%s\": %m", ws->filename);
+
+ return rc;
+}
+
+/*
+ * help
+ *
+ * Prints help page for the program
+ *
+ * progname: the name of the executed program, such as "pg_walsummary"
+ */
+static void
+help(const char *progname)
+{
+ printf(_("%s prints the contents of a WAL summary file.\n\n"), progname);
+ printf(_("Usage:\n"));
+ printf(_(" %s [OPTION]... FILE...\n"), progname);
+ printf(_("\nOptions:\n"));
+ printf(_(" -i, --individual list block numbers individually, not as ranges\n"));
+ printf(_(" -q, --quiet don't print anything, just parse the files\n"));
+ printf(_(" -?, --help show this help, then exit\n"));
+
+ printf(_("\nReport bugs to <%s>.\n"), PACKAGE_BUGREPORT);
+ printf(_("%s home page: <%s>\n"), PACKAGE_NAME, PACKAGE_URL);
+}
--
2.37.1 (Apple Git-137.1)
v5-0003-Change-how-a-base-backup-decides-which-files-have.patchapplication/octet-stream; name=v5-0003-Change-how-a-base-backup-decides-which-files-have.patchDownload
From ea3ed36f4767cc9d5bb3edf992f6cf291de59bbc Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Tue, 3 Oct 2023 13:32:28 -0400
Subject: [PATCH v5 3/6] Change how a base backup decides which files have
checksums.
Previously, it thought that any plain file located under global, base,
or a tablespace directory had checksums unless it was in a short list
of excluded files. Now, it thinks that files in those directories have
checksums if parse_filename_for_nontemp_relation says that they are
relation files. (Temporary relation files don't matter because they're
excluded from the backup anyway.)
This changes the behavior if you have stray files not managed by
PostgreSQL in the relevant directories. Previously, you'd get some
kind of checksum-related complaint if such files existed, assuming
that the cluster had checksums enabled and that the base backup
wasn't run with NOVERIFY_CHECKSUMS. Now, you won't get those
complaints any more. That seems like an improvement to me, because
those files were presumably not created by PostgreSQL and so there
is no reason to think that they would be checksummed like a
PostgreSQL relation file. (If we want to complain about such files,
we should complain about them existing at all, not just about their
checksums.)
The point of this change is to make the code more consistent.
sendDir() was already calling parse_filename_for_nontemp_relation()
as part of an effort to determine which files to include in the
backup. So, it already had the information about whether a certain
file was a relation file. sendFile() then used a separate method,
embodied in is_checksummed_file(), to make what is essentially
the same determination. It's better not to make the same decision
using two different methods, especially in closely-related code.
---
src/backend/backup/basebackup.c | 172 ++++++++++----------------------
1 file changed, 55 insertions(+), 117 deletions(-)
diff --git a/src/backend/backup/basebackup.c b/src/backend/backup/basebackup.c
index b537f46219..4ba63ad8a6 100644
--- a/src/backend/backup/basebackup.c
+++ b/src/backend/backup/basebackup.c
@@ -82,7 +82,8 @@ static int64 sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeo
backup_manifest_info *manifest, Oid spcoid);
static bool sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
struct stat *statbuf, bool missing_ok,
- Oid dboid, Oid spcoid,
+ Oid dboid, Oid spcoid, RelFileNumber relfilenumber,
+ unsigned segno,
backup_manifest_info *manifest);
static off_t read_file_data_into_buffer(bbsink *sink,
const char *readfilename, int fd,
@@ -104,7 +105,6 @@ static void convert_link_to_directory(const char *pathbuf, struct stat *statbuf)
static void perform_base_backup(basebackup_options *opt, bbsink *sink);
static void parse_basebackup_options(List *options, basebackup_options *opt);
static int compareWalFileNames(const ListCell *a, const ListCell *b);
-static bool is_checksummed_file(const char *fullpath, const char *filename);
static int basebackup_read_file(int fd, char *buf, size_t nbytes, off_t offset,
const char *filename, bool partial_read_ok);
@@ -213,23 +213,6 @@ static const struct exclude_list_item excludeFiles[] =
{NULL, false}
};
-/*
- * List of files excluded from checksum validation.
- *
- * Note: this list should be kept in sync with what pg_checksums.c
- * includes.
- */
-static const struct exclude_list_item noChecksumFiles[] = {
- {"pg_control", false},
- {"pg_filenode.map", false},
- {"pg_internal.init", true},
- {"PG_VERSION", false},
-#ifdef EXEC_BACKEND
- {"config_exec_params", true},
-#endif
- {NULL, false}
-};
-
/*
* Actually do a base backup for the specified tablespaces.
*
@@ -356,7 +339,8 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
errmsg("could not stat file \"%s\": %m",
XLOG_CONTROL_FILE)));
sendFile(sink, XLOG_CONTROL_FILE, XLOG_CONTROL_FILE, &statbuf,
- false, InvalidOid, InvalidOid, &manifest);
+ false, InvalidOid, InvalidOid,
+ InvalidRelFileNumber, 0, &manifest);
}
else
{
@@ -625,7 +609,8 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
errmsg("could not stat file \"%s\": %m", pathbuf)));
sendFile(sink, pathbuf, pathbuf, &statbuf, false,
- InvalidOid, InvalidOid, &manifest);
+ InvalidOid, InvalidOid, InvalidRelFileNumber, 0,
+ &manifest);
/* unconditionally mark file as archived */
StatusFilePath(pathbuf, fname, ".done");
@@ -1163,7 +1148,8 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
struct stat statbuf;
int64 size = 0;
const char *lastDir; /* Split last dir from parent path. */
- bool isDbDir = false; /* Does this directory contain relations? */
+ bool isRelationDir = false; /* Does directory contain relations? */
+ Oid dboid = InvalidOid;
/*
* Determine if the current path is a database directory that can contain
@@ -1190,17 +1176,23 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
strncmp(lastDir - (sizeof(TABLESPACE_VERSION_DIRECTORY) - 1),
TABLESPACE_VERSION_DIRECTORY,
sizeof(TABLESPACE_VERSION_DIRECTORY) - 1) == 0))
- isDbDir = true;
+ {
+ isRelationDir = true;
+ dboid = atooid(lastDir + 1);
+ }
}
+ else if (strcmp(path, "./global") == 0)
+ isRelationDir = true;
dir = AllocateDir(path);
while ((de = ReadDir(dir, path)) != NULL)
{
int excludeIdx;
bool excludeFound;
- RelFileNumber relNumber;
- ForkNumber relForkNum;
- unsigned segno;
+ RelFileNumber relfilenumber = InvalidRelFileNumber;
+ ForkNumber relForkNum = InvalidForkNumber;
+ unsigned segno = 0;
+ bool isRelationFile = false;
/* Skip special stuff */
if (strcmp(de->d_name, ".") == 0 || strcmp(de->d_name, "..") == 0)
@@ -1248,37 +1240,40 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
if (excludeFound)
continue;
+ /*
+ * If there could be non-temporary relation files in this directory,
+ * try to parse the filename.
+ */
+ if (isRelationDir)
+ isRelationFile =
+ parse_filename_for_nontemp_relation(de->d_name,
+ &relfilenumber,
+ &relForkNum, &segno);
+
/* Exclude all forks for unlogged tables except the init fork */
- if (isDbDir &&
- parse_filename_for_nontemp_relation(de->d_name, &relNumber,
- &relForkNum, &segno))
+ if (isRelationFile && relForkNum != INIT_FORKNUM)
{
- /* Never exclude init forks */
- if (relForkNum != INIT_FORKNUM)
- {
- char initForkFile[MAXPGPATH];
+ char initForkFile[MAXPGPATH];
- /*
- * If any other type of fork, check if there is an init fork
- * with the same RelFileNumber. If so, the file can be
- * excluded.
- */
- snprintf(initForkFile, sizeof(initForkFile), "%s/%u_init",
- path, relNumber);
+ /*
+ * If any other type of fork, check if there is an init fork with
+ * the same RelFileNumber. If so, the file can be excluded.
+ */
+ snprintf(initForkFile, sizeof(initForkFile), "%s/%u_init",
+ path, relfilenumber);
- if (lstat(initForkFile, &statbuf) == 0)
- {
- elog(DEBUG2,
- "unlogged relation file \"%s\" excluded from backup",
- de->d_name);
+ if (lstat(initForkFile, &statbuf) == 0)
+ {
+ elog(DEBUG2,
+ "unlogged relation file \"%s\" excluded from backup",
+ de->d_name);
- continue;
- }
+ continue;
}
}
/* Exclude temporary relations */
- if (isDbDir && looks_like_temp_rel_name(de->d_name))
+ if (OidIsValid(dboid) && looks_like_temp_rel_name(de->d_name))
{
elog(DEBUG2,
"temporary relation file \"%s\" excluded from backup",
@@ -1417,8 +1412,8 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
if (!sizeonly)
sent = sendFile(sink, pathbuf, pathbuf + basepathlen + 1, &statbuf,
- true, isDbDir ? atooid(lastDir + 1) : InvalidOid, spcoid,
- manifest);
+ true, dboid, spcoid,
+ relfilenumber, segno, manifest);
if (sent || sizeonly)
{
@@ -1440,40 +1435,6 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
return size;
}
-/*
- * Check if a file should have its checksum validated.
- * We validate checksums on files in regular tablespaces
- * (including global and default) only, and in those there
- * are some files that are explicitly excluded.
- */
-static bool
-is_checksummed_file(const char *fullpath, const char *filename)
-{
- /* Check that the file is in a tablespace */
- if (strncmp(fullpath, "./global/", 9) == 0 ||
- strncmp(fullpath, "./base/", 7) == 0 ||
- strncmp(fullpath, "/", 1) == 0)
- {
- int excludeIdx;
-
- /* Compare file against noChecksumFiles skip list */
- for (excludeIdx = 0; noChecksumFiles[excludeIdx].name != NULL; excludeIdx++)
- {
- int cmplen = strlen(noChecksumFiles[excludeIdx].name);
-
- if (!noChecksumFiles[excludeIdx].match_prefix)
- cmplen++;
- if (strncmp(filename, noChecksumFiles[excludeIdx].name,
- cmplen) == 0)
- return false;
- }
-
- return true;
- }
- else
- return false;
-}
-
/*
* Given the member, write the TAR header & send the file.
*
@@ -1488,6 +1449,7 @@ is_checksummed_file(const char *fullpath, const char *filename)
static bool
sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
struct stat *statbuf, bool missing_ok, Oid dboid, Oid spcoid,
+ RelFileNumber relfilenumber, unsigned segno,
backup_manifest_info *manifest)
{
int fd;
@@ -1495,8 +1457,6 @@ sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
int checksum_failures = 0;
off_t cnt;
pgoff_t bytes_done = 0;
- int segmentno = 0;
- char *segmentpath;
bool verify_checksum = false;
pg_checksum_context checksum_ctx;
@@ -1522,36 +1482,14 @@ sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
*/
Assert((sink->bbs_buffer_length % BLCKSZ) == 0);
- if (!noverify_checksums && DataChecksumsEnabled())
- {
- char *filename;
-
- /*
- * Get the filename (excluding path). As last_dir_separator()
- * includes the last directory separator, we chop that off by
- * incrementing the pointer.
- */
- filename = last_dir_separator(readfilename) + 1;
-
- if (is_checksummed_file(readfilename, filename))
- {
- verify_checksum = true;
-
- /*
- * Cut off at the segment boundary (".") to get the segment number
- * in order to mix it into the checksum.
- */
- segmentpath = strstr(filename, ".");
- if (segmentpath != NULL)
- {
- segmentno = atoi(segmentpath + 1);
- if (segmentno == 0)
- ereport(ERROR,
- (errmsg("invalid segment number %d in file \"%s\"",
- segmentno, filename)));
- }
- }
- }
+ /*
+ * If we weren't told not to verify checksums, and if checksums are
+ * enabled for this cluster, and if this is a relation file, then verify
+ * the checksum.
+ */
+ if (!noverify_checksums && DataChecksumsEnabled() &&
+ RelFileNumberIsValid(relfilenumber))
+ verify_checksum = true;
/*
* Loop until we read the amount of data the caller told us to expect. The
@@ -1566,7 +1504,7 @@ sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
/* Try to read some more data. */
cnt = read_file_data_into_buffer(sink, readfilename, fd, bytes_done,
remaining,
- blkno + segmentno * RELSEG_SIZE,
+ blkno + segno * RELSEG_SIZE,
verify_checksum,
&checksum_failures);
--
2.37.1 (Apple Git-137.1)
v5-0002-Change-struct-tablespaceinfo-s-oid-member-from-ch.patchapplication/octet-stream; name=v5-0002-Change-struct-tablespaceinfo-s-oid-member-from-ch.patchDownload
From 8ecb8c9b5fd628b1e1df876518cc35973ce53518 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Tue, 3 Oct 2023 13:30:16 -0400
Subject: [PATCH v5 2/6] Change struct tablespaceinfo's oid member from 'char
*' to 'Oid'
This shouldn't change behavior except in the unusual case where
there are file in the tablespace directory that have entirely
numeric names but are nevertheless not possible names for a
tablespace directory, either because their names have leading zeroes
that shouldn't be there, or the value is actually zero, or because
the value is too large to represent as an OID.
In those cases, the directory would previously have made it into
the list of tablespaceinfo objects and no longer will. Thus, base
backups will now ignore such directories, instead of treating them
as legitimate tablespace directories. Similarly, if entries for
such tablespaces occur in a tablespace_map file, they will now
be rejected as erroneous, instead of being honored.
This is infrastructure for future work that wants to be able to
know the tablespace of each relation that is part of a backup
*as an OID*. By strengthening the up-front validation, we don't
have to worry about weird cases later, and can more easily avoid
repeated string->integer conversions.
---
src/backend/access/transam/xlog.c | 19 ++++++++++--
src/backend/access/transam/xlogrecovery.c | 12 ++++++--
src/backend/backup/backup_manifest.c | 6 ++--
src/backend/backup/basebackup.c | 35 ++++++++++++-----------
src/backend/backup/basebackup_copy.c | 2 +-
src/include/backup/backup_manifest.h | 2 +-
src/include/backup/basebackup.h | 2 +-
7 files changed, 49 insertions(+), 29 deletions(-)
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index c0e4ca5089..6c724745b5 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -8502,9 +8502,22 @@ do_pg_backup_start(const char *backupidstr, bool fast, List **tablespaces,
char *relpath = NULL;
char *s;
PGFileType de_type;
+ char *badp;
+ Oid tsoid;
- /* Skip anything that doesn't look like a tablespace */
- if (strspn(de->d_name, "0123456789") != strlen(de->d_name))
+ /*
+ * Try to parse the directory name as an unsigned integer.
+ *
+ * Tablespace directories should be positive integers that can be
+ * represented in 32 bits, with no leading zeroes or trailing
+ * garbage. If we come across a name that doesn't meet those
+ * criteria, skip it.
+ */
+ if (de->d_name[0] < '1' || de->d_name[1] > '9')
+ continue;
+ errno = 0;
+ tsoid = strtoul(de->d_name, &badp, 10);
+ if (*badp != '\0' || errno == EINVAL || errno == ERANGE)
continue;
snprintf(fullpath, sizeof(fullpath), "pg_tblspc/%s", de->d_name);
@@ -8579,7 +8592,7 @@ do_pg_backup_start(const char *backupidstr, bool fast, List **tablespaces,
}
ti = palloc(sizeof(tablespaceinfo));
- ti->oid = pstrdup(de->d_name);
+ ti->oid = tsoid;
ti->path = pstrdup(linkpath);
ti->rpath = relpath;
ti->size = -1;
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index becc2bda62..5549e1afc5 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -678,7 +678,7 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
tablespaceinfo *ti = lfirst(lc);
char *linkloc;
- linkloc = psprintf("pg_tblspc/%s", ti->oid);
+ linkloc = psprintf("pg_tblspc/%u", ti->oid);
/*
* Remove the existing symlink if any and Create the symlink
@@ -692,7 +692,6 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
errmsg("could not create symbolic link \"%s\": %m",
linkloc)));
- pfree(ti->oid);
pfree(ti->path);
pfree(ti);
}
@@ -1341,6 +1340,8 @@ read_tablespace_map(List **tablespaces)
{
if (!was_backslash && (ch == '\n' || ch == '\r'))
{
+ char *endp;
+
if (i == 0)
continue; /* \r immediately followed by \n */
@@ -1360,7 +1361,12 @@ read_tablespace_map(List **tablespaces)
str[n++] = '\0';
ti = palloc0(sizeof(tablespaceinfo));
- ti->oid = pstrdup(str);
+ errno = 0;
+ ti->oid = strtoul(str, &endp, 10);
+ if (*endp != '\0' || errno == EINVAL || errno == ERANGE)
+ ereport(FATAL,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("invalid data in file \"%s\"", TABLESPACE_MAP)));
ti->path = pstrdup(str + n);
*tablespaces = lappend(*tablespaces, ti);
diff --git a/src/backend/backup/backup_manifest.c b/src/backend/backup/backup_manifest.c
index cee6216524..aeed362a9a 100644
--- a/src/backend/backup/backup_manifest.c
+++ b/src/backend/backup/backup_manifest.c
@@ -97,7 +97,7 @@ FreeBackupManifest(backup_manifest_info *manifest)
* Add an entry to the backup manifest for a file.
*/
void
-AddFileToBackupManifest(backup_manifest_info *manifest, const char *spcoid,
+AddFileToBackupManifest(backup_manifest_info *manifest, Oid spcoid,
const char *pathname, size_t size, pg_time_t mtime,
pg_checksum_context *checksum_ctx)
{
@@ -114,9 +114,9 @@ AddFileToBackupManifest(backup_manifest_info *manifest, const char *spcoid,
* pathname relative to the data directory (ignoring the intermediate
* symlink traversal).
*/
- if (spcoid != NULL)
+ if (OidIsValid(spcoid))
{
- snprintf(pathbuf, sizeof(pathbuf), "pg_tblspc/%s/%s", spcoid,
+ snprintf(pathbuf, sizeof(pathbuf), "pg_tblspc/%u/%s", spcoid,
pathname);
pathname = pathbuf;
}
diff --git a/src/backend/backup/basebackup.c b/src/backend/backup/basebackup.c
index b126d9c890..b537f46219 100644
--- a/src/backend/backup/basebackup.c
+++ b/src/backend/backup/basebackup.c
@@ -75,14 +75,15 @@ typedef struct
pg_checksum_type manifest_checksum_type;
} basebackup_options;
-static int64 sendTablespace(bbsink *sink, char *path, char *spcoid, bool sizeonly,
+static int64 sendTablespace(bbsink *sink, char *path, Oid spcoid, bool sizeonly,
struct backup_manifest_info *manifest);
static int64 sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
List *tablespaces, bool sendtblspclinks,
- backup_manifest_info *manifest, const char *spcoid);
+ backup_manifest_info *manifest, Oid spcoid);
static bool sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
- struct stat *statbuf, bool missing_ok, Oid dboid,
- backup_manifest_info *manifest, const char *spcoid);
+ struct stat *statbuf, bool missing_ok,
+ Oid dboid, Oid spcoid,
+ backup_manifest_info *manifest);
static off_t read_file_data_into_buffer(bbsink *sink,
const char *readfilename, int fd,
off_t offset, size_t length,
@@ -305,7 +306,7 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
if (tmp->path == NULL)
tmp->size = sendDir(sink, ".", 1, true, state.tablespaces,
- true, NULL, NULL);
+ true, NULL, InvalidOid);
else
tmp->size = sendTablespace(sink, tmp->path, tmp->oid, true,
NULL);
@@ -346,7 +347,7 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
/* Then the bulk of the files... */
sendDir(sink, ".", 1, false, state.tablespaces,
- sendtblspclinks, &manifest, NULL);
+ sendtblspclinks, &manifest, InvalidOid);
/* ... and pg_control after everything else. */
if (lstat(XLOG_CONTROL_FILE, &statbuf) != 0)
@@ -355,11 +356,11 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
errmsg("could not stat file \"%s\": %m",
XLOG_CONTROL_FILE)));
sendFile(sink, XLOG_CONTROL_FILE, XLOG_CONTROL_FILE, &statbuf,
- false, InvalidOid, &manifest, NULL);
+ false, InvalidOid, InvalidOid, &manifest);
}
else
{
- char *archive_name = psprintf("%s.tar", ti->oid);
+ char *archive_name = psprintf("%u.tar", ti->oid);
bbsink_begin_archive(sink, archive_name);
@@ -623,8 +624,8 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
(errcode_for_file_access(),
errmsg("could not stat file \"%s\": %m", pathbuf)));
- sendFile(sink, pathbuf, pathbuf, &statbuf, false, InvalidOid,
- &manifest, NULL);
+ sendFile(sink, pathbuf, pathbuf, &statbuf, false,
+ InvalidOid, InvalidOid, &manifest);
/* unconditionally mark file as archived */
StatusFilePath(pathbuf, fname, ".done");
@@ -1087,7 +1088,7 @@ sendFileWithContent(bbsink *sink, const char *filename, const char *content,
_tarWritePadding(sink, len);
- AddFileToBackupManifest(manifest, NULL, filename, len,
+ AddFileToBackupManifest(manifest, InvalidOid, filename, len,
(pg_time_t) statbuf.st_mtime, &checksum_ctx);
}
@@ -1099,7 +1100,7 @@ sendFileWithContent(bbsink *sink, const char *filename, const char *content,
* Only used to send auxiliary tablespaces, not PGDATA.
*/
static int64
-sendTablespace(bbsink *sink, char *path, char *spcoid, bool sizeonly,
+sendTablespace(bbsink *sink, char *path, Oid spcoid, bool sizeonly,
backup_manifest_info *manifest)
{
int64 size;
@@ -1154,7 +1155,7 @@ sendTablespace(bbsink *sink, char *path, char *spcoid, bool sizeonly,
static int64
sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
List *tablespaces, bool sendtblspclinks, backup_manifest_info *manifest,
- const char *spcoid)
+ Oid spcoid)
{
DIR *dir;
struct dirent *de;
@@ -1416,8 +1417,8 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
if (!sizeonly)
sent = sendFile(sink, pathbuf, pathbuf + basepathlen + 1, &statbuf,
- true, isDbDir ? atooid(lastDir + 1) : InvalidOid,
- manifest, spcoid);
+ true, isDbDir ? atooid(lastDir + 1) : InvalidOid, spcoid,
+ manifest);
if (sent || sizeonly)
{
@@ -1486,8 +1487,8 @@ is_checksummed_file(const char *fullpath, const char *filename)
*/
static bool
sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
- struct stat *statbuf, bool missing_ok, Oid dboid,
- backup_manifest_info *manifest, const char *spcoid)
+ struct stat *statbuf, bool missing_ok, Oid dboid, Oid spcoid,
+ backup_manifest_info *manifest)
{
int fd;
BlockNumber blkno = 0;
diff --git a/src/backend/backup/basebackup_copy.c b/src/backend/backup/basebackup_copy.c
index fee30c21e1..3bdbe1f989 100644
--- a/src/backend/backup/basebackup_copy.c
+++ b/src/backend/backup/basebackup_copy.c
@@ -407,7 +407,7 @@ SendTablespaceList(List *tablespaces)
}
else
{
- values[0] = ObjectIdGetDatum(strtoul(ti->oid, NULL, 10));
+ values[0] = ObjectIdGetDatum(ti->oid);
values[1] = CStringGetTextDatum(ti->path);
}
if (ti->size >= 0)
diff --git a/src/include/backup/backup_manifest.h b/src/include/backup/backup_manifest.h
index d41b439980..5a481dbcf5 100644
--- a/src/include/backup/backup_manifest.h
+++ b/src/include/backup/backup_manifest.h
@@ -39,7 +39,7 @@ extern void InitializeBackupManifest(backup_manifest_info *manifest,
backup_manifest_option want_manifest,
pg_checksum_type manifest_checksum_type);
extern void AddFileToBackupManifest(backup_manifest_info *manifest,
- const char *spcoid,
+ Oid spcoid,
const char *pathname, size_t size,
pg_time_t mtime,
pg_checksum_context *checksum_ctx);
diff --git a/src/include/backup/basebackup.h b/src/include/backup/basebackup.h
index 3e68abc2bb..1432d9c206 100644
--- a/src/include/backup/basebackup.h
+++ b/src/include/backup/basebackup.h
@@ -27,7 +27,7 @@
*/
typedef struct
{
- char *oid; /* tablespace's OID, as a decimal string */
+ Oid oid; /* tablespace's OID */
char *path; /* full path to tablespace's directory */
char *rpath; /* relative path if it's within PGDATA, else
* NULL */
--
2.37.1 (Apple Git-137.1)
v5-0005-Prototype-patch-for-incremental-backup.patchapplication/octet-stream; name=v5-0005-Prototype-patch-for-incremental-backup.patchDownload
From 7b150a8eb4ec49792535768ac1b5bafe42840c35 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Wed, 14 Jun 2023 11:31:29 -0400
Subject: [PATCH v5 5/6] Prototype patch for incremental backup.
We don't differentiate between incremental and differential backups;
an incremental backup can be based either on a full backup or on a
previous incremental backup.
This adds a new background process, the WAL summarizer, whose behavor
is governed by new GUCs wal_summarize_mb and wal_summarize_keep_time.
This writes out WAL summary files to $PGDATA/pg_wal/summaries. Each
summary file contains information for a certain range of LSNs on a
certain TLI. For each relation, it stores a "limit block" which is
0 if a relation is created or destroyed within a certain range of WAL
records, or otherwise the shortest length to which the relation was
truncated during that range of WAL records, or otherwise
InvalidBlockNumber. In addition, it stores any blocks which have
been modified during that range of WAL records, but excluding blocks
which were removed by truncation after they were modified and which
were never modified thereafter. In other words, it tells us which
blocks need to copied in case of an incremental backup covering that
range of WAL records.
To take an incremental backup, you use the new replication command
UPLOAD_MANIFEST to upload the manifest for the prior backup. This
prior backup could either be a full backup or another incremental
backup. You then use BASE_BACKUP with the INCREMENTAL option to take
the backup. pg_basebackup now has an --incremental=PATH_TO_MANIFEST
option to trigger this behavior.
An incremental backup is like a regular full backup except that
some relation files are replaced with files with names like
INCREMENTAL.${ORIGINAL_NAME}, and the backup_label file contains
additional lines identifying it as an incremental backup. The new
pg_combinebackup tool can be used to reconstruct a data directory
from a full backup and a series of incremental backups.
Open issues:
- Needs some rework once XLOG_CHECKPOINT_REDO patch is committed.
- How should we control generation and retention of summary files?
What should be the defaults?
- Needs to be tested on a standby.
- Should we send the whole backup manifest to the server or, say,
just an LSN?
- Should the timeout when waiting for WAL summaries be configurable?
- It would be nice (but not essential) to do something about incremental
JSON parsing.
- Might need more tests.
Patch by me. Thanks to Dilip Kumar and Andres Freund for some helpful
design discussions. Reviewed by Dilip Kumar and Jakub Wartak.
---
doc/src/sgml/backup.sgml | 89 +-
doc/src/sgml/ref/allfiles.sgml | 1 +
doc/src/sgml/ref/pg_basebackup.sgml | 37 +-
doc/src/sgml/ref/pg_combinebackup.sgml | 228 +++
doc/src/sgml/reference.sgml | 1 +
src/backend/access/transam/xlog.c | 93 +-
src/backend/access/transam/xlogbackup.c | 10 +
src/backend/access/transam/xlogrecovery.c | 6 +
src/backend/backup/Makefile | 5 +-
src/backend/backup/basebackup.c | 334 +++-
src/backend/backup/basebackup_incremental.c | 873 ++++++++++
src/backend/backup/meson.build | 3 +
src/backend/backup/walsummary.c | 356 +++++
src/backend/backup/walsummaryfuncs.c | 169 ++
src/backend/postmaster/Makefile | 1 +
src/backend/postmaster/auxprocess.c | 8 +
src/backend/postmaster/meson.build | 1 +
src/backend/postmaster/postmaster.c | 53 +
src/backend/postmaster/walsummarizer.c | 1414 +++++++++++++++++
src/backend/replication/repl_gram.y | 14 +-
src/backend/replication/repl_scanner.l | 2 +
src/backend/replication/walsender.c | 162 +-
src/backend/storage/ipc/ipci.c | 3 +
src/backend/storage/lmgr/lwlocknames.txt | 1 +
src/backend/utils/activity/pgstat_io.c | 4 +-
.../utils/activity/wait_event_names.txt | 5 +
src/backend/utils/init/miscinit.c | 3 +
src/backend/utils/misc/guc_tables.c | 29 +
src/backend/utils/misc/postgresql.conf.sample | 5 +
src/bin/Makefile | 1 +
src/bin/initdb/initdb.c | 1 +
src/bin/meson.build | 1 +
src/bin/pg_basebackup/bbstreamer_file.c | 1 +
src/bin/pg_basebackup/pg_basebackup.c | 110 +-
src/bin/pg_basebackup/t/010_pg_basebackup.pl | 4 +-
src/bin/pg_combinebackup/.gitignore | 1 +
src/bin/pg_combinebackup/Makefile | 52 +
src/bin/pg_combinebackup/backup_label.c | 281 ++++
src/bin/pg_combinebackup/backup_label.h | 29 +
src/bin/pg_combinebackup/copy_file.c | 169 ++
src/bin/pg_combinebackup/copy_file.h | 19 +
src/bin/pg_combinebackup/load_manifest.c | 245 +++
src/bin/pg_combinebackup/load_manifest.h | 67 +
src/bin/pg_combinebackup/meson.build | 35 +
src/bin/pg_combinebackup/pg_combinebackup.c | 1270 +++++++++++++++
src/bin/pg_combinebackup/reconstruct.c | 618 +++++++
src/bin/pg_combinebackup/reconstruct.h | 32 +
src/bin/pg_combinebackup/t/001_basic.pl | 23 +
.../pg_combinebackup/t/002_compare_backups.pl | 154 ++
src/bin/pg_combinebackup/write_manifest.c | 293 ++++
src/bin/pg_combinebackup/write_manifest.h | 33 +
src/bin/pg_resetwal/pg_resetwal.c | 36 +
src/common/Makefile | 1 +
src/common/blkreftable.c | 1309 +++++++++++++++
src/common/meson.build | 1 +
src/include/access/xlog.h | 1 +
src/include/access/xlogbackup.h | 2 +
src/include/backup/basebackup.h | 5 +-
src/include/backup/basebackup_incremental.h | 56 +
src/include/backup/walsummary.h | 49 +
src/include/catalog/pg_proc.dat | 19 +
src/include/common/blkreftable.h | 120 ++
src/include/miscadmin.h | 3 +
src/include/nodes/replnodes.h | 9 +
src/include/postmaster/walsummarizer.h | 31 +
src/include/storage/proc.h | 9 +-
src/include/utils/guc_tables.h | 1 +
src/test/perl/PostgreSQL/Test/Cluster.pm | 21 +-
src/test/recovery/t/001_stream_rep.pl | 2 +
src/test/recovery/t/019_replslot_limit.pl | 3 +
.../t/035_standby_logical_decoding.pl | 1 +
src/tools/pgindent/typedefs.list | 23 +
72 files changed, 8981 insertions(+), 70 deletions(-)
create mode 100644 doc/src/sgml/ref/pg_combinebackup.sgml
create mode 100644 src/backend/backup/basebackup_incremental.c
create mode 100644 src/backend/backup/walsummary.c
create mode 100644 src/backend/backup/walsummaryfuncs.c
create mode 100644 src/backend/postmaster/walsummarizer.c
create mode 100644 src/bin/pg_combinebackup/.gitignore
create mode 100644 src/bin/pg_combinebackup/Makefile
create mode 100644 src/bin/pg_combinebackup/backup_label.c
create mode 100644 src/bin/pg_combinebackup/backup_label.h
create mode 100644 src/bin/pg_combinebackup/copy_file.c
create mode 100644 src/bin/pg_combinebackup/copy_file.h
create mode 100644 src/bin/pg_combinebackup/load_manifest.c
create mode 100644 src/bin/pg_combinebackup/load_manifest.h
create mode 100644 src/bin/pg_combinebackup/meson.build
create mode 100644 src/bin/pg_combinebackup/pg_combinebackup.c
create mode 100644 src/bin/pg_combinebackup/reconstruct.c
create mode 100644 src/bin/pg_combinebackup/reconstruct.h
create mode 100644 src/bin/pg_combinebackup/t/001_basic.pl
create mode 100644 src/bin/pg_combinebackup/t/002_compare_backups.pl
create mode 100644 src/bin/pg_combinebackup/write_manifest.c
create mode 100644 src/bin/pg_combinebackup/write_manifest.h
create mode 100644 src/common/blkreftable.c
create mode 100644 src/include/backup/basebackup_incremental.h
create mode 100644 src/include/backup/walsummary.h
create mode 100644 src/include/common/blkreftable.h
create mode 100644 src/include/postmaster/walsummarizer.h
diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml
index 8cb24d6ae5..b3468eea3c 100644
--- a/doc/src/sgml/backup.sgml
+++ b/doc/src/sgml/backup.sgml
@@ -857,12 +857,79 @@ test ! -f /mnt/server/archivedir/00000001000000A900000065 && cp pg_wal/0
</para>
</sect2>
+ <sect2 id="backup-incremental-backup">
+ <title>Making an Incremental Backup</title>
+
+ <para>
+ You can use <xref linkend="app-pgbasebackup"/> to take an incremental
+ backup by specifying the <literal>--incremental</literal> option. You must
+ supply, as an argument to <literal>--incremental</literal>, the backup
+ manifest to an earlier backup from the same server. In the resulting
+ backup, non-relation files will be included in their entirety, but some
+ relation files may be replaced by smaller incremental files which contain
+ only the blocks which have been changed since the earlier backup and enough
+ metadata to reconstruct the current version of the file.
+ </para>
+
+ <para>
+ To figure out which blocks need to be backed up, the server uses WAL
+ summaries, which are stored in the data directory, inside the directory
+ <literal>pg_wal/summaries</literal>. If the required summary files are not
+ present, an attempt to take an incremental backup will fail. The summaries
+ present in this directory must cover all LSNs from the start LSN of the
+ prior backup to the start LSN of the current backup. Since the server looks
+ for WAL summaries just after establishing the start LSN of the current
+ backup, the necessary summary files probably won't be instantly present
+ on disk, but the server will wait for any missing files to show up.
+ This also helps if the WAL summarization process has fallen behind.
+ However, if the necessary files have already been removed, or if the WAL
+ summarizer doesn't catch up quickly enough, the incremental backup will
+ fail.
+ </para>
+
+ <para>
+ When restoring an incremental backup, it will be necessary to have not
+ only the incremental backup itself but also all earlier backups that
+ are required to supply the blocks omitted from the incremental backup.
+ See <xref linkend="app-pgcombinebackup"/> for further information about
+ this requirement.
+ </para>
+
+ <para>
+ Note that all of the requirements for making use of a full backup also
+ apply to an incremental backup. For instance, you still need all of the
+ WAL segment files generated during and after the file system backup, and
+ any relevant WAL history files. And you still need to create a
+ <literal>recovery.signal</literal> (or <literal>standby.signal</literal>)
+ and perform recovery, as described in
+ <xref linkend="backup-pitr-recovery" />. The requirement to have earlier
+ backups available at restore time and to use
+ <literal>pg_combinebackup</literal> is an additional requirement on top of
+ everything else. Keep in mind that <application>PostgreSQL</application>
+ has no built-in mechanism to figure out which backups are still needed as
+ a basis for restoring later incremental backups. You must keep track of
+ the relationships between your full and incremental backups on your own,
+ and be certain not to remove earlier backups if they might be needed when
+ restoring later incremental backups.
+ </para>
+
+ <para>
+ Incremental backups typically only make sense for relatively large
+ databases where a significant portion of the data does not change, or only
+ changes slowly. For a small database, it's simpler to ignore the existence
+ of incremental backups and simply take full backups, which are simpler
+ to manage. For a large database all of which is heavily modified,
+ incremental backups won't be much smaller than full backups.
+ </para>
+ </sect2>
+
<sect2 id="backup-lowlevel-base-backup">
<title>Making a Base Backup Using the Low Level API</title>
<para>
- The procedure for making a base backup using the low level
- APIs contains a few more steps than
- the <xref linkend="app-pgbasebackup"/> method, but is relatively
+ Instead of taking a full or incremental base backup using
+ <xref linkend="app-pgbasebackup"/>, you can take a base backup using the
+ low-level API. This procedure contains a few more steps than
+ the <application>pg_basebackup</application> method, but is relatively
simple. It is very important that these steps are executed in
sequence, and that the success of a step is verified before
proceeding to the next step.
@@ -1118,7 +1185,8 @@ SELECT * FROM pg_backup_stop(wait_for_archive => true);
</listitem>
<listitem>
<para>
- Restore the database files from your file system backup. Be sure that they
+ If you're restoring a full backup, you can restore the database files
+ directly into the target directories. Be sure that they
are restored with the right ownership (the database system user, not
<literal>root</literal>!) and with the right permissions. If you are using
tablespaces,
@@ -1126,6 +1194,19 @@ SELECT * FROM pg_backup_stop(wait_for_archive => true);
were correctly restored.
</para>
</listitem>
+ <listitem>
+ <para>
+ If you're restoring an incremental backup, you'll need to restore the
+ incremental backup and all earlier backups upon which it directly or
+ indirectly depends to the machine where you are performing the restore.
+ These backups will need to be placed in separate directories, not the
+ target directories where you want the running server to end up.
+ Once this is done, use <xref linkend="app-pgcombinebackup"/> to pull
+ data from the full backup and all of the subsequent incremental backups
+ and write out a synthetic full backup to the target directories. As above,
+ verify that permissions and tablespace links are correct.
+ </para>
+ </listitem>
<listitem>
<para>
Remove any files present in <filename>pg_wal/</filename>; these came from the
diff --git a/doc/src/sgml/ref/allfiles.sgml b/doc/src/sgml/ref/allfiles.sgml
index 54b5f22d6e..fda4690eab 100644
--- a/doc/src/sgml/ref/allfiles.sgml
+++ b/doc/src/sgml/ref/allfiles.sgml
@@ -202,6 +202,7 @@ Complete list of usable sgml source files in this directory.
<!ENTITY pgBasebackup SYSTEM "pg_basebackup.sgml">
<!ENTITY pgbench SYSTEM "pgbench.sgml">
<!ENTITY pgChecksums SYSTEM "pg_checksums.sgml">
+<!ENTITY pgCombinebackup SYSTEM "pg_combinebackup.sgml">
<!ENTITY pgConfig SYSTEM "pg_config-ref.sgml">
<!ENTITY pgControldata SYSTEM "pg_controldata.sgml">
<!ENTITY pgCtl SYSTEM "pg_ctl-ref.sgml">
diff --git a/doc/src/sgml/ref/pg_basebackup.sgml b/doc/src/sgml/ref/pg_basebackup.sgml
index 712568a62d..50536d0521 100644
--- a/doc/src/sgml/ref/pg_basebackup.sgml
+++ b/doc/src/sgml/ref/pg_basebackup.sgml
@@ -38,11 +38,25 @@ PostgreSQL documentation
</para>
<para>
- <application>pg_basebackup</application> makes an exact copy of the database
- cluster's files, while making sure the server is put into and
- out of backup mode automatically. Backups are always taken of the entire
- database cluster; it is not possible to back up individual databases or
- database objects. For selective backups, another tool such as
+ <application>pg_basebackup</application> can take a full or incremental
+ base backup of the database. When used to take a full backup, it makes an
+ exact copy of the database cluster's files. When used to take an incremental
+ backup, some files that would have been part of a full backup may be
+ replaced with incremental versions of the same files, containing only those
+ blocks that have been modified since the reference backup. An incremental
+ backup cannot be used directly; instead,
+ <xref linkend="app-pgcombinebackup"/> must first
+ be used to combine it with the previous backups upon which it depends.
+ See <xref linkend="backup-incremental-backup" /> for more information
+ about incremental backups, and <xref linkend="backup-pitr-recovery" />
+ for steps to recover from a backup.
+ </para>
+
+ <para>
+ In any mode, <application>pg_basebackup</application> makes sure the server
+ is put into and out of backup mode automatically. Backups are always taken of
+ the entire database cluster; it is not possible to back up individual
+ databases or database objects. For selective backups, another tool such as
<xref linkend="app-pgdump"/> must be used.
</para>
@@ -197,6 +211,19 @@ PostgreSQL documentation
</listitem>
</varlistentry>
+ <varlistentry>
+ <term><option>-i <replaceable class="parameter">old_manifest_file</replaceable></option></term>
+ <term><option>--incremental=<replaceable class="parameter">old_meanifest_file</replaceable></option></term>
+ <listitem>
+ <para>
+ Performs an <link linkend="backup-incremental-backup">incremental
+ backup</link>. The backup manifest for the reference
+ backup must be provided, and will be uploaded to the server, which will
+ respond by sending the requested incremental backup.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry>
<term><option>-R</option></term>
<term><option>--write-recovery-conf</option></term>
diff --git a/doc/src/sgml/ref/pg_combinebackup.sgml b/doc/src/sgml/ref/pg_combinebackup.sgml
new file mode 100644
index 0000000000..6cac73573f
--- /dev/null
+++ b/doc/src/sgml/ref/pg_combinebackup.sgml
@@ -0,0 +1,228 @@
+<!--
+doc/src/sgml/ref/pg_combinebackup.sgml
+PostgreSQL documentation
+-->
+
+<refentry id="app-pgcombinebackup">
+ <indexterm zone="app-pgcombinebackup">
+ <primary>pg_combinebackup</primary>
+ </indexterm>
+
+ <refmeta>
+ <refentrytitle><application>pg_combinebackup</application></refentrytitle>
+ <manvolnum>1</manvolnum>
+ <refmiscinfo>Application</refmiscinfo>
+ </refmeta>
+
+ <refnamediv>
+ <refname>pg_combinebackup</refname>
+ <refpurpose>reconstruct a full backup from an incremental backup and dependent backups</refpurpose>
+ </refnamediv>
+
+ <refsynopsisdiv>
+ <cmdsynopsis>
+ <command>pg_combinebackup</command>
+ <arg rep="repeat"><replaceable>option</replaceable></arg>
+ <arg rep="repeat"><replaceable>backup_directory</replaceable></arg>
+ </cmdsynopsis>
+ </refsynopsisdiv>
+
+ <refsect1>
+ <title>Description</title>
+ <para>
+ <application>pg_combinebackup</application> is used to reconstruct a
+ synthetic full backup from an
+ <link linkend="backup-incremental-backup">incremental backup</link> and the
+ earlier backups upon which it depends.
+ </para>
+
+ <para>
+ Specify all of the required backups on the command line from oldest to newest.
+ That is, the first backup directory should be the path to the full backup, and
+ the last should be the path to the final incremental backup
+ that you wish to restore. The reconstructed backup will be written to the
+ output directory specified by the <option>-o</option> option.
+ </para>
+
+ <para>
+ Although <application>pg_combinebackup</application> will attempt to verify
+ that the backups you specify form a legal backup chain from which a correct
+ full backup can be reconstructed, it is not designed to help you keep track
+ of which backups depend on which other backups. If you remove the one or
+ more of the previous backups upon which your incremental
+ backup relies, you will not be able to restore it.
+ </para>
+
+ <para>
+ Since the output of <application>pg_combinebackup</application> is a
+ synthetic full backup, it can be used as an input to a future invocation of
+ <application>pg_combinebackup</application>. The synthetic full backup would
+ be specified on the command line in lieu of the chain of backups from which
+ it was reconstructed.
+ </para>
+ </refsect1>
+
+ <refsect1>
+ <title>Options</title>
+
+ <para>
+ <variablelist>
+ <varlistentry>
+ <term><option>-d</option></term>
+ <term><option>--debug</option></term>
+ <listitem>
+ <para>
+ Print lots of debug logging output on <filename>stderr</filename>.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><option>-T <replaceable class="parameter">olddir</replaceable>=<replaceable class="parameter">newdir</replaceable></option></term>
+ <term><option>--tablespace-mapping=<replaceable class="parameter">olddir</replaceable>=<replaceable class="parameter">newdir</replaceable></option></term>
+ <listitem>
+ <para>
+ Relocates the tablespace in directory <replaceable>olddir</replaceable>
+ to <replaceable>newdir</replaceable> during the backup.
+ <replaceable>olddir</replaceable> is the absolute path of the tablespace
+ as it exists in the first backup specified on the command line,
+ and <replaceable>newdir</replaceable> is the absolute path to use for the
+ tablespace in the reconstructed backup. If either path needs to contain
+ an equal sign (<literal>=</literal>), precede that with a backslash.
+ This option can be specified multiple times for multiple tablespaces.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><option>-N</option></term>
+ <term><option>--no-sync</option></term>
+ <listitem>
+ <para>
+ By default, <command>pg_combinebackup</command> will wait for all files
+ to be written safely to disk. This option causes
+ <command>pg_combinebackup</command> to return without waiting, which is
+ faster, but means that a subsequent operating system crash can leave
+ the output backup corrupt. Generally, this option is useful for testing
+ but should not be used when creating a production installation.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><option>-o <replaceable class="parameter">outputdir</replaceable></option></term>
+ <term><option>--output=<replaceable class="parameter">outputdir</replaceable></option></term>
+ <listitem>
+ <para>
+ Specifies the output directory to which the synthetic full backup
+ should be written. Currently, this argument is required.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><option>--sync-method</option></term>
+ <listitem>
+ <para>
+ When set to <literal>fsync</literal>, which is the default,
+ <command>pg_combinebackup</command> will recursively open and synchronize
+ all files in the backup directory. When the plain format is used, the
+ search for files will follow symbolic links for the WAL directory and
+ each configured tablespace.
+ </para>
+ <para>
+ On Linux, <literal>syncfs</literal> may be used instead to ask the
+ operating system to synchronize the whole file system that contains the
+ backup directory. When the plain format is used,
+ <command>pg_combinebackup</command> will also synchronize the file systems
+ that contain the WAL files and each tablespace. See
+ <xref linkend="syncfs"/> for more information about using
+ <function>syncfs()</function>.
+ </para>
+ <para>
+ This option has no effect when <option>--no-sync</option> is used.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><option>--manifest-checksums=<replaceable class="parameter">algorithm</replaceable></option></term>
+ <listitem>
+ <para>
+ Like <xref linkend="app-pgbasebackup"/>,
+ <application>pg_combinebackup</application> writes a backup manifest
+ in the output directory. This option specifies the checksum algorithm
+ that should be applied to each file included in the backup manifest.
+ Currently, the available algorithms are <literal>NONE</literal>,
+ <literal>CRC32C</literal>, <literal>SHA224</literal>,
+ <literal>SHA256</literal>, <literal>SHA384</literal>,
+ and <literal>SHA512</literal>. The default is <literal>CRC32C</literal>.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><option>--no-manifest</option></term>
+ <listitem>
+ <para>
+ Disables generation of a backup manifest. If this option is not
+ specified, a backup manifest for the reconstructed backup will be
+ written to the output directory.
+ </para>
+ </listitem>
+ </varlistentry>
+ </variablelist>
+
+ <variablelist>
+ <varlistentry>
+ <term><option>-V</option></term>
+ <term><option>--version</option></term>
+ <listitem>
+ <para>
+ Prints the <application>pg_combinebackup</application> version and exits.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><option>-?</option></term>
+ <term><option>--help</option></term>
+ <listitem>
+ <para>
+ Shows help about <application>pg_combinebackup</application> command line
+ arguments, and exits.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ </variablelist>
+ </para>
+
+ </refsect1>
+
+ <refsect1>
+ <title>Environment</title>
+
+ <para>
+ This utility, like most other <productname>PostgreSQL</productname> utilities,
+ uses the environment variables supported by <application>libpq</application>
+ (see <xref linkend="libpq-envars"/>).
+ </para>
+
+ <para>
+ The environment variable <envar>PG_COLOR</envar> specifies whether to use
+ color in diagnostic messages. Possible values are
+ <literal>always</literal>, <literal>auto</literal> and
+ <literal>never</literal>.
+ </para>
+ </refsect1>
+
+ <refsect1>
+ <title>See Also</title>
+
+ <simplelist type="inline">
+ <member><xref linkend="app-pgbasebackup"/></member>
+ </simplelist>
+ </refsect1>
+
+</refentry>
diff --git a/doc/src/sgml/reference.sgml b/doc/src/sgml/reference.sgml
index e11b4b6130..a07d2b5e01 100644
--- a/doc/src/sgml/reference.sgml
+++ b/doc/src/sgml/reference.sgml
@@ -250,6 +250,7 @@
&pgamcheck;
&pgBasebackup;
&pgbench;
+ &pgCombinebackup;
&pgConfig;
&pgDump;
&pgDumpall;
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 6c724745b5..d6f3cddfa0 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -77,6 +77,7 @@
#include "port/pg_iovec.h"
#include "postmaster/bgwriter.h"
#include "postmaster/startup.h"
+#include "postmaster/walsummarizer.h"
#include "postmaster/walwriter.h"
#include "replication/logical.h"
#include "replication/origin.h"
@@ -3514,6 +3515,43 @@ XLogGetLastRemovedSegno(void)
return lastRemovedSegNo;
}
+/*
+ * Return the oldest WAL segment on the given TLI that still exists in
+ * XLOGDIR, or 0 if none.
+ */
+XLogSegNo
+XLogGetOldestSegno(TimeLineID tli)
+{
+ DIR *xldir;
+ struct dirent *xlde;
+ XLogSegNo oldest_segno = 0;
+
+ xldir = AllocateDir(XLOGDIR);
+ while ((xlde = ReadDir(xldir, XLOGDIR)) != NULL)
+ {
+ TimeLineID file_tli;
+ XLogSegNo file_segno;
+
+ /* Ignore files that are not XLOG segments */
+ if (!IsXLogFileName(xlde->d_name))
+ continue;
+
+ /* Parse filename to get TLI and segno. */
+ XLogFromFileName(xlde->d_name, &file_tli, &file_segno,
+ wal_segment_size);
+
+ /* Ignore anything that's not from the TLI of interest. */
+ if (tli != file_tli)
+ continue;
+
+ /* If it's the oldest so far, update oldest_segno. */
+ if (oldest_segno == 0 || file_segno < oldest_segno)
+ oldest_segno = file_segno;
+ }
+
+ FreeDir(xldir);
+ return oldest_segno;
+}
/*
* Update the last removed segno pointer in shared memory, to reflect that the
@@ -3793,8 +3831,8 @@ RemoveXlogFile(const struct dirent *segment_de,
}
/*
- * Verify whether pg_wal and pg_wal/archive_status exist.
- * If the latter does not exist, recreate it.
+ * Verify whether pg_wal, pg_wal/archive_status, and pg_wal/summaries exist.
+ * If the latter do not exist, recreate them.
*
* It is not the goal of this function to verify the contents of these
* directories, but to help in cases where someone has performed a cluster
@@ -3837,6 +3875,26 @@ ValidateXLOGDirectoryStructure(void)
(errmsg("could not create missing directory \"%s\": %m",
path)));
}
+
+ /* Check for summaries */
+ snprintf(path, MAXPGPATH, XLOGDIR "/summaries");
+ if (stat(path, &stat_buf) == 0)
+ {
+ /* Check for weird cases where it exists but isn't a directory */
+ if (!S_ISDIR(stat_buf.st_mode))
+ ereport(FATAL,
+ (errmsg("required WAL directory \"%s\" does not exist",
+ path)));
+ }
+ else
+ {
+ ereport(LOG,
+ (errmsg("creating missing WAL directory \"%s\"", path)));
+ if (MakePGDirectory(path) < 0)
+ ereport(FATAL,
+ (errmsg("could not create missing directory \"%s\": %m",
+ path)));
+ }
}
/*
@@ -5161,9 +5219,9 @@ StartupXLOG(void)
#endif
/*
- * Verify that pg_wal and pg_wal/archive_status exist. In cases where
- * someone has performed a copy for PITR, these directories may have been
- * excluded and need to be re-created.
+ * Verify that pg_wal, pg_wal/archive_status, and pg_wal/summaries exist.
+ * In cases where someone has performed a copy for PITR, these directories
+ * may have been excluded and need to be re-created.
*/
ValidateXLOGDirectoryStructure();
@@ -6848,6 +6906,17 @@ CreateCheckPoint(int flags)
*/
END_CRIT_SECTION();
+ /*
+ * If there hasn't been much system activity in a while, the WAL
+ * summarizer may be sleeping for relatively long periods, which could
+ * delay an incremental backup that has started concurrently. In the hopes
+ * of avoiding that, poke the WAL summarizer here.
+ *
+ * Possibly this should instead be done at some earlier point in this
+ * function, but it's not clear that it matters much.
+ */
+ SetWalSummarizerLatch();
+
/*
* Let smgr do post-checkpoint cleanup (eg, deleting old files).
*/
@@ -7522,6 +7591,20 @@ KeepLogSeg(XLogRecPtr recptr, XLogSegNo *logSegNo)
}
}
+ /*
+ * If WAL summarization is in use, don't remove WAL that has yet to be
+ * summarized.
+ */
+ keep = GetOldestUnsummarizedLSN(NULL, NULL);
+ if (keep != InvalidXLogRecPtr)
+ {
+ XLogSegNo unsummarized_segno;
+
+ XLByteToSeg(keep, unsummarized_segno, wal_segment_size);
+ if (unsummarized_segno < segno)
+ segno = unsummarized_segno;
+ }
+
/* but, keep at least wal_keep_size if that's set */
if (wal_keep_size_mb > 0)
{
diff --git a/src/backend/access/transam/xlogbackup.c b/src/backend/access/transam/xlogbackup.c
index 21d68133ae..f51d4282bb 100644
--- a/src/backend/access/transam/xlogbackup.c
+++ b/src/backend/access/transam/xlogbackup.c
@@ -77,6 +77,16 @@ build_backup_content(BackupState *state, bool ishistoryfile)
appendStringInfo(result, "STOP TIMELINE: %u\n", state->stoptli);
}
+ /* either both istartpoint and istarttli should be set, or neither */
+ Assert(XLogRecPtrIsInvalid(state->istartpoint) == (state->istarttli == 0));
+ if (!XLogRecPtrIsInvalid(state->istartpoint))
+ {
+ appendStringInfo(result, "INCREMENTAL FROM LSN: %X/%X\n",
+ LSN_FORMAT_ARGS(state->istartpoint));
+ appendStringInfo(result, "INCREMENTAL FROM TLI: %u\n",
+ state->istarttli);
+ }
+
data = result->data;
pfree(result);
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 5549e1afc5..89ddec5bf9 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -1284,6 +1284,12 @@ read_backup_label(XLogRecPtr *checkPointLoc, TimeLineID *backupLabelTLI,
tli_from_file, BACKUP_LABEL_FILE)));
}
+ if (fscanf(lfp, "INCREMENTAL FROM LSN: %X/%X\n", &hi, &lo) > 0)
+ ereport(FATAL,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("this is an incremental backup, not a data directory"),
+ errhint("Use pg_combinebackup to reconstruct a valid data directory.")));
+
if (ferror(lfp) || FreeFile(lfp))
ereport(FATAL,
(errcode_for_file_access(),
diff --git a/src/backend/backup/Makefile b/src/backend/backup/Makefile
index b21bd8ff43..751e6d3d5e 100644
--- a/src/backend/backup/Makefile
+++ b/src/backend/backup/Makefile
@@ -19,12 +19,15 @@ OBJS = \
basebackup.o \
basebackup_copy.o \
basebackup_gzip.o \
+ basebackup_incremental.o \
basebackup_lz4.o \
basebackup_zstd.o \
basebackup_progress.o \
basebackup_server.o \
basebackup_sink.o \
basebackup_target.o \
- basebackup_throttle.o
+ basebackup_throttle.o \
+ walsummary.o \
+ walsummaryfuncs.o
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/backup/basebackup.c b/src/backend/backup/basebackup.c
index 4ba63ad8a6..8a70a9ae41 100644
--- a/src/backend/backup/basebackup.c
+++ b/src/backend/backup/basebackup.c
@@ -20,8 +20,10 @@
#include "access/xlogbackup.h"
#include "backup/backup_manifest.h"
#include "backup/basebackup.h"
+#include "backup/basebackup_incremental.h"
#include "backup/basebackup_sink.h"
#include "backup/basebackup_target.h"
+#include "catalog/pg_tablespace_d.h"
#include "commands/defrem.h"
#include "common/compression.h"
#include "common/file_perm.h"
@@ -64,6 +66,7 @@ typedef struct
bool fastcheckpoint;
bool nowait;
bool includewal;
+ bool incremental;
uint32 maxrate;
bool sendtblspcmapfile;
bool send_to_client;
@@ -76,21 +79,28 @@ typedef struct
} basebackup_options;
static int64 sendTablespace(bbsink *sink, char *path, Oid spcoid, bool sizeonly,
- struct backup_manifest_info *manifest);
+ struct backup_manifest_info *manifest,
+ IncrementalBackupInfo *ib);
static int64 sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
List *tablespaces, bool sendtblspclinks,
- backup_manifest_info *manifest, Oid spcoid);
+ backup_manifest_info *manifest, Oid spcoid,
+ IncrementalBackupInfo *ib);
static bool sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
struct stat *statbuf, bool missing_ok,
Oid dboid, Oid spcoid, RelFileNumber relfilenumber,
unsigned segno,
- backup_manifest_info *manifest);
+ backup_manifest_info *manifest,
+ unsigned num_incremental_blocks,
+ BlockNumber *incremental_blocks,
+ unsigned truncation_block_length);
static off_t read_file_data_into_buffer(bbsink *sink,
const char *readfilename, int fd,
off_t offset, size_t length,
BlockNumber blkno,
bool verify_checksum,
int *checksum_failures);
+static void push_to_sink(bbsink *sink, pg_checksum_context *checksum_ctx,
+ size_t *bytes_done, void *data, size_t length);
static bool verify_page_checksum(Page page, XLogRecPtr start_lsn,
BlockNumber blkno,
uint16 *expected_checksum);
@@ -102,7 +112,8 @@ static int64 _tarWriteHeader(bbsink *sink, const char *filename,
bool sizeonly);
static void _tarWritePadding(bbsink *sink, int len);
static void convert_link_to_directory(const char *pathbuf, struct stat *statbuf);
-static void perform_base_backup(basebackup_options *opt, bbsink *sink);
+static void perform_base_backup(basebackup_options *opt, bbsink *sink,
+ IncrementalBackupInfo *ib);
static void parse_basebackup_options(List *options, basebackup_options *opt);
static int compareWalFileNames(const ListCell *a, const ListCell *b);
static int basebackup_read_file(int fd, char *buf, size_t nbytes, off_t offset,
@@ -220,7 +231,8 @@ static const struct exclude_list_item excludeFiles[] =
* clobbered by longjmp" from stupider versions of gcc.
*/
static void
-perform_base_backup(basebackup_options *opt, bbsink *sink)
+perform_base_backup(basebackup_options *opt, bbsink *sink,
+ IncrementalBackupInfo *ib)
{
bbsink_state state;
XLogRecPtr endptr;
@@ -270,6 +282,10 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
ListCell *lc;
tablespaceinfo *newti;
+ /* If this is an incremental backup, execute preparatory steps. */
+ if (ib != NULL)
+ PrepareForIncrementalBackup(ib, backup_state);
+
/* Add a node for the base directory at the end */
newti = palloc0(sizeof(tablespaceinfo));
newti->size = -1;
@@ -289,10 +305,10 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
if (tmp->path == NULL)
tmp->size = sendDir(sink, ".", 1, true, state.tablespaces,
- true, NULL, InvalidOid);
+ true, NULL, InvalidOid, NULL);
else
tmp->size = sendTablespace(sink, tmp->path, tmp->oid, true,
- NULL);
+ NULL, NULL);
state.bytes_total += tmp->size;
}
state.bytes_total_is_valid = true;
@@ -330,7 +346,7 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
/* Then the bulk of the files... */
sendDir(sink, ".", 1, false, state.tablespaces,
- sendtblspclinks, &manifest, InvalidOid);
+ sendtblspclinks, &manifest, InvalidOid, ib);
/* ... and pg_control after everything else. */
if (lstat(XLOG_CONTROL_FILE, &statbuf) != 0)
@@ -340,7 +356,7 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
XLOG_CONTROL_FILE)));
sendFile(sink, XLOG_CONTROL_FILE, XLOG_CONTROL_FILE, &statbuf,
false, InvalidOid, InvalidOid,
- InvalidRelFileNumber, 0, &manifest);
+ InvalidRelFileNumber, 0, &manifest, 0, NULL, 0);
}
else
{
@@ -348,7 +364,7 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
bbsink_begin_archive(sink, archive_name);
- sendTablespace(sink, ti->path, ti->oid, false, &manifest);
+ sendTablespace(sink, ti->path, ti->oid, false, &manifest, ib);
}
/*
@@ -610,7 +626,7 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
sendFile(sink, pathbuf, pathbuf, &statbuf, false,
InvalidOid, InvalidOid, InvalidRelFileNumber, 0,
- &manifest);
+ &manifest, 0, NULL, 0);
/* unconditionally mark file as archived */
StatusFilePath(pathbuf, fname, ".done");
@@ -686,6 +702,7 @@ parse_basebackup_options(List *options, basebackup_options *opt)
bool o_checkpoint = false;
bool o_nowait = false;
bool o_wal = false;
+ bool o_incremental = false;
bool o_maxrate = false;
bool o_tablespace_map = false;
bool o_noverify_checksums = false;
@@ -764,6 +781,15 @@ parse_basebackup_options(List *options, basebackup_options *opt)
opt->includewal = defGetBoolean(defel);
o_wal = true;
}
+ else if (strcmp(defel->defname, "incremental") == 0)
+ {
+ if (o_incremental)
+ ereport(ERROR,
+ (errcode(ERRCODE_SYNTAX_ERROR),
+ errmsg("duplicate option \"%s\"", defel->defname)));
+ opt->incremental = defGetBoolean(defel);
+ o_incremental = true;
+ }
else if (strcmp(defel->defname, "max_rate") == 0)
{
int64 maxrate;
@@ -956,7 +982,7 @@ parse_basebackup_options(List *options, basebackup_options *opt)
* the filesystem, bypassing the buffer cache.
*/
void
-SendBaseBackup(BaseBackupCmd *cmd)
+SendBaseBackup(BaseBackupCmd *cmd, IncrementalBackupInfo *ib)
{
basebackup_options opt;
bbsink *sink;
@@ -980,6 +1006,20 @@ SendBaseBackup(BaseBackupCmd *cmd)
set_ps_display(activitymsg);
}
+ /*
+ * If we're asked to perform an incremental backup and the user has not
+ * supplied a manifest, that's an ERROR.
+ *
+ * If we're asked to perform a full backup and the user did supply a
+ * manifest, just ignore it.
+ */
+ if (!opt.incremental)
+ ib = NULL;
+ else if (ib == NULL)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("must UPLOAD_MANIFEST before performing an incremental BASE_BACKUP")));
+
/*
* If the target is specifically 'client' then set up to stream the backup
* to the client; otherwise, it's being sent someplace else and should not
@@ -1011,7 +1051,7 @@ SendBaseBackup(BaseBackupCmd *cmd)
*/
PG_TRY();
{
- perform_base_backup(&opt, sink);
+ perform_base_backup(&opt, sink, ib);
}
PG_FINALLY();
{
@@ -1086,7 +1126,7 @@ sendFileWithContent(bbsink *sink, const char *filename, const char *content,
*/
static int64
sendTablespace(bbsink *sink, char *path, Oid spcoid, bool sizeonly,
- backup_manifest_info *manifest)
+ backup_manifest_info *manifest, IncrementalBackupInfo *ib)
{
int64 size;
char pathbuf[MAXPGPATH];
@@ -1120,7 +1160,7 @@ sendTablespace(bbsink *sink, char *path, Oid spcoid, bool sizeonly,
/* Send all the files in the tablespace version directory */
size += sendDir(sink, pathbuf, strlen(path), sizeonly, NIL, true, manifest,
- spcoid);
+ spcoid, ib);
return size;
}
@@ -1140,7 +1180,7 @@ sendTablespace(bbsink *sink, char *path, Oid spcoid, bool sizeonly,
static int64
sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
List *tablespaces, bool sendtblspclinks, backup_manifest_info *manifest,
- Oid spcoid)
+ Oid spcoid, IncrementalBackupInfo *ib)
{
DIR *dir;
struct dirent *de;
@@ -1149,7 +1189,16 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
int64 size = 0;
const char *lastDir; /* Split last dir from parent path. */
bool isRelationDir = false; /* Does directory contain relations? */
+ bool isGlobalDir = false;
Oid dboid = InvalidOid;
+ BlockNumber *relative_block_numbers = NULL;
+
+ /*
+ * Since this array is relatively large, avoid putting it on the stack.
+ * But we don't need it at all if this is not an incremental backup.
+ */
+ if (ib != NULL)
+ relative_block_numbers = palloc(sizeof(BlockNumber) * RELSEG_SIZE);
/*
* Determine if the current path is a database directory that can contain
@@ -1182,7 +1231,10 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
}
}
else if (strcmp(path, "./global") == 0)
+ {
isRelationDir = true;
+ isGlobalDir = true;
+ }
dir = AllocateDir(path);
while ((de = ReadDir(dir, path)) != NULL)
@@ -1331,11 +1383,13 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
&statbuf, sizeonly);
/*
- * Also send archive_status directory (by hackishly reusing
- * statbuf from above ...).
+ * Also send archive_status and summaries directories (by
+ * hackishly reusing statbuf from above ...).
*/
size += _tarWriteHeader(sink, "./pg_wal/archive_status", NULL,
&statbuf, sizeonly);
+ size += _tarWriteHeader(sink, "./pg_wal/summaries", NULL,
+ &statbuf, sizeonly);
continue; /* don't recurse into pg_wal */
}
@@ -1404,33 +1458,88 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
if (!skip_this_dir)
size += sendDir(sink, pathbuf, basepathlen, sizeonly, tablespaces,
- sendtblspclinks, manifest, spcoid);
+ sendtblspclinks, manifest, spcoid, ib);
}
else if (S_ISREG(statbuf.st_mode))
{
bool sent = false;
+ unsigned num_blocks_required = 0;
+ unsigned truncation_block_length = 0;
+ char tarfilenamebuf[MAXPGPATH * 2];
+ char *tarfilename = pathbuf + basepathlen + 1;
+ FileBackupMethod method = BACK_UP_FILE_FULLY;
- if (!sizeonly)
- sent = sendFile(sink, pathbuf, pathbuf + basepathlen + 1, &statbuf,
- true, dboid, spcoid,
- relfilenumber, segno, manifest);
+ if (ib != NULL && isRelationFile)
+ {
+ Oid relspcoid;
+ char *lookup_path;
- if (sent || sizeonly)
+ if (OidIsValid(spcoid))
+ {
+ relspcoid = spcoid;
+ lookup_path = psprintf("pg_tblspc/%u/%s", spcoid,
+ pathbuf + basepathlen + 1);
+ }
+ else
+ {
+ if (isGlobalDir)
+ relspcoid = GLOBALTABLESPACE_OID;
+ else
+ relspcoid = DEFAULTTABLESPACE_OID;
+ lookup_path = pstrdup(pathbuf + basepathlen + 1);
+ }
+
+ method = GetFileBackupMethod(ib, lookup_path, dboid, relspcoid,
+ relfilenumber, relForkNum,
+ segno, statbuf.st_size,
+ &num_blocks_required,
+ relative_block_numbers,
+ &truncation_block_length);
+ if (method == BACK_UP_FILE_INCREMENTALLY)
+ {
+ statbuf.st_size =
+ GetIncrementalFileSize(num_blocks_required);
+ snprintf(tarfilenamebuf, sizeof(tarfilenamebuf),
+ "%s/INCREMENTAL.%s",
+ path + basepathlen + 1,
+ de->d_name);
+ tarfilename = tarfilenamebuf;
+ }
+
+ pfree(lookup_path);
+ }
+
+ if (method != DO_NOT_BACK_UP_FILE)
{
- /* Add size. */
- size += statbuf.st_size;
+ if (!sizeonly)
+ sent = sendFile(sink, pathbuf, tarfilename, &statbuf,
+ true, dboid, spcoid,
+ relfilenumber, segno, manifest,
+ num_blocks_required,
+ method == BACK_UP_FILE_INCREMENTALLY ? relative_block_numbers : NULL,
+ truncation_block_length);
+
+ if (sent || sizeonly)
+ {
+ /* Add size. */
+ size += statbuf.st_size;
- /* Pad to a multiple of the tar block size. */
- size += tarPaddingBytesRequired(statbuf.st_size);
+ /* Pad to a multiple of the tar block size. */
+ size += tarPaddingBytesRequired(statbuf.st_size);
- /* Size of the header for the file. */
- size += TAR_BLOCK_SIZE;
+ /* Size of the header for the file. */
+ size += TAR_BLOCK_SIZE;
+ }
}
}
else
ereport(WARNING,
(errmsg("skipping special file \"%s\"", pathbuf)));
}
+
+ if (relative_block_numbers != NULL)
+ pfree(relative_block_numbers);
+
FreeDir(dir);
return size;
}
@@ -1443,6 +1552,12 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
* If dboid is anything other than InvalidOid then any checksum failures
* detected will get reported to the cumulative stats system.
*
+ * If the file is to be sent incrementally, then num_incremental_blocks
+ * should be the number of blocks to be sent, and incremental_blocks
+ * an array of block numbers relative to the start of the current segment.
+ * If the whole file is to be sent, then incremental_blocks should be NULL,
+ * and num_incremental_blocks can have any value, as it will be ignored.
+ *
* Returns true if the file was successfully sent, false if 'missing_ok',
* and the file did not exist.
*/
@@ -1450,7 +1565,8 @@ static bool
sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
struct stat *statbuf, bool missing_ok, Oid dboid, Oid spcoid,
RelFileNumber relfilenumber, unsigned segno,
- backup_manifest_info *manifest)
+ backup_manifest_info *manifest, unsigned num_incremental_blocks,
+ BlockNumber *incremental_blocks, unsigned truncation_block_length)
{
int fd;
BlockNumber blkno = 0;
@@ -1459,6 +1575,7 @@ sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
pgoff_t bytes_done = 0;
bool verify_checksum = false;
pg_checksum_context checksum_ctx;
+ int ibindex = 0;
if (pg_checksum_init(&checksum_ctx, manifest->checksum_type) < 0)
elog(ERROR, "could not initialize checksum of file \"%s\"",
@@ -1491,22 +1608,111 @@ sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
RelFileNumberIsValid(relfilenumber))
verify_checksum = true;
+ /*
+ * If we're sending an incremental file, write the file header.
+ */
+ if (incremental_blocks != NULL)
+ {
+ unsigned magic = INCREMENTAL_MAGIC;
+ size_t header_bytes_done = 0;
+
+ /* Emit header data. */
+ push_to_sink(sink, &checksum_ctx, &header_bytes_done,
+ &magic, sizeof(magic));
+ push_to_sink(sink, &checksum_ctx, &header_bytes_done,
+ &num_incremental_blocks, sizeof(num_incremental_blocks));
+ push_to_sink(sink, &checksum_ctx, &header_bytes_done,
+ &truncation_block_length, sizeof(truncation_block_length));
+ push_to_sink(sink, &checksum_ctx, &header_bytes_done,
+ incremental_blocks,
+ sizeof(BlockNumber) * num_incremental_blocks);
+
+ /* Flush out any data still in the buffer so it's again empty. */
+ if (header_bytes_done > 0)
+ {
+ bbsink_archive_contents(sink, header_bytes_done);
+ if (pg_checksum_update(&checksum_ctx,
+ (uint8 *) sink->bbs_buffer,
+ header_bytes_done) < 0)
+ elog(ERROR, "could not update checksum of base backup");
+ }
+
+ /* Update our notion of file position. */
+ bytes_done += sizeof(magic);
+ bytes_done += sizeof(num_incremental_blocks);
+ bytes_done += sizeof(truncation_block_length);
+ bytes_done += sizeof(BlockNumber) * num_incremental_blocks;
+ }
+
/*
* Loop until we read the amount of data the caller told us to expect. The
* file could be longer, if it was extended while we were sending it, but
* for a base backup we can ignore such extended data. It will be restored
* from WAL.
*/
- while (bytes_done < statbuf->st_size)
+ while (1)
{
- size_t remaining = statbuf->st_size - bytes_done;
+ /*
+ * Determine whether we've read all the data that we need, and if not,
+ * read some more.
+ */
+ if (incremental_blocks == NULL)
+ {
+ size_t remaining = statbuf->st_size - bytes_done;
+
+ /*
+ * If we've read the required number of bytes, then it's time to
+ * stop.
+ */
+ if (bytes_done >= statbuf->st_size)
+ break;
+
+ /*
+ * Read as many bytes as will fit in the buffer, or however many
+ * are left to read, whichever is less.
+ */
+ cnt = read_file_data_into_buffer(sink, readfilename, fd,
+ bytes_done, remaining,
+ blkno + segno * RELSEG_SIZE,
+ verify_checksum,
+ &checksum_failures);
+ }
+ else
+ {
+ BlockNumber relative_blkno;
- /* Try to read some more data. */
- cnt = read_file_data_into_buffer(sink, readfilename, fd, bytes_done,
- remaining,
- blkno + segno * RELSEG_SIZE,
- verify_checksum,
- &checksum_failures);
+ /*
+ * If we've read all the blocks, then it's time to stop.
+ */
+ if (ibindex >= num_incremental_blocks)
+ break;
+
+ /*
+ * Read just one block, whichever one is the next that we're
+ * supposed to include.
+ */
+ relative_blkno = incremental_blocks[ibindex++];
+ cnt = read_file_data_into_buffer(sink, readfilename, fd,
+ relative_blkno * BLCKSZ,
+ BLCKSZ,
+ relative_blkno + segno * RELSEG_SIZE,
+ verify_checksum,
+ &checksum_failures);
+
+ /*
+ * If we get a partial read, that must mean that the relation is
+ * being truncated. Ultimately, it should be truncated to a
+ * multiple of BLCKSZ, since this path should only be reached for
+ * relation files, but we might transiently observe an
+ * intermediate value.
+ *
+ * It should be fine to treat this just as if the entire block had
+ * been truncated away - i.e. fill this and all later blocks with
+ * zeroes. WAL replay will fix things up.
+ */
+ if (cnt < BLCKSZ)
+ break;
+ }
/*
* If the amount of data we were able to read was not a multiple of
@@ -1689,6 +1895,56 @@ read_file_data_into_buffer(bbsink *sink, const char *readfilename, int fd,
return cnt;
}
+/*
+ * Push data into a bbsink.
+ *
+ * It's better, when possible, to read data directly into the bbsink's buffer,
+ * rather than using this function to copy it into the buffer; this function is
+ * for cases where that approach is not practical.
+ *
+ * bytes_done should point to a count of the number of bytes that are
+ * currently used in the bbsink's buffer. Upon return, the bytes identified by
+ * data and length will have been copied into the bbsink's buffer, flushing
+ * as required, and *bytes_done will have been updated accordingly. If the
+ * buffer was flushed, the previous contents will also have been fed to
+ * checksum_ctx.
+ *
+ * Note that after one or more calls to this function it is the caller's
+ * responsibility to perform any required final flush.
+ */
+static void
+push_to_sink(bbsink *sink, pg_checksum_context *checksum_ctx,
+ size_t *bytes_done, void *data, size_t length)
+{
+ while (length > 0)
+ {
+ size_t bytes_to_copy;
+
+ /*
+ * We use < here rather than <= so that if the data exactly fills the
+ * remaining buffer space, we trigger a flush now.
+ */
+ if (length < sink->bbs_buffer_length - *bytes_done)
+ {
+ /* Append remaining data to buffer. */
+ memcpy(sink->bbs_buffer + *bytes_done, data, length);
+ *bytes_done += length;
+ return;
+ }
+
+ /* Copy until buffer is full and flush it. */
+ bytes_to_copy = sink->bbs_buffer_length - *bytes_done;
+ memcpy(sink->bbs_buffer + *bytes_done, data, bytes_to_copy);
+ data = ((char *) data) + bytes_to_copy;
+ length -= bytes_to_copy;
+ bbsink_archive_contents(sink, sink->bbs_buffer_length);
+ if (pg_checksum_update(checksum_ctx, (uint8 *) sink->bbs_buffer,
+ sink->bbs_buffer_length) < 0)
+ elog(ERROR, "could not update checksum");
+ *bytes_done = 0;
+ }
+}
+
/*
* Try to verify the checksum for the provided page, if it seems appropriate
* to do so.
diff --git a/src/backend/backup/basebackup_incremental.c b/src/backend/backup/basebackup_incremental.c
new file mode 100644
index 0000000000..20cc00bded
--- /dev/null
+++ b/src/backend/backup/basebackup_incremental.c
@@ -0,0 +1,873 @@
+/*-------------------------------------------------------------------------
+ *
+ * basebackup_incremental.c
+ * code for incremental backup support
+ *
+ * This code isn't actually in charge of taking an incremental backup;
+ * the actual construction of the incremental backup happens in
+ * basebackup.c. Here, we're concerned with providing the necessary
+ * supports for that operation. In particular, we need to parse the
+ * backup manifest supplied by the user taking the incremental backup
+ * and extract the required information from it.
+ *
+ * Portions Copyright (c) 2010-2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/backend/backup/basebackup_incremental.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/timeline.h"
+#include "access/xlog.h"
+#include "access/xlogrecovery.h"
+#include "backup/basebackup_incremental.h"
+#include "backup/walsummary.h"
+#include "common/blkreftable.h"
+#include "common/parse_manifest.h"
+#include "common/hashfn.h"
+#include "postmaster/walsummarizer.h"
+
+#define BLOCKS_PER_READ 512
+
+/*
+ * Details extracted from the WAL ranges present in the supplied backup manifest.
+ */
+typedef struct
+{
+ TimeLineID tli;
+ XLogRecPtr start_lsn;
+ XLogRecPtr end_lsn;
+} backup_wal_range;
+
+/*
+ * Details extracted from the file list present in the supplied backup manifest.
+ */
+typedef struct
+{
+ uint32 status;
+ const char *path;
+ size_t size;
+} backup_file_entry;
+
+static uint32 hash_string_pointer(const char *s);
+#define SH_PREFIX backup_file
+#define SH_ELEMENT_TYPE backup_file_entry
+#define SH_KEY_TYPE const char *
+#define SH_KEY path
+#define SH_HASH_KEY(tb, key) hash_string_pointer(key)
+#define SH_EQUAL(tb, a, b) (strcmp(a, b) == 0)
+#define SH_SCOPE static inline
+#define SH_DECLARE
+#define SH_DEFINE
+#include "lib/simplehash.h"
+
+struct IncrementalBackupInfo
+{
+ /* Memory context for this object and its subsidiary objects. */
+ MemoryContext mcxt;
+
+ /* Temporary buffer for storing the manifest while parsing it. */
+ StringInfoData buf;
+
+ /* WAL ranges extracted from the backup manifest. */
+ List *manifest_wal_ranges;
+
+ /*
+ * Files extracted from the backup manifest.
+ *
+ * We don't really need this information, because we use WAL summaries to
+ * figure what's changed. It would be unsafe to just rely on the list of
+ * files that existed before, because it's possible for a file to be
+ * removed and a new one created with the same name and different
+ * contents. In such cases, the whole file must still be sent. We can tell
+ * from the WAL summaries whether that happened, but not from the file
+ * list.
+ *
+ * Nonetheless, this data is useful for sanity checking. If a file that we
+ * think we shouldn't need to send is not present in the manifest for the
+ * prior backup, something has gone terribly wrong. We retain the file
+ * names and sizes, but not the checksums or last modified times, for
+ * which we have no use.
+ *
+ * One significant downside of storing this data is that it consumes
+ * memory. If that turns out to be a problem, we might have to decide not
+ * to retain this information, or to make it optional.
+ */
+ backup_file_hash *manifest_files;
+
+ /*
+ * Block-reference table for the incremental backup.
+ *
+ * It's possible that storing the entire block-reference table in memory
+ * will be a problem for some users. The in-memory format that we're using
+ * here is pretty efficient, converging to little more than 1 bit per
+ * block for relation forks with large numbers of modified blocks. It's
+ * possible, however, that if you try to perform an incremental backup of
+ * a database with a sufficiently large number of relations on a
+ * sufficiently small machine, you could run out of memory here. If that
+ * turns out to be a problem in practice, we'll need to be more clever.
+ */
+ BlockRefTable *brtab;
+};
+
+static void manifest_process_file(JsonManifestParseContext *,
+ char *pathname,
+ size_t size,
+ pg_checksum_type checksum_type,
+ int checksum_length,
+ uint8 *checksum_payload);
+static void manifest_process_wal_range(JsonManifestParseContext *,
+ TimeLineID tli,
+ XLogRecPtr start_lsn,
+ XLogRecPtr end_lsn);
+static void manifest_report_error(JsonManifestParseContext *ib,
+ const char *fmt,...)
+ pg_attribute_printf(2, 3) pg_attribute_noreturn();
+
+/*
+ * Create a new object for storing information extracted from the manifest
+ * supplied when creating an incremental backup.
+ */
+IncrementalBackupInfo *
+CreateIncrementalBackupInfo(MemoryContext mcxt)
+{
+ IncrementalBackupInfo *ib;
+ MemoryContext oldcontext;
+
+ oldcontext = MemoryContextSwitchTo(mcxt);
+
+ ib = palloc0(sizeof(IncrementalBackupInfo));
+ ib->mcxt = mcxt;
+ initStringInfo(&ib->buf);
+
+ /*
+ * It's hard to guess how many files a "typical" installation will have in
+ * the data directory, but a fresh initdb creates almost 1000 files as of
+ * this writing, so it seems to make sense for our estimate to
+ * substantially higher.
+ */
+ ib->manifest_files = backup_file_create(mcxt, 10000, NULL);
+
+ MemoryContextSwitchTo(oldcontext);
+
+ return ib;
+}
+
+/*
+ * Before taking an incremental backup, the caller must supply the backup
+ * manifest from a prior backup. Each chunk of manifest data recieved
+ * from the client should be passed to this function.
+ */
+void
+AppendIncrementalManifestData(IncrementalBackupInfo *ib, const char *data,
+ int len)
+{
+ MemoryContext oldcontext;
+
+ /* Switch to our memory context. */
+ oldcontext = MemoryContextSwitchTo(ib->mcxt);
+
+ /*
+ * XXX. Our json parser is at present incapable of parsing json blobs
+ * incrementally, so we have to accumulate the entire backup manifest
+ * before we can do anything with it. This should really be fixed, since
+ * some users might have very large numbers of files in the data
+ * directory.
+ */
+ appendBinaryStringInfo(&ib->buf, data, len);
+
+ /* Switch back to previous memory context. */
+ MemoryContextSwitchTo(oldcontext);
+}
+
+/*
+ * Finalize an IncrementalBackupInfo object after all manifest data has
+ * been supplied via calls to AppendIncrementalManifestData.
+ */
+void
+FinalizeIncrementalManifest(IncrementalBackupInfo *ib)
+{
+ JsonManifestParseContext context;
+ MemoryContext oldcontext;
+
+ /* Switch to our memory context. */
+ oldcontext = MemoryContextSwitchTo(ib->mcxt);
+
+ /* Parse the manifest. */
+ context.private_data = ib;
+ context.perfile_cb = manifest_process_file;
+ context.perwalrange_cb = manifest_process_wal_range;
+ context.error_cb = manifest_report_error;
+ json_parse_manifest(&context, ib->buf.data, ib->buf.len);
+
+ /* Done with the buffer, so release memory. */
+ pfree(ib->buf.data);
+ ib->buf.data = NULL;
+
+ /* Switch back to previous memory context. */
+ MemoryContextSwitchTo(oldcontext);
+}
+
+/*
+ * Prepare to take an incremental backup.
+ *
+ * Before this function is called, AppendIncrementalManifestData and
+ * FinalizeIncrementalManifest should have already been called to pass all
+ * the manifest data to this object.
+ *
+ * This function performs sanity checks on the data extracted from the
+ * manifest and figures out for which WAL ranges we need summaries, and
+ * whether those summaries are available. Then, it reads and combines the
+ * data from those summary files. It also updates the backup_state with the
+ * reference TLI and LSN for the prior backup.
+ */
+void
+PrepareForIncrementalBackup(IncrementalBackupInfo *ib,
+ BackupState *backup_state)
+{
+ MemoryContext oldcontext;
+ List *expectedTLEs;
+ List *all_wslist,
+ *required_wslist = NIL;
+ ListCell *lc;
+ TimeLineHistoryEntry **tlep;
+ int num_wal_ranges;
+ int i;
+ bool found_backup_start_tli = false;
+ TimeLineID earliest_wal_range_tli = 0;
+ XLogRecPtr earliest_wal_range_start_lsn;
+ TimeLineID latest_wal_range_tli = 0;
+ XLogRecPtr summarized_lsn;
+
+ Assert(ib->buf.data == NULL);
+
+ /* Switch to our memory context. */
+ oldcontext = MemoryContextSwitchTo(ib->mcxt);
+
+ /*
+ * Match up the TLIs that appear in the WAL ranges of the backup manifest
+ * with those that appear in this server's timeline history. We expect
+ * every backup_wal_range to match to a TimeLineHistoryEntry; if it does
+ * not, that's an error.
+ *
+ * This loop also decides which of the WAL ranges is the manifest is most
+ * ancient and which one is the newest, according to the timeline history
+ * of this server, and stores TLIs of those WAL ranges into
+ * earliest_wal_range_tli and latest_wal_range_tli. It also updates
+ * earliest_wal_range_start_lsn to the start LSN of the WAL range for
+ * earliest_wal_range_tli.
+ *
+ * Note that the return value of readTimeLineHistory puts the latest
+ * timeline at the beginning of the list, not the end. Hence, the earliest
+ * TLI is the one that occurs nearest the end of the list returned by
+ * readTimeLineHistory, and the latest TLI is the one that occurs closest
+ * to the beginning.
+ */
+ expectedTLEs = readTimeLineHistory(backup_state->starttli);
+ num_wal_ranges = list_length(ib->manifest_wal_ranges);
+ tlep = palloc0(num_wal_ranges * sizeof(TimeLineHistoryEntry *));
+ for (i = 0; i < num_wal_ranges; ++i)
+ {
+ backup_wal_range *range = list_nth(ib->manifest_wal_ranges, i);
+ bool saw_earliest_wal_range_tli = false;
+ bool saw_latest_wal_range_tli = false;
+
+ /* Search this server's history for this WAL range's TLI. */
+ foreach(lc, expectedTLEs)
+ {
+ TimeLineHistoryEntry *tle = lfirst(lc);
+
+ if (tle->tli == range->tli)
+ {
+ tlep[i] = tle;
+ break;
+ }
+
+ if (tle->tli == earliest_wal_range_tli)
+ saw_earliest_wal_range_tli = true;
+ if (tle->tli == latest_wal_range_tli)
+ saw_latest_wal_range_tli = true;
+ }
+
+ /*
+ * An incremental backup can only be taken relative to a backup that
+ * represents a previous state of this server. If the backup requires
+ * WAL from a timeline that's not in our history, that definitely
+ * isn't the case.
+ */
+ if (tlep[i] == NULL)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("timeline %u found in manifest, but not in this server's history",
+ range->tli)));
+
+ /*
+ * If we found this TLI in the server's history before encountering
+ * the latest TLI seen so far in the server's history, then this TLI
+ * is the latest one seen so far.
+ *
+ * If on the other hand we saw the earliest TLI seen so far before
+ * finding this TLI, this TLI is earlier than the earliest one seen so
+ * far. And if this is the first TLI for which we've searched, it's
+ * also the earliest one seen so far.
+ *
+ * On the first loop iteration, both things should necessarily be
+ * true.
+ */
+ if (!saw_latest_wal_range_tli)
+ latest_wal_range_tli = range->tli;
+ if (earliest_wal_range_tli == 0 || saw_earliest_wal_range_tli)
+ {
+ earliest_wal_range_tli = range->tli;
+ earliest_wal_range_start_lsn = range->start_lsn;
+ }
+ }
+
+ /*
+ * Propagate information about the prior backup into the backup_label that
+ * will be generated for this backup.
+ */
+ backup_state->istartpoint = earliest_wal_range_start_lsn;
+ backup_state->istarttli = earliest_wal_range_tli;
+
+ /*
+ * Sanity check start and end LSNs for the WAL ranges in the manifest.
+ *
+ * Commonly, there won't be any timeline switches during the prior backup
+ * at all, but if there are, they should happen at the same LSNs that this
+ * server switched timelines.
+ *
+ * Whether there are any timeline switches during the prior backup or not,
+ * the prior backup shouldn't require any WAL from a timeline prior to the
+ * start of that timeline. It also shouldn't require any WAL from later
+ * than the start of this backup.
+ *
+ * If any of these sanity checks fail, one possible explanation is that
+ * the user has generated WAL on the same timeline with the same LSNs more
+ * than once. For instance, if two standbys running on timeline 1 were
+ * both promoted and (due to a broken archiving setup) both selected new
+ * timeline ID 2, then it's possible that one of these checks might trip.
+ *
+ * Note that there are lots of ways for the user to do something very bad
+ * without tripping any of these checks, and they are not intended to be
+ * comprehensive. It's pretty hard to see how we could be certain of
+ * anything here. However, if there's a problem staring us right in the
+ * face, it's best to report it, so we do.
+ */
+ for (i = 0; i < num_wal_ranges; ++i)
+ {
+ backup_wal_range *range = list_nth(ib->manifest_wal_ranges, i);
+
+ if (range->tli == earliest_wal_range_tli)
+ {
+ if (range->start_lsn < tlep[i]->begin)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("manifest requires WAL from initial timeline %u starting at %X/%X, but that timeline begins at %X/%X",
+ range->tli,
+ LSN_FORMAT_ARGS(range->start_lsn),
+ LSN_FORMAT_ARGS(tlep[i]->begin))));
+ }
+ else
+ {
+ if (range->start_lsn != tlep[i]->begin)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("manifest requires WAL from continuation timeline %u starting at %X/%X, but that timeline begins at %X/%X",
+ range->tli,
+ LSN_FORMAT_ARGS(range->start_lsn),
+ LSN_FORMAT_ARGS(tlep[i]->begin))));
+ }
+
+ if (range->tli == latest_wal_range_tli)
+ {
+ if (range->end_lsn > backup_state->startpoint)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("manifest requires WAL from final timeline %u ending at %X/%X, but this backup starts at %X/%X",
+ range->tli,
+ LSN_FORMAT_ARGS(range->end_lsn),
+ LSN_FORMAT_ARGS(backup_state->startpoint))));
+ }
+ else
+ {
+ if (range->end_lsn != tlep[i]->end)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("manifest requires WAL from non-final timeline %u ending at %X/%X, but this server switched timelines at %X/%X",
+ range->tli,
+ LSN_FORMAT_ARGS(range->end_lsn),
+ LSN_FORMAT_ARGS(tlep[i]->end))));
+ }
+
+ }
+
+ /*
+ * Wait for WAL summarization to catch up to the backup start LSN (but
+ * time out if it doesn't do so quickly enough).
+ */
+ /* XXX make timeout configurable */
+ summarized_lsn = WaitForWalSummarization(backup_state->startpoint, 60000);
+ if (summarized_lsn < backup_state->startpoint)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("timeout waiting for WAL summarization"),
+ errdetail("This backup requires WAL to be summarized up to %X/%X, but summarizer has only reached %X/%X.",
+ LSN_FORMAT_ARGS(backup_state->startpoint),
+ LSN_FORMAT_ARGS(summarized_lsn))));
+
+ /*
+ * Retrieve a list of all WAL summaries on any timeline that overlap with
+ * the LSN range of interest. We could instead call GetWalSummaries() once
+ * per timeline in the loop that follows, but that would involve reading
+ * the directory multiple times. It should be mildly faster - and perhaps
+ * a bit safer - to do it just once.
+ */
+ all_wslist = GetWalSummaries(0, earliest_wal_range_start_lsn,
+ backup_state->startpoint);
+
+ /*
+ * We need WAL summaries for everything that happened during the prior
+ * backup and everything that happened afterward up until the point where
+ * the current backup started.
+ */
+ foreach(lc, expectedTLEs)
+ {
+ TimeLineHistoryEntry *tle = lfirst(lc);
+ XLogRecPtr tli_start_lsn = tle->begin;
+ XLogRecPtr tli_end_lsn = tle->end;
+ XLogRecPtr tli_missing_lsn = InvalidXLogRecPtr;
+ List *tli_wslist;
+
+ /*
+ * Working through the history of this server from the current
+ * timeline backwards, we skip everything until we find the timeline
+ * where this backup started. Most of the time, this means we won't
+ * skip anything at all, as it's unlikely that the timeline has
+ * changed since the beginning of the backup moments ago.
+ */
+ if (tle->tli == backup_state->starttli)
+ {
+ found_backup_start_tli = true;
+ tli_end_lsn = backup_state->startpoint;
+ }
+ else if (!found_backup_start_tli)
+ continue;
+
+ /*
+ * Find the summaries that overlap the LSN range of interest for this
+ * timeline. If this is the earliest timeline involved, the range of
+ * interest begins with the start LSN of the prior backup; otherwise,
+ * it begins at the LSN at which this timeline came into existence. If
+ * this is the latest TLI involved, the range of interest ends at the
+ * start LSN of the current backup; otherwise, it ends at the point
+ * where we switched from this timeline to the next one.
+ */
+ if (tle->tli == earliest_wal_range_tli)
+ tli_start_lsn = earliest_wal_range_start_lsn;
+ tli_wslist = FilterWalSummaries(all_wslist, tle->tli,
+ tli_start_lsn, tli_end_lsn);
+
+ /*
+ * There is no guarantee that the WAL summaries we found cover the
+ * entire range of LSNs for which summaries are required, or indeed
+ * that we found any WAL summaries at all. Check whether we have a
+ * problem of that sort.
+ */
+ if (!WalSummariesAreComplete(tli_wslist, tli_start_lsn, tli_end_lsn,
+ &tli_missing_lsn))
+ {
+ if (XLogRecPtrIsInvalid(tli_missing_lsn))
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("WAL summaries are required on timeline %u from %X/%X to %X/%X, but no summaries for that timeline and LSN range exist",
+ tle->tli,
+ LSN_FORMAT_ARGS(tli_start_lsn),
+ LSN_FORMAT_ARGS(tli_end_lsn))));
+ else
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("WAL summaries are required on timeline %u from %X/%X to %X/%X, but the summaries for that timeline and LSN range are incomplete",
+ tle->tli,
+ LSN_FORMAT_ARGS(tli_start_lsn),
+ LSN_FORMAT_ARGS(tli_end_lsn)),
+ errdetail("The first unsummarized LSN is this range is %X/%X.",
+ LSN_FORMAT_ARGS(tli_missing_lsn))));
+ }
+
+ /*
+ * Remember that we need to read these summaries.
+ *
+ * Technically, it's possible that this could read more files than
+ * required, since tli_wslist in theory could contain redundant
+ * summaries. For instance, if we have a summary from 0/10000000 to
+ * 0/20000000 and also one from 0/00000000 to 0/30000000, then the
+ * latter subsumes the former and the former could be ignored.
+ *
+ * We ignore this possibility because the WAL summarizer only tries to
+ * generate summaries that do not overlap. If somehow they exist,
+ * we'll do a bit of extra work but the results should still be
+ * correct.
+ */
+ required_wslist = list_concat(required_wslist, tli_wslist);
+
+ /*
+ * Timelines earlier than the one in which the prior backup began are
+ * not relevant.
+ */
+ if (tle->tli == earliest_wal_range_tli)
+ break;
+ }
+
+ /*
+ * Read all of the required block reference table files and merge all of
+ * the data into a single in-memory block reference table.
+ *
+ * See the comments for struct IncrementalBackupInfo for some thoughts on
+ * memory usage.
+ */
+ ib->brtab = CreateEmptyBlockRefTable();
+ foreach(lc, required_wslist)
+ {
+ WalSummaryFile *ws = lfirst(lc);
+ WalSummaryIO wsio;
+ BlockRefTableReader *reader;
+ RelFileLocator rlocator;
+ ForkNumber forknum;
+ BlockNumber limit_block;
+ BlockNumber blocks[BLOCKS_PER_READ];
+
+ wsio.file = OpenWalSummaryFile(ws, false);
+ wsio.filepos = 0;
+ ereport(DEBUG1,
+ (errmsg_internal("reading WAL summary file \"%s\"",
+ FilePathName(wsio.file))));
+ reader = CreateBlockRefTableReader(ReadWalSummary, &wsio,
+ FilePathName(wsio.file),
+ ReportWalSummaryError, NULL);
+ while (BlockRefTableReaderNextRelation(reader, &rlocator, &forknum,
+ &limit_block))
+ {
+ BlockRefTableSetLimitBlock(ib->brtab, &rlocator,
+ forknum, limit_block);
+
+ while (1)
+ {
+ unsigned nblocks;
+ unsigned i;
+
+ nblocks = BlockRefTableReaderGetBlocks(reader, blocks,
+ BLOCKS_PER_READ);
+ if (nblocks == 0)
+ break;
+
+ for (i = 0; i < nblocks; ++i)
+ BlockRefTableMarkBlockModified(ib->brtab, &rlocator,
+ forknum, blocks[i]);
+ }
+ }
+ DestroyBlockRefTableReader(reader);
+ FileClose(wsio.file);
+ }
+
+ /* Switch back to previous memory context. */
+ MemoryContextSwitchTo(oldcontext);
+}
+
+/*
+ * Get the pathname that should be used when a file is sent incrementally.
+ *
+ * The result is a palloc'd string.
+ */
+char *
+GetIncrementalFilePath(Oid dboid, Oid spcoid, RelFileNumber relfilenumber,
+ ForkNumber forknum, unsigned segno)
+{
+ char *path;
+ char *lastslash;
+ char *ipath;
+
+ path = GetRelationPath(dboid, spcoid, relfilenumber, InvalidBackendId,
+ forknum);
+
+ lastslash = strrchr(path, '/');
+ Assert(lastslash != NULL);
+ *lastslash = '\0';
+
+ if (segno > 0)
+ ipath = psprintf("%s/INCREMENTAL.%s.%u", path, lastslash + 1, segno);
+ else
+ ipath = psprintf("%s/INCREMENTAL.%s", path, lastslash + 1);
+
+ pfree(path);
+
+ return ipath;
+}
+
+/*
+ * How should we back up a particular file as part of an incremental backup?
+ *
+ * If the return value is BACK_UP_FILE_FULLY, caller should back up the whole
+ * file just as if this were not an incremental backup.
+ *
+ * If the return value is BACK_UP_FILE_INCREMENTALLY, caller should include
+ * an incremental file in the backup instead of the entire file. On return,
+ * *num_blocks_required will be set to the number of blocks that need to be
+ * sent, and the actual block numbers will have been stored in
+ * relative_block_numbers, which should be an array of at least RELSEG_SIZE.
+ * In addition, *truncation_block_length will be set to the value that should
+ * be included in the incremental file.
+ *
+ * If the return value is DO_NOT_BACK_UP_FILE, the caller should not include
+ * the file in the backup at all.
+ */
+FileBackupMethod
+GetFileBackupMethod(IncrementalBackupInfo *ib, char *path,
+ Oid dboid, Oid spcoid,
+ RelFileNumber relfilenumber, ForkNumber forknum,
+ unsigned segno, size_t size,
+ unsigned *num_blocks_required,
+ BlockNumber *relative_block_numbers,
+ unsigned *truncation_block_length)
+{
+ BlockNumber absolute_block_numbers[RELSEG_SIZE];
+ BlockNumber limit_block;
+ BlockNumber start_blkno;
+ BlockNumber stop_blkno;
+ RelFileLocator rlocator;
+ BlockRefTableEntry *brtentry;
+ unsigned i;
+ unsigned nblocks;
+
+ /* Should only be called after PrepareForIncrementalBackup. */
+ Assert(ib->buf.data == NULL);
+
+ /*
+ * dboid could be InvalidOid if shared rel, but spcoid and relfilenumber
+ * should have legal values.
+ */
+ Assert(OidIsValid(spcoid));
+ Assert(RelFileNumberIsValid(relfilenumber));
+
+ /*
+ * If the file size is too large or not a multiple of BLCKSZ, then
+ * something weird is happening, so give up and send the whole file.
+ */
+ if ((size % BLCKSZ) != 0 || size / BLCKSZ > RELSEG_SIZE)
+ return BACK_UP_FILE_FULLY;
+
+ /*
+ * The free-space map fork is not properly WAL-logged, so we need to
+ * backup the entire file every time.
+ */
+ if (forknum == FSM_FORKNUM)
+ return BACK_UP_FILE_FULLY;
+
+ /*
+ * Check whether this file is part of the prior backup. If it isn't, back
+ * up the whole file.
+ */
+ if (backup_file_lookup(ib->manifest_files, path) == NULL)
+ {
+ char *ipath;
+
+ ipath = GetIncrementalFilePath(dboid, spcoid, relfilenumber,
+ forknum, segno);
+ if (backup_file_lookup(ib->manifest_files, ipath) == NULL)
+ return BACK_UP_FILE_FULLY;
+ }
+
+ /* Look up the block reference table entry. */
+ rlocator.spcOid = spcoid;
+ rlocator.dbOid = dboid;
+ rlocator.relNumber = relfilenumber;
+ brtentry = BlockRefTableGetEntry(ib->brtab, &rlocator, forknum,
+ &limit_block);
+
+ /*
+ * If there is no entry, then there have been no WAL-logged changes to the
+ * relation since the predecessor backup was taken, so we can back it up
+ * incrementally and need not include any modified blocks.
+ *
+ * However, if the file is zero-length, we should do a full backup,
+ * because an incremental file is always more than zero length, and it's
+ * silly to take an incremental backup when a full backup would be
+ * smaller.
+ */
+ if (brtentry == NULL)
+ {
+ *num_blocks_required = 0;
+ *truncation_block_length = size / BLCKSZ;
+ if (size == 0)
+ return BACK_UP_FILE_FULLY;
+ return BACK_UP_FILE_INCREMENTALLY;
+ }
+
+ /*
+ * If the limit_block is less than or equal to the point where this
+ * segment starts, send the whole file.
+ */
+ if (limit_block <= segno * RELSEG_SIZE)
+ return BACK_UP_FILE_FULLY;
+
+ /*
+ * Get relevant entries from the block reference table entry.
+ *
+ * We shouldn't overflow computing the start or stop block numbers, but if
+ * it manages to happen somehow, detect it and throw an error.
+ */
+ start_blkno = segno * RELSEG_SIZE;
+ stop_blkno = start_blkno + (size / BLCKSZ);
+ if (start_blkno / RELSEG_SIZE != segno || stop_blkno < start_blkno)
+ ereport(ERROR,
+ errcode(ERRCODE_INTERNAL_ERROR),
+ errmsg_internal("overflow computing block number bounds for segment %u with size %lu",
+ segno, size));
+ nblocks = BlockRefTableEntryGetBlocks(brtentry, start_blkno, stop_blkno,
+ absolute_block_numbers, RELSEG_SIZE);
+ Assert(nblocks <= RELSEG_SIZE);
+
+ /*
+ * If we're going to have to send nearly all of the blocks, then just send
+ * the whole file, because that won't require much extra storage or
+ * transfer and will speed up and simplify backup restoration. It's not
+ * clear what threshold is most appropriate here and perhaps it ought to
+ * be configurable, but for now we're just going to say that if we'd need
+ * to send 90% of the blocks anyway, give up and send the whole file.
+ *
+ * NB: If you change the threshold here, at least make sure to back up the
+ * file fully when every single block must be sent, because there's
+ * nothing good about sending an incremental file in that case.
+ */
+ if (nblocks * BLCKSZ > size * 0.9)
+ return BACK_UP_FILE_FULLY;
+
+ /*
+ * Looks like we can send an incremental file.
+ *
+ * Return the relevant details to the caller, transposing absolute block
+ * numbers to relative block numbers.
+ *
+ * The truncation block length is the minimum length of the reconstructed
+ * file. Any block numbers below this threshold that are not present in
+ * the backup need to be fetched from the prior backup. At or above this
+ * threshold, blocks should only be included in the result if they are
+ * present in the backup. (This may require inserting zero blocks if the
+ * blocks included in the backup are non-consecutive.)
+ */
+ for (i = 0; i < nblocks; ++i)
+ relative_block_numbers[i] = absolute_block_numbers[i] - start_blkno;
+ *num_blocks_required = nblocks;
+ *truncation_block_length =
+ Min(size / BLCKSZ, limit_block - segno * RELSEG_SIZE);
+ return BACK_UP_FILE_INCREMENTALLY;
+}
+
+/*
+ * Compute the size for an incremental file containing a given number of blocks.
+ */
+extern size_t
+GetIncrementalFileSize(unsigned num_blocks_required)
+{
+ size_t result;
+
+ /* Make sure we're not going to overflow. */
+ Assert(num_blocks_required <= RELSEG_SIZE);
+
+ /*
+ * Three four byte quantities (magic number, truncation block length,
+ * block count) followed by block numbers followed by block contents.
+ */
+ result = 3 * sizeof(uint32);
+ result += (BLCKSZ + sizeof(BlockNumber)) * num_blocks_required;
+
+ return result;
+}
+
+/*
+ * Helper function for filemap hash table.
+ */
+static uint32
+hash_string_pointer(const char *s)
+{
+ unsigned char *ss = (unsigned char *) s;
+
+ return hash_bytes(ss, strlen(s));
+}
+
+/*
+ * This callback is invoked for each file mentioned in the backup manifest.
+ *
+ * We store the path to each file and the size of each file for sanity-checking
+ * purposes. For further details, see comments for IncrementalBackupInfo.
+ */
+static void
+manifest_process_file(JsonManifestParseContext *context,
+ char *pathname, size_t size,
+ pg_checksum_type checksum_type,
+ int checksum_length,
+ uint8 *checksum_payload)
+{
+ IncrementalBackupInfo *ib = context->private_data;
+ backup_file_entry *entry;
+ bool found;
+
+ entry = backup_file_insert(ib->manifest_files, pathname, &found);
+ if (!found)
+ {
+ entry->path = MemoryContextStrdup(ib->manifest_files->ctx,
+ pathname);
+ entry->size = size;
+ }
+}
+
+/*
+ * This callback is invoked for each WAL range mentioned in the backup
+ * manifest.
+ *
+ * We're just interested in learning the oldest LSN and the corresponding TLI
+ * that appear in any WAL range.
+ */
+static void
+manifest_process_wal_range(JsonManifestParseContext *context,
+ TimeLineID tli, XLogRecPtr start_lsn,
+ XLogRecPtr end_lsn)
+{
+ IncrementalBackupInfo *ib = context->private_data;
+ backup_wal_range *range = palloc(sizeof(backup_wal_range));
+
+ range->tli = tli;
+ range->start_lsn = start_lsn;
+ range->end_lsn = end_lsn;
+ ib->manifest_wal_ranges = lappend(ib->manifest_wal_ranges, range);
+}
+
+/*
+ * This callback is invoked if an error occurs while parsing the backup
+ * manifest.
+ */
+static void
+manifest_report_error(JsonManifestParseContext *context, const char *fmt,...)
+{
+ StringInfoData errbuf;
+
+ initStringInfo(&errbuf);
+
+ for (;;)
+ {
+ va_list ap;
+ int needed;
+
+ va_start(ap, fmt);
+ needed = appendStringInfoVA(&errbuf, fmt, ap);
+ va_end(ap);
+ if (needed == 0)
+ break;
+ enlargeStringInfo(&errbuf, needed);
+ }
+
+ ereport(ERROR,
+ errmsg_internal("%s", errbuf.data));
+}
diff --git a/src/backend/backup/meson.build b/src/backend/backup/meson.build
index 11a79bbf80..19c355ceca 100644
--- a/src/backend/backup/meson.build
+++ b/src/backend/backup/meson.build
@@ -5,6 +5,7 @@ backend_sources += files(
'basebackup.c',
'basebackup_copy.c',
'basebackup_gzip.c',
+ 'basebackup_incremental.c',
'basebackup_lz4.c',
'basebackup_progress.c',
'basebackup_server.c',
@@ -12,4 +13,6 @@ backend_sources += files(
'basebackup_target.c',
'basebackup_throttle.c',
'basebackup_zstd.c',
+ 'walsummary.c',
+ 'walsummaryfuncs.c'
)
diff --git a/src/backend/backup/walsummary.c b/src/backend/backup/walsummary.c
new file mode 100644
index 0000000000..ebf4ea038d
--- /dev/null
+++ b/src/backend/backup/walsummary.c
@@ -0,0 +1,356 @@
+/*-------------------------------------------------------------------------
+ *
+ * walsummary.c
+ * Functions for accessing and managing WAL summary data.
+ *
+ * Portions Copyright (c) 2010-2022, PostgreSQL Global Development Group
+ *
+ * src/backend/backup/walsummary.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include <sys/stat.h>
+#include <unistd.h>
+
+#include "access/xlog_internal.h"
+#include "backup/walsummary.h"
+#include "utils/wait_event.h"
+
+static bool IsWalSummaryFilename(char *filename);
+static int ListComparatorForWalSummaryFiles(const ListCell *a,
+ const ListCell *b);
+
+/*
+ * Get a list of WAL summaries.
+ *
+ * If tli != 0, only WAL summaries with the indicated TLI will be included.
+ *
+ * If start_lsn != InvalidXLogRecPtr, only summaries that end before the
+ * indicated LSN will be included.
+ *
+ * If end_lsn != InvalidXLogRecPtr, only summaries that start before the
+ * indicated LSN will be included.
+ *
+ * The intent is that you can call GetWalSummaries(tli, start_lsn, end_lsn)
+ * to get all WAL summaries on the indicated timeline that overlap the
+ * specified LSN range.
+ */
+List *
+GetWalSummaries(TimeLineID tli, XLogRecPtr start_lsn, XLogRecPtr end_lsn)
+{
+ DIR *sdir;
+ struct dirent *dent;
+ List *result = NIL;
+
+ sdir = AllocateDir(XLOGDIR "/summaries");
+ while ((dent = ReadDir(sdir, XLOGDIR "/summaries")) != NULL)
+ {
+ WalSummaryFile *ws;
+ uint32 tmp[5];
+ TimeLineID file_tli;
+ XLogRecPtr file_start_lsn;
+ XLogRecPtr file_end_lsn;
+
+ /* Decode filename, or skip if it's not in the expected format. */
+ if (!IsWalSummaryFilename(dent->d_name))
+ continue;
+ sscanf(dent->d_name, "%08X%08X%08X%08X%08X",
+ &tmp[0], &tmp[1], &tmp[2], &tmp[3], &tmp[4]);
+ file_tli = tmp[0];
+ file_start_lsn = ((uint64) tmp[1]) << 32 | tmp[2];
+ file_end_lsn = ((uint64) tmp[3]) << 32 | tmp[4];
+
+ /* Skip if it doesn't match the filter criteria. */
+ if (tli != 0 && tli != file_tli)
+ continue;
+ if (!XLogRecPtrIsInvalid(start_lsn) && start_lsn > file_end_lsn)
+ continue;
+ if (!XLogRecPtrIsInvalid(end_lsn) && end_lsn < file_start_lsn)
+ continue;
+
+ /* Add it to the list. */
+ ws = palloc(sizeof(WalSummaryFile));
+ ws->tli = file_tli;
+ ws->start_lsn = file_start_lsn;
+ ws->end_lsn = file_end_lsn;
+ result = lappend(result, ws);
+ }
+ FreeDir(sdir);
+
+ return result;
+}
+
+/*
+ * Build a new list of WAL summaries based on an existing list, but filtering
+ * out summaries that don't match the search parameters.
+ *
+ * If tli != 0, only WAL summaries with the indicated TLI will be included.
+ *
+ * If start_lsn != InvalidXLogRecPtr, only summaries that end before the
+ * indicated LSN will be included.
+ *
+ * If end_lsn != InvalidXLogRecPtr, only summaries that start before the
+ * indicated LSN will be included.
+ */
+List *
+FilterWalSummaries(List *wslist, TimeLineID tli,
+ XLogRecPtr start_lsn, XLogRecPtr end_lsn)
+{
+ List *result = NIL;
+ ListCell *lc;
+
+ /* Loop over input. */
+ foreach(lc, wslist)
+ {
+ WalSummaryFile *ws = lfirst(lc);
+
+ /* Skip if it doesn't match the filter criteria. */
+ if (tli != 0 && tli != ws->tli)
+ continue;
+ if (!XLogRecPtrIsInvalid(start_lsn) && start_lsn > ws->end_lsn)
+ continue;
+ if (!XLogRecPtrIsInvalid(end_lsn) && end_lsn < ws->start_lsn)
+ continue;
+
+ /* Add it to the result list. */
+ result = lappend(result, ws);
+ }
+
+ return result;
+}
+
+/*
+ * Check whether the supplied list of WalSummaryFile objects covers the
+ * whole range of LSNs from start_lsn to end_lsn. This function ignores
+ * timelines, so the caller should probably filter using the appropriate
+ * timeline before calling this.
+ *
+ * If the whole range of LSNs is covered, returns true, otherwise false.
+ * If false is returned, *missing_lsn is set either to InvalidXLogRecPtr
+ * if there are no WAL summary files in the input list, or to the first LSN
+ * in the range that is not covered by a WAL summary file in the input list.
+ */
+bool
+WalSummariesAreComplete(List *wslist, XLogRecPtr start_lsn,
+ XLogRecPtr end_lsn, XLogRecPtr *missing_lsn)
+{
+ XLogRecPtr current_lsn = start_lsn;
+ ListCell *lc;
+
+ /* Special case for empty list. */
+ if (wslist == NIL)
+ {
+ *missing_lsn = InvalidXLogRecPtr;
+ return false;
+ }
+
+ /* Make a private copy of the list and sort it by start LSN. */
+ wslist = list_copy(wslist);
+ list_sort(wslist, ListComparatorForWalSummaryFiles);
+
+ /*
+ * Consider summary files in order of increasing start_lsn, advancing the
+ * known-summarized range from start_lsn toward end_lsn.
+ *
+ * Normally, the summary files should cover non-overlapping WAL ranges,
+ * but this algorithm is intended to be correct even in case of overlap.
+ */
+ foreach(lc, wslist)
+ {
+ WalSummaryFile *ws = lfirst(lc);
+
+ if (ws->start_lsn > current_lsn)
+ {
+ /* We found a gap. */
+ break;
+ }
+ if (ws->end_lsn > current_lsn)
+ {
+ /*
+ * Next summary extends beyond end of previous summary, so extend
+ * the end of the range known to be summarized.
+ */
+ current_lsn = ws->end_lsn;
+
+ /*
+ * If the range we know to be summarized has reached the required
+ * end LSN, we have proved completeness.
+ */
+ if (current_lsn >= end_lsn)
+ return true;
+ }
+ }
+
+ /*
+ * We either ran out of summary files without reaching the end LSN, or we
+ * hit a gap in the sequence that resulted in us bailing out of the loop
+ * above.
+ */
+ *missing_lsn = current_lsn;
+ return false;
+}
+
+/*
+ * Open a WAL summary file.
+ *
+ * This will throw an error in case of trouble. As an exception, if
+ * missing_ok = true and the trouble is specifically that the file does
+ * not exist, it will not throw an error and will return a value less than 0.
+ */
+File
+OpenWalSummaryFile(WalSummaryFile *ws, bool missing_ok)
+{
+ char path[MAXPGPATH];
+ File file;
+
+ snprintf(path, MAXPGPATH,
+ XLOGDIR "/summaries/%08X%08X%08X%08X%08X.summary",
+ ws->tli,
+ LSN_FORMAT_ARGS(ws->start_lsn),
+ LSN_FORMAT_ARGS(ws->end_lsn));
+
+ file = PathNameOpenFile(path, O_RDONLY);
+ if (file < 0 && (errno != EEXIST || !missing_ok))
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not open file \"%s\": %m", path)));
+
+ return file;
+}
+
+/*
+ * Remove a WAL summary file if the last modification time precedes the
+ * cutoff time.
+ */
+void
+RemoveWalSummaryIfOlderThan(WalSummaryFile *ws, time_t cutoff_time)
+{
+ char path[MAXPGPATH];
+ struct stat statbuf;
+
+ snprintf(path, MAXPGPATH,
+ XLOGDIR "/summaries/%08X%08X%08X%08X%08X.summary",
+ ws->tli,
+ LSN_FORMAT_ARGS(ws->start_lsn),
+ LSN_FORMAT_ARGS(ws->end_lsn));
+
+ if (lstat(path, &statbuf) != 0)
+ {
+ if (errno == ENOENT)
+ return;
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not stat file \"%s\": %m", path)));
+ }
+ if (statbuf.st_mtime >= cutoff_time)
+ return;
+ if (unlink(path) != 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not stat file \"%s\": %m", path)));
+ ereport(DEBUG2,
+ (errmsg_internal("removing file \"%s\"", path)));
+}
+
+/*
+ * Test whether a filename looks like a WAL summary file.
+ */
+static bool
+IsWalSummaryFilename(char *filename)
+{
+ return strspn(filename, "0123456789ABCDEF") == 40 &&
+ strcmp(filename + 40, ".summary") == 0;
+}
+
+/*
+ * Data read callback for use with CreateBlockRefTableReader.
+ */
+int
+ReadWalSummary(void *wal_summary_io, void *data, int length)
+{
+ WalSummaryIO *io = wal_summary_io;
+ int nbytes;
+
+ nbytes = FileRead(io->file, data, length, io->filepos,
+ WAIT_EVENT_WAL_SUMMARY_READ);
+ if (nbytes < 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not write file \"%s\": %m",
+ FilePathName(io->file))));
+
+ io->filepos += nbytes;
+ return nbytes;
+}
+
+/*
+ * Data write callback for use with WriteBlockRefTable.
+ */
+int
+WriteWalSummary(void *wal_summary_io, void *data, int length)
+{
+ WalSummaryIO *io = wal_summary_io;
+ int nbytes;
+
+ nbytes = FileWrite(io->file, data, length, io->filepos,
+ WAIT_EVENT_WAL_SUMMARY_WRITE);
+ if (nbytes < 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not write file \"%s\": %m",
+ FilePathName(io->file))));
+ if (nbytes != length)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not write file \"%s\": wrote only %d of %d bytes at offset %u",
+ FilePathName(io->file), nbytes,
+ length, (unsigned) io->filepos),
+ errhint("Check free disk space.")));
+
+ io->filepos += nbytes;
+ return nbytes;
+}
+
+/*
+ * Error-reporting callback for use with CreateBlockRefTableReader.
+ */
+void
+ReportWalSummaryError(void *callback_arg, char *fmt,...)
+{
+ StringInfoData buf;
+ va_list ap;
+ int needed;
+
+ initStringInfo(&buf);
+ for (;;)
+ {
+ va_start(ap, fmt);
+ needed = appendStringInfoVA(&buf, fmt, ap);
+ va_end(ap);
+ if (needed == 0)
+ break;
+ enlargeStringInfo(&buf, needed);
+ }
+ ereport(ERROR,
+ errcode(ERRCODE_DATA_CORRUPTED),
+ errmsg_internal("%s", buf.data));
+}
+
+/*
+ * Comparator to sort a List of WalSummaryFile objects by start_lsn.
+ */
+static int
+ListComparatorForWalSummaryFiles(const ListCell *a, const ListCell *b)
+{
+ WalSummaryFile *ws1 = lfirst(a);
+ WalSummaryFile *ws2 = lfirst(b);
+
+ if (ws1->start_lsn < ws2->start_lsn)
+ return -1;
+ if (ws1->start_lsn > ws2->start_lsn)
+ return 1;
+ return 0;
+}
diff --git a/src/backend/backup/walsummaryfuncs.c b/src/backend/backup/walsummaryfuncs.c
new file mode 100644
index 0000000000..2e77d38b4a
--- /dev/null
+++ b/src/backend/backup/walsummaryfuncs.c
@@ -0,0 +1,169 @@
+/*-------------------------------------------------------------------------
+ *
+ * walsummaryfuncs.c
+ * SQL-callable functions for accessing WAL summary data.
+ *
+ * Portions Copyright (c) 2010-2022, PostgreSQL Global Development Group
+ *
+ * src/backend/backup/walsummaryfuncs.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "backup/walsummary.h"
+#include "common/blkreftable.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "utils/fmgrprotos.h"
+#include "utils/pg_lsn.h"
+
+#define NUM_WS_ATTS 3
+#define NUM_SUMMARY_ATTS 6
+#define MAX_BLOCKS_PER_CALL 256
+
+/*
+ * List the WAL summary files available in pg_wal/summaries.
+ */
+Datum
+pg_available_wal_summaries(PG_FUNCTION_ARGS)
+{
+ ReturnSetInfo *rsi;
+ List *wslist;
+ ListCell *lc;
+ Datum values[NUM_WS_ATTS];
+ bool nulls[NUM_WS_ATTS];
+
+ InitMaterializedSRF(fcinfo, 0);
+ rsi = (ReturnSetInfo *) fcinfo->resultinfo;
+
+ memset(nulls, 0, sizeof(nulls));
+
+ wslist = GetWalSummaries(0, InvalidXLogRecPtr, InvalidXLogRecPtr);
+ foreach(lc, wslist)
+ {
+ WalSummaryFile *ws = (WalSummaryFile *) lfirst(lc);
+ HeapTuple tuple;
+
+ CHECK_FOR_INTERRUPTS();
+
+ values[0] = Int64GetDatum((int64) ws->tli);
+ values[1] = LSNGetDatum(ws->start_lsn);
+ values[2] = LSNGetDatum(ws->end_lsn);
+
+ tuple = heap_form_tuple(rsi->setDesc, values, nulls);
+ tuplestore_puttuple(rsi->setResult, tuple);
+ }
+
+ return (Datum) 0;
+}
+
+/*
+ * List the contents of a WAL summary file identified by TLI, start LSN,
+ * and end LSN.
+ */
+Datum
+pg_wal_summary_contents(PG_FUNCTION_ARGS)
+{
+ ReturnSetInfo *rsi;
+ Datum values[NUM_SUMMARY_ATTS];
+ bool nulls[NUM_SUMMARY_ATTS];
+ WalSummaryFile ws;
+ WalSummaryIO io;
+ BlockRefTableReader *reader;
+ int64 raw_tli;
+ RelFileLocator rlocator;
+ ForkNumber forknum;
+ BlockNumber limit_block;
+
+ InitMaterializedSRF(fcinfo, 0);
+ rsi = (ReturnSetInfo *) fcinfo->resultinfo;
+ memset(nulls, 0, sizeof(nulls));
+
+ /*
+ * Since the timeline could at least in theory be more than 2^31, and
+ * since we don't have unsigned types at the SQL level, it is passed as a
+ * 64-bit integer. Test whether it's out of range.
+ */
+ raw_tli = PG_GETARG_INT64(0);
+ if (raw_tli < 1 || raw_tli > PG_INT32_MAX)
+ ereport(ERROR,
+ errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("invalid timeline %lld", (long long) raw_tli));
+
+ /* Prepare to read the specified WAL summry file. */
+ ws.tli = (TimeLineID) raw_tli;
+ ws.start_lsn = PG_GETARG_LSN(1);
+ ws.end_lsn = PG_GETARG_LSN(2);
+ io.filepos = 0;
+ io.file = OpenWalSummaryFile(&ws, false);
+ reader = CreateBlockRefTableReader(ReadWalSummary, &io,
+ FilePathName(io.file),
+ ReportWalSummaryError, NULL);
+
+ /* Loop over relation forks. */
+ while (BlockRefTableReaderNextRelation(reader, &rlocator, &forknum,
+ &limit_block))
+ {
+ BlockNumber blocks[MAX_BLOCKS_PER_CALL];
+ HeapTuple tuple;
+
+ CHECK_FOR_INTERRUPTS();
+
+ values[0] = ObjectIdGetDatum(rlocator.relNumber);
+ values[1] = ObjectIdGetDatum(rlocator.spcOid);
+ values[2] = ObjectIdGetDatum(rlocator.dbOid);
+ values[3] = Int16GetDatum((int16) forknum);
+
+ /* Loop over blocks within the current relation fork. */
+ while (true)
+ {
+ unsigned nblocks;
+ unsigned i;
+
+ CHECK_FOR_INTERRUPTS();
+
+ nblocks = BlockRefTableReaderGetBlocks(reader, blocks,
+ MAX_BLOCKS_PER_CALL);
+ if (nblocks == 0)
+ break;
+
+ /*
+ * For each block that we specifically know to have been modified,
+ * emit a row with that block number and limit_block = false.
+ */
+ values[5] = BoolGetDatum(false);
+ for (i = 0; i < nblocks; ++i)
+ {
+ values[4] = Int64GetDatum((int64) blocks[i]);
+
+ tuple = heap_form_tuple(rsi->setDesc, values, nulls);
+ tuplestore_puttuple(rsi->setResult, tuple);
+ }
+
+ /*
+ * If the limit block is not InvalidBlockNumber, emit an exta row
+ * with that block number and limit_block = true.
+ *
+ * There is no point in doing this when the limit_block is
+ * InvalidBlockNumber, because no block with that number or any
+ * higher number can ever exist.
+ */
+ if (BlockNumberIsValid(limit_block))
+ {
+ values[4] = Int64GetDatum((int64) limit_block);
+ values[5] = BoolGetDatum(true);
+
+ tuple = heap_form_tuple(rsi->setDesc, values, nulls);
+ tuplestore_puttuple(rsi->setResult, tuple);
+ }
+ }
+ }
+
+ /* Cleanup */
+ DestroyBlockRefTableReader(reader);
+ FileClose(io.file);
+
+ return (Datum) 0;
+}
diff --git a/src/backend/postmaster/Makefile b/src/backend/postmaster/Makefile
index 047448b34e..367a46c617 100644
--- a/src/backend/postmaster/Makefile
+++ b/src/backend/postmaster/Makefile
@@ -24,6 +24,7 @@ OBJS = \
postmaster.o \
startup.o \
syslogger.o \
+ walsummarizer.o \
walwriter.o
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/postmaster/auxprocess.c b/src/backend/postmaster/auxprocess.c
index cae6feb356..0c15c1777d 100644
--- a/src/backend/postmaster/auxprocess.c
+++ b/src/backend/postmaster/auxprocess.c
@@ -21,6 +21,7 @@
#include "postmaster/auxprocess.h"
#include "postmaster/bgwriter.h"
#include "postmaster/startup.h"
+#include "postmaster/walsummarizer.h"
#include "postmaster/walwriter.h"
#include "replication/walreceiver.h"
#include "storage/bufmgr.h"
@@ -80,6 +81,9 @@ AuxiliaryProcessMain(AuxProcType auxtype)
case WalReceiverProcess:
MyBackendType = B_WAL_RECEIVER;
break;
+ case WalSummarizerProcess:
+ MyBackendType = B_WAL_SUMMARIZER;
+ break;
default:
elog(PANIC, "unrecognized process type: %d", (int) MyAuxProcType);
MyBackendType = B_INVALID;
@@ -161,6 +165,10 @@ AuxiliaryProcessMain(AuxProcType auxtype)
WalReceiverMain();
proc_exit(1);
+ case WalSummarizerProcess:
+ WalSummarizerMain();
+ proc_exit(1);
+
default:
elog(PANIC, "unrecognized process type: %d", (int) MyAuxProcType);
proc_exit(1);
diff --git a/src/backend/postmaster/meson.build b/src/backend/postmaster/meson.build
index cda921fd10..a30eb6692f 100644
--- a/src/backend/postmaster/meson.build
+++ b/src/backend/postmaster/meson.build
@@ -12,5 +12,6 @@ backend_sources += files(
'postmaster.c',
'startup.c',
'syslogger.c',
+ 'walsummarizer.c',
'walwriter.c',
)
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 9cb624eab8..86f6cf2feb 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -115,6 +115,7 @@
#include "postmaster/pgarch.h"
#include "postmaster/postmaster.h"
#include "postmaster/syslogger.h"
+#include "postmaster/walsummarizer.h"
#include "replication/logicallauncher.h"
#include "replication/walsender.h"
#include "storage/fd.h"
@@ -252,6 +253,7 @@ static pid_t StartupPID = 0,
CheckpointerPID = 0,
WalWriterPID = 0,
WalReceiverPID = 0,
+ WalSummarizerPID = 0,
AutoVacPID = 0,
PgArchPID = 0,
SysLoggerPID = 0;
@@ -443,6 +445,7 @@ static bool CreateOptsFile(int argc, char *argv[], char *fullprogname);
static pid_t StartChildProcess(AuxProcType type);
static void StartAutovacuumWorker(void);
static void MaybeStartWalReceiver(void);
+static void MaybeStartWalSummarizer(void);
static void InitPostmasterDeathWatchHandle(void);
/*
@@ -562,6 +565,7 @@ static void ShmemBackendArrayRemove(Backend *bn);
#define StartCheckpointer() StartChildProcess(CheckpointerProcess)
#define StartWalWriter() StartChildProcess(WalWriterProcess)
#define StartWalReceiver() StartChildProcess(WalReceiverProcess)
+#define StartWalSummarizer() StartChildProcess(WalSummarizerProcess)
/* Macros to check exit status of a child process */
#define EXIT_STATUS_0(st) ((st) == 0)
@@ -1833,6 +1837,9 @@ ServerLoop(void)
if (WalReceiverRequested)
MaybeStartWalReceiver();
+ /* If we need to start a WAL summarizer, try to do that now */
+ MaybeStartWalSummarizer();
+
/* Get other worker processes running, if needed */
if (StartWorkerNeeded || HaveCrashedWorker)
maybe_start_bgworkers();
@@ -2657,6 +2664,8 @@ process_pm_reload_request(void)
signal_child(WalWriterPID, SIGHUP);
if (WalReceiverPID != 0)
signal_child(WalReceiverPID, SIGHUP);
+ if (WalSummarizerPID != 0)
+ signal_child(WalSummarizerPID, SIGHUP);
if (AutoVacPID != 0)
signal_child(AutoVacPID, SIGHUP);
if (PgArchPID != 0)
@@ -3010,6 +3019,7 @@ process_pm_child_exit(void)
BgWriterPID = StartBackgroundWriter();
if (WalWriterPID == 0)
WalWriterPID = StartWalWriter();
+ MaybeStartWalSummarizer();
/*
* Likewise, start other special children as needed. In a restart
@@ -3128,6 +3138,20 @@ process_pm_child_exit(void)
continue;
}
+ /*
+ * Was it the wal summarizer? Normal exit can be ignored; we'll start
+ * a new one at the next iteration of the postmaster's main loop, if
+ * necessary. Any other exit condition is treated as a crash.
+ */
+ if (pid == WalSummarizerPID)
+ {
+ WalSummarizerPID = 0;
+ if (!EXIT_STATUS_0(exitstatus))
+ HandleChildCrash(pid, exitstatus,
+ _("WAL summarizer process"));
+ continue;
+ }
+
/*
* Was it the autovacuum launcher? Normal exit can be ignored; we'll
* start a new one at the next iteration of the postmaster's main
@@ -3523,6 +3547,12 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
else if (WalReceiverPID != 0 && take_action)
sigquit_child(WalReceiverPID);
+ /* Take care of the walsummarizer too */
+ if (pid == WalSummarizerPID)
+ WalSummarizerPID = 0;
+ else if (WalSummarizerPID != 0 && take_action)
+ sigquit_child(WalSummarizerPID);
+
/* Take care of the autovacuum launcher too */
if (pid == AutoVacPID)
AutoVacPID = 0;
@@ -3673,6 +3703,8 @@ PostmasterStateMachine(void)
signal_child(StartupPID, SIGTERM);
if (WalReceiverPID != 0)
signal_child(WalReceiverPID, SIGTERM);
+ if (WalSummarizerPID != 0)
+ signal_child(WalSummarizerPID, SIGTERM);
/* checkpointer, archiver, stats, and syslogger may continue for now */
/* Now transition to PM_WAIT_BACKENDS state to wait for them to die */
@@ -3699,6 +3731,7 @@ PostmasterStateMachine(void)
if (CountChildren(BACKEND_TYPE_ALL - BACKEND_TYPE_WALSND) == 0 &&
StartupPID == 0 &&
WalReceiverPID == 0 &&
+ WalSummarizerPID == 0 &&
BgWriterPID == 0 &&
(CheckpointerPID == 0 ||
(!FatalError && Shutdown < ImmediateShutdown)) &&
@@ -3796,6 +3829,7 @@ PostmasterStateMachine(void)
/* These other guys should be dead already */
Assert(StartupPID == 0);
Assert(WalReceiverPID == 0);
+ Assert(WalSummarizerPID == 0);
Assert(BgWriterPID == 0);
Assert(CheckpointerPID == 0);
Assert(WalWriterPID == 0);
@@ -4017,6 +4051,8 @@ TerminateChildren(int signal)
signal_child(WalWriterPID, signal);
if (WalReceiverPID != 0)
signal_child(WalReceiverPID, signal);
+ if (WalSummarizerPID != 0)
+ signal_child(WalSummarizerPID, signal);
if (AutoVacPID != 0)
signal_child(AutoVacPID, signal);
if (PgArchPID != 0)
@@ -5364,6 +5400,10 @@ StartChildProcess(AuxProcType type)
ereport(LOG,
(errmsg("could not fork WAL receiver process: %m")));
break;
+ case WalSummarizerProcess:
+ ereport(LOG,
+ (errmsg("could not fork WAL summarizer process: %m")));
+ break;
default:
ereport(LOG,
(errmsg("could not fork process: %m")));
@@ -5500,6 +5540,19 @@ MaybeStartWalReceiver(void)
}
}
+/*
+ * MaybeStartWalSummarizer
+ * Start the WAL summarizer process, if not running and our state allows.
+ */
+static void
+MaybeStartWalSummarizer(void)
+{
+ if (wal_summarize_mb != 0 && WalSummarizerPID == 0 &&
+ (pmState == PM_RUN || pmState == PM_HOT_STANDBY) &&
+ Shutdown <= SmartShutdown)
+ WalSummarizerPID = StartWalSummarizer();
+}
+
/*
* Create the opts file
diff --git a/src/backend/postmaster/walsummarizer.c b/src/backend/postmaster/walsummarizer.c
new file mode 100644
index 0000000000..34bd254183
--- /dev/null
+++ b/src/backend/postmaster/walsummarizer.c
@@ -0,0 +1,1414 @@
+/*-------------------------------------------------------------------------
+ *
+ * walsummarizer.c
+ *
+ * Background process to perform WAL summarization, if it is enabled.
+ * It continuously scans the write-ahead log and periodically emits a
+ * summary file which indicates which blocks in which relation forks
+ * were modified by WAL records in the LSN range covered by the summary
+ * file. See walsummary.c and blkreftable.c for more details on the
+ * naming and contents of WAL summary files.
+ *
+ * If configured to do, this background process will also remove WAL
+ * summary files when the file timestamp is older than a configurable
+ * threshold (but only if the WAL has been removed first).
+ *
+ * Portions Copyright (c) 1996-2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/backend/postmaster/walsummarizer.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/timeline.h"
+#include "access/xlog.h"
+#include "access/xlog_internal.h"
+#include "access/xlogrecovery.h"
+#include "access/xlogutils.h"
+#include "backup/walsummary.h"
+#include "catalog/storage_xlog.h"
+#include "common/blkreftable.h"
+#include "libpq/pqsignal.h"
+#include "miscadmin.h"
+#include "postmaster/bgwriter.h"
+#include "postmaster/interrupt.h"
+#include "postmaster/walsummarizer.h"
+#include "replication/walreceiver.h"
+#include "storage/fd.h"
+#include "storage/ipc.h"
+#include "storage/lwlock.h"
+#include "storage/latch.h"
+#include "storage/proc.h"
+#include "storage/procsignal.h"
+#include "storage/shmem.h"
+#include "storage/spin.h"
+#include "utils/guc.h"
+#include "utils/memutils.h"
+#include "utils/wait_event.h"
+
+/*
+ * Data in shared memory related to WAL summarization.
+ */
+typedef struct
+{
+ /*
+ * These fields are protected by WALSummarizerLock.
+ *
+ * Until we've discovered what summary files already exist on disk and
+ * stored that information in shared memory, initialized is false and the
+ * other fields here contain no meaningful information. After that has
+ * been done, initialized is true.
+ *
+ * summarized_tli and summarized_lsn indicate the last LSN and TLI at
+ * which the next summary file will start. Normally, these are the LSN and
+ * TLI at which the last file ended; in such case, lsn_is_exact is true.
+ * If, however, the LSN is just an approximation, then lsn_is_exact is
+ * false. This can happen if, for example, there are no existing WAL
+ * summary files at startup. In that case, we have to derive the position
+ * at which to start summarizing from the WAL files that exist on disk,
+ * and so the LSN might point to the start of the next file even though
+ * that might happen to be in the middle of a WAL record.
+ *
+ * summarizer_pgprocno is the pgprocno value for the summarizer process,
+ * if one is running, or else INVALID_PGPROCNO.
+ *
+ * pending_lsn is used by the summarizer to advertise the ending LSN of a
+ * record it has recently read. It shouldn't ever be less than
+ * summarized_lsn, but might be greater, because the summarizer buffers
+ * data for a range of LSNs in memory before writing out a new file.
+ *
+ * switch_requested can be set to true to notify the summarizer that a new
+ * WAL summary file should be written as soon as possible, without trying
+ * to read more WAL first.
+ */
+ bool initialized;
+ TimeLineID summarized_tli;
+ XLogRecPtr summarized_lsn;
+ bool lsn_is_exact;
+ int summarizer_pgprocno;
+ XLogRecPtr pending_lsn;
+ bool switch_requested;
+
+ /*
+ * This field handles its own synchronizaton.
+ */
+ ConditionVariable summary_file_cv;
+} WalSummarizerData;
+
+/*
+ * Private data for our xlogreader's page read callback.
+ */
+typedef struct
+{
+ TimeLineID tli;
+ bool historic;
+ XLogRecPtr read_upto;
+ bool end_of_wal;
+ bool waited;
+ XLogRecPtr redo_pointer;
+ bool redo_pointer_reached;
+ XLogRecPtr redo_pointer_refresh_lsn;
+} SummarizerReadLocalXLogPrivate;
+
+/* Pointer to shared memory state. */
+static WalSummarizerData *WalSummarizerCtl;
+
+/*
+ * When we reach end of WAL and need to read more, we sleep for a number of
+ * milliseconds that is a integer multiple of MS_PER_SLEEP_QUANTUM. This is
+ * the multiplier. It should vary between 1 and MAX_SLEEP_QUANTA, depending
+ * on system activity. See summarizer_wait_for_wal() for how we adjust this.
+ */
+static long sleep_quanta = 1;
+
+/*
+ * The sleep time will always be a multiple of 200ms and will not exceed
+ * one minute (300 * 200 = 60 * 1000).
+ */
+#define MAX_SLEEP_QUANTA 300
+#define MS_PER_SLEEP_QUANTUM 200
+
+/*
+ * This is a count of the number of pages of WAL that we've read since the
+ * last time we waited for more WAL to appear.
+ */
+static long pages_read_since_last_sleep = 0;
+
+/*
+ * Most recent RedoRecPtr value observed by MaybeRemoveOldWalSummaries.
+ */
+static XLogRecPtr redo_pointer_at_last_summary_removal = InvalidXLogRecPtr;
+
+/*
+ * GUC parameters
+ */
+int wal_summarize_mb = 256;
+int wal_summarize_keep_time = 7 * 24 * 60;
+
+static XLogRecPtr GetLatestLSN(TimeLineID *tli);
+static void HandleWalSummarizerInterrupts(void);
+static XLogRecPtr SummarizeWAL(TimeLineID tli, bool historic,
+ XLogRecPtr start_lsn, bool exact,
+ XLogRecPtr cutoff_lsn, XLogRecPtr maximum_lsn);
+static void SummarizeSmgrRecord(XLogReaderState *xlogreader,
+ BlockRefTable *brtab);
+static void SummarizeXactRecord(XLogReaderState *xlogreader,
+ BlockRefTable *brtab);
+static int summarizer_read_local_xlog_page(XLogReaderState *state,
+ XLogRecPtr targetPagePtr,
+ int reqLen,
+ XLogRecPtr targetRecPtr,
+ char *cur_page);
+static void summarizer_wait_for_wal(void);
+static void MaybeRemoveOldWalSummaries(void);
+
+/*
+ * Amount of shared memory required for this module.
+ */
+Size
+WalSummarizerShmemSize(void)
+{
+ return sizeof(WalSummarizerData);
+}
+
+/*
+ * Create or attach to shared memory segment for this module.
+ */
+void
+WalSummarizerShmemInit(void)
+{
+ bool found;
+
+ WalSummarizerCtl = (WalSummarizerData *)
+ ShmemInitStruct("Wal Summarizer Ctl", WalSummarizerShmemSize(),
+ &found);
+
+ if (!found)
+ {
+ /*
+ * First time through, so initialize.
+ *
+ * We're just filling in dummy values here -- the real initialization
+ * will happen when GetOldestUnsummarizedLSN() is called for the first
+ * time.
+ */
+ WalSummarizerCtl->initialized = false;
+ WalSummarizerCtl->summarized_tli = 0;
+ WalSummarizerCtl->summarized_lsn = InvalidXLogRecPtr;
+ WalSummarizerCtl->lsn_is_exact = false;
+ WalSummarizerCtl->summarizer_pgprocno = INVALID_PGPROCNO;
+ WalSummarizerCtl->pending_lsn = InvalidXLogRecPtr;
+ WalSummarizerCtl->switch_requested = false;
+ ConditionVariableInit(&WalSummarizerCtl->summary_file_cv);
+ }
+}
+
+/*
+ * Entry point for walsummarizer process.
+ */
+void
+WalSummarizerMain(void)
+{
+ sigjmp_buf local_sigjmp_buf;
+ MemoryContext context;
+
+ /*
+ * Within this function, 'current_lsn' and 'current_tli' refer to the
+ * point from which the next WAL summary file should start. 'exact' is
+ * true if 'current_lsn' is known to be the start of a WAL recod or WAL
+ * segment, and false if it might be in the middle of a record someplace.
+ *
+ * 'switch_lsn' and 'switch_tli', if set, are the LSN at which we need to
+ * switch to a new timeline and the timeline to which we need to switch.
+ * If not set, we either haven't figured out the answers yet or we're
+ * already on the latest timeline.
+ */
+ XLogRecPtr current_lsn;
+ TimeLineID current_tli;
+ bool exact;
+ XLogRecPtr switch_lsn = InvalidXLogRecPtr;
+ TimeLineID switch_tli = 0;
+
+ ereport(DEBUG1,
+ (errmsg_internal("WAL summarizer started")));
+
+ /*
+ * Properly accept or ignore signals the postmaster might send us
+ *
+ * We have no particular use for SIGINT at the moment, but seems
+ * reasonable to treat like SIGTERM.
+ */
+ pqsignal(SIGHUP, SignalHandlerForConfigReload);
+ pqsignal(SIGINT, SignalHandlerForShutdownRequest);
+ pqsignal(SIGTERM, SignalHandlerForShutdownRequest);
+ /* SIGQUIT handler was already set up by InitPostmasterChild */
+ pqsignal(SIGALRM, SIG_IGN);
+ pqsignal(SIGPIPE, SIG_IGN);
+ pqsignal(SIGUSR1, procsignal_sigusr1_handler);
+ pqsignal(SIGUSR2, SIG_IGN); /* not used */
+
+ /* Advertise ourselves. */
+ LWLockAcquire(WALSummarizerLock, LW_EXCLUSIVE);
+ WalSummarizerCtl->summarizer_pgprocno = MyProc->pgprocno;
+ LWLockRelease(WALSummarizerLock);
+
+ /* Create and switch to a memory context that we can reset on error. */
+ context = AllocSetContextCreate(TopMemoryContext,
+ "Wal Summarizer",
+ ALLOCSET_DEFAULT_SIZES);
+ MemoryContextSwitchTo(context);
+
+ /*
+ * Reset some signals that are accepted by postmaster but not here
+ */
+ pqsignal(SIGCHLD, SIG_DFL);
+
+ /*
+ * If an exception is encountered, processing resumes here.
+ */
+ if (sigsetjmp(local_sigjmp_buf, 1) != 0)
+ {
+ /* Since not using PG_TRY, must reset error stack by hand */
+ error_context_stack = NULL;
+
+ /* Prevent interrupts while cleaning up */
+ HOLD_INTERRUPTS();
+
+ /* Report the error to the server log */
+ EmitErrorReport();
+
+ /* Release resources we might have acquired. */
+ LWLockReleaseAll();
+ ConditionVariableCancelSleep();
+ pgstat_report_wait_end();
+ ReleaseAuxProcessResources(false);
+ AtEOXact_Files(false);
+ AtEOXact_HashTables(false);
+
+ /*
+ * Now return to normal top-level context and clear ErrorContext for
+ * next time.
+ */
+ MemoryContextSwitchTo(context);
+ FlushErrorState();
+
+ /* Flush any leaked data in the top-level context */
+ MemoryContextResetAndDeleteChildren(context);
+
+ /* Now we can allow interrupts again */
+ RESUME_INTERRUPTS();
+
+ /*
+ * Sleep for 10 seconds before attempting to resume operations in
+ * order to avoid excessing logging.
+ *
+ * Many of the likely error conditions are things that will repeat
+ * every time. For example, if the WAL can't be read or the summary
+ * can't be written, only administrator action will cure the problem.
+ * So a really fast retry time doesn't seem to be especially
+ * beneficial, and it will clutter the logs.
+ */
+ (void) WaitLatch(MyLatch,
+ WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
+ 10000,
+ WAIT_EVENT_WAL_SUMMARIZER_ERROR);
+ }
+
+ /* We can now handle ereport(ERROR) */
+ PG_exception_stack = &local_sigjmp_buf;
+
+ /*
+ * Unblock signals (they were blocked when the postmaster forked us)
+ */
+ sigprocmask(SIG_SETMASK, &UnBlockSig, NULL);
+
+ /*
+ * Fetch information about previous progress from shared memory.
+ *
+ * If we discover that WAL summarization is not enabled, just exit.
+ */
+ current_lsn = GetOldestUnsummarizedLSN(¤t_tli, &exact);
+ if (XLogRecPtrIsInvalid(current_lsn))
+ proc_exit(0);
+
+ /*
+ * Loop forever
+ */
+ for (;;)
+ {
+ XLogRecPtr latest_lsn;
+ TimeLineID latest_tli;
+ XLogRecPtr cutoff_lsn;
+ XLogRecPtr end_of_summary_lsn;
+
+ /* Flush any leaked data in the top-level context */
+ MemoryContextResetAndDeleteChildren(context);
+
+ /* Process any signals received recently. */
+ HandleWalSummarizerInterrupts();
+
+ /* If it's time to remove any old WAL summaries, do that now. */
+ MaybeRemoveOldWalSummaries();
+
+ /* Find the LSN and TLI up to which we can safely summarize. */
+ latest_lsn = GetLatestLSN(&latest_tli);
+
+ /*
+ * If we're summarizing a historic timeline and we haven't yet
+ * computed the point at which to switch to the next timeline, do that
+ * now.
+ *
+ * Note that if this is a standby, what was previously the current
+ * timeline could become historic at any time.
+ *
+ * We could try to make this more efficient by caching the results of
+ * readTimeLineHistory when latest_tli has not changed, but since we
+ * only have to do this once per timeline switch, we probably wouldn't
+ * save any significant amount of work in practice.
+ */
+ if (current_tli != latest_tli && XLogRecPtrIsInvalid(switch_lsn))
+ {
+ List *tles = readTimeLineHistory(latest_tli);
+
+ switch_lsn = tliSwitchPoint(current_tli, tles, &switch_tli);
+ elog(DEBUG2,
+ "switch point from TLI %u to TLI %u is at %X/%X",
+ current_tli, switch_tli, LSN_FORMAT_ARGS(switch_lsn));
+ }
+
+ /*
+ * wal_summarize_mb sets a soft limit on the amont of WAL covered by a
+ * single summary file. If we read a WAL record that ends after the
+ * cutoff LSN computed here, we'll stop the summary. In most cases, it
+ * will actually stop earlier than that, but this is here as a
+ * backstop.
+ */
+ cutoff_lsn = current_lsn + wal_summarize_mb * 1024 * 1024;
+ if (!XLogRecPtrIsInvalid(switch_lsn) && cutoff_lsn > switch_lsn)
+ cutoff_lsn = switch_lsn;
+ elog(DEBUG2,
+ "WAL summarization cutoff is TLI %d @ %X/%X, flush position is %X/%X",
+ current_tli, LSN_FORMAT_ARGS(cutoff_lsn), LSN_FORMAT_ARGS(latest_lsn));
+
+ /* Summarize WAL. */
+ end_of_summary_lsn = SummarizeWAL(current_tli,
+ current_tli != latest_tli,
+ current_lsn, exact,
+ cutoff_lsn, latest_lsn);
+ Assert(!XLogRecPtrIsInvalid(end_of_summary_lsn));
+ Assert(end_of_summary_lsn >= current_lsn);
+
+ /*
+ * Update state for next loop iteration.
+ *
+ * Next summary file should start from exactly where this one ended.
+ * Timeline remains unchanged unless a switch LSN was computed and we
+ * have reached it.
+ */
+ current_lsn = end_of_summary_lsn;
+ exact = true;
+ if (!XLogRecPtrIsInvalid(switch_lsn) && cutoff_lsn >= switch_lsn)
+ {
+ current_tli = switch_tli;
+ switch_lsn = InvalidXLogRecPtr;
+ switch_tli = 0;
+ }
+
+ /* Update state in shared memory. */
+ LWLockAcquire(WALSummarizerLock, LW_EXCLUSIVE);
+ Assert(WalSummarizerCtl->pending_lsn <= end_of_summary_lsn);
+ WalSummarizerCtl->summarized_lsn = end_of_summary_lsn;
+ WalSummarizerCtl->summarized_tli = current_tli;
+ WalSummarizerCtl->lsn_is_exact = true;
+ WalSummarizerCtl->pending_lsn = end_of_summary_lsn;
+ WalSummarizerCtl->switch_requested = false;
+ LWLockRelease(WALSummarizerLock);
+
+ /* Wake up anyone waiting for more summary files to be written. */
+ ConditionVariableBroadcast(&WalSummarizerCtl->summary_file_cv);
+ }
+}
+
+/*
+ * Get the oldest LSN in this server's timeline history that has not yet been
+ * summarized.
+ *
+ * If *tli != NULL, it will be set to the TLI for the LSN that is returned.
+ *
+ * If *lsn_is_exact != NULL, it will be set to true if the returned LSN is
+ * necessarily the start of a WAL record and false if it's just the beginning
+ * of a WAL segment.
+ */
+XLogRecPtr
+GetOldestUnsummarizedLSN(TimeLineID *tli, bool *lsn_is_exact)
+{
+ TimeLineID latest_tli;
+ LWLockMode mode = LW_SHARED;
+ int n;
+ List *tles;
+ XLogRecPtr unsummarized_lsn;
+ TimeLineID unsummarized_tli = 0;
+ bool should_make_exact = false;
+ List *existing_summaries;
+ ListCell *lc;
+
+ /* If not summarizing WAL, do nothing. */
+ if (wal_summarize_mb == 0)
+ return InvalidXLogRecPtr;
+
+ /*
+ * Initially, we acquire the lock in shared mode and try to fetch the
+ * required information. If the data structure hasn't been initialized, we
+ * reacquire the lock in shared mode so that we can initialize it.
+ * However, if someone else does that first before we get the lock, then
+ * we can just return the requested information after all.
+ */
+ while (true)
+ {
+ LWLockAcquire(WALSummarizerLock, mode);
+
+ if (WalSummarizerCtl->initialized)
+ {
+ unsummarized_lsn = WalSummarizerCtl->summarized_lsn;
+ if (tli != NULL)
+ *tli = WalSummarizerCtl->summarized_tli;
+ if (lsn_is_exact != NULL)
+ *lsn_is_exact = WalSummarizerCtl->lsn_is_exact;
+ LWLockRelease(WALSummarizerLock);
+ return unsummarized_lsn;
+ }
+
+ if (mode == LW_EXCLUSIVE)
+ break;
+
+ LWLockRelease(WALSummarizerLock);
+ mode = LW_EXCLUSIVE;
+ }
+
+ /*
+ * The data structure needs to be initialized, and we are the first to
+ * obtain the lock in exclusive mode, so it's our job to do that
+ * initialization.
+ *
+ * So, find the oldest timeline on which WAL still exists, and the
+ * earliest segment for which it exists.
+ */
+ (void) GetLatestLSN(&latest_tli);
+ tles = readTimeLineHistory(latest_tli);
+ for (n = list_length(tles) - 1; n >= 0; --n)
+ {
+ TimeLineHistoryEntry *tle = list_nth(tles, n);
+ XLogSegNo oldest_segno;
+
+ oldest_segno = XLogGetOldestSegno(tle->tli);
+ if (oldest_segno != 0)
+ {
+ /* Compute oldest LSN that still exists on disk. */
+ XLogSegNoOffsetToRecPtr(oldest_segno, 0, wal_segment_size,
+ unsummarized_lsn);
+
+ unsummarized_tli = tle->tli;
+ break;
+ }
+ }
+
+ /* It really should not be possible for us to find no WAL. */
+ if (unsummarized_tli == 0)
+ ereport(ERROR,
+ errcode(ERRCODE_INTERNAL_ERROR),
+ errmsg_internal("no WAL found on timeline %d", latest_tli));
+
+ /*
+ * Don't try to summarize anything older than the end LSN of the newest
+ * summary file that exists for this timeline.
+ */
+ existing_summaries =
+ GetWalSummaries(unsummarized_tli,
+ InvalidXLogRecPtr, InvalidXLogRecPtr);
+ foreach(lc, existing_summaries)
+ {
+ WalSummaryFile *ws = lfirst(lc);
+
+ if (ws->end_lsn > unsummarized_lsn)
+ {
+ unsummarized_lsn = ws->end_lsn;
+ should_make_exact = true;
+ }
+ }
+
+ /* Update shared memory with the discovered values. */
+ WalSummarizerCtl->initialized = true;
+ WalSummarizerCtl->summarized_lsn = unsummarized_lsn;
+ WalSummarizerCtl->summarized_tli = unsummarized_tli;
+ WalSummarizerCtl->lsn_is_exact = should_make_exact;
+ WalSummarizerCtl->pending_lsn = unsummarized_lsn;
+
+ /* Also return the to the caller as required. */
+ if (tli != NULL)
+ *tli = WalSummarizerCtl->summarized_tli;
+ if (lsn_is_exact != NULL)
+ *lsn_is_exact = WalSummarizerCtl->lsn_is_exact;
+ LWLockRelease(WALSummarizerLock);
+
+ return unsummarized_lsn;
+}
+
+/*
+ * Attempt to set the WAL summarizer's latch.
+ *
+ * This might not work, because there's no guarantee that the WAL summarizer
+ * process was successfully started, and it also might have started but
+ * subsequently terminated. So, under normal circumstances, this will get the
+ * latch set, but there's no guarantee.
+ */
+void
+SetWalSummarizerLatch(void)
+{
+ int pgprocno;
+
+ LWLockAcquire(WALSummarizerLock, LW_EXCLUSIVE);
+ pgprocno = WalSummarizerCtl->summarizer_pgprocno;
+ LWLockRelease(WALSummarizerLock);
+
+ if (pgprocno != INVALID_PGPROCNO)
+ SetLatch(&ProcGlobal->allProcs[pgprocno].procLatch);
+}
+
+/*
+ * Wait until WAL summarization reaches the given LSN, but not longer than
+ * the given timeout.
+ *
+ * The return value is the first still-unsummarized LSN. If it's greater than
+ * or equal to the passed LSN, then that LSN was reached. If not, we timed out.
+ */
+XLogRecPtr
+WaitForWalSummarization(XLogRecPtr lsn, long timeout)
+{
+ TimestampTz start_time = GetCurrentTimestamp();
+ TimestampTz deadline = TimestampTzPlusMilliseconds(start_time, timeout);
+ XLogRecPtr summarized_lsn;
+
+ Assert(!XLogRecPtrIsInvalid(lsn));
+ Assert(timeout > 0);
+
+ while (1)
+ {
+ TimestampTz now;
+ long remaining_timeout;
+
+ /*
+ * If the LSN summarized on disk has reached the target value, stop.
+ * If it hasn't, but the in-memory value has reached the target value,
+ * request that a file be written as soon as possible.
+ */
+ LWLockAcquire(WALSummarizerLock, LW_EXCLUSIVE);
+ summarized_lsn = WalSummarizerCtl->summarized_lsn;
+ if (summarized_lsn < lsn &&
+ WalSummarizerCtl->pending_lsn >= lsn)
+ WalSummarizerCtl->switch_requested = true;
+ LWLockRelease(WALSummarizerLock);
+ if (summarized_lsn >= lsn)
+ break;
+
+ /* Timeout reached? If yes, stop. */
+ now = GetCurrentTimestamp();
+ remaining_timeout = TimestampDifferenceMilliseconds(now, deadline);
+ if (remaining_timeout <= 0)
+ break;
+
+ /*
+ * Limit the sleep to 1 second, because we may need to request a
+ * switch.
+ */
+ if (remaining_timeout > 1000)
+ remaining_timeout = 1000;
+
+ /* Wait and see. */
+ ConditionVariableTimedSleep(&WalSummarizerCtl->summary_file_cv,
+ remaining_timeout,
+ WAIT_EVENT_WAL_SUMMARY_READY);
+ }
+
+ return summarized_lsn;
+}
+
+/*
+ * Get the latest LSN that is eligible to be summarized, and set *tli to the
+ * corresponding timeline.
+ */
+static XLogRecPtr
+GetLatestLSN(TimeLineID *tli)
+{
+ if (!RecoveryInProgress())
+ {
+ /* Don't summarize WAL before it's flushed. */
+ return GetFlushRecPtr(tli);
+ }
+ else
+ {
+ XLogRecPtr flush_lsn;
+ TimeLineID flush_tli;
+ XLogRecPtr replay_lsn;
+ TimeLineID replay_tli;
+
+ /*
+ * What we really want to know is how much WAL has been flushed to
+ * disk, but the only flush position available is the one provided by
+ * the walreceiver, which may not be running, because this could be
+ * crash recovery or recovery via restore_command. So use either the
+ * WAL receiver's flush position or the replay position, whichever is
+ * further ahead, on the theory that if the WAL has been replayed then
+ * it must also have been flushed to disk.
+ */
+ flush_lsn = GetWalRcvFlushRecPtr(NULL, &flush_tli);
+ replay_lsn = GetXLogReplayRecPtr(&replay_tli);
+ if (flush_lsn > replay_lsn)
+ {
+ *tli = flush_tli;
+ return flush_lsn;
+ }
+ else
+ {
+ *tli = replay_tli;
+ return replay_lsn;
+ }
+ }
+}
+
+/*
+ * Interrupt handler for main loop of WAL summarizer process.
+ */
+static void
+HandleWalSummarizerInterrupts(void)
+{
+ if (ProcSignalBarrierPending)
+ ProcessProcSignalBarrier();
+
+ if (ConfigReloadPending)
+ {
+ ConfigReloadPending = false;
+ ProcessConfigFile(PGC_SIGHUP);
+ }
+
+ if (ShutdownRequestPending || wal_summarize_mb == 0)
+ {
+ ereport(DEBUG1,
+ errmsg_internal("WAL summarizer shutting down"));
+ proc_exit(0);
+ }
+
+ /* Perform logging of memory contexts of this process */
+ if (LogMemoryContextPending)
+ ProcessLogMemoryContextInterrupt();
+}
+
+/*
+ * Summarize a range of WAL records on a single timeline.
+ *
+ * 'tli' is the timeline to be summarized. 'historic' should be false if the
+ * timeline in question is the latest one and true otherwise.
+ *
+ * 'start_lsn' is the point at which we should start summarizing. If this
+ * value comes from the end LSN of the previous record as returned by the
+ * xlograder machinery, 'exact' should be true; otherwise, 'exact' should
+ * be false, and this function will search forward for the start of a valid
+ * WAL record.
+ *
+ * 'cutoff_lsn' is the point at which we should stop summarizing. The first
+ * record that ends at or after cutoff_lsn will be the last one included
+ * in the summary.
+ *
+ * 'maximum_lsn' identifies the point beyond which we can't count on being
+ * able to read any more WAL. It should be the switch point when reading a
+ * historic timeline, or the most-recently-measured end of WAL when reading
+ * the current timeline.
+ *
+ * The return value is the LSN at which the WAL summary actually ends. Most
+ * often, a summary file ends because we notice that a checkpoint has
+ * occurred and reach the redo pointer of that checkpoint, but sometimes
+ * we stop for other reasons, such as a timeline switch, or reading a record
+ * that ends after the cutoff_lsn.
+ */
+static XLogRecPtr
+SummarizeWAL(TimeLineID tli, bool historic,
+ XLogRecPtr start_lsn, bool exact,
+ XLogRecPtr cutoff_lsn, XLogRecPtr maximum_lsn)
+{
+ SummarizerReadLocalXLogPrivate *private_data;
+ XLogReaderState *xlogreader;
+ XLogRecPtr summary_start_lsn;
+ XLogRecPtr summary_end_lsn = cutoff_lsn;
+ char temp_path[MAXPGPATH];
+ char final_path[MAXPGPATH];
+ WalSummaryIO io;
+ BlockRefTable *brtab = CreateEmptyBlockRefTable();
+
+ /* Initialize private data for xlogreader. */
+ private_data = (SummarizerReadLocalXLogPrivate *)
+ palloc0(sizeof(SummarizerReadLocalXLogPrivate));
+ private_data->tli = tli;
+ private_data->historic = historic;
+ private_data->read_upto = maximum_lsn;
+ private_data->redo_pointer = GetRedoRecPtr();
+ private_data->redo_pointer_refresh_lsn = start_lsn;
+ private_data->redo_pointer_reached =
+ (start_lsn >= private_data->redo_pointer);
+
+ /* Create xlogreader. */
+ xlogreader = XLogReaderAllocate(wal_segment_size, NULL,
+ XL_ROUTINE(.page_read = &summarizer_read_local_xlog_page,
+ .segment_open = &wal_segment_open,
+ .segment_close = &wal_segment_close),
+ private_data);
+ if (xlogreader == NULL)
+ ereport(ERROR,
+ (errcode(ERRCODE_OUT_OF_MEMORY),
+ errmsg("out of memory"),
+ errdetail("Failed while allocating a WAL reading processor.")));
+
+ /*
+ * When exact = false, we're starting from an arbitrary point in the WAL
+ * and must search forward for the start of the next record.
+ *
+ * When exact = true, start_lsn should be either the LSN where a record
+ * begins, or the LSN of a page where the page header is immediately
+ * followed by the start of a new record. XLogBeginRead should tolerate
+ * either case.
+ *
+ * We need to allow for both cases because the behavior of xlogreader
+ * varies. When a record spans two or more xlog pages, the ending LSN
+ * reported by xlogreader will be the starting LSN of the following
+ * record, but when an xlog page boundary falls between two records, the
+ * end LSN for the first will be reported as the first byte of the
+ * following page. We can't know until we read that page how large the
+ * header will be, but we'll have to skip over it to find the next record.
+ */
+ if (exact)
+ {
+ /*
+ * Even if start_lsn is the beginning of a page rather than the
+ * beginning of the first record on that page, we should still use it
+ * as the start LSN for the summary file. That's because we detect
+ * missing summary files by looking for cases where the end LSN of one
+ * file is less than the start LSN of the next file. When only a page
+ * header is skipped, nothing has been missed.
+ */
+ XLogBeginRead(xlogreader, start_lsn);
+ summary_start_lsn = start_lsn;
+ }
+ else
+ {
+ summary_start_lsn = XLogFindNextRecord(xlogreader, start_lsn);
+ if (XLogRecPtrIsInvalid(summary_start_lsn))
+ {
+ /*
+ * If we hit end-of-WAL while trying to find the next valid
+ * record, we must be on a historic timeline that has no valid
+ * records that begin after start_lsn and before end of WAL.
+ */
+ if (private_data->end_of_wal)
+ {
+ ereport(DEBUG1,
+ errmsg_internal("could not read WAL from timeline %d at %X/%X: end of WAL at %X/%X",
+ tli,
+ LSN_FORMAT_ARGS(start_lsn),
+ LSN_FORMAT_ARGS(private_data->read_upto)));
+
+ /*
+ * The timeline ends at or after start_lsn, without containing
+ * any records. Thus, we must make sure the main loop does not
+ * iterate. If start_lsn is the end of the timeline, then we
+ * won't actually emit an empty summary file, but otherwise,
+ * we must, to capture the fact that the LSN range in question
+ * contains no interesting WAL records.
+ */
+ summary_start_lsn = start_lsn;
+ summary_end_lsn = private_data->read_upto;
+ cutoff_lsn = xlogreader->EndRecPtr;
+ }
+ else
+ ereport(ERROR,
+ (errmsg("could not find a valid record after %X/%X",
+ LSN_FORMAT_ARGS(start_lsn))));
+ }
+
+ /* We shouldn't go backward. */
+ Assert(summary_start_lsn >= start_lsn);
+ }
+
+ /*
+ * Main loop: read xlog records one by one.
+ */
+ while (xlogreader->EndRecPtr < cutoff_lsn)
+ {
+ int block_id;
+ char *errormsg;
+ XLogRecord *record;
+ bool switch_requested;
+
+ /* We shouldn't go backward. */
+ Assert(summary_start_lsn <= xlogreader->EndRecPtr);
+
+ /*
+ * This flag tracks whether the read of a particular record had to
+ * wait for more WAL to arrive, so reset it before reading the next
+ * record.
+ */
+ private_data->waited = false;
+
+ /* Now read the next record. */
+ record = XLogReadRecord(xlogreader, &errormsg);
+ if (record == NULL)
+ {
+ SummarizerReadLocalXLogPrivate *private_data;
+
+ private_data = (SummarizerReadLocalXLogPrivate *)
+ xlogreader->private_data;
+ if (private_data->end_of_wal)
+ {
+ /*
+ * This timeline must be historic and must end before we were
+ * able to read a complete record.
+ */
+ ereport(DEBUG1,
+ errmsg_internal("could not read WAL from timeline %d at %X/%X: end of WAL at %X/%X",
+ tli,
+ LSN_FORMAT_ARGS(xlogreader->EndRecPtr),
+ LSN_FORMAT_ARGS(private_data->read_upto)));
+ /* Summary ends at end of WAL. */
+ summary_end_lsn = private_data->read_upto;
+ break;
+ }
+ if (errormsg)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not read WAL at %X/%X: %s",
+ LSN_FORMAT_ARGS(xlogreader->EndRecPtr), errormsg)));
+ else
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not read WAL at %X/%X",
+ LSN_FORMAT_ARGS(xlogreader->EndRecPtr))));
+ }
+
+ /* We shouldn't go backward. */
+ Assert(summary_start_lsn <= xlogreader->EndRecPtr);
+
+ if (xlogreader->ReadRecPtr >= cutoff_lsn)
+ {
+ /*
+ * Woops! We've read a record that *starts* after the cutoff LSN,
+ * contrary to our goal of reading only until we hit the first
+ * record that ends at or after the cutoff LSN. Pretend we didn't
+ * read it after all by bailing out of this loop right here,
+ * before we do anything with this record.
+ *
+ * This can happen because the last record before the cutoff LSN
+ * might be continued across multiple pages, and then we might
+ * come to a page with XLP_FIRST_IS_OVERWRITE_CONTRECORD set. In
+ * that case, the record that was continued across multiple pages
+ * is incomplete and will be disregarded, and the read will
+ * restart from the beginning of the page that is flagged
+ * XLP_FIRST_IS_OVERWRITE_CONTRECORD.
+ *
+ * If this case occurs, we can fairly say that the current summary
+ * file ends at the cutoff LSN exactly. The first record on the
+ * page marked XLP_FIRST_IS_OVERWRITE_CONTRECORD will be
+ * discovered when generating the next summary file.
+ */
+ summary_end_lsn = cutoff_lsn;
+ break;
+ }
+
+ /*
+ * We attempt, on a best effort basis only, to make WAL summary file
+ * boundaries line up with checkpoint cycles. So, if the last redo
+ * pointer we've seen was in the future, and this record starts at
+ * that redo pointer, stop before processing and let it be included in
+ * the next summary file.
+ *
+ * Note that in the case of a checkpoint triggered by a backup, the
+ * redo pointer is likely to be pointing to the first record on a
+ * page. Before reading the record, xlogreader->EndRecPtr will have
+ * pointed to the start of the page, which precedes the redo LSN. But
+ * after reading the next record, we'll advance over the page header
+ * and realize that the next record starts at the redo LSN exactly,
+ * making this the first point at which we can realize that it's time
+ * to stop.
+ */
+ if (!private_data->redo_pointer_reached &&
+ xlogreader->ReadRecPtr >= private_data->redo_pointer)
+ {
+ summary_end_lsn = xlogreader->ReadRecPtr;
+ break;
+ }
+
+ /* Special handling for particular types of WAL records. */
+ switch (XLogRecGetRmid(xlogreader))
+ {
+ case RM_SMGR_ID:
+ SummarizeSmgrRecord(xlogreader, brtab);
+ break;
+ case RM_XACT_ID:
+ SummarizeXactRecord(xlogreader, brtab);
+ break;
+ default:
+ break;
+ }
+
+ /* Feed block references from xlog record to block reference table. */
+ for (block_id = 0; block_id <= XLogRecMaxBlockId(xlogreader);
+ block_id++)
+ {
+ RelFileLocator rlocator;
+ ForkNumber forknum;
+ BlockNumber blocknum;
+
+ if (!XLogRecGetBlockTagExtended(xlogreader, block_id, &rlocator,
+ &forknum, &blocknum, NULL))
+ continue;
+
+ BlockRefTableMarkBlockModified(brtab, &rlocator, forknum,
+ blocknum);
+ }
+
+ /* Update our notion of where this summary file ends. */
+ summary_end_lsn = xlogreader->EndRecPtr;
+
+ /*
+ * Also update shared memory, and handle any request for a WAL summary
+ * file switch.
+ */
+ LWLockAcquire(WALSummarizerLock, LW_EXCLUSIVE);
+ Assert(summary_end_lsn >= WalSummarizerCtl->pending_lsn);
+ Assert(summary_end_lsn >= WalSummarizerCtl->summarized_lsn);
+ WalSummarizerCtl->pending_lsn = summary_end_lsn;
+ switch_requested = WalSummarizerCtl->switch_requested;
+ LWLockRelease(WALSummarizerLock);
+ if (switch_requested)
+ break;
+
+ /*
+ * Periodically update our notion of the redo pointer, because it
+ * might be changing concurrently. There's no interlocking here: we
+ * might race past the new redo pointer before we learn about it.
+ * That's OK; we only use the redo pointer as a heuristic for where to
+ * stop summarizing.
+ *
+ * It would be nice if we could just fetch the updated redo pointer on
+ * every pass through this loop, but that seems a bit too expensive:
+ * GetRedoRecPtr acquires a heavily-contended spinlock. So, instead,
+ * just fetch the updated value if we've just had to sleep, or if
+ * we've read more than a segment's worth of WAL without sleeping.
+ */
+ if (private_data->waited || xlogreader->EndRecPtr >
+ private_data->redo_pointer_refresh_lsn + wal_segment_size)
+ {
+ private_data->redo_pointer = GetRedoRecPtr();
+ private_data->redo_pointer_refresh_lsn = xlogreader->EndRecPtr;
+ private_data->redo_pointer_reached =
+ (xlogreader->EndRecPtr >= private_data->redo_pointer);
+ }
+
+ /*
+ * Recheck whether we've just caught up with the redo pointer, and if
+ * so, stop. This has the same purpose as the earlier check for the
+ * same condition above, but there we've just read a record and might
+ * decide against including it in the current summary file, whereas
+ * here we've already included it and might decide against reading the
+ * next one. Note that we may have just refreshed our notion of the
+ * redo pointer, so it's smart to check here before we do any more
+ * work.
+ */
+ if (!private_data->redo_pointer_reached &&
+ xlogreader->EndRecPtr >= private_data->redo_pointer)
+ break;
+ }
+
+ /* Destroy xlogreader. */
+ pfree(xlogreader->private_data);
+ XLogReaderFree(xlogreader);
+
+ /*
+ * If a timeline switch occurs, we may fail to make any progress at all
+ * before exiting the loop above. If that happens, we don't write a WAL
+ * summary file at all.
+ */
+ if (summary_end_lsn > summary_start_lsn)
+ {
+ /* Generate temporary and final path name. */
+ snprintf(temp_path, MAXPGPATH,
+ XLOGDIR "/summaries/temp.summary");
+ snprintf(final_path, MAXPGPATH,
+ XLOGDIR "/summaries/%08X%08X%08X%08X%08X.summary",
+ tli,
+ LSN_FORMAT_ARGS(summary_start_lsn),
+ LSN_FORMAT_ARGS(summary_end_lsn));
+
+ /* Open the temporary file for writing. */
+ io.filepos = 0;
+ io.file = PathNameOpenFile(temp_path, O_WRONLY | O_CREAT | O_TRUNC);
+ if (io.file < 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not create file \"%s\": %m", temp_path)));
+
+ /* Write the data. */
+ WriteBlockRefTable(brtab, WriteWalSummary, &io);
+
+ /* Close temporary file and shut down xlogreader. */
+ FileClose(io.file);
+
+ /* Tell the user what we did. */
+ ereport(LOG,
+ errmsg("summarized WAL on TLI %d from %X/%X to %X/%X",
+ tli,
+ LSN_FORMAT_ARGS(summary_start_lsn),
+ LSN_FORMAT_ARGS(summary_end_lsn)));
+
+ /* Durably rename the new summary into place. */
+ durable_rename(temp_path, final_path, ERROR);
+ }
+
+ return summary_end_lsn;
+}
+
+/*
+ * Special handling for WAL records with RM_SMGR_ID.
+ */
+static void
+SummarizeSmgrRecord(XLogReaderState *xlogreader, BlockRefTable *brtab)
+{
+ uint8 info = XLogRecGetInfo(xlogreader) & ~XLR_INFO_MASK;
+
+ if (info == XLOG_SMGR_CREATE)
+ {
+ xl_smgr_create *xlrec;
+
+ /*
+ * If a new relation fork is created on disk, there is no point
+ * tracking anything about which blocks have been modified, because
+ * the whole thing will be new. Hence, set the limit block for this
+ * fork to 0.
+ *
+ * Ignore the FSM fork, which is not fully WAL-logged.
+ */
+ xlrec = (xl_smgr_create *) XLogRecGetData(xlogreader);
+
+ if (xlrec->forkNum != FSM_FORKNUM)
+ BlockRefTableSetLimitBlock(brtab, &xlrec->rlocator,
+ xlrec->forkNum, 0);
+ }
+ else if (info == XLOG_SMGR_TRUNCATE)
+ {
+ xl_smgr_truncate *xlrec;
+
+ xlrec = (xl_smgr_truncate *) XLogRecGetData(xlogreader);
+
+ /*
+ * If a relation fork is truncated on disk, there is in point in
+ * tracking anything about block modifications beyond the truncation
+ * point.
+ *
+ * We ignore SMGR_TRUNCATE_FSM here because the FSM isn't fully
+ * WAL-logged and thus we can't track modified blocks for it anyway.
+ */
+ if ((xlrec->flags & SMGR_TRUNCATE_HEAP) != 0)
+ BlockRefTableSetLimitBlock(brtab, &xlrec->rlocator,
+ MAIN_FORKNUM, xlrec->blkno);
+ if ((xlrec->flags & SMGR_TRUNCATE_VM) != 0)
+ BlockRefTableSetLimitBlock(brtab, &xlrec->rlocator,
+ VISIBILITYMAP_FORKNUM, xlrec->blkno);
+ }
+}
+
+/*
+ * Special handling for WAL recods with RM_XACT_ID.
+ */
+static void
+SummarizeXactRecord(XLogReaderState *xlogreader, BlockRefTable *brtab)
+{
+ uint8 info = XLogRecGetInfo(xlogreader) & ~XLR_INFO_MASK;
+ uint8 xact_info = info & XLOG_XACT_OPMASK;
+
+ if (xact_info == XLOG_XACT_COMMIT ||
+ xact_info == XLOG_XACT_COMMIT_PREPARED)
+ {
+ xl_xact_commit *xlrec = (xl_xact_commit *) XLogRecGetData(xlogreader);
+ xl_xact_parsed_commit parsed;
+ int i;
+
+ ParseCommitRecord(XLogRecGetInfo(xlogreader), xlrec, &parsed);
+ for (i = 0; i < parsed.nrels; ++i)
+ {
+ ForkNumber forknum;
+
+ for (forknum = 0; forknum <= MAX_FORKNUM; ++forknum)
+ if (forknum != FSM_FORKNUM)
+ BlockRefTableSetLimitBlock(brtab, &parsed.xlocators[i],
+ forknum, 0);
+ }
+ }
+ else if (xact_info == XLOG_XACT_ABORT ||
+ xact_info == XLOG_XACT_ABORT_PREPARED)
+ {
+ xl_xact_abort *xlrec = (xl_xact_abort *) XLogRecGetData(xlogreader);
+ xl_xact_parsed_abort parsed;
+ int i;
+
+ ParseAbortRecord(XLogRecGetInfo(xlogreader), xlrec, &parsed);
+ for (i = 0; i < parsed.nrels; ++i)
+ {
+ ForkNumber forknum;
+
+ for (forknum = 0; forknum <= MAX_FORKNUM; ++forknum)
+ if (forknum != FSM_FORKNUM)
+ BlockRefTableSetLimitBlock(brtab, &parsed.xlocators[i],
+ forknum, 0);
+ }
+ }
+}
+
+/*
+ * Similar to read_local_xlog_page, but limited to read from one particular
+ * timeline. If the end of WAL is reached, it will wait for more if reading
+ * from the current timeline, or give up if reading from a historic timeline.
+ * In the latter case, it will also set private_data->end_of_wal = true.
+ *
+ * Caller must set private_data->tli to the TLI of interest,
+ * private_data->read_upto to the lowest LSN that is not known to be safe
+ * to read on that timeline, and private_data->historic to true if and only
+ * if the timeline is not the current timeline. This function will update
+ * private_data->read_upto and private_data->historic if more WAL appears
+ * on the current timeline or if the current timeline becomes historic.
+ */
+static int
+summarizer_read_local_xlog_page(XLogReaderState *state,
+ XLogRecPtr targetPagePtr, int reqLen,
+ XLogRecPtr targetRecPtr, char *cur_page)
+{
+ int count;
+ WALReadError errinfo;
+ SummarizerReadLocalXLogPrivate *private_data;
+
+ private_data = (SummarizerReadLocalXLogPrivate *)
+ state->private_data;
+
+ while (true)
+ {
+ if (targetPagePtr + XLOG_BLCKSZ <= private_data->read_upto)
+ {
+ /*
+ * more than one block available; read only that block, have
+ * caller come back if they need more.
+ */
+ count = XLOG_BLCKSZ;
+ break;
+ }
+ else if (targetPagePtr + reqLen > private_data->read_upto)
+ {
+ /* We don't seem to have enough data. */
+ if (private_data->historic)
+ {
+ /*
+ * This is a historic timeline, so there will never be any
+ * more data than we have currently.
+ */
+ private_data->end_of_wal = true;
+ return -1;
+ }
+ else
+ {
+ XLogRecPtr latest_lsn;
+ TimeLineID latest_tli;
+
+ /*
+ * This is - or at least was up until very recently - the
+ * current timeline, so more data might show up. Delay here
+ * so we don't tight-loop.
+ */
+ HandleWalSummarizerInterrupts();
+ summarizer_wait_for_wal();
+ private_data->waited = true;
+
+ /* Recheck end-of-WAL. */
+ latest_lsn = GetLatestLSN(&latest_tli);
+ if (private_data->tli == latest_tli)
+ {
+ /* Still the current timeline, update max LSN. */
+ Assert(latest_lsn >= private_data->read_upto);
+ private_data->read_upto = latest_lsn;
+ }
+ else
+ {
+ List *tles = readTimeLineHistory(latest_tli);
+ XLogRecPtr switchpoint;
+
+ /*
+ * The timeline we're scanning is no longer the latest
+ * one. Figure out when it ended and allow reads up to
+ * exactly that point.
+ */
+ private_data->historic = true;
+ switchpoint = tliSwitchPoint(private_data->tli, tles,
+ NULL);
+ Assert(switchpoint >= private_data->read_upto);
+ private_data->read_upto = switchpoint;
+ }
+
+ /* Go around and try again. */
+ }
+ }
+ else
+ {
+ /* enough bytes available to satisfy the request */
+ count = private_data->read_upto - targetPagePtr;
+ break;
+ }
+ }
+
+ /*
+ * Even though we just determined how much of the page can be validly read
+ * as 'count', read the whole page anyway. It's guaranteed to be
+ * zero-padded up to the page boundary if it's incomplete.
+ */
+ if (!WALRead(state, cur_page, targetPagePtr, XLOG_BLCKSZ,
+ private_data->tli, &errinfo))
+ WALReadRaiseError(&errinfo);
+
+ /* Track that we read a page, for sleep time calculation. */
+ ++pages_read_since_last_sleep;
+
+ /* number of valid bytes in the buffer */
+ return count;
+}
+
+/*
+ * Sleep for long enough that we believe it's likely that more WAL will
+ * be available afterwards.
+ */
+static void
+summarizer_wait_for_wal(void)
+{
+ if (pages_read_since_last_sleep == 0)
+ {
+ /*
+ * No pages were read since the last sleep, so double the sleep time,
+ * but not beyond the maximum allowable value.
+ */
+ sleep_quanta = Min(sleep_quanta * 2, MAX_SLEEP_QUANTA);
+ }
+ else if (pages_read_since_last_sleep > 1)
+ {
+ /*
+ * Multiple pages were read since the last sleep, so reduce the sleep
+ * time.
+ *
+ * A large burst of activity should be able to quickly reduce the
+ * sleep time to the minimum, but we don't want a handful of extra WAL
+ * records to provoke a strong reaction. We choose to reduce the sleep
+ * time by 1 quantum for each page read beyond the first, which is a
+ * fairly arbitrary way of trying to be reactive without
+ * overrreacting.
+ */
+ if (pages_read_since_last_sleep > sleep_quanta - 1)
+ sleep_quanta = 1;
+ else
+ sleep_quanta -= pages_read_since_last_sleep;
+ }
+
+ /* OK, now sleep. */
+ (void) WaitLatch(MyLatch,
+ WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
+ sleep_quanta * MS_PER_SLEEP_QUANTUM,
+ WAIT_EVENT_WAL_SUMMARIZER_WAL);
+ ResetLatch(MyLatch);
+
+ /* Reset count of pages read. */
+ pages_read_since_last_sleep = 0;
+}
+
+/*
+ * Most recent RedoRecPtr value observed by RemoveOldWalSummaries.
+ */
+static void
+MaybeRemoveOldWalSummaries(void)
+{
+ XLogRecPtr redo_pointer = GetRedoRecPtr();
+ List *wslist;
+ time_t cutoff_time;
+
+ /* If WAL summary removal is disabled, don't do anything. */
+ if (wal_summarize_keep_time == 0)
+ return;
+
+ /*
+ * If the redo pointer has not advanced, don't do anything.
+ *
+ * This has the effect that we only try to remove old WAL summary files
+ * once per checkpoint cycle.
+ */
+ if (redo_pointer == redo_pointer_at_last_summary_removal)
+ return;
+ redo_pointer_at_last_summary_removal = redo_pointer;
+
+ /*
+ * Files should only be removed if the last modification time precedes the
+ * cutoff time we compute here.
+ */
+ cutoff_time = time(NULL) - 60 * wal_summarize_keep_time;
+
+ /* Get all the summaries that currently exist. */
+ wslist = GetWalSummaries(0, InvalidXLogRecPtr, InvalidXLogRecPtr);
+
+ /* Loop until all summaries have been considered for removal. */
+ while (wslist != NIL)
+ {
+ ListCell *lc;
+ XLogSegNo oldest_segno;
+ XLogRecPtr oldest_lsn = InvalidXLogRecPtr;
+ TimeLineID selected_tli;
+
+ CHECK_FOR_INTERRUPTS();
+
+ /*
+ * Pick a timeline for which some summary files still exist on disk,
+ * and find the oldest LSN that still exists on disk for that
+ * timeline.
+ */
+ selected_tli = ((WalSummaryFile *) linitial(wslist))->tli;
+ oldest_segno = XLogGetOldestSegno(selected_tli);
+ if (oldest_segno != 0)
+ XLogSegNoOffsetToRecPtr(oldest_segno, 0, wal_segment_size,
+ oldest_lsn);
+
+
+ /* Consider each WAL file on the selected timeline in turn. */
+ foreach(lc, wslist)
+ {
+ WalSummaryFile *ws = lfirst(lc);
+
+ CHECK_FOR_INTERRUPTS();
+
+ /* If it's not on this timeline, it's not time to consider it. */
+ if (selected_tli != ws->tli)
+ continue;
+
+ /*
+ * If the WAL doesn't exist any more, we can remove it if the file
+ * modification time is old enough.
+ */
+ if (XLogRecPtrIsInvalid(oldest_lsn) || ws->end_lsn <= oldest_lsn)
+ RemoveWalSummaryIfOlderThan(ws, cutoff_time);
+
+ /*
+ * Whether we we removed the file or not, we need not consider it
+ * again.
+ */
+ wslist = foreach_delete_current(wslist, lc);
+ pfree(ws);
+ }
+ }
+}
diff --git a/src/backend/replication/repl_gram.y b/src/backend/replication/repl_gram.y
index 0c874e33cf..a5d118ed68 100644
--- a/src/backend/replication/repl_gram.y
+++ b/src/backend/replication/repl_gram.y
@@ -76,11 +76,12 @@ Node *replication_parse_result;
%token K_EXPORT_SNAPSHOT
%token K_NOEXPORT_SNAPSHOT
%token K_USE_SNAPSHOT
+%token K_UPLOAD_MANIFEST
%type <node> command
%type <node> base_backup start_replication start_logical_replication
create_replication_slot drop_replication_slot identify_system
- read_replication_slot timeline_history show
+ read_replication_slot timeline_history show upload_manifest
%type <list> generic_option_list
%type <defelt> generic_option
%type <uintval> opt_timeline
@@ -114,6 +115,7 @@ command:
| read_replication_slot
| timeline_history
| show
+ | upload_manifest
;
/*
@@ -307,6 +309,15 @@ timeline_history:
}
;
+/* UPLOAD_MANIFEST doesn't currently accept any arguments */
+upload_manifest:
+ K_UPLOAD_MANIFEST
+ {
+ UploadManifestCmd *cmd = makeNode(UploadManifestCmd);
+
+ $$ = (Node *) cmd;
+ }
+
opt_physical:
K_PHYSICAL
| /* EMPTY */
@@ -411,6 +422,7 @@ ident_or_keyword:
| K_EXPORT_SNAPSHOT { $$ = "export_snapshot"; }
| K_NOEXPORT_SNAPSHOT { $$ = "noexport_snapshot"; }
| K_USE_SNAPSHOT { $$ = "use_snapshot"; }
+ | K_UPLOAD_MANIFEST { $$ = "upload_manifest"; }
;
%%
diff --git a/src/backend/replication/repl_scanner.l b/src/backend/replication/repl_scanner.l
index 1cc7fb858c..4805da08ee 100644
--- a/src/backend/replication/repl_scanner.l
+++ b/src/backend/replication/repl_scanner.l
@@ -136,6 +136,7 @@ EXPORT_SNAPSHOT { return K_EXPORT_SNAPSHOT; }
NOEXPORT_SNAPSHOT { return K_NOEXPORT_SNAPSHOT; }
USE_SNAPSHOT { return K_USE_SNAPSHOT; }
WAIT { return K_WAIT; }
+UPLOAD_MANIFEST { return K_UPLOAD_MANIFEST; }
{space}+ { /* do nothing */ }
@@ -303,6 +304,7 @@ replication_scanner_is_replication_command(void)
case K_DROP_REPLICATION_SLOT:
case K_READ_REPLICATION_SLOT:
case K_TIMELINE_HISTORY:
+ case K_UPLOAD_MANIFEST:
case K_SHOW:
/* Yes; push back the first token so we can parse later. */
repl_pushed_back_token = first_token;
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index e250b0567e..b33b86671b 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -58,6 +58,7 @@
#include "access/xlogrecovery.h"
#include "access/xlogutils.h"
#include "backup/basebackup.h"
+#include "backup/basebackup_incremental.h"
#include "catalog/pg_authid.h"
#include "catalog/pg_type.h"
#include "commands/dbcommands.h"
@@ -137,6 +138,17 @@ bool wake_wal_senders = false;
*/
static XLogReaderState *xlogreader = NULL;
+/*
+ * If the UPLOAD_MANIFEST command is used to provide a backup manifest in
+ * preparation for an incremental backup, uploaded_manifest will be point
+ * to an object containing information about its contexts, and
+ * uploaded_manifest_mcxt will point to the memory context that contains
+ * that object and all of its subordinate data. Otherwise, both values will
+ * be NULL.
+ */
+static IncrementalBackupInfo *uploaded_manifest = NULL;
+static MemoryContext uploaded_manifest_mcxt = NULL;
+
/*
* These variables keep track of the state of the timeline we're currently
* sending. sendTimeLine identifies the timeline. If sendTimeLineIsHistoric,
@@ -233,6 +245,9 @@ static void XLogSendLogical(void);
static void WalSndDone(WalSndSendDataCallback send_data);
static XLogRecPtr GetStandbyFlushRecPtr(TimeLineID *tli);
static void IdentifySystem(void);
+static void UploadManifest(void);
+static bool HandleUploadManifestPacket(StringInfo buf, off_t *offset,
+ IncrementalBackupInfo *ib);
static void ReadReplicationSlot(ReadReplicationSlotCmd *cmd);
static void CreateReplicationSlot(CreateReplicationSlotCmd *cmd);
static void DropReplicationSlot(DropReplicationSlotCmd *cmd);
@@ -660,6 +675,143 @@ SendTimeLineHistory(TimeLineHistoryCmd *cmd)
pq_endmessage(&buf);
}
+/*
+ * Handle UPLOAD_MANIFEST command.
+ */
+static void
+UploadManifest(void)
+{
+ MemoryContext mcxt;
+ IncrementalBackupInfo *ib;
+ off_t offset = 0;
+ StringInfoData buf;
+
+ /*
+ * parsing the manifest will use the cryptohash stuff, which requires a
+ * resource owner
+ */
+ Assert(CurrentResourceOwner == NULL);
+ CurrentResourceOwner = ResourceOwnerCreate(NULL, "base backup");
+
+ /* Prepare to read manifest data into a temporary context. */
+ mcxt = AllocSetContextCreate(CurrentMemoryContext,
+ "incremental backup information",
+ ALLOCSET_DEFAULT_SIZES);
+ ib = CreateIncrementalBackupInfo(mcxt);
+
+ /* Send a CopyInResponse message */
+ pq_beginmessage(&buf, 'G');
+ pq_sendbyte(&buf, 0);
+ pq_sendint16(&buf, 0);
+ pq_endmessage_reuse(&buf);
+ pq_flush();
+
+ /* Recieve packets from client until done. */
+ while (HandleUploadManifestPacket(&buf, &offset, ib))
+ ;
+
+ /* Finish up manifest processing. */
+ FinalizeIncrementalManifest(ib);
+
+ /*
+ * Discard any old manifest information and arrange to preserve the new
+ * information we just got.
+ *
+ * We assume that MemoryContextDelete and MemoryContextSetParent won't
+ * fail, and thus we shouldn't end up bailing out of here in such a way as
+ * to leave dangling pointrs.
+ */
+ if (uploaded_manifest_mcxt != NULL)
+ MemoryContextDelete(uploaded_manifest_mcxt);
+ MemoryContextSetParent(mcxt, CacheMemoryContext);
+ uploaded_manifest = ib;
+ uploaded_manifest_mcxt = mcxt;
+
+ /* clean up the resource owner we created */
+ WalSndResourceCleanup(true);
+}
+
+/*
+ * Process one packet received during the handling of an UPLOAD_MANIFEST
+ * operation.
+ *
+ * 'buf' is scratch space. This function expects it to be initialized, doesn't
+ * care what the current contents are, and may override them with completely
+ * new contents.
+ *
+ * The return value is true if the caller should continue processing
+ * additional packets and false if the UPLOAD_MANIFEST operation is complete.
+ */
+static bool
+HandleUploadManifestPacket(StringInfo buf, off_t *offset,
+ IncrementalBackupInfo *ib)
+{
+ int mtype;
+ int maxmsglen;
+
+ HOLD_CANCEL_INTERRUPTS();
+
+ pq_startmsgread();
+ mtype = pq_getbyte();
+ if (mtype == EOF)
+ ereport(ERROR,
+ (errcode(ERRCODE_CONNECTION_FAILURE),
+ errmsg("unexpected EOF on client connection with an open transaction")));
+
+ switch (mtype)
+ {
+ case 'd': /* CopyData */
+ maxmsglen = PQ_LARGE_MESSAGE_LIMIT;
+ break;
+ case 'c': /* CopyDone */
+ case 'f': /* CopyFail */
+ case 'H': /* Flush */
+ case 'S': /* Sync */
+ maxmsglen = PQ_SMALL_MESSAGE_LIMIT;
+ break;
+ default:
+ ereport(ERROR,
+ (errcode(ERRCODE_PROTOCOL_VIOLATION),
+ errmsg("unexpected message type 0x%02X during COPY from stdin",
+ mtype)));
+ maxmsglen = 0; /* keep compiler quiet */
+ break;
+ }
+
+ /* Now collect the message body */
+ if (pq_getmessage(buf, maxmsglen))
+ ereport(ERROR,
+ (errcode(ERRCODE_CONNECTION_FAILURE),
+ errmsg("unexpected EOF on client connection with an open transaction")));
+ RESUME_CANCEL_INTERRUPTS();
+
+ /* Process the message */
+ switch (mtype)
+ {
+ case 'd': /* CopyData */
+ AppendIncrementalManifestData(ib, buf->data, buf->len);
+ return true;
+
+ case 'c': /* CopyDone */
+ return false;
+
+ case 'H': /* Sync */
+ case 'S': /* Flush */
+ /* Ignore these while in CopyOut mode as we do elsewhere. */
+ return true;
+
+ case 'f':
+ ereport(ERROR,
+ (errcode(ERRCODE_QUERY_CANCELED),
+ errmsg("COPY from stdin failed: %s",
+ pq_getmsgstring(buf))));
+ }
+
+ /* Not reached. */
+ Assert(false);
+ return false;
+}
+
/*
* Handle START_REPLICATION command.
*
@@ -1801,7 +1953,7 @@ exec_replication_command(const char *cmd_string)
cmdtag = "BASE_BACKUP";
set_ps_display(cmdtag);
PreventInTransactionBlock(true, cmdtag);
- SendBaseBackup((BaseBackupCmd *) cmd_node);
+ SendBaseBackup((BaseBackupCmd *) cmd_node, uploaded_manifest);
EndReplicationCommand(cmdtag);
break;
@@ -1863,6 +2015,14 @@ exec_replication_command(const char *cmd_string)
}
break;
+ case T_UploadManifestCmd:
+ cmdtag = "UPLOAD_MANIFEST";
+ set_ps_display(cmdtag);
+ PreventInTransactionBlock(true, cmdtag);
+ UploadManifest();
+ EndReplicationCommand(cmdtag);
+ break;
+
default:
elog(ERROR, "unrecognized replication command node tag: %u",
cmd_node->type);
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index a3d8eacb8d..3a6729003a 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -31,6 +31,7 @@
#include "postmaster/bgworker_internals.h"
#include "postmaster/bgwriter.h"
#include "postmaster/postmaster.h"
+#include "postmaster/walsummarizer.h"
#include "replication/logicallauncher.h"
#include "replication/origin.h"
#include "replication/slot.h"
@@ -136,6 +137,7 @@ CalculateShmemSize(int *num_semaphores)
size = add_size(size, ReplicationOriginShmemSize());
size = add_size(size, WalSndShmemSize());
size = add_size(size, WalRcvShmemSize());
+ size = add_size(size, WalSummarizerShmemSize());
size = add_size(size, PgArchShmemSize());
size = add_size(size, ApplyLauncherShmemSize());
size = add_size(size, BTreeShmemSize());
@@ -291,6 +293,7 @@ CreateSharedMemoryAndSemaphores(void)
ReplicationOriginShmemInit();
WalSndShmemInit();
WalRcvShmemInit();
+ WalSummarizerShmemInit();
PgArchShmemInit();
ApplyLauncherShmemInit();
diff --git a/src/backend/storage/lmgr/lwlocknames.txt b/src/backend/storage/lmgr/lwlocknames.txt
index f72f2906ce..d621f5507f 100644
--- a/src/backend/storage/lmgr/lwlocknames.txt
+++ b/src/backend/storage/lmgr/lwlocknames.txt
@@ -54,3 +54,4 @@ XactTruncationLock 44
WrapLimitsVacuumLock 46
NotifyQueueTailLock 47
WaitEventExtensionLock 48
+WALSummarizerLock 49
diff --git a/src/backend/utils/activity/pgstat_io.c b/src/backend/utils/activity/pgstat_io.c
index 490d5a9ab7..8109aee6f0 100644
--- a/src/backend/utils/activity/pgstat_io.c
+++ b/src/backend/utils/activity/pgstat_io.c
@@ -296,7 +296,8 @@ pgstat_io_snapshot_cb(void)
* - Syslogger because it is not connected to shared memory
* - Archiver because most relevant archiving IO is delegated to a
* specialized command or module
-* - WAL Receiver and WAL Writer IO is not tracked in pg_stat_io for now
+* - WAL Receiver, WAL Writer, and WAL Summarizer IO are not tracked in
+* pg_stat_io for now
*
* Function returns true if BackendType participates in the cumulative stats
* subsystem for IO and false if it does not.
@@ -318,6 +319,7 @@ pgstat_tracks_io_bktype(BackendType bktype)
case B_LOGGER:
case B_WAL_RECEIVER:
case B_WAL_WRITER:
+ case B_WAL_SUMMARIZER:
return false;
case B_AUTOVAC_LAUNCHER:
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index d7995931bd..7e79163466 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -56,6 +56,7 @@ RECOVERY_WAL_STREAM "Waiting in main loop of startup process for WAL to arrive,
SYSLOGGER_MAIN "Waiting in main loop of syslogger process."
WAL_RECEIVER_MAIN "Waiting in main loop of WAL receiver process."
WAL_SENDER_MAIN "Waiting in main loop of WAL sender process."
+WAL_SUMMARIZER_WAL "Waiting in WAL summarizer for more WAL to be generated."
WAL_WRITER_MAIN "Waiting in main loop of WAL writer process."
@@ -142,6 +143,7 @@ SAFE_SNAPSHOT "Waiting to obtain a valid snapshot for a <literal>READ ONLY DEFER
SYNC_REP "Waiting for confirmation from a remote server during synchronous replication."
WAL_RECEIVER_EXIT "Waiting for the WAL receiver to exit."
WAL_RECEIVER_WAIT_START "Waiting for startup process to send initial data for streaming replication."
+WAL_SUMMARY_READY "Waiting for a new WAL summary to be generated."
XACT_GROUP_UPDATE "Waiting for the group leader to update transaction status at end of a parallel operation."
@@ -162,6 +164,7 @@ REGISTER_SYNC_REQUEST "Waiting while sending synchronization requests to the che
SPIN_DELAY "Waiting while acquiring a contended spinlock."
VACUUM_DELAY "Waiting in a cost-based vacuum delay point."
VACUUM_TRUNCATE "Waiting to acquire an exclusive lock to truncate off any empty pages at the end of a table vacuumed."
+WAL_SUMMARIZER_ERROR "Waiting after a WAL summarizer error."
#
@@ -243,6 +246,8 @@ WAL_COPY_WRITE "Waiting for a write when creating a new WAL segment by copying a
WAL_INIT_SYNC "Waiting for a newly initialized WAL file to reach durable storage."
WAL_INIT_WRITE "Waiting for a write while initializing a new WAL file."
WAL_READ "Waiting for a read from a WAL file."
+WAL_SUMMARY_READ "Waiting for a read from a WAL summary file."
+WAL_SUMMARY_WRITE "Waiting for a write to a WAL summary file."
WAL_SYNC "Waiting for a WAL file to reach durable storage."
WAL_SYNC_METHOD_ASSIGN "Waiting for data to reach durable storage while assigning a new WAL sync method."
WAL_WRITE "Waiting for a write to a WAL file."
diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index 182d666852..94e7944748 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -306,6 +306,9 @@ GetBackendTypeDesc(BackendType backendType)
case B_WAL_SENDER:
backendDesc = "walsender";
break;
+ case B_WAL_SUMMARIZER:
+ backendDesc = "walsummarizer";
+ break;
case B_WAL_WRITER:
backendDesc = "walwriter";
break;
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 4c58574166..faf42bdbfb 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -63,6 +63,7 @@
#include "postmaster/postmaster.h"
#include "postmaster/startup.h"
#include "postmaster/syslogger.h"
+#include "postmaster/walsummarizer.h"
#include "postmaster/walwriter.h"
#include "replication/logicallauncher.h"
#include "replication/slot.h"
@@ -704,6 +705,8 @@ const char *const config_group_names[] =
gettext_noop("Write-Ahead Log / Archive Recovery"),
/* WAL_RECOVERY_TARGET */
gettext_noop("Write-Ahead Log / Recovery Target"),
+ /* WAL_SUMMARIZATION */
+ gettext_noop("Write-Ahead Log / Summarization"),
/* REPLICATION_SENDING */
gettext_noop("Replication / Sending Servers"),
/* REPLICATION_PRIMARY */
@@ -3181,6 +3184,32 @@ struct config_int ConfigureNamesInt[] =
check_wal_segment_size, NULL, NULL
},
+ {
+ {"wal_summarize_mb", PGC_SIGHUP, WAL_SUMMARIZATION,
+ gettext_noop("Number of bytes of WAL per summary file."),
+ gettext_noop("Smaller values minimize extra work performed by incremental backup, but increase the number of files on disk."),
+ GUC_UNIT_MB,
+ },
+ &wal_summarize_mb,
+ 256,
+ 0,
+ INT_MAX,
+ NULL, NULL, NULL
+ },
+
+ {
+ {"wal_summarize_keep_time", PGC_SIGHUP, WAL_SUMMARIZATION,
+ gettext_noop("Time for which WAL summary files should be kept."),
+ NULL,
+ GUC_UNIT_MIN,
+ },
+ &wal_summarize_keep_time,
+ 7 * 24 * 60, /* 1 week */
+ 0,
+ INT_MAX,
+ NULL, NULL, NULL
+ },
+
{
{"autovacuum_naptime", PGC_SIGHUP, AUTOVACUUM,
gettext_noop("Time to sleep between autovacuum runs."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index d08d55c3fe..4736606ac1 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -299,6 +299,11 @@
#recovery_target_action = 'pause' # 'pause', 'promote', 'shutdown'
# (change requires restart)
+# - WAL Summarization -
+
+#wal_summarize_mb = 256 # MB of WAL per summary file, 0 disables
+#wal_summarize_keep_time = '7d' # when to remove old summary files, 0 = never
+
#------------------------------------------------------------------------------
# REPLICATION
diff --git a/src/bin/Makefile b/src/bin/Makefile
index 373077bf52..aa2210925e 100644
--- a/src/bin/Makefile
+++ b/src/bin/Makefile
@@ -19,6 +19,7 @@ SUBDIRS = \
pg_archivecleanup \
pg_basebackup \
pg_checksums \
+ pg_combinebackup \
pg_config \
pg_controldata \
pg_ctl \
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 0c6f5ceb0a..e68b40d2b5 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -227,6 +227,7 @@ static char *extra_options = "";
static const char *const subdirs[] = {
"global",
"pg_wal/archive_status",
+ "pg_wal/summaries",
"pg_commit_ts",
"pg_dynshmem",
"pg_notify",
diff --git a/src/bin/meson.build b/src/bin/meson.build
index 67cb50630c..4cb6fd59bb 100644
--- a/src/bin/meson.build
+++ b/src/bin/meson.build
@@ -5,6 +5,7 @@ subdir('pg_amcheck')
subdir('pg_archivecleanup')
subdir('pg_basebackup')
subdir('pg_checksums')
+subdir('pg_combinebackup')
subdir('pg_config')
subdir('pg_controldata')
subdir('pg_ctl')
diff --git a/src/bin/pg_basebackup/bbstreamer_file.c b/src/bin/pg_basebackup/bbstreamer_file.c
index 45f32974ff..6b78ee283d 100644
--- a/src/bin/pg_basebackup/bbstreamer_file.c
+++ b/src/bin/pg_basebackup/bbstreamer_file.c
@@ -296,6 +296,7 @@ should_allow_existing_directory(const char *pathname)
if (strcmp(filename, "pg_wal") == 0 ||
strcmp(filename, "pg_xlog") == 0 ||
strcmp(filename, "archive_status") == 0 ||
+ strcmp(filename, "summaries") == 0 ||
strcmp(filename, "pg_tblspc") == 0)
return true;
diff --git a/src/bin/pg_basebackup/pg_basebackup.c b/src/bin/pg_basebackup/pg_basebackup.c
index 1a8cef345d..33416b11cf 100644
--- a/src/bin/pg_basebackup/pg_basebackup.c
+++ b/src/bin/pg_basebackup/pg_basebackup.c
@@ -101,6 +101,11 @@ typedef void (*WriteDataCallback) (size_t nbytes, char *buf,
*/
#define MINIMUM_VERSION_FOR_TERMINATED_TARFILE 150000
+/*
+ * pg_wal/summaries exists beginning with version 17.
+ */
+#define MINIMUM_VERSION_FOR_WAL_SUMMARIES 170000
+
/*
* Different ways to include WAL
*/
@@ -217,7 +222,8 @@ static void ReceiveBackupManifestInMemoryChunk(size_t r, char *copybuf,
void *callback_data);
static void BaseBackup(char *compression_algorithm, char *compression_detail,
CompressionLocation compressloc,
- pg_compress_specification *client_compress);
+ pg_compress_specification *client_compress,
+ char *incremental_manifest);
static bool reached_end_position(XLogRecPtr segendpos, uint32 timeline,
bool segment_finished);
@@ -390,6 +396,8 @@ usage(void)
printf(_("\nOptions controlling the output:\n"));
printf(_(" -D, --pgdata=DIRECTORY receive base backup into directory\n"));
printf(_(" -F, --format=p|t output format (plain (default), tar)\n"));
+ printf(_(" -i, --incremental=OLDMANIFEST\n"));
+ printf(_(" take incremental or differential backup\n"));
printf(_(" -r, --max-rate=RATE maximum transfer rate to transfer data directory\n"
" (in kB/s, or use suffix \"k\" or \"M\")\n"));
printf(_(" -R, --write-recovery-conf\n"
@@ -688,6 +696,23 @@ StartLogStreamer(char *startpos, uint32 timeline, char *sysidentifier,
if (pg_mkdir_p(statusdir, pg_dir_create_mode) != 0 && errno != EEXIST)
pg_fatal("could not create directory \"%s\": %m", statusdir);
+
+ /*
+ * For newer server versions, likewise create pg_wal/summaries
+ */
+ if (PQserverVersion(conn) < MINIMUM_VERSION_FOR_WAL_SUMMARIES)
+ {
+ char summarydir[MAXPGPATH];
+
+ snprintf(summarydir, sizeof(summarydir), "%s/%s/summaries",
+ basedir,
+ PQserverVersion(conn) < MINIMUM_VERSION_FOR_PG_WAL ?
+ "pg_xlog" : "pg_wal");
+
+ if (pg_mkdir_p(statusdir, pg_dir_create_mode) != 0 &&
+ errno != EEXIST)
+ pg_fatal("could not create directory \"%s\": %m", summarydir);
+ }
}
/*
@@ -1728,7 +1753,9 @@ ReceiveBackupManifestInMemoryChunk(size_t r, char *copybuf,
static void
BaseBackup(char *compression_algorithm, char *compression_detail,
- CompressionLocation compressloc, pg_compress_specification *client_compress)
+ CompressionLocation compressloc,
+ pg_compress_specification *client_compress,
+ char *incremental_manifest)
{
PGresult *res;
char *sysidentifier;
@@ -1794,7 +1821,74 @@ BaseBackup(char *compression_algorithm, char *compression_detail,
exit(1);
/*
- * Start the actual backup
+ * If the user wants an incremental backup, we must upload the manifest
+ * for the previous backup upon which it is to be based.
+ */
+ if (incremental_manifest != NULL)
+ {
+ int fd;
+ char mbuf[65536];
+ int nbytes;
+
+ /* XXX add a server version check here */
+
+ /* Open the file. */
+ fd = open(incremental_manifest, O_RDONLY | PG_BINARY, 0);
+ if (fd < 0)
+ pg_fatal("could not open file \"%s\": %m", incremental_manifest);
+
+ /* Tell the server what we want to do. */
+ if (PQsendQuery(conn, "UPLOAD_MANIFEST") == 0)
+ pg_fatal("could not send replication command \"%s\": %s",
+ "UPLOAD_MANIFEST", PQerrorMessage(conn));
+ res = PQgetResult(conn);
+ if (PQresultStatus(res) != PGRES_COPY_IN)
+ {
+ if (PQresultStatus(res) == PGRES_FATAL_ERROR)
+ pg_fatal("could not upload manifest: %s",
+ PQerrorMessage(conn));
+ else
+ pg_fatal("could not upload manifest: unexpected status %s",
+ PQresStatus(PQresultStatus(res)));
+ }
+
+ /* Loop, reading from the file and sending the data to the server. */
+ while ((nbytes = read(fd, mbuf, sizeof mbuf)) > 0)
+ {
+ if (PQputCopyData(conn, mbuf, nbytes) < 0)
+ pg_fatal("could not send COPY data: %s",
+ PQerrorMessage(conn));
+ }
+
+ /* Bail out if we exited the loop due to an error. */
+ if (nbytes < 0)
+ pg_fatal("could not read file \"%s\": %m", incremental_manifest);
+
+ /* End the COPY operation. */
+ if (PQputCopyEnd(conn, NULL) < 0)
+ pg_fatal("could not send end-of-COPY: %s",
+ PQerrorMessage(conn));
+
+ /* See whether the server is happy with what we sent. */
+ res = PQgetResult(conn);
+ if (PQresultStatus(res) == PGRES_FATAL_ERROR)
+ pg_fatal("could not upload manifest: %s",
+ PQerrorMessage(conn));
+ else if (PQresultStatus(res) != PGRES_COMMAND_OK)
+ pg_fatal("could not upload manifest: unexpected status %s",
+ PQresStatus(PQresultStatus(res)));
+
+ /* Consume ReadyForQuery message from server. */
+ res = PQgetResult(conn);
+ if (res != NULL)
+ pg_fatal("unexpected extra result while sending manifest");
+
+ /* Add INCREMENTAL option to BASE_BACKUP command. */
+ AppendPlainCommandOption(&buf, use_new_option_syntax, "INCREMENTAL");
+ }
+
+ /*
+ * Continue building up the options list for the BASE_BACKUP command.
*/
AppendStringCommandOption(&buf, use_new_option_syntax, "LABEL", label);
if (estimatesize)
@@ -1901,6 +1995,7 @@ BaseBackup(char *compression_algorithm, char *compression_detail,
else
basebkp = psprintf("BASE_BACKUP %s", buf.data);
+ /* OK, try to start the backup. */
if (PQsendQuery(conn, basebkp) == 0)
pg_fatal("could not send replication command \"%s\": %s",
"BASE_BACKUP", PQerrorMessage(conn));
@@ -2256,6 +2351,7 @@ main(int argc, char **argv)
{"version", no_argument, NULL, 'V'},
{"pgdata", required_argument, NULL, 'D'},
{"format", required_argument, NULL, 'F'},
+ {"incremental", required_argument, NULL, 'i'},
{"checkpoint", required_argument, NULL, 'c'},
{"create-slot", no_argument, NULL, 'C'},
{"max-rate", required_argument, NULL, 'r'},
@@ -2293,6 +2389,7 @@ main(int argc, char **argv)
int option_index;
char *compression_algorithm = "none";
char *compression_detail = NULL;
+ char *incremental_manifest = NULL;
CompressionLocation compressloc = COMPRESS_LOCATION_UNSPECIFIED;
pg_compress_specification client_compress;
@@ -2317,7 +2414,7 @@ main(int argc, char **argv)
atexit(cleanup_directories_atexit);
- while ((c = getopt_long(argc, argv, "c:Cd:D:F:h:l:nNp:Pr:Rs:S:t:T:U:vwWX:zZ:",
+ while ((c = getopt_long(argc, argv, "c:Cd:D:F:h:i:l:nNp:Pr:Rs:S:t:T:U:vwWX:zZ:",
long_options, &option_index)) != -1)
{
switch (c)
@@ -2352,6 +2449,9 @@ main(int argc, char **argv)
case 'h':
dbhost = pg_strdup(optarg);
break;
+ case 'i':
+ incremental_manifest = pg_strdup(optarg);
+ break;
case 'l':
label = pg_strdup(optarg);
break;
@@ -2765,7 +2865,7 @@ main(int argc, char **argv)
}
BaseBackup(compression_algorithm, compression_detail, compressloc,
- &client_compress);
+ &client_compress, incremental_manifest);
success = true;
return 0;
diff --git a/src/bin/pg_basebackup/t/010_pg_basebackup.pl b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
index b9f5e1266b..bf765291e7 100644
--- a/src/bin/pg_basebackup/t/010_pg_basebackup.pl
+++ b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
@@ -223,10 +223,10 @@ SKIP:
"check backup dir permissions");
}
-# Only archive_status directory should be copied in pg_wal/.
+# Only archive_status and summaries directories should be copied in pg_wal/.
is_deeply(
[ sort(slurp_dir("$tempdir/backup/pg_wal/")) ],
- [ sort qw(. .. archive_status) ],
+ [ sort qw(. .. archive_status summaries) ],
'no WAL files copied');
# Contents of these directories should not be copied.
diff --git a/src/bin/pg_combinebackup/.gitignore b/src/bin/pg_combinebackup/.gitignore
new file mode 100644
index 0000000000..d7e617438c
--- /dev/null
+++ b/src/bin/pg_combinebackup/.gitignore
@@ -0,0 +1 @@
+pg_combinebackup
diff --git a/src/bin/pg_combinebackup/Makefile b/src/bin/pg_combinebackup/Makefile
new file mode 100644
index 0000000000..78ba05e624
--- /dev/null
+++ b/src/bin/pg_combinebackup/Makefile
@@ -0,0 +1,52 @@
+#-------------------------------------------------------------------------
+#
+# Makefile for src/bin/pg_combinebackup
+#
+# Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+# Portions Copyright (c) 1994, Regents of the University of California
+#
+# src/bin/pg_combinebackup/Makefile
+#
+#-------------------------------------------------------------------------
+
+PGFILEDESC = "pg_combinebackup - combine incremental backups"
+PGAPPICON=win32
+
+subdir = src/bin/pg_combinebackup
+top_builddir = ../../..
+include $(top_builddir)/src/Makefile.global
+
+override CPPFLAGS := -I$(libpq_srcdir) $(CPPFLAGS)
+LDFLAGS_INTERNAL += -L$(top_builddir)/src/fe_utils -lpgfeutils
+
+OBJS = \
+ $(WIN32RES) \
+ pg_combinebackup.o \
+ backup_label.o \
+ copy_file.o \
+ load_manifest.o \
+ reconstruct.o \
+ write_manifest.o
+
+all: pg_combinebackup
+
+pg_combinebackup: $(OBJS) | submake-libpgport submake-libpgfeutils
+ $(CC) $(CFLAGS) $^ $(LDFLAGS) $(LDFLAGS_EX) $(LIBS) -o $@$(X)
+
+install: all installdirs
+ $(INSTALL_PROGRAM) pg_combinebackup$(X) '$(DESTDIR)$(bindir)/pg_combinebackup$(X)'
+
+installdirs:
+ $(MKDIR_P) '$(DESTDIR)$(bindir)'
+
+uninstall:
+ rm -f '$(DESTDIR)$(bindir)/pg_combinebackup$(X)'
+
+clean distclean maintainer-clean:
+ rm -f pg_combinebackup$(X) $(OBJS)
+
+check:
+ $(prove_check)
+
+installcheck:
+ $(prove_installcheck)
diff --git a/src/bin/pg_combinebackup/backup_label.c b/src/bin/pg_combinebackup/backup_label.c
new file mode 100644
index 0000000000..2a62aa6fad
--- /dev/null
+++ b/src/bin/pg_combinebackup/backup_label.c
@@ -0,0 +1,281 @@
+/*-------------------------------------------------------------------------
+ *
+ * Read and manipulate backup label files
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/bin/pg_combinebackup/backup_label.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres_fe.h"
+
+#include <unistd.h>
+
+#include "access/xlogdefs.h"
+#include "backup_label.h"
+#include "common/logging.h"
+#include "common/file_perm.h"
+#include "write_manifest.h"
+
+static int get_eol_offset(StringInfo buf);
+static bool line_starts_with(char *s, char *e, char *match, char **sout);
+static bool parse_lsn(char *s, char *e, XLogRecPtr *lsn, char **c);
+static bool parse_tli(char *s, char *e, TimeLineID *tli);
+
+/*
+ * Parse a backup label file, starting at buf->cursor.
+ *
+ * We expect to find a START WAL LOCATION line, followed by a LSN, followed
+ * by a space; the resulting LSN is stored into *start_lsn.
+ *
+ * We expect to find a START TIMELINE line, followed by a TLI, followed by
+ * a newline; the resulting TLI is stored into *start_tli.
+ *
+ * We expect to find either both INCREMENTAL FROM LSN and INCREMENTAL FROM TLI
+ * or neither. If these are found, they should be followed by an LSN or TLI
+ * respectively and then by a newline, and the values will be stored into
+ * *previous_lsn and *previous_tli, respectively.
+ *
+ * Other lines in the provided backup_label data are ignored. filename is used
+ * for error reporting; errors are fatal.
+ */
+void
+parse_backup_label(char *filename, StringInfo buf,
+ TimeLineID *start_tli, XLogRecPtr *start_lsn,
+ TimeLineID *previous_tli, XLogRecPtr *previous_lsn)
+{
+ int found = 0;
+
+ *start_tli = 0;
+ *start_lsn = InvalidXLogRecPtr;
+ *previous_tli = 0;
+ *previous_lsn = InvalidXLogRecPtr;
+
+ while (buf->cursor < buf->len)
+ {
+ char *s = &buf->data[buf->cursor];
+ int eo = get_eol_offset(buf);
+ char *e = &buf->data[eo];
+ char *c;
+
+ if (line_starts_with(s, e, "START WAL LOCATION: ", &s))
+ {
+ if (!parse_lsn(s, e, start_lsn, &c))
+ pg_fatal("%s: could not parse START WAL LOCATION",
+ filename);
+ if (c >= e || *c != ' ')
+ pg_fatal("%s: improper terminator for START WAL LOCATION",
+ filename);
+ found |= 1;
+ }
+ else if (line_starts_with(s, e, "START TIMELINE: ", &s))
+ {
+ if (!parse_tli(s, e, start_tli))
+ pg_fatal("%s: could not parse TLI for START TIMELINE",
+ filename);
+ if (*start_tli == 0)
+ pg_fatal("%s: invalid TLI", filename);
+ found |= 2;
+ }
+ else if (line_starts_with(s, e, "INCREMENTAL FROM LSN: ", &s))
+ {
+ if (!parse_lsn(s, e, previous_lsn, &c))
+ pg_fatal("%s: could not parse INCREMENTAL FROM LSN",
+ filename);
+ if (c >= e || *c != '\n')
+ pg_fatal("%s: improper terminator for INCREMENTAL FROM LSN",
+ filename);
+ found |= 4;
+ }
+ else if (line_starts_with(s, e, "INCREMENTAL FROM TLI: ", &s))
+ {
+ if (!parse_tli(s, e, previous_tli))
+ pg_fatal("%s: could not parse INCREMENTAL FROM TLI",
+ filename);
+ if (*previous_tli == 0)
+ pg_fatal("%s: invalid TLI", filename);
+ found |= 8;
+ }
+
+ buf->cursor = eo;
+ }
+
+ if ((found & 1) == 0)
+ pg_fatal("%s: could not find START WAL LOCATION", filename);
+ if ((found & 2) == 0)
+ pg_fatal("%s: could not find START TIMELINE", filename);
+ if ((found & 4) != 0 && (found & 8) == 0)
+ pg_fatal("%s: INCREMENTAL FROM LSN requires INCREMENTAL FROM TLI", filename);
+ if ((found & 8) != 0 && (found & 4) == 0)
+ pg_fatal("%s: INCREMENTAL FROM TLI requires INCREMENTAL FROM LSN", filename);
+}
+
+/*
+ * Write a backup label file to the output directory.
+ *
+ * This will be identical to the provided backup_label file, except that the
+ * INCREMENTAL FROM LSN and INCREMENTAL FROM TLI lines will be omitted.
+ *
+ * The new file will be checksummed using the specified algorithm. If
+ * mwriter != NULL, it will be added to the manifest.
+ */
+void
+write_backup_label(char *output_directory, StringInfo buf,
+ pg_checksum_type checksum_type, manifest_writer *mwriter)
+{
+ char output_filename[MAXPGPATH];
+ int output_fd;
+ pg_checksum_context checksum_ctx;
+ uint8 checksum_payload[PG_CHECKSUM_MAX_LENGTH];
+ int checksum_length;
+
+ pg_checksum_init(&checksum_ctx, checksum_type);
+
+ snprintf(output_filename, MAXPGPATH, "%s/backup_label", output_directory);
+
+ if ((output_fd = open(output_filename,
+ O_WRONLY | O_CREAT | O_EXCL | PG_BINARY,
+ pg_file_create_mode)) < 0)
+ pg_fatal("could not open file \"%s\": %m", output_filename);
+
+ while (buf->cursor < buf->len)
+ {
+ char *s = &buf->data[buf->cursor];
+ int eo = get_eol_offset(buf);
+ char *e = &buf->data[eo];
+
+ if (!line_starts_with(s, e, "INCREMENTAL FROM LSN: ", NULL) &&
+ !line_starts_with(s, e, "INCREMENTAL FROM TLI: ", NULL))
+ {
+ ssize_t wb;
+
+ wb = write(output_fd, s, e - s);
+ if (wb != e - s)
+ {
+ if (wb < 0)
+ pg_fatal("could not write file \"%s\": %m", output_filename);
+ else
+ pg_fatal("could not write file \"%s\": wrote only %d of %d bytes",
+ output_filename, (int) wb, (int) (e - s));
+ }
+ if (pg_checksum_update(&checksum_ctx, (uint8 *) s, e - s) < 0)
+ pg_fatal("could not update checksum of file \"%s\"",
+ output_filename);
+ }
+
+ buf->cursor = eo;
+ }
+
+ if (close(output_fd) != 0)
+ pg_fatal("could not close \"%s\": %m", output_filename);
+
+ checksum_length = pg_checksum_final(&checksum_ctx, checksum_payload);
+
+ if (mwriter != NULL)
+ {
+ struct stat sb;
+
+ /*
+ * We could track the length ourselves, but must stat() to get the
+ * mtime.
+ */
+ if (stat(output_filename, &sb) < 0)
+ pg_fatal("could not stat file \"%s\": %m", output_filename);
+ add_file_to_manifest(mwriter, "backup_label", sb.st_size,
+ sb.st_mtime, checksum_type,
+ checksum_length, checksum_payload);
+ }
+}
+
+/*
+ * Return the offset at which the next line in the buffer starts, or there
+ * is none, the offset at which the buffer ends.
+ *
+ * The search begins at buf->cursor.
+ */
+static int
+get_eol_offset(StringInfo buf)
+{
+ int eo = buf->cursor;
+
+ while (eo < buf->len)
+ {
+ if (buf->data[eo] == '\n')
+ return eo + 1;
+ ++eo;
+ }
+
+ return eo;
+}
+
+/*
+ * Test whether the line that runs from s to e (inclusive of *s, but not
+ * inclusive of *e) starts with the match string provided, and return true
+ * or false according to whether or not this is the case.
+ *
+ * If the function returns true and if *sout != NULL, stores a pointer to the
+ * byte following the match into *sout.
+ */
+static bool
+line_starts_with(char *s, char *e, char *match, char **sout)
+{
+ while (s < e && *match != '\0' && *s == *match)
+ ++s, ++match;
+
+ if (*match == '\0' && sout != NULL)
+ *sout = s;
+
+ return (*match == '\0');
+}
+
+/*
+ * Parse an LSN starting at s and not stopping at or before e. The return value
+ * is true on success and otherwise false. On success, stores the result into
+ * *lsn and sets *c to the first character that is not part of the LSN.
+ */
+static bool
+parse_lsn(char *s, char *e, XLogRecPtr *lsn, char **c)
+{
+ char save = *e;
+ int nchars;
+ bool success;
+ unsigned hi;
+ unsigned lo;
+
+ *e = '\0';
+ success = (sscanf(s, "%X/%X%n", &hi, &lo, &nchars) == 2);
+ *e = save;
+
+ if (success)
+ {
+ *lsn = ((XLogRecPtr) hi) << 32 | (XLogRecPtr) lo;
+ *c = s + nchars;
+ }
+
+ return success;
+}
+
+/*
+ * Parse a TLI starting at s and stopping at or before e. The return value is
+ * true on success and otherwise false. On success, stores the result into
+ * *tli. If the first character that is not part of the TLI is anything other
+ * than a newline, that is deemed a failure.
+ */
+static bool
+parse_tli(char *s, char *e, TimeLineID *tli)
+{
+ char save = *e;
+ int nchars;
+ bool success;
+
+ *e = '\0';
+ success = (sscanf(s, "%u%n", tli, &nchars) == 1);
+ *e = save;
+
+ if (success && s[nchars] != '\n')
+ success = false;
+
+ return success;
+}
diff --git a/src/bin/pg_combinebackup/backup_label.h b/src/bin/pg_combinebackup/backup_label.h
new file mode 100644
index 0000000000..08d6ed67a9
--- /dev/null
+++ b/src/bin/pg_combinebackup/backup_label.h
@@ -0,0 +1,29 @@
+/*-------------------------------------------------------------------------
+ *
+ * Read and manipulate backup label files
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/bin/pg_combinebackup/backup_label.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef BACKUP_LABEL_H
+#define BACKUP_LABEL_H
+
+#include "common/checksum_helper.h"
+#include "lib/stringinfo.h"
+
+struct manifest_writer;
+
+extern void parse_backup_label(char *filename, StringInfo buf,
+ TimeLineID *start_tli,
+ XLogRecPtr *start_lsn,
+ TimeLineID *previous_tli,
+ XLogRecPtr *previous_lsn);
+extern void write_backup_label(char *output_directory, StringInfo buf,
+ pg_checksum_type checksum_type,
+ struct manifest_writer *mwriter);
+
+#endif /* BACKUP_LABEL_H */
diff --git a/src/bin/pg_combinebackup/copy_file.c b/src/bin/pg_combinebackup/copy_file.c
new file mode 100644
index 0000000000..8ba6cc09e4
--- /dev/null
+++ b/src/bin/pg_combinebackup/copy_file.c
@@ -0,0 +1,169 @@
+/*
+ * Copy entire files.
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/bin/pg_combinebackup/copy_file.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres_fe.h"
+
+#ifdef HAVE_COPYFILE_H
+#include <copyfile.h>
+#endif
+#include <fcntl.h>
+#include <sys/stat.h>
+#include <unistd.h>
+
+#include "common/file_perm.h"
+#include "common/logging.h"
+#include "copy_file.h"
+
+static void copy_file_blocks(const char *src, const char *dst,
+ pg_checksum_context *checksum_ctx);
+
+#ifdef WIN32
+static void copy_file_copyfile(const char *src, const char *dst);
+#endif
+
+/*
+ * Copy a regular file, optionally computing a checksum, and emitting
+ * appropriate debug messages. But if we're in dry-run mode, then just emit
+ * the messages and don't copy anything.
+ */
+void
+copy_file(const char *src, const char *dst,
+ pg_checksum_context *checksum_ctx, bool dry_run)
+{
+ /*
+ * In dry-run mode, we don't actually copy anything, nor do we read any
+ * data from the source file, but we do verify that we can open it.
+ */
+ if (dry_run)
+ {
+ int fd;
+
+ if ((fd = open(src, O_RDONLY | PG_BINARY)) < 0)
+ pg_fatal("could not open \"%s\": %m", src);
+ if (close(fd) < 0)
+ pg_fatal("could not close \"%s\": %m", src);
+ }
+
+ /*
+ * If we don't need to compute a checksum, then we can use any special
+ * operating system primitives that we know about to copy the file; this
+ * may be quicker than a naive block copy.
+ */
+ if (checksum_ctx->type != CHECKSUM_TYPE_NONE)
+ {
+ char *strategy_name = NULL;
+ void (*strategy_implementation) (const char *, const char *) = NULL;
+
+#ifdef WIN32
+ strategy_name = "CopyFile";
+ strategy_implementation = copy_file_copyfile;
+#endif
+
+ if (strategy_name != NULL)
+ {
+ if (dry_run)
+ pg_log_debug("would copy \"%s\" to \"%s\" using strategy %s",
+ src, dst, strategy_name);
+ else
+ {
+ pg_log_debug("copying \"%s\" to \"%s\" using strategy %s",
+ src, dst, strategy_name);
+ (*strategy_implementation) (src, dst);
+ }
+ return;
+ }
+ }
+
+ /*
+ * Fall back to the simple approach of reading and writing all the blocks,
+ * feeding them into the checksum context as we go.
+ */
+ if (dry_run)
+ {
+ if (checksum_ctx->type == CHECKSUM_TYPE_NONE)
+ pg_log_debug("would copy \"%s\" to \"%s\"",
+ src, dst);
+ else
+ pg_log_debug("would copy \"%s\" to \"%s\" and checksum with %s",
+ src, dst, pg_checksum_type_name(checksum_ctx->type));
+ }
+ else
+ {
+ if (checksum_ctx->type == CHECKSUM_TYPE_NONE)
+ pg_log_debug("copying \"%s\" to \"%s\"",
+ src, dst);
+ else
+ pg_log_debug("copying \"%s\" to \"%s\" and checksumming with %s",
+ src, dst, pg_checksum_type_name(checksum_ctx->type));
+ copy_file_blocks(src, dst, checksum_ctx);
+ }
+}
+
+/*
+ * Copy a file block by block, and optionally compute a checksum as we go.
+ */
+static void
+copy_file_blocks(const char *src, const char *dst,
+ pg_checksum_context *checksum_ctx)
+{
+ int src_fd;
+ int dest_fd;
+ uint8 *buffer;
+ const int buffer_size = 50 * BLCKSZ;
+ ssize_t rb;
+ unsigned offset = 0;
+
+ if ((src_fd = open(src, O_RDONLY | PG_BINARY, 0)) < 0)
+ pg_fatal("could not open file \"%s\": %m", src);
+
+ if ((dest_fd = open(dst, O_WRONLY | O_CREAT | O_EXCL | PG_BINARY,
+ pg_file_create_mode)) < 0)
+ pg_fatal("could not open file \"%s\": %m", dst);
+
+ buffer = pg_malloc(buffer_size);
+
+ while ((rb = read(src_fd, buffer, buffer_size)) > 0)
+ {
+ ssize_t wb;
+
+ if ((wb = write(dest_fd, buffer, rb)) != rb)
+ {
+ if (wb < 0)
+ pg_fatal("could not write file \"%s\": %m", dst);
+ else
+ pg_fatal("could not write file \"%s\": wrote only %d of %d bytes at offset %u",
+ dst, (int) wb, (int) rb, offset);
+ }
+
+ if (pg_checksum_update(checksum_ctx, buffer, rb) < 0)
+ pg_fatal("could not update checksum of file \"%s\"", dst);
+
+ offset += rb;
+ }
+
+ if (rb < 0)
+ pg_fatal("could not read file \"%s\": %m", dst);
+
+ pg_free(buffer);
+ close(src_fd);
+ close(dest_fd);
+}
+
+#ifdef WIN32
+static void
+copy_file_copyfile(const char *src, const char *dst)
+{
+ if (CopyFile(src, dst, true) == 0)
+ {
+ _dosmaperr(GetLastError());
+ pg_fatal("could not copy \"%s\" to \"%s\": %m", src, dst);
+ }
+}
+#endif /* WIN32 */
diff --git a/src/bin/pg_combinebackup/copy_file.h b/src/bin/pg_combinebackup/copy_file.h
new file mode 100644
index 0000000000..031030bacb
--- /dev/null
+++ b/src/bin/pg_combinebackup/copy_file.h
@@ -0,0 +1,19 @@
+/*
+ * Copy entire files.
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/bin/pg_combinebackup/copy_file.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef COPY_FILE_H
+#define COPY_FILE_H
+
+#include "common/checksum_helper.h"
+
+extern void copy_file(const char *src, const char *dst,
+ pg_checksum_context *checksum_ctx, bool dry_run);
+
+#endif /* COPY_FILE_H */
diff --git a/src/bin/pg_combinebackup/load_manifest.c b/src/bin/pg_combinebackup/load_manifest.c
new file mode 100644
index 0000000000..d0b8de7912
--- /dev/null
+++ b/src/bin/pg_combinebackup/load_manifest.c
@@ -0,0 +1,245 @@
+/*-------------------------------------------------------------------------
+ *
+ * Load data from a backup manifest into memory.
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/bin/pg_combinebackup/load_manifest.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres_fe.h"
+
+#include <sys/stat.h>
+#include <unistd.h>
+
+#include "common/hashfn.h"
+#include "common/logging.h"
+#include "common/parse_manifest.h"
+#include "load_manifest.h"
+
+/*
+ * For efficiency, we'd like our hash table containing information about the
+ * manifest to start out with approximately the correct number of entries.
+ * There's no way to know the exact number of entries without reading the whole
+ * file, but we can get an estimate by dividing the file size by the estimated
+ * number of bytes per line.
+ *
+ * This could be off by about a factor of two in either direction, because the
+ * checksum algorithm has a big impact on the line lengths; e.g. a SHA512
+ * checksum is 128 hex bytes, whereas a CRC-32C value is only 8, and there
+ * might be no checksum at all.
+ */
+#define ESTIMATED_BYTES_PER_MANIFEST_LINE 100
+
+/*
+ * Define a hash table which we can use to store information about the files
+ * mentioned in the backup manifest.
+ */
+static uint32 hash_string_pointer(char *s);
+#define SH_PREFIX manifest_files
+#define SH_ELEMENT_TYPE manifest_file
+#define SH_KEY_TYPE char *
+#define SH_KEY pathname
+#define SH_HASH_KEY(tb, key) hash_string_pointer(key)
+#define SH_EQUAL(tb, a, b) (strcmp(a, b) == 0)
+#define SH_SCOPE extern
+#define SH_RAW_ALLOCATOR pg_malloc0
+#define SH_DEFINE
+#include "lib/simplehash.h"
+
+static void record_manifest_details_for_file(JsonManifestParseContext *context,
+ char *pathname, size_t size,
+ pg_checksum_type checksum_type,
+ int checksum_length,
+ uint8 *checksum_payload);
+static void record_manifest_details_for_wal_range(JsonManifestParseContext *context,
+ TimeLineID tli,
+ XLogRecPtr start_lsn,
+ XLogRecPtr end_lsn);
+static void report_manifest_error(JsonManifestParseContext *context,
+ const char *fmt,...)
+ pg_attribute_printf(2, 3) pg_attribute_noreturn();
+
+/*
+ * Load backup_manifest files from an array of backups and produces an array
+ * of manifest_data objects.
+ *
+ * NB: Since load_backup_manifest() can return NULL, the resulting array could
+ * contain NULL entries.
+ */
+manifest_data **
+load_backup_manifests(int n_backups, char **backup_directories)
+{
+ manifest_data **result;
+ int i;
+
+ result = pg_malloc(sizeof(manifest_data *) * n_backups);
+ for (i = 0; i < n_backups; ++i)
+ result[i] = load_backup_manifest(backup_directories[i]);
+
+ return result;
+}
+
+/*
+ * Parse the backup_manifest file in the named backup directory. Construct a
+ * hash table with information about all the files it mentions, and a linked
+ * list of all the WAL ranges it mentions.
+ *
+ * If the backup_manifest file simply doesn't exist, logs a warning and returns
+ * NULL. Any other error, or any error parsing the contents of the file, is
+ * fatal.
+ */
+manifest_data *
+load_backup_manifest(char *backup_directory)
+{
+ char pathname[MAXPGPATH];
+ int fd;
+ struct stat statbuf;
+ off_t estimate;
+ uint32 initial_size;
+ manifest_files_hash *ht;
+ char *buffer;
+ int rc;
+ JsonManifestParseContext context;
+ manifest_data *result;
+
+ /* Open the manifest file. */
+ snprintf(pathname, MAXPGPATH, "%s/backup_manifest", backup_directory);
+ if ((fd = open(pathname, O_RDONLY | PG_BINARY, 0)) < 0)
+ {
+ if (errno == EEXIST)
+ {
+ pg_log_warning("\"%s\" does not exist", pathname);
+ return NULL;
+ }
+ pg_fatal("could not open file \"%s\": %m", pathname);
+ }
+
+ /* Figure out how big the manifest is. */
+ if (fstat(fd, &statbuf) != 0)
+ pg_fatal("could not stat file \"%s\": %m", pathname);
+
+ /* Guess how large to make the hash table based on the manifest size. */
+ estimate = statbuf.st_size / ESTIMATED_BYTES_PER_MANIFEST_LINE;
+ initial_size = Min(PG_UINT32_MAX, Max(estimate, 256));
+
+ /* Create the hash table. */
+ ht = manifest_files_create(initial_size, NULL);
+
+ /*
+ * Slurp in the whole file.
+ *
+ * This is not ideal, but there's currently no way to get pg_parse_json()
+ * to perform incremental parsing.
+ */
+ buffer = pg_malloc(statbuf.st_size);
+ rc = read(fd, buffer, statbuf.st_size);
+ if (rc != statbuf.st_size)
+ {
+ if (rc < 0)
+ pg_fatal("could not read file \"%s\": %m", pathname);
+ else
+ pg_fatal("could not read file \"%s\": read %d of %lld",
+ pathname, rc, (long long int) statbuf.st_size);
+ }
+
+ /* Close the manifest file. */
+ close(fd);
+
+ /* Parse the manifest. */
+ result = pg_malloc0(sizeof(manifest_data));
+ result->files = ht;
+ context.private_data = result;
+ context.perfile_cb = record_manifest_details_for_file;
+ context.perwalrange_cb = record_manifest_details_for_wal_range;
+ context.error_cb = report_manifest_error;
+ json_parse_manifest(&context, buffer, statbuf.st_size);
+
+ /* All done. */
+ pfree(buffer);
+ return result;
+}
+
+/*
+ * Report an error while parsing the manifest.
+ *
+ * We consider all such errors to be fatal errors. The manifest parser
+ * expects this function not to return.
+ */
+static void
+report_manifest_error(JsonManifestParseContext *context, const char *fmt,...)
+{
+ va_list ap;
+
+ va_start(ap, fmt);
+ pg_log_generic_v(PG_LOG_ERROR, PG_LOG_PRIMARY, gettext(fmt), ap);
+ va_end(ap);
+
+ exit(1);
+}
+
+/*
+ * Record details extracted from the backup manifest for one file.
+ */
+static void
+record_manifest_details_for_file(JsonManifestParseContext *context,
+ char *pathname, size_t size,
+ pg_checksum_type checksum_type,
+ int checksum_length, uint8 *checksum_payload)
+{
+ manifest_data *manifest = context->private_data;
+ manifest_file *m;
+ bool found;
+
+ /* Make a new entry in the hash table for this file. */
+ m = manifest_files_insert(manifest->files, pathname, &found);
+ if (found)
+ pg_fatal("duplicate path name in backup manifest: \"%s\"", pathname);
+
+ /* Initialize the entry. */
+ m->size = size;
+ m->checksum_type = checksum_type;
+ m->checksum_length = checksum_length;
+ m->checksum_payload = checksum_payload;
+}
+
+/*
+ * Record details extracted from the backup manifest for one WAL range.
+ */
+static void
+record_manifest_details_for_wal_range(JsonManifestParseContext *context,
+ TimeLineID tli,
+ XLogRecPtr start_lsn, XLogRecPtr end_lsn)
+{
+ manifest_data *manifest = context->private_data;
+ manifest_wal_range *range;
+
+ /* Allocate and initialize a struct describing this WAL range. */
+ range = palloc(sizeof(manifest_wal_range));
+ range->tli = tli;
+ range->start_lsn = start_lsn;
+ range->end_lsn = end_lsn;
+ range->prev = manifest->last_wal_range;
+ range->next = NULL;
+
+ /* Add it to the end of the list. */
+ if (manifest->first_wal_range == NULL)
+ manifest->first_wal_range = range;
+ else
+ manifest->last_wal_range->next = range;
+ manifest->last_wal_range = range;
+}
+
+/*
+ * Helper function for manifest_files hash table.
+ */
+static uint32
+hash_string_pointer(char *s)
+{
+ unsigned char *ss = (unsigned char *) s;
+
+ return hash_bytes(ss, strlen(s));
+}
diff --git a/src/bin/pg_combinebackup/load_manifest.h b/src/bin/pg_combinebackup/load_manifest.h
new file mode 100644
index 0000000000..2bfeeff156
--- /dev/null
+++ b/src/bin/pg_combinebackup/load_manifest.h
@@ -0,0 +1,67 @@
+/*-------------------------------------------------------------------------
+ *
+ * Load data from a backup manifest into memory.
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/bin/pg_combinebackup/load_manifest.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef LOAD_MANIFEST_H
+#define LOAD_MANIFEST_H
+
+#include "access/xlogdefs.h"
+#include "common/checksum_helper.h"
+
+/*
+ * Each file described by the manifest file is parsed to produce an object
+ * like this.
+ */
+typedef struct manifest_file
+{
+ uint32 status; /* hash status */
+ char *pathname;
+ size_t size;
+ pg_checksum_type checksum_type;
+ int checksum_length;
+ uint8 *checksum_payload;
+} manifest_file;
+
+#define SH_PREFIX manifest_files
+#define SH_ELEMENT_TYPE manifest_file
+#define SH_KEY_TYPE char *
+#define SH_SCOPE extern
+#define SH_RAW_ALLOCATOR pg_malloc0
+#define SH_DECLARE
+#include "lib/simplehash.h"
+
+/*
+ * Each WAL range described by the manifest file is parsed to produce an
+ * object like this.
+ */
+typedef struct manifest_wal_range
+{
+ TimeLineID tli;
+ XLogRecPtr start_lsn;
+ XLogRecPtr end_lsn;
+ struct manifest_wal_range *next;
+ struct manifest_wal_range *prev;
+} manifest_wal_range;
+
+/*
+ * All the data parsed from a backup_manifest file.
+ */
+typedef struct manifest_data
+{
+ manifest_files_hash *files;
+ manifest_wal_range *first_wal_range;
+ manifest_wal_range *last_wal_range;
+} manifest_data;
+
+extern manifest_data *load_backup_manifest(char *backup_directory);
+extern manifest_data **load_backup_manifests(int n_backups,
+ char **backup_directories);
+
+#endif /* LOAD_MANIFEST_H */
diff --git a/src/bin/pg_combinebackup/meson.build b/src/bin/pg_combinebackup/meson.build
new file mode 100644
index 0000000000..a6036dea74
--- /dev/null
+++ b/src/bin/pg_combinebackup/meson.build
@@ -0,0 +1,35 @@
+# Copyright (c) 2022-2023, PostgreSQL Global Development Group
+
+pg_combinebackup_sources = files(
+ 'pg_combinebackup.c',
+ 'backup_label.c',
+ 'copy_file.c',
+ 'load_manifest.c',
+ 'reconstruct.c',
+ 'write_manifest.c',
+)
+
+if host_system == 'windows'
+ pg_combinebackup_sources += rc_bin_gen.process(win32ver_rc, extra_args: [
+ '--NAME', 'pg_combinebackup',
+ '--FILEDESC', 'pg_combinebackup - combine incremental backups',])
+endif
+
+pg_combinebackup = executable('pg_combinebackup',
+ pg_combinebackup_sources,
+ dependencies: [frontend_code],
+ kwargs: default_bin_args,
+)
+bin_targets += pg_combinebackup
+
+tests += {
+ 'name': 'pg_combinebackup',
+ 'sd': meson.current_source_dir(),
+ 'bd': meson.current_build_dir(),
+ 'tap': {
+ 'tests': [
+ 't/001_basic.pl',
+ 't/002_compare_backups.pl',
+ ],
+ }
+}
diff --git a/src/bin/pg_combinebackup/pg_combinebackup.c b/src/bin/pg_combinebackup/pg_combinebackup.c
new file mode 100644
index 0000000000..32d2846433
--- /dev/null
+++ b/src/bin/pg_combinebackup/pg_combinebackup.c
@@ -0,0 +1,1270 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_combinebackup.c
+ * Combine incremental backups with prior backups.
+ *
+ * Copyright (c) 2017-2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/bin/pg_combinebackup/pg_combinebackup.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres_fe.h"
+
+#include <dirent.h>
+#include <fcntl.h>
+#include <limits.h>
+
+#include "access/xlogdefs.h"
+#include "backup_label.h"
+#include "common/blkreftable.h"
+#include "common/checksum_helper.h"
+#include "common/controldata_utils.h"
+#include "common/file_perm.h"
+#include "common/file_utils.h"
+#include "common/logging.h"
+#include "copy_file.h"
+#include "fe_utils/option_utils.h"
+#include "lib/stringinfo.h"
+#include "load_manifest.h"
+#include "getopt_long.h"
+#include "reconstruct.h"
+#include "write_manifest.h"
+
+/* Incremental file naming convention. */
+#define INCREMENTAL_PREFIX "INCREMENTAL."
+#define INCREMENTAL_PREFIX_LENGTH 12
+
+/*
+ * Tracking for directories that need to be removed, or have their contents
+ * removed, if the operation fails.
+ */
+typedef struct cb_cleanup_dir
+{
+ char *target_path;
+ bool rmtopdir;
+ struct cb_cleanup_dir *next;
+} cb_cleanup_dir;
+
+/*
+ * Stores a tablespace mapping provided using -T, --tablespace-mapping.
+ */
+typedef struct cb_tablespace_mapping
+{
+ char old_dir[MAXPGPATH];
+ char new_dir[MAXPGPATH];
+ struct cb_tablespace_mapping *next;
+} cb_tablespace_mapping;
+
+/*
+ * Stores data parsed from all command-line options.
+ */
+typedef struct cb_options
+{
+ bool debug;
+ char *output;
+ bool dry_run;
+ bool no_sync;
+ cb_tablespace_mapping *tsmappings;
+ pg_checksum_type manifest_checksums;
+ bool no_manifest;
+ DataDirSyncMethod sync_method;
+} cb_options;
+
+/*
+ * Data about a tablespace.
+ *
+ * Every normal tablespace needs a tablespace mapping, but in-place tablespaces
+ * don't, so the list of tablespaces can contain more entries than the list of
+ * tablespace mappings.
+ */
+typedef struct cb_tablespace
+{
+ Oid oid;
+ bool in_place;
+ char old_dir[MAXPGPATH];
+ char new_dir[MAXPGPATH];
+ struct cb_tablespace *next;
+} cb_tablespace;
+
+/* Directories to be removed if we exit uncleanly. */
+cb_cleanup_dir *cleanup_dir_list = NULL;
+
+static void add_tablespace_mapping(cb_options *opt, char *arg);
+static StringInfo check_backup_label_files(int n_backups, char **backup_dirs);
+static void check_control_files(int n_backups, char **backup_dirs);
+static void check_input_dir_permissions(char *dir);
+static void cleanup_directories_atexit(void);
+static void create_output_directory(char *dirname, cb_options *opt);
+static void help(const char *progname);
+static bool parse_oid(char *s, Oid *result);
+static void process_directory_recursively(Oid tsoid,
+ char *input_directory,
+ char *output_directory,
+ char *relative_path,
+ int n_prior_backups,
+ char **prior_backup_dirs,
+ manifest_data **manifests,
+ manifest_writer *mwriter,
+ cb_options *opt);
+static int read_pg_version_file(char *directory);
+static void remember_to_cleanup_directory(char *target_path, bool rmtopdir);
+static void reset_directory_cleanup_list(void);
+static cb_tablespace *scan_for_existing_tablespaces(char *pathname,
+ cb_options *opt);
+static void slurp_file(int fd, char *filename, StringInfo buf, int maxlen);
+
+/*
+ * Main program.
+ */
+int
+main(int argc, char *argv[])
+{
+ static struct option long_options[] = {
+ {"debug", no_argument, NULL, 'd'},
+ {"dry-run", no_argument, NULL, 'n'},
+ {"no-sync", no_argument, NULL, 'N'},
+ {"output", required_argument, NULL, 'o'},
+ {"tablespace-mapping", no_argument, NULL, 'T'},
+ {"manifest-checksums", required_argument, NULL, 1},
+ {"no-manifest", no_argument, NULL, 2},
+ {"sync-method", required_argument, NULL, 3},
+ {NULL, 0, NULL, 0}
+ };
+
+ const char *progname;
+ char *last_input_dir;
+ int optindex;
+ int c;
+ int n_backups;
+ int n_prior_backups;
+ int version;
+ char **prior_backup_dirs;
+ cb_options opt;
+ cb_tablespace *tablespaces;
+ cb_tablespace *ts;
+ StringInfo last_backup_label;
+ manifest_data **manifests;
+ manifest_writer *mwriter;
+
+ pg_logging_init(argv[0]);
+ progname = get_progname(argv[0]);
+ handle_help_version_opts(argc, argv, progname, help);
+
+ memset(&opt, 0, sizeof(opt));
+ opt.manifest_checksums = CHECKSUM_TYPE_CRC32C;
+ opt.sync_method = DATA_DIR_SYNC_METHOD_FSYNC;
+
+ /* process command-line options */
+ while ((c = getopt_long(argc, argv, "do:nNPT:",
+ long_options, &optindex)) != -1)
+ {
+ switch (c)
+ {
+ case 'd':
+ opt.debug = true;
+ pg_logging_increase_verbosity();
+ break;
+ case 'o':
+ opt.output = optarg;
+ break;
+ case 'n':
+ opt.dry_run = true;
+ break;
+ case 'N':
+ opt.no_sync = true;
+ break;
+ case 'T':
+ add_tablespace_mapping(&opt, optarg);
+ break;
+ case 1:
+ if (!pg_checksum_parse_type(optarg,
+ &opt.manifest_checksums))
+ pg_fatal("unrecognized checksum algorithm: \"%s\"",
+ optarg);
+ break;
+ case 2:
+ opt.no_manifest = true;
+ break;
+ case 3:
+ if (!parse_sync_method(optarg, &opt.sync_method))
+ exit(1);
+ break;
+ default:
+ /* getopt_long already emitted a complaint */
+ pg_log_error_hint("Try \"%s --help\" for more information.", progname);
+ exit(1);
+ }
+ }
+
+ if (optind >= argc)
+ {
+ pg_log_error("%s: no input directories specified", progname);
+ pg_log_error_hint("Try \"%s --help\" for more information.", progname);
+ exit(1);
+ }
+
+ if (opt.output == NULL)
+ pg_fatal("no output directory specified");
+
+ /* If no manifest is needed, no checksums are needed, either. */
+ if (opt.no_manifest)
+ opt.manifest_checksums = CHECKSUM_TYPE_NONE;
+
+ /* Read the server version from the final backup. */
+ version = read_pg_version_file(argv[argc - 1]);
+
+ /* Sanity-check control files. */
+ n_backups = argc - optind;
+ check_control_files(n_backups, argv + optind);
+
+ /* Sanity-check backup_label files, and get the contents of the last one. */
+ last_backup_label = check_backup_label_files(n_backups, argv + optind);
+
+ /* Load backup manifests. */
+ manifests = load_backup_manifests(n_backups, argv + optind);
+
+ /* Figure out which tablespaces are going to be included in the output. */
+ last_input_dir = argv[argc - 1];
+ check_input_dir_permissions(last_input_dir);
+ tablespaces = scan_for_existing_tablespaces(last_input_dir, &opt);
+
+ /*
+ * Create output directories.
+ *
+ * We create one output directory for the main data directory plus one for
+ * each non-in-place tablespace. create_output_directory() will arrange
+ * for those directories to be cleaned up on failure. In-place tablespaces
+ * aren't handled at this stage because they're located beneath the main
+ * output directory, and thus the cleanup of that directory will get rid
+ * of them. Plus, the pg_tblspc directory that needs to contain them
+ * doesn't exist yet.
+ */
+ atexit(cleanup_directories_atexit);
+ create_output_directory(opt.output, &opt);
+ for (ts = tablespaces; ts != NULL; ts = ts->next)
+ if (!ts->in_place)
+ create_output_directory(ts->new_dir, &opt);
+
+ /* If we need to write a backup_manifest, prepare to do so. */
+ if (!opt.dry_run && !opt.no_manifest)
+ mwriter = create_manifest_writer(opt.output);
+ else
+ mwriter = NULL;
+
+ /* Write backup label into output directory. */
+ if (opt.dry_run)
+ pg_log_debug("would generate \"%s/backup_label\"", opt.output);
+ else
+ {
+ pg_log_debug("generating \"%s/backup_label\"", opt.output);
+ last_backup_label->cursor = 0;
+ write_backup_label(opt.output, last_backup_label,
+ opt.manifest_checksums, mwriter);
+ }
+
+ /*
+ * We'll need the pathnames to the prior backups. By "prior" we mean all
+ * but the last one listed on the command line.
+ */
+ n_prior_backups = argc - optind - 1;
+ prior_backup_dirs = argv + optind;
+
+ /* Process everything that's not part of a user-defined tablespace. */
+ pg_log_debug("processing backup directory \"%s\"", last_input_dir);
+ process_directory_recursively(InvalidOid, last_input_dir, opt.output,
+ NULL, n_prior_backups, prior_backup_dirs,
+ manifests, mwriter, &opt);
+
+ /* Process user-defined tablespaces. */
+ for (ts = tablespaces; ts != NULL; ts = ts->next)
+ {
+ pg_log_debug("processing tablespace directory \"%s\"", ts->old_dir);
+
+ /*
+ * If it's a normal tablespace, we need to set up a symbolic link from
+ * pg_tblspc/${OID} to the target directory; if it's an in-place
+ * tablespace, we need to create a directory at pg_tblspc/${OID}.
+ */
+ if (!ts->in_place)
+ {
+ char linkpath[MAXPGPATH];
+
+ snprintf(linkpath, MAXPGPATH, "%s/pg_tblspc/%u", opt.output,
+ ts->oid);
+
+ if (opt.dry_run)
+ pg_log_debug("would create symbolic link from \"%s\" to \"%s\"",
+ linkpath, ts->new_dir);
+ else
+ {
+ pg_log_debug("creating symbolic link from \"%s\" to \"%s\"",
+ linkpath, ts->new_dir);
+ if (symlink(ts->new_dir, linkpath) != 0)
+ pg_fatal("could not create symbolic link from \"%s\" to \"%s\": %m",
+ linkpath, ts->new_dir);
+ }
+ }
+ else
+ {
+ if (opt.dry_run)
+ pg_log_debug("would create directory \"%s\"", ts->new_dir);
+ else
+ {
+ pg_log_debug("creating directory \"%s\"", ts->new_dir);
+ if (pg_mkdir_p(ts->new_dir, pg_dir_create_mode) == -1)
+ pg_fatal("could not create directory \"%s\": %m",
+ ts->new_dir);
+ }
+ }
+
+ /* OK, now handle the directory contents. */
+ process_directory_recursively(ts->oid, ts->old_dir, ts->new_dir,
+ NULL, n_prior_backups, prior_backup_dirs,
+ manifests, mwriter, &opt);
+ }
+
+ /* Finalize the backup_manifest, if we're generating one. */
+ if (mwriter != NULL)
+ finalize_manifest(mwriter,
+ manifests[n_prior_backups]->first_wal_range);
+
+ /* fsync that output directory unless we've been told not to do so */
+ if (!opt.no_sync)
+ {
+ if (opt.dry_run)
+ pg_log_debug("would recursively fsync \"%s\"", opt.output);
+ else
+ {
+ pg_log_debug("recursively fsyncing \"%s\"", opt.output);
+ sync_pgdata(opt.output, version * 10000, opt.sync_method);
+ }
+ }
+
+ /* It's a success, so don't remove the output directories. */
+ reset_directory_cleanup_list();
+ exit(0);
+}
+
+/*
+ * Process the option argument for the -T, --tablespace-mapping switch.
+ */
+static void
+add_tablespace_mapping(cb_options *opt, char *arg)
+{
+ cb_tablespace_mapping *tsmap = pg_malloc0(sizeof(cb_tablespace_mapping));
+ char *dst;
+ char *dst_ptr;
+ char *arg_ptr;
+
+ /*
+ * Basically, we just want to copy everything before the equals sign to
+ * tsmap->old_dir and everything afterwards to tsmap->new_dir, but if
+ * there's more or less than one equals sign, that's an error, and if
+ * there's an equals sign preceded by a backslash, don't treat it as a
+ * field separator but instead copy a literal equals sign.
+ */
+ dst_ptr = dst = tsmap->old_dir;
+ for (arg_ptr = arg; *arg_ptr != '\0'; arg_ptr++)
+ {
+ if (dst_ptr - dst >= MAXPGPATH)
+ pg_fatal("directory name too long");
+
+ if (*arg_ptr == '\\' && *(arg_ptr + 1) == '=')
+ ; /* skip backslash escaping = */
+ else if (*arg_ptr == '=' && (arg_ptr == arg || *(arg_ptr - 1) != '\\'))
+ {
+ if (tsmap->new_dir[0] != '\0')
+ pg_fatal("multiple \"=\" signs in tablespace mapping");
+ else
+ dst = dst_ptr = tsmap->new_dir;
+ }
+ else
+ *dst_ptr++ = *arg_ptr;
+ }
+ if (!tsmap->old_dir[0] || !tsmap->new_dir[0])
+ pg_fatal("invalid tablespace mapping format \"%s\", must be \"OLDDIR=NEWDIR\"", arg);
+
+ /*
+ * All tablespaces are created with absolute directories, so specifying a
+ * non-absolute path here would never match, possibly confusing users.
+ *
+ * In contrast to pg_basebackup, both the old and new directories are on
+ * the local machine, so the local machine's definition of an absolute
+ * path is the only relevant one.
+ */
+ if (!is_absolute_path(tsmap->old_dir))
+ pg_fatal("old directory is not an absolute path in tablespace mapping: %s",
+ tsmap->old_dir);
+
+ if (!is_absolute_path(tsmap->new_dir))
+ pg_fatal("old directory is not an absolute path in tablespace mapping: %s",
+ tsmap->new_dir);
+
+ /* Canonicalize paths to avoid spurious failures when comparing. */
+ canonicalize_path(tsmap->old_dir);
+ canonicalize_path(tsmap->new_dir);
+
+ /* Add it to the list. */
+ tsmap->next = opt->tsmappings;
+ opt->tsmappings = tsmap;
+}
+
+/*
+ * Check that the backup_label files form a coherent backup chain, and return
+ * the contents of the backup_label file from the latest backup.
+ */
+static StringInfo
+check_backup_label_files(int n_backups, char **backup_dirs)
+{
+ StringInfo buf = makeStringInfo();
+ StringInfo lastbuf = buf;
+ int i;
+ TimeLineID check_tli = 0;
+ XLogRecPtr check_lsn = InvalidXLogRecPtr;
+
+ /* Try to read each backup_label file in turn, last to first. */
+ for (i = n_backups - 1; i >= 0; --i)
+ {
+ char pathbuf[MAXPGPATH];
+ int fd;
+ TimeLineID start_tli;
+ TimeLineID previous_tli;
+ XLogRecPtr start_lsn;
+ XLogRecPtr previous_lsn;
+
+ /* Open the backup_label file. */
+ snprintf(pathbuf, MAXPGPATH, "%s/backup_label", backup_dirs[i]);
+ pg_log_debug("reading \"%s\"", pathbuf);
+ if ((fd = open(pathbuf, O_RDONLY, 0)) < 0)
+ pg_fatal("could not open file \"%s\": %m", pathbuf);
+
+ /*
+ * Slurp the whole file into memory.
+ *
+ * The exact size limit that we impose here doesn't really matter --
+ * most of what's supposed to be in the file is fixed size and quite
+ * short. However, the length of the backup_label is limited (at least
+ * by some parts of the code) to MAXGPATH, so include that value in
+ * the maximum length that we tolerate.
+ */
+ slurp_file(fd, pathbuf, buf, 10000 + MAXPGPATH);
+
+ /* Close the file. */
+ if (close(fd) != 0)
+ pg_fatal("could not close \"%s\": %m", pathbuf);
+
+ /* Parse the file contents. */
+ parse_backup_label(pathbuf, buf, &start_tli, &start_lsn,
+ &previous_tli, &previous_lsn);
+
+ /*
+ * Sanity checks.
+ *
+ * XXX. It's actually not required that start_lsn == check_lsn. It
+ * would be OK if start_lsn > check_lsn provided that start_lsn is
+ * less than or equal to the relevant switchpoint. But at the moment
+ * we don't have that information.
+ */
+ if (i > 0 && previous_tli == 0)
+ pg_fatal("backup at \"%s\" is a full backup, but only the first backup should be a full backup",
+ backup_dirs[i]);
+ if (i == 0 && previous_tli != 0)
+ pg_fatal("backup at \"%s\" is an incremental backup, but the first backup should be a full backup",
+ backup_dirs[i]);
+ if (i < n_backups - 1 && start_tli != check_tli)
+ pg_fatal("backup at \"%s\" starts on timeline %u, but expected %u",
+ backup_dirs[i], start_tli, check_tli);
+ if (i < n_backups - 1 && start_lsn != check_lsn)
+ pg_fatal("backup at \"%s\" starts at LSN %X/%X, but expected %X/%X",
+ backup_dirs[i],
+ LSN_FORMAT_ARGS(start_lsn),
+ LSN_FORMAT_ARGS(check_lsn));
+ check_tli = previous_tli;
+ check_lsn = previous_lsn;
+
+ /*
+ * The last backup label in the chain needs to be saved for later use,
+ * while the others are only needed within this loop.
+ */
+ if (lastbuf == buf)
+ buf = makeStringInfo();
+ else
+ resetStringInfo(buf);
+ }
+
+ /* Free memory that we don't need any more. */
+ if (lastbuf != buf)
+ {
+ pfree(buf->data);
+ pfree(buf);
+ }
+
+ /*
+ * Return the data from the first backup_info that we read (which is the
+ * backup_label from the last directory specified on the command line).
+ */
+ return lastbuf;
+}
+
+/*
+ * Sanity check control files.
+ */
+static void
+check_control_files(int n_backups, char **backup_dirs)
+{
+ int i;
+ uint64 system_identifier;
+
+ /* Try to read each control file in turn, last to first. */
+ for (i = n_backups - 1; i >= 0; --i)
+ {
+ ControlFileData *control_file;
+ bool crc_ok;
+
+ pg_log_debug("reading \"%s/global/pg_control\"", backup_dirs[i]);
+ control_file = get_controlfile(backup_dirs[i], &crc_ok);
+
+ /* Control file contents not meaningful if CRC is bad. */
+ if (!crc_ok)
+ pg_fatal("%s/global/pg_control: crc is incorrect", backup_dirs[i]);
+
+ /* Can't interpret control file if not current version. */
+ if (control_file->pg_control_version != PG_CONTROL_VERSION)
+ pg_fatal("%s/global/pg_control: unexpected control file version",
+ backup_dirs[i]);
+
+ /* System identifiers should all match. */
+ if (i == n_backups - 1)
+ system_identifier = control_file->system_identifier;
+ else if (system_identifier != control_file->system_identifier)
+ pg_fatal("%s/global/pg_control: expected system identifier %llu, but found %llu",
+ backup_dirs[i], (unsigned long long) system_identifier,
+ (unsigned long long) control_file->system_identifier);
+
+ /* Release memory. */
+ pfree(control_file);
+ }
+
+ /*
+ * If debug output is enabled, make a note of the system identifier that
+ * we found in all of the relevant control files.
+ */
+ pg_log_debug("system identifier is %llu",
+ (unsigned long long) system_identifier);
+}
+
+/*
+ * Set default permissions for new files and directories based on the
+ * permissions of the given directory. The intent here is that the output
+ * directory should use the same permissions scheme as the final input
+ * directory.
+ */
+static void
+check_input_dir_permissions(char *dir)
+{
+ struct stat st;
+
+ if (stat(dir, &st) != 0)
+ pg_fatal("could not stat \"%s\": %m", dir);
+
+ SetDataDirectoryCreatePerm(st.st_mode);
+}
+
+/*
+ * Clean up output directories before exiting.
+ */
+static void
+cleanup_directories_atexit(void)
+{
+ while (cleanup_dir_list != NULL)
+ {
+ cb_cleanup_dir *dir = cleanup_dir_list;
+
+ if (dir->rmtopdir)
+ {
+ pg_log_info("removing output directory \"%s\"", dir->target_path);
+ if (!rmtree(dir->target_path, dir->rmtopdir))
+ pg_log_error("failed to remove output directory");
+ }
+ else
+ {
+ pg_log_info("removing contents of output directory \"%s\"",
+ dir->target_path);
+ if (!rmtree(dir->target_path, dir->rmtopdir))
+ pg_log_error("failed to remove contents of output directory");
+ }
+
+ cleanup_dir_list = cleanup_dir_list->next;
+ pfree(dir);
+ }
+}
+
+/*
+ * Create the named output directory, unless it already exists or we're in
+ * dry-run mode. If it already exists but is not empty, that's a fatal error.
+ *
+ * Adds the created directory to the list of directories to be cleaned up
+ * at process exit.
+ */
+static void
+create_output_directory(char *dirname, cb_options *opt)
+{
+ switch (pg_check_dir(dirname))
+ {
+ case 0:
+ if (opt->dry_run)
+ {
+ pg_log_debug("would create directory \"%s\"", dirname);
+ return;
+ }
+ pg_log_debug("creating directory \"%s\"", dirname);
+ if (pg_mkdir_p(dirname, pg_dir_create_mode) == -1)
+ pg_fatal("could not create directory \"%s\": %m", dirname);
+ remember_to_cleanup_directory(dirname, true);
+ break;
+
+ case 1:
+ pg_log_debug("using existing directory \"%s\"", dirname);
+ remember_to_cleanup_directory(dirname, false);
+ break;
+
+ case 2:
+ case 3:
+ case 4:
+ pg_fatal("directory \"%s\" exists but is not empty", dirname);
+
+ case -1:
+ pg_fatal("could not access directory \"%s\": %m", dirname);
+ }
+}
+
+/*
+ * help
+ *
+ * Prints help page for the program
+ *
+ * progname: the name of the executed program, such as "pg_combinebackup"
+ */
+static void
+help(const char *progname)
+{
+ printf(_("%s combines incremental backups.\n\n"), progname);
+ printf(_("Usage:\n"));
+ printf(_(" %s [OPTION]... DIRECTORY...\n"), progname);
+ printf(_("\nOptions:\n"));
+ printf(_(" -d, --debug generate lots of debugging output\n"));
+ printf(_(" -n, --dry-run don't actually do anything\n"));
+ printf(_(" -N, --no-sync do not wait for changes to be written safely to disk\n"));
+ printf(_(" -o, --output output directory\n"));
+ printf(_(" -T, --tablespace-mapping=OLDDIR=NEWDIR\n"));
+ printf(_(" relocate tablespace in OLDDIR to NEWDIR\n"));
+ printf(_(" --manifest-checksums=SHA{224,256,384,512}|CRC32C|NONE\n"
+ " use algorithm for manifest checksums\n"));
+ printf(_(" --no-manifest suppress generation of backup manifest\n"));
+ printf(_(" --sync-method=METHOD set method for syncing files to disk\n"));
+ printf(_(" -?, --help show this help, then exit\n"));
+
+ printf(_("\nReport bugs to <%s>.\n"), PACKAGE_BUGREPORT);
+ printf(_("%s home page: <%s>\n"), PACKAGE_NAME, PACKAGE_URL);
+}
+
+/*
+ * Try to parse a string as a non-zero OID without leading zeroes.
+ *
+ * If it works, return true and set *result to the answer, else return false.
+ */
+static bool
+parse_oid(char *s, Oid *result)
+{
+ Oid oid;
+ char *ep;
+
+ errno = 0;
+ oid = strtoul(s, &ep, 10);
+ if (errno != 0 || *ep != '\0' || oid < 1 || oid > PG_UINT32_MAX)
+ return false;
+
+ *result = oid;
+ return true;
+}
+
+/*
+ * Copy files from the input directory to the output directory, reconstructing
+ * full files from incremental files as required.
+ *
+ * If processing is a user-defined tablespace, the tsoid should be the OID
+ * of that tablespace and input_directory and output_directory should be the
+ * toplevel input and output directories for that tablespace. Otherwise,
+ * tsoid should be InvalidOid and input_directory and output_directory should
+ * be the main input and output directories.
+ *
+ * relative_path is the path beneath the given input and output directories
+ * that we are currently processing. If NULL, it indicates that we're
+ * processing the input and output directories themselves.
+ *
+ * n_prior_backups is the number of prior backups that we have available.
+ * This doesn't count the very last backup, which is referenced by
+ * output_directory, just the older ones. prior_backup_dirs is an array of
+ * the locations of those previous backups.
+ */
+static void
+process_directory_recursively(Oid tsoid,
+ char *input_directory,
+ char *output_directory,
+ char *relative_path,
+ int n_prior_backups,
+ char **prior_backup_dirs,
+ manifest_data **manifests,
+ manifest_writer *mwriter,
+ cb_options *opt)
+{
+ char ifulldir[MAXPGPATH];
+ char ofulldir[MAXPGPATH];
+ char manifest_prefix[MAXPGPATH];
+ DIR *dir;
+ struct dirent *de;
+ bool is_pg_tblspc;
+ bool is_pg_wal;
+ manifest_data *latest_manifest = manifests[n_prior_backups];
+ pg_checksum_type checksum_type;
+
+ StaticAssertStmt(strlen(INCREMENTAL_PREFIX) == INCREMENTAL_PREFIX_LENGTH,
+ "INCREMENTAL_PREFIX_LENGTH is incorrect");
+
+ /*
+ * pg_tblspc and pg_wal are special cases, so detect those here.
+ *
+ * pg_tblspc is only special at the top level, but subdirectories of
+ * pg_wal are just as special as the top level directory.
+ *
+ * Since incremental backup does not exist in pre-v10 versions, we don't
+ * have to worry about the old pg_xlog naming.
+ */
+ is_pg_tblspc = !OidIsValid(tsoid) && relative_path != NULL &&
+ strcmp(relative_path, "pg_tblspc") == 0;
+ is_pg_wal = !OidIsValid(tsoid) && relative_path != NULL &&
+ (strcmp(relative_path, "pg_wal") == 0 ||
+ strncmp(relative_path, "pg_wal/", 7) == 0);
+
+ /*
+ * If we're under pg_wal, then we don't need checksums, because these
+ * files aren't included in the backup manifest. Otherwise use whatever
+ * type of checksum is configured.
+ */
+ if (!is_pg_wal)
+ checksum_type = opt->manifest_checksums;
+ else
+ checksum_type = CHECKSUM_TYPE_NONE;
+
+ /*
+ * Append the relative path to the input and output directories, and
+ * figure out the appropriate prefix to add to files in this directory
+ * when looking them up in a backup manifest.
+ */
+ if (relative_path == NULL)
+ {
+ strncpy(ifulldir, input_directory, MAXPGPATH);
+ strncpy(ofulldir, output_directory, MAXPGPATH);
+ if (OidIsValid(tsoid))
+ snprintf(manifest_prefix, MAXPGPATH, "pg_tblspc/%u/", tsoid);
+ else
+ manifest_prefix[0] = '\0';
+ }
+ else
+ {
+ snprintf(ifulldir, MAXPGPATH, "%s/%s", input_directory,
+ relative_path);
+ snprintf(ofulldir, MAXPGPATH, "%s/%s", output_directory,
+ relative_path);
+ if (OidIsValid(tsoid))
+ snprintf(manifest_prefix, MAXPGPATH, "pg_tblspc/%u/%s/",
+ tsoid, relative_path);
+ else
+ snprintf(manifest_prefix, MAXPGPATH, "%s/", relative_path);
+ }
+
+ /*
+ * Toplevel output directories have already been created by the time this
+ * function is called, but any subdirectories are our responsibility.
+ */
+ if (relative_path != NULL)
+ {
+ if (opt->dry_run)
+ pg_log_debug("would create directory \"%s\"", ofulldir);
+ else
+ {
+ pg_log_debug("creating directory \"%s\"", ofulldir);
+ if (mkdir(ofulldir, pg_dir_create_mode) == -1)
+ pg_fatal("could not create directory \"%s\": %m", ofulldir);
+ }
+ }
+
+ /* It's time to scan the directory. */
+ if ((dir = opendir(ifulldir)) == NULL)
+ pg_fatal("could not open directory \"%s\": %m", ifulldir);
+ while (errno = 0, (de = readdir(dir)) != NULL)
+ {
+ PGFileType type;
+ char ifullpath[MAXPGPATH];
+ char ofullpath[MAXPGPATH];
+ char manifest_path[MAXPGPATH];
+ Oid oid = InvalidOid;
+ int checksum_length = 0;
+ uint8 *checksum_payload = NULL;
+ pg_checksum_context checksum_ctx;
+
+ /* Ignore "." and ".." entries. */
+ if (strcmp(de->d_name, ".") == 0 ||
+ strcmp(de->d_name, "..") == 0)
+ continue;
+
+ /* Construct input path. */
+ snprintf(ifullpath, MAXPGPATH, "%s/%s", ifulldir, de->d_name);
+
+ /* Figure out what kind of directory entry this is. */
+ type = get_dirent_type(ifullpath, de, false, PG_LOG_ERROR);
+ if (type == PGFILETYPE_ERROR)
+ exit(1);
+
+ /*
+ * If we're processing pg_tblspc, then check whether the filename
+ * looks like it could be a tablespace OID. If so, and if the
+ * directory entry is a symbolic link or a directory, skip it.
+ *
+ * Our goal here is to ignore anything that would have been considered
+ * by scan_for_existing_tablespaces to be a tablespace.
+ */
+ if (is_pg_tblspc && parse_oid(de->d_name, &oid) &&
+ (type == PGFILETYPE_LNK || type == PGFILETYPE_DIR))
+ continue;
+
+ /* If it's a directory, recurse. */
+ if (type == PGFILETYPE_DIR)
+ {
+ char new_relative_path[MAXPGPATH];
+
+ /* Append new pathname component to relative path. */
+ if (relative_path == NULL)
+ strncpy(new_relative_path, de->d_name, MAXPGPATH);
+ else
+ snprintf(new_relative_path, MAXPGPATH, "%s/%s", relative_path,
+ de->d_name);
+
+ /* And recurse. */
+ process_directory_recursively(tsoid,
+ input_directory, output_directory,
+ new_relative_path,
+ n_prior_backups, prior_backup_dirs,
+ manifests, mwriter, opt);
+ continue;
+ }
+
+ /* Skip anything that's not a regular file. */
+ if (type != PGFILETYPE_REG)
+ {
+ if (type == PGFILETYPE_LNK)
+ pg_log_warning("skipping symbolic link \"%s\"", ifullpath);
+ else
+ pg_log_warning("skipping special file \"%s\"", ifullpath);
+ continue;
+ }
+
+ /*
+ * Skip the backup_label and backup_manifest files; they require
+ * special handling and are handled elsewhere.
+ */
+ if (relative_path == NULL &&
+ (strcmp(de->d_name, "backup_label") == 0 ||
+ strcmp(de->d_name, "backup_manifest") == 0))
+ continue;
+
+ /*
+ * If it's an incremental file, hand it off to the reconstruction
+ * code, which will figure out what to do.
+ */
+ if (strncmp(de->d_name, INCREMENTAL_PREFIX,
+ INCREMENTAL_PREFIX_LENGTH) == 0)
+ {
+ /* Output path should not include "INCREMENTAL." prefix. */
+ snprintf(ofullpath, MAXPGPATH, "%s/%s", ofulldir,
+ de->d_name + INCREMENTAL_PREFIX_LENGTH);
+
+
+ /* Manifest path likewise omits incremental prefix. */
+ snprintf(manifest_path, MAXPGPATH, "%s%s", manifest_prefix,
+ de->d_name + INCREMENTAL_PREFIX_LENGTH);
+
+ /* Reconstruction logic will do the rest. */
+ reconstruct_from_incremental_file(ifullpath, ofullpath,
+ relative_path,
+ de->d_name + INCREMENTAL_PREFIX_LENGTH,
+ n_prior_backups,
+ prior_backup_dirs,
+ manifests,
+ manifest_path,
+ checksum_type,
+ &checksum_length,
+ &checksum_payload,
+ opt->dry_run);
+ }
+ else
+ {
+ /* Construct the path that the backup_manifest will use. */
+ snprintf(manifest_path, MAXPGPATH, "%s%s", manifest_prefix,
+ de->d_name);
+
+ /*
+ * It's not an incremental file, so we need to copy the entire
+ * file to the output directory.
+ *
+ * If a checksum of the required type already exists in the
+ * backup_manifest for the final input directory, we can save some
+ * work by reusing that checksum instead of computing a new one.
+ */
+ if (checksum_type != CHECKSUM_TYPE_NONE &&
+ latest_manifest != NULL)
+ {
+ manifest_file *mfile;
+
+ mfile = manifest_files_lookup(latest_manifest->files,
+ manifest_path);
+ if (mfile == NULL)
+ {
+ /*
+ * The directory is out of sync with the backup_manifest,
+ * so emit a warning.
+ */
+ pg_log_warning("\"%s/backup_manifest\" contains no entry for \"%s\"",
+ input_directory, manifest_path);
+ }
+ else if (mfile->checksum_type == checksum_type)
+ {
+ checksum_length = mfile->checksum_length;
+ checksum_payload = mfile->checksum_payload;
+ }
+ }
+
+ /*
+ * If we're reusing a checksum, then we don't need copy_file() to
+ * compute one for us, but otherwise, it needs to compute whatever
+ * type of checksum we need.
+ */
+ if (checksum_length != 0)
+ pg_checksum_init(&checksum_ctx, CHECKSUM_TYPE_NONE);
+ else
+ pg_checksum_init(&checksum_ctx, checksum_type);
+
+ /* Actually copy the file. */
+ snprintf(ofullpath, MAXPGPATH, "%s/%s", ofulldir, de->d_name);
+ copy_file(ifullpath, ofullpath, &checksum_ctx, opt->dry_run);
+
+ /*
+ * If copy_file() performed a checksum calculation for us, then
+ * save the results (except in dry-run mode, when there's no
+ * point).
+ */
+ if (checksum_ctx.type != CHECKSUM_TYPE_NONE && !opt->dry_run)
+ {
+ checksum_payload = pg_malloc(PG_CHECKSUM_MAX_LENGTH);
+ checksum_length = pg_checksum_final(&checksum_ctx,
+ checksum_payload);
+ }
+ }
+
+ /* Generate manifest entry, if needed. */
+ if (mwriter != NULL)
+ {
+ struct stat sb;
+
+ /*
+ * In order to generate a manifest entry, we need the file size
+ * and mtime. We have no way to know the correct mtime except to
+ * stat() the file, so just do that and get the size as well.
+ *
+ * If we didn't need the mtime here, we could try to obtain the
+ * file size from the reconstruction or file copy process above,
+ * although that is actually not convenient in all cases. If we
+ * write the file ourselves then clearly we can keep a count of
+ * bytes, but if we use something like CopyFile() then it's
+ * trickier. Since we have to stat() anyway to get the mtime,
+ * there's no point in worrying about it.
+ */
+ if (stat(ofullpath, &sb) < 0)
+ pg_fatal("could not stat file \"%s\": %m", ofullpath);
+
+ /* OK, now do the work. */
+ add_file_to_manifest(mwriter, manifest_path,
+ sb.st_size, sb.st_mtime,
+ checksum_type, checksum_length,
+ checksum_payload);
+ }
+
+ /* Avoid leaking memory. */
+ if (checksum_payload != NULL)
+ pfree(checksum_payload);
+ }
+
+ closedir(dir);
+}
+
+/*
+ * Read the version number from PG_VERSION and convert it to the usual server
+ * version number format. (e.g. If PG_VERSION contains "14\n" this function
+ * will return 140000)
+ */
+static int
+read_pg_version_file(char *directory)
+{
+ char filename[MAXPGPATH];
+ StringInfoData buf;
+ int fd;
+ int version;
+ char *ep;
+
+ /* Construct pathname. */
+ snprintf(filename, MAXPGPATH, "%s/PG_VERSION", directory);
+
+ /* Open file. */
+ if ((fd = open(filename, O_RDONLY, 0)) < 0)
+ pg_fatal("could not open file \"%s\": %m", filename);
+
+ /* Read into memory. Length limit of 128 should be more than generous. */
+ initStringInfo(&buf);
+ slurp_file(fd, filename, &buf, 128);
+
+ /* Close the file. */
+ if (close(fd) != 0)
+ pg_fatal("could not close \"%s\": %m", filename);
+
+ /* Convert to integer. */
+ errno = 0;
+ version = strtoul(buf.data, &ep, 10);
+ if (errno != 0 || *ep != '\n')
+ {
+ /*
+ * Incremental backup is not relevant to very old server versions that
+ * used multi-part version number (e.g. 9.6, or 8.4). So if we see
+ * what looks like the beginning of such a version number, just bail
+ * out.
+ */
+ if (version < 10 && *ep == '.')
+ pg_fatal("%s: server version too old\n", filename);
+ pg_fatal("%s: could not parse version number\n", filename);
+ }
+
+ /* Debugging output. */
+ pg_log_debug("read server version %d from \"%s\"", version, filename);
+
+ /* Release memory and return result. */
+ pfree(buf.data);
+ return version * 10000;
+}
+
+/*
+ * Add a directory to the list of output directories to clean up.
+ */
+static void
+remember_to_cleanup_directory(char *target_path, bool rmtopdir)
+{
+ cb_cleanup_dir *dir = pg_malloc(sizeof(cb_cleanup_dir));
+
+ dir->target_path = target_path;
+ dir->rmtopdir = rmtopdir;
+ dir->next = cleanup_dir_list;
+ cleanup_dir_list = dir;
+}
+
+/*
+ * Empty out the list of directories scheduled for cleanup a exit.
+ *
+ * We want to remove the output directories only on a failure, so call this
+ * function when we know that the operation has succeeded.
+ *
+ * Since we only expect this to be called when we're about to exit, we could
+ * just set cleanup_dir_list to NULL and be done with it, but we free the
+ * memory to be tidy.
+ */
+static void
+reset_directory_cleanup_list(void)
+{
+ while (cleanup_dir_list != NULL)
+ {
+ cb_cleanup_dir *dir = cleanup_dir_list;
+
+ cleanup_dir_list = cleanup_dir_list->next;
+ pfree(dir);
+ }
+}
+
+/*
+ * Scan the pg_tblspc directory of the final input backup to get a canonical
+ * list of what tablespaces are part of the backup.
+ *
+ * 'pathname' should be the path to the toplevel backup directory for the
+ * final backup in the backup chain.
+ */
+static cb_tablespace *
+scan_for_existing_tablespaces(char *pathname, cb_options *opt)
+{
+ char pg_tblspc[MAXPGPATH];
+ DIR *dir;
+ struct dirent *de;
+ cb_tablespace *tslist = NULL;
+
+ snprintf(pg_tblspc, MAXPGPATH, "%s/pg_tblspc", pathname);
+ pg_log_debug("scanning \"%s\"", pg_tblspc);
+
+ if ((dir = opendir(pg_tblspc)) == NULL)
+ pg_fatal("could not open directory \"%s\": %m", pathname);
+
+ while (errno = 0, (de = readdir(dir)) != NULL)
+ {
+ Oid oid;
+ char tblspcdir[MAXPGPATH];
+ char link_target[MAXPGPATH];
+ int link_length;
+ cb_tablespace *ts;
+ cb_tablespace *otherts;
+ PGFileType type;
+
+ /* Silently ignore "." and ".." entries. */
+ if (strcmp(de->d_name, ".") == 0 || strcmp(de->d_name, "..") == 0)
+ continue;
+
+ /* Construct full pathname. */
+ snprintf(tblspcdir, MAXPGPATH, "%s/%s", pg_tblspc, de->d_name);
+
+ /* Ignore any file name that doesn't look like a proper OID. */
+ if (!parse_oid(de->d_name, &oid))
+ {
+ pg_log_debug("skipping \"%s\" because the filename is not a legal tablespace OID",
+ tblspcdir);
+ continue;
+ }
+
+ /* Only symbolic links and directories are tablespaces. */
+ type = get_dirent_type(tblspcdir, de, false, PG_LOG_ERROR);
+ if (type == PGFILETYPE_ERROR)
+ exit(1);
+ if (type != PGFILETYPE_LNK && type != PGFILETYPE_DIR)
+ {
+ pg_log_debug("skipping \"%s\" because it is neither a symbolic link nor a directory",
+ tblspcdir);
+ continue;
+ }
+
+ /* Create a new tablespace object. */
+ ts = pg_malloc0(sizeof(cb_tablespace));
+ ts->oid = oid;
+
+ /*
+ * If it's a link, it's not an in-place tablespace. Otherwise, it must
+ * be a directory, and thus an in-place tablespace.
+ */
+ if (type == PGFILETYPE_LNK)
+ {
+ cb_tablespace_mapping *tsmap;
+
+ /* Read the link target. */
+ link_length = readlink(tblspcdir, link_target, sizeof(link_target));
+ if (link_length < 0)
+ pg_fatal("could not read symbolic link \"%s\": %m",
+ tblspcdir);
+ if (link_length >= sizeof(link_target))
+ pg_fatal("symbolic link \"%s\" is too long", tblspcdir);
+ link_target[link_length] = '\0';
+ if (!is_absolute_path(link_target))
+ pg_fatal("symbolic link \"%s\" is relative", tblspcdir);
+
+ /* Caonicalize the link target. */
+ canonicalize_path(link_target);
+
+ /*
+ * Find the corresponding tablespace mapping and copy the relevant
+ * details into the new tablespace entry.
+ */
+ for (tsmap = opt->tsmappings; tsmap != NULL; tsmap = tsmap->next)
+ {
+ if (strcmp(tsmap->old_dir, link_target) == 0)
+ {
+ strncpy(ts->old_dir, tsmap->old_dir, MAXPGPATH);
+ strncpy(ts->new_dir, tsmap->new_dir, MAXPGPATH);
+ ts->in_place = false;
+ break;
+ }
+ }
+
+ /* Every non-in-place tablespace must be mapped. */
+ if (tsmap == NULL)
+ pg_fatal("tablespace at \"%s\" has no tablespace mapping",
+ link_target);
+ }
+ else
+ {
+ /*
+ * For an in-place tablespace, there's no separate directory, so
+ * we just record the paths within the data directories.
+ */
+ snprintf(ts->old_dir, MAXPGPATH, "%s/%s", pg_tblspc, de->d_name);
+ snprintf(ts->new_dir, MAXPGPATH, "%s/pg_tblpc/%s", opt->output,
+ de->d_name);
+ ts->in_place = true;
+ }
+
+ /* Tablespaces should not share a directory. */
+ for (otherts = tslist; otherts != NULL; otherts = otherts->next)
+ if (strcmp(ts->new_dir, otherts->new_dir) == 0)
+ pg_fatal("tablespaces with OIDs %u and %u both point at \"%s\"",
+ otherts->oid, oid, ts->new_dir);
+
+ /* Add this tablespace to the list. */
+ ts->next = tslist;
+ tslist = ts;
+ }
+
+ return tslist;
+}
+
+/*
+ * Read a file into a StringInfo.
+ *
+ * fd is used for the actual file I/O, filename for error reporting purposes.
+ * A file longer than maxlen is a fatal error.
+ */
+static void
+slurp_file(int fd, char *filename, StringInfo buf, int maxlen)
+{
+ struct stat st;
+ ssize_t rb;
+
+ /* Check file size, and complain if it's too large. */
+ if (fstat(fd, &st) != 0)
+ pg_fatal("could not stat \"%s\": %m", filename);
+ if (st.st_size > maxlen)
+ pg_fatal("file \"%s\" is too large", filename);
+
+ /* Make sure we have enough space. */
+ enlargeStringInfo(buf, st.st_size);
+
+ /* Read the data. */
+ rb = read(fd, &buf->data[buf->len], st.st_size);
+
+ /*
+ * We don't expect any concurrent changes, so we should read exactly the
+ * expected number of bytes.
+ */
+ if (rb != st.st_size)
+ {
+ if (rb < 0)
+ pg_fatal("could not read file \"%s\": %m", filename);
+ else
+ pg_fatal("could not read file \"%s\": read only %d of %d bytes",
+ filename, (int) rb, (int) st.st_size);
+ }
+
+ /* Adjust buffer length for new data and restore trailing-\0 invariant */
+ buf->len += rb;
+ buf->data[buf->len] = '\0';
+}
diff --git a/src/bin/pg_combinebackup/reconstruct.c b/src/bin/pg_combinebackup/reconstruct.c
new file mode 100644
index 0000000000..c774bf1842
--- /dev/null
+++ b/src/bin/pg_combinebackup/reconstruct.c
@@ -0,0 +1,618 @@
+/*-------------------------------------------------------------------------
+ *
+ * reconstruct.c
+ * Reconstruct full file from incremental file and backup chain.
+ *
+ * Copyright (c) 2017-2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/bin/pg_combinebackup/reconstruct.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres_fe.h"
+
+#include <unistd.h>
+
+#include "backup/basebackup_incremental.h"
+#include "common/logging.h"
+#include "common/file_perm.h"
+#include "copy_file.h"
+#include "reconstruct.h"
+#include "storage/block.h"
+
+/*
+ * An rfile stores the data that we need in order to be able to use some file
+ * on disk for reconstruction. For any given output file, we create one rfile
+ * per backup that we need to consult when we constructing that output file.
+ *
+ * If we find a full version of the file in the backup chain, then only
+ * filename and fd are initialized; the remaining fields are 0 or NULL.
+ * For an incremental file, header_length, num_blocks, relative_block_numbers,
+ * and truncation_block_length are also set.
+ *
+ * num_blocks_read and highest_offset_read always start out as 0.
+ */
+typedef struct rfile
+{
+ char *filename;
+ int fd;
+ size_t header_length;
+ unsigned num_blocks;
+ BlockNumber *relative_block_numbers;
+ unsigned truncation_block_length;
+ unsigned num_blocks_read;
+ off_t highest_offset_read;
+} rfile;
+
+static void debug_reconstruction(int n_source,
+ rfile **sources,
+ bool dry_run);
+static unsigned find_reconstructed_block_length(rfile *s);
+static rfile *make_incremental_rfile(char *filename);
+static rfile *make_rfile(char *filename, bool missing_ok);
+static void write_reconstructed_file(char *input_filename,
+ char *output_filename,
+ unsigned block_length,
+ rfile **sourcemap,
+ off_t *offsetmap,
+ pg_checksum_context *checksum_ctx,
+ bool dry_run);
+static void read_bytes(rfile *rf, void *buffer, unsigned length);
+
+/*
+ * Reconstruct a full file from an incremental file and a chain of prior
+ * backups.
+ *
+ * input_filename should be the path to the incremental file, and
+ * output_filename should be the path where the reconstructed file is to be
+ * written.
+ *
+ * relative_path should be the relative path to the directory containing this
+ * file. bare_file_name should be the name of the file within that directory,
+ * without "INCREMENTAL.".
+ *
+ * n_prior_backups is the number of prior backups, and prior_backup_dirs is
+ * an array of pathnames where those backups can be found.
+ */
+void
+reconstruct_from_incremental_file(char *input_filename,
+ char *output_filename,
+ char *relative_path,
+ char *bare_file_name,
+ int n_prior_backups,
+ char **prior_backup_dirs,
+ manifest_data **manifests,
+ char *manifest_path,
+ pg_checksum_type checksum_type,
+ int *checksum_length,
+ uint8 **checksum_payload,
+ bool dry_run)
+{
+ rfile **source;
+ rfile *latest_source = NULL;
+ rfile **sourcemap;
+ off_t *offsetmap;
+ unsigned block_length;
+ unsigned num_missing_blocks;
+ unsigned i;
+ unsigned sidx = n_prior_backups;
+ bool full_copy_possible = true;
+ int copy_source_index = -1;
+ rfile *copy_source = NULL;
+ pg_checksum_context checksum_ctx;
+
+ /*
+ * Every block must come either from the latest version of the file or
+ * from one of the prior backups.
+ */
+ source = pg_malloc0(sizeof(rfile *) * (1 + n_prior_backups));
+
+ /*
+ * Use the information from the latest incremental file to figure out how
+ * long the reconstructed file should be.
+ */
+ latest_source = make_incremental_rfile(input_filename);
+ source[n_prior_backups] = latest_source;
+ block_length = find_reconstructed_block_length(latest_source);
+
+ /*
+ * For each block in the output file, we need to know from which file we
+ * need to obtain it and at what offset in that file it's stored.
+ * sourcemap gives us the first of these things, and offsetmap the latter.
+ */
+ sourcemap = pg_malloc0(sizeof(rfile *) * block_length);
+ offsetmap = pg_malloc0(sizeof(off_t) * block_length);
+
+ /*
+ * Blocks prior to the truncation_block_length threshold must be obtained
+ * from some prior backup, while those after that threshold are left as
+ * zeroes if not present in the newest incremental file.
+ * num_missing_blocks counts the number of blocks that we must be found
+ * somewhere in the backup chain, and is thus initially equal to
+ * truncation_block_length.
+ */
+ num_missing_blocks = latest_source->truncation_block_length;
+
+ /*
+ * Every block that is present in the newest incremental file should be
+ * sourced from that file. If it precedes the truncation_block_length,
+ * it's a block that we would otherwise have had to find in an older
+ * backup and thus reduces the number of blocks remaining to be found by
+ * one; otherwise, it's an extra block that needs to be included in the
+ * output but would not have needed to be found in an older backup if it
+ * had not been present.
+ */
+ for (i = 0; i < latest_source->num_blocks; ++i)
+ {
+ BlockNumber b = latest_source->relative_block_numbers[i];
+
+ Assert(b < block_length);
+ sourcemap[b] = latest_source;
+ offsetmap[b] = latest_source->header_length + (i * BLCKSZ);
+ if (b < latest_source->truncation_block_length)
+ num_missing_blocks--;
+
+ /*
+ * A full copy of a file from an earlier backup is only possible if no
+ * blocks are needed from any later incremental file.
+ */
+ full_copy_possible = false;
+ }
+
+ while (num_missing_blocks > 0)
+ {
+ char source_filename[MAXPGPATH];
+ rfile *s;
+
+ /*
+ * Move to the next backup in the chain. If there are no more, then
+ * something has gone wrong and reconstruction has failed.
+ */
+ if (sidx == 0)
+ pg_fatal("reconstruction for file \"%s\" failed to find %u required blocks",
+ output_filename, num_missing_blocks);
+ --sidx;
+
+ /*
+ * Look for the full file in the previous backup. If not found, then
+ * look for an incremental file instead.
+ */
+ snprintf(source_filename, MAXPGPATH, "%s/%s/%s",
+ prior_backup_dirs[sidx], relative_path, bare_file_name);
+ if ((s = make_rfile(source_filename, true)) == NULL)
+ {
+ snprintf(source_filename, MAXPGPATH, "%s/%s/INCREMENTAL.%s",
+ prior_backup_dirs[sidx], relative_path, bare_file_name);
+ s = make_incremental_rfile(source_filename);
+ }
+ source[sidx] = s;
+
+ /*
+ * If s->header_length == 0, then this is a full file; otherwise, it's
+ * an incremental file.
+ */
+ if (s->header_length != 0)
+ {
+ /*
+ * Since we found another incremental file, source all blocks from
+ * it that we need but don't yet have.
+ */
+ for (i = 0; i < s->num_blocks; ++i)
+ {
+ BlockNumber b = s->relative_block_numbers[i];
+
+ if (b < latest_source->truncation_block_length &&
+ sourcemap[b] == NULL)
+ {
+ sourcemap[b] = s;
+ offsetmap[b] = s->header_length + (i * BLCKSZ);
+
+ Assert(num_missing_blocks > 0);
+ --num_missing_blocks;
+
+ /*
+ * A full copy of a file from an earlier backup is only
+ * possible if no blocks are needed from any later
+ * incremental file.
+ */
+ full_copy_possible = false;
+ }
+ }
+ }
+ else
+ {
+ BlockNumber b;
+
+ /*
+ * Since we found a full file, source all remaining required
+ * blocks from it.
+ */
+ for (b = 0; b < latest_source->truncation_block_length; ++b)
+ {
+ if (sourcemap[b] == NULL)
+ {
+ sourcemap[b] = s;
+ offsetmap[b] = b * BLCKSZ;
+
+ Assert(num_missing_blocks > 0);
+ --num_missing_blocks;
+ }
+ }
+ Assert(num_missing_blocks == 0);
+
+ /*
+ * If a full copy looks possible, check whether the resulting file
+ * should be exactly as long as the source file is. If so, a full
+ * copy is acceptable, otherwise not.
+ */
+ if (full_copy_possible)
+ {
+ struct stat sb;
+ uint64 expected_length;
+
+ if (fstat(s->fd, &sb) < 0)
+ pg_fatal("could not stat \"%s\": %m", s->filename);
+ expected_length =
+ (uint64) latest_source->truncation_block_length;
+ expected_length *= BLCKSZ;
+ if (expected_length == sb.st_size)
+ {
+ copy_source = s;
+ copy_source_index = sidx;
+ }
+ }
+ }
+ }
+
+ /*
+ * If a checksum of the required type already exists in the
+ * backup_manifest for the relevant input directory, we can save some work
+ * by reusing that checksum instead of computing a new one.
+ */
+ if (copy_source_index >= 0 && manifests[copy_source_index] != NULL &&
+ checksum_type != CHECKSUM_TYPE_NONE)
+ {
+ manifest_file *mfile;
+
+ mfile = manifest_files_lookup(manifests[copy_source_index]->files,
+ manifest_path);
+ if (mfile == NULL)
+ {
+ /*
+ * The directory is out of sync with the backup_manifest, so emit
+ * a warning.
+ */
+ pg_log_warning("\"%s/backup_manifest\" contains no entry for \"%s\"",
+ prior_backup_dirs[copy_source_index],
+ manifest_path);
+ }
+ else if (mfile->checksum_type == checksum_type)
+ {
+ *checksum_length = mfile->checksum_length;
+ *checksum_payload = pg_malloc(*checksum_length);
+ memcpy(*checksum_payload, mfile->checksum_payload,
+ *checksum_length);
+ checksum_type = CHECKSUM_TYPE_NONE;
+ }
+ }
+
+ /* Prepare for checksum calculation, if required. */
+ pg_checksum_init(&checksum_ctx, checksum_type);
+
+ /*
+ * If the full file can be created by copying a file from an older backup
+ * in the chain without needing to overwrite any blocks or truncate the
+ * result, then forget about performing reconstruction and just copy that
+ * file in its entirety.
+ *
+ * Otherwise, reconstruct.
+ */
+ if (copy_source != NULL)
+ copy_file(copy_source->filename, output_filename,
+ &checksum_ctx, dry_run);
+ else
+ {
+ write_reconstructed_file(input_filename, output_filename,
+ block_length, sourcemap, offsetmap,
+ &checksum_ctx, dry_run);
+ debug_reconstruction(n_prior_backups + 1, source, dry_run);
+ }
+
+ /* Save results of checksum calculation. */
+ if (checksum_type != CHECKSUM_TYPE_NONE)
+ {
+ *checksum_payload = pg_malloc(PG_CHECKSUM_MAX_LENGTH);
+ *checksum_length = pg_checksum_final(&checksum_ctx,
+ *checksum_payload);
+ }
+
+ /*
+ * Close files and release memory.
+ */
+ for (i = 0; i <= n_prior_backups; ++i)
+ {
+ rfile *s = source[i];
+
+ if (s == NULL)
+ continue;
+ if (close(s->fd) != 0)
+ pg_fatal("could not close \"%s\": %m", s->filename);
+ if (s->relative_block_numbers != NULL)
+ pfree(s->relative_block_numbers);
+ pg_free(s->filename);
+ }
+ pfree(sourcemap);
+ pfree(offsetmap);
+ pfree(source);
+}
+
+/*
+ * Perform post-reconstruction logging and sanity checks.
+ */
+static void
+debug_reconstruction(int n_source, rfile **sources, bool dry_run)
+{
+ unsigned i;
+
+ for (i = 0; i < n_source; ++i)
+ {
+ rfile *s = sources[i];
+
+ /* Ignore source if not used. */
+ if (s == NULL)
+ continue;
+
+ /* If no data is needed from this file, we can ignore it. */
+ if (s->num_blocks_read == 0)
+ continue;
+
+ /* Debug logging. */
+ if (dry_run)
+ pg_log_debug("would have read %u blocks from \"%s\"",
+ s->num_blocks_read, s->filename);
+ else
+ pg_log_debug("read %u blocks from \"%s\"",
+ s->num_blocks_read, s->filename);
+
+ /*
+ * In dry-run mode, we don't actually try to read data from the file,
+ * but we do try to verify that the file is long enough that we could
+ * have read the data if we'd tried.
+ *
+ * If this fails, then it means that a non-dry-run attempt would fail,
+ * complaining of not being able to read the required bytes from the
+ * file.
+ */
+ if (dry_run)
+ {
+ struct stat sb;
+
+ if (fstat(s->fd, &sb) < 0)
+ pg_fatal("could not stat \"%s\": %m", s->filename);
+ if (sb.st_size < s->highest_offset_read)
+ pg_fatal("file \"%s\" is too short: expected %llu, found %llu",
+ s->filename,
+ (unsigned long long) s->highest_offset_read,
+ (unsigned long long) sb.st_size);
+ }
+ }
+}
+
+/*
+ * When we perform reconstruction using an incremental file, the output file
+ * should be at least as long as the truncation_block_length. Any blocks
+ * present in the incremental file increase the output length as far as is
+ * necessary to include those blocks.
+ */
+static unsigned
+find_reconstructed_block_length(rfile *s)
+{
+ unsigned block_length = s->truncation_block_length;
+ unsigned i;
+
+ for (i = 0; i < s->num_blocks; ++i)
+ if (s->relative_block_numbers[i] >= block_length)
+ block_length = s->relative_block_numbers[i] + 1;
+
+ return block_length;
+}
+
+/*
+ * Initialize an incremental rfile, reading the header so that we know which
+ * blocks it contains.
+ */
+static rfile *
+make_incremental_rfile(char *filename)
+{
+ rfile *rf;
+ unsigned magic;
+
+ rf = make_rfile(filename, false);
+
+ /* Read and validate magic number. */
+ read_bytes(rf, &magic, sizeof(magic));
+ if (magic != INCREMENTAL_MAGIC)
+ pg_fatal("file \"%s\" has bad incremental magic number (0x%x not 0x%x)",
+ filename, magic, INCREMENTAL_MAGIC);
+
+ /* Read block count. */
+ read_bytes(rf, &rf->num_blocks, sizeof(rf->num_blocks));
+ if (rf->num_blocks > RELSEG_SIZE)
+ pg_fatal("file \"%s\" has block count %u in excess of segment size %u",
+ filename, rf->num_blocks, RELSEG_SIZE);
+
+ /* Read truncation block length. */
+ read_bytes(rf, &rf->truncation_block_length,
+ sizeof(rf->truncation_block_length));
+ if (rf->truncation_block_length > RELSEG_SIZE)
+ pg_fatal("file \"%s\" has truncation block length %u in excess of segment size %u",
+ filename, rf->truncation_block_length, RELSEG_SIZE);
+
+ /* Read block numbers if there are any. */
+ if (rf->num_blocks > 0)
+ {
+ rf->relative_block_numbers =
+ pg_malloc0(sizeof(BlockNumber) * rf->num_blocks);
+ read_bytes(rf, rf->relative_block_numbers,
+ sizeof(BlockNumber) * rf->num_blocks);
+ }
+
+ /* Remember length of header. */
+ rf->header_length = sizeof(magic) + sizeof(rf->num_blocks) +
+ sizeof(rf->truncation_block_length) +
+ sizeof(BlockNumber) * rf->num_blocks;
+
+ return rf;
+}
+
+/*
+ * Allocate and perform basic initialization of an rfile.
+ */
+static rfile *
+make_rfile(char *filename, bool missing_ok)
+{
+ rfile *rf;
+
+ rf = pg_malloc0(sizeof(rfile));
+ rf->filename = pstrdup(filename);
+ if ((rf->fd = open(filename, O_RDONLY | PG_BINARY, 0)) < 0)
+ {
+ if (missing_ok && errno == ENOENT)
+ {
+ pg_free(rf);
+ return NULL;
+ }
+ pg_fatal("could not open file \"%s\": %m", filename);
+ }
+
+ return rf;
+}
+
+/*
+ * Read the indicated number of bytes from an rfile into the buffer.
+ */
+static void
+read_bytes(rfile *rf, void *buffer, unsigned length)
+{
+ unsigned rb = read(rf->fd, buffer, length);
+
+ if (rb != length)
+ {
+ if (rb < 0)
+ pg_fatal("could not read file \"%s\": %m", rf->filename);
+ else
+ pg_fatal("could not read file \"%s\": read only %d of %d bytes",
+ rf->filename, (int) rb, length);
+ }
+}
+
+/*
+ * Write out a reconstructed file.
+ */
+static void
+write_reconstructed_file(char *input_filename,
+ char *output_filename,
+ unsigned block_length,
+ rfile **sourcemap,
+ off_t *offsetmap,
+ pg_checksum_context *checksum_ctx,
+ bool dry_run)
+{
+ int wfd = -1;
+ unsigned i;
+ unsigned zero_blocks = 0;
+
+ /* Debugging output. */
+ if (dry_run)
+ pg_log_debug("would reconstruct \"%s\" (%u blocks, checksum %s)",
+ output_filename, block_length,
+ pg_checksum_type_name(checksum_ctx->type));
+ else
+ pg_log_debug("reconstructing \"%s\" (%u blocks, checksum %s)",
+ output_filename, block_length,
+ pg_checksum_type_name(checksum_ctx->type));
+
+ /* Open the output file, except in dry_run mode. */
+ if (!dry_run &&
+ (wfd = open(output_filename,
+ O_RDWR | PG_BINARY | O_CREAT | O_EXCL,
+ pg_file_create_mode)) < 0)
+ pg_fatal("could not open file \"%s\": %m", output_filename);
+
+ /* Read and write the blocks as required. */
+ for (i = 0; i < block_length; ++i)
+ {
+ uint8 buffer[BLCKSZ];
+ rfile *s = sourcemap[i];
+ unsigned wb;
+
+ /* Update accounting information. */
+ if (s == NULL)
+ ++zero_blocks;
+ else
+ {
+ s->num_blocks_read++;
+ s->highest_offset_read = Max(s->highest_offset_read,
+ offsetmap[i] + BLCKSZ);
+ }
+
+ /* Skip the rest of this in dry-run mode. */
+ if (dry_run)
+ continue;
+
+ /* Read or zero-fill the block as appropriate. */
+ if (s == NULL)
+ {
+ /*
+ * New block not mentioned in the WAL summary. Should have been an
+ * uninitialized block, so just zero-fill it.
+ */
+ memset(buffer, 0, BLCKSZ);
+ }
+ else
+ {
+ unsigned rb;
+
+ /* Read the block from the correct source, except if dry-run. */
+ rb = pg_pread(s->fd, buffer, BLCKSZ, offsetmap[i]);
+ if (rb != BLCKSZ)
+ {
+ if (rb < 0)
+ pg_fatal("could not read file \"%s\": %m", s->filename);
+ else
+ pg_fatal("could not read file \"%s\": read only %d of %d bytes at offset %u",
+ s->filename, (int) rb, BLCKSZ,
+ (unsigned) offsetmap[i]);
+ }
+ }
+
+ /* Write out the block. */
+ if ((wb = write(wfd, buffer, BLCKSZ)) != BLCKSZ)
+ {
+ if (wb < 0)
+ pg_fatal("could not write file \"%s\": %m", output_filename);
+ else
+ pg_fatal("could not write file \"%s\": wrote only %d of %d bytes",
+ output_filename, (int) wb, BLCKSZ);
+ }
+
+ /* Update the checksum computation. */
+ if (pg_checksum_update(checksum_ctx, buffer, BLCKSZ) < 0)
+ pg_fatal("could not update checksum of file \"%s\"",
+ output_filename);
+ }
+
+ /* Debugging output. */
+ if (zero_blocks > 0)
+ {
+ if (dry_run)
+ pg_log_debug("would have zero-filled %u blocks", zero_blocks);
+ else
+ pg_log_debug("zero-filled %u blocks", zero_blocks);
+ }
+
+ /* Close the output file. */
+ if (wfd >= 0 && close(wfd) != 0)
+ pg_fatal("could not close \"%s\": %m", output_filename);
+}
diff --git a/src/bin/pg_combinebackup/reconstruct.h b/src/bin/pg_combinebackup/reconstruct.h
new file mode 100644
index 0000000000..c599a70d42
--- /dev/null
+++ b/src/bin/pg_combinebackup/reconstruct.h
@@ -0,0 +1,32 @@
+/*-------------------------------------------------------------------------
+ *
+ * reconstruct.h
+ * Reconstruct full file from incremental file and backup chain.
+ *
+ * Copyright (c) 2017-2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/bin/pg_combinebackup/reconstruct.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef RECONSTRUCT_H
+#define RECONSTRUCT_H
+
+#include "common/checksum_helper.h"
+#include "load_manifest.h"
+
+extern void reconstruct_from_incremental_file(char *input_filename,
+ char *output_filename,
+ char *relative_path,
+ char *bare_file_name,
+ int n_prior_backups,
+ char **prior_backup_dirs,
+ manifest_data **manifests,
+ char *manifest_path,
+ pg_checksum_type checksum_type,
+ int *checksum_length,
+ uint8 **checksum_payload,
+ bool dry_run);
+
+#endif
diff --git a/src/bin/pg_combinebackup/t/001_basic.pl b/src/bin/pg_combinebackup/t/001_basic.pl
new file mode 100644
index 0000000000..fb66075d1a
--- /dev/null
+++ b/src/bin/pg_combinebackup/t/001_basic.pl
@@ -0,0 +1,23 @@
+# Copyright (c) 2021-2023, PostgreSQL Global Development Group
+
+use strict;
+use warnings;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+my $tempdir = PostgreSQL::Test::Utils::tempdir;
+
+program_help_ok('pg_combinebackup');
+program_version_ok('pg_combinebackup');
+program_options_handling_ok('pg_combinebackup');
+
+command_fails_like(
+ ['pg_combinebackup'],
+ qr/no input directories specified/,
+ 'input directories must be specified');
+command_fails_like(
+ [ 'pg_combinebackup', $tempdir ],
+ qr/no output directory specified/,
+ 'output directory must be specified');
+
+done_testing();
diff --git a/src/bin/pg_combinebackup/t/002_compare_backups.pl b/src/bin/pg_combinebackup/t/002_compare_backups.pl
new file mode 100644
index 0000000000..3d9238f366
--- /dev/null
+++ b/src/bin/pg_combinebackup/t/002_compare_backups.pl
@@ -0,0 +1,154 @@
+# Copyright (c) 2021-2023, PostgreSQL Global Development Group
+
+use strict;
+use warnings;
+use File::Compare;
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Set up a new database instance.
+my $primary = PostgreSQL::Test::Cluster->new('primary');
+$primary->init(has_archiving => 1, allows_streaming => 1);
+$primary->append_conf('postgresql.conf', 'autovacuum = off');
+$primary->start;
+
+# Create some test tables, each containing one row of data, plus a whole
+# extra database.
+$primary->safe_psql('postgres', <<EOM);
+CREATE TABLE will_change (a int, b text);
+INSERT INTO will_change VALUES (1, 'initial test row');
+CREATE TABLE will_grow (a int, b text);
+INSERT INTO will_grow VALUES (1, 'initial test row');
+CREATE TABLE will_shrink (a int, b text);
+INSERT INTO will_shrink VALUES (1, 'initial test row');
+CREATE TABLE will_get_vacuumed (a int, b text);
+INSERT INTO will_get_vacuumed VALUES (1, 'initial test row');
+CREATE TABLE will_get_dropped (a int, b text);
+INSERT INTO will_get_dropped VALUES (1, 'initial test row');
+CREATE TABLE will_get_rewritten (a int, b text);
+INSERT INTO will_get_rewritten VALUES (1, 'initial test row');
+CREATE DATABASE db_will_get_dropped;
+EOM
+
+# Take a full backup.
+my $backup1path = $primary->backup_dir . '/backup1';
+$primary->command_ok(
+ [ 'pg_basebackup', '-D', $backup1path, '--no-sync', '-cfast' ],
+ "full backup");
+
+# Now make some database changes.
+$primary->safe_psql('postgres', <<EOM);
+UPDATE will_change SET b = 'modified value' WHERE a = 1;
+INSERT INTO will_grow
+ SELECT g, 'additional row' FROM generate_series(2, 5000) g;
+TRUNCATE will_shrink;
+VACUUM will_get_vacuumed;
+DROP TABLE will_get_dropped;
+CREATE TABLE newly_created (a int, b text);
+INSERT INTO newly_created VALUES (1, 'row for new table');
+VACUUM FULL will_get_rewritten;
+DROP DATABASE db_will_get_dropped;
+CREATE DATABASE db_newly_created;
+EOM
+
+# Take an incremental backup.
+my $backup2path = $primary->backup_dir . '/backup2';
+$primary->command_ok(
+ [ 'pg_basebackup', '-D', $backup2path, '--no-sync', '-cfast',
+ '--incremental', $backup1path . '/backup_manifest' ],
+ "incremental backup");
+
+# Find an LSN to which either backup can be recovered.
+my $lsn = $primary->safe_psql('postgres', "SELECT pg_current_wal_lsn();");
+
+# Make sure that the WAL segment containing that LSN has been archived.
+# PostgreSQL won't issue two consecutive XLOG_SWITCH records, and the backup
+# just issued one, so call txid_current() to generate some WAL activity
+# before calling pg_switch_wal().
+$primary->safe_psql('postgres', 'SELECT txid_current();');
+$primary->safe_psql('postgres', 'SELECT pg_switch_wal()');
+
+# Now wait for the LSN we chose above to be archived.
+my $archive_wait_query =
+ "SELECT pg_walfile_name('$lsn') <= last_archived_wal FROM pg_stat_archiver;";
+$primary->poll_query_until('postgres', $archive_wait_query)
+ or die "Timed out while waiting for WAL segment to be archived";
+
+# Perform PITR from the full backup. Disable archive_mode so that the archive
+# doesn't find out about the new timeline; that way, the later PITR below will
+# choose the same timeline.
+my $pitr1 = PostgreSQL::Test::Cluster->new('pitr1');
+$pitr1->init_from_backup($primary, 'backup1',
+ standby => 1, has_restoring => 1);
+$pitr1->append_conf('postgresql.conf', qq{
+recovery_target_lsn = '$lsn'
+recovery_target_action = 'promote'
+archive_mode = 'off'
+});
+$pitr1->start();
+
+# Perform PITR to the same LSN from the incremental backup. Use the same
+# basic configuration as before.
+my $pitr2 = PostgreSQL::Test::Cluster->new('pitr2');
+$pitr2->init_from_backup($primary, 'backup2',
+ standby => 1, has_restoring => 1,
+ combine_with_prior => [ 'backup1' ]);
+$pitr2->append_conf('postgresql.conf', qq{
+recovery_target_lsn = '$lsn'
+recovery_target_action = 'promote'
+archive_mode = 'off'
+});
+$pitr2->start();
+
+# Wait until both servers exit recovery.
+$pitr1->poll_query_until('postgres',
+ "SELECT NOT pg_is_in_recovery();")
+ or die "Timed out while waiting apply to reach LSN $lsn";
+$pitr2->poll_query_until('postgres',
+ "SELECT NOT pg_is_in_recovery();")
+ or die "Timed out while waiting apply to reach LSN $lsn";
+
+# Perform a logical dump of each server, and check that they match.
+# It would be much nicer if we could physically compare the data files, but
+# that doesn't really work. The contents of the page hole aren't guaranteed to
+# be identical, and there can be other discrepancies as well. To make this work
+# we'd need the equivalent of each AM's rm_mask functon written or at least
+# callable from Perl, and that doesn't seem practical.
+#
+# NB: We're just using the primary's backup directory for scratch space here.
+# This could equally well be any other directory we wanted to pick.
+my $backupdir = $primary->backup_dir;
+my $dump1 = $backupdir . '/pitr1.dump';
+my $dump2 = $backupdir . '/pitr2.dump';
+$pitr1->command_ok([
+ 'pg_dumpall', '-f', $dump1, '--no-sync', '--no-unlogged-table-data',
+ '-d', $pitr1->connstr('postgres'),
+ ],
+ 'dump from PITR 1');
+$pitr1->command_ok([
+ 'pg_dumpall', '-f', $dump2, '--no-sync', '--no-unlogged-table-data',
+ '-d', $pitr1->connstr('postgres'),
+ ],
+ 'dump from PITR 2');
+
+# Compare the two dumps, there should be no differences.
+my $compare_res = compare($dump1, $dump2);
+note($dump1);
+note($dump2);
+is($compare_res, 0, "dumps are identical");
+
+# Provide more context if the dumps do not match.
+if ($compare_res != 0)
+{
+ my ($stdout, $stderr) =
+ run_command([ 'diff', '-u', $dump1, $dump2 ]);
+ print "=== diff of $dump1 and $dump2\n";
+ print "=== stdout ===\n";
+ print $stdout;
+ print "=== stderr ===\n";
+ print $stderr;
+ print "=== EOF ===\n";
+}
+
+done_testing();
diff --git a/src/bin/pg_combinebackup/write_manifest.c b/src/bin/pg_combinebackup/write_manifest.c
new file mode 100644
index 0000000000..82160134d8
--- /dev/null
+++ b/src/bin/pg_combinebackup/write_manifest.c
@@ -0,0 +1,293 @@
+/*-------------------------------------------------------------------------
+ *
+ * Write a new backup manifest.
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/bin/pg_combinebackup/write_manifest.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres_fe.h"
+
+#include <fcntl.h>
+#include <time.h>
+#include <unistd.h>
+
+#include "common/checksum_helper.h"
+#include "common/file_perm.h"
+#include "common/logging.h"
+#include "lib/stringinfo.h"
+#include "load_manifest.h"
+#include "mb/pg_wchar.h"
+#include "write_manifest.h"
+
+struct manifest_writer
+{
+ char pathname[MAXPGPATH];
+ int fd;
+ StringInfoData buf;
+ bool first_file;
+ bool still_checksumming;
+ pg_checksum_context manifest_ctx;
+};
+
+static void escape_json(StringInfo buf, const char *str);
+static void flush_manifest(manifest_writer *mwriter);
+static size_t hex_encode(const uint8 *src, size_t len, char *dst);
+
+/*
+ * Create a new backup manifest writer.
+ *
+ * The backup manifest will be written into a file named backup_manifest
+ * in the specified directory.
+ */
+manifest_writer *
+create_manifest_writer(char *directory)
+{
+ manifest_writer *mwriter = pg_malloc(sizeof(manifest_writer));
+
+ snprintf(mwriter->pathname, MAXPGPATH, "%s/backup_manifest", directory);
+ mwriter->fd = -1;
+ initStringInfo(&mwriter->buf);
+ mwriter->first_file = true;
+ mwriter->still_checksumming = true;
+ pg_checksum_init(&mwriter->manifest_ctx, CHECKSUM_TYPE_SHA256);
+
+ appendStringInfo(&mwriter->buf,
+ "{ \"PostgreSQL-Backup-Manifest-Version\": 1,\n"
+ "\"Files\": [");
+
+ return mwriter;
+}
+
+/*
+ * Add an entry for a file to a backup manifest.
+ *
+ * This is very similar to the backend's AddFileToBackupManifest, but
+ * various adjustments are required due to frontend/backend differences
+ * and other details.
+ */
+void
+add_file_to_manifest(manifest_writer *mwriter, const char *manifest_path,
+ size_t size, pg_time_t mtime,
+ pg_checksum_type checksum_type,
+ int checksum_length,
+ uint8 *checksum_payload)
+{
+ int pathlen = strlen(manifest_path);
+
+ if (mwriter->first_file)
+ {
+ appendStringInfoChar(&mwriter->buf, '\n');
+ mwriter->first_file = false;
+ }
+ else
+ appendStringInfoString(&mwriter->buf, ",\n");
+
+ if (pg_encoding_verifymbstr(PG_UTF8, manifest_path, pathlen) == pathlen)
+ {
+ appendStringInfoString(&mwriter->buf, "{ \"Path\": ");
+ escape_json(&mwriter->buf, manifest_path);
+ appendStringInfoString(&mwriter->buf, ", ");
+ }
+ else
+ {
+ appendStringInfoString(&mwriter->buf, "{ \"Encoded-Path\": \"");
+ enlargeStringInfo(&mwriter->buf, 2 * pathlen);
+ mwriter->buf.len += hex_encode((const uint8 *) manifest_path, pathlen,
+ &mwriter->buf.data[mwriter->buf.len]);
+ appendStringInfoString(&mwriter->buf, "\", ");
+ }
+
+ appendStringInfo(&mwriter->buf, "\"Size\": %zu, ", size);
+
+ appendStringInfoString(&mwriter->buf, "\"Last-Modified\": \"");
+ enlargeStringInfo(&mwriter->buf, 128);
+ mwriter->buf.len += strftime(&mwriter->buf.data[mwriter->buf.len], 128,
+ "%Y-%m-%d %H:%M:%S %Z",
+ gmtime(&mtime));
+ appendStringInfoChar(&mwriter->buf, '"');
+
+ if (mwriter->buf.len > 128 * 1024)
+ flush_manifest(mwriter);
+
+ if (checksum_length > 0)
+ {
+ appendStringInfo(&mwriter->buf,
+ ", \"Checksum-Algorithm\": \"%s\", \"Checksum\": \"",
+ pg_checksum_type_name(checksum_type));
+
+ enlargeStringInfo(&mwriter->buf, 2 * checksum_length);
+ mwriter->buf.len += hex_encode(checksum_payload, checksum_length,
+ &mwriter->buf.data[mwriter->buf.len]);
+
+ appendStringInfoChar(&mwriter->buf, '"');
+ }
+
+ appendStringInfoString(&mwriter->buf, " }");
+
+ if (mwriter->buf.len > 128 * 1024)
+ flush_manifest(mwriter);
+}
+
+/*
+ * Finalize the backup_manifest.
+ */
+void
+finalize_manifest(manifest_writer *mwriter,
+ manifest_wal_range *first_wal_range)
+{
+ uint8 checksumbuf[PG_SHA256_DIGEST_LENGTH];
+ int len;
+ manifest_wal_range *wal_range;
+
+ /* Terminate the list of files. */
+ appendStringInfoString(&mwriter->buf, "\n],\n");
+
+ /* Start a list of LSN ranges. */
+ appendStringInfoString(&mwriter->buf, "\"WAL-Ranges\": [\n");
+
+ for (wal_range = first_wal_range; wal_range != NULL;
+ wal_range = wal_range->next)
+ appendStringInfo(&mwriter->buf,
+ "%s{ \"Timeline\": %u, \"Start-LSN\": \"%X/%X\", \"End-LSN\": \"%X/%X\" }",
+ wal_range == first_wal_range ? "" : ",\n",
+ wal_range->tli,
+ LSN_FORMAT_ARGS(wal_range->start_lsn),
+ LSN_FORMAT_ARGS(wal_range->end_lsn));
+
+ /* Terminate the list of WAL ranges. */
+ appendStringInfoString(&mwriter->buf, "\n],\n");
+
+ /* Flush accumulated data and update checksum calculation. */
+ flush_manifest(mwriter);
+
+ /* Checksum only includes data up to this point. */
+ mwriter->still_checksumming = false;
+
+ /* Compute and insert manifest checksum. */
+ appendStringInfoString(&mwriter->buf, "\"Manifest-Checksum\": \"");
+ enlargeStringInfo(&mwriter->buf, 2 * PG_SHA256_DIGEST_STRING_LENGTH);
+ len = pg_checksum_final(&mwriter->manifest_ctx, checksumbuf);
+ Assert(len == PG_SHA256_DIGEST_LENGTH);
+ mwriter->buf.len +=
+ hex_encode(checksumbuf, len, &mwriter->buf.data[mwriter->buf.len]);
+ appendStringInfoString(&mwriter->buf, "\"}\n");
+
+ /* Flush the last manifest checksum itself. */
+ flush_manifest(mwriter);
+
+ /* Close the file. */
+ if (close(mwriter->fd) != 0)
+ pg_fatal("could not close \"%s\": %m", mwriter->pathname);
+ mwriter->fd = -1;
+}
+
+/*
+ * Produce a JSON string literal, properly escaping characters in the text.
+ */
+static void
+escape_json(StringInfo buf, const char *str)
+{
+ const char *p;
+
+ appendStringInfoCharMacro(buf, '"');
+ for (p = str; *p; p++)
+ {
+ switch (*p)
+ {
+ case '\b':
+ appendStringInfoString(buf, "\\b");
+ break;
+ case '\f':
+ appendStringInfoString(buf, "\\f");
+ break;
+ case '\n':
+ appendStringInfoString(buf, "\\n");
+ break;
+ case '\r':
+ appendStringInfoString(buf, "\\r");
+ break;
+ case '\t':
+ appendStringInfoString(buf, "\\t");
+ break;
+ case '"':
+ appendStringInfoString(buf, "\\\"");
+ break;
+ case '\\':
+ appendStringInfoString(buf, "\\\\");
+ break;
+ default:
+ if ((unsigned char) *p < ' ')
+ appendStringInfo(buf, "\\u%04x", (int) *p);
+ else
+ appendStringInfoCharMacro(buf, *p);
+ break;
+ }
+ }
+ appendStringInfoCharMacro(buf, '"');
+}
+
+/*
+ * Flush whatever portion of the backup manifest we have generated and
+ * buffered in memory out to a file on disk.
+ *
+ * The first call to this function will create the file. After that, we
+ * keep it open and just append more data.
+ */
+static void
+flush_manifest(manifest_writer *mwriter)
+{
+ char pathname[MAXPGPATH];
+
+ if (mwriter->fd == -1 &&
+ (mwriter->fd = open(mwriter->pathname,
+ O_WRONLY | O_CREAT | O_EXCL | PG_BINARY,
+ pg_file_create_mode)) < 0)
+ pg_fatal("could not open file \"%s\": %m", mwriter->pathname);
+
+ if (mwriter->buf.len > 0)
+ {
+ ssize_t wb;
+
+ wb = write(mwriter->fd, mwriter->buf.data, mwriter->buf.len);
+ if (wb != mwriter->buf.len)
+ {
+ if (wb < 0)
+ pg_fatal("could not write \"%s\": %m", mwriter->pathname);
+ else
+ pg_fatal("could not write file \"%s\": wrote only %d of %d bytes",
+ pathname, (int) wb, mwriter->buf.len);
+ }
+
+ if (mwriter->still_checksumming)
+ pg_checksum_update(&mwriter->manifest_ctx,
+ (uint8 *) mwriter->buf.data,
+ mwriter->buf.len);
+ resetStringInfo(&mwriter->buf);
+ }
+}
+
+/*
+ * Encode bytes using two hexademical digits for each one.
+ */
+static size_t
+hex_encode(const uint8 *src, size_t len, char *dst)
+{
+ const uint8 *end = src + len;
+
+ while (src < end)
+ {
+ unsigned n1 = (*src >> 4) & 0xF;
+ unsigned n2 = *src & 0xF;
+
+ *dst++ = n1 < 10 ? '0' + n1 : 'a' + n1 - 10;
+ *dst++ = n2 < 10 ? '0' + n2 : 'a' + n2 - 10;
+ ++src;
+ }
+
+ return len * 2;
+}
diff --git a/src/bin/pg_combinebackup/write_manifest.h b/src/bin/pg_combinebackup/write_manifest.h
new file mode 100644
index 0000000000..8fd7fe02c8
--- /dev/null
+++ b/src/bin/pg_combinebackup/write_manifest.h
@@ -0,0 +1,33 @@
+/*-------------------------------------------------------------------------
+ *
+ * Write a new backup manifest.
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/bin/pg_combinebackup/write_manifest.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef WRITE_MANIFEST_H
+#define WRITE_MANIFEST_H
+
+#include "common/checksum_helper.h"
+#include "pgtime.h"
+
+struct manifest_wal_range;
+
+struct manifest_writer;
+typedef struct manifest_writer manifest_writer;
+
+extern manifest_writer *create_manifest_writer(char *directory);
+extern void add_file_to_manifest(manifest_writer *mwriter,
+ const char *manifest_path,
+ size_t size, pg_time_t mtime,
+ pg_checksum_type checksum_type,
+ int checksum_length,
+ uint8 *checksum_payload);
+extern void finalize_manifest(manifest_writer *mwriter,
+ struct manifest_wal_range *first_wal_range);
+
+#endif /* WRITE_MANIFEST_H */
diff --git a/src/bin/pg_resetwal/pg_resetwal.c b/src/bin/pg_resetwal/pg_resetwal.c
index 3ae3fc06df..5407f51a4e 100644
--- a/src/bin/pg_resetwal/pg_resetwal.c
+++ b/src/bin/pg_resetwal/pg_resetwal.c
@@ -85,6 +85,7 @@ static void RewriteControlFile(void);
static void FindEndOfXLOG(void);
static void KillExistingXLOG(void);
static void KillExistingArchiveStatus(void);
+static void KillExistingWALSummaries(void);
static void WriteEmptyXLOG(void);
static void usage(void);
@@ -493,6 +494,7 @@ main(int argc, char *argv[])
RewriteControlFile();
KillExistingXLOG();
KillExistingArchiveStatus();
+ KillExistingWALSummaries();
WriteEmptyXLOG();
printf(_("Write-ahead log reset\n"));
@@ -1034,6 +1036,40 @@ KillExistingArchiveStatus(void)
pg_fatal("could not close directory \"%s\": %m", ARCHSTATDIR);
}
+/*
+ * Remove existing WAL summary files
+ */
+static void
+KillExistingWALSummaries(void)
+{
+#define WALSUMMARYDIR XLOGDIR "/summaries"
+#define WALSUMMARY_NHEXCHARS 40
+
+ DIR *xldir;
+ struct dirent *xlde;
+ char path[MAXPGPATH + sizeof(WALSUMMARYDIR)];
+
+ xldir = opendir(WALSUMMARYDIR);
+ if (xldir == NULL)
+ pg_fatal("could not open directory \"%s\": %m", WALSUMMARYDIR);
+
+ while (errno = 0, (xlde = readdir(xldir)) != NULL)
+ {
+ if (strspn(xlde->d_name, "0123456789ABCDEF") == WALSUMMARY_NHEXCHARS &&
+ strcmp(xlde->d_name + WALSUMMARY_NHEXCHARS, ".summary") == 0)
+ {
+ snprintf(path, sizeof(path), "%s/%s", WALSUMMARYDIR, xlde->d_name);
+ if (unlink(path) < 0)
+ pg_fatal("could not delete file \"%s\": %m", path);
+ }
+ }
+
+ if (errno)
+ pg_fatal("could not read directory \"%s\": %m", WALSUMMARYDIR);
+
+ if (closedir(xldir))
+ pg_fatal("could not close directory \"%s\": %m", ARCHSTATDIR);
+}
/*
* Write an empty XLOG file, containing only the checkpoint record
diff --git a/src/common/Makefile b/src/common/Makefile
index 3c8effc533..2b41dd1839 100644
--- a/src/common/Makefile
+++ b/src/common/Makefile
@@ -49,6 +49,7 @@ OBJS_COMMON = \
archive.o \
base64.o \
binaryheap.o \
+ blkreftable.o \
checksum_helper.o \
compression.o \
config_info.o \
diff --git a/src/common/blkreftable.c b/src/common/blkreftable.c
new file mode 100644
index 0000000000..012a443584
--- /dev/null
+++ b/src/common/blkreftable.c
@@ -0,0 +1,1309 @@
+/*-------------------------------------------------------------------------
+ *
+ * blkreftable.c
+ * Block reference tables.
+ *
+ * A block reference table is used to keep track of which blocks have
+ * been modified by WAL records within a certain LSN range.
+ *
+ * For each relation fork, we keep track of all blocks that have appeared
+ * in block reference in the WAL. We also keep track of the "limit block",
+ * which is the smallest relation length in blocks known to have occurred
+ * during that range of WAL records. This should be set to 0 if the relation
+ * fork is created or destroyed, and to the post-truncation length if
+ * truncated.
+ *
+ * Whenever we set the limit block, we also forget about any modified blocks
+ * beyond that point. Those blocks don't exist any more. Such blocks can
+ * later be marked as modified again; if that happens, it means the relation
+ * was re-extended.
+ *
+ * Portions Copyright (c) 2010-2022, PostgreSQL Global Development Group
+ *
+ * src/common/blkreftable.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+
+#ifndef FRONTEND
+#include "postgres.h"
+#else
+#include "postgres_fe.h"
+#endif
+
+#ifdef FRONTEND
+#include "common/logging.h"
+#endif
+
+#include "common/blkreftable.h"
+#include "common/hashfn.h"
+#include "port/pg_crc32c.h"
+
+/*
+ * A block reference table keeps track of the status of each relation
+ * fork individually.
+ */
+typedef struct BlockRefTableKey
+{
+ RelFileLocator rlocator;
+ ForkNumber forknum;
+} BlockRefTableKey;
+
+/*
+ * We could need to store data either for a relation in which only a
+ * tiny fraction of the blocks have been modified or for a relation in
+ * which nearly every block has been modified, and we want a
+ * space-efficient representation in both cases. To accomplish this,
+ * we divide the relation into chunks of 2^16 blocks and choose between
+ * an array representation and a bitmap representation for each chunk.
+ *
+ * When the number of modified blocks in a given chunk is small, we
+ * essentially store an array of block numbers, but we need not store the
+ * entire block number: instead, we store each block number as a 2-byte
+ * offset from the start of the chunk.
+ *
+ * When the number of modified blocks in a given chunk is large, we switch
+ * to a bitmap representation.
+ *
+ * These same basic representational choices are used both when a block
+ * reference table is stored in memory and when it is serialized to disk.
+ *
+ * In the in-memory representation, we initially allocate each chunk with
+ * space for a number of entries given by INITIAL_ENTRIES_PER_CHUNK and
+ * increase that as necessary until we reach MAX_ENTRIES_PER_CHUNK.
+ * Any chunk whose allocated size reaches MAX_ENTRIES_PER_CHUNK is converted
+ * to a bitmap, and thus never needs to grow further.
+ */
+#define BLOCKS_PER_CHUNK (1 << 16)
+#define BLOCKS_PER_ENTRY (BITS_PER_BYTE * sizeof(uint16))
+#define MAX_ENTRIES_PER_CHUNK (BLOCKS_PER_CHUNK / BLOCKS_PER_ENTRY)
+#define INITIAL_ENTRIES_PER_CHUNK 16
+typedef uint16 *BlockRefTableChunk;
+
+/*
+ * State for one relation fork.
+ *
+ * 'rlocator' and 'forknum' identify the relation fork to which this entry
+ * pertains.
+ *
+ * 'limit_block' is the shortest known length of the relation in blocks
+ * within the LSN range covered by a particular block reference table.
+ * It should be set to 0 if the relation fork is created or dropped. If the
+ * relation fork is truncated, it should be set to the number of blocks that
+ * remain after truncation.
+ *
+ * 'nchunks' is the allocated length of each of the three arrays that follow.
+ * We can only represent the status of block numbers less than nchunks *
+ * BLOCKS_PER_CHUNK.
+ *
+ * 'chunk_size' is an array storing the allocated size of each chunk.
+ *
+ * 'chunk_usage' is an array storing the number of elements used in each
+ * chunk. If that value is less than MAX_ENTRIES_PER_CHUNK, the corresonding
+ * chunk is used as an array; else the corresponding chunk is used as a bitmap.
+ * When used as a bitmap, the least significant bit of the first array element
+ * is the status of the lowest-numbered block covered by this chunk.
+ *
+ * 'chunk_data' is the array of chunks.
+ */
+struct BlockRefTableEntry
+{
+ BlockRefTableKey key;
+ BlockNumber limit_block;
+ char status;
+ uint32 nchunks;
+ uint16 *chunk_size;
+ uint16 *chunk_usage;
+ BlockRefTableChunk *chunk_data;
+};
+
+/* Declare and define a hash table over type BlockRefTableEntry. */
+#define SH_PREFIX blockreftable
+#define SH_ELEMENT_TYPE BlockRefTableEntry
+#define SH_KEY_TYPE BlockRefTableKey
+#define SH_KEY key
+#define SH_HASH_KEY(tb, key) \
+ hash_bytes((const unsigned char *) &key, sizeof(BlockRefTableKey))
+#define SH_EQUAL(tb, a, b) memcmp(&a, &b, sizeof(BlockRefTableKey)) == 0
+#define SH_SCOPE static inline
+#ifdef FRONTEND
+#define SH_RAW_ALLOCATOR pg_malloc0
+#endif
+#define SH_DEFINE
+#define SH_DECLARE
+#include "lib/simplehash.h"
+
+/*
+ * A block reference table is basically just the hash table, but we don't
+ * want to expose that to outside callers.
+ *
+ * We keep track of the memory context in use explicitly too, so that it's
+ * easy to place all of our allocations in the same context.
+ */
+struct BlockRefTable
+{
+ blockreftable_hash *hash;
+#ifndef FRONTEND
+ MemoryContext mcxt;
+#endif
+};
+
+/*
+ * On-disk serialization format for block reference table entries.
+ */
+typedef struct BlockRefTableSerializedEntry
+{
+ RelFileLocator rlocator;
+ ForkNumber forknum;
+ BlockNumber limit_block;
+ uint32 nchunks;
+} BlockRefTableSerializedEntry;
+
+/*
+ * Buffer size, so that we avoid doing many small I/Os.
+ */
+#define BUFSIZE 65536
+
+/*
+ * Ad-hoc buffer for file I/O.
+ */
+typedef struct BlockRefTableBuffer
+{
+ io_callback_fn io_callback;
+ void *io_callback_arg;
+ char data[BUFSIZE];
+ int used;
+ int cursor;
+ pg_crc32c crc;
+} BlockRefTableBuffer;
+
+/*
+ * State for keeping track of progress while incrementally reading a block
+ * table reference file from disk.
+ *
+ * total_chunks means the number of chunks for the RelFileLocator/ForkNumber
+ * combination that is curently being read, and consumed_chunks is the number
+ * of those that have been read. (We always read all the information for
+ * a single chunk at one time, so we don't need to be able to represent the
+ * state where a chunk has been partially read.)
+ *
+ * chunk_size is the array of chunk sizes. The length is given by total_chunks.
+ *
+ * chunk_data holds the current chunk.
+ *
+ * chunk_position helps us figure out how much progress we've made in returning
+ * the block numbers for the current chunk to the caller. If the chunk is a
+ * bitmap, it's the number of bits we've scanned; otherwise, it's the number
+ * of chunk entries we've scanned.
+ */
+struct BlockRefTableReader
+{
+ BlockRefTableBuffer buffer;
+ char *error_filename;
+ report_error_fn error_callback;
+ void *error_callback_arg;
+ uint32 total_chunks;
+ uint32 consumed_chunks;
+ uint16 *chunk_size;
+ uint16 chunk_data[MAX_ENTRIES_PER_CHUNK];
+ uint32 chunk_position;
+};
+
+/*
+ * State for keeping track of progress while incrementally writing a block
+ * reference table file to disk.
+ */
+struct BlockRefTableWriter
+{
+ BlockRefTableBuffer buffer;
+};
+
+/* Function prototypes. */
+static int BlockRefTableComparator(const void *a, const void *b);
+static void BlockRefTableFlush(BlockRefTableBuffer *buffer);
+static void BlockRefTableRead(BlockRefTableReader *reader, void *data,
+ int length);
+static void BlockRefTableWrite(BlockRefTableBuffer *buffer, void *data,
+ int length);
+static void BlockRefTableFileTerminate(BlockRefTableBuffer *buffer);
+
+/*
+ * Create an empty block reference table.
+ */
+BlockRefTable *
+CreateEmptyBlockRefTable(void)
+{
+ BlockRefTable *brtab = palloc(sizeof(BlockRefTable));
+
+ /*
+ * Even completely empty database has a few hundred relation forks, so it
+ * seems best to size the hash on the assumption that we're going to have
+ * at least a few thousand entries.
+ */
+#ifdef FRONTEND
+ brtab->hash = blockreftable_create(4096, NULL);
+#else
+ brtab->mcxt = CurrentMemoryContext;
+ brtab->hash = blockreftable_create(brtab->mcxt, 4096, NULL);
+#endif
+
+ return brtab;
+}
+
+/*
+ * Set the "limit block" for a relation fork and forget any modified blocks
+ * with equal or higher block numbers.
+ *
+ * The "limit block" is the shortest known length of the relation within the
+ * range of WAL records covered by this block reference table.
+ */
+void
+BlockRefTableSetLimitBlock(BlockRefTable *brtab,
+ const RelFileLocator *rlocator,
+ ForkNumber forknum,
+ BlockNumber limit_block)
+{
+ BlockRefTableEntry *brtentry;
+ BlockRefTableKey key;
+ bool found;
+
+ memcpy(&key.rlocator, rlocator, sizeof(RelFileLocator));
+ key.forknum = forknum;
+ brtentry = blockreftable_insert(brtab->hash, key, &found);
+
+ if (!found)
+ {
+ /*
+ * We have no existing data about this relation fork, so just record
+ * the limit_block value supplied by the caller, and make sure other
+ * parts of the entry are properly initialized.
+ */
+ brtentry->limit_block = limit_block;
+ brtentry->nchunks = 0;
+ brtentry->chunk_size = NULL;
+ brtentry->chunk_usage = NULL;
+ brtentry->chunk_data = NULL;
+ return;
+ }
+
+ BlockRefTableEntrySetLimitBlock(brtentry, limit_block);
+}
+
+/*
+ * Mark a block in a given relation fork as known to have been modified.
+ */
+void
+BlockRefTableMarkBlockModified(BlockRefTable *brtab,
+ const RelFileLocator *rlocator,
+ ForkNumber forknum,
+ BlockNumber blknum)
+{
+ BlockRefTableEntry *brtentry;
+ BlockRefTableKey key;
+ bool found;
+#ifndef FRONTEND
+ MemoryContext oldcontext = MemoryContextSwitchTo(brtab->mcxt);
+#endif
+
+ memcpy(&key.rlocator, rlocator, sizeof(RelFileLocator));
+ key.forknum = forknum;
+ brtentry = blockreftable_insert(brtab->hash, key, &found);
+
+ if (!found)
+ {
+ /*
+ * We want to set the initial limit block value to something higher
+ * than any legal block number. InvalidBlockNumber fits the bill.
+ */
+ brtentry->limit_block = InvalidBlockNumber;
+ brtentry->nchunks = 0;
+ brtentry->chunk_size = NULL;
+ brtentry->chunk_usage = NULL;
+ brtentry->chunk_data = NULL;
+ }
+
+ BlockRefTableEntryMarkBlockModified(brtentry, forknum, blknum);
+
+#ifndef FRONTEND
+ MemoryContextSwitchTo(oldcontext);
+#endif
+}
+
+/*
+ * Get an entry from a block reference table.
+ *
+ * If the entry does not exist, this function returns NULL. Otherwise, it
+ * returns the entry and sets *limit_block to the value from the entry.
+ */
+BlockRefTableEntry *
+BlockRefTableGetEntry(BlockRefTable *brtab, const RelFileLocator *rlocator,
+ ForkNumber forknum, BlockNumber *limit_block)
+{
+ BlockRefTableKey key;
+ BlockRefTableEntry *entry;
+
+ Assert(limit_block != NULL);
+
+ memcpy(&key.rlocator, rlocator, sizeof(RelFileLocator));
+ key.forknum = forknum;
+ entry = blockreftable_lookup(brtab->hash, key);
+
+ if (entry != NULL)
+ *limit_block = entry->limit_block;
+
+ return entry;
+}
+
+/*
+ * Get block numbers from a table entry.
+ *
+ * 'blocks' must point to enough space to hold at least 'nblocks' block
+ * numbers, and any block numbers we manage to get will be written there.
+ * The return value is the number of block numbers actually written.
+ *
+ * We do not return block numbers unless they are greater than or equal to
+ * start_blkno and strictly less than stop_blkno.
+ */
+int
+BlockRefTableEntryGetBlocks(BlockRefTableEntry *entry,
+ BlockNumber start_blkno,
+ BlockNumber stop_blkno,
+ BlockNumber *blocks,
+ int nblocks)
+{
+ uint32 start_chunkno;
+ uint32 stop_chunkno;
+ uint32 chunkno;
+ int nresults = 0;
+
+ Assert(entry != NULL);
+
+ /*
+ * Figure out which chunks could potentially contain blocks of interest.
+ *
+ * We need to be careful about overflow here, because stop_blkno could be
+ * InvalidBlockNumber or something very close to it.
+ */
+ start_chunkno = start_blkno / BLOCKS_PER_CHUNK;
+ stop_chunkno = stop_blkno / BLOCKS_PER_CHUNK;
+ if ((stop_blkno % BLOCKS_PER_CHUNK) != 0)
+ ++stop_chunkno;
+ if (stop_chunkno > entry->nchunks)
+ stop_chunkno = entry->nchunks;
+
+ /*
+ * Loop over chunks.
+ */
+ for (chunkno = start_chunkno; chunkno < stop_chunkno; ++chunkno)
+ {
+ uint16 chunk_usage = entry->chunk_usage[chunkno];
+ BlockRefTableChunk chunk_data = entry->chunk_data[chunkno];
+ unsigned start_offset = 0;
+ unsigned stop_offset = BLOCKS_PER_CHUNK;
+
+ /*
+ * If the start and/or stop block number falls within this chunk, the
+ * whole chunk may not be of interest. Figure out which portion we
+ * care about, if it's not the whole thing.
+ */
+ if (chunkno == start_chunkno)
+ start_offset = start_blkno % BLOCKS_PER_CHUNK;
+ if (chunkno == stop_chunkno)
+ stop_offset = stop_blkno % BLOCKS_PER_CHUNK;
+
+ /*
+ * Handling differs depending on whether this is an array of offsets
+ * or a bitmap.
+ */
+ if (chunk_usage == MAX_ENTRIES_PER_CHUNK)
+ {
+ unsigned i;
+
+ /* It's a bitmap, so test every relevant bit. */
+ for (i = start_offset; i < BLOCKS_PER_CHUNK; ++i)
+ {
+ uint16 w = chunk_data[i / BLOCKS_PER_ENTRY];
+
+ if ((w & (1 << (i % BLOCKS_PER_ENTRY))) != 0)
+ {
+ BlockNumber blkno = chunkno * BLOCKS_PER_CHUNK + i;
+
+ blocks[nresults++] = blkno;
+
+ /* Early exit if we run out of output space. */
+ if (nresults == nblocks)
+ return nresults;
+ }
+ }
+ }
+ else
+ {
+ unsigned i;
+
+ /* It's an array of offsets, so check each one. */
+ for (i = 0; i < chunk_usage; ++i)
+ {
+ uint16 offset = chunk_data[i];
+
+ if (offset >= start_offset && offset < stop_offset)
+ {
+ BlockNumber blkno = chunkno * BLOCKS_PER_CHUNK + offset;
+
+ blocks[nresults++] = blkno;
+
+ /* Early exit if we run out of output space. */
+ if (nresults == nblocks)
+ return nresults;
+ }
+ }
+ }
+ }
+
+ return nresults;
+}
+
+/*
+ * Serialize a block reference table to a file.
+ */
+void
+WriteBlockRefTable(BlockRefTable *brtab,
+ io_callback_fn write_callback,
+ void *write_callback_arg)
+{
+ BlockRefTableSerializedEntry *sdata = NULL;
+ BlockRefTableBuffer buffer;
+ uint32 magic = BLOCKREFTABLE_MAGIC;
+
+ /* Prepare buffer. */
+ memset(&buffer, 0, sizeof(BlockRefTableBuffer));
+ buffer.io_callback = write_callback;
+ buffer.io_callback_arg = write_callback_arg;
+ INIT_CRC32C(buffer.crc);
+
+ /* Write magic number. */
+ BlockRefTableWrite(&buffer, &magic, sizeof(uint32));
+
+ /* Write the entries, assuming there are some. */
+ if (brtab->hash->members > 0)
+ {
+ unsigned i = 0;
+ blockreftable_iterator it;
+ BlockRefTableEntry *brtentry;
+
+ /* Extract entries into serializable format and sort them. */
+ sdata =
+ palloc(brtab->hash->members * sizeof(BlockRefTableSerializedEntry));
+ blockreftable_start_iterate(brtab->hash, &it);
+ while ((brtentry = blockreftable_iterate(brtab->hash, &it)) != NULL)
+ {
+ BlockRefTableSerializedEntry *sentry = &sdata[i++];
+
+ sentry->rlocator = brtentry->key.rlocator;
+ sentry->forknum = brtentry->key.forknum;
+ sentry->limit_block = brtentry->limit_block;
+ sentry->nchunks = brtentry->nchunks;
+
+ /* trim trailing zero entries */
+ while (sentry->nchunks > 0 &&
+ brtentry->chunk_usage[sentry->nchunks - 1] == 0)
+ sentry->nchunks--;
+ }
+ Assert(i == brtab->hash->members);
+ qsort(sdata, i, sizeof(BlockRefTableSerializedEntry),
+ BlockRefTableComparator);
+
+ /* Loop over entries in sorted order and serialize each one. */
+ for (i = 0; i < brtab->hash->members; ++i)
+ {
+ BlockRefTableSerializedEntry *sentry = &sdata[i];
+ BlockRefTableEntry *brtentry;
+ BlockRefTableKey key;
+ unsigned j;
+
+ /* Write the serialized entry itself. */
+ BlockRefTableWrite(&buffer, sentry,
+ sizeof(BlockRefTableSerializedEntry));
+
+ /* Look up the original entry so we can access the chunks. */
+ memcpy(&key.rlocator, &sentry->rlocator, sizeof(RelFileLocator));
+ key.forknum = sentry->forknum;
+ brtentry = blockreftable_lookup(brtab->hash, key);
+ Assert(brtentry != NULL);
+
+ /* Write the untruncated portion of the chunk length array. */
+ if (sentry->nchunks != 0)
+ BlockRefTableWrite(&buffer, brtentry->chunk_usage,
+ sentry->nchunks * sizeof(uint16));
+
+ /* Write the contents of each chunk. */
+ for (j = 0; j < brtentry->nchunks; ++j)
+ {
+ if (brtentry->chunk_usage[j] == 0)
+ continue;
+ BlockRefTableWrite(&buffer, brtentry->chunk_data[j],
+ brtentry->chunk_usage[j] * sizeof(uint16));
+ }
+ }
+ }
+
+ /* Write out appropriate terminator and CRC and flush buffer. */
+ BlockRefTableFileTerminate(&buffer);
+}
+
+/*
+ * Prepare to incrementally read a block reference table file.
+ *
+ * 'read_callback' is a function that can be called to read data from the
+ * underlying file (or other data source) into our internal buffer.
+ *
+ * 'read_callback_arg' is an opaque argument to be passed to read_callback.
+ *
+ * 'error_filename' is the filename that should be included in error messages
+ * if the file is found to be malformed. The value is not copied, so the
+ * caller should ensure that it remains valid until done with this
+ * BlockRefTableReader.
+ *
+ * 'error_callback' is a function to be called if the file is found to be
+ * malformed. This is not used for I/O errors, which must be handled internally
+ * by read_callback.
+ *
+ * 'error_callback_arg' is an opaque arguent to be passed to error_callback.
+ */
+BlockRefTableReader *
+CreateBlockRefTableReader(io_callback_fn read_callback,
+ void *read_callback_arg,
+ char *error_filename,
+ report_error_fn error_callback,
+ void *error_callback_arg)
+{
+ BlockRefTableReader *reader;
+ uint32 magic;
+
+ /* Initialize data structure. */
+ reader = palloc0(sizeof(BlockRefTableReader));
+ reader->buffer.io_callback = read_callback;
+ reader->buffer.io_callback_arg = read_callback_arg;
+ reader->error_filename = error_filename;
+ reader->error_callback = error_callback;
+ reader->error_callback_arg = error_callback_arg;
+ INIT_CRC32C(reader->buffer.crc);
+
+ /* Verify magic number. */
+ BlockRefTableRead(reader, &magic, sizeof(uint32));
+ if (magic != BLOCKREFTABLE_MAGIC)
+ error_callback(error_callback_arg,
+ "file \"%s\" has wrong magic number: expected %u, found %u",
+ error_filename,
+ BLOCKREFTABLE_MAGIC, magic);
+
+ return reader;
+}
+
+/*
+ * Read next relation fork covered by this block reference table file.
+ *
+ * After calling this function, you must call BlockRefTableReaderGetBlocks
+ * until it returns 0 before calling it again.
+ */
+bool
+BlockRefTableReaderNextRelation(BlockRefTableReader *reader,
+ RelFileLocator *rlocator,
+ ForkNumber *forknum,
+ BlockNumber *limit_block)
+{
+ BlockRefTableSerializedEntry sentry;
+ BlockRefTableSerializedEntry zentry = {0};
+
+ /*
+ * Sanity check: caller must read all blocks from all chunks before moving
+ * on to the next relation.
+ */
+ Assert(reader->total_chunks == reader->consumed_chunks);
+
+ /* Read serialized entry. */
+ BlockRefTableRead(reader, &sentry,
+ sizeof(BlockRefTableSerializedEntry));
+
+ /*
+ * If we just read the sentinel entry indicating that we've reached the
+ * end, read and check the CRC.
+ */
+ if (memcmp(&sentry, &zentry, sizeof(BlockRefTableSerializedEntry)) == 0)
+ {
+ pg_crc32c expected_crc;
+ pg_crc32c actual_crc;
+
+ /*
+ * We want to know the CRC of the file excluding the 4-byte CRC
+ * itself, so copy the current value of the CRC accumulator before
+ * reading those bytes, and use the copy to finalize the calculation.
+ */
+ expected_crc = reader->buffer.crc;
+ FIN_CRC32C(expected_crc);
+
+ /* Now we can read the actual value. */
+ BlockRefTableRead(reader, &actual_crc, sizeof(pg_crc32c));
+
+ /* Throw an error if there is a mismatch. */
+ if (!EQ_CRC32C(expected_crc, actual_crc))
+ reader->error_callback(reader->error_callback_arg,
+ "file \"%s\" has wrong checksum: expected %08X, found %08X",
+ reader->error_filename, expected_crc, actual_crc);
+
+ return false;
+ }
+
+ /* Read chunk size array. */
+ if (reader->chunk_size != NULL)
+ pfree(reader->chunk_size);
+ reader->chunk_size = palloc(sentry.nchunks * sizeof(uint16));
+ BlockRefTableRead(reader, reader->chunk_size,
+ sentry.nchunks * sizeof(uint16));
+
+ /* Set up for chunk scan. */
+ reader->total_chunks = sentry.nchunks;
+ reader->consumed_chunks = 0;
+
+ /* Return data to caller. */
+ memcpy(rlocator, &sentry.rlocator, sizeof(RelFileLocator));
+ *forknum = sentry.forknum;
+ *limit_block = sentry.limit_block;
+ return true;
+}
+
+/*
+ * Get modified blocks associated with the relation fork returned by
+ * the most recent call to BlockRefTableReaderNextRelation.
+ *
+ * On return, block numbers will be written into the 'blocks' array, whose
+ * length should be passed via 'nblocks'. The return value is the number of
+ * entries actually written into the 'blocks' array, which may be less than
+ * 'nblocks' if we run out of modified blocks in the relation fork before
+ * we run out of room in the array.
+ */
+unsigned
+BlockRefTableReaderGetBlocks(BlockRefTableReader *reader,
+ BlockNumber *blocks,
+ int nblocks)
+{
+ unsigned blocks_found = 0;
+
+ /* Must provide space for at least one block number to be returned. */
+ Assert(nblocks > 0);
+
+ /* Loop collecting blocks to return to caller. */
+ for (;;)
+ {
+ uint16 next_chunk_size;
+
+ /*
+ * If we've read at least one chunk, maybe it contains some block
+ * numbers that could satisfy caller's request.
+ */
+ if (reader->consumed_chunks > 0)
+ {
+ uint32 chunkno = reader->consumed_chunks - 1;
+ uint16 chunk_size = reader->chunk_size[chunkno];
+
+ if (chunk_size == MAX_ENTRIES_PER_CHUNK)
+ {
+ /* Bitmap format, so search for bits that are set. */
+ while (reader->chunk_position < BLOCKS_PER_CHUNK &&
+ blocks_found < nblocks)
+ {
+ uint16 chunkoffset = reader->chunk_position;
+ uint16 w;
+
+ w = reader->chunk_data[chunkoffset / BLOCKS_PER_ENTRY];
+ if ((w & (1u << (chunkoffset % BLOCKS_PER_ENTRY))) != 0)
+ blocks[blocks_found++] =
+ chunkno * BLOCKS_PER_CHUNK + chunkoffset;
+ ++reader->chunk_position;
+ }
+ }
+ else
+ {
+ /* Not in bitmap format, so each entry is a 2-byte offset. */
+ while (reader->chunk_position < chunk_size &&
+ blocks_found < nblocks)
+ {
+ blocks[blocks_found++] = chunkno * BLOCKS_PER_CHUNK
+ + reader->chunk_data[reader->chunk_position];
+ ++reader->chunk_position;
+ }
+ }
+ }
+
+ /* We found enough blocks, so we're done. */
+ if (blocks_found >= nblocks)
+ break;
+
+ /*
+ * We didn't find enough blocks, so we must need the next chunk. If
+ * there are none left, though, then we're done anyway.
+ */
+ if (reader->consumed_chunks == reader->total_chunks)
+ break;
+
+ /*
+ * Read data for next chunk and reset scan position to beginning of
+ * chunk. Note that the next chunk might be empty, in which case we
+ * consume the chunk without actually consuming any bytes from the
+ * underlying file.
+ */
+ next_chunk_size = reader->chunk_size[reader->consumed_chunks];
+ if (next_chunk_size > 0)
+ BlockRefTableRead(reader, reader->chunk_data,
+ next_chunk_size * sizeof(uint16));
+ ++reader->consumed_chunks;
+ reader->chunk_position = 0;
+ }
+
+ return blocks_found;
+}
+
+/*
+ * Release memory used while reading a block reference table from a file.
+ */
+void
+DestroyBlockRefTableReader(BlockRefTableReader *reader)
+{
+ if (reader->chunk_size != NULL)
+ {
+ pfree(reader->chunk_size);
+ reader->chunk_size = NULL;
+ }
+ pfree(reader);
+}
+
+/*
+ * Prepare to write a block reference table file incrementally.
+ *
+ * Caller must be able to supply BlockRefTableEntry objects sorted in the
+ * appropriate order.
+ */
+BlockRefTableWriter *
+CreateBlockRefTableWriter(io_callback_fn write_callback,
+ void *write_callback_arg)
+{
+ BlockRefTableWriter *writer;
+ uint32 magic = BLOCKREFTABLE_MAGIC;
+
+ /* Prepare buffer and CRC check and save callbacks. */
+ writer = palloc0(sizeof(BlockRefTableWriter));
+ writer->buffer.io_callback = write_callback;
+ writer->buffer.io_callback_arg = write_callback_arg;
+ INIT_CRC32C(writer->buffer.crc);
+
+ /* Write magic number. */
+ BlockRefTableWrite(&writer->buffer, &magic, sizeof(uint32));
+
+ return writer;
+}
+
+/*
+ * Append one entry to a block reference table file.
+ *
+ * Note that entries must be written in the proper order, that is, sorted by
+ * tablespace, then database, then relfilenumber, then fork number. Caller
+ * is responsible for supplying data in the correct order. If that seems hard,
+ * use an in-memory BlockRefTable instead.
+ */
+void
+BlockRefTableWriteEntry(BlockRefTableWriter *writer, BlockRefTableEntry *entry)
+{
+ BlockRefTableSerializedEntry sentry;
+ unsigned j;
+
+ /* Convert to serialized entry format. */
+ sentry.rlocator = entry->key.rlocator;
+ sentry.forknum = entry->key.forknum;
+ sentry.limit_block = entry->limit_block;
+ sentry.nchunks = entry->nchunks;
+
+ /* Trim trailing zero entries. */
+ while (sentry.nchunks > 0 && entry->chunk_usage[sentry.nchunks - 1] == 0)
+ sentry.nchunks--;
+
+ /* Write the serialized entry itself. */
+ BlockRefTableWrite(&writer->buffer, &sentry,
+ sizeof(BlockRefTableSerializedEntry));
+
+ /* Write the untruncated portion of the chunk length array. */
+ if (sentry.nchunks != 0)
+ BlockRefTableWrite(&writer->buffer, entry->chunk_usage,
+ sentry.nchunks * sizeof(uint16));
+
+ /* Write the contents of each chunk. */
+ for (j = 0; j < entry->nchunks; ++j)
+ {
+ if (entry->chunk_usage[j] == 0)
+ continue;
+ BlockRefTableWrite(&writer->buffer, entry->chunk_data[j],
+ entry->chunk_usage[j] * sizeof(uint16));
+ }
+}
+
+/*
+ * Finalize an incremental write of a block reference table file.
+ */
+void
+DestroyBlockRefTableWriter(BlockRefTableWriter *writer)
+{
+ BlockRefTableFileTerminate(&writer->buffer);
+ pfree(writer);
+}
+
+/*
+ * Allocate a standalone BlockRefTableEntry.
+ *
+ * When we're manipulating a full in-memory BlockRefTable, the entries are
+ * part of the hash table and are allocated by simplehash. This routine is
+ * used by callers that want to write out a BlockRefTable to a file without
+ * needing to store the whole thing in memory at once.
+ *
+ * Entries allocated by this function can be manipulated using the functions
+ * BlockRefTableEntrySetLimitBlock and BlockRefTableEntryMarkBlockModified
+ * and then written using BlockRefTableWriteEntry and freed using
+ * BlockRefTableFreeEntry.
+ */
+BlockRefTableEntry *
+CreateBlockRefTableEntry(RelFileLocator rlocator, ForkNumber forknum)
+{
+ BlockRefTableEntry *entry = palloc0(sizeof(BlockRefTableEntry));
+
+ memcpy(&entry->key.rlocator, &rlocator, sizeof(RelFileLocator));
+ entry->key.forknum = forknum;
+ entry->limit_block = InvalidBlockNumber;
+
+ return entry;
+}
+
+/*
+ * Update a BlockRefTableEntry with a new value for the "limit block" and
+ * forget any equal-or-higher-numbered modified blocks.
+ *
+ * The "limit block" is the shortest known length of the relation within the
+ * range of WAL records covered by this block reference table.
+ */
+void
+BlockRefTableEntrySetLimitBlock(BlockRefTableEntry *entry,
+ BlockNumber limit_block)
+{
+ unsigned chunkno;
+ unsigned limit_chunkno;
+ unsigned limit_chunkoffset;
+ BlockRefTableChunk limit_chunk;
+
+ /* If we already have an equal or lower limit block, do nothing. */
+ if (limit_block >= entry->limit_block)
+ return;
+
+ /* Record the new limit block value. */
+ entry->limit_block = limit_block;
+
+ /*
+ * Figure out which chunk would store the state of the new limit block,
+ * and which offset within that chunk.
+ */
+ limit_chunkno = limit_block / BLOCKS_PER_CHUNK;
+ limit_chunkoffset = limit_block % BLOCKS_PER_CHUNK;
+
+ /*
+ * If the number of chunks is not large enough for any blocks with equal
+ * or higher block numbers to exist, then there is nothing further to do.
+ */
+ if (limit_chunkno >= entry->nchunks)
+ return;
+
+ /* Discard entire contents of any higher-numbered chunks. */
+ for (chunkno = limit_chunkno + 1; chunkno < entry->nchunks; ++chunkno)
+ entry->chunk_usage[chunkno] = 0;
+
+ /*
+ * Next, we need to discard any offsets within the chunk that would
+ * contain the limit_block. We must handle this differenly depending on
+ * whether the chunk that would contain limit_block is a bitmap or an
+ * array of offsets.
+ */
+ limit_chunk = entry->chunk_data[limit_chunkno];
+ if (entry->chunk_usage[limit_chunkno] == MAX_ENTRIES_PER_CHUNK)
+ {
+ unsigned chunkoffset;
+
+ /* It's a bitmap. Unset bits. */
+ for (chunkoffset = limit_chunkoffset; chunkoffset < BLOCKS_PER_CHUNK;
+ ++chunkoffset)
+ limit_chunk[chunkoffset / BLOCKS_PER_ENTRY] &=
+ ~(1 << (chunkoffset % BLOCKS_PER_ENTRY));
+ }
+ else
+ {
+ unsigned i,
+ j = 0;
+
+ /* It's an offset array. Filter out large offsets. */
+ for (i = 0; i < entry->chunk_usage[limit_chunkno]; ++i)
+ {
+ Assert(j <= i);
+ if (limit_chunk[i] < limit_chunkoffset)
+ limit_chunk[j++] = limit_chunk[i];
+ }
+ Assert(j <= entry->chunk_usage[limit_chunkno]);
+ entry->chunk_usage[limit_chunkno] = j;
+ }
+}
+
+/*
+ * Mark a block in a given BlkRefTableEntry as known to have been modified.
+ */
+void
+BlockRefTableEntryMarkBlockModified(BlockRefTableEntry *entry,
+ ForkNumber forknum,
+ BlockNumber blknum)
+{
+ unsigned chunkno;
+ unsigned chunkoffset;
+ unsigned i;
+
+ /*
+ * Which chunk should store the state of this block? And what is the
+ * offset of this block relative to the start of that chunk?
+ */
+ chunkno = blknum / BLOCKS_PER_CHUNK;
+ chunkoffset = blknum % BLOCKS_PER_CHUNK;
+
+ /*
+ * If 'nchunks' isn't big enough for us to be able to represent the state
+ * of this block, we need to enlarge our arrays.
+ */
+ if (chunkno >= entry->nchunks)
+ {
+ unsigned max_chunks;
+ unsigned extra_chunks;
+
+ /*
+ * New array size is a power of 2, at least 16, big enough so that
+ * chunkno will be a valid array index.
+ */
+ max_chunks = Max(16, entry->nchunks);
+ while (max_chunks < chunkno + 1)
+ chunkno *= 2;
+ Assert(max_chunks > chunkno);
+ extra_chunks = max_chunks - entry->nchunks;
+
+ if (entry->nchunks == 0)
+ {
+ entry->chunk_size = palloc0(sizeof(uint16) * max_chunks);
+ entry->chunk_usage = palloc0(sizeof(uint16) * max_chunks);
+ entry->chunk_data =
+ palloc0(sizeof(BlockRefTableChunk) * max_chunks);
+ }
+ else
+ {
+ entry->chunk_size = repalloc(entry->chunk_size,
+ sizeof(uint16) * max_chunks);
+ memset(&entry->chunk_size[entry->nchunks], 0,
+ extra_chunks * sizeof(uint16));
+ entry->chunk_usage = repalloc(entry->chunk_usage,
+ sizeof(uint16) * max_chunks);
+ memset(&entry->chunk_usage[entry->nchunks], 0,
+ extra_chunks * sizeof(uint16));
+ entry->chunk_data = repalloc(entry->chunk_data,
+ sizeof(BlockRefTableChunk) * max_chunks);
+ memset(&entry->chunk_data[entry->nchunks], 0,
+ extra_chunks * sizeof(BlockRefTableChunk));
+ }
+ entry->nchunks = max_chunks;
+ }
+
+ /*
+ * If the chunk that covers this block number doesn't exist yet, create it
+ * as an array and add the appropriate offset to it. We make it pretty
+ * small initially, because there might only be 1 or a few block
+ * references in this chunk and we don't want to use up too much memory.
+ */
+ if (entry->chunk_size[chunkno] == 0)
+ {
+ entry->chunk_data[chunkno] =
+ palloc(sizeof(uint16) * INITIAL_ENTRIES_PER_CHUNK);
+ entry->chunk_size[chunkno] = INITIAL_ENTRIES_PER_CHUNK;
+ entry->chunk_data[chunkno][0] = chunkoffset;
+ entry->chunk_usage[chunkno] = 1;
+ return;
+ }
+
+ /*
+ * If the number of entries in this chunk is already maximum, it must be a
+ * bitmap. Just set the appropriate bit.
+ */
+ if (entry->chunk_usage[chunkno] == MAX_ENTRIES_PER_CHUNK)
+ {
+ BlockRefTableChunk chunk = entry->chunk_data[chunkno];
+
+ chunk[chunkoffset / BLOCKS_PER_ENTRY] |=
+ 1 << (chunkoffset % BLOCKS_PER_ENTRY);
+ return;
+ }
+
+ /*
+ * There is an existing chunk and it's in array format. Let's find out
+ * whether it already has an entry for this block. If so, we do not need
+ * to do anything.
+ */
+ for (i = 0; i < entry->chunk_usage[chunkno]; ++i)
+ {
+ if (entry->chunk_data[chunkno][i] == chunkoffset)
+ return;
+ }
+
+ /*
+ * If the number of entries currently used is one less than the maximum,
+ * it's time to convert to bitmap format.
+ */
+ if (entry->chunk_usage[chunkno] == MAX_ENTRIES_PER_CHUNK - 1)
+ {
+ BlockRefTableChunk newchunk;
+ unsigned j;
+
+ /* Allocate a new chunk. */
+ newchunk = palloc0(MAX_ENTRIES_PER_CHUNK * sizeof(uint16));
+
+ /* Set the bit for each existing entry. */
+ for (j = 0; j < entry->chunk_usage[chunkno]; ++j)
+ {
+ unsigned coff = entry->chunk_data[chunkno][j];
+
+ newchunk[coff / BLOCKS_PER_ENTRY] |=
+ 1 << (coff % BLOCKS_PER_ENTRY);
+ }
+
+ /* Set the bit for the new entry. */
+ newchunk[chunkoffset / BLOCKS_PER_ENTRY] |=
+ 1 << (chunkoffset % BLOCKS_PER_ENTRY);
+
+ /* Swap the new chunk into place and update metadata. */
+ pfree(entry->chunk_data[chunkno]);
+ entry->chunk_data[chunkno] = newchunk;
+ entry->chunk_size[chunkno] = MAX_ENTRIES_PER_CHUNK;
+ entry->chunk_usage[chunkno] = MAX_ENTRIES_PER_CHUNK;
+ return;
+ }
+
+ /*
+ * OK, we currently have an array, and we don't need to convert to a
+ * bitmap, but we do need to add a new element. If there's not enough
+ * room, we'll have to expand the array.
+ */
+ if (entry->chunk_usage[chunkno] == entry->chunk_size[chunkno])
+ {
+ unsigned newsize = entry->chunk_size[chunkno] * 2;
+
+ Assert(newsize <= MAX_ENTRIES_PER_CHUNK);
+ entry->chunk_data[chunkno] = repalloc(entry->chunk_data[chunkno],
+ newsize * sizeof(uint16));
+ entry->chunk_size[chunkno] = newsize;
+ }
+
+ /* Now we can add the new entry. */
+ entry->chunk_data[chunkno][entry->chunk_usage[chunkno]] =
+ chunkoffset;
+ entry->chunk_usage[chunkno]++;
+}
+
+/*
+ * Release memory for a BlockRefTablEntry that was created by
+ * CreateBlockRefTableEntry.
+ */
+void
+BlockRefTableFreeEntry(BlockRefTableEntry *entry)
+{
+ if (entry->chunk_size != NULL)
+ {
+ pfree(entry->chunk_size);
+ entry->chunk_size = NULL;
+ }
+
+ if (entry->chunk_usage != NULL)
+ {
+ pfree(entry->chunk_usage);
+ entry->chunk_usage = NULL;
+ }
+
+ if (entry->chunk_data != NULL)
+ {
+ pfree(entry->chunk_data);
+ entry->chunk_data = NULL;
+ }
+
+ pfree(entry);
+}
+
+/*
+ * Comparator for BlockRefTableSerializedEntry objects.
+ *
+ * We make the tablespace OID the first column of the sort key to match
+ * the on-disk tree structure.
+ */
+static int
+BlockRefTableComparator(const void *a, const void *b)
+{
+ const BlockRefTableSerializedEntry *sa = a;
+ const BlockRefTableSerializedEntry *sb = b;
+
+ if (sa->rlocator.spcOid > sb->rlocator.spcOid)
+ return 1;
+ if (sa->rlocator.spcOid < sb->rlocator.spcOid)
+ return -1;
+
+ if (sa->rlocator.dbOid > sb->rlocator.dbOid)
+ return 1;
+ if (sa->rlocator.dbOid < sb->rlocator.dbOid)
+ return -1;
+
+ if (sa->rlocator.relNumber > sb->rlocator.relNumber)
+ return 1;
+ if (sa->rlocator.relNumber < sb->rlocator.relNumber)
+ return -1;
+
+ if (sa->forknum > sb->forknum)
+ return 1;
+ if (sa->forknum < sb->forknum)
+ return -1;
+
+ return 0;
+}
+
+/*
+ * Flush any buffered data out of a BlockRefTableBuffer.
+ */
+static void
+BlockRefTableFlush(BlockRefTableBuffer *buffer)
+{
+ buffer->io_callback(buffer->io_callback_arg, buffer->data, buffer->used);
+ buffer->used = 0;
+}
+
+/*
+ * Read data from a BlockRefTableBuffer, and update the running CRC
+ * calculation for the returned data (but not any data that we may have
+ * buffered but not yet actually returned).
+ */
+static void
+BlockRefTableRead(BlockRefTableReader *reader, void *data, int length)
+{
+ BlockRefTableBuffer *buffer = &reader->buffer;
+
+ /* Loop until read is fully satisfied. */
+ while (length > 0)
+ {
+ if (buffer->cursor < buffer->used)
+ {
+ /*
+ * If any buffered data is available, use that to satisfy as much
+ * of the request as possible.
+ */
+ int bytes_to_copy = Min(length, buffer->used - buffer->cursor);
+
+ memcpy(data, &buffer->data[buffer->cursor], bytes_to_copy);
+ COMP_CRC32C(buffer->crc, &buffer->data[buffer->cursor],
+ bytes_to_copy);
+ buffer->cursor += bytes_to_copy;
+ data = ((char *) data) + bytes_to_copy;
+ length -= bytes_to_copy;
+ }
+ else if (length >= BUFSIZE)
+ {
+ /*
+ * If the request length is long, read directly into caller's
+ * buffer.
+ */
+ int bytes_read;
+
+ bytes_read = buffer->io_callback(buffer->io_callback_arg,
+ data, length);
+ COMP_CRC32C(buffer->crc, data, bytes_read);
+ data = ((char *) data) + bytes_read;
+ length -= bytes_read;
+
+ /* If we didn't get anything, that's bad. */
+ if (bytes_read == 0)
+ reader->error_callback(reader->error_callback_arg,
+ "file \"%s\" ends unexpectedly",
+ reader->error_filename);
+ }
+ else
+ {
+ /*
+ * Refill our buffer.
+ */
+ buffer->used = buffer->io_callback(buffer->io_callback_arg,
+ buffer->data, BUFSIZE);
+ buffer->cursor = 0;
+
+ /* If we didn't get anything, that's bad. */
+ if (buffer->used == 0)
+ reader->error_callback(reader->error_callback_arg,
+ "file \"%s\" ends unexpectedly",
+ reader->error_filename);
+ }
+ }
+}
+
+/*
+ * Supply data to a BlockRefTableBuffer for write to the underlying File,
+ * and update the running CRC calculation for that data.
+ */
+static void
+BlockRefTableWrite(BlockRefTableBuffer *buffer, void *data, int length)
+{
+ /* Update running CRC calculation. */
+ COMP_CRC32C(buffer->crc, data, length);
+
+ /* If the new data can't fit into the buffer, flush the buffer. */
+ if (buffer->used + length > BUFSIZE)
+ {
+ buffer->io_callback(buffer->io_callback_arg, buffer->data,
+ buffer->used);
+ buffer->used = 0;
+ }
+
+ /* If the new data would fill the buffer, or more, write it directly. */
+ if (length >= BUFSIZE)
+ {
+ buffer->io_callback(buffer->io_callback_arg, data, length);
+ return;
+ }
+
+ /* Otherwise, copy the new data into the buffer. */
+ memcpy(&buffer->data[buffer->used], data, length);
+ buffer->used += length;
+ Assert(buffer->used <= BUFSIZE);
+}
+
+/*
+ * Generate the sentinel and CRC required at the end of a block reference
+ * table file and flush them out of our internal buffer.
+ */
+static void
+BlockRefTableFileTerminate(BlockRefTableBuffer *buffer)
+{
+ BlockRefTableSerializedEntry zentry = {0};
+ pg_crc32c crc;
+
+ /* Write a sentinel indicating that there are no more entries. */
+ BlockRefTableWrite(buffer, &zentry,
+ sizeof(BlockRefTableSerializedEntry));
+
+ /*
+ * Writing the checksum will perturb the ongoing checksum calculation, so
+ * copy the state first and finalize the computation using the copy.
+ */
+ crc = buffer->crc;
+ FIN_CRC32C(crc);
+ BlockRefTableWrite(buffer, &crc, sizeof(pg_crc32c));
+
+ /* Flush any leftover data out of our buffer. */
+ BlockRefTableFlush(buffer);
+}
diff --git a/src/common/meson.build b/src/common/meson.build
index aa646f96a3..6348d60ec4 100644
--- a/src/common/meson.build
+++ b/src/common/meson.build
@@ -4,6 +4,7 @@ common_sources = files(
'archive.c',
'base64.c',
'binaryheap.c',
+ 'blkreftable.c',
'checksum_helper.c',
'compression.c',
'controldata_utils.c',
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 4ad572cb87..9d1e4ab57b 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -209,6 +209,7 @@ extern int XLogFileOpen(XLogSegNo segno, TimeLineID tli);
extern void CheckXLogRemoved(XLogSegNo segno, TimeLineID tli);
extern XLogSegNo XLogGetLastRemovedSegno(void);
+extern XLogSegNo XLogGetOldestSegno(TimeLineID tli);
extern void XLogSetAsyncXactLSN(XLogRecPtr asyncXactLSN);
extern void XLogSetReplicationSlotMinimumLSN(XLogRecPtr lsn);
diff --git a/src/include/access/xlogbackup.h b/src/include/access/xlogbackup.h
index 1611358137..90e04cad56 100644
--- a/src/include/access/xlogbackup.h
+++ b/src/include/access/xlogbackup.h
@@ -28,6 +28,8 @@ typedef struct BackupState
XLogRecPtr checkpointloc; /* last checkpoint location */
pg_time_t starttime; /* backup start time */
bool started_in_recovery; /* backup started in recovery? */
+ XLogRecPtr istartpoint; /* incremental based on backup at this LSN */
+ TimeLineID istarttli; /* incremental based on backup on this TLI */
/* Fields saved at the end of backup */
XLogRecPtr stoppoint; /* backup stop WAL location */
diff --git a/src/include/backup/basebackup.h b/src/include/backup/basebackup.h
index 1432d9c206..345bd22534 100644
--- a/src/include/backup/basebackup.h
+++ b/src/include/backup/basebackup.h
@@ -34,6 +34,9 @@ typedef struct
int64 size; /* total size as sent; -1 if not known */
} tablespaceinfo;
-extern void SendBaseBackup(BaseBackupCmd *cmd);
+struct IncrementalBackupInfo;
+
+extern void SendBaseBackup(BaseBackupCmd *cmd,
+ struct IncrementalBackupInfo *ib);
#endif /* _BASEBACKUP_H */
diff --git a/src/include/backup/basebackup_incremental.h b/src/include/backup/basebackup_incremental.h
new file mode 100644
index 0000000000..c300235a2f
--- /dev/null
+++ b/src/include/backup/basebackup_incremental.h
@@ -0,0 +1,56 @@
+/*-------------------------------------------------------------------------
+ *
+ * basebackup_incremental.h
+ * API for incremental backup support
+ *
+ * Portions Copyright (c) 2010-2022, PostgreSQL Global Development Group
+ *
+ * src/include/backup/basebackup_incremental.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef BASEBACKUP_INCREMENTAL_H
+#define BASEBACKUP_INCREMENTAL_H
+
+#include "access/xlogbackup.h"
+#include "common/relpath.h"
+#include "storage/block.h"
+#include "utils/palloc.h"
+
+#define INCREMENTAL_MAGIC 0xd3ae1f0d
+
+typedef enum
+{
+ BACK_UP_FILE_FULLY,
+ BACK_UP_FILE_INCREMENTALLY,
+ DO_NOT_BACK_UP_FILE
+} FileBackupMethod;
+
+struct IncrementalBackupInfo;
+typedef struct IncrementalBackupInfo IncrementalBackupInfo;
+
+extern IncrementalBackupInfo *CreateIncrementalBackupInfo(MemoryContext);
+
+extern void AppendIncrementalManifestData(IncrementalBackupInfo *ib,
+ const char *data,
+ int len);
+extern void FinalizeIncrementalManifest(IncrementalBackupInfo *ib);
+
+extern void PrepareForIncrementalBackup(IncrementalBackupInfo *ib,
+ BackupState *backup_state);
+
+extern char *GetIncrementalFilePath(Oid dboid, Oid spcoid,
+ RelFileNumber relfilenumber,
+ ForkNumber forknum, unsigned segno);
+extern FileBackupMethod GetFileBackupMethod(IncrementalBackupInfo *ib,
+ char *path,
+ Oid dboid, Oid spcoid,
+ RelFileNumber relfilenumber,
+ ForkNumber forknum,
+ unsigned segno, size_t size,
+ unsigned *num_blocks_required,
+ BlockNumber *relative_block_numbers,
+ unsigned *truncation_block_length);
+extern size_t GetIncrementalFileSize(unsigned num_blocks_required);
+
+#endif
diff --git a/src/include/backup/walsummary.h b/src/include/backup/walsummary.h
new file mode 100644
index 0000000000..d086e64019
--- /dev/null
+++ b/src/include/backup/walsummary.h
@@ -0,0 +1,49 @@
+/*-------------------------------------------------------------------------
+ *
+ * walsummary.h
+ * WAL summary management
+ *
+ * Portions Copyright (c) 2010-2022, PostgreSQL Global Development Group
+ *
+ * src/include/backup/walsummary.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef WALSUMMARY_H
+#define WALSUMMARY_H
+
+#include <time.h>
+
+#include "access/xlogdefs.h"
+#include "nodes/pg_list.h"
+#include "storage/fd.h"
+
+typedef struct WalSummaryIO
+{
+ File file;
+ off_t filepos;
+} WalSummaryIO;
+
+typedef struct WalSummaryFile
+{
+ XLogRecPtr start_lsn;
+ XLogRecPtr end_lsn;
+ TimeLineID tli;
+} WalSummaryFile;
+
+extern List *GetWalSummaries(TimeLineID tli, XLogRecPtr start_lsn,
+ XLogRecPtr end_lsn);
+extern List *FilterWalSummaries(List *wslist, TimeLineID tli,
+ XLogRecPtr start_lsn, XLogRecPtr end_lsn);
+extern bool WalSummariesAreComplete(List *wslist,
+ XLogRecPtr start_lsn, XLogRecPtr end_lsn,
+ XLogRecPtr *missing_lsn);
+extern File OpenWalSummaryFile(WalSummaryFile *ws, bool missing_ok);
+extern void RemoveWalSummaryIfOlderThan(WalSummaryFile *ws,
+ time_t cutoff_time);
+
+extern int ReadWalSummary(void *wal_summary_io, void *data, int length);
+extern int WriteWalSummary(void *wal_summary_io, void *data, int length);
+extern void ReportWalSummaryError(void *callback_arg, char *fmt,...);
+
+#endif /* WALSUMMARY_H */
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index c92d0631a0..9717c4630e 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -12071,4 +12071,23 @@
proname => 'any_value_transfn', prorettype => 'anyelement',
proargtypes => 'anyelement anyelement', prosrc => 'any_value_transfn' },
+{ oid => '8436',
+ descr => 'list of available WAL summary files',
+ proname => 'pg_available_wal_summaries', prorows => '100',
+ proretset => 't', provolatile => 'v', proparallel => 's',
+ prorettype => 'record', proargtypes => '',
+ proallargtypes => '{int8,pg_lsn,pg_lsn}',
+ proargmodes => '{o,o,o}',
+ proargnames => '{tli,start_lsn,end_lsn}',
+ prosrc => 'pg_available_wal_summaries' },
+{ oid => '8437',
+ descr => 'contents of a WAL sumamry file',
+ proname => 'pg_wal_summary_contents', prorows => '100',
+ proretset => 't', provolatile => 'v', proparallel => 's',
+ prorettype => 'record', proargtypes => 'int8 pg_lsn pg_lsn',
+ proallargtypes => '{int8,pg_lsn,pg_lsn,oid,oid,oid,int2,int8,bool}',
+ proargmodes => '{i,i,i,o,o,o,o,o,o}',
+ proargnames => '{tli,start_lsn,end_lsn,relfilenode,reltablespace,reldatabase,relforknumber,relblocknumber,is_limit_block}',
+ prosrc => 'pg_wal_summary_contents' },
+
]
diff --git a/src/include/common/blkreftable.h b/src/include/common/blkreftable.h
new file mode 100644
index 0000000000..22d9883dc5
--- /dev/null
+++ b/src/include/common/blkreftable.h
@@ -0,0 +1,120 @@
+/*-------------------------------------------------------------------------
+ *
+ * blkreftable.h
+ * Block reference tables.
+ *
+ * A block reference table is used to keep track of which blocks have
+ * been modified by WAL records within a certain LSN range.
+ *
+ * For each relation fork, there is a "limit block number". All existing
+ * blocks greater than or equal to the limit block number must be
+ * considered modified; for those less than the limit block number,
+ * we maintain a bitmap. When a relation fork is created or dropped,
+ * the limit block number should be set to 0. When it's truncated,
+ * the limit block number should be set to the length in blocks to
+ * which it was truncated.
+ *
+ * Portions Copyright (c) 2010-2022, PostgreSQL Global Development Group
+ *
+ * src/include/common/blkreftable.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef BLKREFTABLE_H
+#define BLKREFTABLE_H
+
+#include "storage/block.h"
+#include "storage/relfilelocator.h"
+
+/* Magic number for serialization file format. */
+#define BLOCKREFTABLE_MAGIC 0x652b137b
+
+struct BlockRefTable;
+struct BlockRefTableEntry;
+struct BlockRefTableReader;
+struct BlockRefTableWriter;
+typedef struct BlockRefTable BlockRefTable;
+typedef struct BlockRefTableEntry BlockRefTableEntry;
+typedef struct BlockRefTableReader BlockRefTableReader;
+typedef struct BlockRefTableWriter BlockRefTableWriter;
+
+/*
+ * The return value of io_callback_fn should be the number of bytes read
+ * or written. If an error occurs, the functions should report it and
+ * not return. When used as a write callback, short writes should be retried
+ * or treated as errors, so that if the callback returns, the return value
+ * is always the request length.
+ *
+ * report_error_fn should not return.
+ */
+typedef int (*io_callback_fn) (void *callback_arg, void *data, int length);
+typedef void (*report_error_fn) (void *calblack_arg, char *msg,...);
+
+
+/*
+ * Functions for manipulating an entire in-memory block reference table.
+ */
+extern BlockRefTable *CreateEmptyBlockRefTable(void);
+extern void BlockRefTableSetLimitBlock(BlockRefTable *brtab,
+ const RelFileLocator *rlocator,
+ ForkNumber forknum,
+ BlockNumber limit_block);
+extern void BlockRefTableMarkBlockModified(BlockRefTable *brtab,
+ const RelFileLocator *rlocator,
+ ForkNumber forknum,
+ BlockNumber blknum);
+extern void WriteBlockRefTable(BlockRefTable *brtab,
+ io_callback_fn write_callback,
+ void *write_callback_arg);
+
+extern BlockRefTableEntry *BlockRefTableGetEntry(BlockRefTable *brtab,
+ const RelFileLocator *rlocator,
+ ForkNumber forknum,
+ BlockNumber *limit_block);
+extern int BlockRefTableEntryGetBlocks(BlockRefTableEntry *entry,
+ BlockNumber start_blkno,
+ BlockNumber stop_blkno,
+ BlockNumber *blocks,
+ int nblocks);
+
+/*
+ * Functions for reading a block reference table incrementally from disk.
+ */
+extern BlockRefTableReader *CreateBlockRefTableReader(io_callback_fn read_callback,
+ void *read_callback_arg,
+ char *error_filename,
+ report_error_fn error_callback,
+ void *error_callback_arg);
+extern bool BlockRefTableReaderNextRelation(BlockRefTableReader *reader,
+ RelFileLocator *rlocator,
+ ForkNumber *forknum,
+ BlockNumber *limit_block);
+extern unsigned BlockRefTableReaderGetBlocks(BlockRefTableReader *reader,
+ BlockNumber *blocks,
+ int nblocks);
+extern void DestroyBlockRefTableReader(BlockRefTableReader *reader);
+
+/*
+ * Functions for writing a block reference table incrementally to disk.
+ *
+ * Note that entries must be written in the proper order, that is, sorted by
+ * database, then tablespace, then relfilenumber, then fork number. Caller
+ * is responsible for supplying data in the correct order. If that seems hard,
+ * use an in-memory BlockRefTable instead.
+ */
+extern BlockRefTableWriter *CreateBlockRefTableWriter(io_callback_fn write_callback,
+ void *write_callback_arg);
+extern void BlockRefTableWriteEntry(BlockRefTableWriter *writer,
+ BlockRefTableEntry *entry);
+extern void DestroyBlockRefTableWriter(BlockRefTableWriter *writer);
+
+extern BlockRefTableEntry *CreateBlockRefTableEntry(RelFileLocator rlocator,
+ ForkNumber forknum);
+extern void BlockRefTableEntrySetLimitBlock(BlockRefTableEntry *entry,
+ BlockNumber limit_block);
+extern void BlockRefTableEntryMarkBlockModified(BlockRefTableEntry *entry,
+ ForkNumber forknum,
+ BlockNumber blknum);
+extern void BlockRefTableFreeEntry(BlockRefTableEntry *entry);
+
+#endif /* BLKREFTABLE_H */
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 7232b03e37..042fdc6ca1 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -340,6 +340,7 @@ typedef enum BackendType
B_STARTUP,
B_WAL_RECEIVER,
B_WAL_SENDER,
+ B_WAL_SUMMARIZER,
B_WAL_WRITER,
} BackendType;
@@ -446,6 +447,7 @@ typedef enum
CheckpointerProcess,
WalWriterProcess,
WalReceiverProcess,
+ WalSummarizerProcess,
NUM_AUXPROCTYPES /* Must be last! */
} AuxProcType;
@@ -458,6 +460,7 @@ extern PGDLLIMPORT AuxProcType MyAuxProcType;
#define AmCheckpointerProcess() (MyAuxProcType == CheckpointerProcess)
#define AmWalWriterProcess() (MyAuxProcType == WalWriterProcess)
#define AmWalReceiverProcess() (MyAuxProcType == WalReceiverProcess)
+#define AmWalSummarizerProcess() (MyAuxProcType == WalSummarizerProcess)
/*****************************************************************************
diff --git a/src/include/nodes/replnodes.h b/src/include/nodes/replnodes.h
index 4321ba8f86..856491eecd 100644
--- a/src/include/nodes/replnodes.h
+++ b/src/include/nodes/replnodes.h
@@ -108,4 +108,13 @@ typedef struct TimeLineHistoryCmd
TimeLineID timeline;
} TimeLineHistoryCmd;
+/* ----------------------
+ * UPLOAD_MANIFEST command
+ * ----------------------
+ */
+typedef struct UploadManifestCmd
+{
+ NodeTag type;
+} UploadManifestCmd;
+
#endif /* REPLNODES_H */
diff --git a/src/include/postmaster/walsummarizer.h b/src/include/postmaster/walsummarizer.h
new file mode 100644
index 0000000000..7584cb69a7
--- /dev/null
+++ b/src/include/postmaster/walsummarizer.h
@@ -0,0 +1,31 @@
+/*-------------------------------------------------------------------------
+ *
+ * walsummarizer.h
+ *
+ * Header file for background WAL summarization process.
+ *
+ * Portions Copyright (c) 1996-2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/include/postmaster/walsummarizer.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef WALSUMMARIZER_H
+#define WALSUMMARIZER_H
+
+#include "access/xlogdefs.h"
+
+extern int wal_summarize_mb;
+extern int wal_summarize_keep_time;
+
+extern Size WalSummarizerShmemSize(void);
+extern void WalSummarizerShmemInit(void);
+extern void WalSummarizerMain(void) pg_attribute_noreturn();
+
+extern XLogRecPtr GetOldestUnsummarizedLSN(TimeLineID *tli,
+ bool *lsn_is_exact);
+extern void SetWalSummarizerLatch(void);
+extern XLogRecPtr WaitForWalSummarization(XLogRecPtr lsn, long timeout);
+
+#endif
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index ef74f32693..ee55008082 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -417,11 +417,12 @@ extern PGDLLIMPORT PGPROC *PreparedXactProcs;
* We set aside some extra PGPROC structures for auxiliary processes,
* ie things that aren't full-fledged backends but need shmem access.
*
- * Background writer, checkpointer, WAL writer and archiver run during normal
- * operation. Startup process and WAL receiver also consume 2 slots, but WAL
- * writer is launched only after startup has exited, so we only need 5 slots.
+ * Background writer, checkpointer, WAL writer, WAL summarizer, and archiver
+ * run during normal operation. Startup process and WAL receiver also consume
+ * 2 slots, but WAL writer is launched only after startup has exited, so we
+ * only need 6 slots.
*/
-#define NUM_AUXILIARY_PROCS 5
+#define NUM_AUXILIARY_PROCS 6
/* configurable options */
extern PGDLLIMPORT int DeadlockTimeout;
diff --git a/src/include/utils/guc_tables.h b/src/include/utils/guc_tables.h
index d5a0880678..7d3bc0f671 100644
--- a/src/include/utils/guc_tables.h
+++ b/src/include/utils/guc_tables.h
@@ -72,6 +72,7 @@ enum config_group
WAL_RECOVERY,
WAL_ARCHIVE_RECOVERY,
WAL_RECOVERY_TARGET,
+ WAL_SUMMARIZATION,
REPLICATION_SENDING,
REPLICATION_PRIMARY,
REPLICATION_STANDBY,
diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index c3d46c7c70..b711d60fc4 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -779,6 +779,10 @@ a tar-format backup, pass the name of the tar program to use in the
keyword parameter tar_program. Note that tablespace tar files aren't
handled here.
+To restore from an incremental backup, pass the parameter combine_with_prior
+as a reference to an array of prior backup names with which this backup
+is to be combined using pg_combinebackup.
+
Streaming replication can be enabled on this node by passing the keyword
parameter has_streaming => 1. This is disabled by default.
@@ -816,7 +820,22 @@ sub init_from_backup
mkdir $self->archive_dir;
my $data_path = $self->data_dir;
- if (defined $params{tar_program})
+ if (defined $params{combine_with_prior})
+ {
+ my @prior_backups = @{$params{combine_with_prior}};
+ my @prior_backup_path;
+
+ for my $prior_backup_name (@prior_backups)
+ {
+ push @prior_backup_path,
+ $root_node->backup_dir . '/' . $prior_backup_name;
+ }
+
+ local %ENV = $self->_get_env();
+ PostgreSQL::Test::Utils::system_or_bail('pg_combinebackup',
+ @prior_backup_path, $backup_path, '-o', $data_path);
+ }
+ elsif (defined $params{tar_program})
{
mkdir($data_path);
PostgreSQL::Test::Utils::system_or_bail($params{tar_program}, 'xf',
diff --git a/src/test/recovery/t/001_stream_rep.pl b/src/test/recovery/t/001_stream_rep.pl
index 95f9b0d772..ad11be4664 100644
--- a/src/test/recovery/t/001_stream_rep.pl
+++ b/src/test/recovery/t/001_stream_rep.pl
@@ -15,6 +15,8 @@ my $node_primary = PostgreSQL::Test::Cluster->new('primary');
$node_primary->init(
allows_streaming => 1,
auth_extra => [ '--create-role', 'repl_role' ]);
+# WAL summarization can postpone WAL recycling, leading to test failures
+$node_primary->append_conf('postgresql.conf', "wal_summarize_mb = 0");
$node_primary->start;
my $backup_name = 'my_backup';
diff --git a/src/test/recovery/t/019_replslot_limit.pl b/src/test/recovery/t/019_replslot_limit.pl
index 7d94f15778..4f52ddbe79 100644
--- a/src/test/recovery/t/019_replslot_limit.pl
+++ b/src/test/recovery/t/019_replslot_limit.pl
@@ -22,6 +22,7 @@ $node_primary->append_conf(
min_wal_size = 2MB
max_wal_size = 4MB
log_checkpoints = yes
+wal_summarize_mb = 0
));
$node_primary->start;
$node_primary->safe_psql('postgres',
@@ -256,6 +257,7 @@ $node_primary2->append_conf(
min_wal_size = 32MB
max_wal_size = 32MB
log_checkpoints = yes
+wal_summarize_mb = 0
));
$node_primary2->start;
$node_primary2->safe_psql('postgres',
@@ -310,6 +312,7 @@ $node_primary3->append_conf(
max_wal_size = 2MB
log_checkpoints = yes
max_slot_wal_keep_size = 1MB
+ wal_summarize_mb = 0
));
$node_primary3->start;
$node_primary3->safe_psql('postgres',
diff --git a/src/test/recovery/t/035_standby_logical_decoding.pl b/src/test/recovery/t/035_standby_logical_decoding.pl
index 480e6d6caa..a91437dfa7 100644
--- a/src/test/recovery/t/035_standby_logical_decoding.pl
+++ b/src/test/recovery/t/035_standby_logical_decoding.pl
@@ -250,6 +250,7 @@ $node_primary->append_conf(
wal_level = 'logical'
max_replication_slots = 4
max_wal_senders = 4
+wal_summarize_mb = 0
});
$node_primary->dump_info;
$node_primary->start;
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index e69bb671bf..2ae238bf81 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3992,3 +3992,26 @@ yyscan_t
z_stream
z_streamp
zic_t
+BlockRefTable
+BlockRefTableBuffer
+BlockRefTableEntry
+BlockRefTableKey
+BlockRefTableReader
+BlockRefTableSerializedEntry
+BlockRefTableWriter
+FileBackupMethod
+IncrementalBackupInfo
+SummarizerReadLocalXLogPrivate
+UploadManifestCmd
+WalSummarizerData
+WalSummaryFile
+WalSummaryIO
+backup_file_entry
+backup_wal_range
+cb_cleanup_dir
+cb_options
+cb_tablespace
+cb_tablespace_mapping
+manifest_data
+manifest_writer
+rfile
--
2.37.1 (Apple Git-137.1)
v5-0001-Refactor-parse_filename_for_nontemp_relation-to-p.patchapplication/octet-stream; name=v5-0001-Refactor-parse_filename_for_nontemp_relation-to-p.patchDownload
From 22ffb20913efc6d9ffbe77b78444aeb8ba54217b Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Tue, 3 Oct 2023 13:30:44 -0400
Subject: [PATCH v5 1/6] Refactor parse_filename_for_nontemp_relation to parse
more.
Instead of returning the number of characters in the RelFileNumber,
return the RelFileNumber itself. Continue to return the fork number,
as before, and additionally return the segment number.
parse_filename_for_nontemp_relation now rejects a RelFileNumber or
segment number that begins with a leading zero. Before, we accepted
such cases as relation filenames, but if we continued to do so after
this change, the function might return the same values for two
different files (e.g. 1234.5 and 001234.5 or 1234.005) which could be
annoying for callers. Since we don't actually ever generate filenames
with leading zeroes in the names, any such files that we find must
have been created by something other than PostgreSQL, and it is
therefore reasonable to treat them as non-relation files.
Along the way, change unlogged_relation_entry to store a RelFileNumber
rather than an OID. This update should have been made in
851f4cc75cdd8c831f1baa9a7abf8c8248b65890, but it was overlooked.
It's trivial to make the update as part of this commit, perhaps more
trivial than it would have been without it, so do that.
---
src/backend/backup/basebackup.c | 15 ++--
src/backend/storage/file/reinit.c | 137 ++++++++++++++++++------------
src/include/storage/reinit.h | 5 +-
3 files changed, 93 insertions(+), 64 deletions(-)
diff --git a/src/backend/backup/basebackup.c b/src/backend/backup/basebackup.c
index 7d025bcf38..b126d9c890 100644
--- a/src/backend/backup/basebackup.c
+++ b/src/backend/backup/basebackup.c
@@ -1197,9 +1197,9 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
{
int excludeIdx;
bool excludeFound;
- ForkNumber relForkNum; /* Type of fork if file is a relation */
- int relnumchars; /* Chars in filename that are the
- * relnumber */
+ RelFileNumber relNumber;
+ ForkNumber relForkNum;
+ unsigned segno;
/* Skip special stuff */
if (strcmp(de->d_name, ".") == 0 || strcmp(de->d_name, "..") == 0)
@@ -1249,23 +1249,20 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
/* Exclude all forks for unlogged tables except the init fork */
if (isDbDir &&
- parse_filename_for_nontemp_relation(de->d_name, &relnumchars,
- &relForkNum))
+ parse_filename_for_nontemp_relation(de->d_name, &relNumber,
+ &relForkNum, &segno))
{
/* Never exclude init forks */
if (relForkNum != INIT_FORKNUM)
{
char initForkFile[MAXPGPATH];
- char relNumber[OIDCHARS + 1];
/*
* If any other type of fork, check if there is an init fork
* with the same RelFileNumber. If so, the file can be
* excluded.
*/
- memcpy(relNumber, de->d_name, relnumchars);
- relNumber[relnumchars] = '\0';
- snprintf(initForkFile, sizeof(initForkFile), "%s/%s_init",
+ snprintf(initForkFile, sizeof(initForkFile), "%s/%u_init",
path, relNumber);
if (lstat(initForkFile, &statbuf) == 0)
diff --git a/src/backend/storage/file/reinit.c b/src/backend/storage/file/reinit.c
index fb55371b1b..5df2517b46 100644
--- a/src/backend/storage/file/reinit.c
+++ b/src/backend/storage/file/reinit.c
@@ -31,7 +31,7 @@ static void ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname,
typedef struct
{
- Oid reloid; /* hash key */
+ RelFileNumber relnumber; /* hash key */
} unlogged_relation_entry;
/*
@@ -195,12 +195,13 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
{
ForkNumber forkNum;
- int relnumchars;
+ unsigned segno;
unlogged_relation_entry ent;
/* Skip anything that doesn't look like a relation data file. */
- if (!parse_filename_for_nontemp_relation(de->d_name, &relnumchars,
- &forkNum))
+ if (!parse_filename_for_nontemp_relation(de->d_name,
+ &ent.relnumber,
+ &forkNum, &segno))
continue;
/* Also skip it unless this is the init fork. */
@@ -208,10 +209,8 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
continue;
/*
- * Put the OID portion of the name into the hash table, if it
- * isn't already.
+ * Put the RelFileNumber into the hash table, if it isn't already.
*/
- ent.reloid = atooid(de->d_name);
(void) hash_search(hash, &ent, HASH_ENTER, NULL);
}
@@ -235,12 +234,13 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
{
ForkNumber forkNum;
- int relnumchars;
+ unsigned segno;
unlogged_relation_entry ent;
/* Skip anything that doesn't look like a relation data file. */
- if (!parse_filename_for_nontemp_relation(de->d_name, &relnumchars,
- &forkNum))
+ if (!parse_filename_for_nontemp_relation(de->d_name,
+ &ent.relnumber,
+ &forkNum, &segno))
continue;
/* We never remove the init fork. */
@@ -251,7 +251,6 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
* See whether the OID portion of the name shows up in the hash
* table. If so, nuke it!
*/
- ent.reloid = atooid(de->d_name);
if (hash_search(hash, &ent, HASH_FIND, NULL))
{
snprintf(rm_path, sizeof(rm_path), "%s/%s",
@@ -285,14 +284,14 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
{
ForkNumber forkNum;
- int relnumchars;
- char relnumbuf[OIDCHARS + 1];
+ RelFileNumber relNumber;
+ unsigned segno;
char srcpath[MAXPGPATH * 2];
char dstpath[MAXPGPATH];
/* Skip anything that doesn't look like a relation data file. */
- if (!parse_filename_for_nontemp_relation(de->d_name, &relnumchars,
- &forkNum))
+ if (!parse_filename_for_nontemp_relation(de->d_name, &relNumber,
+ &forkNum, &segno))
continue;
/* Also skip it unless this is the init fork. */
@@ -304,11 +303,12 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
dbspacedirname, de->d_name);
/* Construct destination pathname. */
- memcpy(relnumbuf, de->d_name, relnumchars);
- relnumbuf[relnumchars] = '\0';
- snprintf(dstpath, sizeof(dstpath), "%s/%s%s",
- dbspacedirname, relnumbuf, de->d_name + relnumchars + 1 +
- strlen(forkNames[INIT_FORKNUM]));
+ if (segno == 0)
+ snprintf(dstpath, sizeof(dstpath), "%s/%u",
+ dbspacedirname, relNumber);
+ else
+ snprintf(dstpath, sizeof(dstpath), "%s/%u.%u",
+ dbspacedirname, relNumber, segno);
/* OK, we're ready to perform the actual copy. */
elog(DEBUG2, "copying %s to %s", srcpath, dstpath);
@@ -327,14 +327,14 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
dbspace_dir = AllocateDir(dbspacedirname);
while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
{
+ RelFileNumber relNumber;
ForkNumber forkNum;
- int relnumchars;
- char relnumbuf[OIDCHARS + 1];
+ unsigned segno;
char mainpath[MAXPGPATH];
/* Skip anything that doesn't look like a relation data file. */
- if (!parse_filename_for_nontemp_relation(de->d_name, &relnumchars,
- &forkNum))
+ if (!parse_filename_for_nontemp_relation(de->d_name, &relNumber,
+ &forkNum, &segno))
continue;
/* Also skip it unless this is the init fork. */
@@ -342,11 +342,12 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
continue;
/* Construct main fork pathname. */
- memcpy(relnumbuf, de->d_name, relnumchars);
- relnumbuf[relnumchars] = '\0';
- snprintf(mainpath, sizeof(mainpath), "%s/%s%s",
- dbspacedirname, relnumbuf, de->d_name + relnumchars + 1 +
- strlen(forkNames[INIT_FORKNUM]));
+ if (segno == 0)
+ snprintf(mainpath, sizeof(mainpath), "%s/%u",
+ dbspacedirname, relNumber);
+ else
+ snprintf(mainpath, sizeof(mainpath), "%s/%u.%u",
+ dbspacedirname, relNumber, segno);
fsync_fname(mainpath, false);
}
@@ -371,52 +372,82 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
* This function returns true if the file appears to be in the correct format
* for a non-temporary relation and false otherwise.
*
- * NB: If this function returns true, the caller is entitled to assume that
- * *relnumchars has been set to a value no more than OIDCHARS, and thus
- * that a buffer of OIDCHARS+1 characters is sufficient to hold the
- * RelFileNumber portion of the filename. This is critical to protect against
- * a possible buffer overrun.
+ * If it returns true, it sets *relnumber, *fork, and *segno to the values
+ * extracted from the filename. If it returns false, these values are set to
+ * InvalidRelFileNumber, InvalidForkNumber, and 0, respectively.
*/
bool
-parse_filename_for_nontemp_relation(const char *name, int *relnumchars,
- ForkNumber *fork)
+parse_filename_for_nontemp_relation(const char *name, RelFileNumber *relnumber,
+ ForkNumber *fork, unsigned *segno)
{
- int pos;
+ unsigned long n,
+ s;
+ ForkNumber f;
+ char *endp;
- /* Look for a non-empty string of digits (that isn't too long). */
- for (pos = 0; isdigit((unsigned char) name[pos]); ++pos)
- ;
- if (pos == 0 || pos > OIDCHARS)
+ *relnumber = InvalidRelFileNumber;
+ *fork = InvalidForkNumber;
+ *segno = 0;
+
+ /*
+ * Relation filenames should begin with a digit that is not a zero. By
+ * rejecting cases involving leading zeroes, the caller can assume that
+ * there's only one possible string of characters that could have produced
+ * any given value for *relnumber.
+ *
+ * (To be clear, we don't expect files with names like 0017.3 to exist at
+ * all -- but if 0017.3 does exist, it's a non-relation file, not part of
+ * the main fork for relfilenode 17.)
+ */
+ if (name[0] < '1' || name[0] > '9')
+ return false;
+
+ /*
+ * Parse the leading digit string. If the value is out of range, we
+ * conclude that this isn't a relation file at all.
+ */
+ errno = 0;
+ n = strtoul(name, &endp, 10);
+ if (errno || name == endp || n <= 0 || n > PG_UINT32_MAX)
return false;
- *relnumchars = pos;
+ name = endp;
/* Check for a fork name. */
- if (name[pos] != '_')
- *fork = MAIN_FORKNUM;
+ if (*name != '_')
+ f = MAIN_FORKNUM;
else
{
int forkchar;
- forkchar = forkname_chars(&name[pos + 1], fork);
+ forkchar = forkname_chars(name + 1, &f);
if (forkchar <= 0)
return false;
- pos += forkchar + 1;
+ name += forkchar + 1;
}
/* Check for a segment number. */
- if (name[pos] == '.')
+ if (*name != '.')
+ s = 0;
+ else
{
- int segchar;
+ /* Reject leading zeroes, just like we do for RelFileNumber. */
+ if (name[0] < '1' || name[0] > '9')
+ return false;
- for (segchar = 1; isdigit((unsigned char) name[pos + segchar]); ++segchar)
- ;
- if (segchar <= 1)
+ errno = 0;
+ s = strtoul(name + 1, &endp, 10);
+ if (errno || name + 1 == endp || s <= 0 || s > PG_UINT32_MAX)
return false;
- pos += segchar;
+ name = endp;
}
/* Now we should be at the end. */
- if (name[pos] != '\0')
+ if (*name != '\0')
return false;
+
+ /* Set out parameters and return. */
+ *relnumber = (RelFileNumber) n;
+ *fork = f;
+ *segno = (unsigned) s;
return true;
}
diff --git a/src/include/storage/reinit.h b/src/include/storage/reinit.h
index e2bbb5abe9..f8eb7ce234 100644
--- a/src/include/storage/reinit.h
+++ b/src/include/storage/reinit.h
@@ -20,8 +20,9 @@
extern void ResetUnloggedRelations(int op);
extern bool parse_filename_for_nontemp_relation(const char *name,
- int *relnumchars,
- ForkNumber *fork);
+ RelFileNumber *relnumber,
+ ForkNumber *fork,
+ unsigned *segno);
#define UNLOGGED_RELATION_CLEANUP 0x0001
#define UNLOGGED_RELATION_INIT 0x0002
--
2.37.1 (Apple Git-137.1)
On 10/19/23 12:05, Robert Haas wrote:
On Wed, Oct 4, 2023 at 4:08 PM Robert Haas <robertmhaas@gmail.com> wrote:
Clearly there's a good amount of stuff to sort out here, but we've
still got quite a bit of time left before feature freeze so I'd like
to have a go at it. Please let me know your thoughts, if you have any.Apparently, nobody has any thoughts, but here's an updated patch set
anyway. The main change, other than rebasing, is that I did a bunch
more documentation work on the main patch (0005). I'm much happier
with it now, although I expect it may need more adjustments here and
there as outstanding design questions get settled.After some thought, I think that it should be fine to commit 0001 and
0002 as independent refactoring patches, and I plan to go ahead and do
that pretty soon unless somebody objects.
0001 looks pretty good to me. The only thing I find a little troublesome
is the repeated construction of file names with/without segment numbers
in ResetUnloggedRelationsInDbspaceDir(), .e.g.:
+ if (segno == 0)
+ snprintf(dstpath, sizeof(dstpath), "%s/%u",
+ dbspacedirname, relNumber);
+ else
+ snprintf(dstpath, sizeof(dstpath), "%s/%u.%u",
+ dbspacedirname, relNumber, segno);
If this happened three times I'd definitely want a helper function, but
even with two I think it would be a bit nicer.
0002 is definitely a good idea. FWIW pgBackRest does this conversion but
also errors if it does not succeed. We have never seen a report of this
error happening in the wild, so I think it must be pretty rare if it
does happen.
Regards,
-David
On Thu, Oct 19, 2023 at 3:18 PM David Steele <david@pgmasters.net> wrote:
0001 looks pretty good to me. The only thing I find a little troublesome
is the repeated construction of file names with/without segment numbers
in ResetUnloggedRelationsInDbspaceDir(), .e.g.:+ if (segno == 0) + snprintf(dstpath, sizeof(dstpath), "%s/%u", + dbspacedirname, relNumber); + else + snprintf(dstpath, sizeof(dstpath), "%s/%u.%u", + dbspacedirname, relNumber, segno);If this happened three times I'd definitely want a helper function, but
even with two I think it would be a bit nicer.
Personally I think that would make the code harder to read rather than
easier. I agree that repeating code isn't great, but this is a
relatively brief idiom and pretty self-explanatory. If other people
agree with you I can change it, but to me it's not an improvement.
0002 is definitely a good idea. FWIW pgBackRest does this conversion but
also errors if it does not succeed. We have never seen a report of this
error happening in the wild, so I think it must be pretty rare if it
does happen.
Cool, but ... how about the main patch set? It's nice to get some of
these refactoring bits and pieces out of the way, but if I spend the
effort to work out what I think are the right answers to the remaining
design questions for the main patch set and then find out after I've
done all that that you have massive objections, I'm going to be
annoyed. I've been trying to get this feature into PostgreSQL for
years, and if I don't succeed this time, I want the reason to be
something better than "well, I didn't find out that David disliked X
until five minutes before I was planning to type 'git push'."
I'm not really concerned about detailed bug-hunting in the main
patches just yet. The time for that will come. But if you have views
on how to resolve the design questions that I mentioned in a couple of
emails back, or intend to advocate vigorously against the whole
concept for some reason, let's try to sort that out sooner rather than
later.
--
Robert Haas
EDB: http://www.enterprisedb.com
Hi Robert,
On Wed, Oct 4, 2023 at 10:09 PM Robert Haas <robertmhaas@gmail.com> wrote:
On Tue, Oct 3, 2023 at 2:21 PM Robert Haas <robertmhaas@gmail.com> wrote:
Here's a new patch set, also addressing Jakub's observation that
MINIMUM_VERSION_FOR_WAL_SUMMARIES needed updating.Here's yet another new version.[..]
Okay, so another good news - related to the patch version #4.
Not-so-tiny stress test consisting of pgbench run for 24h straight
(with incremental backups every 2h, with base of initial full backup),
followed by two PITRs (one not using incremental backup and one using
to to illustrate the performance point - and potentially spot any
errors in between). In both cases it worked fine. Pgbench has this
behaviour that it doesn't cause space growth over time - it produces
lots of WAL instead. Some stats:
START DBSIZE: ~3.3GB (pgbench -i -s 200 --partitions=8)
END DBSIZE: ~4.3GB
RUN DURATION: 24h (pgbench -P 1 -R 100 -T 86400)
WALARCHIVES-24h: 77GB
FULL-DB-BACKUP-SIZE: 3.4GB
INCREMENTAL-BACKUP-11-SIZE: 3.5GB
Env: Azure VM D4s (4VCPU), Debian 11, gcc 10.2, normal build (asserts
and debug disabled)
The increments were taken every 2h just to see if they would fail for
any reason - they did not.
PITR RTO RESULTS (copy/pg_combinebackup time + recovery time):
1. time to restore from fullbackup (+ recovery of 24h WAL[77GB]): 53s
+ 4640s =~ 78min
2. time to restore from fullbackup+incremental backup from 2h ago (+
recovery of 2h WAL [5.4GB]): 68s + 190s =~ 4min18s
I could probably pre populate the DB with 1TB cold data (not touched
to be touched pgbench at all), just for the sake of argument, and that
would probably could be demonstrated how space efficient the
incremental backup can be, but most of time would be time wasted on
copying the 1TB here...
- I would like some feedback on the generation of WAL summary files.
Right now, I have it enabled by default, and summaries are kept for a
week. That means that, with no additional setup, you can take an
incremental backup as long as the reference backup was taken in the
last week.
I've just noticed one thing when recovery is progress: is
summarization working during recovery - in the background - an
expected behaviour? I'm wondering about that, because after freshly
restored and recovered DB, one would need to create a *new* full
backup and only from that point new summaries would have any use?
Sample log:
2023-10-20 11:10:02.288 UTC [64434] LOG: restored log file
"000000010000000200000022" from archive
2023-10-20 11:10:02.599 UTC [64434] LOG: restored log file
"000000010000000200000023" from archive
2023-10-20 11:10:02.769 UTC [64446] LOG: summarized WAL on TLI 1 from
2/139B1130 to 2/239B1518
2023-10-20 11:10:02.923 UTC [64434] LOG: restored log file
"000000010000000200000024" from archive
2023-10-20 11:10:03.193 UTC [64434] LOG: restored log file
"000000010000000200000025" from archive
2023-10-20 11:10:03.345 UTC [64432] LOG: restartpoint starting: wal
2023-10-20 11:10:03.407 UTC [64446] LOG: summarized WAL on TLI 1 from
2/239B1518 to 2/25B609D0
2023-10-20 11:10:03.521 UTC [64434] LOG: restored log file
"000000010000000200000026" from archive
2023-10-20 11:10:04.429 UTC [64434] LOG: restored log file
"000000010000000200000027" from archive
- On a related note, I haven't yet tested this on a standby, which is
a thing that I definitely need to do. I don't know of a reason why it
shouldn't be possible for all of this machinery to work on a standby
just as it does on a primary, but then we need the WAL summarizer to
run there too, which could end up being a waste if nobody ever tries
to take an incremental backup. I wonder how that should be reflected
in the configuration. We could do something like what we've done for
archive_mode, where on means "only on if this is a primary" and you
have to say always if you want it to run on standbys as well ... but
I'm not sure if that's a design pattern that we really want to
replicate into more places. I'd be somewhat inclined to just make
whatever configuration parameters we need to configure this thing on
the primary also work on standbys, and you can set each server up as
you please. But I'm open to other suggestions.
I'll try to play with some standby restores in future, stay tuned.
Regards,
-J.
On 10/19/23 16:00, Robert Haas wrote:
On Thu, Oct 19, 2023 at 3:18 PM David Steele <david@pgmasters.net> wrote:
0001 looks pretty good to me. The only thing I find a little troublesome
is the repeated construction of file names with/without segment numbers
in ResetUnloggedRelationsInDbspaceDir(), .e.g.:+ if (segno == 0) + snprintf(dstpath, sizeof(dstpath), "%s/%u", + dbspacedirname, relNumber); + else + snprintf(dstpath, sizeof(dstpath), "%s/%u.%u", + dbspacedirname, relNumber, segno);If this happened three times I'd definitely want a helper function, but
even with two I think it would be a bit nicer.Personally I think that would make the code harder to read rather than
easier. I agree that repeating code isn't great, but this is a
relatively brief idiom and pretty self-explanatory. If other people
agree with you I can change it, but to me it's not an improvement.
Then I'm fine with it as is.
0002 is definitely a good idea. FWIW pgBackRest does this conversion but
also errors if it does not succeed. We have never seen a report of this
error happening in the wild, so I think it must be pretty rare if it
does happen.Cool, but ... how about the main patch set? It's nice to get some of
these refactoring bits and pieces out of the way, but if I spend the
effort to work out what I think are the right answers to the remaining
design questions for the main patch set and then find out after I've
done all that that you have massive objections, I'm going to be
annoyed. I've been trying to get this feature into PostgreSQL for
years, and if I don't succeed this time, I want the reason to be
something better than "well, I didn't find out that David disliked X
until five minutes before I was planning to type 'git push'."
I simply have not had time to look at the main patch set in any detail.
I'm not really concerned about detailed bug-hunting in the main
patches just yet. The time for that will come. But if you have views
on how to resolve the design questions that I mentioned in a couple of
emails back, or intend to advocate vigorously against the whole
concept for some reason, let's try to sort that out sooner rather than
later.
In my view this feature puts the cart way before the horse. I'd think
higher priority features might be parallelism, a backup repository,
expiration management, archiving, or maybe even a restore command.
It seems the only goal here is to make pg_basebackup a tool for external
backup software to use, which might be OK, but I don't believe this
feature really advances pg_basebackup as a usable piece of stand-alone
software. If people really think that start/stop backup is too
complicated an interface how are they supposed to track page
incrementals and get them to a place where pg_combinebackup can put them
backup together? If automation is required to use this feature,
shouldn't pg_basebackup implement that automation?
I have plenty of thoughts about the implementation as well, but I have a
lot on my plate right now and I don't have time to get into it.
I don't plan to stand in your way on this feature. I'm reviewing what
patches I can out of courtesy and to be sure that nothing adjacent to
your work is being affected. My apologies if my reviews are not meeting
your expectations, but I am contributing as my time constraints allow.
Regards,
-David
On Fri, Oct 20, 2023 at 11:30 AM David Steele <david@pgmasters.net> wrote:
Then I'm fine with it as is.
OK, thanks.
In my view this feature puts the cart way before the horse. I'd think
higher priority features might be parallelism, a backup repository,
expiration management, archiving, or maybe even a restore command.It seems the only goal here is to make pg_basebackup a tool for external
backup software to use, which might be OK, but I don't believe this
feature really advances pg_basebackup as a usable piece of stand-alone
software. If people really think that start/stop backup is too
complicated an interface how are they supposed to track page
incrementals and get them to a place where pg_combinebackup can put them
backup together? If automation is required to use this feature,
shouldn't pg_basebackup implement that automation?I have plenty of thoughts about the implementation as well, but I have a
lot on my plate right now and I don't have time to get into it.I don't plan to stand in your way on this feature. I'm reviewing what
patches I can out of courtesy and to be sure that nothing adjacent to
your work is being affected. My apologies if my reviews are not meeting
your expectations, but I am contributing as my time constraints allow.
Sorry, I realize reading this response that I probably didn't do a
very good job writing that email and came across sounding like a jerk.
Possibly, I actually am a jerk. Whether it just sounded like it or I
actually am, I apologize. But your last paragraph here gets at my real
question, which is whether you were going to try to block the feature.
I recognize that we have different priorities when it comes to what
would make most sense to implement in PostgreSQL, and that's OK, or at
least, it's OK with me. I also don't have any particular expectation
about how much you should review the patch or in what level of detail,
and I have sincerely appreciated your feedback thus far. If you are
able to continue to provide more, that's great, and if that's not,
well, you're not obligated. What I was concerned about was whether
that review was a precursor to a vigorous attempt to keep the main
patch from getting committed, because if that was going to be the
case, then I'd like to surface that conflict sooner rather than later.
It sounds like that's not an issue, which is great.
At the risk of drifting into the fraught question of what I *should*
be implementing rather than the hopefully-less-fraught question of
whether what I am actually implementing is any good, I see incremental
backup as a way of removing some of the use cases for the low-level
backup API. If you said "but people still will have lots of reasons to
use it," I would agree; and if you said "people can still screw things
up with pg_basebackup," I would also agree. Nonetheless, most of the
disasters I've personally seen have stemmed from the use of the
low-level API rather than from the use of pg_basebackup, though there
are exceptions. I also think a lot of the use of the low-level API is
driven by it being just too darn slow to copy the whole database, and
incremental backup can help with that in some circumstances. Also, I
have worked fairly hard to try to make sure that if you misuse
pg_combinebackup, or fail to use it altogether, you'll get an error
rather than silent data corruption. I would be interested to hear
about scenarios where the checks that I've implemented can be defeated
by something that is plausibly described as stupidity rather than
malice. I'm not sure we can fix all such cases, but I'm very alert to
the horror that will befall me if user error looks exactly like a bug
in the code. For my own sanity, we have to be able to distinguish
those cases. Moreover, we also need to be able to distinguish
backup-time bugs from reassembly-time bugs, which is why I've got the
pg_walsummary tool, and why pg_combinebackup has the ability to emit
fairly detailed debugging output. I anticipate those things being
useful in investigating bug reports when they show up. I won't be too
surprised if it turns out that more work on sanity-checking and/or
debugging tools is needed, but I think your concern about people
misusing stuff is bang on target and I really want to do whatever we
can to avoid that when possible and detect it when it happens.
--
Robert Haas
EDB: http://www.enterprisedb.com
On Fri, Oct 20, 2023 at 9:20 AM Jakub Wartak
<jakub.wartak@enterprisedb.com> wrote:
Okay, so another good news - related to the patch version #4.
Not-so-tiny stress test consisting of pgbench run for 24h straight
(with incremental backups every 2h, with base of initial full backup),
followed by two PITRs (one not using incremental backup and one using
to to illustrate the performance point - and potentially spot any
errors in between). In both cases it worked fine.
This is great testing, thanks. What might be even better is to test
whether the resulting backups are correct, somehow.
I've just noticed one thing when recovery is progress: is
summarization working during recovery - in the background - an
expected behaviour? I'm wondering about that, because after freshly
restored and recovered DB, one would need to create a *new* full
backup and only from that point new summaries would have any use?
Actually, I think you could take an incremental backup relative to a
full backup from a previous timeline.
But the question of what summarization ought to do (or not do) during
recovery, and whether it ought to be enabled by default, and what the
retention policy ought to be are very much live ones. Right now, it's
enabled by default and keeps summaries for a week, assuming you don't
reset your local clock and that it advances at the same speed as the
universe's own clock. But that's all debatable. Any views?
Meanwhile, here's a new patch set. I went ahead and committed the
first two preparatory patches, as I said earlier that I intended to
do. And here I've adjusted the main patch, which is now 0003, for the
addition of XLOG_CHECKPOINT_REDO, which permitted me to simplify a few
things. wal_summarize_mb now feels like a bit of a silly GUC --
presumably you'd never care, unless you had an absolutely gigantic
inter-checkpoint WAL distance. And if you have that, maybe you should
also have enough memory to summarize all that WAL. Or maybe not:
perhaps it's better to write WAL summaries more than once per
checkpoint when checkpoints are really big. But I'm worried that the
GUC will become a source of needless confusion for users. For most
people, it seems like emitting one summary per checkpoint should be
totally fine, and they might prefer a simple Boolean GUC,
summarize_wal = true | false, over this. I'm just not quite sure about
the corner cases.
--
Robert Haas
EDB: http://www.enterprisedb.com
Attachments:
v6-0004-Add-new-pg_walsummary-tool.patchapplication/octet-stream; name=v6-0004-Add-new-pg_walsummary-tool.patchDownload
From c66a2ab3cbee191f1ba0d97994b8a7a8e0086c68 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Wed, 14 Jun 2023 11:31:39 -0400
Subject: [PATCH v6 4/4] Add new pg_walsummary tool.
This can dump the contents of WAL summary files, either those in
pg_wal/summaries, or the INCREMENTAL_BACKUP files that are part of
an incremental backup proper.
XXX. Needs documentation and tests.
---
src/bin/Makefile | 1 +
src/bin/meson.build | 1 +
src/bin/pg_walsummary/.gitignore | 1 +
src/bin/pg_walsummary/Makefile | 42 ++++
src/bin/pg_walsummary/meson.build | 24 +++
src/bin/pg_walsummary/pg_walsummary.c | 278 ++++++++++++++++++++++++++
6 files changed, 347 insertions(+)
create mode 100644 src/bin/pg_walsummary/.gitignore
create mode 100644 src/bin/pg_walsummary/Makefile
create mode 100644 src/bin/pg_walsummary/meson.build
create mode 100644 src/bin/pg_walsummary/pg_walsummary.c
diff --git a/src/bin/Makefile b/src/bin/Makefile
index aa2210925e..f98f58d39e 100644
--- a/src/bin/Makefile
+++ b/src/bin/Makefile
@@ -31,6 +31,7 @@ SUBDIRS = \
pg_upgrade \
pg_verifybackup \
pg_waldump \
+ pg_walsummary \
pgbench \
psql \
scripts
diff --git a/src/bin/meson.build b/src/bin/meson.build
index 4cb6fd59bb..d1e9ef4409 100644
--- a/src/bin/meson.build
+++ b/src/bin/meson.build
@@ -17,6 +17,7 @@ subdir('pg_test_timing')
subdir('pg_upgrade')
subdir('pg_verifybackup')
subdir('pg_waldump')
+subdir('pg_walsummary')
subdir('pgbench')
subdir('pgevent')
subdir('psql')
diff --git a/src/bin/pg_walsummary/.gitignore b/src/bin/pg_walsummary/.gitignore
new file mode 100644
index 0000000000..d71ec192fa
--- /dev/null
+++ b/src/bin/pg_walsummary/.gitignore
@@ -0,0 +1 @@
+pg_walsummary
diff --git a/src/bin/pg_walsummary/Makefile b/src/bin/pg_walsummary/Makefile
new file mode 100644
index 0000000000..852f7208f6
--- /dev/null
+++ b/src/bin/pg_walsummary/Makefile
@@ -0,0 +1,42 @@
+#-------------------------------------------------------------------------
+#
+# Makefile for src/bin/pg_walsummary
+#
+# Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+# Portions Copyright (c) 1994, Regents of the University of California
+#
+# src/bin/pg_walsummary/Makefile
+#
+#-------------------------------------------------------------------------
+
+PGFILEDESC = "pg_walsummary - print contents of WAL summary files"
+PGAPPICON=win32
+
+subdir = src/bin/pg_walsummary
+top_builddir = ../../..
+include $(top_builddir)/src/Makefile.global
+
+override CPPFLAGS := -I$(libpq_srcdir) $(CPPFLAGS)
+LDFLAGS_INTERNAL += -L$(top_builddir)/src/fe_utils -lpgfeutils
+
+OBJS = \
+ $(WIN32RES) \
+ pg_walsummary.o
+
+all: pg_walsummary
+
+pg_walsummary: $(OBJS) | submake-libpgport submake-libpgfeutils
+ $(CC) $(CFLAGS) $^ $(LDFLAGS) $(LDFLAGS_EX) $(LIBS) -o $@$(X)
+
+
+install: all installdirs
+ $(INSTALL_PROGRAM) pg_walsummary$(X) '$(DESTDIR)$(bindir)/pg_walsummary$(X)'
+
+installdirs:
+ $(MKDIR_P) '$(DESTDIR)$(bindir)'
+
+uninstall:
+ rm -f '$(DESTDIR)$(bindir)/pg_walsummary$(X)'
+
+clean distclean maintainer-clean:
+ rm -f pg_walsummary$(X) $(OBJS)
diff --git a/src/bin/pg_walsummary/meson.build b/src/bin/pg_walsummary/meson.build
new file mode 100644
index 0000000000..c2092960c6
--- /dev/null
+++ b/src/bin/pg_walsummary/meson.build
@@ -0,0 +1,24 @@
+# Copyright (c) 2022-2023, PostgreSQL Global Development Group
+
+pg_walsummary_sources = files(
+ 'pg_walsummary.c',
+)
+
+if host_system == 'windows'
+ pg_walsummary_sources += rc_bin_gen.process(win32ver_rc, extra_args: [
+ '--NAME', 'pg_walsummary',
+ '--FILEDESC', 'pg_walsummary - print contents of WAL summary files',])
+endif
+
+pg_walsummary = executable('pg_walsummary',
+ pg_walsummary_sources,
+ dependencies: [frontend_code],
+ kwargs: default_bin_args,
+)
+bin_targets += pg_walsummary
+
+tests += {
+ 'name': 'pg_walsummary',
+ 'sd': meson.current_source_dir(),
+ 'bd': meson.current_build_dir()
+}
diff --git a/src/bin/pg_walsummary/pg_walsummary.c b/src/bin/pg_walsummary/pg_walsummary.c
new file mode 100644
index 0000000000..0304a42026
--- /dev/null
+++ b/src/bin/pg_walsummary/pg_walsummary.c
@@ -0,0 +1,278 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_walsummary.c
+ * Prints the contents of WAL summary files.
+ *
+ * Copyright (c) 2017-2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/bin/pg_walsummary/pg_walsummary.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres_fe.h"
+
+#include <fcntl.h>
+#include <limits.h>
+
+#include "common/blkreftable.h"
+#include "common/logging.h"
+#include "fe_utils/option_utils.h"
+#include "lib/stringinfo.h"
+#include "getopt_long.h"
+
+typedef struct ws_options
+{
+ bool individual;
+ bool quiet;
+} ws_options;
+
+typedef struct ws_file_info
+{
+ int fd;
+ char *filename;
+} ws_file_info;
+
+static BlockNumber *block_buffer = NULL;
+static unsigned block_buffer_size = 512; /* Initial size. */
+
+static void dump_one_relation(ws_options *opt, RelFileLocator *rlocator,
+ ForkNumber forknum, BlockNumber limit_block,
+ BlockRefTableReader *reader);
+static void help(const char *progname);
+static int compare_block_numbers(const void *a, const void *b);
+static int walsummary_read_callback(void *callback_arg, void *data,
+ int length);
+static void walsummary_error_callback(void *callback_arg, char *fmt,...);
+
+/*
+ * Main program.
+ */
+int
+main(int argc, char *argv[])
+{
+ static struct option long_options[] = {
+ {"individual", no_argument, NULL, 'i'},
+ {"quiet", no_argument, NULL, 'q'},
+ {NULL, 0, NULL, 0}
+ };
+
+ const char *progname;
+ int optindex;
+ int c;
+ ws_options opt;
+
+ pg_logging_init(argv[0]);
+ progname = get_progname(argv[0]);
+ handle_help_version_opts(argc, argv, progname, help);
+
+ /* process command-line options */
+ while ((c = getopt_long(argc, argv, "f:iqw:",
+ long_options, &optindex)) != -1)
+ {
+ switch (c)
+ {
+ case 'i':
+ opt.individual = true;
+ break;
+ case 'q':
+ opt.quiet = true;
+ break;
+ default:
+ /* getopt_long already emitted a complaint */
+ pg_log_error_hint("Try \"%s --help\" for more information.", progname);
+ exit(1);
+ }
+ }
+
+ if (optind >= argc)
+ {
+ pg_log_error("%s: no input files specified", progname);
+ pg_log_error_hint("Try \"%s --help\" for more information.", progname);
+ exit(1);
+ }
+
+ while (optind < argc)
+ {
+ ws_file_info ws;
+ BlockRefTableReader *reader;
+ RelFileLocator rlocator;
+ ForkNumber forknum;
+ BlockNumber limit_block;
+
+ ws.filename = argv[optind++];
+ if ((ws.fd = open(ws.filename, O_RDONLY | PG_BINARY, 0)) < 0)
+ pg_fatal("could not open file \"%s\": %m", ws.filename);
+
+ reader = CreateBlockRefTableReader(walsummary_read_callback, &ws,
+ ws.filename,
+ walsummary_error_callback, NULL);
+ while (BlockRefTableReaderNextRelation(reader, &rlocator, &forknum,
+ &limit_block))
+ dump_one_relation(&opt, &rlocator, forknum, limit_block, reader);
+
+ DestroyBlockRefTableReader(reader);
+ close(ws.fd);
+ }
+
+ exit(0);
+}
+
+/*
+ * Dump details for one relation.
+ */
+static void
+dump_one_relation(ws_options *opt, RelFileLocator *rlocator,
+ ForkNumber forknum, BlockNumber limit_block,
+ BlockRefTableReader *reader)
+{
+ unsigned i = 0;
+ unsigned nblocks;
+ BlockNumber startblock = InvalidBlockNumber;
+ BlockNumber endblock = InvalidBlockNumber;
+
+ /* Dump limit block, if any. */
+ if (limit_block != InvalidBlockNumber)
+ printf("TS %u, DB %u, REL %u, FORK %s: limit %u\n",
+ rlocator->spcOid, rlocator->dbOid, rlocator->relNumber,
+ forkNames[forknum], limit_block);
+
+ /* If we haven't allocated a block buffer yet, do that now. */
+ if (block_buffer == NULL)
+ block_buffer = palloc_array(BlockNumber, block_buffer_size);
+
+ /* Try to fill the block buffer. */
+ nblocks = BlockRefTableReaderGetBlocks(reader,
+ block_buffer,
+ block_buffer_size);
+
+ /* If we filled the block buffer completely, we must enlarge it. */
+ while (nblocks >= block_buffer_size)
+ {
+ unsigned new_size;
+
+ /* Double the size, being careful about overflow. */
+ new_size = block_buffer_size * 2;
+ if (new_size < block_buffer_size)
+ new_size = PG_UINT32_MAX;
+ block_buffer = repalloc_array(block_buffer, BlockNumber, new_size);
+
+ /* Try to fill the newly-allocated space. */
+ nblocks +=
+ BlockRefTableReaderGetBlocks(reader,
+ block_buffer + block_buffer_size,
+ new_size - block_buffer_size);
+
+ /* Save the new size for later calls. */
+ block_buffer_size = new_size;
+ }
+
+ /* If we don't need to produce any output, skip the rest of this. */
+ if (opt->quiet)
+ return;
+
+ /*
+ * Sort the returned block numbers. If the block reference table was using
+ * the bitmap representation for a given chunk, the block numbers in that
+ * chunk will already be sorted, but when the array-of-offsets
+ * representation is used, we can receive block numbers here out of order.
+ */
+ qsort(block_buffer, nblocks, sizeof(BlockNumber), compare_block_numbers);
+
+ /* Dump block references. */
+ while (i < nblocks)
+ {
+ /*
+ * Find the next range of blocks to print, but if --individual was
+ * specified, then consider each block a separate range.
+ */
+ startblock = endblock = block_buffer[i++];
+ if (!opt->individual)
+ {
+ while (i < nblocks && block_buffer[i] == endblock + 1)
+ {
+ endblock++;
+ i++;
+ }
+ }
+
+ /* Format this range of block numbers as a string. */
+ if (startblock == endblock)
+ printf("TS %u, DB %u, REL %u, FORK %s: block %u\n",
+ rlocator->spcOid, rlocator->dbOid, rlocator->relNumber,
+ forkNames[forknum], startblock);
+ else
+ printf("TS %u, DB %u, REL %u, FORK %s: blocks %u..%u\n",
+ rlocator->spcOid, rlocator->dbOid, rlocator->relNumber,
+ forkNames[forknum], startblock, endblock);
+ }
+}
+
+/*
+ * Quicksort comparator for block numbers.
+ */
+static int
+compare_block_numbers(const void *a, const void *b)
+{
+ BlockNumber aa = *(BlockNumber *) a;
+ BlockNumber bb = *(BlockNumber *) b;
+
+ if (aa > bb)
+ return 1;
+ else if (aa == bb)
+ return 0;
+ else
+ return -1;
+}
+
+/*
+ * Error callback.
+ */
+void
+walsummary_error_callback(void *callback_arg, char *fmt,...)
+{
+ va_list ap;
+
+ va_start(ap, fmt);
+ pg_log_generic_v(PG_LOG_ERROR, PG_LOG_PRIMARY, fmt, ap);
+ va_end(ap);
+
+ exit(1);
+}
+
+/*
+ * Read callback.
+ */
+int
+walsummary_read_callback(void *callback_arg, void *data, int length)
+{
+ ws_file_info *ws = callback_arg;
+ int rc;
+
+ if ((rc = read(ws->fd, data, length)) < 0)
+ pg_fatal("could not read file \"%s\": %m", ws->filename);
+
+ return rc;
+}
+
+/*
+ * help
+ *
+ * Prints help page for the program
+ *
+ * progname: the name of the executed program, such as "pg_walsummary"
+ */
+static void
+help(const char *progname)
+{
+ printf(_("%s prints the contents of a WAL summary file.\n\n"), progname);
+ printf(_("Usage:\n"));
+ printf(_(" %s [OPTION]... FILE...\n"), progname);
+ printf(_("\nOptions:\n"));
+ printf(_(" -i, --individual list block numbers individually, not as ranges\n"));
+ printf(_(" -q, --quiet don't print anything, just parse the files\n"));
+ printf(_(" -?, --help show this help, then exit\n"));
+
+ printf(_("\nReport bugs to <%s>.\n"), PACKAGE_BUGREPORT);
+ printf(_("%s home page: <%s>\n"), PACKAGE_NAME, PACKAGE_URL);
+}
--
2.37.1 (Apple Git-137.1)
v6-0002-Move-src-bin-pg_verifybackup-parse_manifest.c-int.patchapplication/octet-stream; name=v6-0002-Move-src-bin-pg_verifybackup-parse_manifest.c-int.patchDownload
From fac0a392b62254066300c051b077ef78a9d4cbcb Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Tue, 3 Oct 2023 13:32:45 -0400
Subject: [PATCH v6 2/4] Move src/bin/pg_verifybackup/parse_manifest.c into
src/common.
This makes it possible for the code to be easily reused by other
client-side tools, and/or by the server.
---
src/bin/pg_verifybackup/Makefile | 1 -
src/bin/pg_verifybackup/meson.build | 1 -
src/bin/pg_verifybackup/pg_verifybackup.c | 2 +-
src/common/Makefile | 1 +
src/common/meson.build | 1 +
src/{bin/pg_verifybackup => common}/parse_manifest.c | 4 ++--
src/{bin/pg_verifybackup => include/common}/parse_manifest.h | 2 +-
7 files changed, 6 insertions(+), 6 deletions(-)
rename src/{bin/pg_verifybackup => common}/parse_manifest.c (99%)
rename src/{bin/pg_verifybackup => include/common}/parse_manifest.h (97%)
diff --git a/src/bin/pg_verifybackup/Makefile b/src/bin/pg_verifybackup/Makefile
index 596df15118..8f04fa662c 100644
--- a/src/bin/pg_verifybackup/Makefile
+++ b/src/bin/pg_verifybackup/Makefile
@@ -21,7 +21,6 @@ LDFLAGS_INTERNAL += -L$(top_builddir)/src/fe_utils -lpgfeutils $(libpq_pgport)
OBJS = \
$(WIN32RES) \
- parse_manifest.o \
pg_verifybackup.o
all: pg_verifybackup
diff --git a/src/bin/pg_verifybackup/meson.build b/src/bin/pg_verifybackup/meson.build
index 9369da1bc6..58f780d1a6 100644
--- a/src/bin/pg_verifybackup/meson.build
+++ b/src/bin/pg_verifybackup/meson.build
@@ -1,7 +1,6 @@
# Copyright (c) 2022-2023, PostgreSQL Global Development Group
pg_verifybackup_sources = files(
- 'parse_manifest.c',
'pg_verifybackup.c'
)
diff --git a/src/bin/pg_verifybackup/pg_verifybackup.c b/src/bin/pg_verifybackup/pg_verifybackup.c
index 059836f0e6..ce423a03d4 100644
--- a/src/bin/pg_verifybackup/pg_verifybackup.c
+++ b/src/bin/pg_verifybackup/pg_verifybackup.c
@@ -20,9 +20,9 @@
#include "common/hashfn.h"
#include "common/logging.h"
+#include "common/parse_manifest.h"
#include "fe_utils/simple_list.h"
#include "getopt_long.h"
-#include "parse_manifest.h"
#include "pgtime.h"
/*
diff --git a/src/common/Makefile b/src/common/Makefile
index 70884be00c..3c8effc533 100644
--- a/src/common/Makefile
+++ b/src/common/Makefile
@@ -66,6 +66,7 @@ OBJS_COMMON = \
kwlookup.o \
link-canary.o \
md5_common.o \
+ parse_manifest.o \
percentrepl.o \
pg_get_line.o \
pg_lzcompress.o \
diff --git a/src/common/meson.build b/src/common/meson.build
index ae05ac63cf..aa646f96a3 100644
--- a/src/common/meson.build
+++ b/src/common/meson.build
@@ -18,6 +18,7 @@ common_sources = files(
'kwlookup.c',
'link-canary.c',
'md5_common.c',
+ 'parse_manifest.c',
'percentrepl.c',
'pg_get_line.c',
'pg_lzcompress.c',
diff --git a/src/bin/pg_verifybackup/parse_manifest.c b/src/common/parse_manifest.c
similarity index 99%
rename from src/bin/pg_verifybackup/parse_manifest.c
rename to src/common/parse_manifest.c
index f0acd9f1e7..9895f2f17d 100644
--- a/src/bin/pg_verifybackup/parse_manifest.c
+++ b/src/common/parse_manifest.c
@@ -6,15 +6,15 @@
* Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
* Portions Copyright (c) 1994, Regents of the University of California
*
- * src/bin/pg_verifybackup/parse_manifest.c
+ * src/common/parse_manifest.c
*
*-------------------------------------------------------------------------
*/
#include "postgres_fe.h"
-#include "parse_manifest.h"
#include "common/jsonapi.h"
+#include "common/parse_manifest.h"
/*
* Semantic states for JSON manifest parsing.
diff --git a/src/bin/pg_verifybackup/parse_manifest.h b/src/include/common/parse_manifest.h
similarity index 97%
rename from src/bin/pg_verifybackup/parse_manifest.h
rename to src/include/common/parse_manifest.h
index 7387a917a2..7b24c5d785 100644
--- a/src/bin/pg_verifybackup/parse_manifest.h
+++ b/src/include/common/parse_manifest.h
@@ -6,7 +6,7 @@
* Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
* Portions Copyright (c) 1994, Regents of the University of California
*
- * src/bin/pg_verifybackup/parse_manifest.h
+ * src/include/common/parse_manifest.h
*
*-------------------------------------------------------------------------
*/
--
2.37.1 (Apple Git-137.1)
v6-0001-Change-how-a-base-backup-decides-which-files-have.patchapplication/octet-stream; name=v6-0001-Change-how-a-base-backup-decides-which-files-have.patchDownload
From 621ea9af483466cbf08cbcca10a4650c2518f235 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Tue, 3 Oct 2023 13:32:28 -0400
Subject: [PATCH v6 1/4] Change how a base backup decides which files have
checksums.
Previously, it thought that any plain file located under global, base,
or a tablespace directory had checksums unless it was in a short list
of excluded files. Now, it thinks that files in those directories have
checksums if parse_filename_for_nontemp_relation says that they are
relation files. (Temporary relation files don't matter because they're
excluded from the backup anyway.)
This changes the behavior if you have stray files not managed by
PostgreSQL in the relevant directories. Previously, you'd get some
kind of checksum-related complaint if such files existed, assuming
that the cluster had checksums enabled and that the base backup
wasn't run with NOVERIFY_CHECKSUMS. Now, you won't get those
complaints any more. That seems like an improvement to me, because
those files were presumably not created by PostgreSQL and so there
is no reason to think that they would be checksummed like a
PostgreSQL relation file. (If we want to complain about such files,
we should complain about them existing at all, not just about their
checksums.)
The point of this change is to make the code more consistent.
sendDir() was already calling parse_filename_for_nontemp_relation()
as part of an effort to determine which files to include in the
backup. So, it already had the information about whether a certain
file was a relation file. sendFile() then used a separate method,
embodied in is_checksummed_file(), to make what is essentially
the same determination. It's better not to make the same decision
using two different methods, especially in closely-related code.
---
src/backend/backup/basebackup.c | 172 ++++++++++----------------------
1 file changed, 55 insertions(+), 117 deletions(-)
diff --git a/src/backend/backup/basebackup.c b/src/backend/backup/basebackup.c
index b537f46219..4ba63ad8a6 100644
--- a/src/backend/backup/basebackup.c
+++ b/src/backend/backup/basebackup.c
@@ -82,7 +82,8 @@ static int64 sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeo
backup_manifest_info *manifest, Oid spcoid);
static bool sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
struct stat *statbuf, bool missing_ok,
- Oid dboid, Oid spcoid,
+ Oid dboid, Oid spcoid, RelFileNumber relfilenumber,
+ unsigned segno,
backup_manifest_info *manifest);
static off_t read_file_data_into_buffer(bbsink *sink,
const char *readfilename, int fd,
@@ -104,7 +105,6 @@ static void convert_link_to_directory(const char *pathbuf, struct stat *statbuf)
static void perform_base_backup(basebackup_options *opt, bbsink *sink);
static void parse_basebackup_options(List *options, basebackup_options *opt);
static int compareWalFileNames(const ListCell *a, const ListCell *b);
-static bool is_checksummed_file(const char *fullpath, const char *filename);
static int basebackup_read_file(int fd, char *buf, size_t nbytes, off_t offset,
const char *filename, bool partial_read_ok);
@@ -213,23 +213,6 @@ static const struct exclude_list_item excludeFiles[] =
{NULL, false}
};
-/*
- * List of files excluded from checksum validation.
- *
- * Note: this list should be kept in sync with what pg_checksums.c
- * includes.
- */
-static const struct exclude_list_item noChecksumFiles[] = {
- {"pg_control", false},
- {"pg_filenode.map", false},
- {"pg_internal.init", true},
- {"PG_VERSION", false},
-#ifdef EXEC_BACKEND
- {"config_exec_params", true},
-#endif
- {NULL, false}
-};
-
/*
* Actually do a base backup for the specified tablespaces.
*
@@ -356,7 +339,8 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
errmsg("could not stat file \"%s\": %m",
XLOG_CONTROL_FILE)));
sendFile(sink, XLOG_CONTROL_FILE, XLOG_CONTROL_FILE, &statbuf,
- false, InvalidOid, InvalidOid, &manifest);
+ false, InvalidOid, InvalidOid,
+ InvalidRelFileNumber, 0, &manifest);
}
else
{
@@ -625,7 +609,8 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
errmsg("could not stat file \"%s\": %m", pathbuf)));
sendFile(sink, pathbuf, pathbuf, &statbuf, false,
- InvalidOid, InvalidOid, &manifest);
+ InvalidOid, InvalidOid, InvalidRelFileNumber, 0,
+ &manifest);
/* unconditionally mark file as archived */
StatusFilePath(pathbuf, fname, ".done");
@@ -1163,7 +1148,8 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
struct stat statbuf;
int64 size = 0;
const char *lastDir; /* Split last dir from parent path. */
- bool isDbDir = false; /* Does this directory contain relations? */
+ bool isRelationDir = false; /* Does directory contain relations? */
+ Oid dboid = InvalidOid;
/*
* Determine if the current path is a database directory that can contain
@@ -1190,17 +1176,23 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
strncmp(lastDir - (sizeof(TABLESPACE_VERSION_DIRECTORY) - 1),
TABLESPACE_VERSION_DIRECTORY,
sizeof(TABLESPACE_VERSION_DIRECTORY) - 1) == 0))
- isDbDir = true;
+ {
+ isRelationDir = true;
+ dboid = atooid(lastDir + 1);
+ }
}
+ else if (strcmp(path, "./global") == 0)
+ isRelationDir = true;
dir = AllocateDir(path);
while ((de = ReadDir(dir, path)) != NULL)
{
int excludeIdx;
bool excludeFound;
- RelFileNumber relNumber;
- ForkNumber relForkNum;
- unsigned segno;
+ RelFileNumber relfilenumber = InvalidRelFileNumber;
+ ForkNumber relForkNum = InvalidForkNumber;
+ unsigned segno = 0;
+ bool isRelationFile = false;
/* Skip special stuff */
if (strcmp(de->d_name, ".") == 0 || strcmp(de->d_name, "..") == 0)
@@ -1248,37 +1240,40 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
if (excludeFound)
continue;
+ /*
+ * If there could be non-temporary relation files in this directory,
+ * try to parse the filename.
+ */
+ if (isRelationDir)
+ isRelationFile =
+ parse_filename_for_nontemp_relation(de->d_name,
+ &relfilenumber,
+ &relForkNum, &segno);
+
/* Exclude all forks for unlogged tables except the init fork */
- if (isDbDir &&
- parse_filename_for_nontemp_relation(de->d_name, &relNumber,
- &relForkNum, &segno))
+ if (isRelationFile && relForkNum != INIT_FORKNUM)
{
- /* Never exclude init forks */
- if (relForkNum != INIT_FORKNUM)
- {
- char initForkFile[MAXPGPATH];
+ char initForkFile[MAXPGPATH];
- /*
- * If any other type of fork, check if there is an init fork
- * with the same RelFileNumber. If so, the file can be
- * excluded.
- */
- snprintf(initForkFile, sizeof(initForkFile), "%s/%u_init",
- path, relNumber);
+ /*
+ * If any other type of fork, check if there is an init fork with
+ * the same RelFileNumber. If so, the file can be excluded.
+ */
+ snprintf(initForkFile, sizeof(initForkFile), "%s/%u_init",
+ path, relfilenumber);
- if (lstat(initForkFile, &statbuf) == 0)
- {
- elog(DEBUG2,
- "unlogged relation file \"%s\" excluded from backup",
- de->d_name);
+ if (lstat(initForkFile, &statbuf) == 0)
+ {
+ elog(DEBUG2,
+ "unlogged relation file \"%s\" excluded from backup",
+ de->d_name);
- continue;
- }
+ continue;
}
}
/* Exclude temporary relations */
- if (isDbDir && looks_like_temp_rel_name(de->d_name))
+ if (OidIsValid(dboid) && looks_like_temp_rel_name(de->d_name))
{
elog(DEBUG2,
"temporary relation file \"%s\" excluded from backup",
@@ -1417,8 +1412,8 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
if (!sizeonly)
sent = sendFile(sink, pathbuf, pathbuf + basepathlen + 1, &statbuf,
- true, isDbDir ? atooid(lastDir + 1) : InvalidOid, spcoid,
- manifest);
+ true, dboid, spcoid,
+ relfilenumber, segno, manifest);
if (sent || sizeonly)
{
@@ -1440,40 +1435,6 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
return size;
}
-/*
- * Check if a file should have its checksum validated.
- * We validate checksums on files in regular tablespaces
- * (including global and default) only, and in those there
- * are some files that are explicitly excluded.
- */
-static bool
-is_checksummed_file(const char *fullpath, const char *filename)
-{
- /* Check that the file is in a tablespace */
- if (strncmp(fullpath, "./global/", 9) == 0 ||
- strncmp(fullpath, "./base/", 7) == 0 ||
- strncmp(fullpath, "/", 1) == 0)
- {
- int excludeIdx;
-
- /* Compare file against noChecksumFiles skip list */
- for (excludeIdx = 0; noChecksumFiles[excludeIdx].name != NULL; excludeIdx++)
- {
- int cmplen = strlen(noChecksumFiles[excludeIdx].name);
-
- if (!noChecksumFiles[excludeIdx].match_prefix)
- cmplen++;
- if (strncmp(filename, noChecksumFiles[excludeIdx].name,
- cmplen) == 0)
- return false;
- }
-
- return true;
- }
- else
- return false;
-}
-
/*
* Given the member, write the TAR header & send the file.
*
@@ -1488,6 +1449,7 @@ is_checksummed_file(const char *fullpath, const char *filename)
static bool
sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
struct stat *statbuf, bool missing_ok, Oid dboid, Oid spcoid,
+ RelFileNumber relfilenumber, unsigned segno,
backup_manifest_info *manifest)
{
int fd;
@@ -1495,8 +1457,6 @@ sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
int checksum_failures = 0;
off_t cnt;
pgoff_t bytes_done = 0;
- int segmentno = 0;
- char *segmentpath;
bool verify_checksum = false;
pg_checksum_context checksum_ctx;
@@ -1522,36 +1482,14 @@ sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
*/
Assert((sink->bbs_buffer_length % BLCKSZ) == 0);
- if (!noverify_checksums && DataChecksumsEnabled())
- {
- char *filename;
-
- /*
- * Get the filename (excluding path). As last_dir_separator()
- * includes the last directory separator, we chop that off by
- * incrementing the pointer.
- */
- filename = last_dir_separator(readfilename) + 1;
-
- if (is_checksummed_file(readfilename, filename))
- {
- verify_checksum = true;
-
- /*
- * Cut off at the segment boundary (".") to get the segment number
- * in order to mix it into the checksum.
- */
- segmentpath = strstr(filename, ".");
- if (segmentpath != NULL)
- {
- segmentno = atoi(segmentpath + 1);
- if (segmentno == 0)
- ereport(ERROR,
- (errmsg("invalid segment number %d in file \"%s\"",
- segmentno, filename)));
- }
- }
- }
+ /*
+ * If we weren't told not to verify checksums, and if checksums are
+ * enabled for this cluster, and if this is a relation file, then verify
+ * the checksum.
+ */
+ if (!noverify_checksums && DataChecksumsEnabled() &&
+ RelFileNumberIsValid(relfilenumber))
+ verify_checksum = true;
/*
* Loop until we read the amount of data the caller told us to expect. The
@@ -1566,7 +1504,7 @@ sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
/* Try to read some more data. */
cnt = read_file_data_into_buffer(sink, readfilename, fd, bytes_done,
remaining,
- blkno + segmentno * RELSEG_SIZE,
+ blkno + segno * RELSEG_SIZE,
verify_checksum,
&checksum_failures);
--
2.37.1 (Apple Git-137.1)
v6-0003-Prototype-patch-for-incremental-backup.patchapplication/octet-stream; name=v6-0003-Prototype-patch-for-incremental-backup.patchDownload
From a381fdbb31ba2752f89b64dd46506fb530cc0355 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Wed, 14 Jun 2023 11:31:29 -0400
Subject: [PATCH v6 3/4] Prototype patch for incremental backup.
We don't differentiate between incremental and differential backups;
an incremental backup can be based either on a full backup or on a
previous incremental backup.
This adds a new background process, the WAL summarizer, whose behavor
is governed by new GUCs wal_summarize_mb and wal_summarize_keep_time.
This writes out WAL summary files to $PGDATA/pg_wal/summaries. Each
summary file contains information for a certain range of LSNs on a
certain TLI. For each relation, it stores a "limit block" which is
0 if a relation is created or destroyed within a certain range of WAL
records, or otherwise the shortest length to which the relation was
truncated during that range of WAL records, or otherwise
InvalidBlockNumber. In addition, it stores any blocks which have
been modified during that range of WAL records, but excluding blocks
which were removed by truncation after they were modified and which
were never modified thereafter. In other words, it tells us which
blocks need to copied in case of an incremental backup covering that
range of WAL records.
To take an incremental backup, you use the new replication command
UPLOAD_MANIFEST to upload the manifest for the prior backup. This
prior backup could either be a full backup or another incremental
backup. You then use BASE_BACKUP with the INCREMENTAL option to take
the backup. pg_basebackup now has an --incremental=PATH_TO_MANIFEST
option to trigger this behavior.
An incremental backup is like a regular full backup except that
some relation files are replaced with files with names like
INCREMENTAL.${ORIGINAL_NAME}, and the backup_label file contains
additional lines identifying it as an incremental backup. The new
pg_combinebackup tool can be used to reconstruct a data directory
from a full backup and a series of incremental backups.
Open issues:
- Should we remove wal_summarize_mb, or replace it with a Boolean
on/off switch?
- How should we control generation and retention of summary files?
What should be the defaults?
- Needs to be tested on a standby.
- Should we send the whole backup manifest to the server or, say,
just an LSN?
- Should the timeout when waiting for WAL summaries be configurable?
If it is, then the maximum sleep time for the WAL summarizer needs
to vary accordingly.
- It would be nice (but not essential) to do something about incremental
JSON parsing.
- Might need more tests.
Patch by me. Thanks to Dilip Kumar and Andres Freund for some helpful
design discussions. Reviewed by Dilip Kumar and Jakub Wartak.
---
doc/src/sgml/backup.sgml | 89 +-
doc/src/sgml/ref/allfiles.sgml | 1 +
doc/src/sgml/ref/pg_basebackup.sgml | 37 +-
doc/src/sgml/ref/pg_combinebackup.sgml | 228 +++
doc/src/sgml/reference.sgml | 1 +
src/backend/access/transam/xlog.c | 100 +-
src/backend/access/transam/xlogbackup.c | 10 +
src/backend/access/transam/xlogrecovery.c | 6 +
src/backend/backup/Makefile | 5 +-
src/backend/backup/basebackup.c | 334 +++-
src/backend/backup/basebackup_incremental.c | 873 +++++++++++
src/backend/backup/meson.build | 3 +
src/backend/backup/walsummary.c | 356 +++++
src/backend/backup/walsummaryfuncs.c | 169 ++
src/backend/postmaster/Makefile | 1 +
src/backend/postmaster/auxprocess.c | 8 +
src/backend/postmaster/meson.build | 1 +
src/backend/postmaster/postmaster.c | 53 +
src/backend/postmaster/walsummarizer.c | 1363 +++++++++++++++++
src/backend/replication/repl_gram.y | 14 +-
src/backend/replication/repl_scanner.l | 2 +
src/backend/replication/walsender.c | 162 +-
src/backend/storage/ipc/ipci.c | 3 +
src/backend/storage/lmgr/lwlocknames.txt | 1 +
src/backend/utils/activity/pgstat_io.c | 4 +-
.../utils/activity/wait_event_names.txt | 5 +
src/backend/utils/init/miscinit.c | 3 +
src/backend/utils/misc/guc_tables.c | 29 +
src/backend/utils/misc/postgresql.conf.sample | 5 +
src/bin/Makefile | 1 +
src/bin/initdb/initdb.c | 1 +
src/bin/meson.build | 1 +
src/bin/pg_basebackup/bbstreamer_file.c | 1 +
src/bin/pg_basebackup/pg_basebackup.c | 110 +-
src/bin/pg_basebackup/t/010_pg_basebackup.pl | 4 +-
src/bin/pg_combinebackup/.gitignore | 1 +
src/bin/pg_combinebackup/Makefile | 52 +
src/bin/pg_combinebackup/backup_label.c | 281 ++++
src/bin/pg_combinebackup/backup_label.h | 29 +
src/bin/pg_combinebackup/copy_file.c | 169 ++
src/bin/pg_combinebackup/copy_file.h | 19 +
src/bin/pg_combinebackup/load_manifest.c | 245 +++
src/bin/pg_combinebackup/load_manifest.h | 67 +
src/bin/pg_combinebackup/meson.build | 35 +
src/bin/pg_combinebackup/pg_combinebackup.c | 1270 +++++++++++++++
src/bin/pg_combinebackup/reconstruct.c | 618 ++++++++
src/bin/pg_combinebackup/reconstruct.h | 32 +
src/bin/pg_combinebackup/t/001_basic.pl | 23 +
.../pg_combinebackup/t/002_compare_backups.pl | 154 ++
src/bin/pg_combinebackup/write_manifest.c | 293 ++++
src/bin/pg_combinebackup/write_manifest.h | 33 +
src/bin/pg_resetwal/pg_resetwal.c | 36 +
src/common/Makefile | 1 +
src/common/blkreftable.c | 1309 ++++++++++++++++
src/common/meson.build | 1 +
src/include/access/xlog.h | 1 +
src/include/access/xlogbackup.h | 2 +
src/include/backup/basebackup.h | 5 +-
src/include/backup/basebackup_incremental.h | 56 +
src/include/backup/walsummary.h | 49 +
src/include/catalog/pg_proc.dat | 19 +
src/include/common/blkreftable.h | 120 ++
src/include/miscadmin.h | 3 +
src/include/nodes/replnodes.h | 9 +
src/include/postmaster/walsummarizer.h | 31 +
src/include/storage/proc.h | 9 +-
src/include/utils/guc_tables.h | 1 +
src/test/perl/PostgreSQL/Test/Cluster.pm | 21 +-
src/test/recovery/t/001_stream_rep.pl | 2 +
src/test/recovery/t/019_replslot_limit.pl | 3 +
.../t/035_standby_logical_decoding.pl | 1 +
src/tools/pgindent/typedefs.list | 23 +
72 files changed, 8937 insertions(+), 70 deletions(-)
create mode 100644 doc/src/sgml/ref/pg_combinebackup.sgml
create mode 100644 src/backend/backup/basebackup_incremental.c
create mode 100644 src/backend/backup/walsummary.c
create mode 100644 src/backend/backup/walsummaryfuncs.c
create mode 100644 src/backend/postmaster/walsummarizer.c
create mode 100644 src/bin/pg_combinebackup/.gitignore
create mode 100644 src/bin/pg_combinebackup/Makefile
create mode 100644 src/bin/pg_combinebackup/backup_label.c
create mode 100644 src/bin/pg_combinebackup/backup_label.h
create mode 100644 src/bin/pg_combinebackup/copy_file.c
create mode 100644 src/bin/pg_combinebackup/copy_file.h
create mode 100644 src/bin/pg_combinebackup/load_manifest.c
create mode 100644 src/bin/pg_combinebackup/load_manifest.h
create mode 100644 src/bin/pg_combinebackup/meson.build
create mode 100644 src/bin/pg_combinebackup/pg_combinebackup.c
create mode 100644 src/bin/pg_combinebackup/reconstruct.c
create mode 100644 src/bin/pg_combinebackup/reconstruct.h
create mode 100644 src/bin/pg_combinebackup/t/001_basic.pl
create mode 100644 src/bin/pg_combinebackup/t/002_compare_backups.pl
create mode 100644 src/bin/pg_combinebackup/write_manifest.c
create mode 100644 src/bin/pg_combinebackup/write_manifest.h
create mode 100644 src/common/blkreftable.c
create mode 100644 src/include/backup/basebackup_incremental.h
create mode 100644 src/include/backup/walsummary.h
create mode 100644 src/include/common/blkreftable.h
create mode 100644 src/include/postmaster/walsummarizer.h
diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml
index 8cb24d6ae5..b3468eea3c 100644
--- a/doc/src/sgml/backup.sgml
+++ b/doc/src/sgml/backup.sgml
@@ -857,12 +857,79 @@ test ! -f /mnt/server/archivedir/00000001000000A900000065 && cp pg_wal/0
</para>
</sect2>
+ <sect2 id="backup-incremental-backup">
+ <title>Making an Incremental Backup</title>
+
+ <para>
+ You can use <xref linkend="app-pgbasebackup"/> to take an incremental
+ backup by specifying the <literal>--incremental</literal> option. You must
+ supply, as an argument to <literal>--incremental</literal>, the backup
+ manifest to an earlier backup from the same server. In the resulting
+ backup, non-relation files will be included in their entirety, but some
+ relation files may be replaced by smaller incremental files which contain
+ only the blocks which have been changed since the earlier backup and enough
+ metadata to reconstruct the current version of the file.
+ </para>
+
+ <para>
+ To figure out which blocks need to be backed up, the server uses WAL
+ summaries, which are stored in the data directory, inside the directory
+ <literal>pg_wal/summaries</literal>. If the required summary files are not
+ present, an attempt to take an incremental backup will fail. The summaries
+ present in this directory must cover all LSNs from the start LSN of the
+ prior backup to the start LSN of the current backup. Since the server looks
+ for WAL summaries just after establishing the start LSN of the current
+ backup, the necessary summary files probably won't be instantly present
+ on disk, but the server will wait for any missing files to show up.
+ This also helps if the WAL summarization process has fallen behind.
+ However, if the necessary files have already been removed, or if the WAL
+ summarizer doesn't catch up quickly enough, the incremental backup will
+ fail.
+ </para>
+
+ <para>
+ When restoring an incremental backup, it will be necessary to have not
+ only the incremental backup itself but also all earlier backups that
+ are required to supply the blocks omitted from the incremental backup.
+ See <xref linkend="app-pgcombinebackup"/> for further information about
+ this requirement.
+ </para>
+
+ <para>
+ Note that all of the requirements for making use of a full backup also
+ apply to an incremental backup. For instance, you still need all of the
+ WAL segment files generated during and after the file system backup, and
+ any relevant WAL history files. And you still need to create a
+ <literal>recovery.signal</literal> (or <literal>standby.signal</literal>)
+ and perform recovery, as described in
+ <xref linkend="backup-pitr-recovery" />. The requirement to have earlier
+ backups available at restore time and to use
+ <literal>pg_combinebackup</literal> is an additional requirement on top of
+ everything else. Keep in mind that <application>PostgreSQL</application>
+ has no built-in mechanism to figure out which backups are still needed as
+ a basis for restoring later incremental backups. You must keep track of
+ the relationships between your full and incremental backups on your own,
+ and be certain not to remove earlier backups if they might be needed when
+ restoring later incremental backups.
+ </para>
+
+ <para>
+ Incremental backups typically only make sense for relatively large
+ databases where a significant portion of the data does not change, or only
+ changes slowly. For a small database, it's simpler to ignore the existence
+ of incremental backups and simply take full backups, which are simpler
+ to manage. For a large database all of which is heavily modified,
+ incremental backups won't be much smaller than full backups.
+ </para>
+ </sect2>
+
<sect2 id="backup-lowlevel-base-backup">
<title>Making a Base Backup Using the Low Level API</title>
<para>
- The procedure for making a base backup using the low level
- APIs contains a few more steps than
- the <xref linkend="app-pgbasebackup"/> method, but is relatively
+ Instead of taking a full or incremental base backup using
+ <xref linkend="app-pgbasebackup"/>, you can take a base backup using the
+ low-level API. This procedure contains a few more steps than
+ the <application>pg_basebackup</application> method, but is relatively
simple. It is very important that these steps are executed in
sequence, and that the success of a step is verified before
proceeding to the next step.
@@ -1118,7 +1185,8 @@ SELECT * FROM pg_backup_stop(wait_for_archive => true);
</listitem>
<listitem>
<para>
- Restore the database files from your file system backup. Be sure that they
+ If you're restoring a full backup, you can restore the database files
+ directly into the target directories. Be sure that they
are restored with the right ownership (the database system user, not
<literal>root</literal>!) and with the right permissions. If you are using
tablespaces,
@@ -1126,6 +1194,19 @@ SELECT * FROM pg_backup_stop(wait_for_archive => true);
were correctly restored.
</para>
</listitem>
+ <listitem>
+ <para>
+ If you're restoring an incremental backup, you'll need to restore the
+ incremental backup and all earlier backups upon which it directly or
+ indirectly depends to the machine where you are performing the restore.
+ These backups will need to be placed in separate directories, not the
+ target directories where you want the running server to end up.
+ Once this is done, use <xref linkend="app-pgcombinebackup"/> to pull
+ data from the full backup and all of the subsequent incremental backups
+ and write out a synthetic full backup to the target directories. As above,
+ verify that permissions and tablespace links are correct.
+ </para>
+ </listitem>
<listitem>
<para>
Remove any files present in <filename>pg_wal/</filename>; these came from the
diff --git a/doc/src/sgml/ref/allfiles.sgml b/doc/src/sgml/ref/allfiles.sgml
index 54b5f22d6e..fda4690eab 100644
--- a/doc/src/sgml/ref/allfiles.sgml
+++ b/doc/src/sgml/ref/allfiles.sgml
@@ -202,6 +202,7 @@ Complete list of usable sgml source files in this directory.
<!ENTITY pgBasebackup SYSTEM "pg_basebackup.sgml">
<!ENTITY pgbench SYSTEM "pgbench.sgml">
<!ENTITY pgChecksums SYSTEM "pg_checksums.sgml">
+<!ENTITY pgCombinebackup SYSTEM "pg_combinebackup.sgml">
<!ENTITY pgConfig SYSTEM "pg_config-ref.sgml">
<!ENTITY pgControldata SYSTEM "pg_controldata.sgml">
<!ENTITY pgCtl SYSTEM "pg_ctl-ref.sgml">
diff --git a/doc/src/sgml/ref/pg_basebackup.sgml b/doc/src/sgml/ref/pg_basebackup.sgml
index 712568a62d..50536d0521 100644
--- a/doc/src/sgml/ref/pg_basebackup.sgml
+++ b/doc/src/sgml/ref/pg_basebackup.sgml
@@ -38,11 +38,25 @@ PostgreSQL documentation
</para>
<para>
- <application>pg_basebackup</application> makes an exact copy of the database
- cluster's files, while making sure the server is put into and
- out of backup mode automatically. Backups are always taken of the entire
- database cluster; it is not possible to back up individual databases or
- database objects. For selective backups, another tool such as
+ <application>pg_basebackup</application> can take a full or incremental
+ base backup of the database. When used to take a full backup, it makes an
+ exact copy of the database cluster's files. When used to take an incremental
+ backup, some files that would have been part of a full backup may be
+ replaced with incremental versions of the same files, containing only those
+ blocks that have been modified since the reference backup. An incremental
+ backup cannot be used directly; instead,
+ <xref linkend="app-pgcombinebackup"/> must first
+ be used to combine it with the previous backups upon which it depends.
+ See <xref linkend="backup-incremental-backup" /> for more information
+ about incremental backups, and <xref linkend="backup-pitr-recovery" />
+ for steps to recover from a backup.
+ </para>
+
+ <para>
+ In any mode, <application>pg_basebackup</application> makes sure the server
+ is put into and out of backup mode automatically. Backups are always taken of
+ the entire database cluster; it is not possible to back up individual
+ databases or database objects. For selective backups, another tool such as
<xref linkend="app-pgdump"/> must be used.
</para>
@@ -197,6 +211,19 @@ PostgreSQL documentation
</listitem>
</varlistentry>
+ <varlistentry>
+ <term><option>-i <replaceable class="parameter">old_manifest_file</replaceable></option></term>
+ <term><option>--incremental=<replaceable class="parameter">old_meanifest_file</replaceable></option></term>
+ <listitem>
+ <para>
+ Performs an <link linkend="backup-incremental-backup">incremental
+ backup</link>. The backup manifest for the reference
+ backup must be provided, and will be uploaded to the server, which will
+ respond by sending the requested incremental backup.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry>
<term><option>-R</option></term>
<term><option>--write-recovery-conf</option></term>
diff --git a/doc/src/sgml/ref/pg_combinebackup.sgml b/doc/src/sgml/ref/pg_combinebackup.sgml
new file mode 100644
index 0000000000..6cac73573f
--- /dev/null
+++ b/doc/src/sgml/ref/pg_combinebackup.sgml
@@ -0,0 +1,228 @@
+<!--
+doc/src/sgml/ref/pg_combinebackup.sgml
+PostgreSQL documentation
+-->
+
+<refentry id="app-pgcombinebackup">
+ <indexterm zone="app-pgcombinebackup">
+ <primary>pg_combinebackup</primary>
+ </indexterm>
+
+ <refmeta>
+ <refentrytitle><application>pg_combinebackup</application></refentrytitle>
+ <manvolnum>1</manvolnum>
+ <refmiscinfo>Application</refmiscinfo>
+ </refmeta>
+
+ <refnamediv>
+ <refname>pg_combinebackup</refname>
+ <refpurpose>reconstruct a full backup from an incremental backup and dependent backups</refpurpose>
+ </refnamediv>
+
+ <refsynopsisdiv>
+ <cmdsynopsis>
+ <command>pg_combinebackup</command>
+ <arg rep="repeat"><replaceable>option</replaceable></arg>
+ <arg rep="repeat"><replaceable>backup_directory</replaceable></arg>
+ </cmdsynopsis>
+ </refsynopsisdiv>
+
+ <refsect1>
+ <title>Description</title>
+ <para>
+ <application>pg_combinebackup</application> is used to reconstruct a
+ synthetic full backup from an
+ <link linkend="backup-incremental-backup">incremental backup</link> and the
+ earlier backups upon which it depends.
+ </para>
+
+ <para>
+ Specify all of the required backups on the command line from oldest to newest.
+ That is, the first backup directory should be the path to the full backup, and
+ the last should be the path to the final incremental backup
+ that you wish to restore. The reconstructed backup will be written to the
+ output directory specified by the <option>-o</option> option.
+ </para>
+
+ <para>
+ Although <application>pg_combinebackup</application> will attempt to verify
+ that the backups you specify form a legal backup chain from which a correct
+ full backup can be reconstructed, it is not designed to help you keep track
+ of which backups depend on which other backups. If you remove the one or
+ more of the previous backups upon which your incremental
+ backup relies, you will not be able to restore it.
+ </para>
+
+ <para>
+ Since the output of <application>pg_combinebackup</application> is a
+ synthetic full backup, it can be used as an input to a future invocation of
+ <application>pg_combinebackup</application>. The synthetic full backup would
+ be specified on the command line in lieu of the chain of backups from which
+ it was reconstructed.
+ </para>
+ </refsect1>
+
+ <refsect1>
+ <title>Options</title>
+
+ <para>
+ <variablelist>
+ <varlistentry>
+ <term><option>-d</option></term>
+ <term><option>--debug</option></term>
+ <listitem>
+ <para>
+ Print lots of debug logging output on <filename>stderr</filename>.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><option>-T <replaceable class="parameter">olddir</replaceable>=<replaceable class="parameter">newdir</replaceable></option></term>
+ <term><option>--tablespace-mapping=<replaceable class="parameter">olddir</replaceable>=<replaceable class="parameter">newdir</replaceable></option></term>
+ <listitem>
+ <para>
+ Relocates the tablespace in directory <replaceable>olddir</replaceable>
+ to <replaceable>newdir</replaceable> during the backup.
+ <replaceable>olddir</replaceable> is the absolute path of the tablespace
+ as it exists in the first backup specified on the command line,
+ and <replaceable>newdir</replaceable> is the absolute path to use for the
+ tablespace in the reconstructed backup. If either path needs to contain
+ an equal sign (<literal>=</literal>), precede that with a backslash.
+ This option can be specified multiple times for multiple tablespaces.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><option>-N</option></term>
+ <term><option>--no-sync</option></term>
+ <listitem>
+ <para>
+ By default, <command>pg_combinebackup</command> will wait for all files
+ to be written safely to disk. This option causes
+ <command>pg_combinebackup</command> to return without waiting, which is
+ faster, but means that a subsequent operating system crash can leave
+ the output backup corrupt. Generally, this option is useful for testing
+ but should not be used when creating a production installation.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><option>-o <replaceable class="parameter">outputdir</replaceable></option></term>
+ <term><option>--output=<replaceable class="parameter">outputdir</replaceable></option></term>
+ <listitem>
+ <para>
+ Specifies the output directory to which the synthetic full backup
+ should be written. Currently, this argument is required.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><option>--sync-method</option></term>
+ <listitem>
+ <para>
+ When set to <literal>fsync</literal>, which is the default,
+ <command>pg_combinebackup</command> will recursively open and synchronize
+ all files in the backup directory. When the plain format is used, the
+ search for files will follow symbolic links for the WAL directory and
+ each configured tablespace.
+ </para>
+ <para>
+ On Linux, <literal>syncfs</literal> may be used instead to ask the
+ operating system to synchronize the whole file system that contains the
+ backup directory. When the plain format is used,
+ <command>pg_combinebackup</command> will also synchronize the file systems
+ that contain the WAL files and each tablespace. See
+ <xref linkend="syncfs"/> for more information about using
+ <function>syncfs()</function>.
+ </para>
+ <para>
+ This option has no effect when <option>--no-sync</option> is used.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><option>--manifest-checksums=<replaceable class="parameter">algorithm</replaceable></option></term>
+ <listitem>
+ <para>
+ Like <xref linkend="app-pgbasebackup"/>,
+ <application>pg_combinebackup</application> writes a backup manifest
+ in the output directory. This option specifies the checksum algorithm
+ that should be applied to each file included in the backup manifest.
+ Currently, the available algorithms are <literal>NONE</literal>,
+ <literal>CRC32C</literal>, <literal>SHA224</literal>,
+ <literal>SHA256</literal>, <literal>SHA384</literal>,
+ and <literal>SHA512</literal>. The default is <literal>CRC32C</literal>.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><option>--no-manifest</option></term>
+ <listitem>
+ <para>
+ Disables generation of a backup manifest. If this option is not
+ specified, a backup manifest for the reconstructed backup will be
+ written to the output directory.
+ </para>
+ </listitem>
+ </varlistentry>
+ </variablelist>
+
+ <variablelist>
+ <varlistentry>
+ <term><option>-V</option></term>
+ <term><option>--version</option></term>
+ <listitem>
+ <para>
+ Prints the <application>pg_combinebackup</application> version and exits.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><option>-?</option></term>
+ <term><option>--help</option></term>
+ <listitem>
+ <para>
+ Shows help about <application>pg_combinebackup</application> command line
+ arguments, and exits.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ </variablelist>
+ </para>
+
+ </refsect1>
+
+ <refsect1>
+ <title>Environment</title>
+
+ <para>
+ This utility, like most other <productname>PostgreSQL</productname> utilities,
+ uses the environment variables supported by <application>libpq</application>
+ (see <xref linkend="libpq-envars"/>).
+ </para>
+
+ <para>
+ The environment variable <envar>PG_COLOR</envar> specifies whether to use
+ color in diagnostic messages. Possible values are
+ <literal>always</literal>, <literal>auto</literal> and
+ <literal>never</literal>.
+ </para>
+ </refsect1>
+
+ <refsect1>
+ <title>See Also</title>
+
+ <simplelist type="inline">
+ <member><xref linkend="app-pgbasebackup"/></member>
+ </simplelist>
+ </refsect1>
+
+</refentry>
diff --git a/doc/src/sgml/reference.sgml b/doc/src/sgml/reference.sgml
index e11b4b6130..a07d2b5e01 100644
--- a/doc/src/sgml/reference.sgml
+++ b/doc/src/sgml/reference.sgml
@@ -250,6 +250,7 @@
&pgamcheck;
&pgBasebackup;
&pgbench;
+ &pgCombinebackup;
&pgConfig;
&pgDump;
&pgDumpall;
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 40461923ea..9ddad7864f 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -77,6 +77,7 @@
#include "port/pg_iovec.h"
#include "postmaster/bgwriter.h"
#include "postmaster/startup.h"
+#include "postmaster/walsummarizer.h"
#include "postmaster/walwriter.h"
#include "replication/logical.h"
#include "replication/origin.h"
@@ -3555,6 +3556,43 @@ XLogGetLastRemovedSegno(void)
return lastRemovedSegNo;
}
+/*
+ * Return the oldest WAL segment on the given TLI that still exists in
+ * XLOGDIR, or 0 if none.
+ */
+XLogSegNo
+XLogGetOldestSegno(TimeLineID tli)
+{
+ DIR *xldir;
+ struct dirent *xlde;
+ XLogSegNo oldest_segno = 0;
+
+ xldir = AllocateDir(XLOGDIR);
+ while ((xlde = ReadDir(xldir, XLOGDIR)) != NULL)
+ {
+ TimeLineID file_tli;
+ XLogSegNo file_segno;
+
+ /* Ignore files that are not XLOG segments */
+ if (!IsXLogFileName(xlde->d_name))
+ continue;
+
+ /* Parse filename to get TLI and segno. */
+ XLogFromFileName(xlde->d_name, &file_tli, &file_segno,
+ wal_segment_size);
+
+ /* Ignore anything that's not from the TLI of interest. */
+ if (tli != file_tli)
+ continue;
+
+ /* If it's the oldest so far, update oldest_segno. */
+ if (oldest_segno == 0 || file_segno < oldest_segno)
+ oldest_segno = file_segno;
+ }
+
+ FreeDir(xldir);
+ return oldest_segno;
+}
/*
* Update the last removed segno pointer in shared memory, to reflect that the
@@ -3834,8 +3872,8 @@ RemoveXlogFile(const struct dirent *segment_de,
}
/*
- * Verify whether pg_wal and pg_wal/archive_status exist.
- * If the latter does not exist, recreate it.
+ * Verify whether pg_wal, pg_wal/archive_status, and pg_wal/summaries exist.
+ * If the latter do not exist, recreate them.
*
* It is not the goal of this function to verify the contents of these
* directories, but to help in cases where someone has performed a cluster
@@ -3878,6 +3916,26 @@ ValidateXLOGDirectoryStructure(void)
(errmsg("could not create missing directory \"%s\": %m",
path)));
}
+
+ /* Check for summaries */
+ snprintf(path, MAXPGPATH, XLOGDIR "/summaries");
+ if (stat(path, &stat_buf) == 0)
+ {
+ /* Check for weird cases where it exists but isn't a directory */
+ if (!S_ISDIR(stat_buf.st_mode))
+ ereport(FATAL,
+ (errmsg("required WAL directory \"%s\" does not exist",
+ path)));
+ }
+ else
+ {
+ ereport(LOG,
+ (errmsg("creating missing WAL directory \"%s\"", path)));
+ if (MakePGDirectory(path) < 0)
+ ereport(FATAL,
+ (errmsg("could not create missing directory \"%s\": %m",
+ path)));
+ }
}
/*
@@ -5202,9 +5260,9 @@ StartupXLOG(void)
#endif
/*
- * Verify that pg_wal and pg_wal/archive_status exist. In cases where
- * someone has performed a copy for PITR, these directories may have been
- * excluded and need to be re-created.
+ * Verify that pg_wal, pg_wal/archive_status, and pg_wal/summaries exist.
+ * In cases where someone has performed a copy for PITR, these directories
+ * may have been excluded and need to be re-created.
*/
ValidateXLOGDirectoryStructure();
@@ -6921,6 +6979,24 @@ CreateCheckPoint(int flags)
*/
END_CRIT_SECTION();
+ /*
+ * WAL summaries end when the next XLOG_CHECKPOINT_REDO or
+ * XLOG_CHECKPOINT_SHUTDOWN record is reached. This is the first point
+ * where (a) we're not inside of a critical section and (b) we can be
+ * certain that the relevant record has been flushed to disk, which must
+ * happen before it can be summarized.
+ *
+ * If this is a shutdown checkpoint, then this happens reasonably promptly:
+ * we've only just inserted and flushed the XLOG_CHECKPOINT_SHUTDOWN
+ * record. If this is not a shutdown checkpoint, then this might not be
+ * very prompt at all: the XLOG_CHECKPOINT_REDO record was written before
+ * we began flushing data to disk, and that could be many minutes ago at
+ * this point. However, we don't XLogFlush() after inserting that record,
+ * so we're not guaranteed that it's on disk until after the above call
+ * that flushes the XLOG_CHECKPOINT_ONLINE record.
+ */
+ SetWalSummarizerLatch();
+
/*
* Let smgr do post-checkpoint cleanup (eg, deleting old files).
*/
@@ -7595,6 +7671,20 @@ KeepLogSeg(XLogRecPtr recptr, XLogSegNo *logSegNo)
}
}
+ /*
+ * If WAL summarization is in use, don't remove WAL that has yet to be
+ * summarized.
+ */
+ keep = GetOldestUnsummarizedLSN(NULL, NULL);
+ if (keep != InvalidXLogRecPtr)
+ {
+ XLogSegNo unsummarized_segno;
+
+ XLByteToSeg(keep, unsummarized_segno, wal_segment_size);
+ if (unsummarized_segno < segno)
+ segno = unsummarized_segno;
+ }
+
/* but, keep at least wal_keep_size if that's set */
if (wal_keep_size_mb > 0)
{
diff --git a/src/backend/access/transam/xlogbackup.c b/src/backend/access/transam/xlogbackup.c
index 21d68133ae..f51d4282bb 100644
--- a/src/backend/access/transam/xlogbackup.c
+++ b/src/backend/access/transam/xlogbackup.c
@@ -77,6 +77,16 @@ build_backup_content(BackupState *state, bool ishistoryfile)
appendStringInfo(result, "STOP TIMELINE: %u\n", state->stoptli);
}
+ /* either both istartpoint and istarttli should be set, or neither */
+ Assert(XLogRecPtrIsInvalid(state->istartpoint) == (state->istarttli == 0));
+ if (!XLogRecPtrIsInvalid(state->istartpoint))
+ {
+ appendStringInfo(result, "INCREMENTAL FROM LSN: %X/%X\n",
+ LSN_FORMAT_ARGS(state->istartpoint));
+ appendStringInfo(result, "INCREMENTAL FROM TLI: %u\n",
+ state->istarttli);
+ }
+
data = result->data;
pfree(result);
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 315e4b27cb..6cde31ee23 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -1284,6 +1284,12 @@ read_backup_label(XLogRecPtr *checkPointLoc, TimeLineID *backupLabelTLI,
tli_from_file, BACKUP_LABEL_FILE)));
}
+ if (fscanf(lfp, "INCREMENTAL FROM LSN: %X/%X\n", &hi, &lo) > 0)
+ ereport(FATAL,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("this is an incremental backup, not a data directory"),
+ errhint("Use pg_combinebackup to reconstruct a valid data directory.")));
+
if (ferror(lfp) || FreeFile(lfp))
ereport(FATAL,
(errcode_for_file_access(),
diff --git a/src/backend/backup/Makefile b/src/backend/backup/Makefile
index b21bd8ff43..751e6d3d5e 100644
--- a/src/backend/backup/Makefile
+++ b/src/backend/backup/Makefile
@@ -19,12 +19,15 @@ OBJS = \
basebackup.o \
basebackup_copy.o \
basebackup_gzip.o \
+ basebackup_incremental.o \
basebackup_lz4.o \
basebackup_zstd.o \
basebackup_progress.o \
basebackup_server.o \
basebackup_sink.o \
basebackup_target.o \
- basebackup_throttle.o
+ basebackup_throttle.o \
+ walsummary.o \
+ walsummaryfuncs.o
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/backup/basebackup.c b/src/backend/backup/basebackup.c
index 4ba63ad8a6..8a70a9ae41 100644
--- a/src/backend/backup/basebackup.c
+++ b/src/backend/backup/basebackup.c
@@ -20,8 +20,10 @@
#include "access/xlogbackup.h"
#include "backup/backup_manifest.h"
#include "backup/basebackup.h"
+#include "backup/basebackup_incremental.h"
#include "backup/basebackup_sink.h"
#include "backup/basebackup_target.h"
+#include "catalog/pg_tablespace_d.h"
#include "commands/defrem.h"
#include "common/compression.h"
#include "common/file_perm.h"
@@ -64,6 +66,7 @@ typedef struct
bool fastcheckpoint;
bool nowait;
bool includewal;
+ bool incremental;
uint32 maxrate;
bool sendtblspcmapfile;
bool send_to_client;
@@ -76,21 +79,28 @@ typedef struct
} basebackup_options;
static int64 sendTablespace(bbsink *sink, char *path, Oid spcoid, bool sizeonly,
- struct backup_manifest_info *manifest);
+ struct backup_manifest_info *manifest,
+ IncrementalBackupInfo *ib);
static int64 sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
List *tablespaces, bool sendtblspclinks,
- backup_manifest_info *manifest, Oid spcoid);
+ backup_manifest_info *manifest, Oid spcoid,
+ IncrementalBackupInfo *ib);
static bool sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
struct stat *statbuf, bool missing_ok,
Oid dboid, Oid spcoid, RelFileNumber relfilenumber,
unsigned segno,
- backup_manifest_info *manifest);
+ backup_manifest_info *manifest,
+ unsigned num_incremental_blocks,
+ BlockNumber *incremental_blocks,
+ unsigned truncation_block_length);
static off_t read_file_data_into_buffer(bbsink *sink,
const char *readfilename, int fd,
off_t offset, size_t length,
BlockNumber blkno,
bool verify_checksum,
int *checksum_failures);
+static void push_to_sink(bbsink *sink, pg_checksum_context *checksum_ctx,
+ size_t *bytes_done, void *data, size_t length);
static bool verify_page_checksum(Page page, XLogRecPtr start_lsn,
BlockNumber blkno,
uint16 *expected_checksum);
@@ -102,7 +112,8 @@ static int64 _tarWriteHeader(bbsink *sink, const char *filename,
bool sizeonly);
static void _tarWritePadding(bbsink *sink, int len);
static void convert_link_to_directory(const char *pathbuf, struct stat *statbuf);
-static void perform_base_backup(basebackup_options *opt, bbsink *sink);
+static void perform_base_backup(basebackup_options *opt, bbsink *sink,
+ IncrementalBackupInfo *ib);
static void parse_basebackup_options(List *options, basebackup_options *opt);
static int compareWalFileNames(const ListCell *a, const ListCell *b);
static int basebackup_read_file(int fd, char *buf, size_t nbytes, off_t offset,
@@ -220,7 +231,8 @@ static const struct exclude_list_item excludeFiles[] =
* clobbered by longjmp" from stupider versions of gcc.
*/
static void
-perform_base_backup(basebackup_options *opt, bbsink *sink)
+perform_base_backup(basebackup_options *opt, bbsink *sink,
+ IncrementalBackupInfo *ib)
{
bbsink_state state;
XLogRecPtr endptr;
@@ -270,6 +282,10 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
ListCell *lc;
tablespaceinfo *newti;
+ /* If this is an incremental backup, execute preparatory steps. */
+ if (ib != NULL)
+ PrepareForIncrementalBackup(ib, backup_state);
+
/* Add a node for the base directory at the end */
newti = palloc0(sizeof(tablespaceinfo));
newti->size = -1;
@@ -289,10 +305,10 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
if (tmp->path == NULL)
tmp->size = sendDir(sink, ".", 1, true, state.tablespaces,
- true, NULL, InvalidOid);
+ true, NULL, InvalidOid, NULL);
else
tmp->size = sendTablespace(sink, tmp->path, tmp->oid, true,
- NULL);
+ NULL, NULL);
state.bytes_total += tmp->size;
}
state.bytes_total_is_valid = true;
@@ -330,7 +346,7 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
/* Then the bulk of the files... */
sendDir(sink, ".", 1, false, state.tablespaces,
- sendtblspclinks, &manifest, InvalidOid);
+ sendtblspclinks, &manifest, InvalidOid, ib);
/* ... and pg_control after everything else. */
if (lstat(XLOG_CONTROL_FILE, &statbuf) != 0)
@@ -340,7 +356,7 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
XLOG_CONTROL_FILE)));
sendFile(sink, XLOG_CONTROL_FILE, XLOG_CONTROL_FILE, &statbuf,
false, InvalidOid, InvalidOid,
- InvalidRelFileNumber, 0, &manifest);
+ InvalidRelFileNumber, 0, &manifest, 0, NULL, 0);
}
else
{
@@ -348,7 +364,7 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
bbsink_begin_archive(sink, archive_name);
- sendTablespace(sink, ti->path, ti->oid, false, &manifest);
+ sendTablespace(sink, ti->path, ti->oid, false, &manifest, ib);
}
/*
@@ -610,7 +626,7 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
sendFile(sink, pathbuf, pathbuf, &statbuf, false,
InvalidOid, InvalidOid, InvalidRelFileNumber, 0,
- &manifest);
+ &manifest, 0, NULL, 0);
/* unconditionally mark file as archived */
StatusFilePath(pathbuf, fname, ".done");
@@ -686,6 +702,7 @@ parse_basebackup_options(List *options, basebackup_options *opt)
bool o_checkpoint = false;
bool o_nowait = false;
bool o_wal = false;
+ bool o_incremental = false;
bool o_maxrate = false;
bool o_tablespace_map = false;
bool o_noverify_checksums = false;
@@ -764,6 +781,15 @@ parse_basebackup_options(List *options, basebackup_options *opt)
opt->includewal = defGetBoolean(defel);
o_wal = true;
}
+ else if (strcmp(defel->defname, "incremental") == 0)
+ {
+ if (o_incremental)
+ ereport(ERROR,
+ (errcode(ERRCODE_SYNTAX_ERROR),
+ errmsg("duplicate option \"%s\"", defel->defname)));
+ opt->incremental = defGetBoolean(defel);
+ o_incremental = true;
+ }
else if (strcmp(defel->defname, "max_rate") == 0)
{
int64 maxrate;
@@ -956,7 +982,7 @@ parse_basebackup_options(List *options, basebackup_options *opt)
* the filesystem, bypassing the buffer cache.
*/
void
-SendBaseBackup(BaseBackupCmd *cmd)
+SendBaseBackup(BaseBackupCmd *cmd, IncrementalBackupInfo *ib)
{
basebackup_options opt;
bbsink *sink;
@@ -980,6 +1006,20 @@ SendBaseBackup(BaseBackupCmd *cmd)
set_ps_display(activitymsg);
}
+ /*
+ * If we're asked to perform an incremental backup and the user has not
+ * supplied a manifest, that's an ERROR.
+ *
+ * If we're asked to perform a full backup and the user did supply a
+ * manifest, just ignore it.
+ */
+ if (!opt.incremental)
+ ib = NULL;
+ else if (ib == NULL)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("must UPLOAD_MANIFEST before performing an incremental BASE_BACKUP")));
+
/*
* If the target is specifically 'client' then set up to stream the backup
* to the client; otherwise, it's being sent someplace else and should not
@@ -1011,7 +1051,7 @@ SendBaseBackup(BaseBackupCmd *cmd)
*/
PG_TRY();
{
- perform_base_backup(&opt, sink);
+ perform_base_backup(&opt, sink, ib);
}
PG_FINALLY();
{
@@ -1086,7 +1126,7 @@ sendFileWithContent(bbsink *sink, const char *filename, const char *content,
*/
static int64
sendTablespace(bbsink *sink, char *path, Oid spcoid, bool sizeonly,
- backup_manifest_info *manifest)
+ backup_manifest_info *manifest, IncrementalBackupInfo *ib)
{
int64 size;
char pathbuf[MAXPGPATH];
@@ -1120,7 +1160,7 @@ sendTablespace(bbsink *sink, char *path, Oid spcoid, bool sizeonly,
/* Send all the files in the tablespace version directory */
size += sendDir(sink, pathbuf, strlen(path), sizeonly, NIL, true, manifest,
- spcoid);
+ spcoid, ib);
return size;
}
@@ -1140,7 +1180,7 @@ sendTablespace(bbsink *sink, char *path, Oid spcoid, bool sizeonly,
static int64
sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
List *tablespaces, bool sendtblspclinks, backup_manifest_info *manifest,
- Oid spcoid)
+ Oid spcoid, IncrementalBackupInfo *ib)
{
DIR *dir;
struct dirent *de;
@@ -1149,7 +1189,16 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
int64 size = 0;
const char *lastDir; /* Split last dir from parent path. */
bool isRelationDir = false; /* Does directory contain relations? */
+ bool isGlobalDir = false;
Oid dboid = InvalidOid;
+ BlockNumber *relative_block_numbers = NULL;
+
+ /*
+ * Since this array is relatively large, avoid putting it on the stack.
+ * But we don't need it at all if this is not an incremental backup.
+ */
+ if (ib != NULL)
+ relative_block_numbers = palloc(sizeof(BlockNumber) * RELSEG_SIZE);
/*
* Determine if the current path is a database directory that can contain
@@ -1182,7 +1231,10 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
}
}
else if (strcmp(path, "./global") == 0)
+ {
isRelationDir = true;
+ isGlobalDir = true;
+ }
dir = AllocateDir(path);
while ((de = ReadDir(dir, path)) != NULL)
@@ -1331,11 +1383,13 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
&statbuf, sizeonly);
/*
- * Also send archive_status directory (by hackishly reusing
- * statbuf from above ...).
+ * Also send archive_status and summaries directories (by
+ * hackishly reusing statbuf from above ...).
*/
size += _tarWriteHeader(sink, "./pg_wal/archive_status", NULL,
&statbuf, sizeonly);
+ size += _tarWriteHeader(sink, "./pg_wal/summaries", NULL,
+ &statbuf, sizeonly);
continue; /* don't recurse into pg_wal */
}
@@ -1404,33 +1458,88 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
if (!skip_this_dir)
size += sendDir(sink, pathbuf, basepathlen, sizeonly, tablespaces,
- sendtblspclinks, manifest, spcoid);
+ sendtblspclinks, manifest, spcoid, ib);
}
else if (S_ISREG(statbuf.st_mode))
{
bool sent = false;
+ unsigned num_blocks_required = 0;
+ unsigned truncation_block_length = 0;
+ char tarfilenamebuf[MAXPGPATH * 2];
+ char *tarfilename = pathbuf + basepathlen + 1;
+ FileBackupMethod method = BACK_UP_FILE_FULLY;
- if (!sizeonly)
- sent = sendFile(sink, pathbuf, pathbuf + basepathlen + 1, &statbuf,
- true, dboid, spcoid,
- relfilenumber, segno, manifest);
+ if (ib != NULL && isRelationFile)
+ {
+ Oid relspcoid;
+ char *lookup_path;
- if (sent || sizeonly)
+ if (OidIsValid(spcoid))
+ {
+ relspcoid = spcoid;
+ lookup_path = psprintf("pg_tblspc/%u/%s", spcoid,
+ pathbuf + basepathlen + 1);
+ }
+ else
+ {
+ if (isGlobalDir)
+ relspcoid = GLOBALTABLESPACE_OID;
+ else
+ relspcoid = DEFAULTTABLESPACE_OID;
+ lookup_path = pstrdup(pathbuf + basepathlen + 1);
+ }
+
+ method = GetFileBackupMethod(ib, lookup_path, dboid, relspcoid,
+ relfilenumber, relForkNum,
+ segno, statbuf.st_size,
+ &num_blocks_required,
+ relative_block_numbers,
+ &truncation_block_length);
+ if (method == BACK_UP_FILE_INCREMENTALLY)
+ {
+ statbuf.st_size =
+ GetIncrementalFileSize(num_blocks_required);
+ snprintf(tarfilenamebuf, sizeof(tarfilenamebuf),
+ "%s/INCREMENTAL.%s",
+ path + basepathlen + 1,
+ de->d_name);
+ tarfilename = tarfilenamebuf;
+ }
+
+ pfree(lookup_path);
+ }
+
+ if (method != DO_NOT_BACK_UP_FILE)
{
- /* Add size. */
- size += statbuf.st_size;
+ if (!sizeonly)
+ sent = sendFile(sink, pathbuf, tarfilename, &statbuf,
+ true, dboid, spcoid,
+ relfilenumber, segno, manifest,
+ num_blocks_required,
+ method == BACK_UP_FILE_INCREMENTALLY ? relative_block_numbers : NULL,
+ truncation_block_length);
+
+ if (sent || sizeonly)
+ {
+ /* Add size. */
+ size += statbuf.st_size;
- /* Pad to a multiple of the tar block size. */
- size += tarPaddingBytesRequired(statbuf.st_size);
+ /* Pad to a multiple of the tar block size. */
+ size += tarPaddingBytesRequired(statbuf.st_size);
- /* Size of the header for the file. */
- size += TAR_BLOCK_SIZE;
+ /* Size of the header for the file. */
+ size += TAR_BLOCK_SIZE;
+ }
}
}
else
ereport(WARNING,
(errmsg("skipping special file \"%s\"", pathbuf)));
}
+
+ if (relative_block_numbers != NULL)
+ pfree(relative_block_numbers);
+
FreeDir(dir);
return size;
}
@@ -1443,6 +1552,12 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
* If dboid is anything other than InvalidOid then any checksum failures
* detected will get reported to the cumulative stats system.
*
+ * If the file is to be sent incrementally, then num_incremental_blocks
+ * should be the number of blocks to be sent, and incremental_blocks
+ * an array of block numbers relative to the start of the current segment.
+ * If the whole file is to be sent, then incremental_blocks should be NULL,
+ * and num_incremental_blocks can have any value, as it will be ignored.
+ *
* Returns true if the file was successfully sent, false if 'missing_ok',
* and the file did not exist.
*/
@@ -1450,7 +1565,8 @@ static bool
sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
struct stat *statbuf, bool missing_ok, Oid dboid, Oid spcoid,
RelFileNumber relfilenumber, unsigned segno,
- backup_manifest_info *manifest)
+ backup_manifest_info *manifest, unsigned num_incremental_blocks,
+ BlockNumber *incremental_blocks, unsigned truncation_block_length)
{
int fd;
BlockNumber blkno = 0;
@@ -1459,6 +1575,7 @@ sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
pgoff_t bytes_done = 0;
bool verify_checksum = false;
pg_checksum_context checksum_ctx;
+ int ibindex = 0;
if (pg_checksum_init(&checksum_ctx, manifest->checksum_type) < 0)
elog(ERROR, "could not initialize checksum of file \"%s\"",
@@ -1491,22 +1608,111 @@ sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
RelFileNumberIsValid(relfilenumber))
verify_checksum = true;
+ /*
+ * If we're sending an incremental file, write the file header.
+ */
+ if (incremental_blocks != NULL)
+ {
+ unsigned magic = INCREMENTAL_MAGIC;
+ size_t header_bytes_done = 0;
+
+ /* Emit header data. */
+ push_to_sink(sink, &checksum_ctx, &header_bytes_done,
+ &magic, sizeof(magic));
+ push_to_sink(sink, &checksum_ctx, &header_bytes_done,
+ &num_incremental_blocks, sizeof(num_incremental_blocks));
+ push_to_sink(sink, &checksum_ctx, &header_bytes_done,
+ &truncation_block_length, sizeof(truncation_block_length));
+ push_to_sink(sink, &checksum_ctx, &header_bytes_done,
+ incremental_blocks,
+ sizeof(BlockNumber) * num_incremental_blocks);
+
+ /* Flush out any data still in the buffer so it's again empty. */
+ if (header_bytes_done > 0)
+ {
+ bbsink_archive_contents(sink, header_bytes_done);
+ if (pg_checksum_update(&checksum_ctx,
+ (uint8 *) sink->bbs_buffer,
+ header_bytes_done) < 0)
+ elog(ERROR, "could not update checksum of base backup");
+ }
+
+ /* Update our notion of file position. */
+ bytes_done += sizeof(magic);
+ bytes_done += sizeof(num_incremental_blocks);
+ bytes_done += sizeof(truncation_block_length);
+ bytes_done += sizeof(BlockNumber) * num_incremental_blocks;
+ }
+
/*
* Loop until we read the amount of data the caller told us to expect. The
* file could be longer, if it was extended while we were sending it, but
* for a base backup we can ignore such extended data. It will be restored
* from WAL.
*/
- while (bytes_done < statbuf->st_size)
+ while (1)
{
- size_t remaining = statbuf->st_size - bytes_done;
+ /*
+ * Determine whether we've read all the data that we need, and if not,
+ * read some more.
+ */
+ if (incremental_blocks == NULL)
+ {
+ size_t remaining = statbuf->st_size - bytes_done;
+
+ /*
+ * If we've read the required number of bytes, then it's time to
+ * stop.
+ */
+ if (bytes_done >= statbuf->st_size)
+ break;
+
+ /*
+ * Read as many bytes as will fit in the buffer, or however many
+ * are left to read, whichever is less.
+ */
+ cnt = read_file_data_into_buffer(sink, readfilename, fd,
+ bytes_done, remaining,
+ blkno + segno * RELSEG_SIZE,
+ verify_checksum,
+ &checksum_failures);
+ }
+ else
+ {
+ BlockNumber relative_blkno;
- /* Try to read some more data. */
- cnt = read_file_data_into_buffer(sink, readfilename, fd, bytes_done,
- remaining,
- blkno + segno * RELSEG_SIZE,
- verify_checksum,
- &checksum_failures);
+ /*
+ * If we've read all the blocks, then it's time to stop.
+ */
+ if (ibindex >= num_incremental_blocks)
+ break;
+
+ /*
+ * Read just one block, whichever one is the next that we're
+ * supposed to include.
+ */
+ relative_blkno = incremental_blocks[ibindex++];
+ cnt = read_file_data_into_buffer(sink, readfilename, fd,
+ relative_blkno * BLCKSZ,
+ BLCKSZ,
+ relative_blkno + segno * RELSEG_SIZE,
+ verify_checksum,
+ &checksum_failures);
+
+ /*
+ * If we get a partial read, that must mean that the relation is
+ * being truncated. Ultimately, it should be truncated to a
+ * multiple of BLCKSZ, since this path should only be reached for
+ * relation files, but we might transiently observe an
+ * intermediate value.
+ *
+ * It should be fine to treat this just as if the entire block had
+ * been truncated away - i.e. fill this and all later blocks with
+ * zeroes. WAL replay will fix things up.
+ */
+ if (cnt < BLCKSZ)
+ break;
+ }
/*
* If the amount of data we were able to read was not a multiple of
@@ -1689,6 +1895,56 @@ read_file_data_into_buffer(bbsink *sink, const char *readfilename, int fd,
return cnt;
}
+/*
+ * Push data into a bbsink.
+ *
+ * It's better, when possible, to read data directly into the bbsink's buffer,
+ * rather than using this function to copy it into the buffer; this function is
+ * for cases where that approach is not practical.
+ *
+ * bytes_done should point to a count of the number of bytes that are
+ * currently used in the bbsink's buffer. Upon return, the bytes identified by
+ * data and length will have been copied into the bbsink's buffer, flushing
+ * as required, and *bytes_done will have been updated accordingly. If the
+ * buffer was flushed, the previous contents will also have been fed to
+ * checksum_ctx.
+ *
+ * Note that after one or more calls to this function it is the caller's
+ * responsibility to perform any required final flush.
+ */
+static void
+push_to_sink(bbsink *sink, pg_checksum_context *checksum_ctx,
+ size_t *bytes_done, void *data, size_t length)
+{
+ while (length > 0)
+ {
+ size_t bytes_to_copy;
+
+ /*
+ * We use < here rather than <= so that if the data exactly fills the
+ * remaining buffer space, we trigger a flush now.
+ */
+ if (length < sink->bbs_buffer_length - *bytes_done)
+ {
+ /* Append remaining data to buffer. */
+ memcpy(sink->bbs_buffer + *bytes_done, data, length);
+ *bytes_done += length;
+ return;
+ }
+
+ /* Copy until buffer is full and flush it. */
+ bytes_to_copy = sink->bbs_buffer_length - *bytes_done;
+ memcpy(sink->bbs_buffer + *bytes_done, data, bytes_to_copy);
+ data = ((char *) data) + bytes_to_copy;
+ length -= bytes_to_copy;
+ bbsink_archive_contents(sink, sink->bbs_buffer_length);
+ if (pg_checksum_update(checksum_ctx, (uint8 *) sink->bbs_buffer,
+ sink->bbs_buffer_length) < 0)
+ elog(ERROR, "could not update checksum");
+ *bytes_done = 0;
+ }
+}
+
/*
* Try to verify the checksum for the provided page, if it seems appropriate
* to do so.
diff --git a/src/backend/backup/basebackup_incremental.c b/src/backend/backup/basebackup_incremental.c
new file mode 100644
index 0000000000..20cc00bded
--- /dev/null
+++ b/src/backend/backup/basebackup_incremental.c
@@ -0,0 +1,873 @@
+/*-------------------------------------------------------------------------
+ *
+ * basebackup_incremental.c
+ * code for incremental backup support
+ *
+ * This code isn't actually in charge of taking an incremental backup;
+ * the actual construction of the incremental backup happens in
+ * basebackup.c. Here, we're concerned with providing the necessary
+ * supports for that operation. In particular, we need to parse the
+ * backup manifest supplied by the user taking the incremental backup
+ * and extract the required information from it.
+ *
+ * Portions Copyright (c) 2010-2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/backend/backup/basebackup_incremental.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/timeline.h"
+#include "access/xlog.h"
+#include "access/xlogrecovery.h"
+#include "backup/basebackup_incremental.h"
+#include "backup/walsummary.h"
+#include "common/blkreftable.h"
+#include "common/parse_manifest.h"
+#include "common/hashfn.h"
+#include "postmaster/walsummarizer.h"
+
+#define BLOCKS_PER_READ 512
+
+/*
+ * Details extracted from the WAL ranges present in the supplied backup manifest.
+ */
+typedef struct
+{
+ TimeLineID tli;
+ XLogRecPtr start_lsn;
+ XLogRecPtr end_lsn;
+} backup_wal_range;
+
+/*
+ * Details extracted from the file list present in the supplied backup manifest.
+ */
+typedef struct
+{
+ uint32 status;
+ const char *path;
+ size_t size;
+} backup_file_entry;
+
+static uint32 hash_string_pointer(const char *s);
+#define SH_PREFIX backup_file
+#define SH_ELEMENT_TYPE backup_file_entry
+#define SH_KEY_TYPE const char *
+#define SH_KEY path
+#define SH_HASH_KEY(tb, key) hash_string_pointer(key)
+#define SH_EQUAL(tb, a, b) (strcmp(a, b) == 0)
+#define SH_SCOPE static inline
+#define SH_DECLARE
+#define SH_DEFINE
+#include "lib/simplehash.h"
+
+struct IncrementalBackupInfo
+{
+ /* Memory context for this object and its subsidiary objects. */
+ MemoryContext mcxt;
+
+ /* Temporary buffer for storing the manifest while parsing it. */
+ StringInfoData buf;
+
+ /* WAL ranges extracted from the backup manifest. */
+ List *manifest_wal_ranges;
+
+ /*
+ * Files extracted from the backup manifest.
+ *
+ * We don't really need this information, because we use WAL summaries to
+ * figure what's changed. It would be unsafe to just rely on the list of
+ * files that existed before, because it's possible for a file to be
+ * removed and a new one created with the same name and different
+ * contents. In such cases, the whole file must still be sent. We can tell
+ * from the WAL summaries whether that happened, but not from the file
+ * list.
+ *
+ * Nonetheless, this data is useful for sanity checking. If a file that we
+ * think we shouldn't need to send is not present in the manifest for the
+ * prior backup, something has gone terribly wrong. We retain the file
+ * names and sizes, but not the checksums or last modified times, for
+ * which we have no use.
+ *
+ * One significant downside of storing this data is that it consumes
+ * memory. If that turns out to be a problem, we might have to decide not
+ * to retain this information, or to make it optional.
+ */
+ backup_file_hash *manifest_files;
+
+ /*
+ * Block-reference table for the incremental backup.
+ *
+ * It's possible that storing the entire block-reference table in memory
+ * will be a problem for some users. The in-memory format that we're using
+ * here is pretty efficient, converging to little more than 1 bit per
+ * block for relation forks with large numbers of modified blocks. It's
+ * possible, however, that if you try to perform an incremental backup of
+ * a database with a sufficiently large number of relations on a
+ * sufficiently small machine, you could run out of memory here. If that
+ * turns out to be a problem in practice, we'll need to be more clever.
+ */
+ BlockRefTable *brtab;
+};
+
+static void manifest_process_file(JsonManifestParseContext *,
+ char *pathname,
+ size_t size,
+ pg_checksum_type checksum_type,
+ int checksum_length,
+ uint8 *checksum_payload);
+static void manifest_process_wal_range(JsonManifestParseContext *,
+ TimeLineID tli,
+ XLogRecPtr start_lsn,
+ XLogRecPtr end_lsn);
+static void manifest_report_error(JsonManifestParseContext *ib,
+ const char *fmt,...)
+ pg_attribute_printf(2, 3) pg_attribute_noreturn();
+
+/*
+ * Create a new object for storing information extracted from the manifest
+ * supplied when creating an incremental backup.
+ */
+IncrementalBackupInfo *
+CreateIncrementalBackupInfo(MemoryContext mcxt)
+{
+ IncrementalBackupInfo *ib;
+ MemoryContext oldcontext;
+
+ oldcontext = MemoryContextSwitchTo(mcxt);
+
+ ib = palloc0(sizeof(IncrementalBackupInfo));
+ ib->mcxt = mcxt;
+ initStringInfo(&ib->buf);
+
+ /*
+ * It's hard to guess how many files a "typical" installation will have in
+ * the data directory, but a fresh initdb creates almost 1000 files as of
+ * this writing, so it seems to make sense for our estimate to
+ * substantially higher.
+ */
+ ib->manifest_files = backup_file_create(mcxt, 10000, NULL);
+
+ MemoryContextSwitchTo(oldcontext);
+
+ return ib;
+}
+
+/*
+ * Before taking an incremental backup, the caller must supply the backup
+ * manifest from a prior backup. Each chunk of manifest data recieved
+ * from the client should be passed to this function.
+ */
+void
+AppendIncrementalManifestData(IncrementalBackupInfo *ib, const char *data,
+ int len)
+{
+ MemoryContext oldcontext;
+
+ /* Switch to our memory context. */
+ oldcontext = MemoryContextSwitchTo(ib->mcxt);
+
+ /*
+ * XXX. Our json parser is at present incapable of parsing json blobs
+ * incrementally, so we have to accumulate the entire backup manifest
+ * before we can do anything with it. This should really be fixed, since
+ * some users might have very large numbers of files in the data
+ * directory.
+ */
+ appendBinaryStringInfo(&ib->buf, data, len);
+
+ /* Switch back to previous memory context. */
+ MemoryContextSwitchTo(oldcontext);
+}
+
+/*
+ * Finalize an IncrementalBackupInfo object after all manifest data has
+ * been supplied via calls to AppendIncrementalManifestData.
+ */
+void
+FinalizeIncrementalManifest(IncrementalBackupInfo *ib)
+{
+ JsonManifestParseContext context;
+ MemoryContext oldcontext;
+
+ /* Switch to our memory context. */
+ oldcontext = MemoryContextSwitchTo(ib->mcxt);
+
+ /* Parse the manifest. */
+ context.private_data = ib;
+ context.perfile_cb = manifest_process_file;
+ context.perwalrange_cb = manifest_process_wal_range;
+ context.error_cb = manifest_report_error;
+ json_parse_manifest(&context, ib->buf.data, ib->buf.len);
+
+ /* Done with the buffer, so release memory. */
+ pfree(ib->buf.data);
+ ib->buf.data = NULL;
+
+ /* Switch back to previous memory context. */
+ MemoryContextSwitchTo(oldcontext);
+}
+
+/*
+ * Prepare to take an incremental backup.
+ *
+ * Before this function is called, AppendIncrementalManifestData and
+ * FinalizeIncrementalManifest should have already been called to pass all
+ * the manifest data to this object.
+ *
+ * This function performs sanity checks on the data extracted from the
+ * manifest and figures out for which WAL ranges we need summaries, and
+ * whether those summaries are available. Then, it reads and combines the
+ * data from those summary files. It also updates the backup_state with the
+ * reference TLI and LSN for the prior backup.
+ */
+void
+PrepareForIncrementalBackup(IncrementalBackupInfo *ib,
+ BackupState *backup_state)
+{
+ MemoryContext oldcontext;
+ List *expectedTLEs;
+ List *all_wslist,
+ *required_wslist = NIL;
+ ListCell *lc;
+ TimeLineHistoryEntry **tlep;
+ int num_wal_ranges;
+ int i;
+ bool found_backup_start_tli = false;
+ TimeLineID earliest_wal_range_tli = 0;
+ XLogRecPtr earliest_wal_range_start_lsn;
+ TimeLineID latest_wal_range_tli = 0;
+ XLogRecPtr summarized_lsn;
+
+ Assert(ib->buf.data == NULL);
+
+ /* Switch to our memory context. */
+ oldcontext = MemoryContextSwitchTo(ib->mcxt);
+
+ /*
+ * Match up the TLIs that appear in the WAL ranges of the backup manifest
+ * with those that appear in this server's timeline history. We expect
+ * every backup_wal_range to match to a TimeLineHistoryEntry; if it does
+ * not, that's an error.
+ *
+ * This loop also decides which of the WAL ranges is the manifest is most
+ * ancient and which one is the newest, according to the timeline history
+ * of this server, and stores TLIs of those WAL ranges into
+ * earliest_wal_range_tli and latest_wal_range_tli. It also updates
+ * earliest_wal_range_start_lsn to the start LSN of the WAL range for
+ * earliest_wal_range_tli.
+ *
+ * Note that the return value of readTimeLineHistory puts the latest
+ * timeline at the beginning of the list, not the end. Hence, the earliest
+ * TLI is the one that occurs nearest the end of the list returned by
+ * readTimeLineHistory, and the latest TLI is the one that occurs closest
+ * to the beginning.
+ */
+ expectedTLEs = readTimeLineHistory(backup_state->starttli);
+ num_wal_ranges = list_length(ib->manifest_wal_ranges);
+ tlep = palloc0(num_wal_ranges * sizeof(TimeLineHistoryEntry *));
+ for (i = 0; i < num_wal_ranges; ++i)
+ {
+ backup_wal_range *range = list_nth(ib->manifest_wal_ranges, i);
+ bool saw_earliest_wal_range_tli = false;
+ bool saw_latest_wal_range_tli = false;
+
+ /* Search this server's history for this WAL range's TLI. */
+ foreach(lc, expectedTLEs)
+ {
+ TimeLineHistoryEntry *tle = lfirst(lc);
+
+ if (tle->tli == range->tli)
+ {
+ tlep[i] = tle;
+ break;
+ }
+
+ if (tle->tli == earliest_wal_range_tli)
+ saw_earliest_wal_range_tli = true;
+ if (tle->tli == latest_wal_range_tli)
+ saw_latest_wal_range_tli = true;
+ }
+
+ /*
+ * An incremental backup can only be taken relative to a backup that
+ * represents a previous state of this server. If the backup requires
+ * WAL from a timeline that's not in our history, that definitely
+ * isn't the case.
+ */
+ if (tlep[i] == NULL)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("timeline %u found in manifest, but not in this server's history",
+ range->tli)));
+
+ /*
+ * If we found this TLI in the server's history before encountering
+ * the latest TLI seen so far in the server's history, then this TLI
+ * is the latest one seen so far.
+ *
+ * If on the other hand we saw the earliest TLI seen so far before
+ * finding this TLI, this TLI is earlier than the earliest one seen so
+ * far. And if this is the first TLI for which we've searched, it's
+ * also the earliest one seen so far.
+ *
+ * On the first loop iteration, both things should necessarily be
+ * true.
+ */
+ if (!saw_latest_wal_range_tli)
+ latest_wal_range_tli = range->tli;
+ if (earliest_wal_range_tli == 0 || saw_earliest_wal_range_tli)
+ {
+ earliest_wal_range_tli = range->tli;
+ earliest_wal_range_start_lsn = range->start_lsn;
+ }
+ }
+
+ /*
+ * Propagate information about the prior backup into the backup_label that
+ * will be generated for this backup.
+ */
+ backup_state->istartpoint = earliest_wal_range_start_lsn;
+ backup_state->istarttli = earliest_wal_range_tli;
+
+ /*
+ * Sanity check start and end LSNs for the WAL ranges in the manifest.
+ *
+ * Commonly, there won't be any timeline switches during the prior backup
+ * at all, but if there are, they should happen at the same LSNs that this
+ * server switched timelines.
+ *
+ * Whether there are any timeline switches during the prior backup or not,
+ * the prior backup shouldn't require any WAL from a timeline prior to the
+ * start of that timeline. It also shouldn't require any WAL from later
+ * than the start of this backup.
+ *
+ * If any of these sanity checks fail, one possible explanation is that
+ * the user has generated WAL on the same timeline with the same LSNs more
+ * than once. For instance, if two standbys running on timeline 1 were
+ * both promoted and (due to a broken archiving setup) both selected new
+ * timeline ID 2, then it's possible that one of these checks might trip.
+ *
+ * Note that there are lots of ways for the user to do something very bad
+ * without tripping any of these checks, and they are not intended to be
+ * comprehensive. It's pretty hard to see how we could be certain of
+ * anything here. However, if there's a problem staring us right in the
+ * face, it's best to report it, so we do.
+ */
+ for (i = 0; i < num_wal_ranges; ++i)
+ {
+ backup_wal_range *range = list_nth(ib->manifest_wal_ranges, i);
+
+ if (range->tli == earliest_wal_range_tli)
+ {
+ if (range->start_lsn < tlep[i]->begin)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("manifest requires WAL from initial timeline %u starting at %X/%X, but that timeline begins at %X/%X",
+ range->tli,
+ LSN_FORMAT_ARGS(range->start_lsn),
+ LSN_FORMAT_ARGS(tlep[i]->begin))));
+ }
+ else
+ {
+ if (range->start_lsn != tlep[i]->begin)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("manifest requires WAL from continuation timeline %u starting at %X/%X, but that timeline begins at %X/%X",
+ range->tli,
+ LSN_FORMAT_ARGS(range->start_lsn),
+ LSN_FORMAT_ARGS(tlep[i]->begin))));
+ }
+
+ if (range->tli == latest_wal_range_tli)
+ {
+ if (range->end_lsn > backup_state->startpoint)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("manifest requires WAL from final timeline %u ending at %X/%X, but this backup starts at %X/%X",
+ range->tli,
+ LSN_FORMAT_ARGS(range->end_lsn),
+ LSN_FORMAT_ARGS(backup_state->startpoint))));
+ }
+ else
+ {
+ if (range->end_lsn != tlep[i]->end)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("manifest requires WAL from non-final timeline %u ending at %X/%X, but this server switched timelines at %X/%X",
+ range->tli,
+ LSN_FORMAT_ARGS(range->end_lsn),
+ LSN_FORMAT_ARGS(tlep[i]->end))));
+ }
+
+ }
+
+ /*
+ * Wait for WAL summarization to catch up to the backup start LSN (but
+ * time out if it doesn't do so quickly enough).
+ */
+ /* XXX make timeout configurable */
+ summarized_lsn = WaitForWalSummarization(backup_state->startpoint, 60000);
+ if (summarized_lsn < backup_state->startpoint)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("timeout waiting for WAL summarization"),
+ errdetail("This backup requires WAL to be summarized up to %X/%X, but summarizer has only reached %X/%X.",
+ LSN_FORMAT_ARGS(backup_state->startpoint),
+ LSN_FORMAT_ARGS(summarized_lsn))));
+
+ /*
+ * Retrieve a list of all WAL summaries on any timeline that overlap with
+ * the LSN range of interest. We could instead call GetWalSummaries() once
+ * per timeline in the loop that follows, but that would involve reading
+ * the directory multiple times. It should be mildly faster - and perhaps
+ * a bit safer - to do it just once.
+ */
+ all_wslist = GetWalSummaries(0, earliest_wal_range_start_lsn,
+ backup_state->startpoint);
+
+ /*
+ * We need WAL summaries for everything that happened during the prior
+ * backup and everything that happened afterward up until the point where
+ * the current backup started.
+ */
+ foreach(lc, expectedTLEs)
+ {
+ TimeLineHistoryEntry *tle = lfirst(lc);
+ XLogRecPtr tli_start_lsn = tle->begin;
+ XLogRecPtr tli_end_lsn = tle->end;
+ XLogRecPtr tli_missing_lsn = InvalidXLogRecPtr;
+ List *tli_wslist;
+
+ /*
+ * Working through the history of this server from the current
+ * timeline backwards, we skip everything until we find the timeline
+ * where this backup started. Most of the time, this means we won't
+ * skip anything at all, as it's unlikely that the timeline has
+ * changed since the beginning of the backup moments ago.
+ */
+ if (tle->tli == backup_state->starttli)
+ {
+ found_backup_start_tli = true;
+ tli_end_lsn = backup_state->startpoint;
+ }
+ else if (!found_backup_start_tli)
+ continue;
+
+ /*
+ * Find the summaries that overlap the LSN range of interest for this
+ * timeline. If this is the earliest timeline involved, the range of
+ * interest begins with the start LSN of the prior backup; otherwise,
+ * it begins at the LSN at which this timeline came into existence. If
+ * this is the latest TLI involved, the range of interest ends at the
+ * start LSN of the current backup; otherwise, it ends at the point
+ * where we switched from this timeline to the next one.
+ */
+ if (tle->tli == earliest_wal_range_tli)
+ tli_start_lsn = earliest_wal_range_start_lsn;
+ tli_wslist = FilterWalSummaries(all_wslist, tle->tli,
+ tli_start_lsn, tli_end_lsn);
+
+ /*
+ * There is no guarantee that the WAL summaries we found cover the
+ * entire range of LSNs for which summaries are required, or indeed
+ * that we found any WAL summaries at all. Check whether we have a
+ * problem of that sort.
+ */
+ if (!WalSummariesAreComplete(tli_wslist, tli_start_lsn, tli_end_lsn,
+ &tli_missing_lsn))
+ {
+ if (XLogRecPtrIsInvalid(tli_missing_lsn))
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("WAL summaries are required on timeline %u from %X/%X to %X/%X, but no summaries for that timeline and LSN range exist",
+ tle->tli,
+ LSN_FORMAT_ARGS(tli_start_lsn),
+ LSN_FORMAT_ARGS(tli_end_lsn))));
+ else
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("WAL summaries are required on timeline %u from %X/%X to %X/%X, but the summaries for that timeline and LSN range are incomplete",
+ tle->tli,
+ LSN_FORMAT_ARGS(tli_start_lsn),
+ LSN_FORMAT_ARGS(tli_end_lsn)),
+ errdetail("The first unsummarized LSN is this range is %X/%X.",
+ LSN_FORMAT_ARGS(tli_missing_lsn))));
+ }
+
+ /*
+ * Remember that we need to read these summaries.
+ *
+ * Technically, it's possible that this could read more files than
+ * required, since tli_wslist in theory could contain redundant
+ * summaries. For instance, if we have a summary from 0/10000000 to
+ * 0/20000000 and also one from 0/00000000 to 0/30000000, then the
+ * latter subsumes the former and the former could be ignored.
+ *
+ * We ignore this possibility because the WAL summarizer only tries to
+ * generate summaries that do not overlap. If somehow they exist,
+ * we'll do a bit of extra work but the results should still be
+ * correct.
+ */
+ required_wslist = list_concat(required_wslist, tli_wslist);
+
+ /*
+ * Timelines earlier than the one in which the prior backup began are
+ * not relevant.
+ */
+ if (tle->tli == earliest_wal_range_tli)
+ break;
+ }
+
+ /*
+ * Read all of the required block reference table files and merge all of
+ * the data into a single in-memory block reference table.
+ *
+ * See the comments for struct IncrementalBackupInfo for some thoughts on
+ * memory usage.
+ */
+ ib->brtab = CreateEmptyBlockRefTable();
+ foreach(lc, required_wslist)
+ {
+ WalSummaryFile *ws = lfirst(lc);
+ WalSummaryIO wsio;
+ BlockRefTableReader *reader;
+ RelFileLocator rlocator;
+ ForkNumber forknum;
+ BlockNumber limit_block;
+ BlockNumber blocks[BLOCKS_PER_READ];
+
+ wsio.file = OpenWalSummaryFile(ws, false);
+ wsio.filepos = 0;
+ ereport(DEBUG1,
+ (errmsg_internal("reading WAL summary file \"%s\"",
+ FilePathName(wsio.file))));
+ reader = CreateBlockRefTableReader(ReadWalSummary, &wsio,
+ FilePathName(wsio.file),
+ ReportWalSummaryError, NULL);
+ while (BlockRefTableReaderNextRelation(reader, &rlocator, &forknum,
+ &limit_block))
+ {
+ BlockRefTableSetLimitBlock(ib->brtab, &rlocator,
+ forknum, limit_block);
+
+ while (1)
+ {
+ unsigned nblocks;
+ unsigned i;
+
+ nblocks = BlockRefTableReaderGetBlocks(reader, blocks,
+ BLOCKS_PER_READ);
+ if (nblocks == 0)
+ break;
+
+ for (i = 0; i < nblocks; ++i)
+ BlockRefTableMarkBlockModified(ib->brtab, &rlocator,
+ forknum, blocks[i]);
+ }
+ }
+ DestroyBlockRefTableReader(reader);
+ FileClose(wsio.file);
+ }
+
+ /* Switch back to previous memory context. */
+ MemoryContextSwitchTo(oldcontext);
+}
+
+/*
+ * Get the pathname that should be used when a file is sent incrementally.
+ *
+ * The result is a palloc'd string.
+ */
+char *
+GetIncrementalFilePath(Oid dboid, Oid spcoid, RelFileNumber relfilenumber,
+ ForkNumber forknum, unsigned segno)
+{
+ char *path;
+ char *lastslash;
+ char *ipath;
+
+ path = GetRelationPath(dboid, spcoid, relfilenumber, InvalidBackendId,
+ forknum);
+
+ lastslash = strrchr(path, '/');
+ Assert(lastslash != NULL);
+ *lastslash = '\0';
+
+ if (segno > 0)
+ ipath = psprintf("%s/INCREMENTAL.%s.%u", path, lastslash + 1, segno);
+ else
+ ipath = psprintf("%s/INCREMENTAL.%s", path, lastslash + 1);
+
+ pfree(path);
+
+ return ipath;
+}
+
+/*
+ * How should we back up a particular file as part of an incremental backup?
+ *
+ * If the return value is BACK_UP_FILE_FULLY, caller should back up the whole
+ * file just as if this were not an incremental backup.
+ *
+ * If the return value is BACK_UP_FILE_INCREMENTALLY, caller should include
+ * an incremental file in the backup instead of the entire file. On return,
+ * *num_blocks_required will be set to the number of blocks that need to be
+ * sent, and the actual block numbers will have been stored in
+ * relative_block_numbers, which should be an array of at least RELSEG_SIZE.
+ * In addition, *truncation_block_length will be set to the value that should
+ * be included in the incremental file.
+ *
+ * If the return value is DO_NOT_BACK_UP_FILE, the caller should not include
+ * the file in the backup at all.
+ */
+FileBackupMethod
+GetFileBackupMethod(IncrementalBackupInfo *ib, char *path,
+ Oid dboid, Oid spcoid,
+ RelFileNumber relfilenumber, ForkNumber forknum,
+ unsigned segno, size_t size,
+ unsigned *num_blocks_required,
+ BlockNumber *relative_block_numbers,
+ unsigned *truncation_block_length)
+{
+ BlockNumber absolute_block_numbers[RELSEG_SIZE];
+ BlockNumber limit_block;
+ BlockNumber start_blkno;
+ BlockNumber stop_blkno;
+ RelFileLocator rlocator;
+ BlockRefTableEntry *brtentry;
+ unsigned i;
+ unsigned nblocks;
+
+ /* Should only be called after PrepareForIncrementalBackup. */
+ Assert(ib->buf.data == NULL);
+
+ /*
+ * dboid could be InvalidOid if shared rel, but spcoid and relfilenumber
+ * should have legal values.
+ */
+ Assert(OidIsValid(spcoid));
+ Assert(RelFileNumberIsValid(relfilenumber));
+
+ /*
+ * If the file size is too large or not a multiple of BLCKSZ, then
+ * something weird is happening, so give up and send the whole file.
+ */
+ if ((size % BLCKSZ) != 0 || size / BLCKSZ > RELSEG_SIZE)
+ return BACK_UP_FILE_FULLY;
+
+ /*
+ * The free-space map fork is not properly WAL-logged, so we need to
+ * backup the entire file every time.
+ */
+ if (forknum == FSM_FORKNUM)
+ return BACK_UP_FILE_FULLY;
+
+ /*
+ * Check whether this file is part of the prior backup. If it isn't, back
+ * up the whole file.
+ */
+ if (backup_file_lookup(ib->manifest_files, path) == NULL)
+ {
+ char *ipath;
+
+ ipath = GetIncrementalFilePath(dboid, spcoid, relfilenumber,
+ forknum, segno);
+ if (backup_file_lookup(ib->manifest_files, ipath) == NULL)
+ return BACK_UP_FILE_FULLY;
+ }
+
+ /* Look up the block reference table entry. */
+ rlocator.spcOid = spcoid;
+ rlocator.dbOid = dboid;
+ rlocator.relNumber = relfilenumber;
+ brtentry = BlockRefTableGetEntry(ib->brtab, &rlocator, forknum,
+ &limit_block);
+
+ /*
+ * If there is no entry, then there have been no WAL-logged changes to the
+ * relation since the predecessor backup was taken, so we can back it up
+ * incrementally and need not include any modified blocks.
+ *
+ * However, if the file is zero-length, we should do a full backup,
+ * because an incremental file is always more than zero length, and it's
+ * silly to take an incremental backup when a full backup would be
+ * smaller.
+ */
+ if (brtentry == NULL)
+ {
+ *num_blocks_required = 0;
+ *truncation_block_length = size / BLCKSZ;
+ if (size == 0)
+ return BACK_UP_FILE_FULLY;
+ return BACK_UP_FILE_INCREMENTALLY;
+ }
+
+ /*
+ * If the limit_block is less than or equal to the point where this
+ * segment starts, send the whole file.
+ */
+ if (limit_block <= segno * RELSEG_SIZE)
+ return BACK_UP_FILE_FULLY;
+
+ /*
+ * Get relevant entries from the block reference table entry.
+ *
+ * We shouldn't overflow computing the start or stop block numbers, but if
+ * it manages to happen somehow, detect it and throw an error.
+ */
+ start_blkno = segno * RELSEG_SIZE;
+ stop_blkno = start_blkno + (size / BLCKSZ);
+ if (start_blkno / RELSEG_SIZE != segno || stop_blkno < start_blkno)
+ ereport(ERROR,
+ errcode(ERRCODE_INTERNAL_ERROR),
+ errmsg_internal("overflow computing block number bounds for segment %u with size %lu",
+ segno, size));
+ nblocks = BlockRefTableEntryGetBlocks(brtentry, start_blkno, stop_blkno,
+ absolute_block_numbers, RELSEG_SIZE);
+ Assert(nblocks <= RELSEG_SIZE);
+
+ /*
+ * If we're going to have to send nearly all of the blocks, then just send
+ * the whole file, because that won't require much extra storage or
+ * transfer and will speed up and simplify backup restoration. It's not
+ * clear what threshold is most appropriate here and perhaps it ought to
+ * be configurable, but for now we're just going to say that if we'd need
+ * to send 90% of the blocks anyway, give up and send the whole file.
+ *
+ * NB: If you change the threshold here, at least make sure to back up the
+ * file fully when every single block must be sent, because there's
+ * nothing good about sending an incremental file in that case.
+ */
+ if (nblocks * BLCKSZ > size * 0.9)
+ return BACK_UP_FILE_FULLY;
+
+ /*
+ * Looks like we can send an incremental file.
+ *
+ * Return the relevant details to the caller, transposing absolute block
+ * numbers to relative block numbers.
+ *
+ * The truncation block length is the minimum length of the reconstructed
+ * file. Any block numbers below this threshold that are not present in
+ * the backup need to be fetched from the prior backup. At or above this
+ * threshold, blocks should only be included in the result if they are
+ * present in the backup. (This may require inserting zero blocks if the
+ * blocks included in the backup are non-consecutive.)
+ */
+ for (i = 0; i < nblocks; ++i)
+ relative_block_numbers[i] = absolute_block_numbers[i] - start_blkno;
+ *num_blocks_required = nblocks;
+ *truncation_block_length =
+ Min(size / BLCKSZ, limit_block - segno * RELSEG_SIZE);
+ return BACK_UP_FILE_INCREMENTALLY;
+}
+
+/*
+ * Compute the size for an incremental file containing a given number of blocks.
+ */
+extern size_t
+GetIncrementalFileSize(unsigned num_blocks_required)
+{
+ size_t result;
+
+ /* Make sure we're not going to overflow. */
+ Assert(num_blocks_required <= RELSEG_SIZE);
+
+ /*
+ * Three four byte quantities (magic number, truncation block length,
+ * block count) followed by block numbers followed by block contents.
+ */
+ result = 3 * sizeof(uint32);
+ result += (BLCKSZ + sizeof(BlockNumber)) * num_blocks_required;
+
+ return result;
+}
+
+/*
+ * Helper function for filemap hash table.
+ */
+static uint32
+hash_string_pointer(const char *s)
+{
+ unsigned char *ss = (unsigned char *) s;
+
+ return hash_bytes(ss, strlen(s));
+}
+
+/*
+ * This callback is invoked for each file mentioned in the backup manifest.
+ *
+ * We store the path to each file and the size of each file for sanity-checking
+ * purposes. For further details, see comments for IncrementalBackupInfo.
+ */
+static void
+manifest_process_file(JsonManifestParseContext *context,
+ char *pathname, size_t size,
+ pg_checksum_type checksum_type,
+ int checksum_length,
+ uint8 *checksum_payload)
+{
+ IncrementalBackupInfo *ib = context->private_data;
+ backup_file_entry *entry;
+ bool found;
+
+ entry = backup_file_insert(ib->manifest_files, pathname, &found);
+ if (!found)
+ {
+ entry->path = MemoryContextStrdup(ib->manifest_files->ctx,
+ pathname);
+ entry->size = size;
+ }
+}
+
+/*
+ * This callback is invoked for each WAL range mentioned in the backup
+ * manifest.
+ *
+ * We're just interested in learning the oldest LSN and the corresponding TLI
+ * that appear in any WAL range.
+ */
+static void
+manifest_process_wal_range(JsonManifestParseContext *context,
+ TimeLineID tli, XLogRecPtr start_lsn,
+ XLogRecPtr end_lsn)
+{
+ IncrementalBackupInfo *ib = context->private_data;
+ backup_wal_range *range = palloc(sizeof(backup_wal_range));
+
+ range->tli = tli;
+ range->start_lsn = start_lsn;
+ range->end_lsn = end_lsn;
+ ib->manifest_wal_ranges = lappend(ib->manifest_wal_ranges, range);
+}
+
+/*
+ * This callback is invoked if an error occurs while parsing the backup
+ * manifest.
+ */
+static void
+manifest_report_error(JsonManifestParseContext *context, const char *fmt,...)
+{
+ StringInfoData errbuf;
+
+ initStringInfo(&errbuf);
+
+ for (;;)
+ {
+ va_list ap;
+ int needed;
+
+ va_start(ap, fmt);
+ needed = appendStringInfoVA(&errbuf, fmt, ap);
+ va_end(ap);
+ if (needed == 0)
+ break;
+ enlargeStringInfo(&errbuf, needed);
+ }
+
+ ereport(ERROR,
+ errmsg_internal("%s", errbuf.data));
+}
diff --git a/src/backend/backup/meson.build b/src/backend/backup/meson.build
index 11a79bbf80..19c355ceca 100644
--- a/src/backend/backup/meson.build
+++ b/src/backend/backup/meson.build
@@ -5,6 +5,7 @@ backend_sources += files(
'basebackup.c',
'basebackup_copy.c',
'basebackup_gzip.c',
+ 'basebackup_incremental.c',
'basebackup_lz4.c',
'basebackup_progress.c',
'basebackup_server.c',
@@ -12,4 +13,6 @@ backend_sources += files(
'basebackup_target.c',
'basebackup_throttle.c',
'basebackup_zstd.c',
+ 'walsummary.c',
+ 'walsummaryfuncs.c'
)
diff --git a/src/backend/backup/walsummary.c b/src/backend/backup/walsummary.c
new file mode 100644
index 0000000000..ebf4ea038d
--- /dev/null
+++ b/src/backend/backup/walsummary.c
@@ -0,0 +1,356 @@
+/*-------------------------------------------------------------------------
+ *
+ * walsummary.c
+ * Functions for accessing and managing WAL summary data.
+ *
+ * Portions Copyright (c) 2010-2022, PostgreSQL Global Development Group
+ *
+ * src/backend/backup/walsummary.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include <sys/stat.h>
+#include <unistd.h>
+
+#include "access/xlog_internal.h"
+#include "backup/walsummary.h"
+#include "utils/wait_event.h"
+
+static bool IsWalSummaryFilename(char *filename);
+static int ListComparatorForWalSummaryFiles(const ListCell *a,
+ const ListCell *b);
+
+/*
+ * Get a list of WAL summaries.
+ *
+ * If tli != 0, only WAL summaries with the indicated TLI will be included.
+ *
+ * If start_lsn != InvalidXLogRecPtr, only summaries that end before the
+ * indicated LSN will be included.
+ *
+ * If end_lsn != InvalidXLogRecPtr, only summaries that start before the
+ * indicated LSN will be included.
+ *
+ * The intent is that you can call GetWalSummaries(tli, start_lsn, end_lsn)
+ * to get all WAL summaries on the indicated timeline that overlap the
+ * specified LSN range.
+ */
+List *
+GetWalSummaries(TimeLineID tli, XLogRecPtr start_lsn, XLogRecPtr end_lsn)
+{
+ DIR *sdir;
+ struct dirent *dent;
+ List *result = NIL;
+
+ sdir = AllocateDir(XLOGDIR "/summaries");
+ while ((dent = ReadDir(sdir, XLOGDIR "/summaries")) != NULL)
+ {
+ WalSummaryFile *ws;
+ uint32 tmp[5];
+ TimeLineID file_tli;
+ XLogRecPtr file_start_lsn;
+ XLogRecPtr file_end_lsn;
+
+ /* Decode filename, or skip if it's not in the expected format. */
+ if (!IsWalSummaryFilename(dent->d_name))
+ continue;
+ sscanf(dent->d_name, "%08X%08X%08X%08X%08X",
+ &tmp[0], &tmp[1], &tmp[2], &tmp[3], &tmp[4]);
+ file_tli = tmp[0];
+ file_start_lsn = ((uint64) tmp[1]) << 32 | tmp[2];
+ file_end_lsn = ((uint64) tmp[3]) << 32 | tmp[4];
+
+ /* Skip if it doesn't match the filter criteria. */
+ if (tli != 0 && tli != file_tli)
+ continue;
+ if (!XLogRecPtrIsInvalid(start_lsn) && start_lsn > file_end_lsn)
+ continue;
+ if (!XLogRecPtrIsInvalid(end_lsn) && end_lsn < file_start_lsn)
+ continue;
+
+ /* Add it to the list. */
+ ws = palloc(sizeof(WalSummaryFile));
+ ws->tli = file_tli;
+ ws->start_lsn = file_start_lsn;
+ ws->end_lsn = file_end_lsn;
+ result = lappend(result, ws);
+ }
+ FreeDir(sdir);
+
+ return result;
+}
+
+/*
+ * Build a new list of WAL summaries based on an existing list, but filtering
+ * out summaries that don't match the search parameters.
+ *
+ * If tli != 0, only WAL summaries with the indicated TLI will be included.
+ *
+ * If start_lsn != InvalidXLogRecPtr, only summaries that end before the
+ * indicated LSN will be included.
+ *
+ * If end_lsn != InvalidXLogRecPtr, only summaries that start before the
+ * indicated LSN will be included.
+ */
+List *
+FilterWalSummaries(List *wslist, TimeLineID tli,
+ XLogRecPtr start_lsn, XLogRecPtr end_lsn)
+{
+ List *result = NIL;
+ ListCell *lc;
+
+ /* Loop over input. */
+ foreach(lc, wslist)
+ {
+ WalSummaryFile *ws = lfirst(lc);
+
+ /* Skip if it doesn't match the filter criteria. */
+ if (tli != 0 && tli != ws->tli)
+ continue;
+ if (!XLogRecPtrIsInvalid(start_lsn) && start_lsn > ws->end_lsn)
+ continue;
+ if (!XLogRecPtrIsInvalid(end_lsn) && end_lsn < ws->start_lsn)
+ continue;
+
+ /* Add it to the result list. */
+ result = lappend(result, ws);
+ }
+
+ return result;
+}
+
+/*
+ * Check whether the supplied list of WalSummaryFile objects covers the
+ * whole range of LSNs from start_lsn to end_lsn. This function ignores
+ * timelines, so the caller should probably filter using the appropriate
+ * timeline before calling this.
+ *
+ * If the whole range of LSNs is covered, returns true, otherwise false.
+ * If false is returned, *missing_lsn is set either to InvalidXLogRecPtr
+ * if there are no WAL summary files in the input list, or to the first LSN
+ * in the range that is not covered by a WAL summary file in the input list.
+ */
+bool
+WalSummariesAreComplete(List *wslist, XLogRecPtr start_lsn,
+ XLogRecPtr end_lsn, XLogRecPtr *missing_lsn)
+{
+ XLogRecPtr current_lsn = start_lsn;
+ ListCell *lc;
+
+ /* Special case for empty list. */
+ if (wslist == NIL)
+ {
+ *missing_lsn = InvalidXLogRecPtr;
+ return false;
+ }
+
+ /* Make a private copy of the list and sort it by start LSN. */
+ wslist = list_copy(wslist);
+ list_sort(wslist, ListComparatorForWalSummaryFiles);
+
+ /*
+ * Consider summary files in order of increasing start_lsn, advancing the
+ * known-summarized range from start_lsn toward end_lsn.
+ *
+ * Normally, the summary files should cover non-overlapping WAL ranges,
+ * but this algorithm is intended to be correct even in case of overlap.
+ */
+ foreach(lc, wslist)
+ {
+ WalSummaryFile *ws = lfirst(lc);
+
+ if (ws->start_lsn > current_lsn)
+ {
+ /* We found a gap. */
+ break;
+ }
+ if (ws->end_lsn > current_lsn)
+ {
+ /*
+ * Next summary extends beyond end of previous summary, so extend
+ * the end of the range known to be summarized.
+ */
+ current_lsn = ws->end_lsn;
+
+ /*
+ * If the range we know to be summarized has reached the required
+ * end LSN, we have proved completeness.
+ */
+ if (current_lsn >= end_lsn)
+ return true;
+ }
+ }
+
+ /*
+ * We either ran out of summary files without reaching the end LSN, or we
+ * hit a gap in the sequence that resulted in us bailing out of the loop
+ * above.
+ */
+ *missing_lsn = current_lsn;
+ return false;
+}
+
+/*
+ * Open a WAL summary file.
+ *
+ * This will throw an error in case of trouble. As an exception, if
+ * missing_ok = true and the trouble is specifically that the file does
+ * not exist, it will not throw an error and will return a value less than 0.
+ */
+File
+OpenWalSummaryFile(WalSummaryFile *ws, bool missing_ok)
+{
+ char path[MAXPGPATH];
+ File file;
+
+ snprintf(path, MAXPGPATH,
+ XLOGDIR "/summaries/%08X%08X%08X%08X%08X.summary",
+ ws->tli,
+ LSN_FORMAT_ARGS(ws->start_lsn),
+ LSN_FORMAT_ARGS(ws->end_lsn));
+
+ file = PathNameOpenFile(path, O_RDONLY);
+ if (file < 0 && (errno != EEXIST || !missing_ok))
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not open file \"%s\": %m", path)));
+
+ return file;
+}
+
+/*
+ * Remove a WAL summary file if the last modification time precedes the
+ * cutoff time.
+ */
+void
+RemoveWalSummaryIfOlderThan(WalSummaryFile *ws, time_t cutoff_time)
+{
+ char path[MAXPGPATH];
+ struct stat statbuf;
+
+ snprintf(path, MAXPGPATH,
+ XLOGDIR "/summaries/%08X%08X%08X%08X%08X.summary",
+ ws->tli,
+ LSN_FORMAT_ARGS(ws->start_lsn),
+ LSN_FORMAT_ARGS(ws->end_lsn));
+
+ if (lstat(path, &statbuf) != 0)
+ {
+ if (errno == ENOENT)
+ return;
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not stat file \"%s\": %m", path)));
+ }
+ if (statbuf.st_mtime >= cutoff_time)
+ return;
+ if (unlink(path) != 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not stat file \"%s\": %m", path)));
+ ereport(DEBUG2,
+ (errmsg_internal("removing file \"%s\"", path)));
+}
+
+/*
+ * Test whether a filename looks like a WAL summary file.
+ */
+static bool
+IsWalSummaryFilename(char *filename)
+{
+ return strspn(filename, "0123456789ABCDEF") == 40 &&
+ strcmp(filename + 40, ".summary") == 0;
+}
+
+/*
+ * Data read callback for use with CreateBlockRefTableReader.
+ */
+int
+ReadWalSummary(void *wal_summary_io, void *data, int length)
+{
+ WalSummaryIO *io = wal_summary_io;
+ int nbytes;
+
+ nbytes = FileRead(io->file, data, length, io->filepos,
+ WAIT_EVENT_WAL_SUMMARY_READ);
+ if (nbytes < 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not write file \"%s\": %m",
+ FilePathName(io->file))));
+
+ io->filepos += nbytes;
+ return nbytes;
+}
+
+/*
+ * Data write callback for use with WriteBlockRefTable.
+ */
+int
+WriteWalSummary(void *wal_summary_io, void *data, int length)
+{
+ WalSummaryIO *io = wal_summary_io;
+ int nbytes;
+
+ nbytes = FileWrite(io->file, data, length, io->filepos,
+ WAIT_EVENT_WAL_SUMMARY_WRITE);
+ if (nbytes < 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not write file \"%s\": %m",
+ FilePathName(io->file))));
+ if (nbytes != length)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not write file \"%s\": wrote only %d of %d bytes at offset %u",
+ FilePathName(io->file), nbytes,
+ length, (unsigned) io->filepos),
+ errhint("Check free disk space.")));
+
+ io->filepos += nbytes;
+ return nbytes;
+}
+
+/*
+ * Error-reporting callback for use with CreateBlockRefTableReader.
+ */
+void
+ReportWalSummaryError(void *callback_arg, char *fmt,...)
+{
+ StringInfoData buf;
+ va_list ap;
+ int needed;
+
+ initStringInfo(&buf);
+ for (;;)
+ {
+ va_start(ap, fmt);
+ needed = appendStringInfoVA(&buf, fmt, ap);
+ va_end(ap);
+ if (needed == 0)
+ break;
+ enlargeStringInfo(&buf, needed);
+ }
+ ereport(ERROR,
+ errcode(ERRCODE_DATA_CORRUPTED),
+ errmsg_internal("%s", buf.data));
+}
+
+/*
+ * Comparator to sort a List of WalSummaryFile objects by start_lsn.
+ */
+static int
+ListComparatorForWalSummaryFiles(const ListCell *a, const ListCell *b)
+{
+ WalSummaryFile *ws1 = lfirst(a);
+ WalSummaryFile *ws2 = lfirst(b);
+
+ if (ws1->start_lsn < ws2->start_lsn)
+ return -1;
+ if (ws1->start_lsn > ws2->start_lsn)
+ return 1;
+ return 0;
+}
diff --git a/src/backend/backup/walsummaryfuncs.c b/src/backend/backup/walsummaryfuncs.c
new file mode 100644
index 0000000000..2e77d38b4a
--- /dev/null
+++ b/src/backend/backup/walsummaryfuncs.c
@@ -0,0 +1,169 @@
+/*-------------------------------------------------------------------------
+ *
+ * walsummaryfuncs.c
+ * SQL-callable functions for accessing WAL summary data.
+ *
+ * Portions Copyright (c) 2010-2022, PostgreSQL Global Development Group
+ *
+ * src/backend/backup/walsummaryfuncs.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "backup/walsummary.h"
+#include "common/blkreftable.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "utils/fmgrprotos.h"
+#include "utils/pg_lsn.h"
+
+#define NUM_WS_ATTS 3
+#define NUM_SUMMARY_ATTS 6
+#define MAX_BLOCKS_PER_CALL 256
+
+/*
+ * List the WAL summary files available in pg_wal/summaries.
+ */
+Datum
+pg_available_wal_summaries(PG_FUNCTION_ARGS)
+{
+ ReturnSetInfo *rsi;
+ List *wslist;
+ ListCell *lc;
+ Datum values[NUM_WS_ATTS];
+ bool nulls[NUM_WS_ATTS];
+
+ InitMaterializedSRF(fcinfo, 0);
+ rsi = (ReturnSetInfo *) fcinfo->resultinfo;
+
+ memset(nulls, 0, sizeof(nulls));
+
+ wslist = GetWalSummaries(0, InvalidXLogRecPtr, InvalidXLogRecPtr);
+ foreach(lc, wslist)
+ {
+ WalSummaryFile *ws = (WalSummaryFile *) lfirst(lc);
+ HeapTuple tuple;
+
+ CHECK_FOR_INTERRUPTS();
+
+ values[0] = Int64GetDatum((int64) ws->tli);
+ values[1] = LSNGetDatum(ws->start_lsn);
+ values[2] = LSNGetDatum(ws->end_lsn);
+
+ tuple = heap_form_tuple(rsi->setDesc, values, nulls);
+ tuplestore_puttuple(rsi->setResult, tuple);
+ }
+
+ return (Datum) 0;
+}
+
+/*
+ * List the contents of a WAL summary file identified by TLI, start LSN,
+ * and end LSN.
+ */
+Datum
+pg_wal_summary_contents(PG_FUNCTION_ARGS)
+{
+ ReturnSetInfo *rsi;
+ Datum values[NUM_SUMMARY_ATTS];
+ bool nulls[NUM_SUMMARY_ATTS];
+ WalSummaryFile ws;
+ WalSummaryIO io;
+ BlockRefTableReader *reader;
+ int64 raw_tli;
+ RelFileLocator rlocator;
+ ForkNumber forknum;
+ BlockNumber limit_block;
+
+ InitMaterializedSRF(fcinfo, 0);
+ rsi = (ReturnSetInfo *) fcinfo->resultinfo;
+ memset(nulls, 0, sizeof(nulls));
+
+ /*
+ * Since the timeline could at least in theory be more than 2^31, and
+ * since we don't have unsigned types at the SQL level, it is passed as a
+ * 64-bit integer. Test whether it's out of range.
+ */
+ raw_tli = PG_GETARG_INT64(0);
+ if (raw_tli < 1 || raw_tli > PG_INT32_MAX)
+ ereport(ERROR,
+ errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("invalid timeline %lld", (long long) raw_tli));
+
+ /* Prepare to read the specified WAL summry file. */
+ ws.tli = (TimeLineID) raw_tli;
+ ws.start_lsn = PG_GETARG_LSN(1);
+ ws.end_lsn = PG_GETARG_LSN(2);
+ io.filepos = 0;
+ io.file = OpenWalSummaryFile(&ws, false);
+ reader = CreateBlockRefTableReader(ReadWalSummary, &io,
+ FilePathName(io.file),
+ ReportWalSummaryError, NULL);
+
+ /* Loop over relation forks. */
+ while (BlockRefTableReaderNextRelation(reader, &rlocator, &forknum,
+ &limit_block))
+ {
+ BlockNumber blocks[MAX_BLOCKS_PER_CALL];
+ HeapTuple tuple;
+
+ CHECK_FOR_INTERRUPTS();
+
+ values[0] = ObjectIdGetDatum(rlocator.relNumber);
+ values[1] = ObjectIdGetDatum(rlocator.spcOid);
+ values[2] = ObjectIdGetDatum(rlocator.dbOid);
+ values[3] = Int16GetDatum((int16) forknum);
+
+ /* Loop over blocks within the current relation fork. */
+ while (true)
+ {
+ unsigned nblocks;
+ unsigned i;
+
+ CHECK_FOR_INTERRUPTS();
+
+ nblocks = BlockRefTableReaderGetBlocks(reader, blocks,
+ MAX_BLOCKS_PER_CALL);
+ if (nblocks == 0)
+ break;
+
+ /*
+ * For each block that we specifically know to have been modified,
+ * emit a row with that block number and limit_block = false.
+ */
+ values[5] = BoolGetDatum(false);
+ for (i = 0; i < nblocks; ++i)
+ {
+ values[4] = Int64GetDatum((int64) blocks[i]);
+
+ tuple = heap_form_tuple(rsi->setDesc, values, nulls);
+ tuplestore_puttuple(rsi->setResult, tuple);
+ }
+
+ /*
+ * If the limit block is not InvalidBlockNumber, emit an exta row
+ * with that block number and limit_block = true.
+ *
+ * There is no point in doing this when the limit_block is
+ * InvalidBlockNumber, because no block with that number or any
+ * higher number can ever exist.
+ */
+ if (BlockNumberIsValid(limit_block))
+ {
+ values[4] = Int64GetDatum((int64) limit_block);
+ values[5] = BoolGetDatum(true);
+
+ tuple = heap_form_tuple(rsi->setDesc, values, nulls);
+ tuplestore_puttuple(rsi->setResult, tuple);
+ }
+ }
+ }
+
+ /* Cleanup */
+ DestroyBlockRefTableReader(reader);
+ FileClose(io.file);
+
+ return (Datum) 0;
+}
diff --git a/src/backend/postmaster/Makefile b/src/backend/postmaster/Makefile
index 047448b34e..367a46c617 100644
--- a/src/backend/postmaster/Makefile
+++ b/src/backend/postmaster/Makefile
@@ -24,6 +24,7 @@ OBJS = \
postmaster.o \
startup.o \
syslogger.o \
+ walsummarizer.o \
walwriter.o
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/postmaster/auxprocess.c b/src/backend/postmaster/auxprocess.c
index cae6feb356..0c15c1777d 100644
--- a/src/backend/postmaster/auxprocess.c
+++ b/src/backend/postmaster/auxprocess.c
@@ -21,6 +21,7 @@
#include "postmaster/auxprocess.h"
#include "postmaster/bgwriter.h"
#include "postmaster/startup.h"
+#include "postmaster/walsummarizer.h"
#include "postmaster/walwriter.h"
#include "replication/walreceiver.h"
#include "storage/bufmgr.h"
@@ -80,6 +81,9 @@ AuxiliaryProcessMain(AuxProcType auxtype)
case WalReceiverProcess:
MyBackendType = B_WAL_RECEIVER;
break;
+ case WalSummarizerProcess:
+ MyBackendType = B_WAL_SUMMARIZER;
+ break;
default:
elog(PANIC, "unrecognized process type: %d", (int) MyAuxProcType);
MyBackendType = B_INVALID;
@@ -161,6 +165,10 @@ AuxiliaryProcessMain(AuxProcType auxtype)
WalReceiverMain();
proc_exit(1);
+ case WalSummarizerProcess:
+ WalSummarizerMain();
+ proc_exit(1);
+
default:
elog(PANIC, "unrecognized process type: %d", (int) MyAuxProcType);
proc_exit(1);
diff --git a/src/backend/postmaster/meson.build b/src/backend/postmaster/meson.build
index cda921fd10..a30eb6692f 100644
--- a/src/backend/postmaster/meson.build
+++ b/src/backend/postmaster/meson.build
@@ -12,5 +12,6 @@ backend_sources += files(
'postmaster.c',
'startup.c',
'syslogger.c',
+ 'walsummarizer.c',
'walwriter.c',
)
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 9cb624eab8..86f6cf2feb 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -115,6 +115,7 @@
#include "postmaster/pgarch.h"
#include "postmaster/postmaster.h"
#include "postmaster/syslogger.h"
+#include "postmaster/walsummarizer.h"
#include "replication/logicallauncher.h"
#include "replication/walsender.h"
#include "storage/fd.h"
@@ -252,6 +253,7 @@ static pid_t StartupPID = 0,
CheckpointerPID = 0,
WalWriterPID = 0,
WalReceiverPID = 0,
+ WalSummarizerPID = 0,
AutoVacPID = 0,
PgArchPID = 0,
SysLoggerPID = 0;
@@ -443,6 +445,7 @@ static bool CreateOptsFile(int argc, char *argv[], char *fullprogname);
static pid_t StartChildProcess(AuxProcType type);
static void StartAutovacuumWorker(void);
static void MaybeStartWalReceiver(void);
+static void MaybeStartWalSummarizer(void);
static void InitPostmasterDeathWatchHandle(void);
/*
@@ -562,6 +565,7 @@ static void ShmemBackendArrayRemove(Backend *bn);
#define StartCheckpointer() StartChildProcess(CheckpointerProcess)
#define StartWalWriter() StartChildProcess(WalWriterProcess)
#define StartWalReceiver() StartChildProcess(WalReceiverProcess)
+#define StartWalSummarizer() StartChildProcess(WalSummarizerProcess)
/* Macros to check exit status of a child process */
#define EXIT_STATUS_0(st) ((st) == 0)
@@ -1833,6 +1837,9 @@ ServerLoop(void)
if (WalReceiverRequested)
MaybeStartWalReceiver();
+ /* If we need to start a WAL summarizer, try to do that now */
+ MaybeStartWalSummarizer();
+
/* Get other worker processes running, if needed */
if (StartWorkerNeeded || HaveCrashedWorker)
maybe_start_bgworkers();
@@ -2657,6 +2664,8 @@ process_pm_reload_request(void)
signal_child(WalWriterPID, SIGHUP);
if (WalReceiverPID != 0)
signal_child(WalReceiverPID, SIGHUP);
+ if (WalSummarizerPID != 0)
+ signal_child(WalSummarizerPID, SIGHUP);
if (AutoVacPID != 0)
signal_child(AutoVacPID, SIGHUP);
if (PgArchPID != 0)
@@ -3010,6 +3019,7 @@ process_pm_child_exit(void)
BgWriterPID = StartBackgroundWriter();
if (WalWriterPID == 0)
WalWriterPID = StartWalWriter();
+ MaybeStartWalSummarizer();
/*
* Likewise, start other special children as needed. In a restart
@@ -3128,6 +3138,20 @@ process_pm_child_exit(void)
continue;
}
+ /*
+ * Was it the wal summarizer? Normal exit can be ignored; we'll start
+ * a new one at the next iteration of the postmaster's main loop, if
+ * necessary. Any other exit condition is treated as a crash.
+ */
+ if (pid == WalSummarizerPID)
+ {
+ WalSummarizerPID = 0;
+ if (!EXIT_STATUS_0(exitstatus))
+ HandleChildCrash(pid, exitstatus,
+ _("WAL summarizer process"));
+ continue;
+ }
+
/*
* Was it the autovacuum launcher? Normal exit can be ignored; we'll
* start a new one at the next iteration of the postmaster's main
@@ -3523,6 +3547,12 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
else if (WalReceiverPID != 0 && take_action)
sigquit_child(WalReceiverPID);
+ /* Take care of the walsummarizer too */
+ if (pid == WalSummarizerPID)
+ WalSummarizerPID = 0;
+ else if (WalSummarizerPID != 0 && take_action)
+ sigquit_child(WalSummarizerPID);
+
/* Take care of the autovacuum launcher too */
if (pid == AutoVacPID)
AutoVacPID = 0;
@@ -3673,6 +3703,8 @@ PostmasterStateMachine(void)
signal_child(StartupPID, SIGTERM);
if (WalReceiverPID != 0)
signal_child(WalReceiverPID, SIGTERM);
+ if (WalSummarizerPID != 0)
+ signal_child(WalSummarizerPID, SIGTERM);
/* checkpointer, archiver, stats, and syslogger may continue for now */
/* Now transition to PM_WAIT_BACKENDS state to wait for them to die */
@@ -3699,6 +3731,7 @@ PostmasterStateMachine(void)
if (CountChildren(BACKEND_TYPE_ALL - BACKEND_TYPE_WALSND) == 0 &&
StartupPID == 0 &&
WalReceiverPID == 0 &&
+ WalSummarizerPID == 0 &&
BgWriterPID == 0 &&
(CheckpointerPID == 0 ||
(!FatalError && Shutdown < ImmediateShutdown)) &&
@@ -3796,6 +3829,7 @@ PostmasterStateMachine(void)
/* These other guys should be dead already */
Assert(StartupPID == 0);
Assert(WalReceiverPID == 0);
+ Assert(WalSummarizerPID == 0);
Assert(BgWriterPID == 0);
Assert(CheckpointerPID == 0);
Assert(WalWriterPID == 0);
@@ -4017,6 +4051,8 @@ TerminateChildren(int signal)
signal_child(WalWriterPID, signal);
if (WalReceiverPID != 0)
signal_child(WalReceiverPID, signal);
+ if (WalSummarizerPID != 0)
+ signal_child(WalSummarizerPID, signal);
if (AutoVacPID != 0)
signal_child(AutoVacPID, signal);
if (PgArchPID != 0)
@@ -5364,6 +5400,10 @@ StartChildProcess(AuxProcType type)
ereport(LOG,
(errmsg("could not fork WAL receiver process: %m")));
break;
+ case WalSummarizerProcess:
+ ereport(LOG,
+ (errmsg("could not fork WAL summarizer process: %m")));
+ break;
default:
ereport(LOG,
(errmsg("could not fork process: %m")));
@@ -5500,6 +5540,19 @@ MaybeStartWalReceiver(void)
}
}
+/*
+ * MaybeStartWalSummarizer
+ * Start the WAL summarizer process, if not running and our state allows.
+ */
+static void
+MaybeStartWalSummarizer(void)
+{
+ if (wal_summarize_mb != 0 && WalSummarizerPID == 0 &&
+ (pmState == PM_RUN || pmState == PM_HOT_STANDBY) &&
+ Shutdown <= SmartShutdown)
+ WalSummarizerPID = StartWalSummarizer();
+}
+
/*
* Create the opts file
diff --git a/src/backend/postmaster/walsummarizer.c b/src/backend/postmaster/walsummarizer.c
new file mode 100644
index 0000000000..4ded951119
--- /dev/null
+++ b/src/backend/postmaster/walsummarizer.c
@@ -0,0 +1,1363 @@
+/*-------------------------------------------------------------------------
+ *
+ * walsummarizer.c
+ *
+ * Background process to perform WAL summarization, if it is enabled.
+ * It continuously scans the write-ahead log and periodically emits a
+ * summary file which indicates which blocks in which relation forks
+ * were modified by WAL records in the LSN range covered by the summary
+ * file. See walsummary.c and blkreftable.c for more details on the
+ * naming and contents of WAL summary files.
+ *
+ * If configured to do, this background process will also remove WAL
+ * summary files when the file timestamp is older than a configurable
+ * threshold (but only if the WAL has been removed first).
+ *
+ * Portions Copyright (c) 1996-2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/backend/postmaster/walsummarizer.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/timeline.h"
+#include "access/xlog.h"
+#include "access/xlog_internal.h"
+#include "access/xlogrecovery.h"
+#include "access/xlogutils.h"
+#include "backup/walsummary.h"
+#include "catalog/storage_xlog.h"
+#include "common/blkreftable.h"
+#include "libpq/pqsignal.h"
+#include "miscadmin.h"
+#include "postmaster/bgwriter.h"
+#include "postmaster/interrupt.h"
+#include "postmaster/walsummarizer.h"
+#include "replication/walreceiver.h"
+#include "storage/fd.h"
+#include "storage/ipc.h"
+#include "storage/lwlock.h"
+#include "storage/latch.h"
+#include "storage/proc.h"
+#include "storage/procsignal.h"
+#include "storage/shmem.h"
+#include "storage/spin.h"
+#include "utils/guc.h"
+#include "utils/memutils.h"
+#include "utils/wait_event.h"
+
+/*
+ * Data in shared memory related to WAL summarization.
+ */
+typedef struct
+{
+ /*
+ * These fields are protected by WALSummarizerLock.
+ *
+ * Until we've discovered what summary files already exist on disk and
+ * stored that information in shared memory, initialized is false and the
+ * other fields here contain no meaningful information. After that has
+ * been done, initialized is true.
+ *
+ * summarized_tli and summarized_lsn indicate the last LSN and TLI at
+ * which the next summary file will start. Normally, these are the LSN and
+ * TLI at which the last file ended; in such case, lsn_is_exact is true.
+ * If, however, the LSN is just an approximation, then lsn_is_exact is
+ * false. This can happen if, for example, there are no existing WAL
+ * summary files at startup. In that case, we have to derive the position
+ * at which to start summarizing from the WAL files that exist on disk,
+ * and so the LSN might point to the start of the next file even though
+ * that might happen to be in the middle of a WAL record.
+ *
+ * summarizer_pgprocno is the pgprocno value for the summarizer process,
+ * if one is running, or else INVALID_PGPROCNO.
+ *
+ * pending_lsn is used by the summarizer to advertise the ending LSN of a
+ * record it has recently read. It shouldn't ever be less than
+ * summarized_lsn, but might be greater, because the summarizer buffers
+ * data for a range of LSNs in memory before writing out a new file.
+ */
+ bool initialized;
+ TimeLineID summarized_tli;
+ XLogRecPtr summarized_lsn;
+ bool lsn_is_exact;
+ int summarizer_pgprocno;
+ XLogRecPtr pending_lsn;
+
+ /*
+ * This field handles its own synchronizaton.
+ */
+ ConditionVariable summary_file_cv;
+} WalSummarizerData;
+
+/*
+ * Private data for our xlogreader's page read callback.
+ */
+typedef struct
+{
+ TimeLineID tli;
+ bool historic;
+ XLogRecPtr read_upto;
+ bool end_of_wal;
+ bool waited;
+} SummarizerReadLocalXLogPrivate;
+
+/* Pointer to shared memory state. */
+static WalSummarizerData *WalSummarizerCtl;
+
+/*
+ * When we reach end of WAL and need to read more, we sleep for a number of
+ * milliseconds that is a integer multiple of MS_PER_SLEEP_QUANTUM. This is
+ * the multiplier. It should vary between 1 and MAX_SLEEP_QUANTA, depending
+ * on system activity. See summarizer_wait_for_wal() for how we adjust this.
+ */
+static long sleep_quanta = 1;
+
+/*
+ * The sleep time will always be a multiple of 200ms and will not exceed
+ * thirty seconds (150 * 200 = 30 * 1000). Note that the timeout here needs
+ * to be substntially less than the maximum amount of time for which an
+ * incremental backup will wait for this process to catch up. Otherwise, an
+ * incremental backup might time out on an idle system just because we sleep
+ * for too long.
+ */
+#define MAX_SLEEP_QUANTA 150
+#define MS_PER_SLEEP_QUANTUM 200
+
+/*
+ * This is a count of the number of pages of WAL that we've read since the
+ * last time we waited for more WAL to appear.
+ */
+static long pages_read_since_last_sleep = 0;
+
+/*
+ * Most recent RedoRecPtr value observed by MaybeRemoveOldWalSummaries.
+ */
+static XLogRecPtr redo_pointer_at_last_summary_removal = InvalidXLogRecPtr;
+
+/*
+ * GUC parameters
+ */
+int wal_summarize_mb = 256;
+int wal_summarize_keep_time = 7 * 24 * 60;
+
+static XLogRecPtr GetLatestLSN(TimeLineID *tli);
+static void HandleWalSummarizerInterrupts(void);
+static XLogRecPtr SummarizeWAL(TimeLineID tli, bool historic,
+ XLogRecPtr start_lsn, bool exact,
+ XLogRecPtr cutoff_lsn, XLogRecPtr maximum_lsn);
+static void SummarizeSmgrRecord(XLogReaderState *xlogreader,
+ BlockRefTable *brtab);
+static void SummarizeXactRecord(XLogReaderState *xlogreader,
+ BlockRefTable *brtab);
+static bool SummarizeXlogRecord(XLogReaderState *xlogreader);
+static int summarizer_read_local_xlog_page(XLogReaderState *state,
+ XLogRecPtr targetPagePtr,
+ int reqLen,
+ XLogRecPtr targetRecPtr,
+ char *cur_page);
+static void summarizer_wait_for_wal(void);
+static void MaybeRemoveOldWalSummaries(void);
+
+/*
+ * Amount of shared memory required for this module.
+ */
+Size
+WalSummarizerShmemSize(void)
+{
+ return sizeof(WalSummarizerData);
+}
+
+/*
+ * Create or attach to shared memory segment for this module.
+ */
+void
+WalSummarizerShmemInit(void)
+{
+ bool found;
+
+ WalSummarizerCtl = (WalSummarizerData *)
+ ShmemInitStruct("Wal Summarizer Ctl", WalSummarizerShmemSize(),
+ &found);
+
+ if (!found)
+ {
+ /*
+ * First time through, so initialize.
+ *
+ * We're just filling in dummy values here -- the real initialization
+ * will happen when GetOldestUnsummarizedLSN() is called for the first
+ * time.
+ */
+ WalSummarizerCtl->initialized = false;
+ WalSummarizerCtl->summarized_tli = 0;
+ WalSummarizerCtl->summarized_lsn = InvalidXLogRecPtr;
+ WalSummarizerCtl->lsn_is_exact = false;
+ WalSummarizerCtl->summarizer_pgprocno = INVALID_PGPROCNO;
+ WalSummarizerCtl->pending_lsn = InvalidXLogRecPtr;
+ ConditionVariableInit(&WalSummarizerCtl->summary_file_cv);
+ }
+}
+
+/*
+ * Entry point for walsummarizer process.
+ */
+void
+WalSummarizerMain(void)
+{
+ sigjmp_buf local_sigjmp_buf;
+ MemoryContext context;
+
+ /*
+ * Within this function, 'current_lsn' and 'current_tli' refer to the
+ * point from which the next WAL summary file should start. 'exact' is
+ * true if 'current_lsn' is known to be the start of a WAL recod or WAL
+ * segment, and false if it might be in the middle of a record someplace.
+ *
+ * 'switch_lsn' and 'switch_tli', if set, are the LSN at which we need to
+ * switch to a new timeline and the timeline to which we need to switch.
+ * If not set, we either haven't figured out the answers yet or we're
+ * already on the latest timeline.
+ */
+ XLogRecPtr current_lsn;
+ TimeLineID current_tli;
+ bool exact;
+ XLogRecPtr switch_lsn = InvalidXLogRecPtr;
+ TimeLineID switch_tli = 0;
+
+ ereport(DEBUG1,
+ (errmsg_internal("WAL summarizer started")));
+
+ /*
+ * Properly accept or ignore signals the postmaster might send us
+ *
+ * We have no particular use for SIGINT at the moment, but seems
+ * reasonable to treat like SIGTERM.
+ */
+ pqsignal(SIGHUP, SignalHandlerForConfigReload);
+ pqsignal(SIGINT, SignalHandlerForShutdownRequest);
+ pqsignal(SIGTERM, SignalHandlerForShutdownRequest);
+ /* SIGQUIT handler was already set up by InitPostmasterChild */
+ pqsignal(SIGALRM, SIG_IGN);
+ pqsignal(SIGPIPE, SIG_IGN);
+ pqsignal(SIGUSR1, procsignal_sigusr1_handler);
+ pqsignal(SIGUSR2, SIG_IGN); /* not used */
+
+ /* Advertise ourselves. */
+ LWLockAcquire(WALSummarizerLock, LW_EXCLUSIVE);
+ WalSummarizerCtl->summarizer_pgprocno = MyProc->pgprocno;
+ LWLockRelease(WALSummarizerLock);
+
+ /* Create and switch to a memory context that we can reset on error. */
+ context = AllocSetContextCreate(TopMemoryContext,
+ "Wal Summarizer",
+ ALLOCSET_DEFAULT_SIZES);
+ MemoryContextSwitchTo(context);
+
+ /*
+ * Reset some signals that are accepted by postmaster but not here
+ */
+ pqsignal(SIGCHLD, SIG_DFL);
+
+ /*
+ * If an exception is encountered, processing resumes here.
+ */
+ if (sigsetjmp(local_sigjmp_buf, 1) != 0)
+ {
+ /* Since not using PG_TRY, must reset error stack by hand */
+ error_context_stack = NULL;
+
+ /* Prevent interrupts while cleaning up */
+ HOLD_INTERRUPTS();
+
+ /* Report the error to the server log */
+ EmitErrorReport();
+
+ /* Release resources we might have acquired. */
+ LWLockReleaseAll();
+ ConditionVariableCancelSleep();
+ pgstat_report_wait_end();
+ ReleaseAuxProcessResources(false);
+ AtEOXact_Files(false);
+ AtEOXact_HashTables(false);
+
+ /*
+ * Now return to normal top-level context and clear ErrorContext for
+ * next time.
+ */
+ MemoryContextSwitchTo(context);
+ FlushErrorState();
+
+ /* Flush any leaked data in the top-level context */
+ MemoryContextResetAndDeleteChildren(context);
+
+ /* Now we can allow interrupts again */
+ RESUME_INTERRUPTS();
+
+ /*
+ * Sleep for 10 seconds before attempting to resume operations in
+ * order to avoid excessing logging.
+ *
+ * Many of the likely error conditions are things that will repeat
+ * every time. For example, if the WAL can't be read or the summary
+ * can't be written, only administrator action will cure the problem.
+ * So a really fast retry time doesn't seem to be especially
+ * beneficial, and it will clutter the logs.
+ */
+ (void) WaitLatch(MyLatch,
+ WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
+ 10000,
+ WAIT_EVENT_WAL_SUMMARIZER_ERROR);
+ }
+
+ /* We can now handle ereport(ERROR) */
+ PG_exception_stack = &local_sigjmp_buf;
+
+ /*
+ * Unblock signals (they were blocked when the postmaster forked us)
+ */
+ sigprocmask(SIG_SETMASK, &UnBlockSig, NULL);
+
+ /*
+ * Fetch information about previous progress from shared memory.
+ *
+ * If we discover that WAL summarization is not enabled, just exit.
+ */
+ current_lsn = GetOldestUnsummarizedLSN(¤t_tli, &exact);
+ if (XLogRecPtrIsInvalid(current_lsn))
+ proc_exit(0);
+
+ /*
+ * Loop forever
+ */
+ for (;;)
+ {
+ XLogRecPtr latest_lsn;
+ TimeLineID latest_tli;
+ XLogRecPtr cutoff_lsn;
+ XLogRecPtr end_of_summary_lsn;
+
+ /* Flush any leaked data in the top-level context */
+ MemoryContextResetAndDeleteChildren(context);
+
+ /* Process any signals received recently. */
+ HandleWalSummarizerInterrupts();
+
+ /* If it's time to remove any old WAL summaries, do that now. */
+ MaybeRemoveOldWalSummaries();
+
+ /* Find the LSN and TLI up to which we can safely summarize. */
+ latest_lsn = GetLatestLSN(&latest_tli);
+
+ /*
+ * If we're summarizing a historic timeline and we haven't yet
+ * computed the point at which to switch to the next timeline, do that
+ * now.
+ *
+ * Note that if this is a standby, what was previously the current
+ * timeline could become historic at any time.
+ *
+ * We could try to make this more efficient by caching the results of
+ * readTimeLineHistory when latest_tli has not changed, but since we
+ * only have to do this once per timeline switch, we probably wouldn't
+ * save any significant amount of work in practice.
+ */
+ if (current_tli != latest_tli && XLogRecPtrIsInvalid(switch_lsn))
+ {
+ List *tles = readTimeLineHistory(latest_tli);
+
+ switch_lsn = tliSwitchPoint(current_tli, tles, &switch_tli);
+ elog(DEBUG2,
+ "switch point from TLI %u to TLI %u is at %X/%X",
+ current_tli, switch_tli, LSN_FORMAT_ARGS(switch_lsn));
+ }
+
+ /*
+ * wal_summarize_mb sets a soft limit on the amont of WAL covered by a
+ * single summary file. If we read a WAL record that ends after the
+ * cutoff LSN computed here, we'll stop the summary. In most cases, it
+ * will actually stop earlier than that, but this is here as a
+ * backstop.
+ */
+ cutoff_lsn = current_lsn + wal_summarize_mb * 1024 * 1024;
+ if (!XLogRecPtrIsInvalid(switch_lsn) && cutoff_lsn > switch_lsn)
+ cutoff_lsn = switch_lsn;
+ elog(DEBUG2,
+ "WAL summarization cutoff is TLI %d @ %X/%X, flush position is %X/%X",
+ current_tli, LSN_FORMAT_ARGS(cutoff_lsn), LSN_FORMAT_ARGS(latest_lsn));
+
+ /* Summarize WAL. */
+ end_of_summary_lsn = SummarizeWAL(current_tli,
+ current_tli != latest_tli,
+ current_lsn, exact,
+ cutoff_lsn, latest_lsn);
+ Assert(!XLogRecPtrIsInvalid(end_of_summary_lsn));
+ Assert(end_of_summary_lsn >= current_lsn);
+
+ /*
+ * Update state for next loop iteration.
+ *
+ * Next summary file should start from exactly where this one ended.
+ * Timeline remains unchanged unless a switch LSN was computed and we
+ * have reached it.
+ */
+ current_lsn = end_of_summary_lsn;
+ exact = true;
+ if (!XLogRecPtrIsInvalid(switch_lsn) && cutoff_lsn >= switch_lsn)
+ {
+ current_tli = switch_tli;
+ switch_lsn = InvalidXLogRecPtr;
+ switch_tli = 0;
+ }
+
+ /* Update state in shared memory. */
+ LWLockAcquire(WALSummarizerLock, LW_EXCLUSIVE);
+ Assert(WalSummarizerCtl->pending_lsn <= end_of_summary_lsn);
+ WalSummarizerCtl->summarized_lsn = end_of_summary_lsn;
+ WalSummarizerCtl->summarized_tli = current_tli;
+ WalSummarizerCtl->lsn_is_exact = true;
+ WalSummarizerCtl->pending_lsn = end_of_summary_lsn;
+ LWLockRelease(WALSummarizerLock);
+
+ /* Wake up anyone waiting for more summary files to be written. */
+ ConditionVariableBroadcast(&WalSummarizerCtl->summary_file_cv);
+ }
+}
+
+/*
+ * Get the oldest LSN in this server's timeline history that has not yet been
+ * summarized.
+ *
+ * If *tli != NULL, it will be set to the TLI for the LSN that is returned.
+ *
+ * If *lsn_is_exact != NULL, it will be set to true if the returned LSN is
+ * necessarily the start of a WAL record and false if it's just the beginning
+ * of a WAL segment.
+ */
+XLogRecPtr
+GetOldestUnsummarizedLSN(TimeLineID *tli, bool *lsn_is_exact)
+{
+ TimeLineID latest_tli;
+ LWLockMode mode = LW_SHARED;
+ int n;
+ List *tles;
+ XLogRecPtr unsummarized_lsn;
+ TimeLineID unsummarized_tli = 0;
+ bool should_make_exact = false;
+ List *existing_summaries;
+ ListCell *lc;
+
+ /* If not summarizing WAL, do nothing. */
+ if (wal_summarize_mb == 0)
+ return InvalidXLogRecPtr;
+
+ /*
+ * Initially, we acquire the lock in shared mode and try to fetch the
+ * required information. If the data structure hasn't been initialized, we
+ * reacquire the lock in shared mode so that we can initialize it.
+ * However, if someone else does that first before we get the lock, then
+ * we can just return the requested information after all.
+ */
+ while (true)
+ {
+ LWLockAcquire(WALSummarizerLock, mode);
+
+ if (WalSummarizerCtl->initialized)
+ {
+ unsummarized_lsn = WalSummarizerCtl->summarized_lsn;
+ if (tli != NULL)
+ *tli = WalSummarizerCtl->summarized_tli;
+ if (lsn_is_exact != NULL)
+ *lsn_is_exact = WalSummarizerCtl->lsn_is_exact;
+ LWLockRelease(WALSummarizerLock);
+ return unsummarized_lsn;
+ }
+
+ if (mode == LW_EXCLUSIVE)
+ break;
+
+ LWLockRelease(WALSummarizerLock);
+ mode = LW_EXCLUSIVE;
+ }
+
+ /*
+ * The data structure needs to be initialized, and we are the first to
+ * obtain the lock in exclusive mode, so it's our job to do that
+ * initialization.
+ *
+ * So, find the oldest timeline on which WAL still exists, and the
+ * earliest segment for which it exists.
+ */
+ (void) GetLatestLSN(&latest_tli);
+ tles = readTimeLineHistory(latest_tli);
+ for (n = list_length(tles) - 1; n >= 0; --n)
+ {
+ TimeLineHistoryEntry *tle = list_nth(tles, n);
+ XLogSegNo oldest_segno;
+
+ oldest_segno = XLogGetOldestSegno(tle->tli);
+ if (oldest_segno != 0)
+ {
+ /* Compute oldest LSN that still exists on disk. */
+ XLogSegNoOffsetToRecPtr(oldest_segno, 0, wal_segment_size,
+ unsummarized_lsn);
+
+ unsummarized_tli = tle->tli;
+ break;
+ }
+ }
+
+ /* It really should not be possible for us to find no WAL. */
+ if (unsummarized_tli == 0)
+ ereport(ERROR,
+ errcode(ERRCODE_INTERNAL_ERROR),
+ errmsg_internal("no WAL found on timeline %d", latest_tli));
+
+ /*
+ * Don't try to summarize anything older than the end LSN of the newest
+ * summary file that exists for this timeline.
+ */
+ existing_summaries =
+ GetWalSummaries(unsummarized_tli,
+ InvalidXLogRecPtr, InvalidXLogRecPtr);
+ foreach(lc, existing_summaries)
+ {
+ WalSummaryFile *ws = lfirst(lc);
+
+ if (ws->end_lsn > unsummarized_lsn)
+ {
+ unsummarized_lsn = ws->end_lsn;
+ should_make_exact = true;
+ }
+ }
+
+ /* Update shared memory with the discovered values. */
+ WalSummarizerCtl->initialized = true;
+ WalSummarizerCtl->summarized_lsn = unsummarized_lsn;
+ WalSummarizerCtl->summarized_tli = unsummarized_tli;
+ WalSummarizerCtl->lsn_is_exact = should_make_exact;
+ WalSummarizerCtl->pending_lsn = unsummarized_lsn;
+
+ /* Also return the to the caller as required. */
+ if (tli != NULL)
+ *tli = WalSummarizerCtl->summarized_tli;
+ if (lsn_is_exact != NULL)
+ *lsn_is_exact = WalSummarizerCtl->lsn_is_exact;
+ LWLockRelease(WALSummarizerLock);
+
+ return unsummarized_lsn;
+}
+
+/*
+ * Attempt to set the WAL summarizer's latch.
+ *
+ * This might not work, because there's no guarantee that the WAL summarizer
+ * process was successfully started, and it also might have started but
+ * subsequently terminated. So, under normal circumstances, this will get the
+ * latch set, but there's no guarantee.
+ */
+void
+SetWalSummarizerLatch(void)
+{
+ int pgprocno;
+
+ LWLockAcquire(WALSummarizerLock, LW_EXCLUSIVE);
+ pgprocno = WalSummarizerCtl->summarizer_pgprocno;
+ LWLockRelease(WALSummarizerLock);
+
+ if (pgprocno != INVALID_PGPROCNO)
+ SetLatch(&ProcGlobal->allProcs[pgprocno].procLatch);
+}
+
+/*
+ * Wait until WAL summarization reaches the given LSN, but not longer than
+ * the given timeout.
+ *
+ * The return value is the first still-unsummarized LSN. If it's greater than
+ * or equal to the passed LSN, then that LSN was reached. If not, we timed out.
+ */
+XLogRecPtr
+WaitForWalSummarization(XLogRecPtr lsn, long timeout)
+{
+ TimestampTz start_time = GetCurrentTimestamp();
+ TimestampTz deadline = TimestampTzPlusMilliseconds(start_time, timeout);
+ XLogRecPtr summarized_lsn;
+
+ Assert(!XLogRecPtrIsInvalid(lsn));
+ Assert(timeout > 0);
+
+ while (1)
+ {
+ TimestampTz now;
+ long remaining_timeout;
+
+ /*
+ * If the LSN summarized on disk has reached the target value, stop.
+ */
+ LWLockAcquire(WALSummarizerLock, LW_EXCLUSIVE);
+ summarized_lsn = WalSummarizerCtl->summarized_lsn;
+ LWLockRelease(WALSummarizerLock);
+ if (summarized_lsn >= lsn)
+ break;
+
+ /* Timeout reached? If yes, stop. */
+ now = GetCurrentTimestamp();
+ remaining_timeout = TimestampDifferenceMilliseconds(now, deadline);
+ if (remaining_timeout <= 0)
+ break;
+
+ /* Wait and see. */
+ ConditionVariableTimedSleep(&WalSummarizerCtl->summary_file_cv,
+ remaining_timeout,
+ WAIT_EVENT_WAL_SUMMARY_READY);
+ }
+
+ return summarized_lsn;
+}
+
+/*
+ * Get the latest LSN that is eligible to be summarized, and set *tli to the
+ * corresponding timeline.
+ */
+static XLogRecPtr
+GetLatestLSN(TimeLineID *tli)
+{
+ if (!RecoveryInProgress())
+ {
+ /* Don't summarize WAL before it's flushed. */
+ return GetFlushRecPtr(tli);
+ }
+ else
+ {
+ XLogRecPtr flush_lsn;
+ TimeLineID flush_tli;
+ XLogRecPtr replay_lsn;
+ TimeLineID replay_tli;
+
+ /*
+ * What we really want to know is how much WAL has been flushed to
+ * disk, but the only flush position available is the one provided by
+ * the walreceiver, which may not be running, because this could be
+ * crash recovery or recovery via restore_command. So use either the
+ * WAL receiver's flush position or the replay position, whichever is
+ * further ahead, on the theory that if the WAL has been replayed then
+ * it must also have been flushed to disk.
+ */
+ flush_lsn = GetWalRcvFlushRecPtr(NULL, &flush_tli);
+ replay_lsn = GetXLogReplayRecPtr(&replay_tli);
+ if (flush_lsn > replay_lsn)
+ {
+ *tli = flush_tli;
+ return flush_lsn;
+ }
+ else
+ {
+ *tli = replay_tli;
+ return replay_lsn;
+ }
+ }
+}
+
+/*
+ * Interrupt handler for main loop of WAL summarizer process.
+ */
+static void
+HandleWalSummarizerInterrupts(void)
+{
+ if (ProcSignalBarrierPending)
+ ProcessProcSignalBarrier();
+
+ if (ConfigReloadPending)
+ {
+ ConfigReloadPending = false;
+ ProcessConfigFile(PGC_SIGHUP);
+ }
+
+ if (ShutdownRequestPending || wal_summarize_mb == 0)
+ {
+ ereport(DEBUG1,
+ errmsg_internal("WAL summarizer shutting down"));
+ proc_exit(0);
+ }
+
+ /* Perform logging of memory contexts of this process */
+ if (LogMemoryContextPending)
+ ProcessLogMemoryContextInterrupt();
+}
+
+/*
+ * Summarize a range of WAL records on a single timeline.
+ *
+ * 'tli' is the timeline to be summarized. 'historic' should be false if the
+ * timeline in question is the latest one and true otherwise.
+ *
+ * 'start_lsn' is the point at which we should start summarizing. If this
+ * value comes from the end LSN of the previous record as returned by the
+ * xlograder machinery, 'exact' should be true; otherwise, 'exact' should
+ * be false, and this function will search forward for the start of a valid
+ * WAL record.
+ *
+ * 'cutoff_lsn' is the point at which we should stop summarizing. The first
+ * record that ends at or after cutoff_lsn will be the last one included
+ * in the summary.
+ *
+ * 'maximum_lsn' identifies the point beyond which we can't count on being
+ * able to read any more WAL. It should be the switch point when reading a
+ * historic timeline, or the most-recently-measured end of WAL when reading
+ * the current timeline.
+ *
+ * The return value is the LSN at which the WAL summary actually ends. Most
+ * often, a summary file ends because we notice that a checkpoint has
+ * occurred and reach the redo pointer of that checkpoint, but sometimes
+ * we stop for other reasons, such as a timeline switch, or reading a record
+ * that ends after the cutoff_lsn.
+ */
+static XLogRecPtr
+SummarizeWAL(TimeLineID tli, bool historic,
+ XLogRecPtr start_lsn, bool exact,
+ XLogRecPtr cutoff_lsn, XLogRecPtr maximum_lsn)
+{
+ SummarizerReadLocalXLogPrivate *private_data;
+ XLogReaderState *xlogreader;
+ XLogRecPtr summary_start_lsn;
+ XLogRecPtr summary_end_lsn = cutoff_lsn;
+ char temp_path[MAXPGPATH];
+ char final_path[MAXPGPATH];
+ WalSummaryIO io;
+ BlockRefTable *brtab = CreateEmptyBlockRefTable();
+
+ /* Initialize private data for xlogreader. */
+ private_data = (SummarizerReadLocalXLogPrivate *)
+ palloc0(sizeof(SummarizerReadLocalXLogPrivate));
+ private_data->tli = tli;
+ private_data->historic = historic;
+ private_data->read_upto = maximum_lsn;
+
+ /* Create xlogreader. */
+ xlogreader = XLogReaderAllocate(wal_segment_size, NULL,
+ XL_ROUTINE(.page_read = &summarizer_read_local_xlog_page,
+ .segment_open = &wal_segment_open,
+ .segment_close = &wal_segment_close),
+ private_data);
+ if (xlogreader == NULL)
+ ereport(ERROR,
+ (errcode(ERRCODE_OUT_OF_MEMORY),
+ errmsg("out of memory"),
+ errdetail("Failed while allocating a WAL reading processor.")));
+
+ /*
+ * When exact = false, we're starting from an arbitrary point in the WAL
+ * and must search forward for the start of the next record.
+ *
+ * When exact = true, start_lsn should be either the LSN where a record
+ * begins, or the LSN of a page where the page header is immediately
+ * followed by the start of a new record. XLogBeginRead should tolerate
+ * either case.
+ *
+ * We need to allow for both cases because the behavior of xlogreader
+ * varies. When a record spans two or more xlog pages, the ending LSN
+ * reported by xlogreader will be the starting LSN of the following
+ * record, but when an xlog page boundary falls between two records, the
+ * end LSN for the first will be reported as the first byte of the
+ * following page. We can't know until we read that page how large the
+ * header will be, but we'll have to skip over it to find the next record.
+ */
+ if (exact)
+ {
+ /*
+ * Even if start_lsn is the beginning of a page rather than the
+ * beginning of the first record on that page, we should still use it
+ * as the start LSN for the summary file. That's because we detect
+ * missing summary files by looking for cases where the end LSN of one
+ * file is less than the start LSN of the next file. When only a page
+ * header is skipped, nothing has been missed.
+ */
+ XLogBeginRead(xlogreader, start_lsn);
+ summary_start_lsn = start_lsn;
+ }
+ else
+ {
+ summary_start_lsn = XLogFindNextRecord(xlogreader, start_lsn);
+ if (XLogRecPtrIsInvalid(summary_start_lsn))
+ {
+ /*
+ * If we hit end-of-WAL while trying to find the next valid
+ * record, we must be on a historic timeline that has no valid
+ * records that begin after start_lsn and before end of WAL.
+ */
+ if (private_data->end_of_wal)
+ {
+ ereport(DEBUG1,
+ errmsg_internal("could not read WAL from timeline %d at %X/%X: end of WAL at %X/%X",
+ tli,
+ LSN_FORMAT_ARGS(start_lsn),
+ LSN_FORMAT_ARGS(private_data->read_upto)));
+
+ /*
+ * The timeline ends at or after start_lsn, without containing
+ * any records. Thus, we must make sure the main loop does not
+ * iterate. If start_lsn is the end of the timeline, then we
+ * won't actually emit an empty summary file, but otherwise,
+ * we must, to capture the fact that the LSN range in question
+ * contains no interesting WAL records.
+ */
+ summary_start_lsn = start_lsn;
+ summary_end_lsn = private_data->read_upto;
+ cutoff_lsn = xlogreader->EndRecPtr;
+ }
+ else
+ ereport(ERROR,
+ (errmsg("could not find a valid record after %X/%X",
+ LSN_FORMAT_ARGS(start_lsn))));
+ }
+
+ /* We shouldn't go backward. */
+ Assert(summary_start_lsn >= start_lsn);
+ }
+
+ /*
+ * Main loop: read xlog records one by one.
+ */
+ while (xlogreader->EndRecPtr < cutoff_lsn)
+ {
+ int block_id;
+ char *errormsg;
+ XLogRecord *record;
+ bool stop_requested = false;
+
+ /* We shouldn't go backward. */
+ Assert(summary_start_lsn <= xlogreader->EndRecPtr);
+
+ /*
+ * This flag tracks whether the read of a particular record had to
+ * wait for more WAL to arrive, so reset it before reading the next
+ * record.
+ */
+ private_data->waited = false;
+
+ /* Now read the next record. */
+ record = XLogReadRecord(xlogreader, &errormsg);
+ if (record == NULL)
+ {
+ SummarizerReadLocalXLogPrivate *private_data;
+
+ private_data = (SummarizerReadLocalXLogPrivate *)
+ xlogreader->private_data;
+ if (private_data->end_of_wal)
+ {
+ /*
+ * This timeline must be historic and must end before we were
+ * able to read a complete record.
+ */
+ ereport(DEBUG1,
+ errmsg_internal("could not read WAL from timeline %d at %X/%X: end of WAL at %X/%X",
+ tli,
+ LSN_FORMAT_ARGS(xlogreader->EndRecPtr),
+ LSN_FORMAT_ARGS(private_data->read_upto)));
+ /* Summary ends at end of WAL. */
+ summary_end_lsn = private_data->read_upto;
+ break;
+ }
+ if (errormsg)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not read WAL at %X/%X: %s",
+ LSN_FORMAT_ARGS(xlogreader->EndRecPtr), errormsg)));
+ else
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not read WAL at %X/%X",
+ LSN_FORMAT_ARGS(xlogreader->EndRecPtr))));
+ }
+
+ /* We shouldn't go backward. */
+ Assert(summary_start_lsn <= xlogreader->EndRecPtr);
+
+ if (xlogreader->ReadRecPtr >= cutoff_lsn)
+ {
+ /*
+ * Woops! We've read a record that *starts* after the cutoff LSN,
+ * contrary to our goal of reading only until we hit the first
+ * record that ends at or after the cutoff LSN. Pretend we didn't
+ * read it after all by bailing out of this loop right here,
+ * before we do anything with this record.
+ *
+ * This can happen because the last record before the cutoff LSN
+ * might be continued across multiple pages, and then we might
+ * come to a page with XLP_FIRST_IS_OVERWRITE_CONTRECORD set. In
+ * that case, the record that was continued across multiple pages
+ * is incomplete and will be disregarded, and the read will
+ * restart from the beginning of the page that is flagged
+ * XLP_FIRST_IS_OVERWRITE_CONTRECORD.
+ *
+ * If this case occurs, we can fairly say that the current summary
+ * file ends at the cutoff LSN exactly. The first record on the
+ * page marked XLP_FIRST_IS_OVERWRITE_CONTRECORD will be
+ * discovered when generating the next summary file.
+ */
+ summary_end_lsn = cutoff_lsn;
+ break;
+ }
+
+ /* Special handling for particular types of WAL records. */
+ switch (XLogRecGetRmid(xlogreader))
+ {
+ case RM_SMGR_ID:
+ SummarizeSmgrRecord(xlogreader, brtab);
+ break;
+ case RM_XACT_ID:
+ SummarizeXactRecord(xlogreader, brtab);
+ break;
+ case RM_XLOG_ID:
+ stop_requested = SummarizeXlogRecord(xlogreader);
+ break;
+ default:
+ break;
+ }
+
+ /*
+ * If we've been told that it's time to end this WAL summary file,
+ * do so. As an exception, if there's nothing included in this WAL
+ * summary file yet, then stoppng doesn't make any sense, and we
+ * should wait until the next stop point instead.
+ */
+ if (stop_requested && xlogreader->ReadRecPtr > summary_start_lsn)
+ {
+ summary_end_lsn = xlogreader->ReadRecPtr;
+ break;
+ }
+
+ /* Feed block references from xlog record to block reference table. */
+ for (block_id = 0; block_id <= XLogRecMaxBlockId(xlogreader);
+ block_id++)
+ {
+ RelFileLocator rlocator;
+ ForkNumber forknum;
+ BlockNumber blocknum;
+
+ if (!XLogRecGetBlockTagExtended(xlogreader, block_id, &rlocator,
+ &forknum, &blocknum, NULL))
+ continue;
+
+ BlockRefTableMarkBlockModified(brtab, &rlocator, forknum,
+ blocknum);
+ }
+
+ /* Update our notion of where this summary file ends. */
+ summary_end_lsn = xlogreader->EndRecPtr;
+
+ /* Also update shared memory. */
+ LWLockAcquire(WALSummarizerLock, LW_EXCLUSIVE);
+ Assert(summary_end_lsn >= WalSummarizerCtl->pending_lsn);
+ Assert(summary_end_lsn >= WalSummarizerCtl->summarized_lsn);
+ WalSummarizerCtl->pending_lsn = summary_end_lsn;
+ LWLockRelease(WALSummarizerLock);
+ }
+
+ /* Destroy xlogreader. */
+ pfree(xlogreader->private_data);
+ XLogReaderFree(xlogreader);
+
+ /*
+ * If a timeline switch occurs, we may fail to make any progress at all
+ * before exiting the loop above. If that happens, we don't write a WAL
+ * summary file at all.
+ */
+ if (summary_end_lsn > summary_start_lsn)
+ {
+ /* Generate temporary and final path name. */
+ snprintf(temp_path, MAXPGPATH,
+ XLOGDIR "/summaries/temp.summary");
+ snprintf(final_path, MAXPGPATH,
+ XLOGDIR "/summaries/%08X%08X%08X%08X%08X.summary",
+ tli,
+ LSN_FORMAT_ARGS(summary_start_lsn),
+ LSN_FORMAT_ARGS(summary_end_lsn));
+
+ /* Open the temporary file for writing. */
+ io.filepos = 0;
+ io.file = PathNameOpenFile(temp_path, O_WRONLY | O_CREAT | O_TRUNC);
+ if (io.file < 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not create file \"%s\": %m", temp_path)));
+
+ /* Write the data. */
+ WriteBlockRefTable(brtab, WriteWalSummary, &io);
+
+ /* Close temporary file and shut down xlogreader. */
+ FileClose(io.file);
+
+ /* Tell the user what we did. */
+ ereport(LOG,
+ errmsg("summarized WAL on TLI %d from %X/%X to %X/%X",
+ tli,
+ LSN_FORMAT_ARGS(summary_start_lsn),
+ LSN_FORMAT_ARGS(summary_end_lsn)));
+
+ /* Durably rename the new summary into place. */
+ durable_rename(temp_path, final_path, ERROR);
+ }
+
+ return summary_end_lsn;
+}
+
+/*
+ * Special handling for WAL records with RM_SMGR_ID.
+ */
+static void
+SummarizeSmgrRecord(XLogReaderState *xlogreader, BlockRefTable *brtab)
+{
+ uint8 info = XLogRecGetInfo(xlogreader) & ~XLR_INFO_MASK;
+
+ if (info == XLOG_SMGR_CREATE)
+ {
+ xl_smgr_create *xlrec;
+
+ /*
+ * If a new relation fork is created on disk, there is no point
+ * tracking anything about which blocks have been modified, because
+ * the whole thing will be new. Hence, set the limit block for this
+ * fork to 0.
+ *
+ * Ignore the FSM fork, which is not fully WAL-logged.
+ */
+ xlrec = (xl_smgr_create *) XLogRecGetData(xlogreader);
+
+ if (xlrec->forkNum != FSM_FORKNUM)
+ BlockRefTableSetLimitBlock(brtab, &xlrec->rlocator,
+ xlrec->forkNum, 0);
+ }
+ else if (info == XLOG_SMGR_TRUNCATE)
+ {
+ xl_smgr_truncate *xlrec;
+
+ xlrec = (xl_smgr_truncate *) XLogRecGetData(xlogreader);
+
+ /*
+ * If a relation fork is truncated on disk, there is in point in
+ * tracking anything about block modifications beyond the truncation
+ * point.
+ *
+ * We ignore SMGR_TRUNCATE_FSM here because the FSM isn't fully
+ * WAL-logged and thus we can't track modified blocks for it anyway.
+ */
+ if ((xlrec->flags & SMGR_TRUNCATE_HEAP) != 0)
+ BlockRefTableSetLimitBlock(brtab, &xlrec->rlocator,
+ MAIN_FORKNUM, xlrec->blkno);
+ if ((xlrec->flags & SMGR_TRUNCATE_VM) != 0)
+ BlockRefTableSetLimitBlock(brtab, &xlrec->rlocator,
+ VISIBILITYMAP_FORKNUM, xlrec->blkno);
+ }
+}
+
+/*
+ * Special handling for WAL recods with RM_XACT_ID.
+ */
+static void
+SummarizeXactRecord(XLogReaderState *xlogreader, BlockRefTable *brtab)
+{
+ uint8 info = XLogRecGetInfo(xlogreader) & ~XLR_INFO_MASK;
+ uint8 xact_info = info & XLOG_XACT_OPMASK;
+
+ if (xact_info == XLOG_XACT_COMMIT ||
+ xact_info == XLOG_XACT_COMMIT_PREPARED)
+ {
+ xl_xact_commit *xlrec = (xl_xact_commit *) XLogRecGetData(xlogreader);
+ xl_xact_parsed_commit parsed;
+ int i;
+
+ ParseCommitRecord(XLogRecGetInfo(xlogreader), xlrec, &parsed);
+ for (i = 0; i < parsed.nrels; ++i)
+ {
+ ForkNumber forknum;
+
+ for (forknum = 0; forknum <= MAX_FORKNUM; ++forknum)
+ if (forknum != FSM_FORKNUM)
+ BlockRefTableSetLimitBlock(brtab, &parsed.xlocators[i],
+ forknum, 0);
+ }
+ }
+ else if (xact_info == XLOG_XACT_ABORT ||
+ xact_info == XLOG_XACT_ABORT_PREPARED)
+ {
+ xl_xact_abort *xlrec = (xl_xact_abort *) XLogRecGetData(xlogreader);
+ xl_xact_parsed_abort parsed;
+ int i;
+
+ ParseAbortRecord(XLogRecGetInfo(xlogreader), xlrec, &parsed);
+ for (i = 0; i < parsed.nrels; ++i)
+ {
+ ForkNumber forknum;
+
+ for (forknum = 0; forknum <= MAX_FORKNUM; ++forknum)
+ if (forknum != FSM_FORKNUM)
+ BlockRefTableSetLimitBlock(brtab, &parsed.xlocators[i],
+ forknum, 0);
+ }
+ }
+}
+
+/*
+ * Special handling for WAL recods with RM_XLOG_ID.
+ */
+static bool
+SummarizeXlogRecord(XLogReaderState *xlogreader)
+{
+ uint8 info = XLogRecGetInfo(xlogreader) & ~XLR_INFO_MASK;
+
+ if (info == XLOG_CHECKPOINT_REDO || info == XLOG_CHECKPOINT_SHUTDOWN)
+ {
+ /*
+ * This is an LSN at which redo might begin, so we'd like summarization
+ * to stop just before this WAL record.
+ */
+ return true;
+ }
+
+ return false;
+}
+
+/*
+ * Similar to read_local_xlog_page, but limited to read from one particular
+ * timeline. If the end of WAL is reached, it will wait for more if reading
+ * from the current timeline, or give up if reading from a historic timeline.
+ * In the latter case, it will also set private_data->end_of_wal = true.
+ *
+ * Caller must set private_data->tli to the TLI of interest,
+ * private_data->read_upto to the lowest LSN that is not known to be safe
+ * to read on that timeline, and private_data->historic to true if and only
+ * if the timeline is not the current timeline. This function will update
+ * private_data->read_upto and private_data->historic if more WAL appears
+ * on the current timeline or if the current timeline becomes historic.
+ */
+static int
+summarizer_read_local_xlog_page(XLogReaderState *state,
+ XLogRecPtr targetPagePtr, int reqLen,
+ XLogRecPtr targetRecPtr, char *cur_page)
+{
+ int count;
+ WALReadError errinfo;
+ SummarizerReadLocalXLogPrivate *private_data;
+
+ private_data = (SummarizerReadLocalXLogPrivate *)
+ state->private_data;
+
+ while (true)
+ {
+ if (targetPagePtr + XLOG_BLCKSZ <= private_data->read_upto)
+ {
+ /*
+ * more than one block available; read only that block, have
+ * caller come back if they need more.
+ */
+ count = XLOG_BLCKSZ;
+ break;
+ }
+ else if (targetPagePtr + reqLen > private_data->read_upto)
+ {
+ /* We don't seem to have enough data. */
+ if (private_data->historic)
+ {
+ /*
+ * This is a historic timeline, so there will never be any
+ * more data than we have currently.
+ */
+ private_data->end_of_wal = true;
+ return -1;
+ }
+ else
+ {
+ XLogRecPtr latest_lsn;
+ TimeLineID latest_tli;
+
+ /*
+ * This is - or at least was up until very recently - the
+ * current timeline, so more data might show up. Delay here
+ * so we don't tight-loop.
+ */
+ HandleWalSummarizerInterrupts();
+ summarizer_wait_for_wal();
+ private_data->waited = true;
+
+ /* Recheck end-of-WAL. */
+ latest_lsn = GetLatestLSN(&latest_tli);
+ if (private_data->tli == latest_tli)
+ {
+ /* Still the current timeline, update max LSN. */
+ Assert(latest_lsn >= private_data->read_upto);
+ private_data->read_upto = latest_lsn;
+ }
+ else
+ {
+ List *tles = readTimeLineHistory(latest_tli);
+ XLogRecPtr switchpoint;
+
+ /*
+ * The timeline we're scanning is no longer the latest
+ * one. Figure out when it ended and allow reads up to
+ * exactly that point.
+ */
+ private_data->historic = true;
+ switchpoint = tliSwitchPoint(private_data->tli, tles,
+ NULL);
+ Assert(switchpoint >= private_data->read_upto);
+ private_data->read_upto = switchpoint;
+ }
+
+ /* Go around and try again. */
+ }
+ }
+ else
+ {
+ /* enough bytes available to satisfy the request */
+ count = private_data->read_upto - targetPagePtr;
+ break;
+ }
+ }
+
+ /*
+ * Even though we just determined how much of the page can be validly read
+ * as 'count', read the whole page anyway. It's guaranteed to be
+ * zero-padded up to the page boundary if it's incomplete.
+ */
+ if (!WALRead(state, cur_page, targetPagePtr, XLOG_BLCKSZ,
+ private_data->tli, &errinfo))
+ WALReadRaiseError(&errinfo);
+
+ /* Track that we read a page, for sleep time calculation. */
+ ++pages_read_since_last_sleep;
+
+ /* number of valid bytes in the buffer */
+ return count;
+}
+
+/*
+ * Sleep for long enough that we believe it's likely that more WAL will
+ * be available afterwards.
+ */
+static void
+summarizer_wait_for_wal(void)
+{
+ if (pages_read_since_last_sleep == 0)
+ {
+ /*
+ * No pages were read since the last sleep, so double the sleep time,
+ * but not beyond the maximum allowable value.
+ */
+ sleep_quanta = Min(sleep_quanta * 2, MAX_SLEEP_QUANTA);
+ }
+ else if (pages_read_since_last_sleep > 1)
+ {
+ /*
+ * Multiple pages were read since the last sleep, so reduce the sleep
+ * time.
+ *
+ * A large burst of activity should be able to quickly reduce the
+ * sleep time to the minimum, but we don't want a handful of extra WAL
+ * records to provoke a strong reaction. We choose to reduce the sleep
+ * time by 1 quantum for each page read beyond the first, which is a
+ * fairly arbitrary way of trying to be reactive without
+ * overrreacting.
+ */
+ if (pages_read_since_last_sleep > sleep_quanta - 1)
+ sleep_quanta = 1;
+ else
+ sleep_quanta -= pages_read_since_last_sleep;
+ }
+
+ /* OK, now sleep. */
+ (void) WaitLatch(MyLatch,
+ WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
+ sleep_quanta * MS_PER_SLEEP_QUANTUM,
+ WAIT_EVENT_WAL_SUMMARIZER_WAL);
+ ResetLatch(MyLatch);
+
+ /* Reset count of pages read. */
+ pages_read_since_last_sleep = 0;
+}
+
+/*
+ * Most recent RedoRecPtr value observed by RemoveOldWalSummaries.
+ */
+static void
+MaybeRemoveOldWalSummaries(void)
+{
+ XLogRecPtr redo_pointer = GetRedoRecPtr();
+ List *wslist;
+ time_t cutoff_time;
+
+ /* If WAL summary removal is disabled, don't do anything. */
+ if (wal_summarize_keep_time == 0)
+ return;
+
+ /*
+ * If the redo pointer has not advanced, don't do anything.
+ *
+ * This has the effect that we only try to remove old WAL summary files
+ * once per checkpoint cycle.
+ */
+ if (redo_pointer == redo_pointer_at_last_summary_removal)
+ return;
+ redo_pointer_at_last_summary_removal = redo_pointer;
+
+ /*
+ * Files should only be removed if the last modification time precedes the
+ * cutoff time we compute here.
+ */
+ cutoff_time = time(NULL) - 60 * wal_summarize_keep_time;
+
+ /* Get all the summaries that currently exist. */
+ wslist = GetWalSummaries(0, InvalidXLogRecPtr, InvalidXLogRecPtr);
+
+ /* Loop until all summaries have been considered for removal. */
+ while (wslist != NIL)
+ {
+ ListCell *lc;
+ XLogSegNo oldest_segno;
+ XLogRecPtr oldest_lsn = InvalidXLogRecPtr;
+ TimeLineID selected_tli;
+
+ CHECK_FOR_INTERRUPTS();
+
+ /*
+ * Pick a timeline for which some summary files still exist on disk,
+ * and find the oldest LSN that still exists on disk for that
+ * timeline.
+ */
+ selected_tli = ((WalSummaryFile *) linitial(wslist))->tli;
+ oldest_segno = XLogGetOldestSegno(selected_tli);
+ if (oldest_segno != 0)
+ XLogSegNoOffsetToRecPtr(oldest_segno, 0, wal_segment_size,
+ oldest_lsn);
+
+
+ /* Consider each WAL file on the selected timeline in turn. */
+ foreach(lc, wslist)
+ {
+ WalSummaryFile *ws = lfirst(lc);
+
+ CHECK_FOR_INTERRUPTS();
+
+ /* If it's not on this timeline, it's not time to consider it. */
+ if (selected_tli != ws->tli)
+ continue;
+
+ /*
+ * If the WAL doesn't exist any more, we can remove it if the file
+ * modification time is old enough.
+ */
+ if (XLogRecPtrIsInvalid(oldest_lsn) || ws->end_lsn <= oldest_lsn)
+ RemoveWalSummaryIfOlderThan(ws, cutoff_time);
+
+ /*
+ * Whether we we removed the file or not, we need not consider it
+ * again.
+ */
+ wslist = foreach_delete_current(wslist, lc);
+ pfree(ws);
+ }
+ }
+}
diff --git a/src/backend/replication/repl_gram.y b/src/backend/replication/repl_gram.y
index 0c874e33cf..a5d118ed68 100644
--- a/src/backend/replication/repl_gram.y
+++ b/src/backend/replication/repl_gram.y
@@ -76,11 +76,12 @@ Node *replication_parse_result;
%token K_EXPORT_SNAPSHOT
%token K_NOEXPORT_SNAPSHOT
%token K_USE_SNAPSHOT
+%token K_UPLOAD_MANIFEST
%type <node> command
%type <node> base_backup start_replication start_logical_replication
create_replication_slot drop_replication_slot identify_system
- read_replication_slot timeline_history show
+ read_replication_slot timeline_history show upload_manifest
%type <list> generic_option_list
%type <defelt> generic_option
%type <uintval> opt_timeline
@@ -114,6 +115,7 @@ command:
| read_replication_slot
| timeline_history
| show
+ | upload_manifest
;
/*
@@ -307,6 +309,15 @@ timeline_history:
}
;
+/* UPLOAD_MANIFEST doesn't currently accept any arguments */
+upload_manifest:
+ K_UPLOAD_MANIFEST
+ {
+ UploadManifestCmd *cmd = makeNode(UploadManifestCmd);
+
+ $$ = (Node *) cmd;
+ }
+
opt_physical:
K_PHYSICAL
| /* EMPTY */
@@ -411,6 +422,7 @@ ident_or_keyword:
| K_EXPORT_SNAPSHOT { $$ = "export_snapshot"; }
| K_NOEXPORT_SNAPSHOT { $$ = "noexport_snapshot"; }
| K_USE_SNAPSHOT { $$ = "use_snapshot"; }
+ | K_UPLOAD_MANIFEST { $$ = "upload_manifest"; }
;
%%
diff --git a/src/backend/replication/repl_scanner.l b/src/backend/replication/repl_scanner.l
index 1cc7fb858c..4805da08ee 100644
--- a/src/backend/replication/repl_scanner.l
+++ b/src/backend/replication/repl_scanner.l
@@ -136,6 +136,7 @@ EXPORT_SNAPSHOT { return K_EXPORT_SNAPSHOT; }
NOEXPORT_SNAPSHOT { return K_NOEXPORT_SNAPSHOT; }
USE_SNAPSHOT { return K_USE_SNAPSHOT; }
WAIT { return K_WAIT; }
+UPLOAD_MANIFEST { return K_UPLOAD_MANIFEST; }
{space}+ { /* do nothing */ }
@@ -303,6 +304,7 @@ replication_scanner_is_replication_command(void)
case K_DROP_REPLICATION_SLOT:
case K_READ_REPLICATION_SLOT:
case K_TIMELINE_HISTORY:
+ case K_UPLOAD_MANIFEST:
case K_SHOW:
/* Yes; push back the first token so we can parse later. */
repl_pushed_back_token = first_token;
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index e250b0567e..b33b86671b 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -58,6 +58,7 @@
#include "access/xlogrecovery.h"
#include "access/xlogutils.h"
#include "backup/basebackup.h"
+#include "backup/basebackup_incremental.h"
#include "catalog/pg_authid.h"
#include "catalog/pg_type.h"
#include "commands/dbcommands.h"
@@ -137,6 +138,17 @@ bool wake_wal_senders = false;
*/
static XLogReaderState *xlogreader = NULL;
+/*
+ * If the UPLOAD_MANIFEST command is used to provide a backup manifest in
+ * preparation for an incremental backup, uploaded_manifest will be point
+ * to an object containing information about its contexts, and
+ * uploaded_manifest_mcxt will point to the memory context that contains
+ * that object and all of its subordinate data. Otherwise, both values will
+ * be NULL.
+ */
+static IncrementalBackupInfo *uploaded_manifest = NULL;
+static MemoryContext uploaded_manifest_mcxt = NULL;
+
/*
* These variables keep track of the state of the timeline we're currently
* sending. sendTimeLine identifies the timeline. If sendTimeLineIsHistoric,
@@ -233,6 +245,9 @@ static void XLogSendLogical(void);
static void WalSndDone(WalSndSendDataCallback send_data);
static XLogRecPtr GetStandbyFlushRecPtr(TimeLineID *tli);
static void IdentifySystem(void);
+static void UploadManifest(void);
+static bool HandleUploadManifestPacket(StringInfo buf, off_t *offset,
+ IncrementalBackupInfo *ib);
static void ReadReplicationSlot(ReadReplicationSlotCmd *cmd);
static void CreateReplicationSlot(CreateReplicationSlotCmd *cmd);
static void DropReplicationSlot(DropReplicationSlotCmd *cmd);
@@ -660,6 +675,143 @@ SendTimeLineHistory(TimeLineHistoryCmd *cmd)
pq_endmessage(&buf);
}
+/*
+ * Handle UPLOAD_MANIFEST command.
+ */
+static void
+UploadManifest(void)
+{
+ MemoryContext mcxt;
+ IncrementalBackupInfo *ib;
+ off_t offset = 0;
+ StringInfoData buf;
+
+ /*
+ * parsing the manifest will use the cryptohash stuff, which requires a
+ * resource owner
+ */
+ Assert(CurrentResourceOwner == NULL);
+ CurrentResourceOwner = ResourceOwnerCreate(NULL, "base backup");
+
+ /* Prepare to read manifest data into a temporary context. */
+ mcxt = AllocSetContextCreate(CurrentMemoryContext,
+ "incremental backup information",
+ ALLOCSET_DEFAULT_SIZES);
+ ib = CreateIncrementalBackupInfo(mcxt);
+
+ /* Send a CopyInResponse message */
+ pq_beginmessage(&buf, 'G');
+ pq_sendbyte(&buf, 0);
+ pq_sendint16(&buf, 0);
+ pq_endmessage_reuse(&buf);
+ pq_flush();
+
+ /* Recieve packets from client until done. */
+ while (HandleUploadManifestPacket(&buf, &offset, ib))
+ ;
+
+ /* Finish up manifest processing. */
+ FinalizeIncrementalManifest(ib);
+
+ /*
+ * Discard any old manifest information and arrange to preserve the new
+ * information we just got.
+ *
+ * We assume that MemoryContextDelete and MemoryContextSetParent won't
+ * fail, and thus we shouldn't end up bailing out of here in such a way as
+ * to leave dangling pointrs.
+ */
+ if (uploaded_manifest_mcxt != NULL)
+ MemoryContextDelete(uploaded_manifest_mcxt);
+ MemoryContextSetParent(mcxt, CacheMemoryContext);
+ uploaded_manifest = ib;
+ uploaded_manifest_mcxt = mcxt;
+
+ /* clean up the resource owner we created */
+ WalSndResourceCleanup(true);
+}
+
+/*
+ * Process one packet received during the handling of an UPLOAD_MANIFEST
+ * operation.
+ *
+ * 'buf' is scratch space. This function expects it to be initialized, doesn't
+ * care what the current contents are, and may override them with completely
+ * new contents.
+ *
+ * The return value is true if the caller should continue processing
+ * additional packets and false if the UPLOAD_MANIFEST operation is complete.
+ */
+static bool
+HandleUploadManifestPacket(StringInfo buf, off_t *offset,
+ IncrementalBackupInfo *ib)
+{
+ int mtype;
+ int maxmsglen;
+
+ HOLD_CANCEL_INTERRUPTS();
+
+ pq_startmsgread();
+ mtype = pq_getbyte();
+ if (mtype == EOF)
+ ereport(ERROR,
+ (errcode(ERRCODE_CONNECTION_FAILURE),
+ errmsg("unexpected EOF on client connection with an open transaction")));
+
+ switch (mtype)
+ {
+ case 'd': /* CopyData */
+ maxmsglen = PQ_LARGE_MESSAGE_LIMIT;
+ break;
+ case 'c': /* CopyDone */
+ case 'f': /* CopyFail */
+ case 'H': /* Flush */
+ case 'S': /* Sync */
+ maxmsglen = PQ_SMALL_MESSAGE_LIMIT;
+ break;
+ default:
+ ereport(ERROR,
+ (errcode(ERRCODE_PROTOCOL_VIOLATION),
+ errmsg("unexpected message type 0x%02X during COPY from stdin",
+ mtype)));
+ maxmsglen = 0; /* keep compiler quiet */
+ break;
+ }
+
+ /* Now collect the message body */
+ if (pq_getmessage(buf, maxmsglen))
+ ereport(ERROR,
+ (errcode(ERRCODE_CONNECTION_FAILURE),
+ errmsg("unexpected EOF on client connection with an open transaction")));
+ RESUME_CANCEL_INTERRUPTS();
+
+ /* Process the message */
+ switch (mtype)
+ {
+ case 'd': /* CopyData */
+ AppendIncrementalManifestData(ib, buf->data, buf->len);
+ return true;
+
+ case 'c': /* CopyDone */
+ return false;
+
+ case 'H': /* Sync */
+ case 'S': /* Flush */
+ /* Ignore these while in CopyOut mode as we do elsewhere. */
+ return true;
+
+ case 'f':
+ ereport(ERROR,
+ (errcode(ERRCODE_QUERY_CANCELED),
+ errmsg("COPY from stdin failed: %s",
+ pq_getmsgstring(buf))));
+ }
+
+ /* Not reached. */
+ Assert(false);
+ return false;
+}
+
/*
* Handle START_REPLICATION command.
*
@@ -1801,7 +1953,7 @@ exec_replication_command(const char *cmd_string)
cmdtag = "BASE_BACKUP";
set_ps_display(cmdtag);
PreventInTransactionBlock(true, cmdtag);
- SendBaseBackup((BaseBackupCmd *) cmd_node);
+ SendBaseBackup((BaseBackupCmd *) cmd_node, uploaded_manifest);
EndReplicationCommand(cmdtag);
break;
@@ -1863,6 +2015,14 @@ exec_replication_command(const char *cmd_string)
}
break;
+ case T_UploadManifestCmd:
+ cmdtag = "UPLOAD_MANIFEST";
+ set_ps_display(cmdtag);
+ PreventInTransactionBlock(true, cmdtag);
+ UploadManifest();
+ EndReplicationCommand(cmdtag);
+ break;
+
default:
elog(ERROR, "unrecognized replication command node tag: %u",
cmd_node->type);
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index a3d8eacb8d..3a6729003a 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -31,6 +31,7 @@
#include "postmaster/bgworker_internals.h"
#include "postmaster/bgwriter.h"
#include "postmaster/postmaster.h"
+#include "postmaster/walsummarizer.h"
#include "replication/logicallauncher.h"
#include "replication/origin.h"
#include "replication/slot.h"
@@ -136,6 +137,7 @@ CalculateShmemSize(int *num_semaphores)
size = add_size(size, ReplicationOriginShmemSize());
size = add_size(size, WalSndShmemSize());
size = add_size(size, WalRcvShmemSize());
+ size = add_size(size, WalSummarizerShmemSize());
size = add_size(size, PgArchShmemSize());
size = add_size(size, ApplyLauncherShmemSize());
size = add_size(size, BTreeShmemSize());
@@ -291,6 +293,7 @@ CreateSharedMemoryAndSemaphores(void)
ReplicationOriginShmemInit();
WalSndShmemInit();
WalRcvShmemInit();
+ WalSummarizerShmemInit();
PgArchShmemInit();
ApplyLauncherShmemInit();
diff --git a/src/backend/storage/lmgr/lwlocknames.txt b/src/backend/storage/lmgr/lwlocknames.txt
index f72f2906ce..d621f5507f 100644
--- a/src/backend/storage/lmgr/lwlocknames.txt
+++ b/src/backend/storage/lmgr/lwlocknames.txt
@@ -54,3 +54,4 @@ XactTruncationLock 44
WrapLimitsVacuumLock 46
NotifyQueueTailLock 47
WaitEventExtensionLock 48
+WALSummarizerLock 49
diff --git a/src/backend/utils/activity/pgstat_io.c b/src/backend/utils/activity/pgstat_io.c
index 490d5a9ab7..8109aee6f0 100644
--- a/src/backend/utils/activity/pgstat_io.c
+++ b/src/backend/utils/activity/pgstat_io.c
@@ -296,7 +296,8 @@ pgstat_io_snapshot_cb(void)
* - Syslogger because it is not connected to shared memory
* - Archiver because most relevant archiving IO is delegated to a
* specialized command or module
-* - WAL Receiver and WAL Writer IO is not tracked in pg_stat_io for now
+* - WAL Receiver, WAL Writer, and WAL Summarizer IO are not tracked in
+* pg_stat_io for now
*
* Function returns true if BackendType participates in the cumulative stats
* subsystem for IO and false if it does not.
@@ -318,6 +319,7 @@ pgstat_tracks_io_bktype(BackendType bktype)
case B_LOGGER:
case B_WAL_RECEIVER:
case B_WAL_WRITER:
+ case B_WAL_SUMMARIZER:
return false;
case B_AUTOVAC_LAUNCHER:
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index d7995931bd..7e79163466 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -56,6 +56,7 @@ RECOVERY_WAL_STREAM "Waiting in main loop of startup process for WAL to arrive,
SYSLOGGER_MAIN "Waiting in main loop of syslogger process."
WAL_RECEIVER_MAIN "Waiting in main loop of WAL receiver process."
WAL_SENDER_MAIN "Waiting in main loop of WAL sender process."
+WAL_SUMMARIZER_WAL "Waiting in WAL summarizer for more WAL to be generated."
WAL_WRITER_MAIN "Waiting in main loop of WAL writer process."
@@ -142,6 +143,7 @@ SAFE_SNAPSHOT "Waiting to obtain a valid snapshot for a <literal>READ ONLY DEFER
SYNC_REP "Waiting for confirmation from a remote server during synchronous replication."
WAL_RECEIVER_EXIT "Waiting for the WAL receiver to exit."
WAL_RECEIVER_WAIT_START "Waiting for startup process to send initial data for streaming replication."
+WAL_SUMMARY_READY "Waiting for a new WAL summary to be generated."
XACT_GROUP_UPDATE "Waiting for the group leader to update transaction status at end of a parallel operation."
@@ -162,6 +164,7 @@ REGISTER_SYNC_REQUEST "Waiting while sending synchronization requests to the che
SPIN_DELAY "Waiting while acquiring a contended spinlock."
VACUUM_DELAY "Waiting in a cost-based vacuum delay point."
VACUUM_TRUNCATE "Waiting to acquire an exclusive lock to truncate off any empty pages at the end of a table vacuumed."
+WAL_SUMMARIZER_ERROR "Waiting after a WAL summarizer error."
#
@@ -243,6 +246,8 @@ WAL_COPY_WRITE "Waiting for a write when creating a new WAL segment by copying a
WAL_INIT_SYNC "Waiting for a newly initialized WAL file to reach durable storage."
WAL_INIT_WRITE "Waiting for a write while initializing a new WAL file."
WAL_READ "Waiting for a read from a WAL file."
+WAL_SUMMARY_READ "Waiting for a read from a WAL summary file."
+WAL_SUMMARY_WRITE "Waiting for a write to a WAL summary file."
WAL_SYNC "Waiting for a WAL file to reach durable storage."
WAL_SYNC_METHOD_ASSIGN "Waiting for data to reach durable storage while assigning a new WAL sync method."
WAL_WRITE "Waiting for a write to a WAL file."
diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index 182d666852..94e7944748 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -306,6 +306,9 @@ GetBackendTypeDesc(BackendType backendType)
case B_WAL_SENDER:
backendDesc = "walsender";
break;
+ case B_WAL_SUMMARIZER:
+ backendDesc = "walsummarizer";
+ break;
case B_WAL_WRITER:
backendDesc = "walwriter";
break;
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 4c58574166..faf42bdbfb 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -63,6 +63,7 @@
#include "postmaster/postmaster.h"
#include "postmaster/startup.h"
#include "postmaster/syslogger.h"
+#include "postmaster/walsummarizer.h"
#include "postmaster/walwriter.h"
#include "replication/logicallauncher.h"
#include "replication/slot.h"
@@ -704,6 +705,8 @@ const char *const config_group_names[] =
gettext_noop("Write-Ahead Log / Archive Recovery"),
/* WAL_RECOVERY_TARGET */
gettext_noop("Write-Ahead Log / Recovery Target"),
+ /* WAL_SUMMARIZATION */
+ gettext_noop("Write-Ahead Log / Summarization"),
/* REPLICATION_SENDING */
gettext_noop("Replication / Sending Servers"),
/* REPLICATION_PRIMARY */
@@ -3181,6 +3184,32 @@ struct config_int ConfigureNamesInt[] =
check_wal_segment_size, NULL, NULL
},
+ {
+ {"wal_summarize_mb", PGC_SIGHUP, WAL_SUMMARIZATION,
+ gettext_noop("Number of bytes of WAL per summary file."),
+ gettext_noop("Smaller values minimize extra work performed by incremental backup, but increase the number of files on disk."),
+ GUC_UNIT_MB,
+ },
+ &wal_summarize_mb,
+ 256,
+ 0,
+ INT_MAX,
+ NULL, NULL, NULL
+ },
+
+ {
+ {"wal_summarize_keep_time", PGC_SIGHUP, WAL_SUMMARIZATION,
+ gettext_noop("Time for which WAL summary files should be kept."),
+ NULL,
+ GUC_UNIT_MIN,
+ },
+ &wal_summarize_keep_time,
+ 7 * 24 * 60, /* 1 week */
+ 0,
+ INT_MAX,
+ NULL, NULL, NULL
+ },
+
{
{"autovacuum_naptime", PGC_SIGHUP, AUTOVACUUM,
gettext_noop("Time to sleep between autovacuum runs."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index d08d55c3fe..4736606ac1 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -299,6 +299,11 @@
#recovery_target_action = 'pause' # 'pause', 'promote', 'shutdown'
# (change requires restart)
+# - WAL Summarization -
+
+#wal_summarize_mb = 256 # MB of WAL per summary file, 0 disables
+#wal_summarize_keep_time = '7d' # when to remove old summary files, 0 = never
+
#------------------------------------------------------------------------------
# REPLICATION
diff --git a/src/bin/Makefile b/src/bin/Makefile
index 373077bf52..aa2210925e 100644
--- a/src/bin/Makefile
+++ b/src/bin/Makefile
@@ -19,6 +19,7 @@ SUBDIRS = \
pg_archivecleanup \
pg_basebackup \
pg_checksums \
+ pg_combinebackup \
pg_config \
pg_controldata \
pg_ctl \
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 0c6f5ceb0a..e68b40d2b5 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -227,6 +227,7 @@ static char *extra_options = "";
static const char *const subdirs[] = {
"global",
"pg_wal/archive_status",
+ "pg_wal/summaries",
"pg_commit_ts",
"pg_dynshmem",
"pg_notify",
diff --git a/src/bin/meson.build b/src/bin/meson.build
index 67cb50630c..4cb6fd59bb 100644
--- a/src/bin/meson.build
+++ b/src/bin/meson.build
@@ -5,6 +5,7 @@ subdir('pg_amcheck')
subdir('pg_archivecleanup')
subdir('pg_basebackup')
subdir('pg_checksums')
+subdir('pg_combinebackup')
subdir('pg_config')
subdir('pg_controldata')
subdir('pg_ctl')
diff --git a/src/bin/pg_basebackup/bbstreamer_file.c b/src/bin/pg_basebackup/bbstreamer_file.c
index 45f32974ff..6b78ee283d 100644
--- a/src/bin/pg_basebackup/bbstreamer_file.c
+++ b/src/bin/pg_basebackup/bbstreamer_file.c
@@ -296,6 +296,7 @@ should_allow_existing_directory(const char *pathname)
if (strcmp(filename, "pg_wal") == 0 ||
strcmp(filename, "pg_xlog") == 0 ||
strcmp(filename, "archive_status") == 0 ||
+ strcmp(filename, "summaries") == 0 ||
strcmp(filename, "pg_tblspc") == 0)
return true;
diff --git a/src/bin/pg_basebackup/pg_basebackup.c b/src/bin/pg_basebackup/pg_basebackup.c
index 1a8cef345d..33416b11cf 100644
--- a/src/bin/pg_basebackup/pg_basebackup.c
+++ b/src/bin/pg_basebackup/pg_basebackup.c
@@ -101,6 +101,11 @@ typedef void (*WriteDataCallback) (size_t nbytes, char *buf,
*/
#define MINIMUM_VERSION_FOR_TERMINATED_TARFILE 150000
+/*
+ * pg_wal/summaries exists beginning with version 17.
+ */
+#define MINIMUM_VERSION_FOR_WAL_SUMMARIES 170000
+
/*
* Different ways to include WAL
*/
@@ -217,7 +222,8 @@ static void ReceiveBackupManifestInMemoryChunk(size_t r, char *copybuf,
void *callback_data);
static void BaseBackup(char *compression_algorithm, char *compression_detail,
CompressionLocation compressloc,
- pg_compress_specification *client_compress);
+ pg_compress_specification *client_compress,
+ char *incremental_manifest);
static bool reached_end_position(XLogRecPtr segendpos, uint32 timeline,
bool segment_finished);
@@ -390,6 +396,8 @@ usage(void)
printf(_("\nOptions controlling the output:\n"));
printf(_(" -D, --pgdata=DIRECTORY receive base backup into directory\n"));
printf(_(" -F, --format=p|t output format (plain (default), tar)\n"));
+ printf(_(" -i, --incremental=OLDMANIFEST\n"));
+ printf(_(" take incremental or differential backup\n"));
printf(_(" -r, --max-rate=RATE maximum transfer rate to transfer data directory\n"
" (in kB/s, or use suffix \"k\" or \"M\")\n"));
printf(_(" -R, --write-recovery-conf\n"
@@ -688,6 +696,23 @@ StartLogStreamer(char *startpos, uint32 timeline, char *sysidentifier,
if (pg_mkdir_p(statusdir, pg_dir_create_mode) != 0 && errno != EEXIST)
pg_fatal("could not create directory \"%s\": %m", statusdir);
+
+ /*
+ * For newer server versions, likewise create pg_wal/summaries
+ */
+ if (PQserverVersion(conn) < MINIMUM_VERSION_FOR_WAL_SUMMARIES)
+ {
+ char summarydir[MAXPGPATH];
+
+ snprintf(summarydir, sizeof(summarydir), "%s/%s/summaries",
+ basedir,
+ PQserverVersion(conn) < MINIMUM_VERSION_FOR_PG_WAL ?
+ "pg_xlog" : "pg_wal");
+
+ if (pg_mkdir_p(statusdir, pg_dir_create_mode) != 0 &&
+ errno != EEXIST)
+ pg_fatal("could not create directory \"%s\": %m", summarydir);
+ }
}
/*
@@ -1728,7 +1753,9 @@ ReceiveBackupManifestInMemoryChunk(size_t r, char *copybuf,
static void
BaseBackup(char *compression_algorithm, char *compression_detail,
- CompressionLocation compressloc, pg_compress_specification *client_compress)
+ CompressionLocation compressloc,
+ pg_compress_specification *client_compress,
+ char *incremental_manifest)
{
PGresult *res;
char *sysidentifier;
@@ -1794,7 +1821,74 @@ BaseBackup(char *compression_algorithm, char *compression_detail,
exit(1);
/*
- * Start the actual backup
+ * If the user wants an incremental backup, we must upload the manifest
+ * for the previous backup upon which it is to be based.
+ */
+ if (incremental_manifest != NULL)
+ {
+ int fd;
+ char mbuf[65536];
+ int nbytes;
+
+ /* XXX add a server version check here */
+
+ /* Open the file. */
+ fd = open(incremental_manifest, O_RDONLY | PG_BINARY, 0);
+ if (fd < 0)
+ pg_fatal("could not open file \"%s\": %m", incremental_manifest);
+
+ /* Tell the server what we want to do. */
+ if (PQsendQuery(conn, "UPLOAD_MANIFEST") == 0)
+ pg_fatal("could not send replication command \"%s\": %s",
+ "UPLOAD_MANIFEST", PQerrorMessage(conn));
+ res = PQgetResult(conn);
+ if (PQresultStatus(res) != PGRES_COPY_IN)
+ {
+ if (PQresultStatus(res) == PGRES_FATAL_ERROR)
+ pg_fatal("could not upload manifest: %s",
+ PQerrorMessage(conn));
+ else
+ pg_fatal("could not upload manifest: unexpected status %s",
+ PQresStatus(PQresultStatus(res)));
+ }
+
+ /* Loop, reading from the file and sending the data to the server. */
+ while ((nbytes = read(fd, mbuf, sizeof mbuf)) > 0)
+ {
+ if (PQputCopyData(conn, mbuf, nbytes) < 0)
+ pg_fatal("could not send COPY data: %s",
+ PQerrorMessage(conn));
+ }
+
+ /* Bail out if we exited the loop due to an error. */
+ if (nbytes < 0)
+ pg_fatal("could not read file \"%s\": %m", incremental_manifest);
+
+ /* End the COPY operation. */
+ if (PQputCopyEnd(conn, NULL) < 0)
+ pg_fatal("could not send end-of-COPY: %s",
+ PQerrorMessage(conn));
+
+ /* See whether the server is happy with what we sent. */
+ res = PQgetResult(conn);
+ if (PQresultStatus(res) == PGRES_FATAL_ERROR)
+ pg_fatal("could not upload manifest: %s",
+ PQerrorMessage(conn));
+ else if (PQresultStatus(res) != PGRES_COMMAND_OK)
+ pg_fatal("could not upload manifest: unexpected status %s",
+ PQresStatus(PQresultStatus(res)));
+
+ /* Consume ReadyForQuery message from server. */
+ res = PQgetResult(conn);
+ if (res != NULL)
+ pg_fatal("unexpected extra result while sending manifest");
+
+ /* Add INCREMENTAL option to BASE_BACKUP command. */
+ AppendPlainCommandOption(&buf, use_new_option_syntax, "INCREMENTAL");
+ }
+
+ /*
+ * Continue building up the options list for the BASE_BACKUP command.
*/
AppendStringCommandOption(&buf, use_new_option_syntax, "LABEL", label);
if (estimatesize)
@@ -1901,6 +1995,7 @@ BaseBackup(char *compression_algorithm, char *compression_detail,
else
basebkp = psprintf("BASE_BACKUP %s", buf.data);
+ /* OK, try to start the backup. */
if (PQsendQuery(conn, basebkp) == 0)
pg_fatal("could not send replication command \"%s\": %s",
"BASE_BACKUP", PQerrorMessage(conn));
@@ -2256,6 +2351,7 @@ main(int argc, char **argv)
{"version", no_argument, NULL, 'V'},
{"pgdata", required_argument, NULL, 'D'},
{"format", required_argument, NULL, 'F'},
+ {"incremental", required_argument, NULL, 'i'},
{"checkpoint", required_argument, NULL, 'c'},
{"create-slot", no_argument, NULL, 'C'},
{"max-rate", required_argument, NULL, 'r'},
@@ -2293,6 +2389,7 @@ main(int argc, char **argv)
int option_index;
char *compression_algorithm = "none";
char *compression_detail = NULL;
+ char *incremental_manifest = NULL;
CompressionLocation compressloc = COMPRESS_LOCATION_UNSPECIFIED;
pg_compress_specification client_compress;
@@ -2317,7 +2414,7 @@ main(int argc, char **argv)
atexit(cleanup_directories_atexit);
- while ((c = getopt_long(argc, argv, "c:Cd:D:F:h:l:nNp:Pr:Rs:S:t:T:U:vwWX:zZ:",
+ while ((c = getopt_long(argc, argv, "c:Cd:D:F:h:i:l:nNp:Pr:Rs:S:t:T:U:vwWX:zZ:",
long_options, &option_index)) != -1)
{
switch (c)
@@ -2352,6 +2449,9 @@ main(int argc, char **argv)
case 'h':
dbhost = pg_strdup(optarg);
break;
+ case 'i':
+ incremental_manifest = pg_strdup(optarg);
+ break;
case 'l':
label = pg_strdup(optarg);
break;
@@ -2765,7 +2865,7 @@ main(int argc, char **argv)
}
BaseBackup(compression_algorithm, compression_detail, compressloc,
- &client_compress);
+ &client_compress, incremental_manifest);
success = true;
return 0;
diff --git a/src/bin/pg_basebackup/t/010_pg_basebackup.pl b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
index b9f5e1266b..bf765291e7 100644
--- a/src/bin/pg_basebackup/t/010_pg_basebackup.pl
+++ b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
@@ -223,10 +223,10 @@ SKIP:
"check backup dir permissions");
}
-# Only archive_status directory should be copied in pg_wal/.
+# Only archive_status and summaries directories should be copied in pg_wal/.
is_deeply(
[ sort(slurp_dir("$tempdir/backup/pg_wal/")) ],
- [ sort qw(. .. archive_status) ],
+ [ sort qw(. .. archive_status summaries) ],
'no WAL files copied');
# Contents of these directories should not be copied.
diff --git a/src/bin/pg_combinebackup/.gitignore b/src/bin/pg_combinebackup/.gitignore
new file mode 100644
index 0000000000..d7e617438c
--- /dev/null
+++ b/src/bin/pg_combinebackup/.gitignore
@@ -0,0 +1 @@
+pg_combinebackup
diff --git a/src/bin/pg_combinebackup/Makefile b/src/bin/pg_combinebackup/Makefile
new file mode 100644
index 0000000000..78ba05e624
--- /dev/null
+++ b/src/bin/pg_combinebackup/Makefile
@@ -0,0 +1,52 @@
+#-------------------------------------------------------------------------
+#
+# Makefile for src/bin/pg_combinebackup
+#
+# Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+# Portions Copyright (c) 1994, Regents of the University of California
+#
+# src/bin/pg_combinebackup/Makefile
+#
+#-------------------------------------------------------------------------
+
+PGFILEDESC = "pg_combinebackup - combine incremental backups"
+PGAPPICON=win32
+
+subdir = src/bin/pg_combinebackup
+top_builddir = ../../..
+include $(top_builddir)/src/Makefile.global
+
+override CPPFLAGS := -I$(libpq_srcdir) $(CPPFLAGS)
+LDFLAGS_INTERNAL += -L$(top_builddir)/src/fe_utils -lpgfeutils
+
+OBJS = \
+ $(WIN32RES) \
+ pg_combinebackup.o \
+ backup_label.o \
+ copy_file.o \
+ load_manifest.o \
+ reconstruct.o \
+ write_manifest.o
+
+all: pg_combinebackup
+
+pg_combinebackup: $(OBJS) | submake-libpgport submake-libpgfeutils
+ $(CC) $(CFLAGS) $^ $(LDFLAGS) $(LDFLAGS_EX) $(LIBS) -o $@$(X)
+
+install: all installdirs
+ $(INSTALL_PROGRAM) pg_combinebackup$(X) '$(DESTDIR)$(bindir)/pg_combinebackup$(X)'
+
+installdirs:
+ $(MKDIR_P) '$(DESTDIR)$(bindir)'
+
+uninstall:
+ rm -f '$(DESTDIR)$(bindir)/pg_combinebackup$(X)'
+
+clean distclean maintainer-clean:
+ rm -f pg_combinebackup$(X) $(OBJS)
+
+check:
+ $(prove_check)
+
+installcheck:
+ $(prove_installcheck)
diff --git a/src/bin/pg_combinebackup/backup_label.c b/src/bin/pg_combinebackup/backup_label.c
new file mode 100644
index 0000000000..2a62aa6fad
--- /dev/null
+++ b/src/bin/pg_combinebackup/backup_label.c
@@ -0,0 +1,281 @@
+/*-------------------------------------------------------------------------
+ *
+ * Read and manipulate backup label files
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/bin/pg_combinebackup/backup_label.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres_fe.h"
+
+#include <unistd.h>
+
+#include "access/xlogdefs.h"
+#include "backup_label.h"
+#include "common/logging.h"
+#include "common/file_perm.h"
+#include "write_manifest.h"
+
+static int get_eol_offset(StringInfo buf);
+static bool line_starts_with(char *s, char *e, char *match, char **sout);
+static bool parse_lsn(char *s, char *e, XLogRecPtr *lsn, char **c);
+static bool parse_tli(char *s, char *e, TimeLineID *tli);
+
+/*
+ * Parse a backup label file, starting at buf->cursor.
+ *
+ * We expect to find a START WAL LOCATION line, followed by a LSN, followed
+ * by a space; the resulting LSN is stored into *start_lsn.
+ *
+ * We expect to find a START TIMELINE line, followed by a TLI, followed by
+ * a newline; the resulting TLI is stored into *start_tli.
+ *
+ * We expect to find either both INCREMENTAL FROM LSN and INCREMENTAL FROM TLI
+ * or neither. If these are found, they should be followed by an LSN or TLI
+ * respectively and then by a newline, and the values will be stored into
+ * *previous_lsn and *previous_tli, respectively.
+ *
+ * Other lines in the provided backup_label data are ignored. filename is used
+ * for error reporting; errors are fatal.
+ */
+void
+parse_backup_label(char *filename, StringInfo buf,
+ TimeLineID *start_tli, XLogRecPtr *start_lsn,
+ TimeLineID *previous_tli, XLogRecPtr *previous_lsn)
+{
+ int found = 0;
+
+ *start_tli = 0;
+ *start_lsn = InvalidXLogRecPtr;
+ *previous_tli = 0;
+ *previous_lsn = InvalidXLogRecPtr;
+
+ while (buf->cursor < buf->len)
+ {
+ char *s = &buf->data[buf->cursor];
+ int eo = get_eol_offset(buf);
+ char *e = &buf->data[eo];
+ char *c;
+
+ if (line_starts_with(s, e, "START WAL LOCATION: ", &s))
+ {
+ if (!parse_lsn(s, e, start_lsn, &c))
+ pg_fatal("%s: could not parse START WAL LOCATION",
+ filename);
+ if (c >= e || *c != ' ')
+ pg_fatal("%s: improper terminator for START WAL LOCATION",
+ filename);
+ found |= 1;
+ }
+ else if (line_starts_with(s, e, "START TIMELINE: ", &s))
+ {
+ if (!parse_tli(s, e, start_tli))
+ pg_fatal("%s: could not parse TLI for START TIMELINE",
+ filename);
+ if (*start_tli == 0)
+ pg_fatal("%s: invalid TLI", filename);
+ found |= 2;
+ }
+ else if (line_starts_with(s, e, "INCREMENTAL FROM LSN: ", &s))
+ {
+ if (!parse_lsn(s, e, previous_lsn, &c))
+ pg_fatal("%s: could not parse INCREMENTAL FROM LSN",
+ filename);
+ if (c >= e || *c != '\n')
+ pg_fatal("%s: improper terminator for INCREMENTAL FROM LSN",
+ filename);
+ found |= 4;
+ }
+ else if (line_starts_with(s, e, "INCREMENTAL FROM TLI: ", &s))
+ {
+ if (!parse_tli(s, e, previous_tli))
+ pg_fatal("%s: could not parse INCREMENTAL FROM TLI",
+ filename);
+ if (*previous_tli == 0)
+ pg_fatal("%s: invalid TLI", filename);
+ found |= 8;
+ }
+
+ buf->cursor = eo;
+ }
+
+ if ((found & 1) == 0)
+ pg_fatal("%s: could not find START WAL LOCATION", filename);
+ if ((found & 2) == 0)
+ pg_fatal("%s: could not find START TIMELINE", filename);
+ if ((found & 4) != 0 && (found & 8) == 0)
+ pg_fatal("%s: INCREMENTAL FROM LSN requires INCREMENTAL FROM TLI", filename);
+ if ((found & 8) != 0 && (found & 4) == 0)
+ pg_fatal("%s: INCREMENTAL FROM TLI requires INCREMENTAL FROM LSN", filename);
+}
+
+/*
+ * Write a backup label file to the output directory.
+ *
+ * This will be identical to the provided backup_label file, except that the
+ * INCREMENTAL FROM LSN and INCREMENTAL FROM TLI lines will be omitted.
+ *
+ * The new file will be checksummed using the specified algorithm. If
+ * mwriter != NULL, it will be added to the manifest.
+ */
+void
+write_backup_label(char *output_directory, StringInfo buf,
+ pg_checksum_type checksum_type, manifest_writer *mwriter)
+{
+ char output_filename[MAXPGPATH];
+ int output_fd;
+ pg_checksum_context checksum_ctx;
+ uint8 checksum_payload[PG_CHECKSUM_MAX_LENGTH];
+ int checksum_length;
+
+ pg_checksum_init(&checksum_ctx, checksum_type);
+
+ snprintf(output_filename, MAXPGPATH, "%s/backup_label", output_directory);
+
+ if ((output_fd = open(output_filename,
+ O_WRONLY | O_CREAT | O_EXCL | PG_BINARY,
+ pg_file_create_mode)) < 0)
+ pg_fatal("could not open file \"%s\": %m", output_filename);
+
+ while (buf->cursor < buf->len)
+ {
+ char *s = &buf->data[buf->cursor];
+ int eo = get_eol_offset(buf);
+ char *e = &buf->data[eo];
+
+ if (!line_starts_with(s, e, "INCREMENTAL FROM LSN: ", NULL) &&
+ !line_starts_with(s, e, "INCREMENTAL FROM TLI: ", NULL))
+ {
+ ssize_t wb;
+
+ wb = write(output_fd, s, e - s);
+ if (wb != e - s)
+ {
+ if (wb < 0)
+ pg_fatal("could not write file \"%s\": %m", output_filename);
+ else
+ pg_fatal("could not write file \"%s\": wrote only %d of %d bytes",
+ output_filename, (int) wb, (int) (e - s));
+ }
+ if (pg_checksum_update(&checksum_ctx, (uint8 *) s, e - s) < 0)
+ pg_fatal("could not update checksum of file \"%s\"",
+ output_filename);
+ }
+
+ buf->cursor = eo;
+ }
+
+ if (close(output_fd) != 0)
+ pg_fatal("could not close \"%s\": %m", output_filename);
+
+ checksum_length = pg_checksum_final(&checksum_ctx, checksum_payload);
+
+ if (mwriter != NULL)
+ {
+ struct stat sb;
+
+ /*
+ * We could track the length ourselves, but must stat() to get the
+ * mtime.
+ */
+ if (stat(output_filename, &sb) < 0)
+ pg_fatal("could not stat file \"%s\": %m", output_filename);
+ add_file_to_manifest(mwriter, "backup_label", sb.st_size,
+ sb.st_mtime, checksum_type,
+ checksum_length, checksum_payload);
+ }
+}
+
+/*
+ * Return the offset at which the next line in the buffer starts, or there
+ * is none, the offset at which the buffer ends.
+ *
+ * The search begins at buf->cursor.
+ */
+static int
+get_eol_offset(StringInfo buf)
+{
+ int eo = buf->cursor;
+
+ while (eo < buf->len)
+ {
+ if (buf->data[eo] == '\n')
+ return eo + 1;
+ ++eo;
+ }
+
+ return eo;
+}
+
+/*
+ * Test whether the line that runs from s to e (inclusive of *s, but not
+ * inclusive of *e) starts with the match string provided, and return true
+ * or false according to whether or not this is the case.
+ *
+ * If the function returns true and if *sout != NULL, stores a pointer to the
+ * byte following the match into *sout.
+ */
+static bool
+line_starts_with(char *s, char *e, char *match, char **sout)
+{
+ while (s < e && *match != '\0' && *s == *match)
+ ++s, ++match;
+
+ if (*match == '\0' && sout != NULL)
+ *sout = s;
+
+ return (*match == '\0');
+}
+
+/*
+ * Parse an LSN starting at s and not stopping at or before e. The return value
+ * is true on success and otherwise false. On success, stores the result into
+ * *lsn and sets *c to the first character that is not part of the LSN.
+ */
+static bool
+parse_lsn(char *s, char *e, XLogRecPtr *lsn, char **c)
+{
+ char save = *e;
+ int nchars;
+ bool success;
+ unsigned hi;
+ unsigned lo;
+
+ *e = '\0';
+ success = (sscanf(s, "%X/%X%n", &hi, &lo, &nchars) == 2);
+ *e = save;
+
+ if (success)
+ {
+ *lsn = ((XLogRecPtr) hi) << 32 | (XLogRecPtr) lo;
+ *c = s + nchars;
+ }
+
+ return success;
+}
+
+/*
+ * Parse a TLI starting at s and stopping at or before e. The return value is
+ * true on success and otherwise false. On success, stores the result into
+ * *tli. If the first character that is not part of the TLI is anything other
+ * than a newline, that is deemed a failure.
+ */
+static bool
+parse_tli(char *s, char *e, TimeLineID *tli)
+{
+ char save = *e;
+ int nchars;
+ bool success;
+
+ *e = '\0';
+ success = (sscanf(s, "%u%n", tli, &nchars) == 1);
+ *e = save;
+
+ if (success && s[nchars] != '\n')
+ success = false;
+
+ return success;
+}
diff --git a/src/bin/pg_combinebackup/backup_label.h b/src/bin/pg_combinebackup/backup_label.h
new file mode 100644
index 0000000000..08d6ed67a9
--- /dev/null
+++ b/src/bin/pg_combinebackup/backup_label.h
@@ -0,0 +1,29 @@
+/*-------------------------------------------------------------------------
+ *
+ * Read and manipulate backup label files
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/bin/pg_combinebackup/backup_label.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef BACKUP_LABEL_H
+#define BACKUP_LABEL_H
+
+#include "common/checksum_helper.h"
+#include "lib/stringinfo.h"
+
+struct manifest_writer;
+
+extern void parse_backup_label(char *filename, StringInfo buf,
+ TimeLineID *start_tli,
+ XLogRecPtr *start_lsn,
+ TimeLineID *previous_tli,
+ XLogRecPtr *previous_lsn);
+extern void write_backup_label(char *output_directory, StringInfo buf,
+ pg_checksum_type checksum_type,
+ struct manifest_writer *mwriter);
+
+#endif /* BACKUP_LABEL_H */
diff --git a/src/bin/pg_combinebackup/copy_file.c b/src/bin/pg_combinebackup/copy_file.c
new file mode 100644
index 0000000000..8ba6cc09e4
--- /dev/null
+++ b/src/bin/pg_combinebackup/copy_file.c
@@ -0,0 +1,169 @@
+/*
+ * Copy entire files.
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/bin/pg_combinebackup/copy_file.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres_fe.h"
+
+#ifdef HAVE_COPYFILE_H
+#include <copyfile.h>
+#endif
+#include <fcntl.h>
+#include <sys/stat.h>
+#include <unistd.h>
+
+#include "common/file_perm.h"
+#include "common/logging.h"
+#include "copy_file.h"
+
+static void copy_file_blocks(const char *src, const char *dst,
+ pg_checksum_context *checksum_ctx);
+
+#ifdef WIN32
+static void copy_file_copyfile(const char *src, const char *dst);
+#endif
+
+/*
+ * Copy a regular file, optionally computing a checksum, and emitting
+ * appropriate debug messages. But if we're in dry-run mode, then just emit
+ * the messages and don't copy anything.
+ */
+void
+copy_file(const char *src, const char *dst,
+ pg_checksum_context *checksum_ctx, bool dry_run)
+{
+ /*
+ * In dry-run mode, we don't actually copy anything, nor do we read any
+ * data from the source file, but we do verify that we can open it.
+ */
+ if (dry_run)
+ {
+ int fd;
+
+ if ((fd = open(src, O_RDONLY | PG_BINARY)) < 0)
+ pg_fatal("could not open \"%s\": %m", src);
+ if (close(fd) < 0)
+ pg_fatal("could not close \"%s\": %m", src);
+ }
+
+ /*
+ * If we don't need to compute a checksum, then we can use any special
+ * operating system primitives that we know about to copy the file; this
+ * may be quicker than a naive block copy.
+ */
+ if (checksum_ctx->type != CHECKSUM_TYPE_NONE)
+ {
+ char *strategy_name = NULL;
+ void (*strategy_implementation) (const char *, const char *) = NULL;
+
+#ifdef WIN32
+ strategy_name = "CopyFile";
+ strategy_implementation = copy_file_copyfile;
+#endif
+
+ if (strategy_name != NULL)
+ {
+ if (dry_run)
+ pg_log_debug("would copy \"%s\" to \"%s\" using strategy %s",
+ src, dst, strategy_name);
+ else
+ {
+ pg_log_debug("copying \"%s\" to \"%s\" using strategy %s",
+ src, dst, strategy_name);
+ (*strategy_implementation) (src, dst);
+ }
+ return;
+ }
+ }
+
+ /*
+ * Fall back to the simple approach of reading and writing all the blocks,
+ * feeding them into the checksum context as we go.
+ */
+ if (dry_run)
+ {
+ if (checksum_ctx->type == CHECKSUM_TYPE_NONE)
+ pg_log_debug("would copy \"%s\" to \"%s\"",
+ src, dst);
+ else
+ pg_log_debug("would copy \"%s\" to \"%s\" and checksum with %s",
+ src, dst, pg_checksum_type_name(checksum_ctx->type));
+ }
+ else
+ {
+ if (checksum_ctx->type == CHECKSUM_TYPE_NONE)
+ pg_log_debug("copying \"%s\" to \"%s\"",
+ src, dst);
+ else
+ pg_log_debug("copying \"%s\" to \"%s\" and checksumming with %s",
+ src, dst, pg_checksum_type_name(checksum_ctx->type));
+ copy_file_blocks(src, dst, checksum_ctx);
+ }
+}
+
+/*
+ * Copy a file block by block, and optionally compute a checksum as we go.
+ */
+static void
+copy_file_blocks(const char *src, const char *dst,
+ pg_checksum_context *checksum_ctx)
+{
+ int src_fd;
+ int dest_fd;
+ uint8 *buffer;
+ const int buffer_size = 50 * BLCKSZ;
+ ssize_t rb;
+ unsigned offset = 0;
+
+ if ((src_fd = open(src, O_RDONLY | PG_BINARY, 0)) < 0)
+ pg_fatal("could not open file \"%s\": %m", src);
+
+ if ((dest_fd = open(dst, O_WRONLY | O_CREAT | O_EXCL | PG_BINARY,
+ pg_file_create_mode)) < 0)
+ pg_fatal("could not open file \"%s\": %m", dst);
+
+ buffer = pg_malloc(buffer_size);
+
+ while ((rb = read(src_fd, buffer, buffer_size)) > 0)
+ {
+ ssize_t wb;
+
+ if ((wb = write(dest_fd, buffer, rb)) != rb)
+ {
+ if (wb < 0)
+ pg_fatal("could not write file \"%s\": %m", dst);
+ else
+ pg_fatal("could not write file \"%s\": wrote only %d of %d bytes at offset %u",
+ dst, (int) wb, (int) rb, offset);
+ }
+
+ if (pg_checksum_update(checksum_ctx, buffer, rb) < 0)
+ pg_fatal("could not update checksum of file \"%s\"", dst);
+
+ offset += rb;
+ }
+
+ if (rb < 0)
+ pg_fatal("could not read file \"%s\": %m", dst);
+
+ pg_free(buffer);
+ close(src_fd);
+ close(dest_fd);
+}
+
+#ifdef WIN32
+static void
+copy_file_copyfile(const char *src, const char *dst)
+{
+ if (CopyFile(src, dst, true) == 0)
+ {
+ _dosmaperr(GetLastError());
+ pg_fatal("could not copy \"%s\" to \"%s\": %m", src, dst);
+ }
+}
+#endif /* WIN32 */
diff --git a/src/bin/pg_combinebackup/copy_file.h b/src/bin/pg_combinebackup/copy_file.h
new file mode 100644
index 0000000000..031030bacb
--- /dev/null
+++ b/src/bin/pg_combinebackup/copy_file.h
@@ -0,0 +1,19 @@
+/*
+ * Copy entire files.
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/bin/pg_combinebackup/copy_file.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef COPY_FILE_H
+#define COPY_FILE_H
+
+#include "common/checksum_helper.h"
+
+extern void copy_file(const char *src, const char *dst,
+ pg_checksum_context *checksum_ctx, bool dry_run);
+
+#endif /* COPY_FILE_H */
diff --git a/src/bin/pg_combinebackup/load_manifest.c b/src/bin/pg_combinebackup/load_manifest.c
new file mode 100644
index 0000000000..d0b8de7912
--- /dev/null
+++ b/src/bin/pg_combinebackup/load_manifest.c
@@ -0,0 +1,245 @@
+/*-------------------------------------------------------------------------
+ *
+ * Load data from a backup manifest into memory.
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/bin/pg_combinebackup/load_manifest.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres_fe.h"
+
+#include <sys/stat.h>
+#include <unistd.h>
+
+#include "common/hashfn.h"
+#include "common/logging.h"
+#include "common/parse_manifest.h"
+#include "load_manifest.h"
+
+/*
+ * For efficiency, we'd like our hash table containing information about the
+ * manifest to start out with approximately the correct number of entries.
+ * There's no way to know the exact number of entries without reading the whole
+ * file, but we can get an estimate by dividing the file size by the estimated
+ * number of bytes per line.
+ *
+ * This could be off by about a factor of two in either direction, because the
+ * checksum algorithm has a big impact on the line lengths; e.g. a SHA512
+ * checksum is 128 hex bytes, whereas a CRC-32C value is only 8, and there
+ * might be no checksum at all.
+ */
+#define ESTIMATED_BYTES_PER_MANIFEST_LINE 100
+
+/*
+ * Define a hash table which we can use to store information about the files
+ * mentioned in the backup manifest.
+ */
+static uint32 hash_string_pointer(char *s);
+#define SH_PREFIX manifest_files
+#define SH_ELEMENT_TYPE manifest_file
+#define SH_KEY_TYPE char *
+#define SH_KEY pathname
+#define SH_HASH_KEY(tb, key) hash_string_pointer(key)
+#define SH_EQUAL(tb, a, b) (strcmp(a, b) == 0)
+#define SH_SCOPE extern
+#define SH_RAW_ALLOCATOR pg_malloc0
+#define SH_DEFINE
+#include "lib/simplehash.h"
+
+static void record_manifest_details_for_file(JsonManifestParseContext *context,
+ char *pathname, size_t size,
+ pg_checksum_type checksum_type,
+ int checksum_length,
+ uint8 *checksum_payload);
+static void record_manifest_details_for_wal_range(JsonManifestParseContext *context,
+ TimeLineID tli,
+ XLogRecPtr start_lsn,
+ XLogRecPtr end_lsn);
+static void report_manifest_error(JsonManifestParseContext *context,
+ const char *fmt,...)
+ pg_attribute_printf(2, 3) pg_attribute_noreturn();
+
+/*
+ * Load backup_manifest files from an array of backups and produces an array
+ * of manifest_data objects.
+ *
+ * NB: Since load_backup_manifest() can return NULL, the resulting array could
+ * contain NULL entries.
+ */
+manifest_data **
+load_backup_manifests(int n_backups, char **backup_directories)
+{
+ manifest_data **result;
+ int i;
+
+ result = pg_malloc(sizeof(manifest_data *) * n_backups);
+ for (i = 0; i < n_backups; ++i)
+ result[i] = load_backup_manifest(backup_directories[i]);
+
+ return result;
+}
+
+/*
+ * Parse the backup_manifest file in the named backup directory. Construct a
+ * hash table with information about all the files it mentions, and a linked
+ * list of all the WAL ranges it mentions.
+ *
+ * If the backup_manifest file simply doesn't exist, logs a warning and returns
+ * NULL. Any other error, or any error parsing the contents of the file, is
+ * fatal.
+ */
+manifest_data *
+load_backup_manifest(char *backup_directory)
+{
+ char pathname[MAXPGPATH];
+ int fd;
+ struct stat statbuf;
+ off_t estimate;
+ uint32 initial_size;
+ manifest_files_hash *ht;
+ char *buffer;
+ int rc;
+ JsonManifestParseContext context;
+ manifest_data *result;
+
+ /* Open the manifest file. */
+ snprintf(pathname, MAXPGPATH, "%s/backup_manifest", backup_directory);
+ if ((fd = open(pathname, O_RDONLY | PG_BINARY, 0)) < 0)
+ {
+ if (errno == EEXIST)
+ {
+ pg_log_warning("\"%s\" does not exist", pathname);
+ return NULL;
+ }
+ pg_fatal("could not open file \"%s\": %m", pathname);
+ }
+
+ /* Figure out how big the manifest is. */
+ if (fstat(fd, &statbuf) != 0)
+ pg_fatal("could not stat file \"%s\": %m", pathname);
+
+ /* Guess how large to make the hash table based on the manifest size. */
+ estimate = statbuf.st_size / ESTIMATED_BYTES_PER_MANIFEST_LINE;
+ initial_size = Min(PG_UINT32_MAX, Max(estimate, 256));
+
+ /* Create the hash table. */
+ ht = manifest_files_create(initial_size, NULL);
+
+ /*
+ * Slurp in the whole file.
+ *
+ * This is not ideal, but there's currently no way to get pg_parse_json()
+ * to perform incremental parsing.
+ */
+ buffer = pg_malloc(statbuf.st_size);
+ rc = read(fd, buffer, statbuf.st_size);
+ if (rc != statbuf.st_size)
+ {
+ if (rc < 0)
+ pg_fatal("could not read file \"%s\": %m", pathname);
+ else
+ pg_fatal("could not read file \"%s\": read %d of %lld",
+ pathname, rc, (long long int) statbuf.st_size);
+ }
+
+ /* Close the manifest file. */
+ close(fd);
+
+ /* Parse the manifest. */
+ result = pg_malloc0(sizeof(manifest_data));
+ result->files = ht;
+ context.private_data = result;
+ context.perfile_cb = record_manifest_details_for_file;
+ context.perwalrange_cb = record_manifest_details_for_wal_range;
+ context.error_cb = report_manifest_error;
+ json_parse_manifest(&context, buffer, statbuf.st_size);
+
+ /* All done. */
+ pfree(buffer);
+ return result;
+}
+
+/*
+ * Report an error while parsing the manifest.
+ *
+ * We consider all such errors to be fatal errors. The manifest parser
+ * expects this function not to return.
+ */
+static void
+report_manifest_error(JsonManifestParseContext *context, const char *fmt,...)
+{
+ va_list ap;
+
+ va_start(ap, fmt);
+ pg_log_generic_v(PG_LOG_ERROR, PG_LOG_PRIMARY, gettext(fmt), ap);
+ va_end(ap);
+
+ exit(1);
+}
+
+/*
+ * Record details extracted from the backup manifest for one file.
+ */
+static void
+record_manifest_details_for_file(JsonManifestParseContext *context,
+ char *pathname, size_t size,
+ pg_checksum_type checksum_type,
+ int checksum_length, uint8 *checksum_payload)
+{
+ manifest_data *manifest = context->private_data;
+ manifest_file *m;
+ bool found;
+
+ /* Make a new entry in the hash table for this file. */
+ m = manifest_files_insert(manifest->files, pathname, &found);
+ if (found)
+ pg_fatal("duplicate path name in backup manifest: \"%s\"", pathname);
+
+ /* Initialize the entry. */
+ m->size = size;
+ m->checksum_type = checksum_type;
+ m->checksum_length = checksum_length;
+ m->checksum_payload = checksum_payload;
+}
+
+/*
+ * Record details extracted from the backup manifest for one WAL range.
+ */
+static void
+record_manifest_details_for_wal_range(JsonManifestParseContext *context,
+ TimeLineID tli,
+ XLogRecPtr start_lsn, XLogRecPtr end_lsn)
+{
+ manifest_data *manifest = context->private_data;
+ manifest_wal_range *range;
+
+ /* Allocate and initialize a struct describing this WAL range. */
+ range = palloc(sizeof(manifest_wal_range));
+ range->tli = tli;
+ range->start_lsn = start_lsn;
+ range->end_lsn = end_lsn;
+ range->prev = manifest->last_wal_range;
+ range->next = NULL;
+
+ /* Add it to the end of the list. */
+ if (manifest->first_wal_range == NULL)
+ manifest->first_wal_range = range;
+ else
+ manifest->last_wal_range->next = range;
+ manifest->last_wal_range = range;
+}
+
+/*
+ * Helper function for manifest_files hash table.
+ */
+static uint32
+hash_string_pointer(char *s)
+{
+ unsigned char *ss = (unsigned char *) s;
+
+ return hash_bytes(ss, strlen(s));
+}
diff --git a/src/bin/pg_combinebackup/load_manifest.h b/src/bin/pg_combinebackup/load_manifest.h
new file mode 100644
index 0000000000..2bfeeff156
--- /dev/null
+++ b/src/bin/pg_combinebackup/load_manifest.h
@@ -0,0 +1,67 @@
+/*-------------------------------------------------------------------------
+ *
+ * Load data from a backup manifest into memory.
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/bin/pg_combinebackup/load_manifest.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef LOAD_MANIFEST_H
+#define LOAD_MANIFEST_H
+
+#include "access/xlogdefs.h"
+#include "common/checksum_helper.h"
+
+/*
+ * Each file described by the manifest file is parsed to produce an object
+ * like this.
+ */
+typedef struct manifest_file
+{
+ uint32 status; /* hash status */
+ char *pathname;
+ size_t size;
+ pg_checksum_type checksum_type;
+ int checksum_length;
+ uint8 *checksum_payload;
+} manifest_file;
+
+#define SH_PREFIX manifest_files
+#define SH_ELEMENT_TYPE manifest_file
+#define SH_KEY_TYPE char *
+#define SH_SCOPE extern
+#define SH_RAW_ALLOCATOR pg_malloc0
+#define SH_DECLARE
+#include "lib/simplehash.h"
+
+/*
+ * Each WAL range described by the manifest file is parsed to produce an
+ * object like this.
+ */
+typedef struct manifest_wal_range
+{
+ TimeLineID tli;
+ XLogRecPtr start_lsn;
+ XLogRecPtr end_lsn;
+ struct manifest_wal_range *next;
+ struct manifest_wal_range *prev;
+} manifest_wal_range;
+
+/*
+ * All the data parsed from a backup_manifest file.
+ */
+typedef struct manifest_data
+{
+ manifest_files_hash *files;
+ manifest_wal_range *first_wal_range;
+ manifest_wal_range *last_wal_range;
+} manifest_data;
+
+extern manifest_data *load_backup_manifest(char *backup_directory);
+extern manifest_data **load_backup_manifests(int n_backups,
+ char **backup_directories);
+
+#endif /* LOAD_MANIFEST_H */
diff --git a/src/bin/pg_combinebackup/meson.build b/src/bin/pg_combinebackup/meson.build
new file mode 100644
index 0000000000..a6036dea74
--- /dev/null
+++ b/src/bin/pg_combinebackup/meson.build
@@ -0,0 +1,35 @@
+# Copyright (c) 2022-2023, PostgreSQL Global Development Group
+
+pg_combinebackup_sources = files(
+ 'pg_combinebackup.c',
+ 'backup_label.c',
+ 'copy_file.c',
+ 'load_manifest.c',
+ 'reconstruct.c',
+ 'write_manifest.c',
+)
+
+if host_system == 'windows'
+ pg_combinebackup_sources += rc_bin_gen.process(win32ver_rc, extra_args: [
+ '--NAME', 'pg_combinebackup',
+ '--FILEDESC', 'pg_combinebackup - combine incremental backups',])
+endif
+
+pg_combinebackup = executable('pg_combinebackup',
+ pg_combinebackup_sources,
+ dependencies: [frontend_code],
+ kwargs: default_bin_args,
+)
+bin_targets += pg_combinebackup
+
+tests += {
+ 'name': 'pg_combinebackup',
+ 'sd': meson.current_source_dir(),
+ 'bd': meson.current_build_dir(),
+ 'tap': {
+ 'tests': [
+ 't/001_basic.pl',
+ 't/002_compare_backups.pl',
+ ],
+ }
+}
diff --git a/src/bin/pg_combinebackup/pg_combinebackup.c b/src/bin/pg_combinebackup/pg_combinebackup.c
new file mode 100644
index 0000000000..32d2846433
--- /dev/null
+++ b/src/bin/pg_combinebackup/pg_combinebackup.c
@@ -0,0 +1,1270 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_combinebackup.c
+ * Combine incremental backups with prior backups.
+ *
+ * Copyright (c) 2017-2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/bin/pg_combinebackup/pg_combinebackup.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres_fe.h"
+
+#include <dirent.h>
+#include <fcntl.h>
+#include <limits.h>
+
+#include "access/xlogdefs.h"
+#include "backup_label.h"
+#include "common/blkreftable.h"
+#include "common/checksum_helper.h"
+#include "common/controldata_utils.h"
+#include "common/file_perm.h"
+#include "common/file_utils.h"
+#include "common/logging.h"
+#include "copy_file.h"
+#include "fe_utils/option_utils.h"
+#include "lib/stringinfo.h"
+#include "load_manifest.h"
+#include "getopt_long.h"
+#include "reconstruct.h"
+#include "write_manifest.h"
+
+/* Incremental file naming convention. */
+#define INCREMENTAL_PREFIX "INCREMENTAL."
+#define INCREMENTAL_PREFIX_LENGTH 12
+
+/*
+ * Tracking for directories that need to be removed, or have their contents
+ * removed, if the operation fails.
+ */
+typedef struct cb_cleanup_dir
+{
+ char *target_path;
+ bool rmtopdir;
+ struct cb_cleanup_dir *next;
+} cb_cleanup_dir;
+
+/*
+ * Stores a tablespace mapping provided using -T, --tablespace-mapping.
+ */
+typedef struct cb_tablespace_mapping
+{
+ char old_dir[MAXPGPATH];
+ char new_dir[MAXPGPATH];
+ struct cb_tablespace_mapping *next;
+} cb_tablespace_mapping;
+
+/*
+ * Stores data parsed from all command-line options.
+ */
+typedef struct cb_options
+{
+ bool debug;
+ char *output;
+ bool dry_run;
+ bool no_sync;
+ cb_tablespace_mapping *tsmappings;
+ pg_checksum_type manifest_checksums;
+ bool no_manifest;
+ DataDirSyncMethod sync_method;
+} cb_options;
+
+/*
+ * Data about a tablespace.
+ *
+ * Every normal tablespace needs a tablespace mapping, but in-place tablespaces
+ * don't, so the list of tablespaces can contain more entries than the list of
+ * tablespace mappings.
+ */
+typedef struct cb_tablespace
+{
+ Oid oid;
+ bool in_place;
+ char old_dir[MAXPGPATH];
+ char new_dir[MAXPGPATH];
+ struct cb_tablespace *next;
+} cb_tablespace;
+
+/* Directories to be removed if we exit uncleanly. */
+cb_cleanup_dir *cleanup_dir_list = NULL;
+
+static void add_tablespace_mapping(cb_options *opt, char *arg);
+static StringInfo check_backup_label_files(int n_backups, char **backup_dirs);
+static void check_control_files(int n_backups, char **backup_dirs);
+static void check_input_dir_permissions(char *dir);
+static void cleanup_directories_atexit(void);
+static void create_output_directory(char *dirname, cb_options *opt);
+static void help(const char *progname);
+static bool parse_oid(char *s, Oid *result);
+static void process_directory_recursively(Oid tsoid,
+ char *input_directory,
+ char *output_directory,
+ char *relative_path,
+ int n_prior_backups,
+ char **prior_backup_dirs,
+ manifest_data **manifests,
+ manifest_writer *mwriter,
+ cb_options *opt);
+static int read_pg_version_file(char *directory);
+static void remember_to_cleanup_directory(char *target_path, bool rmtopdir);
+static void reset_directory_cleanup_list(void);
+static cb_tablespace *scan_for_existing_tablespaces(char *pathname,
+ cb_options *opt);
+static void slurp_file(int fd, char *filename, StringInfo buf, int maxlen);
+
+/*
+ * Main program.
+ */
+int
+main(int argc, char *argv[])
+{
+ static struct option long_options[] = {
+ {"debug", no_argument, NULL, 'd'},
+ {"dry-run", no_argument, NULL, 'n'},
+ {"no-sync", no_argument, NULL, 'N'},
+ {"output", required_argument, NULL, 'o'},
+ {"tablespace-mapping", no_argument, NULL, 'T'},
+ {"manifest-checksums", required_argument, NULL, 1},
+ {"no-manifest", no_argument, NULL, 2},
+ {"sync-method", required_argument, NULL, 3},
+ {NULL, 0, NULL, 0}
+ };
+
+ const char *progname;
+ char *last_input_dir;
+ int optindex;
+ int c;
+ int n_backups;
+ int n_prior_backups;
+ int version;
+ char **prior_backup_dirs;
+ cb_options opt;
+ cb_tablespace *tablespaces;
+ cb_tablespace *ts;
+ StringInfo last_backup_label;
+ manifest_data **manifests;
+ manifest_writer *mwriter;
+
+ pg_logging_init(argv[0]);
+ progname = get_progname(argv[0]);
+ handle_help_version_opts(argc, argv, progname, help);
+
+ memset(&opt, 0, sizeof(opt));
+ opt.manifest_checksums = CHECKSUM_TYPE_CRC32C;
+ opt.sync_method = DATA_DIR_SYNC_METHOD_FSYNC;
+
+ /* process command-line options */
+ while ((c = getopt_long(argc, argv, "do:nNPT:",
+ long_options, &optindex)) != -1)
+ {
+ switch (c)
+ {
+ case 'd':
+ opt.debug = true;
+ pg_logging_increase_verbosity();
+ break;
+ case 'o':
+ opt.output = optarg;
+ break;
+ case 'n':
+ opt.dry_run = true;
+ break;
+ case 'N':
+ opt.no_sync = true;
+ break;
+ case 'T':
+ add_tablespace_mapping(&opt, optarg);
+ break;
+ case 1:
+ if (!pg_checksum_parse_type(optarg,
+ &opt.manifest_checksums))
+ pg_fatal("unrecognized checksum algorithm: \"%s\"",
+ optarg);
+ break;
+ case 2:
+ opt.no_manifest = true;
+ break;
+ case 3:
+ if (!parse_sync_method(optarg, &opt.sync_method))
+ exit(1);
+ break;
+ default:
+ /* getopt_long already emitted a complaint */
+ pg_log_error_hint("Try \"%s --help\" for more information.", progname);
+ exit(1);
+ }
+ }
+
+ if (optind >= argc)
+ {
+ pg_log_error("%s: no input directories specified", progname);
+ pg_log_error_hint("Try \"%s --help\" for more information.", progname);
+ exit(1);
+ }
+
+ if (opt.output == NULL)
+ pg_fatal("no output directory specified");
+
+ /* If no manifest is needed, no checksums are needed, either. */
+ if (opt.no_manifest)
+ opt.manifest_checksums = CHECKSUM_TYPE_NONE;
+
+ /* Read the server version from the final backup. */
+ version = read_pg_version_file(argv[argc - 1]);
+
+ /* Sanity-check control files. */
+ n_backups = argc - optind;
+ check_control_files(n_backups, argv + optind);
+
+ /* Sanity-check backup_label files, and get the contents of the last one. */
+ last_backup_label = check_backup_label_files(n_backups, argv + optind);
+
+ /* Load backup manifests. */
+ manifests = load_backup_manifests(n_backups, argv + optind);
+
+ /* Figure out which tablespaces are going to be included in the output. */
+ last_input_dir = argv[argc - 1];
+ check_input_dir_permissions(last_input_dir);
+ tablespaces = scan_for_existing_tablespaces(last_input_dir, &opt);
+
+ /*
+ * Create output directories.
+ *
+ * We create one output directory for the main data directory plus one for
+ * each non-in-place tablespace. create_output_directory() will arrange
+ * for those directories to be cleaned up on failure. In-place tablespaces
+ * aren't handled at this stage because they're located beneath the main
+ * output directory, and thus the cleanup of that directory will get rid
+ * of them. Plus, the pg_tblspc directory that needs to contain them
+ * doesn't exist yet.
+ */
+ atexit(cleanup_directories_atexit);
+ create_output_directory(opt.output, &opt);
+ for (ts = tablespaces; ts != NULL; ts = ts->next)
+ if (!ts->in_place)
+ create_output_directory(ts->new_dir, &opt);
+
+ /* If we need to write a backup_manifest, prepare to do so. */
+ if (!opt.dry_run && !opt.no_manifest)
+ mwriter = create_manifest_writer(opt.output);
+ else
+ mwriter = NULL;
+
+ /* Write backup label into output directory. */
+ if (opt.dry_run)
+ pg_log_debug("would generate \"%s/backup_label\"", opt.output);
+ else
+ {
+ pg_log_debug("generating \"%s/backup_label\"", opt.output);
+ last_backup_label->cursor = 0;
+ write_backup_label(opt.output, last_backup_label,
+ opt.manifest_checksums, mwriter);
+ }
+
+ /*
+ * We'll need the pathnames to the prior backups. By "prior" we mean all
+ * but the last one listed on the command line.
+ */
+ n_prior_backups = argc - optind - 1;
+ prior_backup_dirs = argv + optind;
+
+ /* Process everything that's not part of a user-defined tablespace. */
+ pg_log_debug("processing backup directory \"%s\"", last_input_dir);
+ process_directory_recursively(InvalidOid, last_input_dir, opt.output,
+ NULL, n_prior_backups, prior_backup_dirs,
+ manifests, mwriter, &opt);
+
+ /* Process user-defined tablespaces. */
+ for (ts = tablespaces; ts != NULL; ts = ts->next)
+ {
+ pg_log_debug("processing tablespace directory \"%s\"", ts->old_dir);
+
+ /*
+ * If it's a normal tablespace, we need to set up a symbolic link from
+ * pg_tblspc/${OID} to the target directory; if it's an in-place
+ * tablespace, we need to create a directory at pg_tblspc/${OID}.
+ */
+ if (!ts->in_place)
+ {
+ char linkpath[MAXPGPATH];
+
+ snprintf(linkpath, MAXPGPATH, "%s/pg_tblspc/%u", opt.output,
+ ts->oid);
+
+ if (opt.dry_run)
+ pg_log_debug("would create symbolic link from \"%s\" to \"%s\"",
+ linkpath, ts->new_dir);
+ else
+ {
+ pg_log_debug("creating symbolic link from \"%s\" to \"%s\"",
+ linkpath, ts->new_dir);
+ if (symlink(ts->new_dir, linkpath) != 0)
+ pg_fatal("could not create symbolic link from \"%s\" to \"%s\": %m",
+ linkpath, ts->new_dir);
+ }
+ }
+ else
+ {
+ if (opt.dry_run)
+ pg_log_debug("would create directory \"%s\"", ts->new_dir);
+ else
+ {
+ pg_log_debug("creating directory \"%s\"", ts->new_dir);
+ if (pg_mkdir_p(ts->new_dir, pg_dir_create_mode) == -1)
+ pg_fatal("could not create directory \"%s\": %m",
+ ts->new_dir);
+ }
+ }
+
+ /* OK, now handle the directory contents. */
+ process_directory_recursively(ts->oid, ts->old_dir, ts->new_dir,
+ NULL, n_prior_backups, prior_backup_dirs,
+ manifests, mwriter, &opt);
+ }
+
+ /* Finalize the backup_manifest, if we're generating one. */
+ if (mwriter != NULL)
+ finalize_manifest(mwriter,
+ manifests[n_prior_backups]->first_wal_range);
+
+ /* fsync that output directory unless we've been told not to do so */
+ if (!opt.no_sync)
+ {
+ if (opt.dry_run)
+ pg_log_debug("would recursively fsync \"%s\"", opt.output);
+ else
+ {
+ pg_log_debug("recursively fsyncing \"%s\"", opt.output);
+ sync_pgdata(opt.output, version * 10000, opt.sync_method);
+ }
+ }
+
+ /* It's a success, so don't remove the output directories. */
+ reset_directory_cleanup_list();
+ exit(0);
+}
+
+/*
+ * Process the option argument for the -T, --tablespace-mapping switch.
+ */
+static void
+add_tablespace_mapping(cb_options *opt, char *arg)
+{
+ cb_tablespace_mapping *tsmap = pg_malloc0(sizeof(cb_tablespace_mapping));
+ char *dst;
+ char *dst_ptr;
+ char *arg_ptr;
+
+ /*
+ * Basically, we just want to copy everything before the equals sign to
+ * tsmap->old_dir and everything afterwards to tsmap->new_dir, but if
+ * there's more or less than one equals sign, that's an error, and if
+ * there's an equals sign preceded by a backslash, don't treat it as a
+ * field separator but instead copy a literal equals sign.
+ */
+ dst_ptr = dst = tsmap->old_dir;
+ for (arg_ptr = arg; *arg_ptr != '\0'; arg_ptr++)
+ {
+ if (dst_ptr - dst >= MAXPGPATH)
+ pg_fatal("directory name too long");
+
+ if (*arg_ptr == '\\' && *(arg_ptr + 1) == '=')
+ ; /* skip backslash escaping = */
+ else if (*arg_ptr == '=' && (arg_ptr == arg || *(arg_ptr - 1) != '\\'))
+ {
+ if (tsmap->new_dir[0] != '\0')
+ pg_fatal("multiple \"=\" signs in tablespace mapping");
+ else
+ dst = dst_ptr = tsmap->new_dir;
+ }
+ else
+ *dst_ptr++ = *arg_ptr;
+ }
+ if (!tsmap->old_dir[0] || !tsmap->new_dir[0])
+ pg_fatal("invalid tablespace mapping format \"%s\", must be \"OLDDIR=NEWDIR\"", arg);
+
+ /*
+ * All tablespaces are created with absolute directories, so specifying a
+ * non-absolute path here would never match, possibly confusing users.
+ *
+ * In contrast to pg_basebackup, both the old and new directories are on
+ * the local machine, so the local machine's definition of an absolute
+ * path is the only relevant one.
+ */
+ if (!is_absolute_path(tsmap->old_dir))
+ pg_fatal("old directory is not an absolute path in tablespace mapping: %s",
+ tsmap->old_dir);
+
+ if (!is_absolute_path(tsmap->new_dir))
+ pg_fatal("old directory is not an absolute path in tablespace mapping: %s",
+ tsmap->new_dir);
+
+ /* Canonicalize paths to avoid spurious failures when comparing. */
+ canonicalize_path(tsmap->old_dir);
+ canonicalize_path(tsmap->new_dir);
+
+ /* Add it to the list. */
+ tsmap->next = opt->tsmappings;
+ opt->tsmappings = tsmap;
+}
+
+/*
+ * Check that the backup_label files form a coherent backup chain, and return
+ * the contents of the backup_label file from the latest backup.
+ */
+static StringInfo
+check_backup_label_files(int n_backups, char **backup_dirs)
+{
+ StringInfo buf = makeStringInfo();
+ StringInfo lastbuf = buf;
+ int i;
+ TimeLineID check_tli = 0;
+ XLogRecPtr check_lsn = InvalidXLogRecPtr;
+
+ /* Try to read each backup_label file in turn, last to first. */
+ for (i = n_backups - 1; i >= 0; --i)
+ {
+ char pathbuf[MAXPGPATH];
+ int fd;
+ TimeLineID start_tli;
+ TimeLineID previous_tli;
+ XLogRecPtr start_lsn;
+ XLogRecPtr previous_lsn;
+
+ /* Open the backup_label file. */
+ snprintf(pathbuf, MAXPGPATH, "%s/backup_label", backup_dirs[i]);
+ pg_log_debug("reading \"%s\"", pathbuf);
+ if ((fd = open(pathbuf, O_RDONLY, 0)) < 0)
+ pg_fatal("could not open file \"%s\": %m", pathbuf);
+
+ /*
+ * Slurp the whole file into memory.
+ *
+ * The exact size limit that we impose here doesn't really matter --
+ * most of what's supposed to be in the file is fixed size and quite
+ * short. However, the length of the backup_label is limited (at least
+ * by some parts of the code) to MAXGPATH, so include that value in
+ * the maximum length that we tolerate.
+ */
+ slurp_file(fd, pathbuf, buf, 10000 + MAXPGPATH);
+
+ /* Close the file. */
+ if (close(fd) != 0)
+ pg_fatal("could not close \"%s\": %m", pathbuf);
+
+ /* Parse the file contents. */
+ parse_backup_label(pathbuf, buf, &start_tli, &start_lsn,
+ &previous_tli, &previous_lsn);
+
+ /*
+ * Sanity checks.
+ *
+ * XXX. It's actually not required that start_lsn == check_lsn. It
+ * would be OK if start_lsn > check_lsn provided that start_lsn is
+ * less than or equal to the relevant switchpoint. But at the moment
+ * we don't have that information.
+ */
+ if (i > 0 && previous_tli == 0)
+ pg_fatal("backup at \"%s\" is a full backup, but only the first backup should be a full backup",
+ backup_dirs[i]);
+ if (i == 0 && previous_tli != 0)
+ pg_fatal("backup at \"%s\" is an incremental backup, but the first backup should be a full backup",
+ backup_dirs[i]);
+ if (i < n_backups - 1 && start_tli != check_tli)
+ pg_fatal("backup at \"%s\" starts on timeline %u, but expected %u",
+ backup_dirs[i], start_tli, check_tli);
+ if (i < n_backups - 1 && start_lsn != check_lsn)
+ pg_fatal("backup at \"%s\" starts at LSN %X/%X, but expected %X/%X",
+ backup_dirs[i],
+ LSN_FORMAT_ARGS(start_lsn),
+ LSN_FORMAT_ARGS(check_lsn));
+ check_tli = previous_tli;
+ check_lsn = previous_lsn;
+
+ /*
+ * The last backup label in the chain needs to be saved for later use,
+ * while the others are only needed within this loop.
+ */
+ if (lastbuf == buf)
+ buf = makeStringInfo();
+ else
+ resetStringInfo(buf);
+ }
+
+ /* Free memory that we don't need any more. */
+ if (lastbuf != buf)
+ {
+ pfree(buf->data);
+ pfree(buf);
+ }
+
+ /*
+ * Return the data from the first backup_info that we read (which is the
+ * backup_label from the last directory specified on the command line).
+ */
+ return lastbuf;
+}
+
+/*
+ * Sanity check control files.
+ */
+static void
+check_control_files(int n_backups, char **backup_dirs)
+{
+ int i;
+ uint64 system_identifier;
+
+ /* Try to read each control file in turn, last to first. */
+ for (i = n_backups - 1; i >= 0; --i)
+ {
+ ControlFileData *control_file;
+ bool crc_ok;
+
+ pg_log_debug("reading \"%s/global/pg_control\"", backup_dirs[i]);
+ control_file = get_controlfile(backup_dirs[i], &crc_ok);
+
+ /* Control file contents not meaningful if CRC is bad. */
+ if (!crc_ok)
+ pg_fatal("%s/global/pg_control: crc is incorrect", backup_dirs[i]);
+
+ /* Can't interpret control file if not current version. */
+ if (control_file->pg_control_version != PG_CONTROL_VERSION)
+ pg_fatal("%s/global/pg_control: unexpected control file version",
+ backup_dirs[i]);
+
+ /* System identifiers should all match. */
+ if (i == n_backups - 1)
+ system_identifier = control_file->system_identifier;
+ else if (system_identifier != control_file->system_identifier)
+ pg_fatal("%s/global/pg_control: expected system identifier %llu, but found %llu",
+ backup_dirs[i], (unsigned long long) system_identifier,
+ (unsigned long long) control_file->system_identifier);
+
+ /* Release memory. */
+ pfree(control_file);
+ }
+
+ /*
+ * If debug output is enabled, make a note of the system identifier that
+ * we found in all of the relevant control files.
+ */
+ pg_log_debug("system identifier is %llu",
+ (unsigned long long) system_identifier);
+}
+
+/*
+ * Set default permissions for new files and directories based on the
+ * permissions of the given directory. The intent here is that the output
+ * directory should use the same permissions scheme as the final input
+ * directory.
+ */
+static void
+check_input_dir_permissions(char *dir)
+{
+ struct stat st;
+
+ if (stat(dir, &st) != 0)
+ pg_fatal("could not stat \"%s\": %m", dir);
+
+ SetDataDirectoryCreatePerm(st.st_mode);
+}
+
+/*
+ * Clean up output directories before exiting.
+ */
+static void
+cleanup_directories_atexit(void)
+{
+ while (cleanup_dir_list != NULL)
+ {
+ cb_cleanup_dir *dir = cleanup_dir_list;
+
+ if (dir->rmtopdir)
+ {
+ pg_log_info("removing output directory \"%s\"", dir->target_path);
+ if (!rmtree(dir->target_path, dir->rmtopdir))
+ pg_log_error("failed to remove output directory");
+ }
+ else
+ {
+ pg_log_info("removing contents of output directory \"%s\"",
+ dir->target_path);
+ if (!rmtree(dir->target_path, dir->rmtopdir))
+ pg_log_error("failed to remove contents of output directory");
+ }
+
+ cleanup_dir_list = cleanup_dir_list->next;
+ pfree(dir);
+ }
+}
+
+/*
+ * Create the named output directory, unless it already exists or we're in
+ * dry-run mode. If it already exists but is not empty, that's a fatal error.
+ *
+ * Adds the created directory to the list of directories to be cleaned up
+ * at process exit.
+ */
+static void
+create_output_directory(char *dirname, cb_options *opt)
+{
+ switch (pg_check_dir(dirname))
+ {
+ case 0:
+ if (opt->dry_run)
+ {
+ pg_log_debug("would create directory \"%s\"", dirname);
+ return;
+ }
+ pg_log_debug("creating directory \"%s\"", dirname);
+ if (pg_mkdir_p(dirname, pg_dir_create_mode) == -1)
+ pg_fatal("could not create directory \"%s\": %m", dirname);
+ remember_to_cleanup_directory(dirname, true);
+ break;
+
+ case 1:
+ pg_log_debug("using existing directory \"%s\"", dirname);
+ remember_to_cleanup_directory(dirname, false);
+ break;
+
+ case 2:
+ case 3:
+ case 4:
+ pg_fatal("directory \"%s\" exists but is not empty", dirname);
+
+ case -1:
+ pg_fatal("could not access directory \"%s\": %m", dirname);
+ }
+}
+
+/*
+ * help
+ *
+ * Prints help page for the program
+ *
+ * progname: the name of the executed program, such as "pg_combinebackup"
+ */
+static void
+help(const char *progname)
+{
+ printf(_("%s combines incremental backups.\n\n"), progname);
+ printf(_("Usage:\n"));
+ printf(_(" %s [OPTION]... DIRECTORY...\n"), progname);
+ printf(_("\nOptions:\n"));
+ printf(_(" -d, --debug generate lots of debugging output\n"));
+ printf(_(" -n, --dry-run don't actually do anything\n"));
+ printf(_(" -N, --no-sync do not wait for changes to be written safely to disk\n"));
+ printf(_(" -o, --output output directory\n"));
+ printf(_(" -T, --tablespace-mapping=OLDDIR=NEWDIR\n"));
+ printf(_(" relocate tablespace in OLDDIR to NEWDIR\n"));
+ printf(_(" --manifest-checksums=SHA{224,256,384,512}|CRC32C|NONE\n"
+ " use algorithm for manifest checksums\n"));
+ printf(_(" --no-manifest suppress generation of backup manifest\n"));
+ printf(_(" --sync-method=METHOD set method for syncing files to disk\n"));
+ printf(_(" -?, --help show this help, then exit\n"));
+
+ printf(_("\nReport bugs to <%s>.\n"), PACKAGE_BUGREPORT);
+ printf(_("%s home page: <%s>\n"), PACKAGE_NAME, PACKAGE_URL);
+}
+
+/*
+ * Try to parse a string as a non-zero OID without leading zeroes.
+ *
+ * If it works, return true and set *result to the answer, else return false.
+ */
+static bool
+parse_oid(char *s, Oid *result)
+{
+ Oid oid;
+ char *ep;
+
+ errno = 0;
+ oid = strtoul(s, &ep, 10);
+ if (errno != 0 || *ep != '\0' || oid < 1 || oid > PG_UINT32_MAX)
+ return false;
+
+ *result = oid;
+ return true;
+}
+
+/*
+ * Copy files from the input directory to the output directory, reconstructing
+ * full files from incremental files as required.
+ *
+ * If processing is a user-defined tablespace, the tsoid should be the OID
+ * of that tablespace and input_directory and output_directory should be the
+ * toplevel input and output directories for that tablespace. Otherwise,
+ * tsoid should be InvalidOid and input_directory and output_directory should
+ * be the main input and output directories.
+ *
+ * relative_path is the path beneath the given input and output directories
+ * that we are currently processing. If NULL, it indicates that we're
+ * processing the input and output directories themselves.
+ *
+ * n_prior_backups is the number of prior backups that we have available.
+ * This doesn't count the very last backup, which is referenced by
+ * output_directory, just the older ones. prior_backup_dirs is an array of
+ * the locations of those previous backups.
+ */
+static void
+process_directory_recursively(Oid tsoid,
+ char *input_directory,
+ char *output_directory,
+ char *relative_path,
+ int n_prior_backups,
+ char **prior_backup_dirs,
+ manifest_data **manifests,
+ manifest_writer *mwriter,
+ cb_options *opt)
+{
+ char ifulldir[MAXPGPATH];
+ char ofulldir[MAXPGPATH];
+ char manifest_prefix[MAXPGPATH];
+ DIR *dir;
+ struct dirent *de;
+ bool is_pg_tblspc;
+ bool is_pg_wal;
+ manifest_data *latest_manifest = manifests[n_prior_backups];
+ pg_checksum_type checksum_type;
+
+ StaticAssertStmt(strlen(INCREMENTAL_PREFIX) == INCREMENTAL_PREFIX_LENGTH,
+ "INCREMENTAL_PREFIX_LENGTH is incorrect");
+
+ /*
+ * pg_tblspc and pg_wal are special cases, so detect those here.
+ *
+ * pg_tblspc is only special at the top level, but subdirectories of
+ * pg_wal are just as special as the top level directory.
+ *
+ * Since incremental backup does not exist in pre-v10 versions, we don't
+ * have to worry about the old pg_xlog naming.
+ */
+ is_pg_tblspc = !OidIsValid(tsoid) && relative_path != NULL &&
+ strcmp(relative_path, "pg_tblspc") == 0;
+ is_pg_wal = !OidIsValid(tsoid) && relative_path != NULL &&
+ (strcmp(relative_path, "pg_wal") == 0 ||
+ strncmp(relative_path, "pg_wal/", 7) == 0);
+
+ /*
+ * If we're under pg_wal, then we don't need checksums, because these
+ * files aren't included in the backup manifest. Otherwise use whatever
+ * type of checksum is configured.
+ */
+ if (!is_pg_wal)
+ checksum_type = opt->manifest_checksums;
+ else
+ checksum_type = CHECKSUM_TYPE_NONE;
+
+ /*
+ * Append the relative path to the input and output directories, and
+ * figure out the appropriate prefix to add to files in this directory
+ * when looking them up in a backup manifest.
+ */
+ if (relative_path == NULL)
+ {
+ strncpy(ifulldir, input_directory, MAXPGPATH);
+ strncpy(ofulldir, output_directory, MAXPGPATH);
+ if (OidIsValid(tsoid))
+ snprintf(manifest_prefix, MAXPGPATH, "pg_tblspc/%u/", tsoid);
+ else
+ manifest_prefix[0] = '\0';
+ }
+ else
+ {
+ snprintf(ifulldir, MAXPGPATH, "%s/%s", input_directory,
+ relative_path);
+ snprintf(ofulldir, MAXPGPATH, "%s/%s", output_directory,
+ relative_path);
+ if (OidIsValid(tsoid))
+ snprintf(manifest_prefix, MAXPGPATH, "pg_tblspc/%u/%s/",
+ tsoid, relative_path);
+ else
+ snprintf(manifest_prefix, MAXPGPATH, "%s/", relative_path);
+ }
+
+ /*
+ * Toplevel output directories have already been created by the time this
+ * function is called, but any subdirectories are our responsibility.
+ */
+ if (relative_path != NULL)
+ {
+ if (opt->dry_run)
+ pg_log_debug("would create directory \"%s\"", ofulldir);
+ else
+ {
+ pg_log_debug("creating directory \"%s\"", ofulldir);
+ if (mkdir(ofulldir, pg_dir_create_mode) == -1)
+ pg_fatal("could not create directory \"%s\": %m", ofulldir);
+ }
+ }
+
+ /* It's time to scan the directory. */
+ if ((dir = opendir(ifulldir)) == NULL)
+ pg_fatal("could not open directory \"%s\": %m", ifulldir);
+ while (errno = 0, (de = readdir(dir)) != NULL)
+ {
+ PGFileType type;
+ char ifullpath[MAXPGPATH];
+ char ofullpath[MAXPGPATH];
+ char manifest_path[MAXPGPATH];
+ Oid oid = InvalidOid;
+ int checksum_length = 0;
+ uint8 *checksum_payload = NULL;
+ pg_checksum_context checksum_ctx;
+
+ /* Ignore "." and ".." entries. */
+ if (strcmp(de->d_name, ".") == 0 ||
+ strcmp(de->d_name, "..") == 0)
+ continue;
+
+ /* Construct input path. */
+ snprintf(ifullpath, MAXPGPATH, "%s/%s", ifulldir, de->d_name);
+
+ /* Figure out what kind of directory entry this is. */
+ type = get_dirent_type(ifullpath, de, false, PG_LOG_ERROR);
+ if (type == PGFILETYPE_ERROR)
+ exit(1);
+
+ /*
+ * If we're processing pg_tblspc, then check whether the filename
+ * looks like it could be a tablespace OID. If so, and if the
+ * directory entry is a symbolic link or a directory, skip it.
+ *
+ * Our goal here is to ignore anything that would have been considered
+ * by scan_for_existing_tablespaces to be a tablespace.
+ */
+ if (is_pg_tblspc && parse_oid(de->d_name, &oid) &&
+ (type == PGFILETYPE_LNK || type == PGFILETYPE_DIR))
+ continue;
+
+ /* If it's a directory, recurse. */
+ if (type == PGFILETYPE_DIR)
+ {
+ char new_relative_path[MAXPGPATH];
+
+ /* Append new pathname component to relative path. */
+ if (relative_path == NULL)
+ strncpy(new_relative_path, de->d_name, MAXPGPATH);
+ else
+ snprintf(new_relative_path, MAXPGPATH, "%s/%s", relative_path,
+ de->d_name);
+
+ /* And recurse. */
+ process_directory_recursively(tsoid,
+ input_directory, output_directory,
+ new_relative_path,
+ n_prior_backups, prior_backup_dirs,
+ manifests, mwriter, opt);
+ continue;
+ }
+
+ /* Skip anything that's not a regular file. */
+ if (type != PGFILETYPE_REG)
+ {
+ if (type == PGFILETYPE_LNK)
+ pg_log_warning("skipping symbolic link \"%s\"", ifullpath);
+ else
+ pg_log_warning("skipping special file \"%s\"", ifullpath);
+ continue;
+ }
+
+ /*
+ * Skip the backup_label and backup_manifest files; they require
+ * special handling and are handled elsewhere.
+ */
+ if (relative_path == NULL &&
+ (strcmp(de->d_name, "backup_label") == 0 ||
+ strcmp(de->d_name, "backup_manifest") == 0))
+ continue;
+
+ /*
+ * If it's an incremental file, hand it off to the reconstruction
+ * code, which will figure out what to do.
+ */
+ if (strncmp(de->d_name, INCREMENTAL_PREFIX,
+ INCREMENTAL_PREFIX_LENGTH) == 0)
+ {
+ /* Output path should not include "INCREMENTAL." prefix. */
+ snprintf(ofullpath, MAXPGPATH, "%s/%s", ofulldir,
+ de->d_name + INCREMENTAL_PREFIX_LENGTH);
+
+
+ /* Manifest path likewise omits incremental prefix. */
+ snprintf(manifest_path, MAXPGPATH, "%s%s", manifest_prefix,
+ de->d_name + INCREMENTAL_PREFIX_LENGTH);
+
+ /* Reconstruction logic will do the rest. */
+ reconstruct_from_incremental_file(ifullpath, ofullpath,
+ relative_path,
+ de->d_name + INCREMENTAL_PREFIX_LENGTH,
+ n_prior_backups,
+ prior_backup_dirs,
+ manifests,
+ manifest_path,
+ checksum_type,
+ &checksum_length,
+ &checksum_payload,
+ opt->dry_run);
+ }
+ else
+ {
+ /* Construct the path that the backup_manifest will use. */
+ snprintf(manifest_path, MAXPGPATH, "%s%s", manifest_prefix,
+ de->d_name);
+
+ /*
+ * It's not an incremental file, so we need to copy the entire
+ * file to the output directory.
+ *
+ * If a checksum of the required type already exists in the
+ * backup_manifest for the final input directory, we can save some
+ * work by reusing that checksum instead of computing a new one.
+ */
+ if (checksum_type != CHECKSUM_TYPE_NONE &&
+ latest_manifest != NULL)
+ {
+ manifest_file *mfile;
+
+ mfile = manifest_files_lookup(latest_manifest->files,
+ manifest_path);
+ if (mfile == NULL)
+ {
+ /*
+ * The directory is out of sync with the backup_manifest,
+ * so emit a warning.
+ */
+ pg_log_warning("\"%s/backup_manifest\" contains no entry for \"%s\"",
+ input_directory, manifest_path);
+ }
+ else if (mfile->checksum_type == checksum_type)
+ {
+ checksum_length = mfile->checksum_length;
+ checksum_payload = mfile->checksum_payload;
+ }
+ }
+
+ /*
+ * If we're reusing a checksum, then we don't need copy_file() to
+ * compute one for us, but otherwise, it needs to compute whatever
+ * type of checksum we need.
+ */
+ if (checksum_length != 0)
+ pg_checksum_init(&checksum_ctx, CHECKSUM_TYPE_NONE);
+ else
+ pg_checksum_init(&checksum_ctx, checksum_type);
+
+ /* Actually copy the file. */
+ snprintf(ofullpath, MAXPGPATH, "%s/%s", ofulldir, de->d_name);
+ copy_file(ifullpath, ofullpath, &checksum_ctx, opt->dry_run);
+
+ /*
+ * If copy_file() performed a checksum calculation for us, then
+ * save the results (except in dry-run mode, when there's no
+ * point).
+ */
+ if (checksum_ctx.type != CHECKSUM_TYPE_NONE && !opt->dry_run)
+ {
+ checksum_payload = pg_malloc(PG_CHECKSUM_MAX_LENGTH);
+ checksum_length = pg_checksum_final(&checksum_ctx,
+ checksum_payload);
+ }
+ }
+
+ /* Generate manifest entry, if needed. */
+ if (mwriter != NULL)
+ {
+ struct stat sb;
+
+ /*
+ * In order to generate a manifest entry, we need the file size
+ * and mtime. We have no way to know the correct mtime except to
+ * stat() the file, so just do that and get the size as well.
+ *
+ * If we didn't need the mtime here, we could try to obtain the
+ * file size from the reconstruction or file copy process above,
+ * although that is actually not convenient in all cases. If we
+ * write the file ourselves then clearly we can keep a count of
+ * bytes, but if we use something like CopyFile() then it's
+ * trickier. Since we have to stat() anyway to get the mtime,
+ * there's no point in worrying about it.
+ */
+ if (stat(ofullpath, &sb) < 0)
+ pg_fatal("could not stat file \"%s\": %m", ofullpath);
+
+ /* OK, now do the work. */
+ add_file_to_manifest(mwriter, manifest_path,
+ sb.st_size, sb.st_mtime,
+ checksum_type, checksum_length,
+ checksum_payload);
+ }
+
+ /* Avoid leaking memory. */
+ if (checksum_payload != NULL)
+ pfree(checksum_payload);
+ }
+
+ closedir(dir);
+}
+
+/*
+ * Read the version number from PG_VERSION and convert it to the usual server
+ * version number format. (e.g. If PG_VERSION contains "14\n" this function
+ * will return 140000)
+ */
+static int
+read_pg_version_file(char *directory)
+{
+ char filename[MAXPGPATH];
+ StringInfoData buf;
+ int fd;
+ int version;
+ char *ep;
+
+ /* Construct pathname. */
+ snprintf(filename, MAXPGPATH, "%s/PG_VERSION", directory);
+
+ /* Open file. */
+ if ((fd = open(filename, O_RDONLY, 0)) < 0)
+ pg_fatal("could not open file \"%s\": %m", filename);
+
+ /* Read into memory. Length limit of 128 should be more than generous. */
+ initStringInfo(&buf);
+ slurp_file(fd, filename, &buf, 128);
+
+ /* Close the file. */
+ if (close(fd) != 0)
+ pg_fatal("could not close \"%s\": %m", filename);
+
+ /* Convert to integer. */
+ errno = 0;
+ version = strtoul(buf.data, &ep, 10);
+ if (errno != 0 || *ep != '\n')
+ {
+ /*
+ * Incremental backup is not relevant to very old server versions that
+ * used multi-part version number (e.g. 9.6, or 8.4). So if we see
+ * what looks like the beginning of such a version number, just bail
+ * out.
+ */
+ if (version < 10 && *ep == '.')
+ pg_fatal("%s: server version too old\n", filename);
+ pg_fatal("%s: could not parse version number\n", filename);
+ }
+
+ /* Debugging output. */
+ pg_log_debug("read server version %d from \"%s\"", version, filename);
+
+ /* Release memory and return result. */
+ pfree(buf.data);
+ return version * 10000;
+}
+
+/*
+ * Add a directory to the list of output directories to clean up.
+ */
+static void
+remember_to_cleanup_directory(char *target_path, bool rmtopdir)
+{
+ cb_cleanup_dir *dir = pg_malloc(sizeof(cb_cleanup_dir));
+
+ dir->target_path = target_path;
+ dir->rmtopdir = rmtopdir;
+ dir->next = cleanup_dir_list;
+ cleanup_dir_list = dir;
+}
+
+/*
+ * Empty out the list of directories scheduled for cleanup a exit.
+ *
+ * We want to remove the output directories only on a failure, so call this
+ * function when we know that the operation has succeeded.
+ *
+ * Since we only expect this to be called when we're about to exit, we could
+ * just set cleanup_dir_list to NULL and be done with it, but we free the
+ * memory to be tidy.
+ */
+static void
+reset_directory_cleanup_list(void)
+{
+ while (cleanup_dir_list != NULL)
+ {
+ cb_cleanup_dir *dir = cleanup_dir_list;
+
+ cleanup_dir_list = cleanup_dir_list->next;
+ pfree(dir);
+ }
+}
+
+/*
+ * Scan the pg_tblspc directory of the final input backup to get a canonical
+ * list of what tablespaces are part of the backup.
+ *
+ * 'pathname' should be the path to the toplevel backup directory for the
+ * final backup in the backup chain.
+ */
+static cb_tablespace *
+scan_for_existing_tablespaces(char *pathname, cb_options *opt)
+{
+ char pg_tblspc[MAXPGPATH];
+ DIR *dir;
+ struct dirent *de;
+ cb_tablespace *tslist = NULL;
+
+ snprintf(pg_tblspc, MAXPGPATH, "%s/pg_tblspc", pathname);
+ pg_log_debug("scanning \"%s\"", pg_tblspc);
+
+ if ((dir = opendir(pg_tblspc)) == NULL)
+ pg_fatal("could not open directory \"%s\": %m", pathname);
+
+ while (errno = 0, (de = readdir(dir)) != NULL)
+ {
+ Oid oid;
+ char tblspcdir[MAXPGPATH];
+ char link_target[MAXPGPATH];
+ int link_length;
+ cb_tablespace *ts;
+ cb_tablespace *otherts;
+ PGFileType type;
+
+ /* Silently ignore "." and ".." entries. */
+ if (strcmp(de->d_name, ".") == 0 || strcmp(de->d_name, "..") == 0)
+ continue;
+
+ /* Construct full pathname. */
+ snprintf(tblspcdir, MAXPGPATH, "%s/%s", pg_tblspc, de->d_name);
+
+ /* Ignore any file name that doesn't look like a proper OID. */
+ if (!parse_oid(de->d_name, &oid))
+ {
+ pg_log_debug("skipping \"%s\" because the filename is not a legal tablespace OID",
+ tblspcdir);
+ continue;
+ }
+
+ /* Only symbolic links and directories are tablespaces. */
+ type = get_dirent_type(tblspcdir, de, false, PG_LOG_ERROR);
+ if (type == PGFILETYPE_ERROR)
+ exit(1);
+ if (type != PGFILETYPE_LNK && type != PGFILETYPE_DIR)
+ {
+ pg_log_debug("skipping \"%s\" because it is neither a symbolic link nor a directory",
+ tblspcdir);
+ continue;
+ }
+
+ /* Create a new tablespace object. */
+ ts = pg_malloc0(sizeof(cb_tablespace));
+ ts->oid = oid;
+
+ /*
+ * If it's a link, it's not an in-place tablespace. Otherwise, it must
+ * be a directory, and thus an in-place tablespace.
+ */
+ if (type == PGFILETYPE_LNK)
+ {
+ cb_tablespace_mapping *tsmap;
+
+ /* Read the link target. */
+ link_length = readlink(tblspcdir, link_target, sizeof(link_target));
+ if (link_length < 0)
+ pg_fatal("could not read symbolic link \"%s\": %m",
+ tblspcdir);
+ if (link_length >= sizeof(link_target))
+ pg_fatal("symbolic link \"%s\" is too long", tblspcdir);
+ link_target[link_length] = '\0';
+ if (!is_absolute_path(link_target))
+ pg_fatal("symbolic link \"%s\" is relative", tblspcdir);
+
+ /* Caonicalize the link target. */
+ canonicalize_path(link_target);
+
+ /*
+ * Find the corresponding tablespace mapping and copy the relevant
+ * details into the new tablespace entry.
+ */
+ for (tsmap = opt->tsmappings; tsmap != NULL; tsmap = tsmap->next)
+ {
+ if (strcmp(tsmap->old_dir, link_target) == 0)
+ {
+ strncpy(ts->old_dir, tsmap->old_dir, MAXPGPATH);
+ strncpy(ts->new_dir, tsmap->new_dir, MAXPGPATH);
+ ts->in_place = false;
+ break;
+ }
+ }
+
+ /* Every non-in-place tablespace must be mapped. */
+ if (tsmap == NULL)
+ pg_fatal("tablespace at \"%s\" has no tablespace mapping",
+ link_target);
+ }
+ else
+ {
+ /*
+ * For an in-place tablespace, there's no separate directory, so
+ * we just record the paths within the data directories.
+ */
+ snprintf(ts->old_dir, MAXPGPATH, "%s/%s", pg_tblspc, de->d_name);
+ snprintf(ts->new_dir, MAXPGPATH, "%s/pg_tblpc/%s", opt->output,
+ de->d_name);
+ ts->in_place = true;
+ }
+
+ /* Tablespaces should not share a directory. */
+ for (otherts = tslist; otherts != NULL; otherts = otherts->next)
+ if (strcmp(ts->new_dir, otherts->new_dir) == 0)
+ pg_fatal("tablespaces with OIDs %u and %u both point at \"%s\"",
+ otherts->oid, oid, ts->new_dir);
+
+ /* Add this tablespace to the list. */
+ ts->next = tslist;
+ tslist = ts;
+ }
+
+ return tslist;
+}
+
+/*
+ * Read a file into a StringInfo.
+ *
+ * fd is used for the actual file I/O, filename for error reporting purposes.
+ * A file longer than maxlen is a fatal error.
+ */
+static void
+slurp_file(int fd, char *filename, StringInfo buf, int maxlen)
+{
+ struct stat st;
+ ssize_t rb;
+
+ /* Check file size, and complain if it's too large. */
+ if (fstat(fd, &st) != 0)
+ pg_fatal("could not stat \"%s\": %m", filename);
+ if (st.st_size > maxlen)
+ pg_fatal("file \"%s\" is too large", filename);
+
+ /* Make sure we have enough space. */
+ enlargeStringInfo(buf, st.st_size);
+
+ /* Read the data. */
+ rb = read(fd, &buf->data[buf->len], st.st_size);
+
+ /*
+ * We don't expect any concurrent changes, so we should read exactly the
+ * expected number of bytes.
+ */
+ if (rb != st.st_size)
+ {
+ if (rb < 0)
+ pg_fatal("could not read file \"%s\": %m", filename);
+ else
+ pg_fatal("could not read file \"%s\": read only %d of %d bytes",
+ filename, (int) rb, (int) st.st_size);
+ }
+
+ /* Adjust buffer length for new data and restore trailing-\0 invariant */
+ buf->len += rb;
+ buf->data[buf->len] = '\0';
+}
diff --git a/src/bin/pg_combinebackup/reconstruct.c b/src/bin/pg_combinebackup/reconstruct.c
new file mode 100644
index 0000000000..c774bf1842
--- /dev/null
+++ b/src/bin/pg_combinebackup/reconstruct.c
@@ -0,0 +1,618 @@
+/*-------------------------------------------------------------------------
+ *
+ * reconstruct.c
+ * Reconstruct full file from incremental file and backup chain.
+ *
+ * Copyright (c) 2017-2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/bin/pg_combinebackup/reconstruct.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres_fe.h"
+
+#include <unistd.h>
+
+#include "backup/basebackup_incremental.h"
+#include "common/logging.h"
+#include "common/file_perm.h"
+#include "copy_file.h"
+#include "reconstruct.h"
+#include "storage/block.h"
+
+/*
+ * An rfile stores the data that we need in order to be able to use some file
+ * on disk for reconstruction. For any given output file, we create one rfile
+ * per backup that we need to consult when we constructing that output file.
+ *
+ * If we find a full version of the file in the backup chain, then only
+ * filename and fd are initialized; the remaining fields are 0 or NULL.
+ * For an incremental file, header_length, num_blocks, relative_block_numbers,
+ * and truncation_block_length are also set.
+ *
+ * num_blocks_read and highest_offset_read always start out as 0.
+ */
+typedef struct rfile
+{
+ char *filename;
+ int fd;
+ size_t header_length;
+ unsigned num_blocks;
+ BlockNumber *relative_block_numbers;
+ unsigned truncation_block_length;
+ unsigned num_blocks_read;
+ off_t highest_offset_read;
+} rfile;
+
+static void debug_reconstruction(int n_source,
+ rfile **sources,
+ bool dry_run);
+static unsigned find_reconstructed_block_length(rfile *s);
+static rfile *make_incremental_rfile(char *filename);
+static rfile *make_rfile(char *filename, bool missing_ok);
+static void write_reconstructed_file(char *input_filename,
+ char *output_filename,
+ unsigned block_length,
+ rfile **sourcemap,
+ off_t *offsetmap,
+ pg_checksum_context *checksum_ctx,
+ bool dry_run);
+static void read_bytes(rfile *rf, void *buffer, unsigned length);
+
+/*
+ * Reconstruct a full file from an incremental file and a chain of prior
+ * backups.
+ *
+ * input_filename should be the path to the incremental file, and
+ * output_filename should be the path where the reconstructed file is to be
+ * written.
+ *
+ * relative_path should be the relative path to the directory containing this
+ * file. bare_file_name should be the name of the file within that directory,
+ * without "INCREMENTAL.".
+ *
+ * n_prior_backups is the number of prior backups, and prior_backup_dirs is
+ * an array of pathnames where those backups can be found.
+ */
+void
+reconstruct_from_incremental_file(char *input_filename,
+ char *output_filename,
+ char *relative_path,
+ char *bare_file_name,
+ int n_prior_backups,
+ char **prior_backup_dirs,
+ manifest_data **manifests,
+ char *manifest_path,
+ pg_checksum_type checksum_type,
+ int *checksum_length,
+ uint8 **checksum_payload,
+ bool dry_run)
+{
+ rfile **source;
+ rfile *latest_source = NULL;
+ rfile **sourcemap;
+ off_t *offsetmap;
+ unsigned block_length;
+ unsigned num_missing_blocks;
+ unsigned i;
+ unsigned sidx = n_prior_backups;
+ bool full_copy_possible = true;
+ int copy_source_index = -1;
+ rfile *copy_source = NULL;
+ pg_checksum_context checksum_ctx;
+
+ /*
+ * Every block must come either from the latest version of the file or
+ * from one of the prior backups.
+ */
+ source = pg_malloc0(sizeof(rfile *) * (1 + n_prior_backups));
+
+ /*
+ * Use the information from the latest incremental file to figure out how
+ * long the reconstructed file should be.
+ */
+ latest_source = make_incremental_rfile(input_filename);
+ source[n_prior_backups] = latest_source;
+ block_length = find_reconstructed_block_length(latest_source);
+
+ /*
+ * For each block in the output file, we need to know from which file we
+ * need to obtain it and at what offset in that file it's stored.
+ * sourcemap gives us the first of these things, and offsetmap the latter.
+ */
+ sourcemap = pg_malloc0(sizeof(rfile *) * block_length);
+ offsetmap = pg_malloc0(sizeof(off_t) * block_length);
+
+ /*
+ * Blocks prior to the truncation_block_length threshold must be obtained
+ * from some prior backup, while those after that threshold are left as
+ * zeroes if not present in the newest incremental file.
+ * num_missing_blocks counts the number of blocks that we must be found
+ * somewhere in the backup chain, and is thus initially equal to
+ * truncation_block_length.
+ */
+ num_missing_blocks = latest_source->truncation_block_length;
+
+ /*
+ * Every block that is present in the newest incremental file should be
+ * sourced from that file. If it precedes the truncation_block_length,
+ * it's a block that we would otherwise have had to find in an older
+ * backup and thus reduces the number of blocks remaining to be found by
+ * one; otherwise, it's an extra block that needs to be included in the
+ * output but would not have needed to be found in an older backup if it
+ * had not been present.
+ */
+ for (i = 0; i < latest_source->num_blocks; ++i)
+ {
+ BlockNumber b = latest_source->relative_block_numbers[i];
+
+ Assert(b < block_length);
+ sourcemap[b] = latest_source;
+ offsetmap[b] = latest_source->header_length + (i * BLCKSZ);
+ if (b < latest_source->truncation_block_length)
+ num_missing_blocks--;
+
+ /*
+ * A full copy of a file from an earlier backup is only possible if no
+ * blocks are needed from any later incremental file.
+ */
+ full_copy_possible = false;
+ }
+
+ while (num_missing_blocks > 0)
+ {
+ char source_filename[MAXPGPATH];
+ rfile *s;
+
+ /*
+ * Move to the next backup in the chain. If there are no more, then
+ * something has gone wrong and reconstruction has failed.
+ */
+ if (sidx == 0)
+ pg_fatal("reconstruction for file \"%s\" failed to find %u required blocks",
+ output_filename, num_missing_blocks);
+ --sidx;
+
+ /*
+ * Look for the full file in the previous backup. If not found, then
+ * look for an incremental file instead.
+ */
+ snprintf(source_filename, MAXPGPATH, "%s/%s/%s",
+ prior_backup_dirs[sidx], relative_path, bare_file_name);
+ if ((s = make_rfile(source_filename, true)) == NULL)
+ {
+ snprintf(source_filename, MAXPGPATH, "%s/%s/INCREMENTAL.%s",
+ prior_backup_dirs[sidx], relative_path, bare_file_name);
+ s = make_incremental_rfile(source_filename);
+ }
+ source[sidx] = s;
+
+ /*
+ * If s->header_length == 0, then this is a full file; otherwise, it's
+ * an incremental file.
+ */
+ if (s->header_length != 0)
+ {
+ /*
+ * Since we found another incremental file, source all blocks from
+ * it that we need but don't yet have.
+ */
+ for (i = 0; i < s->num_blocks; ++i)
+ {
+ BlockNumber b = s->relative_block_numbers[i];
+
+ if (b < latest_source->truncation_block_length &&
+ sourcemap[b] == NULL)
+ {
+ sourcemap[b] = s;
+ offsetmap[b] = s->header_length + (i * BLCKSZ);
+
+ Assert(num_missing_blocks > 0);
+ --num_missing_blocks;
+
+ /*
+ * A full copy of a file from an earlier backup is only
+ * possible if no blocks are needed from any later
+ * incremental file.
+ */
+ full_copy_possible = false;
+ }
+ }
+ }
+ else
+ {
+ BlockNumber b;
+
+ /*
+ * Since we found a full file, source all remaining required
+ * blocks from it.
+ */
+ for (b = 0; b < latest_source->truncation_block_length; ++b)
+ {
+ if (sourcemap[b] == NULL)
+ {
+ sourcemap[b] = s;
+ offsetmap[b] = b * BLCKSZ;
+
+ Assert(num_missing_blocks > 0);
+ --num_missing_blocks;
+ }
+ }
+ Assert(num_missing_blocks == 0);
+
+ /*
+ * If a full copy looks possible, check whether the resulting file
+ * should be exactly as long as the source file is. If so, a full
+ * copy is acceptable, otherwise not.
+ */
+ if (full_copy_possible)
+ {
+ struct stat sb;
+ uint64 expected_length;
+
+ if (fstat(s->fd, &sb) < 0)
+ pg_fatal("could not stat \"%s\": %m", s->filename);
+ expected_length =
+ (uint64) latest_source->truncation_block_length;
+ expected_length *= BLCKSZ;
+ if (expected_length == sb.st_size)
+ {
+ copy_source = s;
+ copy_source_index = sidx;
+ }
+ }
+ }
+ }
+
+ /*
+ * If a checksum of the required type already exists in the
+ * backup_manifest for the relevant input directory, we can save some work
+ * by reusing that checksum instead of computing a new one.
+ */
+ if (copy_source_index >= 0 && manifests[copy_source_index] != NULL &&
+ checksum_type != CHECKSUM_TYPE_NONE)
+ {
+ manifest_file *mfile;
+
+ mfile = manifest_files_lookup(manifests[copy_source_index]->files,
+ manifest_path);
+ if (mfile == NULL)
+ {
+ /*
+ * The directory is out of sync with the backup_manifest, so emit
+ * a warning.
+ */
+ pg_log_warning("\"%s/backup_manifest\" contains no entry for \"%s\"",
+ prior_backup_dirs[copy_source_index],
+ manifest_path);
+ }
+ else if (mfile->checksum_type == checksum_type)
+ {
+ *checksum_length = mfile->checksum_length;
+ *checksum_payload = pg_malloc(*checksum_length);
+ memcpy(*checksum_payload, mfile->checksum_payload,
+ *checksum_length);
+ checksum_type = CHECKSUM_TYPE_NONE;
+ }
+ }
+
+ /* Prepare for checksum calculation, if required. */
+ pg_checksum_init(&checksum_ctx, checksum_type);
+
+ /*
+ * If the full file can be created by copying a file from an older backup
+ * in the chain without needing to overwrite any blocks or truncate the
+ * result, then forget about performing reconstruction and just copy that
+ * file in its entirety.
+ *
+ * Otherwise, reconstruct.
+ */
+ if (copy_source != NULL)
+ copy_file(copy_source->filename, output_filename,
+ &checksum_ctx, dry_run);
+ else
+ {
+ write_reconstructed_file(input_filename, output_filename,
+ block_length, sourcemap, offsetmap,
+ &checksum_ctx, dry_run);
+ debug_reconstruction(n_prior_backups + 1, source, dry_run);
+ }
+
+ /* Save results of checksum calculation. */
+ if (checksum_type != CHECKSUM_TYPE_NONE)
+ {
+ *checksum_payload = pg_malloc(PG_CHECKSUM_MAX_LENGTH);
+ *checksum_length = pg_checksum_final(&checksum_ctx,
+ *checksum_payload);
+ }
+
+ /*
+ * Close files and release memory.
+ */
+ for (i = 0; i <= n_prior_backups; ++i)
+ {
+ rfile *s = source[i];
+
+ if (s == NULL)
+ continue;
+ if (close(s->fd) != 0)
+ pg_fatal("could not close \"%s\": %m", s->filename);
+ if (s->relative_block_numbers != NULL)
+ pfree(s->relative_block_numbers);
+ pg_free(s->filename);
+ }
+ pfree(sourcemap);
+ pfree(offsetmap);
+ pfree(source);
+}
+
+/*
+ * Perform post-reconstruction logging and sanity checks.
+ */
+static void
+debug_reconstruction(int n_source, rfile **sources, bool dry_run)
+{
+ unsigned i;
+
+ for (i = 0; i < n_source; ++i)
+ {
+ rfile *s = sources[i];
+
+ /* Ignore source if not used. */
+ if (s == NULL)
+ continue;
+
+ /* If no data is needed from this file, we can ignore it. */
+ if (s->num_blocks_read == 0)
+ continue;
+
+ /* Debug logging. */
+ if (dry_run)
+ pg_log_debug("would have read %u blocks from \"%s\"",
+ s->num_blocks_read, s->filename);
+ else
+ pg_log_debug("read %u blocks from \"%s\"",
+ s->num_blocks_read, s->filename);
+
+ /*
+ * In dry-run mode, we don't actually try to read data from the file,
+ * but we do try to verify that the file is long enough that we could
+ * have read the data if we'd tried.
+ *
+ * If this fails, then it means that a non-dry-run attempt would fail,
+ * complaining of not being able to read the required bytes from the
+ * file.
+ */
+ if (dry_run)
+ {
+ struct stat sb;
+
+ if (fstat(s->fd, &sb) < 0)
+ pg_fatal("could not stat \"%s\": %m", s->filename);
+ if (sb.st_size < s->highest_offset_read)
+ pg_fatal("file \"%s\" is too short: expected %llu, found %llu",
+ s->filename,
+ (unsigned long long) s->highest_offset_read,
+ (unsigned long long) sb.st_size);
+ }
+ }
+}
+
+/*
+ * When we perform reconstruction using an incremental file, the output file
+ * should be at least as long as the truncation_block_length. Any blocks
+ * present in the incremental file increase the output length as far as is
+ * necessary to include those blocks.
+ */
+static unsigned
+find_reconstructed_block_length(rfile *s)
+{
+ unsigned block_length = s->truncation_block_length;
+ unsigned i;
+
+ for (i = 0; i < s->num_blocks; ++i)
+ if (s->relative_block_numbers[i] >= block_length)
+ block_length = s->relative_block_numbers[i] + 1;
+
+ return block_length;
+}
+
+/*
+ * Initialize an incremental rfile, reading the header so that we know which
+ * blocks it contains.
+ */
+static rfile *
+make_incremental_rfile(char *filename)
+{
+ rfile *rf;
+ unsigned magic;
+
+ rf = make_rfile(filename, false);
+
+ /* Read and validate magic number. */
+ read_bytes(rf, &magic, sizeof(magic));
+ if (magic != INCREMENTAL_MAGIC)
+ pg_fatal("file \"%s\" has bad incremental magic number (0x%x not 0x%x)",
+ filename, magic, INCREMENTAL_MAGIC);
+
+ /* Read block count. */
+ read_bytes(rf, &rf->num_blocks, sizeof(rf->num_blocks));
+ if (rf->num_blocks > RELSEG_SIZE)
+ pg_fatal("file \"%s\" has block count %u in excess of segment size %u",
+ filename, rf->num_blocks, RELSEG_SIZE);
+
+ /* Read truncation block length. */
+ read_bytes(rf, &rf->truncation_block_length,
+ sizeof(rf->truncation_block_length));
+ if (rf->truncation_block_length > RELSEG_SIZE)
+ pg_fatal("file \"%s\" has truncation block length %u in excess of segment size %u",
+ filename, rf->truncation_block_length, RELSEG_SIZE);
+
+ /* Read block numbers if there are any. */
+ if (rf->num_blocks > 0)
+ {
+ rf->relative_block_numbers =
+ pg_malloc0(sizeof(BlockNumber) * rf->num_blocks);
+ read_bytes(rf, rf->relative_block_numbers,
+ sizeof(BlockNumber) * rf->num_blocks);
+ }
+
+ /* Remember length of header. */
+ rf->header_length = sizeof(magic) + sizeof(rf->num_blocks) +
+ sizeof(rf->truncation_block_length) +
+ sizeof(BlockNumber) * rf->num_blocks;
+
+ return rf;
+}
+
+/*
+ * Allocate and perform basic initialization of an rfile.
+ */
+static rfile *
+make_rfile(char *filename, bool missing_ok)
+{
+ rfile *rf;
+
+ rf = pg_malloc0(sizeof(rfile));
+ rf->filename = pstrdup(filename);
+ if ((rf->fd = open(filename, O_RDONLY | PG_BINARY, 0)) < 0)
+ {
+ if (missing_ok && errno == ENOENT)
+ {
+ pg_free(rf);
+ return NULL;
+ }
+ pg_fatal("could not open file \"%s\": %m", filename);
+ }
+
+ return rf;
+}
+
+/*
+ * Read the indicated number of bytes from an rfile into the buffer.
+ */
+static void
+read_bytes(rfile *rf, void *buffer, unsigned length)
+{
+ unsigned rb = read(rf->fd, buffer, length);
+
+ if (rb != length)
+ {
+ if (rb < 0)
+ pg_fatal("could not read file \"%s\": %m", rf->filename);
+ else
+ pg_fatal("could not read file \"%s\": read only %d of %d bytes",
+ rf->filename, (int) rb, length);
+ }
+}
+
+/*
+ * Write out a reconstructed file.
+ */
+static void
+write_reconstructed_file(char *input_filename,
+ char *output_filename,
+ unsigned block_length,
+ rfile **sourcemap,
+ off_t *offsetmap,
+ pg_checksum_context *checksum_ctx,
+ bool dry_run)
+{
+ int wfd = -1;
+ unsigned i;
+ unsigned zero_blocks = 0;
+
+ /* Debugging output. */
+ if (dry_run)
+ pg_log_debug("would reconstruct \"%s\" (%u blocks, checksum %s)",
+ output_filename, block_length,
+ pg_checksum_type_name(checksum_ctx->type));
+ else
+ pg_log_debug("reconstructing \"%s\" (%u blocks, checksum %s)",
+ output_filename, block_length,
+ pg_checksum_type_name(checksum_ctx->type));
+
+ /* Open the output file, except in dry_run mode. */
+ if (!dry_run &&
+ (wfd = open(output_filename,
+ O_RDWR | PG_BINARY | O_CREAT | O_EXCL,
+ pg_file_create_mode)) < 0)
+ pg_fatal("could not open file \"%s\": %m", output_filename);
+
+ /* Read and write the blocks as required. */
+ for (i = 0; i < block_length; ++i)
+ {
+ uint8 buffer[BLCKSZ];
+ rfile *s = sourcemap[i];
+ unsigned wb;
+
+ /* Update accounting information. */
+ if (s == NULL)
+ ++zero_blocks;
+ else
+ {
+ s->num_blocks_read++;
+ s->highest_offset_read = Max(s->highest_offset_read,
+ offsetmap[i] + BLCKSZ);
+ }
+
+ /* Skip the rest of this in dry-run mode. */
+ if (dry_run)
+ continue;
+
+ /* Read or zero-fill the block as appropriate. */
+ if (s == NULL)
+ {
+ /*
+ * New block not mentioned in the WAL summary. Should have been an
+ * uninitialized block, so just zero-fill it.
+ */
+ memset(buffer, 0, BLCKSZ);
+ }
+ else
+ {
+ unsigned rb;
+
+ /* Read the block from the correct source, except if dry-run. */
+ rb = pg_pread(s->fd, buffer, BLCKSZ, offsetmap[i]);
+ if (rb != BLCKSZ)
+ {
+ if (rb < 0)
+ pg_fatal("could not read file \"%s\": %m", s->filename);
+ else
+ pg_fatal("could not read file \"%s\": read only %d of %d bytes at offset %u",
+ s->filename, (int) rb, BLCKSZ,
+ (unsigned) offsetmap[i]);
+ }
+ }
+
+ /* Write out the block. */
+ if ((wb = write(wfd, buffer, BLCKSZ)) != BLCKSZ)
+ {
+ if (wb < 0)
+ pg_fatal("could not write file \"%s\": %m", output_filename);
+ else
+ pg_fatal("could not write file \"%s\": wrote only %d of %d bytes",
+ output_filename, (int) wb, BLCKSZ);
+ }
+
+ /* Update the checksum computation. */
+ if (pg_checksum_update(checksum_ctx, buffer, BLCKSZ) < 0)
+ pg_fatal("could not update checksum of file \"%s\"",
+ output_filename);
+ }
+
+ /* Debugging output. */
+ if (zero_blocks > 0)
+ {
+ if (dry_run)
+ pg_log_debug("would have zero-filled %u blocks", zero_blocks);
+ else
+ pg_log_debug("zero-filled %u blocks", zero_blocks);
+ }
+
+ /* Close the output file. */
+ if (wfd >= 0 && close(wfd) != 0)
+ pg_fatal("could not close \"%s\": %m", output_filename);
+}
diff --git a/src/bin/pg_combinebackup/reconstruct.h b/src/bin/pg_combinebackup/reconstruct.h
new file mode 100644
index 0000000000..c599a70d42
--- /dev/null
+++ b/src/bin/pg_combinebackup/reconstruct.h
@@ -0,0 +1,32 @@
+/*-------------------------------------------------------------------------
+ *
+ * reconstruct.h
+ * Reconstruct full file from incremental file and backup chain.
+ *
+ * Copyright (c) 2017-2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/bin/pg_combinebackup/reconstruct.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef RECONSTRUCT_H
+#define RECONSTRUCT_H
+
+#include "common/checksum_helper.h"
+#include "load_manifest.h"
+
+extern void reconstruct_from_incremental_file(char *input_filename,
+ char *output_filename,
+ char *relative_path,
+ char *bare_file_name,
+ int n_prior_backups,
+ char **prior_backup_dirs,
+ manifest_data **manifests,
+ char *manifest_path,
+ pg_checksum_type checksum_type,
+ int *checksum_length,
+ uint8 **checksum_payload,
+ bool dry_run);
+
+#endif
diff --git a/src/bin/pg_combinebackup/t/001_basic.pl b/src/bin/pg_combinebackup/t/001_basic.pl
new file mode 100644
index 0000000000..fb66075d1a
--- /dev/null
+++ b/src/bin/pg_combinebackup/t/001_basic.pl
@@ -0,0 +1,23 @@
+# Copyright (c) 2021-2023, PostgreSQL Global Development Group
+
+use strict;
+use warnings;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+my $tempdir = PostgreSQL::Test::Utils::tempdir;
+
+program_help_ok('pg_combinebackup');
+program_version_ok('pg_combinebackup');
+program_options_handling_ok('pg_combinebackup');
+
+command_fails_like(
+ ['pg_combinebackup'],
+ qr/no input directories specified/,
+ 'input directories must be specified');
+command_fails_like(
+ [ 'pg_combinebackup', $tempdir ],
+ qr/no output directory specified/,
+ 'output directory must be specified');
+
+done_testing();
diff --git a/src/bin/pg_combinebackup/t/002_compare_backups.pl b/src/bin/pg_combinebackup/t/002_compare_backups.pl
new file mode 100644
index 0000000000..3d9238f366
--- /dev/null
+++ b/src/bin/pg_combinebackup/t/002_compare_backups.pl
@@ -0,0 +1,154 @@
+# Copyright (c) 2021-2023, PostgreSQL Global Development Group
+
+use strict;
+use warnings;
+use File::Compare;
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Set up a new database instance.
+my $primary = PostgreSQL::Test::Cluster->new('primary');
+$primary->init(has_archiving => 1, allows_streaming => 1);
+$primary->append_conf('postgresql.conf', 'autovacuum = off');
+$primary->start;
+
+# Create some test tables, each containing one row of data, plus a whole
+# extra database.
+$primary->safe_psql('postgres', <<EOM);
+CREATE TABLE will_change (a int, b text);
+INSERT INTO will_change VALUES (1, 'initial test row');
+CREATE TABLE will_grow (a int, b text);
+INSERT INTO will_grow VALUES (1, 'initial test row');
+CREATE TABLE will_shrink (a int, b text);
+INSERT INTO will_shrink VALUES (1, 'initial test row');
+CREATE TABLE will_get_vacuumed (a int, b text);
+INSERT INTO will_get_vacuumed VALUES (1, 'initial test row');
+CREATE TABLE will_get_dropped (a int, b text);
+INSERT INTO will_get_dropped VALUES (1, 'initial test row');
+CREATE TABLE will_get_rewritten (a int, b text);
+INSERT INTO will_get_rewritten VALUES (1, 'initial test row');
+CREATE DATABASE db_will_get_dropped;
+EOM
+
+# Take a full backup.
+my $backup1path = $primary->backup_dir . '/backup1';
+$primary->command_ok(
+ [ 'pg_basebackup', '-D', $backup1path, '--no-sync', '-cfast' ],
+ "full backup");
+
+# Now make some database changes.
+$primary->safe_psql('postgres', <<EOM);
+UPDATE will_change SET b = 'modified value' WHERE a = 1;
+INSERT INTO will_grow
+ SELECT g, 'additional row' FROM generate_series(2, 5000) g;
+TRUNCATE will_shrink;
+VACUUM will_get_vacuumed;
+DROP TABLE will_get_dropped;
+CREATE TABLE newly_created (a int, b text);
+INSERT INTO newly_created VALUES (1, 'row for new table');
+VACUUM FULL will_get_rewritten;
+DROP DATABASE db_will_get_dropped;
+CREATE DATABASE db_newly_created;
+EOM
+
+# Take an incremental backup.
+my $backup2path = $primary->backup_dir . '/backup2';
+$primary->command_ok(
+ [ 'pg_basebackup', '-D', $backup2path, '--no-sync', '-cfast',
+ '--incremental', $backup1path . '/backup_manifest' ],
+ "incremental backup");
+
+# Find an LSN to which either backup can be recovered.
+my $lsn = $primary->safe_psql('postgres', "SELECT pg_current_wal_lsn();");
+
+# Make sure that the WAL segment containing that LSN has been archived.
+# PostgreSQL won't issue two consecutive XLOG_SWITCH records, and the backup
+# just issued one, so call txid_current() to generate some WAL activity
+# before calling pg_switch_wal().
+$primary->safe_psql('postgres', 'SELECT txid_current();');
+$primary->safe_psql('postgres', 'SELECT pg_switch_wal()');
+
+# Now wait for the LSN we chose above to be archived.
+my $archive_wait_query =
+ "SELECT pg_walfile_name('$lsn') <= last_archived_wal FROM pg_stat_archiver;";
+$primary->poll_query_until('postgres', $archive_wait_query)
+ or die "Timed out while waiting for WAL segment to be archived";
+
+# Perform PITR from the full backup. Disable archive_mode so that the archive
+# doesn't find out about the new timeline; that way, the later PITR below will
+# choose the same timeline.
+my $pitr1 = PostgreSQL::Test::Cluster->new('pitr1');
+$pitr1->init_from_backup($primary, 'backup1',
+ standby => 1, has_restoring => 1);
+$pitr1->append_conf('postgresql.conf', qq{
+recovery_target_lsn = '$lsn'
+recovery_target_action = 'promote'
+archive_mode = 'off'
+});
+$pitr1->start();
+
+# Perform PITR to the same LSN from the incremental backup. Use the same
+# basic configuration as before.
+my $pitr2 = PostgreSQL::Test::Cluster->new('pitr2');
+$pitr2->init_from_backup($primary, 'backup2',
+ standby => 1, has_restoring => 1,
+ combine_with_prior => [ 'backup1' ]);
+$pitr2->append_conf('postgresql.conf', qq{
+recovery_target_lsn = '$lsn'
+recovery_target_action = 'promote'
+archive_mode = 'off'
+});
+$pitr2->start();
+
+# Wait until both servers exit recovery.
+$pitr1->poll_query_until('postgres',
+ "SELECT NOT pg_is_in_recovery();")
+ or die "Timed out while waiting apply to reach LSN $lsn";
+$pitr2->poll_query_until('postgres',
+ "SELECT NOT pg_is_in_recovery();")
+ or die "Timed out while waiting apply to reach LSN $lsn";
+
+# Perform a logical dump of each server, and check that they match.
+# It would be much nicer if we could physically compare the data files, but
+# that doesn't really work. The contents of the page hole aren't guaranteed to
+# be identical, and there can be other discrepancies as well. To make this work
+# we'd need the equivalent of each AM's rm_mask functon written or at least
+# callable from Perl, and that doesn't seem practical.
+#
+# NB: We're just using the primary's backup directory for scratch space here.
+# This could equally well be any other directory we wanted to pick.
+my $backupdir = $primary->backup_dir;
+my $dump1 = $backupdir . '/pitr1.dump';
+my $dump2 = $backupdir . '/pitr2.dump';
+$pitr1->command_ok([
+ 'pg_dumpall', '-f', $dump1, '--no-sync', '--no-unlogged-table-data',
+ '-d', $pitr1->connstr('postgres'),
+ ],
+ 'dump from PITR 1');
+$pitr1->command_ok([
+ 'pg_dumpall', '-f', $dump2, '--no-sync', '--no-unlogged-table-data',
+ '-d', $pitr1->connstr('postgres'),
+ ],
+ 'dump from PITR 2');
+
+# Compare the two dumps, there should be no differences.
+my $compare_res = compare($dump1, $dump2);
+note($dump1);
+note($dump2);
+is($compare_res, 0, "dumps are identical");
+
+# Provide more context if the dumps do not match.
+if ($compare_res != 0)
+{
+ my ($stdout, $stderr) =
+ run_command([ 'diff', '-u', $dump1, $dump2 ]);
+ print "=== diff of $dump1 and $dump2\n";
+ print "=== stdout ===\n";
+ print $stdout;
+ print "=== stderr ===\n";
+ print $stderr;
+ print "=== EOF ===\n";
+}
+
+done_testing();
diff --git a/src/bin/pg_combinebackup/write_manifest.c b/src/bin/pg_combinebackup/write_manifest.c
new file mode 100644
index 0000000000..82160134d8
--- /dev/null
+++ b/src/bin/pg_combinebackup/write_manifest.c
@@ -0,0 +1,293 @@
+/*-------------------------------------------------------------------------
+ *
+ * Write a new backup manifest.
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/bin/pg_combinebackup/write_manifest.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres_fe.h"
+
+#include <fcntl.h>
+#include <time.h>
+#include <unistd.h>
+
+#include "common/checksum_helper.h"
+#include "common/file_perm.h"
+#include "common/logging.h"
+#include "lib/stringinfo.h"
+#include "load_manifest.h"
+#include "mb/pg_wchar.h"
+#include "write_manifest.h"
+
+struct manifest_writer
+{
+ char pathname[MAXPGPATH];
+ int fd;
+ StringInfoData buf;
+ bool first_file;
+ bool still_checksumming;
+ pg_checksum_context manifest_ctx;
+};
+
+static void escape_json(StringInfo buf, const char *str);
+static void flush_manifest(manifest_writer *mwriter);
+static size_t hex_encode(const uint8 *src, size_t len, char *dst);
+
+/*
+ * Create a new backup manifest writer.
+ *
+ * The backup manifest will be written into a file named backup_manifest
+ * in the specified directory.
+ */
+manifest_writer *
+create_manifest_writer(char *directory)
+{
+ manifest_writer *mwriter = pg_malloc(sizeof(manifest_writer));
+
+ snprintf(mwriter->pathname, MAXPGPATH, "%s/backup_manifest", directory);
+ mwriter->fd = -1;
+ initStringInfo(&mwriter->buf);
+ mwriter->first_file = true;
+ mwriter->still_checksumming = true;
+ pg_checksum_init(&mwriter->manifest_ctx, CHECKSUM_TYPE_SHA256);
+
+ appendStringInfo(&mwriter->buf,
+ "{ \"PostgreSQL-Backup-Manifest-Version\": 1,\n"
+ "\"Files\": [");
+
+ return mwriter;
+}
+
+/*
+ * Add an entry for a file to a backup manifest.
+ *
+ * This is very similar to the backend's AddFileToBackupManifest, but
+ * various adjustments are required due to frontend/backend differences
+ * and other details.
+ */
+void
+add_file_to_manifest(manifest_writer *mwriter, const char *manifest_path,
+ size_t size, pg_time_t mtime,
+ pg_checksum_type checksum_type,
+ int checksum_length,
+ uint8 *checksum_payload)
+{
+ int pathlen = strlen(manifest_path);
+
+ if (mwriter->first_file)
+ {
+ appendStringInfoChar(&mwriter->buf, '\n');
+ mwriter->first_file = false;
+ }
+ else
+ appendStringInfoString(&mwriter->buf, ",\n");
+
+ if (pg_encoding_verifymbstr(PG_UTF8, manifest_path, pathlen) == pathlen)
+ {
+ appendStringInfoString(&mwriter->buf, "{ \"Path\": ");
+ escape_json(&mwriter->buf, manifest_path);
+ appendStringInfoString(&mwriter->buf, ", ");
+ }
+ else
+ {
+ appendStringInfoString(&mwriter->buf, "{ \"Encoded-Path\": \"");
+ enlargeStringInfo(&mwriter->buf, 2 * pathlen);
+ mwriter->buf.len += hex_encode((const uint8 *) manifest_path, pathlen,
+ &mwriter->buf.data[mwriter->buf.len]);
+ appendStringInfoString(&mwriter->buf, "\", ");
+ }
+
+ appendStringInfo(&mwriter->buf, "\"Size\": %zu, ", size);
+
+ appendStringInfoString(&mwriter->buf, "\"Last-Modified\": \"");
+ enlargeStringInfo(&mwriter->buf, 128);
+ mwriter->buf.len += strftime(&mwriter->buf.data[mwriter->buf.len], 128,
+ "%Y-%m-%d %H:%M:%S %Z",
+ gmtime(&mtime));
+ appendStringInfoChar(&mwriter->buf, '"');
+
+ if (mwriter->buf.len > 128 * 1024)
+ flush_manifest(mwriter);
+
+ if (checksum_length > 0)
+ {
+ appendStringInfo(&mwriter->buf,
+ ", \"Checksum-Algorithm\": \"%s\", \"Checksum\": \"",
+ pg_checksum_type_name(checksum_type));
+
+ enlargeStringInfo(&mwriter->buf, 2 * checksum_length);
+ mwriter->buf.len += hex_encode(checksum_payload, checksum_length,
+ &mwriter->buf.data[mwriter->buf.len]);
+
+ appendStringInfoChar(&mwriter->buf, '"');
+ }
+
+ appendStringInfoString(&mwriter->buf, " }");
+
+ if (mwriter->buf.len > 128 * 1024)
+ flush_manifest(mwriter);
+}
+
+/*
+ * Finalize the backup_manifest.
+ */
+void
+finalize_manifest(manifest_writer *mwriter,
+ manifest_wal_range *first_wal_range)
+{
+ uint8 checksumbuf[PG_SHA256_DIGEST_LENGTH];
+ int len;
+ manifest_wal_range *wal_range;
+
+ /* Terminate the list of files. */
+ appendStringInfoString(&mwriter->buf, "\n],\n");
+
+ /* Start a list of LSN ranges. */
+ appendStringInfoString(&mwriter->buf, "\"WAL-Ranges\": [\n");
+
+ for (wal_range = first_wal_range; wal_range != NULL;
+ wal_range = wal_range->next)
+ appendStringInfo(&mwriter->buf,
+ "%s{ \"Timeline\": %u, \"Start-LSN\": \"%X/%X\", \"End-LSN\": \"%X/%X\" }",
+ wal_range == first_wal_range ? "" : ",\n",
+ wal_range->tli,
+ LSN_FORMAT_ARGS(wal_range->start_lsn),
+ LSN_FORMAT_ARGS(wal_range->end_lsn));
+
+ /* Terminate the list of WAL ranges. */
+ appendStringInfoString(&mwriter->buf, "\n],\n");
+
+ /* Flush accumulated data and update checksum calculation. */
+ flush_manifest(mwriter);
+
+ /* Checksum only includes data up to this point. */
+ mwriter->still_checksumming = false;
+
+ /* Compute and insert manifest checksum. */
+ appendStringInfoString(&mwriter->buf, "\"Manifest-Checksum\": \"");
+ enlargeStringInfo(&mwriter->buf, 2 * PG_SHA256_DIGEST_STRING_LENGTH);
+ len = pg_checksum_final(&mwriter->manifest_ctx, checksumbuf);
+ Assert(len == PG_SHA256_DIGEST_LENGTH);
+ mwriter->buf.len +=
+ hex_encode(checksumbuf, len, &mwriter->buf.data[mwriter->buf.len]);
+ appendStringInfoString(&mwriter->buf, "\"}\n");
+
+ /* Flush the last manifest checksum itself. */
+ flush_manifest(mwriter);
+
+ /* Close the file. */
+ if (close(mwriter->fd) != 0)
+ pg_fatal("could not close \"%s\": %m", mwriter->pathname);
+ mwriter->fd = -1;
+}
+
+/*
+ * Produce a JSON string literal, properly escaping characters in the text.
+ */
+static void
+escape_json(StringInfo buf, const char *str)
+{
+ const char *p;
+
+ appendStringInfoCharMacro(buf, '"');
+ for (p = str; *p; p++)
+ {
+ switch (*p)
+ {
+ case '\b':
+ appendStringInfoString(buf, "\\b");
+ break;
+ case '\f':
+ appendStringInfoString(buf, "\\f");
+ break;
+ case '\n':
+ appendStringInfoString(buf, "\\n");
+ break;
+ case '\r':
+ appendStringInfoString(buf, "\\r");
+ break;
+ case '\t':
+ appendStringInfoString(buf, "\\t");
+ break;
+ case '"':
+ appendStringInfoString(buf, "\\\"");
+ break;
+ case '\\':
+ appendStringInfoString(buf, "\\\\");
+ break;
+ default:
+ if ((unsigned char) *p < ' ')
+ appendStringInfo(buf, "\\u%04x", (int) *p);
+ else
+ appendStringInfoCharMacro(buf, *p);
+ break;
+ }
+ }
+ appendStringInfoCharMacro(buf, '"');
+}
+
+/*
+ * Flush whatever portion of the backup manifest we have generated and
+ * buffered in memory out to a file on disk.
+ *
+ * The first call to this function will create the file. After that, we
+ * keep it open and just append more data.
+ */
+static void
+flush_manifest(manifest_writer *mwriter)
+{
+ char pathname[MAXPGPATH];
+
+ if (mwriter->fd == -1 &&
+ (mwriter->fd = open(mwriter->pathname,
+ O_WRONLY | O_CREAT | O_EXCL | PG_BINARY,
+ pg_file_create_mode)) < 0)
+ pg_fatal("could not open file \"%s\": %m", mwriter->pathname);
+
+ if (mwriter->buf.len > 0)
+ {
+ ssize_t wb;
+
+ wb = write(mwriter->fd, mwriter->buf.data, mwriter->buf.len);
+ if (wb != mwriter->buf.len)
+ {
+ if (wb < 0)
+ pg_fatal("could not write \"%s\": %m", mwriter->pathname);
+ else
+ pg_fatal("could not write file \"%s\": wrote only %d of %d bytes",
+ pathname, (int) wb, mwriter->buf.len);
+ }
+
+ if (mwriter->still_checksumming)
+ pg_checksum_update(&mwriter->manifest_ctx,
+ (uint8 *) mwriter->buf.data,
+ mwriter->buf.len);
+ resetStringInfo(&mwriter->buf);
+ }
+}
+
+/*
+ * Encode bytes using two hexademical digits for each one.
+ */
+static size_t
+hex_encode(const uint8 *src, size_t len, char *dst)
+{
+ const uint8 *end = src + len;
+
+ while (src < end)
+ {
+ unsigned n1 = (*src >> 4) & 0xF;
+ unsigned n2 = *src & 0xF;
+
+ *dst++ = n1 < 10 ? '0' + n1 : 'a' + n1 - 10;
+ *dst++ = n2 < 10 ? '0' + n2 : 'a' + n2 - 10;
+ ++src;
+ }
+
+ return len * 2;
+}
diff --git a/src/bin/pg_combinebackup/write_manifest.h b/src/bin/pg_combinebackup/write_manifest.h
new file mode 100644
index 0000000000..8fd7fe02c8
--- /dev/null
+++ b/src/bin/pg_combinebackup/write_manifest.h
@@ -0,0 +1,33 @@
+/*-------------------------------------------------------------------------
+ *
+ * Write a new backup manifest.
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/bin/pg_combinebackup/write_manifest.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef WRITE_MANIFEST_H
+#define WRITE_MANIFEST_H
+
+#include "common/checksum_helper.h"
+#include "pgtime.h"
+
+struct manifest_wal_range;
+
+struct manifest_writer;
+typedef struct manifest_writer manifest_writer;
+
+extern manifest_writer *create_manifest_writer(char *directory);
+extern void add_file_to_manifest(manifest_writer *mwriter,
+ const char *manifest_path,
+ size_t size, pg_time_t mtime,
+ pg_checksum_type checksum_type,
+ int checksum_length,
+ uint8 *checksum_payload);
+extern void finalize_manifest(manifest_writer *mwriter,
+ struct manifest_wal_range *first_wal_range);
+
+#endif /* WRITE_MANIFEST_H */
diff --git a/src/bin/pg_resetwal/pg_resetwal.c b/src/bin/pg_resetwal/pg_resetwal.c
index 3ae3fc06df..5407f51a4e 100644
--- a/src/bin/pg_resetwal/pg_resetwal.c
+++ b/src/bin/pg_resetwal/pg_resetwal.c
@@ -85,6 +85,7 @@ static void RewriteControlFile(void);
static void FindEndOfXLOG(void);
static void KillExistingXLOG(void);
static void KillExistingArchiveStatus(void);
+static void KillExistingWALSummaries(void);
static void WriteEmptyXLOG(void);
static void usage(void);
@@ -493,6 +494,7 @@ main(int argc, char *argv[])
RewriteControlFile();
KillExistingXLOG();
KillExistingArchiveStatus();
+ KillExistingWALSummaries();
WriteEmptyXLOG();
printf(_("Write-ahead log reset\n"));
@@ -1034,6 +1036,40 @@ KillExistingArchiveStatus(void)
pg_fatal("could not close directory \"%s\": %m", ARCHSTATDIR);
}
+/*
+ * Remove existing WAL summary files
+ */
+static void
+KillExistingWALSummaries(void)
+{
+#define WALSUMMARYDIR XLOGDIR "/summaries"
+#define WALSUMMARY_NHEXCHARS 40
+
+ DIR *xldir;
+ struct dirent *xlde;
+ char path[MAXPGPATH + sizeof(WALSUMMARYDIR)];
+
+ xldir = opendir(WALSUMMARYDIR);
+ if (xldir == NULL)
+ pg_fatal("could not open directory \"%s\": %m", WALSUMMARYDIR);
+
+ while (errno = 0, (xlde = readdir(xldir)) != NULL)
+ {
+ if (strspn(xlde->d_name, "0123456789ABCDEF") == WALSUMMARY_NHEXCHARS &&
+ strcmp(xlde->d_name + WALSUMMARY_NHEXCHARS, ".summary") == 0)
+ {
+ snprintf(path, sizeof(path), "%s/%s", WALSUMMARYDIR, xlde->d_name);
+ if (unlink(path) < 0)
+ pg_fatal("could not delete file \"%s\": %m", path);
+ }
+ }
+
+ if (errno)
+ pg_fatal("could not read directory \"%s\": %m", WALSUMMARYDIR);
+
+ if (closedir(xldir))
+ pg_fatal("could not close directory \"%s\": %m", ARCHSTATDIR);
+}
/*
* Write an empty XLOG file, containing only the checkpoint record
diff --git a/src/common/Makefile b/src/common/Makefile
index 3c8effc533..2b41dd1839 100644
--- a/src/common/Makefile
+++ b/src/common/Makefile
@@ -49,6 +49,7 @@ OBJS_COMMON = \
archive.o \
base64.o \
binaryheap.o \
+ blkreftable.o \
checksum_helper.o \
compression.o \
config_info.o \
diff --git a/src/common/blkreftable.c b/src/common/blkreftable.c
new file mode 100644
index 0000000000..012a443584
--- /dev/null
+++ b/src/common/blkreftable.c
@@ -0,0 +1,1309 @@
+/*-------------------------------------------------------------------------
+ *
+ * blkreftable.c
+ * Block reference tables.
+ *
+ * A block reference table is used to keep track of which blocks have
+ * been modified by WAL records within a certain LSN range.
+ *
+ * For each relation fork, we keep track of all blocks that have appeared
+ * in block reference in the WAL. We also keep track of the "limit block",
+ * which is the smallest relation length in blocks known to have occurred
+ * during that range of WAL records. This should be set to 0 if the relation
+ * fork is created or destroyed, and to the post-truncation length if
+ * truncated.
+ *
+ * Whenever we set the limit block, we also forget about any modified blocks
+ * beyond that point. Those blocks don't exist any more. Such blocks can
+ * later be marked as modified again; if that happens, it means the relation
+ * was re-extended.
+ *
+ * Portions Copyright (c) 2010-2022, PostgreSQL Global Development Group
+ *
+ * src/common/blkreftable.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+
+#ifndef FRONTEND
+#include "postgres.h"
+#else
+#include "postgres_fe.h"
+#endif
+
+#ifdef FRONTEND
+#include "common/logging.h"
+#endif
+
+#include "common/blkreftable.h"
+#include "common/hashfn.h"
+#include "port/pg_crc32c.h"
+
+/*
+ * A block reference table keeps track of the status of each relation
+ * fork individually.
+ */
+typedef struct BlockRefTableKey
+{
+ RelFileLocator rlocator;
+ ForkNumber forknum;
+} BlockRefTableKey;
+
+/*
+ * We could need to store data either for a relation in which only a
+ * tiny fraction of the blocks have been modified or for a relation in
+ * which nearly every block has been modified, and we want a
+ * space-efficient representation in both cases. To accomplish this,
+ * we divide the relation into chunks of 2^16 blocks and choose between
+ * an array representation and a bitmap representation for each chunk.
+ *
+ * When the number of modified blocks in a given chunk is small, we
+ * essentially store an array of block numbers, but we need not store the
+ * entire block number: instead, we store each block number as a 2-byte
+ * offset from the start of the chunk.
+ *
+ * When the number of modified blocks in a given chunk is large, we switch
+ * to a bitmap representation.
+ *
+ * These same basic representational choices are used both when a block
+ * reference table is stored in memory and when it is serialized to disk.
+ *
+ * In the in-memory representation, we initially allocate each chunk with
+ * space for a number of entries given by INITIAL_ENTRIES_PER_CHUNK and
+ * increase that as necessary until we reach MAX_ENTRIES_PER_CHUNK.
+ * Any chunk whose allocated size reaches MAX_ENTRIES_PER_CHUNK is converted
+ * to a bitmap, and thus never needs to grow further.
+ */
+#define BLOCKS_PER_CHUNK (1 << 16)
+#define BLOCKS_PER_ENTRY (BITS_PER_BYTE * sizeof(uint16))
+#define MAX_ENTRIES_PER_CHUNK (BLOCKS_PER_CHUNK / BLOCKS_PER_ENTRY)
+#define INITIAL_ENTRIES_PER_CHUNK 16
+typedef uint16 *BlockRefTableChunk;
+
+/*
+ * State for one relation fork.
+ *
+ * 'rlocator' and 'forknum' identify the relation fork to which this entry
+ * pertains.
+ *
+ * 'limit_block' is the shortest known length of the relation in blocks
+ * within the LSN range covered by a particular block reference table.
+ * It should be set to 0 if the relation fork is created or dropped. If the
+ * relation fork is truncated, it should be set to the number of blocks that
+ * remain after truncation.
+ *
+ * 'nchunks' is the allocated length of each of the three arrays that follow.
+ * We can only represent the status of block numbers less than nchunks *
+ * BLOCKS_PER_CHUNK.
+ *
+ * 'chunk_size' is an array storing the allocated size of each chunk.
+ *
+ * 'chunk_usage' is an array storing the number of elements used in each
+ * chunk. If that value is less than MAX_ENTRIES_PER_CHUNK, the corresonding
+ * chunk is used as an array; else the corresponding chunk is used as a bitmap.
+ * When used as a bitmap, the least significant bit of the first array element
+ * is the status of the lowest-numbered block covered by this chunk.
+ *
+ * 'chunk_data' is the array of chunks.
+ */
+struct BlockRefTableEntry
+{
+ BlockRefTableKey key;
+ BlockNumber limit_block;
+ char status;
+ uint32 nchunks;
+ uint16 *chunk_size;
+ uint16 *chunk_usage;
+ BlockRefTableChunk *chunk_data;
+};
+
+/* Declare and define a hash table over type BlockRefTableEntry. */
+#define SH_PREFIX blockreftable
+#define SH_ELEMENT_TYPE BlockRefTableEntry
+#define SH_KEY_TYPE BlockRefTableKey
+#define SH_KEY key
+#define SH_HASH_KEY(tb, key) \
+ hash_bytes((const unsigned char *) &key, sizeof(BlockRefTableKey))
+#define SH_EQUAL(tb, a, b) memcmp(&a, &b, sizeof(BlockRefTableKey)) == 0
+#define SH_SCOPE static inline
+#ifdef FRONTEND
+#define SH_RAW_ALLOCATOR pg_malloc0
+#endif
+#define SH_DEFINE
+#define SH_DECLARE
+#include "lib/simplehash.h"
+
+/*
+ * A block reference table is basically just the hash table, but we don't
+ * want to expose that to outside callers.
+ *
+ * We keep track of the memory context in use explicitly too, so that it's
+ * easy to place all of our allocations in the same context.
+ */
+struct BlockRefTable
+{
+ blockreftable_hash *hash;
+#ifndef FRONTEND
+ MemoryContext mcxt;
+#endif
+};
+
+/*
+ * On-disk serialization format for block reference table entries.
+ */
+typedef struct BlockRefTableSerializedEntry
+{
+ RelFileLocator rlocator;
+ ForkNumber forknum;
+ BlockNumber limit_block;
+ uint32 nchunks;
+} BlockRefTableSerializedEntry;
+
+/*
+ * Buffer size, so that we avoid doing many small I/Os.
+ */
+#define BUFSIZE 65536
+
+/*
+ * Ad-hoc buffer for file I/O.
+ */
+typedef struct BlockRefTableBuffer
+{
+ io_callback_fn io_callback;
+ void *io_callback_arg;
+ char data[BUFSIZE];
+ int used;
+ int cursor;
+ pg_crc32c crc;
+} BlockRefTableBuffer;
+
+/*
+ * State for keeping track of progress while incrementally reading a block
+ * table reference file from disk.
+ *
+ * total_chunks means the number of chunks for the RelFileLocator/ForkNumber
+ * combination that is curently being read, and consumed_chunks is the number
+ * of those that have been read. (We always read all the information for
+ * a single chunk at one time, so we don't need to be able to represent the
+ * state where a chunk has been partially read.)
+ *
+ * chunk_size is the array of chunk sizes. The length is given by total_chunks.
+ *
+ * chunk_data holds the current chunk.
+ *
+ * chunk_position helps us figure out how much progress we've made in returning
+ * the block numbers for the current chunk to the caller. If the chunk is a
+ * bitmap, it's the number of bits we've scanned; otherwise, it's the number
+ * of chunk entries we've scanned.
+ */
+struct BlockRefTableReader
+{
+ BlockRefTableBuffer buffer;
+ char *error_filename;
+ report_error_fn error_callback;
+ void *error_callback_arg;
+ uint32 total_chunks;
+ uint32 consumed_chunks;
+ uint16 *chunk_size;
+ uint16 chunk_data[MAX_ENTRIES_PER_CHUNK];
+ uint32 chunk_position;
+};
+
+/*
+ * State for keeping track of progress while incrementally writing a block
+ * reference table file to disk.
+ */
+struct BlockRefTableWriter
+{
+ BlockRefTableBuffer buffer;
+};
+
+/* Function prototypes. */
+static int BlockRefTableComparator(const void *a, const void *b);
+static void BlockRefTableFlush(BlockRefTableBuffer *buffer);
+static void BlockRefTableRead(BlockRefTableReader *reader, void *data,
+ int length);
+static void BlockRefTableWrite(BlockRefTableBuffer *buffer, void *data,
+ int length);
+static void BlockRefTableFileTerminate(BlockRefTableBuffer *buffer);
+
+/*
+ * Create an empty block reference table.
+ */
+BlockRefTable *
+CreateEmptyBlockRefTable(void)
+{
+ BlockRefTable *brtab = palloc(sizeof(BlockRefTable));
+
+ /*
+ * Even completely empty database has a few hundred relation forks, so it
+ * seems best to size the hash on the assumption that we're going to have
+ * at least a few thousand entries.
+ */
+#ifdef FRONTEND
+ brtab->hash = blockreftable_create(4096, NULL);
+#else
+ brtab->mcxt = CurrentMemoryContext;
+ brtab->hash = blockreftable_create(brtab->mcxt, 4096, NULL);
+#endif
+
+ return brtab;
+}
+
+/*
+ * Set the "limit block" for a relation fork and forget any modified blocks
+ * with equal or higher block numbers.
+ *
+ * The "limit block" is the shortest known length of the relation within the
+ * range of WAL records covered by this block reference table.
+ */
+void
+BlockRefTableSetLimitBlock(BlockRefTable *brtab,
+ const RelFileLocator *rlocator,
+ ForkNumber forknum,
+ BlockNumber limit_block)
+{
+ BlockRefTableEntry *brtentry;
+ BlockRefTableKey key;
+ bool found;
+
+ memcpy(&key.rlocator, rlocator, sizeof(RelFileLocator));
+ key.forknum = forknum;
+ brtentry = blockreftable_insert(brtab->hash, key, &found);
+
+ if (!found)
+ {
+ /*
+ * We have no existing data about this relation fork, so just record
+ * the limit_block value supplied by the caller, and make sure other
+ * parts of the entry are properly initialized.
+ */
+ brtentry->limit_block = limit_block;
+ brtentry->nchunks = 0;
+ brtentry->chunk_size = NULL;
+ brtentry->chunk_usage = NULL;
+ brtentry->chunk_data = NULL;
+ return;
+ }
+
+ BlockRefTableEntrySetLimitBlock(brtentry, limit_block);
+}
+
+/*
+ * Mark a block in a given relation fork as known to have been modified.
+ */
+void
+BlockRefTableMarkBlockModified(BlockRefTable *brtab,
+ const RelFileLocator *rlocator,
+ ForkNumber forknum,
+ BlockNumber blknum)
+{
+ BlockRefTableEntry *brtentry;
+ BlockRefTableKey key;
+ bool found;
+#ifndef FRONTEND
+ MemoryContext oldcontext = MemoryContextSwitchTo(brtab->mcxt);
+#endif
+
+ memcpy(&key.rlocator, rlocator, sizeof(RelFileLocator));
+ key.forknum = forknum;
+ brtentry = blockreftable_insert(brtab->hash, key, &found);
+
+ if (!found)
+ {
+ /*
+ * We want to set the initial limit block value to something higher
+ * than any legal block number. InvalidBlockNumber fits the bill.
+ */
+ brtentry->limit_block = InvalidBlockNumber;
+ brtentry->nchunks = 0;
+ brtentry->chunk_size = NULL;
+ brtentry->chunk_usage = NULL;
+ brtentry->chunk_data = NULL;
+ }
+
+ BlockRefTableEntryMarkBlockModified(brtentry, forknum, blknum);
+
+#ifndef FRONTEND
+ MemoryContextSwitchTo(oldcontext);
+#endif
+}
+
+/*
+ * Get an entry from a block reference table.
+ *
+ * If the entry does not exist, this function returns NULL. Otherwise, it
+ * returns the entry and sets *limit_block to the value from the entry.
+ */
+BlockRefTableEntry *
+BlockRefTableGetEntry(BlockRefTable *brtab, const RelFileLocator *rlocator,
+ ForkNumber forknum, BlockNumber *limit_block)
+{
+ BlockRefTableKey key;
+ BlockRefTableEntry *entry;
+
+ Assert(limit_block != NULL);
+
+ memcpy(&key.rlocator, rlocator, sizeof(RelFileLocator));
+ key.forknum = forknum;
+ entry = blockreftable_lookup(brtab->hash, key);
+
+ if (entry != NULL)
+ *limit_block = entry->limit_block;
+
+ return entry;
+}
+
+/*
+ * Get block numbers from a table entry.
+ *
+ * 'blocks' must point to enough space to hold at least 'nblocks' block
+ * numbers, and any block numbers we manage to get will be written there.
+ * The return value is the number of block numbers actually written.
+ *
+ * We do not return block numbers unless they are greater than or equal to
+ * start_blkno and strictly less than stop_blkno.
+ */
+int
+BlockRefTableEntryGetBlocks(BlockRefTableEntry *entry,
+ BlockNumber start_blkno,
+ BlockNumber stop_blkno,
+ BlockNumber *blocks,
+ int nblocks)
+{
+ uint32 start_chunkno;
+ uint32 stop_chunkno;
+ uint32 chunkno;
+ int nresults = 0;
+
+ Assert(entry != NULL);
+
+ /*
+ * Figure out which chunks could potentially contain blocks of interest.
+ *
+ * We need to be careful about overflow here, because stop_blkno could be
+ * InvalidBlockNumber or something very close to it.
+ */
+ start_chunkno = start_blkno / BLOCKS_PER_CHUNK;
+ stop_chunkno = stop_blkno / BLOCKS_PER_CHUNK;
+ if ((stop_blkno % BLOCKS_PER_CHUNK) != 0)
+ ++stop_chunkno;
+ if (stop_chunkno > entry->nchunks)
+ stop_chunkno = entry->nchunks;
+
+ /*
+ * Loop over chunks.
+ */
+ for (chunkno = start_chunkno; chunkno < stop_chunkno; ++chunkno)
+ {
+ uint16 chunk_usage = entry->chunk_usage[chunkno];
+ BlockRefTableChunk chunk_data = entry->chunk_data[chunkno];
+ unsigned start_offset = 0;
+ unsigned stop_offset = BLOCKS_PER_CHUNK;
+
+ /*
+ * If the start and/or stop block number falls within this chunk, the
+ * whole chunk may not be of interest. Figure out which portion we
+ * care about, if it's not the whole thing.
+ */
+ if (chunkno == start_chunkno)
+ start_offset = start_blkno % BLOCKS_PER_CHUNK;
+ if (chunkno == stop_chunkno)
+ stop_offset = stop_blkno % BLOCKS_PER_CHUNK;
+
+ /*
+ * Handling differs depending on whether this is an array of offsets
+ * or a bitmap.
+ */
+ if (chunk_usage == MAX_ENTRIES_PER_CHUNK)
+ {
+ unsigned i;
+
+ /* It's a bitmap, so test every relevant bit. */
+ for (i = start_offset; i < BLOCKS_PER_CHUNK; ++i)
+ {
+ uint16 w = chunk_data[i / BLOCKS_PER_ENTRY];
+
+ if ((w & (1 << (i % BLOCKS_PER_ENTRY))) != 0)
+ {
+ BlockNumber blkno = chunkno * BLOCKS_PER_CHUNK + i;
+
+ blocks[nresults++] = blkno;
+
+ /* Early exit if we run out of output space. */
+ if (nresults == nblocks)
+ return nresults;
+ }
+ }
+ }
+ else
+ {
+ unsigned i;
+
+ /* It's an array of offsets, so check each one. */
+ for (i = 0; i < chunk_usage; ++i)
+ {
+ uint16 offset = chunk_data[i];
+
+ if (offset >= start_offset && offset < stop_offset)
+ {
+ BlockNumber blkno = chunkno * BLOCKS_PER_CHUNK + offset;
+
+ blocks[nresults++] = blkno;
+
+ /* Early exit if we run out of output space. */
+ if (nresults == nblocks)
+ return nresults;
+ }
+ }
+ }
+ }
+
+ return nresults;
+}
+
+/*
+ * Serialize a block reference table to a file.
+ */
+void
+WriteBlockRefTable(BlockRefTable *brtab,
+ io_callback_fn write_callback,
+ void *write_callback_arg)
+{
+ BlockRefTableSerializedEntry *sdata = NULL;
+ BlockRefTableBuffer buffer;
+ uint32 magic = BLOCKREFTABLE_MAGIC;
+
+ /* Prepare buffer. */
+ memset(&buffer, 0, sizeof(BlockRefTableBuffer));
+ buffer.io_callback = write_callback;
+ buffer.io_callback_arg = write_callback_arg;
+ INIT_CRC32C(buffer.crc);
+
+ /* Write magic number. */
+ BlockRefTableWrite(&buffer, &magic, sizeof(uint32));
+
+ /* Write the entries, assuming there are some. */
+ if (brtab->hash->members > 0)
+ {
+ unsigned i = 0;
+ blockreftable_iterator it;
+ BlockRefTableEntry *brtentry;
+
+ /* Extract entries into serializable format and sort them. */
+ sdata =
+ palloc(brtab->hash->members * sizeof(BlockRefTableSerializedEntry));
+ blockreftable_start_iterate(brtab->hash, &it);
+ while ((brtentry = blockreftable_iterate(brtab->hash, &it)) != NULL)
+ {
+ BlockRefTableSerializedEntry *sentry = &sdata[i++];
+
+ sentry->rlocator = brtentry->key.rlocator;
+ sentry->forknum = brtentry->key.forknum;
+ sentry->limit_block = brtentry->limit_block;
+ sentry->nchunks = brtentry->nchunks;
+
+ /* trim trailing zero entries */
+ while (sentry->nchunks > 0 &&
+ brtentry->chunk_usage[sentry->nchunks - 1] == 0)
+ sentry->nchunks--;
+ }
+ Assert(i == brtab->hash->members);
+ qsort(sdata, i, sizeof(BlockRefTableSerializedEntry),
+ BlockRefTableComparator);
+
+ /* Loop over entries in sorted order and serialize each one. */
+ for (i = 0; i < brtab->hash->members; ++i)
+ {
+ BlockRefTableSerializedEntry *sentry = &sdata[i];
+ BlockRefTableEntry *brtentry;
+ BlockRefTableKey key;
+ unsigned j;
+
+ /* Write the serialized entry itself. */
+ BlockRefTableWrite(&buffer, sentry,
+ sizeof(BlockRefTableSerializedEntry));
+
+ /* Look up the original entry so we can access the chunks. */
+ memcpy(&key.rlocator, &sentry->rlocator, sizeof(RelFileLocator));
+ key.forknum = sentry->forknum;
+ brtentry = blockreftable_lookup(brtab->hash, key);
+ Assert(brtentry != NULL);
+
+ /* Write the untruncated portion of the chunk length array. */
+ if (sentry->nchunks != 0)
+ BlockRefTableWrite(&buffer, brtentry->chunk_usage,
+ sentry->nchunks * sizeof(uint16));
+
+ /* Write the contents of each chunk. */
+ for (j = 0; j < brtentry->nchunks; ++j)
+ {
+ if (brtentry->chunk_usage[j] == 0)
+ continue;
+ BlockRefTableWrite(&buffer, brtentry->chunk_data[j],
+ brtentry->chunk_usage[j] * sizeof(uint16));
+ }
+ }
+ }
+
+ /* Write out appropriate terminator and CRC and flush buffer. */
+ BlockRefTableFileTerminate(&buffer);
+}
+
+/*
+ * Prepare to incrementally read a block reference table file.
+ *
+ * 'read_callback' is a function that can be called to read data from the
+ * underlying file (or other data source) into our internal buffer.
+ *
+ * 'read_callback_arg' is an opaque argument to be passed to read_callback.
+ *
+ * 'error_filename' is the filename that should be included in error messages
+ * if the file is found to be malformed. The value is not copied, so the
+ * caller should ensure that it remains valid until done with this
+ * BlockRefTableReader.
+ *
+ * 'error_callback' is a function to be called if the file is found to be
+ * malformed. This is not used for I/O errors, which must be handled internally
+ * by read_callback.
+ *
+ * 'error_callback_arg' is an opaque arguent to be passed to error_callback.
+ */
+BlockRefTableReader *
+CreateBlockRefTableReader(io_callback_fn read_callback,
+ void *read_callback_arg,
+ char *error_filename,
+ report_error_fn error_callback,
+ void *error_callback_arg)
+{
+ BlockRefTableReader *reader;
+ uint32 magic;
+
+ /* Initialize data structure. */
+ reader = palloc0(sizeof(BlockRefTableReader));
+ reader->buffer.io_callback = read_callback;
+ reader->buffer.io_callback_arg = read_callback_arg;
+ reader->error_filename = error_filename;
+ reader->error_callback = error_callback;
+ reader->error_callback_arg = error_callback_arg;
+ INIT_CRC32C(reader->buffer.crc);
+
+ /* Verify magic number. */
+ BlockRefTableRead(reader, &magic, sizeof(uint32));
+ if (magic != BLOCKREFTABLE_MAGIC)
+ error_callback(error_callback_arg,
+ "file \"%s\" has wrong magic number: expected %u, found %u",
+ error_filename,
+ BLOCKREFTABLE_MAGIC, magic);
+
+ return reader;
+}
+
+/*
+ * Read next relation fork covered by this block reference table file.
+ *
+ * After calling this function, you must call BlockRefTableReaderGetBlocks
+ * until it returns 0 before calling it again.
+ */
+bool
+BlockRefTableReaderNextRelation(BlockRefTableReader *reader,
+ RelFileLocator *rlocator,
+ ForkNumber *forknum,
+ BlockNumber *limit_block)
+{
+ BlockRefTableSerializedEntry sentry;
+ BlockRefTableSerializedEntry zentry = {0};
+
+ /*
+ * Sanity check: caller must read all blocks from all chunks before moving
+ * on to the next relation.
+ */
+ Assert(reader->total_chunks == reader->consumed_chunks);
+
+ /* Read serialized entry. */
+ BlockRefTableRead(reader, &sentry,
+ sizeof(BlockRefTableSerializedEntry));
+
+ /*
+ * If we just read the sentinel entry indicating that we've reached the
+ * end, read and check the CRC.
+ */
+ if (memcmp(&sentry, &zentry, sizeof(BlockRefTableSerializedEntry)) == 0)
+ {
+ pg_crc32c expected_crc;
+ pg_crc32c actual_crc;
+
+ /*
+ * We want to know the CRC of the file excluding the 4-byte CRC
+ * itself, so copy the current value of the CRC accumulator before
+ * reading those bytes, and use the copy to finalize the calculation.
+ */
+ expected_crc = reader->buffer.crc;
+ FIN_CRC32C(expected_crc);
+
+ /* Now we can read the actual value. */
+ BlockRefTableRead(reader, &actual_crc, sizeof(pg_crc32c));
+
+ /* Throw an error if there is a mismatch. */
+ if (!EQ_CRC32C(expected_crc, actual_crc))
+ reader->error_callback(reader->error_callback_arg,
+ "file \"%s\" has wrong checksum: expected %08X, found %08X",
+ reader->error_filename, expected_crc, actual_crc);
+
+ return false;
+ }
+
+ /* Read chunk size array. */
+ if (reader->chunk_size != NULL)
+ pfree(reader->chunk_size);
+ reader->chunk_size = palloc(sentry.nchunks * sizeof(uint16));
+ BlockRefTableRead(reader, reader->chunk_size,
+ sentry.nchunks * sizeof(uint16));
+
+ /* Set up for chunk scan. */
+ reader->total_chunks = sentry.nchunks;
+ reader->consumed_chunks = 0;
+
+ /* Return data to caller. */
+ memcpy(rlocator, &sentry.rlocator, sizeof(RelFileLocator));
+ *forknum = sentry.forknum;
+ *limit_block = sentry.limit_block;
+ return true;
+}
+
+/*
+ * Get modified blocks associated with the relation fork returned by
+ * the most recent call to BlockRefTableReaderNextRelation.
+ *
+ * On return, block numbers will be written into the 'blocks' array, whose
+ * length should be passed via 'nblocks'. The return value is the number of
+ * entries actually written into the 'blocks' array, which may be less than
+ * 'nblocks' if we run out of modified blocks in the relation fork before
+ * we run out of room in the array.
+ */
+unsigned
+BlockRefTableReaderGetBlocks(BlockRefTableReader *reader,
+ BlockNumber *blocks,
+ int nblocks)
+{
+ unsigned blocks_found = 0;
+
+ /* Must provide space for at least one block number to be returned. */
+ Assert(nblocks > 0);
+
+ /* Loop collecting blocks to return to caller. */
+ for (;;)
+ {
+ uint16 next_chunk_size;
+
+ /*
+ * If we've read at least one chunk, maybe it contains some block
+ * numbers that could satisfy caller's request.
+ */
+ if (reader->consumed_chunks > 0)
+ {
+ uint32 chunkno = reader->consumed_chunks - 1;
+ uint16 chunk_size = reader->chunk_size[chunkno];
+
+ if (chunk_size == MAX_ENTRIES_PER_CHUNK)
+ {
+ /* Bitmap format, so search for bits that are set. */
+ while (reader->chunk_position < BLOCKS_PER_CHUNK &&
+ blocks_found < nblocks)
+ {
+ uint16 chunkoffset = reader->chunk_position;
+ uint16 w;
+
+ w = reader->chunk_data[chunkoffset / BLOCKS_PER_ENTRY];
+ if ((w & (1u << (chunkoffset % BLOCKS_PER_ENTRY))) != 0)
+ blocks[blocks_found++] =
+ chunkno * BLOCKS_PER_CHUNK + chunkoffset;
+ ++reader->chunk_position;
+ }
+ }
+ else
+ {
+ /* Not in bitmap format, so each entry is a 2-byte offset. */
+ while (reader->chunk_position < chunk_size &&
+ blocks_found < nblocks)
+ {
+ blocks[blocks_found++] = chunkno * BLOCKS_PER_CHUNK
+ + reader->chunk_data[reader->chunk_position];
+ ++reader->chunk_position;
+ }
+ }
+ }
+
+ /* We found enough blocks, so we're done. */
+ if (blocks_found >= nblocks)
+ break;
+
+ /*
+ * We didn't find enough blocks, so we must need the next chunk. If
+ * there are none left, though, then we're done anyway.
+ */
+ if (reader->consumed_chunks == reader->total_chunks)
+ break;
+
+ /*
+ * Read data for next chunk and reset scan position to beginning of
+ * chunk. Note that the next chunk might be empty, in which case we
+ * consume the chunk without actually consuming any bytes from the
+ * underlying file.
+ */
+ next_chunk_size = reader->chunk_size[reader->consumed_chunks];
+ if (next_chunk_size > 0)
+ BlockRefTableRead(reader, reader->chunk_data,
+ next_chunk_size * sizeof(uint16));
+ ++reader->consumed_chunks;
+ reader->chunk_position = 0;
+ }
+
+ return blocks_found;
+}
+
+/*
+ * Release memory used while reading a block reference table from a file.
+ */
+void
+DestroyBlockRefTableReader(BlockRefTableReader *reader)
+{
+ if (reader->chunk_size != NULL)
+ {
+ pfree(reader->chunk_size);
+ reader->chunk_size = NULL;
+ }
+ pfree(reader);
+}
+
+/*
+ * Prepare to write a block reference table file incrementally.
+ *
+ * Caller must be able to supply BlockRefTableEntry objects sorted in the
+ * appropriate order.
+ */
+BlockRefTableWriter *
+CreateBlockRefTableWriter(io_callback_fn write_callback,
+ void *write_callback_arg)
+{
+ BlockRefTableWriter *writer;
+ uint32 magic = BLOCKREFTABLE_MAGIC;
+
+ /* Prepare buffer and CRC check and save callbacks. */
+ writer = palloc0(sizeof(BlockRefTableWriter));
+ writer->buffer.io_callback = write_callback;
+ writer->buffer.io_callback_arg = write_callback_arg;
+ INIT_CRC32C(writer->buffer.crc);
+
+ /* Write magic number. */
+ BlockRefTableWrite(&writer->buffer, &magic, sizeof(uint32));
+
+ return writer;
+}
+
+/*
+ * Append one entry to a block reference table file.
+ *
+ * Note that entries must be written in the proper order, that is, sorted by
+ * tablespace, then database, then relfilenumber, then fork number. Caller
+ * is responsible for supplying data in the correct order. If that seems hard,
+ * use an in-memory BlockRefTable instead.
+ */
+void
+BlockRefTableWriteEntry(BlockRefTableWriter *writer, BlockRefTableEntry *entry)
+{
+ BlockRefTableSerializedEntry sentry;
+ unsigned j;
+
+ /* Convert to serialized entry format. */
+ sentry.rlocator = entry->key.rlocator;
+ sentry.forknum = entry->key.forknum;
+ sentry.limit_block = entry->limit_block;
+ sentry.nchunks = entry->nchunks;
+
+ /* Trim trailing zero entries. */
+ while (sentry.nchunks > 0 && entry->chunk_usage[sentry.nchunks - 1] == 0)
+ sentry.nchunks--;
+
+ /* Write the serialized entry itself. */
+ BlockRefTableWrite(&writer->buffer, &sentry,
+ sizeof(BlockRefTableSerializedEntry));
+
+ /* Write the untruncated portion of the chunk length array. */
+ if (sentry.nchunks != 0)
+ BlockRefTableWrite(&writer->buffer, entry->chunk_usage,
+ sentry.nchunks * sizeof(uint16));
+
+ /* Write the contents of each chunk. */
+ for (j = 0; j < entry->nchunks; ++j)
+ {
+ if (entry->chunk_usage[j] == 0)
+ continue;
+ BlockRefTableWrite(&writer->buffer, entry->chunk_data[j],
+ entry->chunk_usage[j] * sizeof(uint16));
+ }
+}
+
+/*
+ * Finalize an incremental write of a block reference table file.
+ */
+void
+DestroyBlockRefTableWriter(BlockRefTableWriter *writer)
+{
+ BlockRefTableFileTerminate(&writer->buffer);
+ pfree(writer);
+}
+
+/*
+ * Allocate a standalone BlockRefTableEntry.
+ *
+ * When we're manipulating a full in-memory BlockRefTable, the entries are
+ * part of the hash table and are allocated by simplehash. This routine is
+ * used by callers that want to write out a BlockRefTable to a file without
+ * needing to store the whole thing in memory at once.
+ *
+ * Entries allocated by this function can be manipulated using the functions
+ * BlockRefTableEntrySetLimitBlock and BlockRefTableEntryMarkBlockModified
+ * and then written using BlockRefTableWriteEntry and freed using
+ * BlockRefTableFreeEntry.
+ */
+BlockRefTableEntry *
+CreateBlockRefTableEntry(RelFileLocator rlocator, ForkNumber forknum)
+{
+ BlockRefTableEntry *entry = palloc0(sizeof(BlockRefTableEntry));
+
+ memcpy(&entry->key.rlocator, &rlocator, sizeof(RelFileLocator));
+ entry->key.forknum = forknum;
+ entry->limit_block = InvalidBlockNumber;
+
+ return entry;
+}
+
+/*
+ * Update a BlockRefTableEntry with a new value for the "limit block" and
+ * forget any equal-or-higher-numbered modified blocks.
+ *
+ * The "limit block" is the shortest known length of the relation within the
+ * range of WAL records covered by this block reference table.
+ */
+void
+BlockRefTableEntrySetLimitBlock(BlockRefTableEntry *entry,
+ BlockNumber limit_block)
+{
+ unsigned chunkno;
+ unsigned limit_chunkno;
+ unsigned limit_chunkoffset;
+ BlockRefTableChunk limit_chunk;
+
+ /* If we already have an equal or lower limit block, do nothing. */
+ if (limit_block >= entry->limit_block)
+ return;
+
+ /* Record the new limit block value. */
+ entry->limit_block = limit_block;
+
+ /*
+ * Figure out which chunk would store the state of the new limit block,
+ * and which offset within that chunk.
+ */
+ limit_chunkno = limit_block / BLOCKS_PER_CHUNK;
+ limit_chunkoffset = limit_block % BLOCKS_PER_CHUNK;
+
+ /*
+ * If the number of chunks is not large enough for any blocks with equal
+ * or higher block numbers to exist, then there is nothing further to do.
+ */
+ if (limit_chunkno >= entry->nchunks)
+ return;
+
+ /* Discard entire contents of any higher-numbered chunks. */
+ for (chunkno = limit_chunkno + 1; chunkno < entry->nchunks; ++chunkno)
+ entry->chunk_usage[chunkno] = 0;
+
+ /*
+ * Next, we need to discard any offsets within the chunk that would
+ * contain the limit_block. We must handle this differenly depending on
+ * whether the chunk that would contain limit_block is a bitmap or an
+ * array of offsets.
+ */
+ limit_chunk = entry->chunk_data[limit_chunkno];
+ if (entry->chunk_usage[limit_chunkno] == MAX_ENTRIES_PER_CHUNK)
+ {
+ unsigned chunkoffset;
+
+ /* It's a bitmap. Unset bits. */
+ for (chunkoffset = limit_chunkoffset; chunkoffset < BLOCKS_PER_CHUNK;
+ ++chunkoffset)
+ limit_chunk[chunkoffset / BLOCKS_PER_ENTRY] &=
+ ~(1 << (chunkoffset % BLOCKS_PER_ENTRY));
+ }
+ else
+ {
+ unsigned i,
+ j = 0;
+
+ /* It's an offset array. Filter out large offsets. */
+ for (i = 0; i < entry->chunk_usage[limit_chunkno]; ++i)
+ {
+ Assert(j <= i);
+ if (limit_chunk[i] < limit_chunkoffset)
+ limit_chunk[j++] = limit_chunk[i];
+ }
+ Assert(j <= entry->chunk_usage[limit_chunkno]);
+ entry->chunk_usage[limit_chunkno] = j;
+ }
+}
+
+/*
+ * Mark a block in a given BlkRefTableEntry as known to have been modified.
+ */
+void
+BlockRefTableEntryMarkBlockModified(BlockRefTableEntry *entry,
+ ForkNumber forknum,
+ BlockNumber blknum)
+{
+ unsigned chunkno;
+ unsigned chunkoffset;
+ unsigned i;
+
+ /*
+ * Which chunk should store the state of this block? And what is the
+ * offset of this block relative to the start of that chunk?
+ */
+ chunkno = blknum / BLOCKS_PER_CHUNK;
+ chunkoffset = blknum % BLOCKS_PER_CHUNK;
+
+ /*
+ * If 'nchunks' isn't big enough for us to be able to represent the state
+ * of this block, we need to enlarge our arrays.
+ */
+ if (chunkno >= entry->nchunks)
+ {
+ unsigned max_chunks;
+ unsigned extra_chunks;
+
+ /*
+ * New array size is a power of 2, at least 16, big enough so that
+ * chunkno will be a valid array index.
+ */
+ max_chunks = Max(16, entry->nchunks);
+ while (max_chunks < chunkno + 1)
+ chunkno *= 2;
+ Assert(max_chunks > chunkno);
+ extra_chunks = max_chunks - entry->nchunks;
+
+ if (entry->nchunks == 0)
+ {
+ entry->chunk_size = palloc0(sizeof(uint16) * max_chunks);
+ entry->chunk_usage = palloc0(sizeof(uint16) * max_chunks);
+ entry->chunk_data =
+ palloc0(sizeof(BlockRefTableChunk) * max_chunks);
+ }
+ else
+ {
+ entry->chunk_size = repalloc(entry->chunk_size,
+ sizeof(uint16) * max_chunks);
+ memset(&entry->chunk_size[entry->nchunks], 0,
+ extra_chunks * sizeof(uint16));
+ entry->chunk_usage = repalloc(entry->chunk_usage,
+ sizeof(uint16) * max_chunks);
+ memset(&entry->chunk_usage[entry->nchunks], 0,
+ extra_chunks * sizeof(uint16));
+ entry->chunk_data = repalloc(entry->chunk_data,
+ sizeof(BlockRefTableChunk) * max_chunks);
+ memset(&entry->chunk_data[entry->nchunks], 0,
+ extra_chunks * sizeof(BlockRefTableChunk));
+ }
+ entry->nchunks = max_chunks;
+ }
+
+ /*
+ * If the chunk that covers this block number doesn't exist yet, create it
+ * as an array and add the appropriate offset to it. We make it pretty
+ * small initially, because there might only be 1 or a few block
+ * references in this chunk and we don't want to use up too much memory.
+ */
+ if (entry->chunk_size[chunkno] == 0)
+ {
+ entry->chunk_data[chunkno] =
+ palloc(sizeof(uint16) * INITIAL_ENTRIES_PER_CHUNK);
+ entry->chunk_size[chunkno] = INITIAL_ENTRIES_PER_CHUNK;
+ entry->chunk_data[chunkno][0] = chunkoffset;
+ entry->chunk_usage[chunkno] = 1;
+ return;
+ }
+
+ /*
+ * If the number of entries in this chunk is already maximum, it must be a
+ * bitmap. Just set the appropriate bit.
+ */
+ if (entry->chunk_usage[chunkno] == MAX_ENTRIES_PER_CHUNK)
+ {
+ BlockRefTableChunk chunk = entry->chunk_data[chunkno];
+
+ chunk[chunkoffset / BLOCKS_PER_ENTRY] |=
+ 1 << (chunkoffset % BLOCKS_PER_ENTRY);
+ return;
+ }
+
+ /*
+ * There is an existing chunk and it's in array format. Let's find out
+ * whether it already has an entry for this block. If so, we do not need
+ * to do anything.
+ */
+ for (i = 0; i < entry->chunk_usage[chunkno]; ++i)
+ {
+ if (entry->chunk_data[chunkno][i] == chunkoffset)
+ return;
+ }
+
+ /*
+ * If the number of entries currently used is one less than the maximum,
+ * it's time to convert to bitmap format.
+ */
+ if (entry->chunk_usage[chunkno] == MAX_ENTRIES_PER_CHUNK - 1)
+ {
+ BlockRefTableChunk newchunk;
+ unsigned j;
+
+ /* Allocate a new chunk. */
+ newchunk = palloc0(MAX_ENTRIES_PER_CHUNK * sizeof(uint16));
+
+ /* Set the bit for each existing entry. */
+ for (j = 0; j < entry->chunk_usage[chunkno]; ++j)
+ {
+ unsigned coff = entry->chunk_data[chunkno][j];
+
+ newchunk[coff / BLOCKS_PER_ENTRY] |=
+ 1 << (coff % BLOCKS_PER_ENTRY);
+ }
+
+ /* Set the bit for the new entry. */
+ newchunk[chunkoffset / BLOCKS_PER_ENTRY] |=
+ 1 << (chunkoffset % BLOCKS_PER_ENTRY);
+
+ /* Swap the new chunk into place and update metadata. */
+ pfree(entry->chunk_data[chunkno]);
+ entry->chunk_data[chunkno] = newchunk;
+ entry->chunk_size[chunkno] = MAX_ENTRIES_PER_CHUNK;
+ entry->chunk_usage[chunkno] = MAX_ENTRIES_PER_CHUNK;
+ return;
+ }
+
+ /*
+ * OK, we currently have an array, and we don't need to convert to a
+ * bitmap, but we do need to add a new element. If there's not enough
+ * room, we'll have to expand the array.
+ */
+ if (entry->chunk_usage[chunkno] == entry->chunk_size[chunkno])
+ {
+ unsigned newsize = entry->chunk_size[chunkno] * 2;
+
+ Assert(newsize <= MAX_ENTRIES_PER_CHUNK);
+ entry->chunk_data[chunkno] = repalloc(entry->chunk_data[chunkno],
+ newsize * sizeof(uint16));
+ entry->chunk_size[chunkno] = newsize;
+ }
+
+ /* Now we can add the new entry. */
+ entry->chunk_data[chunkno][entry->chunk_usage[chunkno]] =
+ chunkoffset;
+ entry->chunk_usage[chunkno]++;
+}
+
+/*
+ * Release memory for a BlockRefTablEntry that was created by
+ * CreateBlockRefTableEntry.
+ */
+void
+BlockRefTableFreeEntry(BlockRefTableEntry *entry)
+{
+ if (entry->chunk_size != NULL)
+ {
+ pfree(entry->chunk_size);
+ entry->chunk_size = NULL;
+ }
+
+ if (entry->chunk_usage != NULL)
+ {
+ pfree(entry->chunk_usage);
+ entry->chunk_usage = NULL;
+ }
+
+ if (entry->chunk_data != NULL)
+ {
+ pfree(entry->chunk_data);
+ entry->chunk_data = NULL;
+ }
+
+ pfree(entry);
+}
+
+/*
+ * Comparator for BlockRefTableSerializedEntry objects.
+ *
+ * We make the tablespace OID the first column of the sort key to match
+ * the on-disk tree structure.
+ */
+static int
+BlockRefTableComparator(const void *a, const void *b)
+{
+ const BlockRefTableSerializedEntry *sa = a;
+ const BlockRefTableSerializedEntry *sb = b;
+
+ if (sa->rlocator.spcOid > sb->rlocator.spcOid)
+ return 1;
+ if (sa->rlocator.spcOid < sb->rlocator.spcOid)
+ return -1;
+
+ if (sa->rlocator.dbOid > sb->rlocator.dbOid)
+ return 1;
+ if (sa->rlocator.dbOid < sb->rlocator.dbOid)
+ return -1;
+
+ if (sa->rlocator.relNumber > sb->rlocator.relNumber)
+ return 1;
+ if (sa->rlocator.relNumber < sb->rlocator.relNumber)
+ return -1;
+
+ if (sa->forknum > sb->forknum)
+ return 1;
+ if (sa->forknum < sb->forknum)
+ return -1;
+
+ return 0;
+}
+
+/*
+ * Flush any buffered data out of a BlockRefTableBuffer.
+ */
+static void
+BlockRefTableFlush(BlockRefTableBuffer *buffer)
+{
+ buffer->io_callback(buffer->io_callback_arg, buffer->data, buffer->used);
+ buffer->used = 0;
+}
+
+/*
+ * Read data from a BlockRefTableBuffer, and update the running CRC
+ * calculation for the returned data (but not any data that we may have
+ * buffered but not yet actually returned).
+ */
+static void
+BlockRefTableRead(BlockRefTableReader *reader, void *data, int length)
+{
+ BlockRefTableBuffer *buffer = &reader->buffer;
+
+ /* Loop until read is fully satisfied. */
+ while (length > 0)
+ {
+ if (buffer->cursor < buffer->used)
+ {
+ /*
+ * If any buffered data is available, use that to satisfy as much
+ * of the request as possible.
+ */
+ int bytes_to_copy = Min(length, buffer->used - buffer->cursor);
+
+ memcpy(data, &buffer->data[buffer->cursor], bytes_to_copy);
+ COMP_CRC32C(buffer->crc, &buffer->data[buffer->cursor],
+ bytes_to_copy);
+ buffer->cursor += bytes_to_copy;
+ data = ((char *) data) + bytes_to_copy;
+ length -= bytes_to_copy;
+ }
+ else if (length >= BUFSIZE)
+ {
+ /*
+ * If the request length is long, read directly into caller's
+ * buffer.
+ */
+ int bytes_read;
+
+ bytes_read = buffer->io_callback(buffer->io_callback_arg,
+ data, length);
+ COMP_CRC32C(buffer->crc, data, bytes_read);
+ data = ((char *) data) + bytes_read;
+ length -= bytes_read;
+
+ /* If we didn't get anything, that's bad. */
+ if (bytes_read == 0)
+ reader->error_callback(reader->error_callback_arg,
+ "file \"%s\" ends unexpectedly",
+ reader->error_filename);
+ }
+ else
+ {
+ /*
+ * Refill our buffer.
+ */
+ buffer->used = buffer->io_callback(buffer->io_callback_arg,
+ buffer->data, BUFSIZE);
+ buffer->cursor = 0;
+
+ /* If we didn't get anything, that's bad. */
+ if (buffer->used == 0)
+ reader->error_callback(reader->error_callback_arg,
+ "file \"%s\" ends unexpectedly",
+ reader->error_filename);
+ }
+ }
+}
+
+/*
+ * Supply data to a BlockRefTableBuffer for write to the underlying File,
+ * and update the running CRC calculation for that data.
+ */
+static void
+BlockRefTableWrite(BlockRefTableBuffer *buffer, void *data, int length)
+{
+ /* Update running CRC calculation. */
+ COMP_CRC32C(buffer->crc, data, length);
+
+ /* If the new data can't fit into the buffer, flush the buffer. */
+ if (buffer->used + length > BUFSIZE)
+ {
+ buffer->io_callback(buffer->io_callback_arg, buffer->data,
+ buffer->used);
+ buffer->used = 0;
+ }
+
+ /* If the new data would fill the buffer, or more, write it directly. */
+ if (length >= BUFSIZE)
+ {
+ buffer->io_callback(buffer->io_callback_arg, data, length);
+ return;
+ }
+
+ /* Otherwise, copy the new data into the buffer. */
+ memcpy(&buffer->data[buffer->used], data, length);
+ buffer->used += length;
+ Assert(buffer->used <= BUFSIZE);
+}
+
+/*
+ * Generate the sentinel and CRC required at the end of a block reference
+ * table file and flush them out of our internal buffer.
+ */
+static void
+BlockRefTableFileTerminate(BlockRefTableBuffer *buffer)
+{
+ BlockRefTableSerializedEntry zentry = {0};
+ pg_crc32c crc;
+
+ /* Write a sentinel indicating that there are no more entries. */
+ BlockRefTableWrite(buffer, &zentry,
+ sizeof(BlockRefTableSerializedEntry));
+
+ /*
+ * Writing the checksum will perturb the ongoing checksum calculation, so
+ * copy the state first and finalize the computation using the copy.
+ */
+ crc = buffer->crc;
+ FIN_CRC32C(crc);
+ BlockRefTableWrite(buffer, &crc, sizeof(pg_crc32c));
+
+ /* Flush any leftover data out of our buffer. */
+ BlockRefTableFlush(buffer);
+}
diff --git a/src/common/meson.build b/src/common/meson.build
index aa646f96a3..6348d60ec4 100644
--- a/src/common/meson.build
+++ b/src/common/meson.build
@@ -4,6 +4,7 @@ common_sources = files(
'archive.c',
'base64.c',
'binaryheap.c',
+ 'blkreftable.c',
'checksum_helper.c',
'compression.c',
'controldata_utils.c',
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 4ad572cb87..9d1e4ab57b 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -209,6 +209,7 @@ extern int XLogFileOpen(XLogSegNo segno, TimeLineID tli);
extern void CheckXLogRemoved(XLogSegNo segno, TimeLineID tli);
extern XLogSegNo XLogGetLastRemovedSegno(void);
+extern XLogSegNo XLogGetOldestSegno(TimeLineID tli);
extern void XLogSetAsyncXactLSN(XLogRecPtr asyncXactLSN);
extern void XLogSetReplicationSlotMinimumLSN(XLogRecPtr lsn);
diff --git a/src/include/access/xlogbackup.h b/src/include/access/xlogbackup.h
index 1611358137..90e04cad56 100644
--- a/src/include/access/xlogbackup.h
+++ b/src/include/access/xlogbackup.h
@@ -28,6 +28,8 @@ typedef struct BackupState
XLogRecPtr checkpointloc; /* last checkpoint location */
pg_time_t starttime; /* backup start time */
bool started_in_recovery; /* backup started in recovery? */
+ XLogRecPtr istartpoint; /* incremental based on backup at this LSN */
+ TimeLineID istarttli; /* incremental based on backup on this TLI */
/* Fields saved at the end of backup */
XLogRecPtr stoppoint; /* backup stop WAL location */
diff --git a/src/include/backup/basebackup.h b/src/include/backup/basebackup.h
index 1432d9c206..345bd22534 100644
--- a/src/include/backup/basebackup.h
+++ b/src/include/backup/basebackup.h
@@ -34,6 +34,9 @@ typedef struct
int64 size; /* total size as sent; -1 if not known */
} tablespaceinfo;
-extern void SendBaseBackup(BaseBackupCmd *cmd);
+struct IncrementalBackupInfo;
+
+extern void SendBaseBackup(BaseBackupCmd *cmd,
+ struct IncrementalBackupInfo *ib);
#endif /* _BASEBACKUP_H */
diff --git a/src/include/backup/basebackup_incremental.h b/src/include/backup/basebackup_incremental.h
new file mode 100644
index 0000000000..c300235a2f
--- /dev/null
+++ b/src/include/backup/basebackup_incremental.h
@@ -0,0 +1,56 @@
+/*-------------------------------------------------------------------------
+ *
+ * basebackup_incremental.h
+ * API for incremental backup support
+ *
+ * Portions Copyright (c) 2010-2022, PostgreSQL Global Development Group
+ *
+ * src/include/backup/basebackup_incremental.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef BASEBACKUP_INCREMENTAL_H
+#define BASEBACKUP_INCREMENTAL_H
+
+#include "access/xlogbackup.h"
+#include "common/relpath.h"
+#include "storage/block.h"
+#include "utils/palloc.h"
+
+#define INCREMENTAL_MAGIC 0xd3ae1f0d
+
+typedef enum
+{
+ BACK_UP_FILE_FULLY,
+ BACK_UP_FILE_INCREMENTALLY,
+ DO_NOT_BACK_UP_FILE
+} FileBackupMethod;
+
+struct IncrementalBackupInfo;
+typedef struct IncrementalBackupInfo IncrementalBackupInfo;
+
+extern IncrementalBackupInfo *CreateIncrementalBackupInfo(MemoryContext);
+
+extern void AppendIncrementalManifestData(IncrementalBackupInfo *ib,
+ const char *data,
+ int len);
+extern void FinalizeIncrementalManifest(IncrementalBackupInfo *ib);
+
+extern void PrepareForIncrementalBackup(IncrementalBackupInfo *ib,
+ BackupState *backup_state);
+
+extern char *GetIncrementalFilePath(Oid dboid, Oid spcoid,
+ RelFileNumber relfilenumber,
+ ForkNumber forknum, unsigned segno);
+extern FileBackupMethod GetFileBackupMethod(IncrementalBackupInfo *ib,
+ char *path,
+ Oid dboid, Oid spcoid,
+ RelFileNumber relfilenumber,
+ ForkNumber forknum,
+ unsigned segno, size_t size,
+ unsigned *num_blocks_required,
+ BlockNumber *relative_block_numbers,
+ unsigned *truncation_block_length);
+extern size_t GetIncrementalFileSize(unsigned num_blocks_required);
+
+#endif
diff --git a/src/include/backup/walsummary.h b/src/include/backup/walsummary.h
new file mode 100644
index 0000000000..d086e64019
--- /dev/null
+++ b/src/include/backup/walsummary.h
@@ -0,0 +1,49 @@
+/*-------------------------------------------------------------------------
+ *
+ * walsummary.h
+ * WAL summary management
+ *
+ * Portions Copyright (c) 2010-2022, PostgreSQL Global Development Group
+ *
+ * src/include/backup/walsummary.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef WALSUMMARY_H
+#define WALSUMMARY_H
+
+#include <time.h>
+
+#include "access/xlogdefs.h"
+#include "nodes/pg_list.h"
+#include "storage/fd.h"
+
+typedef struct WalSummaryIO
+{
+ File file;
+ off_t filepos;
+} WalSummaryIO;
+
+typedef struct WalSummaryFile
+{
+ XLogRecPtr start_lsn;
+ XLogRecPtr end_lsn;
+ TimeLineID tli;
+} WalSummaryFile;
+
+extern List *GetWalSummaries(TimeLineID tli, XLogRecPtr start_lsn,
+ XLogRecPtr end_lsn);
+extern List *FilterWalSummaries(List *wslist, TimeLineID tli,
+ XLogRecPtr start_lsn, XLogRecPtr end_lsn);
+extern bool WalSummariesAreComplete(List *wslist,
+ XLogRecPtr start_lsn, XLogRecPtr end_lsn,
+ XLogRecPtr *missing_lsn);
+extern File OpenWalSummaryFile(WalSummaryFile *ws, bool missing_ok);
+extern void RemoveWalSummaryIfOlderThan(WalSummaryFile *ws,
+ time_t cutoff_time);
+
+extern int ReadWalSummary(void *wal_summary_io, void *data, int length);
+extern int WriteWalSummary(void *wal_summary_io, void *data, int length);
+extern void ReportWalSummaryError(void *callback_arg, char *fmt,...);
+
+#endif /* WALSUMMARY_H */
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index c92d0631a0..9717c4630e 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -12071,4 +12071,23 @@
proname => 'any_value_transfn', prorettype => 'anyelement',
proargtypes => 'anyelement anyelement', prosrc => 'any_value_transfn' },
+{ oid => '8436',
+ descr => 'list of available WAL summary files',
+ proname => 'pg_available_wal_summaries', prorows => '100',
+ proretset => 't', provolatile => 'v', proparallel => 's',
+ prorettype => 'record', proargtypes => '',
+ proallargtypes => '{int8,pg_lsn,pg_lsn}',
+ proargmodes => '{o,o,o}',
+ proargnames => '{tli,start_lsn,end_lsn}',
+ prosrc => 'pg_available_wal_summaries' },
+{ oid => '8437',
+ descr => 'contents of a WAL sumamry file',
+ proname => 'pg_wal_summary_contents', prorows => '100',
+ proretset => 't', provolatile => 'v', proparallel => 's',
+ prorettype => 'record', proargtypes => 'int8 pg_lsn pg_lsn',
+ proallargtypes => '{int8,pg_lsn,pg_lsn,oid,oid,oid,int2,int8,bool}',
+ proargmodes => '{i,i,i,o,o,o,o,o,o}',
+ proargnames => '{tli,start_lsn,end_lsn,relfilenode,reltablespace,reldatabase,relforknumber,relblocknumber,is_limit_block}',
+ prosrc => 'pg_wal_summary_contents' },
+
]
diff --git a/src/include/common/blkreftable.h b/src/include/common/blkreftable.h
new file mode 100644
index 0000000000..22d9883dc5
--- /dev/null
+++ b/src/include/common/blkreftable.h
@@ -0,0 +1,120 @@
+/*-------------------------------------------------------------------------
+ *
+ * blkreftable.h
+ * Block reference tables.
+ *
+ * A block reference table is used to keep track of which blocks have
+ * been modified by WAL records within a certain LSN range.
+ *
+ * For each relation fork, there is a "limit block number". All existing
+ * blocks greater than or equal to the limit block number must be
+ * considered modified; for those less than the limit block number,
+ * we maintain a bitmap. When a relation fork is created or dropped,
+ * the limit block number should be set to 0. When it's truncated,
+ * the limit block number should be set to the length in blocks to
+ * which it was truncated.
+ *
+ * Portions Copyright (c) 2010-2022, PostgreSQL Global Development Group
+ *
+ * src/include/common/blkreftable.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef BLKREFTABLE_H
+#define BLKREFTABLE_H
+
+#include "storage/block.h"
+#include "storage/relfilelocator.h"
+
+/* Magic number for serialization file format. */
+#define BLOCKREFTABLE_MAGIC 0x652b137b
+
+struct BlockRefTable;
+struct BlockRefTableEntry;
+struct BlockRefTableReader;
+struct BlockRefTableWriter;
+typedef struct BlockRefTable BlockRefTable;
+typedef struct BlockRefTableEntry BlockRefTableEntry;
+typedef struct BlockRefTableReader BlockRefTableReader;
+typedef struct BlockRefTableWriter BlockRefTableWriter;
+
+/*
+ * The return value of io_callback_fn should be the number of bytes read
+ * or written. If an error occurs, the functions should report it and
+ * not return. When used as a write callback, short writes should be retried
+ * or treated as errors, so that if the callback returns, the return value
+ * is always the request length.
+ *
+ * report_error_fn should not return.
+ */
+typedef int (*io_callback_fn) (void *callback_arg, void *data, int length);
+typedef void (*report_error_fn) (void *calblack_arg, char *msg,...);
+
+
+/*
+ * Functions for manipulating an entire in-memory block reference table.
+ */
+extern BlockRefTable *CreateEmptyBlockRefTable(void);
+extern void BlockRefTableSetLimitBlock(BlockRefTable *brtab,
+ const RelFileLocator *rlocator,
+ ForkNumber forknum,
+ BlockNumber limit_block);
+extern void BlockRefTableMarkBlockModified(BlockRefTable *brtab,
+ const RelFileLocator *rlocator,
+ ForkNumber forknum,
+ BlockNumber blknum);
+extern void WriteBlockRefTable(BlockRefTable *brtab,
+ io_callback_fn write_callback,
+ void *write_callback_arg);
+
+extern BlockRefTableEntry *BlockRefTableGetEntry(BlockRefTable *brtab,
+ const RelFileLocator *rlocator,
+ ForkNumber forknum,
+ BlockNumber *limit_block);
+extern int BlockRefTableEntryGetBlocks(BlockRefTableEntry *entry,
+ BlockNumber start_blkno,
+ BlockNumber stop_blkno,
+ BlockNumber *blocks,
+ int nblocks);
+
+/*
+ * Functions for reading a block reference table incrementally from disk.
+ */
+extern BlockRefTableReader *CreateBlockRefTableReader(io_callback_fn read_callback,
+ void *read_callback_arg,
+ char *error_filename,
+ report_error_fn error_callback,
+ void *error_callback_arg);
+extern bool BlockRefTableReaderNextRelation(BlockRefTableReader *reader,
+ RelFileLocator *rlocator,
+ ForkNumber *forknum,
+ BlockNumber *limit_block);
+extern unsigned BlockRefTableReaderGetBlocks(BlockRefTableReader *reader,
+ BlockNumber *blocks,
+ int nblocks);
+extern void DestroyBlockRefTableReader(BlockRefTableReader *reader);
+
+/*
+ * Functions for writing a block reference table incrementally to disk.
+ *
+ * Note that entries must be written in the proper order, that is, sorted by
+ * database, then tablespace, then relfilenumber, then fork number. Caller
+ * is responsible for supplying data in the correct order. If that seems hard,
+ * use an in-memory BlockRefTable instead.
+ */
+extern BlockRefTableWriter *CreateBlockRefTableWriter(io_callback_fn write_callback,
+ void *write_callback_arg);
+extern void BlockRefTableWriteEntry(BlockRefTableWriter *writer,
+ BlockRefTableEntry *entry);
+extern void DestroyBlockRefTableWriter(BlockRefTableWriter *writer);
+
+extern BlockRefTableEntry *CreateBlockRefTableEntry(RelFileLocator rlocator,
+ ForkNumber forknum);
+extern void BlockRefTableEntrySetLimitBlock(BlockRefTableEntry *entry,
+ BlockNumber limit_block);
+extern void BlockRefTableEntryMarkBlockModified(BlockRefTableEntry *entry,
+ ForkNumber forknum,
+ BlockNumber blknum);
+extern void BlockRefTableFreeEntry(BlockRefTableEntry *entry);
+
+#endif /* BLKREFTABLE_H */
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 7232b03e37..042fdc6ca1 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -340,6 +340,7 @@ typedef enum BackendType
B_STARTUP,
B_WAL_RECEIVER,
B_WAL_SENDER,
+ B_WAL_SUMMARIZER,
B_WAL_WRITER,
} BackendType;
@@ -446,6 +447,7 @@ typedef enum
CheckpointerProcess,
WalWriterProcess,
WalReceiverProcess,
+ WalSummarizerProcess,
NUM_AUXPROCTYPES /* Must be last! */
} AuxProcType;
@@ -458,6 +460,7 @@ extern PGDLLIMPORT AuxProcType MyAuxProcType;
#define AmCheckpointerProcess() (MyAuxProcType == CheckpointerProcess)
#define AmWalWriterProcess() (MyAuxProcType == WalWriterProcess)
#define AmWalReceiverProcess() (MyAuxProcType == WalReceiverProcess)
+#define AmWalSummarizerProcess() (MyAuxProcType == WalSummarizerProcess)
/*****************************************************************************
diff --git a/src/include/nodes/replnodes.h b/src/include/nodes/replnodes.h
index 4321ba8f86..856491eecd 100644
--- a/src/include/nodes/replnodes.h
+++ b/src/include/nodes/replnodes.h
@@ -108,4 +108,13 @@ typedef struct TimeLineHistoryCmd
TimeLineID timeline;
} TimeLineHistoryCmd;
+/* ----------------------
+ * UPLOAD_MANIFEST command
+ * ----------------------
+ */
+typedef struct UploadManifestCmd
+{
+ NodeTag type;
+} UploadManifestCmd;
+
#endif /* REPLNODES_H */
diff --git a/src/include/postmaster/walsummarizer.h b/src/include/postmaster/walsummarizer.h
new file mode 100644
index 0000000000..7584cb69a7
--- /dev/null
+++ b/src/include/postmaster/walsummarizer.h
@@ -0,0 +1,31 @@
+/*-------------------------------------------------------------------------
+ *
+ * walsummarizer.h
+ *
+ * Header file for background WAL summarization process.
+ *
+ * Portions Copyright (c) 1996-2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/include/postmaster/walsummarizer.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef WALSUMMARIZER_H
+#define WALSUMMARIZER_H
+
+#include "access/xlogdefs.h"
+
+extern int wal_summarize_mb;
+extern int wal_summarize_keep_time;
+
+extern Size WalSummarizerShmemSize(void);
+extern void WalSummarizerShmemInit(void);
+extern void WalSummarizerMain(void) pg_attribute_noreturn();
+
+extern XLogRecPtr GetOldestUnsummarizedLSN(TimeLineID *tli,
+ bool *lsn_is_exact);
+extern void SetWalSummarizerLatch(void);
+extern XLogRecPtr WaitForWalSummarization(XLogRecPtr lsn, long timeout);
+
+#endif
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index ef74f32693..ee55008082 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -417,11 +417,12 @@ extern PGDLLIMPORT PGPROC *PreparedXactProcs;
* We set aside some extra PGPROC structures for auxiliary processes,
* ie things that aren't full-fledged backends but need shmem access.
*
- * Background writer, checkpointer, WAL writer and archiver run during normal
- * operation. Startup process and WAL receiver also consume 2 slots, but WAL
- * writer is launched only after startup has exited, so we only need 5 slots.
+ * Background writer, checkpointer, WAL writer, WAL summarizer, and archiver
+ * run during normal operation. Startup process and WAL receiver also consume
+ * 2 slots, but WAL writer is launched only after startup has exited, so we
+ * only need 6 slots.
*/
-#define NUM_AUXILIARY_PROCS 5
+#define NUM_AUXILIARY_PROCS 6
/* configurable options */
extern PGDLLIMPORT int DeadlockTimeout;
diff --git a/src/include/utils/guc_tables.h b/src/include/utils/guc_tables.h
index d5a0880678..7d3bc0f671 100644
--- a/src/include/utils/guc_tables.h
+++ b/src/include/utils/guc_tables.h
@@ -72,6 +72,7 @@ enum config_group
WAL_RECOVERY,
WAL_ARCHIVE_RECOVERY,
WAL_RECOVERY_TARGET,
+ WAL_SUMMARIZATION,
REPLICATION_SENDING,
REPLICATION_PRIMARY,
REPLICATION_STANDBY,
diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index c3d46c7c70..b711d60fc4 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -779,6 +779,10 @@ a tar-format backup, pass the name of the tar program to use in the
keyword parameter tar_program. Note that tablespace tar files aren't
handled here.
+To restore from an incremental backup, pass the parameter combine_with_prior
+as a reference to an array of prior backup names with which this backup
+is to be combined using pg_combinebackup.
+
Streaming replication can be enabled on this node by passing the keyword
parameter has_streaming => 1. This is disabled by default.
@@ -816,7 +820,22 @@ sub init_from_backup
mkdir $self->archive_dir;
my $data_path = $self->data_dir;
- if (defined $params{tar_program})
+ if (defined $params{combine_with_prior})
+ {
+ my @prior_backups = @{$params{combine_with_prior}};
+ my @prior_backup_path;
+
+ for my $prior_backup_name (@prior_backups)
+ {
+ push @prior_backup_path,
+ $root_node->backup_dir . '/' . $prior_backup_name;
+ }
+
+ local %ENV = $self->_get_env();
+ PostgreSQL::Test::Utils::system_or_bail('pg_combinebackup',
+ @prior_backup_path, $backup_path, '-o', $data_path);
+ }
+ elsif (defined $params{tar_program})
{
mkdir($data_path);
PostgreSQL::Test::Utils::system_or_bail($params{tar_program}, 'xf',
diff --git a/src/test/recovery/t/001_stream_rep.pl b/src/test/recovery/t/001_stream_rep.pl
index 95f9b0d772..ad11be4664 100644
--- a/src/test/recovery/t/001_stream_rep.pl
+++ b/src/test/recovery/t/001_stream_rep.pl
@@ -15,6 +15,8 @@ my $node_primary = PostgreSQL::Test::Cluster->new('primary');
$node_primary->init(
allows_streaming => 1,
auth_extra => [ '--create-role', 'repl_role' ]);
+# WAL summarization can postpone WAL recycling, leading to test failures
+$node_primary->append_conf('postgresql.conf', "wal_summarize_mb = 0");
$node_primary->start;
my $backup_name = 'my_backup';
diff --git a/src/test/recovery/t/019_replslot_limit.pl b/src/test/recovery/t/019_replslot_limit.pl
index 7d94f15778..4f52ddbe79 100644
--- a/src/test/recovery/t/019_replslot_limit.pl
+++ b/src/test/recovery/t/019_replslot_limit.pl
@@ -22,6 +22,7 @@ $node_primary->append_conf(
min_wal_size = 2MB
max_wal_size = 4MB
log_checkpoints = yes
+wal_summarize_mb = 0
));
$node_primary->start;
$node_primary->safe_psql('postgres',
@@ -256,6 +257,7 @@ $node_primary2->append_conf(
min_wal_size = 32MB
max_wal_size = 32MB
log_checkpoints = yes
+wal_summarize_mb = 0
));
$node_primary2->start;
$node_primary2->safe_psql('postgres',
@@ -310,6 +312,7 @@ $node_primary3->append_conf(
max_wal_size = 2MB
log_checkpoints = yes
max_slot_wal_keep_size = 1MB
+ wal_summarize_mb = 0
));
$node_primary3->start;
$node_primary3->safe_psql('postgres',
diff --git a/src/test/recovery/t/035_standby_logical_decoding.pl b/src/test/recovery/t/035_standby_logical_decoding.pl
index 9c34c0d36c..5fe4faf1be 100644
--- a/src/test/recovery/t/035_standby_logical_decoding.pl
+++ b/src/test/recovery/t/035_standby_logical_decoding.pl
@@ -250,6 +250,7 @@ $node_primary->append_conf(
wal_level = 'logical'
max_replication_slots = 4
max_wal_senders = 4
+wal_summarize_mb = 0
});
$node_primary->dump_info;
$node_primary->start;
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 06b25617bc..8bccec66c3 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3993,3 +3993,26 @@ yyscan_t
z_stream
z_streamp
zic_t
+BlockRefTable
+BlockRefTableBuffer
+BlockRefTableEntry
+BlockRefTableKey
+BlockRefTableReader
+BlockRefTableSerializedEntry
+BlockRefTableWriter
+FileBackupMethod
+IncrementalBackupInfo
+SummarizerReadLocalXLogPrivate
+UploadManifestCmd
+WalSummarizerData
+WalSummaryFile
+WalSummaryIO
+backup_file_entry
+backup_wal_range
+cb_cleanup_dir
+cb_options
+cb_tablespace
+cb_tablespace_mapping
+manifest_data
+manifest_writer
+rfile
--
2.37.1 (Apple Git-137.1)
On 10/23/23 11:44, Robert Haas wrote:
On Fri, Oct 20, 2023 at 11:30 AM David Steele <david@pgmasters.net> wrote:
I don't plan to stand in your way on this feature. I'm reviewing what
patches I can out of courtesy and to be sure that nothing adjacent to
your work is being affected. My apologies if my reviews are not meeting
your expectations, but I am contributing as my time constraints allow.Sorry, I realize reading this response that I probably didn't do a
very good job writing that email and came across sounding like a jerk.
Possibly, I actually am a jerk. Whether it just sounded like it or I
actually am, I apologize.
That was the way it came across, though I prefer to think it was
unintentional. I certainly understand how frustrating dealing with a
large and uncertain patch can be. Either way, I appreciate the apology.
Now onward...
But your last paragraph here gets at my real
question, which is whether you were going to try to block the feature.
I recognize that we have different priorities when it comes to what
would make most sense to implement in PostgreSQL, and that's OK, or at
least, it's OK with me.
This seem perfectly natural to me.
I also don't have any particular expectation
about how much you should review the patch or in what level of detail,
and I have sincerely appreciated your feedback thus far. If you are
able to continue to provide more, that's great, and if that's not,
well, you're not obligated. What I was concerned about was whether
that review was a precursor to a vigorous attempt to keep the main
patch from getting committed, because if that was going to be the
case, then I'd like to surface that conflict sooner rather than later.
It sounds like that's not an issue, which is great.
Overall I would say I'm not strongly for or against the patch. I think
it will be very difficult to use in a manual fashion, but automation is
they way to go in general so that's not necessarily and argument against.
However, this is an area of great interest to me so I do want to at
least make sure nothing is being impacted adjacent to the main goal of
this patch. So far I have seen no sign of that, but that has been a
primary goal of my reviews.
At the risk of drifting into the fraught question of what I *should*
be implementing rather than the hopefully-less-fraught question of
whether what I am actually implementing is any good, I see incremental
backup as a way of removing some of the use cases for the low-level
backup API. If you said "but people still will have lots of reasons to
use it," I would agree; and if you said "people can still screw things
up with pg_basebackup," I would also agree. Nonetheless, most of the
disasters I've personally seen have stemmed from the use of the
low-level API rather than from the use of pg_basebackup, though there
are exceptions.
This all makes sense to me.
I also think a lot of the use of the low-level API is
driven by it being just too darn slow to copy the whole database, and
incremental backup can help with that in some circumstances.
I would argue that restore performance is *more* important than backup
performance and this patch is a step backward in that regard. Backups
will be faster and less space will be used in the repository, but
restore performance is going to suffer. If the deltas are very small the
difference will probably be negligible, but as the deltas get large (and
especially if there are a lot of them) the penalty will be more noticeable.
Also, I
have worked fairly hard to try to make sure that if you misuse
pg_combinebackup, or fail to use it altogether, you'll get an error
rather than silent data corruption. I would be interested to hear
about scenarios where the checks that I've implemented can be defeated
by something that is plausibly described as stupidity rather than
malice. I'm not sure we can fix all such cases, but I'm very alert to
the horror that will befall me if user error looks exactly like a bug
in the code. For my own sanity, we have to be able to distinguish
those cases.
I was concerned with the difficulty of trying to stage the correct
backups for pg_combinebackup, not whether it would recognize that the
needed data was not available and then error appropriately. The latter
is surmountable within pg_combinebackup but the former is left up to the
user.
Moreover, we also need to be able to distinguish
backup-time bugs from reassembly-time bugs, which is why I've got the
pg_walsummary tool, and why pg_combinebackup has the ability to emit
fairly detailed debugging output. I anticipate those things being
useful in investigating bug reports when they show up. I won't be too
surprised if it turns out that more work on sanity-checking and/or
debugging tools is needed, but I think your concern about people
misusing stuff is bang on target and I really want to do whatever we
can to avoid that when possible and detect it when it happens.
The ability of users to misuse tools is, of course, legendary, so that
all sounds good to me.
One note regarding the patches. I feel like
v5-0005-Prototype-patch-for-incremental-backup should be split to have
the WAL summarizer as one patch and the changes to base backup as a
separate patch.
It might not be useful to commit one without the other, but it would
make for an easier read. Just my 2c.
Regards,
-David
On Mon, Oct 23, 2023 at 7:56 PM David Steele <david@pgmasters.net> wrote:
I also think a lot of the use of the low-level API is
driven by it being just too darn slow to copy the whole database, and
incremental backup can help with that in some circumstances.I would argue that restore performance is *more* important than backup
performance and this patch is a step backward in that regard. Backups
will be faster and less space will be used in the repository, but
restore performance is going to suffer. If the deltas are very small the
difference will probably be negligible, but as the deltas get large (and
especially if there are a lot of them) the penalty will be more noticeable.
I think an awful lot depends here on whether the repository is local
or remote. If you have filesystem access to wherever the backups are
stored anyway, I don't think that using pg_combinebackup to write out
a new data directory is going to be much slower than copying one data
directory from the repository to wherever you'd actually use the
backup. It may be somewhat slower because we do need to access some
data in every involved backup, but I don't think it should be vastly
slower because we don't have to read every backup in its entirety. For
each file, we read the (small) header of the newest incremental file
and every incremental file that precedes it until we find a full file.
Then, we construct a map of which blocks need to be read from which
sources and read only the required blocks from each source. If all the
blocks are coming from a single file (because there are no incremental
for a certain file, or they contain no blocks) then we just copy the
entire source file in one shot, which can be optimized using the same
tricks we use elsewhere. Inevitably, this is going to read more data
and do more random I/O than just a flat copy of a directory, but it's
not terrible. The overall amount of I/O should be a lot closer to the
size of the output directory than to the sum of the sizes of the input
directories.
Now, if the repository is remote, and you have to download all of
those backups first, and then run pg_combinebackup on them afterward,
that is going to be unpleasant, unless the incremental backups are all
quite small. Possibly this could be addressed by teaching
pg_combinebackup to do things like accessing data over HTTP and SSH,
and relatedly, looking inside tarfiles without needing them unpacked.
For now, I've left those as ideas for future improvement, but I think
potentially they could address some of your concerns here. A
difficulty is that there are a lot of protocols that people might want
to use to push bytes around, and it might be hard to keep up with the
march of progress.
I do agree, though, that there's no such thing as a free lunch. I
wouldn't recommend to anyone that they plan to restore from a chain of
100 incremental backups. Not only might it be slow, but the
opportunities for something to go wrong are magnified. Even if you've
automated everything well enough that there's no human error involved,
what if you've got a corrupted file somewhere? Maybe that's not likely
in absolute terms, but the more files you've got, the more likely it
becomes. What I'd suggest someone do instead is periodically do
pg_combinebackup full_reference_backup oldest_incremental -o
new_full_reference_backup; rm -rf full_reference_backup; mv
new_full_reference_backup full_reference_backup. The new full
reference backup is intended to still be usable for restoring
incrementals based on the incremental it replaced. I hope that, if
people use the feature well, this should limit the need for really
long backup chains. I am sure, though, that some people will use it
poorly. Maybe there's room for more documentation on this topic.
I was concerned with the difficulty of trying to stage the correct
backups for pg_combinebackup, not whether it would recognize that the
needed data was not available and then error appropriately. The latter
is surmountable within pg_combinebackup but the former is left up to the
user.
Indeed.
One note regarding the patches. I feel like
v5-0005-Prototype-patch-for-incremental-backup should be split to have
the WAL summarizer as one patch and the changes to base backup as a
separate patch.It might not be useful to commit one without the other, but it would
make for an easier read. Just my 2c.
Yeah, maybe so. I'm not quite ready to commit to doing that split as
of this writing but I will think about it and possibly do it.
--
Robert Haas
EDB: http://www.enterprisedb.com
On 04.10.23 22:08, Robert Haas wrote:
- I would like some feedback on the generation of WAL summary files.
Right now, I have it enabled by default, and summaries are kept for a
week. That means that, with no additional setup, you can take an
incremental backup as long as the reference backup was taken in the
last week. File removal is governed by mtimes, so if you change the
mtimes of your summary files or whack your system clock around, weird
things might happen. But obviously this might be inconvenient. Some
people might not want WAL summary files to be generated at all because
they don't care about incremental backup, and other people might want
them retained for longer, and still other people might want them to be
not removed automatically or removed automatically based on some
criteria other than mtime. I don't really know what's best here. I
don't think the default policy that the patches implement is
especially terrible, but it's just something that I made up and I
don't have any real confidence that it's wonderful.
The easiest answer is to have it off by default. Let people figure out
what works for them. There are various factors like storage, network,
server performance, RTO that will determine what combination of full
backup, incremental backup, and WAL replay will satisfy someone's
requirements. I suppose tests could be set up to determine this to some
degree. But we could also start slow and let people figure it out
themselves. When pg_basebackup was added, it was also disabled by default.
If we think that 7d is a good setting, then I would suggest to consider,
like 10d. Otherwise, if you do a weekly incremental backup and you have
a time change or a hiccup of some kind one day, you lose your backup
sequence.
Another possible answer is, like, 400 days? Because why not? What is a
reasonable upper limit for this?
- It's regrettable that we don't have incremental JSON parsing; I
think that means anyone who has a backup manifest that is bigger than
1GB can't use this feature. However, that's also a problem for the
existing backup manifest feature, and as far as I can see, we have no
complaints about it. So maybe people just don't have databases with
enough relations for that to be much of a live issue yet. I'm inclined
to treat this as a non-blocker,
It looks like each file entry in the manifest takes about 150 bytes, so
1 GB would allow for 1024**3/150 = 7158278 files. That seems fine for now?
- Right now, I have a hard-coded 60 second timeout for WAL
summarization. If you try to take an incremental backup and the WAL
summaries you need don't show up within 60 seconds, the backup times
out. I think that's a reasonable default, but should it be
configurable? If yes, should that be a GUC or, perhaps better, a
pg_basebackup option?
The current user experience of pg_basebackup is that it waits possibly a
long time for a checkpoint, and there is an option to make it go faster,
but there is no timeout AFAICT. Is this substantially different? Could
we just let it wait forever?
Also, does waiting for checkpoint and WAL summarization happen in
parallel? If so, what if it starts a checkpoint that might take 15 min
to complete, and then after 60 seconds it kicks you off because the WAL
summarization isn't ready. That might be wasteful.
- I'm curious what people think about the pg_walsummary tool that is
included in 0006. I think it's going to be fairly important for
debugging, but it does feel a little bit bad to add a new binary for
something pretty niche.
This seems fine.
Is the WAL summary file format documented anywhere in your patch set
yet? My only thought was, maybe the file format could be human-readable
(more like backup_label) to avoid this. But maybe not.
On Tue, Oct 24, 2023 at 10:53 AM Peter Eisentraut <peter@eisentraut.org> wrote:
The easiest answer is to have it off by default. Let people figure out
what works for them. There are various factors like storage, network,
server performance, RTO that will determine what combination of full
backup, incremental backup, and WAL replay will satisfy someone's
requirements. I suppose tests could be set up to determine this to some
degree. But we could also start slow and let people figure it out
themselves. When pg_basebackup was added, it was also disabled by default.If we think that 7d is a good setting, then I would suggest to consider,
like 10d. Otherwise, if you do a weekly incremental backup and you have
a time change or a hiccup of some kind one day, you lose your backup
sequence.Another possible answer is, like, 400 days? Because why not? What is a
reasonable upper limit for this?
In concept, I don't think this should even be time-based. What you
should do is remove WAL summaries once you know that you've taken as
many incremental backups that might use them as you're ever going to
do. But PostgreSQL itself doesn't have any way of knowing what your
intended backup patterns are. If your incremental backup fails on
Monday night and you run it manually on Tuesday morning, you might
still rerun it as an incremental backup. If it fails every night for a
month and you finally realize and decide to intervene manually, maybe
you want a new full backup at that point. It's been a month. But on
the other hand maybe you don't. There's no time-based answer to this
question that is really correct, and I think it's quite possible that
your backup software might want to shut off time-based deletion
altogether and make its own decisions about when to nuke summaries.
However, I also don't think that's a great default setting. It could
easily lead to people wasting a bunch of disk space for no reason.
As far as the 7d value, I figured that nighty incremental backups
would be fairly common. If we think weekly incremental backups would
be common, then pushing this out to 10d would make sense. While
there's no reason you couldn't take an annual incremental backup, and
thus want a 400d value, it seems like a pretty niche use case.
Note that whether to remove summaries is a separate question from
whether to generate them in the first place. Right now, I have
wal_summarize_mb controlling whether they get generated in the first
place, but as I noted in another recent email, that isn't an entirely
satisfying solution.
It looks like each file entry in the manifest takes about 150 bytes, so
1 GB would allow for 1024**3/150 = 7158278 files. That seems fine for now?
I suspect a few people have more files than that. They'll just have to
wait to use this feature until we get incremental JSON parsing (or
undo the decision to use JSON for the manifest).
The current user experience of pg_basebackup is that it waits possibly a
long time for a checkpoint, and there is an option to make it go faster,
but there is no timeout AFAICT. Is this substantially different? Could
we just let it wait forever?
We could. I installed the timeout because the first versions of the
feature were buggy, and I didn't like having my tests hang forever
with no indication of what had gone wrong. At least in my experience
so far, the time spent waiting for WAL summarization is typically
quite short, because only the WAL that needs to be summarized is
whatever was emitted since the last time it woke up up through the
start LSN of the backup. That's probably not much, and the next time
the summarizer wakes up, the file should appear within moments. So,
it's a little different from the checkpoint case, where long waits are
expected.
Also, does waiting for checkpoint and WAL summarization happen in
parallel? If so, what if it starts a checkpoint that might take 15 min
to complete, and then after 60 seconds it kicks you off because the WAL
summarization isn't ready. That might be wasteful.
It is not parallel. The trouble is, we don't really have any way to
know whether WAL summarization is going to fail for whatever reason.
We don't expect that to happen, but if somebody changes the
permissions on the WAL summary directory or attaches gdb to the WAL
summarizer process or something of that sort, it might.
We could check at the outset whether we seem to be really far behind
and emit a warning. For instance, if we're 1TB behind on WAL
summarization when the checkpoint is requested, chances are something
is busted and we're probably not going to catch up any time soon. We
could warn the user about that and let them make their own decision
about whether to cancel. But, that idea won't help in unattended
operation, and the threshold for "really far behind" is not very
clear. It might be better to wait until we get more experience with
how things actually fail before doing too much engineering here, but
I'm also open to suggestions.
Is the WAL summary file format documented anywhere in your patch set
yet? My only thought was, maybe the file format could be human-readable
(more like backup_label) to avoid this. But maybe not.
The comment in blkreftable.c just above "#define BLOCKS_PER_CHUNK"
gives an overview of the format. I think that we probably don't want
to convert to a text format, because this format is extremely
space-efficient and very convenient to transfer between disk and
memory. We don't want to run out of memory when summarizing large
ranges of WAL, or when taking an incremental backup that requires
combining many individual summaries into a combined summary that tells
us what needs to be included in the backup.
--
Robert Haas
EDB: http://www.enterprisedb.com
On 2023-10-24 Tu 12:08, Robert Haas wrote:
It looks like each file entry in the manifest takes about 150 bytes, so
1 GB would allow for 1024**3/150 = 7158278 files. That seems fine for now?I suspect a few people have more files than that. They'll just have to Maybe someone on the list can see some way o
wait to use this feature until we get incremental JSON parsing (or
undo the decision to use JSON for the manifest).
Robert asked me to work on this quite some time ago, and most of this
work was done last year.
Here's my WIP for an incremental JSON parser. It works and passes all
the usual json/b tests. It implements Algorithm 4.3 in the Dragon Book.
The reason I haven't posted it before is that it's about 50% slower in
pure parsing speed than the current recursive descent parser in my
testing. I've tried various things to make it faster, but haven't made
much impact. One of my colleagues is going to take a fresh look at it,
but maybe someone on the list can see where we can save some cycles.
If we can't make it faster, I guess we could use the RD parser for
non-incremental cases and only use the non-RD parser for incremental,
although that would be a bit sad. However, I don't think we can make the
RD parser suitable for incremental parsing - there's too much state
involved in the call stack.
cheers
andrew
--
Andrew Dunstan
EDB: https://www.enterprisedb.com
Attachments:
On Wed, Oct 25, 2023 at 7:54 AM Andrew Dunstan <andrew@dunslane.net> wrote:
Robert asked me to work on this quite some time ago, and most of this
work was done last year.Here's my WIP for an incremental JSON parser. It works and passes all
the usual json/b tests. It implements Algorithm 4.3 in the Dragon Book.
The reason I haven't posted it before is that it's about 50% slower in
pure parsing speed than the current recursive descent parser in my
testing. I've tried various things to make it faster, but haven't made
much impact. One of my colleagues is going to take a fresh look at it,
but maybe someone on the list can see where we can save some cycles.If we can't make it faster, I guess we could use the RD parser for
non-incremental cases and only use the non-RD parser for incremental,
although that would be a bit sad. However, I don't think we can make the
RD parser suitable for incremental parsing - there's too much state
involved in the call stack.
Yeah, this is exactly why I didn't want to use JSON for the backup
manifest in the first place. Parsing such a manifest incrementally is
complicated. If we'd gone with my original design where the manifest
consisted of a bunch of lines each of which could be parsed
separately, we'd already have incremental parsing and wouldn't be
faced with these difficult trade-offs.
Unfortunately, I'm not in a good position either to figure out how to
make your prototype faster, or to evaluate how painful it is to keep
both in the source tree. It's probably worth considering how likely it
is that we'd be interested in incremental JSON parsing in other cases.
Maintaining two JSON parsers is probably not a lot of fun regardless,
but if each of them gets used for a bunch of things, that feels less
bad than if one of them gets used for a bunch of things and the other
one only ever gets used for backup manifests. Would we be interested
in JSON-format database dumps? Incrementally parsing JSON LOBs? Either
seems tenuous, but those are examples of the kind of thing that could
make us happy to have incremental JSON parsing as a general facility.
If nobody's very excited by those kinds of use cases, then this just
boils down to whether we want to (a) accept that users with very large
numbers of relation files won't be able to use pg_verifybackup or
incremental backup, (b) accept that we're going to maintain a second
JSON parser just to enable that use cas and with no other benefit, or
(c) undertake to change the manifest format to something that is
straightforward to parse incrementally. I think (a) is reasonable
short term, but at some point I think we should do better. I'm not
really that enthused about (c) because it means more work for me and
possibly more arguing, but if (b) is going to cause a lot of hassle
then we might need to consider it.
--
Robert Haas
EDB: http://www.enterprisedb.com
On 2023-10-25 We 09:05, Robert Haas wrote:
On Wed, Oct 25, 2023 at 7:54 AM Andrew Dunstan <andrew@dunslane.net> wrote:
Robert asked me to work on this quite some time ago, and most of this
work was done last year.Here's my WIP for an incremental JSON parser. It works and passes all
the usual json/b tests. It implements Algorithm 4.3 in the Dragon Book.
The reason I haven't posted it before is that it's about 50% slower in
pure parsing speed than the current recursive descent parser in my
testing. I've tried various things to make it faster, but haven't made
much impact. One of my colleagues is going to take a fresh look at it,
but maybe someone on the list can see where we can save some cycles.If we can't make it faster, I guess we could use the RD parser for
non-incremental cases and only use the non-RD parser for incremental,
although that would be a bit sad. However, I don't think we can make the
RD parser suitable for incremental parsing - there's too much state
involved in the call stack.Yeah, this is exactly why I didn't want to use JSON for the backup
manifest in the first place. Parsing such a manifest incrementally is
complicated. If we'd gone with my original design where the manifest
consisted of a bunch of lines each of which could be parsed
separately, we'd already have incremental parsing and wouldn't be
faced with these difficult trade-offs.Unfortunately, I'm not in a good position either to figure out how to
make your prototype faster, or to evaluate how painful it is to keep
both in the source tree. It's probably worth considering how likely it
is that we'd be interested in incremental JSON parsing in other cases.
Maintaining two JSON parsers is probably not a lot of fun regardless,
but if each of them gets used for a bunch of things, that feels less
bad than if one of them gets used for a bunch of things and the other
one only ever gets used for backup manifests. Would we be interested
in JSON-format database dumps? Incrementally parsing JSON LOBs? Either
seems tenuous, but those are examples of the kind of thing that could
make us happy to have incremental JSON parsing as a general facility.If nobody's very excited by those kinds of use cases, then this just
boils down to whether we want to (a) accept that users with very large
numbers of relation files won't be able to use pg_verifybackup or
incremental backup, (b) accept that we're going to maintain a second
JSON parser just to enable that use cas and with no other benefit, or
(c) undertake to change the manifest format to something that is
straightforward to parse incrementally. I think (a) is reasonable
short term, but at some point I think we should do better. I'm not
really that enthused about (c) because it means more work for me and
possibly more arguing, but if (b) is going to cause a lot of hassle
then we might need to consider it.
I'm not too worried about the maintenance burden. The RD routines were
added in March 2013 (commit a570c98d7fa) and have hardly changed since
then. The new code is not ground-breaking - it's just a different (and
fairly well known) way of doing the same thing. I'd be happier if we
could make it faster, but maybe it's just a fact that keeping an
explicit stack, which is how this works, is slower.
I wouldn't at all be surprised if there were other good uses for
incremental JSON parsing, including some you've identified.
That said, I agree that JSON might not be the best format for backup
manifests, but maybe that ship has sailed.
cheers
andrew
--
Andrew Dunstan
EDB: https://www.enterprisedb.com
On Wed, Oct 25, 2023 at 10:33 AM Andrew Dunstan <andrew@dunslane.net> wrote:
I'm not too worried about the maintenance burden.
That said, I agree that JSON might not be the best format for backup
manifests, but maybe that ship has sailed.
I think it's a decision we could walk back if we had a good enough
reason, but it would be nicer if we didn't have to, because what we
have right now is working. If we change it for no real reason, we
might introduce new bugs, and at least in theory, incompatibility with
third-party tools that parse the existing format. If you think we can
live with the additional complexity in the JSON parsing stuff, I'd
rather go that way.
--
Robert Haas
EDB: http://www.enterprisedb.com
On Tue, Oct 24, 2023 at 8:29 AM Robert Haas <robertmhaas@gmail.com> wrote:
Yeah, maybe so. I'm not quite ready to commit to doing that split as
of this writing but I will think about it and possibly do it.
I have done this. Here's v7.
This version also includes several new TAP tests for the main patch,
some of which were inspired by our discussion. It also includes SGML
documentation for pg_walsummary.
New tests:
003_timeline.pl tests the case where the prior backup for an
incremental backup was taken on an earlier timeline.
004_manifest.pl tests the manifest-related options for pg_combinebackup.
005_integrity.pl tests the sanity checks that prevent combining a
backup with the wrong prior backup.
Overview of the new organization of the patch set:
0001 - preparatory refactoring of basebackup.c, changing the algorithm
that we use to decide which files have checksums
0002 - code movement only. makes it possible to reuse parse_manifest.c
0003 - add the WAL summarizer process, but useless on its own
0004 - add incremental backup, making use of 0003
0005 - add pg_walsummary debugging tool
Notes:
- I suspect that 0003 is the most likely to have serious bugs, followed by 0004.
- See XXX comments in the commit messages for some known open issues.
- Still looking for more comments on
/messages/by-id/CA+TgmoYdPS7a4eiqAFCZ8dr4r3-O0zq1LvTO5drwWr+7wHQaSQ@mail.gmail.com
and other recent emails where design questions came up
--
Robert Haas
EDB: http://www.enterprisedb.com
Attachments:
v7-0002-Move-src-bin-pg_verifybackup-parse_manifest.c-int.patchapplication/octet-stream; name=v7-0002-Move-src-bin-pg_verifybackup-parse_manifest.c-int.patchDownload
From ece784be59172049432385a5831005c2c9f8fed2 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Tue, 3 Oct 2023 13:32:45 -0400
Subject: [PATCH v7 2/5] Move src/bin/pg_verifybackup/parse_manifest.c into
src/common.
This makes it possible for the code to be easily reused by other
client-side tools, and/or by the server.
---
src/bin/pg_verifybackup/Makefile | 1 -
src/bin/pg_verifybackup/meson.build | 1 -
src/bin/pg_verifybackup/pg_verifybackup.c | 2 +-
src/common/Makefile | 1 +
src/common/meson.build | 1 +
src/{bin/pg_verifybackup => common}/parse_manifest.c | 4 ++--
src/{bin/pg_verifybackup => include/common}/parse_manifest.h | 2 +-
7 files changed, 6 insertions(+), 6 deletions(-)
rename src/{bin/pg_verifybackup => common}/parse_manifest.c (99%)
rename src/{bin/pg_verifybackup => include/common}/parse_manifest.h (97%)
diff --git a/src/bin/pg_verifybackup/Makefile b/src/bin/pg_verifybackup/Makefile
index 596df15118..8f04fa662c 100644
--- a/src/bin/pg_verifybackup/Makefile
+++ b/src/bin/pg_verifybackup/Makefile
@@ -21,7 +21,6 @@ LDFLAGS_INTERNAL += -L$(top_builddir)/src/fe_utils -lpgfeutils $(libpq_pgport)
OBJS = \
$(WIN32RES) \
- parse_manifest.o \
pg_verifybackup.o
all: pg_verifybackup
diff --git a/src/bin/pg_verifybackup/meson.build b/src/bin/pg_verifybackup/meson.build
index 9369da1bc6..58f780d1a6 100644
--- a/src/bin/pg_verifybackup/meson.build
+++ b/src/bin/pg_verifybackup/meson.build
@@ -1,7 +1,6 @@
# Copyright (c) 2022-2023, PostgreSQL Global Development Group
pg_verifybackup_sources = files(
- 'parse_manifest.c',
'pg_verifybackup.c'
)
diff --git a/src/bin/pg_verifybackup/pg_verifybackup.c b/src/bin/pg_verifybackup/pg_verifybackup.c
index 059836f0e6..ce423a03d4 100644
--- a/src/bin/pg_verifybackup/pg_verifybackup.c
+++ b/src/bin/pg_verifybackup/pg_verifybackup.c
@@ -20,9 +20,9 @@
#include "common/hashfn.h"
#include "common/logging.h"
+#include "common/parse_manifest.h"
#include "fe_utils/simple_list.h"
#include "getopt_long.h"
-#include "parse_manifest.h"
#include "pgtime.h"
/*
diff --git a/src/common/Makefile b/src/common/Makefile
index 70884be00c..3c8effc533 100644
--- a/src/common/Makefile
+++ b/src/common/Makefile
@@ -66,6 +66,7 @@ OBJS_COMMON = \
kwlookup.o \
link-canary.o \
md5_common.o \
+ parse_manifest.o \
percentrepl.o \
pg_get_line.o \
pg_lzcompress.o \
diff --git a/src/common/meson.build b/src/common/meson.build
index ae05ac63cf..aa646f96a3 100644
--- a/src/common/meson.build
+++ b/src/common/meson.build
@@ -18,6 +18,7 @@ common_sources = files(
'kwlookup.c',
'link-canary.c',
'md5_common.c',
+ 'parse_manifest.c',
'percentrepl.c',
'pg_get_line.c',
'pg_lzcompress.c',
diff --git a/src/bin/pg_verifybackup/parse_manifest.c b/src/common/parse_manifest.c
similarity index 99%
rename from src/bin/pg_verifybackup/parse_manifest.c
rename to src/common/parse_manifest.c
index f0acd9f1e7..9895f2f17d 100644
--- a/src/bin/pg_verifybackup/parse_manifest.c
+++ b/src/common/parse_manifest.c
@@ -6,15 +6,15 @@
* Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
* Portions Copyright (c) 1994, Regents of the University of California
*
- * src/bin/pg_verifybackup/parse_manifest.c
+ * src/common/parse_manifest.c
*
*-------------------------------------------------------------------------
*/
#include "postgres_fe.h"
-#include "parse_manifest.h"
#include "common/jsonapi.h"
+#include "common/parse_manifest.h"
/*
* Semantic states for JSON manifest parsing.
diff --git a/src/bin/pg_verifybackup/parse_manifest.h b/src/include/common/parse_manifest.h
similarity index 97%
rename from src/bin/pg_verifybackup/parse_manifest.h
rename to src/include/common/parse_manifest.h
index 7387a917a2..7b24c5d785 100644
--- a/src/bin/pg_verifybackup/parse_manifest.h
+++ b/src/include/common/parse_manifest.h
@@ -6,7 +6,7 @@
* Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
* Portions Copyright (c) 1994, Regents of the University of California
*
- * src/bin/pg_verifybackup/parse_manifest.h
+ * src/include/common/parse_manifest.h
*
*-------------------------------------------------------------------------
*/
--
2.37.1 (Apple Git-137.1)
v7-0001-Change-how-a-base-backup-decides-which-files-have.patchapplication/octet-stream; name=v7-0001-Change-how-a-base-backup-decides-which-files-have.patchDownload
From af44c310593481eb1d3227324fd585cbf27db50d Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Tue, 3 Oct 2023 13:32:28 -0400
Subject: [PATCH v7 1/5] Change how a base backup decides which files have
checksums.
Previously, it thought that any plain file located under global, base,
or a tablespace directory had checksums unless it was in a short list
of excluded files. Now, it thinks that files in those directories have
checksums if parse_filename_for_nontemp_relation says that they are
relation files. (Temporary relation files don't matter because they're
excluded from the backup anyway.)
This changes the behavior if you have stray files not managed by
PostgreSQL in the relevant directories. Previously, you'd get some
kind of checksum-related complaint if such files existed, assuming
that the cluster had checksums enabled and that the base backup
wasn't run with NOVERIFY_CHECKSUMS. Now, you won't get those
complaints any more. That seems like an improvement to me, because
those files were presumably not created by PostgreSQL and so there
is no reason to think that they would be checksummed like a
PostgreSQL relation file. (If we want to complain about such files,
we should complain about them existing at all, not just about their
checksums.)
The point of this change is to make the code more consistent.
sendDir() was already calling parse_filename_for_nontemp_relation()
as part of an effort to determine which files to include in the
backup. So, it already had the information about whether a certain
file was a relation file. sendFile() then used a separate method,
embodied in is_checksummed_file(), to make what is essentially
the same determination. It's better not to make the same decision
using two different methods, especially in closely-related code.
---
src/backend/backup/basebackup.c | 172 ++++++++++----------------------
1 file changed, 55 insertions(+), 117 deletions(-)
diff --git a/src/backend/backup/basebackup.c b/src/backend/backup/basebackup.c
index b537f46219..4ba63ad8a6 100644
--- a/src/backend/backup/basebackup.c
+++ b/src/backend/backup/basebackup.c
@@ -82,7 +82,8 @@ static int64 sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeo
backup_manifest_info *manifest, Oid spcoid);
static bool sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
struct stat *statbuf, bool missing_ok,
- Oid dboid, Oid spcoid,
+ Oid dboid, Oid spcoid, RelFileNumber relfilenumber,
+ unsigned segno,
backup_manifest_info *manifest);
static off_t read_file_data_into_buffer(bbsink *sink,
const char *readfilename, int fd,
@@ -104,7 +105,6 @@ static void convert_link_to_directory(const char *pathbuf, struct stat *statbuf)
static void perform_base_backup(basebackup_options *opt, bbsink *sink);
static void parse_basebackup_options(List *options, basebackup_options *opt);
static int compareWalFileNames(const ListCell *a, const ListCell *b);
-static bool is_checksummed_file(const char *fullpath, const char *filename);
static int basebackup_read_file(int fd, char *buf, size_t nbytes, off_t offset,
const char *filename, bool partial_read_ok);
@@ -213,23 +213,6 @@ static const struct exclude_list_item excludeFiles[] =
{NULL, false}
};
-/*
- * List of files excluded from checksum validation.
- *
- * Note: this list should be kept in sync with what pg_checksums.c
- * includes.
- */
-static const struct exclude_list_item noChecksumFiles[] = {
- {"pg_control", false},
- {"pg_filenode.map", false},
- {"pg_internal.init", true},
- {"PG_VERSION", false},
-#ifdef EXEC_BACKEND
- {"config_exec_params", true},
-#endif
- {NULL, false}
-};
-
/*
* Actually do a base backup for the specified tablespaces.
*
@@ -356,7 +339,8 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
errmsg("could not stat file \"%s\": %m",
XLOG_CONTROL_FILE)));
sendFile(sink, XLOG_CONTROL_FILE, XLOG_CONTROL_FILE, &statbuf,
- false, InvalidOid, InvalidOid, &manifest);
+ false, InvalidOid, InvalidOid,
+ InvalidRelFileNumber, 0, &manifest);
}
else
{
@@ -625,7 +609,8 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
errmsg("could not stat file \"%s\": %m", pathbuf)));
sendFile(sink, pathbuf, pathbuf, &statbuf, false,
- InvalidOid, InvalidOid, &manifest);
+ InvalidOid, InvalidOid, InvalidRelFileNumber, 0,
+ &manifest);
/* unconditionally mark file as archived */
StatusFilePath(pathbuf, fname, ".done");
@@ -1163,7 +1148,8 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
struct stat statbuf;
int64 size = 0;
const char *lastDir; /* Split last dir from parent path. */
- bool isDbDir = false; /* Does this directory contain relations? */
+ bool isRelationDir = false; /* Does directory contain relations? */
+ Oid dboid = InvalidOid;
/*
* Determine if the current path is a database directory that can contain
@@ -1190,17 +1176,23 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
strncmp(lastDir - (sizeof(TABLESPACE_VERSION_DIRECTORY) - 1),
TABLESPACE_VERSION_DIRECTORY,
sizeof(TABLESPACE_VERSION_DIRECTORY) - 1) == 0))
- isDbDir = true;
+ {
+ isRelationDir = true;
+ dboid = atooid(lastDir + 1);
+ }
}
+ else if (strcmp(path, "./global") == 0)
+ isRelationDir = true;
dir = AllocateDir(path);
while ((de = ReadDir(dir, path)) != NULL)
{
int excludeIdx;
bool excludeFound;
- RelFileNumber relNumber;
- ForkNumber relForkNum;
- unsigned segno;
+ RelFileNumber relfilenumber = InvalidRelFileNumber;
+ ForkNumber relForkNum = InvalidForkNumber;
+ unsigned segno = 0;
+ bool isRelationFile = false;
/* Skip special stuff */
if (strcmp(de->d_name, ".") == 0 || strcmp(de->d_name, "..") == 0)
@@ -1248,37 +1240,40 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
if (excludeFound)
continue;
+ /*
+ * If there could be non-temporary relation files in this directory,
+ * try to parse the filename.
+ */
+ if (isRelationDir)
+ isRelationFile =
+ parse_filename_for_nontemp_relation(de->d_name,
+ &relfilenumber,
+ &relForkNum, &segno);
+
/* Exclude all forks for unlogged tables except the init fork */
- if (isDbDir &&
- parse_filename_for_nontemp_relation(de->d_name, &relNumber,
- &relForkNum, &segno))
+ if (isRelationFile && relForkNum != INIT_FORKNUM)
{
- /* Never exclude init forks */
- if (relForkNum != INIT_FORKNUM)
- {
- char initForkFile[MAXPGPATH];
+ char initForkFile[MAXPGPATH];
- /*
- * If any other type of fork, check if there is an init fork
- * with the same RelFileNumber. If so, the file can be
- * excluded.
- */
- snprintf(initForkFile, sizeof(initForkFile), "%s/%u_init",
- path, relNumber);
+ /*
+ * If any other type of fork, check if there is an init fork with
+ * the same RelFileNumber. If so, the file can be excluded.
+ */
+ snprintf(initForkFile, sizeof(initForkFile), "%s/%u_init",
+ path, relfilenumber);
- if (lstat(initForkFile, &statbuf) == 0)
- {
- elog(DEBUG2,
- "unlogged relation file \"%s\" excluded from backup",
- de->d_name);
+ if (lstat(initForkFile, &statbuf) == 0)
+ {
+ elog(DEBUG2,
+ "unlogged relation file \"%s\" excluded from backup",
+ de->d_name);
- continue;
- }
+ continue;
}
}
/* Exclude temporary relations */
- if (isDbDir && looks_like_temp_rel_name(de->d_name))
+ if (OidIsValid(dboid) && looks_like_temp_rel_name(de->d_name))
{
elog(DEBUG2,
"temporary relation file \"%s\" excluded from backup",
@@ -1417,8 +1412,8 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
if (!sizeonly)
sent = sendFile(sink, pathbuf, pathbuf + basepathlen + 1, &statbuf,
- true, isDbDir ? atooid(lastDir + 1) : InvalidOid, spcoid,
- manifest);
+ true, dboid, spcoid,
+ relfilenumber, segno, manifest);
if (sent || sizeonly)
{
@@ -1440,40 +1435,6 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
return size;
}
-/*
- * Check if a file should have its checksum validated.
- * We validate checksums on files in regular tablespaces
- * (including global and default) only, and in those there
- * are some files that are explicitly excluded.
- */
-static bool
-is_checksummed_file(const char *fullpath, const char *filename)
-{
- /* Check that the file is in a tablespace */
- if (strncmp(fullpath, "./global/", 9) == 0 ||
- strncmp(fullpath, "./base/", 7) == 0 ||
- strncmp(fullpath, "/", 1) == 0)
- {
- int excludeIdx;
-
- /* Compare file against noChecksumFiles skip list */
- for (excludeIdx = 0; noChecksumFiles[excludeIdx].name != NULL; excludeIdx++)
- {
- int cmplen = strlen(noChecksumFiles[excludeIdx].name);
-
- if (!noChecksumFiles[excludeIdx].match_prefix)
- cmplen++;
- if (strncmp(filename, noChecksumFiles[excludeIdx].name,
- cmplen) == 0)
- return false;
- }
-
- return true;
- }
- else
- return false;
-}
-
/*
* Given the member, write the TAR header & send the file.
*
@@ -1488,6 +1449,7 @@ is_checksummed_file(const char *fullpath, const char *filename)
static bool
sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
struct stat *statbuf, bool missing_ok, Oid dboid, Oid spcoid,
+ RelFileNumber relfilenumber, unsigned segno,
backup_manifest_info *manifest)
{
int fd;
@@ -1495,8 +1457,6 @@ sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
int checksum_failures = 0;
off_t cnt;
pgoff_t bytes_done = 0;
- int segmentno = 0;
- char *segmentpath;
bool verify_checksum = false;
pg_checksum_context checksum_ctx;
@@ -1522,36 +1482,14 @@ sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
*/
Assert((sink->bbs_buffer_length % BLCKSZ) == 0);
- if (!noverify_checksums && DataChecksumsEnabled())
- {
- char *filename;
-
- /*
- * Get the filename (excluding path). As last_dir_separator()
- * includes the last directory separator, we chop that off by
- * incrementing the pointer.
- */
- filename = last_dir_separator(readfilename) + 1;
-
- if (is_checksummed_file(readfilename, filename))
- {
- verify_checksum = true;
-
- /*
- * Cut off at the segment boundary (".") to get the segment number
- * in order to mix it into the checksum.
- */
- segmentpath = strstr(filename, ".");
- if (segmentpath != NULL)
- {
- segmentno = atoi(segmentpath + 1);
- if (segmentno == 0)
- ereport(ERROR,
- (errmsg("invalid segment number %d in file \"%s\"",
- segmentno, filename)));
- }
- }
- }
+ /*
+ * If we weren't told not to verify checksums, and if checksums are
+ * enabled for this cluster, and if this is a relation file, then verify
+ * the checksum.
+ */
+ if (!noverify_checksums && DataChecksumsEnabled() &&
+ RelFileNumberIsValid(relfilenumber))
+ verify_checksum = true;
/*
* Loop until we read the amount of data the caller told us to expect. The
@@ -1566,7 +1504,7 @@ sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
/* Try to read some more data. */
cnt = read_file_data_into_buffer(sink, readfilename, fd, bytes_done,
remaining,
- blkno + segmentno * RELSEG_SIZE,
+ blkno + segno * RELSEG_SIZE,
verify_checksum,
&checksum_failures);
--
2.37.1 (Apple Git-137.1)
v7-0005-Add-new-pg_walsummary-tool.patchapplication/octet-stream; name=v7-0005-Add-new-pg_walsummary-tool.patchDownload
From a0ccde062e1af56dc7655388f2017fe644b6aeca Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Wed, 25 Oct 2023 13:01:06 -0400
Subject: [PATCH v7 5/5] Add new pg_walsummary tool.
This can dump the contents of WAL summary files, either those in
pg_wal/summaries, or the INCREMENTAL_BACKUP files that are part of
an incremental backup proper.
XXX. Needs tests.
---
doc/src/sgml/ref/allfiles.sgml | 1 +
doc/src/sgml/ref/pg_walsummary.sgml | 122 +++++++++++
doc/src/sgml/reference.sgml | 1 +
src/bin/Makefile | 1 +
src/bin/meson.build | 1 +
src/bin/pg_walsummary/.gitignore | 1 +
src/bin/pg_walsummary/Makefile | 42 ++++
src/bin/pg_walsummary/meson.build | 24 +++
src/bin/pg_walsummary/pg_walsummary.c | 278 ++++++++++++++++++++++++++
src/tools/pgindent/typedefs.list | 2 +
10 files changed, 473 insertions(+)
create mode 100644 doc/src/sgml/ref/pg_walsummary.sgml
create mode 100644 src/bin/pg_walsummary/.gitignore
create mode 100644 src/bin/pg_walsummary/Makefile
create mode 100644 src/bin/pg_walsummary/meson.build
create mode 100644 src/bin/pg_walsummary/pg_walsummary.c
diff --git a/doc/src/sgml/ref/allfiles.sgml b/doc/src/sgml/ref/allfiles.sgml
index fda4690eab..4a42999b18 100644
--- a/doc/src/sgml/ref/allfiles.sgml
+++ b/doc/src/sgml/ref/allfiles.sgml
@@ -219,6 +219,7 @@ Complete list of usable sgml source files in this directory.
<!ENTITY pgtesttiming SYSTEM "pgtesttiming.sgml">
<!ENTITY pgupgrade SYSTEM "pgupgrade.sgml">
<!ENTITY pgwaldump SYSTEM "pg_waldump.sgml">
+<!ENTITY pgwalsummary SYSTEM "pg_walsummary.sgml">
<!ENTITY postgres SYSTEM "postgres-ref.sgml">
<!ENTITY psqlRef SYSTEM "psql-ref.sgml">
<!ENTITY reindexdb SYSTEM "reindexdb.sgml">
diff --git a/doc/src/sgml/ref/pg_walsummary.sgml b/doc/src/sgml/ref/pg_walsummary.sgml
new file mode 100644
index 0000000000..3a2122b067
--- /dev/null
+++ b/doc/src/sgml/ref/pg_walsummary.sgml
@@ -0,0 +1,122 @@
+<!--
+doc/src/sgml/ref/pg_walsummary.sgml
+PostgreSQL documentation
+-->
+
+<refentry id="app-pgwalsummary">
+ <indexterm zone="app-pgwalsummary">
+ <primary>pg_walsummary</primary>
+ </indexterm>
+
+ <refmeta>
+ <refentrytitle><application>pg_walsummary</application></refentrytitle>
+ <manvolnum>1</manvolnum>
+ <refmiscinfo>Application</refmiscinfo>
+ </refmeta>
+
+ <refnamediv>
+ <refname>pg_walsummary</refname>
+ <refpurpose>print contents of WAL summary files</refpurpose>
+ </refnamediv>
+
+ <refsynopsisdiv>
+ <cmdsynopsis>
+ <command>pg_walsummary</command>
+ <arg rep="repeat" choice="opt"><replaceable>option</replaceable></arg>
+ <arg rep="repeat"><replaceable>file</replaceable></arg>
+ </cmdsynopsis>
+ </refsynopsisdiv>
+
+ <refsect1>
+ <title>Description</title>
+ <para>
+ <application>pg_walsummary</application> is used to print the contents of
+ WAL summary files. These binary files are found with the
+ <literal>pg_wal/summaries</literal> subdirectory of the data directory,
+ and can be converted to text using this tool. This is not ordinarily
+ necessary, since WAL summary files primary exist to support
+ <link linkend="backup-incremental-backup">incremental backup</link>,
+ but it may be useful for debugging purposes.
+ </para>
+
+ <para>
+ A WAL summary file is indexed by tablespace OID, relation OID, and relation
+ fork. For each relation fork, it stores the list of blocks that were
+ modified by WAL within the range summarized in the file. It can also
+ store a "limit block," which is 0 if the relation fork was created or
+ truncated within the relevant WAL range, and otherwise the shortest length
+ to which the relation fork was truncated. If the relation fork was not
+ created, deleted, or truncated within the relevant WAL range, the limit
+ block is undefined or infinite and will not be printed by this tool.
+ </para>
+ </refsect1>
+
+ <refsect1>
+ <title>Options</title>
+
+ <para>
+ <variablelist>
+ <varlistentry>
+ <term><option>-i</option></term>
+ <term><option>--indivudual</option></term>
+ <listitem>
+ <para>
+ By default, <literal>pg_walsummary</literal> prints one line of output
+ for each range of one or more consecutive modified blocks. This can
+ make the output a lot briefer, since a relation where all blocks from
+ 0 through 999 were modified will produce only one line of output rather
+ than 1000 separate lines. This option requests a separate line of
+ output for every modified block.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><option>-q</option></term>
+ <term><option>--quiet</option></term>
+ <listitem>
+ <para>
+ Do not print any output, except for errors. This can be useful
+ when you want to know whether a WAL summary file can be successfully
+ parsed but don't care about the contents.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><option>-?</option></term>
+ <term><option>--help</option></term>
+ <listitem>
+ <para>
+ Shows help about <application>pg_walsummary</application> command line
+ arguments, and exits.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ </variablelist>
+ </para>
+
+ </refsect1>
+
+ <refsect1>
+ <title>Environment</title>
+
+ <para>
+ The environment variable <envar>PG_COLOR</envar> specifies whether to use
+ color in diagnostic messages. Possible values are
+ <literal>always</literal>, <literal>auto</literal> and
+ <literal>never</literal>.
+ </para>
+ </refsect1>
+
+ <refsect1>
+ <title>See Also</title>
+
+ <simplelist type="inline">
+ <member><xref linkend="app-pgbasebackup"/></member>
+ <member><xref linkend="app-pgcombinebackup"/></member>
+ </simplelist>
+ </refsect1>
+
+</refentry>
diff --git a/doc/src/sgml/reference.sgml b/doc/src/sgml/reference.sgml
index a07d2b5e01..aa94f6adf6 100644
--- a/doc/src/sgml/reference.sgml
+++ b/doc/src/sgml/reference.sgml
@@ -289,6 +289,7 @@
&pgtesttiming;
&pgupgrade;
&pgwaldump;
+ &pgwalsummary;
&postgres;
</reference>
diff --git a/src/bin/Makefile b/src/bin/Makefile
index aa2210925e..f98f58d39e 100644
--- a/src/bin/Makefile
+++ b/src/bin/Makefile
@@ -31,6 +31,7 @@ SUBDIRS = \
pg_upgrade \
pg_verifybackup \
pg_waldump \
+ pg_walsummary \
pgbench \
psql \
scripts
diff --git a/src/bin/meson.build b/src/bin/meson.build
index 4cb6fd59bb..d1e9ef4409 100644
--- a/src/bin/meson.build
+++ b/src/bin/meson.build
@@ -17,6 +17,7 @@ subdir('pg_test_timing')
subdir('pg_upgrade')
subdir('pg_verifybackup')
subdir('pg_waldump')
+subdir('pg_walsummary')
subdir('pgbench')
subdir('pgevent')
subdir('psql')
diff --git a/src/bin/pg_walsummary/.gitignore b/src/bin/pg_walsummary/.gitignore
new file mode 100644
index 0000000000..d71ec192fa
--- /dev/null
+++ b/src/bin/pg_walsummary/.gitignore
@@ -0,0 +1 @@
+pg_walsummary
diff --git a/src/bin/pg_walsummary/Makefile b/src/bin/pg_walsummary/Makefile
new file mode 100644
index 0000000000..852f7208f6
--- /dev/null
+++ b/src/bin/pg_walsummary/Makefile
@@ -0,0 +1,42 @@
+#-------------------------------------------------------------------------
+#
+# Makefile for src/bin/pg_walsummary
+#
+# Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+# Portions Copyright (c) 1994, Regents of the University of California
+#
+# src/bin/pg_walsummary/Makefile
+#
+#-------------------------------------------------------------------------
+
+PGFILEDESC = "pg_walsummary - print contents of WAL summary files"
+PGAPPICON=win32
+
+subdir = src/bin/pg_walsummary
+top_builddir = ../../..
+include $(top_builddir)/src/Makefile.global
+
+override CPPFLAGS := -I$(libpq_srcdir) $(CPPFLAGS)
+LDFLAGS_INTERNAL += -L$(top_builddir)/src/fe_utils -lpgfeutils
+
+OBJS = \
+ $(WIN32RES) \
+ pg_walsummary.o
+
+all: pg_walsummary
+
+pg_walsummary: $(OBJS) | submake-libpgport submake-libpgfeutils
+ $(CC) $(CFLAGS) $^ $(LDFLAGS) $(LDFLAGS_EX) $(LIBS) -o $@$(X)
+
+
+install: all installdirs
+ $(INSTALL_PROGRAM) pg_walsummary$(X) '$(DESTDIR)$(bindir)/pg_walsummary$(X)'
+
+installdirs:
+ $(MKDIR_P) '$(DESTDIR)$(bindir)'
+
+uninstall:
+ rm -f '$(DESTDIR)$(bindir)/pg_walsummary$(X)'
+
+clean distclean maintainer-clean:
+ rm -f pg_walsummary$(X) $(OBJS)
diff --git a/src/bin/pg_walsummary/meson.build b/src/bin/pg_walsummary/meson.build
new file mode 100644
index 0000000000..c2092960c6
--- /dev/null
+++ b/src/bin/pg_walsummary/meson.build
@@ -0,0 +1,24 @@
+# Copyright (c) 2022-2023, PostgreSQL Global Development Group
+
+pg_walsummary_sources = files(
+ 'pg_walsummary.c',
+)
+
+if host_system == 'windows'
+ pg_walsummary_sources += rc_bin_gen.process(win32ver_rc, extra_args: [
+ '--NAME', 'pg_walsummary',
+ '--FILEDESC', 'pg_walsummary - print contents of WAL summary files',])
+endif
+
+pg_walsummary = executable('pg_walsummary',
+ pg_walsummary_sources,
+ dependencies: [frontend_code],
+ kwargs: default_bin_args,
+)
+bin_targets += pg_walsummary
+
+tests += {
+ 'name': 'pg_walsummary',
+ 'sd': meson.current_source_dir(),
+ 'bd': meson.current_build_dir()
+}
diff --git a/src/bin/pg_walsummary/pg_walsummary.c b/src/bin/pg_walsummary/pg_walsummary.c
new file mode 100644
index 0000000000..0304a42026
--- /dev/null
+++ b/src/bin/pg_walsummary/pg_walsummary.c
@@ -0,0 +1,278 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_walsummary.c
+ * Prints the contents of WAL summary files.
+ *
+ * Copyright (c) 2017-2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/bin/pg_walsummary/pg_walsummary.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres_fe.h"
+
+#include <fcntl.h>
+#include <limits.h>
+
+#include "common/blkreftable.h"
+#include "common/logging.h"
+#include "fe_utils/option_utils.h"
+#include "lib/stringinfo.h"
+#include "getopt_long.h"
+
+typedef struct ws_options
+{
+ bool individual;
+ bool quiet;
+} ws_options;
+
+typedef struct ws_file_info
+{
+ int fd;
+ char *filename;
+} ws_file_info;
+
+static BlockNumber *block_buffer = NULL;
+static unsigned block_buffer_size = 512; /* Initial size. */
+
+static void dump_one_relation(ws_options *opt, RelFileLocator *rlocator,
+ ForkNumber forknum, BlockNumber limit_block,
+ BlockRefTableReader *reader);
+static void help(const char *progname);
+static int compare_block_numbers(const void *a, const void *b);
+static int walsummary_read_callback(void *callback_arg, void *data,
+ int length);
+static void walsummary_error_callback(void *callback_arg, char *fmt,...);
+
+/*
+ * Main program.
+ */
+int
+main(int argc, char *argv[])
+{
+ static struct option long_options[] = {
+ {"individual", no_argument, NULL, 'i'},
+ {"quiet", no_argument, NULL, 'q'},
+ {NULL, 0, NULL, 0}
+ };
+
+ const char *progname;
+ int optindex;
+ int c;
+ ws_options opt;
+
+ pg_logging_init(argv[0]);
+ progname = get_progname(argv[0]);
+ handle_help_version_opts(argc, argv, progname, help);
+
+ /* process command-line options */
+ while ((c = getopt_long(argc, argv, "f:iqw:",
+ long_options, &optindex)) != -1)
+ {
+ switch (c)
+ {
+ case 'i':
+ opt.individual = true;
+ break;
+ case 'q':
+ opt.quiet = true;
+ break;
+ default:
+ /* getopt_long already emitted a complaint */
+ pg_log_error_hint("Try \"%s --help\" for more information.", progname);
+ exit(1);
+ }
+ }
+
+ if (optind >= argc)
+ {
+ pg_log_error("%s: no input files specified", progname);
+ pg_log_error_hint("Try \"%s --help\" for more information.", progname);
+ exit(1);
+ }
+
+ while (optind < argc)
+ {
+ ws_file_info ws;
+ BlockRefTableReader *reader;
+ RelFileLocator rlocator;
+ ForkNumber forknum;
+ BlockNumber limit_block;
+
+ ws.filename = argv[optind++];
+ if ((ws.fd = open(ws.filename, O_RDONLY | PG_BINARY, 0)) < 0)
+ pg_fatal("could not open file \"%s\": %m", ws.filename);
+
+ reader = CreateBlockRefTableReader(walsummary_read_callback, &ws,
+ ws.filename,
+ walsummary_error_callback, NULL);
+ while (BlockRefTableReaderNextRelation(reader, &rlocator, &forknum,
+ &limit_block))
+ dump_one_relation(&opt, &rlocator, forknum, limit_block, reader);
+
+ DestroyBlockRefTableReader(reader);
+ close(ws.fd);
+ }
+
+ exit(0);
+}
+
+/*
+ * Dump details for one relation.
+ */
+static void
+dump_one_relation(ws_options *opt, RelFileLocator *rlocator,
+ ForkNumber forknum, BlockNumber limit_block,
+ BlockRefTableReader *reader)
+{
+ unsigned i = 0;
+ unsigned nblocks;
+ BlockNumber startblock = InvalidBlockNumber;
+ BlockNumber endblock = InvalidBlockNumber;
+
+ /* Dump limit block, if any. */
+ if (limit_block != InvalidBlockNumber)
+ printf("TS %u, DB %u, REL %u, FORK %s: limit %u\n",
+ rlocator->spcOid, rlocator->dbOid, rlocator->relNumber,
+ forkNames[forknum], limit_block);
+
+ /* If we haven't allocated a block buffer yet, do that now. */
+ if (block_buffer == NULL)
+ block_buffer = palloc_array(BlockNumber, block_buffer_size);
+
+ /* Try to fill the block buffer. */
+ nblocks = BlockRefTableReaderGetBlocks(reader,
+ block_buffer,
+ block_buffer_size);
+
+ /* If we filled the block buffer completely, we must enlarge it. */
+ while (nblocks >= block_buffer_size)
+ {
+ unsigned new_size;
+
+ /* Double the size, being careful about overflow. */
+ new_size = block_buffer_size * 2;
+ if (new_size < block_buffer_size)
+ new_size = PG_UINT32_MAX;
+ block_buffer = repalloc_array(block_buffer, BlockNumber, new_size);
+
+ /* Try to fill the newly-allocated space. */
+ nblocks +=
+ BlockRefTableReaderGetBlocks(reader,
+ block_buffer + block_buffer_size,
+ new_size - block_buffer_size);
+
+ /* Save the new size for later calls. */
+ block_buffer_size = new_size;
+ }
+
+ /* If we don't need to produce any output, skip the rest of this. */
+ if (opt->quiet)
+ return;
+
+ /*
+ * Sort the returned block numbers. If the block reference table was using
+ * the bitmap representation for a given chunk, the block numbers in that
+ * chunk will already be sorted, but when the array-of-offsets
+ * representation is used, we can receive block numbers here out of order.
+ */
+ qsort(block_buffer, nblocks, sizeof(BlockNumber), compare_block_numbers);
+
+ /* Dump block references. */
+ while (i < nblocks)
+ {
+ /*
+ * Find the next range of blocks to print, but if --individual was
+ * specified, then consider each block a separate range.
+ */
+ startblock = endblock = block_buffer[i++];
+ if (!opt->individual)
+ {
+ while (i < nblocks && block_buffer[i] == endblock + 1)
+ {
+ endblock++;
+ i++;
+ }
+ }
+
+ /* Format this range of block numbers as a string. */
+ if (startblock == endblock)
+ printf("TS %u, DB %u, REL %u, FORK %s: block %u\n",
+ rlocator->spcOid, rlocator->dbOid, rlocator->relNumber,
+ forkNames[forknum], startblock);
+ else
+ printf("TS %u, DB %u, REL %u, FORK %s: blocks %u..%u\n",
+ rlocator->spcOid, rlocator->dbOid, rlocator->relNumber,
+ forkNames[forknum], startblock, endblock);
+ }
+}
+
+/*
+ * Quicksort comparator for block numbers.
+ */
+static int
+compare_block_numbers(const void *a, const void *b)
+{
+ BlockNumber aa = *(BlockNumber *) a;
+ BlockNumber bb = *(BlockNumber *) b;
+
+ if (aa > bb)
+ return 1;
+ else if (aa == bb)
+ return 0;
+ else
+ return -1;
+}
+
+/*
+ * Error callback.
+ */
+void
+walsummary_error_callback(void *callback_arg, char *fmt,...)
+{
+ va_list ap;
+
+ va_start(ap, fmt);
+ pg_log_generic_v(PG_LOG_ERROR, PG_LOG_PRIMARY, fmt, ap);
+ va_end(ap);
+
+ exit(1);
+}
+
+/*
+ * Read callback.
+ */
+int
+walsummary_read_callback(void *callback_arg, void *data, int length)
+{
+ ws_file_info *ws = callback_arg;
+ int rc;
+
+ if ((rc = read(ws->fd, data, length)) < 0)
+ pg_fatal("could not read file \"%s\": %m", ws->filename);
+
+ return rc;
+}
+
+/*
+ * help
+ *
+ * Prints help page for the program
+ *
+ * progname: the name of the executed program, such as "pg_walsummary"
+ */
+static void
+help(const char *progname)
+{
+ printf(_("%s prints the contents of a WAL summary file.\n\n"), progname);
+ printf(_("Usage:\n"));
+ printf(_(" %s [OPTION]... FILE...\n"), progname);
+ printf(_("\nOptions:\n"));
+ printf(_(" -i, --individual list block numbers individually, not as ranges\n"));
+ printf(_(" -q, --quiet don't print anything, just parse the files\n"));
+ printf(_(" -?, --help show this help, then exit\n"));
+
+ printf(_("\nReport bugs to <%s>.\n"), PACKAGE_BUGREPORT);
+ printf(_("%s home page: <%s>\n"), PACKAGE_NAME, PACKAGE_URL);
+}
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 064c0ecdc1..9045352b10 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -4016,3 +4016,5 @@ cb_tablespace_mapping
manifest_data
manifest_writer
rfile
+ws_options
+ws_file_info
--
2.37.1 (Apple Git-137.1)
v7-0004-Add-support-for-incremental-backup.patchapplication/octet-stream; name=v7-0004-Add-support-for-incremental-backup.patchDownload
From e9ce1bd16037871339faa5f94e73ff37d4d7ee4f Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Wed, 14 Jun 2023 11:31:29 -0400
Subject: [PATCH v7 4/5] Add support for incremental backup.
To take an incremental backup, you use the new replication command
UPLOAD_MANIFEST to upload the manifest for the prior backup. This
prior backup could either be a full backup or another incremental
backup. You then use BASE_BACKUP with the INCREMENTAL option to take
the backup. pg_basebackup now has an --incremental=PATH_TO_MANIFEST
option to trigger this behavior.
An incremental backup is like a regular full backup except that
some relation files are replaced with files with names like
INCREMENTAL.${ORIGINAL_NAME}, and the backup_label file contains
additional lines identifying it as an incremental backup. The new
pg_combinebackup tool can be used to reconstruct a data directory
from a full backup and a series of incremental backups.
XXX. Needs testing on a standby.
XXX. Should we send the whole backup manifest to the server or, say,
just an LSN?
XXX. Should the timeout when waiting for WAL summaries be configurable?
If it is, then the maximum sleep time for the WAL summarizer needs
to vary accordingly.
XXX. It would be nice (but not essential) to do something about
incremental JSON parsing.
Patch by me. Thanks to Dilip Kumar and Andres Freund for some helpful
design discussions. Reviewed by Dilip Kumar and Jakub Wartak.
---
doc/src/sgml/backup.sgml | 89 +-
doc/src/sgml/ref/allfiles.sgml | 1 +
doc/src/sgml/ref/pg_basebackup.sgml | 37 +-
doc/src/sgml/ref/pg_combinebackup.sgml | 228 +++
doc/src/sgml/reference.sgml | 1 +
src/backend/access/transam/xlogbackup.c | 10 +
src/backend/access/transam/xlogrecovery.c | 6 +
src/backend/backup/Makefile | 1 +
src/backend/backup/basebackup.c | 334 ++++-
src/backend/backup/basebackup_incremental.c | 873 +++++++++++
src/backend/backup/meson.build | 1 +
src/backend/replication/repl_gram.y | 14 +-
src/backend/replication/repl_scanner.l | 2 +
src/backend/replication/walsender.c | 162 ++-
src/backend/storage/ipc/ipci.c | 3 +
src/backend/utils/activity/pgstat_io.c | 4 +-
src/bin/Makefile | 1 +
src/bin/initdb/initdb.c | 1 +
src/bin/meson.build | 1 +
src/bin/pg_basebackup/bbstreamer_file.c | 1 +
src/bin/pg_basebackup/pg_basebackup.c | 110 +-
src/bin/pg_basebackup/t/010_pg_basebackup.pl | 4 +-
src/bin/pg_combinebackup/.gitignore | 1 +
src/bin/pg_combinebackup/Makefile | 52 +
src/bin/pg_combinebackup/backup_label.c | 281 ++++
src/bin/pg_combinebackup/backup_label.h | 29 +
src/bin/pg_combinebackup/copy_file.c | 169 +++
src/bin/pg_combinebackup/copy_file.h | 19 +
src/bin/pg_combinebackup/load_manifest.c | 245 ++++
src/bin/pg_combinebackup/load_manifest.h | 67 +
src/bin/pg_combinebackup/meson.build | 38 +
src/bin/pg_combinebackup/pg_combinebackup.c | 1270 +++++++++++++++++
src/bin/pg_combinebackup/reconstruct.c | 618 ++++++++
src/bin/pg_combinebackup/reconstruct.h | 32 +
src/bin/pg_combinebackup/t/001_basic.pl | 23 +
.../pg_combinebackup/t/002_compare_backups.pl | 153 ++
src/bin/pg_combinebackup/t/003_timeline.pl | 89 ++
src/bin/pg_combinebackup/t/004_manifest.pl | 75 +
src/bin/pg_combinebackup/t/005_integrity.pl | 123 ++
src/bin/pg_combinebackup/write_manifest.c | 293 ++++
src/bin/pg_combinebackup/write_manifest.h | 33 +
src/bin/pg_resetwal/pg_resetwal.c | 36 +
src/include/access/xlogbackup.h | 2 +
src/include/backup/basebackup.h | 5 +-
src/include/backup/basebackup_incremental.h | 56 +
src/include/nodes/replnodes.h | 9 +
src/test/perl/PostgreSQL/Test/Cluster.pm | 21 +-
src/test/recovery/t/001_stream_rep.pl | 2 +
src/test/recovery/t/019_replslot_limit.pl | 3 +
.../t/035_standby_logical_decoding.pl | 1 +
src/tools/pgindent/typedefs.list | 12 +
51 files changed, 5581 insertions(+), 60 deletions(-)
create mode 100644 doc/src/sgml/ref/pg_combinebackup.sgml
create mode 100644 src/backend/backup/basebackup_incremental.c
create mode 100644 src/bin/pg_combinebackup/.gitignore
create mode 100644 src/bin/pg_combinebackup/Makefile
create mode 100644 src/bin/pg_combinebackup/backup_label.c
create mode 100644 src/bin/pg_combinebackup/backup_label.h
create mode 100644 src/bin/pg_combinebackup/copy_file.c
create mode 100644 src/bin/pg_combinebackup/copy_file.h
create mode 100644 src/bin/pg_combinebackup/load_manifest.c
create mode 100644 src/bin/pg_combinebackup/load_manifest.h
create mode 100644 src/bin/pg_combinebackup/meson.build
create mode 100644 src/bin/pg_combinebackup/pg_combinebackup.c
create mode 100644 src/bin/pg_combinebackup/reconstruct.c
create mode 100644 src/bin/pg_combinebackup/reconstruct.h
create mode 100644 src/bin/pg_combinebackup/t/001_basic.pl
create mode 100644 src/bin/pg_combinebackup/t/002_compare_backups.pl
create mode 100644 src/bin/pg_combinebackup/t/003_timeline.pl
create mode 100644 src/bin/pg_combinebackup/t/004_manifest.pl
create mode 100644 src/bin/pg_combinebackup/t/005_integrity.pl
create mode 100644 src/bin/pg_combinebackup/write_manifest.c
create mode 100644 src/bin/pg_combinebackup/write_manifest.h
create mode 100644 src/include/backup/basebackup_incremental.h
diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml
index 8cb24d6ae5..b3468eea3c 100644
--- a/doc/src/sgml/backup.sgml
+++ b/doc/src/sgml/backup.sgml
@@ -857,12 +857,79 @@ test ! -f /mnt/server/archivedir/00000001000000A900000065 && cp pg_wal/0
</para>
</sect2>
+ <sect2 id="backup-incremental-backup">
+ <title>Making an Incremental Backup</title>
+
+ <para>
+ You can use <xref linkend="app-pgbasebackup"/> to take an incremental
+ backup by specifying the <literal>--incremental</literal> option. You must
+ supply, as an argument to <literal>--incremental</literal>, the backup
+ manifest to an earlier backup from the same server. In the resulting
+ backup, non-relation files will be included in their entirety, but some
+ relation files may be replaced by smaller incremental files which contain
+ only the blocks which have been changed since the earlier backup and enough
+ metadata to reconstruct the current version of the file.
+ </para>
+
+ <para>
+ To figure out which blocks need to be backed up, the server uses WAL
+ summaries, which are stored in the data directory, inside the directory
+ <literal>pg_wal/summaries</literal>. If the required summary files are not
+ present, an attempt to take an incremental backup will fail. The summaries
+ present in this directory must cover all LSNs from the start LSN of the
+ prior backup to the start LSN of the current backup. Since the server looks
+ for WAL summaries just after establishing the start LSN of the current
+ backup, the necessary summary files probably won't be instantly present
+ on disk, but the server will wait for any missing files to show up.
+ This also helps if the WAL summarization process has fallen behind.
+ However, if the necessary files have already been removed, or if the WAL
+ summarizer doesn't catch up quickly enough, the incremental backup will
+ fail.
+ </para>
+
+ <para>
+ When restoring an incremental backup, it will be necessary to have not
+ only the incremental backup itself but also all earlier backups that
+ are required to supply the blocks omitted from the incremental backup.
+ See <xref linkend="app-pgcombinebackup"/> for further information about
+ this requirement.
+ </para>
+
+ <para>
+ Note that all of the requirements for making use of a full backup also
+ apply to an incremental backup. For instance, you still need all of the
+ WAL segment files generated during and after the file system backup, and
+ any relevant WAL history files. And you still need to create a
+ <literal>recovery.signal</literal> (or <literal>standby.signal</literal>)
+ and perform recovery, as described in
+ <xref linkend="backup-pitr-recovery" />. The requirement to have earlier
+ backups available at restore time and to use
+ <literal>pg_combinebackup</literal> is an additional requirement on top of
+ everything else. Keep in mind that <application>PostgreSQL</application>
+ has no built-in mechanism to figure out which backups are still needed as
+ a basis for restoring later incremental backups. You must keep track of
+ the relationships between your full and incremental backups on your own,
+ and be certain not to remove earlier backups if they might be needed when
+ restoring later incremental backups.
+ </para>
+
+ <para>
+ Incremental backups typically only make sense for relatively large
+ databases where a significant portion of the data does not change, or only
+ changes slowly. For a small database, it's simpler to ignore the existence
+ of incremental backups and simply take full backups, which are simpler
+ to manage. For a large database all of which is heavily modified,
+ incremental backups won't be much smaller than full backups.
+ </para>
+ </sect2>
+
<sect2 id="backup-lowlevel-base-backup">
<title>Making a Base Backup Using the Low Level API</title>
<para>
- The procedure for making a base backup using the low level
- APIs contains a few more steps than
- the <xref linkend="app-pgbasebackup"/> method, but is relatively
+ Instead of taking a full or incremental base backup using
+ <xref linkend="app-pgbasebackup"/>, you can take a base backup using the
+ low-level API. This procedure contains a few more steps than
+ the <application>pg_basebackup</application> method, but is relatively
simple. It is very important that these steps are executed in
sequence, and that the success of a step is verified before
proceeding to the next step.
@@ -1118,7 +1185,8 @@ SELECT * FROM pg_backup_stop(wait_for_archive => true);
</listitem>
<listitem>
<para>
- Restore the database files from your file system backup. Be sure that they
+ If you're restoring a full backup, you can restore the database files
+ directly into the target directories. Be sure that they
are restored with the right ownership (the database system user, not
<literal>root</literal>!) and with the right permissions. If you are using
tablespaces,
@@ -1126,6 +1194,19 @@ SELECT * FROM pg_backup_stop(wait_for_archive => true);
were correctly restored.
</para>
</listitem>
+ <listitem>
+ <para>
+ If you're restoring an incremental backup, you'll need to restore the
+ incremental backup and all earlier backups upon which it directly or
+ indirectly depends to the machine where you are performing the restore.
+ These backups will need to be placed in separate directories, not the
+ target directories where you want the running server to end up.
+ Once this is done, use <xref linkend="app-pgcombinebackup"/> to pull
+ data from the full backup and all of the subsequent incremental backups
+ and write out a synthetic full backup to the target directories. As above,
+ verify that permissions and tablespace links are correct.
+ </para>
+ </listitem>
<listitem>
<para>
Remove any files present in <filename>pg_wal/</filename>; these came from the
diff --git a/doc/src/sgml/ref/allfiles.sgml b/doc/src/sgml/ref/allfiles.sgml
index 54b5f22d6e..fda4690eab 100644
--- a/doc/src/sgml/ref/allfiles.sgml
+++ b/doc/src/sgml/ref/allfiles.sgml
@@ -202,6 +202,7 @@ Complete list of usable sgml source files in this directory.
<!ENTITY pgBasebackup SYSTEM "pg_basebackup.sgml">
<!ENTITY pgbench SYSTEM "pgbench.sgml">
<!ENTITY pgChecksums SYSTEM "pg_checksums.sgml">
+<!ENTITY pgCombinebackup SYSTEM "pg_combinebackup.sgml">
<!ENTITY pgConfig SYSTEM "pg_config-ref.sgml">
<!ENTITY pgControldata SYSTEM "pg_controldata.sgml">
<!ENTITY pgCtl SYSTEM "pg_ctl-ref.sgml">
diff --git a/doc/src/sgml/ref/pg_basebackup.sgml b/doc/src/sgml/ref/pg_basebackup.sgml
index 712568a62d..50536d0521 100644
--- a/doc/src/sgml/ref/pg_basebackup.sgml
+++ b/doc/src/sgml/ref/pg_basebackup.sgml
@@ -38,11 +38,25 @@ PostgreSQL documentation
</para>
<para>
- <application>pg_basebackup</application> makes an exact copy of the database
- cluster's files, while making sure the server is put into and
- out of backup mode automatically. Backups are always taken of the entire
- database cluster; it is not possible to back up individual databases or
- database objects. For selective backups, another tool such as
+ <application>pg_basebackup</application> can take a full or incremental
+ base backup of the database. When used to take a full backup, it makes an
+ exact copy of the database cluster's files. When used to take an incremental
+ backup, some files that would have been part of a full backup may be
+ replaced with incremental versions of the same files, containing only those
+ blocks that have been modified since the reference backup. An incremental
+ backup cannot be used directly; instead,
+ <xref linkend="app-pgcombinebackup"/> must first
+ be used to combine it with the previous backups upon which it depends.
+ See <xref linkend="backup-incremental-backup" /> for more information
+ about incremental backups, and <xref linkend="backup-pitr-recovery" />
+ for steps to recover from a backup.
+ </para>
+
+ <para>
+ In any mode, <application>pg_basebackup</application> makes sure the server
+ is put into and out of backup mode automatically. Backups are always taken of
+ the entire database cluster; it is not possible to back up individual
+ databases or database objects. For selective backups, another tool such as
<xref linkend="app-pgdump"/> must be used.
</para>
@@ -197,6 +211,19 @@ PostgreSQL documentation
</listitem>
</varlistentry>
+ <varlistentry>
+ <term><option>-i <replaceable class="parameter">old_manifest_file</replaceable></option></term>
+ <term><option>--incremental=<replaceable class="parameter">old_meanifest_file</replaceable></option></term>
+ <listitem>
+ <para>
+ Performs an <link linkend="backup-incremental-backup">incremental
+ backup</link>. The backup manifest for the reference
+ backup must be provided, and will be uploaded to the server, which will
+ respond by sending the requested incremental backup.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry>
<term><option>-R</option></term>
<term><option>--write-recovery-conf</option></term>
diff --git a/doc/src/sgml/ref/pg_combinebackup.sgml b/doc/src/sgml/ref/pg_combinebackup.sgml
new file mode 100644
index 0000000000..6cac73573f
--- /dev/null
+++ b/doc/src/sgml/ref/pg_combinebackup.sgml
@@ -0,0 +1,228 @@
+<!--
+doc/src/sgml/ref/pg_combinebackup.sgml
+PostgreSQL documentation
+-->
+
+<refentry id="app-pgcombinebackup">
+ <indexterm zone="app-pgcombinebackup">
+ <primary>pg_combinebackup</primary>
+ </indexterm>
+
+ <refmeta>
+ <refentrytitle><application>pg_combinebackup</application></refentrytitle>
+ <manvolnum>1</manvolnum>
+ <refmiscinfo>Application</refmiscinfo>
+ </refmeta>
+
+ <refnamediv>
+ <refname>pg_combinebackup</refname>
+ <refpurpose>reconstruct a full backup from an incremental backup and dependent backups</refpurpose>
+ </refnamediv>
+
+ <refsynopsisdiv>
+ <cmdsynopsis>
+ <command>pg_combinebackup</command>
+ <arg rep="repeat"><replaceable>option</replaceable></arg>
+ <arg rep="repeat"><replaceable>backup_directory</replaceable></arg>
+ </cmdsynopsis>
+ </refsynopsisdiv>
+
+ <refsect1>
+ <title>Description</title>
+ <para>
+ <application>pg_combinebackup</application> is used to reconstruct a
+ synthetic full backup from an
+ <link linkend="backup-incremental-backup">incremental backup</link> and the
+ earlier backups upon which it depends.
+ </para>
+
+ <para>
+ Specify all of the required backups on the command line from oldest to newest.
+ That is, the first backup directory should be the path to the full backup, and
+ the last should be the path to the final incremental backup
+ that you wish to restore. The reconstructed backup will be written to the
+ output directory specified by the <option>-o</option> option.
+ </para>
+
+ <para>
+ Although <application>pg_combinebackup</application> will attempt to verify
+ that the backups you specify form a legal backup chain from which a correct
+ full backup can be reconstructed, it is not designed to help you keep track
+ of which backups depend on which other backups. If you remove the one or
+ more of the previous backups upon which your incremental
+ backup relies, you will not be able to restore it.
+ </para>
+
+ <para>
+ Since the output of <application>pg_combinebackup</application> is a
+ synthetic full backup, it can be used as an input to a future invocation of
+ <application>pg_combinebackup</application>. The synthetic full backup would
+ be specified on the command line in lieu of the chain of backups from which
+ it was reconstructed.
+ </para>
+ </refsect1>
+
+ <refsect1>
+ <title>Options</title>
+
+ <para>
+ <variablelist>
+ <varlistentry>
+ <term><option>-d</option></term>
+ <term><option>--debug</option></term>
+ <listitem>
+ <para>
+ Print lots of debug logging output on <filename>stderr</filename>.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><option>-T <replaceable class="parameter">olddir</replaceable>=<replaceable class="parameter">newdir</replaceable></option></term>
+ <term><option>--tablespace-mapping=<replaceable class="parameter">olddir</replaceable>=<replaceable class="parameter">newdir</replaceable></option></term>
+ <listitem>
+ <para>
+ Relocates the tablespace in directory <replaceable>olddir</replaceable>
+ to <replaceable>newdir</replaceable> during the backup.
+ <replaceable>olddir</replaceable> is the absolute path of the tablespace
+ as it exists in the first backup specified on the command line,
+ and <replaceable>newdir</replaceable> is the absolute path to use for the
+ tablespace in the reconstructed backup. If either path needs to contain
+ an equal sign (<literal>=</literal>), precede that with a backslash.
+ This option can be specified multiple times for multiple tablespaces.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><option>-N</option></term>
+ <term><option>--no-sync</option></term>
+ <listitem>
+ <para>
+ By default, <command>pg_combinebackup</command> will wait for all files
+ to be written safely to disk. This option causes
+ <command>pg_combinebackup</command> to return without waiting, which is
+ faster, but means that a subsequent operating system crash can leave
+ the output backup corrupt. Generally, this option is useful for testing
+ but should not be used when creating a production installation.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><option>-o <replaceable class="parameter">outputdir</replaceable></option></term>
+ <term><option>--output=<replaceable class="parameter">outputdir</replaceable></option></term>
+ <listitem>
+ <para>
+ Specifies the output directory to which the synthetic full backup
+ should be written. Currently, this argument is required.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><option>--sync-method</option></term>
+ <listitem>
+ <para>
+ When set to <literal>fsync</literal>, which is the default,
+ <command>pg_combinebackup</command> will recursively open and synchronize
+ all files in the backup directory. When the plain format is used, the
+ search for files will follow symbolic links for the WAL directory and
+ each configured tablespace.
+ </para>
+ <para>
+ On Linux, <literal>syncfs</literal> may be used instead to ask the
+ operating system to synchronize the whole file system that contains the
+ backup directory. When the plain format is used,
+ <command>pg_combinebackup</command> will also synchronize the file systems
+ that contain the WAL files and each tablespace. See
+ <xref linkend="syncfs"/> for more information about using
+ <function>syncfs()</function>.
+ </para>
+ <para>
+ This option has no effect when <option>--no-sync</option> is used.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><option>--manifest-checksums=<replaceable class="parameter">algorithm</replaceable></option></term>
+ <listitem>
+ <para>
+ Like <xref linkend="app-pgbasebackup"/>,
+ <application>pg_combinebackup</application> writes a backup manifest
+ in the output directory. This option specifies the checksum algorithm
+ that should be applied to each file included in the backup manifest.
+ Currently, the available algorithms are <literal>NONE</literal>,
+ <literal>CRC32C</literal>, <literal>SHA224</literal>,
+ <literal>SHA256</literal>, <literal>SHA384</literal>,
+ and <literal>SHA512</literal>. The default is <literal>CRC32C</literal>.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><option>--no-manifest</option></term>
+ <listitem>
+ <para>
+ Disables generation of a backup manifest. If this option is not
+ specified, a backup manifest for the reconstructed backup will be
+ written to the output directory.
+ </para>
+ </listitem>
+ </varlistentry>
+ </variablelist>
+
+ <variablelist>
+ <varlistentry>
+ <term><option>-V</option></term>
+ <term><option>--version</option></term>
+ <listitem>
+ <para>
+ Prints the <application>pg_combinebackup</application> version and exits.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><option>-?</option></term>
+ <term><option>--help</option></term>
+ <listitem>
+ <para>
+ Shows help about <application>pg_combinebackup</application> command line
+ arguments, and exits.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ </variablelist>
+ </para>
+
+ </refsect1>
+
+ <refsect1>
+ <title>Environment</title>
+
+ <para>
+ This utility, like most other <productname>PostgreSQL</productname> utilities,
+ uses the environment variables supported by <application>libpq</application>
+ (see <xref linkend="libpq-envars"/>).
+ </para>
+
+ <para>
+ The environment variable <envar>PG_COLOR</envar> specifies whether to use
+ color in diagnostic messages. Possible values are
+ <literal>always</literal>, <literal>auto</literal> and
+ <literal>never</literal>.
+ </para>
+ </refsect1>
+
+ <refsect1>
+ <title>See Also</title>
+
+ <simplelist type="inline">
+ <member><xref linkend="app-pgbasebackup"/></member>
+ </simplelist>
+ </refsect1>
+
+</refentry>
diff --git a/doc/src/sgml/reference.sgml b/doc/src/sgml/reference.sgml
index e11b4b6130..a07d2b5e01 100644
--- a/doc/src/sgml/reference.sgml
+++ b/doc/src/sgml/reference.sgml
@@ -250,6 +250,7 @@
&pgamcheck;
&pgBasebackup;
&pgbench;
+ &pgCombinebackup;
&pgConfig;
&pgDump;
&pgDumpall;
diff --git a/src/backend/access/transam/xlogbackup.c b/src/backend/access/transam/xlogbackup.c
index 21d68133ae..f51d4282bb 100644
--- a/src/backend/access/transam/xlogbackup.c
+++ b/src/backend/access/transam/xlogbackup.c
@@ -77,6 +77,16 @@ build_backup_content(BackupState *state, bool ishistoryfile)
appendStringInfo(result, "STOP TIMELINE: %u\n", state->stoptli);
}
+ /* either both istartpoint and istarttli should be set, or neither */
+ Assert(XLogRecPtrIsInvalid(state->istartpoint) == (state->istarttli == 0));
+ if (!XLogRecPtrIsInvalid(state->istartpoint))
+ {
+ appendStringInfo(result, "INCREMENTAL FROM LSN: %X/%X\n",
+ LSN_FORMAT_ARGS(state->istartpoint));
+ appendStringInfo(result, "INCREMENTAL FROM TLI: %u\n",
+ state->istarttli);
+ }
+
data = result->data;
pfree(result);
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 315e4b27cb..6cde31ee23 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -1284,6 +1284,12 @@ read_backup_label(XLogRecPtr *checkPointLoc, TimeLineID *backupLabelTLI,
tli_from_file, BACKUP_LABEL_FILE)));
}
+ if (fscanf(lfp, "INCREMENTAL FROM LSN: %X/%X\n", &hi, &lo) > 0)
+ ereport(FATAL,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("this is an incremental backup, not a data directory"),
+ errhint("Use pg_combinebackup to reconstruct a valid data directory.")));
+
if (ferror(lfp) || FreeFile(lfp))
ereport(FATAL,
(errcode_for_file_access(),
diff --git a/src/backend/backup/Makefile b/src/backend/backup/Makefile
index a67b3c58d4..751e6d3d5e 100644
--- a/src/backend/backup/Makefile
+++ b/src/backend/backup/Makefile
@@ -19,6 +19,7 @@ OBJS = \
basebackup.o \
basebackup_copy.o \
basebackup_gzip.o \
+ basebackup_incremental.o \
basebackup_lz4.o \
basebackup_zstd.o \
basebackup_progress.o \
diff --git a/src/backend/backup/basebackup.c b/src/backend/backup/basebackup.c
index 4ba63ad8a6..8a70a9ae41 100644
--- a/src/backend/backup/basebackup.c
+++ b/src/backend/backup/basebackup.c
@@ -20,8 +20,10 @@
#include "access/xlogbackup.h"
#include "backup/backup_manifest.h"
#include "backup/basebackup.h"
+#include "backup/basebackup_incremental.h"
#include "backup/basebackup_sink.h"
#include "backup/basebackup_target.h"
+#include "catalog/pg_tablespace_d.h"
#include "commands/defrem.h"
#include "common/compression.h"
#include "common/file_perm.h"
@@ -64,6 +66,7 @@ typedef struct
bool fastcheckpoint;
bool nowait;
bool includewal;
+ bool incremental;
uint32 maxrate;
bool sendtblspcmapfile;
bool send_to_client;
@@ -76,21 +79,28 @@ typedef struct
} basebackup_options;
static int64 sendTablespace(bbsink *sink, char *path, Oid spcoid, bool sizeonly,
- struct backup_manifest_info *manifest);
+ struct backup_manifest_info *manifest,
+ IncrementalBackupInfo *ib);
static int64 sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
List *tablespaces, bool sendtblspclinks,
- backup_manifest_info *manifest, Oid spcoid);
+ backup_manifest_info *manifest, Oid spcoid,
+ IncrementalBackupInfo *ib);
static bool sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
struct stat *statbuf, bool missing_ok,
Oid dboid, Oid spcoid, RelFileNumber relfilenumber,
unsigned segno,
- backup_manifest_info *manifest);
+ backup_manifest_info *manifest,
+ unsigned num_incremental_blocks,
+ BlockNumber *incremental_blocks,
+ unsigned truncation_block_length);
static off_t read_file_data_into_buffer(bbsink *sink,
const char *readfilename, int fd,
off_t offset, size_t length,
BlockNumber blkno,
bool verify_checksum,
int *checksum_failures);
+static void push_to_sink(bbsink *sink, pg_checksum_context *checksum_ctx,
+ size_t *bytes_done, void *data, size_t length);
static bool verify_page_checksum(Page page, XLogRecPtr start_lsn,
BlockNumber blkno,
uint16 *expected_checksum);
@@ -102,7 +112,8 @@ static int64 _tarWriteHeader(bbsink *sink, const char *filename,
bool sizeonly);
static void _tarWritePadding(bbsink *sink, int len);
static void convert_link_to_directory(const char *pathbuf, struct stat *statbuf);
-static void perform_base_backup(basebackup_options *opt, bbsink *sink);
+static void perform_base_backup(basebackup_options *opt, bbsink *sink,
+ IncrementalBackupInfo *ib);
static void parse_basebackup_options(List *options, basebackup_options *opt);
static int compareWalFileNames(const ListCell *a, const ListCell *b);
static int basebackup_read_file(int fd, char *buf, size_t nbytes, off_t offset,
@@ -220,7 +231,8 @@ static const struct exclude_list_item excludeFiles[] =
* clobbered by longjmp" from stupider versions of gcc.
*/
static void
-perform_base_backup(basebackup_options *opt, bbsink *sink)
+perform_base_backup(basebackup_options *opt, bbsink *sink,
+ IncrementalBackupInfo *ib)
{
bbsink_state state;
XLogRecPtr endptr;
@@ -270,6 +282,10 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
ListCell *lc;
tablespaceinfo *newti;
+ /* If this is an incremental backup, execute preparatory steps. */
+ if (ib != NULL)
+ PrepareForIncrementalBackup(ib, backup_state);
+
/* Add a node for the base directory at the end */
newti = palloc0(sizeof(tablespaceinfo));
newti->size = -1;
@@ -289,10 +305,10 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
if (tmp->path == NULL)
tmp->size = sendDir(sink, ".", 1, true, state.tablespaces,
- true, NULL, InvalidOid);
+ true, NULL, InvalidOid, NULL);
else
tmp->size = sendTablespace(sink, tmp->path, tmp->oid, true,
- NULL);
+ NULL, NULL);
state.bytes_total += tmp->size;
}
state.bytes_total_is_valid = true;
@@ -330,7 +346,7 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
/* Then the bulk of the files... */
sendDir(sink, ".", 1, false, state.tablespaces,
- sendtblspclinks, &manifest, InvalidOid);
+ sendtblspclinks, &manifest, InvalidOid, ib);
/* ... and pg_control after everything else. */
if (lstat(XLOG_CONTROL_FILE, &statbuf) != 0)
@@ -340,7 +356,7 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
XLOG_CONTROL_FILE)));
sendFile(sink, XLOG_CONTROL_FILE, XLOG_CONTROL_FILE, &statbuf,
false, InvalidOid, InvalidOid,
- InvalidRelFileNumber, 0, &manifest);
+ InvalidRelFileNumber, 0, &manifest, 0, NULL, 0);
}
else
{
@@ -348,7 +364,7 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
bbsink_begin_archive(sink, archive_name);
- sendTablespace(sink, ti->path, ti->oid, false, &manifest);
+ sendTablespace(sink, ti->path, ti->oid, false, &manifest, ib);
}
/*
@@ -610,7 +626,7 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
sendFile(sink, pathbuf, pathbuf, &statbuf, false,
InvalidOid, InvalidOid, InvalidRelFileNumber, 0,
- &manifest);
+ &manifest, 0, NULL, 0);
/* unconditionally mark file as archived */
StatusFilePath(pathbuf, fname, ".done");
@@ -686,6 +702,7 @@ parse_basebackup_options(List *options, basebackup_options *opt)
bool o_checkpoint = false;
bool o_nowait = false;
bool o_wal = false;
+ bool o_incremental = false;
bool o_maxrate = false;
bool o_tablespace_map = false;
bool o_noverify_checksums = false;
@@ -764,6 +781,15 @@ parse_basebackup_options(List *options, basebackup_options *opt)
opt->includewal = defGetBoolean(defel);
o_wal = true;
}
+ else if (strcmp(defel->defname, "incremental") == 0)
+ {
+ if (o_incremental)
+ ereport(ERROR,
+ (errcode(ERRCODE_SYNTAX_ERROR),
+ errmsg("duplicate option \"%s\"", defel->defname)));
+ opt->incremental = defGetBoolean(defel);
+ o_incremental = true;
+ }
else if (strcmp(defel->defname, "max_rate") == 0)
{
int64 maxrate;
@@ -956,7 +982,7 @@ parse_basebackup_options(List *options, basebackup_options *opt)
* the filesystem, bypassing the buffer cache.
*/
void
-SendBaseBackup(BaseBackupCmd *cmd)
+SendBaseBackup(BaseBackupCmd *cmd, IncrementalBackupInfo *ib)
{
basebackup_options opt;
bbsink *sink;
@@ -980,6 +1006,20 @@ SendBaseBackup(BaseBackupCmd *cmd)
set_ps_display(activitymsg);
}
+ /*
+ * If we're asked to perform an incremental backup and the user has not
+ * supplied a manifest, that's an ERROR.
+ *
+ * If we're asked to perform a full backup and the user did supply a
+ * manifest, just ignore it.
+ */
+ if (!opt.incremental)
+ ib = NULL;
+ else if (ib == NULL)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("must UPLOAD_MANIFEST before performing an incremental BASE_BACKUP")));
+
/*
* If the target is specifically 'client' then set up to stream the backup
* to the client; otherwise, it's being sent someplace else and should not
@@ -1011,7 +1051,7 @@ SendBaseBackup(BaseBackupCmd *cmd)
*/
PG_TRY();
{
- perform_base_backup(&opt, sink);
+ perform_base_backup(&opt, sink, ib);
}
PG_FINALLY();
{
@@ -1086,7 +1126,7 @@ sendFileWithContent(bbsink *sink, const char *filename, const char *content,
*/
static int64
sendTablespace(bbsink *sink, char *path, Oid spcoid, bool sizeonly,
- backup_manifest_info *manifest)
+ backup_manifest_info *manifest, IncrementalBackupInfo *ib)
{
int64 size;
char pathbuf[MAXPGPATH];
@@ -1120,7 +1160,7 @@ sendTablespace(bbsink *sink, char *path, Oid spcoid, bool sizeonly,
/* Send all the files in the tablespace version directory */
size += sendDir(sink, pathbuf, strlen(path), sizeonly, NIL, true, manifest,
- spcoid);
+ spcoid, ib);
return size;
}
@@ -1140,7 +1180,7 @@ sendTablespace(bbsink *sink, char *path, Oid spcoid, bool sizeonly,
static int64
sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
List *tablespaces, bool sendtblspclinks, backup_manifest_info *manifest,
- Oid spcoid)
+ Oid spcoid, IncrementalBackupInfo *ib)
{
DIR *dir;
struct dirent *de;
@@ -1149,7 +1189,16 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
int64 size = 0;
const char *lastDir; /* Split last dir from parent path. */
bool isRelationDir = false; /* Does directory contain relations? */
+ bool isGlobalDir = false;
Oid dboid = InvalidOid;
+ BlockNumber *relative_block_numbers = NULL;
+
+ /*
+ * Since this array is relatively large, avoid putting it on the stack.
+ * But we don't need it at all if this is not an incremental backup.
+ */
+ if (ib != NULL)
+ relative_block_numbers = palloc(sizeof(BlockNumber) * RELSEG_SIZE);
/*
* Determine if the current path is a database directory that can contain
@@ -1182,7 +1231,10 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
}
}
else if (strcmp(path, "./global") == 0)
+ {
isRelationDir = true;
+ isGlobalDir = true;
+ }
dir = AllocateDir(path);
while ((de = ReadDir(dir, path)) != NULL)
@@ -1331,11 +1383,13 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
&statbuf, sizeonly);
/*
- * Also send archive_status directory (by hackishly reusing
- * statbuf from above ...).
+ * Also send archive_status and summaries directories (by
+ * hackishly reusing statbuf from above ...).
*/
size += _tarWriteHeader(sink, "./pg_wal/archive_status", NULL,
&statbuf, sizeonly);
+ size += _tarWriteHeader(sink, "./pg_wal/summaries", NULL,
+ &statbuf, sizeonly);
continue; /* don't recurse into pg_wal */
}
@@ -1404,33 +1458,88 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
if (!skip_this_dir)
size += sendDir(sink, pathbuf, basepathlen, sizeonly, tablespaces,
- sendtblspclinks, manifest, spcoid);
+ sendtblspclinks, manifest, spcoid, ib);
}
else if (S_ISREG(statbuf.st_mode))
{
bool sent = false;
+ unsigned num_blocks_required = 0;
+ unsigned truncation_block_length = 0;
+ char tarfilenamebuf[MAXPGPATH * 2];
+ char *tarfilename = pathbuf + basepathlen + 1;
+ FileBackupMethod method = BACK_UP_FILE_FULLY;
- if (!sizeonly)
- sent = sendFile(sink, pathbuf, pathbuf + basepathlen + 1, &statbuf,
- true, dboid, spcoid,
- relfilenumber, segno, manifest);
+ if (ib != NULL && isRelationFile)
+ {
+ Oid relspcoid;
+ char *lookup_path;
- if (sent || sizeonly)
+ if (OidIsValid(spcoid))
+ {
+ relspcoid = spcoid;
+ lookup_path = psprintf("pg_tblspc/%u/%s", spcoid,
+ pathbuf + basepathlen + 1);
+ }
+ else
+ {
+ if (isGlobalDir)
+ relspcoid = GLOBALTABLESPACE_OID;
+ else
+ relspcoid = DEFAULTTABLESPACE_OID;
+ lookup_path = pstrdup(pathbuf + basepathlen + 1);
+ }
+
+ method = GetFileBackupMethod(ib, lookup_path, dboid, relspcoid,
+ relfilenumber, relForkNum,
+ segno, statbuf.st_size,
+ &num_blocks_required,
+ relative_block_numbers,
+ &truncation_block_length);
+ if (method == BACK_UP_FILE_INCREMENTALLY)
+ {
+ statbuf.st_size =
+ GetIncrementalFileSize(num_blocks_required);
+ snprintf(tarfilenamebuf, sizeof(tarfilenamebuf),
+ "%s/INCREMENTAL.%s",
+ path + basepathlen + 1,
+ de->d_name);
+ tarfilename = tarfilenamebuf;
+ }
+
+ pfree(lookup_path);
+ }
+
+ if (method != DO_NOT_BACK_UP_FILE)
{
- /* Add size. */
- size += statbuf.st_size;
+ if (!sizeonly)
+ sent = sendFile(sink, pathbuf, tarfilename, &statbuf,
+ true, dboid, spcoid,
+ relfilenumber, segno, manifest,
+ num_blocks_required,
+ method == BACK_UP_FILE_INCREMENTALLY ? relative_block_numbers : NULL,
+ truncation_block_length);
+
+ if (sent || sizeonly)
+ {
+ /* Add size. */
+ size += statbuf.st_size;
- /* Pad to a multiple of the tar block size. */
- size += tarPaddingBytesRequired(statbuf.st_size);
+ /* Pad to a multiple of the tar block size. */
+ size += tarPaddingBytesRequired(statbuf.st_size);
- /* Size of the header for the file. */
- size += TAR_BLOCK_SIZE;
+ /* Size of the header for the file. */
+ size += TAR_BLOCK_SIZE;
+ }
}
}
else
ereport(WARNING,
(errmsg("skipping special file \"%s\"", pathbuf)));
}
+
+ if (relative_block_numbers != NULL)
+ pfree(relative_block_numbers);
+
FreeDir(dir);
return size;
}
@@ -1443,6 +1552,12 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
* If dboid is anything other than InvalidOid then any checksum failures
* detected will get reported to the cumulative stats system.
*
+ * If the file is to be sent incrementally, then num_incremental_blocks
+ * should be the number of blocks to be sent, and incremental_blocks
+ * an array of block numbers relative to the start of the current segment.
+ * If the whole file is to be sent, then incremental_blocks should be NULL,
+ * and num_incremental_blocks can have any value, as it will be ignored.
+ *
* Returns true if the file was successfully sent, false if 'missing_ok',
* and the file did not exist.
*/
@@ -1450,7 +1565,8 @@ static bool
sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
struct stat *statbuf, bool missing_ok, Oid dboid, Oid spcoid,
RelFileNumber relfilenumber, unsigned segno,
- backup_manifest_info *manifest)
+ backup_manifest_info *manifest, unsigned num_incremental_blocks,
+ BlockNumber *incremental_blocks, unsigned truncation_block_length)
{
int fd;
BlockNumber blkno = 0;
@@ -1459,6 +1575,7 @@ sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
pgoff_t bytes_done = 0;
bool verify_checksum = false;
pg_checksum_context checksum_ctx;
+ int ibindex = 0;
if (pg_checksum_init(&checksum_ctx, manifest->checksum_type) < 0)
elog(ERROR, "could not initialize checksum of file \"%s\"",
@@ -1491,22 +1608,111 @@ sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
RelFileNumberIsValid(relfilenumber))
verify_checksum = true;
+ /*
+ * If we're sending an incremental file, write the file header.
+ */
+ if (incremental_blocks != NULL)
+ {
+ unsigned magic = INCREMENTAL_MAGIC;
+ size_t header_bytes_done = 0;
+
+ /* Emit header data. */
+ push_to_sink(sink, &checksum_ctx, &header_bytes_done,
+ &magic, sizeof(magic));
+ push_to_sink(sink, &checksum_ctx, &header_bytes_done,
+ &num_incremental_blocks, sizeof(num_incremental_blocks));
+ push_to_sink(sink, &checksum_ctx, &header_bytes_done,
+ &truncation_block_length, sizeof(truncation_block_length));
+ push_to_sink(sink, &checksum_ctx, &header_bytes_done,
+ incremental_blocks,
+ sizeof(BlockNumber) * num_incremental_blocks);
+
+ /* Flush out any data still in the buffer so it's again empty. */
+ if (header_bytes_done > 0)
+ {
+ bbsink_archive_contents(sink, header_bytes_done);
+ if (pg_checksum_update(&checksum_ctx,
+ (uint8 *) sink->bbs_buffer,
+ header_bytes_done) < 0)
+ elog(ERROR, "could not update checksum of base backup");
+ }
+
+ /* Update our notion of file position. */
+ bytes_done += sizeof(magic);
+ bytes_done += sizeof(num_incremental_blocks);
+ bytes_done += sizeof(truncation_block_length);
+ bytes_done += sizeof(BlockNumber) * num_incremental_blocks;
+ }
+
/*
* Loop until we read the amount of data the caller told us to expect. The
* file could be longer, if it was extended while we were sending it, but
* for a base backup we can ignore such extended data. It will be restored
* from WAL.
*/
- while (bytes_done < statbuf->st_size)
+ while (1)
{
- size_t remaining = statbuf->st_size - bytes_done;
+ /*
+ * Determine whether we've read all the data that we need, and if not,
+ * read some more.
+ */
+ if (incremental_blocks == NULL)
+ {
+ size_t remaining = statbuf->st_size - bytes_done;
+
+ /*
+ * If we've read the required number of bytes, then it's time to
+ * stop.
+ */
+ if (bytes_done >= statbuf->st_size)
+ break;
+
+ /*
+ * Read as many bytes as will fit in the buffer, or however many
+ * are left to read, whichever is less.
+ */
+ cnt = read_file_data_into_buffer(sink, readfilename, fd,
+ bytes_done, remaining,
+ blkno + segno * RELSEG_SIZE,
+ verify_checksum,
+ &checksum_failures);
+ }
+ else
+ {
+ BlockNumber relative_blkno;
- /* Try to read some more data. */
- cnt = read_file_data_into_buffer(sink, readfilename, fd, bytes_done,
- remaining,
- blkno + segno * RELSEG_SIZE,
- verify_checksum,
- &checksum_failures);
+ /*
+ * If we've read all the blocks, then it's time to stop.
+ */
+ if (ibindex >= num_incremental_blocks)
+ break;
+
+ /*
+ * Read just one block, whichever one is the next that we're
+ * supposed to include.
+ */
+ relative_blkno = incremental_blocks[ibindex++];
+ cnt = read_file_data_into_buffer(sink, readfilename, fd,
+ relative_blkno * BLCKSZ,
+ BLCKSZ,
+ relative_blkno + segno * RELSEG_SIZE,
+ verify_checksum,
+ &checksum_failures);
+
+ /*
+ * If we get a partial read, that must mean that the relation is
+ * being truncated. Ultimately, it should be truncated to a
+ * multiple of BLCKSZ, since this path should only be reached for
+ * relation files, but we might transiently observe an
+ * intermediate value.
+ *
+ * It should be fine to treat this just as if the entire block had
+ * been truncated away - i.e. fill this and all later blocks with
+ * zeroes. WAL replay will fix things up.
+ */
+ if (cnt < BLCKSZ)
+ break;
+ }
/*
* If the amount of data we were able to read was not a multiple of
@@ -1689,6 +1895,56 @@ read_file_data_into_buffer(bbsink *sink, const char *readfilename, int fd,
return cnt;
}
+/*
+ * Push data into a bbsink.
+ *
+ * It's better, when possible, to read data directly into the bbsink's buffer,
+ * rather than using this function to copy it into the buffer; this function is
+ * for cases where that approach is not practical.
+ *
+ * bytes_done should point to a count of the number of bytes that are
+ * currently used in the bbsink's buffer. Upon return, the bytes identified by
+ * data and length will have been copied into the bbsink's buffer, flushing
+ * as required, and *bytes_done will have been updated accordingly. If the
+ * buffer was flushed, the previous contents will also have been fed to
+ * checksum_ctx.
+ *
+ * Note that after one or more calls to this function it is the caller's
+ * responsibility to perform any required final flush.
+ */
+static void
+push_to_sink(bbsink *sink, pg_checksum_context *checksum_ctx,
+ size_t *bytes_done, void *data, size_t length)
+{
+ while (length > 0)
+ {
+ size_t bytes_to_copy;
+
+ /*
+ * We use < here rather than <= so that if the data exactly fills the
+ * remaining buffer space, we trigger a flush now.
+ */
+ if (length < sink->bbs_buffer_length - *bytes_done)
+ {
+ /* Append remaining data to buffer. */
+ memcpy(sink->bbs_buffer + *bytes_done, data, length);
+ *bytes_done += length;
+ return;
+ }
+
+ /* Copy until buffer is full and flush it. */
+ bytes_to_copy = sink->bbs_buffer_length - *bytes_done;
+ memcpy(sink->bbs_buffer + *bytes_done, data, bytes_to_copy);
+ data = ((char *) data) + bytes_to_copy;
+ length -= bytes_to_copy;
+ bbsink_archive_contents(sink, sink->bbs_buffer_length);
+ if (pg_checksum_update(checksum_ctx, (uint8 *) sink->bbs_buffer,
+ sink->bbs_buffer_length) < 0)
+ elog(ERROR, "could not update checksum");
+ *bytes_done = 0;
+ }
+}
+
/*
* Try to verify the checksum for the provided page, if it seems appropriate
* to do so.
diff --git a/src/backend/backup/basebackup_incremental.c b/src/backend/backup/basebackup_incremental.c
new file mode 100644
index 0000000000..20cc00bded
--- /dev/null
+++ b/src/backend/backup/basebackup_incremental.c
@@ -0,0 +1,873 @@
+/*-------------------------------------------------------------------------
+ *
+ * basebackup_incremental.c
+ * code for incremental backup support
+ *
+ * This code isn't actually in charge of taking an incremental backup;
+ * the actual construction of the incremental backup happens in
+ * basebackup.c. Here, we're concerned with providing the necessary
+ * supports for that operation. In particular, we need to parse the
+ * backup manifest supplied by the user taking the incremental backup
+ * and extract the required information from it.
+ *
+ * Portions Copyright (c) 2010-2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/backend/backup/basebackup_incremental.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/timeline.h"
+#include "access/xlog.h"
+#include "access/xlogrecovery.h"
+#include "backup/basebackup_incremental.h"
+#include "backup/walsummary.h"
+#include "common/blkreftable.h"
+#include "common/parse_manifest.h"
+#include "common/hashfn.h"
+#include "postmaster/walsummarizer.h"
+
+#define BLOCKS_PER_READ 512
+
+/*
+ * Details extracted from the WAL ranges present in the supplied backup manifest.
+ */
+typedef struct
+{
+ TimeLineID tli;
+ XLogRecPtr start_lsn;
+ XLogRecPtr end_lsn;
+} backup_wal_range;
+
+/*
+ * Details extracted from the file list present in the supplied backup manifest.
+ */
+typedef struct
+{
+ uint32 status;
+ const char *path;
+ size_t size;
+} backup_file_entry;
+
+static uint32 hash_string_pointer(const char *s);
+#define SH_PREFIX backup_file
+#define SH_ELEMENT_TYPE backup_file_entry
+#define SH_KEY_TYPE const char *
+#define SH_KEY path
+#define SH_HASH_KEY(tb, key) hash_string_pointer(key)
+#define SH_EQUAL(tb, a, b) (strcmp(a, b) == 0)
+#define SH_SCOPE static inline
+#define SH_DECLARE
+#define SH_DEFINE
+#include "lib/simplehash.h"
+
+struct IncrementalBackupInfo
+{
+ /* Memory context for this object and its subsidiary objects. */
+ MemoryContext mcxt;
+
+ /* Temporary buffer for storing the manifest while parsing it. */
+ StringInfoData buf;
+
+ /* WAL ranges extracted from the backup manifest. */
+ List *manifest_wal_ranges;
+
+ /*
+ * Files extracted from the backup manifest.
+ *
+ * We don't really need this information, because we use WAL summaries to
+ * figure what's changed. It would be unsafe to just rely on the list of
+ * files that existed before, because it's possible for a file to be
+ * removed and a new one created with the same name and different
+ * contents. In such cases, the whole file must still be sent. We can tell
+ * from the WAL summaries whether that happened, but not from the file
+ * list.
+ *
+ * Nonetheless, this data is useful for sanity checking. If a file that we
+ * think we shouldn't need to send is not present in the manifest for the
+ * prior backup, something has gone terribly wrong. We retain the file
+ * names and sizes, but not the checksums or last modified times, for
+ * which we have no use.
+ *
+ * One significant downside of storing this data is that it consumes
+ * memory. If that turns out to be a problem, we might have to decide not
+ * to retain this information, or to make it optional.
+ */
+ backup_file_hash *manifest_files;
+
+ /*
+ * Block-reference table for the incremental backup.
+ *
+ * It's possible that storing the entire block-reference table in memory
+ * will be a problem for some users. The in-memory format that we're using
+ * here is pretty efficient, converging to little more than 1 bit per
+ * block for relation forks with large numbers of modified blocks. It's
+ * possible, however, that if you try to perform an incremental backup of
+ * a database with a sufficiently large number of relations on a
+ * sufficiently small machine, you could run out of memory here. If that
+ * turns out to be a problem in practice, we'll need to be more clever.
+ */
+ BlockRefTable *brtab;
+};
+
+static void manifest_process_file(JsonManifestParseContext *,
+ char *pathname,
+ size_t size,
+ pg_checksum_type checksum_type,
+ int checksum_length,
+ uint8 *checksum_payload);
+static void manifest_process_wal_range(JsonManifestParseContext *,
+ TimeLineID tli,
+ XLogRecPtr start_lsn,
+ XLogRecPtr end_lsn);
+static void manifest_report_error(JsonManifestParseContext *ib,
+ const char *fmt,...)
+ pg_attribute_printf(2, 3) pg_attribute_noreturn();
+
+/*
+ * Create a new object for storing information extracted from the manifest
+ * supplied when creating an incremental backup.
+ */
+IncrementalBackupInfo *
+CreateIncrementalBackupInfo(MemoryContext mcxt)
+{
+ IncrementalBackupInfo *ib;
+ MemoryContext oldcontext;
+
+ oldcontext = MemoryContextSwitchTo(mcxt);
+
+ ib = palloc0(sizeof(IncrementalBackupInfo));
+ ib->mcxt = mcxt;
+ initStringInfo(&ib->buf);
+
+ /*
+ * It's hard to guess how many files a "typical" installation will have in
+ * the data directory, but a fresh initdb creates almost 1000 files as of
+ * this writing, so it seems to make sense for our estimate to
+ * substantially higher.
+ */
+ ib->manifest_files = backup_file_create(mcxt, 10000, NULL);
+
+ MemoryContextSwitchTo(oldcontext);
+
+ return ib;
+}
+
+/*
+ * Before taking an incremental backup, the caller must supply the backup
+ * manifest from a prior backup. Each chunk of manifest data recieved
+ * from the client should be passed to this function.
+ */
+void
+AppendIncrementalManifestData(IncrementalBackupInfo *ib, const char *data,
+ int len)
+{
+ MemoryContext oldcontext;
+
+ /* Switch to our memory context. */
+ oldcontext = MemoryContextSwitchTo(ib->mcxt);
+
+ /*
+ * XXX. Our json parser is at present incapable of parsing json blobs
+ * incrementally, so we have to accumulate the entire backup manifest
+ * before we can do anything with it. This should really be fixed, since
+ * some users might have very large numbers of files in the data
+ * directory.
+ */
+ appendBinaryStringInfo(&ib->buf, data, len);
+
+ /* Switch back to previous memory context. */
+ MemoryContextSwitchTo(oldcontext);
+}
+
+/*
+ * Finalize an IncrementalBackupInfo object after all manifest data has
+ * been supplied via calls to AppendIncrementalManifestData.
+ */
+void
+FinalizeIncrementalManifest(IncrementalBackupInfo *ib)
+{
+ JsonManifestParseContext context;
+ MemoryContext oldcontext;
+
+ /* Switch to our memory context. */
+ oldcontext = MemoryContextSwitchTo(ib->mcxt);
+
+ /* Parse the manifest. */
+ context.private_data = ib;
+ context.perfile_cb = manifest_process_file;
+ context.perwalrange_cb = manifest_process_wal_range;
+ context.error_cb = manifest_report_error;
+ json_parse_manifest(&context, ib->buf.data, ib->buf.len);
+
+ /* Done with the buffer, so release memory. */
+ pfree(ib->buf.data);
+ ib->buf.data = NULL;
+
+ /* Switch back to previous memory context. */
+ MemoryContextSwitchTo(oldcontext);
+}
+
+/*
+ * Prepare to take an incremental backup.
+ *
+ * Before this function is called, AppendIncrementalManifestData and
+ * FinalizeIncrementalManifest should have already been called to pass all
+ * the manifest data to this object.
+ *
+ * This function performs sanity checks on the data extracted from the
+ * manifest and figures out for which WAL ranges we need summaries, and
+ * whether those summaries are available. Then, it reads and combines the
+ * data from those summary files. It also updates the backup_state with the
+ * reference TLI and LSN for the prior backup.
+ */
+void
+PrepareForIncrementalBackup(IncrementalBackupInfo *ib,
+ BackupState *backup_state)
+{
+ MemoryContext oldcontext;
+ List *expectedTLEs;
+ List *all_wslist,
+ *required_wslist = NIL;
+ ListCell *lc;
+ TimeLineHistoryEntry **tlep;
+ int num_wal_ranges;
+ int i;
+ bool found_backup_start_tli = false;
+ TimeLineID earliest_wal_range_tli = 0;
+ XLogRecPtr earliest_wal_range_start_lsn;
+ TimeLineID latest_wal_range_tli = 0;
+ XLogRecPtr summarized_lsn;
+
+ Assert(ib->buf.data == NULL);
+
+ /* Switch to our memory context. */
+ oldcontext = MemoryContextSwitchTo(ib->mcxt);
+
+ /*
+ * Match up the TLIs that appear in the WAL ranges of the backup manifest
+ * with those that appear in this server's timeline history. We expect
+ * every backup_wal_range to match to a TimeLineHistoryEntry; if it does
+ * not, that's an error.
+ *
+ * This loop also decides which of the WAL ranges is the manifest is most
+ * ancient and which one is the newest, according to the timeline history
+ * of this server, and stores TLIs of those WAL ranges into
+ * earliest_wal_range_tli and latest_wal_range_tli. It also updates
+ * earliest_wal_range_start_lsn to the start LSN of the WAL range for
+ * earliest_wal_range_tli.
+ *
+ * Note that the return value of readTimeLineHistory puts the latest
+ * timeline at the beginning of the list, not the end. Hence, the earliest
+ * TLI is the one that occurs nearest the end of the list returned by
+ * readTimeLineHistory, and the latest TLI is the one that occurs closest
+ * to the beginning.
+ */
+ expectedTLEs = readTimeLineHistory(backup_state->starttli);
+ num_wal_ranges = list_length(ib->manifest_wal_ranges);
+ tlep = palloc0(num_wal_ranges * sizeof(TimeLineHistoryEntry *));
+ for (i = 0; i < num_wal_ranges; ++i)
+ {
+ backup_wal_range *range = list_nth(ib->manifest_wal_ranges, i);
+ bool saw_earliest_wal_range_tli = false;
+ bool saw_latest_wal_range_tli = false;
+
+ /* Search this server's history for this WAL range's TLI. */
+ foreach(lc, expectedTLEs)
+ {
+ TimeLineHistoryEntry *tle = lfirst(lc);
+
+ if (tle->tli == range->tli)
+ {
+ tlep[i] = tle;
+ break;
+ }
+
+ if (tle->tli == earliest_wal_range_tli)
+ saw_earliest_wal_range_tli = true;
+ if (tle->tli == latest_wal_range_tli)
+ saw_latest_wal_range_tli = true;
+ }
+
+ /*
+ * An incremental backup can only be taken relative to a backup that
+ * represents a previous state of this server. If the backup requires
+ * WAL from a timeline that's not in our history, that definitely
+ * isn't the case.
+ */
+ if (tlep[i] == NULL)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("timeline %u found in manifest, but not in this server's history",
+ range->tli)));
+
+ /*
+ * If we found this TLI in the server's history before encountering
+ * the latest TLI seen so far in the server's history, then this TLI
+ * is the latest one seen so far.
+ *
+ * If on the other hand we saw the earliest TLI seen so far before
+ * finding this TLI, this TLI is earlier than the earliest one seen so
+ * far. And if this is the first TLI for which we've searched, it's
+ * also the earliest one seen so far.
+ *
+ * On the first loop iteration, both things should necessarily be
+ * true.
+ */
+ if (!saw_latest_wal_range_tli)
+ latest_wal_range_tli = range->tli;
+ if (earliest_wal_range_tli == 0 || saw_earliest_wal_range_tli)
+ {
+ earliest_wal_range_tli = range->tli;
+ earliest_wal_range_start_lsn = range->start_lsn;
+ }
+ }
+
+ /*
+ * Propagate information about the prior backup into the backup_label that
+ * will be generated for this backup.
+ */
+ backup_state->istartpoint = earliest_wal_range_start_lsn;
+ backup_state->istarttli = earliest_wal_range_tli;
+
+ /*
+ * Sanity check start and end LSNs for the WAL ranges in the manifest.
+ *
+ * Commonly, there won't be any timeline switches during the prior backup
+ * at all, but if there are, they should happen at the same LSNs that this
+ * server switched timelines.
+ *
+ * Whether there are any timeline switches during the prior backup or not,
+ * the prior backup shouldn't require any WAL from a timeline prior to the
+ * start of that timeline. It also shouldn't require any WAL from later
+ * than the start of this backup.
+ *
+ * If any of these sanity checks fail, one possible explanation is that
+ * the user has generated WAL on the same timeline with the same LSNs more
+ * than once. For instance, if two standbys running on timeline 1 were
+ * both promoted and (due to a broken archiving setup) both selected new
+ * timeline ID 2, then it's possible that one of these checks might trip.
+ *
+ * Note that there are lots of ways for the user to do something very bad
+ * without tripping any of these checks, and they are not intended to be
+ * comprehensive. It's pretty hard to see how we could be certain of
+ * anything here. However, if there's a problem staring us right in the
+ * face, it's best to report it, so we do.
+ */
+ for (i = 0; i < num_wal_ranges; ++i)
+ {
+ backup_wal_range *range = list_nth(ib->manifest_wal_ranges, i);
+
+ if (range->tli == earliest_wal_range_tli)
+ {
+ if (range->start_lsn < tlep[i]->begin)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("manifest requires WAL from initial timeline %u starting at %X/%X, but that timeline begins at %X/%X",
+ range->tli,
+ LSN_FORMAT_ARGS(range->start_lsn),
+ LSN_FORMAT_ARGS(tlep[i]->begin))));
+ }
+ else
+ {
+ if (range->start_lsn != tlep[i]->begin)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("manifest requires WAL from continuation timeline %u starting at %X/%X, but that timeline begins at %X/%X",
+ range->tli,
+ LSN_FORMAT_ARGS(range->start_lsn),
+ LSN_FORMAT_ARGS(tlep[i]->begin))));
+ }
+
+ if (range->tli == latest_wal_range_tli)
+ {
+ if (range->end_lsn > backup_state->startpoint)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("manifest requires WAL from final timeline %u ending at %X/%X, but this backup starts at %X/%X",
+ range->tli,
+ LSN_FORMAT_ARGS(range->end_lsn),
+ LSN_FORMAT_ARGS(backup_state->startpoint))));
+ }
+ else
+ {
+ if (range->end_lsn != tlep[i]->end)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("manifest requires WAL from non-final timeline %u ending at %X/%X, but this server switched timelines at %X/%X",
+ range->tli,
+ LSN_FORMAT_ARGS(range->end_lsn),
+ LSN_FORMAT_ARGS(tlep[i]->end))));
+ }
+
+ }
+
+ /*
+ * Wait for WAL summarization to catch up to the backup start LSN (but
+ * time out if it doesn't do so quickly enough).
+ */
+ /* XXX make timeout configurable */
+ summarized_lsn = WaitForWalSummarization(backup_state->startpoint, 60000);
+ if (summarized_lsn < backup_state->startpoint)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("timeout waiting for WAL summarization"),
+ errdetail("This backup requires WAL to be summarized up to %X/%X, but summarizer has only reached %X/%X.",
+ LSN_FORMAT_ARGS(backup_state->startpoint),
+ LSN_FORMAT_ARGS(summarized_lsn))));
+
+ /*
+ * Retrieve a list of all WAL summaries on any timeline that overlap with
+ * the LSN range of interest. We could instead call GetWalSummaries() once
+ * per timeline in the loop that follows, but that would involve reading
+ * the directory multiple times. It should be mildly faster - and perhaps
+ * a bit safer - to do it just once.
+ */
+ all_wslist = GetWalSummaries(0, earliest_wal_range_start_lsn,
+ backup_state->startpoint);
+
+ /*
+ * We need WAL summaries for everything that happened during the prior
+ * backup and everything that happened afterward up until the point where
+ * the current backup started.
+ */
+ foreach(lc, expectedTLEs)
+ {
+ TimeLineHistoryEntry *tle = lfirst(lc);
+ XLogRecPtr tli_start_lsn = tle->begin;
+ XLogRecPtr tli_end_lsn = tle->end;
+ XLogRecPtr tli_missing_lsn = InvalidXLogRecPtr;
+ List *tli_wslist;
+
+ /*
+ * Working through the history of this server from the current
+ * timeline backwards, we skip everything until we find the timeline
+ * where this backup started. Most of the time, this means we won't
+ * skip anything at all, as it's unlikely that the timeline has
+ * changed since the beginning of the backup moments ago.
+ */
+ if (tle->tli == backup_state->starttli)
+ {
+ found_backup_start_tli = true;
+ tli_end_lsn = backup_state->startpoint;
+ }
+ else if (!found_backup_start_tli)
+ continue;
+
+ /*
+ * Find the summaries that overlap the LSN range of interest for this
+ * timeline. If this is the earliest timeline involved, the range of
+ * interest begins with the start LSN of the prior backup; otherwise,
+ * it begins at the LSN at which this timeline came into existence. If
+ * this is the latest TLI involved, the range of interest ends at the
+ * start LSN of the current backup; otherwise, it ends at the point
+ * where we switched from this timeline to the next one.
+ */
+ if (tle->tli == earliest_wal_range_tli)
+ tli_start_lsn = earliest_wal_range_start_lsn;
+ tli_wslist = FilterWalSummaries(all_wslist, tle->tli,
+ tli_start_lsn, tli_end_lsn);
+
+ /*
+ * There is no guarantee that the WAL summaries we found cover the
+ * entire range of LSNs for which summaries are required, or indeed
+ * that we found any WAL summaries at all. Check whether we have a
+ * problem of that sort.
+ */
+ if (!WalSummariesAreComplete(tli_wslist, tli_start_lsn, tli_end_lsn,
+ &tli_missing_lsn))
+ {
+ if (XLogRecPtrIsInvalid(tli_missing_lsn))
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("WAL summaries are required on timeline %u from %X/%X to %X/%X, but no summaries for that timeline and LSN range exist",
+ tle->tli,
+ LSN_FORMAT_ARGS(tli_start_lsn),
+ LSN_FORMAT_ARGS(tli_end_lsn))));
+ else
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("WAL summaries are required on timeline %u from %X/%X to %X/%X, but the summaries for that timeline and LSN range are incomplete",
+ tle->tli,
+ LSN_FORMAT_ARGS(tli_start_lsn),
+ LSN_FORMAT_ARGS(tli_end_lsn)),
+ errdetail("The first unsummarized LSN is this range is %X/%X.",
+ LSN_FORMAT_ARGS(tli_missing_lsn))));
+ }
+
+ /*
+ * Remember that we need to read these summaries.
+ *
+ * Technically, it's possible that this could read more files than
+ * required, since tli_wslist in theory could contain redundant
+ * summaries. For instance, if we have a summary from 0/10000000 to
+ * 0/20000000 and also one from 0/00000000 to 0/30000000, then the
+ * latter subsumes the former and the former could be ignored.
+ *
+ * We ignore this possibility because the WAL summarizer only tries to
+ * generate summaries that do not overlap. If somehow they exist,
+ * we'll do a bit of extra work but the results should still be
+ * correct.
+ */
+ required_wslist = list_concat(required_wslist, tli_wslist);
+
+ /*
+ * Timelines earlier than the one in which the prior backup began are
+ * not relevant.
+ */
+ if (tle->tli == earliest_wal_range_tli)
+ break;
+ }
+
+ /*
+ * Read all of the required block reference table files and merge all of
+ * the data into a single in-memory block reference table.
+ *
+ * See the comments for struct IncrementalBackupInfo for some thoughts on
+ * memory usage.
+ */
+ ib->brtab = CreateEmptyBlockRefTable();
+ foreach(lc, required_wslist)
+ {
+ WalSummaryFile *ws = lfirst(lc);
+ WalSummaryIO wsio;
+ BlockRefTableReader *reader;
+ RelFileLocator rlocator;
+ ForkNumber forknum;
+ BlockNumber limit_block;
+ BlockNumber blocks[BLOCKS_PER_READ];
+
+ wsio.file = OpenWalSummaryFile(ws, false);
+ wsio.filepos = 0;
+ ereport(DEBUG1,
+ (errmsg_internal("reading WAL summary file \"%s\"",
+ FilePathName(wsio.file))));
+ reader = CreateBlockRefTableReader(ReadWalSummary, &wsio,
+ FilePathName(wsio.file),
+ ReportWalSummaryError, NULL);
+ while (BlockRefTableReaderNextRelation(reader, &rlocator, &forknum,
+ &limit_block))
+ {
+ BlockRefTableSetLimitBlock(ib->brtab, &rlocator,
+ forknum, limit_block);
+
+ while (1)
+ {
+ unsigned nblocks;
+ unsigned i;
+
+ nblocks = BlockRefTableReaderGetBlocks(reader, blocks,
+ BLOCKS_PER_READ);
+ if (nblocks == 0)
+ break;
+
+ for (i = 0; i < nblocks; ++i)
+ BlockRefTableMarkBlockModified(ib->brtab, &rlocator,
+ forknum, blocks[i]);
+ }
+ }
+ DestroyBlockRefTableReader(reader);
+ FileClose(wsio.file);
+ }
+
+ /* Switch back to previous memory context. */
+ MemoryContextSwitchTo(oldcontext);
+}
+
+/*
+ * Get the pathname that should be used when a file is sent incrementally.
+ *
+ * The result is a palloc'd string.
+ */
+char *
+GetIncrementalFilePath(Oid dboid, Oid spcoid, RelFileNumber relfilenumber,
+ ForkNumber forknum, unsigned segno)
+{
+ char *path;
+ char *lastslash;
+ char *ipath;
+
+ path = GetRelationPath(dboid, spcoid, relfilenumber, InvalidBackendId,
+ forknum);
+
+ lastslash = strrchr(path, '/');
+ Assert(lastslash != NULL);
+ *lastslash = '\0';
+
+ if (segno > 0)
+ ipath = psprintf("%s/INCREMENTAL.%s.%u", path, lastslash + 1, segno);
+ else
+ ipath = psprintf("%s/INCREMENTAL.%s", path, lastslash + 1);
+
+ pfree(path);
+
+ return ipath;
+}
+
+/*
+ * How should we back up a particular file as part of an incremental backup?
+ *
+ * If the return value is BACK_UP_FILE_FULLY, caller should back up the whole
+ * file just as if this were not an incremental backup.
+ *
+ * If the return value is BACK_UP_FILE_INCREMENTALLY, caller should include
+ * an incremental file in the backup instead of the entire file. On return,
+ * *num_blocks_required will be set to the number of blocks that need to be
+ * sent, and the actual block numbers will have been stored in
+ * relative_block_numbers, which should be an array of at least RELSEG_SIZE.
+ * In addition, *truncation_block_length will be set to the value that should
+ * be included in the incremental file.
+ *
+ * If the return value is DO_NOT_BACK_UP_FILE, the caller should not include
+ * the file in the backup at all.
+ */
+FileBackupMethod
+GetFileBackupMethod(IncrementalBackupInfo *ib, char *path,
+ Oid dboid, Oid spcoid,
+ RelFileNumber relfilenumber, ForkNumber forknum,
+ unsigned segno, size_t size,
+ unsigned *num_blocks_required,
+ BlockNumber *relative_block_numbers,
+ unsigned *truncation_block_length)
+{
+ BlockNumber absolute_block_numbers[RELSEG_SIZE];
+ BlockNumber limit_block;
+ BlockNumber start_blkno;
+ BlockNumber stop_blkno;
+ RelFileLocator rlocator;
+ BlockRefTableEntry *brtentry;
+ unsigned i;
+ unsigned nblocks;
+
+ /* Should only be called after PrepareForIncrementalBackup. */
+ Assert(ib->buf.data == NULL);
+
+ /*
+ * dboid could be InvalidOid if shared rel, but spcoid and relfilenumber
+ * should have legal values.
+ */
+ Assert(OidIsValid(spcoid));
+ Assert(RelFileNumberIsValid(relfilenumber));
+
+ /*
+ * If the file size is too large or not a multiple of BLCKSZ, then
+ * something weird is happening, so give up and send the whole file.
+ */
+ if ((size % BLCKSZ) != 0 || size / BLCKSZ > RELSEG_SIZE)
+ return BACK_UP_FILE_FULLY;
+
+ /*
+ * The free-space map fork is not properly WAL-logged, so we need to
+ * backup the entire file every time.
+ */
+ if (forknum == FSM_FORKNUM)
+ return BACK_UP_FILE_FULLY;
+
+ /*
+ * Check whether this file is part of the prior backup. If it isn't, back
+ * up the whole file.
+ */
+ if (backup_file_lookup(ib->manifest_files, path) == NULL)
+ {
+ char *ipath;
+
+ ipath = GetIncrementalFilePath(dboid, spcoid, relfilenumber,
+ forknum, segno);
+ if (backup_file_lookup(ib->manifest_files, ipath) == NULL)
+ return BACK_UP_FILE_FULLY;
+ }
+
+ /* Look up the block reference table entry. */
+ rlocator.spcOid = spcoid;
+ rlocator.dbOid = dboid;
+ rlocator.relNumber = relfilenumber;
+ brtentry = BlockRefTableGetEntry(ib->brtab, &rlocator, forknum,
+ &limit_block);
+
+ /*
+ * If there is no entry, then there have been no WAL-logged changes to the
+ * relation since the predecessor backup was taken, so we can back it up
+ * incrementally and need not include any modified blocks.
+ *
+ * However, if the file is zero-length, we should do a full backup,
+ * because an incremental file is always more than zero length, and it's
+ * silly to take an incremental backup when a full backup would be
+ * smaller.
+ */
+ if (brtentry == NULL)
+ {
+ *num_blocks_required = 0;
+ *truncation_block_length = size / BLCKSZ;
+ if (size == 0)
+ return BACK_UP_FILE_FULLY;
+ return BACK_UP_FILE_INCREMENTALLY;
+ }
+
+ /*
+ * If the limit_block is less than or equal to the point where this
+ * segment starts, send the whole file.
+ */
+ if (limit_block <= segno * RELSEG_SIZE)
+ return BACK_UP_FILE_FULLY;
+
+ /*
+ * Get relevant entries from the block reference table entry.
+ *
+ * We shouldn't overflow computing the start or stop block numbers, but if
+ * it manages to happen somehow, detect it and throw an error.
+ */
+ start_blkno = segno * RELSEG_SIZE;
+ stop_blkno = start_blkno + (size / BLCKSZ);
+ if (start_blkno / RELSEG_SIZE != segno || stop_blkno < start_blkno)
+ ereport(ERROR,
+ errcode(ERRCODE_INTERNAL_ERROR),
+ errmsg_internal("overflow computing block number bounds for segment %u with size %lu",
+ segno, size));
+ nblocks = BlockRefTableEntryGetBlocks(brtentry, start_blkno, stop_blkno,
+ absolute_block_numbers, RELSEG_SIZE);
+ Assert(nblocks <= RELSEG_SIZE);
+
+ /*
+ * If we're going to have to send nearly all of the blocks, then just send
+ * the whole file, because that won't require much extra storage or
+ * transfer and will speed up and simplify backup restoration. It's not
+ * clear what threshold is most appropriate here and perhaps it ought to
+ * be configurable, but for now we're just going to say that if we'd need
+ * to send 90% of the blocks anyway, give up and send the whole file.
+ *
+ * NB: If you change the threshold here, at least make sure to back up the
+ * file fully when every single block must be sent, because there's
+ * nothing good about sending an incremental file in that case.
+ */
+ if (nblocks * BLCKSZ > size * 0.9)
+ return BACK_UP_FILE_FULLY;
+
+ /*
+ * Looks like we can send an incremental file.
+ *
+ * Return the relevant details to the caller, transposing absolute block
+ * numbers to relative block numbers.
+ *
+ * The truncation block length is the minimum length of the reconstructed
+ * file. Any block numbers below this threshold that are not present in
+ * the backup need to be fetched from the prior backup. At or above this
+ * threshold, blocks should only be included in the result if they are
+ * present in the backup. (This may require inserting zero blocks if the
+ * blocks included in the backup are non-consecutive.)
+ */
+ for (i = 0; i < nblocks; ++i)
+ relative_block_numbers[i] = absolute_block_numbers[i] - start_blkno;
+ *num_blocks_required = nblocks;
+ *truncation_block_length =
+ Min(size / BLCKSZ, limit_block - segno * RELSEG_SIZE);
+ return BACK_UP_FILE_INCREMENTALLY;
+}
+
+/*
+ * Compute the size for an incremental file containing a given number of blocks.
+ */
+extern size_t
+GetIncrementalFileSize(unsigned num_blocks_required)
+{
+ size_t result;
+
+ /* Make sure we're not going to overflow. */
+ Assert(num_blocks_required <= RELSEG_SIZE);
+
+ /*
+ * Three four byte quantities (magic number, truncation block length,
+ * block count) followed by block numbers followed by block contents.
+ */
+ result = 3 * sizeof(uint32);
+ result += (BLCKSZ + sizeof(BlockNumber)) * num_blocks_required;
+
+ return result;
+}
+
+/*
+ * Helper function for filemap hash table.
+ */
+static uint32
+hash_string_pointer(const char *s)
+{
+ unsigned char *ss = (unsigned char *) s;
+
+ return hash_bytes(ss, strlen(s));
+}
+
+/*
+ * This callback is invoked for each file mentioned in the backup manifest.
+ *
+ * We store the path to each file and the size of each file for sanity-checking
+ * purposes. For further details, see comments for IncrementalBackupInfo.
+ */
+static void
+manifest_process_file(JsonManifestParseContext *context,
+ char *pathname, size_t size,
+ pg_checksum_type checksum_type,
+ int checksum_length,
+ uint8 *checksum_payload)
+{
+ IncrementalBackupInfo *ib = context->private_data;
+ backup_file_entry *entry;
+ bool found;
+
+ entry = backup_file_insert(ib->manifest_files, pathname, &found);
+ if (!found)
+ {
+ entry->path = MemoryContextStrdup(ib->manifest_files->ctx,
+ pathname);
+ entry->size = size;
+ }
+}
+
+/*
+ * This callback is invoked for each WAL range mentioned in the backup
+ * manifest.
+ *
+ * We're just interested in learning the oldest LSN and the corresponding TLI
+ * that appear in any WAL range.
+ */
+static void
+manifest_process_wal_range(JsonManifestParseContext *context,
+ TimeLineID tli, XLogRecPtr start_lsn,
+ XLogRecPtr end_lsn)
+{
+ IncrementalBackupInfo *ib = context->private_data;
+ backup_wal_range *range = palloc(sizeof(backup_wal_range));
+
+ range->tli = tli;
+ range->start_lsn = start_lsn;
+ range->end_lsn = end_lsn;
+ ib->manifest_wal_ranges = lappend(ib->manifest_wal_ranges, range);
+}
+
+/*
+ * This callback is invoked if an error occurs while parsing the backup
+ * manifest.
+ */
+static void
+manifest_report_error(JsonManifestParseContext *context, const char *fmt,...)
+{
+ StringInfoData errbuf;
+
+ initStringInfo(&errbuf);
+
+ for (;;)
+ {
+ va_list ap;
+ int needed;
+
+ va_start(ap, fmt);
+ needed = appendStringInfoVA(&errbuf, fmt, ap);
+ va_end(ap);
+ if (needed == 0)
+ break;
+ enlargeStringInfo(&errbuf, needed);
+ }
+
+ ereport(ERROR,
+ errmsg_internal("%s", errbuf.data));
+}
diff --git a/src/backend/backup/meson.build b/src/backend/backup/meson.build
index 0e2de91e9f..19c355ceca 100644
--- a/src/backend/backup/meson.build
+++ b/src/backend/backup/meson.build
@@ -5,6 +5,7 @@ backend_sources += files(
'basebackup.c',
'basebackup_copy.c',
'basebackup_gzip.c',
+ 'basebackup_incremental.c',
'basebackup_lz4.c',
'basebackup_progress.c',
'basebackup_server.c',
diff --git a/src/backend/replication/repl_gram.y b/src/backend/replication/repl_gram.y
index 0c874e33cf..a5d118ed68 100644
--- a/src/backend/replication/repl_gram.y
+++ b/src/backend/replication/repl_gram.y
@@ -76,11 +76,12 @@ Node *replication_parse_result;
%token K_EXPORT_SNAPSHOT
%token K_NOEXPORT_SNAPSHOT
%token K_USE_SNAPSHOT
+%token K_UPLOAD_MANIFEST
%type <node> command
%type <node> base_backup start_replication start_logical_replication
create_replication_slot drop_replication_slot identify_system
- read_replication_slot timeline_history show
+ read_replication_slot timeline_history show upload_manifest
%type <list> generic_option_list
%type <defelt> generic_option
%type <uintval> opt_timeline
@@ -114,6 +115,7 @@ command:
| read_replication_slot
| timeline_history
| show
+ | upload_manifest
;
/*
@@ -307,6 +309,15 @@ timeline_history:
}
;
+/* UPLOAD_MANIFEST doesn't currently accept any arguments */
+upload_manifest:
+ K_UPLOAD_MANIFEST
+ {
+ UploadManifestCmd *cmd = makeNode(UploadManifestCmd);
+
+ $$ = (Node *) cmd;
+ }
+
opt_physical:
K_PHYSICAL
| /* EMPTY */
@@ -411,6 +422,7 @@ ident_or_keyword:
| K_EXPORT_SNAPSHOT { $$ = "export_snapshot"; }
| K_NOEXPORT_SNAPSHOT { $$ = "noexport_snapshot"; }
| K_USE_SNAPSHOT { $$ = "use_snapshot"; }
+ | K_UPLOAD_MANIFEST { $$ = "upload_manifest"; }
;
%%
diff --git a/src/backend/replication/repl_scanner.l b/src/backend/replication/repl_scanner.l
index 1cc7fb858c..4805da08ee 100644
--- a/src/backend/replication/repl_scanner.l
+++ b/src/backend/replication/repl_scanner.l
@@ -136,6 +136,7 @@ EXPORT_SNAPSHOT { return K_EXPORT_SNAPSHOT; }
NOEXPORT_SNAPSHOT { return K_NOEXPORT_SNAPSHOT; }
USE_SNAPSHOT { return K_USE_SNAPSHOT; }
WAIT { return K_WAIT; }
+UPLOAD_MANIFEST { return K_UPLOAD_MANIFEST; }
{space}+ { /* do nothing */ }
@@ -303,6 +304,7 @@ replication_scanner_is_replication_command(void)
case K_DROP_REPLICATION_SLOT:
case K_READ_REPLICATION_SLOT:
case K_TIMELINE_HISTORY:
+ case K_UPLOAD_MANIFEST:
case K_SHOW:
/* Yes; push back the first token so we can parse later. */
repl_pushed_back_token = first_token;
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index e250b0567e..b33b86671b 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -58,6 +58,7 @@
#include "access/xlogrecovery.h"
#include "access/xlogutils.h"
#include "backup/basebackup.h"
+#include "backup/basebackup_incremental.h"
#include "catalog/pg_authid.h"
#include "catalog/pg_type.h"
#include "commands/dbcommands.h"
@@ -137,6 +138,17 @@ bool wake_wal_senders = false;
*/
static XLogReaderState *xlogreader = NULL;
+/*
+ * If the UPLOAD_MANIFEST command is used to provide a backup manifest in
+ * preparation for an incremental backup, uploaded_manifest will be point
+ * to an object containing information about its contexts, and
+ * uploaded_manifest_mcxt will point to the memory context that contains
+ * that object and all of its subordinate data. Otherwise, both values will
+ * be NULL.
+ */
+static IncrementalBackupInfo *uploaded_manifest = NULL;
+static MemoryContext uploaded_manifest_mcxt = NULL;
+
/*
* These variables keep track of the state of the timeline we're currently
* sending. sendTimeLine identifies the timeline. If sendTimeLineIsHistoric,
@@ -233,6 +245,9 @@ static void XLogSendLogical(void);
static void WalSndDone(WalSndSendDataCallback send_data);
static XLogRecPtr GetStandbyFlushRecPtr(TimeLineID *tli);
static void IdentifySystem(void);
+static void UploadManifest(void);
+static bool HandleUploadManifestPacket(StringInfo buf, off_t *offset,
+ IncrementalBackupInfo *ib);
static void ReadReplicationSlot(ReadReplicationSlotCmd *cmd);
static void CreateReplicationSlot(CreateReplicationSlotCmd *cmd);
static void DropReplicationSlot(DropReplicationSlotCmd *cmd);
@@ -660,6 +675,143 @@ SendTimeLineHistory(TimeLineHistoryCmd *cmd)
pq_endmessage(&buf);
}
+/*
+ * Handle UPLOAD_MANIFEST command.
+ */
+static void
+UploadManifest(void)
+{
+ MemoryContext mcxt;
+ IncrementalBackupInfo *ib;
+ off_t offset = 0;
+ StringInfoData buf;
+
+ /*
+ * parsing the manifest will use the cryptohash stuff, which requires a
+ * resource owner
+ */
+ Assert(CurrentResourceOwner == NULL);
+ CurrentResourceOwner = ResourceOwnerCreate(NULL, "base backup");
+
+ /* Prepare to read manifest data into a temporary context. */
+ mcxt = AllocSetContextCreate(CurrentMemoryContext,
+ "incremental backup information",
+ ALLOCSET_DEFAULT_SIZES);
+ ib = CreateIncrementalBackupInfo(mcxt);
+
+ /* Send a CopyInResponse message */
+ pq_beginmessage(&buf, 'G');
+ pq_sendbyte(&buf, 0);
+ pq_sendint16(&buf, 0);
+ pq_endmessage_reuse(&buf);
+ pq_flush();
+
+ /* Recieve packets from client until done. */
+ while (HandleUploadManifestPacket(&buf, &offset, ib))
+ ;
+
+ /* Finish up manifest processing. */
+ FinalizeIncrementalManifest(ib);
+
+ /*
+ * Discard any old manifest information and arrange to preserve the new
+ * information we just got.
+ *
+ * We assume that MemoryContextDelete and MemoryContextSetParent won't
+ * fail, and thus we shouldn't end up bailing out of here in such a way as
+ * to leave dangling pointrs.
+ */
+ if (uploaded_manifest_mcxt != NULL)
+ MemoryContextDelete(uploaded_manifest_mcxt);
+ MemoryContextSetParent(mcxt, CacheMemoryContext);
+ uploaded_manifest = ib;
+ uploaded_manifest_mcxt = mcxt;
+
+ /* clean up the resource owner we created */
+ WalSndResourceCleanup(true);
+}
+
+/*
+ * Process one packet received during the handling of an UPLOAD_MANIFEST
+ * operation.
+ *
+ * 'buf' is scratch space. This function expects it to be initialized, doesn't
+ * care what the current contents are, and may override them with completely
+ * new contents.
+ *
+ * The return value is true if the caller should continue processing
+ * additional packets and false if the UPLOAD_MANIFEST operation is complete.
+ */
+static bool
+HandleUploadManifestPacket(StringInfo buf, off_t *offset,
+ IncrementalBackupInfo *ib)
+{
+ int mtype;
+ int maxmsglen;
+
+ HOLD_CANCEL_INTERRUPTS();
+
+ pq_startmsgread();
+ mtype = pq_getbyte();
+ if (mtype == EOF)
+ ereport(ERROR,
+ (errcode(ERRCODE_CONNECTION_FAILURE),
+ errmsg("unexpected EOF on client connection with an open transaction")));
+
+ switch (mtype)
+ {
+ case 'd': /* CopyData */
+ maxmsglen = PQ_LARGE_MESSAGE_LIMIT;
+ break;
+ case 'c': /* CopyDone */
+ case 'f': /* CopyFail */
+ case 'H': /* Flush */
+ case 'S': /* Sync */
+ maxmsglen = PQ_SMALL_MESSAGE_LIMIT;
+ break;
+ default:
+ ereport(ERROR,
+ (errcode(ERRCODE_PROTOCOL_VIOLATION),
+ errmsg("unexpected message type 0x%02X during COPY from stdin",
+ mtype)));
+ maxmsglen = 0; /* keep compiler quiet */
+ break;
+ }
+
+ /* Now collect the message body */
+ if (pq_getmessage(buf, maxmsglen))
+ ereport(ERROR,
+ (errcode(ERRCODE_CONNECTION_FAILURE),
+ errmsg("unexpected EOF on client connection with an open transaction")));
+ RESUME_CANCEL_INTERRUPTS();
+
+ /* Process the message */
+ switch (mtype)
+ {
+ case 'd': /* CopyData */
+ AppendIncrementalManifestData(ib, buf->data, buf->len);
+ return true;
+
+ case 'c': /* CopyDone */
+ return false;
+
+ case 'H': /* Sync */
+ case 'S': /* Flush */
+ /* Ignore these while in CopyOut mode as we do elsewhere. */
+ return true;
+
+ case 'f':
+ ereport(ERROR,
+ (errcode(ERRCODE_QUERY_CANCELED),
+ errmsg("COPY from stdin failed: %s",
+ pq_getmsgstring(buf))));
+ }
+
+ /* Not reached. */
+ Assert(false);
+ return false;
+}
+
/*
* Handle START_REPLICATION command.
*
@@ -1801,7 +1953,7 @@ exec_replication_command(const char *cmd_string)
cmdtag = "BASE_BACKUP";
set_ps_display(cmdtag);
PreventInTransactionBlock(true, cmdtag);
- SendBaseBackup((BaseBackupCmd *) cmd_node);
+ SendBaseBackup((BaseBackupCmd *) cmd_node, uploaded_manifest);
EndReplicationCommand(cmdtag);
break;
@@ -1863,6 +2015,14 @@ exec_replication_command(const char *cmd_string)
}
break;
+ case T_UploadManifestCmd:
+ cmdtag = "UPLOAD_MANIFEST";
+ set_ps_display(cmdtag);
+ PreventInTransactionBlock(true, cmdtag);
+ UploadManifest();
+ EndReplicationCommand(cmdtag);
+ break;
+
default:
elog(ERROR, "unrecognized replication command node tag: %u",
cmd_node->type);
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index a3d8eacb8d..3a6729003a 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -31,6 +31,7 @@
#include "postmaster/bgworker_internals.h"
#include "postmaster/bgwriter.h"
#include "postmaster/postmaster.h"
+#include "postmaster/walsummarizer.h"
#include "replication/logicallauncher.h"
#include "replication/origin.h"
#include "replication/slot.h"
@@ -136,6 +137,7 @@ CalculateShmemSize(int *num_semaphores)
size = add_size(size, ReplicationOriginShmemSize());
size = add_size(size, WalSndShmemSize());
size = add_size(size, WalRcvShmemSize());
+ size = add_size(size, WalSummarizerShmemSize());
size = add_size(size, PgArchShmemSize());
size = add_size(size, ApplyLauncherShmemSize());
size = add_size(size, BTreeShmemSize());
@@ -291,6 +293,7 @@ CreateSharedMemoryAndSemaphores(void)
ReplicationOriginShmemInit();
WalSndShmemInit();
WalRcvShmemInit();
+ WalSummarizerShmemInit();
PgArchShmemInit();
ApplyLauncherShmemInit();
diff --git a/src/backend/utils/activity/pgstat_io.c b/src/backend/utils/activity/pgstat_io.c
index 490d5a9ab7..8109aee6f0 100644
--- a/src/backend/utils/activity/pgstat_io.c
+++ b/src/backend/utils/activity/pgstat_io.c
@@ -296,7 +296,8 @@ pgstat_io_snapshot_cb(void)
* - Syslogger because it is not connected to shared memory
* - Archiver because most relevant archiving IO is delegated to a
* specialized command or module
-* - WAL Receiver and WAL Writer IO is not tracked in pg_stat_io for now
+* - WAL Receiver, WAL Writer, and WAL Summarizer IO are not tracked in
+* pg_stat_io for now
*
* Function returns true if BackendType participates in the cumulative stats
* subsystem for IO and false if it does not.
@@ -318,6 +319,7 @@ pgstat_tracks_io_bktype(BackendType bktype)
case B_LOGGER:
case B_WAL_RECEIVER:
case B_WAL_WRITER:
+ case B_WAL_SUMMARIZER:
return false;
case B_AUTOVAC_LAUNCHER:
diff --git a/src/bin/Makefile b/src/bin/Makefile
index 373077bf52..aa2210925e 100644
--- a/src/bin/Makefile
+++ b/src/bin/Makefile
@@ -19,6 +19,7 @@ SUBDIRS = \
pg_archivecleanup \
pg_basebackup \
pg_checksums \
+ pg_combinebackup \
pg_config \
pg_controldata \
pg_ctl \
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 0c6f5ceb0a..e68b40d2b5 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -227,6 +227,7 @@ static char *extra_options = "";
static const char *const subdirs[] = {
"global",
"pg_wal/archive_status",
+ "pg_wal/summaries",
"pg_commit_ts",
"pg_dynshmem",
"pg_notify",
diff --git a/src/bin/meson.build b/src/bin/meson.build
index 67cb50630c..4cb6fd59bb 100644
--- a/src/bin/meson.build
+++ b/src/bin/meson.build
@@ -5,6 +5,7 @@ subdir('pg_amcheck')
subdir('pg_archivecleanup')
subdir('pg_basebackup')
subdir('pg_checksums')
+subdir('pg_combinebackup')
subdir('pg_config')
subdir('pg_controldata')
subdir('pg_ctl')
diff --git a/src/bin/pg_basebackup/bbstreamer_file.c b/src/bin/pg_basebackup/bbstreamer_file.c
index 45f32974ff..6b78ee283d 100644
--- a/src/bin/pg_basebackup/bbstreamer_file.c
+++ b/src/bin/pg_basebackup/bbstreamer_file.c
@@ -296,6 +296,7 @@ should_allow_existing_directory(const char *pathname)
if (strcmp(filename, "pg_wal") == 0 ||
strcmp(filename, "pg_xlog") == 0 ||
strcmp(filename, "archive_status") == 0 ||
+ strcmp(filename, "summaries") == 0 ||
strcmp(filename, "pg_tblspc") == 0)
return true;
diff --git a/src/bin/pg_basebackup/pg_basebackup.c b/src/bin/pg_basebackup/pg_basebackup.c
index 1a8cef345d..33416b11cf 100644
--- a/src/bin/pg_basebackup/pg_basebackup.c
+++ b/src/bin/pg_basebackup/pg_basebackup.c
@@ -101,6 +101,11 @@ typedef void (*WriteDataCallback) (size_t nbytes, char *buf,
*/
#define MINIMUM_VERSION_FOR_TERMINATED_TARFILE 150000
+/*
+ * pg_wal/summaries exists beginning with version 17.
+ */
+#define MINIMUM_VERSION_FOR_WAL_SUMMARIES 170000
+
/*
* Different ways to include WAL
*/
@@ -217,7 +222,8 @@ static void ReceiveBackupManifestInMemoryChunk(size_t r, char *copybuf,
void *callback_data);
static void BaseBackup(char *compression_algorithm, char *compression_detail,
CompressionLocation compressloc,
- pg_compress_specification *client_compress);
+ pg_compress_specification *client_compress,
+ char *incremental_manifest);
static bool reached_end_position(XLogRecPtr segendpos, uint32 timeline,
bool segment_finished);
@@ -390,6 +396,8 @@ usage(void)
printf(_("\nOptions controlling the output:\n"));
printf(_(" -D, --pgdata=DIRECTORY receive base backup into directory\n"));
printf(_(" -F, --format=p|t output format (plain (default), tar)\n"));
+ printf(_(" -i, --incremental=OLDMANIFEST\n"));
+ printf(_(" take incremental or differential backup\n"));
printf(_(" -r, --max-rate=RATE maximum transfer rate to transfer data directory\n"
" (in kB/s, or use suffix \"k\" or \"M\")\n"));
printf(_(" -R, --write-recovery-conf\n"
@@ -688,6 +696,23 @@ StartLogStreamer(char *startpos, uint32 timeline, char *sysidentifier,
if (pg_mkdir_p(statusdir, pg_dir_create_mode) != 0 && errno != EEXIST)
pg_fatal("could not create directory \"%s\": %m", statusdir);
+
+ /*
+ * For newer server versions, likewise create pg_wal/summaries
+ */
+ if (PQserverVersion(conn) < MINIMUM_VERSION_FOR_WAL_SUMMARIES)
+ {
+ char summarydir[MAXPGPATH];
+
+ snprintf(summarydir, sizeof(summarydir), "%s/%s/summaries",
+ basedir,
+ PQserverVersion(conn) < MINIMUM_VERSION_FOR_PG_WAL ?
+ "pg_xlog" : "pg_wal");
+
+ if (pg_mkdir_p(statusdir, pg_dir_create_mode) != 0 &&
+ errno != EEXIST)
+ pg_fatal("could not create directory \"%s\": %m", summarydir);
+ }
}
/*
@@ -1728,7 +1753,9 @@ ReceiveBackupManifestInMemoryChunk(size_t r, char *copybuf,
static void
BaseBackup(char *compression_algorithm, char *compression_detail,
- CompressionLocation compressloc, pg_compress_specification *client_compress)
+ CompressionLocation compressloc,
+ pg_compress_specification *client_compress,
+ char *incremental_manifest)
{
PGresult *res;
char *sysidentifier;
@@ -1794,7 +1821,74 @@ BaseBackup(char *compression_algorithm, char *compression_detail,
exit(1);
/*
- * Start the actual backup
+ * If the user wants an incremental backup, we must upload the manifest
+ * for the previous backup upon which it is to be based.
+ */
+ if (incremental_manifest != NULL)
+ {
+ int fd;
+ char mbuf[65536];
+ int nbytes;
+
+ /* XXX add a server version check here */
+
+ /* Open the file. */
+ fd = open(incremental_manifest, O_RDONLY | PG_BINARY, 0);
+ if (fd < 0)
+ pg_fatal("could not open file \"%s\": %m", incremental_manifest);
+
+ /* Tell the server what we want to do. */
+ if (PQsendQuery(conn, "UPLOAD_MANIFEST") == 0)
+ pg_fatal("could not send replication command \"%s\": %s",
+ "UPLOAD_MANIFEST", PQerrorMessage(conn));
+ res = PQgetResult(conn);
+ if (PQresultStatus(res) != PGRES_COPY_IN)
+ {
+ if (PQresultStatus(res) == PGRES_FATAL_ERROR)
+ pg_fatal("could not upload manifest: %s",
+ PQerrorMessage(conn));
+ else
+ pg_fatal("could not upload manifest: unexpected status %s",
+ PQresStatus(PQresultStatus(res)));
+ }
+
+ /* Loop, reading from the file and sending the data to the server. */
+ while ((nbytes = read(fd, mbuf, sizeof mbuf)) > 0)
+ {
+ if (PQputCopyData(conn, mbuf, nbytes) < 0)
+ pg_fatal("could not send COPY data: %s",
+ PQerrorMessage(conn));
+ }
+
+ /* Bail out if we exited the loop due to an error. */
+ if (nbytes < 0)
+ pg_fatal("could not read file \"%s\": %m", incremental_manifest);
+
+ /* End the COPY operation. */
+ if (PQputCopyEnd(conn, NULL) < 0)
+ pg_fatal("could not send end-of-COPY: %s",
+ PQerrorMessage(conn));
+
+ /* See whether the server is happy with what we sent. */
+ res = PQgetResult(conn);
+ if (PQresultStatus(res) == PGRES_FATAL_ERROR)
+ pg_fatal("could not upload manifest: %s",
+ PQerrorMessage(conn));
+ else if (PQresultStatus(res) != PGRES_COMMAND_OK)
+ pg_fatal("could not upload manifest: unexpected status %s",
+ PQresStatus(PQresultStatus(res)));
+
+ /* Consume ReadyForQuery message from server. */
+ res = PQgetResult(conn);
+ if (res != NULL)
+ pg_fatal("unexpected extra result while sending manifest");
+
+ /* Add INCREMENTAL option to BASE_BACKUP command. */
+ AppendPlainCommandOption(&buf, use_new_option_syntax, "INCREMENTAL");
+ }
+
+ /*
+ * Continue building up the options list for the BASE_BACKUP command.
*/
AppendStringCommandOption(&buf, use_new_option_syntax, "LABEL", label);
if (estimatesize)
@@ -1901,6 +1995,7 @@ BaseBackup(char *compression_algorithm, char *compression_detail,
else
basebkp = psprintf("BASE_BACKUP %s", buf.data);
+ /* OK, try to start the backup. */
if (PQsendQuery(conn, basebkp) == 0)
pg_fatal("could not send replication command \"%s\": %s",
"BASE_BACKUP", PQerrorMessage(conn));
@@ -2256,6 +2351,7 @@ main(int argc, char **argv)
{"version", no_argument, NULL, 'V'},
{"pgdata", required_argument, NULL, 'D'},
{"format", required_argument, NULL, 'F'},
+ {"incremental", required_argument, NULL, 'i'},
{"checkpoint", required_argument, NULL, 'c'},
{"create-slot", no_argument, NULL, 'C'},
{"max-rate", required_argument, NULL, 'r'},
@@ -2293,6 +2389,7 @@ main(int argc, char **argv)
int option_index;
char *compression_algorithm = "none";
char *compression_detail = NULL;
+ char *incremental_manifest = NULL;
CompressionLocation compressloc = COMPRESS_LOCATION_UNSPECIFIED;
pg_compress_specification client_compress;
@@ -2317,7 +2414,7 @@ main(int argc, char **argv)
atexit(cleanup_directories_atexit);
- while ((c = getopt_long(argc, argv, "c:Cd:D:F:h:l:nNp:Pr:Rs:S:t:T:U:vwWX:zZ:",
+ while ((c = getopt_long(argc, argv, "c:Cd:D:F:h:i:l:nNp:Pr:Rs:S:t:T:U:vwWX:zZ:",
long_options, &option_index)) != -1)
{
switch (c)
@@ -2352,6 +2449,9 @@ main(int argc, char **argv)
case 'h':
dbhost = pg_strdup(optarg);
break;
+ case 'i':
+ incremental_manifest = pg_strdup(optarg);
+ break;
case 'l':
label = pg_strdup(optarg);
break;
@@ -2765,7 +2865,7 @@ main(int argc, char **argv)
}
BaseBackup(compression_algorithm, compression_detail, compressloc,
- &client_compress);
+ &client_compress, incremental_manifest);
success = true;
return 0;
diff --git a/src/bin/pg_basebackup/t/010_pg_basebackup.pl b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
index b9f5e1266b..bf765291e7 100644
--- a/src/bin/pg_basebackup/t/010_pg_basebackup.pl
+++ b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
@@ -223,10 +223,10 @@ SKIP:
"check backup dir permissions");
}
-# Only archive_status directory should be copied in pg_wal/.
+# Only archive_status and summaries directories should be copied in pg_wal/.
is_deeply(
[ sort(slurp_dir("$tempdir/backup/pg_wal/")) ],
- [ sort qw(. .. archive_status) ],
+ [ sort qw(. .. archive_status summaries) ],
'no WAL files copied');
# Contents of these directories should not be copied.
diff --git a/src/bin/pg_combinebackup/.gitignore b/src/bin/pg_combinebackup/.gitignore
new file mode 100644
index 0000000000..d7e617438c
--- /dev/null
+++ b/src/bin/pg_combinebackup/.gitignore
@@ -0,0 +1 @@
+pg_combinebackup
diff --git a/src/bin/pg_combinebackup/Makefile b/src/bin/pg_combinebackup/Makefile
new file mode 100644
index 0000000000..78ba05e624
--- /dev/null
+++ b/src/bin/pg_combinebackup/Makefile
@@ -0,0 +1,52 @@
+#-------------------------------------------------------------------------
+#
+# Makefile for src/bin/pg_combinebackup
+#
+# Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+# Portions Copyright (c) 1994, Regents of the University of California
+#
+# src/bin/pg_combinebackup/Makefile
+#
+#-------------------------------------------------------------------------
+
+PGFILEDESC = "pg_combinebackup - combine incremental backups"
+PGAPPICON=win32
+
+subdir = src/bin/pg_combinebackup
+top_builddir = ../../..
+include $(top_builddir)/src/Makefile.global
+
+override CPPFLAGS := -I$(libpq_srcdir) $(CPPFLAGS)
+LDFLAGS_INTERNAL += -L$(top_builddir)/src/fe_utils -lpgfeutils
+
+OBJS = \
+ $(WIN32RES) \
+ pg_combinebackup.o \
+ backup_label.o \
+ copy_file.o \
+ load_manifest.o \
+ reconstruct.o \
+ write_manifest.o
+
+all: pg_combinebackup
+
+pg_combinebackup: $(OBJS) | submake-libpgport submake-libpgfeutils
+ $(CC) $(CFLAGS) $^ $(LDFLAGS) $(LDFLAGS_EX) $(LIBS) -o $@$(X)
+
+install: all installdirs
+ $(INSTALL_PROGRAM) pg_combinebackup$(X) '$(DESTDIR)$(bindir)/pg_combinebackup$(X)'
+
+installdirs:
+ $(MKDIR_P) '$(DESTDIR)$(bindir)'
+
+uninstall:
+ rm -f '$(DESTDIR)$(bindir)/pg_combinebackup$(X)'
+
+clean distclean maintainer-clean:
+ rm -f pg_combinebackup$(X) $(OBJS)
+
+check:
+ $(prove_check)
+
+installcheck:
+ $(prove_installcheck)
diff --git a/src/bin/pg_combinebackup/backup_label.c b/src/bin/pg_combinebackup/backup_label.c
new file mode 100644
index 0000000000..2a62aa6fad
--- /dev/null
+++ b/src/bin/pg_combinebackup/backup_label.c
@@ -0,0 +1,281 @@
+/*-------------------------------------------------------------------------
+ *
+ * Read and manipulate backup label files
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/bin/pg_combinebackup/backup_label.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres_fe.h"
+
+#include <unistd.h>
+
+#include "access/xlogdefs.h"
+#include "backup_label.h"
+#include "common/logging.h"
+#include "common/file_perm.h"
+#include "write_manifest.h"
+
+static int get_eol_offset(StringInfo buf);
+static bool line_starts_with(char *s, char *e, char *match, char **sout);
+static bool parse_lsn(char *s, char *e, XLogRecPtr *lsn, char **c);
+static bool parse_tli(char *s, char *e, TimeLineID *tli);
+
+/*
+ * Parse a backup label file, starting at buf->cursor.
+ *
+ * We expect to find a START WAL LOCATION line, followed by a LSN, followed
+ * by a space; the resulting LSN is stored into *start_lsn.
+ *
+ * We expect to find a START TIMELINE line, followed by a TLI, followed by
+ * a newline; the resulting TLI is stored into *start_tli.
+ *
+ * We expect to find either both INCREMENTAL FROM LSN and INCREMENTAL FROM TLI
+ * or neither. If these are found, they should be followed by an LSN or TLI
+ * respectively and then by a newline, and the values will be stored into
+ * *previous_lsn and *previous_tli, respectively.
+ *
+ * Other lines in the provided backup_label data are ignored. filename is used
+ * for error reporting; errors are fatal.
+ */
+void
+parse_backup_label(char *filename, StringInfo buf,
+ TimeLineID *start_tli, XLogRecPtr *start_lsn,
+ TimeLineID *previous_tli, XLogRecPtr *previous_lsn)
+{
+ int found = 0;
+
+ *start_tli = 0;
+ *start_lsn = InvalidXLogRecPtr;
+ *previous_tli = 0;
+ *previous_lsn = InvalidXLogRecPtr;
+
+ while (buf->cursor < buf->len)
+ {
+ char *s = &buf->data[buf->cursor];
+ int eo = get_eol_offset(buf);
+ char *e = &buf->data[eo];
+ char *c;
+
+ if (line_starts_with(s, e, "START WAL LOCATION: ", &s))
+ {
+ if (!parse_lsn(s, e, start_lsn, &c))
+ pg_fatal("%s: could not parse START WAL LOCATION",
+ filename);
+ if (c >= e || *c != ' ')
+ pg_fatal("%s: improper terminator for START WAL LOCATION",
+ filename);
+ found |= 1;
+ }
+ else if (line_starts_with(s, e, "START TIMELINE: ", &s))
+ {
+ if (!parse_tli(s, e, start_tli))
+ pg_fatal("%s: could not parse TLI for START TIMELINE",
+ filename);
+ if (*start_tli == 0)
+ pg_fatal("%s: invalid TLI", filename);
+ found |= 2;
+ }
+ else if (line_starts_with(s, e, "INCREMENTAL FROM LSN: ", &s))
+ {
+ if (!parse_lsn(s, e, previous_lsn, &c))
+ pg_fatal("%s: could not parse INCREMENTAL FROM LSN",
+ filename);
+ if (c >= e || *c != '\n')
+ pg_fatal("%s: improper terminator for INCREMENTAL FROM LSN",
+ filename);
+ found |= 4;
+ }
+ else if (line_starts_with(s, e, "INCREMENTAL FROM TLI: ", &s))
+ {
+ if (!parse_tli(s, e, previous_tli))
+ pg_fatal("%s: could not parse INCREMENTAL FROM TLI",
+ filename);
+ if (*previous_tli == 0)
+ pg_fatal("%s: invalid TLI", filename);
+ found |= 8;
+ }
+
+ buf->cursor = eo;
+ }
+
+ if ((found & 1) == 0)
+ pg_fatal("%s: could not find START WAL LOCATION", filename);
+ if ((found & 2) == 0)
+ pg_fatal("%s: could not find START TIMELINE", filename);
+ if ((found & 4) != 0 && (found & 8) == 0)
+ pg_fatal("%s: INCREMENTAL FROM LSN requires INCREMENTAL FROM TLI", filename);
+ if ((found & 8) != 0 && (found & 4) == 0)
+ pg_fatal("%s: INCREMENTAL FROM TLI requires INCREMENTAL FROM LSN", filename);
+}
+
+/*
+ * Write a backup label file to the output directory.
+ *
+ * This will be identical to the provided backup_label file, except that the
+ * INCREMENTAL FROM LSN and INCREMENTAL FROM TLI lines will be omitted.
+ *
+ * The new file will be checksummed using the specified algorithm. If
+ * mwriter != NULL, it will be added to the manifest.
+ */
+void
+write_backup_label(char *output_directory, StringInfo buf,
+ pg_checksum_type checksum_type, manifest_writer *mwriter)
+{
+ char output_filename[MAXPGPATH];
+ int output_fd;
+ pg_checksum_context checksum_ctx;
+ uint8 checksum_payload[PG_CHECKSUM_MAX_LENGTH];
+ int checksum_length;
+
+ pg_checksum_init(&checksum_ctx, checksum_type);
+
+ snprintf(output_filename, MAXPGPATH, "%s/backup_label", output_directory);
+
+ if ((output_fd = open(output_filename,
+ O_WRONLY | O_CREAT | O_EXCL | PG_BINARY,
+ pg_file_create_mode)) < 0)
+ pg_fatal("could not open file \"%s\": %m", output_filename);
+
+ while (buf->cursor < buf->len)
+ {
+ char *s = &buf->data[buf->cursor];
+ int eo = get_eol_offset(buf);
+ char *e = &buf->data[eo];
+
+ if (!line_starts_with(s, e, "INCREMENTAL FROM LSN: ", NULL) &&
+ !line_starts_with(s, e, "INCREMENTAL FROM TLI: ", NULL))
+ {
+ ssize_t wb;
+
+ wb = write(output_fd, s, e - s);
+ if (wb != e - s)
+ {
+ if (wb < 0)
+ pg_fatal("could not write file \"%s\": %m", output_filename);
+ else
+ pg_fatal("could not write file \"%s\": wrote only %d of %d bytes",
+ output_filename, (int) wb, (int) (e - s));
+ }
+ if (pg_checksum_update(&checksum_ctx, (uint8 *) s, e - s) < 0)
+ pg_fatal("could not update checksum of file \"%s\"",
+ output_filename);
+ }
+
+ buf->cursor = eo;
+ }
+
+ if (close(output_fd) != 0)
+ pg_fatal("could not close \"%s\": %m", output_filename);
+
+ checksum_length = pg_checksum_final(&checksum_ctx, checksum_payload);
+
+ if (mwriter != NULL)
+ {
+ struct stat sb;
+
+ /*
+ * We could track the length ourselves, but must stat() to get the
+ * mtime.
+ */
+ if (stat(output_filename, &sb) < 0)
+ pg_fatal("could not stat file \"%s\": %m", output_filename);
+ add_file_to_manifest(mwriter, "backup_label", sb.st_size,
+ sb.st_mtime, checksum_type,
+ checksum_length, checksum_payload);
+ }
+}
+
+/*
+ * Return the offset at which the next line in the buffer starts, or there
+ * is none, the offset at which the buffer ends.
+ *
+ * The search begins at buf->cursor.
+ */
+static int
+get_eol_offset(StringInfo buf)
+{
+ int eo = buf->cursor;
+
+ while (eo < buf->len)
+ {
+ if (buf->data[eo] == '\n')
+ return eo + 1;
+ ++eo;
+ }
+
+ return eo;
+}
+
+/*
+ * Test whether the line that runs from s to e (inclusive of *s, but not
+ * inclusive of *e) starts with the match string provided, and return true
+ * or false according to whether or not this is the case.
+ *
+ * If the function returns true and if *sout != NULL, stores a pointer to the
+ * byte following the match into *sout.
+ */
+static bool
+line_starts_with(char *s, char *e, char *match, char **sout)
+{
+ while (s < e && *match != '\0' && *s == *match)
+ ++s, ++match;
+
+ if (*match == '\0' && sout != NULL)
+ *sout = s;
+
+ return (*match == '\0');
+}
+
+/*
+ * Parse an LSN starting at s and not stopping at or before e. The return value
+ * is true on success and otherwise false. On success, stores the result into
+ * *lsn and sets *c to the first character that is not part of the LSN.
+ */
+static bool
+parse_lsn(char *s, char *e, XLogRecPtr *lsn, char **c)
+{
+ char save = *e;
+ int nchars;
+ bool success;
+ unsigned hi;
+ unsigned lo;
+
+ *e = '\0';
+ success = (sscanf(s, "%X/%X%n", &hi, &lo, &nchars) == 2);
+ *e = save;
+
+ if (success)
+ {
+ *lsn = ((XLogRecPtr) hi) << 32 | (XLogRecPtr) lo;
+ *c = s + nchars;
+ }
+
+ return success;
+}
+
+/*
+ * Parse a TLI starting at s and stopping at or before e. The return value is
+ * true on success and otherwise false. On success, stores the result into
+ * *tli. If the first character that is not part of the TLI is anything other
+ * than a newline, that is deemed a failure.
+ */
+static bool
+parse_tli(char *s, char *e, TimeLineID *tli)
+{
+ char save = *e;
+ int nchars;
+ bool success;
+
+ *e = '\0';
+ success = (sscanf(s, "%u%n", tli, &nchars) == 1);
+ *e = save;
+
+ if (success && s[nchars] != '\n')
+ success = false;
+
+ return success;
+}
diff --git a/src/bin/pg_combinebackup/backup_label.h b/src/bin/pg_combinebackup/backup_label.h
new file mode 100644
index 0000000000..08d6ed67a9
--- /dev/null
+++ b/src/bin/pg_combinebackup/backup_label.h
@@ -0,0 +1,29 @@
+/*-------------------------------------------------------------------------
+ *
+ * Read and manipulate backup label files
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/bin/pg_combinebackup/backup_label.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef BACKUP_LABEL_H
+#define BACKUP_LABEL_H
+
+#include "common/checksum_helper.h"
+#include "lib/stringinfo.h"
+
+struct manifest_writer;
+
+extern void parse_backup_label(char *filename, StringInfo buf,
+ TimeLineID *start_tli,
+ XLogRecPtr *start_lsn,
+ TimeLineID *previous_tli,
+ XLogRecPtr *previous_lsn);
+extern void write_backup_label(char *output_directory, StringInfo buf,
+ pg_checksum_type checksum_type,
+ struct manifest_writer *mwriter);
+
+#endif /* BACKUP_LABEL_H */
diff --git a/src/bin/pg_combinebackup/copy_file.c b/src/bin/pg_combinebackup/copy_file.c
new file mode 100644
index 0000000000..8ba6cc09e4
--- /dev/null
+++ b/src/bin/pg_combinebackup/copy_file.c
@@ -0,0 +1,169 @@
+/*
+ * Copy entire files.
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/bin/pg_combinebackup/copy_file.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres_fe.h"
+
+#ifdef HAVE_COPYFILE_H
+#include <copyfile.h>
+#endif
+#include <fcntl.h>
+#include <sys/stat.h>
+#include <unistd.h>
+
+#include "common/file_perm.h"
+#include "common/logging.h"
+#include "copy_file.h"
+
+static void copy_file_blocks(const char *src, const char *dst,
+ pg_checksum_context *checksum_ctx);
+
+#ifdef WIN32
+static void copy_file_copyfile(const char *src, const char *dst);
+#endif
+
+/*
+ * Copy a regular file, optionally computing a checksum, and emitting
+ * appropriate debug messages. But if we're in dry-run mode, then just emit
+ * the messages and don't copy anything.
+ */
+void
+copy_file(const char *src, const char *dst,
+ pg_checksum_context *checksum_ctx, bool dry_run)
+{
+ /*
+ * In dry-run mode, we don't actually copy anything, nor do we read any
+ * data from the source file, but we do verify that we can open it.
+ */
+ if (dry_run)
+ {
+ int fd;
+
+ if ((fd = open(src, O_RDONLY | PG_BINARY)) < 0)
+ pg_fatal("could not open \"%s\": %m", src);
+ if (close(fd) < 0)
+ pg_fatal("could not close \"%s\": %m", src);
+ }
+
+ /*
+ * If we don't need to compute a checksum, then we can use any special
+ * operating system primitives that we know about to copy the file; this
+ * may be quicker than a naive block copy.
+ */
+ if (checksum_ctx->type != CHECKSUM_TYPE_NONE)
+ {
+ char *strategy_name = NULL;
+ void (*strategy_implementation) (const char *, const char *) = NULL;
+
+#ifdef WIN32
+ strategy_name = "CopyFile";
+ strategy_implementation = copy_file_copyfile;
+#endif
+
+ if (strategy_name != NULL)
+ {
+ if (dry_run)
+ pg_log_debug("would copy \"%s\" to \"%s\" using strategy %s",
+ src, dst, strategy_name);
+ else
+ {
+ pg_log_debug("copying \"%s\" to \"%s\" using strategy %s",
+ src, dst, strategy_name);
+ (*strategy_implementation) (src, dst);
+ }
+ return;
+ }
+ }
+
+ /*
+ * Fall back to the simple approach of reading and writing all the blocks,
+ * feeding them into the checksum context as we go.
+ */
+ if (dry_run)
+ {
+ if (checksum_ctx->type == CHECKSUM_TYPE_NONE)
+ pg_log_debug("would copy \"%s\" to \"%s\"",
+ src, dst);
+ else
+ pg_log_debug("would copy \"%s\" to \"%s\" and checksum with %s",
+ src, dst, pg_checksum_type_name(checksum_ctx->type));
+ }
+ else
+ {
+ if (checksum_ctx->type == CHECKSUM_TYPE_NONE)
+ pg_log_debug("copying \"%s\" to \"%s\"",
+ src, dst);
+ else
+ pg_log_debug("copying \"%s\" to \"%s\" and checksumming with %s",
+ src, dst, pg_checksum_type_name(checksum_ctx->type));
+ copy_file_blocks(src, dst, checksum_ctx);
+ }
+}
+
+/*
+ * Copy a file block by block, and optionally compute a checksum as we go.
+ */
+static void
+copy_file_blocks(const char *src, const char *dst,
+ pg_checksum_context *checksum_ctx)
+{
+ int src_fd;
+ int dest_fd;
+ uint8 *buffer;
+ const int buffer_size = 50 * BLCKSZ;
+ ssize_t rb;
+ unsigned offset = 0;
+
+ if ((src_fd = open(src, O_RDONLY | PG_BINARY, 0)) < 0)
+ pg_fatal("could not open file \"%s\": %m", src);
+
+ if ((dest_fd = open(dst, O_WRONLY | O_CREAT | O_EXCL | PG_BINARY,
+ pg_file_create_mode)) < 0)
+ pg_fatal("could not open file \"%s\": %m", dst);
+
+ buffer = pg_malloc(buffer_size);
+
+ while ((rb = read(src_fd, buffer, buffer_size)) > 0)
+ {
+ ssize_t wb;
+
+ if ((wb = write(dest_fd, buffer, rb)) != rb)
+ {
+ if (wb < 0)
+ pg_fatal("could not write file \"%s\": %m", dst);
+ else
+ pg_fatal("could not write file \"%s\": wrote only %d of %d bytes at offset %u",
+ dst, (int) wb, (int) rb, offset);
+ }
+
+ if (pg_checksum_update(checksum_ctx, buffer, rb) < 0)
+ pg_fatal("could not update checksum of file \"%s\"", dst);
+
+ offset += rb;
+ }
+
+ if (rb < 0)
+ pg_fatal("could not read file \"%s\": %m", dst);
+
+ pg_free(buffer);
+ close(src_fd);
+ close(dest_fd);
+}
+
+#ifdef WIN32
+static void
+copy_file_copyfile(const char *src, const char *dst)
+{
+ if (CopyFile(src, dst, true) == 0)
+ {
+ _dosmaperr(GetLastError());
+ pg_fatal("could not copy \"%s\" to \"%s\": %m", src, dst);
+ }
+}
+#endif /* WIN32 */
diff --git a/src/bin/pg_combinebackup/copy_file.h b/src/bin/pg_combinebackup/copy_file.h
new file mode 100644
index 0000000000..031030bacb
--- /dev/null
+++ b/src/bin/pg_combinebackup/copy_file.h
@@ -0,0 +1,19 @@
+/*
+ * Copy entire files.
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/bin/pg_combinebackup/copy_file.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef COPY_FILE_H
+#define COPY_FILE_H
+
+#include "common/checksum_helper.h"
+
+extern void copy_file(const char *src, const char *dst,
+ pg_checksum_context *checksum_ctx, bool dry_run);
+
+#endif /* COPY_FILE_H */
diff --git a/src/bin/pg_combinebackup/load_manifest.c b/src/bin/pg_combinebackup/load_manifest.c
new file mode 100644
index 0000000000..d0b8de7912
--- /dev/null
+++ b/src/bin/pg_combinebackup/load_manifest.c
@@ -0,0 +1,245 @@
+/*-------------------------------------------------------------------------
+ *
+ * Load data from a backup manifest into memory.
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/bin/pg_combinebackup/load_manifest.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres_fe.h"
+
+#include <sys/stat.h>
+#include <unistd.h>
+
+#include "common/hashfn.h"
+#include "common/logging.h"
+#include "common/parse_manifest.h"
+#include "load_manifest.h"
+
+/*
+ * For efficiency, we'd like our hash table containing information about the
+ * manifest to start out with approximately the correct number of entries.
+ * There's no way to know the exact number of entries without reading the whole
+ * file, but we can get an estimate by dividing the file size by the estimated
+ * number of bytes per line.
+ *
+ * This could be off by about a factor of two in either direction, because the
+ * checksum algorithm has a big impact on the line lengths; e.g. a SHA512
+ * checksum is 128 hex bytes, whereas a CRC-32C value is only 8, and there
+ * might be no checksum at all.
+ */
+#define ESTIMATED_BYTES_PER_MANIFEST_LINE 100
+
+/*
+ * Define a hash table which we can use to store information about the files
+ * mentioned in the backup manifest.
+ */
+static uint32 hash_string_pointer(char *s);
+#define SH_PREFIX manifest_files
+#define SH_ELEMENT_TYPE manifest_file
+#define SH_KEY_TYPE char *
+#define SH_KEY pathname
+#define SH_HASH_KEY(tb, key) hash_string_pointer(key)
+#define SH_EQUAL(tb, a, b) (strcmp(a, b) == 0)
+#define SH_SCOPE extern
+#define SH_RAW_ALLOCATOR pg_malloc0
+#define SH_DEFINE
+#include "lib/simplehash.h"
+
+static void record_manifest_details_for_file(JsonManifestParseContext *context,
+ char *pathname, size_t size,
+ pg_checksum_type checksum_type,
+ int checksum_length,
+ uint8 *checksum_payload);
+static void record_manifest_details_for_wal_range(JsonManifestParseContext *context,
+ TimeLineID tli,
+ XLogRecPtr start_lsn,
+ XLogRecPtr end_lsn);
+static void report_manifest_error(JsonManifestParseContext *context,
+ const char *fmt,...)
+ pg_attribute_printf(2, 3) pg_attribute_noreturn();
+
+/*
+ * Load backup_manifest files from an array of backups and produces an array
+ * of manifest_data objects.
+ *
+ * NB: Since load_backup_manifest() can return NULL, the resulting array could
+ * contain NULL entries.
+ */
+manifest_data **
+load_backup_manifests(int n_backups, char **backup_directories)
+{
+ manifest_data **result;
+ int i;
+
+ result = pg_malloc(sizeof(manifest_data *) * n_backups);
+ for (i = 0; i < n_backups; ++i)
+ result[i] = load_backup_manifest(backup_directories[i]);
+
+ return result;
+}
+
+/*
+ * Parse the backup_manifest file in the named backup directory. Construct a
+ * hash table with information about all the files it mentions, and a linked
+ * list of all the WAL ranges it mentions.
+ *
+ * If the backup_manifest file simply doesn't exist, logs a warning and returns
+ * NULL. Any other error, or any error parsing the contents of the file, is
+ * fatal.
+ */
+manifest_data *
+load_backup_manifest(char *backup_directory)
+{
+ char pathname[MAXPGPATH];
+ int fd;
+ struct stat statbuf;
+ off_t estimate;
+ uint32 initial_size;
+ manifest_files_hash *ht;
+ char *buffer;
+ int rc;
+ JsonManifestParseContext context;
+ manifest_data *result;
+
+ /* Open the manifest file. */
+ snprintf(pathname, MAXPGPATH, "%s/backup_manifest", backup_directory);
+ if ((fd = open(pathname, O_RDONLY | PG_BINARY, 0)) < 0)
+ {
+ if (errno == EEXIST)
+ {
+ pg_log_warning("\"%s\" does not exist", pathname);
+ return NULL;
+ }
+ pg_fatal("could not open file \"%s\": %m", pathname);
+ }
+
+ /* Figure out how big the manifest is. */
+ if (fstat(fd, &statbuf) != 0)
+ pg_fatal("could not stat file \"%s\": %m", pathname);
+
+ /* Guess how large to make the hash table based on the manifest size. */
+ estimate = statbuf.st_size / ESTIMATED_BYTES_PER_MANIFEST_LINE;
+ initial_size = Min(PG_UINT32_MAX, Max(estimate, 256));
+
+ /* Create the hash table. */
+ ht = manifest_files_create(initial_size, NULL);
+
+ /*
+ * Slurp in the whole file.
+ *
+ * This is not ideal, but there's currently no way to get pg_parse_json()
+ * to perform incremental parsing.
+ */
+ buffer = pg_malloc(statbuf.st_size);
+ rc = read(fd, buffer, statbuf.st_size);
+ if (rc != statbuf.st_size)
+ {
+ if (rc < 0)
+ pg_fatal("could not read file \"%s\": %m", pathname);
+ else
+ pg_fatal("could not read file \"%s\": read %d of %lld",
+ pathname, rc, (long long int) statbuf.st_size);
+ }
+
+ /* Close the manifest file. */
+ close(fd);
+
+ /* Parse the manifest. */
+ result = pg_malloc0(sizeof(manifest_data));
+ result->files = ht;
+ context.private_data = result;
+ context.perfile_cb = record_manifest_details_for_file;
+ context.perwalrange_cb = record_manifest_details_for_wal_range;
+ context.error_cb = report_manifest_error;
+ json_parse_manifest(&context, buffer, statbuf.st_size);
+
+ /* All done. */
+ pfree(buffer);
+ return result;
+}
+
+/*
+ * Report an error while parsing the manifest.
+ *
+ * We consider all such errors to be fatal errors. The manifest parser
+ * expects this function not to return.
+ */
+static void
+report_manifest_error(JsonManifestParseContext *context, const char *fmt,...)
+{
+ va_list ap;
+
+ va_start(ap, fmt);
+ pg_log_generic_v(PG_LOG_ERROR, PG_LOG_PRIMARY, gettext(fmt), ap);
+ va_end(ap);
+
+ exit(1);
+}
+
+/*
+ * Record details extracted from the backup manifest for one file.
+ */
+static void
+record_manifest_details_for_file(JsonManifestParseContext *context,
+ char *pathname, size_t size,
+ pg_checksum_type checksum_type,
+ int checksum_length, uint8 *checksum_payload)
+{
+ manifest_data *manifest = context->private_data;
+ manifest_file *m;
+ bool found;
+
+ /* Make a new entry in the hash table for this file. */
+ m = manifest_files_insert(manifest->files, pathname, &found);
+ if (found)
+ pg_fatal("duplicate path name in backup manifest: \"%s\"", pathname);
+
+ /* Initialize the entry. */
+ m->size = size;
+ m->checksum_type = checksum_type;
+ m->checksum_length = checksum_length;
+ m->checksum_payload = checksum_payload;
+}
+
+/*
+ * Record details extracted from the backup manifest for one WAL range.
+ */
+static void
+record_manifest_details_for_wal_range(JsonManifestParseContext *context,
+ TimeLineID tli,
+ XLogRecPtr start_lsn, XLogRecPtr end_lsn)
+{
+ manifest_data *manifest = context->private_data;
+ manifest_wal_range *range;
+
+ /* Allocate and initialize a struct describing this WAL range. */
+ range = palloc(sizeof(manifest_wal_range));
+ range->tli = tli;
+ range->start_lsn = start_lsn;
+ range->end_lsn = end_lsn;
+ range->prev = manifest->last_wal_range;
+ range->next = NULL;
+
+ /* Add it to the end of the list. */
+ if (manifest->first_wal_range == NULL)
+ manifest->first_wal_range = range;
+ else
+ manifest->last_wal_range->next = range;
+ manifest->last_wal_range = range;
+}
+
+/*
+ * Helper function for manifest_files hash table.
+ */
+static uint32
+hash_string_pointer(char *s)
+{
+ unsigned char *ss = (unsigned char *) s;
+
+ return hash_bytes(ss, strlen(s));
+}
diff --git a/src/bin/pg_combinebackup/load_manifest.h b/src/bin/pg_combinebackup/load_manifest.h
new file mode 100644
index 0000000000..2bfeeff156
--- /dev/null
+++ b/src/bin/pg_combinebackup/load_manifest.h
@@ -0,0 +1,67 @@
+/*-------------------------------------------------------------------------
+ *
+ * Load data from a backup manifest into memory.
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/bin/pg_combinebackup/load_manifest.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef LOAD_MANIFEST_H
+#define LOAD_MANIFEST_H
+
+#include "access/xlogdefs.h"
+#include "common/checksum_helper.h"
+
+/*
+ * Each file described by the manifest file is parsed to produce an object
+ * like this.
+ */
+typedef struct manifest_file
+{
+ uint32 status; /* hash status */
+ char *pathname;
+ size_t size;
+ pg_checksum_type checksum_type;
+ int checksum_length;
+ uint8 *checksum_payload;
+} manifest_file;
+
+#define SH_PREFIX manifest_files
+#define SH_ELEMENT_TYPE manifest_file
+#define SH_KEY_TYPE char *
+#define SH_SCOPE extern
+#define SH_RAW_ALLOCATOR pg_malloc0
+#define SH_DECLARE
+#include "lib/simplehash.h"
+
+/*
+ * Each WAL range described by the manifest file is parsed to produce an
+ * object like this.
+ */
+typedef struct manifest_wal_range
+{
+ TimeLineID tli;
+ XLogRecPtr start_lsn;
+ XLogRecPtr end_lsn;
+ struct manifest_wal_range *next;
+ struct manifest_wal_range *prev;
+} manifest_wal_range;
+
+/*
+ * All the data parsed from a backup_manifest file.
+ */
+typedef struct manifest_data
+{
+ manifest_files_hash *files;
+ manifest_wal_range *first_wal_range;
+ manifest_wal_range *last_wal_range;
+} manifest_data;
+
+extern manifest_data *load_backup_manifest(char *backup_directory);
+extern manifest_data **load_backup_manifests(int n_backups,
+ char **backup_directories);
+
+#endif /* LOAD_MANIFEST_H */
diff --git a/src/bin/pg_combinebackup/meson.build b/src/bin/pg_combinebackup/meson.build
new file mode 100644
index 0000000000..e402d6f50e
--- /dev/null
+++ b/src/bin/pg_combinebackup/meson.build
@@ -0,0 +1,38 @@
+# Copyright (c) 2022-2023, PostgreSQL Global Development Group
+
+pg_combinebackup_sources = files(
+ 'pg_combinebackup.c',
+ 'backup_label.c',
+ 'copy_file.c',
+ 'load_manifest.c',
+ 'reconstruct.c',
+ 'write_manifest.c',
+)
+
+if host_system == 'windows'
+ pg_combinebackup_sources += rc_bin_gen.process(win32ver_rc, extra_args: [
+ '--NAME', 'pg_combinebackup',
+ '--FILEDESC', 'pg_combinebackup - combine incremental backups',])
+endif
+
+pg_combinebackup = executable('pg_combinebackup',
+ pg_combinebackup_sources,
+ dependencies: [frontend_code],
+ kwargs: default_bin_args,
+)
+bin_targets += pg_combinebackup
+
+tests += {
+ 'name': 'pg_combinebackup',
+ 'sd': meson.current_source_dir(),
+ 'bd': meson.current_build_dir(),
+ 'tap': {
+ 'tests': [
+ 't/001_basic.pl',
+ 't/002_compare_backups.pl',
+ 't/003_timeline.pl',
+ 't/004_manifest.pl',
+ 't/005_integrity.pl',
+ ],
+ }
+}
diff --git a/src/bin/pg_combinebackup/pg_combinebackup.c b/src/bin/pg_combinebackup/pg_combinebackup.c
new file mode 100644
index 0000000000..039dce75ce
--- /dev/null
+++ b/src/bin/pg_combinebackup/pg_combinebackup.c
@@ -0,0 +1,1270 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_combinebackup.c
+ * Combine incremental backups with prior backups.
+ *
+ * Copyright (c) 2017-2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/bin/pg_combinebackup/pg_combinebackup.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres_fe.h"
+
+#include <dirent.h>
+#include <fcntl.h>
+#include <limits.h>
+
+#include "access/xlogdefs.h"
+#include "backup_label.h"
+#include "common/blkreftable.h"
+#include "common/checksum_helper.h"
+#include "common/controldata_utils.h"
+#include "common/file_perm.h"
+#include "common/file_utils.h"
+#include "common/logging.h"
+#include "copy_file.h"
+#include "fe_utils/option_utils.h"
+#include "lib/stringinfo.h"
+#include "load_manifest.h"
+#include "getopt_long.h"
+#include "reconstruct.h"
+#include "write_manifest.h"
+
+/* Incremental file naming convention. */
+#define INCREMENTAL_PREFIX "INCREMENTAL."
+#define INCREMENTAL_PREFIX_LENGTH 12
+
+/*
+ * Tracking for directories that need to be removed, or have their contents
+ * removed, if the operation fails.
+ */
+typedef struct cb_cleanup_dir
+{
+ char *target_path;
+ bool rmtopdir;
+ struct cb_cleanup_dir *next;
+} cb_cleanup_dir;
+
+/*
+ * Stores a tablespace mapping provided using -T, --tablespace-mapping.
+ */
+typedef struct cb_tablespace_mapping
+{
+ char old_dir[MAXPGPATH];
+ char new_dir[MAXPGPATH];
+ struct cb_tablespace_mapping *next;
+} cb_tablespace_mapping;
+
+/*
+ * Stores data parsed from all command-line options.
+ */
+typedef struct cb_options
+{
+ bool debug;
+ char *output;
+ bool dry_run;
+ bool no_sync;
+ cb_tablespace_mapping *tsmappings;
+ pg_checksum_type manifest_checksums;
+ bool no_manifest;
+ DataDirSyncMethod sync_method;
+} cb_options;
+
+/*
+ * Data about a tablespace.
+ *
+ * Every normal tablespace needs a tablespace mapping, but in-place tablespaces
+ * don't, so the list of tablespaces can contain more entries than the list of
+ * tablespace mappings.
+ */
+typedef struct cb_tablespace
+{
+ Oid oid;
+ bool in_place;
+ char old_dir[MAXPGPATH];
+ char new_dir[MAXPGPATH];
+ struct cb_tablespace *next;
+} cb_tablespace;
+
+/* Directories to be removed if we exit uncleanly. */
+cb_cleanup_dir *cleanup_dir_list = NULL;
+
+static void add_tablespace_mapping(cb_options *opt, char *arg);
+static StringInfo check_backup_label_files(int n_backups, char **backup_dirs);
+static void check_control_files(int n_backups, char **backup_dirs);
+static void check_input_dir_permissions(char *dir);
+static void cleanup_directories_atexit(void);
+static void create_output_directory(char *dirname, cb_options *opt);
+static void help(const char *progname);
+static bool parse_oid(char *s, Oid *result);
+static void process_directory_recursively(Oid tsoid,
+ char *input_directory,
+ char *output_directory,
+ char *relative_path,
+ int n_prior_backups,
+ char **prior_backup_dirs,
+ manifest_data **manifests,
+ manifest_writer *mwriter,
+ cb_options *opt);
+static int read_pg_version_file(char *directory);
+static void remember_to_cleanup_directory(char *target_path, bool rmtopdir);
+static void reset_directory_cleanup_list(void);
+static cb_tablespace *scan_for_existing_tablespaces(char *pathname,
+ cb_options *opt);
+static void slurp_file(int fd, char *filename, StringInfo buf, int maxlen);
+
+/*
+ * Main program.
+ */
+int
+main(int argc, char *argv[])
+{
+ static struct option long_options[] = {
+ {"debug", no_argument, NULL, 'd'},
+ {"dry-run", no_argument, NULL, 'n'},
+ {"no-sync", no_argument, NULL, 'N'},
+ {"output", required_argument, NULL, 'o'},
+ {"tablespace-mapping", no_argument, NULL, 'T'},
+ {"manifest-checksums", required_argument, NULL, 1},
+ {"no-manifest", no_argument, NULL, 2},
+ {"sync-method", required_argument, NULL, 3},
+ {NULL, 0, NULL, 0}
+ };
+
+ const char *progname;
+ char *last_input_dir;
+ int optindex;
+ int c;
+ int n_backups;
+ int n_prior_backups;
+ int version;
+ char **prior_backup_dirs;
+ cb_options opt;
+ cb_tablespace *tablespaces;
+ cb_tablespace *ts;
+ StringInfo last_backup_label;
+ manifest_data **manifests;
+ manifest_writer *mwriter;
+
+ pg_logging_init(argv[0]);
+ progname = get_progname(argv[0]);
+ handle_help_version_opts(argc, argv, progname, help);
+
+ memset(&opt, 0, sizeof(opt));
+ opt.manifest_checksums = CHECKSUM_TYPE_CRC32C;
+ opt.sync_method = DATA_DIR_SYNC_METHOD_FSYNC;
+
+ /* process command-line options */
+ while ((c = getopt_long(argc, argv, "do:nNPT:",
+ long_options, &optindex)) != -1)
+ {
+ switch (c)
+ {
+ case 'd':
+ opt.debug = true;
+ pg_logging_increase_verbosity();
+ break;
+ case 'o':
+ opt.output = optarg;
+ break;
+ case 'n':
+ opt.dry_run = true;
+ break;
+ case 'N':
+ opt.no_sync = true;
+ break;
+ case 'T':
+ add_tablespace_mapping(&opt, optarg);
+ break;
+ case 1:
+ if (!pg_checksum_parse_type(optarg,
+ &opt.manifest_checksums))
+ pg_fatal("unrecognized checksum algorithm: \"%s\"",
+ optarg);
+ break;
+ case 2:
+ opt.no_manifest = true;
+ break;
+ case 3:
+ if (!parse_sync_method(optarg, &opt.sync_method))
+ exit(1);
+ break;
+ default:
+ /* getopt_long already emitted a complaint */
+ pg_log_error_hint("Try \"%s --help\" for more information.", progname);
+ exit(1);
+ }
+ }
+
+ if (optind >= argc)
+ {
+ pg_log_error("%s: no input directories specified", progname);
+ pg_log_error_hint("Try \"%s --help\" for more information.", progname);
+ exit(1);
+ }
+
+ if (opt.output == NULL)
+ pg_fatal("no output directory specified");
+
+ /* If no manifest is needed, no checksums are needed, either. */
+ if (opt.no_manifest)
+ opt.manifest_checksums = CHECKSUM_TYPE_NONE;
+
+ /* Read the server version from the final backup. */
+ version = read_pg_version_file(argv[argc - 1]);
+
+ /* Sanity-check control files. */
+ n_backups = argc - optind;
+ check_control_files(n_backups, argv + optind);
+
+ /* Sanity-check backup_label files, and get the contents of the last one. */
+ last_backup_label = check_backup_label_files(n_backups, argv + optind);
+
+ /* Load backup manifests. */
+ manifests = load_backup_manifests(n_backups, argv + optind);
+
+ /* Figure out which tablespaces are going to be included in the output. */
+ last_input_dir = argv[argc - 1];
+ check_input_dir_permissions(last_input_dir);
+ tablespaces = scan_for_existing_tablespaces(last_input_dir, &opt);
+
+ /*
+ * Create output directories.
+ *
+ * We create one output directory for the main data directory plus one for
+ * each non-in-place tablespace. create_output_directory() will arrange
+ * for those directories to be cleaned up on failure. In-place tablespaces
+ * aren't handled at this stage because they're located beneath the main
+ * output directory, and thus the cleanup of that directory will get rid
+ * of them. Plus, the pg_tblspc directory that needs to contain them
+ * doesn't exist yet.
+ */
+ atexit(cleanup_directories_atexit);
+ create_output_directory(opt.output, &opt);
+ for (ts = tablespaces; ts != NULL; ts = ts->next)
+ if (!ts->in_place)
+ create_output_directory(ts->new_dir, &opt);
+
+ /* If we need to write a backup_manifest, prepare to do so. */
+ if (!opt.dry_run && !opt.no_manifest)
+ mwriter = create_manifest_writer(opt.output);
+ else
+ mwriter = NULL;
+
+ /* Write backup label into output directory. */
+ if (opt.dry_run)
+ pg_log_debug("would generate \"%s/backup_label\"", opt.output);
+ else
+ {
+ pg_log_debug("generating \"%s/backup_label\"", opt.output);
+ last_backup_label->cursor = 0;
+ write_backup_label(opt.output, last_backup_label,
+ opt.manifest_checksums, mwriter);
+ }
+
+ /*
+ * We'll need the pathnames to the prior backups. By "prior" we mean all
+ * but the last one listed on the command line.
+ */
+ n_prior_backups = argc - optind - 1;
+ prior_backup_dirs = argv + optind;
+
+ /* Process everything that's not part of a user-defined tablespace. */
+ pg_log_debug("processing backup directory \"%s\"", last_input_dir);
+ process_directory_recursively(InvalidOid, last_input_dir, opt.output,
+ NULL, n_prior_backups, prior_backup_dirs,
+ manifests, mwriter, &opt);
+
+ /* Process user-defined tablespaces. */
+ for (ts = tablespaces; ts != NULL; ts = ts->next)
+ {
+ pg_log_debug("processing tablespace directory \"%s\"", ts->old_dir);
+
+ /*
+ * If it's a normal tablespace, we need to set up a symbolic link from
+ * pg_tblspc/${OID} to the target directory; if it's an in-place
+ * tablespace, we need to create a directory at pg_tblspc/${OID}.
+ */
+ if (!ts->in_place)
+ {
+ char linkpath[MAXPGPATH];
+
+ snprintf(linkpath, MAXPGPATH, "%s/pg_tblspc/%u", opt.output,
+ ts->oid);
+
+ if (opt.dry_run)
+ pg_log_debug("would create symbolic link from \"%s\" to \"%s\"",
+ linkpath, ts->new_dir);
+ else
+ {
+ pg_log_debug("creating symbolic link from \"%s\" to \"%s\"",
+ linkpath, ts->new_dir);
+ if (symlink(ts->new_dir, linkpath) != 0)
+ pg_fatal("could not create symbolic link from \"%s\" to \"%s\": %m",
+ linkpath, ts->new_dir);
+ }
+ }
+ else
+ {
+ if (opt.dry_run)
+ pg_log_debug("would create directory \"%s\"", ts->new_dir);
+ else
+ {
+ pg_log_debug("creating directory \"%s\"", ts->new_dir);
+ if (pg_mkdir_p(ts->new_dir, pg_dir_create_mode) == -1)
+ pg_fatal("could not create directory \"%s\": %m",
+ ts->new_dir);
+ }
+ }
+
+ /* OK, now handle the directory contents. */
+ process_directory_recursively(ts->oid, ts->old_dir, ts->new_dir,
+ NULL, n_prior_backups, prior_backup_dirs,
+ manifests, mwriter, &opt);
+ }
+
+ /* Finalize the backup_manifest, if we're generating one. */
+ if (mwriter != NULL)
+ finalize_manifest(mwriter,
+ manifests[n_prior_backups]->first_wal_range);
+
+ /* fsync that output directory unless we've been told not to do so */
+ if (!opt.no_sync)
+ {
+ if (opt.dry_run)
+ pg_log_debug("would recursively fsync \"%s\"", opt.output);
+ else
+ {
+ pg_log_debug("recursively fsyncing \"%s\"", opt.output);
+ sync_pgdata(opt.output, version * 10000, opt.sync_method);
+ }
+ }
+
+ /* It's a success, so don't remove the output directories. */
+ reset_directory_cleanup_list();
+ exit(0);
+}
+
+/*
+ * Process the option argument for the -T, --tablespace-mapping switch.
+ */
+static void
+add_tablespace_mapping(cb_options *opt, char *arg)
+{
+ cb_tablespace_mapping *tsmap = pg_malloc0(sizeof(cb_tablespace_mapping));
+ char *dst;
+ char *dst_ptr;
+ char *arg_ptr;
+
+ /*
+ * Basically, we just want to copy everything before the equals sign to
+ * tsmap->old_dir and everything afterwards to tsmap->new_dir, but if
+ * there's more or less than one equals sign, that's an error, and if
+ * there's an equals sign preceded by a backslash, don't treat it as a
+ * field separator but instead copy a literal equals sign.
+ */
+ dst_ptr = dst = tsmap->old_dir;
+ for (arg_ptr = arg; *arg_ptr != '\0'; arg_ptr++)
+ {
+ if (dst_ptr - dst >= MAXPGPATH)
+ pg_fatal("directory name too long");
+
+ if (*arg_ptr == '\\' && *(arg_ptr + 1) == '=')
+ ; /* skip backslash escaping = */
+ else if (*arg_ptr == '=' && (arg_ptr == arg || *(arg_ptr - 1) != '\\'))
+ {
+ if (tsmap->new_dir[0] != '\0')
+ pg_fatal("multiple \"=\" signs in tablespace mapping");
+ else
+ dst = dst_ptr = tsmap->new_dir;
+ }
+ else
+ *dst_ptr++ = *arg_ptr;
+ }
+ if (!tsmap->old_dir[0] || !tsmap->new_dir[0])
+ pg_fatal("invalid tablespace mapping format \"%s\", must be \"OLDDIR=NEWDIR\"", arg);
+
+ /*
+ * All tablespaces are created with absolute directories, so specifying a
+ * non-absolute path here would never match, possibly confusing users.
+ *
+ * In contrast to pg_basebackup, both the old and new directories are on
+ * the local machine, so the local machine's definition of an absolute
+ * path is the only relevant one.
+ */
+ if (!is_absolute_path(tsmap->old_dir))
+ pg_fatal("old directory is not an absolute path in tablespace mapping: %s",
+ tsmap->old_dir);
+
+ if (!is_absolute_path(tsmap->new_dir))
+ pg_fatal("old directory is not an absolute path in tablespace mapping: %s",
+ tsmap->new_dir);
+
+ /* Canonicalize paths to avoid spurious failures when comparing. */
+ canonicalize_path(tsmap->old_dir);
+ canonicalize_path(tsmap->new_dir);
+
+ /* Add it to the list. */
+ tsmap->next = opt->tsmappings;
+ opt->tsmappings = tsmap;
+}
+
+/*
+ * Check that the backup_label files form a coherent backup chain, and return
+ * the contents of the backup_label file from the latest backup.
+ */
+static StringInfo
+check_backup_label_files(int n_backups, char **backup_dirs)
+{
+ StringInfo buf = makeStringInfo();
+ StringInfo lastbuf = buf;
+ int i;
+ TimeLineID check_tli = 0;
+ XLogRecPtr check_lsn = InvalidXLogRecPtr;
+
+ /* Try to read each backup_label file in turn, last to first. */
+ for (i = n_backups - 1; i >= 0; --i)
+ {
+ char pathbuf[MAXPGPATH];
+ int fd;
+ TimeLineID start_tli;
+ TimeLineID previous_tli;
+ XLogRecPtr start_lsn;
+ XLogRecPtr previous_lsn;
+
+ /* Open the backup_label file. */
+ snprintf(pathbuf, MAXPGPATH, "%s/backup_label", backup_dirs[i]);
+ pg_log_debug("reading \"%s\"", pathbuf);
+ if ((fd = open(pathbuf, O_RDONLY, 0)) < 0)
+ pg_fatal("could not open file \"%s\": %m", pathbuf);
+
+ /*
+ * Slurp the whole file into memory.
+ *
+ * The exact size limit that we impose here doesn't really matter --
+ * most of what's supposed to be in the file is fixed size and quite
+ * short. However, the length of the backup_label is limited (at least
+ * by some parts of the code) to MAXGPATH, so include that value in
+ * the maximum length that we tolerate.
+ */
+ slurp_file(fd, pathbuf, buf, 10000 + MAXPGPATH);
+
+ /* Close the file. */
+ if (close(fd) != 0)
+ pg_fatal("could not close \"%s\": %m", pathbuf);
+
+ /* Parse the file contents. */
+ parse_backup_label(pathbuf, buf, &start_tli, &start_lsn,
+ &previous_tli, &previous_lsn);
+
+ /*
+ * Sanity checks.
+ *
+ * XXX. It's actually not required that start_lsn == check_lsn. It
+ * would be OK if start_lsn > check_lsn provided that start_lsn is
+ * less than or equal to the relevant switchpoint. But at the moment
+ * we don't have that information.
+ */
+ if (i > 0 && previous_tli == 0)
+ pg_fatal("backup at \"%s\" is a full backup, but only the first backup should be a full backup",
+ backup_dirs[i]);
+ if (i == 0 && previous_tli != 0)
+ pg_fatal("backup at \"%s\" is an incremental backup, but the first backup should be a full backup",
+ backup_dirs[i]);
+ if (i < n_backups - 1 && start_tli != check_tli)
+ pg_fatal("backup at \"%s\" starts on timeline %u, but expected %u",
+ backup_dirs[i], start_tli, check_tli);
+ if (i < n_backups - 1 && start_lsn != check_lsn)
+ pg_fatal("backup at \"%s\" starts at LSN %X/%X, but expected %X/%X",
+ backup_dirs[i],
+ LSN_FORMAT_ARGS(start_lsn),
+ LSN_FORMAT_ARGS(check_lsn));
+ check_tli = previous_tli;
+ check_lsn = previous_lsn;
+
+ /*
+ * The last backup label in the chain needs to be saved for later use,
+ * while the others are only needed within this loop.
+ */
+ if (lastbuf == buf)
+ buf = makeStringInfo();
+ else
+ resetStringInfo(buf);
+ }
+
+ /* Free memory that we don't need any more. */
+ if (lastbuf != buf)
+ {
+ pfree(buf->data);
+ pfree(buf);
+ }
+
+ /*
+ * Return the data from the first backup_info that we read (which is the
+ * backup_label from the last directory specified on the command line).
+ */
+ return lastbuf;
+}
+
+/*
+ * Sanity check control files.
+ */
+static void
+check_control_files(int n_backups, char **backup_dirs)
+{
+ int i;
+ uint64 system_identifier;
+
+ /* Try to read each control file in turn, last to first. */
+ for (i = n_backups - 1; i >= 0; --i)
+ {
+ ControlFileData *control_file;
+ bool crc_ok;
+
+ pg_log_debug("reading \"%s/global/pg_control\"", backup_dirs[i]);
+ control_file = get_controlfile(backup_dirs[i], &crc_ok);
+
+ /* Control file contents not meaningful if CRC is bad. */
+ if (!crc_ok)
+ pg_fatal("%s/global/pg_control: crc is incorrect", backup_dirs[i]);
+
+ /* Can't interpret control file if not current version. */
+ if (control_file->pg_control_version != PG_CONTROL_VERSION)
+ pg_fatal("%s/global/pg_control: unexpected control file version",
+ backup_dirs[i]);
+
+ /* System identifiers should all match. */
+ if (i == n_backups - 1)
+ system_identifier = control_file->system_identifier;
+ else if (system_identifier != control_file->system_identifier)
+ pg_fatal("%s/global/pg_control: expected system identifier %llu, but found %llu",
+ backup_dirs[i], (unsigned long long) system_identifier,
+ (unsigned long long) control_file->system_identifier);
+
+ /* Release memory. */
+ pfree(control_file);
+ }
+
+ /*
+ * If debug output is enabled, make a note of the system identifier that
+ * we found in all of the relevant control files.
+ */
+ pg_log_debug("system identifier is %llu",
+ (unsigned long long) system_identifier);
+}
+
+/*
+ * Set default permissions for new files and directories based on the
+ * permissions of the given directory. The intent here is that the output
+ * directory should use the same permissions scheme as the final input
+ * directory.
+ */
+static void
+check_input_dir_permissions(char *dir)
+{
+ struct stat st;
+
+ if (stat(dir, &st) != 0)
+ pg_fatal("could not stat \"%s\": %m", dir);
+
+ SetDataDirectoryCreatePerm(st.st_mode);
+}
+
+/*
+ * Clean up output directories before exiting.
+ */
+static void
+cleanup_directories_atexit(void)
+{
+ while (cleanup_dir_list != NULL)
+ {
+ cb_cleanup_dir *dir = cleanup_dir_list;
+
+ if (dir->rmtopdir)
+ {
+ pg_log_info("removing output directory \"%s\"", dir->target_path);
+ if (!rmtree(dir->target_path, dir->rmtopdir))
+ pg_log_error("failed to remove output directory");
+ }
+ else
+ {
+ pg_log_info("removing contents of output directory \"%s\"",
+ dir->target_path);
+ if (!rmtree(dir->target_path, dir->rmtopdir))
+ pg_log_error("failed to remove contents of output directory");
+ }
+
+ cleanup_dir_list = cleanup_dir_list->next;
+ pfree(dir);
+ }
+}
+
+/*
+ * Create the named output directory, unless it already exists or we're in
+ * dry-run mode. If it already exists but is not empty, that's a fatal error.
+ *
+ * Adds the created directory to the list of directories to be cleaned up
+ * at process exit.
+ */
+static void
+create_output_directory(char *dirname, cb_options *opt)
+{
+ switch (pg_check_dir(dirname))
+ {
+ case 0:
+ if (opt->dry_run)
+ {
+ pg_log_debug("would create directory \"%s\"", dirname);
+ return;
+ }
+ pg_log_debug("creating directory \"%s\"", dirname);
+ if (pg_mkdir_p(dirname, pg_dir_create_mode) == -1)
+ pg_fatal("could not create directory \"%s\": %m", dirname);
+ remember_to_cleanup_directory(dirname, true);
+ break;
+
+ case 1:
+ pg_log_debug("using existing directory \"%s\"", dirname);
+ remember_to_cleanup_directory(dirname, false);
+ break;
+
+ case 2:
+ case 3:
+ case 4:
+ pg_fatal("directory \"%s\" exists but is not empty", dirname);
+
+ case -1:
+ pg_fatal("could not access directory \"%s\": %m", dirname);
+ }
+}
+
+/*
+ * help
+ *
+ * Prints help page for the program
+ *
+ * progname: the name of the executed program, such as "pg_combinebackup"
+ */
+static void
+help(const char *progname)
+{
+ printf(_("%s reconstructs full backups from incrementals.\n\n"), progname);
+ printf(_("Usage:\n"));
+ printf(_(" %s [OPTION]... DIRECTORY...\n"), progname);
+ printf(_("\nOptions:\n"));
+ printf(_(" -d, --debug generate lots of debugging output\n"));
+ printf(_(" -n, --dry-run don't actually do anything\n"));
+ printf(_(" -N, --no-sync do not wait for changes to be written safely to disk\n"));
+ printf(_(" -o, --output output directory\n"));
+ printf(_(" -T, --tablespace-mapping=OLDDIR=NEWDIR\n"));
+ printf(_(" relocate tablespace in OLDDIR to NEWDIR\n"));
+ printf(_(" --manifest-checksums=SHA{224,256,384,512}|CRC32C|NONE\n"
+ " use algorithm for manifest checksums\n"));
+ printf(_(" --no-manifest suppress generation of backup manifest\n"));
+ printf(_(" --sync-method=METHOD set method for syncing files to disk\n"));
+ printf(_(" -?, --help show this help, then exit\n"));
+
+ printf(_("\nReport bugs to <%s>.\n"), PACKAGE_BUGREPORT);
+ printf(_("%s home page: <%s>\n"), PACKAGE_NAME, PACKAGE_URL);
+}
+
+/*
+ * Try to parse a string as a non-zero OID without leading zeroes.
+ *
+ * If it works, return true and set *result to the answer, else return false.
+ */
+static bool
+parse_oid(char *s, Oid *result)
+{
+ Oid oid;
+ char *ep;
+
+ errno = 0;
+ oid = strtoul(s, &ep, 10);
+ if (errno != 0 || *ep != '\0' || oid < 1 || oid > PG_UINT32_MAX)
+ return false;
+
+ *result = oid;
+ return true;
+}
+
+/*
+ * Copy files from the input directory to the output directory, reconstructing
+ * full files from incremental files as required.
+ *
+ * If processing is a user-defined tablespace, the tsoid should be the OID
+ * of that tablespace and input_directory and output_directory should be the
+ * toplevel input and output directories for that tablespace. Otherwise,
+ * tsoid should be InvalidOid and input_directory and output_directory should
+ * be the main input and output directories.
+ *
+ * relative_path is the path beneath the given input and output directories
+ * that we are currently processing. If NULL, it indicates that we're
+ * processing the input and output directories themselves.
+ *
+ * n_prior_backups is the number of prior backups that we have available.
+ * This doesn't count the very last backup, which is referenced by
+ * output_directory, just the older ones. prior_backup_dirs is an array of
+ * the locations of those previous backups.
+ */
+static void
+process_directory_recursively(Oid tsoid,
+ char *input_directory,
+ char *output_directory,
+ char *relative_path,
+ int n_prior_backups,
+ char **prior_backup_dirs,
+ manifest_data **manifests,
+ manifest_writer *mwriter,
+ cb_options *opt)
+{
+ char ifulldir[MAXPGPATH];
+ char ofulldir[MAXPGPATH];
+ char manifest_prefix[MAXPGPATH];
+ DIR *dir;
+ struct dirent *de;
+ bool is_pg_tblspc;
+ bool is_pg_wal;
+ manifest_data *latest_manifest = manifests[n_prior_backups];
+ pg_checksum_type checksum_type;
+
+ StaticAssertStmt(strlen(INCREMENTAL_PREFIX) == INCREMENTAL_PREFIX_LENGTH,
+ "INCREMENTAL_PREFIX_LENGTH is incorrect");
+
+ /*
+ * pg_tblspc and pg_wal are special cases, so detect those here.
+ *
+ * pg_tblspc is only special at the top level, but subdirectories of
+ * pg_wal are just as special as the top level directory.
+ *
+ * Since incremental backup does not exist in pre-v10 versions, we don't
+ * have to worry about the old pg_xlog naming.
+ */
+ is_pg_tblspc = !OidIsValid(tsoid) && relative_path != NULL &&
+ strcmp(relative_path, "pg_tblspc") == 0;
+ is_pg_wal = !OidIsValid(tsoid) && relative_path != NULL &&
+ (strcmp(relative_path, "pg_wal") == 0 ||
+ strncmp(relative_path, "pg_wal/", 7) == 0);
+
+ /*
+ * If we're under pg_wal, then we don't need checksums, because these
+ * files aren't included in the backup manifest. Otherwise use whatever
+ * type of checksum is configured.
+ */
+ if (!is_pg_wal)
+ checksum_type = opt->manifest_checksums;
+ else
+ checksum_type = CHECKSUM_TYPE_NONE;
+
+ /*
+ * Append the relative path to the input and output directories, and
+ * figure out the appropriate prefix to add to files in this directory
+ * when looking them up in a backup manifest.
+ */
+ if (relative_path == NULL)
+ {
+ strncpy(ifulldir, input_directory, MAXPGPATH);
+ strncpy(ofulldir, output_directory, MAXPGPATH);
+ if (OidIsValid(tsoid))
+ snprintf(manifest_prefix, MAXPGPATH, "pg_tblspc/%u/", tsoid);
+ else
+ manifest_prefix[0] = '\0';
+ }
+ else
+ {
+ snprintf(ifulldir, MAXPGPATH, "%s/%s", input_directory,
+ relative_path);
+ snprintf(ofulldir, MAXPGPATH, "%s/%s", output_directory,
+ relative_path);
+ if (OidIsValid(tsoid))
+ snprintf(manifest_prefix, MAXPGPATH, "pg_tblspc/%u/%s/",
+ tsoid, relative_path);
+ else
+ snprintf(manifest_prefix, MAXPGPATH, "%s/", relative_path);
+ }
+
+ /*
+ * Toplevel output directories have already been created by the time this
+ * function is called, but any subdirectories are our responsibility.
+ */
+ if (relative_path != NULL)
+ {
+ if (opt->dry_run)
+ pg_log_debug("would create directory \"%s\"", ofulldir);
+ else
+ {
+ pg_log_debug("creating directory \"%s\"", ofulldir);
+ if (mkdir(ofulldir, pg_dir_create_mode) == -1)
+ pg_fatal("could not create directory \"%s\": %m", ofulldir);
+ }
+ }
+
+ /* It's time to scan the directory. */
+ if ((dir = opendir(ifulldir)) == NULL)
+ pg_fatal("could not open directory \"%s\": %m", ifulldir);
+ while (errno = 0, (de = readdir(dir)) != NULL)
+ {
+ PGFileType type;
+ char ifullpath[MAXPGPATH];
+ char ofullpath[MAXPGPATH];
+ char manifest_path[MAXPGPATH];
+ Oid oid = InvalidOid;
+ int checksum_length = 0;
+ uint8 *checksum_payload = NULL;
+ pg_checksum_context checksum_ctx;
+
+ /* Ignore "." and ".." entries. */
+ if (strcmp(de->d_name, ".") == 0 ||
+ strcmp(de->d_name, "..") == 0)
+ continue;
+
+ /* Construct input path. */
+ snprintf(ifullpath, MAXPGPATH, "%s/%s", ifulldir, de->d_name);
+
+ /* Figure out what kind of directory entry this is. */
+ type = get_dirent_type(ifullpath, de, false, PG_LOG_ERROR);
+ if (type == PGFILETYPE_ERROR)
+ exit(1);
+
+ /*
+ * If we're processing pg_tblspc, then check whether the filename
+ * looks like it could be a tablespace OID. If so, and if the
+ * directory entry is a symbolic link or a directory, skip it.
+ *
+ * Our goal here is to ignore anything that would have been considered
+ * by scan_for_existing_tablespaces to be a tablespace.
+ */
+ if (is_pg_tblspc && parse_oid(de->d_name, &oid) &&
+ (type == PGFILETYPE_LNK || type == PGFILETYPE_DIR))
+ continue;
+
+ /* If it's a directory, recurse. */
+ if (type == PGFILETYPE_DIR)
+ {
+ char new_relative_path[MAXPGPATH];
+
+ /* Append new pathname component to relative path. */
+ if (relative_path == NULL)
+ strncpy(new_relative_path, de->d_name, MAXPGPATH);
+ else
+ snprintf(new_relative_path, MAXPGPATH, "%s/%s", relative_path,
+ de->d_name);
+
+ /* And recurse. */
+ process_directory_recursively(tsoid,
+ input_directory, output_directory,
+ new_relative_path,
+ n_prior_backups, prior_backup_dirs,
+ manifests, mwriter, opt);
+ continue;
+ }
+
+ /* Skip anything that's not a regular file. */
+ if (type != PGFILETYPE_REG)
+ {
+ if (type == PGFILETYPE_LNK)
+ pg_log_warning("skipping symbolic link \"%s\"", ifullpath);
+ else
+ pg_log_warning("skipping special file \"%s\"", ifullpath);
+ continue;
+ }
+
+ /*
+ * Skip the backup_label and backup_manifest files; they require
+ * special handling and are handled elsewhere.
+ */
+ if (relative_path == NULL &&
+ (strcmp(de->d_name, "backup_label") == 0 ||
+ strcmp(de->d_name, "backup_manifest") == 0))
+ continue;
+
+ /*
+ * If it's an incremental file, hand it off to the reconstruction
+ * code, which will figure out what to do.
+ */
+ if (strncmp(de->d_name, INCREMENTAL_PREFIX,
+ INCREMENTAL_PREFIX_LENGTH) == 0)
+ {
+ /* Output path should not include "INCREMENTAL." prefix. */
+ snprintf(ofullpath, MAXPGPATH, "%s/%s", ofulldir,
+ de->d_name + INCREMENTAL_PREFIX_LENGTH);
+
+
+ /* Manifest path likewise omits incremental prefix. */
+ snprintf(manifest_path, MAXPGPATH, "%s%s", manifest_prefix,
+ de->d_name + INCREMENTAL_PREFIX_LENGTH);
+
+ /* Reconstruction logic will do the rest. */
+ reconstruct_from_incremental_file(ifullpath, ofullpath,
+ relative_path,
+ de->d_name + INCREMENTAL_PREFIX_LENGTH,
+ n_prior_backups,
+ prior_backup_dirs,
+ manifests,
+ manifest_path,
+ checksum_type,
+ &checksum_length,
+ &checksum_payload,
+ opt->dry_run);
+ }
+ else
+ {
+ /* Construct the path that the backup_manifest will use. */
+ snprintf(manifest_path, MAXPGPATH, "%s%s", manifest_prefix,
+ de->d_name);
+
+ /*
+ * It's not an incremental file, so we need to copy the entire
+ * file to the output directory.
+ *
+ * If a checksum of the required type already exists in the
+ * backup_manifest for the final input directory, we can save some
+ * work by reusing that checksum instead of computing a new one.
+ */
+ if (checksum_type != CHECKSUM_TYPE_NONE &&
+ latest_manifest != NULL)
+ {
+ manifest_file *mfile;
+
+ mfile = manifest_files_lookup(latest_manifest->files,
+ manifest_path);
+ if (mfile == NULL)
+ {
+ /*
+ * The directory is out of sync with the backup_manifest,
+ * so emit a warning.
+ */
+ pg_log_warning("\"%s/backup_manifest\" contains no entry for \"%s\"",
+ input_directory, manifest_path);
+ }
+ else if (mfile->checksum_type == checksum_type)
+ {
+ checksum_length = mfile->checksum_length;
+ checksum_payload = mfile->checksum_payload;
+ }
+ }
+
+ /*
+ * If we're reusing a checksum, then we don't need copy_file() to
+ * compute one for us, but otherwise, it needs to compute whatever
+ * type of checksum we need.
+ */
+ if (checksum_length != 0)
+ pg_checksum_init(&checksum_ctx, CHECKSUM_TYPE_NONE);
+ else
+ pg_checksum_init(&checksum_ctx, checksum_type);
+
+ /* Actually copy the file. */
+ snprintf(ofullpath, MAXPGPATH, "%s/%s", ofulldir, de->d_name);
+ copy_file(ifullpath, ofullpath, &checksum_ctx, opt->dry_run);
+
+ /*
+ * If copy_file() performed a checksum calculation for us, then
+ * save the results (except in dry-run mode, when there's no
+ * point).
+ */
+ if (checksum_ctx.type != CHECKSUM_TYPE_NONE && !opt->dry_run)
+ {
+ checksum_payload = pg_malloc(PG_CHECKSUM_MAX_LENGTH);
+ checksum_length = pg_checksum_final(&checksum_ctx,
+ checksum_payload);
+ }
+ }
+
+ /* Generate manifest entry, if needed. */
+ if (mwriter != NULL)
+ {
+ struct stat sb;
+
+ /*
+ * In order to generate a manifest entry, we need the file size
+ * and mtime. We have no way to know the correct mtime except to
+ * stat() the file, so just do that and get the size as well.
+ *
+ * If we didn't need the mtime here, we could try to obtain the
+ * file size from the reconstruction or file copy process above,
+ * although that is actually not convenient in all cases. If we
+ * write the file ourselves then clearly we can keep a count of
+ * bytes, but if we use something like CopyFile() then it's
+ * trickier. Since we have to stat() anyway to get the mtime,
+ * there's no point in worrying about it.
+ */
+ if (stat(ofullpath, &sb) < 0)
+ pg_fatal("could not stat file \"%s\": %m", ofullpath);
+
+ /* OK, now do the work. */
+ add_file_to_manifest(mwriter, manifest_path,
+ sb.st_size, sb.st_mtime,
+ checksum_type, checksum_length,
+ checksum_payload);
+ }
+
+ /* Avoid leaking memory. */
+ if (checksum_payload != NULL)
+ pfree(checksum_payload);
+ }
+
+ closedir(dir);
+}
+
+/*
+ * Read the version number from PG_VERSION and convert it to the usual server
+ * version number format. (e.g. If PG_VERSION contains "14\n" this function
+ * will return 140000)
+ */
+static int
+read_pg_version_file(char *directory)
+{
+ char filename[MAXPGPATH];
+ StringInfoData buf;
+ int fd;
+ int version;
+ char *ep;
+
+ /* Construct pathname. */
+ snprintf(filename, MAXPGPATH, "%s/PG_VERSION", directory);
+
+ /* Open file. */
+ if ((fd = open(filename, O_RDONLY, 0)) < 0)
+ pg_fatal("could not open file \"%s\": %m", filename);
+
+ /* Read into memory. Length limit of 128 should be more than generous. */
+ initStringInfo(&buf);
+ slurp_file(fd, filename, &buf, 128);
+
+ /* Close the file. */
+ if (close(fd) != 0)
+ pg_fatal("could not close \"%s\": %m", filename);
+
+ /* Convert to integer. */
+ errno = 0;
+ version = strtoul(buf.data, &ep, 10);
+ if (errno != 0 || *ep != '\n')
+ {
+ /*
+ * Incremental backup is not relevant to very old server versions that
+ * used multi-part version number (e.g. 9.6, or 8.4). So if we see
+ * what looks like the beginning of such a version number, just bail
+ * out.
+ */
+ if (version < 10 && *ep == '.')
+ pg_fatal("%s: server version too old\n", filename);
+ pg_fatal("%s: could not parse version number\n", filename);
+ }
+
+ /* Debugging output. */
+ pg_log_debug("read server version %d from \"%s\"", version, filename);
+
+ /* Release memory and return result. */
+ pfree(buf.data);
+ return version * 10000;
+}
+
+/*
+ * Add a directory to the list of output directories to clean up.
+ */
+static void
+remember_to_cleanup_directory(char *target_path, bool rmtopdir)
+{
+ cb_cleanup_dir *dir = pg_malloc(sizeof(cb_cleanup_dir));
+
+ dir->target_path = target_path;
+ dir->rmtopdir = rmtopdir;
+ dir->next = cleanup_dir_list;
+ cleanup_dir_list = dir;
+}
+
+/*
+ * Empty out the list of directories scheduled for cleanup a exit.
+ *
+ * We want to remove the output directories only on a failure, so call this
+ * function when we know that the operation has succeeded.
+ *
+ * Since we only expect this to be called when we're about to exit, we could
+ * just set cleanup_dir_list to NULL and be done with it, but we free the
+ * memory to be tidy.
+ */
+static void
+reset_directory_cleanup_list(void)
+{
+ while (cleanup_dir_list != NULL)
+ {
+ cb_cleanup_dir *dir = cleanup_dir_list;
+
+ cleanup_dir_list = cleanup_dir_list->next;
+ pfree(dir);
+ }
+}
+
+/*
+ * Scan the pg_tblspc directory of the final input backup to get a canonical
+ * list of what tablespaces are part of the backup.
+ *
+ * 'pathname' should be the path to the toplevel backup directory for the
+ * final backup in the backup chain.
+ */
+static cb_tablespace *
+scan_for_existing_tablespaces(char *pathname, cb_options *opt)
+{
+ char pg_tblspc[MAXPGPATH];
+ DIR *dir;
+ struct dirent *de;
+ cb_tablespace *tslist = NULL;
+
+ snprintf(pg_tblspc, MAXPGPATH, "%s/pg_tblspc", pathname);
+ pg_log_debug("scanning \"%s\"", pg_tblspc);
+
+ if ((dir = opendir(pg_tblspc)) == NULL)
+ pg_fatal("could not open directory \"%s\": %m", pathname);
+
+ while (errno = 0, (de = readdir(dir)) != NULL)
+ {
+ Oid oid;
+ char tblspcdir[MAXPGPATH];
+ char link_target[MAXPGPATH];
+ int link_length;
+ cb_tablespace *ts;
+ cb_tablespace *otherts;
+ PGFileType type;
+
+ /* Silently ignore "." and ".." entries. */
+ if (strcmp(de->d_name, ".") == 0 || strcmp(de->d_name, "..") == 0)
+ continue;
+
+ /* Construct full pathname. */
+ snprintf(tblspcdir, MAXPGPATH, "%s/%s", pg_tblspc, de->d_name);
+
+ /* Ignore any file name that doesn't look like a proper OID. */
+ if (!parse_oid(de->d_name, &oid))
+ {
+ pg_log_debug("skipping \"%s\" because the filename is not a legal tablespace OID",
+ tblspcdir);
+ continue;
+ }
+
+ /* Only symbolic links and directories are tablespaces. */
+ type = get_dirent_type(tblspcdir, de, false, PG_LOG_ERROR);
+ if (type == PGFILETYPE_ERROR)
+ exit(1);
+ if (type != PGFILETYPE_LNK && type != PGFILETYPE_DIR)
+ {
+ pg_log_debug("skipping \"%s\" because it is neither a symbolic link nor a directory",
+ tblspcdir);
+ continue;
+ }
+
+ /* Create a new tablespace object. */
+ ts = pg_malloc0(sizeof(cb_tablespace));
+ ts->oid = oid;
+
+ /*
+ * If it's a link, it's not an in-place tablespace. Otherwise, it must
+ * be a directory, and thus an in-place tablespace.
+ */
+ if (type == PGFILETYPE_LNK)
+ {
+ cb_tablespace_mapping *tsmap;
+
+ /* Read the link target. */
+ link_length = readlink(tblspcdir, link_target, sizeof(link_target));
+ if (link_length < 0)
+ pg_fatal("could not read symbolic link \"%s\": %m",
+ tblspcdir);
+ if (link_length >= sizeof(link_target))
+ pg_fatal("symbolic link \"%s\" is too long", tblspcdir);
+ link_target[link_length] = '\0';
+ if (!is_absolute_path(link_target))
+ pg_fatal("symbolic link \"%s\" is relative", tblspcdir);
+
+ /* Caonicalize the link target. */
+ canonicalize_path(link_target);
+
+ /*
+ * Find the corresponding tablespace mapping and copy the relevant
+ * details into the new tablespace entry.
+ */
+ for (tsmap = opt->tsmappings; tsmap != NULL; tsmap = tsmap->next)
+ {
+ if (strcmp(tsmap->old_dir, link_target) == 0)
+ {
+ strncpy(ts->old_dir, tsmap->old_dir, MAXPGPATH);
+ strncpy(ts->new_dir, tsmap->new_dir, MAXPGPATH);
+ ts->in_place = false;
+ break;
+ }
+ }
+
+ /* Every non-in-place tablespace must be mapped. */
+ if (tsmap == NULL)
+ pg_fatal("tablespace at \"%s\" has no tablespace mapping",
+ link_target);
+ }
+ else
+ {
+ /*
+ * For an in-place tablespace, there's no separate directory, so
+ * we just record the paths within the data directories.
+ */
+ snprintf(ts->old_dir, MAXPGPATH, "%s/%s", pg_tblspc, de->d_name);
+ snprintf(ts->new_dir, MAXPGPATH, "%s/pg_tblpc/%s", opt->output,
+ de->d_name);
+ ts->in_place = true;
+ }
+
+ /* Tablespaces should not share a directory. */
+ for (otherts = tslist; otherts != NULL; otherts = otherts->next)
+ if (strcmp(ts->new_dir, otherts->new_dir) == 0)
+ pg_fatal("tablespaces with OIDs %u and %u both point at \"%s\"",
+ otherts->oid, oid, ts->new_dir);
+
+ /* Add this tablespace to the list. */
+ ts->next = tslist;
+ tslist = ts;
+ }
+
+ return tslist;
+}
+
+/*
+ * Read a file into a StringInfo.
+ *
+ * fd is used for the actual file I/O, filename for error reporting purposes.
+ * A file longer than maxlen is a fatal error.
+ */
+static void
+slurp_file(int fd, char *filename, StringInfo buf, int maxlen)
+{
+ struct stat st;
+ ssize_t rb;
+
+ /* Check file size, and complain if it's too large. */
+ if (fstat(fd, &st) != 0)
+ pg_fatal("could not stat \"%s\": %m", filename);
+ if (st.st_size > maxlen)
+ pg_fatal("file \"%s\" is too large", filename);
+
+ /* Make sure we have enough space. */
+ enlargeStringInfo(buf, st.st_size);
+
+ /* Read the data. */
+ rb = read(fd, &buf->data[buf->len], st.st_size);
+
+ /*
+ * We don't expect any concurrent changes, so we should read exactly the
+ * expected number of bytes.
+ */
+ if (rb != st.st_size)
+ {
+ if (rb < 0)
+ pg_fatal("could not read file \"%s\": %m", filename);
+ else
+ pg_fatal("could not read file \"%s\": read only %d of %d bytes",
+ filename, (int) rb, (int) st.st_size);
+ }
+
+ /* Adjust buffer length for new data and restore trailing-\0 invariant */
+ buf->len += rb;
+ buf->data[buf->len] = '\0';
+}
diff --git a/src/bin/pg_combinebackup/reconstruct.c b/src/bin/pg_combinebackup/reconstruct.c
new file mode 100644
index 0000000000..c774bf1842
--- /dev/null
+++ b/src/bin/pg_combinebackup/reconstruct.c
@@ -0,0 +1,618 @@
+/*-------------------------------------------------------------------------
+ *
+ * reconstruct.c
+ * Reconstruct full file from incremental file and backup chain.
+ *
+ * Copyright (c) 2017-2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/bin/pg_combinebackup/reconstruct.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres_fe.h"
+
+#include <unistd.h>
+
+#include "backup/basebackup_incremental.h"
+#include "common/logging.h"
+#include "common/file_perm.h"
+#include "copy_file.h"
+#include "reconstruct.h"
+#include "storage/block.h"
+
+/*
+ * An rfile stores the data that we need in order to be able to use some file
+ * on disk for reconstruction. For any given output file, we create one rfile
+ * per backup that we need to consult when we constructing that output file.
+ *
+ * If we find a full version of the file in the backup chain, then only
+ * filename and fd are initialized; the remaining fields are 0 or NULL.
+ * For an incremental file, header_length, num_blocks, relative_block_numbers,
+ * and truncation_block_length are also set.
+ *
+ * num_blocks_read and highest_offset_read always start out as 0.
+ */
+typedef struct rfile
+{
+ char *filename;
+ int fd;
+ size_t header_length;
+ unsigned num_blocks;
+ BlockNumber *relative_block_numbers;
+ unsigned truncation_block_length;
+ unsigned num_blocks_read;
+ off_t highest_offset_read;
+} rfile;
+
+static void debug_reconstruction(int n_source,
+ rfile **sources,
+ bool dry_run);
+static unsigned find_reconstructed_block_length(rfile *s);
+static rfile *make_incremental_rfile(char *filename);
+static rfile *make_rfile(char *filename, bool missing_ok);
+static void write_reconstructed_file(char *input_filename,
+ char *output_filename,
+ unsigned block_length,
+ rfile **sourcemap,
+ off_t *offsetmap,
+ pg_checksum_context *checksum_ctx,
+ bool dry_run);
+static void read_bytes(rfile *rf, void *buffer, unsigned length);
+
+/*
+ * Reconstruct a full file from an incremental file and a chain of prior
+ * backups.
+ *
+ * input_filename should be the path to the incremental file, and
+ * output_filename should be the path where the reconstructed file is to be
+ * written.
+ *
+ * relative_path should be the relative path to the directory containing this
+ * file. bare_file_name should be the name of the file within that directory,
+ * without "INCREMENTAL.".
+ *
+ * n_prior_backups is the number of prior backups, and prior_backup_dirs is
+ * an array of pathnames where those backups can be found.
+ */
+void
+reconstruct_from_incremental_file(char *input_filename,
+ char *output_filename,
+ char *relative_path,
+ char *bare_file_name,
+ int n_prior_backups,
+ char **prior_backup_dirs,
+ manifest_data **manifests,
+ char *manifest_path,
+ pg_checksum_type checksum_type,
+ int *checksum_length,
+ uint8 **checksum_payload,
+ bool dry_run)
+{
+ rfile **source;
+ rfile *latest_source = NULL;
+ rfile **sourcemap;
+ off_t *offsetmap;
+ unsigned block_length;
+ unsigned num_missing_blocks;
+ unsigned i;
+ unsigned sidx = n_prior_backups;
+ bool full_copy_possible = true;
+ int copy_source_index = -1;
+ rfile *copy_source = NULL;
+ pg_checksum_context checksum_ctx;
+
+ /*
+ * Every block must come either from the latest version of the file or
+ * from one of the prior backups.
+ */
+ source = pg_malloc0(sizeof(rfile *) * (1 + n_prior_backups));
+
+ /*
+ * Use the information from the latest incremental file to figure out how
+ * long the reconstructed file should be.
+ */
+ latest_source = make_incremental_rfile(input_filename);
+ source[n_prior_backups] = latest_source;
+ block_length = find_reconstructed_block_length(latest_source);
+
+ /*
+ * For each block in the output file, we need to know from which file we
+ * need to obtain it and at what offset in that file it's stored.
+ * sourcemap gives us the first of these things, and offsetmap the latter.
+ */
+ sourcemap = pg_malloc0(sizeof(rfile *) * block_length);
+ offsetmap = pg_malloc0(sizeof(off_t) * block_length);
+
+ /*
+ * Blocks prior to the truncation_block_length threshold must be obtained
+ * from some prior backup, while those after that threshold are left as
+ * zeroes if not present in the newest incremental file.
+ * num_missing_blocks counts the number of blocks that we must be found
+ * somewhere in the backup chain, and is thus initially equal to
+ * truncation_block_length.
+ */
+ num_missing_blocks = latest_source->truncation_block_length;
+
+ /*
+ * Every block that is present in the newest incremental file should be
+ * sourced from that file. If it precedes the truncation_block_length,
+ * it's a block that we would otherwise have had to find in an older
+ * backup and thus reduces the number of blocks remaining to be found by
+ * one; otherwise, it's an extra block that needs to be included in the
+ * output but would not have needed to be found in an older backup if it
+ * had not been present.
+ */
+ for (i = 0; i < latest_source->num_blocks; ++i)
+ {
+ BlockNumber b = latest_source->relative_block_numbers[i];
+
+ Assert(b < block_length);
+ sourcemap[b] = latest_source;
+ offsetmap[b] = latest_source->header_length + (i * BLCKSZ);
+ if (b < latest_source->truncation_block_length)
+ num_missing_blocks--;
+
+ /*
+ * A full copy of a file from an earlier backup is only possible if no
+ * blocks are needed from any later incremental file.
+ */
+ full_copy_possible = false;
+ }
+
+ while (num_missing_blocks > 0)
+ {
+ char source_filename[MAXPGPATH];
+ rfile *s;
+
+ /*
+ * Move to the next backup in the chain. If there are no more, then
+ * something has gone wrong and reconstruction has failed.
+ */
+ if (sidx == 0)
+ pg_fatal("reconstruction for file \"%s\" failed to find %u required blocks",
+ output_filename, num_missing_blocks);
+ --sidx;
+
+ /*
+ * Look for the full file in the previous backup. If not found, then
+ * look for an incremental file instead.
+ */
+ snprintf(source_filename, MAXPGPATH, "%s/%s/%s",
+ prior_backup_dirs[sidx], relative_path, bare_file_name);
+ if ((s = make_rfile(source_filename, true)) == NULL)
+ {
+ snprintf(source_filename, MAXPGPATH, "%s/%s/INCREMENTAL.%s",
+ prior_backup_dirs[sidx], relative_path, bare_file_name);
+ s = make_incremental_rfile(source_filename);
+ }
+ source[sidx] = s;
+
+ /*
+ * If s->header_length == 0, then this is a full file; otherwise, it's
+ * an incremental file.
+ */
+ if (s->header_length != 0)
+ {
+ /*
+ * Since we found another incremental file, source all blocks from
+ * it that we need but don't yet have.
+ */
+ for (i = 0; i < s->num_blocks; ++i)
+ {
+ BlockNumber b = s->relative_block_numbers[i];
+
+ if (b < latest_source->truncation_block_length &&
+ sourcemap[b] == NULL)
+ {
+ sourcemap[b] = s;
+ offsetmap[b] = s->header_length + (i * BLCKSZ);
+
+ Assert(num_missing_blocks > 0);
+ --num_missing_blocks;
+
+ /*
+ * A full copy of a file from an earlier backup is only
+ * possible if no blocks are needed from any later
+ * incremental file.
+ */
+ full_copy_possible = false;
+ }
+ }
+ }
+ else
+ {
+ BlockNumber b;
+
+ /*
+ * Since we found a full file, source all remaining required
+ * blocks from it.
+ */
+ for (b = 0; b < latest_source->truncation_block_length; ++b)
+ {
+ if (sourcemap[b] == NULL)
+ {
+ sourcemap[b] = s;
+ offsetmap[b] = b * BLCKSZ;
+
+ Assert(num_missing_blocks > 0);
+ --num_missing_blocks;
+ }
+ }
+ Assert(num_missing_blocks == 0);
+
+ /*
+ * If a full copy looks possible, check whether the resulting file
+ * should be exactly as long as the source file is. If so, a full
+ * copy is acceptable, otherwise not.
+ */
+ if (full_copy_possible)
+ {
+ struct stat sb;
+ uint64 expected_length;
+
+ if (fstat(s->fd, &sb) < 0)
+ pg_fatal("could not stat \"%s\": %m", s->filename);
+ expected_length =
+ (uint64) latest_source->truncation_block_length;
+ expected_length *= BLCKSZ;
+ if (expected_length == sb.st_size)
+ {
+ copy_source = s;
+ copy_source_index = sidx;
+ }
+ }
+ }
+ }
+
+ /*
+ * If a checksum of the required type already exists in the
+ * backup_manifest for the relevant input directory, we can save some work
+ * by reusing that checksum instead of computing a new one.
+ */
+ if (copy_source_index >= 0 && manifests[copy_source_index] != NULL &&
+ checksum_type != CHECKSUM_TYPE_NONE)
+ {
+ manifest_file *mfile;
+
+ mfile = manifest_files_lookup(manifests[copy_source_index]->files,
+ manifest_path);
+ if (mfile == NULL)
+ {
+ /*
+ * The directory is out of sync with the backup_manifest, so emit
+ * a warning.
+ */
+ pg_log_warning("\"%s/backup_manifest\" contains no entry for \"%s\"",
+ prior_backup_dirs[copy_source_index],
+ manifest_path);
+ }
+ else if (mfile->checksum_type == checksum_type)
+ {
+ *checksum_length = mfile->checksum_length;
+ *checksum_payload = pg_malloc(*checksum_length);
+ memcpy(*checksum_payload, mfile->checksum_payload,
+ *checksum_length);
+ checksum_type = CHECKSUM_TYPE_NONE;
+ }
+ }
+
+ /* Prepare for checksum calculation, if required. */
+ pg_checksum_init(&checksum_ctx, checksum_type);
+
+ /*
+ * If the full file can be created by copying a file from an older backup
+ * in the chain without needing to overwrite any blocks or truncate the
+ * result, then forget about performing reconstruction and just copy that
+ * file in its entirety.
+ *
+ * Otherwise, reconstruct.
+ */
+ if (copy_source != NULL)
+ copy_file(copy_source->filename, output_filename,
+ &checksum_ctx, dry_run);
+ else
+ {
+ write_reconstructed_file(input_filename, output_filename,
+ block_length, sourcemap, offsetmap,
+ &checksum_ctx, dry_run);
+ debug_reconstruction(n_prior_backups + 1, source, dry_run);
+ }
+
+ /* Save results of checksum calculation. */
+ if (checksum_type != CHECKSUM_TYPE_NONE)
+ {
+ *checksum_payload = pg_malloc(PG_CHECKSUM_MAX_LENGTH);
+ *checksum_length = pg_checksum_final(&checksum_ctx,
+ *checksum_payload);
+ }
+
+ /*
+ * Close files and release memory.
+ */
+ for (i = 0; i <= n_prior_backups; ++i)
+ {
+ rfile *s = source[i];
+
+ if (s == NULL)
+ continue;
+ if (close(s->fd) != 0)
+ pg_fatal("could not close \"%s\": %m", s->filename);
+ if (s->relative_block_numbers != NULL)
+ pfree(s->relative_block_numbers);
+ pg_free(s->filename);
+ }
+ pfree(sourcemap);
+ pfree(offsetmap);
+ pfree(source);
+}
+
+/*
+ * Perform post-reconstruction logging and sanity checks.
+ */
+static void
+debug_reconstruction(int n_source, rfile **sources, bool dry_run)
+{
+ unsigned i;
+
+ for (i = 0; i < n_source; ++i)
+ {
+ rfile *s = sources[i];
+
+ /* Ignore source if not used. */
+ if (s == NULL)
+ continue;
+
+ /* If no data is needed from this file, we can ignore it. */
+ if (s->num_blocks_read == 0)
+ continue;
+
+ /* Debug logging. */
+ if (dry_run)
+ pg_log_debug("would have read %u blocks from \"%s\"",
+ s->num_blocks_read, s->filename);
+ else
+ pg_log_debug("read %u blocks from \"%s\"",
+ s->num_blocks_read, s->filename);
+
+ /*
+ * In dry-run mode, we don't actually try to read data from the file,
+ * but we do try to verify that the file is long enough that we could
+ * have read the data if we'd tried.
+ *
+ * If this fails, then it means that a non-dry-run attempt would fail,
+ * complaining of not being able to read the required bytes from the
+ * file.
+ */
+ if (dry_run)
+ {
+ struct stat sb;
+
+ if (fstat(s->fd, &sb) < 0)
+ pg_fatal("could not stat \"%s\": %m", s->filename);
+ if (sb.st_size < s->highest_offset_read)
+ pg_fatal("file \"%s\" is too short: expected %llu, found %llu",
+ s->filename,
+ (unsigned long long) s->highest_offset_read,
+ (unsigned long long) sb.st_size);
+ }
+ }
+}
+
+/*
+ * When we perform reconstruction using an incremental file, the output file
+ * should be at least as long as the truncation_block_length. Any blocks
+ * present in the incremental file increase the output length as far as is
+ * necessary to include those blocks.
+ */
+static unsigned
+find_reconstructed_block_length(rfile *s)
+{
+ unsigned block_length = s->truncation_block_length;
+ unsigned i;
+
+ for (i = 0; i < s->num_blocks; ++i)
+ if (s->relative_block_numbers[i] >= block_length)
+ block_length = s->relative_block_numbers[i] + 1;
+
+ return block_length;
+}
+
+/*
+ * Initialize an incremental rfile, reading the header so that we know which
+ * blocks it contains.
+ */
+static rfile *
+make_incremental_rfile(char *filename)
+{
+ rfile *rf;
+ unsigned magic;
+
+ rf = make_rfile(filename, false);
+
+ /* Read and validate magic number. */
+ read_bytes(rf, &magic, sizeof(magic));
+ if (magic != INCREMENTAL_MAGIC)
+ pg_fatal("file \"%s\" has bad incremental magic number (0x%x not 0x%x)",
+ filename, magic, INCREMENTAL_MAGIC);
+
+ /* Read block count. */
+ read_bytes(rf, &rf->num_blocks, sizeof(rf->num_blocks));
+ if (rf->num_blocks > RELSEG_SIZE)
+ pg_fatal("file \"%s\" has block count %u in excess of segment size %u",
+ filename, rf->num_blocks, RELSEG_SIZE);
+
+ /* Read truncation block length. */
+ read_bytes(rf, &rf->truncation_block_length,
+ sizeof(rf->truncation_block_length));
+ if (rf->truncation_block_length > RELSEG_SIZE)
+ pg_fatal("file \"%s\" has truncation block length %u in excess of segment size %u",
+ filename, rf->truncation_block_length, RELSEG_SIZE);
+
+ /* Read block numbers if there are any. */
+ if (rf->num_blocks > 0)
+ {
+ rf->relative_block_numbers =
+ pg_malloc0(sizeof(BlockNumber) * rf->num_blocks);
+ read_bytes(rf, rf->relative_block_numbers,
+ sizeof(BlockNumber) * rf->num_blocks);
+ }
+
+ /* Remember length of header. */
+ rf->header_length = sizeof(magic) + sizeof(rf->num_blocks) +
+ sizeof(rf->truncation_block_length) +
+ sizeof(BlockNumber) * rf->num_blocks;
+
+ return rf;
+}
+
+/*
+ * Allocate and perform basic initialization of an rfile.
+ */
+static rfile *
+make_rfile(char *filename, bool missing_ok)
+{
+ rfile *rf;
+
+ rf = pg_malloc0(sizeof(rfile));
+ rf->filename = pstrdup(filename);
+ if ((rf->fd = open(filename, O_RDONLY | PG_BINARY, 0)) < 0)
+ {
+ if (missing_ok && errno == ENOENT)
+ {
+ pg_free(rf);
+ return NULL;
+ }
+ pg_fatal("could not open file \"%s\": %m", filename);
+ }
+
+ return rf;
+}
+
+/*
+ * Read the indicated number of bytes from an rfile into the buffer.
+ */
+static void
+read_bytes(rfile *rf, void *buffer, unsigned length)
+{
+ unsigned rb = read(rf->fd, buffer, length);
+
+ if (rb != length)
+ {
+ if (rb < 0)
+ pg_fatal("could not read file \"%s\": %m", rf->filename);
+ else
+ pg_fatal("could not read file \"%s\": read only %d of %d bytes",
+ rf->filename, (int) rb, length);
+ }
+}
+
+/*
+ * Write out a reconstructed file.
+ */
+static void
+write_reconstructed_file(char *input_filename,
+ char *output_filename,
+ unsigned block_length,
+ rfile **sourcemap,
+ off_t *offsetmap,
+ pg_checksum_context *checksum_ctx,
+ bool dry_run)
+{
+ int wfd = -1;
+ unsigned i;
+ unsigned zero_blocks = 0;
+
+ /* Debugging output. */
+ if (dry_run)
+ pg_log_debug("would reconstruct \"%s\" (%u blocks, checksum %s)",
+ output_filename, block_length,
+ pg_checksum_type_name(checksum_ctx->type));
+ else
+ pg_log_debug("reconstructing \"%s\" (%u blocks, checksum %s)",
+ output_filename, block_length,
+ pg_checksum_type_name(checksum_ctx->type));
+
+ /* Open the output file, except in dry_run mode. */
+ if (!dry_run &&
+ (wfd = open(output_filename,
+ O_RDWR | PG_BINARY | O_CREAT | O_EXCL,
+ pg_file_create_mode)) < 0)
+ pg_fatal("could not open file \"%s\": %m", output_filename);
+
+ /* Read and write the blocks as required. */
+ for (i = 0; i < block_length; ++i)
+ {
+ uint8 buffer[BLCKSZ];
+ rfile *s = sourcemap[i];
+ unsigned wb;
+
+ /* Update accounting information. */
+ if (s == NULL)
+ ++zero_blocks;
+ else
+ {
+ s->num_blocks_read++;
+ s->highest_offset_read = Max(s->highest_offset_read,
+ offsetmap[i] + BLCKSZ);
+ }
+
+ /* Skip the rest of this in dry-run mode. */
+ if (dry_run)
+ continue;
+
+ /* Read or zero-fill the block as appropriate. */
+ if (s == NULL)
+ {
+ /*
+ * New block not mentioned in the WAL summary. Should have been an
+ * uninitialized block, so just zero-fill it.
+ */
+ memset(buffer, 0, BLCKSZ);
+ }
+ else
+ {
+ unsigned rb;
+
+ /* Read the block from the correct source, except if dry-run. */
+ rb = pg_pread(s->fd, buffer, BLCKSZ, offsetmap[i]);
+ if (rb != BLCKSZ)
+ {
+ if (rb < 0)
+ pg_fatal("could not read file \"%s\": %m", s->filename);
+ else
+ pg_fatal("could not read file \"%s\": read only %d of %d bytes at offset %u",
+ s->filename, (int) rb, BLCKSZ,
+ (unsigned) offsetmap[i]);
+ }
+ }
+
+ /* Write out the block. */
+ if ((wb = write(wfd, buffer, BLCKSZ)) != BLCKSZ)
+ {
+ if (wb < 0)
+ pg_fatal("could not write file \"%s\": %m", output_filename);
+ else
+ pg_fatal("could not write file \"%s\": wrote only %d of %d bytes",
+ output_filename, (int) wb, BLCKSZ);
+ }
+
+ /* Update the checksum computation. */
+ if (pg_checksum_update(checksum_ctx, buffer, BLCKSZ) < 0)
+ pg_fatal("could not update checksum of file \"%s\"",
+ output_filename);
+ }
+
+ /* Debugging output. */
+ if (zero_blocks > 0)
+ {
+ if (dry_run)
+ pg_log_debug("would have zero-filled %u blocks", zero_blocks);
+ else
+ pg_log_debug("zero-filled %u blocks", zero_blocks);
+ }
+
+ /* Close the output file. */
+ if (wfd >= 0 && close(wfd) != 0)
+ pg_fatal("could not close \"%s\": %m", output_filename);
+}
diff --git a/src/bin/pg_combinebackup/reconstruct.h b/src/bin/pg_combinebackup/reconstruct.h
new file mode 100644
index 0000000000..c599a70d42
--- /dev/null
+++ b/src/bin/pg_combinebackup/reconstruct.h
@@ -0,0 +1,32 @@
+/*-------------------------------------------------------------------------
+ *
+ * reconstruct.h
+ * Reconstruct full file from incremental file and backup chain.
+ *
+ * Copyright (c) 2017-2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/bin/pg_combinebackup/reconstruct.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef RECONSTRUCT_H
+#define RECONSTRUCT_H
+
+#include "common/checksum_helper.h"
+#include "load_manifest.h"
+
+extern void reconstruct_from_incremental_file(char *input_filename,
+ char *output_filename,
+ char *relative_path,
+ char *bare_file_name,
+ int n_prior_backups,
+ char **prior_backup_dirs,
+ manifest_data **manifests,
+ char *manifest_path,
+ pg_checksum_type checksum_type,
+ int *checksum_length,
+ uint8 **checksum_payload,
+ bool dry_run);
+
+#endif
diff --git a/src/bin/pg_combinebackup/t/001_basic.pl b/src/bin/pg_combinebackup/t/001_basic.pl
new file mode 100644
index 0000000000..fb66075d1a
--- /dev/null
+++ b/src/bin/pg_combinebackup/t/001_basic.pl
@@ -0,0 +1,23 @@
+# Copyright (c) 2021-2023, PostgreSQL Global Development Group
+
+use strict;
+use warnings;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+my $tempdir = PostgreSQL::Test::Utils::tempdir;
+
+program_help_ok('pg_combinebackup');
+program_version_ok('pg_combinebackup');
+program_options_handling_ok('pg_combinebackup');
+
+command_fails_like(
+ ['pg_combinebackup'],
+ qr/no input directories specified/,
+ 'input directories must be specified');
+command_fails_like(
+ [ 'pg_combinebackup', $tempdir ],
+ qr/no output directory specified/,
+ 'output directory must be specified');
+
+done_testing();
diff --git a/src/bin/pg_combinebackup/t/002_compare_backups.pl b/src/bin/pg_combinebackup/t/002_compare_backups.pl
new file mode 100644
index 0000000000..d7f9e98b9c
--- /dev/null
+++ b/src/bin/pg_combinebackup/t/002_compare_backups.pl
@@ -0,0 +1,153 @@
+# Copyright (c) 2021-2023, PostgreSQL Global Development Group
+
+use strict;
+use warnings;
+use File::Compare;
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Set up a new database instance.
+my $primary = PostgreSQL::Test::Cluster->new('primary');
+$primary->init(has_archiving => 1, allows_streaming => 1);
+$primary->start;
+
+# Create some test tables, each containing one row of data, plus a whole
+# extra database.
+$primary->safe_psql('postgres', <<EOM);
+CREATE TABLE will_change (a int, b text);
+INSERT INTO will_change VALUES (1, 'initial test row');
+CREATE TABLE will_grow (a int, b text);
+INSERT INTO will_grow VALUES (1, 'initial test row');
+CREATE TABLE will_shrink (a int, b text);
+INSERT INTO will_shrink VALUES (1, 'initial test row');
+CREATE TABLE will_get_vacuumed (a int, b text);
+INSERT INTO will_get_vacuumed VALUES (1, 'initial test row');
+CREATE TABLE will_get_dropped (a int, b text);
+INSERT INTO will_get_dropped VALUES (1, 'initial test row');
+CREATE TABLE will_get_rewritten (a int, b text);
+INSERT INTO will_get_rewritten VALUES (1, 'initial test row');
+CREATE DATABASE db_will_get_dropped;
+EOM
+
+# Take a full backup.
+my $backup1path = $primary->backup_dir . '/backup1';
+$primary->command_ok(
+ [ 'pg_basebackup', '-D', $backup1path, '--no-sync', '-cfast' ],
+ "full backup");
+
+# Now make some database changes.
+$primary->safe_psql('postgres', <<EOM);
+UPDATE will_change SET b = 'modified value' WHERE a = 1;
+INSERT INTO will_grow
+ SELECT g, 'additional row' FROM generate_series(2, 5000) g;
+TRUNCATE will_shrink;
+VACUUM will_get_vacuumed;
+DROP TABLE will_get_dropped;
+CREATE TABLE newly_created (a int, b text);
+INSERT INTO newly_created VALUES (1, 'row for new table');
+VACUUM FULL will_get_rewritten;
+DROP DATABASE db_will_get_dropped;
+CREATE DATABASE db_newly_created;
+EOM
+
+# Take an incremental backup.
+my $backup2path = $primary->backup_dir . '/backup2';
+$primary->command_ok(
+ [ 'pg_basebackup', '-D', $backup2path, '--no-sync', '-cfast',
+ '--incremental', $backup1path . '/backup_manifest' ],
+ "incremental backup");
+
+# Find an LSN to which either backup can be recovered.
+my $lsn = $primary->safe_psql('postgres', "SELECT pg_current_wal_lsn();");
+
+# Make sure that the WAL segment containing that LSN has been archived.
+# PostgreSQL won't issue two consecutive XLOG_SWITCH records, and the backup
+# just issued one, so call txid_current() to generate some WAL activity
+# before calling pg_switch_wal().
+$primary->safe_psql('postgres', 'SELECT txid_current();');
+$primary->safe_psql('postgres', 'SELECT pg_switch_wal()');
+
+# Now wait for the LSN we chose above to be archived.
+my $archive_wait_query =
+ "SELECT pg_walfile_name('$lsn') <= last_archived_wal FROM pg_stat_archiver;";
+$primary->poll_query_until('postgres', $archive_wait_query)
+ or die "Timed out while waiting for WAL segment to be archived";
+
+# Perform PITR from the full backup. Disable archive_mode so that the archive
+# doesn't find out about the new timeline; that way, the later PITR below will
+# choose the same timeline.
+my $pitr1 = PostgreSQL::Test::Cluster->new('pitr1');
+$pitr1->init_from_backup($primary, 'backup1',
+ standby => 1, has_restoring => 1);
+$pitr1->append_conf('postgresql.conf', qq{
+recovery_target_lsn = '$lsn'
+recovery_target_action = 'promote'
+archive_mode = 'off'
+});
+$pitr1->start();
+
+# Perform PITR to the same LSN from the incremental backup. Use the same
+# basic configuration as before.
+my $pitr2 = PostgreSQL::Test::Cluster->new('pitr2');
+$pitr2->init_from_backup($primary, 'backup2',
+ standby => 1, has_restoring => 1,
+ combine_with_prior => [ 'backup1' ]);
+$pitr2->append_conf('postgresql.conf', qq{
+recovery_target_lsn = '$lsn'
+recovery_target_action = 'promote'
+archive_mode = 'off'
+});
+$pitr2->start();
+
+# Wait until both servers exit recovery.
+$pitr1->poll_query_until('postgres',
+ "SELECT NOT pg_is_in_recovery();")
+ or die "Timed out while waiting apply to reach LSN $lsn";
+$pitr2->poll_query_until('postgres',
+ "SELECT NOT pg_is_in_recovery();")
+ or die "Timed out while waiting apply to reach LSN $lsn";
+
+# Perform a logical dump of each server, and check that they match.
+# It would be much nicer if we could physically compare the data files, but
+# that doesn't really work. The contents of the page hole aren't guaranteed to
+# be identical, and there can be other discrepancies as well. To make this work
+# we'd need the equivalent of each AM's rm_mask functon written or at least
+# callable from Perl, and that doesn't seem practical.
+#
+# NB: We're just using the primary's backup directory for scratch space here.
+# This could equally well be any other directory we wanted to pick.
+my $backupdir = $primary->backup_dir;
+my $dump1 = $backupdir . '/pitr1.dump';
+my $dump2 = $backupdir . '/pitr2.dump';
+$pitr1->command_ok([
+ 'pg_dumpall', '-f', $dump1, '--no-sync', '--no-unlogged-table-data',
+ '-d', $pitr1->connstr('postgres'),
+ ],
+ 'dump from PITR 1');
+$pitr1->command_ok([
+ 'pg_dumpall', '-f', $dump2, '--no-sync', '--no-unlogged-table-data',
+ '-d', $pitr1->connstr('postgres'),
+ ],
+ 'dump from PITR 2');
+
+# Compare the two dumps, there should be no differences.
+my $compare_res = compare($dump1, $dump2);
+note($dump1);
+note($dump2);
+is($compare_res, 0, "dumps are identical");
+
+# Provide more context if the dumps do not match.
+if ($compare_res != 0)
+{
+ my ($stdout, $stderr) =
+ run_command([ 'diff', '-u', $dump1, $dump2 ]);
+ print "=== diff of $dump1 and $dump2\n";
+ print "=== stdout ===\n";
+ print $stdout;
+ print "=== stderr ===\n";
+ print $stderr;
+ print "=== EOF ===\n";
+}
+
+done_testing();
diff --git a/src/bin/pg_combinebackup/t/003_timeline.pl b/src/bin/pg_combinebackup/t/003_timeline.pl
new file mode 100644
index 0000000000..73626f060c
--- /dev/null
+++ b/src/bin/pg_combinebackup/t/003_timeline.pl
@@ -0,0 +1,89 @@
+# Copyright (c) 2021-2023, PostgreSQL Global Development Group
+#
+# This test aims to validate that restoring an incremental backup works
+# properly even when the reference backup is on a different timeline.
+
+use strict;
+use warnings;
+use File::Compare;
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Set up a new database instance.
+my $node1 = PostgreSQL::Test::Cluster->new('node1');
+$node1->init(has_archiving => 1, allows_streaming => 1);
+$node1->start;
+
+# Create a table and insert a test row into it.
+$node1->safe_psql('postgres', <<EOM);
+CREATE TABLE mytable (a int, b text);
+INSERT INTO mytable VALUES (1, 'aardvark');
+EOM
+
+# Take a full backup.
+my $backup1path = $node1->backup_dir . '/backup1';
+$node1->command_ok(
+ [ 'pg_basebackup', '-D', $backup1path, '--no-sync', '-cfast' ],
+ "full backup from node1");
+
+# Insert a second row on the original node.
+$node1->safe_psql('postgres', <<EOM);
+INSERT INTO mytable VALUES (2, 'beetle');
+EOM
+
+# Now take an incremental backup.
+my $backup2path = $node1->backup_dir . '/backup2';
+$node1->command_ok(
+ [ 'pg_basebackup', '-D', $backup2path, '--no-sync', '-cfast',
+ '--incremental', $backup1path . '/backup_manifest' ],
+ "incremental backup from node1");
+
+# Restore the incremental backup and use it to create a new node.
+my $node2 = PostgreSQL::Test::Cluster->new('node2');
+$node2->init_from_backup($node1, 'backup2',
+ combine_with_prior => [ 'backup1' ]);
+$node2->start();
+
+# Insert rows on both nodes.
+$node1->safe_psql('postgres', <<EOM);
+INSERT INTO mytable VALUES (3, 'crab');
+EOM
+$node2->safe_psql('postgres', <<EOM);
+INSERT INTO mytable VALUES (4, 'dingo');
+EOM
+
+# Take another incremental backup, from node2, based on backup2 from node1.
+my $backup3path = $node1->backup_dir . '/backup3';
+$node2->command_ok(
+ [ 'pg_basebackup', '-D', $backup3path, '--no-sync', '-cfast',
+ '--incremental', $backup2path . '/backup_manifest' ],
+ "incremental backup from node2");
+
+# Restore the incremental backup and use it to create a new node.
+my $node3 = PostgreSQL::Test::Cluster->new('node3');
+$node3->init_from_backup($node1, 'backup3',
+ combine_with_prior => [ 'backup1', 'backup2' ]);
+$node3->start();
+
+# Let's insert one more row.
+$node3->safe_psql('postgres', <<EOM);
+INSERT INTO mytable VALUES (5, 'elephant');
+EOM
+
+# Now check that we have the expected rows.
+my $result = $node3->safe_psql('postgres', <<EOM);
+select string_agg(a::text, ':'), string_agg(b, ':') from mytable;
+EOM
+is($result, '1:2:4:5|aardvark:beetle:dingo:elephant');
+
+# Let's also verify all the backups.
+for my $backup_name (qw(backup1 backup2 backup3))
+{
+ $node1->command_ok(
+ [ 'pg_verifybackup', $node1->backup_dir . '/' . $backup_name ],
+ "verify backup $backup_name");
+}
+
+# OK, that's all.
+done_testing();
diff --git a/src/bin/pg_combinebackup/t/004_manifest.pl b/src/bin/pg_combinebackup/t/004_manifest.pl
new file mode 100644
index 0000000000..37de61ac06
--- /dev/null
+++ b/src/bin/pg_combinebackup/t/004_manifest.pl
@@ -0,0 +1,75 @@
+# Copyright (c) 2021-2023, PostgreSQL Global Development Group
+#
+# This test aims to validate that pg_combinebackup works in the degenerate
+# case where it is invoked on a single full backup and that it can produce
+# a new, valid manifest when it does. Secondarily, it checks that
+# pg_combinebackup does not produce a manifest when run with --no-manifest.
+
+use strict;
+use warnings;
+use File::Compare;
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Set up a new database instance.
+my $node = PostgreSQL::Test::Cluster->new('node');
+$node->init(has_archiving => 1, allows_streaming => 1);
+$node->start;
+
+# Take a full backup.
+my $original_backup_path = $node->backup_dir . '/original';
+$node->command_ok(
+ [ 'pg_basebackup', '-D', $original_backup_path, '--no-sync', '-cfast' ],
+ "full backup");
+
+# Verify the full backup.
+$node->command_ok([ 'pg_verifybackup', $original_backup_path ],
+ "verify original backup");
+
+# Process the backup with pg_combinebackup using various manifest options.
+sub combine_and_test_one_backup
+{
+ my ($backup_name, $failure_pattern, @extra_options) = @_;
+ my $revised_backup_path = $node->backup_dir . '/' . $backup_name;
+ $node->command_ok(
+ [ 'pg_combinebackup', $original_backup_path, '-o', $revised_backup_path,
+ '--no-sync', @extra_options ],
+ "pg_combinebackup with @extra_options");
+ if (defined $failure_pattern)
+ {
+ $node->command_fails_like(
+ [ 'pg_verifybackup', $revised_backup_path ],
+ $failure_pattern,
+ "unable to verify backup $backup_name");
+ }
+ else
+ {
+ $node->command_ok(
+ [ 'pg_verifybackup', $revised_backup_path ],
+ "verify backup $backup_name");
+ }
+}
+combine_and_test_one_backup('nomanifest',
+ qr/could not open file.*backup_manifest/, '--no-manifest');
+combine_and_test_one_backup('csum_none',
+ undef, '--manifest-checksums=NONE');
+combine_and_test_one_backup('csum_sha224',
+ undef, '--manifest-checksums=SHA224');
+
+# Verify that SHA224 is mentioned in the SHA224 manifest lots of times.
+my $sha224_manifest =
+ slurp_file($node->backup_dir . '/csum_sha224/backup_manifest');
+my $sha224_count = (() = $sha224_manifest =~ /SHA224/mig);
+cmp_ok($sha224_count,
+ '>', 100, "SHA224 is mentioned many times in SHA224 manifest");
+
+# Verify that SHA224 is mentioned in the SHA224 manifest lots of times.
+my $nocsum_manifest =
+ slurp_file($node->backup_dir . '/csum_none/backup_manifest');
+my $nocsum_count = (() = $nocsum_manifest =~ /Checksum-Algorithm/mig);
+is($nocsum_count, 0,
+ "Checksum_Algorithm is not mentioned in no-checksum manifest");
+
+# OK, that's all.
+done_testing();
diff --git a/src/bin/pg_combinebackup/t/005_integrity.pl b/src/bin/pg_combinebackup/t/005_integrity.pl
new file mode 100644
index 0000000000..744adb759e
--- /dev/null
+++ b/src/bin/pg_combinebackup/t/005_integrity.pl
@@ -0,0 +1,123 @@
+# Copyright (c) 2021-2023, PostgreSQL Global Development Group
+#
+# This test aims to validate that an incremental backup can be combined
+# with a valid prior backup and that it cannot be combined with an invalid
+# prior backup.
+
+use strict;
+use warnings;
+use File::Compare;
+use File::Path qw(rmtree);
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Set up a new database instance.
+my $node1 = PostgreSQL::Test::Cluster->new('node1');
+$node1->init(has_archiving => 1, allows_streaming => 1);
+$node1->start;
+
+# Set up another new database instance. We don't want to use the cached
+# INITDB_TEMPLATE for this, because we want it to be a separate cluster
+# with a different system ID.
+my $node2;
+{
+ local $ENV{'INITDB_TEMPLATE'} = undef;
+
+ $node2 = PostgreSQL::Test::Cluster->new('node2');
+ $node2->init(has_archiving => 1, allows_streaming => 1);
+ $node2->start;
+}
+
+# Take a full backup from node1.
+my $backup1path = $node1->backup_dir . '/backup1';
+$node1->command_ok(
+ [ 'pg_basebackup', '-D', $backup1path, '--no-sync', '-cfast' ],
+ "full backup from node1");
+
+# Now take an incremental backup.
+my $backup2path = $node1->backup_dir . '/backup2';
+$node1->command_ok(
+ [ 'pg_basebackup', '-D', $backup2path, '--no-sync', '-cfast',
+ '--incremental', $backup1path . '/backup_manifest' ],
+ "incremental backup from node1");
+
+# Now take another incremental backup.
+my $backup3path = $node1->backup_dir . '/backup3';
+$node1->command_ok(
+ [ 'pg_basebackup', '-D', $backup3path, '--no-sync', '-cfast',
+ '--incremental', $backup2path . '/backup_manifest' ],
+ "another incremental backup from node1");
+
+# Take a full backup from node2.
+my $backupother1path = $node1->backup_dir . '/backupother1';
+$node2->command_ok(
+ [ 'pg_basebackup', '-D', $backupother1path, '--no-sync', '-cfast' ],
+ "full backup from node2");
+
+# Take an incremental backup from node2.
+my $backupother2path = $node1->backup_dir . '/backupother2';
+$node2->command_ok(
+ [ 'pg_basebackup', '-D', $backupother2path, '--no-sync', '-cfast',
+ '--incremental', $backupother1path . '/backup_manifest' ],
+ "incremental backup from node2");
+
+# Result directory.
+my $resultpath = $node1->backup_dir . '/result';
+
+# Can't combine 2 full backups.
+$node1->command_fails_like(
+ [ 'pg_combinebackup', $backup1path, $backup1path, '-o', $resultpath ],
+ qr/is a full backup, but only the first backup should be a full backup/,
+ "can't combine full backups");
+
+# Can't combine 2 incremental backups.
+$node1->command_fails_like(
+ [ 'pg_combinebackup', $backup2path, $backup2path, '-o', $resultpath ],
+ qr/is an incremental backup, but the first backup should be a full backup/,
+ "can't combine full backups");
+
+# Can't combine full backup with an incremental backup from a different system.
+$node1->command_fails_like(
+ [ 'pg_combinebackup', $backup1path, $backupother2path, '-o', $resultpath ],
+ qr/expected system identifier.*but found/,
+ "can't combine backups from different nodes");
+
+# Can't omit a required backup.
+$node1->command_fails_like(
+ [ 'pg_combinebackup', $backup1path, $backup3path, '-o', $resultpath ],
+ qr/starts at LSN.*but expected/,
+ "can't omit a required backup");
+
+# Can't combine backups in the wrong order.
+$node1->command_fails_like(
+ [ 'pg_combinebackup', $backup1path, $backup3path, $backup2path, '-o', $resultpath ],
+ qr/starts at LSN.*but expected/,
+ "can't combine backups in the wrong order");
+
+# Can combine 3 backups that match up properly.
+$node1->command_ok(
+ [ 'pg_combinebackup', $backup1path, $backup2path, $backup3path, '-o', $resultpath ],
+ "can combine 3 matching backups");
+rmtree($resultpath);
+
+# Can combine full backup with first incremental.
+my $synthetic12path = $node1->backup_dir . '/synthetic12';
+$node1->command_ok(
+ [ 'pg_combinebackup', $backup1path, $backup2path, '-o', $synthetic12path ],
+ "can combine 2 matching backups");
+
+# Can combine result of previous step with second incremental.
+$node1->command_ok(
+ [ 'pg_combinebackup', $synthetic12path, $backup3path, '-o', $resultpath ],
+ "can combine synthetic backup with later incremental");
+rmtree($resultpath);
+
+# Can't combine result of 1+2 with 2.
+$node1->command_fails_like(
+ [ 'pg_combinebackup', $synthetic12path, $backup2path, '-o', $resultpath ],
+ qr/starts at LSN.*but expected/,
+ "can't combine synthetic backup with included incremental");
+
+# OK, that's all.
+done_testing();
diff --git a/src/bin/pg_combinebackup/write_manifest.c b/src/bin/pg_combinebackup/write_manifest.c
new file mode 100644
index 0000000000..82160134d8
--- /dev/null
+++ b/src/bin/pg_combinebackup/write_manifest.c
@@ -0,0 +1,293 @@
+/*-------------------------------------------------------------------------
+ *
+ * Write a new backup manifest.
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/bin/pg_combinebackup/write_manifest.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres_fe.h"
+
+#include <fcntl.h>
+#include <time.h>
+#include <unistd.h>
+
+#include "common/checksum_helper.h"
+#include "common/file_perm.h"
+#include "common/logging.h"
+#include "lib/stringinfo.h"
+#include "load_manifest.h"
+#include "mb/pg_wchar.h"
+#include "write_manifest.h"
+
+struct manifest_writer
+{
+ char pathname[MAXPGPATH];
+ int fd;
+ StringInfoData buf;
+ bool first_file;
+ bool still_checksumming;
+ pg_checksum_context manifest_ctx;
+};
+
+static void escape_json(StringInfo buf, const char *str);
+static void flush_manifest(manifest_writer *mwriter);
+static size_t hex_encode(const uint8 *src, size_t len, char *dst);
+
+/*
+ * Create a new backup manifest writer.
+ *
+ * The backup manifest will be written into a file named backup_manifest
+ * in the specified directory.
+ */
+manifest_writer *
+create_manifest_writer(char *directory)
+{
+ manifest_writer *mwriter = pg_malloc(sizeof(manifest_writer));
+
+ snprintf(mwriter->pathname, MAXPGPATH, "%s/backup_manifest", directory);
+ mwriter->fd = -1;
+ initStringInfo(&mwriter->buf);
+ mwriter->first_file = true;
+ mwriter->still_checksumming = true;
+ pg_checksum_init(&mwriter->manifest_ctx, CHECKSUM_TYPE_SHA256);
+
+ appendStringInfo(&mwriter->buf,
+ "{ \"PostgreSQL-Backup-Manifest-Version\": 1,\n"
+ "\"Files\": [");
+
+ return mwriter;
+}
+
+/*
+ * Add an entry for a file to a backup manifest.
+ *
+ * This is very similar to the backend's AddFileToBackupManifest, but
+ * various adjustments are required due to frontend/backend differences
+ * and other details.
+ */
+void
+add_file_to_manifest(manifest_writer *mwriter, const char *manifest_path,
+ size_t size, pg_time_t mtime,
+ pg_checksum_type checksum_type,
+ int checksum_length,
+ uint8 *checksum_payload)
+{
+ int pathlen = strlen(manifest_path);
+
+ if (mwriter->first_file)
+ {
+ appendStringInfoChar(&mwriter->buf, '\n');
+ mwriter->first_file = false;
+ }
+ else
+ appendStringInfoString(&mwriter->buf, ",\n");
+
+ if (pg_encoding_verifymbstr(PG_UTF8, manifest_path, pathlen) == pathlen)
+ {
+ appendStringInfoString(&mwriter->buf, "{ \"Path\": ");
+ escape_json(&mwriter->buf, manifest_path);
+ appendStringInfoString(&mwriter->buf, ", ");
+ }
+ else
+ {
+ appendStringInfoString(&mwriter->buf, "{ \"Encoded-Path\": \"");
+ enlargeStringInfo(&mwriter->buf, 2 * pathlen);
+ mwriter->buf.len += hex_encode((const uint8 *) manifest_path, pathlen,
+ &mwriter->buf.data[mwriter->buf.len]);
+ appendStringInfoString(&mwriter->buf, "\", ");
+ }
+
+ appendStringInfo(&mwriter->buf, "\"Size\": %zu, ", size);
+
+ appendStringInfoString(&mwriter->buf, "\"Last-Modified\": \"");
+ enlargeStringInfo(&mwriter->buf, 128);
+ mwriter->buf.len += strftime(&mwriter->buf.data[mwriter->buf.len], 128,
+ "%Y-%m-%d %H:%M:%S %Z",
+ gmtime(&mtime));
+ appendStringInfoChar(&mwriter->buf, '"');
+
+ if (mwriter->buf.len > 128 * 1024)
+ flush_manifest(mwriter);
+
+ if (checksum_length > 0)
+ {
+ appendStringInfo(&mwriter->buf,
+ ", \"Checksum-Algorithm\": \"%s\", \"Checksum\": \"",
+ pg_checksum_type_name(checksum_type));
+
+ enlargeStringInfo(&mwriter->buf, 2 * checksum_length);
+ mwriter->buf.len += hex_encode(checksum_payload, checksum_length,
+ &mwriter->buf.data[mwriter->buf.len]);
+
+ appendStringInfoChar(&mwriter->buf, '"');
+ }
+
+ appendStringInfoString(&mwriter->buf, " }");
+
+ if (mwriter->buf.len > 128 * 1024)
+ flush_manifest(mwriter);
+}
+
+/*
+ * Finalize the backup_manifest.
+ */
+void
+finalize_manifest(manifest_writer *mwriter,
+ manifest_wal_range *first_wal_range)
+{
+ uint8 checksumbuf[PG_SHA256_DIGEST_LENGTH];
+ int len;
+ manifest_wal_range *wal_range;
+
+ /* Terminate the list of files. */
+ appendStringInfoString(&mwriter->buf, "\n],\n");
+
+ /* Start a list of LSN ranges. */
+ appendStringInfoString(&mwriter->buf, "\"WAL-Ranges\": [\n");
+
+ for (wal_range = first_wal_range; wal_range != NULL;
+ wal_range = wal_range->next)
+ appendStringInfo(&mwriter->buf,
+ "%s{ \"Timeline\": %u, \"Start-LSN\": \"%X/%X\", \"End-LSN\": \"%X/%X\" }",
+ wal_range == first_wal_range ? "" : ",\n",
+ wal_range->tli,
+ LSN_FORMAT_ARGS(wal_range->start_lsn),
+ LSN_FORMAT_ARGS(wal_range->end_lsn));
+
+ /* Terminate the list of WAL ranges. */
+ appendStringInfoString(&mwriter->buf, "\n],\n");
+
+ /* Flush accumulated data and update checksum calculation. */
+ flush_manifest(mwriter);
+
+ /* Checksum only includes data up to this point. */
+ mwriter->still_checksumming = false;
+
+ /* Compute and insert manifest checksum. */
+ appendStringInfoString(&mwriter->buf, "\"Manifest-Checksum\": \"");
+ enlargeStringInfo(&mwriter->buf, 2 * PG_SHA256_DIGEST_STRING_LENGTH);
+ len = pg_checksum_final(&mwriter->manifest_ctx, checksumbuf);
+ Assert(len == PG_SHA256_DIGEST_LENGTH);
+ mwriter->buf.len +=
+ hex_encode(checksumbuf, len, &mwriter->buf.data[mwriter->buf.len]);
+ appendStringInfoString(&mwriter->buf, "\"}\n");
+
+ /* Flush the last manifest checksum itself. */
+ flush_manifest(mwriter);
+
+ /* Close the file. */
+ if (close(mwriter->fd) != 0)
+ pg_fatal("could not close \"%s\": %m", mwriter->pathname);
+ mwriter->fd = -1;
+}
+
+/*
+ * Produce a JSON string literal, properly escaping characters in the text.
+ */
+static void
+escape_json(StringInfo buf, const char *str)
+{
+ const char *p;
+
+ appendStringInfoCharMacro(buf, '"');
+ for (p = str; *p; p++)
+ {
+ switch (*p)
+ {
+ case '\b':
+ appendStringInfoString(buf, "\\b");
+ break;
+ case '\f':
+ appendStringInfoString(buf, "\\f");
+ break;
+ case '\n':
+ appendStringInfoString(buf, "\\n");
+ break;
+ case '\r':
+ appendStringInfoString(buf, "\\r");
+ break;
+ case '\t':
+ appendStringInfoString(buf, "\\t");
+ break;
+ case '"':
+ appendStringInfoString(buf, "\\\"");
+ break;
+ case '\\':
+ appendStringInfoString(buf, "\\\\");
+ break;
+ default:
+ if ((unsigned char) *p < ' ')
+ appendStringInfo(buf, "\\u%04x", (int) *p);
+ else
+ appendStringInfoCharMacro(buf, *p);
+ break;
+ }
+ }
+ appendStringInfoCharMacro(buf, '"');
+}
+
+/*
+ * Flush whatever portion of the backup manifest we have generated and
+ * buffered in memory out to a file on disk.
+ *
+ * The first call to this function will create the file. After that, we
+ * keep it open and just append more data.
+ */
+static void
+flush_manifest(manifest_writer *mwriter)
+{
+ char pathname[MAXPGPATH];
+
+ if (mwriter->fd == -1 &&
+ (mwriter->fd = open(mwriter->pathname,
+ O_WRONLY | O_CREAT | O_EXCL | PG_BINARY,
+ pg_file_create_mode)) < 0)
+ pg_fatal("could not open file \"%s\": %m", mwriter->pathname);
+
+ if (mwriter->buf.len > 0)
+ {
+ ssize_t wb;
+
+ wb = write(mwriter->fd, mwriter->buf.data, mwriter->buf.len);
+ if (wb != mwriter->buf.len)
+ {
+ if (wb < 0)
+ pg_fatal("could not write \"%s\": %m", mwriter->pathname);
+ else
+ pg_fatal("could not write file \"%s\": wrote only %d of %d bytes",
+ pathname, (int) wb, mwriter->buf.len);
+ }
+
+ if (mwriter->still_checksumming)
+ pg_checksum_update(&mwriter->manifest_ctx,
+ (uint8 *) mwriter->buf.data,
+ mwriter->buf.len);
+ resetStringInfo(&mwriter->buf);
+ }
+}
+
+/*
+ * Encode bytes using two hexademical digits for each one.
+ */
+static size_t
+hex_encode(const uint8 *src, size_t len, char *dst)
+{
+ const uint8 *end = src + len;
+
+ while (src < end)
+ {
+ unsigned n1 = (*src >> 4) & 0xF;
+ unsigned n2 = *src & 0xF;
+
+ *dst++ = n1 < 10 ? '0' + n1 : 'a' + n1 - 10;
+ *dst++ = n2 < 10 ? '0' + n2 : 'a' + n2 - 10;
+ ++src;
+ }
+
+ return len * 2;
+}
diff --git a/src/bin/pg_combinebackup/write_manifest.h b/src/bin/pg_combinebackup/write_manifest.h
new file mode 100644
index 0000000000..8fd7fe02c8
--- /dev/null
+++ b/src/bin/pg_combinebackup/write_manifest.h
@@ -0,0 +1,33 @@
+/*-------------------------------------------------------------------------
+ *
+ * Write a new backup manifest.
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/bin/pg_combinebackup/write_manifest.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef WRITE_MANIFEST_H
+#define WRITE_MANIFEST_H
+
+#include "common/checksum_helper.h"
+#include "pgtime.h"
+
+struct manifest_wal_range;
+
+struct manifest_writer;
+typedef struct manifest_writer manifest_writer;
+
+extern manifest_writer *create_manifest_writer(char *directory);
+extern void add_file_to_manifest(manifest_writer *mwriter,
+ const char *manifest_path,
+ size_t size, pg_time_t mtime,
+ pg_checksum_type checksum_type,
+ int checksum_length,
+ uint8 *checksum_payload);
+extern void finalize_manifest(manifest_writer *mwriter,
+ struct manifest_wal_range *first_wal_range);
+
+#endif /* WRITE_MANIFEST_H */
diff --git a/src/bin/pg_resetwal/pg_resetwal.c b/src/bin/pg_resetwal/pg_resetwal.c
index 3ae3fc06df..5407f51a4e 100644
--- a/src/bin/pg_resetwal/pg_resetwal.c
+++ b/src/bin/pg_resetwal/pg_resetwal.c
@@ -85,6 +85,7 @@ static void RewriteControlFile(void);
static void FindEndOfXLOG(void);
static void KillExistingXLOG(void);
static void KillExistingArchiveStatus(void);
+static void KillExistingWALSummaries(void);
static void WriteEmptyXLOG(void);
static void usage(void);
@@ -493,6 +494,7 @@ main(int argc, char *argv[])
RewriteControlFile();
KillExistingXLOG();
KillExistingArchiveStatus();
+ KillExistingWALSummaries();
WriteEmptyXLOG();
printf(_("Write-ahead log reset\n"));
@@ -1034,6 +1036,40 @@ KillExistingArchiveStatus(void)
pg_fatal("could not close directory \"%s\": %m", ARCHSTATDIR);
}
+/*
+ * Remove existing WAL summary files
+ */
+static void
+KillExistingWALSummaries(void)
+{
+#define WALSUMMARYDIR XLOGDIR "/summaries"
+#define WALSUMMARY_NHEXCHARS 40
+
+ DIR *xldir;
+ struct dirent *xlde;
+ char path[MAXPGPATH + sizeof(WALSUMMARYDIR)];
+
+ xldir = opendir(WALSUMMARYDIR);
+ if (xldir == NULL)
+ pg_fatal("could not open directory \"%s\": %m", WALSUMMARYDIR);
+
+ while (errno = 0, (xlde = readdir(xldir)) != NULL)
+ {
+ if (strspn(xlde->d_name, "0123456789ABCDEF") == WALSUMMARY_NHEXCHARS &&
+ strcmp(xlde->d_name + WALSUMMARY_NHEXCHARS, ".summary") == 0)
+ {
+ snprintf(path, sizeof(path), "%s/%s", WALSUMMARYDIR, xlde->d_name);
+ if (unlink(path) < 0)
+ pg_fatal("could not delete file \"%s\": %m", path);
+ }
+ }
+
+ if (errno)
+ pg_fatal("could not read directory \"%s\": %m", WALSUMMARYDIR);
+
+ if (closedir(xldir))
+ pg_fatal("could not close directory \"%s\": %m", ARCHSTATDIR);
+}
/*
* Write an empty XLOG file, containing only the checkpoint record
diff --git a/src/include/access/xlogbackup.h b/src/include/access/xlogbackup.h
index 1611358137..90e04cad56 100644
--- a/src/include/access/xlogbackup.h
+++ b/src/include/access/xlogbackup.h
@@ -28,6 +28,8 @@ typedef struct BackupState
XLogRecPtr checkpointloc; /* last checkpoint location */
pg_time_t starttime; /* backup start time */
bool started_in_recovery; /* backup started in recovery? */
+ XLogRecPtr istartpoint; /* incremental based on backup at this LSN */
+ TimeLineID istarttli; /* incremental based on backup on this TLI */
/* Fields saved at the end of backup */
XLogRecPtr stoppoint; /* backup stop WAL location */
diff --git a/src/include/backup/basebackup.h b/src/include/backup/basebackup.h
index 1432d9c206..345bd22534 100644
--- a/src/include/backup/basebackup.h
+++ b/src/include/backup/basebackup.h
@@ -34,6 +34,9 @@ typedef struct
int64 size; /* total size as sent; -1 if not known */
} tablespaceinfo;
-extern void SendBaseBackup(BaseBackupCmd *cmd);
+struct IncrementalBackupInfo;
+
+extern void SendBaseBackup(BaseBackupCmd *cmd,
+ struct IncrementalBackupInfo *ib);
#endif /* _BASEBACKUP_H */
diff --git a/src/include/backup/basebackup_incremental.h b/src/include/backup/basebackup_incremental.h
new file mode 100644
index 0000000000..c300235a2f
--- /dev/null
+++ b/src/include/backup/basebackup_incremental.h
@@ -0,0 +1,56 @@
+/*-------------------------------------------------------------------------
+ *
+ * basebackup_incremental.h
+ * API for incremental backup support
+ *
+ * Portions Copyright (c) 2010-2022, PostgreSQL Global Development Group
+ *
+ * src/include/backup/basebackup_incremental.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef BASEBACKUP_INCREMENTAL_H
+#define BASEBACKUP_INCREMENTAL_H
+
+#include "access/xlogbackup.h"
+#include "common/relpath.h"
+#include "storage/block.h"
+#include "utils/palloc.h"
+
+#define INCREMENTAL_MAGIC 0xd3ae1f0d
+
+typedef enum
+{
+ BACK_UP_FILE_FULLY,
+ BACK_UP_FILE_INCREMENTALLY,
+ DO_NOT_BACK_UP_FILE
+} FileBackupMethod;
+
+struct IncrementalBackupInfo;
+typedef struct IncrementalBackupInfo IncrementalBackupInfo;
+
+extern IncrementalBackupInfo *CreateIncrementalBackupInfo(MemoryContext);
+
+extern void AppendIncrementalManifestData(IncrementalBackupInfo *ib,
+ const char *data,
+ int len);
+extern void FinalizeIncrementalManifest(IncrementalBackupInfo *ib);
+
+extern void PrepareForIncrementalBackup(IncrementalBackupInfo *ib,
+ BackupState *backup_state);
+
+extern char *GetIncrementalFilePath(Oid dboid, Oid spcoid,
+ RelFileNumber relfilenumber,
+ ForkNumber forknum, unsigned segno);
+extern FileBackupMethod GetFileBackupMethod(IncrementalBackupInfo *ib,
+ char *path,
+ Oid dboid, Oid spcoid,
+ RelFileNumber relfilenumber,
+ ForkNumber forknum,
+ unsigned segno, size_t size,
+ unsigned *num_blocks_required,
+ BlockNumber *relative_block_numbers,
+ unsigned *truncation_block_length);
+extern size_t GetIncrementalFileSize(unsigned num_blocks_required);
+
+#endif
diff --git a/src/include/nodes/replnodes.h b/src/include/nodes/replnodes.h
index 4321ba8f86..856491eecd 100644
--- a/src/include/nodes/replnodes.h
+++ b/src/include/nodes/replnodes.h
@@ -108,4 +108,13 @@ typedef struct TimeLineHistoryCmd
TimeLineID timeline;
} TimeLineHistoryCmd;
+/* ----------------------
+ * UPLOAD_MANIFEST command
+ * ----------------------
+ */
+typedef struct UploadManifestCmd
+{
+ NodeTag type;
+} UploadManifestCmd;
+
#endif /* REPLNODES_H */
diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index c3d46c7c70..b711d60fc4 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -779,6 +779,10 @@ a tar-format backup, pass the name of the tar program to use in the
keyword parameter tar_program. Note that tablespace tar files aren't
handled here.
+To restore from an incremental backup, pass the parameter combine_with_prior
+as a reference to an array of prior backup names with which this backup
+is to be combined using pg_combinebackup.
+
Streaming replication can be enabled on this node by passing the keyword
parameter has_streaming => 1. This is disabled by default.
@@ -816,7 +820,22 @@ sub init_from_backup
mkdir $self->archive_dir;
my $data_path = $self->data_dir;
- if (defined $params{tar_program})
+ if (defined $params{combine_with_prior})
+ {
+ my @prior_backups = @{$params{combine_with_prior}};
+ my @prior_backup_path;
+
+ for my $prior_backup_name (@prior_backups)
+ {
+ push @prior_backup_path,
+ $root_node->backup_dir . '/' . $prior_backup_name;
+ }
+
+ local %ENV = $self->_get_env();
+ PostgreSQL::Test::Utils::system_or_bail('pg_combinebackup',
+ @prior_backup_path, $backup_path, '-o', $data_path);
+ }
+ elsif (defined $params{tar_program})
{
mkdir($data_path);
PostgreSQL::Test::Utils::system_or_bail($params{tar_program}, 'xf',
diff --git a/src/test/recovery/t/001_stream_rep.pl b/src/test/recovery/t/001_stream_rep.pl
index 95f9b0d772..ad11be4664 100644
--- a/src/test/recovery/t/001_stream_rep.pl
+++ b/src/test/recovery/t/001_stream_rep.pl
@@ -15,6 +15,8 @@ my $node_primary = PostgreSQL::Test::Cluster->new('primary');
$node_primary->init(
allows_streaming => 1,
auth_extra => [ '--create-role', 'repl_role' ]);
+# WAL summarization can postpone WAL recycling, leading to test failures
+$node_primary->append_conf('postgresql.conf', "wal_summarize_mb = 0");
$node_primary->start;
my $backup_name = 'my_backup';
diff --git a/src/test/recovery/t/019_replslot_limit.pl b/src/test/recovery/t/019_replslot_limit.pl
index 7d94f15778..4f52ddbe79 100644
--- a/src/test/recovery/t/019_replslot_limit.pl
+++ b/src/test/recovery/t/019_replslot_limit.pl
@@ -22,6 +22,7 @@ $node_primary->append_conf(
min_wal_size = 2MB
max_wal_size = 4MB
log_checkpoints = yes
+wal_summarize_mb = 0
));
$node_primary->start;
$node_primary->safe_psql('postgres',
@@ -256,6 +257,7 @@ $node_primary2->append_conf(
min_wal_size = 32MB
max_wal_size = 32MB
log_checkpoints = yes
+wal_summarize_mb = 0
));
$node_primary2->start;
$node_primary2->safe_psql('postgres',
@@ -310,6 +312,7 @@ $node_primary3->append_conf(
max_wal_size = 2MB
log_checkpoints = yes
max_slot_wal_keep_size = 1MB
+ wal_summarize_mb = 0
));
$node_primary3->start;
$node_primary3->safe_psql('postgres',
diff --git a/src/test/recovery/t/035_standby_logical_decoding.pl b/src/test/recovery/t/035_standby_logical_decoding.pl
index 9c34c0d36c..5fe4faf1be 100644
--- a/src/test/recovery/t/035_standby_logical_decoding.pl
+++ b/src/test/recovery/t/035_standby_logical_decoding.pl
@@ -250,6 +250,7 @@ $node_primary->append_conf(
wal_level = 'logical'
max_replication_slots = 4
max_wal_senders = 4
+wal_summarize_mb = 0
});
$node_primary->dump_info;
$node_primary->start;
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 7c913cbb93..064c0ecdc1 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -4004,3 +4004,15 @@ SummarizerReadLocalXLogPrivate
WalSummarizerData
WalSummaryFile
WalSummaryIO
+FileBackupMethod
+IncrementalBackupInfo
+UploadManifestCmd
+backup_file_entry
+backup_wal_range
+cb_cleanup_dir
+cb_options
+cb_tablespace
+cb_tablespace_mapping
+manifest_data
+manifest_writer
+rfile
--
2.37.1 (Apple Git-137.1)
v7-0003-Add-a-new-WAL-summarizer-process.patchapplication/octet-stream; name=v7-0003-Add-a-new-WAL-summarizer-process.patchDownload
From 62ffb1206b0cb28a00de3a98507df0a384b6783e Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Wed, 25 Oct 2023 12:57:22 -0400
Subject: [PATCH v7 3/5] Add a new WAL summarizer process.
When active, this process writes WAL summary files to
$PGDATA/pg_wal/summaries. Each summary file contains information for a
certain range of LSNs on a certain TLI. For each relation, it stores a
"limit block" which is 0 if a relation is created or destroyed within
a certain range of WAL records, or otherwise the shortest length to
which the relation was truncated during that range of WAL records, or
otherwise InvalidBlockNumber. In addition, it stores a list of blocks
which have been modified during that range of WAL records, but
excluding blocks which were removed by truncation after they were
modified and never subsequently modified again. In other words, it
tells us which blocks need to copied in case of an incremental backup
covering that range of WAL records.
A new parameter wal_summarize_mb enables or disables this new
background process, and also limits the maximum size of a summary
file. However, summaries are also limited to 1 per checkpoint cycle,
so in most practical cases the actual value of wal_summarize_mb is
unimportant, and it just acts as a flag to enable or disable
summarization.
XXX. Possibly we should turn this GUC into a Boolean or change how it
works somehow.
XXX. What should happen on a standby? Do we want summarization to
happen there just as it does on a primary, or do we want something
else?
The background process also automatically deletes summary files that
are older than wal_summarize_keep_time, if that parameter has a non-zero
value and the summarizer is configured to run.
---
src/backend/access/transam/xlog.c | 100 +-
src/backend/backup/Makefile | 4 +-
src/backend/backup/meson.build | 2 +
src/backend/backup/walsummary.c | 356 +++++
src/backend/backup/walsummaryfuncs.c | 169 ++
src/backend/postmaster/Makefile | 1 +
src/backend/postmaster/auxprocess.c | 8 +
src/backend/postmaster/meson.build | 1 +
src/backend/postmaster/postmaster.c | 53 +
src/backend/postmaster/walsummarizer.c | 1363 +++++++++++++++++
src/backend/storage/lmgr/lwlocknames.txt | 1 +
.../utils/activity/wait_event_names.txt | 5 +
src/backend/utils/init/miscinit.c | 3 +
src/backend/utils/misc/guc_tables.c | 29 +
src/backend/utils/misc/postgresql.conf.sample | 5 +
src/common/Makefile | 1 +
src/common/blkreftable.c | 1309 ++++++++++++++++
src/common/meson.build | 1 +
src/include/access/xlog.h | 1 +
src/include/backup/walsummary.h | 49 +
src/include/catalog/pg_proc.dat | 19 +
src/include/common/blkreftable.h | 120 ++
src/include/miscadmin.h | 3 +
src/include/postmaster/walsummarizer.h | 31 +
src/include/storage/proc.h | 9 +-
src/include/utils/guc_tables.h | 1 +
src/tools/pgindent/typedefs.list | 11 +
27 files changed, 3645 insertions(+), 10 deletions(-)
create mode 100644 src/backend/backup/walsummary.c
create mode 100644 src/backend/backup/walsummaryfuncs.c
create mode 100644 src/backend/postmaster/walsummarizer.c
create mode 100644 src/common/blkreftable.c
create mode 100644 src/include/backup/walsummary.h
create mode 100644 src/include/common/blkreftable.h
create mode 100644 src/include/postmaster/walsummarizer.h
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 40461923ea..9ddad7864f 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -77,6 +77,7 @@
#include "port/pg_iovec.h"
#include "postmaster/bgwriter.h"
#include "postmaster/startup.h"
+#include "postmaster/walsummarizer.h"
#include "postmaster/walwriter.h"
#include "replication/logical.h"
#include "replication/origin.h"
@@ -3555,6 +3556,43 @@ XLogGetLastRemovedSegno(void)
return lastRemovedSegNo;
}
+/*
+ * Return the oldest WAL segment on the given TLI that still exists in
+ * XLOGDIR, or 0 if none.
+ */
+XLogSegNo
+XLogGetOldestSegno(TimeLineID tli)
+{
+ DIR *xldir;
+ struct dirent *xlde;
+ XLogSegNo oldest_segno = 0;
+
+ xldir = AllocateDir(XLOGDIR);
+ while ((xlde = ReadDir(xldir, XLOGDIR)) != NULL)
+ {
+ TimeLineID file_tli;
+ XLogSegNo file_segno;
+
+ /* Ignore files that are not XLOG segments */
+ if (!IsXLogFileName(xlde->d_name))
+ continue;
+
+ /* Parse filename to get TLI and segno. */
+ XLogFromFileName(xlde->d_name, &file_tli, &file_segno,
+ wal_segment_size);
+
+ /* Ignore anything that's not from the TLI of interest. */
+ if (tli != file_tli)
+ continue;
+
+ /* If it's the oldest so far, update oldest_segno. */
+ if (oldest_segno == 0 || file_segno < oldest_segno)
+ oldest_segno = file_segno;
+ }
+
+ FreeDir(xldir);
+ return oldest_segno;
+}
/*
* Update the last removed segno pointer in shared memory, to reflect that the
@@ -3834,8 +3872,8 @@ RemoveXlogFile(const struct dirent *segment_de,
}
/*
- * Verify whether pg_wal and pg_wal/archive_status exist.
- * If the latter does not exist, recreate it.
+ * Verify whether pg_wal, pg_wal/archive_status, and pg_wal/summaries exist.
+ * If the latter do not exist, recreate them.
*
* It is not the goal of this function to verify the contents of these
* directories, but to help in cases where someone has performed a cluster
@@ -3878,6 +3916,26 @@ ValidateXLOGDirectoryStructure(void)
(errmsg("could not create missing directory \"%s\": %m",
path)));
}
+
+ /* Check for summaries */
+ snprintf(path, MAXPGPATH, XLOGDIR "/summaries");
+ if (stat(path, &stat_buf) == 0)
+ {
+ /* Check for weird cases where it exists but isn't a directory */
+ if (!S_ISDIR(stat_buf.st_mode))
+ ereport(FATAL,
+ (errmsg("required WAL directory \"%s\" does not exist",
+ path)));
+ }
+ else
+ {
+ ereport(LOG,
+ (errmsg("creating missing WAL directory \"%s\"", path)));
+ if (MakePGDirectory(path) < 0)
+ ereport(FATAL,
+ (errmsg("could not create missing directory \"%s\": %m",
+ path)));
+ }
}
/*
@@ -5202,9 +5260,9 @@ StartupXLOG(void)
#endif
/*
- * Verify that pg_wal and pg_wal/archive_status exist. In cases where
- * someone has performed a copy for PITR, these directories may have been
- * excluded and need to be re-created.
+ * Verify that pg_wal, pg_wal/archive_status, and pg_wal/summaries exist.
+ * In cases where someone has performed a copy for PITR, these directories
+ * may have been excluded and need to be re-created.
*/
ValidateXLOGDirectoryStructure();
@@ -6921,6 +6979,24 @@ CreateCheckPoint(int flags)
*/
END_CRIT_SECTION();
+ /*
+ * WAL summaries end when the next XLOG_CHECKPOINT_REDO or
+ * XLOG_CHECKPOINT_SHUTDOWN record is reached. This is the first point
+ * where (a) we're not inside of a critical section and (b) we can be
+ * certain that the relevant record has been flushed to disk, which must
+ * happen before it can be summarized.
+ *
+ * If this is a shutdown checkpoint, then this happens reasonably promptly:
+ * we've only just inserted and flushed the XLOG_CHECKPOINT_SHUTDOWN
+ * record. If this is not a shutdown checkpoint, then this might not be
+ * very prompt at all: the XLOG_CHECKPOINT_REDO record was written before
+ * we began flushing data to disk, and that could be many minutes ago at
+ * this point. However, we don't XLogFlush() after inserting that record,
+ * so we're not guaranteed that it's on disk until after the above call
+ * that flushes the XLOG_CHECKPOINT_ONLINE record.
+ */
+ SetWalSummarizerLatch();
+
/*
* Let smgr do post-checkpoint cleanup (eg, deleting old files).
*/
@@ -7595,6 +7671,20 @@ KeepLogSeg(XLogRecPtr recptr, XLogSegNo *logSegNo)
}
}
+ /*
+ * If WAL summarization is in use, don't remove WAL that has yet to be
+ * summarized.
+ */
+ keep = GetOldestUnsummarizedLSN(NULL, NULL);
+ if (keep != InvalidXLogRecPtr)
+ {
+ XLogSegNo unsummarized_segno;
+
+ XLByteToSeg(keep, unsummarized_segno, wal_segment_size);
+ if (unsummarized_segno < segno)
+ segno = unsummarized_segno;
+ }
+
/* but, keep at least wal_keep_size if that's set */
if (wal_keep_size_mb > 0)
{
diff --git a/src/backend/backup/Makefile b/src/backend/backup/Makefile
index b21bd8ff43..a67b3c58d4 100644
--- a/src/backend/backup/Makefile
+++ b/src/backend/backup/Makefile
@@ -25,6 +25,8 @@ OBJS = \
basebackup_server.o \
basebackup_sink.o \
basebackup_target.o \
- basebackup_throttle.o
+ basebackup_throttle.o \
+ walsummary.o \
+ walsummaryfuncs.o
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/backup/meson.build b/src/backend/backup/meson.build
index 11a79bbf80..0e2de91e9f 100644
--- a/src/backend/backup/meson.build
+++ b/src/backend/backup/meson.build
@@ -12,4 +12,6 @@ backend_sources += files(
'basebackup_target.c',
'basebackup_throttle.c',
'basebackup_zstd.c',
+ 'walsummary.c',
+ 'walsummaryfuncs.c'
)
diff --git a/src/backend/backup/walsummary.c b/src/backend/backup/walsummary.c
new file mode 100644
index 0000000000..ebf4ea038d
--- /dev/null
+++ b/src/backend/backup/walsummary.c
@@ -0,0 +1,356 @@
+/*-------------------------------------------------------------------------
+ *
+ * walsummary.c
+ * Functions for accessing and managing WAL summary data.
+ *
+ * Portions Copyright (c) 2010-2022, PostgreSQL Global Development Group
+ *
+ * src/backend/backup/walsummary.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include <sys/stat.h>
+#include <unistd.h>
+
+#include "access/xlog_internal.h"
+#include "backup/walsummary.h"
+#include "utils/wait_event.h"
+
+static bool IsWalSummaryFilename(char *filename);
+static int ListComparatorForWalSummaryFiles(const ListCell *a,
+ const ListCell *b);
+
+/*
+ * Get a list of WAL summaries.
+ *
+ * If tli != 0, only WAL summaries with the indicated TLI will be included.
+ *
+ * If start_lsn != InvalidXLogRecPtr, only summaries that end before the
+ * indicated LSN will be included.
+ *
+ * If end_lsn != InvalidXLogRecPtr, only summaries that start before the
+ * indicated LSN will be included.
+ *
+ * The intent is that you can call GetWalSummaries(tli, start_lsn, end_lsn)
+ * to get all WAL summaries on the indicated timeline that overlap the
+ * specified LSN range.
+ */
+List *
+GetWalSummaries(TimeLineID tli, XLogRecPtr start_lsn, XLogRecPtr end_lsn)
+{
+ DIR *sdir;
+ struct dirent *dent;
+ List *result = NIL;
+
+ sdir = AllocateDir(XLOGDIR "/summaries");
+ while ((dent = ReadDir(sdir, XLOGDIR "/summaries")) != NULL)
+ {
+ WalSummaryFile *ws;
+ uint32 tmp[5];
+ TimeLineID file_tli;
+ XLogRecPtr file_start_lsn;
+ XLogRecPtr file_end_lsn;
+
+ /* Decode filename, or skip if it's not in the expected format. */
+ if (!IsWalSummaryFilename(dent->d_name))
+ continue;
+ sscanf(dent->d_name, "%08X%08X%08X%08X%08X",
+ &tmp[0], &tmp[1], &tmp[2], &tmp[3], &tmp[4]);
+ file_tli = tmp[0];
+ file_start_lsn = ((uint64) tmp[1]) << 32 | tmp[2];
+ file_end_lsn = ((uint64) tmp[3]) << 32 | tmp[4];
+
+ /* Skip if it doesn't match the filter criteria. */
+ if (tli != 0 && tli != file_tli)
+ continue;
+ if (!XLogRecPtrIsInvalid(start_lsn) && start_lsn > file_end_lsn)
+ continue;
+ if (!XLogRecPtrIsInvalid(end_lsn) && end_lsn < file_start_lsn)
+ continue;
+
+ /* Add it to the list. */
+ ws = palloc(sizeof(WalSummaryFile));
+ ws->tli = file_tli;
+ ws->start_lsn = file_start_lsn;
+ ws->end_lsn = file_end_lsn;
+ result = lappend(result, ws);
+ }
+ FreeDir(sdir);
+
+ return result;
+}
+
+/*
+ * Build a new list of WAL summaries based on an existing list, but filtering
+ * out summaries that don't match the search parameters.
+ *
+ * If tli != 0, only WAL summaries with the indicated TLI will be included.
+ *
+ * If start_lsn != InvalidXLogRecPtr, only summaries that end before the
+ * indicated LSN will be included.
+ *
+ * If end_lsn != InvalidXLogRecPtr, only summaries that start before the
+ * indicated LSN will be included.
+ */
+List *
+FilterWalSummaries(List *wslist, TimeLineID tli,
+ XLogRecPtr start_lsn, XLogRecPtr end_lsn)
+{
+ List *result = NIL;
+ ListCell *lc;
+
+ /* Loop over input. */
+ foreach(lc, wslist)
+ {
+ WalSummaryFile *ws = lfirst(lc);
+
+ /* Skip if it doesn't match the filter criteria. */
+ if (tli != 0 && tli != ws->tli)
+ continue;
+ if (!XLogRecPtrIsInvalid(start_lsn) && start_lsn > ws->end_lsn)
+ continue;
+ if (!XLogRecPtrIsInvalid(end_lsn) && end_lsn < ws->start_lsn)
+ continue;
+
+ /* Add it to the result list. */
+ result = lappend(result, ws);
+ }
+
+ return result;
+}
+
+/*
+ * Check whether the supplied list of WalSummaryFile objects covers the
+ * whole range of LSNs from start_lsn to end_lsn. This function ignores
+ * timelines, so the caller should probably filter using the appropriate
+ * timeline before calling this.
+ *
+ * If the whole range of LSNs is covered, returns true, otherwise false.
+ * If false is returned, *missing_lsn is set either to InvalidXLogRecPtr
+ * if there are no WAL summary files in the input list, or to the first LSN
+ * in the range that is not covered by a WAL summary file in the input list.
+ */
+bool
+WalSummariesAreComplete(List *wslist, XLogRecPtr start_lsn,
+ XLogRecPtr end_lsn, XLogRecPtr *missing_lsn)
+{
+ XLogRecPtr current_lsn = start_lsn;
+ ListCell *lc;
+
+ /* Special case for empty list. */
+ if (wslist == NIL)
+ {
+ *missing_lsn = InvalidXLogRecPtr;
+ return false;
+ }
+
+ /* Make a private copy of the list and sort it by start LSN. */
+ wslist = list_copy(wslist);
+ list_sort(wslist, ListComparatorForWalSummaryFiles);
+
+ /*
+ * Consider summary files in order of increasing start_lsn, advancing the
+ * known-summarized range from start_lsn toward end_lsn.
+ *
+ * Normally, the summary files should cover non-overlapping WAL ranges,
+ * but this algorithm is intended to be correct even in case of overlap.
+ */
+ foreach(lc, wslist)
+ {
+ WalSummaryFile *ws = lfirst(lc);
+
+ if (ws->start_lsn > current_lsn)
+ {
+ /* We found a gap. */
+ break;
+ }
+ if (ws->end_lsn > current_lsn)
+ {
+ /*
+ * Next summary extends beyond end of previous summary, so extend
+ * the end of the range known to be summarized.
+ */
+ current_lsn = ws->end_lsn;
+
+ /*
+ * If the range we know to be summarized has reached the required
+ * end LSN, we have proved completeness.
+ */
+ if (current_lsn >= end_lsn)
+ return true;
+ }
+ }
+
+ /*
+ * We either ran out of summary files without reaching the end LSN, or we
+ * hit a gap in the sequence that resulted in us bailing out of the loop
+ * above.
+ */
+ *missing_lsn = current_lsn;
+ return false;
+}
+
+/*
+ * Open a WAL summary file.
+ *
+ * This will throw an error in case of trouble. As an exception, if
+ * missing_ok = true and the trouble is specifically that the file does
+ * not exist, it will not throw an error and will return a value less than 0.
+ */
+File
+OpenWalSummaryFile(WalSummaryFile *ws, bool missing_ok)
+{
+ char path[MAXPGPATH];
+ File file;
+
+ snprintf(path, MAXPGPATH,
+ XLOGDIR "/summaries/%08X%08X%08X%08X%08X.summary",
+ ws->tli,
+ LSN_FORMAT_ARGS(ws->start_lsn),
+ LSN_FORMAT_ARGS(ws->end_lsn));
+
+ file = PathNameOpenFile(path, O_RDONLY);
+ if (file < 0 && (errno != EEXIST || !missing_ok))
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not open file \"%s\": %m", path)));
+
+ return file;
+}
+
+/*
+ * Remove a WAL summary file if the last modification time precedes the
+ * cutoff time.
+ */
+void
+RemoveWalSummaryIfOlderThan(WalSummaryFile *ws, time_t cutoff_time)
+{
+ char path[MAXPGPATH];
+ struct stat statbuf;
+
+ snprintf(path, MAXPGPATH,
+ XLOGDIR "/summaries/%08X%08X%08X%08X%08X.summary",
+ ws->tli,
+ LSN_FORMAT_ARGS(ws->start_lsn),
+ LSN_FORMAT_ARGS(ws->end_lsn));
+
+ if (lstat(path, &statbuf) != 0)
+ {
+ if (errno == ENOENT)
+ return;
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not stat file \"%s\": %m", path)));
+ }
+ if (statbuf.st_mtime >= cutoff_time)
+ return;
+ if (unlink(path) != 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not stat file \"%s\": %m", path)));
+ ereport(DEBUG2,
+ (errmsg_internal("removing file \"%s\"", path)));
+}
+
+/*
+ * Test whether a filename looks like a WAL summary file.
+ */
+static bool
+IsWalSummaryFilename(char *filename)
+{
+ return strspn(filename, "0123456789ABCDEF") == 40 &&
+ strcmp(filename + 40, ".summary") == 0;
+}
+
+/*
+ * Data read callback for use with CreateBlockRefTableReader.
+ */
+int
+ReadWalSummary(void *wal_summary_io, void *data, int length)
+{
+ WalSummaryIO *io = wal_summary_io;
+ int nbytes;
+
+ nbytes = FileRead(io->file, data, length, io->filepos,
+ WAIT_EVENT_WAL_SUMMARY_READ);
+ if (nbytes < 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not write file \"%s\": %m",
+ FilePathName(io->file))));
+
+ io->filepos += nbytes;
+ return nbytes;
+}
+
+/*
+ * Data write callback for use with WriteBlockRefTable.
+ */
+int
+WriteWalSummary(void *wal_summary_io, void *data, int length)
+{
+ WalSummaryIO *io = wal_summary_io;
+ int nbytes;
+
+ nbytes = FileWrite(io->file, data, length, io->filepos,
+ WAIT_EVENT_WAL_SUMMARY_WRITE);
+ if (nbytes < 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not write file \"%s\": %m",
+ FilePathName(io->file))));
+ if (nbytes != length)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not write file \"%s\": wrote only %d of %d bytes at offset %u",
+ FilePathName(io->file), nbytes,
+ length, (unsigned) io->filepos),
+ errhint("Check free disk space.")));
+
+ io->filepos += nbytes;
+ return nbytes;
+}
+
+/*
+ * Error-reporting callback for use with CreateBlockRefTableReader.
+ */
+void
+ReportWalSummaryError(void *callback_arg, char *fmt,...)
+{
+ StringInfoData buf;
+ va_list ap;
+ int needed;
+
+ initStringInfo(&buf);
+ for (;;)
+ {
+ va_start(ap, fmt);
+ needed = appendStringInfoVA(&buf, fmt, ap);
+ va_end(ap);
+ if (needed == 0)
+ break;
+ enlargeStringInfo(&buf, needed);
+ }
+ ereport(ERROR,
+ errcode(ERRCODE_DATA_CORRUPTED),
+ errmsg_internal("%s", buf.data));
+}
+
+/*
+ * Comparator to sort a List of WalSummaryFile objects by start_lsn.
+ */
+static int
+ListComparatorForWalSummaryFiles(const ListCell *a, const ListCell *b)
+{
+ WalSummaryFile *ws1 = lfirst(a);
+ WalSummaryFile *ws2 = lfirst(b);
+
+ if (ws1->start_lsn < ws2->start_lsn)
+ return -1;
+ if (ws1->start_lsn > ws2->start_lsn)
+ return 1;
+ return 0;
+}
diff --git a/src/backend/backup/walsummaryfuncs.c b/src/backend/backup/walsummaryfuncs.c
new file mode 100644
index 0000000000..2e77d38b4a
--- /dev/null
+++ b/src/backend/backup/walsummaryfuncs.c
@@ -0,0 +1,169 @@
+/*-------------------------------------------------------------------------
+ *
+ * walsummaryfuncs.c
+ * SQL-callable functions for accessing WAL summary data.
+ *
+ * Portions Copyright (c) 2010-2022, PostgreSQL Global Development Group
+ *
+ * src/backend/backup/walsummaryfuncs.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "backup/walsummary.h"
+#include "common/blkreftable.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "utils/fmgrprotos.h"
+#include "utils/pg_lsn.h"
+
+#define NUM_WS_ATTS 3
+#define NUM_SUMMARY_ATTS 6
+#define MAX_BLOCKS_PER_CALL 256
+
+/*
+ * List the WAL summary files available in pg_wal/summaries.
+ */
+Datum
+pg_available_wal_summaries(PG_FUNCTION_ARGS)
+{
+ ReturnSetInfo *rsi;
+ List *wslist;
+ ListCell *lc;
+ Datum values[NUM_WS_ATTS];
+ bool nulls[NUM_WS_ATTS];
+
+ InitMaterializedSRF(fcinfo, 0);
+ rsi = (ReturnSetInfo *) fcinfo->resultinfo;
+
+ memset(nulls, 0, sizeof(nulls));
+
+ wslist = GetWalSummaries(0, InvalidXLogRecPtr, InvalidXLogRecPtr);
+ foreach(lc, wslist)
+ {
+ WalSummaryFile *ws = (WalSummaryFile *) lfirst(lc);
+ HeapTuple tuple;
+
+ CHECK_FOR_INTERRUPTS();
+
+ values[0] = Int64GetDatum((int64) ws->tli);
+ values[1] = LSNGetDatum(ws->start_lsn);
+ values[2] = LSNGetDatum(ws->end_lsn);
+
+ tuple = heap_form_tuple(rsi->setDesc, values, nulls);
+ tuplestore_puttuple(rsi->setResult, tuple);
+ }
+
+ return (Datum) 0;
+}
+
+/*
+ * List the contents of a WAL summary file identified by TLI, start LSN,
+ * and end LSN.
+ */
+Datum
+pg_wal_summary_contents(PG_FUNCTION_ARGS)
+{
+ ReturnSetInfo *rsi;
+ Datum values[NUM_SUMMARY_ATTS];
+ bool nulls[NUM_SUMMARY_ATTS];
+ WalSummaryFile ws;
+ WalSummaryIO io;
+ BlockRefTableReader *reader;
+ int64 raw_tli;
+ RelFileLocator rlocator;
+ ForkNumber forknum;
+ BlockNumber limit_block;
+
+ InitMaterializedSRF(fcinfo, 0);
+ rsi = (ReturnSetInfo *) fcinfo->resultinfo;
+ memset(nulls, 0, sizeof(nulls));
+
+ /*
+ * Since the timeline could at least in theory be more than 2^31, and
+ * since we don't have unsigned types at the SQL level, it is passed as a
+ * 64-bit integer. Test whether it's out of range.
+ */
+ raw_tli = PG_GETARG_INT64(0);
+ if (raw_tli < 1 || raw_tli > PG_INT32_MAX)
+ ereport(ERROR,
+ errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("invalid timeline %lld", (long long) raw_tli));
+
+ /* Prepare to read the specified WAL summry file. */
+ ws.tli = (TimeLineID) raw_tli;
+ ws.start_lsn = PG_GETARG_LSN(1);
+ ws.end_lsn = PG_GETARG_LSN(2);
+ io.filepos = 0;
+ io.file = OpenWalSummaryFile(&ws, false);
+ reader = CreateBlockRefTableReader(ReadWalSummary, &io,
+ FilePathName(io.file),
+ ReportWalSummaryError, NULL);
+
+ /* Loop over relation forks. */
+ while (BlockRefTableReaderNextRelation(reader, &rlocator, &forknum,
+ &limit_block))
+ {
+ BlockNumber blocks[MAX_BLOCKS_PER_CALL];
+ HeapTuple tuple;
+
+ CHECK_FOR_INTERRUPTS();
+
+ values[0] = ObjectIdGetDatum(rlocator.relNumber);
+ values[1] = ObjectIdGetDatum(rlocator.spcOid);
+ values[2] = ObjectIdGetDatum(rlocator.dbOid);
+ values[3] = Int16GetDatum((int16) forknum);
+
+ /* Loop over blocks within the current relation fork. */
+ while (true)
+ {
+ unsigned nblocks;
+ unsigned i;
+
+ CHECK_FOR_INTERRUPTS();
+
+ nblocks = BlockRefTableReaderGetBlocks(reader, blocks,
+ MAX_BLOCKS_PER_CALL);
+ if (nblocks == 0)
+ break;
+
+ /*
+ * For each block that we specifically know to have been modified,
+ * emit a row with that block number and limit_block = false.
+ */
+ values[5] = BoolGetDatum(false);
+ for (i = 0; i < nblocks; ++i)
+ {
+ values[4] = Int64GetDatum((int64) blocks[i]);
+
+ tuple = heap_form_tuple(rsi->setDesc, values, nulls);
+ tuplestore_puttuple(rsi->setResult, tuple);
+ }
+
+ /*
+ * If the limit block is not InvalidBlockNumber, emit an exta row
+ * with that block number and limit_block = true.
+ *
+ * There is no point in doing this when the limit_block is
+ * InvalidBlockNumber, because no block with that number or any
+ * higher number can ever exist.
+ */
+ if (BlockNumberIsValid(limit_block))
+ {
+ values[4] = Int64GetDatum((int64) limit_block);
+ values[5] = BoolGetDatum(true);
+
+ tuple = heap_form_tuple(rsi->setDesc, values, nulls);
+ tuplestore_puttuple(rsi->setResult, tuple);
+ }
+ }
+ }
+
+ /* Cleanup */
+ DestroyBlockRefTableReader(reader);
+ FileClose(io.file);
+
+ return (Datum) 0;
+}
diff --git a/src/backend/postmaster/Makefile b/src/backend/postmaster/Makefile
index 047448b34e..367a46c617 100644
--- a/src/backend/postmaster/Makefile
+++ b/src/backend/postmaster/Makefile
@@ -24,6 +24,7 @@ OBJS = \
postmaster.o \
startup.o \
syslogger.o \
+ walsummarizer.o \
walwriter.o
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/postmaster/auxprocess.c b/src/backend/postmaster/auxprocess.c
index cae6feb356..0c15c1777d 100644
--- a/src/backend/postmaster/auxprocess.c
+++ b/src/backend/postmaster/auxprocess.c
@@ -21,6 +21,7 @@
#include "postmaster/auxprocess.h"
#include "postmaster/bgwriter.h"
#include "postmaster/startup.h"
+#include "postmaster/walsummarizer.h"
#include "postmaster/walwriter.h"
#include "replication/walreceiver.h"
#include "storage/bufmgr.h"
@@ -80,6 +81,9 @@ AuxiliaryProcessMain(AuxProcType auxtype)
case WalReceiverProcess:
MyBackendType = B_WAL_RECEIVER;
break;
+ case WalSummarizerProcess:
+ MyBackendType = B_WAL_SUMMARIZER;
+ break;
default:
elog(PANIC, "unrecognized process type: %d", (int) MyAuxProcType);
MyBackendType = B_INVALID;
@@ -161,6 +165,10 @@ AuxiliaryProcessMain(AuxProcType auxtype)
WalReceiverMain();
proc_exit(1);
+ case WalSummarizerProcess:
+ WalSummarizerMain();
+ proc_exit(1);
+
default:
elog(PANIC, "unrecognized process type: %d", (int) MyAuxProcType);
proc_exit(1);
diff --git a/src/backend/postmaster/meson.build b/src/backend/postmaster/meson.build
index cda921fd10..a30eb6692f 100644
--- a/src/backend/postmaster/meson.build
+++ b/src/backend/postmaster/meson.build
@@ -12,5 +12,6 @@ backend_sources += files(
'postmaster.c',
'startup.c',
'syslogger.c',
+ 'walsummarizer.c',
'walwriter.c',
)
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 9cb624eab8..86f6cf2feb 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -115,6 +115,7 @@
#include "postmaster/pgarch.h"
#include "postmaster/postmaster.h"
#include "postmaster/syslogger.h"
+#include "postmaster/walsummarizer.h"
#include "replication/logicallauncher.h"
#include "replication/walsender.h"
#include "storage/fd.h"
@@ -252,6 +253,7 @@ static pid_t StartupPID = 0,
CheckpointerPID = 0,
WalWriterPID = 0,
WalReceiverPID = 0,
+ WalSummarizerPID = 0,
AutoVacPID = 0,
PgArchPID = 0,
SysLoggerPID = 0;
@@ -443,6 +445,7 @@ static bool CreateOptsFile(int argc, char *argv[], char *fullprogname);
static pid_t StartChildProcess(AuxProcType type);
static void StartAutovacuumWorker(void);
static void MaybeStartWalReceiver(void);
+static void MaybeStartWalSummarizer(void);
static void InitPostmasterDeathWatchHandle(void);
/*
@@ -562,6 +565,7 @@ static void ShmemBackendArrayRemove(Backend *bn);
#define StartCheckpointer() StartChildProcess(CheckpointerProcess)
#define StartWalWriter() StartChildProcess(WalWriterProcess)
#define StartWalReceiver() StartChildProcess(WalReceiverProcess)
+#define StartWalSummarizer() StartChildProcess(WalSummarizerProcess)
/* Macros to check exit status of a child process */
#define EXIT_STATUS_0(st) ((st) == 0)
@@ -1833,6 +1837,9 @@ ServerLoop(void)
if (WalReceiverRequested)
MaybeStartWalReceiver();
+ /* If we need to start a WAL summarizer, try to do that now */
+ MaybeStartWalSummarizer();
+
/* Get other worker processes running, if needed */
if (StartWorkerNeeded || HaveCrashedWorker)
maybe_start_bgworkers();
@@ -2657,6 +2664,8 @@ process_pm_reload_request(void)
signal_child(WalWriterPID, SIGHUP);
if (WalReceiverPID != 0)
signal_child(WalReceiverPID, SIGHUP);
+ if (WalSummarizerPID != 0)
+ signal_child(WalSummarizerPID, SIGHUP);
if (AutoVacPID != 0)
signal_child(AutoVacPID, SIGHUP);
if (PgArchPID != 0)
@@ -3010,6 +3019,7 @@ process_pm_child_exit(void)
BgWriterPID = StartBackgroundWriter();
if (WalWriterPID == 0)
WalWriterPID = StartWalWriter();
+ MaybeStartWalSummarizer();
/*
* Likewise, start other special children as needed. In a restart
@@ -3128,6 +3138,20 @@ process_pm_child_exit(void)
continue;
}
+ /*
+ * Was it the wal summarizer? Normal exit can be ignored; we'll start
+ * a new one at the next iteration of the postmaster's main loop, if
+ * necessary. Any other exit condition is treated as a crash.
+ */
+ if (pid == WalSummarizerPID)
+ {
+ WalSummarizerPID = 0;
+ if (!EXIT_STATUS_0(exitstatus))
+ HandleChildCrash(pid, exitstatus,
+ _("WAL summarizer process"));
+ continue;
+ }
+
/*
* Was it the autovacuum launcher? Normal exit can be ignored; we'll
* start a new one at the next iteration of the postmaster's main
@@ -3523,6 +3547,12 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
else if (WalReceiverPID != 0 && take_action)
sigquit_child(WalReceiverPID);
+ /* Take care of the walsummarizer too */
+ if (pid == WalSummarizerPID)
+ WalSummarizerPID = 0;
+ else if (WalSummarizerPID != 0 && take_action)
+ sigquit_child(WalSummarizerPID);
+
/* Take care of the autovacuum launcher too */
if (pid == AutoVacPID)
AutoVacPID = 0;
@@ -3673,6 +3703,8 @@ PostmasterStateMachine(void)
signal_child(StartupPID, SIGTERM);
if (WalReceiverPID != 0)
signal_child(WalReceiverPID, SIGTERM);
+ if (WalSummarizerPID != 0)
+ signal_child(WalSummarizerPID, SIGTERM);
/* checkpointer, archiver, stats, and syslogger may continue for now */
/* Now transition to PM_WAIT_BACKENDS state to wait for them to die */
@@ -3699,6 +3731,7 @@ PostmasterStateMachine(void)
if (CountChildren(BACKEND_TYPE_ALL - BACKEND_TYPE_WALSND) == 0 &&
StartupPID == 0 &&
WalReceiverPID == 0 &&
+ WalSummarizerPID == 0 &&
BgWriterPID == 0 &&
(CheckpointerPID == 0 ||
(!FatalError && Shutdown < ImmediateShutdown)) &&
@@ -3796,6 +3829,7 @@ PostmasterStateMachine(void)
/* These other guys should be dead already */
Assert(StartupPID == 0);
Assert(WalReceiverPID == 0);
+ Assert(WalSummarizerPID == 0);
Assert(BgWriterPID == 0);
Assert(CheckpointerPID == 0);
Assert(WalWriterPID == 0);
@@ -4017,6 +4051,8 @@ TerminateChildren(int signal)
signal_child(WalWriterPID, signal);
if (WalReceiverPID != 0)
signal_child(WalReceiverPID, signal);
+ if (WalSummarizerPID != 0)
+ signal_child(WalSummarizerPID, signal);
if (AutoVacPID != 0)
signal_child(AutoVacPID, signal);
if (PgArchPID != 0)
@@ -5364,6 +5400,10 @@ StartChildProcess(AuxProcType type)
ereport(LOG,
(errmsg("could not fork WAL receiver process: %m")));
break;
+ case WalSummarizerProcess:
+ ereport(LOG,
+ (errmsg("could not fork WAL summarizer process: %m")));
+ break;
default:
ereport(LOG,
(errmsg("could not fork process: %m")));
@@ -5500,6 +5540,19 @@ MaybeStartWalReceiver(void)
}
}
+/*
+ * MaybeStartWalSummarizer
+ * Start the WAL summarizer process, if not running and our state allows.
+ */
+static void
+MaybeStartWalSummarizer(void)
+{
+ if (wal_summarize_mb != 0 && WalSummarizerPID == 0 &&
+ (pmState == PM_RUN || pmState == PM_HOT_STANDBY) &&
+ Shutdown <= SmartShutdown)
+ WalSummarizerPID = StartWalSummarizer();
+}
+
/*
* Create the opts file
diff --git a/src/backend/postmaster/walsummarizer.c b/src/backend/postmaster/walsummarizer.c
new file mode 100644
index 0000000000..4ded951119
--- /dev/null
+++ b/src/backend/postmaster/walsummarizer.c
@@ -0,0 +1,1363 @@
+/*-------------------------------------------------------------------------
+ *
+ * walsummarizer.c
+ *
+ * Background process to perform WAL summarization, if it is enabled.
+ * It continuously scans the write-ahead log and periodically emits a
+ * summary file which indicates which blocks in which relation forks
+ * were modified by WAL records in the LSN range covered by the summary
+ * file. See walsummary.c and blkreftable.c for more details on the
+ * naming and contents of WAL summary files.
+ *
+ * If configured to do, this background process will also remove WAL
+ * summary files when the file timestamp is older than a configurable
+ * threshold (but only if the WAL has been removed first).
+ *
+ * Portions Copyright (c) 1996-2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/backend/postmaster/walsummarizer.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/timeline.h"
+#include "access/xlog.h"
+#include "access/xlog_internal.h"
+#include "access/xlogrecovery.h"
+#include "access/xlogutils.h"
+#include "backup/walsummary.h"
+#include "catalog/storage_xlog.h"
+#include "common/blkreftable.h"
+#include "libpq/pqsignal.h"
+#include "miscadmin.h"
+#include "postmaster/bgwriter.h"
+#include "postmaster/interrupt.h"
+#include "postmaster/walsummarizer.h"
+#include "replication/walreceiver.h"
+#include "storage/fd.h"
+#include "storage/ipc.h"
+#include "storage/lwlock.h"
+#include "storage/latch.h"
+#include "storage/proc.h"
+#include "storage/procsignal.h"
+#include "storage/shmem.h"
+#include "storage/spin.h"
+#include "utils/guc.h"
+#include "utils/memutils.h"
+#include "utils/wait_event.h"
+
+/*
+ * Data in shared memory related to WAL summarization.
+ */
+typedef struct
+{
+ /*
+ * These fields are protected by WALSummarizerLock.
+ *
+ * Until we've discovered what summary files already exist on disk and
+ * stored that information in shared memory, initialized is false and the
+ * other fields here contain no meaningful information. After that has
+ * been done, initialized is true.
+ *
+ * summarized_tli and summarized_lsn indicate the last LSN and TLI at
+ * which the next summary file will start. Normally, these are the LSN and
+ * TLI at which the last file ended; in such case, lsn_is_exact is true.
+ * If, however, the LSN is just an approximation, then lsn_is_exact is
+ * false. This can happen if, for example, there are no existing WAL
+ * summary files at startup. In that case, we have to derive the position
+ * at which to start summarizing from the WAL files that exist on disk,
+ * and so the LSN might point to the start of the next file even though
+ * that might happen to be in the middle of a WAL record.
+ *
+ * summarizer_pgprocno is the pgprocno value for the summarizer process,
+ * if one is running, or else INVALID_PGPROCNO.
+ *
+ * pending_lsn is used by the summarizer to advertise the ending LSN of a
+ * record it has recently read. It shouldn't ever be less than
+ * summarized_lsn, but might be greater, because the summarizer buffers
+ * data for a range of LSNs in memory before writing out a new file.
+ */
+ bool initialized;
+ TimeLineID summarized_tli;
+ XLogRecPtr summarized_lsn;
+ bool lsn_is_exact;
+ int summarizer_pgprocno;
+ XLogRecPtr pending_lsn;
+
+ /*
+ * This field handles its own synchronizaton.
+ */
+ ConditionVariable summary_file_cv;
+} WalSummarizerData;
+
+/*
+ * Private data for our xlogreader's page read callback.
+ */
+typedef struct
+{
+ TimeLineID tli;
+ bool historic;
+ XLogRecPtr read_upto;
+ bool end_of_wal;
+ bool waited;
+} SummarizerReadLocalXLogPrivate;
+
+/* Pointer to shared memory state. */
+static WalSummarizerData *WalSummarizerCtl;
+
+/*
+ * When we reach end of WAL and need to read more, we sleep for a number of
+ * milliseconds that is a integer multiple of MS_PER_SLEEP_QUANTUM. This is
+ * the multiplier. It should vary between 1 and MAX_SLEEP_QUANTA, depending
+ * on system activity. See summarizer_wait_for_wal() for how we adjust this.
+ */
+static long sleep_quanta = 1;
+
+/*
+ * The sleep time will always be a multiple of 200ms and will not exceed
+ * thirty seconds (150 * 200 = 30 * 1000). Note that the timeout here needs
+ * to be substntially less than the maximum amount of time for which an
+ * incremental backup will wait for this process to catch up. Otherwise, an
+ * incremental backup might time out on an idle system just because we sleep
+ * for too long.
+ */
+#define MAX_SLEEP_QUANTA 150
+#define MS_PER_SLEEP_QUANTUM 200
+
+/*
+ * This is a count of the number of pages of WAL that we've read since the
+ * last time we waited for more WAL to appear.
+ */
+static long pages_read_since_last_sleep = 0;
+
+/*
+ * Most recent RedoRecPtr value observed by MaybeRemoveOldWalSummaries.
+ */
+static XLogRecPtr redo_pointer_at_last_summary_removal = InvalidXLogRecPtr;
+
+/*
+ * GUC parameters
+ */
+int wal_summarize_mb = 256;
+int wal_summarize_keep_time = 7 * 24 * 60;
+
+static XLogRecPtr GetLatestLSN(TimeLineID *tli);
+static void HandleWalSummarizerInterrupts(void);
+static XLogRecPtr SummarizeWAL(TimeLineID tli, bool historic,
+ XLogRecPtr start_lsn, bool exact,
+ XLogRecPtr cutoff_lsn, XLogRecPtr maximum_lsn);
+static void SummarizeSmgrRecord(XLogReaderState *xlogreader,
+ BlockRefTable *brtab);
+static void SummarizeXactRecord(XLogReaderState *xlogreader,
+ BlockRefTable *brtab);
+static bool SummarizeXlogRecord(XLogReaderState *xlogreader);
+static int summarizer_read_local_xlog_page(XLogReaderState *state,
+ XLogRecPtr targetPagePtr,
+ int reqLen,
+ XLogRecPtr targetRecPtr,
+ char *cur_page);
+static void summarizer_wait_for_wal(void);
+static void MaybeRemoveOldWalSummaries(void);
+
+/*
+ * Amount of shared memory required for this module.
+ */
+Size
+WalSummarizerShmemSize(void)
+{
+ return sizeof(WalSummarizerData);
+}
+
+/*
+ * Create or attach to shared memory segment for this module.
+ */
+void
+WalSummarizerShmemInit(void)
+{
+ bool found;
+
+ WalSummarizerCtl = (WalSummarizerData *)
+ ShmemInitStruct("Wal Summarizer Ctl", WalSummarizerShmemSize(),
+ &found);
+
+ if (!found)
+ {
+ /*
+ * First time through, so initialize.
+ *
+ * We're just filling in dummy values here -- the real initialization
+ * will happen when GetOldestUnsummarizedLSN() is called for the first
+ * time.
+ */
+ WalSummarizerCtl->initialized = false;
+ WalSummarizerCtl->summarized_tli = 0;
+ WalSummarizerCtl->summarized_lsn = InvalidXLogRecPtr;
+ WalSummarizerCtl->lsn_is_exact = false;
+ WalSummarizerCtl->summarizer_pgprocno = INVALID_PGPROCNO;
+ WalSummarizerCtl->pending_lsn = InvalidXLogRecPtr;
+ ConditionVariableInit(&WalSummarizerCtl->summary_file_cv);
+ }
+}
+
+/*
+ * Entry point for walsummarizer process.
+ */
+void
+WalSummarizerMain(void)
+{
+ sigjmp_buf local_sigjmp_buf;
+ MemoryContext context;
+
+ /*
+ * Within this function, 'current_lsn' and 'current_tli' refer to the
+ * point from which the next WAL summary file should start. 'exact' is
+ * true if 'current_lsn' is known to be the start of a WAL recod or WAL
+ * segment, and false if it might be in the middle of a record someplace.
+ *
+ * 'switch_lsn' and 'switch_tli', if set, are the LSN at which we need to
+ * switch to a new timeline and the timeline to which we need to switch.
+ * If not set, we either haven't figured out the answers yet or we're
+ * already on the latest timeline.
+ */
+ XLogRecPtr current_lsn;
+ TimeLineID current_tli;
+ bool exact;
+ XLogRecPtr switch_lsn = InvalidXLogRecPtr;
+ TimeLineID switch_tli = 0;
+
+ ereport(DEBUG1,
+ (errmsg_internal("WAL summarizer started")));
+
+ /*
+ * Properly accept or ignore signals the postmaster might send us
+ *
+ * We have no particular use for SIGINT at the moment, but seems
+ * reasonable to treat like SIGTERM.
+ */
+ pqsignal(SIGHUP, SignalHandlerForConfigReload);
+ pqsignal(SIGINT, SignalHandlerForShutdownRequest);
+ pqsignal(SIGTERM, SignalHandlerForShutdownRequest);
+ /* SIGQUIT handler was already set up by InitPostmasterChild */
+ pqsignal(SIGALRM, SIG_IGN);
+ pqsignal(SIGPIPE, SIG_IGN);
+ pqsignal(SIGUSR1, procsignal_sigusr1_handler);
+ pqsignal(SIGUSR2, SIG_IGN); /* not used */
+
+ /* Advertise ourselves. */
+ LWLockAcquire(WALSummarizerLock, LW_EXCLUSIVE);
+ WalSummarizerCtl->summarizer_pgprocno = MyProc->pgprocno;
+ LWLockRelease(WALSummarizerLock);
+
+ /* Create and switch to a memory context that we can reset on error. */
+ context = AllocSetContextCreate(TopMemoryContext,
+ "Wal Summarizer",
+ ALLOCSET_DEFAULT_SIZES);
+ MemoryContextSwitchTo(context);
+
+ /*
+ * Reset some signals that are accepted by postmaster but not here
+ */
+ pqsignal(SIGCHLD, SIG_DFL);
+
+ /*
+ * If an exception is encountered, processing resumes here.
+ */
+ if (sigsetjmp(local_sigjmp_buf, 1) != 0)
+ {
+ /* Since not using PG_TRY, must reset error stack by hand */
+ error_context_stack = NULL;
+
+ /* Prevent interrupts while cleaning up */
+ HOLD_INTERRUPTS();
+
+ /* Report the error to the server log */
+ EmitErrorReport();
+
+ /* Release resources we might have acquired. */
+ LWLockReleaseAll();
+ ConditionVariableCancelSleep();
+ pgstat_report_wait_end();
+ ReleaseAuxProcessResources(false);
+ AtEOXact_Files(false);
+ AtEOXact_HashTables(false);
+
+ /*
+ * Now return to normal top-level context and clear ErrorContext for
+ * next time.
+ */
+ MemoryContextSwitchTo(context);
+ FlushErrorState();
+
+ /* Flush any leaked data in the top-level context */
+ MemoryContextResetAndDeleteChildren(context);
+
+ /* Now we can allow interrupts again */
+ RESUME_INTERRUPTS();
+
+ /*
+ * Sleep for 10 seconds before attempting to resume operations in
+ * order to avoid excessing logging.
+ *
+ * Many of the likely error conditions are things that will repeat
+ * every time. For example, if the WAL can't be read or the summary
+ * can't be written, only administrator action will cure the problem.
+ * So a really fast retry time doesn't seem to be especially
+ * beneficial, and it will clutter the logs.
+ */
+ (void) WaitLatch(MyLatch,
+ WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
+ 10000,
+ WAIT_EVENT_WAL_SUMMARIZER_ERROR);
+ }
+
+ /* We can now handle ereport(ERROR) */
+ PG_exception_stack = &local_sigjmp_buf;
+
+ /*
+ * Unblock signals (they were blocked when the postmaster forked us)
+ */
+ sigprocmask(SIG_SETMASK, &UnBlockSig, NULL);
+
+ /*
+ * Fetch information about previous progress from shared memory.
+ *
+ * If we discover that WAL summarization is not enabled, just exit.
+ */
+ current_lsn = GetOldestUnsummarizedLSN(¤t_tli, &exact);
+ if (XLogRecPtrIsInvalid(current_lsn))
+ proc_exit(0);
+
+ /*
+ * Loop forever
+ */
+ for (;;)
+ {
+ XLogRecPtr latest_lsn;
+ TimeLineID latest_tli;
+ XLogRecPtr cutoff_lsn;
+ XLogRecPtr end_of_summary_lsn;
+
+ /* Flush any leaked data in the top-level context */
+ MemoryContextResetAndDeleteChildren(context);
+
+ /* Process any signals received recently. */
+ HandleWalSummarizerInterrupts();
+
+ /* If it's time to remove any old WAL summaries, do that now. */
+ MaybeRemoveOldWalSummaries();
+
+ /* Find the LSN and TLI up to which we can safely summarize. */
+ latest_lsn = GetLatestLSN(&latest_tli);
+
+ /*
+ * If we're summarizing a historic timeline and we haven't yet
+ * computed the point at which to switch to the next timeline, do that
+ * now.
+ *
+ * Note that if this is a standby, what was previously the current
+ * timeline could become historic at any time.
+ *
+ * We could try to make this more efficient by caching the results of
+ * readTimeLineHistory when latest_tli has not changed, but since we
+ * only have to do this once per timeline switch, we probably wouldn't
+ * save any significant amount of work in practice.
+ */
+ if (current_tli != latest_tli && XLogRecPtrIsInvalid(switch_lsn))
+ {
+ List *tles = readTimeLineHistory(latest_tli);
+
+ switch_lsn = tliSwitchPoint(current_tli, tles, &switch_tli);
+ elog(DEBUG2,
+ "switch point from TLI %u to TLI %u is at %X/%X",
+ current_tli, switch_tli, LSN_FORMAT_ARGS(switch_lsn));
+ }
+
+ /*
+ * wal_summarize_mb sets a soft limit on the amont of WAL covered by a
+ * single summary file. If we read a WAL record that ends after the
+ * cutoff LSN computed here, we'll stop the summary. In most cases, it
+ * will actually stop earlier than that, but this is here as a
+ * backstop.
+ */
+ cutoff_lsn = current_lsn + wal_summarize_mb * 1024 * 1024;
+ if (!XLogRecPtrIsInvalid(switch_lsn) && cutoff_lsn > switch_lsn)
+ cutoff_lsn = switch_lsn;
+ elog(DEBUG2,
+ "WAL summarization cutoff is TLI %d @ %X/%X, flush position is %X/%X",
+ current_tli, LSN_FORMAT_ARGS(cutoff_lsn), LSN_FORMAT_ARGS(latest_lsn));
+
+ /* Summarize WAL. */
+ end_of_summary_lsn = SummarizeWAL(current_tli,
+ current_tli != latest_tli,
+ current_lsn, exact,
+ cutoff_lsn, latest_lsn);
+ Assert(!XLogRecPtrIsInvalid(end_of_summary_lsn));
+ Assert(end_of_summary_lsn >= current_lsn);
+
+ /*
+ * Update state for next loop iteration.
+ *
+ * Next summary file should start from exactly where this one ended.
+ * Timeline remains unchanged unless a switch LSN was computed and we
+ * have reached it.
+ */
+ current_lsn = end_of_summary_lsn;
+ exact = true;
+ if (!XLogRecPtrIsInvalid(switch_lsn) && cutoff_lsn >= switch_lsn)
+ {
+ current_tli = switch_tli;
+ switch_lsn = InvalidXLogRecPtr;
+ switch_tli = 0;
+ }
+
+ /* Update state in shared memory. */
+ LWLockAcquire(WALSummarizerLock, LW_EXCLUSIVE);
+ Assert(WalSummarizerCtl->pending_lsn <= end_of_summary_lsn);
+ WalSummarizerCtl->summarized_lsn = end_of_summary_lsn;
+ WalSummarizerCtl->summarized_tli = current_tli;
+ WalSummarizerCtl->lsn_is_exact = true;
+ WalSummarizerCtl->pending_lsn = end_of_summary_lsn;
+ LWLockRelease(WALSummarizerLock);
+
+ /* Wake up anyone waiting for more summary files to be written. */
+ ConditionVariableBroadcast(&WalSummarizerCtl->summary_file_cv);
+ }
+}
+
+/*
+ * Get the oldest LSN in this server's timeline history that has not yet been
+ * summarized.
+ *
+ * If *tli != NULL, it will be set to the TLI for the LSN that is returned.
+ *
+ * If *lsn_is_exact != NULL, it will be set to true if the returned LSN is
+ * necessarily the start of a WAL record and false if it's just the beginning
+ * of a WAL segment.
+ */
+XLogRecPtr
+GetOldestUnsummarizedLSN(TimeLineID *tli, bool *lsn_is_exact)
+{
+ TimeLineID latest_tli;
+ LWLockMode mode = LW_SHARED;
+ int n;
+ List *tles;
+ XLogRecPtr unsummarized_lsn;
+ TimeLineID unsummarized_tli = 0;
+ bool should_make_exact = false;
+ List *existing_summaries;
+ ListCell *lc;
+
+ /* If not summarizing WAL, do nothing. */
+ if (wal_summarize_mb == 0)
+ return InvalidXLogRecPtr;
+
+ /*
+ * Initially, we acquire the lock in shared mode and try to fetch the
+ * required information. If the data structure hasn't been initialized, we
+ * reacquire the lock in shared mode so that we can initialize it.
+ * However, if someone else does that first before we get the lock, then
+ * we can just return the requested information after all.
+ */
+ while (true)
+ {
+ LWLockAcquire(WALSummarizerLock, mode);
+
+ if (WalSummarizerCtl->initialized)
+ {
+ unsummarized_lsn = WalSummarizerCtl->summarized_lsn;
+ if (tli != NULL)
+ *tli = WalSummarizerCtl->summarized_tli;
+ if (lsn_is_exact != NULL)
+ *lsn_is_exact = WalSummarizerCtl->lsn_is_exact;
+ LWLockRelease(WALSummarizerLock);
+ return unsummarized_lsn;
+ }
+
+ if (mode == LW_EXCLUSIVE)
+ break;
+
+ LWLockRelease(WALSummarizerLock);
+ mode = LW_EXCLUSIVE;
+ }
+
+ /*
+ * The data structure needs to be initialized, and we are the first to
+ * obtain the lock in exclusive mode, so it's our job to do that
+ * initialization.
+ *
+ * So, find the oldest timeline on which WAL still exists, and the
+ * earliest segment for which it exists.
+ */
+ (void) GetLatestLSN(&latest_tli);
+ tles = readTimeLineHistory(latest_tli);
+ for (n = list_length(tles) - 1; n >= 0; --n)
+ {
+ TimeLineHistoryEntry *tle = list_nth(tles, n);
+ XLogSegNo oldest_segno;
+
+ oldest_segno = XLogGetOldestSegno(tle->tli);
+ if (oldest_segno != 0)
+ {
+ /* Compute oldest LSN that still exists on disk. */
+ XLogSegNoOffsetToRecPtr(oldest_segno, 0, wal_segment_size,
+ unsummarized_lsn);
+
+ unsummarized_tli = tle->tli;
+ break;
+ }
+ }
+
+ /* It really should not be possible for us to find no WAL. */
+ if (unsummarized_tli == 0)
+ ereport(ERROR,
+ errcode(ERRCODE_INTERNAL_ERROR),
+ errmsg_internal("no WAL found on timeline %d", latest_tli));
+
+ /*
+ * Don't try to summarize anything older than the end LSN of the newest
+ * summary file that exists for this timeline.
+ */
+ existing_summaries =
+ GetWalSummaries(unsummarized_tli,
+ InvalidXLogRecPtr, InvalidXLogRecPtr);
+ foreach(lc, existing_summaries)
+ {
+ WalSummaryFile *ws = lfirst(lc);
+
+ if (ws->end_lsn > unsummarized_lsn)
+ {
+ unsummarized_lsn = ws->end_lsn;
+ should_make_exact = true;
+ }
+ }
+
+ /* Update shared memory with the discovered values. */
+ WalSummarizerCtl->initialized = true;
+ WalSummarizerCtl->summarized_lsn = unsummarized_lsn;
+ WalSummarizerCtl->summarized_tli = unsummarized_tli;
+ WalSummarizerCtl->lsn_is_exact = should_make_exact;
+ WalSummarizerCtl->pending_lsn = unsummarized_lsn;
+
+ /* Also return the to the caller as required. */
+ if (tli != NULL)
+ *tli = WalSummarizerCtl->summarized_tli;
+ if (lsn_is_exact != NULL)
+ *lsn_is_exact = WalSummarizerCtl->lsn_is_exact;
+ LWLockRelease(WALSummarizerLock);
+
+ return unsummarized_lsn;
+}
+
+/*
+ * Attempt to set the WAL summarizer's latch.
+ *
+ * This might not work, because there's no guarantee that the WAL summarizer
+ * process was successfully started, and it also might have started but
+ * subsequently terminated. So, under normal circumstances, this will get the
+ * latch set, but there's no guarantee.
+ */
+void
+SetWalSummarizerLatch(void)
+{
+ int pgprocno;
+
+ LWLockAcquire(WALSummarizerLock, LW_EXCLUSIVE);
+ pgprocno = WalSummarizerCtl->summarizer_pgprocno;
+ LWLockRelease(WALSummarizerLock);
+
+ if (pgprocno != INVALID_PGPROCNO)
+ SetLatch(&ProcGlobal->allProcs[pgprocno].procLatch);
+}
+
+/*
+ * Wait until WAL summarization reaches the given LSN, but not longer than
+ * the given timeout.
+ *
+ * The return value is the first still-unsummarized LSN. If it's greater than
+ * or equal to the passed LSN, then that LSN was reached. If not, we timed out.
+ */
+XLogRecPtr
+WaitForWalSummarization(XLogRecPtr lsn, long timeout)
+{
+ TimestampTz start_time = GetCurrentTimestamp();
+ TimestampTz deadline = TimestampTzPlusMilliseconds(start_time, timeout);
+ XLogRecPtr summarized_lsn;
+
+ Assert(!XLogRecPtrIsInvalid(lsn));
+ Assert(timeout > 0);
+
+ while (1)
+ {
+ TimestampTz now;
+ long remaining_timeout;
+
+ /*
+ * If the LSN summarized on disk has reached the target value, stop.
+ */
+ LWLockAcquire(WALSummarizerLock, LW_EXCLUSIVE);
+ summarized_lsn = WalSummarizerCtl->summarized_lsn;
+ LWLockRelease(WALSummarizerLock);
+ if (summarized_lsn >= lsn)
+ break;
+
+ /* Timeout reached? If yes, stop. */
+ now = GetCurrentTimestamp();
+ remaining_timeout = TimestampDifferenceMilliseconds(now, deadline);
+ if (remaining_timeout <= 0)
+ break;
+
+ /* Wait and see. */
+ ConditionVariableTimedSleep(&WalSummarizerCtl->summary_file_cv,
+ remaining_timeout,
+ WAIT_EVENT_WAL_SUMMARY_READY);
+ }
+
+ return summarized_lsn;
+}
+
+/*
+ * Get the latest LSN that is eligible to be summarized, and set *tli to the
+ * corresponding timeline.
+ */
+static XLogRecPtr
+GetLatestLSN(TimeLineID *tli)
+{
+ if (!RecoveryInProgress())
+ {
+ /* Don't summarize WAL before it's flushed. */
+ return GetFlushRecPtr(tli);
+ }
+ else
+ {
+ XLogRecPtr flush_lsn;
+ TimeLineID flush_tli;
+ XLogRecPtr replay_lsn;
+ TimeLineID replay_tli;
+
+ /*
+ * What we really want to know is how much WAL has been flushed to
+ * disk, but the only flush position available is the one provided by
+ * the walreceiver, which may not be running, because this could be
+ * crash recovery or recovery via restore_command. So use either the
+ * WAL receiver's flush position or the replay position, whichever is
+ * further ahead, on the theory that if the WAL has been replayed then
+ * it must also have been flushed to disk.
+ */
+ flush_lsn = GetWalRcvFlushRecPtr(NULL, &flush_tli);
+ replay_lsn = GetXLogReplayRecPtr(&replay_tli);
+ if (flush_lsn > replay_lsn)
+ {
+ *tli = flush_tli;
+ return flush_lsn;
+ }
+ else
+ {
+ *tli = replay_tli;
+ return replay_lsn;
+ }
+ }
+}
+
+/*
+ * Interrupt handler for main loop of WAL summarizer process.
+ */
+static void
+HandleWalSummarizerInterrupts(void)
+{
+ if (ProcSignalBarrierPending)
+ ProcessProcSignalBarrier();
+
+ if (ConfigReloadPending)
+ {
+ ConfigReloadPending = false;
+ ProcessConfigFile(PGC_SIGHUP);
+ }
+
+ if (ShutdownRequestPending || wal_summarize_mb == 0)
+ {
+ ereport(DEBUG1,
+ errmsg_internal("WAL summarizer shutting down"));
+ proc_exit(0);
+ }
+
+ /* Perform logging of memory contexts of this process */
+ if (LogMemoryContextPending)
+ ProcessLogMemoryContextInterrupt();
+}
+
+/*
+ * Summarize a range of WAL records on a single timeline.
+ *
+ * 'tli' is the timeline to be summarized. 'historic' should be false if the
+ * timeline in question is the latest one and true otherwise.
+ *
+ * 'start_lsn' is the point at which we should start summarizing. If this
+ * value comes from the end LSN of the previous record as returned by the
+ * xlograder machinery, 'exact' should be true; otherwise, 'exact' should
+ * be false, and this function will search forward for the start of a valid
+ * WAL record.
+ *
+ * 'cutoff_lsn' is the point at which we should stop summarizing. The first
+ * record that ends at or after cutoff_lsn will be the last one included
+ * in the summary.
+ *
+ * 'maximum_lsn' identifies the point beyond which we can't count on being
+ * able to read any more WAL. It should be the switch point when reading a
+ * historic timeline, or the most-recently-measured end of WAL when reading
+ * the current timeline.
+ *
+ * The return value is the LSN at which the WAL summary actually ends. Most
+ * often, a summary file ends because we notice that a checkpoint has
+ * occurred and reach the redo pointer of that checkpoint, but sometimes
+ * we stop for other reasons, such as a timeline switch, or reading a record
+ * that ends after the cutoff_lsn.
+ */
+static XLogRecPtr
+SummarizeWAL(TimeLineID tli, bool historic,
+ XLogRecPtr start_lsn, bool exact,
+ XLogRecPtr cutoff_lsn, XLogRecPtr maximum_lsn)
+{
+ SummarizerReadLocalXLogPrivate *private_data;
+ XLogReaderState *xlogreader;
+ XLogRecPtr summary_start_lsn;
+ XLogRecPtr summary_end_lsn = cutoff_lsn;
+ char temp_path[MAXPGPATH];
+ char final_path[MAXPGPATH];
+ WalSummaryIO io;
+ BlockRefTable *brtab = CreateEmptyBlockRefTable();
+
+ /* Initialize private data for xlogreader. */
+ private_data = (SummarizerReadLocalXLogPrivate *)
+ palloc0(sizeof(SummarizerReadLocalXLogPrivate));
+ private_data->tli = tli;
+ private_data->historic = historic;
+ private_data->read_upto = maximum_lsn;
+
+ /* Create xlogreader. */
+ xlogreader = XLogReaderAllocate(wal_segment_size, NULL,
+ XL_ROUTINE(.page_read = &summarizer_read_local_xlog_page,
+ .segment_open = &wal_segment_open,
+ .segment_close = &wal_segment_close),
+ private_data);
+ if (xlogreader == NULL)
+ ereport(ERROR,
+ (errcode(ERRCODE_OUT_OF_MEMORY),
+ errmsg("out of memory"),
+ errdetail("Failed while allocating a WAL reading processor.")));
+
+ /*
+ * When exact = false, we're starting from an arbitrary point in the WAL
+ * and must search forward for the start of the next record.
+ *
+ * When exact = true, start_lsn should be either the LSN where a record
+ * begins, or the LSN of a page where the page header is immediately
+ * followed by the start of a new record. XLogBeginRead should tolerate
+ * either case.
+ *
+ * We need to allow for both cases because the behavior of xlogreader
+ * varies. When a record spans two or more xlog pages, the ending LSN
+ * reported by xlogreader will be the starting LSN of the following
+ * record, but when an xlog page boundary falls between two records, the
+ * end LSN for the first will be reported as the first byte of the
+ * following page. We can't know until we read that page how large the
+ * header will be, but we'll have to skip over it to find the next record.
+ */
+ if (exact)
+ {
+ /*
+ * Even if start_lsn is the beginning of a page rather than the
+ * beginning of the first record on that page, we should still use it
+ * as the start LSN for the summary file. That's because we detect
+ * missing summary files by looking for cases where the end LSN of one
+ * file is less than the start LSN of the next file. When only a page
+ * header is skipped, nothing has been missed.
+ */
+ XLogBeginRead(xlogreader, start_lsn);
+ summary_start_lsn = start_lsn;
+ }
+ else
+ {
+ summary_start_lsn = XLogFindNextRecord(xlogreader, start_lsn);
+ if (XLogRecPtrIsInvalid(summary_start_lsn))
+ {
+ /*
+ * If we hit end-of-WAL while trying to find the next valid
+ * record, we must be on a historic timeline that has no valid
+ * records that begin after start_lsn and before end of WAL.
+ */
+ if (private_data->end_of_wal)
+ {
+ ereport(DEBUG1,
+ errmsg_internal("could not read WAL from timeline %d at %X/%X: end of WAL at %X/%X",
+ tli,
+ LSN_FORMAT_ARGS(start_lsn),
+ LSN_FORMAT_ARGS(private_data->read_upto)));
+
+ /*
+ * The timeline ends at or after start_lsn, without containing
+ * any records. Thus, we must make sure the main loop does not
+ * iterate. If start_lsn is the end of the timeline, then we
+ * won't actually emit an empty summary file, but otherwise,
+ * we must, to capture the fact that the LSN range in question
+ * contains no interesting WAL records.
+ */
+ summary_start_lsn = start_lsn;
+ summary_end_lsn = private_data->read_upto;
+ cutoff_lsn = xlogreader->EndRecPtr;
+ }
+ else
+ ereport(ERROR,
+ (errmsg("could not find a valid record after %X/%X",
+ LSN_FORMAT_ARGS(start_lsn))));
+ }
+
+ /* We shouldn't go backward. */
+ Assert(summary_start_lsn >= start_lsn);
+ }
+
+ /*
+ * Main loop: read xlog records one by one.
+ */
+ while (xlogreader->EndRecPtr < cutoff_lsn)
+ {
+ int block_id;
+ char *errormsg;
+ XLogRecord *record;
+ bool stop_requested = false;
+
+ /* We shouldn't go backward. */
+ Assert(summary_start_lsn <= xlogreader->EndRecPtr);
+
+ /*
+ * This flag tracks whether the read of a particular record had to
+ * wait for more WAL to arrive, so reset it before reading the next
+ * record.
+ */
+ private_data->waited = false;
+
+ /* Now read the next record. */
+ record = XLogReadRecord(xlogreader, &errormsg);
+ if (record == NULL)
+ {
+ SummarizerReadLocalXLogPrivate *private_data;
+
+ private_data = (SummarizerReadLocalXLogPrivate *)
+ xlogreader->private_data;
+ if (private_data->end_of_wal)
+ {
+ /*
+ * This timeline must be historic and must end before we were
+ * able to read a complete record.
+ */
+ ereport(DEBUG1,
+ errmsg_internal("could not read WAL from timeline %d at %X/%X: end of WAL at %X/%X",
+ tli,
+ LSN_FORMAT_ARGS(xlogreader->EndRecPtr),
+ LSN_FORMAT_ARGS(private_data->read_upto)));
+ /* Summary ends at end of WAL. */
+ summary_end_lsn = private_data->read_upto;
+ break;
+ }
+ if (errormsg)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not read WAL at %X/%X: %s",
+ LSN_FORMAT_ARGS(xlogreader->EndRecPtr), errormsg)));
+ else
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not read WAL at %X/%X",
+ LSN_FORMAT_ARGS(xlogreader->EndRecPtr))));
+ }
+
+ /* We shouldn't go backward. */
+ Assert(summary_start_lsn <= xlogreader->EndRecPtr);
+
+ if (xlogreader->ReadRecPtr >= cutoff_lsn)
+ {
+ /*
+ * Woops! We've read a record that *starts* after the cutoff LSN,
+ * contrary to our goal of reading only until we hit the first
+ * record that ends at or after the cutoff LSN. Pretend we didn't
+ * read it after all by bailing out of this loop right here,
+ * before we do anything with this record.
+ *
+ * This can happen because the last record before the cutoff LSN
+ * might be continued across multiple pages, and then we might
+ * come to a page with XLP_FIRST_IS_OVERWRITE_CONTRECORD set. In
+ * that case, the record that was continued across multiple pages
+ * is incomplete and will be disregarded, and the read will
+ * restart from the beginning of the page that is flagged
+ * XLP_FIRST_IS_OVERWRITE_CONTRECORD.
+ *
+ * If this case occurs, we can fairly say that the current summary
+ * file ends at the cutoff LSN exactly. The first record on the
+ * page marked XLP_FIRST_IS_OVERWRITE_CONTRECORD will be
+ * discovered when generating the next summary file.
+ */
+ summary_end_lsn = cutoff_lsn;
+ break;
+ }
+
+ /* Special handling for particular types of WAL records. */
+ switch (XLogRecGetRmid(xlogreader))
+ {
+ case RM_SMGR_ID:
+ SummarizeSmgrRecord(xlogreader, brtab);
+ break;
+ case RM_XACT_ID:
+ SummarizeXactRecord(xlogreader, brtab);
+ break;
+ case RM_XLOG_ID:
+ stop_requested = SummarizeXlogRecord(xlogreader);
+ break;
+ default:
+ break;
+ }
+
+ /*
+ * If we've been told that it's time to end this WAL summary file,
+ * do so. As an exception, if there's nothing included in this WAL
+ * summary file yet, then stoppng doesn't make any sense, and we
+ * should wait until the next stop point instead.
+ */
+ if (stop_requested && xlogreader->ReadRecPtr > summary_start_lsn)
+ {
+ summary_end_lsn = xlogreader->ReadRecPtr;
+ break;
+ }
+
+ /* Feed block references from xlog record to block reference table. */
+ for (block_id = 0; block_id <= XLogRecMaxBlockId(xlogreader);
+ block_id++)
+ {
+ RelFileLocator rlocator;
+ ForkNumber forknum;
+ BlockNumber blocknum;
+
+ if (!XLogRecGetBlockTagExtended(xlogreader, block_id, &rlocator,
+ &forknum, &blocknum, NULL))
+ continue;
+
+ BlockRefTableMarkBlockModified(brtab, &rlocator, forknum,
+ blocknum);
+ }
+
+ /* Update our notion of where this summary file ends. */
+ summary_end_lsn = xlogreader->EndRecPtr;
+
+ /* Also update shared memory. */
+ LWLockAcquire(WALSummarizerLock, LW_EXCLUSIVE);
+ Assert(summary_end_lsn >= WalSummarizerCtl->pending_lsn);
+ Assert(summary_end_lsn >= WalSummarizerCtl->summarized_lsn);
+ WalSummarizerCtl->pending_lsn = summary_end_lsn;
+ LWLockRelease(WALSummarizerLock);
+ }
+
+ /* Destroy xlogreader. */
+ pfree(xlogreader->private_data);
+ XLogReaderFree(xlogreader);
+
+ /*
+ * If a timeline switch occurs, we may fail to make any progress at all
+ * before exiting the loop above. If that happens, we don't write a WAL
+ * summary file at all.
+ */
+ if (summary_end_lsn > summary_start_lsn)
+ {
+ /* Generate temporary and final path name. */
+ snprintf(temp_path, MAXPGPATH,
+ XLOGDIR "/summaries/temp.summary");
+ snprintf(final_path, MAXPGPATH,
+ XLOGDIR "/summaries/%08X%08X%08X%08X%08X.summary",
+ tli,
+ LSN_FORMAT_ARGS(summary_start_lsn),
+ LSN_FORMAT_ARGS(summary_end_lsn));
+
+ /* Open the temporary file for writing. */
+ io.filepos = 0;
+ io.file = PathNameOpenFile(temp_path, O_WRONLY | O_CREAT | O_TRUNC);
+ if (io.file < 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not create file \"%s\": %m", temp_path)));
+
+ /* Write the data. */
+ WriteBlockRefTable(brtab, WriteWalSummary, &io);
+
+ /* Close temporary file and shut down xlogreader. */
+ FileClose(io.file);
+
+ /* Tell the user what we did. */
+ ereport(LOG,
+ errmsg("summarized WAL on TLI %d from %X/%X to %X/%X",
+ tli,
+ LSN_FORMAT_ARGS(summary_start_lsn),
+ LSN_FORMAT_ARGS(summary_end_lsn)));
+
+ /* Durably rename the new summary into place. */
+ durable_rename(temp_path, final_path, ERROR);
+ }
+
+ return summary_end_lsn;
+}
+
+/*
+ * Special handling for WAL records with RM_SMGR_ID.
+ */
+static void
+SummarizeSmgrRecord(XLogReaderState *xlogreader, BlockRefTable *brtab)
+{
+ uint8 info = XLogRecGetInfo(xlogreader) & ~XLR_INFO_MASK;
+
+ if (info == XLOG_SMGR_CREATE)
+ {
+ xl_smgr_create *xlrec;
+
+ /*
+ * If a new relation fork is created on disk, there is no point
+ * tracking anything about which blocks have been modified, because
+ * the whole thing will be new. Hence, set the limit block for this
+ * fork to 0.
+ *
+ * Ignore the FSM fork, which is not fully WAL-logged.
+ */
+ xlrec = (xl_smgr_create *) XLogRecGetData(xlogreader);
+
+ if (xlrec->forkNum != FSM_FORKNUM)
+ BlockRefTableSetLimitBlock(brtab, &xlrec->rlocator,
+ xlrec->forkNum, 0);
+ }
+ else if (info == XLOG_SMGR_TRUNCATE)
+ {
+ xl_smgr_truncate *xlrec;
+
+ xlrec = (xl_smgr_truncate *) XLogRecGetData(xlogreader);
+
+ /*
+ * If a relation fork is truncated on disk, there is in point in
+ * tracking anything about block modifications beyond the truncation
+ * point.
+ *
+ * We ignore SMGR_TRUNCATE_FSM here because the FSM isn't fully
+ * WAL-logged and thus we can't track modified blocks for it anyway.
+ */
+ if ((xlrec->flags & SMGR_TRUNCATE_HEAP) != 0)
+ BlockRefTableSetLimitBlock(brtab, &xlrec->rlocator,
+ MAIN_FORKNUM, xlrec->blkno);
+ if ((xlrec->flags & SMGR_TRUNCATE_VM) != 0)
+ BlockRefTableSetLimitBlock(brtab, &xlrec->rlocator,
+ VISIBILITYMAP_FORKNUM, xlrec->blkno);
+ }
+}
+
+/*
+ * Special handling for WAL recods with RM_XACT_ID.
+ */
+static void
+SummarizeXactRecord(XLogReaderState *xlogreader, BlockRefTable *brtab)
+{
+ uint8 info = XLogRecGetInfo(xlogreader) & ~XLR_INFO_MASK;
+ uint8 xact_info = info & XLOG_XACT_OPMASK;
+
+ if (xact_info == XLOG_XACT_COMMIT ||
+ xact_info == XLOG_XACT_COMMIT_PREPARED)
+ {
+ xl_xact_commit *xlrec = (xl_xact_commit *) XLogRecGetData(xlogreader);
+ xl_xact_parsed_commit parsed;
+ int i;
+
+ ParseCommitRecord(XLogRecGetInfo(xlogreader), xlrec, &parsed);
+ for (i = 0; i < parsed.nrels; ++i)
+ {
+ ForkNumber forknum;
+
+ for (forknum = 0; forknum <= MAX_FORKNUM; ++forknum)
+ if (forknum != FSM_FORKNUM)
+ BlockRefTableSetLimitBlock(brtab, &parsed.xlocators[i],
+ forknum, 0);
+ }
+ }
+ else if (xact_info == XLOG_XACT_ABORT ||
+ xact_info == XLOG_XACT_ABORT_PREPARED)
+ {
+ xl_xact_abort *xlrec = (xl_xact_abort *) XLogRecGetData(xlogreader);
+ xl_xact_parsed_abort parsed;
+ int i;
+
+ ParseAbortRecord(XLogRecGetInfo(xlogreader), xlrec, &parsed);
+ for (i = 0; i < parsed.nrels; ++i)
+ {
+ ForkNumber forknum;
+
+ for (forknum = 0; forknum <= MAX_FORKNUM; ++forknum)
+ if (forknum != FSM_FORKNUM)
+ BlockRefTableSetLimitBlock(brtab, &parsed.xlocators[i],
+ forknum, 0);
+ }
+ }
+}
+
+/*
+ * Special handling for WAL recods with RM_XLOG_ID.
+ */
+static bool
+SummarizeXlogRecord(XLogReaderState *xlogreader)
+{
+ uint8 info = XLogRecGetInfo(xlogreader) & ~XLR_INFO_MASK;
+
+ if (info == XLOG_CHECKPOINT_REDO || info == XLOG_CHECKPOINT_SHUTDOWN)
+ {
+ /*
+ * This is an LSN at which redo might begin, so we'd like summarization
+ * to stop just before this WAL record.
+ */
+ return true;
+ }
+
+ return false;
+}
+
+/*
+ * Similar to read_local_xlog_page, but limited to read from one particular
+ * timeline. If the end of WAL is reached, it will wait for more if reading
+ * from the current timeline, or give up if reading from a historic timeline.
+ * In the latter case, it will also set private_data->end_of_wal = true.
+ *
+ * Caller must set private_data->tli to the TLI of interest,
+ * private_data->read_upto to the lowest LSN that is not known to be safe
+ * to read on that timeline, and private_data->historic to true if and only
+ * if the timeline is not the current timeline. This function will update
+ * private_data->read_upto and private_data->historic if more WAL appears
+ * on the current timeline or if the current timeline becomes historic.
+ */
+static int
+summarizer_read_local_xlog_page(XLogReaderState *state,
+ XLogRecPtr targetPagePtr, int reqLen,
+ XLogRecPtr targetRecPtr, char *cur_page)
+{
+ int count;
+ WALReadError errinfo;
+ SummarizerReadLocalXLogPrivate *private_data;
+
+ private_data = (SummarizerReadLocalXLogPrivate *)
+ state->private_data;
+
+ while (true)
+ {
+ if (targetPagePtr + XLOG_BLCKSZ <= private_data->read_upto)
+ {
+ /*
+ * more than one block available; read only that block, have
+ * caller come back if they need more.
+ */
+ count = XLOG_BLCKSZ;
+ break;
+ }
+ else if (targetPagePtr + reqLen > private_data->read_upto)
+ {
+ /* We don't seem to have enough data. */
+ if (private_data->historic)
+ {
+ /*
+ * This is a historic timeline, so there will never be any
+ * more data than we have currently.
+ */
+ private_data->end_of_wal = true;
+ return -1;
+ }
+ else
+ {
+ XLogRecPtr latest_lsn;
+ TimeLineID latest_tli;
+
+ /*
+ * This is - or at least was up until very recently - the
+ * current timeline, so more data might show up. Delay here
+ * so we don't tight-loop.
+ */
+ HandleWalSummarizerInterrupts();
+ summarizer_wait_for_wal();
+ private_data->waited = true;
+
+ /* Recheck end-of-WAL. */
+ latest_lsn = GetLatestLSN(&latest_tli);
+ if (private_data->tli == latest_tli)
+ {
+ /* Still the current timeline, update max LSN. */
+ Assert(latest_lsn >= private_data->read_upto);
+ private_data->read_upto = latest_lsn;
+ }
+ else
+ {
+ List *tles = readTimeLineHistory(latest_tli);
+ XLogRecPtr switchpoint;
+
+ /*
+ * The timeline we're scanning is no longer the latest
+ * one. Figure out when it ended and allow reads up to
+ * exactly that point.
+ */
+ private_data->historic = true;
+ switchpoint = tliSwitchPoint(private_data->tli, tles,
+ NULL);
+ Assert(switchpoint >= private_data->read_upto);
+ private_data->read_upto = switchpoint;
+ }
+
+ /* Go around and try again. */
+ }
+ }
+ else
+ {
+ /* enough bytes available to satisfy the request */
+ count = private_data->read_upto - targetPagePtr;
+ break;
+ }
+ }
+
+ /*
+ * Even though we just determined how much of the page can be validly read
+ * as 'count', read the whole page anyway. It's guaranteed to be
+ * zero-padded up to the page boundary if it's incomplete.
+ */
+ if (!WALRead(state, cur_page, targetPagePtr, XLOG_BLCKSZ,
+ private_data->tli, &errinfo))
+ WALReadRaiseError(&errinfo);
+
+ /* Track that we read a page, for sleep time calculation. */
+ ++pages_read_since_last_sleep;
+
+ /* number of valid bytes in the buffer */
+ return count;
+}
+
+/*
+ * Sleep for long enough that we believe it's likely that more WAL will
+ * be available afterwards.
+ */
+static void
+summarizer_wait_for_wal(void)
+{
+ if (pages_read_since_last_sleep == 0)
+ {
+ /*
+ * No pages were read since the last sleep, so double the sleep time,
+ * but not beyond the maximum allowable value.
+ */
+ sleep_quanta = Min(sleep_quanta * 2, MAX_SLEEP_QUANTA);
+ }
+ else if (pages_read_since_last_sleep > 1)
+ {
+ /*
+ * Multiple pages were read since the last sleep, so reduce the sleep
+ * time.
+ *
+ * A large burst of activity should be able to quickly reduce the
+ * sleep time to the minimum, but we don't want a handful of extra WAL
+ * records to provoke a strong reaction. We choose to reduce the sleep
+ * time by 1 quantum for each page read beyond the first, which is a
+ * fairly arbitrary way of trying to be reactive without
+ * overrreacting.
+ */
+ if (pages_read_since_last_sleep > sleep_quanta - 1)
+ sleep_quanta = 1;
+ else
+ sleep_quanta -= pages_read_since_last_sleep;
+ }
+
+ /* OK, now sleep. */
+ (void) WaitLatch(MyLatch,
+ WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
+ sleep_quanta * MS_PER_SLEEP_QUANTUM,
+ WAIT_EVENT_WAL_SUMMARIZER_WAL);
+ ResetLatch(MyLatch);
+
+ /* Reset count of pages read. */
+ pages_read_since_last_sleep = 0;
+}
+
+/*
+ * Most recent RedoRecPtr value observed by RemoveOldWalSummaries.
+ */
+static void
+MaybeRemoveOldWalSummaries(void)
+{
+ XLogRecPtr redo_pointer = GetRedoRecPtr();
+ List *wslist;
+ time_t cutoff_time;
+
+ /* If WAL summary removal is disabled, don't do anything. */
+ if (wal_summarize_keep_time == 0)
+ return;
+
+ /*
+ * If the redo pointer has not advanced, don't do anything.
+ *
+ * This has the effect that we only try to remove old WAL summary files
+ * once per checkpoint cycle.
+ */
+ if (redo_pointer == redo_pointer_at_last_summary_removal)
+ return;
+ redo_pointer_at_last_summary_removal = redo_pointer;
+
+ /*
+ * Files should only be removed if the last modification time precedes the
+ * cutoff time we compute here.
+ */
+ cutoff_time = time(NULL) - 60 * wal_summarize_keep_time;
+
+ /* Get all the summaries that currently exist. */
+ wslist = GetWalSummaries(0, InvalidXLogRecPtr, InvalidXLogRecPtr);
+
+ /* Loop until all summaries have been considered for removal. */
+ while (wslist != NIL)
+ {
+ ListCell *lc;
+ XLogSegNo oldest_segno;
+ XLogRecPtr oldest_lsn = InvalidXLogRecPtr;
+ TimeLineID selected_tli;
+
+ CHECK_FOR_INTERRUPTS();
+
+ /*
+ * Pick a timeline for which some summary files still exist on disk,
+ * and find the oldest LSN that still exists on disk for that
+ * timeline.
+ */
+ selected_tli = ((WalSummaryFile *) linitial(wslist))->tli;
+ oldest_segno = XLogGetOldestSegno(selected_tli);
+ if (oldest_segno != 0)
+ XLogSegNoOffsetToRecPtr(oldest_segno, 0, wal_segment_size,
+ oldest_lsn);
+
+
+ /* Consider each WAL file on the selected timeline in turn. */
+ foreach(lc, wslist)
+ {
+ WalSummaryFile *ws = lfirst(lc);
+
+ CHECK_FOR_INTERRUPTS();
+
+ /* If it's not on this timeline, it's not time to consider it. */
+ if (selected_tli != ws->tli)
+ continue;
+
+ /*
+ * If the WAL doesn't exist any more, we can remove it if the file
+ * modification time is old enough.
+ */
+ if (XLogRecPtrIsInvalid(oldest_lsn) || ws->end_lsn <= oldest_lsn)
+ RemoveWalSummaryIfOlderThan(ws, cutoff_time);
+
+ /*
+ * Whether we we removed the file or not, we need not consider it
+ * again.
+ */
+ wslist = foreach_delete_current(wslist, lc);
+ pfree(ws);
+ }
+ }
+}
diff --git a/src/backend/storage/lmgr/lwlocknames.txt b/src/backend/storage/lmgr/lwlocknames.txt
index f72f2906ce..d621f5507f 100644
--- a/src/backend/storage/lmgr/lwlocknames.txt
+++ b/src/backend/storage/lmgr/lwlocknames.txt
@@ -54,3 +54,4 @@ XactTruncationLock 44
WrapLimitsVacuumLock 46
NotifyQueueTailLock 47
WaitEventExtensionLock 48
+WALSummarizerLock 49
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index d7995931bd..7e79163466 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -56,6 +56,7 @@ RECOVERY_WAL_STREAM "Waiting in main loop of startup process for WAL to arrive,
SYSLOGGER_MAIN "Waiting in main loop of syslogger process."
WAL_RECEIVER_MAIN "Waiting in main loop of WAL receiver process."
WAL_SENDER_MAIN "Waiting in main loop of WAL sender process."
+WAL_SUMMARIZER_WAL "Waiting in WAL summarizer for more WAL to be generated."
WAL_WRITER_MAIN "Waiting in main loop of WAL writer process."
@@ -142,6 +143,7 @@ SAFE_SNAPSHOT "Waiting to obtain a valid snapshot for a <literal>READ ONLY DEFER
SYNC_REP "Waiting for confirmation from a remote server during synchronous replication."
WAL_RECEIVER_EXIT "Waiting for the WAL receiver to exit."
WAL_RECEIVER_WAIT_START "Waiting for startup process to send initial data for streaming replication."
+WAL_SUMMARY_READY "Waiting for a new WAL summary to be generated."
XACT_GROUP_UPDATE "Waiting for the group leader to update transaction status at end of a parallel operation."
@@ -162,6 +164,7 @@ REGISTER_SYNC_REQUEST "Waiting while sending synchronization requests to the che
SPIN_DELAY "Waiting while acquiring a contended spinlock."
VACUUM_DELAY "Waiting in a cost-based vacuum delay point."
VACUUM_TRUNCATE "Waiting to acquire an exclusive lock to truncate off any empty pages at the end of a table vacuumed."
+WAL_SUMMARIZER_ERROR "Waiting after a WAL summarizer error."
#
@@ -243,6 +246,8 @@ WAL_COPY_WRITE "Waiting for a write when creating a new WAL segment by copying a
WAL_INIT_SYNC "Waiting for a newly initialized WAL file to reach durable storage."
WAL_INIT_WRITE "Waiting for a write while initializing a new WAL file."
WAL_READ "Waiting for a read from a WAL file."
+WAL_SUMMARY_READ "Waiting for a read from a WAL summary file."
+WAL_SUMMARY_WRITE "Waiting for a write to a WAL summary file."
WAL_SYNC "Waiting for a WAL file to reach durable storage."
WAL_SYNC_METHOD_ASSIGN "Waiting for data to reach durable storage while assigning a new WAL sync method."
WAL_WRITE "Waiting for a write to a WAL file."
diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index 182d666852..94e7944748 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -306,6 +306,9 @@ GetBackendTypeDesc(BackendType backendType)
case B_WAL_SENDER:
backendDesc = "walsender";
break;
+ case B_WAL_SUMMARIZER:
+ backendDesc = "walsummarizer";
+ break;
case B_WAL_WRITER:
backendDesc = "walwriter";
break;
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 4c58574166..faf42bdbfb 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -63,6 +63,7 @@
#include "postmaster/postmaster.h"
#include "postmaster/startup.h"
#include "postmaster/syslogger.h"
+#include "postmaster/walsummarizer.h"
#include "postmaster/walwriter.h"
#include "replication/logicallauncher.h"
#include "replication/slot.h"
@@ -704,6 +705,8 @@ const char *const config_group_names[] =
gettext_noop("Write-Ahead Log / Archive Recovery"),
/* WAL_RECOVERY_TARGET */
gettext_noop("Write-Ahead Log / Recovery Target"),
+ /* WAL_SUMMARIZATION */
+ gettext_noop("Write-Ahead Log / Summarization"),
/* REPLICATION_SENDING */
gettext_noop("Replication / Sending Servers"),
/* REPLICATION_PRIMARY */
@@ -3181,6 +3184,32 @@ struct config_int ConfigureNamesInt[] =
check_wal_segment_size, NULL, NULL
},
+ {
+ {"wal_summarize_mb", PGC_SIGHUP, WAL_SUMMARIZATION,
+ gettext_noop("Number of bytes of WAL per summary file."),
+ gettext_noop("Smaller values minimize extra work performed by incremental backup, but increase the number of files on disk."),
+ GUC_UNIT_MB,
+ },
+ &wal_summarize_mb,
+ 256,
+ 0,
+ INT_MAX,
+ NULL, NULL, NULL
+ },
+
+ {
+ {"wal_summarize_keep_time", PGC_SIGHUP, WAL_SUMMARIZATION,
+ gettext_noop("Time for which WAL summary files should be kept."),
+ NULL,
+ GUC_UNIT_MIN,
+ },
+ &wal_summarize_keep_time,
+ 7 * 24 * 60, /* 1 week */
+ 0,
+ INT_MAX,
+ NULL, NULL, NULL
+ },
+
{
{"autovacuum_naptime", PGC_SIGHUP, AUTOVACUUM,
gettext_noop("Time to sleep between autovacuum runs."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index d08d55c3fe..4736606ac1 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -299,6 +299,11 @@
#recovery_target_action = 'pause' # 'pause', 'promote', 'shutdown'
# (change requires restart)
+# - WAL Summarization -
+
+#wal_summarize_mb = 256 # MB of WAL per summary file, 0 disables
+#wal_summarize_keep_time = '7d' # when to remove old summary files, 0 = never
+
#------------------------------------------------------------------------------
# REPLICATION
diff --git a/src/common/Makefile b/src/common/Makefile
index 3c8effc533..2b41dd1839 100644
--- a/src/common/Makefile
+++ b/src/common/Makefile
@@ -49,6 +49,7 @@ OBJS_COMMON = \
archive.o \
base64.o \
binaryheap.o \
+ blkreftable.o \
checksum_helper.o \
compression.o \
config_info.o \
diff --git a/src/common/blkreftable.c b/src/common/blkreftable.c
new file mode 100644
index 0000000000..012a443584
--- /dev/null
+++ b/src/common/blkreftable.c
@@ -0,0 +1,1309 @@
+/*-------------------------------------------------------------------------
+ *
+ * blkreftable.c
+ * Block reference tables.
+ *
+ * A block reference table is used to keep track of which blocks have
+ * been modified by WAL records within a certain LSN range.
+ *
+ * For each relation fork, we keep track of all blocks that have appeared
+ * in block reference in the WAL. We also keep track of the "limit block",
+ * which is the smallest relation length in blocks known to have occurred
+ * during that range of WAL records. This should be set to 0 if the relation
+ * fork is created or destroyed, and to the post-truncation length if
+ * truncated.
+ *
+ * Whenever we set the limit block, we also forget about any modified blocks
+ * beyond that point. Those blocks don't exist any more. Such blocks can
+ * later be marked as modified again; if that happens, it means the relation
+ * was re-extended.
+ *
+ * Portions Copyright (c) 2010-2022, PostgreSQL Global Development Group
+ *
+ * src/common/blkreftable.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+
+#ifndef FRONTEND
+#include "postgres.h"
+#else
+#include "postgres_fe.h"
+#endif
+
+#ifdef FRONTEND
+#include "common/logging.h"
+#endif
+
+#include "common/blkreftable.h"
+#include "common/hashfn.h"
+#include "port/pg_crc32c.h"
+
+/*
+ * A block reference table keeps track of the status of each relation
+ * fork individually.
+ */
+typedef struct BlockRefTableKey
+{
+ RelFileLocator rlocator;
+ ForkNumber forknum;
+} BlockRefTableKey;
+
+/*
+ * We could need to store data either for a relation in which only a
+ * tiny fraction of the blocks have been modified or for a relation in
+ * which nearly every block has been modified, and we want a
+ * space-efficient representation in both cases. To accomplish this,
+ * we divide the relation into chunks of 2^16 blocks and choose between
+ * an array representation and a bitmap representation for each chunk.
+ *
+ * When the number of modified blocks in a given chunk is small, we
+ * essentially store an array of block numbers, but we need not store the
+ * entire block number: instead, we store each block number as a 2-byte
+ * offset from the start of the chunk.
+ *
+ * When the number of modified blocks in a given chunk is large, we switch
+ * to a bitmap representation.
+ *
+ * These same basic representational choices are used both when a block
+ * reference table is stored in memory and when it is serialized to disk.
+ *
+ * In the in-memory representation, we initially allocate each chunk with
+ * space for a number of entries given by INITIAL_ENTRIES_PER_CHUNK and
+ * increase that as necessary until we reach MAX_ENTRIES_PER_CHUNK.
+ * Any chunk whose allocated size reaches MAX_ENTRIES_PER_CHUNK is converted
+ * to a bitmap, and thus never needs to grow further.
+ */
+#define BLOCKS_PER_CHUNK (1 << 16)
+#define BLOCKS_PER_ENTRY (BITS_PER_BYTE * sizeof(uint16))
+#define MAX_ENTRIES_PER_CHUNK (BLOCKS_PER_CHUNK / BLOCKS_PER_ENTRY)
+#define INITIAL_ENTRIES_PER_CHUNK 16
+typedef uint16 *BlockRefTableChunk;
+
+/*
+ * State for one relation fork.
+ *
+ * 'rlocator' and 'forknum' identify the relation fork to which this entry
+ * pertains.
+ *
+ * 'limit_block' is the shortest known length of the relation in blocks
+ * within the LSN range covered by a particular block reference table.
+ * It should be set to 0 if the relation fork is created or dropped. If the
+ * relation fork is truncated, it should be set to the number of blocks that
+ * remain after truncation.
+ *
+ * 'nchunks' is the allocated length of each of the three arrays that follow.
+ * We can only represent the status of block numbers less than nchunks *
+ * BLOCKS_PER_CHUNK.
+ *
+ * 'chunk_size' is an array storing the allocated size of each chunk.
+ *
+ * 'chunk_usage' is an array storing the number of elements used in each
+ * chunk. If that value is less than MAX_ENTRIES_PER_CHUNK, the corresonding
+ * chunk is used as an array; else the corresponding chunk is used as a bitmap.
+ * When used as a bitmap, the least significant bit of the first array element
+ * is the status of the lowest-numbered block covered by this chunk.
+ *
+ * 'chunk_data' is the array of chunks.
+ */
+struct BlockRefTableEntry
+{
+ BlockRefTableKey key;
+ BlockNumber limit_block;
+ char status;
+ uint32 nchunks;
+ uint16 *chunk_size;
+ uint16 *chunk_usage;
+ BlockRefTableChunk *chunk_data;
+};
+
+/* Declare and define a hash table over type BlockRefTableEntry. */
+#define SH_PREFIX blockreftable
+#define SH_ELEMENT_TYPE BlockRefTableEntry
+#define SH_KEY_TYPE BlockRefTableKey
+#define SH_KEY key
+#define SH_HASH_KEY(tb, key) \
+ hash_bytes((const unsigned char *) &key, sizeof(BlockRefTableKey))
+#define SH_EQUAL(tb, a, b) memcmp(&a, &b, sizeof(BlockRefTableKey)) == 0
+#define SH_SCOPE static inline
+#ifdef FRONTEND
+#define SH_RAW_ALLOCATOR pg_malloc0
+#endif
+#define SH_DEFINE
+#define SH_DECLARE
+#include "lib/simplehash.h"
+
+/*
+ * A block reference table is basically just the hash table, but we don't
+ * want to expose that to outside callers.
+ *
+ * We keep track of the memory context in use explicitly too, so that it's
+ * easy to place all of our allocations in the same context.
+ */
+struct BlockRefTable
+{
+ blockreftable_hash *hash;
+#ifndef FRONTEND
+ MemoryContext mcxt;
+#endif
+};
+
+/*
+ * On-disk serialization format for block reference table entries.
+ */
+typedef struct BlockRefTableSerializedEntry
+{
+ RelFileLocator rlocator;
+ ForkNumber forknum;
+ BlockNumber limit_block;
+ uint32 nchunks;
+} BlockRefTableSerializedEntry;
+
+/*
+ * Buffer size, so that we avoid doing many small I/Os.
+ */
+#define BUFSIZE 65536
+
+/*
+ * Ad-hoc buffer for file I/O.
+ */
+typedef struct BlockRefTableBuffer
+{
+ io_callback_fn io_callback;
+ void *io_callback_arg;
+ char data[BUFSIZE];
+ int used;
+ int cursor;
+ pg_crc32c crc;
+} BlockRefTableBuffer;
+
+/*
+ * State for keeping track of progress while incrementally reading a block
+ * table reference file from disk.
+ *
+ * total_chunks means the number of chunks for the RelFileLocator/ForkNumber
+ * combination that is curently being read, and consumed_chunks is the number
+ * of those that have been read. (We always read all the information for
+ * a single chunk at one time, so we don't need to be able to represent the
+ * state where a chunk has been partially read.)
+ *
+ * chunk_size is the array of chunk sizes. The length is given by total_chunks.
+ *
+ * chunk_data holds the current chunk.
+ *
+ * chunk_position helps us figure out how much progress we've made in returning
+ * the block numbers for the current chunk to the caller. If the chunk is a
+ * bitmap, it's the number of bits we've scanned; otherwise, it's the number
+ * of chunk entries we've scanned.
+ */
+struct BlockRefTableReader
+{
+ BlockRefTableBuffer buffer;
+ char *error_filename;
+ report_error_fn error_callback;
+ void *error_callback_arg;
+ uint32 total_chunks;
+ uint32 consumed_chunks;
+ uint16 *chunk_size;
+ uint16 chunk_data[MAX_ENTRIES_PER_CHUNK];
+ uint32 chunk_position;
+};
+
+/*
+ * State for keeping track of progress while incrementally writing a block
+ * reference table file to disk.
+ */
+struct BlockRefTableWriter
+{
+ BlockRefTableBuffer buffer;
+};
+
+/* Function prototypes. */
+static int BlockRefTableComparator(const void *a, const void *b);
+static void BlockRefTableFlush(BlockRefTableBuffer *buffer);
+static void BlockRefTableRead(BlockRefTableReader *reader, void *data,
+ int length);
+static void BlockRefTableWrite(BlockRefTableBuffer *buffer, void *data,
+ int length);
+static void BlockRefTableFileTerminate(BlockRefTableBuffer *buffer);
+
+/*
+ * Create an empty block reference table.
+ */
+BlockRefTable *
+CreateEmptyBlockRefTable(void)
+{
+ BlockRefTable *brtab = palloc(sizeof(BlockRefTable));
+
+ /*
+ * Even completely empty database has a few hundred relation forks, so it
+ * seems best to size the hash on the assumption that we're going to have
+ * at least a few thousand entries.
+ */
+#ifdef FRONTEND
+ brtab->hash = blockreftable_create(4096, NULL);
+#else
+ brtab->mcxt = CurrentMemoryContext;
+ brtab->hash = blockreftable_create(brtab->mcxt, 4096, NULL);
+#endif
+
+ return brtab;
+}
+
+/*
+ * Set the "limit block" for a relation fork and forget any modified blocks
+ * with equal or higher block numbers.
+ *
+ * The "limit block" is the shortest known length of the relation within the
+ * range of WAL records covered by this block reference table.
+ */
+void
+BlockRefTableSetLimitBlock(BlockRefTable *brtab,
+ const RelFileLocator *rlocator,
+ ForkNumber forknum,
+ BlockNumber limit_block)
+{
+ BlockRefTableEntry *brtentry;
+ BlockRefTableKey key;
+ bool found;
+
+ memcpy(&key.rlocator, rlocator, sizeof(RelFileLocator));
+ key.forknum = forknum;
+ brtentry = blockreftable_insert(brtab->hash, key, &found);
+
+ if (!found)
+ {
+ /*
+ * We have no existing data about this relation fork, so just record
+ * the limit_block value supplied by the caller, and make sure other
+ * parts of the entry are properly initialized.
+ */
+ brtentry->limit_block = limit_block;
+ brtentry->nchunks = 0;
+ brtentry->chunk_size = NULL;
+ brtentry->chunk_usage = NULL;
+ brtentry->chunk_data = NULL;
+ return;
+ }
+
+ BlockRefTableEntrySetLimitBlock(brtentry, limit_block);
+}
+
+/*
+ * Mark a block in a given relation fork as known to have been modified.
+ */
+void
+BlockRefTableMarkBlockModified(BlockRefTable *brtab,
+ const RelFileLocator *rlocator,
+ ForkNumber forknum,
+ BlockNumber blknum)
+{
+ BlockRefTableEntry *brtentry;
+ BlockRefTableKey key;
+ bool found;
+#ifndef FRONTEND
+ MemoryContext oldcontext = MemoryContextSwitchTo(brtab->mcxt);
+#endif
+
+ memcpy(&key.rlocator, rlocator, sizeof(RelFileLocator));
+ key.forknum = forknum;
+ brtentry = blockreftable_insert(brtab->hash, key, &found);
+
+ if (!found)
+ {
+ /*
+ * We want to set the initial limit block value to something higher
+ * than any legal block number. InvalidBlockNumber fits the bill.
+ */
+ brtentry->limit_block = InvalidBlockNumber;
+ brtentry->nchunks = 0;
+ brtentry->chunk_size = NULL;
+ brtentry->chunk_usage = NULL;
+ brtentry->chunk_data = NULL;
+ }
+
+ BlockRefTableEntryMarkBlockModified(brtentry, forknum, blknum);
+
+#ifndef FRONTEND
+ MemoryContextSwitchTo(oldcontext);
+#endif
+}
+
+/*
+ * Get an entry from a block reference table.
+ *
+ * If the entry does not exist, this function returns NULL. Otherwise, it
+ * returns the entry and sets *limit_block to the value from the entry.
+ */
+BlockRefTableEntry *
+BlockRefTableGetEntry(BlockRefTable *brtab, const RelFileLocator *rlocator,
+ ForkNumber forknum, BlockNumber *limit_block)
+{
+ BlockRefTableKey key;
+ BlockRefTableEntry *entry;
+
+ Assert(limit_block != NULL);
+
+ memcpy(&key.rlocator, rlocator, sizeof(RelFileLocator));
+ key.forknum = forknum;
+ entry = blockreftable_lookup(brtab->hash, key);
+
+ if (entry != NULL)
+ *limit_block = entry->limit_block;
+
+ return entry;
+}
+
+/*
+ * Get block numbers from a table entry.
+ *
+ * 'blocks' must point to enough space to hold at least 'nblocks' block
+ * numbers, and any block numbers we manage to get will be written there.
+ * The return value is the number of block numbers actually written.
+ *
+ * We do not return block numbers unless they are greater than or equal to
+ * start_blkno and strictly less than stop_blkno.
+ */
+int
+BlockRefTableEntryGetBlocks(BlockRefTableEntry *entry,
+ BlockNumber start_blkno,
+ BlockNumber stop_blkno,
+ BlockNumber *blocks,
+ int nblocks)
+{
+ uint32 start_chunkno;
+ uint32 stop_chunkno;
+ uint32 chunkno;
+ int nresults = 0;
+
+ Assert(entry != NULL);
+
+ /*
+ * Figure out which chunks could potentially contain blocks of interest.
+ *
+ * We need to be careful about overflow here, because stop_blkno could be
+ * InvalidBlockNumber or something very close to it.
+ */
+ start_chunkno = start_blkno / BLOCKS_PER_CHUNK;
+ stop_chunkno = stop_blkno / BLOCKS_PER_CHUNK;
+ if ((stop_blkno % BLOCKS_PER_CHUNK) != 0)
+ ++stop_chunkno;
+ if (stop_chunkno > entry->nchunks)
+ stop_chunkno = entry->nchunks;
+
+ /*
+ * Loop over chunks.
+ */
+ for (chunkno = start_chunkno; chunkno < stop_chunkno; ++chunkno)
+ {
+ uint16 chunk_usage = entry->chunk_usage[chunkno];
+ BlockRefTableChunk chunk_data = entry->chunk_data[chunkno];
+ unsigned start_offset = 0;
+ unsigned stop_offset = BLOCKS_PER_CHUNK;
+
+ /*
+ * If the start and/or stop block number falls within this chunk, the
+ * whole chunk may not be of interest. Figure out which portion we
+ * care about, if it's not the whole thing.
+ */
+ if (chunkno == start_chunkno)
+ start_offset = start_blkno % BLOCKS_PER_CHUNK;
+ if (chunkno == stop_chunkno)
+ stop_offset = stop_blkno % BLOCKS_PER_CHUNK;
+
+ /*
+ * Handling differs depending on whether this is an array of offsets
+ * or a bitmap.
+ */
+ if (chunk_usage == MAX_ENTRIES_PER_CHUNK)
+ {
+ unsigned i;
+
+ /* It's a bitmap, so test every relevant bit. */
+ for (i = start_offset; i < BLOCKS_PER_CHUNK; ++i)
+ {
+ uint16 w = chunk_data[i / BLOCKS_PER_ENTRY];
+
+ if ((w & (1 << (i % BLOCKS_PER_ENTRY))) != 0)
+ {
+ BlockNumber blkno = chunkno * BLOCKS_PER_CHUNK + i;
+
+ blocks[nresults++] = blkno;
+
+ /* Early exit if we run out of output space. */
+ if (nresults == nblocks)
+ return nresults;
+ }
+ }
+ }
+ else
+ {
+ unsigned i;
+
+ /* It's an array of offsets, so check each one. */
+ for (i = 0; i < chunk_usage; ++i)
+ {
+ uint16 offset = chunk_data[i];
+
+ if (offset >= start_offset && offset < stop_offset)
+ {
+ BlockNumber blkno = chunkno * BLOCKS_PER_CHUNK + offset;
+
+ blocks[nresults++] = blkno;
+
+ /* Early exit if we run out of output space. */
+ if (nresults == nblocks)
+ return nresults;
+ }
+ }
+ }
+ }
+
+ return nresults;
+}
+
+/*
+ * Serialize a block reference table to a file.
+ */
+void
+WriteBlockRefTable(BlockRefTable *brtab,
+ io_callback_fn write_callback,
+ void *write_callback_arg)
+{
+ BlockRefTableSerializedEntry *sdata = NULL;
+ BlockRefTableBuffer buffer;
+ uint32 magic = BLOCKREFTABLE_MAGIC;
+
+ /* Prepare buffer. */
+ memset(&buffer, 0, sizeof(BlockRefTableBuffer));
+ buffer.io_callback = write_callback;
+ buffer.io_callback_arg = write_callback_arg;
+ INIT_CRC32C(buffer.crc);
+
+ /* Write magic number. */
+ BlockRefTableWrite(&buffer, &magic, sizeof(uint32));
+
+ /* Write the entries, assuming there are some. */
+ if (brtab->hash->members > 0)
+ {
+ unsigned i = 0;
+ blockreftable_iterator it;
+ BlockRefTableEntry *brtentry;
+
+ /* Extract entries into serializable format and sort them. */
+ sdata =
+ palloc(brtab->hash->members * sizeof(BlockRefTableSerializedEntry));
+ blockreftable_start_iterate(brtab->hash, &it);
+ while ((brtentry = blockreftable_iterate(brtab->hash, &it)) != NULL)
+ {
+ BlockRefTableSerializedEntry *sentry = &sdata[i++];
+
+ sentry->rlocator = brtentry->key.rlocator;
+ sentry->forknum = brtentry->key.forknum;
+ sentry->limit_block = brtentry->limit_block;
+ sentry->nchunks = brtentry->nchunks;
+
+ /* trim trailing zero entries */
+ while (sentry->nchunks > 0 &&
+ brtentry->chunk_usage[sentry->nchunks - 1] == 0)
+ sentry->nchunks--;
+ }
+ Assert(i == brtab->hash->members);
+ qsort(sdata, i, sizeof(BlockRefTableSerializedEntry),
+ BlockRefTableComparator);
+
+ /* Loop over entries in sorted order and serialize each one. */
+ for (i = 0; i < brtab->hash->members; ++i)
+ {
+ BlockRefTableSerializedEntry *sentry = &sdata[i];
+ BlockRefTableEntry *brtentry;
+ BlockRefTableKey key;
+ unsigned j;
+
+ /* Write the serialized entry itself. */
+ BlockRefTableWrite(&buffer, sentry,
+ sizeof(BlockRefTableSerializedEntry));
+
+ /* Look up the original entry so we can access the chunks. */
+ memcpy(&key.rlocator, &sentry->rlocator, sizeof(RelFileLocator));
+ key.forknum = sentry->forknum;
+ brtentry = blockreftable_lookup(brtab->hash, key);
+ Assert(brtentry != NULL);
+
+ /* Write the untruncated portion of the chunk length array. */
+ if (sentry->nchunks != 0)
+ BlockRefTableWrite(&buffer, brtentry->chunk_usage,
+ sentry->nchunks * sizeof(uint16));
+
+ /* Write the contents of each chunk. */
+ for (j = 0; j < brtentry->nchunks; ++j)
+ {
+ if (brtentry->chunk_usage[j] == 0)
+ continue;
+ BlockRefTableWrite(&buffer, brtentry->chunk_data[j],
+ brtentry->chunk_usage[j] * sizeof(uint16));
+ }
+ }
+ }
+
+ /* Write out appropriate terminator and CRC and flush buffer. */
+ BlockRefTableFileTerminate(&buffer);
+}
+
+/*
+ * Prepare to incrementally read a block reference table file.
+ *
+ * 'read_callback' is a function that can be called to read data from the
+ * underlying file (or other data source) into our internal buffer.
+ *
+ * 'read_callback_arg' is an opaque argument to be passed to read_callback.
+ *
+ * 'error_filename' is the filename that should be included in error messages
+ * if the file is found to be malformed. The value is not copied, so the
+ * caller should ensure that it remains valid until done with this
+ * BlockRefTableReader.
+ *
+ * 'error_callback' is a function to be called if the file is found to be
+ * malformed. This is not used for I/O errors, which must be handled internally
+ * by read_callback.
+ *
+ * 'error_callback_arg' is an opaque arguent to be passed to error_callback.
+ */
+BlockRefTableReader *
+CreateBlockRefTableReader(io_callback_fn read_callback,
+ void *read_callback_arg,
+ char *error_filename,
+ report_error_fn error_callback,
+ void *error_callback_arg)
+{
+ BlockRefTableReader *reader;
+ uint32 magic;
+
+ /* Initialize data structure. */
+ reader = palloc0(sizeof(BlockRefTableReader));
+ reader->buffer.io_callback = read_callback;
+ reader->buffer.io_callback_arg = read_callback_arg;
+ reader->error_filename = error_filename;
+ reader->error_callback = error_callback;
+ reader->error_callback_arg = error_callback_arg;
+ INIT_CRC32C(reader->buffer.crc);
+
+ /* Verify magic number. */
+ BlockRefTableRead(reader, &magic, sizeof(uint32));
+ if (magic != BLOCKREFTABLE_MAGIC)
+ error_callback(error_callback_arg,
+ "file \"%s\" has wrong magic number: expected %u, found %u",
+ error_filename,
+ BLOCKREFTABLE_MAGIC, magic);
+
+ return reader;
+}
+
+/*
+ * Read next relation fork covered by this block reference table file.
+ *
+ * After calling this function, you must call BlockRefTableReaderGetBlocks
+ * until it returns 0 before calling it again.
+ */
+bool
+BlockRefTableReaderNextRelation(BlockRefTableReader *reader,
+ RelFileLocator *rlocator,
+ ForkNumber *forknum,
+ BlockNumber *limit_block)
+{
+ BlockRefTableSerializedEntry sentry;
+ BlockRefTableSerializedEntry zentry = {0};
+
+ /*
+ * Sanity check: caller must read all blocks from all chunks before moving
+ * on to the next relation.
+ */
+ Assert(reader->total_chunks == reader->consumed_chunks);
+
+ /* Read serialized entry. */
+ BlockRefTableRead(reader, &sentry,
+ sizeof(BlockRefTableSerializedEntry));
+
+ /*
+ * If we just read the sentinel entry indicating that we've reached the
+ * end, read and check the CRC.
+ */
+ if (memcmp(&sentry, &zentry, sizeof(BlockRefTableSerializedEntry)) == 0)
+ {
+ pg_crc32c expected_crc;
+ pg_crc32c actual_crc;
+
+ /*
+ * We want to know the CRC of the file excluding the 4-byte CRC
+ * itself, so copy the current value of the CRC accumulator before
+ * reading those bytes, and use the copy to finalize the calculation.
+ */
+ expected_crc = reader->buffer.crc;
+ FIN_CRC32C(expected_crc);
+
+ /* Now we can read the actual value. */
+ BlockRefTableRead(reader, &actual_crc, sizeof(pg_crc32c));
+
+ /* Throw an error if there is a mismatch. */
+ if (!EQ_CRC32C(expected_crc, actual_crc))
+ reader->error_callback(reader->error_callback_arg,
+ "file \"%s\" has wrong checksum: expected %08X, found %08X",
+ reader->error_filename, expected_crc, actual_crc);
+
+ return false;
+ }
+
+ /* Read chunk size array. */
+ if (reader->chunk_size != NULL)
+ pfree(reader->chunk_size);
+ reader->chunk_size = palloc(sentry.nchunks * sizeof(uint16));
+ BlockRefTableRead(reader, reader->chunk_size,
+ sentry.nchunks * sizeof(uint16));
+
+ /* Set up for chunk scan. */
+ reader->total_chunks = sentry.nchunks;
+ reader->consumed_chunks = 0;
+
+ /* Return data to caller. */
+ memcpy(rlocator, &sentry.rlocator, sizeof(RelFileLocator));
+ *forknum = sentry.forknum;
+ *limit_block = sentry.limit_block;
+ return true;
+}
+
+/*
+ * Get modified blocks associated with the relation fork returned by
+ * the most recent call to BlockRefTableReaderNextRelation.
+ *
+ * On return, block numbers will be written into the 'blocks' array, whose
+ * length should be passed via 'nblocks'. The return value is the number of
+ * entries actually written into the 'blocks' array, which may be less than
+ * 'nblocks' if we run out of modified blocks in the relation fork before
+ * we run out of room in the array.
+ */
+unsigned
+BlockRefTableReaderGetBlocks(BlockRefTableReader *reader,
+ BlockNumber *blocks,
+ int nblocks)
+{
+ unsigned blocks_found = 0;
+
+ /* Must provide space for at least one block number to be returned. */
+ Assert(nblocks > 0);
+
+ /* Loop collecting blocks to return to caller. */
+ for (;;)
+ {
+ uint16 next_chunk_size;
+
+ /*
+ * If we've read at least one chunk, maybe it contains some block
+ * numbers that could satisfy caller's request.
+ */
+ if (reader->consumed_chunks > 0)
+ {
+ uint32 chunkno = reader->consumed_chunks - 1;
+ uint16 chunk_size = reader->chunk_size[chunkno];
+
+ if (chunk_size == MAX_ENTRIES_PER_CHUNK)
+ {
+ /* Bitmap format, so search for bits that are set. */
+ while (reader->chunk_position < BLOCKS_PER_CHUNK &&
+ blocks_found < nblocks)
+ {
+ uint16 chunkoffset = reader->chunk_position;
+ uint16 w;
+
+ w = reader->chunk_data[chunkoffset / BLOCKS_PER_ENTRY];
+ if ((w & (1u << (chunkoffset % BLOCKS_PER_ENTRY))) != 0)
+ blocks[blocks_found++] =
+ chunkno * BLOCKS_PER_CHUNK + chunkoffset;
+ ++reader->chunk_position;
+ }
+ }
+ else
+ {
+ /* Not in bitmap format, so each entry is a 2-byte offset. */
+ while (reader->chunk_position < chunk_size &&
+ blocks_found < nblocks)
+ {
+ blocks[blocks_found++] = chunkno * BLOCKS_PER_CHUNK
+ + reader->chunk_data[reader->chunk_position];
+ ++reader->chunk_position;
+ }
+ }
+ }
+
+ /* We found enough blocks, so we're done. */
+ if (blocks_found >= nblocks)
+ break;
+
+ /*
+ * We didn't find enough blocks, so we must need the next chunk. If
+ * there are none left, though, then we're done anyway.
+ */
+ if (reader->consumed_chunks == reader->total_chunks)
+ break;
+
+ /*
+ * Read data for next chunk and reset scan position to beginning of
+ * chunk. Note that the next chunk might be empty, in which case we
+ * consume the chunk without actually consuming any bytes from the
+ * underlying file.
+ */
+ next_chunk_size = reader->chunk_size[reader->consumed_chunks];
+ if (next_chunk_size > 0)
+ BlockRefTableRead(reader, reader->chunk_data,
+ next_chunk_size * sizeof(uint16));
+ ++reader->consumed_chunks;
+ reader->chunk_position = 0;
+ }
+
+ return blocks_found;
+}
+
+/*
+ * Release memory used while reading a block reference table from a file.
+ */
+void
+DestroyBlockRefTableReader(BlockRefTableReader *reader)
+{
+ if (reader->chunk_size != NULL)
+ {
+ pfree(reader->chunk_size);
+ reader->chunk_size = NULL;
+ }
+ pfree(reader);
+}
+
+/*
+ * Prepare to write a block reference table file incrementally.
+ *
+ * Caller must be able to supply BlockRefTableEntry objects sorted in the
+ * appropriate order.
+ */
+BlockRefTableWriter *
+CreateBlockRefTableWriter(io_callback_fn write_callback,
+ void *write_callback_arg)
+{
+ BlockRefTableWriter *writer;
+ uint32 magic = BLOCKREFTABLE_MAGIC;
+
+ /* Prepare buffer and CRC check and save callbacks. */
+ writer = palloc0(sizeof(BlockRefTableWriter));
+ writer->buffer.io_callback = write_callback;
+ writer->buffer.io_callback_arg = write_callback_arg;
+ INIT_CRC32C(writer->buffer.crc);
+
+ /* Write magic number. */
+ BlockRefTableWrite(&writer->buffer, &magic, sizeof(uint32));
+
+ return writer;
+}
+
+/*
+ * Append one entry to a block reference table file.
+ *
+ * Note that entries must be written in the proper order, that is, sorted by
+ * tablespace, then database, then relfilenumber, then fork number. Caller
+ * is responsible for supplying data in the correct order. If that seems hard,
+ * use an in-memory BlockRefTable instead.
+ */
+void
+BlockRefTableWriteEntry(BlockRefTableWriter *writer, BlockRefTableEntry *entry)
+{
+ BlockRefTableSerializedEntry sentry;
+ unsigned j;
+
+ /* Convert to serialized entry format. */
+ sentry.rlocator = entry->key.rlocator;
+ sentry.forknum = entry->key.forknum;
+ sentry.limit_block = entry->limit_block;
+ sentry.nchunks = entry->nchunks;
+
+ /* Trim trailing zero entries. */
+ while (sentry.nchunks > 0 && entry->chunk_usage[sentry.nchunks - 1] == 0)
+ sentry.nchunks--;
+
+ /* Write the serialized entry itself. */
+ BlockRefTableWrite(&writer->buffer, &sentry,
+ sizeof(BlockRefTableSerializedEntry));
+
+ /* Write the untruncated portion of the chunk length array. */
+ if (sentry.nchunks != 0)
+ BlockRefTableWrite(&writer->buffer, entry->chunk_usage,
+ sentry.nchunks * sizeof(uint16));
+
+ /* Write the contents of each chunk. */
+ for (j = 0; j < entry->nchunks; ++j)
+ {
+ if (entry->chunk_usage[j] == 0)
+ continue;
+ BlockRefTableWrite(&writer->buffer, entry->chunk_data[j],
+ entry->chunk_usage[j] * sizeof(uint16));
+ }
+}
+
+/*
+ * Finalize an incremental write of a block reference table file.
+ */
+void
+DestroyBlockRefTableWriter(BlockRefTableWriter *writer)
+{
+ BlockRefTableFileTerminate(&writer->buffer);
+ pfree(writer);
+}
+
+/*
+ * Allocate a standalone BlockRefTableEntry.
+ *
+ * When we're manipulating a full in-memory BlockRefTable, the entries are
+ * part of the hash table and are allocated by simplehash. This routine is
+ * used by callers that want to write out a BlockRefTable to a file without
+ * needing to store the whole thing in memory at once.
+ *
+ * Entries allocated by this function can be manipulated using the functions
+ * BlockRefTableEntrySetLimitBlock and BlockRefTableEntryMarkBlockModified
+ * and then written using BlockRefTableWriteEntry and freed using
+ * BlockRefTableFreeEntry.
+ */
+BlockRefTableEntry *
+CreateBlockRefTableEntry(RelFileLocator rlocator, ForkNumber forknum)
+{
+ BlockRefTableEntry *entry = palloc0(sizeof(BlockRefTableEntry));
+
+ memcpy(&entry->key.rlocator, &rlocator, sizeof(RelFileLocator));
+ entry->key.forknum = forknum;
+ entry->limit_block = InvalidBlockNumber;
+
+ return entry;
+}
+
+/*
+ * Update a BlockRefTableEntry with a new value for the "limit block" and
+ * forget any equal-or-higher-numbered modified blocks.
+ *
+ * The "limit block" is the shortest known length of the relation within the
+ * range of WAL records covered by this block reference table.
+ */
+void
+BlockRefTableEntrySetLimitBlock(BlockRefTableEntry *entry,
+ BlockNumber limit_block)
+{
+ unsigned chunkno;
+ unsigned limit_chunkno;
+ unsigned limit_chunkoffset;
+ BlockRefTableChunk limit_chunk;
+
+ /* If we already have an equal or lower limit block, do nothing. */
+ if (limit_block >= entry->limit_block)
+ return;
+
+ /* Record the new limit block value. */
+ entry->limit_block = limit_block;
+
+ /*
+ * Figure out which chunk would store the state of the new limit block,
+ * and which offset within that chunk.
+ */
+ limit_chunkno = limit_block / BLOCKS_PER_CHUNK;
+ limit_chunkoffset = limit_block % BLOCKS_PER_CHUNK;
+
+ /*
+ * If the number of chunks is not large enough for any blocks with equal
+ * or higher block numbers to exist, then there is nothing further to do.
+ */
+ if (limit_chunkno >= entry->nchunks)
+ return;
+
+ /* Discard entire contents of any higher-numbered chunks. */
+ for (chunkno = limit_chunkno + 1; chunkno < entry->nchunks; ++chunkno)
+ entry->chunk_usage[chunkno] = 0;
+
+ /*
+ * Next, we need to discard any offsets within the chunk that would
+ * contain the limit_block. We must handle this differenly depending on
+ * whether the chunk that would contain limit_block is a bitmap or an
+ * array of offsets.
+ */
+ limit_chunk = entry->chunk_data[limit_chunkno];
+ if (entry->chunk_usage[limit_chunkno] == MAX_ENTRIES_PER_CHUNK)
+ {
+ unsigned chunkoffset;
+
+ /* It's a bitmap. Unset bits. */
+ for (chunkoffset = limit_chunkoffset; chunkoffset < BLOCKS_PER_CHUNK;
+ ++chunkoffset)
+ limit_chunk[chunkoffset / BLOCKS_PER_ENTRY] &=
+ ~(1 << (chunkoffset % BLOCKS_PER_ENTRY));
+ }
+ else
+ {
+ unsigned i,
+ j = 0;
+
+ /* It's an offset array. Filter out large offsets. */
+ for (i = 0; i < entry->chunk_usage[limit_chunkno]; ++i)
+ {
+ Assert(j <= i);
+ if (limit_chunk[i] < limit_chunkoffset)
+ limit_chunk[j++] = limit_chunk[i];
+ }
+ Assert(j <= entry->chunk_usage[limit_chunkno]);
+ entry->chunk_usage[limit_chunkno] = j;
+ }
+}
+
+/*
+ * Mark a block in a given BlkRefTableEntry as known to have been modified.
+ */
+void
+BlockRefTableEntryMarkBlockModified(BlockRefTableEntry *entry,
+ ForkNumber forknum,
+ BlockNumber blknum)
+{
+ unsigned chunkno;
+ unsigned chunkoffset;
+ unsigned i;
+
+ /*
+ * Which chunk should store the state of this block? And what is the
+ * offset of this block relative to the start of that chunk?
+ */
+ chunkno = blknum / BLOCKS_PER_CHUNK;
+ chunkoffset = blknum % BLOCKS_PER_CHUNK;
+
+ /*
+ * If 'nchunks' isn't big enough for us to be able to represent the state
+ * of this block, we need to enlarge our arrays.
+ */
+ if (chunkno >= entry->nchunks)
+ {
+ unsigned max_chunks;
+ unsigned extra_chunks;
+
+ /*
+ * New array size is a power of 2, at least 16, big enough so that
+ * chunkno will be a valid array index.
+ */
+ max_chunks = Max(16, entry->nchunks);
+ while (max_chunks < chunkno + 1)
+ chunkno *= 2;
+ Assert(max_chunks > chunkno);
+ extra_chunks = max_chunks - entry->nchunks;
+
+ if (entry->nchunks == 0)
+ {
+ entry->chunk_size = palloc0(sizeof(uint16) * max_chunks);
+ entry->chunk_usage = palloc0(sizeof(uint16) * max_chunks);
+ entry->chunk_data =
+ palloc0(sizeof(BlockRefTableChunk) * max_chunks);
+ }
+ else
+ {
+ entry->chunk_size = repalloc(entry->chunk_size,
+ sizeof(uint16) * max_chunks);
+ memset(&entry->chunk_size[entry->nchunks], 0,
+ extra_chunks * sizeof(uint16));
+ entry->chunk_usage = repalloc(entry->chunk_usage,
+ sizeof(uint16) * max_chunks);
+ memset(&entry->chunk_usage[entry->nchunks], 0,
+ extra_chunks * sizeof(uint16));
+ entry->chunk_data = repalloc(entry->chunk_data,
+ sizeof(BlockRefTableChunk) * max_chunks);
+ memset(&entry->chunk_data[entry->nchunks], 0,
+ extra_chunks * sizeof(BlockRefTableChunk));
+ }
+ entry->nchunks = max_chunks;
+ }
+
+ /*
+ * If the chunk that covers this block number doesn't exist yet, create it
+ * as an array and add the appropriate offset to it. We make it pretty
+ * small initially, because there might only be 1 or a few block
+ * references in this chunk and we don't want to use up too much memory.
+ */
+ if (entry->chunk_size[chunkno] == 0)
+ {
+ entry->chunk_data[chunkno] =
+ palloc(sizeof(uint16) * INITIAL_ENTRIES_PER_CHUNK);
+ entry->chunk_size[chunkno] = INITIAL_ENTRIES_PER_CHUNK;
+ entry->chunk_data[chunkno][0] = chunkoffset;
+ entry->chunk_usage[chunkno] = 1;
+ return;
+ }
+
+ /*
+ * If the number of entries in this chunk is already maximum, it must be a
+ * bitmap. Just set the appropriate bit.
+ */
+ if (entry->chunk_usage[chunkno] == MAX_ENTRIES_PER_CHUNK)
+ {
+ BlockRefTableChunk chunk = entry->chunk_data[chunkno];
+
+ chunk[chunkoffset / BLOCKS_PER_ENTRY] |=
+ 1 << (chunkoffset % BLOCKS_PER_ENTRY);
+ return;
+ }
+
+ /*
+ * There is an existing chunk and it's in array format. Let's find out
+ * whether it already has an entry for this block. If so, we do not need
+ * to do anything.
+ */
+ for (i = 0; i < entry->chunk_usage[chunkno]; ++i)
+ {
+ if (entry->chunk_data[chunkno][i] == chunkoffset)
+ return;
+ }
+
+ /*
+ * If the number of entries currently used is one less than the maximum,
+ * it's time to convert to bitmap format.
+ */
+ if (entry->chunk_usage[chunkno] == MAX_ENTRIES_PER_CHUNK - 1)
+ {
+ BlockRefTableChunk newchunk;
+ unsigned j;
+
+ /* Allocate a new chunk. */
+ newchunk = palloc0(MAX_ENTRIES_PER_CHUNK * sizeof(uint16));
+
+ /* Set the bit for each existing entry. */
+ for (j = 0; j < entry->chunk_usage[chunkno]; ++j)
+ {
+ unsigned coff = entry->chunk_data[chunkno][j];
+
+ newchunk[coff / BLOCKS_PER_ENTRY] |=
+ 1 << (coff % BLOCKS_PER_ENTRY);
+ }
+
+ /* Set the bit for the new entry. */
+ newchunk[chunkoffset / BLOCKS_PER_ENTRY] |=
+ 1 << (chunkoffset % BLOCKS_PER_ENTRY);
+
+ /* Swap the new chunk into place and update metadata. */
+ pfree(entry->chunk_data[chunkno]);
+ entry->chunk_data[chunkno] = newchunk;
+ entry->chunk_size[chunkno] = MAX_ENTRIES_PER_CHUNK;
+ entry->chunk_usage[chunkno] = MAX_ENTRIES_PER_CHUNK;
+ return;
+ }
+
+ /*
+ * OK, we currently have an array, and we don't need to convert to a
+ * bitmap, but we do need to add a new element. If there's not enough
+ * room, we'll have to expand the array.
+ */
+ if (entry->chunk_usage[chunkno] == entry->chunk_size[chunkno])
+ {
+ unsigned newsize = entry->chunk_size[chunkno] * 2;
+
+ Assert(newsize <= MAX_ENTRIES_PER_CHUNK);
+ entry->chunk_data[chunkno] = repalloc(entry->chunk_data[chunkno],
+ newsize * sizeof(uint16));
+ entry->chunk_size[chunkno] = newsize;
+ }
+
+ /* Now we can add the new entry. */
+ entry->chunk_data[chunkno][entry->chunk_usage[chunkno]] =
+ chunkoffset;
+ entry->chunk_usage[chunkno]++;
+}
+
+/*
+ * Release memory for a BlockRefTablEntry that was created by
+ * CreateBlockRefTableEntry.
+ */
+void
+BlockRefTableFreeEntry(BlockRefTableEntry *entry)
+{
+ if (entry->chunk_size != NULL)
+ {
+ pfree(entry->chunk_size);
+ entry->chunk_size = NULL;
+ }
+
+ if (entry->chunk_usage != NULL)
+ {
+ pfree(entry->chunk_usage);
+ entry->chunk_usage = NULL;
+ }
+
+ if (entry->chunk_data != NULL)
+ {
+ pfree(entry->chunk_data);
+ entry->chunk_data = NULL;
+ }
+
+ pfree(entry);
+}
+
+/*
+ * Comparator for BlockRefTableSerializedEntry objects.
+ *
+ * We make the tablespace OID the first column of the sort key to match
+ * the on-disk tree structure.
+ */
+static int
+BlockRefTableComparator(const void *a, const void *b)
+{
+ const BlockRefTableSerializedEntry *sa = a;
+ const BlockRefTableSerializedEntry *sb = b;
+
+ if (sa->rlocator.spcOid > sb->rlocator.spcOid)
+ return 1;
+ if (sa->rlocator.spcOid < sb->rlocator.spcOid)
+ return -1;
+
+ if (sa->rlocator.dbOid > sb->rlocator.dbOid)
+ return 1;
+ if (sa->rlocator.dbOid < sb->rlocator.dbOid)
+ return -1;
+
+ if (sa->rlocator.relNumber > sb->rlocator.relNumber)
+ return 1;
+ if (sa->rlocator.relNumber < sb->rlocator.relNumber)
+ return -1;
+
+ if (sa->forknum > sb->forknum)
+ return 1;
+ if (sa->forknum < sb->forknum)
+ return -1;
+
+ return 0;
+}
+
+/*
+ * Flush any buffered data out of a BlockRefTableBuffer.
+ */
+static void
+BlockRefTableFlush(BlockRefTableBuffer *buffer)
+{
+ buffer->io_callback(buffer->io_callback_arg, buffer->data, buffer->used);
+ buffer->used = 0;
+}
+
+/*
+ * Read data from a BlockRefTableBuffer, and update the running CRC
+ * calculation for the returned data (but not any data that we may have
+ * buffered but not yet actually returned).
+ */
+static void
+BlockRefTableRead(BlockRefTableReader *reader, void *data, int length)
+{
+ BlockRefTableBuffer *buffer = &reader->buffer;
+
+ /* Loop until read is fully satisfied. */
+ while (length > 0)
+ {
+ if (buffer->cursor < buffer->used)
+ {
+ /*
+ * If any buffered data is available, use that to satisfy as much
+ * of the request as possible.
+ */
+ int bytes_to_copy = Min(length, buffer->used - buffer->cursor);
+
+ memcpy(data, &buffer->data[buffer->cursor], bytes_to_copy);
+ COMP_CRC32C(buffer->crc, &buffer->data[buffer->cursor],
+ bytes_to_copy);
+ buffer->cursor += bytes_to_copy;
+ data = ((char *) data) + bytes_to_copy;
+ length -= bytes_to_copy;
+ }
+ else if (length >= BUFSIZE)
+ {
+ /*
+ * If the request length is long, read directly into caller's
+ * buffer.
+ */
+ int bytes_read;
+
+ bytes_read = buffer->io_callback(buffer->io_callback_arg,
+ data, length);
+ COMP_CRC32C(buffer->crc, data, bytes_read);
+ data = ((char *) data) + bytes_read;
+ length -= bytes_read;
+
+ /* If we didn't get anything, that's bad. */
+ if (bytes_read == 0)
+ reader->error_callback(reader->error_callback_arg,
+ "file \"%s\" ends unexpectedly",
+ reader->error_filename);
+ }
+ else
+ {
+ /*
+ * Refill our buffer.
+ */
+ buffer->used = buffer->io_callback(buffer->io_callback_arg,
+ buffer->data, BUFSIZE);
+ buffer->cursor = 0;
+
+ /* If we didn't get anything, that's bad. */
+ if (buffer->used == 0)
+ reader->error_callback(reader->error_callback_arg,
+ "file \"%s\" ends unexpectedly",
+ reader->error_filename);
+ }
+ }
+}
+
+/*
+ * Supply data to a BlockRefTableBuffer for write to the underlying File,
+ * and update the running CRC calculation for that data.
+ */
+static void
+BlockRefTableWrite(BlockRefTableBuffer *buffer, void *data, int length)
+{
+ /* Update running CRC calculation. */
+ COMP_CRC32C(buffer->crc, data, length);
+
+ /* If the new data can't fit into the buffer, flush the buffer. */
+ if (buffer->used + length > BUFSIZE)
+ {
+ buffer->io_callback(buffer->io_callback_arg, buffer->data,
+ buffer->used);
+ buffer->used = 0;
+ }
+
+ /* If the new data would fill the buffer, or more, write it directly. */
+ if (length >= BUFSIZE)
+ {
+ buffer->io_callback(buffer->io_callback_arg, data, length);
+ return;
+ }
+
+ /* Otherwise, copy the new data into the buffer. */
+ memcpy(&buffer->data[buffer->used], data, length);
+ buffer->used += length;
+ Assert(buffer->used <= BUFSIZE);
+}
+
+/*
+ * Generate the sentinel and CRC required at the end of a block reference
+ * table file and flush them out of our internal buffer.
+ */
+static void
+BlockRefTableFileTerminate(BlockRefTableBuffer *buffer)
+{
+ BlockRefTableSerializedEntry zentry = {0};
+ pg_crc32c crc;
+
+ /* Write a sentinel indicating that there are no more entries. */
+ BlockRefTableWrite(buffer, &zentry,
+ sizeof(BlockRefTableSerializedEntry));
+
+ /*
+ * Writing the checksum will perturb the ongoing checksum calculation, so
+ * copy the state first and finalize the computation using the copy.
+ */
+ crc = buffer->crc;
+ FIN_CRC32C(crc);
+ BlockRefTableWrite(buffer, &crc, sizeof(pg_crc32c));
+
+ /* Flush any leftover data out of our buffer. */
+ BlockRefTableFlush(buffer);
+}
diff --git a/src/common/meson.build b/src/common/meson.build
index aa646f96a3..6348d60ec4 100644
--- a/src/common/meson.build
+++ b/src/common/meson.build
@@ -4,6 +4,7 @@ common_sources = files(
'archive.c',
'base64.c',
'binaryheap.c',
+ 'blkreftable.c',
'checksum_helper.c',
'compression.c',
'controldata_utils.c',
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 4ad572cb87..9d1e4ab57b 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -209,6 +209,7 @@ extern int XLogFileOpen(XLogSegNo segno, TimeLineID tli);
extern void CheckXLogRemoved(XLogSegNo segno, TimeLineID tli);
extern XLogSegNo XLogGetLastRemovedSegno(void);
+extern XLogSegNo XLogGetOldestSegno(TimeLineID tli);
extern void XLogSetAsyncXactLSN(XLogRecPtr asyncXactLSN);
extern void XLogSetReplicationSlotMinimumLSN(XLogRecPtr lsn);
diff --git a/src/include/backup/walsummary.h b/src/include/backup/walsummary.h
new file mode 100644
index 0000000000..d086e64019
--- /dev/null
+++ b/src/include/backup/walsummary.h
@@ -0,0 +1,49 @@
+/*-------------------------------------------------------------------------
+ *
+ * walsummary.h
+ * WAL summary management
+ *
+ * Portions Copyright (c) 2010-2022, PostgreSQL Global Development Group
+ *
+ * src/include/backup/walsummary.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef WALSUMMARY_H
+#define WALSUMMARY_H
+
+#include <time.h>
+
+#include "access/xlogdefs.h"
+#include "nodes/pg_list.h"
+#include "storage/fd.h"
+
+typedef struct WalSummaryIO
+{
+ File file;
+ off_t filepos;
+} WalSummaryIO;
+
+typedef struct WalSummaryFile
+{
+ XLogRecPtr start_lsn;
+ XLogRecPtr end_lsn;
+ TimeLineID tli;
+} WalSummaryFile;
+
+extern List *GetWalSummaries(TimeLineID tli, XLogRecPtr start_lsn,
+ XLogRecPtr end_lsn);
+extern List *FilterWalSummaries(List *wslist, TimeLineID tli,
+ XLogRecPtr start_lsn, XLogRecPtr end_lsn);
+extern bool WalSummariesAreComplete(List *wslist,
+ XLogRecPtr start_lsn, XLogRecPtr end_lsn,
+ XLogRecPtr *missing_lsn);
+extern File OpenWalSummaryFile(WalSummaryFile *ws, bool missing_ok);
+extern void RemoveWalSummaryIfOlderThan(WalSummaryFile *ws,
+ time_t cutoff_time);
+
+extern int ReadWalSummary(void *wal_summary_io, void *data, int length);
+extern int WriteWalSummary(void *wal_summary_io, void *data, int length);
+extern void ReportWalSummaryError(void *callback_arg, char *fmt,...);
+
+#endif /* WALSUMMARY_H */
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index c92d0631a0..9717c4630e 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -12071,4 +12071,23 @@
proname => 'any_value_transfn', prorettype => 'anyelement',
proargtypes => 'anyelement anyelement', prosrc => 'any_value_transfn' },
+{ oid => '8436',
+ descr => 'list of available WAL summary files',
+ proname => 'pg_available_wal_summaries', prorows => '100',
+ proretset => 't', provolatile => 'v', proparallel => 's',
+ prorettype => 'record', proargtypes => '',
+ proallargtypes => '{int8,pg_lsn,pg_lsn}',
+ proargmodes => '{o,o,o}',
+ proargnames => '{tli,start_lsn,end_lsn}',
+ prosrc => 'pg_available_wal_summaries' },
+{ oid => '8437',
+ descr => 'contents of a WAL sumamry file',
+ proname => 'pg_wal_summary_contents', prorows => '100',
+ proretset => 't', provolatile => 'v', proparallel => 's',
+ prorettype => 'record', proargtypes => 'int8 pg_lsn pg_lsn',
+ proallargtypes => '{int8,pg_lsn,pg_lsn,oid,oid,oid,int2,int8,bool}',
+ proargmodes => '{i,i,i,o,o,o,o,o,o}',
+ proargnames => '{tli,start_lsn,end_lsn,relfilenode,reltablespace,reldatabase,relforknumber,relblocknumber,is_limit_block}',
+ prosrc => 'pg_wal_summary_contents' },
+
]
diff --git a/src/include/common/blkreftable.h b/src/include/common/blkreftable.h
new file mode 100644
index 0000000000..22d9883dc5
--- /dev/null
+++ b/src/include/common/blkreftable.h
@@ -0,0 +1,120 @@
+/*-------------------------------------------------------------------------
+ *
+ * blkreftable.h
+ * Block reference tables.
+ *
+ * A block reference table is used to keep track of which blocks have
+ * been modified by WAL records within a certain LSN range.
+ *
+ * For each relation fork, there is a "limit block number". All existing
+ * blocks greater than or equal to the limit block number must be
+ * considered modified; for those less than the limit block number,
+ * we maintain a bitmap. When a relation fork is created or dropped,
+ * the limit block number should be set to 0. When it's truncated,
+ * the limit block number should be set to the length in blocks to
+ * which it was truncated.
+ *
+ * Portions Copyright (c) 2010-2022, PostgreSQL Global Development Group
+ *
+ * src/include/common/blkreftable.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef BLKREFTABLE_H
+#define BLKREFTABLE_H
+
+#include "storage/block.h"
+#include "storage/relfilelocator.h"
+
+/* Magic number for serialization file format. */
+#define BLOCKREFTABLE_MAGIC 0x652b137b
+
+struct BlockRefTable;
+struct BlockRefTableEntry;
+struct BlockRefTableReader;
+struct BlockRefTableWriter;
+typedef struct BlockRefTable BlockRefTable;
+typedef struct BlockRefTableEntry BlockRefTableEntry;
+typedef struct BlockRefTableReader BlockRefTableReader;
+typedef struct BlockRefTableWriter BlockRefTableWriter;
+
+/*
+ * The return value of io_callback_fn should be the number of bytes read
+ * or written. If an error occurs, the functions should report it and
+ * not return. When used as a write callback, short writes should be retried
+ * or treated as errors, so that if the callback returns, the return value
+ * is always the request length.
+ *
+ * report_error_fn should not return.
+ */
+typedef int (*io_callback_fn) (void *callback_arg, void *data, int length);
+typedef void (*report_error_fn) (void *calblack_arg, char *msg,...);
+
+
+/*
+ * Functions for manipulating an entire in-memory block reference table.
+ */
+extern BlockRefTable *CreateEmptyBlockRefTable(void);
+extern void BlockRefTableSetLimitBlock(BlockRefTable *brtab,
+ const RelFileLocator *rlocator,
+ ForkNumber forknum,
+ BlockNumber limit_block);
+extern void BlockRefTableMarkBlockModified(BlockRefTable *brtab,
+ const RelFileLocator *rlocator,
+ ForkNumber forknum,
+ BlockNumber blknum);
+extern void WriteBlockRefTable(BlockRefTable *brtab,
+ io_callback_fn write_callback,
+ void *write_callback_arg);
+
+extern BlockRefTableEntry *BlockRefTableGetEntry(BlockRefTable *brtab,
+ const RelFileLocator *rlocator,
+ ForkNumber forknum,
+ BlockNumber *limit_block);
+extern int BlockRefTableEntryGetBlocks(BlockRefTableEntry *entry,
+ BlockNumber start_blkno,
+ BlockNumber stop_blkno,
+ BlockNumber *blocks,
+ int nblocks);
+
+/*
+ * Functions for reading a block reference table incrementally from disk.
+ */
+extern BlockRefTableReader *CreateBlockRefTableReader(io_callback_fn read_callback,
+ void *read_callback_arg,
+ char *error_filename,
+ report_error_fn error_callback,
+ void *error_callback_arg);
+extern bool BlockRefTableReaderNextRelation(BlockRefTableReader *reader,
+ RelFileLocator *rlocator,
+ ForkNumber *forknum,
+ BlockNumber *limit_block);
+extern unsigned BlockRefTableReaderGetBlocks(BlockRefTableReader *reader,
+ BlockNumber *blocks,
+ int nblocks);
+extern void DestroyBlockRefTableReader(BlockRefTableReader *reader);
+
+/*
+ * Functions for writing a block reference table incrementally to disk.
+ *
+ * Note that entries must be written in the proper order, that is, sorted by
+ * database, then tablespace, then relfilenumber, then fork number. Caller
+ * is responsible for supplying data in the correct order. If that seems hard,
+ * use an in-memory BlockRefTable instead.
+ */
+extern BlockRefTableWriter *CreateBlockRefTableWriter(io_callback_fn write_callback,
+ void *write_callback_arg);
+extern void BlockRefTableWriteEntry(BlockRefTableWriter *writer,
+ BlockRefTableEntry *entry);
+extern void DestroyBlockRefTableWriter(BlockRefTableWriter *writer);
+
+extern BlockRefTableEntry *CreateBlockRefTableEntry(RelFileLocator rlocator,
+ ForkNumber forknum);
+extern void BlockRefTableEntrySetLimitBlock(BlockRefTableEntry *entry,
+ BlockNumber limit_block);
+extern void BlockRefTableEntryMarkBlockModified(BlockRefTableEntry *entry,
+ ForkNumber forknum,
+ BlockNumber blknum);
+extern void BlockRefTableFreeEntry(BlockRefTableEntry *entry);
+
+#endif /* BLKREFTABLE_H */
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 7232b03e37..042fdc6ca1 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -340,6 +340,7 @@ typedef enum BackendType
B_STARTUP,
B_WAL_RECEIVER,
B_WAL_SENDER,
+ B_WAL_SUMMARIZER,
B_WAL_WRITER,
} BackendType;
@@ -446,6 +447,7 @@ typedef enum
CheckpointerProcess,
WalWriterProcess,
WalReceiverProcess,
+ WalSummarizerProcess,
NUM_AUXPROCTYPES /* Must be last! */
} AuxProcType;
@@ -458,6 +460,7 @@ extern PGDLLIMPORT AuxProcType MyAuxProcType;
#define AmCheckpointerProcess() (MyAuxProcType == CheckpointerProcess)
#define AmWalWriterProcess() (MyAuxProcType == WalWriterProcess)
#define AmWalReceiverProcess() (MyAuxProcType == WalReceiverProcess)
+#define AmWalSummarizerProcess() (MyAuxProcType == WalSummarizerProcess)
/*****************************************************************************
diff --git a/src/include/postmaster/walsummarizer.h b/src/include/postmaster/walsummarizer.h
new file mode 100644
index 0000000000..7584cb69a7
--- /dev/null
+++ b/src/include/postmaster/walsummarizer.h
@@ -0,0 +1,31 @@
+/*-------------------------------------------------------------------------
+ *
+ * walsummarizer.h
+ *
+ * Header file for background WAL summarization process.
+ *
+ * Portions Copyright (c) 1996-2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/include/postmaster/walsummarizer.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef WALSUMMARIZER_H
+#define WALSUMMARIZER_H
+
+#include "access/xlogdefs.h"
+
+extern int wal_summarize_mb;
+extern int wal_summarize_keep_time;
+
+extern Size WalSummarizerShmemSize(void);
+extern void WalSummarizerShmemInit(void);
+extern void WalSummarizerMain(void) pg_attribute_noreturn();
+
+extern XLogRecPtr GetOldestUnsummarizedLSN(TimeLineID *tli,
+ bool *lsn_is_exact);
+extern void SetWalSummarizerLatch(void);
+extern XLogRecPtr WaitForWalSummarization(XLogRecPtr lsn, long timeout);
+
+#endif
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index ef74f32693..ee55008082 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -417,11 +417,12 @@ extern PGDLLIMPORT PGPROC *PreparedXactProcs;
* We set aside some extra PGPROC structures for auxiliary processes,
* ie things that aren't full-fledged backends but need shmem access.
*
- * Background writer, checkpointer, WAL writer and archiver run during normal
- * operation. Startup process and WAL receiver also consume 2 slots, but WAL
- * writer is launched only after startup has exited, so we only need 5 slots.
+ * Background writer, checkpointer, WAL writer, WAL summarizer, and archiver
+ * run during normal operation. Startup process and WAL receiver also consume
+ * 2 slots, but WAL writer is launched only after startup has exited, so we
+ * only need 6 slots.
*/
-#define NUM_AUXILIARY_PROCS 5
+#define NUM_AUXILIARY_PROCS 6
/* configurable options */
extern PGDLLIMPORT int DeadlockTimeout;
diff --git a/src/include/utils/guc_tables.h b/src/include/utils/guc_tables.h
index d5a0880678..7d3bc0f671 100644
--- a/src/include/utils/guc_tables.h
+++ b/src/include/utils/guc_tables.h
@@ -72,6 +72,7 @@ enum config_group
WAL_RECOVERY,
WAL_ARCHIVE_RECOVERY,
WAL_RECOVERY_TARGET,
+ WAL_SUMMARIZATION,
REPLICATION_SENDING,
REPLICATION_PRIMARY,
REPLICATION_STANDBY,
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 06b25617bc..7c913cbb93 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3993,3 +3993,14 @@ yyscan_t
z_stream
z_streamp
zic_t
+BlockRefTable
+BlockRefTableBuffer
+BlockRefTableEntry
+BlockRefTableKey
+BlockRefTableReader
+BlockRefTableSerializedEntry
+BlockRefTableWriter
+SummarizerReadLocalXLogPrivate
+WalSummarizerData
+WalSummaryFile
+WalSummaryIO
--
2.37.1 (Apple Git-137.1)
On 2023-10-25 We 11:24, Robert Haas wrote:
On Wed, Oct 25, 2023 at 10:33 AM Andrew Dunstan <andrew@dunslane.net> wrote:
I'm not too worried about the maintenance burden.
That said, I agree that JSON might not be the best format for backup
manifests, but maybe that ship has sailed.I think it's a decision we could walk back if we had a good enough
reason, but it would be nicer if we didn't have to, because what we
have right now is working. If we change it for no real reason, we
might introduce new bugs, and at least in theory, incompatibility with
third-party tools that parse the existing format. If you think we can
live with the additional complexity in the JSON parsing stuff, I'd
rather go that way.
OK, I'll go with that. It will actually be a bit less invasive than the
patch I posted.
cheers
andrew
--
Andrew Dunstan
EDB: https://www.enterprisedb.com
On Wed, Oct 25, 2023 at 3:17 PM Andrew Dunstan <andrew@dunslane.net> wrote:
OK, I'll go with that. It will actually be a bit less invasive than the
patch I posted.
Why's that?
--
Robert Haas
EDB: http://www.enterprisedb.com
On 2023-10-25 We 15:19, Robert Haas wrote:
On Wed, Oct 25, 2023 at 3:17 PM Andrew Dunstan <andrew@dunslane.net> wrote:
OK, I'll go with that. It will actually be a bit less invasive than the
patch I posted.Why's that?
Because we won't be removing the RD parser.
cheers
andrew
--
Andrew Dunstan
EDB: https://www.enterprisedb.com
On Thu, Oct 26, 2023 at 6:59 AM Andrew Dunstan <andrew@dunslane.net> wrote:
Because we won't be removing the RD parser.
Ah, OK.
--
Robert Haas
EDB: http://www.enterprisedb.com
On Tue, Oct 24, 2023 at 12:08 PM Robert Haas <robertmhaas@gmail.com> wrote:
Note that whether to remove summaries is a separate question from
whether to generate them in the first place. Right now, I have
wal_summarize_mb controlling whether they get generated in the first
place, but as I noted in another recent email, that isn't an entirely
satisfying solution.
I did some more research on this. My conclusion is that I should
remove wal_summarize_mb and just have a GUC summarize_wal = on|off
that controls whether the summarizer runs at all. There will be one
summary file per checkpoint, no matter how far apart checkpoints are
or how large the summary gets. Below I'll explain the reasoning; let
me know if you disagree.
What I describe above would be a bad plan if it were realistically
possible for a summary file to get so large that it might run the
machine out of memory either when producing it or when trying to make
use of it for an incremental backup. This seems to be a somewhat
difficult scenario to create. So far, I haven't been able to generate
WAL summary files more than a few tens of megabytes in size, even when
summarizing 50+ GB of WAL per summary file. One reason why it's hard
to produce large summary files is because, for a single relation fork,
the WAL summary size converges to 1 bit per modified block when the
number of modified blocks is large. This means that, even if you have
a terabyte sized relation, you're looking at no more than perhaps 20MB
of summary data no matter how much of it gets modified. Now, somebody
could have a 30TB relation and then if they modify the whole thing
they could have the better part of a gigabyte of summary data for that
relation, but if you've got a 30TB table you probably have enough
memory that that's no big deal.
But, what if you have multiple relations? I initialized pgbench with a
scale factor of 30000 and also with 30000 partitions and did a 1-hour
run. I got 4 checkpoints during that time and each one produced an
approximately 16MB summary file. The efficiency here drops
considerably. For example, one of the files is 16495398 bytes and
records information on 7498403 modified blocks, which works out to
about 2.2 bytes per modified block. That's more than an order of
magnitude worse than what I got in the single-relation case, where the
summary file didn't even use two *bits* per modified block. But here
again, the file just isn't that big in absolute terms. To get a 1GB+
WAL summary file, you'd need to modify millions of relation forks,
maybe tens of millions, and most installations aren't even going to
have that many relation forks, let alone be modifying them all
frequently.
My conclusion here is that it's pretty hard to have a database where
WAL summarization is going to use too much memory. I wouldn't be
terribly surprised if there are some extreme cases where it happens,
but those databases probably aren't great candidates for incremental
backup anyway. They're probably databases with millions of relations
and frequent, widely-scattered modifications to those relations. And
if you have that kind of high turnover rate then incremental backup
isn't going to as helpful anyway, so there's probably no reason to
enable WAL summarization in the first place. Maybe if you have that
plus in the same database cluster you have a 100TB of completely
static data that is never modified, and if you also do all of this on
a pretty small machine, then you can find a case where incremental
backup would have worked well but for the memory consumed by WAL
summarization.
But I think that's sufficiently niche that the current patch shouldn't
concern itself with such cases. If we find that they're common enough
to worry about, we might eventually want to do something to mitigate
them, but whether that thing looks anything like wal_summarize_mb
seems pretty unclear. So I conclude that it's a mistake to include
that GUC as currently designed and propose to replace it with a
Boolean as described above.
Comments?
--
Robert Haas
EDB: http://www.enterprisedb.com
While reviewing this thread today, I realized that I never responded
to this email. That was inadvertent; my apologies.
On Wed, Jun 14, 2023 at 4:34 PM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
Nice, I like this idea.
Cool.
Skimming through the 7th patch, I see claims that FSM is not fully
WAL-logged and thus shouldn't be tracked, and so it indeed doesn't
track those changes.
I disagree with that decision: we now have support for custom resource
managers, which may use the various forks for other purposes than
those used in PostgreSQL right now. It would be a shame if data is
lost because of the backup tool ignoring forks because the PostgreSQL
project itself doesn't have post-recovery consistency guarantees in
that fork. So, unless we document that WAL-logged changes in the FSM
fork are actually not recoverable from backup, regardless of the type
of contents, we should still keep track of the changes in the FSM fork
and include the fork in our backups or only exclude those FSM updates
that we know are safe to ignore.
I'm not sure what to do about this problem. I don't think any data
would be *lost* in the scenario that you mention; what I think would
happen is that the FSM forks would be backed up in their entirety even
if they were owned by some other table AM or index AM that was
WAL-logging all changes to whatever it was storing in that fork. So I
think that there is not a correctness issue here but rather an
efficiency issue.
It would still be nice to fix that somehow, but I don't see how to do
it. It would be easy to make the WAL summarizer stop treating the FSM
as a special case, but there's no way for basebackup_incremental.c to
know whether a particular relation fork is for the heap AM or some
other AM that handles WAL-logging differently. It can't for example
examine pg_class; it's not connected to any database, let alone every
database. So we have to either trust that the WAL for the FSM is
correct and complete in all cases, or assume that it isn't in any
case. And the former doesn't seem like a safe or wise assumption given
how the heap AM works.
I think the reality here is unfortunately that we're missing a lot of
important infrastructure to really enable a multi-table-AM world. The
heap AM, and every other table AM, should include a metapage so we can
tell what we're looking at just by examining the disk files. Relation
forks don't scale and should be replaced with some better system that
does. We should have at least two table AMs in core that are fully
supported and do truly useful things. Until some of that stuff (and
probably a bunch of other things) get sorted out, out-of-core AMs are
going to have to remain second-class citizens to some degree.
--
Robert Haas
EDB: http://www.enterprisedb.com
On Thu, Sep 28, 2023 at 6:22 AM Jakub Wartak
<jakub.wartak@enterprisedb.com> wrote:
If that is still an area open for discussion: wouldn't it be better to
just specify LSN as it would allow resyncing standby across major lag
where the WAL to replay would be enormous? Given that we had
primary->standby where standby would be stuck on some LSN, right now
it would be:
1) calculate backup manifest of desynced 10TB standby (how? using
which tool?) - even if possible, that means reading 10TB of data
instead of just putting a number, isn't it?
2) backup primary with such incremental backup >= LSN
3) copy the incremental backup to standby
4) apply it to the impaired standby
5) restart the WAL replay
As you may be able to tell from the flurry of posts and new patch
sets, I'm trying hard to sort out the remaining open items that
pertain to this patch set, and I'm now back to thinking about this
one.
TL;DR: I think the idea has some potential, but there are some
pitfalls that I'm not sure how to address.
I spent some time looking at how we currently use the data from the
backup manifest. Currently, we do two things with it. First, when
we're backing up each file, we check whether it's present in the
backup manifest and, if not, we back it up in full. This actually
feels fairly poor. If it makes any difference at all, then presumably
the underlying algorithm is buggy and needs to be fixed. Maybe that
should be ripped out altogether or turned into some kind of sanity
check that causes a big explosion if it fails. Second, we check
whether the WAL ranges reported by the client match up with the
timeline history of the server (see PrepareForIncrementalBackup). This
set of sanity checks seems fairly important to me, and I'd regret
discarding them. I think there's some possibility that they might
catch user error, like where somebody promotes multiple standbys and
maybe they even get the same timeline on more than one of them, and
then confusion might ensue. I also think that there's a real
possibility that they might make it easier to track down bugs in my
code, even if those bugs aren't necessarily timeline-related. If (or
more realistically when) somebody ends up with a corrupted cluster
after running pg_combinebackup, we're going to need to figure out
whether that corruption is the result of bugs (and if so where they
are) or user error (and if so what it was). The most obvious ways of
ending up with a corrupted cluster are (1) taking an incremental
backup against a prior backup that is not in the history of the server
from which the backup is taken or (2) combining an incremental backup
the wrong prior backup, so whatever sanity checks we can have that
will tend to prevent those kinds of mistakes seem like a really good
idea.
And those kinds of checks seem relevant here, too. Consider that it
wouldn't be valid to use pg_combinebackup to fast-forward a standby
server if the incremental backup's backup-end-LSN preceded the standby
server's minimum recovery point. Imagine that you have a standby whose
last checkpoint's redo location was at LSN 2/48. Being the
enterprising DBA that you are, you make a note of that LSN and go take
an incremental backup based on it. You then stop the standby server
and try to apply the incremental backup to fast-forward the standby.
Well, it's possible that in the meanwhile the standby actually caught
up, and now has a minimum recovery point that follows the
backup-end-LSN of your incremental backup. In that case, you can't
legally use that incremental backup to fast-forward that standby, but
no code I've yet written would be smart enough to figure that out. Or,
maybe you (or some other DBA on your team) got really excited and
actually promoted that standby meanwhile, and now it's not even on the
same timeline any more. In the "normal" case where you take an
incremental backup based on an earlier base backup, these kinds of
problems are detectable, and it seems to me that if we want to enable
this kind of use case, it would be pretty smart to have a plan to
detect similar mistakes here. I don't, currently, but maybe there is
one.
Another practical problem here is that, right now, pg_combinebackup
doesn't have an in-place mode. It knows how to take a bunch of input
backups and write out an output backup, but that output backup needs
to go into a new, fresh directory (or directories plural, if there are
user-defined tablespaces). I had previously considered adding such a
mode, but the idea I had at the time wouldn't have worked for this
case. I imagined that someone might want to run "pg_combinebackup
--in-place full incr" and clobber the contents of the incr directory
with the output, basically discarding the incremental backup you took
in favor of a full backup that could have been taken at the same point
in time. But here, you'd want to clobber the *first* input to
pg_combinebackup, not the last one, so if we want to add something
like this, the UI needs some thought.
One thing that I find quite scary about such a mode is that if you
crash mid-way through, you're in a lot of trouble. In the case that I
had previous contemplated -- overwrite the last incremental with the
reconstructed full backup -- you *might* be able to make it crash safe
by writing out the full files for each incremental file, fsyncing
everything, then removing all of the incremental files and fsyncing
again. The idea would be that if you crash midway through it's OK to
just repeat whatever you were trying to do before the crash and if it
succeeds the second time then all is well. If, for a given file, there
are both incremental and non-incremental versions, then the second
attempt should remove and recreate the non-incremental version from
the incremental version. If there's only a non-incremental version, it
could be that the previous attempt got far enough to remove the
incremental file, but in that case the full file that we now have
should be the same thing that we would produce if we did the operation
now. It all sounds a little scary, but maybe it's OK. And as long as
you don't remove the this-is-an-incremental-backup markers from the
backup_label file until you've done everything else, you can tell
whether you've ever successfully completed the reassembly or not. But
if you're using a hypothetical overwrite mode to overwrite the first
input rather than the last one, well, it looks like a valid data
directory already, and if you replace a bunch of files and then crash,
it still does, but it's not any more, really. I'm not sure I've really
wrapped my head around all of the cases here, but it does feel like
there are some new ways to go wrong.
One thing I also realized when thinking about this is that you could
probably hose yourself with the patch set as it stands today by taking
a full backup, downgrading to wal_level=minimal for a while, doing
some WAL-skipping operations, upgrading to a higher WAL-level again,
and then taking an incremental backup. I think the solution to that is
probably for the WAL summarizer to refuse to run if wal_level=minimal.
Then there would be a gap in the summary files which an incremental
backup attempt would detect.
--
Robert Haas
EDB: http://www.enterprisedb.com
Hi,
On 2023-10-30 10:45:03 -0400, Robert Haas wrote:
On Tue, Oct 24, 2023 at 12:08 PM Robert Haas <robertmhaas@gmail.com> wrote:
Note that whether to remove summaries is a separate question from
whether to generate them in the first place. Right now, I have
wal_summarize_mb controlling whether they get generated in the first
place, but as I noted in another recent email, that isn't an entirely
satisfying solution.I did some more research on this. My conclusion is that I should
remove wal_summarize_mb and just have a GUC summarize_wal = on|off
that controls whether the summarizer runs at all. There will be one
summary file per checkpoint, no matter how far apart checkpoints are
or how large the summary gets. Below I'll explain the reasoning; let
me know if you disagree.
What I describe above would be a bad plan if it were realistically
possible for a summary file to get so large that it might run the
machine out of memory either when producing it or when trying to make
use of it for an incremental backup. This seems to be a somewhat
difficult scenario to create. So far, I haven't been able to generate
WAL summary files more than a few tens of megabytes in size, even when
summarizing 50+ GB of WAL per summary file. One reason why it's hard
to produce large summary files is because, for a single relation fork,
the WAL summary size converges to 1 bit per modified block when the
number of modified blocks is large. This means that, even if you have
a terabyte sized relation, you're looking at no more than perhaps 20MB
of summary data no matter how much of it gets modified. Now, somebody
could have a 30TB relation and then if they modify the whole thing
they could have the better part of a gigabyte of summary data for that
relation, but if you've got a 30TB table you probably have enough
memory that that's no big deal.
I'm not particularly worried about the rewriting-30TB-table case - that'd also
generate >= 30TB of WAL most of the time. Which realistically is going to
trigger a few checkpoints, even on very big instances.
But, what if you have multiple relations? I initialized pgbench with a
scale factor of 30000 and also with 30000 partitions and did a 1-hour
run. I got 4 checkpoints during that time and each one produced an
approximately 16MB summary file.
Hm, I assume the pgbench run will be fairly massively bottlenecked on IO, due
to having to read data from disk, lots of full page write and having to write
out lots of data? I.e. we won't do all that many transactions during the 1h?
To get a 1GB+ WAL summary file, you'd need to modify millions of relation
forks, maybe tens of millions, and most installations aren't even going to
have that many relation forks, let alone be modifying them all frequently.
I tried to find bad cases for a bit - and I am not worried. I wrote a pgbench
script to create 10k single-row relations in each script, ran that with 96
clients, checkpointed, and ran a pgbench script that updated the single row in
each table.
After creation of the relation WAL summarizer uses
LOG: level: 1; Wal Summarizer: 378433680 total in 43 blocks; 5628936 free (66 chunks); 372804744 used
and creates a 26MB summary file.
After checkpoint & updates WAL summarizer uses:
LOG: level: 1; Wal Summarizer: 369205392 total in 43 blocks; 5864536 free (26 chunks); 363340856 used
and creates a 26MB summary file.
Sure, 350MB ain't nothing, but simply just executing \dt in the database
created by this makes the backend use 260MB after. Which isn't going away,
whereas WAL summarizer drops its memory usage soon after.
But I think that's sufficiently niche that the current patch shouldn't
concern itself with such cases. If we find that they're common enough
to worry about, we might eventually want to do something to mitigate
them, but whether that thing looks anything like wal_summarize_mb
seems pretty unclear. So I conclude that it's a mistake to include
that GUC as currently designed and propose to replace it with a
Boolean as described above.
After playing with this for a while, I don't see a reason for wal_summarize_mb
from a memory usage POV at least.
I wonder if there are use cases that might like to consume WAL summaries
before the next checkpoint? For those wal_summarize_mb likely wouldn't be a
good control, but they might want to request a summary file to be created at
some point?
Greetings,
Andres Freund
On Mon, Oct 30, 2023 at 2:46 PM Andres Freund <andres@anarazel.de> wrote:
After playing with this for a while, I don't see a reason for wal_summarize_mb
from a memory usage POV at least.
Cool! Thanks for testing.
I wonder if there are use cases that might like to consume WAL summaries
before the next checkpoint? For those wal_summarize_mb likely wouldn't be a
good control, but they might want to request a summary file to be created at
some point?
It's possible. I actually think it's even more likely that there are
use cases that will also want the WAL summarized, but in some
different way. For example, you might want a summary that would give
you the LSN or approximate LSN where changes to a certain block
occurred. Such a summary would be way bigger than these summaries and
therefore, at least IMHO, a lot less useful for incremental backup,
but it could be really useful for something else. Or you might want
summaries that focus on something other than which blocks got changed,
like what relations were created or destroyed, or only changes to
certain kinds of relations or relation forks, or whatever. In a way,
you can even think of logical decoding as a kind of WAL summarization,
just with a very different set of goals from this one. I won't be too
surprised if the next hacker wants something that is different enough
from what this does that it doesn't make sense to share mechanism, but
if by chance they want the same thing but dumped a bit more
frequently, well, that can be done.
--
Robert Haas
EDB: http://www.enterprisedb.com
On Mon, Oct 30, 2023 at 6:46 PM Robert Haas <robertmhaas@gmail.com> wrote:
On Thu, Sep 28, 2023 at 6:22 AM Jakub Wartak
<jakub.wartak@enterprisedb.com> wrote:If that is still an area open for discussion: wouldn't it be better to
just specify LSN as it would allow resyncing standby across major lag
where the WAL to replay would be enormous? Given that we had
primary->standby where standby would be stuck on some LSN, right now
it would be:
1) calculate backup manifest of desynced 10TB standby (how? using
which tool?) - even if possible, that means reading 10TB of data
instead of just putting a number, isn't it?
2) backup primary with such incremental backup >= LSN
3) copy the incremental backup to standby
4) apply it to the impaired standby
5) restart the WAL replayAs you may be able to tell from the flurry of posts and new patch
sets, I'm trying hard to sort out the remaining open items that
pertain to this patch set, and I'm now back to thinking about this
one.TL;DR: I think the idea has some potential, but there are some
pitfalls that I'm not sure how to address.I spent some time looking at how we currently use the data from the
backup manifest. Currently, we do two things with it. First, when
we're backing up each file, we check whether it's present in the
backup manifest and, if not, we back it up in full. This actually
feels fairly poor. If it makes any difference at all, then presumably
the underlying algorithm is buggy and needs to be fixed. Maybe that
should be ripped out altogether or turned into some kind of sanity
check that causes a big explosion if it fails. Second, we check
whether the WAL ranges reported by the client match up with the
timeline history of the server (see PrepareForIncrementalBackup). This
set of sanity checks seems fairly important to me, and I'd regret
discarding them. I think there's some possibility that they might
catch user error, like where somebody promotes multiple standbys and
maybe they even get the same timeline on more than one of them, and
then confusion might ensue.
[..]
Another practical problem here is that, right now, pg_combinebackup
doesn't have an in-place mode. It knows how to take a bunch of input
backups and write out an output backup, but that output backup needs
to go into a new, fresh directory (or directories plural, if there are
user-defined tablespaces). I had previously considered adding such a
mode, but the idea I had at the time wouldn't have worked for this
case. I imagined that someone might want to run "pg_combinebackup
--in-place full incr" and clobber the contents of the incr directory
with the output, basically discarding the incremental backup you took
in favor of a full backup that could have been taken at the same point
in time.
[..]
Thanks for answering! It all sounds like this
resync-standby-using-primary-incrbackup idea isn't fit for the current
pg_combinebackup, but rather for a new tool hopefully in future. It
could take the current LSN from stuck standby, calculate manifest on
the lagged and offline standby (do we need to calculate manifest
Checksum in that case? I cannot find code for it), deliver it via
"UPLOAD_MANIFEST" to primary and start fetching and applying the
differences while doing some form of copy-on-write from old & incoming
incrbackup data to "$relfilenodeid.new" and then durable_unlink() old
one and durable_rename("$relfilenodeid.new", "$relfilenodeid". Would
it still be possible in theory? (it could use additional safeguards
like rename controlfile when starting and just before ending to
additionally block startup if it hasn't finished). Also it looks as
per comment nearby struct IncrementalBackupInfo.manifest_files that
even checksums are just more for safeguarding rather than core
implementation (?)
What I've meant in the initial idea is not to hinder current efforts,
but asking if the current design will not stand in a way for such a
cool new addition in future ?
One thing I also realized when thinking about this is that you could
probably hose yourself with the patch set as it stands today by taking
a full backup, downgrading to wal_level=minimal for a while, doing
some WAL-skipping operations, upgrading to a higher WAL-level again,
and then taking an incremental backup. I think the solution to that is
probably for the WAL summarizer to refuse to run if wal_level=minimal.
Then there would be a gap in the summary files which an incremental
backup attempt would detect.
As per earlier test [1]/messages/by-id/CAKZiRmzT+bX2ZYdORO32cADtfQ9DvyaOE8fsOEWZc2V5FkEWVg@mail.gmail.com, I've already tried to simulate that in
incrbackuptests-0.1.tgz/test_across_wallevelminimal.sh , but that
worked (but that was with CTAS-wal-minimal-optimization -> new
relfilenodeOID is used for CTAS which got included in the incremental
backup as it's new file) Even retested that with Your v7 patch with
asserts, same. When simulating with "BEGIN; TRUNCATE nightmare; COPY
nightmare FROM '/tmp/copy.out'; COMMIT;" on wal_level=minimal it still
recovers using incremental backup because the WAL contains:
rmgr: Storage, desc: CREATE base/5/36425
[..]
rmgr: XLOG, desc: FPI , blkref #0: rel 1663/5/36425 blk 0 FPW
[..]
e.g. TRUNCATE sets a new relfilenode each time, so they will always be
included in backup and wal_level=minimal optimizations kicks only for
commands that issue a new relfilenode. True/false?
postgres=# select oid, relfilenode, relname from pg_class where
relname like 'night%' order by 1;
oid | relfilenode | relname
-------+-------------+---------------------
16384 | 0 | nightmare
16390 | 36420 | nightmare_p0
16398 | 36425 | nightmare_p1
36411 | 0 | nightmare_pkey
36413 | 36422 | nightmare_p0_pkey
36415 | 36427 | nightmare_p1_pkey
36417 | 0 | nightmare_brin_idx
36418 | 36423 | nightmare_p0_ts_idx
36419 | 36428 | nightmare_p1_ts_idx
(9 rows)
postgres=# truncate nightmare;
TRUNCATE TABLE
postgres=# select oid, relfilenode, relname from pg_class where
relname like 'night%' order by 1;
oid | relfilenode | relname
-------+-------------+---------------------
16384 | 0 | nightmare
16390 | 36434 | nightmare_p0
16398 | 36439 | nightmare_p1
36411 | 0 | nightmare_pkey
36413 | 36436 | nightmare_p0_pkey
36415 | 36441 | nightmare_p1_pkey
36417 | 0 | nightmare_brin_idx
36418 | 36437 | nightmare_p0_ts_idx
36419 | 36442 | nightmare_p1_ts_idx
-J.
[1]: /messages/by-id/CAKZiRmzT+bX2ZYdORO32cADtfQ9DvyaOE8fsOEWZc2V5FkEWVg@mail.gmail.com
On Wed, Nov 1, 2023 at 8:57 AM Jakub Wartak
<jakub.wartak@enterprisedb.com> wrote:
Thanks for answering! It all sounds like this
resync-standby-using-primary-incrbackup idea isn't fit for the current
pg_combinebackup, but rather for a new tool hopefully in future. It
could take the current LSN from stuck standby, calculate manifest on
the lagged and offline standby (do we need to calculate manifest
Checksum in that case? I cannot find code for it), deliver it via
"UPLOAD_MANIFEST" to primary and start fetching and applying the
differences while doing some form of copy-on-write from old & incoming
incrbackup data to "$relfilenodeid.new" and then durable_unlink() old
one and durable_rename("$relfilenodeid.new", "$relfilenodeid". Would
it still be possible in theory? (it could use additional safeguards
like rename controlfile when starting and just before ending to
additionally block startup if it hasn't finished). Also it looks as
per comment nearby struct IncrementalBackupInfo.manifest_files that
even checksums are just more for safeguarding rather than core
implementation (?)What I've meant in the initial idea is not to hinder current efforts,
but asking if the current design will not stand in a way for such a
cool new addition in future ?
Hmm, interesting idea. I think something like that could be made to
work. My first thought was that it would sort of suck to have to
compute a manifest as a precondition of doing this, but then I started
to think maybe it wouldn't, really. I mean, you'd have to scan the
local directory tree and collect all the filenames so that you could
remove any files that are no longer present in the current version of
the data directory which the incremental backup would send to you. If
you're already doing that, the additional cost of generating a
manifest isn't that high, at least if you don't include checksums,
which aren't required. On the other hand, if you didn't need to send
the server a manifest and just needed to send the required WAL ranges,
that would be even cheaper. I'll spend some more time thinking about
this next week.
As per earlier test [1], I've already tried to simulate that in
incrbackuptests-0.1.tgz/test_across_wallevelminimal.sh , but that
worked (but that was with CTAS-wal-minimal-optimization -> new
relfilenodeOID is used for CTAS which got included in the incremental
backup as it's new file) Even retested that with Your v7 patch with
asserts, same. When simulating with "BEGIN; TRUNCATE nightmare; COPY
nightmare FROM '/tmp/copy.out'; COMMIT;" on wal_level=minimal it still
recovers using incremental backup because the WAL contains:
TRUNCATE itself is always WAL-logged, but data added to the relation
in the same relation as the TRUNCATE isn't always WAL-logged (but
sometimes it is, depending on the relation size). So the failure case
wouldn't be missing the TRUNCATE but missing some data-containing
blocks within the relation shortly after it was created or truncated.
I think what I need to do here is avoid summarizing WAL that was
generated under wal_level=minimal. The walsummarizer process should
just refuse to emit summaries for any such WAL.
--
Robert Haas
EDB: http://www.enterprisedb.com
On Mon, Oct 30, 2023 at 2:46 PM Andres Freund <andres@anarazel.de> wrote:
After playing with this for a while, I don't see a reason for wal_summarize_mb
from a memory usage POV at least.
Here's v8. Changes:
- Replace wal_summarize_mb GUC with summarize_wal = on | off.
- Document the summarize_wal and wal_summary_keep_time GUCs.
- Refuse to start with summarize_wal = on and wal_level = minimal.
- Increase default wal_summary_keep_time to 10d from 7d, per (what I
think was) a suggestion from Peter E.
- Fix fencepost errors when deciding which WAL summaries are needed
for a backup.
- Fix indentation damage.
- Standardize on ereport(DEBUG1, ...) in walsummarizer.c vs. various
more and less chatty things I had before.
- Include the timeline in some error messages because not having it
proved confusing.
- Be more consistent about ignoring the FSM fork.
- Fix a bug that could cause WAL summarization to error out when
switching timelines.
- Fix the division between the wal summarizer and incremental backup
patches so that the former passes tests without the latter.
- Fix some things that an older compiler didn't like, including adding
pg_attribute_printf in some places.
- Die with an error instead of crashing if someone feeds us a manifest
with no WAL ranges.
- Sort the block numbers that need to be read from a relation file
before reading them, so that we're certain to read them in ascending
order.
- Be more careful about computing the truncation_block_length of an
incremental file; don't do math on a block number that might be
InvalidBlockNumber.
- Fix pg_combinebackup so it doesn't fail when zero-filled blocks are
added to a relation between the prior backup and the incremental
backup.
- Improve the pg_combinebackup -d output so that it explains in detail
how it's carrying out reconstruction, to improve debuggability.
- Disable WAL summarization by default, but add a test patch to the
series to enable it, because running the whole test suite with it
turned on is good for bug-hunting.
- In pg_walsummary, zero a struct before using instead of starting
with arbitrary junk values.
To do list:
- Figure out whether to do something other than uploading the whole
summary, per discussion with Jakub Wartak.
- Decide what to do about the 60-second waiting-for-WAL-summarization timeout.
- Make incremental backup fail quickly if WAL summarization is not even enabled.
- Have pg_basebackup error out nicely if an incremental backup is
requested from an older server that can't do that.
- Add some kind of tests for pg_walsummary.
--
Robert Haas
EDB: http://www.enterprisedb.com
Attachments:
v8-0005-Add-new-pg_walsummary-tool.patchapplication/octet-stream; name=v8-0005-Add-new-pg_walsummary-tool.patchDownload
From dfa9822958d39ef997c1e071239a2bdfac347bd1 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Wed, 25 Oct 2023 13:01:06 -0400
Subject: [PATCH v8 5/6] Add new pg_walsummary tool.
This can dump the contents of WAL summary files, either those in
pg_wal/summaries, or the INCREMENTAL_BACKUP files that are part of
an incremental backup proper.
XXX. Needs tests.
---
doc/src/sgml/ref/allfiles.sgml | 1 +
doc/src/sgml/ref/pg_walsummary.sgml | 122 +++++++++++
doc/src/sgml/reference.sgml | 1 +
src/bin/Makefile | 1 +
src/bin/meson.build | 1 +
src/bin/pg_walsummary/.gitignore | 1 +
src/bin/pg_walsummary/Makefile | 42 ++++
src/bin/pg_walsummary/meson.build | 24 +++
src/bin/pg_walsummary/pg_walsummary.c | 280 ++++++++++++++++++++++++++
src/tools/pgindent/typedefs.list | 2 +
10 files changed, 475 insertions(+)
create mode 100644 doc/src/sgml/ref/pg_walsummary.sgml
create mode 100644 src/bin/pg_walsummary/.gitignore
create mode 100644 src/bin/pg_walsummary/Makefile
create mode 100644 src/bin/pg_walsummary/meson.build
create mode 100644 src/bin/pg_walsummary/pg_walsummary.c
diff --git a/doc/src/sgml/ref/allfiles.sgml b/doc/src/sgml/ref/allfiles.sgml
index fda4690eab..4a42999b18 100644
--- a/doc/src/sgml/ref/allfiles.sgml
+++ b/doc/src/sgml/ref/allfiles.sgml
@@ -219,6 +219,7 @@ Complete list of usable sgml source files in this directory.
<!ENTITY pgtesttiming SYSTEM "pgtesttiming.sgml">
<!ENTITY pgupgrade SYSTEM "pgupgrade.sgml">
<!ENTITY pgwaldump SYSTEM "pg_waldump.sgml">
+<!ENTITY pgwalsummary SYSTEM "pg_walsummary.sgml">
<!ENTITY postgres SYSTEM "postgres-ref.sgml">
<!ENTITY psqlRef SYSTEM "psql-ref.sgml">
<!ENTITY reindexdb SYSTEM "reindexdb.sgml">
diff --git a/doc/src/sgml/ref/pg_walsummary.sgml b/doc/src/sgml/ref/pg_walsummary.sgml
new file mode 100644
index 0000000000..3a2122b067
--- /dev/null
+++ b/doc/src/sgml/ref/pg_walsummary.sgml
@@ -0,0 +1,122 @@
+<!--
+doc/src/sgml/ref/pg_walsummary.sgml
+PostgreSQL documentation
+-->
+
+<refentry id="app-pgwalsummary">
+ <indexterm zone="app-pgwalsummary">
+ <primary>pg_walsummary</primary>
+ </indexterm>
+
+ <refmeta>
+ <refentrytitle><application>pg_walsummary</application></refentrytitle>
+ <manvolnum>1</manvolnum>
+ <refmiscinfo>Application</refmiscinfo>
+ </refmeta>
+
+ <refnamediv>
+ <refname>pg_walsummary</refname>
+ <refpurpose>print contents of WAL summary files</refpurpose>
+ </refnamediv>
+
+ <refsynopsisdiv>
+ <cmdsynopsis>
+ <command>pg_walsummary</command>
+ <arg rep="repeat" choice="opt"><replaceable>option</replaceable></arg>
+ <arg rep="repeat"><replaceable>file</replaceable></arg>
+ </cmdsynopsis>
+ </refsynopsisdiv>
+
+ <refsect1>
+ <title>Description</title>
+ <para>
+ <application>pg_walsummary</application> is used to print the contents of
+ WAL summary files. These binary files are found with the
+ <literal>pg_wal/summaries</literal> subdirectory of the data directory,
+ and can be converted to text using this tool. This is not ordinarily
+ necessary, since WAL summary files primary exist to support
+ <link linkend="backup-incremental-backup">incremental backup</link>,
+ but it may be useful for debugging purposes.
+ </para>
+
+ <para>
+ A WAL summary file is indexed by tablespace OID, relation OID, and relation
+ fork. For each relation fork, it stores the list of blocks that were
+ modified by WAL within the range summarized in the file. It can also
+ store a "limit block," which is 0 if the relation fork was created or
+ truncated within the relevant WAL range, and otherwise the shortest length
+ to which the relation fork was truncated. If the relation fork was not
+ created, deleted, or truncated within the relevant WAL range, the limit
+ block is undefined or infinite and will not be printed by this tool.
+ </para>
+ </refsect1>
+
+ <refsect1>
+ <title>Options</title>
+
+ <para>
+ <variablelist>
+ <varlistentry>
+ <term><option>-i</option></term>
+ <term><option>--indivudual</option></term>
+ <listitem>
+ <para>
+ By default, <literal>pg_walsummary</literal> prints one line of output
+ for each range of one or more consecutive modified blocks. This can
+ make the output a lot briefer, since a relation where all blocks from
+ 0 through 999 were modified will produce only one line of output rather
+ than 1000 separate lines. This option requests a separate line of
+ output for every modified block.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><option>-q</option></term>
+ <term><option>--quiet</option></term>
+ <listitem>
+ <para>
+ Do not print any output, except for errors. This can be useful
+ when you want to know whether a WAL summary file can be successfully
+ parsed but don't care about the contents.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><option>-?</option></term>
+ <term><option>--help</option></term>
+ <listitem>
+ <para>
+ Shows help about <application>pg_walsummary</application> command line
+ arguments, and exits.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ </variablelist>
+ </para>
+
+ </refsect1>
+
+ <refsect1>
+ <title>Environment</title>
+
+ <para>
+ The environment variable <envar>PG_COLOR</envar> specifies whether to use
+ color in diagnostic messages. Possible values are
+ <literal>always</literal>, <literal>auto</literal> and
+ <literal>never</literal>.
+ </para>
+ </refsect1>
+
+ <refsect1>
+ <title>See Also</title>
+
+ <simplelist type="inline">
+ <member><xref linkend="app-pgbasebackup"/></member>
+ <member><xref linkend="app-pgcombinebackup"/></member>
+ </simplelist>
+ </refsect1>
+
+</refentry>
diff --git a/doc/src/sgml/reference.sgml b/doc/src/sgml/reference.sgml
index a07d2b5e01..aa94f6adf6 100644
--- a/doc/src/sgml/reference.sgml
+++ b/doc/src/sgml/reference.sgml
@@ -289,6 +289,7 @@
&pgtesttiming;
&pgupgrade;
&pgwaldump;
+ &pgwalsummary;
&postgres;
</reference>
diff --git a/src/bin/Makefile b/src/bin/Makefile
index aa2210925e..f98f58d39e 100644
--- a/src/bin/Makefile
+++ b/src/bin/Makefile
@@ -31,6 +31,7 @@ SUBDIRS = \
pg_upgrade \
pg_verifybackup \
pg_waldump \
+ pg_walsummary \
pgbench \
psql \
scripts
diff --git a/src/bin/meson.build b/src/bin/meson.build
index 4cb6fd59bb..d1e9ef4409 100644
--- a/src/bin/meson.build
+++ b/src/bin/meson.build
@@ -17,6 +17,7 @@ subdir('pg_test_timing')
subdir('pg_upgrade')
subdir('pg_verifybackup')
subdir('pg_waldump')
+subdir('pg_walsummary')
subdir('pgbench')
subdir('pgevent')
subdir('psql')
diff --git a/src/bin/pg_walsummary/.gitignore b/src/bin/pg_walsummary/.gitignore
new file mode 100644
index 0000000000..d71ec192fa
--- /dev/null
+++ b/src/bin/pg_walsummary/.gitignore
@@ -0,0 +1 @@
+pg_walsummary
diff --git a/src/bin/pg_walsummary/Makefile b/src/bin/pg_walsummary/Makefile
new file mode 100644
index 0000000000..852f7208f6
--- /dev/null
+++ b/src/bin/pg_walsummary/Makefile
@@ -0,0 +1,42 @@
+#-------------------------------------------------------------------------
+#
+# Makefile for src/bin/pg_walsummary
+#
+# Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+# Portions Copyright (c) 1994, Regents of the University of California
+#
+# src/bin/pg_walsummary/Makefile
+#
+#-------------------------------------------------------------------------
+
+PGFILEDESC = "pg_walsummary - print contents of WAL summary files"
+PGAPPICON=win32
+
+subdir = src/bin/pg_walsummary
+top_builddir = ../../..
+include $(top_builddir)/src/Makefile.global
+
+override CPPFLAGS := -I$(libpq_srcdir) $(CPPFLAGS)
+LDFLAGS_INTERNAL += -L$(top_builddir)/src/fe_utils -lpgfeutils
+
+OBJS = \
+ $(WIN32RES) \
+ pg_walsummary.o
+
+all: pg_walsummary
+
+pg_walsummary: $(OBJS) | submake-libpgport submake-libpgfeutils
+ $(CC) $(CFLAGS) $^ $(LDFLAGS) $(LDFLAGS_EX) $(LIBS) -o $@$(X)
+
+
+install: all installdirs
+ $(INSTALL_PROGRAM) pg_walsummary$(X) '$(DESTDIR)$(bindir)/pg_walsummary$(X)'
+
+installdirs:
+ $(MKDIR_P) '$(DESTDIR)$(bindir)'
+
+uninstall:
+ rm -f '$(DESTDIR)$(bindir)/pg_walsummary$(X)'
+
+clean distclean maintainer-clean:
+ rm -f pg_walsummary$(X) $(OBJS)
diff --git a/src/bin/pg_walsummary/meson.build b/src/bin/pg_walsummary/meson.build
new file mode 100644
index 0000000000..c2092960c6
--- /dev/null
+++ b/src/bin/pg_walsummary/meson.build
@@ -0,0 +1,24 @@
+# Copyright (c) 2022-2023, PostgreSQL Global Development Group
+
+pg_walsummary_sources = files(
+ 'pg_walsummary.c',
+)
+
+if host_system == 'windows'
+ pg_walsummary_sources += rc_bin_gen.process(win32ver_rc, extra_args: [
+ '--NAME', 'pg_walsummary',
+ '--FILEDESC', 'pg_walsummary - print contents of WAL summary files',])
+endif
+
+pg_walsummary = executable('pg_walsummary',
+ pg_walsummary_sources,
+ dependencies: [frontend_code],
+ kwargs: default_bin_args,
+)
+bin_targets += pg_walsummary
+
+tests += {
+ 'name': 'pg_walsummary',
+ 'sd': meson.current_source_dir(),
+ 'bd': meson.current_build_dir()
+}
diff --git a/src/bin/pg_walsummary/pg_walsummary.c b/src/bin/pg_walsummary/pg_walsummary.c
new file mode 100644
index 0000000000..0c0225eeb8
--- /dev/null
+++ b/src/bin/pg_walsummary/pg_walsummary.c
@@ -0,0 +1,280 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_walsummary.c
+ * Prints the contents of WAL summary files.
+ *
+ * Copyright (c) 2017-2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/bin/pg_walsummary/pg_walsummary.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres_fe.h"
+
+#include <fcntl.h>
+#include <limits.h>
+
+#include "common/blkreftable.h"
+#include "common/logging.h"
+#include "fe_utils/option_utils.h"
+#include "lib/stringinfo.h"
+#include "getopt_long.h"
+
+typedef struct ws_options
+{
+ bool individual;
+ bool quiet;
+} ws_options;
+
+typedef struct ws_file_info
+{
+ int fd;
+ char *filename;
+} ws_file_info;
+
+static BlockNumber *block_buffer = NULL;
+static unsigned block_buffer_size = 512; /* Initial size. */
+
+static void dump_one_relation(ws_options *opt, RelFileLocator *rlocator,
+ ForkNumber forknum, BlockNumber limit_block,
+ BlockRefTableReader *reader);
+static void help(const char *progname);
+static int compare_block_numbers(const void *a, const void *b);
+static int walsummary_read_callback(void *callback_arg, void *data,
+ int length);
+static void walsummary_error_callback(void *callback_arg, char *fmt,...) pg_attribute_printf(2, 3);
+
+/*
+ * Main program.
+ */
+int
+main(int argc, char *argv[])
+{
+ static struct option long_options[] = {
+ {"individual", no_argument, NULL, 'i'},
+ {"quiet", no_argument, NULL, 'q'},
+ {NULL, 0, NULL, 0}
+ };
+
+ const char *progname;
+ int optindex;
+ int c;
+ ws_options opt;
+
+ memset(&opt, 0, sizeof(ws_options));
+
+ pg_logging_init(argv[0]);
+ progname = get_progname(argv[0]);
+ handle_help_version_opts(argc, argv, progname, help);
+
+ /* process command-line options */
+ while ((c = getopt_long(argc, argv, "f:iqw:",
+ long_options, &optindex)) != -1)
+ {
+ switch (c)
+ {
+ case 'i':
+ opt.individual = true;
+ break;
+ case 'q':
+ opt.quiet = true;
+ break;
+ default:
+ /* getopt_long already emitted a complaint */
+ pg_log_error_hint("Try \"%s --help\" for more information.", progname);
+ exit(1);
+ }
+ }
+
+ if (optind >= argc)
+ {
+ pg_log_error("%s: no input files specified", progname);
+ pg_log_error_hint("Try \"%s --help\" for more information.", progname);
+ exit(1);
+ }
+
+ while (optind < argc)
+ {
+ ws_file_info ws;
+ BlockRefTableReader *reader;
+ RelFileLocator rlocator;
+ ForkNumber forknum;
+ BlockNumber limit_block;
+
+ ws.filename = argv[optind++];
+ if ((ws.fd = open(ws.filename, O_RDONLY | PG_BINARY, 0)) < 0)
+ pg_fatal("could not open file \"%s\": %m", ws.filename);
+
+ reader = CreateBlockRefTableReader(walsummary_read_callback, &ws,
+ ws.filename,
+ walsummary_error_callback, NULL);
+ while (BlockRefTableReaderNextRelation(reader, &rlocator, &forknum,
+ &limit_block))
+ dump_one_relation(&opt, &rlocator, forknum, limit_block, reader);
+
+ DestroyBlockRefTableReader(reader);
+ close(ws.fd);
+ }
+
+ exit(0);
+}
+
+/*
+ * Dump details for one relation.
+ */
+static void
+dump_one_relation(ws_options *opt, RelFileLocator *rlocator,
+ ForkNumber forknum, BlockNumber limit_block,
+ BlockRefTableReader *reader)
+{
+ unsigned i = 0;
+ unsigned nblocks;
+ BlockNumber startblock = InvalidBlockNumber;
+ BlockNumber endblock = InvalidBlockNumber;
+
+ /* Dump limit block, if any. */
+ if (limit_block != InvalidBlockNumber)
+ printf("TS %u, DB %u, REL %u, FORK %s: limit %u\n",
+ rlocator->spcOid, rlocator->dbOid, rlocator->relNumber,
+ forkNames[forknum], limit_block);
+
+ /* If we haven't allocated a block buffer yet, do that now. */
+ if (block_buffer == NULL)
+ block_buffer = palloc_array(BlockNumber, block_buffer_size);
+
+ /* Try to fill the block buffer. */
+ nblocks = BlockRefTableReaderGetBlocks(reader,
+ block_buffer,
+ block_buffer_size);
+
+ /* If we filled the block buffer completely, we must enlarge it. */
+ while (nblocks >= block_buffer_size)
+ {
+ unsigned new_size;
+
+ /* Double the size, being careful about overflow. */
+ new_size = block_buffer_size * 2;
+ if (new_size < block_buffer_size)
+ new_size = PG_UINT32_MAX;
+ block_buffer = repalloc_array(block_buffer, BlockNumber, new_size);
+
+ /* Try to fill the newly-allocated space. */
+ nblocks +=
+ BlockRefTableReaderGetBlocks(reader,
+ block_buffer + block_buffer_size,
+ new_size - block_buffer_size);
+
+ /* Save the new size for later calls. */
+ block_buffer_size = new_size;
+ }
+
+ /* If we don't need to produce any output, skip the rest of this. */
+ if (opt->quiet)
+ return;
+
+ /*
+ * Sort the returned block numbers. If the block reference table was using
+ * the bitmap representation for a given chunk, the block numbers in that
+ * chunk will already be sorted, but when the array-of-offsets
+ * representation is used, we can receive block numbers here out of order.
+ */
+ qsort(block_buffer, nblocks, sizeof(BlockNumber), compare_block_numbers);
+
+ /* Dump block references. */
+ while (i < nblocks)
+ {
+ /*
+ * Find the next range of blocks to print, but if --individual was
+ * specified, then consider each block a separate range.
+ */
+ startblock = endblock = block_buffer[i++];
+ if (!opt->individual)
+ {
+ while (i < nblocks && block_buffer[i] == endblock + 1)
+ {
+ endblock++;
+ i++;
+ }
+ }
+
+ /* Format this range of block numbers as a string. */
+ if (startblock == endblock)
+ printf("TS %u, DB %u, REL %u, FORK %s: block %u\n",
+ rlocator->spcOid, rlocator->dbOid, rlocator->relNumber,
+ forkNames[forknum], startblock);
+ else
+ printf("TS %u, DB %u, REL %u, FORK %s: blocks %u..%u\n",
+ rlocator->spcOid, rlocator->dbOid, rlocator->relNumber,
+ forkNames[forknum], startblock, endblock);
+ }
+}
+
+/*
+ * Quicksort comparator for block numbers.
+ */
+static int
+compare_block_numbers(const void *a, const void *b)
+{
+ BlockNumber aa = *(BlockNumber *) a;
+ BlockNumber bb = *(BlockNumber *) b;
+
+ if (aa > bb)
+ return 1;
+ else if (aa == bb)
+ return 0;
+ else
+ return -1;
+}
+
+/*
+ * Error callback.
+ */
+void
+walsummary_error_callback(void *callback_arg, char *fmt,...)
+{
+ va_list ap;
+
+ va_start(ap, fmt);
+ pg_log_generic_v(PG_LOG_ERROR, PG_LOG_PRIMARY, fmt, ap);
+ va_end(ap);
+
+ exit(1);
+}
+
+/*
+ * Read callback.
+ */
+int
+walsummary_read_callback(void *callback_arg, void *data, int length)
+{
+ ws_file_info *ws = callback_arg;
+ int rc;
+
+ if ((rc = read(ws->fd, data, length)) < 0)
+ pg_fatal("could not read file \"%s\": %m", ws->filename);
+
+ return rc;
+}
+
+/*
+ * help
+ *
+ * Prints help page for the program
+ *
+ * progname: the name of the executed program, such as "pg_walsummary"
+ */
+static void
+help(const char *progname)
+{
+ printf(_("%s prints the contents of a WAL summary file.\n\n"), progname);
+ printf(_("Usage:\n"));
+ printf(_(" %s [OPTION]... FILE...\n"), progname);
+ printf(_("\nOptions:\n"));
+ printf(_(" -i, --individual list block numbers individually, not as ranges\n"));
+ printf(_(" -q, --quiet don't print anything, just parse the files\n"));
+ printf(_(" -?, --help show this help, then exit\n"));
+
+ printf(_("\nReport bugs to <%s>.\n"), PACKAGE_BUGREPORT);
+ printf(_("%s home page: <%s>\n"), PACKAGE_NAME, PACKAGE_URL);
+}
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 95ae399cae..f4c141f6fb 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -4021,3 +4021,5 @@ cb_tablespace_mapping
manifest_data
manifest_writer
rfile
+ws_options
+ws_file_info
--
2.37.1 (Apple Git-137.1)
v8-0001-Change-how-a-base-backup-decides-which-files-have.patchapplication/octet-stream; name=v8-0001-Change-how-a-base-backup-decides-which-files-have.patchDownload
From f27134fe549da57e0839cc2ab4e5168fa879ad80 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Tue, 3 Oct 2023 13:32:28 -0400
Subject: [PATCH v8 1/6] Change how a base backup decides which files have
checksums.
Previously, it thought that any plain file located under global, base,
or a tablespace directory had checksums unless it was in a short list
of excluded files. Now, it thinks that files in those directories have
checksums if parse_filename_for_nontemp_relation says that they are
relation files. (Temporary relation files don't matter because they're
excluded from the backup anyway.)
This changes the behavior if you have stray files not managed by
PostgreSQL in the relevant directories. Previously, you'd get some
kind of checksum-related complaint if such files existed, assuming
that the cluster had checksums enabled and that the base backup
wasn't run with NOVERIFY_CHECKSUMS. Now, you won't get those
complaints any more. That seems like an improvement to me, because
those files were presumably not created by PostgreSQL and so there
is no reason to think that they would be checksummed like a
PostgreSQL relation file. (If we want to complain about such files,
we should complain about them existing at all, not just about their
checksums.)
The point of this change is to make the code more consistent.
sendDir() was already calling parse_filename_for_nontemp_relation()
as part of an effort to determine which files to include in the
backup. So, it already had the information about whether a certain
file was a relation file. sendFile() then used a separate method,
embodied in is_checksummed_file(), to make what is essentially
the same determination. It's better not to make the same decision
using two different methods, especially in closely-related code.
---
src/backend/backup/basebackup.c | 172 ++++++++++----------------------
1 file changed, 55 insertions(+), 117 deletions(-)
diff --git a/src/backend/backup/basebackup.c b/src/backend/backup/basebackup.c
index b537f46219..4ba63ad8a6 100644
--- a/src/backend/backup/basebackup.c
+++ b/src/backend/backup/basebackup.c
@@ -82,7 +82,8 @@ static int64 sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeo
backup_manifest_info *manifest, Oid spcoid);
static bool sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
struct stat *statbuf, bool missing_ok,
- Oid dboid, Oid spcoid,
+ Oid dboid, Oid spcoid, RelFileNumber relfilenumber,
+ unsigned segno,
backup_manifest_info *manifest);
static off_t read_file_data_into_buffer(bbsink *sink,
const char *readfilename, int fd,
@@ -104,7 +105,6 @@ static void convert_link_to_directory(const char *pathbuf, struct stat *statbuf)
static void perform_base_backup(basebackup_options *opt, bbsink *sink);
static void parse_basebackup_options(List *options, basebackup_options *opt);
static int compareWalFileNames(const ListCell *a, const ListCell *b);
-static bool is_checksummed_file(const char *fullpath, const char *filename);
static int basebackup_read_file(int fd, char *buf, size_t nbytes, off_t offset,
const char *filename, bool partial_read_ok);
@@ -213,23 +213,6 @@ static const struct exclude_list_item excludeFiles[] =
{NULL, false}
};
-/*
- * List of files excluded from checksum validation.
- *
- * Note: this list should be kept in sync with what pg_checksums.c
- * includes.
- */
-static const struct exclude_list_item noChecksumFiles[] = {
- {"pg_control", false},
- {"pg_filenode.map", false},
- {"pg_internal.init", true},
- {"PG_VERSION", false},
-#ifdef EXEC_BACKEND
- {"config_exec_params", true},
-#endif
- {NULL, false}
-};
-
/*
* Actually do a base backup for the specified tablespaces.
*
@@ -356,7 +339,8 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
errmsg("could not stat file \"%s\": %m",
XLOG_CONTROL_FILE)));
sendFile(sink, XLOG_CONTROL_FILE, XLOG_CONTROL_FILE, &statbuf,
- false, InvalidOid, InvalidOid, &manifest);
+ false, InvalidOid, InvalidOid,
+ InvalidRelFileNumber, 0, &manifest);
}
else
{
@@ -625,7 +609,8 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
errmsg("could not stat file \"%s\": %m", pathbuf)));
sendFile(sink, pathbuf, pathbuf, &statbuf, false,
- InvalidOid, InvalidOid, &manifest);
+ InvalidOid, InvalidOid, InvalidRelFileNumber, 0,
+ &manifest);
/* unconditionally mark file as archived */
StatusFilePath(pathbuf, fname, ".done");
@@ -1163,7 +1148,8 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
struct stat statbuf;
int64 size = 0;
const char *lastDir; /* Split last dir from parent path. */
- bool isDbDir = false; /* Does this directory contain relations? */
+ bool isRelationDir = false; /* Does directory contain relations? */
+ Oid dboid = InvalidOid;
/*
* Determine if the current path is a database directory that can contain
@@ -1190,17 +1176,23 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
strncmp(lastDir - (sizeof(TABLESPACE_VERSION_DIRECTORY) - 1),
TABLESPACE_VERSION_DIRECTORY,
sizeof(TABLESPACE_VERSION_DIRECTORY) - 1) == 0))
- isDbDir = true;
+ {
+ isRelationDir = true;
+ dboid = atooid(lastDir + 1);
+ }
}
+ else if (strcmp(path, "./global") == 0)
+ isRelationDir = true;
dir = AllocateDir(path);
while ((de = ReadDir(dir, path)) != NULL)
{
int excludeIdx;
bool excludeFound;
- RelFileNumber relNumber;
- ForkNumber relForkNum;
- unsigned segno;
+ RelFileNumber relfilenumber = InvalidRelFileNumber;
+ ForkNumber relForkNum = InvalidForkNumber;
+ unsigned segno = 0;
+ bool isRelationFile = false;
/* Skip special stuff */
if (strcmp(de->d_name, ".") == 0 || strcmp(de->d_name, "..") == 0)
@@ -1248,37 +1240,40 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
if (excludeFound)
continue;
+ /*
+ * If there could be non-temporary relation files in this directory,
+ * try to parse the filename.
+ */
+ if (isRelationDir)
+ isRelationFile =
+ parse_filename_for_nontemp_relation(de->d_name,
+ &relfilenumber,
+ &relForkNum, &segno);
+
/* Exclude all forks for unlogged tables except the init fork */
- if (isDbDir &&
- parse_filename_for_nontemp_relation(de->d_name, &relNumber,
- &relForkNum, &segno))
+ if (isRelationFile && relForkNum != INIT_FORKNUM)
{
- /* Never exclude init forks */
- if (relForkNum != INIT_FORKNUM)
- {
- char initForkFile[MAXPGPATH];
+ char initForkFile[MAXPGPATH];
- /*
- * If any other type of fork, check if there is an init fork
- * with the same RelFileNumber. If so, the file can be
- * excluded.
- */
- snprintf(initForkFile, sizeof(initForkFile), "%s/%u_init",
- path, relNumber);
+ /*
+ * If any other type of fork, check if there is an init fork with
+ * the same RelFileNumber. If so, the file can be excluded.
+ */
+ snprintf(initForkFile, sizeof(initForkFile), "%s/%u_init",
+ path, relfilenumber);
- if (lstat(initForkFile, &statbuf) == 0)
- {
- elog(DEBUG2,
- "unlogged relation file \"%s\" excluded from backup",
- de->d_name);
+ if (lstat(initForkFile, &statbuf) == 0)
+ {
+ elog(DEBUG2,
+ "unlogged relation file \"%s\" excluded from backup",
+ de->d_name);
- continue;
- }
+ continue;
}
}
/* Exclude temporary relations */
- if (isDbDir && looks_like_temp_rel_name(de->d_name))
+ if (OidIsValid(dboid) && looks_like_temp_rel_name(de->d_name))
{
elog(DEBUG2,
"temporary relation file \"%s\" excluded from backup",
@@ -1417,8 +1412,8 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
if (!sizeonly)
sent = sendFile(sink, pathbuf, pathbuf + basepathlen + 1, &statbuf,
- true, isDbDir ? atooid(lastDir + 1) : InvalidOid, spcoid,
- manifest);
+ true, dboid, spcoid,
+ relfilenumber, segno, manifest);
if (sent || sizeonly)
{
@@ -1440,40 +1435,6 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
return size;
}
-/*
- * Check if a file should have its checksum validated.
- * We validate checksums on files in regular tablespaces
- * (including global and default) only, and in those there
- * are some files that are explicitly excluded.
- */
-static bool
-is_checksummed_file(const char *fullpath, const char *filename)
-{
- /* Check that the file is in a tablespace */
- if (strncmp(fullpath, "./global/", 9) == 0 ||
- strncmp(fullpath, "./base/", 7) == 0 ||
- strncmp(fullpath, "/", 1) == 0)
- {
- int excludeIdx;
-
- /* Compare file against noChecksumFiles skip list */
- for (excludeIdx = 0; noChecksumFiles[excludeIdx].name != NULL; excludeIdx++)
- {
- int cmplen = strlen(noChecksumFiles[excludeIdx].name);
-
- if (!noChecksumFiles[excludeIdx].match_prefix)
- cmplen++;
- if (strncmp(filename, noChecksumFiles[excludeIdx].name,
- cmplen) == 0)
- return false;
- }
-
- return true;
- }
- else
- return false;
-}
-
/*
* Given the member, write the TAR header & send the file.
*
@@ -1488,6 +1449,7 @@ is_checksummed_file(const char *fullpath, const char *filename)
static bool
sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
struct stat *statbuf, bool missing_ok, Oid dboid, Oid spcoid,
+ RelFileNumber relfilenumber, unsigned segno,
backup_manifest_info *manifest)
{
int fd;
@@ -1495,8 +1457,6 @@ sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
int checksum_failures = 0;
off_t cnt;
pgoff_t bytes_done = 0;
- int segmentno = 0;
- char *segmentpath;
bool verify_checksum = false;
pg_checksum_context checksum_ctx;
@@ -1522,36 +1482,14 @@ sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
*/
Assert((sink->bbs_buffer_length % BLCKSZ) == 0);
- if (!noverify_checksums && DataChecksumsEnabled())
- {
- char *filename;
-
- /*
- * Get the filename (excluding path). As last_dir_separator()
- * includes the last directory separator, we chop that off by
- * incrementing the pointer.
- */
- filename = last_dir_separator(readfilename) + 1;
-
- if (is_checksummed_file(readfilename, filename))
- {
- verify_checksum = true;
-
- /*
- * Cut off at the segment boundary (".") to get the segment number
- * in order to mix it into the checksum.
- */
- segmentpath = strstr(filename, ".");
- if (segmentpath != NULL)
- {
- segmentno = atoi(segmentpath + 1);
- if (segmentno == 0)
- ereport(ERROR,
- (errmsg("invalid segment number %d in file \"%s\"",
- segmentno, filename)));
- }
- }
- }
+ /*
+ * If we weren't told not to verify checksums, and if checksums are
+ * enabled for this cluster, and if this is a relation file, then verify
+ * the checksum.
+ */
+ if (!noverify_checksums && DataChecksumsEnabled() &&
+ RelFileNumberIsValid(relfilenumber))
+ verify_checksum = true;
/*
* Loop until we read the amount of data the caller told us to expect. The
@@ -1566,7 +1504,7 @@ sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
/* Try to read some more data. */
cnt = read_file_data_into_buffer(sink, readfilename, fd, bytes_done,
remaining,
- blkno + segmentno * RELSEG_SIZE,
+ blkno + segno * RELSEG_SIZE,
verify_checksum,
&checksum_failures);
--
2.37.1 (Apple Git-137.1)
v8-0006-Test-patch-Enable-summarize_wal-by-default.patchapplication/octet-stream; name=v8-0006-Test-patch-Enable-summarize_wal-by-default.patchDownload
From aa779568f816259e9f225a8c6f0a2e7cdb0252fa Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Mon, 6 Nov 2023 13:53:19 -0500
Subject: [PATCH v8 6/6] Test patch: Enable summarize_wal by default.
To avoid test failures, must remove the prohibition against running
summarize_wal=off with wal_level=minimal, because a bunch of tests
run with wal_level=minimal.
Not for commit.
---
src/backend/postmaster/postmaster.c | 3 ---
src/backend/postmaster/walsummarizer.c | 2 +-
src/backend/utils/misc/guc_tables.c | 2 +-
src/test/recovery/t/001_stream_rep.pl | 2 ++
src/test/recovery/t/019_replslot_limit.pl | 3 +++
src/test/recovery/t/020_archive_status.pl | 1 +
src/test/recovery/t/035_standby_logical_decoding.pl | 1 +
7 files changed, 9 insertions(+), 5 deletions(-)
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 7952fd5c4b..a804d07ce5 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -935,9 +935,6 @@ PostmasterMain(int argc, char *argv[])
if (max_wal_senders > 0 && wal_level == WAL_LEVEL_MINIMAL)
ereport(ERROR,
(errmsg("WAL streaming (max_wal_senders > 0) requires wal_level \"replica\" or \"logical\"")));
- if (summarize_wal && wal_level == WAL_LEVEL_MINIMAL)
- ereport(ERROR,
- (errmsg("WAL cannot be summarized when wal_level is \"minimal\"")));
/*
* Other one-time internal sanity checks can go here, if they are fast.
diff --git a/src/backend/postmaster/walsummarizer.c b/src/backend/postmaster/walsummarizer.c
index 505351f663..1bd570b370 100644
--- a/src/backend/postmaster/walsummarizer.c
+++ b/src/backend/postmaster/walsummarizer.c
@@ -140,7 +140,7 @@ static XLogRecPtr redo_pointer_at_last_summary_removal = InvalidXLogRecPtr;
/*
* GUC parameters
*/
-bool summarize_wal = false;
+bool summarize_wal = true;
int wal_summarize_keep_time = 10 * 24 * 60;
static XLogRecPtr GetLatestLSN(TimeLineID *tli);
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index cf39f8a651..e631994da2 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -1796,7 +1796,7 @@ struct config_bool ConfigureNamesBool[] =
NULL
},
&summarize_wal,
- false,
+ true,
NULL, NULL, NULL
},
diff --git a/src/test/recovery/t/001_stream_rep.pl b/src/test/recovery/t/001_stream_rep.pl
index 95f9b0d772..0d0e63b8dc 100644
--- a/src/test/recovery/t/001_stream_rep.pl
+++ b/src/test/recovery/t/001_stream_rep.pl
@@ -15,6 +15,8 @@ my $node_primary = PostgreSQL::Test::Cluster->new('primary');
$node_primary->init(
allows_streaming => 1,
auth_extra => [ '--create-role', 'repl_role' ]);
+# WAL summarization can postpone WAL recycling, leading to test failures
+$node_primary->append_conf('postgresql.conf', "summarize_wal = off");
$node_primary->start;
my $backup_name = 'my_backup';
diff --git a/src/test/recovery/t/019_replslot_limit.pl b/src/test/recovery/t/019_replslot_limit.pl
index 7d94f15778..a8b342bb98 100644
--- a/src/test/recovery/t/019_replslot_limit.pl
+++ b/src/test/recovery/t/019_replslot_limit.pl
@@ -22,6 +22,7 @@ $node_primary->append_conf(
min_wal_size = 2MB
max_wal_size = 4MB
log_checkpoints = yes
+summarize_wal = off
));
$node_primary->start;
$node_primary->safe_psql('postgres',
@@ -256,6 +257,7 @@ $node_primary2->append_conf(
min_wal_size = 32MB
max_wal_size = 32MB
log_checkpoints = yes
+summarize_wal = off
));
$node_primary2->start;
$node_primary2->safe_psql('postgres',
@@ -310,6 +312,7 @@ $node_primary3->append_conf(
max_wal_size = 2MB
log_checkpoints = yes
max_slot_wal_keep_size = 1MB
+ summarize_wal = off
));
$node_primary3->start;
$node_primary3->safe_psql('postgres',
diff --git a/src/test/recovery/t/020_archive_status.pl b/src/test/recovery/t/020_archive_status.pl
index fa24153d4b..d0d6221368 100644
--- a/src/test/recovery/t/020_archive_status.pl
+++ b/src/test/recovery/t/020_archive_status.pl
@@ -15,6 +15,7 @@ $primary->init(
has_archiving => 1,
allows_streaming => 1);
$primary->append_conf('postgresql.conf', 'autovacuum = off');
+$primary->append_conf('postgresql.conf', 'summarize_wal = off');
$primary->start;
my $primary_data = $primary->data_dir;
diff --git a/src/test/recovery/t/035_standby_logical_decoding.pl b/src/test/recovery/t/035_standby_logical_decoding.pl
index 9c34c0d36c..482edc57a8 100644
--- a/src/test/recovery/t/035_standby_logical_decoding.pl
+++ b/src/test/recovery/t/035_standby_logical_decoding.pl
@@ -250,6 +250,7 @@ $node_primary->append_conf(
wal_level = 'logical'
max_replication_slots = 4
max_wal_senders = 4
+summarize_wal = off
});
$node_primary->dump_info;
$node_primary->start;
--
2.37.1 (Apple Git-137.1)
v8-0003-Add-a-new-WAL-summarizer-process.patchapplication/octet-stream; name=v8-0003-Add-a-new-WAL-summarizer-process.patchDownload
From b1631879f17c92eca327a4ae6e9b93901f95e582 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Wed, 25 Oct 2023 12:57:22 -0400
Subject: [PATCH v8 3/6] Add a new WAL summarizer process.
When active, this process writes WAL summary files to
$PGDATA/pg_wal/summaries. Each summary file contains information for a
certain range of LSNs on a certain TLI. For each relation, it stores a
"limit block" which is 0 if a relation is created or destroyed within
a certain range of WAL records, or otherwise the shortest length to
which the relation was truncated during that range of WAL records, or
otherwise InvalidBlockNumber. In addition, it stores a list of blocks
which have been modified during that range of WAL records, but
excluding blocks which were removed by truncation after they were
modified and never subsequently modified again. In other words, it
tells us which blocks need to copied in case of an incremental backup
covering that range of WAL records.
A new parameter summarize_wal enables or disables this new background
process. The background process also automatically deletes summary
files that are older than wal_summarize_keep_time, if that parameter
has a non-zero value and the summarizer is configured to run.
---
doc/src/sgml/config.sgml | 61 +
src/backend/access/transam/xlog.c | 101 +-
src/backend/backup/Makefile | 4 +-
src/backend/backup/meson.build | 2 +
src/backend/backup/walsummary.c | 356 +++++
src/backend/backup/walsummaryfuncs.c | 169 ++
src/backend/postmaster/Makefile | 1 +
src/backend/postmaster/auxprocess.c | 8 +
src/backend/postmaster/meson.build | 1 +
src/backend/postmaster/postmaster.c | 56 +
src/backend/postmaster/walsummarizer.c | 1374 +++++++++++++++++
src/backend/storage/lmgr/lwlocknames.txt | 1 +
src/backend/utils/activity/pgstat_io.c | 4 +-
.../utils/activity/wait_event_names.txt | 5 +
src/backend/utils/init/miscinit.c | 3 +
src/backend/utils/misc/guc_tables.c | 26 +
src/backend/utils/misc/postgresql.conf.sample | 5 +
src/bin/initdb/initdb.c | 1 +
src/common/Makefile | 1 +
src/common/blkreftable.c | 1309 ++++++++++++++++
src/common/meson.build | 1 +
src/include/access/xlog.h | 1 +
src/include/backup/walsummary.h | 49 +
src/include/catalog/pg_proc.dat | 19 +
src/include/common/blkreftable.h | 120 ++
src/include/miscadmin.h | 3 +
src/include/postmaster/walsummarizer.h | 31 +
src/include/storage/proc.h | 9 +-
src/include/utils/guc_tables.h | 1 +
src/tools/pgindent/typedefs.list | 11 +
30 files changed, 3722 insertions(+), 11 deletions(-)
create mode 100644 src/backend/backup/walsummary.c
create mode 100644 src/backend/backup/walsummaryfuncs.c
create mode 100644 src/backend/postmaster/walsummarizer.c
create mode 100644 src/common/blkreftable.c
create mode 100644 src/include/backup/walsummary.h
create mode 100644 src/include/common/blkreftable.h
create mode 100644 src/include/postmaster/walsummarizer.h
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index bd70ff2e4b..862a143f17 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4134,6 +4134,67 @@ restore_command = 'copy "C:\\server\\archivedir\\%f" "%p"' # Windows
</variablelist>
</sect2>
+ <sect2 id="runtime-config-wal-summarization">
+ <title>WAL Summarization</title>
+
+ <!--
+ <para>
+ These settings control WAL summarization, a feature which must be
+ enabled in order to perform an
+ <link linkend="backup-incremental-backup">incremental backup</link>.
+ </para>
+ -->
+
+ <variablelist>
+ <varlistentry id="guc-summarize-wal" xreflabel="summarize_wal">
+ <term><varname>summarize_wal</varname> (<type>boolean</type>)
+ <indexterm>
+ <primary><varname>summarize_wal</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Enables the WAL summarizer process. Note that WAL summarization can
+ be enabled either on a primary or on a standby. WAL summarization can
+ cannot be enabled when <varname>wal_level</varname> is set to
+ <literal>minimal</literal>. This parameter can only be set in the
+ <filename>postgresql.conf</filename> file or on the server command line.
+ The default is <literal>off</literal>.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry id="guc-wal-summarize-keep-time" xreflabel="wal_summarize_keep_time">
+ <term><varname>wal_summarize_keep_time</varname> (<type>boolean</type>)
+ <indexterm>
+ <primary><varname>wal_summarize_keep_time</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Configures the amount of time after which the WAL summarizer
+ automatically removes old WAL summaries. The file timestamp is used to
+ determine which files are old enough to remove. Typically, you should set
+ this comfortably higher than the time that could pass between a backup
+ and a later incremental backup that depends on it. WAL summaries must
+ be available for the entire range of WAL records between the preceding
+ backup and the new one being taken; if not, the incremental backup will
+ fail. If this parameter is set to zero, WAL summaries will not be
+ automatically deleted, but it is safe to manually remove files that you
+ know will not be required for future incremental backups.
+ This parameter can only be set in the
+ <filename>postgresql.conf</filename> file or on the server command line.
+ The default is 10 days. If <literal>summarize_wal = off</literal>,
+ existing WAL summaries will not be removed regardless of the value of
+ this parameter, because the WAL summarizer will not run.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ </variablelist>
+
+ </sect2>
+
</sect1>
<sect1 id="runtime-config-replication">
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index b541be8eec..d69f4ac5a7 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -77,6 +77,7 @@
#include "port/pg_iovec.h"
#include "postmaster/bgwriter.h"
#include "postmaster/startup.h"
+#include "postmaster/walsummarizer.h"
#include "postmaster/walwriter.h"
#include "replication/logical.h"
#include "replication/origin.h"
@@ -3555,6 +3556,43 @@ XLogGetLastRemovedSegno(void)
return lastRemovedSegNo;
}
+/*
+ * Return the oldest WAL segment on the given TLI that still exists in
+ * XLOGDIR, or 0 if none.
+ */
+XLogSegNo
+XLogGetOldestSegno(TimeLineID tli)
+{
+ DIR *xldir;
+ struct dirent *xlde;
+ XLogSegNo oldest_segno = 0;
+
+ xldir = AllocateDir(XLOGDIR);
+ while ((xlde = ReadDir(xldir, XLOGDIR)) != NULL)
+ {
+ TimeLineID file_tli;
+ XLogSegNo file_segno;
+
+ /* Ignore files that are not XLOG segments */
+ if (!IsXLogFileName(xlde->d_name))
+ continue;
+
+ /* Parse filename to get TLI and segno. */
+ XLogFromFileName(xlde->d_name, &file_tli, &file_segno,
+ wal_segment_size);
+
+ /* Ignore anything that's not from the TLI of interest. */
+ if (tli != file_tli)
+ continue;
+
+ /* If it's the oldest so far, update oldest_segno. */
+ if (oldest_segno == 0 || file_segno < oldest_segno)
+ oldest_segno = file_segno;
+ }
+
+ FreeDir(xldir);
+ return oldest_segno;
+}
/*
* Update the last removed segno pointer in shared memory, to reflect that the
@@ -3834,8 +3872,8 @@ RemoveXlogFile(const struct dirent *segment_de,
}
/*
- * Verify whether pg_wal and pg_wal/archive_status exist.
- * If the latter does not exist, recreate it.
+ * Verify whether pg_wal, pg_wal/archive_status, and pg_wal/summaries exist.
+ * If the latter do not exist, recreate them.
*
* It is not the goal of this function to verify the contents of these
* directories, but to help in cases where someone has performed a cluster
@@ -3878,6 +3916,26 @@ ValidateXLOGDirectoryStructure(void)
(errmsg("could not create missing directory \"%s\": %m",
path)));
}
+
+ /* Check for summaries */
+ snprintf(path, MAXPGPATH, XLOGDIR "/summaries");
+ if (stat(path, &stat_buf) == 0)
+ {
+ /* Check for weird cases where it exists but isn't a directory */
+ if (!S_ISDIR(stat_buf.st_mode))
+ ereport(FATAL,
+ (errmsg("required WAL directory \"%s\" does not exist",
+ path)));
+ }
+ else
+ {
+ ereport(LOG,
+ (errmsg("creating missing WAL directory \"%s\"", path)));
+ if (MakePGDirectory(path) < 0)
+ ereport(FATAL,
+ (errmsg("could not create missing directory \"%s\": %m",
+ path)));
+ }
}
/*
@@ -5202,9 +5260,9 @@ StartupXLOG(void)
#endif
/*
- * Verify that pg_wal and pg_wal/archive_status exist. In cases where
- * someone has performed a copy for PITR, these directories may have been
- * excluded and need to be re-created.
+ * Verify that pg_wal, pg_wal/archive_status, and pg_wal/summaries exist.
+ * In cases where someone has performed a copy for PITR, these directories
+ * may have been excluded and need to be re-created.
*/
ValidateXLOGDirectoryStructure();
@@ -6921,6 +6979,25 @@ CreateCheckPoint(int flags)
*/
END_CRIT_SECTION();
+ /*
+ * WAL summaries end when the next XLOG_CHECKPOINT_REDO or
+ * XLOG_CHECKPOINT_SHUTDOWN record is reached. This is the first point
+ * where (a) we're not inside of a critical section and (b) we can be
+ * certain that the relevant record has been flushed to disk, which must
+ * happen before it can be summarized.
+ *
+ * If this is a shutdown checkpoint, then this happens reasonably
+ * promptly: we've only just inserted and flushed the
+ * XLOG_CHECKPOINT_SHUTDOWN record. If this is not a shutdown checkpoint,
+ * then this might not be very prompt at all: the XLOG_CHECKPOINT_REDO
+ * record was written before we began flushing data to disk, and that
+ * could be many minutes ago at this point. However, we don't XLogFlush()
+ * after inserting that record, so we're not guaranteed that it's on disk
+ * until after the above call that flushes the XLOG_CHECKPOINT_ONLINE
+ * record.
+ */
+ SetWalSummarizerLatch();
+
/*
* Let smgr do post-checkpoint cleanup (eg, deleting old files).
*/
@@ -7595,6 +7672,20 @@ KeepLogSeg(XLogRecPtr recptr, XLogSegNo *logSegNo)
}
}
+ /*
+ * If WAL summarization is in use, don't remove WAL that has yet to be
+ * summarized.
+ */
+ keep = GetOldestUnsummarizedLSN(NULL, NULL);
+ if (keep != InvalidXLogRecPtr)
+ {
+ XLogSegNo unsummarized_segno;
+
+ XLByteToSeg(keep, unsummarized_segno, wal_segment_size);
+ if (unsummarized_segno < segno)
+ segno = unsummarized_segno;
+ }
+
/* but, keep at least wal_keep_size if that's set */
if (wal_keep_size_mb > 0)
{
diff --git a/src/backend/backup/Makefile b/src/backend/backup/Makefile
index b21bd8ff43..a67b3c58d4 100644
--- a/src/backend/backup/Makefile
+++ b/src/backend/backup/Makefile
@@ -25,6 +25,8 @@ OBJS = \
basebackup_server.o \
basebackup_sink.o \
basebackup_target.o \
- basebackup_throttle.o
+ basebackup_throttle.o \
+ walsummary.o \
+ walsummaryfuncs.o
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/backup/meson.build b/src/backend/backup/meson.build
index 11a79bbf80..0e2de91e9f 100644
--- a/src/backend/backup/meson.build
+++ b/src/backend/backup/meson.build
@@ -12,4 +12,6 @@ backend_sources += files(
'basebackup_target.c',
'basebackup_throttle.c',
'basebackup_zstd.c',
+ 'walsummary.c',
+ 'walsummaryfuncs.c'
)
diff --git a/src/backend/backup/walsummary.c b/src/backend/backup/walsummary.c
new file mode 100644
index 0000000000..ca9d750483
--- /dev/null
+++ b/src/backend/backup/walsummary.c
@@ -0,0 +1,356 @@
+/*-------------------------------------------------------------------------
+ *
+ * walsummary.c
+ * Functions for accessing and managing WAL summary data.
+ *
+ * Portions Copyright (c) 2010-2022, PostgreSQL Global Development Group
+ *
+ * src/backend/backup/walsummary.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include <sys/stat.h>
+#include <unistd.h>
+
+#include "access/xlog_internal.h"
+#include "backup/walsummary.h"
+#include "utils/wait_event.h"
+
+static bool IsWalSummaryFilename(char *filename);
+static int ListComparatorForWalSummaryFiles(const ListCell *a,
+ const ListCell *b);
+
+/*
+ * Get a list of WAL summaries.
+ *
+ * If tli != 0, only WAL summaries with the indicated TLI will be included.
+ *
+ * If start_lsn != InvalidXLogRecPtr, only summaries that end before the
+ * indicated LSN will be included.
+ *
+ * If end_lsn != InvalidXLogRecPtr, only summaries that start before the
+ * indicated LSN will be included.
+ *
+ * The intent is that you can call GetWalSummaries(tli, start_lsn, end_lsn)
+ * to get all WAL summaries on the indicated timeline that overlap the
+ * specified LSN range.
+ */
+List *
+GetWalSummaries(TimeLineID tli, XLogRecPtr start_lsn, XLogRecPtr end_lsn)
+{
+ DIR *sdir;
+ struct dirent *dent;
+ List *result = NIL;
+
+ sdir = AllocateDir(XLOGDIR "/summaries");
+ while ((dent = ReadDir(sdir, XLOGDIR "/summaries")) != NULL)
+ {
+ WalSummaryFile *ws;
+ uint32 tmp[5];
+ TimeLineID file_tli;
+ XLogRecPtr file_start_lsn;
+ XLogRecPtr file_end_lsn;
+
+ /* Decode filename, or skip if it's not in the expected format. */
+ if (!IsWalSummaryFilename(dent->d_name))
+ continue;
+ sscanf(dent->d_name, "%08X%08X%08X%08X%08X",
+ &tmp[0], &tmp[1], &tmp[2], &tmp[3], &tmp[4]);
+ file_tli = tmp[0];
+ file_start_lsn = ((uint64) tmp[1]) << 32 | tmp[2];
+ file_end_lsn = ((uint64) tmp[3]) << 32 | tmp[4];
+
+ /* Skip if it doesn't match the filter criteria. */
+ if (tli != 0 && tli != file_tli)
+ continue;
+ if (!XLogRecPtrIsInvalid(start_lsn) && start_lsn >= file_end_lsn)
+ continue;
+ if (!XLogRecPtrIsInvalid(end_lsn) && end_lsn <= file_start_lsn)
+ continue;
+
+ /* Add it to the list. */
+ ws = palloc(sizeof(WalSummaryFile));
+ ws->tli = file_tli;
+ ws->start_lsn = file_start_lsn;
+ ws->end_lsn = file_end_lsn;
+ result = lappend(result, ws);
+ }
+ FreeDir(sdir);
+
+ return result;
+}
+
+/*
+ * Build a new list of WAL summaries based on an existing list, but filtering
+ * out summaries that don't match the search parameters.
+ *
+ * If tli != 0, only WAL summaries with the indicated TLI will be included.
+ *
+ * If start_lsn != InvalidXLogRecPtr, only summaries that end before the
+ * indicated LSN will be included.
+ *
+ * If end_lsn != InvalidXLogRecPtr, only summaries that start before the
+ * indicated LSN will be included.
+ */
+List *
+FilterWalSummaries(List *wslist, TimeLineID tli,
+ XLogRecPtr start_lsn, XLogRecPtr end_lsn)
+{
+ List *result = NIL;
+ ListCell *lc;
+
+ /* Loop over input. */
+ foreach(lc, wslist)
+ {
+ WalSummaryFile *ws = lfirst(lc);
+
+ /* Skip if it doesn't match the filter criteria. */
+ if (tli != 0 && tli != ws->tli)
+ continue;
+ if (!XLogRecPtrIsInvalid(start_lsn) && start_lsn > ws->end_lsn)
+ continue;
+ if (!XLogRecPtrIsInvalid(end_lsn) && end_lsn < ws->start_lsn)
+ continue;
+
+ /* Add it to the result list. */
+ result = lappend(result, ws);
+ }
+
+ return result;
+}
+
+/*
+ * Check whether the supplied list of WalSummaryFile objects covers the
+ * whole range of LSNs from start_lsn to end_lsn. This function ignores
+ * timelines, so the caller should probably filter using the appropriate
+ * timeline before calling this.
+ *
+ * If the whole range of LSNs is covered, returns true, otherwise false.
+ * If false is returned, *missing_lsn is set either to InvalidXLogRecPtr
+ * if there are no WAL summary files in the input list, or to the first LSN
+ * in the range that is not covered by a WAL summary file in the input list.
+ */
+bool
+WalSummariesAreComplete(List *wslist, XLogRecPtr start_lsn,
+ XLogRecPtr end_lsn, XLogRecPtr *missing_lsn)
+{
+ XLogRecPtr current_lsn = start_lsn;
+ ListCell *lc;
+
+ /* Special case for empty list. */
+ if (wslist == NIL)
+ {
+ *missing_lsn = InvalidXLogRecPtr;
+ return false;
+ }
+
+ /* Make a private copy of the list and sort it by start LSN. */
+ wslist = list_copy(wslist);
+ list_sort(wslist, ListComparatorForWalSummaryFiles);
+
+ /*
+ * Consider summary files in order of increasing start_lsn, advancing the
+ * known-summarized range from start_lsn toward end_lsn.
+ *
+ * Normally, the summary files should cover non-overlapping WAL ranges,
+ * but this algorithm is intended to be correct even in case of overlap.
+ */
+ foreach(lc, wslist)
+ {
+ WalSummaryFile *ws = lfirst(lc);
+
+ if (ws->start_lsn > current_lsn)
+ {
+ /* We found a gap. */
+ break;
+ }
+ if (ws->end_lsn > current_lsn)
+ {
+ /*
+ * Next summary extends beyond end of previous summary, so extend
+ * the end of the range known to be summarized.
+ */
+ current_lsn = ws->end_lsn;
+
+ /*
+ * If the range we know to be summarized has reached the required
+ * end LSN, we have proved completeness.
+ */
+ if (current_lsn >= end_lsn)
+ return true;
+ }
+ }
+
+ /*
+ * We either ran out of summary files without reaching the end LSN, or we
+ * hit a gap in the sequence that resulted in us bailing out of the loop
+ * above.
+ */
+ *missing_lsn = current_lsn;
+ return false;
+}
+
+/*
+ * Open a WAL summary file.
+ *
+ * This will throw an error in case of trouble. As an exception, if
+ * missing_ok = true and the trouble is specifically that the file does
+ * not exist, it will not throw an error and will return a value less than 0.
+ */
+File
+OpenWalSummaryFile(WalSummaryFile *ws, bool missing_ok)
+{
+ char path[MAXPGPATH];
+ File file;
+
+ snprintf(path, MAXPGPATH,
+ XLOGDIR "/summaries/%08X%08X%08X%08X%08X.summary",
+ ws->tli,
+ LSN_FORMAT_ARGS(ws->start_lsn),
+ LSN_FORMAT_ARGS(ws->end_lsn));
+
+ file = PathNameOpenFile(path, O_RDONLY);
+ if (file < 0 && (errno != EEXIST || !missing_ok))
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not open file \"%s\": %m", path)));
+
+ return file;
+}
+
+/*
+ * Remove a WAL summary file if the last modification time precedes the
+ * cutoff time.
+ */
+void
+RemoveWalSummaryIfOlderThan(WalSummaryFile *ws, time_t cutoff_time)
+{
+ char path[MAXPGPATH];
+ struct stat statbuf;
+
+ snprintf(path, MAXPGPATH,
+ XLOGDIR "/summaries/%08X%08X%08X%08X%08X.summary",
+ ws->tli,
+ LSN_FORMAT_ARGS(ws->start_lsn),
+ LSN_FORMAT_ARGS(ws->end_lsn));
+
+ if (lstat(path, &statbuf) != 0)
+ {
+ if (errno == ENOENT)
+ return;
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not stat file \"%s\": %m", path)));
+ }
+ if (statbuf.st_mtime >= cutoff_time)
+ return;
+ if (unlink(path) != 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not stat file \"%s\": %m", path)));
+ ereport(DEBUG2,
+ (errmsg_internal("removing file \"%s\"", path)));
+}
+
+/*
+ * Test whether a filename looks like a WAL summary file.
+ */
+static bool
+IsWalSummaryFilename(char *filename)
+{
+ return strspn(filename, "0123456789ABCDEF") == 40 &&
+ strcmp(filename + 40, ".summary") == 0;
+}
+
+/*
+ * Data read callback for use with CreateBlockRefTableReader.
+ */
+int
+ReadWalSummary(void *wal_summary_io, void *data, int length)
+{
+ WalSummaryIO *io = wal_summary_io;
+ int nbytes;
+
+ nbytes = FileRead(io->file, data, length, io->filepos,
+ WAIT_EVENT_WAL_SUMMARY_READ);
+ if (nbytes < 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not write file \"%s\": %m",
+ FilePathName(io->file))));
+
+ io->filepos += nbytes;
+ return nbytes;
+}
+
+/*
+ * Data write callback for use with WriteBlockRefTable.
+ */
+int
+WriteWalSummary(void *wal_summary_io, void *data, int length)
+{
+ WalSummaryIO *io = wal_summary_io;
+ int nbytes;
+
+ nbytes = FileWrite(io->file, data, length, io->filepos,
+ WAIT_EVENT_WAL_SUMMARY_WRITE);
+ if (nbytes < 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not write file \"%s\": %m",
+ FilePathName(io->file))));
+ if (nbytes != length)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not write file \"%s\": wrote only %d of %d bytes at offset %u",
+ FilePathName(io->file), nbytes,
+ length, (unsigned) io->filepos),
+ errhint("Check free disk space.")));
+
+ io->filepos += nbytes;
+ return nbytes;
+}
+
+/*
+ * Error-reporting callback for use with CreateBlockRefTableReader.
+ */
+void
+ReportWalSummaryError(void *callback_arg, char *fmt,...)
+{
+ StringInfoData buf;
+ va_list ap;
+ int needed;
+
+ initStringInfo(&buf);
+ for (;;)
+ {
+ va_start(ap, fmt);
+ needed = appendStringInfoVA(&buf, fmt, ap);
+ va_end(ap);
+ if (needed == 0)
+ break;
+ enlargeStringInfo(&buf, needed);
+ }
+ ereport(ERROR,
+ errcode(ERRCODE_DATA_CORRUPTED),
+ errmsg_internal("%s", buf.data));
+}
+
+/*
+ * Comparator to sort a List of WalSummaryFile objects by start_lsn.
+ */
+static int
+ListComparatorForWalSummaryFiles(const ListCell *a, const ListCell *b)
+{
+ WalSummaryFile *ws1 = lfirst(a);
+ WalSummaryFile *ws2 = lfirst(b);
+
+ if (ws1->start_lsn < ws2->start_lsn)
+ return -1;
+ if (ws1->start_lsn > ws2->start_lsn)
+ return 1;
+ return 0;
+}
diff --git a/src/backend/backup/walsummaryfuncs.c b/src/backend/backup/walsummaryfuncs.c
new file mode 100644
index 0000000000..2e77d38b4a
--- /dev/null
+++ b/src/backend/backup/walsummaryfuncs.c
@@ -0,0 +1,169 @@
+/*-------------------------------------------------------------------------
+ *
+ * walsummaryfuncs.c
+ * SQL-callable functions for accessing WAL summary data.
+ *
+ * Portions Copyright (c) 2010-2022, PostgreSQL Global Development Group
+ *
+ * src/backend/backup/walsummaryfuncs.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "backup/walsummary.h"
+#include "common/blkreftable.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "utils/fmgrprotos.h"
+#include "utils/pg_lsn.h"
+
+#define NUM_WS_ATTS 3
+#define NUM_SUMMARY_ATTS 6
+#define MAX_BLOCKS_PER_CALL 256
+
+/*
+ * List the WAL summary files available in pg_wal/summaries.
+ */
+Datum
+pg_available_wal_summaries(PG_FUNCTION_ARGS)
+{
+ ReturnSetInfo *rsi;
+ List *wslist;
+ ListCell *lc;
+ Datum values[NUM_WS_ATTS];
+ bool nulls[NUM_WS_ATTS];
+
+ InitMaterializedSRF(fcinfo, 0);
+ rsi = (ReturnSetInfo *) fcinfo->resultinfo;
+
+ memset(nulls, 0, sizeof(nulls));
+
+ wslist = GetWalSummaries(0, InvalidXLogRecPtr, InvalidXLogRecPtr);
+ foreach(lc, wslist)
+ {
+ WalSummaryFile *ws = (WalSummaryFile *) lfirst(lc);
+ HeapTuple tuple;
+
+ CHECK_FOR_INTERRUPTS();
+
+ values[0] = Int64GetDatum((int64) ws->tli);
+ values[1] = LSNGetDatum(ws->start_lsn);
+ values[2] = LSNGetDatum(ws->end_lsn);
+
+ tuple = heap_form_tuple(rsi->setDesc, values, nulls);
+ tuplestore_puttuple(rsi->setResult, tuple);
+ }
+
+ return (Datum) 0;
+}
+
+/*
+ * List the contents of a WAL summary file identified by TLI, start LSN,
+ * and end LSN.
+ */
+Datum
+pg_wal_summary_contents(PG_FUNCTION_ARGS)
+{
+ ReturnSetInfo *rsi;
+ Datum values[NUM_SUMMARY_ATTS];
+ bool nulls[NUM_SUMMARY_ATTS];
+ WalSummaryFile ws;
+ WalSummaryIO io;
+ BlockRefTableReader *reader;
+ int64 raw_tli;
+ RelFileLocator rlocator;
+ ForkNumber forknum;
+ BlockNumber limit_block;
+
+ InitMaterializedSRF(fcinfo, 0);
+ rsi = (ReturnSetInfo *) fcinfo->resultinfo;
+ memset(nulls, 0, sizeof(nulls));
+
+ /*
+ * Since the timeline could at least in theory be more than 2^31, and
+ * since we don't have unsigned types at the SQL level, it is passed as a
+ * 64-bit integer. Test whether it's out of range.
+ */
+ raw_tli = PG_GETARG_INT64(0);
+ if (raw_tli < 1 || raw_tli > PG_INT32_MAX)
+ ereport(ERROR,
+ errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("invalid timeline %lld", (long long) raw_tli));
+
+ /* Prepare to read the specified WAL summry file. */
+ ws.tli = (TimeLineID) raw_tli;
+ ws.start_lsn = PG_GETARG_LSN(1);
+ ws.end_lsn = PG_GETARG_LSN(2);
+ io.filepos = 0;
+ io.file = OpenWalSummaryFile(&ws, false);
+ reader = CreateBlockRefTableReader(ReadWalSummary, &io,
+ FilePathName(io.file),
+ ReportWalSummaryError, NULL);
+
+ /* Loop over relation forks. */
+ while (BlockRefTableReaderNextRelation(reader, &rlocator, &forknum,
+ &limit_block))
+ {
+ BlockNumber blocks[MAX_BLOCKS_PER_CALL];
+ HeapTuple tuple;
+
+ CHECK_FOR_INTERRUPTS();
+
+ values[0] = ObjectIdGetDatum(rlocator.relNumber);
+ values[1] = ObjectIdGetDatum(rlocator.spcOid);
+ values[2] = ObjectIdGetDatum(rlocator.dbOid);
+ values[3] = Int16GetDatum((int16) forknum);
+
+ /* Loop over blocks within the current relation fork. */
+ while (true)
+ {
+ unsigned nblocks;
+ unsigned i;
+
+ CHECK_FOR_INTERRUPTS();
+
+ nblocks = BlockRefTableReaderGetBlocks(reader, blocks,
+ MAX_BLOCKS_PER_CALL);
+ if (nblocks == 0)
+ break;
+
+ /*
+ * For each block that we specifically know to have been modified,
+ * emit a row with that block number and limit_block = false.
+ */
+ values[5] = BoolGetDatum(false);
+ for (i = 0; i < nblocks; ++i)
+ {
+ values[4] = Int64GetDatum((int64) blocks[i]);
+
+ tuple = heap_form_tuple(rsi->setDesc, values, nulls);
+ tuplestore_puttuple(rsi->setResult, tuple);
+ }
+
+ /*
+ * If the limit block is not InvalidBlockNumber, emit an exta row
+ * with that block number and limit_block = true.
+ *
+ * There is no point in doing this when the limit_block is
+ * InvalidBlockNumber, because no block with that number or any
+ * higher number can ever exist.
+ */
+ if (BlockNumberIsValid(limit_block))
+ {
+ values[4] = Int64GetDatum((int64) limit_block);
+ values[5] = BoolGetDatum(true);
+
+ tuple = heap_form_tuple(rsi->setDesc, values, nulls);
+ tuplestore_puttuple(rsi->setResult, tuple);
+ }
+ }
+ }
+
+ /* Cleanup */
+ DestroyBlockRefTableReader(reader);
+ FileClose(io.file);
+
+ return (Datum) 0;
+}
diff --git a/src/backend/postmaster/Makefile b/src/backend/postmaster/Makefile
index 047448b34e..367a46c617 100644
--- a/src/backend/postmaster/Makefile
+++ b/src/backend/postmaster/Makefile
@@ -24,6 +24,7 @@ OBJS = \
postmaster.o \
startup.o \
syslogger.o \
+ walsummarizer.o \
walwriter.o
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/postmaster/auxprocess.c b/src/backend/postmaster/auxprocess.c
index cae6feb356..0c15c1777d 100644
--- a/src/backend/postmaster/auxprocess.c
+++ b/src/backend/postmaster/auxprocess.c
@@ -21,6 +21,7 @@
#include "postmaster/auxprocess.h"
#include "postmaster/bgwriter.h"
#include "postmaster/startup.h"
+#include "postmaster/walsummarizer.h"
#include "postmaster/walwriter.h"
#include "replication/walreceiver.h"
#include "storage/bufmgr.h"
@@ -80,6 +81,9 @@ AuxiliaryProcessMain(AuxProcType auxtype)
case WalReceiverProcess:
MyBackendType = B_WAL_RECEIVER;
break;
+ case WalSummarizerProcess:
+ MyBackendType = B_WAL_SUMMARIZER;
+ break;
default:
elog(PANIC, "unrecognized process type: %d", (int) MyAuxProcType);
MyBackendType = B_INVALID;
@@ -161,6 +165,10 @@ AuxiliaryProcessMain(AuxProcType auxtype)
WalReceiverMain();
proc_exit(1);
+ case WalSummarizerProcess:
+ WalSummarizerMain();
+ proc_exit(1);
+
default:
elog(PANIC, "unrecognized process type: %d", (int) MyAuxProcType);
proc_exit(1);
diff --git a/src/backend/postmaster/meson.build b/src/backend/postmaster/meson.build
index cda921fd10..a30eb6692f 100644
--- a/src/backend/postmaster/meson.build
+++ b/src/backend/postmaster/meson.build
@@ -12,5 +12,6 @@ backend_sources += files(
'postmaster.c',
'startup.c',
'syslogger.c',
+ 'walsummarizer.c',
'walwriter.c',
)
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 7b6b613c4a..7952fd5c4b 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -115,6 +115,7 @@
#include "postmaster/pgarch.h"
#include "postmaster/postmaster.h"
#include "postmaster/syslogger.h"
+#include "postmaster/walsummarizer.h"
#include "replication/logicallauncher.h"
#include "replication/walsender.h"
#include "storage/fd.h"
@@ -252,6 +253,7 @@ static pid_t StartupPID = 0,
CheckpointerPID = 0,
WalWriterPID = 0,
WalReceiverPID = 0,
+ WalSummarizerPID = 0,
AutoVacPID = 0,
PgArchPID = 0,
SysLoggerPID = 0;
@@ -443,6 +445,7 @@ static bool CreateOptsFile(int argc, char *argv[], char *fullprogname);
static pid_t StartChildProcess(AuxProcType type);
static void StartAutovacuumWorker(void);
static void MaybeStartWalReceiver(void);
+static void MaybeStartWalSummarizer(void);
static void InitPostmasterDeathWatchHandle(void);
/*
@@ -562,6 +565,7 @@ static void ShmemBackendArrayRemove(Backend *bn);
#define StartCheckpointer() StartChildProcess(CheckpointerProcess)
#define StartWalWriter() StartChildProcess(WalWriterProcess)
#define StartWalReceiver() StartChildProcess(WalReceiverProcess)
+#define StartWalSummarizer() StartChildProcess(WalSummarizerProcess)
/* Macros to check exit status of a child process */
#define EXIT_STATUS_0(st) ((st) == 0)
@@ -931,6 +935,9 @@ PostmasterMain(int argc, char *argv[])
if (max_wal_senders > 0 && wal_level == WAL_LEVEL_MINIMAL)
ereport(ERROR,
(errmsg("WAL streaming (max_wal_senders > 0) requires wal_level \"replica\" or \"logical\"")));
+ if (summarize_wal && wal_level == WAL_LEVEL_MINIMAL)
+ ereport(ERROR,
+ (errmsg("WAL cannot be summarized when wal_level is \"minimal\"")));
/*
* Other one-time internal sanity checks can go here, if they are fast.
@@ -1833,6 +1840,9 @@ ServerLoop(void)
if (WalReceiverRequested)
MaybeStartWalReceiver();
+ /* If we need to start a WAL summarizer, try to do that now */
+ MaybeStartWalSummarizer();
+
/* Get other worker processes running, if needed */
if (StartWorkerNeeded || HaveCrashedWorker)
maybe_start_bgworkers();
@@ -2657,6 +2667,8 @@ process_pm_reload_request(void)
signal_child(WalWriterPID, SIGHUP);
if (WalReceiverPID != 0)
signal_child(WalReceiverPID, SIGHUP);
+ if (WalSummarizerPID != 0)
+ signal_child(WalSummarizerPID, SIGHUP);
if (AutoVacPID != 0)
signal_child(AutoVacPID, SIGHUP);
if (PgArchPID != 0)
@@ -3010,6 +3022,7 @@ process_pm_child_exit(void)
BgWriterPID = StartBackgroundWriter();
if (WalWriterPID == 0)
WalWriterPID = StartWalWriter();
+ MaybeStartWalSummarizer();
/*
* Likewise, start other special children as needed. In a restart
@@ -3128,6 +3141,20 @@ process_pm_child_exit(void)
continue;
}
+ /*
+ * Was it the wal summarizer? Normal exit can be ignored; we'll start
+ * a new one at the next iteration of the postmaster's main loop, if
+ * necessary. Any other exit condition is treated as a crash.
+ */
+ if (pid == WalSummarizerPID)
+ {
+ WalSummarizerPID = 0;
+ if (!EXIT_STATUS_0(exitstatus))
+ HandleChildCrash(pid, exitstatus,
+ _("WAL summarizer process"));
+ continue;
+ }
+
/*
* Was it the autovacuum launcher? Normal exit can be ignored; we'll
* start a new one at the next iteration of the postmaster's main
@@ -3523,6 +3550,12 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
else if (WalReceiverPID != 0 && take_action)
sigquit_child(WalReceiverPID);
+ /* Take care of the walsummarizer too */
+ if (pid == WalSummarizerPID)
+ WalSummarizerPID = 0;
+ else if (WalSummarizerPID != 0 && take_action)
+ sigquit_child(WalSummarizerPID);
+
/* Take care of the autovacuum launcher too */
if (pid == AutoVacPID)
AutoVacPID = 0;
@@ -3673,6 +3706,8 @@ PostmasterStateMachine(void)
signal_child(StartupPID, SIGTERM);
if (WalReceiverPID != 0)
signal_child(WalReceiverPID, SIGTERM);
+ if (WalSummarizerPID != 0)
+ signal_child(WalSummarizerPID, SIGTERM);
/* checkpointer, archiver, stats, and syslogger may continue for now */
/* Now transition to PM_WAIT_BACKENDS state to wait for them to die */
@@ -3699,6 +3734,7 @@ PostmasterStateMachine(void)
if (CountChildren(BACKEND_TYPE_ALL - BACKEND_TYPE_WALSND) == 0 &&
StartupPID == 0 &&
WalReceiverPID == 0 &&
+ WalSummarizerPID == 0 &&
BgWriterPID == 0 &&
(CheckpointerPID == 0 ||
(!FatalError && Shutdown < ImmediateShutdown)) &&
@@ -3796,6 +3832,7 @@ PostmasterStateMachine(void)
/* These other guys should be dead already */
Assert(StartupPID == 0);
Assert(WalReceiverPID == 0);
+ Assert(WalSummarizerPID == 0);
Assert(BgWriterPID == 0);
Assert(CheckpointerPID == 0);
Assert(WalWriterPID == 0);
@@ -4017,6 +4054,8 @@ TerminateChildren(int signal)
signal_child(WalWriterPID, signal);
if (WalReceiverPID != 0)
signal_child(WalReceiverPID, signal);
+ if (WalSummarizerPID != 0)
+ signal_child(WalSummarizerPID, signal);
if (AutoVacPID != 0)
signal_child(AutoVacPID, signal);
if (PgArchPID != 0)
@@ -5364,6 +5403,10 @@ StartChildProcess(AuxProcType type)
ereport(LOG,
(errmsg("could not fork WAL receiver process: %m")));
break;
+ case WalSummarizerProcess:
+ ereport(LOG,
+ (errmsg("could not fork WAL summarizer process: %m")));
+ break;
default:
ereport(LOG,
(errmsg("could not fork process: %m")));
@@ -5500,6 +5543,19 @@ MaybeStartWalReceiver(void)
}
}
+/*
+ * MaybeStartWalSummarizer
+ * Start the WAL summarizer process, if not running and our state allows.
+ */
+static void
+MaybeStartWalSummarizer(void)
+{
+ if (summarize_wal && WalSummarizerPID == 0 &&
+ (pmState == PM_RUN || pmState == PM_HOT_STANDBY) &&
+ Shutdown <= SmartShutdown)
+ WalSummarizerPID = StartWalSummarizer();
+}
+
/*
* Create the opts file
diff --git a/src/backend/postmaster/walsummarizer.c b/src/backend/postmaster/walsummarizer.c
new file mode 100644
index 0000000000..505351f663
--- /dev/null
+++ b/src/backend/postmaster/walsummarizer.c
@@ -0,0 +1,1374 @@
+/*-------------------------------------------------------------------------
+ *
+ * walsummarizer.c
+ *
+ * Background process to perform WAL summarization, if it is enabled.
+ * It continuously scans the write-ahead log and periodically emits a
+ * summary file which indicates which blocks in which relation forks
+ * were modified by WAL records in the LSN range covered by the summary
+ * file. See walsummary.c and blkreftable.c for more details on the
+ * naming and contents of WAL summary files.
+ *
+ * If configured to do, this background process will also remove WAL
+ * summary files when the file timestamp is older than a configurable
+ * threshold (but only if the WAL has been removed first).
+ *
+ * Portions Copyright (c) 1996-2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/backend/postmaster/walsummarizer.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/timeline.h"
+#include "access/xlog.h"
+#include "access/xlog_internal.h"
+#include "access/xlogrecovery.h"
+#include "access/xlogutils.h"
+#include "backup/walsummary.h"
+#include "catalog/storage_xlog.h"
+#include "common/blkreftable.h"
+#include "libpq/pqsignal.h"
+#include "miscadmin.h"
+#include "postmaster/bgwriter.h"
+#include "postmaster/interrupt.h"
+#include "postmaster/walsummarizer.h"
+#include "replication/walreceiver.h"
+#include "storage/fd.h"
+#include "storage/ipc.h"
+#include "storage/lwlock.h"
+#include "storage/latch.h"
+#include "storage/proc.h"
+#include "storage/procsignal.h"
+#include "storage/shmem.h"
+#include "storage/spin.h"
+#include "utils/guc.h"
+#include "utils/memutils.h"
+#include "utils/wait_event.h"
+
+/*
+ * Data in shared memory related to WAL summarization.
+ */
+typedef struct
+{
+ /*
+ * These fields are protected by WALSummarizerLock.
+ *
+ * Until we've discovered what summary files already exist on disk and
+ * stored that information in shared memory, initialized is false and the
+ * other fields here contain no meaningful information. After that has
+ * been done, initialized is true.
+ *
+ * summarized_tli and summarized_lsn indicate the last LSN and TLI at
+ * which the next summary file will start. Normally, these are the LSN and
+ * TLI at which the last file ended; in such case, lsn_is_exact is true.
+ * If, however, the LSN is just an approximation, then lsn_is_exact is
+ * false. This can happen if, for example, there are no existing WAL
+ * summary files at startup. In that case, we have to derive the position
+ * at which to start summarizing from the WAL files that exist on disk,
+ * and so the LSN might point to the start of the next file even though
+ * that might happen to be in the middle of a WAL record.
+ *
+ * summarizer_pgprocno is the pgprocno value for the summarizer process,
+ * if one is running, or else INVALID_PGPROCNO.
+ *
+ * pending_lsn is used by the summarizer to advertise the ending LSN of a
+ * record it has recently read. It shouldn't ever be less than
+ * summarized_lsn, but might be greater, because the summarizer buffers
+ * data for a range of LSNs in memory before writing out a new file.
+ */
+ bool initialized;
+ TimeLineID summarized_tli;
+ XLogRecPtr summarized_lsn;
+ bool lsn_is_exact;
+ int summarizer_pgprocno;
+ XLogRecPtr pending_lsn;
+
+ /*
+ * This field handles its own synchronizaton.
+ */
+ ConditionVariable summary_file_cv;
+} WalSummarizerData;
+
+/*
+ * Private data for our xlogreader's page read callback.
+ */
+typedef struct
+{
+ TimeLineID tli;
+ bool historic;
+ XLogRecPtr read_upto;
+ bool end_of_wal;
+ bool waited;
+} SummarizerReadLocalXLogPrivate;
+
+/* Pointer to shared memory state. */
+static WalSummarizerData *WalSummarizerCtl;
+
+/*
+ * When we reach end of WAL and need to read more, we sleep for a number of
+ * milliseconds that is a integer multiple of MS_PER_SLEEP_QUANTUM. This is
+ * the multiplier. It should vary between 1 and MAX_SLEEP_QUANTA, depending
+ * on system activity. See summarizer_wait_for_wal() for how we adjust this.
+ */
+static long sleep_quanta = 1;
+
+/*
+ * The sleep time will always be a multiple of 200ms and will not exceed
+ * thirty seconds (150 * 200 = 30 * 1000). Note that the timeout here needs
+ * to be substntially less than the maximum amount of time for which an
+ * incremental backup will wait for this process to catch up. Otherwise, an
+ * incremental backup might time out on an idle system just because we sleep
+ * for too long.
+ */
+#define MAX_SLEEP_QUANTA 150
+#define MS_PER_SLEEP_QUANTUM 200
+
+/*
+ * This is a count of the number of pages of WAL that we've read since the
+ * last time we waited for more WAL to appear.
+ */
+static long pages_read_since_last_sleep = 0;
+
+/*
+ * Most recent RedoRecPtr value observed by MaybeRemoveOldWalSummaries.
+ */
+static XLogRecPtr redo_pointer_at_last_summary_removal = InvalidXLogRecPtr;
+
+/*
+ * GUC parameters
+ */
+bool summarize_wal = false;
+int wal_summarize_keep_time = 10 * 24 * 60;
+
+static XLogRecPtr GetLatestLSN(TimeLineID *tli);
+static void HandleWalSummarizerInterrupts(void);
+static XLogRecPtr SummarizeWAL(TimeLineID tli, XLogRecPtr start_lsn,
+ bool exact, XLogRecPtr switch_lsn,
+ XLogRecPtr maximum_lsn);
+static void SummarizeSmgrRecord(XLogReaderState *xlogreader,
+ BlockRefTable *brtab);
+static void SummarizeXactRecord(XLogReaderState *xlogreader,
+ BlockRefTable *brtab);
+static bool SummarizeXlogRecord(XLogReaderState *xlogreader);
+static int summarizer_read_local_xlog_page(XLogReaderState *state,
+ XLogRecPtr targetPagePtr,
+ int reqLen,
+ XLogRecPtr targetRecPtr,
+ char *cur_page);
+static void summarizer_wait_for_wal(void);
+static void MaybeRemoveOldWalSummaries(void);
+
+/*
+ * Amount of shared memory required for this module.
+ */
+Size
+WalSummarizerShmemSize(void)
+{
+ return sizeof(WalSummarizerData);
+}
+
+/*
+ * Create or attach to shared memory segment for this module.
+ */
+void
+WalSummarizerShmemInit(void)
+{
+ bool found;
+
+ WalSummarizerCtl = (WalSummarizerData *)
+ ShmemInitStruct("Wal Summarizer Ctl", WalSummarizerShmemSize(),
+ &found);
+
+ if (!found)
+ {
+ /*
+ * First time through, so initialize.
+ *
+ * We're just filling in dummy values here -- the real initialization
+ * will happen when GetOldestUnsummarizedLSN() is called for the first
+ * time.
+ */
+ WalSummarizerCtl->initialized = false;
+ WalSummarizerCtl->summarized_tli = 0;
+ WalSummarizerCtl->summarized_lsn = InvalidXLogRecPtr;
+ WalSummarizerCtl->lsn_is_exact = false;
+ WalSummarizerCtl->summarizer_pgprocno = INVALID_PGPROCNO;
+ WalSummarizerCtl->pending_lsn = InvalidXLogRecPtr;
+ ConditionVariableInit(&WalSummarizerCtl->summary_file_cv);
+ }
+}
+
+/*
+ * Entry point for walsummarizer process.
+ */
+void
+WalSummarizerMain(void)
+{
+ sigjmp_buf local_sigjmp_buf;
+ MemoryContext context;
+
+ /*
+ * Within this function, 'current_lsn' and 'current_tli' refer to the
+ * point from which the next WAL summary file should start. 'exact' is
+ * true if 'current_lsn' is known to be the start of a WAL recod or WAL
+ * segment, and false if it might be in the middle of a record someplace.
+ *
+ * 'switch_lsn' and 'switch_tli', if set, are the LSN at which we need to
+ * switch to a new timeline and the timeline to which we need to switch.
+ * If not set, we either haven't figured out the answers yet or we're
+ * already on the latest timeline.
+ */
+ XLogRecPtr current_lsn;
+ TimeLineID current_tli;
+ bool exact;
+ XLogRecPtr switch_lsn = InvalidXLogRecPtr;
+ TimeLineID switch_tli = 0;
+
+ ereport(DEBUG1,
+ (errmsg_internal("WAL summarizer started")));
+
+ /*
+ * Properly accept or ignore signals the postmaster might send us
+ *
+ * We have no particular use for SIGINT at the moment, but seems
+ * reasonable to treat like SIGTERM.
+ */
+ pqsignal(SIGHUP, SignalHandlerForConfigReload);
+ pqsignal(SIGINT, SignalHandlerForShutdownRequest);
+ pqsignal(SIGTERM, SignalHandlerForShutdownRequest);
+ /* SIGQUIT handler was already set up by InitPostmasterChild */
+ pqsignal(SIGALRM, SIG_IGN);
+ pqsignal(SIGPIPE, SIG_IGN);
+ pqsignal(SIGUSR1, procsignal_sigusr1_handler);
+ pqsignal(SIGUSR2, SIG_IGN); /* not used */
+
+ /* Advertise ourselves. */
+ LWLockAcquire(WALSummarizerLock, LW_EXCLUSIVE);
+ WalSummarizerCtl->summarizer_pgprocno = MyProc->pgprocno;
+ LWLockRelease(WALSummarizerLock);
+
+ /* Create and switch to a memory context that we can reset on error. */
+ context = AllocSetContextCreate(TopMemoryContext,
+ "Wal Summarizer",
+ ALLOCSET_DEFAULT_SIZES);
+ MemoryContextSwitchTo(context);
+
+ /*
+ * Reset some signals that are accepted by postmaster but not here
+ */
+ pqsignal(SIGCHLD, SIG_DFL);
+
+ /*
+ * If an exception is encountered, processing resumes here.
+ */
+ if (sigsetjmp(local_sigjmp_buf, 1) != 0)
+ {
+ /* Since not using PG_TRY, must reset error stack by hand */
+ error_context_stack = NULL;
+
+ /* Prevent interrupts while cleaning up */
+ HOLD_INTERRUPTS();
+
+ /* Report the error to the server log */
+ EmitErrorReport();
+
+ /* Release resources we might have acquired. */
+ LWLockReleaseAll();
+ ConditionVariableCancelSleep();
+ pgstat_report_wait_end();
+ ReleaseAuxProcessResources(false);
+ AtEOXact_Files(false);
+ AtEOXact_HashTables(false);
+
+ /*
+ * Now return to normal top-level context and clear ErrorContext for
+ * next time.
+ */
+ MemoryContextSwitchTo(context);
+ FlushErrorState();
+
+ /* Flush any leaked data in the top-level context */
+ MemoryContextResetAndDeleteChildren(context);
+
+ /* Now we can allow interrupts again */
+ RESUME_INTERRUPTS();
+
+ /*
+ * Sleep for 10 seconds before attempting to resume operations in
+ * order to avoid excessing logging.
+ *
+ * Many of the likely error conditions are things that will repeat
+ * every time. For example, if the WAL can't be read or the summary
+ * can't be written, only administrator action will cure the problem.
+ * So a really fast retry time doesn't seem to be especially
+ * beneficial, and it will clutter the logs.
+ */
+ (void) WaitLatch(MyLatch,
+ WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
+ 10000,
+ WAIT_EVENT_WAL_SUMMARIZER_ERROR);
+ }
+
+ /* We can now handle ereport(ERROR) */
+ PG_exception_stack = &local_sigjmp_buf;
+
+ /*
+ * Unblock signals (they were blocked when the postmaster forked us)
+ */
+ sigprocmask(SIG_SETMASK, &UnBlockSig, NULL);
+
+ /*
+ * Fetch information about previous progress from shared memory.
+ *
+ * If we discover that WAL summarization is not enabled, just exit.
+ */
+ current_lsn = GetOldestUnsummarizedLSN(¤t_tli, &exact);
+ if (XLogRecPtrIsInvalid(current_lsn))
+ proc_exit(0);
+
+ /*
+ * Loop forever
+ */
+ for (;;)
+ {
+ XLogRecPtr latest_lsn;
+ TimeLineID latest_tli;
+ XLogRecPtr end_of_summary_lsn;
+
+ /* Flush any leaked data in the top-level context */
+ MemoryContextResetAndDeleteChildren(context);
+
+ /* Process any signals received recently. */
+ HandleWalSummarizerInterrupts();
+
+ /* If it's time to remove any old WAL summaries, do that now. */
+ MaybeRemoveOldWalSummaries();
+
+ /* Find the LSN and TLI up to which we can safely summarize. */
+ latest_lsn = GetLatestLSN(&latest_tli);
+
+ /*
+ * If we're summarizing a historic timeline and we haven't yet
+ * computed the point at which to switch to the next timeline, do that
+ * now.
+ *
+ * Note that if this is a standby, what was previously the current
+ * timeline could become historic at any time.
+ *
+ * We could try to make this more efficient by caching the results of
+ * readTimeLineHistory when latest_tli has not changed, but since we
+ * only have to do this once per timeline switch, we probably wouldn't
+ * save any significant amount of work in practice.
+ */
+ if (current_tli != latest_tli && XLogRecPtrIsInvalid(switch_lsn))
+ {
+ List *tles = readTimeLineHistory(latest_tli);
+
+ switch_lsn = tliSwitchPoint(current_tli, tles, &switch_tli);
+ ereport(DEBUG1,
+ errmsg("switch point from TLI %u to TLI %u is at %X/%X",
+ current_tli, switch_tli, LSN_FORMAT_ARGS(switch_lsn)));
+ }
+
+ /*
+ * If we've reached the switch LSN, we can't summarize anything else
+ * on this timeline. Switch to the next timeline and go around again.
+ */
+ if (!XLogRecPtrIsInvalid(switch_lsn) && current_lsn >= switch_lsn)
+ {
+ current_tli = switch_tli;
+ switch_lsn = InvalidXLogRecPtr;
+ switch_tli = 0;
+ continue;
+ }
+
+ /* Summarize WAL. */
+ end_of_summary_lsn = SummarizeWAL(current_tli,
+ current_lsn, exact,
+ switch_lsn, latest_lsn);
+ Assert(!XLogRecPtrIsInvalid(end_of_summary_lsn));
+ Assert(end_of_summary_lsn >= current_lsn);
+
+ /*
+ * Update state for next loop iteration.
+ *
+ * Next summary file should start from exactly where this one ended.
+ */
+ current_lsn = end_of_summary_lsn;
+ exact = true;
+
+ /* Update state in shared memory. */
+ LWLockAcquire(WALSummarizerLock, LW_EXCLUSIVE);
+ Assert(WalSummarizerCtl->pending_lsn <= end_of_summary_lsn);
+ WalSummarizerCtl->summarized_lsn = end_of_summary_lsn;
+ WalSummarizerCtl->summarized_tli = current_tli;
+ WalSummarizerCtl->lsn_is_exact = true;
+ WalSummarizerCtl->pending_lsn = end_of_summary_lsn;
+ LWLockRelease(WALSummarizerLock);
+
+ /* Wake up anyone waiting for more summary files to be written. */
+ ConditionVariableBroadcast(&WalSummarizerCtl->summary_file_cv);
+ }
+}
+
+/*
+ * Get the oldest LSN in this server's timeline history that has not yet been
+ * summarized.
+ *
+ * If *tli != NULL, it will be set to the TLI for the LSN that is returned.
+ *
+ * If *lsn_is_exact != NULL, it will be set to true if the returned LSN is
+ * necessarily the start of a WAL record and false if it's just the beginning
+ * of a WAL segment.
+ */
+XLogRecPtr
+GetOldestUnsummarizedLSN(TimeLineID *tli, bool *lsn_is_exact)
+{
+ TimeLineID latest_tli;
+ LWLockMode mode = LW_SHARED;
+ int n;
+ List *tles;
+ XLogRecPtr unsummarized_lsn;
+ TimeLineID unsummarized_tli = 0;
+ bool should_make_exact = false;
+ List *existing_summaries;
+ ListCell *lc;
+
+ /* If not summarizing WAL, do nothing. */
+ if (!summarize_wal)
+ return InvalidXLogRecPtr;
+
+ /*
+ * Initially, we acquire the lock in shared mode and try to fetch the
+ * required information. If the data structure hasn't been initialized, we
+ * reacquire the lock in shared mode so that we can initialize it.
+ * However, if someone else does that first before we get the lock, then
+ * we can just return the requested information after all.
+ */
+ while (true)
+ {
+ LWLockAcquire(WALSummarizerLock, mode);
+
+ if (WalSummarizerCtl->initialized)
+ {
+ unsummarized_lsn = WalSummarizerCtl->summarized_lsn;
+ if (tli != NULL)
+ *tli = WalSummarizerCtl->summarized_tli;
+ if (lsn_is_exact != NULL)
+ *lsn_is_exact = WalSummarizerCtl->lsn_is_exact;
+ LWLockRelease(WALSummarizerLock);
+ return unsummarized_lsn;
+ }
+
+ if (mode == LW_EXCLUSIVE)
+ break;
+
+ LWLockRelease(WALSummarizerLock);
+ mode = LW_EXCLUSIVE;
+ }
+
+ /*
+ * The data structure needs to be initialized, and we are the first to
+ * obtain the lock in exclusive mode, so it's our job to do that
+ * initialization.
+ *
+ * So, find the oldest timeline on which WAL still exists, and the
+ * earliest segment for which it exists.
+ */
+ (void) GetLatestLSN(&latest_tli);
+ tles = readTimeLineHistory(latest_tli);
+ for (n = list_length(tles) - 1; n >= 0; --n)
+ {
+ TimeLineHistoryEntry *tle = list_nth(tles, n);
+ XLogSegNo oldest_segno;
+
+ oldest_segno = XLogGetOldestSegno(tle->tli);
+ if (oldest_segno != 0)
+ {
+ /* Compute oldest LSN that still exists on disk. */
+ XLogSegNoOffsetToRecPtr(oldest_segno, 0, wal_segment_size,
+ unsummarized_lsn);
+
+ unsummarized_tli = tle->tli;
+ break;
+ }
+ }
+
+ /* It really should not be possible for us to find no WAL. */
+ if (unsummarized_tli == 0)
+ ereport(ERROR,
+ errcode(ERRCODE_INTERNAL_ERROR),
+ errmsg_internal("no WAL found on timeline %d", latest_tli));
+
+ /*
+ * Don't try to summarize anything older than the end LSN of the newest
+ * summary file that exists for this timeline.
+ */
+ existing_summaries =
+ GetWalSummaries(unsummarized_tli,
+ InvalidXLogRecPtr, InvalidXLogRecPtr);
+ foreach(lc, existing_summaries)
+ {
+ WalSummaryFile *ws = lfirst(lc);
+
+ if (ws->end_lsn > unsummarized_lsn)
+ {
+ unsummarized_lsn = ws->end_lsn;
+ should_make_exact = true;
+ }
+ }
+
+ /* Update shared memory with the discovered values. */
+ WalSummarizerCtl->initialized = true;
+ WalSummarizerCtl->summarized_lsn = unsummarized_lsn;
+ WalSummarizerCtl->summarized_tli = unsummarized_tli;
+ WalSummarizerCtl->lsn_is_exact = should_make_exact;
+ WalSummarizerCtl->pending_lsn = unsummarized_lsn;
+
+ /* Also return the to the caller as required. */
+ if (tli != NULL)
+ *tli = WalSummarizerCtl->summarized_tli;
+ if (lsn_is_exact != NULL)
+ *lsn_is_exact = WalSummarizerCtl->lsn_is_exact;
+ LWLockRelease(WALSummarizerLock);
+
+ return unsummarized_lsn;
+}
+
+/*
+ * Attempt to set the WAL summarizer's latch.
+ *
+ * This might not work, because there's no guarantee that the WAL summarizer
+ * process was successfully started, and it also might have started but
+ * subsequently terminated. So, under normal circumstances, this will get the
+ * latch set, but there's no guarantee.
+ */
+void
+SetWalSummarizerLatch(void)
+{
+ int pgprocno;
+
+ if (WalSummarizerCtl == NULL)
+ return;
+
+ LWLockAcquire(WALSummarizerLock, LW_EXCLUSIVE);
+ pgprocno = WalSummarizerCtl->summarizer_pgprocno;
+ LWLockRelease(WALSummarizerLock);
+
+ if (pgprocno != INVALID_PGPROCNO)
+ SetLatch(&ProcGlobal->allProcs[pgprocno].procLatch);
+}
+
+/*
+ * Wait until WAL summarization reaches the given LSN, but not longer than
+ * the given timeout.
+ *
+ * The return value is the first still-unsummarized LSN. If it's greater than
+ * or equal to the passed LSN, then that LSN was reached. If not, we timed out.
+ */
+XLogRecPtr
+WaitForWalSummarization(XLogRecPtr lsn, long timeout)
+{
+ TimestampTz start_time = GetCurrentTimestamp();
+ TimestampTz deadline = TimestampTzPlusMilliseconds(start_time, timeout);
+ XLogRecPtr summarized_lsn;
+
+ Assert(!XLogRecPtrIsInvalid(lsn));
+ Assert(timeout > 0);
+
+ while (1)
+ {
+ TimestampTz now;
+ long remaining_timeout;
+
+ /*
+ * If the LSN summarized on disk has reached the target value, stop.
+ */
+ LWLockAcquire(WALSummarizerLock, LW_EXCLUSIVE);
+ summarized_lsn = WalSummarizerCtl->summarized_lsn;
+ LWLockRelease(WALSummarizerLock);
+ if (summarized_lsn >= lsn)
+ break;
+
+ /* Timeout reached? If yes, stop. */
+ now = GetCurrentTimestamp();
+ remaining_timeout = TimestampDifferenceMilliseconds(now, deadline);
+ if (remaining_timeout <= 0)
+ break;
+
+ /* Wait and see. */
+ ConditionVariableTimedSleep(&WalSummarizerCtl->summary_file_cv,
+ remaining_timeout,
+ WAIT_EVENT_WAL_SUMMARY_READY);
+ }
+
+ return summarized_lsn;
+}
+
+/*
+ * Get the latest LSN that is eligible to be summarized, and set *tli to the
+ * corresponding timeline.
+ */
+static XLogRecPtr
+GetLatestLSN(TimeLineID *tli)
+{
+ if (!RecoveryInProgress())
+ {
+ /* Don't summarize WAL before it's flushed. */
+ return GetFlushRecPtr(tli);
+ }
+ else
+ {
+ XLogRecPtr flush_lsn;
+ TimeLineID flush_tli;
+ XLogRecPtr replay_lsn;
+ TimeLineID replay_tli;
+
+ /*
+ * What we really want to know is how much WAL has been flushed to
+ * disk, but the only flush position available is the one provided by
+ * the walreceiver, which may not be running, because this could be
+ * crash recovery or recovery via restore_command. So use either the
+ * WAL receiver's flush position or the replay position, whichever is
+ * further ahead, on the theory that if the WAL has been replayed then
+ * it must also have been flushed to disk.
+ */
+ flush_lsn = GetWalRcvFlushRecPtr(NULL, &flush_tli);
+ replay_lsn = GetXLogReplayRecPtr(&replay_tli);
+ if (flush_lsn > replay_lsn)
+ {
+ *tli = flush_tli;
+ return flush_lsn;
+ }
+ else
+ {
+ *tli = replay_tli;
+ return replay_lsn;
+ }
+ }
+}
+
+/*
+ * Interrupt handler for main loop of WAL summarizer process.
+ */
+static void
+HandleWalSummarizerInterrupts(void)
+{
+ if (ProcSignalBarrierPending)
+ ProcessProcSignalBarrier();
+
+ if (ConfigReloadPending)
+ {
+ ConfigReloadPending = false;
+ ProcessConfigFile(PGC_SIGHUP);
+ }
+
+ if (ShutdownRequestPending || !summarize_wal)
+ {
+ ereport(DEBUG1,
+ errmsg_internal("WAL summarizer shutting down"));
+ proc_exit(0);
+ }
+
+ /* Perform logging of memory contexts of this process */
+ if (LogMemoryContextPending)
+ ProcessLogMemoryContextInterrupt();
+}
+
+/*
+ * Summarize a range of WAL records on a single timeline.
+ *
+ * 'tli' is the timeline to be summarized.
+ *
+ * 'start_lsn' is the point at which we should start summarizing. If this
+ * value comes from the end LSN of the previous record as returned by the
+ * xlograder machinery, 'exact' should be true; otherwise, 'exact' should
+ * be false, and this function will search forward for the start of a valid
+ * WAL record.
+ *
+ * 'switch_lsn' is the point at which we should switch to a later timeline,
+ * if we're summarizing a historic timeline.
+ *
+ * 'maximum_lsn' identifies the point beyond which we can't count on being
+ * able to read any more WAL. It should be the switch point when reading a
+ * historic timeline, or the most-recently-measured end of WAL when reading
+ * the current timeline.
+ *
+ * The return value is the LSN at which the WAL summary actually ends. Most
+ * often, a summary file ends because we notice that a checkpoint has
+ * occurred and reach the redo pointer of that checkpoint, but sometimes
+ * we stop for other reasons, such as a timeline switch.
+ */
+static XLogRecPtr
+SummarizeWAL(TimeLineID tli, XLogRecPtr start_lsn, bool exact,
+ XLogRecPtr switch_lsn, XLogRecPtr maximum_lsn)
+{
+ SummarizerReadLocalXLogPrivate *private_data;
+ XLogReaderState *xlogreader;
+ XLogRecPtr summary_start_lsn;
+ XLogRecPtr summary_end_lsn = switch_lsn;
+ char temp_path[MAXPGPATH];
+ char final_path[MAXPGPATH];
+ WalSummaryIO io;
+ BlockRefTable *brtab = CreateEmptyBlockRefTable();
+
+ /* Initialize private data for xlogreader. */
+ private_data = (SummarizerReadLocalXLogPrivate *)
+ palloc0(sizeof(SummarizerReadLocalXLogPrivate));
+ private_data->tli = tli;
+ private_data->historic = !XLogRecPtrIsInvalid(switch_lsn);
+ private_data->read_upto = maximum_lsn;
+
+ /* Create xlogreader. */
+ xlogreader = XLogReaderAllocate(wal_segment_size, NULL,
+ XL_ROUTINE(.page_read = &summarizer_read_local_xlog_page,
+ .segment_open = &wal_segment_open,
+ .segment_close = &wal_segment_close),
+ private_data);
+ if (xlogreader == NULL)
+ ereport(ERROR,
+ (errcode(ERRCODE_OUT_OF_MEMORY),
+ errmsg("out of memory"),
+ errdetail("Failed while allocating a WAL reading processor.")));
+
+ /*
+ * When exact = false, we're starting from an arbitrary point in the WAL
+ * and must search forward for the start of the next record.
+ *
+ * When exact = true, start_lsn should be either the LSN where a record
+ * begins, or the LSN of a page where the page header is immediately
+ * followed by the start of a new record. XLogBeginRead should tolerate
+ * either case.
+ *
+ * We need to allow for both cases because the behavior of xlogreader
+ * varies. When a record spans two or more xlog pages, the ending LSN
+ * reported by xlogreader will be the starting LSN of the following
+ * record, but when an xlog page boundary falls between two records, the
+ * end LSN for the first will be reported as the first byte of the
+ * following page. We can't know until we read that page how large the
+ * header will be, but we'll have to skip over it to find the next record.
+ */
+ if (exact)
+ {
+ /*
+ * Even if start_lsn is the beginning of a page rather than the
+ * beginning of the first record on that page, we should still use it
+ * as the start LSN for the summary file. That's because we detect
+ * missing summary files by looking for cases where the end LSN of one
+ * file is less than the start LSN of the next file. When only a page
+ * header is skipped, nothing has been missed.
+ */
+ XLogBeginRead(xlogreader, start_lsn);
+ summary_start_lsn = start_lsn;
+ }
+ else
+ {
+ summary_start_lsn = XLogFindNextRecord(xlogreader, start_lsn);
+ if (XLogRecPtrIsInvalid(summary_start_lsn))
+ {
+ /*
+ * If we hit end-of-WAL while trying to find the next valid
+ * record, we must be on a historic timeline that has no valid
+ * records that begin after start_lsn and before end of WAL.
+ */
+ if (private_data->end_of_wal)
+ {
+ ereport(DEBUG1,
+ errmsg_internal("could not read WAL from timeline %u at %X/%X: end of WAL at %X/%X",
+ tli,
+ LSN_FORMAT_ARGS(start_lsn),
+ LSN_FORMAT_ARGS(private_data->read_upto)));
+
+ /*
+ * The timeline ends at or after start_lsn, without containing
+ * any records. Thus, we must make sure the main loop does not
+ * iterate. If start_lsn is the end of the timeline, then we
+ * won't actually emit an empty summary file, but otherwise,
+ * we must, to capture the fact that the LSN range in question
+ * contains no interesting WAL records.
+ */
+ summary_start_lsn = start_lsn;
+ summary_end_lsn = private_data->read_upto;
+ switch_lsn = xlogreader->EndRecPtr;
+ }
+ else
+ ereport(ERROR,
+ (errmsg("could not find a valid record after %X/%X",
+ LSN_FORMAT_ARGS(start_lsn))));
+ }
+
+ /* We shouldn't go backward. */
+ Assert(summary_start_lsn >= start_lsn);
+ }
+
+ /*
+ * Main loop: read xlog records one by one.
+ */
+ while (1)
+ {
+ int block_id;
+ char *errormsg;
+ XLogRecord *record;
+ bool stop_requested = false;
+
+ CHECK_FOR_INTERRUPTS();
+
+ /* We shouldn't go backward. */
+ Assert(summary_start_lsn <= xlogreader->EndRecPtr);
+
+ /*
+ * This flag tracks whether the read of a particular record had to
+ * wait for more WAL to arrive, so reset it before reading the next
+ * record.
+ */
+ private_data->waited = false;
+
+ /* Now read the next record. */
+ record = XLogReadRecord(xlogreader, &errormsg);
+ if (record == NULL)
+ {
+ SummarizerReadLocalXLogPrivate *private_data;
+
+ private_data = (SummarizerReadLocalXLogPrivate *)
+ xlogreader->private_data;
+ if (private_data->end_of_wal)
+ {
+ /*
+ * This timeline must be historic and must end before we were
+ * able to read a complete record.
+ */
+ ereport(DEBUG1,
+ errmsg_internal("could not read WAL from timeline %d at %X/%X: end of WAL at %X/%X",
+ tli,
+ LSN_FORMAT_ARGS(xlogreader->EndRecPtr),
+ LSN_FORMAT_ARGS(private_data->read_upto)));
+ /* Summary ends at end of WAL. */
+ summary_end_lsn = private_data->read_upto;
+ break;
+ }
+ if (errormsg)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not read WAL from timeline %u at %X/%X: %s",
+ tli, LSN_FORMAT_ARGS(xlogreader->EndRecPtr),
+ errormsg)));
+ else
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not read WAL from timeline %u at %X/%X",
+ tli, LSN_FORMAT_ARGS(xlogreader->EndRecPtr))));
+ }
+
+ /* We shouldn't go backward. */
+ Assert(summary_start_lsn <= xlogreader->EndRecPtr);
+
+ if (!XLogRecPtrIsInvalid(switch_lsn) &&
+ xlogreader->ReadRecPtr >= switch_lsn)
+ {
+ /*
+ * Woops! We've read a record that *starts* after the switch LSN,
+ * contrary to our goal of reading only until we hit the first
+ * record that ends at or after the switch LSN. Pretend we didn't
+ * read it after all by bailing out of this loop right here,
+ * before we do anything with this record.
+ *
+ * This can happen because the last record before the switch LSN
+ * might be continued across multiple pages, and then we might
+ * come to a page with XLP_FIRST_IS_OVERWRITE_CONTRECORD set. In
+ * that case, the record that was continued across multiple pages
+ * is incomplete and will be disregarded, and the read will
+ * restart from the beginning of the page that is flagged
+ * XLP_FIRST_IS_OVERWRITE_CONTRECORD.
+ *
+ * If this case occurs, we can fairly say that the current summary
+ * file ends at the switch LSN exactly. The first record on the
+ * page marked XLP_FIRST_IS_OVERWRITE_CONTRECORD will be
+ * discovered when generating the next summary file.
+ */
+ summary_end_lsn = switch_lsn;
+ break;
+ }
+
+ /* Special handling for particular types of WAL records. */
+ switch (XLogRecGetRmid(xlogreader))
+ {
+ case RM_SMGR_ID:
+ SummarizeSmgrRecord(xlogreader, brtab);
+ break;
+ case RM_XACT_ID:
+ SummarizeXactRecord(xlogreader, brtab);
+ break;
+ case RM_XLOG_ID:
+ stop_requested = SummarizeXlogRecord(xlogreader);
+ break;
+ default:
+ break;
+ }
+
+ /*
+ * If we've been told that it's time to end this WAL summary file, do
+ * so. As an exception, if there's nothing included in this WAL
+ * summary file yet, then stoppng doesn't make any sense, and we
+ * should wait until the next stop point instead.
+ */
+ if (stop_requested && xlogreader->ReadRecPtr > summary_start_lsn)
+ {
+ summary_end_lsn = xlogreader->ReadRecPtr;
+ break;
+ }
+
+ /* Feed block references from xlog record to block reference table. */
+ for (block_id = 0; block_id <= XLogRecMaxBlockId(xlogreader);
+ block_id++)
+ {
+ RelFileLocator rlocator;
+ ForkNumber forknum;
+ BlockNumber blocknum;
+
+ if (!XLogRecGetBlockTagExtended(xlogreader, block_id, &rlocator,
+ &forknum, &blocknum, NULL))
+ continue;
+
+ /*
+ * As we do elsewhere, ignore the FSM fork, because it's not fully
+ * WAL-logged.
+ */
+ if (forknum != FSM_FORKNUM)
+ BlockRefTableMarkBlockModified(brtab, &rlocator, forknum,
+ blocknum);
+ }
+
+ /* Update our notion of where this summary file ends. */
+ summary_end_lsn = xlogreader->EndRecPtr;
+
+ /* Also update shared memory. */
+ LWLockAcquire(WALSummarizerLock, LW_EXCLUSIVE);
+ Assert(summary_end_lsn >= WalSummarizerCtl->pending_lsn);
+ Assert(summary_end_lsn >= WalSummarizerCtl->summarized_lsn);
+ WalSummarizerCtl->pending_lsn = summary_end_lsn;
+ LWLockRelease(WALSummarizerLock);
+
+ /*
+ * If we have a switch LSN and have reached it, stop before reading
+ * the next record.
+ */
+ if (!XLogRecPtrIsInvalid(switch_lsn) &&
+ xlogreader->EndRecPtr >= switch_lsn)
+ break;
+ }
+
+ /* Destroy xlogreader. */
+ pfree(xlogreader->private_data);
+ XLogReaderFree(xlogreader);
+
+ /*
+ * If a timeline switch occurs, we may fail to make any progress at all
+ * before exiting the loop above. If that happens, we don't write a WAL
+ * summary file at all.
+ */
+ if (summary_end_lsn > summary_start_lsn)
+ {
+ /* Generate temporary and final path name. */
+ snprintf(temp_path, MAXPGPATH,
+ XLOGDIR "/summaries/temp.summary");
+ snprintf(final_path, MAXPGPATH,
+ XLOGDIR "/summaries/%08X%08X%08X%08X%08X.summary",
+ tli,
+ LSN_FORMAT_ARGS(summary_start_lsn),
+ LSN_FORMAT_ARGS(summary_end_lsn));
+
+ /* Open the temporary file for writing. */
+ io.filepos = 0;
+ io.file = PathNameOpenFile(temp_path, O_WRONLY | O_CREAT | O_TRUNC);
+ if (io.file < 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not create file \"%s\": %m", temp_path)));
+
+ /* Write the data. */
+ WriteBlockRefTable(brtab, WriteWalSummary, &io);
+
+ /* Close temporary file and shut down xlogreader. */
+ FileClose(io.file);
+
+ /* Tell the user what we did. */
+ ereport(DEBUG1,
+ errmsg("summarized WAL on TLI %d from %X/%X to %X/%X",
+ tli,
+ LSN_FORMAT_ARGS(summary_start_lsn),
+ LSN_FORMAT_ARGS(summary_end_lsn)));
+
+ /* Durably rename the new summary into place. */
+ durable_rename(temp_path, final_path, ERROR);
+ }
+
+ return summary_end_lsn;
+}
+
+/*
+ * Special handling for WAL records with RM_SMGR_ID.
+ */
+static void
+SummarizeSmgrRecord(XLogReaderState *xlogreader, BlockRefTable *brtab)
+{
+ uint8 info = XLogRecGetInfo(xlogreader) & ~XLR_INFO_MASK;
+
+ if (info == XLOG_SMGR_CREATE)
+ {
+ xl_smgr_create *xlrec;
+
+ /*
+ * If a new relation fork is created on disk, there is no point
+ * tracking anything about which blocks have been modified, because
+ * the whole thing will be new. Hence, set the limit block for this
+ * fork to 0.
+ *
+ * Ignore the FSM fork, which is not fully WAL-logged.
+ */
+ xlrec = (xl_smgr_create *) XLogRecGetData(xlogreader);
+
+ if (xlrec->forkNum != FSM_FORKNUM)
+ BlockRefTableSetLimitBlock(brtab, &xlrec->rlocator,
+ xlrec->forkNum, 0);
+ }
+ else if (info == XLOG_SMGR_TRUNCATE)
+ {
+ xl_smgr_truncate *xlrec;
+
+ xlrec = (xl_smgr_truncate *) XLogRecGetData(xlogreader);
+
+ /*
+ * If a relation fork is truncated on disk, there is in point in
+ * tracking anything about block modifications beyond the truncation
+ * point.
+ *
+ * We ignore SMGR_TRUNCATE_FSM here because the FSM isn't fully
+ * WAL-logged and thus we can't track modified blocks for it anyway.
+ */
+ if ((xlrec->flags & SMGR_TRUNCATE_HEAP) != 0)
+ BlockRefTableSetLimitBlock(brtab, &xlrec->rlocator,
+ MAIN_FORKNUM, xlrec->blkno);
+ if ((xlrec->flags & SMGR_TRUNCATE_VM) != 0)
+ BlockRefTableSetLimitBlock(brtab, &xlrec->rlocator,
+ VISIBILITYMAP_FORKNUM, xlrec->blkno);
+ }
+}
+
+/*
+ * Special handling for WAL recods with RM_XACT_ID.
+ */
+static void
+SummarizeXactRecord(XLogReaderState *xlogreader, BlockRefTable *brtab)
+{
+ uint8 info = XLogRecGetInfo(xlogreader) & ~XLR_INFO_MASK;
+ uint8 xact_info = info & XLOG_XACT_OPMASK;
+
+ if (xact_info == XLOG_XACT_COMMIT ||
+ xact_info == XLOG_XACT_COMMIT_PREPARED)
+ {
+ xl_xact_commit *xlrec = (xl_xact_commit *) XLogRecGetData(xlogreader);
+ xl_xact_parsed_commit parsed;
+ int i;
+
+ ParseCommitRecord(XLogRecGetInfo(xlogreader), xlrec, &parsed);
+ for (i = 0; i < parsed.nrels; ++i)
+ {
+ ForkNumber forknum;
+
+ for (forknum = 0; forknum <= MAX_FORKNUM; ++forknum)
+ if (forknum != FSM_FORKNUM)
+ BlockRefTableSetLimitBlock(brtab, &parsed.xlocators[i],
+ forknum, 0);
+ }
+ }
+ else if (xact_info == XLOG_XACT_ABORT ||
+ xact_info == XLOG_XACT_ABORT_PREPARED)
+ {
+ xl_xact_abort *xlrec = (xl_xact_abort *) XLogRecGetData(xlogreader);
+ xl_xact_parsed_abort parsed;
+ int i;
+
+ ParseAbortRecord(XLogRecGetInfo(xlogreader), xlrec, &parsed);
+ for (i = 0; i < parsed.nrels; ++i)
+ {
+ ForkNumber forknum;
+
+ for (forknum = 0; forknum <= MAX_FORKNUM; ++forknum)
+ if (forknum != FSM_FORKNUM)
+ BlockRefTableSetLimitBlock(brtab, &parsed.xlocators[i],
+ forknum, 0);
+ }
+ }
+}
+
+/*
+ * Special handling for WAL recods with RM_XLOG_ID.
+ */
+static bool
+SummarizeXlogRecord(XLogReaderState *xlogreader)
+{
+ uint8 info = XLogRecGetInfo(xlogreader) & ~XLR_INFO_MASK;
+
+ if (info == XLOG_CHECKPOINT_REDO || info == XLOG_CHECKPOINT_SHUTDOWN)
+ {
+ /*
+ * This is an LSN at which redo might begin, so we'd like
+ * summarization to stop just before this WAL record.
+ */
+ return true;
+ }
+
+ return false;
+}
+
+/*
+ * Similar to read_local_xlog_page, but limited to read from one particular
+ * timeline. If the end of WAL is reached, it will wait for more if reading
+ * from the current timeline, or give up if reading from a historic timeline.
+ * In the latter case, it will also set private_data->end_of_wal = true.
+ *
+ * Caller must set private_data->tli to the TLI of interest,
+ * private_data->read_upto to the lowest LSN that is not known to be safe
+ * to read on that timeline, and private_data->historic to true if and only
+ * if the timeline is not the current timeline. This function will update
+ * private_data->read_upto and private_data->historic if more WAL appears
+ * on the current timeline or if the current timeline becomes historic.
+ */
+static int
+summarizer_read_local_xlog_page(XLogReaderState *state,
+ XLogRecPtr targetPagePtr, int reqLen,
+ XLogRecPtr targetRecPtr, char *cur_page)
+{
+ int count;
+ WALReadError errinfo;
+ SummarizerReadLocalXLogPrivate *private_data;
+
+ CHECK_FOR_INTERRUPTS();
+
+ private_data = (SummarizerReadLocalXLogPrivate *)
+ state->private_data;
+
+ while (true)
+ {
+ if (targetPagePtr + XLOG_BLCKSZ <= private_data->read_upto)
+ {
+ /*
+ * more than one block available; read only that block, have
+ * caller come back if they need more.
+ */
+ count = XLOG_BLCKSZ;
+ break;
+ }
+ else if (targetPagePtr + reqLen > private_data->read_upto)
+ {
+ /* We don't seem to have enough data. */
+ if (private_data->historic)
+ {
+ /*
+ * This is a historic timeline, so there will never be any
+ * more data than we have currently.
+ */
+ private_data->end_of_wal = true;
+ return -1;
+ }
+ else
+ {
+ XLogRecPtr latest_lsn;
+ TimeLineID latest_tli;
+
+ /*
+ * This is - or at least was up until very recently - the
+ * current timeline, so more data might show up. Delay here
+ * so we don't tight-loop.
+ */
+ HandleWalSummarizerInterrupts();
+ summarizer_wait_for_wal();
+ private_data->waited = true;
+
+ /* Recheck end-of-WAL. */
+ latest_lsn = GetLatestLSN(&latest_tli);
+ if (private_data->tli == latest_tli)
+ {
+ /* Still the current timeline, update max LSN. */
+ Assert(latest_lsn >= private_data->read_upto);
+ private_data->read_upto = latest_lsn;
+ }
+ else
+ {
+ List *tles = readTimeLineHistory(latest_tli);
+ XLogRecPtr switchpoint;
+
+ /*
+ * The timeline we're scanning is no longer the latest
+ * one. Figure out when it ended and allow reads up to
+ * exactly that point.
+ */
+ private_data->historic = true;
+ switchpoint = tliSwitchPoint(private_data->tli, tles,
+ NULL);
+ Assert(switchpoint >= private_data->read_upto);
+ private_data->read_upto = switchpoint;
+
+ /* Debugging output. */
+ ereport(DEBUG1,
+ errmsg("timeline %u became historic, can read up to %X/%X",
+ private_data->tli, LSN_FORMAT_ARGS(private_data->read_upto)));
+ }
+
+ /* Go around and try again. */
+ }
+ }
+ else
+ {
+ /* enough bytes available to satisfy the request */
+ count = private_data->read_upto - targetPagePtr;
+ break;
+ }
+ }
+
+ /*
+ * Even though we just determined how much of the page can be validly read
+ * as 'count', read the whole page anyway. It's guaranteed to be
+ * zero-padded up to the page boundary if it's incomplete.
+ */
+ if (!WALRead(state, cur_page, targetPagePtr, XLOG_BLCKSZ,
+ private_data->tli, &errinfo))
+ WALReadRaiseError(&errinfo);
+
+ /* Track that we read a page, for sleep time calculation. */
+ ++pages_read_since_last_sleep;
+
+ /* number of valid bytes in the buffer */
+ return count;
+}
+
+/*
+ * Sleep for long enough that we believe it's likely that more WAL will
+ * be available afterwards.
+ */
+static void
+summarizer_wait_for_wal(void)
+{
+ if (pages_read_since_last_sleep == 0)
+ {
+ /*
+ * No pages were read since the last sleep, so double the sleep time,
+ * but not beyond the maximum allowable value.
+ */
+ sleep_quanta = Min(sleep_quanta * 2, MAX_SLEEP_QUANTA);
+ }
+ else if (pages_read_since_last_sleep > 1)
+ {
+ /*
+ * Multiple pages were read since the last sleep, so reduce the sleep
+ * time.
+ *
+ * A large burst of activity should be able to quickly reduce the
+ * sleep time to the minimum, but we don't want a handful of extra WAL
+ * records to provoke a strong reaction. We choose to reduce the sleep
+ * time by 1 quantum for each page read beyond the first, which is a
+ * fairly arbitrary way of trying to be reactive without
+ * overrreacting.
+ */
+ if (pages_read_since_last_sleep > sleep_quanta - 1)
+ sleep_quanta = 1;
+ else
+ sleep_quanta -= pages_read_since_last_sleep;
+ }
+
+ /* OK, now sleep. */
+ (void) WaitLatch(MyLatch,
+ WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
+ sleep_quanta * MS_PER_SLEEP_QUANTUM,
+ WAIT_EVENT_WAL_SUMMARIZER_WAL);
+ ResetLatch(MyLatch);
+
+ /* Reset count of pages read. */
+ pages_read_since_last_sleep = 0;
+}
+
+/*
+ * Most recent RedoRecPtr value observed by RemoveOldWalSummaries.
+ */
+static void
+MaybeRemoveOldWalSummaries(void)
+{
+ XLogRecPtr redo_pointer = GetRedoRecPtr();
+ List *wslist;
+ time_t cutoff_time;
+
+ /* If WAL summary removal is disabled, don't do anything. */
+ if (wal_summarize_keep_time == 0)
+ return;
+
+ /*
+ * If the redo pointer has not advanced, don't do anything.
+ *
+ * This has the effect that we only try to remove old WAL summary files
+ * once per checkpoint cycle.
+ */
+ if (redo_pointer == redo_pointer_at_last_summary_removal)
+ return;
+ redo_pointer_at_last_summary_removal = redo_pointer;
+
+ /*
+ * Files should only be removed if the last modification time precedes the
+ * cutoff time we compute here.
+ */
+ cutoff_time = time(NULL) - 60 * wal_summarize_keep_time;
+
+ /* Get all the summaries that currently exist. */
+ wslist = GetWalSummaries(0, InvalidXLogRecPtr, InvalidXLogRecPtr);
+
+ /* Loop until all summaries have been considered for removal. */
+ while (wslist != NIL)
+ {
+ ListCell *lc;
+ XLogSegNo oldest_segno;
+ XLogRecPtr oldest_lsn = InvalidXLogRecPtr;
+ TimeLineID selected_tli;
+
+ CHECK_FOR_INTERRUPTS();
+
+ /*
+ * Pick a timeline for which some summary files still exist on disk,
+ * and find the oldest LSN that still exists on disk for that
+ * timeline.
+ */
+ selected_tli = ((WalSummaryFile *) linitial(wslist))->tli;
+ oldest_segno = XLogGetOldestSegno(selected_tli);
+ if (oldest_segno != 0)
+ XLogSegNoOffsetToRecPtr(oldest_segno, 0, wal_segment_size,
+ oldest_lsn);
+
+
+ /* Consider each WAL file on the selected timeline in turn. */
+ foreach(lc, wslist)
+ {
+ WalSummaryFile *ws = lfirst(lc);
+
+ CHECK_FOR_INTERRUPTS();
+
+ /* If it's not on this timeline, it's not time to consider it. */
+ if (selected_tli != ws->tli)
+ continue;
+
+ /*
+ * If the WAL doesn't exist any more, we can remove it if the file
+ * modification time is old enough.
+ */
+ if (XLogRecPtrIsInvalid(oldest_lsn) || ws->end_lsn <= oldest_lsn)
+ RemoveWalSummaryIfOlderThan(ws, cutoff_time);
+
+ /*
+ * Whether we we removed the file or not, we need not consider it
+ * again.
+ */
+ wslist = foreach_delete_current(wslist, lc);
+ pfree(ws);
+ }
+ }
+}
diff --git a/src/backend/storage/lmgr/lwlocknames.txt b/src/backend/storage/lmgr/lwlocknames.txt
index f72f2906ce..d621f5507f 100644
--- a/src/backend/storage/lmgr/lwlocknames.txt
+++ b/src/backend/storage/lmgr/lwlocknames.txt
@@ -54,3 +54,4 @@ XactTruncationLock 44
WrapLimitsVacuumLock 46
NotifyQueueTailLock 47
WaitEventExtensionLock 48
+WALSummarizerLock 49
diff --git a/src/backend/utils/activity/pgstat_io.c b/src/backend/utils/activity/pgstat_io.c
index 490d5a9ab7..8109aee6f0 100644
--- a/src/backend/utils/activity/pgstat_io.c
+++ b/src/backend/utils/activity/pgstat_io.c
@@ -296,7 +296,8 @@ pgstat_io_snapshot_cb(void)
* - Syslogger because it is not connected to shared memory
* - Archiver because most relevant archiving IO is delegated to a
* specialized command or module
-* - WAL Receiver and WAL Writer IO is not tracked in pg_stat_io for now
+* - WAL Receiver, WAL Writer, and WAL Summarizer IO are not tracked in
+* pg_stat_io for now
*
* Function returns true if BackendType participates in the cumulative stats
* subsystem for IO and false if it does not.
@@ -318,6 +319,7 @@ pgstat_tracks_io_bktype(BackendType bktype)
case B_LOGGER:
case B_WAL_RECEIVER:
case B_WAL_WRITER:
+ case B_WAL_SUMMARIZER:
return false;
case B_AUTOVAC_LAUNCHER:
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index d7995931bd..7e79163466 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -56,6 +56,7 @@ RECOVERY_WAL_STREAM "Waiting in main loop of startup process for WAL to arrive,
SYSLOGGER_MAIN "Waiting in main loop of syslogger process."
WAL_RECEIVER_MAIN "Waiting in main loop of WAL receiver process."
WAL_SENDER_MAIN "Waiting in main loop of WAL sender process."
+WAL_SUMMARIZER_WAL "Waiting in WAL summarizer for more WAL to be generated."
WAL_WRITER_MAIN "Waiting in main loop of WAL writer process."
@@ -142,6 +143,7 @@ SAFE_SNAPSHOT "Waiting to obtain a valid snapshot for a <literal>READ ONLY DEFER
SYNC_REP "Waiting for confirmation from a remote server during synchronous replication."
WAL_RECEIVER_EXIT "Waiting for the WAL receiver to exit."
WAL_RECEIVER_WAIT_START "Waiting for startup process to send initial data for streaming replication."
+WAL_SUMMARY_READY "Waiting for a new WAL summary to be generated."
XACT_GROUP_UPDATE "Waiting for the group leader to update transaction status at end of a parallel operation."
@@ -162,6 +164,7 @@ REGISTER_SYNC_REQUEST "Waiting while sending synchronization requests to the che
SPIN_DELAY "Waiting while acquiring a contended spinlock."
VACUUM_DELAY "Waiting in a cost-based vacuum delay point."
VACUUM_TRUNCATE "Waiting to acquire an exclusive lock to truncate off any empty pages at the end of a table vacuumed."
+WAL_SUMMARIZER_ERROR "Waiting after a WAL summarizer error."
#
@@ -243,6 +246,8 @@ WAL_COPY_WRITE "Waiting for a write when creating a new WAL segment by copying a
WAL_INIT_SYNC "Waiting for a newly initialized WAL file to reach durable storage."
WAL_INIT_WRITE "Waiting for a write while initializing a new WAL file."
WAL_READ "Waiting for a read from a WAL file."
+WAL_SUMMARY_READ "Waiting for a read from a WAL summary file."
+WAL_SUMMARY_WRITE "Waiting for a write to a WAL summary file."
WAL_SYNC "Waiting for a WAL file to reach durable storage."
WAL_SYNC_METHOD_ASSIGN "Waiting for data to reach durable storage while assigning a new WAL sync method."
WAL_WRITE "Waiting for a write to a WAL file."
diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index cfc5afaa6f..ef2a3a2bfd 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -306,6 +306,9 @@ GetBackendTypeDesc(BackendType backendType)
case B_WAL_SENDER:
backendDesc = "walsender";
break;
+ case B_WAL_SUMMARIZER:
+ backendDesc = "walsummarizer";
+ break;
case B_WAL_WRITER:
backendDesc = "walwriter";
break;
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 7605eff9b9..cf39f8a651 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -63,6 +63,7 @@
#include "postmaster/postmaster.h"
#include "postmaster/startup.h"
#include "postmaster/syslogger.h"
+#include "postmaster/walsummarizer.h"
#include "postmaster/walwriter.h"
#include "replication/logicallauncher.h"
#include "replication/slot.h"
@@ -704,6 +705,8 @@ const char *const config_group_names[] =
gettext_noop("Write-Ahead Log / Archive Recovery"),
/* WAL_RECOVERY_TARGET */
gettext_noop("Write-Ahead Log / Recovery Target"),
+ /* WAL_SUMMARIZATION */
+ gettext_noop("Write-Ahead Log / Summarization"),
/* REPLICATION_SENDING */
gettext_noop("Replication / Sending Servers"),
/* REPLICATION_PRIMARY */
@@ -1787,6 +1790,16 @@ struct config_bool ConfigureNamesBool[] =
NULL, NULL, NULL
},
+ {
+ {"summarize_wal", PGC_SIGHUP, WAL_SUMMARIZATION,
+ gettext_noop("Starts the WAL summarizer process to enable incremental backup."),
+ NULL
+ },
+ &summarize_wal,
+ false,
+ NULL, NULL, NULL
+ },
+
{
{"hot_standby", PGC_POSTMASTER, REPLICATION_STANDBY,
gettext_noop("Allows connections and queries during recovery."),
@@ -3191,6 +3204,19 @@ struct config_int ConfigureNamesInt[] =
check_wal_segment_size, NULL, NULL
},
+ {
+ {"wal_summarize_keep_time", PGC_SIGHUP, WAL_SUMMARIZATION,
+ gettext_noop("Time for which WAL summary files should be kept."),
+ NULL,
+ GUC_UNIT_MIN,
+ },
+ &wal_summarize_keep_time,
+ 10 * 24 * 60, /* 10 days */
+ 0,
+ INT_MAX,
+ NULL, NULL, NULL
+ },
+
{
{"autovacuum_naptime", PGC_SIGHUP, AUTOVACUUM,
gettext_noop("Time to sleep between autovacuum runs."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index e48c066a5b..01c0428990 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -299,6 +299,11 @@
#recovery_target_action = 'pause' # 'pause', 'promote', 'shutdown'
# (change requires restart)
+# - WAL Summarization -
+
+#summarize_wal = off # run WAL summarizer process?
+#wal_summarize_keep_time = '10d' # when to remove old summary files, 0 = never
+
#------------------------------------------------------------------------------
# REPLICATION
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 0c6f5ceb0a..e68b40d2b5 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -227,6 +227,7 @@ static char *extra_options = "";
static const char *const subdirs[] = {
"global",
"pg_wal/archive_status",
+ "pg_wal/summaries",
"pg_commit_ts",
"pg_dynshmem",
"pg_notify",
diff --git a/src/common/Makefile b/src/common/Makefile
index 1092dc63df..23e5a3db47 100644
--- a/src/common/Makefile
+++ b/src/common/Makefile
@@ -49,6 +49,7 @@ OBJS_COMMON = \
archive.o \
base64.o \
binaryheap.o \
+ blkreftable.o \
checksum_helper.o \
compression.o \
config_info.o \
diff --git a/src/common/blkreftable.c b/src/common/blkreftable.c
new file mode 100644
index 0000000000..4d32da0507
--- /dev/null
+++ b/src/common/blkreftable.c
@@ -0,0 +1,1309 @@
+/*-------------------------------------------------------------------------
+ *
+ * blkreftable.c
+ * Block reference tables.
+ *
+ * A block reference table is used to keep track of which blocks have
+ * been modified by WAL records within a certain LSN range.
+ *
+ * For each relation fork, we keep track of all blocks that have appeared
+ * in block reference in the WAL. We also keep track of the "limit block",
+ * which is the smallest relation length in blocks known to have occurred
+ * during that range of WAL records. This should be set to 0 if the relation
+ * fork is created or destroyed, and to the post-truncation length if
+ * truncated.
+ *
+ * Whenever we set the limit block, we also forget about any modified blocks
+ * beyond that point. Those blocks don't exist any more. Such blocks can
+ * later be marked as modified again; if that happens, it means the relation
+ * was re-extended.
+ *
+ * Portions Copyright (c) 2010-2022, PostgreSQL Global Development Group
+ *
+ * src/common/blkreftable.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+
+#ifndef FRONTEND
+#include "postgres.h"
+#else
+#include "postgres_fe.h"
+#endif
+
+#ifdef FRONTEND
+#include "common/logging.h"
+#endif
+
+#include "common/blkreftable.h"
+#include "common/hashfn.h"
+#include "port/pg_crc32c.h"
+
+/*
+ * A block reference table keeps track of the status of each relation
+ * fork individually.
+ */
+typedef struct BlockRefTableKey
+{
+ RelFileLocator rlocator;
+ ForkNumber forknum;
+} BlockRefTableKey;
+
+/*
+ * We could need to store data either for a relation in which only a
+ * tiny fraction of the blocks have been modified or for a relation in
+ * which nearly every block has been modified, and we want a
+ * space-efficient representation in both cases. To accomplish this,
+ * we divide the relation into chunks of 2^16 blocks and choose between
+ * an array representation and a bitmap representation for each chunk.
+ *
+ * When the number of modified blocks in a given chunk is small, we
+ * essentially store an array of block numbers, but we need not store the
+ * entire block number: instead, we store each block number as a 2-byte
+ * offset from the start of the chunk.
+ *
+ * When the number of modified blocks in a given chunk is large, we switch
+ * to a bitmap representation.
+ *
+ * These same basic representational choices are used both when a block
+ * reference table is stored in memory and when it is serialized to disk.
+ *
+ * In the in-memory representation, we initially allocate each chunk with
+ * space for a number of entries given by INITIAL_ENTRIES_PER_CHUNK and
+ * increase that as necessary until we reach MAX_ENTRIES_PER_CHUNK.
+ * Any chunk whose allocated size reaches MAX_ENTRIES_PER_CHUNK is converted
+ * to a bitmap, and thus never needs to grow further.
+ */
+#define BLOCKS_PER_CHUNK (1 << 16)
+#define BLOCKS_PER_ENTRY (BITS_PER_BYTE * sizeof(uint16))
+#define MAX_ENTRIES_PER_CHUNK (BLOCKS_PER_CHUNK / BLOCKS_PER_ENTRY)
+#define INITIAL_ENTRIES_PER_CHUNK 16
+typedef uint16 *BlockRefTableChunk;
+
+/*
+ * State for one relation fork.
+ *
+ * 'rlocator' and 'forknum' identify the relation fork to which this entry
+ * pertains.
+ *
+ * 'limit_block' is the shortest known length of the relation in blocks
+ * within the LSN range covered by a particular block reference table.
+ * It should be set to 0 if the relation fork is created or dropped. If the
+ * relation fork is truncated, it should be set to the number of blocks that
+ * remain after truncation.
+ *
+ * 'nchunks' is the allocated length of each of the three arrays that follow.
+ * We can only represent the status of block numbers less than nchunks *
+ * BLOCKS_PER_CHUNK.
+ *
+ * 'chunk_size' is an array storing the allocated size of each chunk.
+ *
+ * 'chunk_usage' is an array storing the number of elements used in each
+ * chunk. If that value is less than MAX_ENTRIES_PER_CHUNK, the corresonding
+ * chunk is used as an array; else the corresponding chunk is used as a bitmap.
+ * When used as a bitmap, the least significant bit of the first array element
+ * is the status of the lowest-numbered block covered by this chunk.
+ *
+ * 'chunk_data' is the array of chunks.
+ */
+struct BlockRefTableEntry
+{
+ BlockRefTableKey key;
+ BlockNumber limit_block;
+ char status;
+ uint32 nchunks;
+ uint16 *chunk_size;
+ uint16 *chunk_usage;
+ BlockRefTableChunk *chunk_data;
+};
+
+/* Declare and define a hash table over type BlockRefTableEntry. */
+#define SH_PREFIX blockreftable
+#define SH_ELEMENT_TYPE BlockRefTableEntry
+#define SH_KEY_TYPE BlockRefTableKey
+#define SH_KEY key
+#define SH_HASH_KEY(tb, key) \
+ hash_bytes((const unsigned char *) &key, sizeof(BlockRefTableKey))
+#define SH_EQUAL(tb, a, b) memcmp(&a, &b, sizeof(BlockRefTableKey)) == 0
+#define SH_SCOPE static inline
+#ifdef FRONTEND
+#define SH_RAW_ALLOCATOR pg_malloc0
+#endif
+#define SH_DEFINE
+#define SH_DECLARE
+#include "lib/simplehash.h"
+
+/*
+ * A block reference table is basically just the hash table, but we don't
+ * want to expose that to outside callers.
+ *
+ * We keep track of the memory context in use explicitly too, so that it's
+ * easy to place all of our allocations in the same context.
+ */
+struct BlockRefTable
+{
+ blockreftable_hash *hash;
+#ifndef FRONTEND
+ MemoryContext mcxt;
+#endif
+};
+
+/*
+ * On-disk serialization format for block reference table entries.
+ */
+typedef struct BlockRefTableSerializedEntry
+{
+ RelFileLocator rlocator;
+ ForkNumber forknum;
+ BlockNumber limit_block;
+ uint32 nchunks;
+} BlockRefTableSerializedEntry;
+
+/*
+ * Buffer size, so that we avoid doing many small I/Os.
+ */
+#define BUFSIZE 65536
+
+/*
+ * Ad-hoc buffer for file I/O.
+ */
+typedef struct BlockRefTableBuffer
+{
+ io_callback_fn io_callback;
+ void *io_callback_arg;
+ char data[BUFSIZE];
+ int used;
+ int cursor;
+ pg_crc32c crc;
+} BlockRefTableBuffer;
+
+/*
+ * State for keeping track of progress while incrementally reading a block
+ * table reference file from disk.
+ *
+ * total_chunks means the number of chunks for the RelFileLocator/ForkNumber
+ * combination that is curently being read, and consumed_chunks is the number
+ * of those that have been read. (We always read all the information for
+ * a single chunk at one time, so we don't need to be able to represent the
+ * state where a chunk has been partially read.)
+ *
+ * chunk_size is the array of chunk sizes. The length is given by total_chunks.
+ *
+ * chunk_data holds the current chunk.
+ *
+ * chunk_position helps us figure out how much progress we've made in returning
+ * the block numbers for the current chunk to the caller. If the chunk is a
+ * bitmap, it's the number of bits we've scanned; otherwise, it's the number
+ * of chunk entries we've scanned.
+ */
+struct BlockRefTableReader
+{
+ BlockRefTableBuffer buffer;
+ char *error_filename;
+ report_error_fn error_callback;
+ void *error_callback_arg;
+ uint32 total_chunks;
+ uint32 consumed_chunks;
+ uint16 *chunk_size;
+ uint16 chunk_data[MAX_ENTRIES_PER_CHUNK];
+ uint32 chunk_position;
+};
+
+/*
+ * State for keeping track of progress while incrementally writing a block
+ * reference table file to disk.
+ */
+struct BlockRefTableWriter
+{
+ BlockRefTableBuffer buffer;
+};
+
+/* Function prototypes. */
+static int BlockRefTableComparator(const void *a, const void *b);
+static void BlockRefTableFlush(BlockRefTableBuffer *buffer);
+static void BlockRefTableRead(BlockRefTableReader *reader, void *data,
+ int length);
+static void BlockRefTableWrite(BlockRefTableBuffer *buffer, void *data,
+ int length);
+static void BlockRefTableFileTerminate(BlockRefTableBuffer *buffer);
+
+/*
+ * Create an empty block reference table.
+ */
+BlockRefTable *
+CreateEmptyBlockRefTable(void)
+{
+ BlockRefTable *brtab = palloc(sizeof(BlockRefTable));
+
+ /*
+ * Even completely empty database has a few hundred relation forks, so it
+ * seems best to size the hash on the assumption that we're going to have
+ * at least a few thousand entries.
+ */
+#ifdef FRONTEND
+ brtab->hash = blockreftable_create(4096, NULL);
+#else
+ brtab->mcxt = CurrentMemoryContext;
+ brtab->hash = blockreftable_create(brtab->mcxt, 4096, NULL);
+#endif
+
+ return brtab;
+}
+
+/*
+ * Set the "limit block" for a relation fork and forget any modified blocks
+ * with equal or higher block numbers.
+ *
+ * The "limit block" is the shortest known length of the relation within the
+ * range of WAL records covered by this block reference table.
+ */
+void
+BlockRefTableSetLimitBlock(BlockRefTable *brtab,
+ const RelFileLocator *rlocator,
+ ForkNumber forknum,
+ BlockNumber limit_block)
+{
+ BlockRefTableEntry *brtentry;
+ BlockRefTableKey key;
+ bool found;
+
+ memcpy(&key.rlocator, rlocator, sizeof(RelFileLocator));
+ key.forknum = forknum;
+ brtentry = blockreftable_insert(brtab->hash, key, &found);
+
+ if (!found)
+ {
+ /*
+ * We have no existing data about this relation fork, so just record
+ * the limit_block value supplied by the caller, and make sure other
+ * parts of the entry are properly initialized.
+ */
+ brtentry->limit_block = limit_block;
+ brtentry->nchunks = 0;
+ brtentry->chunk_size = NULL;
+ brtentry->chunk_usage = NULL;
+ brtentry->chunk_data = NULL;
+ return;
+ }
+
+ BlockRefTableEntrySetLimitBlock(brtentry, limit_block);
+}
+
+/*
+ * Mark a block in a given relation fork as known to have been modified.
+ */
+void
+BlockRefTableMarkBlockModified(BlockRefTable *brtab,
+ const RelFileLocator *rlocator,
+ ForkNumber forknum,
+ BlockNumber blknum)
+{
+ BlockRefTableEntry *brtentry;
+ BlockRefTableKey key;
+ bool found;
+#ifndef FRONTEND
+ MemoryContext oldcontext = MemoryContextSwitchTo(brtab->mcxt);
+#endif
+
+ memcpy(&key.rlocator, rlocator, sizeof(RelFileLocator));
+ key.forknum = forknum;
+ brtentry = blockreftable_insert(brtab->hash, key, &found);
+
+ if (!found)
+ {
+ /*
+ * We want to set the initial limit block value to something higher
+ * than any legal block number. InvalidBlockNumber fits the bill.
+ */
+ brtentry->limit_block = InvalidBlockNumber;
+ brtentry->nchunks = 0;
+ brtentry->chunk_size = NULL;
+ brtentry->chunk_usage = NULL;
+ brtentry->chunk_data = NULL;
+ }
+
+ BlockRefTableEntryMarkBlockModified(brtentry, forknum, blknum);
+
+#ifndef FRONTEND
+ MemoryContextSwitchTo(oldcontext);
+#endif
+}
+
+/*
+ * Get an entry from a block reference table.
+ *
+ * If the entry does not exist, this function returns NULL. Otherwise, it
+ * returns the entry and sets *limit_block to the value from the entry.
+ */
+BlockRefTableEntry *
+BlockRefTableGetEntry(BlockRefTable *brtab, const RelFileLocator *rlocator,
+ ForkNumber forknum, BlockNumber *limit_block)
+{
+ BlockRefTableKey key;
+ BlockRefTableEntry *entry;
+
+ Assert(limit_block != NULL);
+
+ memcpy(&key.rlocator, rlocator, sizeof(RelFileLocator));
+ key.forknum = forknum;
+ entry = blockreftable_lookup(brtab->hash, key);
+
+ if (entry != NULL)
+ *limit_block = entry->limit_block;
+
+ return entry;
+}
+
+/*
+ * Get block numbers from a table entry.
+ *
+ * 'blocks' must point to enough space to hold at least 'nblocks' block
+ * numbers, and any block numbers we manage to get will be written there.
+ * The return value is the number of block numbers actually written.
+ *
+ * We do not return block numbers unless they are greater than or equal to
+ * start_blkno and strictly less than stop_blkno.
+ */
+int
+BlockRefTableEntryGetBlocks(BlockRefTableEntry *entry,
+ BlockNumber start_blkno,
+ BlockNumber stop_blkno,
+ BlockNumber *blocks,
+ int nblocks)
+{
+ uint32 start_chunkno;
+ uint32 stop_chunkno;
+ uint32 chunkno;
+ int nresults = 0;
+
+ Assert(entry != NULL);
+
+ /*
+ * Figure out which chunks could potentially contain blocks of interest.
+ *
+ * We need to be careful about overflow here, because stop_blkno could be
+ * InvalidBlockNumber or something very close to it.
+ */
+ start_chunkno = start_blkno / BLOCKS_PER_CHUNK;
+ stop_chunkno = stop_blkno / BLOCKS_PER_CHUNK;
+ if ((stop_blkno % BLOCKS_PER_CHUNK) != 0)
+ ++stop_chunkno;
+ if (stop_chunkno > entry->nchunks)
+ stop_chunkno = entry->nchunks;
+
+ /*
+ * Loop over chunks.
+ */
+ for (chunkno = start_chunkno; chunkno < stop_chunkno; ++chunkno)
+ {
+ uint16 chunk_usage = entry->chunk_usage[chunkno];
+ BlockRefTableChunk chunk_data = entry->chunk_data[chunkno];
+ unsigned start_offset = 0;
+ unsigned stop_offset = BLOCKS_PER_CHUNK;
+
+ /*
+ * If the start and/or stop block number falls within this chunk, the
+ * whole chunk may not be of interest. Figure out which portion we
+ * care about, if it's not the whole thing.
+ */
+ if (chunkno == start_chunkno)
+ start_offset = start_blkno % BLOCKS_PER_CHUNK;
+ if (chunkno == stop_chunkno)
+ stop_offset = stop_blkno % BLOCKS_PER_CHUNK;
+
+ /*
+ * Handling differs depending on whether this is an array of offsets
+ * or a bitmap.
+ */
+ if (chunk_usage == MAX_ENTRIES_PER_CHUNK)
+ {
+ unsigned i;
+
+ /* It's a bitmap, so test every relevant bit. */
+ for (i = start_offset; i < BLOCKS_PER_CHUNK; ++i)
+ {
+ uint16 w = chunk_data[i / BLOCKS_PER_ENTRY];
+
+ if ((w & (1 << (i % BLOCKS_PER_ENTRY))) != 0)
+ {
+ BlockNumber blkno = chunkno * BLOCKS_PER_CHUNK + i;
+
+ blocks[nresults++] = blkno;
+
+ /* Early exit if we run out of output space. */
+ if (nresults == nblocks)
+ return nresults;
+ }
+ }
+ }
+ else
+ {
+ unsigned i;
+
+ /* It's an array of offsets, so check each one. */
+ for (i = 0; i < chunk_usage; ++i)
+ {
+ uint16 offset = chunk_data[i];
+
+ if (offset >= start_offset && offset < stop_offset)
+ {
+ BlockNumber blkno = chunkno * BLOCKS_PER_CHUNK + offset;
+
+ blocks[nresults++] = blkno;
+
+ /* Early exit if we run out of output space. */
+ if (nresults == nblocks)
+ return nresults;
+ }
+ }
+ }
+ }
+
+ return nresults;
+}
+
+/*
+ * Serialize a block reference table to a file.
+ */
+void
+WriteBlockRefTable(BlockRefTable *brtab,
+ io_callback_fn write_callback,
+ void *write_callback_arg)
+{
+ BlockRefTableSerializedEntry *sdata = NULL;
+ BlockRefTableBuffer buffer;
+ uint32 magic = BLOCKREFTABLE_MAGIC;
+
+ /* Prepare buffer. */
+ memset(&buffer, 0, sizeof(BlockRefTableBuffer));
+ buffer.io_callback = write_callback;
+ buffer.io_callback_arg = write_callback_arg;
+ INIT_CRC32C(buffer.crc);
+
+ /* Write magic number. */
+ BlockRefTableWrite(&buffer, &magic, sizeof(uint32));
+
+ /* Write the entries, assuming there are some. */
+ if (brtab->hash->members > 0)
+ {
+ unsigned i = 0;
+ blockreftable_iterator it;
+ BlockRefTableEntry *brtentry;
+
+ /* Extract entries into serializable format and sort them. */
+ sdata =
+ palloc(brtab->hash->members * sizeof(BlockRefTableSerializedEntry));
+ blockreftable_start_iterate(brtab->hash, &it);
+ while ((brtentry = blockreftable_iterate(brtab->hash, &it)) != NULL)
+ {
+ BlockRefTableSerializedEntry *sentry = &sdata[i++];
+
+ sentry->rlocator = brtentry->key.rlocator;
+ sentry->forknum = brtentry->key.forknum;
+ sentry->limit_block = brtentry->limit_block;
+ sentry->nchunks = brtentry->nchunks;
+
+ /* trim trailing zero entries */
+ while (sentry->nchunks > 0 &&
+ brtentry->chunk_usage[sentry->nchunks - 1] == 0)
+ sentry->nchunks--;
+ }
+ Assert(i == brtab->hash->members);
+ qsort(sdata, i, sizeof(BlockRefTableSerializedEntry),
+ BlockRefTableComparator);
+
+ /* Loop over entries in sorted order and serialize each one. */
+ for (i = 0; i < brtab->hash->members; ++i)
+ {
+ BlockRefTableSerializedEntry *sentry = &sdata[i];
+ BlockRefTableEntry *brtentry;
+ BlockRefTableKey key;
+ unsigned j;
+
+ /* Write the serialized entry itself. */
+ BlockRefTableWrite(&buffer, sentry,
+ sizeof(BlockRefTableSerializedEntry));
+
+ /* Look up the original entry so we can access the chunks. */
+ memcpy(&key.rlocator, &sentry->rlocator, sizeof(RelFileLocator));
+ key.forknum = sentry->forknum;
+ brtentry = blockreftable_lookup(brtab->hash, key);
+ Assert(brtentry != NULL);
+
+ /* Write the untruncated portion of the chunk length array. */
+ if (sentry->nchunks != 0)
+ BlockRefTableWrite(&buffer, brtentry->chunk_usage,
+ sentry->nchunks * sizeof(uint16));
+
+ /* Write the contents of each chunk. */
+ for (j = 0; j < brtentry->nchunks; ++j)
+ {
+ if (brtentry->chunk_usage[j] == 0)
+ continue;
+ BlockRefTableWrite(&buffer, brtentry->chunk_data[j],
+ brtentry->chunk_usage[j] * sizeof(uint16));
+ }
+ }
+ }
+
+ /* Write out appropriate terminator and CRC and flush buffer. */
+ BlockRefTableFileTerminate(&buffer);
+}
+
+/*
+ * Prepare to incrementally read a block reference table file.
+ *
+ * 'read_callback' is a function that can be called to read data from the
+ * underlying file (or other data source) into our internal buffer.
+ *
+ * 'read_callback_arg' is an opaque argument to be passed to read_callback.
+ *
+ * 'error_filename' is the filename that should be included in error messages
+ * if the file is found to be malformed. The value is not copied, so the
+ * caller should ensure that it remains valid until done with this
+ * BlockRefTableReader.
+ *
+ * 'error_callback' is a function to be called if the file is found to be
+ * malformed. This is not used for I/O errors, which must be handled internally
+ * by read_callback.
+ *
+ * 'error_callback_arg' is an opaque arguent to be passed to error_callback.
+ */
+BlockRefTableReader *
+CreateBlockRefTableReader(io_callback_fn read_callback,
+ void *read_callback_arg,
+ char *error_filename,
+ report_error_fn error_callback,
+ void *error_callback_arg)
+{
+ BlockRefTableReader *reader;
+ uint32 magic;
+
+ /* Initialize data structure. */
+ reader = palloc0(sizeof(BlockRefTableReader));
+ reader->buffer.io_callback = read_callback;
+ reader->buffer.io_callback_arg = read_callback_arg;
+ reader->error_filename = error_filename;
+ reader->error_callback = error_callback;
+ reader->error_callback_arg = error_callback_arg;
+ INIT_CRC32C(reader->buffer.crc);
+
+ /* Verify magic number. */
+ BlockRefTableRead(reader, &magic, sizeof(uint32));
+ if (magic != BLOCKREFTABLE_MAGIC)
+ error_callback(error_callback_arg,
+ "file \"%s\" has wrong magic number: expected %u, found %u",
+ error_filename,
+ BLOCKREFTABLE_MAGIC, magic);
+
+ return reader;
+}
+
+/*
+ * Read next relation fork covered by this block reference table file.
+ *
+ * After calling this function, you must call BlockRefTableReaderGetBlocks
+ * until it returns 0 before calling it again.
+ */
+bool
+BlockRefTableReaderNextRelation(BlockRefTableReader *reader,
+ RelFileLocator *rlocator,
+ ForkNumber *forknum,
+ BlockNumber *limit_block)
+{
+ BlockRefTableSerializedEntry sentry;
+ BlockRefTableSerializedEntry zentry = {{0}};
+
+ /*
+ * Sanity check: caller must read all blocks from all chunks before moving
+ * on to the next relation.
+ */
+ Assert(reader->total_chunks == reader->consumed_chunks);
+
+ /* Read serialized entry. */
+ BlockRefTableRead(reader, &sentry,
+ sizeof(BlockRefTableSerializedEntry));
+
+ /*
+ * If we just read the sentinel entry indicating that we've reached the
+ * end, read and check the CRC.
+ */
+ if (memcmp(&sentry, &zentry, sizeof(BlockRefTableSerializedEntry)) == 0)
+ {
+ pg_crc32c expected_crc;
+ pg_crc32c actual_crc;
+
+ /*
+ * We want to know the CRC of the file excluding the 4-byte CRC
+ * itself, so copy the current value of the CRC accumulator before
+ * reading those bytes, and use the copy to finalize the calculation.
+ */
+ expected_crc = reader->buffer.crc;
+ FIN_CRC32C(expected_crc);
+
+ /* Now we can read the actual value. */
+ BlockRefTableRead(reader, &actual_crc, sizeof(pg_crc32c));
+
+ /* Throw an error if there is a mismatch. */
+ if (!EQ_CRC32C(expected_crc, actual_crc))
+ reader->error_callback(reader->error_callback_arg,
+ "file \"%s\" has wrong checksum: expected %08X, found %08X",
+ reader->error_filename, expected_crc, actual_crc);
+
+ return false;
+ }
+
+ /* Read chunk size array. */
+ if (reader->chunk_size != NULL)
+ pfree(reader->chunk_size);
+ reader->chunk_size = palloc(sentry.nchunks * sizeof(uint16));
+ BlockRefTableRead(reader, reader->chunk_size,
+ sentry.nchunks * sizeof(uint16));
+
+ /* Set up for chunk scan. */
+ reader->total_chunks = sentry.nchunks;
+ reader->consumed_chunks = 0;
+
+ /* Return data to caller. */
+ memcpy(rlocator, &sentry.rlocator, sizeof(RelFileLocator));
+ *forknum = sentry.forknum;
+ *limit_block = sentry.limit_block;
+ return true;
+}
+
+/*
+ * Get modified blocks associated with the relation fork returned by
+ * the most recent call to BlockRefTableReaderNextRelation.
+ *
+ * On return, block numbers will be written into the 'blocks' array, whose
+ * length should be passed via 'nblocks'. The return value is the number of
+ * entries actually written into the 'blocks' array, which may be less than
+ * 'nblocks' if we run out of modified blocks in the relation fork before
+ * we run out of room in the array.
+ */
+unsigned
+BlockRefTableReaderGetBlocks(BlockRefTableReader *reader,
+ BlockNumber *blocks,
+ int nblocks)
+{
+ unsigned blocks_found = 0;
+
+ /* Must provide space for at least one block number to be returned. */
+ Assert(nblocks > 0);
+
+ /* Loop collecting blocks to return to caller. */
+ for (;;)
+ {
+ uint16 next_chunk_size;
+
+ /*
+ * If we've read at least one chunk, maybe it contains some block
+ * numbers that could satisfy caller's request.
+ */
+ if (reader->consumed_chunks > 0)
+ {
+ uint32 chunkno = reader->consumed_chunks - 1;
+ uint16 chunk_size = reader->chunk_size[chunkno];
+
+ if (chunk_size == MAX_ENTRIES_PER_CHUNK)
+ {
+ /* Bitmap format, so search for bits that are set. */
+ while (reader->chunk_position < BLOCKS_PER_CHUNK &&
+ blocks_found < nblocks)
+ {
+ uint16 chunkoffset = reader->chunk_position;
+ uint16 w;
+
+ w = reader->chunk_data[chunkoffset / BLOCKS_PER_ENTRY];
+ if ((w & (1u << (chunkoffset % BLOCKS_PER_ENTRY))) != 0)
+ blocks[blocks_found++] =
+ chunkno * BLOCKS_PER_CHUNK + chunkoffset;
+ ++reader->chunk_position;
+ }
+ }
+ else
+ {
+ /* Not in bitmap format, so each entry is a 2-byte offset. */
+ while (reader->chunk_position < chunk_size &&
+ blocks_found < nblocks)
+ {
+ blocks[blocks_found++] = chunkno * BLOCKS_PER_CHUNK
+ + reader->chunk_data[reader->chunk_position];
+ ++reader->chunk_position;
+ }
+ }
+ }
+
+ /* We found enough blocks, so we're done. */
+ if (blocks_found >= nblocks)
+ break;
+
+ /*
+ * We didn't find enough blocks, so we must need the next chunk. If
+ * there are none left, though, then we're done anyway.
+ */
+ if (reader->consumed_chunks == reader->total_chunks)
+ break;
+
+ /*
+ * Read data for next chunk and reset scan position to beginning of
+ * chunk. Note that the next chunk might be empty, in which case we
+ * consume the chunk without actually consuming any bytes from the
+ * underlying file.
+ */
+ next_chunk_size = reader->chunk_size[reader->consumed_chunks];
+ if (next_chunk_size > 0)
+ BlockRefTableRead(reader, reader->chunk_data,
+ next_chunk_size * sizeof(uint16));
+ ++reader->consumed_chunks;
+ reader->chunk_position = 0;
+ }
+
+ return blocks_found;
+}
+
+/*
+ * Release memory used while reading a block reference table from a file.
+ */
+void
+DestroyBlockRefTableReader(BlockRefTableReader *reader)
+{
+ if (reader->chunk_size != NULL)
+ {
+ pfree(reader->chunk_size);
+ reader->chunk_size = NULL;
+ }
+ pfree(reader);
+}
+
+/*
+ * Prepare to write a block reference table file incrementally.
+ *
+ * Caller must be able to supply BlockRefTableEntry objects sorted in the
+ * appropriate order.
+ */
+BlockRefTableWriter *
+CreateBlockRefTableWriter(io_callback_fn write_callback,
+ void *write_callback_arg)
+{
+ BlockRefTableWriter *writer;
+ uint32 magic = BLOCKREFTABLE_MAGIC;
+
+ /* Prepare buffer and CRC check and save callbacks. */
+ writer = palloc0(sizeof(BlockRefTableWriter));
+ writer->buffer.io_callback = write_callback;
+ writer->buffer.io_callback_arg = write_callback_arg;
+ INIT_CRC32C(writer->buffer.crc);
+
+ /* Write magic number. */
+ BlockRefTableWrite(&writer->buffer, &magic, sizeof(uint32));
+
+ return writer;
+}
+
+/*
+ * Append one entry to a block reference table file.
+ *
+ * Note that entries must be written in the proper order, that is, sorted by
+ * tablespace, then database, then relfilenumber, then fork number. Caller
+ * is responsible for supplying data in the correct order. If that seems hard,
+ * use an in-memory BlockRefTable instead.
+ */
+void
+BlockRefTableWriteEntry(BlockRefTableWriter *writer, BlockRefTableEntry *entry)
+{
+ BlockRefTableSerializedEntry sentry;
+ unsigned j;
+
+ /* Convert to serialized entry format. */
+ sentry.rlocator = entry->key.rlocator;
+ sentry.forknum = entry->key.forknum;
+ sentry.limit_block = entry->limit_block;
+ sentry.nchunks = entry->nchunks;
+
+ /* Trim trailing zero entries. */
+ while (sentry.nchunks > 0 && entry->chunk_usage[sentry.nchunks - 1] == 0)
+ sentry.nchunks--;
+
+ /* Write the serialized entry itself. */
+ BlockRefTableWrite(&writer->buffer, &sentry,
+ sizeof(BlockRefTableSerializedEntry));
+
+ /* Write the untruncated portion of the chunk length array. */
+ if (sentry.nchunks != 0)
+ BlockRefTableWrite(&writer->buffer, entry->chunk_usage,
+ sentry.nchunks * sizeof(uint16));
+
+ /* Write the contents of each chunk. */
+ for (j = 0; j < entry->nchunks; ++j)
+ {
+ if (entry->chunk_usage[j] == 0)
+ continue;
+ BlockRefTableWrite(&writer->buffer, entry->chunk_data[j],
+ entry->chunk_usage[j] * sizeof(uint16));
+ }
+}
+
+/*
+ * Finalize an incremental write of a block reference table file.
+ */
+void
+DestroyBlockRefTableWriter(BlockRefTableWriter *writer)
+{
+ BlockRefTableFileTerminate(&writer->buffer);
+ pfree(writer);
+}
+
+/*
+ * Allocate a standalone BlockRefTableEntry.
+ *
+ * When we're manipulating a full in-memory BlockRefTable, the entries are
+ * part of the hash table and are allocated by simplehash. This routine is
+ * used by callers that want to write out a BlockRefTable to a file without
+ * needing to store the whole thing in memory at once.
+ *
+ * Entries allocated by this function can be manipulated using the functions
+ * BlockRefTableEntrySetLimitBlock and BlockRefTableEntryMarkBlockModified
+ * and then written using BlockRefTableWriteEntry and freed using
+ * BlockRefTableFreeEntry.
+ */
+BlockRefTableEntry *
+CreateBlockRefTableEntry(RelFileLocator rlocator, ForkNumber forknum)
+{
+ BlockRefTableEntry *entry = palloc0(sizeof(BlockRefTableEntry));
+
+ memcpy(&entry->key.rlocator, &rlocator, sizeof(RelFileLocator));
+ entry->key.forknum = forknum;
+ entry->limit_block = InvalidBlockNumber;
+
+ return entry;
+}
+
+/*
+ * Update a BlockRefTableEntry with a new value for the "limit block" and
+ * forget any equal-or-higher-numbered modified blocks.
+ *
+ * The "limit block" is the shortest known length of the relation within the
+ * range of WAL records covered by this block reference table.
+ */
+void
+BlockRefTableEntrySetLimitBlock(BlockRefTableEntry *entry,
+ BlockNumber limit_block)
+{
+ unsigned chunkno;
+ unsigned limit_chunkno;
+ unsigned limit_chunkoffset;
+ BlockRefTableChunk limit_chunk;
+
+ /* If we already have an equal or lower limit block, do nothing. */
+ if (limit_block >= entry->limit_block)
+ return;
+
+ /* Record the new limit block value. */
+ entry->limit_block = limit_block;
+
+ /*
+ * Figure out which chunk would store the state of the new limit block,
+ * and which offset within that chunk.
+ */
+ limit_chunkno = limit_block / BLOCKS_PER_CHUNK;
+ limit_chunkoffset = limit_block % BLOCKS_PER_CHUNK;
+
+ /*
+ * If the number of chunks is not large enough for any blocks with equal
+ * or higher block numbers to exist, then there is nothing further to do.
+ */
+ if (limit_chunkno >= entry->nchunks)
+ return;
+
+ /* Discard entire contents of any higher-numbered chunks. */
+ for (chunkno = limit_chunkno + 1; chunkno < entry->nchunks; ++chunkno)
+ entry->chunk_usage[chunkno] = 0;
+
+ /*
+ * Next, we need to discard any offsets within the chunk that would
+ * contain the limit_block. We must handle this differenly depending on
+ * whether the chunk that would contain limit_block is a bitmap or an
+ * array of offsets.
+ */
+ limit_chunk = entry->chunk_data[limit_chunkno];
+ if (entry->chunk_usage[limit_chunkno] == MAX_ENTRIES_PER_CHUNK)
+ {
+ unsigned chunkoffset;
+
+ /* It's a bitmap. Unset bits. */
+ for (chunkoffset = limit_chunkoffset; chunkoffset < BLOCKS_PER_CHUNK;
+ ++chunkoffset)
+ limit_chunk[chunkoffset / BLOCKS_PER_ENTRY] &=
+ ~(1 << (chunkoffset % BLOCKS_PER_ENTRY));
+ }
+ else
+ {
+ unsigned i,
+ j = 0;
+
+ /* It's an offset array. Filter out large offsets. */
+ for (i = 0; i < entry->chunk_usage[limit_chunkno]; ++i)
+ {
+ Assert(j <= i);
+ if (limit_chunk[i] < limit_chunkoffset)
+ limit_chunk[j++] = limit_chunk[i];
+ }
+ Assert(j <= entry->chunk_usage[limit_chunkno]);
+ entry->chunk_usage[limit_chunkno] = j;
+ }
+}
+
+/*
+ * Mark a block in a given BlkRefTableEntry as known to have been modified.
+ */
+void
+BlockRefTableEntryMarkBlockModified(BlockRefTableEntry *entry,
+ ForkNumber forknum,
+ BlockNumber blknum)
+{
+ unsigned chunkno;
+ unsigned chunkoffset;
+ unsigned i;
+
+ /*
+ * Which chunk should store the state of this block? And what is the
+ * offset of this block relative to the start of that chunk?
+ */
+ chunkno = blknum / BLOCKS_PER_CHUNK;
+ chunkoffset = blknum % BLOCKS_PER_CHUNK;
+
+ /*
+ * If 'nchunks' isn't big enough for us to be able to represent the state
+ * of this block, we need to enlarge our arrays.
+ */
+ if (chunkno >= entry->nchunks)
+ {
+ unsigned max_chunks;
+ unsigned extra_chunks;
+
+ /*
+ * New array size is a power of 2, at least 16, big enough so that
+ * chunkno will be a valid array index.
+ */
+ max_chunks = Max(16, entry->nchunks);
+ while (max_chunks < chunkno + 1)
+ chunkno *= 2;
+ Assert(max_chunks > chunkno);
+ extra_chunks = max_chunks - entry->nchunks;
+
+ if (entry->nchunks == 0)
+ {
+ entry->chunk_size = palloc0(sizeof(uint16) * max_chunks);
+ entry->chunk_usage = palloc0(sizeof(uint16) * max_chunks);
+ entry->chunk_data =
+ palloc0(sizeof(BlockRefTableChunk) * max_chunks);
+ }
+ else
+ {
+ entry->chunk_size = repalloc(entry->chunk_size,
+ sizeof(uint16) * max_chunks);
+ memset(&entry->chunk_size[entry->nchunks], 0,
+ extra_chunks * sizeof(uint16));
+ entry->chunk_usage = repalloc(entry->chunk_usage,
+ sizeof(uint16) * max_chunks);
+ memset(&entry->chunk_usage[entry->nchunks], 0,
+ extra_chunks * sizeof(uint16));
+ entry->chunk_data = repalloc(entry->chunk_data,
+ sizeof(BlockRefTableChunk) * max_chunks);
+ memset(&entry->chunk_data[entry->nchunks], 0,
+ extra_chunks * sizeof(BlockRefTableChunk));
+ }
+ entry->nchunks = max_chunks;
+ }
+
+ /*
+ * If the chunk that covers this block number doesn't exist yet, create it
+ * as an array and add the appropriate offset to it. We make it pretty
+ * small initially, because there might only be 1 or a few block
+ * references in this chunk and we don't want to use up too much memory.
+ */
+ if (entry->chunk_size[chunkno] == 0)
+ {
+ entry->chunk_data[chunkno] =
+ palloc(sizeof(uint16) * INITIAL_ENTRIES_PER_CHUNK);
+ entry->chunk_size[chunkno] = INITIAL_ENTRIES_PER_CHUNK;
+ entry->chunk_data[chunkno][0] = chunkoffset;
+ entry->chunk_usage[chunkno] = 1;
+ return;
+ }
+
+ /*
+ * If the number of entries in this chunk is already maximum, it must be a
+ * bitmap. Just set the appropriate bit.
+ */
+ if (entry->chunk_usage[chunkno] == MAX_ENTRIES_PER_CHUNK)
+ {
+ BlockRefTableChunk chunk = entry->chunk_data[chunkno];
+
+ chunk[chunkoffset / BLOCKS_PER_ENTRY] |=
+ 1 << (chunkoffset % BLOCKS_PER_ENTRY);
+ return;
+ }
+
+ /*
+ * There is an existing chunk and it's in array format. Let's find out
+ * whether it already has an entry for this block. If so, we do not need
+ * to do anything.
+ */
+ for (i = 0; i < entry->chunk_usage[chunkno]; ++i)
+ {
+ if (entry->chunk_data[chunkno][i] == chunkoffset)
+ return;
+ }
+
+ /*
+ * If the number of entries currently used is one less than the maximum,
+ * it's time to convert to bitmap format.
+ */
+ if (entry->chunk_usage[chunkno] == MAX_ENTRIES_PER_CHUNK - 1)
+ {
+ BlockRefTableChunk newchunk;
+ unsigned j;
+
+ /* Allocate a new chunk. */
+ newchunk = palloc0(MAX_ENTRIES_PER_CHUNK * sizeof(uint16));
+
+ /* Set the bit for each existing entry. */
+ for (j = 0; j < entry->chunk_usage[chunkno]; ++j)
+ {
+ unsigned coff = entry->chunk_data[chunkno][j];
+
+ newchunk[coff / BLOCKS_PER_ENTRY] |=
+ 1 << (coff % BLOCKS_PER_ENTRY);
+ }
+
+ /* Set the bit for the new entry. */
+ newchunk[chunkoffset / BLOCKS_PER_ENTRY] |=
+ 1 << (chunkoffset % BLOCKS_PER_ENTRY);
+
+ /* Swap the new chunk into place and update metadata. */
+ pfree(entry->chunk_data[chunkno]);
+ entry->chunk_data[chunkno] = newchunk;
+ entry->chunk_size[chunkno] = MAX_ENTRIES_PER_CHUNK;
+ entry->chunk_usage[chunkno] = MAX_ENTRIES_PER_CHUNK;
+ return;
+ }
+
+ /*
+ * OK, we currently have an array, and we don't need to convert to a
+ * bitmap, but we do need to add a new element. If there's not enough
+ * room, we'll have to expand the array.
+ */
+ if (entry->chunk_usage[chunkno] == entry->chunk_size[chunkno])
+ {
+ unsigned newsize = entry->chunk_size[chunkno] * 2;
+
+ Assert(newsize <= MAX_ENTRIES_PER_CHUNK);
+ entry->chunk_data[chunkno] = repalloc(entry->chunk_data[chunkno],
+ newsize * sizeof(uint16));
+ entry->chunk_size[chunkno] = newsize;
+ }
+
+ /* Now we can add the new entry. */
+ entry->chunk_data[chunkno][entry->chunk_usage[chunkno]] =
+ chunkoffset;
+ entry->chunk_usage[chunkno]++;
+}
+
+/*
+ * Release memory for a BlockRefTablEntry that was created by
+ * CreateBlockRefTableEntry.
+ */
+void
+BlockRefTableFreeEntry(BlockRefTableEntry *entry)
+{
+ if (entry->chunk_size != NULL)
+ {
+ pfree(entry->chunk_size);
+ entry->chunk_size = NULL;
+ }
+
+ if (entry->chunk_usage != NULL)
+ {
+ pfree(entry->chunk_usage);
+ entry->chunk_usage = NULL;
+ }
+
+ if (entry->chunk_data != NULL)
+ {
+ pfree(entry->chunk_data);
+ entry->chunk_data = NULL;
+ }
+
+ pfree(entry);
+}
+
+/*
+ * Comparator for BlockRefTableSerializedEntry objects.
+ *
+ * We make the tablespace OID the first column of the sort key to match
+ * the on-disk tree structure.
+ */
+static int
+BlockRefTableComparator(const void *a, const void *b)
+{
+ const BlockRefTableSerializedEntry *sa = a;
+ const BlockRefTableSerializedEntry *sb = b;
+
+ if (sa->rlocator.spcOid > sb->rlocator.spcOid)
+ return 1;
+ if (sa->rlocator.spcOid < sb->rlocator.spcOid)
+ return -1;
+
+ if (sa->rlocator.dbOid > sb->rlocator.dbOid)
+ return 1;
+ if (sa->rlocator.dbOid < sb->rlocator.dbOid)
+ return -1;
+
+ if (sa->rlocator.relNumber > sb->rlocator.relNumber)
+ return 1;
+ if (sa->rlocator.relNumber < sb->rlocator.relNumber)
+ return -1;
+
+ if (sa->forknum > sb->forknum)
+ return 1;
+ if (sa->forknum < sb->forknum)
+ return -1;
+
+ return 0;
+}
+
+/*
+ * Flush any buffered data out of a BlockRefTableBuffer.
+ */
+static void
+BlockRefTableFlush(BlockRefTableBuffer *buffer)
+{
+ buffer->io_callback(buffer->io_callback_arg, buffer->data, buffer->used);
+ buffer->used = 0;
+}
+
+/*
+ * Read data from a BlockRefTableBuffer, and update the running CRC
+ * calculation for the returned data (but not any data that we may have
+ * buffered but not yet actually returned).
+ */
+static void
+BlockRefTableRead(BlockRefTableReader *reader, void *data, int length)
+{
+ BlockRefTableBuffer *buffer = &reader->buffer;
+
+ /* Loop until read is fully satisfied. */
+ while (length > 0)
+ {
+ if (buffer->cursor < buffer->used)
+ {
+ /*
+ * If any buffered data is available, use that to satisfy as much
+ * of the request as possible.
+ */
+ int bytes_to_copy = Min(length, buffer->used - buffer->cursor);
+
+ memcpy(data, &buffer->data[buffer->cursor], bytes_to_copy);
+ COMP_CRC32C(buffer->crc, &buffer->data[buffer->cursor],
+ bytes_to_copy);
+ buffer->cursor += bytes_to_copy;
+ data = ((char *) data) + bytes_to_copy;
+ length -= bytes_to_copy;
+ }
+ else if (length >= BUFSIZE)
+ {
+ /*
+ * If the request length is long, read directly into caller's
+ * buffer.
+ */
+ int bytes_read;
+
+ bytes_read = buffer->io_callback(buffer->io_callback_arg,
+ data, length);
+ COMP_CRC32C(buffer->crc, data, bytes_read);
+ data = ((char *) data) + bytes_read;
+ length -= bytes_read;
+
+ /* If we didn't get anything, that's bad. */
+ if (bytes_read == 0)
+ reader->error_callback(reader->error_callback_arg,
+ "file \"%s\" ends unexpectedly",
+ reader->error_filename);
+ }
+ else
+ {
+ /*
+ * Refill our buffer.
+ */
+ buffer->used = buffer->io_callback(buffer->io_callback_arg,
+ buffer->data, BUFSIZE);
+ buffer->cursor = 0;
+
+ /* If we didn't get anything, that's bad. */
+ if (buffer->used == 0)
+ reader->error_callback(reader->error_callback_arg,
+ "file \"%s\" ends unexpectedly",
+ reader->error_filename);
+ }
+ }
+}
+
+/*
+ * Supply data to a BlockRefTableBuffer for write to the underlying File,
+ * and update the running CRC calculation for that data.
+ */
+static void
+BlockRefTableWrite(BlockRefTableBuffer *buffer, void *data, int length)
+{
+ /* Update running CRC calculation. */
+ COMP_CRC32C(buffer->crc, data, length);
+
+ /* If the new data can't fit into the buffer, flush the buffer. */
+ if (buffer->used + length > BUFSIZE)
+ {
+ buffer->io_callback(buffer->io_callback_arg, buffer->data,
+ buffer->used);
+ buffer->used = 0;
+ }
+
+ /* If the new data would fill the buffer, or more, write it directly. */
+ if (length >= BUFSIZE)
+ {
+ buffer->io_callback(buffer->io_callback_arg, data, length);
+ return;
+ }
+
+ /* Otherwise, copy the new data into the buffer. */
+ memcpy(&buffer->data[buffer->used], data, length);
+ buffer->used += length;
+ Assert(buffer->used <= BUFSIZE);
+}
+
+/*
+ * Generate the sentinel and CRC required at the end of a block reference
+ * table file and flush them out of our internal buffer.
+ */
+static void
+BlockRefTableFileTerminate(BlockRefTableBuffer *buffer)
+{
+ BlockRefTableSerializedEntry zentry = {{0}};
+ pg_crc32c crc;
+
+ /* Write a sentinel indicating that there are no more entries. */
+ BlockRefTableWrite(buffer, &zentry,
+ sizeof(BlockRefTableSerializedEntry));
+
+ /*
+ * Writing the checksum will perturb the ongoing checksum calculation, so
+ * copy the state first and finalize the computation using the copy.
+ */
+ crc = buffer->crc;
+ FIN_CRC32C(crc);
+ BlockRefTableWrite(buffer, &crc, sizeof(pg_crc32c));
+
+ /* Flush any leftover data out of our buffer. */
+ BlockRefTableFlush(buffer);
+}
diff --git a/src/common/meson.build b/src/common/meson.build
index d52dd12bc9..7ad4270a3a 100644
--- a/src/common/meson.build
+++ b/src/common/meson.build
@@ -4,6 +4,7 @@ common_sources = files(
'archive.c',
'base64.c',
'binaryheap.c',
+ 'blkreftable.c',
'checksum_helper.c',
'compression.c',
'controldata_utils.c',
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index a14126d164..da71580364 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -209,6 +209,7 @@ extern int XLogFileOpen(XLogSegNo segno, TimeLineID tli);
extern void CheckXLogRemoved(XLogSegNo segno, TimeLineID tli);
extern XLogSegNo XLogGetLastRemovedSegno(void);
+extern XLogSegNo XLogGetOldestSegno(TimeLineID tli);
extern void XLogSetAsyncXactLSN(XLogRecPtr asyncXactLSN);
extern void XLogSetReplicationSlotMinimumLSN(XLogRecPtr lsn);
diff --git a/src/include/backup/walsummary.h b/src/include/backup/walsummary.h
new file mode 100644
index 0000000000..8e3dc7b837
--- /dev/null
+++ b/src/include/backup/walsummary.h
@@ -0,0 +1,49 @@
+/*-------------------------------------------------------------------------
+ *
+ * walsummary.h
+ * WAL summary management
+ *
+ * Portions Copyright (c) 2010-2022, PostgreSQL Global Development Group
+ *
+ * src/include/backup/walsummary.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef WALSUMMARY_H
+#define WALSUMMARY_H
+
+#include <time.h>
+
+#include "access/xlogdefs.h"
+#include "nodes/pg_list.h"
+#include "storage/fd.h"
+
+typedef struct WalSummaryIO
+{
+ File file;
+ off_t filepos;
+} WalSummaryIO;
+
+typedef struct WalSummaryFile
+{
+ XLogRecPtr start_lsn;
+ XLogRecPtr end_lsn;
+ TimeLineID tli;
+} WalSummaryFile;
+
+extern List *GetWalSummaries(TimeLineID tli, XLogRecPtr start_lsn,
+ XLogRecPtr end_lsn);
+extern List *FilterWalSummaries(List *wslist, TimeLineID tli,
+ XLogRecPtr start_lsn, XLogRecPtr end_lsn);
+extern bool WalSummariesAreComplete(List *wslist,
+ XLogRecPtr start_lsn, XLogRecPtr end_lsn,
+ XLogRecPtr *missing_lsn);
+extern File OpenWalSummaryFile(WalSummaryFile *ws, bool missing_ok);
+extern void RemoveWalSummaryIfOlderThan(WalSummaryFile *ws,
+ time_t cutoff_time);
+
+extern int ReadWalSummary(void *wal_summary_io, void *data, int length);
+extern int WriteWalSummary(void *wal_summary_io, void *data, int length);
+extern void ReportWalSummaryError(void *callback_arg, char *fmt,...) pg_attribute_printf(2, 3);
+
+#endif /* WALSUMMARY_H */
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index f14aed422a..8c550560a7 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -12086,4 +12086,23 @@
proname => 'any_value_transfn', prorettype => 'anyelement',
proargtypes => 'anyelement anyelement', prosrc => 'any_value_transfn' },
+{ oid => '8436',
+ descr => 'list of available WAL summary files',
+ proname => 'pg_available_wal_summaries', prorows => '100',
+ proretset => 't', provolatile => 'v', proparallel => 's',
+ prorettype => 'record', proargtypes => '',
+ proallargtypes => '{int8,pg_lsn,pg_lsn}',
+ proargmodes => '{o,o,o}',
+ proargnames => '{tli,start_lsn,end_lsn}',
+ prosrc => 'pg_available_wal_summaries' },
+{ oid => '8437',
+ descr => 'contents of a WAL sumamry file',
+ proname => 'pg_wal_summary_contents', prorows => '100',
+ proretset => 't', provolatile => 'v', proparallel => 's',
+ prorettype => 'record', proargtypes => 'int8 pg_lsn pg_lsn',
+ proallargtypes => '{int8,pg_lsn,pg_lsn,oid,oid,oid,int2,int8,bool}',
+ proargmodes => '{i,i,i,o,o,o,o,o,o}',
+ proargnames => '{tli,start_lsn,end_lsn,relfilenode,reltablespace,reldatabase,relforknumber,relblocknumber,is_limit_block}',
+ prosrc => 'pg_wal_summary_contents' },
+
]
diff --git a/src/include/common/blkreftable.h b/src/include/common/blkreftable.h
new file mode 100644
index 0000000000..70d6c072d7
--- /dev/null
+++ b/src/include/common/blkreftable.h
@@ -0,0 +1,120 @@
+/*-------------------------------------------------------------------------
+ *
+ * blkreftable.h
+ * Block reference tables.
+ *
+ * A block reference table is used to keep track of which blocks have
+ * been modified by WAL records within a certain LSN range.
+ *
+ * For each relation fork, there is a "limit block number". All existing
+ * blocks greater than or equal to the limit block number must be
+ * considered modified; for those less than the limit block number,
+ * we maintain a bitmap. When a relation fork is created or dropped,
+ * the limit block number should be set to 0. When it's truncated,
+ * the limit block number should be set to the length in blocks to
+ * which it was truncated.
+ *
+ * Portions Copyright (c) 2010-2022, PostgreSQL Global Development Group
+ *
+ * src/include/common/blkreftable.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef BLKREFTABLE_H
+#define BLKREFTABLE_H
+
+#include "storage/block.h"
+#include "storage/relfilelocator.h"
+
+/* Magic number for serialization file format. */
+#define BLOCKREFTABLE_MAGIC 0x652b137b
+
+struct BlockRefTable;
+struct BlockRefTableEntry;
+struct BlockRefTableReader;
+struct BlockRefTableWriter;
+typedef struct BlockRefTable BlockRefTable;
+typedef struct BlockRefTableEntry BlockRefTableEntry;
+typedef struct BlockRefTableReader BlockRefTableReader;
+typedef struct BlockRefTableWriter BlockRefTableWriter;
+
+/*
+ * The return value of io_callback_fn should be the number of bytes read
+ * or written. If an error occurs, the functions should report it and
+ * not return. When used as a write callback, short writes should be retried
+ * or treated as errors, so that if the callback returns, the return value
+ * is always the request length.
+ *
+ * report_error_fn should not return.
+ */
+typedef int (*io_callback_fn) (void *callback_arg, void *data, int length);
+typedef void (*report_error_fn) (void *calblack_arg, char *msg,...) pg_attribute_printf(2, 3);
+
+
+/*
+ * Functions for manipulating an entire in-memory block reference table.
+ */
+extern BlockRefTable *CreateEmptyBlockRefTable(void);
+extern void BlockRefTableSetLimitBlock(BlockRefTable *brtab,
+ const RelFileLocator *rlocator,
+ ForkNumber forknum,
+ BlockNumber limit_block);
+extern void BlockRefTableMarkBlockModified(BlockRefTable *brtab,
+ const RelFileLocator *rlocator,
+ ForkNumber forknum,
+ BlockNumber blknum);
+extern void WriteBlockRefTable(BlockRefTable *brtab,
+ io_callback_fn write_callback,
+ void *write_callback_arg);
+
+extern BlockRefTableEntry *BlockRefTableGetEntry(BlockRefTable *brtab,
+ const RelFileLocator *rlocator,
+ ForkNumber forknum,
+ BlockNumber *limit_block);
+extern int BlockRefTableEntryGetBlocks(BlockRefTableEntry *entry,
+ BlockNumber start_blkno,
+ BlockNumber stop_blkno,
+ BlockNumber *blocks,
+ int nblocks);
+
+/*
+ * Functions for reading a block reference table incrementally from disk.
+ */
+extern BlockRefTableReader *CreateBlockRefTableReader(io_callback_fn read_callback,
+ void *read_callback_arg,
+ char *error_filename,
+ report_error_fn error_callback,
+ void *error_callback_arg);
+extern bool BlockRefTableReaderNextRelation(BlockRefTableReader *reader,
+ RelFileLocator *rlocator,
+ ForkNumber *forknum,
+ BlockNumber *limit_block);
+extern unsigned BlockRefTableReaderGetBlocks(BlockRefTableReader *reader,
+ BlockNumber *blocks,
+ int nblocks);
+extern void DestroyBlockRefTableReader(BlockRefTableReader *reader);
+
+/*
+ * Functions for writing a block reference table incrementally to disk.
+ *
+ * Note that entries must be written in the proper order, that is, sorted by
+ * database, then tablespace, then relfilenumber, then fork number. Caller
+ * is responsible for supplying data in the correct order. If that seems hard,
+ * use an in-memory BlockRefTable instead.
+ */
+extern BlockRefTableWriter *CreateBlockRefTableWriter(io_callback_fn write_callback,
+ void *write_callback_arg);
+extern void BlockRefTableWriteEntry(BlockRefTableWriter *writer,
+ BlockRefTableEntry *entry);
+extern void DestroyBlockRefTableWriter(BlockRefTableWriter *writer);
+
+extern BlockRefTableEntry *CreateBlockRefTableEntry(RelFileLocator rlocator,
+ ForkNumber forknum);
+extern void BlockRefTableEntrySetLimitBlock(BlockRefTableEntry *entry,
+ BlockNumber limit_block);
+extern void BlockRefTableEntryMarkBlockModified(BlockRefTableEntry *entry,
+ ForkNumber forknum,
+ BlockNumber blknum);
+extern void BlockRefTableFreeEntry(BlockRefTableEntry *entry);
+
+#endif /* BLKREFTABLE_H */
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index f0cc651435..ab8f47379a 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -340,6 +340,7 @@ typedef enum BackendType
B_STARTUP,
B_WAL_RECEIVER,
B_WAL_SENDER,
+ B_WAL_SUMMARIZER,
B_WAL_WRITER,
} BackendType;
@@ -446,6 +447,7 @@ typedef enum
CheckpointerProcess,
WalWriterProcess,
WalReceiverProcess,
+ WalSummarizerProcess,
NUM_AUXPROCTYPES /* Must be last! */
} AuxProcType;
@@ -458,6 +460,7 @@ extern PGDLLIMPORT AuxProcType MyAuxProcType;
#define AmCheckpointerProcess() (MyAuxProcType == CheckpointerProcess)
#define AmWalWriterProcess() (MyAuxProcType == WalWriterProcess)
#define AmWalReceiverProcess() (MyAuxProcType == WalReceiverProcess)
+#define AmWalSummarizerProcess() (MyAuxProcType == WalSummarizerProcess)
/*****************************************************************************
diff --git a/src/include/postmaster/walsummarizer.h b/src/include/postmaster/walsummarizer.h
new file mode 100644
index 0000000000..15db2377dd
--- /dev/null
+++ b/src/include/postmaster/walsummarizer.h
@@ -0,0 +1,31 @@
+/*-------------------------------------------------------------------------
+ *
+ * walsummarizer.h
+ *
+ * Header file for background WAL summarization process.
+ *
+ * Portions Copyright (c) 1996-2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/include/postmaster/walsummarizer.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef WALSUMMARIZER_H
+#define WALSUMMARIZER_H
+
+#include "access/xlogdefs.h"
+
+extern bool summarize_wal;
+extern int wal_summarize_keep_time;
+
+extern Size WalSummarizerShmemSize(void);
+extern void WalSummarizerShmemInit(void);
+extern void WalSummarizerMain(void) pg_attribute_noreturn();
+
+extern XLogRecPtr GetOldestUnsummarizedLSN(TimeLineID *tli,
+ bool *lsn_is_exact);
+extern void SetWalSummarizerLatch(void);
+extern XLogRecPtr WaitForWalSummarization(XLogRecPtr lsn, long timeout);
+
+#endif
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index ef74f32693..ee55008082 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -417,11 +417,12 @@ extern PGDLLIMPORT PGPROC *PreparedXactProcs;
* We set aside some extra PGPROC structures for auxiliary processes,
* ie things that aren't full-fledged backends but need shmem access.
*
- * Background writer, checkpointer, WAL writer and archiver run during normal
- * operation. Startup process and WAL receiver also consume 2 slots, but WAL
- * writer is launched only after startup has exited, so we only need 5 slots.
+ * Background writer, checkpointer, WAL writer, WAL summarizer, and archiver
+ * run during normal operation. Startup process and WAL receiver also consume
+ * 2 slots, but WAL writer is launched only after startup has exited, so we
+ * only need 6 slots.
*/
-#define NUM_AUXILIARY_PROCS 5
+#define NUM_AUXILIARY_PROCS 6
/* configurable options */
extern PGDLLIMPORT int DeadlockTimeout;
diff --git a/src/include/utils/guc_tables.h b/src/include/utils/guc_tables.h
index 0c38255961..eaa8c46dda 100644
--- a/src/include/utils/guc_tables.h
+++ b/src/include/utils/guc_tables.h
@@ -72,6 +72,7 @@ enum config_group
WAL_RECOVERY,
WAL_ARCHIVE_RECOVERY,
WAL_RECOVERY_TARGET,
+ WAL_SUMMARIZATION,
REPLICATION_SENDING,
REPLICATION_PRIMARY,
REPLICATION_STANDBY,
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 87c1aee379..bf81b91e20 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3998,3 +3998,14 @@ yyscan_t
z_stream
z_streamp
zic_t
+BlockRefTable
+BlockRefTableBuffer
+BlockRefTableEntry
+BlockRefTableKey
+BlockRefTableReader
+BlockRefTableSerializedEntry
+BlockRefTableWriter
+SummarizerReadLocalXLogPrivate
+WalSummarizerData
+WalSummaryFile
+WalSummaryIO
--
2.37.1 (Apple Git-137.1)
v8-0004-Add-support-for-incremental-backup.patchapplication/octet-stream; name=v8-0004-Add-support-for-incremental-backup.patchDownload
From 811a0f412169277aceb89b0dd5cc5acd02d3e658 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Wed, 14 Jun 2023 11:31:29 -0400
Subject: [PATCH v8 4/6] Add support for incremental backup.
To take an incremental backup, you use the new replication command
UPLOAD_MANIFEST to upload the manifest for the prior backup. This
prior backup could either be a full backup or another incremental
backup. You then use BASE_BACKUP with the INCREMENTAL option to take
the backup. pg_basebackup now has an --incremental=PATH_TO_MANIFEST
option to trigger this behavior.
An incremental backup is like a regular full backup except that
some relation files are replaced with files with names like
INCREMENTAL.${ORIGINAL_NAME}, and the backup_label file contains
additional lines identifying it as an incremental backup. The new
pg_combinebackup tool can be used to reconstruct a data directory
from a full backup and a series of incremental backups.
XXX. Should we send the whole backup manifest to the server or, say,
just an LSN?
XXX. Should the timeout when waiting for WAL summaries be configurable?
If it is, then the maximum sleep time for the WAL summarizer needs
to vary accordingly.
XXX. It would be nice (but not essential) to do something about
incremental JSON parsing.
Patch by me. Thanks to Dilip Kumar and Andres Freund for some helpful
design discussions. Reviewed by Dilip Kumar and Jakub Wartak.
---
doc/src/sgml/backup.sgml | 89 +-
doc/src/sgml/config.sgml | 2 -
doc/src/sgml/ref/allfiles.sgml | 1 +
doc/src/sgml/ref/pg_basebackup.sgml | 37 +-
doc/src/sgml/ref/pg_combinebackup.sgml | 228 +++
doc/src/sgml/reference.sgml | 1 +
src/backend/access/transam/xlogbackup.c | 10 +
src/backend/access/transam/xlogrecovery.c | 6 +
src/backend/backup/Makefile | 1 +
src/backend/backup/basebackup.c | 334 ++++-
src/backend/backup/basebackup_incremental.c | 917 ++++++++++++
src/backend/backup/meson.build | 1 +
src/backend/replication/repl_gram.y | 14 +-
src/backend/replication/repl_scanner.l | 2 +
src/backend/replication/walsender.c | 162 ++-
src/backend/storage/ipc/ipci.c | 3 +
src/bin/Makefile | 1 +
src/bin/meson.build | 1 +
src/bin/pg_basebackup/bbstreamer_file.c | 1 +
src/bin/pg_basebackup/pg_basebackup.c | 110 +-
src/bin/pg_basebackup/t/010_pg_basebackup.pl | 4 +-
src/bin/pg_combinebackup/.gitignore | 1 +
src/bin/pg_combinebackup/Makefile | 52 +
src/bin/pg_combinebackup/backup_label.c | 281 ++++
src/bin/pg_combinebackup/backup_label.h | 29 +
src/bin/pg_combinebackup/copy_file.c | 169 +++
src/bin/pg_combinebackup/copy_file.h | 19 +
src/bin/pg_combinebackup/load_manifest.c | 245 ++++
src/bin/pg_combinebackup/load_manifest.h | 67 +
src/bin/pg_combinebackup/meson.build | 38 +
src/bin/pg_combinebackup/pg_combinebackup.c | 1271 +++++++++++++++++
src/bin/pg_combinebackup/reconstruct.c | 681 +++++++++
src/bin/pg_combinebackup/reconstruct.h | 33 +
src/bin/pg_combinebackup/t/001_basic.pl | 23 +
.../pg_combinebackup/t/002_compare_backups.pl | 154 ++
src/bin/pg_combinebackup/t/003_timeline.pl | 90 ++
src/bin/pg_combinebackup/t/004_manifest.pl | 75 +
src/bin/pg_combinebackup/t/005_integrity.pl | 125 ++
src/bin/pg_combinebackup/write_manifest.c | 293 ++++
src/bin/pg_combinebackup/write_manifest.h | 33 +
src/bin/pg_resetwal/pg_resetwal.c | 36 +
src/include/access/xlogbackup.h | 2 +
src/include/backup/basebackup.h | 5 +-
src/include/backup/basebackup_incremental.h | 56 +
src/include/nodes/replnodes.h | 9 +
src/test/perl/PostgreSQL/Test/Cluster.pm | 21 +-
src/tools/pgindent/typedefs.list | 12 +
47 files changed, 5684 insertions(+), 61 deletions(-)
create mode 100644 doc/src/sgml/ref/pg_combinebackup.sgml
create mode 100644 src/backend/backup/basebackup_incremental.c
create mode 100644 src/bin/pg_combinebackup/.gitignore
create mode 100644 src/bin/pg_combinebackup/Makefile
create mode 100644 src/bin/pg_combinebackup/backup_label.c
create mode 100644 src/bin/pg_combinebackup/backup_label.h
create mode 100644 src/bin/pg_combinebackup/copy_file.c
create mode 100644 src/bin/pg_combinebackup/copy_file.h
create mode 100644 src/bin/pg_combinebackup/load_manifest.c
create mode 100644 src/bin/pg_combinebackup/load_manifest.h
create mode 100644 src/bin/pg_combinebackup/meson.build
create mode 100644 src/bin/pg_combinebackup/pg_combinebackup.c
create mode 100644 src/bin/pg_combinebackup/reconstruct.c
create mode 100644 src/bin/pg_combinebackup/reconstruct.h
create mode 100644 src/bin/pg_combinebackup/t/001_basic.pl
create mode 100644 src/bin/pg_combinebackup/t/002_compare_backups.pl
create mode 100644 src/bin/pg_combinebackup/t/003_timeline.pl
create mode 100644 src/bin/pg_combinebackup/t/004_manifest.pl
create mode 100644 src/bin/pg_combinebackup/t/005_integrity.pl
create mode 100644 src/bin/pg_combinebackup/write_manifest.c
create mode 100644 src/bin/pg_combinebackup/write_manifest.h
create mode 100644 src/include/backup/basebackup_incremental.h
diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml
index 8cb24d6ae5..b3468eea3c 100644
--- a/doc/src/sgml/backup.sgml
+++ b/doc/src/sgml/backup.sgml
@@ -857,12 +857,79 @@ test ! -f /mnt/server/archivedir/00000001000000A900000065 && cp pg_wal/0
</para>
</sect2>
+ <sect2 id="backup-incremental-backup">
+ <title>Making an Incremental Backup</title>
+
+ <para>
+ You can use <xref linkend="app-pgbasebackup"/> to take an incremental
+ backup by specifying the <literal>--incremental</literal> option. You must
+ supply, as an argument to <literal>--incremental</literal>, the backup
+ manifest to an earlier backup from the same server. In the resulting
+ backup, non-relation files will be included in their entirety, but some
+ relation files may be replaced by smaller incremental files which contain
+ only the blocks which have been changed since the earlier backup and enough
+ metadata to reconstruct the current version of the file.
+ </para>
+
+ <para>
+ To figure out which blocks need to be backed up, the server uses WAL
+ summaries, which are stored in the data directory, inside the directory
+ <literal>pg_wal/summaries</literal>. If the required summary files are not
+ present, an attempt to take an incremental backup will fail. The summaries
+ present in this directory must cover all LSNs from the start LSN of the
+ prior backup to the start LSN of the current backup. Since the server looks
+ for WAL summaries just after establishing the start LSN of the current
+ backup, the necessary summary files probably won't be instantly present
+ on disk, but the server will wait for any missing files to show up.
+ This also helps if the WAL summarization process has fallen behind.
+ However, if the necessary files have already been removed, or if the WAL
+ summarizer doesn't catch up quickly enough, the incremental backup will
+ fail.
+ </para>
+
+ <para>
+ When restoring an incremental backup, it will be necessary to have not
+ only the incremental backup itself but also all earlier backups that
+ are required to supply the blocks omitted from the incremental backup.
+ See <xref linkend="app-pgcombinebackup"/> for further information about
+ this requirement.
+ </para>
+
+ <para>
+ Note that all of the requirements for making use of a full backup also
+ apply to an incremental backup. For instance, you still need all of the
+ WAL segment files generated during and after the file system backup, and
+ any relevant WAL history files. And you still need to create a
+ <literal>recovery.signal</literal> (or <literal>standby.signal</literal>)
+ and perform recovery, as described in
+ <xref linkend="backup-pitr-recovery" />. The requirement to have earlier
+ backups available at restore time and to use
+ <literal>pg_combinebackup</literal> is an additional requirement on top of
+ everything else. Keep in mind that <application>PostgreSQL</application>
+ has no built-in mechanism to figure out which backups are still needed as
+ a basis for restoring later incremental backups. You must keep track of
+ the relationships between your full and incremental backups on your own,
+ and be certain not to remove earlier backups if they might be needed when
+ restoring later incremental backups.
+ </para>
+
+ <para>
+ Incremental backups typically only make sense for relatively large
+ databases where a significant portion of the data does not change, or only
+ changes slowly. For a small database, it's simpler to ignore the existence
+ of incremental backups and simply take full backups, which are simpler
+ to manage. For a large database all of which is heavily modified,
+ incremental backups won't be much smaller than full backups.
+ </para>
+ </sect2>
+
<sect2 id="backup-lowlevel-base-backup">
<title>Making a Base Backup Using the Low Level API</title>
<para>
- The procedure for making a base backup using the low level
- APIs contains a few more steps than
- the <xref linkend="app-pgbasebackup"/> method, but is relatively
+ Instead of taking a full or incremental base backup using
+ <xref linkend="app-pgbasebackup"/>, you can take a base backup using the
+ low-level API. This procedure contains a few more steps than
+ the <application>pg_basebackup</application> method, but is relatively
simple. It is very important that these steps are executed in
sequence, and that the success of a step is verified before
proceeding to the next step.
@@ -1118,7 +1185,8 @@ SELECT * FROM pg_backup_stop(wait_for_archive => true);
</listitem>
<listitem>
<para>
- Restore the database files from your file system backup. Be sure that they
+ If you're restoring a full backup, you can restore the database files
+ directly into the target directories. Be sure that they
are restored with the right ownership (the database system user, not
<literal>root</literal>!) and with the right permissions. If you are using
tablespaces,
@@ -1126,6 +1194,19 @@ SELECT * FROM pg_backup_stop(wait_for_archive => true);
were correctly restored.
</para>
</listitem>
+ <listitem>
+ <para>
+ If you're restoring an incremental backup, you'll need to restore the
+ incremental backup and all earlier backups upon which it directly or
+ indirectly depends to the machine where you are performing the restore.
+ These backups will need to be placed in separate directories, not the
+ target directories where you want the running server to end up.
+ Once this is done, use <xref linkend="app-pgcombinebackup"/> to pull
+ data from the full backup and all of the subsequent incremental backups
+ and write out a synthetic full backup to the target directories. As above,
+ verify that permissions and tablespace links are correct.
+ </para>
+ </listitem>
<listitem>
<para>
Remove any files present in <filename>pg_wal/</filename>; these came from the
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 862a143f17..1e646f5978 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4137,13 +4137,11 @@ restore_command = 'copy "C:\\server\\archivedir\\%f" "%p"' # Windows
<sect2 id="runtime-config-wal-summarization">
<title>WAL Summarization</title>
- <!--
<para>
These settings control WAL summarization, a feature which must be
enabled in order to perform an
<link linkend="backup-incremental-backup">incremental backup</link>.
</para>
- -->
<variablelist>
<varlistentry id="guc-summarize-wal" xreflabel="summarize_wal">
diff --git a/doc/src/sgml/ref/allfiles.sgml b/doc/src/sgml/ref/allfiles.sgml
index 54b5f22d6e..fda4690eab 100644
--- a/doc/src/sgml/ref/allfiles.sgml
+++ b/doc/src/sgml/ref/allfiles.sgml
@@ -202,6 +202,7 @@ Complete list of usable sgml source files in this directory.
<!ENTITY pgBasebackup SYSTEM "pg_basebackup.sgml">
<!ENTITY pgbench SYSTEM "pgbench.sgml">
<!ENTITY pgChecksums SYSTEM "pg_checksums.sgml">
+<!ENTITY pgCombinebackup SYSTEM "pg_combinebackup.sgml">
<!ENTITY pgConfig SYSTEM "pg_config-ref.sgml">
<!ENTITY pgControldata SYSTEM "pg_controldata.sgml">
<!ENTITY pgCtl SYSTEM "pg_ctl-ref.sgml">
diff --git a/doc/src/sgml/ref/pg_basebackup.sgml b/doc/src/sgml/ref/pg_basebackup.sgml
index 0b87fd2d4d..7c183a5cfd 100644
--- a/doc/src/sgml/ref/pg_basebackup.sgml
+++ b/doc/src/sgml/ref/pg_basebackup.sgml
@@ -38,11 +38,25 @@ PostgreSQL documentation
</para>
<para>
- <application>pg_basebackup</application> makes an exact copy of the database
- cluster's files, while making sure the server is put into and
- out of backup mode automatically. Backups are always taken of the entire
- database cluster; it is not possible to back up individual databases or
- database objects. For selective backups, another tool such as
+ <application>pg_basebackup</application> can take a full or incremental
+ base backup of the database. When used to take a full backup, it makes an
+ exact copy of the database cluster's files. When used to take an incremental
+ backup, some files that would have been part of a full backup may be
+ replaced with incremental versions of the same files, containing only those
+ blocks that have been modified since the reference backup. An incremental
+ backup cannot be used directly; instead,
+ <xref linkend="app-pgcombinebackup"/> must first
+ be used to combine it with the previous backups upon which it depends.
+ See <xref linkend="backup-incremental-backup" /> for more information
+ about incremental backups, and <xref linkend="backup-pitr-recovery" />
+ for steps to recover from a backup.
+ </para>
+
+ <para>
+ In any mode, <application>pg_basebackup</application> makes sure the server
+ is put into and out of backup mode automatically. Backups are always taken of
+ the entire database cluster; it is not possible to back up individual
+ databases or database objects. For selective backups, another tool such as
<xref linkend="app-pgdump"/> must be used.
</para>
@@ -197,6 +211,19 @@ PostgreSQL documentation
</listitem>
</varlistentry>
+ <varlistentry>
+ <term><option>-i <replaceable class="parameter">old_manifest_file</replaceable></option></term>
+ <term><option>--incremental=<replaceable class="parameter">old_meanifest_file</replaceable></option></term>
+ <listitem>
+ <para>
+ Performs an <link linkend="backup-incremental-backup">incremental
+ backup</link>. The backup manifest for the reference
+ backup must be provided, and will be uploaded to the server, which will
+ respond by sending the requested incremental backup.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry>
<term><option>-R</option></term>
<term><option>--write-recovery-conf</option></term>
diff --git a/doc/src/sgml/ref/pg_combinebackup.sgml b/doc/src/sgml/ref/pg_combinebackup.sgml
new file mode 100644
index 0000000000..6cac73573f
--- /dev/null
+++ b/doc/src/sgml/ref/pg_combinebackup.sgml
@@ -0,0 +1,228 @@
+<!--
+doc/src/sgml/ref/pg_combinebackup.sgml
+PostgreSQL documentation
+-->
+
+<refentry id="app-pgcombinebackup">
+ <indexterm zone="app-pgcombinebackup">
+ <primary>pg_combinebackup</primary>
+ </indexterm>
+
+ <refmeta>
+ <refentrytitle><application>pg_combinebackup</application></refentrytitle>
+ <manvolnum>1</manvolnum>
+ <refmiscinfo>Application</refmiscinfo>
+ </refmeta>
+
+ <refnamediv>
+ <refname>pg_combinebackup</refname>
+ <refpurpose>reconstruct a full backup from an incremental backup and dependent backups</refpurpose>
+ </refnamediv>
+
+ <refsynopsisdiv>
+ <cmdsynopsis>
+ <command>pg_combinebackup</command>
+ <arg rep="repeat"><replaceable>option</replaceable></arg>
+ <arg rep="repeat"><replaceable>backup_directory</replaceable></arg>
+ </cmdsynopsis>
+ </refsynopsisdiv>
+
+ <refsect1>
+ <title>Description</title>
+ <para>
+ <application>pg_combinebackup</application> is used to reconstruct a
+ synthetic full backup from an
+ <link linkend="backup-incremental-backup">incremental backup</link> and the
+ earlier backups upon which it depends.
+ </para>
+
+ <para>
+ Specify all of the required backups on the command line from oldest to newest.
+ That is, the first backup directory should be the path to the full backup, and
+ the last should be the path to the final incremental backup
+ that you wish to restore. The reconstructed backup will be written to the
+ output directory specified by the <option>-o</option> option.
+ </para>
+
+ <para>
+ Although <application>pg_combinebackup</application> will attempt to verify
+ that the backups you specify form a legal backup chain from which a correct
+ full backup can be reconstructed, it is not designed to help you keep track
+ of which backups depend on which other backups. If you remove the one or
+ more of the previous backups upon which your incremental
+ backup relies, you will not be able to restore it.
+ </para>
+
+ <para>
+ Since the output of <application>pg_combinebackup</application> is a
+ synthetic full backup, it can be used as an input to a future invocation of
+ <application>pg_combinebackup</application>. The synthetic full backup would
+ be specified on the command line in lieu of the chain of backups from which
+ it was reconstructed.
+ </para>
+ </refsect1>
+
+ <refsect1>
+ <title>Options</title>
+
+ <para>
+ <variablelist>
+ <varlistentry>
+ <term><option>-d</option></term>
+ <term><option>--debug</option></term>
+ <listitem>
+ <para>
+ Print lots of debug logging output on <filename>stderr</filename>.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><option>-T <replaceable class="parameter">olddir</replaceable>=<replaceable class="parameter">newdir</replaceable></option></term>
+ <term><option>--tablespace-mapping=<replaceable class="parameter">olddir</replaceable>=<replaceable class="parameter">newdir</replaceable></option></term>
+ <listitem>
+ <para>
+ Relocates the tablespace in directory <replaceable>olddir</replaceable>
+ to <replaceable>newdir</replaceable> during the backup.
+ <replaceable>olddir</replaceable> is the absolute path of the tablespace
+ as it exists in the first backup specified on the command line,
+ and <replaceable>newdir</replaceable> is the absolute path to use for the
+ tablespace in the reconstructed backup. If either path needs to contain
+ an equal sign (<literal>=</literal>), precede that with a backslash.
+ This option can be specified multiple times for multiple tablespaces.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><option>-N</option></term>
+ <term><option>--no-sync</option></term>
+ <listitem>
+ <para>
+ By default, <command>pg_combinebackup</command> will wait for all files
+ to be written safely to disk. This option causes
+ <command>pg_combinebackup</command> to return without waiting, which is
+ faster, but means that a subsequent operating system crash can leave
+ the output backup corrupt. Generally, this option is useful for testing
+ but should not be used when creating a production installation.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><option>-o <replaceable class="parameter">outputdir</replaceable></option></term>
+ <term><option>--output=<replaceable class="parameter">outputdir</replaceable></option></term>
+ <listitem>
+ <para>
+ Specifies the output directory to which the synthetic full backup
+ should be written. Currently, this argument is required.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><option>--sync-method</option></term>
+ <listitem>
+ <para>
+ When set to <literal>fsync</literal>, which is the default,
+ <command>pg_combinebackup</command> will recursively open and synchronize
+ all files in the backup directory. When the plain format is used, the
+ search for files will follow symbolic links for the WAL directory and
+ each configured tablespace.
+ </para>
+ <para>
+ On Linux, <literal>syncfs</literal> may be used instead to ask the
+ operating system to synchronize the whole file system that contains the
+ backup directory. When the plain format is used,
+ <command>pg_combinebackup</command> will also synchronize the file systems
+ that contain the WAL files and each tablespace. See
+ <xref linkend="syncfs"/> for more information about using
+ <function>syncfs()</function>.
+ </para>
+ <para>
+ This option has no effect when <option>--no-sync</option> is used.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><option>--manifest-checksums=<replaceable class="parameter">algorithm</replaceable></option></term>
+ <listitem>
+ <para>
+ Like <xref linkend="app-pgbasebackup"/>,
+ <application>pg_combinebackup</application> writes a backup manifest
+ in the output directory. This option specifies the checksum algorithm
+ that should be applied to each file included in the backup manifest.
+ Currently, the available algorithms are <literal>NONE</literal>,
+ <literal>CRC32C</literal>, <literal>SHA224</literal>,
+ <literal>SHA256</literal>, <literal>SHA384</literal>,
+ and <literal>SHA512</literal>. The default is <literal>CRC32C</literal>.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><option>--no-manifest</option></term>
+ <listitem>
+ <para>
+ Disables generation of a backup manifest. If this option is not
+ specified, a backup manifest for the reconstructed backup will be
+ written to the output directory.
+ </para>
+ </listitem>
+ </varlistentry>
+ </variablelist>
+
+ <variablelist>
+ <varlistentry>
+ <term><option>-V</option></term>
+ <term><option>--version</option></term>
+ <listitem>
+ <para>
+ Prints the <application>pg_combinebackup</application> version and exits.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><option>-?</option></term>
+ <term><option>--help</option></term>
+ <listitem>
+ <para>
+ Shows help about <application>pg_combinebackup</application> command line
+ arguments, and exits.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ </variablelist>
+ </para>
+
+ </refsect1>
+
+ <refsect1>
+ <title>Environment</title>
+
+ <para>
+ This utility, like most other <productname>PostgreSQL</productname> utilities,
+ uses the environment variables supported by <application>libpq</application>
+ (see <xref linkend="libpq-envars"/>).
+ </para>
+
+ <para>
+ The environment variable <envar>PG_COLOR</envar> specifies whether to use
+ color in diagnostic messages. Possible values are
+ <literal>always</literal>, <literal>auto</literal> and
+ <literal>never</literal>.
+ </para>
+ </refsect1>
+
+ <refsect1>
+ <title>See Also</title>
+
+ <simplelist type="inline">
+ <member><xref linkend="app-pgbasebackup"/></member>
+ </simplelist>
+ </refsect1>
+
+</refentry>
diff --git a/doc/src/sgml/reference.sgml b/doc/src/sgml/reference.sgml
index e11b4b6130..a07d2b5e01 100644
--- a/doc/src/sgml/reference.sgml
+++ b/doc/src/sgml/reference.sgml
@@ -250,6 +250,7 @@
&pgamcheck;
&pgBasebackup;
&pgbench;
+ &pgCombinebackup;
&pgConfig;
&pgDump;
&pgDumpall;
diff --git a/src/backend/access/transam/xlogbackup.c b/src/backend/access/transam/xlogbackup.c
index 21d68133ae..f51d4282bb 100644
--- a/src/backend/access/transam/xlogbackup.c
+++ b/src/backend/access/transam/xlogbackup.c
@@ -77,6 +77,16 @@ build_backup_content(BackupState *state, bool ishistoryfile)
appendStringInfo(result, "STOP TIMELINE: %u\n", state->stoptli);
}
+ /* either both istartpoint and istarttli should be set, or neither */
+ Assert(XLogRecPtrIsInvalid(state->istartpoint) == (state->istarttli == 0));
+ if (!XLogRecPtrIsInvalid(state->istartpoint))
+ {
+ appendStringInfo(result, "INCREMENTAL FROM LSN: %X/%X\n",
+ LSN_FORMAT_ARGS(state->istartpoint));
+ appendStringInfo(result, "INCREMENTAL FROM TLI: %u\n",
+ state->istarttli);
+ }
+
data = result->data;
pfree(result);
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index c61566666a..7d2501274e 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -1295,6 +1295,12 @@ read_backup_label(XLogRecPtr *checkPointLoc, TimeLineID *backupLabelTLI,
tli_from_file, BACKUP_LABEL_FILE)));
}
+ if (fscanf(lfp, "INCREMENTAL FROM LSN: %X/%X\n", &hi, &lo) > 0)
+ ereport(FATAL,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("this is an incremental backup, not a data directory"),
+ errhint("Use pg_combinebackup to reconstruct a valid data directory.")));
+
if (ferror(lfp) || FreeFile(lfp))
ereport(FATAL,
(errcode_for_file_access(),
diff --git a/src/backend/backup/Makefile b/src/backend/backup/Makefile
index a67b3c58d4..751e6d3d5e 100644
--- a/src/backend/backup/Makefile
+++ b/src/backend/backup/Makefile
@@ -19,6 +19,7 @@ OBJS = \
basebackup.o \
basebackup_copy.o \
basebackup_gzip.o \
+ basebackup_incremental.o \
basebackup_lz4.o \
basebackup_zstd.o \
basebackup_progress.o \
diff --git a/src/backend/backup/basebackup.c b/src/backend/backup/basebackup.c
index 4ba63ad8a6..8a70a9ae41 100644
--- a/src/backend/backup/basebackup.c
+++ b/src/backend/backup/basebackup.c
@@ -20,8 +20,10 @@
#include "access/xlogbackup.h"
#include "backup/backup_manifest.h"
#include "backup/basebackup.h"
+#include "backup/basebackup_incremental.h"
#include "backup/basebackup_sink.h"
#include "backup/basebackup_target.h"
+#include "catalog/pg_tablespace_d.h"
#include "commands/defrem.h"
#include "common/compression.h"
#include "common/file_perm.h"
@@ -64,6 +66,7 @@ typedef struct
bool fastcheckpoint;
bool nowait;
bool includewal;
+ bool incremental;
uint32 maxrate;
bool sendtblspcmapfile;
bool send_to_client;
@@ -76,21 +79,28 @@ typedef struct
} basebackup_options;
static int64 sendTablespace(bbsink *sink, char *path, Oid spcoid, bool sizeonly,
- struct backup_manifest_info *manifest);
+ struct backup_manifest_info *manifest,
+ IncrementalBackupInfo *ib);
static int64 sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
List *tablespaces, bool sendtblspclinks,
- backup_manifest_info *manifest, Oid spcoid);
+ backup_manifest_info *manifest, Oid spcoid,
+ IncrementalBackupInfo *ib);
static bool sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
struct stat *statbuf, bool missing_ok,
Oid dboid, Oid spcoid, RelFileNumber relfilenumber,
unsigned segno,
- backup_manifest_info *manifest);
+ backup_manifest_info *manifest,
+ unsigned num_incremental_blocks,
+ BlockNumber *incremental_blocks,
+ unsigned truncation_block_length);
static off_t read_file_data_into_buffer(bbsink *sink,
const char *readfilename, int fd,
off_t offset, size_t length,
BlockNumber blkno,
bool verify_checksum,
int *checksum_failures);
+static void push_to_sink(bbsink *sink, pg_checksum_context *checksum_ctx,
+ size_t *bytes_done, void *data, size_t length);
static bool verify_page_checksum(Page page, XLogRecPtr start_lsn,
BlockNumber blkno,
uint16 *expected_checksum);
@@ -102,7 +112,8 @@ static int64 _tarWriteHeader(bbsink *sink, const char *filename,
bool sizeonly);
static void _tarWritePadding(bbsink *sink, int len);
static void convert_link_to_directory(const char *pathbuf, struct stat *statbuf);
-static void perform_base_backup(basebackup_options *opt, bbsink *sink);
+static void perform_base_backup(basebackup_options *opt, bbsink *sink,
+ IncrementalBackupInfo *ib);
static void parse_basebackup_options(List *options, basebackup_options *opt);
static int compareWalFileNames(const ListCell *a, const ListCell *b);
static int basebackup_read_file(int fd, char *buf, size_t nbytes, off_t offset,
@@ -220,7 +231,8 @@ static const struct exclude_list_item excludeFiles[] =
* clobbered by longjmp" from stupider versions of gcc.
*/
static void
-perform_base_backup(basebackup_options *opt, bbsink *sink)
+perform_base_backup(basebackup_options *opt, bbsink *sink,
+ IncrementalBackupInfo *ib)
{
bbsink_state state;
XLogRecPtr endptr;
@@ -270,6 +282,10 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
ListCell *lc;
tablespaceinfo *newti;
+ /* If this is an incremental backup, execute preparatory steps. */
+ if (ib != NULL)
+ PrepareForIncrementalBackup(ib, backup_state);
+
/* Add a node for the base directory at the end */
newti = palloc0(sizeof(tablespaceinfo));
newti->size = -1;
@@ -289,10 +305,10 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
if (tmp->path == NULL)
tmp->size = sendDir(sink, ".", 1, true, state.tablespaces,
- true, NULL, InvalidOid);
+ true, NULL, InvalidOid, NULL);
else
tmp->size = sendTablespace(sink, tmp->path, tmp->oid, true,
- NULL);
+ NULL, NULL);
state.bytes_total += tmp->size;
}
state.bytes_total_is_valid = true;
@@ -330,7 +346,7 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
/* Then the bulk of the files... */
sendDir(sink, ".", 1, false, state.tablespaces,
- sendtblspclinks, &manifest, InvalidOid);
+ sendtblspclinks, &manifest, InvalidOid, ib);
/* ... and pg_control after everything else. */
if (lstat(XLOG_CONTROL_FILE, &statbuf) != 0)
@@ -340,7 +356,7 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
XLOG_CONTROL_FILE)));
sendFile(sink, XLOG_CONTROL_FILE, XLOG_CONTROL_FILE, &statbuf,
false, InvalidOid, InvalidOid,
- InvalidRelFileNumber, 0, &manifest);
+ InvalidRelFileNumber, 0, &manifest, 0, NULL, 0);
}
else
{
@@ -348,7 +364,7 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
bbsink_begin_archive(sink, archive_name);
- sendTablespace(sink, ti->path, ti->oid, false, &manifest);
+ sendTablespace(sink, ti->path, ti->oid, false, &manifest, ib);
}
/*
@@ -610,7 +626,7 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
sendFile(sink, pathbuf, pathbuf, &statbuf, false,
InvalidOid, InvalidOid, InvalidRelFileNumber, 0,
- &manifest);
+ &manifest, 0, NULL, 0);
/* unconditionally mark file as archived */
StatusFilePath(pathbuf, fname, ".done");
@@ -686,6 +702,7 @@ parse_basebackup_options(List *options, basebackup_options *opt)
bool o_checkpoint = false;
bool o_nowait = false;
bool o_wal = false;
+ bool o_incremental = false;
bool o_maxrate = false;
bool o_tablespace_map = false;
bool o_noverify_checksums = false;
@@ -764,6 +781,15 @@ parse_basebackup_options(List *options, basebackup_options *opt)
opt->includewal = defGetBoolean(defel);
o_wal = true;
}
+ else if (strcmp(defel->defname, "incremental") == 0)
+ {
+ if (o_incremental)
+ ereport(ERROR,
+ (errcode(ERRCODE_SYNTAX_ERROR),
+ errmsg("duplicate option \"%s\"", defel->defname)));
+ opt->incremental = defGetBoolean(defel);
+ o_incremental = true;
+ }
else if (strcmp(defel->defname, "max_rate") == 0)
{
int64 maxrate;
@@ -956,7 +982,7 @@ parse_basebackup_options(List *options, basebackup_options *opt)
* the filesystem, bypassing the buffer cache.
*/
void
-SendBaseBackup(BaseBackupCmd *cmd)
+SendBaseBackup(BaseBackupCmd *cmd, IncrementalBackupInfo *ib)
{
basebackup_options opt;
bbsink *sink;
@@ -980,6 +1006,20 @@ SendBaseBackup(BaseBackupCmd *cmd)
set_ps_display(activitymsg);
}
+ /*
+ * If we're asked to perform an incremental backup and the user has not
+ * supplied a manifest, that's an ERROR.
+ *
+ * If we're asked to perform a full backup and the user did supply a
+ * manifest, just ignore it.
+ */
+ if (!opt.incremental)
+ ib = NULL;
+ else if (ib == NULL)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("must UPLOAD_MANIFEST before performing an incremental BASE_BACKUP")));
+
/*
* If the target is specifically 'client' then set up to stream the backup
* to the client; otherwise, it's being sent someplace else and should not
@@ -1011,7 +1051,7 @@ SendBaseBackup(BaseBackupCmd *cmd)
*/
PG_TRY();
{
- perform_base_backup(&opt, sink);
+ perform_base_backup(&opt, sink, ib);
}
PG_FINALLY();
{
@@ -1086,7 +1126,7 @@ sendFileWithContent(bbsink *sink, const char *filename, const char *content,
*/
static int64
sendTablespace(bbsink *sink, char *path, Oid spcoid, bool sizeonly,
- backup_manifest_info *manifest)
+ backup_manifest_info *manifest, IncrementalBackupInfo *ib)
{
int64 size;
char pathbuf[MAXPGPATH];
@@ -1120,7 +1160,7 @@ sendTablespace(bbsink *sink, char *path, Oid spcoid, bool sizeonly,
/* Send all the files in the tablespace version directory */
size += sendDir(sink, pathbuf, strlen(path), sizeonly, NIL, true, manifest,
- spcoid);
+ spcoid, ib);
return size;
}
@@ -1140,7 +1180,7 @@ sendTablespace(bbsink *sink, char *path, Oid spcoid, bool sizeonly,
static int64
sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
List *tablespaces, bool sendtblspclinks, backup_manifest_info *manifest,
- Oid spcoid)
+ Oid spcoid, IncrementalBackupInfo *ib)
{
DIR *dir;
struct dirent *de;
@@ -1149,7 +1189,16 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
int64 size = 0;
const char *lastDir; /* Split last dir from parent path. */
bool isRelationDir = false; /* Does directory contain relations? */
+ bool isGlobalDir = false;
Oid dboid = InvalidOid;
+ BlockNumber *relative_block_numbers = NULL;
+
+ /*
+ * Since this array is relatively large, avoid putting it on the stack.
+ * But we don't need it at all if this is not an incremental backup.
+ */
+ if (ib != NULL)
+ relative_block_numbers = palloc(sizeof(BlockNumber) * RELSEG_SIZE);
/*
* Determine if the current path is a database directory that can contain
@@ -1182,7 +1231,10 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
}
}
else if (strcmp(path, "./global") == 0)
+ {
isRelationDir = true;
+ isGlobalDir = true;
+ }
dir = AllocateDir(path);
while ((de = ReadDir(dir, path)) != NULL)
@@ -1331,11 +1383,13 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
&statbuf, sizeonly);
/*
- * Also send archive_status directory (by hackishly reusing
- * statbuf from above ...).
+ * Also send archive_status and summaries directories (by
+ * hackishly reusing statbuf from above ...).
*/
size += _tarWriteHeader(sink, "./pg_wal/archive_status", NULL,
&statbuf, sizeonly);
+ size += _tarWriteHeader(sink, "./pg_wal/summaries", NULL,
+ &statbuf, sizeonly);
continue; /* don't recurse into pg_wal */
}
@@ -1404,33 +1458,88 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
if (!skip_this_dir)
size += sendDir(sink, pathbuf, basepathlen, sizeonly, tablespaces,
- sendtblspclinks, manifest, spcoid);
+ sendtblspclinks, manifest, spcoid, ib);
}
else if (S_ISREG(statbuf.st_mode))
{
bool sent = false;
+ unsigned num_blocks_required = 0;
+ unsigned truncation_block_length = 0;
+ char tarfilenamebuf[MAXPGPATH * 2];
+ char *tarfilename = pathbuf + basepathlen + 1;
+ FileBackupMethod method = BACK_UP_FILE_FULLY;
- if (!sizeonly)
- sent = sendFile(sink, pathbuf, pathbuf + basepathlen + 1, &statbuf,
- true, dboid, spcoid,
- relfilenumber, segno, manifest);
+ if (ib != NULL && isRelationFile)
+ {
+ Oid relspcoid;
+ char *lookup_path;
- if (sent || sizeonly)
+ if (OidIsValid(spcoid))
+ {
+ relspcoid = spcoid;
+ lookup_path = psprintf("pg_tblspc/%u/%s", spcoid,
+ pathbuf + basepathlen + 1);
+ }
+ else
+ {
+ if (isGlobalDir)
+ relspcoid = GLOBALTABLESPACE_OID;
+ else
+ relspcoid = DEFAULTTABLESPACE_OID;
+ lookup_path = pstrdup(pathbuf + basepathlen + 1);
+ }
+
+ method = GetFileBackupMethod(ib, lookup_path, dboid, relspcoid,
+ relfilenumber, relForkNum,
+ segno, statbuf.st_size,
+ &num_blocks_required,
+ relative_block_numbers,
+ &truncation_block_length);
+ if (method == BACK_UP_FILE_INCREMENTALLY)
+ {
+ statbuf.st_size =
+ GetIncrementalFileSize(num_blocks_required);
+ snprintf(tarfilenamebuf, sizeof(tarfilenamebuf),
+ "%s/INCREMENTAL.%s",
+ path + basepathlen + 1,
+ de->d_name);
+ tarfilename = tarfilenamebuf;
+ }
+
+ pfree(lookup_path);
+ }
+
+ if (method != DO_NOT_BACK_UP_FILE)
{
- /* Add size. */
- size += statbuf.st_size;
+ if (!sizeonly)
+ sent = sendFile(sink, pathbuf, tarfilename, &statbuf,
+ true, dboid, spcoid,
+ relfilenumber, segno, manifest,
+ num_blocks_required,
+ method == BACK_UP_FILE_INCREMENTALLY ? relative_block_numbers : NULL,
+ truncation_block_length);
+
+ if (sent || sizeonly)
+ {
+ /* Add size. */
+ size += statbuf.st_size;
- /* Pad to a multiple of the tar block size. */
- size += tarPaddingBytesRequired(statbuf.st_size);
+ /* Pad to a multiple of the tar block size. */
+ size += tarPaddingBytesRequired(statbuf.st_size);
- /* Size of the header for the file. */
- size += TAR_BLOCK_SIZE;
+ /* Size of the header for the file. */
+ size += TAR_BLOCK_SIZE;
+ }
}
}
else
ereport(WARNING,
(errmsg("skipping special file \"%s\"", pathbuf)));
}
+
+ if (relative_block_numbers != NULL)
+ pfree(relative_block_numbers);
+
FreeDir(dir);
return size;
}
@@ -1443,6 +1552,12 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
* If dboid is anything other than InvalidOid then any checksum failures
* detected will get reported to the cumulative stats system.
*
+ * If the file is to be sent incrementally, then num_incremental_blocks
+ * should be the number of blocks to be sent, and incremental_blocks
+ * an array of block numbers relative to the start of the current segment.
+ * If the whole file is to be sent, then incremental_blocks should be NULL,
+ * and num_incremental_blocks can have any value, as it will be ignored.
+ *
* Returns true if the file was successfully sent, false if 'missing_ok',
* and the file did not exist.
*/
@@ -1450,7 +1565,8 @@ static bool
sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
struct stat *statbuf, bool missing_ok, Oid dboid, Oid spcoid,
RelFileNumber relfilenumber, unsigned segno,
- backup_manifest_info *manifest)
+ backup_manifest_info *manifest, unsigned num_incremental_blocks,
+ BlockNumber *incremental_blocks, unsigned truncation_block_length)
{
int fd;
BlockNumber blkno = 0;
@@ -1459,6 +1575,7 @@ sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
pgoff_t bytes_done = 0;
bool verify_checksum = false;
pg_checksum_context checksum_ctx;
+ int ibindex = 0;
if (pg_checksum_init(&checksum_ctx, manifest->checksum_type) < 0)
elog(ERROR, "could not initialize checksum of file \"%s\"",
@@ -1491,22 +1608,111 @@ sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
RelFileNumberIsValid(relfilenumber))
verify_checksum = true;
+ /*
+ * If we're sending an incremental file, write the file header.
+ */
+ if (incremental_blocks != NULL)
+ {
+ unsigned magic = INCREMENTAL_MAGIC;
+ size_t header_bytes_done = 0;
+
+ /* Emit header data. */
+ push_to_sink(sink, &checksum_ctx, &header_bytes_done,
+ &magic, sizeof(magic));
+ push_to_sink(sink, &checksum_ctx, &header_bytes_done,
+ &num_incremental_blocks, sizeof(num_incremental_blocks));
+ push_to_sink(sink, &checksum_ctx, &header_bytes_done,
+ &truncation_block_length, sizeof(truncation_block_length));
+ push_to_sink(sink, &checksum_ctx, &header_bytes_done,
+ incremental_blocks,
+ sizeof(BlockNumber) * num_incremental_blocks);
+
+ /* Flush out any data still in the buffer so it's again empty. */
+ if (header_bytes_done > 0)
+ {
+ bbsink_archive_contents(sink, header_bytes_done);
+ if (pg_checksum_update(&checksum_ctx,
+ (uint8 *) sink->bbs_buffer,
+ header_bytes_done) < 0)
+ elog(ERROR, "could not update checksum of base backup");
+ }
+
+ /* Update our notion of file position. */
+ bytes_done += sizeof(magic);
+ bytes_done += sizeof(num_incremental_blocks);
+ bytes_done += sizeof(truncation_block_length);
+ bytes_done += sizeof(BlockNumber) * num_incremental_blocks;
+ }
+
/*
* Loop until we read the amount of data the caller told us to expect. The
* file could be longer, if it was extended while we were sending it, but
* for a base backup we can ignore such extended data. It will be restored
* from WAL.
*/
- while (bytes_done < statbuf->st_size)
+ while (1)
{
- size_t remaining = statbuf->st_size - bytes_done;
+ /*
+ * Determine whether we've read all the data that we need, and if not,
+ * read some more.
+ */
+ if (incremental_blocks == NULL)
+ {
+ size_t remaining = statbuf->st_size - bytes_done;
+
+ /*
+ * If we've read the required number of bytes, then it's time to
+ * stop.
+ */
+ if (bytes_done >= statbuf->st_size)
+ break;
+
+ /*
+ * Read as many bytes as will fit in the buffer, or however many
+ * are left to read, whichever is less.
+ */
+ cnt = read_file_data_into_buffer(sink, readfilename, fd,
+ bytes_done, remaining,
+ blkno + segno * RELSEG_SIZE,
+ verify_checksum,
+ &checksum_failures);
+ }
+ else
+ {
+ BlockNumber relative_blkno;
- /* Try to read some more data. */
- cnt = read_file_data_into_buffer(sink, readfilename, fd, bytes_done,
- remaining,
- blkno + segno * RELSEG_SIZE,
- verify_checksum,
- &checksum_failures);
+ /*
+ * If we've read all the blocks, then it's time to stop.
+ */
+ if (ibindex >= num_incremental_blocks)
+ break;
+
+ /*
+ * Read just one block, whichever one is the next that we're
+ * supposed to include.
+ */
+ relative_blkno = incremental_blocks[ibindex++];
+ cnt = read_file_data_into_buffer(sink, readfilename, fd,
+ relative_blkno * BLCKSZ,
+ BLCKSZ,
+ relative_blkno + segno * RELSEG_SIZE,
+ verify_checksum,
+ &checksum_failures);
+
+ /*
+ * If we get a partial read, that must mean that the relation is
+ * being truncated. Ultimately, it should be truncated to a
+ * multiple of BLCKSZ, since this path should only be reached for
+ * relation files, but we might transiently observe an
+ * intermediate value.
+ *
+ * It should be fine to treat this just as if the entire block had
+ * been truncated away - i.e. fill this and all later blocks with
+ * zeroes. WAL replay will fix things up.
+ */
+ if (cnt < BLCKSZ)
+ break;
+ }
/*
* If the amount of data we were able to read was not a multiple of
@@ -1689,6 +1895,56 @@ read_file_data_into_buffer(bbsink *sink, const char *readfilename, int fd,
return cnt;
}
+/*
+ * Push data into a bbsink.
+ *
+ * It's better, when possible, to read data directly into the bbsink's buffer,
+ * rather than using this function to copy it into the buffer; this function is
+ * for cases where that approach is not practical.
+ *
+ * bytes_done should point to a count of the number of bytes that are
+ * currently used in the bbsink's buffer. Upon return, the bytes identified by
+ * data and length will have been copied into the bbsink's buffer, flushing
+ * as required, and *bytes_done will have been updated accordingly. If the
+ * buffer was flushed, the previous contents will also have been fed to
+ * checksum_ctx.
+ *
+ * Note that after one or more calls to this function it is the caller's
+ * responsibility to perform any required final flush.
+ */
+static void
+push_to_sink(bbsink *sink, pg_checksum_context *checksum_ctx,
+ size_t *bytes_done, void *data, size_t length)
+{
+ while (length > 0)
+ {
+ size_t bytes_to_copy;
+
+ /*
+ * We use < here rather than <= so that if the data exactly fills the
+ * remaining buffer space, we trigger a flush now.
+ */
+ if (length < sink->bbs_buffer_length - *bytes_done)
+ {
+ /* Append remaining data to buffer. */
+ memcpy(sink->bbs_buffer + *bytes_done, data, length);
+ *bytes_done += length;
+ return;
+ }
+
+ /* Copy until buffer is full and flush it. */
+ bytes_to_copy = sink->bbs_buffer_length - *bytes_done;
+ memcpy(sink->bbs_buffer + *bytes_done, data, bytes_to_copy);
+ data = ((char *) data) + bytes_to_copy;
+ length -= bytes_to_copy;
+ bbsink_archive_contents(sink, sink->bbs_buffer_length);
+ if (pg_checksum_update(checksum_ctx, (uint8 *) sink->bbs_buffer,
+ sink->bbs_buffer_length) < 0)
+ elog(ERROR, "could not update checksum");
+ *bytes_done = 0;
+ }
+}
+
/*
* Try to verify the checksum for the provided page, if it seems appropriate
* to do so.
diff --git a/src/backend/backup/basebackup_incremental.c b/src/backend/backup/basebackup_incremental.c
new file mode 100644
index 0000000000..12699b5984
--- /dev/null
+++ b/src/backend/backup/basebackup_incremental.c
@@ -0,0 +1,917 @@
+/*-------------------------------------------------------------------------
+ *
+ * basebackup_incremental.c
+ * code for incremental backup support
+ *
+ * This code isn't actually in charge of taking an incremental backup;
+ * the actual construction of the incremental backup happens in
+ * basebackup.c. Here, we're concerned with providing the necessary
+ * supports for that operation. In particular, we need to parse the
+ * backup manifest supplied by the user taking the incremental backup
+ * and extract the required information from it.
+ *
+ * Portions Copyright (c) 2010-2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/backend/backup/basebackup_incremental.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/timeline.h"
+#include "access/xlog.h"
+#include "access/xlogrecovery.h"
+#include "backup/basebackup_incremental.h"
+#include "backup/walsummary.h"
+#include "common/blkreftable.h"
+#include "common/parse_manifest.h"
+#include "common/hashfn.h"
+#include "postmaster/walsummarizer.h"
+
+#define BLOCKS_PER_READ 512
+
+/*
+ * Details extracted from the WAL ranges present in the supplied backup manifest.
+ */
+typedef struct
+{
+ TimeLineID tli;
+ XLogRecPtr start_lsn;
+ XLogRecPtr end_lsn;
+} backup_wal_range;
+
+/*
+ * Details extracted from the file list present in the supplied backup manifest.
+ */
+typedef struct
+{
+ uint32 status;
+ const char *path;
+ size_t size;
+} backup_file_entry;
+
+static uint32 hash_string_pointer(const char *s);
+#define SH_PREFIX backup_file
+#define SH_ELEMENT_TYPE backup_file_entry
+#define SH_KEY_TYPE const char *
+#define SH_KEY path
+#define SH_HASH_KEY(tb, key) hash_string_pointer(key)
+#define SH_EQUAL(tb, a, b) (strcmp(a, b) == 0)
+#define SH_SCOPE static inline
+#define SH_DECLARE
+#define SH_DEFINE
+#include "lib/simplehash.h"
+
+struct IncrementalBackupInfo
+{
+ /* Memory context for this object and its subsidiary objects. */
+ MemoryContext mcxt;
+
+ /* Temporary buffer for storing the manifest while parsing it. */
+ StringInfoData buf;
+
+ /* WAL ranges extracted from the backup manifest. */
+ List *manifest_wal_ranges;
+
+ /*
+ * Files extracted from the backup manifest.
+ *
+ * We don't really need this information, because we use WAL summaries to
+ * figure what's changed. It would be unsafe to just rely on the list of
+ * files that existed before, because it's possible for a file to be
+ * removed and a new one created with the same name and different
+ * contents. In such cases, the whole file must still be sent. We can tell
+ * from the WAL summaries whether that happened, but not from the file
+ * list.
+ *
+ * Nonetheless, this data is useful for sanity checking. If a file that we
+ * think we shouldn't need to send is not present in the manifest for the
+ * prior backup, something has gone terribly wrong. We retain the file
+ * names and sizes, but not the checksums or last modified times, for
+ * which we have no use.
+ *
+ * One significant downside of storing this data is that it consumes
+ * memory. If that turns out to be a problem, we might have to decide not
+ * to retain this information, or to make it optional.
+ */
+ backup_file_hash *manifest_files;
+
+ /*
+ * Block-reference table for the incremental backup.
+ *
+ * It's possible that storing the entire block-reference table in memory
+ * will be a problem for some users. The in-memory format that we're using
+ * here is pretty efficient, converging to little more than 1 bit per
+ * block for relation forks with large numbers of modified blocks. It's
+ * possible, however, that if you try to perform an incremental backup of
+ * a database with a sufficiently large number of relations on a
+ * sufficiently small machine, you could run out of memory here. If that
+ * turns out to be a problem in practice, we'll need to be more clever.
+ */
+ BlockRefTable *brtab;
+};
+
+static void manifest_process_file(JsonManifestParseContext *,
+ char *pathname,
+ size_t size,
+ pg_checksum_type checksum_type,
+ int checksum_length,
+ uint8 *checksum_payload);
+static void manifest_process_wal_range(JsonManifestParseContext *,
+ TimeLineID tli,
+ XLogRecPtr start_lsn,
+ XLogRecPtr end_lsn);
+static void manifest_report_error(JsonManifestParseContext *ib,
+ const char *fmt,...)
+ pg_attribute_printf(2, 3) pg_attribute_noreturn();
+static int compare_block_numbers(const void *a, const void *b);
+
+/*
+ * Create a new object for storing information extracted from the manifest
+ * supplied when creating an incremental backup.
+ */
+IncrementalBackupInfo *
+CreateIncrementalBackupInfo(MemoryContext mcxt)
+{
+ IncrementalBackupInfo *ib;
+ MemoryContext oldcontext;
+
+ oldcontext = MemoryContextSwitchTo(mcxt);
+
+ ib = palloc0(sizeof(IncrementalBackupInfo));
+ ib->mcxt = mcxt;
+ initStringInfo(&ib->buf);
+
+ /*
+ * It's hard to guess how many files a "typical" installation will have in
+ * the data directory, but a fresh initdb creates almost 1000 files as of
+ * this writing, so it seems to make sense for our estimate to
+ * substantially higher.
+ */
+ ib->manifest_files = backup_file_create(mcxt, 10000, NULL);
+
+ MemoryContextSwitchTo(oldcontext);
+
+ return ib;
+}
+
+/*
+ * Before taking an incremental backup, the caller must supply the backup
+ * manifest from a prior backup. Each chunk of manifest data recieved
+ * from the client should be passed to this function.
+ */
+void
+AppendIncrementalManifestData(IncrementalBackupInfo *ib, const char *data,
+ int len)
+{
+ MemoryContext oldcontext;
+
+ /* Switch to our memory context. */
+ oldcontext = MemoryContextSwitchTo(ib->mcxt);
+
+ /*
+ * XXX. Our json parser is at present incapable of parsing json blobs
+ * incrementally, so we have to accumulate the entire backup manifest
+ * before we can do anything with it. This should really be fixed, since
+ * some users might have very large numbers of files in the data
+ * directory.
+ */
+ appendBinaryStringInfo(&ib->buf, data, len);
+
+ /* Switch back to previous memory context. */
+ MemoryContextSwitchTo(oldcontext);
+}
+
+/*
+ * Finalize an IncrementalBackupInfo object after all manifest data has
+ * been supplied via calls to AppendIncrementalManifestData.
+ */
+void
+FinalizeIncrementalManifest(IncrementalBackupInfo *ib)
+{
+ JsonManifestParseContext context;
+ MemoryContext oldcontext;
+
+ /* Switch to our memory context. */
+ oldcontext = MemoryContextSwitchTo(ib->mcxt);
+
+ /* Parse the manifest. */
+ context.private_data = ib;
+ context.perfile_cb = manifest_process_file;
+ context.perwalrange_cb = manifest_process_wal_range;
+ context.error_cb = manifest_report_error;
+ json_parse_manifest(&context, ib->buf.data, ib->buf.len);
+
+ /* Done with the buffer, so release memory. */
+ pfree(ib->buf.data);
+ ib->buf.data = NULL;
+
+ /* Switch back to previous memory context. */
+ MemoryContextSwitchTo(oldcontext);
+}
+
+/*
+ * Prepare to take an incremental backup.
+ *
+ * Before this function is called, AppendIncrementalManifestData and
+ * FinalizeIncrementalManifest should have already been called to pass all
+ * the manifest data to this object.
+ *
+ * This function performs sanity checks on the data extracted from the
+ * manifest and figures out for which WAL ranges we need summaries, and
+ * whether those summaries are available. Then, it reads and combines the
+ * data from those summary files. It also updates the backup_state with the
+ * reference TLI and LSN for the prior backup.
+ */
+void
+PrepareForIncrementalBackup(IncrementalBackupInfo *ib,
+ BackupState *backup_state)
+{
+ MemoryContext oldcontext;
+ List *expectedTLEs;
+ List *all_wslist,
+ *required_wslist = NIL;
+ ListCell *lc;
+ TimeLineHistoryEntry **tlep;
+ int num_wal_ranges;
+ int i;
+ bool found_backup_start_tli = false;
+ TimeLineID earliest_wal_range_tli = 0;
+ XLogRecPtr earliest_wal_range_start_lsn = InvalidXLogRecPtr;
+ TimeLineID latest_wal_range_tli = 0;
+ XLogRecPtr summarized_lsn;
+
+ Assert(ib->buf.data == NULL);
+
+ /* Switch to our memory context. */
+ oldcontext = MemoryContextSwitchTo(ib->mcxt);
+
+ /*
+ * A valid backup manifest must always contain at least one WAL range
+ * (usually exactly one, unless the backup spanned a timeline switch).
+ */
+ num_wal_ranges = list_length(ib->manifest_wal_ranges);
+ if (num_wal_ranges == 0)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("manifest contains no required WAL ranges")));
+
+ /*
+ * Match up the TLIs that appear in the WAL ranges of the backup manifest
+ * with those that appear in this server's timeline history. We expect
+ * every backup_wal_range to match to a TimeLineHistoryEntry; if it does
+ * not, that's an error.
+ *
+ * This loop also decides which of the WAL ranges is the manifest is most
+ * ancient and which one is the newest, according to the timeline history
+ * of this server, and stores TLIs of those WAL ranges into
+ * earliest_wal_range_tli and latest_wal_range_tli. It also updates
+ * earliest_wal_range_start_lsn to the start LSN of the WAL range for
+ * earliest_wal_range_tli.
+ *
+ * Note that the return value of readTimeLineHistory puts the latest
+ * timeline at the beginning of the list, not the end. Hence, the earliest
+ * TLI is the one that occurs nearest the end of the list returned by
+ * readTimeLineHistory, and the latest TLI is the one that occurs closest
+ * to the beginning.
+ */
+ expectedTLEs = readTimeLineHistory(backup_state->starttli);
+ tlep = palloc0(num_wal_ranges * sizeof(TimeLineHistoryEntry *));
+ for (i = 0; i < num_wal_ranges; ++i)
+ {
+ backup_wal_range *range = list_nth(ib->manifest_wal_ranges, i);
+ bool saw_earliest_wal_range_tli = false;
+ bool saw_latest_wal_range_tli = false;
+
+ /* Search this server's history for this WAL range's TLI. */
+ foreach(lc, expectedTLEs)
+ {
+ TimeLineHistoryEntry *tle = lfirst(lc);
+
+ if (tle->tli == range->tli)
+ {
+ tlep[i] = tle;
+ break;
+ }
+
+ if (tle->tli == earliest_wal_range_tli)
+ saw_earliest_wal_range_tli = true;
+ if (tle->tli == latest_wal_range_tli)
+ saw_latest_wal_range_tli = true;
+ }
+
+ /*
+ * An incremental backup can only be taken relative to a backup that
+ * represents a previous state of this server. If the backup requires
+ * WAL from a timeline that's not in our history, that definitely
+ * isn't the case.
+ */
+ if (tlep[i] == NULL)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("timeline %u found in manifest, but not in this server's history",
+ range->tli)));
+
+ /*
+ * If we found this TLI in the server's history before encountering
+ * the latest TLI seen so far in the server's history, then this TLI
+ * is the latest one seen so far.
+ *
+ * If on the other hand we saw the earliest TLI seen so far before
+ * finding this TLI, this TLI is earlier than the earliest one seen so
+ * far. And if this is the first TLI for which we've searched, it's
+ * also the earliest one seen so far.
+ *
+ * On the first loop iteration, both things should necessarily be
+ * true.
+ */
+ if (!saw_latest_wal_range_tli)
+ latest_wal_range_tli = range->tli;
+ if (earliest_wal_range_tli == 0 || saw_earliest_wal_range_tli)
+ {
+ earliest_wal_range_tli = range->tli;
+ earliest_wal_range_start_lsn = range->start_lsn;
+ }
+ }
+
+ /*
+ * Propagate information about the prior backup into the backup_label that
+ * will be generated for this backup.
+ */
+ backup_state->istartpoint = earliest_wal_range_start_lsn;
+ backup_state->istarttli = earliest_wal_range_tli;
+
+ /*
+ * Sanity check start and end LSNs for the WAL ranges in the manifest.
+ *
+ * Commonly, there won't be any timeline switches during the prior backup
+ * at all, but if there are, they should happen at the same LSNs that this
+ * server switched timelines.
+ *
+ * Whether there are any timeline switches during the prior backup or not,
+ * the prior backup shouldn't require any WAL from a timeline prior to the
+ * start of that timeline. It also shouldn't require any WAL from later
+ * than the start of this backup.
+ *
+ * If any of these sanity checks fail, one possible explanation is that
+ * the user has generated WAL on the same timeline with the same LSNs more
+ * than once. For instance, if two standbys running on timeline 1 were
+ * both promoted and (due to a broken archiving setup) both selected new
+ * timeline ID 2, then it's possible that one of these checks might trip.
+ *
+ * Note that there are lots of ways for the user to do something very bad
+ * without tripping any of these checks, and they are not intended to be
+ * comprehensive. It's pretty hard to see how we could be certain of
+ * anything here. However, if there's a problem staring us right in the
+ * face, it's best to report it, so we do.
+ */
+ for (i = 0; i < num_wal_ranges; ++i)
+ {
+ backup_wal_range *range = list_nth(ib->manifest_wal_ranges, i);
+
+ if (range->tli == earliest_wal_range_tli)
+ {
+ if (range->start_lsn < tlep[i]->begin)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("manifest requires WAL from initial timeline %u starting at %X/%X, but that timeline begins at %X/%X",
+ range->tli,
+ LSN_FORMAT_ARGS(range->start_lsn),
+ LSN_FORMAT_ARGS(tlep[i]->begin))));
+ }
+ else
+ {
+ if (range->start_lsn != tlep[i]->begin)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("manifest requires WAL from continuation timeline %u starting at %X/%X, but that timeline begins at %X/%X",
+ range->tli,
+ LSN_FORMAT_ARGS(range->start_lsn),
+ LSN_FORMAT_ARGS(tlep[i]->begin))));
+ }
+
+ if (range->tli == latest_wal_range_tli)
+ {
+ if (range->end_lsn > backup_state->startpoint)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("manifest requires WAL from final timeline %u ending at %X/%X, but this backup starts at %X/%X",
+ range->tli,
+ LSN_FORMAT_ARGS(range->end_lsn),
+ LSN_FORMAT_ARGS(backup_state->startpoint))));
+ }
+ else
+ {
+ if (range->end_lsn != tlep[i]->end)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("manifest requires WAL from non-final timeline %u ending at %X/%X, but this server switched timelines at %X/%X",
+ range->tli,
+ LSN_FORMAT_ARGS(range->end_lsn),
+ LSN_FORMAT_ARGS(tlep[i]->end))));
+ }
+
+ }
+
+ /*
+ * Wait for WAL summarization to catch up to the backup start LSN (but
+ * time out if it doesn't do so quickly enough).
+ */
+ /* XXX make timeout configurable */
+ summarized_lsn = WaitForWalSummarization(backup_state->startpoint, 60000);
+ if (summarized_lsn < backup_state->startpoint)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("timeout waiting for WAL summarization"),
+ errdetail("This backup requires WAL to be summarized up to %X/%X, but summarizer has only reached %X/%X.",
+ LSN_FORMAT_ARGS(backup_state->startpoint),
+ LSN_FORMAT_ARGS(summarized_lsn))));
+
+ /*
+ * Retrieve a list of all WAL summaries on any timeline that overlap with
+ * the LSN range of interest. We could instead call GetWalSummaries() once
+ * per timeline in the loop that follows, but that would involve reading
+ * the directory multiple times. It should be mildly faster - and perhaps
+ * a bit safer - to do it just once.
+ */
+ all_wslist = GetWalSummaries(0, earliest_wal_range_start_lsn,
+ backup_state->startpoint);
+
+ /*
+ * We need WAL summaries for everything that happened during the prior
+ * backup and everything that happened afterward up until the point where
+ * the current backup started.
+ */
+ foreach(lc, expectedTLEs)
+ {
+ TimeLineHistoryEntry *tle = lfirst(lc);
+ XLogRecPtr tli_start_lsn = tle->begin;
+ XLogRecPtr tli_end_lsn = tle->end;
+ XLogRecPtr tli_missing_lsn = InvalidXLogRecPtr;
+ List *tli_wslist;
+
+ /*
+ * Working through the history of this server from the current
+ * timeline backwards, we skip everything until we find the timeline
+ * where this backup started. Most of the time, this means we won't
+ * skip anything at all, as it's unlikely that the timeline has
+ * changed since the beginning of the backup moments ago.
+ */
+ if (tle->tli == backup_state->starttli)
+ {
+ found_backup_start_tli = true;
+ tli_end_lsn = backup_state->startpoint;
+ }
+ else if (!found_backup_start_tli)
+ continue;
+
+ /*
+ * Find the summaries that overlap the LSN range of interest for this
+ * timeline. If this is the earliest timeline involved, the range of
+ * interest begins with the start LSN of the prior backup; otherwise,
+ * it begins at the LSN at which this timeline came into existence. If
+ * this is the latest TLI involved, the range of interest ends at the
+ * start LSN of the current backup; otherwise, it ends at the point
+ * where we switched from this timeline to the next one.
+ */
+ if (tle->tli == earliest_wal_range_tli)
+ tli_start_lsn = earliest_wal_range_start_lsn;
+ tli_wslist = FilterWalSummaries(all_wslist, tle->tli,
+ tli_start_lsn, tli_end_lsn);
+
+ /*
+ * There is no guarantee that the WAL summaries we found cover the
+ * entire range of LSNs for which summaries are required, or indeed
+ * that we found any WAL summaries at all. Check whether we have a
+ * problem of that sort.
+ */
+ if (!WalSummariesAreComplete(tli_wslist, tli_start_lsn, tli_end_lsn,
+ &tli_missing_lsn))
+ {
+ if (XLogRecPtrIsInvalid(tli_missing_lsn))
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("WAL summaries are required on timeline %u from %X/%X to %X/%X, but no summaries for that timeline and LSN range exist",
+ tle->tli,
+ LSN_FORMAT_ARGS(tli_start_lsn),
+ LSN_FORMAT_ARGS(tli_end_lsn))));
+ else
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("WAL summaries are required on timeline %u from %X/%X to %X/%X, but the summaries for that timeline and LSN range are incomplete",
+ tle->tli,
+ LSN_FORMAT_ARGS(tli_start_lsn),
+ LSN_FORMAT_ARGS(tli_end_lsn)),
+ errdetail("The first unsummarized LSN is this range is %X/%X.",
+ LSN_FORMAT_ARGS(tli_missing_lsn))));
+ }
+
+ /*
+ * Remember that we need to read these summaries.
+ *
+ * Technically, it's possible that this could read more files than
+ * required, since tli_wslist in theory could contain redundant
+ * summaries. For instance, if we have a summary from 0/10000000 to
+ * 0/20000000 and also one from 0/00000000 to 0/30000000, then the
+ * latter subsumes the former and the former could be ignored.
+ *
+ * We ignore this possibility because the WAL summarizer only tries to
+ * generate summaries that do not overlap. If somehow they exist,
+ * we'll do a bit of extra work but the results should still be
+ * correct.
+ */
+ required_wslist = list_concat(required_wslist, tli_wslist);
+
+ /*
+ * Timelines earlier than the one in which the prior backup began are
+ * not relevant.
+ */
+ if (tle->tli == earliest_wal_range_tli)
+ break;
+ }
+
+ /*
+ * Read all of the required block reference table files and merge all of
+ * the data into a single in-memory block reference table.
+ *
+ * See the comments for struct IncrementalBackupInfo for some thoughts on
+ * memory usage.
+ */
+ ib->brtab = CreateEmptyBlockRefTable();
+ foreach(lc, required_wslist)
+ {
+ WalSummaryFile *ws = lfirst(lc);
+ WalSummaryIO wsio;
+ BlockRefTableReader *reader;
+ RelFileLocator rlocator;
+ ForkNumber forknum;
+ BlockNumber limit_block;
+ BlockNumber blocks[BLOCKS_PER_READ];
+
+ wsio.file = OpenWalSummaryFile(ws, false);
+ wsio.filepos = 0;
+ ereport(DEBUG1,
+ (errmsg_internal("reading WAL summary file \"%s\"",
+ FilePathName(wsio.file))));
+ reader = CreateBlockRefTableReader(ReadWalSummary, &wsio,
+ FilePathName(wsio.file),
+ ReportWalSummaryError, NULL);
+ while (BlockRefTableReaderNextRelation(reader, &rlocator, &forknum,
+ &limit_block))
+ {
+ BlockRefTableSetLimitBlock(ib->brtab, &rlocator,
+ forknum, limit_block);
+
+ while (1)
+ {
+ unsigned nblocks;
+ unsigned i;
+
+ nblocks = BlockRefTableReaderGetBlocks(reader, blocks,
+ BLOCKS_PER_READ);
+ if (nblocks == 0)
+ break;
+
+ for (i = 0; i < nblocks; ++i)
+ BlockRefTableMarkBlockModified(ib->brtab, &rlocator,
+ forknum, blocks[i]);
+ }
+ }
+ DestroyBlockRefTableReader(reader);
+ FileClose(wsio.file);
+ }
+
+ /* Switch back to previous memory context. */
+ MemoryContextSwitchTo(oldcontext);
+}
+
+/*
+ * Get the pathname that should be used when a file is sent incrementally.
+ *
+ * The result is a palloc'd string.
+ */
+char *
+GetIncrementalFilePath(Oid dboid, Oid spcoid, RelFileNumber relfilenumber,
+ ForkNumber forknum, unsigned segno)
+{
+ char *path;
+ char *lastslash;
+ char *ipath;
+
+ path = GetRelationPath(dboid, spcoid, relfilenumber, InvalidBackendId,
+ forknum);
+
+ lastslash = strrchr(path, '/');
+ Assert(lastslash != NULL);
+ *lastslash = '\0';
+
+ if (segno > 0)
+ ipath = psprintf("%s/INCREMENTAL.%s.%u", path, lastslash + 1, segno);
+ else
+ ipath = psprintf("%s/INCREMENTAL.%s", path, lastslash + 1);
+
+ pfree(path);
+
+ return ipath;
+}
+
+/*
+ * How should we back up a particular file as part of an incremental backup?
+ *
+ * If the return value is BACK_UP_FILE_FULLY, caller should back up the whole
+ * file just as if this were not an incremental backup.
+ *
+ * If the return value is BACK_UP_FILE_INCREMENTALLY, caller should include
+ * an incremental file in the backup instead of the entire file. On return,
+ * *num_blocks_required will be set to the number of blocks that need to be
+ * sent, and the actual block numbers will have been stored in
+ * relative_block_numbers, which should be an array of at least RELSEG_SIZE.
+ * In addition, *truncation_block_length will be set to the value that should
+ * be included in the incremental file.
+ *
+ * If the return value is DO_NOT_BACK_UP_FILE, the caller should not include
+ * the file in the backup at all.
+ */
+FileBackupMethod
+GetFileBackupMethod(IncrementalBackupInfo *ib, char *path,
+ Oid dboid, Oid spcoid,
+ RelFileNumber relfilenumber, ForkNumber forknum,
+ unsigned segno, size_t size,
+ unsigned *num_blocks_required,
+ BlockNumber *relative_block_numbers,
+ unsigned *truncation_block_length)
+{
+ BlockNumber absolute_block_numbers[RELSEG_SIZE];
+ BlockNumber limit_block;
+ BlockNumber start_blkno;
+ BlockNumber stop_blkno;
+ RelFileLocator rlocator;
+ BlockRefTableEntry *brtentry;
+ unsigned i;
+ unsigned nblocks;
+
+ /* Should only be called after PrepareForIncrementalBackup. */
+ Assert(ib->buf.data == NULL);
+
+ /*
+ * dboid could be InvalidOid if shared rel, but spcoid and relfilenumber
+ * should have legal values.
+ */
+ Assert(OidIsValid(spcoid));
+ Assert(RelFileNumberIsValid(relfilenumber));
+
+ /*
+ * If the file size is too large or not a multiple of BLCKSZ, then
+ * something weird is happening, so give up and send the whole file.
+ */
+ if ((size % BLCKSZ) != 0 || size / BLCKSZ > RELSEG_SIZE)
+ return BACK_UP_FILE_FULLY;
+
+ /*
+ * The free-space map fork is not properly WAL-logged, so we need to
+ * backup the entire file every time.
+ */
+ if (forknum == FSM_FORKNUM)
+ return BACK_UP_FILE_FULLY;
+
+ /*
+ * Check whether this file is part of the prior backup. If it isn't, back
+ * up the whole file.
+ */
+ if (backup_file_lookup(ib->manifest_files, path) == NULL)
+ {
+ char *ipath;
+
+ ipath = GetIncrementalFilePath(dboid, spcoid, relfilenumber,
+ forknum, segno);
+ if (backup_file_lookup(ib->manifest_files, ipath) == NULL)
+ return BACK_UP_FILE_FULLY;
+ }
+
+ /* Look up the block reference table entry. */
+ rlocator.spcOid = spcoid;
+ rlocator.dbOid = dboid;
+ rlocator.relNumber = relfilenumber;
+ brtentry = BlockRefTableGetEntry(ib->brtab, &rlocator, forknum,
+ &limit_block);
+
+ /*
+ * If there is no entry, then there have been no WAL-logged changes to the
+ * relation since the predecessor backup was taken, so we can back it up
+ * incrementally and need not include any modified blocks.
+ *
+ * However, if the file is zero-length, we should do a full backup,
+ * because an incremental file is always more than zero length, and it's
+ * silly to take an incremental backup when a full backup would be
+ * smaller.
+ */
+ if (brtentry == NULL)
+ {
+ *num_blocks_required = 0;
+ *truncation_block_length = size / BLCKSZ;
+ if (size == 0)
+ return BACK_UP_FILE_FULLY;
+ return BACK_UP_FILE_INCREMENTALLY;
+ }
+
+ /*
+ * If the limit_block is less than or equal to the point where this
+ * segment starts, send the whole file.
+ */
+ if (limit_block <= segno * RELSEG_SIZE)
+ return BACK_UP_FILE_FULLY;
+
+ /*
+ * Get relevant entries from the block reference table entry.
+ *
+ * We shouldn't overflow computing the start or stop block numbers, but if
+ * it manages to happen somehow, detect it and throw an error.
+ */
+ start_blkno = segno * RELSEG_SIZE;
+ stop_blkno = start_blkno + (size / BLCKSZ);
+ if (start_blkno / RELSEG_SIZE != segno || stop_blkno < start_blkno)
+ ereport(ERROR,
+ errcode(ERRCODE_INTERNAL_ERROR),
+ errmsg_internal("overflow computing block number bounds for segment %u with size %lu",
+ segno, size));
+ nblocks = BlockRefTableEntryGetBlocks(brtentry, start_blkno, stop_blkno,
+ absolute_block_numbers, RELSEG_SIZE);
+ Assert(nblocks <= RELSEG_SIZE);
+
+ /*
+ * Sort the returned block numbers. If the block reference table was using
+ * the bitmap representation for a given chunk, the block numbers in that
+ * chunk will already be sorted, but when the array-of-offsets
+ * representation is used, we can receive block numbers here out of order.
+ */
+ qsort(absolute_block_numbers, nblocks, sizeof(BlockNumber),
+ compare_block_numbers);
+
+ /*
+ * If we're going to have to send nearly all of the blocks, then just send
+ * the whole file, because that won't require much extra storage or
+ * transfer and will speed up and simplify backup restoration. It's not
+ * clear what threshold is most appropriate here and perhaps it ought to
+ * be configurable, but for now we're just going to say that if we'd need
+ * to send 90% of the blocks anyway, give up and send the whole file.
+ *
+ * NB: If you change the threshold here, at least make sure to back up the
+ * file fully when every single block must be sent, because there's
+ * nothing good about sending an incremental file in that case.
+ */
+ if (nblocks * BLCKSZ > size * 0.9)
+ return BACK_UP_FILE_FULLY;
+
+ /*
+ * Looks like we can send an incremental file so transpose absolute block
+ * numbers to relative block numbers.
+ */
+ for (i = 0; i < nblocks; ++i)
+ relative_block_numbers[i] = absolute_block_numbers[i] - start_blkno;
+ *num_blocks_required = nblocks;
+
+ /*
+ * The truncation block length is the minimum length of the reconstructed
+ * file. Any block numbers below this threshold that are not present in
+ * the backup need to be fetched from the prior backup. At or above this
+ * threshold, blocks should only be included in the result if they are
+ * present in the backup. (This may require inserting zero blocks if the
+ * blocks included in the backup are non-consecutive.)
+ */
+ *truncation_block_length = size / BLCKSZ;
+ if (BlockNumberIsValid(limit_block))
+ {
+ unsigned relative_limit = limit_block - segno * RELSEG_SIZE;
+
+ if (*truncation_block_length < relative_limit)
+ *truncation_block_length = relative_limit;
+ }
+
+ /* Send it incrementally. */
+ return BACK_UP_FILE_INCREMENTALLY;
+}
+
+/*
+ * Compute the size for an incremental file containing a given number of blocks.
+ */
+extern size_t
+GetIncrementalFileSize(unsigned num_blocks_required)
+{
+ size_t result;
+
+ /* Make sure we're not going to overflow. */
+ Assert(num_blocks_required <= RELSEG_SIZE);
+
+ /*
+ * Three four byte quantities (magic number, truncation block length,
+ * block count) followed by block numbers followed by block contents.
+ */
+ result = 3 * sizeof(uint32);
+ result += (BLCKSZ + sizeof(BlockNumber)) * num_blocks_required;
+
+ return result;
+}
+
+/*
+ * Helper function for filemap hash table.
+ */
+static uint32
+hash_string_pointer(const char *s)
+{
+ unsigned char *ss = (unsigned char *) s;
+
+ return hash_bytes(ss, strlen(s));
+}
+
+/*
+ * This callback is invoked for each file mentioned in the backup manifest.
+ *
+ * We store the path to each file and the size of each file for sanity-checking
+ * purposes. For further details, see comments for IncrementalBackupInfo.
+ */
+static void
+manifest_process_file(JsonManifestParseContext *context,
+ char *pathname, size_t size,
+ pg_checksum_type checksum_type,
+ int checksum_length,
+ uint8 *checksum_payload)
+{
+ IncrementalBackupInfo *ib = context->private_data;
+ backup_file_entry *entry;
+ bool found;
+
+ entry = backup_file_insert(ib->manifest_files, pathname, &found);
+ if (!found)
+ {
+ entry->path = MemoryContextStrdup(ib->manifest_files->ctx,
+ pathname);
+ entry->size = size;
+ }
+}
+
+/*
+ * This callback is invoked for each WAL range mentioned in the backup
+ * manifest.
+ *
+ * We're just interested in learning the oldest LSN and the corresponding TLI
+ * that appear in any WAL range.
+ */
+static void
+manifest_process_wal_range(JsonManifestParseContext *context,
+ TimeLineID tli, XLogRecPtr start_lsn,
+ XLogRecPtr end_lsn)
+{
+ IncrementalBackupInfo *ib = context->private_data;
+ backup_wal_range *range = palloc(sizeof(backup_wal_range));
+
+ range->tli = tli;
+ range->start_lsn = start_lsn;
+ range->end_lsn = end_lsn;
+ ib->manifest_wal_ranges = lappend(ib->manifest_wal_ranges, range);
+}
+
+/*
+ * This callback is invoked if an error occurs while parsing the backup
+ * manifest.
+ */
+static void
+manifest_report_error(JsonManifestParseContext *context, const char *fmt,...)
+{
+ StringInfoData errbuf;
+
+ initStringInfo(&errbuf);
+
+ for (;;)
+ {
+ va_list ap;
+ int needed;
+
+ va_start(ap, fmt);
+ needed = appendStringInfoVA(&errbuf, fmt, ap);
+ va_end(ap);
+ if (needed == 0)
+ break;
+ enlargeStringInfo(&errbuf, needed);
+ }
+
+ ereport(ERROR,
+ errmsg_internal("%s", errbuf.data));
+}
+
+/*
+ * Quicksort comparator for block numbers.
+ */
+static int
+compare_block_numbers(const void *a, const void *b)
+{
+ BlockNumber aa = *(BlockNumber *) a;
+ BlockNumber bb = *(BlockNumber *) b;
+
+ if (aa > bb)
+ return 1;
+ else if (aa == bb)
+ return 0;
+ else
+ return -1;
+}
diff --git a/src/backend/backup/meson.build b/src/backend/backup/meson.build
index 0e2de91e9f..19c355ceca 100644
--- a/src/backend/backup/meson.build
+++ b/src/backend/backup/meson.build
@@ -5,6 +5,7 @@ backend_sources += files(
'basebackup.c',
'basebackup_copy.c',
'basebackup_gzip.c',
+ 'basebackup_incremental.c',
'basebackup_lz4.c',
'basebackup_progress.c',
'basebackup_server.c',
diff --git a/src/backend/replication/repl_gram.y b/src/backend/replication/repl_gram.y
index 0c874e33cf..a5d118ed68 100644
--- a/src/backend/replication/repl_gram.y
+++ b/src/backend/replication/repl_gram.y
@@ -76,11 +76,12 @@ Node *replication_parse_result;
%token K_EXPORT_SNAPSHOT
%token K_NOEXPORT_SNAPSHOT
%token K_USE_SNAPSHOT
+%token K_UPLOAD_MANIFEST
%type <node> command
%type <node> base_backup start_replication start_logical_replication
create_replication_slot drop_replication_slot identify_system
- read_replication_slot timeline_history show
+ read_replication_slot timeline_history show upload_manifest
%type <list> generic_option_list
%type <defelt> generic_option
%type <uintval> opt_timeline
@@ -114,6 +115,7 @@ command:
| read_replication_slot
| timeline_history
| show
+ | upload_manifest
;
/*
@@ -307,6 +309,15 @@ timeline_history:
}
;
+/* UPLOAD_MANIFEST doesn't currently accept any arguments */
+upload_manifest:
+ K_UPLOAD_MANIFEST
+ {
+ UploadManifestCmd *cmd = makeNode(UploadManifestCmd);
+
+ $$ = (Node *) cmd;
+ }
+
opt_physical:
K_PHYSICAL
| /* EMPTY */
@@ -411,6 +422,7 @@ ident_or_keyword:
| K_EXPORT_SNAPSHOT { $$ = "export_snapshot"; }
| K_NOEXPORT_SNAPSHOT { $$ = "noexport_snapshot"; }
| K_USE_SNAPSHOT { $$ = "use_snapshot"; }
+ | K_UPLOAD_MANIFEST { $$ = "upload_manifest"; }
;
%%
diff --git a/src/backend/replication/repl_scanner.l b/src/backend/replication/repl_scanner.l
index 1cc7fb858c..4805da08ee 100644
--- a/src/backend/replication/repl_scanner.l
+++ b/src/backend/replication/repl_scanner.l
@@ -136,6 +136,7 @@ EXPORT_SNAPSHOT { return K_EXPORT_SNAPSHOT; }
NOEXPORT_SNAPSHOT { return K_NOEXPORT_SNAPSHOT; }
USE_SNAPSHOT { return K_USE_SNAPSHOT; }
WAIT { return K_WAIT; }
+UPLOAD_MANIFEST { return K_UPLOAD_MANIFEST; }
{space}+ { /* do nothing */ }
@@ -303,6 +304,7 @@ replication_scanner_is_replication_command(void)
case K_DROP_REPLICATION_SLOT:
case K_READ_REPLICATION_SLOT:
case K_TIMELINE_HISTORY:
+ case K_UPLOAD_MANIFEST:
case K_SHOW:
/* Yes; push back the first token so we can parse later. */
repl_pushed_back_token = first_token;
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index e250b0567e..b33b86671b 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -58,6 +58,7 @@
#include "access/xlogrecovery.h"
#include "access/xlogutils.h"
#include "backup/basebackup.h"
+#include "backup/basebackup_incremental.h"
#include "catalog/pg_authid.h"
#include "catalog/pg_type.h"
#include "commands/dbcommands.h"
@@ -137,6 +138,17 @@ bool wake_wal_senders = false;
*/
static XLogReaderState *xlogreader = NULL;
+/*
+ * If the UPLOAD_MANIFEST command is used to provide a backup manifest in
+ * preparation for an incremental backup, uploaded_manifest will be point
+ * to an object containing information about its contexts, and
+ * uploaded_manifest_mcxt will point to the memory context that contains
+ * that object and all of its subordinate data. Otherwise, both values will
+ * be NULL.
+ */
+static IncrementalBackupInfo *uploaded_manifest = NULL;
+static MemoryContext uploaded_manifest_mcxt = NULL;
+
/*
* These variables keep track of the state of the timeline we're currently
* sending. sendTimeLine identifies the timeline. If sendTimeLineIsHistoric,
@@ -233,6 +245,9 @@ static void XLogSendLogical(void);
static void WalSndDone(WalSndSendDataCallback send_data);
static XLogRecPtr GetStandbyFlushRecPtr(TimeLineID *tli);
static void IdentifySystem(void);
+static void UploadManifest(void);
+static bool HandleUploadManifestPacket(StringInfo buf, off_t *offset,
+ IncrementalBackupInfo *ib);
static void ReadReplicationSlot(ReadReplicationSlotCmd *cmd);
static void CreateReplicationSlot(CreateReplicationSlotCmd *cmd);
static void DropReplicationSlot(DropReplicationSlotCmd *cmd);
@@ -660,6 +675,143 @@ SendTimeLineHistory(TimeLineHistoryCmd *cmd)
pq_endmessage(&buf);
}
+/*
+ * Handle UPLOAD_MANIFEST command.
+ */
+static void
+UploadManifest(void)
+{
+ MemoryContext mcxt;
+ IncrementalBackupInfo *ib;
+ off_t offset = 0;
+ StringInfoData buf;
+
+ /*
+ * parsing the manifest will use the cryptohash stuff, which requires a
+ * resource owner
+ */
+ Assert(CurrentResourceOwner == NULL);
+ CurrentResourceOwner = ResourceOwnerCreate(NULL, "base backup");
+
+ /* Prepare to read manifest data into a temporary context. */
+ mcxt = AllocSetContextCreate(CurrentMemoryContext,
+ "incremental backup information",
+ ALLOCSET_DEFAULT_SIZES);
+ ib = CreateIncrementalBackupInfo(mcxt);
+
+ /* Send a CopyInResponse message */
+ pq_beginmessage(&buf, 'G');
+ pq_sendbyte(&buf, 0);
+ pq_sendint16(&buf, 0);
+ pq_endmessage_reuse(&buf);
+ pq_flush();
+
+ /* Recieve packets from client until done. */
+ while (HandleUploadManifestPacket(&buf, &offset, ib))
+ ;
+
+ /* Finish up manifest processing. */
+ FinalizeIncrementalManifest(ib);
+
+ /*
+ * Discard any old manifest information and arrange to preserve the new
+ * information we just got.
+ *
+ * We assume that MemoryContextDelete and MemoryContextSetParent won't
+ * fail, and thus we shouldn't end up bailing out of here in such a way as
+ * to leave dangling pointrs.
+ */
+ if (uploaded_manifest_mcxt != NULL)
+ MemoryContextDelete(uploaded_manifest_mcxt);
+ MemoryContextSetParent(mcxt, CacheMemoryContext);
+ uploaded_manifest = ib;
+ uploaded_manifest_mcxt = mcxt;
+
+ /* clean up the resource owner we created */
+ WalSndResourceCleanup(true);
+}
+
+/*
+ * Process one packet received during the handling of an UPLOAD_MANIFEST
+ * operation.
+ *
+ * 'buf' is scratch space. This function expects it to be initialized, doesn't
+ * care what the current contents are, and may override them with completely
+ * new contents.
+ *
+ * The return value is true if the caller should continue processing
+ * additional packets and false if the UPLOAD_MANIFEST operation is complete.
+ */
+static bool
+HandleUploadManifestPacket(StringInfo buf, off_t *offset,
+ IncrementalBackupInfo *ib)
+{
+ int mtype;
+ int maxmsglen;
+
+ HOLD_CANCEL_INTERRUPTS();
+
+ pq_startmsgread();
+ mtype = pq_getbyte();
+ if (mtype == EOF)
+ ereport(ERROR,
+ (errcode(ERRCODE_CONNECTION_FAILURE),
+ errmsg("unexpected EOF on client connection with an open transaction")));
+
+ switch (mtype)
+ {
+ case 'd': /* CopyData */
+ maxmsglen = PQ_LARGE_MESSAGE_LIMIT;
+ break;
+ case 'c': /* CopyDone */
+ case 'f': /* CopyFail */
+ case 'H': /* Flush */
+ case 'S': /* Sync */
+ maxmsglen = PQ_SMALL_MESSAGE_LIMIT;
+ break;
+ default:
+ ereport(ERROR,
+ (errcode(ERRCODE_PROTOCOL_VIOLATION),
+ errmsg("unexpected message type 0x%02X during COPY from stdin",
+ mtype)));
+ maxmsglen = 0; /* keep compiler quiet */
+ break;
+ }
+
+ /* Now collect the message body */
+ if (pq_getmessage(buf, maxmsglen))
+ ereport(ERROR,
+ (errcode(ERRCODE_CONNECTION_FAILURE),
+ errmsg("unexpected EOF on client connection with an open transaction")));
+ RESUME_CANCEL_INTERRUPTS();
+
+ /* Process the message */
+ switch (mtype)
+ {
+ case 'd': /* CopyData */
+ AppendIncrementalManifestData(ib, buf->data, buf->len);
+ return true;
+
+ case 'c': /* CopyDone */
+ return false;
+
+ case 'H': /* Sync */
+ case 'S': /* Flush */
+ /* Ignore these while in CopyOut mode as we do elsewhere. */
+ return true;
+
+ case 'f':
+ ereport(ERROR,
+ (errcode(ERRCODE_QUERY_CANCELED),
+ errmsg("COPY from stdin failed: %s",
+ pq_getmsgstring(buf))));
+ }
+
+ /* Not reached. */
+ Assert(false);
+ return false;
+}
+
/*
* Handle START_REPLICATION command.
*
@@ -1801,7 +1953,7 @@ exec_replication_command(const char *cmd_string)
cmdtag = "BASE_BACKUP";
set_ps_display(cmdtag);
PreventInTransactionBlock(true, cmdtag);
- SendBaseBackup((BaseBackupCmd *) cmd_node);
+ SendBaseBackup((BaseBackupCmd *) cmd_node, uploaded_manifest);
EndReplicationCommand(cmdtag);
break;
@@ -1863,6 +2015,14 @@ exec_replication_command(const char *cmd_string)
}
break;
+ case T_UploadManifestCmd:
+ cmdtag = "UPLOAD_MANIFEST";
+ set_ps_display(cmdtag);
+ PreventInTransactionBlock(true, cmdtag);
+ UploadManifest();
+ EndReplicationCommand(cmdtag);
+ break;
+
default:
elog(ERROR, "unrecognized replication command node tag: %u",
cmd_node->type);
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index a3d8eacb8d..3a6729003a 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -31,6 +31,7 @@
#include "postmaster/bgworker_internals.h"
#include "postmaster/bgwriter.h"
#include "postmaster/postmaster.h"
+#include "postmaster/walsummarizer.h"
#include "replication/logicallauncher.h"
#include "replication/origin.h"
#include "replication/slot.h"
@@ -136,6 +137,7 @@ CalculateShmemSize(int *num_semaphores)
size = add_size(size, ReplicationOriginShmemSize());
size = add_size(size, WalSndShmemSize());
size = add_size(size, WalRcvShmemSize());
+ size = add_size(size, WalSummarizerShmemSize());
size = add_size(size, PgArchShmemSize());
size = add_size(size, ApplyLauncherShmemSize());
size = add_size(size, BTreeShmemSize());
@@ -291,6 +293,7 @@ CreateSharedMemoryAndSemaphores(void)
ReplicationOriginShmemInit();
WalSndShmemInit();
WalRcvShmemInit();
+ WalSummarizerShmemInit();
PgArchShmemInit();
ApplyLauncherShmemInit();
diff --git a/src/bin/Makefile b/src/bin/Makefile
index 373077bf52..aa2210925e 100644
--- a/src/bin/Makefile
+++ b/src/bin/Makefile
@@ -19,6 +19,7 @@ SUBDIRS = \
pg_archivecleanup \
pg_basebackup \
pg_checksums \
+ pg_combinebackup \
pg_config \
pg_controldata \
pg_ctl \
diff --git a/src/bin/meson.build b/src/bin/meson.build
index 67cb50630c..4cb6fd59bb 100644
--- a/src/bin/meson.build
+++ b/src/bin/meson.build
@@ -5,6 +5,7 @@ subdir('pg_amcheck')
subdir('pg_archivecleanup')
subdir('pg_basebackup')
subdir('pg_checksums')
+subdir('pg_combinebackup')
subdir('pg_config')
subdir('pg_controldata')
subdir('pg_ctl')
diff --git a/src/bin/pg_basebackup/bbstreamer_file.c b/src/bin/pg_basebackup/bbstreamer_file.c
index 45f32974ff..6b78ee283d 100644
--- a/src/bin/pg_basebackup/bbstreamer_file.c
+++ b/src/bin/pg_basebackup/bbstreamer_file.c
@@ -296,6 +296,7 @@ should_allow_existing_directory(const char *pathname)
if (strcmp(filename, "pg_wal") == 0 ||
strcmp(filename, "pg_xlog") == 0 ||
strcmp(filename, "archive_status") == 0 ||
+ strcmp(filename, "summaries") == 0 ||
strcmp(filename, "pg_tblspc") == 0)
return true;
diff --git a/src/bin/pg_basebackup/pg_basebackup.c b/src/bin/pg_basebackup/pg_basebackup.c
index f32684a8f2..26fd9ad0bc 100644
--- a/src/bin/pg_basebackup/pg_basebackup.c
+++ b/src/bin/pg_basebackup/pg_basebackup.c
@@ -101,6 +101,11 @@ typedef void (*WriteDataCallback) (size_t nbytes, char *buf,
*/
#define MINIMUM_VERSION_FOR_TERMINATED_TARFILE 150000
+/*
+ * pg_wal/summaries exists beginning with version 17.
+ */
+#define MINIMUM_VERSION_FOR_WAL_SUMMARIES 170000
+
/*
* Different ways to include WAL
*/
@@ -217,7 +222,8 @@ static void ReceiveBackupManifestInMemoryChunk(size_t r, char *copybuf,
void *callback_data);
static void BaseBackup(char *compression_algorithm, char *compression_detail,
CompressionLocation compressloc,
- pg_compress_specification *client_compress);
+ pg_compress_specification *client_compress,
+ char *incremental_manifest);
static bool reached_end_position(XLogRecPtr segendpos, uint32 timeline,
bool segment_finished);
@@ -390,6 +396,8 @@ usage(void)
printf(_("\nOptions controlling the output:\n"));
printf(_(" -D, --pgdata=DIRECTORY receive base backup into directory\n"));
printf(_(" -F, --format=p|t output format (plain (default), tar)\n"));
+ printf(_(" -i, --incremental=OLDMANIFEST\n"));
+ printf(_(" take incremental or differential backup\n"));
printf(_(" -r, --max-rate=RATE maximum transfer rate to transfer data directory\n"
" (in kB/s, or use suffix \"k\" or \"M\")\n"));
printf(_(" -R, --write-recovery-conf\n"
@@ -688,6 +696,23 @@ StartLogStreamer(char *startpos, uint32 timeline, char *sysidentifier,
if (pg_mkdir_p(statusdir, pg_dir_create_mode) != 0 && errno != EEXIST)
pg_fatal("could not create directory \"%s\": %m", statusdir);
+
+ /*
+ * For newer server versions, likewise create pg_wal/summaries
+ */
+ if (PQserverVersion(conn) < MINIMUM_VERSION_FOR_WAL_SUMMARIES)
+ {
+ char summarydir[MAXPGPATH];
+
+ snprintf(summarydir, sizeof(summarydir), "%s/%s/summaries",
+ basedir,
+ PQserverVersion(conn) < MINIMUM_VERSION_FOR_PG_WAL ?
+ "pg_xlog" : "pg_wal");
+
+ if (pg_mkdir_p(statusdir, pg_dir_create_mode) != 0 &&
+ errno != EEXIST)
+ pg_fatal("could not create directory \"%s\": %m", summarydir);
+ }
}
/*
@@ -1728,7 +1753,9 @@ ReceiveBackupManifestInMemoryChunk(size_t r, char *copybuf,
static void
BaseBackup(char *compression_algorithm, char *compression_detail,
- CompressionLocation compressloc, pg_compress_specification *client_compress)
+ CompressionLocation compressloc,
+ pg_compress_specification *client_compress,
+ char *incremental_manifest)
{
PGresult *res;
char *sysidentifier;
@@ -1794,7 +1821,74 @@ BaseBackup(char *compression_algorithm, char *compression_detail,
exit(1);
/*
- * Start the actual backup
+ * If the user wants an incremental backup, we must upload the manifest
+ * for the previous backup upon which it is to be based.
+ */
+ if (incremental_manifest != NULL)
+ {
+ int fd;
+ char mbuf[65536];
+ int nbytes;
+
+ /* XXX add a server version check here */
+
+ /* Open the file. */
+ fd = open(incremental_manifest, O_RDONLY | PG_BINARY, 0);
+ if (fd < 0)
+ pg_fatal("could not open file \"%s\": %m", incremental_manifest);
+
+ /* Tell the server what we want to do. */
+ if (PQsendQuery(conn, "UPLOAD_MANIFEST") == 0)
+ pg_fatal("could not send replication command \"%s\": %s",
+ "UPLOAD_MANIFEST", PQerrorMessage(conn));
+ res = PQgetResult(conn);
+ if (PQresultStatus(res) != PGRES_COPY_IN)
+ {
+ if (PQresultStatus(res) == PGRES_FATAL_ERROR)
+ pg_fatal("could not upload manifest: %s",
+ PQerrorMessage(conn));
+ else
+ pg_fatal("could not upload manifest: unexpected status %s",
+ PQresStatus(PQresultStatus(res)));
+ }
+
+ /* Loop, reading from the file and sending the data to the server. */
+ while ((nbytes = read(fd, mbuf, sizeof mbuf)) > 0)
+ {
+ if (PQputCopyData(conn, mbuf, nbytes) < 0)
+ pg_fatal("could not send COPY data: %s",
+ PQerrorMessage(conn));
+ }
+
+ /* Bail out if we exited the loop due to an error. */
+ if (nbytes < 0)
+ pg_fatal("could not read file \"%s\": %m", incremental_manifest);
+
+ /* End the COPY operation. */
+ if (PQputCopyEnd(conn, NULL) < 0)
+ pg_fatal("could not send end-of-COPY: %s",
+ PQerrorMessage(conn));
+
+ /* See whether the server is happy with what we sent. */
+ res = PQgetResult(conn);
+ if (PQresultStatus(res) == PGRES_FATAL_ERROR)
+ pg_fatal("could not upload manifest: %s",
+ PQerrorMessage(conn));
+ else if (PQresultStatus(res) != PGRES_COMMAND_OK)
+ pg_fatal("could not upload manifest: unexpected status %s",
+ PQresStatus(PQresultStatus(res)));
+
+ /* Consume ReadyForQuery message from server. */
+ res = PQgetResult(conn);
+ if (res != NULL)
+ pg_fatal("unexpected extra result while sending manifest");
+
+ /* Add INCREMENTAL option to BASE_BACKUP command. */
+ AppendPlainCommandOption(&buf, use_new_option_syntax, "INCREMENTAL");
+ }
+
+ /*
+ * Continue building up the options list for the BASE_BACKUP command.
*/
AppendStringCommandOption(&buf, use_new_option_syntax, "LABEL", label);
if (estimatesize)
@@ -1901,6 +1995,7 @@ BaseBackup(char *compression_algorithm, char *compression_detail,
else
basebkp = psprintf("BASE_BACKUP %s", buf.data);
+ /* OK, try to start the backup. */
if (PQsendQuery(conn, basebkp) == 0)
pg_fatal("could not send replication command \"%s\": %s",
"BASE_BACKUP", PQerrorMessage(conn));
@@ -2256,6 +2351,7 @@ main(int argc, char **argv)
{"version", no_argument, NULL, 'V'},
{"pgdata", required_argument, NULL, 'D'},
{"format", required_argument, NULL, 'F'},
+ {"incremental", required_argument, NULL, 'i'},
{"checkpoint", required_argument, NULL, 'c'},
{"create-slot", no_argument, NULL, 'C'},
{"max-rate", required_argument, NULL, 'r'},
@@ -2293,6 +2389,7 @@ main(int argc, char **argv)
int option_index;
char *compression_algorithm = "none";
char *compression_detail = NULL;
+ char *incremental_manifest = NULL;
CompressionLocation compressloc = COMPRESS_LOCATION_UNSPECIFIED;
pg_compress_specification client_compress;
@@ -2317,7 +2414,7 @@ main(int argc, char **argv)
atexit(cleanup_directories_atexit);
- while ((c = getopt_long(argc, argv, "c:Cd:D:F:h:l:nNp:Pr:Rs:S:t:T:U:vwWX:zZ:",
+ while ((c = getopt_long(argc, argv, "c:Cd:D:F:h:i:l:nNp:Pr:Rs:S:t:T:U:vwWX:zZ:",
long_options, &option_index)) != -1)
{
switch (c)
@@ -2352,6 +2449,9 @@ main(int argc, char **argv)
case 'h':
dbhost = pg_strdup(optarg);
break;
+ case 'i':
+ incremental_manifest = pg_strdup(optarg);
+ break;
case 'l':
label = pg_strdup(optarg);
break;
@@ -2765,7 +2865,7 @@ main(int argc, char **argv)
}
BaseBackup(compression_algorithm, compression_detail, compressloc,
- &client_compress);
+ &client_compress, incremental_manifest);
success = true;
return 0;
diff --git a/src/bin/pg_basebackup/t/010_pg_basebackup.pl b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
index b9f5e1266b..bf765291e7 100644
--- a/src/bin/pg_basebackup/t/010_pg_basebackup.pl
+++ b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
@@ -223,10 +223,10 @@ SKIP:
"check backup dir permissions");
}
-# Only archive_status directory should be copied in pg_wal/.
+# Only archive_status and summaries directories should be copied in pg_wal/.
is_deeply(
[ sort(slurp_dir("$tempdir/backup/pg_wal/")) ],
- [ sort qw(. .. archive_status) ],
+ [ sort qw(. .. archive_status summaries) ],
'no WAL files copied');
# Contents of these directories should not be copied.
diff --git a/src/bin/pg_combinebackup/.gitignore b/src/bin/pg_combinebackup/.gitignore
new file mode 100644
index 0000000000..d7e617438c
--- /dev/null
+++ b/src/bin/pg_combinebackup/.gitignore
@@ -0,0 +1 @@
+pg_combinebackup
diff --git a/src/bin/pg_combinebackup/Makefile b/src/bin/pg_combinebackup/Makefile
new file mode 100644
index 0000000000..78ba05e624
--- /dev/null
+++ b/src/bin/pg_combinebackup/Makefile
@@ -0,0 +1,52 @@
+#-------------------------------------------------------------------------
+#
+# Makefile for src/bin/pg_combinebackup
+#
+# Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+# Portions Copyright (c) 1994, Regents of the University of California
+#
+# src/bin/pg_combinebackup/Makefile
+#
+#-------------------------------------------------------------------------
+
+PGFILEDESC = "pg_combinebackup - combine incremental backups"
+PGAPPICON=win32
+
+subdir = src/bin/pg_combinebackup
+top_builddir = ../../..
+include $(top_builddir)/src/Makefile.global
+
+override CPPFLAGS := -I$(libpq_srcdir) $(CPPFLAGS)
+LDFLAGS_INTERNAL += -L$(top_builddir)/src/fe_utils -lpgfeutils
+
+OBJS = \
+ $(WIN32RES) \
+ pg_combinebackup.o \
+ backup_label.o \
+ copy_file.o \
+ load_manifest.o \
+ reconstruct.o \
+ write_manifest.o
+
+all: pg_combinebackup
+
+pg_combinebackup: $(OBJS) | submake-libpgport submake-libpgfeutils
+ $(CC) $(CFLAGS) $^ $(LDFLAGS) $(LDFLAGS_EX) $(LIBS) -o $@$(X)
+
+install: all installdirs
+ $(INSTALL_PROGRAM) pg_combinebackup$(X) '$(DESTDIR)$(bindir)/pg_combinebackup$(X)'
+
+installdirs:
+ $(MKDIR_P) '$(DESTDIR)$(bindir)'
+
+uninstall:
+ rm -f '$(DESTDIR)$(bindir)/pg_combinebackup$(X)'
+
+clean distclean maintainer-clean:
+ rm -f pg_combinebackup$(X) $(OBJS)
+
+check:
+ $(prove_check)
+
+installcheck:
+ $(prove_installcheck)
diff --git a/src/bin/pg_combinebackup/backup_label.c b/src/bin/pg_combinebackup/backup_label.c
new file mode 100644
index 0000000000..2a62aa6fad
--- /dev/null
+++ b/src/bin/pg_combinebackup/backup_label.c
@@ -0,0 +1,281 @@
+/*-------------------------------------------------------------------------
+ *
+ * Read and manipulate backup label files
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/bin/pg_combinebackup/backup_label.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres_fe.h"
+
+#include <unistd.h>
+
+#include "access/xlogdefs.h"
+#include "backup_label.h"
+#include "common/logging.h"
+#include "common/file_perm.h"
+#include "write_manifest.h"
+
+static int get_eol_offset(StringInfo buf);
+static bool line_starts_with(char *s, char *e, char *match, char **sout);
+static bool parse_lsn(char *s, char *e, XLogRecPtr *lsn, char **c);
+static bool parse_tli(char *s, char *e, TimeLineID *tli);
+
+/*
+ * Parse a backup label file, starting at buf->cursor.
+ *
+ * We expect to find a START WAL LOCATION line, followed by a LSN, followed
+ * by a space; the resulting LSN is stored into *start_lsn.
+ *
+ * We expect to find a START TIMELINE line, followed by a TLI, followed by
+ * a newline; the resulting TLI is stored into *start_tli.
+ *
+ * We expect to find either both INCREMENTAL FROM LSN and INCREMENTAL FROM TLI
+ * or neither. If these are found, they should be followed by an LSN or TLI
+ * respectively and then by a newline, and the values will be stored into
+ * *previous_lsn and *previous_tli, respectively.
+ *
+ * Other lines in the provided backup_label data are ignored. filename is used
+ * for error reporting; errors are fatal.
+ */
+void
+parse_backup_label(char *filename, StringInfo buf,
+ TimeLineID *start_tli, XLogRecPtr *start_lsn,
+ TimeLineID *previous_tli, XLogRecPtr *previous_lsn)
+{
+ int found = 0;
+
+ *start_tli = 0;
+ *start_lsn = InvalidXLogRecPtr;
+ *previous_tli = 0;
+ *previous_lsn = InvalidXLogRecPtr;
+
+ while (buf->cursor < buf->len)
+ {
+ char *s = &buf->data[buf->cursor];
+ int eo = get_eol_offset(buf);
+ char *e = &buf->data[eo];
+ char *c;
+
+ if (line_starts_with(s, e, "START WAL LOCATION: ", &s))
+ {
+ if (!parse_lsn(s, e, start_lsn, &c))
+ pg_fatal("%s: could not parse START WAL LOCATION",
+ filename);
+ if (c >= e || *c != ' ')
+ pg_fatal("%s: improper terminator for START WAL LOCATION",
+ filename);
+ found |= 1;
+ }
+ else if (line_starts_with(s, e, "START TIMELINE: ", &s))
+ {
+ if (!parse_tli(s, e, start_tli))
+ pg_fatal("%s: could not parse TLI for START TIMELINE",
+ filename);
+ if (*start_tli == 0)
+ pg_fatal("%s: invalid TLI", filename);
+ found |= 2;
+ }
+ else if (line_starts_with(s, e, "INCREMENTAL FROM LSN: ", &s))
+ {
+ if (!parse_lsn(s, e, previous_lsn, &c))
+ pg_fatal("%s: could not parse INCREMENTAL FROM LSN",
+ filename);
+ if (c >= e || *c != '\n')
+ pg_fatal("%s: improper terminator for INCREMENTAL FROM LSN",
+ filename);
+ found |= 4;
+ }
+ else if (line_starts_with(s, e, "INCREMENTAL FROM TLI: ", &s))
+ {
+ if (!parse_tli(s, e, previous_tli))
+ pg_fatal("%s: could not parse INCREMENTAL FROM TLI",
+ filename);
+ if (*previous_tli == 0)
+ pg_fatal("%s: invalid TLI", filename);
+ found |= 8;
+ }
+
+ buf->cursor = eo;
+ }
+
+ if ((found & 1) == 0)
+ pg_fatal("%s: could not find START WAL LOCATION", filename);
+ if ((found & 2) == 0)
+ pg_fatal("%s: could not find START TIMELINE", filename);
+ if ((found & 4) != 0 && (found & 8) == 0)
+ pg_fatal("%s: INCREMENTAL FROM LSN requires INCREMENTAL FROM TLI", filename);
+ if ((found & 8) != 0 && (found & 4) == 0)
+ pg_fatal("%s: INCREMENTAL FROM TLI requires INCREMENTAL FROM LSN", filename);
+}
+
+/*
+ * Write a backup label file to the output directory.
+ *
+ * This will be identical to the provided backup_label file, except that the
+ * INCREMENTAL FROM LSN and INCREMENTAL FROM TLI lines will be omitted.
+ *
+ * The new file will be checksummed using the specified algorithm. If
+ * mwriter != NULL, it will be added to the manifest.
+ */
+void
+write_backup_label(char *output_directory, StringInfo buf,
+ pg_checksum_type checksum_type, manifest_writer *mwriter)
+{
+ char output_filename[MAXPGPATH];
+ int output_fd;
+ pg_checksum_context checksum_ctx;
+ uint8 checksum_payload[PG_CHECKSUM_MAX_LENGTH];
+ int checksum_length;
+
+ pg_checksum_init(&checksum_ctx, checksum_type);
+
+ snprintf(output_filename, MAXPGPATH, "%s/backup_label", output_directory);
+
+ if ((output_fd = open(output_filename,
+ O_WRONLY | O_CREAT | O_EXCL | PG_BINARY,
+ pg_file_create_mode)) < 0)
+ pg_fatal("could not open file \"%s\": %m", output_filename);
+
+ while (buf->cursor < buf->len)
+ {
+ char *s = &buf->data[buf->cursor];
+ int eo = get_eol_offset(buf);
+ char *e = &buf->data[eo];
+
+ if (!line_starts_with(s, e, "INCREMENTAL FROM LSN: ", NULL) &&
+ !line_starts_with(s, e, "INCREMENTAL FROM TLI: ", NULL))
+ {
+ ssize_t wb;
+
+ wb = write(output_fd, s, e - s);
+ if (wb != e - s)
+ {
+ if (wb < 0)
+ pg_fatal("could not write file \"%s\": %m", output_filename);
+ else
+ pg_fatal("could not write file \"%s\": wrote only %d of %d bytes",
+ output_filename, (int) wb, (int) (e - s));
+ }
+ if (pg_checksum_update(&checksum_ctx, (uint8 *) s, e - s) < 0)
+ pg_fatal("could not update checksum of file \"%s\"",
+ output_filename);
+ }
+
+ buf->cursor = eo;
+ }
+
+ if (close(output_fd) != 0)
+ pg_fatal("could not close \"%s\": %m", output_filename);
+
+ checksum_length = pg_checksum_final(&checksum_ctx, checksum_payload);
+
+ if (mwriter != NULL)
+ {
+ struct stat sb;
+
+ /*
+ * We could track the length ourselves, but must stat() to get the
+ * mtime.
+ */
+ if (stat(output_filename, &sb) < 0)
+ pg_fatal("could not stat file \"%s\": %m", output_filename);
+ add_file_to_manifest(mwriter, "backup_label", sb.st_size,
+ sb.st_mtime, checksum_type,
+ checksum_length, checksum_payload);
+ }
+}
+
+/*
+ * Return the offset at which the next line in the buffer starts, or there
+ * is none, the offset at which the buffer ends.
+ *
+ * The search begins at buf->cursor.
+ */
+static int
+get_eol_offset(StringInfo buf)
+{
+ int eo = buf->cursor;
+
+ while (eo < buf->len)
+ {
+ if (buf->data[eo] == '\n')
+ return eo + 1;
+ ++eo;
+ }
+
+ return eo;
+}
+
+/*
+ * Test whether the line that runs from s to e (inclusive of *s, but not
+ * inclusive of *e) starts with the match string provided, and return true
+ * or false according to whether or not this is the case.
+ *
+ * If the function returns true and if *sout != NULL, stores a pointer to the
+ * byte following the match into *sout.
+ */
+static bool
+line_starts_with(char *s, char *e, char *match, char **sout)
+{
+ while (s < e && *match != '\0' && *s == *match)
+ ++s, ++match;
+
+ if (*match == '\0' && sout != NULL)
+ *sout = s;
+
+ return (*match == '\0');
+}
+
+/*
+ * Parse an LSN starting at s and not stopping at or before e. The return value
+ * is true on success and otherwise false. On success, stores the result into
+ * *lsn and sets *c to the first character that is not part of the LSN.
+ */
+static bool
+parse_lsn(char *s, char *e, XLogRecPtr *lsn, char **c)
+{
+ char save = *e;
+ int nchars;
+ bool success;
+ unsigned hi;
+ unsigned lo;
+
+ *e = '\0';
+ success = (sscanf(s, "%X/%X%n", &hi, &lo, &nchars) == 2);
+ *e = save;
+
+ if (success)
+ {
+ *lsn = ((XLogRecPtr) hi) << 32 | (XLogRecPtr) lo;
+ *c = s + nchars;
+ }
+
+ return success;
+}
+
+/*
+ * Parse a TLI starting at s and stopping at or before e. The return value is
+ * true on success and otherwise false. On success, stores the result into
+ * *tli. If the first character that is not part of the TLI is anything other
+ * than a newline, that is deemed a failure.
+ */
+static bool
+parse_tli(char *s, char *e, TimeLineID *tli)
+{
+ char save = *e;
+ int nchars;
+ bool success;
+
+ *e = '\0';
+ success = (sscanf(s, "%u%n", tli, &nchars) == 1);
+ *e = save;
+
+ if (success && s[nchars] != '\n')
+ success = false;
+
+ return success;
+}
diff --git a/src/bin/pg_combinebackup/backup_label.h b/src/bin/pg_combinebackup/backup_label.h
new file mode 100644
index 0000000000..08d6ed67a9
--- /dev/null
+++ b/src/bin/pg_combinebackup/backup_label.h
@@ -0,0 +1,29 @@
+/*-------------------------------------------------------------------------
+ *
+ * Read and manipulate backup label files
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/bin/pg_combinebackup/backup_label.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef BACKUP_LABEL_H
+#define BACKUP_LABEL_H
+
+#include "common/checksum_helper.h"
+#include "lib/stringinfo.h"
+
+struct manifest_writer;
+
+extern void parse_backup_label(char *filename, StringInfo buf,
+ TimeLineID *start_tli,
+ XLogRecPtr *start_lsn,
+ TimeLineID *previous_tli,
+ XLogRecPtr *previous_lsn);
+extern void write_backup_label(char *output_directory, StringInfo buf,
+ pg_checksum_type checksum_type,
+ struct manifest_writer *mwriter);
+
+#endif /* BACKUP_LABEL_H */
diff --git a/src/bin/pg_combinebackup/copy_file.c b/src/bin/pg_combinebackup/copy_file.c
new file mode 100644
index 0000000000..8ba6cc09e4
--- /dev/null
+++ b/src/bin/pg_combinebackup/copy_file.c
@@ -0,0 +1,169 @@
+/*
+ * Copy entire files.
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/bin/pg_combinebackup/copy_file.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres_fe.h"
+
+#ifdef HAVE_COPYFILE_H
+#include <copyfile.h>
+#endif
+#include <fcntl.h>
+#include <sys/stat.h>
+#include <unistd.h>
+
+#include "common/file_perm.h"
+#include "common/logging.h"
+#include "copy_file.h"
+
+static void copy_file_blocks(const char *src, const char *dst,
+ pg_checksum_context *checksum_ctx);
+
+#ifdef WIN32
+static void copy_file_copyfile(const char *src, const char *dst);
+#endif
+
+/*
+ * Copy a regular file, optionally computing a checksum, and emitting
+ * appropriate debug messages. But if we're in dry-run mode, then just emit
+ * the messages and don't copy anything.
+ */
+void
+copy_file(const char *src, const char *dst,
+ pg_checksum_context *checksum_ctx, bool dry_run)
+{
+ /*
+ * In dry-run mode, we don't actually copy anything, nor do we read any
+ * data from the source file, but we do verify that we can open it.
+ */
+ if (dry_run)
+ {
+ int fd;
+
+ if ((fd = open(src, O_RDONLY | PG_BINARY)) < 0)
+ pg_fatal("could not open \"%s\": %m", src);
+ if (close(fd) < 0)
+ pg_fatal("could not close \"%s\": %m", src);
+ }
+
+ /*
+ * If we don't need to compute a checksum, then we can use any special
+ * operating system primitives that we know about to copy the file; this
+ * may be quicker than a naive block copy.
+ */
+ if (checksum_ctx->type != CHECKSUM_TYPE_NONE)
+ {
+ char *strategy_name = NULL;
+ void (*strategy_implementation) (const char *, const char *) = NULL;
+
+#ifdef WIN32
+ strategy_name = "CopyFile";
+ strategy_implementation = copy_file_copyfile;
+#endif
+
+ if (strategy_name != NULL)
+ {
+ if (dry_run)
+ pg_log_debug("would copy \"%s\" to \"%s\" using strategy %s",
+ src, dst, strategy_name);
+ else
+ {
+ pg_log_debug("copying \"%s\" to \"%s\" using strategy %s",
+ src, dst, strategy_name);
+ (*strategy_implementation) (src, dst);
+ }
+ return;
+ }
+ }
+
+ /*
+ * Fall back to the simple approach of reading and writing all the blocks,
+ * feeding them into the checksum context as we go.
+ */
+ if (dry_run)
+ {
+ if (checksum_ctx->type == CHECKSUM_TYPE_NONE)
+ pg_log_debug("would copy \"%s\" to \"%s\"",
+ src, dst);
+ else
+ pg_log_debug("would copy \"%s\" to \"%s\" and checksum with %s",
+ src, dst, pg_checksum_type_name(checksum_ctx->type));
+ }
+ else
+ {
+ if (checksum_ctx->type == CHECKSUM_TYPE_NONE)
+ pg_log_debug("copying \"%s\" to \"%s\"",
+ src, dst);
+ else
+ pg_log_debug("copying \"%s\" to \"%s\" and checksumming with %s",
+ src, dst, pg_checksum_type_name(checksum_ctx->type));
+ copy_file_blocks(src, dst, checksum_ctx);
+ }
+}
+
+/*
+ * Copy a file block by block, and optionally compute a checksum as we go.
+ */
+static void
+copy_file_blocks(const char *src, const char *dst,
+ pg_checksum_context *checksum_ctx)
+{
+ int src_fd;
+ int dest_fd;
+ uint8 *buffer;
+ const int buffer_size = 50 * BLCKSZ;
+ ssize_t rb;
+ unsigned offset = 0;
+
+ if ((src_fd = open(src, O_RDONLY | PG_BINARY, 0)) < 0)
+ pg_fatal("could not open file \"%s\": %m", src);
+
+ if ((dest_fd = open(dst, O_WRONLY | O_CREAT | O_EXCL | PG_BINARY,
+ pg_file_create_mode)) < 0)
+ pg_fatal("could not open file \"%s\": %m", dst);
+
+ buffer = pg_malloc(buffer_size);
+
+ while ((rb = read(src_fd, buffer, buffer_size)) > 0)
+ {
+ ssize_t wb;
+
+ if ((wb = write(dest_fd, buffer, rb)) != rb)
+ {
+ if (wb < 0)
+ pg_fatal("could not write file \"%s\": %m", dst);
+ else
+ pg_fatal("could not write file \"%s\": wrote only %d of %d bytes at offset %u",
+ dst, (int) wb, (int) rb, offset);
+ }
+
+ if (pg_checksum_update(checksum_ctx, buffer, rb) < 0)
+ pg_fatal("could not update checksum of file \"%s\"", dst);
+
+ offset += rb;
+ }
+
+ if (rb < 0)
+ pg_fatal("could not read file \"%s\": %m", dst);
+
+ pg_free(buffer);
+ close(src_fd);
+ close(dest_fd);
+}
+
+#ifdef WIN32
+static void
+copy_file_copyfile(const char *src, const char *dst)
+{
+ if (CopyFile(src, dst, true) == 0)
+ {
+ _dosmaperr(GetLastError());
+ pg_fatal("could not copy \"%s\" to \"%s\": %m", src, dst);
+ }
+}
+#endif /* WIN32 */
diff --git a/src/bin/pg_combinebackup/copy_file.h b/src/bin/pg_combinebackup/copy_file.h
new file mode 100644
index 0000000000..031030bacb
--- /dev/null
+++ b/src/bin/pg_combinebackup/copy_file.h
@@ -0,0 +1,19 @@
+/*
+ * Copy entire files.
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/bin/pg_combinebackup/copy_file.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef COPY_FILE_H
+#define COPY_FILE_H
+
+#include "common/checksum_helper.h"
+
+extern void copy_file(const char *src, const char *dst,
+ pg_checksum_context *checksum_ctx, bool dry_run);
+
+#endif /* COPY_FILE_H */
diff --git a/src/bin/pg_combinebackup/load_manifest.c b/src/bin/pg_combinebackup/load_manifest.c
new file mode 100644
index 0000000000..d0b8de7912
--- /dev/null
+++ b/src/bin/pg_combinebackup/load_manifest.c
@@ -0,0 +1,245 @@
+/*-------------------------------------------------------------------------
+ *
+ * Load data from a backup manifest into memory.
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/bin/pg_combinebackup/load_manifest.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres_fe.h"
+
+#include <sys/stat.h>
+#include <unistd.h>
+
+#include "common/hashfn.h"
+#include "common/logging.h"
+#include "common/parse_manifest.h"
+#include "load_manifest.h"
+
+/*
+ * For efficiency, we'd like our hash table containing information about the
+ * manifest to start out with approximately the correct number of entries.
+ * There's no way to know the exact number of entries without reading the whole
+ * file, but we can get an estimate by dividing the file size by the estimated
+ * number of bytes per line.
+ *
+ * This could be off by about a factor of two in either direction, because the
+ * checksum algorithm has a big impact on the line lengths; e.g. a SHA512
+ * checksum is 128 hex bytes, whereas a CRC-32C value is only 8, and there
+ * might be no checksum at all.
+ */
+#define ESTIMATED_BYTES_PER_MANIFEST_LINE 100
+
+/*
+ * Define a hash table which we can use to store information about the files
+ * mentioned in the backup manifest.
+ */
+static uint32 hash_string_pointer(char *s);
+#define SH_PREFIX manifest_files
+#define SH_ELEMENT_TYPE manifest_file
+#define SH_KEY_TYPE char *
+#define SH_KEY pathname
+#define SH_HASH_KEY(tb, key) hash_string_pointer(key)
+#define SH_EQUAL(tb, a, b) (strcmp(a, b) == 0)
+#define SH_SCOPE extern
+#define SH_RAW_ALLOCATOR pg_malloc0
+#define SH_DEFINE
+#include "lib/simplehash.h"
+
+static void record_manifest_details_for_file(JsonManifestParseContext *context,
+ char *pathname, size_t size,
+ pg_checksum_type checksum_type,
+ int checksum_length,
+ uint8 *checksum_payload);
+static void record_manifest_details_for_wal_range(JsonManifestParseContext *context,
+ TimeLineID tli,
+ XLogRecPtr start_lsn,
+ XLogRecPtr end_lsn);
+static void report_manifest_error(JsonManifestParseContext *context,
+ const char *fmt,...)
+ pg_attribute_printf(2, 3) pg_attribute_noreturn();
+
+/*
+ * Load backup_manifest files from an array of backups and produces an array
+ * of manifest_data objects.
+ *
+ * NB: Since load_backup_manifest() can return NULL, the resulting array could
+ * contain NULL entries.
+ */
+manifest_data **
+load_backup_manifests(int n_backups, char **backup_directories)
+{
+ manifest_data **result;
+ int i;
+
+ result = pg_malloc(sizeof(manifest_data *) * n_backups);
+ for (i = 0; i < n_backups; ++i)
+ result[i] = load_backup_manifest(backup_directories[i]);
+
+ return result;
+}
+
+/*
+ * Parse the backup_manifest file in the named backup directory. Construct a
+ * hash table with information about all the files it mentions, and a linked
+ * list of all the WAL ranges it mentions.
+ *
+ * If the backup_manifest file simply doesn't exist, logs a warning and returns
+ * NULL. Any other error, or any error parsing the contents of the file, is
+ * fatal.
+ */
+manifest_data *
+load_backup_manifest(char *backup_directory)
+{
+ char pathname[MAXPGPATH];
+ int fd;
+ struct stat statbuf;
+ off_t estimate;
+ uint32 initial_size;
+ manifest_files_hash *ht;
+ char *buffer;
+ int rc;
+ JsonManifestParseContext context;
+ manifest_data *result;
+
+ /* Open the manifest file. */
+ snprintf(pathname, MAXPGPATH, "%s/backup_manifest", backup_directory);
+ if ((fd = open(pathname, O_RDONLY | PG_BINARY, 0)) < 0)
+ {
+ if (errno == EEXIST)
+ {
+ pg_log_warning("\"%s\" does not exist", pathname);
+ return NULL;
+ }
+ pg_fatal("could not open file \"%s\": %m", pathname);
+ }
+
+ /* Figure out how big the manifest is. */
+ if (fstat(fd, &statbuf) != 0)
+ pg_fatal("could not stat file \"%s\": %m", pathname);
+
+ /* Guess how large to make the hash table based on the manifest size. */
+ estimate = statbuf.st_size / ESTIMATED_BYTES_PER_MANIFEST_LINE;
+ initial_size = Min(PG_UINT32_MAX, Max(estimate, 256));
+
+ /* Create the hash table. */
+ ht = manifest_files_create(initial_size, NULL);
+
+ /*
+ * Slurp in the whole file.
+ *
+ * This is not ideal, but there's currently no way to get pg_parse_json()
+ * to perform incremental parsing.
+ */
+ buffer = pg_malloc(statbuf.st_size);
+ rc = read(fd, buffer, statbuf.st_size);
+ if (rc != statbuf.st_size)
+ {
+ if (rc < 0)
+ pg_fatal("could not read file \"%s\": %m", pathname);
+ else
+ pg_fatal("could not read file \"%s\": read %d of %lld",
+ pathname, rc, (long long int) statbuf.st_size);
+ }
+
+ /* Close the manifest file. */
+ close(fd);
+
+ /* Parse the manifest. */
+ result = pg_malloc0(sizeof(manifest_data));
+ result->files = ht;
+ context.private_data = result;
+ context.perfile_cb = record_manifest_details_for_file;
+ context.perwalrange_cb = record_manifest_details_for_wal_range;
+ context.error_cb = report_manifest_error;
+ json_parse_manifest(&context, buffer, statbuf.st_size);
+
+ /* All done. */
+ pfree(buffer);
+ return result;
+}
+
+/*
+ * Report an error while parsing the manifest.
+ *
+ * We consider all such errors to be fatal errors. The manifest parser
+ * expects this function not to return.
+ */
+static void
+report_manifest_error(JsonManifestParseContext *context, const char *fmt,...)
+{
+ va_list ap;
+
+ va_start(ap, fmt);
+ pg_log_generic_v(PG_LOG_ERROR, PG_LOG_PRIMARY, gettext(fmt), ap);
+ va_end(ap);
+
+ exit(1);
+}
+
+/*
+ * Record details extracted from the backup manifest for one file.
+ */
+static void
+record_manifest_details_for_file(JsonManifestParseContext *context,
+ char *pathname, size_t size,
+ pg_checksum_type checksum_type,
+ int checksum_length, uint8 *checksum_payload)
+{
+ manifest_data *manifest = context->private_data;
+ manifest_file *m;
+ bool found;
+
+ /* Make a new entry in the hash table for this file. */
+ m = manifest_files_insert(manifest->files, pathname, &found);
+ if (found)
+ pg_fatal("duplicate path name in backup manifest: \"%s\"", pathname);
+
+ /* Initialize the entry. */
+ m->size = size;
+ m->checksum_type = checksum_type;
+ m->checksum_length = checksum_length;
+ m->checksum_payload = checksum_payload;
+}
+
+/*
+ * Record details extracted from the backup manifest for one WAL range.
+ */
+static void
+record_manifest_details_for_wal_range(JsonManifestParseContext *context,
+ TimeLineID tli,
+ XLogRecPtr start_lsn, XLogRecPtr end_lsn)
+{
+ manifest_data *manifest = context->private_data;
+ manifest_wal_range *range;
+
+ /* Allocate and initialize a struct describing this WAL range. */
+ range = palloc(sizeof(manifest_wal_range));
+ range->tli = tli;
+ range->start_lsn = start_lsn;
+ range->end_lsn = end_lsn;
+ range->prev = manifest->last_wal_range;
+ range->next = NULL;
+
+ /* Add it to the end of the list. */
+ if (manifest->first_wal_range == NULL)
+ manifest->first_wal_range = range;
+ else
+ manifest->last_wal_range->next = range;
+ manifest->last_wal_range = range;
+}
+
+/*
+ * Helper function for manifest_files hash table.
+ */
+static uint32
+hash_string_pointer(char *s)
+{
+ unsigned char *ss = (unsigned char *) s;
+
+ return hash_bytes(ss, strlen(s));
+}
diff --git a/src/bin/pg_combinebackup/load_manifest.h b/src/bin/pg_combinebackup/load_manifest.h
new file mode 100644
index 0000000000..2bfeeff156
--- /dev/null
+++ b/src/bin/pg_combinebackup/load_manifest.h
@@ -0,0 +1,67 @@
+/*-------------------------------------------------------------------------
+ *
+ * Load data from a backup manifest into memory.
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/bin/pg_combinebackup/load_manifest.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef LOAD_MANIFEST_H
+#define LOAD_MANIFEST_H
+
+#include "access/xlogdefs.h"
+#include "common/checksum_helper.h"
+
+/*
+ * Each file described by the manifest file is parsed to produce an object
+ * like this.
+ */
+typedef struct manifest_file
+{
+ uint32 status; /* hash status */
+ char *pathname;
+ size_t size;
+ pg_checksum_type checksum_type;
+ int checksum_length;
+ uint8 *checksum_payload;
+} manifest_file;
+
+#define SH_PREFIX manifest_files
+#define SH_ELEMENT_TYPE manifest_file
+#define SH_KEY_TYPE char *
+#define SH_SCOPE extern
+#define SH_RAW_ALLOCATOR pg_malloc0
+#define SH_DECLARE
+#include "lib/simplehash.h"
+
+/*
+ * Each WAL range described by the manifest file is parsed to produce an
+ * object like this.
+ */
+typedef struct manifest_wal_range
+{
+ TimeLineID tli;
+ XLogRecPtr start_lsn;
+ XLogRecPtr end_lsn;
+ struct manifest_wal_range *next;
+ struct manifest_wal_range *prev;
+} manifest_wal_range;
+
+/*
+ * All the data parsed from a backup_manifest file.
+ */
+typedef struct manifest_data
+{
+ manifest_files_hash *files;
+ manifest_wal_range *first_wal_range;
+ manifest_wal_range *last_wal_range;
+} manifest_data;
+
+extern manifest_data *load_backup_manifest(char *backup_directory);
+extern manifest_data **load_backup_manifests(int n_backups,
+ char **backup_directories);
+
+#endif /* LOAD_MANIFEST_H */
diff --git a/src/bin/pg_combinebackup/meson.build b/src/bin/pg_combinebackup/meson.build
new file mode 100644
index 0000000000..e402d6f50e
--- /dev/null
+++ b/src/bin/pg_combinebackup/meson.build
@@ -0,0 +1,38 @@
+# Copyright (c) 2022-2023, PostgreSQL Global Development Group
+
+pg_combinebackup_sources = files(
+ 'pg_combinebackup.c',
+ 'backup_label.c',
+ 'copy_file.c',
+ 'load_manifest.c',
+ 'reconstruct.c',
+ 'write_manifest.c',
+)
+
+if host_system == 'windows'
+ pg_combinebackup_sources += rc_bin_gen.process(win32ver_rc, extra_args: [
+ '--NAME', 'pg_combinebackup',
+ '--FILEDESC', 'pg_combinebackup - combine incremental backups',])
+endif
+
+pg_combinebackup = executable('pg_combinebackup',
+ pg_combinebackup_sources,
+ dependencies: [frontend_code],
+ kwargs: default_bin_args,
+)
+bin_targets += pg_combinebackup
+
+tests += {
+ 'name': 'pg_combinebackup',
+ 'sd': meson.current_source_dir(),
+ 'bd': meson.current_build_dir(),
+ 'tap': {
+ 'tests': [
+ 't/001_basic.pl',
+ 't/002_compare_backups.pl',
+ 't/003_timeline.pl',
+ 't/004_manifest.pl',
+ 't/005_integrity.pl',
+ ],
+ }
+}
diff --git a/src/bin/pg_combinebackup/pg_combinebackup.c b/src/bin/pg_combinebackup/pg_combinebackup.c
new file mode 100644
index 0000000000..e607f35edb
--- /dev/null
+++ b/src/bin/pg_combinebackup/pg_combinebackup.c
@@ -0,0 +1,1271 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_combinebackup.c
+ * Combine incremental backups with prior backups.
+ *
+ * Copyright (c) 2017-2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/bin/pg_combinebackup/pg_combinebackup.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres_fe.h"
+
+#include <dirent.h>
+#include <fcntl.h>
+#include <limits.h>
+
+#include "access/xlogdefs.h"
+#include "backup_label.h"
+#include "common/blkreftable.h"
+#include "common/checksum_helper.h"
+#include "common/controldata_utils.h"
+#include "common/file_perm.h"
+#include "common/file_utils.h"
+#include "common/logging.h"
+#include "copy_file.h"
+#include "fe_utils/option_utils.h"
+#include "lib/stringinfo.h"
+#include "load_manifest.h"
+#include "getopt_long.h"
+#include "reconstruct.h"
+#include "write_manifest.h"
+
+/* Incremental file naming convention. */
+#define INCREMENTAL_PREFIX "INCREMENTAL."
+#define INCREMENTAL_PREFIX_LENGTH 12
+
+/*
+ * Tracking for directories that need to be removed, or have their contents
+ * removed, if the operation fails.
+ */
+typedef struct cb_cleanup_dir
+{
+ char *target_path;
+ bool rmtopdir;
+ struct cb_cleanup_dir *next;
+} cb_cleanup_dir;
+
+/*
+ * Stores a tablespace mapping provided using -T, --tablespace-mapping.
+ */
+typedef struct cb_tablespace_mapping
+{
+ char old_dir[MAXPGPATH];
+ char new_dir[MAXPGPATH];
+ struct cb_tablespace_mapping *next;
+} cb_tablespace_mapping;
+
+/*
+ * Stores data parsed from all command-line options.
+ */
+typedef struct cb_options
+{
+ bool debug;
+ char *output;
+ bool dry_run;
+ bool no_sync;
+ cb_tablespace_mapping *tsmappings;
+ pg_checksum_type manifest_checksums;
+ bool no_manifest;
+ DataDirSyncMethod sync_method;
+} cb_options;
+
+/*
+ * Data about a tablespace.
+ *
+ * Every normal tablespace needs a tablespace mapping, but in-place tablespaces
+ * don't, so the list of tablespaces can contain more entries than the list of
+ * tablespace mappings.
+ */
+typedef struct cb_tablespace
+{
+ Oid oid;
+ bool in_place;
+ char old_dir[MAXPGPATH];
+ char new_dir[MAXPGPATH];
+ struct cb_tablespace *next;
+} cb_tablespace;
+
+/* Directories to be removed if we exit uncleanly. */
+cb_cleanup_dir *cleanup_dir_list = NULL;
+
+static void add_tablespace_mapping(cb_options *opt, char *arg);
+static StringInfo check_backup_label_files(int n_backups, char **backup_dirs);
+static void check_control_files(int n_backups, char **backup_dirs);
+static void check_input_dir_permissions(char *dir);
+static void cleanup_directories_atexit(void);
+static void create_output_directory(char *dirname, cb_options *opt);
+static void help(const char *progname);
+static bool parse_oid(char *s, Oid *result);
+static void process_directory_recursively(Oid tsoid,
+ char *input_directory,
+ char *output_directory,
+ char *relative_path,
+ int n_prior_backups,
+ char **prior_backup_dirs,
+ manifest_data **manifests,
+ manifest_writer *mwriter,
+ cb_options *opt);
+static int read_pg_version_file(char *directory);
+static void remember_to_cleanup_directory(char *target_path, bool rmtopdir);
+static void reset_directory_cleanup_list(void);
+static cb_tablespace *scan_for_existing_tablespaces(char *pathname,
+ cb_options *opt);
+static void slurp_file(int fd, char *filename, StringInfo buf, int maxlen);
+
+/*
+ * Main program.
+ */
+int
+main(int argc, char *argv[])
+{
+ static struct option long_options[] = {
+ {"debug", no_argument, NULL, 'd'},
+ {"dry-run", no_argument, NULL, 'n'},
+ {"no-sync", no_argument, NULL, 'N'},
+ {"output", required_argument, NULL, 'o'},
+ {"tablespace-mapping", no_argument, NULL, 'T'},
+ {"manifest-checksums", required_argument, NULL, 1},
+ {"no-manifest", no_argument, NULL, 2},
+ {"sync-method", required_argument, NULL, 3},
+ {NULL, 0, NULL, 0}
+ };
+
+ const char *progname;
+ char *last_input_dir;
+ int optindex;
+ int c;
+ int n_backups;
+ int n_prior_backups;
+ int version;
+ char **prior_backup_dirs;
+ cb_options opt;
+ cb_tablespace *tablespaces;
+ cb_tablespace *ts;
+ StringInfo last_backup_label;
+ manifest_data **manifests;
+ manifest_writer *mwriter;
+
+ pg_logging_init(argv[0]);
+ progname = get_progname(argv[0]);
+ handle_help_version_opts(argc, argv, progname, help);
+
+ memset(&opt, 0, sizeof(opt));
+ opt.manifest_checksums = CHECKSUM_TYPE_CRC32C;
+ opt.sync_method = DATA_DIR_SYNC_METHOD_FSYNC;
+
+ /* process command-line options */
+ while ((c = getopt_long(argc, argv, "do:nNPT:",
+ long_options, &optindex)) != -1)
+ {
+ switch (c)
+ {
+ case 'd':
+ opt.debug = true;
+ pg_logging_increase_verbosity();
+ break;
+ case 'o':
+ opt.output = optarg;
+ break;
+ case 'n':
+ opt.dry_run = true;
+ break;
+ case 'N':
+ opt.no_sync = true;
+ break;
+ case 'T':
+ add_tablespace_mapping(&opt, optarg);
+ break;
+ case 1:
+ if (!pg_checksum_parse_type(optarg,
+ &opt.manifest_checksums))
+ pg_fatal("unrecognized checksum algorithm: \"%s\"",
+ optarg);
+ break;
+ case 2:
+ opt.no_manifest = true;
+ break;
+ case 3:
+ if (!parse_sync_method(optarg, &opt.sync_method))
+ exit(1);
+ break;
+ default:
+ /* getopt_long already emitted a complaint */
+ pg_log_error_hint("Try \"%s --help\" for more information.", progname);
+ exit(1);
+ }
+ }
+
+ if (optind >= argc)
+ {
+ pg_log_error("%s: no input directories specified", progname);
+ pg_log_error_hint("Try \"%s --help\" for more information.", progname);
+ exit(1);
+ }
+
+ if (opt.output == NULL)
+ pg_fatal("no output directory specified");
+
+ /* If no manifest is needed, no checksums are needed, either. */
+ if (opt.no_manifest)
+ opt.manifest_checksums = CHECKSUM_TYPE_NONE;
+
+ /* Read the server version from the final backup. */
+ version = read_pg_version_file(argv[argc - 1]);
+
+ /* Sanity-check control files. */
+ n_backups = argc - optind;
+ check_control_files(n_backups, argv + optind);
+
+ /* Sanity-check backup_label files, and get the contents of the last one. */
+ last_backup_label = check_backup_label_files(n_backups, argv + optind);
+
+ /* Load backup manifests. */
+ manifests = load_backup_manifests(n_backups, argv + optind);
+
+ /* Figure out which tablespaces are going to be included in the output. */
+ last_input_dir = argv[argc - 1];
+ check_input_dir_permissions(last_input_dir);
+ tablespaces = scan_for_existing_tablespaces(last_input_dir, &opt);
+
+ /*
+ * Create output directories.
+ *
+ * We create one output directory for the main data directory plus one for
+ * each non-in-place tablespace. create_output_directory() will arrange
+ * for those directories to be cleaned up on failure. In-place tablespaces
+ * aren't handled at this stage because they're located beneath the main
+ * output directory, and thus the cleanup of that directory will get rid
+ * of them. Plus, the pg_tblspc directory that needs to contain them
+ * doesn't exist yet.
+ */
+ atexit(cleanup_directories_atexit);
+ create_output_directory(opt.output, &opt);
+ for (ts = tablespaces; ts != NULL; ts = ts->next)
+ if (!ts->in_place)
+ create_output_directory(ts->new_dir, &opt);
+
+ /* If we need to write a backup_manifest, prepare to do so. */
+ if (!opt.dry_run && !opt.no_manifest)
+ mwriter = create_manifest_writer(opt.output);
+ else
+ mwriter = NULL;
+
+ /* Write backup label into output directory. */
+ if (opt.dry_run)
+ pg_log_debug("would generate \"%s/backup_label\"", opt.output);
+ else
+ {
+ pg_log_debug("generating \"%s/backup_label\"", opt.output);
+ last_backup_label->cursor = 0;
+ write_backup_label(opt.output, last_backup_label,
+ opt.manifest_checksums, mwriter);
+ }
+
+ /*
+ * We'll need the pathnames to the prior backups. By "prior" we mean all
+ * but the last one listed on the command line.
+ */
+ n_prior_backups = argc - optind - 1;
+ prior_backup_dirs = argv + optind;
+
+ /* Process everything that's not part of a user-defined tablespace. */
+ pg_log_debug("processing backup directory \"%s\"", last_input_dir);
+ process_directory_recursively(InvalidOid, last_input_dir, opt.output,
+ NULL, n_prior_backups, prior_backup_dirs,
+ manifests, mwriter, &opt);
+
+ /* Process user-defined tablespaces. */
+ for (ts = tablespaces; ts != NULL; ts = ts->next)
+ {
+ pg_log_debug("processing tablespace directory \"%s\"", ts->old_dir);
+
+ /*
+ * If it's a normal tablespace, we need to set up a symbolic link from
+ * pg_tblspc/${OID} to the target directory; if it's an in-place
+ * tablespace, we need to create a directory at pg_tblspc/${OID}.
+ */
+ if (!ts->in_place)
+ {
+ char linkpath[MAXPGPATH];
+
+ snprintf(linkpath, MAXPGPATH, "%s/pg_tblspc/%u", opt.output,
+ ts->oid);
+
+ if (opt.dry_run)
+ pg_log_debug("would create symbolic link from \"%s\" to \"%s\"",
+ linkpath, ts->new_dir);
+ else
+ {
+ pg_log_debug("creating symbolic link from \"%s\" to \"%s\"",
+ linkpath, ts->new_dir);
+ if (symlink(ts->new_dir, linkpath) != 0)
+ pg_fatal("could not create symbolic link from \"%s\" to \"%s\": %m",
+ linkpath, ts->new_dir);
+ }
+ }
+ else
+ {
+ if (opt.dry_run)
+ pg_log_debug("would create directory \"%s\"", ts->new_dir);
+ else
+ {
+ pg_log_debug("creating directory \"%s\"", ts->new_dir);
+ if (pg_mkdir_p(ts->new_dir, pg_dir_create_mode) == -1)
+ pg_fatal("could not create directory \"%s\": %m",
+ ts->new_dir);
+ }
+ }
+
+ /* OK, now handle the directory contents. */
+ process_directory_recursively(ts->oid, ts->old_dir, ts->new_dir,
+ NULL, n_prior_backups, prior_backup_dirs,
+ manifests, mwriter, &opt);
+ }
+
+ /* Finalize the backup_manifest, if we're generating one. */
+ if (mwriter != NULL)
+ finalize_manifest(mwriter,
+ manifests[n_prior_backups]->first_wal_range);
+
+ /* fsync that output directory unless we've been told not to do so */
+ if (!opt.no_sync)
+ {
+ if (opt.dry_run)
+ pg_log_debug("would recursively fsync \"%s\"", opt.output);
+ else
+ {
+ pg_log_debug("recursively fsyncing \"%s\"", opt.output);
+ sync_pgdata(opt.output, version * 10000, opt.sync_method);
+ }
+ }
+
+ /* It's a success, so don't remove the output directories. */
+ reset_directory_cleanup_list();
+ exit(0);
+}
+
+/*
+ * Process the option argument for the -T, --tablespace-mapping switch.
+ */
+static void
+add_tablespace_mapping(cb_options *opt, char *arg)
+{
+ cb_tablespace_mapping *tsmap = pg_malloc0(sizeof(cb_tablespace_mapping));
+ char *dst;
+ char *dst_ptr;
+ char *arg_ptr;
+
+ /*
+ * Basically, we just want to copy everything before the equals sign to
+ * tsmap->old_dir and everything afterwards to tsmap->new_dir, but if
+ * there's more or less than one equals sign, that's an error, and if
+ * there's an equals sign preceded by a backslash, don't treat it as a
+ * field separator but instead copy a literal equals sign.
+ */
+ dst_ptr = dst = tsmap->old_dir;
+ for (arg_ptr = arg; *arg_ptr != '\0'; arg_ptr++)
+ {
+ if (dst_ptr - dst >= MAXPGPATH)
+ pg_fatal("directory name too long");
+
+ if (*arg_ptr == '\\' && *(arg_ptr + 1) == '=')
+ ; /* skip backslash escaping = */
+ else if (*arg_ptr == '=' && (arg_ptr == arg || *(arg_ptr - 1) != '\\'))
+ {
+ if (tsmap->new_dir[0] != '\0')
+ pg_fatal("multiple \"=\" signs in tablespace mapping");
+ else
+ dst = dst_ptr = tsmap->new_dir;
+ }
+ else
+ *dst_ptr++ = *arg_ptr;
+ }
+ if (!tsmap->old_dir[0] || !tsmap->new_dir[0])
+ pg_fatal("invalid tablespace mapping format \"%s\", must be \"OLDDIR=NEWDIR\"", arg);
+
+ /*
+ * All tablespaces are created with absolute directories, so specifying a
+ * non-absolute path here would never match, possibly confusing users.
+ *
+ * In contrast to pg_basebackup, both the old and new directories are on
+ * the local machine, so the local machine's definition of an absolute
+ * path is the only relevant one.
+ */
+ if (!is_absolute_path(tsmap->old_dir))
+ pg_fatal("old directory is not an absolute path in tablespace mapping: %s",
+ tsmap->old_dir);
+
+ if (!is_absolute_path(tsmap->new_dir))
+ pg_fatal("old directory is not an absolute path in tablespace mapping: %s",
+ tsmap->new_dir);
+
+ /* Canonicalize paths to avoid spurious failures when comparing. */
+ canonicalize_path(tsmap->old_dir);
+ canonicalize_path(tsmap->new_dir);
+
+ /* Add it to the list. */
+ tsmap->next = opt->tsmappings;
+ opt->tsmappings = tsmap;
+}
+
+/*
+ * Check that the backup_label files form a coherent backup chain, and return
+ * the contents of the backup_label file from the latest backup.
+ */
+static StringInfo
+check_backup_label_files(int n_backups, char **backup_dirs)
+{
+ StringInfo buf = makeStringInfo();
+ StringInfo lastbuf = buf;
+ int i;
+ TimeLineID check_tli = 0;
+ XLogRecPtr check_lsn = InvalidXLogRecPtr;
+
+ /* Try to read each backup_label file in turn, last to first. */
+ for (i = n_backups - 1; i >= 0; --i)
+ {
+ char pathbuf[MAXPGPATH];
+ int fd;
+ TimeLineID start_tli;
+ TimeLineID previous_tli;
+ XLogRecPtr start_lsn;
+ XLogRecPtr previous_lsn;
+
+ /* Open the backup_label file. */
+ snprintf(pathbuf, MAXPGPATH, "%s/backup_label", backup_dirs[i]);
+ pg_log_debug("reading \"%s\"", pathbuf);
+ if ((fd = open(pathbuf, O_RDONLY, 0)) < 0)
+ pg_fatal("could not open file \"%s\": %m", pathbuf);
+
+ /*
+ * Slurp the whole file into memory.
+ *
+ * The exact size limit that we impose here doesn't really matter --
+ * most of what's supposed to be in the file is fixed size and quite
+ * short. However, the length of the backup_label is limited (at least
+ * by some parts of the code) to MAXGPATH, so include that value in
+ * the maximum length that we tolerate.
+ */
+ slurp_file(fd, pathbuf, buf, 10000 + MAXPGPATH);
+
+ /* Close the file. */
+ if (close(fd) != 0)
+ pg_fatal("could not close \"%s\": %m", pathbuf);
+
+ /* Parse the file contents. */
+ parse_backup_label(pathbuf, buf, &start_tli, &start_lsn,
+ &previous_tli, &previous_lsn);
+
+ /*
+ * Sanity checks.
+ *
+ * XXX. It's actually not required that start_lsn == check_lsn. It
+ * would be OK if start_lsn > check_lsn provided that start_lsn is
+ * less than or equal to the relevant switchpoint. But at the moment
+ * we don't have that information.
+ */
+ if (i > 0 && previous_tli == 0)
+ pg_fatal("backup at \"%s\" is a full backup, but only the first backup should be a full backup",
+ backup_dirs[i]);
+ if (i == 0 && previous_tli != 0)
+ pg_fatal("backup at \"%s\" is an incremental backup, but the first backup should be a full backup",
+ backup_dirs[i]);
+ if (i < n_backups - 1 && start_tli != check_tli)
+ pg_fatal("backup at \"%s\" starts on timeline %u, but expected %u",
+ backup_dirs[i], start_tli, check_tli);
+ if (i < n_backups - 1 && start_lsn != check_lsn)
+ pg_fatal("backup at \"%s\" starts at LSN %X/%X, but expected %X/%X",
+ backup_dirs[i],
+ LSN_FORMAT_ARGS(start_lsn),
+ LSN_FORMAT_ARGS(check_lsn));
+ check_tli = previous_tli;
+ check_lsn = previous_lsn;
+
+ /*
+ * The last backup label in the chain needs to be saved for later use,
+ * while the others are only needed within this loop.
+ */
+ if (lastbuf == buf)
+ buf = makeStringInfo();
+ else
+ resetStringInfo(buf);
+ }
+
+ /* Free memory that we don't need any more. */
+ if (lastbuf != buf)
+ {
+ pfree(buf->data);
+ pfree(buf);
+ }
+
+ /*
+ * Return the data from the first backup_info that we read (which is the
+ * backup_label from the last directory specified on the command line).
+ */
+ return lastbuf;
+}
+
+/*
+ * Sanity check control files.
+ */
+static void
+check_control_files(int n_backups, char **backup_dirs)
+{
+ int i;
+ uint64 system_identifier = 0; /* placate compiler */
+
+ /* Try to read each control file in turn, last to first. */
+ for (i = n_backups - 1; i >= 0; --i)
+ {
+ ControlFileData *control_file;
+ bool crc_ok;
+
+ pg_log_debug("reading \"%s/global/pg_control\"", backup_dirs[i]);
+ control_file = get_controlfile(backup_dirs[i], &crc_ok);
+
+ /* Control file contents not meaningful if CRC is bad. */
+ if (!crc_ok)
+ pg_fatal("%s/global/pg_control: crc is incorrect", backup_dirs[i]);
+
+ /* Can't interpret control file if not current version. */
+ if (control_file->pg_control_version != PG_CONTROL_VERSION)
+ pg_fatal("%s/global/pg_control: unexpected control file version",
+ backup_dirs[i]);
+
+ /* System identifiers should all match. */
+ if (i == n_backups - 1)
+ system_identifier = control_file->system_identifier;
+ else if (system_identifier != control_file->system_identifier)
+ pg_fatal("%s/global/pg_control: expected system identifier %llu, but found %llu",
+ backup_dirs[i], (unsigned long long) system_identifier,
+ (unsigned long long) control_file->system_identifier);
+
+ /* Release memory. */
+ pfree(control_file);
+ }
+
+ /*
+ * If debug output is enabled, make a note of the system identifier that
+ * we found in all of the relevant control files.
+ */
+ pg_log_debug("system identifier is %llu",
+ (unsigned long long) system_identifier);
+}
+
+/*
+ * Set default permissions for new files and directories based on the
+ * permissions of the given directory. The intent here is that the output
+ * directory should use the same permissions scheme as the final input
+ * directory.
+ */
+static void
+check_input_dir_permissions(char *dir)
+{
+ struct stat st;
+
+ if (stat(dir, &st) != 0)
+ pg_fatal("could not stat \"%s\": %m", dir);
+
+ SetDataDirectoryCreatePerm(st.st_mode);
+}
+
+/*
+ * Clean up output directories before exiting.
+ */
+static void
+cleanup_directories_atexit(void)
+{
+ while (cleanup_dir_list != NULL)
+ {
+ cb_cleanup_dir *dir = cleanup_dir_list;
+
+ if (dir->rmtopdir)
+ {
+ pg_log_info("removing output directory \"%s\"", dir->target_path);
+ if (!rmtree(dir->target_path, dir->rmtopdir))
+ pg_log_error("failed to remove output directory");
+ }
+ else
+ {
+ pg_log_info("removing contents of output directory \"%s\"",
+ dir->target_path);
+ if (!rmtree(dir->target_path, dir->rmtopdir))
+ pg_log_error("failed to remove contents of output directory");
+ }
+
+ cleanup_dir_list = cleanup_dir_list->next;
+ pfree(dir);
+ }
+}
+
+/*
+ * Create the named output directory, unless it already exists or we're in
+ * dry-run mode. If it already exists but is not empty, that's a fatal error.
+ *
+ * Adds the created directory to the list of directories to be cleaned up
+ * at process exit.
+ */
+static void
+create_output_directory(char *dirname, cb_options *opt)
+{
+ switch (pg_check_dir(dirname))
+ {
+ case 0:
+ if (opt->dry_run)
+ {
+ pg_log_debug("would create directory \"%s\"", dirname);
+ return;
+ }
+ pg_log_debug("creating directory \"%s\"", dirname);
+ if (pg_mkdir_p(dirname, pg_dir_create_mode) == -1)
+ pg_fatal("could not create directory \"%s\": %m", dirname);
+ remember_to_cleanup_directory(dirname, true);
+ break;
+
+ case 1:
+ pg_log_debug("using existing directory \"%s\"", dirname);
+ remember_to_cleanup_directory(dirname, false);
+ break;
+
+ case 2:
+ case 3:
+ case 4:
+ pg_fatal("directory \"%s\" exists but is not empty", dirname);
+
+ case -1:
+ pg_fatal("could not access directory \"%s\": %m", dirname);
+ }
+}
+
+/*
+ * help
+ *
+ * Prints help page for the program
+ *
+ * progname: the name of the executed program, such as "pg_combinebackup"
+ */
+static void
+help(const char *progname)
+{
+ printf(_("%s reconstructs full backups from incrementals.\n\n"), progname);
+ printf(_("Usage:\n"));
+ printf(_(" %s [OPTION]... DIRECTORY...\n"), progname);
+ printf(_("\nOptions:\n"));
+ printf(_(" -d, --debug generate lots of debugging output\n"));
+ printf(_(" -n, --dry-run don't actually do anything\n"));
+ printf(_(" -N, --no-sync do not wait for changes to be written safely to disk\n"));
+ printf(_(" -o, --output output directory\n"));
+ printf(_(" -T, --tablespace-mapping=OLDDIR=NEWDIR\n"));
+ printf(_(" relocate tablespace in OLDDIR to NEWDIR\n"));
+ printf(_(" --manifest-checksums=SHA{224,256,384,512}|CRC32C|NONE\n"
+ " use algorithm for manifest checksums\n"));
+ printf(_(" --no-manifest suppress generation of backup manifest\n"));
+ printf(_(" --sync-method=METHOD set method for syncing files to disk\n"));
+ printf(_(" -?, --help show this help, then exit\n"));
+
+ printf(_("\nReport bugs to <%s>.\n"), PACKAGE_BUGREPORT);
+ printf(_("%s home page: <%s>\n"), PACKAGE_NAME, PACKAGE_URL);
+}
+
+/*
+ * Try to parse a string as a non-zero OID without leading zeroes.
+ *
+ * If it works, return true and set *result to the answer, else return false.
+ */
+static bool
+parse_oid(char *s, Oid *result)
+{
+ Oid oid;
+ char *ep;
+
+ errno = 0;
+ oid = strtoul(s, &ep, 10);
+ if (errno != 0 || *ep != '\0' || oid < 1 || oid > PG_UINT32_MAX)
+ return false;
+
+ *result = oid;
+ return true;
+}
+
+/*
+ * Copy files from the input directory to the output directory, reconstructing
+ * full files from incremental files as required.
+ *
+ * If processing is a user-defined tablespace, the tsoid should be the OID
+ * of that tablespace and input_directory and output_directory should be the
+ * toplevel input and output directories for that tablespace. Otherwise,
+ * tsoid should be InvalidOid and input_directory and output_directory should
+ * be the main input and output directories.
+ *
+ * relative_path is the path beneath the given input and output directories
+ * that we are currently processing. If NULL, it indicates that we're
+ * processing the input and output directories themselves.
+ *
+ * n_prior_backups is the number of prior backups that we have available.
+ * This doesn't count the very last backup, which is referenced by
+ * output_directory, just the older ones. prior_backup_dirs is an array of
+ * the locations of those previous backups.
+ */
+static void
+process_directory_recursively(Oid tsoid,
+ char *input_directory,
+ char *output_directory,
+ char *relative_path,
+ int n_prior_backups,
+ char **prior_backup_dirs,
+ manifest_data **manifests,
+ manifest_writer *mwriter,
+ cb_options *opt)
+{
+ char ifulldir[MAXPGPATH];
+ char ofulldir[MAXPGPATH];
+ char manifest_prefix[MAXPGPATH];
+ DIR *dir;
+ struct dirent *de;
+ bool is_pg_tblspc;
+ bool is_pg_wal;
+ manifest_data *latest_manifest = manifests[n_prior_backups];
+ pg_checksum_type checksum_type;
+
+ StaticAssertStmt(strlen(INCREMENTAL_PREFIX) == INCREMENTAL_PREFIX_LENGTH,
+ "INCREMENTAL_PREFIX_LENGTH is incorrect");
+
+ /*
+ * pg_tblspc and pg_wal are special cases, so detect those here.
+ *
+ * pg_tblspc is only special at the top level, but subdirectories of
+ * pg_wal are just as special as the top level directory.
+ *
+ * Since incremental backup does not exist in pre-v10 versions, we don't
+ * have to worry about the old pg_xlog naming.
+ */
+ is_pg_tblspc = !OidIsValid(tsoid) && relative_path != NULL &&
+ strcmp(relative_path, "pg_tblspc") == 0;
+ is_pg_wal = !OidIsValid(tsoid) && relative_path != NULL &&
+ (strcmp(relative_path, "pg_wal") == 0 ||
+ strncmp(relative_path, "pg_wal/", 7) == 0);
+
+ /*
+ * If we're under pg_wal, then we don't need checksums, because these
+ * files aren't included in the backup manifest. Otherwise use whatever
+ * type of checksum is configured.
+ */
+ if (!is_pg_wal)
+ checksum_type = opt->manifest_checksums;
+ else
+ checksum_type = CHECKSUM_TYPE_NONE;
+
+ /*
+ * Append the relative path to the input and output directories, and
+ * figure out the appropriate prefix to add to files in this directory
+ * when looking them up in a backup manifest.
+ */
+ if (relative_path == NULL)
+ {
+ strncpy(ifulldir, input_directory, MAXPGPATH);
+ strncpy(ofulldir, output_directory, MAXPGPATH);
+ if (OidIsValid(tsoid))
+ snprintf(manifest_prefix, MAXPGPATH, "pg_tblspc/%u/", tsoid);
+ else
+ manifest_prefix[0] = '\0';
+ }
+ else
+ {
+ snprintf(ifulldir, MAXPGPATH, "%s/%s", input_directory,
+ relative_path);
+ snprintf(ofulldir, MAXPGPATH, "%s/%s", output_directory,
+ relative_path);
+ if (OidIsValid(tsoid))
+ snprintf(manifest_prefix, MAXPGPATH, "pg_tblspc/%u/%s/",
+ tsoid, relative_path);
+ else
+ snprintf(manifest_prefix, MAXPGPATH, "%s/", relative_path);
+ }
+
+ /*
+ * Toplevel output directories have already been created by the time this
+ * function is called, but any subdirectories are our responsibility.
+ */
+ if (relative_path != NULL)
+ {
+ if (opt->dry_run)
+ pg_log_debug("would create directory \"%s\"", ofulldir);
+ else
+ {
+ pg_log_debug("creating directory \"%s\"", ofulldir);
+ if (mkdir(ofulldir, pg_dir_create_mode) == -1)
+ pg_fatal("could not create directory \"%s\": %m", ofulldir);
+ }
+ }
+
+ /* It's time to scan the directory. */
+ if ((dir = opendir(ifulldir)) == NULL)
+ pg_fatal("could not open directory \"%s\": %m", ifulldir);
+ while (errno = 0, (de = readdir(dir)) != NULL)
+ {
+ PGFileType type;
+ char ifullpath[MAXPGPATH];
+ char ofullpath[MAXPGPATH];
+ char manifest_path[MAXPGPATH];
+ Oid oid = InvalidOid;
+ int checksum_length = 0;
+ uint8 *checksum_payload = NULL;
+ pg_checksum_context checksum_ctx;
+
+ /* Ignore "." and ".." entries. */
+ if (strcmp(de->d_name, ".") == 0 ||
+ strcmp(de->d_name, "..") == 0)
+ continue;
+
+ /* Construct input path. */
+ snprintf(ifullpath, MAXPGPATH, "%s/%s", ifulldir, de->d_name);
+
+ /* Figure out what kind of directory entry this is. */
+ type = get_dirent_type(ifullpath, de, false, PG_LOG_ERROR);
+ if (type == PGFILETYPE_ERROR)
+ exit(1);
+
+ /*
+ * If we're processing pg_tblspc, then check whether the filename
+ * looks like it could be a tablespace OID. If so, and if the
+ * directory entry is a symbolic link or a directory, skip it.
+ *
+ * Our goal here is to ignore anything that would have been considered
+ * by scan_for_existing_tablespaces to be a tablespace.
+ */
+ if (is_pg_tblspc && parse_oid(de->d_name, &oid) &&
+ (type == PGFILETYPE_LNK || type == PGFILETYPE_DIR))
+ continue;
+
+ /* If it's a directory, recurse. */
+ if (type == PGFILETYPE_DIR)
+ {
+ char new_relative_path[MAXPGPATH];
+
+ /* Append new pathname component to relative path. */
+ if (relative_path == NULL)
+ strncpy(new_relative_path, de->d_name, MAXPGPATH);
+ else
+ snprintf(new_relative_path, MAXPGPATH, "%s/%s", relative_path,
+ de->d_name);
+
+ /* And recurse. */
+ process_directory_recursively(tsoid,
+ input_directory, output_directory,
+ new_relative_path,
+ n_prior_backups, prior_backup_dirs,
+ manifests, mwriter, opt);
+ continue;
+ }
+
+ /* Skip anything that's not a regular file. */
+ if (type != PGFILETYPE_REG)
+ {
+ if (type == PGFILETYPE_LNK)
+ pg_log_warning("skipping symbolic link \"%s\"", ifullpath);
+ else
+ pg_log_warning("skipping special file \"%s\"", ifullpath);
+ continue;
+ }
+
+ /*
+ * Skip the backup_label and backup_manifest files; they require
+ * special handling and are handled elsewhere.
+ */
+ if (relative_path == NULL &&
+ (strcmp(de->d_name, "backup_label") == 0 ||
+ strcmp(de->d_name, "backup_manifest") == 0))
+ continue;
+
+ /*
+ * If it's an incremental file, hand it off to the reconstruction
+ * code, which will figure out what to do.
+ */
+ if (strncmp(de->d_name, INCREMENTAL_PREFIX,
+ INCREMENTAL_PREFIX_LENGTH) == 0)
+ {
+ /* Output path should not include "INCREMENTAL." prefix. */
+ snprintf(ofullpath, MAXPGPATH, "%s/%s", ofulldir,
+ de->d_name + INCREMENTAL_PREFIX_LENGTH);
+
+
+ /* Manifest path likewise omits incremental prefix. */
+ snprintf(manifest_path, MAXPGPATH, "%s%s", manifest_prefix,
+ de->d_name + INCREMENTAL_PREFIX_LENGTH);
+
+ /* Reconstruction logic will do the rest. */
+ reconstruct_from_incremental_file(ifullpath, ofullpath,
+ relative_path,
+ de->d_name + INCREMENTAL_PREFIX_LENGTH,
+ n_prior_backups,
+ prior_backup_dirs,
+ manifests,
+ manifest_path,
+ checksum_type,
+ &checksum_length,
+ &checksum_payload,
+ opt->debug,
+ opt->dry_run);
+ }
+ else
+ {
+ /* Construct the path that the backup_manifest will use. */
+ snprintf(manifest_path, MAXPGPATH, "%s%s", manifest_prefix,
+ de->d_name);
+
+ /*
+ * It's not an incremental file, so we need to copy the entire
+ * file to the output directory.
+ *
+ * If a checksum of the required type already exists in the
+ * backup_manifest for the final input directory, we can save some
+ * work by reusing that checksum instead of computing a new one.
+ */
+ if (checksum_type != CHECKSUM_TYPE_NONE &&
+ latest_manifest != NULL)
+ {
+ manifest_file *mfile;
+
+ mfile = manifest_files_lookup(latest_manifest->files,
+ manifest_path);
+ if (mfile == NULL)
+ {
+ /*
+ * The directory is out of sync with the backup_manifest,
+ * so emit a warning.
+ */
+ pg_log_warning("\"%s/backup_manifest\" contains no entry for \"%s\"",
+ input_directory, manifest_path);
+ }
+ else if (mfile->checksum_type == checksum_type)
+ {
+ checksum_length = mfile->checksum_length;
+ checksum_payload = mfile->checksum_payload;
+ }
+ }
+
+ /*
+ * If we're reusing a checksum, then we don't need copy_file() to
+ * compute one for us, but otherwise, it needs to compute whatever
+ * type of checksum we need.
+ */
+ if (checksum_length != 0)
+ pg_checksum_init(&checksum_ctx, CHECKSUM_TYPE_NONE);
+ else
+ pg_checksum_init(&checksum_ctx, checksum_type);
+
+ /* Actually copy the file. */
+ snprintf(ofullpath, MAXPGPATH, "%s/%s", ofulldir, de->d_name);
+ copy_file(ifullpath, ofullpath, &checksum_ctx, opt->dry_run);
+
+ /*
+ * If copy_file() performed a checksum calculation for us, then
+ * save the results (except in dry-run mode, when there's no
+ * point).
+ */
+ if (checksum_ctx.type != CHECKSUM_TYPE_NONE && !opt->dry_run)
+ {
+ checksum_payload = pg_malloc(PG_CHECKSUM_MAX_LENGTH);
+ checksum_length = pg_checksum_final(&checksum_ctx,
+ checksum_payload);
+ }
+ }
+
+ /* Generate manifest entry, if needed. */
+ if (mwriter != NULL)
+ {
+ struct stat sb;
+
+ /*
+ * In order to generate a manifest entry, we need the file size
+ * and mtime. We have no way to know the correct mtime except to
+ * stat() the file, so just do that and get the size as well.
+ *
+ * If we didn't need the mtime here, we could try to obtain the
+ * file size from the reconstruction or file copy process above,
+ * although that is actually not convenient in all cases. If we
+ * write the file ourselves then clearly we can keep a count of
+ * bytes, but if we use something like CopyFile() then it's
+ * trickier. Since we have to stat() anyway to get the mtime,
+ * there's no point in worrying about it.
+ */
+ if (stat(ofullpath, &sb) < 0)
+ pg_fatal("could not stat file \"%s\": %m", ofullpath);
+
+ /* OK, now do the work. */
+ add_file_to_manifest(mwriter, manifest_path,
+ sb.st_size, sb.st_mtime,
+ checksum_type, checksum_length,
+ checksum_payload);
+ }
+
+ /* Avoid leaking memory. */
+ if (checksum_payload != NULL)
+ pfree(checksum_payload);
+ }
+
+ closedir(dir);
+}
+
+/*
+ * Read the version number from PG_VERSION and convert it to the usual server
+ * version number format. (e.g. If PG_VERSION contains "14\n" this function
+ * will return 140000)
+ */
+static int
+read_pg_version_file(char *directory)
+{
+ char filename[MAXPGPATH];
+ StringInfoData buf;
+ int fd;
+ int version;
+ char *ep;
+
+ /* Construct pathname. */
+ snprintf(filename, MAXPGPATH, "%s/PG_VERSION", directory);
+
+ /* Open file. */
+ if ((fd = open(filename, O_RDONLY, 0)) < 0)
+ pg_fatal("could not open file \"%s\": %m", filename);
+
+ /* Read into memory. Length limit of 128 should be more than generous. */
+ initStringInfo(&buf);
+ slurp_file(fd, filename, &buf, 128);
+
+ /* Close the file. */
+ if (close(fd) != 0)
+ pg_fatal("could not close \"%s\": %m", filename);
+
+ /* Convert to integer. */
+ errno = 0;
+ version = strtoul(buf.data, &ep, 10);
+ if (errno != 0 || *ep != '\n')
+ {
+ /*
+ * Incremental backup is not relevant to very old server versions that
+ * used multi-part version number (e.g. 9.6, or 8.4). So if we see
+ * what looks like the beginning of such a version number, just bail
+ * out.
+ */
+ if (version < 10 && *ep == '.')
+ pg_fatal("%s: server version too old\n", filename);
+ pg_fatal("%s: could not parse version number\n", filename);
+ }
+
+ /* Debugging output. */
+ pg_log_debug("read server version %d from \"%s\"", version, filename);
+
+ /* Release memory and return result. */
+ pfree(buf.data);
+ return version * 10000;
+}
+
+/*
+ * Add a directory to the list of output directories to clean up.
+ */
+static void
+remember_to_cleanup_directory(char *target_path, bool rmtopdir)
+{
+ cb_cleanup_dir *dir = pg_malloc(sizeof(cb_cleanup_dir));
+
+ dir->target_path = target_path;
+ dir->rmtopdir = rmtopdir;
+ dir->next = cleanup_dir_list;
+ cleanup_dir_list = dir;
+}
+
+/*
+ * Empty out the list of directories scheduled for cleanup a exit.
+ *
+ * We want to remove the output directories only on a failure, so call this
+ * function when we know that the operation has succeeded.
+ *
+ * Since we only expect this to be called when we're about to exit, we could
+ * just set cleanup_dir_list to NULL and be done with it, but we free the
+ * memory to be tidy.
+ */
+static void
+reset_directory_cleanup_list(void)
+{
+ while (cleanup_dir_list != NULL)
+ {
+ cb_cleanup_dir *dir = cleanup_dir_list;
+
+ cleanup_dir_list = cleanup_dir_list->next;
+ pfree(dir);
+ }
+}
+
+/*
+ * Scan the pg_tblspc directory of the final input backup to get a canonical
+ * list of what tablespaces are part of the backup.
+ *
+ * 'pathname' should be the path to the toplevel backup directory for the
+ * final backup in the backup chain.
+ */
+static cb_tablespace *
+scan_for_existing_tablespaces(char *pathname, cb_options *opt)
+{
+ char pg_tblspc[MAXPGPATH];
+ DIR *dir;
+ struct dirent *de;
+ cb_tablespace *tslist = NULL;
+
+ snprintf(pg_tblspc, MAXPGPATH, "%s/pg_tblspc", pathname);
+ pg_log_debug("scanning \"%s\"", pg_tblspc);
+
+ if ((dir = opendir(pg_tblspc)) == NULL)
+ pg_fatal("could not open directory \"%s\": %m", pathname);
+
+ while (errno = 0, (de = readdir(dir)) != NULL)
+ {
+ Oid oid;
+ char tblspcdir[MAXPGPATH];
+ char link_target[MAXPGPATH];
+ int link_length;
+ cb_tablespace *ts;
+ cb_tablespace *otherts;
+ PGFileType type;
+
+ /* Silently ignore "." and ".." entries. */
+ if (strcmp(de->d_name, ".") == 0 || strcmp(de->d_name, "..") == 0)
+ continue;
+
+ /* Construct full pathname. */
+ snprintf(tblspcdir, MAXPGPATH, "%s/%s", pg_tblspc, de->d_name);
+
+ /* Ignore any file name that doesn't look like a proper OID. */
+ if (!parse_oid(de->d_name, &oid))
+ {
+ pg_log_debug("skipping \"%s\" because the filename is not a legal tablespace OID",
+ tblspcdir);
+ continue;
+ }
+
+ /* Only symbolic links and directories are tablespaces. */
+ type = get_dirent_type(tblspcdir, de, false, PG_LOG_ERROR);
+ if (type == PGFILETYPE_ERROR)
+ exit(1);
+ if (type != PGFILETYPE_LNK && type != PGFILETYPE_DIR)
+ {
+ pg_log_debug("skipping \"%s\" because it is neither a symbolic link nor a directory",
+ tblspcdir);
+ continue;
+ }
+
+ /* Create a new tablespace object. */
+ ts = pg_malloc0(sizeof(cb_tablespace));
+ ts->oid = oid;
+
+ /*
+ * If it's a link, it's not an in-place tablespace. Otherwise, it must
+ * be a directory, and thus an in-place tablespace.
+ */
+ if (type == PGFILETYPE_LNK)
+ {
+ cb_tablespace_mapping *tsmap;
+
+ /* Read the link target. */
+ link_length = readlink(tblspcdir, link_target, sizeof(link_target));
+ if (link_length < 0)
+ pg_fatal("could not read symbolic link \"%s\": %m",
+ tblspcdir);
+ if (link_length >= sizeof(link_target))
+ pg_fatal("symbolic link \"%s\" is too long", tblspcdir);
+ link_target[link_length] = '\0';
+ if (!is_absolute_path(link_target))
+ pg_fatal("symbolic link \"%s\" is relative", tblspcdir);
+
+ /* Caonicalize the link target. */
+ canonicalize_path(link_target);
+
+ /*
+ * Find the corresponding tablespace mapping and copy the relevant
+ * details into the new tablespace entry.
+ */
+ for (tsmap = opt->tsmappings; tsmap != NULL; tsmap = tsmap->next)
+ {
+ if (strcmp(tsmap->old_dir, link_target) == 0)
+ {
+ strncpy(ts->old_dir, tsmap->old_dir, MAXPGPATH);
+ strncpy(ts->new_dir, tsmap->new_dir, MAXPGPATH);
+ ts->in_place = false;
+ break;
+ }
+ }
+
+ /* Every non-in-place tablespace must be mapped. */
+ if (tsmap == NULL)
+ pg_fatal("tablespace at \"%s\" has no tablespace mapping",
+ link_target);
+ }
+ else
+ {
+ /*
+ * For an in-place tablespace, there's no separate directory, so
+ * we just record the paths within the data directories.
+ */
+ snprintf(ts->old_dir, MAXPGPATH, "%s/%s", pg_tblspc, de->d_name);
+ snprintf(ts->new_dir, MAXPGPATH, "%s/pg_tblpc/%s", opt->output,
+ de->d_name);
+ ts->in_place = true;
+ }
+
+ /* Tablespaces should not share a directory. */
+ for (otherts = tslist; otherts != NULL; otherts = otherts->next)
+ if (strcmp(ts->new_dir, otherts->new_dir) == 0)
+ pg_fatal("tablespaces with OIDs %u and %u both point at \"%s\"",
+ otherts->oid, oid, ts->new_dir);
+
+ /* Add this tablespace to the list. */
+ ts->next = tslist;
+ tslist = ts;
+ }
+
+ return tslist;
+}
+
+/*
+ * Read a file into a StringInfo.
+ *
+ * fd is used for the actual file I/O, filename for error reporting purposes.
+ * A file longer than maxlen is a fatal error.
+ */
+static void
+slurp_file(int fd, char *filename, StringInfo buf, int maxlen)
+{
+ struct stat st;
+ ssize_t rb;
+
+ /* Check file size, and complain if it's too large. */
+ if (fstat(fd, &st) != 0)
+ pg_fatal("could not stat \"%s\": %m", filename);
+ if (st.st_size > maxlen)
+ pg_fatal("file \"%s\" is too large", filename);
+
+ /* Make sure we have enough space. */
+ enlargeStringInfo(buf, st.st_size);
+
+ /* Read the data. */
+ rb = read(fd, &buf->data[buf->len], st.st_size);
+
+ /*
+ * We don't expect any concurrent changes, so we should read exactly the
+ * expected number of bytes.
+ */
+ if (rb != st.st_size)
+ {
+ if (rb < 0)
+ pg_fatal("could not read file \"%s\": %m", filename);
+ else
+ pg_fatal("could not read file \"%s\": read only %d of %d bytes",
+ filename, (int) rb, (int) st.st_size);
+ }
+
+ /* Adjust buffer length for new data and restore trailing-\0 invariant */
+ buf->len += rb;
+ buf->data[buf->len] = '\0';
+}
diff --git a/src/bin/pg_combinebackup/reconstruct.c b/src/bin/pg_combinebackup/reconstruct.c
new file mode 100644
index 0000000000..7cd457aef3
--- /dev/null
+++ b/src/bin/pg_combinebackup/reconstruct.c
@@ -0,0 +1,681 @@
+/*-------------------------------------------------------------------------
+ *
+ * reconstruct.c
+ * Reconstruct full file from incremental file and backup chain.
+ *
+ * Copyright (c) 2017-2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/bin/pg_combinebackup/reconstruct.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres_fe.h"
+
+#include <unistd.h>
+
+#include "backup/basebackup_incremental.h"
+#include "common/logging.h"
+#include "common/file_perm.h"
+#include "copy_file.h"
+#include "lib/stringinfo.h"
+#include "reconstruct.h"
+#include "storage/block.h"
+
+/*
+ * An rfile stores the data that we need in order to be able to use some file
+ * on disk for reconstruction. For any given output file, we create one rfile
+ * per backup that we need to consult when we constructing that output file.
+ *
+ * If we find a full version of the file in the backup chain, then only
+ * filename and fd are initialized; the remaining fields are 0 or NULL.
+ * For an incremental file, header_length, num_blocks, relative_block_numbers,
+ * and truncation_block_length are also set.
+ *
+ * num_blocks_read and highest_offset_read always start out as 0.
+ */
+typedef struct rfile
+{
+ char *filename;
+ int fd;
+ size_t header_length;
+ unsigned num_blocks;
+ BlockNumber *relative_block_numbers;
+ unsigned truncation_block_length;
+ unsigned num_blocks_read;
+ off_t highest_offset_read;
+} rfile;
+
+static void debug_reconstruction(int n_source,
+ rfile **sources,
+ bool dry_run);
+static unsigned find_reconstructed_block_length(rfile *s);
+static rfile *make_incremental_rfile(char *filename);
+static rfile *make_rfile(char *filename, bool missing_ok);
+static void write_reconstructed_file(char *input_filename,
+ char *output_filename,
+ unsigned block_length,
+ rfile **sourcemap,
+ off_t *offsetmap,
+ pg_checksum_context *checksum_ctx,
+ bool debug,
+ bool dry_run);
+static void read_bytes(rfile *rf, void *buffer, unsigned length);
+
+/*
+ * Reconstruct a full file from an incremental file and a chain of prior
+ * backups.
+ *
+ * input_filename should be the path to the incremental file, and
+ * output_filename should be the path where the reconstructed file is to be
+ * written.
+ *
+ * relative_path should be the relative path to the directory containing this
+ * file. bare_file_name should be the name of the file within that directory,
+ * without "INCREMENTAL.".
+ *
+ * n_prior_backups is the number of prior backups, and prior_backup_dirs is
+ * an array of pathnames where those backups can be found.
+ */
+void
+reconstruct_from_incremental_file(char *input_filename,
+ char *output_filename,
+ char *relative_path,
+ char *bare_file_name,
+ int n_prior_backups,
+ char **prior_backup_dirs,
+ manifest_data **manifests,
+ char *manifest_path,
+ pg_checksum_type checksum_type,
+ int *checksum_length,
+ uint8 **checksum_payload,
+ bool debug,
+ bool dry_run)
+{
+ rfile **source;
+ rfile *latest_source = NULL;
+ rfile **sourcemap;
+ off_t *offsetmap;
+ unsigned block_length;
+ unsigned i;
+ unsigned sidx = n_prior_backups;
+ bool full_copy_possible = true;
+ int copy_source_index = -1;
+ rfile *copy_source = NULL;
+ pg_checksum_context checksum_ctx;
+
+ /*
+ * Every block must come either from the latest version of the file or
+ * from one of the prior backups.
+ */
+ source = pg_malloc0(sizeof(rfile *) * (1 + n_prior_backups));
+
+ /*
+ * Use the information from the latest incremental file to figure out how
+ * long the reconstructed file should be.
+ */
+ latest_source = make_incremental_rfile(input_filename);
+ source[n_prior_backups] = latest_source;
+ block_length = find_reconstructed_block_length(latest_source);
+
+ /*
+ * For each block in the output file, we need to know from which file we
+ * need to obtain it and at what offset in that file it's stored.
+ * sourcemap gives us the first of these things, and offsetmap the latter.
+ */
+ sourcemap = pg_malloc0(sizeof(rfile *) * block_length);
+ offsetmap = pg_malloc0(sizeof(off_t) * block_length);
+
+ /*
+ * Every block that is present in the newest incremental file should be
+ * sourced from that file. If it precedes the truncation_block_length,
+ * it's a block that we would otherwise have had to find in an older
+ * backup and thus reduces the number of blocks remaining to be found by
+ * one; otherwise, it's an extra block that needs to be included in the
+ * output but would not have needed to be found in an older backup if it
+ * had not been present.
+ */
+ for (i = 0; i < latest_source->num_blocks; ++i)
+ {
+ BlockNumber b = latest_source->relative_block_numbers[i];
+
+ Assert(b < block_length);
+ sourcemap[b] = latest_source;
+ offsetmap[b] = latest_source->header_length + (i * BLCKSZ);
+
+ /*
+ * A full copy of a file from an earlier backup is only possible if no
+ * blocks are needed from any later incremental file.
+ */
+ full_copy_possible = false;
+ }
+
+ while (1)
+ {
+ char source_filename[MAXPGPATH];
+ rfile *s;
+
+ /*
+ * Move to the next backup in the chain. If there are no more, then
+ * we're done.
+ */
+ if (sidx == 0)
+ break;
+ --sidx;
+
+ /*
+ * Look for the full file in the previous backup. If not found, then
+ * look for an incremental file instead.
+ */
+ snprintf(source_filename, MAXPGPATH, "%s/%s/%s",
+ prior_backup_dirs[sidx], relative_path, bare_file_name);
+ if ((s = make_rfile(source_filename, true)) == NULL)
+ {
+ snprintf(source_filename, MAXPGPATH, "%s/%s/INCREMENTAL.%s",
+ prior_backup_dirs[sidx], relative_path, bare_file_name);
+ s = make_incremental_rfile(source_filename);
+ }
+ source[sidx] = s;
+
+ /*
+ * If s->header_length == 0, then this is a full file; otherwise, it's
+ * an incremental file.
+ */
+ if (s->header_length != 0)
+ {
+ /*
+ * Since we found another incremental file, source all blocks from
+ * it that we need but don't yet have.
+ */
+ for (i = 0; i < s->num_blocks; ++i)
+ {
+ BlockNumber b = s->relative_block_numbers[i];
+
+ if (b < latest_source->truncation_block_length &&
+ sourcemap[b] == NULL)
+ {
+ sourcemap[b] = s;
+ offsetmap[b] = s->header_length + (i * BLCKSZ);
+
+ /*
+ * A full copy of a file from an earlier backup is only
+ * possible if no blocks are needed from any later
+ * incremental file.
+ */
+ full_copy_possible = false;
+ }
+ }
+ }
+ else
+ {
+ struct stat sb;
+ BlockNumber b;
+ BlockNumber blocklength;
+
+ /* We need to know the length of the file. */
+ if (fstat(s->fd, &sb) < 0)
+ pg_fatal("could not stat \"%s\": %m", s->filename);
+
+ /*
+ * Since we found a full file, source all blocks from it that
+ * exist in the file.
+ *
+ * Note that there may be blocks that don't exist either in this
+ * file or in any incremental file but that precede
+ * truncation_block_length. These are, presumably, zero-filled
+ * blocks that result from the server extending the file but
+ * taking no action on those blocks that generated any WAL.
+ *
+ * Sadly, we have no way of validating that this is really what
+ * happened, and neither does the server. From it's perspective,
+ * an unmodified block that contains data looks exactly the same
+ * as a zero-filled block that never had any data: either way,
+ * it's not mentioned in any WAL summary and the server has no
+ * reason to read it. From our perspective, all we know is that
+ * nobody had a reason to back up the block. That certainly means
+ * that the block didn't exist at the time of the full backup, but
+ * the supposition that it was all zeroes at the time of every
+ * later backup is one that we can't validate.
+ */
+ blocklength = sb.st_size / BLCKSZ;
+ for (b = 0; b < latest_source->truncation_block_length; ++b)
+ {
+ if (sourcemap[b] == NULL && b < blocklength)
+ {
+ sourcemap[b] = s;
+ offsetmap[b] = b * BLCKSZ;
+ }
+ }
+
+ /*
+ * If a full copy looks possible, check whether the resulting file
+ * should be exactly as long as the source file is. If so, a full
+ * copy is acceptable, otherwise not.
+ */
+ if (full_copy_possible)
+ {
+ uint64 expected_length;
+
+ expected_length =
+ (uint64) latest_source->truncation_block_length;
+ expected_length *= BLCKSZ;
+ if (expected_length == sb.st_size)
+ {
+ copy_source = s;
+ copy_source_index = sidx;
+ }
+ }
+ }
+ }
+
+ /*
+ * If a checksum of the required type already exists in the
+ * backup_manifest for the relevant input directory, we can save some work
+ * by reusing that checksum instead of computing a new one.
+ */
+ if (copy_source_index >= 0 && manifests[copy_source_index] != NULL &&
+ checksum_type != CHECKSUM_TYPE_NONE)
+ {
+ manifest_file *mfile;
+
+ mfile = manifest_files_lookup(manifests[copy_source_index]->files,
+ manifest_path);
+ if (mfile == NULL)
+ {
+ /*
+ * The directory is out of sync with the backup_manifest, so emit
+ * a warning.
+ */
+ pg_log_warning("\"%s/backup_manifest\" contains no entry for \"%s\"",
+ prior_backup_dirs[copy_source_index],
+ manifest_path);
+ }
+ else if (mfile->checksum_type == checksum_type)
+ {
+ *checksum_length = mfile->checksum_length;
+ *checksum_payload = pg_malloc(*checksum_length);
+ memcpy(*checksum_payload, mfile->checksum_payload,
+ *checksum_length);
+ checksum_type = CHECKSUM_TYPE_NONE;
+ }
+ }
+
+ /* Prepare for checksum calculation, if required. */
+ pg_checksum_init(&checksum_ctx, checksum_type);
+
+ /*
+ * If the full file can be created by copying a file from an older backup
+ * in the chain without needing to overwrite any blocks or truncate the
+ * result, then forget about performing reconstruction and just copy that
+ * file in its entirety.
+ *
+ * Otherwise, reconstruct.
+ */
+ if (copy_source != NULL)
+ copy_file(copy_source->filename, output_filename,
+ &checksum_ctx, dry_run);
+ else
+ {
+ write_reconstructed_file(input_filename, output_filename,
+ block_length, sourcemap, offsetmap,
+ &checksum_ctx, debug, dry_run);
+ debug_reconstruction(n_prior_backups + 1, source, dry_run);
+ }
+
+ /* Save results of checksum calculation. */
+ if (checksum_type != CHECKSUM_TYPE_NONE)
+ {
+ *checksum_payload = pg_malloc(PG_CHECKSUM_MAX_LENGTH);
+ *checksum_length = pg_checksum_final(&checksum_ctx,
+ *checksum_payload);
+ }
+
+ /*
+ * Close files and release memory.
+ */
+ for (i = 0; i <= n_prior_backups; ++i)
+ {
+ rfile *s = source[i];
+
+ if (s == NULL)
+ continue;
+ if (close(s->fd) != 0)
+ pg_fatal("could not close \"%s\": %m", s->filename);
+ if (s->relative_block_numbers != NULL)
+ pfree(s->relative_block_numbers);
+ pg_free(s->filename);
+ }
+ pfree(sourcemap);
+ pfree(offsetmap);
+ pfree(source);
+}
+
+/*
+ * Perform post-reconstruction logging and sanity checks.
+ */
+static void
+debug_reconstruction(int n_source, rfile **sources, bool dry_run)
+{
+ unsigned i;
+
+ for (i = 0; i < n_source; ++i)
+ {
+ rfile *s = sources[i];
+
+ /* Ignore source if not used. */
+ if (s == NULL)
+ continue;
+
+ /* If no data is needed from this file, we can ignore it. */
+ if (s->num_blocks_read == 0)
+ continue;
+
+ /* Debug logging. */
+ if (dry_run)
+ pg_log_debug("would have read %u blocks from \"%s\"",
+ s->num_blocks_read, s->filename);
+ else
+ pg_log_debug("read %u blocks from \"%s\"",
+ s->num_blocks_read, s->filename);
+
+ /*
+ * In dry-run mode, we don't actually try to read data from the file,
+ * but we do try to verify that the file is long enough that we could
+ * have read the data if we'd tried.
+ *
+ * If this fails, then it means that a non-dry-run attempt would fail,
+ * complaining of not being able to read the required bytes from the
+ * file.
+ */
+ if (dry_run)
+ {
+ struct stat sb;
+
+ if (fstat(s->fd, &sb) < 0)
+ pg_fatal("could not stat \"%s\": %m", s->filename);
+ if (sb.st_size < s->highest_offset_read)
+ pg_fatal("file \"%s\" is too short: expected %llu, found %llu",
+ s->filename,
+ (unsigned long long) s->highest_offset_read,
+ (unsigned long long) sb.st_size);
+ }
+ }
+}
+
+/*
+ * When we perform reconstruction using an incremental file, the output file
+ * should be at least as long as the truncation_block_length. Any blocks
+ * present in the incremental file increase the output length as far as is
+ * necessary to include those blocks.
+ */
+static unsigned
+find_reconstructed_block_length(rfile *s)
+{
+ unsigned block_length = s->truncation_block_length;
+ unsigned i;
+
+ for (i = 0; i < s->num_blocks; ++i)
+ if (s->relative_block_numbers[i] >= block_length)
+ block_length = s->relative_block_numbers[i] + 1;
+
+ return block_length;
+}
+
+/*
+ * Initialize an incremental rfile, reading the header so that we know which
+ * blocks it contains.
+ */
+static rfile *
+make_incremental_rfile(char *filename)
+{
+ rfile *rf;
+ unsigned magic;
+
+ rf = make_rfile(filename, false);
+
+ /* Read and validate magic number. */
+ read_bytes(rf, &magic, sizeof(magic));
+ if (magic != INCREMENTAL_MAGIC)
+ pg_fatal("file \"%s\" has bad incremental magic number (0x%x not 0x%x)",
+ filename, magic, INCREMENTAL_MAGIC);
+
+ /* Read block count. */
+ read_bytes(rf, &rf->num_blocks, sizeof(rf->num_blocks));
+ if (rf->num_blocks > RELSEG_SIZE)
+ pg_fatal("file \"%s\" has block count %u in excess of segment size %u",
+ filename, rf->num_blocks, RELSEG_SIZE);
+
+ /* Read truncation block length. */
+ read_bytes(rf, &rf->truncation_block_length,
+ sizeof(rf->truncation_block_length));
+ if (rf->truncation_block_length > RELSEG_SIZE)
+ pg_fatal("file \"%s\" has truncation block length %u in excess of segment size %u",
+ filename, rf->truncation_block_length, RELSEG_SIZE);
+
+ /* Read block numbers if there are any. */
+ if (rf->num_blocks > 0)
+ {
+ rf->relative_block_numbers =
+ pg_malloc0(sizeof(BlockNumber) * rf->num_blocks);
+ read_bytes(rf, rf->relative_block_numbers,
+ sizeof(BlockNumber) * rf->num_blocks);
+ }
+
+ /* Remember length of header. */
+ rf->header_length = sizeof(magic) + sizeof(rf->num_blocks) +
+ sizeof(rf->truncation_block_length) +
+ sizeof(BlockNumber) * rf->num_blocks;
+
+ return rf;
+}
+
+/*
+ * Allocate and perform basic initialization of an rfile.
+ */
+static rfile *
+make_rfile(char *filename, bool missing_ok)
+{
+ rfile *rf;
+
+ rf = pg_malloc0(sizeof(rfile));
+ rf->filename = pstrdup(filename);
+ if ((rf->fd = open(filename, O_RDONLY | PG_BINARY, 0)) < 0)
+ {
+ if (missing_ok && errno == ENOENT)
+ {
+ pg_free(rf);
+ return NULL;
+ }
+ pg_fatal("could not open file \"%s\": %m", filename);
+ }
+
+ return rf;
+}
+
+/*
+ * Read the indicated number of bytes from an rfile into the buffer.
+ */
+static void
+read_bytes(rfile *rf, void *buffer, unsigned length)
+{
+ unsigned rb = read(rf->fd, buffer, length);
+
+ if (rb != length)
+ {
+ if (rb < 0)
+ pg_fatal("could not read file \"%s\": %m", rf->filename);
+ else
+ pg_fatal("could not read file \"%s\": read only %d of %d bytes",
+ rf->filename, (int) rb, length);
+ }
+}
+
+/*
+ * Write out a reconstructed file.
+ */
+static void
+write_reconstructed_file(char *input_filename,
+ char *output_filename,
+ unsigned block_length,
+ rfile **sourcemap,
+ off_t *offsetmap,
+ pg_checksum_context *checksum_ctx,
+ bool debug,
+ bool dry_run)
+{
+ int wfd = -1;
+ unsigned i;
+ unsigned zero_blocks = 0;
+
+ /* Debugging output. */
+ if (debug)
+ {
+ StringInfoData debug_buf;
+ unsigned start_of_range = 0;
+ unsigned current_block = 0;
+
+ /* Basic information about the output file to be produced. */
+ if (dry_run)
+ pg_log_debug("would reconstruct \"%s\" (%u blocks, checksum %s)",
+ output_filename, block_length,
+ pg_checksum_type_name(checksum_ctx->type));
+ else
+ pg_log_debug("reconstructing \"%s\" (%u blocks, checksum %s)",
+ output_filename, block_length,
+ pg_checksum_type_name(checksum_ctx->type));
+
+ /* Print out the plan for reconstructing this file. */
+ initStringInfo(&debug_buf);
+ while (current_block < block_length)
+ {
+ rfile *s = sourcemap[current_block];
+
+ /* Extend range, if possible. */
+ if (current_block + 1 < block_length &&
+ s == sourcemap[current_block + 1])
+ {
+ ++current_block;
+ continue;
+ }
+
+ /* Add details about this range. */
+ if (s == NULL)
+ {
+ if (current_block == start_of_range)
+ appendStringInfo(&debug_buf, " %u:zero", current_block);
+ else
+ appendStringInfo(&debug_buf, " %u-%u:zero",
+ start_of_range, current_block);
+ }
+ else
+ {
+ if (current_block == start_of_range)
+ appendStringInfo(&debug_buf, " %u:%s@" UINT64_FORMAT,
+ current_block,
+ s == NULL ? "ZERO" : s->filename,
+ (uint64) offsetmap[current_block]);
+ else
+ appendStringInfo(&debug_buf, " %u-%u:%s@" UINT64_FORMAT,
+ start_of_range, current_block,
+ s == NULL ? "ZERO" : s->filename,
+ (uint64) offsetmap[current_block]);
+ }
+
+ /* Begin new range. */
+ start_of_range = ++current_block;
+
+ /* If the output is very long or we are done, dump it now. */
+ if (current_block == block_length || debug_buf.len > 1024)
+ {
+ pg_log_debug("reconstruction plan:%s", debug_buf.data);
+ resetStringInfo(&debug_buf);
+ }
+ }
+
+ /* Free memory. */
+ pfree(debug_buf.data);
+ }
+
+ /* Open the output file, except in dry_run mode. */
+ if (!dry_run &&
+ (wfd = open(output_filename,
+ O_RDWR | PG_BINARY | O_CREAT | O_EXCL,
+ pg_file_create_mode)) < 0)
+ pg_fatal("could not open file \"%s\": %m", output_filename);
+
+ /* Read and write the blocks as required. */
+ for (i = 0; i < block_length; ++i)
+ {
+ uint8 buffer[BLCKSZ];
+ rfile *s = sourcemap[i];
+ unsigned wb;
+
+ /* Update accounting information. */
+ if (s == NULL)
+ ++zero_blocks;
+ else
+ {
+ s->num_blocks_read++;
+ s->highest_offset_read = Max(s->highest_offset_read,
+ offsetmap[i] + BLCKSZ);
+ }
+
+ /* Skip the rest of this in dry-run mode. */
+ if (dry_run)
+ continue;
+
+ /* Read or zero-fill the block as appropriate. */
+ if (s == NULL)
+ {
+ /*
+ * New block not mentioned in the WAL summary. Should have been an
+ * uninitialized block, so just zero-fill it.
+ */
+ memset(buffer, 0, BLCKSZ);
+ }
+ else
+ {
+ unsigned rb;
+
+ /* Read the block from the correct source, except if dry-run. */
+ rb = pg_pread(s->fd, buffer, BLCKSZ, offsetmap[i]);
+ if (rb != BLCKSZ)
+ {
+ if (rb < 0)
+ pg_fatal("could not read file \"%s\": %m", s->filename);
+ else
+ pg_fatal("could not read file \"%s\": read only %d of %d bytes at offset %u",
+ s->filename, (int) rb, BLCKSZ,
+ (unsigned) offsetmap[i]);
+ }
+ }
+
+ /* Write out the block. */
+ if ((wb = write(wfd, buffer, BLCKSZ)) != BLCKSZ)
+ {
+ if (wb < 0)
+ pg_fatal("could not write file \"%s\": %m", output_filename);
+ else
+ pg_fatal("could not write file \"%s\": wrote only %d of %d bytes",
+ output_filename, (int) wb, BLCKSZ);
+ }
+
+ /* Update the checksum computation. */
+ if (pg_checksum_update(checksum_ctx, buffer, BLCKSZ) < 0)
+ pg_fatal("could not update checksum of file \"%s\"",
+ output_filename);
+ }
+
+ /* Debugging output. */
+ if (zero_blocks > 0)
+ {
+ if (dry_run)
+ pg_log_debug("would have zero-filled %u blocks", zero_blocks);
+ else
+ pg_log_debug("zero-filled %u blocks", zero_blocks);
+ }
+
+ /* Close the output file. */
+ if (wfd >= 0 && close(wfd) != 0)
+ pg_fatal("could not close \"%s\": %m", output_filename);
+}
diff --git a/src/bin/pg_combinebackup/reconstruct.h b/src/bin/pg_combinebackup/reconstruct.h
new file mode 100644
index 0000000000..d689aeb5c2
--- /dev/null
+++ b/src/bin/pg_combinebackup/reconstruct.h
@@ -0,0 +1,33 @@
+/*-------------------------------------------------------------------------
+ *
+ * reconstruct.h
+ * Reconstruct full file from incremental file and backup chain.
+ *
+ * Copyright (c) 2017-2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/bin/pg_combinebackup/reconstruct.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef RECONSTRUCT_H
+#define RECONSTRUCT_H
+
+#include "common/checksum_helper.h"
+#include "load_manifest.h"
+
+extern void reconstruct_from_incremental_file(char *input_filename,
+ char *output_filename,
+ char *relative_path,
+ char *bare_file_name,
+ int n_prior_backups,
+ char **prior_backup_dirs,
+ manifest_data **manifests,
+ char *manifest_path,
+ pg_checksum_type checksum_type,
+ int *checksum_length,
+ uint8 **checksum_payload,
+ bool debug,
+ bool dry_run);
+
+#endif
diff --git a/src/bin/pg_combinebackup/t/001_basic.pl b/src/bin/pg_combinebackup/t/001_basic.pl
new file mode 100644
index 0000000000..fb66075d1a
--- /dev/null
+++ b/src/bin/pg_combinebackup/t/001_basic.pl
@@ -0,0 +1,23 @@
+# Copyright (c) 2021-2023, PostgreSQL Global Development Group
+
+use strict;
+use warnings;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+my $tempdir = PostgreSQL::Test::Utils::tempdir;
+
+program_help_ok('pg_combinebackup');
+program_version_ok('pg_combinebackup');
+program_options_handling_ok('pg_combinebackup');
+
+command_fails_like(
+ ['pg_combinebackup'],
+ qr/no input directories specified/,
+ 'input directories must be specified');
+command_fails_like(
+ [ 'pg_combinebackup', $tempdir ],
+ qr/no output directory specified/,
+ 'output directory must be specified');
+
+done_testing();
diff --git a/src/bin/pg_combinebackup/t/002_compare_backups.pl b/src/bin/pg_combinebackup/t/002_compare_backups.pl
new file mode 100644
index 0000000000..0b80455aff
--- /dev/null
+++ b/src/bin/pg_combinebackup/t/002_compare_backups.pl
@@ -0,0 +1,154 @@
+# Copyright (c) 2021-2023, PostgreSQL Global Development Group
+
+use strict;
+use warnings;
+use File::Compare;
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Set up a new database instance.
+my $primary = PostgreSQL::Test::Cluster->new('primary');
+$primary->init(has_archiving => 1, allows_streaming => 1);
+$primary->append_conf('postgresql.conf', 'summarize_wal = on');
+$primary->start;
+
+# Create some test tables, each containing one row of data, plus a whole
+# extra database.
+$primary->safe_psql('postgres', <<EOM);
+CREATE TABLE will_change (a int, b text);
+INSERT INTO will_change VALUES (1, 'initial test row');
+CREATE TABLE will_grow (a int, b text);
+INSERT INTO will_grow VALUES (1, 'initial test row');
+CREATE TABLE will_shrink (a int, b text);
+INSERT INTO will_shrink VALUES (1, 'initial test row');
+CREATE TABLE will_get_vacuumed (a int, b text);
+INSERT INTO will_get_vacuumed VALUES (1, 'initial test row');
+CREATE TABLE will_get_dropped (a int, b text);
+INSERT INTO will_get_dropped VALUES (1, 'initial test row');
+CREATE TABLE will_get_rewritten (a int, b text);
+INSERT INTO will_get_rewritten VALUES (1, 'initial test row');
+CREATE DATABASE db_will_get_dropped;
+EOM
+
+# Take a full backup.
+my $backup1path = $primary->backup_dir . '/backup1';
+$primary->command_ok(
+ [ 'pg_basebackup', '-D', $backup1path, '--no-sync', '-cfast' ],
+ "full backup");
+
+# Now make some database changes.
+$primary->safe_psql('postgres', <<EOM);
+UPDATE will_change SET b = 'modified value' WHERE a = 1;
+INSERT INTO will_grow
+ SELECT g, 'additional row' FROM generate_series(2, 5000) g;
+TRUNCATE will_shrink;
+VACUUM will_get_vacuumed;
+DROP TABLE will_get_dropped;
+CREATE TABLE newly_created (a int, b text);
+INSERT INTO newly_created VALUES (1, 'row for new table');
+VACUUM FULL will_get_rewritten;
+DROP DATABASE db_will_get_dropped;
+CREATE DATABASE db_newly_created;
+EOM
+
+# Take an incremental backup.
+my $backup2path = $primary->backup_dir . '/backup2';
+$primary->command_ok(
+ [ 'pg_basebackup', '-D', $backup2path, '--no-sync', '-cfast',
+ '--incremental', $backup1path . '/backup_manifest' ],
+ "incremental backup");
+
+# Find an LSN to which either backup can be recovered.
+my $lsn = $primary->safe_psql('postgres', "SELECT pg_current_wal_lsn();");
+
+# Make sure that the WAL segment containing that LSN has been archived.
+# PostgreSQL won't issue two consecutive XLOG_SWITCH records, and the backup
+# just issued one, so call txid_current() to generate some WAL activity
+# before calling pg_switch_wal().
+$primary->safe_psql('postgres', 'SELECT txid_current();');
+$primary->safe_psql('postgres', 'SELECT pg_switch_wal()');
+
+# Now wait for the LSN we chose above to be archived.
+my $archive_wait_query =
+ "SELECT pg_walfile_name('$lsn') <= last_archived_wal FROM pg_stat_archiver;";
+$primary->poll_query_until('postgres', $archive_wait_query)
+ or die "Timed out while waiting for WAL segment to be archived";
+
+# Perform PITR from the full backup. Disable archive_mode so that the archive
+# doesn't find out about the new timeline; that way, the later PITR below will
+# choose the same timeline.
+my $pitr1 = PostgreSQL::Test::Cluster->new('pitr1');
+$pitr1->init_from_backup($primary, 'backup1',
+ standby => 1, has_restoring => 1);
+$pitr1->append_conf('postgresql.conf', qq{
+recovery_target_lsn = '$lsn'
+recovery_target_action = 'promote'
+archive_mode = 'off'
+});
+$pitr1->start();
+
+# Perform PITR to the same LSN from the incremental backup. Use the same
+# basic configuration as before.
+my $pitr2 = PostgreSQL::Test::Cluster->new('pitr2');
+$pitr2->init_from_backup($primary, 'backup2',
+ standby => 1, has_restoring => 1,
+ combine_with_prior => [ 'backup1' ]);
+$pitr2->append_conf('postgresql.conf', qq{
+recovery_target_lsn = '$lsn'
+recovery_target_action = 'promote'
+archive_mode = 'off'
+});
+$pitr2->start();
+
+# Wait until both servers exit recovery.
+$pitr1->poll_query_until('postgres',
+ "SELECT NOT pg_is_in_recovery();")
+ or die "Timed out while waiting apply to reach LSN $lsn";
+$pitr2->poll_query_until('postgres',
+ "SELECT NOT pg_is_in_recovery();")
+ or die "Timed out while waiting apply to reach LSN $lsn";
+
+# Perform a logical dump of each server, and check that they match.
+# It would be much nicer if we could physically compare the data files, but
+# that doesn't really work. The contents of the page hole aren't guaranteed to
+# be identical, and there can be other discrepancies as well. To make this work
+# we'd need the equivalent of each AM's rm_mask functon written or at least
+# callable from Perl, and that doesn't seem practical.
+#
+# NB: We're just using the primary's backup directory for scratch space here.
+# This could equally well be any other directory we wanted to pick.
+my $backupdir = $primary->backup_dir;
+my $dump1 = $backupdir . '/pitr1.dump';
+my $dump2 = $backupdir . '/pitr2.dump';
+$pitr1->command_ok([
+ 'pg_dumpall', '-f', $dump1, '--no-sync', '--no-unlogged-table-data',
+ '-d', $pitr1->connstr('postgres'),
+ ],
+ 'dump from PITR 1');
+$pitr1->command_ok([
+ 'pg_dumpall', '-f', $dump2, '--no-sync', '--no-unlogged-table-data',
+ '-d', $pitr1->connstr('postgres'),
+ ],
+ 'dump from PITR 2');
+
+# Compare the two dumps, there should be no differences.
+my $compare_res = compare($dump1, $dump2);
+note($dump1);
+note($dump2);
+is($compare_res, 0, "dumps are identical");
+
+# Provide more context if the dumps do not match.
+if ($compare_res != 0)
+{
+ my ($stdout, $stderr) =
+ run_command([ 'diff', '-u', $dump1, $dump2 ]);
+ print "=== diff of $dump1 and $dump2\n";
+ print "=== stdout ===\n";
+ print $stdout;
+ print "=== stderr ===\n";
+ print $stderr;
+ print "=== EOF ===\n";
+}
+
+done_testing();
diff --git a/src/bin/pg_combinebackup/t/003_timeline.pl b/src/bin/pg_combinebackup/t/003_timeline.pl
new file mode 100644
index 0000000000..bc053ca5e8
--- /dev/null
+++ b/src/bin/pg_combinebackup/t/003_timeline.pl
@@ -0,0 +1,90 @@
+# Copyright (c) 2021-2023, PostgreSQL Global Development Group
+#
+# This test aims to validate that restoring an incremental backup works
+# properly even when the reference backup is on a different timeline.
+
+use strict;
+use warnings;
+use File::Compare;
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Set up a new database instance.
+my $node1 = PostgreSQL::Test::Cluster->new('node1');
+$node1->init(has_archiving => 1, allows_streaming => 1);
+$node1->append_conf('postgresql.conf', 'summarize_wal = on');
+$node1->start;
+
+# Create a table and insert a test row into it.
+$node1->safe_psql('postgres', <<EOM);
+CREATE TABLE mytable (a int, b text);
+INSERT INTO mytable VALUES (1, 'aardvark');
+EOM
+
+# Take a full backup.
+my $backup1path = $node1->backup_dir . '/backup1';
+$node1->command_ok(
+ [ 'pg_basebackup', '-D', $backup1path, '--no-sync', '-cfast' ],
+ "full backup from node1");
+
+# Insert a second row on the original node.
+$node1->safe_psql('postgres', <<EOM);
+INSERT INTO mytable VALUES (2, 'beetle');
+EOM
+
+# Now take an incremental backup.
+my $backup2path = $node1->backup_dir . '/backup2';
+$node1->command_ok(
+ [ 'pg_basebackup', '-D', $backup2path, '--no-sync', '-cfast',
+ '--incremental', $backup1path . '/backup_manifest' ],
+ "incremental backup from node1");
+
+# Restore the incremental backup and use it to create a new node.
+my $node2 = PostgreSQL::Test::Cluster->new('node2');
+$node2->init_from_backup($node1, 'backup2',
+ combine_with_prior => [ 'backup1' ]);
+$node2->start();
+
+# Insert rows on both nodes.
+$node1->safe_psql('postgres', <<EOM);
+INSERT INTO mytable VALUES (3, 'crab');
+EOM
+$node2->safe_psql('postgres', <<EOM);
+INSERT INTO mytable VALUES (4, 'dingo');
+EOM
+
+# Take another incremental backup, from node2, based on backup2 from node1.
+my $backup3path = $node1->backup_dir . '/backup3';
+$node2->command_ok(
+ [ 'pg_basebackup', '-D', $backup3path, '--no-sync', '-cfast',
+ '--incremental', $backup2path . '/backup_manifest' ],
+ "incremental backup from node2");
+
+# Restore the incremental backup and use it to create a new node.
+my $node3 = PostgreSQL::Test::Cluster->new('node3');
+$node3->init_from_backup($node1, 'backup3',
+ combine_with_prior => [ 'backup1', 'backup2' ]);
+$node3->start();
+
+# Let's insert one more row.
+$node3->safe_psql('postgres', <<EOM);
+INSERT INTO mytable VALUES (5, 'elephant');
+EOM
+
+# Now check that we have the expected rows.
+my $result = $node3->safe_psql('postgres', <<EOM);
+select string_agg(a::text, ':'), string_agg(b, ':') from mytable;
+EOM
+is($result, '1:2:4:5|aardvark:beetle:dingo:elephant');
+
+# Let's also verify all the backups.
+for my $backup_name (qw(backup1 backup2 backup3))
+{
+ $node1->command_ok(
+ [ 'pg_verifybackup', $node1->backup_dir . '/' . $backup_name ],
+ "verify backup $backup_name");
+}
+
+# OK, that's all.
+done_testing();
diff --git a/src/bin/pg_combinebackup/t/004_manifest.pl b/src/bin/pg_combinebackup/t/004_manifest.pl
new file mode 100644
index 0000000000..37de61ac06
--- /dev/null
+++ b/src/bin/pg_combinebackup/t/004_manifest.pl
@@ -0,0 +1,75 @@
+# Copyright (c) 2021-2023, PostgreSQL Global Development Group
+#
+# This test aims to validate that pg_combinebackup works in the degenerate
+# case where it is invoked on a single full backup and that it can produce
+# a new, valid manifest when it does. Secondarily, it checks that
+# pg_combinebackup does not produce a manifest when run with --no-manifest.
+
+use strict;
+use warnings;
+use File::Compare;
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Set up a new database instance.
+my $node = PostgreSQL::Test::Cluster->new('node');
+$node->init(has_archiving => 1, allows_streaming => 1);
+$node->start;
+
+# Take a full backup.
+my $original_backup_path = $node->backup_dir . '/original';
+$node->command_ok(
+ [ 'pg_basebackup', '-D', $original_backup_path, '--no-sync', '-cfast' ],
+ "full backup");
+
+# Verify the full backup.
+$node->command_ok([ 'pg_verifybackup', $original_backup_path ],
+ "verify original backup");
+
+# Process the backup with pg_combinebackup using various manifest options.
+sub combine_and_test_one_backup
+{
+ my ($backup_name, $failure_pattern, @extra_options) = @_;
+ my $revised_backup_path = $node->backup_dir . '/' . $backup_name;
+ $node->command_ok(
+ [ 'pg_combinebackup', $original_backup_path, '-o', $revised_backup_path,
+ '--no-sync', @extra_options ],
+ "pg_combinebackup with @extra_options");
+ if (defined $failure_pattern)
+ {
+ $node->command_fails_like(
+ [ 'pg_verifybackup', $revised_backup_path ],
+ $failure_pattern,
+ "unable to verify backup $backup_name");
+ }
+ else
+ {
+ $node->command_ok(
+ [ 'pg_verifybackup', $revised_backup_path ],
+ "verify backup $backup_name");
+ }
+}
+combine_and_test_one_backup('nomanifest',
+ qr/could not open file.*backup_manifest/, '--no-manifest');
+combine_and_test_one_backup('csum_none',
+ undef, '--manifest-checksums=NONE');
+combine_and_test_one_backup('csum_sha224',
+ undef, '--manifest-checksums=SHA224');
+
+# Verify that SHA224 is mentioned in the SHA224 manifest lots of times.
+my $sha224_manifest =
+ slurp_file($node->backup_dir . '/csum_sha224/backup_manifest');
+my $sha224_count = (() = $sha224_manifest =~ /SHA224/mig);
+cmp_ok($sha224_count,
+ '>', 100, "SHA224 is mentioned many times in SHA224 manifest");
+
+# Verify that SHA224 is mentioned in the SHA224 manifest lots of times.
+my $nocsum_manifest =
+ slurp_file($node->backup_dir . '/csum_none/backup_manifest');
+my $nocsum_count = (() = $nocsum_manifest =~ /Checksum-Algorithm/mig);
+is($nocsum_count, 0,
+ "Checksum_Algorithm is not mentioned in no-checksum manifest");
+
+# OK, that's all.
+done_testing();
diff --git a/src/bin/pg_combinebackup/t/005_integrity.pl b/src/bin/pg_combinebackup/t/005_integrity.pl
new file mode 100644
index 0000000000..b1f63a43e0
--- /dev/null
+++ b/src/bin/pg_combinebackup/t/005_integrity.pl
@@ -0,0 +1,125 @@
+# Copyright (c) 2021-2023, PostgreSQL Global Development Group
+#
+# This test aims to validate that an incremental backup can be combined
+# with a valid prior backup and that it cannot be combined with an invalid
+# prior backup.
+
+use strict;
+use warnings;
+use File::Compare;
+use File::Path qw(rmtree);
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Set up a new database instance.
+my $node1 = PostgreSQL::Test::Cluster->new('node1');
+$node1->init(has_archiving => 1, allows_streaming => 1);
+$node1->append_conf('postgresql.conf', 'summarize_wal = on');
+$node1->start;
+
+# Set up another new database instance. We don't want to use the cached
+# INITDB_TEMPLATE for this, because we want it to be a separate cluster
+# with a different system ID.
+my $node2;
+{
+ local $ENV{'INITDB_TEMPLATE'} = undef;
+
+ $node2 = PostgreSQL::Test::Cluster->new('node2');
+ $node2->init(has_archiving => 1, allows_streaming => 1);
+ $node2->append_conf('postgresql.conf', 'summarize_wal = on');
+ $node2->start;
+}
+
+# Take a full backup from node1.
+my $backup1path = $node1->backup_dir . '/backup1';
+$node1->command_ok(
+ [ 'pg_basebackup', '-D', $backup1path, '--no-sync', '-cfast' ],
+ "full backup from node1");
+
+# Now take an incremental backup.
+my $backup2path = $node1->backup_dir . '/backup2';
+$node1->command_ok(
+ [ 'pg_basebackup', '-D', $backup2path, '--no-sync', '-cfast',
+ '--incremental', $backup1path . '/backup_manifest' ],
+ "incremental backup from node1");
+
+# Now take another incremental backup.
+my $backup3path = $node1->backup_dir . '/backup3';
+$node1->command_ok(
+ [ 'pg_basebackup', '-D', $backup3path, '--no-sync', '-cfast',
+ '--incremental', $backup2path . '/backup_manifest' ],
+ "another incremental backup from node1");
+
+# Take a full backup from node2.
+my $backupother1path = $node1->backup_dir . '/backupother1';
+$node2->command_ok(
+ [ 'pg_basebackup', '-D', $backupother1path, '--no-sync', '-cfast' ],
+ "full backup from node2");
+
+# Take an incremental backup from node2.
+my $backupother2path = $node1->backup_dir . '/backupother2';
+$node2->command_ok(
+ [ 'pg_basebackup', '-D', $backupother2path, '--no-sync', '-cfast',
+ '--incremental', $backupother1path . '/backup_manifest' ],
+ "incremental backup from node2");
+
+# Result directory.
+my $resultpath = $node1->backup_dir . '/result';
+
+# Can't combine 2 full backups.
+$node1->command_fails_like(
+ [ 'pg_combinebackup', $backup1path, $backup1path, '-o', $resultpath ],
+ qr/is a full backup, but only the first backup should be a full backup/,
+ "can't combine full backups");
+
+# Can't combine 2 incremental backups.
+$node1->command_fails_like(
+ [ 'pg_combinebackup', $backup2path, $backup2path, '-o', $resultpath ],
+ qr/is an incremental backup, but the first backup should be a full backup/,
+ "can't combine full backups");
+
+# Can't combine full backup with an incremental backup from a different system.
+$node1->command_fails_like(
+ [ 'pg_combinebackup', $backup1path, $backupother2path, '-o', $resultpath ],
+ qr/expected system identifier.*but found/,
+ "can't combine backups from different nodes");
+
+# Can't omit a required backup.
+$node1->command_fails_like(
+ [ 'pg_combinebackup', $backup1path, $backup3path, '-o', $resultpath ],
+ qr/starts at LSN.*but expected/,
+ "can't omit a required backup");
+
+# Can't combine backups in the wrong order.
+$node1->command_fails_like(
+ [ 'pg_combinebackup', $backup1path, $backup3path, $backup2path, '-o', $resultpath ],
+ qr/starts at LSN.*but expected/,
+ "can't combine backups in the wrong order");
+
+# Can combine 3 backups that match up properly.
+$node1->command_ok(
+ [ 'pg_combinebackup', $backup1path, $backup2path, $backup3path, '-o', $resultpath ],
+ "can combine 3 matching backups");
+rmtree($resultpath);
+
+# Can combine full backup with first incremental.
+my $synthetic12path = $node1->backup_dir . '/synthetic12';
+$node1->command_ok(
+ [ 'pg_combinebackup', $backup1path, $backup2path, '-o', $synthetic12path ],
+ "can combine 2 matching backups");
+
+# Can combine result of previous step with second incremental.
+$node1->command_ok(
+ [ 'pg_combinebackup', $synthetic12path, $backup3path, '-o', $resultpath ],
+ "can combine synthetic backup with later incremental");
+rmtree($resultpath);
+
+# Can't combine result of 1+2 with 2.
+$node1->command_fails_like(
+ [ 'pg_combinebackup', $synthetic12path, $backup2path, '-o', $resultpath ],
+ qr/starts at LSN.*but expected/,
+ "can't combine synthetic backup with included incremental");
+
+# OK, that's all.
+done_testing();
diff --git a/src/bin/pg_combinebackup/write_manifest.c b/src/bin/pg_combinebackup/write_manifest.c
new file mode 100644
index 0000000000..82160134d8
--- /dev/null
+++ b/src/bin/pg_combinebackup/write_manifest.c
@@ -0,0 +1,293 @@
+/*-------------------------------------------------------------------------
+ *
+ * Write a new backup manifest.
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/bin/pg_combinebackup/write_manifest.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres_fe.h"
+
+#include <fcntl.h>
+#include <time.h>
+#include <unistd.h>
+
+#include "common/checksum_helper.h"
+#include "common/file_perm.h"
+#include "common/logging.h"
+#include "lib/stringinfo.h"
+#include "load_manifest.h"
+#include "mb/pg_wchar.h"
+#include "write_manifest.h"
+
+struct manifest_writer
+{
+ char pathname[MAXPGPATH];
+ int fd;
+ StringInfoData buf;
+ bool first_file;
+ bool still_checksumming;
+ pg_checksum_context manifest_ctx;
+};
+
+static void escape_json(StringInfo buf, const char *str);
+static void flush_manifest(manifest_writer *mwriter);
+static size_t hex_encode(const uint8 *src, size_t len, char *dst);
+
+/*
+ * Create a new backup manifest writer.
+ *
+ * The backup manifest will be written into a file named backup_manifest
+ * in the specified directory.
+ */
+manifest_writer *
+create_manifest_writer(char *directory)
+{
+ manifest_writer *mwriter = pg_malloc(sizeof(manifest_writer));
+
+ snprintf(mwriter->pathname, MAXPGPATH, "%s/backup_manifest", directory);
+ mwriter->fd = -1;
+ initStringInfo(&mwriter->buf);
+ mwriter->first_file = true;
+ mwriter->still_checksumming = true;
+ pg_checksum_init(&mwriter->manifest_ctx, CHECKSUM_TYPE_SHA256);
+
+ appendStringInfo(&mwriter->buf,
+ "{ \"PostgreSQL-Backup-Manifest-Version\": 1,\n"
+ "\"Files\": [");
+
+ return mwriter;
+}
+
+/*
+ * Add an entry for a file to a backup manifest.
+ *
+ * This is very similar to the backend's AddFileToBackupManifest, but
+ * various adjustments are required due to frontend/backend differences
+ * and other details.
+ */
+void
+add_file_to_manifest(manifest_writer *mwriter, const char *manifest_path,
+ size_t size, pg_time_t mtime,
+ pg_checksum_type checksum_type,
+ int checksum_length,
+ uint8 *checksum_payload)
+{
+ int pathlen = strlen(manifest_path);
+
+ if (mwriter->first_file)
+ {
+ appendStringInfoChar(&mwriter->buf, '\n');
+ mwriter->first_file = false;
+ }
+ else
+ appendStringInfoString(&mwriter->buf, ",\n");
+
+ if (pg_encoding_verifymbstr(PG_UTF8, manifest_path, pathlen) == pathlen)
+ {
+ appendStringInfoString(&mwriter->buf, "{ \"Path\": ");
+ escape_json(&mwriter->buf, manifest_path);
+ appendStringInfoString(&mwriter->buf, ", ");
+ }
+ else
+ {
+ appendStringInfoString(&mwriter->buf, "{ \"Encoded-Path\": \"");
+ enlargeStringInfo(&mwriter->buf, 2 * pathlen);
+ mwriter->buf.len += hex_encode((const uint8 *) manifest_path, pathlen,
+ &mwriter->buf.data[mwriter->buf.len]);
+ appendStringInfoString(&mwriter->buf, "\", ");
+ }
+
+ appendStringInfo(&mwriter->buf, "\"Size\": %zu, ", size);
+
+ appendStringInfoString(&mwriter->buf, "\"Last-Modified\": \"");
+ enlargeStringInfo(&mwriter->buf, 128);
+ mwriter->buf.len += strftime(&mwriter->buf.data[mwriter->buf.len], 128,
+ "%Y-%m-%d %H:%M:%S %Z",
+ gmtime(&mtime));
+ appendStringInfoChar(&mwriter->buf, '"');
+
+ if (mwriter->buf.len > 128 * 1024)
+ flush_manifest(mwriter);
+
+ if (checksum_length > 0)
+ {
+ appendStringInfo(&mwriter->buf,
+ ", \"Checksum-Algorithm\": \"%s\", \"Checksum\": \"",
+ pg_checksum_type_name(checksum_type));
+
+ enlargeStringInfo(&mwriter->buf, 2 * checksum_length);
+ mwriter->buf.len += hex_encode(checksum_payload, checksum_length,
+ &mwriter->buf.data[mwriter->buf.len]);
+
+ appendStringInfoChar(&mwriter->buf, '"');
+ }
+
+ appendStringInfoString(&mwriter->buf, " }");
+
+ if (mwriter->buf.len > 128 * 1024)
+ flush_manifest(mwriter);
+}
+
+/*
+ * Finalize the backup_manifest.
+ */
+void
+finalize_manifest(manifest_writer *mwriter,
+ manifest_wal_range *first_wal_range)
+{
+ uint8 checksumbuf[PG_SHA256_DIGEST_LENGTH];
+ int len;
+ manifest_wal_range *wal_range;
+
+ /* Terminate the list of files. */
+ appendStringInfoString(&mwriter->buf, "\n],\n");
+
+ /* Start a list of LSN ranges. */
+ appendStringInfoString(&mwriter->buf, "\"WAL-Ranges\": [\n");
+
+ for (wal_range = first_wal_range; wal_range != NULL;
+ wal_range = wal_range->next)
+ appendStringInfo(&mwriter->buf,
+ "%s{ \"Timeline\": %u, \"Start-LSN\": \"%X/%X\", \"End-LSN\": \"%X/%X\" }",
+ wal_range == first_wal_range ? "" : ",\n",
+ wal_range->tli,
+ LSN_FORMAT_ARGS(wal_range->start_lsn),
+ LSN_FORMAT_ARGS(wal_range->end_lsn));
+
+ /* Terminate the list of WAL ranges. */
+ appendStringInfoString(&mwriter->buf, "\n],\n");
+
+ /* Flush accumulated data and update checksum calculation. */
+ flush_manifest(mwriter);
+
+ /* Checksum only includes data up to this point. */
+ mwriter->still_checksumming = false;
+
+ /* Compute and insert manifest checksum. */
+ appendStringInfoString(&mwriter->buf, "\"Manifest-Checksum\": \"");
+ enlargeStringInfo(&mwriter->buf, 2 * PG_SHA256_DIGEST_STRING_LENGTH);
+ len = pg_checksum_final(&mwriter->manifest_ctx, checksumbuf);
+ Assert(len == PG_SHA256_DIGEST_LENGTH);
+ mwriter->buf.len +=
+ hex_encode(checksumbuf, len, &mwriter->buf.data[mwriter->buf.len]);
+ appendStringInfoString(&mwriter->buf, "\"}\n");
+
+ /* Flush the last manifest checksum itself. */
+ flush_manifest(mwriter);
+
+ /* Close the file. */
+ if (close(mwriter->fd) != 0)
+ pg_fatal("could not close \"%s\": %m", mwriter->pathname);
+ mwriter->fd = -1;
+}
+
+/*
+ * Produce a JSON string literal, properly escaping characters in the text.
+ */
+static void
+escape_json(StringInfo buf, const char *str)
+{
+ const char *p;
+
+ appendStringInfoCharMacro(buf, '"');
+ for (p = str; *p; p++)
+ {
+ switch (*p)
+ {
+ case '\b':
+ appendStringInfoString(buf, "\\b");
+ break;
+ case '\f':
+ appendStringInfoString(buf, "\\f");
+ break;
+ case '\n':
+ appendStringInfoString(buf, "\\n");
+ break;
+ case '\r':
+ appendStringInfoString(buf, "\\r");
+ break;
+ case '\t':
+ appendStringInfoString(buf, "\\t");
+ break;
+ case '"':
+ appendStringInfoString(buf, "\\\"");
+ break;
+ case '\\':
+ appendStringInfoString(buf, "\\\\");
+ break;
+ default:
+ if ((unsigned char) *p < ' ')
+ appendStringInfo(buf, "\\u%04x", (int) *p);
+ else
+ appendStringInfoCharMacro(buf, *p);
+ break;
+ }
+ }
+ appendStringInfoCharMacro(buf, '"');
+}
+
+/*
+ * Flush whatever portion of the backup manifest we have generated and
+ * buffered in memory out to a file on disk.
+ *
+ * The first call to this function will create the file. After that, we
+ * keep it open and just append more data.
+ */
+static void
+flush_manifest(manifest_writer *mwriter)
+{
+ char pathname[MAXPGPATH];
+
+ if (mwriter->fd == -1 &&
+ (mwriter->fd = open(mwriter->pathname,
+ O_WRONLY | O_CREAT | O_EXCL | PG_BINARY,
+ pg_file_create_mode)) < 0)
+ pg_fatal("could not open file \"%s\": %m", mwriter->pathname);
+
+ if (mwriter->buf.len > 0)
+ {
+ ssize_t wb;
+
+ wb = write(mwriter->fd, mwriter->buf.data, mwriter->buf.len);
+ if (wb != mwriter->buf.len)
+ {
+ if (wb < 0)
+ pg_fatal("could not write \"%s\": %m", mwriter->pathname);
+ else
+ pg_fatal("could not write file \"%s\": wrote only %d of %d bytes",
+ pathname, (int) wb, mwriter->buf.len);
+ }
+
+ if (mwriter->still_checksumming)
+ pg_checksum_update(&mwriter->manifest_ctx,
+ (uint8 *) mwriter->buf.data,
+ mwriter->buf.len);
+ resetStringInfo(&mwriter->buf);
+ }
+}
+
+/*
+ * Encode bytes using two hexademical digits for each one.
+ */
+static size_t
+hex_encode(const uint8 *src, size_t len, char *dst)
+{
+ const uint8 *end = src + len;
+
+ while (src < end)
+ {
+ unsigned n1 = (*src >> 4) & 0xF;
+ unsigned n2 = *src & 0xF;
+
+ *dst++ = n1 < 10 ? '0' + n1 : 'a' + n1 - 10;
+ *dst++ = n2 < 10 ? '0' + n2 : 'a' + n2 - 10;
+ ++src;
+ }
+
+ return len * 2;
+}
diff --git a/src/bin/pg_combinebackup/write_manifest.h b/src/bin/pg_combinebackup/write_manifest.h
new file mode 100644
index 0000000000..8fd7fe02c8
--- /dev/null
+++ b/src/bin/pg_combinebackup/write_manifest.h
@@ -0,0 +1,33 @@
+/*-------------------------------------------------------------------------
+ *
+ * Write a new backup manifest.
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/bin/pg_combinebackup/write_manifest.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef WRITE_MANIFEST_H
+#define WRITE_MANIFEST_H
+
+#include "common/checksum_helper.h"
+#include "pgtime.h"
+
+struct manifest_wal_range;
+
+struct manifest_writer;
+typedef struct manifest_writer manifest_writer;
+
+extern manifest_writer *create_manifest_writer(char *directory);
+extern void add_file_to_manifest(manifest_writer *mwriter,
+ const char *manifest_path,
+ size_t size, pg_time_t mtime,
+ pg_checksum_type checksum_type,
+ int checksum_length,
+ uint8 *checksum_payload);
+extern void finalize_manifest(manifest_writer *mwriter,
+ struct manifest_wal_range *first_wal_range);
+
+#endif /* WRITE_MANIFEST_H */
diff --git a/src/bin/pg_resetwal/pg_resetwal.c b/src/bin/pg_resetwal/pg_resetwal.c
index 3ae3fc06df..5407f51a4e 100644
--- a/src/bin/pg_resetwal/pg_resetwal.c
+++ b/src/bin/pg_resetwal/pg_resetwal.c
@@ -85,6 +85,7 @@ static void RewriteControlFile(void);
static void FindEndOfXLOG(void);
static void KillExistingXLOG(void);
static void KillExistingArchiveStatus(void);
+static void KillExistingWALSummaries(void);
static void WriteEmptyXLOG(void);
static void usage(void);
@@ -493,6 +494,7 @@ main(int argc, char *argv[])
RewriteControlFile();
KillExistingXLOG();
KillExistingArchiveStatus();
+ KillExistingWALSummaries();
WriteEmptyXLOG();
printf(_("Write-ahead log reset\n"));
@@ -1034,6 +1036,40 @@ KillExistingArchiveStatus(void)
pg_fatal("could not close directory \"%s\": %m", ARCHSTATDIR);
}
+/*
+ * Remove existing WAL summary files
+ */
+static void
+KillExistingWALSummaries(void)
+{
+#define WALSUMMARYDIR XLOGDIR "/summaries"
+#define WALSUMMARY_NHEXCHARS 40
+
+ DIR *xldir;
+ struct dirent *xlde;
+ char path[MAXPGPATH + sizeof(WALSUMMARYDIR)];
+
+ xldir = opendir(WALSUMMARYDIR);
+ if (xldir == NULL)
+ pg_fatal("could not open directory \"%s\": %m", WALSUMMARYDIR);
+
+ while (errno = 0, (xlde = readdir(xldir)) != NULL)
+ {
+ if (strspn(xlde->d_name, "0123456789ABCDEF") == WALSUMMARY_NHEXCHARS &&
+ strcmp(xlde->d_name + WALSUMMARY_NHEXCHARS, ".summary") == 0)
+ {
+ snprintf(path, sizeof(path), "%s/%s", WALSUMMARYDIR, xlde->d_name);
+ if (unlink(path) < 0)
+ pg_fatal("could not delete file \"%s\": %m", path);
+ }
+ }
+
+ if (errno)
+ pg_fatal("could not read directory \"%s\": %m", WALSUMMARYDIR);
+
+ if (closedir(xldir))
+ pg_fatal("could not close directory \"%s\": %m", ARCHSTATDIR);
+}
/*
* Write an empty XLOG file, containing only the checkpoint record
diff --git a/src/include/access/xlogbackup.h b/src/include/access/xlogbackup.h
index 1611358137..90e04cad56 100644
--- a/src/include/access/xlogbackup.h
+++ b/src/include/access/xlogbackup.h
@@ -28,6 +28,8 @@ typedef struct BackupState
XLogRecPtr checkpointloc; /* last checkpoint location */
pg_time_t starttime; /* backup start time */
bool started_in_recovery; /* backup started in recovery? */
+ XLogRecPtr istartpoint; /* incremental based on backup at this LSN */
+ TimeLineID istarttli; /* incremental based on backup on this TLI */
/* Fields saved at the end of backup */
XLogRecPtr stoppoint; /* backup stop WAL location */
diff --git a/src/include/backup/basebackup.h b/src/include/backup/basebackup.h
index 1432d9c206..345bd22534 100644
--- a/src/include/backup/basebackup.h
+++ b/src/include/backup/basebackup.h
@@ -34,6 +34,9 @@ typedef struct
int64 size; /* total size as sent; -1 if not known */
} tablespaceinfo;
-extern void SendBaseBackup(BaseBackupCmd *cmd);
+struct IncrementalBackupInfo;
+
+extern void SendBaseBackup(BaseBackupCmd *cmd,
+ struct IncrementalBackupInfo *ib);
#endif /* _BASEBACKUP_H */
diff --git a/src/include/backup/basebackup_incremental.h b/src/include/backup/basebackup_incremental.h
new file mode 100644
index 0000000000..c300235a2f
--- /dev/null
+++ b/src/include/backup/basebackup_incremental.h
@@ -0,0 +1,56 @@
+/*-------------------------------------------------------------------------
+ *
+ * basebackup_incremental.h
+ * API for incremental backup support
+ *
+ * Portions Copyright (c) 2010-2022, PostgreSQL Global Development Group
+ *
+ * src/include/backup/basebackup_incremental.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef BASEBACKUP_INCREMENTAL_H
+#define BASEBACKUP_INCREMENTAL_H
+
+#include "access/xlogbackup.h"
+#include "common/relpath.h"
+#include "storage/block.h"
+#include "utils/palloc.h"
+
+#define INCREMENTAL_MAGIC 0xd3ae1f0d
+
+typedef enum
+{
+ BACK_UP_FILE_FULLY,
+ BACK_UP_FILE_INCREMENTALLY,
+ DO_NOT_BACK_UP_FILE
+} FileBackupMethod;
+
+struct IncrementalBackupInfo;
+typedef struct IncrementalBackupInfo IncrementalBackupInfo;
+
+extern IncrementalBackupInfo *CreateIncrementalBackupInfo(MemoryContext);
+
+extern void AppendIncrementalManifestData(IncrementalBackupInfo *ib,
+ const char *data,
+ int len);
+extern void FinalizeIncrementalManifest(IncrementalBackupInfo *ib);
+
+extern void PrepareForIncrementalBackup(IncrementalBackupInfo *ib,
+ BackupState *backup_state);
+
+extern char *GetIncrementalFilePath(Oid dboid, Oid spcoid,
+ RelFileNumber relfilenumber,
+ ForkNumber forknum, unsigned segno);
+extern FileBackupMethod GetFileBackupMethod(IncrementalBackupInfo *ib,
+ char *path,
+ Oid dboid, Oid spcoid,
+ RelFileNumber relfilenumber,
+ ForkNumber forknum,
+ unsigned segno, size_t size,
+ unsigned *num_blocks_required,
+ BlockNumber *relative_block_numbers,
+ unsigned *truncation_block_length);
+extern size_t GetIncrementalFileSize(unsigned num_blocks_required);
+
+#endif
diff --git a/src/include/nodes/replnodes.h b/src/include/nodes/replnodes.h
index 5142a08729..c98961c329 100644
--- a/src/include/nodes/replnodes.h
+++ b/src/include/nodes/replnodes.h
@@ -108,4 +108,13 @@ typedef struct TimeLineHistoryCmd
TimeLineID timeline;
} TimeLineHistoryCmd;
+/* ----------------------
+ * UPLOAD_MANIFEST command
+ * ----------------------
+ */
+typedef struct UploadManifestCmd
+{
+ NodeTag type;
+} UploadManifestCmd;
+
#endif /* REPLNODES_H */
diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index c3d46c7c70..b711d60fc4 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -779,6 +779,10 @@ a tar-format backup, pass the name of the tar program to use in the
keyword parameter tar_program. Note that tablespace tar files aren't
handled here.
+To restore from an incremental backup, pass the parameter combine_with_prior
+as a reference to an array of prior backup names with which this backup
+is to be combined using pg_combinebackup.
+
Streaming replication can be enabled on this node by passing the keyword
parameter has_streaming => 1. This is disabled by default.
@@ -816,7 +820,22 @@ sub init_from_backup
mkdir $self->archive_dir;
my $data_path = $self->data_dir;
- if (defined $params{tar_program})
+ if (defined $params{combine_with_prior})
+ {
+ my @prior_backups = @{$params{combine_with_prior}};
+ my @prior_backup_path;
+
+ for my $prior_backup_name (@prior_backups)
+ {
+ push @prior_backup_path,
+ $root_node->backup_dir . '/' . $prior_backup_name;
+ }
+
+ local %ENV = $self->_get_env();
+ PostgreSQL::Test::Utils::system_or_bail('pg_combinebackup',
+ @prior_backup_path, $backup_path, '-o', $data_path);
+ }
+ elsif (defined $params{tar_program})
{
mkdir($data_path);
PostgreSQL::Test::Utils::system_or_bail($params{tar_program}, 'xf',
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index bf81b91e20..95ae399cae 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -4009,3 +4009,15 @@ SummarizerReadLocalXLogPrivate
WalSummarizerData
WalSummaryFile
WalSummaryIO
+FileBackupMethod
+IncrementalBackupInfo
+UploadManifestCmd
+backup_file_entry
+backup_wal_range
+cb_cleanup_dir
+cb_options
+cb_tablespace
+cb_tablespace_mapping
+manifest_data
+manifest_writer
+rfile
--
2.37.1 (Apple Git-137.1)
v8-0002-Move-src-bin-pg_verifybackup-parse_manifest.c-int.patchapplication/octet-stream; name=v8-0002-Move-src-bin-pg_verifybackup-parse_manifest.c-int.patchDownload
From 571b192c1655d1e74d0a5d3b0b3915cc69e7463f Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Tue, 3 Oct 2023 13:32:45 -0400
Subject: [PATCH v8 2/6] Move src/bin/pg_verifybackup/parse_manifest.c into
src/common.
This makes it possible for the code to be easily reused by other
client-side tools, and/or by the server.
---
src/bin/pg_verifybackup/Makefile | 1 -
src/bin/pg_verifybackup/meson.build | 1 -
src/bin/pg_verifybackup/pg_verifybackup.c | 2 +-
src/common/Makefile | 1 +
src/common/meson.build | 1 +
src/{bin/pg_verifybackup => common}/parse_manifest.c | 4 ++--
src/{bin/pg_verifybackup => include/common}/parse_manifest.h | 2 +-
7 files changed, 6 insertions(+), 6 deletions(-)
rename src/{bin/pg_verifybackup => common}/parse_manifest.c (99%)
rename src/{bin/pg_verifybackup => include/common}/parse_manifest.h (97%)
diff --git a/src/bin/pg_verifybackup/Makefile b/src/bin/pg_verifybackup/Makefile
index c96323faa9..7c045f142e 100644
--- a/src/bin/pg_verifybackup/Makefile
+++ b/src/bin/pg_verifybackup/Makefile
@@ -21,7 +21,6 @@ LDFLAGS_INTERNAL += -L$(top_builddir)/src/fe_utils -lpgfeutils $(libpq_pgport)
OBJS = \
$(WIN32RES) \
- parse_manifest.o \
pg_verifybackup.o
all: pg_verifybackup
diff --git a/src/bin/pg_verifybackup/meson.build b/src/bin/pg_verifybackup/meson.build
index 9369da1bc6..58f780d1a6 100644
--- a/src/bin/pg_verifybackup/meson.build
+++ b/src/bin/pg_verifybackup/meson.build
@@ -1,7 +1,6 @@
# Copyright (c) 2022-2023, PostgreSQL Global Development Group
pg_verifybackup_sources = files(
- 'parse_manifest.c',
'pg_verifybackup.c'
)
diff --git a/src/bin/pg_verifybackup/pg_verifybackup.c b/src/bin/pg_verifybackup/pg_verifybackup.c
index 059836f0e6..ce423a03d4 100644
--- a/src/bin/pg_verifybackup/pg_verifybackup.c
+++ b/src/bin/pg_verifybackup/pg_verifybackup.c
@@ -20,9 +20,9 @@
#include "common/hashfn.h"
#include "common/logging.h"
+#include "common/parse_manifest.h"
#include "fe_utils/simple_list.h"
#include "getopt_long.h"
-#include "parse_manifest.h"
#include "pgtime.h"
/*
diff --git a/src/common/Makefile b/src/common/Makefile
index ce4535d7fe..1092dc63df 100644
--- a/src/common/Makefile
+++ b/src/common/Makefile
@@ -66,6 +66,7 @@ OBJS_COMMON = \
kwlookup.o \
link-canary.o \
md5_common.o \
+ parse_manifest.o \
percentrepl.o \
pg_get_line.o \
pg_lzcompress.o \
diff --git a/src/common/meson.build b/src/common/meson.build
index 8be145c0fb..d52dd12bc9 100644
--- a/src/common/meson.build
+++ b/src/common/meson.build
@@ -18,6 +18,7 @@ common_sources = files(
'kwlookup.c',
'link-canary.c',
'md5_common.c',
+ 'parse_manifest.c',
'percentrepl.c',
'pg_get_line.c',
'pg_lzcompress.c',
diff --git a/src/bin/pg_verifybackup/parse_manifest.c b/src/common/parse_manifest.c
similarity index 99%
rename from src/bin/pg_verifybackup/parse_manifest.c
rename to src/common/parse_manifest.c
index bf0227c668..ee6f74e5d5 100644
--- a/src/bin/pg_verifybackup/parse_manifest.c
+++ b/src/common/parse_manifest.c
@@ -6,15 +6,15 @@
* Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
* Portions Copyright (c) 1994, Regents of the University of California
*
- * src/bin/pg_verifybackup/parse_manifest.c
+ * src/common/parse_manifest.c
*
*-------------------------------------------------------------------------
*/
#include "postgres_fe.h"
-#include "parse_manifest.h"
#include "common/jsonapi.h"
+#include "common/parse_manifest.h"
/*
* Semantic states for JSON manifest parsing.
diff --git a/src/bin/pg_verifybackup/parse_manifest.h b/src/include/common/parse_manifest.h
similarity index 97%
rename from src/bin/pg_verifybackup/parse_manifest.h
rename to src/include/common/parse_manifest.h
index 7387a917a2..7b24c5d785 100644
--- a/src/bin/pg_verifybackup/parse_manifest.h
+++ b/src/include/common/parse_manifest.h
@@ -6,7 +6,7 @@
* Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
* Portions Copyright (c) 1994, Regents of the University of California
*
- * src/bin/pg_verifybackup/parse_manifest.h
+ * src/include/common/parse_manifest.h
*
*-------------------------------------------------------------------------
*/
--
2.37.1 (Apple Git-137.1)
On Tue, Nov 7, 2023 at 2:06 AM Robert Haas <robertmhaas@gmail.com> wrote:
On Mon, Oct 30, 2023 at 2:46 PM Andres Freund <andres@anarazel.de> wrote:
After playing with this for a while, I don't see a reason for wal_summarize_mb
from a memory usage POV at least.Here's v8. Changes:
Review comments, based on what I reviewed so far.
- I think 0001 looks good improvement irrespective of the patch series.
- review 0003
1.
+ be enabled either on a primary or on a standby. WAL summarization can
+ cannot be enabled when <varname>wal_level</varname> is set to
+ <literal>minimal</literal>.
Grammatical error
"WAL summarization can cannot" -> WAL summarization cannot
2.
+ <varlistentry id="guc-wal-summarize-keep-time"
xreflabel="wal_summarize_keep_time">
+ <term><varname>wal_summarize_keep_time</varname> (<type>boolean</type>)
+ <indexterm>
+ <primary><varname>wal_summarize_keep_time</varname>
configuration parameter</primary>
+ </indexterm>
I feel the name of the guy should be either wal_summarizer_keep_time
or wal_summaries_keep_time, I mean either we should refer to the
summarizer process or to the way summaries files.
3.
+XLogGetOldestSegno(TimeLineID tli)
+{
+
+ /* Ignore files that are not XLOG segments */
+ if (!IsXLogFileName(xlde->d_name))
+ continue;
+
+ /* Parse filename to get TLI and segno. */
+ XLogFromFileName(xlde->d_name, &file_tli, &file_segno,
+ wal_segment_size);
+
+ /* Ignore anything that's not from the TLI of interest. */
+ if (tli != file_tli)
+ continue;
+
+ /* If it's the oldest so far, update oldest_segno. */
Some of the single-line comments end with a full stop whereas others
do not, so better to be consistent.
4.
+ * If start_lsn != InvalidXLogRecPtr, only summaries that end before the
+ * indicated LSN will be included.
+ *
+ * If end_lsn != InvalidXLogRecPtr, only summaries that start before the
+ * indicated LSN will be included.
+ *
+ * The intent is that you can call GetWalSummaries(tli, start_lsn, end_lsn)
+ * to get all WAL summaries on the indicated timeline that overlap the
+ * specified LSN range.
+ */
+List *
+GetWalSummaries(TimeLineID tli, XLogRecPtr start_lsn, XLogRecPtr end_lsn)
Instead of "If start_lsn != InvalidXLogRecPtr, only summaries that end
before the" it should be "If start_lsn != InvalidXLogRecPtr, only
summaries that end after the" because only if the summary files are
Ending after the start_lsn then it will have some overlapping and we
need to return them if ending before start lsn then those files are
not overlapping at all, right?
5.
In FilterWalSummaries() header also the comment is wrong same as for
GetWalSummaries() function.
6.
+ * If the whole range of LSNs is covered, returns true, otherwise false.
+ * If false is returned, *missing_lsn is set either to InvalidXLogRecPtr
+ * if there are no WAL summary files in the input list, or to the first LSN
+ * in the range that is not covered by a WAL summary file in the input list.
+ */
+bool
+WalSummariesAreComplete(List *wslist, XLogRecPtr start_lsn,
I did not see the usage of this function, but I think if the whole
range is not covered why not keep the behavior uniform w.r.t. what we
set for '*missing_lsn', I mean suppose there is no file then
missing_lsn is the start_lsn because a very first LSN is missing.
7.
+ nbytes = FileRead(io->file, data, length, io->filepos,
+ WAIT_EVENT_WAL_SUMMARY_READ);
+ if (nbytes < 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not write file \"%s\": %m",
+ FilePathName(io->file))));
/could not write file/ could not read file
8.
+/*
+ * Comparator to sort a List of WalSummaryFile objects by start_lsn.
+ */
+static int
+ListComparatorForWalSummaryFiles(const ListCell *a, const ListCell *b)
+{
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
Great stuff you got here. I'm doing a first pass trying to grok the
whole thing for more substantive comments, but in the meantime here are
some cosmetic ones.
I got the following warnings, both valid:
../../../../pgsql/source/master/src/common/blkreftable.c: In function 'WriteBlockRefTable':
../../../../pgsql/source/master/src/common/blkreftable.c:520:45: warning: declaration of 'brtentry' shadows a previous local [-Wshadow=compatible-local]
520 | BlockRefTableEntry *brtentry;
| ^~~~~~~~
../../../../pgsql/source/master/src/common/blkreftable.c:492:37: note: shadowed declaration is here
492 | BlockRefTableEntry *brtentry;
| ^~~~~~~~
../../../../../pgsql/source/master/src/backend/postmaster/walsummarizer.c: In function 'SummarizeWAL':
../../../../../pgsql/source/master/src/backend/postmaster/walsummarizer.c:833:57: warning: declaration of 'private_data' shadows a previous local [-Wshadow=compatible-local]
833 | SummarizerReadLocalXLogPrivate *private_data;
| ^~~~~~~~~~~~
../../../../../pgsql/source/master/src/backend/postmaster/walsummarizer.c:709:41: note: shadowed declaration is here
709 | SummarizerReadLocalXLogPrivate *private_data;
| ^~~~~~~~~~~~
In blkreftable.c, I think the definition of SH_EQUAL should have an
outer layer of parentheses. Also, it would be good to provide and use a
function to initialize a BlockRefTableKey from the RelFileNode and
forknum components, and ensure that any padding bytes are zeroed.
Otherwise it's not going to be a great hash key. On my platform there
aren't any (padding bytes), but I think it's unwise to rely on that.
I don't think SummarizerReadLocalXLogPrivate->waited is used for
anything. Could be removed AFAICS, unless you're foreseen adding
something that uses it.
These forward struct declarations are not buying you anything, I'd
remove them:
diff --git a/src/include/common/blkreftable.h b/src/include/common/blkreftable.h
index 70d6c072d7..316e67122c 100644
--- a/src/include/common/blkreftable.h
+++ b/src/include/common/blkreftable.h
@@ -29,10 +29,7 @@
/* Magic number for serialization file format. */
#define BLOCKREFTABLE_MAGIC 0x652b137b
-struct BlockRefTable;
-struct BlockRefTableEntry;
-struct BlockRefTableReader;
-struct BlockRefTableWriter;
+/* Struct definitions appear in blkreftable.c */
typedef struct BlockRefTable BlockRefTable;
typedef struct BlockRefTableEntry BlockRefTableEntry;
typedef struct BlockRefTableReader BlockRefTableReader;
and backup_label.h doesn't know about TimeLineID, so it needs this:
diff --git a/src/bin/pg_combinebackup/backup_label.h b/src/bin/pg_combinebackup/backup_label.h
index 08d6ed67a9..3af7ea274c 100644
--- a/src/bin/pg_combinebackup/backup_label.h
+++ b/src/bin/pg_combinebackup/backup_label.h
@@ -12,6 +12,7 @@
#ifndef BACKUP_LABEL_H
#define BACKUP_LABEL_H
+#include "access/xlogdefs.h"
#include "common/checksum_helper.h"
#include "lib/stringinfo.h"
I don't much like the way header files in src/bin/pg_combinebackup files
are structured. Particularly, causing a simplehash to be "instantiated"
just because load_manifest.h is included seems poised to cause pain. I
think there should be a file with the basic struct declarations (no
simplehash); and then maybe since both pg_basebackup and
pg_combinebackup seem to need the same simplehash, create a separate
header file containing just that.. But, did you notice that anything
that includes reconstruct.h will instantiate the simplehash stuff,
because it includes load_manifest.h? It may be unwise to have the
simplehash in a header file. Maybe just declare it in each .c file that
needs it. The duplicity is not that large.
I'll see if I can understand the way all these headers are needed to
propose some other arrangement.
I see this idea of having "struct FooBar;" immediately followed by
"typedef struct FooBar FooBar;" which I mentioned from blkreftable.h
occurs in other places as well (JsonManifestParseContext in
parse_manifest.h, maybe others?). Was this pattern cargo-culted from
somewhere? Perhaps we have other places to clean up.
Why leave unnamed arguments in function declarations? For example, in
static void manifest_process_file(JsonManifestParseContext *,
char *pathname,
size_t size,
pg_checksum_type checksum_type,
int checksum_length,
uint8 *checksum_payload);
the first argument lacks a name. Is this just an oversight, I hope?
In GetFileBackupMethod(), which arguments are in and which are out?
The comment doesn't say, and it's not obvious why we pass both the file
path as well as the individual constituent pieces for it.
DO_NOT_BACKUP_FILE appears not to be set anywhere. Do you expect to use
this later? If not, maybe remove it.
There are two functions named record_manifest_details_for_file() in
different programs. I think this sort of arrangement is not great, as
it is confusing confusing to follow. It would be better if those two
routines were called something like, say, verifybackup_perfile_cb and
combinebackup_perfile_cb instead; then in the function comment say
something like
/*
* JsonManifestParseContext->perfile_cb implementation for pg_combinebackup.
*
* Record details extracted from the backup manifest for one file,
* because we like to keep things tracked or whatever.
*/
so it's easy to track down what does what and why. Same with
perwalrange_cb. "perfile" looks bothersome to me as a name entity. Why
not per_file_cb? and per_walrange_cb?
In walsummarizer.c, HandleWalSummarizerInterrupts is called in
summarizer_read_local_xlog_page but SummarizeWAL() doesn't do that.
Maybe it should?
I think this path is not going to be very human-likeable.
snprintf(final_path, MAXPGPATH,
XLOGDIR "/summaries/%08X%08X%08X%08X%08X.summary",
tli,
LSN_FORMAT_ARGS(summary_start_lsn),
LSN_FORMAT_ARGS(summary_end_lsn));
Why not add a dash between the TLI and between both LSNs, or something
like that? (Also, are we really printing TLIs as 8-byte hexs?)
--
Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
"I suspect most samba developers are already technically insane...
Of course, since many of them are Australians, you can't tell." (L. Torvalds)
On Fri, Nov 10, 2023 at 6:27 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
- I think 0001 looks good improvement irrespective of the patch series.
OK, perhaps that can be independently committed, then, if nobody objects.
Thanks for the review; I've fixed a bunch of things that you
mentioned. I'll just comment on the ones I haven't yet done anything
about below.
2. + <varlistentry id="guc-wal-summarize-keep-time" xreflabel="wal_summarize_keep_time"> + <term><varname>wal_summarize_keep_time</varname> (<type>boolean</type>) + <indexterm> + <primary><varname>wal_summarize_keep_time</varname> configuration parameter</primary> + </indexterm>I feel the name of the guy should be either wal_summarizer_keep_time
or wal_summaries_keep_time, I mean either we should refer to the
summarizer process or to the way summaries files.
How about wal_summary_keep_time?
6. + * If the whole range of LSNs is covered, returns true, otherwise false. + * If false is returned, *missing_lsn is set either to InvalidXLogRecPtr + * if there are no WAL summary files in the input list, or to the first LSN + * in the range that is not covered by a WAL summary file in the input list. + */ +bool +WalSummariesAreComplete(List *wslist, XLogRecPtr start_lsn,I did not see the usage of this function, but I think if the whole
range is not covered why not keep the behavior uniform w.r.t. what we
set for '*missing_lsn', I mean suppose there is no file then
missing_lsn is the start_lsn because a very first LSN is missing.
It's used later in the patch series. I think the way that I have it
makes for a more understandable error message.
8. +/* + * Comparator to sort a List of WalSummaryFile objects by start_lsn. + */ +static int +ListComparatorForWalSummaryFiles(const ListCell *a, const ListCell *b) +{
I'm not sure what needs fixing here.
--
Robert Haas
EDB: http://www.enterprisedb.com
On Mon, Nov 13, 2023 at 11:25 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
Great stuff you got here. I'm doing a first pass trying to grok the
whole thing for more substantive comments, but in the meantime here are
some cosmetic ones.
Thanks, thanks, and thanks.
I've fixed some things that you mentioned in the attached version.
Other comments below.
In blkreftable.c, I think the definition of SH_EQUAL should have an
outer layer of parentheses. Also, it would be good to provide and use a
function to initialize a BlockRefTableKey from the RelFileNode and
forknum components, and ensure that any padding bytes are zeroed.
Otherwise it's not going to be a great hash key. On my platform there
aren't any (padding bytes), but I think it's unwise to rely on that.
I'm having trouble understanding the second part of this suggestion.
Note that in a frontend context, SH_RAW_ALLOCATOR is pg_malloc0, and
in a backend context, we get the default, which is
MemoryContextAllocZero. Maybe there's some case this doesn't cover,
though?
These forward struct declarations are not buying you anything, I'd
remove them:
I've had problems from time to time when I don't do this. I'll remove
it here, but I'm not convinced that it's always useless.
I don't much like the way header files in src/bin/pg_combinebackup files
are structured. Particularly, causing a simplehash to be "instantiated"
just because load_manifest.h is included seems poised to cause pain. I
think there should be a file with the basic struct declarations (no
simplehash); and then maybe since both pg_basebackup and
pg_combinebackup seem to need the same simplehash, create a separate
header file containing just that.. But, did you notice that anything
that includes reconstruct.h will instantiate the simplehash stuff,
because it includes load_manifest.h? It may be unwise to have the
simplehash in a header file. Maybe just declare it in each .c file that
needs it. The duplicity is not that large.
I think that I did this correctly. AIUI, if you're defining a
simplehash that only one source file needs, you make the scope
"static" and do both SH_DECLARE and SH_DEFILE it in that file. If you
need it to be shared by multiple files, you make it "extern" in the
header file, do SH_DECLARE there, and SH_DEFINE in one of those source
files. Or you could make the scope "static inline" in the header file
and then you'd both SH_DECLARE and SH_DEFINE it in the header file.
If I were to do as you suggest here, I think I'd end up with 2 copies
of the compiled code for this instead of one, and if they ever got out
of sync everything would break silently.
Why leave unnamed arguments in function declarations? For example, in
static void manifest_process_file(JsonManifestParseContext *,
char *pathname,
size_t size,
pg_checksum_type checksum_type,
int checksum_length,
uint8 *checksum_payload);
the first argument lacks a name. Is this just an oversight, I hope?
I mean, I've changed it now, but I don't think it's worth getting too
excited about. "int checksum_length" is much better documentation than
just "int," but "JsonManifestParseContext *context" is just noise,
IMHO. You can argue that it's better for consistency that way, but
whatever.
In GetFileBackupMethod(), which arguments are in and which are out?
The comment doesn't say, and it's not obvious why we pass both the file
path as well as the individual constituent pieces for it.
The header comment does document which values are potentially set on
return. I guess I thought it was clear enough that the stuff not
documented to be output parameters was input parameters. Most of them
aren't even pointers, so they have to be input parameters. The only
exception is 'path', which I have some difficulty thinking that anyone
is going to imagine to be an input pointer.
Maybe you could propose a more specific rewording of this comment?
FWIW, I'm not altogether sure whether this function is going to get
more heavily adjusted in a rev or three of the patch set, so maybe we
want to wait to sort this out until this is closer to final, but OTOH
if I know what you have in mind for the current version, I might be
more likely to keep it in a good place if I end up changing it.
DO_NOT_BACKUP_FILE appears not to be set anywhere. Do you expect to use
this later? If not, maybe remove it.
Woops, that was a holdover from an earlier version.
There are two functions named record_manifest_details_for_file() in
different programs. I think this sort of arrangement is not great, as
it is confusing confusing to follow. It would be better if those two
routines were called something like, say, verifybackup_perfile_cb and
combinebackup_perfile_cb instead; then in the function comment say
something like
/*
* JsonManifestParseContext->perfile_cb implementation for pg_combinebackup.
*
* Record details extracted from the backup manifest for one file,
* because we like to keep things tracked or whatever.
*/
so it's easy to track down what does what and why. Same with
perwalrange_cb. "perfile" looks bothersome to me as a name entity. Why
not per_file_cb? and per_walrange_cb?
I had trouble figuring out how to name this stuff. I did notice the
awkwardness, but surely nobody can think that two functions with the
same name in different binaries can be actually the same function.
If we want to inject more underscores here, my vote is to go all the
way and make it per_wal_range_cb.
In walsummarizer.c, HandleWalSummarizerInterrupts is called in
summarizer_read_local_xlog_page but SummarizeWAL() doesn't do that.
Maybe it should?
I replaced all the CHECK_FOR_INTERRUPTS() in that file with
HandleWalSummarizerInterrupts(). Does that seem right?
I think this path is not going to be very human-likeable.
snprintf(final_path, MAXPGPATH,
XLOGDIR "/summaries/%08X%08X%08X%08X%08X.summary",
tli,
LSN_FORMAT_ARGS(summary_start_lsn),
LSN_FORMAT_ARGS(summary_end_lsn));
Why not add a dash between the TLI and between both LSNs, or something
like that? (Also, are we really printing TLIs as 8-byte hexs?)
Dealing with the last part first, we already do that in every WAL file
name. I actually think these file names are easier to work with than
WAL file names, because 000000010000000000000020 is not the WAL
starting at 0/20, but rather the WAL starting at 0/20000000. To know
at what LSN a WAL file starts, you have to mentally delete characters
17 through 22, which will always be zero, and instead add six zeroes
at the end. I don't think whoever came up with that file naming
convention deserves an award, unless it's a raspberry award. With
these names, you get something like
0000000100000000015125B800000000015128F0.summary and you can sort of
see that 1512 repeats so the LSN went from something ending in 5B8 to
something ending in 8F0. I actually think it's way better.
But I have a hard time arguing that it wouldn't be more readable still
if we put some separator characters in there. I didn't do that because
then they'd look less like WAL file names, but maybe that's not really
a problem. A possible reason not to bother is that these files are
less necessary for humans to care about than WAL files, since they
don't need to be archived or transported between nodes in any way.
Basically I think this is probably fine the way it is, but if you or
others think it's really important to change it, I can do that. Just
as long as we don't spend 50 emails arguing about which separator
character to use.
--
Robert Haas
EDB: http://www.enterprisedb.com
Attachments:
v9-0005-Add-new-pg_walsummary-tool.patchapplication/octet-stream; name=v9-0005-Add-new-pg_walsummary-tool.patchDownload
From f3905c11344d141ce84e900b19bf7bdd3c2d1572 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Wed, 25 Oct 2023 13:01:06 -0400
Subject: [PATCH v9 5/6] Add new pg_walsummary tool.
This can dump the contents of WAL summary files, either those in
pg_wal/summaries, or the INCREMENTAL_BACKUP files that are part of
an incremental backup proper.
XXX. Needs tests.
---
doc/src/sgml/ref/allfiles.sgml | 1 +
doc/src/sgml/ref/pg_walsummary.sgml | 122 +++++++++++
doc/src/sgml/reference.sgml | 1 +
src/bin/Makefile | 1 +
src/bin/meson.build | 1 +
src/bin/pg_walsummary/.gitignore | 1 +
src/bin/pg_walsummary/Makefile | 42 ++++
src/bin/pg_walsummary/meson.build | 24 +++
src/bin/pg_walsummary/pg_walsummary.c | 280 ++++++++++++++++++++++++++
src/tools/pgindent/typedefs.list | 2 +
10 files changed, 475 insertions(+)
create mode 100644 doc/src/sgml/ref/pg_walsummary.sgml
create mode 100644 src/bin/pg_walsummary/.gitignore
create mode 100644 src/bin/pg_walsummary/Makefile
create mode 100644 src/bin/pg_walsummary/meson.build
create mode 100644 src/bin/pg_walsummary/pg_walsummary.c
diff --git a/doc/src/sgml/ref/allfiles.sgml b/doc/src/sgml/ref/allfiles.sgml
index fda4690eab..4a42999b18 100644
--- a/doc/src/sgml/ref/allfiles.sgml
+++ b/doc/src/sgml/ref/allfiles.sgml
@@ -219,6 +219,7 @@ Complete list of usable sgml source files in this directory.
<!ENTITY pgtesttiming SYSTEM "pgtesttiming.sgml">
<!ENTITY pgupgrade SYSTEM "pgupgrade.sgml">
<!ENTITY pgwaldump SYSTEM "pg_waldump.sgml">
+<!ENTITY pgwalsummary SYSTEM "pg_walsummary.sgml">
<!ENTITY postgres SYSTEM "postgres-ref.sgml">
<!ENTITY psqlRef SYSTEM "psql-ref.sgml">
<!ENTITY reindexdb SYSTEM "reindexdb.sgml">
diff --git a/doc/src/sgml/ref/pg_walsummary.sgml b/doc/src/sgml/ref/pg_walsummary.sgml
new file mode 100644
index 0000000000..3a2122b067
--- /dev/null
+++ b/doc/src/sgml/ref/pg_walsummary.sgml
@@ -0,0 +1,122 @@
+<!--
+doc/src/sgml/ref/pg_walsummary.sgml
+PostgreSQL documentation
+-->
+
+<refentry id="app-pgwalsummary">
+ <indexterm zone="app-pgwalsummary">
+ <primary>pg_walsummary</primary>
+ </indexterm>
+
+ <refmeta>
+ <refentrytitle><application>pg_walsummary</application></refentrytitle>
+ <manvolnum>1</manvolnum>
+ <refmiscinfo>Application</refmiscinfo>
+ </refmeta>
+
+ <refnamediv>
+ <refname>pg_walsummary</refname>
+ <refpurpose>print contents of WAL summary files</refpurpose>
+ </refnamediv>
+
+ <refsynopsisdiv>
+ <cmdsynopsis>
+ <command>pg_walsummary</command>
+ <arg rep="repeat" choice="opt"><replaceable>option</replaceable></arg>
+ <arg rep="repeat"><replaceable>file</replaceable></arg>
+ </cmdsynopsis>
+ </refsynopsisdiv>
+
+ <refsect1>
+ <title>Description</title>
+ <para>
+ <application>pg_walsummary</application> is used to print the contents of
+ WAL summary files. These binary files are found with the
+ <literal>pg_wal/summaries</literal> subdirectory of the data directory,
+ and can be converted to text using this tool. This is not ordinarily
+ necessary, since WAL summary files primary exist to support
+ <link linkend="backup-incremental-backup">incremental backup</link>,
+ but it may be useful for debugging purposes.
+ </para>
+
+ <para>
+ A WAL summary file is indexed by tablespace OID, relation OID, and relation
+ fork. For each relation fork, it stores the list of blocks that were
+ modified by WAL within the range summarized in the file. It can also
+ store a "limit block," which is 0 if the relation fork was created or
+ truncated within the relevant WAL range, and otherwise the shortest length
+ to which the relation fork was truncated. If the relation fork was not
+ created, deleted, or truncated within the relevant WAL range, the limit
+ block is undefined or infinite and will not be printed by this tool.
+ </para>
+ </refsect1>
+
+ <refsect1>
+ <title>Options</title>
+
+ <para>
+ <variablelist>
+ <varlistentry>
+ <term><option>-i</option></term>
+ <term><option>--indivudual</option></term>
+ <listitem>
+ <para>
+ By default, <literal>pg_walsummary</literal> prints one line of output
+ for each range of one or more consecutive modified blocks. This can
+ make the output a lot briefer, since a relation where all blocks from
+ 0 through 999 were modified will produce only one line of output rather
+ than 1000 separate lines. This option requests a separate line of
+ output for every modified block.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><option>-q</option></term>
+ <term><option>--quiet</option></term>
+ <listitem>
+ <para>
+ Do not print any output, except for errors. This can be useful
+ when you want to know whether a WAL summary file can be successfully
+ parsed but don't care about the contents.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><option>-?</option></term>
+ <term><option>--help</option></term>
+ <listitem>
+ <para>
+ Shows help about <application>pg_walsummary</application> command line
+ arguments, and exits.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ </variablelist>
+ </para>
+
+ </refsect1>
+
+ <refsect1>
+ <title>Environment</title>
+
+ <para>
+ The environment variable <envar>PG_COLOR</envar> specifies whether to use
+ color in diagnostic messages. Possible values are
+ <literal>always</literal>, <literal>auto</literal> and
+ <literal>never</literal>.
+ </para>
+ </refsect1>
+
+ <refsect1>
+ <title>See Also</title>
+
+ <simplelist type="inline">
+ <member><xref linkend="app-pgbasebackup"/></member>
+ <member><xref linkend="app-pgcombinebackup"/></member>
+ </simplelist>
+ </refsect1>
+
+</refentry>
diff --git a/doc/src/sgml/reference.sgml b/doc/src/sgml/reference.sgml
index a07d2b5e01..aa94f6adf6 100644
--- a/doc/src/sgml/reference.sgml
+++ b/doc/src/sgml/reference.sgml
@@ -289,6 +289,7 @@
&pgtesttiming;
&pgupgrade;
&pgwaldump;
+ &pgwalsummary;
&postgres;
</reference>
diff --git a/src/bin/Makefile b/src/bin/Makefile
index aa2210925e..f98f58d39e 100644
--- a/src/bin/Makefile
+++ b/src/bin/Makefile
@@ -31,6 +31,7 @@ SUBDIRS = \
pg_upgrade \
pg_verifybackup \
pg_waldump \
+ pg_walsummary \
pgbench \
psql \
scripts
diff --git a/src/bin/meson.build b/src/bin/meson.build
index 4cb6fd59bb..d1e9ef4409 100644
--- a/src/bin/meson.build
+++ b/src/bin/meson.build
@@ -17,6 +17,7 @@ subdir('pg_test_timing')
subdir('pg_upgrade')
subdir('pg_verifybackup')
subdir('pg_waldump')
+subdir('pg_walsummary')
subdir('pgbench')
subdir('pgevent')
subdir('psql')
diff --git a/src/bin/pg_walsummary/.gitignore b/src/bin/pg_walsummary/.gitignore
new file mode 100644
index 0000000000..d71ec192fa
--- /dev/null
+++ b/src/bin/pg_walsummary/.gitignore
@@ -0,0 +1 @@
+pg_walsummary
diff --git a/src/bin/pg_walsummary/Makefile b/src/bin/pg_walsummary/Makefile
new file mode 100644
index 0000000000..852f7208f6
--- /dev/null
+++ b/src/bin/pg_walsummary/Makefile
@@ -0,0 +1,42 @@
+#-------------------------------------------------------------------------
+#
+# Makefile for src/bin/pg_walsummary
+#
+# Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+# Portions Copyright (c) 1994, Regents of the University of California
+#
+# src/bin/pg_walsummary/Makefile
+#
+#-------------------------------------------------------------------------
+
+PGFILEDESC = "pg_walsummary - print contents of WAL summary files"
+PGAPPICON=win32
+
+subdir = src/bin/pg_walsummary
+top_builddir = ../../..
+include $(top_builddir)/src/Makefile.global
+
+override CPPFLAGS := -I$(libpq_srcdir) $(CPPFLAGS)
+LDFLAGS_INTERNAL += -L$(top_builddir)/src/fe_utils -lpgfeutils
+
+OBJS = \
+ $(WIN32RES) \
+ pg_walsummary.o
+
+all: pg_walsummary
+
+pg_walsummary: $(OBJS) | submake-libpgport submake-libpgfeutils
+ $(CC) $(CFLAGS) $^ $(LDFLAGS) $(LDFLAGS_EX) $(LIBS) -o $@$(X)
+
+
+install: all installdirs
+ $(INSTALL_PROGRAM) pg_walsummary$(X) '$(DESTDIR)$(bindir)/pg_walsummary$(X)'
+
+installdirs:
+ $(MKDIR_P) '$(DESTDIR)$(bindir)'
+
+uninstall:
+ rm -f '$(DESTDIR)$(bindir)/pg_walsummary$(X)'
+
+clean distclean maintainer-clean:
+ rm -f pg_walsummary$(X) $(OBJS)
diff --git a/src/bin/pg_walsummary/meson.build b/src/bin/pg_walsummary/meson.build
new file mode 100644
index 0000000000..c2092960c6
--- /dev/null
+++ b/src/bin/pg_walsummary/meson.build
@@ -0,0 +1,24 @@
+# Copyright (c) 2022-2023, PostgreSQL Global Development Group
+
+pg_walsummary_sources = files(
+ 'pg_walsummary.c',
+)
+
+if host_system == 'windows'
+ pg_walsummary_sources += rc_bin_gen.process(win32ver_rc, extra_args: [
+ '--NAME', 'pg_walsummary',
+ '--FILEDESC', 'pg_walsummary - print contents of WAL summary files',])
+endif
+
+pg_walsummary = executable('pg_walsummary',
+ pg_walsummary_sources,
+ dependencies: [frontend_code],
+ kwargs: default_bin_args,
+)
+bin_targets += pg_walsummary
+
+tests += {
+ 'name': 'pg_walsummary',
+ 'sd': meson.current_source_dir(),
+ 'bd': meson.current_build_dir()
+}
diff --git a/src/bin/pg_walsummary/pg_walsummary.c b/src/bin/pg_walsummary/pg_walsummary.c
new file mode 100644
index 0000000000..0c0225eeb8
--- /dev/null
+++ b/src/bin/pg_walsummary/pg_walsummary.c
@@ -0,0 +1,280 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_walsummary.c
+ * Prints the contents of WAL summary files.
+ *
+ * Copyright (c) 2017-2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/bin/pg_walsummary/pg_walsummary.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres_fe.h"
+
+#include <fcntl.h>
+#include <limits.h>
+
+#include "common/blkreftable.h"
+#include "common/logging.h"
+#include "fe_utils/option_utils.h"
+#include "lib/stringinfo.h"
+#include "getopt_long.h"
+
+typedef struct ws_options
+{
+ bool individual;
+ bool quiet;
+} ws_options;
+
+typedef struct ws_file_info
+{
+ int fd;
+ char *filename;
+} ws_file_info;
+
+static BlockNumber *block_buffer = NULL;
+static unsigned block_buffer_size = 512; /* Initial size. */
+
+static void dump_one_relation(ws_options *opt, RelFileLocator *rlocator,
+ ForkNumber forknum, BlockNumber limit_block,
+ BlockRefTableReader *reader);
+static void help(const char *progname);
+static int compare_block_numbers(const void *a, const void *b);
+static int walsummary_read_callback(void *callback_arg, void *data,
+ int length);
+static void walsummary_error_callback(void *callback_arg, char *fmt,...) pg_attribute_printf(2, 3);
+
+/*
+ * Main program.
+ */
+int
+main(int argc, char *argv[])
+{
+ static struct option long_options[] = {
+ {"individual", no_argument, NULL, 'i'},
+ {"quiet", no_argument, NULL, 'q'},
+ {NULL, 0, NULL, 0}
+ };
+
+ const char *progname;
+ int optindex;
+ int c;
+ ws_options opt;
+
+ memset(&opt, 0, sizeof(ws_options));
+
+ pg_logging_init(argv[0]);
+ progname = get_progname(argv[0]);
+ handle_help_version_opts(argc, argv, progname, help);
+
+ /* process command-line options */
+ while ((c = getopt_long(argc, argv, "f:iqw:",
+ long_options, &optindex)) != -1)
+ {
+ switch (c)
+ {
+ case 'i':
+ opt.individual = true;
+ break;
+ case 'q':
+ opt.quiet = true;
+ break;
+ default:
+ /* getopt_long already emitted a complaint */
+ pg_log_error_hint("Try \"%s --help\" for more information.", progname);
+ exit(1);
+ }
+ }
+
+ if (optind >= argc)
+ {
+ pg_log_error("%s: no input files specified", progname);
+ pg_log_error_hint("Try \"%s --help\" for more information.", progname);
+ exit(1);
+ }
+
+ while (optind < argc)
+ {
+ ws_file_info ws;
+ BlockRefTableReader *reader;
+ RelFileLocator rlocator;
+ ForkNumber forknum;
+ BlockNumber limit_block;
+
+ ws.filename = argv[optind++];
+ if ((ws.fd = open(ws.filename, O_RDONLY | PG_BINARY, 0)) < 0)
+ pg_fatal("could not open file \"%s\": %m", ws.filename);
+
+ reader = CreateBlockRefTableReader(walsummary_read_callback, &ws,
+ ws.filename,
+ walsummary_error_callback, NULL);
+ while (BlockRefTableReaderNextRelation(reader, &rlocator, &forknum,
+ &limit_block))
+ dump_one_relation(&opt, &rlocator, forknum, limit_block, reader);
+
+ DestroyBlockRefTableReader(reader);
+ close(ws.fd);
+ }
+
+ exit(0);
+}
+
+/*
+ * Dump details for one relation.
+ */
+static void
+dump_one_relation(ws_options *opt, RelFileLocator *rlocator,
+ ForkNumber forknum, BlockNumber limit_block,
+ BlockRefTableReader *reader)
+{
+ unsigned i = 0;
+ unsigned nblocks;
+ BlockNumber startblock = InvalidBlockNumber;
+ BlockNumber endblock = InvalidBlockNumber;
+
+ /* Dump limit block, if any. */
+ if (limit_block != InvalidBlockNumber)
+ printf("TS %u, DB %u, REL %u, FORK %s: limit %u\n",
+ rlocator->spcOid, rlocator->dbOid, rlocator->relNumber,
+ forkNames[forknum], limit_block);
+
+ /* If we haven't allocated a block buffer yet, do that now. */
+ if (block_buffer == NULL)
+ block_buffer = palloc_array(BlockNumber, block_buffer_size);
+
+ /* Try to fill the block buffer. */
+ nblocks = BlockRefTableReaderGetBlocks(reader,
+ block_buffer,
+ block_buffer_size);
+
+ /* If we filled the block buffer completely, we must enlarge it. */
+ while (nblocks >= block_buffer_size)
+ {
+ unsigned new_size;
+
+ /* Double the size, being careful about overflow. */
+ new_size = block_buffer_size * 2;
+ if (new_size < block_buffer_size)
+ new_size = PG_UINT32_MAX;
+ block_buffer = repalloc_array(block_buffer, BlockNumber, new_size);
+
+ /* Try to fill the newly-allocated space. */
+ nblocks +=
+ BlockRefTableReaderGetBlocks(reader,
+ block_buffer + block_buffer_size,
+ new_size - block_buffer_size);
+
+ /* Save the new size for later calls. */
+ block_buffer_size = new_size;
+ }
+
+ /* If we don't need to produce any output, skip the rest of this. */
+ if (opt->quiet)
+ return;
+
+ /*
+ * Sort the returned block numbers. If the block reference table was using
+ * the bitmap representation for a given chunk, the block numbers in that
+ * chunk will already be sorted, but when the array-of-offsets
+ * representation is used, we can receive block numbers here out of order.
+ */
+ qsort(block_buffer, nblocks, sizeof(BlockNumber), compare_block_numbers);
+
+ /* Dump block references. */
+ while (i < nblocks)
+ {
+ /*
+ * Find the next range of blocks to print, but if --individual was
+ * specified, then consider each block a separate range.
+ */
+ startblock = endblock = block_buffer[i++];
+ if (!opt->individual)
+ {
+ while (i < nblocks && block_buffer[i] == endblock + 1)
+ {
+ endblock++;
+ i++;
+ }
+ }
+
+ /* Format this range of block numbers as a string. */
+ if (startblock == endblock)
+ printf("TS %u, DB %u, REL %u, FORK %s: block %u\n",
+ rlocator->spcOid, rlocator->dbOid, rlocator->relNumber,
+ forkNames[forknum], startblock);
+ else
+ printf("TS %u, DB %u, REL %u, FORK %s: blocks %u..%u\n",
+ rlocator->spcOid, rlocator->dbOid, rlocator->relNumber,
+ forkNames[forknum], startblock, endblock);
+ }
+}
+
+/*
+ * Quicksort comparator for block numbers.
+ */
+static int
+compare_block_numbers(const void *a, const void *b)
+{
+ BlockNumber aa = *(BlockNumber *) a;
+ BlockNumber bb = *(BlockNumber *) b;
+
+ if (aa > bb)
+ return 1;
+ else if (aa == bb)
+ return 0;
+ else
+ return -1;
+}
+
+/*
+ * Error callback.
+ */
+void
+walsummary_error_callback(void *callback_arg, char *fmt,...)
+{
+ va_list ap;
+
+ va_start(ap, fmt);
+ pg_log_generic_v(PG_LOG_ERROR, PG_LOG_PRIMARY, fmt, ap);
+ va_end(ap);
+
+ exit(1);
+}
+
+/*
+ * Read callback.
+ */
+int
+walsummary_read_callback(void *callback_arg, void *data, int length)
+{
+ ws_file_info *ws = callback_arg;
+ int rc;
+
+ if ((rc = read(ws->fd, data, length)) < 0)
+ pg_fatal("could not read file \"%s\": %m", ws->filename);
+
+ return rc;
+}
+
+/*
+ * help
+ *
+ * Prints help page for the program
+ *
+ * progname: the name of the executed program, such as "pg_walsummary"
+ */
+static void
+help(const char *progname)
+{
+ printf(_("%s prints the contents of a WAL summary file.\n\n"), progname);
+ printf(_("Usage:\n"));
+ printf(_(" %s [OPTION]... FILE...\n"), progname);
+ printf(_("\nOptions:\n"));
+ printf(_(" -i, --individual list block numbers individually, not as ranges\n"));
+ printf(_(" -q, --quiet don't print anything, just parse the files\n"));
+ printf(_(" -?, --help show this help, then exit\n"));
+
+ printf(_("\nReport bugs to <%s>.\n"), PACKAGE_BUGREPORT);
+ printf(_("%s home page: <%s>\n"), PACKAGE_NAME, PACKAGE_URL);
+}
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index ea71b215ee..42728036e2 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -4025,3 +4025,5 @@ cb_tablespace_mapping
manifest_data
manifest_writer
rfile
+ws_options
+ws_file_info
--
2.37.1 (Apple Git-137.1)
v9-0001-Change-how-a-base-backup-decides-which-files-have.patchapplication/octet-stream; name=v9-0001-Change-how-a-base-backup-decides-which-files-have.patchDownload
From 70f747876d92ffc5c930f2212b7abaad44cde3ca Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Tue, 3 Oct 2023 13:32:28 -0400
Subject: [PATCH v9 1/6] Change how a base backup decides which files have
checksums.
Previously, it thought that any plain file located under global, base,
or a tablespace directory had checksums unless it was in a short list
of excluded files. Now, it thinks that files in those directories have
checksums if parse_filename_for_nontemp_relation says that they are
relation files. (Temporary relation files don't matter because they're
excluded from the backup anyway.)
This changes the behavior if you have stray files not managed by
PostgreSQL in the relevant directories. Previously, you'd get some
kind of checksum-related complaint if such files existed, assuming
that the cluster had checksums enabled and that the base backup
wasn't run with NOVERIFY_CHECKSUMS. Now, you won't get those
complaints any more. That seems like an improvement to me, because
those files were presumably not created by PostgreSQL and so there
is no reason to think that they would be checksummed like a
PostgreSQL relation file. (If we want to complain about such files,
we should complain about them existing at all, not just about their
checksums.)
The point of this change is to make the code more consistent.
sendDir() was already calling parse_filename_for_nontemp_relation()
as part of an effort to determine which files to include in the
backup. So, it already had the information about whether a certain
file was a relation file. sendFile() then used a separate method,
embodied in is_checksummed_file(), to make what is essentially
the same determination. It's better not to make the same decision
using two different methods, especially in closely-related code.
---
src/backend/backup/basebackup.c | 172 ++++++++++----------------------
1 file changed, 55 insertions(+), 117 deletions(-)
diff --git a/src/backend/backup/basebackup.c b/src/backend/backup/basebackup.c
index 480d67e02c..35dd79babc 100644
--- a/src/backend/backup/basebackup.c
+++ b/src/backend/backup/basebackup.c
@@ -82,7 +82,8 @@ static int64 sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeo
backup_manifest_info *manifest, Oid spcoid);
static bool sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
struct stat *statbuf, bool missing_ok,
- Oid dboid, Oid spcoid,
+ Oid dboid, Oid spcoid, RelFileNumber relfilenumber,
+ unsigned segno,
backup_manifest_info *manifest);
static off_t read_file_data_into_buffer(bbsink *sink,
const char *readfilename, int fd,
@@ -104,7 +105,6 @@ static void convert_link_to_directory(const char *pathbuf, struct stat *statbuf)
static void perform_base_backup(basebackup_options *opt, bbsink *sink);
static void parse_basebackup_options(List *options, basebackup_options *opt);
static int compareWalFileNames(const ListCell *a, const ListCell *b);
-static bool is_checksummed_file(const char *fullpath, const char *filename);
static int basebackup_read_file(int fd, char *buf, size_t nbytes, off_t offset,
const char *filename, bool partial_read_ok);
@@ -213,23 +213,6 @@ static const struct exclude_list_item excludeFiles[] =
{NULL, false}
};
-/*
- * List of files excluded from checksum validation.
- *
- * Note: this list should be kept in sync with what pg_checksums.c
- * includes.
- */
-static const struct exclude_list_item noChecksumFiles[] = {
- {"pg_control", false},
- {"pg_filenode.map", false},
- {"pg_internal.init", true},
- {"PG_VERSION", false},
-#ifdef EXEC_BACKEND
- {"config_exec_params", true},
-#endif
- {NULL, false}
-};
-
/*
* Actually do a base backup for the specified tablespaces.
*
@@ -356,7 +339,8 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
errmsg("could not stat file \"%s\": %m",
XLOG_CONTROL_FILE)));
sendFile(sink, XLOG_CONTROL_FILE, XLOG_CONTROL_FILE, &statbuf,
- false, InvalidOid, InvalidOid, &manifest);
+ false, InvalidOid, InvalidOid,
+ InvalidRelFileNumber, 0, &manifest);
}
else
{
@@ -625,7 +609,8 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
errmsg("could not stat file \"%s\": %m", pathbuf)));
sendFile(sink, pathbuf, pathbuf, &statbuf, false,
- InvalidOid, InvalidOid, &manifest);
+ InvalidOid, InvalidOid, InvalidRelFileNumber, 0,
+ &manifest);
/* unconditionally mark file as archived */
StatusFilePath(pathbuf, fname, ".done");
@@ -1166,7 +1151,8 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
struct stat statbuf;
int64 size = 0;
const char *lastDir; /* Split last dir from parent path. */
- bool isDbDir = false; /* Does this directory contain relations? */
+ bool isRelationDir = false; /* Does directory contain relations? */
+ Oid dboid = InvalidOid;
/*
* Determine if the current path is a database directory that can contain
@@ -1193,17 +1179,23 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
strncmp(lastDir - (sizeof(TABLESPACE_VERSION_DIRECTORY) - 1),
TABLESPACE_VERSION_DIRECTORY,
sizeof(TABLESPACE_VERSION_DIRECTORY) - 1) == 0))
- isDbDir = true;
+ {
+ isRelationDir = true;
+ dboid = atooid(lastDir + 1);
+ }
}
+ else if (strcmp(path, "./global") == 0)
+ isRelationDir = true;
dir = AllocateDir(path);
while ((de = ReadDir(dir, path)) != NULL)
{
int excludeIdx;
bool excludeFound;
- RelFileNumber relNumber;
- ForkNumber relForkNum;
- unsigned segno;
+ RelFileNumber relfilenumber = InvalidRelFileNumber;
+ ForkNumber relForkNum = InvalidForkNumber;
+ unsigned segno = 0;
+ bool isRelationFile = false;
/* Skip special stuff */
if (strcmp(de->d_name, ".") == 0 || strcmp(de->d_name, "..") == 0)
@@ -1251,37 +1243,40 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
if (excludeFound)
continue;
+ /*
+ * If there could be non-temporary relation files in this directory,
+ * try to parse the filename.
+ */
+ if (isRelationDir)
+ isRelationFile =
+ parse_filename_for_nontemp_relation(de->d_name,
+ &relfilenumber,
+ &relForkNum, &segno);
+
/* Exclude all forks for unlogged tables except the init fork */
- if (isDbDir &&
- parse_filename_for_nontemp_relation(de->d_name, &relNumber,
- &relForkNum, &segno))
+ if (isRelationFile && relForkNum != INIT_FORKNUM)
{
- /* Never exclude init forks */
- if (relForkNum != INIT_FORKNUM)
- {
- char initForkFile[MAXPGPATH];
+ char initForkFile[MAXPGPATH];
- /*
- * If any other type of fork, check if there is an init fork
- * with the same RelFileNumber. If so, the file can be
- * excluded.
- */
- snprintf(initForkFile, sizeof(initForkFile), "%s/%u_init",
- path, relNumber);
+ /*
+ * If any other type of fork, check if there is an init fork with
+ * the same RelFileNumber. If so, the file can be excluded.
+ */
+ snprintf(initForkFile, sizeof(initForkFile), "%s/%u_init",
+ path, relfilenumber);
- if (lstat(initForkFile, &statbuf) == 0)
- {
- elog(DEBUG2,
- "unlogged relation file \"%s\" excluded from backup",
- de->d_name);
+ if (lstat(initForkFile, &statbuf) == 0)
+ {
+ elog(DEBUG2,
+ "unlogged relation file \"%s\" excluded from backup",
+ de->d_name);
- continue;
- }
+ continue;
}
}
/* Exclude temporary relations */
- if (isDbDir && looks_like_temp_rel_name(de->d_name))
+ if (OidIsValid(dboid) && looks_like_temp_rel_name(de->d_name))
{
elog(DEBUG2,
"temporary relation file \"%s\" excluded from backup",
@@ -1420,8 +1415,8 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
if (!sizeonly)
sent = sendFile(sink, pathbuf, pathbuf + basepathlen + 1, &statbuf,
- true, isDbDir ? atooid(lastDir + 1) : InvalidOid, spcoid,
- manifest);
+ true, dboid, spcoid,
+ relfilenumber, segno, manifest);
if (sent || sizeonly)
{
@@ -1443,40 +1438,6 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
return size;
}
-/*
- * Check if a file should have its checksum validated.
- * We validate checksums on files in regular tablespaces
- * (including global and default) only, and in those there
- * are some files that are explicitly excluded.
- */
-static bool
-is_checksummed_file(const char *fullpath, const char *filename)
-{
- /* Check that the file is in a tablespace */
- if (strncmp(fullpath, "./global/", 9) == 0 ||
- strncmp(fullpath, "./base/", 7) == 0 ||
- strncmp(fullpath, "/", 1) == 0)
- {
- int excludeIdx;
-
- /* Compare file against noChecksumFiles skip list */
- for (excludeIdx = 0; noChecksumFiles[excludeIdx].name != NULL; excludeIdx++)
- {
- int cmplen = strlen(noChecksumFiles[excludeIdx].name);
-
- if (!noChecksumFiles[excludeIdx].match_prefix)
- cmplen++;
- if (strncmp(filename, noChecksumFiles[excludeIdx].name,
- cmplen) == 0)
- return false;
- }
-
- return true;
- }
- else
- return false;
-}
-
/*
* Given the member, write the TAR header & send the file.
*
@@ -1491,6 +1452,7 @@ is_checksummed_file(const char *fullpath, const char *filename)
static bool
sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
struct stat *statbuf, bool missing_ok, Oid dboid, Oid spcoid,
+ RelFileNumber relfilenumber, unsigned segno,
backup_manifest_info *manifest)
{
int fd;
@@ -1498,8 +1460,6 @@ sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
int checksum_failures = 0;
off_t cnt;
pgoff_t bytes_done = 0;
- int segmentno = 0;
- char *segmentpath;
bool verify_checksum = false;
pg_checksum_context checksum_ctx;
@@ -1525,36 +1485,14 @@ sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
*/
Assert((sink->bbs_buffer_length % BLCKSZ) == 0);
- if (!noverify_checksums && DataChecksumsEnabled())
- {
- char *filename;
-
- /*
- * Get the filename (excluding path). As last_dir_separator()
- * includes the last directory separator, we chop that off by
- * incrementing the pointer.
- */
- filename = last_dir_separator(readfilename) + 1;
-
- if (is_checksummed_file(readfilename, filename))
- {
- verify_checksum = true;
-
- /*
- * Cut off at the segment boundary (".") to get the segment number
- * in order to mix it into the checksum.
- */
- segmentpath = strstr(filename, ".");
- if (segmentpath != NULL)
- {
- segmentno = atoi(segmentpath + 1);
- if (segmentno == 0)
- ereport(ERROR,
- (errmsg("invalid segment number %d in file \"%s\"",
- segmentno, filename)));
- }
- }
- }
+ /*
+ * If we weren't told not to verify checksums, and if checksums are
+ * enabled for this cluster, and if this is a relation file, then verify
+ * the checksum.
+ */
+ if (!noverify_checksums && DataChecksumsEnabled() &&
+ RelFileNumberIsValid(relfilenumber))
+ verify_checksum = true;
/*
* Loop until we read the amount of data the caller told us to expect. The
@@ -1569,7 +1507,7 @@ sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
/* Try to read some more data. */
cnt = read_file_data_into_buffer(sink, readfilename, fd, bytes_done,
remaining,
- blkno + segmentno * RELSEG_SIZE,
+ blkno + segno * RELSEG_SIZE,
verify_checksum,
&checksum_failures);
--
2.37.1 (Apple Git-137.1)
v9-0002-Move-src-bin-pg_verifybackup-parse_manifest.c-int.patchapplication/octet-stream; name=v9-0002-Move-src-bin-pg_verifybackup-parse_manifest.c-int.patchDownload
From 9c839024a4b4e95c583874151a42ddf7cb50986e Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Tue, 3 Oct 2023 13:32:45 -0400
Subject: [PATCH v9 2/6] Move src/bin/pg_verifybackup/parse_manifest.c into
src/common.
This makes it possible for the code to be easily reused by other
client-side tools, and/or by the server.
---
src/bin/pg_verifybackup/Makefile | 1 -
src/bin/pg_verifybackup/meson.build | 1 -
src/bin/pg_verifybackup/pg_verifybackup.c | 2 +-
src/common/Makefile | 1 +
src/common/meson.build | 1 +
src/{bin/pg_verifybackup => common}/parse_manifest.c | 4 ++--
src/{bin/pg_verifybackup => include/common}/parse_manifest.h | 2 +-
7 files changed, 6 insertions(+), 6 deletions(-)
rename src/{bin/pg_verifybackup => common}/parse_manifest.c (99%)
rename src/{bin/pg_verifybackup => include/common}/parse_manifest.h (97%)
diff --git a/src/bin/pg_verifybackup/Makefile b/src/bin/pg_verifybackup/Makefile
index c96323faa9..7c045f142e 100644
--- a/src/bin/pg_verifybackup/Makefile
+++ b/src/bin/pg_verifybackup/Makefile
@@ -21,7 +21,6 @@ LDFLAGS_INTERNAL += -L$(top_builddir)/src/fe_utils -lpgfeutils $(libpq_pgport)
OBJS = \
$(WIN32RES) \
- parse_manifest.o \
pg_verifybackup.o
all: pg_verifybackup
diff --git a/src/bin/pg_verifybackup/meson.build b/src/bin/pg_verifybackup/meson.build
index 9369da1bc6..58f780d1a6 100644
--- a/src/bin/pg_verifybackup/meson.build
+++ b/src/bin/pg_verifybackup/meson.build
@@ -1,7 +1,6 @@
# Copyright (c) 2022-2023, PostgreSQL Global Development Group
pg_verifybackup_sources = files(
- 'parse_manifest.c',
'pg_verifybackup.c'
)
diff --git a/src/bin/pg_verifybackup/pg_verifybackup.c b/src/bin/pg_verifybackup/pg_verifybackup.c
index 059836f0e6..ce423a03d4 100644
--- a/src/bin/pg_verifybackup/pg_verifybackup.c
+++ b/src/bin/pg_verifybackup/pg_verifybackup.c
@@ -20,9 +20,9 @@
#include "common/hashfn.h"
#include "common/logging.h"
+#include "common/parse_manifest.h"
#include "fe_utils/simple_list.h"
#include "getopt_long.h"
-#include "parse_manifest.h"
#include "pgtime.h"
/*
diff --git a/src/common/Makefile b/src/common/Makefile
index ce4535d7fe..1092dc63df 100644
--- a/src/common/Makefile
+++ b/src/common/Makefile
@@ -66,6 +66,7 @@ OBJS_COMMON = \
kwlookup.o \
link-canary.o \
md5_common.o \
+ parse_manifest.o \
percentrepl.o \
pg_get_line.o \
pg_lzcompress.o \
diff --git a/src/common/meson.build b/src/common/meson.build
index 8be145c0fb..d52dd12bc9 100644
--- a/src/common/meson.build
+++ b/src/common/meson.build
@@ -18,6 +18,7 @@ common_sources = files(
'kwlookup.c',
'link-canary.c',
'md5_common.c',
+ 'parse_manifest.c',
'percentrepl.c',
'pg_get_line.c',
'pg_lzcompress.c',
diff --git a/src/bin/pg_verifybackup/parse_manifest.c b/src/common/parse_manifest.c
similarity index 99%
rename from src/bin/pg_verifybackup/parse_manifest.c
rename to src/common/parse_manifest.c
index bf0227c668..ee6f74e5d5 100644
--- a/src/bin/pg_verifybackup/parse_manifest.c
+++ b/src/common/parse_manifest.c
@@ -6,15 +6,15 @@
* Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
* Portions Copyright (c) 1994, Regents of the University of California
*
- * src/bin/pg_verifybackup/parse_manifest.c
+ * src/common/parse_manifest.c
*
*-------------------------------------------------------------------------
*/
#include "postgres_fe.h"
-#include "parse_manifest.h"
#include "common/jsonapi.h"
+#include "common/parse_manifest.h"
/*
* Semantic states for JSON manifest parsing.
diff --git a/src/bin/pg_verifybackup/parse_manifest.h b/src/include/common/parse_manifest.h
similarity index 97%
rename from src/bin/pg_verifybackup/parse_manifest.h
rename to src/include/common/parse_manifest.h
index 7387a917a2..7b24c5d785 100644
--- a/src/bin/pg_verifybackup/parse_manifest.h
+++ b/src/include/common/parse_manifest.h
@@ -6,7 +6,7 @@
* Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
* Portions Copyright (c) 1994, Regents of the University of California
*
- * src/bin/pg_verifybackup/parse_manifest.h
+ * src/include/common/parse_manifest.h
*
*-------------------------------------------------------------------------
*/
--
2.37.1 (Apple Git-137.1)
v9-0004-Add-support-for-incremental-backup.patchapplication/octet-stream; name=v9-0004-Add-support-for-incremental-backup.patchDownload
From 32077e8302fb706369ae51f9a83f198278558b59 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Wed, 14 Jun 2023 11:31:29 -0400
Subject: [PATCH v9 4/6] Add support for incremental backup.
To take an incremental backup, you use the new replication command
UPLOAD_MANIFEST to upload the manifest for the prior backup. This
prior backup could either be a full backup or another incremental
backup. You then use BASE_BACKUP with the INCREMENTAL option to take
the backup. pg_basebackup now has an --incremental=PATH_TO_MANIFEST
option to trigger this behavior.
An incremental backup is like a regular full backup except that
some relation files are replaced with files with names like
INCREMENTAL.${ORIGINAL_NAME}, and the backup_label file contains
additional lines identifying it as an incremental backup. The new
pg_combinebackup tool can be used to reconstruct a data directory
from a full backup and a series of incremental backups.
XXX. Should we send the whole backup manifest to the server or, say,
just an LSN?
XXX. Should the timeout when waiting for WAL summaries be configurable?
If it is, then the maximum sleep time for the WAL summarizer needs
to vary accordingly.
XXX. It would be nice (but not essential) to do something about
incremental JSON parsing.
Patch by me. Thanks to Dilip Kumar and Andres Freund for some helpful
design discussions. Reviewed by Dilip Kumar and Jakub Wartak.
---
doc/src/sgml/backup.sgml | 89 +-
doc/src/sgml/config.sgml | 2 -
doc/src/sgml/ref/allfiles.sgml | 1 +
doc/src/sgml/ref/pg_basebackup.sgml | 37 +-
doc/src/sgml/ref/pg_combinebackup.sgml | 228 +++
doc/src/sgml/reference.sgml | 1 +
src/backend/access/transam/xlogbackup.c | 10 +
src/backend/access/transam/xlogrecovery.c | 6 +
src/backend/backup/Makefile | 1 +
src/backend/backup/basebackup.c | 313 +++-
src/backend/backup/basebackup_incremental.c | 914 ++++++++++++
src/backend/backup/meson.build | 1 +
src/backend/replication/repl_gram.y | 14 +-
src/backend/replication/repl_scanner.l | 2 +
src/backend/replication/walsender.c | 162 ++-
src/backend/storage/ipc/ipci.c | 3 +
src/bin/Makefile | 1 +
src/bin/meson.build | 1 +
src/bin/pg_basebackup/bbstreamer_file.c | 1 +
src/bin/pg_basebackup/pg_basebackup.c | 110 +-
src/bin/pg_basebackup/t/010_pg_basebackup.pl | 4 +-
src/bin/pg_combinebackup/.gitignore | 1 +
src/bin/pg_combinebackup/Makefile | 52 +
src/bin/pg_combinebackup/backup_label.c | 281 ++++
src/bin/pg_combinebackup/backup_label.h | 30 +
src/bin/pg_combinebackup/copy_file.c | 169 +++
src/bin/pg_combinebackup/copy_file.h | 19 +
src/bin/pg_combinebackup/load_manifest.c | 245 ++++
src/bin/pg_combinebackup/load_manifest.h | 67 +
src/bin/pg_combinebackup/meson.build | 38 +
src/bin/pg_combinebackup/pg_combinebackup.c | 1270 +++++++++++++++++
src/bin/pg_combinebackup/reconstruct.c | 682 +++++++++
src/bin/pg_combinebackup/reconstruct.h | 33 +
src/bin/pg_combinebackup/t/001_basic.pl | 23 +
.../pg_combinebackup/t/002_compare_backups.pl | 154 ++
src/bin/pg_combinebackup/t/003_timeline.pl | 90 ++
src/bin/pg_combinebackup/t/004_manifest.pl | 75 +
src/bin/pg_combinebackup/t/005_integrity.pl | 125 ++
src/bin/pg_combinebackup/write_manifest.c | 293 ++++
src/bin/pg_combinebackup/write_manifest.h | 33 +
src/bin/pg_resetwal/pg_resetwal.c | 36 +
src/include/access/xlogbackup.h | 2 +
src/include/backup/basebackup.h | 5 +-
src/include/backup/basebackup_incremental.h | 55 +
src/include/nodes/replnodes.h | 9 +
src/test/perl/PostgreSQL/Test/Cluster.pm | 21 +-
src/tools/pgindent/typedefs.list | 12 +
47 files changed, 5669 insertions(+), 52 deletions(-)
create mode 100644 doc/src/sgml/ref/pg_combinebackup.sgml
create mode 100644 src/backend/backup/basebackup_incremental.c
create mode 100644 src/bin/pg_combinebackup/.gitignore
create mode 100644 src/bin/pg_combinebackup/Makefile
create mode 100644 src/bin/pg_combinebackup/backup_label.c
create mode 100644 src/bin/pg_combinebackup/backup_label.h
create mode 100644 src/bin/pg_combinebackup/copy_file.c
create mode 100644 src/bin/pg_combinebackup/copy_file.h
create mode 100644 src/bin/pg_combinebackup/load_manifest.c
create mode 100644 src/bin/pg_combinebackup/load_manifest.h
create mode 100644 src/bin/pg_combinebackup/meson.build
create mode 100644 src/bin/pg_combinebackup/pg_combinebackup.c
create mode 100644 src/bin/pg_combinebackup/reconstruct.c
create mode 100644 src/bin/pg_combinebackup/reconstruct.h
create mode 100644 src/bin/pg_combinebackup/t/001_basic.pl
create mode 100644 src/bin/pg_combinebackup/t/002_compare_backups.pl
create mode 100644 src/bin/pg_combinebackup/t/003_timeline.pl
create mode 100644 src/bin/pg_combinebackup/t/004_manifest.pl
create mode 100644 src/bin/pg_combinebackup/t/005_integrity.pl
create mode 100644 src/bin/pg_combinebackup/write_manifest.c
create mode 100644 src/bin/pg_combinebackup/write_manifest.h
create mode 100644 src/include/backup/basebackup_incremental.h
diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml
index 8cb24d6ae5..b3468eea3c 100644
--- a/doc/src/sgml/backup.sgml
+++ b/doc/src/sgml/backup.sgml
@@ -857,12 +857,79 @@ test ! -f /mnt/server/archivedir/00000001000000A900000065 && cp pg_wal/0
</para>
</sect2>
+ <sect2 id="backup-incremental-backup">
+ <title>Making an Incremental Backup</title>
+
+ <para>
+ You can use <xref linkend="app-pgbasebackup"/> to take an incremental
+ backup by specifying the <literal>--incremental</literal> option. You must
+ supply, as an argument to <literal>--incremental</literal>, the backup
+ manifest to an earlier backup from the same server. In the resulting
+ backup, non-relation files will be included in their entirety, but some
+ relation files may be replaced by smaller incremental files which contain
+ only the blocks which have been changed since the earlier backup and enough
+ metadata to reconstruct the current version of the file.
+ </para>
+
+ <para>
+ To figure out which blocks need to be backed up, the server uses WAL
+ summaries, which are stored in the data directory, inside the directory
+ <literal>pg_wal/summaries</literal>. If the required summary files are not
+ present, an attempt to take an incremental backup will fail. The summaries
+ present in this directory must cover all LSNs from the start LSN of the
+ prior backup to the start LSN of the current backup. Since the server looks
+ for WAL summaries just after establishing the start LSN of the current
+ backup, the necessary summary files probably won't be instantly present
+ on disk, but the server will wait for any missing files to show up.
+ This also helps if the WAL summarization process has fallen behind.
+ However, if the necessary files have already been removed, or if the WAL
+ summarizer doesn't catch up quickly enough, the incremental backup will
+ fail.
+ </para>
+
+ <para>
+ When restoring an incremental backup, it will be necessary to have not
+ only the incremental backup itself but also all earlier backups that
+ are required to supply the blocks omitted from the incremental backup.
+ See <xref linkend="app-pgcombinebackup"/> for further information about
+ this requirement.
+ </para>
+
+ <para>
+ Note that all of the requirements for making use of a full backup also
+ apply to an incremental backup. For instance, you still need all of the
+ WAL segment files generated during and after the file system backup, and
+ any relevant WAL history files. And you still need to create a
+ <literal>recovery.signal</literal> (or <literal>standby.signal</literal>)
+ and perform recovery, as described in
+ <xref linkend="backup-pitr-recovery" />. The requirement to have earlier
+ backups available at restore time and to use
+ <literal>pg_combinebackup</literal> is an additional requirement on top of
+ everything else. Keep in mind that <application>PostgreSQL</application>
+ has no built-in mechanism to figure out which backups are still needed as
+ a basis for restoring later incremental backups. You must keep track of
+ the relationships between your full and incremental backups on your own,
+ and be certain not to remove earlier backups if they might be needed when
+ restoring later incremental backups.
+ </para>
+
+ <para>
+ Incremental backups typically only make sense for relatively large
+ databases where a significant portion of the data does not change, or only
+ changes slowly. For a small database, it's simpler to ignore the existence
+ of incremental backups and simply take full backups, which are simpler
+ to manage. For a large database all of which is heavily modified,
+ incremental backups won't be much smaller than full backups.
+ </para>
+ </sect2>
+
<sect2 id="backup-lowlevel-base-backup">
<title>Making a Base Backup Using the Low Level API</title>
<para>
- The procedure for making a base backup using the low level
- APIs contains a few more steps than
- the <xref linkend="app-pgbasebackup"/> method, but is relatively
+ Instead of taking a full or incremental base backup using
+ <xref linkend="app-pgbasebackup"/>, you can take a base backup using the
+ low-level API. This procedure contains a few more steps than
+ the <application>pg_basebackup</application> method, but is relatively
simple. It is very important that these steps are executed in
sequence, and that the success of a step is verified before
proceeding to the next step.
@@ -1118,7 +1185,8 @@ SELECT * FROM pg_backup_stop(wait_for_archive => true);
</listitem>
<listitem>
<para>
- Restore the database files from your file system backup. Be sure that they
+ If you're restoring a full backup, you can restore the database files
+ directly into the target directories. Be sure that they
are restored with the right ownership (the database system user, not
<literal>root</literal>!) and with the right permissions. If you are using
tablespaces,
@@ -1126,6 +1194,19 @@ SELECT * FROM pg_backup_stop(wait_for_archive => true);
were correctly restored.
</para>
</listitem>
+ <listitem>
+ <para>
+ If you're restoring an incremental backup, you'll need to restore the
+ incremental backup and all earlier backups upon which it directly or
+ indirectly depends to the machine where you are performing the restore.
+ These backups will need to be placed in separate directories, not the
+ target directories where you want the running server to end up.
+ Once this is done, use <xref linkend="app-pgcombinebackup"/> to pull
+ data from the full backup and all of the subsequent incremental backups
+ and write out a synthetic full backup to the target directories. As above,
+ verify that permissions and tablespace links are correct.
+ </para>
+ </listitem>
<listitem>
<para>
Remove any files present in <filename>pg_wal/</filename>; these came from the
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 15471a6b38..cd46d8ff4e 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4137,13 +4137,11 @@ restore_command = 'copy "C:\\server\\archivedir\\%f" "%p"' # Windows
<sect2 id="runtime-config-wal-summarization">
<title>WAL Summarization</title>
- <!--
<para>
These settings control WAL summarization, a feature which must be
enabled in order to perform an
<link linkend="backup-incremental-backup">incremental backup</link>.
</para>
- -->
<variablelist>
<varlistentry id="guc-summarize-wal" xreflabel="summarize_wal">
diff --git a/doc/src/sgml/ref/allfiles.sgml b/doc/src/sgml/ref/allfiles.sgml
index 54b5f22d6e..fda4690eab 100644
--- a/doc/src/sgml/ref/allfiles.sgml
+++ b/doc/src/sgml/ref/allfiles.sgml
@@ -202,6 +202,7 @@ Complete list of usable sgml source files in this directory.
<!ENTITY pgBasebackup SYSTEM "pg_basebackup.sgml">
<!ENTITY pgbench SYSTEM "pgbench.sgml">
<!ENTITY pgChecksums SYSTEM "pg_checksums.sgml">
+<!ENTITY pgCombinebackup SYSTEM "pg_combinebackup.sgml">
<!ENTITY pgConfig SYSTEM "pg_config-ref.sgml">
<!ENTITY pgControldata SYSTEM "pg_controldata.sgml">
<!ENTITY pgCtl SYSTEM "pg_ctl-ref.sgml">
diff --git a/doc/src/sgml/ref/pg_basebackup.sgml b/doc/src/sgml/ref/pg_basebackup.sgml
index 0b87fd2d4d..7c183a5cfd 100644
--- a/doc/src/sgml/ref/pg_basebackup.sgml
+++ b/doc/src/sgml/ref/pg_basebackup.sgml
@@ -38,11 +38,25 @@ PostgreSQL documentation
</para>
<para>
- <application>pg_basebackup</application> makes an exact copy of the database
- cluster's files, while making sure the server is put into and
- out of backup mode automatically. Backups are always taken of the entire
- database cluster; it is not possible to back up individual databases or
- database objects. For selective backups, another tool such as
+ <application>pg_basebackup</application> can take a full or incremental
+ base backup of the database. When used to take a full backup, it makes an
+ exact copy of the database cluster's files. When used to take an incremental
+ backup, some files that would have been part of a full backup may be
+ replaced with incremental versions of the same files, containing only those
+ blocks that have been modified since the reference backup. An incremental
+ backup cannot be used directly; instead,
+ <xref linkend="app-pgcombinebackup"/> must first
+ be used to combine it with the previous backups upon which it depends.
+ See <xref linkend="backup-incremental-backup" /> for more information
+ about incremental backups, and <xref linkend="backup-pitr-recovery" />
+ for steps to recover from a backup.
+ </para>
+
+ <para>
+ In any mode, <application>pg_basebackup</application> makes sure the server
+ is put into and out of backup mode automatically. Backups are always taken of
+ the entire database cluster; it is not possible to back up individual
+ databases or database objects. For selective backups, another tool such as
<xref linkend="app-pgdump"/> must be used.
</para>
@@ -197,6 +211,19 @@ PostgreSQL documentation
</listitem>
</varlistentry>
+ <varlistentry>
+ <term><option>-i <replaceable class="parameter">old_manifest_file</replaceable></option></term>
+ <term><option>--incremental=<replaceable class="parameter">old_meanifest_file</replaceable></option></term>
+ <listitem>
+ <para>
+ Performs an <link linkend="backup-incremental-backup">incremental
+ backup</link>. The backup manifest for the reference
+ backup must be provided, and will be uploaded to the server, which will
+ respond by sending the requested incremental backup.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry>
<term><option>-R</option></term>
<term><option>--write-recovery-conf</option></term>
diff --git a/doc/src/sgml/ref/pg_combinebackup.sgml b/doc/src/sgml/ref/pg_combinebackup.sgml
new file mode 100644
index 0000000000..6cac73573f
--- /dev/null
+++ b/doc/src/sgml/ref/pg_combinebackup.sgml
@@ -0,0 +1,228 @@
+<!--
+doc/src/sgml/ref/pg_combinebackup.sgml
+PostgreSQL documentation
+-->
+
+<refentry id="app-pgcombinebackup">
+ <indexterm zone="app-pgcombinebackup">
+ <primary>pg_combinebackup</primary>
+ </indexterm>
+
+ <refmeta>
+ <refentrytitle><application>pg_combinebackup</application></refentrytitle>
+ <manvolnum>1</manvolnum>
+ <refmiscinfo>Application</refmiscinfo>
+ </refmeta>
+
+ <refnamediv>
+ <refname>pg_combinebackup</refname>
+ <refpurpose>reconstruct a full backup from an incremental backup and dependent backups</refpurpose>
+ </refnamediv>
+
+ <refsynopsisdiv>
+ <cmdsynopsis>
+ <command>pg_combinebackup</command>
+ <arg rep="repeat"><replaceable>option</replaceable></arg>
+ <arg rep="repeat"><replaceable>backup_directory</replaceable></arg>
+ </cmdsynopsis>
+ </refsynopsisdiv>
+
+ <refsect1>
+ <title>Description</title>
+ <para>
+ <application>pg_combinebackup</application> is used to reconstruct a
+ synthetic full backup from an
+ <link linkend="backup-incremental-backup">incremental backup</link> and the
+ earlier backups upon which it depends.
+ </para>
+
+ <para>
+ Specify all of the required backups on the command line from oldest to newest.
+ That is, the first backup directory should be the path to the full backup, and
+ the last should be the path to the final incremental backup
+ that you wish to restore. The reconstructed backup will be written to the
+ output directory specified by the <option>-o</option> option.
+ </para>
+
+ <para>
+ Although <application>pg_combinebackup</application> will attempt to verify
+ that the backups you specify form a legal backup chain from which a correct
+ full backup can be reconstructed, it is not designed to help you keep track
+ of which backups depend on which other backups. If you remove the one or
+ more of the previous backups upon which your incremental
+ backup relies, you will not be able to restore it.
+ </para>
+
+ <para>
+ Since the output of <application>pg_combinebackup</application> is a
+ synthetic full backup, it can be used as an input to a future invocation of
+ <application>pg_combinebackup</application>. The synthetic full backup would
+ be specified on the command line in lieu of the chain of backups from which
+ it was reconstructed.
+ </para>
+ </refsect1>
+
+ <refsect1>
+ <title>Options</title>
+
+ <para>
+ <variablelist>
+ <varlistentry>
+ <term><option>-d</option></term>
+ <term><option>--debug</option></term>
+ <listitem>
+ <para>
+ Print lots of debug logging output on <filename>stderr</filename>.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><option>-T <replaceable class="parameter">olddir</replaceable>=<replaceable class="parameter">newdir</replaceable></option></term>
+ <term><option>--tablespace-mapping=<replaceable class="parameter">olddir</replaceable>=<replaceable class="parameter">newdir</replaceable></option></term>
+ <listitem>
+ <para>
+ Relocates the tablespace in directory <replaceable>olddir</replaceable>
+ to <replaceable>newdir</replaceable> during the backup.
+ <replaceable>olddir</replaceable> is the absolute path of the tablespace
+ as it exists in the first backup specified on the command line,
+ and <replaceable>newdir</replaceable> is the absolute path to use for the
+ tablespace in the reconstructed backup. If either path needs to contain
+ an equal sign (<literal>=</literal>), precede that with a backslash.
+ This option can be specified multiple times for multiple tablespaces.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><option>-N</option></term>
+ <term><option>--no-sync</option></term>
+ <listitem>
+ <para>
+ By default, <command>pg_combinebackup</command> will wait for all files
+ to be written safely to disk. This option causes
+ <command>pg_combinebackup</command> to return without waiting, which is
+ faster, but means that a subsequent operating system crash can leave
+ the output backup corrupt. Generally, this option is useful for testing
+ but should not be used when creating a production installation.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><option>-o <replaceable class="parameter">outputdir</replaceable></option></term>
+ <term><option>--output=<replaceable class="parameter">outputdir</replaceable></option></term>
+ <listitem>
+ <para>
+ Specifies the output directory to which the synthetic full backup
+ should be written. Currently, this argument is required.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><option>--sync-method</option></term>
+ <listitem>
+ <para>
+ When set to <literal>fsync</literal>, which is the default,
+ <command>pg_combinebackup</command> will recursively open and synchronize
+ all files in the backup directory. When the plain format is used, the
+ search for files will follow symbolic links for the WAL directory and
+ each configured tablespace.
+ </para>
+ <para>
+ On Linux, <literal>syncfs</literal> may be used instead to ask the
+ operating system to synchronize the whole file system that contains the
+ backup directory. When the plain format is used,
+ <command>pg_combinebackup</command> will also synchronize the file systems
+ that contain the WAL files and each tablespace. See
+ <xref linkend="syncfs"/> for more information about using
+ <function>syncfs()</function>.
+ </para>
+ <para>
+ This option has no effect when <option>--no-sync</option> is used.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><option>--manifest-checksums=<replaceable class="parameter">algorithm</replaceable></option></term>
+ <listitem>
+ <para>
+ Like <xref linkend="app-pgbasebackup"/>,
+ <application>pg_combinebackup</application> writes a backup manifest
+ in the output directory. This option specifies the checksum algorithm
+ that should be applied to each file included in the backup manifest.
+ Currently, the available algorithms are <literal>NONE</literal>,
+ <literal>CRC32C</literal>, <literal>SHA224</literal>,
+ <literal>SHA256</literal>, <literal>SHA384</literal>,
+ and <literal>SHA512</literal>. The default is <literal>CRC32C</literal>.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><option>--no-manifest</option></term>
+ <listitem>
+ <para>
+ Disables generation of a backup manifest. If this option is not
+ specified, a backup manifest for the reconstructed backup will be
+ written to the output directory.
+ </para>
+ </listitem>
+ </varlistentry>
+ </variablelist>
+
+ <variablelist>
+ <varlistentry>
+ <term><option>-V</option></term>
+ <term><option>--version</option></term>
+ <listitem>
+ <para>
+ Prints the <application>pg_combinebackup</application> version and exits.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><option>-?</option></term>
+ <term><option>--help</option></term>
+ <listitem>
+ <para>
+ Shows help about <application>pg_combinebackup</application> command line
+ arguments, and exits.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ </variablelist>
+ </para>
+
+ </refsect1>
+
+ <refsect1>
+ <title>Environment</title>
+
+ <para>
+ This utility, like most other <productname>PostgreSQL</productname> utilities,
+ uses the environment variables supported by <application>libpq</application>
+ (see <xref linkend="libpq-envars"/>).
+ </para>
+
+ <para>
+ The environment variable <envar>PG_COLOR</envar> specifies whether to use
+ color in diagnostic messages. Possible values are
+ <literal>always</literal>, <literal>auto</literal> and
+ <literal>never</literal>.
+ </para>
+ </refsect1>
+
+ <refsect1>
+ <title>See Also</title>
+
+ <simplelist type="inline">
+ <member><xref linkend="app-pgbasebackup"/></member>
+ </simplelist>
+ </refsect1>
+
+</refentry>
diff --git a/doc/src/sgml/reference.sgml b/doc/src/sgml/reference.sgml
index e11b4b6130..a07d2b5e01 100644
--- a/doc/src/sgml/reference.sgml
+++ b/doc/src/sgml/reference.sgml
@@ -250,6 +250,7 @@
&pgamcheck;
&pgBasebackup;
&pgbench;
+ &pgCombinebackup;
&pgConfig;
&pgDump;
&pgDumpall;
diff --git a/src/backend/access/transam/xlogbackup.c b/src/backend/access/transam/xlogbackup.c
index 21d68133ae..f51d4282bb 100644
--- a/src/backend/access/transam/xlogbackup.c
+++ b/src/backend/access/transam/xlogbackup.c
@@ -77,6 +77,16 @@ build_backup_content(BackupState *state, bool ishistoryfile)
appendStringInfo(result, "STOP TIMELINE: %u\n", state->stoptli);
}
+ /* either both istartpoint and istarttli should be set, or neither */
+ Assert(XLogRecPtrIsInvalid(state->istartpoint) == (state->istarttli == 0));
+ if (!XLogRecPtrIsInvalid(state->istartpoint))
+ {
+ appendStringInfo(result, "INCREMENTAL FROM LSN: %X/%X\n",
+ LSN_FORMAT_ARGS(state->istartpoint));
+ appendStringInfo(result, "INCREMENTAL FROM TLI: %u\n",
+ state->istarttli);
+ }
+
data = result->data;
pfree(result);
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index c61566666a..7d2501274e 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -1295,6 +1295,12 @@ read_backup_label(XLogRecPtr *checkPointLoc, TimeLineID *backupLabelTLI,
tli_from_file, BACKUP_LABEL_FILE)));
}
+ if (fscanf(lfp, "INCREMENTAL FROM LSN: %X/%X\n", &hi, &lo) > 0)
+ ereport(FATAL,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("this is an incremental backup, not a data directory"),
+ errhint("Use pg_combinebackup to reconstruct a valid data directory.")));
+
if (ferror(lfp) || FreeFile(lfp))
ereport(FATAL,
(errcode_for_file_access(),
diff --git a/src/backend/backup/Makefile b/src/backend/backup/Makefile
index a67b3c58d4..751e6d3d5e 100644
--- a/src/backend/backup/Makefile
+++ b/src/backend/backup/Makefile
@@ -19,6 +19,7 @@ OBJS = \
basebackup.o \
basebackup_copy.o \
basebackup_gzip.o \
+ basebackup_incremental.o \
basebackup_lz4.o \
basebackup_zstd.o \
basebackup_progress.o \
diff --git a/src/backend/backup/basebackup.c b/src/backend/backup/basebackup.c
index 35dd79babc..acc1a6bada 100644
--- a/src/backend/backup/basebackup.c
+++ b/src/backend/backup/basebackup.c
@@ -20,8 +20,10 @@
#include "access/xlogbackup.h"
#include "backup/backup_manifest.h"
#include "backup/basebackup.h"
+#include "backup/basebackup_incremental.h"
#include "backup/basebackup_sink.h"
#include "backup/basebackup_target.h"
+#include "catalog/pg_tablespace_d.h"
#include "commands/defrem.h"
#include "common/compression.h"
#include "common/file_perm.h"
@@ -64,6 +66,7 @@ typedef struct
bool fastcheckpoint;
bool nowait;
bool includewal;
+ bool incremental;
uint32 maxrate;
bool sendtblspcmapfile;
bool send_to_client;
@@ -76,21 +79,28 @@ typedef struct
} basebackup_options;
static int64 sendTablespace(bbsink *sink, char *path, Oid spcoid, bool sizeonly,
- struct backup_manifest_info *manifest);
+ struct backup_manifest_info *manifest,
+ IncrementalBackupInfo *ib);
static int64 sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
List *tablespaces, bool sendtblspclinks,
- backup_manifest_info *manifest, Oid spcoid);
+ backup_manifest_info *manifest, Oid spcoid,
+ IncrementalBackupInfo *ib);
static bool sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
struct stat *statbuf, bool missing_ok,
Oid dboid, Oid spcoid, RelFileNumber relfilenumber,
unsigned segno,
- backup_manifest_info *manifest);
+ backup_manifest_info *manifest,
+ unsigned num_incremental_blocks,
+ BlockNumber *incremental_blocks,
+ unsigned truncation_block_length);
static off_t read_file_data_into_buffer(bbsink *sink,
const char *readfilename, int fd,
off_t offset, size_t length,
BlockNumber blkno,
bool verify_checksum,
int *checksum_failures);
+static void push_to_sink(bbsink *sink, pg_checksum_context *checksum_ctx,
+ size_t *bytes_done, void *data, size_t length);
static bool verify_page_checksum(Page page, XLogRecPtr start_lsn,
BlockNumber blkno,
uint16 *expected_checksum);
@@ -102,7 +112,8 @@ static int64 _tarWriteHeader(bbsink *sink, const char *filename,
bool sizeonly);
static void _tarWritePadding(bbsink *sink, int len);
static void convert_link_to_directory(const char *pathbuf, struct stat *statbuf);
-static void perform_base_backup(basebackup_options *opt, bbsink *sink);
+static void perform_base_backup(basebackup_options *opt, bbsink *sink,
+ IncrementalBackupInfo *ib);
static void parse_basebackup_options(List *options, basebackup_options *opt);
static int compareWalFileNames(const ListCell *a, const ListCell *b);
static int basebackup_read_file(int fd, char *buf, size_t nbytes, off_t offset,
@@ -220,7 +231,8 @@ static const struct exclude_list_item excludeFiles[] =
* clobbered by longjmp" from stupider versions of gcc.
*/
static void
-perform_base_backup(basebackup_options *opt, bbsink *sink)
+perform_base_backup(basebackup_options *opt, bbsink *sink,
+ IncrementalBackupInfo *ib)
{
bbsink_state state;
XLogRecPtr endptr;
@@ -270,6 +282,10 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
ListCell *lc;
tablespaceinfo *newti;
+ /* If this is an incremental backup, execute preparatory steps. */
+ if (ib != NULL)
+ PrepareForIncrementalBackup(ib, backup_state);
+
/* Add a node for the base directory at the end */
newti = palloc0(sizeof(tablespaceinfo));
newti->size = -1;
@@ -289,10 +305,10 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
if (tmp->path == NULL)
tmp->size = sendDir(sink, ".", 1, true, state.tablespaces,
- true, NULL, InvalidOid);
+ true, NULL, InvalidOid, NULL);
else
tmp->size = sendTablespace(sink, tmp->path, tmp->oid, true,
- NULL);
+ NULL, NULL);
state.bytes_total += tmp->size;
}
state.bytes_total_is_valid = true;
@@ -330,7 +346,7 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
/* Then the bulk of the files... */
sendDir(sink, ".", 1, false, state.tablespaces,
- sendtblspclinks, &manifest, InvalidOid);
+ sendtblspclinks, &manifest, InvalidOid, ib);
/* ... and pg_control after everything else. */
if (lstat(XLOG_CONTROL_FILE, &statbuf) != 0)
@@ -340,7 +356,7 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
XLOG_CONTROL_FILE)));
sendFile(sink, XLOG_CONTROL_FILE, XLOG_CONTROL_FILE, &statbuf,
false, InvalidOid, InvalidOid,
- InvalidRelFileNumber, 0, &manifest);
+ InvalidRelFileNumber, 0, &manifest, 0, NULL, 0);
}
else
{
@@ -348,7 +364,7 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
bbsink_begin_archive(sink, archive_name);
- sendTablespace(sink, ti->path, ti->oid, false, &manifest);
+ sendTablespace(sink, ti->path, ti->oid, false, &manifest, ib);
}
/*
@@ -610,7 +626,7 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
sendFile(sink, pathbuf, pathbuf, &statbuf, false,
InvalidOid, InvalidOid, InvalidRelFileNumber, 0,
- &manifest);
+ &manifest, 0, NULL, 0);
/* unconditionally mark file as archived */
StatusFilePath(pathbuf, fname, ".done");
@@ -686,6 +702,7 @@ parse_basebackup_options(List *options, basebackup_options *opt)
bool o_checkpoint = false;
bool o_nowait = false;
bool o_wal = false;
+ bool o_incremental = false;
bool o_maxrate = false;
bool o_tablespace_map = false;
bool o_noverify_checksums = false;
@@ -764,6 +781,15 @@ parse_basebackup_options(List *options, basebackup_options *opt)
opt->includewal = defGetBoolean(defel);
o_wal = true;
}
+ else if (strcmp(defel->defname, "incremental") == 0)
+ {
+ if (o_incremental)
+ ereport(ERROR,
+ (errcode(ERRCODE_SYNTAX_ERROR),
+ errmsg("duplicate option \"%s\"", defel->defname)));
+ opt->incremental = defGetBoolean(defel);
+ o_incremental = true;
+ }
else if (strcmp(defel->defname, "max_rate") == 0)
{
int64 maxrate;
@@ -956,7 +982,7 @@ parse_basebackup_options(List *options, basebackup_options *opt)
* the filesystem, bypassing the buffer cache.
*/
void
-SendBaseBackup(BaseBackupCmd *cmd)
+SendBaseBackup(BaseBackupCmd *cmd, IncrementalBackupInfo *ib)
{
basebackup_options opt;
bbsink *sink;
@@ -980,6 +1006,20 @@ SendBaseBackup(BaseBackupCmd *cmd)
set_ps_display(activitymsg);
}
+ /*
+ * If we're asked to perform an incremental backup and the user has not
+ * supplied a manifest, that's an ERROR.
+ *
+ * If we're asked to perform a full backup and the user did supply a
+ * manifest, just ignore it.
+ */
+ if (!opt.incremental)
+ ib = NULL;
+ else if (ib == NULL)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("must UPLOAD_MANIFEST before performing an incremental BASE_BACKUP")));
+
/*
* If the target is specifically 'client' then set up to stream the backup
* to the client; otherwise, it's being sent someplace else and should not
@@ -1011,7 +1051,7 @@ SendBaseBackup(BaseBackupCmd *cmd)
*/
PG_TRY();
{
- perform_base_backup(&opt, sink);
+ perform_base_backup(&opt, sink, ib);
}
PG_FINALLY();
{
@@ -1089,7 +1129,7 @@ sendFileWithContent(bbsink *sink, const char *filename, const char *content,
*/
static int64
sendTablespace(bbsink *sink, char *path, Oid spcoid, bool sizeonly,
- backup_manifest_info *manifest)
+ backup_manifest_info *manifest, IncrementalBackupInfo *ib)
{
int64 size;
char pathbuf[MAXPGPATH];
@@ -1123,7 +1163,7 @@ sendTablespace(bbsink *sink, char *path, Oid spcoid, bool sizeonly,
/* Send all the files in the tablespace version directory */
size += sendDir(sink, pathbuf, strlen(path), sizeonly, NIL, true, manifest,
- spcoid);
+ spcoid, ib);
return size;
}
@@ -1143,7 +1183,7 @@ sendTablespace(bbsink *sink, char *path, Oid spcoid, bool sizeonly,
static int64
sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
List *tablespaces, bool sendtblspclinks, backup_manifest_info *manifest,
- Oid spcoid)
+ Oid spcoid, IncrementalBackupInfo *ib)
{
DIR *dir;
struct dirent *de;
@@ -1152,7 +1192,16 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
int64 size = 0;
const char *lastDir; /* Split last dir from parent path. */
bool isRelationDir = false; /* Does directory contain relations? */
+ bool isGlobalDir = false;
Oid dboid = InvalidOid;
+ BlockNumber *relative_block_numbers = NULL;
+
+ /*
+ * Since this array is relatively large, avoid putting it on the stack.
+ * But we don't need it at all if this is not an incremental backup.
+ */
+ if (ib != NULL)
+ relative_block_numbers = palloc(sizeof(BlockNumber) * RELSEG_SIZE);
/*
* Determine if the current path is a database directory that can contain
@@ -1185,7 +1234,10 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
}
}
else if (strcmp(path, "./global") == 0)
+ {
isRelationDir = true;
+ isGlobalDir = true;
+ }
dir = AllocateDir(path);
while ((de = ReadDir(dir, path)) != NULL)
@@ -1334,11 +1386,13 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
&statbuf, sizeonly);
/*
- * Also send archive_status directory (by hackishly reusing
- * statbuf from above ...).
+ * Also send archive_status and summaries directories (by
+ * hackishly reusing statbuf from above ...).
*/
size += _tarWriteHeader(sink, "./pg_wal/archive_status", NULL,
&statbuf, sizeonly);
+ size += _tarWriteHeader(sink, "./pg_wal/summaries", NULL,
+ &statbuf, sizeonly);
continue; /* don't recurse into pg_wal */
}
@@ -1407,16 +1461,64 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
if (!skip_this_dir)
size += sendDir(sink, pathbuf, basepathlen, sizeonly, tablespaces,
- sendtblspclinks, manifest, spcoid);
+ sendtblspclinks, manifest, spcoid, ib);
}
else if (S_ISREG(statbuf.st_mode))
{
bool sent = false;
+ unsigned num_blocks_required = 0;
+ unsigned truncation_block_length = 0;
+ char tarfilenamebuf[MAXPGPATH * 2];
+ char *tarfilename = pathbuf + basepathlen + 1;
+ FileBackupMethod method = BACK_UP_FILE_FULLY;
+
+ if (ib != NULL && isRelationFile)
+ {
+ Oid relspcoid;
+ char *lookup_path;
+
+ if (OidIsValid(spcoid))
+ {
+ relspcoid = spcoid;
+ lookup_path = psprintf("pg_tblspc/%u/%s", spcoid,
+ pathbuf + basepathlen + 1);
+ }
+ else
+ {
+ if (isGlobalDir)
+ relspcoid = GLOBALTABLESPACE_OID;
+ else
+ relspcoid = DEFAULTTABLESPACE_OID;
+ lookup_path = pstrdup(pathbuf + basepathlen + 1);
+ }
+
+ method = GetFileBackupMethod(ib, lookup_path, dboid, relspcoid,
+ relfilenumber, relForkNum,
+ segno, statbuf.st_size,
+ &num_blocks_required,
+ relative_block_numbers,
+ &truncation_block_length);
+ if (method == BACK_UP_FILE_INCREMENTALLY)
+ {
+ statbuf.st_size =
+ GetIncrementalFileSize(num_blocks_required);
+ snprintf(tarfilenamebuf, sizeof(tarfilenamebuf),
+ "%s/INCREMENTAL.%s",
+ path + basepathlen + 1,
+ de->d_name);
+ tarfilename = tarfilenamebuf;
+ }
+
+ pfree(lookup_path);
+ }
if (!sizeonly)
- sent = sendFile(sink, pathbuf, pathbuf + basepathlen + 1, &statbuf,
+ sent = sendFile(sink, pathbuf, tarfilename, &statbuf,
true, dboid, spcoid,
- relfilenumber, segno, manifest);
+ relfilenumber, segno, manifest,
+ num_blocks_required,
+ method == BACK_UP_FILE_INCREMENTALLY ? relative_block_numbers : NULL,
+ truncation_block_length);
if (sent || sizeonly)
{
@@ -1434,6 +1536,10 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
ereport(WARNING,
(errmsg("skipping special file \"%s\"", pathbuf)));
}
+
+ if (relative_block_numbers != NULL)
+ pfree(relative_block_numbers);
+
FreeDir(dir);
return size;
}
@@ -1446,6 +1552,12 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
* If dboid is anything other than InvalidOid then any checksum failures
* detected will get reported to the cumulative stats system.
*
+ * If the file is to be sent incrementally, then num_incremental_blocks
+ * should be the number of blocks to be sent, and incremental_blocks
+ * an array of block numbers relative to the start of the current segment.
+ * If the whole file is to be sent, then incremental_blocks should be NULL,
+ * and num_incremental_blocks can have any value, as it will be ignored.
+ *
* Returns true if the file was successfully sent, false if 'missing_ok',
* and the file did not exist.
*/
@@ -1453,7 +1565,8 @@ static bool
sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
struct stat *statbuf, bool missing_ok, Oid dboid, Oid spcoid,
RelFileNumber relfilenumber, unsigned segno,
- backup_manifest_info *manifest)
+ backup_manifest_info *manifest, unsigned num_incremental_blocks,
+ BlockNumber *incremental_blocks, unsigned truncation_block_length)
{
int fd;
BlockNumber blkno = 0;
@@ -1462,6 +1575,7 @@ sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
pgoff_t bytes_done = 0;
bool verify_checksum = false;
pg_checksum_context checksum_ctx;
+ int ibindex = 0;
if (pg_checksum_init(&checksum_ctx, manifest->checksum_type) < 0)
elog(ERROR, "could not initialize checksum of file \"%s\"",
@@ -1494,22 +1608,111 @@ sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
RelFileNumberIsValid(relfilenumber))
verify_checksum = true;
+ /*
+ * If we're sending an incremental file, write the file header.
+ */
+ if (incremental_blocks != NULL)
+ {
+ unsigned magic = INCREMENTAL_MAGIC;
+ size_t header_bytes_done = 0;
+
+ /* Emit header data. */
+ push_to_sink(sink, &checksum_ctx, &header_bytes_done,
+ &magic, sizeof(magic));
+ push_to_sink(sink, &checksum_ctx, &header_bytes_done,
+ &num_incremental_blocks, sizeof(num_incremental_blocks));
+ push_to_sink(sink, &checksum_ctx, &header_bytes_done,
+ &truncation_block_length, sizeof(truncation_block_length));
+ push_to_sink(sink, &checksum_ctx, &header_bytes_done,
+ incremental_blocks,
+ sizeof(BlockNumber) * num_incremental_blocks);
+
+ /* Flush out any data still in the buffer so it's again empty. */
+ if (header_bytes_done > 0)
+ {
+ bbsink_archive_contents(sink, header_bytes_done);
+ if (pg_checksum_update(&checksum_ctx,
+ (uint8 *) sink->bbs_buffer,
+ header_bytes_done) < 0)
+ elog(ERROR, "could not update checksum of base backup");
+ }
+
+ /* Update our notion of file position. */
+ bytes_done += sizeof(magic);
+ bytes_done += sizeof(num_incremental_blocks);
+ bytes_done += sizeof(truncation_block_length);
+ bytes_done += sizeof(BlockNumber) * num_incremental_blocks;
+ }
+
/*
* Loop until we read the amount of data the caller told us to expect. The
* file could be longer, if it was extended while we were sending it, but
* for a base backup we can ignore such extended data. It will be restored
* from WAL.
*/
- while (bytes_done < statbuf->st_size)
+ while (1)
{
- size_t remaining = statbuf->st_size - bytes_done;
+ /*
+ * Determine whether we've read all the data that we need, and if not,
+ * read some more.
+ */
+ if (incremental_blocks == NULL)
+ {
+ size_t remaining = statbuf->st_size - bytes_done;
+
+ /*
+ * If we've read the required number of bytes, then it's time to
+ * stop.
+ */
+ if (bytes_done >= statbuf->st_size)
+ break;
+
+ /*
+ * Read as many bytes as will fit in the buffer, or however many
+ * are left to read, whichever is less.
+ */
+ cnt = read_file_data_into_buffer(sink, readfilename, fd,
+ bytes_done, remaining,
+ blkno + segno * RELSEG_SIZE,
+ verify_checksum,
+ &checksum_failures);
+ }
+ else
+ {
+ BlockNumber relative_blkno;
+
+ /*
+ * If we've read all the blocks, then it's time to stop.
+ */
+ if (ibindex >= num_incremental_blocks)
+ break;
+
+ /*
+ * Read just one block, whichever one is the next that we're
+ * supposed to include.
+ */
+ relative_blkno = incremental_blocks[ibindex++];
+ cnt = read_file_data_into_buffer(sink, readfilename, fd,
+ relative_blkno * BLCKSZ,
+ BLCKSZ,
+ relative_blkno + segno * RELSEG_SIZE,
+ verify_checksum,
+ &checksum_failures);
- /* Try to read some more data. */
- cnt = read_file_data_into_buffer(sink, readfilename, fd, bytes_done,
- remaining,
- blkno + segno * RELSEG_SIZE,
- verify_checksum,
- &checksum_failures);
+ /*
+ * If we get a partial read, that must mean that the relation is
+ * being truncated. Ultimately, it should be truncated to a
+ * multiple of BLCKSZ, since this path should only be reached for
+ * relation files, but we might transiently observe an
+ * intermediate value.
+ *
+ * It should be fine to treat this just as if the entire block had
+ * been truncated away - i.e. fill this and all later blocks with
+ * zeroes. WAL replay will fix things up.
+ */
+ if (cnt < BLCKSZ)
+ break;
+ }
/*
* If the amount of data we were able to read was not a multiple of
@@ -1692,6 +1895,56 @@ read_file_data_into_buffer(bbsink *sink, const char *readfilename, int fd,
return cnt;
}
+/*
+ * Push data into a bbsink.
+ *
+ * It's better, when possible, to read data directly into the bbsink's buffer,
+ * rather than using this function to copy it into the buffer; this function is
+ * for cases where that approach is not practical.
+ *
+ * bytes_done should point to a count of the number of bytes that are
+ * currently used in the bbsink's buffer. Upon return, the bytes identified by
+ * data and length will have been copied into the bbsink's buffer, flushing
+ * as required, and *bytes_done will have been updated accordingly. If the
+ * buffer was flushed, the previous contents will also have been fed to
+ * checksum_ctx.
+ *
+ * Note that after one or more calls to this function it is the caller's
+ * responsibility to perform any required final flush.
+ */
+static void
+push_to_sink(bbsink *sink, pg_checksum_context *checksum_ctx,
+ size_t *bytes_done, void *data, size_t length)
+{
+ while (length > 0)
+ {
+ size_t bytes_to_copy;
+
+ /*
+ * We use < here rather than <= so that if the data exactly fills the
+ * remaining buffer space, we trigger a flush now.
+ */
+ if (length < sink->bbs_buffer_length - *bytes_done)
+ {
+ /* Append remaining data to buffer. */
+ memcpy(sink->bbs_buffer + *bytes_done, data, length);
+ *bytes_done += length;
+ return;
+ }
+
+ /* Copy until buffer is full and flush it. */
+ bytes_to_copy = sink->bbs_buffer_length - *bytes_done;
+ memcpy(sink->bbs_buffer + *bytes_done, data, bytes_to_copy);
+ data = ((char *) data) + bytes_to_copy;
+ length -= bytes_to_copy;
+ bbsink_archive_contents(sink, sink->bbs_buffer_length);
+ if (pg_checksum_update(checksum_ctx, (uint8 *) sink->bbs_buffer,
+ sink->bbs_buffer_length) < 0)
+ elog(ERROR, "could not update checksum");
+ *bytes_done = 0;
+ }
+}
+
/*
* Try to verify the checksum for the provided page, if it seems appropriate
* to do so.
diff --git a/src/backend/backup/basebackup_incremental.c b/src/backend/backup/basebackup_incremental.c
new file mode 100644
index 0000000000..db079dfc67
--- /dev/null
+++ b/src/backend/backup/basebackup_incremental.c
@@ -0,0 +1,914 @@
+/*-------------------------------------------------------------------------
+ *
+ * basebackup_incremental.c
+ * code for incremental backup support
+ *
+ * This code isn't actually in charge of taking an incremental backup;
+ * the actual construction of the incremental backup happens in
+ * basebackup.c. Here, we're concerned with providing the necessary
+ * supports for that operation. In particular, we need to parse the
+ * backup manifest supplied by the user taking the incremental backup
+ * and extract the required information from it.
+ *
+ * Portions Copyright (c) 2010-2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/backend/backup/basebackup_incremental.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/timeline.h"
+#include "access/xlog.h"
+#include "access/xlogrecovery.h"
+#include "backup/basebackup_incremental.h"
+#include "backup/walsummary.h"
+#include "common/blkreftable.h"
+#include "common/parse_manifest.h"
+#include "common/hashfn.h"
+#include "postmaster/walsummarizer.h"
+
+#define BLOCKS_PER_READ 512
+
+/*
+ * Details extracted from the WAL ranges present in the supplied backup manifest.
+ */
+typedef struct
+{
+ TimeLineID tli;
+ XLogRecPtr start_lsn;
+ XLogRecPtr end_lsn;
+} backup_wal_range;
+
+/*
+ * Details extracted from the file list present in the supplied backup manifest.
+ */
+typedef struct
+{
+ uint32 status;
+ const char *path;
+ size_t size;
+} backup_file_entry;
+
+static uint32 hash_string_pointer(const char *s);
+#define SH_PREFIX backup_file
+#define SH_ELEMENT_TYPE backup_file_entry
+#define SH_KEY_TYPE const char *
+#define SH_KEY path
+#define SH_HASH_KEY(tb, key) hash_string_pointer(key)
+#define SH_EQUAL(tb, a, b) (strcmp(a, b) == 0)
+#define SH_SCOPE static inline
+#define SH_DECLARE
+#define SH_DEFINE
+#include "lib/simplehash.h"
+
+struct IncrementalBackupInfo
+{
+ /* Memory context for this object and its subsidiary objects. */
+ MemoryContext mcxt;
+
+ /* Temporary buffer for storing the manifest while parsing it. */
+ StringInfoData buf;
+
+ /* WAL ranges extracted from the backup manifest. */
+ List *manifest_wal_ranges;
+
+ /*
+ * Files extracted from the backup manifest.
+ *
+ * We don't really need this information, because we use WAL summaries to
+ * figure what's changed. It would be unsafe to just rely on the list of
+ * files that existed before, because it's possible for a file to be
+ * removed and a new one created with the same name and different
+ * contents. In such cases, the whole file must still be sent. We can tell
+ * from the WAL summaries whether that happened, but not from the file
+ * list.
+ *
+ * Nonetheless, this data is useful for sanity checking. If a file that we
+ * think we shouldn't need to send is not present in the manifest for the
+ * prior backup, something has gone terribly wrong. We retain the file
+ * names and sizes, but not the checksums or last modified times, for
+ * which we have no use.
+ *
+ * One significant downside of storing this data is that it consumes
+ * memory. If that turns out to be a problem, we might have to decide not
+ * to retain this information, or to make it optional.
+ */
+ backup_file_hash *manifest_files;
+
+ /*
+ * Block-reference table for the incremental backup.
+ *
+ * It's possible that storing the entire block-reference table in memory
+ * will be a problem for some users. The in-memory format that we're using
+ * here is pretty efficient, converging to little more than 1 bit per
+ * block for relation forks with large numbers of modified blocks. It's
+ * possible, however, that if you try to perform an incremental backup of
+ * a database with a sufficiently large number of relations on a
+ * sufficiently small machine, you could run out of memory here. If that
+ * turns out to be a problem in practice, we'll need to be more clever.
+ */
+ BlockRefTable *brtab;
+};
+
+static void manifest_process_file(JsonManifestParseContext *context,
+ char *pathname,
+ size_t size,
+ pg_checksum_type checksum_type,
+ int checksum_length,
+ uint8 *checksum_payload);
+static void manifest_process_wal_range(JsonManifestParseContext *context,
+ TimeLineID tli,
+ XLogRecPtr start_lsn,
+ XLogRecPtr end_lsn);
+static void manifest_report_error(JsonManifestParseContext *ib,
+ const char *fmt,...)
+ pg_attribute_printf(2, 3) pg_attribute_noreturn();
+static int compare_block_numbers(const void *a, const void *b);
+
+/*
+ * Create a new object for storing information extracted from the manifest
+ * supplied when creating an incremental backup.
+ */
+IncrementalBackupInfo *
+CreateIncrementalBackupInfo(MemoryContext mcxt)
+{
+ IncrementalBackupInfo *ib;
+ MemoryContext oldcontext;
+
+ oldcontext = MemoryContextSwitchTo(mcxt);
+
+ ib = palloc0(sizeof(IncrementalBackupInfo));
+ ib->mcxt = mcxt;
+ initStringInfo(&ib->buf);
+
+ /*
+ * It's hard to guess how many files a "typical" installation will have in
+ * the data directory, but a fresh initdb creates almost 1000 files as of
+ * this writing, so it seems to make sense for our estimate to
+ * substantially higher.
+ */
+ ib->manifest_files = backup_file_create(mcxt, 10000, NULL);
+
+ MemoryContextSwitchTo(oldcontext);
+
+ return ib;
+}
+
+/*
+ * Before taking an incremental backup, the caller must supply the backup
+ * manifest from a prior backup. Each chunk of manifest data recieved
+ * from the client should be passed to this function.
+ */
+void
+AppendIncrementalManifestData(IncrementalBackupInfo *ib, const char *data,
+ int len)
+{
+ MemoryContext oldcontext;
+
+ /* Switch to our memory context. */
+ oldcontext = MemoryContextSwitchTo(ib->mcxt);
+
+ /*
+ * XXX. Our json parser is at present incapable of parsing json blobs
+ * incrementally, so we have to accumulate the entire backup manifest
+ * before we can do anything with it. This should really be fixed, since
+ * some users might have very large numbers of files in the data
+ * directory.
+ */
+ appendBinaryStringInfo(&ib->buf, data, len);
+
+ /* Switch back to previous memory context. */
+ MemoryContextSwitchTo(oldcontext);
+}
+
+/*
+ * Finalize an IncrementalBackupInfo object after all manifest data has
+ * been supplied via calls to AppendIncrementalManifestData.
+ */
+void
+FinalizeIncrementalManifest(IncrementalBackupInfo *ib)
+{
+ JsonManifestParseContext context;
+ MemoryContext oldcontext;
+
+ /* Switch to our memory context. */
+ oldcontext = MemoryContextSwitchTo(ib->mcxt);
+
+ /* Parse the manifest. */
+ context.private_data = ib;
+ context.perfile_cb = manifest_process_file;
+ context.perwalrange_cb = manifest_process_wal_range;
+ context.error_cb = manifest_report_error;
+ json_parse_manifest(&context, ib->buf.data, ib->buf.len);
+
+ /* Done with the buffer, so release memory. */
+ pfree(ib->buf.data);
+ ib->buf.data = NULL;
+
+ /* Switch back to previous memory context. */
+ MemoryContextSwitchTo(oldcontext);
+}
+
+/*
+ * Prepare to take an incremental backup.
+ *
+ * Before this function is called, AppendIncrementalManifestData and
+ * FinalizeIncrementalManifest should have already been called to pass all
+ * the manifest data to this object.
+ *
+ * This function performs sanity checks on the data extracted from the
+ * manifest and figures out for which WAL ranges we need summaries, and
+ * whether those summaries are available. Then, it reads and combines the
+ * data from those summary files. It also updates the backup_state with the
+ * reference TLI and LSN for the prior backup.
+ */
+void
+PrepareForIncrementalBackup(IncrementalBackupInfo *ib,
+ BackupState *backup_state)
+{
+ MemoryContext oldcontext;
+ List *expectedTLEs;
+ List *all_wslist,
+ *required_wslist = NIL;
+ ListCell *lc;
+ TimeLineHistoryEntry **tlep;
+ int num_wal_ranges;
+ int i;
+ bool found_backup_start_tli = false;
+ TimeLineID earliest_wal_range_tli = 0;
+ XLogRecPtr earliest_wal_range_start_lsn = InvalidXLogRecPtr;
+ TimeLineID latest_wal_range_tli = 0;
+ XLogRecPtr summarized_lsn;
+
+ Assert(ib->buf.data == NULL);
+
+ /* Switch to our memory context. */
+ oldcontext = MemoryContextSwitchTo(ib->mcxt);
+
+ /*
+ * A valid backup manifest must always contain at least one WAL range
+ * (usually exactly one, unless the backup spanned a timeline switch).
+ */
+ num_wal_ranges = list_length(ib->manifest_wal_ranges);
+ if (num_wal_ranges == 0)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("manifest contains no required WAL ranges")));
+
+ /*
+ * Match up the TLIs that appear in the WAL ranges of the backup manifest
+ * with those that appear in this server's timeline history. We expect
+ * every backup_wal_range to match to a TimeLineHistoryEntry; if it does
+ * not, that's an error.
+ *
+ * This loop also decides which of the WAL ranges is the manifest is most
+ * ancient and which one is the newest, according to the timeline history
+ * of this server, and stores TLIs of those WAL ranges into
+ * earliest_wal_range_tli and latest_wal_range_tli. It also updates
+ * earliest_wal_range_start_lsn to the start LSN of the WAL range for
+ * earliest_wal_range_tli.
+ *
+ * Note that the return value of readTimeLineHistory puts the latest
+ * timeline at the beginning of the list, not the end. Hence, the earliest
+ * TLI is the one that occurs nearest the end of the list returned by
+ * readTimeLineHistory, and the latest TLI is the one that occurs closest
+ * to the beginning.
+ */
+ expectedTLEs = readTimeLineHistory(backup_state->starttli);
+ tlep = palloc0(num_wal_ranges * sizeof(TimeLineHistoryEntry *));
+ for (i = 0; i < num_wal_ranges; ++i)
+ {
+ backup_wal_range *range = list_nth(ib->manifest_wal_ranges, i);
+ bool saw_earliest_wal_range_tli = false;
+ bool saw_latest_wal_range_tli = false;
+
+ /* Search this server's history for this WAL range's TLI. */
+ foreach(lc, expectedTLEs)
+ {
+ TimeLineHistoryEntry *tle = lfirst(lc);
+
+ if (tle->tli == range->tli)
+ {
+ tlep[i] = tle;
+ break;
+ }
+
+ if (tle->tli == earliest_wal_range_tli)
+ saw_earliest_wal_range_tli = true;
+ if (tle->tli == latest_wal_range_tli)
+ saw_latest_wal_range_tli = true;
+ }
+
+ /*
+ * An incremental backup can only be taken relative to a backup that
+ * represents a previous state of this server. If the backup requires
+ * WAL from a timeline that's not in our history, that definitely
+ * isn't the case.
+ */
+ if (tlep[i] == NULL)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("timeline %u found in manifest, but not in this server's history",
+ range->tli)));
+
+ /*
+ * If we found this TLI in the server's history before encountering
+ * the latest TLI seen so far in the server's history, then this TLI
+ * is the latest one seen so far.
+ *
+ * If on the other hand we saw the earliest TLI seen so far before
+ * finding this TLI, this TLI is earlier than the earliest one seen so
+ * far. And if this is the first TLI for which we've searched, it's
+ * also the earliest one seen so far.
+ *
+ * On the first loop iteration, both things should necessarily be
+ * true.
+ */
+ if (!saw_latest_wal_range_tli)
+ latest_wal_range_tli = range->tli;
+ if (earliest_wal_range_tli == 0 || saw_earliest_wal_range_tli)
+ {
+ earliest_wal_range_tli = range->tli;
+ earliest_wal_range_start_lsn = range->start_lsn;
+ }
+ }
+
+ /*
+ * Propagate information about the prior backup into the backup_label that
+ * will be generated for this backup.
+ */
+ backup_state->istartpoint = earliest_wal_range_start_lsn;
+ backup_state->istarttli = earliest_wal_range_tli;
+
+ /*
+ * Sanity check start and end LSNs for the WAL ranges in the manifest.
+ *
+ * Commonly, there won't be any timeline switches during the prior backup
+ * at all, but if there are, they should happen at the same LSNs that this
+ * server switched timelines.
+ *
+ * Whether there are any timeline switches during the prior backup or not,
+ * the prior backup shouldn't require any WAL from a timeline prior to the
+ * start of that timeline. It also shouldn't require any WAL from later
+ * than the start of this backup.
+ *
+ * If any of these sanity checks fail, one possible explanation is that
+ * the user has generated WAL on the same timeline with the same LSNs more
+ * than once. For instance, if two standbys running on timeline 1 were
+ * both promoted and (due to a broken archiving setup) both selected new
+ * timeline ID 2, then it's possible that one of these checks might trip.
+ *
+ * Note that there are lots of ways for the user to do something very bad
+ * without tripping any of these checks, and they are not intended to be
+ * comprehensive. It's pretty hard to see how we could be certain of
+ * anything here. However, if there's a problem staring us right in the
+ * face, it's best to report it, so we do.
+ */
+ for (i = 0; i < num_wal_ranges; ++i)
+ {
+ backup_wal_range *range = list_nth(ib->manifest_wal_ranges, i);
+
+ if (range->tli == earliest_wal_range_tli)
+ {
+ if (range->start_lsn < tlep[i]->begin)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("manifest requires WAL from initial timeline %u starting at %X/%X, but that timeline begins at %X/%X",
+ range->tli,
+ LSN_FORMAT_ARGS(range->start_lsn),
+ LSN_FORMAT_ARGS(tlep[i]->begin))));
+ }
+ else
+ {
+ if (range->start_lsn != tlep[i]->begin)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("manifest requires WAL from continuation timeline %u starting at %X/%X, but that timeline begins at %X/%X",
+ range->tli,
+ LSN_FORMAT_ARGS(range->start_lsn),
+ LSN_FORMAT_ARGS(tlep[i]->begin))));
+ }
+
+ if (range->tli == latest_wal_range_tli)
+ {
+ if (range->end_lsn > backup_state->startpoint)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("manifest requires WAL from final timeline %u ending at %X/%X, but this backup starts at %X/%X",
+ range->tli,
+ LSN_FORMAT_ARGS(range->end_lsn),
+ LSN_FORMAT_ARGS(backup_state->startpoint))));
+ }
+ else
+ {
+ if (range->end_lsn != tlep[i]->end)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("manifest requires WAL from non-final timeline %u ending at %X/%X, but this server switched timelines at %X/%X",
+ range->tli,
+ LSN_FORMAT_ARGS(range->end_lsn),
+ LSN_FORMAT_ARGS(tlep[i]->end))));
+ }
+
+ }
+
+ /*
+ * Wait for WAL summarization to catch up to the backup start LSN (but
+ * time out if it doesn't do so quickly enough).
+ */
+ /* XXX make timeout configurable */
+ summarized_lsn = WaitForWalSummarization(backup_state->startpoint, 60000);
+ if (summarized_lsn < backup_state->startpoint)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("timeout waiting for WAL summarization"),
+ errdetail("This backup requires WAL to be summarized up to %X/%X, but summarizer has only reached %X/%X.",
+ LSN_FORMAT_ARGS(backup_state->startpoint),
+ LSN_FORMAT_ARGS(summarized_lsn))));
+
+ /*
+ * Retrieve a list of all WAL summaries on any timeline that overlap with
+ * the LSN range of interest. We could instead call GetWalSummaries() once
+ * per timeline in the loop that follows, but that would involve reading
+ * the directory multiple times. It should be mildly faster - and perhaps
+ * a bit safer - to do it just once.
+ */
+ all_wslist = GetWalSummaries(0, earliest_wal_range_start_lsn,
+ backup_state->startpoint);
+
+ /*
+ * We need WAL summaries for everything that happened during the prior
+ * backup and everything that happened afterward up until the point where
+ * the current backup started.
+ */
+ foreach(lc, expectedTLEs)
+ {
+ TimeLineHistoryEntry *tle = lfirst(lc);
+ XLogRecPtr tli_start_lsn = tle->begin;
+ XLogRecPtr tli_end_lsn = tle->end;
+ XLogRecPtr tli_missing_lsn = InvalidXLogRecPtr;
+ List *tli_wslist;
+
+ /*
+ * Working through the history of this server from the current
+ * timeline backwards, we skip everything until we find the timeline
+ * where this backup started. Most of the time, this means we won't
+ * skip anything at all, as it's unlikely that the timeline has
+ * changed since the beginning of the backup moments ago.
+ */
+ if (tle->tli == backup_state->starttli)
+ {
+ found_backup_start_tli = true;
+ tli_end_lsn = backup_state->startpoint;
+ }
+ else if (!found_backup_start_tli)
+ continue;
+
+ /*
+ * Find the summaries that overlap the LSN range of interest for this
+ * timeline. If this is the earliest timeline involved, the range of
+ * interest begins with the start LSN of the prior backup; otherwise,
+ * it begins at the LSN at which this timeline came into existence. If
+ * this is the latest TLI involved, the range of interest ends at the
+ * start LSN of the current backup; otherwise, it ends at the point
+ * where we switched from this timeline to the next one.
+ */
+ if (tle->tli == earliest_wal_range_tli)
+ tli_start_lsn = earliest_wal_range_start_lsn;
+ tli_wslist = FilterWalSummaries(all_wslist, tle->tli,
+ tli_start_lsn, tli_end_lsn);
+
+ /*
+ * There is no guarantee that the WAL summaries we found cover the
+ * entire range of LSNs for which summaries are required, or indeed
+ * that we found any WAL summaries at all. Check whether we have a
+ * problem of that sort.
+ */
+ if (!WalSummariesAreComplete(tli_wslist, tli_start_lsn, tli_end_lsn,
+ &tli_missing_lsn))
+ {
+ if (XLogRecPtrIsInvalid(tli_missing_lsn))
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("WAL summaries are required on timeline %u from %X/%X to %X/%X, but no summaries for that timeline and LSN range exist",
+ tle->tli,
+ LSN_FORMAT_ARGS(tli_start_lsn),
+ LSN_FORMAT_ARGS(tli_end_lsn))));
+ else
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("WAL summaries are required on timeline %u from %X/%X to %X/%X, but the summaries for that timeline and LSN range are incomplete",
+ tle->tli,
+ LSN_FORMAT_ARGS(tli_start_lsn),
+ LSN_FORMAT_ARGS(tli_end_lsn)),
+ errdetail("The first unsummarized LSN is this range is %X/%X.",
+ LSN_FORMAT_ARGS(tli_missing_lsn))));
+ }
+
+ /*
+ * Remember that we need to read these summaries.
+ *
+ * Technically, it's possible that this could read more files than
+ * required, since tli_wslist in theory could contain redundant
+ * summaries. For instance, if we have a summary from 0/10000000 to
+ * 0/20000000 and also one from 0/00000000 to 0/30000000, then the
+ * latter subsumes the former and the former could be ignored.
+ *
+ * We ignore this possibility because the WAL summarizer only tries to
+ * generate summaries that do not overlap. If somehow they exist,
+ * we'll do a bit of extra work but the results should still be
+ * correct.
+ */
+ required_wslist = list_concat(required_wslist, tli_wslist);
+
+ /*
+ * Timelines earlier than the one in which the prior backup began are
+ * not relevant.
+ */
+ if (tle->tli == earliest_wal_range_tli)
+ break;
+ }
+
+ /*
+ * Read all of the required block reference table files and merge all of
+ * the data into a single in-memory block reference table.
+ *
+ * See the comments for struct IncrementalBackupInfo for some thoughts on
+ * memory usage.
+ */
+ ib->brtab = CreateEmptyBlockRefTable();
+ foreach(lc, required_wslist)
+ {
+ WalSummaryFile *ws = lfirst(lc);
+ WalSummaryIO wsio;
+ BlockRefTableReader *reader;
+ RelFileLocator rlocator;
+ ForkNumber forknum;
+ BlockNumber limit_block;
+ BlockNumber blocks[BLOCKS_PER_READ];
+
+ wsio.file = OpenWalSummaryFile(ws, false);
+ wsio.filepos = 0;
+ ereport(DEBUG1,
+ (errmsg_internal("reading WAL summary file \"%s\"",
+ FilePathName(wsio.file))));
+ reader = CreateBlockRefTableReader(ReadWalSummary, &wsio,
+ FilePathName(wsio.file),
+ ReportWalSummaryError, NULL);
+ while (BlockRefTableReaderNextRelation(reader, &rlocator, &forknum,
+ &limit_block))
+ {
+ BlockRefTableSetLimitBlock(ib->brtab, &rlocator,
+ forknum, limit_block);
+
+ while (1)
+ {
+ unsigned nblocks;
+ unsigned i;
+
+ nblocks = BlockRefTableReaderGetBlocks(reader, blocks,
+ BLOCKS_PER_READ);
+ if (nblocks == 0)
+ break;
+
+ for (i = 0; i < nblocks; ++i)
+ BlockRefTableMarkBlockModified(ib->brtab, &rlocator,
+ forknum, blocks[i]);
+ }
+ }
+ DestroyBlockRefTableReader(reader);
+ FileClose(wsio.file);
+ }
+
+ /* Switch back to previous memory context. */
+ MemoryContextSwitchTo(oldcontext);
+}
+
+/*
+ * Get the pathname that should be used when a file is sent incrementally.
+ *
+ * The result is a palloc'd string.
+ */
+char *
+GetIncrementalFilePath(Oid dboid, Oid spcoid, RelFileNumber relfilenumber,
+ ForkNumber forknum, unsigned segno)
+{
+ char *path;
+ char *lastslash;
+ char *ipath;
+
+ path = GetRelationPath(dboid, spcoid, relfilenumber, InvalidBackendId,
+ forknum);
+
+ lastslash = strrchr(path, '/');
+ Assert(lastslash != NULL);
+ *lastslash = '\0';
+
+ if (segno > 0)
+ ipath = psprintf("%s/INCREMENTAL.%s.%u", path, lastslash + 1, segno);
+ else
+ ipath = psprintf("%s/INCREMENTAL.%s", path, lastslash + 1);
+
+ pfree(path);
+
+ return ipath;
+}
+
+/*
+ * How should we back up a particular file as part of an incremental backup?
+ *
+ * If the return value is BACK_UP_FILE_FULLY, caller should back up the whole
+ * file just as if this were not an incremental backup.
+ *
+ * If the return value is BACK_UP_FILE_INCREMENTALLY, caller should include
+ * an incremental file in the backup instead of the entire file. On return,
+ * *num_blocks_required will be set to the number of blocks that need to be
+ * sent, and the actual block numbers will have been stored in
+ * relative_block_numbers, which should be an array of at least RELSEG_SIZE.
+ * In addition, *truncation_block_length will be set to the value that should
+ * be included in the incremental file.
+ */
+FileBackupMethod
+GetFileBackupMethod(IncrementalBackupInfo *ib, char *path,
+ Oid dboid, Oid spcoid,
+ RelFileNumber relfilenumber, ForkNumber forknum,
+ unsigned segno, size_t size,
+ unsigned *num_blocks_required,
+ BlockNumber *relative_block_numbers,
+ unsigned *truncation_block_length)
+{
+ BlockNumber absolute_block_numbers[RELSEG_SIZE];
+ BlockNumber limit_block;
+ BlockNumber start_blkno;
+ BlockNumber stop_blkno;
+ RelFileLocator rlocator;
+ BlockRefTableEntry *brtentry;
+ unsigned i;
+ unsigned nblocks;
+
+ /* Should only be called after PrepareForIncrementalBackup. */
+ Assert(ib->buf.data == NULL);
+
+ /*
+ * dboid could be InvalidOid if shared rel, but spcoid and relfilenumber
+ * should have legal values.
+ */
+ Assert(OidIsValid(spcoid));
+ Assert(RelFileNumberIsValid(relfilenumber));
+
+ /*
+ * If the file size is too large or not a multiple of BLCKSZ, then
+ * something weird is happening, so give up and send the whole file.
+ */
+ if ((size % BLCKSZ) != 0 || size / BLCKSZ > RELSEG_SIZE)
+ return BACK_UP_FILE_FULLY;
+
+ /*
+ * The free-space map fork is not properly WAL-logged, so we need to
+ * backup the entire file every time.
+ */
+ if (forknum == FSM_FORKNUM)
+ return BACK_UP_FILE_FULLY;
+
+ /*
+ * Check whether this file is part of the prior backup. If it isn't, back
+ * up the whole file.
+ */
+ if (backup_file_lookup(ib->manifest_files, path) == NULL)
+ {
+ char *ipath;
+
+ ipath = GetIncrementalFilePath(dboid, spcoid, relfilenumber,
+ forknum, segno);
+ if (backup_file_lookup(ib->manifest_files, ipath) == NULL)
+ return BACK_UP_FILE_FULLY;
+ }
+
+ /* Look up the block reference table entry. */
+ rlocator.spcOid = spcoid;
+ rlocator.dbOid = dboid;
+ rlocator.relNumber = relfilenumber;
+ brtentry = BlockRefTableGetEntry(ib->brtab, &rlocator, forknum,
+ &limit_block);
+
+ /*
+ * If there is no entry, then there have been no WAL-logged changes to the
+ * relation since the predecessor backup was taken, so we can back it up
+ * incrementally and need not include any modified blocks.
+ *
+ * However, if the file is zero-length, we should do a full backup,
+ * because an incremental file is always more than zero length, and it's
+ * silly to take an incremental backup when a full backup would be
+ * smaller.
+ */
+ if (brtentry == NULL)
+ {
+ *num_blocks_required = 0;
+ *truncation_block_length = size / BLCKSZ;
+ if (size == 0)
+ return BACK_UP_FILE_FULLY;
+ return BACK_UP_FILE_INCREMENTALLY;
+ }
+
+ /*
+ * If the limit_block is less than or equal to the point where this
+ * segment starts, send the whole file.
+ */
+ if (limit_block <= segno * RELSEG_SIZE)
+ return BACK_UP_FILE_FULLY;
+
+ /*
+ * Get relevant entries from the block reference table entry.
+ *
+ * We shouldn't overflow computing the start or stop block numbers, but if
+ * it manages to happen somehow, detect it and throw an error.
+ */
+ start_blkno = segno * RELSEG_SIZE;
+ stop_blkno = start_blkno + (size / BLCKSZ);
+ if (start_blkno / RELSEG_SIZE != segno || stop_blkno < start_blkno)
+ ereport(ERROR,
+ errcode(ERRCODE_INTERNAL_ERROR),
+ errmsg_internal("overflow computing block number bounds for segment %u with size %lu",
+ segno, size));
+ nblocks = BlockRefTableEntryGetBlocks(brtentry, start_blkno, stop_blkno,
+ absolute_block_numbers, RELSEG_SIZE);
+ Assert(nblocks <= RELSEG_SIZE);
+
+ /*
+ * Sort the returned block numbers. If the block reference table was using
+ * the bitmap representation for a given chunk, the block numbers in that
+ * chunk will already be sorted, but when the array-of-offsets
+ * representation is used, we can receive block numbers here out of order.
+ */
+ qsort(absolute_block_numbers, nblocks, sizeof(BlockNumber),
+ compare_block_numbers);
+
+ /*
+ * If we're going to have to send nearly all of the blocks, then just send
+ * the whole file, because that won't require much extra storage or
+ * transfer and will speed up and simplify backup restoration. It's not
+ * clear what threshold is most appropriate here and perhaps it ought to
+ * be configurable, but for now we're just going to say that if we'd need
+ * to send 90% of the blocks anyway, give up and send the whole file.
+ *
+ * NB: If you change the threshold here, at least make sure to back up the
+ * file fully when every single block must be sent, because there's
+ * nothing good about sending an incremental file in that case.
+ */
+ if (nblocks * BLCKSZ > size * 0.9)
+ return BACK_UP_FILE_FULLY;
+
+ /*
+ * Looks like we can send an incremental file so transpose absolute block
+ * numbers to relative block numbers.
+ */
+ for (i = 0; i < nblocks; ++i)
+ relative_block_numbers[i] = absolute_block_numbers[i] - start_blkno;
+ *num_blocks_required = nblocks;
+
+ /*
+ * The truncation block length is the minimum length of the reconstructed
+ * file. Any block numbers below this threshold that are not present in
+ * the backup need to be fetched from the prior backup. At or above this
+ * threshold, blocks should only be included in the result if they are
+ * present in the backup. (This may require inserting zero blocks if the
+ * blocks included in the backup are non-consecutive.)
+ */
+ *truncation_block_length = size / BLCKSZ;
+ if (BlockNumberIsValid(limit_block))
+ {
+ unsigned relative_limit = limit_block - segno * RELSEG_SIZE;
+
+ if (*truncation_block_length < relative_limit)
+ *truncation_block_length = relative_limit;
+ }
+
+ /* Send it incrementally. */
+ return BACK_UP_FILE_INCREMENTALLY;
+}
+
+/*
+ * Compute the size for an incremental file containing a given number of blocks.
+ */
+extern size_t
+GetIncrementalFileSize(unsigned num_blocks_required)
+{
+ size_t result;
+
+ /* Make sure we're not going to overflow. */
+ Assert(num_blocks_required <= RELSEG_SIZE);
+
+ /*
+ * Three four byte quantities (magic number, truncation block length,
+ * block count) followed by block numbers followed by block contents.
+ */
+ result = 3 * sizeof(uint32);
+ result += (BLCKSZ + sizeof(BlockNumber)) * num_blocks_required;
+
+ return result;
+}
+
+/*
+ * Helper function for filemap hash table.
+ */
+static uint32
+hash_string_pointer(const char *s)
+{
+ unsigned char *ss = (unsigned char *) s;
+
+ return hash_bytes(ss, strlen(s));
+}
+
+/*
+ * This callback is invoked for each file mentioned in the backup manifest.
+ *
+ * We store the path to each file and the size of each file for sanity-checking
+ * purposes. For further details, see comments for IncrementalBackupInfo.
+ */
+static void
+manifest_process_file(JsonManifestParseContext *context,
+ char *pathname, size_t size,
+ pg_checksum_type checksum_type,
+ int checksum_length,
+ uint8 *checksum_payload)
+{
+ IncrementalBackupInfo *ib = context->private_data;
+ backup_file_entry *entry;
+ bool found;
+
+ entry = backup_file_insert(ib->manifest_files, pathname, &found);
+ if (!found)
+ {
+ entry->path = MemoryContextStrdup(ib->manifest_files->ctx,
+ pathname);
+ entry->size = size;
+ }
+}
+
+/*
+ * This callback is invoked for each WAL range mentioned in the backup
+ * manifest.
+ *
+ * We're just interested in learning the oldest LSN and the corresponding TLI
+ * that appear in any WAL range.
+ */
+static void
+manifest_process_wal_range(JsonManifestParseContext *context,
+ TimeLineID tli, XLogRecPtr start_lsn,
+ XLogRecPtr end_lsn)
+{
+ IncrementalBackupInfo *ib = context->private_data;
+ backup_wal_range *range = palloc(sizeof(backup_wal_range));
+
+ range->tli = tli;
+ range->start_lsn = start_lsn;
+ range->end_lsn = end_lsn;
+ ib->manifest_wal_ranges = lappend(ib->manifest_wal_ranges, range);
+}
+
+/*
+ * This callback is invoked if an error occurs while parsing the backup
+ * manifest.
+ */
+static void
+manifest_report_error(JsonManifestParseContext *context, const char *fmt,...)
+{
+ StringInfoData errbuf;
+
+ initStringInfo(&errbuf);
+
+ for (;;)
+ {
+ va_list ap;
+ int needed;
+
+ va_start(ap, fmt);
+ needed = appendStringInfoVA(&errbuf, fmt, ap);
+ va_end(ap);
+ if (needed == 0)
+ break;
+ enlargeStringInfo(&errbuf, needed);
+ }
+
+ ereport(ERROR,
+ errmsg_internal("%s", errbuf.data));
+}
+
+/*
+ * Quicksort comparator for block numbers.
+ */
+static int
+compare_block_numbers(const void *a, const void *b)
+{
+ BlockNumber aa = *(BlockNumber *) a;
+ BlockNumber bb = *(BlockNumber *) b;
+
+ if (aa > bb)
+ return 1;
+ else if (aa == bb)
+ return 0;
+ else
+ return -1;
+}
diff --git a/src/backend/backup/meson.build b/src/backend/backup/meson.build
index 0e2de91e9f..19c355ceca 100644
--- a/src/backend/backup/meson.build
+++ b/src/backend/backup/meson.build
@@ -5,6 +5,7 @@ backend_sources += files(
'basebackup.c',
'basebackup_copy.c',
'basebackup_gzip.c',
+ 'basebackup_incremental.c',
'basebackup_lz4.c',
'basebackup_progress.c',
'basebackup_server.c',
diff --git a/src/backend/replication/repl_gram.y b/src/backend/replication/repl_gram.y
index 0c874e33cf..a5d118ed68 100644
--- a/src/backend/replication/repl_gram.y
+++ b/src/backend/replication/repl_gram.y
@@ -76,11 +76,12 @@ Node *replication_parse_result;
%token K_EXPORT_SNAPSHOT
%token K_NOEXPORT_SNAPSHOT
%token K_USE_SNAPSHOT
+%token K_UPLOAD_MANIFEST
%type <node> command
%type <node> base_backup start_replication start_logical_replication
create_replication_slot drop_replication_slot identify_system
- read_replication_slot timeline_history show
+ read_replication_slot timeline_history show upload_manifest
%type <list> generic_option_list
%type <defelt> generic_option
%type <uintval> opt_timeline
@@ -114,6 +115,7 @@ command:
| read_replication_slot
| timeline_history
| show
+ | upload_manifest
;
/*
@@ -307,6 +309,15 @@ timeline_history:
}
;
+/* UPLOAD_MANIFEST doesn't currently accept any arguments */
+upload_manifest:
+ K_UPLOAD_MANIFEST
+ {
+ UploadManifestCmd *cmd = makeNode(UploadManifestCmd);
+
+ $$ = (Node *) cmd;
+ }
+
opt_physical:
K_PHYSICAL
| /* EMPTY */
@@ -411,6 +422,7 @@ ident_or_keyword:
| K_EXPORT_SNAPSHOT { $$ = "export_snapshot"; }
| K_NOEXPORT_SNAPSHOT { $$ = "noexport_snapshot"; }
| K_USE_SNAPSHOT { $$ = "use_snapshot"; }
+ | K_UPLOAD_MANIFEST { $$ = "upload_manifest"; }
;
%%
diff --git a/src/backend/replication/repl_scanner.l b/src/backend/replication/repl_scanner.l
index 1cc7fb858c..4805da08ee 100644
--- a/src/backend/replication/repl_scanner.l
+++ b/src/backend/replication/repl_scanner.l
@@ -136,6 +136,7 @@ EXPORT_SNAPSHOT { return K_EXPORT_SNAPSHOT; }
NOEXPORT_SNAPSHOT { return K_NOEXPORT_SNAPSHOT; }
USE_SNAPSHOT { return K_USE_SNAPSHOT; }
WAIT { return K_WAIT; }
+UPLOAD_MANIFEST { return K_UPLOAD_MANIFEST; }
{space}+ { /* do nothing */ }
@@ -303,6 +304,7 @@ replication_scanner_is_replication_command(void)
case K_DROP_REPLICATION_SLOT:
case K_READ_REPLICATION_SLOT:
case K_TIMELINE_HISTORY:
+ case K_UPLOAD_MANIFEST:
case K_SHOW:
/* Yes; push back the first token so we can parse later. */
repl_pushed_back_token = first_token;
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index e250b0567e..b33b86671b 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -58,6 +58,7 @@
#include "access/xlogrecovery.h"
#include "access/xlogutils.h"
#include "backup/basebackup.h"
+#include "backup/basebackup_incremental.h"
#include "catalog/pg_authid.h"
#include "catalog/pg_type.h"
#include "commands/dbcommands.h"
@@ -137,6 +138,17 @@ bool wake_wal_senders = false;
*/
static XLogReaderState *xlogreader = NULL;
+/*
+ * If the UPLOAD_MANIFEST command is used to provide a backup manifest in
+ * preparation for an incremental backup, uploaded_manifest will be point
+ * to an object containing information about its contexts, and
+ * uploaded_manifest_mcxt will point to the memory context that contains
+ * that object and all of its subordinate data. Otherwise, both values will
+ * be NULL.
+ */
+static IncrementalBackupInfo *uploaded_manifest = NULL;
+static MemoryContext uploaded_manifest_mcxt = NULL;
+
/*
* These variables keep track of the state of the timeline we're currently
* sending. sendTimeLine identifies the timeline. If sendTimeLineIsHistoric,
@@ -233,6 +245,9 @@ static void XLogSendLogical(void);
static void WalSndDone(WalSndSendDataCallback send_data);
static XLogRecPtr GetStandbyFlushRecPtr(TimeLineID *tli);
static void IdentifySystem(void);
+static void UploadManifest(void);
+static bool HandleUploadManifestPacket(StringInfo buf, off_t *offset,
+ IncrementalBackupInfo *ib);
static void ReadReplicationSlot(ReadReplicationSlotCmd *cmd);
static void CreateReplicationSlot(CreateReplicationSlotCmd *cmd);
static void DropReplicationSlot(DropReplicationSlotCmd *cmd);
@@ -660,6 +675,143 @@ SendTimeLineHistory(TimeLineHistoryCmd *cmd)
pq_endmessage(&buf);
}
+/*
+ * Handle UPLOAD_MANIFEST command.
+ */
+static void
+UploadManifest(void)
+{
+ MemoryContext mcxt;
+ IncrementalBackupInfo *ib;
+ off_t offset = 0;
+ StringInfoData buf;
+
+ /*
+ * parsing the manifest will use the cryptohash stuff, which requires a
+ * resource owner
+ */
+ Assert(CurrentResourceOwner == NULL);
+ CurrentResourceOwner = ResourceOwnerCreate(NULL, "base backup");
+
+ /* Prepare to read manifest data into a temporary context. */
+ mcxt = AllocSetContextCreate(CurrentMemoryContext,
+ "incremental backup information",
+ ALLOCSET_DEFAULT_SIZES);
+ ib = CreateIncrementalBackupInfo(mcxt);
+
+ /* Send a CopyInResponse message */
+ pq_beginmessage(&buf, 'G');
+ pq_sendbyte(&buf, 0);
+ pq_sendint16(&buf, 0);
+ pq_endmessage_reuse(&buf);
+ pq_flush();
+
+ /* Recieve packets from client until done. */
+ while (HandleUploadManifestPacket(&buf, &offset, ib))
+ ;
+
+ /* Finish up manifest processing. */
+ FinalizeIncrementalManifest(ib);
+
+ /*
+ * Discard any old manifest information and arrange to preserve the new
+ * information we just got.
+ *
+ * We assume that MemoryContextDelete and MemoryContextSetParent won't
+ * fail, and thus we shouldn't end up bailing out of here in such a way as
+ * to leave dangling pointrs.
+ */
+ if (uploaded_manifest_mcxt != NULL)
+ MemoryContextDelete(uploaded_manifest_mcxt);
+ MemoryContextSetParent(mcxt, CacheMemoryContext);
+ uploaded_manifest = ib;
+ uploaded_manifest_mcxt = mcxt;
+
+ /* clean up the resource owner we created */
+ WalSndResourceCleanup(true);
+}
+
+/*
+ * Process one packet received during the handling of an UPLOAD_MANIFEST
+ * operation.
+ *
+ * 'buf' is scratch space. This function expects it to be initialized, doesn't
+ * care what the current contents are, and may override them with completely
+ * new contents.
+ *
+ * The return value is true if the caller should continue processing
+ * additional packets and false if the UPLOAD_MANIFEST operation is complete.
+ */
+static bool
+HandleUploadManifestPacket(StringInfo buf, off_t *offset,
+ IncrementalBackupInfo *ib)
+{
+ int mtype;
+ int maxmsglen;
+
+ HOLD_CANCEL_INTERRUPTS();
+
+ pq_startmsgread();
+ mtype = pq_getbyte();
+ if (mtype == EOF)
+ ereport(ERROR,
+ (errcode(ERRCODE_CONNECTION_FAILURE),
+ errmsg("unexpected EOF on client connection with an open transaction")));
+
+ switch (mtype)
+ {
+ case 'd': /* CopyData */
+ maxmsglen = PQ_LARGE_MESSAGE_LIMIT;
+ break;
+ case 'c': /* CopyDone */
+ case 'f': /* CopyFail */
+ case 'H': /* Flush */
+ case 'S': /* Sync */
+ maxmsglen = PQ_SMALL_MESSAGE_LIMIT;
+ break;
+ default:
+ ereport(ERROR,
+ (errcode(ERRCODE_PROTOCOL_VIOLATION),
+ errmsg("unexpected message type 0x%02X during COPY from stdin",
+ mtype)));
+ maxmsglen = 0; /* keep compiler quiet */
+ break;
+ }
+
+ /* Now collect the message body */
+ if (pq_getmessage(buf, maxmsglen))
+ ereport(ERROR,
+ (errcode(ERRCODE_CONNECTION_FAILURE),
+ errmsg("unexpected EOF on client connection with an open transaction")));
+ RESUME_CANCEL_INTERRUPTS();
+
+ /* Process the message */
+ switch (mtype)
+ {
+ case 'd': /* CopyData */
+ AppendIncrementalManifestData(ib, buf->data, buf->len);
+ return true;
+
+ case 'c': /* CopyDone */
+ return false;
+
+ case 'H': /* Sync */
+ case 'S': /* Flush */
+ /* Ignore these while in CopyOut mode as we do elsewhere. */
+ return true;
+
+ case 'f':
+ ereport(ERROR,
+ (errcode(ERRCODE_QUERY_CANCELED),
+ errmsg("COPY from stdin failed: %s",
+ pq_getmsgstring(buf))));
+ }
+
+ /* Not reached. */
+ Assert(false);
+ return false;
+}
+
/*
* Handle START_REPLICATION command.
*
@@ -1801,7 +1953,7 @@ exec_replication_command(const char *cmd_string)
cmdtag = "BASE_BACKUP";
set_ps_display(cmdtag);
PreventInTransactionBlock(true, cmdtag);
- SendBaseBackup((BaseBackupCmd *) cmd_node);
+ SendBaseBackup((BaseBackupCmd *) cmd_node, uploaded_manifest);
EndReplicationCommand(cmdtag);
break;
@@ -1863,6 +2015,14 @@ exec_replication_command(const char *cmd_string)
}
break;
+ case T_UploadManifestCmd:
+ cmdtag = "UPLOAD_MANIFEST";
+ set_ps_display(cmdtag);
+ PreventInTransactionBlock(true, cmdtag);
+ UploadManifest();
+ EndReplicationCommand(cmdtag);
+ break;
+
default:
elog(ERROR, "unrecognized replication command node tag: %u",
cmd_node->type);
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index a3d8eacb8d..3a6729003a 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -31,6 +31,7 @@
#include "postmaster/bgworker_internals.h"
#include "postmaster/bgwriter.h"
#include "postmaster/postmaster.h"
+#include "postmaster/walsummarizer.h"
#include "replication/logicallauncher.h"
#include "replication/origin.h"
#include "replication/slot.h"
@@ -136,6 +137,7 @@ CalculateShmemSize(int *num_semaphores)
size = add_size(size, ReplicationOriginShmemSize());
size = add_size(size, WalSndShmemSize());
size = add_size(size, WalRcvShmemSize());
+ size = add_size(size, WalSummarizerShmemSize());
size = add_size(size, PgArchShmemSize());
size = add_size(size, ApplyLauncherShmemSize());
size = add_size(size, BTreeShmemSize());
@@ -291,6 +293,7 @@ CreateSharedMemoryAndSemaphores(void)
ReplicationOriginShmemInit();
WalSndShmemInit();
WalRcvShmemInit();
+ WalSummarizerShmemInit();
PgArchShmemInit();
ApplyLauncherShmemInit();
diff --git a/src/bin/Makefile b/src/bin/Makefile
index 373077bf52..aa2210925e 100644
--- a/src/bin/Makefile
+++ b/src/bin/Makefile
@@ -19,6 +19,7 @@ SUBDIRS = \
pg_archivecleanup \
pg_basebackup \
pg_checksums \
+ pg_combinebackup \
pg_config \
pg_controldata \
pg_ctl \
diff --git a/src/bin/meson.build b/src/bin/meson.build
index 67cb50630c..4cb6fd59bb 100644
--- a/src/bin/meson.build
+++ b/src/bin/meson.build
@@ -5,6 +5,7 @@ subdir('pg_amcheck')
subdir('pg_archivecleanup')
subdir('pg_basebackup')
subdir('pg_checksums')
+subdir('pg_combinebackup')
subdir('pg_config')
subdir('pg_controldata')
subdir('pg_ctl')
diff --git a/src/bin/pg_basebackup/bbstreamer_file.c b/src/bin/pg_basebackup/bbstreamer_file.c
index 45f32974ff..6b78ee283d 100644
--- a/src/bin/pg_basebackup/bbstreamer_file.c
+++ b/src/bin/pg_basebackup/bbstreamer_file.c
@@ -296,6 +296,7 @@ should_allow_existing_directory(const char *pathname)
if (strcmp(filename, "pg_wal") == 0 ||
strcmp(filename, "pg_xlog") == 0 ||
strcmp(filename, "archive_status") == 0 ||
+ strcmp(filename, "summaries") == 0 ||
strcmp(filename, "pg_tblspc") == 0)
return true;
diff --git a/src/bin/pg_basebackup/pg_basebackup.c b/src/bin/pg_basebackup/pg_basebackup.c
index f32684a8f2..26fd9ad0bc 100644
--- a/src/bin/pg_basebackup/pg_basebackup.c
+++ b/src/bin/pg_basebackup/pg_basebackup.c
@@ -101,6 +101,11 @@ typedef void (*WriteDataCallback) (size_t nbytes, char *buf,
*/
#define MINIMUM_VERSION_FOR_TERMINATED_TARFILE 150000
+/*
+ * pg_wal/summaries exists beginning with version 17.
+ */
+#define MINIMUM_VERSION_FOR_WAL_SUMMARIES 170000
+
/*
* Different ways to include WAL
*/
@@ -217,7 +222,8 @@ static void ReceiveBackupManifestInMemoryChunk(size_t r, char *copybuf,
void *callback_data);
static void BaseBackup(char *compression_algorithm, char *compression_detail,
CompressionLocation compressloc,
- pg_compress_specification *client_compress);
+ pg_compress_specification *client_compress,
+ char *incremental_manifest);
static bool reached_end_position(XLogRecPtr segendpos, uint32 timeline,
bool segment_finished);
@@ -390,6 +396,8 @@ usage(void)
printf(_("\nOptions controlling the output:\n"));
printf(_(" -D, --pgdata=DIRECTORY receive base backup into directory\n"));
printf(_(" -F, --format=p|t output format (plain (default), tar)\n"));
+ printf(_(" -i, --incremental=OLDMANIFEST\n"));
+ printf(_(" take incremental or differential backup\n"));
printf(_(" -r, --max-rate=RATE maximum transfer rate to transfer data directory\n"
" (in kB/s, or use suffix \"k\" or \"M\")\n"));
printf(_(" -R, --write-recovery-conf\n"
@@ -688,6 +696,23 @@ StartLogStreamer(char *startpos, uint32 timeline, char *sysidentifier,
if (pg_mkdir_p(statusdir, pg_dir_create_mode) != 0 && errno != EEXIST)
pg_fatal("could not create directory \"%s\": %m", statusdir);
+
+ /*
+ * For newer server versions, likewise create pg_wal/summaries
+ */
+ if (PQserverVersion(conn) < MINIMUM_VERSION_FOR_WAL_SUMMARIES)
+ {
+ char summarydir[MAXPGPATH];
+
+ snprintf(summarydir, sizeof(summarydir), "%s/%s/summaries",
+ basedir,
+ PQserverVersion(conn) < MINIMUM_VERSION_FOR_PG_WAL ?
+ "pg_xlog" : "pg_wal");
+
+ if (pg_mkdir_p(statusdir, pg_dir_create_mode) != 0 &&
+ errno != EEXIST)
+ pg_fatal("could not create directory \"%s\": %m", summarydir);
+ }
}
/*
@@ -1728,7 +1753,9 @@ ReceiveBackupManifestInMemoryChunk(size_t r, char *copybuf,
static void
BaseBackup(char *compression_algorithm, char *compression_detail,
- CompressionLocation compressloc, pg_compress_specification *client_compress)
+ CompressionLocation compressloc,
+ pg_compress_specification *client_compress,
+ char *incremental_manifest)
{
PGresult *res;
char *sysidentifier;
@@ -1794,7 +1821,74 @@ BaseBackup(char *compression_algorithm, char *compression_detail,
exit(1);
/*
- * Start the actual backup
+ * If the user wants an incremental backup, we must upload the manifest
+ * for the previous backup upon which it is to be based.
+ */
+ if (incremental_manifest != NULL)
+ {
+ int fd;
+ char mbuf[65536];
+ int nbytes;
+
+ /* XXX add a server version check here */
+
+ /* Open the file. */
+ fd = open(incremental_manifest, O_RDONLY | PG_BINARY, 0);
+ if (fd < 0)
+ pg_fatal("could not open file \"%s\": %m", incremental_manifest);
+
+ /* Tell the server what we want to do. */
+ if (PQsendQuery(conn, "UPLOAD_MANIFEST") == 0)
+ pg_fatal("could not send replication command \"%s\": %s",
+ "UPLOAD_MANIFEST", PQerrorMessage(conn));
+ res = PQgetResult(conn);
+ if (PQresultStatus(res) != PGRES_COPY_IN)
+ {
+ if (PQresultStatus(res) == PGRES_FATAL_ERROR)
+ pg_fatal("could not upload manifest: %s",
+ PQerrorMessage(conn));
+ else
+ pg_fatal("could not upload manifest: unexpected status %s",
+ PQresStatus(PQresultStatus(res)));
+ }
+
+ /* Loop, reading from the file and sending the data to the server. */
+ while ((nbytes = read(fd, mbuf, sizeof mbuf)) > 0)
+ {
+ if (PQputCopyData(conn, mbuf, nbytes) < 0)
+ pg_fatal("could not send COPY data: %s",
+ PQerrorMessage(conn));
+ }
+
+ /* Bail out if we exited the loop due to an error. */
+ if (nbytes < 0)
+ pg_fatal("could not read file \"%s\": %m", incremental_manifest);
+
+ /* End the COPY operation. */
+ if (PQputCopyEnd(conn, NULL) < 0)
+ pg_fatal("could not send end-of-COPY: %s",
+ PQerrorMessage(conn));
+
+ /* See whether the server is happy with what we sent. */
+ res = PQgetResult(conn);
+ if (PQresultStatus(res) == PGRES_FATAL_ERROR)
+ pg_fatal("could not upload manifest: %s",
+ PQerrorMessage(conn));
+ else if (PQresultStatus(res) != PGRES_COMMAND_OK)
+ pg_fatal("could not upload manifest: unexpected status %s",
+ PQresStatus(PQresultStatus(res)));
+
+ /* Consume ReadyForQuery message from server. */
+ res = PQgetResult(conn);
+ if (res != NULL)
+ pg_fatal("unexpected extra result while sending manifest");
+
+ /* Add INCREMENTAL option to BASE_BACKUP command. */
+ AppendPlainCommandOption(&buf, use_new_option_syntax, "INCREMENTAL");
+ }
+
+ /*
+ * Continue building up the options list for the BASE_BACKUP command.
*/
AppendStringCommandOption(&buf, use_new_option_syntax, "LABEL", label);
if (estimatesize)
@@ -1901,6 +1995,7 @@ BaseBackup(char *compression_algorithm, char *compression_detail,
else
basebkp = psprintf("BASE_BACKUP %s", buf.data);
+ /* OK, try to start the backup. */
if (PQsendQuery(conn, basebkp) == 0)
pg_fatal("could not send replication command \"%s\": %s",
"BASE_BACKUP", PQerrorMessage(conn));
@@ -2256,6 +2351,7 @@ main(int argc, char **argv)
{"version", no_argument, NULL, 'V'},
{"pgdata", required_argument, NULL, 'D'},
{"format", required_argument, NULL, 'F'},
+ {"incremental", required_argument, NULL, 'i'},
{"checkpoint", required_argument, NULL, 'c'},
{"create-slot", no_argument, NULL, 'C'},
{"max-rate", required_argument, NULL, 'r'},
@@ -2293,6 +2389,7 @@ main(int argc, char **argv)
int option_index;
char *compression_algorithm = "none";
char *compression_detail = NULL;
+ char *incremental_manifest = NULL;
CompressionLocation compressloc = COMPRESS_LOCATION_UNSPECIFIED;
pg_compress_specification client_compress;
@@ -2317,7 +2414,7 @@ main(int argc, char **argv)
atexit(cleanup_directories_atexit);
- while ((c = getopt_long(argc, argv, "c:Cd:D:F:h:l:nNp:Pr:Rs:S:t:T:U:vwWX:zZ:",
+ while ((c = getopt_long(argc, argv, "c:Cd:D:F:h:i:l:nNp:Pr:Rs:S:t:T:U:vwWX:zZ:",
long_options, &option_index)) != -1)
{
switch (c)
@@ -2352,6 +2449,9 @@ main(int argc, char **argv)
case 'h':
dbhost = pg_strdup(optarg);
break;
+ case 'i':
+ incremental_manifest = pg_strdup(optarg);
+ break;
case 'l':
label = pg_strdup(optarg);
break;
@@ -2765,7 +2865,7 @@ main(int argc, char **argv)
}
BaseBackup(compression_algorithm, compression_detail, compressloc,
- &client_compress);
+ &client_compress, incremental_manifest);
success = true;
return 0;
diff --git a/src/bin/pg_basebackup/t/010_pg_basebackup.pl b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
index b9f5e1266b..bf765291e7 100644
--- a/src/bin/pg_basebackup/t/010_pg_basebackup.pl
+++ b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
@@ -223,10 +223,10 @@ SKIP:
"check backup dir permissions");
}
-# Only archive_status directory should be copied in pg_wal/.
+# Only archive_status and summaries directories should be copied in pg_wal/.
is_deeply(
[ sort(slurp_dir("$tempdir/backup/pg_wal/")) ],
- [ sort qw(. .. archive_status) ],
+ [ sort qw(. .. archive_status summaries) ],
'no WAL files copied');
# Contents of these directories should not be copied.
diff --git a/src/bin/pg_combinebackup/.gitignore b/src/bin/pg_combinebackup/.gitignore
new file mode 100644
index 0000000000..d7e617438c
--- /dev/null
+++ b/src/bin/pg_combinebackup/.gitignore
@@ -0,0 +1 @@
+pg_combinebackup
diff --git a/src/bin/pg_combinebackup/Makefile b/src/bin/pg_combinebackup/Makefile
new file mode 100644
index 0000000000..78ba05e624
--- /dev/null
+++ b/src/bin/pg_combinebackup/Makefile
@@ -0,0 +1,52 @@
+#-------------------------------------------------------------------------
+#
+# Makefile for src/bin/pg_combinebackup
+#
+# Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+# Portions Copyright (c) 1994, Regents of the University of California
+#
+# src/bin/pg_combinebackup/Makefile
+#
+#-------------------------------------------------------------------------
+
+PGFILEDESC = "pg_combinebackup - combine incremental backups"
+PGAPPICON=win32
+
+subdir = src/bin/pg_combinebackup
+top_builddir = ../../..
+include $(top_builddir)/src/Makefile.global
+
+override CPPFLAGS := -I$(libpq_srcdir) $(CPPFLAGS)
+LDFLAGS_INTERNAL += -L$(top_builddir)/src/fe_utils -lpgfeutils
+
+OBJS = \
+ $(WIN32RES) \
+ pg_combinebackup.o \
+ backup_label.o \
+ copy_file.o \
+ load_manifest.o \
+ reconstruct.o \
+ write_manifest.o
+
+all: pg_combinebackup
+
+pg_combinebackup: $(OBJS) | submake-libpgport submake-libpgfeutils
+ $(CC) $(CFLAGS) $^ $(LDFLAGS) $(LDFLAGS_EX) $(LIBS) -o $@$(X)
+
+install: all installdirs
+ $(INSTALL_PROGRAM) pg_combinebackup$(X) '$(DESTDIR)$(bindir)/pg_combinebackup$(X)'
+
+installdirs:
+ $(MKDIR_P) '$(DESTDIR)$(bindir)'
+
+uninstall:
+ rm -f '$(DESTDIR)$(bindir)/pg_combinebackup$(X)'
+
+clean distclean maintainer-clean:
+ rm -f pg_combinebackup$(X) $(OBJS)
+
+check:
+ $(prove_check)
+
+installcheck:
+ $(prove_installcheck)
diff --git a/src/bin/pg_combinebackup/backup_label.c b/src/bin/pg_combinebackup/backup_label.c
new file mode 100644
index 0000000000..2a62aa6fad
--- /dev/null
+++ b/src/bin/pg_combinebackup/backup_label.c
@@ -0,0 +1,281 @@
+/*-------------------------------------------------------------------------
+ *
+ * Read and manipulate backup label files
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/bin/pg_combinebackup/backup_label.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres_fe.h"
+
+#include <unistd.h>
+
+#include "access/xlogdefs.h"
+#include "backup_label.h"
+#include "common/logging.h"
+#include "common/file_perm.h"
+#include "write_manifest.h"
+
+static int get_eol_offset(StringInfo buf);
+static bool line_starts_with(char *s, char *e, char *match, char **sout);
+static bool parse_lsn(char *s, char *e, XLogRecPtr *lsn, char **c);
+static bool parse_tli(char *s, char *e, TimeLineID *tli);
+
+/*
+ * Parse a backup label file, starting at buf->cursor.
+ *
+ * We expect to find a START WAL LOCATION line, followed by a LSN, followed
+ * by a space; the resulting LSN is stored into *start_lsn.
+ *
+ * We expect to find a START TIMELINE line, followed by a TLI, followed by
+ * a newline; the resulting TLI is stored into *start_tli.
+ *
+ * We expect to find either both INCREMENTAL FROM LSN and INCREMENTAL FROM TLI
+ * or neither. If these are found, they should be followed by an LSN or TLI
+ * respectively and then by a newline, and the values will be stored into
+ * *previous_lsn and *previous_tli, respectively.
+ *
+ * Other lines in the provided backup_label data are ignored. filename is used
+ * for error reporting; errors are fatal.
+ */
+void
+parse_backup_label(char *filename, StringInfo buf,
+ TimeLineID *start_tli, XLogRecPtr *start_lsn,
+ TimeLineID *previous_tli, XLogRecPtr *previous_lsn)
+{
+ int found = 0;
+
+ *start_tli = 0;
+ *start_lsn = InvalidXLogRecPtr;
+ *previous_tli = 0;
+ *previous_lsn = InvalidXLogRecPtr;
+
+ while (buf->cursor < buf->len)
+ {
+ char *s = &buf->data[buf->cursor];
+ int eo = get_eol_offset(buf);
+ char *e = &buf->data[eo];
+ char *c;
+
+ if (line_starts_with(s, e, "START WAL LOCATION: ", &s))
+ {
+ if (!parse_lsn(s, e, start_lsn, &c))
+ pg_fatal("%s: could not parse START WAL LOCATION",
+ filename);
+ if (c >= e || *c != ' ')
+ pg_fatal("%s: improper terminator for START WAL LOCATION",
+ filename);
+ found |= 1;
+ }
+ else if (line_starts_with(s, e, "START TIMELINE: ", &s))
+ {
+ if (!parse_tli(s, e, start_tli))
+ pg_fatal("%s: could not parse TLI for START TIMELINE",
+ filename);
+ if (*start_tli == 0)
+ pg_fatal("%s: invalid TLI", filename);
+ found |= 2;
+ }
+ else if (line_starts_with(s, e, "INCREMENTAL FROM LSN: ", &s))
+ {
+ if (!parse_lsn(s, e, previous_lsn, &c))
+ pg_fatal("%s: could not parse INCREMENTAL FROM LSN",
+ filename);
+ if (c >= e || *c != '\n')
+ pg_fatal("%s: improper terminator for INCREMENTAL FROM LSN",
+ filename);
+ found |= 4;
+ }
+ else if (line_starts_with(s, e, "INCREMENTAL FROM TLI: ", &s))
+ {
+ if (!parse_tli(s, e, previous_tli))
+ pg_fatal("%s: could not parse INCREMENTAL FROM TLI",
+ filename);
+ if (*previous_tli == 0)
+ pg_fatal("%s: invalid TLI", filename);
+ found |= 8;
+ }
+
+ buf->cursor = eo;
+ }
+
+ if ((found & 1) == 0)
+ pg_fatal("%s: could not find START WAL LOCATION", filename);
+ if ((found & 2) == 0)
+ pg_fatal("%s: could not find START TIMELINE", filename);
+ if ((found & 4) != 0 && (found & 8) == 0)
+ pg_fatal("%s: INCREMENTAL FROM LSN requires INCREMENTAL FROM TLI", filename);
+ if ((found & 8) != 0 && (found & 4) == 0)
+ pg_fatal("%s: INCREMENTAL FROM TLI requires INCREMENTAL FROM LSN", filename);
+}
+
+/*
+ * Write a backup label file to the output directory.
+ *
+ * This will be identical to the provided backup_label file, except that the
+ * INCREMENTAL FROM LSN and INCREMENTAL FROM TLI lines will be omitted.
+ *
+ * The new file will be checksummed using the specified algorithm. If
+ * mwriter != NULL, it will be added to the manifest.
+ */
+void
+write_backup_label(char *output_directory, StringInfo buf,
+ pg_checksum_type checksum_type, manifest_writer *mwriter)
+{
+ char output_filename[MAXPGPATH];
+ int output_fd;
+ pg_checksum_context checksum_ctx;
+ uint8 checksum_payload[PG_CHECKSUM_MAX_LENGTH];
+ int checksum_length;
+
+ pg_checksum_init(&checksum_ctx, checksum_type);
+
+ snprintf(output_filename, MAXPGPATH, "%s/backup_label", output_directory);
+
+ if ((output_fd = open(output_filename,
+ O_WRONLY | O_CREAT | O_EXCL | PG_BINARY,
+ pg_file_create_mode)) < 0)
+ pg_fatal("could not open file \"%s\": %m", output_filename);
+
+ while (buf->cursor < buf->len)
+ {
+ char *s = &buf->data[buf->cursor];
+ int eo = get_eol_offset(buf);
+ char *e = &buf->data[eo];
+
+ if (!line_starts_with(s, e, "INCREMENTAL FROM LSN: ", NULL) &&
+ !line_starts_with(s, e, "INCREMENTAL FROM TLI: ", NULL))
+ {
+ ssize_t wb;
+
+ wb = write(output_fd, s, e - s);
+ if (wb != e - s)
+ {
+ if (wb < 0)
+ pg_fatal("could not write file \"%s\": %m", output_filename);
+ else
+ pg_fatal("could not write file \"%s\": wrote only %d of %d bytes",
+ output_filename, (int) wb, (int) (e - s));
+ }
+ if (pg_checksum_update(&checksum_ctx, (uint8 *) s, e - s) < 0)
+ pg_fatal("could not update checksum of file \"%s\"",
+ output_filename);
+ }
+
+ buf->cursor = eo;
+ }
+
+ if (close(output_fd) != 0)
+ pg_fatal("could not close \"%s\": %m", output_filename);
+
+ checksum_length = pg_checksum_final(&checksum_ctx, checksum_payload);
+
+ if (mwriter != NULL)
+ {
+ struct stat sb;
+
+ /*
+ * We could track the length ourselves, but must stat() to get the
+ * mtime.
+ */
+ if (stat(output_filename, &sb) < 0)
+ pg_fatal("could not stat file \"%s\": %m", output_filename);
+ add_file_to_manifest(mwriter, "backup_label", sb.st_size,
+ sb.st_mtime, checksum_type,
+ checksum_length, checksum_payload);
+ }
+}
+
+/*
+ * Return the offset at which the next line in the buffer starts, or there
+ * is none, the offset at which the buffer ends.
+ *
+ * The search begins at buf->cursor.
+ */
+static int
+get_eol_offset(StringInfo buf)
+{
+ int eo = buf->cursor;
+
+ while (eo < buf->len)
+ {
+ if (buf->data[eo] == '\n')
+ return eo + 1;
+ ++eo;
+ }
+
+ return eo;
+}
+
+/*
+ * Test whether the line that runs from s to e (inclusive of *s, but not
+ * inclusive of *e) starts with the match string provided, and return true
+ * or false according to whether or not this is the case.
+ *
+ * If the function returns true and if *sout != NULL, stores a pointer to the
+ * byte following the match into *sout.
+ */
+static bool
+line_starts_with(char *s, char *e, char *match, char **sout)
+{
+ while (s < e && *match != '\0' && *s == *match)
+ ++s, ++match;
+
+ if (*match == '\0' && sout != NULL)
+ *sout = s;
+
+ return (*match == '\0');
+}
+
+/*
+ * Parse an LSN starting at s and not stopping at or before e. The return value
+ * is true on success and otherwise false. On success, stores the result into
+ * *lsn and sets *c to the first character that is not part of the LSN.
+ */
+static bool
+parse_lsn(char *s, char *e, XLogRecPtr *lsn, char **c)
+{
+ char save = *e;
+ int nchars;
+ bool success;
+ unsigned hi;
+ unsigned lo;
+
+ *e = '\0';
+ success = (sscanf(s, "%X/%X%n", &hi, &lo, &nchars) == 2);
+ *e = save;
+
+ if (success)
+ {
+ *lsn = ((XLogRecPtr) hi) << 32 | (XLogRecPtr) lo;
+ *c = s + nchars;
+ }
+
+ return success;
+}
+
+/*
+ * Parse a TLI starting at s and stopping at or before e. The return value is
+ * true on success and otherwise false. On success, stores the result into
+ * *tli. If the first character that is not part of the TLI is anything other
+ * than a newline, that is deemed a failure.
+ */
+static bool
+parse_tli(char *s, char *e, TimeLineID *tli)
+{
+ char save = *e;
+ int nchars;
+ bool success;
+
+ *e = '\0';
+ success = (sscanf(s, "%u%n", tli, &nchars) == 1);
+ *e = save;
+
+ if (success && s[nchars] != '\n')
+ success = false;
+
+ return success;
+}
diff --git a/src/bin/pg_combinebackup/backup_label.h b/src/bin/pg_combinebackup/backup_label.h
new file mode 100644
index 0000000000..3af7ea274c
--- /dev/null
+++ b/src/bin/pg_combinebackup/backup_label.h
@@ -0,0 +1,30 @@
+/*-------------------------------------------------------------------------
+ *
+ * Read and manipulate backup label files
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/bin/pg_combinebackup/backup_label.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef BACKUP_LABEL_H
+#define BACKUP_LABEL_H
+
+#include "access/xlogdefs.h"
+#include "common/checksum_helper.h"
+#include "lib/stringinfo.h"
+
+struct manifest_writer;
+
+extern void parse_backup_label(char *filename, StringInfo buf,
+ TimeLineID *start_tli,
+ XLogRecPtr *start_lsn,
+ TimeLineID *previous_tli,
+ XLogRecPtr *previous_lsn);
+extern void write_backup_label(char *output_directory, StringInfo buf,
+ pg_checksum_type checksum_type,
+ struct manifest_writer *mwriter);
+
+#endif /* BACKUP_LABEL_H */
diff --git a/src/bin/pg_combinebackup/copy_file.c b/src/bin/pg_combinebackup/copy_file.c
new file mode 100644
index 0000000000..f2b45787e9
--- /dev/null
+++ b/src/bin/pg_combinebackup/copy_file.c
@@ -0,0 +1,169 @@
+/*
+ * Copy entire files.
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/bin/pg_combinebackup/copy_file.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres_fe.h"
+
+#ifdef HAVE_COPYFILE_H
+#include <copyfile.h>
+#endif
+#include <fcntl.h>
+#include <sys/stat.h>
+#include <unistd.h>
+
+#include "common/file_perm.h"
+#include "common/logging.h"
+#include "copy_file.h"
+
+static void copy_file_blocks(const char *src, const char *dst,
+ pg_checksum_context *checksum_ctx);
+
+#ifdef WIN32
+static void copy_file_copyfile(const char *src, const char *dst);
+#endif
+
+/*
+ * Copy a regular file, optionally computing a checksum, and emitting
+ * appropriate debug messages. But if we're in dry-run mode, then just emit
+ * the messages and don't copy anything.
+ */
+void
+copy_file(const char *src, const char *dst,
+ pg_checksum_context *checksum_ctx, bool dry_run)
+{
+ /*
+ * In dry-run mode, we don't actually copy anything, nor do we read any
+ * data from the source file, but we do verify that we can open it.
+ */
+ if (dry_run)
+ {
+ int fd;
+
+ if ((fd = open(src, O_RDONLY | PG_BINARY, 0)) < 0)
+ pg_fatal("could not open \"%s\": %m", src);
+ if (close(fd) < 0)
+ pg_fatal("could not close \"%s\": %m", src);
+ }
+
+ /*
+ * If we don't need to compute a checksum, then we can use any special
+ * operating system primitives that we know about to copy the file; this
+ * may be quicker than a naive block copy.
+ */
+ if (checksum_ctx->type != CHECKSUM_TYPE_NONE)
+ {
+ char *strategy_name = NULL;
+ void (*strategy_implementation) (const char *, const char *) = NULL;
+
+#ifdef WIN32
+ strategy_name = "CopyFile";
+ strategy_implementation = copy_file_copyfile;
+#endif
+
+ if (strategy_name != NULL)
+ {
+ if (dry_run)
+ pg_log_debug("would copy \"%s\" to \"%s\" using strategy %s",
+ src, dst, strategy_name);
+ else
+ {
+ pg_log_debug("copying \"%s\" to \"%s\" using strategy %s",
+ src, dst, strategy_name);
+ (*strategy_implementation) (src, dst);
+ }
+ return;
+ }
+ }
+
+ /*
+ * Fall back to the simple approach of reading and writing all the blocks,
+ * feeding them into the checksum context as we go.
+ */
+ if (dry_run)
+ {
+ if (checksum_ctx->type == CHECKSUM_TYPE_NONE)
+ pg_log_debug("would copy \"%s\" to \"%s\"",
+ src, dst);
+ else
+ pg_log_debug("would copy \"%s\" to \"%s\" and checksum with %s",
+ src, dst, pg_checksum_type_name(checksum_ctx->type));
+ }
+ else
+ {
+ if (checksum_ctx->type == CHECKSUM_TYPE_NONE)
+ pg_log_debug("copying \"%s\" to \"%s\"",
+ src, dst);
+ else
+ pg_log_debug("copying \"%s\" to \"%s\" and checksumming with %s",
+ src, dst, pg_checksum_type_name(checksum_ctx->type));
+ copy_file_blocks(src, dst, checksum_ctx);
+ }
+}
+
+/*
+ * Copy a file block by block, and optionally compute a checksum as we go.
+ */
+static void
+copy_file_blocks(const char *src, const char *dst,
+ pg_checksum_context *checksum_ctx)
+{
+ int src_fd;
+ int dest_fd;
+ uint8 *buffer;
+ const int buffer_size = 50 * BLCKSZ;
+ ssize_t rb;
+ unsigned offset = 0;
+
+ if ((src_fd = open(src, O_RDONLY | PG_BINARY, 0)) < 0)
+ pg_fatal("could not open file \"%s\": %m", src);
+
+ if ((dest_fd = open(dst, O_WRONLY | O_CREAT | O_EXCL | PG_BINARY,
+ pg_file_create_mode)) < 0)
+ pg_fatal("could not open file \"%s\": %m", dst);
+
+ buffer = pg_malloc(buffer_size);
+
+ while ((rb = read(src_fd, buffer, buffer_size)) > 0)
+ {
+ ssize_t wb;
+
+ if ((wb = write(dest_fd, buffer, rb)) != rb)
+ {
+ if (wb < 0)
+ pg_fatal("could not write file \"%s\": %m", dst);
+ else
+ pg_fatal("could not write file \"%s\": wrote only %d of %d bytes at offset %u",
+ dst, (int) wb, (int) rb, offset);
+ }
+
+ if (pg_checksum_update(checksum_ctx, buffer, rb) < 0)
+ pg_fatal("could not update checksum of file \"%s\"", dst);
+
+ offset += rb;
+ }
+
+ if (rb < 0)
+ pg_fatal("could not read file \"%s\": %m", dst);
+
+ pg_free(buffer);
+ close(src_fd);
+ close(dest_fd);
+}
+
+#ifdef WIN32
+static void
+copy_file_copyfile(const char *src, const char *dst)
+{
+ if (CopyFile(src, dst, true) == 0)
+ {
+ _dosmaperr(GetLastError());
+ pg_fatal("could not copy \"%s\" to \"%s\": %m", src, dst);
+ }
+}
+#endif /* WIN32 */
diff --git a/src/bin/pg_combinebackup/copy_file.h b/src/bin/pg_combinebackup/copy_file.h
new file mode 100644
index 0000000000..031030bacb
--- /dev/null
+++ b/src/bin/pg_combinebackup/copy_file.h
@@ -0,0 +1,19 @@
+/*
+ * Copy entire files.
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/bin/pg_combinebackup/copy_file.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef COPY_FILE_H
+#define COPY_FILE_H
+
+#include "common/checksum_helper.h"
+
+extern void copy_file(const char *src, const char *dst,
+ pg_checksum_context *checksum_ctx, bool dry_run);
+
+#endif /* COPY_FILE_H */
diff --git a/src/bin/pg_combinebackup/load_manifest.c b/src/bin/pg_combinebackup/load_manifest.c
new file mode 100644
index 0000000000..d0b8de7912
--- /dev/null
+++ b/src/bin/pg_combinebackup/load_manifest.c
@@ -0,0 +1,245 @@
+/*-------------------------------------------------------------------------
+ *
+ * Load data from a backup manifest into memory.
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/bin/pg_combinebackup/load_manifest.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres_fe.h"
+
+#include <sys/stat.h>
+#include <unistd.h>
+
+#include "common/hashfn.h"
+#include "common/logging.h"
+#include "common/parse_manifest.h"
+#include "load_manifest.h"
+
+/*
+ * For efficiency, we'd like our hash table containing information about the
+ * manifest to start out with approximately the correct number of entries.
+ * There's no way to know the exact number of entries without reading the whole
+ * file, but we can get an estimate by dividing the file size by the estimated
+ * number of bytes per line.
+ *
+ * This could be off by about a factor of two in either direction, because the
+ * checksum algorithm has a big impact on the line lengths; e.g. a SHA512
+ * checksum is 128 hex bytes, whereas a CRC-32C value is only 8, and there
+ * might be no checksum at all.
+ */
+#define ESTIMATED_BYTES_PER_MANIFEST_LINE 100
+
+/*
+ * Define a hash table which we can use to store information about the files
+ * mentioned in the backup manifest.
+ */
+static uint32 hash_string_pointer(char *s);
+#define SH_PREFIX manifest_files
+#define SH_ELEMENT_TYPE manifest_file
+#define SH_KEY_TYPE char *
+#define SH_KEY pathname
+#define SH_HASH_KEY(tb, key) hash_string_pointer(key)
+#define SH_EQUAL(tb, a, b) (strcmp(a, b) == 0)
+#define SH_SCOPE extern
+#define SH_RAW_ALLOCATOR pg_malloc0
+#define SH_DEFINE
+#include "lib/simplehash.h"
+
+static void record_manifest_details_for_file(JsonManifestParseContext *context,
+ char *pathname, size_t size,
+ pg_checksum_type checksum_type,
+ int checksum_length,
+ uint8 *checksum_payload);
+static void record_manifest_details_for_wal_range(JsonManifestParseContext *context,
+ TimeLineID tli,
+ XLogRecPtr start_lsn,
+ XLogRecPtr end_lsn);
+static void report_manifest_error(JsonManifestParseContext *context,
+ const char *fmt,...)
+ pg_attribute_printf(2, 3) pg_attribute_noreturn();
+
+/*
+ * Load backup_manifest files from an array of backups and produces an array
+ * of manifest_data objects.
+ *
+ * NB: Since load_backup_manifest() can return NULL, the resulting array could
+ * contain NULL entries.
+ */
+manifest_data **
+load_backup_manifests(int n_backups, char **backup_directories)
+{
+ manifest_data **result;
+ int i;
+
+ result = pg_malloc(sizeof(manifest_data *) * n_backups);
+ for (i = 0; i < n_backups; ++i)
+ result[i] = load_backup_manifest(backup_directories[i]);
+
+ return result;
+}
+
+/*
+ * Parse the backup_manifest file in the named backup directory. Construct a
+ * hash table with information about all the files it mentions, and a linked
+ * list of all the WAL ranges it mentions.
+ *
+ * If the backup_manifest file simply doesn't exist, logs a warning and returns
+ * NULL. Any other error, or any error parsing the contents of the file, is
+ * fatal.
+ */
+manifest_data *
+load_backup_manifest(char *backup_directory)
+{
+ char pathname[MAXPGPATH];
+ int fd;
+ struct stat statbuf;
+ off_t estimate;
+ uint32 initial_size;
+ manifest_files_hash *ht;
+ char *buffer;
+ int rc;
+ JsonManifestParseContext context;
+ manifest_data *result;
+
+ /* Open the manifest file. */
+ snprintf(pathname, MAXPGPATH, "%s/backup_manifest", backup_directory);
+ if ((fd = open(pathname, O_RDONLY | PG_BINARY, 0)) < 0)
+ {
+ if (errno == EEXIST)
+ {
+ pg_log_warning("\"%s\" does not exist", pathname);
+ return NULL;
+ }
+ pg_fatal("could not open file \"%s\": %m", pathname);
+ }
+
+ /* Figure out how big the manifest is. */
+ if (fstat(fd, &statbuf) != 0)
+ pg_fatal("could not stat file \"%s\": %m", pathname);
+
+ /* Guess how large to make the hash table based on the manifest size. */
+ estimate = statbuf.st_size / ESTIMATED_BYTES_PER_MANIFEST_LINE;
+ initial_size = Min(PG_UINT32_MAX, Max(estimate, 256));
+
+ /* Create the hash table. */
+ ht = manifest_files_create(initial_size, NULL);
+
+ /*
+ * Slurp in the whole file.
+ *
+ * This is not ideal, but there's currently no way to get pg_parse_json()
+ * to perform incremental parsing.
+ */
+ buffer = pg_malloc(statbuf.st_size);
+ rc = read(fd, buffer, statbuf.st_size);
+ if (rc != statbuf.st_size)
+ {
+ if (rc < 0)
+ pg_fatal("could not read file \"%s\": %m", pathname);
+ else
+ pg_fatal("could not read file \"%s\": read %d of %lld",
+ pathname, rc, (long long int) statbuf.st_size);
+ }
+
+ /* Close the manifest file. */
+ close(fd);
+
+ /* Parse the manifest. */
+ result = pg_malloc0(sizeof(manifest_data));
+ result->files = ht;
+ context.private_data = result;
+ context.perfile_cb = record_manifest_details_for_file;
+ context.perwalrange_cb = record_manifest_details_for_wal_range;
+ context.error_cb = report_manifest_error;
+ json_parse_manifest(&context, buffer, statbuf.st_size);
+
+ /* All done. */
+ pfree(buffer);
+ return result;
+}
+
+/*
+ * Report an error while parsing the manifest.
+ *
+ * We consider all such errors to be fatal errors. The manifest parser
+ * expects this function not to return.
+ */
+static void
+report_manifest_error(JsonManifestParseContext *context, const char *fmt,...)
+{
+ va_list ap;
+
+ va_start(ap, fmt);
+ pg_log_generic_v(PG_LOG_ERROR, PG_LOG_PRIMARY, gettext(fmt), ap);
+ va_end(ap);
+
+ exit(1);
+}
+
+/*
+ * Record details extracted from the backup manifest for one file.
+ */
+static void
+record_manifest_details_for_file(JsonManifestParseContext *context,
+ char *pathname, size_t size,
+ pg_checksum_type checksum_type,
+ int checksum_length, uint8 *checksum_payload)
+{
+ manifest_data *manifest = context->private_data;
+ manifest_file *m;
+ bool found;
+
+ /* Make a new entry in the hash table for this file. */
+ m = manifest_files_insert(manifest->files, pathname, &found);
+ if (found)
+ pg_fatal("duplicate path name in backup manifest: \"%s\"", pathname);
+
+ /* Initialize the entry. */
+ m->size = size;
+ m->checksum_type = checksum_type;
+ m->checksum_length = checksum_length;
+ m->checksum_payload = checksum_payload;
+}
+
+/*
+ * Record details extracted from the backup manifest for one WAL range.
+ */
+static void
+record_manifest_details_for_wal_range(JsonManifestParseContext *context,
+ TimeLineID tli,
+ XLogRecPtr start_lsn, XLogRecPtr end_lsn)
+{
+ manifest_data *manifest = context->private_data;
+ manifest_wal_range *range;
+
+ /* Allocate and initialize a struct describing this WAL range. */
+ range = palloc(sizeof(manifest_wal_range));
+ range->tli = tli;
+ range->start_lsn = start_lsn;
+ range->end_lsn = end_lsn;
+ range->prev = manifest->last_wal_range;
+ range->next = NULL;
+
+ /* Add it to the end of the list. */
+ if (manifest->first_wal_range == NULL)
+ manifest->first_wal_range = range;
+ else
+ manifest->last_wal_range->next = range;
+ manifest->last_wal_range = range;
+}
+
+/*
+ * Helper function for manifest_files hash table.
+ */
+static uint32
+hash_string_pointer(char *s)
+{
+ unsigned char *ss = (unsigned char *) s;
+
+ return hash_bytes(ss, strlen(s));
+}
diff --git a/src/bin/pg_combinebackup/load_manifest.h b/src/bin/pg_combinebackup/load_manifest.h
new file mode 100644
index 0000000000..2bfeeff156
--- /dev/null
+++ b/src/bin/pg_combinebackup/load_manifest.h
@@ -0,0 +1,67 @@
+/*-------------------------------------------------------------------------
+ *
+ * Load data from a backup manifest into memory.
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/bin/pg_combinebackup/load_manifest.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef LOAD_MANIFEST_H
+#define LOAD_MANIFEST_H
+
+#include "access/xlogdefs.h"
+#include "common/checksum_helper.h"
+
+/*
+ * Each file described by the manifest file is parsed to produce an object
+ * like this.
+ */
+typedef struct manifest_file
+{
+ uint32 status; /* hash status */
+ char *pathname;
+ size_t size;
+ pg_checksum_type checksum_type;
+ int checksum_length;
+ uint8 *checksum_payload;
+} manifest_file;
+
+#define SH_PREFIX manifest_files
+#define SH_ELEMENT_TYPE manifest_file
+#define SH_KEY_TYPE char *
+#define SH_SCOPE extern
+#define SH_RAW_ALLOCATOR pg_malloc0
+#define SH_DECLARE
+#include "lib/simplehash.h"
+
+/*
+ * Each WAL range described by the manifest file is parsed to produce an
+ * object like this.
+ */
+typedef struct manifest_wal_range
+{
+ TimeLineID tli;
+ XLogRecPtr start_lsn;
+ XLogRecPtr end_lsn;
+ struct manifest_wal_range *next;
+ struct manifest_wal_range *prev;
+} manifest_wal_range;
+
+/*
+ * All the data parsed from a backup_manifest file.
+ */
+typedef struct manifest_data
+{
+ manifest_files_hash *files;
+ manifest_wal_range *first_wal_range;
+ manifest_wal_range *last_wal_range;
+} manifest_data;
+
+extern manifest_data *load_backup_manifest(char *backup_directory);
+extern manifest_data **load_backup_manifests(int n_backups,
+ char **backup_directories);
+
+#endif /* LOAD_MANIFEST_H */
diff --git a/src/bin/pg_combinebackup/meson.build b/src/bin/pg_combinebackup/meson.build
new file mode 100644
index 0000000000..e402d6f50e
--- /dev/null
+++ b/src/bin/pg_combinebackup/meson.build
@@ -0,0 +1,38 @@
+# Copyright (c) 2022-2023, PostgreSQL Global Development Group
+
+pg_combinebackup_sources = files(
+ 'pg_combinebackup.c',
+ 'backup_label.c',
+ 'copy_file.c',
+ 'load_manifest.c',
+ 'reconstruct.c',
+ 'write_manifest.c',
+)
+
+if host_system == 'windows'
+ pg_combinebackup_sources += rc_bin_gen.process(win32ver_rc, extra_args: [
+ '--NAME', 'pg_combinebackup',
+ '--FILEDESC', 'pg_combinebackup - combine incremental backups',])
+endif
+
+pg_combinebackup = executable('pg_combinebackup',
+ pg_combinebackup_sources,
+ dependencies: [frontend_code],
+ kwargs: default_bin_args,
+)
+bin_targets += pg_combinebackup
+
+tests += {
+ 'name': 'pg_combinebackup',
+ 'sd': meson.current_source_dir(),
+ 'bd': meson.current_build_dir(),
+ 'tap': {
+ 'tests': [
+ 't/001_basic.pl',
+ 't/002_compare_backups.pl',
+ 't/003_timeline.pl',
+ 't/004_manifest.pl',
+ 't/005_integrity.pl',
+ ],
+ }
+}
diff --git a/src/bin/pg_combinebackup/pg_combinebackup.c b/src/bin/pg_combinebackup/pg_combinebackup.c
new file mode 100644
index 0000000000..7bf56e57ae
--- /dev/null
+++ b/src/bin/pg_combinebackup/pg_combinebackup.c
@@ -0,0 +1,1270 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_combinebackup.c
+ * Combine incremental backups with prior backups.
+ *
+ * Copyright (c) 2017-2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/bin/pg_combinebackup/pg_combinebackup.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres_fe.h"
+
+#include <dirent.h>
+#include <fcntl.h>
+#include <limits.h>
+
+#include "backup_label.h"
+#include "common/blkreftable.h"
+#include "common/checksum_helper.h"
+#include "common/controldata_utils.h"
+#include "common/file_perm.h"
+#include "common/file_utils.h"
+#include "common/logging.h"
+#include "copy_file.h"
+#include "fe_utils/option_utils.h"
+#include "lib/stringinfo.h"
+#include "load_manifest.h"
+#include "getopt_long.h"
+#include "reconstruct.h"
+#include "write_manifest.h"
+
+/* Incremental file naming convention. */
+#define INCREMENTAL_PREFIX "INCREMENTAL."
+#define INCREMENTAL_PREFIX_LENGTH 12
+
+/*
+ * Tracking for directories that need to be removed, or have their contents
+ * removed, if the operation fails.
+ */
+typedef struct cb_cleanup_dir
+{
+ char *target_path;
+ bool rmtopdir;
+ struct cb_cleanup_dir *next;
+} cb_cleanup_dir;
+
+/*
+ * Stores a tablespace mapping provided using -T, --tablespace-mapping.
+ */
+typedef struct cb_tablespace_mapping
+{
+ char old_dir[MAXPGPATH];
+ char new_dir[MAXPGPATH];
+ struct cb_tablespace_mapping *next;
+} cb_tablespace_mapping;
+
+/*
+ * Stores data parsed from all command-line options.
+ */
+typedef struct cb_options
+{
+ bool debug;
+ char *output;
+ bool dry_run;
+ bool no_sync;
+ cb_tablespace_mapping *tsmappings;
+ pg_checksum_type manifest_checksums;
+ bool no_manifest;
+ DataDirSyncMethod sync_method;
+} cb_options;
+
+/*
+ * Data about a tablespace.
+ *
+ * Every normal tablespace needs a tablespace mapping, but in-place tablespaces
+ * don't, so the list of tablespaces can contain more entries than the list of
+ * tablespace mappings.
+ */
+typedef struct cb_tablespace
+{
+ Oid oid;
+ bool in_place;
+ char old_dir[MAXPGPATH];
+ char new_dir[MAXPGPATH];
+ struct cb_tablespace *next;
+} cb_tablespace;
+
+/* Directories to be removed if we exit uncleanly. */
+cb_cleanup_dir *cleanup_dir_list = NULL;
+
+static void add_tablespace_mapping(cb_options *opt, char *arg);
+static StringInfo check_backup_label_files(int n_backups, char **backup_dirs);
+static void check_control_files(int n_backups, char **backup_dirs);
+static void check_input_dir_permissions(char *dir);
+static void cleanup_directories_atexit(void);
+static void create_output_directory(char *dirname, cb_options *opt);
+static void help(const char *progname);
+static bool parse_oid(char *s, Oid *result);
+static void process_directory_recursively(Oid tsoid,
+ char *input_directory,
+ char *output_directory,
+ char *relative_path,
+ int n_prior_backups,
+ char **prior_backup_dirs,
+ manifest_data **manifests,
+ manifest_writer *mwriter,
+ cb_options *opt);
+static int read_pg_version_file(char *directory);
+static void remember_to_cleanup_directory(char *target_path, bool rmtopdir);
+static void reset_directory_cleanup_list(void);
+static cb_tablespace *scan_for_existing_tablespaces(char *pathname,
+ cb_options *opt);
+static void slurp_file(int fd, char *filename, StringInfo buf, int maxlen);
+
+/*
+ * Main program.
+ */
+int
+main(int argc, char *argv[])
+{
+ static struct option long_options[] = {
+ {"debug", no_argument, NULL, 'd'},
+ {"dry-run", no_argument, NULL, 'n'},
+ {"no-sync", no_argument, NULL, 'N'},
+ {"output", required_argument, NULL, 'o'},
+ {"tablespace-mapping", no_argument, NULL, 'T'},
+ {"manifest-checksums", required_argument, NULL, 1},
+ {"no-manifest", no_argument, NULL, 2},
+ {"sync-method", required_argument, NULL, 3},
+ {NULL, 0, NULL, 0}
+ };
+
+ const char *progname;
+ char *last_input_dir;
+ int optindex;
+ int c;
+ int n_backups;
+ int n_prior_backups;
+ int version;
+ char **prior_backup_dirs;
+ cb_options opt;
+ cb_tablespace *tablespaces;
+ cb_tablespace *ts;
+ StringInfo last_backup_label;
+ manifest_data **manifests;
+ manifest_writer *mwriter;
+
+ pg_logging_init(argv[0]);
+ progname = get_progname(argv[0]);
+ handle_help_version_opts(argc, argv, progname, help);
+
+ memset(&opt, 0, sizeof(opt));
+ opt.manifest_checksums = CHECKSUM_TYPE_CRC32C;
+ opt.sync_method = DATA_DIR_SYNC_METHOD_FSYNC;
+
+ /* process command-line options */
+ while ((c = getopt_long(argc, argv, "do:nNPT:",
+ long_options, &optindex)) != -1)
+ {
+ switch (c)
+ {
+ case 'd':
+ opt.debug = true;
+ pg_logging_increase_verbosity();
+ break;
+ case 'o':
+ opt.output = optarg;
+ break;
+ case 'n':
+ opt.dry_run = true;
+ break;
+ case 'N':
+ opt.no_sync = true;
+ break;
+ case 'T':
+ add_tablespace_mapping(&opt, optarg);
+ break;
+ case 1:
+ if (!pg_checksum_parse_type(optarg,
+ &opt.manifest_checksums))
+ pg_fatal("unrecognized checksum algorithm: \"%s\"",
+ optarg);
+ break;
+ case 2:
+ opt.no_manifest = true;
+ break;
+ case 3:
+ if (!parse_sync_method(optarg, &opt.sync_method))
+ exit(1);
+ break;
+ default:
+ /* getopt_long already emitted a complaint */
+ pg_log_error_hint("Try \"%s --help\" for more information.", progname);
+ exit(1);
+ }
+ }
+
+ if (optind >= argc)
+ {
+ pg_log_error("%s: no input directories specified", progname);
+ pg_log_error_hint("Try \"%s --help\" for more information.", progname);
+ exit(1);
+ }
+
+ if (opt.output == NULL)
+ pg_fatal("no output directory specified");
+
+ /* If no manifest is needed, no checksums are needed, either. */
+ if (opt.no_manifest)
+ opt.manifest_checksums = CHECKSUM_TYPE_NONE;
+
+ /* Read the server version from the final backup. */
+ version = read_pg_version_file(argv[argc - 1]);
+
+ /* Sanity-check control files. */
+ n_backups = argc - optind;
+ check_control_files(n_backups, argv + optind);
+
+ /* Sanity-check backup_label files, and get the contents of the last one. */
+ last_backup_label = check_backup_label_files(n_backups, argv + optind);
+
+ /* Load backup manifests. */
+ manifests = load_backup_manifests(n_backups, argv + optind);
+
+ /* Figure out which tablespaces are going to be included in the output. */
+ last_input_dir = argv[argc - 1];
+ check_input_dir_permissions(last_input_dir);
+ tablespaces = scan_for_existing_tablespaces(last_input_dir, &opt);
+
+ /*
+ * Create output directories.
+ *
+ * We create one output directory for the main data directory plus one for
+ * each non-in-place tablespace. create_output_directory() will arrange
+ * for those directories to be cleaned up on failure. In-place tablespaces
+ * aren't handled at this stage because they're located beneath the main
+ * output directory, and thus the cleanup of that directory will get rid
+ * of them. Plus, the pg_tblspc directory that needs to contain them
+ * doesn't exist yet.
+ */
+ atexit(cleanup_directories_atexit);
+ create_output_directory(opt.output, &opt);
+ for (ts = tablespaces; ts != NULL; ts = ts->next)
+ if (!ts->in_place)
+ create_output_directory(ts->new_dir, &opt);
+
+ /* If we need to write a backup_manifest, prepare to do so. */
+ if (!opt.dry_run && !opt.no_manifest)
+ mwriter = create_manifest_writer(opt.output);
+ else
+ mwriter = NULL;
+
+ /* Write backup label into output directory. */
+ if (opt.dry_run)
+ pg_log_debug("would generate \"%s/backup_label\"", opt.output);
+ else
+ {
+ pg_log_debug("generating \"%s/backup_label\"", opt.output);
+ last_backup_label->cursor = 0;
+ write_backup_label(opt.output, last_backup_label,
+ opt.manifest_checksums, mwriter);
+ }
+
+ /*
+ * We'll need the pathnames to the prior backups. By "prior" we mean all
+ * but the last one listed on the command line.
+ */
+ n_prior_backups = argc - optind - 1;
+ prior_backup_dirs = argv + optind;
+
+ /* Process everything that's not part of a user-defined tablespace. */
+ pg_log_debug("processing backup directory \"%s\"", last_input_dir);
+ process_directory_recursively(InvalidOid, last_input_dir, opt.output,
+ NULL, n_prior_backups, prior_backup_dirs,
+ manifests, mwriter, &opt);
+
+ /* Process user-defined tablespaces. */
+ for (ts = tablespaces; ts != NULL; ts = ts->next)
+ {
+ pg_log_debug("processing tablespace directory \"%s\"", ts->old_dir);
+
+ /*
+ * If it's a normal tablespace, we need to set up a symbolic link from
+ * pg_tblspc/${OID} to the target directory; if it's an in-place
+ * tablespace, we need to create a directory at pg_tblspc/${OID}.
+ */
+ if (!ts->in_place)
+ {
+ char linkpath[MAXPGPATH];
+
+ snprintf(linkpath, MAXPGPATH, "%s/pg_tblspc/%u", opt.output,
+ ts->oid);
+
+ if (opt.dry_run)
+ pg_log_debug("would create symbolic link from \"%s\" to \"%s\"",
+ linkpath, ts->new_dir);
+ else
+ {
+ pg_log_debug("creating symbolic link from \"%s\" to \"%s\"",
+ linkpath, ts->new_dir);
+ if (symlink(ts->new_dir, linkpath) != 0)
+ pg_fatal("could not create symbolic link from \"%s\" to \"%s\": %m",
+ linkpath, ts->new_dir);
+ }
+ }
+ else
+ {
+ if (opt.dry_run)
+ pg_log_debug("would create directory \"%s\"", ts->new_dir);
+ else
+ {
+ pg_log_debug("creating directory \"%s\"", ts->new_dir);
+ if (pg_mkdir_p(ts->new_dir, pg_dir_create_mode) == -1)
+ pg_fatal("could not create directory \"%s\": %m",
+ ts->new_dir);
+ }
+ }
+
+ /* OK, now handle the directory contents. */
+ process_directory_recursively(ts->oid, ts->old_dir, ts->new_dir,
+ NULL, n_prior_backups, prior_backup_dirs,
+ manifests, mwriter, &opt);
+ }
+
+ /* Finalize the backup_manifest, if we're generating one. */
+ if (mwriter != NULL)
+ finalize_manifest(mwriter,
+ manifests[n_prior_backups]->first_wal_range);
+
+ /* fsync that output directory unless we've been told not to do so */
+ if (!opt.no_sync)
+ {
+ if (opt.dry_run)
+ pg_log_debug("would recursively fsync \"%s\"", opt.output);
+ else
+ {
+ pg_log_debug("recursively fsyncing \"%s\"", opt.output);
+ sync_pgdata(opt.output, version * 10000, opt.sync_method);
+ }
+ }
+
+ /* It's a success, so don't remove the output directories. */
+ reset_directory_cleanup_list();
+ exit(0);
+}
+
+/*
+ * Process the option argument for the -T, --tablespace-mapping switch.
+ */
+static void
+add_tablespace_mapping(cb_options *opt, char *arg)
+{
+ cb_tablespace_mapping *tsmap = pg_malloc0(sizeof(cb_tablespace_mapping));
+ char *dst;
+ char *dst_ptr;
+ char *arg_ptr;
+
+ /*
+ * Basically, we just want to copy everything before the equals sign to
+ * tsmap->old_dir and everything afterwards to tsmap->new_dir, but if
+ * there's more or less than one equals sign, that's an error, and if
+ * there's an equals sign preceded by a backslash, don't treat it as a
+ * field separator but instead copy a literal equals sign.
+ */
+ dst_ptr = dst = tsmap->old_dir;
+ for (arg_ptr = arg; *arg_ptr != '\0'; arg_ptr++)
+ {
+ if (dst_ptr - dst >= MAXPGPATH)
+ pg_fatal("directory name too long");
+
+ if (*arg_ptr == '\\' && *(arg_ptr + 1) == '=')
+ ; /* skip backslash escaping = */
+ else if (*arg_ptr == '=' && (arg_ptr == arg || *(arg_ptr - 1) != '\\'))
+ {
+ if (tsmap->new_dir[0] != '\0')
+ pg_fatal("multiple \"=\" signs in tablespace mapping");
+ else
+ dst = dst_ptr = tsmap->new_dir;
+ }
+ else
+ *dst_ptr++ = *arg_ptr;
+ }
+ if (!tsmap->old_dir[0] || !tsmap->new_dir[0])
+ pg_fatal("invalid tablespace mapping format \"%s\", must be \"OLDDIR=NEWDIR\"", arg);
+
+ /*
+ * All tablespaces are created with absolute directories, so specifying a
+ * non-absolute path here would never match, possibly confusing users.
+ *
+ * In contrast to pg_basebackup, both the old and new directories are on
+ * the local machine, so the local machine's definition of an absolute
+ * path is the only relevant one.
+ */
+ if (!is_absolute_path(tsmap->old_dir))
+ pg_fatal("old directory is not an absolute path in tablespace mapping: %s",
+ tsmap->old_dir);
+
+ if (!is_absolute_path(tsmap->new_dir))
+ pg_fatal("old directory is not an absolute path in tablespace mapping: %s",
+ tsmap->new_dir);
+
+ /* Canonicalize paths to avoid spurious failures when comparing. */
+ canonicalize_path(tsmap->old_dir);
+ canonicalize_path(tsmap->new_dir);
+
+ /* Add it to the list. */
+ tsmap->next = opt->tsmappings;
+ opt->tsmappings = tsmap;
+}
+
+/*
+ * Check that the backup_label files form a coherent backup chain, and return
+ * the contents of the backup_label file from the latest backup.
+ */
+static StringInfo
+check_backup_label_files(int n_backups, char **backup_dirs)
+{
+ StringInfo buf = makeStringInfo();
+ StringInfo lastbuf = buf;
+ int i;
+ TimeLineID check_tli = 0;
+ XLogRecPtr check_lsn = InvalidXLogRecPtr;
+
+ /* Try to read each backup_label file in turn, last to first. */
+ for (i = n_backups - 1; i >= 0; --i)
+ {
+ char pathbuf[MAXPGPATH];
+ int fd;
+ TimeLineID start_tli;
+ TimeLineID previous_tli;
+ XLogRecPtr start_lsn;
+ XLogRecPtr previous_lsn;
+
+ /* Open the backup_label file. */
+ snprintf(pathbuf, MAXPGPATH, "%s/backup_label", backup_dirs[i]);
+ pg_log_debug("reading \"%s\"", pathbuf);
+ if ((fd = open(pathbuf, O_RDONLY, 0)) < 0)
+ pg_fatal("could not open file \"%s\": %m", pathbuf);
+
+ /*
+ * Slurp the whole file into memory.
+ *
+ * The exact size limit that we impose here doesn't really matter --
+ * most of what's supposed to be in the file is fixed size and quite
+ * short. However, the length of the backup_label is limited (at least
+ * by some parts of the code) to MAXGPATH, so include that value in
+ * the maximum length that we tolerate.
+ */
+ slurp_file(fd, pathbuf, buf, 10000 + MAXPGPATH);
+
+ /* Close the file. */
+ if (close(fd) != 0)
+ pg_fatal("could not close \"%s\": %m", pathbuf);
+
+ /* Parse the file contents. */
+ parse_backup_label(pathbuf, buf, &start_tli, &start_lsn,
+ &previous_tli, &previous_lsn);
+
+ /*
+ * Sanity checks.
+ *
+ * XXX. It's actually not required that start_lsn == check_lsn. It
+ * would be OK if start_lsn > check_lsn provided that start_lsn is
+ * less than or equal to the relevant switchpoint. But at the moment
+ * we don't have that information.
+ */
+ if (i > 0 && previous_tli == 0)
+ pg_fatal("backup at \"%s\" is a full backup, but only the first backup should be a full backup",
+ backup_dirs[i]);
+ if (i == 0 && previous_tli != 0)
+ pg_fatal("backup at \"%s\" is an incremental backup, but the first backup should be a full backup",
+ backup_dirs[i]);
+ if (i < n_backups - 1 && start_tli != check_tli)
+ pg_fatal("backup at \"%s\" starts on timeline %u, but expected %u",
+ backup_dirs[i], start_tli, check_tli);
+ if (i < n_backups - 1 && start_lsn != check_lsn)
+ pg_fatal("backup at \"%s\" starts at LSN %X/%X, but expected %X/%X",
+ backup_dirs[i],
+ LSN_FORMAT_ARGS(start_lsn),
+ LSN_FORMAT_ARGS(check_lsn));
+ check_tli = previous_tli;
+ check_lsn = previous_lsn;
+
+ /*
+ * The last backup label in the chain needs to be saved for later use,
+ * while the others are only needed within this loop.
+ */
+ if (lastbuf == buf)
+ buf = makeStringInfo();
+ else
+ resetStringInfo(buf);
+ }
+
+ /* Free memory that we don't need any more. */
+ if (lastbuf != buf)
+ {
+ pfree(buf->data);
+ pfree(buf);
+ }
+
+ /*
+ * Return the data from the first backup_info that we read (which is the
+ * backup_label from the last directory specified on the command line).
+ */
+ return lastbuf;
+}
+
+/*
+ * Sanity check control files.
+ */
+static void
+check_control_files(int n_backups, char **backup_dirs)
+{
+ int i;
+ uint64 system_identifier = 0; /* placate compiler */
+
+ /* Try to read each control file in turn, last to first. */
+ for (i = n_backups - 1; i >= 0; --i)
+ {
+ ControlFileData *control_file;
+ bool crc_ok;
+
+ pg_log_debug("reading \"%s/global/pg_control\"", backup_dirs[i]);
+ control_file = get_controlfile(backup_dirs[i], &crc_ok);
+
+ /* Control file contents not meaningful if CRC is bad. */
+ if (!crc_ok)
+ pg_fatal("%s/global/pg_control: crc is incorrect", backup_dirs[i]);
+
+ /* Can't interpret control file if not current version. */
+ if (control_file->pg_control_version != PG_CONTROL_VERSION)
+ pg_fatal("%s/global/pg_control: unexpected control file version",
+ backup_dirs[i]);
+
+ /* System identifiers should all match. */
+ if (i == n_backups - 1)
+ system_identifier = control_file->system_identifier;
+ else if (system_identifier != control_file->system_identifier)
+ pg_fatal("%s/global/pg_control: expected system identifier %llu, but found %llu",
+ backup_dirs[i], (unsigned long long) system_identifier,
+ (unsigned long long) control_file->system_identifier);
+
+ /* Release memory. */
+ pfree(control_file);
+ }
+
+ /*
+ * If debug output is enabled, make a note of the system identifier that
+ * we found in all of the relevant control files.
+ */
+ pg_log_debug("system identifier is %llu",
+ (unsigned long long) system_identifier);
+}
+
+/*
+ * Set default permissions for new files and directories based on the
+ * permissions of the given directory. The intent here is that the output
+ * directory should use the same permissions scheme as the final input
+ * directory.
+ */
+static void
+check_input_dir_permissions(char *dir)
+{
+ struct stat st;
+
+ if (stat(dir, &st) != 0)
+ pg_fatal("could not stat \"%s\": %m", dir);
+
+ SetDataDirectoryCreatePerm(st.st_mode);
+}
+
+/*
+ * Clean up output directories before exiting.
+ */
+static void
+cleanup_directories_atexit(void)
+{
+ while (cleanup_dir_list != NULL)
+ {
+ cb_cleanup_dir *dir = cleanup_dir_list;
+
+ if (dir->rmtopdir)
+ {
+ pg_log_info("removing output directory \"%s\"", dir->target_path);
+ if (!rmtree(dir->target_path, dir->rmtopdir))
+ pg_log_error("failed to remove output directory");
+ }
+ else
+ {
+ pg_log_info("removing contents of output directory \"%s\"",
+ dir->target_path);
+ if (!rmtree(dir->target_path, dir->rmtopdir))
+ pg_log_error("failed to remove contents of output directory");
+ }
+
+ cleanup_dir_list = cleanup_dir_list->next;
+ pfree(dir);
+ }
+}
+
+/*
+ * Create the named output directory, unless it already exists or we're in
+ * dry-run mode. If it already exists but is not empty, that's a fatal error.
+ *
+ * Adds the created directory to the list of directories to be cleaned up
+ * at process exit.
+ */
+static void
+create_output_directory(char *dirname, cb_options *opt)
+{
+ switch (pg_check_dir(dirname))
+ {
+ case 0:
+ if (opt->dry_run)
+ {
+ pg_log_debug("would create directory \"%s\"", dirname);
+ return;
+ }
+ pg_log_debug("creating directory \"%s\"", dirname);
+ if (pg_mkdir_p(dirname, pg_dir_create_mode) == -1)
+ pg_fatal("could not create directory \"%s\": %m", dirname);
+ remember_to_cleanup_directory(dirname, true);
+ break;
+
+ case 1:
+ pg_log_debug("using existing directory \"%s\"", dirname);
+ remember_to_cleanup_directory(dirname, false);
+ break;
+
+ case 2:
+ case 3:
+ case 4:
+ pg_fatal("directory \"%s\" exists but is not empty", dirname);
+
+ case -1:
+ pg_fatal("could not access directory \"%s\": %m", dirname);
+ }
+}
+
+/*
+ * help
+ *
+ * Prints help page for the program
+ *
+ * progname: the name of the executed program, such as "pg_combinebackup"
+ */
+static void
+help(const char *progname)
+{
+ printf(_("%s reconstructs full backups from incrementals.\n\n"), progname);
+ printf(_("Usage:\n"));
+ printf(_(" %s [OPTION]... DIRECTORY...\n"), progname);
+ printf(_("\nOptions:\n"));
+ printf(_(" -d, --debug generate lots of debugging output\n"));
+ printf(_(" -n, --dry-run don't actually do anything\n"));
+ printf(_(" -N, --no-sync do not wait for changes to be written safely to disk\n"));
+ printf(_(" -o, --output output directory\n"));
+ printf(_(" -T, --tablespace-mapping=OLDDIR=NEWDIR\n"));
+ printf(_(" relocate tablespace in OLDDIR to NEWDIR\n"));
+ printf(_(" --manifest-checksums=SHA{224,256,384,512}|CRC32C|NONE\n"
+ " use algorithm for manifest checksums\n"));
+ printf(_(" --no-manifest suppress generation of backup manifest\n"));
+ printf(_(" --sync-method=METHOD set method for syncing files to disk\n"));
+ printf(_(" -?, --help show this help, then exit\n"));
+
+ printf(_("\nReport bugs to <%s>.\n"), PACKAGE_BUGREPORT);
+ printf(_("%s home page: <%s>\n"), PACKAGE_NAME, PACKAGE_URL);
+}
+
+/*
+ * Try to parse a string as a non-zero OID without leading zeroes.
+ *
+ * If it works, return true and set *result to the answer, else return false.
+ */
+static bool
+parse_oid(char *s, Oid *result)
+{
+ Oid oid;
+ char *ep;
+
+ errno = 0;
+ oid = strtoul(s, &ep, 10);
+ if (errno != 0 || *ep != '\0' || oid < 1 || oid > PG_UINT32_MAX)
+ return false;
+
+ *result = oid;
+ return true;
+}
+
+/*
+ * Copy files from the input directory to the output directory, reconstructing
+ * full files from incremental files as required.
+ *
+ * If processing is a user-defined tablespace, the tsoid should be the OID
+ * of that tablespace and input_directory and output_directory should be the
+ * toplevel input and output directories for that tablespace. Otherwise,
+ * tsoid should be InvalidOid and input_directory and output_directory should
+ * be the main input and output directories.
+ *
+ * relative_path is the path beneath the given input and output directories
+ * that we are currently processing. If NULL, it indicates that we're
+ * processing the input and output directories themselves.
+ *
+ * n_prior_backups is the number of prior backups that we have available.
+ * This doesn't count the very last backup, which is referenced by
+ * output_directory, just the older ones. prior_backup_dirs is an array of
+ * the locations of those previous backups.
+ */
+static void
+process_directory_recursively(Oid tsoid,
+ char *input_directory,
+ char *output_directory,
+ char *relative_path,
+ int n_prior_backups,
+ char **prior_backup_dirs,
+ manifest_data **manifests,
+ manifest_writer *mwriter,
+ cb_options *opt)
+{
+ char ifulldir[MAXPGPATH];
+ char ofulldir[MAXPGPATH];
+ char manifest_prefix[MAXPGPATH];
+ DIR *dir;
+ struct dirent *de;
+ bool is_pg_tblspc;
+ bool is_pg_wal;
+ manifest_data *latest_manifest = manifests[n_prior_backups];
+ pg_checksum_type checksum_type;
+
+ StaticAssertStmt(strlen(INCREMENTAL_PREFIX) == INCREMENTAL_PREFIX_LENGTH,
+ "INCREMENTAL_PREFIX_LENGTH is incorrect");
+
+ /*
+ * pg_tblspc and pg_wal are special cases, so detect those here.
+ *
+ * pg_tblspc is only special at the top level, but subdirectories of
+ * pg_wal are just as special as the top level directory.
+ *
+ * Since incremental backup does not exist in pre-v10 versions, we don't
+ * have to worry about the old pg_xlog naming.
+ */
+ is_pg_tblspc = !OidIsValid(tsoid) && relative_path != NULL &&
+ strcmp(relative_path, "pg_tblspc") == 0;
+ is_pg_wal = !OidIsValid(tsoid) && relative_path != NULL &&
+ (strcmp(relative_path, "pg_wal") == 0 ||
+ strncmp(relative_path, "pg_wal/", 7) == 0);
+
+ /*
+ * If we're under pg_wal, then we don't need checksums, because these
+ * files aren't included in the backup manifest. Otherwise use whatever
+ * type of checksum is configured.
+ */
+ if (!is_pg_wal)
+ checksum_type = opt->manifest_checksums;
+ else
+ checksum_type = CHECKSUM_TYPE_NONE;
+
+ /*
+ * Append the relative path to the input and output directories, and
+ * figure out the appropriate prefix to add to files in this directory
+ * when looking them up in a backup manifest.
+ */
+ if (relative_path == NULL)
+ {
+ strncpy(ifulldir, input_directory, MAXPGPATH);
+ strncpy(ofulldir, output_directory, MAXPGPATH);
+ if (OidIsValid(tsoid))
+ snprintf(manifest_prefix, MAXPGPATH, "pg_tblspc/%u/", tsoid);
+ else
+ manifest_prefix[0] = '\0';
+ }
+ else
+ {
+ snprintf(ifulldir, MAXPGPATH, "%s/%s", input_directory,
+ relative_path);
+ snprintf(ofulldir, MAXPGPATH, "%s/%s", output_directory,
+ relative_path);
+ if (OidIsValid(tsoid))
+ snprintf(manifest_prefix, MAXPGPATH, "pg_tblspc/%u/%s/",
+ tsoid, relative_path);
+ else
+ snprintf(manifest_prefix, MAXPGPATH, "%s/", relative_path);
+ }
+
+ /*
+ * Toplevel output directories have already been created by the time this
+ * function is called, but any subdirectories are our responsibility.
+ */
+ if (relative_path != NULL)
+ {
+ if (opt->dry_run)
+ pg_log_debug("would create directory \"%s\"", ofulldir);
+ else
+ {
+ pg_log_debug("creating directory \"%s\"", ofulldir);
+ if (mkdir(ofulldir, pg_dir_create_mode) == -1)
+ pg_fatal("could not create directory \"%s\": %m", ofulldir);
+ }
+ }
+
+ /* It's time to scan the directory. */
+ if ((dir = opendir(ifulldir)) == NULL)
+ pg_fatal("could not open directory \"%s\": %m", ifulldir);
+ while (errno = 0, (de = readdir(dir)) != NULL)
+ {
+ PGFileType type;
+ char ifullpath[MAXPGPATH];
+ char ofullpath[MAXPGPATH];
+ char manifest_path[MAXPGPATH];
+ Oid oid = InvalidOid;
+ int checksum_length = 0;
+ uint8 *checksum_payload = NULL;
+ pg_checksum_context checksum_ctx;
+
+ /* Ignore "." and ".." entries. */
+ if (strcmp(de->d_name, ".") == 0 ||
+ strcmp(de->d_name, "..") == 0)
+ continue;
+
+ /* Construct input path. */
+ snprintf(ifullpath, MAXPGPATH, "%s/%s", ifulldir, de->d_name);
+
+ /* Figure out what kind of directory entry this is. */
+ type = get_dirent_type(ifullpath, de, false, PG_LOG_ERROR);
+ if (type == PGFILETYPE_ERROR)
+ exit(1);
+
+ /*
+ * If we're processing pg_tblspc, then check whether the filename
+ * looks like it could be a tablespace OID. If so, and if the
+ * directory entry is a symbolic link or a directory, skip it.
+ *
+ * Our goal here is to ignore anything that would have been considered
+ * by scan_for_existing_tablespaces to be a tablespace.
+ */
+ if (is_pg_tblspc && parse_oid(de->d_name, &oid) &&
+ (type == PGFILETYPE_LNK || type == PGFILETYPE_DIR))
+ continue;
+
+ /* If it's a directory, recurse. */
+ if (type == PGFILETYPE_DIR)
+ {
+ char new_relative_path[MAXPGPATH];
+
+ /* Append new pathname component to relative path. */
+ if (relative_path == NULL)
+ strncpy(new_relative_path, de->d_name, MAXPGPATH);
+ else
+ snprintf(new_relative_path, MAXPGPATH, "%s/%s", relative_path,
+ de->d_name);
+
+ /* And recurse. */
+ process_directory_recursively(tsoid,
+ input_directory, output_directory,
+ new_relative_path,
+ n_prior_backups, prior_backup_dirs,
+ manifests, mwriter, opt);
+ continue;
+ }
+
+ /* Skip anything that's not a regular file. */
+ if (type != PGFILETYPE_REG)
+ {
+ if (type == PGFILETYPE_LNK)
+ pg_log_warning("skipping symbolic link \"%s\"", ifullpath);
+ else
+ pg_log_warning("skipping special file \"%s\"", ifullpath);
+ continue;
+ }
+
+ /*
+ * Skip the backup_label and backup_manifest files; they require
+ * special handling and are handled elsewhere.
+ */
+ if (relative_path == NULL &&
+ (strcmp(de->d_name, "backup_label") == 0 ||
+ strcmp(de->d_name, "backup_manifest") == 0))
+ continue;
+
+ /*
+ * If it's an incremental file, hand it off to the reconstruction
+ * code, which will figure out what to do.
+ */
+ if (strncmp(de->d_name, INCREMENTAL_PREFIX,
+ INCREMENTAL_PREFIX_LENGTH) == 0)
+ {
+ /* Output path should not include "INCREMENTAL." prefix. */
+ snprintf(ofullpath, MAXPGPATH, "%s/%s", ofulldir,
+ de->d_name + INCREMENTAL_PREFIX_LENGTH);
+
+
+ /* Manifest path likewise omits incremental prefix. */
+ snprintf(manifest_path, MAXPGPATH, "%s%s", manifest_prefix,
+ de->d_name + INCREMENTAL_PREFIX_LENGTH);
+
+ /* Reconstruction logic will do the rest. */
+ reconstruct_from_incremental_file(ifullpath, ofullpath,
+ relative_path,
+ de->d_name + INCREMENTAL_PREFIX_LENGTH,
+ n_prior_backups,
+ prior_backup_dirs,
+ manifests,
+ manifest_path,
+ checksum_type,
+ &checksum_length,
+ &checksum_payload,
+ opt->debug,
+ opt->dry_run);
+ }
+ else
+ {
+ /* Construct the path that the backup_manifest will use. */
+ snprintf(manifest_path, MAXPGPATH, "%s%s", manifest_prefix,
+ de->d_name);
+
+ /*
+ * It's not an incremental file, so we need to copy the entire
+ * file to the output directory.
+ *
+ * If a checksum of the required type already exists in the
+ * backup_manifest for the final input directory, we can save some
+ * work by reusing that checksum instead of computing a new one.
+ */
+ if (checksum_type != CHECKSUM_TYPE_NONE &&
+ latest_manifest != NULL)
+ {
+ manifest_file *mfile;
+
+ mfile = manifest_files_lookup(latest_manifest->files,
+ manifest_path);
+ if (mfile == NULL)
+ {
+ /*
+ * The directory is out of sync with the backup_manifest,
+ * so emit a warning.
+ */
+ pg_log_warning("\"%s/backup_manifest\" contains no entry for \"%s\"",
+ input_directory, manifest_path);
+ }
+ else if (mfile->checksum_type == checksum_type)
+ {
+ checksum_length = mfile->checksum_length;
+ checksum_payload = mfile->checksum_payload;
+ }
+ }
+
+ /*
+ * If we're reusing a checksum, then we don't need copy_file() to
+ * compute one for us, but otherwise, it needs to compute whatever
+ * type of checksum we need.
+ */
+ if (checksum_length != 0)
+ pg_checksum_init(&checksum_ctx, CHECKSUM_TYPE_NONE);
+ else
+ pg_checksum_init(&checksum_ctx, checksum_type);
+
+ /* Actually copy the file. */
+ snprintf(ofullpath, MAXPGPATH, "%s/%s", ofulldir, de->d_name);
+ copy_file(ifullpath, ofullpath, &checksum_ctx, opt->dry_run);
+
+ /*
+ * If copy_file() performed a checksum calculation for us, then
+ * save the results (except in dry-run mode, when there's no
+ * point).
+ */
+ if (checksum_ctx.type != CHECKSUM_TYPE_NONE && !opt->dry_run)
+ {
+ checksum_payload = pg_malloc(PG_CHECKSUM_MAX_LENGTH);
+ checksum_length = pg_checksum_final(&checksum_ctx,
+ checksum_payload);
+ }
+ }
+
+ /* Generate manifest entry, if needed. */
+ if (mwriter != NULL)
+ {
+ struct stat sb;
+
+ /*
+ * In order to generate a manifest entry, we need the file size
+ * and mtime. We have no way to know the correct mtime except to
+ * stat() the file, so just do that and get the size as well.
+ *
+ * If we didn't need the mtime here, we could try to obtain the
+ * file size from the reconstruction or file copy process above,
+ * although that is actually not convenient in all cases. If we
+ * write the file ourselves then clearly we can keep a count of
+ * bytes, but if we use something like CopyFile() then it's
+ * trickier. Since we have to stat() anyway to get the mtime,
+ * there's no point in worrying about it.
+ */
+ if (stat(ofullpath, &sb) < 0)
+ pg_fatal("could not stat file \"%s\": %m", ofullpath);
+
+ /* OK, now do the work. */
+ add_file_to_manifest(mwriter, manifest_path,
+ sb.st_size, sb.st_mtime,
+ checksum_type, checksum_length,
+ checksum_payload);
+ }
+
+ /* Avoid leaking memory. */
+ if (checksum_payload != NULL)
+ pfree(checksum_payload);
+ }
+
+ closedir(dir);
+}
+
+/*
+ * Read the version number from PG_VERSION and convert it to the usual server
+ * version number format. (e.g. If PG_VERSION contains "14\n" this function
+ * will return 140000)
+ */
+static int
+read_pg_version_file(char *directory)
+{
+ char filename[MAXPGPATH];
+ StringInfoData buf;
+ int fd;
+ int version;
+ char *ep;
+
+ /* Construct pathname. */
+ snprintf(filename, MAXPGPATH, "%s/PG_VERSION", directory);
+
+ /* Open file. */
+ if ((fd = open(filename, O_RDONLY, 0)) < 0)
+ pg_fatal("could not open file \"%s\": %m", filename);
+
+ /* Read into memory. Length limit of 128 should be more than generous. */
+ initStringInfo(&buf);
+ slurp_file(fd, filename, &buf, 128);
+
+ /* Close the file. */
+ if (close(fd) != 0)
+ pg_fatal("could not close \"%s\": %m", filename);
+
+ /* Convert to integer. */
+ errno = 0;
+ version = strtoul(buf.data, &ep, 10);
+ if (errno != 0 || *ep != '\n')
+ {
+ /*
+ * Incremental backup is not relevant to very old server versions that
+ * used multi-part version number (e.g. 9.6, or 8.4). So if we see
+ * what looks like the beginning of such a version number, just bail
+ * out.
+ */
+ if (version < 10 && *ep == '.')
+ pg_fatal("%s: server version too old\n", filename);
+ pg_fatal("%s: could not parse version number\n", filename);
+ }
+
+ /* Debugging output. */
+ pg_log_debug("read server version %d from \"%s\"", version, filename);
+
+ /* Release memory and return result. */
+ pfree(buf.data);
+ return version * 10000;
+}
+
+/*
+ * Add a directory to the list of output directories to clean up.
+ */
+static void
+remember_to_cleanup_directory(char *target_path, bool rmtopdir)
+{
+ cb_cleanup_dir *dir = pg_malloc(sizeof(cb_cleanup_dir));
+
+ dir->target_path = target_path;
+ dir->rmtopdir = rmtopdir;
+ dir->next = cleanup_dir_list;
+ cleanup_dir_list = dir;
+}
+
+/*
+ * Empty out the list of directories scheduled for cleanup a exit.
+ *
+ * We want to remove the output directories only on a failure, so call this
+ * function when we know that the operation has succeeded.
+ *
+ * Since we only expect this to be called when we're about to exit, we could
+ * just set cleanup_dir_list to NULL and be done with it, but we free the
+ * memory to be tidy.
+ */
+static void
+reset_directory_cleanup_list(void)
+{
+ while (cleanup_dir_list != NULL)
+ {
+ cb_cleanup_dir *dir = cleanup_dir_list;
+
+ cleanup_dir_list = cleanup_dir_list->next;
+ pfree(dir);
+ }
+}
+
+/*
+ * Scan the pg_tblspc directory of the final input backup to get a canonical
+ * list of what tablespaces are part of the backup.
+ *
+ * 'pathname' should be the path to the toplevel backup directory for the
+ * final backup in the backup chain.
+ */
+static cb_tablespace *
+scan_for_existing_tablespaces(char *pathname, cb_options *opt)
+{
+ char pg_tblspc[MAXPGPATH];
+ DIR *dir;
+ struct dirent *de;
+ cb_tablespace *tslist = NULL;
+
+ snprintf(pg_tblspc, MAXPGPATH, "%s/pg_tblspc", pathname);
+ pg_log_debug("scanning \"%s\"", pg_tblspc);
+
+ if ((dir = opendir(pg_tblspc)) == NULL)
+ pg_fatal("could not open directory \"%s\": %m", pathname);
+
+ while (errno = 0, (de = readdir(dir)) != NULL)
+ {
+ Oid oid;
+ char tblspcdir[MAXPGPATH];
+ char link_target[MAXPGPATH];
+ int link_length;
+ cb_tablespace *ts;
+ cb_tablespace *otherts;
+ PGFileType type;
+
+ /* Silently ignore "." and ".." entries. */
+ if (strcmp(de->d_name, ".") == 0 || strcmp(de->d_name, "..") == 0)
+ continue;
+
+ /* Construct full pathname. */
+ snprintf(tblspcdir, MAXPGPATH, "%s/%s", pg_tblspc, de->d_name);
+
+ /* Ignore any file name that doesn't look like a proper OID. */
+ if (!parse_oid(de->d_name, &oid))
+ {
+ pg_log_debug("skipping \"%s\" because the filename is not a legal tablespace OID",
+ tblspcdir);
+ continue;
+ }
+
+ /* Only symbolic links and directories are tablespaces. */
+ type = get_dirent_type(tblspcdir, de, false, PG_LOG_ERROR);
+ if (type == PGFILETYPE_ERROR)
+ exit(1);
+ if (type != PGFILETYPE_LNK && type != PGFILETYPE_DIR)
+ {
+ pg_log_debug("skipping \"%s\" because it is neither a symbolic link nor a directory",
+ tblspcdir);
+ continue;
+ }
+
+ /* Create a new tablespace object. */
+ ts = pg_malloc0(sizeof(cb_tablespace));
+ ts->oid = oid;
+
+ /*
+ * If it's a link, it's not an in-place tablespace. Otherwise, it must
+ * be a directory, and thus an in-place tablespace.
+ */
+ if (type == PGFILETYPE_LNK)
+ {
+ cb_tablespace_mapping *tsmap;
+
+ /* Read the link target. */
+ link_length = readlink(tblspcdir, link_target, sizeof(link_target));
+ if (link_length < 0)
+ pg_fatal("could not read symbolic link \"%s\": %m",
+ tblspcdir);
+ if (link_length >= sizeof(link_target))
+ pg_fatal("symbolic link \"%s\" is too long", tblspcdir);
+ link_target[link_length] = '\0';
+ if (!is_absolute_path(link_target))
+ pg_fatal("symbolic link \"%s\" is relative", tblspcdir);
+
+ /* Caonicalize the link target. */
+ canonicalize_path(link_target);
+
+ /*
+ * Find the corresponding tablespace mapping and copy the relevant
+ * details into the new tablespace entry.
+ */
+ for (tsmap = opt->tsmappings; tsmap != NULL; tsmap = tsmap->next)
+ {
+ if (strcmp(tsmap->old_dir, link_target) == 0)
+ {
+ strncpy(ts->old_dir, tsmap->old_dir, MAXPGPATH);
+ strncpy(ts->new_dir, tsmap->new_dir, MAXPGPATH);
+ ts->in_place = false;
+ break;
+ }
+ }
+
+ /* Every non-in-place tablespace must be mapped. */
+ if (tsmap == NULL)
+ pg_fatal("tablespace at \"%s\" has no tablespace mapping",
+ link_target);
+ }
+ else
+ {
+ /*
+ * For an in-place tablespace, there's no separate directory, so
+ * we just record the paths within the data directories.
+ */
+ snprintf(ts->old_dir, MAXPGPATH, "%s/%s", pg_tblspc, de->d_name);
+ snprintf(ts->new_dir, MAXPGPATH, "%s/pg_tblpc/%s", opt->output,
+ de->d_name);
+ ts->in_place = true;
+ }
+
+ /* Tablespaces should not share a directory. */
+ for (otherts = tslist; otherts != NULL; otherts = otherts->next)
+ if (strcmp(ts->new_dir, otherts->new_dir) == 0)
+ pg_fatal("tablespaces with OIDs %u and %u both point at \"%s\"",
+ otherts->oid, oid, ts->new_dir);
+
+ /* Add this tablespace to the list. */
+ ts->next = tslist;
+ tslist = ts;
+ }
+
+ return tslist;
+}
+
+/*
+ * Read a file into a StringInfo.
+ *
+ * fd is used for the actual file I/O, filename for error reporting purposes.
+ * A file longer than maxlen is a fatal error.
+ */
+static void
+slurp_file(int fd, char *filename, StringInfo buf, int maxlen)
+{
+ struct stat st;
+ ssize_t rb;
+
+ /* Check file size, and complain if it's too large. */
+ if (fstat(fd, &st) != 0)
+ pg_fatal("could not stat \"%s\": %m", filename);
+ if (st.st_size > maxlen)
+ pg_fatal("file \"%s\" is too large", filename);
+
+ /* Make sure we have enough space. */
+ enlargeStringInfo(buf, st.st_size);
+
+ /* Read the data. */
+ rb = read(fd, &buf->data[buf->len], st.st_size);
+
+ /*
+ * We don't expect any concurrent changes, so we should read exactly the
+ * expected number of bytes.
+ */
+ if (rb != st.st_size)
+ {
+ if (rb < 0)
+ pg_fatal("could not read file \"%s\": %m", filename);
+ else
+ pg_fatal("could not read file \"%s\": read only %d of %d bytes",
+ filename, (int) rb, (int) st.st_size);
+ }
+
+ /* Adjust buffer length for new data and restore trailing-\0 invariant */
+ buf->len += rb;
+ buf->data[buf->len] = '\0';
+}
diff --git a/src/bin/pg_combinebackup/reconstruct.c b/src/bin/pg_combinebackup/reconstruct.c
new file mode 100644
index 0000000000..e7f0523fe9
--- /dev/null
+++ b/src/bin/pg_combinebackup/reconstruct.c
@@ -0,0 +1,682 @@
+/*-------------------------------------------------------------------------
+ *
+ * reconstruct.c
+ * Reconstruct full file from incremental file and backup chain.
+ *
+ * Copyright (c) 2017-2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/bin/pg_combinebackup/reconstruct.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres_fe.h"
+
+#include <unistd.h>
+
+#include "backup/basebackup_incremental.h"
+#include "common/logging.h"
+#include "common/file_perm.h"
+#include "copy_file.h"
+#include "lib/stringinfo.h"
+#include "reconstruct.h"
+#include "storage/block.h"
+
+/*
+ * An rfile stores the data that we need in order to be able to use some file
+ * on disk for reconstruction. For any given output file, we create one rfile
+ * per backup that we need to consult when we constructing that output file.
+ *
+ * If we find a full version of the file in the backup chain, then only
+ * filename and fd are initialized; the remaining fields are 0 or NULL.
+ * For an incremental file, header_length, num_blocks, relative_block_numbers,
+ * and truncation_block_length are also set.
+ *
+ * num_blocks_read and highest_offset_read always start out as 0.
+ */
+typedef struct rfile
+{
+ char *filename;
+ int fd;
+ size_t header_length;
+ unsigned num_blocks;
+ BlockNumber *relative_block_numbers;
+ unsigned truncation_block_length;
+ unsigned num_blocks_read;
+ off_t highest_offset_read;
+} rfile;
+
+static void debug_reconstruction(int n_source,
+ rfile **sources,
+ bool dry_run);
+static unsigned find_reconstructed_block_length(rfile *s);
+static rfile *make_incremental_rfile(char *filename);
+static rfile *make_rfile(char *filename, bool missing_ok);
+static void write_reconstructed_file(char *input_filename,
+ char *output_filename,
+ unsigned block_length,
+ rfile **sourcemap,
+ off_t *offsetmap,
+ pg_checksum_context *checksum_ctx,
+ bool debug,
+ bool dry_run);
+static void read_bytes(rfile *rf, void *buffer, unsigned length);
+
+/*
+ * Reconstruct a full file from an incremental file and a chain of prior
+ * backups.
+ *
+ * input_filename should be the path to the incremental file, and
+ * output_filename should be the path where the reconstructed file is to be
+ * written.
+ *
+ * relative_path should be the relative path to the directory containing this
+ * file. bare_file_name should be the name of the file within that directory,
+ * without "INCREMENTAL.".
+ *
+ * n_prior_backups is the number of prior backups, and prior_backup_dirs is
+ * an array of pathnames where those backups can be found.
+ */
+void
+reconstruct_from_incremental_file(char *input_filename,
+ char *output_filename,
+ char *relative_path,
+ char *bare_file_name,
+ int n_prior_backups,
+ char **prior_backup_dirs,
+ manifest_data **manifests,
+ char *manifest_path,
+ pg_checksum_type checksum_type,
+ int *checksum_length,
+ uint8 **checksum_payload,
+ bool debug,
+ bool dry_run)
+{
+ rfile **source;
+ rfile *latest_source = NULL;
+ rfile **sourcemap;
+ off_t *offsetmap;
+ unsigned block_length;
+ unsigned i;
+ unsigned sidx = n_prior_backups;
+ bool full_copy_possible = true;
+ int copy_source_index = -1;
+ rfile *copy_source = NULL;
+ pg_checksum_context checksum_ctx;
+
+ /*
+ * Every block must come either from the latest version of the file or
+ * from one of the prior backups.
+ */
+ source = pg_malloc0(sizeof(rfile *) * (1 + n_prior_backups));
+
+ /*
+ * Use the information from the latest incremental file to figure out how
+ * long the reconstructed file should be.
+ */
+ latest_source = make_incremental_rfile(input_filename);
+ source[n_prior_backups] = latest_source;
+ block_length = find_reconstructed_block_length(latest_source);
+
+ /*
+ * For each block in the output file, we need to know from which file we
+ * need to obtain it and at what offset in that file it's stored.
+ * sourcemap gives us the first of these things, and offsetmap the latter.
+ */
+ sourcemap = pg_malloc0(sizeof(rfile *) * block_length);
+ offsetmap = pg_malloc0(sizeof(off_t) * block_length);
+
+ /*
+ * Every block that is present in the newest incremental file should be
+ * sourced from that file. If it precedes the truncation_block_length,
+ * it's a block that we would otherwise have had to find in an older
+ * backup and thus reduces the number of blocks remaining to be found by
+ * one; otherwise, it's an extra block that needs to be included in the
+ * output but would not have needed to be found in an older backup if it
+ * had not been present.
+ */
+ for (i = 0; i < latest_source->num_blocks; ++i)
+ {
+ BlockNumber b = latest_source->relative_block_numbers[i];
+
+ Assert(b < block_length);
+ sourcemap[b] = latest_source;
+ offsetmap[b] = latest_source->header_length + (i * BLCKSZ);
+
+ /*
+ * A full copy of a file from an earlier backup is only possible if no
+ * blocks are needed from any later incremental file.
+ */
+ full_copy_possible = false;
+ }
+
+ while (1)
+ {
+ char source_filename[MAXPGPATH];
+ rfile *s;
+
+ /*
+ * Move to the next backup in the chain. If there are no more, then
+ * we're done.
+ */
+ if (sidx == 0)
+ break;
+ --sidx;
+
+ /*
+ * Look for the full file in the previous backup. If not found, then
+ * look for an incremental file instead.
+ */
+ snprintf(source_filename, MAXPGPATH, "%s/%s/%s",
+ prior_backup_dirs[sidx], relative_path, bare_file_name);
+ if ((s = make_rfile(source_filename, true)) == NULL)
+ {
+ snprintf(source_filename, MAXPGPATH, "%s/%s/INCREMENTAL.%s",
+ prior_backup_dirs[sidx], relative_path, bare_file_name);
+ s = make_incremental_rfile(source_filename);
+ }
+ source[sidx] = s;
+
+ /*
+ * If s->header_length == 0, then this is a full file; otherwise, it's
+ * an incremental file.
+ */
+ if (s->header_length == 0)
+ {
+ struct stat sb;
+ BlockNumber b;
+ BlockNumber blocklength;
+
+ /* We need to know the length of the file. */
+ if (fstat(s->fd, &sb) < 0)
+ pg_fatal("could not stat \"%s\": %m", s->filename);
+
+ /*
+ * Since we found a full file, source all blocks from it that
+ * exist in the file.
+ *
+ * Note that there may be blocks that don't exist either in this
+ * file or in any incremental file but that precede
+ * truncation_block_length. These are, presumably, zero-filled
+ * blocks that result from the server extending the file but
+ * taking no action on those blocks that generated any WAL.
+ *
+ * Sadly, we have no way of validating that this is really what
+ * happened, and neither does the server. From it's perspective,
+ * an unmodified block that contains data looks exactly the same
+ * as a zero-filled block that never had any data: either way,
+ * it's not mentioned in any WAL summary and the server has no
+ * reason to read it. From our perspective, all we know is that
+ * nobody had a reason to back up the block. That certainly means
+ * that the block didn't exist at the time of the full backup, but
+ * the supposition that it was all zeroes at the time of every
+ * later backup is one that we can't validate.
+ */
+ blocklength = sb.st_size / BLCKSZ;
+ for (b = 0; b < latest_source->truncation_block_length; ++b)
+ {
+ if (sourcemap[b] == NULL && b < blocklength)
+ {
+ sourcemap[b] = s;
+ offsetmap[b] = b * BLCKSZ;
+ }
+ }
+
+ /*
+ * If a full copy looks possible, check whether the resulting file
+ * should be exactly as long as the source file is. If so, a full
+ * copy is acceptable, otherwise not.
+ */
+ if (full_copy_possible)
+ {
+ uint64 expected_length;
+
+ expected_length =
+ (uint64) latest_source->truncation_block_length;
+ expected_length *= BLCKSZ;
+ if (expected_length == sb.st_size)
+ {
+ copy_source = s;
+ copy_source_index = sidx;
+ }
+ }
+
+ /* We don't need to consider any further sources. */
+ break;
+ }
+
+ /*
+ * Since we found another incremental file, source all blocks from
+ * it that we need but don't yet have.
+ */
+ for (i = 0; i < s->num_blocks; ++i)
+ {
+ BlockNumber b = s->relative_block_numbers[i];
+
+ if (b < latest_source->truncation_block_length &&
+ sourcemap[b] == NULL)
+ {
+ sourcemap[b] = s;
+ offsetmap[b] = s->header_length + (i * BLCKSZ);
+
+ /*
+ * A full copy of a file from an earlier backup is only
+ * possible if no blocks are needed from any later
+ * incremental file.
+ */
+ full_copy_possible = false;
+ }
+ }
+ }
+
+ /*
+ * If a checksum of the required type already exists in the
+ * backup_manifest for the relevant input directory, we can save some work
+ * by reusing that checksum instead of computing a new one.
+ */
+ if (copy_source_index >= 0 && manifests[copy_source_index] != NULL &&
+ checksum_type != CHECKSUM_TYPE_NONE)
+ {
+ manifest_file *mfile;
+
+ mfile = manifest_files_lookup(manifests[copy_source_index]->files,
+ manifest_path);
+ if (mfile == NULL)
+ {
+ /*
+ * The directory is out of sync with the backup_manifest, so emit
+ * a warning.
+ */
+ pg_log_warning("\"%s/backup_manifest\" contains no entry for \"%s\"",
+ prior_backup_dirs[copy_source_index],
+ manifest_path);
+ }
+ else if (mfile->checksum_type == checksum_type)
+ {
+ *checksum_length = mfile->checksum_length;
+ *checksum_payload = pg_malloc(*checksum_length);
+ memcpy(*checksum_payload, mfile->checksum_payload,
+ *checksum_length);
+ checksum_type = CHECKSUM_TYPE_NONE;
+ }
+ }
+
+ /* Prepare for checksum calculation, if required. */
+ pg_checksum_init(&checksum_ctx, checksum_type);
+
+ /*
+ * If the full file can be created by copying a file from an older backup
+ * in the chain without needing to overwrite any blocks or truncate the
+ * result, then forget about performing reconstruction and just copy that
+ * file in its entirety.
+ *
+ * Otherwise, reconstruct.
+ */
+ if (copy_source != NULL)
+ copy_file(copy_source->filename, output_filename,
+ &checksum_ctx, dry_run);
+ else
+ {
+ write_reconstructed_file(input_filename, output_filename,
+ block_length, sourcemap, offsetmap,
+ &checksum_ctx, debug, dry_run);
+ debug_reconstruction(n_prior_backups + 1, source, dry_run);
+ }
+
+ /* Save results of checksum calculation. */
+ if (checksum_type != CHECKSUM_TYPE_NONE)
+ {
+ *checksum_payload = pg_malloc(PG_CHECKSUM_MAX_LENGTH);
+ *checksum_length = pg_checksum_final(&checksum_ctx,
+ *checksum_payload);
+ }
+
+ /*
+ * Close files and release memory.
+ */
+ for (i = 0; i <= n_prior_backups; ++i)
+ {
+ rfile *s = source[i];
+
+ if (s == NULL)
+ continue;
+ if (close(s->fd) != 0)
+ pg_fatal("could not close \"%s\": %m", s->filename);
+ if (s->relative_block_numbers != NULL)
+ pfree(s->relative_block_numbers);
+ pg_free(s->filename);
+ }
+ pfree(sourcemap);
+ pfree(offsetmap);
+ pfree(source);
+}
+
+/*
+ * Perform post-reconstruction logging and sanity checks.
+ */
+static void
+debug_reconstruction(int n_source, rfile **sources, bool dry_run)
+{
+ unsigned i;
+
+ for (i = 0; i < n_source; ++i)
+ {
+ rfile *s = sources[i];
+
+ /* Ignore source if not used. */
+ if (s == NULL)
+ continue;
+
+ /* If no data is needed from this file, we can ignore it. */
+ if (s->num_blocks_read == 0)
+ continue;
+
+ /* Debug logging. */
+ if (dry_run)
+ pg_log_debug("would have read %u blocks from \"%s\"",
+ s->num_blocks_read, s->filename);
+ else
+ pg_log_debug("read %u blocks from \"%s\"",
+ s->num_blocks_read, s->filename);
+
+ /*
+ * In dry-run mode, we don't actually try to read data from the file,
+ * but we do try to verify that the file is long enough that we could
+ * have read the data if we'd tried.
+ *
+ * If this fails, then it means that a non-dry-run attempt would fail,
+ * complaining of not being able to read the required bytes from the
+ * file.
+ */
+ if (dry_run)
+ {
+ struct stat sb;
+
+ if (fstat(s->fd, &sb) < 0)
+ pg_fatal("could not stat \"%s\": %m", s->filename);
+ if (sb.st_size < s->highest_offset_read)
+ pg_fatal("file \"%s\" is too short: expected %llu, found %llu",
+ s->filename,
+ (unsigned long long) s->highest_offset_read,
+ (unsigned long long) sb.st_size);
+ }
+ }
+}
+
+/*
+ * When we perform reconstruction using an incremental file, the output file
+ * should be at least as long as the truncation_block_length. Any blocks
+ * present in the incremental file increase the output length as far as is
+ * necessary to include those blocks.
+ */
+static unsigned
+find_reconstructed_block_length(rfile *s)
+{
+ unsigned block_length = s->truncation_block_length;
+ unsigned i;
+
+ for (i = 0; i < s->num_blocks; ++i)
+ if (s->relative_block_numbers[i] >= block_length)
+ block_length = s->relative_block_numbers[i] + 1;
+
+ return block_length;
+}
+
+/*
+ * Initialize an incremental rfile, reading the header so that we know which
+ * blocks it contains.
+ */
+static rfile *
+make_incremental_rfile(char *filename)
+{
+ rfile *rf;
+ unsigned magic;
+
+ rf = make_rfile(filename, false);
+
+ /* Read and validate magic number. */
+ read_bytes(rf, &magic, sizeof(magic));
+ if (magic != INCREMENTAL_MAGIC)
+ pg_fatal("file \"%s\" has bad incremental magic number (0x%x not 0x%x)",
+ filename, magic, INCREMENTAL_MAGIC);
+
+ /* Read block count. */
+ read_bytes(rf, &rf->num_blocks, sizeof(rf->num_blocks));
+ if (rf->num_blocks > RELSEG_SIZE)
+ pg_fatal("file \"%s\" has block count %u in excess of segment size %u",
+ filename, rf->num_blocks, RELSEG_SIZE);
+
+ /* Read truncation block length. */
+ read_bytes(rf, &rf->truncation_block_length,
+ sizeof(rf->truncation_block_length));
+ if (rf->truncation_block_length > RELSEG_SIZE)
+ pg_fatal("file \"%s\" has truncation block length %u in excess of segment size %u",
+ filename, rf->truncation_block_length, RELSEG_SIZE);
+
+ /* Read block numbers if there are any. */
+ if (rf->num_blocks > 0)
+ {
+ rf->relative_block_numbers =
+ pg_malloc0(sizeof(BlockNumber) * rf->num_blocks);
+ read_bytes(rf, rf->relative_block_numbers,
+ sizeof(BlockNumber) * rf->num_blocks);
+ }
+
+ /* Remember length of header. */
+ rf->header_length = sizeof(magic) + sizeof(rf->num_blocks) +
+ sizeof(rf->truncation_block_length) +
+ sizeof(BlockNumber) * rf->num_blocks;
+
+ return rf;
+}
+
+/*
+ * Allocate and perform basic initialization of an rfile.
+ */
+static rfile *
+make_rfile(char *filename, bool missing_ok)
+{
+ rfile *rf;
+
+ rf = pg_malloc0(sizeof(rfile));
+ rf->filename = pstrdup(filename);
+ if ((rf->fd = open(filename, O_RDONLY | PG_BINARY, 0)) < 0)
+ {
+ if (missing_ok && errno == ENOENT)
+ {
+ pg_free(rf);
+ return NULL;
+ }
+ pg_fatal("could not open file \"%s\": %m", filename);
+ }
+
+ return rf;
+}
+
+/*
+ * Read the indicated number of bytes from an rfile into the buffer.
+ */
+static void
+read_bytes(rfile *rf, void *buffer, unsigned length)
+{
+ unsigned rb = read(rf->fd, buffer, length);
+
+ if (rb != length)
+ {
+ if (rb < 0)
+ pg_fatal("could not read file \"%s\": %m", rf->filename);
+ else
+ pg_fatal("could not read file \"%s\": read only %d of %d bytes",
+ rf->filename, (int) rb, length);
+ }
+}
+
+/*
+ * Write out a reconstructed file.
+ */
+static void
+write_reconstructed_file(char *input_filename,
+ char *output_filename,
+ unsigned block_length,
+ rfile **sourcemap,
+ off_t *offsetmap,
+ pg_checksum_context *checksum_ctx,
+ bool debug,
+ bool dry_run)
+{
+ int wfd = -1;
+ unsigned i;
+ unsigned zero_blocks = 0;
+
+ /* Debugging output. */
+ if (debug)
+ {
+ StringInfoData debug_buf;
+ unsigned start_of_range = 0;
+ unsigned current_block = 0;
+
+ /* Basic information about the output file to be produced. */
+ if (dry_run)
+ pg_log_debug("would reconstruct \"%s\" (%u blocks, checksum %s)",
+ output_filename, block_length,
+ pg_checksum_type_name(checksum_ctx->type));
+ else
+ pg_log_debug("reconstructing \"%s\" (%u blocks, checksum %s)",
+ output_filename, block_length,
+ pg_checksum_type_name(checksum_ctx->type));
+
+ /* Print out the plan for reconstructing this file. */
+ initStringInfo(&debug_buf);
+ while (current_block < block_length)
+ {
+ rfile *s = sourcemap[current_block];
+
+ /* Extend range, if possible. */
+ if (current_block + 1 < block_length &&
+ s == sourcemap[current_block + 1])
+ {
+ ++current_block;
+ continue;
+ }
+
+ /* Add details about this range. */
+ if (s == NULL)
+ {
+ if (current_block == start_of_range)
+ appendStringInfo(&debug_buf, " %u:zero", current_block);
+ else
+ appendStringInfo(&debug_buf, " %u-%u:zero",
+ start_of_range, current_block);
+ }
+ else
+ {
+ if (current_block == start_of_range)
+ appendStringInfo(&debug_buf, " %u:%s@" UINT64_FORMAT,
+ current_block,
+ s == NULL ? "ZERO" : s->filename,
+ (uint64) offsetmap[current_block]);
+ else
+ appendStringInfo(&debug_buf, " %u-%u:%s@" UINT64_FORMAT,
+ start_of_range, current_block,
+ s == NULL ? "ZERO" : s->filename,
+ (uint64) offsetmap[current_block]);
+ }
+
+ /* Begin new range. */
+ start_of_range = ++current_block;
+
+ /* If the output is very long or we are done, dump it now. */
+ if (current_block == block_length || debug_buf.len > 1024)
+ {
+ pg_log_debug("reconstruction plan:%s", debug_buf.data);
+ resetStringInfo(&debug_buf);
+ }
+ }
+
+ /* Free memory. */
+ pfree(debug_buf.data);
+ }
+
+ /* Open the output file, except in dry_run mode. */
+ if (!dry_run &&
+ (wfd = open(output_filename,
+ O_RDWR | PG_BINARY | O_CREAT | O_EXCL,
+ pg_file_create_mode)) < 0)
+ pg_fatal("could not open file \"%s\": %m", output_filename);
+
+ /* Read and write the blocks as required. */
+ for (i = 0; i < block_length; ++i)
+ {
+ uint8 buffer[BLCKSZ];
+ rfile *s = sourcemap[i];
+ unsigned wb;
+
+ /* Update accounting information. */
+ if (s == NULL)
+ ++zero_blocks;
+ else
+ {
+ s->num_blocks_read++;
+ s->highest_offset_read = Max(s->highest_offset_read,
+ offsetmap[i] + BLCKSZ);
+ }
+
+ /* Skip the rest of this in dry-run mode. */
+ if (dry_run)
+ continue;
+
+ /* Read or zero-fill the block as appropriate. */
+ if (s == NULL)
+ {
+ /*
+ * New block not mentioned in the WAL summary. Should have been an
+ * uninitialized block, so just zero-fill it.
+ */
+ memset(buffer, 0, BLCKSZ);
+ }
+ else
+ {
+ unsigned rb;
+
+ /* Read the block from the correct source, except if dry-run. */
+ rb = pg_pread(s->fd, buffer, BLCKSZ, offsetmap[i]);
+ if (rb != BLCKSZ)
+ {
+ if (rb < 0)
+ pg_fatal("could not read file \"%s\": %m", s->filename);
+ else
+ pg_fatal("could not read file \"%s\": read only %d of %d bytes at offset %u",
+ s->filename, (int) rb, BLCKSZ,
+ (unsigned) offsetmap[i]);
+ }
+ }
+
+ /* Write out the block. */
+ if ((wb = write(wfd, buffer, BLCKSZ)) != BLCKSZ)
+ {
+ if (wb < 0)
+ pg_fatal("could not write file \"%s\": %m", output_filename);
+ else
+ pg_fatal("could not write file \"%s\": wrote only %d of %d bytes",
+ output_filename, (int) wb, BLCKSZ);
+ }
+
+ /* Update the checksum computation. */
+ if (pg_checksum_update(checksum_ctx, buffer, BLCKSZ) < 0)
+ pg_fatal("could not update checksum of file \"%s\"",
+ output_filename);
+ }
+
+ /* Debugging output. */
+ if (zero_blocks > 0)
+ {
+ if (dry_run)
+ pg_log_debug("would have zero-filled %u blocks", zero_blocks);
+ else
+ pg_log_debug("zero-filled %u blocks", zero_blocks);
+ }
+
+ /* Close the output file. */
+ if (wfd >= 0 && close(wfd) != 0)
+ pg_fatal("could not close \"%s\": %m", output_filename);
+}
diff --git a/src/bin/pg_combinebackup/reconstruct.h b/src/bin/pg_combinebackup/reconstruct.h
new file mode 100644
index 0000000000..d689aeb5c2
--- /dev/null
+++ b/src/bin/pg_combinebackup/reconstruct.h
@@ -0,0 +1,33 @@
+/*-------------------------------------------------------------------------
+ *
+ * reconstruct.h
+ * Reconstruct full file from incremental file and backup chain.
+ *
+ * Copyright (c) 2017-2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/bin/pg_combinebackup/reconstruct.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef RECONSTRUCT_H
+#define RECONSTRUCT_H
+
+#include "common/checksum_helper.h"
+#include "load_manifest.h"
+
+extern void reconstruct_from_incremental_file(char *input_filename,
+ char *output_filename,
+ char *relative_path,
+ char *bare_file_name,
+ int n_prior_backups,
+ char **prior_backup_dirs,
+ manifest_data **manifests,
+ char *manifest_path,
+ pg_checksum_type checksum_type,
+ int *checksum_length,
+ uint8 **checksum_payload,
+ bool debug,
+ bool dry_run);
+
+#endif
diff --git a/src/bin/pg_combinebackup/t/001_basic.pl b/src/bin/pg_combinebackup/t/001_basic.pl
new file mode 100644
index 0000000000..fb66075d1a
--- /dev/null
+++ b/src/bin/pg_combinebackup/t/001_basic.pl
@@ -0,0 +1,23 @@
+# Copyright (c) 2021-2023, PostgreSQL Global Development Group
+
+use strict;
+use warnings;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+my $tempdir = PostgreSQL::Test::Utils::tempdir;
+
+program_help_ok('pg_combinebackup');
+program_version_ok('pg_combinebackup');
+program_options_handling_ok('pg_combinebackup');
+
+command_fails_like(
+ ['pg_combinebackup'],
+ qr/no input directories specified/,
+ 'input directories must be specified');
+command_fails_like(
+ [ 'pg_combinebackup', $tempdir ],
+ qr/no output directory specified/,
+ 'output directory must be specified');
+
+done_testing();
diff --git a/src/bin/pg_combinebackup/t/002_compare_backups.pl b/src/bin/pg_combinebackup/t/002_compare_backups.pl
new file mode 100644
index 0000000000..0b80455aff
--- /dev/null
+++ b/src/bin/pg_combinebackup/t/002_compare_backups.pl
@@ -0,0 +1,154 @@
+# Copyright (c) 2021-2023, PostgreSQL Global Development Group
+
+use strict;
+use warnings;
+use File::Compare;
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Set up a new database instance.
+my $primary = PostgreSQL::Test::Cluster->new('primary');
+$primary->init(has_archiving => 1, allows_streaming => 1);
+$primary->append_conf('postgresql.conf', 'summarize_wal = on');
+$primary->start;
+
+# Create some test tables, each containing one row of data, plus a whole
+# extra database.
+$primary->safe_psql('postgres', <<EOM);
+CREATE TABLE will_change (a int, b text);
+INSERT INTO will_change VALUES (1, 'initial test row');
+CREATE TABLE will_grow (a int, b text);
+INSERT INTO will_grow VALUES (1, 'initial test row');
+CREATE TABLE will_shrink (a int, b text);
+INSERT INTO will_shrink VALUES (1, 'initial test row');
+CREATE TABLE will_get_vacuumed (a int, b text);
+INSERT INTO will_get_vacuumed VALUES (1, 'initial test row');
+CREATE TABLE will_get_dropped (a int, b text);
+INSERT INTO will_get_dropped VALUES (1, 'initial test row');
+CREATE TABLE will_get_rewritten (a int, b text);
+INSERT INTO will_get_rewritten VALUES (1, 'initial test row');
+CREATE DATABASE db_will_get_dropped;
+EOM
+
+# Take a full backup.
+my $backup1path = $primary->backup_dir . '/backup1';
+$primary->command_ok(
+ [ 'pg_basebackup', '-D', $backup1path, '--no-sync', '-cfast' ],
+ "full backup");
+
+# Now make some database changes.
+$primary->safe_psql('postgres', <<EOM);
+UPDATE will_change SET b = 'modified value' WHERE a = 1;
+INSERT INTO will_grow
+ SELECT g, 'additional row' FROM generate_series(2, 5000) g;
+TRUNCATE will_shrink;
+VACUUM will_get_vacuumed;
+DROP TABLE will_get_dropped;
+CREATE TABLE newly_created (a int, b text);
+INSERT INTO newly_created VALUES (1, 'row for new table');
+VACUUM FULL will_get_rewritten;
+DROP DATABASE db_will_get_dropped;
+CREATE DATABASE db_newly_created;
+EOM
+
+# Take an incremental backup.
+my $backup2path = $primary->backup_dir . '/backup2';
+$primary->command_ok(
+ [ 'pg_basebackup', '-D', $backup2path, '--no-sync', '-cfast',
+ '--incremental', $backup1path . '/backup_manifest' ],
+ "incremental backup");
+
+# Find an LSN to which either backup can be recovered.
+my $lsn = $primary->safe_psql('postgres', "SELECT pg_current_wal_lsn();");
+
+# Make sure that the WAL segment containing that LSN has been archived.
+# PostgreSQL won't issue two consecutive XLOG_SWITCH records, and the backup
+# just issued one, so call txid_current() to generate some WAL activity
+# before calling pg_switch_wal().
+$primary->safe_psql('postgres', 'SELECT txid_current();');
+$primary->safe_psql('postgres', 'SELECT pg_switch_wal()');
+
+# Now wait for the LSN we chose above to be archived.
+my $archive_wait_query =
+ "SELECT pg_walfile_name('$lsn') <= last_archived_wal FROM pg_stat_archiver;";
+$primary->poll_query_until('postgres', $archive_wait_query)
+ or die "Timed out while waiting for WAL segment to be archived";
+
+# Perform PITR from the full backup. Disable archive_mode so that the archive
+# doesn't find out about the new timeline; that way, the later PITR below will
+# choose the same timeline.
+my $pitr1 = PostgreSQL::Test::Cluster->new('pitr1');
+$pitr1->init_from_backup($primary, 'backup1',
+ standby => 1, has_restoring => 1);
+$pitr1->append_conf('postgresql.conf', qq{
+recovery_target_lsn = '$lsn'
+recovery_target_action = 'promote'
+archive_mode = 'off'
+});
+$pitr1->start();
+
+# Perform PITR to the same LSN from the incremental backup. Use the same
+# basic configuration as before.
+my $pitr2 = PostgreSQL::Test::Cluster->new('pitr2');
+$pitr2->init_from_backup($primary, 'backup2',
+ standby => 1, has_restoring => 1,
+ combine_with_prior => [ 'backup1' ]);
+$pitr2->append_conf('postgresql.conf', qq{
+recovery_target_lsn = '$lsn'
+recovery_target_action = 'promote'
+archive_mode = 'off'
+});
+$pitr2->start();
+
+# Wait until both servers exit recovery.
+$pitr1->poll_query_until('postgres',
+ "SELECT NOT pg_is_in_recovery();")
+ or die "Timed out while waiting apply to reach LSN $lsn";
+$pitr2->poll_query_until('postgres',
+ "SELECT NOT pg_is_in_recovery();")
+ or die "Timed out while waiting apply to reach LSN $lsn";
+
+# Perform a logical dump of each server, and check that they match.
+# It would be much nicer if we could physically compare the data files, but
+# that doesn't really work. The contents of the page hole aren't guaranteed to
+# be identical, and there can be other discrepancies as well. To make this work
+# we'd need the equivalent of each AM's rm_mask functon written or at least
+# callable from Perl, and that doesn't seem practical.
+#
+# NB: We're just using the primary's backup directory for scratch space here.
+# This could equally well be any other directory we wanted to pick.
+my $backupdir = $primary->backup_dir;
+my $dump1 = $backupdir . '/pitr1.dump';
+my $dump2 = $backupdir . '/pitr2.dump';
+$pitr1->command_ok([
+ 'pg_dumpall', '-f', $dump1, '--no-sync', '--no-unlogged-table-data',
+ '-d', $pitr1->connstr('postgres'),
+ ],
+ 'dump from PITR 1');
+$pitr1->command_ok([
+ 'pg_dumpall', '-f', $dump2, '--no-sync', '--no-unlogged-table-data',
+ '-d', $pitr1->connstr('postgres'),
+ ],
+ 'dump from PITR 2');
+
+# Compare the two dumps, there should be no differences.
+my $compare_res = compare($dump1, $dump2);
+note($dump1);
+note($dump2);
+is($compare_res, 0, "dumps are identical");
+
+# Provide more context if the dumps do not match.
+if ($compare_res != 0)
+{
+ my ($stdout, $stderr) =
+ run_command([ 'diff', '-u', $dump1, $dump2 ]);
+ print "=== diff of $dump1 and $dump2\n";
+ print "=== stdout ===\n";
+ print $stdout;
+ print "=== stderr ===\n";
+ print $stderr;
+ print "=== EOF ===\n";
+}
+
+done_testing();
diff --git a/src/bin/pg_combinebackup/t/003_timeline.pl b/src/bin/pg_combinebackup/t/003_timeline.pl
new file mode 100644
index 0000000000..bc053ca5e8
--- /dev/null
+++ b/src/bin/pg_combinebackup/t/003_timeline.pl
@@ -0,0 +1,90 @@
+# Copyright (c) 2021-2023, PostgreSQL Global Development Group
+#
+# This test aims to validate that restoring an incremental backup works
+# properly even when the reference backup is on a different timeline.
+
+use strict;
+use warnings;
+use File::Compare;
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Set up a new database instance.
+my $node1 = PostgreSQL::Test::Cluster->new('node1');
+$node1->init(has_archiving => 1, allows_streaming => 1);
+$node1->append_conf('postgresql.conf', 'summarize_wal = on');
+$node1->start;
+
+# Create a table and insert a test row into it.
+$node1->safe_psql('postgres', <<EOM);
+CREATE TABLE mytable (a int, b text);
+INSERT INTO mytable VALUES (1, 'aardvark');
+EOM
+
+# Take a full backup.
+my $backup1path = $node1->backup_dir . '/backup1';
+$node1->command_ok(
+ [ 'pg_basebackup', '-D', $backup1path, '--no-sync', '-cfast' ],
+ "full backup from node1");
+
+# Insert a second row on the original node.
+$node1->safe_psql('postgres', <<EOM);
+INSERT INTO mytable VALUES (2, 'beetle');
+EOM
+
+# Now take an incremental backup.
+my $backup2path = $node1->backup_dir . '/backup2';
+$node1->command_ok(
+ [ 'pg_basebackup', '-D', $backup2path, '--no-sync', '-cfast',
+ '--incremental', $backup1path . '/backup_manifest' ],
+ "incremental backup from node1");
+
+# Restore the incremental backup and use it to create a new node.
+my $node2 = PostgreSQL::Test::Cluster->new('node2');
+$node2->init_from_backup($node1, 'backup2',
+ combine_with_prior => [ 'backup1' ]);
+$node2->start();
+
+# Insert rows on both nodes.
+$node1->safe_psql('postgres', <<EOM);
+INSERT INTO mytable VALUES (3, 'crab');
+EOM
+$node2->safe_psql('postgres', <<EOM);
+INSERT INTO mytable VALUES (4, 'dingo');
+EOM
+
+# Take another incremental backup, from node2, based on backup2 from node1.
+my $backup3path = $node1->backup_dir . '/backup3';
+$node2->command_ok(
+ [ 'pg_basebackup', '-D', $backup3path, '--no-sync', '-cfast',
+ '--incremental', $backup2path . '/backup_manifest' ],
+ "incremental backup from node2");
+
+# Restore the incremental backup and use it to create a new node.
+my $node3 = PostgreSQL::Test::Cluster->new('node3');
+$node3->init_from_backup($node1, 'backup3',
+ combine_with_prior => [ 'backup1', 'backup2' ]);
+$node3->start();
+
+# Let's insert one more row.
+$node3->safe_psql('postgres', <<EOM);
+INSERT INTO mytable VALUES (5, 'elephant');
+EOM
+
+# Now check that we have the expected rows.
+my $result = $node3->safe_psql('postgres', <<EOM);
+select string_agg(a::text, ':'), string_agg(b, ':') from mytable;
+EOM
+is($result, '1:2:4:5|aardvark:beetle:dingo:elephant');
+
+# Let's also verify all the backups.
+for my $backup_name (qw(backup1 backup2 backup3))
+{
+ $node1->command_ok(
+ [ 'pg_verifybackup', $node1->backup_dir . '/' . $backup_name ],
+ "verify backup $backup_name");
+}
+
+# OK, that's all.
+done_testing();
diff --git a/src/bin/pg_combinebackup/t/004_manifest.pl b/src/bin/pg_combinebackup/t/004_manifest.pl
new file mode 100644
index 0000000000..37de61ac06
--- /dev/null
+++ b/src/bin/pg_combinebackup/t/004_manifest.pl
@@ -0,0 +1,75 @@
+# Copyright (c) 2021-2023, PostgreSQL Global Development Group
+#
+# This test aims to validate that pg_combinebackup works in the degenerate
+# case where it is invoked on a single full backup and that it can produce
+# a new, valid manifest when it does. Secondarily, it checks that
+# pg_combinebackup does not produce a manifest when run with --no-manifest.
+
+use strict;
+use warnings;
+use File::Compare;
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Set up a new database instance.
+my $node = PostgreSQL::Test::Cluster->new('node');
+$node->init(has_archiving => 1, allows_streaming => 1);
+$node->start;
+
+# Take a full backup.
+my $original_backup_path = $node->backup_dir . '/original';
+$node->command_ok(
+ [ 'pg_basebackup', '-D', $original_backup_path, '--no-sync', '-cfast' ],
+ "full backup");
+
+# Verify the full backup.
+$node->command_ok([ 'pg_verifybackup', $original_backup_path ],
+ "verify original backup");
+
+# Process the backup with pg_combinebackup using various manifest options.
+sub combine_and_test_one_backup
+{
+ my ($backup_name, $failure_pattern, @extra_options) = @_;
+ my $revised_backup_path = $node->backup_dir . '/' . $backup_name;
+ $node->command_ok(
+ [ 'pg_combinebackup', $original_backup_path, '-o', $revised_backup_path,
+ '--no-sync', @extra_options ],
+ "pg_combinebackup with @extra_options");
+ if (defined $failure_pattern)
+ {
+ $node->command_fails_like(
+ [ 'pg_verifybackup', $revised_backup_path ],
+ $failure_pattern,
+ "unable to verify backup $backup_name");
+ }
+ else
+ {
+ $node->command_ok(
+ [ 'pg_verifybackup', $revised_backup_path ],
+ "verify backup $backup_name");
+ }
+}
+combine_and_test_one_backup('nomanifest',
+ qr/could not open file.*backup_manifest/, '--no-manifest');
+combine_and_test_one_backup('csum_none',
+ undef, '--manifest-checksums=NONE');
+combine_and_test_one_backup('csum_sha224',
+ undef, '--manifest-checksums=SHA224');
+
+# Verify that SHA224 is mentioned in the SHA224 manifest lots of times.
+my $sha224_manifest =
+ slurp_file($node->backup_dir . '/csum_sha224/backup_manifest');
+my $sha224_count = (() = $sha224_manifest =~ /SHA224/mig);
+cmp_ok($sha224_count,
+ '>', 100, "SHA224 is mentioned many times in SHA224 manifest");
+
+# Verify that SHA224 is mentioned in the SHA224 manifest lots of times.
+my $nocsum_manifest =
+ slurp_file($node->backup_dir . '/csum_none/backup_manifest');
+my $nocsum_count = (() = $nocsum_manifest =~ /Checksum-Algorithm/mig);
+is($nocsum_count, 0,
+ "Checksum_Algorithm is not mentioned in no-checksum manifest");
+
+# OK, that's all.
+done_testing();
diff --git a/src/bin/pg_combinebackup/t/005_integrity.pl b/src/bin/pg_combinebackup/t/005_integrity.pl
new file mode 100644
index 0000000000..b1f63a43e0
--- /dev/null
+++ b/src/bin/pg_combinebackup/t/005_integrity.pl
@@ -0,0 +1,125 @@
+# Copyright (c) 2021-2023, PostgreSQL Global Development Group
+#
+# This test aims to validate that an incremental backup can be combined
+# with a valid prior backup and that it cannot be combined with an invalid
+# prior backup.
+
+use strict;
+use warnings;
+use File::Compare;
+use File::Path qw(rmtree);
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Set up a new database instance.
+my $node1 = PostgreSQL::Test::Cluster->new('node1');
+$node1->init(has_archiving => 1, allows_streaming => 1);
+$node1->append_conf('postgresql.conf', 'summarize_wal = on');
+$node1->start;
+
+# Set up another new database instance. We don't want to use the cached
+# INITDB_TEMPLATE for this, because we want it to be a separate cluster
+# with a different system ID.
+my $node2;
+{
+ local $ENV{'INITDB_TEMPLATE'} = undef;
+
+ $node2 = PostgreSQL::Test::Cluster->new('node2');
+ $node2->init(has_archiving => 1, allows_streaming => 1);
+ $node2->append_conf('postgresql.conf', 'summarize_wal = on');
+ $node2->start;
+}
+
+# Take a full backup from node1.
+my $backup1path = $node1->backup_dir . '/backup1';
+$node1->command_ok(
+ [ 'pg_basebackup', '-D', $backup1path, '--no-sync', '-cfast' ],
+ "full backup from node1");
+
+# Now take an incremental backup.
+my $backup2path = $node1->backup_dir . '/backup2';
+$node1->command_ok(
+ [ 'pg_basebackup', '-D', $backup2path, '--no-sync', '-cfast',
+ '--incremental', $backup1path . '/backup_manifest' ],
+ "incremental backup from node1");
+
+# Now take another incremental backup.
+my $backup3path = $node1->backup_dir . '/backup3';
+$node1->command_ok(
+ [ 'pg_basebackup', '-D', $backup3path, '--no-sync', '-cfast',
+ '--incremental', $backup2path . '/backup_manifest' ],
+ "another incremental backup from node1");
+
+# Take a full backup from node2.
+my $backupother1path = $node1->backup_dir . '/backupother1';
+$node2->command_ok(
+ [ 'pg_basebackup', '-D', $backupother1path, '--no-sync', '-cfast' ],
+ "full backup from node2");
+
+# Take an incremental backup from node2.
+my $backupother2path = $node1->backup_dir . '/backupother2';
+$node2->command_ok(
+ [ 'pg_basebackup', '-D', $backupother2path, '--no-sync', '-cfast',
+ '--incremental', $backupother1path . '/backup_manifest' ],
+ "incremental backup from node2");
+
+# Result directory.
+my $resultpath = $node1->backup_dir . '/result';
+
+# Can't combine 2 full backups.
+$node1->command_fails_like(
+ [ 'pg_combinebackup', $backup1path, $backup1path, '-o', $resultpath ],
+ qr/is a full backup, but only the first backup should be a full backup/,
+ "can't combine full backups");
+
+# Can't combine 2 incremental backups.
+$node1->command_fails_like(
+ [ 'pg_combinebackup', $backup2path, $backup2path, '-o', $resultpath ],
+ qr/is an incremental backup, but the first backup should be a full backup/,
+ "can't combine full backups");
+
+# Can't combine full backup with an incremental backup from a different system.
+$node1->command_fails_like(
+ [ 'pg_combinebackup', $backup1path, $backupother2path, '-o', $resultpath ],
+ qr/expected system identifier.*but found/,
+ "can't combine backups from different nodes");
+
+# Can't omit a required backup.
+$node1->command_fails_like(
+ [ 'pg_combinebackup', $backup1path, $backup3path, '-o', $resultpath ],
+ qr/starts at LSN.*but expected/,
+ "can't omit a required backup");
+
+# Can't combine backups in the wrong order.
+$node1->command_fails_like(
+ [ 'pg_combinebackup', $backup1path, $backup3path, $backup2path, '-o', $resultpath ],
+ qr/starts at LSN.*but expected/,
+ "can't combine backups in the wrong order");
+
+# Can combine 3 backups that match up properly.
+$node1->command_ok(
+ [ 'pg_combinebackup', $backup1path, $backup2path, $backup3path, '-o', $resultpath ],
+ "can combine 3 matching backups");
+rmtree($resultpath);
+
+# Can combine full backup with first incremental.
+my $synthetic12path = $node1->backup_dir . '/synthetic12';
+$node1->command_ok(
+ [ 'pg_combinebackup', $backup1path, $backup2path, '-o', $synthetic12path ],
+ "can combine 2 matching backups");
+
+# Can combine result of previous step with second incremental.
+$node1->command_ok(
+ [ 'pg_combinebackup', $synthetic12path, $backup3path, '-o', $resultpath ],
+ "can combine synthetic backup with later incremental");
+rmtree($resultpath);
+
+# Can't combine result of 1+2 with 2.
+$node1->command_fails_like(
+ [ 'pg_combinebackup', $synthetic12path, $backup2path, '-o', $resultpath ],
+ qr/starts at LSN.*but expected/,
+ "can't combine synthetic backup with included incremental");
+
+# OK, that's all.
+done_testing();
diff --git a/src/bin/pg_combinebackup/write_manifest.c b/src/bin/pg_combinebackup/write_manifest.c
new file mode 100644
index 0000000000..82160134d8
--- /dev/null
+++ b/src/bin/pg_combinebackup/write_manifest.c
@@ -0,0 +1,293 @@
+/*-------------------------------------------------------------------------
+ *
+ * Write a new backup manifest.
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/bin/pg_combinebackup/write_manifest.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres_fe.h"
+
+#include <fcntl.h>
+#include <time.h>
+#include <unistd.h>
+
+#include "common/checksum_helper.h"
+#include "common/file_perm.h"
+#include "common/logging.h"
+#include "lib/stringinfo.h"
+#include "load_manifest.h"
+#include "mb/pg_wchar.h"
+#include "write_manifest.h"
+
+struct manifest_writer
+{
+ char pathname[MAXPGPATH];
+ int fd;
+ StringInfoData buf;
+ bool first_file;
+ bool still_checksumming;
+ pg_checksum_context manifest_ctx;
+};
+
+static void escape_json(StringInfo buf, const char *str);
+static void flush_manifest(manifest_writer *mwriter);
+static size_t hex_encode(const uint8 *src, size_t len, char *dst);
+
+/*
+ * Create a new backup manifest writer.
+ *
+ * The backup manifest will be written into a file named backup_manifest
+ * in the specified directory.
+ */
+manifest_writer *
+create_manifest_writer(char *directory)
+{
+ manifest_writer *mwriter = pg_malloc(sizeof(manifest_writer));
+
+ snprintf(mwriter->pathname, MAXPGPATH, "%s/backup_manifest", directory);
+ mwriter->fd = -1;
+ initStringInfo(&mwriter->buf);
+ mwriter->first_file = true;
+ mwriter->still_checksumming = true;
+ pg_checksum_init(&mwriter->manifest_ctx, CHECKSUM_TYPE_SHA256);
+
+ appendStringInfo(&mwriter->buf,
+ "{ \"PostgreSQL-Backup-Manifest-Version\": 1,\n"
+ "\"Files\": [");
+
+ return mwriter;
+}
+
+/*
+ * Add an entry for a file to a backup manifest.
+ *
+ * This is very similar to the backend's AddFileToBackupManifest, but
+ * various adjustments are required due to frontend/backend differences
+ * and other details.
+ */
+void
+add_file_to_manifest(manifest_writer *mwriter, const char *manifest_path,
+ size_t size, pg_time_t mtime,
+ pg_checksum_type checksum_type,
+ int checksum_length,
+ uint8 *checksum_payload)
+{
+ int pathlen = strlen(manifest_path);
+
+ if (mwriter->first_file)
+ {
+ appendStringInfoChar(&mwriter->buf, '\n');
+ mwriter->first_file = false;
+ }
+ else
+ appendStringInfoString(&mwriter->buf, ",\n");
+
+ if (pg_encoding_verifymbstr(PG_UTF8, manifest_path, pathlen) == pathlen)
+ {
+ appendStringInfoString(&mwriter->buf, "{ \"Path\": ");
+ escape_json(&mwriter->buf, manifest_path);
+ appendStringInfoString(&mwriter->buf, ", ");
+ }
+ else
+ {
+ appendStringInfoString(&mwriter->buf, "{ \"Encoded-Path\": \"");
+ enlargeStringInfo(&mwriter->buf, 2 * pathlen);
+ mwriter->buf.len += hex_encode((const uint8 *) manifest_path, pathlen,
+ &mwriter->buf.data[mwriter->buf.len]);
+ appendStringInfoString(&mwriter->buf, "\", ");
+ }
+
+ appendStringInfo(&mwriter->buf, "\"Size\": %zu, ", size);
+
+ appendStringInfoString(&mwriter->buf, "\"Last-Modified\": \"");
+ enlargeStringInfo(&mwriter->buf, 128);
+ mwriter->buf.len += strftime(&mwriter->buf.data[mwriter->buf.len], 128,
+ "%Y-%m-%d %H:%M:%S %Z",
+ gmtime(&mtime));
+ appendStringInfoChar(&mwriter->buf, '"');
+
+ if (mwriter->buf.len > 128 * 1024)
+ flush_manifest(mwriter);
+
+ if (checksum_length > 0)
+ {
+ appendStringInfo(&mwriter->buf,
+ ", \"Checksum-Algorithm\": \"%s\", \"Checksum\": \"",
+ pg_checksum_type_name(checksum_type));
+
+ enlargeStringInfo(&mwriter->buf, 2 * checksum_length);
+ mwriter->buf.len += hex_encode(checksum_payload, checksum_length,
+ &mwriter->buf.data[mwriter->buf.len]);
+
+ appendStringInfoChar(&mwriter->buf, '"');
+ }
+
+ appendStringInfoString(&mwriter->buf, " }");
+
+ if (mwriter->buf.len > 128 * 1024)
+ flush_manifest(mwriter);
+}
+
+/*
+ * Finalize the backup_manifest.
+ */
+void
+finalize_manifest(manifest_writer *mwriter,
+ manifest_wal_range *first_wal_range)
+{
+ uint8 checksumbuf[PG_SHA256_DIGEST_LENGTH];
+ int len;
+ manifest_wal_range *wal_range;
+
+ /* Terminate the list of files. */
+ appendStringInfoString(&mwriter->buf, "\n],\n");
+
+ /* Start a list of LSN ranges. */
+ appendStringInfoString(&mwriter->buf, "\"WAL-Ranges\": [\n");
+
+ for (wal_range = first_wal_range; wal_range != NULL;
+ wal_range = wal_range->next)
+ appendStringInfo(&mwriter->buf,
+ "%s{ \"Timeline\": %u, \"Start-LSN\": \"%X/%X\", \"End-LSN\": \"%X/%X\" }",
+ wal_range == first_wal_range ? "" : ",\n",
+ wal_range->tli,
+ LSN_FORMAT_ARGS(wal_range->start_lsn),
+ LSN_FORMAT_ARGS(wal_range->end_lsn));
+
+ /* Terminate the list of WAL ranges. */
+ appendStringInfoString(&mwriter->buf, "\n],\n");
+
+ /* Flush accumulated data and update checksum calculation. */
+ flush_manifest(mwriter);
+
+ /* Checksum only includes data up to this point. */
+ mwriter->still_checksumming = false;
+
+ /* Compute and insert manifest checksum. */
+ appendStringInfoString(&mwriter->buf, "\"Manifest-Checksum\": \"");
+ enlargeStringInfo(&mwriter->buf, 2 * PG_SHA256_DIGEST_STRING_LENGTH);
+ len = pg_checksum_final(&mwriter->manifest_ctx, checksumbuf);
+ Assert(len == PG_SHA256_DIGEST_LENGTH);
+ mwriter->buf.len +=
+ hex_encode(checksumbuf, len, &mwriter->buf.data[mwriter->buf.len]);
+ appendStringInfoString(&mwriter->buf, "\"}\n");
+
+ /* Flush the last manifest checksum itself. */
+ flush_manifest(mwriter);
+
+ /* Close the file. */
+ if (close(mwriter->fd) != 0)
+ pg_fatal("could not close \"%s\": %m", mwriter->pathname);
+ mwriter->fd = -1;
+}
+
+/*
+ * Produce a JSON string literal, properly escaping characters in the text.
+ */
+static void
+escape_json(StringInfo buf, const char *str)
+{
+ const char *p;
+
+ appendStringInfoCharMacro(buf, '"');
+ for (p = str; *p; p++)
+ {
+ switch (*p)
+ {
+ case '\b':
+ appendStringInfoString(buf, "\\b");
+ break;
+ case '\f':
+ appendStringInfoString(buf, "\\f");
+ break;
+ case '\n':
+ appendStringInfoString(buf, "\\n");
+ break;
+ case '\r':
+ appendStringInfoString(buf, "\\r");
+ break;
+ case '\t':
+ appendStringInfoString(buf, "\\t");
+ break;
+ case '"':
+ appendStringInfoString(buf, "\\\"");
+ break;
+ case '\\':
+ appendStringInfoString(buf, "\\\\");
+ break;
+ default:
+ if ((unsigned char) *p < ' ')
+ appendStringInfo(buf, "\\u%04x", (int) *p);
+ else
+ appendStringInfoCharMacro(buf, *p);
+ break;
+ }
+ }
+ appendStringInfoCharMacro(buf, '"');
+}
+
+/*
+ * Flush whatever portion of the backup manifest we have generated and
+ * buffered in memory out to a file on disk.
+ *
+ * The first call to this function will create the file. After that, we
+ * keep it open and just append more data.
+ */
+static void
+flush_manifest(manifest_writer *mwriter)
+{
+ char pathname[MAXPGPATH];
+
+ if (mwriter->fd == -1 &&
+ (mwriter->fd = open(mwriter->pathname,
+ O_WRONLY | O_CREAT | O_EXCL | PG_BINARY,
+ pg_file_create_mode)) < 0)
+ pg_fatal("could not open file \"%s\": %m", mwriter->pathname);
+
+ if (mwriter->buf.len > 0)
+ {
+ ssize_t wb;
+
+ wb = write(mwriter->fd, mwriter->buf.data, mwriter->buf.len);
+ if (wb != mwriter->buf.len)
+ {
+ if (wb < 0)
+ pg_fatal("could not write \"%s\": %m", mwriter->pathname);
+ else
+ pg_fatal("could not write file \"%s\": wrote only %d of %d bytes",
+ pathname, (int) wb, mwriter->buf.len);
+ }
+
+ if (mwriter->still_checksumming)
+ pg_checksum_update(&mwriter->manifest_ctx,
+ (uint8 *) mwriter->buf.data,
+ mwriter->buf.len);
+ resetStringInfo(&mwriter->buf);
+ }
+}
+
+/*
+ * Encode bytes using two hexademical digits for each one.
+ */
+static size_t
+hex_encode(const uint8 *src, size_t len, char *dst)
+{
+ const uint8 *end = src + len;
+
+ while (src < end)
+ {
+ unsigned n1 = (*src >> 4) & 0xF;
+ unsigned n2 = *src & 0xF;
+
+ *dst++ = n1 < 10 ? '0' + n1 : 'a' + n1 - 10;
+ *dst++ = n2 < 10 ? '0' + n2 : 'a' + n2 - 10;
+ ++src;
+ }
+
+ return len * 2;
+}
diff --git a/src/bin/pg_combinebackup/write_manifest.h b/src/bin/pg_combinebackup/write_manifest.h
new file mode 100644
index 0000000000..8fd7fe02c8
--- /dev/null
+++ b/src/bin/pg_combinebackup/write_manifest.h
@@ -0,0 +1,33 @@
+/*-------------------------------------------------------------------------
+ *
+ * Write a new backup manifest.
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/bin/pg_combinebackup/write_manifest.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef WRITE_MANIFEST_H
+#define WRITE_MANIFEST_H
+
+#include "common/checksum_helper.h"
+#include "pgtime.h"
+
+struct manifest_wal_range;
+
+struct manifest_writer;
+typedef struct manifest_writer manifest_writer;
+
+extern manifest_writer *create_manifest_writer(char *directory);
+extern void add_file_to_manifest(manifest_writer *mwriter,
+ const char *manifest_path,
+ size_t size, pg_time_t mtime,
+ pg_checksum_type checksum_type,
+ int checksum_length,
+ uint8 *checksum_payload);
+extern void finalize_manifest(manifest_writer *mwriter,
+ struct manifest_wal_range *first_wal_range);
+
+#endif /* WRITE_MANIFEST_H */
diff --git a/src/bin/pg_resetwal/pg_resetwal.c b/src/bin/pg_resetwal/pg_resetwal.c
index 3ae3fc06df..5407f51a4e 100644
--- a/src/bin/pg_resetwal/pg_resetwal.c
+++ b/src/bin/pg_resetwal/pg_resetwal.c
@@ -85,6 +85,7 @@ static void RewriteControlFile(void);
static void FindEndOfXLOG(void);
static void KillExistingXLOG(void);
static void KillExistingArchiveStatus(void);
+static void KillExistingWALSummaries(void);
static void WriteEmptyXLOG(void);
static void usage(void);
@@ -493,6 +494,7 @@ main(int argc, char *argv[])
RewriteControlFile();
KillExistingXLOG();
KillExistingArchiveStatus();
+ KillExistingWALSummaries();
WriteEmptyXLOG();
printf(_("Write-ahead log reset\n"));
@@ -1034,6 +1036,40 @@ KillExistingArchiveStatus(void)
pg_fatal("could not close directory \"%s\": %m", ARCHSTATDIR);
}
+/*
+ * Remove existing WAL summary files
+ */
+static void
+KillExistingWALSummaries(void)
+{
+#define WALSUMMARYDIR XLOGDIR "/summaries"
+#define WALSUMMARY_NHEXCHARS 40
+
+ DIR *xldir;
+ struct dirent *xlde;
+ char path[MAXPGPATH + sizeof(WALSUMMARYDIR)];
+
+ xldir = opendir(WALSUMMARYDIR);
+ if (xldir == NULL)
+ pg_fatal("could not open directory \"%s\": %m", WALSUMMARYDIR);
+
+ while (errno = 0, (xlde = readdir(xldir)) != NULL)
+ {
+ if (strspn(xlde->d_name, "0123456789ABCDEF") == WALSUMMARY_NHEXCHARS &&
+ strcmp(xlde->d_name + WALSUMMARY_NHEXCHARS, ".summary") == 0)
+ {
+ snprintf(path, sizeof(path), "%s/%s", WALSUMMARYDIR, xlde->d_name);
+ if (unlink(path) < 0)
+ pg_fatal("could not delete file \"%s\": %m", path);
+ }
+ }
+
+ if (errno)
+ pg_fatal("could not read directory \"%s\": %m", WALSUMMARYDIR);
+
+ if (closedir(xldir))
+ pg_fatal("could not close directory \"%s\": %m", ARCHSTATDIR);
+}
/*
* Write an empty XLOG file, containing only the checkpoint record
diff --git a/src/include/access/xlogbackup.h b/src/include/access/xlogbackup.h
index 1611358137..90e04cad56 100644
--- a/src/include/access/xlogbackup.h
+++ b/src/include/access/xlogbackup.h
@@ -28,6 +28,8 @@ typedef struct BackupState
XLogRecPtr checkpointloc; /* last checkpoint location */
pg_time_t starttime; /* backup start time */
bool started_in_recovery; /* backup started in recovery? */
+ XLogRecPtr istartpoint; /* incremental based on backup at this LSN */
+ TimeLineID istarttli; /* incremental based on backup on this TLI */
/* Fields saved at the end of backup */
XLogRecPtr stoppoint; /* backup stop WAL location */
diff --git a/src/include/backup/basebackup.h b/src/include/backup/basebackup.h
index 1432d9c206..345bd22534 100644
--- a/src/include/backup/basebackup.h
+++ b/src/include/backup/basebackup.h
@@ -34,6 +34,9 @@ typedef struct
int64 size; /* total size as sent; -1 if not known */
} tablespaceinfo;
-extern void SendBaseBackup(BaseBackupCmd *cmd);
+struct IncrementalBackupInfo;
+
+extern void SendBaseBackup(BaseBackupCmd *cmd,
+ struct IncrementalBackupInfo *ib);
#endif /* _BASEBACKUP_H */
diff --git a/src/include/backup/basebackup_incremental.h b/src/include/backup/basebackup_incremental.h
new file mode 100644
index 0000000000..5eafe62fc6
--- /dev/null
+++ b/src/include/backup/basebackup_incremental.h
@@ -0,0 +1,55 @@
+/*-------------------------------------------------------------------------
+ *
+ * basebackup_incremental.h
+ * API for incremental backup support
+ *
+ * Portions Copyright (c) 2010-2022, PostgreSQL Global Development Group
+ *
+ * src/include/backup/basebackup_incremental.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef BASEBACKUP_INCREMENTAL_H
+#define BASEBACKUP_INCREMENTAL_H
+
+#include "access/xlogbackup.h"
+#include "common/relpath.h"
+#include "storage/block.h"
+#include "utils/palloc.h"
+
+#define INCREMENTAL_MAGIC 0xd3ae1f0d
+
+typedef enum
+{
+ BACK_UP_FILE_FULLY,
+ BACK_UP_FILE_INCREMENTALLY
+} FileBackupMethod;
+
+struct IncrementalBackupInfo;
+typedef struct IncrementalBackupInfo IncrementalBackupInfo;
+
+extern IncrementalBackupInfo *CreateIncrementalBackupInfo(MemoryContext);
+
+extern void AppendIncrementalManifestData(IncrementalBackupInfo *ib,
+ const char *data,
+ int len);
+extern void FinalizeIncrementalManifest(IncrementalBackupInfo *ib);
+
+extern void PrepareForIncrementalBackup(IncrementalBackupInfo *ib,
+ BackupState *backup_state);
+
+extern char *GetIncrementalFilePath(Oid dboid, Oid spcoid,
+ RelFileNumber relfilenumber,
+ ForkNumber forknum, unsigned segno);
+extern FileBackupMethod GetFileBackupMethod(IncrementalBackupInfo *ib,
+ char *path,
+ Oid dboid, Oid spcoid,
+ RelFileNumber relfilenumber,
+ ForkNumber forknum,
+ unsigned segno, size_t size,
+ unsigned *num_blocks_required,
+ BlockNumber *relative_block_numbers,
+ unsigned *truncation_block_length);
+extern size_t GetIncrementalFileSize(unsigned num_blocks_required);
+
+#endif
diff --git a/src/include/nodes/replnodes.h b/src/include/nodes/replnodes.h
index 5142a08729..c98961c329 100644
--- a/src/include/nodes/replnodes.h
+++ b/src/include/nodes/replnodes.h
@@ -108,4 +108,13 @@ typedef struct TimeLineHistoryCmd
TimeLineID timeline;
} TimeLineHistoryCmd;
+/* ----------------------
+ * UPLOAD_MANIFEST command
+ * ----------------------
+ */
+typedef struct UploadManifestCmd
+{
+ NodeTag type;
+} UploadManifestCmd;
+
#endif /* REPLNODES_H */
diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index c3d46c7c70..72b4ecaf12 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -779,6 +779,10 @@ a tar-format backup, pass the name of the tar program to use in the
keyword parameter tar_program. Note that tablespace tar files aren't
handled here.
+To restore from an incremental backup, pass the parameter combine_with_prior
+as a reference to an array of prior backup names with which this backup
+is to be combined using pg_combinebackup.
+
Streaming replication can be enabled on this node by passing the keyword
parameter has_streaming => 1. This is disabled by default.
@@ -816,7 +820,22 @@ sub init_from_backup
mkdir $self->archive_dir;
my $data_path = $self->data_dir;
- if (defined $params{tar_program})
+ if (defined $params{combine_with_prior})
+ {
+ my @prior_backups = @{$params{combine_with_prior}};
+ my @prior_backup_path;
+
+ for my $prior_backup_name (@prior_backups)
+ {
+ push @prior_backup_path,
+ $root_node->backup_dir . '/' . $prior_backup_name;
+ }
+
+ local %ENV = $self->_get_env();
+ PostgreSQL::Test::Utils::system_or_bail('pg_combinebackup', '-d',
+ @prior_backup_path, $backup_path, '-o', $data_path);
+ }
+ elsif (defined $params{tar_program})
{
mkdir($data_path);
PostgreSQL::Test::Utils::system_or_bail($params{tar_program}, 'xf',
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 659e58aeac..ea71b215ee 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -4013,3 +4013,15 @@ SummarizerReadLocalXLogPrivate
WalSummarizerData
WalSummaryFile
WalSummaryIO
+FileBackupMethod
+IncrementalBackupInfo
+UploadManifestCmd
+backup_file_entry
+backup_wal_range
+cb_cleanup_dir
+cb_options
+cb_tablespace
+cb_tablespace_mapping
+manifest_data
+manifest_writer
+rfile
--
2.37.1 (Apple Git-137.1)
v9-0003-Add-a-new-WAL-summarizer-process.patchapplication/octet-stream; name=v9-0003-Add-a-new-WAL-summarizer-process.patchDownload
From 189c2290b863776fae7251709ab27f893989d4bc Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Wed, 25 Oct 2023 12:57:22 -0400
Subject: [PATCH v9 3/6] Add a new WAL summarizer process.
When active, this process writes WAL summary files to
$PGDATA/pg_wal/summaries. Each summary file contains information for a
certain range of LSNs on a certain TLI. For each relation, it stores a
"limit block" which is 0 if a relation is created or destroyed within
a certain range of WAL records, or otherwise the shortest length to
which the relation was truncated during that range of WAL records, or
otherwise InvalidBlockNumber. In addition, it stores a list of blocks
which have been modified during that range of WAL records, but
excluding blocks which were removed by truncation after they were
modified and never subsequently modified again. In other words, it
tells us which blocks need to copied in case of an incremental backup
covering that range of WAL records.
A new parameter summarize_wal enables or disables this new background
process. The background process also automatically deletes summary
files that are older than wal_summarize_keep_time, if that parameter
has a non-zero value and the summarizer is configured to run.
---
doc/src/sgml/config.sgml | 61 +
src/backend/access/transam/xlog.c | 101 +-
src/backend/backup/Makefile | 4 +-
src/backend/backup/meson.build | 2 +
src/backend/backup/walsummary.c | 356 +++++
src/backend/backup/walsummaryfuncs.c | 169 ++
src/backend/postmaster/Makefile | 1 +
src/backend/postmaster/auxprocess.c | 8 +
src/backend/postmaster/meson.build | 1 +
src/backend/postmaster/postmaster.c | 56 +
src/backend/postmaster/walsummarizer.c | 1361 +++++++++++++++++
src/backend/storage/lmgr/lwlocknames.txt | 1 +
src/backend/utils/activity/pgstat_io.c | 4 +-
.../utils/activity/wait_event_names.txt | 5 +
src/backend/utils/init/miscinit.c | 3 +
src/backend/utils/misc/guc_tables.c | 26 +
src/backend/utils/misc/postgresql.conf.sample | 5 +
src/bin/initdb/initdb.c | 1 +
src/common/Makefile | 1 +
src/common/blkreftable.c | 1308 ++++++++++++++++
src/common/meson.build | 1 +
src/include/access/xlog.h | 1 +
src/include/backup/walsummary.h | 49 +
src/include/catalog/pg_proc.dat | 19 +
src/include/common/blkreftable.h | 116 ++
src/include/miscadmin.h | 3 +
src/include/postmaster/walsummarizer.h | 31 +
src/include/storage/proc.h | 9 +-
src/include/utils/guc_tables.h | 1 +
src/tools/pgindent/typedefs.list | 11 +
30 files changed, 3704 insertions(+), 11 deletions(-)
create mode 100644 src/backend/backup/walsummary.c
create mode 100644 src/backend/backup/walsummaryfuncs.c
create mode 100644 src/backend/postmaster/walsummarizer.c
create mode 100644 src/common/blkreftable.c
create mode 100644 src/include/backup/walsummary.h
create mode 100644 src/include/common/blkreftable.h
create mode 100644 src/include/postmaster/walsummarizer.h
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index fc35a46e5e..15471a6b38 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4134,6 +4134,67 @@ restore_command = 'copy "C:\\server\\archivedir\\%f" "%p"' # Windows
</variablelist>
</sect2>
+ <sect2 id="runtime-config-wal-summarization">
+ <title>WAL Summarization</title>
+
+ <!--
+ <para>
+ These settings control WAL summarization, a feature which must be
+ enabled in order to perform an
+ <link linkend="backup-incremental-backup">incremental backup</link>.
+ </para>
+ -->
+
+ <variablelist>
+ <varlistentry id="guc-summarize-wal" xreflabel="summarize_wal">
+ <term><varname>summarize_wal</varname> (<type>boolean</type>)
+ <indexterm>
+ <primary><varname>summarize_wal</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Enables the WAL summarizer process. Note that WAL summarization can
+ be enabled either on a primary or on a standby. WAL summarization
+ cannot be enabled when <varname>wal_level</varname> is set to
+ <literal>minimal</literal>. This parameter can only be set in the
+ <filename>postgresql.conf</filename> file or on the server command line.
+ The default is <literal>off</literal>.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry id="guc-wal-summarize-keep-time" xreflabel="wal_summarize_keep_time">
+ <term><varname>wal_summarize_keep_time</varname> (<type>boolean</type>)
+ <indexterm>
+ <primary><varname>wal_summarize_keep_time</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Configures the amount of time after which the WAL summarizer
+ automatically removes old WAL summaries. The file timestamp is used to
+ determine which files are old enough to remove. Typically, you should set
+ this comfortably higher than the time that could pass between a backup
+ and a later incremental backup that depends on it. WAL summaries must
+ be available for the entire range of WAL records between the preceding
+ backup and the new one being taken; if not, the incremental backup will
+ fail. If this parameter is set to zero, WAL summaries will not be
+ automatically deleted, but it is safe to manually remove files that you
+ know will not be required for future incremental backups.
+ This parameter can only be set in the
+ <filename>postgresql.conf</filename> file or on the server command line.
+ The default is 10 days. If <literal>summarize_wal = off</literal>,
+ existing WAL summaries will not be removed regardless of the value of
+ this parameter, because the WAL summarizer will not run.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ </variablelist>
+
+ </sect2>
+
</sect1>
<sect1 id="runtime-config-replication">
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 1159dff1a6..678495a64b 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -77,6 +77,7 @@
#include "port/pg_iovec.h"
#include "postmaster/bgwriter.h"
#include "postmaster/startup.h"
+#include "postmaster/walsummarizer.h"
#include "postmaster/walwriter.h"
#include "replication/logical.h"
#include "replication/origin.h"
@@ -3574,6 +3575,43 @@ XLogGetLastRemovedSegno(void)
return lastRemovedSegNo;
}
+/*
+ * Return the oldest WAL segment on the given TLI that still exists in
+ * XLOGDIR, or 0 if none.
+ */
+XLogSegNo
+XLogGetOldestSegno(TimeLineID tli)
+{
+ DIR *xldir;
+ struct dirent *xlde;
+ XLogSegNo oldest_segno = 0;
+
+ xldir = AllocateDir(XLOGDIR);
+ while ((xlde = ReadDir(xldir, XLOGDIR)) != NULL)
+ {
+ TimeLineID file_tli;
+ XLogSegNo file_segno;
+
+ /* Ignore files that are not XLOG segments. */
+ if (!IsXLogFileName(xlde->d_name))
+ continue;
+
+ /* Parse filename to get TLI and segno. */
+ XLogFromFileName(xlde->d_name, &file_tli, &file_segno,
+ wal_segment_size);
+
+ /* Ignore anything that's not from the TLI of interest. */
+ if (tli != file_tli)
+ continue;
+
+ /* If it's the oldest so far, update oldest_segno. */
+ if (oldest_segno == 0 || file_segno < oldest_segno)
+ oldest_segno = file_segno;
+ }
+
+ FreeDir(xldir);
+ return oldest_segno;
+}
/*
* Update the last removed segno pointer in shared memory, to reflect that the
@@ -3853,8 +3891,8 @@ RemoveXlogFile(const struct dirent *segment_de,
}
/*
- * Verify whether pg_wal and pg_wal/archive_status exist.
- * If the latter does not exist, recreate it.
+ * Verify whether pg_wal, pg_wal/archive_status, and pg_wal/summaries exist.
+ * If the latter do not exist, recreate them.
*
* It is not the goal of this function to verify the contents of these
* directories, but to help in cases where someone has performed a cluster
@@ -3897,6 +3935,26 @@ ValidateXLOGDirectoryStructure(void)
(errmsg("could not create missing directory \"%s\": %m",
path)));
}
+
+ /* Check for summaries */
+ snprintf(path, MAXPGPATH, XLOGDIR "/summaries");
+ if (stat(path, &stat_buf) == 0)
+ {
+ /* Check for weird cases where it exists but isn't a directory */
+ if (!S_ISDIR(stat_buf.st_mode))
+ ereport(FATAL,
+ (errmsg("required WAL directory \"%s\" does not exist",
+ path)));
+ }
+ else
+ {
+ ereport(LOG,
+ (errmsg("creating missing WAL directory \"%s\"", path)));
+ if (MakePGDirectory(path) < 0)
+ ereport(FATAL,
+ (errmsg("could not create missing directory \"%s\": %m",
+ path)));
+ }
}
/*
@@ -5221,9 +5279,9 @@ StartupXLOG(void)
#endif
/*
- * Verify that pg_wal and pg_wal/archive_status exist. In cases where
- * someone has performed a copy for PITR, these directories may have been
- * excluded and need to be re-created.
+ * Verify that pg_wal, pg_wal/archive_status, and pg_wal/summaries exist.
+ * In cases where someone has performed a copy for PITR, these directories
+ * may have been excluded and need to be re-created.
*/
ValidateXLOGDirectoryStructure();
@@ -6940,6 +6998,25 @@ CreateCheckPoint(int flags)
*/
END_CRIT_SECTION();
+ /*
+ * WAL summaries end when the next XLOG_CHECKPOINT_REDO or
+ * XLOG_CHECKPOINT_SHUTDOWN record is reached. This is the first point
+ * where (a) we're not inside of a critical section and (b) we can be
+ * certain that the relevant record has been flushed to disk, which must
+ * happen before it can be summarized.
+ *
+ * If this is a shutdown checkpoint, then this happens reasonably
+ * promptly: we've only just inserted and flushed the
+ * XLOG_CHECKPOINT_SHUTDOWN record. If this is not a shutdown checkpoint,
+ * then this might not be very prompt at all: the XLOG_CHECKPOINT_REDO
+ * record was written before we began flushing data to disk, and that
+ * could be many minutes ago at this point. However, we don't XLogFlush()
+ * after inserting that record, so we're not guaranteed that it's on disk
+ * until after the above call that flushes the XLOG_CHECKPOINT_ONLINE
+ * record.
+ */
+ SetWalSummarizerLatch();
+
/*
* Let smgr do post-checkpoint cleanup (eg, deleting old files).
*/
@@ -7614,6 +7691,20 @@ KeepLogSeg(XLogRecPtr recptr, XLogSegNo *logSegNo)
}
}
+ /*
+ * If WAL summarization is in use, don't remove WAL that has yet to be
+ * summarized.
+ */
+ keep = GetOldestUnsummarizedLSN(NULL, NULL);
+ if (keep != InvalidXLogRecPtr)
+ {
+ XLogSegNo unsummarized_segno;
+
+ XLByteToSeg(keep, unsummarized_segno, wal_segment_size);
+ if (unsummarized_segno < segno)
+ segno = unsummarized_segno;
+ }
+
/* but, keep at least wal_keep_size if that's set */
if (wal_keep_size_mb > 0)
{
diff --git a/src/backend/backup/Makefile b/src/backend/backup/Makefile
index b21bd8ff43..a67b3c58d4 100644
--- a/src/backend/backup/Makefile
+++ b/src/backend/backup/Makefile
@@ -25,6 +25,8 @@ OBJS = \
basebackup_server.o \
basebackup_sink.o \
basebackup_target.o \
- basebackup_throttle.o
+ basebackup_throttle.o \
+ walsummary.o \
+ walsummaryfuncs.o
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/backup/meson.build b/src/backend/backup/meson.build
index 11a79bbf80..0e2de91e9f 100644
--- a/src/backend/backup/meson.build
+++ b/src/backend/backup/meson.build
@@ -12,4 +12,6 @@ backend_sources += files(
'basebackup_target.c',
'basebackup_throttle.c',
'basebackup_zstd.c',
+ 'walsummary.c',
+ 'walsummaryfuncs.c'
)
diff --git a/src/backend/backup/walsummary.c b/src/backend/backup/walsummary.c
new file mode 100644
index 0000000000..271d199874
--- /dev/null
+++ b/src/backend/backup/walsummary.c
@@ -0,0 +1,356 @@
+/*-------------------------------------------------------------------------
+ *
+ * walsummary.c
+ * Functions for accessing and managing WAL summary data.
+ *
+ * Portions Copyright (c) 2010-2022, PostgreSQL Global Development Group
+ *
+ * src/backend/backup/walsummary.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include <sys/stat.h>
+#include <unistd.h>
+
+#include "access/xlog_internal.h"
+#include "backup/walsummary.h"
+#include "utils/wait_event.h"
+
+static bool IsWalSummaryFilename(char *filename);
+static int ListComparatorForWalSummaryFiles(const ListCell *a,
+ const ListCell *b);
+
+/*
+ * Get a list of WAL summaries.
+ *
+ * If tli != 0, only WAL summaries with the indicated TLI will be included.
+ *
+ * If start_lsn != InvalidXLogRecPtr, only summaries that end after the
+ * indicated LSN will be included.
+ *
+ * If end_lsn != InvalidXLogRecPtr, only summaries that start before the
+ * indicated LSN will be included.
+ *
+ * The intent is that you can call GetWalSummaries(tli, start_lsn, end_lsn)
+ * to get all WAL summaries on the indicated timeline that overlap the
+ * specified LSN range.
+ */
+List *
+GetWalSummaries(TimeLineID tli, XLogRecPtr start_lsn, XLogRecPtr end_lsn)
+{
+ DIR *sdir;
+ struct dirent *dent;
+ List *result = NIL;
+
+ sdir = AllocateDir(XLOGDIR "/summaries");
+ while ((dent = ReadDir(sdir, XLOGDIR "/summaries")) != NULL)
+ {
+ WalSummaryFile *ws;
+ uint32 tmp[5];
+ TimeLineID file_tli;
+ XLogRecPtr file_start_lsn;
+ XLogRecPtr file_end_lsn;
+
+ /* Decode filename, or skip if it's not in the expected format. */
+ if (!IsWalSummaryFilename(dent->d_name))
+ continue;
+ sscanf(dent->d_name, "%08X%08X%08X%08X%08X",
+ &tmp[0], &tmp[1], &tmp[2], &tmp[3], &tmp[4]);
+ file_tli = tmp[0];
+ file_start_lsn = ((uint64) tmp[1]) << 32 | tmp[2];
+ file_end_lsn = ((uint64) tmp[3]) << 32 | tmp[4];
+
+ /* Skip if it doesn't match the filter criteria. */
+ if (tli != 0 && tli != file_tli)
+ continue;
+ if (!XLogRecPtrIsInvalid(start_lsn) && start_lsn >= file_end_lsn)
+ continue;
+ if (!XLogRecPtrIsInvalid(end_lsn) && end_lsn <= file_start_lsn)
+ continue;
+
+ /* Add it to the list. */
+ ws = palloc(sizeof(WalSummaryFile));
+ ws->tli = file_tli;
+ ws->start_lsn = file_start_lsn;
+ ws->end_lsn = file_end_lsn;
+ result = lappend(result, ws);
+ }
+ FreeDir(sdir);
+
+ return result;
+}
+
+/*
+ * Build a new list of WAL summaries based on an existing list, but filtering
+ * out summaries that don't match the search parameters.
+ *
+ * If tli != 0, only WAL summaries with the indicated TLI will be included.
+ *
+ * If start_lsn != InvalidXLogRecPtr, only summaries that end after the
+ * indicated LSN will be included.
+ *
+ * If end_lsn != InvalidXLogRecPtr, only summaries that start before the
+ * indicated LSN will be included.
+ */
+List *
+FilterWalSummaries(List *wslist, TimeLineID tli,
+ XLogRecPtr start_lsn, XLogRecPtr end_lsn)
+{
+ List *result = NIL;
+ ListCell *lc;
+
+ /* Loop over input. */
+ foreach(lc, wslist)
+ {
+ WalSummaryFile *ws = lfirst(lc);
+
+ /* Skip if it doesn't match the filter criteria. */
+ if (tli != 0 && tli != ws->tli)
+ continue;
+ if (!XLogRecPtrIsInvalid(start_lsn) && start_lsn > ws->end_lsn)
+ continue;
+ if (!XLogRecPtrIsInvalid(end_lsn) && end_lsn < ws->start_lsn)
+ continue;
+
+ /* Add it to the result list. */
+ result = lappend(result, ws);
+ }
+
+ return result;
+}
+
+/*
+ * Check whether the supplied list of WalSummaryFile objects covers the
+ * whole range of LSNs from start_lsn to end_lsn. This function ignores
+ * timelines, so the caller should probably filter using the appropriate
+ * timeline before calling this.
+ *
+ * If the whole range of LSNs is covered, returns true, otherwise false.
+ * If false is returned, *missing_lsn is set either to InvalidXLogRecPtr
+ * if there are no WAL summary files in the input list, or to the first LSN
+ * in the range that is not covered by a WAL summary file in the input list.
+ */
+bool
+WalSummariesAreComplete(List *wslist, XLogRecPtr start_lsn,
+ XLogRecPtr end_lsn, XLogRecPtr *missing_lsn)
+{
+ XLogRecPtr current_lsn = start_lsn;
+ ListCell *lc;
+
+ /* Special case for empty list. */
+ if (wslist == NIL)
+ {
+ *missing_lsn = InvalidXLogRecPtr;
+ return false;
+ }
+
+ /* Make a private copy of the list and sort it by start LSN. */
+ wslist = list_copy(wslist);
+ list_sort(wslist, ListComparatorForWalSummaryFiles);
+
+ /*
+ * Consider summary files in order of increasing start_lsn, advancing the
+ * known-summarized range from start_lsn toward end_lsn.
+ *
+ * Normally, the summary files should cover non-overlapping WAL ranges,
+ * but this algorithm is intended to be correct even in case of overlap.
+ */
+ foreach(lc, wslist)
+ {
+ WalSummaryFile *ws = lfirst(lc);
+
+ if (ws->start_lsn > current_lsn)
+ {
+ /* We found a gap. */
+ break;
+ }
+ if (ws->end_lsn > current_lsn)
+ {
+ /*
+ * Next summary extends beyond end of previous summary, so extend
+ * the end of the range known to be summarized.
+ */
+ current_lsn = ws->end_lsn;
+
+ /*
+ * If the range we know to be summarized has reached the required
+ * end LSN, we have proved completeness.
+ */
+ if (current_lsn >= end_lsn)
+ return true;
+ }
+ }
+
+ /*
+ * We either ran out of summary files without reaching the end LSN, or we
+ * hit a gap in the sequence that resulted in us bailing out of the loop
+ * above.
+ */
+ *missing_lsn = current_lsn;
+ return false;
+}
+
+/*
+ * Open a WAL summary file.
+ *
+ * This will throw an error in case of trouble. As an exception, if
+ * missing_ok = true and the trouble is specifically that the file does
+ * not exist, it will not throw an error and will return a value less than 0.
+ */
+File
+OpenWalSummaryFile(WalSummaryFile *ws, bool missing_ok)
+{
+ char path[MAXPGPATH];
+ File file;
+
+ snprintf(path, MAXPGPATH,
+ XLOGDIR "/summaries/%08X%08X%08X%08X%08X.summary",
+ ws->tli,
+ LSN_FORMAT_ARGS(ws->start_lsn),
+ LSN_FORMAT_ARGS(ws->end_lsn));
+
+ file = PathNameOpenFile(path, O_RDONLY);
+ if (file < 0 && (errno != EEXIST || !missing_ok))
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not open file \"%s\": %m", path)));
+
+ return file;
+}
+
+/*
+ * Remove a WAL summary file if the last modification time precedes the
+ * cutoff time.
+ */
+void
+RemoveWalSummaryIfOlderThan(WalSummaryFile *ws, time_t cutoff_time)
+{
+ char path[MAXPGPATH];
+ struct stat statbuf;
+
+ snprintf(path, MAXPGPATH,
+ XLOGDIR "/summaries/%08X%08X%08X%08X%08X.summary",
+ ws->tli,
+ LSN_FORMAT_ARGS(ws->start_lsn),
+ LSN_FORMAT_ARGS(ws->end_lsn));
+
+ if (lstat(path, &statbuf) != 0)
+ {
+ if (errno == ENOENT)
+ return;
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not stat file \"%s\": %m", path)));
+ }
+ if (statbuf.st_mtime >= cutoff_time)
+ return;
+ if (unlink(path) != 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not stat file \"%s\": %m", path)));
+ ereport(DEBUG2,
+ (errmsg_internal("removing file \"%s\"", path)));
+}
+
+/*
+ * Test whether a filename looks like a WAL summary file.
+ */
+static bool
+IsWalSummaryFilename(char *filename)
+{
+ return strspn(filename, "0123456789ABCDEF") == 40 &&
+ strcmp(filename + 40, ".summary") == 0;
+}
+
+/*
+ * Data read callback for use with CreateBlockRefTableReader.
+ */
+int
+ReadWalSummary(void *wal_summary_io, void *data, int length)
+{
+ WalSummaryIO *io = wal_summary_io;
+ int nbytes;
+
+ nbytes = FileRead(io->file, data, length, io->filepos,
+ WAIT_EVENT_WAL_SUMMARY_READ);
+ if (nbytes < 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not read file \"%s\": %m",
+ FilePathName(io->file))));
+
+ io->filepos += nbytes;
+ return nbytes;
+}
+
+/*
+ * Data write callback for use with WriteBlockRefTable.
+ */
+int
+WriteWalSummary(void *wal_summary_io, void *data, int length)
+{
+ WalSummaryIO *io = wal_summary_io;
+ int nbytes;
+
+ nbytes = FileWrite(io->file, data, length, io->filepos,
+ WAIT_EVENT_WAL_SUMMARY_WRITE);
+ if (nbytes < 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not write file \"%s\": %m",
+ FilePathName(io->file))));
+ if (nbytes != length)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not write file \"%s\": wrote only %d of %d bytes at offset %u",
+ FilePathName(io->file), nbytes,
+ length, (unsigned) io->filepos),
+ errhint("Check free disk space.")));
+
+ io->filepos += nbytes;
+ return nbytes;
+}
+
+/*
+ * Error-reporting callback for use with CreateBlockRefTableReader.
+ */
+void
+ReportWalSummaryError(void *callback_arg, char *fmt,...)
+{
+ StringInfoData buf;
+ va_list ap;
+ int needed;
+
+ initStringInfo(&buf);
+ for (;;)
+ {
+ va_start(ap, fmt);
+ needed = appendStringInfoVA(&buf, fmt, ap);
+ va_end(ap);
+ if (needed == 0)
+ break;
+ enlargeStringInfo(&buf, needed);
+ }
+ ereport(ERROR,
+ errcode(ERRCODE_DATA_CORRUPTED),
+ errmsg_internal("%s", buf.data));
+}
+
+/*
+ * Comparator to sort a List of WalSummaryFile objects by start_lsn.
+ */
+static int
+ListComparatorForWalSummaryFiles(const ListCell *a, const ListCell *b)
+{
+ WalSummaryFile *ws1 = lfirst(a);
+ WalSummaryFile *ws2 = lfirst(b);
+
+ if (ws1->start_lsn < ws2->start_lsn)
+ return -1;
+ if (ws1->start_lsn > ws2->start_lsn)
+ return 1;
+ return 0;
+}
diff --git a/src/backend/backup/walsummaryfuncs.c b/src/backend/backup/walsummaryfuncs.c
new file mode 100644
index 0000000000..2e77d38b4a
--- /dev/null
+++ b/src/backend/backup/walsummaryfuncs.c
@@ -0,0 +1,169 @@
+/*-------------------------------------------------------------------------
+ *
+ * walsummaryfuncs.c
+ * SQL-callable functions for accessing WAL summary data.
+ *
+ * Portions Copyright (c) 2010-2022, PostgreSQL Global Development Group
+ *
+ * src/backend/backup/walsummaryfuncs.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "backup/walsummary.h"
+#include "common/blkreftable.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "utils/fmgrprotos.h"
+#include "utils/pg_lsn.h"
+
+#define NUM_WS_ATTS 3
+#define NUM_SUMMARY_ATTS 6
+#define MAX_BLOCKS_PER_CALL 256
+
+/*
+ * List the WAL summary files available in pg_wal/summaries.
+ */
+Datum
+pg_available_wal_summaries(PG_FUNCTION_ARGS)
+{
+ ReturnSetInfo *rsi;
+ List *wslist;
+ ListCell *lc;
+ Datum values[NUM_WS_ATTS];
+ bool nulls[NUM_WS_ATTS];
+
+ InitMaterializedSRF(fcinfo, 0);
+ rsi = (ReturnSetInfo *) fcinfo->resultinfo;
+
+ memset(nulls, 0, sizeof(nulls));
+
+ wslist = GetWalSummaries(0, InvalidXLogRecPtr, InvalidXLogRecPtr);
+ foreach(lc, wslist)
+ {
+ WalSummaryFile *ws = (WalSummaryFile *) lfirst(lc);
+ HeapTuple tuple;
+
+ CHECK_FOR_INTERRUPTS();
+
+ values[0] = Int64GetDatum((int64) ws->tli);
+ values[1] = LSNGetDatum(ws->start_lsn);
+ values[2] = LSNGetDatum(ws->end_lsn);
+
+ tuple = heap_form_tuple(rsi->setDesc, values, nulls);
+ tuplestore_puttuple(rsi->setResult, tuple);
+ }
+
+ return (Datum) 0;
+}
+
+/*
+ * List the contents of a WAL summary file identified by TLI, start LSN,
+ * and end LSN.
+ */
+Datum
+pg_wal_summary_contents(PG_FUNCTION_ARGS)
+{
+ ReturnSetInfo *rsi;
+ Datum values[NUM_SUMMARY_ATTS];
+ bool nulls[NUM_SUMMARY_ATTS];
+ WalSummaryFile ws;
+ WalSummaryIO io;
+ BlockRefTableReader *reader;
+ int64 raw_tli;
+ RelFileLocator rlocator;
+ ForkNumber forknum;
+ BlockNumber limit_block;
+
+ InitMaterializedSRF(fcinfo, 0);
+ rsi = (ReturnSetInfo *) fcinfo->resultinfo;
+ memset(nulls, 0, sizeof(nulls));
+
+ /*
+ * Since the timeline could at least in theory be more than 2^31, and
+ * since we don't have unsigned types at the SQL level, it is passed as a
+ * 64-bit integer. Test whether it's out of range.
+ */
+ raw_tli = PG_GETARG_INT64(0);
+ if (raw_tli < 1 || raw_tli > PG_INT32_MAX)
+ ereport(ERROR,
+ errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("invalid timeline %lld", (long long) raw_tli));
+
+ /* Prepare to read the specified WAL summry file. */
+ ws.tli = (TimeLineID) raw_tli;
+ ws.start_lsn = PG_GETARG_LSN(1);
+ ws.end_lsn = PG_GETARG_LSN(2);
+ io.filepos = 0;
+ io.file = OpenWalSummaryFile(&ws, false);
+ reader = CreateBlockRefTableReader(ReadWalSummary, &io,
+ FilePathName(io.file),
+ ReportWalSummaryError, NULL);
+
+ /* Loop over relation forks. */
+ while (BlockRefTableReaderNextRelation(reader, &rlocator, &forknum,
+ &limit_block))
+ {
+ BlockNumber blocks[MAX_BLOCKS_PER_CALL];
+ HeapTuple tuple;
+
+ CHECK_FOR_INTERRUPTS();
+
+ values[0] = ObjectIdGetDatum(rlocator.relNumber);
+ values[1] = ObjectIdGetDatum(rlocator.spcOid);
+ values[2] = ObjectIdGetDatum(rlocator.dbOid);
+ values[3] = Int16GetDatum((int16) forknum);
+
+ /* Loop over blocks within the current relation fork. */
+ while (true)
+ {
+ unsigned nblocks;
+ unsigned i;
+
+ CHECK_FOR_INTERRUPTS();
+
+ nblocks = BlockRefTableReaderGetBlocks(reader, blocks,
+ MAX_BLOCKS_PER_CALL);
+ if (nblocks == 0)
+ break;
+
+ /*
+ * For each block that we specifically know to have been modified,
+ * emit a row with that block number and limit_block = false.
+ */
+ values[5] = BoolGetDatum(false);
+ for (i = 0; i < nblocks; ++i)
+ {
+ values[4] = Int64GetDatum((int64) blocks[i]);
+
+ tuple = heap_form_tuple(rsi->setDesc, values, nulls);
+ tuplestore_puttuple(rsi->setResult, tuple);
+ }
+
+ /*
+ * If the limit block is not InvalidBlockNumber, emit an exta row
+ * with that block number and limit_block = true.
+ *
+ * There is no point in doing this when the limit_block is
+ * InvalidBlockNumber, because no block with that number or any
+ * higher number can ever exist.
+ */
+ if (BlockNumberIsValid(limit_block))
+ {
+ values[4] = Int64GetDatum((int64) limit_block);
+ values[5] = BoolGetDatum(true);
+
+ tuple = heap_form_tuple(rsi->setDesc, values, nulls);
+ tuplestore_puttuple(rsi->setResult, tuple);
+ }
+ }
+ }
+
+ /* Cleanup */
+ DestroyBlockRefTableReader(reader);
+ FileClose(io.file);
+
+ return (Datum) 0;
+}
diff --git a/src/backend/postmaster/Makefile b/src/backend/postmaster/Makefile
index 047448b34e..367a46c617 100644
--- a/src/backend/postmaster/Makefile
+++ b/src/backend/postmaster/Makefile
@@ -24,6 +24,7 @@ OBJS = \
postmaster.o \
startup.o \
syslogger.o \
+ walsummarizer.o \
walwriter.o
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/postmaster/auxprocess.c b/src/backend/postmaster/auxprocess.c
index cae6feb356..0c15c1777d 100644
--- a/src/backend/postmaster/auxprocess.c
+++ b/src/backend/postmaster/auxprocess.c
@@ -21,6 +21,7 @@
#include "postmaster/auxprocess.h"
#include "postmaster/bgwriter.h"
#include "postmaster/startup.h"
+#include "postmaster/walsummarizer.h"
#include "postmaster/walwriter.h"
#include "replication/walreceiver.h"
#include "storage/bufmgr.h"
@@ -80,6 +81,9 @@ AuxiliaryProcessMain(AuxProcType auxtype)
case WalReceiverProcess:
MyBackendType = B_WAL_RECEIVER;
break;
+ case WalSummarizerProcess:
+ MyBackendType = B_WAL_SUMMARIZER;
+ break;
default:
elog(PANIC, "unrecognized process type: %d", (int) MyAuxProcType);
MyBackendType = B_INVALID;
@@ -161,6 +165,10 @@ AuxiliaryProcessMain(AuxProcType auxtype)
WalReceiverMain();
proc_exit(1);
+ case WalSummarizerProcess:
+ WalSummarizerMain();
+ proc_exit(1);
+
default:
elog(PANIC, "unrecognized process type: %d", (int) MyAuxProcType);
proc_exit(1);
diff --git a/src/backend/postmaster/meson.build b/src/backend/postmaster/meson.build
index cda921fd10..a30eb6692f 100644
--- a/src/backend/postmaster/meson.build
+++ b/src/backend/postmaster/meson.build
@@ -12,5 +12,6 @@ backend_sources += files(
'postmaster.c',
'startup.c',
'syslogger.c',
+ 'walsummarizer.c',
'walwriter.c',
)
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 7b6b613c4a..7952fd5c4b 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -115,6 +115,7 @@
#include "postmaster/pgarch.h"
#include "postmaster/postmaster.h"
#include "postmaster/syslogger.h"
+#include "postmaster/walsummarizer.h"
#include "replication/logicallauncher.h"
#include "replication/walsender.h"
#include "storage/fd.h"
@@ -252,6 +253,7 @@ static pid_t StartupPID = 0,
CheckpointerPID = 0,
WalWriterPID = 0,
WalReceiverPID = 0,
+ WalSummarizerPID = 0,
AutoVacPID = 0,
PgArchPID = 0,
SysLoggerPID = 0;
@@ -443,6 +445,7 @@ static bool CreateOptsFile(int argc, char *argv[], char *fullprogname);
static pid_t StartChildProcess(AuxProcType type);
static void StartAutovacuumWorker(void);
static void MaybeStartWalReceiver(void);
+static void MaybeStartWalSummarizer(void);
static void InitPostmasterDeathWatchHandle(void);
/*
@@ -562,6 +565,7 @@ static void ShmemBackendArrayRemove(Backend *bn);
#define StartCheckpointer() StartChildProcess(CheckpointerProcess)
#define StartWalWriter() StartChildProcess(WalWriterProcess)
#define StartWalReceiver() StartChildProcess(WalReceiverProcess)
+#define StartWalSummarizer() StartChildProcess(WalSummarizerProcess)
/* Macros to check exit status of a child process */
#define EXIT_STATUS_0(st) ((st) == 0)
@@ -931,6 +935,9 @@ PostmasterMain(int argc, char *argv[])
if (max_wal_senders > 0 && wal_level == WAL_LEVEL_MINIMAL)
ereport(ERROR,
(errmsg("WAL streaming (max_wal_senders > 0) requires wal_level \"replica\" or \"logical\"")));
+ if (summarize_wal && wal_level == WAL_LEVEL_MINIMAL)
+ ereport(ERROR,
+ (errmsg("WAL cannot be summarized when wal_level is \"minimal\"")));
/*
* Other one-time internal sanity checks can go here, if they are fast.
@@ -1833,6 +1840,9 @@ ServerLoop(void)
if (WalReceiverRequested)
MaybeStartWalReceiver();
+ /* If we need to start a WAL summarizer, try to do that now */
+ MaybeStartWalSummarizer();
+
/* Get other worker processes running, if needed */
if (StartWorkerNeeded || HaveCrashedWorker)
maybe_start_bgworkers();
@@ -2657,6 +2667,8 @@ process_pm_reload_request(void)
signal_child(WalWriterPID, SIGHUP);
if (WalReceiverPID != 0)
signal_child(WalReceiverPID, SIGHUP);
+ if (WalSummarizerPID != 0)
+ signal_child(WalSummarizerPID, SIGHUP);
if (AutoVacPID != 0)
signal_child(AutoVacPID, SIGHUP);
if (PgArchPID != 0)
@@ -3010,6 +3022,7 @@ process_pm_child_exit(void)
BgWriterPID = StartBackgroundWriter();
if (WalWriterPID == 0)
WalWriterPID = StartWalWriter();
+ MaybeStartWalSummarizer();
/*
* Likewise, start other special children as needed. In a restart
@@ -3128,6 +3141,20 @@ process_pm_child_exit(void)
continue;
}
+ /*
+ * Was it the wal summarizer? Normal exit can be ignored; we'll start
+ * a new one at the next iteration of the postmaster's main loop, if
+ * necessary. Any other exit condition is treated as a crash.
+ */
+ if (pid == WalSummarizerPID)
+ {
+ WalSummarizerPID = 0;
+ if (!EXIT_STATUS_0(exitstatus))
+ HandleChildCrash(pid, exitstatus,
+ _("WAL summarizer process"));
+ continue;
+ }
+
/*
* Was it the autovacuum launcher? Normal exit can be ignored; we'll
* start a new one at the next iteration of the postmaster's main
@@ -3523,6 +3550,12 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
else if (WalReceiverPID != 0 && take_action)
sigquit_child(WalReceiverPID);
+ /* Take care of the walsummarizer too */
+ if (pid == WalSummarizerPID)
+ WalSummarizerPID = 0;
+ else if (WalSummarizerPID != 0 && take_action)
+ sigquit_child(WalSummarizerPID);
+
/* Take care of the autovacuum launcher too */
if (pid == AutoVacPID)
AutoVacPID = 0;
@@ -3673,6 +3706,8 @@ PostmasterStateMachine(void)
signal_child(StartupPID, SIGTERM);
if (WalReceiverPID != 0)
signal_child(WalReceiverPID, SIGTERM);
+ if (WalSummarizerPID != 0)
+ signal_child(WalSummarizerPID, SIGTERM);
/* checkpointer, archiver, stats, and syslogger may continue for now */
/* Now transition to PM_WAIT_BACKENDS state to wait for them to die */
@@ -3699,6 +3734,7 @@ PostmasterStateMachine(void)
if (CountChildren(BACKEND_TYPE_ALL - BACKEND_TYPE_WALSND) == 0 &&
StartupPID == 0 &&
WalReceiverPID == 0 &&
+ WalSummarizerPID == 0 &&
BgWriterPID == 0 &&
(CheckpointerPID == 0 ||
(!FatalError && Shutdown < ImmediateShutdown)) &&
@@ -3796,6 +3832,7 @@ PostmasterStateMachine(void)
/* These other guys should be dead already */
Assert(StartupPID == 0);
Assert(WalReceiverPID == 0);
+ Assert(WalSummarizerPID == 0);
Assert(BgWriterPID == 0);
Assert(CheckpointerPID == 0);
Assert(WalWriterPID == 0);
@@ -4017,6 +4054,8 @@ TerminateChildren(int signal)
signal_child(WalWriterPID, signal);
if (WalReceiverPID != 0)
signal_child(WalReceiverPID, signal);
+ if (WalSummarizerPID != 0)
+ signal_child(WalSummarizerPID, signal);
if (AutoVacPID != 0)
signal_child(AutoVacPID, signal);
if (PgArchPID != 0)
@@ -5364,6 +5403,10 @@ StartChildProcess(AuxProcType type)
ereport(LOG,
(errmsg("could not fork WAL receiver process: %m")));
break;
+ case WalSummarizerProcess:
+ ereport(LOG,
+ (errmsg("could not fork WAL summarizer process: %m")));
+ break;
default:
ereport(LOG,
(errmsg("could not fork process: %m")));
@@ -5500,6 +5543,19 @@ MaybeStartWalReceiver(void)
}
}
+/*
+ * MaybeStartWalSummarizer
+ * Start the WAL summarizer process, if not running and our state allows.
+ */
+static void
+MaybeStartWalSummarizer(void)
+{
+ if (summarize_wal && WalSummarizerPID == 0 &&
+ (pmState == PM_RUN || pmState == PM_HOT_STANDBY) &&
+ Shutdown <= SmartShutdown)
+ WalSummarizerPID = StartWalSummarizer();
+}
+
/*
* Create the opts file
diff --git a/src/backend/postmaster/walsummarizer.c b/src/backend/postmaster/walsummarizer.c
new file mode 100644
index 0000000000..fe09207ddc
--- /dev/null
+++ b/src/backend/postmaster/walsummarizer.c
@@ -0,0 +1,1361 @@
+/*-------------------------------------------------------------------------
+ *
+ * walsummarizer.c
+ *
+ * Background process to perform WAL summarization, if it is enabled.
+ * It continuously scans the write-ahead log and periodically emits a
+ * summary file which indicates which blocks in which relation forks
+ * were modified by WAL records in the LSN range covered by the summary
+ * file. See walsummary.c and blkreftable.c for more details on the
+ * naming and contents of WAL summary files.
+ *
+ * If configured to do, this background process will also remove WAL
+ * summary files when the file timestamp is older than a configurable
+ * threshold (but only if the WAL has been removed first).
+ *
+ * Portions Copyright (c) 1996-2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/backend/postmaster/walsummarizer.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/timeline.h"
+#include "access/xlog.h"
+#include "access/xlog_internal.h"
+#include "access/xlogrecovery.h"
+#include "access/xlogutils.h"
+#include "backup/walsummary.h"
+#include "catalog/storage_xlog.h"
+#include "common/blkreftable.h"
+#include "libpq/pqsignal.h"
+#include "miscadmin.h"
+#include "postmaster/bgwriter.h"
+#include "postmaster/interrupt.h"
+#include "postmaster/walsummarizer.h"
+#include "replication/walreceiver.h"
+#include "storage/fd.h"
+#include "storage/ipc.h"
+#include "storage/lwlock.h"
+#include "storage/latch.h"
+#include "storage/proc.h"
+#include "storage/procsignal.h"
+#include "storage/shmem.h"
+#include "storage/spin.h"
+#include "utils/guc.h"
+#include "utils/memutils.h"
+#include "utils/wait_event.h"
+
+/*
+ * Data in shared memory related to WAL summarization.
+ */
+typedef struct
+{
+ /*
+ * These fields are protected by WALSummarizerLock.
+ *
+ * Until we've discovered what summary files already exist on disk and
+ * stored that information in shared memory, initialized is false and the
+ * other fields here contain no meaningful information. After that has
+ * been done, initialized is true.
+ *
+ * summarized_tli and summarized_lsn indicate the last LSN and TLI at
+ * which the next summary file will start. Normally, these are the LSN and
+ * TLI at which the last file ended; in such case, lsn_is_exact is true.
+ * If, however, the LSN is just an approximation, then lsn_is_exact is
+ * false. This can happen if, for example, there are no existing WAL
+ * summary files at startup. In that case, we have to derive the position
+ * at which to start summarizing from the WAL files that exist on disk,
+ * and so the LSN might point to the start of the next file even though
+ * that might happen to be in the middle of a WAL record.
+ *
+ * summarizer_pgprocno is the pgprocno value for the summarizer process,
+ * if one is running, or else INVALID_PGPROCNO.
+ *
+ * pending_lsn is used by the summarizer to advertise the ending LSN of a
+ * record it has recently read. It shouldn't ever be less than
+ * summarized_lsn, but might be greater, because the summarizer buffers
+ * data for a range of LSNs in memory before writing out a new file.
+ */
+ bool initialized;
+ TimeLineID summarized_tli;
+ XLogRecPtr summarized_lsn;
+ bool lsn_is_exact;
+ int summarizer_pgprocno;
+ XLogRecPtr pending_lsn;
+
+ /*
+ * This field handles its own synchronizaton.
+ */
+ ConditionVariable summary_file_cv;
+} WalSummarizerData;
+
+/*
+ * Private data for our xlogreader's page read callback.
+ */
+typedef struct
+{
+ TimeLineID tli;
+ bool historic;
+ XLogRecPtr read_upto;
+ bool end_of_wal;
+} SummarizerReadLocalXLogPrivate;
+
+/* Pointer to shared memory state. */
+static WalSummarizerData *WalSummarizerCtl;
+
+/*
+ * When we reach end of WAL and need to read more, we sleep for a number of
+ * milliseconds that is a integer multiple of MS_PER_SLEEP_QUANTUM. This is
+ * the multiplier. It should vary between 1 and MAX_SLEEP_QUANTA, depending
+ * on system activity. See summarizer_wait_for_wal() for how we adjust this.
+ */
+static long sleep_quanta = 1;
+
+/*
+ * The sleep time will always be a multiple of 200ms and will not exceed
+ * thirty seconds (150 * 200 = 30 * 1000). Note that the timeout here needs
+ * to be substntially less than the maximum amount of time for which an
+ * incremental backup will wait for this process to catch up. Otherwise, an
+ * incremental backup might time out on an idle system just because we sleep
+ * for too long.
+ */
+#define MAX_SLEEP_QUANTA 150
+#define MS_PER_SLEEP_QUANTUM 200
+
+/*
+ * This is a count of the number of pages of WAL that we've read since the
+ * last time we waited for more WAL to appear.
+ */
+static long pages_read_since_last_sleep = 0;
+
+/*
+ * Most recent RedoRecPtr value observed by MaybeRemoveOldWalSummaries.
+ */
+static XLogRecPtr redo_pointer_at_last_summary_removal = InvalidXLogRecPtr;
+
+/*
+ * GUC parameters
+ */
+bool summarize_wal = false;
+int wal_summarize_keep_time = 10 * 24 * 60;
+
+static XLogRecPtr GetLatestLSN(TimeLineID *tli);
+static void HandleWalSummarizerInterrupts(void);
+static XLogRecPtr SummarizeWAL(TimeLineID tli, XLogRecPtr start_lsn,
+ bool exact, XLogRecPtr switch_lsn,
+ XLogRecPtr maximum_lsn);
+static void SummarizeSmgrRecord(XLogReaderState *xlogreader,
+ BlockRefTable *brtab);
+static void SummarizeXactRecord(XLogReaderState *xlogreader,
+ BlockRefTable *brtab);
+static bool SummarizeXlogRecord(XLogReaderState *xlogreader);
+static int summarizer_read_local_xlog_page(XLogReaderState *state,
+ XLogRecPtr targetPagePtr,
+ int reqLen,
+ XLogRecPtr targetRecPtr,
+ char *cur_page);
+static void summarizer_wait_for_wal(void);
+static void MaybeRemoveOldWalSummaries(void);
+
+/*
+ * Amount of shared memory required for this module.
+ */
+Size
+WalSummarizerShmemSize(void)
+{
+ return sizeof(WalSummarizerData);
+}
+
+/*
+ * Create or attach to shared memory segment for this module.
+ */
+void
+WalSummarizerShmemInit(void)
+{
+ bool found;
+
+ WalSummarizerCtl = (WalSummarizerData *)
+ ShmemInitStruct("Wal Summarizer Ctl", WalSummarizerShmemSize(),
+ &found);
+
+ if (!found)
+ {
+ /*
+ * First time through, so initialize.
+ *
+ * We're just filling in dummy values here -- the real initialization
+ * will happen when GetOldestUnsummarizedLSN() is called for the first
+ * time.
+ */
+ WalSummarizerCtl->initialized = false;
+ WalSummarizerCtl->summarized_tli = 0;
+ WalSummarizerCtl->summarized_lsn = InvalidXLogRecPtr;
+ WalSummarizerCtl->lsn_is_exact = false;
+ WalSummarizerCtl->summarizer_pgprocno = INVALID_PGPROCNO;
+ WalSummarizerCtl->pending_lsn = InvalidXLogRecPtr;
+ ConditionVariableInit(&WalSummarizerCtl->summary_file_cv);
+ }
+}
+
+/*
+ * Entry point for walsummarizer process.
+ */
+void
+WalSummarizerMain(void)
+{
+ sigjmp_buf local_sigjmp_buf;
+ MemoryContext context;
+
+ /*
+ * Within this function, 'current_lsn' and 'current_tli' refer to the
+ * point from which the next WAL summary file should start. 'exact' is
+ * true if 'current_lsn' is known to be the start of a WAL recod or WAL
+ * segment, and false if it might be in the middle of a record someplace.
+ *
+ * 'switch_lsn' and 'switch_tli', if set, are the LSN at which we need to
+ * switch to a new timeline and the timeline to which we need to switch.
+ * If not set, we either haven't figured out the answers yet or we're
+ * already on the latest timeline.
+ */
+ XLogRecPtr current_lsn;
+ TimeLineID current_tli;
+ bool exact;
+ XLogRecPtr switch_lsn = InvalidXLogRecPtr;
+ TimeLineID switch_tli = 0;
+
+ ereport(DEBUG1,
+ (errmsg_internal("WAL summarizer started")));
+
+ /*
+ * Properly accept or ignore signals the postmaster might send us
+ *
+ * We have no particular use for SIGINT at the moment, but seems
+ * reasonable to treat like SIGTERM.
+ */
+ pqsignal(SIGHUP, SignalHandlerForConfigReload);
+ pqsignal(SIGINT, SignalHandlerForShutdownRequest);
+ pqsignal(SIGTERM, SignalHandlerForShutdownRequest);
+ /* SIGQUIT handler was already set up by InitPostmasterChild */
+ pqsignal(SIGALRM, SIG_IGN);
+ pqsignal(SIGPIPE, SIG_IGN);
+ pqsignal(SIGUSR1, procsignal_sigusr1_handler);
+ pqsignal(SIGUSR2, SIG_IGN); /* not used */
+
+ /* Advertise ourselves. */
+ LWLockAcquire(WALSummarizerLock, LW_EXCLUSIVE);
+ WalSummarizerCtl->summarizer_pgprocno = MyProc->pgprocno;
+ LWLockRelease(WALSummarizerLock);
+
+ /* Create and switch to a memory context that we can reset on error. */
+ context = AllocSetContextCreate(TopMemoryContext,
+ "Wal Summarizer",
+ ALLOCSET_DEFAULT_SIZES);
+ MemoryContextSwitchTo(context);
+
+ /*
+ * Reset some signals that are accepted by postmaster but not here
+ */
+ pqsignal(SIGCHLD, SIG_DFL);
+
+ /*
+ * If an exception is encountered, processing resumes here.
+ */
+ if (sigsetjmp(local_sigjmp_buf, 1) != 0)
+ {
+ /* Since not using PG_TRY, must reset error stack by hand */
+ error_context_stack = NULL;
+
+ /* Prevent interrupts while cleaning up */
+ HOLD_INTERRUPTS();
+
+ /* Report the error to the server log */
+ EmitErrorReport();
+
+ /* Release resources we might have acquired. */
+ LWLockReleaseAll();
+ ConditionVariableCancelSleep();
+ pgstat_report_wait_end();
+ ReleaseAuxProcessResources(false);
+ AtEOXact_Files(false);
+ AtEOXact_HashTables(false);
+
+ /*
+ * Now return to normal top-level context and clear ErrorContext for
+ * next time.
+ */
+ MemoryContextSwitchTo(context);
+ FlushErrorState();
+
+ /* Flush any leaked data in the top-level context */
+ MemoryContextResetAndDeleteChildren(context);
+
+ /* Now we can allow interrupts again */
+ RESUME_INTERRUPTS();
+
+ /*
+ * Sleep for 10 seconds before attempting to resume operations in
+ * order to avoid excessing logging.
+ *
+ * Many of the likely error conditions are things that will repeat
+ * every time. For example, if the WAL can't be read or the summary
+ * can't be written, only administrator action will cure the problem.
+ * So a really fast retry time doesn't seem to be especially
+ * beneficial, and it will clutter the logs.
+ */
+ (void) WaitLatch(MyLatch,
+ WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
+ 10000,
+ WAIT_EVENT_WAL_SUMMARIZER_ERROR);
+ }
+
+ /* We can now handle ereport(ERROR) */
+ PG_exception_stack = &local_sigjmp_buf;
+
+ /*
+ * Unblock signals (they were blocked when the postmaster forked us)
+ */
+ sigprocmask(SIG_SETMASK, &UnBlockSig, NULL);
+
+ /*
+ * Fetch information about previous progress from shared memory.
+ *
+ * If we discover that WAL summarization is not enabled, just exit.
+ */
+ current_lsn = GetOldestUnsummarizedLSN(¤t_tli, &exact);
+ if (XLogRecPtrIsInvalid(current_lsn))
+ proc_exit(0);
+
+ /*
+ * Loop forever
+ */
+ for (;;)
+ {
+ XLogRecPtr latest_lsn;
+ TimeLineID latest_tli;
+ XLogRecPtr end_of_summary_lsn;
+
+ /* Flush any leaked data in the top-level context */
+ MemoryContextResetAndDeleteChildren(context);
+
+ /* Process any signals received recently. */
+ HandleWalSummarizerInterrupts();
+
+ /* If it's time to remove any old WAL summaries, do that now. */
+ MaybeRemoveOldWalSummaries();
+
+ /* Find the LSN and TLI up to which we can safely summarize. */
+ latest_lsn = GetLatestLSN(&latest_tli);
+
+ /*
+ * If we're summarizing a historic timeline and we haven't yet
+ * computed the point at which to switch to the next timeline, do that
+ * now.
+ *
+ * Note that if this is a standby, what was previously the current
+ * timeline could become historic at any time.
+ *
+ * We could try to make this more efficient by caching the results of
+ * readTimeLineHistory when latest_tli has not changed, but since we
+ * only have to do this once per timeline switch, we probably wouldn't
+ * save any significant amount of work in practice.
+ */
+ if (current_tli != latest_tli && XLogRecPtrIsInvalid(switch_lsn))
+ {
+ List *tles = readTimeLineHistory(latest_tli);
+
+ switch_lsn = tliSwitchPoint(current_tli, tles, &switch_tli);
+ ereport(DEBUG1,
+ errmsg("switch point from TLI %u to TLI %u is at %X/%X",
+ current_tli, switch_tli, LSN_FORMAT_ARGS(switch_lsn)));
+ }
+
+ /*
+ * If we've reached the switch LSN, we can't summarize anything else
+ * on this timeline. Switch to the next timeline and go around again.
+ */
+ if (!XLogRecPtrIsInvalid(switch_lsn) && current_lsn >= switch_lsn)
+ {
+ current_tli = switch_tli;
+ switch_lsn = InvalidXLogRecPtr;
+ switch_tli = 0;
+ continue;
+ }
+
+ /* Summarize WAL. */
+ end_of_summary_lsn = SummarizeWAL(current_tli,
+ current_lsn, exact,
+ switch_lsn, latest_lsn);
+ Assert(!XLogRecPtrIsInvalid(end_of_summary_lsn));
+ Assert(end_of_summary_lsn >= current_lsn);
+
+ /*
+ * Update state for next loop iteration.
+ *
+ * Next summary file should start from exactly where this one ended.
+ */
+ current_lsn = end_of_summary_lsn;
+ exact = true;
+
+ /* Update state in shared memory. */
+ LWLockAcquire(WALSummarizerLock, LW_EXCLUSIVE);
+ Assert(WalSummarizerCtl->pending_lsn <= end_of_summary_lsn);
+ WalSummarizerCtl->summarized_lsn = end_of_summary_lsn;
+ WalSummarizerCtl->summarized_tli = current_tli;
+ WalSummarizerCtl->lsn_is_exact = true;
+ WalSummarizerCtl->pending_lsn = end_of_summary_lsn;
+ LWLockRelease(WALSummarizerLock);
+
+ /* Wake up anyone waiting for more summary files to be written. */
+ ConditionVariableBroadcast(&WalSummarizerCtl->summary_file_cv);
+ }
+}
+
+/*
+ * Get the oldest LSN in this server's timeline history that has not yet been
+ * summarized.
+ *
+ * If *tli != NULL, it will be set to the TLI for the LSN that is returned.
+ *
+ * If *lsn_is_exact != NULL, it will be set to true if the returned LSN is
+ * necessarily the start of a WAL record and false if it's just the beginning
+ * of a WAL segment.
+ */
+XLogRecPtr
+GetOldestUnsummarizedLSN(TimeLineID *tli, bool *lsn_is_exact)
+{
+ TimeLineID latest_tli;
+ LWLockMode mode = LW_SHARED;
+ int n;
+ List *tles;
+ XLogRecPtr unsummarized_lsn;
+ TimeLineID unsummarized_tli = 0;
+ bool should_make_exact = false;
+ List *existing_summaries;
+ ListCell *lc;
+
+ /* If not summarizing WAL, do nothing. */
+ if (!summarize_wal)
+ return InvalidXLogRecPtr;
+
+ /*
+ * Initially, we acquire the lock in shared mode and try to fetch the
+ * required information. If the data structure hasn't been initialized, we
+ * reacquire the lock in shared mode so that we can initialize it.
+ * However, if someone else does that first before we get the lock, then
+ * we can just return the requested information after all.
+ */
+ while (true)
+ {
+ LWLockAcquire(WALSummarizerLock, mode);
+
+ if (WalSummarizerCtl->initialized)
+ {
+ unsummarized_lsn = WalSummarizerCtl->summarized_lsn;
+ if (tli != NULL)
+ *tli = WalSummarizerCtl->summarized_tli;
+ if (lsn_is_exact != NULL)
+ *lsn_is_exact = WalSummarizerCtl->lsn_is_exact;
+ LWLockRelease(WALSummarizerLock);
+ return unsummarized_lsn;
+ }
+
+ if (mode == LW_EXCLUSIVE)
+ break;
+
+ LWLockRelease(WALSummarizerLock);
+ mode = LW_EXCLUSIVE;
+ }
+
+ /*
+ * The data structure needs to be initialized, and we are the first to
+ * obtain the lock in exclusive mode, so it's our job to do that
+ * initialization.
+ *
+ * So, find the oldest timeline on which WAL still exists, and the
+ * earliest segment for which it exists.
+ */
+ (void) GetLatestLSN(&latest_tli);
+ tles = readTimeLineHistory(latest_tli);
+ for (n = list_length(tles) - 1; n >= 0; --n)
+ {
+ TimeLineHistoryEntry *tle = list_nth(tles, n);
+ XLogSegNo oldest_segno;
+
+ oldest_segno = XLogGetOldestSegno(tle->tli);
+ if (oldest_segno != 0)
+ {
+ /* Compute oldest LSN that still exists on disk. */
+ XLogSegNoOffsetToRecPtr(oldest_segno, 0, wal_segment_size,
+ unsummarized_lsn);
+
+ unsummarized_tli = tle->tli;
+ break;
+ }
+ }
+
+ /* It really should not be possible for us to find no WAL. */
+ if (unsummarized_tli == 0)
+ ereport(ERROR,
+ errcode(ERRCODE_INTERNAL_ERROR),
+ errmsg_internal("no WAL found on timeline %d", latest_tli));
+
+ /*
+ * Don't try to summarize anything older than the end LSN of the newest
+ * summary file that exists for this timeline.
+ */
+ existing_summaries =
+ GetWalSummaries(unsummarized_tli,
+ InvalidXLogRecPtr, InvalidXLogRecPtr);
+ foreach(lc, existing_summaries)
+ {
+ WalSummaryFile *ws = lfirst(lc);
+
+ if (ws->end_lsn > unsummarized_lsn)
+ {
+ unsummarized_lsn = ws->end_lsn;
+ should_make_exact = true;
+ }
+ }
+
+ /* Update shared memory with the discovered values. */
+ WalSummarizerCtl->initialized = true;
+ WalSummarizerCtl->summarized_lsn = unsummarized_lsn;
+ WalSummarizerCtl->summarized_tli = unsummarized_tli;
+ WalSummarizerCtl->lsn_is_exact = should_make_exact;
+ WalSummarizerCtl->pending_lsn = unsummarized_lsn;
+
+ /* Also return the to the caller as required. */
+ if (tli != NULL)
+ *tli = WalSummarizerCtl->summarized_tli;
+ if (lsn_is_exact != NULL)
+ *lsn_is_exact = WalSummarizerCtl->lsn_is_exact;
+ LWLockRelease(WALSummarizerLock);
+
+ return unsummarized_lsn;
+}
+
+/*
+ * Attempt to set the WAL summarizer's latch.
+ *
+ * This might not work, because there's no guarantee that the WAL summarizer
+ * process was successfully started, and it also might have started but
+ * subsequently terminated. So, under normal circumstances, this will get the
+ * latch set, but there's no guarantee.
+ */
+void
+SetWalSummarizerLatch(void)
+{
+ int pgprocno;
+
+ if (WalSummarizerCtl == NULL)
+ return;
+
+ LWLockAcquire(WALSummarizerLock, LW_EXCLUSIVE);
+ pgprocno = WalSummarizerCtl->summarizer_pgprocno;
+ LWLockRelease(WALSummarizerLock);
+
+ if (pgprocno != INVALID_PGPROCNO)
+ SetLatch(&ProcGlobal->allProcs[pgprocno].procLatch);
+}
+
+/*
+ * Wait until WAL summarization reaches the given LSN, but not longer than
+ * the given timeout.
+ *
+ * The return value is the first still-unsummarized LSN. If it's greater than
+ * or equal to the passed LSN, then that LSN was reached. If not, we timed out.
+ */
+XLogRecPtr
+WaitForWalSummarization(XLogRecPtr lsn, long timeout)
+{
+ TimestampTz start_time = GetCurrentTimestamp();
+ TimestampTz deadline = TimestampTzPlusMilliseconds(start_time, timeout);
+ XLogRecPtr summarized_lsn;
+
+ Assert(!XLogRecPtrIsInvalid(lsn));
+ Assert(timeout > 0);
+
+ while (1)
+ {
+ TimestampTz now;
+ long remaining_timeout;
+
+ /*
+ * If the LSN summarized on disk has reached the target value, stop.
+ */
+ LWLockAcquire(WALSummarizerLock, LW_EXCLUSIVE);
+ summarized_lsn = WalSummarizerCtl->summarized_lsn;
+ LWLockRelease(WALSummarizerLock);
+ if (summarized_lsn >= lsn)
+ break;
+
+ /* Timeout reached? If yes, stop. */
+ now = GetCurrentTimestamp();
+ remaining_timeout = TimestampDifferenceMilliseconds(now, deadline);
+ if (remaining_timeout <= 0)
+ break;
+
+ /* Wait and see. */
+ ConditionVariableTimedSleep(&WalSummarizerCtl->summary_file_cv,
+ remaining_timeout,
+ WAIT_EVENT_WAL_SUMMARY_READY);
+ }
+
+ return summarized_lsn;
+}
+
+/*
+ * Get the latest LSN that is eligible to be summarized, and set *tli to the
+ * corresponding timeline.
+ */
+static XLogRecPtr
+GetLatestLSN(TimeLineID *tli)
+{
+ if (!RecoveryInProgress())
+ {
+ /* Don't summarize WAL before it's flushed. */
+ return GetFlushRecPtr(tli);
+ }
+ else
+ {
+ XLogRecPtr flush_lsn;
+ TimeLineID flush_tli;
+ XLogRecPtr replay_lsn;
+ TimeLineID replay_tli;
+
+ /*
+ * What we really want to know is how much WAL has been flushed to
+ * disk, but the only flush position available is the one provided by
+ * the walreceiver, which may not be running, because this could be
+ * crash recovery or recovery via restore_command. So use either the
+ * WAL receiver's flush position or the replay position, whichever is
+ * further ahead, on the theory that if the WAL has been replayed then
+ * it must also have been flushed to disk.
+ */
+ flush_lsn = GetWalRcvFlushRecPtr(NULL, &flush_tli);
+ replay_lsn = GetXLogReplayRecPtr(&replay_tli);
+ if (flush_lsn > replay_lsn)
+ {
+ *tli = flush_tli;
+ return flush_lsn;
+ }
+ else
+ {
+ *tli = replay_tli;
+ return replay_lsn;
+ }
+ }
+}
+
+/*
+ * Interrupt handler for main loop of WAL summarizer process.
+ */
+static void
+HandleWalSummarizerInterrupts(void)
+{
+ if (ProcSignalBarrierPending)
+ ProcessProcSignalBarrier();
+
+ if (ConfigReloadPending)
+ {
+ ConfigReloadPending = false;
+ ProcessConfigFile(PGC_SIGHUP);
+ }
+
+ if (ShutdownRequestPending || !summarize_wal)
+ {
+ ereport(DEBUG1,
+ errmsg_internal("WAL summarizer shutting down"));
+ proc_exit(0);
+ }
+
+ /* Perform logging of memory contexts of this process */
+ if (LogMemoryContextPending)
+ ProcessLogMemoryContextInterrupt();
+}
+
+/*
+ * Summarize a range of WAL records on a single timeline.
+ *
+ * 'tli' is the timeline to be summarized.
+ *
+ * 'start_lsn' is the point at which we should start summarizing. If this
+ * value comes from the end LSN of the previous record as returned by the
+ * xlograder machinery, 'exact' should be true; otherwise, 'exact' should
+ * be false, and this function will search forward for the start of a valid
+ * WAL record.
+ *
+ * 'switch_lsn' is the point at which we should switch to a later timeline,
+ * if we're summarizing a historic timeline.
+ *
+ * 'maximum_lsn' identifies the point beyond which we can't count on being
+ * able to read any more WAL. It should be the switch point when reading a
+ * historic timeline, or the most-recently-measured end of WAL when reading
+ * the current timeline.
+ *
+ * The return value is the LSN at which the WAL summary actually ends. Most
+ * often, a summary file ends because we notice that a checkpoint has
+ * occurred and reach the redo pointer of that checkpoint, but sometimes
+ * we stop for other reasons, such as a timeline switch.
+ */
+static XLogRecPtr
+SummarizeWAL(TimeLineID tli, XLogRecPtr start_lsn, bool exact,
+ XLogRecPtr switch_lsn, XLogRecPtr maximum_lsn)
+{
+ SummarizerReadLocalXLogPrivate *private_data;
+ XLogReaderState *xlogreader;
+ XLogRecPtr summary_start_lsn;
+ XLogRecPtr summary_end_lsn = switch_lsn;
+ char temp_path[MAXPGPATH];
+ char final_path[MAXPGPATH];
+ WalSummaryIO io;
+ BlockRefTable *brtab = CreateEmptyBlockRefTable();
+
+ /* Initialize private data for xlogreader. */
+ private_data = (SummarizerReadLocalXLogPrivate *)
+ palloc0(sizeof(SummarizerReadLocalXLogPrivate));
+ private_data->tli = tli;
+ private_data->historic = !XLogRecPtrIsInvalid(switch_lsn);
+ private_data->read_upto = maximum_lsn;
+
+ /* Create xlogreader. */
+ xlogreader = XLogReaderAllocate(wal_segment_size, NULL,
+ XL_ROUTINE(.page_read = &summarizer_read_local_xlog_page,
+ .segment_open = &wal_segment_open,
+ .segment_close = &wal_segment_close),
+ private_data);
+ if (xlogreader == NULL)
+ ereport(ERROR,
+ (errcode(ERRCODE_OUT_OF_MEMORY),
+ errmsg("out of memory"),
+ errdetail("Failed while allocating a WAL reading processor.")));
+
+ /*
+ * When exact = false, we're starting from an arbitrary point in the WAL
+ * and must search forward for the start of the next record.
+ *
+ * When exact = true, start_lsn should be either the LSN where a record
+ * begins, or the LSN of a page where the page header is immediately
+ * followed by the start of a new record. XLogBeginRead should tolerate
+ * either case.
+ *
+ * We need to allow for both cases because the behavior of xlogreader
+ * varies. When a record spans two or more xlog pages, the ending LSN
+ * reported by xlogreader will be the starting LSN of the following
+ * record, but when an xlog page boundary falls between two records, the
+ * end LSN for the first will be reported as the first byte of the
+ * following page. We can't know until we read that page how large the
+ * header will be, but we'll have to skip over it to find the next record.
+ */
+ if (exact)
+ {
+ /*
+ * Even if start_lsn is the beginning of a page rather than the
+ * beginning of the first record on that page, we should still use it
+ * as the start LSN for the summary file. That's because we detect
+ * missing summary files by looking for cases where the end LSN of one
+ * file is less than the start LSN of the next file. When only a page
+ * header is skipped, nothing has been missed.
+ */
+ XLogBeginRead(xlogreader, start_lsn);
+ summary_start_lsn = start_lsn;
+ }
+ else
+ {
+ summary_start_lsn = XLogFindNextRecord(xlogreader, start_lsn);
+ if (XLogRecPtrIsInvalid(summary_start_lsn))
+ {
+ /*
+ * If we hit end-of-WAL while trying to find the next valid
+ * record, we must be on a historic timeline that has no valid
+ * records that begin after start_lsn and before end of WAL.
+ */
+ if (private_data->end_of_wal)
+ {
+ ereport(DEBUG1,
+ errmsg_internal("could not read WAL from timeline %u at %X/%X: end of WAL at %X/%X",
+ tli,
+ LSN_FORMAT_ARGS(start_lsn),
+ LSN_FORMAT_ARGS(private_data->read_upto)));
+
+ /*
+ * The timeline ends at or after start_lsn, without containing
+ * any records. Thus, we must make sure the main loop does not
+ * iterate. If start_lsn is the end of the timeline, then we
+ * won't actually emit an empty summary file, but otherwise,
+ * we must, to capture the fact that the LSN range in question
+ * contains no interesting WAL records.
+ */
+ summary_start_lsn = start_lsn;
+ summary_end_lsn = private_data->read_upto;
+ switch_lsn = xlogreader->EndRecPtr;
+ }
+ else
+ ereport(ERROR,
+ (errmsg("could not find a valid record after %X/%X",
+ LSN_FORMAT_ARGS(start_lsn))));
+ }
+
+ /* We shouldn't go backward. */
+ Assert(summary_start_lsn >= start_lsn);
+ }
+
+ /*
+ * Main loop: read xlog records one by one.
+ */
+ while (1)
+ {
+ int block_id;
+ char *errormsg;
+ XLogRecord *record;
+ bool stop_requested = false;
+
+ HandleWalSummarizerInterrupts();
+
+ /* We shouldn't go backward. */
+ Assert(summary_start_lsn <= xlogreader->EndRecPtr);
+
+ /* Now read the next record. */
+ record = XLogReadRecord(xlogreader, &errormsg);
+ if (record == NULL)
+ {
+ if (private_data->end_of_wal)
+ {
+ /*
+ * This timeline must be historic and must end before we were
+ * able to read a complete record.
+ */
+ ereport(DEBUG1,
+ errmsg_internal("could not read WAL from timeline %d at %X/%X: end of WAL at %X/%X",
+ tli,
+ LSN_FORMAT_ARGS(xlogreader->EndRecPtr),
+ LSN_FORMAT_ARGS(private_data->read_upto)));
+ /* Summary ends at end of WAL. */
+ summary_end_lsn = private_data->read_upto;
+ break;
+ }
+ if (errormsg)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not read WAL from timeline %u at %X/%X: %s",
+ tli, LSN_FORMAT_ARGS(xlogreader->EndRecPtr),
+ errormsg)));
+ else
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not read WAL from timeline %u at %X/%X",
+ tli, LSN_FORMAT_ARGS(xlogreader->EndRecPtr))));
+ }
+
+ /* We shouldn't go backward. */
+ Assert(summary_start_lsn <= xlogreader->EndRecPtr);
+
+ if (!XLogRecPtrIsInvalid(switch_lsn) &&
+ xlogreader->ReadRecPtr >= switch_lsn)
+ {
+ /*
+ * Woops! We've read a record that *starts* after the switch LSN,
+ * contrary to our goal of reading only until we hit the first
+ * record that ends at or after the switch LSN. Pretend we didn't
+ * read it after all by bailing out of this loop right here,
+ * before we do anything with this record.
+ *
+ * This can happen because the last record before the switch LSN
+ * might be continued across multiple pages, and then we might
+ * come to a page with XLP_FIRST_IS_OVERWRITE_CONTRECORD set. In
+ * that case, the record that was continued across multiple pages
+ * is incomplete and will be disregarded, and the read will
+ * restart from the beginning of the page that is flagged
+ * XLP_FIRST_IS_OVERWRITE_CONTRECORD.
+ *
+ * If this case occurs, we can fairly say that the current summary
+ * file ends at the switch LSN exactly. The first record on the
+ * page marked XLP_FIRST_IS_OVERWRITE_CONTRECORD will be
+ * discovered when generating the next summary file.
+ */
+ summary_end_lsn = switch_lsn;
+ break;
+ }
+
+ /* Special handling for particular types of WAL records. */
+ switch (XLogRecGetRmid(xlogreader))
+ {
+ case RM_SMGR_ID:
+ SummarizeSmgrRecord(xlogreader, brtab);
+ break;
+ case RM_XACT_ID:
+ SummarizeXactRecord(xlogreader, brtab);
+ break;
+ case RM_XLOG_ID:
+ stop_requested = SummarizeXlogRecord(xlogreader);
+ break;
+ default:
+ break;
+ }
+
+ /*
+ * If we've been told that it's time to end this WAL summary file, do
+ * so. As an exception, if there's nothing included in this WAL
+ * summary file yet, then stoppng doesn't make any sense, and we
+ * should wait until the next stop point instead.
+ */
+ if (stop_requested && xlogreader->ReadRecPtr > summary_start_lsn)
+ {
+ summary_end_lsn = xlogreader->ReadRecPtr;
+ break;
+ }
+
+ /* Feed block references from xlog record to block reference table. */
+ for (block_id = 0; block_id <= XLogRecMaxBlockId(xlogreader);
+ block_id++)
+ {
+ RelFileLocator rlocator;
+ ForkNumber forknum;
+ BlockNumber blocknum;
+
+ if (!XLogRecGetBlockTagExtended(xlogreader, block_id, &rlocator,
+ &forknum, &blocknum, NULL))
+ continue;
+
+ /*
+ * As we do elsewhere, ignore the FSM fork, because it's not fully
+ * WAL-logged.
+ */
+ if (forknum != FSM_FORKNUM)
+ BlockRefTableMarkBlockModified(brtab, &rlocator, forknum,
+ blocknum);
+ }
+
+ /* Update our notion of where this summary file ends. */
+ summary_end_lsn = xlogreader->EndRecPtr;
+
+ /* Also update shared memory. */
+ LWLockAcquire(WALSummarizerLock, LW_EXCLUSIVE);
+ Assert(summary_end_lsn >= WalSummarizerCtl->pending_lsn);
+ Assert(summary_end_lsn >= WalSummarizerCtl->summarized_lsn);
+ WalSummarizerCtl->pending_lsn = summary_end_lsn;
+ LWLockRelease(WALSummarizerLock);
+
+ /*
+ * If we have a switch LSN and have reached it, stop before reading
+ * the next record.
+ */
+ if (!XLogRecPtrIsInvalid(switch_lsn) &&
+ xlogreader->EndRecPtr >= switch_lsn)
+ break;
+ }
+
+ /* Destroy xlogreader. */
+ pfree(xlogreader->private_data);
+ XLogReaderFree(xlogreader);
+
+ /*
+ * If a timeline switch occurs, we may fail to make any progress at all
+ * before exiting the loop above. If that happens, we don't write a WAL
+ * summary file at all.
+ */
+ if (summary_end_lsn > summary_start_lsn)
+ {
+ /* Generate temporary and final path name. */
+ snprintf(temp_path, MAXPGPATH,
+ XLOGDIR "/summaries/temp.summary");
+ snprintf(final_path, MAXPGPATH,
+ XLOGDIR "/summaries/%08X%08X%08X%08X%08X.summary",
+ tli,
+ LSN_FORMAT_ARGS(summary_start_lsn),
+ LSN_FORMAT_ARGS(summary_end_lsn));
+
+ /* Open the temporary file for writing. */
+ io.filepos = 0;
+ io.file = PathNameOpenFile(temp_path, O_WRONLY | O_CREAT | O_TRUNC);
+ if (io.file < 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not create file \"%s\": %m", temp_path)));
+
+ /* Write the data. */
+ WriteBlockRefTable(brtab, WriteWalSummary, &io);
+
+ /* Close temporary file and shut down xlogreader. */
+ FileClose(io.file);
+
+ /* Tell the user what we did. */
+ ereport(DEBUG1,
+ errmsg("summarized WAL on TLI %d from %X/%X to %X/%X",
+ tli,
+ LSN_FORMAT_ARGS(summary_start_lsn),
+ LSN_FORMAT_ARGS(summary_end_lsn)));
+
+ /* Durably rename the new summary into place. */
+ durable_rename(temp_path, final_path, ERROR);
+ }
+
+ return summary_end_lsn;
+}
+
+/*
+ * Special handling for WAL records with RM_SMGR_ID.
+ */
+static void
+SummarizeSmgrRecord(XLogReaderState *xlogreader, BlockRefTable *brtab)
+{
+ uint8 info = XLogRecGetInfo(xlogreader) & ~XLR_INFO_MASK;
+
+ if (info == XLOG_SMGR_CREATE)
+ {
+ xl_smgr_create *xlrec;
+
+ /*
+ * If a new relation fork is created on disk, there is no point
+ * tracking anything about which blocks have been modified, because
+ * the whole thing will be new. Hence, set the limit block for this
+ * fork to 0.
+ *
+ * Ignore the FSM fork, which is not fully WAL-logged.
+ */
+ xlrec = (xl_smgr_create *) XLogRecGetData(xlogreader);
+
+ if (xlrec->forkNum != FSM_FORKNUM)
+ BlockRefTableSetLimitBlock(brtab, &xlrec->rlocator,
+ xlrec->forkNum, 0);
+ }
+ else if (info == XLOG_SMGR_TRUNCATE)
+ {
+ xl_smgr_truncate *xlrec;
+
+ xlrec = (xl_smgr_truncate *) XLogRecGetData(xlogreader);
+
+ /*
+ * If a relation fork is truncated on disk, there is in point in
+ * tracking anything about block modifications beyond the truncation
+ * point.
+ *
+ * We ignore SMGR_TRUNCATE_FSM here because the FSM isn't fully
+ * WAL-logged and thus we can't track modified blocks for it anyway.
+ */
+ if ((xlrec->flags & SMGR_TRUNCATE_HEAP) != 0)
+ BlockRefTableSetLimitBlock(brtab, &xlrec->rlocator,
+ MAIN_FORKNUM, xlrec->blkno);
+ if ((xlrec->flags & SMGR_TRUNCATE_VM) != 0)
+ BlockRefTableSetLimitBlock(brtab, &xlrec->rlocator,
+ VISIBILITYMAP_FORKNUM, xlrec->blkno);
+ }
+}
+
+/*
+ * Special handling for WAL recods with RM_XACT_ID.
+ */
+static void
+SummarizeXactRecord(XLogReaderState *xlogreader, BlockRefTable *brtab)
+{
+ uint8 info = XLogRecGetInfo(xlogreader) & ~XLR_INFO_MASK;
+ uint8 xact_info = info & XLOG_XACT_OPMASK;
+
+ if (xact_info == XLOG_XACT_COMMIT ||
+ xact_info == XLOG_XACT_COMMIT_PREPARED)
+ {
+ xl_xact_commit *xlrec = (xl_xact_commit *) XLogRecGetData(xlogreader);
+ xl_xact_parsed_commit parsed;
+ int i;
+
+ ParseCommitRecord(XLogRecGetInfo(xlogreader), xlrec, &parsed);
+ for (i = 0; i < parsed.nrels; ++i)
+ {
+ ForkNumber forknum;
+
+ for (forknum = 0; forknum <= MAX_FORKNUM; ++forknum)
+ if (forknum != FSM_FORKNUM)
+ BlockRefTableSetLimitBlock(brtab, &parsed.xlocators[i],
+ forknum, 0);
+ }
+ }
+ else if (xact_info == XLOG_XACT_ABORT ||
+ xact_info == XLOG_XACT_ABORT_PREPARED)
+ {
+ xl_xact_abort *xlrec = (xl_xact_abort *) XLogRecGetData(xlogreader);
+ xl_xact_parsed_abort parsed;
+ int i;
+
+ ParseAbortRecord(XLogRecGetInfo(xlogreader), xlrec, &parsed);
+ for (i = 0; i < parsed.nrels; ++i)
+ {
+ ForkNumber forknum;
+
+ for (forknum = 0; forknum <= MAX_FORKNUM; ++forknum)
+ if (forknum != FSM_FORKNUM)
+ BlockRefTableSetLimitBlock(brtab, &parsed.xlocators[i],
+ forknum, 0);
+ }
+ }
+}
+
+/*
+ * Special handling for WAL recods with RM_XLOG_ID.
+ */
+static bool
+SummarizeXlogRecord(XLogReaderState *xlogreader)
+{
+ uint8 info = XLogRecGetInfo(xlogreader) & ~XLR_INFO_MASK;
+
+ if (info == XLOG_CHECKPOINT_REDO || info == XLOG_CHECKPOINT_SHUTDOWN)
+ {
+ /*
+ * This is an LSN at which redo might begin, so we'd like
+ * summarization to stop just before this WAL record.
+ */
+ return true;
+ }
+
+ return false;
+}
+
+/*
+ * Similar to read_local_xlog_page, but limited to read from one particular
+ * timeline. If the end of WAL is reached, it will wait for more if reading
+ * from the current timeline, or give up if reading from a historic timeline.
+ * In the latter case, it will also set private_data->end_of_wal = true.
+ *
+ * Caller must set private_data->tli to the TLI of interest,
+ * private_data->read_upto to the lowest LSN that is not known to be safe
+ * to read on that timeline, and private_data->historic to true if and only
+ * if the timeline is not the current timeline. This function will update
+ * private_data->read_upto and private_data->historic if more WAL appears
+ * on the current timeline or if the current timeline becomes historic.
+ */
+static int
+summarizer_read_local_xlog_page(XLogReaderState *state,
+ XLogRecPtr targetPagePtr, int reqLen,
+ XLogRecPtr targetRecPtr, char *cur_page)
+{
+ int count;
+ WALReadError errinfo;
+ SummarizerReadLocalXLogPrivate *private_data;
+
+ HandleWalSummarizerInterrupts();
+
+ private_data = (SummarizerReadLocalXLogPrivate *)
+ state->private_data;
+
+ while (true)
+ {
+ if (targetPagePtr + XLOG_BLCKSZ <= private_data->read_upto)
+ {
+ /*
+ * more than one block available; read only that block, have
+ * caller come back if they need more.
+ */
+ count = XLOG_BLCKSZ;
+ break;
+ }
+ else if (targetPagePtr + reqLen > private_data->read_upto)
+ {
+ /* We don't seem to have enough data. */
+ if (private_data->historic)
+ {
+ /*
+ * This is a historic timeline, so there will never be any
+ * more data than we have currently.
+ */
+ private_data->end_of_wal = true;
+ return -1;
+ }
+ else
+ {
+ XLogRecPtr latest_lsn;
+ TimeLineID latest_tli;
+
+ /*
+ * This is - or at least was up until very recently - the
+ * current timeline, so more data might show up. Delay here
+ * so we don't tight-loop.
+ */
+ HandleWalSummarizerInterrupts();
+ summarizer_wait_for_wal();
+
+ /* Recheck end-of-WAL. */
+ latest_lsn = GetLatestLSN(&latest_tli);
+ if (private_data->tli == latest_tli)
+ {
+ /* Still the current timeline, update max LSN. */
+ Assert(latest_lsn >= private_data->read_upto);
+ private_data->read_upto = latest_lsn;
+ }
+ else
+ {
+ List *tles = readTimeLineHistory(latest_tli);
+ XLogRecPtr switchpoint;
+
+ /*
+ * The timeline we're scanning is no longer the latest
+ * one. Figure out when it ended and allow reads up to
+ * exactly that point.
+ */
+ private_data->historic = true;
+ switchpoint = tliSwitchPoint(private_data->tli, tles,
+ NULL);
+ Assert(switchpoint >= private_data->read_upto);
+ private_data->read_upto = switchpoint;
+
+ /* Debugging output. */
+ ereport(DEBUG1,
+ errmsg("timeline %u became historic, can read up to %X/%X",
+ private_data->tli, LSN_FORMAT_ARGS(private_data->read_upto)));
+ }
+
+ /* Go around and try again. */
+ }
+ }
+ else
+ {
+ /* enough bytes available to satisfy the request */
+ count = private_data->read_upto - targetPagePtr;
+ break;
+ }
+ }
+
+ /*
+ * Even though we just determined how much of the page can be validly read
+ * as 'count', read the whole page anyway. It's guaranteed to be
+ * zero-padded up to the page boundary if it's incomplete.
+ */
+ if (!WALRead(state, cur_page, targetPagePtr, XLOG_BLCKSZ,
+ private_data->tli, &errinfo))
+ WALReadRaiseError(&errinfo);
+
+ /* Track that we read a page, for sleep time calculation. */
+ ++pages_read_since_last_sleep;
+
+ /* number of valid bytes in the buffer */
+ return count;
+}
+
+/*
+ * Sleep for long enough that we believe it's likely that more WAL will
+ * be available afterwards.
+ */
+static void
+summarizer_wait_for_wal(void)
+{
+ if (pages_read_since_last_sleep == 0)
+ {
+ /*
+ * No pages were read since the last sleep, so double the sleep time,
+ * but not beyond the maximum allowable value.
+ */
+ sleep_quanta = Min(sleep_quanta * 2, MAX_SLEEP_QUANTA);
+ }
+ else if (pages_read_since_last_sleep > 1)
+ {
+ /*
+ * Multiple pages were read since the last sleep, so reduce the sleep
+ * time.
+ *
+ * A large burst of activity should be able to quickly reduce the
+ * sleep time to the minimum, but we don't want a handful of extra WAL
+ * records to provoke a strong reaction. We choose to reduce the sleep
+ * time by 1 quantum for each page read beyond the first, which is a
+ * fairly arbitrary way of trying to be reactive without
+ * overrreacting.
+ */
+ if (pages_read_since_last_sleep > sleep_quanta - 1)
+ sleep_quanta = 1;
+ else
+ sleep_quanta -= pages_read_since_last_sleep;
+ }
+
+ /* OK, now sleep. */
+ (void) WaitLatch(MyLatch,
+ WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
+ sleep_quanta * MS_PER_SLEEP_QUANTUM,
+ WAIT_EVENT_WAL_SUMMARIZER_WAL);
+ ResetLatch(MyLatch);
+
+ /* Reset count of pages read. */
+ pages_read_since_last_sleep = 0;
+}
+
+/*
+ * Most recent RedoRecPtr value observed by RemoveOldWalSummaries.
+ */
+static void
+MaybeRemoveOldWalSummaries(void)
+{
+ XLogRecPtr redo_pointer = GetRedoRecPtr();
+ List *wslist;
+ time_t cutoff_time;
+
+ /* If WAL summary removal is disabled, don't do anything. */
+ if (wal_summarize_keep_time == 0)
+ return;
+
+ /*
+ * If the redo pointer has not advanced, don't do anything.
+ *
+ * This has the effect that we only try to remove old WAL summary files
+ * once per checkpoint cycle.
+ */
+ if (redo_pointer == redo_pointer_at_last_summary_removal)
+ return;
+ redo_pointer_at_last_summary_removal = redo_pointer;
+
+ /*
+ * Files should only be removed if the last modification time precedes the
+ * cutoff time we compute here.
+ */
+ cutoff_time = time(NULL) - 60 * wal_summarize_keep_time;
+
+ /* Get all the summaries that currently exist. */
+ wslist = GetWalSummaries(0, InvalidXLogRecPtr, InvalidXLogRecPtr);
+
+ /* Loop until all summaries have been considered for removal. */
+ while (wslist != NIL)
+ {
+ ListCell *lc;
+ XLogSegNo oldest_segno;
+ XLogRecPtr oldest_lsn = InvalidXLogRecPtr;
+ TimeLineID selected_tli;
+
+ HandleWalSummarizerInterrupts();
+
+ /*
+ * Pick a timeline for which some summary files still exist on disk,
+ * and find the oldest LSN that still exists on disk for that
+ * timeline.
+ */
+ selected_tli = ((WalSummaryFile *) linitial(wslist))->tli;
+ oldest_segno = XLogGetOldestSegno(selected_tli);
+ if (oldest_segno != 0)
+ XLogSegNoOffsetToRecPtr(oldest_segno, 0, wal_segment_size,
+ oldest_lsn);
+
+
+ /* Consider each WAL file on the selected timeline in turn. */
+ foreach(lc, wslist)
+ {
+ WalSummaryFile *ws = lfirst(lc);
+
+ HandleWalSummarizerInterrupts();
+
+ /* If it's not on this timeline, it's not time to consider it. */
+ if (selected_tli != ws->tli)
+ continue;
+
+ /*
+ * If the WAL doesn't exist any more, we can remove it if the file
+ * modification time is old enough.
+ */
+ if (XLogRecPtrIsInvalid(oldest_lsn) || ws->end_lsn <= oldest_lsn)
+ RemoveWalSummaryIfOlderThan(ws, cutoff_time);
+
+ /*
+ * Whether we we removed the file or not, we need not consider it
+ * again.
+ */
+ wslist = foreach_delete_current(wslist, lc);
+ pfree(ws);
+ }
+ }
+}
diff --git a/src/backend/storage/lmgr/lwlocknames.txt b/src/backend/storage/lmgr/lwlocknames.txt
index f72f2906ce..d621f5507f 100644
--- a/src/backend/storage/lmgr/lwlocknames.txt
+++ b/src/backend/storage/lmgr/lwlocknames.txt
@@ -54,3 +54,4 @@ XactTruncationLock 44
WrapLimitsVacuumLock 46
NotifyQueueTailLock 47
WaitEventExtensionLock 48
+WALSummarizerLock 49
diff --git a/src/backend/utils/activity/pgstat_io.c b/src/backend/utils/activity/pgstat_io.c
index 490d5a9ab7..8109aee6f0 100644
--- a/src/backend/utils/activity/pgstat_io.c
+++ b/src/backend/utils/activity/pgstat_io.c
@@ -296,7 +296,8 @@ pgstat_io_snapshot_cb(void)
* - Syslogger because it is not connected to shared memory
* - Archiver because most relevant archiving IO is delegated to a
* specialized command or module
-* - WAL Receiver and WAL Writer IO is not tracked in pg_stat_io for now
+* - WAL Receiver, WAL Writer, and WAL Summarizer IO are not tracked in
+* pg_stat_io for now
*
* Function returns true if BackendType participates in the cumulative stats
* subsystem for IO and false if it does not.
@@ -318,6 +319,7 @@ pgstat_tracks_io_bktype(BackendType bktype)
case B_LOGGER:
case B_WAL_RECEIVER:
case B_WAL_WRITER:
+ case B_WAL_SUMMARIZER:
return false;
case B_AUTOVAC_LAUNCHER:
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index d7995931bd..7e79163466 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -56,6 +56,7 @@ RECOVERY_WAL_STREAM "Waiting in main loop of startup process for WAL to arrive,
SYSLOGGER_MAIN "Waiting in main loop of syslogger process."
WAL_RECEIVER_MAIN "Waiting in main loop of WAL receiver process."
WAL_SENDER_MAIN "Waiting in main loop of WAL sender process."
+WAL_SUMMARIZER_WAL "Waiting in WAL summarizer for more WAL to be generated."
WAL_WRITER_MAIN "Waiting in main loop of WAL writer process."
@@ -142,6 +143,7 @@ SAFE_SNAPSHOT "Waiting to obtain a valid snapshot for a <literal>READ ONLY DEFER
SYNC_REP "Waiting for confirmation from a remote server during synchronous replication."
WAL_RECEIVER_EXIT "Waiting for the WAL receiver to exit."
WAL_RECEIVER_WAIT_START "Waiting for startup process to send initial data for streaming replication."
+WAL_SUMMARY_READY "Waiting for a new WAL summary to be generated."
XACT_GROUP_UPDATE "Waiting for the group leader to update transaction status at end of a parallel operation."
@@ -162,6 +164,7 @@ REGISTER_SYNC_REQUEST "Waiting while sending synchronization requests to the che
SPIN_DELAY "Waiting while acquiring a contended spinlock."
VACUUM_DELAY "Waiting in a cost-based vacuum delay point."
VACUUM_TRUNCATE "Waiting to acquire an exclusive lock to truncate off any empty pages at the end of a table vacuumed."
+WAL_SUMMARIZER_ERROR "Waiting after a WAL summarizer error."
#
@@ -243,6 +246,8 @@ WAL_COPY_WRITE "Waiting for a write when creating a new WAL segment by copying a
WAL_INIT_SYNC "Waiting for a newly initialized WAL file to reach durable storage."
WAL_INIT_WRITE "Waiting for a write while initializing a new WAL file."
WAL_READ "Waiting for a read from a WAL file."
+WAL_SUMMARY_READ "Waiting for a read from a WAL summary file."
+WAL_SUMMARY_WRITE "Waiting for a write to a WAL summary file."
WAL_SYNC "Waiting for a WAL file to reach durable storage."
WAL_SYNC_METHOD_ASSIGN "Waiting for data to reach durable storage while assigning a new WAL sync method."
WAL_WRITE "Waiting for a write to a WAL file."
diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index cfc5afaa6f..ef2a3a2bfd 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -306,6 +306,9 @@ GetBackendTypeDesc(BackendType backendType)
case B_WAL_SENDER:
backendDesc = "walsender";
break;
+ case B_WAL_SUMMARIZER:
+ backendDesc = "walsummarizer";
+ break;
case B_WAL_WRITER:
backendDesc = "walwriter";
break;
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index b764ef6998..c9dde2d5e5 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -63,6 +63,7 @@
#include "postmaster/postmaster.h"
#include "postmaster/startup.h"
#include "postmaster/syslogger.h"
+#include "postmaster/walsummarizer.h"
#include "postmaster/walwriter.h"
#include "replication/logicallauncher.h"
#include "replication/slot.h"
@@ -704,6 +705,8 @@ const char *const config_group_names[] =
gettext_noop("Write-Ahead Log / Archive Recovery"),
/* WAL_RECOVERY_TARGET */
gettext_noop("Write-Ahead Log / Recovery Target"),
+ /* WAL_SUMMARIZATION */
+ gettext_noop("Write-Ahead Log / Summarization"),
/* REPLICATION_SENDING */
gettext_noop("Replication / Sending Servers"),
/* REPLICATION_PRIMARY */
@@ -1787,6 +1790,16 @@ struct config_bool ConfigureNamesBool[] =
NULL, NULL, NULL
},
+ {
+ {"summarize_wal", PGC_SIGHUP, WAL_SUMMARIZATION,
+ gettext_noop("Starts the WAL summarizer process to enable incremental backup."),
+ NULL
+ },
+ &summarize_wal,
+ false,
+ NULL, NULL, NULL
+ },
+
{
{"hot_standby", PGC_POSTMASTER, REPLICATION_STANDBY,
gettext_noop("Allows connections and queries during recovery."),
@@ -3191,6 +3204,19 @@ struct config_int ConfigureNamesInt[] =
check_wal_segment_size, NULL, NULL
},
+ {
+ {"wal_summarize_keep_time", PGC_SIGHUP, WAL_SUMMARIZATION,
+ gettext_noop("Time for which WAL summary files should be kept."),
+ NULL,
+ GUC_UNIT_MIN,
+ },
+ &wal_summarize_keep_time,
+ 10 * 24 * 60, /* 10 days */
+ 0,
+ INT_MAX,
+ NULL, NULL, NULL
+ },
+
{
{"autovacuum_naptime", PGC_SIGHUP, AUTOVACUUM,
gettext_noop("Time to sleep between autovacuum runs."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index e48c066a5b..01c0428990 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -299,6 +299,11 @@
#recovery_target_action = 'pause' # 'pause', 'promote', 'shutdown'
# (change requires restart)
+# - WAL Summarization -
+
+#summarize_wal = off # run WAL summarizer process?
+#wal_summarize_keep_time = '10d' # when to remove old summary files, 0 = never
+
#------------------------------------------------------------------------------
# REPLICATION
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 0c6f5ceb0a..e68b40d2b5 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -227,6 +227,7 @@ static char *extra_options = "";
static const char *const subdirs[] = {
"global",
"pg_wal/archive_status",
+ "pg_wal/summaries",
"pg_commit_ts",
"pg_dynshmem",
"pg_notify",
diff --git a/src/common/Makefile b/src/common/Makefile
index 1092dc63df..23e5a3db47 100644
--- a/src/common/Makefile
+++ b/src/common/Makefile
@@ -49,6 +49,7 @@ OBJS_COMMON = \
archive.o \
base64.o \
binaryheap.o \
+ blkreftable.o \
checksum_helper.o \
compression.o \
config_info.o \
diff --git a/src/common/blkreftable.c b/src/common/blkreftable.c
new file mode 100644
index 0000000000..ac6860a9ae
--- /dev/null
+++ b/src/common/blkreftable.c
@@ -0,0 +1,1308 @@
+/*-------------------------------------------------------------------------
+ *
+ * blkreftable.c
+ * Block reference tables.
+ *
+ * A block reference table is used to keep track of which blocks have
+ * been modified by WAL records within a certain LSN range.
+ *
+ * For each relation fork, we keep track of all blocks that have appeared
+ * in block reference in the WAL. We also keep track of the "limit block",
+ * which is the smallest relation length in blocks known to have occurred
+ * during that range of WAL records. This should be set to 0 if the relation
+ * fork is created or destroyed, and to the post-truncation length if
+ * truncated.
+ *
+ * Whenever we set the limit block, we also forget about any modified blocks
+ * beyond that point. Those blocks don't exist any more. Such blocks can
+ * later be marked as modified again; if that happens, it means the relation
+ * was re-extended.
+ *
+ * Portions Copyright (c) 2010-2022, PostgreSQL Global Development Group
+ *
+ * src/common/blkreftable.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+
+#ifndef FRONTEND
+#include "postgres.h"
+#else
+#include "postgres_fe.h"
+#endif
+
+#ifdef FRONTEND
+#include "common/logging.h"
+#endif
+
+#include "common/blkreftable.h"
+#include "common/hashfn.h"
+#include "port/pg_crc32c.h"
+
+/*
+ * A block reference table keeps track of the status of each relation
+ * fork individually.
+ */
+typedef struct BlockRefTableKey
+{
+ RelFileLocator rlocator;
+ ForkNumber forknum;
+} BlockRefTableKey;
+
+/*
+ * We could need to store data either for a relation in which only a
+ * tiny fraction of the blocks have been modified or for a relation in
+ * which nearly every block has been modified, and we want a
+ * space-efficient representation in both cases. To accomplish this,
+ * we divide the relation into chunks of 2^16 blocks and choose between
+ * an array representation and a bitmap representation for each chunk.
+ *
+ * When the number of modified blocks in a given chunk is small, we
+ * essentially store an array of block numbers, but we need not store the
+ * entire block number: instead, we store each block number as a 2-byte
+ * offset from the start of the chunk.
+ *
+ * When the number of modified blocks in a given chunk is large, we switch
+ * to a bitmap representation.
+ *
+ * These same basic representational choices are used both when a block
+ * reference table is stored in memory and when it is serialized to disk.
+ *
+ * In the in-memory representation, we initially allocate each chunk with
+ * space for a number of entries given by INITIAL_ENTRIES_PER_CHUNK and
+ * increase that as necessary until we reach MAX_ENTRIES_PER_CHUNK.
+ * Any chunk whose allocated size reaches MAX_ENTRIES_PER_CHUNK is converted
+ * to a bitmap, and thus never needs to grow further.
+ */
+#define BLOCKS_PER_CHUNK (1 << 16)
+#define BLOCKS_PER_ENTRY (BITS_PER_BYTE * sizeof(uint16))
+#define MAX_ENTRIES_PER_CHUNK (BLOCKS_PER_CHUNK / BLOCKS_PER_ENTRY)
+#define INITIAL_ENTRIES_PER_CHUNK 16
+typedef uint16 *BlockRefTableChunk;
+
+/*
+ * State for one relation fork.
+ *
+ * 'rlocator' and 'forknum' identify the relation fork to which this entry
+ * pertains.
+ *
+ * 'limit_block' is the shortest known length of the relation in blocks
+ * within the LSN range covered by a particular block reference table.
+ * It should be set to 0 if the relation fork is created or dropped. If the
+ * relation fork is truncated, it should be set to the number of blocks that
+ * remain after truncation.
+ *
+ * 'nchunks' is the allocated length of each of the three arrays that follow.
+ * We can only represent the status of block numbers less than nchunks *
+ * BLOCKS_PER_CHUNK.
+ *
+ * 'chunk_size' is an array storing the allocated size of each chunk.
+ *
+ * 'chunk_usage' is an array storing the number of elements used in each
+ * chunk. If that value is less than MAX_ENTRIES_PER_CHUNK, the corresonding
+ * chunk is used as an array; else the corresponding chunk is used as a bitmap.
+ * When used as a bitmap, the least significant bit of the first array element
+ * is the status of the lowest-numbered block covered by this chunk.
+ *
+ * 'chunk_data' is the array of chunks.
+ */
+struct BlockRefTableEntry
+{
+ BlockRefTableKey key;
+ BlockNumber limit_block;
+ char status;
+ uint32 nchunks;
+ uint16 *chunk_size;
+ uint16 *chunk_usage;
+ BlockRefTableChunk *chunk_data;
+};
+
+/* Declare and define a hash table over type BlockRefTableEntry. */
+#define SH_PREFIX blockreftable
+#define SH_ELEMENT_TYPE BlockRefTableEntry
+#define SH_KEY_TYPE BlockRefTableKey
+#define SH_KEY key
+#define SH_HASH_KEY(tb, key) \
+ hash_bytes((const unsigned char *) &key, sizeof(BlockRefTableKey))
+#define SH_EQUAL(tb, a, b) (memcmp(&a, &b, sizeof(BlockRefTableKey)) == 0)
+#define SH_SCOPE static inline
+#ifdef FRONTEND
+#define SH_RAW_ALLOCATOR pg_malloc0
+#endif
+#define SH_DEFINE
+#define SH_DECLARE
+#include "lib/simplehash.h"
+
+/*
+ * A block reference table is basically just the hash table, but we don't
+ * want to expose that to outside callers.
+ *
+ * We keep track of the memory context in use explicitly too, so that it's
+ * easy to place all of our allocations in the same context.
+ */
+struct BlockRefTable
+{
+ blockreftable_hash *hash;
+#ifndef FRONTEND
+ MemoryContext mcxt;
+#endif
+};
+
+/*
+ * On-disk serialization format for block reference table entries.
+ */
+typedef struct BlockRefTableSerializedEntry
+{
+ RelFileLocator rlocator;
+ ForkNumber forknum;
+ BlockNumber limit_block;
+ uint32 nchunks;
+} BlockRefTableSerializedEntry;
+
+/*
+ * Buffer size, so that we avoid doing many small I/Os.
+ */
+#define BUFSIZE 65536
+
+/*
+ * Ad-hoc buffer for file I/O.
+ */
+typedef struct BlockRefTableBuffer
+{
+ io_callback_fn io_callback;
+ void *io_callback_arg;
+ char data[BUFSIZE];
+ int used;
+ int cursor;
+ pg_crc32c crc;
+} BlockRefTableBuffer;
+
+/*
+ * State for keeping track of progress while incrementally reading a block
+ * table reference file from disk.
+ *
+ * total_chunks means the number of chunks for the RelFileLocator/ForkNumber
+ * combination that is curently being read, and consumed_chunks is the number
+ * of those that have been read. (We always read all the information for
+ * a single chunk at one time, so we don't need to be able to represent the
+ * state where a chunk has been partially read.)
+ *
+ * chunk_size is the array of chunk sizes. The length is given by total_chunks.
+ *
+ * chunk_data holds the current chunk.
+ *
+ * chunk_position helps us figure out how much progress we've made in returning
+ * the block numbers for the current chunk to the caller. If the chunk is a
+ * bitmap, it's the number of bits we've scanned; otherwise, it's the number
+ * of chunk entries we've scanned.
+ */
+struct BlockRefTableReader
+{
+ BlockRefTableBuffer buffer;
+ char *error_filename;
+ report_error_fn error_callback;
+ void *error_callback_arg;
+ uint32 total_chunks;
+ uint32 consumed_chunks;
+ uint16 *chunk_size;
+ uint16 chunk_data[MAX_ENTRIES_PER_CHUNK];
+ uint32 chunk_position;
+};
+
+/*
+ * State for keeping track of progress while incrementally writing a block
+ * reference table file to disk.
+ */
+struct BlockRefTableWriter
+{
+ BlockRefTableBuffer buffer;
+};
+
+/* Function prototypes. */
+static int BlockRefTableComparator(const void *a, const void *b);
+static void BlockRefTableFlush(BlockRefTableBuffer *buffer);
+static void BlockRefTableRead(BlockRefTableReader *reader, void *data,
+ int length);
+static void BlockRefTableWrite(BlockRefTableBuffer *buffer, void *data,
+ int length);
+static void BlockRefTableFileTerminate(BlockRefTableBuffer *buffer);
+
+/*
+ * Create an empty block reference table.
+ */
+BlockRefTable *
+CreateEmptyBlockRefTable(void)
+{
+ BlockRefTable *brtab = palloc(sizeof(BlockRefTable));
+
+ /*
+ * Even completely empty database has a few hundred relation forks, so it
+ * seems best to size the hash on the assumption that we're going to have
+ * at least a few thousand entries.
+ */
+#ifdef FRONTEND
+ brtab->hash = blockreftable_create(4096, NULL);
+#else
+ brtab->mcxt = CurrentMemoryContext;
+ brtab->hash = blockreftable_create(brtab->mcxt, 4096, NULL);
+#endif
+
+ return brtab;
+}
+
+/*
+ * Set the "limit block" for a relation fork and forget any modified blocks
+ * with equal or higher block numbers.
+ *
+ * The "limit block" is the shortest known length of the relation within the
+ * range of WAL records covered by this block reference table.
+ */
+void
+BlockRefTableSetLimitBlock(BlockRefTable *brtab,
+ const RelFileLocator *rlocator,
+ ForkNumber forknum,
+ BlockNumber limit_block)
+{
+ BlockRefTableEntry *brtentry;
+ BlockRefTableKey key;
+ bool found;
+
+ memcpy(&key.rlocator, rlocator, sizeof(RelFileLocator));
+ key.forknum = forknum;
+ brtentry = blockreftable_insert(brtab->hash, key, &found);
+
+ if (!found)
+ {
+ /*
+ * We have no existing data about this relation fork, so just record
+ * the limit_block value supplied by the caller, and make sure other
+ * parts of the entry are properly initialized.
+ */
+ brtentry->limit_block = limit_block;
+ brtentry->nchunks = 0;
+ brtentry->chunk_size = NULL;
+ brtentry->chunk_usage = NULL;
+ brtentry->chunk_data = NULL;
+ return;
+ }
+
+ BlockRefTableEntrySetLimitBlock(brtentry, limit_block);
+}
+
+/*
+ * Mark a block in a given relation fork as known to have been modified.
+ */
+void
+BlockRefTableMarkBlockModified(BlockRefTable *brtab,
+ const RelFileLocator *rlocator,
+ ForkNumber forknum,
+ BlockNumber blknum)
+{
+ BlockRefTableEntry *brtentry;
+ BlockRefTableKey key;
+ bool found;
+#ifndef FRONTEND
+ MemoryContext oldcontext = MemoryContextSwitchTo(brtab->mcxt);
+#endif
+
+ memcpy(&key.rlocator, rlocator, sizeof(RelFileLocator));
+ key.forknum = forknum;
+ brtentry = blockreftable_insert(brtab->hash, key, &found);
+
+ if (!found)
+ {
+ /*
+ * We want to set the initial limit block value to something higher
+ * than any legal block number. InvalidBlockNumber fits the bill.
+ */
+ brtentry->limit_block = InvalidBlockNumber;
+ brtentry->nchunks = 0;
+ brtentry->chunk_size = NULL;
+ brtentry->chunk_usage = NULL;
+ brtentry->chunk_data = NULL;
+ }
+
+ BlockRefTableEntryMarkBlockModified(brtentry, forknum, blknum);
+
+#ifndef FRONTEND
+ MemoryContextSwitchTo(oldcontext);
+#endif
+}
+
+/*
+ * Get an entry from a block reference table.
+ *
+ * If the entry does not exist, this function returns NULL. Otherwise, it
+ * returns the entry and sets *limit_block to the value from the entry.
+ */
+BlockRefTableEntry *
+BlockRefTableGetEntry(BlockRefTable *brtab, const RelFileLocator *rlocator,
+ ForkNumber forknum, BlockNumber *limit_block)
+{
+ BlockRefTableKey key;
+ BlockRefTableEntry *entry;
+
+ Assert(limit_block != NULL);
+
+ memcpy(&key.rlocator, rlocator, sizeof(RelFileLocator));
+ key.forknum = forknum;
+ entry = blockreftable_lookup(brtab->hash, key);
+
+ if (entry != NULL)
+ *limit_block = entry->limit_block;
+
+ return entry;
+}
+
+/*
+ * Get block numbers from a table entry.
+ *
+ * 'blocks' must point to enough space to hold at least 'nblocks' block
+ * numbers, and any block numbers we manage to get will be written there.
+ * The return value is the number of block numbers actually written.
+ *
+ * We do not return block numbers unless they are greater than or equal to
+ * start_blkno and strictly less than stop_blkno.
+ */
+int
+BlockRefTableEntryGetBlocks(BlockRefTableEntry *entry,
+ BlockNumber start_blkno,
+ BlockNumber stop_blkno,
+ BlockNumber *blocks,
+ int nblocks)
+{
+ uint32 start_chunkno;
+ uint32 stop_chunkno;
+ uint32 chunkno;
+ int nresults = 0;
+
+ Assert(entry != NULL);
+
+ /*
+ * Figure out which chunks could potentially contain blocks of interest.
+ *
+ * We need to be careful about overflow here, because stop_blkno could be
+ * InvalidBlockNumber or something very close to it.
+ */
+ start_chunkno = start_blkno / BLOCKS_PER_CHUNK;
+ stop_chunkno = stop_blkno / BLOCKS_PER_CHUNK;
+ if ((stop_blkno % BLOCKS_PER_CHUNK) != 0)
+ ++stop_chunkno;
+ if (stop_chunkno > entry->nchunks)
+ stop_chunkno = entry->nchunks;
+
+ /*
+ * Loop over chunks.
+ */
+ for (chunkno = start_chunkno; chunkno < stop_chunkno; ++chunkno)
+ {
+ uint16 chunk_usage = entry->chunk_usage[chunkno];
+ BlockRefTableChunk chunk_data = entry->chunk_data[chunkno];
+ unsigned start_offset = 0;
+ unsigned stop_offset = BLOCKS_PER_CHUNK;
+
+ /*
+ * If the start and/or stop block number falls within this chunk, the
+ * whole chunk may not be of interest. Figure out which portion we
+ * care about, if it's not the whole thing.
+ */
+ if (chunkno == start_chunkno)
+ start_offset = start_blkno % BLOCKS_PER_CHUNK;
+ if (chunkno == stop_chunkno)
+ stop_offset = stop_blkno % BLOCKS_PER_CHUNK;
+
+ /*
+ * Handling differs depending on whether this is an array of offsets
+ * or a bitmap.
+ */
+ if (chunk_usage == MAX_ENTRIES_PER_CHUNK)
+ {
+ unsigned i;
+
+ /* It's a bitmap, so test every relevant bit. */
+ for (i = start_offset; i < BLOCKS_PER_CHUNK; ++i)
+ {
+ uint16 w = chunk_data[i / BLOCKS_PER_ENTRY];
+
+ if ((w & (1 << (i % BLOCKS_PER_ENTRY))) != 0)
+ {
+ BlockNumber blkno = chunkno * BLOCKS_PER_CHUNK + i;
+
+ blocks[nresults++] = blkno;
+
+ /* Early exit if we run out of output space. */
+ if (nresults == nblocks)
+ return nresults;
+ }
+ }
+ }
+ else
+ {
+ unsigned i;
+
+ /* It's an array of offsets, so check each one. */
+ for (i = 0; i < chunk_usage; ++i)
+ {
+ uint16 offset = chunk_data[i];
+
+ if (offset >= start_offset && offset < stop_offset)
+ {
+ BlockNumber blkno = chunkno * BLOCKS_PER_CHUNK + offset;
+
+ blocks[nresults++] = blkno;
+
+ /* Early exit if we run out of output space. */
+ if (nresults == nblocks)
+ return nresults;
+ }
+ }
+ }
+ }
+
+ return nresults;
+}
+
+/*
+ * Serialize a block reference table to a file.
+ */
+void
+WriteBlockRefTable(BlockRefTable *brtab,
+ io_callback_fn write_callback,
+ void *write_callback_arg)
+{
+ BlockRefTableSerializedEntry *sdata = NULL;
+ BlockRefTableBuffer buffer;
+ uint32 magic = BLOCKREFTABLE_MAGIC;
+
+ /* Prepare buffer. */
+ memset(&buffer, 0, sizeof(BlockRefTableBuffer));
+ buffer.io_callback = write_callback;
+ buffer.io_callback_arg = write_callback_arg;
+ INIT_CRC32C(buffer.crc);
+
+ /* Write magic number. */
+ BlockRefTableWrite(&buffer, &magic, sizeof(uint32));
+
+ /* Write the entries, assuming there are some. */
+ if (brtab->hash->members > 0)
+ {
+ unsigned i = 0;
+ blockreftable_iterator it;
+ BlockRefTableEntry *brtentry;
+
+ /* Extract entries into serializable format and sort them. */
+ sdata =
+ palloc(brtab->hash->members * sizeof(BlockRefTableSerializedEntry));
+ blockreftable_start_iterate(brtab->hash, &it);
+ while ((brtentry = blockreftable_iterate(brtab->hash, &it)) != NULL)
+ {
+ BlockRefTableSerializedEntry *sentry = &sdata[i++];
+
+ sentry->rlocator = brtentry->key.rlocator;
+ sentry->forknum = brtentry->key.forknum;
+ sentry->limit_block = brtentry->limit_block;
+ sentry->nchunks = brtentry->nchunks;
+
+ /* trim trailing zero entries */
+ while (sentry->nchunks > 0 &&
+ brtentry->chunk_usage[sentry->nchunks - 1] == 0)
+ sentry->nchunks--;
+ }
+ Assert(i == brtab->hash->members);
+ qsort(sdata, i, sizeof(BlockRefTableSerializedEntry),
+ BlockRefTableComparator);
+
+ /* Loop over entries in sorted order and serialize each one. */
+ for (i = 0; i < brtab->hash->members; ++i)
+ {
+ BlockRefTableSerializedEntry *sentry = &sdata[i];
+ BlockRefTableKey key;
+ unsigned j;
+
+ /* Write the serialized entry itself. */
+ BlockRefTableWrite(&buffer, sentry,
+ sizeof(BlockRefTableSerializedEntry));
+
+ /* Look up the original entry so we can access the chunks. */
+ memcpy(&key.rlocator, &sentry->rlocator, sizeof(RelFileLocator));
+ key.forknum = sentry->forknum;
+ brtentry = blockreftable_lookup(brtab->hash, key);
+ Assert(brtentry != NULL);
+
+ /* Write the untruncated portion of the chunk length array. */
+ if (sentry->nchunks != 0)
+ BlockRefTableWrite(&buffer, brtentry->chunk_usage,
+ sentry->nchunks * sizeof(uint16));
+
+ /* Write the contents of each chunk. */
+ for (j = 0; j < brtentry->nchunks; ++j)
+ {
+ if (brtentry->chunk_usage[j] == 0)
+ continue;
+ BlockRefTableWrite(&buffer, brtentry->chunk_data[j],
+ brtentry->chunk_usage[j] * sizeof(uint16));
+ }
+ }
+ }
+
+ /* Write out appropriate terminator and CRC and flush buffer. */
+ BlockRefTableFileTerminate(&buffer);
+}
+
+/*
+ * Prepare to incrementally read a block reference table file.
+ *
+ * 'read_callback' is a function that can be called to read data from the
+ * underlying file (or other data source) into our internal buffer.
+ *
+ * 'read_callback_arg' is an opaque argument to be passed to read_callback.
+ *
+ * 'error_filename' is the filename that should be included in error messages
+ * if the file is found to be malformed. The value is not copied, so the
+ * caller should ensure that it remains valid until done with this
+ * BlockRefTableReader.
+ *
+ * 'error_callback' is a function to be called if the file is found to be
+ * malformed. This is not used for I/O errors, which must be handled internally
+ * by read_callback.
+ *
+ * 'error_callback_arg' is an opaque arguent to be passed to error_callback.
+ */
+BlockRefTableReader *
+CreateBlockRefTableReader(io_callback_fn read_callback,
+ void *read_callback_arg,
+ char *error_filename,
+ report_error_fn error_callback,
+ void *error_callback_arg)
+{
+ BlockRefTableReader *reader;
+ uint32 magic;
+
+ /* Initialize data structure. */
+ reader = palloc0(sizeof(BlockRefTableReader));
+ reader->buffer.io_callback = read_callback;
+ reader->buffer.io_callback_arg = read_callback_arg;
+ reader->error_filename = error_filename;
+ reader->error_callback = error_callback;
+ reader->error_callback_arg = error_callback_arg;
+ INIT_CRC32C(reader->buffer.crc);
+
+ /* Verify magic number. */
+ BlockRefTableRead(reader, &magic, sizeof(uint32));
+ if (magic != BLOCKREFTABLE_MAGIC)
+ error_callback(error_callback_arg,
+ "file \"%s\" has wrong magic number: expected %u, found %u",
+ error_filename,
+ BLOCKREFTABLE_MAGIC, magic);
+
+ return reader;
+}
+
+/*
+ * Read next relation fork covered by this block reference table file.
+ *
+ * After calling this function, you must call BlockRefTableReaderGetBlocks
+ * until it returns 0 before calling it again.
+ */
+bool
+BlockRefTableReaderNextRelation(BlockRefTableReader *reader,
+ RelFileLocator *rlocator,
+ ForkNumber *forknum,
+ BlockNumber *limit_block)
+{
+ BlockRefTableSerializedEntry sentry;
+ BlockRefTableSerializedEntry zentry = {{0}};
+
+ /*
+ * Sanity check: caller must read all blocks from all chunks before moving
+ * on to the next relation.
+ */
+ Assert(reader->total_chunks == reader->consumed_chunks);
+
+ /* Read serialized entry. */
+ BlockRefTableRead(reader, &sentry,
+ sizeof(BlockRefTableSerializedEntry));
+
+ /*
+ * If we just read the sentinel entry indicating that we've reached the
+ * end, read and check the CRC.
+ */
+ if (memcmp(&sentry, &zentry, sizeof(BlockRefTableSerializedEntry)) == 0)
+ {
+ pg_crc32c expected_crc;
+ pg_crc32c actual_crc;
+
+ /*
+ * We want to know the CRC of the file excluding the 4-byte CRC
+ * itself, so copy the current value of the CRC accumulator before
+ * reading those bytes, and use the copy to finalize the calculation.
+ */
+ expected_crc = reader->buffer.crc;
+ FIN_CRC32C(expected_crc);
+
+ /* Now we can read the actual value. */
+ BlockRefTableRead(reader, &actual_crc, sizeof(pg_crc32c));
+
+ /* Throw an error if there is a mismatch. */
+ if (!EQ_CRC32C(expected_crc, actual_crc))
+ reader->error_callback(reader->error_callback_arg,
+ "file \"%s\" has wrong checksum: expected %08X, found %08X",
+ reader->error_filename, expected_crc, actual_crc);
+
+ return false;
+ }
+
+ /* Read chunk size array. */
+ if (reader->chunk_size != NULL)
+ pfree(reader->chunk_size);
+ reader->chunk_size = palloc(sentry.nchunks * sizeof(uint16));
+ BlockRefTableRead(reader, reader->chunk_size,
+ sentry.nchunks * sizeof(uint16));
+
+ /* Set up for chunk scan. */
+ reader->total_chunks = sentry.nchunks;
+ reader->consumed_chunks = 0;
+
+ /* Return data to caller. */
+ memcpy(rlocator, &sentry.rlocator, sizeof(RelFileLocator));
+ *forknum = sentry.forknum;
+ *limit_block = sentry.limit_block;
+ return true;
+}
+
+/*
+ * Get modified blocks associated with the relation fork returned by
+ * the most recent call to BlockRefTableReaderNextRelation.
+ *
+ * On return, block numbers will be written into the 'blocks' array, whose
+ * length should be passed via 'nblocks'. The return value is the number of
+ * entries actually written into the 'blocks' array, which may be less than
+ * 'nblocks' if we run out of modified blocks in the relation fork before
+ * we run out of room in the array.
+ */
+unsigned
+BlockRefTableReaderGetBlocks(BlockRefTableReader *reader,
+ BlockNumber *blocks,
+ int nblocks)
+{
+ unsigned blocks_found = 0;
+
+ /* Must provide space for at least one block number to be returned. */
+ Assert(nblocks > 0);
+
+ /* Loop collecting blocks to return to caller. */
+ for (;;)
+ {
+ uint16 next_chunk_size;
+
+ /*
+ * If we've read at least one chunk, maybe it contains some block
+ * numbers that could satisfy caller's request.
+ */
+ if (reader->consumed_chunks > 0)
+ {
+ uint32 chunkno = reader->consumed_chunks - 1;
+ uint16 chunk_size = reader->chunk_size[chunkno];
+
+ if (chunk_size == MAX_ENTRIES_PER_CHUNK)
+ {
+ /* Bitmap format, so search for bits that are set. */
+ while (reader->chunk_position < BLOCKS_PER_CHUNK &&
+ blocks_found < nblocks)
+ {
+ uint16 chunkoffset = reader->chunk_position;
+ uint16 w;
+
+ w = reader->chunk_data[chunkoffset / BLOCKS_PER_ENTRY];
+ if ((w & (1u << (chunkoffset % BLOCKS_PER_ENTRY))) != 0)
+ blocks[blocks_found++] =
+ chunkno * BLOCKS_PER_CHUNK + chunkoffset;
+ ++reader->chunk_position;
+ }
+ }
+ else
+ {
+ /* Not in bitmap format, so each entry is a 2-byte offset. */
+ while (reader->chunk_position < chunk_size &&
+ blocks_found < nblocks)
+ {
+ blocks[blocks_found++] = chunkno * BLOCKS_PER_CHUNK
+ + reader->chunk_data[reader->chunk_position];
+ ++reader->chunk_position;
+ }
+ }
+ }
+
+ /* We found enough blocks, so we're done. */
+ if (blocks_found >= nblocks)
+ break;
+
+ /*
+ * We didn't find enough blocks, so we must need the next chunk. If
+ * there are none left, though, then we're done anyway.
+ */
+ if (reader->consumed_chunks == reader->total_chunks)
+ break;
+
+ /*
+ * Read data for next chunk and reset scan position to beginning of
+ * chunk. Note that the next chunk might be empty, in which case we
+ * consume the chunk without actually consuming any bytes from the
+ * underlying file.
+ */
+ next_chunk_size = reader->chunk_size[reader->consumed_chunks];
+ if (next_chunk_size > 0)
+ BlockRefTableRead(reader, reader->chunk_data,
+ next_chunk_size * sizeof(uint16));
+ ++reader->consumed_chunks;
+ reader->chunk_position = 0;
+ }
+
+ return blocks_found;
+}
+
+/*
+ * Release memory used while reading a block reference table from a file.
+ */
+void
+DestroyBlockRefTableReader(BlockRefTableReader *reader)
+{
+ if (reader->chunk_size != NULL)
+ {
+ pfree(reader->chunk_size);
+ reader->chunk_size = NULL;
+ }
+ pfree(reader);
+}
+
+/*
+ * Prepare to write a block reference table file incrementally.
+ *
+ * Caller must be able to supply BlockRefTableEntry objects sorted in the
+ * appropriate order.
+ */
+BlockRefTableWriter *
+CreateBlockRefTableWriter(io_callback_fn write_callback,
+ void *write_callback_arg)
+{
+ BlockRefTableWriter *writer;
+ uint32 magic = BLOCKREFTABLE_MAGIC;
+
+ /* Prepare buffer and CRC check and save callbacks. */
+ writer = palloc0(sizeof(BlockRefTableWriter));
+ writer->buffer.io_callback = write_callback;
+ writer->buffer.io_callback_arg = write_callback_arg;
+ INIT_CRC32C(writer->buffer.crc);
+
+ /* Write magic number. */
+ BlockRefTableWrite(&writer->buffer, &magic, sizeof(uint32));
+
+ return writer;
+}
+
+/*
+ * Append one entry to a block reference table file.
+ *
+ * Note that entries must be written in the proper order, that is, sorted by
+ * tablespace, then database, then relfilenumber, then fork number. Caller
+ * is responsible for supplying data in the correct order. If that seems hard,
+ * use an in-memory BlockRefTable instead.
+ */
+void
+BlockRefTableWriteEntry(BlockRefTableWriter *writer, BlockRefTableEntry *entry)
+{
+ BlockRefTableSerializedEntry sentry;
+ unsigned j;
+
+ /* Convert to serialized entry format. */
+ sentry.rlocator = entry->key.rlocator;
+ sentry.forknum = entry->key.forknum;
+ sentry.limit_block = entry->limit_block;
+ sentry.nchunks = entry->nchunks;
+
+ /* Trim trailing zero entries. */
+ while (sentry.nchunks > 0 && entry->chunk_usage[sentry.nchunks - 1] == 0)
+ sentry.nchunks--;
+
+ /* Write the serialized entry itself. */
+ BlockRefTableWrite(&writer->buffer, &sentry,
+ sizeof(BlockRefTableSerializedEntry));
+
+ /* Write the untruncated portion of the chunk length array. */
+ if (sentry.nchunks != 0)
+ BlockRefTableWrite(&writer->buffer, entry->chunk_usage,
+ sentry.nchunks * sizeof(uint16));
+
+ /* Write the contents of each chunk. */
+ for (j = 0; j < entry->nchunks; ++j)
+ {
+ if (entry->chunk_usage[j] == 0)
+ continue;
+ BlockRefTableWrite(&writer->buffer, entry->chunk_data[j],
+ entry->chunk_usage[j] * sizeof(uint16));
+ }
+}
+
+/*
+ * Finalize an incremental write of a block reference table file.
+ */
+void
+DestroyBlockRefTableWriter(BlockRefTableWriter *writer)
+{
+ BlockRefTableFileTerminate(&writer->buffer);
+ pfree(writer);
+}
+
+/*
+ * Allocate a standalone BlockRefTableEntry.
+ *
+ * When we're manipulating a full in-memory BlockRefTable, the entries are
+ * part of the hash table and are allocated by simplehash. This routine is
+ * used by callers that want to write out a BlockRefTable to a file without
+ * needing to store the whole thing in memory at once.
+ *
+ * Entries allocated by this function can be manipulated using the functions
+ * BlockRefTableEntrySetLimitBlock and BlockRefTableEntryMarkBlockModified
+ * and then written using BlockRefTableWriteEntry and freed using
+ * BlockRefTableFreeEntry.
+ */
+BlockRefTableEntry *
+CreateBlockRefTableEntry(RelFileLocator rlocator, ForkNumber forknum)
+{
+ BlockRefTableEntry *entry = palloc0(sizeof(BlockRefTableEntry));
+
+ memcpy(&entry->key.rlocator, &rlocator, sizeof(RelFileLocator));
+ entry->key.forknum = forknum;
+ entry->limit_block = InvalidBlockNumber;
+
+ return entry;
+}
+
+/*
+ * Update a BlockRefTableEntry with a new value for the "limit block" and
+ * forget any equal-or-higher-numbered modified blocks.
+ *
+ * The "limit block" is the shortest known length of the relation within the
+ * range of WAL records covered by this block reference table.
+ */
+void
+BlockRefTableEntrySetLimitBlock(BlockRefTableEntry *entry,
+ BlockNumber limit_block)
+{
+ unsigned chunkno;
+ unsigned limit_chunkno;
+ unsigned limit_chunkoffset;
+ BlockRefTableChunk limit_chunk;
+
+ /* If we already have an equal or lower limit block, do nothing. */
+ if (limit_block >= entry->limit_block)
+ return;
+
+ /* Record the new limit block value. */
+ entry->limit_block = limit_block;
+
+ /*
+ * Figure out which chunk would store the state of the new limit block,
+ * and which offset within that chunk.
+ */
+ limit_chunkno = limit_block / BLOCKS_PER_CHUNK;
+ limit_chunkoffset = limit_block % BLOCKS_PER_CHUNK;
+
+ /*
+ * If the number of chunks is not large enough for any blocks with equal
+ * or higher block numbers to exist, then there is nothing further to do.
+ */
+ if (limit_chunkno >= entry->nchunks)
+ return;
+
+ /* Discard entire contents of any higher-numbered chunks. */
+ for (chunkno = limit_chunkno + 1; chunkno < entry->nchunks; ++chunkno)
+ entry->chunk_usage[chunkno] = 0;
+
+ /*
+ * Next, we need to discard any offsets within the chunk that would
+ * contain the limit_block. We must handle this differenly depending on
+ * whether the chunk that would contain limit_block is a bitmap or an
+ * array of offsets.
+ */
+ limit_chunk = entry->chunk_data[limit_chunkno];
+ if (entry->chunk_usage[limit_chunkno] == MAX_ENTRIES_PER_CHUNK)
+ {
+ unsigned chunkoffset;
+
+ /* It's a bitmap. Unset bits. */
+ for (chunkoffset = limit_chunkoffset; chunkoffset < BLOCKS_PER_CHUNK;
+ ++chunkoffset)
+ limit_chunk[chunkoffset / BLOCKS_PER_ENTRY] &=
+ ~(1 << (chunkoffset % BLOCKS_PER_ENTRY));
+ }
+ else
+ {
+ unsigned i,
+ j = 0;
+
+ /* It's an offset array. Filter out large offsets. */
+ for (i = 0; i < entry->chunk_usage[limit_chunkno]; ++i)
+ {
+ Assert(j <= i);
+ if (limit_chunk[i] < limit_chunkoffset)
+ limit_chunk[j++] = limit_chunk[i];
+ }
+ Assert(j <= entry->chunk_usage[limit_chunkno]);
+ entry->chunk_usage[limit_chunkno] = j;
+ }
+}
+
+/*
+ * Mark a block in a given BlkRefTableEntry as known to have been modified.
+ */
+void
+BlockRefTableEntryMarkBlockModified(BlockRefTableEntry *entry,
+ ForkNumber forknum,
+ BlockNumber blknum)
+{
+ unsigned chunkno;
+ unsigned chunkoffset;
+ unsigned i;
+
+ /*
+ * Which chunk should store the state of this block? And what is the
+ * offset of this block relative to the start of that chunk?
+ */
+ chunkno = blknum / BLOCKS_PER_CHUNK;
+ chunkoffset = blknum % BLOCKS_PER_CHUNK;
+
+ /*
+ * If 'nchunks' isn't big enough for us to be able to represent the state
+ * of this block, we need to enlarge our arrays.
+ */
+ if (chunkno >= entry->nchunks)
+ {
+ unsigned max_chunks;
+ unsigned extra_chunks;
+
+ /*
+ * New array size is a power of 2, at least 16, big enough so that
+ * chunkno will be a valid array index.
+ */
+ max_chunks = Max(16, entry->nchunks);
+ while (max_chunks < chunkno + 1)
+ chunkno *= 2;
+ Assert(max_chunks > chunkno);
+ extra_chunks = max_chunks - entry->nchunks;
+
+ if (entry->nchunks == 0)
+ {
+ entry->chunk_size = palloc0(sizeof(uint16) * max_chunks);
+ entry->chunk_usage = palloc0(sizeof(uint16) * max_chunks);
+ entry->chunk_data =
+ palloc0(sizeof(BlockRefTableChunk) * max_chunks);
+ }
+ else
+ {
+ entry->chunk_size = repalloc(entry->chunk_size,
+ sizeof(uint16) * max_chunks);
+ memset(&entry->chunk_size[entry->nchunks], 0,
+ extra_chunks * sizeof(uint16));
+ entry->chunk_usage = repalloc(entry->chunk_usage,
+ sizeof(uint16) * max_chunks);
+ memset(&entry->chunk_usage[entry->nchunks], 0,
+ extra_chunks * sizeof(uint16));
+ entry->chunk_data = repalloc(entry->chunk_data,
+ sizeof(BlockRefTableChunk) * max_chunks);
+ memset(&entry->chunk_data[entry->nchunks], 0,
+ extra_chunks * sizeof(BlockRefTableChunk));
+ }
+ entry->nchunks = max_chunks;
+ }
+
+ /*
+ * If the chunk that covers this block number doesn't exist yet, create it
+ * as an array and add the appropriate offset to it. We make it pretty
+ * small initially, because there might only be 1 or a few block
+ * references in this chunk and we don't want to use up too much memory.
+ */
+ if (entry->chunk_size[chunkno] == 0)
+ {
+ entry->chunk_data[chunkno] =
+ palloc(sizeof(uint16) * INITIAL_ENTRIES_PER_CHUNK);
+ entry->chunk_size[chunkno] = INITIAL_ENTRIES_PER_CHUNK;
+ entry->chunk_data[chunkno][0] = chunkoffset;
+ entry->chunk_usage[chunkno] = 1;
+ return;
+ }
+
+ /*
+ * If the number of entries in this chunk is already maximum, it must be a
+ * bitmap. Just set the appropriate bit.
+ */
+ if (entry->chunk_usage[chunkno] == MAX_ENTRIES_PER_CHUNK)
+ {
+ BlockRefTableChunk chunk = entry->chunk_data[chunkno];
+
+ chunk[chunkoffset / BLOCKS_PER_ENTRY] |=
+ 1 << (chunkoffset % BLOCKS_PER_ENTRY);
+ return;
+ }
+
+ /*
+ * There is an existing chunk and it's in array format. Let's find out
+ * whether it already has an entry for this block. If so, we do not need
+ * to do anything.
+ */
+ for (i = 0; i < entry->chunk_usage[chunkno]; ++i)
+ {
+ if (entry->chunk_data[chunkno][i] == chunkoffset)
+ return;
+ }
+
+ /*
+ * If the number of entries currently used is one less than the maximum,
+ * it's time to convert to bitmap format.
+ */
+ if (entry->chunk_usage[chunkno] == MAX_ENTRIES_PER_CHUNK - 1)
+ {
+ BlockRefTableChunk newchunk;
+ unsigned j;
+
+ /* Allocate a new chunk. */
+ newchunk = palloc0(MAX_ENTRIES_PER_CHUNK * sizeof(uint16));
+
+ /* Set the bit for each existing entry. */
+ for (j = 0; j < entry->chunk_usage[chunkno]; ++j)
+ {
+ unsigned coff = entry->chunk_data[chunkno][j];
+
+ newchunk[coff / BLOCKS_PER_ENTRY] |=
+ 1 << (coff % BLOCKS_PER_ENTRY);
+ }
+
+ /* Set the bit for the new entry. */
+ newchunk[chunkoffset / BLOCKS_PER_ENTRY] |=
+ 1 << (chunkoffset % BLOCKS_PER_ENTRY);
+
+ /* Swap the new chunk into place and update metadata. */
+ pfree(entry->chunk_data[chunkno]);
+ entry->chunk_data[chunkno] = newchunk;
+ entry->chunk_size[chunkno] = MAX_ENTRIES_PER_CHUNK;
+ entry->chunk_usage[chunkno] = MAX_ENTRIES_PER_CHUNK;
+ return;
+ }
+
+ /*
+ * OK, we currently have an array, and we don't need to convert to a
+ * bitmap, but we do need to add a new element. If there's not enough
+ * room, we'll have to expand the array.
+ */
+ if (entry->chunk_usage[chunkno] == entry->chunk_size[chunkno])
+ {
+ unsigned newsize = entry->chunk_size[chunkno] * 2;
+
+ Assert(newsize <= MAX_ENTRIES_PER_CHUNK);
+ entry->chunk_data[chunkno] = repalloc(entry->chunk_data[chunkno],
+ newsize * sizeof(uint16));
+ entry->chunk_size[chunkno] = newsize;
+ }
+
+ /* Now we can add the new entry. */
+ entry->chunk_data[chunkno][entry->chunk_usage[chunkno]] =
+ chunkoffset;
+ entry->chunk_usage[chunkno]++;
+}
+
+/*
+ * Release memory for a BlockRefTablEntry that was created by
+ * CreateBlockRefTableEntry.
+ */
+void
+BlockRefTableFreeEntry(BlockRefTableEntry *entry)
+{
+ if (entry->chunk_size != NULL)
+ {
+ pfree(entry->chunk_size);
+ entry->chunk_size = NULL;
+ }
+
+ if (entry->chunk_usage != NULL)
+ {
+ pfree(entry->chunk_usage);
+ entry->chunk_usage = NULL;
+ }
+
+ if (entry->chunk_data != NULL)
+ {
+ pfree(entry->chunk_data);
+ entry->chunk_data = NULL;
+ }
+
+ pfree(entry);
+}
+
+/*
+ * Comparator for BlockRefTableSerializedEntry objects.
+ *
+ * We make the tablespace OID the first column of the sort key to match
+ * the on-disk tree structure.
+ */
+static int
+BlockRefTableComparator(const void *a, const void *b)
+{
+ const BlockRefTableSerializedEntry *sa = a;
+ const BlockRefTableSerializedEntry *sb = b;
+
+ if (sa->rlocator.spcOid > sb->rlocator.spcOid)
+ return 1;
+ if (sa->rlocator.spcOid < sb->rlocator.spcOid)
+ return -1;
+
+ if (sa->rlocator.dbOid > sb->rlocator.dbOid)
+ return 1;
+ if (sa->rlocator.dbOid < sb->rlocator.dbOid)
+ return -1;
+
+ if (sa->rlocator.relNumber > sb->rlocator.relNumber)
+ return 1;
+ if (sa->rlocator.relNumber < sb->rlocator.relNumber)
+ return -1;
+
+ if (sa->forknum > sb->forknum)
+ return 1;
+ if (sa->forknum < sb->forknum)
+ return -1;
+
+ return 0;
+}
+
+/*
+ * Flush any buffered data out of a BlockRefTableBuffer.
+ */
+static void
+BlockRefTableFlush(BlockRefTableBuffer *buffer)
+{
+ buffer->io_callback(buffer->io_callback_arg, buffer->data, buffer->used);
+ buffer->used = 0;
+}
+
+/*
+ * Read data from a BlockRefTableBuffer, and update the running CRC
+ * calculation for the returned data (but not any data that we may have
+ * buffered but not yet actually returned).
+ */
+static void
+BlockRefTableRead(BlockRefTableReader *reader, void *data, int length)
+{
+ BlockRefTableBuffer *buffer = &reader->buffer;
+
+ /* Loop until read is fully satisfied. */
+ while (length > 0)
+ {
+ if (buffer->cursor < buffer->used)
+ {
+ /*
+ * If any buffered data is available, use that to satisfy as much
+ * of the request as possible.
+ */
+ int bytes_to_copy = Min(length, buffer->used - buffer->cursor);
+
+ memcpy(data, &buffer->data[buffer->cursor], bytes_to_copy);
+ COMP_CRC32C(buffer->crc, &buffer->data[buffer->cursor],
+ bytes_to_copy);
+ buffer->cursor += bytes_to_copy;
+ data = ((char *) data) + bytes_to_copy;
+ length -= bytes_to_copy;
+ }
+ else if (length >= BUFSIZE)
+ {
+ /*
+ * If the request length is long, read directly into caller's
+ * buffer.
+ */
+ int bytes_read;
+
+ bytes_read = buffer->io_callback(buffer->io_callback_arg,
+ data, length);
+ COMP_CRC32C(buffer->crc, data, bytes_read);
+ data = ((char *) data) + bytes_read;
+ length -= bytes_read;
+
+ /* If we didn't get anything, that's bad. */
+ if (bytes_read == 0)
+ reader->error_callback(reader->error_callback_arg,
+ "file \"%s\" ends unexpectedly",
+ reader->error_filename);
+ }
+ else
+ {
+ /*
+ * Refill our buffer.
+ */
+ buffer->used = buffer->io_callback(buffer->io_callback_arg,
+ buffer->data, BUFSIZE);
+ buffer->cursor = 0;
+
+ /* If we didn't get anything, that's bad. */
+ if (buffer->used == 0)
+ reader->error_callback(reader->error_callback_arg,
+ "file \"%s\" ends unexpectedly",
+ reader->error_filename);
+ }
+ }
+}
+
+/*
+ * Supply data to a BlockRefTableBuffer for write to the underlying File,
+ * and update the running CRC calculation for that data.
+ */
+static void
+BlockRefTableWrite(BlockRefTableBuffer *buffer, void *data, int length)
+{
+ /* Update running CRC calculation. */
+ COMP_CRC32C(buffer->crc, data, length);
+
+ /* If the new data can't fit into the buffer, flush the buffer. */
+ if (buffer->used + length > BUFSIZE)
+ {
+ buffer->io_callback(buffer->io_callback_arg, buffer->data,
+ buffer->used);
+ buffer->used = 0;
+ }
+
+ /* If the new data would fill the buffer, or more, write it directly. */
+ if (length >= BUFSIZE)
+ {
+ buffer->io_callback(buffer->io_callback_arg, data, length);
+ return;
+ }
+
+ /* Otherwise, copy the new data into the buffer. */
+ memcpy(&buffer->data[buffer->used], data, length);
+ buffer->used += length;
+ Assert(buffer->used <= BUFSIZE);
+}
+
+/*
+ * Generate the sentinel and CRC required at the end of a block reference
+ * table file and flush them out of our internal buffer.
+ */
+static void
+BlockRefTableFileTerminate(BlockRefTableBuffer *buffer)
+{
+ BlockRefTableSerializedEntry zentry = {{0}};
+ pg_crc32c crc;
+
+ /* Write a sentinel indicating that there are no more entries. */
+ BlockRefTableWrite(buffer, &zentry,
+ sizeof(BlockRefTableSerializedEntry));
+
+ /*
+ * Writing the checksum will perturb the ongoing checksum calculation, so
+ * copy the state first and finalize the computation using the copy.
+ */
+ crc = buffer->crc;
+ FIN_CRC32C(crc);
+ BlockRefTableWrite(buffer, &crc, sizeof(pg_crc32c));
+
+ /* Flush any leftover data out of our buffer. */
+ BlockRefTableFlush(buffer);
+}
diff --git a/src/common/meson.build b/src/common/meson.build
index d52dd12bc9..7ad4270a3a 100644
--- a/src/common/meson.build
+++ b/src/common/meson.build
@@ -4,6 +4,7 @@ common_sources = files(
'archive.c',
'base64.c',
'binaryheap.c',
+ 'blkreftable.c',
'checksum_helper.c',
'compression.c',
'controldata_utils.c',
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index a14126d164..da71580364 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -209,6 +209,7 @@ extern int XLogFileOpen(XLogSegNo segno, TimeLineID tli);
extern void CheckXLogRemoved(XLogSegNo segno, TimeLineID tli);
extern XLogSegNo XLogGetLastRemovedSegno(void);
+extern XLogSegNo XLogGetOldestSegno(TimeLineID tli);
extern void XLogSetAsyncXactLSN(XLogRecPtr asyncXactLSN);
extern void XLogSetReplicationSlotMinimumLSN(XLogRecPtr lsn);
diff --git a/src/include/backup/walsummary.h b/src/include/backup/walsummary.h
new file mode 100644
index 0000000000..8e3dc7b837
--- /dev/null
+++ b/src/include/backup/walsummary.h
@@ -0,0 +1,49 @@
+/*-------------------------------------------------------------------------
+ *
+ * walsummary.h
+ * WAL summary management
+ *
+ * Portions Copyright (c) 2010-2022, PostgreSQL Global Development Group
+ *
+ * src/include/backup/walsummary.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef WALSUMMARY_H
+#define WALSUMMARY_H
+
+#include <time.h>
+
+#include "access/xlogdefs.h"
+#include "nodes/pg_list.h"
+#include "storage/fd.h"
+
+typedef struct WalSummaryIO
+{
+ File file;
+ off_t filepos;
+} WalSummaryIO;
+
+typedef struct WalSummaryFile
+{
+ XLogRecPtr start_lsn;
+ XLogRecPtr end_lsn;
+ TimeLineID tli;
+} WalSummaryFile;
+
+extern List *GetWalSummaries(TimeLineID tli, XLogRecPtr start_lsn,
+ XLogRecPtr end_lsn);
+extern List *FilterWalSummaries(List *wslist, TimeLineID tli,
+ XLogRecPtr start_lsn, XLogRecPtr end_lsn);
+extern bool WalSummariesAreComplete(List *wslist,
+ XLogRecPtr start_lsn, XLogRecPtr end_lsn,
+ XLogRecPtr *missing_lsn);
+extern File OpenWalSummaryFile(WalSummaryFile *ws, bool missing_ok);
+extern void RemoveWalSummaryIfOlderThan(WalSummaryFile *ws,
+ time_t cutoff_time);
+
+extern int ReadWalSummary(void *wal_summary_io, void *data, int length);
+extern int WriteWalSummary(void *wal_summary_io, void *data, int length);
+extern void ReportWalSummaryError(void *callback_arg, char *fmt,...) pg_attribute_printf(2, 3);
+
+#endif /* WALSUMMARY_H */
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index bd0b8873d3..ce74612703 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -12087,4 +12087,23 @@
proname => 'any_value_transfn', prorettype => 'anyelement',
proargtypes => 'anyelement anyelement', prosrc => 'any_value_transfn' },
+{ oid => '8436',
+ descr => 'list of available WAL summary files',
+ proname => 'pg_available_wal_summaries', prorows => '100',
+ proretset => 't', provolatile => 'v', proparallel => 's',
+ prorettype => 'record', proargtypes => '',
+ proallargtypes => '{int8,pg_lsn,pg_lsn}',
+ proargmodes => '{o,o,o}',
+ proargnames => '{tli,start_lsn,end_lsn}',
+ prosrc => 'pg_available_wal_summaries' },
+{ oid => '8437',
+ descr => 'contents of a WAL sumamry file',
+ proname => 'pg_wal_summary_contents', prorows => '100',
+ proretset => 't', provolatile => 'v', proparallel => 's',
+ prorettype => 'record', proargtypes => 'int8 pg_lsn pg_lsn',
+ proallargtypes => '{int8,pg_lsn,pg_lsn,oid,oid,oid,int2,int8,bool}',
+ proargmodes => '{i,i,i,o,o,o,o,o,o}',
+ proargnames => '{tli,start_lsn,end_lsn,relfilenode,reltablespace,reldatabase,relforknumber,relblocknumber,is_limit_block}',
+ prosrc => 'pg_wal_summary_contents' },
+
]
diff --git a/src/include/common/blkreftable.h b/src/include/common/blkreftable.h
new file mode 100644
index 0000000000..5141f3acd5
--- /dev/null
+++ b/src/include/common/blkreftable.h
@@ -0,0 +1,116 @@
+/*-------------------------------------------------------------------------
+ *
+ * blkreftable.h
+ * Block reference tables.
+ *
+ * A block reference table is used to keep track of which blocks have
+ * been modified by WAL records within a certain LSN range.
+ *
+ * For each relation fork, there is a "limit block number". All existing
+ * blocks greater than or equal to the limit block number must be
+ * considered modified; for those less than the limit block number,
+ * we maintain a bitmap. When a relation fork is created or dropped,
+ * the limit block number should be set to 0. When it's truncated,
+ * the limit block number should be set to the length in blocks to
+ * which it was truncated.
+ *
+ * Portions Copyright (c) 2010-2022, PostgreSQL Global Development Group
+ *
+ * src/include/common/blkreftable.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef BLKREFTABLE_H
+#define BLKREFTABLE_H
+
+#include "storage/block.h"
+#include "storage/relfilelocator.h"
+
+/* Magic number for serialization file format. */
+#define BLOCKREFTABLE_MAGIC 0x652b137b
+
+typedef struct BlockRefTable BlockRefTable;
+typedef struct BlockRefTableEntry BlockRefTableEntry;
+typedef struct BlockRefTableReader BlockRefTableReader;
+typedef struct BlockRefTableWriter BlockRefTableWriter;
+
+/*
+ * The return value of io_callback_fn should be the number of bytes read
+ * or written. If an error occurs, the functions should report it and
+ * not return. When used as a write callback, short writes should be retried
+ * or treated as errors, so that if the callback returns, the return value
+ * is always the request length.
+ *
+ * report_error_fn should not return.
+ */
+typedef int (*io_callback_fn) (void *callback_arg, void *data, int length);
+typedef void (*report_error_fn) (void *calblack_arg, char *msg,...) pg_attribute_printf(2, 3);
+
+
+/*
+ * Functions for manipulating an entire in-memory block reference table.
+ */
+extern BlockRefTable *CreateEmptyBlockRefTable(void);
+extern void BlockRefTableSetLimitBlock(BlockRefTable *brtab,
+ const RelFileLocator *rlocator,
+ ForkNumber forknum,
+ BlockNumber limit_block);
+extern void BlockRefTableMarkBlockModified(BlockRefTable *brtab,
+ const RelFileLocator *rlocator,
+ ForkNumber forknum,
+ BlockNumber blknum);
+extern void WriteBlockRefTable(BlockRefTable *brtab,
+ io_callback_fn write_callback,
+ void *write_callback_arg);
+
+extern BlockRefTableEntry *BlockRefTableGetEntry(BlockRefTable *brtab,
+ const RelFileLocator *rlocator,
+ ForkNumber forknum,
+ BlockNumber *limit_block);
+extern int BlockRefTableEntryGetBlocks(BlockRefTableEntry *entry,
+ BlockNumber start_blkno,
+ BlockNumber stop_blkno,
+ BlockNumber *blocks,
+ int nblocks);
+
+/*
+ * Functions for reading a block reference table incrementally from disk.
+ */
+extern BlockRefTableReader *CreateBlockRefTableReader(io_callback_fn read_callback,
+ void *read_callback_arg,
+ char *error_filename,
+ report_error_fn error_callback,
+ void *error_callback_arg);
+extern bool BlockRefTableReaderNextRelation(BlockRefTableReader *reader,
+ RelFileLocator *rlocator,
+ ForkNumber *forknum,
+ BlockNumber *limit_block);
+extern unsigned BlockRefTableReaderGetBlocks(BlockRefTableReader *reader,
+ BlockNumber *blocks,
+ int nblocks);
+extern void DestroyBlockRefTableReader(BlockRefTableReader *reader);
+
+/*
+ * Functions for writing a block reference table incrementally to disk.
+ *
+ * Note that entries must be written in the proper order, that is, sorted by
+ * database, then tablespace, then relfilenumber, then fork number. Caller
+ * is responsible for supplying data in the correct order. If that seems hard,
+ * use an in-memory BlockRefTable instead.
+ */
+extern BlockRefTableWriter *CreateBlockRefTableWriter(io_callback_fn write_callback,
+ void *write_callback_arg);
+extern void BlockRefTableWriteEntry(BlockRefTableWriter *writer,
+ BlockRefTableEntry *entry);
+extern void DestroyBlockRefTableWriter(BlockRefTableWriter *writer);
+
+extern BlockRefTableEntry *CreateBlockRefTableEntry(RelFileLocator rlocator,
+ ForkNumber forknum);
+extern void BlockRefTableEntrySetLimitBlock(BlockRefTableEntry *entry,
+ BlockNumber limit_block);
+extern void BlockRefTableEntryMarkBlockModified(BlockRefTableEntry *entry,
+ ForkNumber forknum,
+ BlockNumber blknum);
+extern void BlockRefTableFreeEntry(BlockRefTableEntry *entry);
+
+#endif /* BLKREFTABLE_H */
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index f0cc651435..ab8f47379a 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -340,6 +340,7 @@ typedef enum BackendType
B_STARTUP,
B_WAL_RECEIVER,
B_WAL_SENDER,
+ B_WAL_SUMMARIZER,
B_WAL_WRITER,
} BackendType;
@@ -446,6 +447,7 @@ typedef enum
CheckpointerProcess,
WalWriterProcess,
WalReceiverProcess,
+ WalSummarizerProcess,
NUM_AUXPROCTYPES /* Must be last! */
} AuxProcType;
@@ -458,6 +460,7 @@ extern PGDLLIMPORT AuxProcType MyAuxProcType;
#define AmCheckpointerProcess() (MyAuxProcType == CheckpointerProcess)
#define AmWalWriterProcess() (MyAuxProcType == WalWriterProcess)
#define AmWalReceiverProcess() (MyAuxProcType == WalReceiverProcess)
+#define AmWalSummarizerProcess() (MyAuxProcType == WalSummarizerProcess)
/*****************************************************************************
diff --git a/src/include/postmaster/walsummarizer.h b/src/include/postmaster/walsummarizer.h
new file mode 100644
index 0000000000..15db2377dd
--- /dev/null
+++ b/src/include/postmaster/walsummarizer.h
@@ -0,0 +1,31 @@
+/*-------------------------------------------------------------------------
+ *
+ * walsummarizer.h
+ *
+ * Header file for background WAL summarization process.
+ *
+ * Portions Copyright (c) 1996-2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/include/postmaster/walsummarizer.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef WALSUMMARIZER_H
+#define WALSUMMARIZER_H
+
+#include "access/xlogdefs.h"
+
+extern bool summarize_wal;
+extern int wal_summarize_keep_time;
+
+extern Size WalSummarizerShmemSize(void);
+extern void WalSummarizerShmemInit(void);
+extern void WalSummarizerMain(void) pg_attribute_noreturn();
+
+extern XLogRecPtr GetOldestUnsummarizedLSN(TimeLineID *tli,
+ bool *lsn_is_exact);
+extern void SetWalSummarizerLatch(void);
+extern XLogRecPtr WaitForWalSummarization(XLogRecPtr lsn, long timeout);
+
+#endif
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index ef74f32693..ee55008082 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -417,11 +417,12 @@ extern PGDLLIMPORT PGPROC *PreparedXactProcs;
* We set aside some extra PGPROC structures for auxiliary processes,
* ie things that aren't full-fledged backends but need shmem access.
*
- * Background writer, checkpointer, WAL writer and archiver run during normal
- * operation. Startup process and WAL receiver also consume 2 slots, but WAL
- * writer is launched only after startup has exited, so we only need 5 slots.
+ * Background writer, checkpointer, WAL writer, WAL summarizer, and archiver
+ * run during normal operation. Startup process and WAL receiver also consume
+ * 2 slots, but WAL writer is launched only after startup has exited, so we
+ * only need 6 slots.
*/
-#define NUM_AUXILIARY_PROCS 5
+#define NUM_AUXILIARY_PROCS 6
/* configurable options */
extern PGDLLIMPORT int DeadlockTimeout;
diff --git a/src/include/utils/guc_tables.h b/src/include/utils/guc_tables.h
index 0c38255961..eaa8c46dda 100644
--- a/src/include/utils/guc_tables.h
+++ b/src/include/utils/guc_tables.h
@@ -72,6 +72,7 @@ enum config_group
WAL_RECOVERY,
WAL_ARCHIVE_RECOVERY,
WAL_RECOVERY_TARGET,
+ WAL_SUMMARIZATION,
REPLICATION_SENDING,
REPLICATION_PRIMARY,
REPLICATION_STANDBY,
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 92c0003ab1..659e58aeac 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -4002,3 +4002,14 @@ yyscan_t
z_stream
z_streamp
zic_t
+BlockRefTable
+BlockRefTableBuffer
+BlockRefTableEntry
+BlockRefTableKey
+BlockRefTableReader
+BlockRefTableSerializedEntry
+BlockRefTableWriter
+SummarizerReadLocalXLogPrivate
+WalSummarizerData
+WalSummaryFile
+WalSummaryIO
--
2.37.1 (Apple Git-137.1)
v9-0006-Test-patch-Enable-summarize_wal-by-default.patchapplication/octet-stream; name=v9-0006-Test-patch-Enable-summarize_wal-by-default.patchDownload
From 7270b8d13f432919cbee984b031b31db4d9ea48e Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Mon, 6 Nov 2023 13:53:19 -0500
Subject: [PATCH v9 6/6] Test patch: Enable summarize_wal by default.
To avoid test failures, must remove the prohibition against running
summarize_wal=off with wal_level=minimal, because a bunch of tests
run with wal_level=minimal.
Not for commit.
---
src/backend/postmaster/postmaster.c | 3 ---
src/backend/postmaster/walsummarizer.c | 2 +-
src/backend/utils/misc/guc_tables.c | 2 +-
src/test/recovery/t/001_stream_rep.pl | 2 ++
src/test/recovery/t/019_replslot_limit.pl | 3 +++
src/test/recovery/t/020_archive_status.pl | 1 +
src/test/recovery/t/035_standby_logical_decoding.pl | 1 +
7 files changed, 9 insertions(+), 5 deletions(-)
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 7952fd5c4b..a804d07ce5 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -935,9 +935,6 @@ PostmasterMain(int argc, char *argv[])
if (max_wal_senders > 0 && wal_level == WAL_LEVEL_MINIMAL)
ereport(ERROR,
(errmsg("WAL streaming (max_wal_senders > 0) requires wal_level \"replica\" or \"logical\"")));
- if (summarize_wal && wal_level == WAL_LEVEL_MINIMAL)
- ereport(ERROR,
- (errmsg("WAL cannot be summarized when wal_level is \"minimal\"")));
/*
* Other one-time internal sanity checks can go here, if they are fast.
diff --git a/src/backend/postmaster/walsummarizer.c b/src/backend/postmaster/walsummarizer.c
index fe09207ddc..505ccbf1b8 100644
--- a/src/backend/postmaster/walsummarizer.c
+++ b/src/backend/postmaster/walsummarizer.c
@@ -139,7 +139,7 @@ static XLogRecPtr redo_pointer_at_last_summary_removal = InvalidXLogRecPtr;
/*
* GUC parameters
*/
-bool summarize_wal = false;
+bool summarize_wal = true;
int wal_summarize_keep_time = 10 * 24 * 60;
static XLogRecPtr GetLatestLSN(TimeLineID *tli);
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index c9dde2d5e5..20f415c6d2 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -1796,7 +1796,7 @@ struct config_bool ConfigureNamesBool[] =
NULL
},
&summarize_wal,
- false,
+ true,
NULL, NULL, NULL
},
diff --git a/src/test/recovery/t/001_stream_rep.pl b/src/test/recovery/t/001_stream_rep.pl
index 95f9b0d772..0d0e63b8dc 100644
--- a/src/test/recovery/t/001_stream_rep.pl
+++ b/src/test/recovery/t/001_stream_rep.pl
@@ -15,6 +15,8 @@ my $node_primary = PostgreSQL::Test::Cluster->new('primary');
$node_primary->init(
allows_streaming => 1,
auth_extra => [ '--create-role', 'repl_role' ]);
+# WAL summarization can postpone WAL recycling, leading to test failures
+$node_primary->append_conf('postgresql.conf', "summarize_wal = off");
$node_primary->start;
my $backup_name = 'my_backup';
diff --git a/src/test/recovery/t/019_replslot_limit.pl b/src/test/recovery/t/019_replslot_limit.pl
index 7d94f15778..a8b342bb98 100644
--- a/src/test/recovery/t/019_replslot_limit.pl
+++ b/src/test/recovery/t/019_replslot_limit.pl
@@ -22,6 +22,7 @@ $node_primary->append_conf(
min_wal_size = 2MB
max_wal_size = 4MB
log_checkpoints = yes
+summarize_wal = off
));
$node_primary->start;
$node_primary->safe_psql('postgres',
@@ -256,6 +257,7 @@ $node_primary2->append_conf(
min_wal_size = 32MB
max_wal_size = 32MB
log_checkpoints = yes
+summarize_wal = off
));
$node_primary2->start;
$node_primary2->safe_psql('postgres',
@@ -310,6 +312,7 @@ $node_primary3->append_conf(
max_wal_size = 2MB
log_checkpoints = yes
max_slot_wal_keep_size = 1MB
+ summarize_wal = off
));
$node_primary3->start;
$node_primary3->safe_psql('postgres',
diff --git a/src/test/recovery/t/020_archive_status.pl b/src/test/recovery/t/020_archive_status.pl
index fa24153d4b..d0d6221368 100644
--- a/src/test/recovery/t/020_archive_status.pl
+++ b/src/test/recovery/t/020_archive_status.pl
@@ -15,6 +15,7 @@ $primary->init(
has_archiving => 1,
allows_streaming => 1);
$primary->append_conf('postgresql.conf', 'autovacuum = off');
+$primary->append_conf('postgresql.conf', 'summarize_wal = off');
$primary->start;
my $primary_data = $primary->data_dir;
diff --git a/src/test/recovery/t/035_standby_logical_decoding.pl b/src/test/recovery/t/035_standby_logical_decoding.pl
index 9c34c0d36c..482edc57a8 100644
--- a/src/test/recovery/t/035_standby_logical_decoding.pl
+++ b/src/test/recovery/t/035_standby_logical_decoding.pl
@@ -250,6 +250,7 @@ $node_primary->append_conf(
wal_level = 'logical'
max_replication_slots = 4
max_wal_senders = 4
+summarize_wal = off
});
$node_primary->dump_info;
$node_primary->start;
--
2.37.1 (Apple Git-137.1)
On Tue, Nov 14, 2023 at 12:52 AM Robert Haas <robertmhaas@gmail.com> wrote:
On Fri, Nov 10, 2023 at 6:27 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
- I think 0001 looks good improvement irrespective of the patch series.
OK, perhaps that can be independently committed, then, if nobody objects.
Thanks for the review; I've fixed a bunch of things that you
mentioned. I'll just comment on the ones I haven't yet done anything
about below.2. + <varlistentry id="guc-wal-summarize-keep-time" xreflabel="wal_summarize_keep_time"> + <term><varname>wal_summarize_keep_time</varname> (<type>boolean</type>) + <indexterm> + <primary><varname>wal_summarize_keep_time</varname> configuration parameter</primary> + </indexterm>I feel the name of the guy should be either wal_summarizer_keep_time
or wal_summaries_keep_time, I mean either we should refer to the
summarizer process or to the way summaries files.How about wal_summary_keep_time?
Yes, that looks perfect to me.
6. + * If the whole range of LSNs is covered, returns true, otherwise false. + * If false is returned, *missing_lsn is set either to InvalidXLogRecPtr + * if there are no WAL summary files in the input list, or to the first LSN + * in the range that is not covered by a WAL summary file in the input list. + */ +bool +WalSummariesAreComplete(List *wslist, XLogRecPtr start_lsn,I did not see the usage of this function, but I think if the whole
range is not covered why not keep the behavior uniform w.r.t. what we
set for '*missing_lsn', I mean suppose there is no file then
missing_lsn is the start_lsn because a very first LSN is missing.It's used later in the patch series. I think the way that I have it
makes for a more understandable error message.
Okay
8. +/* + * Comparator to sort a List of WalSummaryFile objects by start_lsn. + */ +static int +ListComparatorForWalSummaryFiles(const ListCell *a, const ListCell *b) +{I'm not sure what needs fixing here.
I think I copy-pasted it by mistake, just ignore it.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
On Tue, Nov 14, 2023 at 2:10 AM Robert Haas <robertmhaas@gmail.com> wrote:
On Mon, Nov 13, 2023 at 11:25 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
Great stuff you got here. I'm doing a first pass trying to grok the
whole thing for more substantive comments, but in the meantime here are
some cosmetic ones.Thanks, thanks, and thanks.
I've fixed some things that you mentioned in the attached version.
Other comments below.
Here are some more comments based on what I have read so far, mostly
cosmetics comments.
1.
+ * summary file yet, then stoppng doesn't make any sense, and we
+ * should wait until the next stop point instead.
Typo /stoppng/stopping
2.
+ /* Close temporary file and shut down xlogreader. */
+ FileClose(io.file);
+
We have already freed the xlogreader so the second part of the comment
is not valid.
3.+ /*
+ * If a relation fork is truncated on disk, there is in point in
+ * tracking anything about block modifications beyond the truncation
+ * point.
Typo. /there is in point/ there is no point
4.
+/*
+ * Special handling for WAL recods with RM_XACT_ID.
+ */
/recods/records
5.
+ if (xact_info == XLOG_XACT_COMMIT ||
+ xact_info == XLOG_XACT_COMMIT_PREPARED)
+ {
+ xl_xact_commit *xlrec = (xl_xact_commit *) XLogRecGetData(xlogreader);
+ xl_xact_parsed_commit parsed;
+ int i;
+
+ ParseCommitRecord(XLogRecGetInfo(xlogreader), xlrec, &parsed);
+ for (i = 0; i < parsed.nrels; ++i)
+ {
+ ForkNumber forknum;
+
+ for (forknum = 0; forknum <= MAX_FORKNUM; ++forknum)
+ if (forknum != FSM_FORKNUM)
+ BlockRefTableSetLimitBlock(brtab, &parsed.xlocators[i],
+ forknum, 0);
+ }
+ }
For SmgrCreate and Truncate I understand setting the 'limit block' but
why for commit/abort? I think it would be better to add some comments
here.
6.
+ * Caller must set private_data->tli to the TLI of interest,
+ * private_data->read_upto to the lowest LSN that is not known to be safe
+ * to read on that timeline, and private_data->historic to true if and only
+ * if the timeline is not the current timeline. This function will update
+ * private_data->read_upto and private_data->historic if more WAL appears
+ * on the current timeline or if the current timeline becomes historic.
+ */
+static int
+summarizer_read_local_xlog_page(XLogReaderState *state,
+ XLogRecPtr targetPagePtr, int reqLen,
+ XLogRecPtr targetRecPtr, char *cur_page)
The comments say "private_data->read_upto to the lowest LSN that is
not known to be safe" but is it really the lowest LSN? I think it is
the highest LSN this is known to be safe for that TLI no?
7.
+ /* If it's time to remove any old WAL summaries, do that now. */
+ MaybeRemoveOldWalSummaries();
I was just wondering whether removing old summaries should be the job
of the wal summarizer or it should be the job of the checkpointer, I
mean while removing the old wals it can also check and remove the old
summaries? Anyway, it's just a question and I do not have a strong
opinion on this.
8.
+ /*
+ * Whether we we removed the file or not, we need not consider it
+ * again.
+ */
Typo /Whether we we removed/ Whether we removed
9.
+/*
+ * Get an entry from a block reference table.
+ *
+ * If the entry does not exist, this function returns NULL. Otherwise, it
+ * returns the entry and sets *limit_block to the value from the entry.
+ */
+BlockRefTableEntry *
+BlockRefTableGetEntry(BlockRefTable *brtab, const RelFileLocator *rlocator,
+ ForkNumber forknum, BlockNumber *limit_block)
If this function is already returning 'BlockRefTableEntry' then why
would it need to set an extra '*limit_block' out parameter which it is
actually reading from the entry itself?
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
0001 looks OK to push, and since it stands on its own I would get it out
of the way soon rather than waiting for the rest of the series to be
further reviewed.
0002:
This moves bin/pg_verifybackup/parse_manifest.c to
common/parse_manifest.c, where it's not clear that it's for backup
manifests (wasn't a problem in the previous location). I wonder if
we're going to have anything else called "manifest", in which case I
propose to rename the file to make it clear that this is about backup
manifests -- maybe parse_bkp_manifest.c.
This patch looks pretty uncontroversial, but there's no point in going
further with this one until followup patches are closer to commit.
0003:
AmWalSummarizerProcess() is unused. Remove?
MaybeWalSummarizer() is called on each ServerLoop() in postmaster.c?
This causes a function call to be emitted every time through. That
looks odd. All other process starts have some triggering condition.
GetOldestUnsummarizedLSN uses while(true), but WaitForWalSummarization
and SummarizeWAL use while(1). Maybe settle on one style?
Still reading this one.
0004:
in PrepareForIncrementalBackup(), the logic that determines
earliest_wal_range_tli and latest_wal_range_tli looks pretty weird. I
think it works fine if there's a single timeline, but not otherwise.
Or maybe the trick is that it relies on timelines returned by
readTimeLineHistory being sorted backwards? If so, maybe add a comment
about that somewhere; I don't think other callers of readTimeLineHistory
make that assumption.
--
Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/
"Postgres is bloatware by design: it was built to house
PhD theses." (Joey Hellerstein, SIGMOD annual conference 2002)
Hi Robert,
[..spotted the v9 patchset..]
so I've spent some time playing still with patchset v8 (without the
6/6 testing patch related to wal_level=minimal), with the exception of
- patchset v9 - marked otherwise.
1. On compile time there were 2 warnings to shadowing variable (at
least with gcc version 10.2.1), but on v9 that is fixed:
blkreftable.c: In function ‘WriteBlockRefTable’:
blkreftable.c:520:24: warning: declaration of ‘brtentry’ shadows a
previous local [-Wshadow=compatible-local]
walsummarizer.c: In function ‘SummarizeWAL’:
walsummarizer.c:833:36: warning: declaration of ‘private_data’ shadows
a previous local [-Wshadow=compatible-local]
2. Usability thing: I hit the timeout hard: "This backup requires WAL
to be summarized up to 0/90000D8, but summarizer has only reached
0/0." with summarize_wal=off (default) but apparently this in TODO.
Looks like an important usability thing.
3. I've verified that if the DB was in wal_level=minimal even
temporarily (and thus summarization was disabled) it is impossible to
take incremental backup:
pg_basebackup: error: could not initiate base backup: ERROR: WAL
summaries are required on timeline 1 from 0/70000D8 to 0/10000028, but
the summaries for that timeline and LSN range are incomplete
DETAIL: The first unsummarized LSN is this range is 0/D04AE88.
4. As we have discussed off list, there's is (was) this
pg_combinebackup bug in v8's reconstruct_from_incremental_file() where
it was unable to realize that - in case of combining multiple
incremental backups - it should stop looking for the previous instance
of the full file (while it was fine with v6 of the patchset). I've
checked it on v9 - it is good now.
5. On v8 i've finally played a little bit with standby(s) and this
patchset with couple of basic scenarios while mixing source of the
backups:
a. full on standby, incr1 on standby, full db restore (incl. incr1) on standby
# sometimes i'm getting spurious error like those when doing
incrementals on standby with -c fast :
2023-11-15 13:49:05.721 CET [10573] LOG: recovery restart point
at 0/A000028
2023-11-15 13:49:07.591 CET [10597] WARNING: aborting backup due
to backend exiting before pg_backup_stop was called
2023-11-15 13:49:07.591 CET [10597] ERROR: manifest requires WAL
from final timeline 1 ending at 0/A0000F8, but this backup starts at
0/A000028
2023-11-15 13:49:07.591 CET [10597] STATEMENT: BASE_BACKUP (
INCREMENTAL, LABEL 'pg_basebackup base backup', PROGRESS,
CHECKPOINT 'fast', WAIT 0, MANIFEST 'yes', TARGET 'client')
# when you retry the same pg_basebackup it goes fine (looks like
CHECKPOINT on standby/restartpoint <-> summarizer disconnect, I'll dig
deeper tomorrow. It seems that issuing "CHECKPOINT; pg_sleep(1);"
against primary just before pg_basebackup --incr on standby
workarounds it)
b. full on primary, incr1 on standby, full db restore (incl. incr1) on
standby # WORKS
c. full on standby, incr1 on standby, full db restore (incl. incr1) on
primary # WORKS*
d. full on primary, incr1 on standby, full db restore (incl. incr1) on
primary # WORKS*
* - needs pg_promote() due to the controlfile having standby bit +
potential fiddling with postgresql.auto.conf as it is having
primary_connstring GUC.
6. Sci-fi-mode-on: I was wondering about the dangers of e.g. having
more recent pg_basebackup (e.g. from pg18 one day) running against
pg17 in the scope of having this incremental backups possibility. Is
it going to be safe? (currently there seems to be no safeguards
against such use) or should those things (core, pg_basebackup) should
be running in version lock step?
Regards,
-J.
On 2023-Nov-13, Robert Haas wrote:
On Mon, Nov 13, 2023 at 11:25 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
Also, it would be good to provide and use a
function to initialize a BlockRefTableKey from the RelFileNode and
forknum components, and ensure that any padding bytes are zeroed.
Otherwise it's not going to be a great hash key. On my platform there
aren't any (padding bytes), but I think it's unwise to rely on that.I'm having trouble understanding the second part of this suggestion.
Note that in a frontend context, SH_RAW_ALLOCATOR is pg_malloc0, and
in a backend context, we get the default, which is
MemoryContextAllocZero. Maybe there's some case this doesn't cover,
though?
I meant code like this
memcpy(&key.rlocator, rlocator, sizeof(RelFileLocator));
key.forknum = forknum;
entry = blockreftable_lookup(brtab->hash, key);
where any padding bytes in "key" could have arbitrary values, because
they're not initialized. So I'd have a (maybe static inline) function
BlockRefTableKeyInit(&key, rlocator, forknum)
that fills it in for you.
Note:
#define SH_EQUAL(tb, a, b) (memcmp(&a, &b, sizeof(BlockRefTableKey)) == 0)
AFAICT the new simplehash uses in this patch series are the only ones
that use memcmp() as SH_EQUAL, so we don't necessarily have precedent on
lack of padding bytes initialization in existing uses of simplehash.
These forward struct declarations are not buying you anything, I'd
remove them:I've had problems from time to time when I don't do this. I'll remove
it here, but I'm not convinced that it's always useless.
Well, certainly there are places where they are necessary.
I don't much like the way header files in src/bin/pg_combinebackup files
are structured. Particularly, causing a simplehash to be "instantiated"
just because load_manifest.h is included seems poised to cause pain. I
think there should be a file with the basic struct declarations (no
simplehash); and then maybe since both pg_basebackup and
pg_combinebackup seem to need the same simplehash, create a separate
header file containing just that.. But, did you notice that anything
that includes reconstruct.h will instantiate the simplehash stuff,
because it includes load_manifest.h? It may be unwise to have the
simplehash in a header file. Maybe just declare it in each .c file that
needs it. The duplicity is not that large.I think that I did this correctly.
Oh, I hadn't grokked that we had this SH_SCOPE thing and a separate
SH_DECLARE for it being extern. OK, please ignore that.
Why leave unnamed arguments in function declarations?
I mean, I've changed it now, but I don't think it's worth getting too
excited about.
Well, we did get into consistency arguments on this point previously. I
agree it's not *terribly* important, but on thread
/messages/by-id/CAH2-WznJt9CMM9KJTMjJh_zbL5hD9oX44qdJ4aqZtjFi-zA3Tg@mail.gmail.com
people got really worked up about this stuff.
In GetFileBackupMethod(), which arguments are in and which are out?
The comment doesn't say, and it's not obvious why we pass both the file
path as well as the individual constituent pieces for it.The header comment does document which values are potentially set on
return. I guess I thought it was clear enough that the stuff not
documented to be output parameters was input parameters. Most of them
aren't even pointers, so they have to be input parameters. The only
exception is 'path', which I have some difficulty thinking that anyone
is going to imagine to be an input pointer.
An output pointer, you mean :-) (Should it be const?)
When the return value is BACK_UP_FILE_FULLY, it's not clear what happens
to these output values; we modify some, but why? Maybe they should be
left alone? In that case, the "if size == 0" test should move a couple
of lines up, in the brtentry == NULL block.
BTW, you could do the qsort() after deciding to backup the file fully if
more than 90% needs to be replaced.
BTW, in sendDir() why do
lookup_path = pstrdup(pathbuf + basepathlen + 1);
when you could do
lookup_path = pstrdup(tarfilename);
?
There are two functions named record_manifest_details_for_file() in
different programs.I had trouble figuring out how to name this stuff. I did notice the
awkwardness, but surely nobody can think that two functions with the
same name in different binaries can be actually the same function.
Of course not, but when cscope-jumping around, it is weird.
If we want to inject more underscores here, my vote is to go all the
way and make it per_wal_range_cb.
+1
In walsummarizer.c, HandleWalSummarizerInterrupts is called in
summarizer_read_local_xlog_page but SummarizeWAL() doesn't do that.
Maybe it should?I replaced all the CHECK_FOR_INTERRUPTS() in that file with
HandleWalSummarizerInterrupts(). Does that seem right?
Looks to be what walwriter.c does at least, so I guess it's OK.
I think this path is not going to be very human-likeable.
snprintf(final_path, MAXPGPATH,
XLOGDIR "/summaries/%08X%08X%08X%08X%08X.summary",
tli,
LSN_FORMAT_ARGS(summary_start_lsn),
LSN_FORMAT_ARGS(summary_end_lsn));
Why not add a dash between the TLI and between both LSNs, or something
like that?
But I have a hard time arguing that it wouldn't be more readable still
if we put some separator characters in there. I didn't do that because
then they'd look less like WAL file names, but maybe that's not really
a problem. A possible reason not to bother is that these files are
less necessary for humans to care about than WAL files, since they
don't need to be archived or transported between nodes in any way.
Basically I think this is probably fine the way it is, but if you or
others think it's really important to change it, I can do that. Just
as long as we don't spend 50 emails arguing about which separator
character to use.
Yeah, I just think that endless stream of hex chars are hard to read,
and I've found myself following digits in the screen with my fingers in
order to parse file names. I guess you could say thousands separators
for regular numbers aren't needed either, but we do have them for
readability sake.
I think a new section in chapter 30 "Reliability and the Write-Ahead
Log" is warranted. It would explain the summarization process, what the
summary files are used for, and the deletion mechanism. I can draft
something if you want.
It's not clear to me if WalSummarizerCtl->pending_lsn if fulfilling some
purpose or it's just a leftover from prior development. I see it's only
read in an assertion ... Maybe if we think this cross-check is
important, it should be turned into an elog? Otherwise, I'd remove it.
--
Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
"No me acuerdo, pero no es cierto. No es cierto, y si fuera cierto,
no me acuerdo." (Augusto Pinochet a una corte de justicia)
On Tue, Nov 14, 2023 at 8:12 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
0001 looks OK to push, and since it stands on its own I would get it out
of the way soon rather than waiting for the rest of the series to be
further reviewed.
All right, done.
0003:
AmWalSummarizerProcess() is unused. Remove?
The intent seems to be to have one of these per enum value, whether it
gets used or not. Some of the others aren't used, either.
MaybeWalSummarizer() is called on each ServerLoop() in postmaster.c?
This causes a function call to be emitted every time through. That
looks odd. All other process starts have some triggering condition.
I'm not sure how much this matters, really. I would expect that the
function call overhead here wouldn't be very noticeable. Generally I
think that when ServerLoop returns from WaitEventSetWait it's going to
be because we need to fork a process. That's pretty expensive compared
to a function call. If we can iterate through this loop lots of times
without doing any real work then it might matter, but I feel like
that's probably not the case, and probably something we would want to
fix if it were the case.
Now, I could nevertheless move some of the triggering conditions in
MaybeStartWalSummarizer(), but moving, say, just the summarize_wal
condition wouldn't be enough to avoid having MaybeStartWalSummarizer()
called repeatedly when there was no work to do, because summarize_wal
could be true and the summarizer could all be running. Similarly, if I
move just the WalSummarizerPID == 0 condition, the function gets
called repeatedly without doing anything when summarize_wal = off. So
at a minimum you have to move both of those if you care about avoiding
the function call overhead, and then you have to wonder if you care
about the corner cases where the function would be called repeatedly
for no gain even then.
Another approach would be to make the function static inline rather
than just static. Or we could delete the whole function and just
duplicate the logic it contains at both call sites. Personally I'm
inclined to just leave it how it is in the absence of some evidence
that there's a real problem here. It's nice to have all the triggering
conditions in one place with nothing duplicated.
GetOldestUnsummarizedLSN uses while(true), but WaitForWalSummarization
and SummarizeWAL use while(1). Maybe settle on one style?
OK.
0004:
in PrepareForIncrementalBackup(), the logic that determines
earliest_wal_range_tli and latest_wal_range_tli looks pretty weird. I
think it works fine if there's a single timeline, but not otherwise.
Or maybe the trick is that it relies on timelines returned by
readTimeLineHistory being sorted backwards? If so, maybe add a comment
about that somewhere; I don't think other callers of readTimeLineHistory
make that assumption.
It does indeed rely on that assumption, and the comment at the top of
the for (i = 0; i < num_wal_ranges; ++i) loop explains that. Note also
the comment just below that begins "If we found this TLI in the
server's history". I agree with you that this logic looks strange, and
it's possible that there's some better way to do encode the idea than
what I've done here, but I think it might be just that the particular
calculation we're trying to do here is strange. It's almost easier to
understand the logic if you start by reading the sanity checks
("manifest requires WAL from initial timeline %u starting at %X/%X,
but that timeline begins at %X/%X" et. al.), look at the triggering
conditions for those, and then work upward to see how
earliest/latest_wal_range_tli get set, and then look up from there to
see how saw_earliest/latest_wal_range_tli are used in computing those
values.
We do rely on the ordering assumption elsewhere. For example, in
XLogFileReadAnyTLI, see if (tli < curFileTLI) break. We also use it to
set expectedTLEs, which is documented to have this property. And
AddWALInfoToBackupManifest relies on it too; see the comment "Because
the timeline history file lists newer timelines before older ones" in
AddWALInfoToBackupManifest. We're not entirely consistent about this,
e.g., unlike XLogFileReadAnyTLI, tliInHistory() and
tliOfPointInHistory() don't have an early exit provision, but we do
use it some places.
--
Robert Haas
EDB: http://www.enterprisedb.com
On Thu, Nov 16, 2023 at 5:21 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
I meant code like this
memcpy(&key.rlocator, rlocator, sizeof(RelFileLocator));
key.forknum = forknum;
entry = blockreftable_lookup(brtab->hash, key);
Ah, I hadn't thought about that. Another way of handling that might be
to add = {0} to the declaration of key. But I can do the initializer
thing too if you think it's better. I'm not sure if there's an
argument that the initializer might optimize better.
An output pointer, you mean :-) (Should it be const?)
I'm bad at const, but that seems to work, so sure.
When the return value is BACK_UP_FILE_FULLY, it's not clear what happens
to these output values; we modify some, but why? Maybe they should be
left alone? In that case, the "if size == 0" test should move a couple
of lines up, in the brtentry == NULL block.
OK.
BTW, you could do the qsort() after deciding to backup the file fully if
more than 90% needs to be replaced.
OK.
BTW, in sendDir() why do
lookup_path = pstrdup(pathbuf + basepathlen + 1);
when you could do
lookup_path = pstrdup(tarfilename);
?
No reason, changed.
If we want to inject more underscores here, my vote is to go all the
way and make it per_wal_range_cb.+1
Will look into this.
Yeah, I just think that endless stream of hex chars are hard to read,
and I've found myself following digits in the screen with my fingers in
order to parse file names. I guess you could say thousands separators
for regular numbers aren't needed either, but we do have them for
readability sake.
Sigh.
I think a new section in chapter 30 "Reliability and the Write-Ahead
Log" is warranted. It would explain the summarization process, what the
summary files are used for, and the deletion mechanism. I can draft
something if you want.
Sure, if you want to take a crack at it, that's great.
It's not clear to me if WalSummarizerCtl->pending_lsn if fulfilling some
purpose or it's just a leftover from prior development. I see it's only
read in an assertion ... Maybe if we think this cross-check is
important, it should be turned into an elog? Otherwise, I'd remove it.
I've been thinking about that. One thing I'm not quite sure about
though is introspection. Maybe there should be a function that shows
summarized_tli and summarized_lsn from WalSummarizerData, and maybe it
should expose pending_lsn too.
--
Robert Haas
EDB: http://www.enterprisedb.com
On 2023-Oct-04, Robert Haas wrote:
- I would like some feedback on the generation of WAL summary files.
Right now, I have it enabled by default, and summaries are kept for a
week. That means that, with no additional setup, you can take an
incremental backup as long as the reference backup was taken in the
last week. File removal is governed by mtimes, so if you change the
mtimes of your summary files or whack your system clock around, weird
things might happen. But obviously this might be inconvenient. Some
people might not want WAL summary files to be generated at all because
they don't care about incremental backup, and other people might want
them retained for longer, and still other people might want them to be
not removed automatically or removed automatically based on some
criteria other than mtime. I don't really know what's best here. I
don't think the default policy that the patches implement is
especially terrible, but it's just something that I made up and I
don't have any real confidence that it's wonderful. One point to be
consider here is that, if WAL summarization is enabled, checkpoints
can't remove WAL that isn't summarized yet. Mostly that's not a
problem, I think, because the WAL summarizer is pretty fast. But it
could increase disk consumption for some people. I don't think that we
need to worry about the summaries themselves being a problem in terms
of space consumption; at least in all the cases I've tested, they're
just not very big.
So, wal_summary is no longer turned on by default, I think following a
comment from Peter E. I think this is a good decision, as we're only
going to need them on servers from which incremental backups are going
to be taken, which is a strict subset of all servers; and furthermore,
people that need them are going to realize that very easily, while if we
went the other around most people would not realize that they need to
turn them off to save some resource consumption.
Granted, the amount of resources additionally used is probably not very
big. But since it can be changed with a reload not restart, it doesn't
seem problematic.
... oh, I just noticed that this patch now fails to compile because of
the MemoryContextResetAndDeleteChildren removal.
(Typo in the pg_walsummary manpage: "since WAL summary files primary
exist" -> "primarily")
- On a related note, I haven't yet tested this on a standby, which is
a thing that I definitely need to do. I don't know of a reason why it
shouldn't be possible for all of this machinery to work on a standby
just as it does on a primary, but then we need the WAL summarizer to
run there too, which could end up being a waste if nobody ever tries
to take an incremental backup. I wonder how that should be reflected
in the configuration. We could do something like what we've done for
archive_mode, where on means "only on if this is a primary" and you
have to say always if you want it to run on standbys as well ... but
I'm not sure if that's a design pattern that we really want to
replicate into more places. I'd be somewhat inclined to just make
whatever configuration parameters we need to configure this thing on
the primary also work on standbys, and you can set each server up as
you please. But I'm open to other suggestions.
I think it should default to off in primary and standby, and the user
has to enable it in whichever server they want to take backups from.
- We need to settle the question of whether to send the whole backup
manifest to the server or just the LSN. In a previous attempt at
incremental backup, we decided the whole manifest was necessary,
because flat-copying files could make new data show up with old LSNs.
But that version of the patch set was trying to find modified blocks
by checking their LSNs individually, not by summarizing WAL. And since
the operations that flat-copy files are WAL-logged, the WAL summary
approach seems to eliminate that problem - maybe an LSN (and the
associated TLI) is good enough now. This also relates to Jakub's
question about whether this machinery could be used to fast-forward a
standby, which is not exactly a base backup but ... perhaps close
enough? I'm somewhat inclined to believe that we can simplify to an
LSN and TLI; however, if we do that, then we'll have big problems if
later we realize that we want the manifest for something after all. So
if anybody thinks that there's a reason to keep doing what the patch
does today -- namely, upload the whole manifest to the server --
please speak up.
I don't understand this point. Currently, the protocol is that
UPLOAD_MANIFEST is used to send the manifest prior to requesting the
backup. You seem to be saying that you're thinking of removing support
for UPLOAD_MANIFEST and instead just give the LSN as an option to the
BASE_BACKUP command?
- It's regrettable that we don't have incremental JSON parsing;
We now do have it, at least in patch form. I guess the question is
whether we're going to accept it in core. I see chances of changing the
format of the manifest rather slim at this point, and the need for very
large manifests is likely to go up with time, so we probably need to
take that code and polish it up, and see if we can improve its
performance.
- Right now, I have a hard-coded 60 second timeout for WAL
summarization. If you try to take an incremental backup and the WAL
summaries you need don't show up within 60 seconds, the backup times
out. I think that's a reasonable default, but should it be
configurable? If yes, should that be a GUC or, perhaps better, a
pg_basebackup option?
I'd rather have a way for the server to provide diagnostics on why the
summaries aren't being produced. Maybe a server running under valgrind
is going to fail and need a longer one, but otherwise a hardcoded
timeout seems sufficient.
You did say later that you thought summary files would just go from one
checkpoint to the next. So the only question is at what point the file
for the last checkpoint (i.e. from the previous one up to the one
requested by pg_basebackup) is written. If walsummarizer keeps almost
the complete state in memory and just waits for the checkpoint record to
write it, then it's probably okay.
- I'm curious what people think about the pg_walsummary tool that is
included in 0006. I think it's going to be fairly important for
debugging, but it does feel a little bit bad to add a new binary for
something pretty niche. Nevertheless, merging it into any other
utility seems relatively awkward, so I'm inclined to think both that
this should be included in whatever finally gets committed and that it
should be a separate binary. I considered whether it should go in
contrib, but we seem to have moved to a policy that heavily favors
limiting contrib to extensions and loadable modules, rather than
binaries.
I propose to keep the door open for that binary doing other things that
dumping the files as text. So add a command argument, which currently
can only be "dump", to allow the command do other things later if
needed. (For example, remove files from a server on which summarize_wal
has been turned off; or perhaps remove files that are below some LSN.)
--
Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
"Estoy de acuerdo contigo en que la verdad absoluta no existe...
El problema es que la mentira sí existe y tu estás mintiendo" (G. Lama)
On 2023-Nov-16, Robert Haas wrote:
On Thu, Nov 16, 2023 at 5:21 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
I meant code like this
memcpy(&key.rlocator, rlocator, sizeof(RelFileLocator));
key.forknum = forknum;
entry = blockreftable_lookup(brtab->hash, key);Ah, I hadn't thought about that. Another way of handling that might be
to add = {0} to the declaration of key. But I can do the initializer
thing too if you think it's better. I'm not sure if there's an
argument that the initializer might optimize better.
I think the {0} initializer is good enough, given a comment to indicate
why.
It's not clear to me if WalSummarizerCtl->pending_lsn if fulfilling some
purpose or it's just a leftover from prior development. I see it's only
read in an assertion ... Maybe if we think this cross-check is
important, it should be turned into an elog? Otherwise, I'd remove it.I've been thinking about that. One thing I'm not quite sure about
though is introspection. Maybe there should be a function that shows
summarized_tli and summarized_lsn from WalSummarizerData, and maybe it
should expose pending_lsn too.
True.
--
Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/
On 2023-Nov-16, Alvaro Herrera wrote:
On 2023-Oct-04, Robert Haas wrote:
- Right now, I have a hard-coded 60 second timeout for WAL
summarization. If you try to take an incremental backup and the WAL
summaries you need don't show up within 60 seconds, the backup times
out. I think that's a reasonable default, but should it be
configurable? If yes, should that be a GUC or, perhaps better, a
pg_basebackup option?I'd rather have a way for the server to provide diagnostics on why the
summaries aren't being produced. Maybe a server running under valgrind
is going to fail and need a longer one, but otherwise a hardcoded
timeout seems sufficient.You did say later that you thought summary files would just go from one
checkpoint to the next. So the only question is at what point the file
for the last checkpoint (i.e. from the previous one up to the one
requested by pg_basebackup) is written. If walsummarizer keeps almost
the complete state in memory and just waits for the checkpoint record to
write it, then it's probably okay.
On 2023-Nov-16, Alvaro Herrera wrote:
On 2023-Nov-16, Robert Haas wrote:
On Thu, Nov 16, 2023 at 5:21 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
It's not clear to me if WalSummarizerCtl->pending_lsn if fulfilling some
purpose or it's just a leftover from prior development. I see it's only
read in an assertion ... Maybe if we think this cross-check is
important, it should be turned into an elog? Otherwise, I'd remove it.I've been thinking about that. One thing I'm not quite sure about
though is introspection. Maybe there should be a function that shows
summarized_tli and summarized_lsn from WalSummarizerData, and maybe it
should expose pending_lsn too.True.
Putting those two thoughts together, I think pg_basebackup with
--progress could tell you "still waiting for the summary file up to LSN
%X/%X to appear, and the walsummarizer is currently handling lsn %X/%X"
or something like that. This would probably require two concurrent
connections, one to run BASE_BACKUP and another to inquire server state;
but this should easy enough to integrate together with parallel
basebackup later.
--
Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
Import Notes
Reply to msg id not found: 202311161726.qzbi7b7ln6lk@alvherre.pgsql202311161723.cffoiarzi5fl@alvherre.pgsql | Resolved by subject fallback
On Thu, Nov 16, 2023 at 12:34 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
Putting those two thoughts together, I think pg_basebackup with
--progress could tell you "still waiting for the summary file up to LSN
%X/%X to appear, and the walsummarizer is currently handling lsn %X/%X"
or something like that. This would probably require two concurrent
connections, one to run BASE_BACKUP and another to inquire server state;
but this should easy enough to integrate together with parallel
basebackup later.
I had similar thoughts, except I was thinking it would be better to
have the warnings be generated on the server side. That would save the
need for a second libpq connection, which would be good, because I
think adding that would result in a pretty large increase in
complexity and some not-so-great user-visible consequences. In fact,
my latest thought is to just remove the timeout altogether, and emit
warnings like this:
WARNING: still waiting for WAL summarization to reach %X/%X after %d
seconds, currently at %X/%X
We could emit that every 30 seconds or so until either the situation
resolves itself or the user hits ^C. I think that would be good enough
here. If we want, the interval between messages can be a GUC, but I
don't know how much real need there will be to tailor that.
--
Robert Haas
EDB: http://www.enterprisedb.com
On Thu, Nov 16, 2023 at 12:23 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
So, wal_summary is no longer turned on by default, I think following a
comment from Peter E. I think this is a good decision, as we're only
going to need them on servers from which incremental backups are going
to be taken, which is a strict subset of all servers; and furthermore,
people that need them are going to realize that very easily, while if we
went the other around most people would not realize that they need to
turn them off to save some resource consumption.Granted, the amount of resources additionally used is probably not very
big. But since it can be changed with a reload not restart, it doesn't
seem problematic.
Yeah. I meant to say that I'd changed that for that reason, but in the
flurry of new versions I omitted to do so.
... oh, I just noticed that this patch now fails to compile because of
the MemoryContextResetAndDeleteChildren removal.
Fixed.
(Typo in the pg_walsummary manpage: "since WAL summary files primary
exist" -> "primarily")
This, too.
I think it should default to off in primary and standby, and the user
has to enable it in whichever server they want to take backups from.
Yeah, that's how it works currently.
I don't understand this point. Currently, the protocol is that
UPLOAD_MANIFEST is used to send the manifest prior to requesting the
backup. You seem to be saying that you're thinking of removing support
for UPLOAD_MANIFEST and instead just give the LSN as an option to the
BASE_BACKUP command?
I don't think I'd want to do exactly that, because then you could only
send one LSN, and I do think we want to send a set of LSN ranges with
the corresponding TLI for each. I was thinking about dumping
UPLOAD_MANIFEST and instead having a command like:
INCREMENTAL_WAL_RANGE 1 2/462AC48 2/462C698
The client would execute this command one or more times before
starting an incremental backup.
I propose to keep the door open for that binary doing other things that
dumping the files as text. So add a command argument, which currently
can only be "dump", to allow the command do other things later if
needed. (For example, remove files from a server on which summarize_wal
has been turned off; or perhaps remove files that are below some LSN.)
I don't like that very much. That sounds like one of those
forward-compatibility things that somebody designs and then nothing
ever happens and ten years later you still have an ugly wart.
My theory is that these files are going to need very little
management. In general, they're small; if you never removed them, it
probably wouldn't hurt, or at least, not for a long time. As to
specific use cases, if you want to remove files from a server on which
summarize_wal has been turned off, you can just use rm. Removing files
from before a certain LSN would probably need a bit of scripting, but
only a bit. Conceivably we could provide something like that in core,
but it doesn't seem necessary, and it also seems to me that we might
do well to include that in pg_archivecleanup rather than in
pg_walsummary.
Here's a new version. Changes:
- Add preparatory renaming patches to the series.
- Rename wal_summarize_keep_time to wal_summary_keep_time.
- Change while (true) to while (1).
- Typo fixes.
- Fix incorrect assertion in summarizer_read_local_xlog_page; this
could cause occasional regression test failures in 004_pg_xlog_symlink
and 009_growing_files.
- Zero-initialize BlockRefTableKey variables.
- Replace a couple instances of pathbuf + basepathlen + 1 with tarfilename.
- Add const to path argument of GetFileBackupMethod.
- Avoid setting output parameters of GetFileBackupMethod unless the
return value is BACK_UP_FILE_INCREMENTALLY.
- In GetFileBackupMethod, postpone qsorting block numbers slightly.
- Define INCREMENTAL_PREFIX_LENGTH using sizeof(), because that should
hopefully work everywhere and the StaticAssertStmt that checks the
value of this doesn't work on Windows.
- Change MemoryContextResetAndDeleteChildren to MemoryContextReset.
--
Robert Haas
EDB: http://www.enterprisedb.com
Attachments:
v10-0003-Move-src-bin-pg_verifybackup-parse_manifest.c-in.patchapplication/octet-stream; name=v10-0003-Move-src-bin-pg_verifybackup-parse_manifest.c-in.patchDownload
From 5329672f6ccd92cb3b60e3c64f699327d099bc88 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Tue, 3 Oct 2023 13:32:45 -0400
Subject: [PATCH v10 3/7] Move src/bin/pg_verifybackup/parse_manifest.c into
src/common.
This makes it possible for the code to be easily reused by other
client-side tools, and/or by the server.
---
src/bin/pg_verifybackup/Makefile | 1 -
src/bin/pg_verifybackup/meson.build | 1 -
src/bin/pg_verifybackup/pg_verifybackup.c | 2 +-
src/common/Makefile | 1 +
src/common/meson.build | 1 +
src/{bin/pg_verifybackup => common}/parse_manifest.c | 4 ++--
src/{bin/pg_verifybackup => include/common}/parse_manifest.h | 2 +-
7 files changed, 6 insertions(+), 6 deletions(-)
rename src/{bin/pg_verifybackup => common}/parse_manifest.c (99%)
rename src/{bin/pg_verifybackup => include/common}/parse_manifest.h (97%)
diff --git a/src/bin/pg_verifybackup/Makefile b/src/bin/pg_verifybackup/Makefile
index c96323faa9..7c045f142e 100644
--- a/src/bin/pg_verifybackup/Makefile
+++ b/src/bin/pg_verifybackup/Makefile
@@ -21,7 +21,6 @@ LDFLAGS_INTERNAL += -L$(top_builddir)/src/fe_utils -lpgfeutils $(libpq_pgport)
OBJS = \
$(WIN32RES) \
- parse_manifest.o \
pg_verifybackup.o
all: pg_verifybackup
diff --git a/src/bin/pg_verifybackup/meson.build b/src/bin/pg_verifybackup/meson.build
index 9369da1bc6..58f780d1a6 100644
--- a/src/bin/pg_verifybackup/meson.build
+++ b/src/bin/pg_verifybackup/meson.build
@@ -1,7 +1,6 @@
# Copyright (c) 2022-2023, PostgreSQL Global Development Group
pg_verifybackup_sources = files(
- 'parse_manifest.c',
'pg_verifybackup.c'
)
diff --git a/src/bin/pg_verifybackup/pg_verifybackup.c b/src/bin/pg_verifybackup/pg_verifybackup.c
index d921d0f003..88081f66f7 100644
--- a/src/bin/pg_verifybackup/pg_verifybackup.c
+++ b/src/bin/pg_verifybackup/pg_verifybackup.c
@@ -20,9 +20,9 @@
#include "common/hashfn.h"
#include "common/logging.h"
+#include "common/parse_manifest.h"
#include "fe_utils/simple_list.h"
#include "getopt_long.h"
-#include "parse_manifest.h"
#include "pgtime.h"
/*
diff --git a/src/common/Makefile b/src/common/Makefile
index ce4535d7fe..1092dc63df 100644
--- a/src/common/Makefile
+++ b/src/common/Makefile
@@ -66,6 +66,7 @@ OBJS_COMMON = \
kwlookup.o \
link-canary.o \
md5_common.o \
+ parse_manifest.o \
percentrepl.o \
pg_get_line.o \
pg_lzcompress.o \
diff --git a/src/common/meson.build b/src/common/meson.build
index 8be145c0fb..d52dd12bc9 100644
--- a/src/common/meson.build
+++ b/src/common/meson.build
@@ -18,6 +18,7 @@ common_sources = files(
'kwlookup.c',
'link-canary.c',
'md5_common.c',
+ 'parse_manifest.c',
'percentrepl.c',
'pg_get_line.c',
'pg_lzcompress.c',
diff --git a/src/bin/pg_verifybackup/parse_manifest.c b/src/common/parse_manifest.c
similarity index 99%
rename from src/bin/pg_verifybackup/parse_manifest.c
rename to src/common/parse_manifest.c
index 850adf90a8..9f52bfa83b 100644
--- a/src/bin/pg_verifybackup/parse_manifest.c
+++ b/src/common/parse_manifest.c
@@ -6,15 +6,15 @@
* Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
* Portions Copyright (c) 1994, Regents of the University of California
*
- * src/bin/pg_verifybackup/parse_manifest.c
+ * src/common/parse_manifest.c
*
*-------------------------------------------------------------------------
*/
#include "postgres_fe.h"
-#include "parse_manifest.h"
#include "common/jsonapi.h"
+#include "common/parse_manifest.h"
/*
* Semantic states for JSON manifest parsing.
diff --git a/src/bin/pg_verifybackup/parse_manifest.h b/src/include/common/parse_manifest.h
similarity index 97%
rename from src/bin/pg_verifybackup/parse_manifest.h
rename to src/include/common/parse_manifest.h
index 001b9a6a11..811c9149f4 100644
--- a/src/bin/pg_verifybackup/parse_manifest.h
+++ b/src/include/common/parse_manifest.h
@@ -6,7 +6,7 @@
* Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
* Portions Copyright (c) 1994, Regents of the University of California
*
- * src/bin/pg_verifybackup/parse_manifest.h
+ * src/include/common/parse_manifest.h
*
*-------------------------------------------------------------------------
*/
--
2.37.1 (Apple Git-137.1)
v10-0006-Add-new-pg_walsummary-tool.patchapplication/octet-stream; name=v10-0006-Add-new-pg_walsummary-tool.patchDownload
From 24d51d4309dee0defeafd2cb57cee7fa4148dd19 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Wed, 25 Oct 2023 13:01:06 -0400
Subject: [PATCH v10 6/7] Add new pg_walsummary tool.
This can dump the contents of WAL summary files, either those in
pg_wal/summaries, or the INCREMENTAL_BACKUP files that are part of
an incremental backup proper.
XXX. Needs tests.
---
doc/src/sgml/ref/allfiles.sgml | 1 +
doc/src/sgml/ref/pg_walsummary.sgml | 122 +++++++++++
doc/src/sgml/reference.sgml | 1 +
src/backend/postmaster/walsummarizer.c | 4 +-
src/bin/Makefile | 1 +
src/bin/meson.build | 1 +
src/bin/pg_walsummary/.gitignore | 1 +
src/bin/pg_walsummary/Makefile | 42 ++++
src/bin/pg_walsummary/meson.build | 24 +++
src/bin/pg_walsummary/pg_walsummary.c | 280 +++++++++++++++++++++++++
src/tools/pgindent/typedefs.list | 2 +
11 files changed, 477 insertions(+), 2 deletions(-)
create mode 100644 doc/src/sgml/ref/pg_walsummary.sgml
create mode 100644 src/bin/pg_walsummary/.gitignore
create mode 100644 src/bin/pg_walsummary/Makefile
create mode 100644 src/bin/pg_walsummary/meson.build
create mode 100644 src/bin/pg_walsummary/pg_walsummary.c
diff --git a/doc/src/sgml/ref/allfiles.sgml b/doc/src/sgml/ref/allfiles.sgml
index fda4690eab..4a42999b18 100644
--- a/doc/src/sgml/ref/allfiles.sgml
+++ b/doc/src/sgml/ref/allfiles.sgml
@@ -219,6 +219,7 @@ Complete list of usable sgml source files in this directory.
<!ENTITY pgtesttiming SYSTEM "pgtesttiming.sgml">
<!ENTITY pgupgrade SYSTEM "pgupgrade.sgml">
<!ENTITY pgwaldump SYSTEM "pg_waldump.sgml">
+<!ENTITY pgwalsummary SYSTEM "pg_walsummary.sgml">
<!ENTITY postgres SYSTEM "postgres-ref.sgml">
<!ENTITY psqlRef SYSTEM "psql-ref.sgml">
<!ENTITY reindexdb SYSTEM "reindexdb.sgml">
diff --git a/doc/src/sgml/ref/pg_walsummary.sgml b/doc/src/sgml/ref/pg_walsummary.sgml
new file mode 100644
index 0000000000..93e265ead7
--- /dev/null
+++ b/doc/src/sgml/ref/pg_walsummary.sgml
@@ -0,0 +1,122 @@
+<!--
+doc/src/sgml/ref/pg_walsummary.sgml
+PostgreSQL documentation
+-->
+
+<refentry id="app-pgwalsummary">
+ <indexterm zone="app-pgwalsummary">
+ <primary>pg_walsummary</primary>
+ </indexterm>
+
+ <refmeta>
+ <refentrytitle><application>pg_walsummary</application></refentrytitle>
+ <manvolnum>1</manvolnum>
+ <refmiscinfo>Application</refmiscinfo>
+ </refmeta>
+
+ <refnamediv>
+ <refname>pg_walsummary</refname>
+ <refpurpose>print contents of WAL summary files</refpurpose>
+ </refnamediv>
+
+ <refsynopsisdiv>
+ <cmdsynopsis>
+ <command>pg_walsummary</command>
+ <arg rep="repeat" choice="opt"><replaceable>option</replaceable></arg>
+ <arg rep="repeat"><replaceable>file</replaceable></arg>
+ </cmdsynopsis>
+ </refsynopsisdiv>
+
+ <refsect1>
+ <title>Description</title>
+ <para>
+ <application>pg_walsummary</application> is used to print the contents of
+ WAL summary files. These binary files are found with the
+ <literal>pg_wal/summaries</literal> subdirectory of the data directory,
+ and can be converted to text using this tool. This is not ordinarily
+ necessary, since WAL summary files primarily exist to support
+ <link linkend="backup-incremental-backup">incremental backup</link>,
+ but it may be useful for debugging purposes.
+ </para>
+
+ <para>
+ A WAL summary file is indexed by tablespace OID, relation OID, and relation
+ fork. For each relation fork, it stores the list of blocks that were
+ modified by WAL within the range summarized in the file. It can also
+ store a "limit block," which is 0 if the relation fork was created or
+ truncated within the relevant WAL range, and otherwise the shortest length
+ to which the relation fork was truncated. If the relation fork was not
+ created, deleted, or truncated within the relevant WAL range, the limit
+ block is undefined or infinite and will not be printed by this tool.
+ </para>
+ </refsect1>
+
+ <refsect1>
+ <title>Options</title>
+
+ <para>
+ <variablelist>
+ <varlistentry>
+ <term><option>-i</option></term>
+ <term><option>--indivudual</option></term>
+ <listitem>
+ <para>
+ By default, <literal>pg_walsummary</literal> prints one line of output
+ for each range of one or more consecutive modified blocks. This can
+ make the output a lot briefer, since a relation where all blocks from
+ 0 through 999 were modified will produce only one line of output rather
+ than 1000 separate lines. This option requests a separate line of
+ output for every modified block.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><option>-q</option></term>
+ <term><option>--quiet</option></term>
+ <listitem>
+ <para>
+ Do not print any output, except for errors. This can be useful
+ when you want to know whether a WAL summary file can be successfully
+ parsed but don't care about the contents.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><option>-?</option></term>
+ <term><option>--help</option></term>
+ <listitem>
+ <para>
+ Shows help about <application>pg_walsummary</application> command line
+ arguments, and exits.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ </variablelist>
+ </para>
+
+ </refsect1>
+
+ <refsect1>
+ <title>Environment</title>
+
+ <para>
+ The environment variable <envar>PG_COLOR</envar> specifies whether to use
+ color in diagnostic messages. Possible values are
+ <literal>always</literal>, <literal>auto</literal> and
+ <literal>never</literal>.
+ </para>
+ </refsect1>
+
+ <refsect1>
+ <title>See Also</title>
+
+ <simplelist type="inline">
+ <member><xref linkend="app-pgbasebackup"/></member>
+ <member><xref linkend="app-pgcombinebackup"/></member>
+ </simplelist>
+ </refsect1>
+
+</refentry>
diff --git a/doc/src/sgml/reference.sgml b/doc/src/sgml/reference.sgml
index a07d2b5e01..aa94f6adf6 100644
--- a/doc/src/sgml/reference.sgml
+++ b/doc/src/sgml/reference.sgml
@@ -289,6 +289,7 @@
&pgtesttiming;
&pgupgrade;
&pgwaldump;
+ &pgwalsummary;
&postgres;
</reference>
diff --git a/src/backend/postmaster/walsummarizer.c b/src/backend/postmaster/walsummarizer.c
index bf6ed98703..ad8a1166d1 100644
--- a/src/backend/postmaster/walsummarizer.c
+++ b/src/backend/postmaster/walsummarizer.c
@@ -290,7 +290,7 @@ WalSummarizerMain(void)
FlushErrorState();
/* Flush any leaked data in the top-level context */
- MemoryContextResetAndDeleteChildren(context);
+ MemoryContextReset(context);
/* Now we can allow interrupts again */
RESUME_INTERRUPTS();
@@ -338,7 +338,7 @@ WalSummarizerMain(void)
XLogRecPtr end_of_summary_lsn;
/* Flush any leaked data in the top-level context */
- MemoryContextResetAndDeleteChildren(context);
+ MemoryContextReset(context);
/* Process any signals received recently. */
HandleWalSummarizerInterrupts();
diff --git a/src/bin/Makefile b/src/bin/Makefile
index aa2210925e..f98f58d39e 100644
--- a/src/bin/Makefile
+++ b/src/bin/Makefile
@@ -31,6 +31,7 @@ SUBDIRS = \
pg_upgrade \
pg_verifybackup \
pg_waldump \
+ pg_walsummary \
pgbench \
psql \
scripts
diff --git a/src/bin/meson.build b/src/bin/meson.build
index 4cb6fd59bb..d1e9ef4409 100644
--- a/src/bin/meson.build
+++ b/src/bin/meson.build
@@ -17,6 +17,7 @@ subdir('pg_test_timing')
subdir('pg_upgrade')
subdir('pg_verifybackup')
subdir('pg_waldump')
+subdir('pg_walsummary')
subdir('pgbench')
subdir('pgevent')
subdir('psql')
diff --git a/src/bin/pg_walsummary/.gitignore b/src/bin/pg_walsummary/.gitignore
new file mode 100644
index 0000000000..d71ec192fa
--- /dev/null
+++ b/src/bin/pg_walsummary/.gitignore
@@ -0,0 +1 @@
+pg_walsummary
diff --git a/src/bin/pg_walsummary/Makefile b/src/bin/pg_walsummary/Makefile
new file mode 100644
index 0000000000..852f7208f6
--- /dev/null
+++ b/src/bin/pg_walsummary/Makefile
@@ -0,0 +1,42 @@
+#-------------------------------------------------------------------------
+#
+# Makefile for src/bin/pg_walsummary
+#
+# Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+# Portions Copyright (c) 1994, Regents of the University of California
+#
+# src/bin/pg_walsummary/Makefile
+#
+#-------------------------------------------------------------------------
+
+PGFILEDESC = "pg_walsummary - print contents of WAL summary files"
+PGAPPICON=win32
+
+subdir = src/bin/pg_walsummary
+top_builddir = ../../..
+include $(top_builddir)/src/Makefile.global
+
+override CPPFLAGS := -I$(libpq_srcdir) $(CPPFLAGS)
+LDFLAGS_INTERNAL += -L$(top_builddir)/src/fe_utils -lpgfeutils
+
+OBJS = \
+ $(WIN32RES) \
+ pg_walsummary.o
+
+all: pg_walsummary
+
+pg_walsummary: $(OBJS) | submake-libpgport submake-libpgfeutils
+ $(CC) $(CFLAGS) $^ $(LDFLAGS) $(LDFLAGS_EX) $(LIBS) -o $@$(X)
+
+
+install: all installdirs
+ $(INSTALL_PROGRAM) pg_walsummary$(X) '$(DESTDIR)$(bindir)/pg_walsummary$(X)'
+
+installdirs:
+ $(MKDIR_P) '$(DESTDIR)$(bindir)'
+
+uninstall:
+ rm -f '$(DESTDIR)$(bindir)/pg_walsummary$(X)'
+
+clean distclean maintainer-clean:
+ rm -f pg_walsummary$(X) $(OBJS)
diff --git a/src/bin/pg_walsummary/meson.build b/src/bin/pg_walsummary/meson.build
new file mode 100644
index 0000000000..c2092960c6
--- /dev/null
+++ b/src/bin/pg_walsummary/meson.build
@@ -0,0 +1,24 @@
+# Copyright (c) 2022-2023, PostgreSQL Global Development Group
+
+pg_walsummary_sources = files(
+ 'pg_walsummary.c',
+)
+
+if host_system == 'windows'
+ pg_walsummary_sources += rc_bin_gen.process(win32ver_rc, extra_args: [
+ '--NAME', 'pg_walsummary',
+ '--FILEDESC', 'pg_walsummary - print contents of WAL summary files',])
+endif
+
+pg_walsummary = executable('pg_walsummary',
+ pg_walsummary_sources,
+ dependencies: [frontend_code],
+ kwargs: default_bin_args,
+)
+bin_targets += pg_walsummary
+
+tests += {
+ 'name': 'pg_walsummary',
+ 'sd': meson.current_source_dir(),
+ 'bd': meson.current_build_dir()
+}
diff --git a/src/bin/pg_walsummary/pg_walsummary.c b/src/bin/pg_walsummary/pg_walsummary.c
new file mode 100644
index 0000000000..0c0225eeb8
--- /dev/null
+++ b/src/bin/pg_walsummary/pg_walsummary.c
@@ -0,0 +1,280 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_walsummary.c
+ * Prints the contents of WAL summary files.
+ *
+ * Copyright (c) 2017-2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/bin/pg_walsummary/pg_walsummary.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres_fe.h"
+
+#include <fcntl.h>
+#include <limits.h>
+
+#include "common/blkreftable.h"
+#include "common/logging.h"
+#include "fe_utils/option_utils.h"
+#include "lib/stringinfo.h"
+#include "getopt_long.h"
+
+typedef struct ws_options
+{
+ bool individual;
+ bool quiet;
+} ws_options;
+
+typedef struct ws_file_info
+{
+ int fd;
+ char *filename;
+} ws_file_info;
+
+static BlockNumber *block_buffer = NULL;
+static unsigned block_buffer_size = 512; /* Initial size. */
+
+static void dump_one_relation(ws_options *opt, RelFileLocator *rlocator,
+ ForkNumber forknum, BlockNumber limit_block,
+ BlockRefTableReader *reader);
+static void help(const char *progname);
+static int compare_block_numbers(const void *a, const void *b);
+static int walsummary_read_callback(void *callback_arg, void *data,
+ int length);
+static void walsummary_error_callback(void *callback_arg, char *fmt,...) pg_attribute_printf(2, 3);
+
+/*
+ * Main program.
+ */
+int
+main(int argc, char *argv[])
+{
+ static struct option long_options[] = {
+ {"individual", no_argument, NULL, 'i'},
+ {"quiet", no_argument, NULL, 'q'},
+ {NULL, 0, NULL, 0}
+ };
+
+ const char *progname;
+ int optindex;
+ int c;
+ ws_options opt;
+
+ memset(&opt, 0, sizeof(ws_options));
+
+ pg_logging_init(argv[0]);
+ progname = get_progname(argv[0]);
+ handle_help_version_opts(argc, argv, progname, help);
+
+ /* process command-line options */
+ while ((c = getopt_long(argc, argv, "f:iqw:",
+ long_options, &optindex)) != -1)
+ {
+ switch (c)
+ {
+ case 'i':
+ opt.individual = true;
+ break;
+ case 'q':
+ opt.quiet = true;
+ break;
+ default:
+ /* getopt_long already emitted a complaint */
+ pg_log_error_hint("Try \"%s --help\" for more information.", progname);
+ exit(1);
+ }
+ }
+
+ if (optind >= argc)
+ {
+ pg_log_error("%s: no input files specified", progname);
+ pg_log_error_hint("Try \"%s --help\" for more information.", progname);
+ exit(1);
+ }
+
+ while (optind < argc)
+ {
+ ws_file_info ws;
+ BlockRefTableReader *reader;
+ RelFileLocator rlocator;
+ ForkNumber forknum;
+ BlockNumber limit_block;
+
+ ws.filename = argv[optind++];
+ if ((ws.fd = open(ws.filename, O_RDONLY | PG_BINARY, 0)) < 0)
+ pg_fatal("could not open file \"%s\": %m", ws.filename);
+
+ reader = CreateBlockRefTableReader(walsummary_read_callback, &ws,
+ ws.filename,
+ walsummary_error_callback, NULL);
+ while (BlockRefTableReaderNextRelation(reader, &rlocator, &forknum,
+ &limit_block))
+ dump_one_relation(&opt, &rlocator, forknum, limit_block, reader);
+
+ DestroyBlockRefTableReader(reader);
+ close(ws.fd);
+ }
+
+ exit(0);
+}
+
+/*
+ * Dump details for one relation.
+ */
+static void
+dump_one_relation(ws_options *opt, RelFileLocator *rlocator,
+ ForkNumber forknum, BlockNumber limit_block,
+ BlockRefTableReader *reader)
+{
+ unsigned i = 0;
+ unsigned nblocks;
+ BlockNumber startblock = InvalidBlockNumber;
+ BlockNumber endblock = InvalidBlockNumber;
+
+ /* Dump limit block, if any. */
+ if (limit_block != InvalidBlockNumber)
+ printf("TS %u, DB %u, REL %u, FORK %s: limit %u\n",
+ rlocator->spcOid, rlocator->dbOid, rlocator->relNumber,
+ forkNames[forknum], limit_block);
+
+ /* If we haven't allocated a block buffer yet, do that now. */
+ if (block_buffer == NULL)
+ block_buffer = palloc_array(BlockNumber, block_buffer_size);
+
+ /* Try to fill the block buffer. */
+ nblocks = BlockRefTableReaderGetBlocks(reader,
+ block_buffer,
+ block_buffer_size);
+
+ /* If we filled the block buffer completely, we must enlarge it. */
+ while (nblocks >= block_buffer_size)
+ {
+ unsigned new_size;
+
+ /* Double the size, being careful about overflow. */
+ new_size = block_buffer_size * 2;
+ if (new_size < block_buffer_size)
+ new_size = PG_UINT32_MAX;
+ block_buffer = repalloc_array(block_buffer, BlockNumber, new_size);
+
+ /* Try to fill the newly-allocated space. */
+ nblocks +=
+ BlockRefTableReaderGetBlocks(reader,
+ block_buffer + block_buffer_size,
+ new_size - block_buffer_size);
+
+ /* Save the new size for later calls. */
+ block_buffer_size = new_size;
+ }
+
+ /* If we don't need to produce any output, skip the rest of this. */
+ if (opt->quiet)
+ return;
+
+ /*
+ * Sort the returned block numbers. If the block reference table was using
+ * the bitmap representation for a given chunk, the block numbers in that
+ * chunk will already be sorted, but when the array-of-offsets
+ * representation is used, we can receive block numbers here out of order.
+ */
+ qsort(block_buffer, nblocks, sizeof(BlockNumber), compare_block_numbers);
+
+ /* Dump block references. */
+ while (i < nblocks)
+ {
+ /*
+ * Find the next range of blocks to print, but if --individual was
+ * specified, then consider each block a separate range.
+ */
+ startblock = endblock = block_buffer[i++];
+ if (!opt->individual)
+ {
+ while (i < nblocks && block_buffer[i] == endblock + 1)
+ {
+ endblock++;
+ i++;
+ }
+ }
+
+ /* Format this range of block numbers as a string. */
+ if (startblock == endblock)
+ printf("TS %u, DB %u, REL %u, FORK %s: block %u\n",
+ rlocator->spcOid, rlocator->dbOid, rlocator->relNumber,
+ forkNames[forknum], startblock);
+ else
+ printf("TS %u, DB %u, REL %u, FORK %s: blocks %u..%u\n",
+ rlocator->spcOid, rlocator->dbOid, rlocator->relNumber,
+ forkNames[forknum], startblock, endblock);
+ }
+}
+
+/*
+ * Quicksort comparator for block numbers.
+ */
+static int
+compare_block_numbers(const void *a, const void *b)
+{
+ BlockNumber aa = *(BlockNumber *) a;
+ BlockNumber bb = *(BlockNumber *) b;
+
+ if (aa > bb)
+ return 1;
+ else if (aa == bb)
+ return 0;
+ else
+ return -1;
+}
+
+/*
+ * Error callback.
+ */
+void
+walsummary_error_callback(void *callback_arg, char *fmt,...)
+{
+ va_list ap;
+
+ va_start(ap, fmt);
+ pg_log_generic_v(PG_LOG_ERROR, PG_LOG_PRIMARY, fmt, ap);
+ va_end(ap);
+
+ exit(1);
+}
+
+/*
+ * Read callback.
+ */
+int
+walsummary_read_callback(void *callback_arg, void *data, int length)
+{
+ ws_file_info *ws = callback_arg;
+ int rc;
+
+ if ((rc = read(ws->fd, data, length)) < 0)
+ pg_fatal("could not read file \"%s\": %m", ws->filename);
+
+ return rc;
+}
+
+/*
+ * help
+ *
+ * Prints help page for the program
+ *
+ * progname: the name of the executed program, such as "pg_walsummary"
+ */
+static void
+help(const char *progname)
+{
+ printf(_("%s prints the contents of a WAL summary file.\n\n"), progname);
+ printf(_("Usage:\n"));
+ printf(_(" %s [OPTION]... FILE...\n"), progname);
+ printf(_("\nOptions:\n"));
+ printf(_(" -i, --individual list block numbers individually, not as ranges\n"));
+ printf(_(" -q, --quiet don't print anything, just parse the files\n"));
+ printf(_(" -?, --help show this help, then exit\n"));
+
+ printf(_("\nReport bugs to <%s>.\n"), PACKAGE_BUGREPORT);
+ printf(_("%s home page: <%s>\n"), PACKAGE_NAME, PACKAGE_URL);
+}
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 1fa5f0ed26..0565e160f2 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -4026,3 +4026,5 @@ cb_tablespace_mapping
manifest_data
manifest_writer
rfile
+ws_options
+ws_file_info
--
2.37.1 (Apple Git-137.1)
v10-0004-Add-a-new-WAL-summarizer-process.patchapplication/octet-stream; name=v10-0004-Add-a-new-WAL-summarizer-process.patchDownload
From c8e46c57bf89d051f64e4ac3666810c83ce74705 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Wed, 25 Oct 2023 12:57:22 -0400
Subject: [PATCH v10 4/7] Add a new WAL summarizer process.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
When active, this process writes WAL summary files to
$PGDATA/pg_wal/summaries. Each summary file contains information for a
certain range of LSNs on a certain TLI. For each relation, it stores a
"limit block" which is 0 if a relation is created or destroyed within
a certain range of WAL records, or otherwise the shortest length to
which the relation was truncated during that range of WAL records, or
otherwise InvalidBlockNumber. In addition, it stores a list of blocks
which have been modified during that range of WAL records, but
excluding blocks which were removed by truncation after they were
modified and never subsequently modified again. In other words, it
tells us which blocks need to copied in case of an incremental backup
covering that range of WAL records.
A new parameter summarize_wal enables or disables this new background
process. The background process also automatically deletes summary
files that are older than wal_summarize_keep_time, if that parameter
has a non-zero value and the summarizer is configured to run.
Patch by me, with some design help from Dilip Kumar. Reviewed by
Matthias van de Meent, Dilip Kumar, Jakub Wartak, Peter Eisentraut,
and Álvaro Herrera.
---
doc/src/sgml/config.sgml | 61 +
src/backend/access/transam/xlog.c | 101 +-
src/backend/backup/Makefile | 4 +-
src/backend/backup/meson.build | 2 +
src/backend/backup/walsummary.c | 356 +++++
src/backend/backup/walsummaryfuncs.c | 169 ++
src/backend/postmaster/Makefile | 1 +
src/backend/postmaster/auxprocess.c | 8 +
src/backend/postmaster/meson.build | 1 +
src/backend/postmaster/postmaster.c | 56 +
src/backend/postmaster/walsummarizer.c | 1383 +++++++++++++++++
src/backend/storage/lmgr/lwlocknames.txt | 1 +
src/backend/utils/activity/pgstat_io.c | 4 +-
.../utils/activity/wait_event_names.txt | 5 +
src/backend/utils/init/miscinit.c | 3 +
src/backend/utils/misc/guc_tables.c | 26 +
src/backend/utils/misc/postgresql.conf.sample | 5 +
src/bin/initdb/initdb.c | 1 +
src/common/Makefile | 1 +
src/common/blkreftable.c | 1308 ++++++++++++++++
src/common/meson.build | 1 +
src/include/access/xlog.h | 1 +
src/include/backup/walsummary.h | 49 +
src/include/catalog/pg_proc.dat | 19 +
src/include/common/blkreftable.h | 116 ++
src/include/miscadmin.h | 3 +
src/include/postmaster/walsummarizer.h | 31 +
src/include/storage/proc.h | 9 +-
src/include/utils/guc_tables.h | 1 +
src/tools/pgindent/typedefs.list | 11 +
30 files changed, 3726 insertions(+), 11 deletions(-)
create mode 100644 src/backend/backup/walsummary.c
create mode 100644 src/backend/backup/walsummaryfuncs.c
create mode 100644 src/backend/postmaster/walsummarizer.c
create mode 100644 src/common/blkreftable.c
create mode 100644 src/include/backup/walsummary.h
create mode 100644 src/include/common/blkreftable.h
create mode 100644 src/include/postmaster/walsummarizer.h
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index fc35a46e5e..6073b93480 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4134,6 +4134,67 @@ restore_command = 'copy "C:\\server\\archivedir\\%f" "%p"' # Windows
</variablelist>
</sect2>
+ <sect2 id="runtime-config-wal-summarization">
+ <title>WAL Summarization</title>
+
+ <!--
+ <para>
+ These settings control WAL summarization, a feature which must be
+ enabled in order to perform an
+ <link linkend="backup-incremental-backup">incremental backup</link>.
+ </para>
+ -->
+
+ <variablelist>
+ <varlistentry id="guc-summarize-wal" xreflabel="summarize_wal">
+ <term><varname>summarize_wal</varname> (<type>boolean</type>)
+ <indexterm>
+ <primary><varname>summarize_wal</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Enables the WAL summarizer process. Note that WAL summarization can
+ be enabled either on a primary or on a standby. WAL summarization
+ cannot be enabled when <varname>wal_level</varname> is set to
+ <literal>minimal</literal>. This parameter can only be set in the
+ <filename>postgresql.conf</filename> file or on the server command line.
+ The default is <literal>off</literal>.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry id="guc-wal-summary-keep-time" xreflabel="wal_summary_keep_time">
+ <term><varname>wal_summary_keep_time</varname> (<type>boolean</type>)
+ <indexterm>
+ <primary><varname>wal_summary_keep_time</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Configures the amount of time after which the WAL summarizer
+ automatically removes old WAL summaries. The file timestamp is used to
+ determine which files are old enough to remove. Typically, you should set
+ this comfortably higher than the time that could pass between a backup
+ and a later incremental backup that depends on it. WAL summaries must
+ be available for the entire range of WAL records between the preceding
+ backup and the new one being taken; if not, the incremental backup will
+ fail. If this parameter is set to zero, WAL summaries will not be
+ automatically deleted, but it is safe to manually remove files that you
+ know will not be required for future incremental backups.
+ This parameter can only be set in the
+ <filename>postgresql.conf</filename> file or on the server command line.
+ The default is 10 days. If <literal>summarize_wal = off</literal>,
+ existing WAL summaries will not be removed regardless of the value of
+ this parameter, because the WAL summarizer will not run.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ </variablelist>
+
+ </sect2>
+
</sect1>
<sect1 id="runtime-config-replication">
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 1159dff1a6..678495a64b 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -77,6 +77,7 @@
#include "port/pg_iovec.h"
#include "postmaster/bgwriter.h"
#include "postmaster/startup.h"
+#include "postmaster/walsummarizer.h"
#include "postmaster/walwriter.h"
#include "replication/logical.h"
#include "replication/origin.h"
@@ -3574,6 +3575,43 @@ XLogGetLastRemovedSegno(void)
return lastRemovedSegNo;
}
+/*
+ * Return the oldest WAL segment on the given TLI that still exists in
+ * XLOGDIR, or 0 if none.
+ */
+XLogSegNo
+XLogGetOldestSegno(TimeLineID tli)
+{
+ DIR *xldir;
+ struct dirent *xlde;
+ XLogSegNo oldest_segno = 0;
+
+ xldir = AllocateDir(XLOGDIR);
+ while ((xlde = ReadDir(xldir, XLOGDIR)) != NULL)
+ {
+ TimeLineID file_tli;
+ XLogSegNo file_segno;
+
+ /* Ignore files that are not XLOG segments. */
+ if (!IsXLogFileName(xlde->d_name))
+ continue;
+
+ /* Parse filename to get TLI and segno. */
+ XLogFromFileName(xlde->d_name, &file_tli, &file_segno,
+ wal_segment_size);
+
+ /* Ignore anything that's not from the TLI of interest. */
+ if (tli != file_tli)
+ continue;
+
+ /* If it's the oldest so far, update oldest_segno. */
+ if (oldest_segno == 0 || file_segno < oldest_segno)
+ oldest_segno = file_segno;
+ }
+
+ FreeDir(xldir);
+ return oldest_segno;
+}
/*
* Update the last removed segno pointer in shared memory, to reflect that the
@@ -3853,8 +3891,8 @@ RemoveXlogFile(const struct dirent *segment_de,
}
/*
- * Verify whether pg_wal and pg_wal/archive_status exist.
- * If the latter does not exist, recreate it.
+ * Verify whether pg_wal, pg_wal/archive_status, and pg_wal/summaries exist.
+ * If the latter do not exist, recreate them.
*
* It is not the goal of this function to verify the contents of these
* directories, but to help in cases where someone has performed a cluster
@@ -3897,6 +3935,26 @@ ValidateXLOGDirectoryStructure(void)
(errmsg("could not create missing directory \"%s\": %m",
path)));
}
+
+ /* Check for summaries */
+ snprintf(path, MAXPGPATH, XLOGDIR "/summaries");
+ if (stat(path, &stat_buf) == 0)
+ {
+ /* Check for weird cases where it exists but isn't a directory */
+ if (!S_ISDIR(stat_buf.st_mode))
+ ereport(FATAL,
+ (errmsg("required WAL directory \"%s\" does not exist",
+ path)));
+ }
+ else
+ {
+ ereport(LOG,
+ (errmsg("creating missing WAL directory \"%s\"", path)));
+ if (MakePGDirectory(path) < 0)
+ ereport(FATAL,
+ (errmsg("could not create missing directory \"%s\": %m",
+ path)));
+ }
}
/*
@@ -5221,9 +5279,9 @@ StartupXLOG(void)
#endif
/*
- * Verify that pg_wal and pg_wal/archive_status exist. In cases where
- * someone has performed a copy for PITR, these directories may have been
- * excluded and need to be re-created.
+ * Verify that pg_wal, pg_wal/archive_status, and pg_wal/summaries exist.
+ * In cases where someone has performed a copy for PITR, these directories
+ * may have been excluded and need to be re-created.
*/
ValidateXLOGDirectoryStructure();
@@ -6940,6 +6998,25 @@ CreateCheckPoint(int flags)
*/
END_CRIT_SECTION();
+ /*
+ * WAL summaries end when the next XLOG_CHECKPOINT_REDO or
+ * XLOG_CHECKPOINT_SHUTDOWN record is reached. This is the first point
+ * where (a) we're not inside of a critical section and (b) we can be
+ * certain that the relevant record has been flushed to disk, which must
+ * happen before it can be summarized.
+ *
+ * If this is a shutdown checkpoint, then this happens reasonably
+ * promptly: we've only just inserted and flushed the
+ * XLOG_CHECKPOINT_SHUTDOWN record. If this is not a shutdown checkpoint,
+ * then this might not be very prompt at all: the XLOG_CHECKPOINT_REDO
+ * record was written before we began flushing data to disk, and that
+ * could be many minutes ago at this point. However, we don't XLogFlush()
+ * after inserting that record, so we're not guaranteed that it's on disk
+ * until after the above call that flushes the XLOG_CHECKPOINT_ONLINE
+ * record.
+ */
+ SetWalSummarizerLatch();
+
/*
* Let smgr do post-checkpoint cleanup (eg, deleting old files).
*/
@@ -7614,6 +7691,20 @@ KeepLogSeg(XLogRecPtr recptr, XLogSegNo *logSegNo)
}
}
+ /*
+ * If WAL summarization is in use, don't remove WAL that has yet to be
+ * summarized.
+ */
+ keep = GetOldestUnsummarizedLSN(NULL, NULL);
+ if (keep != InvalidXLogRecPtr)
+ {
+ XLogSegNo unsummarized_segno;
+
+ XLByteToSeg(keep, unsummarized_segno, wal_segment_size);
+ if (unsummarized_segno < segno)
+ segno = unsummarized_segno;
+ }
+
/* but, keep at least wal_keep_size if that's set */
if (wal_keep_size_mb > 0)
{
diff --git a/src/backend/backup/Makefile b/src/backend/backup/Makefile
index b21bd8ff43..a67b3c58d4 100644
--- a/src/backend/backup/Makefile
+++ b/src/backend/backup/Makefile
@@ -25,6 +25,8 @@ OBJS = \
basebackup_server.o \
basebackup_sink.o \
basebackup_target.o \
- basebackup_throttle.o
+ basebackup_throttle.o \
+ walsummary.o \
+ walsummaryfuncs.o
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/backup/meson.build b/src/backend/backup/meson.build
index 11a79bbf80..0e2de91e9f 100644
--- a/src/backend/backup/meson.build
+++ b/src/backend/backup/meson.build
@@ -12,4 +12,6 @@ backend_sources += files(
'basebackup_target.c',
'basebackup_throttle.c',
'basebackup_zstd.c',
+ 'walsummary.c',
+ 'walsummaryfuncs.c'
)
diff --git a/src/backend/backup/walsummary.c b/src/backend/backup/walsummary.c
new file mode 100644
index 0000000000..271d199874
--- /dev/null
+++ b/src/backend/backup/walsummary.c
@@ -0,0 +1,356 @@
+/*-------------------------------------------------------------------------
+ *
+ * walsummary.c
+ * Functions for accessing and managing WAL summary data.
+ *
+ * Portions Copyright (c) 2010-2022, PostgreSQL Global Development Group
+ *
+ * src/backend/backup/walsummary.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include <sys/stat.h>
+#include <unistd.h>
+
+#include "access/xlog_internal.h"
+#include "backup/walsummary.h"
+#include "utils/wait_event.h"
+
+static bool IsWalSummaryFilename(char *filename);
+static int ListComparatorForWalSummaryFiles(const ListCell *a,
+ const ListCell *b);
+
+/*
+ * Get a list of WAL summaries.
+ *
+ * If tli != 0, only WAL summaries with the indicated TLI will be included.
+ *
+ * If start_lsn != InvalidXLogRecPtr, only summaries that end after the
+ * indicated LSN will be included.
+ *
+ * If end_lsn != InvalidXLogRecPtr, only summaries that start before the
+ * indicated LSN will be included.
+ *
+ * The intent is that you can call GetWalSummaries(tli, start_lsn, end_lsn)
+ * to get all WAL summaries on the indicated timeline that overlap the
+ * specified LSN range.
+ */
+List *
+GetWalSummaries(TimeLineID tli, XLogRecPtr start_lsn, XLogRecPtr end_lsn)
+{
+ DIR *sdir;
+ struct dirent *dent;
+ List *result = NIL;
+
+ sdir = AllocateDir(XLOGDIR "/summaries");
+ while ((dent = ReadDir(sdir, XLOGDIR "/summaries")) != NULL)
+ {
+ WalSummaryFile *ws;
+ uint32 tmp[5];
+ TimeLineID file_tli;
+ XLogRecPtr file_start_lsn;
+ XLogRecPtr file_end_lsn;
+
+ /* Decode filename, or skip if it's not in the expected format. */
+ if (!IsWalSummaryFilename(dent->d_name))
+ continue;
+ sscanf(dent->d_name, "%08X%08X%08X%08X%08X",
+ &tmp[0], &tmp[1], &tmp[2], &tmp[3], &tmp[4]);
+ file_tli = tmp[0];
+ file_start_lsn = ((uint64) tmp[1]) << 32 | tmp[2];
+ file_end_lsn = ((uint64) tmp[3]) << 32 | tmp[4];
+
+ /* Skip if it doesn't match the filter criteria. */
+ if (tli != 0 && tli != file_tli)
+ continue;
+ if (!XLogRecPtrIsInvalid(start_lsn) && start_lsn >= file_end_lsn)
+ continue;
+ if (!XLogRecPtrIsInvalid(end_lsn) && end_lsn <= file_start_lsn)
+ continue;
+
+ /* Add it to the list. */
+ ws = palloc(sizeof(WalSummaryFile));
+ ws->tli = file_tli;
+ ws->start_lsn = file_start_lsn;
+ ws->end_lsn = file_end_lsn;
+ result = lappend(result, ws);
+ }
+ FreeDir(sdir);
+
+ return result;
+}
+
+/*
+ * Build a new list of WAL summaries based on an existing list, but filtering
+ * out summaries that don't match the search parameters.
+ *
+ * If tli != 0, only WAL summaries with the indicated TLI will be included.
+ *
+ * If start_lsn != InvalidXLogRecPtr, only summaries that end after the
+ * indicated LSN will be included.
+ *
+ * If end_lsn != InvalidXLogRecPtr, only summaries that start before the
+ * indicated LSN will be included.
+ */
+List *
+FilterWalSummaries(List *wslist, TimeLineID tli,
+ XLogRecPtr start_lsn, XLogRecPtr end_lsn)
+{
+ List *result = NIL;
+ ListCell *lc;
+
+ /* Loop over input. */
+ foreach(lc, wslist)
+ {
+ WalSummaryFile *ws = lfirst(lc);
+
+ /* Skip if it doesn't match the filter criteria. */
+ if (tli != 0 && tli != ws->tli)
+ continue;
+ if (!XLogRecPtrIsInvalid(start_lsn) && start_lsn > ws->end_lsn)
+ continue;
+ if (!XLogRecPtrIsInvalid(end_lsn) && end_lsn < ws->start_lsn)
+ continue;
+
+ /* Add it to the result list. */
+ result = lappend(result, ws);
+ }
+
+ return result;
+}
+
+/*
+ * Check whether the supplied list of WalSummaryFile objects covers the
+ * whole range of LSNs from start_lsn to end_lsn. This function ignores
+ * timelines, so the caller should probably filter using the appropriate
+ * timeline before calling this.
+ *
+ * If the whole range of LSNs is covered, returns true, otherwise false.
+ * If false is returned, *missing_lsn is set either to InvalidXLogRecPtr
+ * if there are no WAL summary files in the input list, or to the first LSN
+ * in the range that is not covered by a WAL summary file in the input list.
+ */
+bool
+WalSummariesAreComplete(List *wslist, XLogRecPtr start_lsn,
+ XLogRecPtr end_lsn, XLogRecPtr *missing_lsn)
+{
+ XLogRecPtr current_lsn = start_lsn;
+ ListCell *lc;
+
+ /* Special case for empty list. */
+ if (wslist == NIL)
+ {
+ *missing_lsn = InvalidXLogRecPtr;
+ return false;
+ }
+
+ /* Make a private copy of the list and sort it by start LSN. */
+ wslist = list_copy(wslist);
+ list_sort(wslist, ListComparatorForWalSummaryFiles);
+
+ /*
+ * Consider summary files in order of increasing start_lsn, advancing the
+ * known-summarized range from start_lsn toward end_lsn.
+ *
+ * Normally, the summary files should cover non-overlapping WAL ranges,
+ * but this algorithm is intended to be correct even in case of overlap.
+ */
+ foreach(lc, wslist)
+ {
+ WalSummaryFile *ws = lfirst(lc);
+
+ if (ws->start_lsn > current_lsn)
+ {
+ /* We found a gap. */
+ break;
+ }
+ if (ws->end_lsn > current_lsn)
+ {
+ /*
+ * Next summary extends beyond end of previous summary, so extend
+ * the end of the range known to be summarized.
+ */
+ current_lsn = ws->end_lsn;
+
+ /*
+ * If the range we know to be summarized has reached the required
+ * end LSN, we have proved completeness.
+ */
+ if (current_lsn >= end_lsn)
+ return true;
+ }
+ }
+
+ /*
+ * We either ran out of summary files without reaching the end LSN, or we
+ * hit a gap in the sequence that resulted in us bailing out of the loop
+ * above.
+ */
+ *missing_lsn = current_lsn;
+ return false;
+}
+
+/*
+ * Open a WAL summary file.
+ *
+ * This will throw an error in case of trouble. As an exception, if
+ * missing_ok = true and the trouble is specifically that the file does
+ * not exist, it will not throw an error and will return a value less than 0.
+ */
+File
+OpenWalSummaryFile(WalSummaryFile *ws, bool missing_ok)
+{
+ char path[MAXPGPATH];
+ File file;
+
+ snprintf(path, MAXPGPATH,
+ XLOGDIR "/summaries/%08X%08X%08X%08X%08X.summary",
+ ws->tli,
+ LSN_FORMAT_ARGS(ws->start_lsn),
+ LSN_FORMAT_ARGS(ws->end_lsn));
+
+ file = PathNameOpenFile(path, O_RDONLY);
+ if (file < 0 && (errno != EEXIST || !missing_ok))
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not open file \"%s\": %m", path)));
+
+ return file;
+}
+
+/*
+ * Remove a WAL summary file if the last modification time precedes the
+ * cutoff time.
+ */
+void
+RemoveWalSummaryIfOlderThan(WalSummaryFile *ws, time_t cutoff_time)
+{
+ char path[MAXPGPATH];
+ struct stat statbuf;
+
+ snprintf(path, MAXPGPATH,
+ XLOGDIR "/summaries/%08X%08X%08X%08X%08X.summary",
+ ws->tli,
+ LSN_FORMAT_ARGS(ws->start_lsn),
+ LSN_FORMAT_ARGS(ws->end_lsn));
+
+ if (lstat(path, &statbuf) != 0)
+ {
+ if (errno == ENOENT)
+ return;
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not stat file \"%s\": %m", path)));
+ }
+ if (statbuf.st_mtime >= cutoff_time)
+ return;
+ if (unlink(path) != 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not stat file \"%s\": %m", path)));
+ ereport(DEBUG2,
+ (errmsg_internal("removing file \"%s\"", path)));
+}
+
+/*
+ * Test whether a filename looks like a WAL summary file.
+ */
+static bool
+IsWalSummaryFilename(char *filename)
+{
+ return strspn(filename, "0123456789ABCDEF") == 40 &&
+ strcmp(filename + 40, ".summary") == 0;
+}
+
+/*
+ * Data read callback for use with CreateBlockRefTableReader.
+ */
+int
+ReadWalSummary(void *wal_summary_io, void *data, int length)
+{
+ WalSummaryIO *io = wal_summary_io;
+ int nbytes;
+
+ nbytes = FileRead(io->file, data, length, io->filepos,
+ WAIT_EVENT_WAL_SUMMARY_READ);
+ if (nbytes < 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not read file \"%s\": %m",
+ FilePathName(io->file))));
+
+ io->filepos += nbytes;
+ return nbytes;
+}
+
+/*
+ * Data write callback for use with WriteBlockRefTable.
+ */
+int
+WriteWalSummary(void *wal_summary_io, void *data, int length)
+{
+ WalSummaryIO *io = wal_summary_io;
+ int nbytes;
+
+ nbytes = FileWrite(io->file, data, length, io->filepos,
+ WAIT_EVENT_WAL_SUMMARY_WRITE);
+ if (nbytes < 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not write file \"%s\": %m",
+ FilePathName(io->file))));
+ if (nbytes != length)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not write file \"%s\": wrote only %d of %d bytes at offset %u",
+ FilePathName(io->file), nbytes,
+ length, (unsigned) io->filepos),
+ errhint("Check free disk space.")));
+
+ io->filepos += nbytes;
+ return nbytes;
+}
+
+/*
+ * Error-reporting callback for use with CreateBlockRefTableReader.
+ */
+void
+ReportWalSummaryError(void *callback_arg, char *fmt,...)
+{
+ StringInfoData buf;
+ va_list ap;
+ int needed;
+
+ initStringInfo(&buf);
+ for (;;)
+ {
+ va_start(ap, fmt);
+ needed = appendStringInfoVA(&buf, fmt, ap);
+ va_end(ap);
+ if (needed == 0)
+ break;
+ enlargeStringInfo(&buf, needed);
+ }
+ ereport(ERROR,
+ errcode(ERRCODE_DATA_CORRUPTED),
+ errmsg_internal("%s", buf.data));
+}
+
+/*
+ * Comparator to sort a List of WalSummaryFile objects by start_lsn.
+ */
+static int
+ListComparatorForWalSummaryFiles(const ListCell *a, const ListCell *b)
+{
+ WalSummaryFile *ws1 = lfirst(a);
+ WalSummaryFile *ws2 = lfirst(b);
+
+ if (ws1->start_lsn < ws2->start_lsn)
+ return -1;
+ if (ws1->start_lsn > ws2->start_lsn)
+ return 1;
+ return 0;
+}
diff --git a/src/backend/backup/walsummaryfuncs.c b/src/backend/backup/walsummaryfuncs.c
new file mode 100644
index 0000000000..a1f69ad4ba
--- /dev/null
+++ b/src/backend/backup/walsummaryfuncs.c
@@ -0,0 +1,169 @@
+/*-------------------------------------------------------------------------
+ *
+ * walsummaryfuncs.c
+ * SQL-callable functions for accessing WAL summary data.
+ *
+ * Portions Copyright (c) 2010-2022, PostgreSQL Global Development Group
+ *
+ * src/backend/backup/walsummaryfuncs.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "backup/walsummary.h"
+#include "common/blkreftable.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "utils/fmgrprotos.h"
+#include "utils/pg_lsn.h"
+
+#define NUM_WS_ATTS 3
+#define NUM_SUMMARY_ATTS 6
+#define MAX_BLOCKS_PER_CALL 256
+
+/*
+ * List the WAL summary files available in pg_wal/summaries.
+ */
+Datum
+pg_available_wal_summaries(PG_FUNCTION_ARGS)
+{
+ ReturnSetInfo *rsi;
+ List *wslist;
+ ListCell *lc;
+ Datum values[NUM_WS_ATTS];
+ bool nulls[NUM_WS_ATTS];
+
+ InitMaterializedSRF(fcinfo, 0);
+ rsi = (ReturnSetInfo *) fcinfo->resultinfo;
+
+ memset(nulls, 0, sizeof(nulls));
+
+ wslist = GetWalSummaries(0, InvalidXLogRecPtr, InvalidXLogRecPtr);
+ foreach(lc, wslist)
+ {
+ WalSummaryFile *ws = (WalSummaryFile *) lfirst(lc);
+ HeapTuple tuple;
+
+ CHECK_FOR_INTERRUPTS();
+
+ values[0] = Int64GetDatum((int64) ws->tli);
+ values[1] = LSNGetDatum(ws->start_lsn);
+ values[2] = LSNGetDatum(ws->end_lsn);
+
+ tuple = heap_form_tuple(rsi->setDesc, values, nulls);
+ tuplestore_puttuple(rsi->setResult, tuple);
+ }
+
+ return (Datum) 0;
+}
+
+/*
+ * List the contents of a WAL summary file identified by TLI, start LSN,
+ * and end LSN.
+ */
+Datum
+pg_wal_summary_contents(PG_FUNCTION_ARGS)
+{
+ ReturnSetInfo *rsi;
+ Datum values[NUM_SUMMARY_ATTS];
+ bool nulls[NUM_SUMMARY_ATTS];
+ WalSummaryFile ws;
+ WalSummaryIO io;
+ BlockRefTableReader *reader;
+ int64 raw_tli;
+ RelFileLocator rlocator;
+ ForkNumber forknum;
+ BlockNumber limit_block;
+
+ InitMaterializedSRF(fcinfo, 0);
+ rsi = (ReturnSetInfo *) fcinfo->resultinfo;
+ memset(nulls, 0, sizeof(nulls));
+
+ /*
+ * Since the timeline could at least in theory be more than 2^31, and
+ * since we don't have unsigned types at the SQL level, it is passed as a
+ * 64-bit integer. Test whether it's out of range.
+ */
+ raw_tli = PG_GETARG_INT64(0);
+ if (raw_tli < 1 || raw_tli > PG_INT32_MAX)
+ ereport(ERROR,
+ errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("invalid timeline %lld", (long long) raw_tli));
+
+ /* Prepare to read the specified WAL summry file. */
+ ws.tli = (TimeLineID) raw_tli;
+ ws.start_lsn = PG_GETARG_LSN(1);
+ ws.end_lsn = PG_GETARG_LSN(2);
+ io.filepos = 0;
+ io.file = OpenWalSummaryFile(&ws, false);
+ reader = CreateBlockRefTableReader(ReadWalSummary, &io,
+ FilePathName(io.file),
+ ReportWalSummaryError, NULL);
+
+ /* Loop over relation forks. */
+ while (BlockRefTableReaderNextRelation(reader, &rlocator, &forknum,
+ &limit_block))
+ {
+ BlockNumber blocks[MAX_BLOCKS_PER_CALL];
+ HeapTuple tuple;
+
+ CHECK_FOR_INTERRUPTS();
+
+ values[0] = ObjectIdGetDatum(rlocator.relNumber);
+ values[1] = ObjectIdGetDatum(rlocator.spcOid);
+ values[2] = ObjectIdGetDatum(rlocator.dbOid);
+ values[3] = Int16GetDatum((int16) forknum);
+
+ /* Loop over blocks within the current relation fork. */
+ while (1)
+ {
+ unsigned nblocks;
+ unsigned i;
+
+ CHECK_FOR_INTERRUPTS();
+
+ nblocks = BlockRefTableReaderGetBlocks(reader, blocks,
+ MAX_BLOCKS_PER_CALL);
+ if (nblocks == 0)
+ break;
+
+ /*
+ * For each block that we specifically know to have been modified,
+ * emit a row with that block number and limit_block = false.
+ */
+ values[5] = BoolGetDatum(false);
+ for (i = 0; i < nblocks; ++i)
+ {
+ values[4] = Int64GetDatum((int64) blocks[i]);
+
+ tuple = heap_form_tuple(rsi->setDesc, values, nulls);
+ tuplestore_puttuple(rsi->setResult, tuple);
+ }
+
+ /*
+ * If the limit block is not InvalidBlockNumber, emit an exta row
+ * with that block number and limit_block = true.
+ *
+ * There is no point in doing this when the limit_block is
+ * InvalidBlockNumber, because no block with that number or any
+ * higher number can ever exist.
+ */
+ if (BlockNumberIsValid(limit_block))
+ {
+ values[4] = Int64GetDatum((int64) limit_block);
+ values[5] = BoolGetDatum(true);
+
+ tuple = heap_form_tuple(rsi->setDesc, values, nulls);
+ tuplestore_puttuple(rsi->setResult, tuple);
+ }
+ }
+ }
+
+ /* Cleanup */
+ DestroyBlockRefTableReader(reader);
+ FileClose(io.file);
+
+ return (Datum) 0;
+}
diff --git a/src/backend/postmaster/Makefile b/src/backend/postmaster/Makefile
index 047448b34e..367a46c617 100644
--- a/src/backend/postmaster/Makefile
+++ b/src/backend/postmaster/Makefile
@@ -24,6 +24,7 @@ OBJS = \
postmaster.o \
startup.o \
syslogger.o \
+ walsummarizer.o \
walwriter.o
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/postmaster/auxprocess.c b/src/backend/postmaster/auxprocess.c
index cae6feb356..0c15c1777d 100644
--- a/src/backend/postmaster/auxprocess.c
+++ b/src/backend/postmaster/auxprocess.c
@@ -21,6 +21,7 @@
#include "postmaster/auxprocess.h"
#include "postmaster/bgwriter.h"
#include "postmaster/startup.h"
+#include "postmaster/walsummarizer.h"
#include "postmaster/walwriter.h"
#include "replication/walreceiver.h"
#include "storage/bufmgr.h"
@@ -80,6 +81,9 @@ AuxiliaryProcessMain(AuxProcType auxtype)
case WalReceiverProcess:
MyBackendType = B_WAL_RECEIVER;
break;
+ case WalSummarizerProcess:
+ MyBackendType = B_WAL_SUMMARIZER;
+ break;
default:
elog(PANIC, "unrecognized process type: %d", (int) MyAuxProcType);
MyBackendType = B_INVALID;
@@ -161,6 +165,10 @@ AuxiliaryProcessMain(AuxProcType auxtype)
WalReceiverMain();
proc_exit(1);
+ case WalSummarizerProcess:
+ WalSummarizerMain();
+ proc_exit(1);
+
default:
elog(PANIC, "unrecognized process type: %d", (int) MyAuxProcType);
proc_exit(1);
diff --git a/src/backend/postmaster/meson.build b/src/backend/postmaster/meson.build
index cda921fd10..a30eb6692f 100644
--- a/src/backend/postmaster/meson.build
+++ b/src/backend/postmaster/meson.build
@@ -12,5 +12,6 @@ backend_sources += files(
'postmaster.c',
'startup.c',
'syslogger.c',
+ 'walsummarizer.c',
'walwriter.c',
)
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 7b6b613c4a..7952fd5c4b 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -115,6 +115,7 @@
#include "postmaster/pgarch.h"
#include "postmaster/postmaster.h"
#include "postmaster/syslogger.h"
+#include "postmaster/walsummarizer.h"
#include "replication/logicallauncher.h"
#include "replication/walsender.h"
#include "storage/fd.h"
@@ -252,6 +253,7 @@ static pid_t StartupPID = 0,
CheckpointerPID = 0,
WalWriterPID = 0,
WalReceiverPID = 0,
+ WalSummarizerPID = 0,
AutoVacPID = 0,
PgArchPID = 0,
SysLoggerPID = 0;
@@ -443,6 +445,7 @@ static bool CreateOptsFile(int argc, char *argv[], char *fullprogname);
static pid_t StartChildProcess(AuxProcType type);
static void StartAutovacuumWorker(void);
static void MaybeStartWalReceiver(void);
+static void MaybeStartWalSummarizer(void);
static void InitPostmasterDeathWatchHandle(void);
/*
@@ -562,6 +565,7 @@ static void ShmemBackendArrayRemove(Backend *bn);
#define StartCheckpointer() StartChildProcess(CheckpointerProcess)
#define StartWalWriter() StartChildProcess(WalWriterProcess)
#define StartWalReceiver() StartChildProcess(WalReceiverProcess)
+#define StartWalSummarizer() StartChildProcess(WalSummarizerProcess)
/* Macros to check exit status of a child process */
#define EXIT_STATUS_0(st) ((st) == 0)
@@ -931,6 +935,9 @@ PostmasterMain(int argc, char *argv[])
if (max_wal_senders > 0 && wal_level == WAL_LEVEL_MINIMAL)
ereport(ERROR,
(errmsg("WAL streaming (max_wal_senders > 0) requires wal_level \"replica\" or \"logical\"")));
+ if (summarize_wal && wal_level == WAL_LEVEL_MINIMAL)
+ ereport(ERROR,
+ (errmsg("WAL cannot be summarized when wal_level is \"minimal\"")));
/*
* Other one-time internal sanity checks can go here, if they are fast.
@@ -1833,6 +1840,9 @@ ServerLoop(void)
if (WalReceiverRequested)
MaybeStartWalReceiver();
+ /* If we need to start a WAL summarizer, try to do that now */
+ MaybeStartWalSummarizer();
+
/* Get other worker processes running, if needed */
if (StartWorkerNeeded || HaveCrashedWorker)
maybe_start_bgworkers();
@@ -2657,6 +2667,8 @@ process_pm_reload_request(void)
signal_child(WalWriterPID, SIGHUP);
if (WalReceiverPID != 0)
signal_child(WalReceiverPID, SIGHUP);
+ if (WalSummarizerPID != 0)
+ signal_child(WalSummarizerPID, SIGHUP);
if (AutoVacPID != 0)
signal_child(AutoVacPID, SIGHUP);
if (PgArchPID != 0)
@@ -3010,6 +3022,7 @@ process_pm_child_exit(void)
BgWriterPID = StartBackgroundWriter();
if (WalWriterPID == 0)
WalWriterPID = StartWalWriter();
+ MaybeStartWalSummarizer();
/*
* Likewise, start other special children as needed. In a restart
@@ -3128,6 +3141,20 @@ process_pm_child_exit(void)
continue;
}
+ /*
+ * Was it the wal summarizer? Normal exit can be ignored; we'll start
+ * a new one at the next iteration of the postmaster's main loop, if
+ * necessary. Any other exit condition is treated as a crash.
+ */
+ if (pid == WalSummarizerPID)
+ {
+ WalSummarizerPID = 0;
+ if (!EXIT_STATUS_0(exitstatus))
+ HandleChildCrash(pid, exitstatus,
+ _("WAL summarizer process"));
+ continue;
+ }
+
/*
* Was it the autovacuum launcher? Normal exit can be ignored; we'll
* start a new one at the next iteration of the postmaster's main
@@ -3523,6 +3550,12 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
else if (WalReceiverPID != 0 && take_action)
sigquit_child(WalReceiverPID);
+ /* Take care of the walsummarizer too */
+ if (pid == WalSummarizerPID)
+ WalSummarizerPID = 0;
+ else if (WalSummarizerPID != 0 && take_action)
+ sigquit_child(WalSummarizerPID);
+
/* Take care of the autovacuum launcher too */
if (pid == AutoVacPID)
AutoVacPID = 0;
@@ -3673,6 +3706,8 @@ PostmasterStateMachine(void)
signal_child(StartupPID, SIGTERM);
if (WalReceiverPID != 0)
signal_child(WalReceiverPID, SIGTERM);
+ if (WalSummarizerPID != 0)
+ signal_child(WalSummarizerPID, SIGTERM);
/* checkpointer, archiver, stats, and syslogger may continue for now */
/* Now transition to PM_WAIT_BACKENDS state to wait for them to die */
@@ -3699,6 +3734,7 @@ PostmasterStateMachine(void)
if (CountChildren(BACKEND_TYPE_ALL - BACKEND_TYPE_WALSND) == 0 &&
StartupPID == 0 &&
WalReceiverPID == 0 &&
+ WalSummarizerPID == 0 &&
BgWriterPID == 0 &&
(CheckpointerPID == 0 ||
(!FatalError && Shutdown < ImmediateShutdown)) &&
@@ -3796,6 +3832,7 @@ PostmasterStateMachine(void)
/* These other guys should be dead already */
Assert(StartupPID == 0);
Assert(WalReceiverPID == 0);
+ Assert(WalSummarizerPID == 0);
Assert(BgWriterPID == 0);
Assert(CheckpointerPID == 0);
Assert(WalWriterPID == 0);
@@ -4017,6 +4054,8 @@ TerminateChildren(int signal)
signal_child(WalWriterPID, signal);
if (WalReceiverPID != 0)
signal_child(WalReceiverPID, signal);
+ if (WalSummarizerPID != 0)
+ signal_child(WalSummarizerPID, signal);
if (AutoVacPID != 0)
signal_child(AutoVacPID, signal);
if (PgArchPID != 0)
@@ -5364,6 +5403,10 @@ StartChildProcess(AuxProcType type)
ereport(LOG,
(errmsg("could not fork WAL receiver process: %m")));
break;
+ case WalSummarizerProcess:
+ ereport(LOG,
+ (errmsg("could not fork WAL summarizer process: %m")));
+ break;
default:
ereport(LOG,
(errmsg("could not fork process: %m")));
@@ -5500,6 +5543,19 @@ MaybeStartWalReceiver(void)
}
}
+/*
+ * MaybeStartWalSummarizer
+ * Start the WAL summarizer process, if not running and our state allows.
+ */
+static void
+MaybeStartWalSummarizer(void)
+{
+ if (summarize_wal && WalSummarizerPID == 0 &&
+ (pmState == PM_RUN || pmState == PM_HOT_STANDBY) &&
+ Shutdown <= SmartShutdown)
+ WalSummarizerPID = StartWalSummarizer();
+}
+
/*
* Create the opts file
diff --git a/src/backend/postmaster/walsummarizer.c b/src/backend/postmaster/walsummarizer.c
new file mode 100644
index 0000000000..bf6ed98703
--- /dev/null
+++ b/src/backend/postmaster/walsummarizer.c
@@ -0,0 +1,1383 @@
+/*-------------------------------------------------------------------------
+ *
+ * walsummarizer.c
+ *
+ * Background process to perform WAL summarization, if it is enabled.
+ * It continuously scans the write-ahead log and periodically emits a
+ * summary file which indicates which blocks in which relation forks
+ * were modified by WAL records in the LSN range covered by the summary
+ * file. See walsummary.c and blkreftable.c for more details on the
+ * naming and contents of WAL summary files.
+ *
+ * If configured to do, this background process will also remove WAL
+ * summary files when the file timestamp is older than a configurable
+ * threshold (but only if the WAL has been removed first).
+ *
+ * Portions Copyright (c) 1996-2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/backend/postmaster/walsummarizer.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/timeline.h"
+#include "access/xlog.h"
+#include "access/xlog_internal.h"
+#include "access/xlogrecovery.h"
+#include "access/xlogutils.h"
+#include "backup/walsummary.h"
+#include "catalog/storage_xlog.h"
+#include "common/blkreftable.h"
+#include "libpq/pqsignal.h"
+#include "miscadmin.h"
+#include "postmaster/bgwriter.h"
+#include "postmaster/interrupt.h"
+#include "postmaster/walsummarizer.h"
+#include "replication/walreceiver.h"
+#include "storage/fd.h"
+#include "storage/ipc.h"
+#include "storage/lwlock.h"
+#include "storage/latch.h"
+#include "storage/proc.h"
+#include "storage/procsignal.h"
+#include "storage/shmem.h"
+#include "storage/spin.h"
+#include "utils/guc.h"
+#include "utils/memutils.h"
+#include "utils/wait_event.h"
+
+/*
+ * Data in shared memory related to WAL summarization.
+ */
+typedef struct
+{
+ /*
+ * These fields are protected by WALSummarizerLock.
+ *
+ * Until we've discovered what summary files already exist on disk and
+ * stored that information in shared memory, initialized is false and the
+ * other fields here contain no meaningful information. After that has
+ * been done, initialized is true.
+ *
+ * summarized_tli and summarized_lsn indicate the last LSN and TLI at
+ * which the next summary file will start. Normally, these are the LSN and
+ * TLI at which the last file ended; in such case, lsn_is_exact is true.
+ * If, however, the LSN is just an approximation, then lsn_is_exact is
+ * false. This can happen if, for example, there are no existing WAL
+ * summary files at startup. In that case, we have to derive the position
+ * at which to start summarizing from the WAL files that exist on disk,
+ * and so the LSN might point to the start of the next file even though
+ * that might happen to be in the middle of a WAL record.
+ *
+ * summarizer_pgprocno is the pgprocno value for the summarizer process,
+ * if one is running, or else INVALID_PGPROCNO.
+ *
+ * pending_lsn is used by the summarizer to advertise the ending LSN of a
+ * record it has recently read. It shouldn't ever be less than
+ * summarized_lsn, but might be greater, because the summarizer buffers
+ * data for a range of LSNs in memory before writing out a new file.
+ */
+ bool initialized;
+ TimeLineID summarized_tli;
+ XLogRecPtr summarized_lsn;
+ bool lsn_is_exact;
+ int summarizer_pgprocno;
+ XLogRecPtr pending_lsn;
+
+ /*
+ * This field handles its own synchronizaton.
+ */
+ ConditionVariable summary_file_cv;
+} WalSummarizerData;
+
+/*
+ * Private data for our xlogreader's page read callback.
+ */
+typedef struct
+{
+ TimeLineID tli;
+ bool historic;
+ XLogRecPtr read_upto;
+ bool end_of_wal;
+} SummarizerReadLocalXLogPrivate;
+
+/* Pointer to shared memory state. */
+static WalSummarizerData *WalSummarizerCtl;
+
+/*
+ * When we reach end of WAL and need to read more, we sleep for a number of
+ * milliseconds that is a integer multiple of MS_PER_SLEEP_QUANTUM. This is
+ * the multiplier. It should vary between 1 and MAX_SLEEP_QUANTA, depending
+ * on system activity. See summarizer_wait_for_wal() for how we adjust this.
+ */
+static long sleep_quanta = 1;
+
+/*
+ * The sleep time will always be a multiple of 200ms and will not exceed
+ * thirty seconds (150 * 200 = 30 * 1000). Note that the timeout here needs
+ * to be substntially less than the maximum amount of time for which an
+ * incremental backup will wait for this process to catch up. Otherwise, an
+ * incremental backup might time out on an idle system just because we sleep
+ * for too long.
+ */
+#define MAX_SLEEP_QUANTA 150
+#define MS_PER_SLEEP_QUANTUM 200
+
+/*
+ * This is a count of the number of pages of WAL that we've read since the
+ * last time we waited for more WAL to appear.
+ */
+static long pages_read_since_last_sleep = 0;
+
+/*
+ * Most recent RedoRecPtr value observed by MaybeRemoveOldWalSummaries.
+ */
+static XLogRecPtr redo_pointer_at_last_summary_removal = InvalidXLogRecPtr;
+
+/*
+ * GUC parameters
+ */
+bool summarize_wal = false;
+int wal_summary_keep_time = 10 * 24 * 60;
+
+static XLogRecPtr GetLatestLSN(TimeLineID *tli);
+static void HandleWalSummarizerInterrupts(void);
+static XLogRecPtr SummarizeWAL(TimeLineID tli, XLogRecPtr start_lsn,
+ bool exact, XLogRecPtr switch_lsn,
+ XLogRecPtr maximum_lsn);
+static void SummarizeSmgrRecord(XLogReaderState *xlogreader,
+ BlockRefTable *brtab);
+static void SummarizeXactRecord(XLogReaderState *xlogreader,
+ BlockRefTable *brtab);
+static bool SummarizeXlogRecord(XLogReaderState *xlogreader);
+static int summarizer_read_local_xlog_page(XLogReaderState *state,
+ XLogRecPtr targetPagePtr,
+ int reqLen,
+ XLogRecPtr targetRecPtr,
+ char *cur_page);
+static void summarizer_wait_for_wal(void);
+static void MaybeRemoveOldWalSummaries(void);
+
+/*
+ * Amount of shared memory required for this module.
+ */
+Size
+WalSummarizerShmemSize(void)
+{
+ return sizeof(WalSummarizerData);
+}
+
+/*
+ * Create or attach to shared memory segment for this module.
+ */
+void
+WalSummarizerShmemInit(void)
+{
+ bool found;
+
+ WalSummarizerCtl = (WalSummarizerData *)
+ ShmemInitStruct("Wal Summarizer Ctl", WalSummarizerShmemSize(),
+ &found);
+
+ if (!found)
+ {
+ /*
+ * First time through, so initialize.
+ *
+ * We're just filling in dummy values here -- the real initialization
+ * will happen when GetOldestUnsummarizedLSN() is called for the first
+ * time.
+ */
+ WalSummarizerCtl->initialized = false;
+ WalSummarizerCtl->summarized_tli = 0;
+ WalSummarizerCtl->summarized_lsn = InvalidXLogRecPtr;
+ WalSummarizerCtl->lsn_is_exact = false;
+ WalSummarizerCtl->summarizer_pgprocno = INVALID_PGPROCNO;
+ WalSummarizerCtl->pending_lsn = InvalidXLogRecPtr;
+ ConditionVariableInit(&WalSummarizerCtl->summary_file_cv);
+ }
+}
+
+/*
+ * Entry point for walsummarizer process.
+ */
+void
+WalSummarizerMain(void)
+{
+ sigjmp_buf local_sigjmp_buf;
+ MemoryContext context;
+
+ /*
+ * Within this function, 'current_lsn' and 'current_tli' refer to the
+ * point from which the next WAL summary file should start. 'exact' is
+ * true if 'current_lsn' is known to be the start of a WAL recod or WAL
+ * segment, and false if it might be in the middle of a record someplace.
+ *
+ * 'switch_lsn' and 'switch_tli', if set, are the LSN at which we need to
+ * switch to a new timeline and the timeline to which we need to switch.
+ * If not set, we either haven't figured out the answers yet or we're
+ * already on the latest timeline.
+ */
+ XLogRecPtr current_lsn;
+ TimeLineID current_tli;
+ bool exact;
+ XLogRecPtr switch_lsn = InvalidXLogRecPtr;
+ TimeLineID switch_tli = 0;
+
+ ereport(DEBUG1,
+ (errmsg_internal("WAL summarizer started")));
+
+ /*
+ * Properly accept or ignore signals the postmaster might send us
+ *
+ * We have no particular use for SIGINT at the moment, but seems
+ * reasonable to treat like SIGTERM.
+ */
+ pqsignal(SIGHUP, SignalHandlerForConfigReload);
+ pqsignal(SIGINT, SignalHandlerForShutdownRequest);
+ pqsignal(SIGTERM, SignalHandlerForShutdownRequest);
+ /* SIGQUIT handler was already set up by InitPostmasterChild */
+ pqsignal(SIGALRM, SIG_IGN);
+ pqsignal(SIGPIPE, SIG_IGN);
+ pqsignal(SIGUSR1, procsignal_sigusr1_handler);
+ pqsignal(SIGUSR2, SIG_IGN); /* not used */
+
+ /* Advertise ourselves. */
+ LWLockAcquire(WALSummarizerLock, LW_EXCLUSIVE);
+ WalSummarizerCtl->summarizer_pgprocno = MyProc->pgprocno;
+ LWLockRelease(WALSummarizerLock);
+
+ /* Create and switch to a memory context that we can reset on error. */
+ context = AllocSetContextCreate(TopMemoryContext,
+ "Wal Summarizer",
+ ALLOCSET_DEFAULT_SIZES);
+ MemoryContextSwitchTo(context);
+
+ /*
+ * Reset some signals that are accepted by postmaster but not here
+ */
+ pqsignal(SIGCHLD, SIG_DFL);
+
+ /*
+ * If an exception is encountered, processing resumes here.
+ */
+ if (sigsetjmp(local_sigjmp_buf, 1) != 0)
+ {
+ /* Since not using PG_TRY, must reset error stack by hand */
+ error_context_stack = NULL;
+
+ /* Prevent interrupts while cleaning up */
+ HOLD_INTERRUPTS();
+
+ /* Report the error to the server log */
+ EmitErrorReport();
+
+ /* Release resources we might have acquired. */
+ LWLockReleaseAll();
+ ConditionVariableCancelSleep();
+ pgstat_report_wait_end();
+ ReleaseAuxProcessResources(false);
+ AtEOXact_Files(false);
+ AtEOXact_HashTables(false);
+
+ /*
+ * Now return to normal top-level context and clear ErrorContext for
+ * next time.
+ */
+ MemoryContextSwitchTo(context);
+ FlushErrorState();
+
+ /* Flush any leaked data in the top-level context */
+ MemoryContextResetAndDeleteChildren(context);
+
+ /* Now we can allow interrupts again */
+ RESUME_INTERRUPTS();
+
+ /*
+ * Sleep for 10 seconds before attempting to resume operations in
+ * order to avoid excessing logging.
+ *
+ * Many of the likely error conditions are things that will repeat
+ * every time. For example, if the WAL can't be read or the summary
+ * can't be written, only administrator action will cure the problem.
+ * So a really fast retry time doesn't seem to be especially
+ * beneficial, and it will clutter the logs.
+ */
+ (void) WaitLatch(MyLatch,
+ WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
+ 10000,
+ WAIT_EVENT_WAL_SUMMARIZER_ERROR);
+ }
+
+ /* We can now handle ereport(ERROR) */
+ PG_exception_stack = &local_sigjmp_buf;
+
+ /*
+ * Unblock signals (they were blocked when the postmaster forked us)
+ */
+ sigprocmask(SIG_SETMASK, &UnBlockSig, NULL);
+
+ /*
+ * Fetch information about previous progress from shared memory.
+ *
+ * If we discover that WAL summarization is not enabled, just exit.
+ */
+ current_lsn = GetOldestUnsummarizedLSN(¤t_tli, &exact);
+ if (XLogRecPtrIsInvalid(current_lsn))
+ proc_exit(0);
+
+ /*
+ * Loop forever
+ */
+ for (;;)
+ {
+ XLogRecPtr latest_lsn;
+ TimeLineID latest_tli;
+ XLogRecPtr end_of_summary_lsn;
+
+ /* Flush any leaked data in the top-level context */
+ MemoryContextResetAndDeleteChildren(context);
+
+ /* Process any signals received recently. */
+ HandleWalSummarizerInterrupts();
+
+ /* If it's time to remove any old WAL summaries, do that now. */
+ MaybeRemoveOldWalSummaries();
+
+ /* Find the LSN and TLI up to which we can safely summarize. */
+ latest_lsn = GetLatestLSN(&latest_tli);
+
+ /*
+ * If we're summarizing a historic timeline and we haven't yet
+ * computed the point at which to switch to the next timeline, do that
+ * now.
+ *
+ * Note that if this is a standby, what was previously the current
+ * timeline could become historic at any time.
+ *
+ * We could try to make this more efficient by caching the results of
+ * readTimeLineHistory when latest_tli has not changed, but since we
+ * only have to do this once per timeline switch, we probably wouldn't
+ * save any significant amount of work in practice.
+ */
+ if (current_tli != latest_tli && XLogRecPtrIsInvalid(switch_lsn))
+ {
+ List *tles = readTimeLineHistory(latest_tli);
+
+ switch_lsn = tliSwitchPoint(current_tli, tles, &switch_tli);
+ ereport(DEBUG1,
+ errmsg("switch point from TLI %u to TLI %u is at %X/%X",
+ current_tli, switch_tli, LSN_FORMAT_ARGS(switch_lsn)));
+ }
+
+ /*
+ * If we've reached the switch LSN, we can't summarize anything else
+ * on this timeline. Switch to the next timeline and go around again.
+ */
+ if (!XLogRecPtrIsInvalid(switch_lsn) && current_lsn >= switch_lsn)
+ {
+ current_tli = switch_tli;
+ switch_lsn = InvalidXLogRecPtr;
+ switch_tli = 0;
+ continue;
+ }
+
+ /* Summarize WAL. */
+ end_of_summary_lsn = SummarizeWAL(current_tli,
+ current_lsn, exact,
+ switch_lsn, latest_lsn);
+ Assert(!XLogRecPtrIsInvalid(end_of_summary_lsn));
+ Assert(end_of_summary_lsn >= current_lsn);
+
+ /*
+ * Update state for next loop iteration.
+ *
+ * Next summary file should start from exactly where this one ended.
+ */
+ current_lsn = end_of_summary_lsn;
+ exact = true;
+
+ /* Update state in shared memory. */
+ LWLockAcquire(WALSummarizerLock, LW_EXCLUSIVE);
+ Assert(WalSummarizerCtl->pending_lsn <= end_of_summary_lsn);
+ WalSummarizerCtl->summarized_lsn = end_of_summary_lsn;
+ WalSummarizerCtl->summarized_tli = current_tli;
+ WalSummarizerCtl->lsn_is_exact = true;
+ WalSummarizerCtl->pending_lsn = end_of_summary_lsn;
+ LWLockRelease(WALSummarizerLock);
+
+ /* Wake up anyone waiting for more summary files to be written. */
+ ConditionVariableBroadcast(&WalSummarizerCtl->summary_file_cv);
+ }
+}
+
+/*
+ * Get the oldest LSN in this server's timeline history that has not yet been
+ * summarized.
+ *
+ * If *tli != NULL, it will be set to the TLI for the LSN that is returned.
+ *
+ * If *lsn_is_exact != NULL, it will be set to true if the returned LSN is
+ * necessarily the start of a WAL record and false if it's just the beginning
+ * of a WAL segment.
+ */
+XLogRecPtr
+GetOldestUnsummarizedLSN(TimeLineID *tli, bool *lsn_is_exact)
+{
+ TimeLineID latest_tli;
+ LWLockMode mode = LW_SHARED;
+ int n;
+ List *tles;
+ XLogRecPtr unsummarized_lsn;
+ TimeLineID unsummarized_tli = 0;
+ bool should_make_exact = false;
+ List *existing_summaries;
+ ListCell *lc;
+
+ /* If not summarizing WAL, do nothing. */
+ if (!summarize_wal)
+ return InvalidXLogRecPtr;
+
+ /*
+ * Initially, we acquire the lock in shared mode and try to fetch the
+ * required information. If the data structure hasn't been initialized, we
+ * reacquire the lock in shared mode so that we can initialize it.
+ * However, if someone else does that first before we get the lock, then
+ * we can just return the requested information after all.
+ */
+ while (1)
+ {
+ LWLockAcquire(WALSummarizerLock, mode);
+
+ if (WalSummarizerCtl->initialized)
+ {
+ unsummarized_lsn = WalSummarizerCtl->summarized_lsn;
+ if (tli != NULL)
+ *tli = WalSummarizerCtl->summarized_tli;
+ if (lsn_is_exact != NULL)
+ *lsn_is_exact = WalSummarizerCtl->lsn_is_exact;
+ LWLockRelease(WALSummarizerLock);
+ return unsummarized_lsn;
+ }
+
+ if (mode == LW_EXCLUSIVE)
+ break;
+
+ LWLockRelease(WALSummarizerLock);
+ mode = LW_EXCLUSIVE;
+ }
+
+ /*
+ * The data structure needs to be initialized, and we are the first to
+ * obtain the lock in exclusive mode, so it's our job to do that
+ * initialization.
+ *
+ * So, find the oldest timeline on which WAL still exists, and the
+ * earliest segment for which it exists.
+ */
+ (void) GetLatestLSN(&latest_tli);
+ tles = readTimeLineHistory(latest_tli);
+ for (n = list_length(tles) - 1; n >= 0; --n)
+ {
+ TimeLineHistoryEntry *tle = list_nth(tles, n);
+ XLogSegNo oldest_segno;
+
+ oldest_segno = XLogGetOldestSegno(tle->tli);
+ if (oldest_segno != 0)
+ {
+ /* Compute oldest LSN that still exists on disk. */
+ XLogSegNoOffsetToRecPtr(oldest_segno, 0, wal_segment_size,
+ unsummarized_lsn);
+
+ unsummarized_tli = tle->tli;
+ break;
+ }
+ }
+
+ /* It really should not be possible for us to find no WAL. */
+ if (unsummarized_tli == 0)
+ ereport(ERROR,
+ errcode(ERRCODE_INTERNAL_ERROR),
+ errmsg_internal("no WAL found on timeline %d", latest_tli));
+
+ /*
+ * Don't try to summarize anything older than the end LSN of the newest
+ * summary file that exists for this timeline.
+ */
+ existing_summaries =
+ GetWalSummaries(unsummarized_tli,
+ InvalidXLogRecPtr, InvalidXLogRecPtr);
+ foreach(lc, existing_summaries)
+ {
+ WalSummaryFile *ws = lfirst(lc);
+
+ if (ws->end_lsn > unsummarized_lsn)
+ {
+ unsummarized_lsn = ws->end_lsn;
+ should_make_exact = true;
+ }
+ }
+
+ /* Update shared memory with the discovered values. */
+ WalSummarizerCtl->initialized = true;
+ WalSummarizerCtl->summarized_lsn = unsummarized_lsn;
+ WalSummarizerCtl->summarized_tli = unsummarized_tli;
+ WalSummarizerCtl->lsn_is_exact = should_make_exact;
+ WalSummarizerCtl->pending_lsn = unsummarized_lsn;
+
+ /* Also return the to the caller as required. */
+ if (tli != NULL)
+ *tli = WalSummarizerCtl->summarized_tli;
+ if (lsn_is_exact != NULL)
+ *lsn_is_exact = WalSummarizerCtl->lsn_is_exact;
+ LWLockRelease(WALSummarizerLock);
+
+ return unsummarized_lsn;
+}
+
+/*
+ * Attempt to set the WAL summarizer's latch.
+ *
+ * This might not work, because there's no guarantee that the WAL summarizer
+ * process was successfully started, and it also might have started but
+ * subsequently terminated. So, under normal circumstances, this will get the
+ * latch set, but there's no guarantee.
+ */
+void
+SetWalSummarizerLatch(void)
+{
+ int pgprocno;
+
+ if (WalSummarizerCtl == NULL)
+ return;
+
+ LWLockAcquire(WALSummarizerLock, LW_EXCLUSIVE);
+ pgprocno = WalSummarizerCtl->summarizer_pgprocno;
+ LWLockRelease(WALSummarizerLock);
+
+ if (pgprocno != INVALID_PGPROCNO)
+ SetLatch(&ProcGlobal->allProcs[pgprocno].procLatch);
+}
+
+/*
+ * Wait until WAL summarization reaches the given LSN, but not longer than
+ * the given timeout.
+ *
+ * The return value is the first still-unsummarized LSN. If it's greater than
+ * or equal to the passed LSN, then that LSN was reached. If not, we timed out.
+ */
+XLogRecPtr
+WaitForWalSummarization(XLogRecPtr lsn, long timeout)
+{
+ TimestampTz start_time = GetCurrentTimestamp();
+ TimestampTz deadline = TimestampTzPlusMilliseconds(start_time, timeout);
+ XLogRecPtr summarized_lsn;
+
+ Assert(!XLogRecPtrIsInvalid(lsn));
+ Assert(timeout > 0);
+
+ while (1)
+ {
+ TimestampTz now;
+ long remaining_timeout;
+
+ /*
+ * If the LSN summarized on disk has reached the target value, stop.
+ */
+ LWLockAcquire(WALSummarizerLock, LW_EXCLUSIVE);
+ summarized_lsn = WalSummarizerCtl->summarized_lsn;
+ LWLockRelease(WALSummarizerLock);
+ if (summarized_lsn >= lsn)
+ break;
+
+ /* Timeout reached? If yes, stop. */
+ now = GetCurrentTimestamp();
+ remaining_timeout = TimestampDifferenceMilliseconds(now, deadline);
+ if (remaining_timeout <= 0)
+ break;
+
+ /* Wait and see. */
+ ConditionVariableTimedSleep(&WalSummarizerCtl->summary_file_cv,
+ remaining_timeout,
+ WAIT_EVENT_WAL_SUMMARY_READY);
+ }
+
+ return summarized_lsn;
+}
+
+/*
+ * Get the latest LSN that is eligible to be summarized, and set *tli to the
+ * corresponding timeline.
+ */
+static XLogRecPtr
+GetLatestLSN(TimeLineID *tli)
+{
+ if (!RecoveryInProgress())
+ {
+ /* Don't summarize WAL before it's flushed. */
+ return GetFlushRecPtr(tli);
+ }
+ else
+ {
+ XLogRecPtr flush_lsn;
+ TimeLineID flush_tli;
+ XLogRecPtr replay_lsn;
+ TimeLineID replay_tli;
+
+ /*
+ * What we really want to know is how much WAL has been flushed to
+ * disk, but the only flush position available is the one provided by
+ * the walreceiver, which may not be running, because this could be
+ * crash recovery or recovery via restore_command. So use either the
+ * WAL receiver's flush position or the replay position, whichever is
+ * further ahead, on the theory that if the WAL has been replayed then
+ * it must also have been flushed to disk.
+ */
+ flush_lsn = GetWalRcvFlushRecPtr(NULL, &flush_tli);
+ replay_lsn = GetXLogReplayRecPtr(&replay_tli);
+ if (flush_lsn > replay_lsn)
+ {
+ *tli = flush_tli;
+ return flush_lsn;
+ }
+ else
+ {
+ *tli = replay_tli;
+ return replay_lsn;
+ }
+ }
+}
+
+/*
+ * Interrupt handler for main loop of WAL summarizer process.
+ */
+static void
+HandleWalSummarizerInterrupts(void)
+{
+ if (ProcSignalBarrierPending)
+ ProcessProcSignalBarrier();
+
+ if (ConfigReloadPending)
+ {
+ ConfigReloadPending = false;
+ ProcessConfigFile(PGC_SIGHUP);
+ }
+
+ if (ShutdownRequestPending || !summarize_wal)
+ {
+ ereport(DEBUG1,
+ errmsg_internal("WAL summarizer shutting down"));
+ proc_exit(0);
+ }
+
+ /* Perform logging of memory contexts of this process */
+ if (LogMemoryContextPending)
+ ProcessLogMemoryContextInterrupt();
+}
+
+/*
+ * Summarize a range of WAL records on a single timeline.
+ *
+ * 'tli' is the timeline to be summarized.
+ *
+ * 'start_lsn' is the point at which we should start summarizing. If this
+ * value comes from the end LSN of the previous record as returned by the
+ * xlograder machinery, 'exact' should be true; otherwise, 'exact' should
+ * be false, and this function will search forward for the start of a valid
+ * WAL record.
+ *
+ * 'switch_lsn' is the point at which we should switch to a later timeline,
+ * if we're summarizing a historic timeline.
+ *
+ * 'maximum_lsn' identifies the point beyond which we can't count on being
+ * able to read any more WAL. It should be the switch point when reading a
+ * historic timeline, or the most-recently-measured end of WAL when reading
+ * the current timeline.
+ *
+ * The return value is the LSN at which the WAL summary actually ends. Most
+ * often, a summary file ends because we notice that a checkpoint has
+ * occurred and reach the redo pointer of that checkpoint, but sometimes
+ * we stop for other reasons, such as a timeline switch.
+ */
+static XLogRecPtr
+SummarizeWAL(TimeLineID tli, XLogRecPtr start_lsn, bool exact,
+ XLogRecPtr switch_lsn, XLogRecPtr maximum_lsn)
+{
+ SummarizerReadLocalXLogPrivate *private_data;
+ XLogReaderState *xlogreader;
+ XLogRecPtr summary_start_lsn;
+ XLogRecPtr summary_end_lsn = switch_lsn;
+ char temp_path[MAXPGPATH];
+ char final_path[MAXPGPATH];
+ WalSummaryIO io;
+ BlockRefTable *brtab = CreateEmptyBlockRefTable();
+
+ /* Initialize private data for xlogreader. */
+ private_data = (SummarizerReadLocalXLogPrivate *)
+ palloc0(sizeof(SummarizerReadLocalXLogPrivate));
+ private_data->tli = tli;
+ private_data->historic = !XLogRecPtrIsInvalid(switch_lsn);
+ private_data->read_upto = maximum_lsn;
+
+ /* Create xlogreader. */
+ xlogreader = XLogReaderAllocate(wal_segment_size, NULL,
+ XL_ROUTINE(.page_read = &summarizer_read_local_xlog_page,
+ .segment_open = &wal_segment_open,
+ .segment_close = &wal_segment_close),
+ private_data);
+ if (xlogreader == NULL)
+ ereport(ERROR,
+ (errcode(ERRCODE_OUT_OF_MEMORY),
+ errmsg("out of memory"),
+ errdetail("Failed while allocating a WAL reading processor.")));
+
+ /*
+ * When exact = false, we're starting from an arbitrary point in the WAL
+ * and must search forward for the start of the next record.
+ *
+ * When exact = true, start_lsn should be either the LSN where a record
+ * begins, or the LSN of a page where the page header is immediately
+ * followed by the start of a new record. XLogBeginRead should tolerate
+ * either case.
+ *
+ * We need to allow for both cases because the behavior of xlogreader
+ * varies. When a record spans two or more xlog pages, the ending LSN
+ * reported by xlogreader will be the starting LSN of the following
+ * record, but when an xlog page boundary falls between two records, the
+ * end LSN for the first will be reported as the first byte of the
+ * following page. We can't know until we read that page how large the
+ * header will be, but we'll have to skip over it to find the next record.
+ */
+ if (exact)
+ {
+ /*
+ * Even if start_lsn is the beginning of a page rather than the
+ * beginning of the first record on that page, we should still use it
+ * as the start LSN for the summary file. That's because we detect
+ * missing summary files by looking for cases where the end LSN of one
+ * file is less than the start LSN of the next file. When only a page
+ * header is skipped, nothing has been missed.
+ */
+ XLogBeginRead(xlogreader, start_lsn);
+ summary_start_lsn = start_lsn;
+ }
+ else
+ {
+ summary_start_lsn = XLogFindNextRecord(xlogreader, start_lsn);
+ if (XLogRecPtrIsInvalid(summary_start_lsn))
+ {
+ /*
+ * If we hit end-of-WAL while trying to find the next valid
+ * record, we must be on a historic timeline that has no valid
+ * records that begin after start_lsn and before end of WAL.
+ */
+ if (private_data->end_of_wal)
+ {
+ ereport(DEBUG1,
+ errmsg_internal("could not read WAL from timeline %u at %X/%X: end of WAL at %X/%X",
+ tli,
+ LSN_FORMAT_ARGS(start_lsn),
+ LSN_FORMAT_ARGS(private_data->read_upto)));
+
+ /*
+ * The timeline ends at or after start_lsn, without containing
+ * any records. Thus, we must make sure the main loop does not
+ * iterate. If start_lsn is the end of the timeline, then we
+ * won't actually emit an empty summary file, but otherwise,
+ * we must, to capture the fact that the LSN range in question
+ * contains no interesting WAL records.
+ */
+ summary_start_lsn = start_lsn;
+ summary_end_lsn = private_data->read_upto;
+ switch_lsn = xlogreader->EndRecPtr;
+ }
+ else
+ ereport(ERROR,
+ (errmsg("could not find a valid record after %X/%X",
+ LSN_FORMAT_ARGS(start_lsn))));
+ }
+
+ /* We shouldn't go backward. */
+ Assert(summary_start_lsn >= start_lsn);
+ }
+
+ /*
+ * Main loop: read xlog records one by one.
+ */
+ while (1)
+ {
+ int block_id;
+ char *errormsg;
+ XLogRecord *record;
+ bool stop_requested = false;
+
+ HandleWalSummarizerInterrupts();
+
+ /* We shouldn't go backward. */
+ Assert(summary_start_lsn <= xlogreader->EndRecPtr);
+
+ /* Now read the next record. */
+ record = XLogReadRecord(xlogreader, &errormsg);
+ if (record == NULL)
+ {
+ if (private_data->end_of_wal)
+ {
+ /*
+ * This timeline must be historic and must end before we were
+ * able to read a complete record.
+ */
+ ereport(DEBUG1,
+ errmsg_internal("could not read WAL from timeline %d at %X/%X: end of WAL at %X/%X",
+ tli,
+ LSN_FORMAT_ARGS(xlogreader->EndRecPtr),
+ LSN_FORMAT_ARGS(private_data->read_upto)));
+ /* Summary ends at end of WAL. */
+ summary_end_lsn = private_data->read_upto;
+ break;
+ }
+ if (errormsg)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not read WAL from timeline %u at %X/%X: %s",
+ tli, LSN_FORMAT_ARGS(xlogreader->EndRecPtr),
+ errormsg)));
+ else
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not read WAL from timeline %u at %X/%X",
+ tli, LSN_FORMAT_ARGS(xlogreader->EndRecPtr))));
+ }
+
+ /* We shouldn't go backward. */
+ Assert(summary_start_lsn <= xlogreader->EndRecPtr);
+
+ if (!XLogRecPtrIsInvalid(switch_lsn) &&
+ xlogreader->ReadRecPtr >= switch_lsn)
+ {
+ /*
+ * Woops! We've read a record that *starts* after the switch LSN,
+ * contrary to our goal of reading only until we hit the first
+ * record that ends at or after the switch LSN. Pretend we didn't
+ * read it after all by bailing out of this loop right here,
+ * before we do anything with this record.
+ *
+ * This can happen because the last record before the switch LSN
+ * might be continued across multiple pages, and then we might
+ * come to a page with XLP_FIRST_IS_OVERWRITE_CONTRECORD set. In
+ * that case, the record that was continued across multiple pages
+ * is incomplete and will be disregarded, and the read will
+ * restart from the beginning of the page that is flagged
+ * XLP_FIRST_IS_OVERWRITE_CONTRECORD.
+ *
+ * If this case occurs, we can fairly say that the current summary
+ * file ends at the switch LSN exactly. The first record on the
+ * page marked XLP_FIRST_IS_OVERWRITE_CONTRECORD will be
+ * discovered when generating the next summary file.
+ */
+ summary_end_lsn = switch_lsn;
+ break;
+ }
+
+ /* Special handling for particular types of WAL records. */
+ switch (XLogRecGetRmid(xlogreader))
+ {
+ case RM_SMGR_ID:
+ SummarizeSmgrRecord(xlogreader, brtab);
+ break;
+ case RM_XACT_ID:
+ SummarizeXactRecord(xlogreader, brtab);
+ break;
+ case RM_XLOG_ID:
+ stop_requested = SummarizeXlogRecord(xlogreader);
+ break;
+ default:
+ break;
+ }
+
+ /*
+ * If we've been told that it's time to end this WAL summary file, do
+ * so. As an exception, if there's nothing included in this WAL
+ * summary file yet, then stopping doesn't make any sense, and we
+ * should wait until the next stop point instead.
+ */
+ if (stop_requested && xlogreader->ReadRecPtr > summary_start_lsn)
+ {
+ summary_end_lsn = xlogreader->ReadRecPtr;
+ break;
+ }
+
+ /* Feed block references from xlog record to block reference table. */
+ for (block_id = 0; block_id <= XLogRecMaxBlockId(xlogreader);
+ block_id++)
+ {
+ RelFileLocator rlocator;
+ ForkNumber forknum;
+ BlockNumber blocknum;
+
+ if (!XLogRecGetBlockTagExtended(xlogreader, block_id, &rlocator,
+ &forknum, &blocknum, NULL))
+ continue;
+
+ /*
+ * As we do elsewhere, ignore the FSM fork, because it's not fully
+ * WAL-logged.
+ */
+ if (forknum != FSM_FORKNUM)
+ BlockRefTableMarkBlockModified(brtab, &rlocator, forknum,
+ blocknum);
+ }
+
+ /* Update our notion of where this summary file ends. */
+ summary_end_lsn = xlogreader->EndRecPtr;
+
+ /* Also update shared memory. */
+ LWLockAcquire(WALSummarizerLock, LW_EXCLUSIVE);
+ Assert(summary_end_lsn >= WalSummarizerCtl->pending_lsn);
+ Assert(summary_end_lsn >= WalSummarizerCtl->summarized_lsn);
+ WalSummarizerCtl->pending_lsn = summary_end_lsn;
+ LWLockRelease(WALSummarizerLock);
+
+ /*
+ * If we have a switch LSN and have reached it, stop before reading
+ * the next record.
+ */
+ if (!XLogRecPtrIsInvalid(switch_lsn) &&
+ xlogreader->EndRecPtr >= switch_lsn)
+ break;
+ }
+
+ /* Destroy xlogreader. */
+ pfree(xlogreader->private_data);
+ XLogReaderFree(xlogreader);
+
+ /*
+ * If a timeline switch occurs, we may fail to make any progress at all
+ * before exiting the loop above. If that happens, we don't write a WAL
+ * summary file at all.
+ */
+ if (summary_end_lsn > summary_start_lsn)
+ {
+ /* Generate temporary and final path name. */
+ snprintf(temp_path, MAXPGPATH,
+ XLOGDIR "/summaries/temp.summary");
+ snprintf(final_path, MAXPGPATH,
+ XLOGDIR "/summaries/%08X%08X%08X%08X%08X.summary",
+ tli,
+ LSN_FORMAT_ARGS(summary_start_lsn),
+ LSN_FORMAT_ARGS(summary_end_lsn));
+
+ /* Open the temporary file for writing. */
+ io.filepos = 0;
+ io.file = PathNameOpenFile(temp_path, O_WRONLY | O_CREAT | O_TRUNC);
+ if (io.file < 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not create file \"%s\": %m", temp_path)));
+
+ /* Write the data. */
+ WriteBlockRefTable(brtab, WriteWalSummary, &io);
+
+ /* Close temporary file and shut down xlogreader. */
+ FileClose(io.file);
+
+ /* Tell the user what we did. */
+ ereport(DEBUG1,
+ errmsg("summarized WAL on TLI %d from %X/%X to %X/%X",
+ tli,
+ LSN_FORMAT_ARGS(summary_start_lsn),
+ LSN_FORMAT_ARGS(summary_end_lsn)));
+
+ /* Durably rename the new summary into place. */
+ durable_rename(temp_path, final_path, ERROR);
+ }
+
+ return summary_end_lsn;
+}
+
+/*
+ * Special handling for WAL records with RM_SMGR_ID.
+ */
+static void
+SummarizeSmgrRecord(XLogReaderState *xlogreader, BlockRefTable *brtab)
+{
+ uint8 info = XLogRecGetInfo(xlogreader) & ~XLR_INFO_MASK;
+
+ if (info == XLOG_SMGR_CREATE)
+ {
+ xl_smgr_create *xlrec;
+
+ /*
+ * If a new relation fork is created on disk, there is no point
+ * tracking anything about which blocks have been modified, because
+ * the whole thing will be new. Hence, set the limit block for this
+ * fork to 0.
+ *
+ * Ignore the FSM fork, which is not fully WAL-logged.
+ */
+ xlrec = (xl_smgr_create *) XLogRecGetData(xlogreader);
+
+ if (xlrec->forkNum != FSM_FORKNUM)
+ BlockRefTableSetLimitBlock(brtab, &xlrec->rlocator,
+ xlrec->forkNum, 0);
+ }
+ else if (info == XLOG_SMGR_TRUNCATE)
+ {
+ xl_smgr_truncate *xlrec;
+
+ xlrec = (xl_smgr_truncate *) XLogRecGetData(xlogreader);
+
+ /*
+ * If a relation fork is truncated on disk, there is no point in
+ * tracking anything about block modifications beyond the truncation
+ * point.
+ *
+ * We ignore SMGR_TRUNCATE_FSM here because the FSM isn't fully
+ * WAL-logged and thus we can't track modified blocks for it anyway.
+ */
+ if ((xlrec->flags & SMGR_TRUNCATE_HEAP) != 0)
+ BlockRefTableSetLimitBlock(brtab, &xlrec->rlocator,
+ MAIN_FORKNUM, xlrec->blkno);
+ if ((xlrec->flags & SMGR_TRUNCATE_VM) != 0)
+ BlockRefTableSetLimitBlock(brtab, &xlrec->rlocator,
+ VISIBILITYMAP_FORKNUM, xlrec->blkno);
+ }
+}
+
+/*
+ * Special handling for WAL recods with RM_XACT_ID.
+ */
+static void
+SummarizeXactRecord(XLogReaderState *xlogreader, BlockRefTable *brtab)
+{
+ uint8 info = XLogRecGetInfo(xlogreader) & ~XLR_INFO_MASK;
+ uint8 xact_info = info & XLOG_XACT_OPMASK;
+
+ if (xact_info == XLOG_XACT_COMMIT ||
+ xact_info == XLOG_XACT_COMMIT_PREPARED)
+ {
+ xl_xact_commit *xlrec = (xl_xact_commit *) XLogRecGetData(xlogreader);
+ xl_xact_parsed_commit parsed;
+ int i;
+
+ /*
+ * Don't track modified blocks for any relations that were removed on
+ * commit.
+ */
+ ParseCommitRecord(XLogRecGetInfo(xlogreader), xlrec, &parsed);
+ for (i = 0; i < parsed.nrels; ++i)
+ {
+ ForkNumber forknum;
+
+ for (forknum = 0; forknum <= MAX_FORKNUM; ++forknum)
+ if (forknum != FSM_FORKNUM)
+ BlockRefTableSetLimitBlock(brtab, &parsed.xlocators[i],
+ forknum, 0);
+ }
+ }
+ else if (xact_info == XLOG_XACT_ABORT ||
+ xact_info == XLOG_XACT_ABORT_PREPARED)
+ {
+ xl_xact_abort *xlrec = (xl_xact_abort *) XLogRecGetData(xlogreader);
+ xl_xact_parsed_abort parsed;
+ int i;
+
+ /*
+ * Don't track modified blocks for any relations that were removed on
+ * abort.
+ */
+ ParseAbortRecord(XLogRecGetInfo(xlogreader), xlrec, &parsed);
+ for (i = 0; i < parsed.nrels; ++i)
+ {
+ ForkNumber forknum;
+
+ for (forknum = 0; forknum <= MAX_FORKNUM; ++forknum)
+ if (forknum != FSM_FORKNUM)
+ BlockRefTableSetLimitBlock(brtab, &parsed.xlocators[i],
+ forknum, 0);
+ }
+ }
+}
+
+/*
+ * Special handling for WAL recods with RM_XLOG_ID.
+ */
+static bool
+SummarizeXlogRecord(XLogReaderState *xlogreader)
+{
+ uint8 info = XLogRecGetInfo(xlogreader) & ~XLR_INFO_MASK;
+
+ if (info == XLOG_CHECKPOINT_REDO || info == XLOG_CHECKPOINT_SHUTDOWN)
+ {
+ /*
+ * This is an LSN at which redo might begin, so we'd like
+ * summarization to stop just before this WAL record.
+ */
+ return true;
+ }
+
+ return false;
+}
+
+/*
+ * Similar to read_local_xlog_page, but limited to read from one particular
+ * timeline. If the end of WAL is reached, it will wait for more if reading
+ * from the current timeline, or give up if reading from a historic timeline.
+ * In the latter case, it will also set private_data->end_of_wal = true.
+ *
+ * Caller must set private_data->tli to the TLI of interest,
+ * private_data->read_upto to the lowest LSN that is not known to be safe
+ * to read on that timeline, and private_data->historic to true if and only
+ * if the timeline is not the current timeline. This function will update
+ * private_data->read_upto and private_data->historic if more WAL appears
+ * on the current timeline or if the current timeline becomes historic.
+ */
+static int
+summarizer_read_local_xlog_page(XLogReaderState *state,
+ XLogRecPtr targetPagePtr, int reqLen,
+ XLogRecPtr targetRecPtr, char *cur_page)
+{
+ int count;
+ WALReadError errinfo;
+ SummarizerReadLocalXLogPrivate *private_data;
+
+ HandleWalSummarizerInterrupts();
+
+ private_data = (SummarizerReadLocalXLogPrivate *)
+ state->private_data;
+
+ while (1)
+ {
+ if (targetPagePtr + XLOG_BLCKSZ <= private_data->read_upto)
+ {
+ /*
+ * more than one block available; read only that block, have
+ * caller come back if they need more.
+ */
+ count = XLOG_BLCKSZ;
+ break;
+ }
+ else if (targetPagePtr + reqLen > private_data->read_upto)
+ {
+ /* We don't seem to have enough data. */
+ if (private_data->historic)
+ {
+ /*
+ * This is a historic timeline, so there will never be any
+ * more data than we have currently.
+ */
+ private_data->end_of_wal = true;
+ return -1;
+ }
+ else
+ {
+ XLogRecPtr latest_lsn;
+ TimeLineID latest_tli;
+
+ /*
+ * This is - or at least was up until very recently - the
+ * current timeline, so more data might show up. Delay here
+ * so we don't tight-loop.
+ */
+ HandleWalSummarizerInterrupts();
+ summarizer_wait_for_wal();
+
+ /* Recheck end-of-WAL. */
+ latest_lsn = GetLatestLSN(&latest_tli);
+ if (private_data->tli == latest_tli)
+ {
+ /* Still the current timeline, update max LSN. */
+ Assert(latest_lsn >= private_data->read_upto);
+ private_data->read_upto = latest_lsn;
+ }
+ else
+ {
+ List *tles = readTimeLineHistory(latest_tli);
+ XLogRecPtr switchpoint;
+
+ /*
+ * The timeline we're scanning is no longer the latest
+ * one. Figure out when it ended.
+ */
+ private_data->historic = true;
+ switchpoint = tliSwitchPoint(private_data->tli, tles,
+ NULL);
+
+ /*
+ * Allow reads up to exactly the switch point.
+ *
+ * It's possible that this will cause read_upto to move
+ * backwards, because walreceiver might have read a partial
+ * record and flushed it to disk, and we'd view that data
+ * as safe to read. However, the XLOG_END_OF_RECOVERY
+ * record will be written at the end of the last complete
+ * WAL record, not at the end of the WAL that we've flushed
+ * to disk.
+ *
+ * So switchpoint < private->read_upto is possible here,
+ * but switchpoint < state->EndRecPtr should not be.
+ */
+ Assert(switchpoint >= state->EndRecPtr);
+ private_data->read_upto = switchpoint;
+
+ /* Debugging output. */
+ ereport(DEBUG1,
+ errmsg("timeline %u became historic, can read up to %X/%X",
+ private_data->tli, LSN_FORMAT_ARGS(private_data->read_upto)));
+ }
+
+ /* Go around and try again. */
+ }
+ }
+ else
+ {
+ /* enough bytes available to satisfy the request */
+ count = private_data->read_upto - targetPagePtr;
+ break;
+ }
+ }
+
+ /*
+ * Even though we just determined how much of the page can be validly read
+ * as 'count', read the whole page anyway. It's guaranteed to be
+ * zero-padded up to the page boundary if it's incomplete.
+ */
+ if (!WALRead(state, cur_page, targetPagePtr, XLOG_BLCKSZ,
+ private_data->tli, &errinfo))
+ WALReadRaiseError(&errinfo);
+
+ /* Track that we read a page, for sleep time calculation. */
+ ++pages_read_since_last_sleep;
+
+ /* number of valid bytes in the buffer */
+ return count;
+}
+
+/*
+ * Sleep for long enough that we believe it's likely that more WAL will
+ * be available afterwards.
+ */
+static void
+summarizer_wait_for_wal(void)
+{
+ if (pages_read_since_last_sleep == 0)
+ {
+ /*
+ * No pages were read since the last sleep, so double the sleep time,
+ * but not beyond the maximum allowable value.
+ */
+ sleep_quanta = Min(sleep_quanta * 2, MAX_SLEEP_QUANTA);
+ }
+ else if (pages_read_since_last_sleep > 1)
+ {
+ /*
+ * Multiple pages were read since the last sleep, so reduce the sleep
+ * time.
+ *
+ * A large burst of activity should be able to quickly reduce the
+ * sleep time to the minimum, but we don't want a handful of extra WAL
+ * records to provoke a strong reaction. We choose to reduce the sleep
+ * time by 1 quantum for each page read beyond the first, which is a
+ * fairly arbitrary way of trying to be reactive without
+ * overrreacting.
+ */
+ if (pages_read_since_last_sleep > sleep_quanta - 1)
+ sleep_quanta = 1;
+ else
+ sleep_quanta -= pages_read_since_last_sleep;
+ }
+
+ /* OK, now sleep. */
+ (void) WaitLatch(MyLatch,
+ WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
+ sleep_quanta * MS_PER_SLEEP_QUANTUM,
+ WAIT_EVENT_WAL_SUMMARIZER_WAL);
+ ResetLatch(MyLatch);
+
+ /* Reset count of pages read. */
+ pages_read_since_last_sleep = 0;
+}
+
+/*
+ * Most recent RedoRecPtr value observed by RemoveOldWalSummaries.
+ */
+static void
+MaybeRemoveOldWalSummaries(void)
+{
+ XLogRecPtr redo_pointer = GetRedoRecPtr();
+ List *wslist;
+ time_t cutoff_time;
+
+ /* If WAL summary removal is disabled, don't do anything. */
+ if (wal_summary_keep_time == 0)
+ return;
+
+ /*
+ * If the redo pointer has not advanced, don't do anything.
+ *
+ * This has the effect that we only try to remove old WAL summary files
+ * once per checkpoint cycle.
+ */
+ if (redo_pointer == redo_pointer_at_last_summary_removal)
+ return;
+ redo_pointer_at_last_summary_removal = redo_pointer;
+
+ /*
+ * Files should only be removed if the last modification time precedes the
+ * cutoff time we compute here.
+ */
+ cutoff_time = time(NULL) - 60 * wal_summary_keep_time;
+
+ /* Get all the summaries that currently exist. */
+ wslist = GetWalSummaries(0, InvalidXLogRecPtr, InvalidXLogRecPtr);
+
+ /* Loop until all summaries have been considered for removal. */
+ while (wslist != NIL)
+ {
+ ListCell *lc;
+ XLogSegNo oldest_segno;
+ XLogRecPtr oldest_lsn = InvalidXLogRecPtr;
+ TimeLineID selected_tli;
+
+ HandleWalSummarizerInterrupts();
+
+ /*
+ * Pick a timeline for which some summary files still exist on disk,
+ * and find the oldest LSN that still exists on disk for that
+ * timeline.
+ */
+ selected_tli = ((WalSummaryFile *) linitial(wslist))->tli;
+ oldest_segno = XLogGetOldestSegno(selected_tli);
+ if (oldest_segno != 0)
+ XLogSegNoOffsetToRecPtr(oldest_segno, 0, wal_segment_size,
+ oldest_lsn);
+
+
+ /* Consider each WAL file on the selected timeline in turn. */
+ foreach(lc, wslist)
+ {
+ WalSummaryFile *ws = lfirst(lc);
+
+ HandleWalSummarizerInterrupts();
+
+ /* If it's not on this timeline, it's not time to consider it. */
+ if (selected_tli != ws->tli)
+ continue;
+
+ /*
+ * If the WAL doesn't exist any more, we can remove it if the file
+ * modification time is old enough.
+ */
+ if (XLogRecPtrIsInvalid(oldest_lsn) || ws->end_lsn <= oldest_lsn)
+ RemoveWalSummaryIfOlderThan(ws, cutoff_time);
+
+ /*
+ * Whether we removed the file or not, we need not consider it
+ * again.
+ */
+ wslist = foreach_delete_current(wslist, lc);
+ pfree(ws);
+ }
+ }
+}
diff --git a/src/backend/storage/lmgr/lwlocknames.txt b/src/backend/storage/lmgr/lwlocknames.txt
index f72f2906ce..d621f5507f 100644
--- a/src/backend/storage/lmgr/lwlocknames.txt
+++ b/src/backend/storage/lmgr/lwlocknames.txt
@@ -54,3 +54,4 @@ XactTruncationLock 44
WrapLimitsVacuumLock 46
NotifyQueueTailLock 47
WaitEventExtensionLock 48
+WALSummarizerLock 49
diff --git a/src/backend/utils/activity/pgstat_io.c b/src/backend/utils/activity/pgstat_io.c
index 490d5a9ab7..8109aee6f0 100644
--- a/src/backend/utils/activity/pgstat_io.c
+++ b/src/backend/utils/activity/pgstat_io.c
@@ -296,7 +296,8 @@ pgstat_io_snapshot_cb(void)
* - Syslogger because it is not connected to shared memory
* - Archiver because most relevant archiving IO is delegated to a
* specialized command or module
-* - WAL Receiver and WAL Writer IO is not tracked in pg_stat_io for now
+* - WAL Receiver, WAL Writer, and WAL Summarizer IO are not tracked in
+* pg_stat_io for now
*
* Function returns true if BackendType participates in the cumulative stats
* subsystem for IO and false if it does not.
@@ -318,6 +319,7 @@ pgstat_tracks_io_bktype(BackendType bktype)
case B_LOGGER:
case B_WAL_RECEIVER:
case B_WAL_WRITER:
+ case B_WAL_SUMMARIZER:
return false;
case B_AUTOVAC_LAUNCHER:
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index d7995931bd..7e79163466 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -56,6 +56,7 @@ RECOVERY_WAL_STREAM "Waiting in main loop of startup process for WAL to arrive,
SYSLOGGER_MAIN "Waiting in main loop of syslogger process."
WAL_RECEIVER_MAIN "Waiting in main loop of WAL receiver process."
WAL_SENDER_MAIN "Waiting in main loop of WAL sender process."
+WAL_SUMMARIZER_WAL "Waiting in WAL summarizer for more WAL to be generated."
WAL_WRITER_MAIN "Waiting in main loop of WAL writer process."
@@ -142,6 +143,7 @@ SAFE_SNAPSHOT "Waiting to obtain a valid snapshot for a <literal>READ ONLY DEFER
SYNC_REP "Waiting for confirmation from a remote server during synchronous replication."
WAL_RECEIVER_EXIT "Waiting for the WAL receiver to exit."
WAL_RECEIVER_WAIT_START "Waiting for startup process to send initial data for streaming replication."
+WAL_SUMMARY_READY "Waiting for a new WAL summary to be generated."
XACT_GROUP_UPDATE "Waiting for the group leader to update transaction status at end of a parallel operation."
@@ -162,6 +164,7 @@ REGISTER_SYNC_REQUEST "Waiting while sending synchronization requests to the che
SPIN_DELAY "Waiting while acquiring a contended spinlock."
VACUUM_DELAY "Waiting in a cost-based vacuum delay point."
VACUUM_TRUNCATE "Waiting to acquire an exclusive lock to truncate off any empty pages at the end of a table vacuumed."
+WAL_SUMMARIZER_ERROR "Waiting after a WAL summarizer error."
#
@@ -243,6 +246,8 @@ WAL_COPY_WRITE "Waiting for a write when creating a new WAL segment by copying a
WAL_INIT_SYNC "Waiting for a newly initialized WAL file to reach durable storage."
WAL_INIT_WRITE "Waiting for a write while initializing a new WAL file."
WAL_READ "Waiting for a read from a WAL file."
+WAL_SUMMARY_READ "Waiting for a read from a WAL summary file."
+WAL_SUMMARY_WRITE "Waiting for a write to a WAL summary file."
WAL_SYNC "Waiting for a WAL file to reach durable storage."
WAL_SYNC_METHOD_ASSIGN "Waiting for data to reach durable storage while assigning a new WAL sync method."
WAL_WRITE "Waiting for a write to a WAL file."
diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index cfc5afaa6f..ef2a3a2bfd 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -306,6 +306,9 @@ GetBackendTypeDesc(BackendType backendType)
case B_WAL_SENDER:
backendDesc = "walsender";
break;
+ case B_WAL_SUMMARIZER:
+ backendDesc = "walsummarizer";
+ break;
case B_WAL_WRITER:
backendDesc = "walwriter";
break;
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index b764ef6998..a6de5aca0a 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -63,6 +63,7 @@
#include "postmaster/postmaster.h"
#include "postmaster/startup.h"
#include "postmaster/syslogger.h"
+#include "postmaster/walsummarizer.h"
#include "postmaster/walwriter.h"
#include "replication/logicallauncher.h"
#include "replication/slot.h"
@@ -704,6 +705,8 @@ const char *const config_group_names[] =
gettext_noop("Write-Ahead Log / Archive Recovery"),
/* WAL_RECOVERY_TARGET */
gettext_noop("Write-Ahead Log / Recovery Target"),
+ /* WAL_SUMMARIZATION */
+ gettext_noop("Write-Ahead Log / Summarization"),
/* REPLICATION_SENDING */
gettext_noop("Replication / Sending Servers"),
/* REPLICATION_PRIMARY */
@@ -1787,6 +1790,16 @@ struct config_bool ConfigureNamesBool[] =
NULL, NULL, NULL
},
+ {
+ {"summarize_wal", PGC_SIGHUP, WAL_SUMMARIZATION,
+ gettext_noop("Starts the WAL summarizer process to enable incremental backup."),
+ NULL
+ },
+ &summarize_wal,
+ false,
+ NULL, NULL, NULL
+ },
+
{
{"hot_standby", PGC_POSTMASTER, REPLICATION_STANDBY,
gettext_noop("Allows connections and queries during recovery."),
@@ -3191,6 +3204,19 @@ struct config_int ConfigureNamesInt[] =
check_wal_segment_size, NULL, NULL
},
+ {
+ {"wal_summary_keep_time", PGC_SIGHUP, WAL_SUMMARIZATION,
+ gettext_noop("Time for which WAL summary files should be kept."),
+ NULL,
+ GUC_UNIT_MIN,
+ },
+ &wal_summary_keep_time,
+ 10 * 24 * 60, /* 10 days */
+ 0,
+ INT_MAX,
+ NULL, NULL, NULL
+ },
+
{
{"autovacuum_naptime", PGC_SIGHUP, AUTOVACUUM,
gettext_noop("Time to sleep between autovacuum runs."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index e48c066a5b..e732453daa 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -299,6 +299,11 @@
#recovery_target_action = 'pause' # 'pause', 'promote', 'shutdown'
# (change requires restart)
+# - WAL Summarization -
+
+#summarize_wal = off # run WAL summarizer process?
+#wal_summary_keep_time = '10d' # when to remove old summary files, 0 = never
+
#------------------------------------------------------------------------------
# REPLICATION
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 0c6f5ceb0a..e68b40d2b5 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -227,6 +227,7 @@ static char *extra_options = "";
static const char *const subdirs[] = {
"global",
"pg_wal/archive_status",
+ "pg_wal/summaries",
"pg_commit_ts",
"pg_dynshmem",
"pg_notify",
diff --git a/src/common/Makefile b/src/common/Makefile
index 1092dc63df..23e5a3db47 100644
--- a/src/common/Makefile
+++ b/src/common/Makefile
@@ -49,6 +49,7 @@ OBJS_COMMON = \
archive.o \
base64.o \
binaryheap.o \
+ blkreftable.o \
checksum_helper.o \
compression.o \
config_info.o \
diff --git a/src/common/blkreftable.c b/src/common/blkreftable.c
new file mode 100644
index 0000000000..d952dee912
--- /dev/null
+++ b/src/common/blkreftable.c
@@ -0,0 +1,1308 @@
+/*-------------------------------------------------------------------------
+ *
+ * blkreftable.c
+ * Block reference tables.
+ *
+ * A block reference table is used to keep track of which blocks have
+ * been modified by WAL records within a certain LSN range.
+ *
+ * For each relation fork, we keep track of all blocks that have appeared
+ * in block reference in the WAL. We also keep track of the "limit block",
+ * which is the smallest relation length in blocks known to have occurred
+ * during that range of WAL records. This should be set to 0 if the relation
+ * fork is created or destroyed, and to the post-truncation length if
+ * truncated.
+ *
+ * Whenever we set the limit block, we also forget about any modified blocks
+ * beyond that point. Those blocks don't exist any more. Such blocks can
+ * later be marked as modified again; if that happens, it means the relation
+ * was re-extended.
+ *
+ * Portions Copyright (c) 2010-2022, PostgreSQL Global Development Group
+ *
+ * src/common/blkreftable.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+
+#ifndef FRONTEND
+#include "postgres.h"
+#else
+#include "postgres_fe.h"
+#endif
+
+#ifdef FRONTEND
+#include "common/logging.h"
+#endif
+
+#include "common/blkreftable.h"
+#include "common/hashfn.h"
+#include "port/pg_crc32c.h"
+
+/*
+ * A block reference table keeps track of the status of each relation
+ * fork individually.
+ */
+typedef struct BlockRefTableKey
+{
+ RelFileLocator rlocator;
+ ForkNumber forknum;
+} BlockRefTableKey;
+
+/*
+ * We could need to store data either for a relation in which only a
+ * tiny fraction of the blocks have been modified or for a relation in
+ * which nearly every block has been modified, and we want a
+ * space-efficient representation in both cases. To accomplish this,
+ * we divide the relation into chunks of 2^16 blocks and choose between
+ * an array representation and a bitmap representation for each chunk.
+ *
+ * When the number of modified blocks in a given chunk is small, we
+ * essentially store an array of block numbers, but we need not store the
+ * entire block number: instead, we store each block number as a 2-byte
+ * offset from the start of the chunk.
+ *
+ * When the number of modified blocks in a given chunk is large, we switch
+ * to a bitmap representation.
+ *
+ * These same basic representational choices are used both when a block
+ * reference table is stored in memory and when it is serialized to disk.
+ *
+ * In the in-memory representation, we initially allocate each chunk with
+ * space for a number of entries given by INITIAL_ENTRIES_PER_CHUNK and
+ * increase that as necessary until we reach MAX_ENTRIES_PER_CHUNK.
+ * Any chunk whose allocated size reaches MAX_ENTRIES_PER_CHUNK is converted
+ * to a bitmap, and thus never needs to grow further.
+ */
+#define BLOCKS_PER_CHUNK (1 << 16)
+#define BLOCKS_PER_ENTRY (BITS_PER_BYTE * sizeof(uint16))
+#define MAX_ENTRIES_PER_CHUNK (BLOCKS_PER_CHUNK / BLOCKS_PER_ENTRY)
+#define INITIAL_ENTRIES_PER_CHUNK 16
+typedef uint16 *BlockRefTableChunk;
+
+/*
+ * State for one relation fork.
+ *
+ * 'rlocator' and 'forknum' identify the relation fork to which this entry
+ * pertains.
+ *
+ * 'limit_block' is the shortest known length of the relation in blocks
+ * within the LSN range covered by a particular block reference table.
+ * It should be set to 0 if the relation fork is created or dropped. If the
+ * relation fork is truncated, it should be set to the number of blocks that
+ * remain after truncation.
+ *
+ * 'nchunks' is the allocated length of each of the three arrays that follow.
+ * We can only represent the status of block numbers less than nchunks *
+ * BLOCKS_PER_CHUNK.
+ *
+ * 'chunk_size' is an array storing the allocated size of each chunk.
+ *
+ * 'chunk_usage' is an array storing the number of elements used in each
+ * chunk. If that value is less than MAX_ENTRIES_PER_CHUNK, the corresonding
+ * chunk is used as an array; else the corresponding chunk is used as a bitmap.
+ * When used as a bitmap, the least significant bit of the first array element
+ * is the status of the lowest-numbered block covered by this chunk.
+ *
+ * 'chunk_data' is the array of chunks.
+ */
+struct BlockRefTableEntry
+{
+ BlockRefTableKey key;
+ BlockNumber limit_block;
+ char status;
+ uint32 nchunks;
+ uint16 *chunk_size;
+ uint16 *chunk_usage;
+ BlockRefTableChunk *chunk_data;
+};
+
+/* Declare and define a hash table over type BlockRefTableEntry. */
+#define SH_PREFIX blockreftable
+#define SH_ELEMENT_TYPE BlockRefTableEntry
+#define SH_KEY_TYPE BlockRefTableKey
+#define SH_KEY key
+#define SH_HASH_KEY(tb, key) \
+ hash_bytes((const unsigned char *) &key, sizeof(BlockRefTableKey))
+#define SH_EQUAL(tb, a, b) (memcmp(&a, &b, sizeof(BlockRefTableKey)) == 0)
+#define SH_SCOPE static inline
+#ifdef FRONTEND
+#define SH_RAW_ALLOCATOR pg_malloc0
+#endif
+#define SH_DEFINE
+#define SH_DECLARE
+#include "lib/simplehash.h"
+
+/*
+ * A block reference table is basically just the hash table, but we don't
+ * want to expose that to outside callers.
+ *
+ * We keep track of the memory context in use explicitly too, so that it's
+ * easy to place all of our allocations in the same context.
+ */
+struct BlockRefTable
+{
+ blockreftable_hash *hash;
+#ifndef FRONTEND
+ MemoryContext mcxt;
+#endif
+};
+
+/*
+ * On-disk serialization format for block reference table entries.
+ */
+typedef struct BlockRefTableSerializedEntry
+{
+ RelFileLocator rlocator;
+ ForkNumber forknum;
+ BlockNumber limit_block;
+ uint32 nchunks;
+} BlockRefTableSerializedEntry;
+
+/*
+ * Buffer size, so that we avoid doing many small I/Os.
+ */
+#define BUFSIZE 65536
+
+/*
+ * Ad-hoc buffer for file I/O.
+ */
+typedef struct BlockRefTableBuffer
+{
+ io_callback_fn io_callback;
+ void *io_callback_arg;
+ char data[BUFSIZE];
+ int used;
+ int cursor;
+ pg_crc32c crc;
+} BlockRefTableBuffer;
+
+/*
+ * State for keeping track of progress while incrementally reading a block
+ * table reference file from disk.
+ *
+ * total_chunks means the number of chunks for the RelFileLocator/ForkNumber
+ * combination that is curently being read, and consumed_chunks is the number
+ * of those that have been read. (We always read all the information for
+ * a single chunk at one time, so we don't need to be able to represent the
+ * state where a chunk has been partially read.)
+ *
+ * chunk_size is the array of chunk sizes. The length is given by total_chunks.
+ *
+ * chunk_data holds the current chunk.
+ *
+ * chunk_position helps us figure out how much progress we've made in returning
+ * the block numbers for the current chunk to the caller. If the chunk is a
+ * bitmap, it's the number of bits we've scanned; otherwise, it's the number
+ * of chunk entries we've scanned.
+ */
+struct BlockRefTableReader
+{
+ BlockRefTableBuffer buffer;
+ char *error_filename;
+ report_error_fn error_callback;
+ void *error_callback_arg;
+ uint32 total_chunks;
+ uint32 consumed_chunks;
+ uint16 *chunk_size;
+ uint16 chunk_data[MAX_ENTRIES_PER_CHUNK];
+ uint32 chunk_position;
+};
+
+/*
+ * State for keeping track of progress while incrementally writing a block
+ * reference table file to disk.
+ */
+struct BlockRefTableWriter
+{
+ BlockRefTableBuffer buffer;
+};
+
+/* Function prototypes. */
+static int BlockRefTableComparator(const void *a, const void *b);
+static void BlockRefTableFlush(BlockRefTableBuffer *buffer);
+static void BlockRefTableRead(BlockRefTableReader *reader, void *data,
+ int length);
+static void BlockRefTableWrite(BlockRefTableBuffer *buffer, void *data,
+ int length);
+static void BlockRefTableFileTerminate(BlockRefTableBuffer *buffer);
+
+/*
+ * Create an empty block reference table.
+ */
+BlockRefTable *
+CreateEmptyBlockRefTable(void)
+{
+ BlockRefTable *brtab = palloc(sizeof(BlockRefTable));
+
+ /*
+ * Even completely empty database has a few hundred relation forks, so it
+ * seems best to size the hash on the assumption that we're going to have
+ * at least a few thousand entries.
+ */
+#ifdef FRONTEND
+ brtab->hash = blockreftable_create(4096, NULL);
+#else
+ brtab->mcxt = CurrentMemoryContext;
+ brtab->hash = blockreftable_create(brtab->mcxt, 4096, NULL);
+#endif
+
+ return brtab;
+}
+
+/*
+ * Set the "limit block" for a relation fork and forget any modified blocks
+ * with equal or higher block numbers.
+ *
+ * The "limit block" is the shortest known length of the relation within the
+ * range of WAL records covered by this block reference table.
+ */
+void
+BlockRefTableSetLimitBlock(BlockRefTable *brtab,
+ const RelFileLocator *rlocator,
+ ForkNumber forknum,
+ BlockNumber limit_block)
+{
+ BlockRefTableEntry *brtentry;
+ BlockRefTableKey key = {0}; /* make sure any padding is zero */
+ bool found;
+
+ memcpy(&key.rlocator, rlocator, sizeof(RelFileLocator));
+ key.forknum = forknum;
+ brtentry = blockreftable_insert(brtab->hash, key, &found);
+
+ if (!found)
+ {
+ /*
+ * We have no existing data about this relation fork, so just record
+ * the limit_block value supplied by the caller, and make sure other
+ * parts of the entry are properly initialized.
+ */
+ brtentry->limit_block = limit_block;
+ brtentry->nchunks = 0;
+ brtentry->chunk_size = NULL;
+ brtentry->chunk_usage = NULL;
+ brtentry->chunk_data = NULL;
+ return;
+ }
+
+ BlockRefTableEntrySetLimitBlock(brtentry, limit_block);
+}
+
+/*
+ * Mark a block in a given relation fork as known to have been modified.
+ */
+void
+BlockRefTableMarkBlockModified(BlockRefTable *brtab,
+ const RelFileLocator *rlocator,
+ ForkNumber forknum,
+ BlockNumber blknum)
+{
+ BlockRefTableEntry *brtentry;
+ BlockRefTableKey key = {0}; /* make sure any padding is zero */
+ bool found;
+#ifndef FRONTEND
+ MemoryContext oldcontext = MemoryContextSwitchTo(brtab->mcxt);
+#endif
+
+ memcpy(&key.rlocator, rlocator, sizeof(RelFileLocator));
+ key.forknum = forknum;
+ brtentry = blockreftable_insert(brtab->hash, key, &found);
+
+ if (!found)
+ {
+ /*
+ * We want to set the initial limit block value to something higher
+ * than any legal block number. InvalidBlockNumber fits the bill.
+ */
+ brtentry->limit_block = InvalidBlockNumber;
+ brtentry->nchunks = 0;
+ brtentry->chunk_size = NULL;
+ brtentry->chunk_usage = NULL;
+ brtentry->chunk_data = NULL;
+ }
+
+ BlockRefTableEntryMarkBlockModified(brtentry, forknum, blknum);
+
+#ifndef FRONTEND
+ MemoryContextSwitchTo(oldcontext);
+#endif
+}
+
+/*
+ * Get an entry from a block reference table.
+ *
+ * If the entry does not exist, this function returns NULL. Otherwise, it
+ * returns the entry and sets *limit_block to the value from the entry.
+ */
+BlockRefTableEntry *
+BlockRefTableGetEntry(BlockRefTable *brtab, const RelFileLocator *rlocator,
+ ForkNumber forknum, BlockNumber *limit_block)
+{
+ BlockRefTableKey key = {0}; /* make sure any padding is zero */
+ BlockRefTableEntry *entry;
+
+ Assert(limit_block != NULL);
+
+ memcpy(&key.rlocator, rlocator, sizeof(RelFileLocator));
+ key.forknum = forknum;
+ entry = blockreftable_lookup(brtab->hash, key);
+
+ if (entry != NULL)
+ *limit_block = entry->limit_block;
+
+ return entry;
+}
+
+/*
+ * Get block numbers from a table entry.
+ *
+ * 'blocks' must point to enough space to hold at least 'nblocks' block
+ * numbers, and any block numbers we manage to get will be written there.
+ * The return value is the number of block numbers actually written.
+ *
+ * We do not return block numbers unless they are greater than or equal to
+ * start_blkno and strictly less than stop_blkno.
+ */
+int
+BlockRefTableEntryGetBlocks(BlockRefTableEntry *entry,
+ BlockNumber start_blkno,
+ BlockNumber stop_blkno,
+ BlockNumber *blocks,
+ int nblocks)
+{
+ uint32 start_chunkno;
+ uint32 stop_chunkno;
+ uint32 chunkno;
+ int nresults = 0;
+
+ Assert(entry != NULL);
+
+ /*
+ * Figure out which chunks could potentially contain blocks of interest.
+ *
+ * We need to be careful about overflow here, because stop_blkno could be
+ * InvalidBlockNumber or something very close to it.
+ */
+ start_chunkno = start_blkno / BLOCKS_PER_CHUNK;
+ stop_chunkno = stop_blkno / BLOCKS_PER_CHUNK;
+ if ((stop_blkno % BLOCKS_PER_CHUNK) != 0)
+ ++stop_chunkno;
+ if (stop_chunkno > entry->nchunks)
+ stop_chunkno = entry->nchunks;
+
+ /*
+ * Loop over chunks.
+ */
+ for (chunkno = start_chunkno; chunkno < stop_chunkno; ++chunkno)
+ {
+ uint16 chunk_usage = entry->chunk_usage[chunkno];
+ BlockRefTableChunk chunk_data = entry->chunk_data[chunkno];
+ unsigned start_offset = 0;
+ unsigned stop_offset = BLOCKS_PER_CHUNK;
+
+ /*
+ * If the start and/or stop block number falls within this chunk, the
+ * whole chunk may not be of interest. Figure out which portion we
+ * care about, if it's not the whole thing.
+ */
+ if (chunkno == start_chunkno)
+ start_offset = start_blkno % BLOCKS_PER_CHUNK;
+ if (chunkno == stop_chunkno)
+ stop_offset = stop_blkno % BLOCKS_PER_CHUNK;
+
+ /*
+ * Handling differs depending on whether this is an array of offsets
+ * or a bitmap.
+ */
+ if (chunk_usage == MAX_ENTRIES_PER_CHUNK)
+ {
+ unsigned i;
+
+ /* It's a bitmap, so test every relevant bit. */
+ for (i = start_offset; i < BLOCKS_PER_CHUNK; ++i)
+ {
+ uint16 w = chunk_data[i / BLOCKS_PER_ENTRY];
+
+ if ((w & (1 << (i % BLOCKS_PER_ENTRY))) != 0)
+ {
+ BlockNumber blkno = chunkno * BLOCKS_PER_CHUNK + i;
+
+ blocks[nresults++] = blkno;
+
+ /* Early exit if we run out of output space. */
+ if (nresults == nblocks)
+ return nresults;
+ }
+ }
+ }
+ else
+ {
+ unsigned i;
+
+ /* It's an array of offsets, so check each one. */
+ for (i = 0; i < chunk_usage; ++i)
+ {
+ uint16 offset = chunk_data[i];
+
+ if (offset >= start_offset && offset < stop_offset)
+ {
+ BlockNumber blkno = chunkno * BLOCKS_PER_CHUNK + offset;
+
+ blocks[nresults++] = blkno;
+
+ /* Early exit if we run out of output space. */
+ if (nresults == nblocks)
+ return nresults;
+ }
+ }
+ }
+ }
+
+ return nresults;
+}
+
+/*
+ * Serialize a block reference table to a file.
+ */
+void
+WriteBlockRefTable(BlockRefTable *brtab,
+ io_callback_fn write_callback,
+ void *write_callback_arg)
+{
+ BlockRefTableSerializedEntry *sdata = NULL;
+ BlockRefTableBuffer buffer;
+ uint32 magic = BLOCKREFTABLE_MAGIC;
+
+ /* Prepare buffer. */
+ memset(&buffer, 0, sizeof(BlockRefTableBuffer));
+ buffer.io_callback = write_callback;
+ buffer.io_callback_arg = write_callback_arg;
+ INIT_CRC32C(buffer.crc);
+
+ /* Write magic number. */
+ BlockRefTableWrite(&buffer, &magic, sizeof(uint32));
+
+ /* Write the entries, assuming there are some. */
+ if (brtab->hash->members > 0)
+ {
+ unsigned i = 0;
+ blockreftable_iterator it;
+ BlockRefTableEntry *brtentry;
+
+ /* Extract entries into serializable format and sort them. */
+ sdata =
+ palloc(brtab->hash->members * sizeof(BlockRefTableSerializedEntry));
+ blockreftable_start_iterate(brtab->hash, &it);
+ while ((brtentry = blockreftable_iterate(brtab->hash, &it)) != NULL)
+ {
+ BlockRefTableSerializedEntry *sentry = &sdata[i++];
+
+ sentry->rlocator = brtentry->key.rlocator;
+ sentry->forknum = brtentry->key.forknum;
+ sentry->limit_block = brtentry->limit_block;
+ sentry->nchunks = brtentry->nchunks;
+
+ /* trim trailing zero entries */
+ while (sentry->nchunks > 0 &&
+ brtentry->chunk_usage[sentry->nchunks - 1] == 0)
+ sentry->nchunks--;
+ }
+ Assert(i == brtab->hash->members);
+ qsort(sdata, i, sizeof(BlockRefTableSerializedEntry),
+ BlockRefTableComparator);
+
+ /* Loop over entries in sorted order and serialize each one. */
+ for (i = 0; i < brtab->hash->members; ++i)
+ {
+ BlockRefTableSerializedEntry *sentry = &sdata[i];
+ BlockRefTableKey key = {0}; /* make sure any padding is zero */
+ unsigned j;
+
+ /* Write the serialized entry itself. */
+ BlockRefTableWrite(&buffer, sentry,
+ sizeof(BlockRefTableSerializedEntry));
+
+ /* Look up the original entry so we can access the chunks. */
+ memcpy(&key.rlocator, &sentry->rlocator, sizeof(RelFileLocator));
+ key.forknum = sentry->forknum;
+ brtentry = blockreftable_lookup(brtab->hash, key);
+ Assert(brtentry != NULL);
+
+ /* Write the untruncated portion of the chunk length array. */
+ if (sentry->nchunks != 0)
+ BlockRefTableWrite(&buffer, brtentry->chunk_usage,
+ sentry->nchunks * sizeof(uint16));
+
+ /* Write the contents of each chunk. */
+ for (j = 0; j < brtentry->nchunks; ++j)
+ {
+ if (brtentry->chunk_usage[j] == 0)
+ continue;
+ BlockRefTableWrite(&buffer, brtentry->chunk_data[j],
+ brtentry->chunk_usage[j] * sizeof(uint16));
+ }
+ }
+ }
+
+ /* Write out appropriate terminator and CRC and flush buffer. */
+ BlockRefTableFileTerminate(&buffer);
+}
+
+/*
+ * Prepare to incrementally read a block reference table file.
+ *
+ * 'read_callback' is a function that can be called to read data from the
+ * underlying file (or other data source) into our internal buffer.
+ *
+ * 'read_callback_arg' is an opaque argument to be passed to read_callback.
+ *
+ * 'error_filename' is the filename that should be included in error messages
+ * if the file is found to be malformed. The value is not copied, so the
+ * caller should ensure that it remains valid until done with this
+ * BlockRefTableReader.
+ *
+ * 'error_callback' is a function to be called if the file is found to be
+ * malformed. This is not used for I/O errors, which must be handled internally
+ * by read_callback.
+ *
+ * 'error_callback_arg' is an opaque arguent to be passed to error_callback.
+ */
+BlockRefTableReader *
+CreateBlockRefTableReader(io_callback_fn read_callback,
+ void *read_callback_arg,
+ char *error_filename,
+ report_error_fn error_callback,
+ void *error_callback_arg)
+{
+ BlockRefTableReader *reader;
+ uint32 magic;
+
+ /* Initialize data structure. */
+ reader = palloc0(sizeof(BlockRefTableReader));
+ reader->buffer.io_callback = read_callback;
+ reader->buffer.io_callback_arg = read_callback_arg;
+ reader->error_filename = error_filename;
+ reader->error_callback = error_callback;
+ reader->error_callback_arg = error_callback_arg;
+ INIT_CRC32C(reader->buffer.crc);
+
+ /* Verify magic number. */
+ BlockRefTableRead(reader, &magic, sizeof(uint32));
+ if (magic != BLOCKREFTABLE_MAGIC)
+ error_callback(error_callback_arg,
+ "file \"%s\" has wrong magic number: expected %u, found %u",
+ error_filename,
+ BLOCKREFTABLE_MAGIC, magic);
+
+ return reader;
+}
+
+/*
+ * Read next relation fork covered by this block reference table file.
+ *
+ * After calling this function, you must call BlockRefTableReaderGetBlocks
+ * until it returns 0 before calling it again.
+ */
+bool
+BlockRefTableReaderNextRelation(BlockRefTableReader *reader,
+ RelFileLocator *rlocator,
+ ForkNumber *forknum,
+ BlockNumber *limit_block)
+{
+ BlockRefTableSerializedEntry sentry;
+ BlockRefTableSerializedEntry zentry = {{0}};
+
+ /*
+ * Sanity check: caller must read all blocks from all chunks before moving
+ * on to the next relation.
+ */
+ Assert(reader->total_chunks == reader->consumed_chunks);
+
+ /* Read serialized entry. */
+ BlockRefTableRead(reader, &sentry,
+ sizeof(BlockRefTableSerializedEntry));
+
+ /*
+ * If we just read the sentinel entry indicating that we've reached the
+ * end, read and check the CRC.
+ */
+ if (memcmp(&sentry, &zentry, sizeof(BlockRefTableSerializedEntry)) == 0)
+ {
+ pg_crc32c expected_crc;
+ pg_crc32c actual_crc;
+
+ /*
+ * We want to know the CRC of the file excluding the 4-byte CRC
+ * itself, so copy the current value of the CRC accumulator before
+ * reading those bytes, and use the copy to finalize the calculation.
+ */
+ expected_crc = reader->buffer.crc;
+ FIN_CRC32C(expected_crc);
+
+ /* Now we can read the actual value. */
+ BlockRefTableRead(reader, &actual_crc, sizeof(pg_crc32c));
+
+ /* Throw an error if there is a mismatch. */
+ if (!EQ_CRC32C(expected_crc, actual_crc))
+ reader->error_callback(reader->error_callback_arg,
+ "file \"%s\" has wrong checksum: expected %08X, found %08X",
+ reader->error_filename, expected_crc, actual_crc);
+
+ return false;
+ }
+
+ /* Read chunk size array. */
+ if (reader->chunk_size != NULL)
+ pfree(reader->chunk_size);
+ reader->chunk_size = palloc(sentry.nchunks * sizeof(uint16));
+ BlockRefTableRead(reader, reader->chunk_size,
+ sentry.nchunks * sizeof(uint16));
+
+ /* Set up for chunk scan. */
+ reader->total_chunks = sentry.nchunks;
+ reader->consumed_chunks = 0;
+
+ /* Return data to caller. */
+ memcpy(rlocator, &sentry.rlocator, sizeof(RelFileLocator));
+ *forknum = sentry.forknum;
+ *limit_block = sentry.limit_block;
+ return true;
+}
+
+/*
+ * Get modified blocks associated with the relation fork returned by
+ * the most recent call to BlockRefTableReaderNextRelation.
+ *
+ * On return, block numbers will be written into the 'blocks' array, whose
+ * length should be passed via 'nblocks'. The return value is the number of
+ * entries actually written into the 'blocks' array, which may be less than
+ * 'nblocks' if we run out of modified blocks in the relation fork before
+ * we run out of room in the array.
+ */
+unsigned
+BlockRefTableReaderGetBlocks(BlockRefTableReader *reader,
+ BlockNumber *blocks,
+ int nblocks)
+{
+ unsigned blocks_found = 0;
+
+ /* Must provide space for at least one block number to be returned. */
+ Assert(nblocks > 0);
+
+ /* Loop collecting blocks to return to caller. */
+ for (;;)
+ {
+ uint16 next_chunk_size;
+
+ /*
+ * If we've read at least one chunk, maybe it contains some block
+ * numbers that could satisfy caller's request.
+ */
+ if (reader->consumed_chunks > 0)
+ {
+ uint32 chunkno = reader->consumed_chunks - 1;
+ uint16 chunk_size = reader->chunk_size[chunkno];
+
+ if (chunk_size == MAX_ENTRIES_PER_CHUNK)
+ {
+ /* Bitmap format, so search for bits that are set. */
+ while (reader->chunk_position < BLOCKS_PER_CHUNK &&
+ blocks_found < nblocks)
+ {
+ uint16 chunkoffset = reader->chunk_position;
+ uint16 w;
+
+ w = reader->chunk_data[chunkoffset / BLOCKS_PER_ENTRY];
+ if ((w & (1u << (chunkoffset % BLOCKS_PER_ENTRY))) != 0)
+ blocks[blocks_found++] =
+ chunkno * BLOCKS_PER_CHUNK + chunkoffset;
+ ++reader->chunk_position;
+ }
+ }
+ else
+ {
+ /* Not in bitmap format, so each entry is a 2-byte offset. */
+ while (reader->chunk_position < chunk_size &&
+ blocks_found < nblocks)
+ {
+ blocks[blocks_found++] = chunkno * BLOCKS_PER_CHUNK
+ + reader->chunk_data[reader->chunk_position];
+ ++reader->chunk_position;
+ }
+ }
+ }
+
+ /* We found enough blocks, so we're done. */
+ if (blocks_found >= nblocks)
+ break;
+
+ /*
+ * We didn't find enough blocks, so we must need the next chunk. If
+ * there are none left, though, then we're done anyway.
+ */
+ if (reader->consumed_chunks == reader->total_chunks)
+ break;
+
+ /*
+ * Read data for next chunk and reset scan position to beginning of
+ * chunk. Note that the next chunk might be empty, in which case we
+ * consume the chunk without actually consuming any bytes from the
+ * underlying file.
+ */
+ next_chunk_size = reader->chunk_size[reader->consumed_chunks];
+ if (next_chunk_size > 0)
+ BlockRefTableRead(reader, reader->chunk_data,
+ next_chunk_size * sizeof(uint16));
+ ++reader->consumed_chunks;
+ reader->chunk_position = 0;
+ }
+
+ return blocks_found;
+}
+
+/*
+ * Release memory used while reading a block reference table from a file.
+ */
+void
+DestroyBlockRefTableReader(BlockRefTableReader *reader)
+{
+ if (reader->chunk_size != NULL)
+ {
+ pfree(reader->chunk_size);
+ reader->chunk_size = NULL;
+ }
+ pfree(reader);
+}
+
+/*
+ * Prepare to write a block reference table file incrementally.
+ *
+ * Caller must be able to supply BlockRefTableEntry objects sorted in the
+ * appropriate order.
+ */
+BlockRefTableWriter *
+CreateBlockRefTableWriter(io_callback_fn write_callback,
+ void *write_callback_arg)
+{
+ BlockRefTableWriter *writer;
+ uint32 magic = BLOCKREFTABLE_MAGIC;
+
+ /* Prepare buffer and CRC check and save callbacks. */
+ writer = palloc0(sizeof(BlockRefTableWriter));
+ writer->buffer.io_callback = write_callback;
+ writer->buffer.io_callback_arg = write_callback_arg;
+ INIT_CRC32C(writer->buffer.crc);
+
+ /* Write magic number. */
+ BlockRefTableWrite(&writer->buffer, &magic, sizeof(uint32));
+
+ return writer;
+}
+
+/*
+ * Append one entry to a block reference table file.
+ *
+ * Note that entries must be written in the proper order, that is, sorted by
+ * tablespace, then database, then relfilenumber, then fork number. Caller
+ * is responsible for supplying data in the correct order. If that seems hard,
+ * use an in-memory BlockRefTable instead.
+ */
+void
+BlockRefTableWriteEntry(BlockRefTableWriter *writer, BlockRefTableEntry *entry)
+{
+ BlockRefTableSerializedEntry sentry;
+ unsigned j;
+
+ /* Convert to serialized entry format. */
+ sentry.rlocator = entry->key.rlocator;
+ sentry.forknum = entry->key.forknum;
+ sentry.limit_block = entry->limit_block;
+ sentry.nchunks = entry->nchunks;
+
+ /* Trim trailing zero entries. */
+ while (sentry.nchunks > 0 && entry->chunk_usage[sentry.nchunks - 1] == 0)
+ sentry.nchunks--;
+
+ /* Write the serialized entry itself. */
+ BlockRefTableWrite(&writer->buffer, &sentry,
+ sizeof(BlockRefTableSerializedEntry));
+
+ /* Write the untruncated portion of the chunk length array. */
+ if (sentry.nchunks != 0)
+ BlockRefTableWrite(&writer->buffer, entry->chunk_usage,
+ sentry.nchunks * sizeof(uint16));
+
+ /* Write the contents of each chunk. */
+ for (j = 0; j < entry->nchunks; ++j)
+ {
+ if (entry->chunk_usage[j] == 0)
+ continue;
+ BlockRefTableWrite(&writer->buffer, entry->chunk_data[j],
+ entry->chunk_usage[j] * sizeof(uint16));
+ }
+}
+
+/*
+ * Finalize an incremental write of a block reference table file.
+ */
+void
+DestroyBlockRefTableWriter(BlockRefTableWriter *writer)
+{
+ BlockRefTableFileTerminate(&writer->buffer);
+ pfree(writer);
+}
+
+/*
+ * Allocate a standalone BlockRefTableEntry.
+ *
+ * When we're manipulating a full in-memory BlockRefTable, the entries are
+ * part of the hash table and are allocated by simplehash. This routine is
+ * used by callers that want to write out a BlockRefTable to a file without
+ * needing to store the whole thing in memory at once.
+ *
+ * Entries allocated by this function can be manipulated using the functions
+ * BlockRefTableEntrySetLimitBlock and BlockRefTableEntryMarkBlockModified
+ * and then written using BlockRefTableWriteEntry and freed using
+ * BlockRefTableFreeEntry.
+ */
+BlockRefTableEntry *
+CreateBlockRefTableEntry(RelFileLocator rlocator, ForkNumber forknum)
+{
+ BlockRefTableEntry *entry = palloc0(sizeof(BlockRefTableEntry));
+
+ memcpy(&entry->key.rlocator, &rlocator, sizeof(RelFileLocator));
+ entry->key.forknum = forknum;
+ entry->limit_block = InvalidBlockNumber;
+
+ return entry;
+}
+
+/*
+ * Update a BlockRefTableEntry with a new value for the "limit block" and
+ * forget any equal-or-higher-numbered modified blocks.
+ *
+ * The "limit block" is the shortest known length of the relation within the
+ * range of WAL records covered by this block reference table.
+ */
+void
+BlockRefTableEntrySetLimitBlock(BlockRefTableEntry *entry,
+ BlockNumber limit_block)
+{
+ unsigned chunkno;
+ unsigned limit_chunkno;
+ unsigned limit_chunkoffset;
+ BlockRefTableChunk limit_chunk;
+
+ /* If we already have an equal or lower limit block, do nothing. */
+ if (limit_block >= entry->limit_block)
+ return;
+
+ /* Record the new limit block value. */
+ entry->limit_block = limit_block;
+
+ /*
+ * Figure out which chunk would store the state of the new limit block,
+ * and which offset within that chunk.
+ */
+ limit_chunkno = limit_block / BLOCKS_PER_CHUNK;
+ limit_chunkoffset = limit_block % BLOCKS_PER_CHUNK;
+
+ /*
+ * If the number of chunks is not large enough for any blocks with equal
+ * or higher block numbers to exist, then there is nothing further to do.
+ */
+ if (limit_chunkno >= entry->nchunks)
+ return;
+
+ /* Discard entire contents of any higher-numbered chunks. */
+ for (chunkno = limit_chunkno + 1; chunkno < entry->nchunks; ++chunkno)
+ entry->chunk_usage[chunkno] = 0;
+
+ /*
+ * Next, we need to discard any offsets within the chunk that would
+ * contain the limit_block. We must handle this differenly depending on
+ * whether the chunk that would contain limit_block is a bitmap or an
+ * array of offsets.
+ */
+ limit_chunk = entry->chunk_data[limit_chunkno];
+ if (entry->chunk_usage[limit_chunkno] == MAX_ENTRIES_PER_CHUNK)
+ {
+ unsigned chunkoffset;
+
+ /* It's a bitmap. Unset bits. */
+ for (chunkoffset = limit_chunkoffset; chunkoffset < BLOCKS_PER_CHUNK;
+ ++chunkoffset)
+ limit_chunk[chunkoffset / BLOCKS_PER_ENTRY] &=
+ ~(1 << (chunkoffset % BLOCKS_PER_ENTRY));
+ }
+ else
+ {
+ unsigned i,
+ j = 0;
+
+ /* It's an offset array. Filter out large offsets. */
+ for (i = 0; i < entry->chunk_usage[limit_chunkno]; ++i)
+ {
+ Assert(j <= i);
+ if (limit_chunk[i] < limit_chunkoffset)
+ limit_chunk[j++] = limit_chunk[i];
+ }
+ Assert(j <= entry->chunk_usage[limit_chunkno]);
+ entry->chunk_usage[limit_chunkno] = j;
+ }
+}
+
+/*
+ * Mark a block in a given BlkRefTableEntry as known to have been modified.
+ */
+void
+BlockRefTableEntryMarkBlockModified(BlockRefTableEntry *entry,
+ ForkNumber forknum,
+ BlockNumber blknum)
+{
+ unsigned chunkno;
+ unsigned chunkoffset;
+ unsigned i;
+
+ /*
+ * Which chunk should store the state of this block? And what is the
+ * offset of this block relative to the start of that chunk?
+ */
+ chunkno = blknum / BLOCKS_PER_CHUNK;
+ chunkoffset = blknum % BLOCKS_PER_CHUNK;
+
+ /*
+ * If 'nchunks' isn't big enough for us to be able to represent the state
+ * of this block, we need to enlarge our arrays.
+ */
+ if (chunkno >= entry->nchunks)
+ {
+ unsigned max_chunks;
+ unsigned extra_chunks;
+
+ /*
+ * New array size is a power of 2, at least 16, big enough so that
+ * chunkno will be a valid array index.
+ */
+ max_chunks = Max(16, entry->nchunks);
+ while (max_chunks < chunkno + 1)
+ chunkno *= 2;
+ Assert(max_chunks > chunkno);
+ extra_chunks = max_chunks - entry->nchunks;
+
+ if (entry->nchunks == 0)
+ {
+ entry->chunk_size = palloc0(sizeof(uint16) * max_chunks);
+ entry->chunk_usage = palloc0(sizeof(uint16) * max_chunks);
+ entry->chunk_data =
+ palloc0(sizeof(BlockRefTableChunk) * max_chunks);
+ }
+ else
+ {
+ entry->chunk_size = repalloc(entry->chunk_size,
+ sizeof(uint16) * max_chunks);
+ memset(&entry->chunk_size[entry->nchunks], 0,
+ extra_chunks * sizeof(uint16));
+ entry->chunk_usage = repalloc(entry->chunk_usage,
+ sizeof(uint16) * max_chunks);
+ memset(&entry->chunk_usage[entry->nchunks], 0,
+ extra_chunks * sizeof(uint16));
+ entry->chunk_data = repalloc(entry->chunk_data,
+ sizeof(BlockRefTableChunk) * max_chunks);
+ memset(&entry->chunk_data[entry->nchunks], 0,
+ extra_chunks * sizeof(BlockRefTableChunk));
+ }
+ entry->nchunks = max_chunks;
+ }
+
+ /*
+ * If the chunk that covers this block number doesn't exist yet, create it
+ * as an array and add the appropriate offset to it. We make it pretty
+ * small initially, because there might only be 1 or a few block
+ * references in this chunk and we don't want to use up too much memory.
+ */
+ if (entry->chunk_size[chunkno] == 0)
+ {
+ entry->chunk_data[chunkno] =
+ palloc(sizeof(uint16) * INITIAL_ENTRIES_PER_CHUNK);
+ entry->chunk_size[chunkno] = INITIAL_ENTRIES_PER_CHUNK;
+ entry->chunk_data[chunkno][0] = chunkoffset;
+ entry->chunk_usage[chunkno] = 1;
+ return;
+ }
+
+ /*
+ * If the number of entries in this chunk is already maximum, it must be a
+ * bitmap. Just set the appropriate bit.
+ */
+ if (entry->chunk_usage[chunkno] == MAX_ENTRIES_PER_CHUNK)
+ {
+ BlockRefTableChunk chunk = entry->chunk_data[chunkno];
+
+ chunk[chunkoffset / BLOCKS_PER_ENTRY] |=
+ 1 << (chunkoffset % BLOCKS_PER_ENTRY);
+ return;
+ }
+
+ /*
+ * There is an existing chunk and it's in array format. Let's find out
+ * whether it already has an entry for this block. If so, we do not need
+ * to do anything.
+ */
+ for (i = 0; i < entry->chunk_usage[chunkno]; ++i)
+ {
+ if (entry->chunk_data[chunkno][i] == chunkoffset)
+ return;
+ }
+
+ /*
+ * If the number of entries currently used is one less than the maximum,
+ * it's time to convert to bitmap format.
+ */
+ if (entry->chunk_usage[chunkno] == MAX_ENTRIES_PER_CHUNK - 1)
+ {
+ BlockRefTableChunk newchunk;
+ unsigned j;
+
+ /* Allocate a new chunk. */
+ newchunk = palloc0(MAX_ENTRIES_PER_CHUNK * sizeof(uint16));
+
+ /* Set the bit for each existing entry. */
+ for (j = 0; j < entry->chunk_usage[chunkno]; ++j)
+ {
+ unsigned coff = entry->chunk_data[chunkno][j];
+
+ newchunk[coff / BLOCKS_PER_ENTRY] |=
+ 1 << (coff % BLOCKS_PER_ENTRY);
+ }
+
+ /* Set the bit for the new entry. */
+ newchunk[chunkoffset / BLOCKS_PER_ENTRY] |=
+ 1 << (chunkoffset % BLOCKS_PER_ENTRY);
+
+ /* Swap the new chunk into place and update metadata. */
+ pfree(entry->chunk_data[chunkno]);
+ entry->chunk_data[chunkno] = newchunk;
+ entry->chunk_size[chunkno] = MAX_ENTRIES_PER_CHUNK;
+ entry->chunk_usage[chunkno] = MAX_ENTRIES_PER_CHUNK;
+ return;
+ }
+
+ /*
+ * OK, we currently have an array, and we don't need to convert to a
+ * bitmap, but we do need to add a new element. If there's not enough
+ * room, we'll have to expand the array.
+ */
+ if (entry->chunk_usage[chunkno] == entry->chunk_size[chunkno])
+ {
+ unsigned newsize = entry->chunk_size[chunkno] * 2;
+
+ Assert(newsize <= MAX_ENTRIES_PER_CHUNK);
+ entry->chunk_data[chunkno] = repalloc(entry->chunk_data[chunkno],
+ newsize * sizeof(uint16));
+ entry->chunk_size[chunkno] = newsize;
+ }
+
+ /* Now we can add the new entry. */
+ entry->chunk_data[chunkno][entry->chunk_usage[chunkno]] =
+ chunkoffset;
+ entry->chunk_usage[chunkno]++;
+}
+
+/*
+ * Release memory for a BlockRefTablEntry that was created by
+ * CreateBlockRefTableEntry.
+ */
+void
+BlockRefTableFreeEntry(BlockRefTableEntry *entry)
+{
+ if (entry->chunk_size != NULL)
+ {
+ pfree(entry->chunk_size);
+ entry->chunk_size = NULL;
+ }
+
+ if (entry->chunk_usage != NULL)
+ {
+ pfree(entry->chunk_usage);
+ entry->chunk_usage = NULL;
+ }
+
+ if (entry->chunk_data != NULL)
+ {
+ pfree(entry->chunk_data);
+ entry->chunk_data = NULL;
+ }
+
+ pfree(entry);
+}
+
+/*
+ * Comparator for BlockRefTableSerializedEntry objects.
+ *
+ * We make the tablespace OID the first column of the sort key to match
+ * the on-disk tree structure.
+ */
+static int
+BlockRefTableComparator(const void *a, const void *b)
+{
+ const BlockRefTableSerializedEntry *sa = a;
+ const BlockRefTableSerializedEntry *sb = b;
+
+ if (sa->rlocator.spcOid > sb->rlocator.spcOid)
+ return 1;
+ if (sa->rlocator.spcOid < sb->rlocator.spcOid)
+ return -1;
+
+ if (sa->rlocator.dbOid > sb->rlocator.dbOid)
+ return 1;
+ if (sa->rlocator.dbOid < sb->rlocator.dbOid)
+ return -1;
+
+ if (sa->rlocator.relNumber > sb->rlocator.relNumber)
+ return 1;
+ if (sa->rlocator.relNumber < sb->rlocator.relNumber)
+ return -1;
+
+ if (sa->forknum > sb->forknum)
+ return 1;
+ if (sa->forknum < sb->forknum)
+ return -1;
+
+ return 0;
+}
+
+/*
+ * Flush any buffered data out of a BlockRefTableBuffer.
+ */
+static void
+BlockRefTableFlush(BlockRefTableBuffer *buffer)
+{
+ buffer->io_callback(buffer->io_callback_arg, buffer->data, buffer->used);
+ buffer->used = 0;
+}
+
+/*
+ * Read data from a BlockRefTableBuffer, and update the running CRC
+ * calculation for the returned data (but not any data that we may have
+ * buffered but not yet actually returned).
+ */
+static void
+BlockRefTableRead(BlockRefTableReader *reader, void *data, int length)
+{
+ BlockRefTableBuffer *buffer = &reader->buffer;
+
+ /* Loop until read is fully satisfied. */
+ while (length > 0)
+ {
+ if (buffer->cursor < buffer->used)
+ {
+ /*
+ * If any buffered data is available, use that to satisfy as much
+ * of the request as possible.
+ */
+ int bytes_to_copy = Min(length, buffer->used - buffer->cursor);
+
+ memcpy(data, &buffer->data[buffer->cursor], bytes_to_copy);
+ COMP_CRC32C(buffer->crc, &buffer->data[buffer->cursor],
+ bytes_to_copy);
+ buffer->cursor += bytes_to_copy;
+ data = ((char *) data) + bytes_to_copy;
+ length -= bytes_to_copy;
+ }
+ else if (length >= BUFSIZE)
+ {
+ /*
+ * If the request length is long, read directly into caller's
+ * buffer.
+ */
+ int bytes_read;
+
+ bytes_read = buffer->io_callback(buffer->io_callback_arg,
+ data, length);
+ COMP_CRC32C(buffer->crc, data, bytes_read);
+ data = ((char *) data) + bytes_read;
+ length -= bytes_read;
+
+ /* If we didn't get anything, that's bad. */
+ if (bytes_read == 0)
+ reader->error_callback(reader->error_callback_arg,
+ "file \"%s\" ends unexpectedly",
+ reader->error_filename);
+ }
+ else
+ {
+ /*
+ * Refill our buffer.
+ */
+ buffer->used = buffer->io_callback(buffer->io_callback_arg,
+ buffer->data, BUFSIZE);
+ buffer->cursor = 0;
+
+ /* If we didn't get anything, that's bad. */
+ if (buffer->used == 0)
+ reader->error_callback(reader->error_callback_arg,
+ "file \"%s\" ends unexpectedly",
+ reader->error_filename);
+ }
+ }
+}
+
+/*
+ * Supply data to a BlockRefTableBuffer for write to the underlying File,
+ * and update the running CRC calculation for that data.
+ */
+static void
+BlockRefTableWrite(BlockRefTableBuffer *buffer, void *data, int length)
+{
+ /* Update running CRC calculation. */
+ COMP_CRC32C(buffer->crc, data, length);
+
+ /* If the new data can't fit into the buffer, flush the buffer. */
+ if (buffer->used + length > BUFSIZE)
+ {
+ buffer->io_callback(buffer->io_callback_arg, buffer->data,
+ buffer->used);
+ buffer->used = 0;
+ }
+
+ /* If the new data would fill the buffer, or more, write it directly. */
+ if (length >= BUFSIZE)
+ {
+ buffer->io_callback(buffer->io_callback_arg, data, length);
+ return;
+ }
+
+ /* Otherwise, copy the new data into the buffer. */
+ memcpy(&buffer->data[buffer->used], data, length);
+ buffer->used += length;
+ Assert(buffer->used <= BUFSIZE);
+}
+
+/*
+ * Generate the sentinel and CRC required at the end of a block reference
+ * table file and flush them out of our internal buffer.
+ */
+static void
+BlockRefTableFileTerminate(BlockRefTableBuffer *buffer)
+{
+ BlockRefTableSerializedEntry zentry = {{0}};
+ pg_crc32c crc;
+
+ /* Write a sentinel indicating that there are no more entries. */
+ BlockRefTableWrite(buffer, &zentry,
+ sizeof(BlockRefTableSerializedEntry));
+
+ /*
+ * Writing the checksum will perturb the ongoing checksum calculation, so
+ * copy the state first and finalize the computation using the copy.
+ */
+ crc = buffer->crc;
+ FIN_CRC32C(crc);
+ BlockRefTableWrite(buffer, &crc, sizeof(pg_crc32c));
+
+ /* Flush any leftover data out of our buffer. */
+ BlockRefTableFlush(buffer);
+}
diff --git a/src/common/meson.build b/src/common/meson.build
index d52dd12bc9..7ad4270a3a 100644
--- a/src/common/meson.build
+++ b/src/common/meson.build
@@ -4,6 +4,7 @@ common_sources = files(
'archive.c',
'base64.c',
'binaryheap.c',
+ 'blkreftable.c',
'checksum_helper.c',
'compression.c',
'controldata_utils.c',
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index a14126d164..da71580364 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -209,6 +209,7 @@ extern int XLogFileOpen(XLogSegNo segno, TimeLineID tli);
extern void CheckXLogRemoved(XLogSegNo segno, TimeLineID tli);
extern XLogSegNo XLogGetLastRemovedSegno(void);
+extern XLogSegNo XLogGetOldestSegno(TimeLineID tli);
extern void XLogSetAsyncXactLSN(XLogRecPtr asyncXactLSN);
extern void XLogSetReplicationSlotMinimumLSN(XLogRecPtr lsn);
diff --git a/src/include/backup/walsummary.h b/src/include/backup/walsummary.h
new file mode 100644
index 0000000000..8e3dc7b837
--- /dev/null
+++ b/src/include/backup/walsummary.h
@@ -0,0 +1,49 @@
+/*-------------------------------------------------------------------------
+ *
+ * walsummary.h
+ * WAL summary management
+ *
+ * Portions Copyright (c) 2010-2022, PostgreSQL Global Development Group
+ *
+ * src/include/backup/walsummary.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef WALSUMMARY_H
+#define WALSUMMARY_H
+
+#include <time.h>
+
+#include "access/xlogdefs.h"
+#include "nodes/pg_list.h"
+#include "storage/fd.h"
+
+typedef struct WalSummaryIO
+{
+ File file;
+ off_t filepos;
+} WalSummaryIO;
+
+typedef struct WalSummaryFile
+{
+ XLogRecPtr start_lsn;
+ XLogRecPtr end_lsn;
+ TimeLineID tli;
+} WalSummaryFile;
+
+extern List *GetWalSummaries(TimeLineID tli, XLogRecPtr start_lsn,
+ XLogRecPtr end_lsn);
+extern List *FilterWalSummaries(List *wslist, TimeLineID tli,
+ XLogRecPtr start_lsn, XLogRecPtr end_lsn);
+extern bool WalSummariesAreComplete(List *wslist,
+ XLogRecPtr start_lsn, XLogRecPtr end_lsn,
+ XLogRecPtr *missing_lsn);
+extern File OpenWalSummaryFile(WalSummaryFile *ws, bool missing_ok);
+extern void RemoveWalSummaryIfOlderThan(WalSummaryFile *ws,
+ time_t cutoff_time);
+
+extern int ReadWalSummary(void *wal_summary_io, void *data, int length);
+extern int WriteWalSummary(void *wal_summary_io, void *data, int length);
+extern void ReportWalSummaryError(void *callback_arg, char *fmt,...) pg_attribute_printf(2, 3);
+
+#endif /* WALSUMMARY_H */
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index fb58dee3bc..79c8f86d89 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -12100,4 +12100,23 @@
proname => 'any_value_transfn', prorettype => 'anyelement',
proargtypes => 'anyelement anyelement', prosrc => 'any_value_transfn' },
+{ oid => '8436',
+ descr => 'list of available WAL summary files',
+ proname => 'pg_available_wal_summaries', prorows => '100',
+ proretset => 't', provolatile => 'v', proparallel => 's',
+ prorettype => 'record', proargtypes => '',
+ proallargtypes => '{int8,pg_lsn,pg_lsn}',
+ proargmodes => '{o,o,o}',
+ proargnames => '{tli,start_lsn,end_lsn}',
+ prosrc => 'pg_available_wal_summaries' },
+{ oid => '8437',
+ descr => 'contents of a WAL sumamry file',
+ proname => 'pg_wal_summary_contents', prorows => '100',
+ proretset => 't', provolatile => 'v', proparallel => 's',
+ prorettype => 'record', proargtypes => 'int8 pg_lsn pg_lsn',
+ proallargtypes => '{int8,pg_lsn,pg_lsn,oid,oid,oid,int2,int8,bool}',
+ proargmodes => '{i,i,i,o,o,o,o,o,o}',
+ proargnames => '{tli,start_lsn,end_lsn,relfilenode,reltablespace,reldatabase,relforknumber,relblocknumber,is_limit_block}',
+ prosrc => 'pg_wal_summary_contents' },
+
]
diff --git a/src/include/common/blkreftable.h b/src/include/common/blkreftable.h
new file mode 100644
index 0000000000..5141f3acd5
--- /dev/null
+++ b/src/include/common/blkreftable.h
@@ -0,0 +1,116 @@
+/*-------------------------------------------------------------------------
+ *
+ * blkreftable.h
+ * Block reference tables.
+ *
+ * A block reference table is used to keep track of which blocks have
+ * been modified by WAL records within a certain LSN range.
+ *
+ * For each relation fork, there is a "limit block number". All existing
+ * blocks greater than or equal to the limit block number must be
+ * considered modified; for those less than the limit block number,
+ * we maintain a bitmap. When a relation fork is created or dropped,
+ * the limit block number should be set to 0. When it's truncated,
+ * the limit block number should be set to the length in blocks to
+ * which it was truncated.
+ *
+ * Portions Copyright (c) 2010-2022, PostgreSQL Global Development Group
+ *
+ * src/include/common/blkreftable.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef BLKREFTABLE_H
+#define BLKREFTABLE_H
+
+#include "storage/block.h"
+#include "storage/relfilelocator.h"
+
+/* Magic number for serialization file format. */
+#define BLOCKREFTABLE_MAGIC 0x652b137b
+
+typedef struct BlockRefTable BlockRefTable;
+typedef struct BlockRefTableEntry BlockRefTableEntry;
+typedef struct BlockRefTableReader BlockRefTableReader;
+typedef struct BlockRefTableWriter BlockRefTableWriter;
+
+/*
+ * The return value of io_callback_fn should be the number of bytes read
+ * or written. If an error occurs, the functions should report it and
+ * not return. When used as a write callback, short writes should be retried
+ * or treated as errors, so that if the callback returns, the return value
+ * is always the request length.
+ *
+ * report_error_fn should not return.
+ */
+typedef int (*io_callback_fn) (void *callback_arg, void *data, int length);
+typedef void (*report_error_fn) (void *calblack_arg, char *msg,...) pg_attribute_printf(2, 3);
+
+
+/*
+ * Functions for manipulating an entire in-memory block reference table.
+ */
+extern BlockRefTable *CreateEmptyBlockRefTable(void);
+extern void BlockRefTableSetLimitBlock(BlockRefTable *brtab,
+ const RelFileLocator *rlocator,
+ ForkNumber forknum,
+ BlockNumber limit_block);
+extern void BlockRefTableMarkBlockModified(BlockRefTable *brtab,
+ const RelFileLocator *rlocator,
+ ForkNumber forknum,
+ BlockNumber blknum);
+extern void WriteBlockRefTable(BlockRefTable *brtab,
+ io_callback_fn write_callback,
+ void *write_callback_arg);
+
+extern BlockRefTableEntry *BlockRefTableGetEntry(BlockRefTable *brtab,
+ const RelFileLocator *rlocator,
+ ForkNumber forknum,
+ BlockNumber *limit_block);
+extern int BlockRefTableEntryGetBlocks(BlockRefTableEntry *entry,
+ BlockNumber start_blkno,
+ BlockNumber stop_blkno,
+ BlockNumber *blocks,
+ int nblocks);
+
+/*
+ * Functions for reading a block reference table incrementally from disk.
+ */
+extern BlockRefTableReader *CreateBlockRefTableReader(io_callback_fn read_callback,
+ void *read_callback_arg,
+ char *error_filename,
+ report_error_fn error_callback,
+ void *error_callback_arg);
+extern bool BlockRefTableReaderNextRelation(BlockRefTableReader *reader,
+ RelFileLocator *rlocator,
+ ForkNumber *forknum,
+ BlockNumber *limit_block);
+extern unsigned BlockRefTableReaderGetBlocks(BlockRefTableReader *reader,
+ BlockNumber *blocks,
+ int nblocks);
+extern void DestroyBlockRefTableReader(BlockRefTableReader *reader);
+
+/*
+ * Functions for writing a block reference table incrementally to disk.
+ *
+ * Note that entries must be written in the proper order, that is, sorted by
+ * database, then tablespace, then relfilenumber, then fork number. Caller
+ * is responsible for supplying data in the correct order. If that seems hard,
+ * use an in-memory BlockRefTable instead.
+ */
+extern BlockRefTableWriter *CreateBlockRefTableWriter(io_callback_fn write_callback,
+ void *write_callback_arg);
+extern void BlockRefTableWriteEntry(BlockRefTableWriter *writer,
+ BlockRefTableEntry *entry);
+extern void DestroyBlockRefTableWriter(BlockRefTableWriter *writer);
+
+extern BlockRefTableEntry *CreateBlockRefTableEntry(RelFileLocator rlocator,
+ ForkNumber forknum);
+extern void BlockRefTableEntrySetLimitBlock(BlockRefTableEntry *entry,
+ BlockNumber limit_block);
+extern void BlockRefTableEntryMarkBlockModified(BlockRefTableEntry *entry,
+ ForkNumber forknum,
+ BlockNumber blknum);
+extern void BlockRefTableFreeEntry(BlockRefTableEntry *entry);
+
+#endif /* BLKREFTABLE_H */
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index f0cc651435..ab8f47379a 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -340,6 +340,7 @@ typedef enum BackendType
B_STARTUP,
B_WAL_RECEIVER,
B_WAL_SENDER,
+ B_WAL_SUMMARIZER,
B_WAL_WRITER,
} BackendType;
@@ -446,6 +447,7 @@ typedef enum
CheckpointerProcess,
WalWriterProcess,
WalReceiverProcess,
+ WalSummarizerProcess,
NUM_AUXPROCTYPES /* Must be last! */
} AuxProcType;
@@ -458,6 +460,7 @@ extern PGDLLIMPORT AuxProcType MyAuxProcType;
#define AmCheckpointerProcess() (MyAuxProcType == CheckpointerProcess)
#define AmWalWriterProcess() (MyAuxProcType == WalWriterProcess)
#define AmWalReceiverProcess() (MyAuxProcType == WalReceiverProcess)
+#define AmWalSummarizerProcess() (MyAuxProcType == WalSummarizerProcess)
/*****************************************************************************
diff --git a/src/include/postmaster/walsummarizer.h b/src/include/postmaster/walsummarizer.h
new file mode 100644
index 0000000000..4a6792e5f9
--- /dev/null
+++ b/src/include/postmaster/walsummarizer.h
@@ -0,0 +1,31 @@
+/*-------------------------------------------------------------------------
+ *
+ * walsummarizer.h
+ *
+ * Header file for background WAL summarization process.
+ *
+ * Portions Copyright (c) 1996-2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/include/postmaster/walsummarizer.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef WALSUMMARIZER_H
+#define WALSUMMARIZER_H
+
+#include "access/xlogdefs.h"
+
+extern bool summarize_wal;
+extern int wal_summary_keep_time;
+
+extern Size WalSummarizerShmemSize(void);
+extern void WalSummarizerShmemInit(void);
+extern void WalSummarizerMain(void) pg_attribute_noreturn();
+
+extern XLogRecPtr GetOldestUnsummarizedLSN(TimeLineID *tli,
+ bool *lsn_is_exact);
+extern void SetWalSummarizerLatch(void);
+extern XLogRecPtr WaitForWalSummarization(XLogRecPtr lsn, long timeout);
+
+#endif
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index ef74f32693..ee55008082 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -417,11 +417,12 @@ extern PGDLLIMPORT PGPROC *PreparedXactProcs;
* We set aside some extra PGPROC structures for auxiliary processes,
* ie things that aren't full-fledged backends but need shmem access.
*
- * Background writer, checkpointer, WAL writer and archiver run during normal
- * operation. Startup process and WAL receiver also consume 2 slots, but WAL
- * writer is launched only after startup has exited, so we only need 5 slots.
+ * Background writer, checkpointer, WAL writer, WAL summarizer, and archiver
+ * run during normal operation. Startup process and WAL receiver also consume
+ * 2 slots, but WAL writer is launched only after startup has exited, so we
+ * only need 6 slots.
*/
-#define NUM_AUXILIARY_PROCS 5
+#define NUM_AUXILIARY_PROCS 6
/* configurable options */
extern PGDLLIMPORT int DeadlockTimeout;
diff --git a/src/include/utils/guc_tables.h b/src/include/utils/guc_tables.h
index 0c38255961..eaa8c46dda 100644
--- a/src/include/utils/guc_tables.h
+++ b/src/include/utils/guc_tables.h
@@ -72,6 +72,7 @@ enum config_group
WAL_RECOVERY,
WAL_ARCHIVE_RECOVERY,
WAL_RECOVERY_TARGET,
+ WAL_SUMMARIZATION,
REPLICATION_SENDING,
REPLICATION_PRIMARY,
REPLICATION_STANDBY,
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 3cea73e220..7a2807a9a3 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -4003,3 +4003,14 @@ yyscan_t
z_stream
z_streamp
zic_t
+BlockRefTable
+BlockRefTableBuffer
+BlockRefTableEntry
+BlockRefTableKey
+BlockRefTableReader
+BlockRefTableSerializedEntry
+BlockRefTableWriter
+SummarizerReadLocalXLogPrivate
+WalSummarizerData
+WalSummaryFile
+WalSummaryIO
--
2.37.1 (Apple Git-137.1)
v10-0007-Test-patch-Enable-summarize_wal-by-default.patchapplication/octet-stream; name=v10-0007-Test-patch-Enable-summarize_wal-by-default.patchDownload
From 4afa56e6ec82bc812fa25c8993f373e7f9bf662a Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Tue, 14 Nov 2023 13:49:28 -0500
Subject: [PATCH v10 7/7] Test patch: Enable summarize_wal by default.
To avoid test failures, must remove the prohibition against running
summarize_wal=off with wal_level=minimal, because a bunch of tests
run with wal_level=minimal.
Not for commit.
---
src/backend/postmaster/postmaster.c | 3 ---
src/backend/postmaster/walsummarizer.c | 2 +-
src/backend/utils/misc/guc_tables.c | 2 +-
src/test/recovery/t/001_stream_rep.pl | 2 ++
src/test/recovery/t/019_replslot_limit.pl | 3 +++
src/test/recovery/t/020_archive_status.pl | 1 +
src/test/recovery/t/035_standby_logical_decoding.pl | 1 +
7 files changed, 9 insertions(+), 5 deletions(-)
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 7952fd5c4b..a804d07ce5 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -935,9 +935,6 @@ PostmasterMain(int argc, char *argv[])
if (max_wal_senders > 0 && wal_level == WAL_LEVEL_MINIMAL)
ereport(ERROR,
(errmsg("WAL streaming (max_wal_senders > 0) requires wal_level \"replica\" or \"logical\"")));
- if (summarize_wal && wal_level == WAL_LEVEL_MINIMAL)
- ereport(ERROR,
- (errmsg("WAL cannot be summarized when wal_level is \"minimal\"")));
/*
* Other one-time internal sanity checks can go here, if they are fast.
diff --git a/src/backend/postmaster/walsummarizer.c b/src/backend/postmaster/walsummarizer.c
index ad8a1166d1..ca3e504f47 100644
--- a/src/backend/postmaster/walsummarizer.c
+++ b/src/backend/postmaster/walsummarizer.c
@@ -139,7 +139,7 @@ static XLogRecPtr redo_pointer_at_last_summary_removal = InvalidXLogRecPtr;
/*
* GUC parameters
*/
-bool summarize_wal = false;
+bool summarize_wal = true;
int wal_summary_keep_time = 10 * 24 * 60;
static XLogRecPtr GetLatestLSN(TimeLineID *tli);
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index a6de5aca0a..170f491d7a 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -1796,7 +1796,7 @@ struct config_bool ConfigureNamesBool[] =
NULL
},
&summarize_wal,
- false,
+ true,
NULL, NULL, NULL
},
diff --git a/src/test/recovery/t/001_stream_rep.pl b/src/test/recovery/t/001_stream_rep.pl
index 95f9b0d772..0d0e63b8dc 100644
--- a/src/test/recovery/t/001_stream_rep.pl
+++ b/src/test/recovery/t/001_stream_rep.pl
@@ -15,6 +15,8 @@ my $node_primary = PostgreSQL::Test::Cluster->new('primary');
$node_primary->init(
allows_streaming => 1,
auth_extra => [ '--create-role', 'repl_role' ]);
+# WAL summarization can postpone WAL recycling, leading to test failures
+$node_primary->append_conf('postgresql.conf', "summarize_wal = off");
$node_primary->start;
my $backup_name = 'my_backup';
diff --git a/src/test/recovery/t/019_replslot_limit.pl b/src/test/recovery/t/019_replslot_limit.pl
index 7d94f15778..a8b342bb98 100644
--- a/src/test/recovery/t/019_replslot_limit.pl
+++ b/src/test/recovery/t/019_replslot_limit.pl
@@ -22,6 +22,7 @@ $node_primary->append_conf(
min_wal_size = 2MB
max_wal_size = 4MB
log_checkpoints = yes
+summarize_wal = off
));
$node_primary->start;
$node_primary->safe_psql('postgres',
@@ -256,6 +257,7 @@ $node_primary2->append_conf(
min_wal_size = 32MB
max_wal_size = 32MB
log_checkpoints = yes
+summarize_wal = off
));
$node_primary2->start;
$node_primary2->safe_psql('postgres',
@@ -310,6 +312,7 @@ $node_primary3->append_conf(
max_wal_size = 2MB
log_checkpoints = yes
max_slot_wal_keep_size = 1MB
+ summarize_wal = off
));
$node_primary3->start;
$node_primary3->safe_psql('postgres',
diff --git a/src/test/recovery/t/020_archive_status.pl b/src/test/recovery/t/020_archive_status.pl
index fa24153d4b..d0d6221368 100644
--- a/src/test/recovery/t/020_archive_status.pl
+++ b/src/test/recovery/t/020_archive_status.pl
@@ -15,6 +15,7 @@ $primary->init(
has_archiving => 1,
allows_streaming => 1);
$primary->append_conf('postgresql.conf', 'autovacuum = off');
+$primary->append_conf('postgresql.conf', 'summarize_wal = off');
$primary->start;
my $primary_data = $primary->data_dir;
diff --git a/src/test/recovery/t/035_standby_logical_decoding.pl b/src/test/recovery/t/035_standby_logical_decoding.pl
index 9c34c0d36c..482edc57a8 100644
--- a/src/test/recovery/t/035_standby_logical_decoding.pl
+++ b/src/test/recovery/t/035_standby_logical_decoding.pl
@@ -250,6 +250,7 @@ $node_primary->append_conf(
wal_level = 'logical'
max_replication_slots = 4
max_wal_senders = 4
+summarize_wal = off
});
$node_primary->dump_info;
$node_primary->start;
--
2.37.1 (Apple Git-137.1)
v10-0005-Add-support-for-incremental-backup.patchapplication/octet-stream; name=v10-0005-Add-support-for-incremental-backup.patchDownload
From 564c4e822c350b7d17406746a56464e388212060 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Wed, 14 Jun 2023 11:31:29 -0400
Subject: [PATCH v10 5/7] Add support for incremental backup.
To take an incremental backup, you use the new replication command
UPLOAD_MANIFEST to upload the manifest for the prior backup. This
prior backup could either be a full backup or another incremental
backup. You then use BASE_BACKUP with the INCREMENTAL option to take
the backup. pg_basebackup now has an --incremental=PATH_TO_MANIFEST
option to trigger this behavior.
An incremental backup is like a regular full backup except that
some relation files are replaced with files with names like
INCREMENTAL.${ORIGINAL_NAME}, and the backup_label file contains
additional lines identifying it as an incremental backup. The new
pg_combinebackup tool can be used to reconstruct a data directory
from a full backup and a series of incremental backups.
XXX. Should we send the whole backup manifest to the server or, say,
just an LSN?
XXX. Should the timeout when waiting for WAL summaries be configurable?
If it is, then the maximum sleep time for the WAL summarizer needs
to vary accordingly.
XXX. It would be nice (but not essential) to do something about
incremental JSON parsing.
Patch by me. Thanks to Dilip Kumar and Andres Freund for some helpful
design discussions. Reviewed by Dilip Kumar and Jakub Wartak.
---
doc/src/sgml/backup.sgml | 89 +-
doc/src/sgml/config.sgml | 2 -
doc/src/sgml/ref/allfiles.sgml | 1 +
doc/src/sgml/ref/pg_basebackup.sgml | 37 +-
doc/src/sgml/ref/pg_combinebackup.sgml | 228 +++
doc/src/sgml/reference.sgml | 1 +
src/backend/access/transam/xlogbackup.c | 10 +
src/backend/access/transam/xlogrecovery.c | 6 +
src/backend/backup/Makefile | 1 +
src/backend/backup/basebackup.c | 313 +++-
src/backend/backup/basebackup_incremental.c | 913 ++++++++++++
src/backend/backup/meson.build | 1 +
src/backend/replication/repl_gram.y | 14 +-
src/backend/replication/repl_scanner.l | 2 +
src/backend/replication/walsender.c | 162 ++-
src/backend/storage/ipc/ipci.c | 3 +
src/bin/Makefile | 1 +
src/bin/meson.build | 1 +
src/bin/pg_basebackup/bbstreamer_file.c | 1 +
src/bin/pg_basebackup/pg_basebackup.c | 110 +-
src/bin/pg_basebackup/t/010_pg_basebackup.pl | 4 +-
src/bin/pg_combinebackup/.gitignore | 1 +
src/bin/pg_combinebackup/Makefile | 52 +
src/bin/pg_combinebackup/backup_label.c | 281 ++++
src/bin/pg_combinebackup/backup_label.h | 30 +
src/bin/pg_combinebackup/copy_file.c | 169 +++
src/bin/pg_combinebackup/copy_file.h | 19 +
src/bin/pg_combinebackup/load_manifest.c | 245 ++++
src/bin/pg_combinebackup/load_manifest.h | 67 +
src/bin/pg_combinebackup/meson.build | 38 +
src/bin/pg_combinebackup/pg_combinebackup.c | 1267 +++++++++++++++++
src/bin/pg_combinebackup/reconstruct.c | 682 +++++++++
src/bin/pg_combinebackup/reconstruct.h | 33 +
src/bin/pg_combinebackup/t/001_basic.pl | 23 +
.../pg_combinebackup/t/002_compare_backups.pl | 154 ++
src/bin/pg_combinebackup/t/003_timeline.pl | 90 ++
src/bin/pg_combinebackup/t/004_manifest.pl | 75 +
src/bin/pg_combinebackup/t/005_integrity.pl | 125 ++
src/bin/pg_combinebackup/write_manifest.c | 293 ++++
src/bin/pg_combinebackup/write_manifest.h | 33 +
src/bin/pg_resetwal/pg_resetwal.c | 36 +
src/include/access/xlogbackup.h | 2 +
src/include/backup/basebackup.h | 5 +-
src/include/backup/basebackup_incremental.h | 55 +
src/include/nodes/replnodes.h | 9 +
src/test/perl/PostgreSQL/Test/Cluster.pm | 21 +-
src/tools/pgindent/typedefs.list | 12 +
47 files changed, 5665 insertions(+), 52 deletions(-)
create mode 100644 doc/src/sgml/ref/pg_combinebackup.sgml
create mode 100644 src/backend/backup/basebackup_incremental.c
create mode 100644 src/bin/pg_combinebackup/.gitignore
create mode 100644 src/bin/pg_combinebackup/Makefile
create mode 100644 src/bin/pg_combinebackup/backup_label.c
create mode 100644 src/bin/pg_combinebackup/backup_label.h
create mode 100644 src/bin/pg_combinebackup/copy_file.c
create mode 100644 src/bin/pg_combinebackup/copy_file.h
create mode 100644 src/bin/pg_combinebackup/load_manifest.c
create mode 100644 src/bin/pg_combinebackup/load_manifest.h
create mode 100644 src/bin/pg_combinebackup/meson.build
create mode 100644 src/bin/pg_combinebackup/pg_combinebackup.c
create mode 100644 src/bin/pg_combinebackup/reconstruct.c
create mode 100644 src/bin/pg_combinebackup/reconstruct.h
create mode 100644 src/bin/pg_combinebackup/t/001_basic.pl
create mode 100644 src/bin/pg_combinebackup/t/002_compare_backups.pl
create mode 100644 src/bin/pg_combinebackup/t/003_timeline.pl
create mode 100644 src/bin/pg_combinebackup/t/004_manifest.pl
create mode 100644 src/bin/pg_combinebackup/t/005_integrity.pl
create mode 100644 src/bin/pg_combinebackup/write_manifest.c
create mode 100644 src/bin/pg_combinebackup/write_manifest.h
create mode 100644 src/include/backup/basebackup_incremental.h
diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml
index 8cb24d6ae5..b3468eea3c 100644
--- a/doc/src/sgml/backup.sgml
+++ b/doc/src/sgml/backup.sgml
@@ -857,12 +857,79 @@ test ! -f /mnt/server/archivedir/00000001000000A900000065 && cp pg_wal/0
</para>
</sect2>
+ <sect2 id="backup-incremental-backup">
+ <title>Making an Incremental Backup</title>
+
+ <para>
+ You can use <xref linkend="app-pgbasebackup"/> to take an incremental
+ backup by specifying the <literal>--incremental</literal> option. You must
+ supply, as an argument to <literal>--incremental</literal>, the backup
+ manifest to an earlier backup from the same server. In the resulting
+ backup, non-relation files will be included in their entirety, but some
+ relation files may be replaced by smaller incremental files which contain
+ only the blocks which have been changed since the earlier backup and enough
+ metadata to reconstruct the current version of the file.
+ </para>
+
+ <para>
+ To figure out which blocks need to be backed up, the server uses WAL
+ summaries, which are stored in the data directory, inside the directory
+ <literal>pg_wal/summaries</literal>. If the required summary files are not
+ present, an attempt to take an incremental backup will fail. The summaries
+ present in this directory must cover all LSNs from the start LSN of the
+ prior backup to the start LSN of the current backup. Since the server looks
+ for WAL summaries just after establishing the start LSN of the current
+ backup, the necessary summary files probably won't be instantly present
+ on disk, but the server will wait for any missing files to show up.
+ This also helps if the WAL summarization process has fallen behind.
+ However, if the necessary files have already been removed, or if the WAL
+ summarizer doesn't catch up quickly enough, the incremental backup will
+ fail.
+ </para>
+
+ <para>
+ When restoring an incremental backup, it will be necessary to have not
+ only the incremental backup itself but also all earlier backups that
+ are required to supply the blocks omitted from the incremental backup.
+ See <xref linkend="app-pgcombinebackup"/> for further information about
+ this requirement.
+ </para>
+
+ <para>
+ Note that all of the requirements for making use of a full backup also
+ apply to an incremental backup. For instance, you still need all of the
+ WAL segment files generated during and after the file system backup, and
+ any relevant WAL history files. And you still need to create a
+ <literal>recovery.signal</literal> (or <literal>standby.signal</literal>)
+ and perform recovery, as described in
+ <xref linkend="backup-pitr-recovery" />. The requirement to have earlier
+ backups available at restore time and to use
+ <literal>pg_combinebackup</literal> is an additional requirement on top of
+ everything else. Keep in mind that <application>PostgreSQL</application>
+ has no built-in mechanism to figure out which backups are still needed as
+ a basis for restoring later incremental backups. You must keep track of
+ the relationships between your full and incremental backups on your own,
+ and be certain not to remove earlier backups if they might be needed when
+ restoring later incremental backups.
+ </para>
+
+ <para>
+ Incremental backups typically only make sense for relatively large
+ databases where a significant portion of the data does not change, or only
+ changes slowly. For a small database, it's simpler to ignore the existence
+ of incremental backups and simply take full backups, which are simpler
+ to manage. For a large database all of which is heavily modified,
+ incremental backups won't be much smaller than full backups.
+ </para>
+ </sect2>
+
<sect2 id="backup-lowlevel-base-backup">
<title>Making a Base Backup Using the Low Level API</title>
<para>
- The procedure for making a base backup using the low level
- APIs contains a few more steps than
- the <xref linkend="app-pgbasebackup"/> method, but is relatively
+ Instead of taking a full or incremental base backup using
+ <xref linkend="app-pgbasebackup"/>, you can take a base backup using the
+ low-level API. This procedure contains a few more steps than
+ the <application>pg_basebackup</application> method, but is relatively
simple. It is very important that these steps are executed in
sequence, and that the success of a step is verified before
proceeding to the next step.
@@ -1118,7 +1185,8 @@ SELECT * FROM pg_backup_stop(wait_for_archive => true);
</listitem>
<listitem>
<para>
- Restore the database files from your file system backup. Be sure that they
+ If you're restoring a full backup, you can restore the database files
+ directly into the target directories. Be sure that they
are restored with the right ownership (the database system user, not
<literal>root</literal>!) and with the right permissions. If you are using
tablespaces,
@@ -1126,6 +1194,19 @@ SELECT * FROM pg_backup_stop(wait_for_archive => true);
were correctly restored.
</para>
</listitem>
+ <listitem>
+ <para>
+ If you're restoring an incremental backup, you'll need to restore the
+ incremental backup and all earlier backups upon which it directly or
+ indirectly depends to the machine where you are performing the restore.
+ These backups will need to be placed in separate directories, not the
+ target directories where you want the running server to end up.
+ Once this is done, use <xref linkend="app-pgcombinebackup"/> to pull
+ data from the full backup and all of the subsequent incremental backups
+ and write out a synthetic full backup to the target directories. As above,
+ verify that permissions and tablespace links are correct.
+ </para>
+ </listitem>
<listitem>
<para>
Remove any files present in <filename>pg_wal/</filename>; these came from the
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 6073b93480..d5083afc87 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4137,13 +4137,11 @@ restore_command = 'copy "C:\\server\\archivedir\\%f" "%p"' # Windows
<sect2 id="runtime-config-wal-summarization">
<title>WAL Summarization</title>
- <!--
<para>
These settings control WAL summarization, a feature which must be
enabled in order to perform an
<link linkend="backup-incremental-backup">incremental backup</link>.
</para>
- -->
<variablelist>
<varlistentry id="guc-summarize-wal" xreflabel="summarize_wal">
diff --git a/doc/src/sgml/ref/allfiles.sgml b/doc/src/sgml/ref/allfiles.sgml
index 54b5f22d6e..fda4690eab 100644
--- a/doc/src/sgml/ref/allfiles.sgml
+++ b/doc/src/sgml/ref/allfiles.sgml
@@ -202,6 +202,7 @@ Complete list of usable sgml source files in this directory.
<!ENTITY pgBasebackup SYSTEM "pg_basebackup.sgml">
<!ENTITY pgbench SYSTEM "pgbench.sgml">
<!ENTITY pgChecksums SYSTEM "pg_checksums.sgml">
+<!ENTITY pgCombinebackup SYSTEM "pg_combinebackup.sgml">
<!ENTITY pgConfig SYSTEM "pg_config-ref.sgml">
<!ENTITY pgControldata SYSTEM "pg_controldata.sgml">
<!ENTITY pgCtl SYSTEM "pg_ctl-ref.sgml">
diff --git a/doc/src/sgml/ref/pg_basebackup.sgml b/doc/src/sgml/ref/pg_basebackup.sgml
index 0b87fd2d4d..7c183a5cfd 100644
--- a/doc/src/sgml/ref/pg_basebackup.sgml
+++ b/doc/src/sgml/ref/pg_basebackup.sgml
@@ -38,11 +38,25 @@ PostgreSQL documentation
</para>
<para>
- <application>pg_basebackup</application> makes an exact copy of the database
- cluster's files, while making sure the server is put into and
- out of backup mode automatically. Backups are always taken of the entire
- database cluster; it is not possible to back up individual databases or
- database objects. For selective backups, another tool such as
+ <application>pg_basebackup</application> can take a full or incremental
+ base backup of the database. When used to take a full backup, it makes an
+ exact copy of the database cluster's files. When used to take an incremental
+ backup, some files that would have been part of a full backup may be
+ replaced with incremental versions of the same files, containing only those
+ blocks that have been modified since the reference backup. An incremental
+ backup cannot be used directly; instead,
+ <xref linkend="app-pgcombinebackup"/> must first
+ be used to combine it with the previous backups upon which it depends.
+ See <xref linkend="backup-incremental-backup" /> for more information
+ about incremental backups, and <xref linkend="backup-pitr-recovery" />
+ for steps to recover from a backup.
+ </para>
+
+ <para>
+ In any mode, <application>pg_basebackup</application> makes sure the server
+ is put into and out of backup mode automatically. Backups are always taken of
+ the entire database cluster; it is not possible to back up individual
+ databases or database objects. For selective backups, another tool such as
<xref linkend="app-pgdump"/> must be used.
</para>
@@ -197,6 +211,19 @@ PostgreSQL documentation
</listitem>
</varlistentry>
+ <varlistentry>
+ <term><option>-i <replaceable class="parameter">old_manifest_file</replaceable></option></term>
+ <term><option>--incremental=<replaceable class="parameter">old_meanifest_file</replaceable></option></term>
+ <listitem>
+ <para>
+ Performs an <link linkend="backup-incremental-backup">incremental
+ backup</link>. The backup manifest for the reference
+ backup must be provided, and will be uploaded to the server, which will
+ respond by sending the requested incremental backup.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry>
<term><option>-R</option></term>
<term><option>--write-recovery-conf</option></term>
diff --git a/doc/src/sgml/ref/pg_combinebackup.sgml b/doc/src/sgml/ref/pg_combinebackup.sgml
new file mode 100644
index 0000000000..6cac73573f
--- /dev/null
+++ b/doc/src/sgml/ref/pg_combinebackup.sgml
@@ -0,0 +1,228 @@
+<!--
+doc/src/sgml/ref/pg_combinebackup.sgml
+PostgreSQL documentation
+-->
+
+<refentry id="app-pgcombinebackup">
+ <indexterm zone="app-pgcombinebackup">
+ <primary>pg_combinebackup</primary>
+ </indexterm>
+
+ <refmeta>
+ <refentrytitle><application>pg_combinebackup</application></refentrytitle>
+ <manvolnum>1</manvolnum>
+ <refmiscinfo>Application</refmiscinfo>
+ </refmeta>
+
+ <refnamediv>
+ <refname>pg_combinebackup</refname>
+ <refpurpose>reconstruct a full backup from an incremental backup and dependent backups</refpurpose>
+ </refnamediv>
+
+ <refsynopsisdiv>
+ <cmdsynopsis>
+ <command>pg_combinebackup</command>
+ <arg rep="repeat"><replaceable>option</replaceable></arg>
+ <arg rep="repeat"><replaceable>backup_directory</replaceable></arg>
+ </cmdsynopsis>
+ </refsynopsisdiv>
+
+ <refsect1>
+ <title>Description</title>
+ <para>
+ <application>pg_combinebackup</application> is used to reconstruct a
+ synthetic full backup from an
+ <link linkend="backup-incremental-backup">incremental backup</link> and the
+ earlier backups upon which it depends.
+ </para>
+
+ <para>
+ Specify all of the required backups on the command line from oldest to newest.
+ That is, the first backup directory should be the path to the full backup, and
+ the last should be the path to the final incremental backup
+ that you wish to restore. The reconstructed backup will be written to the
+ output directory specified by the <option>-o</option> option.
+ </para>
+
+ <para>
+ Although <application>pg_combinebackup</application> will attempt to verify
+ that the backups you specify form a legal backup chain from which a correct
+ full backup can be reconstructed, it is not designed to help you keep track
+ of which backups depend on which other backups. If you remove the one or
+ more of the previous backups upon which your incremental
+ backup relies, you will not be able to restore it.
+ </para>
+
+ <para>
+ Since the output of <application>pg_combinebackup</application> is a
+ synthetic full backup, it can be used as an input to a future invocation of
+ <application>pg_combinebackup</application>. The synthetic full backup would
+ be specified on the command line in lieu of the chain of backups from which
+ it was reconstructed.
+ </para>
+ </refsect1>
+
+ <refsect1>
+ <title>Options</title>
+
+ <para>
+ <variablelist>
+ <varlistentry>
+ <term><option>-d</option></term>
+ <term><option>--debug</option></term>
+ <listitem>
+ <para>
+ Print lots of debug logging output on <filename>stderr</filename>.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><option>-T <replaceable class="parameter">olddir</replaceable>=<replaceable class="parameter">newdir</replaceable></option></term>
+ <term><option>--tablespace-mapping=<replaceable class="parameter">olddir</replaceable>=<replaceable class="parameter">newdir</replaceable></option></term>
+ <listitem>
+ <para>
+ Relocates the tablespace in directory <replaceable>olddir</replaceable>
+ to <replaceable>newdir</replaceable> during the backup.
+ <replaceable>olddir</replaceable> is the absolute path of the tablespace
+ as it exists in the first backup specified on the command line,
+ and <replaceable>newdir</replaceable> is the absolute path to use for the
+ tablespace in the reconstructed backup. If either path needs to contain
+ an equal sign (<literal>=</literal>), precede that with a backslash.
+ This option can be specified multiple times for multiple tablespaces.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><option>-N</option></term>
+ <term><option>--no-sync</option></term>
+ <listitem>
+ <para>
+ By default, <command>pg_combinebackup</command> will wait for all files
+ to be written safely to disk. This option causes
+ <command>pg_combinebackup</command> to return without waiting, which is
+ faster, but means that a subsequent operating system crash can leave
+ the output backup corrupt. Generally, this option is useful for testing
+ but should not be used when creating a production installation.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><option>-o <replaceable class="parameter">outputdir</replaceable></option></term>
+ <term><option>--output=<replaceable class="parameter">outputdir</replaceable></option></term>
+ <listitem>
+ <para>
+ Specifies the output directory to which the synthetic full backup
+ should be written. Currently, this argument is required.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><option>--sync-method</option></term>
+ <listitem>
+ <para>
+ When set to <literal>fsync</literal>, which is the default,
+ <command>pg_combinebackup</command> will recursively open and synchronize
+ all files in the backup directory. When the plain format is used, the
+ search for files will follow symbolic links for the WAL directory and
+ each configured tablespace.
+ </para>
+ <para>
+ On Linux, <literal>syncfs</literal> may be used instead to ask the
+ operating system to synchronize the whole file system that contains the
+ backup directory. When the plain format is used,
+ <command>pg_combinebackup</command> will also synchronize the file systems
+ that contain the WAL files and each tablespace. See
+ <xref linkend="syncfs"/> for more information about using
+ <function>syncfs()</function>.
+ </para>
+ <para>
+ This option has no effect when <option>--no-sync</option> is used.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><option>--manifest-checksums=<replaceable class="parameter">algorithm</replaceable></option></term>
+ <listitem>
+ <para>
+ Like <xref linkend="app-pgbasebackup"/>,
+ <application>pg_combinebackup</application> writes a backup manifest
+ in the output directory. This option specifies the checksum algorithm
+ that should be applied to each file included in the backup manifest.
+ Currently, the available algorithms are <literal>NONE</literal>,
+ <literal>CRC32C</literal>, <literal>SHA224</literal>,
+ <literal>SHA256</literal>, <literal>SHA384</literal>,
+ and <literal>SHA512</literal>. The default is <literal>CRC32C</literal>.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><option>--no-manifest</option></term>
+ <listitem>
+ <para>
+ Disables generation of a backup manifest. If this option is not
+ specified, a backup manifest for the reconstructed backup will be
+ written to the output directory.
+ </para>
+ </listitem>
+ </varlistentry>
+ </variablelist>
+
+ <variablelist>
+ <varlistentry>
+ <term><option>-V</option></term>
+ <term><option>--version</option></term>
+ <listitem>
+ <para>
+ Prints the <application>pg_combinebackup</application> version and exits.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><option>-?</option></term>
+ <term><option>--help</option></term>
+ <listitem>
+ <para>
+ Shows help about <application>pg_combinebackup</application> command line
+ arguments, and exits.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ </variablelist>
+ </para>
+
+ </refsect1>
+
+ <refsect1>
+ <title>Environment</title>
+
+ <para>
+ This utility, like most other <productname>PostgreSQL</productname> utilities,
+ uses the environment variables supported by <application>libpq</application>
+ (see <xref linkend="libpq-envars"/>).
+ </para>
+
+ <para>
+ The environment variable <envar>PG_COLOR</envar> specifies whether to use
+ color in diagnostic messages. Possible values are
+ <literal>always</literal>, <literal>auto</literal> and
+ <literal>never</literal>.
+ </para>
+ </refsect1>
+
+ <refsect1>
+ <title>See Also</title>
+
+ <simplelist type="inline">
+ <member><xref linkend="app-pgbasebackup"/></member>
+ </simplelist>
+ </refsect1>
+
+</refentry>
diff --git a/doc/src/sgml/reference.sgml b/doc/src/sgml/reference.sgml
index e11b4b6130..a07d2b5e01 100644
--- a/doc/src/sgml/reference.sgml
+++ b/doc/src/sgml/reference.sgml
@@ -250,6 +250,7 @@
&pgamcheck;
&pgBasebackup;
&pgbench;
+ &pgCombinebackup;
&pgConfig;
&pgDump;
&pgDumpall;
diff --git a/src/backend/access/transam/xlogbackup.c b/src/backend/access/transam/xlogbackup.c
index 21d68133ae..f51d4282bb 100644
--- a/src/backend/access/transam/xlogbackup.c
+++ b/src/backend/access/transam/xlogbackup.c
@@ -77,6 +77,16 @@ build_backup_content(BackupState *state, bool ishistoryfile)
appendStringInfo(result, "STOP TIMELINE: %u\n", state->stoptli);
}
+ /* either both istartpoint and istarttli should be set, or neither */
+ Assert(XLogRecPtrIsInvalid(state->istartpoint) == (state->istarttli == 0));
+ if (!XLogRecPtrIsInvalid(state->istartpoint))
+ {
+ appendStringInfo(result, "INCREMENTAL FROM LSN: %X/%X\n",
+ LSN_FORMAT_ARGS(state->istartpoint));
+ appendStringInfo(result, "INCREMENTAL FROM TLI: %u\n",
+ state->istarttli);
+ }
+
data = result->data;
pfree(result);
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index c61566666a..7d2501274e 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -1295,6 +1295,12 @@ read_backup_label(XLogRecPtr *checkPointLoc, TimeLineID *backupLabelTLI,
tli_from_file, BACKUP_LABEL_FILE)));
}
+ if (fscanf(lfp, "INCREMENTAL FROM LSN: %X/%X\n", &hi, &lo) > 0)
+ ereport(FATAL,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("this is an incremental backup, not a data directory"),
+ errhint("Use pg_combinebackup to reconstruct a valid data directory.")));
+
if (ferror(lfp) || FreeFile(lfp))
ereport(FATAL,
(errcode_for_file_access(),
diff --git a/src/backend/backup/Makefile b/src/backend/backup/Makefile
index a67b3c58d4..751e6d3d5e 100644
--- a/src/backend/backup/Makefile
+++ b/src/backend/backup/Makefile
@@ -19,6 +19,7 @@ OBJS = \
basebackup.o \
basebackup_copy.o \
basebackup_gzip.o \
+ basebackup_incremental.o \
basebackup_lz4.o \
basebackup_zstd.o \
basebackup_progress.o \
diff --git a/src/backend/backup/basebackup.c b/src/backend/backup/basebackup.c
index 35dd79babc..9ecce5f222 100644
--- a/src/backend/backup/basebackup.c
+++ b/src/backend/backup/basebackup.c
@@ -20,8 +20,10 @@
#include "access/xlogbackup.h"
#include "backup/backup_manifest.h"
#include "backup/basebackup.h"
+#include "backup/basebackup_incremental.h"
#include "backup/basebackup_sink.h"
#include "backup/basebackup_target.h"
+#include "catalog/pg_tablespace_d.h"
#include "commands/defrem.h"
#include "common/compression.h"
#include "common/file_perm.h"
@@ -64,6 +66,7 @@ typedef struct
bool fastcheckpoint;
bool nowait;
bool includewal;
+ bool incremental;
uint32 maxrate;
bool sendtblspcmapfile;
bool send_to_client;
@@ -76,21 +79,28 @@ typedef struct
} basebackup_options;
static int64 sendTablespace(bbsink *sink, char *path, Oid spcoid, bool sizeonly,
- struct backup_manifest_info *manifest);
+ struct backup_manifest_info *manifest,
+ IncrementalBackupInfo *ib);
static int64 sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
List *tablespaces, bool sendtblspclinks,
- backup_manifest_info *manifest, Oid spcoid);
+ backup_manifest_info *manifest, Oid spcoid,
+ IncrementalBackupInfo *ib);
static bool sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
struct stat *statbuf, bool missing_ok,
Oid dboid, Oid spcoid, RelFileNumber relfilenumber,
unsigned segno,
- backup_manifest_info *manifest);
+ backup_manifest_info *manifest,
+ unsigned num_incremental_blocks,
+ BlockNumber *incremental_blocks,
+ unsigned truncation_block_length);
static off_t read_file_data_into_buffer(bbsink *sink,
const char *readfilename, int fd,
off_t offset, size_t length,
BlockNumber blkno,
bool verify_checksum,
int *checksum_failures);
+static void push_to_sink(bbsink *sink, pg_checksum_context *checksum_ctx,
+ size_t *bytes_done, void *data, size_t length);
static bool verify_page_checksum(Page page, XLogRecPtr start_lsn,
BlockNumber blkno,
uint16 *expected_checksum);
@@ -102,7 +112,8 @@ static int64 _tarWriteHeader(bbsink *sink, const char *filename,
bool sizeonly);
static void _tarWritePadding(bbsink *sink, int len);
static void convert_link_to_directory(const char *pathbuf, struct stat *statbuf);
-static void perform_base_backup(basebackup_options *opt, bbsink *sink);
+static void perform_base_backup(basebackup_options *opt, bbsink *sink,
+ IncrementalBackupInfo *ib);
static void parse_basebackup_options(List *options, basebackup_options *opt);
static int compareWalFileNames(const ListCell *a, const ListCell *b);
static int basebackup_read_file(int fd, char *buf, size_t nbytes, off_t offset,
@@ -220,7 +231,8 @@ static const struct exclude_list_item excludeFiles[] =
* clobbered by longjmp" from stupider versions of gcc.
*/
static void
-perform_base_backup(basebackup_options *opt, bbsink *sink)
+perform_base_backup(basebackup_options *opt, bbsink *sink,
+ IncrementalBackupInfo *ib)
{
bbsink_state state;
XLogRecPtr endptr;
@@ -270,6 +282,10 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
ListCell *lc;
tablespaceinfo *newti;
+ /* If this is an incremental backup, execute preparatory steps. */
+ if (ib != NULL)
+ PrepareForIncrementalBackup(ib, backup_state);
+
/* Add a node for the base directory at the end */
newti = palloc0(sizeof(tablespaceinfo));
newti->size = -1;
@@ -289,10 +305,10 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
if (tmp->path == NULL)
tmp->size = sendDir(sink, ".", 1, true, state.tablespaces,
- true, NULL, InvalidOid);
+ true, NULL, InvalidOid, NULL);
else
tmp->size = sendTablespace(sink, tmp->path, tmp->oid, true,
- NULL);
+ NULL, NULL);
state.bytes_total += tmp->size;
}
state.bytes_total_is_valid = true;
@@ -330,7 +346,7 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
/* Then the bulk of the files... */
sendDir(sink, ".", 1, false, state.tablespaces,
- sendtblspclinks, &manifest, InvalidOid);
+ sendtblspclinks, &manifest, InvalidOid, ib);
/* ... and pg_control after everything else. */
if (lstat(XLOG_CONTROL_FILE, &statbuf) != 0)
@@ -340,7 +356,7 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
XLOG_CONTROL_FILE)));
sendFile(sink, XLOG_CONTROL_FILE, XLOG_CONTROL_FILE, &statbuf,
false, InvalidOid, InvalidOid,
- InvalidRelFileNumber, 0, &manifest);
+ InvalidRelFileNumber, 0, &manifest, 0, NULL, 0);
}
else
{
@@ -348,7 +364,7 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
bbsink_begin_archive(sink, archive_name);
- sendTablespace(sink, ti->path, ti->oid, false, &manifest);
+ sendTablespace(sink, ti->path, ti->oid, false, &manifest, ib);
}
/*
@@ -610,7 +626,7 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
sendFile(sink, pathbuf, pathbuf, &statbuf, false,
InvalidOid, InvalidOid, InvalidRelFileNumber, 0,
- &manifest);
+ &manifest, 0, NULL, 0);
/* unconditionally mark file as archived */
StatusFilePath(pathbuf, fname, ".done");
@@ -686,6 +702,7 @@ parse_basebackup_options(List *options, basebackup_options *opt)
bool o_checkpoint = false;
bool o_nowait = false;
bool o_wal = false;
+ bool o_incremental = false;
bool o_maxrate = false;
bool o_tablespace_map = false;
bool o_noverify_checksums = false;
@@ -764,6 +781,15 @@ parse_basebackup_options(List *options, basebackup_options *opt)
opt->includewal = defGetBoolean(defel);
o_wal = true;
}
+ else if (strcmp(defel->defname, "incremental") == 0)
+ {
+ if (o_incremental)
+ ereport(ERROR,
+ (errcode(ERRCODE_SYNTAX_ERROR),
+ errmsg("duplicate option \"%s\"", defel->defname)));
+ opt->incremental = defGetBoolean(defel);
+ o_incremental = true;
+ }
else if (strcmp(defel->defname, "max_rate") == 0)
{
int64 maxrate;
@@ -956,7 +982,7 @@ parse_basebackup_options(List *options, basebackup_options *opt)
* the filesystem, bypassing the buffer cache.
*/
void
-SendBaseBackup(BaseBackupCmd *cmd)
+SendBaseBackup(BaseBackupCmd *cmd, IncrementalBackupInfo *ib)
{
basebackup_options opt;
bbsink *sink;
@@ -980,6 +1006,20 @@ SendBaseBackup(BaseBackupCmd *cmd)
set_ps_display(activitymsg);
}
+ /*
+ * If we're asked to perform an incremental backup and the user has not
+ * supplied a manifest, that's an ERROR.
+ *
+ * If we're asked to perform a full backup and the user did supply a
+ * manifest, just ignore it.
+ */
+ if (!opt.incremental)
+ ib = NULL;
+ else if (ib == NULL)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("must UPLOAD_MANIFEST before performing an incremental BASE_BACKUP")));
+
/*
* If the target is specifically 'client' then set up to stream the backup
* to the client; otherwise, it's being sent someplace else and should not
@@ -1011,7 +1051,7 @@ SendBaseBackup(BaseBackupCmd *cmd)
*/
PG_TRY();
{
- perform_base_backup(&opt, sink);
+ perform_base_backup(&opt, sink, ib);
}
PG_FINALLY();
{
@@ -1089,7 +1129,7 @@ sendFileWithContent(bbsink *sink, const char *filename, const char *content,
*/
static int64
sendTablespace(bbsink *sink, char *path, Oid spcoid, bool sizeonly,
- backup_manifest_info *manifest)
+ backup_manifest_info *manifest, IncrementalBackupInfo *ib)
{
int64 size;
char pathbuf[MAXPGPATH];
@@ -1123,7 +1163,7 @@ sendTablespace(bbsink *sink, char *path, Oid spcoid, bool sizeonly,
/* Send all the files in the tablespace version directory */
size += sendDir(sink, pathbuf, strlen(path), sizeonly, NIL, true, manifest,
- spcoid);
+ spcoid, ib);
return size;
}
@@ -1143,7 +1183,7 @@ sendTablespace(bbsink *sink, char *path, Oid spcoid, bool sizeonly,
static int64
sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
List *tablespaces, bool sendtblspclinks, backup_manifest_info *manifest,
- Oid spcoid)
+ Oid spcoid, IncrementalBackupInfo *ib)
{
DIR *dir;
struct dirent *de;
@@ -1152,7 +1192,16 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
int64 size = 0;
const char *lastDir; /* Split last dir from parent path. */
bool isRelationDir = false; /* Does directory contain relations? */
+ bool isGlobalDir = false;
Oid dboid = InvalidOid;
+ BlockNumber *relative_block_numbers = NULL;
+
+ /*
+ * Since this array is relatively large, avoid putting it on the stack.
+ * But we don't need it at all if this is not an incremental backup.
+ */
+ if (ib != NULL)
+ relative_block_numbers = palloc(sizeof(BlockNumber) * RELSEG_SIZE);
/*
* Determine if the current path is a database directory that can contain
@@ -1185,7 +1234,10 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
}
}
else if (strcmp(path, "./global") == 0)
+ {
isRelationDir = true;
+ isGlobalDir = true;
+ }
dir = AllocateDir(path);
while ((de = ReadDir(dir, path)) != NULL)
@@ -1334,11 +1386,13 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
&statbuf, sizeonly);
/*
- * Also send archive_status directory (by hackishly reusing
- * statbuf from above ...).
+ * Also send archive_status and summaries directories (by
+ * hackishly reusing statbuf from above ...).
*/
size += _tarWriteHeader(sink, "./pg_wal/archive_status", NULL,
&statbuf, sizeonly);
+ size += _tarWriteHeader(sink, "./pg_wal/summaries", NULL,
+ &statbuf, sizeonly);
continue; /* don't recurse into pg_wal */
}
@@ -1407,16 +1461,64 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
if (!skip_this_dir)
size += sendDir(sink, pathbuf, basepathlen, sizeonly, tablespaces,
- sendtblspclinks, manifest, spcoid);
+ sendtblspclinks, manifest, spcoid, ib);
}
else if (S_ISREG(statbuf.st_mode))
{
bool sent = false;
+ unsigned num_blocks_required = 0;
+ unsigned truncation_block_length = 0;
+ char tarfilenamebuf[MAXPGPATH * 2];
+ char *tarfilename = pathbuf + basepathlen + 1;
+ FileBackupMethod method = BACK_UP_FILE_FULLY;
+
+ if (ib != NULL && isRelationFile)
+ {
+ Oid relspcoid;
+ char *lookup_path;
+
+ if (OidIsValid(spcoid))
+ {
+ relspcoid = spcoid;
+ lookup_path = psprintf("pg_tblspc/%u/%s", spcoid,
+ tarfilename);
+ }
+ else
+ {
+ if (isGlobalDir)
+ relspcoid = GLOBALTABLESPACE_OID;
+ else
+ relspcoid = DEFAULTTABLESPACE_OID;
+ lookup_path = pstrdup(tarfilename);
+ }
+
+ method = GetFileBackupMethod(ib, lookup_path, dboid, relspcoid,
+ relfilenumber, relForkNum,
+ segno, statbuf.st_size,
+ &num_blocks_required,
+ relative_block_numbers,
+ &truncation_block_length);
+ if (method == BACK_UP_FILE_INCREMENTALLY)
+ {
+ statbuf.st_size =
+ GetIncrementalFileSize(num_blocks_required);
+ snprintf(tarfilenamebuf, sizeof(tarfilenamebuf),
+ "%s/INCREMENTAL.%s",
+ path + basepathlen + 1,
+ de->d_name);
+ tarfilename = tarfilenamebuf;
+ }
+
+ pfree(lookup_path);
+ }
if (!sizeonly)
- sent = sendFile(sink, pathbuf, pathbuf + basepathlen + 1, &statbuf,
+ sent = sendFile(sink, pathbuf, tarfilename, &statbuf,
true, dboid, spcoid,
- relfilenumber, segno, manifest);
+ relfilenumber, segno, manifest,
+ num_blocks_required,
+ method == BACK_UP_FILE_INCREMENTALLY ? relative_block_numbers : NULL,
+ truncation_block_length);
if (sent || sizeonly)
{
@@ -1434,6 +1536,10 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
ereport(WARNING,
(errmsg("skipping special file \"%s\"", pathbuf)));
}
+
+ if (relative_block_numbers != NULL)
+ pfree(relative_block_numbers);
+
FreeDir(dir);
return size;
}
@@ -1446,6 +1552,12 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
* If dboid is anything other than InvalidOid then any checksum failures
* detected will get reported to the cumulative stats system.
*
+ * If the file is to be sent incrementally, then num_incremental_blocks
+ * should be the number of blocks to be sent, and incremental_blocks
+ * an array of block numbers relative to the start of the current segment.
+ * If the whole file is to be sent, then incremental_blocks should be NULL,
+ * and num_incremental_blocks can have any value, as it will be ignored.
+ *
* Returns true if the file was successfully sent, false if 'missing_ok',
* and the file did not exist.
*/
@@ -1453,7 +1565,8 @@ static bool
sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
struct stat *statbuf, bool missing_ok, Oid dboid, Oid spcoid,
RelFileNumber relfilenumber, unsigned segno,
- backup_manifest_info *manifest)
+ backup_manifest_info *manifest, unsigned num_incremental_blocks,
+ BlockNumber *incremental_blocks, unsigned truncation_block_length)
{
int fd;
BlockNumber blkno = 0;
@@ -1462,6 +1575,7 @@ sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
pgoff_t bytes_done = 0;
bool verify_checksum = false;
pg_checksum_context checksum_ctx;
+ int ibindex = 0;
if (pg_checksum_init(&checksum_ctx, manifest->checksum_type) < 0)
elog(ERROR, "could not initialize checksum of file \"%s\"",
@@ -1494,22 +1608,111 @@ sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
RelFileNumberIsValid(relfilenumber))
verify_checksum = true;
+ /*
+ * If we're sending an incremental file, write the file header.
+ */
+ if (incremental_blocks != NULL)
+ {
+ unsigned magic = INCREMENTAL_MAGIC;
+ size_t header_bytes_done = 0;
+
+ /* Emit header data. */
+ push_to_sink(sink, &checksum_ctx, &header_bytes_done,
+ &magic, sizeof(magic));
+ push_to_sink(sink, &checksum_ctx, &header_bytes_done,
+ &num_incremental_blocks, sizeof(num_incremental_blocks));
+ push_to_sink(sink, &checksum_ctx, &header_bytes_done,
+ &truncation_block_length, sizeof(truncation_block_length));
+ push_to_sink(sink, &checksum_ctx, &header_bytes_done,
+ incremental_blocks,
+ sizeof(BlockNumber) * num_incremental_blocks);
+
+ /* Flush out any data still in the buffer so it's again empty. */
+ if (header_bytes_done > 0)
+ {
+ bbsink_archive_contents(sink, header_bytes_done);
+ if (pg_checksum_update(&checksum_ctx,
+ (uint8 *) sink->bbs_buffer,
+ header_bytes_done) < 0)
+ elog(ERROR, "could not update checksum of base backup");
+ }
+
+ /* Update our notion of file position. */
+ bytes_done += sizeof(magic);
+ bytes_done += sizeof(num_incremental_blocks);
+ bytes_done += sizeof(truncation_block_length);
+ bytes_done += sizeof(BlockNumber) * num_incremental_blocks;
+ }
+
/*
* Loop until we read the amount of data the caller told us to expect. The
* file could be longer, if it was extended while we were sending it, but
* for a base backup we can ignore such extended data. It will be restored
* from WAL.
*/
- while (bytes_done < statbuf->st_size)
+ while (1)
{
- size_t remaining = statbuf->st_size - bytes_done;
+ /*
+ * Determine whether we've read all the data that we need, and if not,
+ * read some more.
+ */
+ if (incremental_blocks == NULL)
+ {
+ size_t remaining = statbuf->st_size - bytes_done;
+
+ /*
+ * If we've read the required number of bytes, then it's time to
+ * stop.
+ */
+ if (bytes_done >= statbuf->st_size)
+ break;
+
+ /*
+ * Read as many bytes as will fit in the buffer, or however many
+ * are left to read, whichever is less.
+ */
+ cnt = read_file_data_into_buffer(sink, readfilename, fd,
+ bytes_done, remaining,
+ blkno + segno * RELSEG_SIZE,
+ verify_checksum,
+ &checksum_failures);
+ }
+ else
+ {
+ BlockNumber relative_blkno;
+
+ /*
+ * If we've read all the blocks, then it's time to stop.
+ */
+ if (ibindex >= num_incremental_blocks)
+ break;
+
+ /*
+ * Read just one block, whichever one is the next that we're
+ * supposed to include.
+ */
+ relative_blkno = incremental_blocks[ibindex++];
+ cnt = read_file_data_into_buffer(sink, readfilename, fd,
+ relative_blkno * BLCKSZ,
+ BLCKSZ,
+ relative_blkno + segno * RELSEG_SIZE,
+ verify_checksum,
+ &checksum_failures);
- /* Try to read some more data. */
- cnt = read_file_data_into_buffer(sink, readfilename, fd, bytes_done,
- remaining,
- blkno + segno * RELSEG_SIZE,
- verify_checksum,
- &checksum_failures);
+ /*
+ * If we get a partial read, that must mean that the relation is
+ * being truncated. Ultimately, it should be truncated to a
+ * multiple of BLCKSZ, since this path should only be reached for
+ * relation files, but we might transiently observe an
+ * intermediate value.
+ *
+ * It should be fine to treat this just as if the entire block had
+ * been truncated away - i.e. fill this and all later blocks with
+ * zeroes. WAL replay will fix things up.
+ */
+ if (cnt < BLCKSZ)
+ break;
+ }
/*
* If the amount of data we were able to read was not a multiple of
@@ -1692,6 +1895,56 @@ read_file_data_into_buffer(bbsink *sink, const char *readfilename, int fd,
return cnt;
}
+/*
+ * Push data into a bbsink.
+ *
+ * It's better, when possible, to read data directly into the bbsink's buffer,
+ * rather than using this function to copy it into the buffer; this function is
+ * for cases where that approach is not practical.
+ *
+ * bytes_done should point to a count of the number of bytes that are
+ * currently used in the bbsink's buffer. Upon return, the bytes identified by
+ * data and length will have been copied into the bbsink's buffer, flushing
+ * as required, and *bytes_done will have been updated accordingly. If the
+ * buffer was flushed, the previous contents will also have been fed to
+ * checksum_ctx.
+ *
+ * Note that after one or more calls to this function it is the caller's
+ * responsibility to perform any required final flush.
+ */
+static void
+push_to_sink(bbsink *sink, pg_checksum_context *checksum_ctx,
+ size_t *bytes_done, void *data, size_t length)
+{
+ while (length > 0)
+ {
+ size_t bytes_to_copy;
+
+ /*
+ * We use < here rather than <= so that if the data exactly fills the
+ * remaining buffer space, we trigger a flush now.
+ */
+ if (length < sink->bbs_buffer_length - *bytes_done)
+ {
+ /* Append remaining data to buffer. */
+ memcpy(sink->bbs_buffer + *bytes_done, data, length);
+ *bytes_done += length;
+ return;
+ }
+
+ /* Copy until buffer is full and flush it. */
+ bytes_to_copy = sink->bbs_buffer_length - *bytes_done;
+ memcpy(sink->bbs_buffer + *bytes_done, data, bytes_to_copy);
+ data = ((char *) data) + bytes_to_copy;
+ length -= bytes_to_copy;
+ bbsink_archive_contents(sink, sink->bbs_buffer_length);
+ if (pg_checksum_update(checksum_ctx, (uint8 *) sink->bbs_buffer,
+ sink->bbs_buffer_length) < 0)
+ elog(ERROR, "could not update checksum");
+ *bytes_done = 0;
+ }
+}
+
/*
* Try to verify the checksum for the provided page, if it seems appropriate
* to do so.
diff --git a/src/backend/backup/basebackup_incremental.c b/src/backend/backup/basebackup_incremental.c
new file mode 100644
index 0000000000..2e051f297c
--- /dev/null
+++ b/src/backend/backup/basebackup_incremental.c
@@ -0,0 +1,913 @@
+/*-------------------------------------------------------------------------
+ *
+ * basebackup_incremental.c
+ * code for incremental backup support
+ *
+ * This code isn't actually in charge of taking an incremental backup;
+ * the actual construction of the incremental backup happens in
+ * basebackup.c. Here, we're concerned with providing the necessary
+ * supports for that operation. In particular, we need to parse the
+ * backup manifest supplied by the user taking the incremental backup
+ * and extract the required information from it.
+ *
+ * Portions Copyright (c) 2010-2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/backend/backup/basebackup_incremental.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/timeline.h"
+#include "access/xlog.h"
+#include "access/xlogrecovery.h"
+#include "backup/basebackup_incremental.h"
+#include "backup/walsummary.h"
+#include "common/blkreftable.h"
+#include "common/parse_manifest.h"
+#include "common/hashfn.h"
+#include "postmaster/walsummarizer.h"
+
+#define BLOCKS_PER_READ 512
+
+/*
+ * Details extracted from the WAL ranges present in the supplied backup manifest.
+ */
+typedef struct
+{
+ TimeLineID tli;
+ XLogRecPtr start_lsn;
+ XLogRecPtr end_lsn;
+} backup_wal_range;
+
+/*
+ * Details extracted from the file list present in the supplied backup manifest.
+ */
+typedef struct
+{
+ uint32 status;
+ const char *path;
+ size_t size;
+} backup_file_entry;
+
+static uint32 hash_string_pointer(const char *s);
+#define SH_PREFIX backup_file
+#define SH_ELEMENT_TYPE backup_file_entry
+#define SH_KEY_TYPE const char *
+#define SH_KEY path
+#define SH_HASH_KEY(tb, key) hash_string_pointer(key)
+#define SH_EQUAL(tb, a, b) (strcmp(a, b) == 0)
+#define SH_SCOPE static inline
+#define SH_DECLARE
+#define SH_DEFINE
+#include "lib/simplehash.h"
+
+struct IncrementalBackupInfo
+{
+ /* Memory context for this object and its subsidiary objects. */
+ MemoryContext mcxt;
+
+ /* Temporary buffer for storing the manifest while parsing it. */
+ StringInfoData buf;
+
+ /* WAL ranges extracted from the backup manifest. */
+ List *manifest_wal_ranges;
+
+ /*
+ * Files extracted from the backup manifest.
+ *
+ * We don't really need this information, because we use WAL summaries to
+ * figure what's changed. It would be unsafe to just rely on the list of
+ * files that existed before, because it's possible for a file to be
+ * removed and a new one created with the same name and different
+ * contents. In such cases, the whole file must still be sent. We can tell
+ * from the WAL summaries whether that happened, but not from the file
+ * list.
+ *
+ * Nonetheless, this data is useful for sanity checking. If a file that we
+ * think we shouldn't need to send is not present in the manifest for the
+ * prior backup, something has gone terribly wrong. We retain the file
+ * names and sizes, but not the checksums or last modified times, for
+ * which we have no use.
+ *
+ * One significant downside of storing this data is that it consumes
+ * memory. If that turns out to be a problem, we might have to decide not
+ * to retain this information, or to make it optional.
+ */
+ backup_file_hash *manifest_files;
+
+ /*
+ * Block-reference table for the incremental backup.
+ *
+ * It's possible that storing the entire block-reference table in memory
+ * will be a problem for some users. The in-memory format that we're using
+ * here is pretty efficient, converging to little more than 1 bit per
+ * block for relation forks with large numbers of modified blocks. It's
+ * possible, however, that if you try to perform an incremental backup of
+ * a database with a sufficiently large number of relations on a
+ * sufficiently small machine, you could run out of memory here. If that
+ * turns out to be a problem in practice, we'll need to be more clever.
+ */
+ BlockRefTable *brtab;
+};
+
+static void manifest_process_file(JsonManifestParseContext *context,
+ char *pathname,
+ size_t size,
+ pg_checksum_type checksum_type,
+ int checksum_length,
+ uint8 *checksum_payload);
+static void manifest_process_wal_range(JsonManifestParseContext *context,
+ TimeLineID tli,
+ XLogRecPtr start_lsn,
+ XLogRecPtr end_lsn);
+static void manifest_report_error(JsonManifestParseContext *ib,
+ const char *fmt,...)
+ pg_attribute_printf(2, 3) pg_attribute_noreturn();
+static int compare_block_numbers(const void *a, const void *b);
+
+/*
+ * Create a new object for storing information extracted from the manifest
+ * supplied when creating an incremental backup.
+ */
+IncrementalBackupInfo *
+CreateIncrementalBackupInfo(MemoryContext mcxt)
+{
+ IncrementalBackupInfo *ib;
+ MemoryContext oldcontext;
+
+ oldcontext = MemoryContextSwitchTo(mcxt);
+
+ ib = palloc0(sizeof(IncrementalBackupInfo));
+ ib->mcxt = mcxt;
+ initStringInfo(&ib->buf);
+
+ /*
+ * It's hard to guess how many files a "typical" installation will have in
+ * the data directory, but a fresh initdb creates almost 1000 files as of
+ * this writing, so it seems to make sense for our estimate to
+ * substantially higher.
+ */
+ ib->manifest_files = backup_file_create(mcxt, 10000, NULL);
+
+ MemoryContextSwitchTo(oldcontext);
+
+ return ib;
+}
+
+/*
+ * Before taking an incremental backup, the caller must supply the backup
+ * manifest from a prior backup. Each chunk of manifest data recieved
+ * from the client should be passed to this function.
+ */
+void
+AppendIncrementalManifestData(IncrementalBackupInfo *ib, const char *data,
+ int len)
+{
+ MemoryContext oldcontext;
+
+ /* Switch to our memory context. */
+ oldcontext = MemoryContextSwitchTo(ib->mcxt);
+
+ /*
+ * XXX. Our json parser is at present incapable of parsing json blobs
+ * incrementally, so we have to accumulate the entire backup manifest
+ * before we can do anything with it. This should really be fixed, since
+ * some users might have very large numbers of files in the data
+ * directory.
+ */
+ appendBinaryStringInfo(&ib->buf, data, len);
+
+ /* Switch back to previous memory context. */
+ MemoryContextSwitchTo(oldcontext);
+}
+
+/*
+ * Finalize an IncrementalBackupInfo object after all manifest data has
+ * been supplied via calls to AppendIncrementalManifestData.
+ */
+void
+FinalizeIncrementalManifest(IncrementalBackupInfo *ib)
+{
+ JsonManifestParseContext context;
+ MemoryContext oldcontext;
+
+ /* Switch to our memory context. */
+ oldcontext = MemoryContextSwitchTo(ib->mcxt);
+
+ /* Parse the manifest. */
+ context.private_data = ib;
+ context.per_file_cb = manifest_process_file;
+ context.per_wal_range_cb = manifest_process_wal_range;
+ context.error_cb = manifest_report_error;
+ json_parse_manifest(&context, ib->buf.data, ib->buf.len);
+
+ /* Done with the buffer, so release memory. */
+ pfree(ib->buf.data);
+ ib->buf.data = NULL;
+
+ /* Switch back to previous memory context. */
+ MemoryContextSwitchTo(oldcontext);
+}
+
+/*
+ * Prepare to take an incremental backup.
+ *
+ * Before this function is called, AppendIncrementalManifestData and
+ * FinalizeIncrementalManifest should have already been called to pass all
+ * the manifest data to this object.
+ *
+ * This function performs sanity checks on the data extracted from the
+ * manifest and figures out for which WAL ranges we need summaries, and
+ * whether those summaries are available. Then, it reads and combines the
+ * data from those summary files. It also updates the backup_state with the
+ * reference TLI and LSN for the prior backup.
+ */
+void
+PrepareForIncrementalBackup(IncrementalBackupInfo *ib,
+ BackupState *backup_state)
+{
+ MemoryContext oldcontext;
+ List *expectedTLEs;
+ List *all_wslist,
+ *required_wslist = NIL;
+ ListCell *lc;
+ TimeLineHistoryEntry **tlep;
+ int num_wal_ranges;
+ int i;
+ bool found_backup_start_tli = false;
+ TimeLineID earliest_wal_range_tli = 0;
+ XLogRecPtr earliest_wal_range_start_lsn = InvalidXLogRecPtr;
+ TimeLineID latest_wal_range_tli = 0;
+ XLogRecPtr summarized_lsn;
+
+ Assert(ib->buf.data == NULL);
+
+ /* Switch to our memory context. */
+ oldcontext = MemoryContextSwitchTo(ib->mcxt);
+
+ /*
+ * A valid backup manifest must always contain at least one WAL range
+ * (usually exactly one, unless the backup spanned a timeline switch).
+ */
+ num_wal_ranges = list_length(ib->manifest_wal_ranges);
+ if (num_wal_ranges == 0)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("manifest contains no required WAL ranges")));
+
+ /*
+ * Match up the TLIs that appear in the WAL ranges of the backup manifest
+ * with those that appear in this server's timeline history. We expect
+ * every backup_wal_range to match to a TimeLineHistoryEntry; if it does
+ * not, that's an error.
+ *
+ * This loop also decides which of the WAL ranges is the manifest is most
+ * ancient and which one is the newest, according to the timeline history
+ * of this server, and stores TLIs of those WAL ranges into
+ * earliest_wal_range_tli and latest_wal_range_tli. It also updates
+ * earliest_wal_range_start_lsn to the start LSN of the WAL range for
+ * earliest_wal_range_tli.
+ *
+ * Note that the return value of readTimeLineHistory puts the latest
+ * timeline at the beginning of the list, not the end. Hence, the earliest
+ * TLI is the one that occurs nearest the end of the list returned by
+ * readTimeLineHistory, and the latest TLI is the one that occurs closest
+ * to the beginning.
+ */
+ expectedTLEs = readTimeLineHistory(backup_state->starttli);
+ tlep = palloc0(num_wal_ranges * sizeof(TimeLineHistoryEntry *));
+ for (i = 0; i < num_wal_ranges; ++i)
+ {
+ backup_wal_range *range = list_nth(ib->manifest_wal_ranges, i);
+ bool saw_earliest_wal_range_tli = false;
+ bool saw_latest_wal_range_tli = false;
+
+ /* Search this server's history for this WAL range's TLI. */
+ foreach(lc, expectedTLEs)
+ {
+ TimeLineHistoryEntry *tle = lfirst(lc);
+
+ if (tle->tli == range->tli)
+ {
+ tlep[i] = tle;
+ break;
+ }
+
+ if (tle->tli == earliest_wal_range_tli)
+ saw_earliest_wal_range_tli = true;
+ if (tle->tli == latest_wal_range_tli)
+ saw_latest_wal_range_tli = true;
+ }
+
+ /*
+ * An incremental backup can only be taken relative to a backup that
+ * represents a previous state of this server. If the backup requires
+ * WAL from a timeline that's not in our history, that definitely
+ * isn't the case.
+ */
+ if (tlep[i] == NULL)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("timeline %u found in manifest, but not in this server's history",
+ range->tli)));
+
+ /*
+ * If we found this TLI in the server's history before encountering
+ * the latest TLI seen so far in the server's history, then this TLI
+ * is the latest one seen so far.
+ *
+ * If on the other hand we saw the earliest TLI seen so far before
+ * finding this TLI, this TLI is earlier than the earliest one seen so
+ * far. And if this is the first TLI for which we've searched, it's
+ * also the earliest one seen so far.
+ *
+ * On the first loop iteration, both things should necessarily be
+ * true.
+ */
+ if (!saw_latest_wal_range_tli)
+ latest_wal_range_tli = range->tli;
+ if (earliest_wal_range_tli == 0 || saw_earliest_wal_range_tli)
+ {
+ earliest_wal_range_tli = range->tli;
+ earliest_wal_range_start_lsn = range->start_lsn;
+ }
+ }
+
+ /*
+ * Propagate information about the prior backup into the backup_label that
+ * will be generated for this backup.
+ */
+ backup_state->istartpoint = earliest_wal_range_start_lsn;
+ backup_state->istarttli = earliest_wal_range_tli;
+
+ /*
+ * Sanity check start and end LSNs for the WAL ranges in the manifest.
+ *
+ * Commonly, there won't be any timeline switches during the prior backup
+ * at all, but if there are, they should happen at the same LSNs that this
+ * server switched timelines.
+ *
+ * Whether there are any timeline switches during the prior backup or not,
+ * the prior backup shouldn't require any WAL from a timeline prior to the
+ * start of that timeline. It also shouldn't require any WAL from later
+ * than the start of this backup.
+ *
+ * If any of these sanity checks fail, one possible explanation is that
+ * the user has generated WAL on the same timeline with the same LSNs more
+ * than once. For instance, if two standbys running on timeline 1 were
+ * both promoted and (due to a broken archiving setup) both selected new
+ * timeline ID 2, then it's possible that one of these checks might trip.
+ *
+ * Note that there are lots of ways for the user to do something very bad
+ * without tripping any of these checks, and they are not intended to be
+ * comprehensive. It's pretty hard to see how we could be certain of
+ * anything here. However, if there's a problem staring us right in the
+ * face, it's best to report it, so we do.
+ */
+ for (i = 0; i < num_wal_ranges; ++i)
+ {
+ backup_wal_range *range = list_nth(ib->manifest_wal_ranges, i);
+
+ if (range->tli == earliest_wal_range_tli)
+ {
+ if (range->start_lsn < tlep[i]->begin)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("manifest requires WAL from initial timeline %u starting at %X/%X, but that timeline begins at %X/%X",
+ range->tli,
+ LSN_FORMAT_ARGS(range->start_lsn),
+ LSN_FORMAT_ARGS(tlep[i]->begin))));
+ }
+ else
+ {
+ if (range->start_lsn != tlep[i]->begin)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("manifest requires WAL from continuation timeline %u starting at %X/%X, but that timeline begins at %X/%X",
+ range->tli,
+ LSN_FORMAT_ARGS(range->start_lsn),
+ LSN_FORMAT_ARGS(tlep[i]->begin))));
+ }
+
+ if (range->tli == latest_wal_range_tli)
+ {
+ if (range->end_lsn > backup_state->startpoint)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("manifest requires WAL from final timeline %u ending at %X/%X, but this backup starts at %X/%X",
+ range->tli,
+ LSN_FORMAT_ARGS(range->end_lsn),
+ LSN_FORMAT_ARGS(backup_state->startpoint))));
+ }
+ else
+ {
+ if (range->end_lsn != tlep[i]->end)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("manifest requires WAL from non-final timeline %u ending at %X/%X, but this server switched timelines at %X/%X",
+ range->tli,
+ LSN_FORMAT_ARGS(range->end_lsn),
+ LSN_FORMAT_ARGS(tlep[i]->end))));
+ }
+
+ }
+
+ /*
+ * Wait for WAL summarization to catch up to the backup start LSN (but
+ * time out if it doesn't do so quickly enough).
+ */
+ /* XXX make timeout configurable */
+ summarized_lsn = WaitForWalSummarization(backup_state->startpoint, 60000);
+ if (summarized_lsn < backup_state->startpoint)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("timeout waiting for WAL summarization"),
+ errdetail("This backup requires WAL to be summarized up to %X/%X, but summarizer has only reached %X/%X.",
+ LSN_FORMAT_ARGS(backup_state->startpoint),
+ LSN_FORMAT_ARGS(summarized_lsn))));
+
+ /*
+ * Retrieve a list of all WAL summaries on any timeline that overlap with
+ * the LSN range of interest. We could instead call GetWalSummaries() once
+ * per timeline in the loop that follows, but that would involve reading
+ * the directory multiple times. It should be mildly faster - and perhaps
+ * a bit safer - to do it just once.
+ */
+ all_wslist = GetWalSummaries(0, earliest_wal_range_start_lsn,
+ backup_state->startpoint);
+
+ /*
+ * We need WAL summaries for everything that happened during the prior
+ * backup and everything that happened afterward up until the point where
+ * the current backup started.
+ */
+ foreach(lc, expectedTLEs)
+ {
+ TimeLineHistoryEntry *tle = lfirst(lc);
+ XLogRecPtr tli_start_lsn = tle->begin;
+ XLogRecPtr tli_end_lsn = tle->end;
+ XLogRecPtr tli_missing_lsn = InvalidXLogRecPtr;
+ List *tli_wslist;
+
+ /*
+ * Working through the history of this server from the current
+ * timeline backwards, we skip everything until we find the timeline
+ * where this backup started. Most of the time, this means we won't
+ * skip anything at all, as it's unlikely that the timeline has
+ * changed since the beginning of the backup moments ago.
+ */
+ if (tle->tli == backup_state->starttli)
+ {
+ found_backup_start_tli = true;
+ tli_end_lsn = backup_state->startpoint;
+ }
+ else if (!found_backup_start_tli)
+ continue;
+
+ /*
+ * Find the summaries that overlap the LSN range of interest for this
+ * timeline. If this is the earliest timeline involved, the range of
+ * interest begins with the start LSN of the prior backup; otherwise,
+ * it begins at the LSN at which this timeline came into existence. If
+ * this is the latest TLI involved, the range of interest ends at the
+ * start LSN of the current backup; otherwise, it ends at the point
+ * where we switched from this timeline to the next one.
+ */
+ if (tle->tli == earliest_wal_range_tli)
+ tli_start_lsn = earliest_wal_range_start_lsn;
+ tli_wslist = FilterWalSummaries(all_wslist, tle->tli,
+ tli_start_lsn, tli_end_lsn);
+
+ /*
+ * There is no guarantee that the WAL summaries we found cover the
+ * entire range of LSNs for which summaries are required, or indeed
+ * that we found any WAL summaries at all. Check whether we have a
+ * problem of that sort.
+ */
+ if (!WalSummariesAreComplete(tli_wslist, tli_start_lsn, tli_end_lsn,
+ &tli_missing_lsn))
+ {
+ if (XLogRecPtrIsInvalid(tli_missing_lsn))
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("WAL summaries are required on timeline %u from %X/%X to %X/%X, but no summaries for that timeline and LSN range exist",
+ tle->tli,
+ LSN_FORMAT_ARGS(tli_start_lsn),
+ LSN_FORMAT_ARGS(tli_end_lsn))));
+ else
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("WAL summaries are required on timeline %u from %X/%X to %X/%X, but the summaries for that timeline and LSN range are incomplete",
+ tle->tli,
+ LSN_FORMAT_ARGS(tli_start_lsn),
+ LSN_FORMAT_ARGS(tli_end_lsn)),
+ errdetail("The first unsummarized LSN is this range is %X/%X.",
+ LSN_FORMAT_ARGS(tli_missing_lsn))));
+ }
+
+ /*
+ * Remember that we need to read these summaries.
+ *
+ * Technically, it's possible that this could read more files than
+ * required, since tli_wslist in theory could contain redundant
+ * summaries. For instance, if we have a summary from 0/10000000 to
+ * 0/20000000 and also one from 0/00000000 to 0/30000000, then the
+ * latter subsumes the former and the former could be ignored.
+ *
+ * We ignore this possibility because the WAL summarizer only tries to
+ * generate summaries that do not overlap. If somehow they exist,
+ * we'll do a bit of extra work but the results should still be
+ * correct.
+ */
+ required_wslist = list_concat(required_wslist, tli_wslist);
+
+ /*
+ * Timelines earlier than the one in which the prior backup began are
+ * not relevant.
+ */
+ if (tle->tli == earliest_wal_range_tli)
+ break;
+ }
+
+ /*
+ * Read all of the required block reference table files and merge all of
+ * the data into a single in-memory block reference table.
+ *
+ * See the comments for struct IncrementalBackupInfo for some thoughts on
+ * memory usage.
+ */
+ ib->brtab = CreateEmptyBlockRefTable();
+ foreach(lc, required_wslist)
+ {
+ WalSummaryFile *ws = lfirst(lc);
+ WalSummaryIO wsio;
+ BlockRefTableReader *reader;
+ RelFileLocator rlocator;
+ ForkNumber forknum;
+ BlockNumber limit_block;
+ BlockNumber blocks[BLOCKS_PER_READ];
+
+ wsio.file = OpenWalSummaryFile(ws, false);
+ wsio.filepos = 0;
+ ereport(DEBUG1,
+ (errmsg_internal("reading WAL summary file \"%s\"",
+ FilePathName(wsio.file))));
+ reader = CreateBlockRefTableReader(ReadWalSummary, &wsio,
+ FilePathName(wsio.file),
+ ReportWalSummaryError, NULL);
+ while (BlockRefTableReaderNextRelation(reader, &rlocator, &forknum,
+ &limit_block))
+ {
+ BlockRefTableSetLimitBlock(ib->brtab, &rlocator,
+ forknum, limit_block);
+
+ while (1)
+ {
+ unsigned nblocks;
+ unsigned i;
+
+ nblocks = BlockRefTableReaderGetBlocks(reader, blocks,
+ BLOCKS_PER_READ);
+ if (nblocks == 0)
+ break;
+
+ for (i = 0; i < nblocks; ++i)
+ BlockRefTableMarkBlockModified(ib->brtab, &rlocator,
+ forknum, blocks[i]);
+ }
+ }
+ DestroyBlockRefTableReader(reader);
+ FileClose(wsio.file);
+ }
+
+ /* Switch back to previous memory context. */
+ MemoryContextSwitchTo(oldcontext);
+}
+
+/*
+ * Get the pathname that should be used when a file is sent incrementally.
+ *
+ * The result is a palloc'd string.
+ */
+char *
+GetIncrementalFilePath(Oid dboid, Oid spcoid, RelFileNumber relfilenumber,
+ ForkNumber forknum, unsigned segno)
+{
+ char *path;
+ char *lastslash;
+ char *ipath;
+
+ path = GetRelationPath(dboid, spcoid, relfilenumber, InvalidBackendId,
+ forknum);
+
+ lastslash = strrchr(path, '/');
+ Assert(lastslash != NULL);
+ *lastslash = '\0';
+
+ if (segno > 0)
+ ipath = psprintf("%s/INCREMENTAL.%s.%u", path, lastslash + 1, segno);
+ else
+ ipath = psprintf("%s/INCREMENTAL.%s", path, lastslash + 1);
+
+ pfree(path);
+
+ return ipath;
+}
+
+/*
+ * How should we back up a particular file as part of an incremental backup?
+ *
+ * If the return value is BACK_UP_FILE_FULLY, caller should back up the whole
+ * file just as if this were not an incremental backup.
+ *
+ * If the return value is BACK_UP_FILE_INCREMENTALLY, caller should include
+ * an incremental file in the backup instead of the entire file. On return,
+ * *num_blocks_required will be set to the number of blocks that need to be
+ * sent, and the actual block numbers will have been stored in
+ * relative_block_numbers, which should be an array of at least RELSEG_SIZE.
+ * In addition, *truncation_block_length will be set to the value that should
+ * be included in the incremental file.
+ */
+FileBackupMethod
+GetFileBackupMethod(IncrementalBackupInfo *ib, const char *path,
+ Oid dboid, Oid spcoid,
+ RelFileNumber relfilenumber, ForkNumber forknum,
+ unsigned segno, size_t size,
+ unsigned *num_blocks_required,
+ BlockNumber *relative_block_numbers,
+ unsigned *truncation_block_length)
+{
+ BlockNumber absolute_block_numbers[RELSEG_SIZE];
+ BlockNumber limit_block;
+ BlockNumber start_blkno;
+ BlockNumber stop_blkno;
+ RelFileLocator rlocator;
+ BlockRefTableEntry *brtentry;
+ unsigned i;
+ unsigned nblocks;
+
+ /* Should only be called after PrepareForIncrementalBackup. */
+ Assert(ib->buf.data == NULL);
+
+ /*
+ * dboid could be InvalidOid if shared rel, but spcoid and relfilenumber
+ * should have legal values.
+ */
+ Assert(OidIsValid(spcoid));
+ Assert(RelFileNumberIsValid(relfilenumber));
+
+ /*
+ * If the file size is too large or not a multiple of BLCKSZ, then
+ * something weird is happening, so give up and send the whole file.
+ */
+ if ((size % BLCKSZ) != 0 || size / BLCKSZ > RELSEG_SIZE)
+ return BACK_UP_FILE_FULLY;
+
+ /*
+ * The free-space map fork is not properly WAL-logged, so we need to
+ * backup the entire file every time.
+ */
+ if (forknum == FSM_FORKNUM)
+ return BACK_UP_FILE_FULLY;
+
+ /*
+ * Check whether this file is part of the prior backup. If it isn't, back
+ * up the whole file.
+ */
+ if (backup_file_lookup(ib->manifest_files, path) == NULL)
+ {
+ char *ipath;
+
+ ipath = GetIncrementalFilePath(dboid, spcoid, relfilenumber,
+ forknum, segno);
+ if (backup_file_lookup(ib->manifest_files, ipath) == NULL)
+ return BACK_UP_FILE_FULLY;
+ }
+
+ /* Look up the block reference table entry. */
+ rlocator.spcOid = spcoid;
+ rlocator.dbOid = dboid;
+ rlocator.relNumber = relfilenumber;
+ brtentry = BlockRefTableGetEntry(ib->brtab, &rlocator, forknum,
+ &limit_block);
+
+ /*
+ * If there is no entry, then there have been no WAL-logged changes to the
+ * relation since the predecessor backup was taken, so we can back it up
+ * incrementally and need not include any modified blocks.
+ *
+ * However, if the file is zero-length, we should do a full backup,
+ * because an incremental file is always more than zero length, and it's
+ * silly to take an incremental backup when a full backup would be
+ * smaller.
+ */
+ if (brtentry == NULL)
+ {
+ if (size == 0)
+ return BACK_UP_FILE_FULLY;
+ *num_blocks_required = 0;
+ *truncation_block_length = size / BLCKSZ;
+ return BACK_UP_FILE_INCREMENTALLY;
+ }
+
+ /*
+ * If the limit_block is less than or equal to the point where this
+ * segment starts, send the whole file.
+ */
+ if (limit_block <= segno * RELSEG_SIZE)
+ return BACK_UP_FILE_FULLY;
+
+ /*
+ * Get relevant entries from the block reference table entry.
+ *
+ * We shouldn't overflow computing the start or stop block numbers, but if
+ * it manages to happen somehow, detect it and throw an error.
+ */
+ start_blkno = segno * RELSEG_SIZE;
+ stop_blkno = start_blkno + (size / BLCKSZ);
+ if (start_blkno / RELSEG_SIZE != segno || stop_blkno < start_blkno)
+ ereport(ERROR,
+ errcode(ERRCODE_INTERNAL_ERROR),
+ errmsg_internal("overflow computing block number bounds for segment %u with size %lu",
+ segno, size));
+ nblocks = BlockRefTableEntryGetBlocks(brtentry, start_blkno, stop_blkno,
+ absolute_block_numbers, RELSEG_SIZE);
+ Assert(nblocks <= RELSEG_SIZE);
+
+ /*
+ * If we're going to have to send nearly all of the blocks, then just send
+ * the whole file, because that won't require much extra storage or
+ * transfer and will speed up and simplify backup restoration. It's not
+ * clear what threshold is most appropriate here and perhaps it ought to
+ * be configurable, but for now we're just going to say that if we'd need
+ * to send 90% of the blocks anyway, give up and send the whole file.
+ *
+ * NB: If you change the threshold here, at least make sure to back up the
+ * file fully when every single block must be sent, because there's
+ * nothing good about sending an incremental file in that case.
+ */
+ if (nblocks * BLCKSZ > size * 0.9)
+ return BACK_UP_FILE_FULLY;
+
+ /*
+ * Looks like we can send an incremental file, so sort the absolute the
+ * block numbers and then transpose absolute block numbers to relative
+ * block numbers.
+ *
+ * NB: If the block reference table was using the bitmap representation
+ * for a given chunk, the block numbers in that chunk will already be
+ * sorted, but when the array-of-offsets representation is used, we can
+ * receive block numbers here out of order.
+ */
+ qsort(absolute_block_numbers, nblocks, sizeof(BlockNumber),
+ compare_block_numbers);
+ for (i = 0; i < nblocks; ++i)
+ relative_block_numbers[i] = absolute_block_numbers[i] - start_blkno;
+ *num_blocks_required = nblocks;
+
+ /*
+ * The truncation block length is the minimum length of the reconstructed
+ * file. Any block numbers below this threshold that are not present in
+ * the backup need to be fetched from the prior backup. At or above this
+ * threshold, blocks should only be included in the result if they are
+ * present in the backup. (This may require inserting zero blocks if the
+ * blocks included in the backup are non-consecutive.)
+ */
+ *truncation_block_length = size / BLCKSZ;
+ if (BlockNumberIsValid(limit_block))
+ {
+ unsigned relative_limit = limit_block - segno * RELSEG_SIZE;
+
+ if (*truncation_block_length < relative_limit)
+ *truncation_block_length = relative_limit;
+ }
+
+ /* Send it incrementally. */
+ return BACK_UP_FILE_INCREMENTALLY;
+}
+
+/*
+ * Compute the size for an incremental file containing a given number of blocks.
+ */
+extern size_t
+GetIncrementalFileSize(unsigned num_blocks_required)
+{
+ size_t result;
+
+ /* Make sure we're not going to overflow. */
+ Assert(num_blocks_required <= RELSEG_SIZE);
+
+ /*
+ * Three four byte quantities (magic number, truncation block length,
+ * block count) followed by block numbers followed by block contents.
+ */
+ result = 3 * sizeof(uint32);
+ result += (BLCKSZ + sizeof(BlockNumber)) * num_blocks_required;
+
+ return result;
+}
+
+/*
+ * Helper function for filemap hash table.
+ */
+static uint32
+hash_string_pointer(const char *s)
+{
+ unsigned char *ss = (unsigned char *) s;
+
+ return hash_bytes(ss, strlen(s));
+}
+
+/*
+ * This callback is invoked for each file mentioned in the backup manifest.
+ *
+ * We store the path to each file and the size of each file for sanity-checking
+ * purposes. For further details, see comments for IncrementalBackupInfo.
+ */
+static void
+manifest_process_file(JsonManifestParseContext *context,
+ char *pathname, size_t size,
+ pg_checksum_type checksum_type,
+ int checksum_length,
+ uint8 *checksum_payload)
+{
+ IncrementalBackupInfo *ib = context->private_data;
+ backup_file_entry *entry;
+ bool found;
+
+ entry = backup_file_insert(ib->manifest_files, pathname, &found);
+ if (!found)
+ {
+ entry->path = MemoryContextStrdup(ib->manifest_files->ctx,
+ pathname);
+ entry->size = size;
+ }
+}
+
+/*
+ * This callback is invoked for each WAL range mentioned in the backup
+ * manifest.
+ *
+ * We're just interested in learning the oldest LSN and the corresponding TLI
+ * that appear in any WAL range.
+ */
+static void
+manifest_process_wal_range(JsonManifestParseContext *context,
+ TimeLineID tli, XLogRecPtr start_lsn,
+ XLogRecPtr end_lsn)
+{
+ IncrementalBackupInfo *ib = context->private_data;
+ backup_wal_range *range = palloc(sizeof(backup_wal_range));
+
+ range->tli = tli;
+ range->start_lsn = start_lsn;
+ range->end_lsn = end_lsn;
+ ib->manifest_wal_ranges = lappend(ib->manifest_wal_ranges, range);
+}
+
+/*
+ * This callback is invoked if an error occurs while parsing the backup
+ * manifest.
+ */
+static void
+manifest_report_error(JsonManifestParseContext *context, const char *fmt,...)
+{
+ StringInfoData errbuf;
+
+ initStringInfo(&errbuf);
+
+ for (;;)
+ {
+ va_list ap;
+ int needed;
+
+ va_start(ap, fmt);
+ needed = appendStringInfoVA(&errbuf, fmt, ap);
+ va_end(ap);
+ if (needed == 0)
+ break;
+ enlargeStringInfo(&errbuf, needed);
+ }
+
+ ereport(ERROR,
+ errmsg_internal("%s", errbuf.data));
+}
+
+/*
+ * Quicksort comparator for block numbers.
+ */
+static int
+compare_block_numbers(const void *a, const void *b)
+{
+ BlockNumber aa = *(BlockNumber *) a;
+ BlockNumber bb = *(BlockNumber *) b;
+
+ if (aa > bb)
+ return 1;
+ else if (aa == bb)
+ return 0;
+ else
+ return -1;
+}
diff --git a/src/backend/backup/meson.build b/src/backend/backup/meson.build
index 0e2de91e9f..19c355ceca 100644
--- a/src/backend/backup/meson.build
+++ b/src/backend/backup/meson.build
@@ -5,6 +5,7 @@ backend_sources += files(
'basebackup.c',
'basebackup_copy.c',
'basebackup_gzip.c',
+ 'basebackup_incremental.c',
'basebackup_lz4.c',
'basebackup_progress.c',
'basebackup_server.c',
diff --git a/src/backend/replication/repl_gram.y b/src/backend/replication/repl_gram.y
index 0c874e33cf..a5d118ed68 100644
--- a/src/backend/replication/repl_gram.y
+++ b/src/backend/replication/repl_gram.y
@@ -76,11 +76,12 @@ Node *replication_parse_result;
%token K_EXPORT_SNAPSHOT
%token K_NOEXPORT_SNAPSHOT
%token K_USE_SNAPSHOT
+%token K_UPLOAD_MANIFEST
%type <node> command
%type <node> base_backup start_replication start_logical_replication
create_replication_slot drop_replication_slot identify_system
- read_replication_slot timeline_history show
+ read_replication_slot timeline_history show upload_manifest
%type <list> generic_option_list
%type <defelt> generic_option
%type <uintval> opt_timeline
@@ -114,6 +115,7 @@ command:
| read_replication_slot
| timeline_history
| show
+ | upload_manifest
;
/*
@@ -307,6 +309,15 @@ timeline_history:
}
;
+/* UPLOAD_MANIFEST doesn't currently accept any arguments */
+upload_manifest:
+ K_UPLOAD_MANIFEST
+ {
+ UploadManifestCmd *cmd = makeNode(UploadManifestCmd);
+
+ $$ = (Node *) cmd;
+ }
+
opt_physical:
K_PHYSICAL
| /* EMPTY */
@@ -411,6 +422,7 @@ ident_or_keyword:
| K_EXPORT_SNAPSHOT { $$ = "export_snapshot"; }
| K_NOEXPORT_SNAPSHOT { $$ = "noexport_snapshot"; }
| K_USE_SNAPSHOT { $$ = "use_snapshot"; }
+ | K_UPLOAD_MANIFEST { $$ = "upload_manifest"; }
;
%%
diff --git a/src/backend/replication/repl_scanner.l b/src/backend/replication/repl_scanner.l
index 1cc7fb858c..4805da08ee 100644
--- a/src/backend/replication/repl_scanner.l
+++ b/src/backend/replication/repl_scanner.l
@@ -136,6 +136,7 @@ EXPORT_SNAPSHOT { return K_EXPORT_SNAPSHOT; }
NOEXPORT_SNAPSHOT { return K_NOEXPORT_SNAPSHOT; }
USE_SNAPSHOT { return K_USE_SNAPSHOT; }
WAIT { return K_WAIT; }
+UPLOAD_MANIFEST { return K_UPLOAD_MANIFEST; }
{space}+ { /* do nothing */ }
@@ -303,6 +304,7 @@ replication_scanner_is_replication_command(void)
case K_DROP_REPLICATION_SLOT:
case K_READ_REPLICATION_SLOT:
case K_TIMELINE_HISTORY:
+ case K_UPLOAD_MANIFEST:
case K_SHOW:
/* Yes; push back the first token so we can parse later. */
repl_pushed_back_token = first_token;
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index e250b0567e..b33b86671b 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -58,6 +58,7 @@
#include "access/xlogrecovery.h"
#include "access/xlogutils.h"
#include "backup/basebackup.h"
+#include "backup/basebackup_incremental.h"
#include "catalog/pg_authid.h"
#include "catalog/pg_type.h"
#include "commands/dbcommands.h"
@@ -137,6 +138,17 @@ bool wake_wal_senders = false;
*/
static XLogReaderState *xlogreader = NULL;
+/*
+ * If the UPLOAD_MANIFEST command is used to provide a backup manifest in
+ * preparation for an incremental backup, uploaded_manifest will be point
+ * to an object containing information about its contexts, and
+ * uploaded_manifest_mcxt will point to the memory context that contains
+ * that object and all of its subordinate data. Otherwise, both values will
+ * be NULL.
+ */
+static IncrementalBackupInfo *uploaded_manifest = NULL;
+static MemoryContext uploaded_manifest_mcxt = NULL;
+
/*
* These variables keep track of the state of the timeline we're currently
* sending. sendTimeLine identifies the timeline. If sendTimeLineIsHistoric,
@@ -233,6 +245,9 @@ static void XLogSendLogical(void);
static void WalSndDone(WalSndSendDataCallback send_data);
static XLogRecPtr GetStandbyFlushRecPtr(TimeLineID *tli);
static void IdentifySystem(void);
+static void UploadManifest(void);
+static bool HandleUploadManifestPacket(StringInfo buf, off_t *offset,
+ IncrementalBackupInfo *ib);
static void ReadReplicationSlot(ReadReplicationSlotCmd *cmd);
static void CreateReplicationSlot(CreateReplicationSlotCmd *cmd);
static void DropReplicationSlot(DropReplicationSlotCmd *cmd);
@@ -660,6 +675,143 @@ SendTimeLineHistory(TimeLineHistoryCmd *cmd)
pq_endmessage(&buf);
}
+/*
+ * Handle UPLOAD_MANIFEST command.
+ */
+static void
+UploadManifest(void)
+{
+ MemoryContext mcxt;
+ IncrementalBackupInfo *ib;
+ off_t offset = 0;
+ StringInfoData buf;
+
+ /*
+ * parsing the manifest will use the cryptohash stuff, which requires a
+ * resource owner
+ */
+ Assert(CurrentResourceOwner == NULL);
+ CurrentResourceOwner = ResourceOwnerCreate(NULL, "base backup");
+
+ /* Prepare to read manifest data into a temporary context. */
+ mcxt = AllocSetContextCreate(CurrentMemoryContext,
+ "incremental backup information",
+ ALLOCSET_DEFAULT_SIZES);
+ ib = CreateIncrementalBackupInfo(mcxt);
+
+ /* Send a CopyInResponse message */
+ pq_beginmessage(&buf, 'G');
+ pq_sendbyte(&buf, 0);
+ pq_sendint16(&buf, 0);
+ pq_endmessage_reuse(&buf);
+ pq_flush();
+
+ /* Recieve packets from client until done. */
+ while (HandleUploadManifestPacket(&buf, &offset, ib))
+ ;
+
+ /* Finish up manifest processing. */
+ FinalizeIncrementalManifest(ib);
+
+ /*
+ * Discard any old manifest information and arrange to preserve the new
+ * information we just got.
+ *
+ * We assume that MemoryContextDelete and MemoryContextSetParent won't
+ * fail, and thus we shouldn't end up bailing out of here in such a way as
+ * to leave dangling pointrs.
+ */
+ if (uploaded_manifest_mcxt != NULL)
+ MemoryContextDelete(uploaded_manifest_mcxt);
+ MemoryContextSetParent(mcxt, CacheMemoryContext);
+ uploaded_manifest = ib;
+ uploaded_manifest_mcxt = mcxt;
+
+ /* clean up the resource owner we created */
+ WalSndResourceCleanup(true);
+}
+
+/*
+ * Process one packet received during the handling of an UPLOAD_MANIFEST
+ * operation.
+ *
+ * 'buf' is scratch space. This function expects it to be initialized, doesn't
+ * care what the current contents are, and may override them with completely
+ * new contents.
+ *
+ * The return value is true if the caller should continue processing
+ * additional packets and false if the UPLOAD_MANIFEST operation is complete.
+ */
+static bool
+HandleUploadManifestPacket(StringInfo buf, off_t *offset,
+ IncrementalBackupInfo *ib)
+{
+ int mtype;
+ int maxmsglen;
+
+ HOLD_CANCEL_INTERRUPTS();
+
+ pq_startmsgread();
+ mtype = pq_getbyte();
+ if (mtype == EOF)
+ ereport(ERROR,
+ (errcode(ERRCODE_CONNECTION_FAILURE),
+ errmsg("unexpected EOF on client connection with an open transaction")));
+
+ switch (mtype)
+ {
+ case 'd': /* CopyData */
+ maxmsglen = PQ_LARGE_MESSAGE_LIMIT;
+ break;
+ case 'c': /* CopyDone */
+ case 'f': /* CopyFail */
+ case 'H': /* Flush */
+ case 'S': /* Sync */
+ maxmsglen = PQ_SMALL_MESSAGE_LIMIT;
+ break;
+ default:
+ ereport(ERROR,
+ (errcode(ERRCODE_PROTOCOL_VIOLATION),
+ errmsg("unexpected message type 0x%02X during COPY from stdin",
+ mtype)));
+ maxmsglen = 0; /* keep compiler quiet */
+ break;
+ }
+
+ /* Now collect the message body */
+ if (pq_getmessage(buf, maxmsglen))
+ ereport(ERROR,
+ (errcode(ERRCODE_CONNECTION_FAILURE),
+ errmsg("unexpected EOF on client connection with an open transaction")));
+ RESUME_CANCEL_INTERRUPTS();
+
+ /* Process the message */
+ switch (mtype)
+ {
+ case 'd': /* CopyData */
+ AppendIncrementalManifestData(ib, buf->data, buf->len);
+ return true;
+
+ case 'c': /* CopyDone */
+ return false;
+
+ case 'H': /* Sync */
+ case 'S': /* Flush */
+ /* Ignore these while in CopyOut mode as we do elsewhere. */
+ return true;
+
+ case 'f':
+ ereport(ERROR,
+ (errcode(ERRCODE_QUERY_CANCELED),
+ errmsg("COPY from stdin failed: %s",
+ pq_getmsgstring(buf))));
+ }
+
+ /* Not reached. */
+ Assert(false);
+ return false;
+}
+
/*
* Handle START_REPLICATION command.
*
@@ -1801,7 +1953,7 @@ exec_replication_command(const char *cmd_string)
cmdtag = "BASE_BACKUP";
set_ps_display(cmdtag);
PreventInTransactionBlock(true, cmdtag);
- SendBaseBackup((BaseBackupCmd *) cmd_node);
+ SendBaseBackup((BaseBackupCmd *) cmd_node, uploaded_manifest);
EndReplicationCommand(cmdtag);
break;
@@ -1863,6 +2015,14 @@ exec_replication_command(const char *cmd_string)
}
break;
+ case T_UploadManifestCmd:
+ cmdtag = "UPLOAD_MANIFEST";
+ set_ps_display(cmdtag);
+ PreventInTransactionBlock(true, cmdtag);
+ UploadManifest();
+ EndReplicationCommand(cmdtag);
+ break;
+
default:
elog(ERROR, "unrecognized replication command node tag: %u",
cmd_node->type);
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index a3d8eacb8d..3a6729003a 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -31,6 +31,7 @@
#include "postmaster/bgworker_internals.h"
#include "postmaster/bgwriter.h"
#include "postmaster/postmaster.h"
+#include "postmaster/walsummarizer.h"
#include "replication/logicallauncher.h"
#include "replication/origin.h"
#include "replication/slot.h"
@@ -136,6 +137,7 @@ CalculateShmemSize(int *num_semaphores)
size = add_size(size, ReplicationOriginShmemSize());
size = add_size(size, WalSndShmemSize());
size = add_size(size, WalRcvShmemSize());
+ size = add_size(size, WalSummarizerShmemSize());
size = add_size(size, PgArchShmemSize());
size = add_size(size, ApplyLauncherShmemSize());
size = add_size(size, BTreeShmemSize());
@@ -291,6 +293,7 @@ CreateSharedMemoryAndSemaphores(void)
ReplicationOriginShmemInit();
WalSndShmemInit();
WalRcvShmemInit();
+ WalSummarizerShmemInit();
PgArchShmemInit();
ApplyLauncherShmemInit();
diff --git a/src/bin/Makefile b/src/bin/Makefile
index 373077bf52..aa2210925e 100644
--- a/src/bin/Makefile
+++ b/src/bin/Makefile
@@ -19,6 +19,7 @@ SUBDIRS = \
pg_archivecleanup \
pg_basebackup \
pg_checksums \
+ pg_combinebackup \
pg_config \
pg_controldata \
pg_ctl \
diff --git a/src/bin/meson.build b/src/bin/meson.build
index 67cb50630c..4cb6fd59bb 100644
--- a/src/bin/meson.build
+++ b/src/bin/meson.build
@@ -5,6 +5,7 @@ subdir('pg_amcheck')
subdir('pg_archivecleanup')
subdir('pg_basebackup')
subdir('pg_checksums')
+subdir('pg_combinebackup')
subdir('pg_config')
subdir('pg_controldata')
subdir('pg_ctl')
diff --git a/src/bin/pg_basebackup/bbstreamer_file.c b/src/bin/pg_basebackup/bbstreamer_file.c
index 45f32974ff..6b78ee283d 100644
--- a/src/bin/pg_basebackup/bbstreamer_file.c
+++ b/src/bin/pg_basebackup/bbstreamer_file.c
@@ -296,6 +296,7 @@ should_allow_existing_directory(const char *pathname)
if (strcmp(filename, "pg_wal") == 0 ||
strcmp(filename, "pg_xlog") == 0 ||
strcmp(filename, "archive_status") == 0 ||
+ strcmp(filename, "summaries") == 0 ||
strcmp(filename, "pg_tblspc") == 0)
return true;
diff --git a/src/bin/pg_basebackup/pg_basebackup.c b/src/bin/pg_basebackup/pg_basebackup.c
index f32684a8f2..26fd9ad0bc 100644
--- a/src/bin/pg_basebackup/pg_basebackup.c
+++ b/src/bin/pg_basebackup/pg_basebackup.c
@@ -101,6 +101,11 @@ typedef void (*WriteDataCallback) (size_t nbytes, char *buf,
*/
#define MINIMUM_VERSION_FOR_TERMINATED_TARFILE 150000
+/*
+ * pg_wal/summaries exists beginning with version 17.
+ */
+#define MINIMUM_VERSION_FOR_WAL_SUMMARIES 170000
+
/*
* Different ways to include WAL
*/
@@ -217,7 +222,8 @@ static void ReceiveBackupManifestInMemoryChunk(size_t r, char *copybuf,
void *callback_data);
static void BaseBackup(char *compression_algorithm, char *compression_detail,
CompressionLocation compressloc,
- pg_compress_specification *client_compress);
+ pg_compress_specification *client_compress,
+ char *incremental_manifest);
static bool reached_end_position(XLogRecPtr segendpos, uint32 timeline,
bool segment_finished);
@@ -390,6 +396,8 @@ usage(void)
printf(_("\nOptions controlling the output:\n"));
printf(_(" -D, --pgdata=DIRECTORY receive base backup into directory\n"));
printf(_(" -F, --format=p|t output format (plain (default), tar)\n"));
+ printf(_(" -i, --incremental=OLDMANIFEST\n"));
+ printf(_(" take incremental or differential backup\n"));
printf(_(" -r, --max-rate=RATE maximum transfer rate to transfer data directory\n"
" (in kB/s, or use suffix \"k\" or \"M\")\n"));
printf(_(" -R, --write-recovery-conf\n"
@@ -688,6 +696,23 @@ StartLogStreamer(char *startpos, uint32 timeline, char *sysidentifier,
if (pg_mkdir_p(statusdir, pg_dir_create_mode) != 0 && errno != EEXIST)
pg_fatal("could not create directory \"%s\": %m", statusdir);
+
+ /*
+ * For newer server versions, likewise create pg_wal/summaries
+ */
+ if (PQserverVersion(conn) < MINIMUM_VERSION_FOR_WAL_SUMMARIES)
+ {
+ char summarydir[MAXPGPATH];
+
+ snprintf(summarydir, sizeof(summarydir), "%s/%s/summaries",
+ basedir,
+ PQserverVersion(conn) < MINIMUM_VERSION_FOR_PG_WAL ?
+ "pg_xlog" : "pg_wal");
+
+ if (pg_mkdir_p(statusdir, pg_dir_create_mode) != 0 &&
+ errno != EEXIST)
+ pg_fatal("could not create directory \"%s\": %m", summarydir);
+ }
}
/*
@@ -1728,7 +1753,9 @@ ReceiveBackupManifestInMemoryChunk(size_t r, char *copybuf,
static void
BaseBackup(char *compression_algorithm, char *compression_detail,
- CompressionLocation compressloc, pg_compress_specification *client_compress)
+ CompressionLocation compressloc,
+ pg_compress_specification *client_compress,
+ char *incremental_manifest)
{
PGresult *res;
char *sysidentifier;
@@ -1794,7 +1821,74 @@ BaseBackup(char *compression_algorithm, char *compression_detail,
exit(1);
/*
- * Start the actual backup
+ * If the user wants an incremental backup, we must upload the manifest
+ * for the previous backup upon which it is to be based.
+ */
+ if (incremental_manifest != NULL)
+ {
+ int fd;
+ char mbuf[65536];
+ int nbytes;
+
+ /* XXX add a server version check here */
+
+ /* Open the file. */
+ fd = open(incremental_manifest, O_RDONLY | PG_BINARY, 0);
+ if (fd < 0)
+ pg_fatal("could not open file \"%s\": %m", incremental_manifest);
+
+ /* Tell the server what we want to do. */
+ if (PQsendQuery(conn, "UPLOAD_MANIFEST") == 0)
+ pg_fatal("could not send replication command \"%s\": %s",
+ "UPLOAD_MANIFEST", PQerrorMessage(conn));
+ res = PQgetResult(conn);
+ if (PQresultStatus(res) != PGRES_COPY_IN)
+ {
+ if (PQresultStatus(res) == PGRES_FATAL_ERROR)
+ pg_fatal("could not upload manifest: %s",
+ PQerrorMessage(conn));
+ else
+ pg_fatal("could not upload manifest: unexpected status %s",
+ PQresStatus(PQresultStatus(res)));
+ }
+
+ /* Loop, reading from the file and sending the data to the server. */
+ while ((nbytes = read(fd, mbuf, sizeof mbuf)) > 0)
+ {
+ if (PQputCopyData(conn, mbuf, nbytes) < 0)
+ pg_fatal("could not send COPY data: %s",
+ PQerrorMessage(conn));
+ }
+
+ /* Bail out if we exited the loop due to an error. */
+ if (nbytes < 0)
+ pg_fatal("could not read file \"%s\": %m", incremental_manifest);
+
+ /* End the COPY operation. */
+ if (PQputCopyEnd(conn, NULL) < 0)
+ pg_fatal("could not send end-of-COPY: %s",
+ PQerrorMessage(conn));
+
+ /* See whether the server is happy with what we sent. */
+ res = PQgetResult(conn);
+ if (PQresultStatus(res) == PGRES_FATAL_ERROR)
+ pg_fatal("could not upload manifest: %s",
+ PQerrorMessage(conn));
+ else if (PQresultStatus(res) != PGRES_COMMAND_OK)
+ pg_fatal("could not upload manifest: unexpected status %s",
+ PQresStatus(PQresultStatus(res)));
+
+ /* Consume ReadyForQuery message from server. */
+ res = PQgetResult(conn);
+ if (res != NULL)
+ pg_fatal("unexpected extra result while sending manifest");
+
+ /* Add INCREMENTAL option to BASE_BACKUP command. */
+ AppendPlainCommandOption(&buf, use_new_option_syntax, "INCREMENTAL");
+ }
+
+ /*
+ * Continue building up the options list for the BASE_BACKUP command.
*/
AppendStringCommandOption(&buf, use_new_option_syntax, "LABEL", label);
if (estimatesize)
@@ -1901,6 +1995,7 @@ BaseBackup(char *compression_algorithm, char *compression_detail,
else
basebkp = psprintf("BASE_BACKUP %s", buf.data);
+ /* OK, try to start the backup. */
if (PQsendQuery(conn, basebkp) == 0)
pg_fatal("could not send replication command \"%s\": %s",
"BASE_BACKUP", PQerrorMessage(conn));
@@ -2256,6 +2351,7 @@ main(int argc, char **argv)
{"version", no_argument, NULL, 'V'},
{"pgdata", required_argument, NULL, 'D'},
{"format", required_argument, NULL, 'F'},
+ {"incremental", required_argument, NULL, 'i'},
{"checkpoint", required_argument, NULL, 'c'},
{"create-slot", no_argument, NULL, 'C'},
{"max-rate", required_argument, NULL, 'r'},
@@ -2293,6 +2389,7 @@ main(int argc, char **argv)
int option_index;
char *compression_algorithm = "none";
char *compression_detail = NULL;
+ char *incremental_manifest = NULL;
CompressionLocation compressloc = COMPRESS_LOCATION_UNSPECIFIED;
pg_compress_specification client_compress;
@@ -2317,7 +2414,7 @@ main(int argc, char **argv)
atexit(cleanup_directories_atexit);
- while ((c = getopt_long(argc, argv, "c:Cd:D:F:h:l:nNp:Pr:Rs:S:t:T:U:vwWX:zZ:",
+ while ((c = getopt_long(argc, argv, "c:Cd:D:F:h:i:l:nNp:Pr:Rs:S:t:T:U:vwWX:zZ:",
long_options, &option_index)) != -1)
{
switch (c)
@@ -2352,6 +2449,9 @@ main(int argc, char **argv)
case 'h':
dbhost = pg_strdup(optarg);
break;
+ case 'i':
+ incremental_manifest = pg_strdup(optarg);
+ break;
case 'l':
label = pg_strdup(optarg);
break;
@@ -2765,7 +2865,7 @@ main(int argc, char **argv)
}
BaseBackup(compression_algorithm, compression_detail, compressloc,
- &client_compress);
+ &client_compress, incremental_manifest);
success = true;
return 0;
diff --git a/src/bin/pg_basebackup/t/010_pg_basebackup.pl b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
index b9f5e1266b..bf765291e7 100644
--- a/src/bin/pg_basebackup/t/010_pg_basebackup.pl
+++ b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
@@ -223,10 +223,10 @@ SKIP:
"check backup dir permissions");
}
-# Only archive_status directory should be copied in pg_wal/.
+# Only archive_status and summaries directories should be copied in pg_wal/.
is_deeply(
[ sort(slurp_dir("$tempdir/backup/pg_wal/")) ],
- [ sort qw(. .. archive_status) ],
+ [ sort qw(. .. archive_status summaries) ],
'no WAL files copied');
# Contents of these directories should not be copied.
diff --git a/src/bin/pg_combinebackup/.gitignore b/src/bin/pg_combinebackup/.gitignore
new file mode 100644
index 0000000000..d7e617438c
--- /dev/null
+++ b/src/bin/pg_combinebackup/.gitignore
@@ -0,0 +1 @@
+pg_combinebackup
diff --git a/src/bin/pg_combinebackup/Makefile b/src/bin/pg_combinebackup/Makefile
new file mode 100644
index 0000000000..78ba05e624
--- /dev/null
+++ b/src/bin/pg_combinebackup/Makefile
@@ -0,0 +1,52 @@
+#-------------------------------------------------------------------------
+#
+# Makefile for src/bin/pg_combinebackup
+#
+# Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+# Portions Copyright (c) 1994, Regents of the University of California
+#
+# src/bin/pg_combinebackup/Makefile
+#
+#-------------------------------------------------------------------------
+
+PGFILEDESC = "pg_combinebackup - combine incremental backups"
+PGAPPICON=win32
+
+subdir = src/bin/pg_combinebackup
+top_builddir = ../../..
+include $(top_builddir)/src/Makefile.global
+
+override CPPFLAGS := -I$(libpq_srcdir) $(CPPFLAGS)
+LDFLAGS_INTERNAL += -L$(top_builddir)/src/fe_utils -lpgfeutils
+
+OBJS = \
+ $(WIN32RES) \
+ pg_combinebackup.o \
+ backup_label.o \
+ copy_file.o \
+ load_manifest.o \
+ reconstruct.o \
+ write_manifest.o
+
+all: pg_combinebackup
+
+pg_combinebackup: $(OBJS) | submake-libpgport submake-libpgfeutils
+ $(CC) $(CFLAGS) $^ $(LDFLAGS) $(LDFLAGS_EX) $(LIBS) -o $@$(X)
+
+install: all installdirs
+ $(INSTALL_PROGRAM) pg_combinebackup$(X) '$(DESTDIR)$(bindir)/pg_combinebackup$(X)'
+
+installdirs:
+ $(MKDIR_P) '$(DESTDIR)$(bindir)'
+
+uninstall:
+ rm -f '$(DESTDIR)$(bindir)/pg_combinebackup$(X)'
+
+clean distclean maintainer-clean:
+ rm -f pg_combinebackup$(X) $(OBJS)
+
+check:
+ $(prove_check)
+
+installcheck:
+ $(prove_installcheck)
diff --git a/src/bin/pg_combinebackup/backup_label.c b/src/bin/pg_combinebackup/backup_label.c
new file mode 100644
index 0000000000..2a62aa6fad
--- /dev/null
+++ b/src/bin/pg_combinebackup/backup_label.c
@@ -0,0 +1,281 @@
+/*-------------------------------------------------------------------------
+ *
+ * Read and manipulate backup label files
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/bin/pg_combinebackup/backup_label.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres_fe.h"
+
+#include <unistd.h>
+
+#include "access/xlogdefs.h"
+#include "backup_label.h"
+#include "common/logging.h"
+#include "common/file_perm.h"
+#include "write_manifest.h"
+
+static int get_eol_offset(StringInfo buf);
+static bool line_starts_with(char *s, char *e, char *match, char **sout);
+static bool parse_lsn(char *s, char *e, XLogRecPtr *lsn, char **c);
+static bool parse_tli(char *s, char *e, TimeLineID *tli);
+
+/*
+ * Parse a backup label file, starting at buf->cursor.
+ *
+ * We expect to find a START WAL LOCATION line, followed by a LSN, followed
+ * by a space; the resulting LSN is stored into *start_lsn.
+ *
+ * We expect to find a START TIMELINE line, followed by a TLI, followed by
+ * a newline; the resulting TLI is stored into *start_tli.
+ *
+ * We expect to find either both INCREMENTAL FROM LSN and INCREMENTAL FROM TLI
+ * or neither. If these are found, they should be followed by an LSN or TLI
+ * respectively and then by a newline, and the values will be stored into
+ * *previous_lsn and *previous_tli, respectively.
+ *
+ * Other lines in the provided backup_label data are ignored. filename is used
+ * for error reporting; errors are fatal.
+ */
+void
+parse_backup_label(char *filename, StringInfo buf,
+ TimeLineID *start_tli, XLogRecPtr *start_lsn,
+ TimeLineID *previous_tli, XLogRecPtr *previous_lsn)
+{
+ int found = 0;
+
+ *start_tli = 0;
+ *start_lsn = InvalidXLogRecPtr;
+ *previous_tli = 0;
+ *previous_lsn = InvalidXLogRecPtr;
+
+ while (buf->cursor < buf->len)
+ {
+ char *s = &buf->data[buf->cursor];
+ int eo = get_eol_offset(buf);
+ char *e = &buf->data[eo];
+ char *c;
+
+ if (line_starts_with(s, e, "START WAL LOCATION: ", &s))
+ {
+ if (!parse_lsn(s, e, start_lsn, &c))
+ pg_fatal("%s: could not parse START WAL LOCATION",
+ filename);
+ if (c >= e || *c != ' ')
+ pg_fatal("%s: improper terminator for START WAL LOCATION",
+ filename);
+ found |= 1;
+ }
+ else if (line_starts_with(s, e, "START TIMELINE: ", &s))
+ {
+ if (!parse_tli(s, e, start_tli))
+ pg_fatal("%s: could not parse TLI for START TIMELINE",
+ filename);
+ if (*start_tli == 0)
+ pg_fatal("%s: invalid TLI", filename);
+ found |= 2;
+ }
+ else if (line_starts_with(s, e, "INCREMENTAL FROM LSN: ", &s))
+ {
+ if (!parse_lsn(s, e, previous_lsn, &c))
+ pg_fatal("%s: could not parse INCREMENTAL FROM LSN",
+ filename);
+ if (c >= e || *c != '\n')
+ pg_fatal("%s: improper terminator for INCREMENTAL FROM LSN",
+ filename);
+ found |= 4;
+ }
+ else if (line_starts_with(s, e, "INCREMENTAL FROM TLI: ", &s))
+ {
+ if (!parse_tli(s, e, previous_tli))
+ pg_fatal("%s: could not parse INCREMENTAL FROM TLI",
+ filename);
+ if (*previous_tli == 0)
+ pg_fatal("%s: invalid TLI", filename);
+ found |= 8;
+ }
+
+ buf->cursor = eo;
+ }
+
+ if ((found & 1) == 0)
+ pg_fatal("%s: could not find START WAL LOCATION", filename);
+ if ((found & 2) == 0)
+ pg_fatal("%s: could not find START TIMELINE", filename);
+ if ((found & 4) != 0 && (found & 8) == 0)
+ pg_fatal("%s: INCREMENTAL FROM LSN requires INCREMENTAL FROM TLI", filename);
+ if ((found & 8) != 0 && (found & 4) == 0)
+ pg_fatal("%s: INCREMENTAL FROM TLI requires INCREMENTAL FROM LSN", filename);
+}
+
+/*
+ * Write a backup label file to the output directory.
+ *
+ * This will be identical to the provided backup_label file, except that the
+ * INCREMENTAL FROM LSN and INCREMENTAL FROM TLI lines will be omitted.
+ *
+ * The new file will be checksummed using the specified algorithm. If
+ * mwriter != NULL, it will be added to the manifest.
+ */
+void
+write_backup_label(char *output_directory, StringInfo buf,
+ pg_checksum_type checksum_type, manifest_writer *mwriter)
+{
+ char output_filename[MAXPGPATH];
+ int output_fd;
+ pg_checksum_context checksum_ctx;
+ uint8 checksum_payload[PG_CHECKSUM_MAX_LENGTH];
+ int checksum_length;
+
+ pg_checksum_init(&checksum_ctx, checksum_type);
+
+ snprintf(output_filename, MAXPGPATH, "%s/backup_label", output_directory);
+
+ if ((output_fd = open(output_filename,
+ O_WRONLY | O_CREAT | O_EXCL | PG_BINARY,
+ pg_file_create_mode)) < 0)
+ pg_fatal("could not open file \"%s\": %m", output_filename);
+
+ while (buf->cursor < buf->len)
+ {
+ char *s = &buf->data[buf->cursor];
+ int eo = get_eol_offset(buf);
+ char *e = &buf->data[eo];
+
+ if (!line_starts_with(s, e, "INCREMENTAL FROM LSN: ", NULL) &&
+ !line_starts_with(s, e, "INCREMENTAL FROM TLI: ", NULL))
+ {
+ ssize_t wb;
+
+ wb = write(output_fd, s, e - s);
+ if (wb != e - s)
+ {
+ if (wb < 0)
+ pg_fatal("could not write file \"%s\": %m", output_filename);
+ else
+ pg_fatal("could not write file \"%s\": wrote only %d of %d bytes",
+ output_filename, (int) wb, (int) (e - s));
+ }
+ if (pg_checksum_update(&checksum_ctx, (uint8 *) s, e - s) < 0)
+ pg_fatal("could not update checksum of file \"%s\"",
+ output_filename);
+ }
+
+ buf->cursor = eo;
+ }
+
+ if (close(output_fd) != 0)
+ pg_fatal("could not close \"%s\": %m", output_filename);
+
+ checksum_length = pg_checksum_final(&checksum_ctx, checksum_payload);
+
+ if (mwriter != NULL)
+ {
+ struct stat sb;
+
+ /*
+ * We could track the length ourselves, but must stat() to get the
+ * mtime.
+ */
+ if (stat(output_filename, &sb) < 0)
+ pg_fatal("could not stat file \"%s\": %m", output_filename);
+ add_file_to_manifest(mwriter, "backup_label", sb.st_size,
+ sb.st_mtime, checksum_type,
+ checksum_length, checksum_payload);
+ }
+}
+
+/*
+ * Return the offset at which the next line in the buffer starts, or there
+ * is none, the offset at which the buffer ends.
+ *
+ * The search begins at buf->cursor.
+ */
+static int
+get_eol_offset(StringInfo buf)
+{
+ int eo = buf->cursor;
+
+ while (eo < buf->len)
+ {
+ if (buf->data[eo] == '\n')
+ return eo + 1;
+ ++eo;
+ }
+
+ return eo;
+}
+
+/*
+ * Test whether the line that runs from s to e (inclusive of *s, but not
+ * inclusive of *e) starts with the match string provided, and return true
+ * or false according to whether or not this is the case.
+ *
+ * If the function returns true and if *sout != NULL, stores a pointer to the
+ * byte following the match into *sout.
+ */
+static bool
+line_starts_with(char *s, char *e, char *match, char **sout)
+{
+ while (s < e && *match != '\0' && *s == *match)
+ ++s, ++match;
+
+ if (*match == '\0' && sout != NULL)
+ *sout = s;
+
+ return (*match == '\0');
+}
+
+/*
+ * Parse an LSN starting at s and not stopping at or before e. The return value
+ * is true on success and otherwise false. On success, stores the result into
+ * *lsn and sets *c to the first character that is not part of the LSN.
+ */
+static bool
+parse_lsn(char *s, char *e, XLogRecPtr *lsn, char **c)
+{
+ char save = *e;
+ int nchars;
+ bool success;
+ unsigned hi;
+ unsigned lo;
+
+ *e = '\0';
+ success = (sscanf(s, "%X/%X%n", &hi, &lo, &nchars) == 2);
+ *e = save;
+
+ if (success)
+ {
+ *lsn = ((XLogRecPtr) hi) << 32 | (XLogRecPtr) lo;
+ *c = s + nchars;
+ }
+
+ return success;
+}
+
+/*
+ * Parse a TLI starting at s and stopping at or before e. The return value is
+ * true on success and otherwise false. On success, stores the result into
+ * *tli. If the first character that is not part of the TLI is anything other
+ * than a newline, that is deemed a failure.
+ */
+static bool
+parse_tli(char *s, char *e, TimeLineID *tli)
+{
+ char save = *e;
+ int nchars;
+ bool success;
+
+ *e = '\0';
+ success = (sscanf(s, "%u%n", tli, &nchars) == 1);
+ *e = save;
+
+ if (success && s[nchars] != '\n')
+ success = false;
+
+ return success;
+}
diff --git a/src/bin/pg_combinebackup/backup_label.h b/src/bin/pg_combinebackup/backup_label.h
new file mode 100644
index 0000000000..3af7ea274c
--- /dev/null
+++ b/src/bin/pg_combinebackup/backup_label.h
@@ -0,0 +1,30 @@
+/*-------------------------------------------------------------------------
+ *
+ * Read and manipulate backup label files
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/bin/pg_combinebackup/backup_label.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef BACKUP_LABEL_H
+#define BACKUP_LABEL_H
+
+#include "access/xlogdefs.h"
+#include "common/checksum_helper.h"
+#include "lib/stringinfo.h"
+
+struct manifest_writer;
+
+extern void parse_backup_label(char *filename, StringInfo buf,
+ TimeLineID *start_tli,
+ XLogRecPtr *start_lsn,
+ TimeLineID *previous_tli,
+ XLogRecPtr *previous_lsn);
+extern void write_backup_label(char *output_directory, StringInfo buf,
+ pg_checksum_type checksum_type,
+ struct manifest_writer *mwriter);
+
+#endif /* BACKUP_LABEL_H */
diff --git a/src/bin/pg_combinebackup/copy_file.c b/src/bin/pg_combinebackup/copy_file.c
new file mode 100644
index 0000000000..f2b45787e9
--- /dev/null
+++ b/src/bin/pg_combinebackup/copy_file.c
@@ -0,0 +1,169 @@
+/*
+ * Copy entire files.
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/bin/pg_combinebackup/copy_file.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres_fe.h"
+
+#ifdef HAVE_COPYFILE_H
+#include <copyfile.h>
+#endif
+#include <fcntl.h>
+#include <sys/stat.h>
+#include <unistd.h>
+
+#include "common/file_perm.h"
+#include "common/logging.h"
+#include "copy_file.h"
+
+static void copy_file_blocks(const char *src, const char *dst,
+ pg_checksum_context *checksum_ctx);
+
+#ifdef WIN32
+static void copy_file_copyfile(const char *src, const char *dst);
+#endif
+
+/*
+ * Copy a regular file, optionally computing a checksum, and emitting
+ * appropriate debug messages. But if we're in dry-run mode, then just emit
+ * the messages and don't copy anything.
+ */
+void
+copy_file(const char *src, const char *dst,
+ pg_checksum_context *checksum_ctx, bool dry_run)
+{
+ /*
+ * In dry-run mode, we don't actually copy anything, nor do we read any
+ * data from the source file, but we do verify that we can open it.
+ */
+ if (dry_run)
+ {
+ int fd;
+
+ if ((fd = open(src, O_RDONLY | PG_BINARY, 0)) < 0)
+ pg_fatal("could not open \"%s\": %m", src);
+ if (close(fd) < 0)
+ pg_fatal("could not close \"%s\": %m", src);
+ }
+
+ /*
+ * If we don't need to compute a checksum, then we can use any special
+ * operating system primitives that we know about to copy the file; this
+ * may be quicker than a naive block copy.
+ */
+ if (checksum_ctx->type != CHECKSUM_TYPE_NONE)
+ {
+ char *strategy_name = NULL;
+ void (*strategy_implementation) (const char *, const char *) = NULL;
+
+#ifdef WIN32
+ strategy_name = "CopyFile";
+ strategy_implementation = copy_file_copyfile;
+#endif
+
+ if (strategy_name != NULL)
+ {
+ if (dry_run)
+ pg_log_debug("would copy \"%s\" to \"%s\" using strategy %s",
+ src, dst, strategy_name);
+ else
+ {
+ pg_log_debug("copying \"%s\" to \"%s\" using strategy %s",
+ src, dst, strategy_name);
+ (*strategy_implementation) (src, dst);
+ }
+ return;
+ }
+ }
+
+ /*
+ * Fall back to the simple approach of reading and writing all the blocks,
+ * feeding them into the checksum context as we go.
+ */
+ if (dry_run)
+ {
+ if (checksum_ctx->type == CHECKSUM_TYPE_NONE)
+ pg_log_debug("would copy \"%s\" to \"%s\"",
+ src, dst);
+ else
+ pg_log_debug("would copy \"%s\" to \"%s\" and checksum with %s",
+ src, dst, pg_checksum_type_name(checksum_ctx->type));
+ }
+ else
+ {
+ if (checksum_ctx->type == CHECKSUM_TYPE_NONE)
+ pg_log_debug("copying \"%s\" to \"%s\"",
+ src, dst);
+ else
+ pg_log_debug("copying \"%s\" to \"%s\" and checksumming with %s",
+ src, dst, pg_checksum_type_name(checksum_ctx->type));
+ copy_file_blocks(src, dst, checksum_ctx);
+ }
+}
+
+/*
+ * Copy a file block by block, and optionally compute a checksum as we go.
+ */
+static void
+copy_file_blocks(const char *src, const char *dst,
+ pg_checksum_context *checksum_ctx)
+{
+ int src_fd;
+ int dest_fd;
+ uint8 *buffer;
+ const int buffer_size = 50 * BLCKSZ;
+ ssize_t rb;
+ unsigned offset = 0;
+
+ if ((src_fd = open(src, O_RDONLY | PG_BINARY, 0)) < 0)
+ pg_fatal("could not open file \"%s\": %m", src);
+
+ if ((dest_fd = open(dst, O_WRONLY | O_CREAT | O_EXCL | PG_BINARY,
+ pg_file_create_mode)) < 0)
+ pg_fatal("could not open file \"%s\": %m", dst);
+
+ buffer = pg_malloc(buffer_size);
+
+ while ((rb = read(src_fd, buffer, buffer_size)) > 0)
+ {
+ ssize_t wb;
+
+ if ((wb = write(dest_fd, buffer, rb)) != rb)
+ {
+ if (wb < 0)
+ pg_fatal("could not write file \"%s\": %m", dst);
+ else
+ pg_fatal("could not write file \"%s\": wrote only %d of %d bytes at offset %u",
+ dst, (int) wb, (int) rb, offset);
+ }
+
+ if (pg_checksum_update(checksum_ctx, buffer, rb) < 0)
+ pg_fatal("could not update checksum of file \"%s\"", dst);
+
+ offset += rb;
+ }
+
+ if (rb < 0)
+ pg_fatal("could not read file \"%s\": %m", dst);
+
+ pg_free(buffer);
+ close(src_fd);
+ close(dest_fd);
+}
+
+#ifdef WIN32
+static void
+copy_file_copyfile(const char *src, const char *dst)
+{
+ if (CopyFile(src, dst, true) == 0)
+ {
+ _dosmaperr(GetLastError());
+ pg_fatal("could not copy \"%s\" to \"%s\": %m", src, dst);
+ }
+}
+#endif /* WIN32 */
diff --git a/src/bin/pg_combinebackup/copy_file.h b/src/bin/pg_combinebackup/copy_file.h
new file mode 100644
index 0000000000..031030bacb
--- /dev/null
+++ b/src/bin/pg_combinebackup/copy_file.h
@@ -0,0 +1,19 @@
+/*
+ * Copy entire files.
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/bin/pg_combinebackup/copy_file.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef COPY_FILE_H
+#define COPY_FILE_H
+
+#include "common/checksum_helper.h"
+
+extern void copy_file(const char *src, const char *dst,
+ pg_checksum_context *checksum_ctx, bool dry_run);
+
+#endif /* COPY_FILE_H */
diff --git a/src/bin/pg_combinebackup/load_manifest.c b/src/bin/pg_combinebackup/load_manifest.c
new file mode 100644
index 0000000000..2b4e2dadff
--- /dev/null
+++ b/src/bin/pg_combinebackup/load_manifest.c
@@ -0,0 +1,245 @@
+/*-------------------------------------------------------------------------
+ *
+ * Load data from a backup manifest into memory.
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/bin/pg_combinebackup/load_manifest.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres_fe.h"
+
+#include <sys/stat.h>
+#include <unistd.h>
+
+#include "common/hashfn.h"
+#include "common/logging.h"
+#include "common/parse_manifest.h"
+#include "load_manifest.h"
+
+/*
+ * For efficiency, we'd like our hash table containing information about the
+ * manifest to start out with approximately the correct number of entries.
+ * There's no way to know the exact number of entries without reading the whole
+ * file, but we can get an estimate by dividing the file size by the estimated
+ * number of bytes per line.
+ *
+ * This could be off by about a factor of two in either direction, because the
+ * checksum algorithm has a big impact on the line lengths; e.g. a SHA512
+ * checksum is 128 hex bytes, whereas a CRC-32C value is only 8, and there
+ * might be no checksum at all.
+ */
+#define ESTIMATED_BYTES_PER_MANIFEST_LINE 100
+
+/*
+ * Define a hash table which we can use to store information about the files
+ * mentioned in the backup manifest.
+ */
+static uint32 hash_string_pointer(char *s);
+#define SH_PREFIX manifest_files
+#define SH_ELEMENT_TYPE manifest_file
+#define SH_KEY_TYPE char *
+#define SH_KEY pathname
+#define SH_HASH_KEY(tb, key) hash_string_pointer(key)
+#define SH_EQUAL(tb, a, b) (strcmp(a, b) == 0)
+#define SH_SCOPE extern
+#define SH_RAW_ALLOCATOR pg_malloc0
+#define SH_DEFINE
+#include "lib/simplehash.h"
+
+static void combinebackup_per_file_cb(JsonManifestParseContext *context,
+ char *pathname, size_t size,
+ pg_checksum_type checksum_type,
+ int checksum_length,
+ uint8 *checksum_payload);
+static void combinebackup_per_wal_range_cb(JsonManifestParseContext *context,
+ TimeLineID tli,
+ XLogRecPtr start_lsn,
+ XLogRecPtr end_lsn);
+static void report_manifest_error(JsonManifestParseContext *context,
+ const char *fmt,...)
+ pg_attribute_printf(2, 3) pg_attribute_noreturn();
+
+/*
+ * Load backup_manifest files from an array of backups and produces an array
+ * of manifest_data objects.
+ *
+ * NB: Since load_backup_manifest() can return NULL, the resulting array could
+ * contain NULL entries.
+ */
+manifest_data **
+load_backup_manifests(int n_backups, char **backup_directories)
+{
+ manifest_data **result;
+ int i;
+
+ result = pg_malloc(sizeof(manifest_data *) * n_backups);
+ for (i = 0; i < n_backups; ++i)
+ result[i] = load_backup_manifest(backup_directories[i]);
+
+ return result;
+}
+
+/*
+ * Parse the backup_manifest file in the named backup directory. Construct a
+ * hash table with information about all the files it mentions, and a linked
+ * list of all the WAL ranges it mentions.
+ *
+ * If the backup_manifest file simply doesn't exist, logs a warning and returns
+ * NULL. Any other error, or any error parsing the contents of the file, is
+ * fatal.
+ */
+manifest_data *
+load_backup_manifest(char *backup_directory)
+{
+ char pathname[MAXPGPATH];
+ int fd;
+ struct stat statbuf;
+ off_t estimate;
+ uint32 initial_size;
+ manifest_files_hash *ht;
+ char *buffer;
+ int rc;
+ JsonManifestParseContext context;
+ manifest_data *result;
+
+ /* Open the manifest file. */
+ snprintf(pathname, MAXPGPATH, "%s/backup_manifest", backup_directory);
+ if ((fd = open(pathname, O_RDONLY | PG_BINARY, 0)) < 0)
+ {
+ if (errno == EEXIST)
+ {
+ pg_log_warning("\"%s\" does not exist", pathname);
+ return NULL;
+ }
+ pg_fatal("could not open file \"%s\": %m", pathname);
+ }
+
+ /* Figure out how big the manifest is. */
+ if (fstat(fd, &statbuf) != 0)
+ pg_fatal("could not stat file \"%s\": %m", pathname);
+
+ /* Guess how large to make the hash table based on the manifest size. */
+ estimate = statbuf.st_size / ESTIMATED_BYTES_PER_MANIFEST_LINE;
+ initial_size = Min(PG_UINT32_MAX, Max(estimate, 256));
+
+ /* Create the hash table. */
+ ht = manifest_files_create(initial_size, NULL);
+
+ /*
+ * Slurp in the whole file.
+ *
+ * This is not ideal, but there's currently no way to get pg_parse_json()
+ * to perform incremental parsing.
+ */
+ buffer = pg_malloc(statbuf.st_size);
+ rc = read(fd, buffer, statbuf.st_size);
+ if (rc != statbuf.st_size)
+ {
+ if (rc < 0)
+ pg_fatal("could not read file \"%s\": %m", pathname);
+ else
+ pg_fatal("could not read file \"%s\": read %d of %lld",
+ pathname, rc, (long long int) statbuf.st_size);
+ }
+
+ /* Close the manifest file. */
+ close(fd);
+
+ /* Parse the manifest. */
+ result = pg_malloc0(sizeof(manifest_data));
+ result->files = ht;
+ context.private_data = result;
+ context.per_file_cb = combinebackup_per_file_cb;
+ context.per_wal_range_cb = combinebackup_per_wal_range_cb;
+ context.error_cb = report_manifest_error;
+ json_parse_manifest(&context, buffer, statbuf.st_size);
+
+ /* All done. */
+ pfree(buffer);
+ return result;
+}
+
+/*
+ * Report an error while parsing the manifest.
+ *
+ * We consider all such errors to be fatal errors. The manifest parser
+ * expects this function not to return.
+ */
+static void
+report_manifest_error(JsonManifestParseContext *context, const char *fmt,...)
+{
+ va_list ap;
+
+ va_start(ap, fmt);
+ pg_log_generic_v(PG_LOG_ERROR, PG_LOG_PRIMARY, gettext(fmt), ap);
+ va_end(ap);
+
+ exit(1);
+}
+
+/*
+ * Record details extracted from the backup manifest for one file.
+ */
+static void
+combinebackup_per_file_cb(JsonManifestParseContext *context,
+ char *pathname, size_t size,
+ pg_checksum_type checksum_type,
+ int checksum_length, uint8 *checksum_payload)
+{
+ manifest_data *manifest = context->private_data;
+ manifest_file *m;
+ bool found;
+
+ /* Make a new entry in the hash table for this file. */
+ m = manifest_files_insert(manifest->files, pathname, &found);
+ if (found)
+ pg_fatal("duplicate path name in backup manifest: \"%s\"", pathname);
+
+ /* Initialize the entry. */
+ m->size = size;
+ m->checksum_type = checksum_type;
+ m->checksum_length = checksum_length;
+ m->checksum_payload = checksum_payload;
+}
+
+/*
+ * Record details extracted from the backup manifest for one WAL range.
+ */
+static void
+combinebackup_per_wal_range_cb(JsonManifestParseContext *context,
+ TimeLineID tli,
+ XLogRecPtr start_lsn, XLogRecPtr end_lsn)
+{
+ manifest_data *manifest = context->private_data;
+ manifest_wal_range *range;
+
+ /* Allocate and initialize a struct describing this WAL range. */
+ range = palloc(sizeof(manifest_wal_range));
+ range->tli = tli;
+ range->start_lsn = start_lsn;
+ range->end_lsn = end_lsn;
+ range->prev = manifest->last_wal_range;
+ range->next = NULL;
+
+ /* Add it to the end of the list. */
+ if (manifest->first_wal_range == NULL)
+ manifest->first_wal_range = range;
+ else
+ manifest->last_wal_range->next = range;
+ manifest->last_wal_range = range;
+}
+
+/*
+ * Helper function for manifest_files hash table.
+ */
+static uint32
+hash_string_pointer(char *s)
+{
+ unsigned char *ss = (unsigned char *) s;
+
+ return hash_bytes(ss, strlen(s));
+}
diff --git a/src/bin/pg_combinebackup/load_manifest.h b/src/bin/pg_combinebackup/load_manifest.h
new file mode 100644
index 0000000000..2bfeeff156
--- /dev/null
+++ b/src/bin/pg_combinebackup/load_manifest.h
@@ -0,0 +1,67 @@
+/*-------------------------------------------------------------------------
+ *
+ * Load data from a backup manifest into memory.
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/bin/pg_combinebackup/load_manifest.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef LOAD_MANIFEST_H
+#define LOAD_MANIFEST_H
+
+#include "access/xlogdefs.h"
+#include "common/checksum_helper.h"
+
+/*
+ * Each file described by the manifest file is parsed to produce an object
+ * like this.
+ */
+typedef struct manifest_file
+{
+ uint32 status; /* hash status */
+ char *pathname;
+ size_t size;
+ pg_checksum_type checksum_type;
+ int checksum_length;
+ uint8 *checksum_payload;
+} manifest_file;
+
+#define SH_PREFIX manifest_files
+#define SH_ELEMENT_TYPE manifest_file
+#define SH_KEY_TYPE char *
+#define SH_SCOPE extern
+#define SH_RAW_ALLOCATOR pg_malloc0
+#define SH_DECLARE
+#include "lib/simplehash.h"
+
+/*
+ * Each WAL range described by the manifest file is parsed to produce an
+ * object like this.
+ */
+typedef struct manifest_wal_range
+{
+ TimeLineID tli;
+ XLogRecPtr start_lsn;
+ XLogRecPtr end_lsn;
+ struct manifest_wal_range *next;
+ struct manifest_wal_range *prev;
+} manifest_wal_range;
+
+/*
+ * All the data parsed from a backup_manifest file.
+ */
+typedef struct manifest_data
+{
+ manifest_files_hash *files;
+ manifest_wal_range *first_wal_range;
+ manifest_wal_range *last_wal_range;
+} manifest_data;
+
+extern manifest_data *load_backup_manifest(char *backup_directory);
+extern manifest_data **load_backup_manifests(int n_backups,
+ char **backup_directories);
+
+#endif /* LOAD_MANIFEST_H */
diff --git a/src/bin/pg_combinebackup/meson.build b/src/bin/pg_combinebackup/meson.build
new file mode 100644
index 0000000000..e402d6f50e
--- /dev/null
+++ b/src/bin/pg_combinebackup/meson.build
@@ -0,0 +1,38 @@
+# Copyright (c) 2022-2023, PostgreSQL Global Development Group
+
+pg_combinebackup_sources = files(
+ 'pg_combinebackup.c',
+ 'backup_label.c',
+ 'copy_file.c',
+ 'load_manifest.c',
+ 'reconstruct.c',
+ 'write_manifest.c',
+)
+
+if host_system == 'windows'
+ pg_combinebackup_sources += rc_bin_gen.process(win32ver_rc, extra_args: [
+ '--NAME', 'pg_combinebackup',
+ '--FILEDESC', 'pg_combinebackup - combine incremental backups',])
+endif
+
+pg_combinebackup = executable('pg_combinebackup',
+ pg_combinebackup_sources,
+ dependencies: [frontend_code],
+ kwargs: default_bin_args,
+)
+bin_targets += pg_combinebackup
+
+tests += {
+ 'name': 'pg_combinebackup',
+ 'sd': meson.current_source_dir(),
+ 'bd': meson.current_build_dir(),
+ 'tap': {
+ 'tests': [
+ 't/001_basic.pl',
+ 't/002_compare_backups.pl',
+ 't/003_timeline.pl',
+ 't/004_manifest.pl',
+ 't/005_integrity.pl',
+ ],
+ }
+}
diff --git a/src/bin/pg_combinebackup/pg_combinebackup.c b/src/bin/pg_combinebackup/pg_combinebackup.c
new file mode 100644
index 0000000000..d52cc40b8b
--- /dev/null
+++ b/src/bin/pg_combinebackup/pg_combinebackup.c
@@ -0,0 +1,1267 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_combinebackup.c
+ * Combine incremental backups with prior backups.
+ *
+ * Copyright (c) 2017-2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/bin/pg_combinebackup/pg_combinebackup.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres_fe.h"
+
+#include <dirent.h>
+#include <fcntl.h>
+#include <limits.h>
+
+#include "backup_label.h"
+#include "common/blkreftable.h"
+#include "common/checksum_helper.h"
+#include "common/controldata_utils.h"
+#include "common/file_perm.h"
+#include "common/file_utils.h"
+#include "common/logging.h"
+#include "copy_file.h"
+#include "fe_utils/option_utils.h"
+#include "lib/stringinfo.h"
+#include "load_manifest.h"
+#include "getopt_long.h"
+#include "reconstruct.h"
+#include "write_manifest.h"
+
+/* Incremental file naming convention. */
+#define INCREMENTAL_PREFIX "INCREMENTAL."
+#define INCREMENTAL_PREFIX_LENGTH (sizeof(INCREMENTAL_PREFIX) - 1)
+
+/*
+ * Tracking for directories that need to be removed, or have their contents
+ * removed, if the operation fails.
+ */
+typedef struct cb_cleanup_dir
+{
+ char *target_path;
+ bool rmtopdir;
+ struct cb_cleanup_dir *next;
+} cb_cleanup_dir;
+
+/*
+ * Stores a tablespace mapping provided using -T, --tablespace-mapping.
+ */
+typedef struct cb_tablespace_mapping
+{
+ char old_dir[MAXPGPATH];
+ char new_dir[MAXPGPATH];
+ struct cb_tablespace_mapping *next;
+} cb_tablespace_mapping;
+
+/*
+ * Stores data parsed from all command-line options.
+ */
+typedef struct cb_options
+{
+ bool debug;
+ char *output;
+ bool dry_run;
+ bool no_sync;
+ cb_tablespace_mapping *tsmappings;
+ pg_checksum_type manifest_checksums;
+ bool no_manifest;
+ DataDirSyncMethod sync_method;
+} cb_options;
+
+/*
+ * Data about a tablespace.
+ *
+ * Every normal tablespace needs a tablespace mapping, but in-place tablespaces
+ * don't, so the list of tablespaces can contain more entries than the list of
+ * tablespace mappings.
+ */
+typedef struct cb_tablespace
+{
+ Oid oid;
+ bool in_place;
+ char old_dir[MAXPGPATH];
+ char new_dir[MAXPGPATH];
+ struct cb_tablespace *next;
+} cb_tablespace;
+
+/* Directories to be removed if we exit uncleanly. */
+cb_cleanup_dir *cleanup_dir_list = NULL;
+
+static void add_tablespace_mapping(cb_options *opt, char *arg);
+static StringInfo check_backup_label_files(int n_backups, char **backup_dirs);
+static void check_control_files(int n_backups, char **backup_dirs);
+static void check_input_dir_permissions(char *dir);
+static void cleanup_directories_atexit(void);
+static void create_output_directory(char *dirname, cb_options *opt);
+static void help(const char *progname);
+static bool parse_oid(char *s, Oid *result);
+static void process_directory_recursively(Oid tsoid,
+ char *input_directory,
+ char *output_directory,
+ char *relative_path,
+ int n_prior_backups,
+ char **prior_backup_dirs,
+ manifest_data **manifests,
+ manifest_writer *mwriter,
+ cb_options *opt);
+static int read_pg_version_file(char *directory);
+static void remember_to_cleanup_directory(char *target_path, bool rmtopdir);
+static void reset_directory_cleanup_list(void);
+static cb_tablespace *scan_for_existing_tablespaces(char *pathname,
+ cb_options *opt);
+static void slurp_file(int fd, char *filename, StringInfo buf, int maxlen);
+
+/*
+ * Main program.
+ */
+int
+main(int argc, char *argv[])
+{
+ static struct option long_options[] = {
+ {"debug", no_argument, NULL, 'd'},
+ {"dry-run", no_argument, NULL, 'n'},
+ {"no-sync", no_argument, NULL, 'N'},
+ {"output", required_argument, NULL, 'o'},
+ {"tablespace-mapping", no_argument, NULL, 'T'},
+ {"manifest-checksums", required_argument, NULL, 1},
+ {"no-manifest", no_argument, NULL, 2},
+ {"sync-method", required_argument, NULL, 3},
+ {NULL, 0, NULL, 0}
+ };
+
+ const char *progname;
+ char *last_input_dir;
+ int optindex;
+ int c;
+ int n_backups;
+ int n_prior_backups;
+ int version;
+ char **prior_backup_dirs;
+ cb_options opt;
+ cb_tablespace *tablespaces;
+ cb_tablespace *ts;
+ StringInfo last_backup_label;
+ manifest_data **manifests;
+ manifest_writer *mwriter;
+
+ pg_logging_init(argv[0]);
+ progname = get_progname(argv[0]);
+ handle_help_version_opts(argc, argv, progname, help);
+
+ memset(&opt, 0, sizeof(opt));
+ opt.manifest_checksums = CHECKSUM_TYPE_CRC32C;
+ opt.sync_method = DATA_DIR_SYNC_METHOD_FSYNC;
+
+ /* process command-line options */
+ while ((c = getopt_long(argc, argv, "do:nNPT:",
+ long_options, &optindex)) != -1)
+ {
+ switch (c)
+ {
+ case 'd':
+ opt.debug = true;
+ pg_logging_increase_verbosity();
+ break;
+ case 'o':
+ opt.output = optarg;
+ break;
+ case 'n':
+ opt.dry_run = true;
+ break;
+ case 'N':
+ opt.no_sync = true;
+ break;
+ case 'T':
+ add_tablespace_mapping(&opt, optarg);
+ break;
+ case 1:
+ if (!pg_checksum_parse_type(optarg,
+ &opt.manifest_checksums))
+ pg_fatal("unrecognized checksum algorithm: \"%s\"",
+ optarg);
+ break;
+ case 2:
+ opt.no_manifest = true;
+ break;
+ case 3:
+ if (!parse_sync_method(optarg, &opt.sync_method))
+ exit(1);
+ break;
+ default:
+ /* getopt_long already emitted a complaint */
+ pg_log_error_hint("Try \"%s --help\" for more information.", progname);
+ exit(1);
+ }
+ }
+
+ if (optind >= argc)
+ {
+ pg_log_error("%s: no input directories specified", progname);
+ pg_log_error_hint("Try \"%s --help\" for more information.", progname);
+ exit(1);
+ }
+
+ if (opt.output == NULL)
+ pg_fatal("no output directory specified");
+
+ /* If no manifest is needed, no checksums are needed, either. */
+ if (opt.no_manifest)
+ opt.manifest_checksums = CHECKSUM_TYPE_NONE;
+
+ /* Read the server version from the final backup. */
+ version = read_pg_version_file(argv[argc - 1]);
+
+ /* Sanity-check control files. */
+ n_backups = argc - optind;
+ check_control_files(n_backups, argv + optind);
+
+ /* Sanity-check backup_label files, and get the contents of the last one. */
+ last_backup_label = check_backup_label_files(n_backups, argv + optind);
+
+ /* Load backup manifests. */
+ manifests = load_backup_manifests(n_backups, argv + optind);
+
+ /* Figure out which tablespaces are going to be included in the output. */
+ last_input_dir = argv[argc - 1];
+ check_input_dir_permissions(last_input_dir);
+ tablespaces = scan_for_existing_tablespaces(last_input_dir, &opt);
+
+ /*
+ * Create output directories.
+ *
+ * We create one output directory for the main data directory plus one for
+ * each non-in-place tablespace. create_output_directory() will arrange
+ * for those directories to be cleaned up on failure. In-place tablespaces
+ * aren't handled at this stage because they're located beneath the main
+ * output directory, and thus the cleanup of that directory will get rid
+ * of them. Plus, the pg_tblspc directory that needs to contain them
+ * doesn't exist yet.
+ */
+ atexit(cleanup_directories_atexit);
+ create_output_directory(opt.output, &opt);
+ for (ts = tablespaces; ts != NULL; ts = ts->next)
+ if (!ts->in_place)
+ create_output_directory(ts->new_dir, &opt);
+
+ /* If we need to write a backup_manifest, prepare to do so. */
+ if (!opt.dry_run && !opt.no_manifest)
+ mwriter = create_manifest_writer(opt.output);
+ else
+ mwriter = NULL;
+
+ /* Write backup label into output directory. */
+ if (opt.dry_run)
+ pg_log_debug("would generate \"%s/backup_label\"", opt.output);
+ else
+ {
+ pg_log_debug("generating \"%s/backup_label\"", opt.output);
+ last_backup_label->cursor = 0;
+ write_backup_label(opt.output, last_backup_label,
+ opt.manifest_checksums, mwriter);
+ }
+
+ /*
+ * We'll need the pathnames to the prior backups. By "prior" we mean all
+ * but the last one listed on the command line.
+ */
+ n_prior_backups = argc - optind - 1;
+ prior_backup_dirs = argv + optind;
+
+ /* Process everything that's not part of a user-defined tablespace. */
+ pg_log_debug("processing backup directory \"%s\"", last_input_dir);
+ process_directory_recursively(InvalidOid, last_input_dir, opt.output,
+ NULL, n_prior_backups, prior_backup_dirs,
+ manifests, mwriter, &opt);
+
+ /* Process user-defined tablespaces. */
+ for (ts = tablespaces; ts != NULL; ts = ts->next)
+ {
+ pg_log_debug("processing tablespace directory \"%s\"", ts->old_dir);
+
+ /*
+ * If it's a normal tablespace, we need to set up a symbolic link from
+ * pg_tblspc/${OID} to the target directory; if it's an in-place
+ * tablespace, we need to create a directory at pg_tblspc/${OID}.
+ */
+ if (!ts->in_place)
+ {
+ char linkpath[MAXPGPATH];
+
+ snprintf(linkpath, MAXPGPATH, "%s/pg_tblspc/%u", opt.output,
+ ts->oid);
+
+ if (opt.dry_run)
+ pg_log_debug("would create symbolic link from \"%s\" to \"%s\"",
+ linkpath, ts->new_dir);
+ else
+ {
+ pg_log_debug("creating symbolic link from \"%s\" to \"%s\"",
+ linkpath, ts->new_dir);
+ if (symlink(ts->new_dir, linkpath) != 0)
+ pg_fatal("could not create symbolic link from \"%s\" to \"%s\": %m",
+ linkpath, ts->new_dir);
+ }
+ }
+ else
+ {
+ if (opt.dry_run)
+ pg_log_debug("would create directory \"%s\"", ts->new_dir);
+ else
+ {
+ pg_log_debug("creating directory \"%s\"", ts->new_dir);
+ if (pg_mkdir_p(ts->new_dir, pg_dir_create_mode) == -1)
+ pg_fatal("could not create directory \"%s\": %m",
+ ts->new_dir);
+ }
+ }
+
+ /* OK, now handle the directory contents. */
+ process_directory_recursively(ts->oid, ts->old_dir, ts->new_dir,
+ NULL, n_prior_backups, prior_backup_dirs,
+ manifests, mwriter, &opt);
+ }
+
+ /* Finalize the backup_manifest, if we're generating one. */
+ if (mwriter != NULL)
+ finalize_manifest(mwriter,
+ manifests[n_prior_backups]->first_wal_range);
+
+ /* fsync that output directory unless we've been told not to do so */
+ if (!opt.no_sync)
+ {
+ if (opt.dry_run)
+ pg_log_debug("would recursively fsync \"%s\"", opt.output);
+ else
+ {
+ pg_log_debug("recursively fsyncing \"%s\"", opt.output);
+ sync_pgdata(opt.output, version * 10000, opt.sync_method);
+ }
+ }
+
+ /* It's a success, so don't remove the output directories. */
+ reset_directory_cleanup_list();
+ exit(0);
+}
+
+/*
+ * Process the option argument for the -T, --tablespace-mapping switch.
+ */
+static void
+add_tablespace_mapping(cb_options *opt, char *arg)
+{
+ cb_tablespace_mapping *tsmap = pg_malloc0(sizeof(cb_tablespace_mapping));
+ char *dst;
+ char *dst_ptr;
+ char *arg_ptr;
+
+ /*
+ * Basically, we just want to copy everything before the equals sign to
+ * tsmap->old_dir and everything afterwards to tsmap->new_dir, but if
+ * there's more or less than one equals sign, that's an error, and if
+ * there's an equals sign preceded by a backslash, don't treat it as a
+ * field separator but instead copy a literal equals sign.
+ */
+ dst_ptr = dst = tsmap->old_dir;
+ for (arg_ptr = arg; *arg_ptr != '\0'; arg_ptr++)
+ {
+ if (dst_ptr - dst >= MAXPGPATH)
+ pg_fatal("directory name too long");
+
+ if (*arg_ptr == '\\' && *(arg_ptr + 1) == '=')
+ ; /* skip backslash escaping = */
+ else if (*arg_ptr == '=' && (arg_ptr == arg || *(arg_ptr - 1) != '\\'))
+ {
+ if (tsmap->new_dir[0] != '\0')
+ pg_fatal("multiple \"=\" signs in tablespace mapping");
+ else
+ dst = dst_ptr = tsmap->new_dir;
+ }
+ else
+ *dst_ptr++ = *arg_ptr;
+ }
+ if (!tsmap->old_dir[0] || !tsmap->new_dir[0])
+ pg_fatal("invalid tablespace mapping format \"%s\", must be \"OLDDIR=NEWDIR\"", arg);
+
+ /*
+ * All tablespaces are created with absolute directories, so specifying a
+ * non-absolute path here would never match, possibly confusing users.
+ *
+ * In contrast to pg_basebackup, both the old and new directories are on
+ * the local machine, so the local machine's definition of an absolute
+ * path is the only relevant one.
+ */
+ if (!is_absolute_path(tsmap->old_dir))
+ pg_fatal("old directory is not an absolute path in tablespace mapping: %s",
+ tsmap->old_dir);
+
+ if (!is_absolute_path(tsmap->new_dir))
+ pg_fatal("old directory is not an absolute path in tablespace mapping: %s",
+ tsmap->new_dir);
+
+ /* Canonicalize paths to avoid spurious failures when comparing. */
+ canonicalize_path(tsmap->old_dir);
+ canonicalize_path(tsmap->new_dir);
+
+ /* Add it to the list. */
+ tsmap->next = opt->tsmappings;
+ opt->tsmappings = tsmap;
+}
+
+/*
+ * Check that the backup_label files form a coherent backup chain, and return
+ * the contents of the backup_label file from the latest backup.
+ */
+static StringInfo
+check_backup_label_files(int n_backups, char **backup_dirs)
+{
+ StringInfo buf = makeStringInfo();
+ StringInfo lastbuf = buf;
+ int i;
+ TimeLineID check_tli = 0;
+ XLogRecPtr check_lsn = InvalidXLogRecPtr;
+
+ /* Try to read each backup_label file in turn, last to first. */
+ for (i = n_backups - 1; i >= 0; --i)
+ {
+ char pathbuf[MAXPGPATH];
+ int fd;
+ TimeLineID start_tli;
+ TimeLineID previous_tli;
+ XLogRecPtr start_lsn;
+ XLogRecPtr previous_lsn;
+
+ /* Open the backup_label file. */
+ snprintf(pathbuf, MAXPGPATH, "%s/backup_label", backup_dirs[i]);
+ pg_log_debug("reading \"%s\"", pathbuf);
+ if ((fd = open(pathbuf, O_RDONLY, 0)) < 0)
+ pg_fatal("could not open file \"%s\": %m", pathbuf);
+
+ /*
+ * Slurp the whole file into memory.
+ *
+ * The exact size limit that we impose here doesn't really matter --
+ * most of what's supposed to be in the file is fixed size and quite
+ * short. However, the length of the backup_label is limited (at least
+ * by some parts of the code) to MAXGPATH, so include that value in
+ * the maximum length that we tolerate.
+ */
+ slurp_file(fd, pathbuf, buf, 10000 + MAXPGPATH);
+
+ /* Close the file. */
+ if (close(fd) != 0)
+ pg_fatal("could not close \"%s\": %m", pathbuf);
+
+ /* Parse the file contents. */
+ parse_backup_label(pathbuf, buf, &start_tli, &start_lsn,
+ &previous_tli, &previous_lsn);
+
+ /*
+ * Sanity checks.
+ *
+ * XXX. It's actually not required that start_lsn == check_lsn. It
+ * would be OK if start_lsn > check_lsn provided that start_lsn is
+ * less than or equal to the relevant switchpoint. But at the moment
+ * we don't have that information.
+ */
+ if (i > 0 && previous_tli == 0)
+ pg_fatal("backup at \"%s\" is a full backup, but only the first backup should be a full backup",
+ backup_dirs[i]);
+ if (i == 0 && previous_tli != 0)
+ pg_fatal("backup at \"%s\" is an incremental backup, but the first backup should be a full backup",
+ backup_dirs[i]);
+ if (i < n_backups - 1 && start_tli != check_tli)
+ pg_fatal("backup at \"%s\" starts on timeline %u, but expected %u",
+ backup_dirs[i], start_tli, check_tli);
+ if (i < n_backups - 1 && start_lsn != check_lsn)
+ pg_fatal("backup at \"%s\" starts at LSN %X/%X, but expected %X/%X",
+ backup_dirs[i],
+ LSN_FORMAT_ARGS(start_lsn),
+ LSN_FORMAT_ARGS(check_lsn));
+ check_tli = previous_tli;
+ check_lsn = previous_lsn;
+
+ /*
+ * The last backup label in the chain needs to be saved for later use,
+ * while the others are only needed within this loop.
+ */
+ if (lastbuf == buf)
+ buf = makeStringInfo();
+ else
+ resetStringInfo(buf);
+ }
+
+ /* Free memory that we don't need any more. */
+ if (lastbuf != buf)
+ {
+ pfree(buf->data);
+ pfree(buf);
+ }
+
+ /*
+ * Return the data from the first backup_info that we read (which is the
+ * backup_label from the last directory specified on the command line).
+ */
+ return lastbuf;
+}
+
+/*
+ * Sanity check control files.
+ */
+static void
+check_control_files(int n_backups, char **backup_dirs)
+{
+ int i;
+ uint64 system_identifier = 0; /* placate compiler */
+
+ /* Try to read each control file in turn, last to first. */
+ for (i = n_backups - 1; i >= 0; --i)
+ {
+ ControlFileData *control_file;
+ bool crc_ok;
+
+ pg_log_debug("reading \"%s/global/pg_control\"", backup_dirs[i]);
+ control_file = get_controlfile(backup_dirs[i], &crc_ok);
+
+ /* Control file contents not meaningful if CRC is bad. */
+ if (!crc_ok)
+ pg_fatal("%s/global/pg_control: crc is incorrect", backup_dirs[i]);
+
+ /* Can't interpret control file if not current version. */
+ if (control_file->pg_control_version != PG_CONTROL_VERSION)
+ pg_fatal("%s/global/pg_control: unexpected control file version",
+ backup_dirs[i]);
+
+ /* System identifiers should all match. */
+ if (i == n_backups - 1)
+ system_identifier = control_file->system_identifier;
+ else if (system_identifier != control_file->system_identifier)
+ pg_fatal("%s/global/pg_control: expected system identifier %llu, but found %llu",
+ backup_dirs[i], (unsigned long long) system_identifier,
+ (unsigned long long) control_file->system_identifier);
+
+ /* Release memory. */
+ pfree(control_file);
+ }
+
+ /*
+ * If debug output is enabled, make a note of the system identifier that
+ * we found in all of the relevant control files.
+ */
+ pg_log_debug("system identifier is %llu",
+ (unsigned long long) system_identifier);
+}
+
+/*
+ * Set default permissions for new files and directories based on the
+ * permissions of the given directory. The intent here is that the output
+ * directory should use the same permissions scheme as the final input
+ * directory.
+ */
+static void
+check_input_dir_permissions(char *dir)
+{
+ struct stat st;
+
+ if (stat(dir, &st) != 0)
+ pg_fatal("could not stat \"%s\": %m", dir);
+
+ SetDataDirectoryCreatePerm(st.st_mode);
+}
+
+/*
+ * Clean up output directories before exiting.
+ */
+static void
+cleanup_directories_atexit(void)
+{
+ while (cleanup_dir_list != NULL)
+ {
+ cb_cleanup_dir *dir = cleanup_dir_list;
+
+ if (dir->rmtopdir)
+ {
+ pg_log_info("removing output directory \"%s\"", dir->target_path);
+ if (!rmtree(dir->target_path, dir->rmtopdir))
+ pg_log_error("failed to remove output directory");
+ }
+ else
+ {
+ pg_log_info("removing contents of output directory \"%s\"",
+ dir->target_path);
+ if (!rmtree(dir->target_path, dir->rmtopdir))
+ pg_log_error("failed to remove contents of output directory");
+ }
+
+ cleanup_dir_list = cleanup_dir_list->next;
+ pfree(dir);
+ }
+}
+
+/*
+ * Create the named output directory, unless it already exists or we're in
+ * dry-run mode. If it already exists but is not empty, that's a fatal error.
+ *
+ * Adds the created directory to the list of directories to be cleaned up
+ * at process exit.
+ */
+static void
+create_output_directory(char *dirname, cb_options *opt)
+{
+ switch (pg_check_dir(dirname))
+ {
+ case 0:
+ if (opt->dry_run)
+ {
+ pg_log_debug("would create directory \"%s\"", dirname);
+ return;
+ }
+ pg_log_debug("creating directory \"%s\"", dirname);
+ if (pg_mkdir_p(dirname, pg_dir_create_mode) == -1)
+ pg_fatal("could not create directory \"%s\": %m", dirname);
+ remember_to_cleanup_directory(dirname, true);
+ break;
+
+ case 1:
+ pg_log_debug("using existing directory \"%s\"", dirname);
+ remember_to_cleanup_directory(dirname, false);
+ break;
+
+ case 2:
+ case 3:
+ case 4:
+ pg_fatal("directory \"%s\" exists but is not empty", dirname);
+
+ case -1:
+ pg_fatal("could not access directory \"%s\": %m", dirname);
+ }
+}
+
+/*
+ * help
+ *
+ * Prints help page for the program
+ *
+ * progname: the name of the executed program, such as "pg_combinebackup"
+ */
+static void
+help(const char *progname)
+{
+ printf(_("%s reconstructs full backups from incrementals.\n\n"), progname);
+ printf(_("Usage:\n"));
+ printf(_(" %s [OPTION]... DIRECTORY...\n"), progname);
+ printf(_("\nOptions:\n"));
+ printf(_(" -d, --debug generate lots of debugging output\n"));
+ printf(_(" -n, --dry-run don't actually do anything\n"));
+ printf(_(" -N, --no-sync do not wait for changes to be written safely to disk\n"));
+ printf(_(" -o, --output output directory\n"));
+ printf(_(" -T, --tablespace-mapping=OLDDIR=NEWDIR\n"));
+ printf(_(" relocate tablespace in OLDDIR to NEWDIR\n"));
+ printf(_(" --manifest-checksums=SHA{224,256,384,512}|CRC32C|NONE\n"
+ " use algorithm for manifest checksums\n"));
+ printf(_(" --no-manifest suppress generation of backup manifest\n"));
+ printf(_(" --sync-method=METHOD set method for syncing files to disk\n"));
+ printf(_(" -?, --help show this help, then exit\n"));
+
+ printf(_("\nReport bugs to <%s>.\n"), PACKAGE_BUGREPORT);
+ printf(_("%s home page: <%s>\n"), PACKAGE_NAME, PACKAGE_URL);
+}
+
+/*
+ * Try to parse a string as a non-zero OID without leading zeroes.
+ *
+ * If it works, return true and set *result to the answer, else return false.
+ */
+static bool
+parse_oid(char *s, Oid *result)
+{
+ Oid oid;
+ char *ep;
+
+ errno = 0;
+ oid = strtoul(s, &ep, 10);
+ if (errno != 0 || *ep != '\0' || oid < 1 || oid > PG_UINT32_MAX)
+ return false;
+
+ *result = oid;
+ return true;
+}
+
+/*
+ * Copy files from the input directory to the output directory, reconstructing
+ * full files from incremental files as required.
+ *
+ * If processing is a user-defined tablespace, the tsoid should be the OID
+ * of that tablespace and input_directory and output_directory should be the
+ * toplevel input and output directories for that tablespace. Otherwise,
+ * tsoid should be InvalidOid and input_directory and output_directory should
+ * be the main input and output directories.
+ *
+ * relative_path is the path beneath the given input and output directories
+ * that we are currently processing. If NULL, it indicates that we're
+ * processing the input and output directories themselves.
+ *
+ * n_prior_backups is the number of prior backups that we have available.
+ * This doesn't count the very last backup, which is referenced by
+ * output_directory, just the older ones. prior_backup_dirs is an array of
+ * the locations of those previous backups.
+ */
+static void
+process_directory_recursively(Oid tsoid,
+ char *input_directory,
+ char *output_directory,
+ char *relative_path,
+ int n_prior_backups,
+ char **prior_backup_dirs,
+ manifest_data **manifests,
+ manifest_writer *mwriter,
+ cb_options *opt)
+{
+ char ifulldir[MAXPGPATH];
+ char ofulldir[MAXPGPATH];
+ char manifest_prefix[MAXPGPATH];
+ DIR *dir;
+ struct dirent *de;
+ bool is_pg_tblspc;
+ bool is_pg_wal;
+ manifest_data *latest_manifest = manifests[n_prior_backups];
+ pg_checksum_type checksum_type;
+
+ /*
+ * pg_tblspc and pg_wal are special cases, so detect those here.
+ *
+ * pg_tblspc is only special at the top level, but subdirectories of
+ * pg_wal are just as special as the top level directory.
+ *
+ * Since incremental backup does not exist in pre-v10 versions, we don't
+ * have to worry about the old pg_xlog naming.
+ */
+ is_pg_tblspc = !OidIsValid(tsoid) && relative_path != NULL &&
+ strcmp(relative_path, "pg_tblspc") == 0;
+ is_pg_wal = !OidIsValid(tsoid) && relative_path != NULL &&
+ (strcmp(relative_path, "pg_wal") == 0 ||
+ strncmp(relative_path, "pg_wal/", 7) == 0);
+
+ /*
+ * If we're under pg_wal, then we don't need checksums, because these
+ * files aren't included in the backup manifest. Otherwise use whatever
+ * type of checksum is configured.
+ */
+ if (!is_pg_wal)
+ checksum_type = opt->manifest_checksums;
+ else
+ checksum_type = CHECKSUM_TYPE_NONE;
+
+ /*
+ * Append the relative path to the input and output directories, and
+ * figure out the appropriate prefix to add to files in this directory
+ * when looking them up in a backup manifest.
+ */
+ if (relative_path == NULL)
+ {
+ strncpy(ifulldir, input_directory, MAXPGPATH);
+ strncpy(ofulldir, output_directory, MAXPGPATH);
+ if (OidIsValid(tsoid))
+ snprintf(manifest_prefix, MAXPGPATH, "pg_tblspc/%u/", tsoid);
+ else
+ manifest_prefix[0] = '\0';
+ }
+ else
+ {
+ snprintf(ifulldir, MAXPGPATH, "%s/%s", input_directory,
+ relative_path);
+ snprintf(ofulldir, MAXPGPATH, "%s/%s", output_directory,
+ relative_path);
+ if (OidIsValid(tsoid))
+ snprintf(manifest_prefix, MAXPGPATH, "pg_tblspc/%u/%s/",
+ tsoid, relative_path);
+ else
+ snprintf(manifest_prefix, MAXPGPATH, "%s/", relative_path);
+ }
+
+ /*
+ * Toplevel output directories have already been created by the time this
+ * function is called, but any subdirectories are our responsibility.
+ */
+ if (relative_path != NULL)
+ {
+ if (opt->dry_run)
+ pg_log_debug("would create directory \"%s\"", ofulldir);
+ else
+ {
+ pg_log_debug("creating directory \"%s\"", ofulldir);
+ if (mkdir(ofulldir, pg_dir_create_mode) == -1)
+ pg_fatal("could not create directory \"%s\": %m", ofulldir);
+ }
+ }
+
+ /* It's time to scan the directory. */
+ if ((dir = opendir(ifulldir)) == NULL)
+ pg_fatal("could not open directory \"%s\": %m", ifulldir);
+ while (errno = 0, (de = readdir(dir)) != NULL)
+ {
+ PGFileType type;
+ char ifullpath[MAXPGPATH];
+ char ofullpath[MAXPGPATH];
+ char manifest_path[MAXPGPATH];
+ Oid oid = InvalidOid;
+ int checksum_length = 0;
+ uint8 *checksum_payload = NULL;
+ pg_checksum_context checksum_ctx;
+
+ /* Ignore "." and ".." entries. */
+ if (strcmp(de->d_name, ".") == 0 ||
+ strcmp(de->d_name, "..") == 0)
+ continue;
+
+ /* Construct input path. */
+ snprintf(ifullpath, MAXPGPATH, "%s/%s", ifulldir, de->d_name);
+
+ /* Figure out what kind of directory entry this is. */
+ type = get_dirent_type(ifullpath, de, false, PG_LOG_ERROR);
+ if (type == PGFILETYPE_ERROR)
+ exit(1);
+
+ /*
+ * If we're processing pg_tblspc, then check whether the filename
+ * looks like it could be a tablespace OID. If so, and if the
+ * directory entry is a symbolic link or a directory, skip it.
+ *
+ * Our goal here is to ignore anything that would have been considered
+ * by scan_for_existing_tablespaces to be a tablespace.
+ */
+ if (is_pg_tblspc && parse_oid(de->d_name, &oid) &&
+ (type == PGFILETYPE_LNK || type == PGFILETYPE_DIR))
+ continue;
+
+ /* If it's a directory, recurse. */
+ if (type == PGFILETYPE_DIR)
+ {
+ char new_relative_path[MAXPGPATH];
+
+ /* Append new pathname component to relative path. */
+ if (relative_path == NULL)
+ strncpy(new_relative_path, de->d_name, MAXPGPATH);
+ else
+ snprintf(new_relative_path, MAXPGPATH, "%s/%s", relative_path,
+ de->d_name);
+
+ /* And recurse. */
+ process_directory_recursively(tsoid,
+ input_directory, output_directory,
+ new_relative_path,
+ n_prior_backups, prior_backup_dirs,
+ manifests, mwriter, opt);
+ continue;
+ }
+
+ /* Skip anything that's not a regular file. */
+ if (type != PGFILETYPE_REG)
+ {
+ if (type == PGFILETYPE_LNK)
+ pg_log_warning("skipping symbolic link \"%s\"", ifullpath);
+ else
+ pg_log_warning("skipping special file \"%s\"", ifullpath);
+ continue;
+ }
+
+ /*
+ * Skip the backup_label and backup_manifest files; they require
+ * special handling and are handled elsewhere.
+ */
+ if (relative_path == NULL &&
+ (strcmp(de->d_name, "backup_label") == 0 ||
+ strcmp(de->d_name, "backup_manifest") == 0))
+ continue;
+
+ /*
+ * If it's an incremental file, hand it off to the reconstruction
+ * code, which will figure out what to do.
+ */
+ if (strncmp(de->d_name, INCREMENTAL_PREFIX,
+ INCREMENTAL_PREFIX_LENGTH) == 0)
+ {
+ /* Output path should not include "INCREMENTAL." prefix. */
+ snprintf(ofullpath, MAXPGPATH, "%s/%s", ofulldir,
+ de->d_name + INCREMENTAL_PREFIX_LENGTH);
+
+
+ /* Manifest path likewise omits incremental prefix. */
+ snprintf(manifest_path, MAXPGPATH, "%s%s", manifest_prefix,
+ de->d_name + INCREMENTAL_PREFIX_LENGTH);
+
+ /* Reconstruction logic will do the rest. */
+ reconstruct_from_incremental_file(ifullpath, ofullpath,
+ relative_path,
+ de->d_name + INCREMENTAL_PREFIX_LENGTH,
+ n_prior_backups,
+ prior_backup_dirs,
+ manifests,
+ manifest_path,
+ checksum_type,
+ &checksum_length,
+ &checksum_payload,
+ opt->debug,
+ opt->dry_run);
+ }
+ else
+ {
+ /* Construct the path that the backup_manifest will use. */
+ snprintf(manifest_path, MAXPGPATH, "%s%s", manifest_prefix,
+ de->d_name);
+
+ /*
+ * It's not an incremental file, so we need to copy the entire
+ * file to the output directory.
+ *
+ * If a checksum of the required type already exists in the
+ * backup_manifest for the final input directory, we can save some
+ * work by reusing that checksum instead of computing a new one.
+ */
+ if (checksum_type != CHECKSUM_TYPE_NONE &&
+ latest_manifest != NULL)
+ {
+ manifest_file *mfile;
+
+ mfile = manifest_files_lookup(latest_manifest->files,
+ manifest_path);
+ if (mfile == NULL)
+ {
+ /*
+ * The directory is out of sync with the backup_manifest,
+ * so emit a warning.
+ */
+ pg_log_warning("\"%s/backup_manifest\" contains no entry for \"%s\"",
+ input_directory, manifest_path);
+ }
+ else if (mfile->checksum_type == checksum_type)
+ {
+ checksum_length = mfile->checksum_length;
+ checksum_payload = mfile->checksum_payload;
+ }
+ }
+
+ /*
+ * If we're reusing a checksum, then we don't need copy_file() to
+ * compute one for us, but otherwise, it needs to compute whatever
+ * type of checksum we need.
+ */
+ if (checksum_length != 0)
+ pg_checksum_init(&checksum_ctx, CHECKSUM_TYPE_NONE);
+ else
+ pg_checksum_init(&checksum_ctx, checksum_type);
+
+ /* Actually copy the file. */
+ snprintf(ofullpath, MAXPGPATH, "%s/%s", ofulldir, de->d_name);
+ copy_file(ifullpath, ofullpath, &checksum_ctx, opt->dry_run);
+
+ /*
+ * If copy_file() performed a checksum calculation for us, then
+ * save the results (except in dry-run mode, when there's no
+ * point).
+ */
+ if (checksum_ctx.type != CHECKSUM_TYPE_NONE && !opt->dry_run)
+ {
+ checksum_payload = pg_malloc(PG_CHECKSUM_MAX_LENGTH);
+ checksum_length = pg_checksum_final(&checksum_ctx,
+ checksum_payload);
+ }
+ }
+
+ /* Generate manifest entry, if needed. */
+ if (mwriter != NULL)
+ {
+ struct stat sb;
+
+ /*
+ * In order to generate a manifest entry, we need the file size
+ * and mtime. We have no way to know the correct mtime except to
+ * stat() the file, so just do that and get the size as well.
+ *
+ * If we didn't need the mtime here, we could try to obtain the
+ * file size from the reconstruction or file copy process above,
+ * although that is actually not convenient in all cases. If we
+ * write the file ourselves then clearly we can keep a count of
+ * bytes, but if we use something like CopyFile() then it's
+ * trickier. Since we have to stat() anyway to get the mtime,
+ * there's no point in worrying about it.
+ */
+ if (stat(ofullpath, &sb) < 0)
+ pg_fatal("could not stat file \"%s\": %m", ofullpath);
+
+ /* OK, now do the work. */
+ add_file_to_manifest(mwriter, manifest_path,
+ sb.st_size, sb.st_mtime,
+ checksum_type, checksum_length,
+ checksum_payload);
+ }
+
+ /* Avoid leaking memory. */
+ if (checksum_payload != NULL)
+ pfree(checksum_payload);
+ }
+
+ closedir(dir);
+}
+
+/*
+ * Read the version number from PG_VERSION and convert it to the usual server
+ * version number format. (e.g. If PG_VERSION contains "14\n" this function
+ * will return 140000)
+ */
+static int
+read_pg_version_file(char *directory)
+{
+ char filename[MAXPGPATH];
+ StringInfoData buf;
+ int fd;
+ int version;
+ char *ep;
+
+ /* Construct pathname. */
+ snprintf(filename, MAXPGPATH, "%s/PG_VERSION", directory);
+
+ /* Open file. */
+ if ((fd = open(filename, O_RDONLY, 0)) < 0)
+ pg_fatal("could not open file \"%s\": %m", filename);
+
+ /* Read into memory. Length limit of 128 should be more than generous. */
+ initStringInfo(&buf);
+ slurp_file(fd, filename, &buf, 128);
+
+ /* Close the file. */
+ if (close(fd) != 0)
+ pg_fatal("could not close \"%s\": %m", filename);
+
+ /* Convert to integer. */
+ errno = 0;
+ version = strtoul(buf.data, &ep, 10);
+ if (errno != 0 || *ep != '\n')
+ {
+ /*
+ * Incremental backup is not relevant to very old server versions that
+ * used multi-part version number (e.g. 9.6, or 8.4). So if we see
+ * what looks like the beginning of such a version number, just bail
+ * out.
+ */
+ if (version < 10 && *ep == '.')
+ pg_fatal("%s: server version too old\n", filename);
+ pg_fatal("%s: could not parse version number\n", filename);
+ }
+
+ /* Debugging output. */
+ pg_log_debug("read server version %d from \"%s\"", version, filename);
+
+ /* Release memory and return result. */
+ pfree(buf.data);
+ return version * 10000;
+}
+
+/*
+ * Add a directory to the list of output directories to clean up.
+ */
+static void
+remember_to_cleanup_directory(char *target_path, bool rmtopdir)
+{
+ cb_cleanup_dir *dir = pg_malloc(sizeof(cb_cleanup_dir));
+
+ dir->target_path = target_path;
+ dir->rmtopdir = rmtopdir;
+ dir->next = cleanup_dir_list;
+ cleanup_dir_list = dir;
+}
+
+/*
+ * Empty out the list of directories scheduled for cleanup a exit.
+ *
+ * We want to remove the output directories only on a failure, so call this
+ * function when we know that the operation has succeeded.
+ *
+ * Since we only expect this to be called when we're about to exit, we could
+ * just set cleanup_dir_list to NULL and be done with it, but we free the
+ * memory to be tidy.
+ */
+static void
+reset_directory_cleanup_list(void)
+{
+ while (cleanup_dir_list != NULL)
+ {
+ cb_cleanup_dir *dir = cleanup_dir_list;
+
+ cleanup_dir_list = cleanup_dir_list->next;
+ pfree(dir);
+ }
+}
+
+/*
+ * Scan the pg_tblspc directory of the final input backup to get a canonical
+ * list of what tablespaces are part of the backup.
+ *
+ * 'pathname' should be the path to the toplevel backup directory for the
+ * final backup in the backup chain.
+ */
+static cb_tablespace *
+scan_for_existing_tablespaces(char *pathname, cb_options *opt)
+{
+ char pg_tblspc[MAXPGPATH];
+ DIR *dir;
+ struct dirent *de;
+ cb_tablespace *tslist = NULL;
+
+ snprintf(pg_tblspc, MAXPGPATH, "%s/pg_tblspc", pathname);
+ pg_log_debug("scanning \"%s\"", pg_tblspc);
+
+ if ((dir = opendir(pg_tblspc)) == NULL)
+ pg_fatal("could not open directory \"%s\": %m", pathname);
+
+ while (errno = 0, (de = readdir(dir)) != NULL)
+ {
+ Oid oid;
+ char tblspcdir[MAXPGPATH];
+ char link_target[MAXPGPATH];
+ int link_length;
+ cb_tablespace *ts;
+ cb_tablespace *otherts;
+ PGFileType type;
+
+ /* Silently ignore "." and ".." entries. */
+ if (strcmp(de->d_name, ".") == 0 || strcmp(de->d_name, "..") == 0)
+ continue;
+
+ /* Construct full pathname. */
+ snprintf(tblspcdir, MAXPGPATH, "%s/%s", pg_tblspc, de->d_name);
+
+ /* Ignore any file name that doesn't look like a proper OID. */
+ if (!parse_oid(de->d_name, &oid))
+ {
+ pg_log_debug("skipping \"%s\" because the filename is not a legal tablespace OID",
+ tblspcdir);
+ continue;
+ }
+
+ /* Only symbolic links and directories are tablespaces. */
+ type = get_dirent_type(tblspcdir, de, false, PG_LOG_ERROR);
+ if (type == PGFILETYPE_ERROR)
+ exit(1);
+ if (type != PGFILETYPE_LNK && type != PGFILETYPE_DIR)
+ {
+ pg_log_debug("skipping \"%s\" because it is neither a symbolic link nor a directory",
+ tblspcdir);
+ continue;
+ }
+
+ /* Create a new tablespace object. */
+ ts = pg_malloc0(sizeof(cb_tablespace));
+ ts->oid = oid;
+
+ /*
+ * If it's a link, it's not an in-place tablespace. Otherwise, it must
+ * be a directory, and thus an in-place tablespace.
+ */
+ if (type == PGFILETYPE_LNK)
+ {
+ cb_tablespace_mapping *tsmap;
+
+ /* Read the link target. */
+ link_length = readlink(tblspcdir, link_target, sizeof(link_target));
+ if (link_length < 0)
+ pg_fatal("could not read symbolic link \"%s\": %m",
+ tblspcdir);
+ if (link_length >= sizeof(link_target))
+ pg_fatal("symbolic link \"%s\" is too long", tblspcdir);
+ link_target[link_length] = '\0';
+ if (!is_absolute_path(link_target))
+ pg_fatal("symbolic link \"%s\" is relative", tblspcdir);
+
+ /* Caonicalize the link target. */
+ canonicalize_path(link_target);
+
+ /*
+ * Find the corresponding tablespace mapping and copy the relevant
+ * details into the new tablespace entry.
+ */
+ for (tsmap = opt->tsmappings; tsmap != NULL; tsmap = tsmap->next)
+ {
+ if (strcmp(tsmap->old_dir, link_target) == 0)
+ {
+ strncpy(ts->old_dir, tsmap->old_dir, MAXPGPATH);
+ strncpy(ts->new_dir, tsmap->new_dir, MAXPGPATH);
+ ts->in_place = false;
+ break;
+ }
+ }
+
+ /* Every non-in-place tablespace must be mapped. */
+ if (tsmap == NULL)
+ pg_fatal("tablespace at \"%s\" has no tablespace mapping",
+ link_target);
+ }
+ else
+ {
+ /*
+ * For an in-place tablespace, there's no separate directory, so
+ * we just record the paths within the data directories.
+ */
+ snprintf(ts->old_dir, MAXPGPATH, "%s/%s", pg_tblspc, de->d_name);
+ snprintf(ts->new_dir, MAXPGPATH, "%s/pg_tblpc/%s", opt->output,
+ de->d_name);
+ ts->in_place = true;
+ }
+
+ /* Tablespaces should not share a directory. */
+ for (otherts = tslist; otherts != NULL; otherts = otherts->next)
+ if (strcmp(ts->new_dir, otherts->new_dir) == 0)
+ pg_fatal("tablespaces with OIDs %u and %u both point at \"%s\"",
+ otherts->oid, oid, ts->new_dir);
+
+ /* Add this tablespace to the list. */
+ ts->next = tslist;
+ tslist = ts;
+ }
+
+ return tslist;
+}
+
+/*
+ * Read a file into a StringInfo.
+ *
+ * fd is used for the actual file I/O, filename for error reporting purposes.
+ * A file longer than maxlen is a fatal error.
+ */
+static void
+slurp_file(int fd, char *filename, StringInfo buf, int maxlen)
+{
+ struct stat st;
+ ssize_t rb;
+
+ /* Check file size, and complain if it's too large. */
+ if (fstat(fd, &st) != 0)
+ pg_fatal("could not stat \"%s\": %m", filename);
+ if (st.st_size > maxlen)
+ pg_fatal("file \"%s\" is too large", filename);
+
+ /* Make sure we have enough space. */
+ enlargeStringInfo(buf, st.st_size);
+
+ /* Read the data. */
+ rb = read(fd, &buf->data[buf->len], st.st_size);
+
+ /*
+ * We don't expect any concurrent changes, so we should read exactly the
+ * expected number of bytes.
+ */
+ if (rb != st.st_size)
+ {
+ if (rb < 0)
+ pg_fatal("could not read file \"%s\": %m", filename);
+ else
+ pg_fatal("could not read file \"%s\": read only %d of %d bytes",
+ filename, (int) rb, (int) st.st_size);
+ }
+
+ /* Adjust buffer length for new data and restore trailing-\0 invariant */
+ buf->len += rb;
+ buf->data[buf->len] = '\0';
+}
diff --git a/src/bin/pg_combinebackup/reconstruct.c b/src/bin/pg_combinebackup/reconstruct.c
new file mode 100644
index 0000000000..d3d089527e
--- /dev/null
+++ b/src/bin/pg_combinebackup/reconstruct.c
@@ -0,0 +1,682 @@
+/*-------------------------------------------------------------------------
+ *
+ * reconstruct.c
+ * Reconstruct full file from incremental file and backup chain.
+ *
+ * Copyright (c) 2017-2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/bin/pg_combinebackup/reconstruct.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres_fe.h"
+
+#include <unistd.h>
+
+#include "backup/basebackup_incremental.h"
+#include "common/logging.h"
+#include "common/file_perm.h"
+#include "copy_file.h"
+#include "lib/stringinfo.h"
+#include "reconstruct.h"
+#include "storage/block.h"
+
+/*
+ * An rfile stores the data that we need in order to be able to use some file
+ * on disk for reconstruction. For any given output file, we create one rfile
+ * per backup that we need to consult when we constructing that output file.
+ *
+ * If we find a full version of the file in the backup chain, then only
+ * filename and fd are initialized; the remaining fields are 0 or NULL.
+ * For an incremental file, header_length, num_blocks, relative_block_numbers,
+ * and truncation_block_length are also set.
+ *
+ * num_blocks_read and highest_offset_read always start out as 0.
+ */
+typedef struct rfile
+{
+ char *filename;
+ int fd;
+ size_t header_length;
+ unsigned num_blocks;
+ BlockNumber *relative_block_numbers;
+ unsigned truncation_block_length;
+ unsigned num_blocks_read;
+ off_t highest_offset_read;
+} rfile;
+
+static void debug_reconstruction(int n_source,
+ rfile **sources,
+ bool dry_run);
+static unsigned find_reconstructed_block_length(rfile *s);
+static rfile *make_incremental_rfile(char *filename);
+static rfile *make_rfile(char *filename, bool missing_ok);
+static void write_reconstructed_file(char *input_filename,
+ char *output_filename,
+ unsigned block_length,
+ rfile **sourcemap,
+ off_t *offsetmap,
+ pg_checksum_context *checksum_ctx,
+ bool debug,
+ bool dry_run);
+static void read_bytes(rfile *rf, void *buffer, unsigned length);
+
+/*
+ * Reconstruct a full file from an incremental file and a chain of prior
+ * backups.
+ *
+ * input_filename should be the path to the incremental file, and
+ * output_filename should be the path where the reconstructed file is to be
+ * written.
+ *
+ * relative_path should be the relative path to the directory containing this
+ * file. bare_file_name should be the name of the file within that directory,
+ * without "INCREMENTAL.".
+ *
+ * n_prior_backups is the number of prior backups, and prior_backup_dirs is
+ * an array of pathnames where those backups can be found.
+ */
+void
+reconstruct_from_incremental_file(char *input_filename,
+ char *output_filename,
+ char *relative_path,
+ char *bare_file_name,
+ int n_prior_backups,
+ char **prior_backup_dirs,
+ manifest_data **manifests,
+ char *manifest_path,
+ pg_checksum_type checksum_type,
+ int *checksum_length,
+ uint8 **checksum_payload,
+ bool debug,
+ bool dry_run)
+{
+ rfile **source;
+ rfile *latest_source = NULL;
+ rfile **sourcemap;
+ off_t *offsetmap;
+ unsigned block_length;
+ unsigned i;
+ unsigned sidx = n_prior_backups;
+ bool full_copy_possible = true;
+ int copy_source_index = -1;
+ rfile *copy_source = NULL;
+ pg_checksum_context checksum_ctx;
+
+ /*
+ * Every block must come either from the latest version of the file or
+ * from one of the prior backups.
+ */
+ source = pg_malloc0(sizeof(rfile *) * (1 + n_prior_backups));
+
+ /*
+ * Use the information from the latest incremental file to figure out how
+ * long the reconstructed file should be.
+ */
+ latest_source = make_incremental_rfile(input_filename);
+ source[n_prior_backups] = latest_source;
+ block_length = find_reconstructed_block_length(latest_source);
+
+ /*
+ * For each block in the output file, we need to know from which file we
+ * need to obtain it and at what offset in that file it's stored.
+ * sourcemap gives us the first of these things, and offsetmap the latter.
+ */
+ sourcemap = pg_malloc0(sizeof(rfile *) * block_length);
+ offsetmap = pg_malloc0(sizeof(off_t) * block_length);
+
+ /*
+ * Every block that is present in the newest incremental file should be
+ * sourced from that file. If it precedes the truncation_block_length,
+ * it's a block that we would otherwise have had to find in an older
+ * backup and thus reduces the number of blocks remaining to be found by
+ * one; otherwise, it's an extra block that needs to be included in the
+ * output but would not have needed to be found in an older backup if it
+ * had not been present.
+ */
+ for (i = 0; i < latest_source->num_blocks; ++i)
+ {
+ BlockNumber b = latest_source->relative_block_numbers[i];
+
+ Assert(b < block_length);
+ sourcemap[b] = latest_source;
+ offsetmap[b] = latest_source->header_length + (i * BLCKSZ);
+
+ /*
+ * A full copy of a file from an earlier backup is only possible if no
+ * blocks are needed from any later incremental file.
+ */
+ full_copy_possible = false;
+ }
+
+ while (1)
+ {
+ char source_filename[MAXPGPATH];
+ rfile *s;
+
+ /*
+ * Move to the next backup in the chain. If there are no more, then
+ * we're done.
+ */
+ if (sidx == 0)
+ break;
+ --sidx;
+
+ /*
+ * Look for the full file in the previous backup. If not found, then
+ * look for an incremental file instead.
+ */
+ snprintf(source_filename, MAXPGPATH, "%s/%s/%s",
+ prior_backup_dirs[sidx], relative_path, bare_file_name);
+ if ((s = make_rfile(source_filename, true)) == NULL)
+ {
+ snprintf(source_filename, MAXPGPATH, "%s/%s/INCREMENTAL.%s",
+ prior_backup_dirs[sidx], relative_path, bare_file_name);
+ s = make_incremental_rfile(source_filename);
+ }
+ source[sidx] = s;
+
+ /*
+ * If s->header_length == 0, then this is a full file; otherwise, it's
+ * an incremental file.
+ */
+ if (s->header_length == 0)
+ {
+ struct stat sb;
+ BlockNumber b;
+ BlockNumber blocklength;
+
+ /* We need to know the length of the file. */
+ if (fstat(s->fd, &sb) < 0)
+ pg_fatal("could not stat \"%s\": %m", s->filename);
+
+ /*
+ * Since we found a full file, source all blocks from it that
+ * exist in the file.
+ *
+ * Note that there may be blocks that don't exist either in this
+ * file or in any incremental file but that precede
+ * truncation_block_length. These are, presumably, zero-filled
+ * blocks that result from the server extending the file but
+ * taking no action on those blocks that generated any WAL.
+ *
+ * Sadly, we have no way of validating that this is really what
+ * happened, and neither does the server. From it's perspective,
+ * an unmodified block that contains data looks exactly the same
+ * as a zero-filled block that never had any data: either way,
+ * it's not mentioned in any WAL summary and the server has no
+ * reason to read it. From our perspective, all we know is that
+ * nobody had a reason to back up the block. That certainly means
+ * that the block didn't exist at the time of the full backup, but
+ * the supposition that it was all zeroes at the time of every
+ * later backup is one that we can't validate.
+ */
+ blocklength = sb.st_size / BLCKSZ;
+ for (b = 0; b < latest_source->truncation_block_length; ++b)
+ {
+ if (sourcemap[b] == NULL && b < blocklength)
+ {
+ sourcemap[b] = s;
+ offsetmap[b] = b * BLCKSZ;
+ }
+ }
+
+ /*
+ * If a full copy looks possible, check whether the resulting file
+ * should be exactly as long as the source file is. If so, a full
+ * copy is acceptable, otherwise not.
+ */
+ if (full_copy_possible)
+ {
+ uint64 expected_length;
+
+ expected_length =
+ (uint64) latest_source->truncation_block_length;
+ expected_length *= BLCKSZ;
+ if (expected_length == sb.st_size)
+ {
+ copy_source = s;
+ copy_source_index = sidx;
+ }
+ }
+
+ /* We don't need to consider any further sources. */
+ break;
+ }
+
+ /*
+ * Since we found another incremental file, source all blocks from it
+ * that we need but don't yet have.
+ */
+ for (i = 0; i < s->num_blocks; ++i)
+ {
+ BlockNumber b = s->relative_block_numbers[i];
+
+ if (b < latest_source->truncation_block_length &&
+ sourcemap[b] == NULL)
+ {
+ sourcemap[b] = s;
+ offsetmap[b] = s->header_length + (i * BLCKSZ);
+
+ /*
+ * A full copy of a file from an earlier backup is only
+ * possible if no blocks are needed from any later incremental
+ * file.
+ */
+ full_copy_possible = false;
+ }
+ }
+ }
+
+ /*
+ * If a checksum of the required type already exists in the
+ * backup_manifest for the relevant input directory, we can save some work
+ * by reusing that checksum instead of computing a new one.
+ */
+ if (copy_source_index >= 0 && manifests[copy_source_index] != NULL &&
+ checksum_type != CHECKSUM_TYPE_NONE)
+ {
+ manifest_file *mfile;
+
+ mfile = manifest_files_lookup(manifests[copy_source_index]->files,
+ manifest_path);
+ if (mfile == NULL)
+ {
+ /*
+ * The directory is out of sync with the backup_manifest, so emit
+ * a warning.
+ */
+ pg_log_warning("\"%s/backup_manifest\" contains no entry for \"%s\"",
+ prior_backup_dirs[copy_source_index],
+ manifest_path);
+ }
+ else if (mfile->checksum_type == checksum_type)
+ {
+ *checksum_length = mfile->checksum_length;
+ *checksum_payload = pg_malloc(*checksum_length);
+ memcpy(*checksum_payload, mfile->checksum_payload,
+ *checksum_length);
+ checksum_type = CHECKSUM_TYPE_NONE;
+ }
+ }
+
+ /* Prepare for checksum calculation, if required. */
+ pg_checksum_init(&checksum_ctx, checksum_type);
+
+ /*
+ * If the full file can be created by copying a file from an older backup
+ * in the chain without needing to overwrite any blocks or truncate the
+ * result, then forget about performing reconstruction and just copy that
+ * file in its entirety.
+ *
+ * Otherwise, reconstruct.
+ */
+ if (copy_source != NULL)
+ copy_file(copy_source->filename, output_filename,
+ &checksum_ctx, dry_run);
+ else
+ {
+ write_reconstructed_file(input_filename, output_filename,
+ block_length, sourcemap, offsetmap,
+ &checksum_ctx, debug, dry_run);
+ debug_reconstruction(n_prior_backups + 1, source, dry_run);
+ }
+
+ /* Save results of checksum calculation. */
+ if (checksum_type != CHECKSUM_TYPE_NONE)
+ {
+ *checksum_payload = pg_malloc(PG_CHECKSUM_MAX_LENGTH);
+ *checksum_length = pg_checksum_final(&checksum_ctx,
+ *checksum_payload);
+ }
+
+ /*
+ * Close files and release memory.
+ */
+ for (i = 0; i <= n_prior_backups; ++i)
+ {
+ rfile *s = source[i];
+
+ if (s == NULL)
+ continue;
+ if (close(s->fd) != 0)
+ pg_fatal("could not close \"%s\": %m", s->filename);
+ if (s->relative_block_numbers != NULL)
+ pfree(s->relative_block_numbers);
+ pg_free(s->filename);
+ }
+ pfree(sourcemap);
+ pfree(offsetmap);
+ pfree(source);
+}
+
+/*
+ * Perform post-reconstruction logging and sanity checks.
+ */
+static void
+debug_reconstruction(int n_source, rfile **sources, bool dry_run)
+{
+ unsigned i;
+
+ for (i = 0; i < n_source; ++i)
+ {
+ rfile *s = sources[i];
+
+ /* Ignore source if not used. */
+ if (s == NULL)
+ continue;
+
+ /* If no data is needed from this file, we can ignore it. */
+ if (s->num_blocks_read == 0)
+ continue;
+
+ /* Debug logging. */
+ if (dry_run)
+ pg_log_debug("would have read %u blocks from \"%s\"",
+ s->num_blocks_read, s->filename);
+ else
+ pg_log_debug("read %u blocks from \"%s\"",
+ s->num_blocks_read, s->filename);
+
+ /*
+ * In dry-run mode, we don't actually try to read data from the file,
+ * but we do try to verify that the file is long enough that we could
+ * have read the data if we'd tried.
+ *
+ * If this fails, then it means that a non-dry-run attempt would fail,
+ * complaining of not being able to read the required bytes from the
+ * file.
+ */
+ if (dry_run)
+ {
+ struct stat sb;
+
+ if (fstat(s->fd, &sb) < 0)
+ pg_fatal("could not stat \"%s\": %m", s->filename);
+ if (sb.st_size < s->highest_offset_read)
+ pg_fatal("file \"%s\" is too short: expected %llu, found %llu",
+ s->filename,
+ (unsigned long long) s->highest_offset_read,
+ (unsigned long long) sb.st_size);
+ }
+ }
+}
+
+/*
+ * When we perform reconstruction using an incremental file, the output file
+ * should be at least as long as the truncation_block_length. Any blocks
+ * present in the incremental file increase the output length as far as is
+ * necessary to include those blocks.
+ */
+static unsigned
+find_reconstructed_block_length(rfile *s)
+{
+ unsigned block_length = s->truncation_block_length;
+ unsigned i;
+
+ for (i = 0; i < s->num_blocks; ++i)
+ if (s->relative_block_numbers[i] >= block_length)
+ block_length = s->relative_block_numbers[i] + 1;
+
+ return block_length;
+}
+
+/*
+ * Initialize an incremental rfile, reading the header so that we know which
+ * blocks it contains.
+ */
+static rfile *
+make_incremental_rfile(char *filename)
+{
+ rfile *rf;
+ unsigned magic;
+
+ rf = make_rfile(filename, false);
+
+ /* Read and validate magic number. */
+ read_bytes(rf, &magic, sizeof(magic));
+ if (magic != INCREMENTAL_MAGIC)
+ pg_fatal("file \"%s\" has bad incremental magic number (0x%x not 0x%x)",
+ filename, magic, INCREMENTAL_MAGIC);
+
+ /* Read block count. */
+ read_bytes(rf, &rf->num_blocks, sizeof(rf->num_blocks));
+ if (rf->num_blocks > RELSEG_SIZE)
+ pg_fatal("file \"%s\" has block count %u in excess of segment size %u",
+ filename, rf->num_blocks, RELSEG_SIZE);
+
+ /* Read truncation block length. */
+ read_bytes(rf, &rf->truncation_block_length,
+ sizeof(rf->truncation_block_length));
+ if (rf->truncation_block_length > RELSEG_SIZE)
+ pg_fatal("file \"%s\" has truncation block length %u in excess of segment size %u",
+ filename, rf->truncation_block_length, RELSEG_SIZE);
+
+ /* Read block numbers if there are any. */
+ if (rf->num_blocks > 0)
+ {
+ rf->relative_block_numbers =
+ pg_malloc0(sizeof(BlockNumber) * rf->num_blocks);
+ read_bytes(rf, rf->relative_block_numbers,
+ sizeof(BlockNumber) * rf->num_blocks);
+ }
+
+ /* Remember length of header. */
+ rf->header_length = sizeof(magic) + sizeof(rf->num_blocks) +
+ sizeof(rf->truncation_block_length) +
+ sizeof(BlockNumber) * rf->num_blocks;
+
+ return rf;
+}
+
+/*
+ * Allocate and perform basic initialization of an rfile.
+ */
+static rfile *
+make_rfile(char *filename, bool missing_ok)
+{
+ rfile *rf;
+
+ rf = pg_malloc0(sizeof(rfile));
+ rf->filename = pstrdup(filename);
+ if ((rf->fd = open(filename, O_RDONLY | PG_BINARY, 0)) < 0)
+ {
+ if (missing_ok && errno == ENOENT)
+ {
+ pg_free(rf);
+ return NULL;
+ }
+ pg_fatal("could not open file \"%s\": %m", filename);
+ }
+
+ return rf;
+}
+
+/*
+ * Read the indicated number of bytes from an rfile into the buffer.
+ */
+static void
+read_bytes(rfile *rf, void *buffer, unsigned length)
+{
+ unsigned rb = read(rf->fd, buffer, length);
+
+ if (rb != length)
+ {
+ if (rb < 0)
+ pg_fatal("could not read file \"%s\": %m", rf->filename);
+ else
+ pg_fatal("could not read file \"%s\": read only %d of %d bytes",
+ rf->filename, (int) rb, length);
+ }
+}
+
+/*
+ * Write out a reconstructed file.
+ */
+static void
+write_reconstructed_file(char *input_filename,
+ char *output_filename,
+ unsigned block_length,
+ rfile **sourcemap,
+ off_t *offsetmap,
+ pg_checksum_context *checksum_ctx,
+ bool debug,
+ bool dry_run)
+{
+ int wfd = -1;
+ unsigned i;
+ unsigned zero_blocks = 0;
+
+ /* Debugging output. */
+ if (debug)
+ {
+ StringInfoData debug_buf;
+ unsigned start_of_range = 0;
+ unsigned current_block = 0;
+
+ /* Basic information about the output file to be produced. */
+ if (dry_run)
+ pg_log_debug("would reconstruct \"%s\" (%u blocks, checksum %s)",
+ output_filename, block_length,
+ pg_checksum_type_name(checksum_ctx->type));
+ else
+ pg_log_debug("reconstructing \"%s\" (%u blocks, checksum %s)",
+ output_filename, block_length,
+ pg_checksum_type_name(checksum_ctx->type));
+
+ /* Print out the plan for reconstructing this file. */
+ initStringInfo(&debug_buf);
+ while (current_block < block_length)
+ {
+ rfile *s = sourcemap[current_block];
+
+ /* Extend range, if possible. */
+ if (current_block + 1 < block_length &&
+ s == sourcemap[current_block + 1])
+ {
+ ++current_block;
+ continue;
+ }
+
+ /* Add details about this range. */
+ if (s == NULL)
+ {
+ if (current_block == start_of_range)
+ appendStringInfo(&debug_buf, " %u:zero", current_block);
+ else
+ appendStringInfo(&debug_buf, " %u-%u:zero",
+ start_of_range, current_block);
+ }
+ else
+ {
+ if (current_block == start_of_range)
+ appendStringInfo(&debug_buf, " %u:%s@" UINT64_FORMAT,
+ current_block,
+ s == NULL ? "ZERO" : s->filename,
+ (uint64) offsetmap[current_block]);
+ else
+ appendStringInfo(&debug_buf, " %u-%u:%s@" UINT64_FORMAT,
+ start_of_range, current_block,
+ s == NULL ? "ZERO" : s->filename,
+ (uint64) offsetmap[current_block]);
+ }
+
+ /* Begin new range. */
+ start_of_range = ++current_block;
+
+ /* If the output is very long or we are done, dump it now. */
+ if (current_block == block_length || debug_buf.len > 1024)
+ {
+ pg_log_debug("reconstruction plan:%s", debug_buf.data);
+ resetStringInfo(&debug_buf);
+ }
+ }
+
+ /* Free memory. */
+ pfree(debug_buf.data);
+ }
+
+ /* Open the output file, except in dry_run mode. */
+ if (!dry_run &&
+ (wfd = open(output_filename,
+ O_RDWR | PG_BINARY | O_CREAT | O_EXCL,
+ pg_file_create_mode)) < 0)
+ pg_fatal("could not open file \"%s\": %m", output_filename);
+
+ /* Read and write the blocks as required. */
+ for (i = 0; i < block_length; ++i)
+ {
+ uint8 buffer[BLCKSZ];
+ rfile *s = sourcemap[i];
+ unsigned wb;
+
+ /* Update accounting information. */
+ if (s == NULL)
+ ++zero_blocks;
+ else
+ {
+ s->num_blocks_read++;
+ s->highest_offset_read = Max(s->highest_offset_read,
+ offsetmap[i] + BLCKSZ);
+ }
+
+ /* Skip the rest of this in dry-run mode. */
+ if (dry_run)
+ continue;
+
+ /* Read or zero-fill the block as appropriate. */
+ if (s == NULL)
+ {
+ /*
+ * New block not mentioned in the WAL summary. Should have been an
+ * uninitialized block, so just zero-fill it.
+ */
+ memset(buffer, 0, BLCKSZ);
+ }
+ else
+ {
+ unsigned rb;
+
+ /* Read the block from the correct source, except if dry-run. */
+ rb = pg_pread(s->fd, buffer, BLCKSZ, offsetmap[i]);
+ if (rb != BLCKSZ)
+ {
+ if (rb < 0)
+ pg_fatal("could not read file \"%s\": %m", s->filename);
+ else
+ pg_fatal("could not read file \"%s\": read only %d of %d bytes at offset %u",
+ s->filename, (int) rb, BLCKSZ,
+ (unsigned) offsetmap[i]);
+ }
+ }
+
+ /* Write out the block. */
+ if ((wb = write(wfd, buffer, BLCKSZ)) != BLCKSZ)
+ {
+ if (wb < 0)
+ pg_fatal("could not write file \"%s\": %m", output_filename);
+ else
+ pg_fatal("could not write file \"%s\": wrote only %d of %d bytes",
+ output_filename, (int) wb, BLCKSZ);
+ }
+
+ /* Update the checksum computation. */
+ if (pg_checksum_update(checksum_ctx, buffer, BLCKSZ) < 0)
+ pg_fatal("could not update checksum of file \"%s\"",
+ output_filename);
+ }
+
+ /* Debugging output. */
+ if (zero_blocks > 0)
+ {
+ if (dry_run)
+ pg_log_debug("would have zero-filled %u blocks", zero_blocks);
+ else
+ pg_log_debug("zero-filled %u blocks", zero_blocks);
+ }
+
+ /* Close the output file. */
+ if (wfd >= 0 && close(wfd) != 0)
+ pg_fatal("could not close \"%s\": %m", output_filename);
+}
diff --git a/src/bin/pg_combinebackup/reconstruct.h b/src/bin/pg_combinebackup/reconstruct.h
new file mode 100644
index 0000000000..d689aeb5c2
--- /dev/null
+++ b/src/bin/pg_combinebackup/reconstruct.h
@@ -0,0 +1,33 @@
+/*-------------------------------------------------------------------------
+ *
+ * reconstruct.h
+ * Reconstruct full file from incremental file and backup chain.
+ *
+ * Copyright (c) 2017-2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/bin/pg_combinebackup/reconstruct.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef RECONSTRUCT_H
+#define RECONSTRUCT_H
+
+#include "common/checksum_helper.h"
+#include "load_manifest.h"
+
+extern void reconstruct_from_incremental_file(char *input_filename,
+ char *output_filename,
+ char *relative_path,
+ char *bare_file_name,
+ int n_prior_backups,
+ char **prior_backup_dirs,
+ manifest_data **manifests,
+ char *manifest_path,
+ pg_checksum_type checksum_type,
+ int *checksum_length,
+ uint8 **checksum_payload,
+ bool debug,
+ bool dry_run);
+
+#endif
diff --git a/src/bin/pg_combinebackup/t/001_basic.pl b/src/bin/pg_combinebackup/t/001_basic.pl
new file mode 100644
index 0000000000..fb66075d1a
--- /dev/null
+++ b/src/bin/pg_combinebackup/t/001_basic.pl
@@ -0,0 +1,23 @@
+# Copyright (c) 2021-2023, PostgreSQL Global Development Group
+
+use strict;
+use warnings;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+my $tempdir = PostgreSQL::Test::Utils::tempdir;
+
+program_help_ok('pg_combinebackup');
+program_version_ok('pg_combinebackup');
+program_options_handling_ok('pg_combinebackup');
+
+command_fails_like(
+ ['pg_combinebackup'],
+ qr/no input directories specified/,
+ 'input directories must be specified');
+command_fails_like(
+ [ 'pg_combinebackup', $tempdir ],
+ qr/no output directory specified/,
+ 'output directory must be specified');
+
+done_testing();
diff --git a/src/bin/pg_combinebackup/t/002_compare_backups.pl b/src/bin/pg_combinebackup/t/002_compare_backups.pl
new file mode 100644
index 0000000000..0b80455aff
--- /dev/null
+++ b/src/bin/pg_combinebackup/t/002_compare_backups.pl
@@ -0,0 +1,154 @@
+# Copyright (c) 2021-2023, PostgreSQL Global Development Group
+
+use strict;
+use warnings;
+use File::Compare;
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Set up a new database instance.
+my $primary = PostgreSQL::Test::Cluster->new('primary');
+$primary->init(has_archiving => 1, allows_streaming => 1);
+$primary->append_conf('postgresql.conf', 'summarize_wal = on');
+$primary->start;
+
+# Create some test tables, each containing one row of data, plus a whole
+# extra database.
+$primary->safe_psql('postgres', <<EOM);
+CREATE TABLE will_change (a int, b text);
+INSERT INTO will_change VALUES (1, 'initial test row');
+CREATE TABLE will_grow (a int, b text);
+INSERT INTO will_grow VALUES (1, 'initial test row');
+CREATE TABLE will_shrink (a int, b text);
+INSERT INTO will_shrink VALUES (1, 'initial test row');
+CREATE TABLE will_get_vacuumed (a int, b text);
+INSERT INTO will_get_vacuumed VALUES (1, 'initial test row');
+CREATE TABLE will_get_dropped (a int, b text);
+INSERT INTO will_get_dropped VALUES (1, 'initial test row');
+CREATE TABLE will_get_rewritten (a int, b text);
+INSERT INTO will_get_rewritten VALUES (1, 'initial test row');
+CREATE DATABASE db_will_get_dropped;
+EOM
+
+# Take a full backup.
+my $backup1path = $primary->backup_dir . '/backup1';
+$primary->command_ok(
+ [ 'pg_basebackup', '-D', $backup1path, '--no-sync', '-cfast' ],
+ "full backup");
+
+# Now make some database changes.
+$primary->safe_psql('postgres', <<EOM);
+UPDATE will_change SET b = 'modified value' WHERE a = 1;
+INSERT INTO will_grow
+ SELECT g, 'additional row' FROM generate_series(2, 5000) g;
+TRUNCATE will_shrink;
+VACUUM will_get_vacuumed;
+DROP TABLE will_get_dropped;
+CREATE TABLE newly_created (a int, b text);
+INSERT INTO newly_created VALUES (1, 'row for new table');
+VACUUM FULL will_get_rewritten;
+DROP DATABASE db_will_get_dropped;
+CREATE DATABASE db_newly_created;
+EOM
+
+# Take an incremental backup.
+my $backup2path = $primary->backup_dir . '/backup2';
+$primary->command_ok(
+ [ 'pg_basebackup', '-D', $backup2path, '--no-sync', '-cfast',
+ '--incremental', $backup1path . '/backup_manifest' ],
+ "incremental backup");
+
+# Find an LSN to which either backup can be recovered.
+my $lsn = $primary->safe_psql('postgres', "SELECT pg_current_wal_lsn();");
+
+# Make sure that the WAL segment containing that LSN has been archived.
+# PostgreSQL won't issue two consecutive XLOG_SWITCH records, and the backup
+# just issued one, so call txid_current() to generate some WAL activity
+# before calling pg_switch_wal().
+$primary->safe_psql('postgres', 'SELECT txid_current();');
+$primary->safe_psql('postgres', 'SELECT pg_switch_wal()');
+
+# Now wait for the LSN we chose above to be archived.
+my $archive_wait_query =
+ "SELECT pg_walfile_name('$lsn') <= last_archived_wal FROM pg_stat_archiver;";
+$primary->poll_query_until('postgres', $archive_wait_query)
+ or die "Timed out while waiting for WAL segment to be archived";
+
+# Perform PITR from the full backup. Disable archive_mode so that the archive
+# doesn't find out about the new timeline; that way, the later PITR below will
+# choose the same timeline.
+my $pitr1 = PostgreSQL::Test::Cluster->new('pitr1');
+$pitr1->init_from_backup($primary, 'backup1',
+ standby => 1, has_restoring => 1);
+$pitr1->append_conf('postgresql.conf', qq{
+recovery_target_lsn = '$lsn'
+recovery_target_action = 'promote'
+archive_mode = 'off'
+});
+$pitr1->start();
+
+# Perform PITR to the same LSN from the incremental backup. Use the same
+# basic configuration as before.
+my $pitr2 = PostgreSQL::Test::Cluster->new('pitr2');
+$pitr2->init_from_backup($primary, 'backup2',
+ standby => 1, has_restoring => 1,
+ combine_with_prior => [ 'backup1' ]);
+$pitr2->append_conf('postgresql.conf', qq{
+recovery_target_lsn = '$lsn'
+recovery_target_action = 'promote'
+archive_mode = 'off'
+});
+$pitr2->start();
+
+# Wait until both servers exit recovery.
+$pitr1->poll_query_until('postgres',
+ "SELECT NOT pg_is_in_recovery();")
+ or die "Timed out while waiting apply to reach LSN $lsn";
+$pitr2->poll_query_until('postgres',
+ "SELECT NOT pg_is_in_recovery();")
+ or die "Timed out while waiting apply to reach LSN $lsn";
+
+# Perform a logical dump of each server, and check that they match.
+# It would be much nicer if we could physically compare the data files, but
+# that doesn't really work. The contents of the page hole aren't guaranteed to
+# be identical, and there can be other discrepancies as well. To make this work
+# we'd need the equivalent of each AM's rm_mask functon written or at least
+# callable from Perl, and that doesn't seem practical.
+#
+# NB: We're just using the primary's backup directory for scratch space here.
+# This could equally well be any other directory we wanted to pick.
+my $backupdir = $primary->backup_dir;
+my $dump1 = $backupdir . '/pitr1.dump';
+my $dump2 = $backupdir . '/pitr2.dump';
+$pitr1->command_ok([
+ 'pg_dumpall', '-f', $dump1, '--no-sync', '--no-unlogged-table-data',
+ '-d', $pitr1->connstr('postgres'),
+ ],
+ 'dump from PITR 1');
+$pitr1->command_ok([
+ 'pg_dumpall', '-f', $dump2, '--no-sync', '--no-unlogged-table-data',
+ '-d', $pitr1->connstr('postgres'),
+ ],
+ 'dump from PITR 2');
+
+# Compare the two dumps, there should be no differences.
+my $compare_res = compare($dump1, $dump2);
+note($dump1);
+note($dump2);
+is($compare_res, 0, "dumps are identical");
+
+# Provide more context if the dumps do not match.
+if ($compare_res != 0)
+{
+ my ($stdout, $stderr) =
+ run_command([ 'diff', '-u', $dump1, $dump2 ]);
+ print "=== diff of $dump1 and $dump2\n";
+ print "=== stdout ===\n";
+ print $stdout;
+ print "=== stderr ===\n";
+ print $stderr;
+ print "=== EOF ===\n";
+}
+
+done_testing();
diff --git a/src/bin/pg_combinebackup/t/003_timeline.pl b/src/bin/pg_combinebackup/t/003_timeline.pl
new file mode 100644
index 0000000000..bc053ca5e8
--- /dev/null
+++ b/src/bin/pg_combinebackup/t/003_timeline.pl
@@ -0,0 +1,90 @@
+# Copyright (c) 2021-2023, PostgreSQL Global Development Group
+#
+# This test aims to validate that restoring an incremental backup works
+# properly even when the reference backup is on a different timeline.
+
+use strict;
+use warnings;
+use File::Compare;
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Set up a new database instance.
+my $node1 = PostgreSQL::Test::Cluster->new('node1');
+$node1->init(has_archiving => 1, allows_streaming => 1);
+$node1->append_conf('postgresql.conf', 'summarize_wal = on');
+$node1->start;
+
+# Create a table and insert a test row into it.
+$node1->safe_psql('postgres', <<EOM);
+CREATE TABLE mytable (a int, b text);
+INSERT INTO mytable VALUES (1, 'aardvark');
+EOM
+
+# Take a full backup.
+my $backup1path = $node1->backup_dir . '/backup1';
+$node1->command_ok(
+ [ 'pg_basebackup', '-D', $backup1path, '--no-sync', '-cfast' ],
+ "full backup from node1");
+
+# Insert a second row on the original node.
+$node1->safe_psql('postgres', <<EOM);
+INSERT INTO mytable VALUES (2, 'beetle');
+EOM
+
+# Now take an incremental backup.
+my $backup2path = $node1->backup_dir . '/backup2';
+$node1->command_ok(
+ [ 'pg_basebackup', '-D', $backup2path, '--no-sync', '-cfast',
+ '--incremental', $backup1path . '/backup_manifest' ],
+ "incremental backup from node1");
+
+# Restore the incremental backup and use it to create a new node.
+my $node2 = PostgreSQL::Test::Cluster->new('node2');
+$node2->init_from_backup($node1, 'backup2',
+ combine_with_prior => [ 'backup1' ]);
+$node2->start();
+
+# Insert rows on both nodes.
+$node1->safe_psql('postgres', <<EOM);
+INSERT INTO mytable VALUES (3, 'crab');
+EOM
+$node2->safe_psql('postgres', <<EOM);
+INSERT INTO mytable VALUES (4, 'dingo');
+EOM
+
+# Take another incremental backup, from node2, based on backup2 from node1.
+my $backup3path = $node1->backup_dir . '/backup3';
+$node2->command_ok(
+ [ 'pg_basebackup', '-D', $backup3path, '--no-sync', '-cfast',
+ '--incremental', $backup2path . '/backup_manifest' ],
+ "incremental backup from node2");
+
+# Restore the incremental backup and use it to create a new node.
+my $node3 = PostgreSQL::Test::Cluster->new('node3');
+$node3->init_from_backup($node1, 'backup3',
+ combine_with_prior => [ 'backup1', 'backup2' ]);
+$node3->start();
+
+# Let's insert one more row.
+$node3->safe_psql('postgres', <<EOM);
+INSERT INTO mytable VALUES (5, 'elephant');
+EOM
+
+# Now check that we have the expected rows.
+my $result = $node3->safe_psql('postgres', <<EOM);
+select string_agg(a::text, ':'), string_agg(b, ':') from mytable;
+EOM
+is($result, '1:2:4:5|aardvark:beetle:dingo:elephant');
+
+# Let's also verify all the backups.
+for my $backup_name (qw(backup1 backup2 backup3))
+{
+ $node1->command_ok(
+ [ 'pg_verifybackup', $node1->backup_dir . '/' . $backup_name ],
+ "verify backup $backup_name");
+}
+
+# OK, that's all.
+done_testing();
diff --git a/src/bin/pg_combinebackup/t/004_manifest.pl b/src/bin/pg_combinebackup/t/004_manifest.pl
new file mode 100644
index 0000000000..37de61ac06
--- /dev/null
+++ b/src/bin/pg_combinebackup/t/004_manifest.pl
@@ -0,0 +1,75 @@
+# Copyright (c) 2021-2023, PostgreSQL Global Development Group
+#
+# This test aims to validate that pg_combinebackup works in the degenerate
+# case where it is invoked on a single full backup and that it can produce
+# a new, valid manifest when it does. Secondarily, it checks that
+# pg_combinebackup does not produce a manifest when run with --no-manifest.
+
+use strict;
+use warnings;
+use File::Compare;
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Set up a new database instance.
+my $node = PostgreSQL::Test::Cluster->new('node');
+$node->init(has_archiving => 1, allows_streaming => 1);
+$node->start;
+
+# Take a full backup.
+my $original_backup_path = $node->backup_dir . '/original';
+$node->command_ok(
+ [ 'pg_basebackup', '-D', $original_backup_path, '--no-sync', '-cfast' ],
+ "full backup");
+
+# Verify the full backup.
+$node->command_ok([ 'pg_verifybackup', $original_backup_path ],
+ "verify original backup");
+
+# Process the backup with pg_combinebackup using various manifest options.
+sub combine_and_test_one_backup
+{
+ my ($backup_name, $failure_pattern, @extra_options) = @_;
+ my $revised_backup_path = $node->backup_dir . '/' . $backup_name;
+ $node->command_ok(
+ [ 'pg_combinebackup', $original_backup_path, '-o', $revised_backup_path,
+ '--no-sync', @extra_options ],
+ "pg_combinebackup with @extra_options");
+ if (defined $failure_pattern)
+ {
+ $node->command_fails_like(
+ [ 'pg_verifybackup', $revised_backup_path ],
+ $failure_pattern,
+ "unable to verify backup $backup_name");
+ }
+ else
+ {
+ $node->command_ok(
+ [ 'pg_verifybackup', $revised_backup_path ],
+ "verify backup $backup_name");
+ }
+}
+combine_and_test_one_backup('nomanifest',
+ qr/could not open file.*backup_manifest/, '--no-manifest');
+combine_and_test_one_backup('csum_none',
+ undef, '--manifest-checksums=NONE');
+combine_and_test_one_backup('csum_sha224',
+ undef, '--manifest-checksums=SHA224');
+
+# Verify that SHA224 is mentioned in the SHA224 manifest lots of times.
+my $sha224_manifest =
+ slurp_file($node->backup_dir . '/csum_sha224/backup_manifest');
+my $sha224_count = (() = $sha224_manifest =~ /SHA224/mig);
+cmp_ok($sha224_count,
+ '>', 100, "SHA224 is mentioned many times in SHA224 manifest");
+
+# Verify that SHA224 is mentioned in the SHA224 manifest lots of times.
+my $nocsum_manifest =
+ slurp_file($node->backup_dir . '/csum_none/backup_manifest');
+my $nocsum_count = (() = $nocsum_manifest =~ /Checksum-Algorithm/mig);
+is($nocsum_count, 0,
+ "Checksum_Algorithm is not mentioned in no-checksum manifest");
+
+# OK, that's all.
+done_testing();
diff --git a/src/bin/pg_combinebackup/t/005_integrity.pl b/src/bin/pg_combinebackup/t/005_integrity.pl
new file mode 100644
index 0000000000..b1f63a43e0
--- /dev/null
+++ b/src/bin/pg_combinebackup/t/005_integrity.pl
@@ -0,0 +1,125 @@
+# Copyright (c) 2021-2023, PostgreSQL Global Development Group
+#
+# This test aims to validate that an incremental backup can be combined
+# with a valid prior backup and that it cannot be combined with an invalid
+# prior backup.
+
+use strict;
+use warnings;
+use File::Compare;
+use File::Path qw(rmtree);
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Set up a new database instance.
+my $node1 = PostgreSQL::Test::Cluster->new('node1');
+$node1->init(has_archiving => 1, allows_streaming => 1);
+$node1->append_conf('postgresql.conf', 'summarize_wal = on');
+$node1->start;
+
+# Set up another new database instance. We don't want to use the cached
+# INITDB_TEMPLATE for this, because we want it to be a separate cluster
+# with a different system ID.
+my $node2;
+{
+ local $ENV{'INITDB_TEMPLATE'} = undef;
+
+ $node2 = PostgreSQL::Test::Cluster->new('node2');
+ $node2->init(has_archiving => 1, allows_streaming => 1);
+ $node2->append_conf('postgresql.conf', 'summarize_wal = on');
+ $node2->start;
+}
+
+# Take a full backup from node1.
+my $backup1path = $node1->backup_dir . '/backup1';
+$node1->command_ok(
+ [ 'pg_basebackup', '-D', $backup1path, '--no-sync', '-cfast' ],
+ "full backup from node1");
+
+# Now take an incremental backup.
+my $backup2path = $node1->backup_dir . '/backup2';
+$node1->command_ok(
+ [ 'pg_basebackup', '-D', $backup2path, '--no-sync', '-cfast',
+ '--incremental', $backup1path . '/backup_manifest' ],
+ "incremental backup from node1");
+
+# Now take another incremental backup.
+my $backup3path = $node1->backup_dir . '/backup3';
+$node1->command_ok(
+ [ 'pg_basebackup', '-D', $backup3path, '--no-sync', '-cfast',
+ '--incremental', $backup2path . '/backup_manifest' ],
+ "another incremental backup from node1");
+
+# Take a full backup from node2.
+my $backupother1path = $node1->backup_dir . '/backupother1';
+$node2->command_ok(
+ [ 'pg_basebackup', '-D', $backupother1path, '--no-sync', '-cfast' ],
+ "full backup from node2");
+
+# Take an incremental backup from node2.
+my $backupother2path = $node1->backup_dir . '/backupother2';
+$node2->command_ok(
+ [ 'pg_basebackup', '-D', $backupother2path, '--no-sync', '-cfast',
+ '--incremental', $backupother1path . '/backup_manifest' ],
+ "incremental backup from node2");
+
+# Result directory.
+my $resultpath = $node1->backup_dir . '/result';
+
+# Can't combine 2 full backups.
+$node1->command_fails_like(
+ [ 'pg_combinebackup', $backup1path, $backup1path, '-o', $resultpath ],
+ qr/is a full backup, but only the first backup should be a full backup/,
+ "can't combine full backups");
+
+# Can't combine 2 incremental backups.
+$node1->command_fails_like(
+ [ 'pg_combinebackup', $backup2path, $backup2path, '-o', $resultpath ],
+ qr/is an incremental backup, but the first backup should be a full backup/,
+ "can't combine full backups");
+
+# Can't combine full backup with an incremental backup from a different system.
+$node1->command_fails_like(
+ [ 'pg_combinebackup', $backup1path, $backupother2path, '-o', $resultpath ],
+ qr/expected system identifier.*but found/,
+ "can't combine backups from different nodes");
+
+# Can't omit a required backup.
+$node1->command_fails_like(
+ [ 'pg_combinebackup', $backup1path, $backup3path, '-o', $resultpath ],
+ qr/starts at LSN.*but expected/,
+ "can't omit a required backup");
+
+# Can't combine backups in the wrong order.
+$node1->command_fails_like(
+ [ 'pg_combinebackup', $backup1path, $backup3path, $backup2path, '-o', $resultpath ],
+ qr/starts at LSN.*but expected/,
+ "can't combine backups in the wrong order");
+
+# Can combine 3 backups that match up properly.
+$node1->command_ok(
+ [ 'pg_combinebackup', $backup1path, $backup2path, $backup3path, '-o', $resultpath ],
+ "can combine 3 matching backups");
+rmtree($resultpath);
+
+# Can combine full backup with first incremental.
+my $synthetic12path = $node1->backup_dir . '/synthetic12';
+$node1->command_ok(
+ [ 'pg_combinebackup', $backup1path, $backup2path, '-o', $synthetic12path ],
+ "can combine 2 matching backups");
+
+# Can combine result of previous step with second incremental.
+$node1->command_ok(
+ [ 'pg_combinebackup', $synthetic12path, $backup3path, '-o', $resultpath ],
+ "can combine synthetic backup with later incremental");
+rmtree($resultpath);
+
+# Can't combine result of 1+2 with 2.
+$node1->command_fails_like(
+ [ 'pg_combinebackup', $synthetic12path, $backup2path, '-o', $resultpath ],
+ qr/starts at LSN.*but expected/,
+ "can't combine synthetic backup with included incremental");
+
+# OK, that's all.
+done_testing();
diff --git a/src/bin/pg_combinebackup/write_manifest.c b/src/bin/pg_combinebackup/write_manifest.c
new file mode 100644
index 0000000000..82160134d8
--- /dev/null
+++ b/src/bin/pg_combinebackup/write_manifest.c
@@ -0,0 +1,293 @@
+/*-------------------------------------------------------------------------
+ *
+ * Write a new backup manifest.
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/bin/pg_combinebackup/write_manifest.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres_fe.h"
+
+#include <fcntl.h>
+#include <time.h>
+#include <unistd.h>
+
+#include "common/checksum_helper.h"
+#include "common/file_perm.h"
+#include "common/logging.h"
+#include "lib/stringinfo.h"
+#include "load_manifest.h"
+#include "mb/pg_wchar.h"
+#include "write_manifest.h"
+
+struct manifest_writer
+{
+ char pathname[MAXPGPATH];
+ int fd;
+ StringInfoData buf;
+ bool first_file;
+ bool still_checksumming;
+ pg_checksum_context manifest_ctx;
+};
+
+static void escape_json(StringInfo buf, const char *str);
+static void flush_manifest(manifest_writer *mwriter);
+static size_t hex_encode(const uint8 *src, size_t len, char *dst);
+
+/*
+ * Create a new backup manifest writer.
+ *
+ * The backup manifest will be written into a file named backup_manifest
+ * in the specified directory.
+ */
+manifest_writer *
+create_manifest_writer(char *directory)
+{
+ manifest_writer *mwriter = pg_malloc(sizeof(manifest_writer));
+
+ snprintf(mwriter->pathname, MAXPGPATH, "%s/backup_manifest", directory);
+ mwriter->fd = -1;
+ initStringInfo(&mwriter->buf);
+ mwriter->first_file = true;
+ mwriter->still_checksumming = true;
+ pg_checksum_init(&mwriter->manifest_ctx, CHECKSUM_TYPE_SHA256);
+
+ appendStringInfo(&mwriter->buf,
+ "{ \"PostgreSQL-Backup-Manifest-Version\": 1,\n"
+ "\"Files\": [");
+
+ return mwriter;
+}
+
+/*
+ * Add an entry for a file to a backup manifest.
+ *
+ * This is very similar to the backend's AddFileToBackupManifest, but
+ * various adjustments are required due to frontend/backend differences
+ * and other details.
+ */
+void
+add_file_to_manifest(manifest_writer *mwriter, const char *manifest_path,
+ size_t size, pg_time_t mtime,
+ pg_checksum_type checksum_type,
+ int checksum_length,
+ uint8 *checksum_payload)
+{
+ int pathlen = strlen(manifest_path);
+
+ if (mwriter->first_file)
+ {
+ appendStringInfoChar(&mwriter->buf, '\n');
+ mwriter->first_file = false;
+ }
+ else
+ appendStringInfoString(&mwriter->buf, ",\n");
+
+ if (pg_encoding_verifymbstr(PG_UTF8, manifest_path, pathlen) == pathlen)
+ {
+ appendStringInfoString(&mwriter->buf, "{ \"Path\": ");
+ escape_json(&mwriter->buf, manifest_path);
+ appendStringInfoString(&mwriter->buf, ", ");
+ }
+ else
+ {
+ appendStringInfoString(&mwriter->buf, "{ \"Encoded-Path\": \"");
+ enlargeStringInfo(&mwriter->buf, 2 * pathlen);
+ mwriter->buf.len += hex_encode((const uint8 *) manifest_path, pathlen,
+ &mwriter->buf.data[mwriter->buf.len]);
+ appendStringInfoString(&mwriter->buf, "\", ");
+ }
+
+ appendStringInfo(&mwriter->buf, "\"Size\": %zu, ", size);
+
+ appendStringInfoString(&mwriter->buf, "\"Last-Modified\": \"");
+ enlargeStringInfo(&mwriter->buf, 128);
+ mwriter->buf.len += strftime(&mwriter->buf.data[mwriter->buf.len], 128,
+ "%Y-%m-%d %H:%M:%S %Z",
+ gmtime(&mtime));
+ appendStringInfoChar(&mwriter->buf, '"');
+
+ if (mwriter->buf.len > 128 * 1024)
+ flush_manifest(mwriter);
+
+ if (checksum_length > 0)
+ {
+ appendStringInfo(&mwriter->buf,
+ ", \"Checksum-Algorithm\": \"%s\", \"Checksum\": \"",
+ pg_checksum_type_name(checksum_type));
+
+ enlargeStringInfo(&mwriter->buf, 2 * checksum_length);
+ mwriter->buf.len += hex_encode(checksum_payload, checksum_length,
+ &mwriter->buf.data[mwriter->buf.len]);
+
+ appendStringInfoChar(&mwriter->buf, '"');
+ }
+
+ appendStringInfoString(&mwriter->buf, " }");
+
+ if (mwriter->buf.len > 128 * 1024)
+ flush_manifest(mwriter);
+}
+
+/*
+ * Finalize the backup_manifest.
+ */
+void
+finalize_manifest(manifest_writer *mwriter,
+ manifest_wal_range *first_wal_range)
+{
+ uint8 checksumbuf[PG_SHA256_DIGEST_LENGTH];
+ int len;
+ manifest_wal_range *wal_range;
+
+ /* Terminate the list of files. */
+ appendStringInfoString(&mwriter->buf, "\n],\n");
+
+ /* Start a list of LSN ranges. */
+ appendStringInfoString(&mwriter->buf, "\"WAL-Ranges\": [\n");
+
+ for (wal_range = first_wal_range; wal_range != NULL;
+ wal_range = wal_range->next)
+ appendStringInfo(&mwriter->buf,
+ "%s{ \"Timeline\": %u, \"Start-LSN\": \"%X/%X\", \"End-LSN\": \"%X/%X\" }",
+ wal_range == first_wal_range ? "" : ",\n",
+ wal_range->tli,
+ LSN_FORMAT_ARGS(wal_range->start_lsn),
+ LSN_FORMAT_ARGS(wal_range->end_lsn));
+
+ /* Terminate the list of WAL ranges. */
+ appendStringInfoString(&mwriter->buf, "\n],\n");
+
+ /* Flush accumulated data and update checksum calculation. */
+ flush_manifest(mwriter);
+
+ /* Checksum only includes data up to this point. */
+ mwriter->still_checksumming = false;
+
+ /* Compute and insert manifest checksum. */
+ appendStringInfoString(&mwriter->buf, "\"Manifest-Checksum\": \"");
+ enlargeStringInfo(&mwriter->buf, 2 * PG_SHA256_DIGEST_STRING_LENGTH);
+ len = pg_checksum_final(&mwriter->manifest_ctx, checksumbuf);
+ Assert(len == PG_SHA256_DIGEST_LENGTH);
+ mwriter->buf.len +=
+ hex_encode(checksumbuf, len, &mwriter->buf.data[mwriter->buf.len]);
+ appendStringInfoString(&mwriter->buf, "\"}\n");
+
+ /* Flush the last manifest checksum itself. */
+ flush_manifest(mwriter);
+
+ /* Close the file. */
+ if (close(mwriter->fd) != 0)
+ pg_fatal("could not close \"%s\": %m", mwriter->pathname);
+ mwriter->fd = -1;
+}
+
+/*
+ * Produce a JSON string literal, properly escaping characters in the text.
+ */
+static void
+escape_json(StringInfo buf, const char *str)
+{
+ const char *p;
+
+ appendStringInfoCharMacro(buf, '"');
+ for (p = str; *p; p++)
+ {
+ switch (*p)
+ {
+ case '\b':
+ appendStringInfoString(buf, "\\b");
+ break;
+ case '\f':
+ appendStringInfoString(buf, "\\f");
+ break;
+ case '\n':
+ appendStringInfoString(buf, "\\n");
+ break;
+ case '\r':
+ appendStringInfoString(buf, "\\r");
+ break;
+ case '\t':
+ appendStringInfoString(buf, "\\t");
+ break;
+ case '"':
+ appendStringInfoString(buf, "\\\"");
+ break;
+ case '\\':
+ appendStringInfoString(buf, "\\\\");
+ break;
+ default:
+ if ((unsigned char) *p < ' ')
+ appendStringInfo(buf, "\\u%04x", (int) *p);
+ else
+ appendStringInfoCharMacro(buf, *p);
+ break;
+ }
+ }
+ appendStringInfoCharMacro(buf, '"');
+}
+
+/*
+ * Flush whatever portion of the backup manifest we have generated and
+ * buffered in memory out to a file on disk.
+ *
+ * The first call to this function will create the file. After that, we
+ * keep it open and just append more data.
+ */
+static void
+flush_manifest(manifest_writer *mwriter)
+{
+ char pathname[MAXPGPATH];
+
+ if (mwriter->fd == -1 &&
+ (mwriter->fd = open(mwriter->pathname,
+ O_WRONLY | O_CREAT | O_EXCL | PG_BINARY,
+ pg_file_create_mode)) < 0)
+ pg_fatal("could not open file \"%s\": %m", mwriter->pathname);
+
+ if (mwriter->buf.len > 0)
+ {
+ ssize_t wb;
+
+ wb = write(mwriter->fd, mwriter->buf.data, mwriter->buf.len);
+ if (wb != mwriter->buf.len)
+ {
+ if (wb < 0)
+ pg_fatal("could not write \"%s\": %m", mwriter->pathname);
+ else
+ pg_fatal("could not write file \"%s\": wrote only %d of %d bytes",
+ pathname, (int) wb, mwriter->buf.len);
+ }
+
+ if (mwriter->still_checksumming)
+ pg_checksum_update(&mwriter->manifest_ctx,
+ (uint8 *) mwriter->buf.data,
+ mwriter->buf.len);
+ resetStringInfo(&mwriter->buf);
+ }
+}
+
+/*
+ * Encode bytes using two hexademical digits for each one.
+ */
+static size_t
+hex_encode(const uint8 *src, size_t len, char *dst)
+{
+ const uint8 *end = src + len;
+
+ while (src < end)
+ {
+ unsigned n1 = (*src >> 4) & 0xF;
+ unsigned n2 = *src & 0xF;
+
+ *dst++ = n1 < 10 ? '0' + n1 : 'a' + n1 - 10;
+ *dst++ = n2 < 10 ? '0' + n2 : 'a' + n2 - 10;
+ ++src;
+ }
+
+ return len * 2;
+}
diff --git a/src/bin/pg_combinebackup/write_manifest.h b/src/bin/pg_combinebackup/write_manifest.h
new file mode 100644
index 0000000000..8fd7fe02c8
--- /dev/null
+++ b/src/bin/pg_combinebackup/write_manifest.h
@@ -0,0 +1,33 @@
+/*-------------------------------------------------------------------------
+ *
+ * Write a new backup manifest.
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/bin/pg_combinebackup/write_manifest.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef WRITE_MANIFEST_H
+#define WRITE_MANIFEST_H
+
+#include "common/checksum_helper.h"
+#include "pgtime.h"
+
+struct manifest_wal_range;
+
+struct manifest_writer;
+typedef struct manifest_writer manifest_writer;
+
+extern manifest_writer *create_manifest_writer(char *directory);
+extern void add_file_to_manifest(manifest_writer *mwriter,
+ const char *manifest_path,
+ size_t size, pg_time_t mtime,
+ pg_checksum_type checksum_type,
+ int checksum_length,
+ uint8 *checksum_payload);
+extern void finalize_manifest(manifest_writer *mwriter,
+ struct manifest_wal_range *first_wal_range);
+
+#endif /* WRITE_MANIFEST_H */
diff --git a/src/bin/pg_resetwal/pg_resetwal.c b/src/bin/pg_resetwal/pg_resetwal.c
index 3ae3fc06df..5407f51a4e 100644
--- a/src/bin/pg_resetwal/pg_resetwal.c
+++ b/src/bin/pg_resetwal/pg_resetwal.c
@@ -85,6 +85,7 @@ static void RewriteControlFile(void);
static void FindEndOfXLOG(void);
static void KillExistingXLOG(void);
static void KillExistingArchiveStatus(void);
+static void KillExistingWALSummaries(void);
static void WriteEmptyXLOG(void);
static void usage(void);
@@ -493,6 +494,7 @@ main(int argc, char *argv[])
RewriteControlFile();
KillExistingXLOG();
KillExistingArchiveStatus();
+ KillExistingWALSummaries();
WriteEmptyXLOG();
printf(_("Write-ahead log reset\n"));
@@ -1034,6 +1036,40 @@ KillExistingArchiveStatus(void)
pg_fatal("could not close directory \"%s\": %m", ARCHSTATDIR);
}
+/*
+ * Remove existing WAL summary files
+ */
+static void
+KillExistingWALSummaries(void)
+{
+#define WALSUMMARYDIR XLOGDIR "/summaries"
+#define WALSUMMARY_NHEXCHARS 40
+
+ DIR *xldir;
+ struct dirent *xlde;
+ char path[MAXPGPATH + sizeof(WALSUMMARYDIR)];
+
+ xldir = opendir(WALSUMMARYDIR);
+ if (xldir == NULL)
+ pg_fatal("could not open directory \"%s\": %m", WALSUMMARYDIR);
+
+ while (errno = 0, (xlde = readdir(xldir)) != NULL)
+ {
+ if (strspn(xlde->d_name, "0123456789ABCDEF") == WALSUMMARY_NHEXCHARS &&
+ strcmp(xlde->d_name + WALSUMMARY_NHEXCHARS, ".summary") == 0)
+ {
+ snprintf(path, sizeof(path), "%s/%s", WALSUMMARYDIR, xlde->d_name);
+ if (unlink(path) < 0)
+ pg_fatal("could not delete file \"%s\": %m", path);
+ }
+ }
+
+ if (errno)
+ pg_fatal("could not read directory \"%s\": %m", WALSUMMARYDIR);
+
+ if (closedir(xldir))
+ pg_fatal("could not close directory \"%s\": %m", ARCHSTATDIR);
+}
/*
* Write an empty XLOG file, containing only the checkpoint record
diff --git a/src/include/access/xlogbackup.h b/src/include/access/xlogbackup.h
index 1611358137..90e04cad56 100644
--- a/src/include/access/xlogbackup.h
+++ b/src/include/access/xlogbackup.h
@@ -28,6 +28,8 @@ typedef struct BackupState
XLogRecPtr checkpointloc; /* last checkpoint location */
pg_time_t starttime; /* backup start time */
bool started_in_recovery; /* backup started in recovery? */
+ XLogRecPtr istartpoint; /* incremental based on backup at this LSN */
+ TimeLineID istarttli; /* incremental based on backup on this TLI */
/* Fields saved at the end of backup */
XLogRecPtr stoppoint; /* backup stop WAL location */
diff --git a/src/include/backup/basebackup.h b/src/include/backup/basebackup.h
index 1432d9c206..345bd22534 100644
--- a/src/include/backup/basebackup.h
+++ b/src/include/backup/basebackup.h
@@ -34,6 +34,9 @@ typedef struct
int64 size; /* total size as sent; -1 if not known */
} tablespaceinfo;
-extern void SendBaseBackup(BaseBackupCmd *cmd);
+struct IncrementalBackupInfo;
+
+extern void SendBaseBackup(BaseBackupCmd *cmd,
+ struct IncrementalBackupInfo *ib);
#endif /* _BASEBACKUP_H */
diff --git a/src/include/backup/basebackup_incremental.h b/src/include/backup/basebackup_incremental.h
new file mode 100644
index 0000000000..de99117599
--- /dev/null
+++ b/src/include/backup/basebackup_incremental.h
@@ -0,0 +1,55 @@
+/*-------------------------------------------------------------------------
+ *
+ * basebackup_incremental.h
+ * API for incremental backup support
+ *
+ * Portions Copyright (c) 2010-2022, PostgreSQL Global Development Group
+ *
+ * src/include/backup/basebackup_incremental.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef BASEBACKUP_INCREMENTAL_H
+#define BASEBACKUP_INCREMENTAL_H
+
+#include "access/xlogbackup.h"
+#include "common/relpath.h"
+#include "storage/block.h"
+#include "utils/palloc.h"
+
+#define INCREMENTAL_MAGIC 0xd3ae1f0d
+
+typedef enum
+{
+ BACK_UP_FILE_FULLY,
+ BACK_UP_FILE_INCREMENTALLY
+} FileBackupMethod;
+
+struct IncrementalBackupInfo;
+typedef struct IncrementalBackupInfo IncrementalBackupInfo;
+
+extern IncrementalBackupInfo *CreateIncrementalBackupInfo(MemoryContext);
+
+extern void AppendIncrementalManifestData(IncrementalBackupInfo *ib,
+ const char *data,
+ int len);
+extern void FinalizeIncrementalManifest(IncrementalBackupInfo *ib);
+
+extern void PrepareForIncrementalBackup(IncrementalBackupInfo *ib,
+ BackupState *backup_state);
+
+extern char *GetIncrementalFilePath(Oid dboid, Oid spcoid,
+ RelFileNumber relfilenumber,
+ ForkNumber forknum, unsigned segno);
+extern FileBackupMethod GetFileBackupMethod(IncrementalBackupInfo *ib,
+ const char *path,
+ Oid dboid, Oid spcoid,
+ RelFileNumber relfilenumber,
+ ForkNumber forknum,
+ unsigned segno, size_t size,
+ unsigned *num_blocks_required,
+ BlockNumber *relative_block_numbers,
+ unsigned *truncation_block_length);
+extern size_t GetIncrementalFileSize(unsigned num_blocks_required);
+
+#endif
diff --git a/src/include/nodes/replnodes.h b/src/include/nodes/replnodes.h
index 5142a08729..c98961c329 100644
--- a/src/include/nodes/replnodes.h
+++ b/src/include/nodes/replnodes.h
@@ -108,4 +108,13 @@ typedef struct TimeLineHistoryCmd
TimeLineID timeline;
} TimeLineHistoryCmd;
+/* ----------------------
+ * UPLOAD_MANIFEST command
+ * ----------------------
+ */
+typedef struct UploadManifestCmd
+{
+ NodeTag type;
+} UploadManifestCmd;
+
#endif /* REPLNODES_H */
diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index c3d46c7c70..72b4ecaf12 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -779,6 +779,10 @@ a tar-format backup, pass the name of the tar program to use in the
keyword parameter tar_program. Note that tablespace tar files aren't
handled here.
+To restore from an incremental backup, pass the parameter combine_with_prior
+as a reference to an array of prior backup names with which this backup
+is to be combined using pg_combinebackup.
+
Streaming replication can be enabled on this node by passing the keyword
parameter has_streaming => 1. This is disabled by default.
@@ -816,7 +820,22 @@ sub init_from_backup
mkdir $self->archive_dir;
my $data_path = $self->data_dir;
- if (defined $params{tar_program})
+ if (defined $params{combine_with_prior})
+ {
+ my @prior_backups = @{$params{combine_with_prior}};
+ my @prior_backup_path;
+
+ for my $prior_backup_name (@prior_backups)
+ {
+ push @prior_backup_path,
+ $root_node->backup_dir . '/' . $prior_backup_name;
+ }
+
+ local %ENV = $self->_get_env();
+ PostgreSQL::Test::Utils::system_or_bail('pg_combinebackup', '-d',
+ @prior_backup_path, $backup_path, '-o', $data_path);
+ }
+ elsif (defined $params{tar_program})
{
mkdir($data_path);
PostgreSQL::Test::Utils::system_or_bail($params{tar_program}, 'xf',
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 7a2807a9a3..1fa5f0ed26 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -4014,3 +4014,15 @@ SummarizerReadLocalXLogPrivate
WalSummarizerData
WalSummaryFile
WalSummaryIO
+FileBackupMethod
+IncrementalBackupInfo
+UploadManifestCmd
+backup_file_entry
+backup_wal_range
+cb_cleanup_dir
+cb_options
+cb_tablespace
+cb_tablespace_mapping
+manifest_data
+manifest_writer
+rfile
--
2.37.1 (Apple Git-137.1)
v10-0002-Rename-pg_verifybackup-s-JsonManifestParseContex.patchapplication/octet-stream; name=v10-0002-Rename-pg_verifybackup-s-JsonManifestParseContex.patchDownload
From e74622823707444bff190b71dec4b29d5c3882be Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Thu, 16 Nov 2023 13:15:14 -0500
Subject: [PATCH v10 2/7] Rename pg_verifybackup's JsonManifestParseContext
callback functions.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
The old names were too generic, and would have applied to any binary
that made use of JsonManifestParseContext. Rename to make the names
specific to pg_verifybackup, since there are plans afoot to reuse
this infrastructure.
Per suggestion from Álvaro Herrra.
---
src/bin/pg_verifybackup/pg_verifybackup.c | 36 +++++++++++------------
1 file changed, 18 insertions(+), 18 deletions(-)
diff --git a/src/bin/pg_verifybackup/pg_verifybackup.c b/src/bin/pg_verifybackup/pg_verifybackup.c
index 8526eb9bbf..d921d0f003 100644
--- a/src/bin/pg_verifybackup/pg_verifybackup.c
+++ b/src/bin/pg_verifybackup/pg_verifybackup.c
@@ -119,15 +119,15 @@ static void parse_manifest_file(char *manifest_path,
manifest_files_hash **ht_p,
manifest_wal_range **first_wal_range_p);
-static void record_manifest_details_for_file(JsonManifestParseContext *context,
- char *pathname, size_t size,
- pg_checksum_type checksum_type,
- int checksum_length,
- uint8 *checksum_payload);
-static void record_manifest_details_for_wal_range(JsonManifestParseContext *context,
- TimeLineID tli,
- XLogRecPtr start_lsn,
- XLogRecPtr end_lsn);
+static void verifybackup_per_file_cb(JsonManifestParseContext *context,
+ char *pathname, size_t size,
+ pg_checksum_type checksum_type,
+ int checksum_length,
+ uint8 *checksum_payload);
+static void verifybackup_per_wal_range_cb(JsonManifestParseContext *context,
+ TimeLineID tli,
+ XLogRecPtr start_lsn,
+ XLogRecPtr end_lsn);
static void report_manifest_error(JsonManifestParseContext *context,
const char *fmt,...)
pg_attribute_printf(2, 3) pg_attribute_noreturn();
@@ -440,8 +440,8 @@ parse_manifest_file(char *manifest_path, manifest_files_hash **ht_p,
private_context.first_wal_range = NULL;
private_context.last_wal_range = NULL;
context.private_data = &private_context;
- context.per_file_cb = record_manifest_details_for_file;
- context.per_wal_range_cb = record_manifest_details_for_wal_range;
+ context.per_file_cb = verifybackup_per_file_cb;
+ context.per_wal_range_cb = verifybackup_per_wal_range_cb;
context.error_cb = report_manifest_error;
json_parse_manifest(&context, buffer, statbuf.st_size);
@@ -475,10 +475,10 @@ report_manifest_error(JsonManifestParseContext *context, const char *fmt,...)
* Record details extracted from the backup manifest for one file.
*/
static void
-record_manifest_details_for_file(JsonManifestParseContext *context,
- char *pathname, size_t size,
- pg_checksum_type checksum_type,
- int checksum_length, uint8 *checksum_payload)
+verifybackup_per_file_cb(JsonManifestParseContext *context,
+ char *pathname, size_t size,
+ pg_checksum_type checksum_type,
+ int checksum_length, uint8 *checksum_payload)
{
parser_context *pcxt = context->private_data;
manifest_files_hash *ht = pcxt->ht;
@@ -504,9 +504,9 @@ record_manifest_details_for_file(JsonManifestParseContext *context,
* Record details extracted from the backup manifest for one WAL range.
*/
static void
-record_manifest_details_for_wal_range(JsonManifestParseContext *context,
- TimeLineID tli,
- XLogRecPtr start_lsn, XLogRecPtr end_lsn)
+verifybackup_per_wal_range_cb(JsonManifestParseContext *context,
+ TimeLineID tli,
+ XLogRecPtr start_lsn, XLogRecPtr end_lsn)
{
parser_context *pcxt = context->private_data;
manifest_wal_range *range;
--
2.37.1 (Apple Git-137.1)
v10-0001-Rename-JsonManifestParseContext-callbacks.patchapplication/octet-stream; name=v10-0001-Rename-JsonManifestParseContext-callbacks.patchDownload
From ef099e66862c7c6fe03410e4d9cd4bb450d7ee94 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Thu, 16 Nov 2023 13:10:01 -0500
Subject: [PATCH v10 1/7] Rename JsonManifestParseContext callbacks.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
There is currently a worldwide oversupply of underscores, so use
some of them here as word separators. In the event of a later
underscore shortage, these can be removed again, and another of
PostgreSQL's innumerable methods of marking word bounadries can
be substituted.
Per suggestion from Álvaro Herrera.
---
src/bin/pg_verifybackup/parse_manifest.c | 8 ++++----
src/bin/pg_verifybackup/parse_manifest.h | 18 +++++++++---------
src/bin/pg_verifybackup/pg_verifybackup.c | 4 ++--
src/tools/pgindent/typedefs.list | 4 ++--
4 files changed, 17 insertions(+), 17 deletions(-)
diff --git a/src/bin/pg_verifybackup/parse_manifest.c b/src/bin/pg_verifybackup/parse_manifest.c
index bf0227c668..850adf90a8 100644
--- a/src/bin/pg_verifybackup/parse_manifest.c
+++ b/src/bin/pg_verifybackup/parse_manifest.c
@@ -112,7 +112,7 @@ static bool parse_xlogrecptr(XLogRecPtr *result, char *input);
*
* Caller should set up the parsing context and then invoke this function.
* For each file whose information is extracted from the manifest,
- * context->perfile_cb is invoked. In case of trouble, context->error_cb is
+ * context->per_file_cb is invoked. In case of trouble, context->error_cb is
* invoked and is expected not to return.
*/
void
@@ -545,8 +545,8 @@ json_manifest_finalize_file(JsonManifestParseState *parse)
}
/* Invoke the callback with the details we've gathered. */
- context->perfile_cb(context, parse->pathname, size,
- checksum_type, checksum_length, checksum_payload);
+ context->per_file_cb(context, parse->pathname, size,
+ checksum_type, checksum_length, checksum_payload);
/* Free memory we no longer need. */
if (parse->size != NULL)
@@ -602,7 +602,7 @@ json_manifest_finalize_wal_range(JsonManifestParseState *parse)
"could not parse end LSN");
/* Invoke the callback with the details we've gathered. */
- context->perwalrange_cb(context, tli, start_lsn, end_lsn);
+ context->per_wal_range_cb(context, tli, start_lsn, end_lsn);
/* Free memory we no longer need. */
if (parse->timeline != NULL)
diff --git a/src/bin/pg_verifybackup/parse_manifest.h b/src/bin/pg_verifybackup/parse_manifest.h
index 7387a917a2..001b9a6a11 100644
--- a/src/bin/pg_verifybackup/parse_manifest.h
+++ b/src/bin/pg_verifybackup/parse_manifest.h
@@ -21,13 +21,13 @@
struct JsonManifestParseContext;
typedef struct JsonManifestParseContext JsonManifestParseContext;
-typedef void (*json_manifest_perfile_callback) (JsonManifestParseContext *,
- char *pathname,
- size_t size, pg_checksum_type checksum_type,
- int checksum_length, uint8 *checksum_payload);
-typedef void (*json_manifest_perwalrange_callback) (JsonManifestParseContext *,
- TimeLineID tli,
- XLogRecPtr start_lsn, XLogRecPtr end_lsn);
+typedef void (*json_manifest_per_file_callback) (JsonManifestParseContext *,
+ char *pathname,
+ size_t size, pg_checksum_type checksum_type,
+ int checksum_length, uint8 *checksum_payload);
+typedef void (*json_manifest_per_wal_range_callback) (JsonManifestParseContext *,
+ TimeLineID tli,
+ XLogRecPtr start_lsn, XLogRecPtr end_lsn);
typedef void (*json_manifest_error_callback) (JsonManifestParseContext *,
const char *fmt,...) pg_attribute_printf(2, 3)
pg_attribute_noreturn();
@@ -35,8 +35,8 @@ typedef void (*json_manifest_error_callback) (JsonManifestParseContext *,
struct JsonManifestParseContext
{
void *private_data;
- json_manifest_perfile_callback perfile_cb;
- json_manifest_perwalrange_callback perwalrange_cb;
+ json_manifest_per_file_callback per_file_cb;
+ json_manifest_per_wal_range_callback per_wal_range_cb;
json_manifest_error_callback error_cb;
};
diff --git a/src/bin/pg_verifybackup/pg_verifybackup.c b/src/bin/pg_verifybackup/pg_verifybackup.c
index 059836f0e6..8526eb9bbf 100644
--- a/src/bin/pg_verifybackup/pg_verifybackup.c
+++ b/src/bin/pg_verifybackup/pg_verifybackup.c
@@ -440,8 +440,8 @@ parse_manifest_file(char *manifest_path, manifest_files_hash **ht_p,
private_context.first_wal_range = NULL;
private_context.last_wal_range = NULL;
context.private_data = &private_context;
- context.perfile_cb = record_manifest_details_for_file;
- context.perwalrange_cb = record_manifest_details_for_wal_range;
+ context.per_file_cb = record_manifest_details_for_file;
+ context.per_wal_range_cb = record_manifest_details_for_wal_range;
context.error_cb = report_manifest_error;
json_parse_manifest(&context, buffer, statbuf.st_size);
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index dba3498a13..3cea73e220 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3441,8 +3441,8 @@ jmp_buf
join_search_hook_type
json_aelem_action
json_manifest_error_callback
-json_manifest_perfile_callback
-json_manifest_perwalrange_callback
+json_manifest_per_file_callback
+json_manifest_per_wal_range_callback
json_ofield_action
json_scalar_action
json_struct_action
--
2.37.1 (Apple Git-137.1)
I made a pass over pg_combinebackup for NLS. I propose the attached
patch.
--
Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/
"Right now the sectors on the hard disk run clockwise, but I heard a rumor that
you can squeeze 0.2% more throughput by running them counterclockwise.
It's worth the effort. Recommended." (Gerry Pourwelle)
Attachments:
0001-do-NLS-for-pg_combinebackup.patchtext/x-diff; charset=utf-8Download
From a385542ff03514885fa4e84b0485e51cdcdd04bd Mon Sep 17 00:00:00 2001
From: Alvaro Herrera <alvherre@alvh.no-ip.org>
Date: Fri, 17 Nov 2023 10:51:36 +0100
Subject: [PATCH] do NLS for pg_combinebackup
---
src/bin/pg_combinebackup/backup_label.c | 34 +++++++++++----------
src/bin/pg_combinebackup/nls.mk | 11 +++++++
src/bin/pg_combinebackup/pg_combinebackup.c | 24 ++++++++++-----
src/bin/pg_combinebackup/reconstruct.c | 9 ++++--
4 files changed, 52 insertions(+), 26 deletions(-)
create mode 100644 src/bin/pg_combinebackup/nls.mk
diff --git a/src/bin/pg_combinebackup/backup_label.c b/src/bin/pg_combinebackup/backup_label.c
index 2a62aa6fad..922e00854d 100644
--- a/src/bin/pg_combinebackup/backup_label.c
+++ b/src/bin/pg_combinebackup/backup_label.c
@@ -63,18 +63,18 @@ parse_backup_label(char *filename, StringInfo buf,
if (line_starts_with(s, e, "START WAL LOCATION: ", &s))
{
if (!parse_lsn(s, e, start_lsn, &c))
- pg_fatal("%s: could not parse START WAL LOCATION",
- filename);
+ pg_fatal("%s: could not parse %s",
+ filename, "START WAL LOCATION");
if (c >= e || *c != ' ')
- pg_fatal("%s: improper terminator for START WAL LOCATION",
- filename);
+ pg_fatal("%s: improper terminator for %s",
+ filename, "START WAL LOCATION");
found |= 1;
}
else if (line_starts_with(s, e, "START TIMELINE: ", &s))
{
if (!parse_tli(s, e, start_tli))
- pg_fatal("%s: could not parse TLI for START TIMELINE",
- filename);
+ pg_fatal("%s: could not parse TLI for %s",
+ filename, "START TIMELINE");
if (*start_tli == 0)
pg_fatal("%s: invalid TLI", filename);
found |= 2;
@@ -82,18 +82,18 @@ parse_backup_label(char *filename, StringInfo buf,
else if (line_starts_with(s, e, "INCREMENTAL FROM LSN: ", &s))
{
if (!parse_lsn(s, e, previous_lsn, &c))
- pg_fatal("%s: could not parse INCREMENTAL FROM LSN",
- filename);
+ pg_fatal("%s: could not parse %s",
+ filename, "INCREMENTAL FROM LSN");
if (c >= e || *c != '\n')
- pg_fatal("%s: improper terminator for INCREMENTAL FROM LSN",
- filename);
+ pg_fatal("%s: improper terminator for %s",
+ filename, "INCREMENTAL FROM LSN");
found |= 4;
}
else if (line_starts_with(s, e, "INCREMENTAL FROM TLI: ", &s))
{
if (!parse_tli(s, e, previous_tli))
- pg_fatal("%s: could not parse INCREMENTAL FROM TLI",
- filename);
+ pg_fatal("%s: could not parse %s",
+ filename, "INCREMENTAL FROM TLI");
if (*previous_tli == 0)
pg_fatal("%s: invalid TLI", filename);
found |= 8;
@@ -103,13 +103,15 @@ parse_backup_label(char *filename, StringInfo buf,
}
if ((found & 1) == 0)
- pg_fatal("%s: could not find START WAL LOCATION", filename);
+ pg_fatal("%s: could not find %s", filename, "START WAL LOCATION");
if ((found & 2) == 0)
- pg_fatal("%s: could not find START TIMELINE", filename);
+ pg_fatal("%s: could not find %s", filename, "START TIMELINE");
if ((found & 4) != 0 && (found & 8) == 0)
- pg_fatal("%s: INCREMENTAL FROM LSN requires INCREMENTAL FROM TLI", filename);
+ pg_fatal("%s: %s requires %s", filename,
+ "INCREMENTAL FROM LSN", "INCREMENTAL FROM TLI");
if ((found & 8) != 0 && (found & 4) == 0)
- pg_fatal("%s: INCREMENTAL FROM TLI requires INCREMENTAL FROM LSN", filename);
+ pg_fatal("%s: %s requires %s", filename,
+ "INCREMENTAL FROM TLI", "INCREMENTAL FROM LSN");
}
/*
diff --git a/src/bin/pg_combinebackup/nls.mk b/src/bin/pg_combinebackup/nls.mk
new file mode 100644
index 0000000000..c8e59d1d00
--- /dev/null
+++ b/src/bin/pg_combinebackup/nls.mk
@@ -0,0 +1,11 @@
+# src/bin/pg_combinebackup/nls.mk
+CATALOG_NAME = pg_combinebackup
+GETTEXT_FILES = $(FRONTEND_COMMON_GETTEXT_FILES) \
+ backup_label.c \
+ copy_file.c \
+ load_manifest.c \
+ pg_combinebackup.c \
+ reconstruct.c \
+ write_manifest.c
+GETTEXT_TRIGGERS = $(FRONTEND_COMMON_GETTEXT_TRIGGERS)
+GETTEXT_FLAGS = $(FRONTEND_COMMON_GETTEXT_FLAGS)
diff --git a/src/bin/pg_combinebackup/pg_combinebackup.c b/src/bin/pg_combinebackup/pg_combinebackup.c
index 7bf56e57ae..618b5dd7f6 100644
--- a/src/bin/pg_combinebackup/pg_combinebackup.c
+++ b/src/bin/pg_combinebackup/pg_combinebackup.c
@@ -521,29 +521,33 @@ check_control_files(int n_backups, char **backup_dirs)
{
ControlFileData *control_file;
bool crc_ok;
+ char *controlpath;
- pg_log_debug("reading \"%s/global/pg_control\"", backup_dirs[i]);
+ path = psprintf("%s/%s", backup_dirs[i], "global/pg_control");
+
+ pg_log_debug("reading \"%s\"", controlpath);
control_file = get_controlfile(backup_dirs[i], &crc_ok);
/* Control file contents not meaningful if CRC is bad. */
if (!crc_ok)
- pg_fatal("%s/global/pg_control: crc is incorrect", backup_dirs[i]);
+ pg_fatal("%s: crc is incorrect", controlpath);
/* Can't interpret control file if not current version. */
if (control_file->pg_control_version != PG_CONTROL_VERSION)
- pg_fatal("%s/global/pg_control: unexpected control file version",
- backup_dirs[i]);
+ pg_fatal("%s: unexpected control file version",
+ controlpath);
/* System identifiers should all match. */
if (i == n_backups - 1)
system_identifier = control_file->system_identifier;
else if (system_identifier != control_file->system_identifier)
- pg_fatal("%s/global/pg_control: expected system identifier %llu, but found %llu",
- backup_dirs[i], (unsigned long long) system_identifier,
+ pg_fatal("%s: expected system identifier %llu, but found %llu",
+ controlpath, (unsigned long long) system_identifier,
(unsigned long long) control_file->system_identifier);
/* Release memory. */
pfree(control_file);
+ pfree(controlpath);
}
/*
@@ -932,12 +936,16 @@ process_directory_recursively(Oid tsoid,
manifest_path);
if (mfile == NULL)
{
+ char *fullpath = psprintf("%s/%s", input_directory,
+ backup_manifest);
+
/*
* The directory is out of sync with the backup_manifest,
* so emit a warning.
*/
- pg_log_warning("\"%s/backup_manifest\" contains no entry for \"%s\"",
- input_directory, manifest_path);
+ pg_log_warning("\"%s\" contains no entry for \"%s\"",
+ fullpath, manifest_path);
+ pfree(fullpath);
}
else if (mfile->checksum_type == checksum_type)
{
diff --git a/src/bin/pg_combinebackup/reconstruct.c b/src/bin/pg_combinebackup/reconstruct.c
index e7f0523fe9..19ab96904b 100644
--- a/src/bin/pg_combinebackup/reconstruct.c
+++ b/src/bin/pg_combinebackup/reconstruct.c
@@ -283,13 +283,18 @@ reconstruct_from_incremental_file(char *input_filename,
manifest_path);
if (mfile == NULL)
{
+ char *path = psprintf("%s/backup_manifest",
+ prior_backup_dirs[copy_source_index]);
+
/*
* The directory is out of sync with the backup_manifest, so emit
* a warning.
*/
- pg_log_warning("\"%s/backup_manifest\" contains no entry for \"%s\"",
- prior_backup_dirs[copy_source_index],
+ /*- translator: the first %s is a backup manifest file, the second is a file absent therein */
+ pg_log_warning("\"%s\" contains no entry for \"%s\"",
+ path,
manifest_path);
+ pfree(path);
}
else if (mfile->checksum_type == checksum_type)
{
--
2.39.2
On Fri, Nov 17, 2023 at 5:01 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
I made a pass over pg_combinebackup for NLS. I propose the attached
patch.
This doesn't quite compile for me so I changed a few things and
incorporated it. Hopefully I didn't mess anything up.
Here's v11. In addition to incorporating Álvaro's NLS changes, with
the off-list help of Jakub Wartak, I finally tracked down two one-line
bugs in BlockRefTableEntryGetBlocks that have been causing the cfbot
to blow up on these patches. What I hadn't realized is that cfbot runs
with the relation segment size changed to 6 blocks, which tickled some
code paths that I wasn't exercising locally. Thanks a ton to Jakub for
the help running this down. cfbot was unhappy about a %lu so I've
changed that to %zu in this version, too. Finally, the previous
version of this patch set had some pgindent damage, so that is
hopefully now cleaned up as well.
I wish I had better ideas about how to thoroughly test this. I've got
a bunch of different tests for pg_combinebackup and I think those are
good, but the bugs mentioned in the previous paragraph show that those
aren't sufficient to catch all of the logic errors that can exist,
which is not great. But, as I say, I'm not quite sure how to do
better, so I guess I'll just need to keep fixing problems as we find
them.
--
Robert Haas
EDB: http://www.enterprisedb.com
Attachments:
v11-0003-Move-src-bin-pg_verifybackup-parse_manifest.c-in.patchapplication/octet-stream; name=v11-0003-Move-src-bin-pg_verifybackup-parse_manifest.c-in.patchDownload
From 12544993d124cedf799108a2ffcfb66f9c84b7b5 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Tue, 3 Oct 2023 13:32:45 -0400
Subject: [PATCH v11 3/7] Move src/bin/pg_verifybackup/parse_manifest.c into
src/common.
This makes it possible for the code to be easily reused by other
client-side tools, and/or by the server.
---
src/bin/pg_verifybackup/Makefile | 1 -
src/bin/pg_verifybackup/meson.build | 1 -
src/bin/pg_verifybackup/pg_verifybackup.c | 2 +-
src/common/Makefile | 1 +
src/common/meson.build | 1 +
src/{bin/pg_verifybackup => common}/parse_manifest.c | 4 ++--
src/{bin/pg_verifybackup => include/common}/parse_manifest.h | 2 +-
7 files changed, 6 insertions(+), 6 deletions(-)
rename src/{bin/pg_verifybackup => common}/parse_manifest.c (99%)
rename src/{bin/pg_verifybackup => include/common}/parse_manifest.h (97%)
diff --git a/src/bin/pg_verifybackup/Makefile b/src/bin/pg_verifybackup/Makefile
index c96323faa9..7c045f142e 100644
--- a/src/bin/pg_verifybackup/Makefile
+++ b/src/bin/pg_verifybackup/Makefile
@@ -21,7 +21,6 @@ LDFLAGS_INTERNAL += -L$(top_builddir)/src/fe_utils -lpgfeutils $(libpq_pgport)
OBJS = \
$(WIN32RES) \
- parse_manifest.o \
pg_verifybackup.o
all: pg_verifybackup
diff --git a/src/bin/pg_verifybackup/meson.build b/src/bin/pg_verifybackup/meson.build
index 9369da1bc6..58f780d1a6 100644
--- a/src/bin/pg_verifybackup/meson.build
+++ b/src/bin/pg_verifybackup/meson.build
@@ -1,7 +1,6 @@
# Copyright (c) 2022-2023, PostgreSQL Global Development Group
pg_verifybackup_sources = files(
- 'parse_manifest.c',
'pg_verifybackup.c'
)
diff --git a/src/bin/pg_verifybackup/pg_verifybackup.c b/src/bin/pg_verifybackup/pg_verifybackup.c
index d921d0f003..88081f66f7 100644
--- a/src/bin/pg_verifybackup/pg_verifybackup.c
+++ b/src/bin/pg_verifybackup/pg_verifybackup.c
@@ -20,9 +20,9 @@
#include "common/hashfn.h"
#include "common/logging.h"
+#include "common/parse_manifest.h"
#include "fe_utils/simple_list.h"
#include "getopt_long.h"
-#include "parse_manifest.h"
#include "pgtime.h"
/*
diff --git a/src/common/Makefile b/src/common/Makefile
index ce4535d7fe..1092dc63df 100644
--- a/src/common/Makefile
+++ b/src/common/Makefile
@@ -66,6 +66,7 @@ OBJS_COMMON = \
kwlookup.o \
link-canary.o \
md5_common.o \
+ parse_manifest.o \
percentrepl.o \
pg_get_line.o \
pg_lzcompress.o \
diff --git a/src/common/meson.build b/src/common/meson.build
index 8be145c0fb..d52dd12bc9 100644
--- a/src/common/meson.build
+++ b/src/common/meson.build
@@ -18,6 +18,7 @@ common_sources = files(
'kwlookup.c',
'link-canary.c',
'md5_common.c',
+ 'parse_manifest.c',
'percentrepl.c',
'pg_get_line.c',
'pg_lzcompress.c',
diff --git a/src/bin/pg_verifybackup/parse_manifest.c b/src/common/parse_manifest.c
similarity index 99%
rename from src/bin/pg_verifybackup/parse_manifest.c
rename to src/common/parse_manifest.c
index 850adf90a8..9f52bfa83b 100644
--- a/src/bin/pg_verifybackup/parse_manifest.c
+++ b/src/common/parse_manifest.c
@@ -6,15 +6,15 @@
* Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
* Portions Copyright (c) 1994, Regents of the University of California
*
- * src/bin/pg_verifybackup/parse_manifest.c
+ * src/common/parse_manifest.c
*
*-------------------------------------------------------------------------
*/
#include "postgres_fe.h"
-#include "parse_manifest.h"
#include "common/jsonapi.h"
+#include "common/parse_manifest.h"
/*
* Semantic states for JSON manifest parsing.
diff --git a/src/bin/pg_verifybackup/parse_manifest.h b/src/include/common/parse_manifest.h
similarity index 97%
rename from src/bin/pg_verifybackup/parse_manifest.h
rename to src/include/common/parse_manifest.h
index 001b9a6a11..811c9149f4 100644
--- a/src/bin/pg_verifybackup/parse_manifest.h
+++ b/src/include/common/parse_manifest.h
@@ -6,7 +6,7 @@
* Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
* Portions Copyright (c) 1994, Regents of the University of California
*
- * src/bin/pg_verifybackup/parse_manifest.h
+ * src/include/common/parse_manifest.h
*
*-------------------------------------------------------------------------
*/
--
2.37.1 (Apple Git-137.1)
v11-0002-Rename-pg_verifybackup-s-JsonManifestParseContex.patchapplication/octet-stream; name=v11-0002-Rename-pg_verifybackup-s-JsonManifestParseContex.patchDownload
From 276a7bc53517e01b88c995aa0a75422d6fa9ffea Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Thu, 16 Nov 2023 13:15:14 -0500
Subject: [PATCH v11 2/7] Rename pg_verifybackup's JsonManifestParseContext
callback functions.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
The old names were too generic, and would have applied to any binary
that made use of JsonManifestParseContext. Rename to make the names
specific to pg_verifybackup, since there are plans afoot to reuse
this infrastructure.
Per suggestion from Álvaro Herrra.
---
src/bin/pg_verifybackup/pg_verifybackup.c | 36 +++++++++++------------
1 file changed, 18 insertions(+), 18 deletions(-)
diff --git a/src/bin/pg_verifybackup/pg_verifybackup.c b/src/bin/pg_verifybackup/pg_verifybackup.c
index 8526eb9bbf..d921d0f003 100644
--- a/src/bin/pg_verifybackup/pg_verifybackup.c
+++ b/src/bin/pg_verifybackup/pg_verifybackup.c
@@ -119,15 +119,15 @@ static void parse_manifest_file(char *manifest_path,
manifest_files_hash **ht_p,
manifest_wal_range **first_wal_range_p);
-static void record_manifest_details_for_file(JsonManifestParseContext *context,
- char *pathname, size_t size,
- pg_checksum_type checksum_type,
- int checksum_length,
- uint8 *checksum_payload);
-static void record_manifest_details_for_wal_range(JsonManifestParseContext *context,
- TimeLineID tli,
- XLogRecPtr start_lsn,
- XLogRecPtr end_lsn);
+static void verifybackup_per_file_cb(JsonManifestParseContext *context,
+ char *pathname, size_t size,
+ pg_checksum_type checksum_type,
+ int checksum_length,
+ uint8 *checksum_payload);
+static void verifybackup_per_wal_range_cb(JsonManifestParseContext *context,
+ TimeLineID tli,
+ XLogRecPtr start_lsn,
+ XLogRecPtr end_lsn);
static void report_manifest_error(JsonManifestParseContext *context,
const char *fmt,...)
pg_attribute_printf(2, 3) pg_attribute_noreturn();
@@ -440,8 +440,8 @@ parse_manifest_file(char *manifest_path, manifest_files_hash **ht_p,
private_context.first_wal_range = NULL;
private_context.last_wal_range = NULL;
context.private_data = &private_context;
- context.per_file_cb = record_manifest_details_for_file;
- context.per_wal_range_cb = record_manifest_details_for_wal_range;
+ context.per_file_cb = verifybackup_per_file_cb;
+ context.per_wal_range_cb = verifybackup_per_wal_range_cb;
context.error_cb = report_manifest_error;
json_parse_manifest(&context, buffer, statbuf.st_size);
@@ -475,10 +475,10 @@ report_manifest_error(JsonManifestParseContext *context, const char *fmt,...)
* Record details extracted from the backup manifest for one file.
*/
static void
-record_manifest_details_for_file(JsonManifestParseContext *context,
- char *pathname, size_t size,
- pg_checksum_type checksum_type,
- int checksum_length, uint8 *checksum_payload)
+verifybackup_per_file_cb(JsonManifestParseContext *context,
+ char *pathname, size_t size,
+ pg_checksum_type checksum_type,
+ int checksum_length, uint8 *checksum_payload)
{
parser_context *pcxt = context->private_data;
manifest_files_hash *ht = pcxt->ht;
@@ -504,9 +504,9 @@ record_manifest_details_for_file(JsonManifestParseContext *context,
* Record details extracted from the backup manifest for one WAL range.
*/
static void
-record_manifest_details_for_wal_range(JsonManifestParseContext *context,
- TimeLineID tli,
- XLogRecPtr start_lsn, XLogRecPtr end_lsn)
+verifybackup_per_wal_range_cb(JsonManifestParseContext *context,
+ TimeLineID tli,
+ XLogRecPtr start_lsn, XLogRecPtr end_lsn)
{
parser_context *pcxt = context->private_data;
manifest_wal_range *range;
--
2.37.1 (Apple Git-137.1)
v11-0001-Rename-JsonManifestParseContext-callbacks.patchapplication/octet-stream; name=v11-0001-Rename-JsonManifestParseContext-callbacks.patchDownload
From 8894a309dcc7dd43dccda0f4de5051c032b458b8 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Thu, 16 Nov 2023 13:10:01 -0500
Subject: [PATCH v11 1/7] Rename JsonManifestParseContext callbacks.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
There is currently a worldwide oversupply of underscores, so use
some of them here as word separators. In the event of a later
underscore shortage, these can be removed again, and another of
PostgreSQL's innumerable methods of marking word bounadries can
be substituted.
Per suggestion from Álvaro Herrera.
---
src/bin/pg_verifybackup/parse_manifest.c | 8 ++++----
src/bin/pg_verifybackup/parse_manifest.h | 18 +++++++++---------
src/bin/pg_verifybackup/pg_verifybackup.c | 4 ++--
src/tools/pgindent/typedefs.list | 4 ++--
4 files changed, 17 insertions(+), 17 deletions(-)
diff --git a/src/bin/pg_verifybackup/parse_manifest.c b/src/bin/pg_verifybackup/parse_manifest.c
index bf0227c668..850adf90a8 100644
--- a/src/bin/pg_verifybackup/parse_manifest.c
+++ b/src/bin/pg_verifybackup/parse_manifest.c
@@ -112,7 +112,7 @@ static bool parse_xlogrecptr(XLogRecPtr *result, char *input);
*
* Caller should set up the parsing context and then invoke this function.
* For each file whose information is extracted from the manifest,
- * context->perfile_cb is invoked. In case of trouble, context->error_cb is
+ * context->per_file_cb is invoked. In case of trouble, context->error_cb is
* invoked and is expected not to return.
*/
void
@@ -545,8 +545,8 @@ json_manifest_finalize_file(JsonManifestParseState *parse)
}
/* Invoke the callback with the details we've gathered. */
- context->perfile_cb(context, parse->pathname, size,
- checksum_type, checksum_length, checksum_payload);
+ context->per_file_cb(context, parse->pathname, size,
+ checksum_type, checksum_length, checksum_payload);
/* Free memory we no longer need. */
if (parse->size != NULL)
@@ -602,7 +602,7 @@ json_manifest_finalize_wal_range(JsonManifestParseState *parse)
"could not parse end LSN");
/* Invoke the callback with the details we've gathered. */
- context->perwalrange_cb(context, tli, start_lsn, end_lsn);
+ context->per_wal_range_cb(context, tli, start_lsn, end_lsn);
/* Free memory we no longer need. */
if (parse->timeline != NULL)
diff --git a/src/bin/pg_verifybackup/parse_manifest.h b/src/bin/pg_verifybackup/parse_manifest.h
index 7387a917a2..001b9a6a11 100644
--- a/src/bin/pg_verifybackup/parse_manifest.h
+++ b/src/bin/pg_verifybackup/parse_manifest.h
@@ -21,13 +21,13 @@
struct JsonManifestParseContext;
typedef struct JsonManifestParseContext JsonManifestParseContext;
-typedef void (*json_manifest_perfile_callback) (JsonManifestParseContext *,
- char *pathname,
- size_t size, pg_checksum_type checksum_type,
- int checksum_length, uint8 *checksum_payload);
-typedef void (*json_manifest_perwalrange_callback) (JsonManifestParseContext *,
- TimeLineID tli,
- XLogRecPtr start_lsn, XLogRecPtr end_lsn);
+typedef void (*json_manifest_per_file_callback) (JsonManifestParseContext *,
+ char *pathname,
+ size_t size, pg_checksum_type checksum_type,
+ int checksum_length, uint8 *checksum_payload);
+typedef void (*json_manifest_per_wal_range_callback) (JsonManifestParseContext *,
+ TimeLineID tli,
+ XLogRecPtr start_lsn, XLogRecPtr end_lsn);
typedef void (*json_manifest_error_callback) (JsonManifestParseContext *,
const char *fmt,...) pg_attribute_printf(2, 3)
pg_attribute_noreturn();
@@ -35,8 +35,8 @@ typedef void (*json_manifest_error_callback) (JsonManifestParseContext *,
struct JsonManifestParseContext
{
void *private_data;
- json_manifest_perfile_callback perfile_cb;
- json_manifest_perwalrange_callback perwalrange_cb;
+ json_manifest_per_file_callback per_file_cb;
+ json_manifest_per_wal_range_callback per_wal_range_cb;
json_manifest_error_callback error_cb;
};
diff --git a/src/bin/pg_verifybackup/pg_verifybackup.c b/src/bin/pg_verifybackup/pg_verifybackup.c
index 059836f0e6..8526eb9bbf 100644
--- a/src/bin/pg_verifybackup/pg_verifybackup.c
+++ b/src/bin/pg_verifybackup/pg_verifybackup.c
@@ -440,8 +440,8 @@ parse_manifest_file(char *manifest_path, manifest_files_hash **ht_p,
private_context.first_wal_range = NULL;
private_context.last_wal_range = NULL;
context.private_data = &private_context;
- context.perfile_cb = record_manifest_details_for_file;
- context.perwalrange_cb = record_manifest_details_for_wal_range;
+ context.per_file_cb = record_manifest_details_for_file;
+ context.per_wal_range_cb = record_manifest_details_for_wal_range;
context.error_cb = report_manifest_error;
json_parse_manifest(&context, buffer, statbuf.st_size);
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index dba3498a13..3cea73e220 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3441,8 +3441,8 @@ jmp_buf
join_search_hook_type
json_aelem_action
json_manifest_error_callback
-json_manifest_perfile_callback
-json_manifest_perwalrange_callback
+json_manifest_per_file_callback
+json_manifest_per_wal_range_callback
json_ofield_action
json_scalar_action
json_struct_action
--
2.37.1 (Apple Git-137.1)
v11-0005-Add-support-for-incremental-backup.patchapplication/octet-stream; name=v11-0005-Add-support-for-incremental-backup.patchDownload
From eb179e89810240d208fed31811d0285c8ee76eb4 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Wed, 14 Jun 2023 11:31:29 -0400
Subject: [PATCH v11 5/7] Add support for incremental backup.
To take an incremental backup, you use the new replication command
UPLOAD_MANIFEST to upload the manifest for the prior backup. This
prior backup could either be a full backup or another incremental
backup. You then use BASE_BACKUP with the INCREMENTAL option to take
the backup. pg_basebackup now has an --incremental=PATH_TO_MANIFEST
option to trigger this behavior.
An incremental backup is like a regular full backup except that
some relation files are replaced with files with names like
INCREMENTAL.${ORIGINAL_NAME}, and the backup_label file contains
additional lines identifying it as an incremental backup. The new
pg_combinebackup tool can be used to reconstruct a data directory
from a full backup and a series of incremental backups.
XXX. Should we send the whole backup manifest to the server or, say,
just an LSN?
XXX. Should the timeout when waiting for WAL summaries be configurable?
If it is, then the maximum sleep time for the WAL summarizer needs
to vary accordingly.
XXX. It would be nice (but not essential) to do something about
incremental JSON parsing.
Patch by me. Thanks to Dilip Kumar and Andres Freund for some helpful
design discussions. Reviewed by Dilip Kumar and Jakub Wartak.
---
doc/src/sgml/backup.sgml | 89 +-
doc/src/sgml/config.sgml | 2 -
doc/src/sgml/ref/allfiles.sgml | 1 +
doc/src/sgml/ref/pg_basebackup.sgml | 37 +-
doc/src/sgml/ref/pg_combinebackup.sgml | 228 +++
doc/src/sgml/reference.sgml | 1 +
src/backend/access/transam/xlogbackup.c | 10 +
src/backend/access/transam/xlogrecovery.c | 6 +
src/backend/backup/Makefile | 1 +
src/backend/backup/basebackup.c | 313 +++-
src/backend/backup/basebackup_incremental.c | 913 ++++++++++++
src/backend/backup/meson.build | 1 +
src/backend/replication/repl_gram.y | 14 +-
src/backend/replication/repl_scanner.l | 2 +
src/backend/replication/walsender.c | 162 ++-
src/backend/storage/ipc/ipci.c | 3 +
src/bin/Makefile | 1 +
src/bin/meson.build | 1 +
src/bin/pg_basebackup/bbstreamer_file.c | 1 +
src/bin/pg_basebackup/pg_basebackup.c | 110 +-
src/bin/pg_basebackup/t/010_pg_basebackup.pl | 4 +-
src/bin/pg_combinebackup/.gitignore | 1 +
src/bin/pg_combinebackup/Makefile | 52 +
src/bin/pg_combinebackup/backup_label.c | 283 ++++
src/bin/pg_combinebackup/backup_label.h | 30 +
src/bin/pg_combinebackup/copy_file.c | 169 +++
src/bin/pg_combinebackup/copy_file.h | 19 +
src/bin/pg_combinebackup/load_manifest.c | 245 ++++
src/bin/pg_combinebackup/load_manifest.h | 67 +
src/bin/pg_combinebackup/meson.build | 38 +
src/bin/pg_combinebackup/nls.mk | 11 +
src/bin/pg_combinebackup/pg_combinebackup.c | 1275 +++++++++++++++++
src/bin/pg_combinebackup/reconstruct.c | 687 +++++++++
src/bin/pg_combinebackup/reconstruct.h | 33 +
src/bin/pg_combinebackup/t/001_basic.pl | 23 +
.../pg_combinebackup/t/002_compare_backups.pl | 154 ++
src/bin/pg_combinebackup/t/003_timeline.pl | 90 ++
src/bin/pg_combinebackup/t/004_manifest.pl | 75 +
src/bin/pg_combinebackup/t/005_integrity.pl | 125 ++
src/bin/pg_combinebackup/write_manifest.c | 293 ++++
src/bin/pg_combinebackup/write_manifest.h | 33 +
src/bin/pg_resetwal/pg_resetwal.c | 36 +
src/include/access/xlogbackup.h | 2 +
src/include/backup/basebackup.h | 5 +-
src/include/backup/basebackup_incremental.h | 55 +
src/include/nodes/replnodes.h | 9 +
src/test/perl/PostgreSQL/Test/Cluster.pm | 21 +-
src/tools/pgindent/typedefs.list | 12 +
48 files changed, 5691 insertions(+), 52 deletions(-)
create mode 100644 doc/src/sgml/ref/pg_combinebackup.sgml
create mode 100644 src/backend/backup/basebackup_incremental.c
create mode 100644 src/bin/pg_combinebackup/.gitignore
create mode 100644 src/bin/pg_combinebackup/Makefile
create mode 100644 src/bin/pg_combinebackup/backup_label.c
create mode 100644 src/bin/pg_combinebackup/backup_label.h
create mode 100644 src/bin/pg_combinebackup/copy_file.c
create mode 100644 src/bin/pg_combinebackup/copy_file.h
create mode 100644 src/bin/pg_combinebackup/load_manifest.c
create mode 100644 src/bin/pg_combinebackup/load_manifest.h
create mode 100644 src/bin/pg_combinebackup/meson.build
create mode 100644 src/bin/pg_combinebackup/nls.mk
create mode 100644 src/bin/pg_combinebackup/pg_combinebackup.c
create mode 100644 src/bin/pg_combinebackup/reconstruct.c
create mode 100644 src/bin/pg_combinebackup/reconstruct.h
create mode 100644 src/bin/pg_combinebackup/t/001_basic.pl
create mode 100644 src/bin/pg_combinebackup/t/002_compare_backups.pl
create mode 100644 src/bin/pg_combinebackup/t/003_timeline.pl
create mode 100644 src/bin/pg_combinebackup/t/004_manifest.pl
create mode 100644 src/bin/pg_combinebackup/t/005_integrity.pl
create mode 100644 src/bin/pg_combinebackup/write_manifest.c
create mode 100644 src/bin/pg_combinebackup/write_manifest.h
create mode 100644 src/include/backup/basebackup_incremental.h
diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml
index 8cb24d6ae5..b3468eea3c 100644
--- a/doc/src/sgml/backup.sgml
+++ b/doc/src/sgml/backup.sgml
@@ -857,12 +857,79 @@ test ! -f /mnt/server/archivedir/00000001000000A900000065 && cp pg_wal/0
</para>
</sect2>
+ <sect2 id="backup-incremental-backup">
+ <title>Making an Incremental Backup</title>
+
+ <para>
+ You can use <xref linkend="app-pgbasebackup"/> to take an incremental
+ backup by specifying the <literal>--incremental</literal> option. You must
+ supply, as an argument to <literal>--incremental</literal>, the backup
+ manifest to an earlier backup from the same server. In the resulting
+ backup, non-relation files will be included in their entirety, but some
+ relation files may be replaced by smaller incremental files which contain
+ only the blocks which have been changed since the earlier backup and enough
+ metadata to reconstruct the current version of the file.
+ </para>
+
+ <para>
+ To figure out which blocks need to be backed up, the server uses WAL
+ summaries, which are stored in the data directory, inside the directory
+ <literal>pg_wal/summaries</literal>. If the required summary files are not
+ present, an attempt to take an incremental backup will fail. The summaries
+ present in this directory must cover all LSNs from the start LSN of the
+ prior backup to the start LSN of the current backup. Since the server looks
+ for WAL summaries just after establishing the start LSN of the current
+ backup, the necessary summary files probably won't be instantly present
+ on disk, but the server will wait for any missing files to show up.
+ This also helps if the WAL summarization process has fallen behind.
+ However, if the necessary files have already been removed, or if the WAL
+ summarizer doesn't catch up quickly enough, the incremental backup will
+ fail.
+ </para>
+
+ <para>
+ When restoring an incremental backup, it will be necessary to have not
+ only the incremental backup itself but also all earlier backups that
+ are required to supply the blocks omitted from the incremental backup.
+ See <xref linkend="app-pgcombinebackup"/> for further information about
+ this requirement.
+ </para>
+
+ <para>
+ Note that all of the requirements for making use of a full backup also
+ apply to an incremental backup. For instance, you still need all of the
+ WAL segment files generated during and after the file system backup, and
+ any relevant WAL history files. And you still need to create a
+ <literal>recovery.signal</literal> (or <literal>standby.signal</literal>)
+ and perform recovery, as described in
+ <xref linkend="backup-pitr-recovery" />. The requirement to have earlier
+ backups available at restore time and to use
+ <literal>pg_combinebackup</literal> is an additional requirement on top of
+ everything else. Keep in mind that <application>PostgreSQL</application>
+ has no built-in mechanism to figure out which backups are still needed as
+ a basis for restoring later incremental backups. You must keep track of
+ the relationships between your full and incremental backups on your own,
+ and be certain not to remove earlier backups if they might be needed when
+ restoring later incremental backups.
+ </para>
+
+ <para>
+ Incremental backups typically only make sense for relatively large
+ databases where a significant portion of the data does not change, or only
+ changes slowly. For a small database, it's simpler to ignore the existence
+ of incremental backups and simply take full backups, which are simpler
+ to manage. For a large database all of which is heavily modified,
+ incremental backups won't be much smaller than full backups.
+ </para>
+ </sect2>
+
<sect2 id="backup-lowlevel-base-backup">
<title>Making a Base Backup Using the Low Level API</title>
<para>
- The procedure for making a base backup using the low level
- APIs contains a few more steps than
- the <xref linkend="app-pgbasebackup"/> method, but is relatively
+ Instead of taking a full or incremental base backup using
+ <xref linkend="app-pgbasebackup"/>, you can take a base backup using the
+ low-level API. This procedure contains a few more steps than
+ the <application>pg_basebackup</application> method, but is relatively
simple. It is very important that these steps are executed in
sequence, and that the success of a step is verified before
proceeding to the next step.
@@ -1118,7 +1185,8 @@ SELECT * FROM pg_backup_stop(wait_for_archive => true);
</listitem>
<listitem>
<para>
- Restore the database files from your file system backup. Be sure that they
+ If you're restoring a full backup, you can restore the database files
+ directly into the target directories. Be sure that they
are restored with the right ownership (the database system user, not
<literal>root</literal>!) and with the right permissions. If you are using
tablespaces,
@@ -1126,6 +1194,19 @@ SELECT * FROM pg_backup_stop(wait_for_archive => true);
were correctly restored.
</para>
</listitem>
+ <listitem>
+ <para>
+ If you're restoring an incremental backup, you'll need to restore the
+ incremental backup and all earlier backups upon which it directly or
+ indirectly depends to the machine where you are performing the restore.
+ These backups will need to be placed in separate directories, not the
+ target directories where you want the running server to end up.
+ Once this is done, use <xref linkend="app-pgcombinebackup"/> to pull
+ data from the full backup and all of the subsequent incremental backups
+ and write out a synthetic full backup to the target directories. As above,
+ verify that permissions and tablespace links are correct.
+ </para>
+ </listitem>
<listitem>
<para>
Remove any files present in <filename>pg_wal/</filename>; these came from the
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 6073b93480..d5083afc87 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4137,13 +4137,11 @@ restore_command = 'copy "C:\\server\\archivedir\\%f" "%p"' # Windows
<sect2 id="runtime-config-wal-summarization">
<title>WAL Summarization</title>
- <!--
<para>
These settings control WAL summarization, a feature which must be
enabled in order to perform an
<link linkend="backup-incremental-backup">incremental backup</link>.
</para>
- -->
<variablelist>
<varlistentry id="guc-summarize-wal" xreflabel="summarize_wal">
diff --git a/doc/src/sgml/ref/allfiles.sgml b/doc/src/sgml/ref/allfiles.sgml
index 54b5f22d6e..fda4690eab 100644
--- a/doc/src/sgml/ref/allfiles.sgml
+++ b/doc/src/sgml/ref/allfiles.sgml
@@ -202,6 +202,7 @@ Complete list of usable sgml source files in this directory.
<!ENTITY pgBasebackup SYSTEM "pg_basebackup.sgml">
<!ENTITY pgbench SYSTEM "pgbench.sgml">
<!ENTITY pgChecksums SYSTEM "pg_checksums.sgml">
+<!ENTITY pgCombinebackup SYSTEM "pg_combinebackup.sgml">
<!ENTITY pgConfig SYSTEM "pg_config-ref.sgml">
<!ENTITY pgControldata SYSTEM "pg_controldata.sgml">
<!ENTITY pgCtl SYSTEM "pg_ctl-ref.sgml">
diff --git a/doc/src/sgml/ref/pg_basebackup.sgml b/doc/src/sgml/ref/pg_basebackup.sgml
index 0b87fd2d4d..7c183a5cfd 100644
--- a/doc/src/sgml/ref/pg_basebackup.sgml
+++ b/doc/src/sgml/ref/pg_basebackup.sgml
@@ -38,11 +38,25 @@ PostgreSQL documentation
</para>
<para>
- <application>pg_basebackup</application> makes an exact copy of the database
- cluster's files, while making sure the server is put into and
- out of backup mode automatically. Backups are always taken of the entire
- database cluster; it is not possible to back up individual databases or
- database objects. For selective backups, another tool such as
+ <application>pg_basebackup</application> can take a full or incremental
+ base backup of the database. When used to take a full backup, it makes an
+ exact copy of the database cluster's files. When used to take an incremental
+ backup, some files that would have been part of a full backup may be
+ replaced with incremental versions of the same files, containing only those
+ blocks that have been modified since the reference backup. An incremental
+ backup cannot be used directly; instead,
+ <xref linkend="app-pgcombinebackup"/> must first
+ be used to combine it with the previous backups upon which it depends.
+ See <xref linkend="backup-incremental-backup" /> for more information
+ about incremental backups, and <xref linkend="backup-pitr-recovery" />
+ for steps to recover from a backup.
+ </para>
+
+ <para>
+ In any mode, <application>pg_basebackup</application> makes sure the server
+ is put into and out of backup mode automatically. Backups are always taken of
+ the entire database cluster; it is not possible to back up individual
+ databases or database objects. For selective backups, another tool such as
<xref linkend="app-pgdump"/> must be used.
</para>
@@ -197,6 +211,19 @@ PostgreSQL documentation
</listitem>
</varlistentry>
+ <varlistentry>
+ <term><option>-i <replaceable class="parameter">old_manifest_file</replaceable></option></term>
+ <term><option>--incremental=<replaceable class="parameter">old_meanifest_file</replaceable></option></term>
+ <listitem>
+ <para>
+ Performs an <link linkend="backup-incremental-backup">incremental
+ backup</link>. The backup manifest for the reference
+ backup must be provided, and will be uploaded to the server, which will
+ respond by sending the requested incremental backup.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry>
<term><option>-R</option></term>
<term><option>--write-recovery-conf</option></term>
diff --git a/doc/src/sgml/ref/pg_combinebackup.sgml b/doc/src/sgml/ref/pg_combinebackup.sgml
new file mode 100644
index 0000000000..6cac73573f
--- /dev/null
+++ b/doc/src/sgml/ref/pg_combinebackup.sgml
@@ -0,0 +1,228 @@
+<!--
+doc/src/sgml/ref/pg_combinebackup.sgml
+PostgreSQL documentation
+-->
+
+<refentry id="app-pgcombinebackup">
+ <indexterm zone="app-pgcombinebackup">
+ <primary>pg_combinebackup</primary>
+ </indexterm>
+
+ <refmeta>
+ <refentrytitle><application>pg_combinebackup</application></refentrytitle>
+ <manvolnum>1</manvolnum>
+ <refmiscinfo>Application</refmiscinfo>
+ </refmeta>
+
+ <refnamediv>
+ <refname>pg_combinebackup</refname>
+ <refpurpose>reconstruct a full backup from an incremental backup and dependent backups</refpurpose>
+ </refnamediv>
+
+ <refsynopsisdiv>
+ <cmdsynopsis>
+ <command>pg_combinebackup</command>
+ <arg rep="repeat"><replaceable>option</replaceable></arg>
+ <arg rep="repeat"><replaceable>backup_directory</replaceable></arg>
+ </cmdsynopsis>
+ </refsynopsisdiv>
+
+ <refsect1>
+ <title>Description</title>
+ <para>
+ <application>pg_combinebackup</application> is used to reconstruct a
+ synthetic full backup from an
+ <link linkend="backup-incremental-backup">incremental backup</link> and the
+ earlier backups upon which it depends.
+ </para>
+
+ <para>
+ Specify all of the required backups on the command line from oldest to newest.
+ That is, the first backup directory should be the path to the full backup, and
+ the last should be the path to the final incremental backup
+ that you wish to restore. The reconstructed backup will be written to the
+ output directory specified by the <option>-o</option> option.
+ </para>
+
+ <para>
+ Although <application>pg_combinebackup</application> will attempt to verify
+ that the backups you specify form a legal backup chain from which a correct
+ full backup can be reconstructed, it is not designed to help you keep track
+ of which backups depend on which other backups. If you remove the one or
+ more of the previous backups upon which your incremental
+ backup relies, you will not be able to restore it.
+ </para>
+
+ <para>
+ Since the output of <application>pg_combinebackup</application> is a
+ synthetic full backup, it can be used as an input to a future invocation of
+ <application>pg_combinebackup</application>. The synthetic full backup would
+ be specified on the command line in lieu of the chain of backups from which
+ it was reconstructed.
+ </para>
+ </refsect1>
+
+ <refsect1>
+ <title>Options</title>
+
+ <para>
+ <variablelist>
+ <varlistentry>
+ <term><option>-d</option></term>
+ <term><option>--debug</option></term>
+ <listitem>
+ <para>
+ Print lots of debug logging output on <filename>stderr</filename>.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><option>-T <replaceable class="parameter">olddir</replaceable>=<replaceable class="parameter">newdir</replaceable></option></term>
+ <term><option>--tablespace-mapping=<replaceable class="parameter">olddir</replaceable>=<replaceable class="parameter">newdir</replaceable></option></term>
+ <listitem>
+ <para>
+ Relocates the tablespace in directory <replaceable>olddir</replaceable>
+ to <replaceable>newdir</replaceable> during the backup.
+ <replaceable>olddir</replaceable> is the absolute path of the tablespace
+ as it exists in the first backup specified on the command line,
+ and <replaceable>newdir</replaceable> is the absolute path to use for the
+ tablespace in the reconstructed backup. If either path needs to contain
+ an equal sign (<literal>=</literal>), precede that with a backslash.
+ This option can be specified multiple times for multiple tablespaces.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><option>-N</option></term>
+ <term><option>--no-sync</option></term>
+ <listitem>
+ <para>
+ By default, <command>pg_combinebackup</command> will wait for all files
+ to be written safely to disk. This option causes
+ <command>pg_combinebackup</command> to return without waiting, which is
+ faster, but means that a subsequent operating system crash can leave
+ the output backup corrupt. Generally, this option is useful for testing
+ but should not be used when creating a production installation.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><option>-o <replaceable class="parameter">outputdir</replaceable></option></term>
+ <term><option>--output=<replaceable class="parameter">outputdir</replaceable></option></term>
+ <listitem>
+ <para>
+ Specifies the output directory to which the synthetic full backup
+ should be written. Currently, this argument is required.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><option>--sync-method</option></term>
+ <listitem>
+ <para>
+ When set to <literal>fsync</literal>, which is the default,
+ <command>pg_combinebackup</command> will recursively open and synchronize
+ all files in the backup directory. When the plain format is used, the
+ search for files will follow symbolic links for the WAL directory and
+ each configured tablespace.
+ </para>
+ <para>
+ On Linux, <literal>syncfs</literal> may be used instead to ask the
+ operating system to synchronize the whole file system that contains the
+ backup directory. When the plain format is used,
+ <command>pg_combinebackup</command> will also synchronize the file systems
+ that contain the WAL files and each tablespace. See
+ <xref linkend="syncfs"/> for more information about using
+ <function>syncfs()</function>.
+ </para>
+ <para>
+ This option has no effect when <option>--no-sync</option> is used.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><option>--manifest-checksums=<replaceable class="parameter">algorithm</replaceable></option></term>
+ <listitem>
+ <para>
+ Like <xref linkend="app-pgbasebackup"/>,
+ <application>pg_combinebackup</application> writes a backup manifest
+ in the output directory. This option specifies the checksum algorithm
+ that should be applied to each file included in the backup manifest.
+ Currently, the available algorithms are <literal>NONE</literal>,
+ <literal>CRC32C</literal>, <literal>SHA224</literal>,
+ <literal>SHA256</literal>, <literal>SHA384</literal>,
+ and <literal>SHA512</literal>. The default is <literal>CRC32C</literal>.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><option>--no-manifest</option></term>
+ <listitem>
+ <para>
+ Disables generation of a backup manifest. If this option is not
+ specified, a backup manifest for the reconstructed backup will be
+ written to the output directory.
+ </para>
+ </listitem>
+ </varlistentry>
+ </variablelist>
+
+ <variablelist>
+ <varlistentry>
+ <term><option>-V</option></term>
+ <term><option>--version</option></term>
+ <listitem>
+ <para>
+ Prints the <application>pg_combinebackup</application> version and exits.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><option>-?</option></term>
+ <term><option>--help</option></term>
+ <listitem>
+ <para>
+ Shows help about <application>pg_combinebackup</application> command line
+ arguments, and exits.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ </variablelist>
+ </para>
+
+ </refsect1>
+
+ <refsect1>
+ <title>Environment</title>
+
+ <para>
+ This utility, like most other <productname>PostgreSQL</productname> utilities,
+ uses the environment variables supported by <application>libpq</application>
+ (see <xref linkend="libpq-envars"/>).
+ </para>
+
+ <para>
+ The environment variable <envar>PG_COLOR</envar> specifies whether to use
+ color in diagnostic messages. Possible values are
+ <literal>always</literal>, <literal>auto</literal> and
+ <literal>never</literal>.
+ </para>
+ </refsect1>
+
+ <refsect1>
+ <title>See Also</title>
+
+ <simplelist type="inline">
+ <member><xref linkend="app-pgbasebackup"/></member>
+ </simplelist>
+ </refsect1>
+
+</refentry>
diff --git a/doc/src/sgml/reference.sgml b/doc/src/sgml/reference.sgml
index e11b4b6130..a07d2b5e01 100644
--- a/doc/src/sgml/reference.sgml
+++ b/doc/src/sgml/reference.sgml
@@ -250,6 +250,7 @@
&pgamcheck;
&pgBasebackup;
&pgbench;
+ &pgCombinebackup;
&pgConfig;
&pgDump;
&pgDumpall;
diff --git a/src/backend/access/transam/xlogbackup.c b/src/backend/access/transam/xlogbackup.c
index 21d68133ae..f51d4282bb 100644
--- a/src/backend/access/transam/xlogbackup.c
+++ b/src/backend/access/transam/xlogbackup.c
@@ -77,6 +77,16 @@ build_backup_content(BackupState *state, bool ishistoryfile)
appendStringInfo(result, "STOP TIMELINE: %u\n", state->stoptli);
}
+ /* either both istartpoint and istarttli should be set, or neither */
+ Assert(XLogRecPtrIsInvalid(state->istartpoint) == (state->istarttli == 0));
+ if (!XLogRecPtrIsInvalid(state->istartpoint))
+ {
+ appendStringInfo(result, "INCREMENTAL FROM LSN: %X/%X\n",
+ LSN_FORMAT_ARGS(state->istartpoint));
+ appendStringInfo(result, "INCREMENTAL FROM TLI: %u\n",
+ state->istarttli);
+ }
+
data = result->data;
pfree(result);
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index c61566666a..7d2501274e 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -1295,6 +1295,12 @@ read_backup_label(XLogRecPtr *checkPointLoc, TimeLineID *backupLabelTLI,
tli_from_file, BACKUP_LABEL_FILE)));
}
+ if (fscanf(lfp, "INCREMENTAL FROM LSN: %X/%X\n", &hi, &lo) > 0)
+ ereport(FATAL,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("this is an incremental backup, not a data directory"),
+ errhint("Use pg_combinebackup to reconstruct a valid data directory.")));
+
if (ferror(lfp) || FreeFile(lfp))
ereport(FATAL,
(errcode_for_file_access(),
diff --git a/src/backend/backup/Makefile b/src/backend/backup/Makefile
index a67b3c58d4..751e6d3d5e 100644
--- a/src/backend/backup/Makefile
+++ b/src/backend/backup/Makefile
@@ -19,6 +19,7 @@ OBJS = \
basebackup.o \
basebackup_copy.o \
basebackup_gzip.o \
+ basebackup_incremental.o \
basebackup_lz4.o \
basebackup_zstd.o \
basebackup_progress.o \
diff --git a/src/backend/backup/basebackup.c b/src/backend/backup/basebackup.c
index 35dd79babc..9ecce5f222 100644
--- a/src/backend/backup/basebackup.c
+++ b/src/backend/backup/basebackup.c
@@ -20,8 +20,10 @@
#include "access/xlogbackup.h"
#include "backup/backup_manifest.h"
#include "backup/basebackup.h"
+#include "backup/basebackup_incremental.h"
#include "backup/basebackup_sink.h"
#include "backup/basebackup_target.h"
+#include "catalog/pg_tablespace_d.h"
#include "commands/defrem.h"
#include "common/compression.h"
#include "common/file_perm.h"
@@ -64,6 +66,7 @@ typedef struct
bool fastcheckpoint;
bool nowait;
bool includewal;
+ bool incremental;
uint32 maxrate;
bool sendtblspcmapfile;
bool send_to_client;
@@ -76,21 +79,28 @@ typedef struct
} basebackup_options;
static int64 sendTablespace(bbsink *sink, char *path, Oid spcoid, bool sizeonly,
- struct backup_manifest_info *manifest);
+ struct backup_manifest_info *manifest,
+ IncrementalBackupInfo *ib);
static int64 sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
List *tablespaces, bool sendtblspclinks,
- backup_manifest_info *manifest, Oid spcoid);
+ backup_manifest_info *manifest, Oid spcoid,
+ IncrementalBackupInfo *ib);
static bool sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
struct stat *statbuf, bool missing_ok,
Oid dboid, Oid spcoid, RelFileNumber relfilenumber,
unsigned segno,
- backup_manifest_info *manifest);
+ backup_manifest_info *manifest,
+ unsigned num_incremental_blocks,
+ BlockNumber *incremental_blocks,
+ unsigned truncation_block_length);
static off_t read_file_data_into_buffer(bbsink *sink,
const char *readfilename, int fd,
off_t offset, size_t length,
BlockNumber blkno,
bool verify_checksum,
int *checksum_failures);
+static void push_to_sink(bbsink *sink, pg_checksum_context *checksum_ctx,
+ size_t *bytes_done, void *data, size_t length);
static bool verify_page_checksum(Page page, XLogRecPtr start_lsn,
BlockNumber blkno,
uint16 *expected_checksum);
@@ -102,7 +112,8 @@ static int64 _tarWriteHeader(bbsink *sink, const char *filename,
bool sizeonly);
static void _tarWritePadding(bbsink *sink, int len);
static void convert_link_to_directory(const char *pathbuf, struct stat *statbuf);
-static void perform_base_backup(basebackup_options *opt, bbsink *sink);
+static void perform_base_backup(basebackup_options *opt, bbsink *sink,
+ IncrementalBackupInfo *ib);
static void parse_basebackup_options(List *options, basebackup_options *opt);
static int compareWalFileNames(const ListCell *a, const ListCell *b);
static int basebackup_read_file(int fd, char *buf, size_t nbytes, off_t offset,
@@ -220,7 +231,8 @@ static const struct exclude_list_item excludeFiles[] =
* clobbered by longjmp" from stupider versions of gcc.
*/
static void
-perform_base_backup(basebackup_options *opt, bbsink *sink)
+perform_base_backup(basebackup_options *opt, bbsink *sink,
+ IncrementalBackupInfo *ib)
{
bbsink_state state;
XLogRecPtr endptr;
@@ -270,6 +282,10 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
ListCell *lc;
tablespaceinfo *newti;
+ /* If this is an incremental backup, execute preparatory steps. */
+ if (ib != NULL)
+ PrepareForIncrementalBackup(ib, backup_state);
+
/* Add a node for the base directory at the end */
newti = palloc0(sizeof(tablespaceinfo));
newti->size = -1;
@@ -289,10 +305,10 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
if (tmp->path == NULL)
tmp->size = sendDir(sink, ".", 1, true, state.tablespaces,
- true, NULL, InvalidOid);
+ true, NULL, InvalidOid, NULL);
else
tmp->size = sendTablespace(sink, tmp->path, tmp->oid, true,
- NULL);
+ NULL, NULL);
state.bytes_total += tmp->size;
}
state.bytes_total_is_valid = true;
@@ -330,7 +346,7 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
/* Then the bulk of the files... */
sendDir(sink, ".", 1, false, state.tablespaces,
- sendtblspclinks, &manifest, InvalidOid);
+ sendtblspclinks, &manifest, InvalidOid, ib);
/* ... and pg_control after everything else. */
if (lstat(XLOG_CONTROL_FILE, &statbuf) != 0)
@@ -340,7 +356,7 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
XLOG_CONTROL_FILE)));
sendFile(sink, XLOG_CONTROL_FILE, XLOG_CONTROL_FILE, &statbuf,
false, InvalidOid, InvalidOid,
- InvalidRelFileNumber, 0, &manifest);
+ InvalidRelFileNumber, 0, &manifest, 0, NULL, 0);
}
else
{
@@ -348,7 +364,7 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
bbsink_begin_archive(sink, archive_name);
- sendTablespace(sink, ti->path, ti->oid, false, &manifest);
+ sendTablespace(sink, ti->path, ti->oid, false, &manifest, ib);
}
/*
@@ -610,7 +626,7 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
sendFile(sink, pathbuf, pathbuf, &statbuf, false,
InvalidOid, InvalidOid, InvalidRelFileNumber, 0,
- &manifest);
+ &manifest, 0, NULL, 0);
/* unconditionally mark file as archived */
StatusFilePath(pathbuf, fname, ".done");
@@ -686,6 +702,7 @@ parse_basebackup_options(List *options, basebackup_options *opt)
bool o_checkpoint = false;
bool o_nowait = false;
bool o_wal = false;
+ bool o_incremental = false;
bool o_maxrate = false;
bool o_tablespace_map = false;
bool o_noverify_checksums = false;
@@ -764,6 +781,15 @@ parse_basebackup_options(List *options, basebackup_options *opt)
opt->includewal = defGetBoolean(defel);
o_wal = true;
}
+ else if (strcmp(defel->defname, "incremental") == 0)
+ {
+ if (o_incremental)
+ ereport(ERROR,
+ (errcode(ERRCODE_SYNTAX_ERROR),
+ errmsg("duplicate option \"%s\"", defel->defname)));
+ opt->incremental = defGetBoolean(defel);
+ o_incremental = true;
+ }
else if (strcmp(defel->defname, "max_rate") == 0)
{
int64 maxrate;
@@ -956,7 +982,7 @@ parse_basebackup_options(List *options, basebackup_options *opt)
* the filesystem, bypassing the buffer cache.
*/
void
-SendBaseBackup(BaseBackupCmd *cmd)
+SendBaseBackup(BaseBackupCmd *cmd, IncrementalBackupInfo *ib)
{
basebackup_options opt;
bbsink *sink;
@@ -980,6 +1006,20 @@ SendBaseBackup(BaseBackupCmd *cmd)
set_ps_display(activitymsg);
}
+ /*
+ * If we're asked to perform an incremental backup and the user has not
+ * supplied a manifest, that's an ERROR.
+ *
+ * If we're asked to perform a full backup and the user did supply a
+ * manifest, just ignore it.
+ */
+ if (!opt.incremental)
+ ib = NULL;
+ else if (ib == NULL)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("must UPLOAD_MANIFEST before performing an incremental BASE_BACKUP")));
+
/*
* If the target is specifically 'client' then set up to stream the backup
* to the client; otherwise, it's being sent someplace else and should not
@@ -1011,7 +1051,7 @@ SendBaseBackup(BaseBackupCmd *cmd)
*/
PG_TRY();
{
- perform_base_backup(&opt, sink);
+ perform_base_backup(&opt, sink, ib);
}
PG_FINALLY();
{
@@ -1089,7 +1129,7 @@ sendFileWithContent(bbsink *sink, const char *filename, const char *content,
*/
static int64
sendTablespace(bbsink *sink, char *path, Oid spcoid, bool sizeonly,
- backup_manifest_info *manifest)
+ backup_manifest_info *manifest, IncrementalBackupInfo *ib)
{
int64 size;
char pathbuf[MAXPGPATH];
@@ -1123,7 +1163,7 @@ sendTablespace(bbsink *sink, char *path, Oid spcoid, bool sizeonly,
/* Send all the files in the tablespace version directory */
size += sendDir(sink, pathbuf, strlen(path), sizeonly, NIL, true, manifest,
- spcoid);
+ spcoid, ib);
return size;
}
@@ -1143,7 +1183,7 @@ sendTablespace(bbsink *sink, char *path, Oid spcoid, bool sizeonly,
static int64
sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
List *tablespaces, bool sendtblspclinks, backup_manifest_info *manifest,
- Oid spcoid)
+ Oid spcoid, IncrementalBackupInfo *ib)
{
DIR *dir;
struct dirent *de;
@@ -1152,7 +1192,16 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
int64 size = 0;
const char *lastDir; /* Split last dir from parent path. */
bool isRelationDir = false; /* Does directory contain relations? */
+ bool isGlobalDir = false;
Oid dboid = InvalidOid;
+ BlockNumber *relative_block_numbers = NULL;
+
+ /*
+ * Since this array is relatively large, avoid putting it on the stack.
+ * But we don't need it at all if this is not an incremental backup.
+ */
+ if (ib != NULL)
+ relative_block_numbers = palloc(sizeof(BlockNumber) * RELSEG_SIZE);
/*
* Determine if the current path is a database directory that can contain
@@ -1185,7 +1234,10 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
}
}
else if (strcmp(path, "./global") == 0)
+ {
isRelationDir = true;
+ isGlobalDir = true;
+ }
dir = AllocateDir(path);
while ((de = ReadDir(dir, path)) != NULL)
@@ -1334,11 +1386,13 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
&statbuf, sizeonly);
/*
- * Also send archive_status directory (by hackishly reusing
- * statbuf from above ...).
+ * Also send archive_status and summaries directories (by
+ * hackishly reusing statbuf from above ...).
*/
size += _tarWriteHeader(sink, "./pg_wal/archive_status", NULL,
&statbuf, sizeonly);
+ size += _tarWriteHeader(sink, "./pg_wal/summaries", NULL,
+ &statbuf, sizeonly);
continue; /* don't recurse into pg_wal */
}
@@ -1407,16 +1461,64 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
if (!skip_this_dir)
size += sendDir(sink, pathbuf, basepathlen, sizeonly, tablespaces,
- sendtblspclinks, manifest, spcoid);
+ sendtblspclinks, manifest, spcoid, ib);
}
else if (S_ISREG(statbuf.st_mode))
{
bool sent = false;
+ unsigned num_blocks_required = 0;
+ unsigned truncation_block_length = 0;
+ char tarfilenamebuf[MAXPGPATH * 2];
+ char *tarfilename = pathbuf + basepathlen + 1;
+ FileBackupMethod method = BACK_UP_FILE_FULLY;
+
+ if (ib != NULL && isRelationFile)
+ {
+ Oid relspcoid;
+ char *lookup_path;
+
+ if (OidIsValid(spcoid))
+ {
+ relspcoid = spcoid;
+ lookup_path = psprintf("pg_tblspc/%u/%s", spcoid,
+ tarfilename);
+ }
+ else
+ {
+ if (isGlobalDir)
+ relspcoid = GLOBALTABLESPACE_OID;
+ else
+ relspcoid = DEFAULTTABLESPACE_OID;
+ lookup_path = pstrdup(tarfilename);
+ }
+
+ method = GetFileBackupMethod(ib, lookup_path, dboid, relspcoid,
+ relfilenumber, relForkNum,
+ segno, statbuf.st_size,
+ &num_blocks_required,
+ relative_block_numbers,
+ &truncation_block_length);
+ if (method == BACK_UP_FILE_INCREMENTALLY)
+ {
+ statbuf.st_size =
+ GetIncrementalFileSize(num_blocks_required);
+ snprintf(tarfilenamebuf, sizeof(tarfilenamebuf),
+ "%s/INCREMENTAL.%s",
+ path + basepathlen + 1,
+ de->d_name);
+ tarfilename = tarfilenamebuf;
+ }
+
+ pfree(lookup_path);
+ }
if (!sizeonly)
- sent = sendFile(sink, pathbuf, pathbuf + basepathlen + 1, &statbuf,
+ sent = sendFile(sink, pathbuf, tarfilename, &statbuf,
true, dboid, spcoid,
- relfilenumber, segno, manifest);
+ relfilenumber, segno, manifest,
+ num_blocks_required,
+ method == BACK_UP_FILE_INCREMENTALLY ? relative_block_numbers : NULL,
+ truncation_block_length);
if (sent || sizeonly)
{
@@ -1434,6 +1536,10 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
ereport(WARNING,
(errmsg("skipping special file \"%s\"", pathbuf)));
}
+
+ if (relative_block_numbers != NULL)
+ pfree(relative_block_numbers);
+
FreeDir(dir);
return size;
}
@@ -1446,6 +1552,12 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
* If dboid is anything other than InvalidOid then any checksum failures
* detected will get reported to the cumulative stats system.
*
+ * If the file is to be sent incrementally, then num_incremental_blocks
+ * should be the number of blocks to be sent, and incremental_blocks
+ * an array of block numbers relative to the start of the current segment.
+ * If the whole file is to be sent, then incremental_blocks should be NULL,
+ * and num_incremental_blocks can have any value, as it will be ignored.
+ *
* Returns true if the file was successfully sent, false if 'missing_ok',
* and the file did not exist.
*/
@@ -1453,7 +1565,8 @@ static bool
sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
struct stat *statbuf, bool missing_ok, Oid dboid, Oid spcoid,
RelFileNumber relfilenumber, unsigned segno,
- backup_manifest_info *manifest)
+ backup_manifest_info *manifest, unsigned num_incremental_blocks,
+ BlockNumber *incremental_blocks, unsigned truncation_block_length)
{
int fd;
BlockNumber blkno = 0;
@@ -1462,6 +1575,7 @@ sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
pgoff_t bytes_done = 0;
bool verify_checksum = false;
pg_checksum_context checksum_ctx;
+ int ibindex = 0;
if (pg_checksum_init(&checksum_ctx, manifest->checksum_type) < 0)
elog(ERROR, "could not initialize checksum of file \"%s\"",
@@ -1494,22 +1608,111 @@ sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
RelFileNumberIsValid(relfilenumber))
verify_checksum = true;
+ /*
+ * If we're sending an incremental file, write the file header.
+ */
+ if (incremental_blocks != NULL)
+ {
+ unsigned magic = INCREMENTAL_MAGIC;
+ size_t header_bytes_done = 0;
+
+ /* Emit header data. */
+ push_to_sink(sink, &checksum_ctx, &header_bytes_done,
+ &magic, sizeof(magic));
+ push_to_sink(sink, &checksum_ctx, &header_bytes_done,
+ &num_incremental_blocks, sizeof(num_incremental_blocks));
+ push_to_sink(sink, &checksum_ctx, &header_bytes_done,
+ &truncation_block_length, sizeof(truncation_block_length));
+ push_to_sink(sink, &checksum_ctx, &header_bytes_done,
+ incremental_blocks,
+ sizeof(BlockNumber) * num_incremental_blocks);
+
+ /* Flush out any data still in the buffer so it's again empty. */
+ if (header_bytes_done > 0)
+ {
+ bbsink_archive_contents(sink, header_bytes_done);
+ if (pg_checksum_update(&checksum_ctx,
+ (uint8 *) sink->bbs_buffer,
+ header_bytes_done) < 0)
+ elog(ERROR, "could not update checksum of base backup");
+ }
+
+ /* Update our notion of file position. */
+ bytes_done += sizeof(magic);
+ bytes_done += sizeof(num_incremental_blocks);
+ bytes_done += sizeof(truncation_block_length);
+ bytes_done += sizeof(BlockNumber) * num_incremental_blocks;
+ }
+
/*
* Loop until we read the amount of data the caller told us to expect. The
* file could be longer, if it was extended while we were sending it, but
* for a base backup we can ignore such extended data. It will be restored
* from WAL.
*/
- while (bytes_done < statbuf->st_size)
+ while (1)
{
- size_t remaining = statbuf->st_size - bytes_done;
+ /*
+ * Determine whether we've read all the data that we need, and if not,
+ * read some more.
+ */
+ if (incremental_blocks == NULL)
+ {
+ size_t remaining = statbuf->st_size - bytes_done;
+
+ /*
+ * If we've read the required number of bytes, then it's time to
+ * stop.
+ */
+ if (bytes_done >= statbuf->st_size)
+ break;
+
+ /*
+ * Read as many bytes as will fit in the buffer, or however many
+ * are left to read, whichever is less.
+ */
+ cnt = read_file_data_into_buffer(sink, readfilename, fd,
+ bytes_done, remaining,
+ blkno + segno * RELSEG_SIZE,
+ verify_checksum,
+ &checksum_failures);
+ }
+ else
+ {
+ BlockNumber relative_blkno;
+
+ /*
+ * If we've read all the blocks, then it's time to stop.
+ */
+ if (ibindex >= num_incremental_blocks)
+ break;
+
+ /*
+ * Read just one block, whichever one is the next that we're
+ * supposed to include.
+ */
+ relative_blkno = incremental_blocks[ibindex++];
+ cnt = read_file_data_into_buffer(sink, readfilename, fd,
+ relative_blkno * BLCKSZ,
+ BLCKSZ,
+ relative_blkno + segno * RELSEG_SIZE,
+ verify_checksum,
+ &checksum_failures);
- /* Try to read some more data. */
- cnt = read_file_data_into_buffer(sink, readfilename, fd, bytes_done,
- remaining,
- blkno + segno * RELSEG_SIZE,
- verify_checksum,
- &checksum_failures);
+ /*
+ * If we get a partial read, that must mean that the relation is
+ * being truncated. Ultimately, it should be truncated to a
+ * multiple of BLCKSZ, since this path should only be reached for
+ * relation files, but we might transiently observe an
+ * intermediate value.
+ *
+ * It should be fine to treat this just as if the entire block had
+ * been truncated away - i.e. fill this and all later blocks with
+ * zeroes. WAL replay will fix things up.
+ */
+ if (cnt < BLCKSZ)
+ break;
+ }
/*
* If the amount of data we were able to read was not a multiple of
@@ -1692,6 +1895,56 @@ read_file_data_into_buffer(bbsink *sink, const char *readfilename, int fd,
return cnt;
}
+/*
+ * Push data into a bbsink.
+ *
+ * It's better, when possible, to read data directly into the bbsink's buffer,
+ * rather than using this function to copy it into the buffer; this function is
+ * for cases where that approach is not practical.
+ *
+ * bytes_done should point to a count of the number of bytes that are
+ * currently used in the bbsink's buffer. Upon return, the bytes identified by
+ * data and length will have been copied into the bbsink's buffer, flushing
+ * as required, and *bytes_done will have been updated accordingly. If the
+ * buffer was flushed, the previous contents will also have been fed to
+ * checksum_ctx.
+ *
+ * Note that after one or more calls to this function it is the caller's
+ * responsibility to perform any required final flush.
+ */
+static void
+push_to_sink(bbsink *sink, pg_checksum_context *checksum_ctx,
+ size_t *bytes_done, void *data, size_t length)
+{
+ while (length > 0)
+ {
+ size_t bytes_to_copy;
+
+ /*
+ * We use < here rather than <= so that if the data exactly fills the
+ * remaining buffer space, we trigger a flush now.
+ */
+ if (length < sink->bbs_buffer_length - *bytes_done)
+ {
+ /* Append remaining data to buffer. */
+ memcpy(sink->bbs_buffer + *bytes_done, data, length);
+ *bytes_done += length;
+ return;
+ }
+
+ /* Copy until buffer is full and flush it. */
+ bytes_to_copy = sink->bbs_buffer_length - *bytes_done;
+ memcpy(sink->bbs_buffer + *bytes_done, data, bytes_to_copy);
+ data = ((char *) data) + bytes_to_copy;
+ length -= bytes_to_copy;
+ bbsink_archive_contents(sink, sink->bbs_buffer_length);
+ if (pg_checksum_update(checksum_ctx, (uint8 *) sink->bbs_buffer,
+ sink->bbs_buffer_length) < 0)
+ elog(ERROR, "could not update checksum");
+ *bytes_done = 0;
+ }
+}
+
/*
* Try to verify the checksum for the provided page, if it seems appropriate
* to do so.
diff --git a/src/backend/backup/basebackup_incremental.c b/src/backend/backup/basebackup_incremental.c
new file mode 100644
index 0000000000..303117e19e
--- /dev/null
+++ b/src/backend/backup/basebackup_incremental.c
@@ -0,0 +1,913 @@
+/*-------------------------------------------------------------------------
+ *
+ * basebackup_incremental.c
+ * code for incremental backup support
+ *
+ * This code isn't actually in charge of taking an incremental backup;
+ * the actual construction of the incremental backup happens in
+ * basebackup.c. Here, we're concerned with providing the necessary
+ * supports for that operation. In particular, we need to parse the
+ * backup manifest supplied by the user taking the incremental backup
+ * and extract the required information from it.
+ *
+ * Portions Copyright (c) 2010-2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/backend/backup/basebackup_incremental.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/timeline.h"
+#include "access/xlog.h"
+#include "access/xlogrecovery.h"
+#include "backup/basebackup_incremental.h"
+#include "backup/walsummary.h"
+#include "common/blkreftable.h"
+#include "common/parse_manifest.h"
+#include "common/hashfn.h"
+#include "postmaster/walsummarizer.h"
+
+#define BLOCKS_PER_READ 512
+
+/*
+ * Details extracted from the WAL ranges present in the supplied backup manifest.
+ */
+typedef struct
+{
+ TimeLineID tli;
+ XLogRecPtr start_lsn;
+ XLogRecPtr end_lsn;
+} backup_wal_range;
+
+/*
+ * Details extracted from the file list present in the supplied backup manifest.
+ */
+typedef struct
+{
+ uint32 status;
+ const char *path;
+ size_t size;
+} backup_file_entry;
+
+static uint32 hash_string_pointer(const char *s);
+#define SH_PREFIX backup_file
+#define SH_ELEMENT_TYPE backup_file_entry
+#define SH_KEY_TYPE const char *
+#define SH_KEY path
+#define SH_HASH_KEY(tb, key) hash_string_pointer(key)
+#define SH_EQUAL(tb, a, b) (strcmp(a, b) == 0)
+#define SH_SCOPE static inline
+#define SH_DECLARE
+#define SH_DEFINE
+#include "lib/simplehash.h"
+
+struct IncrementalBackupInfo
+{
+ /* Memory context for this object and its subsidiary objects. */
+ MemoryContext mcxt;
+
+ /* Temporary buffer for storing the manifest while parsing it. */
+ StringInfoData buf;
+
+ /* WAL ranges extracted from the backup manifest. */
+ List *manifest_wal_ranges;
+
+ /*
+ * Files extracted from the backup manifest.
+ *
+ * We don't really need this information, because we use WAL summaries to
+ * figure what's changed. It would be unsafe to just rely on the list of
+ * files that existed before, because it's possible for a file to be
+ * removed and a new one created with the same name and different
+ * contents. In such cases, the whole file must still be sent. We can tell
+ * from the WAL summaries whether that happened, but not from the file
+ * list.
+ *
+ * Nonetheless, this data is useful for sanity checking. If a file that we
+ * think we shouldn't need to send is not present in the manifest for the
+ * prior backup, something has gone terribly wrong. We retain the file
+ * names and sizes, but not the checksums or last modified times, for
+ * which we have no use.
+ *
+ * One significant downside of storing this data is that it consumes
+ * memory. If that turns out to be a problem, we might have to decide not
+ * to retain this information, or to make it optional.
+ */
+ backup_file_hash *manifest_files;
+
+ /*
+ * Block-reference table for the incremental backup.
+ *
+ * It's possible that storing the entire block-reference table in memory
+ * will be a problem for some users. The in-memory format that we're using
+ * here is pretty efficient, converging to little more than 1 bit per
+ * block for relation forks with large numbers of modified blocks. It's
+ * possible, however, that if you try to perform an incremental backup of
+ * a database with a sufficiently large number of relations on a
+ * sufficiently small machine, you could run out of memory here. If that
+ * turns out to be a problem in practice, we'll need to be more clever.
+ */
+ BlockRefTable *brtab;
+};
+
+static void manifest_process_file(JsonManifestParseContext *context,
+ char *pathname,
+ size_t size,
+ pg_checksum_type checksum_type,
+ int checksum_length,
+ uint8 *checksum_payload);
+static void manifest_process_wal_range(JsonManifestParseContext *context,
+ TimeLineID tli,
+ XLogRecPtr start_lsn,
+ XLogRecPtr end_lsn);
+static void manifest_report_error(JsonManifestParseContext *ib,
+ const char *fmt,...)
+ pg_attribute_printf(2, 3) pg_attribute_noreturn();
+static int compare_block_numbers(const void *a, const void *b);
+
+/*
+ * Create a new object for storing information extracted from the manifest
+ * supplied when creating an incremental backup.
+ */
+IncrementalBackupInfo *
+CreateIncrementalBackupInfo(MemoryContext mcxt)
+{
+ IncrementalBackupInfo *ib;
+ MemoryContext oldcontext;
+
+ oldcontext = MemoryContextSwitchTo(mcxt);
+
+ ib = palloc0(sizeof(IncrementalBackupInfo));
+ ib->mcxt = mcxt;
+ initStringInfo(&ib->buf);
+
+ /*
+ * It's hard to guess how many files a "typical" installation will have in
+ * the data directory, but a fresh initdb creates almost 1000 files as of
+ * this writing, so it seems to make sense for our estimate to
+ * substantially higher.
+ */
+ ib->manifest_files = backup_file_create(mcxt, 10000, NULL);
+
+ MemoryContextSwitchTo(oldcontext);
+
+ return ib;
+}
+
+/*
+ * Before taking an incremental backup, the caller must supply the backup
+ * manifest from a prior backup. Each chunk of manifest data recieved
+ * from the client should be passed to this function.
+ */
+void
+AppendIncrementalManifestData(IncrementalBackupInfo *ib, const char *data,
+ int len)
+{
+ MemoryContext oldcontext;
+
+ /* Switch to our memory context. */
+ oldcontext = MemoryContextSwitchTo(ib->mcxt);
+
+ /*
+ * XXX. Our json parser is at present incapable of parsing json blobs
+ * incrementally, so we have to accumulate the entire backup manifest
+ * before we can do anything with it. This should really be fixed, since
+ * some users might have very large numbers of files in the data
+ * directory.
+ */
+ appendBinaryStringInfo(&ib->buf, data, len);
+
+ /* Switch back to previous memory context. */
+ MemoryContextSwitchTo(oldcontext);
+}
+
+/*
+ * Finalize an IncrementalBackupInfo object after all manifest data has
+ * been supplied via calls to AppendIncrementalManifestData.
+ */
+void
+FinalizeIncrementalManifest(IncrementalBackupInfo *ib)
+{
+ JsonManifestParseContext context;
+ MemoryContext oldcontext;
+
+ /* Switch to our memory context. */
+ oldcontext = MemoryContextSwitchTo(ib->mcxt);
+
+ /* Parse the manifest. */
+ context.private_data = ib;
+ context.per_file_cb = manifest_process_file;
+ context.per_wal_range_cb = manifest_process_wal_range;
+ context.error_cb = manifest_report_error;
+ json_parse_manifest(&context, ib->buf.data, ib->buf.len);
+
+ /* Done with the buffer, so release memory. */
+ pfree(ib->buf.data);
+ ib->buf.data = NULL;
+
+ /* Switch back to previous memory context. */
+ MemoryContextSwitchTo(oldcontext);
+}
+
+/*
+ * Prepare to take an incremental backup.
+ *
+ * Before this function is called, AppendIncrementalManifestData and
+ * FinalizeIncrementalManifest should have already been called to pass all
+ * the manifest data to this object.
+ *
+ * This function performs sanity checks on the data extracted from the
+ * manifest and figures out for which WAL ranges we need summaries, and
+ * whether those summaries are available. Then, it reads and combines the
+ * data from those summary files. It also updates the backup_state with the
+ * reference TLI and LSN for the prior backup.
+ */
+void
+PrepareForIncrementalBackup(IncrementalBackupInfo *ib,
+ BackupState *backup_state)
+{
+ MemoryContext oldcontext;
+ List *expectedTLEs;
+ List *all_wslist,
+ *required_wslist = NIL;
+ ListCell *lc;
+ TimeLineHistoryEntry **tlep;
+ int num_wal_ranges;
+ int i;
+ bool found_backup_start_tli = false;
+ TimeLineID earliest_wal_range_tli = 0;
+ XLogRecPtr earliest_wal_range_start_lsn = InvalidXLogRecPtr;
+ TimeLineID latest_wal_range_tli = 0;
+ XLogRecPtr summarized_lsn;
+
+ Assert(ib->buf.data == NULL);
+
+ /* Switch to our memory context. */
+ oldcontext = MemoryContextSwitchTo(ib->mcxt);
+
+ /*
+ * A valid backup manifest must always contain at least one WAL range
+ * (usually exactly one, unless the backup spanned a timeline switch).
+ */
+ num_wal_ranges = list_length(ib->manifest_wal_ranges);
+ if (num_wal_ranges == 0)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("manifest contains no required WAL ranges")));
+
+ /*
+ * Match up the TLIs that appear in the WAL ranges of the backup manifest
+ * with those that appear in this server's timeline history. We expect
+ * every backup_wal_range to match to a TimeLineHistoryEntry; if it does
+ * not, that's an error.
+ *
+ * This loop also decides which of the WAL ranges is the manifest is most
+ * ancient and which one is the newest, according to the timeline history
+ * of this server, and stores TLIs of those WAL ranges into
+ * earliest_wal_range_tli and latest_wal_range_tli. It also updates
+ * earliest_wal_range_start_lsn to the start LSN of the WAL range for
+ * earliest_wal_range_tli.
+ *
+ * Note that the return value of readTimeLineHistory puts the latest
+ * timeline at the beginning of the list, not the end. Hence, the earliest
+ * TLI is the one that occurs nearest the end of the list returned by
+ * readTimeLineHistory, and the latest TLI is the one that occurs closest
+ * to the beginning.
+ */
+ expectedTLEs = readTimeLineHistory(backup_state->starttli);
+ tlep = palloc0(num_wal_ranges * sizeof(TimeLineHistoryEntry *));
+ for (i = 0; i < num_wal_ranges; ++i)
+ {
+ backup_wal_range *range = list_nth(ib->manifest_wal_ranges, i);
+ bool saw_earliest_wal_range_tli = false;
+ bool saw_latest_wal_range_tli = false;
+
+ /* Search this server's history for this WAL range's TLI. */
+ foreach(lc, expectedTLEs)
+ {
+ TimeLineHistoryEntry *tle = lfirst(lc);
+
+ if (tle->tli == range->tli)
+ {
+ tlep[i] = tle;
+ break;
+ }
+
+ if (tle->tli == earliest_wal_range_tli)
+ saw_earliest_wal_range_tli = true;
+ if (tle->tli == latest_wal_range_tli)
+ saw_latest_wal_range_tli = true;
+ }
+
+ /*
+ * An incremental backup can only be taken relative to a backup that
+ * represents a previous state of this server. If the backup requires
+ * WAL from a timeline that's not in our history, that definitely
+ * isn't the case.
+ */
+ if (tlep[i] == NULL)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("timeline %u found in manifest, but not in this server's history",
+ range->tli)));
+
+ /*
+ * If we found this TLI in the server's history before encountering
+ * the latest TLI seen so far in the server's history, then this TLI
+ * is the latest one seen so far.
+ *
+ * If on the other hand we saw the earliest TLI seen so far before
+ * finding this TLI, this TLI is earlier than the earliest one seen so
+ * far. And if this is the first TLI for which we've searched, it's
+ * also the earliest one seen so far.
+ *
+ * On the first loop iteration, both things should necessarily be
+ * true.
+ */
+ if (!saw_latest_wal_range_tli)
+ latest_wal_range_tli = range->tli;
+ if (earliest_wal_range_tli == 0 || saw_earliest_wal_range_tli)
+ {
+ earliest_wal_range_tli = range->tli;
+ earliest_wal_range_start_lsn = range->start_lsn;
+ }
+ }
+
+ /*
+ * Propagate information about the prior backup into the backup_label that
+ * will be generated for this backup.
+ */
+ backup_state->istartpoint = earliest_wal_range_start_lsn;
+ backup_state->istarttli = earliest_wal_range_tli;
+
+ /*
+ * Sanity check start and end LSNs for the WAL ranges in the manifest.
+ *
+ * Commonly, there won't be any timeline switches during the prior backup
+ * at all, but if there are, they should happen at the same LSNs that this
+ * server switched timelines.
+ *
+ * Whether there are any timeline switches during the prior backup or not,
+ * the prior backup shouldn't require any WAL from a timeline prior to the
+ * start of that timeline. It also shouldn't require any WAL from later
+ * than the start of this backup.
+ *
+ * If any of these sanity checks fail, one possible explanation is that
+ * the user has generated WAL on the same timeline with the same LSNs more
+ * than once. For instance, if two standbys running on timeline 1 were
+ * both promoted and (due to a broken archiving setup) both selected new
+ * timeline ID 2, then it's possible that one of these checks might trip.
+ *
+ * Note that there are lots of ways for the user to do something very bad
+ * without tripping any of these checks, and they are not intended to be
+ * comprehensive. It's pretty hard to see how we could be certain of
+ * anything here. However, if there's a problem staring us right in the
+ * face, it's best to report it, so we do.
+ */
+ for (i = 0; i < num_wal_ranges; ++i)
+ {
+ backup_wal_range *range = list_nth(ib->manifest_wal_ranges, i);
+
+ if (range->tli == earliest_wal_range_tli)
+ {
+ if (range->start_lsn < tlep[i]->begin)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("manifest requires WAL from initial timeline %u starting at %X/%X, but that timeline begins at %X/%X",
+ range->tli,
+ LSN_FORMAT_ARGS(range->start_lsn),
+ LSN_FORMAT_ARGS(tlep[i]->begin))));
+ }
+ else
+ {
+ if (range->start_lsn != tlep[i]->begin)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("manifest requires WAL from continuation timeline %u starting at %X/%X, but that timeline begins at %X/%X",
+ range->tli,
+ LSN_FORMAT_ARGS(range->start_lsn),
+ LSN_FORMAT_ARGS(tlep[i]->begin))));
+ }
+
+ if (range->tli == latest_wal_range_tli)
+ {
+ if (range->end_lsn > backup_state->startpoint)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("manifest requires WAL from final timeline %u ending at %X/%X, but this backup starts at %X/%X",
+ range->tli,
+ LSN_FORMAT_ARGS(range->end_lsn),
+ LSN_FORMAT_ARGS(backup_state->startpoint))));
+ }
+ else
+ {
+ if (range->end_lsn != tlep[i]->end)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("manifest requires WAL from non-final timeline %u ending at %X/%X, but this server switched timelines at %X/%X",
+ range->tli,
+ LSN_FORMAT_ARGS(range->end_lsn),
+ LSN_FORMAT_ARGS(tlep[i]->end))));
+ }
+
+ }
+
+ /*
+ * Wait for WAL summarization to catch up to the backup start LSN (but
+ * time out if it doesn't do so quickly enough).
+ */
+ /* XXX make timeout configurable */
+ summarized_lsn = WaitForWalSummarization(backup_state->startpoint, 60000);
+ if (summarized_lsn < backup_state->startpoint)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("timeout waiting for WAL summarization"),
+ errdetail("This backup requires WAL to be summarized up to %X/%X, but summarizer has only reached %X/%X.",
+ LSN_FORMAT_ARGS(backup_state->startpoint),
+ LSN_FORMAT_ARGS(summarized_lsn))));
+
+ /*
+ * Retrieve a list of all WAL summaries on any timeline that overlap with
+ * the LSN range of interest. We could instead call GetWalSummaries() once
+ * per timeline in the loop that follows, but that would involve reading
+ * the directory multiple times. It should be mildly faster - and perhaps
+ * a bit safer - to do it just once.
+ */
+ all_wslist = GetWalSummaries(0, earliest_wal_range_start_lsn,
+ backup_state->startpoint);
+
+ /*
+ * We need WAL summaries for everything that happened during the prior
+ * backup and everything that happened afterward up until the point where
+ * the current backup started.
+ */
+ foreach(lc, expectedTLEs)
+ {
+ TimeLineHistoryEntry *tle = lfirst(lc);
+ XLogRecPtr tli_start_lsn = tle->begin;
+ XLogRecPtr tli_end_lsn = tle->end;
+ XLogRecPtr tli_missing_lsn = InvalidXLogRecPtr;
+ List *tli_wslist;
+
+ /*
+ * Working through the history of this server from the current
+ * timeline backwards, we skip everything until we find the timeline
+ * where this backup started. Most of the time, this means we won't
+ * skip anything at all, as it's unlikely that the timeline has
+ * changed since the beginning of the backup moments ago.
+ */
+ if (tle->tli == backup_state->starttli)
+ {
+ found_backup_start_tli = true;
+ tli_end_lsn = backup_state->startpoint;
+ }
+ else if (!found_backup_start_tli)
+ continue;
+
+ /*
+ * Find the summaries that overlap the LSN range of interest for this
+ * timeline. If this is the earliest timeline involved, the range of
+ * interest begins with the start LSN of the prior backup; otherwise,
+ * it begins at the LSN at which this timeline came into existence. If
+ * this is the latest TLI involved, the range of interest ends at the
+ * start LSN of the current backup; otherwise, it ends at the point
+ * where we switched from this timeline to the next one.
+ */
+ if (tle->tli == earliest_wal_range_tli)
+ tli_start_lsn = earliest_wal_range_start_lsn;
+ tli_wslist = FilterWalSummaries(all_wslist, tle->tli,
+ tli_start_lsn, tli_end_lsn);
+
+ /*
+ * There is no guarantee that the WAL summaries we found cover the
+ * entire range of LSNs for which summaries are required, or indeed
+ * that we found any WAL summaries at all. Check whether we have a
+ * problem of that sort.
+ */
+ if (!WalSummariesAreComplete(tli_wslist, tli_start_lsn, tli_end_lsn,
+ &tli_missing_lsn))
+ {
+ if (XLogRecPtrIsInvalid(tli_missing_lsn))
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("WAL summaries are required on timeline %u from %X/%X to %X/%X, but no summaries for that timeline and LSN range exist",
+ tle->tli,
+ LSN_FORMAT_ARGS(tli_start_lsn),
+ LSN_FORMAT_ARGS(tli_end_lsn))));
+ else
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("WAL summaries are required on timeline %u from %X/%X to %X/%X, but the summaries for that timeline and LSN range are incomplete",
+ tle->tli,
+ LSN_FORMAT_ARGS(tli_start_lsn),
+ LSN_FORMAT_ARGS(tli_end_lsn)),
+ errdetail("The first unsummarized LSN is this range is %X/%X.",
+ LSN_FORMAT_ARGS(tli_missing_lsn))));
+ }
+
+ /*
+ * Remember that we need to read these summaries.
+ *
+ * Technically, it's possible that this could read more files than
+ * required, since tli_wslist in theory could contain redundant
+ * summaries. For instance, if we have a summary from 0/10000000 to
+ * 0/20000000 and also one from 0/00000000 to 0/30000000, then the
+ * latter subsumes the former and the former could be ignored.
+ *
+ * We ignore this possibility because the WAL summarizer only tries to
+ * generate summaries that do not overlap. If somehow they exist,
+ * we'll do a bit of extra work but the results should still be
+ * correct.
+ */
+ required_wslist = list_concat(required_wslist, tli_wslist);
+
+ /*
+ * Timelines earlier than the one in which the prior backup began are
+ * not relevant.
+ */
+ if (tle->tli == earliest_wal_range_tli)
+ break;
+ }
+
+ /*
+ * Read all of the required block reference table files and merge all of
+ * the data into a single in-memory block reference table.
+ *
+ * See the comments for struct IncrementalBackupInfo for some thoughts on
+ * memory usage.
+ */
+ ib->brtab = CreateEmptyBlockRefTable();
+ foreach(lc, required_wslist)
+ {
+ WalSummaryFile *ws = lfirst(lc);
+ WalSummaryIO wsio;
+ BlockRefTableReader *reader;
+ RelFileLocator rlocator;
+ ForkNumber forknum;
+ BlockNumber limit_block;
+ BlockNumber blocks[BLOCKS_PER_READ];
+
+ wsio.file = OpenWalSummaryFile(ws, false);
+ wsio.filepos = 0;
+ ereport(DEBUG1,
+ (errmsg_internal("reading WAL summary file \"%s\"",
+ FilePathName(wsio.file))));
+ reader = CreateBlockRefTableReader(ReadWalSummary, &wsio,
+ FilePathName(wsio.file),
+ ReportWalSummaryError, NULL);
+ while (BlockRefTableReaderNextRelation(reader, &rlocator, &forknum,
+ &limit_block))
+ {
+ BlockRefTableSetLimitBlock(ib->brtab, &rlocator,
+ forknum, limit_block);
+
+ while (1)
+ {
+ unsigned nblocks;
+ unsigned i;
+
+ nblocks = BlockRefTableReaderGetBlocks(reader, blocks,
+ BLOCKS_PER_READ);
+ if (nblocks == 0)
+ break;
+
+ for (i = 0; i < nblocks; ++i)
+ BlockRefTableMarkBlockModified(ib->brtab, &rlocator,
+ forknum, blocks[i]);
+ }
+ }
+ DestroyBlockRefTableReader(reader);
+ FileClose(wsio.file);
+ }
+
+ /* Switch back to previous memory context. */
+ MemoryContextSwitchTo(oldcontext);
+}
+
+/*
+ * Get the pathname that should be used when a file is sent incrementally.
+ *
+ * The result is a palloc'd string.
+ */
+char *
+GetIncrementalFilePath(Oid dboid, Oid spcoid, RelFileNumber relfilenumber,
+ ForkNumber forknum, unsigned segno)
+{
+ char *path;
+ char *lastslash;
+ char *ipath;
+
+ path = GetRelationPath(dboid, spcoid, relfilenumber, InvalidBackendId,
+ forknum);
+
+ lastslash = strrchr(path, '/');
+ Assert(lastslash != NULL);
+ *lastslash = '\0';
+
+ if (segno > 0)
+ ipath = psprintf("%s/INCREMENTAL.%s.%u", path, lastslash + 1, segno);
+ else
+ ipath = psprintf("%s/INCREMENTAL.%s", path, lastslash + 1);
+
+ pfree(path);
+
+ return ipath;
+}
+
+/*
+ * How should we back up a particular file as part of an incremental backup?
+ *
+ * If the return value is BACK_UP_FILE_FULLY, caller should back up the whole
+ * file just as if this were not an incremental backup.
+ *
+ * If the return value is BACK_UP_FILE_INCREMENTALLY, caller should include
+ * an incremental file in the backup instead of the entire file. On return,
+ * *num_blocks_required will be set to the number of blocks that need to be
+ * sent, and the actual block numbers will have been stored in
+ * relative_block_numbers, which should be an array of at least RELSEG_SIZE.
+ * In addition, *truncation_block_length will be set to the value that should
+ * be included in the incremental file.
+ */
+FileBackupMethod
+GetFileBackupMethod(IncrementalBackupInfo *ib, const char *path,
+ Oid dboid, Oid spcoid,
+ RelFileNumber relfilenumber, ForkNumber forknum,
+ unsigned segno, size_t size,
+ unsigned *num_blocks_required,
+ BlockNumber *relative_block_numbers,
+ unsigned *truncation_block_length)
+{
+ BlockNumber absolute_block_numbers[RELSEG_SIZE];
+ BlockNumber limit_block;
+ BlockNumber start_blkno;
+ BlockNumber stop_blkno;
+ RelFileLocator rlocator;
+ BlockRefTableEntry *brtentry;
+ unsigned i;
+ unsigned nblocks;
+
+ /* Should only be called after PrepareForIncrementalBackup. */
+ Assert(ib->buf.data == NULL);
+
+ /*
+ * dboid could be InvalidOid if shared rel, but spcoid and relfilenumber
+ * should have legal values.
+ */
+ Assert(OidIsValid(spcoid));
+ Assert(RelFileNumberIsValid(relfilenumber));
+
+ /*
+ * If the file size is too large or not a multiple of BLCKSZ, then
+ * something weird is happening, so give up and send the whole file.
+ */
+ if ((size % BLCKSZ) != 0 || size / BLCKSZ > RELSEG_SIZE)
+ return BACK_UP_FILE_FULLY;
+
+ /*
+ * The free-space map fork is not properly WAL-logged, so we need to
+ * backup the entire file every time.
+ */
+ if (forknum == FSM_FORKNUM)
+ return BACK_UP_FILE_FULLY;
+
+ /*
+ * Check whether this file is part of the prior backup. If it isn't, back
+ * up the whole file.
+ */
+ if (backup_file_lookup(ib->manifest_files, path) == NULL)
+ {
+ char *ipath;
+
+ ipath = GetIncrementalFilePath(dboid, spcoid, relfilenumber,
+ forknum, segno);
+ if (backup_file_lookup(ib->manifest_files, ipath) == NULL)
+ return BACK_UP_FILE_FULLY;
+ }
+
+ /* Look up the block reference table entry. */
+ rlocator.spcOid = spcoid;
+ rlocator.dbOid = dboid;
+ rlocator.relNumber = relfilenumber;
+ brtentry = BlockRefTableGetEntry(ib->brtab, &rlocator, forknum,
+ &limit_block);
+
+ /*
+ * If there is no entry, then there have been no WAL-logged changes to the
+ * relation since the predecessor backup was taken, so we can back it up
+ * incrementally and need not include any modified blocks.
+ *
+ * However, if the file is zero-length, we should do a full backup,
+ * because an incremental file is always more than zero length, and it's
+ * silly to take an incremental backup when a full backup would be
+ * smaller.
+ */
+ if (brtentry == NULL)
+ {
+ if (size == 0)
+ return BACK_UP_FILE_FULLY;
+ *num_blocks_required = 0;
+ *truncation_block_length = size / BLCKSZ;
+ return BACK_UP_FILE_INCREMENTALLY;
+ }
+
+ /*
+ * If the limit_block is less than or equal to the point where this
+ * segment starts, send the whole file.
+ */
+ if (limit_block <= segno * RELSEG_SIZE)
+ return BACK_UP_FILE_FULLY;
+
+ /*
+ * Get relevant entries from the block reference table entry.
+ *
+ * We shouldn't overflow computing the start or stop block numbers, but if
+ * it manages to happen somehow, detect it and throw an error.
+ */
+ start_blkno = segno * RELSEG_SIZE;
+ stop_blkno = start_blkno + (size / BLCKSZ);
+ if (start_blkno / RELSEG_SIZE != segno || stop_blkno < start_blkno)
+ ereport(ERROR,
+ errcode(ERRCODE_INTERNAL_ERROR),
+ errmsg_internal("overflow computing block number bounds for segment %u with size %zu",
+ segno, size));
+ nblocks = BlockRefTableEntryGetBlocks(brtentry, start_blkno, stop_blkno,
+ absolute_block_numbers, RELSEG_SIZE);
+ Assert(nblocks <= RELSEG_SIZE);
+
+ /*
+ * If we're going to have to send nearly all of the blocks, then just send
+ * the whole file, because that won't require much extra storage or
+ * transfer and will speed up and simplify backup restoration. It's not
+ * clear what threshold is most appropriate here and perhaps it ought to
+ * be configurable, but for now we're just going to say that if we'd need
+ * to send 90% of the blocks anyway, give up and send the whole file.
+ *
+ * NB: If you change the threshold here, at least make sure to back up the
+ * file fully when every single block must be sent, because there's
+ * nothing good about sending an incremental file in that case.
+ */
+ if (nblocks * BLCKSZ > size * 0.9)
+ return BACK_UP_FILE_FULLY;
+
+ /*
+ * Looks like we can send an incremental file, so sort the absolute the
+ * block numbers and then transpose absolute block numbers to relative
+ * block numbers.
+ *
+ * NB: If the block reference table was using the bitmap representation
+ * for a given chunk, the block numbers in that chunk will already be
+ * sorted, but when the array-of-offsets representation is used, we can
+ * receive block numbers here out of order.
+ */
+ qsort(absolute_block_numbers, nblocks, sizeof(BlockNumber),
+ compare_block_numbers);
+ for (i = 0; i < nblocks; ++i)
+ relative_block_numbers[i] = absolute_block_numbers[i] - start_blkno;
+ *num_blocks_required = nblocks;
+
+ /*
+ * The truncation block length is the minimum length of the reconstructed
+ * file. Any block numbers below this threshold that are not present in
+ * the backup need to be fetched from the prior backup. At or above this
+ * threshold, blocks should only be included in the result if they are
+ * present in the backup. (This may require inserting zero blocks if the
+ * blocks included in the backup are non-consecutive.)
+ */
+ *truncation_block_length = size / BLCKSZ;
+ if (BlockNumberIsValid(limit_block))
+ {
+ unsigned relative_limit = limit_block - segno * RELSEG_SIZE;
+
+ if (*truncation_block_length < relative_limit)
+ *truncation_block_length = relative_limit;
+ }
+
+ /* Send it incrementally. */
+ return BACK_UP_FILE_INCREMENTALLY;
+}
+
+/*
+ * Compute the size for an incremental file containing a given number of blocks.
+ */
+extern size_t
+GetIncrementalFileSize(unsigned num_blocks_required)
+{
+ size_t result;
+
+ /* Make sure we're not going to overflow. */
+ Assert(num_blocks_required <= RELSEG_SIZE);
+
+ /*
+ * Three four byte quantities (magic number, truncation block length,
+ * block count) followed by block numbers followed by block contents.
+ */
+ result = 3 * sizeof(uint32);
+ result += (BLCKSZ + sizeof(BlockNumber)) * num_blocks_required;
+
+ return result;
+}
+
+/*
+ * Helper function for filemap hash table.
+ */
+static uint32
+hash_string_pointer(const char *s)
+{
+ unsigned char *ss = (unsigned char *) s;
+
+ return hash_bytes(ss, strlen(s));
+}
+
+/*
+ * This callback is invoked for each file mentioned in the backup manifest.
+ *
+ * We store the path to each file and the size of each file for sanity-checking
+ * purposes. For further details, see comments for IncrementalBackupInfo.
+ */
+static void
+manifest_process_file(JsonManifestParseContext *context,
+ char *pathname, size_t size,
+ pg_checksum_type checksum_type,
+ int checksum_length,
+ uint8 *checksum_payload)
+{
+ IncrementalBackupInfo *ib = context->private_data;
+ backup_file_entry *entry;
+ bool found;
+
+ entry = backup_file_insert(ib->manifest_files, pathname, &found);
+ if (!found)
+ {
+ entry->path = MemoryContextStrdup(ib->manifest_files->ctx,
+ pathname);
+ entry->size = size;
+ }
+}
+
+/*
+ * This callback is invoked for each WAL range mentioned in the backup
+ * manifest.
+ *
+ * We're just interested in learning the oldest LSN and the corresponding TLI
+ * that appear in any WAL range.
+ */
+static void
+manifest_process_wal_range(JsonManifestParseContext *context,
+ TimeLineID tli, XLogRecPtr start_lsn,
+ XLogRecPtr end_lsn)
+{
+ IncrementalBackupInfo *ib = context->private_data;
+ backup_wal_range *range = palloc(sizeof(backup_wal_range));
+
+ range->tli = tli;
+ range->start_lsn = start_lsn;
+ range->end_lsn = end_lsn;
+ ib->manifest_wal_ranges = lappend(ib->manifest_wal_ranges, range);
+}
+
+/*
+ * This callback is invoked if an error occurs while parsing the backup
+ * manifest.
+ */
+static void
+manifest_report_error(JsonManifestParseContext *context, const char *fmt,...)
+{
+ StringInfoData errbuf;
+
+ initStringInfo(&errbuf);
+
+ for (;;)
+ {
+ va_list ap;
+ int needed;
+
+ va_start(ap, fmt);
+ needed = appendStringInfoVA(&errbuf, fmt, ap);
+ va_end(ap);
+ if (needed == 0)
+ break;
+ enlargeStringInfo(&errbuf, needed);
+ }
+
+ ereport(ERROR,
+ errmsg_internal("%s", errbuf.data));
+}
+
+/*
+ * Quicksort comparator for block numbers.
+ */
+static int
+compare_block_numbers(const void *a, const void *b)
+{
+ BlockNumber aa = *(BlockNumber *) a;
+ BlockNumber bb = *(BlockNumber *) b;
+
+ if (aa > bb)
+ return 1;
+ else if (aa == bb)
+ return 0;
+ else
+ return -1;
+}
diff --git a/src/backend/backup/meson.build b/src/backend/backup/meson.build
index 0e2de91e9f..19c355ceca 100644
--- a/src/backend/backup/meson.build
+++ b/src/backend/backup/meson.build
@@ -5,6 +5,7 @@ backend_sources += files(
'basebackup.c',
'basebackup_copy.c',
'basebackup_gzip.c',
+ 'basebackup_incremental.c',
'basebackup_lz4.c',
'basebackup_progress.c',
'basebackup_server.c',
diff --git a/src/backend/replication/repl_gram.y b/src/backend/replication/repl_gram.y
index 0c874e33cf..a5d118ed68 100644
--- a/src/backend/replication/repl_gram.y
+++ b/src/backend/replication/repl_gram.y
@@ -76,11 +76,12 @@ Node *replication_parse_result;
%token K_EXPORT_SNAPSHOT
%token K_NOEXPORT_SNAPSHOT
%token K_USE_SNAPSHOT
+%token K_UPLOAD_MANIFEST
%type <node> command
%type <node> base_backup start_replication start_logical_replication
create_replication_slot drop_replication_slot identify_system
- read_replication_slot timeline_history show
+ read_replication_slot timeline_history show upload_manifest
%type <list> generic_option_list
%type <defelt> generic_option
%type <uintval> opt_timeline
@@ -114,6 +115,7 @@ command:
| read_replication_slot
| timeline_history
| show
+ | upload_manifest
;
/*
@@ -307,6 +309,15 @@ timeline_history:
}
;
+/* UPLOAD_MANIFEST doesn't currently accept any arguments */
+upload_manifest:
+ K_UPLOAD_MANIFEST
+ {
+ UploadManifestCmd *cmd = makeNode(UploadManifestCmd);
+
+ $$ = (Node *) cmd;
+ }
+
opt_physical:
K_PHYSICAL
| /* EMPTY */
@@ -411,6 +422,7 @@ ident_or_keyword:
| K_EXPORT_SNAPSHOT { $$ = "export_snapshot"; }
| K_NOEXPORT_SNAPSHOT { $$ = "noexport_snapshot"; }
| K_USE_SNAPSHOT { $$ = "use_snapshot"; }
+ | K_UPLOAD_MANIFEST { $$ = "upload_manifest"; }
;
%%
diff --git a/src/backend/replication/repl_scanner.l b/src/backend/replication/repl_scanner.l
index 1cc7fb858c..4805da08ee 100644
--- a/src/backend/replication/repl_scanner.l
+++ b/src/backend/replication/repl_scanner.l
@@ -136,6 +136,7 @@ EXPORT_SNAPSHOT { return K_EXPORT_SNAPSHOT; }
NOEXPORT_SNAPSHOT { return K_NOEXPORT_SNAPSHOT; }
USE_SNAPSHOT { return K_USE_SNAPSHOT; }
WAIT { return K_WAIT; }
+UPLOAD_MANIFEST { return K_UPLOAD_MANIFEST; }
{space}+ { /* do nothing */ }
@@ -303,6 +304,7 @@ replication_scanner_is_replication_command(void)
case K_DROP_REPLICATION_SLOT:
case K_READ_REPLICATION_SLOT:
case K_TIMELINE_HISTORY:
+ case K_UPLOAD_MANIFEST:
case K_SHOW:
/* Yes; push back the first token so we can parse later. */
repl_pushed_back_token = first_token;
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index e250b0567e..b33b86671b 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -58,6 +58,7 @@
#include "access/xlogrecovery.h"
#include "access/xlogutils.h"
#include "backup/basebackup.h"
+#include "backup/basebackup_incremental.h"
#include "catalog/pg_authid.h"
#include "catalog/pg_type.h"
#include "commands/dbcommands.h"
@@ -137,6 +138,17 @@ bool wake_wal_senders = false;
*/
static XLogReaderState *xlogreader = NULL;
+/*
+ * If the UPLOAD_MANIFEST command is used to provide a backup manifest in
+ * preparation for an incremental backup, uploaded_manifest will be point
+ * to an object containing information about its contexts, and
+ * uploaded_manifest_mcxt will point to the memory context that contains
+ * that object and all of its subordinate data. Otherwise, both values will
+ * be NULL.
+ */
+static IncrementalBackupInfo *uploaded_manifest = NULL;
+static MemoryContext uploaded_manifest_mcxt = NULL;
+
/*
* These variables keep track of the state of the timeline we're currently
* sending. sendTimeLine identifies the timeline. If sendTimeLineIsHistoric,
@@ -233,6 +245,9 @@ static void XLogSendLogical(void);
static void WalSndDone(WalSndSendDataCallback send_data);
static XLogRecPtr GetStandbyFlushRecPtr(TimeLineID *tli);
static void IdentifySystem(void);
+static void UploadManifest(void);
+static bool HandleUploadManifestPacket(StringInfo buf, off_t *offset,
+ IncrementalBackupInfo *ib);
static void ReadReplicationSlot(ReadReplicationSlotCmd *cmd);
static void CreateReplicationSlot(CreateReplicationSlotCmd *cmd);
static void DropReplicationSlot(DropReplicationSlotCmd *cmd);
@@ -660,6 +675,143 @@ SendTimeLineHistory(TimeLineHistoryCmd *cmd)
pq_endmessage(&buf);
}
+/*
+ * Handle UPLOAD_MANIFEST command.
+ */
+static void
+UploadManifest(void)
+{
+ MemoryContext mcxt;
+ IncrementalBackupInfo *ib;
+ off_t offset = 0;
+ StringInfoData buf;
+
+ /*
+ * parsing the manifest will use the cryptohash stuff, which requires a
+ * resource owner
+ */
+ Assert(CurrentResourceOwner == NULL);
+ CurrentResourceOwner = ResourceOwnerCreate(NULL, "base backup");
+
+ /* Prepare to read manifest data into a temporary context. */
+ mcxt = AllocSetContextCreate(CurrentMemoryContext,
+ "incremental backup information",
+ ALLOCSET_DEFAULT_SIZES);
+ ib = CreateIncrementalBackupInfo(mcxt);
+
+ /* Send a CopyInResponse message */
+ pq_beginmessage(&buf, 'G');
+ pq_sendbyte(&buf, 0);
+ pq_sendint16(&buf, 0);
+ pq_endmessage_reuse(&buf);
+ pq_flush();
+
+ /* Recieve packets from client until done. */
+ while (HandleUploadManifestPacket(&buf, &offset, ib))
+ ;
+
+ /* Finish up manifest processing. */
+ FinalizeIncrementalManifest(ib);
+
+ /*
+ * Discard any old manifest information and arrange to preserve the new
+ * information we just got.
+ *
+ * We assume that MemoryContextDelete and MemoryContextSetParent won't
+ * fail, and thus we shouldn't end up bailing out of here in such a way as
+ * to leave dangling pointrs.
+ */
+ if (uploaded_manifest_mcxt != NULL)
+ MemoryContextDelete(uploaded_manifest_mcxt);
+ MemoryContextSetParent(mcxt, CacheMemoryContext);
+ uploaded_manifest = ib;
+ uploaded_manifest_mcxt = mcxt;
+
+ /* clean up the resource owner we created */
+ WalSndResourceCleanup(true);
+}
+
+/*
+ * Process one packet received during the handling of an UPLOAD_MANIFEST
+ * operation.
+ *
+ * 'buf' is scratch space. This function expects it to be initialized, doesn't
+ * care what the current contents are, and may override them with completely
+ * new contents.
+ *
+ * The return value is true if the caller should continue processing
+ * additional packets and false if the UPLOAD_MANIFEST operation is complete.
+ */
+static bool
+HandleUploadManifestPacket(StringInfo buf, off_t *offset,
+ IncrementalBackupInfo *ib)
+{
+ int mtype;
+ int maxmsglen;
+
+ HOLD_CANCEL_INTERRUPTS();
+
+ pq_startmsgread();
+ mtype = pq_getbyte();
+ if (mtype == EOF)
+ ereport(ERROR,
+ (errcode(ERRCODE_CONNECTION_FAILURE),
+ errmsg("unexpected EOF on client connection with an open transaction")));
+
+ switch (mtype)
+ {
+ case 'd': /* CopyData */
+ maxmsglen = PQ_LARGE_MESSAGE_LIMIT;
+ break;
+ case 'c': /* CopyDone */
+ case 'f': /* CopyFail */
+ case 'H': /* Flush */
+ case 'S': /* Sync */
+ maxmsglen = PQ_SMALL_MESSAGE_LIMIT;
+ break;
+ default:
+ ereport(ERROR,
+ (errcode(ERRCODE_PROTOCOL_VIOLATION),
+ errmsg("unexpected message type 0x%02X during COPY from stdin",
+ mtype)));
+ maxmsglen = 0; /* keep compiler quiet */
+ break;
+ }
+
+ /* Now collect the message body */
+ if (pq_getmessage(buf, maxmsglen))
+ ereport(ERROR,
+ (errcode(ERRCODE_CONNECTION_FAILURE),
+ errmsg("unexpected EOF on client connection with an open transaction")));
+ RESUME_CANCEL_INTERRUPTS();
+
+ /* Process the message */
+ switch (mtype)
+ {
+ case 'd': /* CopyData */
+ AppendIncrementalManifestData(ib, buf->data, buf->len);
+ return true;
+
+ case 'c': /* CopyDone */
+ return false;
+
+ case 'H': /* Sync */
+ case 'S': /* Flush */
+ /* Ignore these while in CopyOut mode as we do elsewhere. */
+ return true;
+
+ case 'f':
+ ereport(ERROR,
+ (errcode(ERRCODE_QUERY_CANCELED),
+ errmsg("COPY from stdin failed: %s",
+ pq_getmsgstring(buf))));
+ }
+
+ /* Not reached. */
+ Assert(false);
+ return false;
+}
+
/*
* Handle START_REPLICATION command.
*
@@ -1801,7 +1953,7 @@ exec_replication_command(const char *cmd_string)
cmdtag = "BASE_BACKUP";
set_ps_display(cmdtag);
PreventInTransactionBlock(true, cmdtag);
- SendBaseBackup((BaseBackupCmd *) cmd_node);
+ SendBaseBackup((BaseBackupCmd *) cmd_node, uploaded_manifest);
EndReplicationCommand(cmdtag);
break;
@@ -1863,6 +2015,14 @@ exec_replication_command(const char *cmd_string)
}
break;
+ case T_UploadManifestCmd:
+ cmdtag = "UPLOAD_MANIFEST";
+ set_ps_display(cmdtag);
+ PreventInTransactionBlock(true, cmdtag);
+ UploadManifest();
+ EndReplicationCommand(cmdtag);
+ break;
+
default:
elog(ERROR, "unrecognized replication command node tag: %u",
cmd_node->type);
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index a3d8eacb8d..3a6729003a 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -31,6 +31,7 @@
#include "postmaster/bgworker_internals.h"
#include "postmaster/bgwriter.h"
#include "postmaster/postmaster.h"
+#include "postmaster/walsummarizer.h"
#include "replication/logicallauncher.h"
#include "replication/origin.h"
#include "replication/slot.h"
@@ -136,6 +137,7 @@ CalculateShmemSize(int *num_semaphores)
size = add_size(size, ReplicationOriginShmemSize());
size = add_size(size, WalSndShmemSize());
size = add_size(size, WalRcvShmemSize());
+ size = add_size(size, WalSummarizerShmemSize());
size = add_size(size, PgArchShmemSize());
size = add_size(size, ApplyLauncherShmemSize());
size = add_size(size, BTreeShmemSize());
@@ -291,6 +293,7 @@ CreateSharedMemoryAndSemaphores(void)
ReplicationOriginShmemInit();
WalSndShmemInit();
WalRcvShmemInit();
+ WalSummarizerShmemInit();
PgArchShmemInit();
ApplyLauncherShmemInit();
diff --git a/src/bin/Makefile b/src/bin/Makefile
index 373077bf52..aa2210925e 100644
--- a/src/bin/Makefile
+++ b/src/bin/Makefile
@@ -19,6 +19,7 @@ SUBDIRS = \
pg_archivecleanup \
pg_basebackup \
pg_checksums \
+ pg_combinebackup \
pg_config \
pg_controldata \
pg_ctl \
diff --git a/src/bin/meson.build b/src/bin/meson.build
index 67cb50630c..4cb6fd59bb 100644
--- a/src/bin/meson.build
+++ b/src/bin/meson.build
@@ -5,6 +5,7 @@ subdir('pg_amcheck')
subdir('pg_archivecleanup')
subdir('pg_basebackup')
subdir('pg_checksums')
+subdir('pg_combinebackup')
subdir('pg_config')
subdir('pg_controldata')
subdir('pg_ctl')
diff --git a/src/bin/pg_basebackup/bbstreamer_file.c b/src/bin/pg_basebackup/bbstreamer_file.c
index 45f32974ff..6b78ee283d 100644
--- a/src/bin/pg_basebackup/bbstreamer_file.c
+++ b/src/bin/pg_basebackup/bbstreamer_file.c
@@ -296,6 +296,7 @@ should_allow_existing_directory(const char *pathname)
if (strcmp(filename, "pg_wal") == 0 ||
strcmp(filename, "pg_xlog") == 0 ||
strcmp(filename, "archive_status") == 0 ||
+ strcmp(filename, "summaries") == 0 ||
strcmp(filename, "pg_tblspc") == 0)
return true;
diff --git a/src/bin/pg_basebackup/pg_basebackup.c b/src/bin/pg_basebackup/pg_basebackup.c
index f32684a8f2..26fd9ad0bc 100644
--- a/src/bin/pg_basebackup/pg_basebackup.c
+++ b/src/bin/pg_basebackup/pg_basebackup.c
@@ -101,6 +101,11 @@ typedef void (*WriteDataCallback) (size_t nbytes, char *buf,
*/
#define MINIMUM_VERSION_FOR_TERMINATED_TARFILE 150000
+/*
+ * pg_wal/summaries exists beginning with version 17.
+ */
+#define MINIMUM_VERSION_FOR_WAL_SUMMARIES 170000
+
/*
* Different ways to include WAL
*/
@@ -217,7 +222,8 @@ static void ReceiveBackupManifestInMemoryChunk(size_t r, char *copybuf,
void *callback_data);
static void BaseBackup(char *compression_algorithm, char *compression_detail,
CompressionLocation compressloc,
- pg_compress_specification *client_compress);
+ pg_compress_specification *client_compress,
+ char *incremental_manifest);
static bool reached_end_position(XLogRecPtr segendpos, uint32 timeline,
bool segment_finished);
@@ -390,6 +396,8 @@ usage(void)
printf(_("\nOptions controlling the output:\n"));
printf(_(" -D, --pgdata=DIRECTORY receive base backup into directory\n"));
printf(_(" -F, --format=p|t output format (plain (default), tar)\n"));
+ printf(_(" -i, --incremental=OLDMANIFEST\n"));
+ printf(_(" take incremental or differential backup\n"));
printf(_(" -r, --max-rate=RATE maximum transfer rate to transfer data directory\n"
" (in kB/s, or use suffix \"k\" or \"M\")\n"));
printf(_(" -R, --write-recovery-conf\n"
@@ -688,6 +696,23 @@ StartLogStreamer(char *startpos, uint32 timeline, char *sysidentifier,
if (pg_mkdir_p(statusdir, pg_dir_create_mode) != 0 && errno != EEXIST)
pg_fatal("could not create directory \"%s\": %m", statusdir);
+
+ /*
+ * For newer server versions, likewise create pg_wal/summaries
+ */
+ if (PQserverVersion(conn) < MINIMUM_VERSION_FOR_WAL_SUMMARIES)
+ {
+ char summarydir[MAXPGPATH];
+
+ snprintf(summarydir, sizeof(summarydir), "%s/%s/summaries",
+ basedir,
+ PQserverVersion(conn) < MINIMUM_VERSION_FOR_PG_WAL ?
+ "pg_xlog" : "pg_wal");
+
+ if (pg_mkdir_p(statusdir, pg_dir_create_mode) != 0 &&
+ errno != EEXIST)
+ pg_fatal("could not create directory \"%s\": %m", summarydir);
+ }
}
/*
@@ -1728,7 +1753,9 @@ ReceiveBackupManifestInMemoryChunk(size_t r, char *copybuf,
static void
BaseBackup(char *compression_algorithm, char *compression_detail,
- CompressionLocation compressloc, pg_compress_specification *client_compress)
+ CompressionLocation compressloc,
+ pg_compress_specification *client_compress,
+ char *incremental_manifest)
{
PGresult *res;
char *sysidentifier;
@@ -1794,7 +1821,74 @@ BaseBackup(char *compression_algorithm, char *compression_detail,
exit(1);
/*
- * Start the actual backup
+ * If the user wants an incremental backup, we must upload the manifest
+ * for the previous backup upon which it is to be based.
+ */
+ if (incremental_manifest != NULL)
+ {
+ int fd;
+ char mbuf[65536];
+ int nbytes;
+
+ /* XXX add a server version check here */
+
+ /* Open the file. */
+ fd = open(incremental_manifest, O_RDONLY | PG_BINARY, 0);
+ if (fd < 0)
+ pg_fatal("could not open file \"%s\": %m", incremental_manifest);
+
+ /* Tell the server what we want to do. */
+ if (PQsendQuery(conn, "UPLOAD_MANIFEST") == 0)
+ pg_fatal("could not send replication command \"%s\": %s",
+ "UPLOAD_MANIFEST", PQerrorMessage(conn));
+ res = PQgetResult(conn);
+ if (PQresultStatus(res) != PGRES_COPY_IN)
+ {
+ if (PQresultStatus(res) == PGRES_FATAL_ERROR)
+ pg_fatal("could not upload manifest: %s",
+ PQerrorMessage(conn));
+ else
+ pg_fatal("could not upload manifest: unexpected status %s",
+ PQresStatus(PQresultStatus(res)));
+ }
+
+ /* Loop, reading from the file and sending the data to the server. */
+ while ((nbytes = read(fd, mbuf, sizeof mbuf)) > 0)
+ {
+ if (PQputCopyData(conn, mbuf, nbytes) < 0)
+ pg_fatal("could not send COPY data: %s",
+ PQerrorMessage(conn));
+ }
+
+ /* Bail out if we exited the loop due to an error. */
+ if (nbytes < 0)
+ pg_fatal("could not read file \"%s\": %m", incremental_manifest);
+
+ /* End the COPY operation. */
+ if (PQputCopyEnd(conn, NULL) < 0)
+ pg_fatal("could not send end-of-COPY: %s",
+ PQerrorMessage(conn));
+
+ /* See whether the server is happy with what we sent. */
+ res = PQgetResult(conn);
+ if (PQresultStatus(res) == PGRES_FATAL_ERROR)
+ pg_fatal("could not upload manifest: %s",
+ PQerrorMessage(conn));
+ else if (PQresultStatus(res) != PGRES_COMMAND_OK)
+ pg_fatal("could not upload manifest: unexpected status %s",
+ PQresStatus(PQresultStatus(res)));
+
+ /* Consume ReadyForQuery message from server. */
+ res = PQgetResult(conn);
+ if (res != NULL)
+ pg_fatal("unexpected extra result while sending manifest");
+
+ /* Add INCREMENTAL option to BASE_BACKUP command. */
+ AppendPlainCommandOption(&buf, use_new_option_syntax, "INCREMENTAL");
+ }
+
+ /*
+ * Continue building up the options list for the BASE_BACKUP command.
*/
AppendStringCommandOption(&buf, use_new_option_syntax, "LABEL", label);
if (estimatesize)
@@ -1901,6 +1995,7 @@ BaseBackup(char *compression_algorithm, char *compression_detail,
else
basebkp = psprintf("BASE_BACKUP %s", buf.data);
+ /* OK, try to start the backup. */
if (PQsendQuery(conn, basebkp) == 0)
pg_fatal("could not send replication command \"%s\": %s",
"BASE_BACKUP", PQerrorMessage(conn));
@@ -2256,6 +2351,7 @@ main(int argc, char **argv)
{"version", no_argument, NULL, 'V'},
{"pgdata", required_argument, NULL, 'D'},
{"format", required_argument, NULL, 'F'},
+ {"incremental", required_argument, NULL, 'i'},
{"checkpoint", required_argument, NULL, 'c'},
{"create-slot", no_argument, NULL, 'C'},
{"max-rate", required_argument, NULL, 'r'},
@@ -2293,6 +2389,7 @@ main(int argc, char **argv)
int option_index;
char *compression_algorithm = "none";
char *compression_detail = NULL;
+ char *incremental_manifest = NULL;
CompressionLocation compressloc = COMPRESS_LOCATION_UNSPECIFIED;
pg_compress_specification client_compress;
@@ -2317,7 +2414,7 @@ main(int argc, char **argv)
atexit(cleanup_directories_atexit);
- while ((c = getopt_long(argc, argv, "c:Cd:D:F:h:l:nNp:Pr:Rs:S:t:T:U:vwWX:zZ:",
+ while ((c = getopt_long(argc, argv, "c:Cd:D:F:h:i:l:nNp:Pr:Rs:S:t:T:U:vwWX:zZ:",
long_options, &option_index)) != -1)
{
switch (c)
@@ -2352,6 +2449,9 @@ main(int argc, char **argv)
case 'h':
dbhost = pg_strdup(optarg);
break;
+ case 'i':
+ incremental_manifest = pg_strdup(optarg);
+ break;
case 'l':
label = pg_strdup(optarg);
break;
@@ -2765,7 +2865,7 @@ main(int argc, char **argv)
}
BaseBackup(compression_algorithm, compression_detail, compressloc,
- &client_compress);
+ &client_compress, incremental_manifest);
success = true;
return 0;
diff --git a/src/bin/pg_basebackup/t/010_pg_basebackup.pl b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
index b9f5e1266b..bf765291e7 100644
--- a/src/bin/pg_basebackup/t/010_pg_basebackup.pl
+++ b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
@@ -223,10 +223,10 @@ SKIP:
"check backup dir permissions");
}
-# Only archive_status directory should be copied in pg_wal/.
+# Only archive_status and summaries directories should be copied in pg_wal/.
is_deeply(
[ sort(slurp_dir("$tempdir/backup/pg_wal/")) ],
- [ sort qw(. .. archive_status) ],
+ [ sort qw(. .. archive_status summaries) ],
'no WAL files copied');
# Contents of these directories should not be copied.
diff --git a/src/bin/pg_combinebackup/.gitignore b/src/bin/pg_combinebackup/.gitignore
new file mode 100644
index 0000000000..d7e617438c
--- /dev/null
+++ b/src/bin/pg_combinebackup/.gitignore
@@ -0,0 +1 @@
+pg_combinebackup
diff --git a/src/bin/pg_combinebackup/Makefile b/src/bin/pg_combinebackup/Makefile
new file mode 100644
index 0000000000..78ba05e624
--- /dev/null
+++ b/src/bin/pg_combinebackup/Makefile
@@ -0,0 +1,52 @@
+#-------------------------------------------------------------------------
+#
+# Makefile for src/bin/pg_combinebackup
+#
+# Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+# Portions Copyright (c) 1994, Regents of the University of California
+#
+# src/bin/pg_combinebackup/Makefile
+#
+#-------------------------------------------------------------------------
+
+PGFILEDESC = "pg_combinebackup - combine incremental backups"
+PGAPPICON=win32
+
+subdir = src/bin/pg_combinebackup
+top_builddir = ../../..
+include $(top_builddir)/src/Makefile.global
+
+override CPPFLAGS := -I$(libpq_srcdir) $(CPPFLAGS)
+LDFLAGS_INTERNAL += -L$(top_builddir)/src/fe_utils -lpgfeutils
+
+OBJS = \
+ $(WIN32RES) \
+ pg_combinebackup.o \
+ backup_label.o \
+ copy_file.o \
+ load_manifest.o \
+ reconstruct.o \
+ write_manifest.o
+
+all: pg_combinebackup
+
+pg_combinebackup: $(OBJS) | submake-libpgport submake-libpgfeutils
+ $(CC) $(CFLAGS) $^ $(LDFLAGS) $(LDFLAGS_EX) $(LIBS) -o $@$(X)
+
+install: all installdirs
+ $(INSTALL_PROGRAM) pg_combinebackup$(X) '$(DESTDIR)$(bindir)/pg_combinebackup$(X)'
+
+installdirs:
+ $(MKDIR_P) '$(DESTDIR)$(bindir)'
+
+uninstall:
+ rm -f '$(DESTDIR)$(bindir)/pg_combinebackup$(X)'
+
+clean distclean maintainer-clean:
+ rm -f pg_combinebackup$(X) $(OBJS)
+
+check:
+ $(prove_check)
+
+installcheck:
+ $(prove_installcheck)
diff --git a/src/bin/pg_combinebackup/backup_label.c b/src/bin/pg_combinebackup/backup_label.c
new file mode 100644
index 0000000000..922e00854d
--- /dev/null
+++ b/src/bin/pg_combinebackup/backup_label.c
@@ -0,0 +1,283 @@
+/*-------------------------------------------------------------------------
+ *
+ * Read and manipulate backup label files
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/bin/pg_combinebackup/backup_label.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres_fe.h"
+
+#include <unistd.h>
+
+#include "access/xlogdefs.h"
+#include "backup_label.h"
+#include "common/logging.h"
+#include "common/file_perm.h"
+#include "write_manifest.h"
+
+static int get_eol_offset(StringInfo buf);
+static bool line_starts_with(char *s, char *e, char *match, char **sout);
+static bool parse_lsn(char *s, char *e, XLogRecPtr *lsn, char **c);
+static bool parse_tli(char *s, char *e, TimeLineID *tli);
+
+/*
+ * Parse a backup label file, starting at buf->cursor.
+ *
+ * We expect to find a START WAL LOCATION line, followed by a LSN, followed
+ * by a space; the resulting LSN is stored into *start_lsn.
+ *
+ * We expect to find a START TIMELINE line, followed by a TLI, followed by
+ * a newline; the resulting TLI is stored into *start_tli.
+ *
+ * We expect to find either both INCREMENTAL FROM LSN and INCREMENTAL FROM TLI
+ * or neither. If these are found, they should be followed by an LSN or TLI
+ * respectively and then by a newline, and the values will be stored into
+ * *previous_lsn and *previous_tli, respectively.
+ *
+ * Other lines in the provided backup_label data are ignored. filename is used
+ * for error reporting; errors are fatal.
+ */
+void
+parse_backup_label(char *filename, StringInfo buf,
+ TimeLineID *start_tli, XLogRecPtr *start_lsn,
+ TimeLineID *previous_tli, XLogRecPtr *previous_lsn)
+{
+ int found = 0;
+
+ *start_tli = 0;
+ *start_lsn = InvalidXLogRecPtr;
+ *previous_tli = 0;
+ *previous_lsn = InvalidXLogRecPtr;
+
+ while (buf->cursor < buf->len)
+ {
+ char *s = &buf->data[buf->cursor];
+ int eo = get_eol_offset(buf);
+ char *e = &buf->data[eo];
+ char *c;
+
+ if (line_starts_with(s, e, "START WAL LOCATION: ", &s))
+ {
+ if (!parse_lsn(s, e, start_lsn, &c))
+ pg_fatal("%s: could not parse %s",
+ filename, "START WAL LOCATION");
+ if (c >= e || *c != ' ')
+ pg_fatal("%s: improper terminator for %s",
+ filename, "START WAL LOCATION");
+ found |= 1;
+ }
+ else if (line_starts_with(s, e, "START TIMELINE: ", &s))
+ {
+ if (!parse_tli(s, e, start_tli))
+ pg_fatal("%s: could not parse TLI for %s",
+ filename, "START TIMELINE");
+ if (*start_tli == 0)
+ pg_fatal("%s: invalid TLI", filename);
+ found |= 2;
+ }
+ else if (line_starts_with(s, e, "INCREMENTAL FROM LSN: ", &s))
+ {
+ if (!parse_lsn(s, e, previous_lsn, &c))
+ pg_fatal("%s: could not parse %s",
+ filename, "INCREMENTAL FROM LSN");
+ if (c >= e || *c != '\n')
+ pg_fatal("%s: improper terminator for %s",
+ filename, "INCREMENTAL FROM LSN");
+ found |= 4;
+ }
+ else if (line_starts_with(s, e, "INCREMENTAL FROM TLI: ", &s))
+ {
+ if (!parse_tli(s, e, previous_tli))
+ pg_fatal("%s: could not parse %s",
+ filename, "INCREMENTAL FROM TLI");
+ if (*previous_tli == 0)
+ pg_fatal("%s: invalid TLI", filename);
+ found |= 8;
+ }
+
+ buf->cursor = eo;
+ }
+
+ if ((found & 1) == 0)
+ pg_fatal("%s: could not find %s", filename, "START WAL LOCATION");
+ if ((found & 2) == 0)
+ pg_fatal("%s: could not find %s", filename, "START TIMELINE");
+ if ((found & 4) != 0 && (found & 8) == 0)
+ pg_fatal("%s: %s requires %s", filename,
+ "INCREMENTAL FROM LSN", "INCREMENTAL FROM TLI");
+ if ((found & 8) != 0 && (found & 4) == 0)
+ pg_fatal("%s: %s requires %s", filename,
+ "INCREMENTAL FROM TLI", "INCREMENTAL FROM LSN");
+}
+
+/*
+ * Write a backup label file to the output directory.
+ *
+ * This will be identical to the provided backup_label file, except that the
+ * INCREMENTAL FROM LSN and INCREMENTAL FROM TLI lines will be omitted.
+ *
+ * The new file will be checksummed using the specified algorithm. If
+ * mwriter != NULL, it will be added to the manifest.
+ */
+void
+write_backup_label(char *output_directory, StringInfo buf,
+ pg_checksum_type checksum_type, manifest_writer *mwriter)
+{
+ char output_filename[MAXPGPATH];
+ int output_fd;
+ pg_checksum_context checksum_ctx;
+ uint8 checksum_payload[PG_CHECKSUM_MAX_LENGTH];
+ int checksum_length;
+
+ pg_checksum_init(&checksum_ctx, checksum_type);
+
+ snprintf(output_filename, MAXPGPATH, "%s/backup_label", output_directory);
+
+ if ((output_fd = open(output_filename,
+ O_WRONLY | O_CREAT | O_EXCL | PG_BINARY,
+ pg_file_create_mode)) < 0)
+ pg_fatal("could not open file \"%s\": %m", output_filename);
+
+ while (buf->cursor < buf->len)
+ {
+ char *s = &buf->data[buf->cursor];
+ int eo = get_eol_offset(buf);
+ char *e = &buf->data[eo];
+
+ if (!line_starts_with(s, e, "INCREMENTAL FROM LSN: ", NULL) &&
+ !line_starts_with(s, e, "INCREMENTAL FROM TLI: ", NULL))
+ {
+ ssize_t wb;
+
+ wb = write(output_fd, s, e - s);
+ if (wb != e - s)
+ {
+ if (wb < 0)
+ pg_fatal("could not write file \"%s\": %m", output_filename);
+ else
+ pg_fatal("could not write file \"%s\": wrote only %d of %d bytes",
+ output_filename, (int) wb, (int) (e - s));
+ }
+ if (pg_checksum_update(&checksum_ctx, (uint8 *) s, e - s) < 0)
+ pg_fatal("could not update checksum of file \"%s\"",
+ output_filename);
+ }
+
+ buf->cursor = eo;
+ }
+
+ if (close(output_fd) != 0)
+ pg_fatal("could not close \"%s\": %m", output_filename);
+
+ checksum_length = pg_checksum_final(&checksum_ctx, checksum_payload);
+
+ if (mwriter != NULL)
+ {
+ struct stat sb;
+
+ /*
+ * We could track the length ourselves, but must stat() to get the
+ * mtime.
+ */
+ if (stat(output_filename, &sb) < 0)
+ pg_fatal("could not stat file \"%s\": %m", output_filename);
+ add_file_to_manifest(mwriter, "backup_label", sb.st_size,
+ sb.st_mtime, checksum_type,
+ checksum_length, checksum_payload);
+ }
+}
+
+/*
+ * Return the offset at which the next line in the buffer starts, or there
+ * is none, the offset at which the buffer ends.
+ *
+ * The search begins at buf->cursor.
+ */
+static int
+get_eol_offset(StringInfo buf)
+{
+ int eo = buf->cursor;
+
+ while (eo < buf->len)
+ {
+ if (buf->data[eo] == '\n')
+ return eo + 1;
+ ++eo;
+ }
+
+ return eo;
+}
+
+/*
+ * Test whether the line that runs from s to e (inclusive of *s, but not
+ * inclusive of *e) starts with the match string provided, and return true
+ * or false according to whether or not this is the case.
+ *
+ * If the function returns true and if *sout != NULL, stores a pointer to the
+ * byte following the match into *sout.
+ */
+static bool
+line_starts_with(char *s, char *e, char *match, char **sout)
+{
+ while (s < e && *match != '\0' && *s == *match)
+ ++s, ++match;
+
+ if (*match == '\0' && sout != NULL)
+ *sout = s;
+
+ return (*match == '\0');
+}
+
+/*
+ * Parse an LSN starting at s and not stopping at or before e. The return value
+ * is true on success and otherwise false. On success, stores the result into
+ * *lsn and sets *c to the first character that is not part of the LSN.
+ */
+static bool
+parse_lsn(char *s, char *e, XLogRecPtr *lsn, char **c)
+{
+ char save = *e;
+ int nchars;
+ bool success;
+ unsigned hi;
+ unsigned lo;
+
+ *e = '\0';
+ success = (sscanf(s, "%X/%X%n", &hi, &lo, &nchars) == 2);
+ *e = save;
+
+ if (success)
+ {
+ *lsn = ((XLogRecPtr) hi) << 32 | (XLogRecPtr) lo;
+ *c = s + nchars;
+ }
+
+ return success;
+}
+
+/*
+ * Parse a TLI starting at s and stopping at or before e. The return value is
+ * true on success and otherwise false. On success, stores the result into
+ * *tli. If the first character that is not part of the TLI is anything other
+ * than a newline, that is deemed a failure.
+ */
+static bool
+parse_tli(char *s, char *e, TimeLineID *tli)
+{
+ char save = *e;
+ int nchars;
+ bool success;
+
+ *e = '\0';
+ success = (sscanf(s, "%u%n", tli, &nchars) == 1);
+ *e = save;
+
+ if (success && s[nchars] != '\n')
+ success = false;
+
+ return success;
+}
diff --git a/src/bin/pg_combinebackup/backup_label.h b/src/bin/pg_combinebackup/backup_label.h
new file mode 100644
index 0000000000..3af7ea274c
--- /dev/null
+++ b/src/bin/pg_combinebackup/backup_label.h
@@ -0,0 +1,30 @@
+/*-------------------------------------------------------------------------
+ *
+ * Read and manipulate backup label files
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/bin/pg_combinebackup/backup_label.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef BACKUP_LABEL_H
+#define BACKUP_LABEL_H
+
+#include "access/xlogdefs.h"
+#include "common/checksum_helper.h"
+#include "lib/stringinfo.h"
+
+struct manifest_writer;
+
+extern void parse_backup_label(char *filename, StringInfo buf,
+ TimeLineID *start_tli,
+ XLogRecPtr *start_lsn,
+ TimeLineID *previous_tli,
+ XLogRecPtr *previous_lsn);
+extern void write_backup_label(char *output_directory, StringInfo buf,
+ pg_checksum_type checksum_type,
+ struct manifest_writer *mwriter);
+
+#endif /* BACKUP_LABEL_H */
diff --git a/src/bin/pg_combinebackup/copy_file.c b/src/bin/pg_combinebackup/copy_file.c
new file mode 100644
index 0000000000..f2b45787e9
--- /dev/null
+++ b/src/bin/pg_combinebackup/copy_file.c
@@ -0,0 +1,169 @@
+/*
+ * Copy entire files.
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/bin/pg_combinebackup/copy_file.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres_fe.h"
+
+#ifdef HAVE_COPYFILE_H
+#include <copyfile.h>
+#endif
+#include <fcntl.h>
+#include <sys/stat.h>
+#include <unistd.h>
+
+#include "common/file_perm.h"
+#include "common/logging.h"
+#include "copy_file.h"
+
+static void copy_file_blocks(const char *src, const char *dst,
+ pg_checksum_context *checksum_ctx);
+
+#ifdef WIN32
+static void copy_file_copyfile(const char *src, const char *dst);
+#endif
+
+/*
+ * Copy a regular file, optionally computing a checksum, and emitting
+ * appropriate debug messages. But if we're in dry-run mode, then just emit
+ * the messages and don't copy anything.
+ */
+void
+copy_file(const char *src, const char *dst,
+ pg_checksum_context *checksum_ctx, bool dry_run)
+{
+ /*
+ * In dry-run mode, we don't actually copy anything, nor do we read any
+ * data from the source file, but we do verify that we can open it.
+ */
+ if (dry_run)
+ {
+ int fd;
+
+ if ((fd = open(src, O_RDONLY | PG_BINARY, 0)) < 0)
+ pg_fatal("could not open \"%s\": %m", src);
+ if (close(fd) < 0)
+ pg_fatal("could not close \"%s\": %m", src);
+ }
+
+ /*
+ * If we don't need to compute a checksum, then we can use any special
+ * operating system primitives that we know about to copy the file; this
+ * may be quicker than a naive block copy.
+ */
+ if (checksum_ctx->type != CHECKSUM_TYPE_NONE)
+ {
+ char *strategy_name = NULL;
+ void (*strategy_implementation) (const char *, const char *) = NULL;
+
+#ifdef WIN32
+ strategy_name = "CopyFile";
+ strategy_implementation = copy_file_copyfile;
+#endif
+
+ if (strategy_name != NULL)
+ {
+ if (dry_run)
+ pg_log_debug("would copy \"%s\" to \"%s\" using strategy %s",
+ src, dst, strategy_name);
+ else
+ {
+ pg_log_debug("copying \"%s\" to \"%s\" using strategy %s",
+ src, dst, strategy_name);
+ (*strategy_implementation) (src, dst);
+ }
+ return;
+ }
+ }
+
+ /*
+ * Fall back to the simple approach of reading and writing all the blocks,
+ * feeding them into the checksum context as we go.
+ */
+ if (dry_run)
+ {
+ if (checksum_ctx->type == CHECKSUM_TYPE_NONE)
+ pg_log_debug("would copy \"%s\" to \"%s\"",
+ src, dst);
+ else
+ pg_log_debug("would copy \"%s\" to \"%s\" and checksum with %s",
+ src, dst, pg_checksum_type_name(checksum_ctx->type));
+ }
+ else
+ {
+ if (checksum_ctx->type == CHECKSUM_TYPE_NONE)
+ pg_log_debug("copying \"%s\" to \"%s\"",
+ src, dst);
+ else
+ pg_log_debug("copying \"%s\" to \"%s\" and checksumming with %s",
+ src, dst, pg_checksum_type_name(checksum_ctx->type));
+ copy_file_blocks(src, dst, checksum_ctx);
+ }
+}
+
+/*
+ * Copy a file block by block, and optionally compute a checksum as we go.
+ */
+static void
+copy_file_blocks(const char *src, const char *dst,
+ pg_checksum_context *checksum_ctx)
+{
+ int src_fd;
+ int dest_fd;
+ uint8 *buffer;
+ const int buffer_size = 50 * BLCKSZ;
+ ssize_t rb;
+ unsigned offset = 0;
+
+ if ((src_fd = open(src, O_RDONLY | PG_BINARY, 0)) < 0)
+ pg_fatal("could not open file \"%s\": %m", src);
+
+ if ((dest_fd = open(dst, O_WRONLY | O_CREAT | O_EXCL | PG_BINARY,
+ pg_file_create_mode)) < 0)
+ pg_fatal("could not open file \"%s\": %m", dst);
+
+ buffer = pg_malloc(buffer_size);
+
+ while ((rb = read(src_fd, buffer, buffer_size)) > 0)
+ {
+ ssize_t wb;
+
+ if ((wb = write(dest_fd, buffer, rb)) != rb)
+ {
+ if (wb < 0)
+ pg_fatal("could not write file \"%s\": %m", dst);
+ else
+ pg_fatal("could not write file \"%s\": wrote only %d of %d bytes at offset %u",
+ dst, (int) wb, (int) rb, offset);
+ }
+
+ if (pg_checksum_update(checksum_ctx, buffer, rb) < 0)
+ pg_fatal("could not update checksum of file \"%s\"", dst);
+
+ offset += rb;
+ }
+
+ if (rb < 0)
+ pg_fatal("could not read file \"%s\": %m", dst);
+
+ pg_free(buffer);
+ close(src_fd);
+ close(dest_fd);
+}
+
+#ifdef WIN32
+static void
+copy_file_copyfile(const char *src, const char *dst)
+{
+ if (CopyFile(src, dst, true) == 0)
+ {
+ _dosmaperr(GetLastError());
+ pg_fatal("could not copy \"%s\" to \"%s\": %m", src, dst);
+ }
+}
+#endif /* WIN32 */
diff --git a/src/bin/pg_combinebackup/copy_file.h b/src/bin/pg_combinebackup/copy_file.h
new file mode 100644
index 0000000000..031030bacb
--- /dev/null
+++ b/src/bin/pg_combinebackup/copy_file.h
@@ -0,0 +1,19 @@
+/*
+ * Copy entire files.
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/bin/pg_combinebackup/copy_file.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef COPY_FILE_H
+#define COPY_FILE_H
+
+#include "common/checksum_helper.h"
+
+extern void copy_file(const char *src, const char *dst,
+ pg_checksum_context *checksum_ctx, bool dry_run);
+
+#endif /* COPY_FILE_H */
diff --git a/src/bin/pg_combinebackup/load_manifest.c b/src/bin/pg_combinebackup/load_manifest.c
new file mode 100644
index 0000000000..d06c3ffe0f
--- /dev/null
+++ b/src/bin/pg_combinebackup/load_manifest.c
@@ -0,0 +1,245 @@
+/*-------------------------------------------------------------------------
+ *
+ * Load data from a backup manifest into memory.
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/bin/pg_combinebackup/load_manifest.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres_fe.h"
+
+#include <sys/stat.h>
+#include <unistd.h>
+
+#include "common/hashfn.h"
+#include "common/logging.h"
+#include "common/parse_manifest.h"
+#include "load_manifest.h"
+
+/*
+ * For efficiency, we'd like our hash table containing information about the
+ * manifest to start out with approximately the correct number of entries.
+ * There's no way to know the exact number of entries without reading the whole
+ * file, but we can get an estimate by dividing the file size by the estimated
+ * number of bytes per line.
+ *
+ * This could be off by about a factor of two in either direction, because the
+ * checksum algorithm has a big impact on the line lengths; e.g. a SHA512
+ * checksum is 128 hex bytes, whereas a CRC-32C value is only 8, and there
+ * might be no checksum at all.
+ */
+#define ESTIMATED_BYTES_PER_MANIFEST_LINE 100
+
+/*
+ * Define a hash table which we can use to store information about the files
+ * mentioned in the backup manifest.
+ */
+static uint32 hash_string_pointer(char *s);
+#define SH_PREFIX manifest_files
+#define SH_ELEMENT_TYPE manifest_file
+#define SH_KEY_TYPE char *
+#define SH_KEY pathname
+#define SH_HASH_KEY(tb, key) hash_string_pointer(key)
+#define SH_EQUAL(tb, a, b) (strcmp(a, b) == 0)
+#define SH_SCOPE extern
+#define SH_RAW_ALLOCATOR pg_malloc0
+#define SH_DEFINE
+#include "lib/simplehash.h"
+
+static void combinebackup_per_file_cb(JsonManifestParseContext *context,
+ char *pathname, size_t size,
+ pg_checksum_type checksum_type,
+ int checksum_length,
+ uint8 *checksum_payload);
+static void combinebackup_per_wal_range_cb(JsonManifestParseContext *context,
+ TimeLineID tli,
+ XLogRecPtr start_lsn,
+ XLogRecPtr end_lsn);
+static void report_manifest_error(JsonManifestParseContext *context,
+ const char *fmt,...)
+ pg_attribute_printf(2, 3) pg_attribute_noreturn();
+
+/*
+ * Load backup_manifest files from an array of backups and produces an array
+ * of manifest_data objects.
+ *
+ * NB: Since load_backup_manifest() can return NULL, the resulting array could
+ * contain NULL entries.
+ */
+manifest_data **
+load_backup_manifests(int n_backups, char **backup_directories)
+{
+ manifest_data **result;
+ int i;
+
+ result = pg_malloc(sizeof(manifest_data *) * n_backups);
+ for (i = 0; i < n_backups; ++i)
+ result[i] = load_backup_manifest(backup_directories[i]);
+
+ return result;
+}
+
+/*
+ * Parse the backup_manifest file in the named backup directory. Construct a
+ * hash table with information about all the files it mentions, and a linked
+ * list of all the WAL ranges it mentions.
+ *
+ * If the backup_manifest file simply doesn't exist, logs a warning and returns
+ * NULL. Any other error, or any error parsing the contents of the file, is
+ * fatal.
+ */
+manifest_data *
+load_backup_manifest(char *backup_directory)
+{
+ char pathname[MAXPGPATH];
+ int fd;
+ struct stat statbuf;
+ off_t estimate;
+ uint32 initial_size;
+ manifest_files_hash *ht;
+ char *buffer;
+ int rc;
+ JsonManifestParseContext context;
+ manifest_data *result;
+
+ /* Open the manifest file. */
+ snprintf(pathname, MAXPGPATH, "%s/backup_manifest", backup_directory);
+ if ((fd = open(pathname, O_RDONLY | PG_BINARY, 0)) < 0)
+ {
+ if (errno == EEXIST)
+ {
+ pg_log_warning("\"%s\" does not exist", pathname);
+ return NULL;
+ }
+ pg_fatal("could not open file \"%s\": %m", pathname);
+ }
+
+ /* Figure out how big the manifest is. */
+ if (fstat(fd, &statbuf) != 0)
+ pg_fatal("could not stat file \"%s\": %m", pathname);
+
+ /* Guess how large to make the hash table based on the manifest size. */
+ estimate = statbuf.st_size / ESTIMATED_BYTES_PER_MANIFEST_LINE;
+ initial_size = Min(PG_UINT32_MAX, Max(estimate, 256));
+
+ /* Create the hash table. */
+ ht = manifest_files_create(initial_size, NULL);
+
+ /*
+ * Slurp in the whole file.
+ *
+ * This is not ideal, but there's currently no way to get pg_parse_json()
+ * to perform incremental parsing.
+ */
+ buffer = pg_malloc(statbuf.st_size);
+ rc = read(fd, buffer, statbuf.st_size);
+ if (rc != statbuf.st_size)
+ {
+ if (rc < 0)
+ pg_fatal("could not read file \"%s\": %m", pathname);
+ else
+ pg_fatal("could not read file \"%s\": read %d of %lld",
+ pathname, rc, (long long int) statbuf.st_size);
+ }
+
+ /* Close the manifest file. */
+ close(fd);
+
+ /* Parse the manifest. */
+ result = pg_malloc0(sizeof(manifest_data));
+ result->files = ht;
+ context.private_data = result;
+ context.per_file_cb = combinebackup_per_file_cb;
+ context.per_wal_range_cb = combinebackup_per_wal_range_cb;
+ context.error_cb = report_manifest_error;
+ json_parse_manifest(&context, buffer, statbuf.st_size);
+
+ /* All done. */
+ pfree(buffer);
+ return result;
+}
+
+/*
+ * Report an error while parsing the manifest.
+ *
+ * We consider all such errors to be fatal errors. The manifest parser
+ * expects this function not to return.
+ */
+static void
+report_manifest_error(JsonManifestParseContext *context, const char *fmt,...)
+{
+ va_list ap;
+
+ va_start(ap, fmt);
+ pg_log_generic_v(PG_LOG_ERROR, PG_LOG_PRIMARY, gettext(fmt), ap);
+ va_end(ap);
+
+ exit(1);
+}
+
+/*
+ * Record details extracted from the backup manifest for one file.
+ */
+static void
+combinebackup_per_file_cb(JsonManifestParseContext *context,
+ char *pathname, size_t size,
+ pg_checksum_type checksum_type,
+ int checksum_length, uint8 *checksum_payload)
+{
+ manifest_data *manifest = context->private_data;
+ manifest_file *m;
+ bool found;
+
+ /* Make a new entry in the hash table for this file. */
+ m = manifest_files_insert(manifest->files, pathname, &found);
+ if (found)
+ pg_fatal("duplicate path name in backup manifest: \"%s\"", pathname);
+
+ /* Initialize the entry. */
+ m->size = size;
+ m->checksum_type = checksum_type;
+ m->checksum_length = checksum_length;
+ m->checksum_payload = checksum_payload;
+}
+
+/*
+ * Record details extracted from the backup manifest for one WAL range.
+ */
+static void
+combinebackup_per_wal_range_cb(JsonManifestParseContext *context,
+ TimeLineID tli,
+ XLogRecPtr start_lsn, XLogRecPtr end_lsn)
+{
+ manifest_data *manifest = context->private_data;
+ manifest_wal_range *range;
+
+ /* Allocate and initialize a struct describing this WAL range. */
+ range = palloc(sizeof(manifest_wal_range));
+ range->tli = tli;
+ range->start_lsn = start_lsn;
+ range->end_lsn = end_lsn;
+ range->prev = manifest->last_wal_range;
+ range->next = NULL;
+
+ /* Add it to the end of the list. */
+ if (manifest->first_wal_range == NULL)
+ manifest->first_wal_range = range;
+ else
+ manifest->last_wal_range->next = range;
+ manifest->last_wal_range = range;
+}
+
+/*
+ * Helper function for manifest_files hash table.
+ */
+static uint32
+hash_string_pointer(char *s)
+{
+ unsigned char *ss = (unsigned char *) s;
+
+ return hash_bytes(ss, strlen(s));
+}
diff --git a/src/bin/pg_combinebackup/load_manifest.h b/src/bin/pg_combinebackup/load_manifest.h
new file mode 100644
index 0000000000..2bfeeff156
--- /dev/null
+++ b/src/bin/pg_combinebackup/load_manifest.h
@@ -0,0 +1,67 @@
+/*-------------------------------------------------------------------------
+ *
+ * Load data from a backup manifest into memory.
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/bin/pg_combinebackup/load_manifest.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef LOAD_MANIFEST_H
+#define LOAD_MANIFEST_H
+
+#include "access/xlogdefs.h"
+#include "common/checksum_helper.h"
+
+/*
+ * Each file described by the manifest file is parsed to produce an object
+ * like this.
+ */
+typedef struct manifest_file
+{
+ uint32 status; /* hash status */
+ char *pathname;
+ size_t size;
+ pg_checksum_type checksum_type;
+ int checksum_length;
+ uint8 *checksum_payload;
+} manifest_file;
+
+#define SH_PREFIX manifest_files
+#define SH_ELEMENT_TYPE manifest_file
+#define SH_KEY_TYPE char *
+#define SH_SCOPE extern
+#define SH_RAW_ALLOCATOR pg_malloc0
+#define SH_DECLARE
+#include "lib/simplehash.h"
+
+/*
+ * Each WAL range described by the manifest file is parsed to produce an
+ * object like this.
+ */
+typedef struct manifest_wal_range
+{
+ TimeLineID tli;
+ XLogRecPtr start_lsn;
+ XLogRecPtr end_lsn;
+ struct manifest_wal_range *next;
+ struct manifest_wal_range *prev;
+} manifest_wal_range;
+
+/*
+ * All the data parsed from a backup_manifest file.
+ */
+typedef struct manifest_data
+{
+ manifest_files_hash *files;
+ manifest_wal_range *first_wal_range;
+ manifest_wal_range *last_wal_range;
+} manifest_data;
+
+extern manifest_data *load_backup_manifest(char *backup_directory);
+extern manifest_data **load_backup_manifests(int n_backups,
+ char **backup_directories);
+
+#endif /* LOAD_MANIFEST_H */
diff --git a/src/bin/pg_combinebackup/meson.build b/src/bin/pg_combinebackup/meson.build
new file mode 100644
index 0000000000..e402d6f50e
--- /dev/null
+++ b/src/bin/pg_combinebackup/meson.build
@@ -0,0 +1,38 @@
+# Copyright (c) 2022-2023, PostgreSQL Global Development Group
+
+pg_combinebackup_sources = files(
+ 'pg_combinebackup.c',
+ 'backup_label.c',
+ 'copy_file.c',
+ 'load_manifest.c',
+ 'reconstruct.c',
+ 'write_manifest.c',
+)
+
+if host_system == 'windows'
+ pg_combinebackup_sources += rc_bin_gen.process(win32ver_rc, extra_args: [
+ '--NAME', 'pg_combinebackup',
+ '--FILEDESC', 'pg_combinebackup - combine incremental backups',])
+endif
+
+pg_combinebackup = executable('pg_combinebackup',
+ pg_combinebackup_sources,
+ dependencies: [frontend_code],
+ kwargs: default_bin_args,
+)
+bin_targets += pg_combinebackup
+
+tests += {
+ 'name': 'pg_combinebackup',
+ 'sd': meson.current_source_dir(),
+ 'bd': meson.current_build_dir(),
+ 'tap': {
+ 'tests': [
+ 't/001_basic.pl',
+ 't/002_compare_backups.pl',
+ 't/003_timeline.pl',
+ 't/004_manifest.pl',
+ 't/005_integrity.pl',
+ ],
+ }
+}
diff --git a/src/bin/pg_combinebackup/nls.mk b/src/bin/pg_combinebackup/nls.mk
new file mode 100644
index 0000000000..c8e59d1d00
--- /dev/null
+++ b/src/bin/pg_combinebackup/nls.mk
@@ -0,0 +1,11 @@
+# src/bin/pg_combinebackup/nls.mk
+CATALOG_NAME = pg_combinebackup
+GETTEXT_FILES = $(FRONTEND_COMMON_GETTEXT_FILES) \
+ backup_label.c \
+ copy_file.c \
+ load_manifest.c \
+ pg_combinebackup.c \
+ reconstruct.c \
+ write_manifest.c
+GETTEXT_TRIGGERS = $(FRONTEND_COMMON_GETTEXT_TRIGGERS)
+GETTEXT_FLAGS = $(FRONTEND_COMMON_GETTEXT_FLAGS)
diff --git a/src/bin/pg_combinebackup/pg_combinebackup.c b/src/bin/pg_combinebackup/pg_combinebackup.c
new file mode 100644
index 0000000000..6eb705c959
--- /dev/null
+++ b/src/bin/pg_combinebackup/pg_combinebackup.c
@@ -0,0 +1,1275 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_combinebackup.c
+ * Combine incremental backups with prior backups.
+ *
+ * Copyright (c) 2017-2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/bin/pg_combinebackup/pg_combinebackup.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres_fe.h"
+
+#include <dirent.h>
+#include <fcntl.h>
+#include <limits.h>
+
+#include "backup_label.h"
+#include "common/blkreftable.h"
+#include "common/checksum_helper.h"
+#include "common/controldata_utils.h"
+#include "common/file_perm.h"
+#include "common/file_utils.h"
+#include "common/logging.h"
+#include "copy_file.h"
+#include "fe_utils/option_utils.h"
+#include "lib/stringinfo.h"
+#include "load_manifest.h"
+#include "getopt_long.h"
+#include "reconstruct.h"
+#include "write_manifest.h"
+
+/* Incremental file naming convention. */
+#define INCREMENTAL_PREFIX "INCREMENTAL."
+#define INCREMENTAL_PREFIX_LENGTH (sizeof(INCREMENTAL_PREFIX) - 1)
+
+/*
+ * Tracking for directories that need to be removed, or have their contents
+ * removed, if the operation fails.
+ */
+typedef struct cb_cleanup_dir
+{
+ char *target_path;
+ bool rmtopdir;
+ struct cb_cleanup_dir *next;
+} cb_cleanup_dir;
+
+/*
+ * Stores a tablespace mapping provided using -T, --tablespace-mapping.
+ */
+typedef struct cb_tablespace_mapping
+{
+ char old_dir[MAXPGPATH];
+ char new_dir[MAXPGPATH];
+ struct cb_tablespace_mapping *next;
+} cb_tablespace_mapping;
+
+/*
+ * Stores data parsed from all command-line options.
+ */
+typedef struct cb_options
+{
+ bool debug;
+ char *output;
+ bool dry_run;
+ bool no_sync;
+ cb_tablespace_mapping *tsmappings;
+ pg_checksum_type manifest_checksums;
+ bool no_manifest;
+ DataDirSyncMethod sync_method;
+} cb_options;
+
+/*
+ * Data about a tablespace.
+ *
+ * Every normal tablespace needs a tablespace mapping, but in-place tablespaces
+ * don't, so the list of tablespaces can contain more entries than the list of
+ * tablespace mappings.
+ */
+typedef struct cb_tablespace
+{
+ Oid oid;
+ bool in_place;
+ char old_dir[MAXPGPATH];
+ char new_dir[MAXPGPATH];
+ struct cb_tablespace *next;
+} cb_tablespace;
+
+/* Directories to be removed if we exit uncleanly. */
+cb_cleanup_dir *cleanup_dir_list = NULL;
+
+static void add_tablespace_mapping(cb_options *opt, char *arg);
+static StringInfo check_backup_label_files(int n_backups, char **backup_dirs);
+static void check_control_files(int n_backups, char **backup_dirs);
+static void check_input_dir_permissions(char *dir);
+static void cleanup_directories_atexit(void);
+static void create_output_directory(char *dirname, cb_options *opt);
+static void help(const char *progname);
+static bool parse_oid(char *s, Oid *result);
+static void process_directory_recursively(Oid tsoid,
+ char *input_directory,
+ char *output_directory,
+ char *relative_path,
+ int n_prior_backups,
+ char **prior_backup_dirs,
+ manifest_data **manifests,
+ manifest_writer *mwriter,
+ cb_options *opt);
+static int read_pg_version_file(char *directory);
+static void remember_to_cleanup_directory(char *target_path, bool rmtopdir);
+static void reset_directory_cleanup_list(void);
+static cb_tablespace *scan_for_existing_tablespaces(char *pathname,
+ cb_options *opt);
+static void slurp_file(int fd, char *filename, StringInfo buf, int maxlen);
+
+/*
+ * Main program.
+ */
+int
+main(int argc, char *argv[])
+{
+ static struct option long_options[] = {
+ {"debug", no_argument, NULL, 'd'},
+ {"dry-run", no_argument, NULL, 'n'},
+ {"no-sync", no_argument, NULL, 'N'},
+ {"output", required_argument, NULL, 'o'},
+ {"tablespace-mapping", no_argument, NULL, 'T'},
+ {"manifest-checksums", required_argument, NULL, 1},
+ {"no-manifest", no_argument, NULL, 2},
+ {"sync-method", required_argument, NULL, 3},
+ {NULL, 0, NULL, 0}
+ };
+
+ const char *progname;
+ char *last_input_dir;
+ int optindex;
+ int c;
+ int n_backups;
+ int n_prior_backups;
+ int version;
+ char **prior_backup_dirs;
+ cb_options opt;
+ cb_tablespace *tablespaces;
+ cb_tablespace *ts;
+ StringInfo last_backup_label;
+ manifest_data **manifests;
+ manifest_writer *mwriter;
+
+ pg_logging_init(argv[0]);
+ progname = get_progname(argv[0]);
+ handle_help_version_opts(argc, argv, progname, help);
+
+ memset(&opt, 0, sizeof(opt));
+ opt.manifest_checksums = CHECKSUM_TYPE_CRC32C;
+ opt.sync_method = DATA_DIR_SYNC_METHOD_FSYNC;
+
+ /* process command-line options */
+ while ((c = getopt_long(argc, argv, "do:nNPT:",
+ long_options, &optindex)) != -1)
+ {
+ switch (c)
+ {
+ case 'd':
+ opt.debug = true;
+ pg_logging_increase_verbosity();
+ break;
+ case 'o':
+ opt.output = optarg;
+ break;
+ case 'n':
+ opt.dry_run = true;
+ break;
+ case 'N':
+ opt.no_sync = true;
+ break;
+ case 'T':
+ add_tablespace_mapping(&opt, optarg);
+ break;
+ case 1:
+ if (!pg_checksum_parse_type(optarg,
+ &opt.manifest_checksums))
+ pg_fatal("unrecognized checksum algorithm: \"%s\"",
+ optarg);
+ break;
+ case 2:
+ opt.no_manifest = true;
+ break;
+ case 3:
+ if (!parse_sync_method(optarg, &opt.sync_method))
+ exit(1);
+ break;
+ default:
+ /* getopt_long already emitted a complaint */
+ pg_log_error_hint("Try \"%s --help\" for more information.", progname);
+ exit(1);
+ }
+ }
+
+ if (optind >= argc)
+ {
+ pg_log_error("%s: no input directories specified", progname);
+ pg_log_error_hint("Try \"%s --help\" for more information.", progname);
+ exit(1);
+ }
+
+ if (opt.output == NULL)
+ pg_fatal("no output directory specified");
+
+ /* If no manifest is needed, no checksums are needed, either. */
+ if (opt.no_manifest)
+ opt.manifest_checksums = CHECKSUM_TYPE_NONE;
+
+ /* Read the server version from the final backup. */
+ version = read_pg_version_file(argv[argc - 1]);
+
+ /* Sanity-check control files. */
+ n_backups = argc - optind;
+ check_control_files(n_backups, argv + optind);
+
+ /* Sanity-check backup_label files, and get the contents of the last one. */
+ last_backup_label = check_backup_label_files(n_backups, argv + optind);
+
+ /* Load backup manifests. */
+ manifests = load_backup_manifests(n_backups, argv + optind);
+
+ /* Figure out which tablespaces are going to be included in the output. */
+ last_input_dir = argv[argc - 1];
+ check_input_dir_permissions(last_input_dir);
+ tablespaces = scan_for_existing_tablespaces(last_input_dir, &opt);
+
+ /*
+ * Create output directories.
+ *
+ * We create one output directory for the main data directory plus one for
+ * each non-in-place tablespace. create_output_directory() will arrange
+ * for those directories to be cleaned up on failure. In-place tablespaces
+ * aren't handled at this stage because they're located beneath the main
+ * output directory, and thus the cleanup of that directory will get rid
+ * of them. Plus, the pg_tblspc directory that needs to contain them
+ * doesn't exist yet.
+ */
+ atexit(cleanup_directories_atexit);
+ create_output_directory(opt.output, &opt);
+ for (ts = tablespaces; ts != NULL; ts = ts->next)
+ if (!ts->in_place)
+ create_output_directory(ts->new_dir, &opt);
+
+ /* If we need to write a backup_manifest, prepare to do so. */
+ if (!opt.dry_run && !opt.no_manifest)
+ mwriter = create_manifest_writer(opt.output);
+ else
+ mwriter = NULL;
+
+ /* Write backup label into output directory. */
+ if (opt.dry_run)
+ pg_log_debug("would generate \"%s/backup_label\"", opt.output);
+ else
+ {
+ pg_log_debug("generating \"%s/backup_label\"", opt.output);
+ last_backup_label->cursor = 0;
+ write_backup_label(opt.output, last_backup_label,
+ opt.manifest_checksums, mwriter);
+ }
+
+ /*
+ * We'll need the pathnames to the prior backups. By "prior" we mean all
+ * but the last one listed on the command line.
+ */
+ n_prior_backups = argc - optind - 1;
+ prior_backup_dirs = argv + optind;
+
+ /* Process everything that's not part of a user-defined tablespace. */
+ pg_log_debug("processing backup directory \"%s\"", last_input_dir);
+ process_directory_recursively(InvalidOid, last_input_dir, opt.output,
+ NULL, n_prior_backups, prior_backup_dirs,
+ manifests, mwriter, &opt);
+
+ /* Process user-defined tablespaces. */
+ for (ts = tablespaces; ts != NULL; ts = ts->next)
+ {
+ pg_log_debug("processing tablespace directory \"%s\"", ts->old_dir);
+
+ /*
+ * If it's a normal tablespace, we need to set up a symbolic link from
+ * pg_tblspc/${OID} to the target directory; if it's an in-place
+ * tablespace, we need to create a directory at pg_tblspc/${OID}.
+ */
+ if (!ts->in_place)
+ {
+ char linkpath[MAXPGPATH];
+
+ snprintf(linkpath, MAXPGPATH, "%s/pg_tblspc/%u", opt.output,
+ ts->oid);
+
+ if (opt.dry_run)
+ pg_log_debug("would create symbolic link from \"%s\" to \"%s\"",
+ linkpath, ts->new_dir);
+ else
+ {
+ pg_log_debug("creating symbolic link from \"%s\" to \"%s\"",
+ linkpath, ts->new_dir);
+ if (symlink(ts->new_dir, linkpath) != 0)
+ pg_fatal("could not create symbolic link from \"%s\" to \"%s\": %m",
+ linkpath, ts->new_dir);
+ }
+ }
+ else
+ {
+ if (opt.dry_run)
+ pg_log_debug("would create directory \"%s\"", ts->new_dir);
+ else
+ {
+ pg_log_debug("creating directory \"%s\"", ts->new_dir);
+ if (pg_mkdir_p(ts->new_dir, pg_dir_create_mode) == -1)
+ pg_fatal("could not create directory \"%s\": %m",
+ ts->new_dir);
+ }
+ }
+
+ /* OK, now handle the directory contents. */
+ process_directory_recursively(ts->oid, ts->old_dir, ts->new_dir,
+ NULL, n_prior_backups, prior_backup_dirs,
+ manifests, mwriter, &opt);
+ }
+
+ /* Finalize the backup_manifest, if we're generating one. */
+ if (mwriter != NULL)
+ finalize_manifest(mwriter,
+ manifests[n_prior_backups]->first_wal_range);
+
+ /* fsync that output directory unless we've been told not to do so */
+ if (!opt.no_sync)
+ {
+ if (opt.dry_run)
+ pg_log_debug("would recursively fsync \"%s\"", opt.output);
+ else
+ {
+ pg_log_debug("recursively fsyncing \"%s\"", opt.output);
+ sync_pgdata(opt.output, version * 10000, opt.sync_method);
+ }
+ }
+
+ /* It's a success, so don't remove the output directories. */
+ reset_directory_cleanup_list();
+ exit(0);
+}
+
+/*
+ * Process the option argument for the -T, --tablespace-mapping switch.
+ */
+static void
+add_tablespace_mapping(cb_options *opt, char *arg)
+{
+ cb_tablespace_mapping *tsmap = pg_malloc0(sizeof(cb_tablespace_mapping));
+ char *dst;
+ char *dst_ptr;
+ char *arg_ptr;
+
+ /*
+ * Basically, we just want to copy everything before the equals sign to
+ * tsmap->old_dir and everything afterwards to tsmap->new_dir, but if
+ * there's more or less than one equals sign, that's an error, and if
+ * there's an equals sign preceded by a backslash, don't treat it as a
+ * field separator but instead copy a literal equals sign.
+ */
+ dst_ptr = dst = tsmap->old_dir;
+ for (arg_ptr = arg; *arg_ptr != '\0'; arg_ptr++)
+ {
+ if (dst_ptr - dst >= MAXPGPATH)
+ pg_fatal("directory name too long");
+
+ if (*arg_ptr == '\\' && *(arg_ptr + 1) == '=')
+ ; /* skip backslash escaping = */
+ else if (*arg_ptr == '=' && (arg_ptr == arg || *(arg_ptr - 1) != '\\'))
+ {
+ if (tsmap->new_dir[0] != '\0')
+ pg_fatal("multiple \"=\" signs in tablespace mapping");
+ else
+ dst = dst_ptr = tsmap->new_dir;
+ }
+ else
+ *dst_ptr++ = *arg_ptr;
+ }
+ if (!tsmap->old_dir[0] || !tsmap->new_dir[0])
+ pg_fatal("invalid tablespace mapping format \"%s\", must be \"OLDDIR=NEWDIR\"", arg);
+
+ /*
+ * All tablespaces are created with absolute directories, so specifying a
+ * non-absolute path here would never match, possibly confusing users.
+ *
+ * In contrast to pg_basebackup, both the old and new directories are on
+ * the local machine, so the local machine's definition of an absolute
+ * path is the only relevant one.
+ */
+ if (!is_absolute_path(tsmap->old_dir))
+ pg_fatal("old directory is not an absolute path in tablespace mapping: %s",
+ tsmap->old_dir);
+
+ if (!is_absolute_path(tsmap->new_dir))
+ pg_fatal("old directory is not an absolute path in tablespace mapping: %s",
+ tsmap->new_dir);
+
+ /* Canonicalize paths to avoid spurious failures when comparing. */
+ canonicalize_path(tsmap->old_dir);
+ canonicalize_path(tsmap->new_dir);
+
+ /* Add it to the list. */
+ tsmap->next = opt->tsmappings;
+ opt->tsmappings = tsmap;
+}
+
+/*
+ * Check that the backup_label files form a coherent backup chain, and return
+ * the contents of the backup_label file from the latest backup.
+ */
+static StringInfo
+check_backup_label_files(int n_backups, char **backup_dirs)
+{
+ StringInfo buf = makeStringInfo();
+ StringInfo lastbuf = buf;
+ int i;
+ TimeLineID check_tli = 0;
+ XLogRecPtr check_lsn = InvalidXLogRecPtr;
+
+ /* Try to read each backup_label file in turn, last to first. */
+ for (i = n_backups - 1; i >= 0; --i)
+ {
+ char pathbuf[MAXPGPATH];
+ int fd;
+ TimeLineID start_tli;
+ TimeLineID previous_tli;
+ XLogRecPtr start_lsn;
+ XLogRecPtr previous_lsn;
+
+ /* Open the backup_label file. */
+ snprintf(pathbuf, MAXPGPATH, "%s/backup_label", backup_dirs[i]);
+ pg_log_debug("reading \"%s\"", pathbuf);
+ if ((fd = open(pathbuf, O_RDONLY, 0)) < 0)
+ pg_fatal("could not open file \"%s\": %m", pathbuf);
+
+ /*
+ * Slurp the whole file into memory.
+ *
+ * The exact size limit that we impose here doesn't really matter --
+ * most of what's supposed to be in the file is fixed size and quite
+ * short. However, the length of the backup_label is limited (at least
+ * by some parts of the code) to MAXGPATH, so include that value in
+ * the maximum length that we tolerate.
+ */
+ slurp_file(fd, pathbuf, buf, 10000 + MAXPGPATH);
+
+ /* Close the file. */
+ if (close(fd) != 0)
+ pg_fatal("could not close \"%s\": %m", pathbuf);
+
+ /* Parse the file contents. */
+ parse_backup_label(pathbuf, buf, &start_tli, &start_lsn,
+ &previous_tli, &previous_lsn);
+
+ /*
+ * Sanity checks.
+ *
+ * XXX. It's actually not required that start_lsn == check_lsn. It
+ * would be OK if start_lsn > check_lsn provided that start_lsn is
+ * less than or equal to the relevant switchpoint. But at the moment
+ * we don't have that information.
+ */
+ if (i > 0 && previous_tli == 0)
+ pg_fatal("backup at \"%s\" is a full backup, but only the first backup should be a full backup",
+ backup_dirs[i]);
+ if (i == 0 && previous_tli != 0)
+ pg_fatal("backup at \"%s\" is an incremental backup, but the first backup should be a full backup",
+ backup_dirs[i]);
+ if (i < n_backups - 1 && start_tli != check_tli)
+ pg_fatal("backup at \"%s\" starts on timeline %u, but expected %u",
+ backup_dirs[i], start_tli, check_tli);
+ if (i < n_backups - 1 && start_lsn != check_lsn)
+ pg_fatal("backup at \"%s\" starts at LSN %X/%X, but expected %X/%X",
+ backup_dirs[i],
+ LSN_FORMAT_ARGS(start_lsn),
+ LSN_FORMAT_ARGS(check_lsn));
+ check_tli = previous_tli;
+ check_lsn = previous_lsn;
+
+ /*
+ * The last backup label in the chain needs to be saved for later use,
+ * while the others are only needed within this loop.
+ */
+ if (lastbuf == buf)
+ buf = makeStringInfo();
+ else
+ resetStringInfo(buf);
+ }
+
+ /* Free memory that we don't need any more. */
+ if (lastbuf != buf)
+ {
+ pfree(buf->data);
+ pfree(buf);
+ }
+
+ /*
+ * Return the data from the first backup_info that we read (which is the
+ * backup_label from the last directory specified on the command line).
+ */
+ return lastbuf;
+}
+
+/*
+ * Sanity check control files.
+ */
+static void
+check_control_files(int n_backups, char **backup_dirs)
+{
+ int i;
+ uint64 system_identifier = 0; /* placate compiler */
+
+ /* Try to read each control file in turn, last to first. */
+ for (i = n_backups - 1; i >= 0; --i)
+ {
+ ControlFileData *control_file;
+ bool crc_ok;
+ char *controlpath;
+
+ controlpath = psprintf("%s/%s", backup_dirs[i], "global/pg_control");
+ pg_log_debug("reading \"%s\"", controlpath);
+ control_file = get_controlfile(backup_dirs[i], &crc_ok);
+
+ /* Control file contents not meaningful if CRC is bad. */
+ if (!crc_ok)
+ pg_fatal("%s: crc is incorrect", controlpath);
+
+ /* Can't interpret control file if not current version. */
+ if (control_file->pg_control_version != PG_CONTROL_VERSION)
+ pg_fatal("%s: unexpected control file version",
+ controlpath);
+
+ /* System identifiers should all match. */
+ if (i == n_backups - 1)
+ system_identifier = control_file->system_identifier;
+ else if (system_identifier != control_file->system_identifier)
+ pg_fatal("%s: expected system identifier %llu, but found %llu",
+ controlpath, (unsigned long long) system_identifier,
+ (unsigned long long) control_file->system_identifier);
+
+ /* Release memory. */
+ pfree(control_file);
+ pfree(controlpath);
+ }
+
+ /*
+ * If debug output is enabled, make a note of the system identifier that
+ * we found in all of the relevant control files.
+ */
+ pg_log_debug("system identifier is %llu",
+ (unsigned long long) system_identifier);
+}
+
+/*
+ * Set default permissions for new files and directories based on the
+ * permissions of the given directory. The intent here is that the output
+ * directory should use the same permissions scheme as the final input
+ * directory.
+ */
+static void
+check_input_dir_permissions(char *dir)
+{
+ struct stat st;
+
+ if (stat(dir, &st) != 0)
+ pg_fatal("could not stat \"%s\": %m", dir);
+
+ SetDataDirectoryCreatePerm(st.st_mode);
+}
+
+/*
+ * Clean up output directories before exiting.
+ */
+static void
+cleanup_directories_atexit(void)
+{
+ while (cleanup_dir_list != NULL)
+ {
+ cb_cleanup_dir *dir = cleanup_dir_list;
+
+ if (dir->rmtopdir)
+ {
+ pg_log_info("removing output directory \"%s\"", dir->target_path);
+ if (!rmtree(dir->target_path, dir->rmtopdir))
+ pg_log_error("failed to remove output directory");
+ }
+ else
+ {
+ pg_log_info("removing contents of output directory \"%s\"",
+ dir->target_path);
+ if (!rmtree(dir->target_path, dir->rmtopdir))
+ pg_log_error("failed to remove contents of output directory");
+ }
+
+ cleanup_dir_list = cleanup_dir_list->next;
+ pfree(dir);
+ }
+}
+
+/*
+ * Create the named output directory, unless it already exists or we're in
+ * dry-run mode. If it already exists but is not empty, that's a fatal error.
+ *
+ * Adds the created directory to the list of directories to be cleaned up
+ * at process exit.
+ */
+static void
+create_output_directory(char *dirname, cb_options *opt)
+{
+ switch (pg_check_dir(dirname))
+ {
+ case 0:
+ if (opt->dry_run)
+ {
+ pg_log_debug("would create directory \"%s\"", dirname);
+ return;
+ }
+ pg_log_debug("creating directory \"%s\"", dirname);
+ if (pg_mkdir_p(dirname, pg_dir_create_mode) == -1)
+ pg_fatal("could not create directory \"%s\": %m", dirname);
+ remember_to_cleanup_directory(dirname, true);
+ break;
+
+ case 1:
+ pg_log_debug("using existing directory \"%s\"", dirname);
+ remember_to_cleanup_directory(dirname, false);
+ break;
+
+ case 2:
+ case 3:
+ case 4:
+ pg_fatal("directory \"%s\" exists but is not empty", dirname);
+
+ case -1:
+ pg_fatal("could not access directory \"%s\": %m", dirname);
+ }
+}
+
+/*
+ * help
+ *
+ * Prints help page for the program
+ *
+ * progname: the name of the executed program, such as "pg_combinebackup"
+ */
+static void
+help(const char *progname)
+{
+ printf(_("%s reconstructs full backups from incrementals.\n\n"), progname);
+ printf(_("Usage:\n"));
+ printf(_(" %s [OPTION]... DIRECTORY...\n"), progname);
+ printf(_("\nOptions:\n"));
+ printf(_(" -d, --debug generate lots of debugging output\n"));
+ printf(_(" -n, --dry-run don't actually do anything\n"));
+ printf(_(" -N, --no-sync do not wait for changes to be written safely to disk\n"));
+ printf(_(" -o, --output output directory\n"));
+ printf(_(" -T, --tablespace-mapping=OLDDIR=NEWDIR\n"));
+ printf(_(" relocate tablespace in OLDDIR to NEWDIR\n"));
+ printf(_(" --manifest-checksums=SHA{224,256,384,512}|CRC32C|NONE\n"
+ " use algorithm for manifest checksums\n"));
+ printf(_(" --no-manifest suppress generation of backup manifest\n"));
+ printf(_(" --sync-method=METHOD set method for syncing files to disk\n"));
+ printf(_(" -?, --help show this help, then exit\n"));
+
+ printf(_("\nReport bugs to <%s>.\n"), PACKAGE_BUGREPORT);
+ printf(_("%s home page: <%s>\n"), PACKAGE_NAME, PACKAGE_URL);
+}
+
+/*
+ * Try to parse a string as a non-zero OID without leading zeroes.
+ *
+ * If it works, return true and set *result to the answer, else return false.
+ */
+static bool
+parse_oid(char *s, Oid *result)
+{
+ Oid oid;
+ char *ep;
+
+ errno = 0;
+ oid = strtoul(s, &ep, 10);
+ if (errno != 0 || *ep != '\0' || oid < 1 || oid > PG_UINT32_MAX)
+ return false;
+
+ *result = oid;
+ return true;
+}
+
+/*
+ * Copy files from the input directory to the output directory, reconstructing
+ * full files from incremental files as required.
+ *
+ * If processing is a user-defined tablespace, the tsoid should be the OID
+ * of that tablespace and input_directory and output_directory should be the
+ * toplevel input and output directories for that tablespace. Otherwise,
+ * tsoid should be InvalidOid and input_directory and output_directory should
+ * be the main input and output directories.
+ *
+ * relative_path is the path beneath the given input and output directories
+ * that we are currently processing. If NULL, it indicates that we're
+ * processing the input and output directories themselves.
+ *
+ * n_prior_backups is the number of prior backups that we have available.
+ * This doesn't count the very last backup, which is referenced by
+ * output_directory, just the older ones. prior_backup_dirs is an array of
+ * the locations of those previous backups.
+ */
+static void
+process_directory_recursively(Oid tsoid,
+ char *input_directory,
+ char *output_directory,
+ char *relative_path,
+ int n_prior_backups,
+ char **prior_backup_dirs,
+ manifest_data **manifests,
+ manifest_writer *mwriter,
+ cb_options *opt)
+{
+ char ifulldir[MAXPGPATH];
+ char ofulldir[MAXPGPATH];
+ char manifest_prefix[MAXPGPATH];
+ DIR *dir;
+ struct dirent *de;
+ bool is_pg_tblspc;
+ bool is_pg_wal;
+ manifest_data *latest_manifest = manifests[n_prior_backups];
+ pg_checksum_type checksum_type;
+
+ /*
+ * pg_tblspc and pg_wal are special cases, so detect those here.
+ *
+ * pg_tblspc is only special at the top level, but subdirectories of
+ * pg_wal are just as special as the top level directory.
+ *
+ * Since incremental backup does not exist in pre-v10 versions, we don't
+ * have to worry about the old pg_xlog naming.
+ */
+ is_pg_tblspc = !OidIsValid(tsoid) && relative_path != NULL &&
+ strcmp(relative_path, "pg_tblspc") == 0;
+ is_pg_wal = !OidIsValid(tsoid) && relative_path != NULL &&
+ (strcmp(relative_path, "pg_wal") == 0 ||
+ strncmp(relative_path, "pg_wal/", 7) == 0);
+
+ /*
+ * If we're under pg_wal, then we don't need checksums, because these
+ * files aren't included in the backup manifest. Otherwise use whatever
+ * type of checksum is configured.
+ */
+ if (!is_pg_wal)
+ checksum_type = opt->manifest_checksums;
+ else
+ checksum_type = CHECKSUM_TYPE_NONE;
+
+ /*
+ * Append the relative path to the input and output directories, and
+ * figure out the appropriate prefix to add to files in this directory
+ * when looking them up in a backup manifest.
+ */
+ if (relative_path == NULL)
+ {
+ strncpy(ifulldir, input_directory, MAXPGPATH);
+ strncpy(ofulldir, output_directory, MAXPGPATH);
+ if (OidIsValid(tsoid))
+ snprintf(manifest_prefix, MAXPGPATH, "pg_tblspc/%u/", tsoid);
+ else
+ manifest_prefix[0] = '\0';
+ }
+ else
+ {
+ snprintf(ifulldir, MAXPGPATH, "%s/%s", input_directory,
+ relative_path);
+ snprintf(ofulldir, MAXPGPATH, "%s/%s", output_directory,
+ relative_path);
+ if (OidIsValid(tsoid))
+ snprintf(manifest_prefix, MAXPGPATH, "pg_tblspc/%u/%s/",
+ tsoid, relative_path);
+ else
+ snprintf(manifest_prefix, MAXPGPATH, "%s/", relative_path);
+ }
+
+ /*
+ * Toplevel output directories have already been created by the time this
+ * function is called, but any subdirectories are our responsibility.
+ */
+ if (relative_path != NULL)
+ {
+ if (opt->dry_run)
+ pg_log_debug("would create directory \"%s\"", ofulldir);
+ else
+ {
+ pg_log_debug("creating directory \"%s\"", ofulldir);
+ if (mkdir(ofulldir, pg_dir_create_mode) == -1)
+ pg_fatal("could not create directory \"%s\": %m", ofulldir);
+ }
+ }
+
+ /* It's time to scan the directory. */
+ if ((dir = opendir(ifulldir)) == NULL)
+ pg_fatal("could not open directory \"%s\": %m", ifulldir);
+ while (errno = 0, (de = readdir(dir)) != NULL)
+ {
+ PGFileType type;
+ char ifullpath[MAXPGPATH];
+ char ofullpath[MAXPGPATH];
+ char manifest_path[MAXPGPATH];
+ Oid oid = InvalidOid;
+ int checksum_length = 0;
+ uint8 *checksum_payload = NULL;
+ pg_checksum_context checksum_ctx;
+
+ /* Ignore "." and ".." entries. */
+ if (strcmp(de->d_name, ".") == 0 ||
+ strcmp(de->d_name, "..") == 0)
+ continue;
+
+ /* Construct input path. */
+ snprintf(ifullpath, MAXPGPATH, "%s/%s", ifulldir, de->d_name);
+
+ /* Figure out what kind of directory entry this is. */
+ type = get_dirent_type(ifullpath, de, false, PG_LOG_ERROR);
+ if (type == PGFILETYPE_ERROR)
+ exit(1);
+
+ /*
+ * If we're processing pg_tblspc, then check whether the filename
+ * looks like it could be a tablespace OID. If so, and if the
+ * directory entry is a symbolic link or a directory, skip it.
+ *
+ * Our goal here is to ignore anything that would have been considered
+ * by scan_for_existing_tablespaces to be a tablespace.
+ */
+ if (is_pg_tblspc && parse_oid(de->d_name, &oid) &&
+ (type == PGFILETYPE_LNK || type == PGFILETYPE_DIR))
+ continue;
+
+ /* If it's a directory, recurse. */
+ if (type == PGFILETYPE_DIR)
+ {
+ char new_relative_path[MAXPGPATH];
+
+ /* Append new pathname component to relative path. */
+ if (relative_path == NULL)
+ strncpy(new_relative_path, de->d_name, MAXPGPATH);
+ else
+ snprintf(new_relative_path, MAXPGPATH, "%s/%s", relative_path,
+ de->d_name);
+
+ /* And recurse. */
+ process_directory_recursively(tsoid,
+ input_directory, output_directory,
+ new_relative_path,
+ n_prior_backups, prior_backup_dirs,
+ manifests, mwriter, opt);
+ continue;
+ }
+
+ /* Skip anything that's not a regular file. */
+ if (type != PGFILETYPE_REG)
+ {
+ if (type == PGFILETYPE_LNK)
+ pg_log_warning("skipping symbolic link \"%s\"", ifullpath);
+ else
+ pg_log_warning("skipping special file \"%s\"", ifullpath);
+ continue;
+ }
+
+ /*
+ * Skip the backup_label and backup_manifest files; they require
+ * special handling and are handled elsewhere.
+ */
+ if (relative_path == NULL &&
+ (strcmp(de->d_name, "backup_label") == 0 ||
+ strcmp(de->d_name, "backup_manifest") == 0))
+ continue;
+
+ /*
+ * If it's an incremental file, hand it off to the reconstruction
+ * code, which will figure out what to do.
+ */
+ if (strncmp(de->d_name, INCREMENTAL_PREFIX,
+ INCREMENTAL_PREFIX_LENGTH) == 0)
+ {
+ /* Output path should not include "INCREMENTAL." prefix. */
+ snprintf(ofullpath, MAXPGPATH, "%s/%s", ofulldir,
+ de->d_name + INCREMENTAL_PREFIX_LENGTH);
+
+
+ /* Manifest path likewise omits incremental prefix. */
+ snprintf(manifest_path, MAXPGPATH, "%s%s", manifest_prefix,
+ de->d_name + INCREMENTAL_PREFIX_LENGTH);
+
+ /* Reconstruction logic will do the rest. */
+ reconstruct_from_incremental_file(ifullpath, ofullpath,
+ relative_path,
+ de->d_name + INCREMENTAL_PREFIX_LENGTH,
+ n_prior_backups,
+ prior_backup_dirs,
+ manifests,
+ manifest_path,
+ checksum_type,
+ &checksum_length,
+ &checksum_payload,
+ opt->debug,
+ opt->dry_run);
+ }
+ else
+ {
+ /* Construct the path that the backup_manifest will use. */
+ snprintf(manifest_path, MAXPGPATH, "%s%s", manifest_prefix,
+ de->d_name);
+
+ /*
+ * It's not an incremental file, so we need to copy the entire
+ * file to the output directory.
+ *
+ * If a checksum of the required type already exists in the
+ * backup_manifest for the final input directory, we can save some
+ * work by reusing that checksum instead of computing a new one.
+ */
+ if (checksum_type != CHECKSUM_TYPE_NONE &&
+ latest_manifest != NULL)
+ {
+ manifest_file *mfile;
+
+ mfile = manifest_files_lookup(latest_manifest->files,
+ manifest_path);
+ if (mfile == NULL)
+ {
+ char *bmpath;
+
+ /*
+ * The directory is out of sync with the backup_manifest,
+ * so emit a warning.
+ */
+ bmpath = psprintf("%s/%s", input_directory,
+ "backup_manifest");
+ pg_log_warning("\"%s\" contains no entry for \"%s\"",
+ bmpath, manifest_path);
+ pfree(bmpath);
+ }
+ else if (mfile->checksum_type == checksum_type)
+ {
+ checksum_length = mfile->checksum_length;
+ checksum_payload = mfile->checksum_payload;
+ }
+ }
+
+ /*
+ * If we're reusing a checksum, then we don't need copy_file() to
+ * compute one for us, but otherwise, it needs to compute whatever
+ * type of checksum we need.
+ */
+ if (checksum_length != 0)
+ pg_checksum_init(&checksum_ctx, CHECKSUM_TYPE_NONE);
+ else
+ pg_checksum_init(&checksum_ctx, checksum_type);
+
+ /* Actually copy the file. */
+ snprintf(ofullpath, MAXPGPATH, "%s/%s", ofulldir, de->d_name);
+ copy_file(ifullpath, ofullpath, &checksum_ctx, opt->dry_run);
+
+ /*
+ * If copy_file() performed a checksum calculation for us, then
+ * save the results (except in dry-run mode, when there's no
+ * point).
+ */
+ if (checksum_ctx.type != CHECKSUM_TYPE_NONE && !opt->dry_run)
+ {
+ checksum_payload = pg_malloc(PG_CHECKSUM_MAX_LENGTH);
+ checksum_length = pg_checksum_final(&checksum_ctx,
+ checksum_payload);
+ }
+ }
+
+ /* Generate manifest entry, if needed. */
+ if (mwriter != NULL)
+ {
+ struct stat sb;
+
+ /*
+ * In order to generate a manifest entry, we need the file size
+ * and mtime. We have no way to know the correct mtime except to
+ * stat() the file, so just do that and get the size as well.
+ *
+ * If we didn't need the mtime here, we could try to obtain the
+ * file size from the reconstruction or file copy process above,
+ * although that is actually not convenient in all cases. If we
+ * write the file ourselves then clearly we can keep a count of
+ * bytes, but if we use something like CopyFile() then it's
+ * trickier. Since we have to stat() anyway to get the mtime,
+ * there's no point in worrying about it.
+ */
+ if (stat(ofullpath, &sb) < 0)
+ pg_fatal("could not stat file \"%s\": %m", ofullpath);
+
+ /* OK, now do the work. */
+ add_file_to_manifest(mwriter, manifest_path,
+ sb.st_size, sb.st_mtime,
+ checksum_type, checksum_length,
+ checksum_payload);
+ }
+
+ /* Avoid leaking memory. */
+ if (checksum_payload != NULL)
+ pfree(checksum_payload);
+ }
+
+ closedir(dir);
+}
+
+/*
+ * Read the version number from PG_VERSION and convert it to the usual server
+ * version number format. (e.g. If PG_VERSION contains "14\n" this function
+ * will return 140000)
+ */
+static int
+read_pg_version_file(char *directory)
+{
+ char filename[MAXPGPATH];
+ StringInfoData buf;
+ int fd;
+ int version;
+ char *ep;
+
+ /* Construct pathname. */
+ snprintf(filename, MAXPGPATH, "%s/PG_VERSION", directory);
+
+ /* Open file. */
+ if ((fd = open(filename, O_RDONLY, 0)) < 0)
+ pg_fatal("could not open file \"%s\": %m", filename);
+
+ /* Read into memory. Length limit of 128 should be more than generous. */
+ initStringInfo(&buf);
+ slurp_file(fd, filename, &buf, 128);
+
+ /* Close the file. */
+ if (close(fd) != 0)
+ pg_fatal("could not close \"%s\": %m", filename);
+
+ /* Convert to integer. */
+ errno = 0;
+ version = strtoul(buf.data, &ep, 10);
+ if (errno != 0 || *ep != '\n')
+ {
+ /*
+ * Incremental backup is not relevant to very old server versions that
+ * used multi-part version number (e.g. 9.6, or 8.4). So if we see
+ * what looks like the beginning of such a version number, just bail
+ * out.
+ */
+ if (version < 10 && *ep == '.')
+ pg_fatal("%s: server version too old\n", filename);
+ pg_fatal("%s: could not parse version number\n", filename);
+ }
+
+ /* Debugging output. */
+ pg_log_debug("read server version %d from \"%s\"", version, filename);
+
+ /* Release memory and return result. */
+ pfree(buf.data);
+ return version * 10000;
+}
+
+/*
+ * Add a directory to the list of output directories to clean up.
+ */
+static void
+remember_to_cleanup_directory(char *target_path, bool rmtopdir)
+{
+ cb_cleanup_dir *dir = pg_malloc(sizeof(cb_cleanup_dir));
+
+ dir->target_path = target_path;
+ dir->rmtopdir = rmtopdir;
+ dir->next = cleanup_dir_list;
+ cleanup_dir_list = dir;
+}
+
+/*
+ * Empty out the list of directories scheduled for cleanup a exit.
+ *
+ * We want to remove the output directories only on a failure, so call this
+ * function when we know that the operation has succeeded.
+ *
+ * Since we only expect this to be called when we're about to exit, we could
+ * just set cleanup_dir_list to NULL and be done with it, but we free the
+ * memory to be tidy.
+ */
+static void
+reset_directory_cleanup_list(void)
+{
+ while (cleanup_dir_list != NULL)
+ {
+ cb_cleanup_dir *dir = cleanup_dir_list;
+
+ cleanup_dir_list = cleanup_dir_list->next;
+ pfree(dir);
+ }
+}
+
+/*
+ * Scan the pg_tblspc directory of the final input backup to get a canonical
+ * list of what tablespaces are part of the backup.
+ *
+ * 'pathname' should be the path to the toplevel backup directory for the
+ * final backup in the backup chain.
+ */
+static cb_tablespace *
+scan_for_existing_tablespaces(char *pathname, cb_options *opt)
+{
+ char pg_tblspc[MAXPGPATH];
+ DIR *dir;
+ struct dirent *de;
+ cb_tablespace *tslist = NULL;
+
+ snprintf(pg_tblspc, MAXPGPATH, "%s/pg_tblspc", pathname);
+ pg_log_debug("scanning \"%s\"", pg_tblspc);
+
+ if ((dir = opendir(pg_tblspc)) == NULL)
+ pg_fatal("could not open directory \"%s\": %m", pathname);
+
+ while (errno = 0, (de = readdir(dir)) != NULL)
+ {
+ Oid oid;
+ char tblspcdir[MAXPGPATH];
+ char link_target[MAXPGPATH];
+ int link_length;
+ cb_tablespace *ts;
+ cb_tablespace *otherts;
+ PGFileType type;
+
+ /* Silently ignore "." and ".." entries. */
+ if (strcmp(de->d_name, ".") == 0 || strcmp(de->d_name, "..") == 0)
+ continue;
+
+ /* Construct full pathname. */
+ snprintf(tblspcdir, MAXPGPATH, "%s/%s", pg_tblspc, de->d_name);
+
+ /* Ignore any file name that doesn't look like a proper OID. */
+ if (!parse_oid(de->d_name, &oid))
+ {
+ pg_log_debug("skipping \"%s\" because the filename is not a legal tablespace OID",
+ tblspcdir);
+ continue;
+ }
+
+ /* Only symbolic links and directories are tablespaces. */
+ type = get_dirent_type(tblspcdir, de, false, PG_LOG_ERROR);
+ if (type == PGFILETYPE_ERROR)
+ exit(1);
+ if (type != PGFILETYPE_LNK && type != PGFILETYPE_DIR)
+ {
+ pg_log_debug("skipping \"%s\" because it is neither a symbolic link nor a directory",
+ tblspcdir);
+ continue;
+ }
+
+ /* Create a new tablespace object. */
+ ts = pg_malloc0(sizeof(cb_tablespace));
+ ts->oid = oid;
+
+ /*
+ * If it's a link, it's not an in-place tablespace. Otherwise, it must
+ * be a directory, and thus an in-place tablespace.
+ */
+ if (type == PGFILETYPE_LNK)
+ {
+ cb_tablespace_mapping *tsmap;
+
+ /* Read the link target. */
+ link_length = readlink(tblspcdir, link_target, sizeof(link_target));
+ if (link_length < 0)
+ pg_fatal("could not read symbolic link \"%s\": %m",
+ tblspcdir);
+ if (link_length >= sizeof(link_target))
+ pg_fatal("symbolic link \"%s\" is too long", tblspcdir);
+ link_target[link_length] = '\0';
+ if (!is_absolute_path(link_target))
+ pg_fatal("symbolic link \"%s\" is relative", tblspcdir);
+
+ /* Caonicalize the link target. */
+ canonicalize_path(link_target);
+
+ /*
+ * Find the corresponding tablespace mapping and copy the relevant
+ * details into the new tablespace entry.
+ */
+ for (tsmap = opt->tsmappings; tsmap != NULL; tsmap = tsmap->next)
+ {
+ if (strcmp(tsmap->old_dir, link_target) == 0)
+ {
+ strncpy(ts->old_dir, tsmap->old_dir, MAXPGPATH);
+ strncpy(ts->new_dir, tsmap->new_dir, MAXPGPATH);
+ ts->in_place = false;
+ break;
+ }
+ }
+
+ /* Every non-in-place tablespace must be mapped. */
+ if (tsmap == NULL)
+ pg_fatal("tablespace at \"%s\" has no tablespace mapping",
+ link_target);
+ }
+ else
+ {
+ /*
+ * For an in-place tablespace, there's no separate directory, so
+ * we just record the paths within the data directories.
+ */
+ snprintf(ts->old_dir, MAXPGPATH, "%s/%s", pg_tblspc, de->d_name);
+ snprintf(ts->new_dir, MAXPGPATH, "%s/pg_tblpc/%s", opt->output,
+ de->d_name);
+ ts->in_place = true;
+ }
+
+ /* Tablespaces should not share a directory. */
+ for (otherts = tslist; otherts != NULL; otherts = otherts->next)
+ if (strcmp(ts->new_dir, otherts->new_dir) == 0)
+ pg_fatal("tablespaces with OIDs %u and %u both point at \"%s\"",
+ otherts->oid, oid, ts->new_dir);
+
+ /* Add this tablespace to the list. */
+ ts->next = tslist;
+ tslist = ts;
+ }
+
+ return tslist;
+}
+
+/*
+ * Read a file into a StringInfo.
+ *
+ * fd is used for the actual file I/O, filename for error reporting purposes.
+ * A file longer than maxlen is a fatal error.
+ */
+static void
+slurp_file(int fd, char *filename, StringInfo buf, int maxlen)
+{
+ struct stat st;
+ ssize_t rb;
+
+ /* Check file size, and complain if it's too large. */
+ if (fstat(fd, &st) != 0)
+ pg_fatal("could not stat \"%s\": %m", filename);
+ if (st.st_size > maxlen)
+ pg_fatal("file \"%s\" is too large", filename);
+
+ /* Make sure we have enough space. */
+ enlargeStringInfo(buf, st.st_size);
+
+ /* Read the data. */
+ rb = read(fd, &buf->data[buf->len], st.st_size);
+
+ /*
+ * We don't expect any concurrent changes, so we should read exactly the
+ * expected number of bytes.
+ */
+ if (rb != st.st_size)
+ {
+ if (rb < 0)
+ pg_fatal("could not read file \"%s\": %m", filename);
+ else
+ pg_fatal("could not read file \"%s\": read only %d of %d bytes",
+ filename, (int) rb, (int) st.st_size);
+ }
+
+ /* Adjust buffer length for new data and restore trailing-\0 invariant */
+ buf->len += rb;
+ buf->data[buf->len] = '\0';
+}
diff --git a/src/bin/pg_combinebackup/reconstruct.c b/src/bin/pg_combinebackup/reconstruct.c
new file mode 100644
index 0000000000..6decdd8934
--- /dev/null
+++ b/src/bin/pg_combinebackup/reconstruct.c
@@ -0,0 +1,687 @@
+/*-------------------------------------------------------------------------
+ *
+ * reconstruct.c
+ * Reconstruct full file from incremental file and backup chain.
+ *
+ * Copyright (c) 2017-2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/bin/pg_combinebackup/reconstruct.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres_fe.h"
+
+#include <unistd.h>
+
+#include "backup/basebackup_incremental.h"
+#include "common/logging.h"
+#include "common/file_perm.h"
+#include "copy_file.h"
+#include "lib/stringinfo.h"
+#include "reconstruct.h"
+#include "storage/block.h"
+
+/*
+ * An rfile stores the data that we need in order to be able to use some file
+ * on disk for reconstruction. For any given output file, we create one rfile
+ * per backup that we need to consult when we constructing that output file.
+ *
+ * If we find a full version of the file in the backup chain, then only
+ * filename and fd are initialized; the remaining fields are 0 or NULL.
+ * For an incremental file, header_length, num_blocks, relative_block_numbers,
+ * and truncation_block_length are also set.
+ *
+ * num_blocks_read and highest_offset_read always start out as 0.
+ */
+typedef struct rfile
+{
+ char *filename;
+ int fd;
+ size_t header_length;
+ unsigned num_blocks;
+ BlockNumber *relative_block_numbers;
+ unsigned truncation_block_length;
+ unsigned num_blocks_read;
+ off_t highest_offset_read;
+} rfile;
+
+static void debug_reconstruction(int n_source,
+ rfile **sources,
+ bool dry_run);
+static unsigned find_reconstructed_block_length(rfile *s);
+static rfile *make_incremental_rfile(char *filename);
+static rfile *make_rfile(char *filename, bool missing_ok);
+static void write_reconstructed_file(char *input_filename,
+ char *output_filename,
+ unsigned block_length,
+ rfile **sourcemap,
+ off_t *offsetmap,
+ pg_checksum_context *checksum_ctx,
+ bool debug,
+ bool dry_run);
+static void read_bytes(rfile *rf, void *buffer, unsigned length);
+
+/*
+ * Reconstruct a full file from an incremental file and a chain of prior
+ * backups.
+ *
+ * input_filename should be the path to the incremental file, and
+ * output_filename should be the path where the reconstructed file is to be
+ * written.
+ *
+ * relative_path should be the relative path to the directory containing this
+ * file. bare_file_name should be the name of the file within that directory,
+ * without "INCREMENTAL.".
+ *
+ * n_prior_backups is the number of prior backups, and prior_backup_dirs is
+ * an array of pathnames where those backups can be found.
+ */
+void
+reconstruct_from_incremental_file(char *input_filename,
+ char *output_filename,
+ char *relative_path,
+ char *bare_file_name,
+ int n_prior_backups,
+ char **prior_backup_dirs,
+ manifest_data **manifests,
+ char *manifest_path,
+ pg_checksum_type checksum_type,
+ int *checksum_length,
+ uint8 **checksum_payload,
+ bool debug,
+ bool dry_run)
+{
+ rfile **source;
+ rfile *latest_source = NULL;
+ rfile **sourcemap;
+ off_t *offsetmap;
+ unsigned block_length;
+ unsigned i;
+ unsigned sidx = n_prior_backups;
+ bool full_copy_possible = true;
+ int copy_source_index = -1;
+ rfile *copy_source = NULL;
+ pg_checksum_context checksum_ctx;
+
+ /*
+ * Every block must come either from the latest version of the file or
+ * from one of the prior backups.
+ */
+ source = pg_malloc0(sizeof(rfile *) * (1 + n_prior_backups));
+
+ /*
+ * Use the information from the latest incremental file to figure out how
+ * long the reconstructed file should be.
+ */
+ latest_source = make_incremental_rfile(input_filename);
+ source[n_prior_backups] = latest_source;
+ block_length = find_reconstructed_block_length(latest_source);
+
+ /*
+ * For each block in the output file, we need to know from which file we
+ * need to obtain it and at what offset in that file it's stored.
+ * sourcemap gives us the first of these things, and offsetmap the latter.
+ */
+ sourcemap = pg_malloc0(sizeof(rfile *) * block_length);
+ offsetmap = pg_malloc0(sizeof(off_t) * block_length);
+
+ /*
+ * Every block that is present in the newest incremental file should be
+ * sourced from that file. If it precedes the truncation_block_length,
+ * it's a block that we would otherwise have had to find in an older
+ * backup and thus reduces the number of blocks remaining to be found by
+ * one; otherwise, it's an extra block that needs to be included in the
+ * output but would not have needed to be found in an older backup if it
+ * had not been present.
+ */
+ for (i = 0; i < latest_source->num_blocks; ++i)
+ {
+ BlockNumber b = latest_source->relative_block_numbers[i];
+
+ Assert(b < block_length);
+ sourcemap[b] = latest_source;
+ offsetmap[b] = latest_source->header_length + (i * BLCKSZ);
+
+ /*
+ * A full copy of a file from an earlier backup is only possible if no
+ * blocks are needed from any later incremental file.
+ */
+ full_copy_possible = false;
+ }
+
+ while (1)
+ {
+ char source_filename[MAXPGPATH];
+ rfile *s;
+
+ /*
+ * Move to the next backup in the chain. If there are no more, then
+ * we're done.
+ */
+ if (sidx == 0)
+ break;
+ --sidx;
+
+ /*
+ * Look for the full file in the previous backup. If not found, then
+ * look for an incremental file instead.
+ */
+ snprintf(source_filename, MAXPGPATH, "%s/%s/%s",
+ prior_backup_dirs[sidx], relative_path, bare_file_name);
+ if ((s = make_rfile(source_filename, true)) == NULL)
+ {
+ snprintf(source_filename, MAXPGPATH, "%s/%s/INCREMENTAL.%s",
+ prior_backup_dirs[sidx], relative_path, bare_file_name);
+ s = make_incremental_rfile(source_filename);
+ }
+ source[sidx] = s;
+
+ /*
+ * If s->header_length == 0, then this is a full file; otherwise, it's
+ * an incremental file.
+ */
+ if (s->header_length == 0)
+ {
+ struct stat sb;
+ BlockNumber b;
+ BlockNumber blocklength;
+
+ /* We need to know the length of the file. */
+ if (fstat(s->fd, &sb) < 0)
+ pg_fatal("could not stat \"%s\": %m", s->filename);
+
+ /*
+ * Since we found a full file, source all blocks from it that
+ * exist in the file.
+ *
+ * Note that there may be blocks that don't exist either in this
+ * file or in any incremental file but that precede
+ * truncation_block_length. These are, presumably, zero-filled
+ * blocks that result from the server extending the file but
+ * taking no action on those blocks that generated any WAL.
+ *
+ * Sadly, we have no way of validating that this is really what
+ * happened, and neither does the server. From it's perspective,
+ * an unmodified block that contains data looks exactly the same
+ * as a zero-filled block that never had any data: either way,
+ * it's not mentioned in any WAL summary and the server has no
+ * reason to read it. From our perspective, all we know is that
+ * nobody had a reason to back up the block. That certainly means
+ * that the block didn't exist at the time of the full backup, but
+ * the supposition that it was all zeroes at the time of every
+ * later backup is one that we can't validate.
+ */
+ blocklength = sb.st_size / BLCKSZ;
+ for (b = 0; b < latest_source->truncation_block_length; ++b)
+ {
+ if (sourcemap[b] == NULL && b < blocklength)
+ {
+ sourcemap[b] = s;
+ offsetmap[b] = b * BLCKSZ;
+ }
+ }
+
+ /*
+ * If a full copy looks possible, check whether the resulting file
+ * should be exactly as long as the source file is. If so, a full
+ * copy is acceptable, otherwise not.
+ */
+ if (full_copy_possible)
+ {
+ uint64 expected_length;
+
+ expected_length =
+ (uint64) latest_source->truncation_block_length;
+ expected_length *= BLCKSZ;
+ if (expected_length == sb.st_size)
+ {
+ copy_source = s;
+ copy_source_index = sidx;
+ }
+ }
+
+ /* We don't need to consider any further sources. */
+ break;
+ }
+
+ /*
+ * Since we found another incremental file, source all blocks from it
+ * that we need but don't yet have.
+ */
+ for (i = 0; i < s->num_blocks; ++i)
+ {
+ BlockNumber b = s->relative_block_numbers[i];
+
+ if (b < latest_source->truncation_block_length &&
+ sourcemap[b] == NULL)
+ {
+ sourcemap[b] = s;
+ offsetmap[b] = s->header_length + (i * BLCKSZ);
+
+ /*
+ * A full copy of a file from an earlier backup is only
+ * possible if no blocks are needed from any later incremental
+ * file.
+ */
+ full_copy_possible = false;
+ }
+ }
+ }
+
+ /*
+ * If a checksum of the required type already exists in the
+ * backup_manifest for the relevant input directory, we can save some work
+ * by reusing that checksum instead of computing a new one.
+ */
+ if (copy_source_index >= 0 && manifests[copy_source_index] != NULL &&
+ checksum_type != CHECKSUM_TYPE_NONE)
+ {
+ manifest_file *mfile;
+
+ mfile = manifest_files_lookup(manifests[copy_source_index]->files,
+ manifest_path);
+ if (mfile == NULL)
+ {
+ char *path = psprintf("%s/backup_manifest",
+ prior_backup_dirs[copy_source_index]);
+
+ /*
+ * The directory is out of sync with the backup_manifest, so emit
+ * a warning.
+ */
+ /*- translator: the first %s is a backup manifest file, the second is a file absent therein */
+ pg_log_warning("\"%s\" contains no entry for \"%s\"",
+ path,
+ manifest_path);
+ pfree(path);
+ }
+ else if (mfile->checksum_type == checksum_type)
+ {
+ *checksum_length = mfile->checksum_length;
+ *checksum_payload = pg_malloc(*checksum_length);
+ memcpy(*checksum_payload, mfile->checksum_payload,
+ *checksum_length);
+ checksum_type = CHECKSUM_TYPE_NONE;
+ }
+ }
+
+ /* Prepare for checksum calculation, if required. */
+ pg_checksum_init(&checksum_ctx, checksum_type);
+
+ /*
+ * If the full file can be created by copying a file from an older backup
+ * in the chain without needing to overwrite any blocks or truncate the
+ * result, then forget about performing reconstruction and just copy that
+ * file in its entirety.
+ *
+ * Otherwise, reconstruct.
+ */
+ if (copy_source != NULL)
+ copy_file(copy_source->filename, output_filename,
+ &checksum_ctx, dry_run);
+ else
+ {
+ write_reconstructed_file(input_filename, output_filename,
+ block_length, sourcemap, offsetmap,
+ &checksum_ctx, debug, dry_run);
+ debug_reconstruction(n_prior_backups + 1, source, dry_run);
+ }
+
+ /* Save results of checksum calculation. */
+ if (checksum_type != CHECKSUM_TYPE_NONE)
+ {
+ *checksum_payload = pg_malloc(PG_CHECKSUM_MAX_LENGTH);
+ *checksum_length = pg_checksum_final(&checksum_ctx,
+ *checksum_payload);
+ }
+
+ /*
+ * Close files and release memory.
+ */
+ for (i = 0; i <= n_prior_backups; ++i)
+ {
+ rfile *s = source[i];
+
+ if (s == NULL)
+ continue;
+ if (close(s->fd) != 0)
+ pg_fatal("could not close \"%s\": %m", s->filename);
+ if (s->relative_block_numbers != NULL)
+ pfree(s->relative_block_numbers);
+ pg_free(s->filename);
+ }
+ pfree(sourcemap);
+ pfree(offsetmap);
+ pfree(source);
+}
+
+/*
+ * Perform post-reconstruction logging and sanity checks.
+ */
+static void
+debug_reconstruction(int n_source, rfile **sources, bool dry_run)
+{
+ unsigned i;
+
+ for (i = 0; i < n_source; ++i)
+ {
+ rfile *s = sources[i];
+
+ /* Ignore source if not used. */
+ if (s == NULL)
+ continue;
+
+ /* If no data is needed from this file, we can ignore it. */
+ if (s->num_blocks_read == 0)
+ continue;
+
+ /* Debug logging. */
+ if (dry_run)
+ pg_log_debug("would have read %u blocks from \"%s\"",
+ s->num_blocks_read, s->filename);
+ else
+ pg_log_debug("read %u blocks from \"%s\"",
+ s->num_blocks_read, s->filename);
+
+ /*
+ * In dry-run mode, we don't actually try to read data from the file,
+ * but we do try to verify that the file is long enough that we could
+ * have read the data if we'd tried.
+ *
+ * If this fails, then it means that a non-dry-run attempt would fail,
+ * complaining of not being able to read the required bytes from the
+ * file.
+ */
+ if (dry_run)
+ {
+ struct stat sb;
+
+ if (fstat(s->fd, &sb) < 0)
+ pg_fatal("could not stat \"%s\": %m", s->filename);
+ if (sb.st_size < s->highest_offset_read)
+ pg_fatal("file \"%s\" is too short: expected %llu, found %llu",
+ s->filename,
+ (unsigned long long) s->highest_offset_read,
+ (unsigned long long) sb.st_size);
+ }
+ }
+}
+
+/*
+ * When we perform reconstruction using an incremental file, the output file
+ * should be at least as long as the truncation_block_length. Any blocks
+ * present in the incremental file increase the output length as far as is
+ * necessary to include those blocks.
+ */
+static unsigned
+find_reconstructed_block_length(rfile *s)
+{
+ unsigned block_length = s->truncation_block_length;
+ unsigned i;
+
+ for (i = 0; i < s->num_blocks; ++i)
+ if (s->relative_block_numbers[i] >= block_length)
+ block_length = s->relative_block_numbers[i] + 1;
+
+ return block_length;
+}
+
+/*
+ * Initialize an incremental rfile, reading the header so that we know which
+ * blocks it contains.
+ */
+static rfile *
+make_incremental_rfile(char *filename)
+{
+ rfile *rf;
+ unsigned magic;
+
+ rf = make_rfile(filename, false);
+
+ /* Read and validate magic number. */
+ read_bytes(rf, &magic, sizeof(magic));
+ if (magic != INCREMENTAL_MAGIC)
+ pg_fatal("file \"%s\" has bad incremental magic number (0x%x not 0x%x)",
+ filename, magic, INCREMENTAL_MAGIC);
+
+ /* Read block count. */
+ read_bytes(rf, &rf->num_blocks, sizeof(rf->num_blocks));
+ if (rf->num_blocks > RELSEG_SIZE)
+ pg_fatal("file \"%s\" has block count %u in excess of segment size %u",
+ filename, rf->num_blocks, RELSEG_SIZE);
+
+ /* Read truncation block length. */
+ read_bytes(rf, &rf->truncation_block_length,
+ sizeof(rf->truncation_block_length));
+ if (rf->truncation_block_length > RELSEG_SIZE)
+ pg_fatal("file \"%s\" has truncation block length %u in excess of segment size %u",
+ filename, rf->truncation_block_length, RELSEG_SIZE);
+
+ /* Read block numbers if there are any. */
+ if (rf->num_blocks > 0)
+ {
+ rf->relative_block_numbers =
+ pg_malloc0(sizeof(BlockNumber) * rf->num_blocks);
+ read_bytes(rf, rf->relative_block_numbers,
+ sizeof(BlockNumber) * rf->num_blocks);
+ }
+
+ /* Remember length of header. */
+ rf->header_length = sizeof(magic) + sizeof(rf->num_blocks) +
+ sizeof(rf->truncation_block_length) +
+ sizeof(BlockNumber) * rf->num_blocks;
+
+ return rf;
+}
+
+/*
+ * Allocate and perform basic initialization of an rfile.
+ */
+static rfile *
+make_rfile(char *filename, bool missing_ok)
+{
+ rfile *rf;
+
+ rf = pg_malloc0(sizeof(rfile));
+ rf->filename = pstrdup(filename);
+ if ((rf->fd = open(filename, O_RDONLY | PG_BINARY, 0)) < 0)
+ {
+ if (missing_ok && errno == ENOENT)
+ {
+ pg_free(rf);
+ return NULL;
+ }
+ pg_fatal("could not open file \"%s\": %m", filename);
+ }
+
+ return rf;
+}
+
+/*
+ * Read the indicated number of bytes from an rfile into the buffer.
+ */
+static void
+read_bytes(rfile *rf, void *buffer, unsigned length)
+{
+ unsigned rb = read(rf->fd, buffer, length);
+
+ if (rb != length)
+ {
+ if (rb < 0)
+ pg_fatal("could not read file \"%s\": %m", rf->filename);
+ else
+ pg_fatal("could not read file \"%s\": read only %d of %d bytes",
+ rf->filename, (int) rb, length);
+ }
+}
+
+/*
+ * Write out a reconstructed file.
+ */
+static void
+write_reconstructed_file(char *input_filename,
+ char *output_filename,
+ unsigned block_length,
+ rfile **sourcemap,
+ off_t *offsetmap,
+ pg_checksum_context *checksum_ctx,
+ bool debug,
+ bool dry_run)
+{
+ int wfd = -1;
+ unsigned i;
+ unsigned zero_blocks = 0;
+
+ /* Debugging output. */
+ if (debug)
+ {
+ StringInfoData debug_buf;
+ unsigned start_of_range = 0;
+ unsigned current_block = 0;
+
+ /* Basic information about the output file to be produced. */
+ if (dry_run)
+ pg_log_debug("would reconstruct \"%s\" (%u blocks, checksum %s)",
+ output_filename, block_length,
+ pg_checksum_type_name(checksum_ctx->type));
+ else
+ pg_log_debug("reconstructing \"%s\" (%u blocks, checksum %s)",
+ output_filename, block_length,
+ pg_checksum_type_name(checksum_ctx->type));
+
+ /* Print out the plan for reconstructing this file. */
+ initStringInfo(&debug_buf);
+ while (current_block < block_length)
+ {
+ rfile *s = sourcemap[current_block];
+
+ /* Extend range, if possible. */
+ if (current_block + 1 < block_length &&
+ s == sourcemap[current_block + 1])
+ {
+ ++current_block;
+ continue;
+ }
+
+ /* Add details about this range. */
+ if (s == NULL)
+ {
+ if (current_block == start_of_range)
+ appendStringInfo(&debug_buf, " %u:zero", current_block);
+ else
+ appendStringInfo(&debug_buf, " %u-%u:zero",
+ start_of_range, current_block);
+ }
+ else
+ {
+ if (current_block == start_of_range)
+ appendStringInfo(&debug_buf, " %u:%s@" UINT64_FORMAT,
+ current_block,
+ s == NULL ? "ZERO" : s->filename,
+ (uint64) offsetmap[current_block]);
+ else
+ appendStringInfo(&debug_buf, " %u-%u:%s@" UINT64_FORMAT,
+ start_of_range, current_block,
+ s == NULL ? "ZERO" : s->filename,
+ (uint64) offsetmap[current_block]);
+ }
+
+ /* Begin new range. */
+ start_of_range = ++current_block;
+
+ /* If the output is very long or we are done, dump it now. */
+ if (current_block == block_length || debug_buf.len > 1024)
+ {
+ pg_log_debug("reconstruction plan:%s", debug_buf.data);
+ resetStringInfo(&debug_buf);
+ }
+ }
+
+ /* Free memory. */
+ pfree(debug_buf.data);
+ }
+
+ /* Open the output file, except in dry_run mode. */
+ if (!dry_run &&
+ (wfd = open(output_filename,
+ O_RDWR | PG_BINARY | O_CREAT | O_EXCL,
+ pg_file_create_mode)) < 0)
+ pg_fatal("could not open file \"%s\": %m", output_filename);
+
+ /* Read and write the blocks as required. */
+ for (i = 0; i < block_length; ++i)
+ {
+ uint8 buffer[BLCKSZ];
+ rfile *s = sourcemap[i];
+ unsigned wb;
+
+ /* Update accounting information. */
+ if (s == NULL)
+ ++zero_blocks;
+ else
+ {
+ s->num_blocks_read++;
+ s->highest_offset_read = Max(s->highest_offset_read,
+ offsetmap[i] + BLCKSZ);
+ }
+
+ /* Skip the rest of this in dry-run mode. */
+ if (dry_run)
+ continue;
+
+ /* Read or zero-fill the block as appropriate. */
+ if (s == NULL)
+ {
+ /*
+ * New block not mentioned in the WAL summary. Should have been an
+ * uninitialized block, so just zero-fill it.
+ */
+ memset(buffer, 0, BLCKSZ);
+ }
+ else
+ {
+ unsigned rb;
+
+ /* Read the block from the correct source, except if dry-run. */
+ rb = pg_pread(s->fd, buffer, BLCKSZ, offsetmap[i]);
+ if (rb != BLCKSZ)
+ {
+ if (rb < 0)
+ pg_fatal("could not read file \"%s\": %m", s->filename);
+ else
+ pg_fatal("could not read file \"%s\": read only %d of %d bytes at offset %u",
+ s->filename, (int) rb, BLCKSZ,
+ (unsigned) offsetmap[i]);
+ }
+ }
+
+ /* Write out the block. */
+ if ((wb = write(wfd, buffer, BLCKSZ)) != BLCKSZ)
+ {
+ if (wb < 0)
+ pg_fatal("could not write file \"%s\": %m", output_filename);
+ else
+ pg_fatal("could not write file \"%s\": wrote only %d of %d bytes",
+ output_filename, (int) wb, BLCKSZ);
+ }
+
+ /* Update the checksum computation. */
+ if (pg_checksum_update(checksum_ctx, buffer, BLCKSZ) < 0)
+ pg_fatal("could not update checksum of file \"%s\"",
+ output_filename);
+ }
+
+ /* Debugging output. */
+ if (zero_blocks > 0)
+ {
+ if (dry_run)
+ pg_log_debug("would have zero-filled %u blocks", zero_blocks);
+ else
+ pg_log_debug("zero-filled %u blocks", zero_blocks);
+ }
+
+ /* Close the output file. */
+ if (wfd >= 0 && close(wfd) != 0)
+ pg_fatal("could not close \"%s\": %m", output_filename);
+}
diff --git a/src/bin/pg_combinebackup/reconstruct.h b/src/bin/pg_combinebackup/reconstruct.h
new file mode 100644
index 0000000000..d689aeb5c2
--- /dev/null
+++ b/src/bin/pg_combinebackup/reconstruct.h
@@ -0,0 +1,33 @@
+/*-------------------------------------------------------------------------
+ *
+ * reconstruct.h
+ * Reconstruct full file from incremental file and backup chain.
+ *
+ * Copyright (c) 2017-2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/bin/pg_combinebackup/reconstruct.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef RECONSTRUCT_H
+#define RECONSTRUCT_H
+
+#include "common/checksum_helper.h"
+#include "load_manifest.h"
+
+extern void reconstruct_from_incremental_file(char *input_filename,
+ char *output_filename,
+ char *relative_path,
+ char *bare_file_name,
+ int n_prior_backups,
+ char **prior_backup_dirs,
+ manifest_data **manifests,
+ char *manifest_path,
+ pg_checksum_type checksum_type,
+ int *checksum_length,
+ uint8 **checksum_payload,
+ bool debug,
+ bool dry_run);
+
+#endif
diff --git a/src/bin/pg_combinebackup/t/001_basic.pl b/src/bin/pg_combinebackup/t/001_basic.pl
new file mode 100644
index 0000000000..fb66075d1a
--- /dev/null
+++ b/src/bin/pg_combinebackup/t/001_basic.pl
@@ -0,0 +1,23 @@
+# Copyright (c) 2021-2023, PostgreSQL Global Development Group
+
+use strict;
+use warnings;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+my $tempdir = PostgreSQL::Test::Utils::tempdir;
+
+program_help_ok('pg_combinebackup');
+program_version_ok('pg_combinebackup');
+program_options_handling_ok('pg_combinebackup');
+
+command_fails_like(
+ ['pg_combinebackup'],
+ qr/no input directories specified/,
+ 'input directories must be specified');
+command_fails_like(
+ [ 'pg_combinebackup', $tempdir ],
+ qr/no output directory specified/,
+ 'output directory must be specified');
+
+done_testing();
diff --git a/src/bin/pg_combinebackup/t/002_compare_backups.pl b/src/bin/pg_combinebackup/t/002_compare_backups.pl
new file mode 100644
index 0000000000..0b80455aff
--- /dev/null
+++ b/src/bin/pg_combinebackup/t/002_compare_backups.pl
@@ -0,0 +1,154 @@
+# Copyright (c) 2021-2023, PostgreSQL Global Development Group
+
+use strict;
+use warnings;
+use File::Compare;
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Set up a new database instance.
+my $primary = PostgreSQL::Test::Cluster->new('primary');
+$primary->init(has_archiving => 1, allows_streaming => 1);
+$primary->append_conf('postgresql.conf', 'summarize_wal = on');
+$primary->start;
+
+# Create some test tables, each containing one row of data, plus a whole
+# extra database.
+$primary->safe_psql('postgres', <<EOM);
+CREATE TABLE will_change (a int, b text);
+INSERT INTO will_change VALUES (1, 'initial test row');
+CREATE TABLE will_grow (a int, b text);
+INSERT INTO will_grow VALUES (1, 'initial test row');
+CREATE TABLE will_shrink (a int, b text);
+INSERT INTO will_shrink VALUES (1, 'initial test row');
+CREATE TABLE will_get_vacuumed (a int, b text);
+INSERT INTO will_get_vacuumed VALUES (1, 'initial test row');
+CREATE TABLE will_get_dropped (a int, b text);
+INSERT INTO will_get_dropped VALUES (1, 'initial test row');
+CREATE TABLE will_get_rewritten (a int, b text);
+INSERT INTO will_get_rewritten VALUES (1, 'initial test row');
+CREATE DATABASE db_will_get_dropped;
+EOM
+
+# Take a full backup.
+my $backup1path = $primary->backup_dir . '/backup1';
+$primary->command_ok(
+ [ 'pg_basebackup', '-D', $backup1path, '--no-sync', '-cfast' ],
+ "full backup");
+
+# Now make some database changes.
+$primary->safe_psql('postgres', <<EOM);
+UPDATE will_change SET b = 'modified value' WHERE a = 1;
+INSERT INTO will_grow
+ SELECT g, 'additional row' FROM generate_series(2, 5000) g;
+TRUNCATE will_shrink;
+VACUUM will_get_vacuumed;
+DROP TABLE will_get_dropped;
+CREATE TABLE newly_created (a int, b text);
+INSERT INTO newly_created VALUES (1, 'row for new table');
+VACUUM FULL will_get_rewritten;
+DROP DATABASE db_will_get_dropped;
+CREATE DATABASE db_newly_created;
+EOM
+
+# Take an incremental backup.
+my $backup2path = $primary->backup_dir . '/backup2';
+$primary->command_ok(
+ [ 'pg_basebackup', '-D', $backup2path, '--no-sync', '-cfast',
+ '--incremental', $backup1path . '/backup_manifest' ],
+ "incremental backup");
+
+# Find an LSN to which either backup can be recovered.
+my $lsn = $primary->safe_psql('postgres', "SELECT pg_current_wal_lsn();");
+
+# Make sure that the WAL segment containing that LSN has been archived.
+# PostgreSQL won't issue two consecutive XLOG_SWITCH records, and the backup
+# just issued one, so call txid_current() to generate some WAL activity
+# before calling pg_switch_wal().
+$primary->safe_psql('postgres', 'SELECT txid_current();');
+$primary->safe_psql('postgres', 'SELECT pg_switch_wal()');
+
+# Now wait for the LSN we chose above to be archived.
+my $archive_wait_query =
+ "SELECT pg_walfile_name('$lsn') <= last_archived_wal FROM pg_stat_archiver;";
+$primary->poll_query_until('postgres', $archive_wait_query)
+ or die "Timed out while waiting for WAL segment to be archived";
+
+# Perform PITR from the full backup. Disable archive_mode so that the archive
+# doesn't find out about the new timeline; that way, the later PITR below will
+# choose the same timeline.
+my $pitr1 = PostgreSQL::Test::Cluster->new('pitr1');
+$pitr1->init_from_backup($primary, 'backup1',
+ standby => 1, has_restoring => 1);
+$pitr1->append_conf('postgresql.conf', qq{
+recovery_target_lsn = '$lsn'
+recovery_target_action = 'promote'
+archive_mode = 'off'
+});
+$pitr1->start();
+
+# Perform PITR to the same LSN from the incremental backup. Use the same
+# basic configuration as before.
+my $pitr2 = PostgreSQL::Test::Cluster->new('pitr2');
+$pitr2->init_from_backup($primary, 'backup2',
+ standby => 1, has_restoring => 1,
+ combine_with_prior => [ 'backup1' ]);
+$pitr2->append_conf('postgresql.conf', qq{
+recovery_target_lsn = '$lsn'
+recovery_target_action = 'promote'
+archive_mode = 'off'
+});
+$pitr2->start();
+
+# Wait until both servers exit recovery.
+$pitr1->poll_query_until('postgres',
+ "SELECT NOT pg_is_in_recovery();")
+ or die "Timed out while waiting apply to reach LSN $lsn";
+$pitr2->poll_query_until('postgres',
+ "SELECT NOT pg_is_in_recovery();")
+ or die "Timed out while waiting apply to reach LSN $lsn";
+
+# Perform a logical dump of each server, and check that they match.
+# It would be much nicer if we could physically compare the data files, but
+# that doesn't really work. The contents of the page hole aren't guaranteed to
+# be identical, and there can be other discrepancies as well. To make this work
+# we'd need the equivalent of each AM's rm_mask functon written or at least
+# callable from Perl, and that doesn't seem practical.
+#
+# NB: We're just using the primary's backup directory for scratch space here.
+# This could equally well be any other directory we wanted to pick.
+my $backupdir = $primary->backup_dir;
+my $dump1 = $backupdir . '/pitr1.dump';
+my $dump2 = $backupdir . '/pitr2.dump';
+$pitr1->command_ok([
+ 'pg_dumpall', '-f', $dump1, '--no-sync', '--no-unlogged-table-data',
+ '-d', $pitr1->connstr('postgres'),
+ ],
+ 'dump from PITR 1');
+$pitr1->command_ok([
+ 'pg_dumpall', '-f', $dump2, '--no-sync', '--no-unlogged-table-data',
+ '-d', $pitr1->connstr('postgres'),
+ ],
+ 'dump from PITR 2');
+
+# Compare the two dumps, there should be no differences.
+my $compare_res = compare($dump1, $dump2);
+note($dump1);
+note($dump2);
+is($compare_res, 0, "dumps are identical");
+
+# Provide more context if the dumps do not match.
+if ($compare_res != 0)
+{
+ my ($stdout, $stderr) =
+ run_command([ 'diff', '-u', $dump1, $dump2 ]);
+ print "=== diff of $dump1 and $dump2\n";
+ print "=== stdout ===\n";
+ print $stdout;
+ print "=== stderr ===\n";
+ print $stderr;
+ print "=== EOF ===\n";
+}
+
+done_testing();
diff --git a/src/bin/pg_combinebackup/t/003_timeline.pl b/src/bin/pg_combinebackup/t/003_timeline.pl
new file mode 100644
index 0000000000..bc053ca5e8
--- /dev/null
+++ b/src/bin/pg_combinebackup/t/003_timeline.pl
@@ -0,0 +1,90 @@
+# Copyright (c) 2021-2023, PostgreSQL Global Development Group
+#
+# This test aims to validate that restoring an incremental backup works
+# properly even when the reference backup is on a different timeline.
+
+use strict;
+use warnings;
+use File::Compare;
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Set up a new database instance.
+my $node1 = PostgreSQL::Test::Cluster->new('node1');
+$node1->init(has_archiving => 1, allows_streaming => 1);
+$node1->append_conf('postgresql.conf', 'summarize_wal = on');
+$node1->start;
+
+# Create a table and insert a test row into it.
+$node1->safe_psql('postgres', <<EOM);
+CREATE TABLE mytable (a int, b text);
+INSERT INTO mytable VALUES (1, 'aardvark');
+EOM
+
+# Take a full backup.
+my $backup1path = $node1->backup_dir . '/backup1';
+$node1->command_ok(
+ [ 'pg_basebackup', '-D', $backup1path, '--no-sync', '-cfast' ],
+ "full backup from node1");
+
+# Insert a second row on the original node.
+$node1->safe_psql('postgres', <<EOM);
+INSERT INTO mytable VALUES (2, 'beetle');
+EOM
+
+# Now take an incremental backup.
+my $backup2path = $node1->backup_dir . '/backup2';
+$node1->command_ok(
+ [ 'pg_basebackup', '-D', $backup2path, '--no-sync', '-cfast',
+ '--incremental', $backup1path . '/backup_manifest' ],
+ "incremental backup from node1");
+
+# Restore the incremental backup and use it to create a new node.
+my $node2 = PostgreSQL::Test::Cluster->new('node2');
+$node2->init_from_backup($node1, 'backup2',
+ combine_with_prior => [ 'backup1' ]);
+$node2->start();
+
+# Insert rows on both nodes.
+$node1->safe_psql('postgres', <<EOM);
+INSERT INTO mytable VALUES (3, 'crab');
+EOM
+$node2->safe_psql('postgres', <<EOM);
+INSERT INTO mytable VALUES (4, 'dingo');
+EOM
+
+# Take another incremental backup, from node2, based on backup2 from node1.
+my $backup3path = $node1->backup_dir . '/backup3';
+$node2->command_ok(
+ [ 'pg_basebackup', '-D', $backup3path, '--no-sync', '-cfast',
+ '--incremental', $backup2path . '/backup_manifest' ],
+ "incremental backup from node2");
+
+# Restore the incremental backup and use it to create a new node.
+my $node3 = PostgreSQL::Test::Cluster->new('node3');
+$node3->init_from_backup($node1, 'backup3',
+ combine_with_prior => [ 'backup1', 'backup2' ]);
+$node3->start();
+
+# Let's insert one more row.
+$node3->safe_psql('postgres', <<EOM);
+INSERT INTO mytable VALUES (5, 'elephant');
+EOM
+
+# Now check that we have the expected rows.
+my $result = $node3->safe_psql('postgres', <<EOM);
+select string_agg(a::text, ':'), string_agg(b, ':') from mytable;
+EOM
+is($result, '1:2:4:5|aardvark:beetle:dingo:elephant');
+
+# Let's also verify all the backups.
+for my $backup_name (qw(backup1 backup2 backup3))
+{
+ $node1->command_ok(
+ [ 'pg_verifybackup', $node1->backup_dir . '/' . $backup_name ],
+ "verify backup $backup_name");
+}
+
+# OK, that's all.
+done_testing();
diff --git a/src/bin/pg_combinebackup/t/004_manifest.pl b/src/bin/pg_combinebackup/t/004_manifest.pl
new file mode 100644
index 0000000000..37de61ac06
--- /dev/null
+++ b/src/bin/pg_combinebackup/t/004_manifest.pl
@@ -0,0 +1,75 @@
+# Copyright (c) 2021-2023, PostgreSQL Global Development Group
+#
+# This test aims to validate that pg_combinebackup works in the degenerate
+# case where it is invoked on a single full backup and that it can produce
+# a new, valid manifest when it does. Secondarily, it checks that
+# pg_combinebackup does not produce a manifest when run with --no-manifest.
+
+use strict;
+use warnings;
+use File::Compare;
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Set up a new database instance.
+my $node = PostgreSQL::Test::Cluster->new('node');
+$node->init(has_archiving => 1, allows_streaming => 1);
+$node->start;
+
+# Take a full backup.
+my $original_backup_path = $node->backup_dir . '/original';
+$node->command_ok(
+ [ 'pg_basebackup', '-D', $original_backup_path, '--no-sync', '-cfast' ],
+ "full backup");
+
+# Verify the full backup.
+$node->command_ok([ 'pg_verifybackup', $original_backup_path ],
+ "verify original backup");
+
+# Process the backup with pg_combinebackup using various manifest options.
+sub combine_and_test_one_backup
+{
+ my ($backup_name, $failure_pattern, @extra_options) = @_;
+ my $revised_backup_path = $node->backup_dir . '/' . $backup_name;
+ $node->command_ok(
+ [ 'pg_combinebackup', $original_backup_path, '-o', $revised_backup_path,
+ '--no-sync', @extra_options ],
+ "pg_combinebackup with @extra_options");
+ if (defined $failure_pattern)
+ {
+ $node->command_fails_like(
+ [ 'pg_verifybackup', $revised_backup_path ],
+ $failure_pattern,
+ "unable to verify backup $backup_name");
+ }
+ else
+ {
+ $node->command_ok(
+ [ 'pg_verifybackup', $revised_backup_path ],
+ "verify backup $backup_name");
+ }
+}
+combine_and_test_one_backup('nomanifest',
+ qr/could not open file.*backup_manifest/, '--no-manifest');
+combine_and_test_one_backup('csum_none',
+ undef, '--manifest-checksums=NONE');
+combine_and_test_one_backup('csum_sha224',
+ undef, '--manifest-checksums=SHA224');
+
+# Verify that SHA224 is mentioned in the SHA224 manifest lots of times.
+my $sha224_manifest =
+ slurp_file($node->backup_dir . '/csum_sha224/backup_manifest');
+my $sha224_count = (() = $sha224_manifest =~ /SHA224/mig);
+cmp_ok($sha224_count,
+ '>', 100, "SHA224 is mentioned many times in SHA224 manifest");
+
+# Verify that SHA224 is mentioned in the SHA224 manifest lots of times.
+my $nocsum_manifest =
+ slurp_file($node->backup_dir . '/csum_none/backup_manifest');
+my $nocsum_count = (() = $nocsum_manifest =~ /Checksum-Algorithm/mig);
+is($nocsum_count, 0,
+ "Checksum_Algorithm is not mentioned in no-checksum manifest");
+
+# OK, that's all.
+done_testing();
diff --git a/src/bin/pg_combinebackup/t/005_integrity.pl b/src/bin/pg_combinebackup/t/005_integrity.pl
new file mode 100644
index 0000000000..b1f63a43e0
--- /dev/null
+++ b/src/bin/pg_combinebackup/t/005_integrity.pl
@@ -0,0 +1,125 @@
+# Copyright (c) 2021-2023, PostgreSQL Global Development Group
+#
+# This test aims to validate that an incremental backup can be combined
+# with a valid prior backup and that it cannot be combined with an invalid
+# prior backup.
+
+use strict;
+use warnings;
+use File::Compare;
+use File::Path qw(rmtree);
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Set up a new database instance.
+my $node1 = PostgreSQL::Test::Cluster->new('node1');
+$node1->init(has_archiving => 1, allows_streaming => 1);
+$node1->append_conf('postgresql.conf', 'summarize_wal = on');
+$node1->start;
+
+# Set up another new database instance. We don't want to use the cached
+# INITDB_TEMPLATE for this, because we want it to be a separate cluster
+# with a different system ID.
+my $node2;
+{
+ local $ENV{'INITDB_TEMPLATE'} = undef;
+
+ $node2 = PostgreSQL::Test::Cluster->new('node2');
+ $node2->init(has_archiving => 1, allows_streaming => 1);
+ $node2->append_conf('postgresql.conf', 'summarize_wal = on');
+ $node2->start;
+}
+
+# Take a full backup from node1.
+my $backup1path = $node1->backup_dir . '/backup1';
+$node1->command_ok(
+ [ 'pg_basebackup', '-D', $backup1path, '--no-sync', '-cfast' ],
+ "full backup from node1");
+
+# Now take an incremental backup.
+my $backup2path = $node1->backup_dir . '/backup2';
+$node1->command_ok(
+ [ 'pg_basebackup', '-D', $backup2path, '--no-sync', '-cfast',
+ '--incremental', $backup1path . '/backup_manifest' ],
+ "incremental backup from node1");
+
+# Now take another incremental backup.
+my $backup3path = $node1->backup_dir . '/backup3';
+$node1->command_ok(
+ [ 'pg_basebackup', '-D', $backup3path, '--no-sync', '-cfast',
+ '--incremental', $backup2path . '/backup_manifest' ],
+ "another incremental backup from node1");
+
+# Take a full backup from node2.
+my $backupother1path = $node1->backup_dir . '/backupother1';
+$node2->command_ok(
+ [ 'pg_basebackup', '-D', $backupother1path, '--no-sync', '-cfast' ],
+ "full backup from node2");
+
+# Take an incremental backup from node2.
+my $backupother2path = $node1->backup_dir . '/backupother2';
+$node2->command_ok(
+ [ 'pg_basebackup', '-D', $backupother2path, '--no-sync', '-cfast',
+ '--incremental', $backupother1path . '/backup_manifest' ],
+ "incremental backup from node2");
+
+# Result directory.
+my $resultpath = $node1->backup_dir . '/result';
+
+# Can't combine 2 full backups.
+$node1->command_fails_like(
+ [ 'pg_combinebackup', $backup1path, $backup1path, '-o', $resultpath ],
+ qr/is a full backup, but only the first backup should be a full backup/,
+ "can't combine full backups");
+
+# Can't combine 2 incremental backups.
+$node1->command_fails_like(
+ [ 'pg_combinebackup', $backup2path, $backup2path, '-o', $resultpath ],
+ qr/is an incremental backup, but the first backup should be a full backup/,
+ "can't combine full backups");
+
+# Can't combine full backup with an incremental backup from a different system.
+$node1->command_fails_like(
+ [ 'pg_combinebackup', $backup1path, $backupother2path, '-o', $resultpath ],
+ qr/expected system identifier.*but found/,
+ "can't combine backups from different nodes");
+
+# Can't omit a required backup.
+$node1->command_fails_like(
+ [ 'pg_combinebackup', $backup1path, $backup3path, '-o', $resultpath ],
+ qr/starts at LSN.*but expected/,
+ "can't omit a required backup");
+
+# Can't combine backups in the wrong order.
+$node1->command_fails_like(
+ [ 'pg_combinebackup', $backup1path, $backup3path, $backup2path, '-o', $resultpath ],
+ qr/starts at LSN.*but expected/,
+ "can't combine backups in the wrong order");
+
+# Can combine 3 backups that match up properly.
+$node1->command_ok(
+ [ 'pg_combinebackup', $backup1path, $backup2path, $backup3path, '-o', $resultpath ],
+ "can combine 3 matching backups");
+rmtree($resultpath);
+
+# Can combine full backup with first incremental.
+my $synthetic12path = $node1->backup_dir . '/synthetic12';
+$node1->command_ok(
+ [ 'pg_combinebackup', $backup1path, $backup2path, '-o', $synthetic12path ],
+ "can combine 2 matching backups");
+
+# Can combine result of previous step with second incremental.
+$node1->command_ok(
+ [ 'pg_combinebackup', $synthetic12path, $backup3path, '-o', $resultpath ],
+ "can combine synthetic backup with later incremental");
+rmtree($resultpath);
+
+# Can't combine result of 1+2 with 2.
+$node1->command_fails_like(
+ [ 'pg_combinebackup', $synthetic12path, $backup2path, '-o', $resultpath ],
+ qr/starts at LSN.*but expected/,
+ "can't combine synthetic backup with included incremental");
+
+# OK, that's all.
+done_testing();
diff --git a/src/bin/pg_combinebackup/write_manifest.c b/src/bin/pg_combinebackup/write_manifest.c
new file mode 100644
index 0000000000..82160134d8
--- /dev/null
+++ b/src/bin/pg_combinebackup/write_manifest.c
@@ -0,0 +1,293 @@
+/*-------------------------------------------------------------------------
+ *
+ * Write a new backup manifest.
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/bin/pg_combinebackup/write_manifest.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres_fe.h"
+
+#include <fcntl.h>
+#include <time.h>
+#include <unistd.h>
+
+#include "common/checksum_helper.h"
+#include "common/file_perm.h"
+#include "common/logging.h"
+#include "lib/stringinfo.h"
+#include "load_manifest.h"
+#include "mb/pg_wchar.h"
+#include "write_manifest.h"
+
+struct manifest_writer
+{
+ char pathname[MAXPGPATH];
+ int fd;
+ StringInfoData buf;
+ bool first_file;
+ bool still_checksumming;
+ pg_checksum_context manifest_ctx;
+};
+
+static void escape_json(StringInfo buf, const char *str);
+static void flush_manifest(manifest_writer *mwriter);
+static size_t hex_encode(const uint8 *src, size_t len, char *dst);
+
+/*
+ * Create a new backup manifest writer.
+ *
+ * The backup manifest will be written into a file named backup_manifest
+ * in the specified directory.
+ */
+manifest_writer *
+create_manifest_writer(char *directory)
+{
+ manifest_writer *mwriter = pg_malloc(sizeof(manifest_writer));
+
+ snprintf(mwriter->pathname, MAXPGPATH, "%s/backup_manifest", directory);
+ mwriter->fd = -1;
+ initStringInfo(&mwriter->buf);
+ mwriter->first_file = true;
+ mwriter->still_checksumming = true;
+ pg_checksum_init(&mwriter->manifest_ctx, CHECKSUM_TYPE_SHA256);
+
+ appendStringInfo(&mwriter->buf,
+ "{ \"PostgreSQL-Backup-Manifest-Version\": 1,\n"
+ "\"Files\": [");
+
+ return mwriter;
+}
+
+/*
+ * Add an entry for a file to a backup manifest.
+ *
+ * This is very similar to the backend's AddFileToBackupManifest, but
+ * various adjustments are required due to frontend/backend differences
+ * and other details.
+ */
+void
+add_file_to_manifest(manifest_writer *mwriter, const char *manifest_path,
+ size_t size, pg_time_t mtime,
+ pg_checksum_type checksum_type,
+ int checksum_length,
+ uint8 *checksum_payload)
+{
+ int pathlen = strlen(manifest_path);
+
+ if (mwriter->first_file)
+ {
+ appendStringInfoChar(&mwriter->buf, '\n');
+ mwriter->first_file = false;
+ }
+ else
+ appendStringInfoString(&mwriter->buf, ",\n");
+
+ if (pg_encoding_verifymbstr(PG_UTF8, manifest_path, pathlen) == pathlen)
+ {
+ appendStringInfoString(&mwriter->buf, "{ \"Path\": ");
+ escape_json(&mwriter->buf, manifest_path);
+ appendStringInfoString(&mwriter->buf, ", ");
+ }
+ else
+ {
+ appendStringInfoString(&mwriter->buf, "{ \"Encoded-Path\": \"");
+ enlargeStringInfo(&mwriter->buf, 2 * pathlen);
+ mwriter->buf.len += hex_encode((const uint8 *) manifest_path, pathlen,
+ &mwriter->buf.data[mwriter->buf.len]);
+ appendStringInfoString(&mwriter->buf, "\", ");
+ }
+
+ appendStringInfo(&mwriter->buf, "\"Size\": %zu, ", size);
+
+ appendStringInfoString(&mwriter->buf, "\"Last-Modified\": \"");
+ enlargeStringInfo(&mwriter->buf, 128);
+ mwriter->buf.len += strftime(&mwriter->buf.data[mwriter->buf.len], 128,
+ "%Y-%m-%d %H:%M:%S %Z",
+ gmtime(&mtime));
+ appendStringInfoChar(&mwriter->buf, '"');
+
+ if (mwriter->buf.len > 128 * 1024)
+ flush_manifest(mwriter);
+
+ if (checksum_length > 0)
+ {
+ appendStringInfo(&mwriter->buf,
+ ", \"Checksum-Algorithm\": \"%s\", \"Checksum\": \"",
+ pg_checksum_type_name(checksum_type));
+
+ enlargeStringInfo(&mwriter->buf, 2 * checksum_length);
+ mwriter->buf.len += hex_encode(checksum_payload, checksum_length,
+ &mwriter->buf.data[mwriter->buf.len]);
+
+ appendStringInfoChar(&mwriter->buf, '"');
+ }
+
+ appendStringInfoString(&mwriter->buf, " }");
+
+ if (mwriter->buf.len > 128 * 1024)
+ flush_manifest(mwriter);
+}
+
+/*
+ * Finalize the backup_manifest.
+ */
+void
+finalize_manifest(manifest_writer *mwriter,
+ manifest_wal_range *first_wal_range)
+{
+ uint8 checksumbuf[PG_SHA256_DIGEST_LENGTH];
+ int len;
+ manifest_wal_range *wal_range;
+
+ /* Terminate the list of files. */
+ appendStringInfoString(&mwriter->buf, "\n],\n");
+
+ /* Start a list of LSN ranges. */
+ appendStringInfoString(&mwriter->buf, "\"WAL-Ranges\": [\n");
+
+ for (wal_range = first_wal_range; wal_range != NULL;
+ wal_range = wal_range->next)
+ appendStringInfo(&mwriter->buf,
+ "%s{ \"Timeline\": %u, \"Start-LSN\": \"%X/%X\", \"End-LSN\": \"%X/%X\" }",
+ wal_range == first_wal_range ? "" : ",\n",
+ wal_range->tli,
+ LSN_FORMAT_ARGS(wal_range->start_lsn),
+ LSN_FORMAT_ARGS(wal_range->end_lsn));
+
+ /* Terminate the list of WAL ranges. */
+ appendStringInfoString(&mwriter->buf, "\n],\n");
+
+ /* Flush accumulated data and update checksum calculation. */
+ flush_manifest(mwriter);
+
+ /* Checksum only includes data up to this point. */
+ mwriter->still_checksumming = false;
+
+ /* Compute and insert manifest checksum. */
+ appendStringInfoString(&mwriter->buf, "\"Manifest-Checksum\": \"");
+ enlargeStringInfo(&mwriter->buf, 2 * PG_SHA256_DIGEST_STRING_LENGTH);
+ len = pg_checksum_final(&mwriter->manifest_ctx, checksumbuf);
+ Assert(len == PG_SHA256_DIGEST_LENGTH);
+ mwriter->buf.len +=
+ hex_encode(checksumbuf, len, &mwriter->buf.data[mwriter->buf.len]);
+ appendStringInfoString(&mwriter->buf, "\"}\n");
+
+ /* Flush the last manifest checksum itself. */
+ flush_manifest(mwriter);
+
+ /* Close the file. */
+ if (close(mwriter->fd) != 0)
+ pg_fatal("could not close \"%s\": %m", mwriter->pathname);
+ mwriter->fd = -1;
+}
+
+/*
+ * Produce a JSON string literal, properly escaping characters in the text.
+ */
+static void
+escape_json(StringInfo buf, const char *str)
+{
+ const char *p;
+
+ appendStringInfoCharMacro(buf, '"');
+ for (p = str; *p; p++)
+ {
+ switch (*p)
+ {
+ case '\b':
+ appendStringInfoString(buf, "\\b");
+ break;
+ case '\f':
+ appendStringInfoString(buf, "\\f");
+ break;
+ case '\n':
+ appendStringInfoString(buf, "\\n");
+ break;
+ case '\r':
+ appendStringInfoString(buf, "\\r");
+ break;
+ case '\t':
+ appendStringInfoString(buf, "\\t");
+ break;
+ case '"':
+ appendStringInfoString(buf, "\\\"");
+ break;
+ case '\\':
+ appendStringInfoString(buf, "\\\\");
+ break;
+ default:
+ if ((unsigned char) *p < ' ')
+ appendStringInfo(buf, "\\u%04x", (int) *p);
+ else
+ appendStringInfoCharMacro(buf, *p);
+ break;
+ }
+ }
+ appendStringInfoCharMacro(buf, '"');
+}
+
+/*
+ * Flush whatever portion of the backup manifest we have generated and
+ * buffered in memory out to a file on disk.
+ *
+ * The first call to this function will create the file. After that, we
+ * keep it open and just append more data.
+ */
+static void
+flush_manifest(manifest_writer *mwriter)
+{
+ char pathname[MAXPGPATH];
+
+ if (mwriter->fd == -1 &&
+ (mwriter->fd = open(mwriter->pathname,
+ O_WRONLY | O_CREAT | O_EXCL | PG_BINARY,
+ pg_file_create_mode)) < 0)
+ pg_fatal("could not open file \"%s\": %m", mwriter->pathname);
+
+ if (mwriter->buf.len > 0)
+ {
+ ssize_t wb;
+
+ wb = write(mwriter->fd, mwriter->buf.data, mwriter->buf.len);
+ if (wb != mwriter->buf.len)
+ {
+ if (wb < 0)
+ pg_fatal("could not write \"%s\": %m", mwriter->pathname);
+ else
+ pg_fatal("could not write file \"%s\": wrote only %d of %d bytes",
+ pathname, (int) wb, mwriter->buf.len);
+ }
+
+ if (mwriter->still_checksumming)
+ pg_checksum_update(&mwriter->manifest_ctx,
+ (uint8 *) mwriter->buf.data,
+ mwriter->buf.len);
+ resetStringInfo(&mwriter->buf);
+ }
+}
+
+/*
+ * Encode bytes using two hexademical digits for each one.
+ */
+static size_t
+hex_encode(const uint8 *src, size_t len, char *dst)
+{
+ const uint8 *end = src + len;
+
+ while (src < end)
+ {
+ unsigned n1 = (*src >> 4) & 0xF;
+ unsigned n2 = *src & 0xF;
+
+ *dst++ = n1 < 10 ? '0' + n1 : 'a' + n1 - 10;
+ *dst++ = n2 < 10 ? '0' + n2 : 'a' + n2 - 10;
+ ++src;
+ }
+
+ return len * 2;
+}
diff --git a/src/bin/pg_combinebackup/write_manifest.h b/src/bin/pg_combinebackup/write_manifest.h
new file mode 100644
index 0000000000..8fd7fe02c8
--- /dev/null
+++ b/src/bin/pg_combinebackup/write_manifest.h
@@ -0,0 +1,33 @@
+/*-------------------------------------------------------------------------
+ *
+ * Write a new backup manifest.
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/bin/pg_combinebackup/write_manifest.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef WRITE_MANIFEST_H
+#define WRITE_MANIFEST_H
+
+#include "common/checksum_helper.h"
+#include "pgtime.h"
+
+struct manifest_wal_range;
+
+struct manifest_writer;
+typedef struct manifest_writer manifest_writer;
+
+extern manifest_writer *create_manifest_writer(char *directory);
+extern void add_file_to_manifest(manifest_writer *mwriter,
+ const char *manifest_path,
+ size_t size, pg_time_t mtime,
+ pg_checksum_type checksum_type,
+ int checksum_length,
+ uint8 *checksum_payload);
+extern void finalize_manifest(manifest_writer *mwriter,
+ struct manifest_wal_range *first_wal_range);
+
+#endif /* WRITE_MANIFEST_H */
diff --git a/src/bin/pg_resetwal/pg_resetwal.c b/src/bin/pg_resetwal/pg_resetwal.c
index 3ae3fc06df..5407f51a4e 100644
--- a/src/bin/pg_resetwal/pg_resetwal.c
+++ b/src/bin/pg_resetwal/pg_resetwal.c
@@ -85,6 +85,7 @@ static void RewriteControlFile(void);
static void FindEndOfXLOG(void);
static void KillExistingXLOG(void);
static void KillExistingArchiveStatus(void);
+static void KillExistingWALSummaries(void);
static void WriteEmptyXLOG(void);
static void usage(void);
@@ -493,6 +494,7 @@ main(int argc, char *argv[])
RewriteControlFile();
KillExistingXLOG();
KillExistingArchiveStatus();
+ KillExistingWALSummaries();
WriteEmptyXLOG();
printf(_("Write-ahead log reset\n"));
@@ -1034,6 +1036,40 @@ KillExistingArchiveStatus(void)
pg_fatal("could not close directory \"%s\": %m", ARCHSTATDIR);
}
+/*
+ * Remove existing WAL summary files
+ */
+static void
+KillExistingWALSummaries(void)
+{
+#define WALSUMMARYDIR XLOGDIR "/summaries"
+#define WALSUMMARY_NHEXCHARS 40
+
+ DIR *xldir;
+ struct dirent *xlde;
+ char path[MAXPGPATH + sizeof(WALSUMMARYDIR)];
+
+ xldir = opendir(WALSUMMARYDIR);
+ if (xldir == NULL)
+ pg_fatal("could not open directory \"%s\": %m", WALSUMMARYDIR);
+
+ while (errno = 0, (xlde = readdir(xldir)) != NULL)
+ {
+ if (strspn(xlde->d_name, "0123456789ABCDEF") == WALSUMMARY_NHEXCHARS &&
+ strcmp(xlde->d_name + WALSUMMARY_NHEXCHARS, ".summary") == 0)
+ {
+ snprintf(path, sizeof(path), "%s/%s", WALSUMMARYDIR, xlde->d_name);
+ if (unlink(path) < 0)
+ pg_fatal("could not delete file \"%s\": %m", path);
+ }
+ }
+
+ if (errno)
+ pg_fatal("could not read directory \"%s\": %m", WALSUMMARYDIR);
+
+ if (closedir(xldir))
+ pg_fatal("could not close directory \"%s\": %m", ARCHSTATDIR);
+}
/*
* Write an empty XLOG file, containing only the checkpoint record
diff --git a/src/include/access/xlogbackup.h b/src/include/access/xlogbackup.h
index 1611358137..90e04cad56 100644
--- a/src/include/access/xlogbackup.h
+++ b/src/include/access/xlogbackup.h
@@ -28,6 +28,8 @@ typedef struct BackupState
XLogRecPtr checkpointloc; /* last checkpoint location */
pg_time_t starttime; /* backup start time */
bool started_in_recovery; /* backup started in recovery? */
+ XLogRecPtr istartpoint; /* incremental based on backup at this LSN */
+ TimeLineID istarttli; /* incremental based on backup on this TLI */
/* Fields saved at the end of backup */
XLogRecPtr stoppoint; /* backup stop WAL location */
diff --git a/src/include/backup/basebackup.h b/src/include/backup/basebackup.h
index 1432d9c206..345bd22534 100644
--- a/src/include/backup/basebackup.h
+++ b/src/include/backup/basebackup.h
@@ -34,6 +34,9 @@ typedef struct
int64 size; /* total size as sent; -1 if not known */
} tablespaceinfo;
-extern void SendBaseBackup(BaseBackupCmd *cmd);
+struct IncrementalBackupInfo;
+
+extern void SendBaseBackup(BaseBackupCmd *cmd,
+ struct IncrementalBackupInfo *ib);
#endif /* _BASEBACKUP_H */
diff --git a/src/include/backup/basebackup_incremental.h b/src/include/backup/basebackup_incremental.h
new file mode 100644
index 0000000000..de99117599
--- /dev/null
+++ b/src/include/backup/basebackup_incremental.h
@@ -0,0 +1,55 @@
+/*-------------------------------------------------------------------------
+ *
+ * basebackup_incremental.h
+ * API for incremental backup support
+ *
+ * Portions Copyright (c) 2010-2022, PostgreSQL Global Development Group
+ *
+ * src/include/backup/basebackup_incremental.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef BASEBACKUP_INCREMENTAL_H
+#define BASEBACKUP_INCREMENTAL_H
+
+#include "access/xlogbackup.h"
+#include "common/relpath.h"
+#include "storage/block.h"
+#include "utils/palloc.h"
+
+#define INCREMENTAL_MAGIC 0xd3ae1f0d
+
+typedef enum
+{
+ BACK_UP_FILE_FULLY,
+ BACK_UP_FILE_INCREMENTALLY
+} FileBackupMethod;
+
+struct IncrementalBackupInfo;
+typedef struct IncrementalBackupInfo IncrementalBackupInfo;
+
+extern IncrementalBackupInfo *CreateIncrementalBackupInfo(MemoryContext);
+
+extern void AppendIncrementalManifestData(IncrementalBackupInfo *ib,
+ const char *data,
+ int len);
+extern void FinalizeIncrementalManifest(IncrementalBackupInfo *ib);
+
+extern void PrepareForIncrementalBackup(IncrementalBackupInfo *ib,
+ BackupState *backup_state);
+
+extern char *GetIncrementalFilePath(Oid dboid, Oid spcoid,
+ RelFileNumber relfilenumber,
+ ForkNumber forknum, unsigned segno);
+extern FileBackupMethod GetFileBackupMethod(IncrementalBackupInfo *ib,
+ const char *path,
+ Oid dboid, Oid spcoid,
+ RelFileNumber relfilenumber,
+ ForkNumber forknum,
+ unsigned segno, size_t size,
+ unsigned *num_blocks_required,
+ BlockNumber *relative_block_numbers,
+ unsigned *truncation_block_length);
+extern size_t GetIncrementalFileSize(unsigned num_blocks_required);
+
+#endif
diff --git a/src/include/nodes/replnodes.h b/src/include/nodes/replnodes.h
index 5142a08729..c98961c329 100644
--- a/src/include/nodes/replnodes.h
+++ b/src/include/nodes/replnodes.h
@@ -108,4 +108,13 @@ typedef struct TimeLineHistoryCmd
TimeLineID timeline;
} TimeLineHistoryCmd;
+/* ----------------------
+ * UPLOAD_MANIFEST command
+ * ----------------------
+ */
+typedef struct UploadManifestCmd
+{
+ NodeTag type;
+} UploadManifestCmd;
+
#endif /* REPLNODES_H */
diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index c3d46c7c70..72b4ecaf12 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -779,6 +779,10 @@ a tar-format backup, pass the name of the tar program to use in the
keyword parameter tar_program. Note that tablespace tar files aren't
handled here.
+To restore from an incremental backup, pass the parameter combine_with_prior
+as a reference to an array of prior backup names with which this backup
+is to be combined using pg_combinebackup.
+
Streaming replication can be enabled on this node by passing the keyword
parameter has_streaming => 1. This is disabled by default.
@@ -816,7 +820,22 @@ sub init_from_backup
mkdir $self->archive_dir;
my $data_path = $self->data_dir;
- if (defined $params{tar_program})
+ if (defined $params{combine_with_prior})
+ {
+ my @prior_backups = @{$params{combine_with_prior}};
+ my @prior_backup_path;
+
+ for my $prior_backup_name (@prior_backups)
+ {
+ push @prior_backup_path,
+ $root_node->backup_dir . '/' . $prior_backup_name;
+ }
+
+ local %ENV = $self->_get_env();
+ PostgreSQL::Test::Utils::system_or_bail('pg_combinebackup', '-d',
+ @prior_backup_path, $backup_path, '-o', $data_path);
+ }
+ elsif (defined $params{tar_program})
{
mkdir($data_path);
PostgreSQL::Test::Utils::system_or_bail($params{tar_program}, 'xf',
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 7a2807a9a3..1fa5f0ed26 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -4014,3 +4014,15 @@ SummarizerReadLocalXLogPrivate
WalSummarizerData
WalSummaryFile
WalSummaryIO
+FileBackupMethod
+IncrementalBackupInfo
+UploadManifestCmd
+backup_file_entry
+backup_wal_range
+cb_cleanup_dir
+cb_options
+cb_tablespace
+cb_tablespace_mapping
+manifest_data
+manifest_writer
+rfile
--
2.37.1 (Apple Git-137.1)
v11-0004-Add-a-new-WAL-summarizer-process.patchapplication/octet-stream; name=v11-0004-Add-a-new-WAL-summarizer-process.patchDownload
From a22f5ed951d71d8b2520a8a3597b966ec158f9e9 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Wed, 25 Oct 2023 12:57:22 -0400
Subject: [PATCH v11 4/7] Add a new WAL summarizer process.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
When active, this process writes WAL summary files to
$PGDATA/pg_wal/summaries. Each summary file contains information for a
certain range of LSNs on a certain TLI. For each relation, it stores a
"limit block" which is 0 if a relation is created or destroyed within
a certain range of WAL records, or otherwise the shortest length to
which the relation was truncated during that range of WAL records, or
otherwise InvalidBlockNumber. In addition, it stores a list of blocks
which have been modified during that range of WAL records, but
excluding blocks which were removed by truncation after they were
modified and never subsequently modified again. In other words, it
tells us which blocks need to copied in case of an incremental backup
covering that range of WAL records.
A new parameter summarize_wal enables or disables this new background
process. The background process also automatically deletes summary
files that are older than wal_summarize_keep_time, if that parameter
has a non-zero value and the summarizer is configured to run.
Patch by me, with some design help from Dilip Kumar. Reviewed by
Matthias van de Meent, Dilip Kumar, Jakub Wartak, Peter Eisentraut,
and Álvaro Herrera.
---
doc/src/sgml/config.sgml | 61 +
src/backend/access/transam/xlog.c | 101 +-
src/backend/backup/Makefile | 4 +-
src/backend/backup/meson.build | 2 +
src/backend/backup/walsummary.c | 356 +++++
src/backend/backup/walsummaryfuncs.c | 169 ++
src/backend/postmaster/Makefile | 1 +
src/backend/postmaster/auxprocess.c | 8 +
src/backend/postmaster/meson.build | 1 +
src/backend/postmaster/postmaster.c | 56 +
src/backend/postmaster/walsummarizer.c | 1383 +++++++++++++++++
src/backend/storage/lmgr/lwlocknames.txt | 1 +
src/backend/utils/activity/pgstat_io.c | 4 +-
.../utils/activity/wait_event_names.txt | 5 +
src/backend/utils/init/miscinit.c | 3 +
src/backend/utils/misc/guc_tables.c | 26 +
src/backend/utils/misc/postgresql.conf.sample | 5 +
src/bin/initdb/initdb.c | 1 +
src/common/Makefile | 1 +
src/common/blkreftable.c | 1308 ++++++++++++++++
src/common/meson.build | 1 +
src/include/access/xlog.h | 1 +
src/include/backup/walsummary.h | 49 +
src/include/catalog/pg_proc.dat | 19 +
src/include/common/blkreftable.h | 116 ++
src/include/miscadmin.h | 3 +
src/include/postmaster/walsummarizer.h | 31 +
src/include/storage/proc.h | 9 +-
src/include/utils/guc_tables.h | 1 +
src/tools/pgindent/typedefs.list | 11 +
30 files changed, 3726 insertions(+), 11 deletions(-)
create mode 100644 src/backend/backup/walsummary.c
create mode 100644 src/backend/backup/walsummaryfuncs.c
create mode 100644 src/backend/postmaster/walsummarizer.c
create mode 100644 src/common/blkreftable.c
create mode 100644 src/include/backup/walsummary.h
create mode 100644 src/include/common/blkreftable.h
create mode 100644 src/include/postmaster/walsummarizer.h
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index fc35a46e5e..6073b93480 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4134,6 +4134,67 @@ restore_command = 'copy "C:\\server\\archivedir\\%f" "%p"' # Windows
</variablelist>
</sect2>
+ <sect2 id="runtime-config-wal-summarization">
+ <title>WAL Summarization</title>
+
+ <!--
+ <para>
+ These settings control WAL summarization, a feature which must be
+ enabled in order to perform an
+ <link linkend="backup-incremental-backup">incremental backup</link>.
+ </para>
+ -->
+
+ <variablelist>
+ <varlistentry id="guc-summarize-wal" xreflabel="summarize_wal">
+ <term><varname>summarize_wal</varname> (<type>boolean</type>)
+ <indexterm>
+ <primary><varname>summarize_wal</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Enables the WAL summarizer process. Note that WAL summarization can
+ be enabled either on a primary or on a standby. WAL summarization
+ cannot be enabled when <varname>wal_level</varname> is set to
+ <literal>minimal</literal>. This parameter can only be set in the
+ <filename>postgresql.conf</filename> file or on the server command line.
+ The default is <literal>off</literal>.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry id="guc-wal-summary-keep-time" xreflabel="wal_summary_keep_time">
+ <term><varname>wal_summary_keep_time</varname> (<type>boolean</type>)
+ <indexterm>
+ <primary><varname>wal_summary_keep_time</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Configures the amount of time after which the WAL summarizer
+ automatically removes old WAL summaries. The file timestamp is used to
+ determine which files are old enough to remove. Typically, you should set
+ this comfortably higher than the time that could pass between a backup
+ and a later incremental backup that depends on it. WAL summaries must
+ be available for the entire range of WAL records between the preceding
+ backup and the new one being taken; if not, the incremental backup will
+ fail. If this parameter is set to zero, WAL summaries will not be
+ automatically deleted, but it is safe to manually remove files that you
+ know will not be required for future incremental backups.
+ This parameter can only be set in the
+ <filename>postgresql.conf</filename> file or on the server command line.
+ The default is 10 days. If <literal>summarize_wal = off</literal>,
+ existing WAL summaries will not be removed regardless of the value of
+ this parameter, because the WAL summarizer will not run.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ </variablelist>
+
+ </sect2>
+
</sect1>
<sect1 id="runtime-config-replication">
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 1159dff1a6..678495a64b 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -77,6 +77,7 @@
#include "port/pg_iovec.h"
#include "postmaster/bgwriter.h"
#include "postmaster/startup.h"
+#include "postmaster/walsummarizer.h"
#include "postmaster/walwriter.h"
#include "replication/logical.h"
#include "replication/origin.h"
@@ -3574,6 +3575,43 @@ XLogGetLastRemovedSegno(void)
return lastRemovedSegNo;
}
+/*
+ * Return the oldest WAL segment on the given TLI that still exists in
+ * XLOGDIR, or 0 if none.
+ */
+XLogSegNo
+XLogGetOldestSegno(TimeLineID tli)
+{
+ DIR *xldir;
+ struct dirent *xlde;
+ XLogSegNo oldest_segno = 0;
+
+ xldir = AllocateDir(XLOGDIR);
+ while ((xlde = ReadDir(xldir, XLOGDIR)) != NULL)
+ {
+ TimeLineID file_tli;
+ XLogSegNo file_segno;
+
+ /* Ignore files that are not XLOG segments. */
+ if (!IsXLogFileName(xlde->d_name))
+ continue;
+
+ /* Parse filename to get TLI and segno. */
+ XLogFromFileName(xlde->d_name, &file_tli, &file_segno,
+ wal_segment_size);
+
+ /* Ignore anything that's not from the TLI of interest. */
+ if (tli != file_tli)
+ continue;
+
+ /* If it's the oldest so far, update oldest_segno. */
+ if (oldest_segno == 0 || file_segno < oldest_segno)
+ oldest_segno = file_segno;
+ }
+
+ FreeDir(xldir);
+ return oldest_segno;
+}
/*
* Update the last removed segno pointer in shared memory, to reflect that the
@@ -3853,8 +3891,8 @@ RemoveXlogFile(const struct dirent *segment_de,
}
/*
- * Verify whether pg_wal and pg_wal/archive_status exist.
- * If the latter does not exist, recreate it.
+ * Verify whether pg_wal, pg_wal/archive_status, and pg_wal/summaries exist.
+ * If the latter do not exist, recreate them.
*
* It is not the goal of this function to verify the contents of these
* directories, but to help in cases where someone has performed a cluster
@@ -3897,6 +3935,26 @@ ValidateXLOGDirectoryStructure(void)
(errmsg("could not create missing directory \"%s\": %m",
path)));
}
+
+ /* Check for summaries */
+ snprintf(path, MAXPGPATH, XLOGDIR "/summaries");
+ if (stat(path, &stat_buf) == 0)
+ {
+ /* Check for weird cases where it exists but isn't a directory */
+ if (!S_ISDIR(stat_buf.st_mode))
+ ereport(FATAL,
+ (errmsg("required WAL directory \"%s\" does not exist",
+ path)));
+ }
+ else
+ {
+ ereport(LOG,
+ (errmsg("creating missing WAL directory \"%s\"", path)));
+ if (MakePGDirectory(path) < 0)
+ ereport(FATAL,
+ (errmsg("could not create missing directory \"%s\": %m",
+ path)));
+ }
}
/*
@@ -5221,9 +5279,9 @@ StartupXLOG(void)
#endif
/*
- * Verify that pg_wal and pg_wal/archive_status exist. In cases where
- * someone has performed a copy for PITR, these directories may have been
- * excluded and need to be re-created.
+ * Verify that pg_wal, pg_wal/archive_status, and pg_wal/summaries exist.
+ * In cases where someone has performed a copy for PITR, these directories
+ * may have been excluded and need to be re-created.
*/
ValidateXLOGDirectoryStructure();
@@ -6940,6 +6998,25 @@ CreateCheckPoint(int flags)
*/
END_CRIT_SECTION();
+ /*
+ * WAL summaries end when the next XLOG_CHECKPOINT_REDO or
+ * XLOG_CHECKPOINT_SHUTDOWN record is reached. This is the first point
+ * where (a) we're not inside of a critical section and (b) we can be
+ * certain that the relevant record has been flushed to disk, which must
+ * happen before it can be summarized.
+ *
+ * If this is a shutdown checkpoint, then this happens reasonably
+ * promptly: we've only just inserted and flushed the
+ * XLOG_CHECKPOINT_SHUTDOWN record. If this is not a shutdown checkpoint,
+ * then this might not be very prompt at all: the XLOG_CHECKPOINT_REDO
+ * record was written before we began flushing data to disk, and that
+ * could be many minutes ago at this point. However, we don't XLogFlush()
+ * after inserting that record, so we're not guaranteed that it's on disk
+ * until after the above call that flushes the XLOG_CHECKPOINT_ONLINE
+ * record.
+ */
+ SetWalSummarizerLatch();
+
/*
* Let smgr do post-checkpoint cleanup (eg, deleting old files).
*/
@@ -7614,6 +7691,20 @@ KeepLogSeg(XLogRecPtr recptr, XLogSegNo *logSegNo)
}
}
+ /*
+ * If WAL summarization is in use, don't remove WAL that has yet to be
+ * summarized.
+ */
+ keep = GetOldestUnsummarizedLSN(NULL, NULL);
+ if (keep != InvalidXLogRecPtr)
+ {
+ XLogSegNo unsummarized_segno;
+
+ XLByteToSeg(keep, unsummarized_segno, wal_segment_size);
+ if (unsummarized_segno < segno)
+ segno = unsummarized_segno;
+ }
+
/* but, keep at least wal_keep_size if that's set */
if (wal_keep_size_mb > 0)
{
diff --git a/src/backend/backup/Makefile b/src/backend/backup/Makefile
index b21bd8ff43..a67b3c58d4 100644
--- a/src/backend/backup/Makefile
+++ b/src/backend/backup/Makefile
@@ -25,6 +25,8 @@ OBJS = \
basebackup_server.o \
basebackup_sink.o \
basebackup_target.o \
- basebackup_throttle.o
+ basebackup_throttle.o \
+ walsummary.o \
+ walsummaryfuncs.o
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/backup/meson.build b/src/backend/backup/meson.build
index 11a79bbf80..0e2de91e9f 100644
--- a/src/backend/backup/meson.build
+++ b/src/backend/backup/meson.build
@@ -12,4 +12,6 @@ backend_sources += files(
'basebackup_target.c',
'basebackup_throttle.c',
'basebackup_zstd.c',
+ 'walsummary.c',
+ 'walsummaryfuncs.c'
)
diff --git a/src/backend/backup/walsummary.c b/src/backend/backup/walsummary.c
new file mode 100644
index 0000000000..271d199874
--- /dev/null
+++ b/src/backend/backup/walsummary.c
@@ -0,0 +1,356 @@
+/*-------------------------------------------------------------------------
+ *
+ * walsummary.c
+ * Functions for accessing and managing WAL summary data.
+ *
+ * Portions Copyright (c) 2010-2022, PostgreSQL Global Development Group
+ *
+ * src/backend/backup/walsummary.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include <sys/stat.h>
+#include <unistd.h>
+
+#include "access/xlog_internal.h"
+#include "backup/walsummary.h"
+#include "utils/wait_event.h"
+
+static bool IsWalSummaryFilename(char *filename);
+static int ListComparatorForWalSummaryFiles(const ListCell *a,
+ const ListCell *b);
+
+/*
+ * Get a list of WAL summaries.
+ *
+ * If tli != 0, only WAL summaries with the indicated TLI will be included.
+ *
+ * If start_lsn != InvalidXLogRecPtr, only summaries that end after the
+ * indicated LSN will be included.
+ *
+ * If end_lsn != InvalidXLogRecPtr, only summaries that start before the
+ * indicated LSN will be included.
+ *
+ * The intent is that you can call GetWalSummaries(tli, start_lsn, end_lsn)
+ * to get all WAL summaries on the indicated timeline that overlap the
+ * specified LSN range.
+ */
+List *
+GetWalSummaries(TimeLineID tli, XLogRecPtr start_lsn, XLogRecPtr end_lsn)
+{
+ DIR *sdir;
+ struct dirent *dent;
+ List *result = NIL;
+
+ sdir = AllocateDir(XLOGDIR "/summaries");
+ while ((dent = ReadDir(sdir, XLOGDIR "/summaries")) != NULL)
+ {
+ WalSummaryFile *ws;
+ uint32 tmp[5];
+ TimeLineID file_tli;
+ XLogRecPtr file_start_lsn;
+ XLogRecPtr file_end_lsn;
+
+ /* Decode filename, or skip if it's not in the expected format. */
+ if (!IsWalSummaryFilename(dent->d_name))
+ continue;
+ sscanf(dent->d_name, "%08X%08X%08X%08X%08X",
+ &tmp[0], &tmp[1], &tmp[2], &tmp[3], &tmp[4]);
+ file_tli = tmp[0];
+ file_start_lsn = ((uint64) tmp[1]) << 32 | tmp[2];
+ file_end_lsn = ((uint64) tmp[3]) << 32 | tmp[4];
+
+ /* Skip if it doesn't match the filter criteria. */
+ if (tli != 0 && tli != file_tli)
+ continue;
+ if (!XLogRecPtrIsInvalid(start_lsn) && start_lsn >= file_end_lsn)
+ continue;
+ if (!XLogRecPtrIsInvalid(end_lsn) && end_lsn <= file_start_lsn)
+ continue;
+
+ /* Add it to the list. */
+ ws = palloc(sizeof(WalSummaryFile));
+ ws->tli = file_tli;
+ ws->start_lsn = file_start_lsn;
+ ws->end_lsn = file_end_lsn;
+ result = lappend(result, ws);
+ }
+ FreeDir(sdir);
+
+ return result;
+}
+
+/*
+ * Build a new list of WAL summaries based on an existing list, but filtering
+ * out summaries that don't match the search parameters.
+ *
+ * If tli != 0, only WAL summaries with the indicated TLI will be included.
+ *
+ * If start_lsn != InvalidXLogRecPtr, only summaries that end after the
+ * indicated LSN will be included.
+ *
+ * If end_lsn != InvalidXLogRecPtr, only summaries that start before the
+ * indicated LSN will be included.
+ */
+List *
+FilterWalSummaries(List *wslist, TimeLineID tli,
+ XLogRecPtr start_lsn, XLogRecPtr end_lsn)
+{
+ List *result = NIL;
+ ListCell *lc;
+
+ /* Loop over input. */
+ foreach(lc, wslist)
+ {
+ WalSummaryFile *ws = lfirst(lc);
+
+ /* Skip if it doesn't match the filter criteria. */
+ if (tli != 0 && tli != ws->tli)
+ continue;
+ if (!XLogRecPtrIsInvalid(start_lsn) && start_lsn > ws->end_lsn)
+ continue;
+ if (!XLogRecPtrIsInvalid(end_lsn) && end_lsn < ws->start_lsn)
+ continue;
+
+ /* Add it to the result list. */
+ result = lappend(result, ws);
+ }
+
+ return result;
+}
+
+/*
+ * Check whether the supplied list of WalSummaryFile objects covers the
+ * whole range of LSNs from start_lsn to end_lsn. This function ignores
+ * timelines, so the caller should probably filter using the appropriate
+ * timeline before calling this.
+ *
+ * If the whole range of LSNs is covered, returns true, otherwise false.
+ * If false is returned, *missing_lsn is set either to InvalidXLogRecPtr
+ * if there are no WAL summary files in the input list, or to the first LSN
+ * in the range that is not covered by a WAL summary file in the input list.
+ */
+bool
+WalSummariesAreComplete(List *wslist, XLogRecPtr start_lsn,
+ XLogRecPtr end_lsn, XLogRecPtr *missing_lsn)
+{
+ XLogRecPtr current_lsn = start_lsn;
+ ListCell *lc;
+
+ /* Special case for empty list. */
+ if (wslist == NIL)
+ {
+ *missing_lsn = InvalidXLogRecPtr;
+ return false;
+ }
+
+ /* Make a private copy of the list and sort it by start LSN. */
+ wslist = list_copy(wslist);
+ list_sort(wslist, ListComparatorForWalSummaryFiles);
+
+ /*
+ * Consider summary files in order of increasing start_lsn, advancing the
+ * known-summarized range from start_lsn toward end_lsn.
+ *
+ * Normally, the summary files should cover non-overlapping WAL ranges,
+ * but this algorithm is intended to be correct even in case of overlap.
+ */
+ foreach(lc, wslist)
+ {
+ WalSummaryFile *ws = lfirst(lc);
+
+ if (ws->start_lsn > current_lsn)
+ {
+ /* We found a gap. */
+ break;
+ }
+ if (ws->end_lsn > current_lsn)
+ {
+ /*
+ * Next summary extends beyond end of previous summary, so extend
+ * the end of the range known to be summarized.
+ */
+ current_lsn = ws->end_lsn;
+
+ /*
+ * If the range we know to be summarized has reached the required
+ * end LSN, we have proved completeness.
+ */
+ if (current_lsn >= end_lsn)
+ return true;
+ }
+ }
+
+ /*
+ * We either ran out of summary files without reaching the end LSN, or we
+ * hit a gap in the sequence that resulted in us bailing out of the loop
+ * above.
+ */
+ *missing_lsn = current_lsn;
+ return false;
+}
+
+/*
+ * Open a WAL summary file.
+ *
+ * This will throw an error in case of trouble. As an exception, if
+ * missing_ok = true and the trouble is specifically that the file does
+ * not exist, it will not throw an error and will return a value less than 0.
+ */
+File
+OpenWalSummaryFile(WalSummaryFile *ws, bool missing_ok)
+{
+ char path[MAXPGPATH];
+ File file;
+
+ snprintf(path, MAXPGPATH,
+ XLOGDIR "/summaries/%08X%08X%08X%08X%08X.summary",
+ ws->tli,
+ LSN_FORMAT_ARGS(ws->start_lsn),
+ LSN_FORMAT_ARGS(ws->end_lsn));
+
+ file = PathNameOpenFile(path, O_RDONLY);
+ if (file < 0 && (errno != EEXIST || !missing_ok))
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not open file \"%s\": %m", path)));
+
+ return file;
+}
+
+/*
+ * Remove a WAL summary file if the last modification time precedes the
+ * cutoff time.
+ */
+void
+RemoveWalSummaryIfOlderThan(WalSummaryFile *ws, time_t cutoff_time)
+{
+ char path[MAXPGPATH];
+ struct stat statbuf;
+
+ snprintf(path, MAXPGPATH,
+ XLOGDIR "/summaries/%08X%08X%08X%08X%08X.summary",
+ ws->tli,
+ LSN_FORMAT_ARGS(ws->start_lsn),
+ LSN_FORMAT_ARGS(ws->end_lsn));
+
+ if (lstat(path, &statbuf) != 0)
+ {
+ if (errno == ENOENT)
+ return;
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not stat file \"%s\": %m", path)));
+ }
+ if (statbuf.st_mtime >= cutoff_time)
+ return;
+ if (unlink(path) != 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not stat file \"%s\": %m", path)));
+ ereport(DEBUG2,
+ (errmsg_internal("removing file \"%s\"", path)));
+}
+
+/*
+ * Test whether a filename looks like a WAL summary file.
+ */
+static bool
+IsWalSummaryFilename(char *filename)
+{
+ return strspn(filename, "0123456789ABCDEF") == 40 &&
+ strcmp(filename + 40, ".summary") == 0;
+}
+
+/*
+ * Data read callback for use with CreateBlockRefTableReader.
+ */
+int
+ReadWalSummary(void *wal_summary_io, void *data, int length)
+{
+ WalSummaryIO *io = wal_summary_io;
+ int nbytes;
+
+ nbytes = FileRead(io->file, data, length, io->filepos,
+ WAIT_EVENT_WAL_SUMMARY_READ);
+ if (nbytes < 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not read file \"%s\": %m",
+ FilePathName(io->file))));
+
+ io->filepos += nbytes;
+ return nbytes;
+}
+
+/*
+ * Data write callback for use with WriteBlockRefTable.
+ */
+int
+WriteWalSummary(void *wal_summary_io, void *data, int length)
+{
+ WalSummaryIO *io = wal_summary_io;
+ int nbytes;
+
+ nbytes = FileWrite(io->file, data, length, io->filepos,
+ WAIT_EVENT_WAL_SUMMARY_WRITE);
+ if (nbytes < 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not write file \"%s\": %m",
+ FilePathName(io->file))));
+ if (nbytes != length)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not write file \"%s\": wrote only %d of %d bytes at offset %u",
+ FilePathName(io->file), nbytes,
+ length, (unsigned) io->filepos),
+ errhint("Check free disk space.")));
+
+ io->filepos += nbytes;
+ return nbytes;
+}
+
+/*
+ * Error-reporting callback for use with CreateBlockRefTableReader.
+ */
+void
+ReportWalSummaryError(void *callback_arg, char *fmt,...)
+{
+ StringInfoData buf;
+ va_list ap;
+ int needed;
+
+ initStringInfo(&buf);
+ for (;;)
+ {
+ va_start(ap, fmt);
+ needed = appendStringInfoVA(&buf, fmt, ap);
+ va_end(ap);
+ if (needed == 0)
+ break;
+ enlargeStringInfo(&buf, needed);
+ }
+ ereport(ERROR,
+ errcode(ERRCODE_DATA_CORRUPTED),
+ errmsg_internal("%s", buf.data));
+}
+
+/*
+ * Comparator to sort a List of WalSummaryFile objects by start_lsn.
+ */
+static int
+ListComparatorForWalSummaryFiles(const ListCell *a, const ListCell *b)
+{
+ WalSummaryFile *ws1 = lfirst(a);
+ WalSummaryFile *ws2 = lfirst(b);
+
+ if (ws1->start_lsn < ws2->start_lsn)
+ return -1;
+ if (ws1->start_lsn > ws2->start_lsn)
+ return 1;
+ return 0;
+}
diff --git a/src/backend/backup/walsummaryfuncs.c b/src/backend/backup/walsummaryfuncs.c
new file mode 100644
index 0000000000..a1f69ad4ba
--- /dev/null
+++ b/src/backend/backup/walsummaryfuncs.c
@@ -0,0 +1,169 @@
+/*-------------------------------------------------------------------------
+ *
+ * walsummaryfuncs.c
+ * SQL-callable functions for accessing WAL summary data.
+ *
+ * Portions Copyright (c) 2010-2022, PostgreSQL Global Development Group
+ *
+ * src/backend/backup/walsummaryfuncs.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "backup/walsummary.h"
+#include "common/blkreftable.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "utils/fmgrprotos.h"
+#include "utils/pg_lsn.h"
+
+#define NUM_WS_ATTS 3
+#define NUM_SUMMARY_ATTS 6
+#define MAX_BLOCKS_PER_CALL 256
+
+/*
+ * List the WAL summary files available in pg_wal/summaries.
+ */
+Datum
+pg_available_wal_summaries(PG_FUNCTION_ARGS)
+{
+ ReturnSetInfo *rsi;
+ List *wslist;
+ ListCell *lc;
+ Datum values[NUM_WS_ATTS];
+ bool nulls[NUM_WS_ATTS];
+
+ InitMaterializedSRF(fcinfo, 0);
+ rsi = (ReturnSetInfo *) fcinfo->resultinfo;
+
+ memset(nulls, 0, sizeof(nulls));
+
+ wslist = GetWalSummaries(0, InvalidXLogRecPtr, InvalidXLogRecPtr);
+ foreach(lc, wslist)
+ {
+ WalSummaryFile *ws = (WalSummaryFile *) lfirst(lc);
+ HeapTuple tuple;
+
+ CHECK_FOR_INTERRUPTS();
+
+ values[0] = Int64GetDatum((int64) ws->tli);
+ values[1] = LSNGetDatum(ws->start_lsn);
+ values[2] = LSNGetDatum(ws->end_lsn);
+
+ tuple = heap_form_tuple(rsi->setDesc, values, nulls);
+ tuplestore_puttuple(rsi->setResult, tuple);
+ }
+
+ return (Datum) 0;
+}
+
+/*
+ * List the contents of a WAL summary file identified by TLI, start LSN,
+ * and end LSN.
+ */
+Datum
+pg_wal_summary_contents(PG_FUNCTION_ARGS)
+{
+ ReturnSetInfo *rsi;
+ Datum values[NUM_SUMMARY_ATTS];
+ bool nulls[NUM_SUMMARY_ATTS];
+ WalSummaryFile ws;
+ WalSummaryIO io;
+ BlockRefTableReader *reader;
+ int64 raw_tli;
+ RelFileLocator rlocator;
+ ForkNumber forknum;
+ BlockNumber limit_block;
+
+ InitMaterializedSRF(fcinfo, 0);
+ rsi = (ReturnSetInfo *) fcinfo->resultinfo;
+ memset(nulls, 0, sizeof(nulls));
+
+ /*
+ * Since the timeline could at least in theory be more than 2^31, and
+ * since we don't have unsigned types at the SQL level, it is passed as a
+ * 64-bit integer. Test whether it's out of range.
+ */
+ raw_tli = PG_GETARG_INT64(0);
+ if (raw_tli < 1 || raw_tli > PG_INT32_MAX)
+ ereport(ERROR,
+ errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("invalid timeline %lld", (long long) raw_tli));
+
+ /* Prepare to read the specified WAL summry file. */
+ ws.tli = (TimeLineID) raw_tli;
+ ws.start_lsn = PG_GETARG_LSN(1);
+ ws.end_lsn = PG_GETARG_LSN(2);
+ io.filepos = 0;
+ io.file = OpenWalSummaryFile(&ws, false);
+ reader = CreateBlockRefTableReader(ReadWalSummary, &io,
+ FilePathName(io.file),
+ ReportWalSummaryError, NULL);
+
+ /* Loop over relation forks. */
+ while (BlockRefTableReaderNextRelation(reader, &rlocator, &forknum,
+ &limit_block))
+ {
+ BlockNumber blocks[MAX_BLOCKS_PER_CALL];
+ HeapTuple tuple;
+
+ CHECK_FOR_INTERRUPTS();
+
+ values[0] = ObjectIdGetDatum(rlocator.relNumber);
+ values[1] = ObjectIdGetDatum(rlocator.spcOid);
+ values[2] = ObjectIdGetDatum(rlocator.dbOid);
+ values[3] = Int16GetDatum((int16) forknum);
+
+ /* Loop over blocks within the current relation fork. */
+ while (1)
+ {
+ unsigned nblocks;
+ unsigned i;
+
+ CHECK_FOR_INTERRUPTS();
+
+ nblocks = BlockRefTableReaderGetBlocks(reader, blocks,
+ MAX_BLOCKS_PER_CALL);
+ if (nblocks == 0)
+ break;
+
+ /*
+ * For each block that we specifically know to have been modified,
+ * emit a row with that block number and limit_block = false.
+ */
+ values[5] = BoolGetDatum(false);
+ for (i = 0; i < nblocks; ++i)
+ {
+ values[4] = Int64GetDatum((int64) blocks[i]);
+
+ tuple = heap_form_tuple(rsi->setDesc, values, nulls);
+ tuplestore_puttuple(rsi->setResult, tuple);
+ }
+
+ /*
+ * If the limit block is not InvalidBlockNumber, emit an exta row
+ * with that block number and limit_block = true.
+ *
+ * There is no point in doing this when the limit_block is
+ * InvalidBlockNumber, because no block with that number or any
+ * higher number can ever exist.
+ */
+ if (BlockNumberIsValid(limit_block))
+ {
+ values[4] = Int64GetDatum((int64) limit_block);
+ values[5] = BoolGetDatum(true);
+
+ tuple = heap_form_tuple(rsi->setDesc, values, nulls);
+ tuplestore_puttuple(rsi->setResult, tuple);
+ }
+ }
+ }
+
+ /* Cleanup */
+ DestroyBlockRefTableReader(reader);
+ FileClose(io.file);
+
+ return (Datum) 0;
+}
diff --git a/src/backend/postmaster/Makefile b/src/backend/postmaster/Makefile
index 047448b34e..367a46c617 100644
--- a/src/backend/postmaster/Makefile
+++ b/src/backend/postmaster/Makefile
@@ -24,6 +24,7 @@ OBJS = \
postmaster.o \
startup.o \
syslogger.o \
+ walsummarizer.o \
walwriter.o
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/postmaster/auxprocess.c b/src/backend/postmaster/auxprocess.c
index cae6feb356..0c15c1777d 100644
--- a/src/backend/postmaster/auxprocess.c
+++ b/src/backend/postmaster/auxprocess.c
@@ -21,6 +21,7 @@
#include "postmaster/auxprocess.h"
#include "postmaster/bgwriter.h"
#include "postmaster/startup.h"
+#include "postmaster/walsummarizer.h"
#include "postmaster/walwriter.h"
#include "replication/walreceiver.h"
#include "storage/bufmgr.h"
@@ -80,6 +81,9 @@ AuxiliaryProcessMain(AuxProcType auxtype)
case WalReceiverProcess:
MyBackendType = B_WAL_RECEIVER;
break;
+ case WalSummarizerProcess:
+ MyBackendType = B_WAL_SUMMARIZER;
+ break;
default:
elog(PANIC, "unrecognized process type: %d", (int) MyAuxProcType);
MyBackendType = B_INVALID;
@@ -161,6 +165,10 @@ AuxiliaryProcessMain(AuxProcType auxtype)
WalReceiverMain();
proc_exit(1);
+ case WalSummarizerProcess:
+ WalSummarizerMain();
+ proc_exit(1);
+
default:
elog(PANIC, "unrecognized process type: %d", (int) MyAuxProcType);
proc_exit(1);
diff --git a/src/backend/postmaster/meson.build b/src/backend/postmaster/meson.build
index cda921fd10..a30eb6692f 100644
--- a/src/backend/postmaster/meson.build
+++ b/src/backend/postmaster/meson.build
@@ -12,5 +12,6 @@ backend_sources += files(
'postmaster.c',
'startup.c',
'syslogger.c',
+ 'walsummarizer.c',
'walwriter.c',
)
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 7b6b613c4a..7952fd5c4b 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -115,6 +115,7 @@
#include "postmaster/pgarch.h"
#include "postmaster/postmaster.h"
#include "postmaster/syslogger.h"
+#include "postmaster/walsummarizer.h"
#include "replication/logicallauncher.h"
#include "replication/walsender.h"
#include "storage/fd.h"
@@ -252,6 +253,7 @@ static pid_t StartupPID = 0,
CheckpointerPID = 0,
WalWriterPID = 0,
WalReceiverPID = 0,
+ WalSummarizerPID = 0,
AutoVacPID = 0,
PgArchPID = 0,
SysLoggerPID = 0;
@@ -443,6 +445,7 @@ static bool CreateOptsFile(int argc, char *argv[], char *fullprogname);
static pid_t StartChildProcess(AuxProcType type);
static void StartAutovacuumWorker(void);
static void MaybeStartWalReceiver(void);
+static void MaybeStartWalSummarizer(void);
static void InitPostmasterDeathWatchHandle(void);
/*
@@ -562,6 +565,7 @@ static void ShmemBackendArrayRemove(Backend *bn);
#define StartCheckpointer() StartChildProcess(CheckpointerProcess)
#define StartWalWriter() StartChildProcess(WalWriterProcess)
#define StartWalReceiver() StartChildProcess(WalReceiverProcess)
+#define StartWalSummarizer() StartChildProcess(WalSummarizerProcess)
/* Macros to check exit status of a child process */
#define EXIT_STATUS_0(st) ((st) == 0)
@@ -931,6 +935,9 @@ PostmasterMain(int argc, char *argv[])
if (max_wal_senders > 0 && wal_level == WAL_LEVEL_MINIMAL)
ereport(ERROR,
(errmsg("WAL streaming (max_wal_senders > 0) requires wal_level \"replica\" or \"logical\"")));
+ if (summarize_wal && wal_level == WAL_LEVEL_MINIMAL)
+ ereport(ERROR,
+ (errmsg("WAL cannot be summarized when wal_level is \"minimal\"")));
/*
* Other one-time internal sanity checks can go here, if they are fast.
@@ -1833,6 +1840,9 @@ ServerLoop(void)
if (WalReceiverRequested)
MaybeStartWalReceiver();
+ /* If we need to start a WAL summarizer, try to do that now */
+ MaybeStartWalSummarizer();
+
/* Get other worker processes running, if needed */
if (StartWorkerNeeded || HaveCrashedWorker)
maybe_start_bgworkers();
@@ -2657,6 +2667,8 @@ process_pm_reload_request(void)
signal_child(WalWriterPID, SIGHUP);
if (WalReceiverPID != 0)
signal_child(WalReceiverPID, SIGHUP);
+ if (WalSummarizerPID != 0)
+ signal_child(WalSummarizerPID, SIGHUP);
if (AutoVacPID != 0)
signal_child(AutoVacPID, SIGHUP);
if (PgArchPID != 0)
@@ -3010,6 +3022,7 @@ process_pm_child_exit(void)
BgWriterPID = StartBackgroundWriter();
if (WalWriterPID == 0)
WalWriterPID = StartWalWriter();
+ MaybeStartWalSummarizer();
/*
* Likewise, start other special children as needed. In a restart
@@ -3128,6 +3141,20 @@ process_pm_child_exit(void)
continue;
}
+ /*
+ * Was it the wal summarizer? Normal exit can be ignored; we'll start
+ * a new one at the next iteration of the postmaster's main loop, if
+ * necessary. Any other exit condition is treated as a crash.
+ */
+ if (pid == WalSummarizerPID)
+ {
+ WalSummarizerPID = 0;
+ if (!EXIT_STATUS_0(exitstatus))
+ HandleChildCrash(pid, exitstatus,
+ _("WAL summarizer process"));
+ continue;
+ }
+
/*
* Was it the autovacuum launcher? Normal exit can be ignored; we'll
* start a new one at the next iteration of the postmaster's main
@@ -3523,6 +3550,12 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
else if (WalReceiverPID != 0 && take_action)
sigquit_child(WalReceiverPID);
+ /* Take care of the walsummarizer too */
+ if (pid == WalSummarizerPID)
+ WalSummarizerPID = 0;
+ else if (WalSummarizerPID != 0 && take_action)
+ sigquit_child(WalSummarizerPID);
+
/* Take care of the autovacuum launcher too */
if (pid == AutoVacPID)
AutoVacPID = 0;
@@ -3673,6 +3706,8 @@ PostmasterStateMachine(void)
signal_child(StartupPID, SIGTERM);
if (WalReceiverPID != 0)
signal_child(WalReceiverPID, SIGTERM);
+ if (WalSummarizerPID != 0)
+ signal_child(WalSummarizerPID, SIGTERM);
/* checkpointer, archiver, stats, and syslogger may continue for now */
/* Now transition to PM_WAIT_BACKENDS state to wait for them to die */
@@ -3699,6 +3734,7 @@ PostmasterStateMachine(void)
if (CountChildren(BACKEND_TYPE_ALL - BACKEND_TYPE_WALSND) == 0 &&
StartupPID == 0 &&
WalReceiverPID == 0 &&
+ WalSummarizerPID == 0 &&
BgWriterPID == 0 &&
(CheckpointerPID == 0 ||
(!FatalError && Shutdown < ImmediateShutdown)) &&
@@ -3796,6 +3832,7 @@ PostmasterStateMachine(void)
/* These other guys should be dead already */
Assert(StartupPID == 0);
Assert(WalReceiverPID == 0);
+ Assert(WalSummarizerPID == 0);
Assert(BgWriterPID == 0);
Assert(CheckpointerPID == 0);
Assert(WalWriterPID == 0);
@@ -4017,6 +4054,8 @@ TerminateChildren(int signal)
signal_child(WalWriterPID, signal);
if (WalReceiverPID != 0)
signal_child(WalReceiverPID, signal);
+ if (WalSummarizerPID != 0)
+ signal_child(WalSummarizerPID, signal);
if (AutoVacPID != 0)
signal_child(AutoVacPID, signal);
if (PgArchPID != 0)
@@ -5364,6 +5403,10 @@ StartChildProcess(AuxProcType type)
ereport(LOG,
(errmsg("could not fork WAL receiver process: %m")));
break;
+ case WalSummarizerProcess:
+ ereport(LOG,
+ (errmsg("could not fork WAL summarizer process: %m")));
+ break;
default:
ereport(LOG,
(errmsg("could not fork process: %m")));
@@ -5500,6 +5543,19 @@ MaybeStartWalReceiver(void)
}
}
+/*
+ * MaybeStartWalSummarizer
+ * Start the WAL summarizer process, if not running and our state allows.
+ */
+static void
+MaybeStartWalSummarizer(void)
+{
+ if (summarize_wal && WalSummarizerPID == 0 &&
+ (pmState == PM_RUN || pmState == PM_HOT_STANDBY) &&
+ Shutdown <= SmartShutdown)
+ WalSummarizerPID = StartWalSummarizer();
+}
+
/*
* Create the opts file
diff --git a/src/backend/postmaster/walsummarizer.c b/src/backend/postmaster/walsummarizer.c
new file mode 100644
index 0000000000..a083647c42
--- /dev/null
+++ b/src/backend/postmaster/walsummarizer.c
@@ -0,0 +1,1383 @@
+/*-------------------------------------------------------------------------
+ *
+ * walsummarizer.c
+ *
+ * Background process to perform WAL summarization, if it is enabled.
+ * It continuously scans the write-ahead log and periodically emits a
+ * summary file which indicates which blocks in which relation forks
+ * were modified by WAL records in the LSN range covered by the summary
+ * file. See walsummary.c and blkreftable.c for more details on the
+ * naming and contents of WAL summary files.
+ *
+ * If configured to do, this background process will also remove WAL
+ * summary files when the file timestamp is older than a configurable
+ * threshold (but only if the WAL has been removed first).
+ *
+ * Portions Copyright (c) 1996-2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/backend/postmaster/walsummarizer.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/timeline.h"
+#include "access/xlog.h"
+#include "access/xlog_internal.h"
+#include "access/xlogrecovery.h"
+#include "access/xlogutils.h"
+#include "backup/walsummary.h"
+#include "catalog/storage_xlog.h"
+#include "common/blkreftable.h"
+#include "libpq/pqsignal.h"
+#include "miscadmin.h"
+#include "postmaster/bgwriter.h"
+#include "postmaster/interrupt.h"
+#include "postmaster/walsummarizer.h"
+#include "replication/walreceiver.h"
+#include "storage/fd.h"
+#include "storage/ipc.h"
+#include "storage/lwlock.h"
+#include "storage/latch.h"
+#include "storage/proc.h"
+#include "storage/procsignal.h"
+#include "storage/shmem.h"
+#include "storage/spin.h"
+#include "utils/guc.h"
+#include "utils/memutils.h"
+#include "utils/wait_event.h"
+
+/*
+ * Data in shared memory related to WAL summarization.
+ */
+typedef struct
+{
+ /*
+ * These fields are protected by WALSummarizerLock.
+ *
+ * Until we've discovered what summary files already exist on disk and
+ * stored that information in shared memory, initialized is false and the
+ * other fields here contain no meaningful information. After that has
+ * been done, initialized is true.
+ *
+ * summarized_tli and summarized_lsn indicate the last LSN and TLI at
+ * which the next summary file will start. Normally, these are the LSN and
+ * TLI at which the last file ended; in such case, lsn_is_exact is true.
+ * If, however, the LSN is just an approximation, then lsn_is_exact is
+ * false. This can happen if, for example, there are no existing WAL
+ * summary files at startup. In that case, we have to derive the position
+ * at which to start summarizing from the WAL files that exist on disk,
+ * and so the LSN might point to the start of the next file even though
+ * that might happen to be in the middle of a WAL record.
+ *
+ * summarizer_pgprocno is the pgprocno value for the summarizer process,
+ * if one is running, or else INVALID_PGPROCNO.
+ *
+ * pending_lsn is used by the summarizer to advertise the ending LSN of a
+ * record it has recently read. It shouldn't ever be less than
+ * summarized_lsn, but might be greater, because the summarizer buffers
+ * data for a range of LSNs in memory before writing out a new file.
+ */
+ bool initialized;
+ TimeLineID summarized_tli;
+ XLogRecPtr summarized_lsn;
+ bool lsn_is_exact;
+ int summarizer_pgprocno;
+ XLogRecPtr pending_lsn;
+
+ /*
+ * This field handles its own synchronizaton.
+ */
+ ConditionVariable summary_file_cv;
+} WalSummarizerData;
+
+/*
+ * Private data for our xlogreader's page read callback.
+ */
+typedef struct
+{
+ TimeLineID tli;
+ bool historic;
+ XLogRecPtr read_upto;
+ bool end_of_wal;
+} SummarizerReadLocalXLogPrivate;
+
+/* Pointer to shared memory state. */
+static WalSummarizerData *WalSummarizerCtl;
+
+/*
+ * When we reach end of WAL and need to read more, we sleep for a number of
+ * milliseconds that is a integer multiple of MS_PER_SLEEP_QUANTUM. This is
+ * the multiplier. It should vary between 1 and MAX_SLEEP_QUANTA, depending
+ * on system activity. See summarizer_wait_for_wal() for how we adjust this.
+ */
+static long sleep_quanta = 1;
+
+/*
+ * The sleep time will always be a multiple of 200ms and will not exceed
+ * thirty seconds (150 * 200 = 30 * 1000). Note that the timeout here needs
+ * to be substntially less than the maximum amount of time for which an
+ * incremental backup will wait for this process to catch up. Otherwise, an
+ * incremental backup might time out on an idle system just because we sleep
+ * for too long.
+ */
+#define MAX_SLEEP_QUANTA 150
+#define MS_PER_SLEEP_QUANTUM 200
+
+/*
+ * This is a count of the number of pages of WAL that we've read since the
+ * last time we waited for more WAL to appear.
+ */
+static long pages_read_since_last_sleep = 0;
+
+/*
+ * Most recent RedoRecPtr value observed by MaybeRemoveOldWalSummaries.
+ */
+static XLogRecPtr redo_pointer_at_last_summary_removal = InvalidXLogRecPtr;
+
+/*
+ * GUC parameters
+ */
+bool summarize_wal = false;
+int wal_summary_keep_time = 10 * 24 * 60;
+
+static XLogRecPtr GetLatestLSN(TimeLineID *tli);
+static void HandleWalSummarizerInterrupts(void);
+static XLogRecPtr SummarizeWAL(TimeLineID tli, XLogRecPtr start_lsn,
+ bool exact, XLogRecPtr switch_lsn,
+ XLogRecPtr maximum_lsn);
+static void SummarizeSmgrRecord(XLogReaderState *xlogreader,
+ BlockRefTable *brtab);
+static void SummarizeXactRecord(XLogReaderState *xlogreader,
+ BlockRefTable *brtab);
+static bool SummarizeXlogRecord(XLogReaderState *xlogreader);
+static int summarizer_read_local_xlog_page(XLogReaderState *state,
+ XLogRecPtr targetPagePtr,
+ int reqLen,
+ XLogRecPtr targetRecPtr,
+ char *cur_page);
+static void summarizer_wait_for_wal(void);
+static void MaybeRemoveOldWalSummaries(void);
+
+/*
+ * Amount of shared memory required for this module.
+ */
+Size
+WalSummarizerShmemSize(void)
+{
+ return sizeof(WalSummarizerData);
+}
+
+/*
+ * Create or attach to shared memory segment for this module.
+ */
+void
+WalSummarizerShmemInit(void)
+{
+ bool found;
+
+ WalSummarizerCtl = (WalSummarizerData *)
+ ShmemInitStruct("Wal Summarizer Ctl", WalSummarizerShmemSize(),
+ &found);
+
+ if (!found)
+ {
+ /*
+ * First time through, so initialize.
+ *
+ * We're just filling in dummy values here -- the real initialization
+ * will happen when GetOldestUnsummarizedLSN() is called for the first
+ * time.
+ */
+ WalSummarizerCtl->initialized = false;
+ WalSummarizerCtl->summarized_tli = 0;
+ WalSummarizerCtl->summarized_lsn = InvalidXLogRecPtr;
+ WalSummarizerCtl->lsn_is_exact = false;
+ WalSummarizerCtl->summarizer_pgprocno = INVALID_PGPROCNO;
+ WalSummarizerCtl->pending_lsn = InvalidXLogRecPtr;
+ ConditionVariableInit(&WalSummarizerCtl->summary_file_cv);
+ }
+}
+
+/*
+ * Entry point for walsummarizer process.
+ */
+void
+WalSummarizerMain(void)
+{
+ sigjmp_buf local_sigjmp_buf;
+ MemoryContext context;
+
+ /*
+ * Within this function, 'current_lsn' and 'current_tli' refer to the
+ * point from which the next WAL summary file should start. 'exact' is
+ * true if 'current_lsn' is known to be the start of a WAL recod or WAL
+ * segment, and false if it might be in the middle of a record someplace.
+ *
+ * 'switch_lsn' and 'switch_tli', if set, are the LSN at which we need to
+ * switch to a new timeline and the timeline to which we need to switch.
+ * If not set, we either haven't figured out the answers yet or we're
+ * already on the latest timeline.
+ */
+ XLogRecPtr current_lsn;
+ TimeLineID current_tli;
+ bool exact;
+ XLogRecPtr switch_lsn = InvalidXLogRecPtr;
+ TimeLineID switch_tli = 0;
+
+ ereport(DEBUG1,
+ (errmsg_internal("WAL summarizer started")));
+
+ /*
+ * Properly accept or ignore signals the postmaster might send us
+ *
+ * We have no particular use for SIGINT at the moment, but seems
+ * reasonable to treat like SIGTERM.
+ */
+ pqsignal(SIGHUP, SignalHandlerForConfigReload);
+ pqsignal(SIGINT, SignalHandlerForShutdownRequest);
+ pqsignal(SIGTERM, SignalHandlerForShutdownRequest);
+ /* SIGQUIT handler was already set up by InitPostmasterChild */
+ pqsignal(SIGALRM, SIG_IGN);
+ pqsignal(SIGPIPE, SIG_IGN);
+ pqsignal(SIGUSR1, procsignal_sigusr1_handler);
+ pqsignal(SIGUSR2, SIG_IGN); /* not used */
+
+ /* Advertise ourselves. */
+ LWLockAcquire(WALSummarizerLock, LW_EXCLUSIVE);
+ WalSummarizerCtl->summarizer_pgprocno = MyProc->pgprocno;
+ LWLockRelease(WALSummarizerLock);
+
+ /* Create and switch to a memory context that we can reset on error. */
+ context = AllocSetContextCreate(TopMemoryContext,
+ "Wal Summarizer",
+ ALLOCSET_DEFAULT_SIZES);
+ MemoryContextSwitchTo(context);
+
+ /*
+ * Reset some signals that are accepted by postmaster but not here
+ */
+ pqsignal(SIGCHLD, SIG_DFL);
+
+ /*
+ * If an exception is encountered, processing resumes here.
+ */
+ if (sigsetjmp(local_sigjmp_buf, 1) != 0)
+ {
+ /* Since not using PG_TRY, must reset error stack by hand */
+ error_context_stack = NULL;
+
+ /* Prevent interrupts while cleaning up */
+ HOLD_INTERRUPTS();
+
+ /* Report the error to the server log */
+ EmitErrorReport();
+
+ /* Release resources we might have acquired. */
+ LWLockReleaseAll();
+ ConditionVariableCancelSleep();
+ pgstat_report_wait_end();
+ ReleaseAuxProcessResources(false);
+ AtEOXact_Files(false);
+ AtEOXact_HashTables(false);
+
+ /*
+ * Now return to normal top-level context and clear ErrorContext for
+ * next time.
+ */
+ MemoryContextSwitchTo(context);
+ FlushErrorState();
+
+ /* Flush any leaked data in the top-level context */
+ MemoryContextResetAndDeleteChildren(context);
+
+ /* Now we can allow interrupts again */
+ RESUME_INTERRUPTS();
+
+ /*
+ * Sleep for 10 seconds before attempting to resume operations in
+ * order to avoid excessing logging.
+ *
+ * Many of the likely error conditions are things that will repeat
+ * every time. For example, if the WAL can't be read or the summary
+ * can't be written, only administrator action will cure the problem.
+ * So a really fast retry time doesn't seem to be especially
+ * beneficial, and it will clutter the logs.
+ */
+ (void) WaitLatch(MyLatch,
+ WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
+ 10000,
+ WAIT_EVENT_WAL_SUMMARIZER_ERROR);
+ }
+
+ /* We can now handle ereport(ERROR) */
+ PG_exception_stack = &local_sigjmp_buf;
+
+ /*
+ * Unblock signals (they were blocked when the postmaster forked us)
+ */
+ sigprocmask(SIG_SETMASK, &UnBlockSig, NULL);
+
+ /*
+ * Fetch information about previous progress from shared memory.
+ *
+ * If we discover that WAL summarization is not enabled, just exit.
+ */
+ current_lsn = GetOldestUnsummarizedLSN(¤t_tli, &exact);
+ if (XLogRecPtrIsInvalid(current_lsn))
+ proc_exit(0);
+
+ /*
+ * Loop forever
+ */
+ for (;;)
+ {
+ XLogRecPtr latest_lsn;
+ TimeLineID latest_tli;
+ XLogRecPtr end_of_summary_lsn;
+
+ /* Flush any leaked data in the top-level context */
+ MemoryContextResetAndDeleteChildren(context);
+
+ /* Process any signals received recently. */
+ HandleWalSummarizerInterrupts();
+
+ /* If it's time to remove any old WAL summaries, do that now. */
+ MaybeRemoveOldWalSummaries();
+
+ /* Find the LSN and TLI up to which we can safely summarize. */
+ latest_lsn = GetLatestLSN(&latest_tli);
+
+ /*
+ * If we're summarizing a historic timeline and we haven't yet
+ * computed the point at which to switch to the next timeline, do that
+ * now.
+ *
+ * Note that if this is a standby, what was previously the current
+ * timeline could become historic at any time.
+ *
+ * We could try to make this more efficient by caching the results of
+ * readTimeLineHistory when latest_tli has not changed, but since we
+ * only have to do this once per timeline switch, we probably wouldn't
+ * save any significant amount of work in practice.
+ */
+ if (current_tli != latest_tli && XLogRecPtrIsInvalid(switch_lsn))
+ {
+ List *tles = readTimeLineHistory(latest_tli);
+
+ switch_lsn = tliSwitchPoint(current_tli, tles, &switch_tli);
+ ereport(DEBUG1,
+ errmsg("switch point from TLI %u to TLI %u is at %X/%X",
+ current_tli, switch_tli, LSN_FORMAT_ARGS(switch_lsn)));
+ }
+
+ /*
+ * If we've reached the switch LSN, we can't summarize anything else
+ * on this timeline. Switch to the next timeline and go around again.
+ */
+ if (!XLogRecPtrIsInvalid(switch_lsn) && current_lsn >= switch_lsn)
+ {
+ current_tli = switch_tli;
+ switch_lsn = InvalidXLogRecPtr;
+ switch_tli = 0;
+ continue;
+ }
+
+ /* Summarize WAL. */
+ end_of_summary_lsn = SummarizeWAL(current_tli,
+ current_lsn, exact,
+ switch_lsn, latest_lsn);
+ Assert(!XLogRecPtrIsInvalid(end_of_summary_lsn));
+ Assert(end_of_summary_lsn >= current_lsn);
+
+ /*
+ * Update state for next loop iteration.
+ *
+ * Next summary file should start from exactly where this one ended.
+ */
+ current_lsn = end_of_summary_lsn;
+ exact = true;
+
+ /* Update state in shared memory. */
+ LWLockAcquire(WALSummarizerLock, LW_EXCLUSIVE);
+ Assert(WalSummarizerCtl->pending_lsn <= end_of_summary_lsn);
+ WalSummarizerCtl->summarized_lsn = end_of_summary_lsn;
+ WalSummarizerCtl->summarized_tli = current_tli;
+ WalSummarizerCtl->lsn_is_exact = true;
+ WalSummarizerCtl->pending_lsn = end_of_summary_lsn;
+ LWLockRelease(WALSummarizerLock);
+
+ /* Wake up anyone waiting for more summary files to be written. */
+ ConditionVariableBroadcast(&WalSummarizerCtl->summary_file_cv);
+ }
+}
+
+/*
+ * Get the oldest LSN in this server's timeline history that has not yet been
+ * summarized.
+ *
+ * If *tli != NULL, it will be set to the TLI for the LSN that is returned.
+ *
+ * If *lsn_is_exact != NULL, it will be set to true if the returned LSN is
+ * necessarily the start of a WAL record and false if it's just the beginning
+ * of a WAL segment.
+ */
+XLogRecPtr
+GetOldestUnsummarizedLSN(TimeLineID *tli, bool *lsn_is_exact)
+{
+ TimeLineID latest_tli;
+ LWLockMode mode = LW_SHARED;
+ int n;
+ List *tles;
+ XLogRecPtr unsummarized_lsn;
+ TimeLineID unsummarized_tli = 0;
+ bool should_make_exact = false;
+ List *existing_summaries;
+ ListCell *lc;
+
+ /* If not summarizing WAL, do nothing. */
+ if (!summarize_wal)
+ return InvalidXLogRecPtr;
+
+ /*
+ * Initially, we acquire the lock in shared mode and try to fetch the
+ * required information. If the data structure hasn't been initialized, we
+ * reacquire the lock in shared mode so that we can initialize it.
+ * However, if someone else does that first before we get the lock, then
+ * we can just return the requested information after all.
+ */
+ while (1)
+ {
+ LWLockAcquire(WALSummarizerLock, mode);
+
+ if (WalSummarizerCtl->initialized)
+ {
+ unsummarized_lsn = WalSummarizerCtl->summarized_lsn;
+ if (tli != NULL)
+ *tli = WalSummarizerCtl->summarized_tli;
+ if (lsn_is_exact != NULL)
+ *lsn_is_exact = WalSummarizerCtl->lsn_is_exact;
+ LWLockRelease(WALSummarizerLock);
+ return unsummarized_lsn;
+ }
+
+ if (mode == LW_EXCLUSIVE)
+ break;
+
+ LWLockRelease(WALSummarizerLock);
+ mode = LW_EXCLUSIVE;
+ }
+
+ /*
+ * The data structure needs to be initialized, and we are the first to
+ * obtain the lock in exclusive mode, so it's our job to do that
+ * initialization.
+ *
+ * So, find the oldest timeline on which WAL still exists, and the
+ * earliest segment for which it exists.
+ */
+ (void) GetLatestLSN(&latest_tli);
+ tles = readTimeLineHistory(latest_tli);
+ for (n = list_length(tles) - 1; n >= 0; --n)
+ {
+ TimeLineHistoryEntry *tle = list_nth(tles, n);
+ XLogSegNo oldest_segno;
+
+ oldest_segno = XLogGetOldestSegno(tle->tli);
+ if (oldest_segno != 0)
+ {
+ /* Compute oldest LSN that still exists on disk. */
+ XLogSegNoOffsetToRecPtr(oldest_segno, 0, wal_segment_size,
+ unsummarized_lsn);
+
+ unsummarized_tli = tle->tli;
+ break;
+ }
+ }
+
+ /* It really should not be possible for us to find no WAL. */
+ if (unsummarized_tli == 0)
+ ereport(ERROR,
+ errcode(ERRCODE_INTERNAL_ERROR),
+ errmsg_internal("no WAL found on timeline %d", latest_tli));
+
+ /*
+ * Don't try to summarize anything older than the end LSN of the newest
+ * summary file that exists for this timeline.
+ */
+ existing_summaries =
+ GetWalSummaries(unsummarized_tli,
+ InvalidXLogRecPtr, InvalidXLogRecPtr);
+ foreach(lc, existing_summaries)
+ {
+ WalSummaryFile *ws = lfirst(lc);
+
+ if (ws->end_lsn > unsummarized_lsn)
+ {
+ unsummarized_lsn = ws->end_lsn;
+ should_make_exact = true;
+ }
+ }
+
+ /* Update shared memory with the discovered values. */
+ WalSummarizerCtl->initialized = true;
+ WalSummarizerCtl->summarized_lsn = unsummarized_lsn;
+ WalSummarizerCtl->summarized_tli = unsummarized_tli;
+ WalSummarizerCtl->lsn_is_exact = should_make_exact;
+ WalSummarizerCtl->pending_lsn = unsummarized_lsn;
+
+ /* Also return the to the caller as required. */
+ if (tli != NULL)
+ *tli = WalSummarizerCtl->summarized_tli;
+ if (lsn_is_exact != NULL)
+ *lsn_is_exact = WalSummarizerCtl->lsn_is_exact;
+ LWLockRelease(WALSummarizerLock);
+
+ return unsummarized_lsn;
+}
+
+/*
+ * Attempt to set the WAL summarizer's latch.
+ *
+ * This might not work, because there's no guarantee that the WAL summarizer
+ * process was successfully started, and it also might have started but
+ * subsequently terminated. So, under normal circumstances, this will get the
+ * latch set, but there's no guarantee.
+ */
+void
+SetWalSummarizerLatch(void)
+{
+ int pgprocno;
+
+ if (WalSummarizerCtl == NULL)
+ return;
+
+ LWLockAcquire(WALSummarizerLock, LW_EXCLUSIVE);
+ pgprocno = WalSummarizerCtl->summarizer_pgprocno;
+ LWLockRelease(WALSummarizerLock);
+
+ if (pgprocno != INVALID_PGPROCNO)
+ SetLatch(&ProcGlobal->allProcs[pgprocno].procLatch);
+}
+
+/*
+ * Wait until WAL summarization reaches the given LSN, but not longer than
+ * the given timeout.
+ *
+ * The return value is the first still-unsummarized LSN. If it's greater than
+ * or equal to the passed LSN, then that LSN was reached. If not, we timed out.
+ */
+XLogRecPtr
+WaitForWalSummarization(XLogRecPtr lsn, long timeout)
+{
+ TimestampTz start_time = GetCurrentTimestamp();
+ TimestampTz deadline = TimestampTzPlusMilliseconds(start_time, timeout);
+ XLogRecPtr summarized_lsn;
+
+ Assert(!XLogRecPtrIsInvalid(lsn));
+ Assert(timeout > 0);
+
+ while (1)
+ {
+ TimestampTz now;
+ long remaining_timeout;
+
+ /*
+ * If the LSN summarized on disk has reached the target value, stop.
+ */
+ LWLockAcquire(WALSummarizerLock, LW_EXCLUSIVE);
+ summarized_lsn = WalSummarizerCtl->summarized_lsn;
+ LWLockRelease(WALSummarizerLock);
+ if (summarized_lsn >= lsn)
+ break;
+
+ /* Timeout reached? If yes, stop. */
+ now = GetCurrentTimestamp();
+ remaining_timeout = TimestampDifferenceMilliseconds(now, deadline);
+ if (remaining_timeout <= 0)
+ break;
+
+ /* Wait and see. */
+ ConditionVariableTimedSleep(&WalSummarizerCtl->summary_file_cv,
+ remaining_timeout,
+ WAIT_EVENT_WAL_SUMMARY_READY);
+ }
+
+ return summarized_lsn;
+}
+
+/*
+ * Get the latest LSN that is eligible to be summarized, and set *tli to the
+ * corresponding timeline.
+ */
+static XLogRecPtr
+GetLatestLSN(TimeLineID *tli)
+{
+ if (!RecoveryInProgress())
+ {
+ /* Don't summarize WAL before it's flushed. */
+ return GetFlushRecPtr(tli);
+ }
+ else
+ {
+ XLogRecPtr flush_lsn;
+ TimeLineID flush_tli;
+ XLogRecPtr replay_lsn;
+ TimeLineID replay_tli;
+
+ /*
+ * What we really want to know is how much WAL has been flushed to
+ * disk, but the only flush position available is the one provided by
+ * the walreceiver, which may not be running, because this could be
+ * crash recovery or recovery via restore_command. So use either the
+ * WAL receiver's flush position or the replay position, whichever is
+ * further ahead, on the theory that if the WAL has been replayed then
+ * it must also have been flushed to disk.
+ */
+ flush_lsn = GetWalRcvFlushRecPtr(NULL, &flush_tli);
+ replay_lsn = GetXLogReplayRecPtr(&replay_tli);
+ if (flush_lsn > replay_lsn)
+ {
+ *tli = flush_tli;
+ return flush_lsn;
+ }
+ else
+ {
+ *tli = replay_tli;
+ return replay_lsn;
+ }
+ }
+}
+
+/*
+ * Interrupt handler for main loop of WAL summarizer process.
+ */
+static void
+HandleWalSummarizerInterrupts(void)
+{
+ if (ProcSignalBarrierPending)
+ ProcessProcSignalBarrier();
+
+ if (ConfigReloadPending)
+ {
+ ConfigReloadPending = false;
+ ProcessConfigFile(PGC_SIGHUP);
+ }
+
+ if (ShutdownRequestPending || !summarize_wal)
+ {
+ ereport(DEBUG1,
+ errmsg_internal("WAL summarizer shutting down"));
+ proc_exit(0);
+ }
+
+ /* Perform logging of memory contexts of this process */
+ if (LogMemoryContextPending)
+ ProcessLogMemoryContextInterrupt();
+}
+
+/*
+ * Summarize a range of WAL records on a single timeline.
+ *
+ * 'tli' is the timeline to be summarized.
+ *
+ * 'start_lsn' is the point at which we should start summarizing. If this
+ * value comes from the end LSN of the previous record as returned by the
+ * xlograder machinery, 'exact' should be true; otherwise, 'exact' should
+ * be false, and this function will search forward for the start of a valid
+ * WAL record.
+ *
+ * 'switch_lsn' is the point at which we should switch to a later timeline,
+ * if we're summarizing a historic timeline.
+ *
+ * 'maximum_lsn' identifies the point beyond which we can't count on being
+ * able to read any more WAL. It should be the switch point when reading a
+ * historic timeline, or the most-recently-measured end of WAL when reading
+ * the current timeline.
+ *
+ * The return value is the LSN at which the WAL summary actually ends. Most
+ * often, a summary file ends because we notice that a checkpoint has
+ * occurred and reach the redo pointer of that checkpoint, but sometimes
+ * we stop for other reasons, such as a timeline switch.
+ */
+static XLogRecPtr
+SummarizeWAL(TimeLineID tli, XLogRecPtr start_lsn, bool exact,
+ XLogRecPtr switch_lsn, XLogRecPtr maximum_lsn)
+{
+ SummarizerReadLocalXLogPrivate *private_data;
+ XLogReaderState *xlogreader;
+ XLogRecPtr summary_start_lsn;
+ XLogRecPtr summary_end_lsn = switch_lsn;
+ char temp_path[MAXPGPATH];
+ char final_path[MAXPGPATH];
+ WalSummaryIO io;
+ BlockRefTable *brtab = CreateEmptyBlockRefTable();
+
+ /* Initialize private data for xlogreader. */
+ private_data = (SummarizerReadLocalXLogPrivate *)
+ palloc0(sizeof(SummarizerReadLocalXLogPrivate));
+ private_data->tli = tli;
+ private_data->historic = !XLogRecPtrIsInvalid(switch_lsn);
+ private_data->read_upto = maximum_lsn;
+
+ /* Create xlogreader. */
+ xlogreader = XLogReaderAllocate(wal_segment_size, NULL,
+ XL_ROUTINE(.page_read = &summarizer_read_local_xlog_page,
+ .segment_open = &wal_segment_open,
+ .segment_close = &wal_segment_close),
+ private_data);
+ if (xlogreader == NULL)
+ ereport(ERROR,
+ (errcode(ERRCODE_OUT_OF_MEMORY),
+ errmsg("out of memory"),
+ errdetail("Failed while allocating a WAL reading processor.")));
+
+ /*
+ * When exact = false, we're starting from an arbitrary point in the WAL
+ * and must search forward for the start of the next record.
+ *
+ * When exact = true, start_lsn should be either the LSN where a record
+ * begins, or the LSN of a page where the page header is immediately
+ * followed by the start of a new record. XLogBeginRead should tolerate
+ * either case.
+ *
+ * We need to allow for both cases because the behavior of xlogreader
+ * varies. When a record spans two or more xlog pages, the ending LSN
+ * reported by xlogreader will be the starting LSN of the following
+ * record, but when an xlog page boundary falls between two records, the
+ * end LSN for the first will be reported as the first byte of the
+ * following page. We can't know until we read that page how large the
+ * header will be, but we'll have to skip over it to find the next record.
+ */
+ if (exact)
+ {
+ /*
+ * Even if start_lsn is the beginning of a page rather than the
+ * beginning of the first record on that page, we should still use it
+ * as the start LSN for the summary file. That's because we detect
+ * missing summary files by looking for cases where the end LSN of one
+ * file is less than the start LSN of the next file. When only a page
+ * header is skipped, nothing has been missed.
+ */
+ XLogBeginRead(xlogreader, start_lsn);
+ summary_start_lsn = start_lsn;
+ }
+ else
+ {
+ summary_start_lsn = XLogFindNextRecord(xlogreader, start_lsn);
+ if (XLogRecPtrIsInvalid(summary_start_lsn))
+ {
+ /*
+ * If we hit end-of-WAL while trying to find the next valid
+ * record, we must be on a historic timeline that has no valid
+ * records that begin after start_lsn and before end of WAL.
+ */
+ if (private_data->end_of_wal)
+ {
+ ereport(DEBUG1,
+ errmsg_internal("could not read WAL from timeline %u at %X/%X: end of WAL at %X/%X",
+ tli,
+ LSN_FORMAT_ARGS(start_lsn),
+ LSN_FORMAT_ARGS(private_data->read_upto)));
+
+ /*
+ * The timeline ends at or after start_lsn, without containing
+ * any records. Thus, we must make sure the main loop does not
+ * iterate. If start_lsn is the end of the timeline, then we
+ * won't actually emit an empty summary file, but otherwise,
+ * we must, to capture the fact that the LSN range in question
+ * contains no interesting WAL records.
+ */
+ summary_start_lsn = start_lsn;
+ summary_end_lsn = private_data->read_upto;
+ switch_lsn = xlogreader->EndRecPtr;
+ }
+ else
+ ereport(ERROR,
+ (errmsg("could not find a valid record after %X/%X",
+ LSN_FORMAT_ARGS(start_lsn))));
+ }
+
+ /* We shouldn't go backward. */
+ Assert(summary_start_lsn >= start_lsn);
+ }
+
+ /*
+ * Main loop: read xlog records one by one.
+ */
+ while (1)
+ {
+ int block_id;
+ char *errormsg;
+ XLogRecord *record;
+ bool stop_requested = false;
+
+ HandleWalSummarizerInterrupts();
+
+ /* We shouldn't go backward. */
+ Assert(summary_start_lsn <= xlogreader->EndRecPtr);
+
+ /* Now read the next record. */
+ record = XLogReadRecord(xlogreader, &errormsg);
+ if (record == NULL)
+ {
+ if (private_data->end_of_wal)
+ {
+ /*
+ * This timeline must be historic and must end before we were
+ * able to read a complete record.
+ */
+ ereport(DEBUG1,
+ errmsg_internal("could not read WAL from timeline %d at %X/%X: end of WAL at %X/%X",
+ tli,
+ LSN_FORMAT_ARGS(xlogreader->EndRecPtr),
+ LSN_FORMAT_ARGS(private_data->read_upto)));
+ /* Summary ends at end of WAL. */
+ summary_end_lsn = private_data->read_upto;
+ break;
+ }
+ if (errormsg)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not read WAL from timeline %u at %X/%X: %s",
+ tli, LSN_FORMAT_ARGS(xlogreader->EndRecPtr),
+ errormsg)));
+ else
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not read WAL from timeline %u at %X/%X",
+ tli, LSN_FORMAT_ARGS(xlogreader->EndRecPtr))));
+ }
+
+ /* We shouldn't go backward. */
+ Assert(summary_start_lsn <= xlogreader->EndRecPtr);
+
+ if (!XLogRecPtrIsInvalid(switch_lsn) &&
+ xlogreader->ReadRecPtr >= switch_lsn)
+ {
+ /*
+ * Woops! We've read a record that *starts* after the switch LSN,
+ * contrary to our goal of reading only until we hit the first
+ * record that ends at or after the switch LSN. Pretend we didn't
+ * read it after all by bailing out of this loop right here,
+ * before we do anything with this record.
+ *
+ * This can happen because the last record before the switch LSN
+ * might be continued across multiple pages, and then we might
+ * come to a page with XLP_FIRST_IS_OVERWRITE_CONTRECORD set. In
+ * that case, the record that was continued across multiple pages
+ * is incomplete and will be disregarded, and the read will
+ * restart from the beginning of the page that is flagged
+ * XLP_FIRST_IS_OVERWRITE_CONTRECORD.
+ *
+ * If this case occurs, we can fairly say that the current summary
+ * file ends at the switch LSN exactly. The first record on the
+ * page marked XLP_FIRST_IS_OVERWRITE_CONTRECORD will be
+ * discovered when generating the next summary file.
+ */
+ summary_end_lsn = switch_lsn;
+ break;
+ }
+
+ /* Special handling for particular types of WAL records. */
+ switch (XLogRecGetRmid(xlogreader))
+ {
+ case RM_SMGR_ID:
+ SummarizeSmgrRecord(xlogreader, brtab);
+ break;
+ case RM_XACT_ID:
+ SummarizeXactRecord(xlogreader, brtab);
+ break;
+ case RM_XLOG_ID:
+ stop_requested = SummarizeXlogRecord(xlogreader);
+ break;
+ default:
+ break;
+ }
+
+ /*
+ * If we've been told that it's time to end this WAL summary file, do
+ * so. As an exception, if there's nothing included in this WAL
+ * summary file yet, then stopping doesn't make any sense, and we
+ * should wait until the next stop point instead.
+ */
+ if (stop_requested && xlogreader->ReadRecPtr > summary_start_lsn)
+ {
+ summary_end_lsn = xlogreader->ReadRecPtr;
+ break;
+ }
+
+ /* Feed block references from xlog record to block reference table. */
+ for (block_id = 0; block_id <= XLogRecMaxBlockId(xlogreader);
+ block_id++)
+ {
+ RelFileLocator rlocator;
+ ForkNumber forknum;
+ BlockNumber blocknum;
+
+ if (!XLogRecGetBlockTagExtended(xlogreader, block_id, &rlocator,
+ &forknum, &blocknum, NULL))
+ continue;
+
+ /*
+ * As we do elsewhere, ignore the FSM fork, because it's not fully
+ * WAL-logged.
+ */
+ if (forknum != FSM_FORKNUM)
+ BlockRefTableMarkBlockModified(brtab, &rlocator, forknum,
+ blocknum);
+ }
+
+ /* Update our notion of where this summary file ends. */
+ summary_end_lsn = xlogreader->EndRecPtr;
+
+ /* Also update shared memory. */
+ LWLockAcquire(WALSummarizerLock, LW_EXCLUSIVE);
+ Assert(summary_end_lsn >= WalSummarizerCtl->pending_lsn);
+ Assert(summary_end_lsn >= WalSummarizerCtl->summarized_lsn);
+ WalSummarizerCtl->pending_lsn = summary_end_lsn;
+ LWLockRelease(WALSummarizerLock);
+
+ /*
+ * If we have a switch LSN and have reached it, stop before reading
+ * the next record.
+ */
+ if (!XLogRecPtrIsInvalid(switch_lsn) &&
+ xlogreader->EndRecPtr >= switch_lsn)
+ break;
+ }
+
+ /* Destroy xlogreader. */
+ pfree(xlogreader->private_data);
+ XLogReaderFree(xlogreader);
+
+ /*
+ * If a timeline switch occurs, we may fail to make any progress at all
+ * before exiting the loop above. If that happens, we don't write a WAL
+ * summary file at all.
+ */
+ if (summary_end_lsn > summary_start_lsn)
+ {
+ /* Generate temporary and final path name. */
+ snprintf(temp_path, MAXPGPATH,
+ XLOGDIR "/summaries/temp.summary");
+ snprintf(final_path, MAXPGPATH,
+ XLOGDIR "/summaries/%08X%08X%08X%08X%08X.summary",
+ tli,
+ LSN_FORMAT_ARGS(summary_start_lsn),
+ LSN_FORMAT_ARGS(summary_end_lsn));
+
+ /* Open the temporary file for writing. */
+ io.filepos = 0;
+ io.file = PathNameOpenFile(temp_path, O_WRONLY | O_CREAT | O_TRUNC);
+ if (io.file < 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not create file \"%s\": %m", temp_path)));
+
+ /* Write the data. */
+ WriteBlockRefTable(brtab, WriteWalSummary, &io);
+
+ /* Close temporary file and shut down xlogreader. */
+ FileClose(io.file);
+
+ /* Tell the user what we did. */
+ ereport(DEBUG1,
+ errmsg("summarized WAL on TLI %d from %X/%X to %X/%X",
+ tli,
+ LSN_FORMAT_ARGS(summary_start_lsn),
+ LSN_FORMAT_ARGS(summary_end_lsn)));
+
+ /* Durably rename the new summary into place. */
+ durable_rename(temp_path, final_path, ERROR);
+ }
+
+ return summary_end_lsn;
+}
+
+/*
+ * Special handling for WAL records with RM_SMGR_ID.
+ */
+static void
+SummarizeSmgrRecord(XLogReaderState *xlogreader, BlockRefTable *brtab)
+{
+ uint8 info = XLogRecGetInfo(xlogreader) & ~XLR_INFO_MASK;
+
+ if (info == XLOG_SMGR_CREATE)
+ {
+ xl_smgr_create *xlrec;
+
+ /*
+ * If a new relation fork is created on disk, there is no point
+ * tracking anything about which blocks have been modified, because
+ * the whole thing will be new. Hence, set the limit block for this
+ * fork to 0.
+ *
+ * Ignore the FSM fork, which is not fully WAL-logged.
+ */
+ xlrec = (xl_smgr_create *) XLogRecGetData(xlogreader);
+
+ if (xlrec->forkNum != FSM_FORKNUM)
+ BlockRefTableSetLimitBlock(brtab, &xlrec->rlocator,
+ xlrec->forkNum, 0);
+ }
+ else if (info == XLOG_SMGR_TRUNCATE)
+ {
+ xl_smgr_truncate *xlrec;
+
+ xlrec = (xl_smgr_truncate *) XLogRecGetData(xlogreader);
+
+ /*
+ * If a relation fork is truncated on disk, there is no point in
+ * tracking anything about block modifications beyond the truncation
+ * point.
+ *
+ * We ignore SMGR_TRUNCATE_FSM here because the FSM isn't fully
+ * WAL-logged and thus we can't track modified blocks for it anyway.
+ */
+ if ((xlrec->flags & SMGR_TRUNCATE_HEAP) != 0)
+ BlockRefTableSetLimitBlock(brtab, &xlrec->rlocator,
+ MAIN_FORKNUM, xlrec->blkno);
+ if ((xlrec->flags & SMGR_TRUNCATE_VM) != 0)
+ BlockRefTableSetLimitBlock(brtab, &xlrec->rlocator,
+ VISIBILITYMAP_FORKNUM, xlrec->blkno);
+ }
+}
+
+/*
+ * Special handling for WAL recods with RM_XACT_ID.
+ */
+static void
+SummarizeXactRecord(XLogReaderState *xlogreader, BlockRefTable *brtab)
+{
+ uint8 info = XLogRecGetInfo(xlogreader) & ~XLR_INFO_MASK;
+ uint8 xact_info = info & XLOG_XACT_OPMASK;
+
+ if (xact_info == XLOG_XACT_COMMIT ||
+ xact_info == XLOG_XACT_COMMIT_PREPARED)
+ {
+ xl_xact_commit *xlrec = (xl_xact_commit *) XLogRecGetData(xlogreader);
+ xl_xact_parsed_commit parsed;
+ int i;
+
+ /*
+ * Don't track modified blocks for any relations that were removed on
+ * commit.
+ */
+ ParseCommitRecord(XLogRecGetInfo(xlogreader), xlrec, &parsed);
+ for (i = 0; i < parsed.nrels; ++i)
+ {
+ ForkNumber forknum;
+
+ for (forknum = 0; forknum <= MAX_FORKNUM; ++forknum)
+ if (forknum != FSM_FORKNUM)
+ BlockRefTableSetLimitBlock(brtab, &parsed.xlocators[i],
+ forknum, 0);
+ }
+ }
+ else if (xact_info == XLOG_XACT_ABORT ||
+ xact_info == XLOG_XACT_ABORT_PREPARED)
+ {
+ xl_xact_abort *xlrec = (xl_xact_abort *) XLogRecGetData(xlogreader);
+ xl_xact_parsed_abort parsed;
+ int i;
+
+ /*
+ * Don't track modified blocks for any relations that were removed on
+ * abort.
+ */
+ ParseAbortRecord(XLogRecGetInfo(xlogreader), xlrec, &parsed);
+ for (i = 0; i < parsed.nrels; ++i)
+ {
+ ForkNumber forknum;
+
+ for (forknum = 0; forknum <= MAX_FORKNUM; ++forknum)
+ if (forknum != FSM_FORKNUM)
+ BlockRefTableSetLimitBlock(brtab, &parsed.xlocators[i],
+ forknum, 0);
+ }
+ }
+}
+
+/*
+ * Special handling for WAL recods with RM_XLOG_ID.
+ */
+static bool
+SummarizeXlogRecord(XLogReaderState *xlogreader)
+{
+ uint8 info = XLogRecGetInfo(xlogreader) & ~XLR_INFO_MASK;
+
+ if (info == XLOG_CHECKPOINT_REDO || info == XLOG_CHECKPOINT_SHUTDOWN)
+ {
+ /*
+ * This is an LSN at which redo might begin, so we'd like
+ * summarization to stop just before this WAL record.
+ */
+ return true;
+ }
+
+ return false;
+}
+
+/*
+ * Similar to read_local_xlog_page, but limited to read from one particular
+ * timeline. If the end of WAL is reached, it will wait for more if reading
+ * from the current timeline, or give up if reading from a historic timeline.
+ * In the latter case, it will also set private_data->end_of_wal = true.
+ *
+ * Caller must set private_data->tli to the TLI of interest,
+ * private_data->read_upto to the lowest LSN that is not known to be safe
+ * to read on that timeline, and private_data->historic to true if and only
+ * if the timeline is not the current timeline. This function will update
+ * private_data->read_upto and private_data->historic if more WAL appears
+ * on the current timeline or if the current timeline becomes historic.
+ */
+static int
+summarizer_read_local_xlog_page(XLogReaderState *state,
+ XLogRecPtr targetPagePtr, int reqLen,
+ XLogRecPtr targetRecPtr, char *cur_page)
+{
+ int count;
+ WALReadError errinfo;
+ SummarizerReadLocalXLogPrivate *private_data;
+
+ HandleWalSummarizerInterrupts();
+
+ private_data = (SummarizerReadLocalXLogPrivate *)
+ state->private_data;
+
+ while (1)
+ {
+ if (targetPagePtr + XLOG_BLCKSZ <= private_data->read_upto)
+ {
+ /*
+ * more than one block available; read only that block, have
+ * caller come back if they need more.
+ */
+ count = XLOG_BLCKSZ;
+ break;
+ }
+ else if (targetPagePtr + reqLen > private_data->read_upto)
+ {
+ /* We don't seem to have enough data. */
+ if (private_data->historic)
+ {
+ /*
+ * This is a historic timeline, so there will never be any
+ * more data than we have currently.
+ */
+ private_data->end_of_wal = true;
+ return -1;
+ }
+ else
+ {
+ XLogRecPtr latest_lsn;
+ TimeLineID latest_tli;
+
+ /*
+ * This is - or at least was up until very recently - the
+ * current timeline, so more data might show up. Delay here
+ * so we don't tight-loop.
+ */
+ HandleWalSummarizerInterrupts();
+ summarizer_wait_for_wal();
+
+ /* Recheck end-of-WAL. */
+ latest_lsn = GetLatestLSN(&latest_tli);
+ if (private_data->tli == latest_tli)
+ {
+ /* Still the current timeline, update max LSN. */
+ Assert(latest_lsn >= private_data->read_upto);
+ private_data->read_upto = latest_lsn;
+ }
+ else
+ {
+ List *tles = readTimeLineHistory(latest_tli);
+ XLogRecPtr switchpoint;
+
+ /*
+ * The timeline we're scanning is no longer the latest
+ * one. Figure out when it ended.
+ */
+ private_data->historic = true;
+ switchpoint = tliSwitchPoint(private_data->tli, tles,
+ NULL);
+
+ /*
+ * Allow reads up to exactly the switch point.
+ *
+ * It's possible that this will cause read_upto to move
+ * backwards, because walreceiver might have read a
+ * partial record and flushed it to disk, and we'd view
+ * that data as safe to read. However, the
+ * XLOG_END_OF_RECOVERY record will be written at the end
+ * of the last complete WAL record, not at the end of the
+ * WAL that we've flushed to disk.
+ *
+ * So switchpoint < private->read_upto is possible here,
+ * but switchpoint < state->EndRecPtr should not be.
+ */
+ Assert(switchpoint >= state->EndRecPtr);
+ private_data->read_upto = switchpoint;
+
+ /* Debugging output. */
+ ereport(DEBUG1,
+ errmsg("timeline %u became historic, can read up to %X/%X",
+ private_data->tli, LSN_FORMAT_ARGS(private_data->read_upto)));
+ }
+
+ /* Go around and try again. */
+ }
+ }
+ else
+ {
+ /* enough bytes available to satisfy the request */
+ count = private_data->read_upto - targetPagePtr;
+ break;
+ }
+ }
+
+ /*
+ * Even though we just determined how much of the page can be validly read
+ * as 'count', read the whole page anyway. It's guaranteed to be
+ * zero-padded up to the page boundary if it's incomplete.
+ */
+ if (!WALRead(state, cur_page, targetPagePtr, XLOG_BLCKSZ,
+ private_data->tli, &errinfo))
+ WALReadRaiseError(&errinfo);
+
+ /* Track that we read a page, for sleep time calculation. */
+ ++pages_read_since_last_sleep;
+
+ /* number of valid bytes in the buffer */
+ return count;
+}
+
+/*
+ * Sleep for long enough that we believe it's likely that more WAL will
+ * be available afterwards.
+ */
+static void
+summarizer_wait_for_wal(void)
+{
+ if (pages_read_since_last_sleep == 0)
+ {
+ /*
+ * No pages were read since the last sleep, so double the sleep time,
+ * but not beyond the maximum allowable value.
+ */
+ sleep_quanta = Min(sleep_quanta * 2, MAX_SLEEP_QUANTA);
+ }
+ else if (pages_read_since_last_sleep > 1)
+ {
+ /*
+ * Multiple pages were read since the last sleep, so reduce the sleep
+ * time.
+ *
+ * A large burst of activity should be able to quickly reduce the
+ * sleep time to the minimum, but we don't want a handful of extra WAL
+ * records to provoke a strong reaction. We choose to reduce the sleep
+ * time by 1 quantum for each page read beyond the first, which is a
+ * fairly arbitrary way of trying to be reactive without
+ * overrreacting.
+ */
+ if (pages_read_since_last_sleep > sleep_quanta - 1)
+ sleep_quanta = 1;
+ else
+ sleep_quanta -= pages_read_since_last_sleep;
+ }
+
+ /* OK, now sleep. */
+ (void) WaitLatch(MyLatch,
+ WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
+ sleep_quanta * MS_PER_SLEEP_QUANTUM,
+ WAIT_EVENT_WAL_SUMMARIZER_WAL);
+ ResetLatch(MyLatch);
+
+ /* Reset count of pages read. */
+ pages_read_since_last_sleep = 0;
+}
+
+/*
+ * Most recent RedoRecPtr value observed by RemoveOldWalSummaries.
+ */
+static void
+MaybeRemoveOldWalSummaries(void)
+{
+ XLogRecPtr redo_pointer = GetRedoRecPtr();
+ List *wslist;
+ time_t cutoff_time;
+
+ /* If WAL summary removal is disabled, don't do anything. */
+ if (wal_summary_keep_time == 0)
+ return;
+
+ /*
+ * If the redo pointer has not advanced, don't do anything.
+ *
+ * This has the effect that we only try to remove old WAL summary files
+ * once per checkpoint cycle.
+ */
+ if (redo_pointer == redo_pointer_at_last_summary_removal)
+ return;
+ redo_pointer_at_last_summary_removal = redo_pointer;
+
+ /*
+ * Files should only be removed if the last modification time precedes the
+ * cutoff time we compute here.
+ */
+ cutoff_time = time(NULL) - 60 * wal_summary_keep_time;
+
+ /* Get all the summaries that currently exist. */
+ wslist = GetWalSummaries(0, InvalidXLogRecPtr, InvalidXLogRecPtr);
+
+ /* Loop until all summaries have been considered for removal. */
+ while (wslist != NIL)
+ {
+ ListCell *lc;
+ XLogSegNo oldest_segno;
+ XLogRecPtr oldest_lsn = InvalidXLogRecPtr;
+ TimeLineID selected_tli;
+
+ HandleWalSummarizerInterrupts();
+
+ /*
+ * Pick a timeline for which some summary files still exist on disk,
+ * and find the oldest LSN that still exists on disk for that
+ * timeline.
+ */
+ selected_tli = ((WalSummaryFile *) linitial(wslist))->tli;
+ oldest_segno = XLogGetOldestSegno(selected_tli);
+ if (oldest_segno != 0)
+ XLogSegNoOffsetToRecPtr(oldest_segno, 0, wal_segment_size,
+ oldest_lsn);
+
+
+ /* Consider each WAL file on the selected timeline in turn. */
+ foreach(lc, wslist)
+ {
+ WalSummaryFile *ws = lfirst(lc);
+
+ HandleWalSummarizerInterrupts();
+
+ /* If it's not on this timeline, it's not time to consider it. */
+ if (selected_tli != ws->tli)
+ continue;
+
+ /*
+ * If the WAL doesn't exist any more, we can remove it if the file
+ * modification time is old enough.
+ */
+ if (XLogRecPtrIsInvalid(oldest_lsn) || ws->end_lsn <= oldest_lsn)
+ RemoveWalSummaryIfOlderThan(ws, cutoff_time);
+
+ /*
+ * Whether we removed the file or not, we need not consider it
+ * again.
+ */
+ wslist = foreach_delete_current(wslist, lc);
+ pfree(ws);
+ }
+ }
+}
diff --git a/src/backend/storage/lmgr/lwlocknames.txt b/src/backend/storage/lmgr/lwlocknames.txt
index f72f2906ce..d621f5507f 100644
--- a/src/backend/storage/lmgr/lwlocknames.txt
+++ b/src/backend/storage/lmgr/lwlocknames.txt
@@ -54,3 +54,4 @@ XactTruncationLock 44
WrapLimitsVacuumLock 46
NotifyQueueTailLock 47
WaitEventExtensionLock 48
+WALSummarizerLock 49
diff --git a/src/backend/utils/activity/pgstat_io.c b/src/backend/utils/activity/pgstat_io.c
index 490d5a9ab7..8109aee6f0 100644
--- a/src/backend/utils/activity/pgstat_io.c
+++ b/src/backend/utils/activity/pgstat_io.c
@@ -296,7 +296,8 @@ pgstat_io_snapshot_cb(void)
* - Syslogger because it is not connected to shared memory
* - Archiver because most relevant archiving IO is delegated to a
* specialized command or module
-* - WAL Receiver and WAL Writer IO is not tracked in pg_stat_io for now
+* - WAL Receiver, WAL Writer, and WAL Summarizer IO are not tracked in
+* pg_stat_io for now
*
* Function returns true if BackendType participates in the cumulative stats
* subsystem for IO and false if it does not.
@@ -318,6 +319,7 @@ pgstat_tracks_io_bktype(BackendType bktype)
case B_LOGGER:
case B_WAL_RECEIVER:
case B_WAL_WRITER:
+ case B_WAL_SUMMARIZER:
return false;
case B_AUTOVAC_LAUNCHER:
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index d7995931bd..7e79163466 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -56,6 +56,7 @@ RECOVERY_WAL_STREAM "Waiting in main loop of startup process for WAL to arrive,
SYSLOGGER_MAIN "Waiting in main loop of syslogger process."
WAL_RECEIVER_MAIN "Waiting in main loop of WAL receiver process."
WAL_SENDER_MAIN "Waiting in main loop of WAL sender process."
+WAL_SUMMARIZER_WAL "Waiting in WAL summarizer for more WAL to be generated."
WAL_WRITER_MAIN "Waiting in main loop of WAL writer process."
@@ -142,6 +143,7 @@ SAFE_SNAPSHOT "Waiting to obtain a valid snapshot for a <literal>READ ONLY DEFER
SYNC_REP "Waiting for confirmation from a remote server during synchronous replication."
WAL_RECEIVER_EXIT "Waiting for the WAL receiver to exit."
WAL_RECEIVER_WAIT_START "Waiting for startup process to send initial data for streaming replication."
+WAL_SUMMARY_READY "Waiting for a new WAL summary to be generated."
XACT_GROUP_UPDATE "Waiting for the group leader to update transaction status at end of a parallel operation."
@@ -162,6 +164,7 @@ REGISTER_SYNC_REQUEST "Waiting while sending synchronization requests to the che
SPIN_DELAY "Waiting while acquiring a contended spinlock."
VACUUM_DELAY "Waiting in a cost-based vacuum delay point."
VACUUM_TRUNCATE "Waiting to acquire an exclusive lock to truncate off any empty pages at the end of a table vacuumed."
+WAL_SUMMARIZER_ERROR "Waiting after a WAL summarizer error."
#
@@ -243,6 +246,8 @@ WAL_COPY_WRITE "Waiting for a write when creating a new WAL segment by copying a
WAL_INIT_SYNC "Waiting for a newly initialized WAL file to reach durable storage."
WAL_INIT_WRITE "Waiting for a write while initializing a new WAL file."
WAL_READ "Waiting for a read from a WAL file."
+WAL_SUMMARY_READ "Waiting for a read from a WAL summary file."
+WAL_SUMMARY_WRITE "Waiting for a write to a WAL summary file."
WAL_SYNC "Waiting for a WAL file to reach durable storage."
WAL_SYNC_METHOD_ASSIGN "Waiting for data to reach durable storage while assigning a new WAL sync method."
WAL_WRITE "Waiting for a write to a WAL file."
diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index cfc5afaa6f..ef2a3a2bfd 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -306,6 +306,9 @@ GetBackendTypeDesc(BackendType backendType)
case B_WAL_SENDER:
backendDesc = "walsender";
break;
+ case B_WAL_SUMMARIZER:
+ backendDesc = "walsummarizer";
+ break;
case B_WAL_WRITER:
backendDesc = "walwriter";
break;
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index b764ef6998..a6de5aca0a 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -63,6 +63,7 @@
#include "postmaster/postmaster.h"
#include "postmaster/startup.h"
#include "postmaster/syslogger.h"
+#include "postmaster/walsummarizer.h"
#include "postmaster/walwriter.h"
#include "replication/logicallauncher.h"
#include "replication/slot.h"
@@ -704,6 +705,8 @@ const char *const config_group_names[] =
gettext_noop("Write-Ahead Log / Archive Recovery"),
/* WAL_RECOVERY_TARGET */
gettext_noop("Write-Ahead Log / Recovery Target"),
+ /* WAL_SUMMARIZATION */
+ gettext_noop("Write-Ahead Log / Summarization"),
/* REPLICATION_SENDING */
gettext_noop("Replication / Sending Servers"),
/* REPLICATION_PRIMARY */
@@ -1787,6 +1790,16 @@ struct config_bool ConfigureNamesBool[] =
NULL, NULL, NULL
},
+ {
+ {"summarize_wal", PGC_SIGHUP, WAL_SUMMARIZATION,
+ gettext_noop("Starts the WAL summarizer process to enable incremental backup."),
+ NULL
+ },
+ &summarize_wal,
+ false,
+ NULL, NULL, NULL
+ },
+
{
{"hot_standby", PGC_POSTMASTER, REPLICATION_STANDBY,
gettext_noop("Allows connections and queries during recovery."),
@@ -3191,6 +3204,19 @@ struct config_int ConfigureNamesInt[] =
check_wal_segment_size, NULL, NULL
},
+ {
+ {"wal_summary_keep_time", PGC_SIGHUP, WAL_SUMMARIZATION,
+ gettext_noop("Time for which WAL summary files should be kept."),
+ NULL,
+ GUC_UNIT_MIN,
+ },
+ &wal_summary_keep_time,
+ 10 * 24 * 60, /* 10 days */
+ 0,
+ INT_MAX,
+ NULL, NULL, NULL
+ },
+
{
{"autovacuum_naptime", PGC_SIGHUP, AUTOVACUUM,
gettext_noop("Time to sleep between autovacuum runs."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index e48c066a5b..e732453daa 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -299,6 +299,11 @@
#recovery_target_action = 'pause' # 'pause', 'promote', 'shutdown'
# (change requires restart)
+# - WAL Summarization -
+
+#summarize_wal = off # run WAL summarizer process?
+#wal_summary_keep_time = '10d' # when to remove old summary files, 0 = never
+
#------------------------------------------------------------------------------
# REPLICATION
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 0c6f5ceb0a..e68b40d2b5 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -227,6 +227,7 @@ static char *extra_options = "";
static const char *const subdirs[] = {
"global",
"pg_wal/archive_status",
+ "pg_wal/summaries",
"pg_commit_ts",
"pg_dynshmem",
"pg_notify",
diff --git a/src/common/Makefile b/src/common/Makefile
index 1092dc63df..23e5a3db47 100644
--- a/src/common/Makefile
+++ b/src/common/Makefile
@@ -49,6 +49,7 @@ OBJS_COMMON = \
archive.o \
base64.o \
binaryheap.o \
+ blkreftable.o \
checksum_helper.o \
compression.o \
config_info.o \
diff --git a/src/common/blkreftable.c b/src/common/blkreftable.c
new file mode 100644
index 0000000000..21ee6f5968
--- /dev/null
+++ b/src/common/blkreftable.c
@@ -0,0 +1,1308 @@
+/*-------------------------------------------------------------------------
+ *
+ * blkreftable.c
+ * Block reference tables.
+ *
+ * A block reference table is used to keep track of which blocks have
+ * been modified by WAL records within a certain LSN range.
+ *
+ * For each relation fork, we keep track of all blocks that have appeared
+ * in block reference in the WAL. We also keep track of the "limit block",
+ * which is the smallest relation length in blocks known to have occurred
+ * during that range of WAL records. This should be set to 0 if the relation
+ * fork is created or destroyed, and to the post-truncation length if
+ * truncated.
+ *
+ * Whenever we set the limit block, we also forget about any modified blocks
+ * beyond that point. Those blocks don't exist any more. Such blocks can
+ * later be marked as modified again; if that happens, it means the relation
+ * was re-extended.
+ *
+ * Portions Copyright (c) 2010-2022, PostgreSQL Global Development Group
+ *
+ * src/common/blkreftable.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+
+#ifndef FRONTEND
+#include "postgres.h"
+#else
+#include "postgres_fe.h"
+#endif
+
+#ifdef FRONTEND
+#include "common/logging.h"
+#endif
+
+#include "common/blkreftable.h"
+#include "common/hashfn.h"
+#include "port/pg_crc32c.h"
+
+/*
+ * A block reference table keeps track of the status of each relation
+ * fork individually.
+ */
+typedef struct BlockRefTableKey
+{
+ RelFileLocator rlocator;
+ ForkNumber forknum;
+} BlockRefTableKey;
+
+/*
+ * We could need to store data either for a relation in which only a
+ * tiny fraction of the blocks have been modified or for a relation in
+ * which nearly every block has been modified, and we want a
+ * space-efficient representation in both cases. To accomplish this,
+ * we divide the relation into chunks of 2^16 blocks and choose between
+ * an array representation and a bitmap representation for each chunk.
+ *
+ * When the number of modified blocks in a given chunk is small, we
+ * essentially store an array of block numbers, but we need not store the
+ * entire block number: instead, we store each block number as a 2-byte
+ * offset from the start of the chunk.
+ *
+ * When the number of modified blocks in a given chunk is large, we switch
+ * to a bitmap representation.
+ *
+ * These same basic representational choices are used both when a block
+ * reference table is stored in memory and when it is serialized to disk.
+ *
+ * In the in-memory representation, we initially allocate each chunk with
+ * space for a number of entries given by INITIAL_ENTRIES_PER_CHUNK and
+ * increase that as necessary until we reach MAX_ENTRIES_PER_CHUNK.
+ * Any chunk whose allocated size reaches MAX_ENTRIES_PER_CHUNK is converted
+ * to a bitmap, and thus never needs to grow further.
+ */
+#define BLOCKS_PER_CHUNK (1 << 16)
+#define BLOCKS_PER_ENTRY (BITS_PER_BYTE * sizeof(uint16))
+#define MAX_ENTRIES_PER_CHUNK (BLOCKS_PER_CHUNK / BLOCKS_PER_ENTRY)
+#define INITIAL_ENTRIES_PER_CHUNK 16
+typedef uint16 *BlockRefTableChunk;
+
+/*
+ * State for one relation fork.
+ *
+ * 'rlocator' and 'forknum' identify the relation fork to which this entry
+ * pertains.
+ *
+ * 'limit_block' is the shortest known length of the relation in blocks
+ * within the LSN range covered by a particular block reference table.
+ * It should be set to 0 if the relation fork is created or dropped. If the
+ * relation fork is truncated, it should be set to the number of blocks that
+ * remain after truncation.
+ *
+ * 'nchunks' is the allocated length of each of the three arrays that follow.
+ * We can only represent the status of block numbers less than nchunks *
+ * BLOCKS_PER_CHUNK.
+ *
+ * 'chunk_size' is an array storing the allocated size of each chunk.
+ *
+ * 'chunk_usage' is an array storing the number of elements used in each
+ * chunk. If that value is less than MAX_ENTRIES_PER_CHUNK, the corresonding
+ * chunk is used as an array; else the corresponding chunk is used as a bitmap.
+ * When used as a bitmap, the least significant bit of the first array element
+ * is the status of the lowest-numbered block covered by this chunk.
+ *
+ * 'chunk_data' is the array of chunks.
+ */
+struct BlockRefTableEntry
+{
+ BlockRefTableKey key;
+ BlockNumber limit_block;
+ char status;
+ uint32 nchunks;
+ uint16 *chunk_size;
+ uint16 *chunk_usage;
+ BlockRefTableChunk *chunk_data;
+};
+
+/* Declare and define a hash table over type BlockRefTableEntry. */
+#define SH_PREFIX blockreftable
+#define SH_ELEMENT_TYPE BlockRefTableEntry
+#define SH_KEY_TYPE BlockRefTableKey
+#define SH_KEY key
+#define SH_HASH_KEY(tb, key) \
+ hash_bytes((const unsigned char *) &key, sizeof(BlockRefTableKey))
+#define SH_EQUAL(tb, a, b) (memcmp(&a, &b, sizeof(BlockRefTableKey)) == 0)
+#define SH_SCOPE static inline
+#ifdef FRONTEND
+#define SH_RAW_ALLOCATOR pg_malloc0
+#endif
+#define SH_DEFINE
+#define SH_DECLARE
+#include "lib/simplehash.h"
+
+/*
+ * A block reference table is basically just the hash table, but we don't
+ * want to expose that to outside callers.
+ *
+ * We keep track of the memory context in use explicitly too, so that it's
+ * easy to place all of our allocations in the same context.
+ */
+struct BlockRefTable
+{
+ blockreftable_hash *hash;
+#ifndef FRONTEND
+ MemoryContext mcxt;
+#endif
+};
+
+/*
+ * On-disk serialization format for block reference table entries.
+ */
+typedef struct BlockRefTableSerializedEntry
+{
+ RelFileLocator rlocator;
+ ForkNumber forknum;
+ BlockNumber limit_block;
+ uint32 nchunks;
+} BlockRefTableSerializedEntry;
+
+/*
+ * Buffer size, so that we avoid doing many small I/Os.
+ */
+#define BUFSIZE 65536
+
+/*
+ * Ad-hoc buffer for file I/O.
+ */
+typedef struct BlockRefTableBuffer
+{
+ io_callback_fn io_callback;
+ void *io_callback_arg;
+ char data[BUFSIZE];
+ int used;
+ int cursor;
+ pg_crc32c crc;
+} BlockRefTableBuffer;
+
+/*
+ * State for keeping track of progress while incrementally reading a block
+ * table reference file from disk.
+ *
+ * total_chunks means the number of chunks for the RelFileLocator/ForkNumber
+ * combination that is curently being read, and consumed_chunks is the number
+ * of those that have been read. (We always read all the information for
+ * a single chunk at one time, so we don't need to be able to represent the
+ * state where a chunk has been partially read.)
+ *
+ * chunk_size is the array of chunk sizes. The length is given by total_chunks.
+ *
+ * chunk_data holds the current chunk.
+ *
+ * chunk_position helps us figure out how much progress we've made in returning
+ * the block numbers for the current chunk to the caller. If the chunk is a
+ * bitmap, it's the number of bits we've scanned; otherwise, it's the number
+ * of chunk entries we've scanned.
+ */
+struct BlockRefTableReader
+{
+ BlockRefTableBuffer buffer;
+ char *error_filename;
+ report_error_fn error_callback;
+ void *error_callback_arg;
+ uint32 total_chunks;
+ uint32 consumed_chunks;
+ uint16 *chunk_size;
+ uint16 chunk_data[MAX_ENTRIES_PER_CHUNK];
+ uint32 chunk_position;
+};
+
+/*
+ * State for keeping track of progress while incrementally writing a block
+ * reference table file to disk.
+ */
+struct BlockRefTableWriter
+{
+ BlockRefTableBuffer buffer;
+};
+
+/* Function prototypes. */
+static int BlockRefTableComparator(const void *a, const void *b);
+static void BlockRefTableFlush(BlockRefTableBuffer *buffer);
+static void BlockRefTableRead(BlockRefTableReader *reader, void *data,
+ int length);
+static void BlockRefTableWrite(BlockRefTableBuffer *buffer, void *data,
+ int length);
+static void BlockRefTableFileTerminate(BlockRefTableBuffer *buffer);
+
+/*
+ * Create an empty block reference table.
+ */
+BlockRefTable *
+CreateEmptyBlockRefTable(void)
+{
+ BlockRefTable *brtab = palloc(sizeof(BlockRefTable));
+
+ /*
+ * Even completely empty database has a few hundred relation forks, so it
+ * seems best to size the hash on the assumption that we're going to have
+ * at least a few thousand entries.
+ */
+#ifdef FRONTEND
+ brtab->hash = blockreftable_create(4096, NULL);
+#else
+ brtab->mcxt = CurrentMemoryContext;
+ brtab->hash = blockreftable_create(brtab->mcxt, 4096, NULL);
+#endif
+
+ return brtab;
+}
+
+/*
+ * Set the "limit block" for a relation fork and forget any modified blocks
+ * with equal or higher block numbers.
+ *
+ * The "limit block" is the shortest known length of the relation within the
+ * range of WAL records covered by this block reference table.
+ */
+void
+BlockRefTableSetLimitBlock(BlockRefTable *brtab,
+ const RelFileLocator *rlocator,
+ ForkNumber forknum,
+ BlockNumber limit_block)
+{
+ BlockRefTableEntry *brtentry;
+ BlockRefTableKey key = {0}; /* make sure any padding is zero */
+ bool found;
+
+ memcpy(&key.rlocator, rlocator, sizeof(RelFileLocator));
+ key.forknum = forknum;
+ brtentry = blockreftable_insert(brtab->hash, key, &found);
+
+ if (!found)
+ {
+ /*
+ * We have no existing data about this relation fork, so just record
+ * the limit_block value supplied by the caller, and make sure other
+ * parts of the entry are properly initialized.
+ */
+ brtentry->limit_block = limit_block;
+ brtentry->nchunks = 0;
+ brtentry->chunk_size = NULL;
+ brtentry->chunk_usage = NULL;
+ brtentry->chunk_data = NULL;
+ return;
+ }
+
+ BlockRefTableEntrySetLimitBlock(brtentry, limit_block);
+}
+
+/*
+ * Mark a block in a given relation fork as known to have been modified.
+ */
+void
+BlockRefTableMarkBlockModified(BlockRefTable *brtab,
+ const RelFileLocator *rlocator,
+ ForkNumber forknum,
+ BlockNumber blknum)
+{
+ BlockRefTableEntry *brtentry;
+ BlockRefTableKey key = {0}; /* make sure any padding is zero */
+ bool found;
+#ifndef FRONTEND
+ MemoryContext oldcontext = MemoryContextSwitchTo(brtab->mcxt);
+#endif
+
+ memcpy(&key.rlocator, rlocator, sizeof(RelFileLocator));
+ key.forknum = forknum;
+ brtentry = blockreftable_insert(brtab->hash, key, &found);
+
+ if (!found)
+ {
+ /*
+ * We want to set the initial limit block value to something higher
+ * than any legal block number. InvalidBlockNumber fits the bill.
+ */
+ brtentry->limit_block = InvalidBlockNumber;
+ brtentry->nchunks = 0;
+ brtentry->chunk_size = NULL;
+ brtentry->chunk_usage = NULL;
+ brtentry->chunk_data = NULL;
+ }
+
+ BlockRefTableEntryMarkBlockModified(brtentry, forknum, blknum);
+
+#ifndef FRONTEND
+ MemoryContextSwitchTo(oldcontext);
+#endif
+}
+
+/*
+ * Get an entry from a block reference table.
+ *
+ * If the entry does not exist, this function returns NULL. Otherwise, it
+ * returns the entry and sets *limit_block to the value from the entry.
+ */
+BlockRefTableEntry *
+BlockRefTableGetEntry(BlockRefTable *brtab, const RelFileLocator *rlocator,
+ ForkNumber forknum, BlockNumber *limit_block)
+{
+ BlockRefTableKey key = {0}; /* make sure any padding is zero */
+ BlockRefTableEntry *entry;
+
+ Assert(limit_block != NULL);
+
+ memcpy(&key.rlocator, rlocator, sizeof(RelFileLocator));
+ key.forknum = forknum;
+ entry = blockreftable_lookup(brtab->hash, key);
+
+ if (entry != NULL)
+ *limit_block = entry->limit_block;
+
+ return entry;
+}
+
+/*
+ * Get block numbers from a table entry.
+ *
+ * 'blocks' must point to enough space to hold at least 'nblocks' block
+ * numbers, and any block numbers we manage to get will be written there.
+ * The return value is the number of block numbers actually written.
+ *
+ * We do not return block numbers unless they are greater than or equal to
+ * start_blkno and strictly less than stop_blkno.
+ */
+int
+BlockRefTableEntryGetBlocks(BlockRefTableEntry *entry,
+ BlockNumber start_blkno,
+ BlockNumber stop_blkno,
+ BlockNumber *blocks,
+ int nblocks)
+{
+ uint32 start_chunkno;
+ uint32 stop_chunkno;
+ uint32 chunkno;
+ int nresults = 0;
+
+ Assert(entry != NULL);
+
+ /*
+ * Figure out which chunks could potentially contain blocks of interest.
+ *
+ * We need to be careful about overflow here, because stop_blkno could be
+ * InvalidBlockNumber or something very close to it.
+ */
+ start_chunkno = start_blkno / BLOCKS_PER_CHUNK;
+ stop_chunkno = stop_blkno / BLOCKS_PER_CHUNK;
+ if ((stop_blkno % BLOCKS_PER_CHUNK) != 0)
+ ++stop_chunkno;
+ if (stop_chunkno > entry->nchunks)
+ stop_chunkno = entry->nchunks;
+
+ /*
+ * Loop over chunks.
+ */
+ for (chunkno = start_chunkno; chunkno < stop_chunkno; ++chunkno)
+ {
+ uint16 chunk_usage = entry->chunk_usage[chunkno];
+ BlockRefTableChunk chunk_data = entry->chunk_data[chunkno];
+ unsigned start_offset = 0;
+ unsigned stop_offset = BLOCKS_PER_CHUNK;
+
+ /*
+ * If the start and/or stop block number falls within this chunk, the
+ * whole chunk may not be of interest. Figure out which portion we
+ * care about, if it's not the whole thing.
+ */
+ if (chunkno == start_chunkno)
+ start_offset = start_blkno % BLOCKS_PER_CHUNK;
+ if (chunkno == stop_chunkno - 1)
+ stop_offset = stop_blkno % BLOCKS_PER_CHUNK;
+
+ /*
+ * Handling differs depending on whether this is an array of offsets
+ * or a bitmap.
+ */
+ if (chunk_usage == MAX_ENTRIES_PER_CHUNK)
+ {
+ unsigned i;
+
+ /* It's a bitmap, so test every relevant bit. */
+ for (i = start_offset; i < stop_offset; ++i)
+ {
+ uint16 w = chunk_data[i / BLOCKS_PER_ENTRY];
+
+ if ((w & (1 << (i % BLOCKS_PER_ENTRY))) != 0)
+ {
+ BlockNumber blkno = chunkno * BLOCKS_PER_CHUNK + i;
+
+ blocks[nresults++] = blkno;
+
+ /* Early exit if we run out of output space. */
+ if (nresults == nblocks)
+ return nresults;
+ }
+ }
+ }
+ else
+ {
+ unsigned i;
+
+ /* It's an array of offsets, so check each one. */
+ for (i = 0; i < chunk_usage; ++i)
+ {
+ uint16 offset = chunk_data[i];
+
+ if (offset >= start_offset && offset < stop_offset)
+ {
+ BlockNumber blkno = chunkno * BLOCKS_PER_CHUNK + offset;
+
+ blocks[nresults++] = blkno;
+
+ /* Early exit if we run out of output space. */
+ if (nresults == nblocks)
+ return nresults;
+ }
+ }
+ }
+ }
+
+ return nresults;
+}
+
+/*
+ * Serialize a block reference table to a file.
+ */
+void
+WriteBlockRefTable(BlockRefTable *brtab,
+ io_callback_fn write_callback,
+ void *write_callback_arg)
+{
+ BlockRefTableSerializedEntry *sdata = NULL;
+ BlockRefTableBuffer buffer;
+ uint32 magic = BLOCKREFTABLE_MAGIC;
+
+ /* Prepare buffer. */
+ memset(&buffer, 0, sizeof(BlockRefTableBuffer));
+ buffer.io_callback = write_callback;
+ buffer.io_callback_arg = write_callback_arg;
+ INIT_CRC32C(buffer.crc);
+
+ /* Write magic number. */
+ BlockRefTableWrite(&buffer, &magic, sizeof(uint32));
+
+ /* Write the entries, assuming there are some. */
+ if (brtab->hash->members > 0)
+ {
+ unsigned i = 0;
+ blockreftable_iterator it;
+ BlockRefTableEntry *brtentry;
+
+ /* Extract entries into serializable format and sort them. */
+ sdata =
+ palloc(brtab->hash->members * sizeof(BlockRefTableSerializedEntry));
+ blockreftable_start_iterate(brtab->hash, &it);
+ while ((brtentry = blockreftable_iterate(brtab->hash, &it)) != NULL)
+ {
+ BlockRefTableSerializedEntry *sentry = &sdata[i++];
+
+ sentry->rlocator = brtentry->key.rlocator;
+ sentry->forknum = brtentry->key.forknum;
+ sentry->limit_block = brtentry->limit_block;
+ sentry->nchunks = brtentry->nchunks;
+
+ /* trim trailing zero entries */
+ while (sentry->nchunks > 0 &&
+ brtentry->chunk_usage[sentry->nchunks - 1] == 0)
+ sentry->nchunks--;
+ }
+ Assert(i == brtab->hash->members);
+ qsort(sdata, i, sizeof(BlockRefTableSerializedEntry),
+ BlockRefTableComparator);
+
+ /* Loop over entries in sorted order and serialize each one. */
+ for (i = 0; i < brtab->hash->members; ++i)
+ {
+ BlockRefTableSerializedEntry *sentry = &sdata[i];
+ BlockRefTableKey key = {0}; /* make sure any padding is zero */
+ unsigned j;
+
+ /* Write the serialized entry itself. */
+ BlockRefTableWrite(&buffer, sentry,
+ sizeof(BlockRefTableSerializedEntry));
+
+ /* Look up the original entry so we can access the chunks. */
+ memcpy(&key.rlocator, &sentry->rlocator, sizeof(RelFileLocator));
+ key.forknum = sentry->forknum;
+ brtentry = blockreftable_lookup(brtab->hash, key);
+ Assert(brtentry != NULL);
+
+ /* Write the untruncated portion of the chunk length array. */
+ if (sentry->nchunks != 0)
+ BlockRefTableWrite(&buffer, brtentry->chunk_usage,
+ sentry->nchunks * sizeof(uint16));
+
+ /* Write the contents of each chunk. */
+ for (j = 0; j < brtentry->nchunks; ++j)
+ {
+ if (brtentry->chunk_usage[j] == 0)
+ continue;
+ BlockRefTableWrite(&buffer, brtentry->chunk_data[j],
+ brtentry->chunk_usage[j] * sizeof(uint16));
+ }
+ }
+ }
+
+ /* Write out appropriate terminator and CRC and flush buffer. */
+ BlockRefTableFileTerminate(&buffer);
+}
+
+/*
+ * Prepare to incrementally read a block reference table file.
+ *
+ * 'read_callback' is a function that can be called to read data from the
+ * underlying file (or other data source) into our internal buffer.
+ *
+ * 'read_callback_arg' is an opaque argument to be passed to read_callback.
+ *
+ * 'error_filename' is the filename that should be included in error messages
+ * if the file is found to be malformed. The value is not copied, so the
+ * caller should ensure that it remains valid until done with this
+ * BlockRefTableReader.
+ *
+ * 'error_callback' is a function to be called if the file is found to be
+ * malformed. This is not used for I/O errors, which must be handled internally
+ * by read_callback.
+ *
+ * 'error_callback_arg' is an opaque arguent to be passed to error_callback.
+ */
+BlockRefTableReader *
+CreateBlockRefTableReader(io_callback_fn read_callback,
+ void *read_callback_arg,
+ char *error_filename,
+ report_error_fn error_callback,
+ void *error_callback_arg)
+{
+ BlockRefTableReader *reader;
+ uint32 magic;
+
+ /* Initialize data structure. */
+ reader = palloc0(sizeof(BlockRefTableReader));
+ reader->buffer.io_callback = read_callback;
+ reader->buffer.io_callback_arg = read_callback_arg;
+ reader->error_filename = error_filename;
+ reader->error_callback = error_callback;
+ reader->error_callback_arg = error_callback_arg;
+ INIT_CRC32C(reader->buffer.crc);
+
+ /* Verify magic number. */
+ BlockRefTableRead(reader, &magic, sizeof(uint32));
+ if (magic != BLOCKREFTABLE_MAGIC)
+ error_callback(error_callback_arg,
+ "file \"%s\" has wrong magic number: expected %u, found %u",
+ error_filename,
+ BLOCKREFTABLE_MAGIC, magic);
+
+ return reader;
+}
+
+/*
+ * Read next relation fork covered by this block reference table file.
+ *
+ * After calling this function, you must call BlockRefTableReaderGetBlocks
+ * until it returns 0 before calling it again.
+ */
+bool
+BlockRefTableReaderNextRelation(BlockRefTableReader *reader,
+ RelFileLocator *rlocator,
+ ForkNumber *forknum,
+ BlockNumber *limit_block)
+{
+ BlockRefTableSerializedEntry sentry;
+ BlockRefTableSerializedEntry zentry = {{0}};
+
+ /*
+ * Sanity check: caller must read all blocks from all chunks before moving
+ * on to the next relation.
+ */
+ Assert(reader->total_chunks == reader->consumed_chunks);
+
+ /* Read serialized entry. */
+ BlockRefTableRead(reader, &sentry,
+ sizeof(BlockRefTableSerializedEntry));
+
+ /*
+ * If we just read the sentinel entry indicating that we've reached the
+ * end, read and check the CRC.
+ */
+ if (memcmp(&sentry, &zentry, sizeof(BlockRefTableSerializedEntry)) == 0)
+ {
+ pg_crc32c expected_crc;
+ pg_crc32c actual_crc;
+
+ /*
+ * We want to know the CRC of the file excluding the 4-byte CRC
+ * itself, so copy the current value of the CRC accumulator before
+ * reading those bytes, and use the copy to finalize the calculation.
+ */
+ expected_crc = reader->buffer.crc;
+ FIN_CRC32C(expected_crc);
+
+ /* Now we can read the actual value. */
+ BlockRefTableRead(reader, &actual_crc, sizeof(pg_crc32c));
+
+ /* Throw an error if there is a mismatch. */
+ if (!EQ_CRC32C(expected_crc, actual_crc))
+ reader->error_callback(reader->error_callback_arg,
+ "file \"%s\" has wrong checksum: expected %08X, found %08X",
+ reader->error_filename, expected_crc, actual_crc);
+
+ return false;
+ }
+
+ /* Read chunk size array. */
+ if (reader->chunk_size != NULL)
+ pfree(reader->chunk_size);
+ reader->chunk_size = palloc(sentry.nchunks * sizeof(uint16));
+ BlockRefTableRead(reader, reader->chunk_size,
+ sentry.nchunks * sizeof(uint16));
+
+ /* Set up for chunk scan. */
+ reader->total_chunks = sentry.nchunks;
+ reader->consumed_chunks = 0;
+
+ /* Return data to caller. */
+ memcpy(rlocator, &sentry.rlocator, sizeof(RelFileLocator));
+ *forknum = sentry.forknum;
+ *limit_block = sentry.limit_block;
+ return true;
+}
+
+/*
+ * Get modified blocks associated with the relation fork returned by
+ * the most recent call to BlockRefTableReaderNextRelation.
+ *
+ * On return, block numbers will be written into the 'blocks' array, whose
+ * length should be passed via 'nblocks'. The return value is the number of
+ * entries actually written into the 'blocks' array, which may be less than
+ * 'nblocks' if we run out of modified blocks in the relation fork before
+ * we run out of room in the array.
+ */
+unsigned
+BlockRefTableReaderGetBlocks(BlockRefTableReader *reader,
+ BlockNumber *blocks,
+ int nblocks)
+{
+ unsigned blocks_found = 0;
+
+ /* Must provide space for at least one block number to be returned. */
+ Assert(nblocks > 0);
+
+ /* Loop collecting blocks to return to caller. */
+ for (;;)
+ {
+ uint16 next_chunk_size;
+
+ /*
+ * If we've read at least one chunk, maybe it contains some block
+ * numbers that could satisfy caller's request.
+ */
+ if (reader->consumed_chunks > 0)
+ {
+ uint32 chunkno = reader->consumed_chunks - 1;
+ uint16 chunk_size = reader->chunk_size[chunkno];
+
+ if (chunk_size == MAX_ENTRIES_PER_CHUNK)
+ {
+ /* Bitmap format, so search for bits that are set. */
+ while (reader->chunk_position < BLOCKS_PER_CHUNK &&
+ blocks_found < nblocks)
+ {
+ uint16 chunkoffset = reader->chunk_position;
+ uint16 w;
+
+ w = reader->chunk_data[chunkoffset / BLOCKS_PER_ENTRY];
+ if ((w & (1u << (chunkoffset % BLOCKS_PER_ENTRY))) != 0)
+ blocks[blocks_found++] =
+ chunkno * BLOCKS_PER_CHUNK + chunkoffset;
+ ++reader->chunk_position;
+ }
+ }
+ else
+ {
+ /* Not in bitmap format, so each entry is a 2-byte offset. */
+ while (reader->chunk_position < chunk_size &&
+ blocks_found < nblocks)
+ {
+ blocks[blocks_found++] = chunkno * BLOCKS_PER_CHUNK
+ + reader->chunk_data[reader->chunk_position];
+ ++reader->chunk_position;
+ }
+ }
+ }
+
+ /* We found enough blocks, so we're done. */
+ if (blocks_found >= nblocks)
+ break;
+
+ /*
+ * We didn't find enough blocks, so we must need the next chunk. If
+ * there are none left, though, then we're done anyway.
+ */
+ if (reader->consumed_chunks == reader->total_chunks)
+ break;
+
+ /*
+ * Read data for next chunk and reset scan position to beginning of
+ * chunk. Note that the next chunk might be empty, in which case we
+ * consume the chunk without actually consuming any bytes from the
+ * underlying file.
+ */
+ next_chunk_size = reader->chunk_size[reader->consumed_chunks];
+ if (next_chunk_size > 0)
+ BlockRefTableRead(reader, reader->chunk_data,
+ next_chunk_size * sizeof(uint16));
+ ++reader->consumed_chunks;
+ reader->chunk_position = 0;
+ }
+
+ return blocks_found;
+}
+
+/*
+ * Release memory used while reading a block reference table from a file.
+ */
+void
+DestroyBlockRefTableReader(BlockRefTableReader *reader)
+{
+ if (reader->chunk_size != NULL)
+ {
+ pfree(reader->chunk_size);
+ reader->chunk_size = NULL;
+ }
+ pfree(reader);
+}
+
+/*
+ * Prepare to write a block reference table file incrementally.
+ *
+ * Caller must be able to supply BlockRefTableEntry objects sorted in the
+ * appropriate order.
+ */
+BlockRefTableWriter *
+CreateBlockRefTableWriter(io_callback_fn write_callback,
+ void *write_callback_arg)
+{
+ BlockRefTableWriter *writer;
+ uint32 magic = BLOCKREFTABLE_MAGIC;
+
+ /* Prepare buffer and CRC check and save callbacks. */
+ writer = palloc0(sizeof(BlockRefTableWriter));
+ writer->buffer.io_callback = write_callback;
+ writer->buffer.io_callback_arg = write_callback_arg;
+ INIT_CRC32C(writer->buffer.crc);
+
+ /* Write magic number. */
+ BlockRefTableWrite(&writer->buffer, &magic, sizeof(uint32));
+
+ return writer;
+}
+
+/*
+ * Append one entry to a block reference table file.
+ *
+ * Note that entries must be written in the proper order, that is, sorted by
+ * tablespace, then database, then relfilenumber, then fork number. Caller
+ * is responsible for supplying data in the correct order. If that seems hard,
+ * use an in-memory BlockRefTable instead.
+ */
+void
+BlockRefTableWriteEntry(BlockRefTableWriter *writer, BlockRefTableEntry *entry)
+{
+ BlockRefTableSerializedEntry sentry;
+ unsigned j;
+
+ /* Convert to serialized entry format. */
+ sentry.rlocator = entry->key.rlocator;
+ sentry.forknum = entry->key.forknum;
+ sentry.limit_block = entry->limit_block;
+ sentry.nchunks = entry->nchunks;
+
+ /* Trim trailing zero entries. */
+ while (sentry.nchunks > 0 && entry->chunk_usage[sentry.nchunks - 1] == 0)
+ sentry.nchunks--;
+
+ /* Write the serialized entry itself. */
+ BlockRefTableWrite(&writer->buffer, &sentry,
+ sizeof(BlockRefTableSerializedEntry));
+
+ /* Write the untruncated portion of the chunk length array. */
+ if (sentry.nchunks != 0)
+ BlockRefTableWrite(&writer->buffer, entry->chunk_usage,
+ sentry.nchunks * sizeof(uint16));
+
+ /* Write the contents of each chunk. */
+ for (j = 0; j < entry->nchunks; ++j)
+ {
+ if (entry->chunk_usage[j] == 0)
+ continue;
+ BlockRefTableWrite(&writer->buffer, entry->chunk_data[j],
+ entry->chunk_usage[j] * sizeof(uint16));
+ }
+}
+
+/*
+ * Finalize an incremental write of a block reference table file.
+ */
+void
+DestroyBlockRefTableWriter(BlockRefTableWriter *writer)
+{
+ BlockRefTableFileTerminate(&writer->buffer);
+ pfree(writer);
+}
+
+/*
+ * Allocate a standalone BlockRefTableEntry.
+ *
+ * When we're manipulating a full in-memory BlockRefTable, the entries are
+ * part of the hash table and are allocated by simplehash. This routine is
+ * used by callers that want to write out a BlockRefTable to a file without
+ * needing to store the whole thing in memory at once.
+ *
+ * Entries allocated by this function can be manipulated using the functions
+ * BlockRefTableEntrySetLimitBlock and BlockRefTableEntryMarkBlockModified
+ * and then written using BlockRefTableWriteEntry and freed using
+ * BlockRefTableFreeEntry.
+ */
+BlockRefTableEntry *
+CreateBlockRefTableEntry(RelFileLocator rlocator, ForkNumber forknum)
+{
+ BlockRefTableEntry *entry = palloc0(sizeof(BlockRefTableEntry));
+
+ memcpy(&entry->key.rlocator, &rlocator, sizeof(RelFileLocator));
+ entry->key.forknum = forknum;
+ entry->limit_block = InvalidBlockNumber;
+
+ return entry;
+}
+
+/*
+ * Update a BlockRefTableEntry with a new value for the "limit block" and
+ * forget any equal-or-higher-numbered modified blocks.
+ *
+ * The "limit block" is the shortest known length of the relation within the
+ * range of WAL records covered by this block reference table.
+ */
+void
+BlockRefTableEntrySetLimitBlock(BlockRefTableEntry *entry,
+ BlockNumber limit_block)
+{
+ unsigned chunkno;
+ unsigned limit_chunkno;
+ unsigned limit_chunkoffset;
+ BlockRefTableChunk limit_chunk;
+
+ /* If we already have an equal or lower limit block, do nothing. */
+ if (limit_block >= entry->limit_block)
+ return;
+
+ /* Record the new limit block value. */
+ entry->limit_block = limit_block;
+
+ /*
+ * Figure out which chunk would store the state of the new limit block,
+ * and which offset within that chunk.
+ */
+ limit_chunkno = limit_block / BLOCKS_PER_CHUNK;
+ limit_chunkoffset = limit_block % BLOCKS_PER_CHUNK;
+
+ /*
+ * If the number of chunks is not large enough for any blocks with equal
+ * or higher block numbers to exist, then there is nothing further to do.
+ */
+ if (limit_chunkno >= entry->nchunks)
+ return;
+
+ /* Discard entire contents of any higher-numbered chunks. */
+ for (chunkno = limit_chunkno + 1; chunkno < entry->nchunks; ++chunkno)
+ entry->chunk_usage[chunkno] = 0;
+
+ /*
+ * Next, we need to discard any offsets within the chunk that would
+ * contain the limit_block. We must handle this differenly depending on
+ * whether the chunk that would contain limit_block is a bitmap or an
+ * array of offsets.
+ */
+ limit_chunk = entry->chunk_data[limit_chunkno];
+ if (entry->chunk_usage[limit_chunkno] == MAX_ENTRIES_PER_CHUNK)
+ {
+ unsigned chunkoffset;
+
+ /* It's a bitmap. Unset bits. */
+ for (chunkoffset = limit_chunkoffset; chunkoffset < BLOCKS_PER_CHUNK;
+ ++chunkoffset)
+ limit_chunk[chunkoffset / BLOCKS_PER_ENTRY] &=
+ ~(1 << (chunkoffset % BLOCKS_PER_ENTRY));
+ }
+ else
+ {
+ unsigned i,
+ j = 0;
+
+ /* It's an offset array. Filter out large offsets. */
+ for (i = 0; i < entry->chunk_usage[limit_chunkno]; ++i)
+ {
+ Assert(j <= i);
+ if (limit_chunk[i] < limit_chunkoffset)
+ limit_chunk[j++] = limit_chunk[i];
+ }
+ Assert(j <= entry->chunk_usage[limit_chunkno]);
+ entry->chunk_usage[limit_chunkno] = j;
+ }
+}
+
+/*
+ * Mark a block in a given BlkRefTableEntry as known to have been modified.
+ */
+void
+BlockRefTableEntryMarkBlockModified(BlockRefTableEntry *entry,
+ ForkNumber forknum,
+ BlockNumber blknum)
+{
+ unsigned chunkno;
+ unsigned chunkoffset;
+ unsigned i;
+
+ /*
+ * Which chunk should store the state of this block? And what is the
+ * offset of this block relative to the start of that chunk?
+ */
+ chunkno = blknum / BLOCKS_PER_CHUNK;
+ chunkoffset = blknum % BLOCKS_PER_CHUNK;
+
+ /*
+ * If 'nchunks' isn't big enough for us to be able to represent the state
+ * of this block, we need to enlarge our arrays.
+ */
+ if (chunkno >= entry->nchunks)
+ {
+ unsigned max_chunks;
+ unsigned extra_chunks;
+
+ /*
+ * New array size is a power of 2, at least 16, big enough so that
+ * chunkno will be a valid array index.
+ */
+ max_chunks = Max(16, entry->nchunks);
+ while (max_chunks < chunkno + 1)
+ chunkno *= 2;
+ Assert(max_chunks > chunkno);
+ extra_chunks = max_chunks - entry->nchunks;
+
+ if (entry->nchunks == 0)
+ {
+ entry->chunk_size = palloc0(sizeof(uint16) * max_chunks);
+ entry->chunk_usage = palloc0(sizeof(uint16) * max_chunks);
+ entry->chunk_data =
+ palloc0(sizeof(BlockRefTableChunk) * max_chunks);
+ }
+ else
+ {
+ entry->chunk_size = repalloc(entry->chunk_size,
+ sizeof(uint16) * max_chunks);
+ memset(&entry->chunk_size[entry->nchunks], 0,
+ extra_chunks * sizeof(uint16));
+ entry->chunk_usage = repalloc(entry->chunk_usage,
+ sizeof(uint16) * max_chunks);
+ memset(&entry->chunk_usage[entry->nchunks], 0,
+ extra_chunks * sizeof(uint16));
+ entry->chunk_data = repalloc(entry->chunk_data,
+ sizeof(BlockRefTableChunk) * max_chunks);
+ memset(&entry->chunk_data[entry->nchunks], 0,
+ extra_chunks * sizeof(BlockRefTableChunk));
+ }
+ entry->nchunks = max_chunks;
+ }
+
+ /*
+ * If the chunk that covers this block number doesn't exist yet, create it
+ * as an array and add the appropriate offset to it. We make it pretty
+ * small initially, because there might only be 1 or a few block
+ * references in this chunk and we don't want to use up too much memory.
+ */
+ if (entry->chunk_size[chunkno] == 0)
+ {
+ entry->chunk_data[chunkno] =
+ palloc(sizeof(uint16) * INITIAL_ENTRIES_PER_CHUNK);
+ entry->chunk_size[chunkno] = INITIAL_ENTRIES_PER_CHUNK;
+ entry->chunk_data[chunkno][0] = chunkoffset;
+ entry->chunk_usage[chunkno] = 1;
+ return;
+ }
+
+ /*
+ * If the number of entries in this chunk is already maximum, it must be a
+ * bitmap. Just set the appropriate bit.
+ */
+ if (entry->chunk_usage[chunkno] == MAX_ENTRIES_PER_CHUNK)
+ {
+ BlockRefTableChunk chunk = entry->chunk_data[chunkno];
+
+ chunk[chunkoffset / BLOCKS_PER_ENTRY] |=
+ 1 << (chunkoffset % BLOCKS_PER_ENTRY);
+ return;
+ }
+
+ /*
+ * There is an existing chunk and it's in array format. Let's find out
+ * whether it already has an entry for this block. If so, we do not need
+ * to do anything.
+ */
+ for (i = 0; i < entry->chunk_usage[chunkno]; ++i)
+ {
+ if (entry->chunk_data[chunkno][i] == chunkoffset)
+ return;
+ }
+
+ /*
+ * If the number of entries currently used is one less than the maximum,
+ * it's time to convert to bitmap format.
+ */
+ if (entry->chunk_usage[chunkno] == MAX_ENTRIES_PER_CHUNK - 1)
+ {
+ BlockRefTableChunk newchunk;
+ unsigned j;
+
+ /* Allocate a new chunk. */
+ newchunk = palloc0(MAX_ENTRIES_PER_CHUNK * sizeof(uint16));
+
+ /* Set the bit for each existing entry. */
+ for (j = 0; j < entry->chunk_usage[chunkno]; ++j)
+ {
+ unsigned coff = entry->chunk_data[chunkno][j];
+
+ newchunk[coff / BLOCKS_PER_ENTRY] |=
+ 1 << (coff % BLOCKS_PER_ENTRY);
+ }
+
+ /* Set the bit for the new entry. */
+ newchunk[chunkoffset / BLOCKS_PER_ENTRY] |=
+ 1 << (chunkoffset % BLOCKS_PER_ENTRY);
+
+ /* Swap the new chunk into place and update metadata. */
+ pfree(entry->chunk_data[chunkno]);
+ entry->chunk_data[chunkno] = newchunk;
+ entry->chunk_size[chunkno] = MAX_ENTRIES_PER_CHUNK;
+ entry->chunk_usage[chunkno] = MAX_ENTRIES_PER_CHUNK;
+ return;
+ }
+
+ /*
+ * OK, we currently have an array, and we don't need to convert to a
+ * bitmap, but we do need to add a new element. If there's not enough
+ * room, we'll have to expand the array.
+ */
+ if (entry->chunk_usage[chunkno] == entry->chunk_size[chunkno])
+ {
+ unsigned newsize = entry->chunk_size[chunkno] * 2;
+
+ Assert(newsize <= MAX_ENTRIES_PER_CHUNK);
+ entry->chunk_data[chunkno] = repalloc(entry->chunk_data[chunkno],
+ newsize * sizeof(uint16));
+ entry->chunk_size[chunkno] = newsize;
+ }
+
+ /* Now we can add the new entry. */
+ entry->chunk_data[chunkno][entry->chunk_usage[chunkno]] =
+ chunkoffset;
+ entry->chunk_usage[chunkno]++;
+}
+
+/*
+ * Release memory for a BlockRefTablEntry that was created by
+ * CreateBlockRefTableEntry.
+ */
+void
+BlockRefTableFreeEntry(BlockRefTableEntry *entry)
+{
+ if (entry->chunk_size != NULL)
+ {
+ pfree(entry->chunk_size);
+ entry->chunk_size = NULL;
+ }
+
+ if (entry->chunk_usage != NULL)
+ {
+ pfree(entry->chunk_usage);
+ entry->chunk_usage = NULL;
+ }
+
+ if (entry->chunk_data != NULL)
+ {
+ pfree(entry->chunk_data);
+ entry->chunk_data = NULL;
+ }
+
+ pfree(entry);
+}
+
+/*
+ * Comparator for BlockRefTableSerializedEntry objects.
+ *
+ * We make the tablespace OID the first column of the sort key to match
+ * the on-disk tree structure.
+ */
+static int
+BlockRefTableComparator(const void *a, const void *b)
+{
+ const BlockRefTableSerializedEntry *sa = a;
+ const BlockRefTableSerializedEntry *sb = b;
+
+ if (sa->rlocator.spcOid > sb->rlocator.spcOid)
+ return 1;
+ if (sa->rlocator.spcOid < sb->rlocator.spcOid)
+ return -1;
+
+ if (sa->rlocator.dbOid > sb->rlocator.dbOid)
+ return 1;
+ if (sa->rlocator.dbOid < sb->rlocator.dbOid)
+ return -1;
+
+ if (sa->rlocator.relNumber > sb->rlocator.relNumber)
+ return 1;
+ if (sa->rlocator.relNumber < sb->rlocator.relNumber)
+ return -1;
+
+ if (sa->forknum > sb->forknum)
+ return 1;
+ if (sa->forknum < sb->forknum)
+ return -1;
+
+ return 0;
+}
+
+/*
+ * Flush any buffered data out of a BlockRefTableBuffer.
+ */
+static void
+BlockRefTableFlush(BlockRefTableBuffer *buffer)
+{
+ buffer->io_callback(buffer->io_callback_arg, buffer->data, buffer->used);
+ buffer->used = 0;
+}
+
+/*
+ * Read data from a BlockRefTableBuffer, and update the running CRC
+ * calculation for the returned data (but not any data that we may have
+ * buffered but not yet actually returned).
+ */
+static void
+BlockRefTableRead(BlockRefTableReader *reader, void *data, int length)
+{
+ BlockRefTableBuffer *buffer = &reader->buffer;
+
+ /* Loop until read is fully satisfied. */
+ while (length > 0)
+ {
+ if (buffer->cursor < buffer->used)
+ {
+ /*
+ * If any buffered data is available, use that to satisfy as much
+ * of the request as possible.
+ */
+ int bytes_to_copy = Min(length, buffer->used - buffer->cursor);
+
+ memcpy(data, &buffer->data[buffer->cursor], bytes_to_copy);
+ COMP_CRC32C(buffer->crc, &buffer->data[buffer->cursor],
+ bytes_to_copy);
+ buffer->cursor += bytes_to_copy;
+ data = ((char *) data) + bytes_to_copy;
+ length -= bytes_to_copy;
+ }
+ else if (length >= BUFSIZE)
+ {
+ /*
+ * If the request length is long, read directly into caller's
+ * buffer.
+ */
+ int bytes_read;
+
+ bytes_read = buffer->io_callback(buffer->io_callback_arg,
+ data, length);
+ COMP_CRC32C(buffer->crc, data, bytes_read);
+ data = ((char *) data) + bytes_read;
+ length -= bytes_read;
+
+ /* If we didn't get anything, that's bad. */
+ if (bytes_read == 0)
+ reader->error_callback(reader->error_callback_arg,
+ "file \"%s\" ends unexpectedly",
+ reader->error_filename);
+ }
+ else
+ {
+ /*
+ * Refill our buffer.
+ */
+ buffer->used = buffer->io_callback(buffer->io_callback_arg,
+ buffer->data, BUFSIZE);
+ buffer->cursor = 0;
+
+ /* If we didn't get anything, that's bad. */
+ if (buffer->used == 0)
+ reader->error_callback(reader->error_callback_arg,
+ "file \"%s\" ends unexpectedly",
+ reader->error_filename);
+ }
+ }
+}
+
+/*
+ * Supply data to a BlockRefTableBuffer for write to the underlying File,
+ * and update the running CRC calculation for that data.
+ */
+static void
+BlockRefTableWrite(BlockRefTableBuffer *buffer, void *data, int length)
+{
+ /* Update running CRC calculation. */
+ COMP_CRC32C(buffer->crc, data, length);
+
+ /* If the new data can't fit into the buffer, flush the buffer. */
+ if (buffer->used + length > BUFSIZE)
+ {
+ buffer->io_callback(buffer->io_callback_arg, buffer->data,
+ buffer->used);
+ buffer->used = 0;
+ }
+
+ /* If the new data would fill the buffer, or more, write it directly. */
+ if (length >= BUFSIZE)
+ {
+ buffer->io_callback(buffer->io_callback_arg, data, length);
+ return;
+ }
+
+ /* Otherwise, copy the new data into the buffer. */
+ memcpy(&buffer->data[buffer->used], data, length);
+ buffer->used += length;
+ Assert(buffer->used <= BUFSIZE);
+}
+
+/*
+ * Generate the sentinel and CRC required at the end of a block reference
+ * table file and flush them out of our internal buffer.
+ */
+static void
+BlockRefTableFileTerminate(BlockRefTableBuffer *buffer)
+{
+ BlockRefTableSerializedEntry zentry = {{0}};
+ pg_crc32c crc;
+
+ /* Write a sentinel indicating that there are no more entries. */
+ BlockRefTableWrite(buffer, &zentry,
+ sizeof(BlockRefTableSerializedEntry));
+
+ /*
+ * Writing the checksum will perturb the ongoing checksum calculation, so
+ * copy the state first and finalize the computation using the copy.
+ */
+ crc = buffer->crc;
+ FIN_CRC32C(crc);
+ BlockRefTableWrite(buffer, &crc, sizeof(pg_crc32c));
+
+ /* Flush any leftover data out of our buffer. */
+ BlockRefTableFlush(buffer);
+}
diff --git a/src/common/meson.build b/src/common/meson.build
index d52dd12bc9..7ad4270a3a 100644
--- a/src/common/meson.build
+++ b/src/common/meson.build
@@ -4,6 +4,7 @@ common_sources = files(
'archive.c',
'base64.c',
'binaryheap.c',
+ 'blkreftable.c',
'checksum_helper.c',
'compression.c',
'controldata_utils.c',
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index a14126d164..da71580364 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -209,6 +209,7 @@ extern int XLogFileOpen(XLogSegNo segno, TimeLineID tli);
extern void CheckXLogRemoved(XLogSegNo segno, TimeLineID tli);
extern XLogSegNo XLogGetLastRemovedSegno(void);
+extern XLogSegNo XLogGetOldestSegno(TimeLineID tli);
extern void XLogSetAsyncXactLSN(XLogRecPtr asyncXactLSN);
extern void XLogSetReplicationSlotMinimumLSN(XLogRecPtr lsn);
diff --git a/src/include/backup/walsummary.h b/src/include/backup/walsummary.h
new file mode 100644
index 0000000000..8e3dc7b837
--- /dev/null
+++ b/src/include/backup/walsummary.h
@@ -0,0 +1,49 @@
+/*-------------------------------------------------------------------------
+ *
+ * walsummary.h
+ * WAL summary management
+ *
+ * Portions Copyright (c) 2010-2022, PostgreSQL Global Development Group
+ *
+ * src/include/backup/walsummary.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef WALSUMMARY_H
+#define WALSUMMARY_H
+
+#include <time.h>
+
+#include "access/xlogdefs.h"
+#include "nodes/pg_list.h"
+#include "storage/fd.h"
+
+typedef struct WalSummaryIO
+{
+ File file;
+ off_t filepos;
+} WalSummaryIO;
+
+typedef struct WalSummaryFile
+{
+ XLogRecPtr start_lsn;
+ XLogRecPtr end_lsn;
+ TimeLineID tli;
+} WalSummaryFile;
+
+extern List *GetWalSummaries(TimeLineID tli, XLogRecPtr start_lsn,
+ XLogRecPtr end_lsn);
+extern List *FilterWalSummaries(List *wslist, TimeLineID tli,
+ XLogRecPtr start_lsn, XLogRecPtr end_lsn);
+extern bool WalSummariesAreComplete(List *wslist,
+ XLogRecPtr start_lsn, XLogRecPtr end_lsn,
+ XLogRecPtr *missing_lsn);
+extern File OpenWalSummaryFile(WalSummaryFile *ws, bool missing_ok);
+extern void RemoveWalSummaryIfOlderThan(WalSummaryFile *ws,
+ time_t cutoff_time);
+
+extern int ReadWalSummary(void *wal_summary_io, void *data, int length);
+extern int WriteWalSummary(void *wal_summary_io, void *data, int length);
+extern void ReportWalSummaryError(void *callback_arg, char *fmt,...) pg_attribute_printf(2, 3);
+
+#endif /* WALSUMMARY_H */
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index fb58dee3bc..79c8f86d89 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -12100,4 +12100,23 @@
proname => 'any_value_transfn', prorettype => 'anyelement',
proargtypes => 'anyelement anyelement', prosrc => 'any_value_transfn' },
+{ oid => '8436',
+ descr => 'list of available WAL summary files',
+ proname => 'pg_available_wal_summaries', prorows => '100',
+ proretset => 't', provolatile => 'v', proparallel => 's',
+ prorettype => 'record', proargtypes => '',
+ proallargtypes => '{int8,pg_lsn,pg_lsn}',
+ proargmodes => '{o,o,o}',
+ proargnames => '{tli,start_lsn,end_lsn}',
+ prosrc => 'pg_available_wal_summaries' },
+{ oid => '8437',
+ descr => 'contents of a WAL sumamry file',
+ proname => 'pg_wal_summary_contents', prorows => '100',
+ proretset => 't', provolatile => 'v', proparallel => 's',
+ prorettype => 'record', proargtypes => 'int8 pg_lsn pg_lsn',
+ proallargtypes => '{int8,pg_lsn,pg_lsn,oid,oid,oid,int2,int8,bool}',
+ proargmodes => '{i,i,i,o,o,o,o,o,o}',
+ proargnames => '{tli,start_lsn,end_lsn,relfilenode,reltablespace,reldatabase,relforknumber,relblocknumber,is_limit_block}',
+ prosrc => 'pg_wal_summary_contents' },
+
]
diff --git a/src/include/common/blkreftable.h b/src/include/common/blkreftable.h
new file mode 100644
index 0000000000..5141f3acd5
--- /dev/null
+++ b/src/include/common/blkreftable.h
@@ -0,0 +1,116 @@
+/*-------------------------------------------------------------------------
+ *
+ * blkreftable.h
+ * Block reference tables.
+ *
+ * A block reference table is used to keep track of which blocks have
+ * been modified by WAL records within a certain LSN range.
+ *
+ * For each relation fork, there is a "limit block number". All existing
+ * blocks greater than or equal to the limit block number must be
+ * considered modified; for those less than the limit block number,
+ * we maintain a bitmap. When a relation fork is created or dropped,
+ * the limit block number should be set to 0. When it's truncated,
+ * the limit block number should be set to the length in blocks to
+ * which it was truncated.
+ *
+ * Portions Copyright (c) 2010-2022, PostgreSQL Global Development Group
+ *
+ * src/include/common/blkreftable.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef BLKREFTABLE_H
+#define BLKREFTABLE_H
+
+#include "storage/block.h"
+#include "storage/relfilelocator.h"
+
+/* Magic number for serialization file format. */
+#define BLOCKREFTABLE_MAGIC 0x652b137b
+
+typedef struct BlockRefTable BlockRefTable;
+typedef struct BlockRefTableEntry BlockRefTableEntry;
+typedef struct BlockRefTableReader BlockRefTableReader;
+typedef struct BlockRefTableWriter BlockRefTableWriter;
+
+/*
+ * The return value of io_callback_fn should be the number of bytes read
+ * or written. If an error occurs, the functions should report it and
+ * not return. When used as a write callback, short writes should be retried
+ * or treated as errors, so that if the callback returns, the return value
+ * is always the request length.
+ *
+ * report_error_fn should not return.
+ */
+typedef int (*io_callback_fn) (void *callback_arg, void *data, int length);
+typedef void (*report_error_fn) (void *calblack_arg, char *msg,...) pg_attribute_printf(2, 3);
+
+
+/*
+ * Functions for manipulating an entire in-memory block reference table.
+ */
+extern BlockRefTable *CreateEmptyBlockRefTable(void);
+extern void BlockRefTableSetLimitBlock(BlockRefTable *brtab,
+ const RelFileLocator *rlocator,
+ ForkNumber forknum,
+ BlockNumber limit_block);
+extern void BlockRefTableMarkBlockModified(BlockRefTable *brtab,
+ const RelFileLocator *rlocator,
+ ForkNumber forknum,
+ BlockNumber blknum);
+extern void WriteBlockRefTable(BlockRefTable *brtab,
+ io_callback_fn write_callback,
+ void *write_callback_arg);
+
+extern BlockRefTableEntry *BlockRefTableGetEntry(BlockRefTable *brtab,
+ const RelFileLocator *rlocator,
+ ForkNumber forknum,
+ BlockNumber *limit_block);
+extern int BlockRefTableEntryGetBlocks(BlockRefTableEntry *entry,
+ BlockNumber start_blkno,
+ BlockNumber stop_blkno,
+ BlockNumber *blocks,
+ int nblocks);
+
+/*
+ * Functions for reading a block reference table incrementally from disk.
+ */
+extern BlockRefTableReader *CreateBlockRefTableReader(io_callback_fn read_callback,
+ void *read_callback_arg,
+ char *error_filename,
+ report_error_fn error_callback,
+ void *error_callback_arg);
+extern bool BlockRefTableReaderNextRelation(BlockRefTableReader *reader,
+ RelFileLocator *rlocator,
+ ForkNumber *forknum,
+ BlockNumber *limit_block);
+extern unsigned BlockRefTableReaderGetBlocks(BlockRefTableReader *reader,
+ BlockNumber *blocks,
+ int nblocks);
+extern void DestroyBlockRefTableReader(BlockRefTableReader *reader);
+
+/*
+ * Functions for writing a block reference table incrementally to disk.
+ *
+ * Note that entries must be written in the proper order, that is, sorted by
+ * database, then tablespace, then relfilenumber, then fork number. Caller
+ * is responsible for supplying data in the correct order. If that seems hard,
+ * use an in-memory BlockRefTable instead.
+ */
+extern BlockRefTableWriter *CreateBlockRefTableWriter(io_callback_fn write_callback,
+ void *write_callback_arg);
+extern void BlockRefTableWriteEntry(BlockRefTableWriter *writer,
+ BlockRefTableEntry *entry);
+extern void DestroyBlockRefTableWriter(BlockRefTableWriter *writer);
+
+extern BlockRefTableEntry *CreateBlockRefTableEntry(RelFileLocator rlocator,
+ ForkNumber forknum);
+extern void BlockRefTableEntrySetLimitBlock(BlockRefTableEntry *entry,
+ BlockNumber limit_block);
+extern void BlockRefTableEntryMarkBlockModified(BlockRefTableEntry *entry,
+ ForkNumber forknum,
+ BlockNumber blknum);
+extern void BlockRefTableFreeEntry(BlockRefTableEntry *entry);
+
+#endif /* BLKREFTABLE_H */
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index f0cc651435..ab8f47379a 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -340,6 +340,7 @@ typedef enum BackendType
B_STARTUP,
B_WAL_RECEIVER,
B_WAL_SENDER,
+ B_WAL_SUMMARIZER,
B_WAL_WRITER,
} BackendType;
@@ -446,6 +447,7 @@ typedef enum
CheckpointerProcess,
WalWriterProcess,
WalReceiverProcess,
+ WalSummarizerProcess,
NUM_AUXPROCTYPES /* Must be last! */
} AuxProcType;
@@ -458,6 +460,7 @@ extern PGDLLIMPORT AuxProcType MyAuxProcType;
#define AmCheckpointerProcess() (MyAuxProcType == CheckpointerProcess)
#define AmWalWriterProcess() (MyAuxProcType == WalWriterProcess)
#define AmWalReceiverProcess() (MyAuxProcType == WalReceiverProcess)
+#define AmWalSummarizerProcess() (MyAuxProcType == WalSummarizerProcess)
/*****************************************************************************
diff --git a/src/include/postmaster/walsummarizer.h b/src/include/postmaster/walsummarizer.h
new file mode 100644
index 0000000000..4a6792e5f9
--- /dev/null
+++ b/src/include/postmaster/walsummarizer.h
@@ -0,0 +1,31 @@
+/*-------------------------------------------------------------------------
+ *
+ * walsummarizer.h
+ *
+ * Header file for background WAL summarization process.
+ *
+ * Portions Copyright (c) 1996-2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/include/postmaster/walsummarizer.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef WALSUMMARIZER_H
+#define WALSUMMARIZER_H
+
+#include "access/xlogdefs.h"
+
+extern bool summarize_wal;
+extern int wal_summary_keep_time;
+
+extern Size WalSummarizerShmemSize(void);
+extern void WalSummarizerShmemInit(void);
+extern void WalSummarizerMain(void) pg_attribute_noreturn();
+
+extern XLogRecPtr GetOldestUnsummarizedLSN(TimeLineID *tli,
+ bool *lsn_is_exact);
+extern void SetWalSummarizerLatch(void);
+extern XLogRecPtr WaitForWalSummarization(XLogRecPtr lsn, long timeout);
+
+#endif
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index ef74f32693..ee55008082 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -417,11 +417,12 @@ extern PGDLLIMPORT PGPROC *PreparedXactProcs;
* We set aside some extra PGPROC structures for auxiliary processes,
* ie things that aren't full-fledged backends but need shmem access.
*
- * Background writer, checkpointer, WAL writer and archiver run during normal
- * operation. Startup process and WAL receiver also consume 2 slots, but WAL
- * writer is launched only after startup has exited, so we only need 5 slots.
+ * Background writer, checkpointer, WAL writer, WAL summarizer, and archiver
+ * run during normal operation. Startup process and WAL receiver also consume
+ * 2 slots, but WAL writer is launched only after startup has exited, so we
+ * only need 6 slots.
*/
-#define NUM_AUXILIARY_PROCS 5
+#define NUM_AUXILIARY_PROCS 6
/* configurable options */
extern PGDLLIMPORT int DeadlockTimeout;
diff --git a/src/include/utils/guc_tables.h b/src/include/utils/guc_tables.h
index 0c38255961..eaa8c46dda 100644
--- a/src/include/utils/guc_tables.h
+++ b/src/include/utils/guc_tables.h
@@ -72,6 +72,7 @@ enum config_group
WAL_RECOVERY,
WAL_ARCHIVE_RECOVERY,
WAL_RECOVERY_TARGET,
+ WAL_SUMMARIZATION,
REPLICATION_SENDING,
REPLICATION_PRIMARY,
REPLICATION_STANDBY,
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 3cea73e220..7a2807a9a3 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -4003,3 +4003,14 @@ yyscan_t
z_stream
z_streamp
zic_t
+BlockRefTable
+BlockRefTableBuffer
+BlockRefTableEntry
+BlockRefTableKey
+BlockRefTableReader
+BlockRefTableSerializedEntry
+BlockRefTableWriter
+SummarizerReadLocalXLogPrivate
+WalSummarizerData
+WalSummaryFile
+WalSummaryIO
--
2.37.1 (Apple Git-137.1)
v11-0006-Add-new-pg_walsummary-tool.patchapplication/octet-stream; name=v11-0006-Add-new-pg_walsummary-tool.patchDownload
From a27ebfaba3018f28ab1fc8e42495ad406f93055f Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Wed, 25 Oct 2023 13:01:06 -0400
Subject: [PATCH v11 6/7] Add new pg_walsummary tool.
This can dump the contents of WAL summary files, either those in
pg_wal/summaries, or the INCREMENTAL_BACKUP files that are part of
an incremental backup proper.
XXX. Needs tests.
---
doc/src/sgml/ref/allfiles.sgml | 1 +
doc/src/sgml/ref/pg_walsummary.sgml | 122 +++++++++++
doc/src/sgml/reference.sgml | 1 +
src/backend/postmaster/walsummarizer.c | 4 +-
src/bin/Makefile | 1 +
src/bin/meson.build | 1 +
src/bin/pg_walsummary/.gitignore | 1 +
src/bin/pg_walsummary/Makefile | 42 ++++
src/bin/pg_walsummary/meson.build | 24 +++
src/bin/pg_walsummary/pg_walsummary.c | 280 +++++++++++++++++++++++++
src/tools/pgindent/typedefs.list | 2 +
11 files changed, 477 insertions(+), 2 deletions(-)
create mode 100644 doc/src/sgml/ref/pg_walsummary.sgml
create mode 100644 src/bin/pg_walsummary/.gitignore
create mode 100644 src/bin/pg_walsummary/Makefile
create mode 100644 src/bin/pg_walsummary/meson.build
create mode 100644 src/bin/pg_walsummary/pg_walsummary.c
diff --git a/doc/src/sgml/ref/allfiles.sgml b/doc/src/sgml/ref/allfiles.sgml
index fda4690eab..4a42999b18 100644
--- a/doc/src/sgml/ref/allfiles.sgml
+++ b/doc/src/sgml/ref/allfiles.sgml
@@ -219,6 +219,7 @@ Complete list of usable sgml source files in this directory.
<!ENTITY pgtesttiming SYSTEM "pgtesttiming.sgml">
<!ENTITY pgupgrade SYSTEM "pgupgrade.sgml">
<!ENTITY pgwaldump SYSTEM "pg_waldump.sgml">
+<!ENTITY pgwalsummary SYSTEM "pg_walsummary.sgml">
<!ENTITY postgres SYSTEM "postgres-ref.sgml">
<!ENTITY psqlRef SYSTEM "psql-ref.sgml">
<!ENTITY reindexdb SYSTEM "reindexdb.sgml">
diff --git a/doc/src/sgml/ref/pg_walsummary.sgml b/doc/src/sgml/ref/pg_walsummary.sgml
new file mode 100644
index 0000000000..93e265ead7
--- /dev/null
+++ b/doc/src/sgml/ref/pg_walsummary.sgml
@@ -0,0 +1,122 @@
+<!--
+doc/src/sgml/ref/pg_walsummary.sgml
+PostgreSQL documentation
+-->
+
+<refentry id="app-pgwalsummary">
+ <indexterm zone="app-pgwalsummary">
+ <primary>pg_walsummary</primary>
+ </indexterm>
+
+ <refmeta>
+ <refentrytitle><application>pg_walsummary</application></refentrytitle>
+ <manvolnum>1</manvolnum>
+ <refmiscinfo>Application</refmiscinfo>
+ </refmeta>
+
+ <refnamediv>
+ <refname>pg_walsummary</refname>
+ <refpurpose>print contents of WAL summary files</refpurpose>
+ </refnamediv>
+
+ <refsynopsisdiv>
+ <cmdsynopsis>
+ <command>pg_walsummary</command>
+ <arg rep="repeat" choice="opt"><replaceable>option</replaceable></arg>
+ <arg rep="repeat"><replaceable>file</replaceable></arg>
+ </cmdsynopsis>
+ </refsynopsisdiv>
+
+ <refsect1>
+ <title>Description</title>
+ <para>
+ <application>pg_walsummary</application> is used to print the contents of
+ WAL summary files. These binary files are found with the
+ <literal>pg_wal/summaries</literal> subdirectory of the data directory,
+ and can be converted to text using this tool. This is not ordinarily
+ necessary, since WAL summary files primarily exist to support
+ <link linkend="backup-incremental-backup">incremental backup</link>,
+ but it may be useful for debugging purposes.
+ </para>
+
+ <para>
+ A WAL summary file is indexed by tablespace OID, relation OID, and relation
+ fork. For each relation fork, it stores the list of blocks that were
+ modified by WAL within the range summarized in the file. It can also
+ store a "limit block," which is 0 if the relation fork was created or
+ truncated within the relevant WAL range, and otherwise the shortest length
+ to which the relation fork was truncated. If the relation fork was not
+ created, deleted, or truncated within the relevant WAL range, the limit
+ block is undefined or infinite and will not be printed by this tool.
+ </para>
+ </refsect1>
+
+ <refsect1>
+ <title>Options</title>
+
+ <para>
+ <variablelist>
+ <varlistentry>
+ <term><option>-i</option></term>
+ <term><option>--indivudual</option></term>
+ <listitem>
+ <para>
+ By default, <literal>pg_walsummary</literal> prints one line of output
+ for each range of one or more consecutive modified blocks. This can
+ make the output a lot briefer, since a relation where all blocks from
+ 0 through 999 were modified will produce only one line of output rather
+ than 1000 separate lines. This option requests a separate line of
+ output for every modified block.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><option>-q</option></term>
+ <term><option>--quiet</option></term>
+ <listitem>
+ <para>
+ Do not print any output, except for errors. This can be useful
+ when you want to know whether a WAL summary file can be successfully
+ parsed but don't care about the contents.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><option>-?</option></term>
+ <term><option>--help</option></term>
+ <listitem>
+ <para>
+ Shows help about <application>pg_walsummary</application> command line
+ arguments, and exits.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ </variablelist>
+ </para>
+
+ </refsect1>
+
+ <refsect1>
+ <title>Environment</title>
+
+ <para>
+ The environment variable <envar>PG_COLOR</envar> specifies whether to use
+ color in diagnostic messages. Possible values are
+ <literal>always</literal>, <literal>auto</literal> and
+ <literal>never</literal>.
+ </para>
+ </refsect1>
+
+ <refsect1>
+ <title>See Also</title>
+
+ <simplelist type="inline">
+ <member><xref linkend="app-pgbasebackup"/></member>
+ <member><xref linkend="app-pgcombinebackup"/></member>
+ </simplelist>
+ </refsect1>
+
+</refentry>
diff --git a/doc/src/sgml/reference.sgml b/doc/src/sgml/reference.sgml
index a07d2b5e01..aa94f6adf6 100644
--- a/doc/src/sgml/reference.sgml
+++ b/doc/src/sgml/reference.sgml
@@ -289,6 +289,7 @@
&pgtesttiming;
&pgupgrade;
&pgwaldump;
+ &pgwalsummary;
&postgres;
</reference>
diff --git a/src/backend/postmaster/walsummarizer.c b/src/backend/postmaster/walsummarizer.c
index a083647c42..7966755f22 100644
--- a/src/backend/postmaster/walsummarizer.c
+++ b/src/backend/postmaster/walsummarizer.c
@@ -290,7 +290,7 @@ WalSummarizerMain(void)
FlushErrorState();
/* Flush any leaked data in the top-level context */
- MemoryContextResetAndDeleteChildren(context);
+ MemoryContextReset(context);
/* Now we can allow interrupts again */
RESUME_INTERRUPTS();
@@ -338,7 +338,7 @@ WalSummarizerMain(void)
XLogRecPtr end_of_summary_lsn;
/* Flush any leaked data in the top-level context */
- MemoryContextResetAndDeleteChildren(context);
+ MemoryContextReset(context);
/* Process any signals received recently. */
HandleWalSummarizerInterrupts();
diff --git a/src/bin/Makefile b/src/bin/Makefile
index aa2210925e..f98f58d39e 100644
--- a/src/bin/Makefile
+++ b/src/bin/Makefile
@@ -31,6 +31,7 @@ SUBDIRS = \
pg_upgrade \
pg_verifybackup \
pg_waldump \
+ pg_walsummary \
pgbench \
psql \
scripts
diff --git a/src/bin/meson.build b/src/bin/meson.build
index 4cb6fd59bb..d1e9ef4409 100644
--- a/src/bin/meson.build
+++ b/src/bin/meson.build
@@ -17,6 +17,7 @@ subdir('pg_test_timing')
subdir('pg_upgrade')
subdir('pg_verifybackup')
subdir('pg_waldump')
+subdir('pg_walsummary')
subdir('pgbench')
subdir('pgevent')
subdir('psql')
diff --git a/src/bin/pg_walsummary/.gitignore b/src/bin/pg_walsummary/.gitignore
new file mode 100644
index 0000000000..d71ec192fa
--- /dev/null
+++ b/src/bin/pg_walsummary/.gitignore
@@ -0,0 +1 @@
+pg_walsummary
diff --git a/src/bin/pg_walsummary/Makefile b/src/bin/pg_walsummary/Makefile
new file mode 100644
index 0000000000..852f7208f6
--- /dev/null
+++ b/src/bin/pg_walsummary/Makefile
@@ -0,0 +1,42 @@
+#-------------------------------------------------------------------------
+#
+# Makefile for src/bin/pg_walsummary
+#
+# Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+# Portions Copyright (c) 1994, Regents of the University of California
+#
+# src/bin/pg_walsummary/Makefile
+#
+#-------------------------------------------------------------------------
+
+PGFILEDESC = "pg_walsummary - print contents of WAL summary files"
+PGAPPICON=win32
+
+subdir = src/bin/pg_walsummary
+top_builddir = ../../..
+include $(top_builddir)/src/Makefile.global
+
+override CPPFLAGS := -I$(libpq_srcdir) $(CPPFLAGS)
+LDFLAGS_INTERNAL += -L$(top_builddir)/src/fe_utils -lpgfeutils
+
+OBJS = \
+ $(WIN32RES) \
+ pg_walsummary.o
+
+all: pg_walsummary
+
+pg_walsummary: $(OBJS) | submake-libpgport submake-libpgfeutils
+ $(CC) $(CFLAGS) $^ $(LDFLAGS) $(LDFLAGS_EX) $(LIBS) -o $@$(X)
+
+
+install: all installdirs
+ $(INSTALL_PROGRAM) pg_walsummary$(X) '$(DESTDIR)$(bindir)/pg_walsummary$(X)'
+
+installdirs:
+ $(MKDIR_P) '$(DESTDIR)$(bindir)'
+
+uninstall:
+ rm -f '$(DESTDIR)$(bindir)/pg_walsummary$(X)'
+
+clean distclean maintainer-clean:
+ rm -f pg_walsummary$(X) $(OBJS)
diff --git a/src/bin/pg_walsummary/meson.build b/src/bin/pg_walsummary/meson.build
new file mode 100644
index 0000000000..c2092960c6
--- /dev/null
+++ b/src/bin/pg_walsummary/meson.build
@@ -0,0 +1,24 @@
+# Copyright (c) 2022-2023, PostgreSQL Global Development Group
+
+pg_walsummary_sources = files(
+ 'pg_walsummary.c',
+)
+
+if host_system == 'windows'
+ pg_walsummary_sources += rc_bin_gen.process(win32ver_rc, extra_args: [
+ '--NAME', 'pg_walsummary',
+ '--FILEDESC', 'pg_walsummary - print contents of WAL summary files',])
+endif
+
+pg_walsummary = executable('pg_walsummary',
+ pg_walsummary_sources,
+ dependencies: [frontend_code],
+ kwargs: default_bin_args,
+)
+bin_targets += pg_walsummary
+
+tests += {
+ 'name': 'pg_walsummary',
+ 'sd': meson.current_source_dir(),
+ 'bd': meson.current_build_dir()
+}
diff --git a/src/bin/pg_walsummary/pg_walsummary.c b/src/bin/pg_walsummary/pg_walsummary.c
new file mode 100644
index 0000000000..0c0225eeb8
--- /dev/null
+++ b/src/bin/pg_walsummary/pg_walsummary.c
@@ -0,0 +1,280 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_walsummary.c
+ * Prints the contents of WAL summary files.
+ *
+ * Copyright (c) 2017-2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/bin/pg_walsummary/pg_walsummary.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres_fe.h"
+
+#include <fcntl.h>
+#include <limits.h>
+
+#include "common/blkreftable.h"
+#include "common/logging.h"
+#include "fe_utils/option_utils.h"
+#include "lib/stringinfo.h"
+#include "getopt_long.h"
+
+typedef struct ws_options
+{
+ bool individual;
+ bool quiet;
+} ws_options;
+
+typedef struct ws_file_info
+{
+ int fd;
+ char *filename;
+} ws_file_info;
+
+static BlockNumber *block_buffer = NULL;
+static unsigned block_buffer_size = 512; /* Initial size. */
+
+static void dump_one_relation(ws_options *opt, RelFileLocator *rlocator,
+ ForkNumber forknum, BlockNumber limit_block,
+ BlockRefTableReader *reader);
+static void help(const char *progname);
+static int compare_block_numbers(const void *a, const void *b);
+static int walsummary_read_callback(void *callback_arg, void *data,
+ int length);
+static void walsummary_error_callback(void *callback_arg, char *fmt,...) pg_attribute_printf(2, 3);
+
+/*
+ * Main program.
+ */
+int
+main(int argc, char *argv[])
+{
+ static struct option long_options[] = {
+ {"individual", no_argument, NULL, 'i'},
+ {"quiet", no_argument, NULL, 'q'},
+ {NULL, 0, NULL, 0}
+ };
+
+ const char *progname;
+ int optindex;
+ int c;
+ ws_options opt;
+
+ memset(&opt, 0, sizeof(ws_options));
+
+ pg_logging_init(argv[0]);
+ progname = get_progname(argv[0]);
+ handle_help_version_opts(argc, argv, progname, help);
+
+ /* process command-line options */
+ while ((c = getopt_long(argc, argv, "f:iqw:",
+ long_options, &optindex)) != -1)
+ {
+ switch (c)
+ {
+ case 'i':
+ opt.individual = true;
+ break;
+ case 'q':
+ opt.quiet = true;
+ break;
+ default:
+ /* getopt_long already emitted a complaint */
+ pg_log_error_hint("Try \"%s --help\" for more information.", progname);
+ exit(1);
+ }
+ }
+
+ if (optind >= argc)
+ {
+ pg_log_error("%s: no input files specified", progname);
+ pg_log_error_hint("Try \"%s --help\" for more information.", progname);
+ exit(1);
+ }
+
+ while (optind < argc)
+ {
+ ws_file_info ws;
+ BlockRefTableReader *reader;
+ RelFileLocator rlocator;
+ ForkNumber forknum;
+ BlockNumber limit_block;
+
+ ws.filename = argv[optind++];
+ if ((ws.fd = open(ws.filename, O_RDONLY | PG_BINARY, 0)) < 0)
+ pg_fatal("could not open file \"%s\": %m", ws.filename);
+
+ reader = CreateBlockRefTableReader(walsummary_read_callback, &ws,
+ ws.filename,
+ walsummary_error_callback, NULL);
+ while (BlockRefTableReaderNextRelation(reader, &rlocator, &forknum,
+ &limit_block))
+ dump_one_relation(&opt, &rlocator, forknum, limit_block, reader);
+
+ DestroyBlockRefTableReader(reader);
+ close(ws.fd);
+ }
+
+ exit(0);
+}
+
+/*
+ * Dump details for one relation.
+ */
+static void
+dump_one_relation(ws_options *opt, RelFileLocator *rlocator,
+ ForkNumber forknum, BlockNumber limit_block,
+ BlockRefTableReader *reader)
+{
+ unsigned i = 0;
+ unsigned nblocks;
+ BlockNumber startblock = InvalidBlockNumber;
+ BlockNumber endblock = InvalidBlockNumber;
+
+ /* Dump limit block, if any. */
+ if (limit_block != InvalidBlockNumber)
+ printf("TS %u, DB %u, REL %u, FORK %s: limit %u\n",
+ rlocator->spcOid, rlocator->dbOid, rlocator->relNumber,
+ forkNames[forknum], limit_block);
+
+ /* If we haven't allocated a block buffer yet, do that now. */
+ if (block_buffer == NULL)
+ block_buffer = palloc_array(BlockNumber, block_buffer_size);
+
+ /* Try to fill the block buffer. */
+ nblocks = BlockRefTableReaderGetBlocks(reader,
+ block_buffer,
+ block_buffer_size);
+
+ /* If we filled the block buffer completely, we must enlarge it. */
+ while (nblocks >= block_buffer_size)
+ {
+ unsigned new_size;
+
+ /* Double the size, being careful about overflow. */
+ new_size = block_buffer_size * 2;
+ if (new_size < block_buffer_size)
+ new_size = PG_UINT32_MAX;
+ block_buffer = repalloc_array(block_buffer, BlockNumber, new_size);
+
+ /* Try to fill the newly-allocated space. */
+ nblocks +=
+ BlockRefTableReaderGetBlocks(reader,
+ block_buffer + block_buffer_size,
+ new_size - block_buffer_size);
+
+ /* Save the new size for later calls. */
+ block_buffer_size = new_size;
+ }
+
+ /* If we don't need to produce any output, skip the rest of this. */
+ if (opt->quiet)
+ return;
+
+ /*
+ * Sort the returned block numbers. If the block reference table was using
+ * the bitmap representation for a given chunk, the block numbers in that
+ * chunk will already be sorted, but when the array-of-offsets
+ * representation is used, we can receive block numbers here out of order.
+ */
+ qsort(block_buffer, nblocks, sizeof(BlockNumber), compare_block_numbers);
+
+ /* Dump block references. */
+ while (i < nblocks)
+ {
+ /*
+ * Find the next range of blocks to print, but if --individual was
+ * specified, then consider each block a separate range.
+ */
+ startblock = endblock = block_buffer[i++];
+ if (!opt->individual)
+ {
+ while (i < nblocks && block_buffer[i] == endblock + 1)
+ {
+ endblock++;
+ i++;
+ }
+ }
+
+ /* Format this range of block numbers as a string. */
+ if (startblock == endblock)
+ printf("TS %u, DB %u, REL %u, FORK %s: block %u\n",
+ rlocator->spcOid, rlocator->dbOid, rlocator->relNumber,
+ forkNames[forknum], startblock);
+ else
+ printf("TS %u, DB %u, REL %u, FORK %s: blocks %u..%u\n",
+ rlocator->spcOid, rlocator->dbOid, rlocator->relNumber,
+ forkNames[forknum], startblock, endblock);
+ }
+}
+
+/*
+ * Quicksort comparator for block numbers.
+ */
+static int
+compare_block_numbers(const void *a, const void *b)
+{
+ BlockNumber aa = *(BlockNumber *) a;
+ BlockNumber bb = *(BlockNumber *) b;
+
+ if (aa > bb)
+ return 1;
+ else if (aa == bb)
+ return 0;
+ else
+ return -1;
+}
+
+/*
+ * Error callback.
+ */
+void
+walsummary_error_callback(void *callback_arg, char *fmt,...)
+{
+ va_list ap;
+
+ va_start(ap, fmt);
+ pg_log_generic_v(PG_LOG_ERROR, PG_LOG_PRIMARY, fmt, ap);
+ va_end(ap);
+
+ exit(1);
+}
+
+/*
+ * Read callback.
+ */
+int
+walsummary_read_callback(void *callback_arg, void *data, int length)
+{
+ ws_file_info *ws = callback_arg;
+ int rc;
+
+ if ((rc = read(ws->fd, data, length)) < 0)
+ pg_fatal("could not read file \"%s\": %m", ws->filename);
+
+ return rc;
+}
+
+/*
+ * help
+ *
+ * Prints help page for the program
+ *
+ * progname: the name of the executed program, such as "pg_walsummary"
+ */
+static void
+help(const char *progname)
+{
+ printf(_("%s prints the contents of a WAL summary file.\n\n"), progname);
+ printf(_("Usage:\n"));
+ printf(_(" %s [OPTION]... FILE...\n"), progname);
+ printf(_("\nOptions:\n"));
+ printf(_(" -i, --individual list block numbers individually, not as ranges\n"));
+ printf(_(" -q, --quiet don't print anything, just parse the files\n"));
+ printf(_(" -?, --help show this help, then exit\n"));
+
+ printf(_("\nReport bugs to <%s>.\n"), PACKAGE_BUGREPORT);
+ printf(_("%s home page: <%s>\n"), PACKAGE_NAME, PACKAGE_URL);
+}
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 1fa5f0ed26..0565e160f2 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -4026,3 +4026,5 @@ cb_tablespace_mapping
manifest_data
manifest_writer
rfile
+ws_options
+ws_file_info
--
2.37.1 (Apple Git-137.1)
v11-0007-Test-patch-Enable-summarize_wal-by-default.patchapplication/octet-stream; name=v11-0007-Test-patch-Enable-summarize_wal-by-default.patchDownload
From 1d175bbeef3ab83e5c992a648e22c7c1cd3c671f Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Tue, 14 Nov 2023 13:49:28 -0500
Subject: [PATCH v11 7/7] Test patch: Enable summarize_wal by default.
To avoid test failures, must remove the prohibition against running
summarize_wal=off with wal_level=minimal, because a bunch of tests
run with wal_level=minimal.
Not for commit.
---
src/backend/postmaster/postmaster.c | 3 ---
src/backend/postmaster/walsummarizer.c | 2 +-
src/backend/utils/misc/guc_tables.c | 2 +-
src/test/recovery/t/001_stream_rep.pl | 2 ++
src/test/recovery/t/019_replslot_limit.pl | 3 +++
src/test/recovery/t/020_archive_status.pl | 1 +
src/test/recovery/t/035_standby_logical_decoding.pl | 1 +
7 files changed, 9 insertions(+), 5 deletions(-)
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 7952fd5c4b..a804d07ce5 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -935,9 +935,6 @@ PostmasterMain(int argc, char *argv[])
if (max_wal_senders > 0 && wal_level == WAL_LEVEL_MINIMAL)
ereport(ERROR,
(errmsg("WAL streaming (max_wal_senders > 0) requires wal_level \"replica\" or \"logical\"")));
- if (summarize_wal && wal_level == WAL_LEVEL_MINIMAL)
- ereport(ERROR,
- (errmsg("WAL cannot be summarized when wal_level is \"minimal\"")));
/*
* Other one-time internal sanity checks can go here, if they are fast.
diff --git a/src/backend/postmaster/walsummarizer.c b/src/backend/postmaster/walsummarizer.c
index 7966755f22..74a0116a13 100644
--- a/src/backend/postmaster/walsummarizer.c
+++ b/src/backend/postmaster/walsummarizer.c
@@ -139,7 +139,7 @@ static XLogRecPtr redo_pointer_at_last_summary_removal = InvalidXLogRecPtr;
/*
* GUC parameters
*/
-bool summarize_wal = false;
+bool summarize_wal = true;
int wal_summary_keep_time = 10 * 24 * 60;
static XLogRecPtr GetLatestLSN(TimeLineID *tli);
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index a6de5aca0a..170f491d7a 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -1796,7 +1796,7 @@ struct config_bool ConfigureNamesBool[] =
NULL
},
&summarize_wal,
- false,
+ true,
NULL, NULL, NULL
},
diff --git a/src/test/recovery/t/001_stream_rep.pl b/src/test/recovery/t/001_stream_rep.pl
index 95f9b0d772..0d0e63b8dc 100644
--- a/src/test/recovery/t/001_stream_rep.pl
+++ b/src/test/recovery/t/001_stream_rep.pl
@@ -15,6 +15,8 @@ my $node_primary = PostgreSQL::Test::Cluster->new('primary');
$node_primary->init(
allows_streaming => 1,
auth_extra => [ '--create-role', 'repl_role' ]);
+# WAL summarization can postpone WAL recycling, leading to test failures
+$node_primary->append_conf('postgresql.conf', "summarize_wal = off");
$node_primary->start;
my $backup_name = 'my_backup';
diff --git a/src/test/recovery/t/019_replslot_limit.pl b/src/test/recovery/t/019_replslot_limit.pl
index 7d94f15778..a8b342bb98 100644
--- a/src/test/recovery/t/019_replslot_limit.pl
+++ b/src/test/recovery/t/019_replslot_limit.pl
@@ -22,6 +22,7 @@ $node_primary->append_conf(
min_wal_size = 2MB
max_wal_size = 4MB
log_checkpoints = yes
+summarize_wal = off
));
$node_primary->start;
$node_primary->safe_psql('postgres',
@@ -256,6 +257,7 @@ $node_primary2->append_conf(
min_wal_size = 32MB
max_wal_size = 32MB
log_checkpoints = yes
+summarize_wal = off
));
$node_primary2->start;
$node_primary2->safe_psql('postgres',
@@ -310,6 +312,7 @@ $node_primary3->append_conf(
max_wal_size = 2MB
log_checkpoints = yes
max_slot_wal_keep_size = 1MB
+ summarize_wal = off
));
$node_primary3->start;
$node_primary3->safe_psql('postgres',
diff --git a/src/test/recovery/t/020_archive_status.pl b/src/test/recovery/t/020_archive_status.pl
index fa24153d4b..d0d6221368 100644
--- a/src/test/recovery/t/020_archive_status.pl
+++ b/src/test/recovery/t/020_archive_status.pl
@@ -15,6 +15,7 @@ $primary->init(
has_archiving => 1,
allows_streaming => 1);
$primary->append_conf('postgresql.conf', 'autovacuum = off');
+$primary->append_conf('postgresql.conf', 'summarize_wal = off');
$primary->start;
my $primary_data = $primary->data_dir;
diff --git a/src/test/recovery/t/035_standby_logical_decoding.pl b/src/test/recovery/t/035_standby_logical_decoding.pl
index 9c34c0d36c..482edc57a8 100644
--- a/src/test/recovery/t/035_standby_logical_decoding.pl
+++ b/src/test/recovery/t/035_standby_logical_decoding.pl
@@ -250,6 +250,7 @@ $node_primary->append_conf(
wal_level = 'logical'
max_replication_slots = 4
max_wal_senders = 4
+summarize_wal = off
});
$node_primary->dump_info;
$node_primary->start;
--
2.37.1 (Apple Git-137.1)
On 2023-Nov-16, Robert Haas wrote:
On Thu, Nov 16, 2023 at 12:23 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
I don't understand this point. Currently, the protocol is that
UPLOAD_MANIFEST is used to send the manifest prior to requesting the
backup. You seem to be saying that you're thinking of removing support
for UPLOAD_MANIFEST and instead just give the LSN as an option to the
BASE_BACKUP command?I don't think I'd want to do exactly that, because then you could only
send one LSN, and I do think we want to send a set of LSN ranges with
the corresponding TLI for each. I was thinking about dumping
UPLOAD_MANIFEST and instead having a command like:INCREMENTAL_WAL_RANGE 1 2/462AC48 2/462C698
The client would execute this command one or more times before
starting an incremental backup.
That sounds good to me. Not having to parse the manifest server-side
sounds like a win, as does saving the transfer, for the cases where the
manifest is large.
Is this meant to support multiple timelines each with non-overlapping
adjacent ranges, rather than multiple non-adjacent ranges?
Do I have it right that you want to rewrite this bit before considering
this ready to commit?
--
Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/
"No nos atrevemos a muchas cosas porque son difíciles,
pero son difíciles porque no nos atrevemos a hacerlas" (Séneca)
On Mon, Nov 20, 2023 at 2:03 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
That sounds good to me. Not having to parse the manifest server-side
sounds like a win, as does saving the transfer, for the cases where the
manifest is large.
OK. I'll look into this next week, hopefully.
Is this meant to support multiple timelines each with non-overlapping
adjacent ranges, rather than multiple non-adjacent ranges?
Correct. I don't see how non-adjacent LSN ranges could ever be a
useful thing, but adjacent ranges on different timelines are useful.
Do I have it right that you want to rewrite this bit before considering
this ready to commit?
For sure. I don't think this is the only thing that needs to be
revised before commit, but it's definitely *a* thing that needs to be
revised before commit.
--
Robert Haas
EDB: http://www.enterprisedb.com
On Mon, Nov 20, 2023 at 2:10 PM Robert Haas <robertmhaas@gmail.com> wrote:
Is this meant to support multiple timelines each with non-overlapping
adjacent ranges, rather than multiple non-adjacent ranges?Correct. I don't see how non-adjacent LSN ranges could ever be a
useful thing, but adjacent ranges on different timelines are useful.
Thinking about this a bit more, there are a couple of things we could
do here in terms of syntax. Once idea is to give up on having a
separate MANIFEST-WAL-RANGE command for each range and instead just
cram everything into either a single command:
MANIFEST-WAL-RANGES {tli} {startlsn} {endlsn}...
Or even into a single option to the BASE_BACKUP command:
BASE_BACKUP yadda yadda INCREMENTAL 'tli@startlsn-endlsn,...'
Or, since we expect adjacent, non-overlapping ranges, you could even
arrange to elide the duplicated boundary LSNs, e.g.
MANIFEST_WAL-RANGES {{tli} {lsn}}... {final-lsn}
Or
BASE_BACKUP yadda yadda INCREMENTAL 'tli@lsn,...,final-lsn'
I'm not sure what's best here. Trying to trim out the duplicated
boundary LSNs feels a bit like rearrangement for the sake of
rearrangement, but maybe it isn't really.
--
Robert Haas
EDB: http://www.enterprisedb.com
On Mon, Nov 20, 2023 at 4:43 PM Robert Haas <robertmhaas@gmail.com> wrote:
On Fri, Nov 17, 2023 at 5:01 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
I made a pass over pg_combinebackup for NLS. I propose the attached
patch.This doesn't quite compile for me so I changed a few things and
incorporated it. Hopefully I didn't mess anything up.Here's v11.
[..]
I wish I had better ideas about how to thoroughly test this. [..]
Hopefully the below add some confidence, I've done some further
quick(?) checks today and results are good:
make check-world #GOOD
test_full_pri__incr_stby__restore_on_pri.sh #GOOD
test_full_pri__incr_stby__restore_on_stby.sh #GOOD*
test_full_stby__incr_stby__restore_on_pri.sh #GOOD
test_full_stby__incr_stby__restore_on_stby.sh #GOOD*
test_many_incrementals_dbcreate.sh #GOOD
test_many_incrementals.sh #GOOD
test_multixact.sh #GOOD
test_pending_2pc.sh #GOOD
test_reindex_and_vacuum_full.sh #GOOD
test_truncaterollback.sh #GOOD
test_unlogged_table.sh #GOOD
test_across_wallevelminimal.sh # GOOD(expected failure, that
walsummaries are off during walminimal and incremental cannot be
taken--> full needs to be taken after wal_level=minimal)
CFbot failed on two hosts this time, I haven't looked at the details
yet (https://cirrus-ci.com/task/6425149646307328 -> end of EOL? ->
LOG: WAL summarizer process (PID 71511) was terminated by signal 6:
Aborted?)
The remaining test idea is to have a longer running DB under stress
test in more real-world conditions and try to recover using chained
incremental backups (one such test was carried out on patchset v6 and
the result was good back then).
-J.
On Wed, Nov 22, 2023 at 3:14 AM Jakub Wartak
<jakub.wartak@enterprisedb.com> wrote:
CFbot failed on two hosts this time, I haven't looked at the details
yet (https://cirrus-ci.com/task/6425149646307328 -> end of EOL? ->
LOG: WAL summarizer process (PID 71511) was terminated by signal 6:
Aborted?)
Robert pinged me to see if I had any ideas.
The reason it fails on Windows is because there is a special code path
for WIN32 in the patch's src/bin/pg_combinebackup/copy_file.c, but it
is incomplete: it returns early without feeding the data into the
checksum, so all the checksums have the same initial and bogus value.
I commented that part out so it took the normal path like Unix, and it
passed.
The reason it fails on Linux 32 bit with -fsanitize is because this
has uncovered a bug in xlogreader.c, which overflows a 32 bit pointer
when doing a size test that could easily be changed to non-overflowing
formulation. AFAICS it is not a live bug because it comes to the
right conclusion without deferencing the pointer due to other checks,
but the sanitizer is not wrong to complain about it and I will post a
patch to fix that in a new thread. With the draft patch I am testing,
the sanitizer is happy and this passes too.
On Thu, Nov 23, 2023 at 11:18 PM Thomas Munro <thomas.munro@gmail.com> wrote:
Robert pinged me to see if I had any ideas.
Thanks, Thomas.
The reason it fails on Windows is because there is a special code path
for WIN32 in the patch's src/bin/pg_combinebackup/copy_file.c, but it
is incomplete: it returns early without feeding the data into the
checksum, so all the checksums have the same initial and bogus value.
I commented that part out so it took the normal path like Unix, and it
passed.
Yikes, that's embarrassing. Thanks for running it down. There is logic
in the caller to figure out whether we need to recompute the checksum
or can reuse one we already have, but copy_file() doesn't understand
that it should take the slow path if a new checksum computation is
required.
The reason it fails on Linux 32 bit with -fsanitize is because this
has uncovered a bug in xlogreader.c, which overflows a 32 bit pointer
when doing a size test that could easily be changed to non-overflowing
formulation. AFAICS it is not a live bug because it comes to the
right conclusion without deferencing the pointer due to other checks,
but the sanitizer is not wrong to complain about it and I will post a
patch to fix that in a new thread. With the draft patch I am testing,
the sanitizer is happy and this passes too.
Thanks so much.
--
Robert Haas
EDB: http://www.enterprisedb.com
On Wed, Nov 15, 2023 at 9:14 AM Jakub Wartak
<jakub.wartak@enterprisedb.com> wrote:
so I've spent some time playing still with patchset v8 (without the
6/6 testing patch related to wal_level=minimal), with the exception of
- patchset v9 - marked otherwise.
Thanks, as usual, for that.
2. Usability thing: I hit the timeout hard: "This backup requires WAL
to be summarized up to 0/90000D8, but summarizer has only reached
0/0." with summarize_wal=off (default) but apparently this in TODO.
Looks like an important usability thing.
All right. I'd sort of forgotten about the need to address that issue,
but apparently, I need to re-remember.
5. On v8 i've finally played a little bit with standby(s) and this
patchset with couple of basic scenarios while mixing source of the
backups:a. full on standby, incr1 on standby, full db restore (incl. incr1) on standby
# sometimes i'm getting spurious error like those when doing
incrementals on standby with -c fast :
2023-11-15 13:49:05.721 CET [10573] LOG: recovery restart point
at 0/A000028
2023-11-15 13:49:07.591 CET [10597] WARNING: aborting backup due
to backend exiting before pg_backup_stop was called
2023-11-15 13:49:07.591 CET [10597] ERROR: manifest requires WAL
from final timeline 1 ending at 0/A0000F8, but this backup starts at
0/A000028
2023-11-15 13:49:07.591 CET [10597] STATEMENT: BASE_BACKUP (
INCREMENTAL, LABEL 'pg_basebackup base backup', PROGRESS,
CHECKPOINT 'fast', WAIT 0, MANIFEST 'yes', TARGET 'client')
# when you retry the same pg_basebackup it goes fine (looks like
CHECKPOINT on standby/restartpoint <-> summarizer disconnect, I'll dig
deeper tomorrow. It seems that issuing "CHECKPOINT; pg_sleep(1);"
against primary just before pg_basebackup --incr on standby
workarounds it)b. full on primary, incr1 on standby, full db restore (incl. incr1) on
standby # WORKS
c. full on standby, incr1 on standby, full db restore (incl. incr1) on
primary # WORKS*
d. full on primary, incr1 on standby, full db restore (incl. incr1) on
primary # WORKS** - needs pg_promote() due to the controlfile having standby bit +
potential fiddling with postgresql.auto.conf as it is having
primary_connstring GUC.
Well, "manifest requires WAL from final timeline 1 ending at
0/A0000F8, but this backup starts at 0/A000028" is a valid complaint,
not a spurious error. It's essentially saying that WAL replay for this
incremental backup would have to begin at a location that is earlier
than where replay for the earlier backup would have to end while
recovering that backup. It's almost like you're trying to go backwards
in time, with the incremental happening before the full backup instead
of after it. I think the reason this is happening is that when you
take a backup, recovery has to start from the previous checkpoint. On
the primary, we perform a new checkpoint and plan to start recovery
from it. But on a standby, we can't perform a new checkpoint, since we
can't write WAL, so we arrange for recovery of the backup to begin
from the most recent checkpoint. And if you do two backups on the
standby in a row without much happening in the middle, then the most
recent checkpoint will be the same for both. And that I think is
what's resulting in this error, because the end of the backup follows
the start of the backup, so if two consecutive backups have the same
start, then the start of the second one will precede the end of the
first one.
One thing that's interesting to note here is that there is no point in
performing an incremental backup under these circumstances. You would
accrue no advantage over just letting replay continue further from the
full backup. The whole point of an incremental backup is that it lets
you "fast forward" your older backup -- you could have just replayed
all the WAL from the older backup until you got to the start LSN of
the newer backup, but reconstructing a backup that can start replay
from the newer LSN directly is, hopefully, quicker than replaying all
of that WAL. But in this scenario, you're starting from the same
checkpoint no matter what -- the amount of WAL replay required to
reach any given LSN will be unchanged. So storing an incremental
backup would be strictly a loss.
Another interesting point to consider is that you could also get this
complaint by doing something like take the full backup from the
primary, and then try to take an incremental backup from a standby,
maybe even a time-delayed standby that's far behind the primary. In
that case, you would really be trying to take an incremental backup
before you actually took the full backup, as far as LSN time goes.
I'm not quite sure what to do about any of this. I think the error is
correct and accurate, but understanding what it means and why it's
happening and what to do about it is probably going to be difficult
for people. Possibly we should have documentation that talks you
through all of this. Or possibly there are ways to elaborate on the
error message itself. But I'm a little skeptical about the latter
approach because it's all so complicated. I don't know that we can
summarize it in a sentence or two.
6. Sci-fi-mode-on: I was wondering about the dangers of e.g. having
more recent pg_basebackup (e.g. from pg18 one day) running against
pg17 in the scope of having this incremental backups possibility. Is
it going to be safe? (currently there seems to be no safeguards
against such use) or should those things (core, pg_basebackup) should
be running in version lock step?
I think it should be safe, actually. pg_basebackup has no reason to
care about WAL format changes across versions. It doesn't even care
about the format of the WAL summaries, which it never sees, but only
needs the server to have. If we change the format of the incremental
files that are included in the backup, then we will need
backward-compatibility code, or we can disallow cross-version
operations. I don't currently foresee a need to do that, but you never
know. It's manageable in any event.
But note that I also didn't (and can't, without a lot of ugliness)
make pg_combinebackup version-independent. So you could think of
taking incremental backups with a different version of pg_basebackup,
but if you want to restore you're going to need a matching version of
pg_combinebackup.
--
Robert Haas
EDB: http://www.enterprisedb.com
New patch set.
0001: Rename JsonManifestParseContext callbacks, per feedback from
Álvaro. Not logically related to the rest of this, except by code
proximity. Separately committable, if nobody objects.
0002: Rename record_manifest_details_for_{file,wal_range}, per
feedback from Álvaro that the names were too generic. Separately
committable, if nobody objects.
0003: Move parse_manifest.c to src/common. No significant changes
since the previous version.
0004: Add a new WAL summarizer process. No significant changes since
the previous version.
0005: Incremental backup itself. Changes:
- Remove UPLOAD_MANIFEST replication command and instead add
INCREMENTAL_WAL_RANGE replication command.
- In consequence, load_manifest.c which was included in the previous
patch sets now moves to src/fe_utils and has some adjustments.
- Actually document the new replication command which I overlooked previously.
- Error out promptly if an incremental backup is attended with
summarize_wal = off.
- Fix test in copy_file(). We should be willing to use the fast-path
if a new checksum is *not* required, but the sense of the test was
inverted in previous versions.
- Fix some handling of the missing-manifest case in pg_combinebackup.
- Fix terminology in a help message.
0006: Add pg_walsummary tool. No significant changes since the previous version.
0007: Test patch, not for commit.
As far as I know, the main commit-blockers here are (1) the timeout
when waiting for WAL summarization is still hard-coded to 60 seconds
and (2) the ubsan issue that Thomas hunted down, which would cause at
least the entire CF environment and maybe some portion of the BF to
turn red if this were committed. That issue is in xlogreader rather
than in this patch set, at least in part, but it still needs fixing
before this goes ahead. I also suspect that the slightly-more
significant refactoring in this version may turn up a few new bugs in
the CF environment. I think once that the aforementioned items are
sorted out, this could be committed through 0005, and 0001 and 0002
could be committed sooner. 0006 should have some tests written before
it gets committed, but it doesn't necessarily have to be committed at
the exact same moment as everything else, and writing tests isn't that
hard, either.
Other loose ends that would be nice to tidy up at some point:
- Incremental JSON parsing so we can handle huge manifests.
- More documentation as proposed by Álvaro but I'm failing to find the
details of his proposal right now.
- More introspection facilities, maybe, or possibly rip some some
stuff out of WalSummarizerCtl if we don't want it. This one might be a
higher priority to address before initial commit, but it's probably
not absolutely critical, either.
I'm not quite sure how aggressively to press forward with getting
stuff committed. I'd certainly rather debug as much as I can locally
and via cfbot before turning the buildfarm pretty colors, but I think
it generally works out better when larger features get pushed earlier
in the cycle rather than in the mad frenzy right before feature
freeze, so I'm not inclined to be too patient, either.
...Robert
Attachments:
v12-0002-Rename-pg_verifybackup-s-JsonManifestParseContex.patchapplication/octet-stream; name=v12-0002-Rename-pg_verifybackup-s-JsonManifestParseContex.patchDownload
From f7c178b722381aab138ae422341897c6de8ef522 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Thu, 16 Nov 2023 13:15:14 -0500
Subject: [PATCH v12 2/7] Rename pg_verifybackup's JsonManifestParseContext
callback functions.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
The old names were too generic, and would have applied to any binary
that made use of JsonManifestParseContext. Rename to make the names
specific to pg_verifybackup, since there are plans afoot to reuse
this infrastructure.
Per suggestion from Álvaro Herrra.
---
src/bin/pg_verifybackup/pg_verifybackup.c | 36 +++++++++++------------
1 file changed, 18 insertions(+), 18 deletions(-)
diff --git a/src/bin/pg_verifybackup/pg_verifybackup.c b/src/bin/pg_verifybackup/pg_verifybackup.c
index 8526eb9bbf..d921d0f003 100644
--- a/src/bin/pg_verifybackup/pg_verifybackup.c
+++ b/src/bin/pg_verifybackup/pg_verifybackup.c
@@ -119,15 +119,15 @@ static void parse_manifest_file(char *manifest_path,
manifest_files_hash **ht_p,
manifest_wal_range **first_wal_range_p);
-static void record_manifest_details_for_file(JsonManifestParseContext *context,
- char *pathname, size_t size,
- pg_checksum_type checksum_type,
- int checksum_length,
- uint8 *checksum_payload);
-static void record_manifest_details_for_wal_range(JsonManifestParseContext *context,
- TimeLineID tli,
- XLogRecPtr start_lsn,
- XLogRecPtr end_lsn);
+static void verifybackup_per_file_cb(JsonManifestParseContext *context,
+ char *pathname, size_t size,
+ pg_checksum_type checksum_type,
+ int checksum_length,
+ uint8 *checksum_payload);
+static void verifybackup_per_wal_range_cb(JsonManifestParseContext *context,
+ TimeLineID tli,
+ XLogRecPtr start_lsn,
+ XLogRecPtr end_lsn);
static void report_manifest_error(JsonManifestParseContext *context,
const char *fmt,...)
pg_attribute_printf(2, 3) pg_attribute_noreturn();
@@ -440,8 +440,8 @@ parse_manifest_file(char *manifest_path, manifest_files_hash **ht_p,
private_context.first_wal_range = NULL;
private_context.last_wal_range = NULL;
context.private_data = &private_context;
- context.per_file_cb = record_manifest_details_for_file;
- context.per_wal_range_cb = record_manifest_details_for_wal_range;
+ context.per_file_cb = verifybackup_per_file_cb;
+ context.per_wal_range_cb = verifybackup_per_wal_range_cb;
context.error_cb = report_manifest_error;
json_parse_manifest(&context, buffer, statbuf.st_size);
@@ -475,10 +475,10 @@ report_manifest_error(JsonManifestParseContext *context, const char *fmt,...)
* Record details extracted from the backup manifest for one file.
*/
static void
-record_manifest_details_for_file(JsonManifestParseContext *context,
- char *pathname, size_t size,
- pg_checksum_type checksum_type,
- int checksum_length, uint8 *checksum_payload)
+verifybackup_per_file_cb(JsonManifestParseContext *context,
+ char *pathname, size_t size,
+ pg_checksum_type checksum_type,
+ int checksum_length, uint8 *checksum_payload)
{
parser_context *pcxt = context->private_data;
manifest_files_hash *ht = pcxt->ht;
@@ -504,9 +504,9 @@ record_manifest_details_for_file(JsonManifestParseContext *context,
* Record details extracted from the backup manifest for one WAL range.
*/
static void
-record_manifest_details_for_wal_range(JsonManifestParseContext *context,
- TimeLineID tli,
- XLogRecPtr start_lsn, XLogRecPtr end_lsn)
+verifybackup_per_wal_range_cb(JsonManifestParseContext *context,
+ TimeLineID tli,
+ XLogRecPtr start_lsn, XLogRecPtr end_lsn)
{
parser_context *pcxt = context->private_data;
manifest_wal_range *range;
--
2.39.3 (Apple Git-145)
v12-0001-Rename-JsonManifestParseContext-callbacks.patchapplication/octet-stream; name=v12-0001-Rename-JsonManifestParseContext-callbacks.patchDownload
From f059dd592e239f6e4d26c7d0e06bab04f5522977 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Thu, 16 Nov 2023 13:10:01 -0500
Subject: [PATCH v12 1/7] Rename JsonManifestParseContext callbacks.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
There is currently a worldwide oversupply of underscores, so use
some of them here as word separators. In the event of a later
underscore shortage, these can be removed again, and another of
PostgreSQL's innumerable methods of marking word bounadries can
be substituted.
Per suggestion from Álvaro Herrera.
---
src/bin/pg_verifybackup/parse_manifest.c | 8 ++++----
src/bin/pg_verifybackup/parse_manifest.h | 18 +++++++++---------
src/bin/pg_verifybackup/pg_verifybackup.c | 4 ++--
src/tools/pgindent/typedefs.list | 4 ++--
4 files changed, 17 insertions(+), 17 deletions(-)
diff --git a/src/bin/pg_verifybackup/parse_manifest.c b/src/bin/pg_verifybackup/parse_manifest.c
index bf0227c668..850adf90a8 100644
--- a/src/bin/pg_verifybackup/parse_manifest.c
+++ b/src/bin/pg_verifybackup/parse_manifest.c
@@ -112,7 +112,7 @@ static bool parse_xlogrecptr(XLogRecPtr *result, char *input);
*
* Caller should set up the parsing context and then invoke this function.
* For each file whose information is extracted from the manifest,
- * context->perfile_cb is invoked. In case of trouble, context->error_cb is
+ * context->per_file_cb is invoked. In case of trouble, context->error_cb is
* invoked and is expected not to return.
*/
void
@@ -545,8 +545,8 @@ json_manifest_finalize_file(JsonManifestParseState *parse)
}
/* Invoke the callback with the details we've gathered. */
- context->perfile_cb(context, parse->pathname, size,
- checksum_type, checksum_length, checksum_payload);
+ context->per_file_cb(context, parse->pathname, size,
+ checksum_type, checksum_length, checksum_payload);
/* Free memory we no longer need. */
if (parse->size != NULL)
@@ -602,7 +602,7 @@ json_manifest_finalize_wal_range(JsonManifestParseState *parse)
"could not parse end LSN");
/* Invoke the callback with the details we've gathered. */
- context->perwalrange_cb(context, tli, start_lsn, end_lsn);
+ context->per_wal_range_cb(context, tli, start_lsn, end_lsn);
/* Free memory we no longer need. */
if (parse->timeline != NULL)
diff --git a/src/bin/pg_verifybackup/parse_manifest.h b/src/bin/pg_verifybackup/parse_manifest.h
index 7387a917a2..001b9a6a11 100644
--- a/src/bin/pg_verifybackup/parse_manifest.h
+++ b/src/bin/pg_verifybackup/parse_manifest.h
@@ -21,13 +21,13 @@
struct JsonManifestParseContext;
typedef struct JsonManifestParseContext JsonManifestParseContext;
-typedef void (*json_manifest_perfile_callback) (JsonManifestParseContext *,
- char *pathname,
- size_t size, pg_checksum_type checksum_type,
- int checksum_length, uint8 *checksum_payload);
-typedef void (*json_manifest_perwalrange_callback) (JsonManifestParseContext *,
- TimeLineID tli,
- XLogRecPtr start_lsn, XLogRecPtr end_lsn);
+typedef void (*json_manifest_per_file_callback) (JsonManifestParseContext *,
+ char *pathname,
+ size_t size, pg_checksum_type checksum_type,
+ int checksum_length, uint8 *checksum_payload);
+typedef void (*json_manifest_per_wal_range_callback) (JsonManifestParseContext *,
+ TimeLineID tli,
+ XLogRecPtr start_lsn, XLogRecPtr end_lsn);
typedef void (*json_manifest_error_callback) (JsonManifestParseContext *,
const char *fmt,...) pg_attribute_printf(2, 3)
pg_attribute_noreturn();
@@ -35,8 +35,8 @@ typedef void (*json_manifest_error_callback) (JsonManifestParseContext *,
struct JsonManifestParseContext
{
void *private_data;
- json_manifest_perfile_callback perfile_cb;
- json_manifest_perwalrange_callback perwalrange_cb;
+ json_manifest_per_file_callback per_file_cb;
+ json_manifest_per_wal_range_callback per_wal_range_cb;
json_manifest_error_callback error_cb;
};
diff --git a/src/bin/pg_verifybackup/pg_verifybackup.c b/src/bin/pg_verifybackup/pg_verifybackup.c
index 059836f0e6..8526eb9bbf 100644
--- a/src/bin/pg_verifybackup/pg_verifybackup.c
+++ b/src/bin/pg_verifybackup/pg_verifybackup.c
@@ -440,8 +440,8 @@ parse_manifest_file(char *manifest_path, manifest_files_hash **ht_p,
private_context.first_wal_range = NULL;
private_context.last_wal_range = NULL;
context.private_data = &private_context;
- context.perfile_cb = record_manifest_details_for_file;
- context.perwalrange_cb = record_manifest_details_for_wal_range;
+ context.per_file_cb = record_manifest_details_for_file;
+ context.per_wal_range_cb = record_manifest_details_for_wal_range;
context.error_cb = report_manifest_error;
json_parse_manifest(&context, buffer, statbuf.st_size);
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index d659adbfd6..38a86575e1 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3445,8 +3445,8 @@ jmp_buf
join_search_hook_type
json_aelem_action
json_manifest_error_callback
-json_manifest_perfile_callback
-json_manifest_perwalrange_callback
+json_manifest_per_file_callback
+json_manifest_per_wal_range_callback
json_ofield_action
json_scalar_action
json_struct_action
--
2.39.3 (Apple Git-145)
v12-0003-Move-src-bin-pg_verifybackup-parse_manifest.c-in.patchapplication/octet-stream; name=v12-0003-Move-src-bin-pg_verifybackup-parse_manifest.c-in.patchDownload
From 9c31fb418c22bb02be5bc0e3b76fb6dca58cf252 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Tue, 3 Oct 2023 13:32:45 -0400
Subject: [PATCH v12 3/7] Move src/bin/pg_verifybackup/parse_manifest.c into
src/common.
This makes it possible for the code to be easily reused by other
client-side tools, and/or by the server.
---
src/bin/pg_verifybackup/Makefile | 1 -
src/bin/pg_verifybackup/meson.build | 1 -
src/bin/pg_verifybackup/pg_verifybackup.c | 2 +-
src/common/Makefile | 1 +
src/common/meson.build | 1 +
src/{bin/pg_verifybackup => common}/parse_manifest.c | 4 ++--
src/{bin/pg_verifybackup => include/common}/parse_manifest.h | 2 +-
7 files changed, 6 insertions(+), 6 deletions(-)
rename src/{bin/pg_verifybackup => common}/parse_manifest.c (99%)
rename src/{bin/pg_verifybackup => include/common}/parse_manifest.h (97%)
diff --git a/src/bin/pg_verifybackup/Makefile b/src/bin/pg_verifybackup/Makefile
index c96323faa9..7c045f142e 100644
--- a/src/bin/pg_verifybackup/Makefile
+++ b/src/bin/pg_verifybackup/Makefile
@@ -21,7 +21,6 @@ LDFLAGS_INTERNAL += -L$(top_builddir)/src/fe_utils -lpgfeutils $(libpq_pgport)
OBJS = \
$(WIN32RES) \
- parse_manifest.o \
pg_verifybackup.o
all: pg_verifybackup
diff --git a/src/bin/pg_verifybackup/meson.build b/src/bin/pg_verifybackup/meson.build
index 9369da1bc6..58f780d1a6 100644
--- a/src/bin/pg_verifybackup/meson.build
+++ b/src/bin/pg_verifybackup/meson.build
@@ -1,7 +1,6 @@
# Copyright (c) 2022-2023, PostgreSQL Global Development Group
pg_verifybackup_sources = files(
- 'parse_manifest.c',
'pg_verifybackup.c'
)
diff --git a/src/bin/pg_verifybackup/pg_verifybackup.c b/src/bin/pg_verifybackup/pg_verifybackup.c
index d921d0f003..88081f66f7 100644
--- a/src/bin/pg_verifybackup/pg_verifybackup.c
+++ b/src/bin/pg_verifybackup/pg_verifybackup.c
@@ -20,9 +20,9 @@
#include "common/hashfn.h"
#include "common/logging.h"
+#include "common/parse_manifest.h"
#include "fe_utils/simple_list.h"
#include "getopt_long.h"
-#include "parse_manifest.h"
#include "pgtime.h"
/*
diff --git a/src/common/Makefile b/src/common/Makefile
index ce4535d7fe..1092dc63df 100644
--- a/src/common/Makefile
+++ b/src/common/Makefile
@@ -66,6 +66,7 @@ OBJS_COMMON = \
kwlookup.o \
link-canary.o \
md5_common.o \
+ parse_manifest.o \
percentrepl.o \
pg_get_line.o \
pg_lzcompress.o \
diff --git a/src/common/meson.build b/src/common/meson.build
index 8be145c0fb..d52dd12bc9 100644
--- a/src/common/meson.build
+++ b/src/common/meson.build
@@ -18,6 +18,7 @@ common_sources = files(
'kwlookup.c',
'link-canary.c',
'md5_common.c',
+ 'parse_manifest.c',
'percentrepl.c',
'pg_get_line.c',
'pg_lzcompress.c',
diff --git a/src/bin/pg_verifybackup/parse_manifest.c b/src/common/parse_manifest.c
similarity index 99%
rename from src/bin/pg_verifybackup/parse_manifest.c
rename to src/common/parse_manifest.c
index 850adf90a8..9f52bfa83b 100644
--- a/src/bin/pg_verifybackup/parse_manifest.c
+++ b/src/common/parse_manifest.c
@@ -6,15 +6,15 @@
* Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
* Portions Copyright (c) 1994, Regents of the University of California
*
- * src/bin/pg_verifybackup/parse_manifest.c
+ * src/common/parse_manifest.c
*
*-------------------------------------------------------------------------
*/
#include "postgres_fe.h"
-#include "parse_manifest.h"
#include "common/jsonapi.h"
+#include "common/parse_manifest.h"
/*
* Semantic states for JSON manifest parsing.
diff --git a/src/bin/pg_verifybackup/parse_manifest.h b/src/include/common/parse_manifest.h
similarity index 97%
rename from src/bin/pg_verifybackup/parse_manifest.h
rename to src/include/common/parse_manifest.h
index 001b9a6a11..811c9149f4 100644
--- a/src/bin/pg_verifybackup/parse_manifest.h
+++ b/src/include/common/parse_manifest.h
@@ -6,7 +6,7 @@
* Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
* Portions Copyright (c) 1994, Regents of the University of California
*
- * src/bin/pg_verifybackup/parse_manifest.h
+ * src/include/common/parse_manifest.h
*
*-------------------------------------------------------------------------
*/
--
2.39.3 (Apple Git-145)
v12-0004-Add-a-new-WAL-summarizer-process.patchapplication/octet-stream; name=v12-0004-Add-a-new-WAL-summarizer-process.patchDownload
From ab3002d1908634f24775af799c0e6c822d598441 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Wed, 25 Oct 2023 12:57:22 -0400
Subject: [PATCH v12 4/7] Add a new WAL summarizer process.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
When active, this process writes WAL summary files to
$PGDATA/pg_wal/summaries. Each summary file contains information for a
certain range of LSNs on a certain TLI. For each relation, it stores a
"limit block" which is 0 if a relation is created or destroyed within
a certain range of WAL records, or otherwise the shortest length to
which the relation was truncated during that range of WAL records, or
otherwise InvalidBlockNumber. In addition, it stores a list of blocks
which have been modified during that range of WAL records, but
excluding blocks which were removed by truncation after they were
modified and never subsequently modified again. In other words, it
tells us which blocks need to copied in case of an incremental backup
covering that range of WAL records.
A new parameter summarize_wal enables or disables this new background
process. The background process also automatically deletes summary
files that are older than wal_summarize_keep_time, if that parameter
has a non-zero value and the summarizer is configured to run.
Patch by me, with some design help from Dilip Kumar. Reviewed by
Matthias van de Meent, Dilip Kumar, Jakub Wartak, Peter Eisentraut,
and Álvaro Herrera.
---
doc/src/sgml/config.sgml | 61 +
src/backend/access/transam/xlog.c | 101 +-
src/backend/backup/Makefile | 4 +-
src/backend/backup/meson.build | 2 +
src/backend/backup/walsummary.c | 356 +++++
src/backend/backup/walsummaryfuncs.c | 169 ++
src/backend/postmaster/Makefile | 1 +
src/backend/postmaster/auxprocess.c | 8 +
src/backend/postmaster/meson.build | 1 +
src/backend/postmaster/postmaster.c | 56 +
src/backend/postmaster/walsummarizer.c | 1383 +++++++++++++++++
src/backend/storage/lmgr/lwlocknames.txt | 1 +
src/backend/utils/activity/pgstat_io.c | 4 +-
.../utils/activity/wait_event_names.txt | 5 +
src/backend/utils/init/miscinit.c | 3 +
src/backend/utils/misc/guc_tables.c | 26 +
src/backend/utils/misc/postgresql.conf.sample | 5 +
src/bin/initdb/initdb.c | 1 +
src/common/Makefile | 1 +
src/common/blkreftable.c | 1308 ++++++++++++++++
src/common/meson.build | 1 +
src/include/access/xlog.h | 1 +
src/include/backup/walsummary.h | 49 +
src/include/catalog/pg_proc.dat | 19 +
src/include/common/blkreftable.h | 116 ++
src/include/miscadmin.h | 3 +
src/include/postmaster/walsummarizer.h | 31 +
src/include/storage/proc.h | 9 +-
src/include/utils/guc_tables.h | 1 +
src/tools/pgindent/typedefs.list | 11 +
30 files changed, 3726 insertions(+), 11 deletions(-)
create mode 100644 src/backend/backup/walsummary.c
create mode 100644 src/backend/backup/walsummaryfuncs.c
create mode 100644 src/backend/postmaster/walsummarizer.c
create mode 100644 src/common/blkreftable.c
create mode 100644 src/include/backup/walsummary.h
create mode 100644 src/include/common/blkreftable.h
create mode 100644 src/include/postmaster/walsummarizer.h
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 94d1eb2b81..4fc5c64150 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4150,6 +4150,67 @@ restore_command = 'copy "C:\\server\\archivedir\\%f" "%p"' # Windows
</variablelist>
</sect2>
+ <sect2 id="runtime-config-wal-summarization">
+ <title>WAL Summarization</title>
+
+ <!--
+ <para>
+ These settings control WAL summarization, a feature which must be
+ enabled in order to perform an
+ <link linkend="backup-incremental-backup">incremental backup</link>.
+ </para>
+ -->
+
+ <variablelist>
+ <varlistentry id="guc-summarize-wal" xreflabel="summarize_wal">
+ <term><varname>summarize_wal</varname> (<type>boolean</type>)
+ <indexterm>
+ <primary><varname>summarize_wal</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Enables the WAL summarizer process. Note that WAL summarization can
+ be enabled either on a primary or on a standby. WAL summarization
+ cannot be enabled when <varname>wal_level</varname> is set to
+ <literal>minimal</literal>. This parameter can only be set in the
+ <filename>postgresql.conf</filename> file or on the server command line.
+ The default is <literal>off</literal>.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry id="guc-wal-summary-keep-time" xreflabel="wal_summary_keep_time">
+ <term><varname>wal_summary_keep_time</varname> (<type>boolean</type>)
+ <indexterm>
+ <primary><varname>wal_summary_keep_time</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Configures the amount of time after which the WAL summarizer
+ automatically removes old WAL summaries. The file timestamp is used to
+ determine which files are old enough to remove. Typically, you should set
+ this comfortably higher than the time that could pass between a backup
+ and a later incremental backup that depends on it. WAL summaries must
+ be available for the entire range of WAL records between the preceding
+ backup and the new one being taken; if not, the incremental backup will
+ fail. If this parameter is set to zero, WAL summaries will not be
+ automatically deleted, but it is safe to manually remove files that you
+ know will not be required for future incremental backups.
+ This parameter can only be set in the
+ <filename>postgresql.conf</filename> file or on the server command line.
+ The default is 10 days. If <literal>summarize_wal = off</literal>,
+ existing WAL summaries will not be removed regardless of the value of
+ this parameter, because the WAL summarizer will not run.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ </variablelist>
+
+ </sect2>
+
</sect1>
<sect1 id="runtime-config-replication">
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 6526bd4f43..72c7c86707 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -77,6 +77,7 @@
#include "port/pg_iovec.h"
#include "postmaster/bgwriter.h"
#include "postmaster/startup.h"
+#include "postmaster/walsummarizer.h"
#include "postmaster/walwriter.h"
#include "replication/logical.h"
#include "replication/origin.h"
@@ -3587,6 +3588,43 @@ XLogGetLastRemovedSegno(void)
return lastRemovedSegNo;
}
+/*
+ * Return the oldest WAL segment on the given TLI that still exists in
+ * XLOGDIR, or 0 if none.
+ */
+XLogSegNo
+XLogGetOldestSegno(TimeLineID tli)
+{
+ DIR *xldir;
+ struct dirent *xlde;
+ XLogSegNo oldest_segno = 0;
+
+ xldir = AllocateDir(XLOGDIR);
+ while ((xlde = ReadDir(xldir, XLOGDIR)) != NULL)
+ {
+ TimeLineID file_tli;
+ XLogSegNo file_segno;
+
+ /* Ignore files that are not XLOG segments. */
+ if (!IsXLogFileName(xlde->d_name))
+ continue;
+
+ /* Parse filename to get TLI and segno. */
+ XLogFromFileName(xlde->d_name, &file_tli, &file_segno,
+ wal_segment_size);
+
+ /* Ignore anything that's not from the TLI of interest. */
+ if (tli != file_tli)
+ continue;
+
+ /* If it's the oldest so far, update oldest_segno. */
+ if (oldest_segno == 0 || file_segno < oldest_segno)
+ oldest_segno = file_segno;
+ }
+
+ FreeDir(xldir);
+ return oldest_segno;
+}
/*
* Update the last removed segno pointer in shared memory, to reflect that the
@@ -3867,8 +3905,8 @@ RemoveXlogFile(const struct dirent *segment_de,
}
/*
- * Verify whether pg_wal and pg_wal/archive_status exist.
- * If the latter does not exist, recreate it.
+ * Verify whether pg_wal, pg_wal/archive_status, and pg_wal/summaries exist.
+ * If the latter do not exist, recreate them.
*
* It is not the goal of this function to verify the contents of these
* directories, but to help in cases where someone has performed a cluster
@@ -3911,6 +3949,26 @@ ValidateXLOGDirectoryStructure(void)
(errmsg("could not create missing directory \"%s\": %m",
path)));
}
+
+ /* Check for summaries */
+ snprintf(path, MAXPGPATH, XLOGDIR "/summaries");
+ if (stat(path, &stat_buf) == 0)
+ {
+ /* Check for weird cases where it exists but isn't a directory */
+ if (!S_ISDIR(stat_buf.st_mode))
+ ereport(FATAL,
+ (errmsg("required WAL directory \"%s\" does not exist",
+ path)));
+ }
+ else
+ {
+ ereport(LOG,
+ (errmsg("creating missing WAL directory \"%s\"", path)));
+ if (MakePGDirectory(path) < 0)
+ ereport(FATAL,
+ (errmsg("could not create missing directory \"%s\": %m",
+ path)));
+ }
}
/*
@@ -5235,9 +5293,9 @@ StartupXLOG(void)
#endif
/*
- * Verify that pg_wal and pg_wal/archive_status exist. In cases where
- * someone has performed a copy for PITR, these directories may have been
- * excluded and need to be re-created.
+ * Verify that pg_wal, pg_wal/archive_status, and pg_wal/summaries exist.
+ * In cases where someone has performed a copy for PITR, these directories
+ * may have been excluded and need to be re-created.
*/
ValidateXLOGDirectoryStructure();
@@ -6954,6 +7012,25 @@ CreateCheckPoint(int flags)
*/
END_CRIT_SECTION();
+ /*
+ * WAL summaries end when the next XLOG_CHECKPOINT_REDO or
+ * XLOG_CHECKPOINT_SHUTDOWN record is reached. This is the first point
+ * where (a) we're not inside of a critical section and (b) we can be
+ * certain that the relevant record has been flushed to disk, which must
+ * happen before it can be summarized.
+ *
+ * If this is a shutdown checkpoint, then this happens reasonably
+ * promptly: we've only just inserted and flushed the
+ * XLOG_CHECKPOINT_SHUTDOWN record. If this is not a shutdown checkpoint,
+ * then this might not be very prompt at all: the XLOG_CHECKPOINT_REDO
+ * record was written before we began flushing data to disk, and that
+ * could be many minutes ago at this point. However, we don't XLogFlush()
+ * after inserting that record, so we're not guaranteed that it's on disk
+ * until after the above call that flushes the XLOG_CHECKPOINT_ONLINE
+ * record.
+ */
+ SetWalSummarizerLatch();
+
/*
* Let smgr do post-checkpoint cleanup (eg, deleting old files).
*/
@@ -7628,6 +7705,20 @@ KeepLogSeg(XLogRecPtr recptr, XLogSegNo *logSegNo)
}
}
+ /*
+ * If WAL summarization is in use, don't remove WAL that has yet to be
+ * summarized.
+ */
+ keep = GetOldestUnsummarizedLSN(NULL, NULL);
+ if (keep != InvalidXLogRecPtr)
+ {
+ XLogSegNo unsummarized_segno;
+
+ XLByteToSeg(keep, unsummarized_segno, wal_segment_size);
+ if (unsummarized_segno < segno)
+ segno = unsummarized_segno;
+ }
+
/* but, keep at least wal_keep_size if that's set */
if (wal_keep_size_mb > 0)
{
diff --git a/src/backend/backup/Makefile b/src/backend/backup/Makefile
index b21bd8ff43..a67b3c58d4 100644
--- a/src/backend/backup/Makefile
+++ b/src/backend/backup/Makefile
@@ -25,6 +25,8 @@ OBJS = \
basebackup_server.o \
basebackup_sink.o \
basebackup_target.o \
- basebackup_throttle.o
+ basebackup_throttle.o \
+ walsummary.o \
+ walsummaryfuncs.o
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/backup/meson.build b/src/backend/backup/meson.build
index 11a79bbf80..0e2de91e9f 100644
--- a/src/backend/backup/meson.build
+++ b/src/backend/backup/meson.build
@@ -12,4 +12,6 @@ backend_sources += files(
'basebackup_target.c',
'basebackup_throttle.c',
'basebackup_zstd.c',
+ 'walsummary.c',
+ 'walsummaryfuncs.c'
)
diff --git a/src/backend/backup/walsummary.c b/src/backend/backup/walsummary.c
new file mode 100644
index 0000000000..271d199874
--- /dev/null
+++ b/src/backend/backup/walsummary.c
@@ -0,0 +1,356 @@
+/*-------------------------------------------------------------------------
+ *
+ * walsummary.c
+ * Functions for accessing and managing WAL summary data.
+ *
+ * Portions Copyright (c) 2010-2022, PostgreSQL Global Development Group
+ *
+ * src/backend/backup/walsummary.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include <sys/stat.h>
+#include <unistd.h>
+
+#include "access/xlog_internal.h"
+#include "backup/walsummary.h"
+#include "utils/wait_event.h"
+
+static bool IsWalSummaryFilename(char *filename);
+static int ListComparatorForWalSummaryFiles(const ListCell *a,
+ const ListCell *b);
+
+/*
+ * Get a list of WAL summaries.
+ *
+ * If tli != 0, only WAL summaries with the indicated TLI will be included.
+ *
+ * If start_lsn != InvalidXLogRecPtr, only summaries that end after the
+ * indicated LSN will be included.
+ *
+ * If end_lsn != InvalidXLogRecPtr, only summaries that start before the
+ * indicated LSN will be included.
+ *
+ * The intent is that you can call GetWalSummaries(tli, start_lsn, end_lsn)
+ * to get all WAL summaries on the indicated timeline that overlap the
+ * specified LSN range.
+ */
+List *
+GetWalSummaries(TimeLineID tli, XLogRecPtr start_lsn, XLogRecPtr end_lsn)
+{
+ DIR *sdir;
+ struct dirent *dent;
+ List *result = NIL;
+
+ sdir = AllocateDir(XLOGDIR "/summaries");
+ while ((dent = ReadDir(sdir, XLOGDIR "/summaries")) != NULL)
+ {
+ WalSummaryFile *ws;
+ uint32 tmp[5];
+ TimeLineID file_tli;
+ XLogRecPtr file_start_lsn;
+ XLogRecPtr file_end_lsn;
+
+ /* Decode filename, or skip if it's not in the expected format. */
+ if (!IsWalSummaryFilename(dent->d_name))
+ continue;
+ sscanf(dent->d_name, "%08X%08X%08X%08X%08X",
+ &tmp[0], &tmp[1], &tmp[2], &tmp[3], &tmp[4]);
+ file_tli = tmp[0];
+ file_start_lsn = ((uint64) tmp[1]) << 32 | tmp[2];
+ file_end_lsn = ((uint64) tmp[3]) << 32 | tmp[4];
+
+ /* Skip if it doesn't match the filter criteria. */
+ if (tli != 0 && tli != file_tli)
+ continue;
+ if (!XLogRecPtrIsInvalid(start_lsn) && start_lsn >= file_end_lsn)
+ continue;
+ if (!XLogRecPtrIsInvalid(end_lsn) && end_lsn <= file_start_lsn)
+ continue;
+
+ /* Add it to the list. */
+ ws = palloc(sizeof(WalSummaryFile));
+ ws->tli = file_tli;
+ ws->start_lsn = file_start_lsn;
+ ws->end_lsn = file_end_lsn;
+ result = lappend(result, ws);
+ }
+ FreeDir(sdir);
+
+ return result;
+}
+
+/*
+ * Build a new list of WAL summaries based on an existing list, but filtering
+ * out summaries that don't match the search parameters.
+ *
+ * If tli != 0, only WAL summaries with the indicated TLI will be included.
+ *
+ * If start_lsn != InvalidXLogRecPtr, only summaries that end after the
+ * indicated LSN will be included.
+ *
+ * If end_lsn != InvalidXLogRecPtr, only summaries that start before the
+ * indicated LSN will be included.
+ */
+List *
+FilterWalSummaries(List *wslist, TimeLineID tli,
+ XLogRecPtr start_lsn, XLogRecPtr end_lsn)
+{
+ List *result = NIL;
+ ListCell *lc;
+
+ /* Loop over input. */
+ foreach(lc, wslist)
+ {
+ WalSummaryFile *ws = lfirst(lc);
+
+ /* Skip if it doesn't match the filter criteria. */
+ if (tli != 0 && tli != ws->tli)
+ continue;
+ if (!XLogRecPtrIsInvalid(start_lsn) && start_lsn > ws->end_lsn)
+ continue;
+ if (!XLogRecPtrIsInvalid(end_lsn) && end_lsn < ws->start_lsn)
+ continue;
+
+ /* Add it to the result list. */
+ result = lappend(result, ws);
+ }
+
+ return result;
+}
+
+/*
+ * Check whether the supplied list of WalSummaryFile objects covers the
+ * whole range of LSNs from start_lsn to end_lsn. This function ignores
+ * timelines, so the caller should probably filter using the appropriate
+ * timeline before calling this.
+ *
+ * If the whole range of LSNs is covered, returns true, otherwise false.
+ * If false is returned, *missing_lsn is set either to InvalidXLogRecPtr
+ * if there are no WAL summary files in the input list, or to the first LSN
+ * in the range that is not covered by a WAL summary file in the input list.
+ */
+bool
+WalSummariesAreComplete(List *wslist, XLogRecPtr start_lsn,
+ XLogRecPtr end_lsn, XLogRecPtr *missing_lsn)
+{
+ XLogRecPtr current_lsn = start_lsn;
+ ListCell *lc;
+
+ /* Special case for empty list. */
+ if (wslist == NIL)
+ {
+ *missing_lsn = InvalidXLogRecPtr;
+ return false;
+ }
+
+ /* Make a private copy of the list and sort it by start LSN. */
+ wslist = list_copy(wslist);
+ list_sort(wslist, ListComparatorForWalSummaryFiles);
+
+ /*
+ * Consider summary files in order of increasing start_lsn, advancing the
+ * known-summarized range from start_lsn toward end_lsn.
+ *
+ * Normally, the summary files should cover non-overlapping WAL ranges,
+ * but this algorithm is intended to be correct even in case of overlap.
+ */
+ foreach(lc, wslist)
+ {
+ WalSummaryFile *ws = lfirst(lc);
+
+ if (ws->start_lsn > current_lsn)
+ {
+ /* We found a gap. */
+ break;
+ }
+ if (ws->end_lsn > current_lsn)
+ {
+ /*
+ * Next summary extends beyond end of previous summary, so extend
+ * the end of the range known to be summarized.
+ */
+ current_lsn = ws->end_lsn;
+
+ /*
+ * If the range we know to be summarized has reached the required
+ * end LSN, we have proved completeness.
+ */
+ if (current_lsn >= end_lsn)
+ return true;
+ }
+ }
+
+ /*
+ * We either ran out of summary files without reaching the end LSN, or we
+ * hit a gap in the sequence that resulted in us bailing out of the loop
+ * above.
+ */
+ *missing_lsn = current_lsn;
+ return false;
+}
+
+/*
+ * Open a WAL summary file.
+ *
+ * This will throw an error in case of trouble. As an exception, if
+ * missing_ok = true and the trouble is specifically that the file does
+ * not exist, it will not throw an error and will return a value less than 0.
+ */
+File
+OpenWalSummaryFile(WalSummaryFile *ws, bool missing_ok)
+{
+ char path[MAXPGPATH];
+ File file;
+
+ snprintf(path, MAXPGPATH,
+ XLOGDIR "/summaries/%08X%08X%08X%08X%08X.summary",
+ ws->tli,
+ LSN_FORMAT_ARGS(ws->start_lsn),
+ LSN_FORMAT_ARGS(ws->end_lsn));
+
+ file = PathNameOpenFile(path, O_RDONLY);
+ if (file < 0 && (errno != EEXIST || !missing_ok))
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not open file \"%s\": %m", path)));
+
+ return file;
+}
+
+/*
+ * Remove a WAL summary file if the last modification time precedes the
+ * cutoff time.
+ */
+void
+RemoveWalSummaryIfOlderThan(WalSummaryFile *ws, time_t cutoff_time)
+{
+ char path[MAXPGPATH];
+ struct stat statbuf;
+
+ snprintf(path, MAXPGPATH,
+ XLOGDIR "/summaries/%08X%08X%08X%08X%08X.summary",
+ ws->tli,
+ LSN_FORMAT_ARGS(ws->start_lsn),
+ LSN_FORMAT_ARGS(ws->end_lsn));
+
+ if (lstat(path, &statbuf) != 0)
+ {
+ if (errno == ENOENT)
+ return;
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not stat file \"%s\": %m", path)));
+ }
+ if (statbuf.st_mtime >= cutoff_time)
+ return;
+ if (unlink(path) != 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not stat file \"%s\": %m", path)));
+ ereport(DEBUG2,
+ (errmsg_internal("removing file \"%s\"", path)));
+}
+
+/*
+ * Test whether a filename looks like a WAL summary file.
+ */
+static bool
+IsWalSummaryFilename(char *filename)
+{
+ return strspn(filename, "0123456789ABCDEF") == 40 &&
+ strcmp(filename + 40, ".summary") == 0;
+}
+
+/*
+ * Data read callback for use with CreateBlockRefTableReader.
+ */
+int
+ReadWalSummary(void *wal_summary_io, void *data, int length)
+{
+ WalSummaryIO *io = wal_summary_io;
+ int nbytes;
+
+ nbytes = FileRead(io->file, data, length, io->filepos,
+ WAIT_EVENT_WAL_SUMMARY_READ);
+ if (nbytes < 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not read file \"%s\": %m",
+ FilePathName(io->file))));
+
+ io->filepos += nbytes;
+ return nbytes;
+}
+
+/*
+ * Data write callback for use with WriteBlockRefTable.
+ */
+int
+WriteWalSummary(void *wal_summary_io, void *data, int length)
+{
+ WalSummaryIO *io = wal_summary_io;
+ int nbytes;
+
+ nbytes = FileWrite(io->file, data, length, io->filepos,
+ WAIT_EVENT_WAL_SUMMARY_WRITE);
+ if (nbytes < 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not write file \"%s\": %m",
+ FilePathName(io->file))));
+ if (nbytes != length)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not write file \"%s\": wrote only %d of %d bytes at offset %u",
+ FilePathName(io->file), nbytes,
+ length, (unsigned) io->filepos),
+ errhint("Check free disk space.")));
+
+ io->filepos += nbytes;
+ return nbytes;
+}
+
+/*
+ * Error-reporting callback for use with CreateBlockRefTableReader.
+ */
+void
+ReportWalSummaryError(void *callback_arg, char *fmt,...)
+{
+ StringInfoData buf;
+ va_list ap;
+ int needed;
+
+ initStringInfo(&buf);
+ for (;;)
+ {
+ va_start(ap, fmt);
+ needed = appendStringInfoVA(&buf, fmt, ap);
+ va_end(ap);
+ if (needed == 0)
+ break;
+ enlargeStringInfo(&buf, needed);
+ }
+ ereport(ERROR,
+ errcode(ERRCODE_DATA_CORRUPTED),
+ errmsg_internal("%s", buf.data));
+}
+
+/*
+ * Comparator to sort a List of WalSummaryFile objects by start_lsn.
+ */
+static int
+ListComparatorForWalSummaryFiles(const ListCell *a, const ListCell *b)
+{
+ WalSummaryFile *ws1 = lfirst(a);
+ WalSummaryFile *ws2 = lfirst(b);
+
+ if (ws1->start_lsn < ws2->start_lsn)
+ return -1;
+ if (ws1->start_lsn > ws2->start_lsn)
+ return 1;
+ return 0;
+}
diff --git a/src/backend/backup/walsummaryfuncs.c b/src/backend/backup/walsummaryfuncs.c
new file mode 100644
index 0000000000..a1f69ad4ba
--- /dev/null
+++ b/src/backend/backup/walsummaryfuncs.c
@@ -0,0 +1,169 @@
+/*-------------------------------------------------------------------------
+ *
+ * walsummaryfuncs.c
+ * SQL-callable functions for accessing WAL summary data.
+ *
+ * Portions Copyright (c) 2010-2022, PostgreSQL Global Development Group
+ *
+ * src/backend/backup/walsummaryfuncs.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "backup/walsummary.h"
+#include "common/blkreftable.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "utils/fmgrprotos.h"
+#include "utils/pg_lsn.h"
+
+#define NUM_WS_ATTS 3
+#define NUM_SUMMARY_ATTS 6
+#define MAX_BLOCKS_PER_CALL 256
+
+/*
+ * List the WAL summary files available in pg_wal/summaries.
+ */
+Datum
+pg_available_wal_summaries(PG_FUNCTION_ARGS)
+{
+ ReturnSetInfo *rsi;
+ List *wslist;
+ ListCell *lc;
+ Datum values[NUM_WS_ATTS];
+ bool nulls[NUM_WS_ATTS];
+
+ InitMaterializedSRF(fcinfo, 0);
+ rsi = (ReturnSetInfo *) fcinfo->resultinfo;
+
+ memset(nulls, 0, sizeof(nulls));
+
+ wslist = GetWalSummaries(0, InvalidXLogRecPtr, InvalidXLogRecPtr);
+ foreach(lc, wslist)
+ {
+ WalSummaryFile *ws = (WalSummaryFile *) lfirst(lc);
+ HeapTuple tuple;
+
+ CHECK_FOR_INTERRUPTS();
+
+ values[0] = Int64GetDatum((int64) ws->tli);
+ values[1] = LSNGetDatum(ws->start_lsn);
+ values[2] = LSNGetDatum(ws->end_lsn);
+
+ tuple = heap_form_tuple(rsi->setDesc, values, nulls);
+ tuplestore_puttuple(rsi->setResult, tuple);
+ }
+
+ return (Datum) 0;
+}
+
+/*
+ * List the contents of a WAL summary file identified by TLI, start LSN,
+ * and end LSN.
+ */
+Datum
+pg_wal_summary_contents(PG_FUNCTION_ARGS)
+{
+ ReturnSetInfo *rsi;
+ Datum values[NUM_SUMMARY_ATTS];
+ bool nulls[NUM_SUMMARY_ATTS];
+ WalSummaryFile ws;
+ WalSummaryIO io;
+ BlockRefTableReader *reader;
+ int64 raw_tli;
+ RelFileLocator rlocator;
+ ForkNumber forknum;
+ BlockNumber limit_block;
+
+ InitMaterializedSRF(fcinfo, 0);
+ rsi = (ReturnSetInfo *) fcinfo->resultinfo;
+ memset(nulls, 0, sizeof(nulls));
+
+ /*
+ * Since the timeline could at least in theory be more than 2^31, and
+ * since we don't have unsigned types at the SQL level, it is passed as a
+ * 64-bit integer. Test whether it's out of range.
+ */
+ raw_tli = PG_GETARG_INT64(0);
+ if (raw_tli < 1 || raw_tli > PG_INT32_MAX)
+ ereport(ERROR,
+ errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("invalid timeline %lld", (long long) raw_tli));
+
+ /* Prepare to read the specified WAL summry file. */
+ ws.tli = (TimeLineID) raw_tli;
+ ws.start_lsn = PG_GETARG_LSN(1);
+ ws.end_lsn = PG_GETARG_LSN(2);
+ io.filepos = 0;
+ io.file = OpenWalSummaryFile(&ws, false);
+ reader = CreateBlockRefTableReader(ReadWalSummary, &io,
+ FilePathName(io.file),
+ ReportWalSummaryError, NULL);
+
+ /* Loop over relation forks. */
+ while (BlockRefTableReaderNextRelation(reader, &rlocator, &forknum,
+ &limit_block))
+ {
+ BlockNumber blocks[MAX_BLOCKS_PER_CALL];
+ HeapTuple tuple;
+
+ CHECK_FOR_INTERRUPTS();
+
+ values[0] = ObjectIdGetDatum(rlocator.relNumber);
+ values[1] = ObjectIdGetDatum(rlocator.spcOid);
+ values[2] = ObjectIdGetDatum(rlocator.dbOid);
+ values[3] = Int16GetDatum((int16) forknum);
+
+ /* Loop over blocks within the current relation fork. */
+ while (1)
+ {
+ unsigned nblocks;
+ unsigned i;
+
+ CHECK_FOR_INTERRUPTS();
+
+ nblocks = BlockRefTableReaderGetBlocks(reader, blocks,
+ MAX_BLOCKS_PER_CALL);
+ if (nblocks == 0)
+ break;
+
+ /*
+ * For each block that we specifically know to have been modified,
+ * emit a row with that block number and limit_block = false.
+ */
+ values[5] = BoolGetDatum(false);
+ for (i = 0; i < nblocks; ++i)
+ {
+ values[4] = Int64GetDatum((int64) blocks[i]);
+
+ tuple = heap_form_tuple(rsi->setDesc, values, nulls);
+ tuplestore_puttuple(rsi->setResult, tuple);
+ }
+
+ /*
+ * If the limit block is not InvalidBlockNumber, emit an exta row
+ * with that block number and limit_block = true.
+ *
+ * There is no point in doing this when the limit_block is
+ * InvalidBlockNumber, because no block with that number or any
+ * higher number can ever exist.
+ */
+ if (BlockNumberIsValid(limit_block))
+ {
+ values[4] = Int64GetDatum((int64) limit_block);
+ values[5] = BoolGetDatum(true);
+
+ tuple = heap_form_tuple(rsi->setDesc, values, nulls);
+ tuplestore_puttuple(rsi->setResult, tuple);
+ }
+ }
+ }
+
+ /* Cleanup */
+ DestroyBlockRefTableReader(reader);
+ FileClose(io.file);
+
+ return (Datum) 0;
+}
diff --git a/src/backend/postmaster/Makefile b/src/backend/postmaster/Makefile
index 047448b34e..367a46c617 100644
--- a/src/backend/postmaster/Makefile
+++ b/src/backend/postmaster/Makefile
@@ -24,6 +24,7 @@ OBJS = \
postmaster.o \
startup.o \
syslogger.o \
+ walsummarizer.o \
walwriter.o
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/postmaster/auxprocess.c b/src/backend/postmaster/auxprocess.c
index cae6feb356..0c15c1777d 100644
--- a/src/backend/postmaster/auxprocess.c
+++ b/src/backend/postmaster/auxprocess.c
@@ -21,6 +21,7 @@
#include "postmaster/auxprocess.h"
#include "postmaster/bgwriter.h"
#include "postmaster/startup.h"
+#include "postmaster/walsummarizer.h"
#include "postmaster/walwriter.h"
#include "replication/walreceiver.h"
#include "storage/bufmgr.h"
@@ -80,6 +81,9 @@ AuxiliaryProcessMain(AuxProcType auxtype)
case WalReceiverProcess:
MyBackendType = B_WAL_RECEIVER;
break;
+ case WalSummarizerProcess:
+ MyBackendType = B_WAL_SUMMARIZER;
+ break;
default:
elog(PANIC, "unrecognized process type: %d", (int) MyAuxProcType);
MyBackendType = B_INVALID;
@@ -161,6 +165,10 @@ AuxiliaryProcessMain(AuxProcType auxtype)
WalReceiverMain();
proc_exit(1);
+ case WalSummarizerProcess:
+ WalSummarizerMain();
+ proc_exit(1);
+
default:
elog(PANIC, "unrecognized process type: %d", (int) MyAuxProcType);
proc_exit(1);
diff --git a/src/backend/postmaster/meson.build b/src/backend/postmaster/meson.build
index cda921fd10..a30eb6692f 100644
--- a/src/backend/postmaster/meson.build
+++ b/src/backend/postmaster/meson.build
@@ -12,5 +12,6 @@ backend_sources += files(
'postmaster.c',
'startup.c',
'syslogger.c',
+ 'walsummarizer.c',
'walwriter.c',
)
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 7a5cd06c5c..1d52a2db7c 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -115,6 +115,7 @@
#include "postmaster/pgarch.h"
#include "postmaster/postmaster.h"
#include "postmaster/syslogger.h"
+#include "postmaster/walsummarizer.h"
#include "replication/logicallauncher.h"
#include "replication/walsender.h"
#include "storage/fd.h"
@@ -252,6 +253,7 @@ static pid_t StartupPID = 0,
CheckpointerPID = 0,
WalWriterPID = 0,
WalReceiverPID = 0,
+ WalSummarizerPID = 0,
AutoVacPID = 0,
PgArchPID = 0,
SysLoggerPID = 0;
@@ -443,6 +445,7 @@ static bool CreateOptsFile(int argc, char *argv[], char *fullprogname);
static pid_t StartChildProcess(AuxProcType type);
static void StartAutovacuumWorker(void);
static void MaybeStartWalReceiver(void);
+static void MaybeStartWalSummarizer(void);
static void InitPostmasterDeathWatchHandle(void);
/*
@@ -562,6 +565,7 @@ static void ShmemBackendArrayRemove(Backend *bn);
#define StartCheckpointer() StartChildProcess(CheckpointerProcess)
#define StartWalWriter() StartChildProcess(WalWriterProcess)
#define StartWalReceiver() StartChildProcess(WalReceiverProcess)
+#define StartWalSummarizer() StartChildProcess(WalSummarizerProcess)
/* Macros to check exit status of a child process */
#define EXIT_STATUS_0(st) ((st) == 0)
@@ -931,6 +935,9 @@ PostmasterMain(int argc, char *argv[])
if (max_wal_senders > 0 && wal_level == WAL_LEVEL_MINIMAL)
ereport(ERROR,
(errmsg("WAL streaming (max_wal_senders > 0) requires wal_level \"replica\" or \"logical\"")));
+ if (summarize_wal && wal_level == WAL_LEVEL_MINIMAL)
+ ereport(ERROR,
+ (errmsg("WAL cannot be summarized when wal_level is \"minimal\"")));
/*
* Other one-time internal sanity checks can go here, if they are fast.
@@ -1833,6 +1840,9 @@ ServerLoop(void)
if (WalReceiverRequested)
MaybeStartWalReceiver();
+ /* If we need to start a WAL summarizer, try to do that now */
+ MaybeStartWalSummarizer();
+
/* Get other worker processes running, if needed */
if (StartWorkerNeeded || HaveCrashedWorker)
maybe_start_bgworkers();
@@ -2657,6 +2667,8 @@ process_pm_reload_request(void)
signal_child(WalWriterPID, SIGHUP);
if (WalReceiverPID != 0)
signal_child(WalReceiverPID, SIGHUP);
+ if (WalSummarizerPID != 0)
+ signal_child(WalSummarizerPID, SIGHUP);
if (AutoVacPID != 0)
signal_child(AutoVacPID, SIGHUP);
if (PgArchPID != 0)
@@ -3010,6 +3022,7 @@ process_pm_child_exit(void)
BgWriterPID = StartBackgroundWriter();
if (WalWriterPID == 0)
WalWriterPID = StartWalWriter();
+ MaybeStartWalSummarizer();
/*
* Likewise, start other special children as needed. In a restart
@@ -3128,6 +3141,20 @@ process_pm_child_exit(void)
continue;
}
+ /*
+ * Was it the wal summarizer? Normal exit can be ignored; we'll start
+ * a new one at the next iteration of the postmaster's main loop, if
+ * necessary. Any other exit condition is treated as a crash.
+ */
+ if (pid == WalSummarizerPID)
+ {
+ WalSummarizerPID = 0;
+ if (!EXIT_STATUS_0(exitstatus))
+ HandleChildCrash(pid, exitstatus,
+ _("WAL summarizer process"));
+ continue;
+ }
+
/*
* Was it the autovacuum launcher? Normal exit can be ignored; we'll
* start a new one at the next iteration of the postmaster's main
@@ -3523,6 +3550,12 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
else if (WalReceiverPID != 0 && take_action)
sigquit_child(WalReceiverPID);
+ /* Take care of the walsummarizer too */
+ if (pid == WalSummarizerPID)
+ WalSummarizerPID = 0;
+ else if (WalSummarizerPID != 0 && take_action)
+ sigquit_child(WalSummarizerPID);
+
/* Take care of the autovacuum launcher too */
if (pid == AutoVacPID)
AutoVacPID = 0;
@@ -3673,6 +3706,8 @@ PostmasterStateMachine(void)
signal_child(StartupPID, SIGTERM);
if (WalReceiverPID != 0)
signal_child(WalReceiverPID, SIGTERM);
+ if (WalSummarizerPID != 0)
+ signal_child(WalSummarizerPID, SIGTERM);
/* checkpointer, archiver, stats, and syslogger may continue for now */
/* Now transition to PM_WAIT_BACKENDS state to wait for them to die */
@@ -3699,6 +3734,7 @@ PostmasterStateMachine(void)
if (CountChildren(BACKEND_TYPE_ALL - BACKEND_TYPE_WALSND) == 0 &&
StartupPID == 0 &&
WalReceiverPID == 0 &&
+ WalSummarizerPID == 0 &&
BgWriterPID == 0 &&
(CheckpointerPID == 0 ||
(!FatalError && Shutdown < ImmediateShutdown)) &&
@@ -3796,6 +3832,7 @@ PostmasterStateMachine(void)
/* These other guys should be dead already */
Assert(StartupPID == 0);
Assert(WalReceiverPID == 0);
+ Assert(WalSummarizerPID == 0);
Assert(BgWriterPID == 0);
Assert(CheckpointerPID == 0);
Assert(WalWriterPID == 0);
@@ -4017,6 +4054,8 @@ TerminateChildren(int signal)
signal_child(WalWriterPID, signal);
if (WalReceiverPID != 0)
signal_child(WalReceiverPID, signal);
+ if (WalSummarizerPID != 0)
+ signal_child(WalSummarizerPID, signal);
if (AutoVacPID != 0)
signal_child(AutoVacPID, signal);
if (PgArchPID != 0)
@@ -5364,6 +5403,10 @@ StartChildProcess(AuxProcType type)
ereport(LOG,
(errmsg("could not fork WAL receiver process: %m")));
break;
+ case WalSummarizerProcess:
+ ereport(LOG,
+ (errmsg("could not fork WAL summarizer process: %m")));
+ break;
default:
ereport(LOG,
(errmsg("could not fork process: %m")));
@@ -5500,6 +5543,19 @@ MaybeStartWalReceiver(void)
}
}
+/*
+ * MaybeStartWalSummarizer
+ * Start the WAL summarizer process, if not running and our state allows.
+ */
+static void
+MaybeStartWalSummarizer(void)
+{
+ if (summarize_wal && WalSummarizerPID == 0 &&
+ (pmState == PM_RUN || pmState == PM_HOT_STANDBY) &&
+ Shutdown <= SmartShutdown)
+ WalSummarizerPID = StartWalSummarizer();
+}
+
/*
* Create the opts file
diff --git a/src/backend/postmaster/walsummarizer.c b/src/backend/postmaster/walsummarizer.c
new file mode 100644
index 0000000000..a083647c42
--- /dev/null
+++ b/src/backend/postmaster/walsummarizer.c
@@ -0,0 +1,1383 @@
+/*-------------------------------------------------------------------------
+ *
+ * walsummarizer.c
+ *
+ * Background process to perform WAL summarization, if it is enabled.
+ * It continuously scans the write-ahead log and periodically emits a
+ * summary file which indicates which blocks in which relation forks
+ * were modified by WAL records in the LSN range covered by the summary
+ * file. See walsummary.c and blkreftable.c for more details on the
+ * naming and contents of WAL summary files.
+ *
+ * If configured to do, this background process will also remove WAL
+ * summary files when the file timestamp is older than a configurable
+ * threshold (but only if the WAL has been removed first).
+ *
+ * Portions Copyright (c) 1996-2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/backend/postmaster/walsummarizer.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/timeline.h"
+#include "access/xlog.h"
+#include "access/xlog_internal.h"
+#include "access/xlogrecovery.h"
+#include "access/xlogutils.h"
+#include "backup/walsummary.h"
+#include "catalog/storage_xlog.h"
+#include "common/blkreftable.h"
+#include "libpq/pqsignal.h"
+#include "miscadmin.h"
+#include "postmaster/bgwriter.h"
+#include "postmaster/interrupt.h"
+#include "postmaster/walsummarizer.h"
+#include "replication/walreceiver.h"
+#include "storage/fd.h"
+#include "storage/ipc.h"
+#include "storage/lwlock.h"
+#include "storage/latch.h"
+#include "storage/proc.h"
+#include "storage/procsignal.h"
+#include "storage/shmem.h"
+#include "storage/spin.h"
+#include "utils/guc.h"
+#include "utils/memutils.h"
+#include "utils/wait_event.h"
+
+/*
+ * Data in shared memory related to WAL summarization.
+ */
+typedef struct
+{
+ /*
+ * These fields are protected by WALSummarizerLock.
+ *
+ * Until we've discovered what summary files already exist on disk and
+ * stored that information in shared memory, initialized is false and the
+ * other fields here contain no meaningful information. After that has
+ * been done, initialized is true.
+ *
+ * summarized_tli and summarized_lsn indicate the last LSN and TLI at
+ * which the next summary file will start. Normally, these are the LSN and
+ * TLI at which the last file ended; in such case, lsn_is_exact is true.
+ * If, however, the LSN is just an approximation, then lsn_is_exact is
+ * false. This can happen if, for example, there are no existing WAL
+ * summary files at startup. In that case, we have to derive the position
+ * at which to start summarizing from the WAL files that exist on disk,
+ * and so the LSN might point to the start of the next file even though
+ * that might happen to be in the middle of a WAL record.
+ *
+ * summarizer_pgprocno is the pgprocno value for the summarizer process,
+ * if one is running, or else INVALID_PGPROCNO.
+ *
+ * pending_lsn is used by the summarizer to advertise the ending LSN of a
+ * record it has recently read. It shouldn't ever be less than
+ * summarized_lsn, but might be greater, because the summarizer buffers
+ * data for a range of LSNs in memory before writing out a new file.
+ */
+ bool initialized;
+ TimeLineID summarized_tli;
+ XLogRecPtr summarized_lsn;
+ bool lsn_is_exact;
+ int summarizer_pgprocno;
+ XLogRecPtr pending_lsn;
+
+ /*
+ * This field handles its own synchronizaton.
+ */
+ ConditionVariable summary_file_cv;
+} WalSummarizerData;
+
+/*
+ * Private data for our xlogreader's page read callback.
+ */
+typedef struct
+{
+ TimeLineID tli;
+ bool historic;
+ XLogRecPtr read_upto;
+ bool end_of_wal;
+} SummarizerReadLocalXLogPrivate;
+
+/* Pointer to shared memory state. */
+static WalSummarizerData *WalSummarizerCtl;
+
+/*
+ * When we reach end of WAL and need to read more, we sleep for a number of
+ * milliseconds that is a integer multiple of MS_PER_SLEEP_QUANTUM. This is
+ * the multiplier. It should vary between 1 and MAX_SLEEP_QUANTA, depending
+ * on system activity. See summarizer_wait_for_wal() for how we adjust this.
+ */
+static long sleep_quanta = 1;
+
+/*
+ * The sleep time will always be a multiple of 200ms and will not exceed
+ * thirty seconds (150 * 200 = 30 * 1000). Note that the timeout here needs
+ * to be substntially less than the maximum amount of time for which an
+ * incremental backup will wait for this process to catch up. Otherwise, an
+ * incremental backup might time out on an idle system just because we sleep
+ * for too long.
+ */
+#define MAX_SLEEP_QUANTA 150
+#define MS_PER_SLEEP_QUANTUM 200
+
+/*
+ * This is a count of the number of pages of WAL that we've read since the
+ * last time we waited for more WAL to appear.
+ */
+static long pages_read_since_last_sleep = 0;
+
+/*
+ * Most recent RedoRecPtr value observed by MaybeRemoveOldWalSummaries.
+ */
+static XLogRecPtr redo_pointer_at_last_summary_removal = InvalidXLogRecPtr;
+
+/*
+ * GUC parameters
+ */
+bool summarize_wal = false;
+int wal_summary_keep_time = 10 * 24 * 60;
+
+static XLogRecPtr GetLatestLSN(TimeLineID *tli);
+static void HandleWalSummarizerInterrupts(void);
+static XLogRecPtr SummarizeWAL(TimeLineID tli, XLogRecPtr start_lsn,
+ bool exact, XLogRecPtr switch_lsn,
+ XLogRecPtr maximum_lsn);
+static void SummarizeSmgrRecord(XLogReaderState *xlogreader,
+ BlockRefTable *brtab);
+static void SummarizeXactRecord(XLogReaderState *xlogreader,
+ BlockRefTable *brtab);
+static bool SummarizeXlogRecord(XLogReaderState *xlogreader);
+static int summarizer_read_local_xlog_page(XLogReaderState *state,
+ XLogRecPtr targetPagePtr,
+ int reqLen,
+ XLogRecPtr targetRecPtr,
+ char *cur_page);
+static void summarizer_wait_for_wal(void);
+static void MaybeRemoveOldWalSummaries(void);
+
+/*
+ * Amount of shared memory required for this module.
+ */
+Size
+WalSummarizerShmemSize(void)
+{
+ return sizeof(WalSummarizerData);
+}
+
+/*
+ * Create or attach to shared memory segment for this module.
+ */
+void
+WalSummarizerShmemInit(void)
+{
+ bool found;
+
+ WalSummarizerCtl = (WalSummarizerData *)
+ ShmemInitStruct("Wal Summarizer Ctl", WalSummarizerShmemSize(),
+ &found);
+
+ if (!found)
+ {
+ /*
+ * First time through, so initialize.
+ *
+ * We're just filling in dummy values here -- the real initialization
+ * will happen when GetOldestUnsummarizedLSN() is called for the first
+ * time.
+ */
+ WalSummarizerCtl->initialized = false;
+ WalSummarizerCtl->summarized_tli = 0;
+ WalSummarizerCtl->summarized_lsn = InvalidXLogRecPtr;
+ WalSummarizerCtl->lsn_is_exact = false;
+ WalSummarizerCtl->summarizer_pgprocno = INVALID_PGPROCNO;
+ WalSummarizerCtl->pending_lsn = InvalidXLogRecPtr;
+ ConditionVariableInit(&WalSummarizerCtl->summary_file_cv);
+ }
+}
+
+/*
+ * Entry point for walsummarizer process.
+ */
+void
+WalSummarizerMain(void)
+{
+ sigjmp_buf local_sigjmp_buf;
+ MemoryContext context;
+
+ /*
+ * Within this function, 'current_lsn' and 'current_tli' refer to the
+ * point from which the next WAL summary file should start. 'exact' is
+ * true if 'current_lsn' is known to be the start of a WAL recod or WAL
+ * segment, and false if it might be in the middle of a record someplace.
+ *
+ * 'switch_lsn' and 'switch_tli', if set, are the LSN at which we need to
+ * switch to a new timeline and the timeline to which we need to switch.
+ * If not set, we either haven't figured out the answers yet or we're
+ * already on the latest timeline.
+ */
+ XLogRecPtr current_lsn;
+ TimeLineID current_tli;
+ bool exact;
+ XLogRecPtr switch_lsn = InvalidXLogRecPtr;
+ TimeLineID switch_tli = 0;
+
+ ereport(DEBUG1,
+ (errmsg_internal("WAL summarizer started")));
+
+ /*
+ * Properly accept or ignore signals the postmaster might send us
+ *
+ * We have no particular use for SIGINT at the moment, but seems
+ * reasonable to treat like SIGTERM.
+ */
+ pqsignal(SIGHUP, SignalHandlerForConfigReload);
+ pqsignal(SIGINT, SignalHandlerForShutdownRequest);
+ pqsignal(SIGTERM, SignalHandlerForShutdownRequest);
+ /* SIGQUIT handler was already set up by InitPostmasterChild */
+ pqsignal(SIGALRM, SIG_IGN);
+ pqsignal(SIGPIPE, SIG_IGN);
+ pqsignal(SIGUSR1, procsignal_sigusr1_handler);
+ pqsignal(SIGUSR2, SIG_IGN); /* not used */
+
+ /* Advertise ourselves. */
+ LWLockAcquire(WALSummarizerLock, LW_EXCLUSIVE);
+ WalSummarizerCtl->summarizer_pgprocno = MyProc->pgprocno;
+ LWLockRelease(WALSummarizerLock);
+
+ /* Create and switch to a memory context that we can reset on error. */
+ context = AllocSetContextCreate(TopMemoryContext,
+ "Wal Summarizer",
+ ALLOCSET_DEFAULT_SIZES);
+ MemoryContextSwitchTo(context);
+
+ /*
+ * Reset some signals that are accepted by postmaster but not here
+ */
+ pqsignal(SIGCHLD, SIG_DFL);
+
+ /*
+ * If an exception is encountered, processing resumes here.
+ */
+ if (sigsetjmp(local_sigjmp_buf, 1) != 0)
+ {
+ /* Since not using PG_TRY, must reset error stack by hand */
+ error_context_stack = NULL;
+
+ /* Prevent interrupts while cleaning up */
+ HOLD_INTERRUPTS();
+
+ /* Report the error to the server log */
+ EmitErrorReport();
+
+ /* Release resources we might have acquired. */
+ LWLockReleaseAll();
+ ConditionVariableCancelSleep();
+ pgstat_report_wait_end();
+ ReleaseAuxProcessResources(false);
+ AtEOXact_Files(false);
+ AtEOXact_HashTables(false);
+
+ /*
+ * Now return to normal top-level context and clear ErrorContext for
+ * next time.
+ */
+ MemoryContextSwitchTo(context);
+ FlushErrorState();
+
+ /* Flush any leaked data in the top-level context */
+ MemoryContextResetAndDeleteChildren(context);
+
+ /* Now we can allow interrupts again */
+ RESUME_INTERRUPTS();
+
+ /*
+ * Sleep for 10 seconds before attempting to resume operations in
+ * order to avoid excessing logging.
+ *
+ * Many of the likely error conditions are things that will repeat
+ * every time. For example, if the WAL can't be read or the summary
+ * can't be written, only administrator action will cure the problem.
+ * So a really fast retry time doesn't seem to be especially
+ * beneficial, and it will clutter the logs.
+ */
+ (void) WaitLatch(MyLatch,
+ WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
+ 10000,
+ WAIT_EVENT_WAL_SUMMARIZER_ERROR);
+ }
+
+ /* We can now handle ereport(ERROR) */
+ PG_exception_stack = &local_sigjmp_buf;
+
+ /*
+ * Unblock signals (they were blocked when the postmaster forked us)
+ */
+ sigprocmask(SIG_SETMASK, &UnBlockSig, NULL);
+
+ /*
+ * Fetch information about previous progress from shared memory.
+ *
+ * If we discover that WAL summarization is not enabled, just exit.
+ */
+ current_lsn = GetOldestUnsummarizedLSN(¤t_tli, &exact);
+ if (XLogRecPtrIsInvalid(current_lsn))
+ proc_exit(0);
+
+ /*
+ * Loop forever
+ */
+ for (;;)
+ {
+ XLogRecPtr latest_lsn;
+ TimeLineID latest_tli;
+ XLogRecPtr end_of_summary_lsn;
+
+ /* Flush any leaked data in the top-level context */
+ MemoryContextResetAndDeleteChildren(context);
+
+ /* Process any signals received recently. */
+ HandleWalSummarizerInterrupts();
+
+ /* If it's time to remove any old WAL summaries, do that now. */
+ MaybeRemoveOldWalSummaries();
+
+ /* Find the LSN and TLI up to which we can safely summarize. */
+ latest_lsn = GetLatestLSN(&latest_tli);
+
+ /*
+ * If we're summarizing a historic timeline and we haven't yet
+ * computed the point at which to switch to the next timeline, do that
+ * now.
+ *
+ * Note that if this is a standby, what was previously the current
+ * timeline could become historic at any time.
+ *
+ * We could try to make this more efficient by caching the results of
+ * readTimeLineHistory when latest_tli has not changed, but since we
+ * only have to do this once per timeline switch, we probably wouldn't
+ * save any significant amount of work in practice.
+ */
+ if (current_tli != latest_tli && XLogRecPtrIsInvalid(switch_lsn))
+ {
+ List *tles = readTimeLineHistory(latest_tli);
+
+ switch_lsn = tliSwitchPoint(current_tli, tles, &switch_tli);
+ ereport(DEBUG1,
+ errmsg("switch point from TLI %u to TLI %u is at %X/%X",
+ current_tli, switch_tli, LSN_FORMAT_ARGS(switch_lsn)));
+ }
+
+ /*
+ * If we've reached the switch LSN, we can't summarize anything else
+ * on this timeline. Switch to the next timeline and go around again.
+ */
+ if (!XLogRecPtrIsInvalid(switch_lsn) && current_lsn >= switch_lsn)
+ {
+ current_tli = switch_tli;
+ switch_lsn = InvalidXLogRecPtr;
+ switch_tli = 0;
+ continue;
+ }
+
+ /* Summarize WAL. */
+ end_of_summary_lsn = SummarizeWAL(current_tli,
+ current_lsn, exact,
+ switch_lsn, latest_lsn);
+ Assert(!XLogRecPtrIsInvalid(end_of_summary_lsn));
+ Assert(end_of_summary_lsn >= current_lsn);
+
+ /*
+ * Update state for next loop iteration.
+ *
+ * Next summary file should start from exactly where this one ended.
+ */
+ current_lsn = end_of_summary_lsn;
+ exact = true;
+
+ /* Update state in shared memory. */
+ LWLockAcquire(WALSummarizerLock, LW_EXCLUSIVE);
+ Assert(WalSummarizerCtl->pending_lsn <= end_of_summary_lsn);
+ WalSummarizerCtl->summarized_lsn = end_of_summary_lsn;
+ WalSummarizerCtl->summarized_tli = current_tli;
+ WalSummarizerCtl->lsn_is_exact = true;
+ WalSummarizerCtl->pending_lsn = end_of_summary_lsn;
+ LWLockRelease(WALSummarizerLock);
+
+ /* Wake up anyone waiting for more summary files to be written. */
+ ConditionVariableBroadcast(&WalSummarizerCtl->summary_file_cv);
+ }
+}
+
+/*
+ * Get the oldest LSN in this server's timeline history that has not yet been
+ * summarized.
+ *
+ * If *tli != NULL, it will be set to the TLI for the LSN that is returned.
+ *
+ * If *lsn_is_exact != NULL, it will be set to true if the returned LSN is
+ * necessarily the start of a WAL record and false if it's just the beginning
+ * of a WAL segment.
+ */
+XLogRecPtr
+GetOldestUnsummarizedLSN(TimeLineID *tli, bool *lsn_is_exact)
+{
+ TimeLineID latest_tli;
+ LWLockMode mode = LW_SHARED;
+ int n;
+ List *tles;
+ XLogRecPtr unsummarized_lsn;
+ TimeLineID unsummarized_tli = 0;
+ bool should_make_exact = false;
+ List *existing_summaries;
+ ListCell *lc;
+
+ /* If not summarizing WAL, do nothing. */
+ if (!summarize_wal)
+ return InvalidXLogRecPtr;
+
+ /*
+ * Initially, we acquire the lock in shared mode and try to fetch the
+ * required information. If the data structure hasn't been initialized, we
+ * reacquire the lock in shared mode so that we can initialize it.
+ * However, if someone else does that first before we get the lock, then
+ * we can just return the requested information after all.
+ */
+ while (1)
+ {
+ LWLockAcquire(WALSummarizerLock, mode);
+
+ if (WalSummarizerCtl->initialized)
+ {
+ unsummarized_lsn = WalSummarizerCtl->summarized_lsn;
+ if (tli != NULL)
+ *tli = WalSummarizerCtl->summarized_tli;
+ if (lsn_is_exact != NULL)
+ *lsn_is_exact = WalSummarizerCtl->lsn_is_exact;
+ LWLockRelease(WALSummarizerLock);
+ return unsummarized_lsn;
+ }
+
+ if (mode == LW_EXCLUSIVE)
+ break;
+
+ LWLockRelease(WALSummarizerLock);
+ mode = LW_EXCLUSIVE;
+ }
+
+ /*
+ * The data structure needs to be initialized, and we are the first to
+ * obtain the lock in exclusive mode, so it's our job to do that
+ * initialization.
+ *
+ * So, find the oldest timeline on which WAL still exists, and the
+ * earliest segment for which it exists.
+ */
+ (void) GetLatestLSN(&latest_tli);
+ tles = readTimeLineHistory(latest_tli);
+ for (n = list_length(tles) - 1; n >= 0; --n)
+ {
+ TimeLineHistoryEntry *tle = list_nth(tles, n);
+ XLogSegNo oldest_segno;
+
+ oldest_segno = XLogGetOldestSegno(tle->tli);
+ if (oldest_segno != 0)
+ {
+ /* Compute oldest LSN that still exists on disk. */
+ XLogSegNoOffsetToRecPtr(oldest_segno, 0, wal_segment_size,
+ unsummarized_lsn);
+
+ unsummarized_tli = tle->tli;
+ break;
+ }
+ }
+
+ /* It really should not be possible for us to find no WAL. */
+ if (unsummarized_tli == 0)
+ ereport(ERROR,
+ errcode(ERRCODE_INTERNAL_ERROR),
+ errmsg_internal("no WAL found on timeline %d", latest_tli));
+
+ /*
+ * Don't try to summarize anything older than the end LSN of the newest
+ * summary file that exists for this timeline.
+ */
+ existing_summaries =
+ GetWalSummaries(unsummarized_tli,
+ InvalidXLogRecPtr, InvalidXLogRecPtr);
+ foreach(lc, existing_summaries)
+ {
+ WalSummaryFile *ws = lfirst(lc);
+
+ if (ws->end_lsn > unsummarized_lsn)
+ {
+ unsummarized_lsn = ws->end_lsn;
+ should_make_exact = true;
+ }
+ }
+
+ /* Update shared memory with the discovered values. */
+ WalSummarizerCtl->initialized = true;
+ WalSummarizerCtl->summarized_lsn = unsummarized_lsn;
+ WalSummarizerCtl->summarized_tli = unsummarized_tli;
+ WalSummarizerCtl->lsn_is_exact = should_make_exact;
+ WalSummarizerCtl->pending_lsn = unsummarized_lsn;
+
+ /* Also return the to the caller as required. */
+ if (tli != NULL)
+ *tli = WalSummarizerCtl->summarized_tli;
+ if (lsn_is_exact != NULL)
+ *lsn_is_exact = WalSummarizerCtl->lsn_is_exact;
+ LWLockRelease(WALSummarizerLock);
+
+ return unsummarized_lsn;
+}
+
+/*
+ * Attempt to set the WAL summarizer's latch.
+ *
+ * This might not work, because there's no guarantee that the WAL summarizer
+ * process was successfully started, and it also might have started but
+ * subsequently terminated. So, under normal circumstances, this will get the
+ * latch set, but there's no guarantee.
+ */
+void
+SetWalSummarizerLatch(void)
+{
+ int pgprocno;
+
+ if (WalSummarizerCtl == NULL)
+ return;
+
+ LWLockAcquire(WALSummarizerLock, LW_EXCLUSIVE);
+ pgprocno = WalSummarizerCtl->summarizer_pgprocno;
+ LWLockRelease(WALSummarizerLock);
+
+ if (pgprocno != INVALID_PGPROCNO)
+ SetLatch(&ProcGlobal->allProcs[pgprocno].procLatch);
+}
+
+/*
+ * Wait until WAL summarization reaches the given LSN, but not longer than
+ * the given timeout.
+ *
+ * The return value is the first still-unsummarized LSN. If it's greater than
+ * or equal to the passed LSN, then that LSN was reached. If not, we timed out.
+ */
+XLogRecPtr
+WaitForWalSummarization(XLogRecPtr lsn, long timeout)
+{
+ TimestampTz start_time = GetCurrentTimestamp();
+ TimestampTz deadline = TimestampTzPlusMilliseconds(start_time, timeout);
+ XLogRecPtr summarized_lsn;
+
+ Assert(!XLogRecPtrIsInvalid(lsn));
+ Assert(timeout > 0);
+
+ while (1)
+ {
+ TimestampTz now;
+ long remaining_timeout;
+
+ /*
+ * If the LSN summarized on disk has reached the target value, stop.
+ */
+ LWLockAcquire(WALSummarizerLock, LW_EXCLUSIVE);
+ summarized_lsn = WalSummarizerCtl->summarized_lsn;
+ LWLockRelease(WALSummarizerLock);
+ if (summarized_lsn >= lsn)
+ break;
+
+ /* Timeout reached? If yes, stop. */
+ now = GetCurrentTimestamp();
+ remaining_timeout = TimestampDifferenceMilliseconds(now, deadline);
+ if (remaining_timeout <= 0)
+ break;
+
+ /* Wait and see. */
+ ConditionVariableTimedSleep(&WalSummarizerCtl->summary_file_cv,
+ remaining_timeout,
+ WAIT_EVENT_WAL_SUMMARY_READY);
+ }
+
+ return summarized_lsn;
+}
+
+/*
+ * Get the latest LSN that is eligible to be summarized, and set *tli to the
+ * corresponding timeline.
+ */
+static XLogRecPtr
+GetLatestLSN(TimeLineID *tli)
+{
+ if (!RecoveryInProgress())
+ {
+ /* Don't summarize WAL before it's flushed. */
+ return GetFlushRecPtr(tli);
+ }
+ else
+ {
+ XLogRecPtr flush_lsn;
+ TimeLineID flush_tli;
+ XLogRecPtr replay_lsn;
+ TimeLineID replay_tli;
+
+ /*
+ * What we really want to know is how much WAL has been flushed to
+ * disk, but the only flush position available is the one provided by
+ * the walreceiver, which may not be running, because this could be
+ * crash recovery or recovery via restore_command. So use either the
+ * WAL receiver's flush position or the replay position, whichever is
+ * further ahead, on the theory that if the WAL has been replayed then
+ * it must also have been flushed to disk.
+ */
+ flush_lsn = GetWalRcvFlushRecPtr(NULL, &flush_tli);
+ replay_lsn = GetXLogReplayRecPtr(&replay_tli);
+ if (flush_lsn > replay_lsn)
+ {
+ *tli = flush_tli;
+ return flush_lsn;
+ }
+ else
+ {
+ *tli = replay_tli;
+ return replay_lsn;
+ }
+ }
+}
+
+/*
+ * Interrupt handler for main loop of WAL summarizer process.
+ */
+static void
+HandleWalSummarizerInterrupts(void)
+{
+ if (ProcSignalBarrierPending)
+ ProcessProcSignalBarrier();
+
+ if (ConfigReloadPending)
+ {
+ ConfigReloadPending = false;
+ ProcessConfigFile(PGC_SIGHUP);
+ }
+
+ if (ShutdownRequestPending || !summarize_wal)
+ {
+ ereport(DEBUG1,
+ errmsg_internal("WAL summarizer shutting down"));
+ proc_exit(0);
+ }
+
+ /* Perform logging of memory contexts of this process */
+ if (LogMemoryContextPending)
+ ProcessLogMemoryContextInterrupt();
+}
+
+/*
+ * Summarize a range of WAL records on a single timeline.
+ *
+ * 'tli' is the timeline to be summarized.
+ *
+ * 'start_lsn' is the point at which we should start summarizing. If this
+ * value comes from the end LSN of the previous record as returned by the
+ * xlograder machinery, 'exact' should be true; otherwise, 'exact' should
+ * be false, and this function will search forward for the start of a valid
+ * WAL record.
+ *
+ * 'switch_lsn' is the point at which we should switch to a later timeline,
+ * if we're summarizing a historic timeline.
+ *
+ * 'maximum_lsn' identifies the point beyond which we can't count on being
+ * able to read any more WAL. It should be the switch point when reading a
+ * historic timeline, or the most-recently-measured end of WAL when reading
+ * the current timeline.
+ *
+ * The return value is the LSN at which the WAL summary actually ends. Most
+ * often, a summary file ends because we notice that a checkpoint has
+ * occurred and reach the redo pointer of that checkpoint, but sometimes
+ * we stop for other reasons, such as a timeline switch.
+ */
+static XLogRecPtr
+SummarizeWAL(TimeLineID tli, XLogRecPtr start_lsn, bool exact,
+ XLogRecPtr switch_lsn, XLogRecPtr maximum_lsn)
+{
+ SummarizerReadLocalXLogPrivate *private_data;
+ XLogReaderState *xlogreader;
+ XLogRecPtr summary_start_lsn;
+ XLogRecPtr summary_end_lsn = switch_lsn;
+ char temp_path[MAXPGPATH];
+ char final_path[MAXPGPATH];
+ WalSummaryIO io;
+ BlockRefTable *brtab = CreateEmptyBlockRefTable();
+
+ /* Initialize private data for xlogreader. */
+ private_data = (SummarizerReadLocalXLogPrivate *)
+ palloc0(sizeof(SummarizerReadLocalXLogPrivate));
+ private_data->tli = tli;
+ private_data->historic = !XLogRecPtrIsInvalid(switch_lsn);
+ private_data->read_upto = maximum_lsn;
+
+ /* Create xlogreader. */
+ xlogreader = XLogReaderAllocate(wal_segment_size, NULL,
+ XL_ROUTINE(.page_read = &summarizer_read_local_xlog_page,
+ .segment_open = &wal_segment_open,
+ .segment_close = &wal_segment_close),
+ private_data);
+ if (xlogreader == NULL)
+ ereport(ERROR,
+ (errcode(ERRCODE_OUT_OF_MEMORY),
+ errmsg("out of memory"),
+ errdetail("Failed while allocating a WAL reading processor.")));
+
+ /*
+ * When exact = false, we're starting from an arbitrary point in the WAL
+ * and must search forward for the start of the next record.
+ *
+ * When exact = true, start_lsn should be either the LSN where a record
+ * begins, or the LSN of a page where the page header is immediately
+ * followed by the start of a new record. XLogBeginRead should tolerate
+ * either case.
+ *
+ * We need to allow for both cases because the behavior of xlogreader
+ * varies. When a record spans two or more xlog pages, the ending LSN
+ * reported by xlogreader will be the starting LSN of the following
+ * record, but when an xlog page boundary falls between two records, the
+ * end LSN for the first will be reported as the first byte of the
+ * following page. We can't know until we read that page how large the
+ * header will be, but we'll have to skip over it to find the next record.
+ */
+ if (exact)
+ {
+ /*
+ * Even if start_lsn is the beginning of a page rather than the
+ * beginning of the first record on that page, we should still use it
+ * as the start LSN for the summary file. That's because we detect
+ * missing summary files by looking for cases where the end LSN of one
+ * file is less than the start LSN of the next file. When only a page
+ * header is skipped, nothing has been missed.
+ */
+ XLogBeginRead(xlogreader, start_lsn);
+ summary_start_lsn = start_lsn;
+ }
+ else
+ {
+ summary_start_lsn = XLogFindNextRecord(xlogreader, start_lsn);
+ if (XLogRecPtrIsInvalid(summary_start_lsn))
+ {
+ /*
+ * If we hit end-of-WAL while trying to find the next valid
+ * record, we must be on a historic timeline that has no valid
+ * records that begin after start_lsn and before end of WAL.
+ */
+ if (private_data->end_of_wal)
+ {
+ ereport(DEBUG1,
+ errmsg_internal("could not read WAL from timeline %u at %X/%X: end of WAL at %X/%X",
+ tli,
+ LSN_FORMAT_ARGS(start_lsn),
+ LSN_FORMAT_ARGS(private_data->read_upto)));
+
+ /*
+ * The timeline ends at or after start_lsn, without containing
+ * any records. Thus, we must make sure the main loop does not
+ * iterate. If start_lsn is the end of the timeline, then we
+ * won't actually emit an empty summary file, but otherwise,
+ * we must, to capture the fact that the LSN range in question
+ * contains no interesting WAL records.
+ */
+ summary_start_lsn = start_lsn;
+ summary_end_lsn = private_data->read_upto;
+ switch_lsn = xlogreader->EndRecPtr;
+ }
+ else
+ ereport(ERROR,
+ (errmsg("could not find a valid record after %X/%X",
+ LSN_FORMAT_ARGS(start_lsn))));
+ }
+
+ /* We shouldn't go backward. */
+ Assert(summary_start_lsn >= start_lsn);
+ }
+
+ /*
+ * Main loop: read xlog records one by one.
+ */
+ while (1)
+ {
+ int block_id;
+ char *errormsg;
+ XLogRecord *record;
+ bool stop_requested = false;
+
+ HandleWalSummarizerInterrupts();
+
+ /* We shouldn't go backward. */
+ Assert(summary_start_lsn <= xlogreader->EndRecPtr);
+
+ /* Now read the next record. */
+ record = XLogReadRecord(xlogreader, &errormsg);
+ if (record == NULL)
+ {
+ if (private_data->end_of_wal)
+ {
+ /*
+ * This timeline must be historic and must end before we were
+ * able to read a complete record.
+ */
+ ereport(DEBUG1,
+ errmsg_internal("could not read WAL from timeline %d at %X/%X: end of WAL at %X/%X",
+ tli,
+ LSN_FORMAT_ARGS(xlogreader->EndRecPtr),
+ LSN_FORMAT_ARGS(private_data->read_upto)));
+ /* Summary ends at end of WAL. */
+ summary_end_lsn = private_data->read_upto;
+ break;
+ }
+ if (errormsg)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not read WAL from timeline %u at %X/%X: %s",
+ tli, LSN_FORMAT_ARGS(xlogreader->EndRecPtr),
+ errormsg)));
+ else
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not read WAL from timeline %u at %X/%X",
+ tli, LSN_FORMAT_ARGS(xlogreader->EndRecPtr))));
+ }
+
+ /* We shouldn't go backward. */
+ Assert(summary_start_lsn <= xlogreader->EndRecPtr);
+
+ if (!XLogRecPtrIsInvalid(switch_lsn) &&
+ xlogreader->ReadRecPtr >= switch_lsn)
+ {
+ /*
+ * Woops! We've read a record that *starts* after the switch LSN,
+ * contrary to our goal of reading only until we hit the first
+ * record that ends at or after the switch LSN. Pretend we didn't
+ * read it after all by bailing out of this loop right here,
+ * before we do anything with this record.
+ *
+ * This can happen because the last record before the switch LSN
+ * might be continued across multiple pages, and then we might
+ * come to a page with XLP_FIRST_IS_OVERWRITE_CONTRECORD set. In
+ * that case, the record that was continued across multiple pages
+ * is incomplete and will be disregarded, and the read will
+ * restart from the beginning of the page that is flagged
+ * XLP_FIRST_IS_OVERWRITE_CONTRECORD.
+ *
+ * If this case occurs, we can fairly say that the current summary
+ * file ends at the switch LSN exactly. The first record on the
+ * page marked XLP_FIRST_IS_OVERWRITE_CONTRECORD will be
+ * discovered when generating the next summary file.
+ */
+ summary_end_lsn = switch_lsn;
+ break;
+ }
+
+ /* Special handling for particular types of WAL records. */
+ switch (XLogRecGetRmid(xlogreader))
+ {
+ case RM_SMGR_ID:
+ SummarizeSmgrRecord(xlogreader, brtab);
+ break;
+ case RM_XACT_ID:
+ SummarizeXactRecord(xlogreader, brtab);
+ break;
+ case RM_XLOG_ID:
+ stop_requested = SummarizeXlogRecord(xlogreader);
+ break;
+ default:
+ break;
+ }
+
+ /*
+ * If we've been told that it's time to end this WAL summary file, do
+ * so. As an exception, if there's nothing included in this WAL
+ * summary file yet, then stopping doesn't make any sense, and we
+ * should wait until the next stop point instead.
+ */
+ if (stop_requested && xlogreader->ReadRecPtr > summary_start_lsn)
+ {
+ summary_end_lsn = xlogreader->ReadRecPtr;
+ break;
+ }
+
+ /* Feed block references from xlog record to block reference table. */
+ for (block_id = 0; block_id <= XLogRecMaxBlockId(xlogreader);
+ block_id++)
+ {
+ RelFileLocator rlocator;
+ ForkNumber forknum;
+ BlockNumber blocknum;
+
+ if (!XLogRecGetBlockTagExtended(xlogreader, block_id, &rlocator,
+ &forknum, &blocknum, NULL))
+ continue;
+
+ /*
+ * As we do elsewhere, ignore the FSM fork, because it's not fully
+ * WAL-logged.
+ */
+ if (forknum != FSM_FORKNUM)
+ BlockRefTableMarkBlockModified(brtab, &rlocator, forknum,
+ blocknum);
+ }
+
+ /* Update our notion of where this summary file ends. */
+ summary_end_lsn = xlogreader->EndRecPtr;
+
+ /* Also update shared memory. */
+ LWLockAcquire(WALSummarizerLock, LW_EXCLUSIVE);
+ Assert(summary_end_lsn >= WalSummarizerCtl->pending_lsn);
+ Assert(summary_end_lsn >= WalSummarizerCtl->summarized_lsn);
+ WalSummarizerCtl->pending_lsn = summary_end_lsn;
+ LWLockRelease(WALSummarizerLock);
+
+ /*
+ * If we have a switch LSN and have reached it, stop before reading
+ * the next record.
+ */
+ if (!XLogRecPtrIsInvalid(switch_lsn) &&
+ xlogreader->EndRecPtr >= switch_lsn)
+ break;
+ }
+
+ /* Destroy xlogreader. */
+ pfree(xlogreader->private_data);
+ XLogReaderFree(xlogreader);
+
+ /*
+ * If a timeline switch occurs, we may fail to make any progress at all
+ * before exiting the loop above. If that happens, we don't write a WAL
+ * summary file at all.
+ */
+ if (summary_end_lsn > summary_start_lsn)
+ {
+ /* Generate temporary and final path name. */
+ snprintf(temp_path, MAXPGPATH,
+ XLOGDIR "/summaries/temp.summary");
+ snprintf(final_path, MAXPGPATH,
+ XLOGDIR "/summaries/%08X%08X%08X%08X%08X.summary",
+ tli,
+ LSN_FORMAT_ARGS(summary_start_lsn),
+ LSN_FORMAT_ARGS(summary_end_lsn));
+
+ /* Open the temporary file for writing. */
+ io.filepos = 0;
+ io.file = PathNameOpenFile(temp_path, O_WRONLY | O_CREAT | O_TRUNC);
+ if (io.file < 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not create file \"%s\": %m", temp_path)));
+
+ /* Write the data. */
+ WriteBlockRefTable(brtab, WriteWalSummary, &io);
+
+ /* Close temporary file and shut down xlogreader. */
+ FileClose(io.file);
+
+ /* Tell the user what we did. */
+ ereport(DEBUG1,
+ errmsg("summarized WAL on TLI %d from %X/%X to %X/%X",
+ tli,
+ LSN_FORMAT_ARGS(summary_start_lsn),
+ LSN_FORMAT_ARGS(summary_end_lsn)));
+
+ /* Durably rename the new summary into place. */
+ durable_rename(temp_path, final_path, ERROR);
+ }
+
+ return summary_end_lsn;
+}
+
+/*
+ * Special handling for WAL records with RM_SMGR_ID.
+ */
+static void
+SummarizeSmgrRecord(XLogReaderState *xlogreader, BlockRefTable *brtab)
+{
+ uint8 info = XLogRecGetInfo(xlogreader) & ~XLR_INFO_MASK;
+
+ if (info == XLOG_SMGR_CREATE)
+ {
+ xl_smgr_create *xlrec;
+
+ /*
+ * If a new relation fork is created on disk, there is no point
+ * tracking anything about which blocks have been modified, because
+ * the whole thing will be new. Hence, set the limit block for this
+ * fork to 0.
+ *
+ * Ignore the FSM fork, which is not fully WAL-logged.
+ */
+ xlrec = (xl_smgr_create *) XLogRecGetData(xlogreader);
+
+ if (xlrec->forkNum != FSM_FORKNUM)
+ BlockRefTableSetLimitBlock(brtab, &xlrec->rlocator,
+ xlrec->forkNum, 0);
+ }
+ else if (info == XLOG_SMGR_TRUNCATE)
+ {
+ xl_smgr_truncate *xlrec;
+
+ xlrec = (xl_smgr_truncate *) XLogRecGetData(xlogreader);
+
+ /*
+ * If a relation fork is truncated on disk, there is no point in
+ * tracking anything about block modifications beyond the truncation
+ * point.
+ *
+ * We ignore SMGR_TRUNCATE_FSM here because the FSM isn't fully
+ * WAL-logged and thus we can't track modified blocks for it anyway.
+ */
+ if ((xlrec->flags & SMGR_TRUNCATE_HEAP) != 0)
+ BlockRefTableSetLimitBlock(brtab, &xlrec->rlocator,
+ MAIN_FORKNUM, xlrec->blkno);
+ if ((xlrec->flags & SMGR_TRUNCATE_VM) != 0)
+ BlockRefTableSetLimitBlock(brtab, &xlrec->rlocator,
+ VISIBILITYMAP_FORKNUM, xlrec->blkno);
+ }
+}
+
+/*
+ * Special handling for WAL recods with RM_XACT_ID.
+ */
+static void
+SummarizeXactRecord(XLogReaderState *xlogreader, BlockRefTable *brtab)
+{
+ uint8 info = XLogRecGetInfo(xlogreader) & ~XLR_INFO_MASK;
+ uint8 xact_info = info & XLOG_XACT_OPMASK;
+
+ if (xact_info == XLOG_XACT_COMMIT ||
+ xact_info == XLOG_XACT_COMMIT_PREPARED)
+ {
+ xl_xact_commit *xlrec = (xl_xact_commit *) XLogRecGetData(xlogreader);
+ xl_xact_parsed_commit parsed;
+ int i;
+
+ /*
+ * Don't track modified blocks for any relations that were removed on
+ * commit.
+ */
+ ParseCommitRecord(XLogRecGetInfo(xlogreader), xlrec, &parsed);
+ for (i = 0; i < parsed.nrels; ++i)
+ {
+ ForkNumber forknum;
+
+ for (forknum = 0; forknum <= MAX_FORKNUM; ++forknum)
+ if (forknum != FSM_FORKNUM)
+ BlockRefTableSetLimitBlock(brtab, &parsed.xlocators[i],
+ forknum, 0);
+ }
+ }
+ else if (xact_info == XLOG_XACT_ABORT ||
+ xact_info == XLOG_XACT_ABORT_PREPARED)
+ {
+ xl_xact_abort *xlrec = (xl_xact_abort *) XLogRecGetData(xlogreader);
+ xl_xact_parsed_abort parsed;
+ int i;
+
+ /*
+ * Don't track modified blocks for any relations that were removed on
+ * abort.
+ */
+ ParseAbortRecord(XLogRecGetInfo(xlogreader), xlrec, &parsed);
+ for (i = 0; i < parsed.nrels; ++i)
+ {
+ ForkNumber forknum;
+
+ for (forknum = 0; forknum <= MAX_FORKNUM; ++forknum)
+ if (forknum != FSM_FORKNUM)
+ BlockRefTableSetLimitBlock(brtab, &parsed.xlocators[i],
+ forknum, 0);
+ }
+ }
+}
+
+/*
+ * Special handling for WAL recods with RM_XLOG_ID.
+ */
+static bool
+SummarizeXlogRecord(XLogReaderState *xlogreader)
+{
+ uint8 info = XLogRecGetInfo(xlogreader) & ~XLR_INFO_MASK;
+
+ if (info == XLOG_CHECKPOINT_REDO || info == XLOG_CHECKPOINT_SHUTDOWN)
+ {
+ /*
+ * This is an LSN at which redo might begin, so we'd like
+ * summarization to stop just before this WAL record.
+ */
+ return true;
+ }
+
+ return false;
+}
+
+/*
+ * Similar to read_local_xlog_page, but limited to read from one particular
+ * timeline. If the end of WAL is reached, it will wait for more if reading
+ * from the current timeline, or give up if reading from a historic timeline.
+ * In the latter case, it will also set private_data->end_of_wal = true.
+ *
+ * Caller must set private_data->tli to the TLI of interest,
+ * private_data->read_upto to the lowest LSN that is not known to be safe
+ * to read on that timeline, and private_data->historic to true if and only
+ * if the timeline is not the current timeline. This function will update
+ * private_data->read_upto and private_data->historic if more WAL appears
+ * on the current timeline or if the current timeline becomes historic.
+ */
+static int
+summarizer_read_local_xlog_page(XLogReaderState *state,
+ XLogRecPtr targetPagePtr, int reqLen,
+ XLogRecPtr targetRecPtr, char *cur_page)
+{
+ int count;
+ WALReadError errinfo;
+ SummarizerReadLocalXLogPrivate *private_data;
+
+ HandleWalSummarizerInterrupts();
+
+ private_data = (SummarizerReadLocalXLogPrivate *)
+ state->private_data;
+
+ while (1)
+ {
+ if (targetPagePtr + XLOG_BLCKSZ <= private_data->read_upto)
+ {
+ /*
+ * more than one block available; read only that block, have
+ * caller come back if they need more.
+ */
+ count = XLOG_BLCKSZ;
+ break;
+ }
+ else if (targetPagePtr + reqLen > private_data->read_upto)
+ {
+ /* We don't seem to have enough data. */
+ if (private_data->historic)
+ {
+ /*
+ * This is a historic timeline, so there will never be any
+ * more data than we have currently.
+ */
+ private_data->end_of_wal = true;
+ return -1;
+ }
+ else
+ {
+ XLogRecPtr latest_lsn;
+ TimeLineID latest_tli;
+
+ /*
+ * This is - or at least was up until very recently - the
+ * current timeline, so more data might show up. Delay here
+ * so we don't tight-loop.
+ */
+ HandleWalSummarizerInterrupts();
+ summarizer_wait_for_wal();
+
+ /* Recheck end-of-WAL. */
+ latest_lsn = GetLatestLSN(&latest_tli);
+ if (private_data->tli == latest_tli)
+ {
+ /* Still the current timeline, update max LSN. */
+ Assert(latest_lsn >= private_data->read_upto);
+ private_data->read_upto = latest_lsn;
+ }
+ else
+ {
+ List *tles = readTimeLineHistory(latest_tli);
+ XLogRecPtr switchpoint;
+
+ /*
+ * The timeline we're scanning is no longer the latest
+ * one. Figure out when it ended.
+ */
+ private_data->historic = true;
+ switchpoint = tliSwitchPoint(private_data->tli, tles,
+ NULL);
+
+ /*
+ * Allow reads up to exactly the switch point.
+ *
+ * It's possible that this will cause read_upto to move
+ * backwards, because walreceiver might have read a
+ * partial record and flushed it to disk, and we'd view
+ * that data as safe to read. However, the
+ * XLOG_END_OF_RECOVERY record will be written at the end
+ * of the last complete WAL record, not at the end of the
+ * WAL that we've flushed to disk.
+ *
+ * So switchpoint < private->read_upto is possible here,
+ * but switchpoint < state->EndRecPtr should not be.
+ */
+ Assert(switchpoint >= state->EndRecPtr);
+ private_data->read_upto = switchpoint;
+
+ /* Debugging output. */
+ ereport(DEBUG1,
+ errmsg("timeline %u became historic, can read up to %X/%X",
+ private_data->tli, LSN_FORMAT_ARGS(private_data->read_upto)));
+ }
+
+ /* Go around and try again. */
+ }
+ }
+ else
+ {
+ /* enough bytes available to satisfy the request */
+ count = private_data->read_upto - targetPagePtr;
+ break;
+ }
+ }
+
+ /*
+ * Even though we just determined how much of the page can be validly read
+ * as 'count', read the whole page anyway. It's guaranteed to be
+ * zero-padded up to the page boundary if it's incomplete.
+ */
+ if (!WALRead(state, cur_page, targetPagePtr, XLOG_BLCKSZ,
+ private_data->tli, &errinfo))
+ WALReadRaiseError(&errinfo);
+
+ /* Track that we read a page, for sleep time calculation. */
+ ++pages_read_since_last_sleep;
+
+ /* number of valid bytes in the buffer */
+ return count;
+}
+
+/*
+ * Sleep for long enough that we believe it's likely that more WAL will
+ * be available afterwards.
+ */
+static void
+summarizer_wait_for_wal(void)
+{
+ if (pages_read_since_last_sleep == 0)
+ {
+ /*
+ * No pages were read since the last sleep, so double the sleep time,
+ * but not beyond the maximum allowable value.
+ */
+ sleep_quanta = Min(sleep_quanta * 2, MAX_SLEEP_QUANTA);
+ }
+ else if (pages_read_since_last_sleep > 1)
+ {
+ /*
+ * Multiple pages were read since the last sleep, so reduce the sleep
+ * time.
+ *
+ * A large burst of activity should be able to quickly reduce the
+ * sleep time to the minimum, but we don't want a handful of extra WAL
+ * records to provoke a strong reaction. We choose to reduce the sleep
+ * time by 1 quantum for each page read beyond the first, which is a
+ * fairly arbitrary way of trying to be reactive without
+ * overrreacting.
+ */
+ if (pages_read_since_last_sleep > sleep_quanta - 1)
+ sleep_quanta = 1;
+ else
+ sleep_quanta -= pages_read_since_last_sleep;
+ }
+
+ /* OK, now sleep. */
+ (void) WaitLatch(MyLatch,
+ WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
+ sleep_quanta * MS_PER_SLEEP_QUANTUM,
+ WAIT_EVENT_WAL_SUMMARIZER_WAL);
+ ResetLatch(MyLatch);
+
+ /* Reset count of pages read. */
+ pages_read_since_last_sleep = 0;
+}
+
+/*
+ * Most recent RedoRecPtr value observed by RemoveOldWalSummaries.
+ */
+static void
+MaybeRemoveOldWalSummaries(void)
+{
+ XLogRecPtr redo_pointer = GetRedoRecPtr();
+ List *wslist;
+ time_t cutoff_time;
+
+ /* If WAL summary removal is disabled, don't do anything. */
+ if (wal_summary_keep_time == 0)
+ return;
+
+ /*
+ * If the redo pointer has not advanced, don't do anything.
+ *
+ * This has the effect that we only try to remove old WAL summary files
+ * once per checkpoint cycle.
+ */
+ if (redo_pointer == redo_pointer_at_last_summary_removal)
+ return;
+ redo_pointer_at_last_summary_removal = redo_pointer;
+
+ /*
+ * Files should only be removed if the last modification time precedes the
+ * cutoff time we compute here.
+ */
+ cutoff_time = time(NULL) - 60 * wal_summary_keep_time;
+
+ /* Get all the summaries that currently exist. */
+ wslist = GetWalSummaries(0, InvalidXLogRecPtr, InvalidXLogRecPtr);
+
+ /* Loop until all summaries have been considered for removal. */
+ while (wslist != NIL)
+ {
+ ListCell *lc;
+ XLogSegNo oldest_segno;
+ XLogRecPtr oldest_lsn = InvalidXLogRecPtr;
+ TimeLineID selected_tli;
+
+ HandleWalSummarizerInterrupts();
+
+ /*
+ * Pick a timeline for which some summary files still exist on disk,
+ * and find the oldest LSN that still exists on disk for that
+ * timeline.
+ */
+ selected_tli = ((WalSummaryFile *) linitial(wslist))->tli;
+ oldest_segno = XLogGetOldestSegno(selected_tli);
+ if (oldest_segno != 0)
+ XLogSegNoOffsetToRecPtr(oldest_segno, 0, wal_segment_size,
+ oldest_lsn);
+
+
+ /* Consider each WAL file on the selected timeline in turn. */
+ foreach(lc, wslist)
+ {
+ WalSummaryFile *ws = lfirst(lc);
+
+ HandleWalSummarizerInterrupts();
+
+ /* If it's not on this timeline, it's not time to consider it. */
+ if (selected_tli != ws->tli)
+ continue;
+
+ /*
+ * If the WAL doesn't exist any more, we can remove it if the file
+ * modification time is old enough.
+ */
+ if (XLogRecPtrIsInvalid(oldest_lsn) || ws->end_lsn <= oldest_lsn)
+ RemoveWalSummaryIfOlderThan(ws, cutoff_time);
+
+ /*
+ * Whether we removed the file or not, we need not consider it
+ * again.
+ */
+ wslist = foreach_delete_current(wslist, lc);
+ pfree(ws);
+ }
+ }
+}
diff --git a/src/backend/storage/lmgr/lwlocknames.txt b/src/backend/storage/lmgr/lwlocknames.txt
index f72f2906ce..d621f5507f 100644
--- a/src/backend/storage/lmgr/lwlocknames.txt
+++ b/src/backend/storage/lmgr/lwlocknames.txt
@@ -54,3 +54,4 @@ XactTruncationLock 44
WrapLimitsVacuumLock 46
NotifyQueueTailLock 47
WaitEventExtensionLock 48
+WALSummarizerLock 49
diff --git a/src/backend/utils/activity/pgstat_io.c b/src/backend/utils/activity/pgstat_io.c
index 490d5a9ab7..8109aee6f0 100644
--- a/src/backend/utils/activity/pgstat_io.c
+++ b/src/backend/utils/activity/pgstat_io.c
@@ -296,7 +296,8 @@ pgstat_io_snapshot_cb(void)
* - Syslogger because it is not connected to shared memory
* - Archiver because most relevant archiving IO is delegated to a
* specialized command or module
-* - WAL Receiver and WAL Writer IO is not tracked in pg_stat_io for now
+* - WAL Receiver, WAL Writer, and WAL Summarizer IO are not tracked in
+* pg_stat_io for now
*
* Function returns true if BackendType participates in the cumulative stats
* subsystem for IO and false if it does not.
@@ -318,6 +319,7 @@ pgstat_tracks_io_bktype(BackendType bktype)
case B_LOGGER:
case B_WAL_RECEIVER:
case B_WAL_WRITER:
+ case B_WAL_SUMMARIZER:
return false;
case B_AUTOVAC_LAUNCHER:
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index d7995931bd..7e79163466 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -56,6 +56,7 @@ RECOVERY_WAL_STREAM "Waiting in main loop of startup process for WAL to arrive,
SYSLOGGER_MAIN "Waiting in main loop of syslogger process."
WAL_RECEIVER_MAIN "Waiting in main loop of WAL receiver process."
WAL_SENDER_MAIN "Waiting in main loop of WAL sender process."
+WAL_SUMMARIZER_WAL "Waiting in WAL summarizer for more WAL to be generated."
WAL_WRITER_MAIN "Waiting in main loop of WAL writer process."
@@ -142,6 +143,7 @@ SAFE_SNAPSHOT "Waiting to obtain a valid snapshot for a <literal>READ ONLY DEFER
SYNC_REP "Waiting for confirmation from a remote server during synchronous replication."
WAL_RECEIVER_EXIT "Waiting for the WAL receiver to exit."
WAL_RECEIVER_WAIT_START "Waiting for startup process to send initial data for streaming replication."
+WAL_SUMMARY_READY "Waiting for a new WAL summary to be generated."
XACT_GROUP_UPDATE "Waiting for the group leader to update transaction status at end of a parallel operation."
@@ -162,6 +164,7 @@ REGISTER_SYNC_REQUEST "Waiting while sending synchronization requests to the che
SPIN_DELAY "Waiting while acquiring a contended spinlock."
VACUUM_DELAY "Waiting in a cost-based vacuum delay point."
VACUUM_TRUNCATE "Waiting to acquire an exclusive lock to truncate off any empty pages at the end of a table vacuumed."
+WAL_SUMMARIZER_ERROR "Waiting after a WAL summarizer error."
#
@@ -243,6 +246,8 @@ WAL_COPY_WRITE "Waiting for a write when creating a new WAL segment by copying a
WAL_INIT_SYNC "Waiting for a newly initialized WAL file to reach durable storage."
WAL_INIT_WRITE "Waiting for a write while initializing a new WAL file."
WAL_READ "Waiting for a read from a WAL file."
+WAL_SUMMARY_READ "Waiting for a read from a WAL summary file."
+WAL_SUMMARY_WRITE "Waiting for a write to a WAL summary file."
WAL_SYNC "Waiting for a WAL file to reach durable storage."
WAL_SYNC_METHOD_ASSIGN "Waiting for data to reach durable storage while assigning a new WAL sync method."
WAL_WRITE "Waiting for a write to a WAL file."
diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index cfc5afaa6f..ef2a3a2bfd 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -306,6 +306,9 @@ GetBackendTypeDesc(BackendType backendType)
case B_WAL_SENDER:
backendDesc = "walsender";
break;
+ case B_WAL_SUMMARIZER:
+ backendDesc = "walsummarizer";
+ break;
case B_WAL_WRITER:
backendDesc = "walwriter";
break;
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 6474e35ec0..405c422db7 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -63,6 +63,7 @@
#include "postmaster/postmaster.h"
#include "postmaster/startup.h"
#include "postmaster/syslogger.h"
+#include "postmaster/walsummarizer.h"
#include "postmaster/walwriter.h"
#include "replication/logicallauncher.h"
#include "replication/slot.h"
@@ -704,6 +705,8 @@ const char *const config_group_names[] =
gettext_noop("Write-Ahead Log / Archive Recovery"),
/* WAL_RECOVERY_TARGET */
gettext_noop("Write-Ahead Log / Recovery Target"),
+ /* WAL_SUMMARIZATION */
+ gettext_noop("Write-Ahead Log / Summarization"),
/* REPLICATION_SENDING */
gettext_noop("Replication / Sending Servers"),
/* REPLICATION_PRIMARY */
@@ -1787,6 +1790,16 @@ struct config_bool ConfigureNamesBool[] =
NULL, NULL, NULL
},
+ {
+ {"summarize_wal", PGC_SIGHUP, WAL_SUMMARIZATION,
+ gettext_noop("Starts the WAL summarizer process to enable incremental backup."),
+ NULL
+ },
+ &summarize_wal,
+ false,
+ NULL, NULL, NULL
+ },
+
{
{"hot_standby", PGC_POSTMASTER, REPLICATION_STANDBY,
gettext_noop("Allows connections and queries during recovery."),
@@ -3201,6 +3214,19 @@ struct config_int ConfigureNamesInt[] =
check_wal_segment_size, NULL, NULL
},
+ {
+ {"wal_summary_keep_time", PGC_SIGHUP, WAL_SUMMARIZATION,
+ gettext_noop("Time for which WAL summary files should be kept."),
+ NULL,
+ GUC_UNIT_MIN,
+ },
+ &wal_summary_keep_time,
+ 10 * 24 * 60, /* 10 days */
+ 0,
+ INT_MAX,
+ NULL, NULL, NULL
+ },
+
{
{"autovacuum_naptime", PGC_SIGHUP, AUTOVACUUM,
gettext_noop("Time to sleep between autovacuum runs."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index cf9f283cfe..b2809c711a 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -302,6 +302,11 @@
#recovery_target_action = 'pause' # 'pause', 'promote', 'shutdown'
# (change requires restart)
+# - WAL Summarization -
+
+#summarize_wal = off # run WAL summarizer process?
+#wal_summary_keep_time = '10d' # when to remove old summary files, 0 = never
+
#------------------------------------------------------------------------------
# REPLICATION
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 0c6f5ceb0a..e68b40d2b5 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -227,6 +227,7 @@ static char *extra_options = "";
static const char *const subdirs[] = {
"global",
"pg_wal/archive_status",
+ "pg_wal/summaries",
"pg_commit_ts",
"pg_dynshmem",
"pg_notify",
diff --git a/src/common/Makefile b/src/common/Makefile
index 1092dc63df..23e5a3db47 100644
--- a/src/common/Makefile
+++ b/src/common/Makefile
@@ -49,6 +49,7 @@ OBJS_COMMON = \
archive.o \
base64.o \
binaryheap.o \
+ blkreftable.o \
checksum_helper.o \
compression.o \
config_info.o \
diff --git a/src/common/blkreftable.c b/src/common/blkreftable.c
new file mode 100644
index 0000000000..21ee6f5968
--- /dev/null
+++ b/src/common/blkreftable.c
@@ -0,0 +1,1308 @@
+/*-------------------------------------------------------------------------
+ *
+ * blkreftable.c
+ * Block reference tables.
+ *
+ * A block reference table is used to keep track of which blocks have
+ * been modified by WAL records within a certain LSN range.
+ *
+ * For each relation fork, we keep track of all blocks that have appeared
+ * in block reference in the WAL. We also keep track of the "limit block",
+ * which is the smallest relation length in blocks known to have occurred
+ * during that range of WAL records. This should be set to 0 if the relation
+ * fork is created or destroyed, and to the post-truncation length if
+ * truncated.
+ *
+ * Whenever we set the limit block, we also forget about any modified blocks
+ * beyond that point. Those blocks don't exist any more. Such blocks can
+ * later be marked as modified again; if that happens, it means the relation
+ * was re-extended.
+ *
+ * Portions Copyright (c) 2010-2022, PostgreSQL Global Development Group
+ *
+ * src/common/blkreftable.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+
+#ifndef FRONTEND
+#include "postgres.h"
+#else
+#include "postgres_fe.h"
+#endif
+
+#ifdef FRONTEND
+#include "common/logging.h"
+#endif
+
+#include "common/blkreftable.h"
+#include "common/hashfn.h"
+#include "port/pg_crc32c.h"
+
+/*
+ * A block reference table keeps track of the status of each relation
+ * fork individually.
+ */
+typedef struct BlockRefTableKey
+{
+ RelFileLocator rlocator;
+ ForkNumber forknum;
+} BlockRefTableKey;
+
+/*
+ * We could need to store data either for a relation in which only a
+ * tiny fraction of the blocks have been modified or for a relation in
+ * which nearly every block has been modified, and we want a
+ * space-efficient representation in both cases. To accomplish this,
+ * we divide the relation into chunks of 2^16 blocks and choose between
+ * an array representation and a bitmap representation for each chunk.
+ *
+ * When the number of modified blocks in a given chunk is small, we
+ * essentially store an array of block numbers, but we need not store the
+ * entire block number: instead, we store each block number as a 2-byte
+ * offset from the start of the chunk.
+ *
+ * When the number of modified blocks in a given chunk is large, we switch
+ * to a bitmap representation.
+ *
+ * These same basic representational choices are used both when a block
+ * reference table is stored in memory and when it is serialized to disk.
+ *
+ * In the in-memory representation, we initially allocate each chunk with
+ * space for a number of entries given by INITIAL_ENTRIES_PER_CHUNK and
+ * increase that as necessary until we reach MAX_ENTRIES_PER_CHUNK.
+ * Any chunk whose allocated size reaches MAX_ENTRIES_PER_CHUNK is converted
+ * to a bitmap, and thus never needs to grow further.
+ */
+#define BLOCKS_PER_CHUNK (1 << 16)
+#define BLOCKS_PER_ENTRY (BITS_PER_BYTE * sizeof(uint16))
+#define MAX_ENTRIES_PER_CHUNK (BLOCKS_PER_CHUNK / BLOCKS_PER_ENTRY)
+#define INITIAL_ENTRIES_PER_CHUNK 16
+typedef uint16 *BlockRefTableChunk;
+
+/*
+ * State for one relation fork.
+ *
+ * 'rlocator' and 'forknum' identify the relation fork to which this entry
+ * pertains.
+ *
+ * 'limit_block' is the shortest known length of the relation in blocks
+ * within the LSN range covered by a particular block reference table.
+ * It should be set to 0 if the relation fork is created or dropped. If the
+ * relation fork is truncated, it should be set to the number of blocks that
+ * remain after truncation.
+ *
+ * 'nchunks' is the allocated length of each of the three arrays that follow.
+ * We can only represent the status of block numbers less than nchunks *
+ * BLOCKS_PER_CHUNK.
+ *
+ * 'chunk_size' is an array storing the allocated size of each chunk.
+ *
+ * 'chunk_usage' is an array storing the number of elements used in each
+ * chunk. If that value is less than MAX_ENTRIES_PER_CHUNK, the corresonding
+ * chunk is used as an array; else the corresponding chunk is used as a bitmap.
+ * When used as a bitmap, the least significant bit of the first array element
+ * is the status of the lowest-numbered block covered by this chunk.
+ *
+ * 'chunk_data' is the array of chunks.
+ */
+struct BlockRefTableEntry
+{
+ BlockRefTableKey key;
+ BlockNumber limit_block;
+ char status;
+ uint32 nchunks;
+ uint16 *chunk_size;
+ uint16 *chunk_usage;
+ BlockRefTableChunk *chunk_data;
+};
+
+/* Declare and define a hash table over type BlockRefTableEntry. */
+#define SH_PREFIX blockreftable
+#define SH_ELEMENT_TYPE BlockRefTableEntry
+#define SH_KEY_TYPE BlockRefTableKey
+#define SH_KEY key
+#define SH_HASH_KEY(tb, key) \
+ hash_bytes((const unsigned char *) &key, sizeof(BlockRefTableKey))
+#define SH_EQUAL(tb, a, b) (memcmp(&a, &b, sizeof(BlockRefTableKey)) == 0)
+#define SH_SCOPE static inline
+#ifdef FRONTEND
+#define SH_RAW_ALLOCATOR pg_malloc0
+#endif
+#define SH_DEFINE
+#define SH_DECLARE
+#include "lib/simplehash.h"
+
+/*
+ * A block reference table is basically just the hash table, but we don't
+ * want to expose that to outside callers.
+ *
+ * We keep track of the memory context in use explicitly too, so that it's
+ * easy to place all of our allocations in the same context.
+ */
+struct BlockRefTable
+{
+ blockreftable_hash *hash;
+#ifndef FRONTEND
+ MemoryContext mcxt;
+#endif
+};
+
+/*
+ * On-disk serialization format for block reference table entries.
+ */
+typedef struct BlockRefTableSerializedEntry
+{
+ RelFileLocator rlocator;
+ ForkNumber forknum;
+ BlockNumber limit_block;
+ uint32 nchunks;
+} BlockRefTableSerializedEntry;
+
+/*
+ * Buffer size, so that we avoid doing many small I/Os.
+ */
+#define BUFSIZE 65536
+
+/*
+ * Ad-hoc buffer for file I/O.
+ */
+typedef struct BlockRefTableBuffer
+{
+ io_callback_fn io_callback;
+ void *io_callback_arg;
+ char data[BUFSIZE];
+ int used;
+ int cursor;
+ pg_crc32c crc;
+} BlockRefTableBuffer;
+
+/*
+ * State for keeping track of progress while incrementally reading a block
+ * table reference file from disk.
+ *
+ * total_chunks means the number of chunks for the RelFileLocator/ForkNumber
+ * combination that is curently being read, and consumed_chunks is the number
+ * of those that have been read. (We always read all the information for
+ * a single chunk at one time, so we don't need to be able to represent the
+ * state where a chunk has been partially read.)
+ *
+ * chunk_size is the array of chunk sizes. The length is given by total_chunks.
+ *
+ * chunk_data holds the current chunk.
+ *
+ * chunk_position helps us figure out how much progress we've made in returning
+ * the block numbers for the current chunk to the caller. If the chunk is a
+ * bitmap, it's the number of bits we've scanned; otherwise, it's the number
+ * of chunk entries we've scanned.
+ */
+struct BlockRefTableReader
+{
+ BlockRefTableBuffer buffer;
+ char *error_filename;
+ report_error_fn error_callback;
+ void *error_callback_arg;
+ uint32 total_chunks;
+ uint32 consumed_chunks;
+ uint16 *chunk_size;
+ uint16 chunk_data[MAX_ENTRIES_PER_CHUNK];
+ uint32 chunk_position;
+};
+
+/*
+ * State for keeping track of progress while incrementally writing a block
+ * reference table file to disk.
+ */
+struct BlockRefTableWriter
+{
+ BlockRefTableBuffer buffer;
+};
+
+/* Function prototypes. */
+static int BlockRefTableComparator(const void *a, const void *b);
+static void BlockRefTableFlush(BlockRefTableBuffer *buffer);
+static void BlockRefTableRead(BlockRefTableReader *reader, void *data,
+ int length);
+static void BlockRefTableWrite(BlockRefTableBuffer *buffer, void *data,
+ int length);
+static void BlockRefTableFileTerminate(BlockRefTableBuffer *buffer);
+
+/*
+ * Create an empty block reference table.
+ */
+BlockRefTable *
+CreateEmptyBlockRefTable(void)
+{
+ BlockRefTable *brtab = palloc(sizeof(BlockRefTable));
+
+ /*
+ * Even completely empty database has a few hundred relation forks, so it
+ * seems best to size the hash on the assumption that we're going to have
+ * at least a few thousand entries.
+ */
+#ifdef FRONTEND
+ brtab->hash = blockreftable_create(4096, NULL);
+#else
+ brtab->mcxt = CurrentMemoryContext;
+ brtab->hash = blockreftable_create(brtab->mcxt, 4096, NULL);
+#endif
+
+ return brtab;
+}
+
+/*
+ * Set the "limit block" for a relation fork and forget any modified blocks
+ * with equal or higher block numbers.
+ *
+ * The "limit block" is the shortest known length of the relation within the
+ * range of WAL records covered by this block reference table.
+ */
+void
+BlockRefTableSetLimitBlock(BlockRefTable *brtab,
+ const RelFileLocator *rlocator,
+ ForkNumber forknum,
+ BlockNumber limit_block)
+{
+ BlockRefTableEntry *brtentry;
+ BlockRefTableKey key = {0}; /* make sure any padding is zero */
+ bool found;
+
+ memcpy(&key.rlocator, rlocator, sizeof(RelFileLocator));
+ key.forknum = forknum;
+ brtentry = blockreftable_insert(brtab->hash, key, &found);
+
+ if (!found)
+ {
+ /*
+ * We have no existing data about this relation fork, so just record
+ * the limit_block value supplied by the caller, and make sure other
+ * parts of the entry are properly initialized.
+ */
+ brtentry->limit_block = limit_block;
+ brtentry->nchunks = 0;
+ brtentry->chunk_size = NULL;
+ brtentry->chunk_usage = NULL;
+ brtentry->chunk_data = NULL;
+ return;
+ }
+
+ BlockRefTableEntrySetLimitBlock(brtentry, limit_block);
+}
+
+/*
+ * Mark a block in a given relation fork as known to have been modified.
+ */
+void
+BlockRefTableMarkBlockModified(BlockRefTable *brtab,
+ const RelFileLocator *rlocator,
+ ForkNumber forknum,
+ BlockNumber blknum)
+{
+ BlockRefTableEntry *brtentry;
+ BlockRefTableKey key = {0}; /* make sure any padding is zero */
+ bool found;
+#ifndef FRONTEND
+ MemoryContext oldcontext = MemoryContextSwitchTo(brtab->mcxt);
+#endif
+
+ memcpy(&key.rlocator, rlocator, sizeof(RelFileLocator));
+ key.forknum = forknum;
+ brtentry = blockreftable_insert(brtab->hash, key, &found);
+
+ if (!found)
+ {
+ /*
+ * We want to set the initial limit block value to something higher
+ * than any legal block number. InvalidBlockNumber fits the bill.
+ */
+ brtentry->limit_block = InvalidBlockNumber;
+ brtentry->nchunks = 0;
+ brtentry->chunk_size = NULL;
+ brtentry->chunk_usage = NULL;
+ brtentry->chunk_data = NULL;
+ }
+
+ BlockRefTableEntryMarkBlockModified(brtentry, forknum, blknum);
+
+#ifndef FRONTEND
+ MemoryContextSwitchTo(oldcontext);
+#endif
+}
+
+/*
+ * Get an entry from a block reference table.
+ *
+ * If the entry does not exist, this function returns NULL. Otherwise, it
+ * returns the entry and sets *limit_block to the value from the entry.
+ */
+BlockRefTableEntry *
+BlockRefTableGetEntry(BlockRefTable *brtab, const RelFileLocator *rlocator,
+ ForkNumber forknum, BlockNumber *limit_block)
+{
+ BlockRefTableKey key = {0}; /* make sure any padding is zero */
+ BlockRefTableEntry *entry;
+
+ Assert(limit_block != NULL);
+
+ memcpy(&key.rlocator, rlocator, sizeof(RelFileLocator));
+ key.forknum = forknum;
+ entry = blockreftable_lookup(brtab->hash, key);
+
+ if (entry != NULL)
+ *limit_block = entry->limit_block;
+
+ return entry;
+}
+
+/*
+ * Get block numbers from a table entry.
+ *
+ * 'blocks' must point to enough space to hold at least 'nblocks' block
+ * numbers, and any block numbers we manage to get will be written there.
+ * The return value is the number of block numbers actually written.
+ *
+ * We do not return block numbers unless they are greater than or equal to
+ * start_blkno and strictly less than stop_blkno.
+ */
+int
+BlockRefTableEntryGetBlocks(BlockRefTableEntry *entry,
+ BlockNumber start_blkno,
+ BlockNumber stop_blkno,
+ BlockNumber *blocks,
+ int nblocks)
+{
+ uint32 start_chunkno;
+ uint32 stop_chunkno;
+ uint32 chunkno;
+ int nresults = 0;
+
+ Assert(entry != NULL);
+
+ /*
+ * Figure out which chunks could potentially contain blocks of interest.
+ *
+ * We need to be careful about overflow here, because stop_blkno could be
+ * InvalidBlockNumber or something very close to it.
+ */
+ start_chunkno = start_blkno / BLOCKS_PER_CHUNK;
+ stop_chunkno = stop_blkno / BLOCKS_PER_CHUNK;
+ if ((stop_blkno % BLOCKS_PER_CHUNK) != 0)
+ ++stop_chunkno;
+ if (stop_chunkno > entry->nchunks)
+ stop_chunkno = entry->nchunks;
+
+ /*
+ * Loop over chunks.
+ */
+ for (chunkno = start_chunkno; chunkno < stop_chunkno; ++chunkno)
+ {
+ uint16 chunk_usage = entry->chunk_usage[chunkno];
+ BlockRefTableChunk chunk_data = entry->chunk_data[chunkno];
+ unsigned start_offset = 0;
+ unsigned stop_offset = BLOCKS_PER_CHUNK;
+
+ /*
+ * If the start and/or stop block number falls within this chunk, the
+ * whole chunk may not be of interest. Figure out which portion we
+ * care about, if it's not the whole thing.
+ */
+ if (chunkno == start_chunkno)
+ start_offset = start_blkno % BLOCKS_PER_CHUNK;
+ if (chunkno == stop_chunkno - 1)
+ stop_offset = stop_blkno % BLOCKS_PER_CHUNK;
+
+ /*
+ * Handling differs depending on whether this is an array of offsets
+ * or a bitmap.
+ */
+ if (chunk_usage == MAX_ENTRIES_PER_CHUNK)
+ {
+ unsigned i;
+
+ /* It's a bitmap, so test every relevant bit. */
+ for (i = start_offset; i < stop_offset; ++i)
+ {
+ uint16 w = chunk_data[i / BLOCKS_PER_ENTRY];
+
+ if ((w & (1 << (i % BLOCKS_PER_ENTRY))) != 0)
+ {
+ BlockNumber blkno = chunkno * BLOCKS_PER_CHUNK + i;
+
+ blocks[nresults++] = blkno;
+
+ /* Early exit if we run out of output space. */
+ if (nresults == nblocks)
+ return nresults;
+ }
+ }
+ }
+ else
+ {
+ unsigned i;
+
+ /* It's an array of offsets, so check each one. */
+ for (i = 0; i < chunk_usage; ++i)
+ {
+ uint16 offset = chunk_data[i];
+
+ if (offset >= start_offset && offset < stop_offset)
+ {
+ BlockNumber blkno = chunkno * BLOCKS_PER_CHUNK + offset;
+
+ blocks[nresults++] = blkno;
+
+ /* Early exit if we run out of output space. */
+ if (nresults == nblocks)
+ return nresults;
+ }
+ }
+ }
+ }
+
+ return nresults;
+}
+
+/*
+ * Serialize a block reference table to a file.
+ */
+void
+WriteBlockRefTable(BlockRefTable *brtab,
+ io_callback_fn write_callback,
+ void *write_callback_arg)
+{
+ BlockRefTableSerializedEntry *sdata = NULL;
+ BlockRefTableBuffer buffer;
+ uint32 magic = BLOCKREFTABLE_MAGIC;
+
+ /* Prepare buffer. */
+ memset(&buffer, 0, sizeof(BlockRefTableBuffer));
+ buffer.io_callback = write_callback;
+ buffer.io_callback_arg = write_callback_arg;
+ INIT_CRC32C(buffer.crc);
+
+ /* Write magic number. */
+ BlockRefTableWrite(&buffer, &magic, sizeof(uint32));
+
+ /* Write the entries, assuming there are some. */
+ if (brtab->hash->members > 0)
+ {
+ unsigned i = 0;
+ blockreftable_iterator it;
+ BlockRefTableEntry *brtentry;
+
+ /* Extract entries into serializable format and sort them. */
+ sdata =
+ palloc(brtab->hash->members * sizeof(BlockRefTableSerializedEntry));
+ blockreftable_start_iterate(brtab->hash, &it);
+ while ((brtentry = blockreftable_iterate(brtab->hash, &it)) != NULL)
+ {
+ BlockRefTableSerializedEntry *sentry = &sdata[i++];
+
+ sentry->rlocator = brtentry->key.rlocator;
+ sentry->forknum = brtentry->key.forknum;
+ sentry->limit_block = brtentry->limit_block;
+ sentry->nchunks = brtentry->nchunks;
+
+ /* trim trailing zero entries */
+ while (sentry->nchunks > 0 &&
+ brtentry->chunk_usage[sentry->nchunks - 1] == 0)
+ sentry->nchunks--;
+ }
+ Assert(i == brtab->hash->members);
+ qsort(sdata, i, sizeof(BlockRefTableSerializedEntry),
+ BlockRefTableComparator);
+
+ /* Loop over entries in sorted order and serialize each one. */
+ for (i = 0; i < brtab->hash->members; ++i)
+ {
+ BlockRefTableSerializedEntry *sentry = &sdata[i];
+ BlockRefTableKey key = {0}; /* make sure any padding is zero */
+ unsigned j;
+
+ /* Write the serialized entry itself. */
+ BlockRefTableWrite(&buffer, sentry,
+ sizeof(BlockRefTableSerializedEntry));
+
+ /* Look up the original entry so we can access the chunks. */
+ memcpy(&key.rlocator, &sentry->rlocator, sizeof(RelFileLocator));
+ key.forknum = sentry->forknum;
+ brtentry = blockreftable_lookup(brtab->hash, key);
+ Assert(brtentry != NULL);
+
+ /* Write the untruncated portion of the chunk length array. */
+ if (sentry->nchunks != 0)
+ BlockRefTableWrite(&buffer, brtentry->chunk_usage,
+ sentry->nchunks * sizeof(uint16));
+
+ /* Write the contents of each chunk. */
+ for (j = 0; j < brtentry->nchunks; ++j)
+ {
+ if (brtentry->chunk_usage[j] == 0)
+ continue;
+ BlockRefTableWrite(&buffer, brtentry->chunk_data[j],
+ brtentry->chunk_usage[j] * sizeof(uint16));
+ }
+ }
+ }
+
+ /* Write out appropriate terminator and CRC and flush buffer. */
+ BlockRefTableFileTerminate(&buffer);
+}
+
+/*
+ * Prepare to incrementally read a block reference table file.
+ *
+ * 'read_callback' is a function that can be called to read data from the
+ * underlying file (or other data source) into our internal buffer.
+ *
+ * 'read_callback_arg' is an opaque argument to be passed to read_callback.
+ *
+ * 'error_filename' is the filename that should be included in error messages
+ * if the file is found to be malformed. The value is not copied, so the
+ * caller should ensure that it remains valid until done with this
+ * BlockRefTableReader.
+ *
+ * 'error_callback' is a function to be called if the file is found to be
+ * malformed. This is not used for I/O errors, which must be handled internally
+ * by read_callback.
+ *
+ * 'error_callback_arg' is an opaque arguent to be passed to error_callback.
+ */
+BlockRefTableReader *
+CreateBlockRefTableReader(io_callback_fn read_callback,
+ void *read_callback_arg,
+ char *error_filename,
+ report_error_fn error_callback,
+ void *error_callback_arg)
+{
+ BlockRefTableReader *reader;
+ uint32 magic;
+
+ /* Initialize data structure. */
+ reader = palloc0(sizeof(BlockRefTableReader));
+ reader->buffer.io_callback = read_callback;
+ reader->buffer.io_callback_arg = read_callback_arg;
+ reader->error_filename = error_filename;
+ reader->error_callback = error_callback;
+ reader->error_callback_arg = error_callback_arg;
+ INIT_CRC32C(reader->buffer.crc);
+
+ /* Verify magic number. */
+ BlockRefTableRead(reader, &magic, sizeof(uint32));
+ if (magic != BLOCKREFTABLE_MAGIC)
+ error_callback(error_callback_arg,
+ "file \"%s\" has wrong magic number: expected %u, found %u",
+ error_filename,
+ BLOCKREFTABLE_MAGIC, magic);
+
+ return reader;
+}
+
+/*
+ * Read next relation fork covered by this block reference table file.
+ *
+ * After calling this function, you must call BlockRefTableReaderGetBlocks
+ * until it returns 0 before calling it again.
+ */
+bool
+BlockRefTableReaderNextRelation(BlockRefTableReader *reader,
+ RelFileLocator *rlocator,
+ ForkNumber *forknum,
+ BlockNumber *limit_block)
+{
+ BlockRefTableSerializedEntry sentry;
+ BlockRefTableSerializedEntry zentry = {{0}};
+
+ /*
+ * Sanity check: caller must read all blocks from all chunks before moving
+ * on to the next relation.
+ */
+ Assert(reader->total_chunks == reader->consumed_chunks);
+
+ /* Read serialized entry. */
+ BlockRefTableRead(reader, &sentry,
+ sizeof(BlockRefTableSerializedEntry));
+
+ /*
+ * If we just read the sentinel entry indicating that we've reached the
+ * end, read and check the CRC.
+ */
+ if (memcmp(&sentry, &zentry, sizeof(BlockRefTableSerializedEntry)) == 0)
+ {
+ pg_crc32c expected_crc;
+ pg_crc32c actual_crc;
+
+ /*
+ * We want to know the CRC of the file excluding the 4-byte CRC
+ * itself, so copy the current value of the CRC accumulator before
+ * reading those bytes, and use the copy to finalize the calculation.
+ */
+ expected_crc = reader->buffer.crc;
+ FIN_CRC32C(expected_crc);
+
+ /* Now we can read the actual value. */
+ BlockRefTableRead(reader, &actual_crc, sizeof(pg_crc32c));
+
+ /* Throw an error if there is a mismatch. */
+ if (!EQ_CRC32C(expected_crc, actual_crc))
+ reader->error_callback(reader->error_callback_arg,
+ "file \"%s\" has wrong checksum: expected %08X, found %08X",
+ reader->error_filename, expected_crc, actual_crc);
+
+ return false;
+ }
+
+ /* Read chunk size array. */
+ if (reader->chunk_size != NULL)
+ pfree(reader->chunk_size);
+ reader->chunk_size = palloc(sentry.nchunks * sizeof(uint16));
+ BlockRefTableRead(reader, reader->chunk_size,
+ sentry.nchunks * sizeof(uint16));
+
+ /* Set up for chunk scan. */
+ reader->total_chunks = sentry.nchunks;
+ reader->consumed_chunks = 0;
+
+ /* Return data to caller. */
+ memcpy(rlocator, &sentry.rlocator, sizeof(RelFileLocator));
+ *forknum = sentry.forknum;
+ *limit_block = sentry.limit_block;
+ return true;
+}
+
+/*
+ * Get modified blocks associated with the relation fork returned by
+ * the most recent call to BlockRefTableReaderNextRelation.
+ *
+ * On return, block numbers will be written into the 'blocks' array, whose
+ * length should be passed via 'nblocks'. The return value is the number of
+ * entries actually written into the 'blocks' array, which may be less than
+ * 'nblocks' if we run out of modified blocks in the relation fork before
+ * we run out of room in the array.
+ */
+unsigned
+BlockRefTableReaderGetBlocks(BlockRefTableReader *reader,
+ BlockNumber *blocks,
+ int nblocks)
+{
+ unsigned blocks_found = 0;
+
+ /* Must provide space for at least one block number to be returned. */
+ Assert(nblocks > 0);
+
+ /* Loop collecting blocks to return to caller. */
+ for (;;)
+ {
+ uint16 next_chunk_size;
+
+ /*
+ * If we've read at least one chunk, maybe it contains some block
+ * numbers that could satisfy caller's request.
+ */
+ if (reader->consumed_chunks > 0)
+ {
+ uint32 chunkno = reader->consumed_chunks - 1;
+ uint16 chunk_size = reader->chunk_size[chunkno];
+
+ if (chunk_size == MAX_ENTRIES_PER_CHUNK)
+ {
+ /* Bitmap format, so search for bits that are set. */
+ while (reader->chunk_position < BLOCKS_PER_CHUNK &&
+ blocks_found < nblocks)
+ {
+ uint16 chunkoffset = reader->chunk_position;
+ uint16 w;
+
+ w = reader->chunk_data[chunkoffset / BLOCKS_PER_ENTRY];
+ if ((w & (1u << (chunkoffset % BLOCKS_PER_ENTRY))) != 0)
+ blocks[blocks_found++] =
+ chunkno * BLOCKS_PER_CHUNK + chunkoffset;
+ ++reader->chunk_position;
+ }
+ }
+ else
+ {
+ /* Not in bitmap format, so each entry is a 2-byte offset. */
+ while (reader->chunk_position < chunk_size &&
+ blocks_found < nblocks)
+ {
+ blocks[blocks_found++] = chunkno * BLOCKS_PER_CHUNK
+ + reader->chunk_data[reader->chunk_position];
+ ++reader->chunk_position;
+ }
+ }
+ }
+
+ /* We found enough blocks, so we're done. */
+ if (blocks_found >= nblocks)
+ break;
+
+ /*
+ * We didn't find enough blocks, so we must need the next chunk. If
+ * there are none left, though, then we're done anyway.
+ */
+ if (reader->consumed_chunks == reader->total_chunks)
+ break;
+
+ /*
+ * Read data for next chunk and reset scan position to beginning of
+ * chunk. Note that the next chunk might be empty, in which case we
+ * consume the chunk without actually consuming any bytes from the
+ * underlying file.
+ */
+ next_chunk_size = reader->chunk_size[reader->consumed_chunks];
+ if (next_chunk_size > 0)
+ BlockRefTableRead(reader, reader->chunk_data,
+ next_chunk_size * sizeof(uint16));
+ ++reader->consumed_chunks;
+ reader->chunk_position = 0;
+ }
+
+ return blocks_found;
+}
+
+/*
+ * Release memory used while reading a block reference table from a file.
+ */
+void
+DestroyBlockRefTableReader(BlockRefTableReader *reader)
+{
+ if (reader->chunk_size != NULL)
+ {
+ pfree(reader->chunk_size);
+ reader->chunk_size = NULL;
+ }
+ pfree(reader);
+}
+
+/*
+ * Prepare to write a block reference table file incrementally.
+ *
+ * Caller must be able to supply BlockRefTableEntry objects sorted in the
+ * appropriate order.
+ */
+BlockRefTableWriter *
+CreateBlockRefTableWriter(io_callback_fn write_callback,
+ void *write_callback_arg)
+{
+ BlockRefTableWriter *writer;
+ uint32 magic = BLOCKREFTABLE_MAGIC;
+
+ /* Prepare buffer and CRC check and save callbacks. */
+ writer = palloc0(sizeof(BlockRefTableWriter));
+ writer->buffer.io_callback = write_callback;
+ writer->buffer.io_callback_arg = write_callback_arg;
+ INIT_CRC32C(writer->buffer.crc);
+
+ /* Write magic number. */
+ BlockRefTableWrite(&writer->buffer, &magic, sizeof(uint32));
+
+ return writer;
+}
+
+/*
+ * Append one entry to a block reference table file.
+ *
+ * Note that entries must be written in the proper order, that is, sorted by
+ * tablespace, then database, then relfilenumber, then fork number. Caller
+ * is responsible for supplying data in the correct order. If that seems hard,
+ * use an in-memory BlockRefTable instead.
+ */
+void
+BlockRefTableWriteEntry(BlockRefTableWriter *writer, BlockRefTableEntry *entry)
+{
+ BlockRefTableSerializedEntry sentry;
+ unsigned j;
+
+ /* Convert to serialized entry format. */
+ sentry.rlocator = entry->key.rlocator;
+ sentry.forknum = entry->key.forknum;
+ sentry.limit_block = entry->limit_block;
+ sentry.nchunks = entry->nchunks;
+
+ /* Trim trailing zero entries. */
+ while (sentry.nchunks > 0 && entry->chunk_usage[sentry.nchunks - 1] == 0)
+ sentry.nchunks--;
+
+ /* Write the serialized entry itself. */
+ BlockRefTableWrite(&writer->buffer, &sentry,
+ sizeof(BlockRefTableSerializedEntry));
+
+ /* Write the untruncated portion of the chunk length array. */
+ if (sentry.nchunks != 0)
+ BlockRefTableWrite(&writer->buffer, entry->chunk_usage,
+ sentry.nchunks * sizeof(uint16));
+
+ /* Write the contents of each chunk. */
+ for (j = 0; j < entry->nchunks; ++j)
+ {
+ if (entry->chunk_usage[j] == 0)
+ continue;
+ BlockRefTableWrite(&writer->buffer, entry->chunk_data[j],
+ entry->chunk_usage[j] * sizeof(uint16));
+ }
+}
+
+/*
+ * Finalize an incremental write of a block reference table file.
+ */
+void
+DestroyBlockRefTableWriter(BlockRefTableWriter *writer)
+{
+ BlockRefTableFileTerminate(&writer->buffer);
+ pfree(writer);
+}
+
+/*
+ * Allocate a standalone BlockRefTableEntry.
+ *
+ * When we're manipulating a full in-memory BlockRefTable, the entries are
+ * part of the hash table and are allocated by simplehash. This routine is
+ * used by callers that want to write out a BlockRefTable to a file without
+ * needing to store the whole thing in memory at once.
+ *
+ * Entries allocated by this function can be manipulated using the functions
+ * BlockRefTableEntrySetLimitBlock and BlockRefTableEntryMarkBlockModified
+ * and then written using BlockRefTableWriteEntry and freed using
+ * BlockRefTableFreeEntry.
+ */
+BlockRefTableEntry *
+CreateBlockRefTableEntry(RelFileLocator rlocator, ForkNumber forknum)
+{
+ BlockRefTableEntry *entry = palloc0(sizeof(BlockRefTableEntry));
+
+ memcpy(&entry->key.rlocator, &rlocator, sizeof(RelFileLocator));
+ entry->key.forknum = forknum;
+ entry->limit_block = InvalidBlockNumber;
+
+ return entry;
+}
+
+/*
+ * Update a BlockRefTableEntry with a new value for the "limit block" and
+ * forget any equal-or-higher-numbered modified blocks.
+ *
+ * The "limit block" is the shortest known length of the relation within the
+ * range of WAL records covered by this block reference table.
+ */
+void
+BlockRefTableEntrySetLimitBlock(BlockRefTableEntry *entry,
+ BlockNumber limit_block)
+{
+ unsigned chunkno;
+ unsigned limit_chunkno;
+ unsigned limit_chunkoffset;
+ BlockRefTableChunk limit_chunk;
+
+ /* If we already have an equal or lower limit block, do nothing. */
+ if (limit_block >= entry->limit_block)
+ return;
+
+ /* Record the new limit block value. */
+ entry->limit_block = limit_block;
+
+ /*
+ * Figure out which chunk would store the state of the new limit block,
+ * and which offset within that chunk.
+ */
+ limit_chunkno = limit_block / BLOCKS_PER_CHUNK;
+ limit_chunkoffset = limit_block % BLOCKS_PER_CHUNK;
+
+ /*
+ * If the number of chunks is not large enough for any blocks with equal
+ * or higher block numbers to exist, then there is nothing further to do.
+ */
+ if (limit_chunkno >= entry->nchunks)
+ return;
+
+ /* Discard entire contents of any higher-numbered chunks. */
+ for (chunkno = limit_chunkno + 1; chunkno < entry->nchunks; ++chunkno)
+ entry->chunk_usage[chunkno] = 0;
+
+ /*
+ * Next, we need to discard any offsets within the chunk that would
+ * contain the limit_block. We must handle this differenly depending on
+ * whether the chunk that would contain limit_block is a bitmap or an
+ * array of offsets.
+ */
+ limit_chunk = entry->chunk_data[limit_chunkno];
+ if (entry->chunk_usage[limit_chunkno] == MAX_ENTRIES_PER_CHUNK)
+ {
+ unsigned chunkoffset;
+
+ /* It's a bitmap. Unset bits. */
+ for (chunkoffset = limit_chunkoffset; chunkoffset < BLOCKS_PER_CHUNK;
+ ++chunkoffset)
+ limit_chunk[chunkoffset / BLOCKS_PER_ENTRY] &=
+ ~(1 << (chunkoffset % BLOCKS_PER_ENTRY));
+ }
+ else
+ {
+ unsigned i,
+ j = 0;
+
+ /* It's an offset array. Filter out large offsets. */
+ for (i = 0; i < entry->chunk_usage[limit_chunkno]; ++i)
+ {
+ Assert(j <= i);
+ if (limit_chunk[i] < limit_chunkoffset)
+ limit_chunk[j++] = limit_chunk[i];
+ }
+ Assert(j <= entry->chunk_usage[limit_chunkno]);
+ entry->chunk_usage[limit_chunkno] = j;
+ }
+}
+
+/*
+ * Mark a block in a given BlkRefTableEntry as known to have been modified.
+ */
+void
+BlockRefTableEntryMarkBlockModified(BlockRefTableEntry *entry,
+ ForkNumber forknum,
+ BlockNumber blknum)
+{
+ unsigned chunkno;
+ unsigned chunkoffset;
+ unsigned i;
+
+ /*
+ * Which chunk should store the state of this block? And what is the
+ * offset of this block relative to the start of that chunk?
+ */
+ chunkno = blknum / BLOCKS_PER_CHUNK;
+ chunkoffset = blknum % BLOCKS_PER_CHUNK;
+
+ /*
+ * If 'nchunks' isn't big enough for us to be able to represent the state
+ * of this block, we need to enlarge our arrays.
+ */
+ if (chunkno >= entry->nchunks)
+ {
+ unsigned max_chunks;
+ unsigned extra_chunks;
+
+ /*
+ * New array size is a power of 2, at least 16, big enough so that
+ * chunkno will be a valid array index.
+ */
+ max_chunks = Max(16, entry->nchunks);
+ while (max_chunks < chunkno + 1)
+ chunkno *= 2;
+ Assert(max_chunks > chunkno);
+ extra_chunks = max_chunks - entry->nchunks;
+
+ if (entry->nchunks == 0)
+ {
+ entry->chunk_size = palloc0(sizeof(uint16) * max_chunks);
+ entry->chunk_usage = palloc0(sizeof(uint16) * max_chunks);
+ entry->chunk_data =
+ palloc0(sizeof(BlockRefTableChunk) * max_chunks);
+ }
+ else
+ {
+ entry->chunk_size = repalloc(entry->chunk_size,
+ sizeof(uint16) * max_chunks);
+ memset(&entry->chunk_size[entry->nchunks], 0,
+ extra_chunks * sizeof(uint16));
+ entry->chunk_usage = repalloc(entry->chunk_usage,
+ sizeof(uint16) * max_chunks);
+ memset(&entry->chunk_usage[entry->nchunks], 0,
+ extra_chunks * sizeof(uint16));
+ entry->chunk_data = repalloc(entry->chunk_data,
+ sizeof(BlockRefTableChunk) * max_chunks);
+ memset(&entry->chunk_data[entry->nchunks], 0,
+ extra_chunks * sizeof(BlockRefTableChunk));
+ }
+ entry->nchunks = max_chunks;
+ }
+
+ /*
+ * If the chunk that covers this block number doesn't exist yet, create it
+ * as an array and add the appropriate offset to it. We make it pretty
+ * small initially, because there might only be 1 or a few block
+ * references in this chunk and we don't want to use up too much memory.
+ */
+ if (entry->chunk_size[chunkno] == 0)
+ {
+ entry->chunk_data[chunkno] =
+ palloc(sizeof(uint16) * INITIAL_ENTRIES_PER_CHUNK);
+ entry->chunk_size[chunkno] = INITIAL_ENTRIES_PER_CHUNK;
+ entry->chunk_data[chunkno][0] = chunkoffset;
+ entry->chunk_usage[chunkno] = 1;
+ return;
+ }
+
+ /*
+ * If the number of entries in this chunk is already maximum, it must be a
+ * bitmap. Just set the appropriate bit.
+ */
+ if (entry->chunk_usage[chunkno] == MAX_ENTRIES_PER_CHUNK)
+ {
+ BlockRefTableChunk chunk = entry->chunk_data[chunkno];
+
+ chunk[chunkoffset / BLOCKS_PER_ENTRY] |=
+ 1 << (chunkoffset % BLOCKS_PER_ENTRY);
+ return;
+ }
+
+ /*
+ * There is an existing chunk and it's in array format. Let's find out
+ * whether it already has an entry for this block. If so, we do not need
+ * to do anything.
+ */
+ for (i = 0; i < entry->chunk_usage[chunkno]; ++i)
+ {
+ if (entry->chunk_data[chunkno][i] == chunkoffset)
+ return;
+ }
+
+ /*
+ * If the number of entries currently used is one less than the maximum,
+ * it's time to convert to bitmap format.
+ */
+ if (entry->chunk_usage[chunkno] == MAX_ENTRIES_PER_CHUNK - 1)
+ {
+ BlockRefTableChunk newchunk;
+ unsigned j;
+
+ /* Allocate a new chunk. */
+ newchunk = palloc0(MAX_ENTRIES_PER_CHUNK * sizeof(uint16));
+
+ /* Set the bit for each existing entry. */
+ for (j = 0; j < entry->chunk_usage[chunkno]; ++j)
+ {
+ unsigned coff = entry->chunk_data[chunkno][j];
+
+ newchunk[coff / BLOCKS_PER_ENTRY] |=
+ 1 << (coff % BLOCKS_PER_ENTRY);
+ }
+
+ /* Set the bit for the new entry. */
+ newchunk[chunkoffset / BLOCKS_PER_ENTRY] |=
+ 1 << (chunkoffset % BLOCKS_PER_ENTRY);
+
+ /* Swap the new chunk into place and update metadata. */
+ pfree(entry->chunk_data[chunkno]);
+ entry->chunk_data[chunkno] = newchunk;
+ entry->chunk_size[chunkno] = MAX_ENTRIES_PER_CHUNK;
+ entry->chunk_usage[chunkno] = MAX_ENTRIES_PER_CHUNK;
+ return;
+ }
+
+ /*
+ * OK, we currently have an array, and we don't need to convert to a
+ * bitmap, but we do need to add a new element. If there's not enough
+ * room, we'll have to expand the array.
+ */
+ if (entry->chunk_usage[chunkno] == entry->chunk_size[chunkno])
+ {
+ unsigned newsize = entry->chunk_size[chunkno] * 2;
+
+ Assert(newsize <= MAX_ENTRIES_PER_CHUNK);
+ entry->chunk_data[chunkno] = repalloc(entry->chunk_data[chunkno],
+ newsize * sizeof(uint16));
+ entry->chunk_size[chunkno] = newsize;
+ }
+
+ /* Now we can add the new entry. */
+ entry->chunk_data[chunkno][entry->chunk_usage[chunkno]] =
+ chunkoffset;
+ entry->chunk_usage[chunkno]++;
+}
+
+/*
+ * Release memory for a BlockRefTablEntry that was created by
+ * CreateBlockRefTableEntry.
+ */
+void
+BlockRefTableFreeEntry(BlockRefTableEntry *entry)
+{
+ if (entry->chunk_size != NULL)
+ {
+ pfree(entry->chunk_size);
+ entry->chunk_size = NULL;
+ }
+
+ if (entry->chunk_usage != NULL)
+ {
+ pfree(entry->chunk_usage);
+ entry->chunk_usage = NULL;
+ }
+
+ if (entry->chunk_data != NULL)
+ {
+ pfree(entry->chunk_data);
+ entry->chunk_data = NULL;
+ }
+
+ pfree(entry);
+}
+
+/*
+ * Comparator for BlockRefTableSerializedEntry objects.
+ *
+ * We make the tablespace OID the first column of the sort key to match
+ * the on-disk tree structure.
+ */
+static int
+BlockRefTableComparator(const void *a, const void *b)
+{
+ const BlockRefTableSerializedEntry *sa = a;
+ const BlockRefTableSerializedEntry *sb = b;
+
+ if (sa->rlocator.spcOid > sb->rlocator.spcOid)
+ return 1;
+ if (sa->rlocator.spcOid < sb->rlocator.spcOid)
+ return -1;
+
+ if (sa->rlocator.dbOid > sb->rlocator.dbOid)
+ return 1;
+ if (sa->rlocator.dbOid < sb->rlocator.dbOid)
+ return -1;
+
+ if (sa->rlocator.relNumber > sb->rlocator.relNumber)
+ return 1;
+ if (sa->rlocator.relNumber < sb->rlocator.relNumber)
+ return -1;
+
+ if (sa->forknum > sb->forknum)
+ return 1;
+ if (sa->forknum < sb->forknum)
+ return -1;
+
+ return 0;
+}
+
+/*
+ * Flush any buffered data out of a BlockRefTableBuffer.
+ */
+static void
+BlockRefTableFlush(BlockRefTableBuffer *buffer)
+{
+ buffer->io_callback(buffer->io_callback_arg, buffer->data, buffer->used);
+ buffer->used = 0;
+}
+
+/*
+ * Read data from a BlockRefTableBuffer, and update the running CRC
+ * calculation for the returned data (but not any data that we may have
+ * buffered but not yet actually returned).
+ */
+static void
+BlockRefTableRead(BlockRefTableReader *reader, void *data, int length)
+{
+ BlockRefTableBuffer *buffer = &reader->buffer;
+
+ /* Loop until read is fully satisfied. */
+ while (length > 0)
+ {
+ if (buffer->cursor < buffer->used)
+ {
+ /*
+ * If any buffered data is available, use that to satisfy as much
+ * of the request as possible.
+ */
+ int bytes_to_copy = Min(length, buffer->used - buffer->cursor);
+
+ memcpy(data, &buffer->data[buffer->cursor], bytes_to_copy);
+ COMP_CRC32C(buffer->crc, &buffer->data[buffer->cursor],
+ bytes_to_copy);
+ buffer->cursor += bytes_to_copy;
+ data = ((char *) data) + bytes_to_copy;
+ length -= bytes_to_copy;
+ }
+ else if (length >= BUFSIZE)
+ {
+ /*
+ * If the request length is long, read directly into caller's
+ * buffer.
+ */
+ int bytes_read;
+
+ bytes_read = buffer->io_callback(buffer->io_callback_arg,
+ data, length);
+ COMP_CRC32C(buffer->crc, data, bytes_read);
+ data = ((char *) data) + bytes_read;
+ length -= bytes_read;
+
+ /* If we didn't get anything, that's bad. */
+ if (bytes_read == 0)
+ reader->error_callback(reader->error_callback_arg,
+ "file \"%s\" ends unexpectedly",
+ reader->error_filename);
+ }
+ else
+ {
+ /*
+ * Refill our buffer.
+ */
+ buffer->used = buffer->io_callback(buffer->io_callback_arg,
+ buffer->data, BUFSIZE);
+ buffer->cursor = 0;
+
+ /* If we didn't get anything, that's bad. */
+ if (buffer->used == 0)
+ reader->error_callback(reader->error_callback_arg,
+ "file \"%s\" ends unexpectedly",
+ reader->error_filename);
+ }
+ }
+}
+
+/*
+ * Supply data to a BlockRefTableBuffer for write to the underlying File,
+ * and update the running CRC calculation for that data.
+ */
+static void
+BlockRefTableWrite(BlockRefTableBuffer *buffer, void *data, int length)
+{
+ /* Update running CRC calculation. */
+ COMP_CRC32C(buffer->crc, data, length);
+
+ /* If the new data can't fit into the buffer, flush the buffer. */
+ if (buffer->used + length > BUFSIZE)
+ {
+ buffer->io_callback(buffer->io_callback_arg, buffer->data,
+ buffer->used);
+ buffer->used = 0;
+ }
+
+ /* If the new data would fill the buffer, or more, write it directly. */
+ if (length >= BUFSIZE)
+ {
+ buffer->io_callback(buffer->io_callback_arg, data, length);
+ return;
+ }
+
+ /* Otherwise, copy the new data into the buffer. */
+ memcpy(&buffer->data[buffer->used], data, length);
+ buffer->used += length;
+ Assert(buffer->used <= BUFSIZE);
+}
+
+/*
+ * Generate the sentinel and CRC required at the end of a block reference
+ * table file and flush them out of our internal buffer.
+ */
+static void
+BlockRefTableFileTerminate(BlockRefTableBuffer *buffer)
+{
+ BlockRefTableSerializedEntry zentry = {{0}};
+ pg_crc32c crc;
+
+ /* Write a sentinel indicating that there are no more entries. */
+ BlockRefTableWrite(buffer, &zentry,
+ sizeof(BlockRefTableSerializedEntry));
+
+ /*
+ * Writing the checksum will perturb the ongoing checksum calculation, so
+ * copy the state first and finalize the computation using the copy.
+ */
+ crc = buffer->crc;
+ FIN_CRC32C(crc);
+ BlockRefTableWrite(buffer, &crc, sizeof(pg_crc32c));
+
+ /* Flush any leftover data out of our buffer. */
+ BlockRefTableFlush(buffer);
+}
diff --git a/src/common/meson.build b/src/common/meson.build
index d52dd12bc9..7ad4270a3a 100644
--- a/src/common/meson.build
+++ b/src/common/meson.build
@@ -4,6 +4,7 @@ common_sources = files(
'archive.c',
'base64.c',
'binaryheap.c',
+ 'blkreftable.c',
'checksum_helper.c',
'compression.c',
'controldata_utils.c',
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index a14126d164..da71580364 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -209,6 +209,7 @@ extern int XLogFileOpen(XLogSegNo segno, TimeLineID tli);
extern void CheckXLogRemoved(XLogSegNo segno, TimeLineID tli);
extern XLogSegNo XLogGetLastRemovedSegno(void);
+extern XLogSegNo XLogGetOldestSegno(TimeLineID tli);
extern void XLogSetAsyncXactLSN(XLogRecPtr asyncXactLSN);
extern void XLogSetReplicationSlotMinimumLSN(XLogRecPtr lsn);
diff --git a/src/include/backup/walsummary.h b/src/include/backup/walsummary.h
new file mode 100644
index 0000000000..8e3dc7b837
--- /dev/null
+++ b/src/include/backup/walsummary.h
@@ -0,0 +1,49 @@
+/*-------------------------------------------------------------------------
+ *
+ * walsummary.h
+ * WAL summary management
+ *
+ * Portions Copyright (c) 2010-2022, PostgreSQL Global Development Group
+ *
+ * src/include/backup/walsummary.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef WALSUMMARY_H
+#define WALSUMMARY_H
+
+#include <time.h>
+
+#include "access/xlogdefs.h"
+#include "nodes/pg_list.h"
+#include "storage/fd.h"
+
+typedef struct WalSummaryIO
+{
+ File file;
+ off_t filepos;
+} WalSummaryIO;
+
+typedef struct WalSummaryFile
+{
+ XLogRecPtr start_lsn;
+ XLogRecPtr end_lsn;
+ TimeLineID tli;
+} WalSummaryFile;
+
+extern List *GetWalSummaries(TimeLineID tli, XLogRecPtr start_lsn,
+ XLogRecPtr end_lsn);
+extern List *FilterWalSummaries(List *wslist, TimeLineID tli,
+ XLogRecPtr start_lsn, XLogRecPtr end_lsn);
+extern bool WalSummariesAreComplete(List *wslist,
+ XLogRecPtr start_lsn, XLogRecPtr end_lsn,
+ XLogRecPtr *missing_lsn);
+extern File OpenWalSummaryFile(WalSummaryFile *ws, bool missing_ok);
+extern void RemoveWalSummaryIfOlderThan(WalSummaryFile *ws,
+ time_t cutoff_time);
+
+extern int ReadWalSummary(void *wal_summary_io, void *data, int length);
+extern int WriteWalSummary(void *wal_summary_io, void *data, int length);
+extern void ReportWalSummaryError(void *callback_arg, char *fmt,...) pg_attribute_printf(2, 3);
+
+#endif /* WALSUMMARY_H */
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index fb58dee3bc..79c8f86d89 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -12100,4 +12100,23 @@
proname => 'any_value_transfn', prorettype => 'anyelement',
proargtypes => 'anyelement anyelement', prosrc => 'any_value_transfn' },
+{ oid => '8436',
+ descr => 'list of available WAL summary files',
+ proname => 'pg_available_wal_summaries', prorows => '100',
+ proretset => 't', provolatile => 'v', proparallel => 's',
+ prorettype => 'record', proargtypes => '',
+ proallargtypes => '{int8,pg_lsn,pg_lsn}',
+ proargmodes => '{o,o,o}',
+ proargnames => '{tli,start_lsn,end_lsn}',
+ prosrc => 'pg_available_wal_summaries' },
+{ oid => '8437',
+ descr => 'contents of a WAL sumamry file',
+ proname => 'pg_wal_summary_contents', prorows => '100',
+ proretset => 't', provolatile => 'v', proparallel => 's',
+ prorettype => 'record', proargtypes => 'int8 pg_lsn pg_lsn',
+ proallargtypes => '{int8,pg_lsn,pg_lsn,oid,oid,oid,int2,int8,bool}',
+ proargmodes => '{i,i,i,o,o,o,o,o,o}',
+ proargnames => '{tli,start_lsn,end_lsn,relfilenode,reltablespace,reldatabase,relforknumber,relblocknumber,is_limit_block}',
+ prosrc => 'pg_wal_summary_contents' },
+
]
diff --git a/src/include/common/blkreftable.h b/src/include/common/blkreftable.h
new file mode 100644
index 0000000000..5141f3acd5
--- /dev/null
+++ b/src/include/common/blkreftable.h
@@ -0,0 +1,116 @@
+/*-------------------------------------------------------------------------
+ *
+ * blkreftable.h
+ * Block reference tables.
+ *
+ * A block reference table is used to keep track of which blocks have
+ * been modified by WAL records within a certain LSN range.
+ *
+ * For each relation fork, there is a "limit block number". All existing
+ * blocks greater than or equal to the limit block number must be
+ * considered modified; for those less than the limit block number,
+ * we maintain a bitmap. When a relation fork is created or dropped,
+ * the limit block number should be set to 0. When it's truncated,
+ * the limit block number should be set to the length in blocks to
+ * which it was truncated.
+ *
+ * Portions Copyright (c) 2010-2022, PostgreSQL Global Development Group
+ *
+ * src/include/common/blkreftable.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef BLKREFTABLE_H
+#define BLKREFTABLE_H
+
+#include "storage/block.h"
+#include "storage/relfilelocator.h"
+
+/* Magic number for serialization file format. */
+#define BLOCKREFTABLE_MAGIC 0x652b137b
+
+typedef struct BlockRefTable BlockRefTable;
+typedef struct BlockRefTableEntry BlockRefTableEntry;
+typedef struct BlockRefTableReader BlockRefTableReader;
+typedef struct BlockRefTableWriter BlockRefTableWriter;
+
+/*
+ * The return value of io_callback_fn should be the number of bytes read
+ * or written. If an error occurs, the functions should report it and
+ * not return. When used as a write callback, short writes should be retried
+ * or treated as errors, so that if the callback returns, the return value
+ * is always the request length.
+ *
+ * report_error_fn should not return.
+ */
+typedef int (*io_callback_fn) (void *callback_arg, void *data, int length);
+typedef void (*report_error_fn) (void *calblack_arg, char *msg,...) pg_attribute_printf(2, 3);
+
+
+/*
+ * Functions for manipulating an entire in-memory block reference table.
+ */
+extern BlockRefTable *CreateEmptyBlockRefTable(void);
+extern void BlockRefTableSetLimitBlock(BlockRefTable *brtab,
+ const RelFileLocator *rlocator,
+ ForkNumber forknum,
+ BlockNumber limit_block);
+extern void BlockRefTableMarkBlockModified(BlockRefTable *brtab,
+ const RelFileLocator *rlocator,
+ ForkNumber forknum,
+ BlockNumber blknum);
+extern void WriteBlockRefTable(BlockRefTable *brtab,
+ io_callback_fn write_callback,
+ void *write_callback_arg);
+
+extern BlockRefTableEntry *BlockRefTableGetEntry(BlockRefTable *brtab,
+ const RelFileLocator *rlocator,
+ ForkNumber forknum,
+ BlockNumber *limit_block);
+extern int BlockRefTableEntryGetBlocks(BlockRefTableEntry *entry,
+ BlockNumber start_blkno,
+ BlockNumber stop_blkno,
+ BlockNumber *blocks,
+ int nblocks);
+
+/*
+ * Functions for reading a block reference table incrementally from disk.
+ */
+extern BlockRefTableReader *CreateBlockRefTableReader(io_callback_fn read_callback,
+ void *read_callback_arg,
+ char *error_filename,
+ report_error_fn error_callback,
+ void *error_callback_arg);
+extern bool BlockRefTableReaderNextRelation(BlockRefTableReader *reader,
+ RelFileLocator *rlocator,
+ ForkNumber *forknum,
+ BlockNumber *limit_block);
+extern unsigned BlockRefTableReaderGetBlocks(BlockRefTableReader *reader,
+ BlockNumber *blocks,
+ int nblocks);
+extern void DestroyBlockRefTableReader(BlockRefTableReader *reader);
+
+/*
+ * Functions for writing a block reference table incrementally to disk.
+ *
+ * Note that entries must be written in the proper order, that is, sorted by
+ * database, then tablespace, then relfilenumber, then fork number. Caller
+ * is responsible for supplying data in the correct order. If that seems hard,
+ * use an in-memory BlockRefTable instead.
+ */
+extern BlockRefTableWriter *CreateBlockRefTableWriter(io_callback_fn write_callback,
+ void *write_callback_arg);
+extern void BlockRefTableWriteEntry(BlockRefTableWriter *writer,
+ BlockRefTableEntry *entry);
+extern void DestroyBlockRefTableWriter(BlockRefTableWriter *writer);
+
+extern BlockRefTableEntry *CreateBlockRefTableEntry(RelFileLocator rlocator,
+ ForkNumber forknum);
+extern void BlockRefTableEntrySetLimitBlock(BlockRefTableEntry *entry,
+ BlockNumber limit_block);
+extern void BlockRefTableEntryMarkBlockModified(BlockRefTableEntry *entry,
+ ForkNumber forknum,
+ BlockNumber blknum);
+extern void BlockRefTableFreeEntry(BlockRefTableEntry *entry);
+
+#endif /* BLKREFTABLE_H */
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index f0cc651435..ab8f47379a 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -340,6 +340,7 @@ typedef enum BackendType
B_STARTUP,
B_WAL_RECEIVER,
B_WAL_SENDER,
+ B_WAL_SUMMARIZER,
B_WAL_WRITER,
} BackendType;
@@ -446,6 +447,7 @@ typedef enum
CheckpointerProcess,
WalWriterProcess,
WalReceiverProcess,
+ WalSummarizerProcess,
NUM_AUXPROCTYPES /* Must be last! */
} AuxProcType;
@@ -458,6 +460,7 @@ extern PGDLLIMPORT AuxProcType MyAuxProcType;
#define AmCheckpointerProcess() (MyAuxProcType == CheckpointerProcess)
#define AmWalWriterProcess() (MyAuxProcType == WalWriterProcess)
#define AmWalReceiverProcess() (MyAuxProcType == WalReceiverProcess)
+#define AmWalSummarizerProcess() (MyAuxProcType == WalSummarizerProcess)
/*****************************************************************************
diff --git a/src/include/postmaster/walsummarizer.h b/src/include/postmaster/walsummarizer.h
new file mode 100644
index 0000000000..4a6792e5f9
--- /dev/null
+++ b/src/include/postmaster/walsummarizer.h
@@ -0,0 +1,31 @@
+/*-------------------------------------------------------------------------
+ *
+ * walsummarizer.h
+ *
+ * Header file for background WAL summarization process.
+ *
+ * Portions Copyright (c) 1996-2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/include/postmaster/walsummarizer.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef WALSUMMARIZER_H
+#define WALSUMMARIZER_H
+
+#include "access/xlogdefs.h"
+
+extern bool summarize_wal;
+extern int wal_summary_keep_time;
+
+extern Size WalSummarizerShmemSize(void);
+extern void WalSummarizerShmemInit(void);
+extern void WalSummarizerMain(void) pg_attribute_noreturn();
+
+extern XLogRecPtr GetOldestUnsummarizedLSN(TimeLineID *tli,
+ bool *lsn_is_exact);
+extern void SetWalSummarizerLatch(void);
+extern XLogRecPtr WaitForWalSummarization(XLogRecPtr lsn, long timeout);
+
+#endif
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 4b25961249..e87fd25d64 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -417,11 +417,12 @@ extern PGDLLIMPORT PGPROC *PreparedXactProcs;
* We set aside some extra PGPROC structures for auxiliary processes,
* ie things that aren't full-fledged backends but need shmem access.
*
- * Background writer, checkpointer, WAL writer and archiver run during normal
- * operation. Startup process and WAL receiver also consume 2 slots, but WAL
- * writer is launched only after startup has exited, so we only need 5 slots.
+ * Background writer, checkpointer, WAL writer, WAL summarizer, and archiver
+ * run during normal operation. Startup process and WAL receiver also consume
+ * 2 slots, but WAL writer is launched only after startup has exited, so we
+ * only need 6 slots.
*/
-#define NUM_AUXILIARY_PROCS 5
+#define NUM_AUXILIARY_PROCS 6
/* configurable options */
extern PGDLLIMPORT int DeadlockTimeout;
diff --git a/src/include/utils/guc_tables.h b/src/include/utils/guc_tables.h
index 0c38255961..eaa8c46dda 100644
--- a/src/include/utils/guc_tables.h
+++ b/src/include/utils/guc_tables.h
@@ -72,6 +72,7 @@ enum config_group
WAL_RECOVERY,
WAL_ARCHIVE_RECOVERY,
WAL_RECOVERY_TARGET,
+ WAL_SUMMARIZATION,
REPLICATION_SENDING,
REPLICATION_PRIMARY,
REPLICATION_STANDBY,
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 38a86575e1..4d99b4b3f1 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -4007,3 +4007,14 @@ yyscan_t
z_stream
z_streamp
zic_t
+BlockRefTable
+BlockRefTableBuffer
+BlockRefTableEntry
+BlockRefTableKey
+BlockRefTableReader
+BlockRefTableSerializedEntry
+BlockRefTableWriter
+SummarizerReadLocalXLogPrivate
+WalSummarizerData
+WalSummaryFile
+WalSummaryIO
--
2.39.3 (Apple Git-145)
v12-0006-Add-new-pg_walsummary-tool.patchapplication/octet-stream; name=v12-0006-Add-new-pg_walsummary-tool.patchDownload
From f88da16681f8defc813f844442f004e425a762cf Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Wed, 25 Oct 2023 13:01:06 -0400
Subject: [PATCH v12 6/7] Add new pg_walsummary tool.
This can dump the contents of WAL summary files, either those in
pg_wal/summaries, or the INCREMENTAL_BACKUP files that are part of
an incremental backup proper.
XXX. Needs tests.
---
doc/src/sgml/ref/allfiles.sgml | 1 +
doc/src/sgml/ref/pg_walsummary.sgml | 122 +++++++++++
doc/src/sgml/reference.sgml | 1 +
src/backend/postmaster/walsummarizer.c | 4 +-
src/bin/Makefile | 1 +
src/bin/meson.build | 1 +
src/bin/pg_walsummary/.gitignore | 1 +
src/bin/pg_walsummary/Makefile | 42 ++++
src/bin/pg_walsummary/meson.build | 24 +++
src/bin/pg_walsummary/pg_walsummary.c | 280 +++++++++++++++++++++++++
src/tools/pgindent/typedefs.list | 2 +
11 files changed, 477 insertions(+), 2 deletions(-)
create mode 100644 doc/src/sgml/ref/pg_walsummary.sgml
create mode 100644 src/bin/pg_walsummary/.gitignore
create mode 100644 src/bin/pg_walsummary/Makefile
create mode 100644 src/bin/pg_walsummary/meson.build
create mode 100644 src/bin/pg_walsummary/pg_walsummary.c
diff --git a/doc/src/sgml/ref/allfiles.sgml b/doc/src/sgml/ref/allfiles.sgml
index fda4690eab..4a42999b18 100644
--- a/doc/src/sgml/ref/allfiles.sgml
+++ b/doc/src/sgml/ref/allfiles.sgml
@@ -219,6 +219,7 @@ Complete list of usable sgml source files in this directory.
<!ENTITY pgtesttiming SYSTEM "pgtesttiming.sgml">
<!ENTITY pgupgrade SYSTEM "pgupgrade.sgml">
<!ENTITY pgwaldump SYSTEM "pg_waldump.sgml">
+<!ENTITY pgwalsummary SYSTEM "pg_walsummary.sgml">
<!ENTITY postgres SYSTEM "postgres-ref.sgml">
<!ENTITY psqlRef SYSTEM "psql-ref.sgml">
<!ENTITY reindexdb SYSTEM "reindexdb.sgml">
diff --git a/doc/src/sgml/ref/pg_walsummary.sgml b/doc/src/sgml/ref/pg_walsummary.sgml
new file mode 100644
index 0000000000..93e265ead7
--- /dev/null
+++ b/doc/src/sgml/ref/pg_walsummary.sgml
@@ -0,0 +1,122 @@
+<!--
+doc/src/sgml/ref/pg_walsummary.sgml
+PostgreSQL documentation
+-->
+
+<refentry id="app-pgwalsummary">
+ <indexterm zone="app-pgwalsummary">
+ <primary>pg_walsummary</primary>
+ </indexterm>
+
+ <refmeta>
+ <refentrytitle><application>pg_walsummary</application></refentrytitle>
+ <manvolnum>1</manvolnum>
+ <refmiscinfo>Application</refmiscinfo>
+ </refmeta>
+
+ <refnamediv>
+ <refname>pg_walsummary</refname>
+ <refpurpose>print contents of WAL summary files</refpurpose>
+ </refnamediv>
+
+ <refsynopsisdiv>
+ <cmdsynopsis>
+ <command>pg_walsummary</command>
+ <arg rep="repeat" choice="opt"><replaceable>option</replaceable></arg>
+ <arg rep="repeat"><replaceable>file</replaceable></arg>
+ </cmdsynopsis>
+ </refsynopsisdiv>
+
+ <refsect1>
+ <title>Description</title>
+ <para>
+ <application>pg_walsummary</application> is used to print the contents of
+ WAL summary files. These binary files are found with the
+ <literal>pg_wal/summaries</literal> subdirectory of the data directory,
+ and can be converted to text using this tool. This is not ordinarily
+ necessary, since WAL summary files primarily exist to support
+ <link linkend="backup-incremental-backup">incremental backup</link>,
+ but it may be useful for debugging purposes.
+ </para>
+
+ <para>
+ A WAL summary file is indexed by tablespace OID, relation OID, and relation
+ fork. For each relation fork, it stores the list of blocks that were
+ modified by WAL within the range summarized in the file. It can also
+ store a "limit block," which is 0 if the relation fork was created or
+ truncated within the relevant WAL range, and otherwise the shortest length
+ to which the relation fork was truncated. If the relation fork was not
+ created, deleted, or truncated within the relevant WAL range, the limit
+ block is undefined or infinite and will not be printed by this tool.
+ </para>
+ </refsect1>
+
+ <refsect1>
+ <title>Options</title>
+
+ <para>
+ <variablelist>
+ <varlistentry>
+ <term><option>-i</option></term>
+ <term><option>--indivudual</option></term>
+ <listitem>
+ <para>
+ By default, <literal>pg_walsummary</literal> prints one line of output
+ for each range of one or more consecutive modified blocks. This can
+ make the output a lot briefer, since a relation where all blocks from
+ 0 through 999 were modified will produce only one line of output rather
+ than 1000 separate lines. This option requests a separate line of
+ output for every modified block.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><option>-q</option></term>
+ <term><option>--quiet</option></term>
+ <listitem>
+ <para>
+ Do not print any output, except for errors. This can be useful
+ when you want to know whether a WAL summary file can be successfully
+ parsed but don't care about the contents.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><option>-?</option></term>
+ <term><option>--help</option></term>
+ <listitem>
+ <para>
+ Shows help about <application>pg_walsummary</application> command line
+ arguments, and exits.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ </variablelist>
+ </para>
+
+ </refsect1>
+
+ <refsect1>
+ <title>Environment</title>
+
+ <para>
+ The environment variable <envar>PG_COLOR</envar> specifies whether to use
+ color in diagnostic messages. Possible values are
+ <literal>always</literal>, <literal>auto</literal> and
+ <literal>never</literal>.
+ </para>
+ </refsect1>
+
+ <refsect1>
+ <title>See Also</title>
+
+ <simplelist type="inline">
+ <member><xref linkend="app-pgbasebackup"/></member>
+ <member><xref linkend="app-pgcombinebackup"/></member>
+ </simplelist>
+ </refsect1>
+
+</refentry>
diff --git a/doc/src/sgml/reference.sgml b/doc/src/sgml/reference.sgml
index a07d2b5e01..aa94f6adf6 100644
--- a/doc/src/sgml/reference.sgml
+++ b/doc/src/sgml/reference.sgml
@@ -289,6 +289,7 @@
&pgtesttiming;
&pgupgrade;
&pgwaldump;
+ &pgwalsummary;
&postgres;
</reference>
diff --git a/src/backend/postmaster/walsummarizer.c b/src/backend/postmaster/walsummarizer.c
index a083647c42..7966755f22 100644
--- a/src/backend/postmaster/walsummarizer.c
+++ b/src/backend/postmaster/walsummarizer.c
@@ -290,7 +290,7 @@ WalSummarizerMain(void)
FlushErrorState();
/* Flush any leaked data in the top-level context */
- MemoryContextResetAndDeleteChildren(context);
+ MemoryContextReset(context);
/* Now we can allow interrupts again */
RESUME_INTERRUPTS();
@@ -338,7 +338,7 @@ WalSummarizerMain(void)
XLogRecPtr end_of_summary_lsn;
/* Flush any leaked data in the top-level context */
- MemoryContextResetAndDeleteChildren(context);
+ MemoryContextReset(context);
/* Process any signals received recently. */
HandleWalSummarizerInterrupts();
diff --git a/src/bin/Makefile b/src/bin/Makefile
index aa2210925e..f98f58d39e 100644
--- a/src/bin/Makefile
+++ b/src/bin/Makefile
@@ -31,6 +31,7 @@ SUBDIRS = \
pg_upgrade \
pg_verifybackup \
pg_waldump \
+ pg_walsummary \
pgbench \
psql \
scripts
diff --git a/src/bin/meson.build b/src/bin/meson.build
index 4cb6fd59bb..d1e9ef4409 100644
--- a/src/bin/meson.build
+++ b/src/bin/meson.build
@@ -17,6 +17,7 @@ subdir('pg_test_timing')
subdir('pg_upgrade')
subdir('pg_verifybackup')
subdir('pg_waldump')
+subdir('pg_walsummary')
subdir('pgbench')
subdir('pgevent')
subdir('psql')
diff --git a/src/bin/pg_walsummary/.gitignore b/src/bin/pg_walsummary/.gitignore
new file mode 100644
index 0000000000..d71ec192fa
--- /dev/null
+++ b/src/bin/pg_walsummary/.gitignore
@@ -0,0 +1 @@
+pg_walsummary
diff --git a/src/bin/pg_walsummary/Makefile b/src/bin/pg_walsummary/Makefile
new file mode 100644
index 0000000000..852f7208f6
--- /dev/null
+++ b/src/bin/pg_walsummary/Makefile
@@ -0,0 +1,42 @@
+#-------------------------------------------------------------------------
+#
+# Makefile for src/bin/pg_walsummary
+#
+# Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+# Portions Copyright (c) 1994, Regents of the University of California
+#
+# src/bin/pg_walsummary/Makefile
+#
+#-------------------------------------------------------------------------
+
+PGFILEDESC = "pg_walsummary - print contents of WAL summary files"
+PGAPPICON=win32
+
+subdir = src/bin/pg_walsummary
+top_builddir = ../../..
+include $(top_builddir)/src/Makefile.global
+
+override CPPFLAGS := -I$(libpq_srcdir) $(CPPFLAGS)
+LDFLAGS_INTERNAL += -L$(top_builddir)/src/fe_utils -lpgfeutils
+
+OBJS = \
+ $(WIN32RES) \
+ pg_walsummary.o
+
+all: pg_walsummary
+
+pg_walsummary: $(OBJS) | submake-libpgport submake-libpgfeutils
+ $(CC) $(CFLAGS) $^ $(LDFLAGS) $(LDFLAGS_EX) $(LIBS) -o $@$(X)
+
+
+install: all installdirs
+ $(INSTALL_PROGRAM) pg_walsummary$(X) '$(DESTDIR)$(bindir)/pg_walsummary$(X)'
+
+installdirs:
+ $(MKDIR_P) '$(DESTDIR)$(bindir)'
+
+uninstall:
+ rm -f '$(DESTDIR)$(bindir)/pg_walsummary$(X)'
+
+clean distclean maintainer-clean:
+ rm -f pg_walsummary$(X) $(OBJS)
diff --git a/src/bin/pg_walsummary/meson.build b/src/bin/pg_walsummary/meson.build
new file mode 100644
index 0000000000..c2092960c6
--- /dev/null
+++ b/src/bin/pg_walsummary/meson.build
@@ -0,0 +1,24 @@
+# Copyright (c) 2022-2023, PostgreSQL Global Development Group
+
+pg_walsummary_sources = files(
+ 'pg_walsummary.c',
+)
+
+if host_system == 'windows'
+ pg_walsummary_sources += rc_bin_gen.process(win32ver_rc, extra_args: [
+ '--NAME', 'pg_walsummary',
+ '--FILEDESC', 'pg_walsummary - print contents of WAL summary files',])
+endif
+
+pg_walsummary = executable('pg_walsummary',
+ pg_walsummary_sources,
+ dependencies: [frontend_code],
+ kwargs: default_bin_args,
+)
+bin_targets += pg_walsummary
+
+tests += {
+ 'name': 'pg_walsummary',
+ 'sd': meson.current_source_dir(),
+ 'bd': meson.current_build_dir()
+}
diff --git a/src/bin/pg_walsummary/pg_walsummary.c b/src/bin/pg_walsummary/pg_walsummary.c
new file mode 100644
index 0000000000..0c0225eeb8
--- /dev/null
+++ b/src/bin/pg_walsummary/pg_walsummary.c
@@ -0,0 +1,280 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_walsummary.c
+ * Prints the contents of WAL summary files.
+ *
+ * Copyright (c) 2017-2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/bin/pg_walsummary/pg_walsummary.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres_fe.h"
+
+#include <fcntl.h>
+#include <limits.h>
+
+#include "common/blkreftable.h"
+#include "common/logging.h"
+#include "fe_utils/option_utils.h"
+#include "lib/stringinfo.h"
+#include "getopt_long.h"
+
+typedef struct ws_options
+{
+ bool individual;
+ bool quiet;
+} ws_options;
+
+typedef struct ws_file_info
+{
+ int fd;
+ char *filename;
+} ws_file_info;
+
+static BlockNumber *block_buffer = NULL;
+static unsigned block_buffer_size = 512; /* Initial size. */
+
+static void dump_one_relation(ws_options *opt, RelFileLocator *rlocator,
+ ForkNumber forknum, BlockNumber limit_block,
+ BlockRefTableReader *reader);
+static void help(const char *progname);
+static int compare_block_numbers(const void *a, const void *b);
+static int walsummary_read_callback(void *callback_arg, void *data,
+ int length);
+static void walsummary_error_callback(void *callback_arg, char *fmt,...) pg_attribute_printf(2, 3);
+
+/*
+ * Main program.
+ */
+int
+main(int argc, char *argv[])
+{
+ static struct option long_options[] = {
+ {"individual", no_argument, NULL, 'i'},
+ {"quiet", no_argument, NULL, 'q'},
+ {NULL, 0, NULL, 0}
+ };
+
+ const char *progname;
+ int optindex;
+ int c;
+ ws_options opt;
+
+ memset(&opt, 0, sizeof(ws_options));
+
+ pg_logging_init(argv[0]);
+ progname = get_progname(argv[0]);
+ handle_help_version_opts(argc, argv, progname, help);
+
+ /* process command-line options */
+ while ((c = getopt_long(argc, argv, "f:iqw:",
+ long_options, &optindex)) != -1)
+ {
+ switch (c)
+ {
+ case 'i':
+ opt.individual = true;
+ break;
+ case 'q':
+ opt.quiet = true;
+ break;
+ default:
+ /* getopt_long already emitted a complaint */
+ pg_log_error_hint("Try \"%s --help\" for more information.", progname);
+ exit(1);
+ }
+ }
+
+ if (optind >= argc)
+ {
+ pg_log_error("%s: no input files specified", progname);
+ pg_log_error_hint("Try \"%s --help\" for more information.", progname);
+ exit(1);
+ }
+
+ while (optind < argc)
+ {
+ ws_file_info ws;
+ BlockRefTableReader *reader;
+ RelFileLocator rlocator;
+ ForkNumber forknum;
+ BlockNumber limit_block;
+
+ ws.filename = argv[optind++];
+ if ((ws.fd = open(ws.filename, O_RDONLY | PG_BINARY, 0)) < 0)
+ pg_fatal("could not open file \"%s\": %m", ws.filename);
+
+ reader = CreateBlockRefTableReader(walsummary_read_callback, &ws,
+ ws.filename,
+ walsummary_error_callback, NULL);
+ while (BlockRefTableReaderNextRelation(reader, &rlocator, &forknum,
+ &limit_block))
+ dump_one_relation(&opt, &rlocator, forknum, limit_block, reader);
+
+ DestroyBlockRefTableReader(reader);
+ close(ws.fd);
+ }
+
+ exit(0);
+}
+
+/*
+ * Dump details for one relation.
+ */
+static void
+dump_one_relation(ws_options *opt, RelFileLocator *rlocator,
+ ForkNumber forknum, BlockNumber limit_block,
+ BlockRefTableReader *reader)
+{
+ unsigned i = 0;
+ unsigned nblocks;
+ BlockNumber startblock = InvalidBlockNumber;
+ BlockNumber endblock = InvalidBlockNumber;
+
+ /* Dump limit block, if any. */
+ if (limit_block != InvalidBlockNumber)
+ printf("TS %u, DB %u, REL %u, FORK %s: limit %u\n",
+ rlocator->spcOid, rlocator->dbOid, rlocator->relNumber,
+ forkNames[forknum], limit_block);
+
+ /* If we haven't allocated a block buffer yet, do that now. */
+ if (block_buffer == NULL)
+ block_buffer = palloc_array(BlockNumber, block_buffer_size);
+
+ /* Try to fill the block buffer. */
+ nblocks = BlockRefTableReaderGetBlocks(reader,
+ block_buffer,
+ block_buffer_size);
+
+ /* If we filled the block buffer completely, we must enlarge it. */
+ while (nblocks >= block_buffer_size)
+ {
+ unsigned new_size;
+
+ /* Double the size, being careful about overflow. */
+ new_size = block_buffer_size * 2;
+ if (new_size < block_buffer_size)
+ new_size = PG_UINT32_MAX;
+ block_buffer = repalloc_array(block_buffer, BlockNumber, new_size);
+
+ /* Try to fill the newly-allocated space. */
+ nblocks +=
+ BlockRefTableReaderGetBlocks(reader,
+ block_buffer + block_buffer_size,
+ new_size - block_buffer_size);
+
+ /* Save the new size for later calls. */
+ block_buffer_size = new_size;
+ }
+
+ /* If we don't need to produce any output, skip the rest of this. */
+ if (opt->quiet)
+ return;
+
+ /*
+ * Sort the returned block numbers. If the block reference table was using
+ * the bitmap representation for a given chunk, the block numbers in that
+ * chunk will already be sorted, but when the array-of-offsets
+ * representation is used, we can receive block numbers here out of order.
+ */
+ qsort(block_buffer, nblocks, sizeof(BlockNumber), compare_block_numbers);
+
+ /* Dump block references. */
+ while (i < nblocks)
+ {
+ /*
+ * Find the next range of blocks to print, but if --individual was
+ * specified, then consider each block a separate range.
+ */
+ startblock = endblock = block_buffer[i++];
+ if (!opt->individual)
+ {
+ while (i < nblocks && block_buffer[i] == endblock + 1)
+ {
+ endblock++;
+ i++;
+ }
+ }
+
+ /* Format this range of block numbers as a string. */
+ if (startblock == endblock)
+ printf("TS %u, DB %u, REL %u, FORK %s: block %u\n",
+ rlocator->spcOid, rlocator->dbOid, rlocator->relNumber,
+ forkNames[forknum], startblock);
+ else
+ printf("TS %u, DB %u, REL %u, FORK %s: blocks %u..%u\n",
+ rlocator->spcOid, rlocator->dbOid, rlocator->relNumber,
+ forkNames[forknum], startblock, endblock);
+ }
+}
+
+/*
+ * Quicksort comparator for block numbers.
+ */
+static int
+compare_block_numbers(const void *a, const void *b)
+{
+ BlockNumber aa = *(BlockNumber *) a;
+ BlockNumber bb = *(BlockNumber *) b;
+
+ if (aa > bb)
+ return 1;
+ else if (aa == bb)
+ return 0;
+ else
+ return -1;
+}
+
+/*
+ * Error callback.
+ */
+void
+walsummary_error_callback(void *callback_arg, char *fmt,...)
+{
+ va_list ap;
+
+ va_start(ap, fmt);
+ pg_log_generic_v(PG_LOG_ERROR, PG_LOG_PRIMARY, fmt, ap);
+ va_end(ap);
+
+ exit(1);
+}
+
+/*
+ * Read callback.
+ */
+int
+walsummary_read_callback(void *callback_arg, void *data, int length)
+{
+ ws_file_info *ws = callback_arg;
+ int rc;
+
+ if ((rc = read(ws->fd, data, length)) < 0)
+ pg_fatal("could not read file \"%s\": %m", ws->filename);
+
+ return rc;
+}
+
+/*
+ * help
+ *
+ * Prints help page for the program
+ *
+ * progname: the name of the executed program, such as "pg_walsummary"
+ */
+static void
+help(const char *progname)
+{
+ printf(_("%s prints the contents of a WAL summary file.\n\n"), progname);
+ printf(_("Usage:\n"));
+ printf(_(" %s [OPTION]... FILE...\n"), progname);
+ printf(_("\nOptions:\n"));
+ printf(_(" -i, --individual list block numbers individually, not as ranges\n"));
+ printf(_(" -q, --quiet don't print anything, just parse the files\n"));
+ printf(_(" -?, --help show this help, then exit\n"));
+
+ printf(_("\nReport bugs to <%s>.\n"), PACKAGE_BUGREPORT);
+ printf(_("%s home page: <%s>\n"), PACKAGE_NAME, PACKAGE_URL);
+}
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index e7d8cf5195..d2114ca161 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -4029,3 +4029,5 @@ cb_tablespace_mapping
manifest_data
manifest_writer
rfile
+ws_options
+ws_file_info
--
2.39.3 (Apple Git-145)
v12-0005-Add-support-for-incremental-backup.patchapplication/octet-stream; name=v12-0005-Add-support-for-incremental-backup.patchDownload
From be158830e2454259534973d7bd2a76268ba82520 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Wed, 14 Jun 2023 11:31:29 -0400
Subject: [PATCH v12 5/7] Add support for incremental backup.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
To take an incremental backup, you use the new replication command
UPLOAD_MANIFEST to upload the manifest for the prior backup. This
prior backup could either be a full backup or another incremental
backup. You then use BASE_BACKUP with the INCREMENTAL option to take
the backup. pg_basebackup now has an --incremental=PATH_TO_MANIFEST
option to trigger this behavior.
An incremental backup is like a regular full backup except that
some relation files are replaced with files with names like
INCREMENTAL.${ORIGINAL_NAME}, and the backup_label file contains
additional lines identifying it as an incremental backup. The new
pg_combinebackup tool can be used to reconstruct a data directory
from a full backup and a series of incremental backups.
XXX. Should the timeout when waiting for WAL summaries be configurable?
If it is, then the maximum sleep time for the WAL summarizer needs
to vary accordingly.
Patch by me. Thanks to Dilip Kumar and Andres Freund for some helpful
design discussions. Reviewed by Dilip Kumar, Jakub Wartak, and
Álvaro Herrera.
---
doc/src/sgml/backup.sgml | 89 +-
doc/src/sgml/config.sgml | 2 -
doc/src/sgml/protocol.sgml | 28 +
doc/src/sgml/ref/allfiles.sgml | 1 +
doc/src/sgml/ref/pg_basebackup.sgml | 37 +-
doc/src/sgml/ref/pg_combinebackup.sgml | 228 +++
doc/src/sgml/reference.sgml | 1 +
src/backend/access/transam/xlogbackup.c | 10 +
src/backend/access/transam/xlogrecovery.c | 6 +
src/backend/backup/Makefile | 1 +
src/backend/backup/basebackup.c | 319 +++-
src/backend/backup/basebackup_incremental.c | 706 +++++++++
src/backend/backup/meson.build | 1 +
src/backend/replication/repl_gram.y | 21 +
src/backend/replication/repl_scanner.l | 2 +
src/backend/replication/walsender.c | 55 +-
src/backend/storage/ipc/ipci.c | 3 +
src/bin/Makefile | 1 +
src/bin/meson.build | 1 +
src/bin/pg_basebackup/bbstreamer_file.c | 1 +
src/bin/pg_basebackup/pg_basebackup.c | 87 +-
src/bin/pg_basebackup/t/010_pg_basebackup.pl | 4 +-
src/bin/pg_combinebackup/.gitignore | 1 +
src/bin/pg_combinebackup/Makefile | 51 +
src/bin/pg_combinebackup/backup_label.c | 283 ++++
src/bin/pg_combinebackup/backup_label.h | 30 +
src/bin/pg_combinebackup/copy_file.c | 169 +++
src/bin/pg_combinebackup/copy_file.h | 19 +
src/bin/pg_combinebackup/meson.build | 37 +
src/bin/pg_combinebackup/nls.mk | 11 +
src/bin/pg_combinebackup/pg_combinebackup.c | 1312 +++++++++++++++++
src/bin/pg_combinebackup/reconstruct.c | 687 +++++++++
src/bin/pg_combinebackup/reconstruct.h | 33 +
src/bin/pg_combinebackup/t/001_basic.pl | 23 +
.../pg_combinebackup/t/002_compare_backups.pl | 154 ++
src/bin/pg_combinebackup/t/003_timeline.pl | 90 ++
src/bin/pg_combinebackup/t/004_manifest.pl | 75 +
src/bin/pg_combinebackup/t/005_integrity.pl | 125 ++
src/bin/pg_combinebackup/write_manifest.c | 293 ++++
src/bin/pg_combinebackup/write_manifest.h | 33 +
src/bin/pg_resetwal/pg_resetwal.c | 36 +
src/fe_utils/Makefile | 1 +
src/fe_utils/load_manifest.c | 261 ++++
src/fe_utils/meson.build | 1 +
src/include/access/xlogbackup.h | 2 +
src/include/backup/basebackup.h | 5 +-
src/include/backup/basebackup_incremental.h | 53 +
src/include/fe_utils/load_manifest.h | 69 +
src/include/nodes/replnodes.h | 12 +
src/test/perl/PostgreSQL/Test/Cluster.pm | 21 +-
src/tools/pgindent/typedefs.list | 11 +
51 files changed, 5451 insertions(+), 51 deletions(-)
create mode 100644 doc/src/sgml/ref/pg_combinebackup.sgml
create mode 100644 src/backend/backup/basebackup_incremental.c
create mode 100644 src/bin/pg_combinebackup/.gitignore
create mode 100644 src/bin/pg_combinebackup/Makefile
create mode 100644 src/bin/pg_combinebackup/backup_label.c
create mode 100644 src/bin/pg_combinebackup/backup_label.h
create mode 100644 src/bin/pg_combinebackup/copy_file.c
create mode 100644 src/bin/pg_combinebackup/copy_file.h
create mode 100644 src/bin/pg_combinebackup/meson.build
create mode 100644 src/bin/pg_combinebackup/nls.mk
create mode 100644 src/bin/pg_combinebackup/pg_combinebackup.c
create mode 100644 src/bin/pg_combinebackup/reconstruct.c
create mode 100644 src/bin/pg_combinebackup/reconstruct.h
create mode 100644 src/bin/pg_combinebackup/t/001_basic.pl
create mode 100644 src/bin/pg_combinebackup/t/002_compare_backups.pl
create mode 100644 src/bin/pg_combinebackup/t/003_timeline.pl
create mode 100644 src/bin/pg_combinebackup/t/004_manifest.pl
create mode 100644 src/bin/pg_combinebackup/t/005_integrity.pl
create mode 100644 src/bin/pg_combinebackup/write_manifest.c
create mode 100644 src/bin/pg_combinebackup/write_manifest.h
create mode 100644 src/fe_utils/load_manifest.c
create mode 100644 src/include/backup/basebackup_incremental.h
create mode 100644 src/include/fe_utils/load_manifest.h
diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml
index 8cb24d6ae5..b3468eea3c 100644
--- a/doc/src/sgml/backup.sgml
+++ b/doc/src/sgml/backup.sgml
@@ -857,12 +857,79 @@ test ! -f /mnt/server/archivedir/00000001000000A900000065 && cp pg_wal/0
</para>
</sect2>
+ <sect2 id="backup-incremental-backup">
+ <title>Making an Incremental Backup</title>
+
+ <para>
+ You can use <xref linkend="app-pgbasebackup"/> to take an incremental
+ backup by specifying the <literal>--incremental</literal> option. You must
+ supply, as an argument to <literal>--incremental</literal>, the backup
+ manifest to an earlier backup from the same server. In the resulting
+ backup, non-relation files will be included in their entirety, but some
+ relation files may be replaced by smaller incremental files which contain
+ only the blocks which have been changed since the earlier backup and enough
+ metadata to reconstruct the current version of the file.
+ </para>
+
+ <para>
+ To figure out which blocks need to be backed up, the server uses WAL
+ summaries, which are stored in the data directory, inside the directory
+ <literal>pg_wal/summaries</literal>. If the required summary files are not
+ present, an attempt to take an incremental backup will fail. The summaries
+ present in this directory must cover all LSNs from the start LSN of the
+ prior backup to the start LSN of the current backup. Since the server looks
+ for WAL summaries just after establishing the start LSN of the current
+ backup, the necessary summary files probably won't be instantly present
+ on disk, but the server will wait for any missing files to show up.
+ This also helps if the WAL summarization process has fallen behind.
+ However, if the necessary files have already been removed, or if the WAL
+ summarizer doesn't catch up quickly enough, the incremental backup will
+ fail.
+ </para>
+
+ <para>
+ When restoring an incremental backup, it will be necessary to have not
+ only the incremental backup itself but also all earlier backups that
+ are required to supply the blocks omitted from the incremental backup.
+ See <xref linkend="app-pgcombinebackup"/> for further information about
+ this requirement.
+ </para>
+
+ <para>
+ Note that all of the requirements for making use of a full backup also
+ apply to an incremental backup. For instance, you still need all of the
+ WAL segment files generated during and after the file system backup, and
+ any relevant WAL history files. And you still need to create a
+ <literal>recovery.signal</literal> (or <literal>standby.signal</literal>)
+ and perform recovery, as described in
+ <xref linkend="backup-pitr-recovery" />. The requirement to have earlier
+ backups available at restore time and to use
+ <literal>pg_combinebackup</literal> is an additional requirement on top of
+ everything else. Keep in mind that <application>PostgreSQL</application>
+ has no built-in mechanism to figure out which backups are still needed as
+ a basis for restoring later incremental backups. You must keep track of
+ the relationships between your full and incremental backups on your own,
+ and be certain not to remove earlier backups if they might be needed when
+ restoring later incremental backups.
+ </para>
+
+ <para>
+ Incremental backups typically only make sense for relatively large
+ databases where a significant portion of the data does not change, or only
+ changes slowly. For a small database, it's simpler to ignore the existence
+ of incremental backups and simply take full backups, which are simpler
+ to manage. For a large database all of which is heavily modified,
+ incremental backups won't be much smaller than full backups.
+ </para>
+ </sect2>
+
<sect2 id="backup-lowlevel-base-backup">
<title>Making a Base Backup Using the Low Level API</title>
<para>
- The procedure for making a base backup using the low level
- APIs contains a few more steps than
- the <xref linkend="app-pgbasebackup"/> method, but is relatively
+ Instead of taking a full or incremental base backup using
+ <xref linkend="app-pgbasebackup"/>, you can take a base backup using the
+ low-level API. This procedure contains a few more steps than
+ the <application>pg_basebackup</application> method, but is relatively
simple. It is very important that these steps are executed in
sequence, and that the success of a step is verified before
proceeding to the next step.
@@ -1118,7 +1185,8 @@ SELECT * FROM pg_backup_stop(wait_for_archive => true);
</listitem>
<listitem>
<para>
- Restore the database files from your file system backup. Be sure that they
+ If you're restoring a full backup, you can restore the database files
+ directly into the target directories. Be sure that they
are restored with the right ownership (the database system user, not
<literal>root</literal>!) and with the right permissions. If you are using
tablespaces,
@@ -1126,6 +1194,19 @@ SELECT * FROM pg_backup_stop(wait_for_archive => true);
were correctly restored.
</para>
</listitem>
+ <listitem>
+ <para>
+ If you're restoring an incremental backup, you'll need to restore the
+ incremental backup and all earlier backups upon which it directly or
+ indirectly depends to the machine where you are performing the restore.
+ These backups will need to be placed in separate directories, not the
+ target directories where you want the running server to end up.
+ Once this is done, use <xref linkend="app-pgcombinebackup"/> to pull
+ data from the full backup and all of the subsequent incremental backups
+ and write out a synthetic full backup to the target directories. As above,
+ verify that permissions and tablespace links are correct.
+ </para>
+ </listitem>
<listitem>
<para>
Remove any files present in <filename>pg_wal/</filename>; these came from the
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 4fc5c64150..13212ba5d9 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4153,13 +4153,11 @@ restore_command = 'copy "C:\\server\\archivedir\\%f" "%p"' # Windows
<sect2 id="runtime-config-wal-summarization">
<title>WAL Summarization</title>
- <!--
<para>
These settings control WAL summarization, a feature which must be
enabled in order to perform an
<link linkend="backup-incremental-backup">incremental backup</link>.
</para>
- -->
<variablelist>
<varlistentry id="guc-summarize-wal" xreflabel="summarize_wal">
diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index af3f016f74..71ce5a9576 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -2599,6 +2599,23 @@ psql "dbname=postgres replication=database" -c "IDENTIFY_SYSTEM;"
</listitem>
</varlistentry>
+ <varlistentry id="protocol-replication-incremental-wal-range">
+ <term>
+ <literal>INCREMENTAL_WAL_RANGE</literal> <replaceable class="parameter">tli</replaceable> <replaceable class="parameter">start_lsn</replaceable> <replaceable class="parameter">end_lsn</replaceable>
+ <indexterm><primary>INCREMENTAL_WAL_RANGE</primary></indexterm>
+ </term>
+ <listitem>
+ <para>
+ Specifies a range of WAL records for a forthcoming incremental backup
+ request. The incremental backup will need to include all changes covered
+ by write-ahead log records on the specified timeline between the
+ specified start LSN and the specified end LSN. This command can be
+ issued multiple times before running <literal>BASE_BACKUP</literal>
+ with the <literal>INCREMENTAL</literal> option.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="protocol-replication-base-backup" xreflabel="BASE_BACKUP">
<term><literal>BASE_BACKUP</literal> [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ]
<indexterm><primary>BASE_BACKUP</primary></indexterm>
@@ -2838,6 +2855,17 @@ psql "dbname=postgres replication=database" -c "IDENTIFY_SYSTEM;"
</para>
</listitem>
</varlistentry>
+
+ <varlistentry>
+ <term><literal>INCREMENTAL</literal></term>
+ <listitem>
+ <para>
+ Requests an incremental backup. The
+ <literal>INCREMENTAL_WAL_RANGE</literal> command must be executed
+ at least once before running a base backup with this option.
+ </para>
+ </listitem>
+ </varlistentry>
</variablelist>
</para>
diff --git a/doc/src/sgml/ref/allfiles.sgml b/doc/src/sgml/ref/allfiles.sgml
index 54b5f22d6e..fda4690eab 100644
--- a/doc/src/sgml/ref/allfiles.sgml
+++ b/doc/src/sgml/ref/allfiles.sgml
@@ -202,6 +202,7 @@ Complete list of usable sgml source files in this directory.
<!ENTITY pgBasebackup SYSTEM "pg_basebackup.sgml">
<!ENTITY pgbench SYSTEM "pgbench.sgml">
<!ENTITY pgChecksums SYSTEM "pg_checksums.sgml">
+<!ENTITY pgCombinebackup SYSTEM "pg_combinebackup.sgml">
<!ENTITY pgConfig SYSTEM "pg_config-ref.sgml">
<!ENTITY pgControldata SYSTEM "pg_controldata.sgml">
<!ENTITY pgCtl SYSTEM "pg_ctl-ref.sgml">
diff --git a/doc/src/sgml/ref/pg_basebackup.sgml b/doc/src/sgml/ref/pg_basebackup.sgml
index 0b87fd2d4d..7c183a5cfd 100644
--- a/doc/src/sgml/ref/pg_basebackup.sgml
+++ b/doc/src/sgml/ref/pg_basebackup.sgml
@@ -38,11 +38,25 @@ PostgreSQL documentation
</para>
<para>
- <application>pg_basebackup</application> makes an exact copy of the database
- cluster's files, while making sure the server is put into and
- out of backup mode automatically. Backups are always taken of the entire
- database cluster; it is not possible to back up individual databases or
- database objects. For selective backups, another tool such as
+ <application>pg_basebackup</application> can take a full or incremental
+ base backup of the database. When used to take a full backup, it makes an
+ exact copy of the database cluster's files. When used to take an incremental
+ backup, some files that would have been part of a full backup may be
+ replaced with incremental versions of the same files, containing only those
+ blocks that have been modified since the reference backup. An incremental
+ backup cannot be used directly; instead,
+ <xref linkend="app-pgcombinebackup"/> must first
+ be used to combine it with the previous backups upon which it depends.
+ See <xref linkend="backup-incremental-backup" /> for more information
+ about incremental backups, and <xref linkend="backup-pitr-recovery" />
+ for steps to recover from a backup.
+ </para>
+
+ <para>
+ In any mode, <application>pg_basebackup</application> makes sure the server
+ is put into and out of backup mode automatically. Backups are always taken of
+ the entire database cluster; it is not possible to back up individual
+ databases or database objects. For selective backups, another tool such as
<xref linkend="app-pgdump"/> must be used.
</para>
@@ -197,6 +211,19 @@ PostgreSQL documentation
</listitem>
</varlistentry>
+ <varlistentry>
+ <term><option>-i <replaceable class="parameter">old_manifest_file</replaceable></option></term>
+ <term><option>--incremental=<replaceable class="parameter">old_meanifest_file</replaceable></option></term>
+ <listitem>
+ <para>
+ Performs an <link linkend="backup-incremental-backup">incremental
+ backup</link>. The backup manifest for the reference
+ backup must be provided, and will be uploaded to the server, which will
+ respond by sending the requested incremental backup.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry>
<term><option>-R</option></term>
<term><option>--write-recovery-conf</option></term>
diff --git a/doc/src/sgml/ref/pg_combinebackup.sgml b/doc/src/sgml/ref/pg_combinebackup.sgml
new file mode 100644
index 0000000000..6cac73573f
--- /dev/null
+++ b/doc/src/sgml/ref/pg_combinebackup.sgml
@@ -0,0 +1,228 @@
+<!--
+doc/src/sgml/ref/pg_combinebackup.sgml
+PostgreSQL documentation
+-->
+
+<refentry id="app-pgcombinebackup">
+ <indexterm zone="app-pgcombinebackup">
+ <primary>pg_combinebackup</primary>
+ </indexterm>
+
+ <refmeta>
+ <refentrytitle><application>pg_combinebackup</application></refentrytitle>
+ <manvolnum>1</manvolnum>
+ <refmiscinfo>Application</refmiscinfo>
+ </refmeta>
+
+ <refnamediv>
+ <refname>pg_combinebackup</refname>
+ <refpurpose>reconstruct a full backup from an incremental backup and dependent backups</refpurpose>
+ </refnamediv>
+
+ <refsynopsisdiv>
+ <cmdsynopsis>
+ <command>pg_combinebackup</command>
+ <arg rep="repeat"><replaceable>option</replaceable></arg>
+ <arg rep="repeat"><replaceable>backup_directory</replaceable></arg>
+ </cmdsynopsis>
+ </refsynopsisdiv>
+
+ <refsect1>
+ <title>Description</title>
+ <para>
+ <application>pg_combinebackup</application> is used to reconstruct a
+ synthetic full backup from an
+ <link linkend="backup-incremental-backup">incremental backup</link> and the
+ earlier backups upon which it depends.
+ </para>
+
+ <para>
+ Specify all of the required backups on the command line from oldest to newest.
+ That is, the first backup directory should be the path to the full backup, and
+ the last should be the path to the final incremental backup
+ that you wish to restore. The reconstructed backup will be written to the
+ output directory specified by the <option>-o</option> option.
+ </para>
+
+ <para>
+ Although <application>pg_combinebackup</application> will attempt to verify
+ that the backups you specify form a legal backup chain from which a correct
+ full backup can be reconstructed, it is not designed to help you keep track
+ of which backups depend on which other backups. If you remove the one or
+ more of the previous backups upon which your incremental
+ backup relies, you will not be able to restore it.
+ </para>
+
+ <para>
+ Since the output of <application>pg_combinebackup</application> is a
+ synthetic full backup, it can be used as an input to a future invocation of
+ <application>pg_combinebackup</application>. The synthetic full backup would
+ be specified on the command line in lieu of the chain of backups from which
+ it was reconstructed.
+ </para>
+ </refsect1>
+
+ <refsect1>
+ <title>Options</title>
+
+ <para>
+ <variablelist>
+ <varlistentry>
+ <term><option>-d</option></term>
+ <term><option>--debug</option></term>
+ <listitem>
+ <para>
+ Print lots of debug logging output on <filename>stderr</filename>.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><option>-T <replaceable class="parameter">olddir</replaceable>=<replaceable class="parameter">newdir</replaceable></option></term>
+ <term><option>--tablespace-mapping=<replaceable class="parameter">olddir</replaceable>=<replaceable class="parameter">newdir</replaceable></option></term>
+ <listitem>
+ <para>
+ Relocates the tablespace in directory <replaceable>olddir</replaceable>
+ to <replaceable>newdir</replaceable> during the backup.
+ <replaceable>olddir</replaceable> is the absolute path of the tablespace
+ as it exists in the first backup specified on the command line,
+ and <replaceable>newdir</replaceable> is the absolute path to use for the
+ tablespace in the reconstructed backup. If either path needs to contain
+ an equal sign (<literal>=</literal>), precede that with a backslash.
+ This option can be specified multiple times for multiple tablespaces.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><option>-N</option></term>
+ <term><option>--no-sync</option></term>
+ <listitem>
+ <para>
+ By default, <command>pg_combinebackup</command> will wait for all files
+ to be written safely to disk. This option causes
+ <command>pg_combinebackup</command> to return without waiting, which is
+ faster, but means that a subsequent operating system crash can leave
+ the output backup corrupt. Generally, this option is useful for testing
+ but should not be used when creating a production installation.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><option>-o <replaceable class="parameter">outputdir</replaceable></option></term>
+ <term><option>--output=<replaceable class="parameter">outputdir</replaceable></option></term>
+ <listitem>
+ <para>
+ Specifies the output directory to which the synthetic full backup
+ should be written. Currently, this argument is required.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><option>--sync-method</option></term>
+ <listitem>
+ <para>
+ When set to <literal>fsync</literal>, which is the default,
+ <command>pg_combinebackup</command> will recursively open and synchronize
+ all files in the backup directory. When the plain format is used, the
+ search for files will follow symbolic links for the WAL directory and
+ each configured tablespace.
+ </para>
+ <para>
+ On Linux, <literal>syncfs</literal> may be used instead to ask the
+ operating system to synchronize the whole file system that contains the
+ backup directory. When the plain format is used,
+ <command>pg_combinebackup</command> will also synchronize the file systems
+ that contain the WAL files and each tablespace. See
+ <xref linkend="syncfs"/> for more information about using
+ <function>syncfs()</function>.
+ </para>
+ <para>
+ This option has no effect when <option>--no-sync</option> is used.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><option>--manifest-checksums=<replaceable class="parameter">algorithm</replaceable></option></term>
+ <listitem>
+ <para>
+ Like <xref linkend="app-pgbasebackup"/>,
+ <application>pg_combinebackup</application> writes a backup manifest
+ in the output directory. This option specifies the checksum algorithm
+ that should be applied to each file included in the backup manifest.
+ Currently, the available algorithms are <literal>NONE</literal>,
+ <literal>CRC32C</literal>, <literal>SHA224</literal>,
+ <literal>SHA256</literal>, <literal>SHA384</literal>,
+ and <literal>SHA512</literal>. The default is <literal>CRC32C</literal>.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><option>--no-manifest</option></term>
+ <listitem>
+ <para>
+ Disables generation of a backup manifest. If this option is not
+ specified, a backup manifest for the reconstructed backup will be
+ written to the output directory.
+ </para>
+ </listitem>
+ </varlistentry>
+ </variablelist>
+
+ <variablelist>
+ <varlistentry>
+ <term><option>-V</option></term>
+ <term><option>--version</option></term>
+ <listitem>
+ <para>
+ Prints the <application>pg_combinebackup</application> version and exits.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><option>-?</option></term>
+ <term><option>--help</option></term>
+ <listitem>
+ <para>
+ Shows help about <application>pg_combinebackup</application> command line
+ arguments, and exits.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ </variablelist>
+ </para>
+
+ </refsect1>
+
+ <refsect1>
+ <title>Environment</title>
+
+ <para>
+ This utility, like most other <productname>PostgreSQL</productname> utilities,
+ uses the environment variables supported by <application>libpq</application>
+ (see <xref linkend="libpq-envars"/>).
+ </para>
+
+ <para>
+ The environment variable <envar>PG_COLOR</envar> specifies whether to use
+ color in diagnostic messages. Possible values are
+ <literal>always</literal>, <literal>auto</literal> and
+ <literal>never</literal>.
+ </para>
+ </refsect1>
+
+ <refsect1>
+ <title>See Also</title>
+
+ <simplelist type="inline">
+ <member><xref linkend="app-pgbasebackup"/></member>
+ </simplelist>
+ </refsect1>
+
+</refentry>
diff --git a/doc/src/sgml/reference.sgml b/doc/src/sgml/reference.sgml
index e11b4b6130..a07d2b5e01 100644
--- a/doc/src/sgml/reference.sgml
+++ b/doc/src/sgml/reference.sgml
@@ -250,6 +250,7 @@
&pgamcheck;
&pgBasebackup;
&pgbench;
+ &pgCombinebackup;
&pgConfig;
&pgDump;
&pgDumpall;
diff --git a/src/backend/access/transam/xlogbackup.c b/src/backend/access/transam/xlogbackup.c
index 21d68133ae..f51d4282bb 100644
--- a/src/backend/access/transam/xlogbackup.c
+++ b/src/backend/access/transam/xlogbackup.c
@@ -77,6 +77,16 @@ build_backup_content(BackupState *state, bool ishistoryfile)
appendStringInfo(result, "STOP TIMELINE: %u\n", state->stoptli);
}
+ /* either both istartpoint and istarttli should be set, or neither */
+ Assert(XLogRecPtrIsInvalid(state->istartpoint) == (state->istarttli == 0));
+ if (!XLogRecPtrIsInvalid(state->istartpoint))
+ {
+ appendStringInfo(result, "INCREMENTAL FROM LSN: %X/%X\n",
+ LSN_FORMAT_ARGS(state->istartpoint));
+ appendStringInfo(result, "INCREMENTAL FROM TLI: %u\n",
+ state->istarttli);
+ }
+
data = result->data;
pfree(result);
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index c61566666a..7d2501274e 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -1295,6 +1295,12 @@ read_backup_label(XLogRecPtr *checkPointLoc, TimeLineID *backupLabelTLI,
tli_from_file, BACKUP_LABEL_FILE)));
}
+ if (fscanf(lfp, "INCREMENTAL FROM LSN: %X/%X\n", &hi, &lo) > 0)
+ ereport(FATAL,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("this is an incremental backup, not a data directory"),
+ errhint("Use pg_combinebackup to reconstruct a valid data directory.")));
+
if (ferror(lfp) || FreeFile(lfp))
ereport(FATAL,
(errcode_for_file_access(),
diff --git a/src/backend/backup/Makefile b/src/backend/backup/Makefile
index a67b3c58d4..751e6d3d5e 100644
--- a/src/backend/backup/Makefile
+++ b/src/backend/backup/Makefile
@@ -19,6 +19,7 @@ OBJS = \
basebackup.o \
basebackup_copy.o \
basebackup_gzip.o \
+ basebackup_incremental.o \
basebackup_lz4.o \
basebackup_zstd.o \
basebackup_progress.o \
diff --git a/src/backend/backup/basebackup.c b/src/backend/backup/basebackup.c
index 35dd79babc..b4c5b60eeb 100644
--- a/src/backend/backup/basebackup.c
+++ b/src/backend/backup/basebackup.c
@@ -20,8 +20,10 @@
#include "access/xlogbackup.h"
#include "backup/backup_manifest.h"
#include "backup/basebackup.h"
+#include "backup/basebackup_incremental.h"
#include "backup/basebackup_sink.h"
#include "backup/basebackup_target.h"
+#include "catalog/pg_tablespace_d.h"
#include "commands/defrem.h"
#include "common/compression.h"
#include "common/file_perm.h"
@@ -33,6 +35,7 @@
#include "pgtar.h"
#include "port.h"
#include "postmaster/syslogger.h"
+#include "postmaster/walsummarizer.h"
#include "replication/walsender.h"
#include "replication/walsender_private.h"
#include "storage/bufpage.h"
@@ -64,6 +67,7 @@ typedef struct
bool fastcheckpoint;
bool nowait;
bool includewal;
+ bool incremental;
uint32 maxrate;
bool sendtblspcmapfile;
bool send_to_client;
@@ -76,21 +80,28 @@ typedef struct
} basebackup_options;
static int64 sendTablespace(bbsink *sink, char *path, Oid spcoid, bool sizeonly,
- struct backup_manifest_info *manifest);
+ struct backup_manifest_info *manifest,
+ IncrementalBackupInfo *ib);
static int64 sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
List *tablespaces, bool sendtblspclinks,
- backup_manifest_info *manifest, Oid spcoid);
+ backup_manifest_info *manifest, Oid spcoid,
+ IncrementalBackupInfo *ib);
static bool sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
struct stat *statbuf, bool missing_ok,
Oid dboid, Oid spcoid, RelFileNumber relfilenumber,
unsigned segno,
- backup_manifest_info *manifest);
+ backup_manifest_info *manifest,
+ unsigned num_incremental_blocks,
+ BlockNumber *incremental_blocks,
+ unsigned truncation_block_length);
static off_t read_file_data_into_buffer(bbsink *sink,
const char *readfilename, int fd,
off_t offset, size_t length,
BlockNumber blkno,
bool verify_checksum,
int *checksum_failures);
+static void push_to_sink(bbsink *sink, pg_checksum_context *checksum_ctx,
+ size_t *bytes_done, void *data, size_t length);
static bool verify_page_checksum(Page page, XLogRecPtr start_lsn,
BlockNumber blkno,
uint16 *expected_checksum);
@@ -102,7 +113,8 @@ static int64 _tarWriteHeader(bbsink *sink, const char *filename,
bool sizeonly);
static void _tarWritePadding(bbsink *sink, int len);
static void convert_link_to_directory(const char *pathbuf, struct stat *statbuf);
-static void perform_base_backup(basebackup_options *opt, bbsink *sink);
+static void perform_base_backup(basebackup_options *opt, bbsink *sink,
+ IncrementalBackupInfo *ib);
static void parse_basebackup_options(List *options, basebackup_options *opt);
static int compareWalFileNames(const ListCell *a, const ListCell *b);
static int basebackup_read_file(int fd, char *buf, size_t nbytes, off_t offset,
@@ -220,7 +232,8 @@ static const struct exclude_list_item excludeFiles[] =
* clobbered by longjmp" from stupider versions of gcc.
*/
static void
-perform_base_backup(basebackup_options *opt, bbsink *sink)
+perform_base_backup(basebackup_options *opt, bbsink *sink,
+ IncrementalBackupInfo *ib)
{
bbsink_state state;
XLogRecPtr endptr;
@@ -270,6 +283,10 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
ListCell *lc;
tablespaceinfo *newti;
+ /* If this is an incremental backup, execute preparatory steps. */
+ if (ib != NULL)
+ PrepareForIncrementalBackup(ib, backup_state);
+
/* Add a node for the base directory at the end */
newti = palloc0(sizeof(tablespaceinfo));
newti->size = -1;
@@ -289,10 +306,10 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
if (tmp->path == NULL)
tmp->size = sendDir(sink, ".", 1, true, state.tablespaces,
- true, NULL, InvalidOid);
+ true, NULL, InvalidOid, NULL);
else
tmp->size = sendTablespace(sink, tmp->path, tmp->oid, true,
- NULL);
+ NULL, NULL);
state.bytes_total += tmp->size;
}
state.bytes_total_is_valid = true;
@@ -330,7 +347,7 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
/* Then the bulk of the files... */
sendDir(sink, ".", 1, false, state.tablespaces,
- sendtblspclinks, &manifest, InvalidOid);
+ sendtblspclinks, &manifest, InvalidOid, ib);
/* ... and pg_control after everything else. */
if (lstat(XLOG_CONTROL_FILE, &statbuf) != 0)
@@ -340,7 +357,7 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
XLOG_CONTROL_FILE)));
sendFile(sink, XLOG_CONTROL_FILE, XLOG_CONTROL_FILE, &statbuf,
false, InvalidOid, InvalidOid,
- InvalidRelFileNumber, 0, &manifest);
+ InvalidRelFileNumber, 0, &manifest, 0, NULL, 0);
}
else
{
@@ -348,7 +365,7 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
bbsink_begin_archive(sink, archive_name);
- sendTablespace(sink, ti->path, ti->oid, false, &manifest);
+ sendTablespace(sink, ti->path, ti->oid, false, &manifest, ib);
}
/*
@@ -610,7 +627,7 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
sendFile(sink, pathbuf, pathbuf, &statbuf, false,
InvalidOid, InvalidOid, InvalidRelFileNumber, 0,
- &manifest);
+ &manifest, 0, NULL, 0);
/* unconditionally mark file as archived */
StatusFilePath(pathbuf, fname, ".done");
@@ -686,6 +703,7 @@ parse_basebackup_options(List *options, basebackup_options *opt)
bool o_checkpoint = false;
bool o_nowait = false;
bool o_wal = false;
+ bool o_incremental = false;
bool o_maxrate = false;
bool o_tablespace_map = false;
bool o_noverify_checksums = false;
@@ -764,6 +782,20 @@ parse_basebackup_options(List *options, basebackup_options *opt)
opt->includewal = defGetBoolean(defel);
o_wal = true;
}
+ else if (strcmp(defel->defname, "incremental") == 0)
+ {
+ if (o_incremental)
+ ereport(ERROR,
+ (errcode(ERRCODE_SYNTAX_ERROR),
+ errmsg("duplicate option \"%s\"", defel->defname)));
+ opt->incremental = defGetBoolean(defel);
+ if (opt->incremental && !summarize_wal)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("incremental backups cannot be taken unless WAL summarization is enabled")));
+ opt->incremental = defGetBoolean(defel);
+ o_incremental = true;
+ }
else if (strcmp(defel->defname, "max_rate") == 0)
{
int64 maxrate;
@@ -956,7 +988,7 @@ parse_basebackup_options(List *options, basebackup_options *opt)
* the filesystem, bypassing the buffer cache.
*/
void
-SendBaseBackup(BaseBackupCmd *cmd)
+SendBaseBackup(BaseBackupCmd *cmd, IncrementalBackupInfo *ib)
{
basebackup_options opt;
bbsink *sink;
@@ -980,6 +1012,20 @@ SendBaseBackup(BaseBackupCmd *cmd)
set_ps_display(activitymsg);
}
+ /*
+ * If we're asked to perform an incremental backup and the user has not
+ * supplied incremental backup information, that's an ERROR.
+ *
+ * If we're asked to perform a full backup and the user did supply such
+ * information, just ignore it.
+ */
+ if (!opt.incremental)
+ ib = NULL;
+ else if (ib == NULL)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("must use INCREMENTAL_WAL_RANGE before performing an incremental BASE_BACKUP")));
+
/*
* If the target is specifically 'client' then set up to stream the backup
* to the client; otherwise, it's being sent someplace else and should not
@@ -1011,7 +1057,7 @@ SendBaseBackup(BaseBackupCmd *cmd)
*/
PG_TRY();
{
- perform_base_backup(&opt, sink);
+ perform_base_backup(&opt, sink, ib);
}
PG_FINALLY();
{
@@ -1089,7 +1135,7 @@ sendFileWithContent(bbsink *sink, const char *filename, const char *content,
*/
static int64
sendTablespace(bbsink *sink, char *path, Oid spcoid, bool sizeonly,
- backup_manifest_info *manifest)
+ backup_manifest_info *manifest, IncrementalBackupInfo *ib)
{
int64 size;
char pathbuf[MAXPGPATH];
@@ -1123,7 +1169,7 @@ sendTablespace(bbsink *sink, char *path, Oid spcoid, bool sizeonly,
/* Send all the files in the tablespace version directory */
size += sendDir(sink, pathbuf, strlen(path), sizeonly, NIL, true, manifest,
- spcoid);
+ spcoid, ib);
return size;
}
@@ -1143,7 +1189,7 @@ sendTablespace(bbsink *sink, char *path, Oid spcoid, bool sizeonly,
static int64
sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
List *tablespaces, bool sendtblspclinks, backup_manifest_info *manifest,
- Oid spcoid)
+ Oid spcoid, IncrementalBackupInfo *ib)
{
DIR *dir;
struct dirent *de;
@@ -1152,7 +1198,16 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
int64 size = 0;
const char *lastDir; /* Split last dir from parent path. */
bool isRelationDir = false; /* Does directory contain relations? */
+ bool isGlobalDir = false;
Oid dboid = InvalidOid;
+ BlockNumber *relative_block_numbers = NULL;
+
+ /*
+ * Since this array is relatively large, avoid putting it on the stack.
+ * But we don't need it at all if this is not an incremental backup.
+ */
+ if (ib != NULL)
+ relative_block_numbers = palloc(sizeof(BlockNumber) * RELSEG_SIZE);
/*
* Determine if the current path is a database directory that can contain
@@ -1185,7 +1240,10 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
}
}
else if (strcmp(path, "./global") == 0)
+ {
isRelationDir = true;
+ isGlobalDir = true;
+ }
dir = AllocateDir(path);
while ((de = ReadDir(dir, path)) != NULL)
@@ -1334,11 +1392,13 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
&statbuf, sizeonly);
/*
- * Also send archive_status directory (by hackishly reusing
- * statbuf from above ...).
+ * Also send archive_status and summaries directories (by
+ * hackishly reusing statbuf from above ...).
*/
size += _tarWriteHeader(sink, "./pg_wal/archive_status", NULL,
&statbuf, sizeonly);
+ size += _tarWriteHeader(sink, "./pg_wal/summaries", NULL,
+ &statbuf, sizeonly);
continue; /* don't recurse into pg_wal */
}
@@ -1407,16 +1467,64 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
if (!skip_this_dir)
size += sendDir(sink, pathbuf, basepathlen, sizeonly, tablespaces,
- sendtblspclinks, manifest, spcoid);
+ sendtblspclinks, manifest, spcoid, ib);
}
else if (S_ISREG(statbuf.st_mode))
{
bool sent = false;
+ unsigned num_blocks_required = 0;
+ unsigned truncation_block_length = 0;
+ char tarfilenamebuf[MAXPGPATH * 2];
+ char *tarfilename = pathbuf + basepathlen + 1;
+ FileBackupMethod method = BACK_UP_FILE_FULLY;
+
+ if (ib != NULL && isRelationFile)
+ {
+ Oid relspcoid;
+ char *lookup_path;
+
+ if (OidIsValid(spcoid))
+ {
+ relspcoid = spcoid;
+ lookup_path = psprintf("pg_tblspc/%u/%s", spcoid,
+ tarfilename);
+ }
+ else
+ {
+ if (isGlobalDir)
+ relspcoid = GLOBALTABLESPACE_OID;
+ else
+ relspcoid = DEFAULTTABLESPACE_OID;
+ lookup_path = pstrdup(tarfilename);
+ }
+
+ method = GetFileBackupMethod(ib, lookup_path, dboid, relspcoid,
+ relfilenumber, relForkNum,
+ segno, statbuf.st_size,
+ &num_blocks_required,
+ relative_block_numbers,
+ &truncation_block_length);
+ if (method == BACK_UP_FILE_INCREMENTALLY)
+ {
+ statbuf.st_size =
+ GetIncrementalFileSize(num_blocks_required);
+ snprintf(tarfilenamebuf, sizeof(tarfilenamebuf),
+ "%s/INCREMENTAL.%s",
+ path + basepathlen + 1,
+ de->d_name);
+ tarfilename = tarfilenamebuf;
+ }
+
+ pfree(lookup_path);
+ }
if (!sizeonly)
- sent = sendFile(sink, pathbuf, pathbuf + basepathlen + 1, &statbuf,
+ sent = sendFile(sink, pathbuf, tarfilename, &statbuf,
true, dboid, spcoid,
- relfilenumber, segno, manifest);
+ relfilenumber, segno, manifest,
+ num_blocks_required,
+ method == BACK_UP_FILE_INCREMENTALLY ? relative_block_numbers : NULL,
+ truncation_block_length);
if (sent || sizeonly)
{
@@ -1434,6 +1542,10 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
ereport(WARNING,
(errmsg("skipping special file \"%s\"", pathbuf)));
}
+
+ if (relative_block_numbers != NULL)
+ pfree(relative_block_numbers);
+
FreeDir(dir);
return size;
}
@@ -1446,6 +1558,12 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
* If dboid is anything other than InvalidOid then any checksum failures
* detected will get reported to the cumulative stats system.
*
+ * If the file is to be sent incrementally, then num_incremental_blocks
+ * should be the number of blocks to be sent, and incremental_blocks
+ * an array of block numbers relative to the start of the current segment.
+ * If the whole file is to be sent, then incremental_blocks should be NULL,
+ * and num_incremental_blocks can have any value, as it will be ignored.
+ *
* Returns true if the file was successfully sent, false if 'missing_ok',
* and the file did not exist.
*/
@@ -1453,7 +1571,8 @@ static bool
sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
struct stat *statbuf, bool missing_ok, Oid dboid, Oid spcoid,
RelFileNumber relfilenumber, unsigned segno,
- backup_manifest_info *manifest)
+ backup_manifest_info *manifest, unsigned num_incremental_blocks,
+ BlockNumber *incremental_blocks, unsigned truncation_block_length)
{
int fd;
BlockNumber blkno = 0;
@@ -1462,6 +1581,7 @@ sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
pgoff_t bytes_done = 0;
bool verify_checksum = false;
pg_checksum_context checksum_ctx;
+ int ibindex = 0;
if (pg_checksum_init(&checksum_ctx, manifest->checksum_type) < 0)
elog(ERROR, "could not initialize checksum of file \"%s\"",
@@ -1494,22 +1614,111 @@ sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
RelFileNumberIsValid(relfilenumber))
verify_checksum = true;
+ /*
+ * If we're sending an incremental file, write the file header.
+ */
+ if (incremental_blocks != NULL)
+ {
+ unsigned magic = INCREMENTAL_MAGIC;
+ size_t header_bytes_done = 0;
+
+ /* Emit header data. */
+ push_to_sink(sink, &checksum_ctx, &header_bytes_done,
+ &magic, sizeof(magic));
+ push_to_sink(sink, &checksum_ctx, &header_bytes_done,
+ &num_incremental_blocks, sizeof(num_incremental_blocks));
+ push_to_sink(sink, &checksum_ctx, &header_bytes_done,
+ &truncation_block_length, sizeof(truncation_block_length));
+ push_to_sink(sink, &checksum_ctx, &header_bytes_done,
+ incremental_blocks,
+ sizeof(BlockNumber) * num_incremental_blocks);
+
+ /* Flush out any data still in the buffer so it's again empty. */
+ if (header_bytes_done > 0)
+ {
+ bbsink_archive_contents(sink, header_bytes_done);
+ if (pg_checksum_update(&checksum_ctx,
+ (uint8 *) sink->bbs_buffer,
+ header_bytes_done) < 0)
+ elog(ERROR, "could not update checksum of base backup");
+ }
+
+ /* Update our notion of file position. */
+ bytes_done += sizeof(magic);
+ bytes_done += sizeof(num_incremental_blocks);
+ bytes_done += sizeof(truncation_block_length);
+ bytes_done += sizeof(BlockNumber) * num_incremental_blocks;
+ }
+
/*
* Loop until we read the amount of data the caller told us to expect. The
* file could be longer, if it was extended while we were sending it, but
* for a base backup we can ignore such extended data. It will be restored
* from WAL.
*/
- while (bytes_done < statbuf->st_size)
+ while (1)
{
- size_t remaining = statbuf->st_size - bytes_done;
+ /*
+ * Determine whether we've read all the data that we need, and if not,
+ * read some more.
+ */
+ if (incremental_blocks == NULL)
+ {
+ size_t remaining = statbuf->st_size - bytes_done;
+
+ /*
+ * If we've read the required number of bytes, then it's time to
+ * stop.
+ */
+ if (bytes_done >= statbuf->st_size)
+ break;
+
+ /*
+ * Read as many bytes as will fit in the buffer, or however many
+ * are left to read, whichever is less.
+ */
+ cnt = read_file_data_into_buffer(sink, readfilename, fd,
+ bytes_done, remaining,
+ blkno + segno * RELSEG_SIZE,
+ verify_checksum,
+ &checksum_failures);
+ }
+ else
+ {
+ BlockNumber relative_blkno;
- /* Try to read some more data. */
- cnt = read_file_data_into_buffer(sink, readfilename, fd, bytes_done,
- remaining,
- blkno + segno * RELSEG_SIZE,
- verify_checksum,
- &checksum_failures);
+ /*
+ * If we've read all the blocks, then it's time to stop.
+ */
+ if (ibindex >= num_incremental_blocks)
+ break;
+
+ /*
+ * Read just one block, whichever one is the next that we're
+ * supposed to include.
+ */
+ relative_blkno = incremental_blocks[ibindex++];
+ cnt = read_file_data_into_buffer(sink, readfilename, fd,
+ relative_blkno * BLCKSZ,
+ BLCKSZ,
+ relative_blkno + segno * RELSEG_SIZE,
+ verify_checksum,
+ &checksum_failures);
+
+ /*
+ * If we get a partial read, that must mean that the relation is
+ * being truncated. Ultimately, it should be truncated to a
+ * multiple of BLCKSZ, since this path should only be reached for
+ * relation files, but we might transiently observe an
+ * intermediate value.
+ *
+ * It should be fine to treat this just as if the entire block had
+ * been truncated away - i.e. fill this and all later blocks with
+ * zeroes. WAL replay will fix things up.
+ */
+ if (cnt < BLCKSZ)
+ break;
+ }
/*
* If the amount of data we were able to read was not a multiple of
@@ -1692,6 +1901,56 @@ read_file_data_into_buffer(bbsink *sink, const char *readfilename, int fd,
return cnt;
}
+/*
+ * Push data into a bbsink.
+ *
+ * It's better, when possible, to read data directly into the bbsink's buffer,
+ * rather than using this function to copy it into the buffer; this function is
+ * for cases where that approach is not practical.
+ *
+ * bytes_done should point to a count of the number of bytes that are
+ * currently used in the bbsink's buffer. Upon return, the bytes identified by
+ * data and length will have been copied into the bbsink's buffer, flushing
+ * as required, and *bytes_done will have been updated accordingly. If the
+ * buffer was flushed, the previous contents will also have been fed to
+ * checksum_ctx.
+ *
+ * Note that after one or more calls to this function it is the caller's
+ * responsibility to perform any required final flush.
+ */
+static void
+push_to_sink(bbsink *sink, pg_checksum_context *checksum_ctx,
+ size_t *bytes_done, void *data, size_t length)
+{
+ while (length > 0)
+ {
+ size_t bytes_to_copy;
+
+ /*
+ * We use < here rather than <= so that if the data exactly fills the
+ * remaining buffer space, we trigger a flush now.
+ */
+ if (length < sink->bbs_buffer_length - *bytes_done)
+ {
+ /* Append remaining data to buffer. */
+ memcpy(sink->bbs_buffer + *bytes_done, data, length);
+ *bytes_done += length;
+ return;
+ }
+
+ /* Copy until buffer is full and flush it. */
+ bytes_to_copy = sink->bbs_buffer_length - *bytes_done;
+ memcpy(sink->bbs_buffer + *bytes_done, data, bytes_to_copy);
+ data = ((char *) data) + bytes_to_copy;
+ length -= bytes_to_copy;
+ bbsink_archive_contents(sink, sink->bbs_buffer_length);
+ if (pg_checksum_update(checksum_ctx, (uint8 *) sink->bbs_buffer,
+ sink->bbs_buffer_length) < 0)
+ elog(ERROR, "could not update checksum");
+ *bytes_done = 0;
+ }
+}
+
/*
* Try to verify the checksum for the provided page, if it seems appropriate
* to do so.
diff --git a/src/backend/backup/basebackup_incremental.c b/src/backend/backup/basebackup_incremental.c
new file mode 100644
index 0000000000..e3779bae47
--- /dev/null
+++ b/src/backend/backup/basebackup_incremental.c
@@ -0,0 +1,706 @@
+/*-------------------------------------------------------------------------
+ *
+ * basebackup_incremental.c
+ * code for incremental backup support
+ *
+ * This code isn't actually in charge of taking an incremental backup;
+ * the actual construction of the incremental backup happens in
+ * basebackup.c. Here, we're concerned with providing the necessary
+ * supports for that operation. In particular, we need to parse the
+ * backup manifest supplied by the user taking the incremental backup
+ * and extract the required information from it.
+ *
+ * Portions Copyright (c) 2010-2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/backend/backup/basebackup_incremental.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/timeline.h"
+#include "access/xlog.h"
+#include "access/xlogrecovery.h"
+#include "backup/basebackup_incremental.h"
+#include "backup/walsummary.h"
+#include "common/blkreftable.h"
+#include "common/hashfn.h"
+#include "postmaster/walsummarizer.h"
+
+#define BLOCKS_PER_READ 512
+
+/*
+ * Details extracted from the WAL ranges present in the supplied backup manifest.
+ */
+typedef struct
+{
+ TimeLineID tli;
+ XLogRecPtr start_lsn;
+ XLogRecPtr end_lsn;
+} backup_wal_range;
+
+struct IncrementalBackupInfo
+{
+ /* Memory context for this object and its subsidiary objects. */
+ MemoryContext mcxt;
+
+ /* WAL ranges extracted from the backup manifest. */
+ List *manifest_wal_ranges;
+
+ /*
+ * Block-reference table for the incremental backup.
+ *
+ * It's possible that storing the entire block-reference table in memory
+ * will be a problem for some users. The in-memory format that we're using
+ * here is pretty efficient, converging to little more than 1 bit per
+ * block for relation forks with large numbers of modified blocks. It's
+ * possible, however, that if you try to perform an incremental backup of
+ * a database with a sufficiently large number of relations on a
+ * sufficiently small machine, you could run out of memory here. If that
+ * turns out to be a problem in practice, we'll need to be more clever.
+ */
+ BlockRefTable *brtab;
+};
+
+static int compare_block_numbers(const void *a, const void *b);
+
+/*
+ * Create a new object for storing information extracted from the manifest
+ * supplied when creating an incremental backup.
+ */
+IncrementalBackupInfo *
+CreateIncrementalBackupInfo(MemoryContext mcxt)
+{
+ IncrementalBackupInfo *ib;
+ MemoryContext oldcontext;
+
+ oldcontext = MemoryContextSwitchTo(mcxt);
+
+ ib = palloc0(sizeof(IncrementalBackupInfo));
+ ib->mcxt = mcxt;
+
+ MemoryContextSwitchTo(oldcontext);
+
+ return ib;
+}
+
+/*
+ * Add a WAL range to IncrementalBackupInfo.
+ *
+ * This means that we'll need WAL summary files that cover the corresponding
+ * WAL range, and will include blocks modified by that WAL range in the
+ * incremental backup.
+ */
+void
+AddIncrementalWalRange(IncrementalBackupInfo *ib, TimeLineID tli,
+ XLogRecPtr start_lsn, XLogRecPtr end_lsn)
+{
+ backup_wal_range *range;
+ MemoryContext oldcontext;
+
+ oldcontext = MemoryContextSwitchTo(ib->mcxt);
+
+ range = palloc(sizeof(backup_wal_range));
+ range->tli = tli;
+ range->start_lsn = start_lsn;
+ range->end_lsn = end_lsn;
+ ib->manifest_wal_ranges = lappend(ib->manifest_wal_ranges, range);
+
+ MemoryContextSwitchTo(oldcontext);
+}
+
+/*
+ * Prepare to take an incremental backup.
+ *
+ * Before this function is called, AddIncrementalWalRange should already have
+ * been called once per WAL range.
+ *
+ * This function performs sanity checks on the data extracted from the
+ * manifest and figures out for which WAL ranges we need summaries, and
+ * whether those summaries are available. Then, it reads and combines the
+ * data from those summary files. It also updates the backup_state with the
+ * reference TLI and LSN for the prior backup.
+ */
+void
+PrepareForIncrementalBackup(IncrementalBackupInfo *ib,
+ BackupState *backup_state)
+{
+ MemoryContext oldcontext;
+ List *expectedTLEs;
+ List *all_wslist,
+ *required_wslist = NIL;
+ ListCell *lc;
+ TimeLineHistoryEntry **tlep;
+ int num_wal_ranges;
+ int i;
+ bool found_backup_start_tli = false;
+ TimeLineID earliest_wal_range_tli = 0;
+ XLogRecPtr earliest_wal_range_start_lsn = InvalidXLogRecPtr;
+ TimeLineID latest_wal_range_tli = 0;
+ XLogRecPtr summarized_lsn;
+
+ /* Switch to our memory context. */
+ oldcontext = MemoryContextSwitchTo(ib->mcxt);
+
+ /*
+ * A valid backup manifest must always contain at least one WAL range
+ * (usually exactly one, unless the backup spanned a timeline switch).
+ */
+ num_wal_ranges = list_length(ib->manifest_wal_ranges);
+ if (num_wal_ranges == 0)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("manifest contains no required WAL ranges")));
+
+ /*
+ * Match up the TLIs that appear in the WAL ranges of the backup manifest
+ * with those that appear in this server's timeline history. We expect
+ * every backup_wal_range to match to a TimeLineHistoryEntry; if it does
+ * not, that's an error.
+ *
+ * This loop also decides which of the WAL ranges is the manifest is most
+ * ancient and which one is the newest, according to the timeline history
+ * of this server, and stores TLIs of those WAL ranges into
+ * earliest_wal_range_tli and latest_wal_range_tli. It also updates
+ * earliest_wal_range_start_lsn to the start LSN of the WAL range for
+ * earliest_wal_range_tli.
+ *
+ * Note that the return value of readTimeLineHistory puts the latest
+ * timeline at the beginning of the list, not the end. Hence, the earliest
+ * TLI is the one that occurs nearest the end of the list returned by
+ * readTimeLineHistory, and the latest TLI is the one that occurs closest
+ * to the beginning.
+ */
+ expectedTLEs = readTimeLineHistory(backup_state->starttli);
+ tlep = palloc0(num_wal_ranges * sizeof(TimeLineHistoryEntry *));
+ for (i = 0; i < num_wal_ranges; ++i)
+ {
+ backup_wal_range *range = list_nth(ib->manifest_wal_ranges, i);
+ bool saw_earliest_wal_range_tli = false;
+ bool saw_latest_wal_range_tli = false;
+
+ /* Search this server's history for this WAL range's TLI. */
+ foreach(lc, expectedTLEs)
+ {
+ TimeLineHistoryEntry *tle = lfirst(lc);
+
+ if (tle->tli == range->tli)
+ {
+ tlep[i] = tle;
+ break;
+ }
+
+ if (tle->tli == earliest_wal_range_tli)
+ saw_earliest_wal_range_tli = true;
+ if (tle->tli == latest_wal_range_tli)
+ saw_latest_wal_range_tli = true;
+ }
+
+ /*
+ * An incremental backup can only be taken relative to a backup that
+ * represents a previous state of this server. If the backup requires
+ * WAL from a timeline that's not in our history, that definitely
+ * isn't the case.
+ */
+ if (tlep[i] == NULL)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("timeline %u found in manifest, but not in this server's history",
+ range->tli)));
+
+ /*
+ * If we found this TLI in the server's history before encountering
+ * the latest TLI seen so far in the server's history, then this TLI
+ * is the latest one seen so far.
+ *
+ * If on the other hand we saw the earliest TLI seen so far before
+ * finding this TLI, this TLI is earlier than the earliest one seen so
+ * far. And if this is the first TLI for which we've searched, it's
+ * also the earliest one seen so far.
+ *
+ * On the first loop iteration, both things should necessarily be
+ * true.
+ */
+ if (!saw_latest_wal_range_tli)
+ latest_wal_range_tli = range->tli;
+ if (earliest_wal_range_tli == 0 || saw_earliest_wal_range_tli)
+ {
+ earliest_wal_range_tli = range->tli;
+ earliest_wal_range_start_lsn = range->start_lsn;
+ }
+ }
+
+ /*
+ * Propagate information about the prior backup into the backup_label that
+ * will be generated for this backup.
+ */
+ backup_state->istartpoint = earliest_wal_range_start_lsn;
+ backup_state->istarttli = earliest_wal_range_tli;
+
+ /*
+ * Sanity check start and end LSNs for the WAL ranges in the manifest.
+ *
+ * Commonly, there won't be any timeline switches during the prior backup
+ * at all, but if there are, they should happen at the same LSNs that this
+ * server switched timelines.
+ *
+ * Whether there are any timeline switches during the prior backup or not,
+ * the prior backup shouldn't require any WAL from a timeline prior to the
+ * start of that timeline. It also shouldn't require any WAL from later
+ * than the start of this backup.
+ *
+ * If any of these sanity checks fail, one possible explanation is that
+ * the user has generated WAL on the same timeline with the same LSNs more
+ * than once. For instance, if two standbys running on timeline 1 were
+ * both promoted and (due to a broken archiving setup) both selected new
+ * timeline ID 2, then it's possible that one of these checks might trip.
+ *
+ * Note that there are lots of ways for the user to do something very bad
+ * without tripping any of these checks, and they are not intended to be
+ * comprehensive. It's pretty hard to see how we could be certain of
+ * anything here. However, if there's a problem staring us right in the
+ * face, it's best to report it, so we do.
+ */
+ for (i = 0; i < num_wal_ranges; ++i)
+ {
+ backup_wal_range *range = list_nth(ib->manifest_wal_ranges, i);
+
+ if (range->tli == earliest_wal_range_tli)
+ {
+ if (range->start_lsn < tlep[i]->begin)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("manifest requires WAL from initial timeline %u starting at %X/%X, but that timeline begins at %X/%X",
+ range->tli,
+ LSN_FORMAT_ARGS(range->start_lsn),
+ LSN_FORMAT_ARGS(tlep[i]->begin))));
+ }
+ else
+ {
+ if (range->start_lsn != tlep[i]->begin)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("manifest requires WAL from continuation timeline %u starting at %X/%X, but that timeline begins at %X/%X",
+ range->tli,
+ LSN_FORMAT_ARGS(range->start_lsn),
+ LSN_FORMAT_ARGS(tlep[i]->begin))));
+ }
+
+ if (range->tli == latest_wal_range_tli)
+ {
+ if (range->end_lsn > backup_state->startpoint)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("manifest requires WAL from final timeline %u ending at %X/%X, but this backup starts at %X/%X",
+ range->tli,
+ LSN_FORMAT_ARGS(range->end_lsn),
+ LSN_FORMAT_ARGS(backup_state->startpoint))));
+ }
+ else
+ {
+ if (range->end_lsn != tlep[i]->end)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("manifest requires WAL from non-final timeline %u ending at %X/%X, but this server switched timelines at %X/%X",
+ range->tli,
+ LSN_FORMAT_ARGS(range->end_lsn),
+ LSN_FORMAT_ARGS(tlep[i]->end))));
+ }
+
+ }
+
+ /*
+ * Wait for WAL summarization to catch up to the backup start LSN (but
+ * time out if it doesn't do so quickly enough).
+ */
+ /* XXX make timeout configurable */
+ summarized_lsn = WaitForWalSummarization(backup_state->startpoint, 60000);
+ if (summarized_lsn < backup_state->startpoint)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("timeout waiting for WAL summarization"),
+ errdetail("This backup requires WAL to be summarized up to %X/%X, but summarizer has only reached %X/%X.",
+ LSN_FORMAT_ARGS(backup_state->startpoint),
+ LSN_FORMAT_ARGS(summarized_lsn))));
+
+ /*
+ * Retrieve a list of all WAL summaries on any timeline that overlap with
+ * the LSN range of interest. We could instead call GetWalSummaries() once
+ * per timeline in the loop that follows, but that would involve reading
+ * the directory multiple times. It should be mildly faster - and perhaps
+ * a bit safer - to do it just once.
+ */
+ all_wslist = GetWalSummaries(0, earliest_wal_range_start_lsn,
+ backup_state->startpoint);
+
+ /*
+ * We need WAL summaries for everything that happened during the prior
+ * backup and everything that happened afterward up until the point where
+ * the current backup started.
+ */
+ foreach(lc, expectedTLEs)
+ {
+ TimeLineHistoryEntry *tle = lfirst(lc);
+ XLogRecPtr tli_start_lsn = tle->begin;
+ XLogRecPtr tli_end_lsn = tle->end;
+ XLogRecPtr tli_missing_lsn = InvalidXLogRecPtr;
+ List *tli_wslist;
+
+ /*
+ * Working through the history of this server from the current
+ * timeline backwards, we skip everything until we find the timeline
+ * where this backup started. Most of the time, this means we won't
+ * skip anything at all, as it's unlikely that the timeline has
+ * changed since the beginning of the backup moments ago.
+ */
+ if (tle->tli == backup_state->starttli)
+ {
+ found_backup_start_tli = true;
+ tli_end_lsn = backup_state->startpoint;
+ }
+ else if (!found_backup_start_tli)
+ continue;
+
+ /*
+ * Find the summaries that overlap the LSN range of interest for this
+ * timeline. If this is the earliest timeline involved, the range of
+ * interest begins with the start LSN of the prior backup; otherwise,
+ * it begins at the LSN at which this timeline came into existence. If
+ * this is the latest TLI involved, the range of interest ends at the
+ * start LSN of the current backup; otherwise, it ends at the point
+ * where we switched from this timeline to the next one.
+ */
+ if (tle->tli == earliest_wal_range_tli)
+ tli_start_lsn = earliest_wal_range_start_lsn;
+ tli_wslist = FilterWalSummaries(all_wslist, tle->tli,
+ tli_start_lsn, tli_end_lsn);
+
+ /*
+ * There is no guarantee that the WAL summaries we found cover the
+ * entire range of LSNs for which summaries are required, or indeed
+ * that we found any WAL summaries at all. Check whether we have a
+ * problem of that sort.
+ */
+ if (!WalSummariesAreComplete(tli_wslist, tli_start_lsn, tli_end_lsn,
+ &tli_missing_lsn))
+ {
+ if (XLogRecPtrIsInvalid(tli_missing_lsn))
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("WAL summaries are required on timeline %u from %X/%X to %X/%X, but no summaries for that timeline and LSN range exist",
+ tle->tli,
+ LSN_FORMAT_ARGS(tli_start_lsn),
+ LSN_FORMAT_ARGS(tli_end_lsn))));
+ else
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("WAL summaries are required on timeline %u from %X/%X to %X/%X, but the summaries for that timeline and LSN range are incomplete",
+ tle->tli,
+ LSN_FORMAT_ARGS(tli_start_lsn),
+ LSN_FORMAT_ARGS(tli_end_lsn)),
+ errdetail("The first unsummarized LSN is this range is %X/%X.",
+ LSN_FORMAT_ARGS(tli_missing_lsn))));
+ }
+
+ /*
+ * Remember that we need to read these summaries.
+ *
+ * Technically, it's possible that this could read more files than
+ * required, since tli_wslist in theory could contain redundant
+ * summaries. For instance, if we have a summary from 0/10000000 to
+ * 0/20000000 and also one from 0/00000000 to 0/30000000, then the
+ * latter subsumes the former and the former could be ignored.
+ *
+ * We ignore this possibility because the WAL summarizer only tries to
+ * generate summaries that do not overlap. If somehow they exist,
+ * we'll do a bit of extra work but the results should still be
+ * correct.
+ */
+ required_wslist = list_concat(required_wslist, tli_wslist);
+
+ /*
+ * Timelines earlier than the one in which the prior backup began are
+ * not relevant.
+ */
+ if (tle->tli == earliest_wal_range_tli)
+ break;
+ }
+
+ /*
+ * Read all of the required block reference table files and merge all of
+ * the data into a single in-memory block reference table.
+ *
+ * See the comments for struct IncrementalBackupInfo for some thoughts on
+ * memory usage.
+ */
+ ib->brtab = CreateEmptyBlockRefTable();
+ foreach(lc, required_wslist)
+ {
+ WalSummaryFile *ws = lfirst(lc);
+ WalSummaryIO wsio;
+ BlockRefTableReader *reader;
+ RelFileLocator rlocator;
+ ForkNumber forknum;
+ BlockNumber limit_block;
+ BlockNumber blocks[BLOCKS_PER_READ];
+
+ wsio.file = OpenWalSummaryFile(ws, false);
+ wsio.filepos = 0;
+ ereport(DEBUG1,
+ (errmsg_internal("reading WAL summary file \"%s\"",
+ FilePathName(wsio.file))));
+ reader = CreateBlockRefTableReader(ReadWalSummary, &wsio,
+ FilePathName(wsio.file),
+ ReportWalSummaryError, NULL);
+ while (BlockRefTableReaderNextRelation(reader, &rlocator, &forknum,
+ &limit_block))
+ {
+ BlockRefTableSetLimitBlock(ib->brtab, &rlocator,
+ forknum, limit_block);
+
+ while (1)
+ {
+ unsigned nblocks;
+ unsigned i;
+
+ nblocks = BlockRefTableReaderGetBlocks(reader, blocks,
+ BLOCKS_PER_READ);
+ if (nblocks == 0)
+ break;
+
+ for (i = 0; i < nblocks; ++i)
+ BlockRefTableMarkBlockModified(ib->brtab, &rlocator,
+ forknum, blocks[i]);
+ }
+ }
+ DestroyBlockRefTableReader(reader);
+ FileClose(wsio.file);
+ }
+
+ /* Switch back to previous memory context. */
+ MemoryContextSwitchTo(oldcontext);
+}
+
+/*
+ * Get the pathname that should be used when a file is sent incrementally.
+ *
+ * The result is a palloc'd string.
+ */
+char *
+GetIncrementalFilePath(Oid dboid, Oid spcoid, RelFileNumber relfilenumber,
+ ForkNumber forknum, unsigned segno)
+{
+ char *path;
+ char *lastslash;
+ char *ipath;
+
+ path = GetRelationPath(dboid, spcoid, relfilenumber, InvalidBackendId,
+ forknum);
+
+ lastslash = strrchr(path, '/');
+ Assert(lastslash != NULL);
+ *lastslash = '\0';
+
+ if (segno > 0)
+ ipath = psprintf("%s/INCREMENTAL.%s.%u", path, lastslash + 1, segno);
+ else
+ ipath = psprintf("%s/INCREMENTAL.%s", path, lastslash + 1);
+
+ pfree(path);
+
+ return ipath;
+}
+
+/*
+ * How should we back up a particular file as part of an incremental backup?
+ *
+ * If the return value is BACK_UP_FILE_FULLY, caller should back up the whole
+ * file just as if this were not an incremental backup.
+ *
+ * If the return value is BACK_UP_FILE_INCREMENTALLY, caller should include
+ * an incremental file in the backup instead of the entire file. On return,
+ * *num_blocks_required will be set to the number of blocks that need to be
+ * sent, and the actual block numbers will have been stored in
+ * relative_block_numbers, which should be an array of at least RELSEG_SIZE.
+ * In addition, *truncation_block_length will be set to the value that should
+ * be included in the incremental file.
+ */
+FileBackupMethod
+GetFileBackupMethod(IncrementalBackupInfo *ib, const char *path,
+ Oid dboid, Oid spcoid,
+ RelFileNumber relfilenumber, ForkNumber forknum,
+ unsigned segno, size_t size,
+ unsigned *num_blocks_required,
+ BlockNumber *relative_block_numbers,
+ unsigned *truncation_block_length)
+{
+ BlockNumber absolute_block_numbers[RELSEG_SIZE];
+ BlockNumber limit_block;
+ BlockNumber start_blkno;
+ BlockNumber stop_blkno;
+ RelFileLocator rlocator;
+ BlockRefTableEntry *brtentry;
+ unsigned i;
+ unsigned nblocks;
+
+ /*
+ * dboid could be InvalidOid if shared rel, but spcoid and relfilenumber
+ * should have legal values.
+ */
+ Assert(OidIsValid(spcoid));
+ Assert(RelFileNumberIsValid(relfilenumber));
+
+ /*
+ * If the file size is too large or not a multiple of BLCKSZ, then
+ * something weird is happening, so give up and send the whole file.
+ */
+ if ((size % BLCKSZ) != 0 || size / BLCKSZ > RELSEG_SIZE)
+ return BACK_UP_FILE_FULLY;
+
+ /*
+ * The free-space map fork is not properly WAL-logged, so we need to
+ * backup the entire file every time.
+ */
+ if (forknum == FSM_FORKNUM)
+ return BACK_UP_FILE_FULLY;
+
+ /* Look up the block reference table entry. */
+ rlocator.spcOid = spcoid;
+ rlocator.dbOid = dboid;
+ rlocator.relNumber = relfilenumber;
+ brtentry = BlockRefTableGetEntry(ib->brtab, &rlocator, forknum,
+ &limit_block);
+
+ /*
+ * If there is no entry, then there have been no WAL-logged changes to the
+ * relation since the predecessor backup was taken, so we can back it up
+ * incrementally and need not include any modified blocks.
+ *
+ * However, if the file is zero-length, we should do a full backup,
+ * because an incremental file is always more than zero length, and it's
+ * silly to take an incremental backup when a full backup would be
+ * smaller.
+ */
+ if (brtentry == NULL)
+ {
+ if (size == 0)
+ return BACK_UP_FILE_FULLY;
+ *num_blocks_required = 0;
+ *truncation_block_length = size / BLCKSZ;
+ return BACK_UP_FILE_INCREMENTALLY;
+ }
+
+ /*
+ * If the limit_block is less than or equal to the point where this
+ * segment starts, send the whole file.
+ */
+ if (limit_block <= segno * RELSEG_SIZE)
+ return BACK_UP_FILE_FULLY;
+
+ /*
+ * Get relevant entries from the block reference table entry.
+ *
+ * We shouldn't overflow computing the start or stop block numbers, but if
+ * it manages to happen somehow, detect it and throw an error.
+ */
+ start_blkno = segno * RELSEG_SIZE;
+ stop_blkno = start_blkno + (size / BLCKSZ);
+ if (start_blkno / RELSEG_SIZE != segno || stop_blkno < start_blkno)
+ ereport(ERROR,
+ errcode(ERRCODE_INTERNAL_ERROR),
+ errmsg_internal("overflow computing block number bounds for segment %u with size %zu",
+ segno, size));
+ nblocks = BlockRefTableEntryGetBlocks(brtentry, start_blkno, stop_blkno,
+ absolute_block_numbers, RELSEG_SIZE);
+ Assert(nblocks <= RELSEG_SIZE);
+
+ /*
+ * If we're going to have to send nearly all of the blocks, then just send
+ * the whole file, because that won't require much extra storage or
+ * transfer and will speed up and simplify backup restoration. It's not
+ * clear what threshold is most appropriate here and perhaps it ought to
+ * be configurable, but for now we're just going to say that if we'd need
+ * to send 90% of the blocks anyway, give up and send the whole file.
+ *
+ * NB: If you change the threshold here, at least make sure to back up the
+ * file fully when every single block must be sent, because there's
+ * nothing good about sending an incremental file in that case.
+ */
+ if (nblocks * BLCKSZ > size * 0.9)
+ return BACK_UP_FILE_FULLY;
+
+ /*
+ * Looks like we can send an incremental file, so sort the absolute the
+ * block numbers and then transpose absolute block numbers to relative
+ * block numbers.
+ *
+ * NB: If the block reference table was using the bitmap representation
+ * for a given chunk, the block numbers in that chunk will already be
+ * sorted, but when the array-of-offsets representation is used, we can
+ * receive block numbers here out of order.
+ */
+ qsort(absolute_block_numbers, nblocks, sizeof(BlockNumber),
+ compare_block_numbers);
+ for (i = 0; i < nblocks; ++i)
+ relative_block_numbers[i] = absolute_block_numbers[i] - start_blkno;
+ *num_blocks_required = nblocks;
+
+ /*
+ * The truncation block length is the minimum length of the reconstructed
+ * file. Any block numbers below this threshold that are not present in
+ * the backup need to be fetched from the prior backup. At or above this
+ * threshold, blocks should only be included in the result if they are
+ * present in the backup. (This may require inserting zero blocks if the
+ * blocks included in the backup are non-consecutive.)
+ */
+ *truncation_block_length = size / BLCKSZ;
+ if (BlockNumberIsValid(limit_block))
+ {
+ unsigned relative_limit = limit_block - segno * RELSEG_SIZE;
+
+ if (*truncation_block_length < relative_limit)
+ *truncation_block_length = relative_limit;
+ }
+
+ /* Send it incrementally. */
+ return BACK_UP_FILE_INCREMENTALLY;
+}
+
+/*
+ * Compute the size for an incremental file containing a given number of blocks.
+ */
+extern size_t
+GetIncrementalFileSize(unsigned num_blocks_required)
+{
+ size_t result;
+
+ /* Make sure we're not going to overflow. */
+ Assert(num_blocks_required <= RELSEG_SIZE);
+
+ /*
+ * Three four byte quantities (magic number, truncation block length,
+ * block count) followed by block numbers followed by block contents.
+ */
+ result = 3 * sizeof(uint32);
+ result += (BLCKSZ + sizeof(BlockNumber)) * num_blocks_required;
+
+ return result;
+}
+
+/*
+ * Quicksort comparator for block numbers.
+ */
+static int
+compare_block_numbers(const void *a, const void *b)
+{
+ BlockNumber aa = *(BlockNumber *) a;
+ BlockNumber bb = *(BlockNumber *) b;
+
+ if (aa > bb)
+ return 1;
+ else if (aa == bb)
+ return 0;
+ else
+ return -1;
+}
diff --git a/src/backend/backup/meson.build b/src/backend/backup/meson.build
index 0e2de91e9f..19c355ceca 100644
--- a/src/backend/backup/meson.build
+++ b/src/backend/backup/meson.build
@@ -5,6 +5,7 @@ backend_sources += files(
'basebackup.c',
'basebackup_copy.c',
'basebackup_gzip.c',
+ 'basebackup_incremental.c',
'basebackup_lz4.c',
'basebackup_progress.c',
'basebackup_server.c',
diff --git a/src/backend/replication/repl_gram.y b/src/backend/replication/repl_gram.y
index 0c874e33cf..6e0dd69c03 100644
--- a/src/backend/replication/repl_gram.y
+++ b/src/backend/replication/repl_gram.y
@@ -76,11 +76,13 @@ Node *replication_parse_result;
%token K_EXPORT_SNAPSHOT
%token K_NOEXPORT_SNAPSHOT
%token K_USE_SNAPSHOT
+%token K_INCREMENTAL_WAL_RANGE
%type <node> command
%type <node> base_backup start_replication start_logical_replication
create_replication_slot drop_replication_slot identify_system
read_replication_slot timeline_history show
+ incremental_wal_range
%type <list> generic_option_list
%type <defelt> generic_option
%type <uintval> opt_timeline
@@ -114,6 +116,7 @@ command:
| read_replication_slot
| timeline_history
| show
+ | incremental_wal_range
;
/*
@@ -307,6 +310,23 @@ timeline_history:
}
;
+incremental_wal_range:
+ K_INCREMENTAL_WAL_RANGE UCONST RECPTR RECPTR
+ {
+ IncrementalWalRangeCmd *cmd = makeNode(IncrementalWalRangeCmd);
+
+ if ($2 <= 0)
+ ereport(ERROR,
+ (errcode(ERRCODE_SYNTAX_ERROR),
+ errmsg("invalid timeline %u", $2)));
+
+ cmd->tli = $2;
+ cmd->start_lsn = $3;
+ cmd->end_lsn = $4;
+ $$ = (Node *) cmd;
+ }
+ ;
+
opt_physical:
K_PHYSICAL
| /* EMPTY */
@@ -411,6 +431,7 @@ ident_or_keyword:
| K_EXPORT_SNAPSHOT { $$ = "export_snapshot"; }
| K_NOEXPORT_SNAPSHOT { $$ = "noexport_snapshot"; }
| K_USE_SNAPSHOT { $$ = "use_snapshot"; }
+ | K_INCREMENTAL_WAL_RANGE { $$ = "incremental_wal_range"; }
;
%%
diff --git a/src/backend/replication/repl_scanner.l b/src/backend/replication/repl_scanner.l
index 1cc7fb858c..8927587cfb 100644
--- a/src/backend/replication/repl_scanner.l
+++ b/src/backend/replication/repl_scanner.l
@@ -136,6 +136,7 @@ EXPORT_SNAPSHOT { return K_EXPORT_SNAPSHOT; }
NOEXPORT_SNAPSHOT { return K_NOEXPORT_SNAPSHOT; }
USE_SNAPSHOT { return K_USE_SNAPSHOT; }
WAIT { return K_WAIT; }
+INCREMENTAL_WAL_RANGE { return K_INCREMENTAL_WAL_RANGE; }
{space}+ { /* do nothing */ }
@@ -303,6 +304,7 @@ replication_scanner_is_replication_command(void)
case K_DROP_REPLICATION_SLOT:
case K_READ_REPLICATION_SLOT:
case K_TIMELINE_HISTORY:
+ case K_INCREMENTAL_WAL_RANGE:
case K_SHOW:
/* Yes; push back the first token so we can parse later. */
repl_pushed_back_token = first_token;
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 3bc9c82389..7d9c63a925 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -58,6 +58,7 @@
#include "access/xlogrecovery.h"
#include "access/xlogutils.h"
#include "backup/basebackup.h"
+#include "backup/basebackup_incremental.h"
#include "catalog/pg_authid.h"
#include "catalog/pg_type.h"
#include "commands/dbcommands.h"
@@ -137,6 +138,16 @@ bool wake_wal_senders = false;
*/
static XLogReaderState *xlogreader = NULL;
+/*
+ * If the INCREMENTAL_WAL_RANGE command is used to prepare for an incremental
+ * backup request, the information thus provided will be stored in
+ * session_ib_info, and session_ib_mcxt will point to the memory context that
+ * contains that object and all of its subordinate data. Otherwise, both
+ * values will be NULL.
+ */
+static IncrementalBackupInfo *session_ib_info = NULL;
+static MemoryContext session_ib_mcxt = NULL;
+
/*
* These variables keep track of the state of the timeline we're currently
* sending. sendTimeLine identifies the timeline. If sendTimeLineIsHistoric,
@@ -233,6 +244,7 @@ static void XLogSendLogical(void);
static void WalSndDone(WalSndSendDataCallback send_data);
static XLogRecPtr GetStandbyFlushRecPtr(TimeLineID *tli);
static void IdentifySystem(void);
+static void IncrementalWalRange(IncrementalWalRangeCmd *cmd);
static void ReadReplicationSlot(ReadReplicationSlotCmd *cmd);
static void CreateReplicationSlot(CreateReplicationSlotCmd *cmd);
static void DropReplicationSlot(DropReplicationSlotCmd *cmd);
@@ -326,6 +338,13 @@ WalSndErrorCleanup(void)
ReplicationSlotCleanup();
+ if (session_ib_mcxt != NULL)
+ {
+ MemoryContextDelete(session_ib_mcxt);
+ session_ib_info = NULL;
+ session_ib_mcxt = NULL;
+ }
+
replication_active = false;
/*
@@ -660,6 +679,26 @@ SendTimeLineHistory(TimeLineHistoryCmd *cmd)
pq_endmessage(&buf);
}
+/*
+ * Handle INCREMENTAL_WAL_RANGE command.
+ */
+static void
+IncrementalWalRange(IncrementalWalRangeCmd *cmd)
+{
+ /* Create a new IncrementalBackupInfo if we don't have one yet. */
+ if (session_ib_mcxt == NULL)
+ {
+ session_ib_mcxt = AllocSetContextCreate(CacheMemoryContext,
+ "incremental backup information",
+ ALLOCSET_DEFAULT_SIZES);
+ session_ib_info = CreateIncrementalBackupInfo(session_ib_mcxt);
+ }
+
+ /* Add the information to our IncrementalBackupInfo. */
+ AddIncrementalWalRange(session_ib_info, cmd->tli, cmd->start_lsn,
+ cmd->end_lsn);
+}
+
/*
* Handle START_REPLICATION command.
*
@@ -1801,7 +1840,13 @@ exec_replication_command(const char *cmd_string)
cmdtag = "BASE_BACKUP";
set_ps_display(cmdtag);
PreventInTransactionBlock(true, cmdtag);
- SendBaseBackup((BaseBackupCmd *) cmd_node);
+ SendBaseBackup((BaseBackupCmd *) cmd_node, session_ib_info);
+ if (session_ib_mcxt != NULL)
+ {
+ MemoryContextDelete(session_ib_mcxt);
+ session_ib_info = NULL;
+ session_ib_mcxt = NULL;
+ }
EndReplicationCommand(cmdtag);
break;
@@ -1863,6 +1908,14 @@ exec_replication_command(const char *cmd_string)
}
break;
+ case T_IncrementalWalRangeCmd:
+ cmdtag = "INCREMENTAL_WAL_RANGE";
+ set_ps_display(cmdtag);
+ PreventInTransactionBlock(true, cmdtag);
+ IncrementalWalRange((IncrementalWalRangeCmd *) cmd_node);
+ EndReplicationCommand(cmdtag);
+ break;
+
default:
elog(ERROR, "unrecognized replication command node tag: %u",
cmd_node->type);
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index a3d8eacb8d..3a6729003a 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -31,6 +31,7 @@
#include "postmaster/bgworker_internals.h"
#include "postmaster/bgwriter.h"
#include "postmaster/postmaster.h"
+#include "postmaster/walsummarizer.h"
#include "replication/logicallauncher.h"
#include "replication/origin.h"
#include "replication/slot.h"
@@ -136,6 +137,7 @@ CalculateShmemSize(int *num_semaphores)
size = add_size(size, ReplicationOriginShmemSize());
size = add_size(size, WalSndShmemSize());
size = add_size(size, WalRcvShmemSize());
+ size = add_size(size, WalSummarizerShmemSize());
size = add_size(size, PgArchShmemSize());
size = add_size(size, ApplyLauncherShmemSize());
size = add_size(size, BTreeShmemSize());
@@ -291,6 +293,7 @@ CreateSharedMemoryAndSemaphores(void)
ReplicationOriginShmemInit();
WalSndShmemInit();
WalRcvShmemInit();
+ WalSummarizerShmemInit();
PgArchShmemInit();
ApplyLauncherShmemInit();
diff --git a/src/bin/Makefile b/src/bin/Makefile
index 373077bf52..aa2210925e 100644
--- a/src/bin/Makefile
+++ b/src/bin/Makefile
@@ -19,6 +19,7 @@ SUBDIRS = \
pg_archivecleanup \
pg_basebackup \
pg_checksums \
+ pg_combinebackup \
pg_config \
pg_controldata \
pg_ctl \
diff --git a/src/bin/meson.build b/src/bin/meson.build
index 67cb50630c..4cb6fd59bb 100644
--- a/src/bin/meson.build
+++ b/src/bin/meson.build
@@ -5,6 +5,7 @@ subdir('pg_amcheck')
subdir('pg_archivecleanup')
subdir('pg_basebackup')
subdir('pg_checksums')
+subdir('pg_combinebackup')
subdir('pg_config')
subdir('pg_controldata')
subdir('pg_ctl')
diff --git a/src/bin/pg_basebackup/bbstreamer_file.c b/src/bin/pg_basebackup/bbstreamer_file.c
index 45f32974ff..6b78ee283d 100644
--- a/src/bin/pg_basebackup/bbstreamer_file.c
+++ b/src/bin/pg_basebackup/bbstreamer_file.c
@@ -296,6 +296,7 @@ should_allow_existing_directory(const char *pathname)
if (strcmp(filename, "pg_wal") == 0 ||
strcmp(filename, "pg_xlog") == 0 ||
strcmp(filename, "archive_status") == 0 ||
+ strcmp(filename, "summaries") == 0 ||
strcmp(filename, "pg_tblspc") == 0)
return true;
diff --git a/src/bin/pg_basebackup/pg_basebackup.c b/src/bin/pg_basebackup/pg_basebackup.c
index f32684a8f2..cb13c1f887 100644
--- a/src/bin/pg_basebackup/pg_basebackup.c
+++ b/src/bin/pg_basebackup/pg_basebackup.c
@@ -32,6 +32,7 @@
#include "common/file_perm.h"
#include "common/file_utils.h"
#include "common/logging.h"
+#include "fe_utils/load_manifest.h"
#include "fe_utils/option_utils.h"
#include "fe_utils/recovery_gen.h"
#include "getopt_long.h"
@@ -101,6 +102,11 @@ typedef void (*WriteDataCallback) (size_t nbytes, char *buf,
*/
#define MINIMUM_VERSION_FOR_TERMINATED_TARFILE 150000
+/*
+ * pg_wal/summaries exists beginning with version 17.
+ */
+#define MINIMUM_VERSION_FOR_WAL_SUMMARIES 170000
+
/*
* Different ways to include WAL
*/
@@ -217,7 +223,8 @@ static void ReceiveBackupManifestInMemoryChunk(size_t r, char *copybuf,
void *callback_data);
static void BaseBackup(char *compression_algorithm, char *compression_detail,
CompressionLocation compressloc,
- pg_compress_specification *client_compress);
+ pg_compress_specification *client_compress,
+ char *incremental_manifest);
static bool reached_end_position(XLogRecPtr segendpos, uint32 timeline,
bool segment_finished);
@@ -390,6 +397,8 @@ usage(void)
printf(_("\nOptions controlling the output:\n"));
printf(_(" -D, --pgdata=DIRECTORY receive base backup into directory\n"));
printf(_(" -F, --format=p|t output format (plain (default), tar)\n"));
+ printf(_(" -i, --incremental=OLDMANIFEST\n"));
+ printf(_(" take incremental backup\n"));
printf(_(" -r, --max-rate=RATE maximum transfer rate to transfer data directory\n"
" (in kB/s, or use suffix \"k\" or \"M\")\n"));
printf(_(" -R, --write-recovery-conf\n"
@@ -688,6 +697,23 @@ StartLogStreamer(char *startpos, uint32 timeline, char *sysidentifier,
if (pg_mkdir_p(statusdir, pg_dir_create_mode) != 0 && errno != EEXIST)
pg_fatal("could not create directory \"%s\": %m", statusdir);
+
+ /*
+ * For newer server versions, likewise create pg_wal/summaries
+ */
+ if (PQserverVersion(conn) < MINIMUM_VERSION_FOR_WAL_SUMMARIES)
+ {
+ char summarydir[MAXPGPATH];
+
+ snprintf(summarydir, sizeof(summarydir), "%s/%s/summaries",
+ basedir,
+ PQserverVersion(conn) < MINIMUM_VERSION_FOR_PG_WAL ?
+ "pg_xlog" : "pg_wal");
+
+ if (pg_mkdir_p(statusdir, pg_dir_create_mode) != 0 &&
+ errno != EEXIST)
+ pg_fatal("could not create directory \"%s\": %m", summarydir);
+ }
}
/*
@@ -1728,7 +1754,9 @@ ReceiveBackupManifestInMemoryChunk(size_t r, char *copybuf,
static void
BaseBackup(char *compression_algorithm, char *compression_detail,
- CompressionLocation compressloc, pg_compress_specification *client_compress)
+ CompressionLocation compressloc,
+ pg_compress_specification *client_compress,
+ char *incremental_manifest)
{
PGresult *res;
char *sysidentifier;
@@ -1794,7 +1822,50 @@ BaseBackup(char *compression_algorithm, char *compression_detail,
exit(1);
/*
- * Start the actual backup
+ * If the user wants an incremental backup, we must upload the manifest
+ * for the previous backup upon which it is to be based.
+ */
+ if (incremental_manifest != NULL)
+ {
+ manifest_data *imdata;
+ manifest_wal_range *wal_range;
+
+ /* Reject if server is too old. */
+ if (serverVersion < MINIMUM_VERSION_FOR_WAL_SUMMARIES)
+ pg_fatal("server does not support incremental backup");
+
+ /* Extract required WAL ranges from the manifest. */
+ imdata = load_backup_manifest(incremental_manifest,
+ LBM_WAL_RANGES);
+
+ /* Send WAL ranges from manifest to server. */
+ for (wal_range = imdata->first_wal_range; wal_range != NULL;
+ wal_range = wal_range->next)
+ {
+ char *query;
+
+ query = psprintf("INCREMENTAL_WAL_RANGE %u %X/%X %X/%X",
+ wal_range->tli,
+ LSN_FORMAT_ARGS(wal_range->start_lsn),
+ LSN_FORMAT_ARGS(wal_range->end_lsn));
+ res = PQexec(conn, query);
+ if (PQresultStatus(res) != PGRES_COMMAND_OK)
+ {
+ if (PQresultStatus(res) == PGRES_FATAL_ERROR)
+ pg_fatal("could not execute query \"%s\": %s",
+ query, PQerrorMessage(conn));
+ else
+ pg_fatal("could not execute query \"%s\": unexpected status %s",
+ query, PQresStatus(PQresultStatus(res)));
+ }
+ }
+
+ /* Add INCREMENTAL option to BASE_BACKUP command. */
+ AppendPlainCommandOption(&buf, use_new_option_syntax, "INCREMENTAL");
+ }
+
+ /*
+ * Continue building up the options list for the BASE_BACKUP command.
*/
AppendStringCommandOption(&buf, use_new_option_syntax, "LABEL", label);
if (estimatesize)
@@ -1901,6 +1972,7 @@ BaseBackup(char *compression_algorithm, char *compression_detail,
else
basebkp = psprintf("BASE_BACKUP %s", buf.data);
+ /* OK, try to start the backup. */
if (PQsendQuery(conn, basebkp) == 0)
pg_fatal("could not send replication command \"%s\": %s",
"BASE_BACKUP", PQerrorMessage(conn));
@@ -2256,6 +2328,7 @@ main(int argc, char **argv)
{"version", no_argument, NULL, 'V'},
{"pgdata", required_argument, NULL, 'D'},
{"format", required_argument, NULL, 'F'},
+ {"incremental", required_argument, NULL, 'i'},
{"checkpoint", required_argument, NULL, 'c'},
{"create-slot", no_argument, NULL, 'C'},
{"max-rate", required_argument, NULL, 'r'},
@@ -2293,6 +2366,7 @@ main(int argc, char **argv)
int option_index;
char *compression_algorithm = "none";
char *compression_detail = NULL;
+ char *incremental_manifest = NULL;
CompressionLocation compressloc = COMPRESS_LOCATION_UNSPECIFIED;
pg_compress_specification client_compress;
@@ -2317,7 +2391,7 @@ main(int argc, char **argv)
atexit(cleanup_directories_atexit);
- while ((c = getopt_long(argc, argv, "c:Cd:D:F:h:l:nNp:Pr:Rs:S:t:T:U:vwWX:zZ:",
+ while ((c = getopt_long(argc, argv, "c:Cd:D:F:h:i:l:nNp:Pr:Rs:S:t:T:U:vwWX:zZ:",
long_options, &option_index)) != -1)
{
switch (c)
@@ -2352,6 +2426,9 @@ main(int argc, char **argv)
case 'h':
dbhost = pg_strdup(optarg);
break;
+ case 'i':
+ incremental_manifest = pg_strdup(optarg);
+ break;
case 'l':
label = pg_strdup(optarg);
break;
@@ -2765,7 +2842,7 @@ main(int argc, char **argv)
}
BaseBackup(compression_algorithm, compression_detail, compressloc,
- &client_compress);
+ &client_compress, incremental_manifest);
success = true;
return 0;
diff --git a/src/bin/pg_basebackup/t/010_pg_basebackup.pl b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
index b9f5e1266b..bf765291e7 100644
--- a/src/bin/pg_basebackup/t/010_pg_basebackup.pl
+++ b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
@@ -223,10 +223,10 @@ SKIP:
"check backup dir permissions");
}
-# Only archive_status directory should be copied in pg_wal/.
+# Only archive_status and summaries directories should be copied in pg_wal/.
is_deeply(
[ sort(slurp_dir("$tempdir/backup/pg_wal/")) ],
- [ sort qw(. .. archive_status) ],
+ [ sort qw(. .. archive_status summaries) ],
'no WAL files copied');
# Contents of these directories should not be copied.
diff --git a/src/bin/pg_combinebackup/.gitignore b/src/bin/pg_combinebackup/.gitignore
new file mode 100644
index 0000000000..d7e617438c
--- /dev/null
+++ b/src/bin/pg_combinebackup/.gitignore
@@ -0,0 +1 @@
+pg_combinebackup
diff --git a/src/bin/pg_combinebackup/Makefile b/src/bin/pg_combinebackup/Makefile
new file mode 100644
index 0000000000..cd8bbdd275
--- /dev/null
+++ b/src/bin/pg_combinebackup/Makefile
@@ -0,0 +1,51 @@
+#-------------------------------------------------------------------------
+#
+# Makefile for src/bin/pg_combinebackup
+#
+# Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+# Portions Copyright (c) 1994, Regents of the University of California
+#
+# src/bin/pg_combinebackup/Makefile
+#
+#-------------------------------------------------------------------------
+
+PGFILEDESC = "pg_combinebackup - combine incremental backups"
+PGAPPICON=win32
+
+subdir = src/bin/pg_combinebackup
+top_builddir = ../../..
+include $(top_builddir)/src/Makefile.global
+
+override CPPFLAGS := -I$(libpq_srcdir) $(CPPFLAGS)
+LDFLAGS_INTERNAL += -L$(top_builddir)/src/fe_utils -lpgfeutils
+
+OBJS = \
+ $(WIN32RES) \
+ pg_combinebackup.o \
+ backup_label.o \
+ copy_file.o \
+ reconstruct.o \
+ write_manifest.o
+
+all: pg_combinebackup
+
+pg_combinebackup: $(OBJS) | submake-libpgport submake-libpgfeutils
+ $(CC) $(CFLAGS) $^ $(LDFLAGS) $(LDFLAGS_EX) $(LIBS) -o $@$(X)
+
+install: all installdirs
+ $(INSTALL_PROGRAM) pg_combinebackup$(X) '$(DESTDIR)$(bindir)/pg_combinebackup$(X)'
+
+installdirs:
+ $(MKDIR_P) '$(DESTDIR)$(bindir)'
+
+uninstall:
+ rm -f '$(DESTDIR)$(bindir)/pg_combinebackup$(X)'
+
+clean distclean maintainer-clean:
+ rm -f pg_combinebackup$(X) $(OBJS)
+
+check:
+ $(prove_check)
+
+installcheck:
+ $(prove_installcheck)
diff --git a/src/bin/pg_combinebackup/backup_label.c b/src/bin/pg_combinebackup/backup_label.c
new file mode 100644
index 0000000000..922e00854d
--- /dev/null
+++ b/src/bin/pg_combinebackup/backup_label.c
@@ -0,0 +1,283 @@
+/*-------------------------------------------------------------------------
+ *
+ * Read and manipulate backup label files
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/bin/pg_combinebackup/backup_label.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres_fe.h"
+
+#include <unistd.h>
+
+#include "access/xlogdefs.h"
+#include "backup_label.h"
+#include "common/logging.h"
+#include "common/file_perm.h"
+#include "write_manifest.h"
+
+static int get_eol_offset(StringInfo buf);
+static bool line_starts_with(char *s, char *e, char *match, char **sout);
+static bool parse_lsn(char *s, char *e, XLogRecPtr *lsn, char **c);
+static bool parse_tli(char *s, char *e, TimeLineID *tli);
+
+/*
+ * Parse a backup label file, starting at buf->cursor.
+ *
+ * We expect to find a START WAL LOCATION line, followed by a LSN, followed
+ * by a space; the resulting LSN is stored into *start_lsn.
+ *
+ * We expect to find a START TIMELINE line, followed by a TLI, followed by
+ * a newline; the resulting TLI is stored into *start_tli.
+ *
+ * We expect to find either both INCREMENTAL FROM LSN and INCREMENTAL FROM TLI
+ * or neither. If these are found, they should be followed by an LSN or TLI
+ * respectively and then by a newline, and the values will be stored into
+ * *previous_lsn and *previous_tli, respectively.
+ *
+ * Other lines in the provided backup_label data are ignored. filename is used
+ * for error reporting; errors are fatal.
+ */
+void
+parse_backup_label(char *filename, StringInfo buf,
+ TimeLineID *start_tli, XLogRecPtr *start_lsn,
+ TimeLineID *previous_tli, XLogRecPtr *previous_lsn)
+{
+ int found = 0;
+
+ *start_tli = 0;
+ *start_lsn = InvalidXLogRecPtr;
+ *previous_tli = 0;
+ *previous_lsn = InvalidXLogRecPtr;
+
+ while (buf->cursor < buf->len)
+ {
+ char *s = &buf->data[buf->cursor];
+ int eo = get_eol_offset(buf);
+ char *e = &buf->data[eo];
+ char *c;
+
+ if (line_starts_with(s, e, "START WAL LOCATION: ", &s))
+ {
+ if (!parse_lsn(s, e, start_lsn, &c))
+ pg_fatal("%s: could not parse %s",
+ filename, "START WAL LOCATION");
+ if (c >= e || *c != ' ')
+ pg_fatal("%s: improper terminator for %s",
+ filename, "START WAL LOCATION");
+ found |= 1;
+ }
+ else if (line_starts_with(s, e, "START TIMELINE: ", &s))
+ {
+ if (!parse_tli(s, e, start_tli))
+ pg_fatal("%s: could not parse TLI for %s",
+ filename, "START TIMELINE");
+ if (*start_tli == 0)
+ pg_fatal("%s: invalid TLI", filename);
+ found |= 2;
+ }
+ else if (line_starts_with(s, e, "INCREMENTAL FROM LSN: ", &s))
+ {
+ if (!parse_lsn(s, e, previous_lsn, &c))
+ pg_fatal("%s: could not parse %s",
+ filename, "INCREMENTAL FROM LSN");
+ if (c >= e || *c != '\n')
+ pg_fatal("%s: improper terminator for %s",
+ filename, "INCREMENTAL FROM LSN");
+ found |= 4;
+ }
+ else if (line_starts_with(s, e, "INCREMENTAL FROM TLI: ", &s))
+ {
+ if (!parse_tli(s, e, previous_tli))
+ pg_fatal("%s: could not parse %s",
+ filename, "INCREMENTAL FROM TLI");
+ if (*previous_tli == 0)
+ pg_fatal("%s: invalid TLI", filename);
+ found |= 8;
+ }
+
+ buf->cursor = eo;
+ }
+
+ if ((found & 1) == 0)
+ pg_fatal("%s: could not find %s", filename, "START WAL LOCATION");
+ if ((found & 2) == 0)
+ pg_fatal("%s: could not find %s", filename, "START TIMELINE");
+ if ((found & 4) != 0 && (found & 8) == 0)
+ pg_fatal("%s: %s requires %s", filename,
+ "INCREMENTAL FROM LSN", "INCREMENTAL FROM TLI");
+ if ((found & 8) != 0 && (found & 4) == 0)
+ pg_fatal("%s: %s requires %s", filename,
+ "INCREMENTAL FROM TLI", "INCREMENTAL FROM LSN");
+}
+
+/*
+ * Write a backup label file to the output directory.
+ *
+ * This will be identical to the provided backup_label file, except that the
+ * INCREMENTAL FROM LSN and INCREMENTAL FROM TLI lines will be omitted.
+ *
+ * The new file will be checksummed using the specified algorithm. If
+ * mwriter != NULL, it will be added to the manifest.
+ */
+void
+write_backup_label(char *output_directory, StringInfo buf,
+ pg_checksum_type checksum_type, manifest_writer *mwriter)
+{
+ char output_filename[MAXPGPATH];
+ int output_fd;
+ pg_checksum_context checksum_ctx;
+ uint8 checksum_payload[PG_CHECKSUM_MAX_LENGTH];
+ int checksum_length;
+
+ pg_checksum_init(&checksum_ctx, checksum_type);
+
+ snprintf(output_filename, MAXPGPATH, "%s/backup_label", output_directory);
+
+ if ((output_fd = open(output_filename,
+ O_WRONLY | O_CREAT | O_EXCL | PG_BINARY,
+ pg_file_create_mode)) < 0)
+ pg_fatal("could not open file \"%s\": %m", output_filename);
+
+ while (buf->cursor < buf->len)
+ {
+ char *s = &buf->data[buf->cursor];
+ int eo = get_eol_offset(buf);
+ char *e = &buf->data[eo];
+
+ if (!line_starts_with(s, e, "INCREMENTAL FROM LSN: ", NULL) &&
+ !line_starts_with(s, e, "INCREMENTAL FROM TLI: ", NULL))
+ {
+ ssize_t wb;
+
+ wb = write(output_fd, s, e - s);
+ if (wb != e - s)
+ {
+ if (wb < 0)
+ pg_fatal("could not write file \"%s\": %m", output_filename);
+ else
+ pg_fatal("could not write file \"%s\": wrote only %d of %d bytes",
+ output_filename, (int) wb, (int) (e - s));
+ }
+ if (pg_checksum_update(&checksum_ctx, (uint8 *) s, e - s) < 0)
+ pg_fatal("could not update checksum of file \"%s\"",
+ output_filename);
+ }
+
+ buf->cursor = eo;
+ }
+
+ if (close(output_fd) != 0)
+ pg_fatal("could not close \"%s\": %m", output_filename);
+
+ checksum_length = pg_checksum_final(&checksum_ctx, checksum_payload);
+
+ if (mwriter != NULL)
+ {
+ struct stat sb;
+
+ /*
+ * We could track the length ourselves, but must stat() to get the
+ * mtime.
+ */
+ if (stat(output_filename, &sb) < 0)
+ pg_fatal("could not stat file \"%s\": %m", output_filename);
+ add_file_to_manifest(mwriter, "backup_label", sb.st_size,
+ sb.st_mtime, checksum_type,
+ checksum_length, checksum_payload);
+ }
+}
+
+/*
+ * Return the offset at which the next line in the buffer starts, or there
+ * is none, the offset at which the buffer ends.
+ *
+ * The search begins at buf->cursor.
+ */
+static int
+get_eol_offset(StringInfo buf)
+{
+ int eo = buf->cursor;
+
+ while (eo < buf->len)
+ {
+ if (buf->data[eo] == '\n')
+ return eo + 1;
+ ++eo;
+ }
+
+ return eo;
+}
+
+/*
+ * Test whether the line that runs from s to e (inclusive of *s, but not
+ * inclusive of *e) starts with the match string provided, and return true
+ * or false according to whether or not this is the case.
+ *
+ * If the function returns true and if *sout != NULL, stores a pointer to the
+ * byte following the match into *sout.
+ */
+static bool
+line_starts_with(char *s, char *e, char *match, char **sout)
+{
+ while (s < e && *match != '\0' && *s == *match)
+ ++s, ++match;
+
+ if (*match == '\0' && sout != NULL)
+ *sout = s;
+
+ return (*match == '\0');
+}
+
+/*
+ * Parse an LSN starting at s and not stopping at or before e. The return value
+ * is true on success and otherwise false. On success, stores the result into
+ * *lsn and sets *c to the first character that is not part of the LSN.
+ */
+static bool
+parse_lsn(char *s, char *e, XLogRecPtr *lsn, char **c)
+{
+ char save = *e;
+ int nchars;
+ bool success;
+ unsigned hi;
+ unsigned lo;
+
+ *e = '\0';
+ success = (sscanf(s, "%X/%X%n", &hi, &lo, &nchars) == 2);
+ *e = save;
+
+ if (success)
+ {
+ *lsn = ((XLogRecPtr) hi) << 32 | (XLogRecPtr) lo;
+ *c = s + nchars;
+ }
+
+ return success;
+}
+
+/*
+ * Parse a TLI starting at s and stopping at or before e. The return value is
+ * true on success and otherwise false. On success, stores the result into
+ * *tli. If the first character that is not part of the TLI is anything other
+ * than a newline, that is deemed a failure.
+ */
+static bool
+parse_tli(char *s, char *e, TimeLineID *tli)
+{
+ char save = *e;
+ int nchars;
+ bool success;
+
+ *e = '\0';
+ success = (sscanf(s, "%u%n", tli, &nchars) == 1);
+ *e = save;
+
+ if (success && s[nchars] != '\n')
+ success = false;
+
+ return success;
+}
diff --git a/src/bin/pg_combinebackup/backup_label.h b/src/bin/pg_combinebackup/backup_label.h
new file mode 100644
index 0000000000..3af7ea274c
--- /dev/null
+++ b/src/bin/pg_combinebackup/backup_label.h
@@ -0,0 +1,30 @@
+/*-------------------------------------------------------------------------
+ *
+ * Read and manipulate backup label files
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/bin/pg_combinebackup/backup_label.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef BACKUP_LABEL_H
+#define BACKUP_LABEL_H
+
+#include "access/xlogdefs.h"
+#include "common/checksum_helper.h"
+#include "lib/stringinfo.h"
+
+struct manifest_writer;
+
+extern void parse_backup_label(char *filename, StringInfo buf,
+ TimeLineID *start_tli,
+ XLogRecPtr *start_lsn,
+ TimeLineID *previous_tli,
+ XLogRecPtr *previous_lsn);
+extern void write_backup_label(char *output_directory, StringInfo buf,
+ pg_checksum_type checksum_type,
+ struct manifest_writer *mwriter);
+
+#endif /* BACKUP_LABEL_H */
diff --git a/src/bin/pg_combinebackup/copy_file.c b/src/bin/pg_combinebackup/copy_file.c
new file mode 100644
index 0000000000..40a55e3087
--- /dev/null
+++ b/src/bin/pg_combinebackup/copy_file.c
@@ -0,0 +1,169 @@
+/*
+ * Copy entire files.
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/bin/pg_combinebackup/copy_file.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres_fe.h"
+
+#ifdef HAVE_COPYFILE_H
+#include <copyfile.h>
+#endif
+#include <fcntl.h>
+#include <sys/stat.h>
+#include <unistd.h>
+
+#include "common/file_perm.h"
+#include "common/logging.h"
+#include "copy_file.h"
+
+static void copy_file_blocks(const char *src, const char *dst,
+ pg_checksum_context *checksum_ctx);
+
+#ifdef WIN32
+static void copy_file_copyfile(const char *src, const char *dst);
+#endif
+
+/*
+ * Copy a regular file, optionally computing a checksum, and emitting
+ * appropriate debug messages. But if we're in dry-run mode, then just emit
+ * the messages and don't copy anything.
+ */
+void
+copy_file(const char *src, const char *dst,
+ pg_checksum_context *checksum_ctx, bool dry_run)
+{
+ /*
+ * In dry-run mode, we don't actually copy anything, nor do we read any
+ * data from the source file, but we do verify that we can open it.
+ */
+ if (dry_run)
+ {
+ int fd;
+
+ if ((fd = open(src, O_RDONLY | PG_BINARY, 0)) < 0)
+ pg_fatal("could not open \"%s\": %m", src);
+ if (close(fd) < 0)
+ pg_fatal("could not close \"%s\": %m", src);
+ }
+
+ /*
+ * If we don't need to compute a checksum, then we can use any special
+ * operating system primitives that we know about to copy the file; this
+ * may be quicker than a naive block copy.
+ */
+ if (checksum_ctx->type == CHECKSUM_TYPE_NONE)
+ {
+ char *strategy_name = NULL;
+ void (*strategy_implementation) (const char *, const char *) = NULL;
+
+#ifdef WIN32
+ strategy_name = "CopyFile";
+ strategy_implementation = copy_file_copyfile;
+#endif
+
+ if (strategy_name != NULL)
+ {
+ if (dry_run)
+ pg_log_debug("would copy \"%s\" to \"%s\" using strategy %s",
+ src, dst, strategy_name);
+ else
+ {
+ pg_log_debug("copying \"%s\" to \"%s\" using strategy %s",
+ src, dst, strategy_name);
+ (*strategy_implementation) (src, dst);
+ }
+ return;
+ }
+ }
+
+ /*
+ * Fall back to the simple approach of reading and writing all the blocks,
+ * feeding them into the checksum context as we go.
+ */
+ if (dry_run)
+ {
+ if (checksum_ctx->type == CHECKSUM_TYPE_NONE)
+ pg_log_debug("would copy \"%s\" to \"%s\"",
+ src, dst);
+ else
+ pg_log_debug("would copy \"%s\" to \"%s\" and checksum with %s",
+ src, dst, pg_checksum_type_name(checksum_ctx->type));
+ }
+ else
+ {
+ if (checksum_ctx->type == CHECKSUM_TYPE_NONE)
+ pg_log_debug("copying \"%s\" to \"%s\"",
+ src, dst);
+ else
+ pg_log_debug("copying \"%s\" to \"%s\" and checksumming with %s",
+ src, dst, pg_checksum_type_name(checksum_ctx->type));
+ copy_file_blocks(src, dst, checksum_ctx);
+ }
+}
+
+/*
+ * Copy a file block by block, and optionally compute a checksum as we go.
+ */
+static void
+copy_file_blocks(const char *src, const char *dst,
+ pg_checksum_context *checksum_ctx)
+{
+ int src_fd;
+ int dest_fd;
+ uint8 *buffer;
+ const int buffer_size = 50 * BLCKSZ;
+ ssize_t rb;
+ unsigned offset = 0;
+
+ if ((src_fd = open(src, O_RDONLY | PG_BINARY, 0)) < 0)
+ pg_fatal("could not open file \"%s\": %m", src);
+
+ if ((dest_fd = open(dst, O_WRONLY | O_CREAT | O_EXCL | PG_BINARY,
+ pg_file_create_mode)) < 0)
+ pg_fatal("could not open file \"%s\": %m", dst);
+
+ buffer = pg_malloc(buffer_size);
+
+ while ((rb = read(src_fd, buffer, buffer_size)) > 0)
+ {
+ ssize_t wb;
+
+ if ((wb = write(dest_fd, buffer, rb)) != rb)
+ {
+ if (wb < 0)
+ pg_fatal("could not write file \"%s\": %m", dst);
+ else
+ pg_fatal("could not write file \"%s\": wrote only %d of %d bytes at offset %u",
+ dst, (int) wb, (int) rb, offset);
+ }
+
+ if (pg_checksum_update(checksum_ctx, buffer, rb) < 0)
+ pg_fatal("could not update checksum of file \"%s\"", dst);
+
+ offset += rb;
+ }
+
+ if (rb < 0)
+ pg_fatal("could not read file \"%s\": %m", dst);
+
+ pg_free(buffer);
+ close(src_fd);
+ close(dest_fd);
+}
+
+#ifdef WIN32
+static void
+copy_file_copyfile(const char *src, const char *dst)
+{
+ if (CopyFile(src, dst, true) == 0)
+ {
+ _dosmaperr(GetLastError());
+ pg_fatal("could not copy \"%s\" to \"%s\": %m", src, dst);
+ }
+}
+#endif /* WIN32 */
diff --git a/src/bin/pg_combinebackup/copy_file.h b/src/bin/pg_combinebackup/copy_file.h
new file mode 100644
index 0000000000..031030bacb
--- /dev/null
+++ b/src/bin/pg_combinebackup/copy_file.h
@@ -0,0 +1,19 @@
+/*
+ * Copy entire files.
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/bin/pg_combinebackup/copy_file.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef COPY_FILE_H
+#define COPY_FILE_H
+
+#include "common/checksum_helper.h"
+
+extern void copy_file(const char *src, const char *dst,
+ pg_checksum_context *checksum_ctx, bool dry_run);
+
+#endif /* COPY_FILE_H */
diff --git a/src/bin/pg_combinebackup/meson.build b/src/bin/pg_combinebackup/meson.build
new file mode 100644
index 0000000000..723811b5ba
--- /dev/null
+++ b/src/bin/pg_combinebackup/meson.build
@@ -0,0 +1,37 @@
+# Copyright (c) 2022-2023, PostgreSQL Global Development Group
+
+pg_combinebackup_sources = files(
+ 'pg_combinebackup.c',
+ 'backup_label.c',
+ 'copy_file.c',
+ 'reconstruct.c',
+ 'write_manifest.c',
+)
+
+if host_system == 'windows'
+ pg_combinebackup_sources += rc_bin_gen.process(win32ver_rc, extra_args: [
+ '--NAME', 'pg_combinebackup',
+ '--FILEDESC', 'pg_combinebackup - combine incremental backups',])
+endif
+
+pg_combinebackup = executable('pg_combinebackup',
+ pg_combinebackup_sources,
+ dependencies: [frontend_code],
+ kwargs: default_bin_args,
+)
+bin_targets += pg_combinebackup
+
+tests += {
+ 'name': 'pg_combinebackup',
+ 'sd': meson.current_source_dir(),
+ 'bd': meson.current_build_dir(),
+ 'tap': {
+ 'tests': [
+ 't/001_basic.pl',
+ 't/002_compare_backups.pl',
+ 't/003_timeline.pl',
+ 't/004_manifest.pl',
+ 't/005_integrity.pl',
+ ],
+ }
+}
diff --git a/src/bin/pg_combinebackup/nls.mk b/src/bin/pg_combinebackup/nls.mk
new file mode 100644
index 0000000000..c8e59d1d00
--- /dev/null
+++ b/src/bin/pg_combinebackup/nls.mk
@@ -0,0 +1,11 @@
+# src/bin/pg_combinebackup/nls.mk
+CATALOG_NAME = pg_combinebackup
+GETTEXT_FILES = $(FRONTEND_COMMON_GETTEXT_FILES) \
+ backup_label.c \
+ copy_file.c \
+ load_manifest.c \
+ pg_combinebackup.c \
+ reconstruct.c \
+ write_manifest.c
+GETTEXT_TRIGGERS = $(FRONTEND_COMMON_GETTEXT_TRIGGERS)
+GETTEXT_FLAGS = $(FRONTEND_COMMON_GETTEXT_FLAGS)
diff --git a/src/bin/pg_combinebackup/pg_combinebackup.c b/src/bin/pg_combinebackup/pg_combinebackup.c
new file mode 100644
index 0000000000..4a753f4a3e
--- /dev/null
+++ b/src/bin/pg_combinebackup/pg_combinebackup.c
@@ -0,0 +1,1312 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_combinebackup.c
+ * Combine incremental backups with prior backups.
+ *
+ * Copyright (c) 2017-2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/bin/pg_combinebackup/pg_combinebackup.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres_fe.h"
+
+#include <dirent.h>
+#include <fcntl.h>
+#include <limits.h>
+
+#include "backup_label.h"
+#include "common/blkreftable.h"
+#include "common/checksum_helper.h"
+#include "common/controldata_utils.h"
+#include "common/file_perm.h"
+#include "common/file_utils.h"
+#include "common/logging.h"
+#include "copy_file.h"
+#include "fe_utils/option_utils.h"
+#include "lib/stringinfo.h"
+#include "getopt_long.h"
+#include "reconstruct.h"
+#include "write_manifest.h"
+
+/* Incremental file naming convention. */
+#define INCREMENTAL_PREFIX "INCREMENTAL."
+#define INCREMENTAL_PREFIX_LENGTH (sizeof(INCREMENTAL_PREFIX) - 1)
+
+/*
+ * Tracking for directories that need to be removed, or have their contents
+ * removed, if the operation fails.
+ */
+typedef struct cb_cleanup_dir
+{
+ char *target_path;
+ bool rmtopdir;
+ struct cb_cleanup_dir *next;
+} cb_cleanup_dir;
+
+/*
+ * Stores a tablespace mapping provided using -T, --tablespace-mapping.
+ */
+typedef struct cb_tablespace_mapping
+{
+ char old_dir[MAXPGPATH];
+ char new_dir[MAXPGPATH];
+ struct cb_tablespace_mapping *next;
+} cb_tablespace_mapping;
+
+/*
+ * Stores data parsed from all command-line options.
+ */
+typedef struct cb_options
+{
+ bool debug;
+ char *output;
+ bool dry_run;
+ bool no_sync;
+ cb_tablespace_mapping *tsmappings;
+ pg_checksum_type manifest_checksums;
+ bool no_manifest;
+ DataDirSyncMethod sync_method;
+} cb_options;
+
+/*
+ * Data about a tablespace.
+ *
+ * Every normal tablespace needs a tablespace mapping, but in-place tablespaces
+ * don't, so the list of tablespaces can contain more entries than the list of
+ * tablespace mappings.
+ */
+typedef struct cb_tablespace
+{
+ Oid oid;
+ bool in_place;
+ char old_dir[MAXPGPATH];
+ char new_dir[MAXPGPATH];
+ struct cb_tablespace *next;
+} cb_tablespace;
+
+/* Directories to be removed if we exit uncleanly. */
+cb_cleanup_dir *cleanup_dir_list = NULL;
+
+static void add_tablespace_mapping(cb_options *opt, char *arg);
+static StringInfo check_backup_label_files(int n_backups, char **backup_dirs);
+static void check_control_files(int n_backups, char **backup_dirs);
+static void check_input_dir_permissions(char *dir);
+static void cleanup_directories_atexit(void);
+static void create_output_directory(char *dirname, cb_options *opt);
+static void help(const char *progname);
+static manifest_data **load_backup_manifests(int n_backups,
+ char **backup_directories);
+static bool parse_oid(char *s, Oid *result);
+static void process_directory_recursively(Oid tsoid,
+ char *input_directory,
+ char *output_directory,
+ char *relative_path,
+ int n_prior_backups,
+ char **prior_backup_dirs,
+ manifest_data **manifests,
+ manifest_writer *mwriter,
+ cb_options *opt);
+static int read_pg_version_file(char *directory);
+static void remember_to_cleanup_directory(char *target_path, bool rmtopdir);
+static void reset_directory_cleanup_list(void);
+static cb_tablespace *scan_for_existing_tablespaces(char *pathname,
+ cb_options *opt);
+static void slurp_file(int fd, char *filename, StringInfo buf, int maxlen);
+
+/*
+ * Main program.
+ */
+int
+main(int argc, char *argv[])
+{
+ static struct option long_options[] = {
+ {"debug", no_argument, NULL, 'd'},
+ {"dry-run", no_argument, NULL, 'n'},
+ {"no-sync", no_argument, NULL, 'N'},
+ {"output", required_argument, NULL, 'o'},
+ {"tablespace-mapping", no_argument, NULL, 'T'},
+ {"manifest-checksums", required_argument, NULL, 1},
+ {"no-manifest", no_argument, NULL, 2},
+ {"sync-method", required_argument, NULL, 3},
+ {NULL, 0, NULL, 0}
+ };
+
+ const char *progname;
+ char *last_input_dir;
+ int optindex;
+ int c;
+ int n_backups;
+ int n_prior_backups;
+ int version;
+ char **prior_backup_dirs;
+ cb_options opt;
+ cb_tablespace *tablespaces;
+ cb_tablespace *ts;
+ StringInfo last_backup_label;
+ manifest_data **manifests;
+ manifest_writer *mwriter;
+
+ pg_logging_init(argv[0]);
+ progname = get_progname(argv[0]);
+ handle_help_version_opts(argc, argv, progname, help);
+
+ memset(&opt, 0, sizeof(opt));
+ opt.manifest_checksums = CHECKSUM_TYPE_CRC32C;
+ opt.sync_method = DATA_DIR_SYNC_METHOD_FSYNC;
+
+ /* process command-line options */
+ while ((c = getopt_long(argc, argv, "do:nNPT:",
+ long_options, &optindex)) != -1)
+ {
+ switch (c)
+ {
+ case 'd':
+ opt.debug = true;
+ pg_logging_increase_verbosity();
+ break;
+ case 'o':
+ opt.output = optarg;
+ break;
+ case 'n':
+ opt.dry_run = true;
+ break;
+ case 'N':
+ opt.no_sync = true;
+ break;
+ case 'T':
+ add_tablespace_mapping(&opt, optarg);
+ break;
+ case 1:
+ if (!pg_checksum_parse_type(optarg,
+ &opt.manifest_checksums))
+ pg_fatal("unrecognized checksum algorithm: \"%s\"",
+ optarg);
+ break;
+ case 2:
+ opt.no_manifest = true;
+ break;
+ case 3:
+ if (!parse_sync_method(optarg, &opt.sync_method))
+ exit(1);
+ break;
+ default:
+ /* getopt_long already emitted a complaint */
+ pg_log_error_hint("Try \"%s --help\" for more information.", progname);
+ exit(1);
+ }
+ }
+
+ if (optind >= argc)
+ {
+ pg_log_error("%s: no input directories specified", progname);
+ pg_log_error_hint("Try \"%s --help\" for more information.", progname);
+ exit(1);
+ }
+
+ if (opt.output == NULL)
+ pg_fatal("no output directory specified");
+
+ /* If no manifest is needed, no checksums are needed, either. */
+ if (opt.no_manifest)
+ opt.manifest_checksums = CHECKSUM_TYPE_NONE;
+
+ /* Read the server version from the final backup. */
+ version = read_pg_version_file(argv[argc - 1]);
+
+ /* Sanity-check control files. */
+ n_backups = argc - optind;
+ check_control_files(n_backups, argv + optind);
+
+ /* Sanity-check backup_label files, and get the contents of the last one. */
+ last_backup_label = check_backup_label_files(n_backups, argv + optind);
+
+ /*
+ * We'll need the pathnames to the prior backups. By "prior" we mean all
+ * but the last one listed on the command line.
+ */
+ n_prior_backups = argc - optind - 1;
+ prior_backup_dirs = argv + optind;
+
+ /* Load backup manifests. */
+ manifests = load_backup_manifests(n_backups, prior_backup_dirs);
+
+ /* Figure out which tablespaces are going to be included in the output. */
+ last_input_dir = argv[argc - 1];
+ check_input_dir_permissions(last_input_dir);
+ tablespaces = scan_for_existing_tablespaces(last_input_dir, &opt);
+
+ /*
+ * Create output directories.
+ *
+ * We create one output directory for the main data directory plus one for
+ * each non-in-place tablespace. create_output_directory() will arrange
+ * for those directories to be cleaned up on failure. In-place tablespaces
+ * aren't handled at this stage because they're located beneath the main
+ * output directory, and thus the cleanup of that directory will get rid
+ * of them. Plus, the pg_tblspc directory that needs to contain them
+ * doesn't exist yet.
+ */
+ atexit(cleanup_directories_atexit);
+ create_output_directory(opt.output, &opt);
+ for (ts = tablespaces; ts != NULL; ts = ts->next)
+ if (!ts->in_place)
+ create_output_directory(ts->new_dir, &opt);
+
+ /* If we need to write a backup_manifest, prepare to do so. */
+ if (!opt.dry_run && !opt.no_manifest)
+ {
+ mwriter = create_manifest_writer(opt.output);
+
+ /*
+ * Verify that we have a backup manifest for the final backup; else we
+ * won't have the WAL ranges for the resulting manifest.
+ */
+ if (manifests[n_prior_backups] == NULL)
+ pg_fatal("can't generate a manifest because no manifest is available for the final input backup");
+ }
+ else
+ mwriter = NULL;
+
+ /* Write backup label into output directory. */
+ if (opt.dry_run)
+ pg_log_debug("would generate \"%s/backup_label\"", opt.output);
+ else
+ {
+ pg_log_debug("generating \"%s/backup_label\"", opt.output);
+ last_backup_label->cursor = 0;
+ write_backup_label(opt.output, last_backup_label,
+ opt.manifest_checksums, mwriter);
+ }
+
+ /* Process everything that's not part of a user-defined tablespace. */
+ pg_log_debug("processing backup directory \"%s\"", last_input_dir);
+ process_directory_recursively(InvalidOid, last_input_dir, opt.output,
+ NULL, n_prior_backups, prior_backup_dirs,
+ manifests, mwriter, &opt);
+
+ /* Process user-defined tablespaces. */
+ for (ts = tablespaces; ts != NULL; ts = ts->next)
+ {
+ pg_log_debug("processing tablespace directory \"%s\"", ts->old_dir);
+
+ /*
+ * If it's a normal tablespace, we need to set up a symbolic link from
+ * pg_tblspc/${OID} to the target directory; if it's an in-place
+ * tablespace, we need to create a directory at pg_tblspc/${OID}.
+ */
+ if (!ts->in_place)
+ {
+ char linkpath[MAXPGPATH];
+
+ snprintf(linkpath, MAXPGPATH, "%s/pg_tblspc/%u", opt.output,
+ ts->oid);
+
+ if (opt.dry_run)
+ pg_log_debug("would create symbolic link from \"%s\" to \"%s\"",
+ linkpath, ts->new_dir);
+ else
+ {
+ pg_log_debug("creating symbolic link from \"%s\" to \"%s\"",
+ linkpath, ts->new_dir);
+ if (symlink(ts->new_dir, linkpath) != 0)
+ pg_fatal("could not create symbolic link from \"%s\" to \"%s\": %m",
+ linkpath, ts->new_dir);
+ }
+ }
+ else
+ {
+ if (opt.dry_run)
+ pg_log_debug("would create directory \"%s\"", ts->new_dir);
+ else
+ {
+ pg_log_debug("creating directory \"%s\"", ts->new_dir);
+ if (pg_mkdir_p(ts->new_dir, pg_dir_create_mode) == -1)
+ pg_fatal("could not create directory \"%s\": %m",
+ ts->new_dir);
+ }
+ }
+
+ /* OK, now handle the directory contents. */
+ process_directory_recursively(ts->oid, ts->old_dir, ts->new_dir,
+ NULL, n_prior_backups, prior_backup_dirs,
+ manifests, mwriter, &opt);
+ }
+
+ /* Finalize the backup_manifest, if we're generating one. */
+ if (mwriter != NULL)
+ finalize_manifest(mwriter,
+ manifests[n_prior_backups]->first_wal_range);
+
+ /* fsync that output directory unless we've been told not to do so */
+ if (!opt.no_sync)
+ {
+ if (opt.dry_run)
+ pg_log_debug("would recursively fsync \"%s\"", opt.output);
+ else
+ {
+ pg_log_debug("recursively fsyncing \"%s\"", opt.output);
+ sync_pgdata(opt.output, version * 10000, opt.sync_method);
+ }
+ }
+
+ /* It's a success, so don't remove the output directories. */
+ reset_directory_cleanup_list();
+ exit(0);
+}
+
+/*
+ * Process the option argument for the -T, --tablespace-mapping switch.
+ */
+static void
+add_tablespace_mapping(cb_options *opt, char *arg)
+{
+ cb_tablespace_mapping *tsmap = pg_malloc0(sizeof(cb_tablespace_mapping));
+ char *dst;
+ char *dst_ptr;
+ char *arg_ptr;
+
+ /*
+ * Basically, we just want to copy everything before the equals sign to
+ * tsmap->old_dir and everything afterwards to tsmap->new_dir, but if
+ * there's more or less than one equals sign, that's an error, and if
+ * there's an equals sign preceded by a backslash, don't treat it as a
+ * field separator but instead copy a literal equals sign.
+ */
+ dst_ptr = dst = tsmap->old_dir;
+ for (arg_ptr = arg; *arg_ptr != '\0'; arg_ptr++)
+ {
+ if (dst_ptr - dst >= MAXPGPATH)
+ pg_fatal("directory name too long");
+
+ if (*arg_ptr == '\\' && *(arg_ptr + 1) == '=')
+ ; /* skip backslash escaping = */
+ else if (*arg_ptr == '=' && (arg_ptr == arg || *(arg_ptr - 1) != '\\'))
+ {
+ if (tsmap->new_dir[0] != '\0')
+ pg_fatal("multiple \"=\" signs in tablespace mapping");
+ else
+ dst = dst_ptr = tsmap->new_dir;
+ }
+ else
+ *dst_ptr++ = *arg_ptr;
+ }
+ if (!tsmap->old_dir[0] || !tsmap->new_dir[0])
+ pg_fatal("invalid tablespace mapping format \"%s\", must be \"OLDDIR=NEWDIR\"", arg);
+
+ /*
+ * All tablespaces are created with absolute directories, so specifying a
+ * non-absolute path here would never match, possibly confusing users.
+ *
+ * In contrast to pg_basebackup, both the old and new directories are on
+ * the local machine, so the local machine's definition of an absolute
+ * path is the only relevant one.
+ */
+ if (!is_absolute_path(tsmap->old_dir))
+ pg_fatal("old directory is not an absolute path in tablespace mapping: %s",
+ tsmap->old_dir);
+
+ if (!is_absolute_path(tsmap->new_dir))
+ pg_fatal("old directory is not an absolute path in tablespace mapping: %s",
+ tsmap->new_dir);
+
+ /* Canonicalize paths to avoid spurious failures when comparing. */
+ canonicalize_path(tsmap->old_dir);
+ canonicalize_path(tsmap->new_dir);
+
+ /* Add it to the list. */
+ tsmap->next = opt->tsmappings;
+ opt->tsmappings = tsmap;
+}
+
+/*
+ * Check that the backup_label files form a coherent backup chain, and return
+ * the contents of the backup_label file from the latest backup.
+ */
+static StringInfo
+check_backup_label_files(int n_backups, char **backup_dirs)
+{
+ StringInfo buf = makeStringInfo();
+ StringInfo lastbuf = buf;
+ int i;
+ TimeLineID check_tli = 0;
+ XLogRecPtr check_lsn = InvalidXLogRecPtr;
+
+ /* Try to read each backup_label file in turn, last to first. */
+ for (i = n_backups - 1; i >= 0; --i)
+ {
+ char pathbuf[MAXPGPATH];
+ int fd;
+ TimeLineID start_tli;
+ TimeLineID previous_tli;
+ XLogRecPtr start_lsn;
+ XLogRecPtr previous_lsn;
+
+ /* Open the backup_label file. */
+ snprintf(pathbuf, MAXPGPATH, "%s/backup_label", backup_dirs[i]);
+ pg_log_debug("reading \"%s\"", pathbuf);
+ if ((fd = open(pathbuf, O_RDONLY, 0)) < 0)
+ pg_fatal("could not open file \"%s\": %m", pathbuf);
+
+ /*
+ * Slurp the whole file into memory.
+ *
+ * The exact size limit that we impose here doesn't really matter --
+ * most of what's supposed to be in the file is fixed size and quite
+ * short. However, the length of the backup_label is limited (at least
+ * by some parts of the code) to MAXGPATH, so include that value in
+ * the maximum length that we tolerate.
+ */
+ slurp_file(fd, pathbuf, buf, 10000 + MAXPGPATH);
+
+ /* Close the file. */
+ if (close(fd) != 0)
+ pg_fatal("could not close \"%s\": %m", pathbuf);
+
+ /* Parse the file contents. */
+ parse_backup_label(pathbuf, buf, &start_tli, &start_lsn,
+ &previous_tli, &previous_lsn);
+
+ /*
+ * Sanity checks.
+ *
+ * XXX. It's actually not required that start_lsn == check_lsn. It
+ * would be OK if start_lsn > check_lsn provided that start_lsn is
+ * less than or equal to the relevant switchpoint. But at the moment
+ * we don't have that information.
+ */
+ if (i > 0 && previous_tli == 0)
+ pg_fatal("backup at \"%s\" is a full backup, but only the first backup should be a full backup",
+ backup_dirs[i]);
+ if (i == 0 && previous_tli != 0)
+ pg_fatal("backup at \"%s\" is an incremental backup, but the first backup should be a full backup",
+ backup_dirs[i]);
+ if (i < n_backups - 1 && start_tli != check_tli)
+ pg_fatal("backup at \"%s\" starts on timeline %u, but expected %u",
+ backup_dirs[i], start_tli, check_tli);
+ if (i < n_backups - 1 && start_lsn != check_lsn)
+ pg_fatal("backup at \"%s\" starts at LSN %X/%X, but expected %X/%X",
+ backup_dirs[i],
+ LSN_FORMAT_ARGS(start_lsn),
+ LSN_FORMAT_ARGS(check_lsn));
+ check_tli = previous_tli;
+ check_lsn = previous_lsn;
+
+ /*
+ * The last backup label in the chain needs to be saved for later use,
+ * while the others are only needed within this loop.
+ */
+ if (lastbuf == buf)
+ buf = makeStringInfo();
+ else
+ resetStringInfo(buf);
+ }
+
+ /* Free memory that we don't need any more. */
+ if (lastbuf != buf)
+ {
+ pfree(buf->data);
+ pfree(buf);
+ }
+
+ /*
+ * Return the data from the first backup_info that we read (which is the
+ * backup_label from the last directory specified on the command line).
+ */
+ return lastbuf;
+}
+
+/*
+ * Sanity check control files.
+ */
+static void
+check_control_files(int n_backups, char **backup_dirs)
+{
+ int i;
+ uint64 system_identifier = 0; /* placate compiler */
+
+ /* Try to read each control file in turn, last to first. */
+ for (i = n_backups - 1; i >= 0; --i)
+ {
+ ControlFileData *control_file;
+ bool crc_ok;
+ char *controlpath;
+
+ controlpath = psprintf("%s/%s", backup_dirs[i], "global/pg_control");
+ pg_log_debug("reading \"%s\"", controlpath);
+ control_file = get_controlfile(backup_dirs[i], &crc_ok);
+
+ /* Control file contents not meaningful if CRC is bad. */
+ if (!crc_ok)
+ pg_fatal("%s: crc is incorrect", controlpath);
+
+ /* Can't interpret control file if not current version. */
+ if (control_file->pg_control_version != PG_CONTROL_VERSION)
+ pg_fatal("%s: unexpected control file version",
+ controlpath);
+
+ /* System identifiers should all match. */
+ if (i == n_backups - 1)
+ system_identifier = control_file->system_identifier;
+ else if (system_identifier != control_file->system_identifier)
+ pg_fatal("%s: expected system identifier %llu, but found %llu",
+ controlpath, (unsigned long long) system_identifier,
+ (unsigned long long) control_file->system_identifier);
+
+ /* Release memory. */
+ pfree(control_file);
+ pfree(controlpath);
+ }
+
+ /*
+ * If debug output is enabled, make a note of the system identifier that
+ * we found in all of the relevant control files.
+ */
+ pg_log_debug("system identifier is %llu",
+ (unsigned long long) system_identifier);
+}
+
+/*
+ * Set default permissions for new files and directories based on the
+ * permissions of the given directory. The intent here is that the output
+ * directory should use the same permissions scheme as the final input
+ * directory.
+ */
+static void
+check_input_dir_permissions(char *dir)
+{
+ struct stat st;
+
+ if (stat(dir, &st) != 0)
+ pg_fatal("could not stat \"%s\": %m", dir);
+
+ SetDataDirectoryCreatePerm(st.st_mode);
+}
+
+/*
+ * Clean up output directories before exiting.
+ */
+static void
+cleanup_directories_atexit(void)
+{
+ while (cleanup_dir_list != NULL)
+ {
+ cb_cleanup_dir *dir = cleanup_dir_list;
+
+ if (dir->rmtopdir)
+ {
+ pg_log_info("removing output directory \"%s\"", dir->target_path);
+ if (!rmtree(dir->target_path, dir->rmtopdir))
+ pg_log_error("failed to remove output directory");
+ }
+ else
+ {
+ pg_log_info("removing contents of output directory \"%s\"",
+ dir->target_path);
+ if (!rmtree(dir->target_path, dir->rmtopdir))
+ pg_log_error("failed to remove contents of output directory");
+ }
+
+ cleanup_dir_list = cleanup_dir_list->next;
+ pfree(dir);
+ }
+}
+
+/*
+ * Create the named output directory, unless it already exists or we're in
+ * dry-run mode. If it already exists but is not empty, that's a fatal error.
+ *
+ * Adds the created directory to the list of directories to be cleaned up
+ * at process exit.
+ */
+static void
+create_output_directory(char *dirname, cb_options *opt)
+{
+ switch (pg_check_dir(dirname))
+ {
+ case 0:
+ if (opt->dry_run)
+ {
+ pg_log_debug("would create directory \"%s\"", dirname);
+ return;
+ }
+ pg_log_debug("creating directory \"%s\"", dirname);
+ if (pg_mkdir_p(dirname, pg_dir_create_mode) == -1)
+ pg_fatal("could not create directory \"%s\": %m", dirname);
+ remember_to_cleanup_directory(dirname, true);
+ break;
+
+ case 1:
+ pg_log_debug("using existing directory \"%s\"", dirname);
+ remember_to_cleanup_directory(dirname, false);
+ break;
+
+ case 2:
+ case 3:
+ case 4:
+ pg_fatal("directory \"%s\" exists but is not empty", dirname);
+
+ case -1:
+ pg_fatal("could not access directory \"%s\": %m", dirname);
+ }
+}
+
+/*
+ * help
+ *
+ * Prints help page for the program
+ *
+ * progname: the name of the executed program, such as "pg_combinebackup"
+ */
+static void
+help(const char *progname)
+{
+ printf(_("%s reconstructs full backups from incrementals.\n\n"), progname);
+ printf(_("Usage:\n"));
+ printf(_(" %s [OPTION]... DIRECTORY...\n"), progname);
+ printf(_("\nOptions:\n"));
+ printf(_(" -d, --debug generate lots of debugging output\n"));
+ printf(_(" -n, --dry-run don't actually do anything\n"));
+ printf(_(" -N, --no-sync do not wait for changes to be written safely to disk\n"));
+ printf(_(" -o, --output output directory\n"));
+ printf(_(" -T, --tablespace-mapping=OLDDIR=NEWDIR\n"));
+ printf(_(" relocate tablespace in OLDDIR to NEWDIR\n"));
+ printf(_(" --manifest-checksums=SHA{224,256,384,512}|CRC32C|NONE\n"
+ " use algorithm for manifest checksums\n"));
+ printf(_(" --no-manifest suppress generation of backup manifest\n"));
+ printf(_(" --sync-method=METHOD set method for syncing files to disk\n"));
+ printf(_(" -?, --help show this help, then exit\n"));
+
+ printf(_("\nReport bugs to <%s>.\n"), PACKAGE_BUGREPORT);
+ printf(_("%s home page: <%s>\n"), PACKAGE_NAME, PACKAGE_URL);
+}
+
+/*
+ * Load backup_manifest files from an array of backups and produces an array
+ * of manifest_data objects.
+ *
+ * NB: Since load_backup_manifest() can return NULL, the resulting array could
+ * contain NULL entries.
+ */
+static manifest_data **
+load_backup_manifests(int n_backups, char **backup_directories)
+{
+ manifest_data **result;
+ int i;
+ const int flags = LBM_FILES | LBM_WAL_RANGES | LBM_MISSING_OK;
+
+ result = pg_malloc(sizeof(manifest_data *) * n_backups);
+ for (i = 0; i < n_backups; ++i)
+ {
+ char *pathname;
+
+ pathname = psprintf("%s/backup_manifest", backup_directories[i]);
+ result[i] = load_backup_manifest(pathname, flags);
+ pfree(pathname);
+ }
+
+ return result;
+}
+
+/*
+ * Try to parse a string as a non-zero OID without leading zeroes.
+ *
+ * If it works, return true and set *result to the answer, else return false.
+ */
+static bool
+parse_oid(char *s, Oid *result)
+{
+ Oid oid;
+ char *ep;
+
+ errno = 0;
+ oid = strtoul(s, &ep, 10);
+ if (errno != 0 || *ep != '\0' || oid < 1 || oid > PG_UINT32_MAX)
+ return false;
+
+ *result = oid;
+ return true;
+}
+
+/*
+ * Copy files from the input directory to the output directory, reconstructing
+ * full files from incremental files as required.
+ *
+ * If processing is a user-defined tablespace, the tsoid should be the OID
+ * of that tablespace and input_directory and output_directory should be the
+ * toplevel input and output directories for that tablespace. Otherwise,
+ * tsoid should be InvalidOid and input_directory and output_directory should
+ * be the main input and output directories.
+ *
+ * relative_path is the path beneath the given input and output directories
+ * that we are currently processing. If NULL, it indicates that we're
+ * processing the input and output directories themselves.
+ *
+ * n_prior_backups is the number of prior backups that we have available.
+ * This doesn't count the very last backup, which is referenced by
+ * output_directory, just the older ones. prior_backup_dirs is an array of
+ * the locations of those previous backups.
+ */
+static void
+process_directory_recursively(Oid tsoid,
+ char *input_directory,
+ char *output_directory,
+ char *relative_path,
+ int n_prior_backups,
+ char **prior_backup_dirs,
+ manifest_data **manifests,
+ manifest_writer *mwriter,
+ cb_options *opt)
+{
+ char ifulldir[MAXPGPATH];
+ char ofulldir[MAXPGPATH];
+ char manifest_prefix[MAXPGPATH];
+ DIR *dir;
+ struct dirent *de;
+ bool is_pg_tblspc;
+ bool is_pg_wal;
+ manifest_data *latest_manifest = manifests[n_prior_backups];
+ pg_checksum_type checksum_type;
+
+ /*
+ * pg_tblspc and pg_wal are special cases, so detect those here.
+ *
+ * pg_tblspc is only special at the top level, but subdirectories of
+ * pg_wal are just as special as the top level directory.
+ *
+ * Since incremental backup does not exist in pre-v10 versions, we don't
+ * have to worry about the old pg_xlog naming.
+ */
+ is_pg_tblspc = !OidIsValid(tsoid) && relative_path != NULL &&
+ strcmp(relative_path, "pg_tblspc") == 0;
+ is_pg_wal = !OidIsValid(tsoid) && relative_path != NULL &&
+ (strcmp(relative_path, "pg_wal") == 0 ||
+ strncmp(relative_path, "pg_wal/", 7) == 0);
+
+ /*
+ * If we're under pg_wal, then we don't need checksums, because these
+ * files aren't included in the backup manifest. Otherwise use whatever
+ * type of checksum is configured.
+ */
+ if (!is_pg_wal)
+ checksum_type = opt->manifest_checksums;
+ else
+ checksum_type = CHECKSUM_TYPE_NONE;
+
+ /*
+ * Append the relative path to the input and output directories, and
+ * figure out the appropriate prefix to add to files in this directory
+ * when looking them up in a backup manifest.
+ */
+ if (relative_path == NULL)
+ {
+ strncpy(ifulldir, input_directory, MAXPGPATH);
+ strncpy(ofulldir, output_directory, MAXPGPATH);
+ if (OidIsValid(tsoid))
+ snprintf(manifest_prefix, MAXPGPATH, "pg_tblspc/%u/", tsoid);
+ else
+ manifest_prefix[0] = '\0';
+ }
+ else
+ {
+ snprintf(ifulldir, MAXPGPATH, "%s/%s", input_directory,
+ relative_path);
+ snprintf(ofulldir, MAXPGPATH, "%s/%s", output_directory,
+ relative_path);
+ if (OidIsValid(tsoid))
+ snprintf(manifest_prefix, MAXPGPATH, "pg_tblspc/%u/%s/",
+ tsoid, relative_path);
+ else
+ snprintf(manifest_prefix, MAXPGPATH, "%s/", relative_path);
+ }
+
+ /*
+ * Toplevel output directories have already been created by the time this
+ * function is called, but any subdirectories are our responsibility.
+ */
+ if (relative_path != NULL)
+ {
+ if (opt->dry_run)
+ pg_log_debug("would create directory \"%s\"", ofulldir);
+ else
+ {
+ pg_log_debug("creating directory \"%s\"", ofulldir);
+ if (mkdir(ofulldir, pg_dir_create_mode) == -1)
+ pg_fatal("could not create directory \"%s\": %m", ofulldir);
+ }
+ }
+
+ /* It's time to scan the directory. */
+ if ((dir = opendir(ifulldir)) == NULL)
+ pg_fatal("could not open directory \"%s\": %m", ifulldir);
+ while (errno = 0, (de = readdir(dir)) != NULL)
+ {
+ PGFileType type;
+ char ifullpath[MAXPGPATH];
+ char ofullpath[MAXPGPATH];
+ char manifest_path[MAXPGPATH];
+ Oid oid = InvalidOid;
+ int checksum_length = 0;
+ uint8 *checksum_payload = NULL;
+ pg_checksum_context checksum_ctx;
+
+ /* Ignore "." and ".." entries. */
+ if (strcmp(de->d_name, ".") == 0 ||
+ strcmp(de->d_name, "..") == 0)
+ continue;
+
+ /* Construct input path. */
+ snprintf(ifullpath, MAXPGPATH, "%s/%s", ifulldir, de->d_name);
+
+ /* Figure out what kind of directory entry this is. */
+ type = get_dirent_type(ifullpath, de, false, PG_LOG_ERROR);
+ if (type == PGFILETYPE_ERROR)
+ exit(1);
+
+ /*
+ * If we're processing pg_tblspc, then check whether the filename
+ * looks like it could be a tablespace OID. If so, and if the
+ * directory entry is a symbolic link or a directory, skip it.
+ *
+ * Our goal here is to ignore anything that would have been considered
+ * by scan_for_existing_tablespaces to be a tablespace.
+ */
+ if (is_pg_tblspc && parse_oid(de->d_name, &oid) &&
+ (type == PGFILETYPE_LNK || type == PGFILETYPE_DIR))
+ continue;
+
+ /* If it's a directory, recurse. */
+ if (type == PGFILETYPE_DIR)
+ {
+ char new_relative_path[MAXPGPATH];
+
+ /* Append new pathname component to relative path. */
+ if (relative_path == NULL)
+ strncpy(new_relative_path, de->d_name, MAXPGPATH);
+ else
+ snprintf(new_relative_path, MAXPGPATH, "%s/%s", relative_path,
+ de->d_name);
+
+ /* And recurse. */
+ process_directory_recursively(tsoid,
+ input_directory, output_directory,
+ new_relative_path,
+ n_prior_backups, prior_backup_dirs,
+ manifests, mwriter, opt);
+ continue;
+ }
+
+ /* Skip anything that's not a regular file. */
+ if (type != PGFILETYPE_REG)
+ {
+ if (type == PGFILETYPE_LNK)
+ pg_log_warning("skipping symbolic link \"%s\"", ifullpath);
+ else
+ pg_log_warning("skipping special file \"%s\"", ifullpath);
+ continue;
+ }
+
+ /*
+ * Skip the backup_label and backup_manifest files; they require
+ * special handling and are handled elsewhere.
+ */
+ if (relative_path == NULL &&
+ (strcmp(de->d_name, "backup_label") == 0 ||
+ strcmp(de->d_name, "backup_manifest") == 0))
+ continue;
+
+ /*
+ * If it's an incremental file, hand it off to the reconstruction
+ * code, which will figure out what to do.
+ */
+ if (strncmp(de->d_name, INCREMENTAL_PREFIX,
+ INCREMENTAL_PREFIX_LENGTH) == 0)
+ {
+ /* Output path should not include "INCREMENTAL." prefix. */
+ snprintf(ofullpath, MAXPGPATH, "%s/%s", ofulldir,
+ de->d_name + INCREMENTAL_PREFIX_LENGTH);
+
+
+ /* Manifest path likewise omits incremental prefix. */
+ snprintf(manifest_path, MAXPGPATH, "%s%s", manifest_prefix,
+ de->d_name + INCREMENTAL_PREFIX_LENGTH);
+
+ /* Reconstruction logic will do the rest. */
+ reconstruct_from_incremental_file(ifullpath, ofullpath,
+ relative_path,
+ de->d_name + INCREMENTAL_PREFIX_LENGTH,
+ n_prior_backups,
+ prior_backup_dirs,
+ manifests,
+ manifest_path,
+ checksum_type,
+ &checksum_length,
+ &checksum_payload,
+ opt->debug,
+ opt->dry_run);
+ }
+ else
+ {
+ /* Construct the path that the backup_manifest will use. */
+ snprintf(manifest_path, MAXPGPATH, "%s%s", manifest_prefix,
+ de->d_name);
+
+ /*
+ * It's not an incremental file, so we need to copy the entire
+ * file to the output directory.
+ *
+ * If a checksum of the required type already exists in the
+ * backup_manifest for the final input directory, we can save some
+ * work by reusing that checksum instead of computing a new one.
+ */
+ if (checksum_type != CHECKSUM_TYPE_NONE &&
+ latest_manifest != NULL)
+ {
+ manifest_file *mfile;
+
+ mfile = manifest_files_lookup(latest_manifest->files,
+ manifest_path);
+ if (mfile == NULL)
+ {
+ char *bmpath;
+
+ /*
+ * The directory is out of sync with the backup_manifest,
+ * so emit a warning.
+ */
+ bmpath = psprintf("%s/%s", input_directory,
+ "backup_manifest");
+ pg_log_warning("\"%s\" contains no entry for \"%s\"",
+ bmpath, manifest_path);
+ pfree(bmpath);
+ }
+ else if (mfile->checksum_type == checksum_type)
+ {
+ checksum_length = mfile->checksum_length;
+ checksum_payload = mfile->checksum_payload;
+ }
+ }
+
+ /*
+ * If we're reusing a checksum, then we don't need copy_file() to
+ * compute one for us, but otherwise, it needs to compute whatever
+ * type of checksum we need.
+ */
+ if (checksum_length != 0)
+ pg_checksum_init(&checksum_ctx, CHECKSUM_TYPE_NONE);
+ else
+ pg_checksum_init(&checksum_ctx, checksum_type);
+
+ /* Actually copy the file. */
+ snprintf(ofullpath, MAXPGPATH, "%s/%s", ofulldir, de->d_name);
+ copy_file(ifullpath, ofullpath, &checksum_ctx, opt->dry_run);
+
+ /*
+ * If copy_file() performed a checksum calculation for us, then
+ * save the results (except in dry-run mode, when there's no
+ * point).
+ */
+ if (checksum_ctx.type != CHECKSUM_TYPE_NONE && !opt->dry_run)
+ {
+ checksum_payload = pg_malloc(PG_CHECKSUM_MAX_LENGTH);
+ checksum_length = pg_checksum_final(&checksum_ctx,
+ checksum_payload);
+ }
+ }
+
+ /* Generate manifest entry, if needed. */
+ if (mwriter != NULL)
+ {
+ struct stat sb;
+
+ /*
+ * In order to generate a manifest entry, we need the file size
+ * and mtime. We have no way to know the correct mtime except to
+ * stat() the file, so just do that and get the size as well.
+ *
+ * If we didn't need the mtime here, we could try to obtain the
+ * file size from the reconstruction or file copy process above,
+ * although that is actually not convenient in all cases. If we
+ * write the file ourselves then clearly we can keep a count of
+ * bytes, but if we use something like CopyFile() then it's
+ * trickier. Since we have to stat() anyway to get the mtime,
+ * there's no point in worrying about it.
+ */
+ if (stat(ofullpath, &sb) < 0)
+ pg_fatal("could not stat file \"%s\": %m", ofullpath);
+
+ /* OK, now do the work. */
+ add_file_to_manifest(mwriter, manifest_path,
+ sb.st_size, sb.st_mtime,
+ checksum_type, checksum_length,
+ checksum_payload);
+ }
+
+ /* Avoid leaking memory. */
+ if (checksum_payload != NULL)
+ pfree(checksum_payload);
+ }
+
+ closedir(dir);
+}
+
+/*
+ * Read the version number from PG_VERSION and convert it to the usual server
+ * version number format. (e.g. If PG_VERSION contains "14\n" this function
+ * will return 140000)
+ */
+static int
+read_pg_version_file(char *directory)
+{
+ char filename[MAXPGPATH];
+ StringInfoData buf;
+ int fd;
+ int version;
+ char *ep;
+
+ /* Construct pathname. */
+ snprintf(filename, MAXPGPATH, "%s/PG_VERSION", directory);
+
+ /* Open file. */
+ if ((fd = open(filename, O_RDONLY, 0)) < 0)
+ pg_fatal("could not open file \"%s\": %m", filename);
+
+ /* Read into memory. Length limit of 128 should be more than generous. */
+ initStringInfo(&buf);
+ slurp_file(fd, filename, &buf, 128);
+
+ /* Close the file. */
+ if (close(fd) != 0)
+ pg_fatal("could not close \"%s\": %m", filename);
+
+ /* Convert to integer. */
+ errno = 0;
+ version = strtoul(buf.data, &ep, 10);
+ if (errno != 0 || *ep != '\n')
+ {
+ /*
+ * Incremental backup is not relevant to very old server versions that
+ * used multi-part version number (e.g. 9.6, or 8.4). So if we see
+ * what looks like the beginning of such a version number, just bail
+ * out.
+ */
+ if (version < 10 && *ep == '.')
+ pg_fatal("%s: server version too old\n", filename);
+ pg_fatal("%s: could not parse version number\n", filename);
+ }
+
+ /* Debugging output. */
+ pg_log_debug("read server version %d from \"%s\"", version, filename);
+
+ /* Release memory and return result. */
+ pfree(buf.data);
+ return version * 10000;
+}
+
+/*
+ * Add a directory to the list of output directories to clean up.
+ */
+static void
+remember_to_cleanup_directory(char *target_path, bool rmtopdir)
+{
+ cb_cleanup_dir *dir = pg_malloc(sizeof(cb_cleanup_dir));
+
+ dir->target_path = target_path;
+ dir->rmtopdir = rmtopdir;
+ dir->next = cleanup_dir_list;
+ cleanup_dir_list = dir;
+}
+
+/*
+ * Empty out the list of directories scheduled for cleanup a exit.
+ *
+ * We want to remove the output directories only on a failure, so call this
+ * function when we know that the operation has succeeded.
+ *
+ * Since we only expect this to be called when we're about to exit, we could
+ * just set cleanup_dir_list to NULL and be done with it, but we free the
+ * memory to be tidy.
+ */
+static void
+reset_directory_cleanup_list(void)
+{
+ while (cleanup_dir_list != NULL)
+ {
+ cb_cleanup_dir *dir = cleanup_dir_list;
+
+ cleanup_dir_list = cleanup_dir_list->next;
+ pfree(dir);
+ }
+}
+
+/*
+ * Scan the pg_tblspc directory of the final input backup to get a canonical
+ * list of what tablespaces are part of the backup.
+ *
+ * 'pathname' should be the path to the toplevel backup directory for the
+ * final backup in the backup chain.
+ */
+static cb_tablespace *
+scan_for_existing_tablespaces(char *pathname, cb_options *opt)
+{
+ char pg_tblspc[MAXPGPATH];
+ DIR *dir;
+ struct dirent *de;
+ cb_tablespace *tslist = NULL;
+
+ snprintf(pg_tblspc, MAXPGPATH, "%s/pg_tblspc", pathname);
+ pg_log_debug("scanning \"%s\"", pg_tblspc);
+
+ if ((dir = opendir(pg_tblspc)) == NULL)
+ pg_fatal("could not open directory \"%s\": %m", pathname);
+
+ while (errno = 0, (de = readdir(dir)) != NULL)
+ {
+ Oid oid;
+ char tblspcdir[MAXPGPATH];
+ char link_target[MAXPGPATH];
+ int link_length;
+ cb_tablespace *ts;
+ cb_tablespace *otherts;
+ PGFileType type;
+
+ /* Silently ignore "." and ".." entries. */
+ if (strcmp(de->d_name, ".") == 0 || strcmp(de->d_name, "..") == 0)
+ continue;
+
+ /* Construct full pathname. */
+ snprintf(tblspcdir, MAXPGPATH, "%s/%s", pg_tblspc, de->d_name);
+
+ /* Ignore any file name that doesn't look like a proper OID. */
+ if (!parse_oid(de->d_name, &oid))
+ {
+ pg_log_debug("skipping \"%s\" because the filename is not a legal tablespace OID",
+ tblspcdir);
+ continue;
+ }
+
+ /* Only symbolic links and directories are tablespaces. */
+ type = get_dirent_type(tblspcdir, de, false, PG_LOG_ERROR);
+ if (type == PGFILETYPE_ERROR)
+ exit(1);
+ if (type != PGFILETYPE_LNK && type != PGFILETYPE_DIR)
+ {
+ pg_log_debug("skipping \"%s\" because it is neither a symbolic link nor a directory",
+ tblspcdir);
+ continue;
+ }
+
+ /* Create a new tablespace object. */
+ ts = pg_malloc0(sizeof(cb_tablespace));
+ ts->oid = oid;
+
+ /*
+ * If it's a link, it's not an in-place tablespace. Otherwise, it must
+ * be a directory, and thus an in-place tablespace.
+ */
+ if (type == PGFILETYPE_LNK)
+ {
+ cb_tablespace_mapping *tsmap;
+
+ /* Read the link target. */
+ link_length = readlink(tblspcdir, link_target, sizeof(link_target));
+ if (link_length < 0)
+ pg_fatal("could not read symbolic link \"%s\": %m",
+ tblspcdir);
+ if (link_length >= sizeof(link_target))
+ pg_fatal("symbolic link \"%s\" is too long", tblspcdir);
+ link_target[link_length] = '\0';
+ if (!is_absolute_path(link_target))
+ pg_fatal("symbolic link \"%s\" is relative", tblspcdir);
+
+ /* Caonicalize the link target. */
+ canonicalize_path(link_target);
+
+ /*
+ * Find the corresponding tablespace mapping and copy the relevant
+ * details into the new tablespace entry.
+ */
+ for (tsmap = opt->tsmappings; tsmap != NULL; tsmap = tsmap->next)
+ {
+ if (strcmp(tsmap->old_dir, link_target) == 0)
+ {
+ strncpy(ts->old_dir, tsmap->old_dir, MAXPGPATH);
+ strncpy(ts->new_dir, tsmap->new_dir, MAXPGPATH);
+ ts->in_place = false;
+ break;
+ }
+ }
+
+ /* Every non-in-place tablespace must be mapped. */
+ if (tsmap == NULL)
+ pg_fatal("tablespace at \"%s\" has no tablespace mapping",
+ link_target);
+ }
+ else
+ {
+ /*
+ * For an in-place tablespace, there's no separate directory, so
+ * we just record the paths within the data directories.
+ */
+ snprintf(ts->old_dir, MAXPGPATH, "%s/%s", pg_tblspc, de->d_name);
+ snprintf(ts->new_dir, MAXPGPATH, "%s/pg_tblpc/%s", opt->output,
+ de->d_name);
+ ts->in_place = true;
+ }
+
+ /* Tablespaces should not share a directory. */
+ for (otherts = tslist; otherts != NULL; otherts = otherts->next)
+ if (strcmp(ts->new_dir, otherts->new_dir) == 0)
+ pg_fatal("tablespaces with OIDs %u and %u both point at \"%s\"",
+ otherts->oid, oid, ts->new_dir);
+
+ /* Add this tablespace to the list. */
+ ts->next = tslist;
+ tslist = ts;
+ }
+
+ return tslist;
+}
+
+/*
+ * Read a file into a StringInfo.
+ *
+ * fd is used for the actual file I/O, filename for error reporting purposes.
+ * A file longer than maxlen is a fatal error.
+ */
+static void
+slurp_file(int fd, char *filename, StringInfo buf, int maxlen)
+{
+ struct stat st;
+ ssize_t rb;
+
+ /* Check file size, and complain if it's too large. */
+ if (fstat(fd, &st) != 0)
+ pg_fatal("could not stat \"%s\": %m", filename);
+ if (st.st_size > maxlen)
+ pg_fatal("file \"%s\" is too large", filename);
+
+ /* Make sure we have enough space. */
+ enlargeStringInfo(buf, st.st_size);
+
+ /* Read the data. */
+ rb = read(fd, &buf->data[buf->len], st.st_size);
+
+ /*
+ * We don't expect any concurrent changes, so we should read exactly the
+ * expected number of bytes.
+ */
+ if (rb != st.st_size)
+ {
+ if (rb < 0)
+ pg_fatal("could not read file \"%s\": %m", filename);
+ else
+ pg_fatal("could not read file \"%s\": read only %d of %d bytes",
+ filename, (int) rb, (int) st.st_size);
+ }
+
+ /* Adjust buffer length for new data and restore trailing-\0 invariant */
+ buf->len += rb;
+ buf->data[buf->len] = '\0';
+}
diff --git a/src/bin/pg_combinebackup/reconstruct.c b/src/bin/pg_combinebackup/reconstruct.c
new file mode 100644
index 0000000000..6decdd8934
--- /dev/null
+++ b/src/bin/pg_combinebackup/reconstruct.c
@@ -0,0 +1,687 @@
+/*-------------------------------------------------------------------------
+ *
+ * reconstruct.c
+ * Reconstruct full file from incremental file and backup chain.
+ *
+ * Copyright (c) 2017-2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/bin/pg_combinebackup/reconstruct.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres_fe.h"
+
+#include <unistd.h>
+
+#include "backup/basebackup_incremental.h"
+#include "common/logging.h"
+#include "common/file_perm.h"
+#include "copy_file.h"
+#include "lib/stringinfo.h"
+#include "reconstruct.h"
+#include "storage/block.h"
+
+/*
+ * An rfile stores the data that we need in order to be able to use some file
+ * on disk for reconstruction. For any given output file, we create one rfile
+ * per backup that we need to consult when we constructing that output file.
+ *
+ * If we find a full version of the file in the backup chain, then only
+ * filename and fd are initialized; the remaining fields are 0 or NULL.
+ * For an incremental file, header_length, num_blocks, relative_block_numbers,
+ * and truncation_block_length are also set.
+ *
+ * num_blocks_read and highest_offset_read always start out as 0.
+ */
+typedef struct rfile
+{
+ char *filename;
+ int fd;
+ size_t header_length;
+ unsigned num_blocks;
+ BlockNumber *relative_block_numbers;
+ unsigned truncation_block_length;
+ unsigned num_blocks_read;
+ off_t highest_offset_read;
+} rfile;
+
+static void debug_reconstruction(int n_source,
+ rfile **sources,
+ bool dry_run);
+static unsigned find_reconstructed_block_length(rfile *s);
+static rfile *make_incremental_rfile(char *filename);
+static rfile *make_rfile(char *filename, bool missing_ok);
+static void write_reconstructed_file(char *input_filename,
+ char *output_filename,
+ unsigned block_length,
+ rfile **sourcemap,
+ off_t *offsetmap,
+ pg_checksum_context *checksum_ctx,
+ bool debug,
+ bool dry_run);
+static void read_bytes(rfile *rf, void *buffer, unsigned length);
+
+/*
+ * Reconstruct a full file from an incremental file and a chain of prior
+ * backups.
+ *
+ * input_filename should be the path to the incremental file, and
+ * output_filename should be the path where the reconstructed file is to be
+ * written.
+ *
+ * relative_path should be the relative path to the directory containing this
+ * file. bare_file_name should be the name of the file within that directory,
+ * without "INCREMENTAL.".
+ *
+ * n_prior_backups is the number of prior backups, and prior_backup_dirs is
+ * an array of pathnames where those backups can be found.
+ */
+void
+reconstruct_from_incremental_file(char *input_filename,
+ char *output_filename,
+ char *relative_path,
+ char *bare_file_name,
+ int n_prior_backups,
+ char **prior_backup_dirs,
+ manifest_data **manifests,
+ char *manifest_path,
+ pg_checksum_type checksum_type,
+ int *checksum_length,
+ uint8 **checksum_payload,
+ bool debug,
+ bool dry_run)
+{
+ rfile **source;
+ rfile *latest_source = NULL;
+ rfile **sourcemap;
+ off_t *offsetmap;
+ unsigned block_length;
+ unsigned i;
+ unsigned sidx = n_prior_backups;
+ bool full_copy_possible = true;
+ int copy_source_index = -1;
+ rfile *copy_source = NULL;
+ pg_checksum_context checksum_ctx;
+
+ /*
+ * Every block must come either from the latest version of the file or
+ * from one of the prior backups.
+ */
+ source = pg_malloc0(sizeof(rfile *) * (1 + n_prior_backups));
+
+ /*
+ * Use the information from the latest incremental file to figure out how
+ * long the reconstructed file should be.
+ */
+ latest_source = make_incremental_rfile(input_filename);
+ source[n_prior_backups] = latest_source;
+ block_length = find_reconstructed_block_length(latest_source);
+
+ /*
+ * For each block in the output file, we need to know from which file we
+ * need to obtain it and at what offset in that file it's stored.
+ * sourcemap gives us the first of these things, and offsetmap the latter.
+ */
+ sourcemap = pg_malloc0(sizeof(rfile *) * block_length);
+ offsetmap = pg_malloc0(sizeof(off_t) * block_length);
+
+ /*
+ * Every block that is present in the newest incremental file should be
+ * sourced from that file. If it precedes the truncation_block_length,
+ * it's a block that we would otherwise have had to find in an older
+ * backup and thus reduces the number of blocks remaining to be found by
+ * one; otherwise, it's an extra block that needs to be included in the
+ * output but would not have needed to be found in an older backup if it
+ * had not been present.
+ */
+ for (i = 0; i < latest_source->num_blocks; ++i)
+ {
+ BlockNumber b = latest_source->relative_block_numbers[i];
+
+ Assert(b < block_length);
+ sourcemap[b] = latest_source;
+ offsetmap[b] = latest_source->header_length + (i * BLCKSZ);
+
+ /*
+ * A full copy of a file from an earlier backup is only possible if no
+ * blocks are needed from any later incremental file.
+ */
+ full_copy_possible = false;
+ }
+
+ while (1)
+ {
+ char source_filename[MAXPGPATH];
+ rfile *s;
+
+ /*
+ * Move to the next backup in the chain. If there are no more, then
+ * we're done.
+ */
+ if (sidx == 0)
+ break;
+ --sidx;
+
+ /*
+ * Look for the full file in the previous backup. If not found, then
+ * look for an incremental file instead.
+ */
+ snprintf(source_filename, MAXPGPATH, "%s/%s/%s",
+ prior_backup_dirs[sidx], relative_path, bare_file_name);
+ if ((s = make_rfile(source_filename, true)) == NULL)
+ {
+ snprintf(source_filename, MAXPGPATH, "%s/%s/INCREMENTAL.%s",
+ prior_backup_dirs[sidx], relative_path, bare_file_name);
+ s = make_incremental_rfile(source_filename);
+ }
+ source[sidx] = s;
+
+ /*
+ * If s->header_length == 0, then this is a full file; otherwise, it's
+ * an incremental file.
+ */
+ if (s->header_length == 0)
+ {
+ struct stat sb;
+ BlockNumber b;
+ BlockNumber blocklength;
+
+ /* We need to know the length of the file. */
+ if (fstat(s->fd, &sb) < 0)
+ pg_fatal("could not stat \"%s\": %m", s->filename);
+
+ /*
+ * Since we found a full file, source all blocks from it that
+ * exist in the file.
+ *
+ * Note that there may be blocks that don't exist either in this
+ * file or in any incremental file but that precede
+ * truncation_block_length. These are, presumably, zero-filled
+ * blocks that result from the server extending the file but
+ * taking no action on those blocks that generated any WAL.
+ *
+ * Sadly, we have no way of validating that this is really what
+ * happened, and neither does the server. From it's perspective,
+ * an unmodified block that contains data looks exactly the same
+ * as a zero-filled block that never had any data: either way,
+ * it's not mentioned in any WAL summary and the server has no
+ * reason to read it. From our perspective, all we know is that
+ * nobody had a reason to back up the block. That certainly means
+ * that the block didn't exist at the time of the full backup, but
+ * the supposition that it was all zeroes at the time of every
+ * later backup is one that we can't validate.
+ */
+ blocklength = sb.st_size / BLCKSZ;
+ for (b = 0; b < latest_source->truncation_block_length; ++b)
+ {
+ if (sourcemap[b] == NULL && b < blocklength)
+ {
+ sourcemap[b] = s;
+ offsetmap[b] = b * BLCKSZ;
+ }
+ }
+
+ /*
+ * If a full copy looks possible, check whether the resulting file
+ * should be exactly as long as the source file is. If so, a full
+ * copy is acceptable, otherwise not.
+ */
+ if (full_copy_possible)
+ {
+ uint64 expected_length;
+
+ expected_length =
+ (uint64) latest_source->truncation_block_length;
+ expected_length *= BLCKSZ;
+ if (expected_length == sb.st_size)
+ {
+ copy_source = s;
+ copy_source_index = sidx;
+ }
+ }
+
+ /* We don't need to consider any further sources. */
+ break;
+ }
+
+ /*
+ * Since we found another incremental file, source all blocks from it
+ * that we need but don't yet have.
+ */
+ for (i = 0; i < s->num_blocks; ++i)
+ {
+ BlockNumber b = s->relative_block_numbers[i];
+
+ if (b < latest_source->truncation_block_length &&
+ sourcemap[b] == NULL)
+ {
+ sourcemap[b] = s;
+ offsetmap[b] = s->header_length + (i * BLCKSZ);
+
+ /*
+ * A full copy of a file from an earlier backup is only
+ * possible if no blocks are needed from any later incremental
+ * file.
+ */
+ full_copy_possible = false;
+ }
+ }
+ }
+
+ /*
+ * If a checksum of the required type already exists in the
+ * backup_manifest for the relevant input directory, we can save some work
+ * by reusing that checksum instead of computing a new one.
+ */
+ if (copy_source_index >= 0 && manifests[copy_source_index] != NULL &&
+ checksum_type != CHECKSUM_TYPE_NONE)
+ {
+ manifest_file *mfile;
+
+ mfile = manifest_files_lookup(manifests[copy_source_index]->files,
+ manifest_path);
+ if (mfile == NULL)
+ {
+ char *path = psprintf("%s/backup_manifest",
+ prior_backup_dirs[copy_source_index]);
+
+ /*
+ * The directory is out of sync with the backup_manifest, so emit
+ * a warning.
+ */
+ /*- translator: the first %s is a backup manifest file, the second is a file absent therein */
+ pg_log_warning("\"%s\" contains no entry for \"%s\"",
+ path,
+ manifest_path);
+ pfree(path);
+ }
+ else if (mfile->checksum_type == checksum_type)
+ {
+ *checksum_length = mfile->checksum_length;
+ *checksum_payload = pg_malloc(*checksum_length);
+ memcpy(*checksum_payload, mfile->checksum_payload,
+ *checksum_length);
+ checksum_type = CHECKSUM_TYPE_NONE;
+ }
+ }
+
+ /* Prepare for checksum calculation, if required. */
+ pg_checksum_init(&checksum_ctx, checksum_type);
+
+ /*
+ * If the full file can be created by copying a file from an older backup
+ * in the chain without needing to overwrite any blocks or truncate the
+ * result, then forget about performing reconstruction and just copy that
+ * file in its entirety.
+ *
+ * Otherwise, reconstruct.
+ */
+ if (copy_source != NULL)
+ copy_file(copy_source->filename, output_filename,
+ &checksum_ctx, dry_run);
+ else
+ {
+ write_reconstructed_file(input_filename, output_filename,
+ block_length, sourcemap, offsetmap,
+ &checksum_ctx, debug, dry_run);
+ debug_reconstruction(n_prior_backups + 1, source, dry_run);
+ }
+
+ /* Save results of checksum calculation. */
+ if (checksum_type != CHECKSUM_TYPE_NONE)
+ {
+ *checksum_payload = pg_malloc(PG_CHECKSUM_MAX_LENGTH);
+ *checksum_length = pg_checksum_final(&checksum_ctx,
+ *checksum_payload);
+ }
+
+ /*
+ * Close files and release memory.
+ */
+ for (i = 0; i <= n_prior_backups; ++i)
+ {
+ rfile *s = source[i];
+
+ if (s == NULL)
+ continue;
+ if (close(s->fd) != 0)
+ pg_fatal("could not close \"%s\": %m", s->filename);
+ if (s->relative_block_numbers != NULL)
+ pfree(s->relative_block_numbers);
+ pg_free(s->filename);
+ }
+ pfree(sourcemap);
+ pfree(offsetmap);
+ pfree(source);
+}
+
+/*
+ * Perform post-reconstruction logging and sanity checks.
+ */
+static void
+debug_reconstruction(int n_source, rfile **sources, bool dry_run)
+{
+ unsigned i;
+
+ for (i = 0; i < n_source; ++i)
+ {
+ rfile *s = sources[i];
+
+ /* Ignore source if not used. */
+ if (s == NULL)
+ continue;
+
+ /* If no data is needed from this file, we can ignore it. */
+ if (s->num_blocks_read == 0)
+ continue;
+
+ /* Debug logging. */
+ if (dry_run)
+ pg_log_debug("would have read %u blocks from \"%s\"",
+ s->num_blocks_read, s->filename);
+ else
+ pg_log_debug("read %u blocks from \"%s\"",
+ s->num_blocks_read, s->filename);
+
+ /*
+ * In dry-run mode, we don't actually try to read data from the file,
+ * but we do try to verify that the file is long enough that we could
+ * have read the data if we'd tried.
+ *
+ * If this fails, then it means that a non-dry-run attempt would fail,
+ * complaining of not being able to read the required bytes from the
+ * file.
+ */
+ if (dry_run)
+ {
+ struct stat sb;
+
+ if (fstat(s->fd, &sb) < 0)
+ pg_fatal("could not stat \"%s\": %m", s->filename);
+ if (sb.st_size < s->highest_offset_read)
+ pg_fatal("file \"%s\" is too short: expected %llu, found %llu",
+ s->filename,
+ (unsigned long long) s->highest_offset_read,
+ (unsigned long long) sb.st_size);
+ }
+ }
+}
+
+/*
+ * When we perform reconstruction using an incremental file, the output file
+ * should be at least as long as the truncation_block_length. Any blocks
+ * present in the incremental file increase the output length as far as is
+ * necessary to include those blocks.
+ */
+static unsigned
+find_reconstructed_block_length(rfile *s)
+{
+ unsigned block_length = s->truncation_block_length;
+ unsigned i;
+
+ for (i = 0; i < s->num_blocks; ++i)
+ if (s->relative_block_numbers[i] >= block_length)
+ block_length = s->relative_block_numbers[i] + 1;
+
+ return block_length;
+}
+
+/*
+ * Initialize an incremental rfile, reading the header so that we know which
+ * blocks it contains.
+ */
+static rfile *
+make_incremental_rfile(char *filename)
+{
+ rfile *rf;
+ unsigned magic;
+
+ rf = make_rfile(filename, false);
+
+ /* Read and validate magic number. */
+ read_bytes(rf, &magic, sizeof(magic));
+ if (magic != INCREMENTAL_MAGIC)
+ pg_fatal("file \"%s\" has bad incremental magic number (0x%x not 0x%x)",
+ filename, magic, INCREMENTAL_MAGIC);
+
+ /* Read block count. */
+ read_bytes(rf, &rf->num_blocks, sizeof(rf->num_blocks));
+ if (rf->num_blocks > RELSEG_SIZE)
+ pg_fatal("file \"%s\" has block count %u in excess of segment size %u",
+ filename, rf->num_blocks, RELSEG_SIZE);
+
+ /* Read truncation block length. */
+ read_bytes(rf, &rf->truncation_block_length,
+ sizeof(rf->truncation_block_length));
+ if (rf->truncation_block_length > RELSEG_SIZE)
+ pg_fatal("file \"%s\" has truncation block length %u in excess of segment size %u",
+ filename, rf->truncation_block_length, RELSEG_SIZE);
+
+ /* Read block numbers if there are any. */
+ if (rf->num_blocks > 0)
+ {
+ rf->relative_block_numbers =
+ pg_malloc0(sizeof(BlockNumber) * rf->num_blocks);
+ read_bytes(rf, rf->relative_block_numbers,
+ sizeof(BlockNumber) * rf->num_blocks);
+ }
+
+ /* Remember length of header. */
+ rf->header_length = sizeof(magic) + sizeof(rf->num_blocks) +
+ sizeof(rf->truncation_block_length) +
+ sizeof(BlockNumber) * rf->num_blocks;
+
+ return rf;
+}
+
+/*
+ * Allocate and perform basic initialization of an rfile.
+ */
+static rfile *
+make_rfile(char *filename, bool missing_ok)
+{
+ rfile *rf;
+
+ rf = pg_malloc0(sizeof(rfile));
+ rf->filename = pstrdup(filename);
+ if ((rf->fd = open(filename, O_RDONLY | PG_BINARY, 0)) < 0)
+ {
+ if (missing_ok && errno == ENOENT)
+ {
+ pg_free(rf);
+ return NULL;
+ }
+ pg_fatal("could not open file \"%s\": %m", filename);
+ }
+
+ return rf;
+}
+
+/*
+ * Read the indicated number of bytes from an rfile into the buffer.
+ */
+static void
+read_bytes(rfile *rf, void *buffer, unsigned length)
+{
+ unsigned rb = read(rf->fd, buffer, length);
+
+ if (rb != length)
+ {
+ if (rb < 0)
+ pg_fatal("could not read file \"%s\": %m", rf->filename);
+ else
+ pg_fatal("could not read file \"%s\": read only %d of %d bytes",
+ rf->filename, (int) rb, length);
+ }
+}
+
+/*
+ * Write out a reconstructed file.
+ */
+static void
+write_reconstructed_file(char *input_filename,
+ char *output_filename,
+ unsigned block_length,
+ rfile **sourcemap,
+ off_t *offsetmap,
+ pg_checksum_context *checksum_ctx,
+ bool debug,
+ bool dry_run)
+{
+ int wfd = -1;
+ unsigned i;
+ unsigned zero_blocks = 0;
+
+ /* Debugging output. */
+ if (debug)
+ {
+ StringInfoData debug_buf;
+ unsigned start_of_range = 0;
+ unsigned current_block = 0;
+
+ /* Basic information about the output file to be produced. */
+ if (dry_run)
+ pg_log_debug("would reconstruct \"%s\" (%u blocks, checksum %s)",
+ output_filename, block_length,
+ pg_checksum_type_name(checksum_ctx->type));
+ else
+ pg_log_debug("reconstructing \"%s\" (%u blocks, checksum %s)",
+ output_filename, block_length,
+ pg_checksum_type_name(checksum_ctx->type));
+
+ /* Print out the plan for reconstructing this file. */
+ initStringInfo(&debug_buf);
+ while (current_block < block_length)
+ {
+ rfile *s = sourcemap[current_block];
+
+ /* Extend range, if possible. */
+ if (current_block + 1 < block_length &&
+ s == sourcemap[current_block + 1])
+ {
+ ++current_block;
+ continue;
+ }
+
+ /* Add details about this range. */
+ if (s == NULL)
+ {
+ if (current_block == start_of_range)
+ appendStringInfo(&debug_buf, " %u:zero", current_block);
+ else
+ appendStringInfo(&debug_buf, " %u-%u:zero",
+ start_of_range, current_block);
+ }
+ else
+ {
+ if (current_block == start_of_range)
+ appendStringInfo(&debug_buf, " %u:%s@" UINT64_FORMAT,
+ current_block,
+ s == NULL ? "ZERO" : s->filename,
+ (uint64) offsetmap[current_block]);
+ else
+ appendStringInfo(&debug_buf, " %u-%u:%s@" UINT64_FORMAT,
+ start_of_range, current_block,
+ s == NULL ? "ZERO" : s->filename,
+ (uint64) offsetmap[current_block]);
+ }
+
+ /* Begin new range. */
+ start_of_range = ++current_block;
+
+ /* If the output is very long or we are done, dump it now. */
+ if (current_block == block_length || debug_buf.len > 1024)
+ {
+ pg_log_debug("reconstruction plan:%s", debug_buf.data);
+ resetStringInfo(&debug_buf);
+ }
+ }
+
+ /* Free memory. */
+ pfree(debug_buf.data);
+ }
+
+ /* Open the output file, except in dry_run mode. */
+ if (!dry_run &&
+ (wfd = open(output_filename,
+ O_RDWR | PG_BINARY | O_CREAT | O_EXCL,
+ pg_file_create_mode)) < 0)
+ pg_fatal("could not open file \"%s\": %m", output_filename);
+
+ /* Read and write the blocks as required. */
+ for (i = 0; i < block_length; ++i)
+ {
+ uint8 buffer[BLCKSZ];
+ rfile *s = sourcemap[i];
+ unsigned wb;
+
+ /* Update accounting information. */
+ if (s == NULL)
+ ++zero_blocks;
+ else
+ {
+ s->num_blocks_read++;
+ s->highest_offset_read = Max(s->highest_offset_read,
+ offsetmap[i] + BLCKSZ);
+ }
+
+ /* Skip the rest of this in dry-run mode. */
+ if (dry_run)
+ continue;
+
+ /* Read or zero-fill the block as appropriate. */
+ if (s == NULL)
+ {
+ /*
+ * New block not mentioned in the WAL summary. Should have been an
+ * uninitialized block, so just zero-fill it.
+ */
+ memset(buffer, 0, BLCKSZ);
+ }
+ else
+ {
+ unsigned rb;
+
+ /* Read the block from the correct source, except if dry-run. */
+ rb = pg_pread(s->fd, buffer, BLCKSZ, offsetmap[i]);
+ if (rb != BLCKSZ)
+ {
+ if (rb < 0)
+ pg_fatal("could not read file \"%s\": %m", s->filename);
+ else
+ pg_fatal("could not read file \"%s\": read only %d of %d bytes at offset %u",
+ s->filename, (int) rb, BLCKSZ,
+ (unsigned) offsetmap[i]);
+ }
+ }
+
+ /* Write out the block. */
+ if ((wb = write(wfd, buffer, BLCKSZ)) != BLCKSZ)
+ {
+ if (wb < 0)
+ pg_fatal("could not write file \"%s\": %m", output_filename);
+ else
+ pg_fatal("could not write file \"%s\": wrote only %d of %d bytes",
+ output_filename, (int) wb, BLCKSZ);
+ }
+
+ /* Update the checksum computation. */
+ if (pg_checksum_update(checksum_ctx, buffer, BLCKSZ) < 0)
+ pg_fatal("could not update checksum of file \"%s\"",
+ output_filename);
+ }
+
+ /* Debugging output. */
+ if (zero_blocks > 0)
+ {
+ if (dry_run)
+ pg_log_debug("would have zero-filled %u blocks", zero_blocks);
+ else
+ pg_log_debug("zero-filled %u blocks", zero_blocks);
+ }
+
+ /* Close the output file. */
+ if (wfd >= 0 && close(wfd) != 0)
+ pg_fatal("could not close \"%s\": %m", output_filename);
+}
diff --git a/src/bin/pg_combinebackup/reconstruct.h b/src/bin/pg_combinebackup/reconstruct.h
new file mode 100644
index 0000000000..5c0fb8ceef
--- /dev/null
+++ b/src/bin/pg_combinebackup/reconstruct.h
@@ -0,0 +1,33 @@
+/*-------------------------------------------------------------------------
+ *
+ * reconstruct.h
+ * Reconstruct full file from incremental file and backup chain.
+ *
+ * Copyright (c) 2017-2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/bin/pg_combinebackup/reconstruct.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef RECONSTRUCT_H
+#define RECONSTRUCT_H
+
+#include "common/checksum_helper.h"
+#include "fe_utils/load_manifest.h"
+
+extern void reconstruct_from_incremental_file(char *input_filename,
+ char *output_filename,
+ char *relative_path,
+ char *bare_file_name,
+ int n_prior_backups,
+ char **prior_backup_dirs,
+ manifest_data **manifests,
+ char *manifest_path,
+ pg_checksum_type checksum_type,
+ int *checksum_length,
+ uint8 **checksum_payload,
+ bool debug,
+ bool dry_run);
+
+#endif
diff --git a/src/bin/pg_combinebackup/t/001_basic.pl b/src/bin/pg_combinebackup/t/001_basic.pl
new file mode 100644
index 0000000000..fb66075d1a
--- /dev/null
+++ b/src/bin/pg_combinebackup/t/001_basic.pl
@@ -0,0 +1,23 @@
+# Copyright (c) 2021-2023, PostgreSQL Global Development Group
+
+use strict;
+use warnings;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+my $tempdir = PostgreSQL::Test::Utils::tempdir;
+
+program_help_ok('pg_combinebackup');
+program_version_ok('pg_combinebackup');
+program_options_handling_ok('pg_combinebackup');
+
+command_fails_like(
+ ['pg_combinebackup'],
+ qr/no input directories specified/,
+ 'input directories must be specified');
+command_fails_like(
+ [ 'pg_combinebackup', $tempdir ],
+ qr/no output directory specified/,
+ 'output directory must be specified');
+
+done_testing();
diff --git a/src/bin/pg_combinebackup/t/002_compare_backups.pl b/src/bin/pg_combinebackup/t/002_compare_backups.pl
new file mode 100644
index 0000000000..0b80455aff
--- /dev/null
+++ b/src/bin/pg_combinebackup/t/002_compare_backups.pl
@@ -0,0 +1,154 @@
+# Copyright (c) 2021-2023, PostgreSQL Global Development Group
+
+use strict;
+use warnings;
+use File::Compare;
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Set up a new database instance.
+my $primary = PostgreSQL::Test::Cluster->new('primary');
+$primary->init(has_archiving => 1, allows_streaming => 1);
+$primary->append_conf('postgresql.conf', 'summarize_wal = on');
+$primary->start;
+
+# Create some test tables, each containing one row of data, plus a whole
+# extra database.
+$primary->safe_psql('postgres', <<EOM);
+CREATE TABLE will_change (a int, b text);
+INSERT INTO will_change VALUES (1, 'initial test row');
+CREATE TABLE will_grow (a int, b text);
+INSERT INTO will_grow VALUES (1, 'initial test row');
+CREATE TABLE will_shrink (a int, b text);
+INSERT INTO will_shrink VALUES (1, 'initial test row');
+CREATE TABLE will_get_vacuumed (a int, b text);
+INSERT INTO will_get_vacuumed VALUES (1, 'initial test row');
+CREATE TABLE will_get_dropped (a int, b text);
+INSERT INTO will_get_dropped VALUES (1, 'initial test row');
+CREATE TABLE will_get_rewritten (a int, b text);
+INSERT INTO will_get_rewritten VALUES (1, 'initial test row');
+CREATE DATABASE db_will_get_dropped;
+EOM
+
+# Take a full backup.
+my $backup1path = $primary->backup_dir . '/backup1';
+$primary->command_ok(
+ [ 'pg_basebackup', '-D', $backup1path, '--no-sync', '-cfast' ],
+ "full backup");
+
+# Now make some database changes.
+$primary->safe_psql('postgres', <<EOM);
+UPDATE will_change SET b = 'modified value' WHERE a = 1;
+INSERT INTO will_grow
+ SELECT g, 'additional row' FROM generate_series(2, 5000) g;
+TRUNCATE will_shrink;
+VACUUM will_get_vacuumed;
+DROP TABLE will_get_dropped;
+CREATE TABLE newly_created (a int, b text);
+INSERT INTO newly_created VALUES (1, 'row for new table');
+VACUUM FULL will_get_rewritten;
+DROP DATABASE db_will_get_dropped;
+CREATE DATABASE db_newly_created;
+EOM
+
+# Take an incremental backup.
+my $backup2path = $primary->backup_dir . '/backup2';
+$primary->command_ok(
+ [ 'pg_basebackup', '-D', $backup2path, '--no-sync', '-cfast',
+ '--incremental', $backup1path . '/backup_manifest' ],
+ "incremental backup");
+
+# Find an LSN to which either backup can be recovered.
+my $lsn = $primary->safe_psql('postgres', "SELECT pg_current_wal_lsn();");
+
+# Make sure that the WAL segment containing that LSN has been archived.
+# PostgreSQL won't issue two consecutive XLOG_SWITCH records, and the backup
+# just issued one, so call txid_current() to generate some WAL activity
+# before calling pg_switch_wal().
+$primary->safe_psql('postgres', 'SELECT txid_current();');
+$primary->safe_psql('postgres', 'SELECT pg_switch_wal()');
+
+# Now wait for the LSN we chose above to be archived.
+my $archive_wait_query =
+ "SELECT pg_walfile_name('$lsn') <= last_archived_wal FROM pg_stat_archiver;";
+$primary->poll_query_until('postgres', $archive_wait_query)
+ or die "Timed out while waiting for WAL segment to be archived";
+
+# Perform PITR from the full backup. Disable archive_mode so that the archive
+# doesn't find out about the new timeline; that way, the later PITR below will
+# choose the same timeline.
+my $pitr1 = PostgreSQL::Test::Cluster->new('pitr1');
+$pitr1->init_from_backup($primary, 'backup1',
+ standby => 1, has_restoring => 1);
+$pitr1->append_conf('postgresql.conf', qq{
+recovery_target_lsn = '$lsn'
+recovery_target_action = 'promote'
+archive_mode = 'off'
+});
+$pitr1->start();
+
+# Perform PITR to the same LSN from the incremental backup. Use the same
+# basic configuration as before.
+my $pitr2 = PostgreSQL::Test::Cluster->new('pitr2');
+$pitr2->init_from_backup($primary, 'backup2',
+ standby => 1, has_restoring => 1,
+ combine_with_prior => [ 'backup1' ]);
+$pitr2->append_conf('postgresql.conf', qq{
+recovery_target_lsn = '$lsn'
+recovery_target_action = 'promote'
+archive_mode = 'off'
+});
+$pitr2->start();
+
+# Wait until both servers exit recovery.
+$pitr1->poll_query_until('postgres',
+ "SELECT NOT pg_is_in_recovery();")
+ or die "Timed out while waiting apply to reach LSN $lsn";
+$pitr2->poll_query_until('postgres',
+ "SELECT NOT pg_is_in_recovery();")
+ or die "Timed out while waiting apply to reach LSN $lsn";
+
+# Perform a logical dump of each server, and check that they match.
+# It would be much nicer if we could physically compare the data files, but
+# that doesn't really work. The contents of the page hole aren't guaranteed to
+# be identical, and there can be other discrepancies as well. To make this work
+# we'd need the equivalent of each AM's rm_mask functon written or at least
+# callable from Perl, and that doesn't seem practical.
+#
+# NB: We're just using the primary's backup directory for scratch space here.
+# This could equally well be any other directory we wanted to pick.
+my $backupdir = $primary->backup_dir;
+my $dump1 = $backupdir . '/pitr1.dump';
+my $dump2 = $backupdir . '/pitr2.dump';
+$pitr1->command_ok([
+ 'pg_dumpall', '-f', $dump1, '--no-sync', '--no-unlogged-table-data',
+ '-d', $pitr1->connstr('postgres'),
+ ],
+ 'dump from PITR 1');
+$pitr1->command_ok([
+ 'pg_dumpall', '-f', $dump2, '--no-sync', '--no-unlogged-table-data',
+ '-d', $pitr1->connstr('postgres'),
+ ],
+ 'dump from PITR 2');
+
+# Compare the two dumps, there should be no differences.
+my $compare_res = compare($dump1, $dump2);
+note($dump1);
+note($dump2);
+is($compare_res, 0, "dumps are identical");
+
+# Provide more context if the dumps do not match.
+if ($compare_res != 0)
+{
+ my ($stdout, $stderr) =
+ run_command([ 'diff', '-u', $dump1, $dump2 ]);
+ print "=== diff of $dump1 and $dump2\n";
+ print "=== stdout ===\n";
+ print $stdout;
+ print "=== stderr ===\n";
+ print $stderr;
+ print "=== EOF ===\n";
+}
+
+done_testing();
diff --git a/src/bin/pg_combinebackup/t/003_timeline.pl b/src/bin/pg_combinebackup/t/003_timeline.pl
new file mode 100644
index 0000000000..bc053ca5e8
--- /dev/null
+++ b/src/bin/pg_combinebackup/t/003_timeline.pl
@@ -0,0 +1,90 @@
+# Copyright (c) 2021-2023, PostgreSQL Global Development Group
+#
+# This test aims to validate that restoring an incremental backup works
+# properly even when the reference backup is on a different timeline.
+
+use strict;
+use warnings;
+use File::Compare;
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Set up a new database instance.
+my $node1 = PostgreSQL::Test::Cluster->new('node1');
+$node1->init(has_archiving => 1, allows_streaming => 1);
+$node1->append_conf('postgresql.conf', 'summarize_wal = on');
+$node1->start;
+
+# Create a table and insert a test row into it.
+$node1->safe_psql('postgres', <<EOM);
+CREATE TABLE mytable (a int, b text);
+INSERT INTO mytable VALUES (1, 'aardvark');
+EOM
+
+# Take a full backup.
+my $backup1path = $node1->backup_dir . '/backup1';
+$node1->command_ok(
+ [ 'pg_basebackup', '-D', $backup1path, '--no-sync', '-cfast' ],
+ "full backup from node1");
+
+# Insert a second row on the original node.
+$node1->safe_psql('postgres', <<EOM);
+INSERT INTO mytable VALUES (2, 'beetle');
+EOM
+
+# Now take an incremental backup.
+my $backup2path = $node1->backup_dir . '/backup2';
+$node1->command_ok(
+ [ 'pg_basebackup', '-D', $backup2path, '--no-sync', '-cfast',
+ '--incremental', $backup1path . '/backup_manifest' ],
+ "incremental backup from node1");
+
+# Restore the incremental backup and use it to create a new node.
+my $node2 = PostgreSQL::Test::Cluster->new('node2');
+$node2->init_from_backup($node1, 'backup2',
+ combine_with_prior => [ 'backup1' ]);
+$node2->start();
+
+# Insert rows on both nodes.
+$node1->safe_psql('postgres', <<EOM);
+INSERT INTO mytable VALUES (3, 'crab');
+EOM
+$node2->safe_psql('postgres', <<EOM);
+INSERT INTO mytable VALUES (4, 'dingo');
+EOM
+
+# Take another incremental backup, from node2, based on backup2 from node1.
+my $backup3path = $node1->backup_dir . '/backup3';
+$node2->command_ok(
+ [ 'pg_basebackup', '-D', $backup3path, '--no-sync', '-cfast',
+ '--incremental', $backup2path . '/backup_manifest' ],
+ "incremental backup from node2");
+
+# Restore the incremental backup and use it to create a new node.
+my $node3 = PostgreSQL::Test::Cluster->new('node3');
+$node3->init_from_backup($node1, 'backup3',
+ combine_with_prior => [ 'backup1', 'backup2' ]);
+$node3->start();
+
+# Let's insert one more row.
+$node3->safe_psql('postgres', <<EOM);
+INSERT INTO mytable VALUES (5, 'elephant');
+EOM
+
+# Now check that we have the expected rows.
+my $result = $node3->safe_psql('postgres', <<EOM);
+select string_agg(a::text, ':'), string_agg(b, ':') from mytable;
+EOM
+is($result, '1:2:4:5|aardvark:beetle:dingo:elephant');
+
+# Let's also verify all the backups.
+for my $backup_name (qw(backup1 backup2 backup3))
+{
+ $node1->command_ok(
+ [ 'pg_verifybackup', $node1->backup_dir . '/' . $backup_name ],
+ "verify backup $backup_name");
+}
+
+# OK, that's all.
+done_testing();
diff --git a/src/bin/pg_combinebackup/t/004_manifest.pl b/src/bin/pg_combinebackup/t/004_manifest.pl
new file mode 100644
index 0000000000..37de61ac06
--- /dev/null
+++ b/src/bin/pg_combinebackup/t/004_manifest.pl
@@ -0,0 +1,75 @@
+# Copyright (c) 2021-2023, PostgreSQL Global Development Group
+#
+# This test aims to validate that pg_combinebackup works in the degenerate
+# case where it is invoked on a single full backup and that it can produce
+# a new, valid manifest when it does. Secondarily, it checks that
+# pg_combinebackup does not produce a manifest when run with --no-manifest.
+
+use strict;
+use warnings;
+use File::Compare;
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Set up a new database instance.
+my $node = PostgreSQL::Test::Cluster->new('node');
+$node->init(has_archiving => 1, allows_streaming => 1);
+$node->start;
+
+# Take a full backup.
+my $original_backup_path = $node->backup_dir . '/original';
+$node->command_ok(
+ [ 'pg_basebackup', '-D', $original_backup_path, '--no-sync', '-cfast' ],
+ "full backup");
+
+# Verify the full backup.
+$node->command_ok([ 'pg_verifybackup', $original_backup_path ],
+ "verify original backup");
+
+# Process the backup with pg_combinebackup using various manifest options.
+sub combine_and_test_one_backup
+{
+ my ($backup_name, $failure_pattern, @extra_options) = @_;
+ my $revised_backup_path = $node->backup_dir . '/' . $backup_name;
+ $node->command_ok(
+ [ 'pg_combinebackup', $original_backup_path, '-o', $revised_backup_path,
+ '--no-sync', @extra_options ],
+ "pg_combinebackup with @extra_options");
+ if (defined $failure_pattern)
+ {
+ $node->command_fails_like(
+ [ 'pg_verifybackup', $revised_backup_path ],
+ $failure_pattern,
+ "unable to verify backup $backup_name");
+ }
+ else
+ {
+ $node->command_ok(
+ [ 'pg_verifybackup', $revised_backup_path ],
+ "verify backup $backup_name");
+ }
+}
+combine_and_test_one_backup('nomanifest',
+ qr/could not open file.*backup_manifest/, '--no-manifest');
+combine_and_test_one_backup('csum_none',
+ undef, '--manifest-checksums=NONE');
+combine_and_test_one_backup('csum_sha224',
+ undef, '--manifest-checksums=SHA224');
+
+# Verify that SHA224 is mentioned in the SHA224 manifest lots of times.
+my $sha224_manifest =
+ slurp_file($node->backup_dir . '/csum_sha224/backup_manifest');
+my $sha224_count = (() = $sha224_manifest =~ /SHA224/mig);
+cmp_ok($sha224_count,
+ '>', 100, "SHA224 is mentioned many times in SHA224 manifest");
+
+# Verify that SHA224 is mentioned in the SHA224 manifest lots of times.
+my $nocsum_manifest =
+ slurp_file($node->backup_dir . '/csum_none/backup_manifest');
+my $nocsum_count = (() = $nocsum_manifest =~ /Checksum-Algorithm/mig);
+is($nocsum_count, 0,
+ "Checksum_Algorithm is not mentioned in no-checksum manifest");
+
+# OK, that's all.
+done_testing();
diff --git a/src/bin/pg_combinebackup/t/005_integrity.pl b/src/bin/pg_combinebackup/t/005_integrity.pl
new file mode 100644
index 0000000000..b1f63a43e0
--- /dev/null
+++ b/src/bin/pg_combinebackup/t/005_integrity.pl
@@ -0,0 +1,125 @@
+# Copyright (c) 2021-2023, PostgreSQL Global Development Group
+#
+# This test aims to validate that an incremental backup can be combined
+# with a valid prior backup and that it cannot be combined with an invalid
+# prior backup.
+
+use strict;
+use warnings;
+use File::Compare;
+use File::Path qw(rmtree);
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Set up a new database instance.
+my $node1 = PostgreSQL::Test::Cluster->new('node1');
+$node1->init(has_archiving => 1, allows_streaming => 1);
+$node1->append_conf('postgresql.conf', 'summarize_wal = on');
+$node1->start;
+
+# Set up another new database instance. We don't want to use the cached
+# INITDB_TEMPLATE for this, because we want it to be a separate cluster
+# with a different system ID.
+my $node2;
+{
+ local $ENV{'INITDB_TEMPLATE'} = undef;
+
+ $node2 = PostgreSQL::Test::Cluster->new('node2');
+ $node2->init(has_archiving => 1, allows_streaming => 1);
+ $node2->append_conf('postgresql.conf', 'summarize_wal = on');
+ $node2->start;
+}
+
+# Take a full backup from node1.
+my $backup1path = $node1->backup_dir . '/backup1';
+$node1->command_ok(
+ [ 'pg_basebackup', '-D', $backup1path, '--no-sync', '-cfast' ],
+ "full backup from node1");
+
+# Now take an incremental backup.
+my $backup2path = $node1->backup_dir . '/backup2';
+$node1->command_ok(
+ [ 'pg_basebackup', '-D', $backup2path, '--no-sync', '-cfast',
+ '--incremental', $backup1path . '/backup_manifest' ],
+ "incremental backup from node1");
+
+# Now take another incremental backup.
+my $backup3path = $node1->backup_dir . '/backup3';
+$node1->command_ok(
+ [ 'pg_basebackup', '-D', $backup3path, '--no-sync', '-cfast',
+ '--incremental', $backup2path . '/backup_manifest' ],
+ "another incremental backup from node1");
+
+# Take a full backup from node2.
+my $backupother1path = $node1->backup_dir . '/backupother1';
+$node2->command_ok(
+ [ 'pg_basebackup', '-D', $backupother1path, '--no-sync', '-cfast' ],
+ "full backup from node2");
+
+# Take an incremental backup from node2.
+my $backupother2path = $node1->backup_dir . '/backupother2';
+$node2->command_ok(
+ [ 'pg_basebackup', '-D', $backupother2path, '--no-sync', '-cfast',
+ '--incremental', $backupother1path . '/backup_manifest' ],
+ "incremental backup from node2");
+
+# Result directory.
+my $resultpath = $node1->backup_dir . '/result';
+
+# Can't combine 2 full backups.
+$node1->command_fails_like(
+ [ 'pg_combinebackup', $backup1path, $backup1path, '-o', $resultpath ],
+ qr/is a full backup, but only the first backup should be a full backup/,
+ "can't combine full backups");
+
+# Can't combine 2 incremental backups.
+$node1->command_fails_like(
+ [ 'pg_combinebackup', $backup2path, $backup2path, '-o', $resultpath ],
+ qr/is an incremental backup, but the first backup should be a full backup/,
+ "can't combine full backups");
+
+# Can't combine full backup with an incremental backup from a different system.
+$node1->command_fails_like(
+ [ 'pg_combinebackup', $backup1path, $backupother2path, '-o', $resultpath ],
+ qr/expected system identifier.*but found/,
+ "can't combine backups from different nodes");
+
+# Can't omit a required backup.
+$node1->command_fails_like(
+ [ 'pg_combinebackup', $backup1path, $backup3path, '-o', $resultpath ],
+ qr/starts at LSN.*but expected/,
+ "can't omit a required backup");
+
+# Can't combine backups in the wrong order.
+$node1->command_fails_like(
+ [ 'pg_combinebackup', $backup1path, $backup3path, $backup2path, '-o', $resultpath ],
+ qr/starts at LSN.*but expected/,
+ "can't combine backups in the wrong order");
+
+# Can combine 3 backups that match up properly.
+$node1->command_ok(
+ [ 'pg_combinebackup', $backup1path, $backup2path, $backup3path, '-o', $resultpath ],
+ "can combine 3 matching backups");
+rmtree($resultpath);
+
+# Can combine full backup with first incremental.
+my $synthetic12path = $node1->backup_dir . '/synthetic12';
+$node1->command_ok(
+ [ 'pg_combinebackup', $backup1path, $backup2path, '-o', $synthetic12path ],
+ "can combine 2 matching backups");
+
+# Can combine result of previous step with second incremental.
+$node1->command_ok(
+ [ 'pg_combinebackup', $synthetic12path, $backup3path, '-o', $resultpath ],
+ "can combine synthetic backup with later incremental");
+rmtree($resultpath);
+
+# Can't combine result of 1+2 with 2.
+$node1->command_fails_like(
+ [ 'pg_combinebackup', $synthetic12path, $backup2path, '-o', $resultpath ],
+ qr/starts at LSN.*but expected/,
+ "can't combine synthetic backup with included incremental");
+
+# OK, that's all.
+done_testing();
diff --git a/src/bin/pg_combinebackup/write_manifest.c b/src/bin/pg_combinebackup/write_manifest.c
new file mode 100644
index 0000000000..945e0d0dde
--- /dev/null
+++ b/src/bin/pg_combinebackup/write_manifest.c
@@ -0,0 +1,293 @@
+/*-------------------------------------------------------------------------
+ *
+ * Write a new backup manifest.
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/bin/pg_combinebackup/write_manifest.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres_fe.h"
+
+#include <fcntl.h>
+#include <time.h>
+#include <unistd.h>
+
+#include "common/checksum_helper.h"
+#include "common/file_perm.h"
+#include "common/logging.h"
+#include "fe_utils/load_manifest.h"
+#include "lib/stringinfo.h"
+#include "mb/pg_wchar.h"
+#include "write_manifest.h"
+
+struct manifest_writer
+{
+ char pathname[MAXPGPATH];
+ int fd;
+ StringInfoData buf;
+ bool first_file;
+ bool still_checksumming;
+ pg_checksum_context manifest_ctx;
+};
+
+static void escape_json(StringInfo buf, const char *str);
+static void flush_manifest(manifest_writer *mwriter);
+static size_t hex_encode(const uint8 *src, size_t len, char *dst);
+
+/*
+ * Create a new backup manifest writer.
+ *
+ * The backup manifest will be written into a file named backup_manifest
+ * in the specified directory.
+ */
+manifest_writer *
+create_manifest_writer(char *directory)
+{
+ manifest_writer *mwriter = pg_malloc(sizeof(manifest_writer));
+
+ snprintf(mwriter->pathname, MAXPGPATH, "%s/backup_manifest", directory);
+ mwriter->fd = -1;
+ initStringInfo(&mwriter->buf);
+ mwriter->first_file = true;
+ mwriter->still_checksumming = true;
+ pg_checksum_init(&mwriter->manifest_ctx, CHECKSUM_TYPE_SHA256);
+
+ appendStringInfo(&mwriter->buf,
+ "{ \"PostgreSQL-Backup-Manifest-Version\": 1,\n"
+ "\"Files\": [");
+
+ return mwriter;
+}
+
+/*
+ * Add an entry for a file to a backup manifest.
+ *
+ * This is very similar to the backend's AddFileToBackupManifest, but
+ * various adjustments are required due to frontend/backend differences
+ * and other details.
+ */
+void
+add_file_to_manifest(manifest_writer *mwriter, const char *manifest_path,
+ size_t size, pg_time_t mtime,
+ pg_checksum_type checksum_type,
+ int checksum_length,
+ uint8 *checksum_payload)
+{
+ int pathlen = strlen(manifest_path);
+
+ if (mwriter->first_file)
+ {
+ appendStringInfoChar(&mwriter->buf, '\n');
+ mwriter->first_file = false;
+ }
+ else
+ appendStringInfoString(&mwriter->buf, ",\n");
+
+ if (pg_encoding_verifymbstr(PG_UTF8, manifest_path, pathlen) == pathlen)
+ {
+ appendStringInfoString(&mwriter->buf, "{ \"Path\": ");
+ escape_json(&mwriter->buf, manifest_path);
+ appendStringInfoString(&mwriter->buf, ", ");
+ }
+ else
+ {
+ appendStringInfoString(&mwriter->buf, "{ \"Encoded-Path\": \"");
+ enlargeStringInfo(&mwriter->buf, 2 * pathlen);
+ mwriter->buf.len += hex_encode((const uint8 *) manifest_path, pathlen,
+ &mwriter->buf.data[mwriter->buf.len]);
+ appendStringInfoString(&mwriter->buf, "\", ");
+ }
+
+ appendStringInfo(&mwriter->buf, "\"Size\": %zu, ", size);
+
+ appendStringInfoString(&mwriter->buf, "\"Last-Modified\": \"");
+ enlargeStringInfo(&mwriter->buf, 128);
+ mwriter->buf.len += strftime(&mwriter->buf.data[mwriter->buf.len], 128,
+ "%Y-%m-%d %H:%M:%S %Z",
+ gmtime(&mtime));
+ appendStringInfoChar(&mwriter->buf, '"');
+
+ if (mwriter->buf.len > 128 * 1024)
+ flush_manifest(mwriter);
+
+ if (checksum_length > 0)
+ {
+ appendStringInfo(&mwriter->buf,
+ ", \"Checksum-Algorithm\": \"%s\", \"Checksum\": \"",
+ pg_checksum_type_name(checksum_type));
+
+ enlargeStringInfo(&mwriter->buf, 2 * checksum_length);
+ mwriter->buf.len += hex_encode(checksum_payload, checksum_length,
+ &mwriter->buf.data[mwriter->buf.len]);
+
+ appendStringInfoChar(&mwriter->buf, '"');
+ }
+
+ appendStringInfoString(&mwriter->buf, " }");
+
+ if (mwriter->buf.len > 128 * 1024)
+ flush_manifest(mwriter);
+}
+
+/*
+ * Finalize the backup_manifest.
+ */
+void
+finalize_manifest(manifest_writer *mwriter,
+ manifest_wal_range *first_wal_range)
+{
+ uint8 checksumbuf[PG_SHA256_DIGEST_LENGTH];
+ int len;
+ manifest_wal_range *wal_range;
+
+ /* Terminate the list of files. */
+ appendStringInfoString(&mwriter->buf, "\n],\n");
+
+ /* Start a list of LSN ranges. */
+ appendStringInfoString(&mwriter->buf, "\"WAL-Ranges\": [\n");
+
+ for (wal_range = first_wal_range; wal_range != NULL;
+ wal_range = wal_range->next)
+ appendStringInfo(&mwriter->buf,
+ "%s{ \"Timeline\": %u, \"Start-LSN\": \"%X/%X\", \"End-LSN\": \"%X/%X\" }",
+ wal_range == first_wal_range ? "" : ",\n",
+ wal_range->tli,
+ LSN_FORMAT_ARGS(wal_range->start_lsn),
+ LSN_FORMAT_ARGS(wal_range->end_lsn));
+
+ /* Terminate the list of WAL ranges. */
+ appendStringInfoString(&mwriter->buf, "\n],\n");
+
+ /* Flush accumulated data and update checksum calculation. */
+ flush_manifest(mwriter);
+
+ /* Checksum only includes data up to this point. */
+ mwriter->still_checksumming = false;
+
+ /* Compute and insert manifest checksum. */
+ appendStringInfoString(&mwriter->buf, "\"Manifest-Checksum\": \"");
+ enlargeStringInfo(&mwriter->buf, 2 * PG_SHA256_DIGEST_STRING_LENGTH);
+ len = pg_checksum_final(&mwriter->manifest_ctx, checksumbuf);
+ Assert(len == PG_SHA256_DIGEST_LENGTH);
+ mwriter->buf.len +=
+ hex_encode(checksumbuf, len, &mwriter->buf.data[mwriter->buf.len]);
+ appendStringInfoString(&mwriter->buf, "\"}\n");
+
+ /* Flush the last manifest checksum itself. */
+ flush_manifest(mwriter);
+
+ /* Close the file. */
+ if (close(mwriter->fd) != 0)
+ pg_fatal("could not close \"%s\": %m", mwriter->pathname);
+ mwriter->fd = -1;
+}
+
+/*
+ * Produce a JSON string literal, properly escaping characters in the text.
+ */
+static void
+escape_json(StringInfo buf, const char *str)
+{
+ const char *p;
+
+ appendStringInfoCharMacro(buf, '"');
+ for (p = str; *p; p++)
+ {
+ switch (*p)
+ {
+ case '\b':
+ appendStringInfoString(buf, "\\b");
+ break;
+ case '\f':
+ appendStringInfoString(buf, "\\f");
+ break;
+ case '\n':
+ appendStringInfoString(buf, "\\n");
+ break;
+ case '\r':
+ appendStringInfoString(buf, "\\r");
+ break;
+ case '\t':
+ appendStringInfoString(buf, "\\t");
+ break;
+ case '"':
+ appendStringInfoString(buf, "\\\"");
+ break;
+ case '\\':
+ appendStringInfoString(buf, "\\\\");
+ break;
+ default:
+ if ((unsigned char) *p < ' ')
+ appendStringInfo(buf, "\\u%04x", (int) *p);
+ else
+ appendStringInfoCharMacro(buf, *p);
+ break;
+ }
+ }
+ appendStringInfoCharMacro(buf, '"');
+}
+
+/*
+ * Flush whatever portion of the backup manifest we have generated and
+ * buffered in memory out to a file on disk.
+ *
+ * The first call to this function will create the file. After that, we
+ * keep it open and just append more data.
+ */
+static void
+flush_manifest(manifest_writer *mwriter)
+{
+ char pathname[MAXPGPATH];
+
+ if (mwriter->fd == -1 &&
+ (mwriter->fd = open(mwriter->pathname,
+ O_WRONLY | O_CREAT | O_EXCL | PG_BINARY,
+ pg_file_create_mode)) < 0)
+ pg_fatal("could not open file \"%s\": %m", mwriter->pathname);
+
+ if (mwriter->buf.len > 0)
+ {
+ ssize_t wb;
+
+ wb = write(mwriter->fd, mwriter->buf.data, mwriter->buf.len);
+ if (wb != mwriter->buf.len)
+ {
+ if (wb < 0)
+ pg_fatal("could not write \"%s\": %m", mwriter->pathname);
+ else
+ pg_fatal("could not write file \"%s\": wrote only %d of %d bytes",
+ pathname, (int) wb, mwriter->buf.len);
+ }
+
+ if (mwriter->still_checksumming)
+ pg_checksum_update(&mwriter->manifest_ctx,
+ (uint8 *) mwriter->buf.data,
+ mwriter->buf.len);
+ resetStringInfo(&mwriter->buf);
+ }
+}
+
+/*
+ * Encode bytes using two hexademical digits for each one.
+ */
+static size_t
+hex_encode(const uint8 *src, size_t len, char *dst)
+{
+ const uint8 *end = src + len;
+
+ while (src < end)
+ {
+ unsigned n1 = (*src >> 4) & 0xF;
+ unsigned n2 = *src & 0xF;
+
+ *dst++ = n1 < 10 ? '0' + n1 : 'a' + n1 - 10;
+ *dst++ = n2 < 10 ? '0' + n2 : 'a' + n2 - 10;
+ ++src;
+ }
+
+ return len * 2;
+}
diff --git a/src/bin/pg_combinebackup/write_manifest.h b/src/bin/pg_combinebackup/write_manifest.h
new file mode 100644
index 0000000000..8fd7fe02c8
--- /dev/null
+++ b/src/bin/pg_combinebackup/write_manifest.h
@@ -0,0 +1,33 @@
+/*-------------------------------------------------------------------------
+ *
+ * Write a new backup manifest.
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/bin/pg_combinebackup/write_manifest.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef WRITE_MANIFEST_H
+#define WRITE_MANIFEST_H
+
+#include "common/checksum_helper.h"
+#include "pgtime.h"
+
+struct manifest_wal_range;
+
+struct manifest_writer;
+typedef struct manifest_writer manifest_writer;
+
+extern manifest_writer *create_manifest_writer(char *directory);
+extern void add_file_to_manifest(manifest_writer *mwriter,
+ const char *manifest_path,
+ size_t size, pg_time_t mtime,
+ pg_checksum_type checksum_type,
+ int checksum_length,
+ uint8 *checksum_payload);
+extern void finalize_manifest(manifest_writer *mwriter,
+ struct manifest_wal_range *first_wal_range);
+
+#endif /* WRITE_MANIFEST_H */
diff --git a/src/bin/pg_resetwal/pg_resetwal.c b/src/bin/pg_resetwal/pg_resetwal.c
index 3ae3fc06df..5407f51a4e 100644
--- a/src/bin/pg_resetwal/pg_resetwal.c
+++ b/src/bin/pg_resetwal/pg_resetwal.c
@@ -85,6 +85,7 @@ static void RewriteControlFile(void);
static void FindEndOfXLOG(void);
static void KillExistingXLOG(void);
static void KillExistingArchiveStatus(void);
+static void KillExistingWALSummaries(void);
static void WriteEmptyXLOG(void);
static void usage(void);
@@ -493,6 +494,7 @@ main(int argc, char *argv[])
RewriteControlFile();
KillExistingXLOG();
KillExistingArchiveStatus();
+ KillExistingWALSummaries();
WriteEmptyXLOG();
printf(_("Write-ahead log reset\n"));
@@ -1034,6 +1036,40 @@ KillExistingArchiveStatus(void)
pg_fatal("could not close directory \"%s\": %m", ARCHSTATDIR);
}
+/*
+ * Remove existing WAL summary files
+ */
+static void
+KillExistingWALSummaries(void)
+{
+#define WALSUMMARYDIR XLOGDIR "/summaries"
+#define WALSUMMARY_NHEXCHARS 40
+
+ DIR *xldir;
+ struct dirent *xlde;
+ char path[MAXPGPATH + sizeof(WALSUMMARYDIR)];
+
+ xldir = opendir(WALSUMMARYDIR);
+ if (xldir == NULL)
+ pg_fatal("could not open directory \"%s\": %m", WALSUMMARYDIR);
+
+ while (errno = 0, (xlde = readdir(xldir)) != NULL)
+ {
+ if (strspn(xlde->d_name, "0123456789ABCDEF") == WALSUMMARY_NHEXCHARS &&
+ strcmp(xlde->d_name + WALSUMMARY_NHEXCHARS, ".summary") == 0)
+ {
+ snprintf(path, sizeof(path), "%s/%s", WALSUMMARYDIR, xlde->d_name);
+ if (unlink(path) < 0)
+ pg_fatal("could not delete file \"%s\": %m", path);
+ }
+ }
+
+ if (errno)
+ pg_fatal("could not read directory \"%s\": %m", WALSUMMARYDIR);
+
+ if (closedir(xldir))
+ pg_fatal("could not close directory \"%s\": %m", ARCHSTATDIR);
+}
/*
* Write an empty XLOG file, containing only the checkpoint record
diff --git a/src/fe_utils/Makefile b/src/fe_utils/Makefile
index 8accd5906d..b8d5428380 100644
--- a/src/fe_utils/Makefile
+++ b/src/fe_utils/Makefile
@@ -24,6 +24,7 @@ OBJS = \
cancel.o \
conditional.o \
connect_utils.o \
+ load_manifest.o \
mbprint.o \
option_utils.o \
parallel_slot.o \
diff --git a/src/fe_utils/load_manifest.c b/src/fe_utils/load_manifest.c
new file mode 100644
index 0000000000..2540092e1d
--- /dev/null
+++ b/src/fe_utils/load_manifest.c
@@ -0,0 +1,261 @@
+/*-------------------------------------------------------------------------
+ *
+ * Load data from a backup manifest into memory.
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/bin/pg_loadmanifest/load_manifest.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres_fe.h"
+
+#include <sys/stat.h>
+#include <unistd.h>
+
+#include "common/hashfn.h"
+#include "common/logging.h"
+#include "common/parse_manifest.h"
+#include "fe_utils/load_manifest.h"
+
+/*
+ * For efficiency, we'd like our hash table containing information about the
+ * manifest to start out with approximately the correct number of entries.
+ * There's no way to know the exact number of entries without reading the whole
+ * file, but we can get an estimate by dividing the file size by the estimated
+ * number of bytes per line.
+ *
+ * This could be off by about a factor of two in either direction, because the
+ * checksum algorithm has a big impact on the line lengths; e.g. a SHA512
+ * checksum is 128 hex bytes, whereas a CRC-32C value is only 8, and there
+ * might be no checksum at all.
+ */
+#define ESTIMATED_BYTES_PER_MANIFEST_LINE 100
+
+/*
+ * Define a hash table which we can use to store information about the files
+ * mentioned in the backup manifest.
+ */
+static uint32 hash_string_pointer(char *s);
+#define SH_PREFIX manifest_files
+#define SH_ELEMENT_TYPE manifest_file
+#define SH_KEY_TYPE char *
+#define SH_KEY pathname
+#define SH_HASH_KEY(tb, key) hash_string_pointer(key)
+#define SH_EQUAL(tb, a, b) (strcmp(a, b) == 0)
+#define SH_SCOPE extern
+#define SH_RAW_ALLOCATOR pg_malloc0
+#define SH_DEFINE
+#include "lib/simplehash.h"
+
+static void loadmanifest_per_file_cb(JsonManifestParseContext *context,
+ char *pathname, size_t size,
+ pg_checksum_type checksum_type,
+ int checksum_length,
+ uint8 *checksum_payload);
+static void loadmanifest_per_file_noop_cb(JsonManifestParseContext *context,
+ char *pathname, size_t size,
+ pg_checksum_type checksum_type,
+ int checksum_length,
+ uint8 *checksum_payload);
+static void loadmanifest_per_wal_range_cb(JsonManifestParseContext *context,
+ TimeLineID tli,
+ XLogRecPtr start_lsn,
+ XLogRecPtr end_lsn);
+static void loadmanifest_per_wal_range_noop_cb(JsonManifestParseContext *context,
+ TimeLineID tli,
+ XLogRecPtr start_lsn,
+ XLogRecPtr end_lsn);
+static void report_manifest_error(JsonManifestParseContext *context,
+ const char *fmt,...)
+ pg_attribute_printf(2, 3) pg_attribute_noreturn();
+
+/*
+ * Parse the backup_manifest file in the named backup directory. Construct a
+ * hash table with information about all the files it mentions, and a linked
+ * list of all the WAL ranges it mentions.
+ *
+ * If the backup_manifest file simply doesn't exist, logs a warning and returns
+ * NULL. Any other error, or any error parsing the contents of the file, is
+ * fatal.
+ */
+manifest_data *
+load_backup_manifest(char *pathname, int flags)
+{
+ int fd;
+ struct stat statbuf;
+ off_t estimate;
+ uint32 initial_size;
+ manifest_files_hash *ht;
+ char *buffer;
+ int rc;
+ JsonManifestParseContext context;
+ manifest_data *result;
+
+ /* Open the manifest file. */
+ if ((fd = open(pathname, O_RDONLY | PG_BINARY, 0)) < 0)
+ {
+ if (errno == ENOENT && (flags & LBM_MISSING_OK) != 0)
+ {
+ pg_log_warning("\"%s\" does not exist", pathname);
+ return NULL;
+ }
+ pg_fatal("could not open file \"%s\": %m", pathname);
+ }
+
+ /* Figure out how big the manifest is. */
+ if (fstat(fd, &statbuf) != 0)
+ pg_fatal("could not stat file \"%s\": %m", pathname);
+
+ /* Guess how large to make the hash table based on the manifest size. */
+ estimate = statbuf.st_size / ESTIMATED_BYTES_PER_MANIFEST_LINE;
+ initial_size = Min(PG_UINT32_MAX, Max(estimate, 256));
+
+ /* Create the hash table. */
+ ht = manifest_files_create(initial_size, NULL);
+
+ /*
+ * Slurp in the whole file.
+ *
+ * This is not ideal, but there's currently no way to get pg_parse_json()
+ * to perform incremental parsing.
+ */
+ buffer = pg_malloc(statbuf.st_size);
+ rc = read(fd, buffer, statbuf.st_size);
+ if (rc != statbuf.st_size)
+ {
+ if (rc < 0)
+ pg_fatal("could not read file \"%s\": %m", pathname);
+ else
+ pg_fatal("could not read file \"%s\": read %d of %lld",
+ pathname, rc, (long long int) statbuf.st_size);
+ }
+
+ /* Close the manifest file. */
+ close(fd);
+
+ /* Parse the manifest. */
+ result = pg_malloc0(sizeof(manifest_data));
+ result->files = ht;
+ context.private_data = result;
+ if ((flags & LBM_FILES) != 0)
+ context.per_file_cb = loadmanifest_per_file_cb;
+ else
+ context.per_file_cb = loadmanifest_per_file_noop_cb;
+ if ((flags & LBM_WAL_RANGES) != 0)
+ context.per_wal_range_cb = loadmanifest_per_wal_range_cb;
+ else
+ context.per_wal_range_cb = loadmanifest_per_wal_range_noop_cb;
+ context.error_cb = report_manifest_error;
+ json_parse_manifest(&context, buffer, statbuf.st_size);
+
+ /* All done. */
+ pfree(buffer);
+ return result;
+}
+
+/*
+ * Report an error while parsing the manifest.
+ *
+ * We consider all such errors to be fatal errors. The manifest parser
+ * expects this function not to return.
+ */
+static void
+report_manifest_error(JsonManifestParseContext *context, const char *fmt,...)
+{
+ va_list ap;
+
+ va_start(ap, fmt);
+ pg_log_generic_v(PG_LOG_ERROR, PG_LOG_PRIMARY, gettext(fmt), ap);
+ va_end(ap);
+
+ exit(1);
+}
+
+/*
+ * Record details extracted from the backup manifest for one file.
+ */
+static void
+loadmanifest_per_file_cb(JsonManifestParseContext *context,
+ char *pathname, size_t size,
+ pg_checksum_type checksum_type,
+ int checksum_length, uint8 *checksum_payload)
+{
+ manifest_data *manifest = context->private_data;
+ manifest_file *m;
+ bool found;
+
+ /* Make a new entry in the hash table for this file. */
+ m = manifest_files_insert(manifest->files, pathname, &found);
+ if (found)
+ pg_fatal("duplicate path name in backup manifest: \"%s\"", pathname);
+
+ /* Initialize the entry. */
+ m->size = size;
+ m->checksum_type = checksum_type;
+ m->checksum_length = checksum_length;
+ m->checksum_payload = checksum_payload;
+}
+
+/*
+ * Do nothing at all for each file in the backup manifest.
+ */
+static void
+loadmanifest_per_file_noop_cb(JsonManifestParseContext *context,
+ char *pathname, size_t size,
+ pg_checksum_type checksum_type,
+ int checksum_length, uint8 *checksum_payload)
+{
+ /* do nothing */
+}
+
+/*
+ * Record details extracted from the backup manifest for one WAL range.
+ */
+static void
+loadmanifest_per_wal_range_cb(JsonManifestParseContext *context,
+ TimeLineID tli,
+ XLogRecPtr start_lsn, XLogRecPtr end_lsn)
+{
+ manifest_data *manifest = context->private_data;
+ manifest_wal_range *range;
+
+ /* Allocate and initialize a struct describing this WAL range. */
+ range = palloc(sizeof(manifest_wal_range));
+ range->tli = tli;
+ range->start_lsn = start_lsn;
+ range->end_lsn = end_lsn;
+ range->prev = manifest->last_wal_range;
+ range->next = NULL;
+
+ /* Add it to the end of the list. */
+ if (manifest->first_wal_range == NULL)
+ manifest->first_wal_range = range;
+ else
+ manifest->last_wal_range->next = range;
+ manifest->last_wal_range = range;
+}
+
+/*
+ * Do nothing at all for each WAL range in the backup manifest.
+ */
+static void
+loadmanifest_per_wal_range_noop_cb(JsonManifestParseContext *context,
+ TimeLineID tli,
+ XLogRecPtr start_lsn, XLogRecPtr end_lsn)
+{
+ /* do nothing */
+}
+
+/*
+ * Helper function for manifest_files hash table.
+ */
+static uint32
+hash_string_pointer(char *s)
+{
+ unsigned char *ss = (unsigned char *) s;
+
+ return hash_bytes(ss, strlen(s));
+}
diff --git a/src/fe_utils/meson.build b/src/fe_utils/meson.build
index ea96e862ad..350bb6b55f 100644
--- a/src/fe_utils/meson.build
+++ b/src/fe_utils/meson.build
@@ -5,6 +5,7 @@ fe_utils_sources = files(
'cancel.c',
'conditional.c',
'connect_utils.c',
+ 'load_manifest.c',
'mbprint.c',
'option_utils.c',
'parallel_slot.c',
diff --git a/src/include/access/xlogbackup.h b/src/include/access/xlogbackup.h
index 1611358137..90e04cad56 100644
--- a/src/include/access/xlogbackup.h
+++ b/src/include/access/xlogbackup.h
@@ -28,6 +28,8 @@ typedef struct BackupState
XLogRecPtr checkpointloc; /* last checkpoint location */
pg_time_t starttime; /* backup start time */
bool started_in_recovery; /* backup started in recovery? */
+ XLogRecPtr istartpoint; /* incremental based on backup at this LSN */
+ TimeLineID istarttli; /* incremental based on backup on this TLI */
/* Fields saved at the end of backup */
XLogRecPtr stoppoint; /* backup stop WAL location */
diff --git a/src/include/backup/basebackup.h b/src/include/backup/basebackup.h
index 1432d9c206..345bd22534 100644
--- a/src/include/backup/basebackup.h
+++ b/src/include/backup/basebackup.h
@@ -34,6 +34,9 @@ typedef struct
int64 size; /* total size as sent; -1 if not known */
} tablespaceinfo;
-extern void SendBaseBackup(BaseBackupCmd *cmd);
+struct IncrementalBackupInfo;
+
+extern void SendBaseBackup(BaseBackupCmd *cmd,
+ struct IncrementalBackupInfo *ib);
#endif /* _BASEBACKUP_H */
diff --git a/src/include/backup/basebackup_incremental.h b/src/include/backup/basebackup_incremental.h
new file mode 100644
index 0000000000..105df90681
--- /dev/null
+++ b/src/include/backup/basebackup_incremental.h
@@ -0,0 +1,53 @@
+/*-------------------------------------------------------------------------
+ *
+ * basebackup_incremental.h
+ * API for incremental backup support
+ *
+ * Portions Copyright (c) 2010-2022, PostgreSQL Global Development Group
+ *
+ * src/include/backup/basebackup_incremental.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef BASEBACKUP_INCREMENTAL_H
+#define BASEBACKUP_INCREMENTAL_H
+
+#include "access/xlogbackup.h"
+#include "common/relpath.h"
+#include "storage/block.h"
+#include "utils/palloc.h"
+
+#define INCREMENTAL_MAGIC 0xd3ae1f0d
+
+typedef enum
+{
+ BACK_UP_FILE_FULLY,
+ BACK_UP_FILE_INCREMENTALLY
+} FileBackupMethod;
+
+struct IncrementalBackupInfo;
+typedef struct IncrementalBackupInfo IncrementalBackupInfo;
+
+extern IncrementalBackupInfo *CreateIncrementalBackupInfo(MemoryContext);
+
+extern void AddIncrementalWalRange(IncrementalBackupInfo *ib, TimeLineID tli,
+ XLogRecPtr start_lsn, XLogRecPtr end_lsn);
+
+extern void PrepareForIncrementalBackup(IncrementalBackupInfo *ib,
+ BackupState *backup_state);
+
+extern char *GetIncrementalFilePath(Oid dboid, Oid spcoid,
+ RelFileNumber relfilenumber,
+ ForkNumber forknum, unsigned segno);
+extern FileBackupMethod GetFileBackupMethod(IncrementalBackupInfo *ib,
+ const char *path,
+ Oid dboid, Oid spcoid,
+ RelFileNumber relfilenumber,
+ ForkNumber forknum,
+ unsigned segno, size_t size,
+ unsigned *num_blocks_required,
+ BlockNumber *relative_block_numbers,
+ unsigned *truncation_block_length);
+extern size_t GetIncrementalFileSize(unsigned num_blocks_required);
+
+#endif
diff --git a/src/include/fe_utils/load_manifest.h b/src/include/fe_utils/load_manifest.h
new file mode 100644
index 0000000000..311fd64db4
--- /dev/null
+++ b/src/include/fe_utils/load_manifest.h
@@ -0,0 +1,69 @@
+/*-------------------------------------------------------------------------
+ *
+ * Load data from a backup manifest into memory.
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/fe_utils/load_manifest.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef LOAD_MANIFEST_H
+#define LOAD_MANIFEST_H
+
+#include "access/xlogdefs.h"
+#include "common/checksum_helper.h"
+
+/*
+ * Each file described by the manifest file is parsed to produce an object
+ * like this.
+ */
+typedef struct manifest_file
+{
+ uint32 status; /* hash status */
+ char *pathname;
+ size_t size;
+ pg_checksum_type checksum_type;
+ int checksum_length;
+ uint8 *checksum_payload;
+} manifest_file;
+
+#define SH_PREFIX manifest_files
+#define SH_ELEMENT_TYPE manifest_file
+#define SH_KEY_TYPE char *
+#define SH_SCOPE extern
+#define SH_RAW_ALLOCATOR pg_malloc0
+#define SH_DECLARE
+#include "lib/simplehash.h"
+
+/*
+ * Each WAL range described by the manifest file is parsed to produce an
+ * object like this.
+ */
+typedef struct manifest_wal_range
+{
+ TimeLineID tli;
+ XLogRecPtr start_lsn;
+ XLogRecPtr end_lsn;
+ struct manifest_wal_range *next;
+ struct manifest_wal_range *prev;
+} manifest_wal_range;
+
+/*
+ * All the data parsed from a backup_manifest file.
+ */
+typedef struct manifest_data
+{
+ manifest_files_hash *files;
+ manifest_wal_range *first_wal_range;
+ manifest_wal_range *last_wal_range;
+} manifest_data;
+
+#define LBM_FILES 0x0001
+#define LBM_WAL_RANGES 0x0002
+#define LBM_MISSING_OK 0x0004
+
+extern manifest_data *load_backup_manifest(char *pathname, int flags);
+
+#endif /* LOAD_MANIFEST_H */
diff --git a/src/include/nodes/replnodes.h b/src/include/nodes/replnodes.h
index 5142a08729..fb0c72717f 100644
--- a/src/include/nodes/replnodes.h
+++ b/src/include/nodes/replnodes.h
@@ -108,4 +108,16 @@ typedef struct TimeLineHistoryCmd
TimeLineID timeline;
} TimeLineHistoryCmd;
+/* ----------------------
+ * INCREMENTAL_WAL_RANGE command
+ * ----------------------
+ */
+typedef struct IncrementalWalRangeCmd
+{
+ NodeTag type;
+ TimeLineID tli;
+ XLogRecPtr start_lsn;
+ XLogRecPtr end_lsn;
+} IncrementalWalRangeCmd;
+
#endif /* REPLNODES_H */
diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index a020377761..46cb2a6550 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -779,6 +779,10 @@ a tar-format backup, pass the name of the tar program to use in the
keyword parameter tar_program. Note that tablespace tar files aren't
handled here.
+To restore from an incremental backup, pass the parameter combine_with_prior
+as a reference to an array of prior backup names with which this backup
+is to be combined using pg_combinebackup.
+
Streaming replication can be enabled on this node by passing the keyword
parameter has_streaming => 1. This is disabled by default.
@@ -816,7 +820,22 @@ sub init_from_backup
mkdir $self->archive_dir;
my $data_path = $self->data_dir;
- if (defined $params{tar_program})
+ if (defined $params{combine_with_prior})
+ {
+ my @prior_backups = @{$params{combine_with_prior}};
+ my @prior_backup_path;
+
+ for my $prior_backup_name (@prior_backups)
+ {
+ push @prior_backup_path,
+ $root_node->backup_dir . '/' . $prior_backup_name;
+ }
+
+ local %ENV = $self->_get_env();
+ PostgreSQL::Test::Utils::system_or_bail('pg_combinebackup', '-d',
+ @prior_backup_path, $backup_path, '-o', $data_path);
+ }
+ elsif (defined $params{tar_program})
{
mkdir($data_path);
PostgreSQL::Test::Utils::system_or_bail($params{tar_program}, 'xf',
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 4d99b4b3f1..e7d8cf5195 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -4018,3 +4018,14 @@ SummarizerReadLocalXLogPrivate
WalSummarizerData
WalSummaryFile
WalSummaryIO
+FileBackupMethod
+IncrementalBackupInfo
+IncrementalWalRangeCmd
+backup_wal_range
+cb_cleanup_dir
+cb_options
+cb_tablespace
+cb_tablespace_mapping
+manifest_data
+manifest_writer
+rfile
--
2.39.3 (Apple Git-145)
v12-0007-Test-patch-Enable-summarize_wal-by-default.patchapplication/octet-stream; name=v12-0007-Test-patch-Enable-summarize_wal-by-default.patchDownload
From cee201dcc7aa51e47ce734af95264a48b9f79d88 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Tue, 14 Nov 2023 13:49:28 -0500
Subject: [PATCH v12 7/7] Test patch: Enable summarize_wal by default.
To avoid test failures, must remove the prohibition against running
summarize_wal=off with wal_level=minimal, because a bunch of tests
run with wal_level=minimal.
Not for commit.
---
src/backend/postmaster/postmaster.c | 3 ---
src/backend/postmaster/walsummarizer.c | 2 +-
src/backend/utils/misc/guc_tables.c | 2 +-
src/test/recovery/t/001_stream_rep.pl | 2 ++
src/test/recovery/t/019_replslot_limit.pl | 3 +++
src/test/recovery/t/020_archive_status.pl | 1 +
src/test/recovery/t/035_standby_logical_decoding.pl | 1 +
7 files changed, 9 insertions(+), 5 deletions(-)
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 1d52a2db7c..37c711230b 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -935,9 +935,6 @@ PostmasterMain(int argc, char *argv[])
if (max_wal_senders > 0 && wal_level == WAL_LEVEL_MINIMAL)
ereport(ERROR,
(errmsg("WAL streaming (max_wal_senders > 0) requires wal_level \"replica\" or \"logical\"")));
- if (summarize_wal && wal_level == WAL_LEVEL_MINIMAL)
- ereport(ERROR,
- (errmsg("WAL cannot be summarized when wal_level is \"minimal\"")));
/*
* Other one-time internal sanity checks can go here, if they are fast.
diff --git a/src/backend/postmaster/walsummarizer.c b/src/backend/postmaster/walsummarizer.c
index 7966755f22..74a0116a13 100644
--- a/src/backend/postmaster/walsummarizer.c
+++ b/src/backend/postmaster/walsummarizer.c
@@ -139,7 +139,7 @@ static XLogRecPtr redo_pointer_at_last_summary_removal = InvalidXLogRecPtr;
/*
* GUC parameters
*/
-bool summarize_wal = false;
+bool summarize_wal = true;
int wal_summary_keep_time = 10 * 24 * 60;
static XLogRecPtr GetLatestLSN(TimeLineID *tli);
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 405c422db7..d18f93d3c8 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -1796,7 +1796,7 @@ struct config_bool ConfigureNamesBool[] =
NULL
},
&summarize_wal,
- false,
+ true,
NULL, NULL, NULL
},
diff --git a/src/test/recovery/t/001_stream_rep.pl b/src/test/recovery/t/001_stream_rep.pl
index 95f9b0d772..0d0e63b8dc 100644
--- a/src/test/recovery/t/001_stream_rep.pl
+++ b/src/test/recovery/t/001_stream_rep.pl
@@ -15,6 +15,8 @@ my $node_primary = PostgreSQL::Test::Cluster->new('primary');
$node_primary->init(
allows_streaming => 1,
auth_extra => [ '--create-role', 'repl_role' ]);
+# WAL summarization can postpone WAL recycling, leading to test failures
+$node_primary->append_conf('postgresql.conf', "summarize_wal = off");
$node_primary->start;
my $backup_name = 'my_backup';
diff --git a/src/test/recovery/t/019_replslot_limit.pl b/src/test/recovery/t/019_replslot_limit.pl
index 7d94f15778..a8b342bb98 100644
--- a/src/test/recovery/t/019_replslot_limit.pl
+++ b/src/test/recovery/t/019_replslot_limit.pl
@@ -22,6 +22,7 @@ $node_primary->append_conf(
min_wal_size = 2MB
max_wal_size = 4MB
log_checkpoints = yes
+summarize_wal = off
));
$node_primary->start;
$node_primary->safe_psql('postgres',
@@ -256,6 +257,7 @@ $node_primary2->append_conf(
min_wal_size = 32MB
max_wal_size = 32MB
log_checkpoints = yes
+summarize_wal = off
));
$node_primary2->start;
$node_primary2->safe_psql('postgres',
@@ -310,6 +312,7 @@ $node_primary3->append_conf(
max_wal_size = 2MB
log_checkpoints = yes
max_slot_wal_keep_size = 1MB
+ summarize_wal = off
));
$node_primary3->start;
$node_primary3->safe_psql('postgres',
diff --git a/src/test/recovery/t/020_archive_status.pl b/src/test/recovery/t/020_archive_status.pl
index fa24153d4b..d0d6221368 100644
--- a/src/test/recovery/t/020_archive_status.pl
+++ b/src/test/recovery/t/020_archive_status.pl
@@ -15,6 +15,7 @@ $primary->init(
has_archiving => 1,
allows_streaming => 1);
$primary->append_conf('postgresql.conf', 'autovacuum = off');
+$primary->append_conf('postgresql.conf', 'summarize_wal = off');
$primary->start;
my $primary_data = $primary->data_dir;
diff --git a/src/test/recovery/t/035_standby_logical_decoding.pl b/src/test/recovery/t/035_standby_logical_decoding.pl
index 9c34c0d36c..482edc57a8 100644
--- a/src/test/recovery/t/035_standby_logical_decoding.pl
+++ b/src/test/recovery/t/035_standby_logical_decoding.pl
@@ -250,6 +250,7 @@ $node_primary->append_conf(
wal_level = 'logical'
max_replication_slots = 4
max_wal_senders = 4
+summarize_wal = off
});
$node_primary->dump_info;
$node_primary->start;
--
2.39.3 (Apple Git-145)
On Thu, Nov 30, 2023 at 9:33 AM Robert Haas <robertmhaas@gmail.com> wrote:
0005: Incremental backup itself. Changes:
- Remove UPLOAD_MANIFEST replication command and instead add
INCREMENTAL_WAL_RANGE replication command.
Unfortunately, I think this change is going to need to be reverted.
Jakub reported out a problem to me off-list, which I think boils down
to this: take a full backup on the primary. create a database on the
primary. now take an incremental backup on the standby using the full
backup from the master as the prior backup. What happens at this point
depends on how far replay has progressed on the standby. I think there
are three scenarios: (1) If replay has not yet reached a checkpoint
later than the one at which the full backup began, then taking the
incremental backup will fail. This is correct, because it makes no
sense to take an incremental backup that goes backwards in time, and
it's pointless to take one that goes forwards but not far enough to
reach the next checkpoint, as you won't save anything. (2) If replay
has progressed far enough that the redo pointer is now beyond the
CREATE DATABASE record, then everything is fine. (3) But if the redo
pointer for the backup is a later checkpoint than the one from which
the full backup started, but also before the CREATE DATABASE record,
then the new database's files exist on disk, but are not mentioned in
the WAL summary, which covers all LSNs from the start of the prior
backup to the start of this one. Here, the start of the backup is
basically the LSN from which replay will start, and since the database
was created after that, those changes aren't in the WAL summary. This
means that we think the file is unchanged since the prior backup, and
so backup no blocks at all. But now we have an incremental file for a
relation for which no full file is present in the prior backup, and
we're in big trouble.
If my analysis is correct, this bug should be new in v12. In v11 and
prior, I think that we always included every file that didn't appear
in the prior manifest in full. I didn't really quite know why I was
doing that, which is why I was willing to rip it out and thus remove
the need for the manifest, but now I think it was actually preventing
exactly this problem. This issue, in general, is files that get
created after the start of the backup. By that time, the WAL summary
that drives the backup has already been built, so it doesn't know
anything about the new files. That would be fine if we either (A)
omitted those new files from the backup completely, since replay would
recreate them or (B) backed them up in full, so that there was nothing
relying on them being there in the earlier backup. But an incremental
backup of such a file is no good.
Then I started worrying about whether there were problems in cases
where a file was dropped and recreated with the same name. I *think*
it's OK. If file F is dropped and recreated after being copied into
the full backup but before being copied into the incremental backup,
then there are basically two cases. First, F might be dropped before
the start LSN of the incremental backup; if so, we'll know from the
WAL summary that the limit block is 0 and back up the whole thing.
Second, F might be dropped after the start LSN of the incremental
backup and before it's actually coped. In that case, we'll not know
when backing up the file that it was dropped and recreated, so we'll
back it up incrementally as if that hadn't happened. That's OK as long
as reconstruction doesn't fail, because WAL replay will again drop and
recreate F. And I think reconstruction won't fail: blocks that are in
the incremental file will be taken from there, blocks in the prior
backup file will be taken from there, and blocks in neither place will
be zero-filled. The result is logically incoherent, but replay will
nuke the file anyway, so whatever.
It bugs me a bit that we don't obey the WAL-before-data rule with file
creation, e.g. RelationCreateStorage does smgrcreate() and then
log_smgrcreate(). So in theory we could see a file on disk for which
nothing has been logged yet; it could even happen that the file gets
created before the start LSN of the backup and the log record gets
written afterward. It seems like it would be far more comfortable to
swap the order there, so that if it's on disk, it's definitely in the
WAL. But I haven't yet been able to think of a scenario in which the
current ordering causes a real problem. If we backup a stray file in
full (or, hypothetically, if we skipped it entirely) then nothing will
happen that can't already happen today with full backup; any problems
we end up having are, I think, not new problems. It's only when we
back up a file incrementally that we need to be careful, and the
analsysis is basically the same as before ... whatever we put into an
incremental file will cause *something* to get reconstructed except
when there's no prior file at all. Having the manifest for the prior
backup lets us avoid the incremental-with-no-prior-file scenario. And
as long as *something* gets reconstructed, I think WAL replay will fix
up the rest.
Considering all this, what I'm inclined to do is go and put
UPLOAD_MANIFEST back, instead of INCREMENTAL_WAL_RANGE, and adjust
accordingly. But first: does anybody see more problems here that I may
have missed?
--
Robert Haas
EDB: http://www.enterprisedb.com
On Mon, Dec 4, 2023 at 3:58 PM Robert Haas <robertmhaas@gmail.com> wrote:
Considering all this, what I'm inclined to do is go and put
UPLOAD_MANIFEST back, instead of INCREMENTAL_WAL_RANGE, and adjust
accordingly. But first: does anybody see more problems here that I may
have missed?
OK, so here's a new version with UPLOAD_MANIFEST put back. I wrote a
long comment explaining why that's believed to be necessary and
sufficient. I committed 0001 and 0002 from the previous series also,
since it doesn't seem like anyone has further comments on those
renamings.
This version also improves (at least, IMHO) the way that we wait for
WAL summarization to finish. Previously, you either caught up fully
within 60 seconds or you died. I didn't like that, because it seemed
like some people would get timeouts when the operation was slowly
progressing and would eventually succeed. So what this version does
is:
- Every 10 seconds, it logs a warning saying that it's still waiting
for WAL summarization. That way, a human operator can understand
what's happening easily, and cancel if they want.
- If 60 seconds go by without the WAL summarizer ingesting even a
single WAL record, it times out. That way, if the WAL summarizer is
dead or totally stuck (e.g. debugger attached, hung I/O) the user
won't be left waiting forever even if they never cancel. But if it's
just slow, it probably won't time out, and the operation should
eventually succeed.
To me, this seems like a reasonable compromise. It might be
unreasonable if WAL summarization is proceeding at a very low but
non-zero rate. But it's hard for me to think of a situation where that
will happen, with the exception of when CPU or I/O are badly
overloaded. But in those cases, the WAL generation rate is probably
also not that high, because apparently the system is paralyzed, so
maybe the wait won't even be that bad, especially given that
everything else on the box should be super-slow too. Plus, even if we
did want to time out in such a case, it's hard to know how slow is too
slow. In any event, I think most failures here are likely to be
complete failures, where the WAL summarizer just doesn't, so the fact
that this times out in those cases seems to me to likely be as much as
we need to do here. But if someone sees a problem with this or has a
clever idea how to make it better, I'm all ears.
--
Robert Haas
EDB: http://www.enterprisedb.com
Attachments:
v13-0001-Move-src-bin-pg_verifybackup-parse_manifest.c-in.patchapplication/octet-stream; name=v13-0001-Move-src-bin-pg_verifybackup-parse_manifest.c-in.patchDownload
From d7ca8e0a1869688e48b8dd4844c819e80ef48f4c Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Tue, 3 Oct 2023 13:32:45 -0400
Subject: [PATCH v13 1/5] Move src/bin/pg_verifybackup/parse_manifest.c into
src/common.
This makes it possible for the code to be easily reused by other
client-side tools, and/or by the server.
---
src/bin/pg_verifybackup/Makefile | 1 -
src/bin/pg_verifybackup/meson.build | 1 -
src/bin/pg_verifybackup/pg_verifybackup.c | 2 +-
src/common/Makefile | 1 +
src/common/meson.build | 1 +
src/{bin/pg_verifybackup => common}/parse_manifest.c | 4 ++--
src/{bin/pg_verifybackup => include/common}/parse_manifest.h | 2 +-
7 files changed, 6 insertions(+), 6 deletions(-)
rename src/{bin/pg_verifybackup => common}/parse_manifest.c (99%)
rename src/{bin/pg_verifybackup => include/common}/parse_manifest.h (97%)
diff --git a/src/bin/pg_verifybackup/Makefile b/src/bin/pg_verifybackup/Makefile
index c96323faa9..7c045f142e 100644
--- a/src/bin/pg_verifybackup/Makefile
+++ b/src/bin/pg_verifybackup/Makefile
@@ -21,7 +21,6 @@ LDFLAGS_INTERNAL += -L$(top_builddir)/src/fe_utils -lpgfeutils $(libpq_pgport)
OBJS = \
$(WIN32RES) \
- parse_manifest.o \
pg_verifybackup.o
all: pg_verifybackup
diff --git a/src/bin/pg_verifybackup/meson.build b/src/bin/pg_verifybackup/meson.build
index 9369da1bc6..58f780d1a6 100644
--- a/src/bin/pg_verifybackup/meson.build
+++ b/src/bin/pg_verifybackup/meson.build
@@ -1,7 +1,6 @@
# Copyright (c) 2022-2023, PostgreSQL Global Development Group
pg_verifybackup_sources = files(
- 'parse_manifest.c',
'pg_verifybackup.c'
)
diff --git a/src/bin/pg_verifybackup/pg_verifybackup.c b/src/bin/pg_verifybackup/pg_verifybackup.c
index d921d0f003..88081f66f7 100644
--- a/src/bin/pg_verifybackup/pg_verifybackup.c
+++ b/src/bin/pg_verifybackup/pg_verifybackup.c
@@ -20,9 +20,9 @@
#include "common/hashfn.h"
#include "common/logging.h"
+#include "common/parse_manifest.h"
#include "fe_utils/simple_list.h"
#include "getopt_long.h"
-#include "parse_manifest.h"
#include "pgtime.h"
/*
diff --git a/src/common/Makefile b/src/common/Makefile
index ce4535d7fe..1092dc63df 100644
--- a/src/common/Makefile
+++ b/src/common/Makefile
@@ -66,6 +66,7 @@ OBJS_COMMON = \
kwlookup.o \
link-canary.o \
md5_common.o \
+ parse_manifest.o \
percentrepl.o \
pg_get_line.o \
pg_lzcompress.o \
diff --git a/src/common/meson.build b/src/common/meson.build
index 8be145c0fb..d52dd12bc9 100644
--- a/src/common/meson.build
+++ b/src/common/meson.build
@@ -18,6 +18,7 @@ common_sources = files(
'kwlookup.c',
'link-canary.c',
'md5_common.c',
+ 'parse_manifest.c',
'percentrepl.c',
'pg_get_line.c',
'pg_lzcompress.c',
diff --git a/src/bin/pg_verifybackup/parse_manifest.c b/src/common/parse_manifest.c
similarity index 99%
rename from src/bin/pg_verifybackup/parse_manifest.c
rename to src/common/parse_manifest.c
index 850adf90a8..9f52bfa83b 100644
--- a/src/bin/pg_verifybackup/parse_manifest.c
+++ b/src/common/parse_manifest.c
@@ -6,15 +6,15 @@
* Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
* Portions Copyright (c) 1994, Regents of the University of California
*
- * src/bin/pg_verifybackup/parse_manifest.c
+ * src/common/parse_manifest.c
*
*-------------------------------------------------------------------------
*/
#include "postgres_fe.h"
-#include "parse_manifest.h"
#include "common/jsonapi.h"
+#include "common/parse_manifest.h"
/*
* Semantic states for JSON manifest parsing.
diff --git a/src/bin/pg_verifybackup/parse_manifest.h b/src/include/common/parse_manifest.h
similarity index 97%
rename from src/bin/pg_verifybackup/parse_manifest.h
rename to src/include/common/parse_manifest.h
index 001b9a6a11..811c9149f4 100644
--- a/src/bin/pg_verifybackup/parse_manifest.h
+++ b/src/include/common/parse_manifest.h
@@ -6,7 +6,7 @@
* Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
* Portions Copyright (c) 1994, Regents of the University of California
*
- * src/bin/pg_verifybackup/parse_manifest.h
+ * src/include/common/parse_manifest.h
*
*-------------------------------------------------------------------------
*/
--
2.39.3 (Apple Git-145)
v13-0004-Add-new-pg_walsummary-tool.patchapplication/octet-stream; name=v13-0004-Add-new-pg_walsummary-tool.patchDownload
From f44fd4c4f6479cf936ec3e667fd8ef0f3041b195 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Wed, 25 Oct 2023 13:01:06 -0400
Subject: [PATCH v13 4/5] Add new pg_walsummary tool.
This can dump the contents of WAL summary files, either those in
pg_wal/summaries, or the INCREMENTAL_BACKUP files that are part of
an incremental backup proper.
XXX. Needs tests.
---
doc/src/sgml/ref/allfiles.sgml | 1 +
doc/src/sgml/ref/pg_walsummary.sgml | 122 +++++++++++
doc/src/sgml/reference.sgml | 1 +
src/backend/postmaster/walsummarizer.c | 4 +-
src/bin/Makefile | 1 +
src/bin/meson.build | 1 +
src/bin/pg_walsummary/.gitignore | 1 +
src/bin/pg_walsummary/Makefile | 42 ++++
src/bin/pg_walsummary/meson.build | 24 +++
src/bin/pg_walsummary/pg_walsummary.c | 280 +++++++++++++++++++++++++
src/tools/pgindent/typedefs.list | 2 +
11 files changed, 477 insertions(+), 2 deletions(-)
create mode 100644 doc/src/sgml/ref/pg_walsummary.sgml
create mode 100644 src/bin/pg_walsummary/.gitignore
create mode 100644 src/bin/pg_walsummary/Makefile
create mode 100644 src/bin/pg_walsummary/meson.build
create mode 100644 src/bin/pg_walsummary/pg_walsummary.c
diff --git a/doc/src/sgml/ref/allfiles.sgml b/doc/src/sgml/ref/allfiles.sgml
index fda4690eab..4a42999b18 100644
--- a/doc/src/sgml/ref/allfiles.sgml
+++ b/doc/src/sgml/ref/allfiles.sgml
@@ -219,6 +219,7 @@ Complete list of usable sgml source files in this directory.
<!ENTITY pgtesttiming SYSTEM "pgtesttiming.sgml">
<!ENTITY pgupgrade SYSTEM "pgupgrade.sgml">
<!ENTITY pgwaldump SYSTEM "pg_waldump.sgml">
+<!ENTITY pgwalsummary SYSTEM "pg_walsummary.sgml">
<!ENTITY postgres SYSTEM "postgres-ref.sgml">
<!ENTITY psqlRef SYSTEM "psql-ref.sgml">
<!ENTITY reindexdb SYSTEM "reindexdb.sgml">
diff --git a/doc/src/sgml/ref/pg_walsummary.sgml b/doc/src/sgml/ref/pg_walsummary.sgml
new file mode 100644
index 0000000000..93e265ead7
--- /dev/null
+++ b/doc/src/sgml/ref/pg_walsummary.sgml
@@ -0,0 +1,122 @@
+<!--
+doc/src/sgml/ref/pg_walsummary.sgml
+PostgreSQL documentation
+-->
+
+<refentry id="app-pgwalsummary">
+ <indexterm zone="app-pgwalsummary">
+ <primary>pg_walsummary</primary>
+ </indexterm>
+
+ <refmeta>
+ <refentrytitle><application>pg_walsummary</application></refentrytitle>
+ <manvolnum>1</manvolnum>
+ <refmiscinfo>Application</refmiscinfo>
+ </refmeta>
+
+ <refnamediv>
+ <refname>pg_walsummary</refname>
+ <refpurpose>print contents of WAL summary files</refpurpose>
+ </refnamediv>
+
+ <refsynopsisdiv>
+ <cmdsynopsis>
+ <command>pg_walsummary</command>
+ <arg rep="repeat" choice="opt"><replaceable>option</replaceable></arg>
+ <arg rep="repeat"><replaceable>file</replaceable></arg>
+ </cmdsynopsis>
+ </refsynopsisdiv>
+
+ <refsect1>
+ <title>Description</title>
+ <para>
+ <application>pg_walsummary</application> is used to print the contents of
+ WAL summary files. These binary files are found with the
+ <literal>pg_wal/summaries</literal> subdirectory of the data directory,
+ and can be converted to text using this tool. This is not ordinarily
+ necessary, since WAL summary files primarily exist to support
+ <link linkend="backup-incremental-backup">incremental backup</link>,
+ but it may be useful for debugging purposes.
+ </para>
+
+ <para>
+ A WAL summary file is indexed by tablespace OID, relation OID, and relation
+ fork. For each relation fork, it stores the list of blocks that were
+ modified by WAL within the range summarized in the file. It can also
+ store a "limit block," which is 0 if the relation fork was created or
+ truncated within the relevant WAL range, and otherwise the shortest length
+ to which the relation fork was truncated. If the relation fork was not
+ created, deleted, or truncated within the relevant WAL range, the limit
+ block is undefined or infinite and will not be printed by this tool.
+ </para>
+ </refsect1>
+
+ <refsect1>
+ <title>Options</title>
+
+ <para>
+ <variablelist>
+ <varlistentry>
+ <term><option>-i</option></term>
+ <term><option>--indivudual</option></term>
+ <listitem>
+ <para>
+ By default, <literal>pg_walsummary</literal> prints one line of output
+ for each range of one or more consecutive modified blocks. This can
+ make the output a lot briefer, since a relation where all blocks from
+ 0 through 999 were modified will produce only one line of output rather
+ than 1000 separate lines. This option requests a separate line of
+ output for every modified block.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><option>-q</option></term>
+ <term><option>--quiet</option></term>
+ <listitem>
+ <para>
+ Do not print any output, except for errors. This can be useful
+ when you want to know whether a WAL summary file can be successfully
+ parsed but don't care about the contents.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><option>-?</option></term>
+ <term><option>--help</option></term>
+ <listitem>
+ <para>
+ Shows help about <application>pg_walsummary</application> command line
+ arguments, and exits.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ </variablelist>
+ </para>
+
+ </refsect1>
+
+ <refsect1>
+ <title>Environment</title>
+
+ <para>
+ The environment variable <envar>PG_COLOR</envar> specifies whether to use
+ color in diagnostic messages. Possible values are
+ <literal>always</literal>, <literal>auto</literal> and
+ <literal>never</literal>.
+ </para>
+ </refsect1>
+
+ <refsect1>
+ <title>See Also</title>
+
+ <simplelist type="inline">
+ <member><xref linkend="app-pgbasebackup"/></member>
+ <member><xref linkend="app-pgcombinebackup"/></member>
+ </simplelist>
+ </refsect1>
+
+</refentry>
diff --git a/doc/src/sgml/reference.sgml b/doc/src/sgml/reference.sgml
index a07d2b5e01..aa94f6adf6 100644
--- a/doc/src/sgml/reference.sgml
+++ b/doc/src/sgml/reference.sgml
@@ -289,6 +289,7 @@
&pgtesttiming;
&pgupgrade;
&pgwaldump;
+ &pgwalsummary;
&postgres;
</reference>
diff --git a/src/backend/postmaster/walsummarizer.c b/src/backend/postmaster/walsummarizer.c
index 02ed8ee6f5..524a671ca4 100644
--- a/src/backend/postmaster/walsummarizer.c
+++ b/src/backend/postmaster/walsummarizer.c
@@ -290,7 +290,7 @@ WalSummarizerMain(void)
FlushErrorState();
/* Flush any leaked data in the top-level context */
- MemoryContextResetAndDeleteChildren(context);
+ MemoryContextReset(context);
/* Now we can allow interrupts again */
RESUME_INTERRUPTS();
@@ -338,7 +338,7 @@ WalSummarizerMain(void)
XLogRecPtr end_of_summary_lsn;
/* Flush any leaked data in the top-level context */
- MemoryContextResetAndDeleteChildren(context);
+ MemoryContextReset(context);
/* Process any signals received recently. */
HandleWalSummarizerInterrupts();
diff --git a/src/bin/Makefile b/src/bin/Makefile
index aa2210925e..f98f58d39e 100644
--- a/src/bin/Makefile
+++ b/src/bin/Makefile
@@ -31,6 +31,7 @@ SUBDIRS = \
pg_upgrade \
pg_verifybackup \
pg_waldump \
+ pg_walsummary \
pgbench \
psql \
scripts
diff --git a/src/bin/meson.build b/src/bin/meson.build
index 4cb6fd59bb..d1e9ef4409 100644
--- a/src/bin/meson.build
+++ b/src/bin/meson.build
@@ -17,6 +17,7 @@ subdir('pg_test_timing')
subdir('pg_upgrade')
subdir('pg_verifybackup')
subdir('pg_waldump')
+subdir('pg_walsummary')
subdir('pgbench')
subdir('pgevent')
subdir('psql')
diff --git a/src/bin/pg_walsummary/.gitignore b/src/bin/pg_walsummary/.gitignore
new file mode 100644
index 0000000000..d71ec192fa
--- /dev/null
+++ b/src/bin/pg_walsummary/.gitignore
@@ -0,0 +1 @@
+pg_walsummary
diff --git a/src/bin/pg_walsummary/Makefile b/src/bin/pg_walsummary/Makefile
new file mode 100644
index 0000000000..852f7208f6
--- /dev/null
+++ b/src/bin/pg_walsummary/Makefile
@@ -0,0 +1,42 @@
+#-------------------------------------------------------------------------
+#
+# Makefile for src/bin/pg_walsummary
+#
+# Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+# Portions Copyright (c) 1994, Regents of the University of California
+#
+# src/bin/pg_walsummary/Makefile
+#
+#-------------------------------------------------------------------------
+
+PGFILEDESC = "pg_walsummary - print contents of WAL summary files"
+PGAPPICON=win32
+
+subdir = src/bin/pg_walsummary
+top_builddir = ../../..
+include $(top_builddir)/src/Makefile.global
+
+override CPPFLAGS := -I$(libpq_srcdir) $(CPPFLAGS)
+LDFLAGS_INTERNAL += -L$(top_builddir)/src/fe_utils -lpgfeutils
+
+OBJS = \
+ $(WIN32RES) \
+ pg_walsummary.o
+
+all: pg_walsummary
+
+pg_walsummary: $(OBJS) | submake-libpgport submake-libpgfeutils
+ $(CC) $(CFLAGS) $^ $(LDFLAGS) $(LDFLAGS_EX) $(LIBS) -o $@$(X)
+
+
+install: all installdirs
+ $(INSTALL_PROGRAM) pg_walsummary$(X) '$(DESTDIR)$(bindir)/pg_walsummary$(X)'
+
+installdirs:
+ $(MKDIR_P) '$(DESTDIR)$(bindir)'
+
+uninstall:
+ rm -f '$(DESTDIR)$(bindir)/pg_walsummary$(X)'
+
+clean distclean maintainer-clean:
+ rm -f pg_walsummary$(X) $(OBJS)
diff --git a/src/bin/pg_walsummary/meson.build b/src/bin/pg_walsummary/meson.build
new file mode 100644
index 0000000000..c2092960c6
--- /dev/null
+++ b/src/bin/pg_walsummary/meson.build
@@ -0,0 +1,24 @@
+# Copyright (c) 2022-2023, PostgreSQL Global Development Group
+
+pg_walsummary_sources = files(
+ 'pg_walsummary.c',
+)
+
+if host_system == 'windows'
+ pg_walsummary_sources += rc_bin_gen.process(win32ver_rc, extra_args: [
+ '--NAME', 'pg_walsummary',
+ '--FILEDESC', 'pg_walsummary - print contents of WAL summary files',])
+endif
+
+pg_walsummary = executable('pg_walsummary',
+ pg_walsummary_sources,
+ dependencies: [frontend_code],
+ kwargs: default_bin_args,
+)
+bin_targets += pg_walsummary
+
+tests += {
+ 'name': 'pg_walsummary',
+ 'sd': meson.current_source_dir(),
+ 'bd': meson.current_build_dir()
+}
diff --git a/src/bin/pg_walsummary/pg_walsummary.c b/src/bin/pg_walsummary/pg_walsummary.c
new file mode 100644
index 0000000000..0c0225eeb8
--- /dev/null
+++ b/src/bin/pg_walsummary/pg_walsummary.c
@@ -0,0 +1,280 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_walsummary.c
+ * Prints the contents of WAL summary files.
+ *
+ * Copyright (c) 2017-2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/bin/pg_walsummary/pg_walsummary.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres_fe.h"
+
+#include <fcntl.h>
+#include <limits.h>
+
+#include "common/blkreftable.h"
+#include "common/logging.h"
+#include "fe_utils/option_utils.h"
+#include "lib/stringinfo.h"
+#include "getopt_long.h"
+
+typedef struct ws_options
+{
+ bool individual;
+ bool quiet;
+} ws_options;
+
+typedef struct ws_file_info
+{
+ int fd;
+ char *filename;
+} ws_file_info;
+
+static BlockNumber *block_buffer = NULL;
+static unsigned block_buffer_size = 512; /* Initial size. */
+
+static void dump_one_relation(ws_options *opt, RelFileLocator *rlocator,
+ ForkNumber forknum, BlockNumber limit_block,
+ BlockRefTableReader *reader);
+static void help(const char *progname);
+static int compare_block_numbers(const void *a, const void *b);
+static int walsummary_read_callback(void *callback_arg, void *data,
+ int length);
+static void walsummary_error_callback(void *callback_arg, char *fmt,...) pg_attribute_printf(2, 3);
+
+/*
+ * Main program.
+ */
+int
+main(int argc, char *argv[])
+{
+ static struct option long_options[] = {
+ {"individual", no_argument, NULL, 'i'},
+ {"quiet", no_argument, NULL, 'q'},
+ {NULL, 0, NULL, 0}
+ };
+
+ const char *progname;
+ int optindex;
+ int c;
+ ws_options opt;
+
+ memset(&opt, 0, sizeof(ws_options));
+
+ pg_logging_init(argv[0]);
+ progname = get_progname(argv[0]);
+ handle_help_version_opts(argc, argv, progname, help);
+
+ /* process command-line options */
+ while ((c = getopt_long(argc, argv, "f:iqw:",
+ long_options, &optindex)) != -1)
+ {
+ switch (c)
+ {
+ case 'i':
+ opt.individual = true;
+ break;
+ case 'q':
+ opt.quiet = true;
+ break;
+ default:
+ /* getopt_long already emitted a complaint */
+ pg_log_error_hint("Try \"%s --help\" for more information.", progname);
+ exit(1);
+ }
+ }
+
+ if (optind >= argc)
+ {
+ pg_log_error("%s: no input files specified", progname);
+ pg_log_error_hint("Try \"%s --help\" for more information.", progname);
+ exit(1);
+ }
+
+ while (optind < argc)
+ {
+ ws_file_info ws;
+ BlockRefTableReader *reader;
+ RelFileLocator rlocator;
+ ForkNumber forknum;
+ BlockNumber limit_block;
+
+ ws.filename = argv[optind++];
+ if ((ws.fd = open(ws.filename, O_RDONLY | PG_BINARY, 0)) < 0)
+ pg_fatal("could not open file \"%s\": %m", ws.filename);
+
+ reader = CreateBlockRefTableReader(walsummary_read_callback, &ws,
+ ws.filename,
+ walsummary_error_callback, NULL);
+ while (BlockRefTableReaderNextRelation(reader, &rlocator, &forknum,
+ &limit_block))
+ dump_one_relation(&opt, &rlocator, forknum, limit_block, reader);
+
+ DestroyBlockRefTableReader(reader);
+ close(ws.fd);
+ }
+
+ exit(0);
+}
+
+/*
+ * Dump details for one relation.
+ */
+static void
+dump_one_relation(ws_options *opt, RelFileLocator *rlocator,
+ ForkNumber forknum, BlockNumber limit_block,
+ BlockRefTableReader *reader)
+{
+ unsigned i = 0;
+ unsigned nblocks;
+ BlockNumber startblock = InvalidBlockNumber;
+ BlockNumber endblock = InvalidBlockNumber;
+
+ /* Dump limit block, if any. */
+ if (limit_block != InvalidBlockNumber)
+ printf("TS %u, DB %u, REL %u, FORK %s: limit %u\n",
+ rlocator->spcOid, rlocator->dbOid, rlocator->relNumber,
+ forkNames[forknum], limit_block);
+
+ /* If we haven't allocated a block buffer yet, do that now. */
+ if (block_buffer == NULL)
+ block_buffer = palloc_array(BlockNumber, block_buffer_size);
+
+ /* Try to fill the block buffer. */
+ nblocks = BlockRefTableReaderGetBlocks(reader,
+ block_buffer,
+ block_buffer_size);
+
+ /* If we filled the block buffer completely, we must enlarge it. */
+ while (nblocks >= block_buffer_size)
+ {
+ unsigned new_size;
+
+ /* Double the size, being careful about overflow. */
+ new_size = block_buffer_size * 2;
+ if (new_size < block_buffer_size)
+ new_size = PG_UINT32_MAX;
+ block_buffer = repalloc_array(block_buffer, BlockNumber, new_size);
+
+ /* Try to fill the newly-allocated space. */
+ nblocks +=
+ BlockRefTableReaderGetBlocks(reader,
+ block_buffer + block_buffer_size,
+ new_size - block_buffer_size);
+
+ /* Save the new size for later calls. */
+ block_buffer_size = new_size;
+ }
+
+ /* If we don't need to produce any output, skip the rest of this. */
+ if (opt->quiet)
+ return;
+
+ /*
+ * Sort the returned block numbers. If the block reference table was using
+ * the bitmap representation for a given chunk, the block numbers in that
+ * chunk will already be sorted, but when the array-of-offsets
+ * representation is used, we can receive block numbers here out of order.
+ */
+ qsort(block_buffer, nblocks, sizeof(BlockNumber), compare_block_numbers);
+
+ /* Dump block references. */
+ while (i < nblocks)
+ {
+ /*
+ * Find the next range of blocks to print, but if --individual was
+ * specified, then consider each block a separate range.
+ */
+ startblock = endblock = block_buffer[i++];
+ if (!opt->individual)
+ {
+ while (i < nblocks && block_buffer[i] == endblock + 1)
+ {
+ endblock++;
+ i++;
+ }
+ }
+
+ /* Format this range of block numbers as a string. */
+ if (startblock == endblock)
+ printf("TS %u, DB %u, REL %u, FORK %s: block %u\n",
+ rlocator->spcOid, rlocator->dbOid, rlocator->relNumber,
+ forkNames[forknum], startblock);
+ else
+ printf("TS %u, DB %u, REL %u, FORK %s: blocks %u..%u\n",
+ rlocator->spcOid, rlocator->dbOid, rlocator->relNumber,
+ forkNames[forknum], startblock, endblock);
+ }
+}
+
+/*
+ * Quicksort comparator for block numbers.
+ */
+static int
+compare_block_numbers(const void *a, const void *b)
+{
+ BlockNumber aa = *(BlockNumber *) a;
+ BlockNumber bb = *(BlockNumber *) b;
+
+ if (aa > bb)
+ return 1;
+ else if (aa == bb)
+ return 0;
+ else
+ return -1;
+}
+
+/*
+ * Error callback.
+ */
+void
+walsummary_error_callback(void *callback_arg, char *fmt,...)
+{
+ va_list ap;
+
+ va_start(ap, fmt);
+ pg_log_generic_v(PG_LOG_ERROR, PG_LOG_PRIMARY, fmt, ap);
+ va_end(ap);
+
+ exit(1);
+}
+
+/*
+ * Read callback.
+ */
+int
+walsummary_read_callback(void *callback_arg, void *data, int length)
+{
+ ws_file_info *ws = callback_arg;
+ int rc;
+
+ if ((rc = read(ws->fd, data, length)) < 0)
+ pg_fatal("could not read file \"%s\": %m", ws->filename);
+
+ return rc;
+}
+
+/*
+ * help
+ *
+ * Prints help page for the program
+ *
+ * progname: the name of the executed program, such as "pg_walsummary"
+ */
+static void
+help(const char *progname)
+{
+ printf(_("%s prints the contents of a WAL summary file.\n\n"), progname);
+ printf(_("Usage:\n"));
+ printf(_(" %s [OPTION]... FILE...\n"), progname);
+ printf(_("\nOptions:\n"));
+ printf(_(" -i, --individual list block numbers individually, not as ranges\n"));
+ printf(_(" -q, --quiet don't print anything, just parse the files\n"));
+ printf(_(" -?, --help show this help, then exit\n"));
+
+ printf(_("\nReport bugs to <%s>.\n"), PACKAGE_BUGREPORT);
+ printf(_("%s home page: <%s>\n"), PACKAGE_NAME, PACKAGE_URL);
+}
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 48c2f6c56f..7e6d11f4a8 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -4030,3 +4030,5 @@ cb_tablespace_mapping
manifest_data
manifest_writer
rfile
+ws_options
+ws_file_info
--
2.39.3 (Apple Git-145)
v13-0005-Test-patch-Enable-summarize_wal-by-default.patchapplication/octet-stream; name=v13-0005-Test-patch-Enable-summarize_wal-by-default.patchDownload
From e94695da40330b3f31bc8fc956bc02873d8fa69a Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Tue, 14 Nov 2023 13:49:28 -0500
Subject: [PATCH v13 5/5] Test patch: Enable summarize_wal by default.
To avoid test failures, must remove the prohibition against running
summarize_wal=off with wal_level=minimal, because a bunch of tests
run with wal_level=minimal.
Not for commit.
---
src/backend/postmaster/postmaster.c | 3 ---
src/backend/postmaster/walsummarizer.c | 2 +-
src/backend/utils/misc/guc_tables.c | 2 +-
src/test/recovery/t/001_stream_rep.pl | 2 ++
src/test/recovery/t/019_replslot_limit.pl | 3 +++
src/test/recovery/t/020_archive_status.pl | 1 +
src/test/recovery/t/035_standby_logical_decoding.pl | 1 +
7 files changed, 9 insertions(+), 5 deletions(-)
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index cbdbb6cdae..7c7ddca33e 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -940,9 +940,6 @@ PostmasterMain(int argc, char *argv[])
if (max_wal_senders > 0 && wal_level == WAL_LEVEL_MINIMAL)
ereport(ERROR,
(errmsg("WAL streaming (max_wal_senders > 0) requires wal_level \"replica\" or \"logical\"")));
- if (summarize_wal && wal_level == WAL_LEVEL_MINIMAL)
- ereport(ERROR,
- (errmsg("WAL cannot be summarized when wal_level is \"minimal\"")));
/*
* Other one-time internal sanity checks can go here, if they are fast.
diff --git a/src/backend/postmaster/walsummarizer.c b/src/backend/postmaster/walsummarizer.c
index 524a671ca4..8d903283d8 100644
--- a/src/backend/postmaster/walsummarizer.c
+++ b/src/backend/postmaster/walsummarizer.c
@@ -139,7 +139,7 @@ static XLogRecPtr redo_pointer_at_last_summary_removal = InvalidXLogRecPtr;
/*
* GUC parameters
*/
-bool summarize_wal = false;
+bool summarize_wal = true;
int wal_summary_keep_time = 10 * 24 * 60;
static XLogRecPtr GetLatestLSN(TimeLineID *tli);
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 405c422db7..d18f93d3c8 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -1796,7 +1796,7 @@ struct config_bool ConfigureNamesBool[] =
NULL
},
&summarize_wal,
- false,
+ true,
NULL, NULL, NULL
},
diff --git a/src/test/recovery/t/001_stream_rep.pl b/src/test/recovery/t/001_stream_rep.pl
index 95f9b0d772..0d0e63b8dc 100644
--- a/src/test/recovery/t/001_stream_rep.pl
+++ b/src/test/recovery/t/001_stream_rep.pl
@@ -15,6 +15,8 @@ my $node_primary = PostgreSQL::Test::Cluster->new('primary');
$node_primary->init(
allows_streaming => 1,
auth_extra => [ '--create-role', 'repl_role' ]);
+# WAL summarization can postpone WAL recycling, leading to test failures
+$node_primary->append_conf('postgresql.conf', "summarize_wal = off");
$node_primary->start;
my $backup_name = 'my_backup';
diff --git a/src/test/recovery/t/019_replslot_limit.pl b/src/test/recovery/t/019_replslot_limit.pl
index 7d94f15778..a8b342bb98 100644
--- a/src/test/recovery/t/019_replslot_limit.pl
+++ b/src/test/recovery/t/019_replslot_limit.pl
@@ -22,6 +22,7 @@ $node_primary->append_conf(
min_wal_size = 2MB
max_wal_size = 4MB
log_checkpoints = yes
+summarize_wal = off
));
$node_primary->start;
$node_primary->safe_psql('postgres',
@@ -256,6 +257,7 @@ $node_primary2->append_conf(
min_wal_size = 32MB
max_wal_size = 32MB
log_checkpoints = yes
+summarize_wal = off
));
$node_primary2->start;
$node_primary2->safe_psql('postgres',
@@ -310,6 +312,7 @@ $node_primary3->append_conf(
max_wal_size = 2MB
log_checkpoints = yes
max_slot_wal_keep_size = 1MB
+ summarize_wal = off
));
$node_primary3->start;
$node_primary3->safe_psql('postgres',
diff --git a/src/test/recovery/t/020_archive_status.pl b/src/test/recovery/t/020_archive_status.pl
index fa24153d4b..d0d6221368 100644
--- a/src/test/recovery/t/020_archive_status.pl
+++ b/src/test/recovery/t/020_archive_status.pl
@@ -15,6 +15,7 @@ $primary->init(
has_archiving => 1,
allows_streaming => 1);
$primary->append_conf('postgresql.conf', 'autovacuum = off');
+$primary->append_conf('postgresql.conf', 'summarize_wal = off');
$primary->start;
my $primary_data = $primary->data_dir;
diff --git a/src/test/recovery/t/035_standby_logical_decoding.pl b/src/test/recovery/t/035_standby_logical_decoding.pl
index 9c34c0d36c..482edc57a8 100644
--- a/src/test/recovery/t/035_standby_logical_decoding.pl
+++ b/src/test/recovery/t/035_standby_logical_decoding.pl
@@ -250,6 +250,7 @@ $node_primary->append_conf(
wal_level = 'logical'
max_replication_slots = 4
max_wal_senders = 4
+summarize_wal = off
});
$node_primary->dump_info;
$node_primary->start;
--
2.39.3 (Apple Git-145)
v13-0002-Add-a-new-WAL-summarizer-process.patchapplication/octet-stream; name=v13-0002-Add-a-new-WAL-summarizer-process.patchDownload
From 0bbade2baaf9a232c0044311a25117e56762f085 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Wed, 25 Oct 2023 12:57:22 -0400
Subject: [PATCH v13 2/5] Add a new WAL summarizer process.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
When active, this process writes WAL summary files to
$PGDATA/pg_wal/summaries. Each summary file contains information for a
certain range of LSNs on a certain TLI. For each relation, it stores a
"limit block" which is 0 if a relation is created or destroyed within
a certain range of WAL records, or otherwise the shortest length to
which the relation was truncated during that range of WAL records, or
otherwise InvalidBlockNumber. In addition, it stores a list of blocks
which have been modified during that range of WAL records, but
excluding blocks which were removed by truncation after they were
modified and never subsequently modified again. In other words, it
tells us which blocks need to copied in case of an incremental backup
covering that range of WAL records.
A new parameter summarize_wal enables or disables this new background
process. The background process also automatically deletes summary
files that are older than wal_summarize_keep_time, if that parameter
has a non-zero value and the summarizer is configured to run.
Patch by me, with some design help from Dilip Kumar. Reviewed by
Matthias van de Meent, Dilip Kumar, Jakub Wartak, Peter Eisentraut,
and Álvaro Herrera.
---
doc/src/sgml/config.sgml | 61 +
src/backend/access/transam/xlog.c | 101 +-
src/backend/backup/Makefile | 4 +-
src/backend/backup/meson.build | 2 +
src/backend/backup/walsummary.c | 356 +++++
src/backend/backup/walsummaryfuncs.c | 169 ++
src/backend/postmaster/Makefile | 1 +
src/backend/postmaster/auxprocess.c | 8 +
src/backend/postmaster/meson.build | 1 +
src/backend/postmaster/postmaster.c | 56 +
src/backend/postmaster/walsummarizer.c | 1386 +++++++++++++++++
src/backend/storage/lmgr/lwlocknames.txt | 1 +
src/backend/utils/activity/pgstat_io.c | 4 +-
.../utils/activity/wait_event_names.txt | 5 +
src/backend/utils/init/miscinit.c | 3 +
src/backend/utils/misc/guc_tables.c | 26 +
src/backend/utils/misc/postgresql.conf.sample | 5 +
src/bin/initdb/initdb.c | 1 +
src/common/Makefile | 1 +
src/common/blkreftable.c | 1308 ++++++++++++++++
src/common/meson.build | 1 +
src/include/access/xlog.h | 1 +
src/include/backup/walsummary.h | 49 +
src/include/catalog/pg_proc.dat | 19 +
src/include/common/blkreftable.h | 116 ++
src/include/miscadmin.h | 3 +
src/include/postmaster/walsummarizer.h | 32 +
src/include/storage/proc.h | 9 +-
src/include/utils/guc_tables.h | 1 +
src/tools/pgindent/typedefs.list | 11 +
30 files changed, 3730 insertions(+), 11 deletions(-)
create mode 100644 src/backend/backup/walsummary.c
create mode 100644 src/backend/backup/walsummaryfuncs.c
create mode 100644 src/backend/postmaster/walsummarizer.c
create mode 100644 src/common/blkreftable.c
create mode 100644 src/include/backup/walsummary.h
create mode 100644 src/include/common/blkreftable.h
create mode 100644 src/include/postmaster/walsummarizer.h
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 94d1eb2b81..4fc5c64150 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4150,6 +4150,67 @@ restore_command = 'copy "C:\\server\\archivedir\\%f" "%p"' # Windows
</variablelist>
</sect2>
+ <sect2 id="runtime-config-wal-summarization">
+ <title>WAL Summarization</title>
+
+ <!--
+ <para>
+ These settings control WAL summarization, a feature which must be
+ enabled in order to perform an
+ <link linkend="backup-incremental-backup">incremental backup</link>.
+ </para>
+ -->
+
+ <variablelist>
+ <varlistentry id="guc-summarize-wal" xreflabel="summarize_wal">
+ <term><varname>summarize_wal</varname> (<type>boolean</type>)
+ <indexterm>
+ <primary><varname>summarize_wal</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Enables the WAL summarizer process. Note that WAL summarization can
+ be enabled either on a primary or on a standby. WAL summarization
+ cannot be enabled when <varname>wal_level</varname> is set to
+ <literal>minimal</literal>. This parameter can only be set in the
+ <filename>postgresql.conf</filename> file or on the server command line.
+ The default is <literal>off</literal>.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry id="guc-wal-summary-keep-time" xreflabel="wal_summary_keep_time">
+ <term><varname>wal_summary_keep_time</varname> (<type>boolean</type>)
+ <indexterm>
+ <primary><varname>wal_summary_keep_time</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Configures the amount of time after which the WAL summarizer
+ automatically removes old WAL summaries. The file timestamp is used to
+ determine which files are old enough to remove. Typically, you should set
+ this comfortably higher than the time that could pass between a backup
+ and a later incremental backup that depends on it. WAL summaries must
+ be available for the entire range of WAL records between the preceding
+ backup and the new one being taken; if not, the incremental backup will
+ fail. If this parameter is set to zero, WAL summaries will not be
+ automatically deleted, but it is safe to manually remove files that you
+ know will not be required for future incremental backups.
+ This parameter can only be set in the
+ <filename>postgresql.conf</filename> file or on the server command line.
+ The default is 10 days. If <literal>summarize_wal = off</literal>,
+ existing WAL summaries will not be removed regardless of the value of
+ this parameter, because the WAL summarizer will not run.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ </variablelist>
+
+ </sect2>
+
</sect1>
<sect1 id="runtime-config-replication">
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 6526bd4f43..72c7c86707 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -77,6 +77,7 @@
#include "port/pg_iovec.h"
#include "postmaster/bgwriter.h"
#include "postmaster/startup.h"
+#include "postmaster/walsummarizer.h"
#include "postmaster/walwriter.h"
#include "replication/logical.h"
#include "replication/origin.h"
@@ -3587,6 +3588,43 @@ XLogGetLastRemovedSegno(void)
return lastRemovedSegNo;
}
+/*
+ * Return the oldest WAL segment on the given TLI that still exists in
+ * XLOGDIR, or 0 if none.
+ */
+XLogSegNo
+XLogGetOldestSegno(TimeLineID tli)
+{
+ DIR *xldir;
+ struct dirent *xlde;
+ XLogSegNo oldest_segno = 0;
+
+ xldir = AllocateDir(XLOGDIR);
+ while ((xlde = ReadDir(xldir, XLOGDIR)) != NULL)
+ {
+ TimeLineID file_tli;
+ XLogSegNo file_segno;
+
+ /* Ignore files that are not XLOG segments. */
+ if (!IsXLogFileName(xlde->d_name))
+ continue;
+
+ /* Parse filename to get TLI and segno. */
+ XLogFromFileName(xlde->d_name, &file_tli, &file_segno,
+ wal_segment_size);
+
+ /* Ignore anything that's not from the TLI of interest. */
+ if (tli != file_tli)
+ continue;
+
+ /* If it's the oldest so far, update oldest_segno. */
+ if (oldest_segno == 0 || file_segno < oldest_segno)
+ oldest_segno = file_segno;
+ }
+
+ FreeDir(xldir);
+ return oldest_segno;
+}
/*
* Update the last removed segno pointer in shared memory, to reflect that the
@@ -3867,8 +3905,8 @@ RemoveXlogFile(const struct dirent *segment_de,
}
/*
- * Verify whether pg_wal and pg_wal/archive_status exist.
- * If the latter does not exist, recreate it.
+ * Verify whether pg_wal, pg_wal/archive_status, and pg_wal/summaries exist.
+ * If the latter do not exist, recreate them.
*
* It is not the goal of this function to verify the contents of these
* directories, but to help in cases where someone has performed a cluster
@@ -3911,6 +3949,26 @@ ValidateXLOGDirectoryStructure(void)
(errmsg("could not create missing directory \"%s\": %m",
path)));
}
+
+ /* Check for summaries */
+ snprintf(path, MAXPGPATH, XLOGDIR "/summaries");
+ if (stat(path, &stat_buf) == 0)
+ {
+ /* Check for weird cases where it exists but isn't a directory */
+ if (!S_ISDIR(stat_buf.st_mode))
+ ereport(FATAL,
+ (errmsg("required WAL directory \"%s\" does not exist",
+ path)));
+ }
+ else
+ {
+ ereport(LOG,
+ (errmsg("creating missing WAL directory \"%s\"", path)));
+ if (MakePGDirectory(path) < 0)
+ ereport(FATAL,
+ (errmsg("could not create missing directory \"%s\": %m",
+ path)));
+ }
}
/*
@@ -5235,9 +5293,9 @@ StartupXLOG(void)
#endif
/*
- * Verify that pg_wal and pg_wal/archive_status exist. In cases where
- * someone has performed a copy for PITR, these directories may have been
- * excluded and need to be re-created.
+ * Verify that pg_wal, pg_wal/archive_status, and pg_wal/summaries exist.
+ * In cases where someone has performed a copy for PITR, these directories
+ * may have been excluded and need to be re-created.
*/
ValidateXLOGDirectoryStructure();
@@ -6954,6 +7012,25 @@ CreateCheckPoint(int flags)
*/
END_CRIT_SECTION();
+ /*
+ * WAL summaries end when the next XLOG_CHECKPOINT_REDO or
+ * XLOG_CHECKPOINT_SHUTDOWN record is reached. This is the first point
+ * where (a) we're not inside of a critical section and (b) we can be
+ * certain that the relevant record has been flushed to disk, which must
+ * happen before it can be summarized.
+ *
+ * If this is a shutdown checkpoint, then this happens reasonably
+ * promptly: we've only just inserted and flushed the
+ * XLOG_CHECKPOINT_SHUTDOWN record. If this is not a shutdown checkpoint,
+ * then this might not be very prompt at all: the XLOG_CHECKPOINT_REDO
+ * record was written before we began flushing data to disk, and that
+ * could be many minutes ago at this point. However, we don't XLogFlush()
+ * after inserting that record, so we're not guaranteed that it's on disk
+ * until after the above call that flushes the XLOG_CHECKPOINT_ONLINE
+ * record.
+ */
+ SetWalSummarizerLatch();
+
/*
* Let smgr do post-checkpoint cleanup (eg, deleting old files).
*/
@@ -7628,6 +7705,20 @@ KeepLogSeg(XLogRecPtr recptr, XLogSegNo *logSegNo)
}
}
+ /*
+ * If WAL summarization is in use, don't remove WAL that has yet to be
+ * summarized.
+ */
+ keep = GetOldestUnsummarizedLSN(NULL, NULL);
+ if (keep != InvalidXLogRecPtr)
+ {
+ XLogSegNo unsummarized_segno;
+
+ XLByteToSeg(keep, unsummarized_segno, wal_segment_size);
+ if (unsummarized_segno < segno)
+ segno = unsummarized_segno;
+ }
+
/* but, keep at least wal_keep_size if that's set */
if (wal_keep_size_mb > 0)
{
diff --git a/src/backend/backup/Makefile b/src/backend/backup/Makefile
index b21bd8ff43..a67b3c58d4 100644
--- a/src/backend/backup/Makefile
+++ b/src/backend/backup/Makefile
@@ -25,6 +25,8 @@ OBJS = \
basebackup_server.o \
basebackup_sink.o \
basebackup_target.o \
- basebackup_throttle.o
+ basebackup_throttle.o \
+ walsummary.o \
+ walsummaryfuncs.o
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/backup/meson.build b/src/backend/backup/meson.build
index 11a79bbf80..0e2de91e9f 100644
--- a/src/backend/backup/meson.build
+++ b/src/backend/backup/meson.build
@@ -12,4 +12,6 @@ backend_sources += files(
'basebackup_target.c',
'basebackup_throttle.c',
'basebackup_zstd.c',
+ 'walsummary.c',
+ 'walsummaryfuncs.c'
)
diff --git a/src/backend/backup/walsummary.c b/src/backend/backup/walsummary.c
new file mode 100644
index 0000000000..271d199874
--- /dev/null
+++ b/src/backend/backup/walsummary.c
@@ -0,0 +1,356 @@
+/*-------------------------------------------------------------------------
+ *
+ * walsummary.c
+ * Functions for accessing and managing WAL summary data.
+ *
+ * Portions Copyright (c) 2010-2022, PostgreSQL Global Development Group
+ *
+ * src/backend/backup/walsummary.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include <sys/stat.h>
+#include <unistd.h>
+
+#include "access/xlog_internal.h"
+#include "backup/walsummary.h"
+#include "utils/wait_event.h"
+
+static bool IsWalSummaryFilename(char *filename);
+static int ListComparatorForWalSummaryFiles(const ListCell *a,
+ const ListCell *b);
+
+/*
+ * Get a list of WAL summaries.
+ *
+ * If tli != 0, only WAL summaries with the indicated TLI will be included.
+ *
+ * If start_lsn != InvalidXLogRecPtr, only summaries that end after the
+ * indicated LSN will be included.
+ *
+ * If end_lsn != InvalidXLogRecPtr, only summaries that start before the
+ * indicated LSN will be included.
+ *
+ * The intent is that you can call GetWalSummaries(tli, start_lsn, end_lsn)
+ * to get all WAL summaries on the indicated timeline that overlap the
+ * specified LSN range.
+ */
+List *
+GetWalSummaries(TimeLineID tli, XLogRecPtr start_lsn, XLogRecPtr end_lsn)
+{
+ DIR *sdir;
+ struct dirent *dent;
+ List *result = NIL;
+
+ sdir = AllocateDir(XLOGDIR "/summaries");
+ while ((dent = ReadDir(sdir, XLOGDIR "/summaries")) != NULL)
+ {
+ WalSummaryFile *ws;
+ uint32 tmp[5];
+ TimeLineID file_tli;
+ XLogRecPtr file_start_lsn;
+ XLogRecPtr file_end_lsn;
+
+ /* Decode filename, or skip if it's not in the expected format. */
+ if (!IsWalSummaryFilename(dent->d_name))
+ continue;
+ sscanf(dent->d_name, "%08X%08X%08X%08X%08X",
+ &tmp[0], &tmp[1], &tmp[2], &tmp[3], &tmp[4]);
+ file_tli = tmp[0];
+ file_start_lsn = ((uint64) tmp[1]) << 32 | tmp[2];
+ file_end_lsn = ((uint64) tmp[3]) << 32 | tmp[4];
+
+ /* Skip if it doesn't match the filter criteria. */
+ if (tli != 0 && tli != file_tli)
+ continue;
+ if (!XLogRecPtrIsInvalid(start_lsn) && start_lsn >= file_end_lsn)
+ continue;
+ if (!XLogRecPtrIsInvalid(end_lsn) && end_lsn <= file_start_lsn)
+ continue;
+
+ /* Add it to the list. */
+ ws = palloc(sizeof(WalSummaryFile));
+ ws->tli = file_tli;
+ ws->start_lsn = file_start_lsn;
+ ws->end_lsn = file_end_lsn;
+ result = lappend(result, ws);
+ }
+ FreeDir(sdir);
+
+ return result;
+}
+
+/*
+ * Build a new list of WAL summaries based on an existing list, but filtering
+ * out summaries that don't match the search parameters.
+ *
+ * If tli != 0, only WAL summaries with the indicated TLI will be included.
+ *
+ * If start_lsn != InvalidXLogRecPtr, only summaries that end after the
+ * indicated LSN will be included.
+ *
+ * If end_lsn != InvalidXLogRecPtr, only summaries that start before the
+ * indicated LSN will be included.
+ */
+List *
+FilterWalSummaries(List *wslist, TimeLineID tli,
+ XLogRecPtr start_lsn, XLogRecPtr end_lsn)
+{
+ List *result = NIL;
+ ListCell *lc;
+
+ /* Loop over input. */
+ foreach(lc, wslist)
+ {
+ WalSummaryFile *ws = lfirst(lc);
+
+ /* Skip if it doesn't match the filter criteria. */
+ if (tli != 0 && tli != ws->tli)
+ continue;
+ if (!XLogRecPtrIsInvalid(start_lsn) && start_lsn > ws->end_lsn)
+ continue;
+ if (!XLogRecPtrIsInvalid(end_lsn) && end_lsn < ws->start_lsn)
+ continue;
+
+ /* Add it to the result list. */
+ result = lappend(result, ws);
+ }
+
+ return result;
+}
+
+/*
+ * Check whether the supplied list of WalSummaryFile objects covers the
+ * whole range of LSNs from start_lsn to end_lsn. This function ignores
+ * timelines, so the caller should probably filter using the appropriate
+ * timeline before calling this.
+ *
+ * If the whole range of LSNs is covered, returns true, otherwise false.
+ * If false is returned, *missing_lsn is set either to InvalidXLogRecPtr
+ * if there are no WAL summary files in the input list, or to the first LSN
+ * in the range that is not covered by a WAL summary file in the input list.
+ */
+bool
+WalSummariesAreComplete(List *wslist, XLogRecPtr start_lsn,
+ XLogRecPtr end_lsn, XLogRecPtr *missing_lsn)
+{
+ XLogRecPtr current_lsn = start_lsn;
+ ListCell *lc;
+
+ /* Special case for empty list. */
+ if (wslist == NIL)
+ {
+ *missing_lsn = InvalidXLogRecPtr;
+ return false;
+ }
+
+ /* Make a private copy of the list and sort it by start LSN. */
+ wslist = list_copy(wslist);
+ list_sort(wslist, ListComparatorForWalSummaryFiles);
+
+ /*
+ * Consider summary files in order of increasing start_lsn, advancing the
+ * known-summarized range from start_lsn toward end_lsn.
+ *
+ * Normally, the summary files should cover non-overlapping WAL ranges,
+ * but this algorithm is intended to be correct even in case of overlap.
+ */
+ foreach(lc, wslist)
+ {
+ WalSummaryFile *ws = lfirst(lc);
+
+ if (ws->start_lsn > current_lsn)
+ {
+ /* We found a gap. */
+ break;
+ }
+ if (ws->end_lsn > current_lsn)
+ {
+ /*
+ * Next summary extends beyond end of previous summary, so extend
+ * the end of the range known to be summarized.
+ */
+ current_lsn = ws->end_lsn;
+
+ /*
+ * If the range we know to be summarized has reached the required
+ * end LSN, we have proved completeness.
+ */
+ if (current_lsn >= end_lsn)
+ return true;
+ }
+ }
+
+ /*
+ * We either ran out of summary files without reaching the end LSN, or we
+ * hit a gap in the sequence that resulted in us bailing out of the loop
+ * above.
+ */
+ *missing_lsn = current_lsn;
+ return false;
+}
+
+/*
+ * Open a WAL summary file.
+ *
+ * This will throw an error in case of trouble. As an exception, if
+ * missing_ok = true and the trouble is specifically that the file does
+ * not exist, it will not throw an error and will return a value less than 0.
+ */
+File
+OpenWalSummaryFile(WalSummaryFile *ws, bool missing_ok)
+{
+ char path[MAXPGPATH];
+ File file;
+
+ snprintf(path, MAXPGPATH,
+ XLOGDIR "/summaries/%08X%08X%08X%08X%08X.summary",
+ ws->tli,
+ LSN_FORMAT_ARGS(ws->start_lsn),
+ LSN_FORMAT_ARGS(ws->end_lsn));
+
+ file = PathNameOpenFile(path, O_RDONLY);
+ if (file < 0 && (errno != EEXIST || !missing_ok))
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not open file \"%s\": %m", path)));
+
+ return file;
+}
+
+/*
+ * Remove a WAL summary file if the last modification time precedes the
+ * cutoff time.
+ */
+void
+RemoveWalSummaryIfOlderThan(WalSummaryFile *ws, time_t cutoff_time)
+{
+ char path[MAXPGPATH];
+ struct stat statbuf;
+
+ snprintf(path, MAXPGPATH,
+ XLOGDIR "/summaries/%08X%08X%08X%08X%08X.summary",
+ ws->tli,
+ LSN_FORMAT_ARGS(ws->start_lsn),
+ LSN_FORMAT_ARGS(ws->end_lsn));
+
+ if (lstat(path, &statbuf) != 0)
+ {
+ if (errno == ENOENT)
+ return;
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not stat file \"%s\": %m", path)));
+ }
+ if (statbuf.st_mtime >= cutoff_time)
+ return;
+ if (unlink(path) != 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not stat file \"%s\": %m", path)));
+ ereport(DEBUG2,
+ (errmsg_internal("removing file \"%s\"", path)));
+}
+
+/*
+ * Test whether a filename looks like a WAL summary file.
+ */
+static bool
+IsWalSummaryFilename(char *filename)
+{
+ return strspn(filename, "0123456789ABCDEF") == 40 &&
+ strcmp(filename + 40, ".summary") == 0;
+}
+
+/*
+ * Data read callback for use with CreateBlockRefTableReader.
+ */
+int
+ReadWalSummary(void *wal_summary_io, void *data, int length)
+{
+ WalSummaryIO *io = wal_summary_io;
+ int nbytes;
+
+ nbytes = FileRead(io->file, data, length, io->filepos,
+ WAIT_EVENT_WAL_SUMMARY_READ);
+ if (nbytes < 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not read file \"%s\": %m",
+ FilePathName(io->file))));
+
+ io->filepos += nbytes;
+ return nbytes;
+}
+
+/*
+ * Data write callback for use with WriteBlockRefTable.
+ */
+int
+WriteWalSummary(void *wal_summary_io, void *data, int length)
+{
+ WalSummaryIO *io = wal_summary_io;
+ int nbytes;
+
+ nbytes = FileWrite(io->file, data, length, io->filepos,
+ WAIT_EVENT_WAL_SUMMARY_WRITE);
+ if (nbytes < 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not write file \"%s\": %m",
+ FilePathName(io->file))));
+ if (nbytes != length)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not write file \"%s\": wrote only %d of %d bytes at offset %u",
+ FilePathName(io->file), nbytes,
+ length, (unsigned) io->filepos),
+ errhint("Check free disk space.")));
+
+ io->filepos += nbytes;
+ return nbytes;
+}
+
+/*
+ * Error-reporting callback for use with CreateBlockRefTableReader.
+ */
+void
+ReportWalSummaryError(void *callback_arg, char *fmt,...)
+{
+ StringInfoData buf;
+ va_list ap;
+ int needed;
+
+ initStringInfo(&buf);
+ for (;;)
+ {
+ va_start(ap, fmt);
+ needed = appendStringInfoVA(&buf, fmt, ap);
+ va_end(ap);
+ if (needed == 0)
+ break;
+ enlargeStringInfo(&buf, needed);
+ }
+ ereport(ERROR,
+ errcode(ERRCODE_DATA_CORRUPTED),
+ errmsg_internal("%s", buf.data));
+}
+
+/*
+ * Comparator to sort a List of WalSummaryFile objects by start_lsn.
+ */
+static int
+ListComparatorForWalSummaryFiles(const ListCell *a, const ListCell *b)
+{
+ WalSummaryFile *ws1 = lfirst(a);
+ WalSummaryFile *ws2 = lfirst(b);
+
+ if (ws1->start_lsn < ws2->start_lsn)
+ return -1;
+ if (ws1->start_lsn > ws2->start_lsn)
+ return 1;
+ return 0;
+}
diff --git a/src/backend/backup/walsummaryfuncs.c b/src/backend/backup/walsummaryfuncs.c
new file mode 100644
index 0000000000..a1f69ad4ba
--- /dev/null
+++ b/src/backend/backup/walsummaryfuncs.c
@@ -0,0 +1,169 @@
+/*-------------------------------------------------------------------------
+ *
+ * walsummaryfuncs.c
+ * SQL-callable functions for accessing WAL summary data.
+ *
+ * Portions Copyright (c) 2010-2022, PostgreSQL Global Development Group
+ *
+ * src/backend/backup/walsummaryfuncs.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "backup/walsummary.h"
+#include "common/blkreftable.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "utils/fmgrprotos.h"
+#include "utils/pg_lsn.h"
+
+#define NUM_WS_ATTS 3
+#define NUM_SUMMARY_ATTS 6
+#define MAX_BLOCKS_PER_CALL 256
+
+/*
+ * List the WAL summary files available in pg_wal/summaries.
+ */
+Datum
+pg_available_wal_summaries(PG_FUNCTION_ARGS)
+{
+ ReturnSetInfo *rsi;
+ List *wslist;
+ ListCell *lc;
+ Datum values[NUM_WS_ATTS];
+ bool nulls[NUM_WS_ATTS];
+
+ InitMaterializedSRF(fcinfo, 0);
+ rsi = (ReturnSetInfo *) fcinfo->resultinfo;
+
+ memset(nulls, 0, sizeof(nulls));
+
+ wslist = GetWalSummaries(0, InvalidXLogRecPtr, InvalidXLogRecPtr);
+ foreach(lc, wslist)
+ {
+ WalSummaryFile *ws = (WalSummaryFile *) lfirst(lc);
+ HeapTuple tuple;
+
+ CHECK_FOR_INTERRUPTS();
+
+ values[0] = Int64GetDatum((int64) ws->tli);
+ values[1] = LSNGetDatum(ws->start_lsn);
+ values[2] = LSNGetDatum(ws->end_lsn);
+
+ tuple = heap_form_tuple(rsi->setDesc, values, nulls);
+ tuplestore_puttuple(rsi->setResult, tuple);
+ }
+
+ return (Datum) 0;
+}
+
+/*
+ * List the contents of a WAL summary file identified by TLI, start LSN,
+ * and end LSN.
+ */
+Datum
+pg_wal_summary_contents(PG_FUNCTION_ARGS)
+{
+ ReturnSetInfo *rsi;
+ Datum values[NUM_SUMMARY_ATTS];
+ bool nulls[NUM_SUMMARY_ATTS];
+ WalSummaryFile ws;
+ WalSummaryIO io;
+ BlockRefTableReader *reader;
+ int64 raw_tli;
+ RelFileLocator rlocator;
+ ForkNumber forknum;
+ BlockNumber limit_block;
+
+ InitMaterializedSRF(fcinfo, 0);
+ rsi = (ReturnSetInfo *) fcinfo->resultinfo;
+ memset(nulls, 0, sizeof(nulls));
+
+ /*
+ * Since the timeline could at least in theory be more than 2^31, and
+ * since we don't have unsigned types at the SQL level, it is passed as a
+ * 64-bit integer. Test whether it's out of range.
+ */
+ raw_tli = PG_GETARG_INT64(0);
+ if (raw_tli < 1 || raw_tli > PG_INT32_MAX)
+ ereport(ERROR,
+ errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("invalid timeline %lld", (long long) raw_tli));
+
+ /* Prepare to read the specified WAL summry file. */
+ ws.tli = (TimeLineID) raw_tli;
+ ws.start_lsn = PG_GETARG_LSN(1);
+ ws.end_lsn = PG_GETARG_LSN(2);
+ io.filepos = 0;
+ io.file = OpenWalSummaryFile(&ws, false);
+ reader = CreateBlockRefTableReader(ReadWalSummary, &io,
+ FilePathName(io.file),
+ ReportWalSummaryError, NULL);
+
+ /* Loop over relation forks. */
+ while (BlockRefTableReaderNextRelation(reader, &rlocator, &forknum,
+ &limit_block))
+ {
+ BlockNumber blocks[MAX_BLOCKS_PER_CALL];
+ HeapTuple tuple;
+
+ CHECK_FOR_INTERRUPTS();
+
+ values[0] = ObjectIdGetDatum(rlocator.relNumber);
+ values[1] = ObjectIdGetDatum(rlocator.spcOid);
+ values[2] = ObjectIdGetDatum(rlocator.dbOid);
+ values[3] = Int16GetDatum((int16) forknum);
+
+ /* Loop over blocks within the current relation fork. */
+ while (1)
+ {
+ unsigned nblocks;
+ unsigned i;
+
+ CHECK_FOR_INTERRUPTS();
+
+ nblocks = BlockRefTableReaderGetBlocks(reader, blocks,
+ MAX_BLOCKS_PER_CALL);
+ if (nblocks == 0)
+ break;
+
+ /*
+ * For each block that we specifically know to have been modified,
+ * emit a row with that block number and limit_block = false.
+ */
+ values[5] = BoolGetDatum(false);
+ for (i = 0; i < nblocks; ++i)
+ {
+ values[4] = Int64GetDatum((int64) blocks[i]);
+
+ tuple = heap_form_tuple(rsi->setDesc, values, nulls);
+ tuplestore_puttuple(rsi->setResult, tuple);
+ }
+
+ /*
+ * If the limit block is not InvalidBlockNumber, emit an exta row
+ * with that block number and limit_block = true.
+ *
+ * There is no point in doing this when the limit_block is
+ * InvalidBlockNumber, because no block with that number or any
+ * higher number can ever exist.
+ */
+ if (BlockNumberIsValid(limit_block))
+ {
+ values[4] = Int64GetDatum((int64) limit_block);
+ values[5] = BoolGetDatum(true);
+
+ tuple = heap_form_tuple(rsi->setDesc, values, nulls);
+ tuplestore_puttuple(rsi->setResult, tuple);
+ }
+ }
+ }
+
+ /* Cleanup */
+ DestroyBlockRefTableReader(reader);
+ FileClose(io.file);
+
+ return (Datum) 0;
+}
diff --git a/src/backend/postmaster/Makefile b/src/backend/postmaster/Makefile
index 047448b34e..367a46c617 100644
--- a/src/backend/postmaster/Makefile
+++ b/src/backend/postmaster/Makefile
@@ -24,6 +24,7 @@ OBJS = \
postmaster.o \
startup.o \
syslogger.o \
+ walsummarizer.o \
walwriter.o
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/postmaster/auxprocess.c b/src/backend/postmaster/auxprocess.c
index bae6f68c40..5f244216a6 100644
--- a/src/backend/postmaster/auxprocess.c
+++ b/src/backend/postmaster/auxprocess.c
@@ -21,6 +21,7 @@
#include "postmaster/auxprocess.h"
#include "postmaster/bgwriter.h"
#include "postmaster/startup.h"
+#include "postmaster/walsummarizer.h"
#include "postmaster/walwriter.h"
#include "replication/walreceiver.h"
#include "storage/bufmgr.h"
@@ -80,6 +81,9 @@ AuxiliaryProcessMain(AuxProcType auxtype)
case WalReceiverProcess:
MyBackendType = B_WAL_RECEIVER;
break;
+ case WalSummarizerProcess:
+ MyBackendType = B_WAL_SUMMARIZER;
+ break;
default:
elog(PANIC, "unrecognized process type: %d", (int) MyAuxProcType);
MyBackendType = B_INVALID;
@@ -158,6 +162,10 @@ AuxiliaryProcessMain(AuxProcType auxtype)
WalReceiverMain();
proc_exit(1);
+ case WalSummarizerProcess:
+ WalSummarizerMain();
+ proc_exit(1);
+
default:
elog(PANIC, "unrecognized process type: %d", (int) MyAuxProcType);
proc_exit(1);
diff --git a/src/backend/postmaster/meson.build b/src/backend/postmaster/meson.build
index cda921fd10..a30eb6692f 100644
--- a/src/backend/postmaster/meson.build
+++ b/src/backend/postmaster/meson.build
@@ -12,5 +12,6 @@ backend_sources += files(
'postmaster.c',
'startup.c',
'syslogger.c',
+ 'walsummarizer.c',
'walwriter.c',
)
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index ae31d66930..cbdbb6cdae 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -115,6 +115,7 @@
#include "postmaster/pgarch.h"
#include "postmaster/postmaster.h"
#include "postmaster/syslogger.h"
+#include "postmaster/walsummarizer.h"
#include "replication/logicallauncher.h"
#include "replication/walsender.h"
#include "storage/fd.h"
@@ -252,6 +253,7 @@ static pid_t StartupPID = 0,
CheckpointerPID = 0,
WalWriterPID = 0,
WalReceiverPID = 0,
+ WalSummarizerPID = 0,
AutoVacPID = 0,
PgArchPID = 0,
SysLoggerPID = 0;
@@ -443,6 +445,7 @@ static bool CreateOptsFile(int argc, char *argv[], char *fullprogname);
static pid_t StartChildProcess(AuxProcType type);
static void StartAutovacuumWorker(void);
static void MaybeStartWalReceiver(void);
+static void MaybeStartWalSummarizer(void);
static void InitPostmasterDeathWatchHandle(void);
/*
@@ -567,6 +570,7 @@ static void ShmemBackendArrayRemove(Backend *bn);
#define StartCheckpointer() StartChildProcess(CheckpointerProcess)
#define StartWalWriter() StartChildProcess(WalWriterProcess)
#define StartWalReceiver() StartChildProcess(WalReceiverProcess)
+#define StartWalSummarizer() StartChildProcess(WalSummarizerProcess)
/* Macros to check exit status of a child process */
#define EXIT_STATUS_0(st) ((st) == 0)
@@ -936,6 +940,9 @@ PostmasterMain(int argc, char *argv[])
if (max_wal_senders > 0 && wal_level == WAL_LEVEL_MINIMAL)
ereport(ERROR,
(errmsg("WAL streaming (max_wal_senders > 0) requires wal_level \"replica\" or \"logical\"")));
+ if (summarize_wal && wal_level == WAL_LEVEL_MINIMAL)
+ ereport(ERROR,
+ (errmsg("WAL cannot be summarized when wal_level is \"minimal\"")));
/*
* Other one-time internal sanity checks can go here, if they are fast.
@@ -1838,6 +1845,9 @@ ServerLoop(void)
if (WalReceiverRequested)
MaybeStartWalReceiver();
+ /* If we need to start a WAL summarizer, try to do that now */
+ MaybeStartWalSummarizer();
+
/* Get other worker processes running, if needed */
if (StartWorkerNeeded || HaveCrashedWorker)
maybe_start_bgworkers();
@@ -2662,6 +2672,8 @@ process_pm_reload_request(void)
signal_child(WalWriterPID, SIGHUP);
if (WalReceiverPID != 0)
signal_child(WalReceiverPID, SIGHUP);
+ if (WalSummarizerPID != 0)
+ signal_child(WalSummarizerPID, SIGHUP);
if (AutoVacPID != 0)
signal_child(AutoVacPID, SIGHUP);
if (PgArchPID != 0)
@@ -3015,6 +3027,7 @@ process_pm_child_exit(void)
BgWriterPID = StartBackgroundWriter();
if (WalWriterPID == 0)
WalWriterPID = StartWalWriter();
+ MaybeStartWalSummarizer();
/*
* Likewise, start other special children as needed. In a restart
@@ -3133,6 +3146,20 @@ process_pm_child_exit(void)
continue;
}
+ /*
+ * Was it the wal summarizer? Normal exit can be ignored; we'll start
+ * a new one at the next iteration of the postmaster's main loop, if
+ * necessary. Any other exit condition is treated as a crash.
+ */
+ if (pid == WalSummarizerPID)
+ {
+ WalSummarizerPID = 0;
+ if (!EXIT_STATUS_0(exitstatus))
+ HandleChildCrash(pid, exitstatus,
+ _("WAL summarizer process"));
+ continue;
+ }
+
/*
* Was it the autovacuum launcher? Normal exit can be ignored; we'll
* start a new one at the next iteration of the postmaster's main
@@ -3528,6 +3555,12 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
else if (WalReceiverPID != 0 && take_action)
sigquit_child(WalReceiverPID);
+ /* Take care of the walsummarizer too */
+ if (pid == WalSummarizerPID)
+ WalSummarizerPID = 0;
+ else if (WalSummarizerPID != 0 && take_action)
+ sigquit_child(WalSummarizerPID);
+
/* Take care of the autovacuum launcher too */
if (pid == AutoVacPID)
AutoVacPID = 0;
@@ -3678,6 +3711,8 @@ PostmasterStateMachine(void)
signal_child(StartupPID, SIGTERM);
if (WalReceiverPID != 0)
signal_child(WalReceiverPID, SIGTERM);
+ if (WalSummarizerPID != 0)
+ signal_child(WalSummarizerPID, SIGTERM);
/* checkpointer, archiver, stats, and syslogger may continue for now */
/* Now transition to PM_WAIT_BACKENDS state to wait for them to die */
@@ -3704,6 +3739,7 @@ PostmasterStateMachine(void)
if (CountChildren(BACKEND_TYPE_ALL - BACKEND_TYPE_WALSND) == 0 &&
StartupPID == 0 &&
WalReceiverPID == 0 &&
+ WalSummarizerPID == 0 &&
BgWriterPID == 0 &&
(CheckpointerPID == 0 ||
(!FatalError && Shutdown < ImmediateShutdown)) &&
@@ -3801,6 +3837,7 @@ PostmasterStateMachine(void)
/* These other guys should be dead already */
Assert(StartupPID == 0);
Assert(WalReceiverPID == 0);
+ Assert(WalSummarizerPID == 0);
Assert(BgWriterPID == 0);
Assert(CheckpointerPID == 0);
Assert(WalWriterPID == 0);
@@ -4022,6 +4059,8 @@ TerminateChildren(int signal)
signal_child(WalWriterPID, signal);
if (WalReceiverPID != 0)
signal_child(WalReceiverPID, signal);
+ if (WalSummarizerPID != 0)
+ signal_child(WalSummarizerPID, signal);
if (AutoVacPID != 0)
signal_child(AutoVacPID, signal);
if (PgArchPID != 0)
@@ -5329,6 +5368,10 @@ StartChildProcess(AuxProcType type)
ereport(LOG,
(errmsg("could not fork WAL receiver process: %m")));
break;
+ case WalSummarizerProcess:
+ ereport(LOG,
+ (errmsg("could not fork WAL summarizer process: %m")));
+ break;
default:
ereport(LOG,
(errmsg("could not fork process: %m")));
@@ -5465,6 +5508,19 @@ MaybeStartWalReceiver(void)
}
}
+/*
+ * MaybeStartWalSummarizer
+ * Start the WAL summarizer process, if not running and our state allows.
+ */
+static void
+MaybeStartWalSummarizer(void)
+{
+ if (summarize_wal && WalSummarizerPID == 0 &&
+ (pmState == PM_RUN || pmState == PM_HOT_STANDBY) &&
+ Shutdown <= SmartShutdown)
+ WalSummarizerPID = StartWalSummarizer();
+}
+
/*
* Create the opts file
diff --git a/src/backend/postmaster/walsummarizer.c b/src/backend/postmaster/walsummarizer.c
new file mode 100644
index 0000000000..02ed8ee6f5
--- /dev/null
+++ b/src/backend/postmaster/walsummarizer.c
@@ -0,0 +1,1386 @@
+/*-------------------------------------------------------------------------
+ *
+ * walsummarizer.c
+ *
+ * Background process to perform WAL summarization, if it is enabled.
+ * It continuously scans the write-ahead log and periodically emits a
+ * summary file which indicates which blocks in which relation forks
+ * were modified by WAL records in the LSN range covered by the summary
+ * file. See walsummary.c and blkreftable.c for more details on the
+ * naming and contents of WAL summary files.
+ *
+ * If configured to do, this background process will also remove WAL
+ * summary files when the file timestamp is older than a configurable
+ * threshold (but only if the WAL has been removed first).
+ *
+ * Portions Copyright (c) 1996-2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/backend/postmaster/walsummarizer.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/timeline.h"
+#include "access/xlog.h"
+#include "access/xlog_internal.h"
+#include "access/xlogrecovery.h"
+#include "access/xlogutils.h"
+#include "backup/walsummary.h"
+#include "catalog/storage_xlog.h"
+#include "common/blkreftable.h"
+#include "libpq/pqsignal.h"
+#include "miscadmin.h"
+#include "postmaster/bgwriter.h"
+#include "postmaster/interrupt.h"
+#include "postmaster/walsummarizer.h"
+#include "replication/walreceiver.h"
+#include "storage/fd.h"
+#include "storage/ipc.h"
+#include "storage/lwlock.h"
+#include "storage/latch.h"
+#include "storage/proc.h"
+#include "storage/procsignal.h"
+#include "storage/shmem.h"
+#include "storage/spin.h"
+#include "utils/guc.h"
+#include "utils/memutils.h"
+#include "utils/wait_event.h"
+
+/*
+ * Data in shared memory related to WAL summarization.
+ */
+typedef struct
+{
+ /*
+ * These fields are protected by WALSummarizerLock.
+ *
+ * Until we've discovered what summary files already exist on disk and
+ * stored that information in shared memory, initialized is false and the
+ * other fields here contain no meaningful information. After that has
+ * been done, initialized is true.
+ *
+ * summarized_tli and summarized_lsn indicate the last LSN and TLI at
+ * which the next summary file will start. Normally, these are the LSN and
+ * TLI at which the last file ended; in such case, lsn_is_exact is true.
+ * If, however, the LSN is just an approximation, then lsn_is_exact is
+ * false. This can happen if, for example, there are no existing WAL
+ * summary files at startup. In that case, we have to derive the position
+ * at which to start summarizing from the WAL files that exist on disk,
+ * and so the LSN might point to the start of the next file even though
+ * that might happen to be in the middle of a WAL record.
+ *
+ * summarizer_pgprocno is the pgprocno value for the summarizer process,
+ * if one is running, or else INVALID_PGPROCNO.
+ *
+ * pending_lsn is used by the summarizer to advertise the ending LSN of a
+ * record it has recently read. It shouldn't ever be less than
+ * summarized_lsn, but might be greater, because the summarizer buffers
+ * data for a range of LSNs in memory before writing out a new file.
+ */
+ bool initialized;
+ TimeLineID summarized_tli;
+ XLogRecPtr summarized_lsn;
+ bool lsn_is_exact;
+ int summarizer_pgprocno;
+ XLogRecPtr pending_lsn;
+
+ /*
+ * This field handles its own synchronizaton.
+ */
+ ConditionVariable summary_file_cv;
+} WalSummarizerData;
+
+/*
+ * Private data for our xlogreader's page read callback.
+ */
+typedef struct
+{
+ TimeLineID tli;
+ bool historic;
+ XLogRecPtr read_upto;
+ bool end_of_wal;
+} SummarizerReadLocalXLogPrivate;
+
+/* Pointer to shared memory state. */
+static WalSummarizerData *WalSummarizerCtl;
+
+/*
+ * When we reach end of WAL and need to read more, we sleep for a number of
+ * milliseconds that is a integer multiple of MS_PER_SLEEP_QUANTUM. This is
+ * the multiplier. It should vary between 1 and MAX_SLEEP_QUANTA, depending
+ * on system activity. See summarizer_wait_for_wal() for how we adjust this.
+ */
+static long sleep_quanta = 1;
+
+/*
+ * The sleep time will always be a multiple of 200ms and will not exceed
+ * thirty seconds (150 * 200 = 30 * 1000). Note that the timeout here needs
+ * to be substntially less than the maximum amount of time for which an
+ * incremental backup will wait for this process to catch up. Otherwise, an
+ * incremental backup might time out on an idle system just because we sleep
+ * for too long.
+ */
+#define MAX_SLEEP_QUANTA 150
+#define MS_PER_SLEEP_QUANTUM 200
+
+/*
+ * This is a count of the number of pages of WAL that we've read since the
+ * last time we waited for more WAL to appear.
+ */
+static long pages_read_since_last_sleep = 0;
+
+/*
+ * Most recent RedoRecPtr value observed by MaybeRemoveOldWalSummaries.
+ */
+static XLogRecPtr redo_pointer_at_last_summary_removal = InvalidXLogRecPtr;
+
+/*
+ * GUC parameters
+ */
+bool summarize_wal = false;
+int wal_summary_keep_time = 10 * 24 * 60;
+
+static XLogRecPtr GetLatestLSN(TimeLineID *tli);
+static void HandleWalSummarizerInterrupts(void);
+static XLogRecPtr SummarizeWAL(TimeLineID tli, XLogRecPtr start_lsn,
+ bool exact, XLogRecPtr switch_lsn,
+ XLogRecPtr maximum_lsn);
+static void SummarizeSmgrRecord(XLogReaderState *xlogreader,
+ BlockRefTable *brtab);
+static void SummarizeXactRecord(XLogReaderState *xlogreader,
+ BlockRefTable *brtab);
+static bool SummarizeXlogRecord(XLogReaderState *xlogreader);
+static int summarizer_read_local_xlog_page(XLogReaderState *state,
+ XLogRecPtr targetPagePtr,
+ int reqLen,
+ XLogRecPtr targetRecPtr,
+ char *cur_page);
+static void summarizer_wait_for_wal(void);
+static void MaybeRemoveOldWalSummaries(void);
+
+/*
+ * Amount of shared memory required for this module.
+ */
+Size
+WalSummarizerShmemSize(void)
+{
+ return sizeof(WalSummarizerData);
+}
+
+/*
+ * Create or attach to shared memory segment for this module.
+ */
+void
+WalSummarizerShmemInit(void)
+{
+ bool found;
+
+ WalSummarizerCtl = (WalSummarizerData *)
+ ShmemInitStruct("Wal Summarizer Ctl", WalSummarizerShmemSize(),
+ &found);
+
+ if (!found)
+ {
+ /*
+ * First time through, so initialize.
+ *
+ * We're just filling in dummy values here -- the real initialization
+ * will happen when GetOldestUnsummarizedLSN() is called for the first
+ * time.
+ */
+ WalSummarizerCtl->initialized = false;
+ WalSummarizerCtl->summarized_tli = 0;
+ WalSummarizerCtl->summarized_lsn = InvalidXLogRecPtr;
+ WalSummarizerCtl->lsn_is_exact = false;
+ WalSummarizerCtl->summarizer_pgprocno = INVALID_PGPROCNO;
+ WalSummarizerCtl->pending_lsn = InvalidXLogRecPtr;
+ ConditionVariableInit(&WalSummarizerCtl->summary_file_cv);
+ }
+}
+
+/*
+ * Entry point for walsummarizer process.
+ */
+void
+WalSummarizerMain(void)
+{
+ sigjmp_buf local_sigjmp_buf;
+ MemoryContext context;
+
+ /*
+ * Within this function, 'current_lsn' and 'current_tli' refer to the
+ * point from which the next WAL summary file should start. 'exact' is
+ * true if 'current_lsn' is known to be the start of a WAL recod or WAL
+ * segment, and false if it might be in the middle of a record someplace.
+ *
+ * 'switch_lsn' and 'switch_tli', if set, are the LSN at which we need to
+ * switch to a new timeline and the timeline to which we need to switch.
+ * If not set, we either haven't figured out the answers yet or we're
+ * already on the latest timeline.
+ */
+ XLogRecPtr current_lsn;
+ TimeLineID current_tli;
+ bool exact;
+ XLogRecPtr switch_lsn = InvalidXLogRecPtr;
+ TimeLineID switch_tli = 0;
+
+ ereport(DEBUG1,
+ (errmsg_internal("WAL summarizer started")));
+
+ /*
+ * Properly accept or ignore signals the postmaster might send us
+ *
+ * We have no particular use for SIGINT at the moment, but seems
+ * reasonable to treat like SIGTERM.
+ */
+ pqsignal(SIGHUP, SignalHandlerForConfigReload);
+ pqsignal(SIGINT, SignalHandlerForShutdownRequest);
+ pqsignal(SIGTERM, SignalHandlerForShutdownRequest);
+ /* SIGQUIT handler was already set up by InitPostmasterChild */
+ pqsignal(SIGALRM, SIG_IGN);
+ pqsignal(SIGPIPE, SIG_IGN);
+ pqsignal(SIGUSR1, procsignal_sigusr1_handler);
+ pqsignal(SIGUSR2, SIG_IGN); /* not used */
+
+ /* Advertise ourselves. */
+ LWLockAcquire(WALSummarizerLock, LW_EXCLUSIVE);
+ WalSummarizerCtl->summarizer_pgprocno = MyProc->pgprocno;
+ LWLockRelease(WALSummarizerLock);
+
+ /* Create and switch to a memory context that we can reset on error. */
+ context = AllocSetContextCreate(TopMemoryContext,
+ "Wal Summarizer",
+ ALLOCSET_DEFAULT_SIZES);
+ MemoryContextSwitchTo(context);
+
+ /*
+ * Reset some signals that are accepted by postmaster but not here
+ */
+ pqsignal(SIGCHLD, SIG_DFL);
+
+ /*
+ * If an exception is encountered, processing resumes here.
+ */
+ if (sigsetjmp(local_sigjmp_buf, 1) != 0)
+ {
+ /* Since not using PG_TRY, must reset error stack by hand */
+ error_context_stack = NULL;
+
+ /* Prevent interrupts while cleaning up */
+ HOLD_INTERRUPTS();
+
+ /* Report the error to the server log */
+ EmitErrorReport();
+
+ /* Release resources we might have acquired. */
+ LWLockReleaseAll();
+ ConditionVariableCancelSleep();
+ pgstat_report_wait_end();
+ ReleaseAuxProcessResources(false);
+ AtEOXact_Files(false);
+ AtEOXact_HashTables(false);
+
+ /*
+ * Now return to normal top-level context and clear ErrorContext for
+ * next time.
+ */
+ MemoryContextSwitchTo(context);
+ FlushErrorState();
+
+ /* Flush any leaked data in the top-level context */
+ MemoryContextResetAndDeleteChildren(context);
+
+ /* Now we can allow interrupts again */
+ RESUME_INTERRUPTS();
+
+ /*
+ * Sleep for 10 seconds before attempting to resume operations in
+ * order to avoid excessing logging.
+ *
+ * Many of the likely error conditions are things that will repeat
+ * every time. For example, if the WAL can't be read or the summary
+ * can't be written, only administrator action will cure the problem.
+ * So a really fast retry time doesn't seem to be especially
+ * beneficial, and it will clutter the logs.
+ */
+ (void) WaitLatch(MyLatch,
+ WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
+ 10000,
+ WAIT_EVENT_WAL_SUMMARIZER_ERROR);
+ }
+
+ /* We can now handle ereport(ERROR) */
+ PG_exception_stack = &local_sigjmp_buf;
+
+ /*
+ * Unblock signals (they were blocked when the postmaster forked us)
+ */
+ sigprocmask(SIG_SETMASK, &UnBlockSig, NULL);
+
+ /*
+ * Fetch information about previous progress from shared memory.
+ *
+ * If we discover that WAL summarization is not enabled, just exit.
+ */
+ current_lsn = GetOldestUnsummarizedLSN(¤t_tli, &exact);
+ if (XLogRecPtrIsInvalid(current_lsn))
+ proc_exit(0);
+
+ /*
+ * Loop forever
+ */
+ for (;;)
+ {
+ XLogRecPtr latest_lsn;
+ TimeLineID latest_tli;
+ XLogRecPtr end_of_summary_lsn;
+
+ /* Flush any leaked data in the top-level context */
+ MemoryContextResetAndDeleteChildren(context);
+
+ /* Process any signals received recently. */
+ HandleWalSummarizerInterrupts();
+
+ /* If it's time to remove any old WAL summaries, do that now. */
+ MaybeRemoveOldWalSummaries();
+
+ /* Find the LSN and TLI up to which we can safely summarize. */
+ latest_lsn = GetLatestLSN(&latest_tli);
+
+ /*
+ * If we're summarizing a historic timeline and we haven't yet
+ * computed the point at which to switch to the next timeline, do that
+ * now.
+ *
+ * Note that if this is a standby, what was previously the current
+ * timeline could become historic at any time.
+ *
+ * We could try to make this more efficient by caching the results of
+ * readTimeLineHistory when latest_tli has not changed, but since we
+ * only have to do this once per timeline switch, we probably wouldn't
+ * save any significant amount of work in practice.
+ */
+ if (current_tli != latest_tli && XLogRecPtrIsInvalid(switch_lsn))
+ {
+ List *tles = readTimeLineHistory(latest_tli);
+
+ switch_lsn = tliSwitchPoint(current_tli, tles, &switch_tli);
+ ereport(DEBUG1,
+ errmsg("switch point from TLI %u to TLI %u is at %X/%X",
+ current_tli, switch_tli, LSN_FORMAT_ARGS(switch_lsn)));
+ }
+
+ /*
+ * If we've reached the switch LSN, we can't summarize anything else
+ * on this timeline. Switch to the next timeline and go around again.
+ */
+ if (!XLogRecPtrIsInvalid(switch_lsn) && current_lsn >= switch_lsn)
+ {
+ current_tli = switch_tli;
+ switch_lsn = InvalidXLogRecPtr;
+ switch_tli = 0;
+ continue;
+ }
+
+ /* Summarize WAL. */
+ end_of_summary_lsn = SummarizeWAL(current_tli,
+ current_lsn, exact,
+ switch_lsn, latest_lsn);
+ Assert(!XLogRecPtrIsInvalid(end_of_summary_lsn));
+ Assert(end_of_summary_lsn >= current_lsn);
+
+ /*
+ * Update state for next loop iteration.
+ *
+ * Next summary file should start from exactly where this one ended.
+ */
+ current_lsn = end_of_summary_lsn;
+ exact = true;
+
+ /* Update state in shared memory. */
+ LWLockAcquire(WALSummarizerLock, LW_EXCLUSIVE);
+ Assert(WalSummarizerCtl->pending_lsn <= end_of_summary_lsn);
+ WalSummarizerCtl->summarized_lsn = end_of_summary_lsn;
+ WalSummarizerCtl->summarized_tli = current_tli;
+ WalSummarizerCtl->lsn_is_exact = true;
+ WalSummarizerCtl->pending_lsn = end_of_summary_lsn;
+ LWLockRelease(WALSummarizerLock);
+
+ /* Wake up anyone waiting for more summary files to be written. */
+ ConditionVariableBroadcast(&WalSummarizerCtl->summary_file_cv);
+ }
+}
+
+/*
+ * Get the oldest LSN in this server's timeline history that has not yet been
+ * summarized.
+ *
+ * If *tli != NULL, it will be set to the TLI for the LSN that is returned.
+ *
+ * If *lsn_is_exact != NULL, it will be set to true if the returned LSN is
+ * necessarily the start of a WAL record and false if it's just the beginning
+ * of a WAL segment.
+ */
+XLogRecPtr
+GetOldestUnsummarizedLSN(TimeLineID *tli, bool *lsn_is_exact)
+{
+ TimeLineID latest_tli;
+ LWLockMode mode = LW_SHARED;
+ int n;
+ List *tles;
+ XLogRecPtr unsummarized_lsn;
+ TimeLineID unsummarized_tli = 0;
+ bool should_make_exact = false;
+ List *existing_summaries;
+ ListCell *lc;
+
+ /* If not summarizing WAL, do nothing. */
+ if (!summarize_wal)
+ return InvalidXLogRecPtr;
+
+ /*
+ * Initially, we acquire the lock in shared mode and try to fetch the
+ * required information. If the data structure hasn't been initialized, we
+ * reacquire the lock in shared mode so that we can initialize it.
+ * However, if someone else does that first before we get the lock, then
+ * we can just return the requested information after all.
+ */
+ while (1)
+ {
+ LWLockAcquire(WALSummarizerLock, mode);
+
+ if (WalSummarizerCtl->initialized)
+ {
+ unsummarized_lsn = WalSummarizerCtl->summarized_lsn;
+ if (tli != NULL)
+ *tli = WalSummarizerCtl->summarized_tli;
+ if (lsn_is_exact != NULL)
+ *lsn_is_exact = WalSummarizerCtl->lsn_is_exact;
+ LWLockRelease(WALSummarizerLock);
+ return unsummarized_lsn;
+ }
+
+ if (mode == LW_EXCLUSIVE)
+ break;
+
+ LWLockRelease(WALSummarizerLock);
+ mode = LW_EXCLUSIVE;
+ }
+
+ /*
+ * The data structure needs to be initialized, and we are the first to
+ * obtain the lock in exclusive mode, so it's our job to do that
+ * initialization.
+ *
+ * So, find the oldest timeline on which WAL still exists, and the
+ * earliest segment for which it exists.
+ */
+ (void) GetLatestLSN(&latest_tli);
+ tles = readTimeLineHistory(latest_tli);
+ for (n = list_length(tles) - 1; n >= 0; --n)
+ {
+ TimeLineHistoryEntry *tle = list_nth(tles, n);
+ XLogSegNo oldest_segno;
+
+ oldest_segno = XLogGetOldestSegno(tle->tli);
+ if (oldest_segno != 0)
+ {
+ /* Compute oldest LSN that still exists on disk. */
+ XLogSegNoOffsetToRecPtr(oldest_segno, 0, wal_segment_size,
+ unsummarized_lsn);
+
+ unsummarized_tli = tle->tli;
+ break;
+ }
+ }
+
+ /* It really should not be possible for us to find no WAL. */
+ if (unsummarized_tli == 0)
+ ereport(ERROR,
+ errcode(ERRCODE_INTERNAL_ERROR),
+ errmsg_internal("no WAL found on timeline %d", latest_tli));
+
+ /*
+ * Don't try to summarize anything older than the end LSN of the newest
+ * summary file that exists for this timeline.
+ */
+ existing_summaries =
+ GetWalSummaries(unsummarized_tli,
+ InvalidXLogRecPtr, InvalidXLogRecPtr);
+ foreach(lc, existing_summaries)
+ {
+ WalSummaryFile *ws = lfirst(lc);
+
+ if (ws->end_lsn > unsummarized_lsn)
+ {
+ unsummarized_lsn = ws->end_lsn;
+ should_make_exact = true;
+ }
+ }
+
+ /* Update shared memory with the discovered values. */
+ WalSummarizerCtl->initialized = true;
+ WalSummarizerCtl->summarized_lsn = unsummarized_lsn;
+ WalSummarizerCtl->summarized_tli = unsummarized_tli;
+ WalSummarizerCtl->lsn_is_exact = should_make_exact;
+ WalSummarizerCtl->pending_lsn = unsummarized_lsn;
+
+ /* Also return the to the caller as required. */
+ if (tli != NULL)
+ *tli = WalSummarizerCtl->summarized_tli;
+ if (lsn_is_exact != NULL)
+ *lsn_is_exact = WalSummarizerCtl->lsn_is_exact;
+ LWLockRelease(WALSummarizerLock);
+
+ return unsummarized_lsn;
+}
+
+/*
+ * Attempt to set the WAL summarizer's latch.
+ *
+ * This might not work, because there's no guarantee that the WAL summarizer
+ * process was successfully started, and it also might have started but
+ * subsequently terminated. So, under normal circumstances, this will get the
+ * latch set, but there's no guarantee.
+ */
+void
+SetWalSummarizerLatch(void)
+{
+ int pgprocno;
+
+ if (WalSummarizerCtl == NULL)
+ return;
+
+ LWLockAcquire(WALSummarizerLock, LW_EXCLUSIVE);
+ pgprocno = WalSummarizerCtl->summarizer_pgprocno;
+ LWLockRelease(WALSummarizerLock);
+
+ if (pgprocno != INVALID_PGPROCNO)
+ SetLatch(&ProcGlobal->allProcs[pgprocno].procLatch);
+}
+
+/*
+ * Wait until WAL summarization reaches the given LSN, but not longer than
+ * the given timeout.
+ *
+ * The return value is the first still-unsummarized LSN. If it's greater than
+ * or equal to the passed LSN, then that LSN was reached. If not, we timed out.
+ *
+ * Either way, *pending_lsn is set to the value taken from WalSummarizerCtl.
+ */
+XLogRecPtr
+WaitForWalSummarization(XLogRecPtr lsn, long timeout, XLogRecPtr *pending_lsn)
+{
+ TimestampTz start_time = GetCurrentTimestamp();
+ TimestampTz deadline = TimestampTzPlusMilliseconds(start_time, timeout);
+ XLogRecPtr summarized_lsn;
+
+ Assert(!XLogRecPtrIsInvalid(lsn));
+ Assert(timeout > 0);
+
+ while (1)
+ {
+ TimestampTz now;
+ long remaining_timeout;
+
+ /*
+ * If the LSN summarized on disk has reached the target value, stop.
+ */
+ LWLockAcquire(WALSummarizerLock, LW_EXCLUSIVE);
+ summarized_lsn = WalSummarizerCtl->summarized_lsn;
+ *pending_lsn = WalSummarizerCtl->pending_lsn;
+ LWLockRelease(WALSummarizerLock);
+ if (summarized_lsn >= lsn)
+ break;
+
+ /* Timeout reached? If yes, stop. */
+ now = GetCurrentTimestamp();
+ remaining_timeout = TimestampDifferenceMilliseconds(now, deadline);
+ if (remaining_timeout <= 0)
+ break;
+
+ /* Wait and see. */
+ ConditionVariableTimedSleep(&WalSummarizerCtl->summary_file_cv,
+ remaining_timeout,
+ WAIT_EVENT_WAL_SUMMARY_READY);
+ }
+
+ return summarized_lsn;
+}
+
+/*
+ * Get the latest LSN that is eligible to be summarized, and set *tli to the
+ * corresponding timeline.
+ */
+static XLogRecPtr
+GetLatestLSN(TimeLineID *tli)
+{
+ if (!RecoveryInProgress())
+ {
+ /* Don't summarize WAL before it's flushed. */
+ return GetFlushRecPtr(tli);
+ }
+ else
+ {
+ XLogRecPtr flush_lsn;
+ TimeLineID flush_tli;
+ XLogRecPtr replay_lsn;
+ TimeLineID replay_tli;
+
+ /*
+ * What we really want to know is how much WAL has been flushed to
+ * disk, but the only flush position available is the one provided by
+ * the walreceiver, which may not be running, because this could be
+ * crash recovery or recovery via restore_command. So use either the
+ * WAL receiver's flush position or the replay position, whichever is
+ * further ahead, on the theory that if the WAL has been replayed then
+ * it must also have been flushed to disk.
+ */
+ flush_lsn = GetWalRcvFlushRecPtr(NULL, &flush_tli);
+ replay_lsn = GetXLogReplayRecPtr(&replay_tli);
+ if (flush_lsn > replay_lsn)
+ {
+ *tli = flush_tli;
+ return flush_lsn;
+ }
+ else
+ {
+ *tli = replay_tli;
+ return replay_lsn;
+ }
+ }
+}
+
+/*
+ * Interrupt handler for main loop of WAL summarizer process.
+ */
+static void
+HandleWalSummarizerInterrupts(void)
+{
+ if (ProcSignalBarrierPending)
+ ProcessProcSignalBarrier();
+
+ if (ConfigReloadPending)
+ {
+ ConfigReloadPending = false;
+ ProcessConfigFile(PGC_SIGHUP);
+ }
+
+ if (ShutdownRequestPending || !summarize_wal)
+ {
+ ereport(DEBUG1,
+ errmsg_internal("WAL summarizer shutting down"));
+ proc_exit(0);
+ }
+
+ /* Perform logging of memory contexts of this process */
+ if (LogMemoryContextPending)
+ ProcessLogMemoryContextInterrupt();
+}
+
+/*
+ * Summarize a range of WAL records on a single timeline.
+ *
+ * 'tli' is the timeline to be summarized.
+ *
+ * 'start_lsn' is the point at which we should start summarizing. If this
+ * value comes from the end LSN of the previous record as returned by the
+ * xlograder machinery, 'exact' should be true; otherwise, 'exact' should
+ * be false, and this function will search forward for the start of a valid
+ * WAL record.
+ *
+ * 'switch_lsn' is the point at which we should switch to a later timeline,
+ * if we're summarizing a historic timeline.
+ *
+ * 'maximum_lsn' identifies the point beyond which we can't count on being
+ * able to read any more WAL. It should be the switch point when reading a
+ * historic timeline, or the most-recently-measured end of WAL when reading
+ * the current timeline.
+ *
+ * The return value is the LSN at which the WAL summary actually ends. Most
+ * often, a summary file ends because we notice that a checkpoint has
+ * occurred and reach the redo pointer of that checkpoint, but sometimes
+ * we stop for other reasons, such as a timeline switch.
+ */
+static XLogRecPtr
+SummarizeWAL(TimeLineID tli, XLogRecPtr start_lsn, bool exact,
+ XLogRecPtr switch_lsn, XLogRecPtr maximum_lsn)
+{
+ SummarizerReadLocalXLogPrivate *private_data;
+ XLogReaderState *xlogreader;
+ XLogRecPtr summary_start_lsn;
+ XLogRecPtr summary_end_lsn = switch_lsn;
+ char temp_path[MAXPGPATH];
+ char final_path[MAXPGPATH];
+ WalSummaryIO io;
+ BlockRefTable *brtab = CreateEmptyBlockRefTable();
+
+ /* Initialize private data for xlogreader. */
+ private_data = (SummarizerReadLocalXLogPrivate *)
+ palloc0(sizeof(SummarizerReadLocalXLogPrivate));
+ private_data->tli = tli;
+ private_data->historic = !XLogRecPtrIsInvalid(switch_lsn);
+ private_data->read_upto = maximum_lsn;
+
+ /* Create xlogreader. */
+ xlogreader = XLogReaderAllocate(wal_segment_size, NULL,
+ XL_ROUTINE(.page_read = &summarizer_read_local_xlog_page,
+ .segment_open = &wal_segment_open,
+ .segment_close = &wal_segment_close),
+ private_data);
+ if (xlogreader == NULL)
+ ereport(ERROR,
+ (errcode(ERRCODE_OUT_OF_MEMORY),
+ errmsg("out of memory"),
+ errdetail("Failed while allocating a WAL reading processor.")));
+
+ /*
+ * When exact = false, we're starting from an arbitrary point in the WAL
+ * and must search forward for the start of the next record.
+ *
+ * When exact = true, start_lsn should be either the LSN where a record
+ * begins, or the LSN of a page where the page header is immediately
+ * followed by the start of a new record. XLogBeginRead should tolerate
+ * either case.
+ *
+ * We need to allow for both cases because the behavior of xlogreader
+ * varies. When a record spans two or more xlog pages, the ending LSN
+ * reported by xlogreader will be the starting LSN of the following
+ * record, but when an xlog page boundary falls between two records, the
+ * end LSN for the first will be reported as the first byte of the
+ * following page. We can't know until we read that page how large the
+ * header will be, but we'll have to skip over it to find the next record.
+ */
+ if (exact)
+ {
+ /*
+ * Even if start_lsn is the beginning of a page rather than the
+ * beginning of the first record on that page, we should still use it
+ * as the start LSN for the summary file. That's because we detect
+ * missing summary files by looking for cases where the end LSN of one
+ * file is less than the start LSN of the next file. When only a page
+ * header is skipped, nothing has been missed.
+ */
+ XLogBeginRead(xlogreader, start_lsn);
+ summary_start_lsn = start_lsn;
+ }
+ else
+ {
+ summary_start_lsn = XLogFindNextRecord(xlogreader, start_lsn);
+ if (XLogRecPtrIsInvalid(summary_start_lsn))
+ {
+ /*
+ * If we hit end-of-WAL while trying to find the next valid
+ * record, we must be on a historic timeline that has no valid
+ * records that begin after start_lsn and before end of WAL.
+ */
+ if (private_data->end_of_wal)
+ {
+ ereport(DEBUG1,
+ errmsg_internal("could not read WAL from timeline %u at %X/%X: end of WAL at %X/%X",
+ tli,
+ LSN_FORMAT_ARGS(start_lsn),
+ LSN_FORMAT_ARGS(private_data->read_upto)));
+
+ /*
+ * The timeline ends at or after start_lsn, without containing
+ * any records. Thus, we must make sure the main loop does not
+ * iterate. If start_lsn is the end of the timeline, then we
+ * won't actually emit an empty summary file, but otherwise,
+ * we must, to capture the fact that the LSN range in question
+ * contains no interesting WAL records.
+ */
+ summary_start_lsn = start_lsn;
+ summary_end_lsn = private_data->read_upto;
+ switch_lsn = xlogreader->EndRecPtr;
+ }
+ else
+ ereport(ERROR,
+ (errmsg("could not find a valid record after %X/%X",
+ LSN_FORMAT_ARGS(start_lsn))));
+ }
+
+ /* We shouldn't go backward. */
+ Assert(summary_start_lsn >= start_lsn);
+ }
+
+ /*
+ * Main loop: read xlog records one by one.
+ */
+ while (1)
+ {
+ int block_id;
+ char *errormsg;
+ XLogRecord *record;
+ bool stop_requested = false;
+
+ HandleWalSummarizerInterrupts();
+
+ /* We shouldn't go backward. */
+ Assert(summary_start_lsn <= xlogreader->EndRecPtr);
+
+ /* Now read the next record. */
+ record = XLogReadRecord(xlogreader, &errormsg);
+ if (record == NULL)
+ {
+ if (private_data->end_of_wal)
+ {
+ /*
+ * This timeline must be historic and must end before we were
+ * able to read a complete record.
+ */
+ ereport(DEBUG1,
+ errmsg_internal("could not read WAL from timeline %d at %X/%X: end of WAL at %X/%X",
+ tli,
+ LSN_FORMAT_ARGS(xlogreader->EndRecPtr),
+ LSN_FORMAT_ARGS(private_data->read_upto)));
+ /* Summary ends at end of WAL. */
+ summary_end_lsn = private_data->read_upto;
+ break;
+ }
+ if (errormsg)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not read WAL from timeline %u at %X/%X: %s",
+ tli, LSN_FORMAT_ARGS(xlogreader->EndRecPtr),
+ errormsg)));
+ else
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not read WAL from timeline %u at %X/%X",
+ tli, LSN_FORMAT_ARGS(xlogreader->EndRecPtr))));
+ }
+
+ /* We shouldn't go backward. */
+ Assert(summary_start_lsn <= xlogreader->EndRecPtr);
+
+ if (!XLogRecPtrIsInvalid(switch_lsn) &&
+ xlogreader->ReadRecPtr >= switch_lsn)
+ {
+ /*
+ * Woops! We've read a record that *starts* after the switch LSN,
+ * contrary to our goal of reading only until we hit the first
+ * record that ends at or after the switch LSN. Pretend we didn't
+ * read it after all by bailing out of this loop right here,
+ * before we do anything with this record.
+ *
+ * This can happen because the last record before the switch LSN
+ * might be continued across multiple pages, and then we might
+ * come to a page with XLP_FIRST_IS_OVERWRITE_CONTRECORD set. In
+ * that case, the record that was continued across multiple pages
+ * is incomplete and will be disregarded, and the read will
+ * restart from the beginning of the page that is flagged
+ * XLP_FIRST_IS_OVERWRITE_CONTRECORD.
+ *
+ * If this case occurs, we can fairly say that the current summary
+ * file ends at the switch LSN exactly. The first record on the
+ * page marked XLP_FIRST_IS_OVERWRITE_CONTRECORD will be
+ * discovered when generating the next summary file.
+ */
+ summary_end_lsn = switch_lsn;
+ break;
+ }
+
+ /* Special handling for particular types of WAL records. */
+ switch (XLogRecGetRmid(xlogreader))
+ {
+ case RM_SMGR_ID:
+ SummarizeSmgrRecord(xlogreader, brtab);
+ break;
+ case RM_XACT_ID:
+ SummarizeXactRecord(xlogreader, brtab);
+ break;
+ case RM_XLOG_ID:
+ stop_requested = SummarizeXlogRecord(xlogreader);
+ break;
+ default:
+ break;
+ }
+
+ /*
+ * If we've been told that it's time to end this WAL summary file, do
+ * so. As an exception, if there's nothing included in this WAL
+ * summary file yet, then stopping doesn't make any sense, and we
+ * should wait until the next stop point instead.
+ */
+ if (stop_requested && xlogreader->ReadRecPtr > summary_start_lsn)
+ {
+ summary_end_lsn = xlogreader->ReadRecPtr;
+ break;
+ }
+
+ /* Feed block references from xlog record to block reference table. */
+ for (block_id = 0; block_id <= XLogRecMaxBlockId(xlogreader);
+ block_id++)
+ {
+ RelFileLocator rlocator;
+ ForkNumber forknum;
+ BlockNumber blocknum;
+
+ if (!XLogRecGetBlockTagExtended(xlogreader, block_id, &rlocator,
+ &forknum, &blocknum, NULL))
+ continue;
+
+ /*
+ * As we do elsewhere, ignore the FSM fork, because it's not fully
+ * WAL-logged.
+ */
+ if (forknum != FSM_FORKNUM)
+ BlockRefTableMarkBlockModified(brtab, &rlocator, forknum,
+ blocknum);
+ }
+
+ /* Update our notion of where this summary file ends. */
+ summary_end_lsn = xlogreader->EndRecPtr;
+
+ /* Also update shared memory. */
+ LWLockAcquire(WALSummarizerLock, LW_EXCLUSIVE);
+ Assert(summary_end_lsn >= WalSummarizerCtl->pending_lsn);
+ Assert(summary_end_lsn >= WalSummarizerCtl->summarized_lsn);
+ WalSummarizerCtl->pending_lsn = summary_end_lsn;
+ LWLockRelease(WALSummarizerLock);
+
+ /*
+ * If we have a switch LSN and have reached it, stop before reading
+ * the next record.
+ */
+ if (!XLogRecPtrIsInvalid(switch_lsn) &&
+ xlogreader->EndRecPtr >= switch_lsn)
+ break;
+ }
+
+ /* Destroy xlogreader. */
+ pfree(xlogreader->private_data);
+ XLogReaderFree(xlogreader);
+
+ /*
+ * If a timeline switch occurs, we may fail to make any progress at all
+ * before exiting the loop above. If that happens, we don't write a WAL
+ * summary file at all.
+ */
+ if (summary_end_lsn > summary_start_lsn)
+ {
+ /* Generate temporary and final path name. */
+ snprintf(temp_path, MAXPGPATH,
+ XLOGDIR "/summaries/temp.summary");
+ snprintf(final_path, MAXPGPATH,
+ XLOGDIR "/summaries/%08X%08X%08X%08X%08X.summary",
+ tli,
+ LSN_FORMAT_ARGS(summary_start_lsn),
+ LSN_FORMAT_ARGS(summary_end_lsn));
+
+ /* Open the temporary file for writing. */
+ io.filepos = 0;
+ io.file = PathNameOpenFile(temp_path, O_WRONLY | O_CREAT | O_TRUNC);
+ if (io.file < 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not create file \"%s\": %m", temp_path)));
+
+ /* Write the data. */
+ WriteBlockRefTable(brtab, WriteWalSummary, &io);
+
+ /* Close temporary file and shut down xlogreader. */
+ FileClose(io.file);
+
+ /* Tell the user what we did. */
+ ereport(DEBUG1,
+ errmsg("summarized WAL on TLI %d from %X/%X to %X/%X",
+ tli,
+ LSN_FORMAT_ARGS(summary_start_lsn),
+ LSN_FORMAT_ARGS(summary_end_lsn)));
+
+ /* Durably rename the new summary into place. */
+ durable_rename(temp_path, final_path, ERROR);
+ }
+
+ return summary_end_lsn;
+}
+
+/*
+ * Special handling for WAL records with RM_SMGR_ID.
+ */
+static void
+SummarizeSmgrRecord(XLogReaderState *xlogreader, BlockRefTable *brtab)
+{
+ uint8 info = XLogRecGetInfo(xlogreader) & ~XLR_INFO_MASK;
+
+ if (info == XLOG_SMGR_CREATE)
+ {
+ xl_smgr_create *xlrec;
+
+ /*
+ * If a new relation fork is created on disk, there is no point
+ * tracking anything about which blocks have been modified, because
+ * the whole thing will be new. Hence, set the limit block for this
+ * fork to 0.
+ *
+ * Ignore the FSM fork, which is not fully WAL-logged.
+ */
+ xlrec = (xl_smgr_create *) XLogRecGetData(xlogreader);
+
+ if (xlrec->forkNum != FSM_FORKNUM)
+ BlockRefTableSetLimitBlock(brtab, &xlrec->rlocator,
+ xlrec->forkNum, 0);
+ }
+ else if (info == XLOG_SMGR_TRUNCATE)
+ {
+ xl_smgr_truncate *xlrec;
+
+ xlrec = (xl_smgr_truncate *) XLogRecGetData(xlogreader);
+
+ /*
+ * If a relation fork is truncated on disk, there is no point in
+ * tracking anything about block modifications beyond the truncation
+ * point.
+ *
+ * We ignore SMGR_TRUNCATE_FSM here because the FSM isn't fully
+ * WAL-logged and thus we can't track modified blocks for it anyway.
+ */
+ if ((xlrec->flags & SMGR_TRUNCATE_HEAP) != 0)
+ BlockRefTableSetLimitBlock(brtab, &xlrec->rlocator,
+ MAIN_FORKNUM, xlrec->blkno);
+ if ((xlrec->flags & SMGR_TRUNCATE_VM) != 0)
+ BlockRefTableSetLimitBlock(brtab, &xlrec->rlocator,
+ VISIBILITYMAP_FORKNUM, xlrec->blkno);
+ }
+}
+
+/*
+ * Special handling for WAL recods with RM_XACT_ID.
+ */
+static void
+SummarizeXactRecord(XLogReaderState *xlogreader, BlockRefTable *brtab)
+{
+ uint8 info = XLogRecGetInfo(xlogreader) & ~XLR_INFO_MASK;
+ uint8 xact_info = info & XLOG_XACT_OPMASK;
+
+ if (xact_info == XLOG_XACT_COMMIT ||
+ xact_info == XLOG_XACT_COMMIT_PREPARED)
+ {
+ xl_xact_commit *xlrec = (xl_xact_commit *) XLogRecGetData(xlogreader);
+ xl_xact_parsed_commit parsed;
+ int i;
+
+ /*
+ * Don't track modified blocks for any relations that were removed on
+ * commit.
+ */
+ ParseCommitRecord(XLogRecGetInfo(xlogreader), xlrec, &parsed);
+ for (i = 0; i < parsed.nrels; ++i)
+ {
+ ForkNumber forknum;
+
+ for (forknum = 0; forknum <= MAX_FORKNUM; ++forknum)
+ if (forknum != FSM_FORKNUM)
+ BlockRefTableSetLimitBlock(brtab, &parsed.xlocators[i],
+ forknum, 0);
+ }
+ }
+ else if (xact_info == XLOG_XACT_ABORT ||
+ xact_info == XLOG_XACT_ABORT_PREPARED)
+ {
+ xl_xact_abort *xlrec = (xl_xact_abort *) XLogRecGetData(xlogreader);
+ xl_xact_parsed_abort parsed;
+ int i;
+
+ /*
+ * Don't track modified blocks for any relations that were removed on
+ * abort.
+ */
+ ParseAbortRecord(XLogRecGetInfo(xlogreader), xlrec, &parsed);
+ for (i = 0; i < parsed.nrels; ++i)
+ {
+ ForkNumber forknum;
+
+ for (forknum = 0; forknum <= MAX_FORKNUM; ++forknum)
+ if (forknum != FSM_FORKNUM)
+ BlockRefTableSetLimitBlock(brtab, &parsed.xlocators[i],
+ forknum, 0);
+ }
+ }
+}
+
+/*
+ * Special handling for WAL recods with RM_XLOG_ID.
+ */
+static bool
+SummarizeXlogRecord(XLogReaderState *xlogreader)
+{
+ uint8 info = XLogRecGetInfo(xlogreader) & ~XLR_INFO_MASK;
+
+ if (info == XLOG_CHECKPOINT_REDO || info == XLOG_CHECKPOINT_SHUTDOWN)
+ {
+ /*
+ * This is an LSN at which redo might begin, so we'd like
+ * summarization to stop just before this WAL record.
+ */
+ return true;
+ }
+
+ return false;
+}
+
+/*
+ * Similar to read_local_xlog_page, but limited to read from one particular
+ * timeline. If the end of WAL is reached, it will wait for more if reading
+ * from the current timeline, or give up if reading from a historic timeline.
+ * In the latter case, it will also set private_data->end_of_wal = true.
+ *
+ * Caller must set private_data->tli to the TLI of interest,
+ * private_data->read_upto to the lowest LSN that is not known to be safe
+ * to read on that timeline, and private_data->historic to true if and only
+ * if the timeline is not the current timeline. This function will update
+ * private_data->read_upto and private_data->historic if more WAL appears
+ * on the current timeline or if the current timeline becomes historic.
+ */
+static int
+summarizer_read_local_xlog_page(XLogReaderState *state,
+ XLogRecPtr targetPagePtr, int reqLen,
+ XLogRecPtr targetRecPtr, char *cur_page)
+{
+ int count;
+ WALReadError errinfo;
+ SummarizerReadLocalXLogPrivate *private_data;
+
+ HandleWalSummarizerInterrupts();
+
+ private_data = (SummarizerReadLocalXLogPrivate *)
+ state->private_data;
+
+ while (1)
+ {
+ if (targetPagePtr + XLOG_BLCKSZ <= private_data->read_upto)
+ {
+ /*
+ * more than one block available; read only that block, have
+ * caller come back if they need more.
+ */
+ count = XLOG_BLCKSZ;
+ break;
+ }
+ else if (targetPagePtr + reqLen > private_data->read_upto)
+ {
+ /* We don't seem to have enough data. */
+ if (private_data->historic)
+ {
+ /*
+ * This is a historic timeline, so there will never be any
+ * more data than we have currently.
+ */
+ private_data->end_of_wal = true;
+ return -1;
+ }
+ else
+ {
+ XLogRecPtr latest_lsn;
+ TimeLineID latest_tli;
+
+ /*
+ * This is - or at least was up until very recently - the
+ * current timeline, so more data might show up. Delay here
+ * so we don't tight-loop.
+ */
+ HandleWalSummarizerInterrupts();
+ summarizer_wait_for_wal();
+
+ /* Recheck end-of-WAL. */
+ latest_lsn = GetLatestLSN(&latest_tli);
+ if (private_data->tli == latest_tli)
+ {
+ /* Still the current timeline, update max LSN. */
+ Assert(latest_lsn >= private_data->read_upto);
+ private_data->read_upto = latest_lsn;
+ }
+ else
+ {
+ List *tles = readTimeLineHistory(latest_tli);
+ XLogRecPtr switchpoint;
+
+ /*
+ * The timeline we're scanning is no longer the latest
+ * one. Figure out when it ended.
+ */
+ private_data->historic = true;
+ switchpoint = tliSwitchPoint(private_data->tli, tles,
+ NULL);
+
+ /*
+ * Allow reads up to exactly the switch point.
+ *
+ * It's possible that this will cause read_upto to move
+ * backwards, because walreceiver might have read a
+ * partial record and flushed it to disk, and we'd view
+ * that data as safe to read. However, the
+ * XLOG_END_OF_RECOVERY record will be written at the end
+ * of the last complete WAL record, not at the end of the
+ * WAL that we've flushed to disk.
+ *
+ * So switchpoint < private->read_upto is possible here,
+ * but switchpoint < state->EndRecPtr should not be.
+ */
+ Assert(switchpoint >= state->EndRecPtr);
+ private_data->read_upto = switchpoint;
+
+ /* Debugging output. */
+ ereport(DEBUG1,
+ errmsg("timeline %u became historic, can read up to %X/%X",
+ private_data->tli, LSN_FORMAT_ARGS(private_data->read_upto)));
+ }
+
+ /* Go around and try again. */
+ }
+ }
+ else
+ {
+ /* enough bytes available to satisfy the request */
+ count = private_data->read_upto - targetPagePtr;
+ break;
+ }
+ }
+
+ /*
+ * Even though we just determined how much of the page can be validly read
+ * as 'count', read the whole page anyway. It's guaranteed to be
+ * zero-padded up to the page boundary if it's incomplete.
+ */
+ if (!WALRead(state, cur_page, targetPagePtr, XLOG_BLCKSZ,
+ private_data->tli, &errinfo))
+ WALReadRaiseError(&errinfo);
+
+ /* Track that we read a page, for sleep time calculation. */
+ ++pages_read_since_last_sleep;
+
+ /* number of valid bytes in the buffer */
+ return count;
+}
+
+/*
+ * Sleep for long enough that we believe it's likely that more WAL will
+ * be available afterwards.
+ */
+static void
+summarizer_wait_for_wal(void)
+{
+ if (pages_read_since_last_sleep == 0)
+ {
+ /*
+ * No pages were read since the last sleep, so double the sleep time,
+ * but not beyond the maximum allowable value.
+ */
+ sleep_quanta = Min(sleep_quanta * 2, MAX_SLEEP_QUANTA);
+ }
+ else if (pages_read_since_last_sleep > 1)
+ {
+ /*
+ * Multiple pages were read since the last sleep, so reduce the sleep
+ * time.
+ *
+ * A large burst of activity should be able to quickly reduce the
+ * sleep time to the minimum, but we don't want a handful of extra WAL
+ * records to provoke a strong reaction. We choose to reduce the sleep
+ * time by 1 quantum for each page read beyond the first, which is a
+ * fairly arbitrary way of trying to be reactive without
+ * overrreacting.
+ */
+ if (pages_read_since_last_sleep > sleep_quanta - 1)
+ sleep_quanta = 1;
+ else
+ sleep_quanta -= pages_read_since_last_sleep;
+ }
+
+ /* OK, now sleep. */
+ (void) WaitLatch(MyLatch,
+ WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
+ sleep_quanta * MS_PER_SLEEP_QUANTUM,
+ WAIT_EVENT_WAL_SUMMARIZER_WAL);
+ ResetLatch(MyLatch);
+
+ /* Reset count of pages read. */
+ pages_read_since_last_sleep = 0;
+}
+
+/*
+ * Most recent RedoRecPtr value observed by RemoveOldWalSummaries.
+ */
+static void
+MaybeRemoveOldWalSummaries(void)
+{
+ XLogRecPtr redo_pointer = GetRedoRecPtr();
+ List *wslist;
+ time_t cutoff_time;
+
+ /* If WAL summary removal is disabled, don't do anything. */
+ if (wal_summary_keep_time == 0)
+ return;
+
+ /*
+ * If the redo pointer has not advanced, don't do anything.
+ *
+ * This has the effect that we only try to remove old WAL summary files
+ * once per checkpoint cycle.
+ */
+ if (redo_pointer == redo_pointer_at_last_summary_removal)
+ return;
+ redo_pointer_at_last_summary_removal = redo_pointer;
+
+ /*
+ * Files should only be removed if the last modification time precedes the
+ * cutoff time we compute here.
+ */
+ cutoff_time = time(NULL) - 60 * wal_summary_keep_time;
+
+ /* Get all the summaries that currently exist. */
+ wslist = GetWalSummaries(0, InvalidXLogRecPtr, InvalidXLogRecPtr);
+
+ /* Loop until all summaries have been considered for removal. */
+ while (wslist != NIL)
+ {
+ ListCell *lc;
+ XLogSegNo oldest_segno;
+ XLogRecPtr oldest_lsn = InvalidXLogRecPtr;
+ TimeLineID selected_tli;
+
+ HandleWalSummarizerInterrupts();
+
+ /*
+ * Pick a timeline for which some summary files still exist on disk,
+ * and find the oldest LSN that still exists on disk for that
+ * timeline.
+ */
+ selected_tli = ((WalSummaryFile *) linitial(wslist))->tli;
+ oldest_segno = XLogGetOldestSegno(selected_tli);
+ if (oldest_segno != 0)
+ XLogSegNoOffsetToRecPtr(oldest_segno, 0, wal_segment_size,
+ oldest_lsn);
+
+
+ /* Consider each WAL file on the selected timeline in turn. */
+ foreach(lc, wslist)
+ {
+ WalSummaryFile *ws = lfirst(lc);
+
+ HandleWalSummarizerInterrupts();
+
+ /* If it's not on this timeline, it's not time to consider it. */
+ if (selected_tli != ws->tli)
+ continue;
+
+ /*
+ * If the WAL doesn't exist any more, we can remove it if the file
+ * modification time is old enough.
+ */
+ if (XLogRecPtrIsInvalid(oldest_lsn) || ws->end_lsn <= oldest_lsn)
+ RemoveWalSummaryIfOlderThan(ws, cutoff_time);
+
+ /*
+ * Whether we removed the file or not, we need not consider it
+ * again.
+ */
+ wslist = foreach_delete_current(wslist, lc);
+ pfree(ws);
+ }
+ }
+}
diff --git a/src/backend/storage/lmgr/lwlocknames.txt b/src/backend/storage/lmgr/lwlocknames.txt
index f72f2906ce..d621f5507f 100644
--- a/src/backend/storage/lmgr/lwlocknames.txt
+++ b/src/backend/storage/lmgr/lwlocknames.txt
@@ -54,3 +54,4 @@ XactTruncationLock 44
WrapLimitsVacuumLock 46
NotifyQueueTailLock 47
WaitEventExtensionLock 48
+WALSummarizerLock 49
diff --git a/src/backend/utils/activity/pgstat_io.c b/src/backend/utils/activity/pgstat_io.c
index 490d5a9ab7..8109aee6f0 100644
--- a/src/backend/utils/activity/pgstat_io.c
+++ b/src/backend/utils/activity/pgstat_io.c
@@ -296,7 +296,8 @@ pgstat_io_snapshot_cb(void)
* - Syslogger because it is not connected to shared memory
* - Archiver because most relevant archiving IO is delegated to a
* specialized command or module
-* - WAL Receiver and WAL Writer IO is not tracked in pg_stat_io for now
+* - WAL Receiver, WAL Writer, and WAL Summarizer IO are not tracked in
+* pg_stat_io for now
*
* Function returns true if BackendType participates in the cumulative stats
* subsystem for IO and false if it does not.
@@ -318,6 +319,7 @@ pgstat_tracks_io_bktype(BackendType bktype)
case B_LOGGER:
case B_WAL_RECEIVER:
case B_WAL_WRITER:
+ case B_WAL_SUMMARIZER:
return false;
case B_AUTOVAC_LAUNCHER:
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index d7995931bd..7e79163466 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -56,6 +56,7 @@ RECOVERY_WAL_STREAM "Waiting in main loop of startup process for WAL to arrive,
SYSLOGGER_MAIN "Waiting in main loop of syslogger process."
WAL_RECEIVER_MAIN "Waiting in main loop of WAL receiver process."
WAL_SENDER_MAIN "Waiting in main loop of WAL sender process."
+WAL_SUMMARIZER_WAL "Waiting in WAL summarizer for more WAL to be generated."
WAL_WRITER_MAIN "Waiting in main loop of WAL writer process."
@@ -142,6 +143,7 @@ SAFE_SNAPSHOT "Waiting to obtain a valid snapshot for a <literal>READ ONLY DEFER
SYNC_REP "Waiting for confirmation from a remote server during synchronous replication."
WAL_RECEIVER_EXIT "Waiting for the WAL receiver to exit."
WAL_RECEIVER_WAIT_START "Waiting for startup process to send initial data for streaming replication."
+WAL_SUMMARY_READY "Waiting for a new WAL summary to be generated."
XACT_GROUP_UPDATE "Waiting for the group leader to update transaction status at end of a parallel operation."
@@ -162,6 +164,7 @@ REGISTER_SYNC_REQUEST "Waiting while sending synchronization requests to the che
SPIN_DELAY "Waiting while acquiring a contended spinlock."
VACUUM_DELAY "Waiting in a cost-based vacuum delay point."
VACUUM_TRUNCATE "Waiting to acquire an exclusive lock to truncate off any empty pages at the end of a table vacuumed."
+WAL_SUMMARIZER_ERROR "Waiting after a WAL summarizer error."
#
@@ -243,6 +246,8 @@ WAL_COPY_WRITE "Waiting for a write when creating a new WAL segment by copying a
WAL_INIT_SYNC "Waiting for a newly initialized WAL file to reach durable storage."
WAL_INIT_WRITE "Waiting for a write while initializing a new WAL file."
WAL_READ "Waiting for a read from a WAL file."
+WAL_SUMMARY_READ "Waiting for a read from a WAL summary file."
+WAL_SUMMARY_WRITE "Waiting for a write to a WAL summary file."
WAL_SYNC "Waiting for a WAL file to reach durable storage."
WAL_SYNC_METHOD_ASSIGN "Waiting for data to reach durable storage while assigning a new WAL sync method."
WAL_WRITE "Waiting for a write to a WAL file."
diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index 819936ec02..5c9b6f991e 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -305,6 +305,9 @@ GetBackendTypeDesc(BackendType backendType)
case B_WAL_SENDER:
backendDesc = "walsender";
break;
+ case B_WAL_SUMMARIZER:
+ backendDesc = "walsummarizer";
+ break;
case B_WAL_WRITER:
backendDesc = "walwriter";
break;
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 6474e35ec0..405c422db7 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -63,6 +63,7 @@
#include "postmaster/postmaster.h"
#include "postmaster/startup.h"
#include "postmaster/syslogger.h"
+#include "postmaster/walsummarizer.h"
#include "postmaster/walwriter.h"
#include "replication/logicallauncher.h"
#include "replication/slot.h"
@@ -704,6 +705,8 @@ const char *const config_group_names[] =
gettext_noop("Write-Ahead Log / Archive Recovery"),
/* WAL_RECOVERY_TARGET */
gettext_noop("Write-Ahead Log / Recovery Target"),
+ /* WAL_SUMMARIZATION */
+ gettext_noop("Write-Ahead Log / Summarization"),
/* REPLICATION_SENDING */
gettext_noop("Replication / Sending Servers"),
/* REPLICATION_PRIMARY */
@@ -1787,6 +1790,16 @@ struct config_bool ConfigureNamesBool[] =
NULL, NULL, NULL
},
+ {
+ {"summarize_wal", PGC_SIGHUP, WAL_SUMMARIZATION,
+ gettext_noop("Starts the WAL summarizer process to enable incremental backup."),
+ NULL
+ },
+ &summarize_wal,
+ false,
+ NULL, NULL, NULL
+ },
+
{
{"hot_standby", PGC_POSTMASTER, REPLICATION_STANDBY,
gettext_noop("Allows connections and queries during recovery."),
@@ -3201,6 +3214,19 @@ struct config_int ConfigureNamesInt[] =
check_wal_segment_size, NULL, NULL
},
+ {
+ {"wal_summary_keep_time", PGC_SIGHUP, WAL_SUMMARIZATION,
+ gettext_noop("Time for which WAL summary files should be kept."),
+ NULL,
+ GUC_UNIT_MIN,
+ },
+ &wal_summary_keep_time,
+ 10 * 24 * 60, /* 10 days */
+ 0,
+ INT_MAX,
+ NULL, NULL, NULL
+ },
+
{
{"autovacuum_naptime", PGC_SIGHUP, AUTOVACUUM,
gettext_noop("Time to sleep between autovacuum runs."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index cf9f283cfe..b2809c711a 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -302,6 +302,11 @@
#recovery_target_action = 'pause' # 'pause', 'promote', 'shutdown'
# (change requires restart)
+# - WAL Summarization -
+
+#summarize_wal = off # run WAL summarizer process?
+#wal_summary_keep_time = '10d' # when to remove old summary files, 0 = never
+
#------------------------------------------------------------------------------
# REPLICATION
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 0c6f5ceb0a..e68b40d2b5 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -227,6 +227,7 @@ static char *extra_options = "";
static const char *const subdirs[] = {
"global",
"pg_wal/archive_status",
+ "pg_wal/summaries",
"pg_commit_ts",
"pg_dynshmem",
"pg_notify",
diff --git a/src/common/Makefile b/src/common/Makefile
index 1092dc63df..23e5a3db47 100644
--- a/src/common/Makefile
+++ b/src/common/Makefile
@@ -49,6 +49,7 @@ OBJS_COMMON = \
archive.o \
base64.o \
binaryheap.o \
+ blkreftable.o \
checksum_helper.o \
compression.o \
config_info.o \
diff --git a/src/common/blkreftable.c b/src/common/blkreftable.c
new file mode 100644
index 0000000000..21ee6f5968
--- /dev/null
+++ b/src/common/blkreftable.c
@@ -0,0 +1,1308 @@
+/*-------------------------------------------------------------------------
+ *
+ * blkreftable.c
+ * Block reference tables.
+ *
+ * A block reference table is used to keep track of which blocks have
+ * been modified by WAL records within a certain LSN range.
+ *
+ * For each relation fork, we keep track of all blocks that have appeared
+ * in block reference in the WAL. We also keep track of the "limit block",
+ * which is the smallest relation length in blocks known to have occurred
+ * during that range of WAL records. This should be set to 0 if the relation
+ * fork is created or destroyed, and to the post-truncation length if
+ * truncated.
+ *
+ * Whenever we set the limit block, we also forget about any modified blocks
+ * beyond that point. Those blocks don't exist any more. Such blocks can
+ * later be marked as modified again; if that happens, it means the relation
+ * was re-extended.
+ *
+ * Portions Copyright (c) 2010-2022, PostgreSQL Global Development Group
+ *
+ * src/common/blkreftable.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+
+#ifndef FRONTEND
+#include "postgres.h"
+#else
+#include "postgres_fe.h"
+#endif
+
+#ifdef FRONTEND
+#include "common/logging.h"
+#endif
+
+#include "common/blkreftable.h"
+#include "common/hashfn.h"
+#include "port/pg_crc32c.h"
+
+/*
+ * A block reference table keeps track of the status of each relation
+ * fork individually.
+ */
+typedef struct BlockRefTableKey
+{
+ RelFileLocator rlocator;
+ ForkNumber forknum;
+} BlockRefTableKey;
+
+/*
+ * We could need to store data either for a relation in which only a
+ * tiny fraction of the blocks have been modified or for a relation in
+ * which nearly every block has been modified, and we want a
+ * space-efficient representation in both cases. To accomplish this,
+ * we divide the relation into chunks of 2^16 blocks and choose between
+ * an array representation and a bitmap representation for each chunk.
+ *
+ * When the number of modified blocks in a given chunk is small, we
+ * essentially store an array of block numbers, but we need not store the
+ * entire block number: instead, we store each block number as a 2-byte
+ * offset from the start of the chunk.
+ *
+ * When the number of modified blocks in a given chunk is large, we switch
+ * to a bitmap representation.
+ *
+ * These same basic representational choices are used both when a block
+ * reference table is stored in memory and when it is serialized to disk.
+ *
+ * In the in-memory representation, we initially allocate each chunk with
+ * space for a number of entries given by INITIAL_ENTRIES_PER_CHUNK and
+ * increase that as necessary until we reach MAX_ENTRIES_PER_CHUNK.
+ * Any chunk whose allocated size reaches MAX_ENTRIES_PER_CHUNK is converted
+ * to a bitmap, and thus never needs to grow further.
+ */
+#define BLOCKS_PER_CHUNK (1 << 16)
+#define BLOCKS_PER_ENTRY (BITS_PER_BYTE * sizeof(uint16))
+#define MAX_ENTRIES_PER_CHUNK (BLOCKS_PER_CHUNK / BLOCKS_PER_ENTRY)
+#define INITIAL_ENTRIES_PER_CHUNK 16
+typedef uint16 *BlockRefTableChunk;
+
+/*
+ * State for one relation fork.
+ *
+ * 'rlocator' and 'forknum' identify the relation fork to which this entry
+ * pertains.
+ *
+ * 'limit_block' is the shortest known length of the relation in blocks
+ * within the LSN range covered by a particular block reference table.
+ * It should be set to 0 if the relation fork is created or dropped. If the
+ * relation fork is truncated, it should be set to the number of blocks that
+ * remain after truncation.
+ *
+ * 'nchunks' is the allocated length of each of the three arrays that follow.
+ * We can only represent the status of block numbers less than nchunks *
+ * BLOCKS_PER_CHUNK.
+ *
+ * 'chunk_size' is an array storing the allocated size of each chunk.
+ *
+ * 'chunk_usage' is an array storing the number of elements used in each
+ * chunk. If that value is less than MAX_ENTRIES_PER_CHUNK, the corresonding
+ * chunk is used as an array; else the corresponding chunk is used as a bitmap.
+ * When used as a bitmap, the least significant bit of the first array element
+ * is the status of the lowest-numbered block covered by this chunk.
+ *
+ * 'chunk_data' is the array of chunks.
+ */
+struct BlockRefTableEntry
+{
+ BlockRefTableKey key;
+ BlockNumber limit_block;
+ char status;
+ uint32 nchunks;
+ uint16 *chunk_size;
+ uint16 *chunk_usage;
+ BlockRefTableChunk *chunk_data;
+};
+
+/* Declare and define a hash table over type BlockRefTableEntry. */
+#define SH_PREFIX blockreftable
+#define SH_ELEMENT_TYPE BlockRefTableEntry
+#define SH_KEY_TYPE BlockRefTableKey
+#define SH_KEY key
+#define SH_HASH_KEY(tb, key) \
+ hash_bytes((const unsigned char *) &key, sizeof(BlockRefTableKey))
+#define SH_EQUAL(tb, a, b) (memcmp(&a, &b, sizeof(BlockRefTableKey)) == 0)
+#define SH_SCOPE static inline
+#ifdef FRONTEND
+#define SH_RAW_ALLOCATOR pg_malloc0
+#endif
+#define SH_DEFINE
+#define SH_DECLARE
+#include "lib/simplehash.h"
+
+/*
+ * A block reference table is basically just the hash table, but we don't
+ * want to expose that to outside callers.
+ *
+ * We keep track of the memory context in use explicitly too, so that it's
+ * easy to place all of our allocations in the same context.
+ */
+struct BlockRefTable
+{
+ blockreftable_hash *hash;
+#ifndef FRONTEND
+ MemoryContext mcxt;
+#endif
+};
+
+/*
+ * On-disk serialization format for block reference table entries.
+ */
+typedef struct BlockRefTableSerializedEntry
+{
+ RelFileLocator rlocator;
+ ForkNumber forknum;
+ BlockNumber limit_block;
+ uint32 nchunks;
+} BlockRefTableSerializedEntry;
+
+/*
+ * Buffer size, so that we avoid doing many small I/Os.
+ */
+#define BUFSIZE 65536
+
+/*
+ * Ad-hoc buffer for file I/O.
+ */
+typedef struct BlockRefTableBuffer
+{
+ io_callback_fn io_callback;
+ void *io_callback_arg;
+ char data[BUFSIZE];
+ int used;
+ int cursor;
+ pg_crc32c crc;
+} BlockRefTableBuffer;
+
+/*
+ * State for keeping track of progress while incrementally reading a block
+ * table reference file from disk.
+ *
+ * total_chunks means the number of chunks for the RelFileLocator/ForkNumber
+ * combination that is curently being read, and consumed_chunks is the number
+ * of those that have been read. (We always read all the information for
+ * a single chunk at one time, so we don't need to be able to represent the
+ * state where a chunk has been partially read.)
+ *
+ * chunk_size is the array of chunk sizes. The length is given by total_chunks.
+ *
+ * chunk_data holds the current chunk.
+ *
+ * chunk_position helps us figure out how much progress we've made in returning
+ * the block numbers for the current chunk to the caller. If the chunk is a
+ * bitmap, it's the number of bits we've scanned; otherwise, it's the number
+ * of chunk entries we've scanned.
+ */
+struct BlockRefTableReader
+{
+ BlockRefTableBuffer buffer;
+ char *error_filename;
+ report_error_fn error_callback;
+ void *error_callback_arg;
+ uint32 total_chunks;
+ uint32 consumed_chunks;
+ uint16 *chunk_size;
+ uint16 chunk_data[MAX_ENTRIES_PER_CHUNK];
+ uint32 chunk_position;
+};
+
+/*
+ * State for keeping track of progress while incrementally writing a block
+ * reference table file to disk.
+ */
+struct BlockRefTableWriter
+{
+ BlockRefTableBuffer buffer;
+};
+
+/* Function prototypes. */
+static int BlockRefTableComparator(const void *a, const void *b);
+static void BlockRefTableFlush(BlockRefTableBuffer *buffer);
+static void BlockRefTableRead(BlockRefTableReader *reader, void *data,
+ int length);
+static void BlockRefTableWrite(BlockRefTableBuffer *buffer, void *data,
+ int length);
+static void BlockRefTableFileTerminate(BlockRefTableBuffer *buffer);
+
+/*
+ * Create an empty block reference table.
+ */
+BlockRefTable *
+CreateEmptyBlockRefTable(void)
+{
+ BlockRefTable *brtab = palloc(sizeof(BlockRefTable));
+
+ /*
+ * Even completely empty database has a few hundred relation forks, so it
+ * seems best to size the hash on the assumption that we're going to have
+ * at least a few thousand entries.
+ */
+#ifdef FRONTEND
+ brtab->hash = blockreftable_create(4096, NULL);
+#else
+ brtab->mcxt = CurrentMemoryContext;
+ brtab->hash = blockreftable_create(brtab->mcxt, 4096, NULL);
+#endif
+
+ return brtab;
+}
+
+/*
+ * Set the "limit block" for a relation fork and forget any modified blocks
+ * with equal or higher block numbers.
+ *
+ * The "limit block" is the shortest known length of the relation within the
+ * range of WAL records covered by this block reference table.
+ */
+void
+BlockRefTableSetLimitBlock(BlockRefTable *brtab,
+ const RelFileLocator *rlocator,
+ ForkNumber forknum,
+ BlockNumber limit_block)
+{
+ BlockRefTableEntry *brtentry;
+ BlockRefTableKey key = {0}; /* make sure any padding is zero */
+ bool found;
+
+ memcpy(&key.rlocator, rlocator, sizeof(RelFileLocator));
+ key.forknum = forknum;
+ brtentry = blockreftable_insert(brtab->hash, key, &found);
+
+ if (!found)
+ {
+ /*
+ * We have no existing data about this relation fork, so just record
+ * the limit_block value supplied by the caller, and make sure other
+ * parts of the entry are properly initialized.
+ */
+ brtentry->limit_block = limit_block;
+ brtentry->nchunks = 0;
+ brtentry->chunk_size = NULL;
+ brtentry->chunk_usage = NULL;
+ brtentry->chunk_data = NULL;
+ return;
+ }
+
+ BlockRefTableEntrySetLimitBlock(brtentry, limit_block);
+}
+
+/*
+ * Mark a block in a given relation fork as known to have been modified.
+ */
+void
+BlockRefTableMarkBlockModified(BlockRefTable *brtab,
+ const RelFileLocator *rlocator,
+ ForkNumber forknum,
+ BlockNumber blknum)
+{
+ BlockRefTableEntry *brtentry;
+ BlockRefTableKey key = {0}; /* make sure any padding is zero */
+ bool found;
+#ifndef FRONTEND
+ MemoryContext oldcontext = MemoryContextSwitchTo(brtab->mcxt);
+#endif
+
+ memcpy(&key.rlocator, rlocator, sizeof(RelFileLocator));
+ key.forknum = forknum;
+ brtentry = blockreftable_insert(brtab->hash, key, &found);
+
+ if (!found)
+ {
+ /*
+ * We want to set the initial limit block value to something higher
+ * than any legal block number. InvalidBlockNumber fits the bill.
+ */
+ brtentry->limit_block = InvalidBlockNumber;
+ brtentry->nchunks = 0;
+ brtentry->chunk_size = NULL;
+ brtentry->chunk_usage = NULL;
+ brtentry->chunk_data = NULL;
+ }
+
+ BlockRefTableEntryMarkBlockModified(brtentry, forknum, blknum);
+
+#ifndef FRONTEND
+ MemoryContextSwitchTo(oldcontext);
+#endif
+}
+
+/*
+ * Get an entry from a block reference table.
+ *
+ * If the entry does not exist, this function returns NULL. Otherwise, it
+ * returns the entry and sets *limit_block to the value from the entry.
+ */
+BlockRefTableEntry *
+BlockRefTableGetEntry(BlockRefTable *brtab, const RelFileLocator *rlocator,
+ ForkNumber forknum, BlockNumber *limit_block)
+{
+ BlockRefTableKey key = {0}; /* make sure any padding is zero */
+ BlockRefTableEntry *entry;
+
+ Assert(limit_block != NULL);
+
+ memcpy(&key.rlocator, rlocator, sizeof(RelFileLocator));
+ key.forknum = forknum;
+ entry = blockreftable_lookup(brtab->hash, key);
+
+ if (entry != NULL)
+ *limit_block = entry->limit_block;
+
+ return entry;
+}
+
+/*
+ * Get block numbers from a table entry.
+ *
+ * 'blocks' must point to enough space to hold at least 'nblocks' block
+ * numbers, and any block numbers we manage to get will be written there.
+ * The return value is the number of block numbers actually written.
+ *
+ * We do not return block numbers unless they are greater than or equal to
+ * start_blkno and strictly less than stop_blkno.
+ */
+int
+BlockRefTableEntryGetBlocks(BlockRefTableEntry *entry,
+ BlockNumber start_blkno,
+ BlockNumber stop_blkno,
+ BlockNumber *blocks,
+ int nblocks)
+{
+ uint32 start_chunkno;
+ uint32 stop_chunkno;
+ uint32 chunkno;
+ int nresults = 0;
+
+ Assert(entry != NULL);
+
+ /*
+ * Figure out which chunks could potentially contain blocks of interest.
+ *
+ * We need to be careful about overflow here, because stop_blkno could be
+ * InvalidBlockNumber or something very close to it.
+ */
+ start_chunkno = start_blkno / BLOCKS_PER_CHUNK;
+ stop_chunkno = stop_blkno / BLOCKS_PER_CHUNK;
+ if ((stop_blkno % BLOCKS_PER_CHUNK) != 0)
+ ++stop_chunkno;
+ if (stop_chunkno > entry->nchunks)
+ stop_chunkno = entry->nchunks;
+
+ /*
+ * Loop over chunks.
+ */
+ for (chunkno = start_chunkno; chunkno < stop_chunkno; ++chunkno)
+ {
+ uint16 chunk_usage = entry->chunk_usage[chunkno];
+ BlockRefTableChunk chunk_data = entry->chunk_data[chunkno];
+ unsigned start_offset = 0;
+ unsigned stop_offset = BLOCKS_PER_CHUNK;
+
+ /*
+ * If the start and/or stop block number falls within this chunk, the
+ * whole chunk may not be of interest. Figure out which portion we
+ * care about, if it's not the whole thing.
+ */
+ if (chunkno == start_chunkno)
+ start_offset = start_blkno % BLOCKS_PER_CHUNK;
+ if (chunkno == stop_chunkno - 1)
+ stop_offset = stop_blkno % BLOCKS_PER_CHUNK;
+
+ /*
+ * Handling differs depending on whether this is an array of offsets
+ * or a bitmap.
+ */
+ if (chunk_usage == MAX_ENTRIES_PER_CHUNK)
+ {
+ unsigned i;
+
+ /* It's a bitmap, so test every relevant bit. */
+ for (i = start_offset; i < stop_offset; ++i)
+ {
+ uint16 w = chunk_data[i / BLOCKS_PER_ENTRY];
+
+ if ((w & (1 << (i % BLOCKS_PER_ENTRY))) != 0)
+ {
+ BlockNumber blkno = chunkno * BLOCKS_PER_CHUNK + i;
+
+ blocks[nresults++] = blkno;
+
+ /* Early exit if we run out of output space. */
+ if (nresults == nblocks)
+ return nresults;
+ }
+ }
+ }
+ else
+ {
+ unsigned i;
+
+ /* It's an array of offsets, so check each one. */
+ for (i = 0; i < chunk_usage; ++i)
+ {
+ uint16 offset = chunk_data[i];
+
+ if (offset >= start_offset && offset < stop_offset)
+ {
+ BlockNumber blkno = chunkno * BLOCKS_PER_CHUNK + offset;
+
+ blocks[nresults++] = blkno;
+
+ /* Early exit if we run out of output space. */
+ if (nresults == nblocks)
+ return nresults;
+ }
+ }
+ }
+ }
+
+ return nresults;
+}
+
+/*
+ * Serialize a block reference table to a file.
+ */
+void
+WriteBlockRefTable(BlockRefTable *brtab,
+ io_callback_fn write_callback,
+ void *write_callback_arg)
+{
+ BlockRefTableSerializedEntry *sdata = NULL;
+ BlockRefTableBuffer buffer;
+ uint32 magic = BLOCKREFTABLE_MAGIC;
+
+ /* Prepare buffer. */
+ memset(&buffer, 0, sizeof(BlockRefTableBuffer));
+ buffer.io_callback = write_callback;
+ buffer.io_callback_arg = write_callback_arg;
+ INIT_CRC32C(buffer.crc);
+
+ /* Write magic number. */
+ BlockRefTableWrite(&buffer, &magic, sizeof(uint32));
+
+ /* Write the entries, assuming there are some. */
+ if (brtab->hash->members > 0)
+ {
+ unsigned i = 0;
+ blockreftable_iterator it;
+ BlockRefTableEntry *brtentry;
+
+ /* Extract entries into serializable format and sort them. */
+ sdata =
+ palloc(brtab->hash->members * sizeof(BlockRefTableSerializedEntry));
+ blockreftable_start_iterate(brtab->hash, &it);
+ while ((brtentry = blockreftable_iterate(brtab->hash, &it)) != NULL)
+ {
+ BlockRefTableSerializedEntry *sentry = &sdata[i++];
+
+ sentry->rlocator = brtentry->key.rlocator;
+ sentry->forknum = brtentry->key.forknum;
+ sentry->limit_block = brtentry->limit_block;
+ sentry->nchunks = brtentry->nchunks;
+
+ /* trim trailing zero entries */
+ while (sentry->nchunks > 0 &&
+ brtentry->chunk_usage[sentry->nchunks - 1] == 0)
+ sentry->nchunks--;
+ }
+ Assert(i == brtab->hash->members);
+ qsort(sdata, i, sizeof(BlockRefTableSerializedEntry),
+ BlockRefTableComparator);
+
+ /* Loop over entries in sorted order and serialize each one. */
+ for (i = 0; i < brtab->hash->members; ++i)
+ {
+ BlockRefTableSerializedEntry *sentry = &sdata[i];
+ BlockRefTableKey key = {0}; /* make sure any padding is zero */
+ unsigned j;
+
+ /* Write the serialized entry itself. */
+ BlockRefTableWrite(&buffer, sentry,
+ sizeof(BlockRefTableSerializedEntry));
+
+ /* Look up the original entry so we can access the chunks. */
+ memcpy(&key.rlocator, &sentry->rlocator, sizeof(RelFileLocator));
+ key.forknum = sentry->forknum;
+ brtentry = blockreftable_lookup(brtab->hash, key);
+ Assert(brtentry != NULL);
+
+ /* Write the untruncated portion of the chunk length array. */
+ if (sentry->nchunks != 0)
+ BlockRefTableWrite(&buffer, brtentry->chunk_usage,
+ sentry->nchunks * sizeof(uint16));
+
+ /* Write the contents of each chunk. */
+ for (j = 0; j < brtentry->nchunks; ++j)
+ {
+ if (brtentry->chunk_usage[j] == 0)
+ continue;
+ BlockRefTableWrite(&buffer, brtentry->chunk_data[j],
+ brtentry->chunk_usage[j] * sizeof(uint16));
+ }
+ }
+ }
+
+ /* Write out appropriate terminator and CRC and flush buffer. */
+ BlockRefTableFileTerminate(&buffer);
+}
+
+/*
+ * Prepare to incrementally read a block reference table file.
+ *
+ * 'read_callback' is a function that can be called to read data from the
+ * underlying file (or other data source) into our internal buffer.
+ *
+ * 'read_callback_arg' is an opaque argument to be passed to read_callback.
+ *
+ * 'error_filename' is the filename that should be included in error messages
+ * if the file is found to be malformed. The value is not copied, so the
+ * caller should ensure that it remains valid until done with this
+ * BlockRefTableReader.
+ *
+ * 'error_callback' is a function to be called if the file is found to be
+ * malformed. This is not used for I/O errors, which must be handled internally
+ * by read_callback.
+ *
+ * 'error_callback_arg' is an opaque arguent to be passed to error_callback.
+ */
+BlockRefTableReader *
+CreateBlockRefTableReader(io_callback_fn read_callback,
+ void *read_callback_arg,
+ char *error_filename,
+ report_error_fn error_callback,
+ void *error_callback_arg)
+{
+ BlockRefTableReader *reader;
+ uint32 magic;
+
+ /* Initialize data structure. */
+ reader = palloc0(sizeof(BlockRefTableReader));
+ reader->buffer.io_callback = read_callback;
+ reader->buffer.io_callback_arg = read_callback_arg;
+ reader->error_filename = error_filename;
+ reader->error_callback = error_callback;
+ reader->error_callback_arg = error_callback_arg;
+ INIT_CRC32C(reader->buffer.crc);
+
+ /* Verify magic number. */
+ BlockRefTableRead(reader, &magic, sizeof(uint32));
+ if (magic != BLOCKREFTABLE_MAGIC)
+ error_callback(error_callback_arg,
+ "file \"%s\" has wrong magic number: expected %u, found %u",
+ error_filename,
+ BLOCKREFTABLE_MAGIC, magic);
+
+ return reader;
+}
+
+/*
+ * Read next relation fork covered by this block reference table file.
+ *
+ * After calling this function, you must call BlockRefTableReaderGetBlocks
+ * until it returns 0 before calling it again.
+ */
+bool
+BlockRefTableReaderNextRelation(BlockRefTableReader *reader,
+ RelFileLocator *rlocator,
+ ForkNumber *forknum,
+ BlockNumber *limit_block)
+{
+ BlockRefTableSerializedEntry sentry;
+ BlockRefTableSerializedEntry zentry = {{0}};
+
+ /*
+ * Sanity check: caller must read all blocks from all chunks before moving
+ * on to the next relation.
+ */
+ Assert(reader->total_chunks == reader->consumed_chunks);
+
+ /* Read serialized entry. */
+ BlockRefTableRead(reader, &sentry,
+ sizeof(BlockRefTableSerializedEntry));
+
+ /*
+ * If we just read the sentinel entry indicating that we've reached the
+ * end, read and check the CRC.
+ */
+ if (memcmp(&sentry, &zentry, sizeof(BlockRefTableSerializedEntry)) == 0)
+ {
+ pg_crc32c expected_crc;
+ pg_crc32c actual_crc;
+
+ /*
+ * We want to know the CRC of the file excluding the 4-byte CRC
+ * itself, so copy the current value of the CRC accumulator before
+ * reading those bytes, and use the copy to finalize the calculation.
+ */
+ expected_crc = reader->buffer.crc;
+ FIN_CRC32C(expected_crc);
+
+ /* Now we can read the actual value. */
+ BlockRefTableRead(reader, &actual_crc, sizeof(pg_crc32c));
+
+ /* Throw an error if there is a mismatch. */
+ if (!EQ_CRC32C(expected_crc, actual_crc))
+ reader->error_callback(reader->error_callback_arg,
+ "file \"%s\" has wrong checksum: expected %08X, found %08X",
+ reader->error_filename, expected_crc, actual_crc);
+
+ return false;
+ }
+
+ /* Read chunk size array. */
+ if (reader->chunk_size != NULL)
+ pfree(reader->chunk_size);
+ reader->chunk_size = palloc(sentry.nchunks * sizeof(uint16));
+ BlockRefTableRead(reader, reader->chunk_size,
+ sentry.nchunks * sizeof(uint16));
+
+ /* Set up for chunk scan. */
+ reader->total_chunks = sentry.nchunks;
+ reader->consumed_chunks = 0;
+
+ /* Return data to caller. */
+ memcpy(rlocator, &sentry.rlocator, sizeof(RelFileLocator));
+ *forknum = sentry.forknum;
+ *limit_block = sentry.limit_block;
+ return true;
+}
+
+/*
+ * Get modified blocks associated with the relation fork returned by
+ * the most recent call to BlockRefTableReaderNextRelation.
+ *
+ * On return, block numbers will be written into the 'blocks' array, whose
+ * length should be passed via 'nblocks'. The return value is the number of
+ * entries actually written into the 'blocks' array, which may be less than
+ * 'nblocks' if we run out of modified blocks in the relation fork before
+ * we run out of room in the array.
+ */
+unsigned
+BlockRefTableReaderGetBlocks(BlockRefTableReader *reader,
+ BlockNumber *blocks,
+ int nblocks)
+{
+ unsigned blocks_found = 0;
+
+ /* Must provide space for at least one block number to be returned. */
+ Assert(nblocks > 0);
+
+ /* Loop collecting blocks to return to caller. */
+ for (;;)
+ {
+ uint16 next_chunk_size;
+
+ /*
+ * If we've read at least one chunk, maybe it contains some block
+ * numbers that could satisfy caller's request.
+ */
+ if (reader->consumed_chunks > 0)
+ {
+ uint32 chunkno = reader->consumed_chunks - 1;
+ uint16 chunk_size = reader->chunk_size[chunkno];
+
+ if (chunk_size == MAX_ENTRIES_PER_CHUNK)
+ {
+ /* Bitmap format, so search for bits that are set. */
+ while (reader->chunk_position < BLOCKS_PER_CHUNK &&
+ blocks_found < nblocks)
+ {
+ uint16 chunkoffset = reader->chunk_position;
+ uint16 w;
+
+ w = reader->chunk_data[chunkoffset / BLOCKS_PER_ENTRY];
+ if ((w & (1u << (chunkoffset % BLOCKS_PER_ENTRY))) != 0)
+ blocks[blocks_found++] =
+ chunkno * BLOCKS_PER_CHUNK + chunkoffset;
+ ++reader->chunk_position;
+ }
+ }
+ else
+ {
+ /* Not in bitmap format, so each entry is a 2-byte offset. */
+ while (reader->chunk_position < chunk_size &&
+ blocks_found < nblocks)
+ {
+ blocks[blocks_found++] = chunkno * BLOCKS_PER_CHUNK
+ + reader->chunk_data[reader->chunk_position];
+ ++reader->chunk_position;
+ }
+ }
+ }
+
+ /* We found enough blocks, so we're done. */
+ if (blocks_found >= nblocks)
+ break;
+
+ /*
+ * We didn't find enough blocks, so we must need the next chunk. If
+ * there are none left, though, then we're done anyway.
+ */
+ if (reader->consumed_chunks == reader->total_chunks)
+ break;
+
+ /*
+ * Read data for next chunk and reset scan position to beginning of
+ * chunk. Note that the next chunk might be empty, in which case we
+ * consume the chunk without actually consuming any bytes from the
+ * underlying file.
+ */
+ next_chunk_size = reader->chunk_size[reader->consumed_chunks];
+ if (next_chunk_size > 0)
+ BlockRefTableRead(reader, reader->chunk_data,
+ next_chunk_size * sizeof(uint16));
+ ++reader->consumed_chunks;
+ reader->chunk_position = 0;
+ }
+
+ return blocks_found;
+}
+
+/*
+ * Release memory used while reading a block reference table from a file.
+ */
+void
+DestroyBlockRefTableReader(BlockRefTableReader *reader)
+{
+ if (reader->chunk_size != NULL)
+ {
+ pfree(reader->chunk_size);
+ reader->chunk_size = NULL;
+ }
+ pfree(reader);
+}
+
+/*
+ * Prepare to write a block reference table file incrementally.
+ *
+ * Caller must be able to supply BlockRefTableEntry objects sorted in the
+ * appropriate order.
+ */
+BlockRefTableWriter *
+CreateBlockRefTableWriter(io_callback_fn write_callback,
+ void *write_callback_arg)
+{
+ BlockRefTableWriter *writer;
+ uint32 magic = BLOCKREFTABLE_MAGIC;
+
+ /* Prepare buffer and CRC check and save callbacks. */
+ writer = palloc0(sizeof(BlockRefTableWriter));
+ writer->buffer.io_callback = write_callback;
+ writer->buffer.io_callback_arg = write_callback_arg;
+ INIT_CRC32C(writer->buffer.crc);
+
+ /* Write magic number. */
+ BlockRefTableWrite(&writer->buffer, &magic, sizeof(uint32));
+
+ return writer;
+}
+
+/*
+ * Append one entry to a block reference table file.
+ *
+ * Note that entries must be written in the proper order, that is, sorted by
+ * tablespace, then database, then relfilenumber, then fork number. Caller
+ * is responsible for supplying data in the correct order. If that seems hard,
+ * use an in-memory BlockRefTable instead.
+ */
+void
+BlockRefTableWriteEntry(BlockRefTableWriter *writer, BlockRefTableEntry *entry)
+{
+ BlockRefTableSerializedEntry sentry;
+ unsigned j;
+
+ /* Convert to serialized entry format. */
+ sentry.rlocator = entry->key.rlocator;
+ sentry.forknum = entry->key.forknum;
+ sentry.limit_block = entry->limit_block;
+ sentry.nchunks = entry->nchunks;
+
+ /* Trim trailing zero entries. */
+ while (sentry.nchunks > 0 && entry->chunk_usage[sentry.nchunks - 1] == 0)
+ sentry.nchunks--;
+
+ /* Write the serialized entry itself. */
+ BlockRefTableWrite(&writer->buffer, &sentry,
+ sizeof(BlockRefTableSerializedEntry));
+
+ /* Write the untruncated portion of the chunk length array. */
+ if (sentry.nchunks != 0)
+ BlockRefTableWrite(&writer->buffer, entry->chunk_usage,
+ sentry.nchunks * sizeof(uint16));
+
+ /* Write the contents of each chunk. */
+ for (j = 0; j < entry->nchunks; ++j)
+ {
+ if (entry->chunk_usage[j] == 0)
+ continue;
+ BlockRefTableWrite(&writer->buffer, entry->chunk_data[j],
+ entry->chunk_usage[j] * sizeof(uint16));
+ }
+}
+
+/*
+ * Finalize an incremental write of a block reference table file.
+ */
+void
+DestroyBlockRefTableWriter(BlockRefTableWriter *writer)
+{
+ BlockRefTableFileTerminate(&writer->buffer);
+ pfree(writer);
+}
+
+/*
+ * Allocate a standalone BlockRefTableEntry.
+ *
+ * When we're manipulating a full in-memory BlockRefTable, the entries are
+ * part of the hash table and are allocated by simplehash. This routine is
+ * used by callers that want to write out a BlockRefTable to a file without
+ * needing to store the whole thing in memory at once.
+ *
+ * Entries allocated by this function can be manipulated using the functions
+ * BlockRefTableEntrySetLimitBlock and BlockRefTableEntryMarkBlockModified
+ * and then written using BlockRefTableWriteEntry and freed using
+ * BlockRefTableFreeEntry.
+ */
+BlockRefTableEntry *
+CreateBlockRefTableEntry(RelFileLocator rlocator, ForkNumber forknum)
+{
+ BlockRefTableEntry *entry = palloc0(sizeof(BlockRefTableEntry));
+
+ memcpy(&entry->key.rlocator, &rlocator, sizeof(RelFileLocator));
+ entry->key.forknum = forknum;
+ entry->limit_block = InvalidBlockNumber;
+
+ return entry;
+}
+
+/*
+ * Update a BlockRefTableEntry with a new value for the "limit block" and
+ * forget any equal-or-higher-numbered modified blocks.
+ *
+ * The "limit block" is the shortest known length of the relation within the
+ * range of WAL records covered by this block reference table.
+ */
+void
+BlockRefTableEntrySetLimitBlock(BlockRefTableEntry *entry,
+ BlockNumber limit_block)
+{
+ unsigned chunkno;
+ unsigned limit_chunkno;
+ unsigned limit_chunkoffset;
+ BlockRefTableChunk limit_chunk;
+
+ /* If we already have an equal or lower limit block, do nothing. */
+ if (limit_block >= entry->limit_block)
+ return;
+
+ /* Record the new limit block value. */
+ entry->limit_block = limit_block;
+
+ /*
+ * Figure out which chunk would store the state of the new limit block,
+ * and which offset within that chunk.
+ */
+ limit_chunkno = limit_block / BLOCKS_PER_CHUNK;
+ limit_chunkoffset = limit_block % BLOCKS_PER_CHUNK;
+
+ /*
+ * If the number of chunks is not large enough for any blocks with equal
+ * or higher block numbers to exist, then there is nothing further to do.
+ */
+ if (limit_chunkno >= entry->nchunks)
+ return;
+
+ /* Discard entire contents of any higher-numbered chunks. */
+ for (chunkno = limit_chunkno + 1; chunkno < entry->nchunks; ++chunkno)
+ entry->chunk_usage[chunkno] = 0;
+
+ /*
+ * Next, we need to discard any offsets within the chunk that would
+ * contain the limit_block. We must handle this differenly depending on
+ * whether the chunk that would contain limit_block is a bitmap or an
+ * array of offsets.
+ */
+ limit_chunk = entry->chunk_data[limit_chunkno];
+ if (entry->chunk_usage[limit_chunkno] == MAX_ENTRIES_PER_CHUNK)
+ {
+ unsigned chunkoffset;
+
+ /* It's a bitmap. Unset bits. */
+ for (chunkoffset = limit_chunkoffset; chunkoffset < BLOCKS_PER_CHUNK;
+ ++chunkoffset)
+ limit_chunk[chunkoffset / BLOCKS_PER_ENTRY] &=
+ ~(1 << (chunkoffset % BLOCKS_PER_ENTRY));
+ }
+ else
+ {
+ unsigned i,
+ j = 0;
+
+ /* It's an offset array. Filter out large offsets. */
+ for (i = 0; i < entry->chunk_usage[limit_chunkno]; ++i)
+ {
+ Assert(j <= i);
+ if (limit_chunk[i] < limit_chunkoffset)
+ limit_chunk[j++] = limit_chunk[i];
+ }
+ Assert(j <= entry->chunk_usage[limit_chunkno]);
+ entry->chunk_usage[limit_chunkno] = j;
+ }
+}
+
+/*
+ * Mark a block in a given BlkRefTableEntry as known to have been modified.
+ */
+void
+BlockRefTableEntryMarkBlockModified(BlockRefTableEntry *entry,
+ ForkNumber forknum,
+ BlockNumber blknum)
+{
+ unsigned chunkno;
+ unsigned chunkoffset;
+ unsigned i;
+
+ /*
+ * Which chunk should store the state of this block? And what is the
+ * offset of this block relative to the start of that chunk?
+ */
+ chunkno = blknum / BLOCKS_PER_CHUNK;
+ chunkoffset = blknum % BLOCKS_PER_CHUNK;
+
+ /*
+ * If 'nchunks' isn't big enough for us to be able to represent the state
+ * of this block, we need to enlarge our arrays.
+ */
+ if (chunkno >= entry->nchunks)
+ {
+ unsigned max_chunks;
+ unsigned extra_chunks;
+
+ /*
+ * New array size is a power of 2, at least 16, big enough so that
+ * chunkno will be a valid array index.
+ */
+ max_chunks = Max(16, entry->nchunks);
+ while (max_chunks < chunkno + 1)
+ chunkno *= 2;
+ Assert(max_chunks > chunkno);
+ extra_chunks = max_chunks - entry->nchunks;
+
+ if (entry->nchunks == 0)
+ {
+ entry->chunk_size = palloc0(sizeof(uint16) * max_chunks);
+ entry->chunk_usage = palloc0(sizeof(uint16) * max_chunks);
+ entry->chunk_data =
+ palloc0(sizeof(BlockRefTableChunk) * max_chunks);
+ }
+ else
+ {
+ entry->chunk_size = repalloc(entry->chunk_size,
+ sizeof(uint16) * max_chunks);
+ memset(&entry->chunk_size[entry->nchunks], 0,
+ extra_chunks * sizeof(uint16));
+ entry->chunk_usage = repalloc(entry->chunk_usage,
+ sizeof(uint16) * max_chunks);
+ memset(&entry->chunk_usage[entry->nchunks], 0,
+ extra_chunks * sizeof(uint16));
+ entry->chunk_data = repalloc(entry->chunk_data,
+ sizeof(BlockRefTableChunk) * max_chunks);
+ memset(&entry->chunk_data[entry->nchunks], 0,
+ extra_chunks * sizeof(BlockRefTableChunk));
+ }
+ entry->nchunks = max_chunks;
+ }
+
+ /*
+ * If the chunk that covers this block number doesn't exist yet, create it
+ * as an array and add the appropriate offset to it. We make it pretty
+ * small initially, because there might only be 1 or a few block
+ * references in this chunk and we don't want to use up too much memory.
+ */
+ if (entry->chunk_size[chunkno] == 0)
+ {
+ entry->chunk_data[chunkno] =
+ palloc(sizeof(uint16) * INITIAL_ENTRIES_PER_CHUNK);
+ entry->chunk_size[chunkno] = INITIAL_ENTRIES_PER_CHUNK;
+ entry->chunk_data[chunkno][0] = chunkoffset;
+ entry->chunk_usage[chunkno] = 1;
+ return;
+ }
+
+ /*
+ * If the number of entries in this chunk is already maximum, it must be a
+ * bitmap. Just set the appropriate bit.
+ */
+ if (entry->chunk_usage[chunkno] == MAX_ENTRIES_PER_CHUNK)
+ {
+ BlockRefTableChunk chunk = entry->chunk_data[chunkno];
+
+ chunk[chunkoffset / BLOCKS_PER_ENTRY] |=
+ 1 << (chunkoffset % BLOCKS_PER_ENTRY);
+ return;
+ }
+
+ /*
+ * There is an existing chunk and it's in array format. Let's find out
+ * whether it already has an entry for this block. If so, we do not need
+ * to do anything.
+ */
+ for (i = 0; i < entry->chunk_usage[chunkno]; ++i)
+ {
+ if (entry->chunk_data[chunkno][i] == chunkoffset)
+ return;
+ }
+
+ /*
+ * If the number of entries currently used is one less than the maximum,
+ * it's time to convert to bitmap format.
+ */
+ if (entry->chunk_usage[chunkno] == MAX_ENTRIES_PER_CHUNK - 1)
+ {
+ BlockRefTableChunk newchunk;
+ unsigned j;
+
+ /* Allocate a new chunk. */
+ newchunk = palloc0(MAX_ENTRIES_PER_CHUNK * sizeof(uint16));
+
+ /* Set the bit for each existing entry. */
+ for (j = 0; j < entry->chunk_usage[chunkno]; ++j)
+ {
+ unsigned coff = entry->chunk_data[chunkno][j];
+
+ newchunk[coff / BLOCKS_PER_ENTRY] |=
+ 1 << (coff % BLOCKS_PER_ENTRY);
+ }
+
+ /* Set the bit for the new entry. */
+ newchunk[chunkoffset / BLOCKS_PER_ENTRY] |=
+ 1 << (chunkoffset % BLOCKS_PER_ENTRY);
+
+ /* Swap the new chunk into place and update metadata. */
+ pfree(entry->chunk_data[chunkno]);
+ entry->chunk_data[chunkno] = newchunk;
+ entry->chunk_size[chunkno] = MAX_ENTRIES_PER_CHUNK;
+ entry->chunk_usage[chunkno] = MAX_ENTRIES_PER_CHUNK;
+ return;
+ }
+
+ /*
+ * OK, we currently have an array, and we don't need to convert to a
+ * bitmap, but we do need to add a new element. If there's not enough
+ * room, we'll have to expand the array.
+ */
+ if (entry->chunk_usage[chunkno] == entry->chunk_size[chunkno])
+ {
+ unsigned newsize = entry->chunk_size[chunkno] * 2;
+
+ Assert(newsize <= MAX_ENTRIES_PER_CHUNK);
+ entry->chunk_data[chunkno] = repalloc(entry->chunk_data[chunkno],
+ newsize * sizeof(uint16));
+ entry->chunk_size[chunkno] = newsize;
+ }
+
+ /* Now we can add the new entry. */
+ entry->chunk_data[chunkno][entry->chunk_usage[chunkno]] =
+ chunkoffset;
+ entry->chunk_usage[chunkno]++;
+}
+
+/*
+ * Release memory for a BlockRefTablEntry that was created by
+ * CreateBlockRefTableEntry.
+ */
+void
+BlockRefTableFreeEntry(BlockRefTableEntry *entry)
+{
+ if (entry->chunk_size != NULL)
+ {
+ pfree(entry->chunk_size);
+ entry->chunk_size = NULL;
+ }
+
+ if (entry->chunk_usage != NULL)
+ {
+ pfree(entry->chunk_usage);
+ entry->chunk_usage = NULL;
+ }
+
+ if (entry->chunk_data != NULL)
+ {
+ pfree(entry->chunk_data);
+ entry->chunk_data = NULL;
+ }
+
+ pfree(entry);
+}
+
+/*
+ * Comparator for BlockRefTableSerializedEntry objects.
+ *
+ * We make the tablespace OID the first column of the sort key to match
+ * the on-disk tree structure.
+ */
+static int
+BlockRefTableComparator(const void *a, const void *b)
+{
+ const BlockRefTableSerializedEntry *sa = a;
+ const BlockRefTableSerializedEntry *sb = b;
+
+ if (sa->rlocator.spcOid > sb->rlocator.spcOid)
+ return 1;
+ if (sa->rlocator.spcOid < sb->rlocator.spcOid)
+ return -1;
+
+ if (sa->rlocator.dbOid > sb->rlocator.dbOid)
+ return 1;
+ if (sa->rlocator.dbOid < sb->rlocator.dbOid)
+ return -1;
+
+ if (sa->rlocator.relNumber > sb->rlocator.relNumber)
+ return 1;
+ if (sa->rlocator.relNumber < sb->rlocator.relNumber)
+ return -1;
+
+ if (sa->forknum > sb->forknum)
+ return 1;
+ if (sa->forknum < sb->forknum)
+ return -1;
+
+ return 0;
+}
+
+/*
+ * Flush any buffered data out of a BlockRefTableBuffer.
+ */
+static void
+BlockRefTableFlush(BlockRefTableBuffer *buffer)
+{
+ buffer->io_callback(buffer->io_callback_arg, buffer->data, buffer->used);
+ buffer->used = 0;
+}
+
+/*
+ * Read data from a BlockRefTableBuffer, and update the running CRC
+ * calculation for the returned data (but not any data that we may have
+ * buffered but not yet actually returned).
+ */
+static void
+BlockRefTableRead(BlockRefTableReader *reader, void *data, int length)
+{
+ BlockRefTableBuffer *buffer = &reader->buffer;
+
+ /* Loop until read is fully satisfied. */
+ while (length > 0)
+ {
+ if (buffer->cursor < buffer->used)
+ {
+ /*
+ * If any buffered data is available, use that to satisfy as much
+ * of the request as possible.
+ */
+ int bytes_to_copy = Min(length, buffer->used - buffer->cursor);
+
+ memcpy(data, &buffer->data[buffer->cursor], bytes_to_copy);
+ COMP_CRC32C(buffer->crc, &buffer->data[buffer->cursor],
+ bytes_to_copy);
+ buffer->cursor += bytes_to_copy;
+ data = ((char *) data) + bytes_to_copy;
+ length -= bytes_to_copy;
+ }
+ else if (length >= BUFSIZE)
+ {
+ /*
+ * If the request length is long, read directly into caller's
+ * buffer.
+ */
+ int bytes_read;
+
+ bytes_read = buffer->io_callback(buffer->io_callback_arg,
+ data, length);
+ COMP_CRC32C(buffer->crc, data, bytes_read);
+ data = ((char *) data) + bytes_read;
+ length -= bytes_read;
+
+ /* If we didn't get anything, that's bad. */
+ if (bytes_read == 0)
+ reader->error_callback(reader->error_callback_arg,
+ "file \"%s\" ends unexpectedly",
+ reader->error_filename);
+ }
+ else
+ {
+ /*
+ * Refill our buffer.
+ */
+ buffer->used = buffer->io_callback(buffer->io_callback_arg,
+ buffer->data, BUFSIZE);
+ buffer->cursor = 0;
+
+ /* If we didn't get anything, that's bad. */
+ if (buffer->used == 0)
+ reader->error_callback(reader->error_callback_arg,
+ "file \"%s\" ends unexpectedly",
+ reader->error_filename);
+ }
+ }
+}
+
+/*
+ * Supply data to a BlockRefTableBuffer for write to the underlying File,
+ * and update the running CRC calculation for that data.
+ */
+static void
+BlockRefTableWrite(BlockRefTableBuffer *buffer, void *data, int length)
+{
+ /* Update running CRC calculation. */
+ COMP_CRC32C(buffer->crc, data, length);
+
+ /* If the new data can't fit into the buffer, flush the buffer. */
+ if (buffer->used + length > BUFSIZE)
+ {
+ buffer->io_callback(buffer->io_callback_arg, buffer->data,
+ buffer->used);
+ buffer->used = 0;
+ }
+
+ /* If the new data would fill the buffer, or more, write it directly. */
+ if (length >= BUFSIZE)
+ {
+ buffer->io_callback(buffer->io_callback_arg, data, length);
+ return;
+ }
+
+ /* Otherwise, copy the new data into the buffer. */
+ memcpy(&buffer->data[buffer->used], data, length);
+ buffer->used += length;
+ Assert(buffer->used <= BUFSIZE);
+}
+
+/*
+ * Generate the sentinel and CRC required at the end of a block reference
+ * table file and flush them out of our internal buffer.
+ */
+static void
+BlockRefTableFileTerminate(BlockRefTableBuffer *buffer)
+{
+ BlockRefTableSerializedEntry zentry = {{0}};
+ pg_crc32c crc;
+
+ /* Write a sentinel indicating that there are no more entries. */
+ BlockRefTableWrite(buffer, &zentry,
+ sizeof(BlockRefTableSerializedEntry));
+
+ /*
+ * Writing the checksum will perturb the ongoing checksum calculation, so
+ * copy the state first and finalize the computation using the copy.
+ */
+ crc = buffer->crc;
+ FIN_CRC32C(crc);
+ BlockRefTableWrite(buffer, &crc, sizeof(pg_crc32c));
+
+ /* Flush any leftover data out of our buffer. */
+ BlockRefTableFlush(buffer);
+}
diff --git a/src/common/meson.build b/src/common/meson.build
index d52dd12bc9..7ad4270a3a 100644
--- a/src/common/meson.build
+++ b/src/common/meson.build
@@ -4,6 +4,7 @@ common_sources = files(
'archive.c',
'base64.c',
'binaryheap.c',
+ 'blkreftable.c',
'checksum_helper.c',
'compression.c',
'controldata_utils.c',
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index a14126d164..da71580364 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -209,6 +209,7 @@ extern int XLogFileOpen(XLogSegNo segno, TimeLineID tli);
extern void CheckXLogRemoved(XLogSegNo segno, TimeLineID tli);
extern XLogSegNo XLogGetLastRemovedSegno(void);
+extern XLogSegNo XLogGetOldestSegno(TimeLineID tli);
extern void XLogSetAsyncXactLSN(XLogRecPtr asyncXactLSN);
extern void XLogSetReplicationSlotMinimumLSN(XLogRecPtr lsn);
diff --git a/src/include/backup/walsummary.h b/src/include/backup/walsummary.h
new file mode 100644
index 0000000000..8e3dc7b837
--- /dev/null
+++ b/src/include/backup/walsummary.h
@@ -0,0 +1,49 @@
+/*-------------------------------------------------------------------------
+ *
+ * walsummary.h
+ * WAL summary management
+ *
+ * Portions Copyright (c) 2010-2022, PostgreSQL Global Development Group
+ *
+ * src/include/backup/walsummary.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef WALSUMMARY_H
+#define WALSUMMARY_H
+
+#include <time.h>
+
+#include "access/xlogdefs.h"
+#include "nodes/pg_list.h"
+#include "storage/fd.h"
+
+typedef struct WalSummaryIO
+{
+ File file;
+ off_t filepos;
+} WalSummaryIO;
+
+typedef struct WalSummaryFile
+{
+ XLogRecPtr start_lsn;
+ XLogRecPtr end_lsn;
+ TimeLineID tli;
+} WalSummaryFile;
+
+extern List *GetWalSummaries(TimeLineID tli, XLogRecPtr start_lsn,
+ XLogRecPtr end_lsn);
+extern List *FilterWalSummaries(List *wslist, TimeLineID tli,
+ XLogRecPtr start_lsn, XLogRecPtr end_lsn);
+extern bool WalSummariesAreComplete(List *wslist,
+ XLogRecPtr start_lsn, XLogRecPtr end_lsn,
+ XLogRecPtr *missing_lsn);
+extern File OpenWalSummaryFile(WalSummaryFile *ws, bool missing_ok);
+extern void RemoveWalSummaryIfOlderThan(WalSummaryFile *ws,
+ time_t cutoff_time);
+
+extern int ReadWalSummary(void *wal_summary_io, void *data, int length);
+extern int WriteWalSummary(void *wal_summary_io, void *data, int length);
+extern void ReportWalSummaryError(void *callback_arg, char *fmt,...) pg_attribute_printf(2, 3);
+
+#endif /* WALSUMMARY_H */
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index fb58dee3bc..79c8f86d89 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -12100,4 +12100,23 @@
proname => 'any_value_transfn', prorettype => 'anyelement',
proargtypes => 'anyelement anyelement', prosrc => 'any_value_transfn' },
+{ oid => '8436',
+ descr => 'list of available WAL summary files',
+ proname => 'pg_available_wal_summaries', prorows => '100',
+ proretset => 't', provolatile => 'v', proparallel => 's',
+ prorettype => 'record', proargtypes => '',
+ proallargtypes => '{int8,pg_lsn,pg_lsn}',
+ proargmodes => '{o,o,o}',
+ proargnames => '{tli,start_lsn,end_lsn}',
+ prosrc => 'pg_available_wal_summaries' },
+{ oid => '8437',
+ descr => 'contents of a WAL sumamry file',
+ proname => 'pg_wal_summary_contents', prorows => '100',
+ proretset => 't', provolatile => 'v', proparallel => 's',
+ prorettype => 'record', proargtypes => 'int8 pg_lsn pg_lsn',
+ proallargtypes => '{int8,pg_lsn,pg_lsn,oid,oid,oid,int2,int8,bool}',
+ proargmodes => '{i,i,i,o,o,o,o,o,o}',
+ proargnames => '{tli,start_lsn,end_lsn,relfilenode,reltablespace,reldatabase,relforknumber,relblocknumber,is_limit_block}',
+ prosrc => 'pg_wal_summary_contents' },
+
]
diff --git a/src/include/common/blkreftable.h b/src/include/common/blkreftable.h
new file mode 100644
index 0000000000..5141f3acd5
--- /dev/null
+++ b/src/include/common/blkreftable.h
@@ -0,0 +1,116 @@
+/*-------------------------------------------------------------------------
+ *
+ * blkreftable.h
+ * Block reference tables.
+ *
+ * A block reference table is used to keep track of which blocks have
+ * been modified by WAL records within a certain LSN range.
+ *
+ * For each relation fork, there is a "limit block number". All existing
+ * blocks greater than or equal to the limit block number must be
+ * considered modified; for those less than the limit block number,
+ * we maintain a bitmap. When a relation fork is created or dropped,
+ * the limit block number should be set to 0. When it's truncated,
+ * the limit block number should be set to the length in blocks to
+ * which it was truncated.
+ *
+ * Portions Copyright (c) 2010-2022, PostgreSQL Global Development Group
+ *
+ * src/include/common/blkreftable.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef BLKREFTABLE_H
+#define BLKREFTABLE_H
+
+#include "storage/block.h"
+#include "storage/relfilelocator.h"
+
+/* Magic number for serialization file format. */
+#define BLOCKREFTABLE_MAGIC 0x652b137b
+
+typedef struct BlockRefTable BlockRefTable;
+typedef struct BlockRefTableEntry BlockRefTableEntry;
+typedef struct BlockRefTableReader BlockRefTableReader;
+typedef struct BlockRefTableWriter BlockRefTableWriter;
+
+/*
+ * The return value of io_callback_fn should be the number of bytes read
+ * or written. If an error occurs, the functions should report it and
+ * not return. When used as a write callback, short writes should be retried
+ * or treated as errors, so that if the callback returns, the return value
+ * is always the request length.
+ *
+ * report_error_fn should not return.
+ */
+typedef int (*io_callback_fn) (void *callback_arg, void *data, int length);
+typedef void (*report_error_fn) (void *calblack_arg, char *msg,...) pg_attribute_printf(2, 3);
+
+
+/*
+ * Functions for manipulating an entire in-memory block reference table.
+ */
+extern BlockRefTable *CreateEmptyBlockRefTable(void);
+extern void BlockRefTableSetLimitBlock(BlockRefTable *brtab,
+ const RelFileLocator *rlocator,
+ ForkNumber forknum,
+ BlockNumber limit_block);
+extern void BlockRefTableMarkBlockModified(BlockRefTable *brtab,
+ const RelFileLocator *rlocator,
+ ForkNumber forknum,
+ BlockNumber blknum);
+extern void WriteBlockRefTable(BlockRefTable *brtab,
+ io_callback_fn write_callback,
+ void *write_callback_arg);
+
+extern BlockRefTableEntry *BlockRefTableGetEntry(BlockRefTable *brtab,
+ const RelFileLocator *rlocator,
+ ForkNumber forknum,
+ BlockNumber *limit_block);
+extern int BlockRefTableEntryGetBlocks(BlockRefTableEntry *entry,
+ BlockNumber start_blkno,
+ BlockNumber stop_blkno,
+ BlockNumber *blocks,
+ int nblocks);
+
+/*
+ * Functions for reading a block reference table incrementally from disk.
+ */
+extern BlockRefTableReader *CreateBlockRefTableReader(io_callback_fn read_callback,
+ void *read_callback_arg,
+ char *error_filename,
+ report_error_fn error_callback,
+ void *error_callback_arg);
+extern bool BlockRefTableReaderNextRelation(BlockRefTableReader *reader,
+ RelFileLocator *rlocator,
+ ForkNumber *forknum,
+ BlockNumber *limit_block);
+extern unsigned BlockRefTableReaderGetBlocks(BlockRefTableReader *reader,
+ BlockNumber *blocks,
+ int nblocks);
+extern void DestroyBlockRefTableReader(BlockRefTableReader *reader);
+
+/*
+ * Functions for writing a block reference table incrementally to disk.
+ *
+ * Note that entries must be written in the proper order, that is, sorted by
+ * database, then tablespace, then relfilenumber, then fork number. Caller
+ * is responsible for supplying data in the correct order. If that seems hard,
+ * use an in-memory BlockRefTable instead.
+ */
+extern BlockRefTableWriter *CreateBlockRefTableWriter(io_callback_fn write_callback,
+ void *write_callback_arg);
+extern void BlockRefTableWriteEntry(BlockRefTableWriter *writer,
+ BlockRefTableEntry *entry);
+extern void DestroyBlockRefTableWriter(BlockRefTableWriter *writer);
+
+extern BlockRefTableEntry *CreateBlockRefTableEntry(RelFileLocator rlocator,
+ ForkNumber forknum);
+extern void BlockRefTableEntrySetLimitBlock(BlockRefTableEntry *entry,
+ BlockNumber limit_block);
+extern void BlockRefTableEntryMarkBlockModified(BlockRefTableEntry *entry,
+ ForkNumber forknum,
+ BlockNumber blknum);
+extern void BlockRefTableFreeEntry(BlockRefTableEntry *entry);
+
+#endif /* BLKREFTABLE_H */
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index f0cc651435..ab8f47379a 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -340,6 +340,7 @@ typedef enum BackendType
B_STARTUP,
B_WAL_RECEIVER,
B_WAL_SENDER,
+ B_WAL_SUMMARIZER,
B_WAL_WRITER,
} BackendType;
@@ -446,6 +447,7 @@ typedef enum
CheckpointerProcess,
WalWriterProcess,
WalReceiverProcess,
+ WalSummarizerProcess,
NUM_AUXPROCTYPES /* Must be last! */
} AuxProcType;
@@ -458,6 +460,7 @@ extern PGDLLIMPORT AuxProcType MyAuxProcType;
#define AmCheckpointerProcess() (MyAuxProcType == CheckpointerProcess)
#define AmWalWriterProcess() (MyAuxProcType == WalWriterProcess)
#define AmWalReceiverProcess() (MyAuxProcType == WalReceiverProcess)
+#define AmWalSummarizerProcess() (MyAuxProcType == WalSummarizerProcess)
/*****************************************************************************
diff --git a/src/include/postmaster/walsummarizer.h b/src/include/postmaster/walsummarizer.h
new file mode 100644
index 0000000000..ebc95bd326
--- /dev/null
+++ b/src/include/postmaster/walsummarizer.h
@@ -0,0 +1,32 @@
+/*-------------------------------------------------------------------------
+ *
+ * walsummarizer.h
+ *
+ * Header file for background WAL summarization process.
+ *
+ * Portions Copyright (c) 1996-2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/include/postmaster/walsummarizer.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef WALSUMMARIZER_H
+#define WALSUMMARIZER_H
+
+#include "access/xlogdefs.h"
+
+extern bool summarize_wal;
+extern int wal_summary_keep_time;
+
+extern Size WalSummarizerShmemSize(void);
+extern void WalSummarizerShmemInit(void);
+extern void WalSummarizerMain(void) pg_attribute_noreturn();
+
+extern XLogRecPtr GetOldestUnsummarizedLSN(TimeLineID *tli,
+ bool *lsn_is_exact);
+extern void SetWalSummarizerLatch(void);
+extern XLogRecPtr WaitForWalSummarization(XLogRecPtr lsn, long timeout,
+ XLogRecPtr *pending_lsn);
+
+#endif
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 4b25961249..e87fd25d64 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -417,11 +417,12 @@ extern PGDLLIMPORT PGPROC *PreparedXactProcs;
* We set aside some extra PGPROC structures for auxiliary processes,
* ie things that aren't full-fledged backends but need shmem access.
*
- * Background writer, checkpointer, WAL writer and archiver run during normal
- * operation. Startup process and WAL receiver also consume 2 slots, but WAL
- * writer is launched only after startup has exited, so we only need 5 slots.
+ * Background writer, checkpointer, WAL writer, WAL summarizer, and archiver
+ * run during normal operation. Startup process and WAL receiver also consume
+ * 2 slots, but WAL writer is launched only after startup has exited, so we
+ * only need 6 slots.
*/
-#define NUM_AUXILIARY_PROCS 5
+#define NUM_AUXILIARY_PROCS 6
/* configurable options */
extern PGDLLIMPORT int DeadlockTimeout;
diff --git a/src/include/utils/guc_tables.h b/src/include/utils/guc_tables.h
index 0c38255961..eaa8c46dda 100644
--- a/src/include/utils/guc_tables.h
+++ b/src/include/utils/guc_tables.h
@@ -72,6 +72,7 @@ enum config_group
WAL_RECOVERY,
WAL_ARCHIVE_RECOVERY,
WAL_RECOVERY_TARGET,
+ WAL_SUMMARIZATION,
REPLICATION_SENDING,
REPLICATION_PRIMARY,
REPLICATION_STANDBY,
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 38a86575e1..4d99b4b3f1 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -4007,3 +4007,14 @@ yyscan_t
z_stream
z_streamp
zic_t
+BlockRefTable
+BlockRefTableBuffer
+BlockRefTableEntry
+BlockRefTableKey
+BlockRefTableReader
+BlockRefTableSerializedEntry
+BlockRefTableWriter
+SummarizerReadLocalXLogPrivate
+WalSummarizerData
+WalSummaryFile
+WalSummaryIO
--
2.39.3 (Apple Git-145)
v13-0003-Add-support-for-incremental-backup.patchapplication/octet-stream; name=v13-0003-Add-support-for-incremental-backup.patchDownload
From 4373fc739fe73ce43487328a6b4d49e031f0a4c2 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Wed, 14 Jun 2023 11:31:29 -0400
Subject: [PATCH v13 3/5] Add support for incremental backup.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
To take an incremental backup, you use the new replication command
UPLOAD_MANIFEST to upload the manifest for the prior backup. This
prior backup could either be a full backup or another incremental
backup. You then use BASE_BACKUP with the INCREMENTAL option to take
the backup. pg_basebackup now has an --incremental=PATH_TO_MANIFEST
option to trigger this behavior.
An incremental backup is like a regular full backup except that
some relation files are replaced with files with names like
INCREMENTAL.${ORIGINAL_NAME}, and the backup_label file contains
additional lines identifying it as an incremental backup. The new
pg_combinebackup tool can be used to reconstruct a data directory
from a full backup and a series of incremental backups.
XXX. It would be nice (but not essential) to do something about
incremental JSON parsing.
Patch by me. Thanks to Dilip Kumar, Andres Freund, and Álvaro Herrera
for design discussion and reviews, and to Jakub Wartak for incredibly
helpful and extensive testing.
---
doc/src/sgml/backup.sgml | 89 +-
doc/src/sgml/config.sgml | 2 -
doc/src/sgml/protocol.sgml | 24 +
doc/src/sgml/ref/allfiles.sgml | 1 +
doc/src/sgml/ref/pg_basebackup.sgml | 37 +-
doc/src/sgml/ref/pg_combinebackup.sgml | 228 +++
doc/src/sgml/reference.sgml | 1 +
src/backend/access/transam/xlogbackup.c | 10 +
src/backend/access/transam/xlogrecovery.c | 6 +
src/backend/backup/Makefile | 1 +
src/backend/backup/basebackup.c | 319 +++-
src/backend/backup/basebackup_incremental.c | 1003 +++++++++++++
src/backend/backup/meson.build | 1 +
src/backend/replication/repl_gram.y | 14 +-
src/backend/replication/repl_scanner.l | 2 +
src/backend/replication/walsender.c | 162 ++-
src/backend/storage/ipc/ipci.c | 3 +
src/bin/Makefile | 1 +
src/bin/meson.build | 1 +
src/bin/pg_basebackup/bbstreamer_file.c | 1 +
src/bin/pg_basebackup/pg_basebackup.c | 112 +-
src/bin/pg_basebackup/t/010_pg_basebackup.pl | 4 +-
src/bin/pg_combinebackup/.gitignore | 1 +
src/bin/pg_combinebackup/Makefile | 52 +
src/bin/pg_combinebackup/backup_label.c | 283 ++++
src/bin/pg_combinebackup/backup_label.h | 30 +
src/bin/pg_combinebackup/copy_file.c | 169 +++
src/bin/pg_combinebackup/copy_file.h | 19 +
src/bin/pg_combinebackup/load_manifest.c | 245 ++++
src/bin/pg_combinebackup/load_manifest.h | 67 +
src/bin/pg_combinebackup/meson.build | 38 +
src/bin/pg_combinebackup/nls.mk | 11 +
src/bin/pg_combinebackup/pg_combinebackup.c | 1284 +++++++++++++++++
src/bin/pg_combinebackup/reconstruct.c | 687 +++++++++
src/bin/pg_combinebackup/reconstruct.h | 33 +
src/bin/pg_combinebackup/t/001_basic.pl | 23 +
.../pg_combinebackup/t/002_compare_backups.pl | 154 ++
src/bin/pg_combinebackup/t/003_timeline.pl | 90 ++
src/bin/pg_combinebackup/t/004_manifest.pl | 75 +
src/bin/pg_combinebackup/t/005_integrity.pl | 125 ++
src/bin/pg_combinebackup/write_manifest.c | 293 ++++
src/bin/pg_combinebackup/write_manifest.h | 33 +
src/bin/pg_resetwal/pg_resetwal.c | 36 +
src/include/access/xlogbackup.h | 2 +
src/include/backup/basebackup.h | 5 +-
src/include/backup/basebackup_incremental.h | 55 +
src/include/nodes/replnodes.h | 9 +
src/test/perl/PostgreSQL/Test/Cluster.pm | 21 +-
src/tools/pgindent/typedefs.list | 12 +
49 files changed, 5822 insertions(+), 52 deletions(-)
create mode 100644 doc/src/sgml/ref/pg_combinebackup.sgml
create mode 100644 src/backend/backup/basebackup_incremental.c
create mode 100644 src/bin/pg_combinebackup/.gitignore
create mode 100644 src/bin/pg_combinebackup/Makefile
create mode 100644 src/bin/pg_combinebackup/backup_label.c
create mode 100644 src/bin/pg_combinebackup/backup_label.h
create mode 100644 src/bin/pg_combinebackup/copy_file.c
create mode 100644 src/bin/pg_combinebackup/copy_file.h
create mode 100644 src/bin/pg_combinebackup/load_manifest.c
create mode 100644 src/bin/pg_combinebackup/load_manifest.h
create mode 100644 src/bin/pg_combinebackup/meson.build
create mode 100644 src/bin/pg_combinebackup/nls.mk
create mode 100644 src/bin/pg_combinebackup/pg_combinebackup.c
create mode 100644 src/bin/pg_combinebackup/reconstruct.c
create mode 100644 src/bin/pg_combinebackup/reconstruct.h
create mode 100644 src/bin/pg_combinebackup/t/001_basic.pl
create mode 100644 src/bin/pg_combinebackup/t/002_compare_backups.pl
create mode 100644 src/bin/pg_combinebackup/t/003_timeline.pl
create mode 100644 src/bin/pg_combinebackup/t/004_manifest.pl
create mode 100644 src/bin/pg_combinebackup/t/005_integrity.pl
create mode 100644 src/bin/pg_combinebackup/write_manifest.c
create mode 100644 src/bin/pg_combinebackup/write_manifest.h
create mode 100644 src/include/backup/basebackup_incremental.h
diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml
index 8cb24d6ae5..b3468eea3c 100644
--- a/doc/src/sgml/backup.sgml
+++ b/doc/src/sgml/backup.sgml
@@ -857,12 +857,79 @@ test ! -f /mnt/server/archivedir/00000001000000A900000065 && cp pg_wal/0
</para>
</sect2>
+ <sect2 id="backup-incremental-backup">
+ <title>Making an Incremental Backup</title>
+
+ <para>
+ You can use <xref linkend="app-pgbasebackup"/> to take an incremental
+ backup by specifying the <literal>--incremental</literal> option. You must
+ supply, as an argument to <literal>--incremental</literal>, the backup
+ manifest to an earlier backup from the same server. In the resulting
+ backup, non-relation files will be included in their entirety, but some
+ relation files may be replaced by smaller incremental files which contain
+ only the blocks which have been changed since the earlier backup and enough
+ metadata to reconstruct the current version of the file.
+ </para>
+
+ <para>
+ To figure out which blocks need to be backed up, the server uses WAL
+ summaries, which are stored in the data directory, inside the directory
+ <literal>pg_wal/summaries</literal>. If the required summary files are not
+ present, an attempt to take an incremental backup will fail. The summaries
+ present in this directory must cover all LSNs from the start LSN of the
+ prior backup to the start LSN of the current backup. Since the server looks
+ for WAL summaries just after establishing the start LSN of the current
+ backup, the necessary summary files probably won't be instantly present
+ on disk, but the server will wait for any missing files to show up.
+ This also helps if the WAL summarization process has fallen behind.
+ However, if the necessary files have already been removed, or if the WAL
+ summarizer doesn't catch up quickly enough, the incremental backup will
+ fail.
+ </para>
+
+ <para>
+ When restoring an incremental backup, it will be necessary to have not
+ only the incremental backup itself but also all earlier backups that
+ are required to supply the blocks omitted from the incremental backup.
+ See <xref linkend="app-pgcombinebackup"/> for further information about
+ this requirement.
+ </para>
+
+ <para>
+ Note that all of the requirements for making use of a full backup also
+ apply to an incremental backup. For instance, you still need all of the
+ WAL segment files generated during and after the file system backup, and
+ any relevant WAL history files. And you still need to create a
+ <literal>recovery.signal</literal> (or <literal>standby.signal</literal>)
+ and perform recovery, as described in
+ <xref linkend="backup-pitr-recovery" />. The requirement to have earlier
+ backups available at restore time and to use
+ <literal>pg_combinebackup</literal> is an additional requirement on top of
+ everything else. Keep in mind that <application>PostgreSQL</application>
+ has no built-in mechanism to figure out which backups are still needed as
+ a basis for restoring later incremental backups. You must keep track of
+ the relationships between your full and incremental backups on your own,
+ and be certain not to remove earlier backups if they might be needed when
+ restoring later incremental backups.
+ </para>
+
+ <para>
+ Incremental backups typically only make sense for relatively large
+ databases where a significant portion of the data does not change, or only
+ changes slowly. For a small database, it's simpler to ignore the existence
+ of incremental backups and simply take full backups, which are simpler
+ to manage. For a large database all of which is heavily modified,
+ incremental backups won't be much smaller than full backups.
+ </para>
+ </sect2>
+
<sect2 id="backup-lowlevel-base-backup">
<title>Making a Base Backup Using the Low Level API</title>
<para>
- The procedure for making a base backup using the low level
- APIs contains a few more steps than
- the <xref linkend="app-pgbasebackup"/> method, but is relatively
+ Instead of taking a full or incremental base backup using
+ <xref linkend="app-pgbasebackup"/>, you can take a base backup using the
+ low-level API. This procedure contains a few more steps than
+ the <application>pg_basebackup</application> method, but is relatively
simple. It is very important that these steps are executed in
sequence, and that the success of a step is verified before
proceeding to the next step.
@@ -1118,7 +1185,8 @@ SELECT * FROM pg_backup_stop(wait_for_archive => true);
</listitem>
<listitem>
<para>
- Restore the database files from your file system backup. Be sure that they
+ If you're restoring a full backup, you can restore the database files
+ directly into the target directories. Be sure that they
are restored with the right ownership (the database system user, not
<literal>root</literal>!) and with the right permissions. If you are using
tablespaces,
@@ -1126,6 +1194,19 @@ SELECT * FROM pg_backup_stop(wait_for_archive => true);
were correctly restored.
</para>
</listitem>
+ <listitem>
+ <para>
+ If you're restoring an incremental backup, you'll need to restore the
+ incremental backup and all earlier backups upon which it directly or
+ indirectly depends to the machine where you are performing the restore.
+ These backups will need to be placed in separate directories, not the
+ target directories where you want the running server to end up.
+ Once this is done, use <xref linkend="app-pgcombinebackup"/> to pull
+ data from the full backup and all of the subsequent incremental backups
+ and write out a synthetic full backup to the target directories. As above,
+ verify that permissions and tablespace links are correct.
+ </para>
+ </listitem>
<listitem>
<para>
Remove any files present in <filename>pg_wal/</filename>; these came from the
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 4fc5c64150..13212ba5d9 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4153,13 +4153,11 @@ restore_command = 'copy "C:\\server\\archivedir\\%f" "%p"' # Windows
<sect2 id="runtime-config-wal-summarization">
<title>WAL Summarization</title>
- <!--
<para>
These settings control WAL summarization, a feature which must be
enabled in order to perform an
<link linkend="backup-incremental-backup">incremental backup</link>.
</para>
- -->
<variablelist>
<varlistentry id="guc-summarize-wal" xreflabel="summarize_wal">
diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index af3f016f74..9a66918171 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -2599,6 +2599,19 @@ psql "dbname=postgres replication=database" -c "IDENTIFY_SYSTEM;"
</listitem>
</varlistentry>
+ <varlistentry id="protocol-replication-upload-manifest">
+ <term>
+ <literal>UPLOAD_MANIFEST</literal>
+ <indexterm><primary>UPLOAD_MANIFEST</primary></indexterm>
+ </term>
+ <listitem>
+ <para>
+ Uploads a backup manifest in preparation for taking an incremental
+ backup.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="protocol-replication-base-backup" xreflabel="BASE_BACKUP">
<term><literal>BASE_BACKUP</literal> [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ]
<indexterm><primary>BASE_BACKUP</primary></indexterm>
@@ -2838,6 +2851,17 @@ psql "dbname=postgres replication=database" -c "IDENTIFY_SYSTEM;"
</para>
</listitem>
</varlistentry>
+
+ <varlistentry>
+ <term><literal>INCREMENTAL</literal></term>
+ <listitem>
+ <para>
+ Requests an incremental backup. The
+ <literal>UPLOAD_MANIFEST</literal> command must be executed
+ before running a base backup with this option.
+ </para>
+ </listitem>
+ </varlistentry>
</variablelist>
</para>
diff --git a/doc/src/sgml/ref/allfiles.sgml b/doc/src/sgml/ref/allfiles.sgml
index 54b5f22d6e..fda4690eab 100644
--- a/doc/src/sgml/ref/allfiles.sgml
+++ b/doc/src/sgml/ref/allfiles.sgml
@@ -202,6 +202,7 @@ Complete list of usable sgml source files in this directory.
<!ENTITY pgBasebackup SYSTEM "pg_basebackup.sgml">
<!ENTITY pgbench SYSTEM "pgbench.sgml">
<!ENTITY pgChecksums SYSTEM "pg_checksums.sgml">
+<!ENTITY pgCombinebackup SYSTEM "pg_combinebackup.sgml">
<!ENTITY pgConfig SYSTEM "pg_config-ref.sgml">
<!ENTITY pgControldata SYSTEM "pg_controldata.sgml">
<!ENTITY pgCtl SYSTEM "pg_ctl-ref.sgml">
diff --git a/doc/src/sgml/ref/pg_basebackup.sgml b/doc/src/sgml/ref/pg_basebackup.sgml
index 0b87fd2d4d..7c183a5cfd 100644
--- a/doc/src/sgml/ref/pg_basebackup.sgml
+++ b/doc/src/sgml/ref/pg_basebackup.sgml
@@ -38,11 +38,25 @@ PostgreSQL documentation
</para>
<para>
- <application>pg_basebackup</application> makes an exact copy of the database
- cluster's files, while making sure the server is put into and
- out of backup mode automatically. Backups are always taken of the entire
- database cluster; it is not possible to back up individual databases or
- database objects. For selective backups, another tool such as
+ <application>pg_basebackup</application> can take a full or incremental
+ base backup of the database. When used to take a full backup, it makes an
+ exact copy of the database cluster's files. When used to take an incremental
+ backup, some files that would have been part of a full backup may be
+ replaced with incremental versions of the same files, containing only those
+ blocks that have been modified since the reference backup. An incremental
+ backup cannot be used directly; instead,
+ <xref linkend="app-pgcombinebackup"/> must first
+ be used to combine it with the previous backups upon which it depends.
+ See <xref linkend="backup-incremental-backup" /> for more information
+ about incremental backups, and <xref linkend="backup-pitr-recovery" />
+ for steps to recover from a backup.
+ </para>
+
+ <para>
+ In any mode, <application>pg_basebackup</application> makes sure the server
+ is put into and out of backup mode automatically. Backups are always taken of
+ the entire database cluster; it is not possible to back up individual
+ databases or database objects. For selective backups, another tool such as
<xref linkend="app-pgdump"/> must be used.
</para>
@@ -197,6 +211,19 @@ PostgreSQL documentation
</listitem>
</varlistentry>
+ <varlistentry>
+ <term><option>-i <replaceable class="parameter">old_manifest_file</replaceable></option></term>
+ <term><option>--incremental=<replaceable class="parameter">old_meanifest_file</replaceable></option></term>
+ <listitem>
+ <para>
+ Performs an <link linkend="backup-incremental-backup">incremental
+ backup</link>. The backup manifest for the reference
+ backup must be provided, and will be uploaded to the server, which will
+ respond by sending the requested incremental backup.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry>
<term><option>-R</option></term>
<term><option>--write-recovery-conf</option></term>
diff --git a/doc/src/sgml/ref/pg_combinebackup.sgml b/doc/src/sgml/ref/pg_combinebackup.sgml
new file mode 100644
index 0000000000..6cac73573f
--- /dev/null
+++ b/doc/src/sgml/ref/pg_combinebackup.sgml
@@ -0,0 +1,228 @@
+<!--
+doc/src/sgml/ref/pg_combinebackup.sgml
+PostgreSQL documentation
+-->
+
+<refentry id="app-pgcombinebackup">
+ <indexterm zone="app-pgcombinebackup">
+ <primary>pg_combinebackup</primary>
+ </indexterm>
+
+ <refmeta>
+ <refentrytitle><application>pg_combinebackup</application></refentrytitle>
+ <manvolnum>1</manvolnum>
+ <refmiscinfo>Application</refmiscinfo>
+ </refmeta>
+
+ <refnamediv>
+ <refname>pg_combinebackup</refname>
+ <refpurpose>reconstruct a full backup from an incremental backup and dependent backups</refpurpose>
+ </refnamediv>
+
+ <refsynopsisdiv>
+ <cmdsynopsis>
+ <command>pg_combinebackup</command>
+ <arg rep="repeat"><replaceable>option</replaceable></arg>
+ <arg rep="repeat"><replaceable>backup_directory</replaceable></arg>
+ </cmdsynopsis>
+ </refsynopsisdiv>
+
+ <refsect1>
+ <title>Description</title>
+ <para>
+ <application>pg_combinebackup</application> is used to reconstruct a
+ synthetic full backup from an
+ <link linkend="backup-incremental-backup">incremental backup</link> and the
+ earlier backups upon which it depends.
+ </para>
+
+ <para>
+ Specify all of the required backups on the command line from oldest to newest.
+ That is, the first backup directory should be the path to the full backup, and
+ the last should be the path to the final incremental backup
+ that you wish to restore. The reconstructed backup will be written to the
+ output directory specified by the <option>-o</option> option.
+ </para>
+
+ <para>
+ Although <application>pg_combinebackup</application> will attempt to verify
+ that the backups you specify form a legal backup chain from which a correct
+ full backup can be reconstructed, it is not designed to help you keep track
+ of which backups depend on which other backups. If you remove the one or
+ more of the previous backups upon which your incremental
+ backup relies, you will not be able to restore it.
+ </para>
+
+ <para>
+ Since the output of <application>pg_combinebackup</application> is a
+ synthetic full backup, it can be used as an input to a future invocation of
+ <application>pg_combinebackup</application>. The synthetic full backup would
+ be specified on the command line in lieu of the chain of backups from which
+ it was reconstructed.
+ </para>
+ </refsect1>
+
+ <refsect1>
+ <title>Options</title>
+
+ <para>
+ <variablelist>
+ <varlistentry>
+ <term><option>-d</option></term>
+ <term><option>--debug</option></term>
+ <listitem>
+ <para>
+ Print lots of debug logging output on <filename>stderr</filename>.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><option>-T <replaceable class="parameter">olddir</replaceable>=<replaceable class="parameter">newdir</replaceable></option></term>
+ <term><option>--tablespace-mapping=<replaceable class="parameter">olddir</replaceable>=<replaceable class="parameter">newdir</replaceable></option></term>
+ <listitem>
+ <para>
+ Relocates the tablespace in directory <replaceable>olddir</replaceable>
+ to <replaceable>newdir</replaceable> during the backup.
+ <replaceable>olddir</replaceable> is the absolute path of the tablespace
+ as it exists in the first backup specified on the command line,
+ and <replaceable>newdir</replaceable> is the absolute path to use for the
+ tablespace in the reconstructed backup. If either path needs to contain
+ an equal sign (<literal>=</literal>), precede that with a backslash.
+ This option can be specified multiple times for multiple tablespaces.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><option>-N</option></term>
+ <term><option>--no-sync</option></term>
+ <listitem>
+ <para>
+ By default, <command>pg_combinebackup</command> will wait for all files
+ to be written safely to disk. This option causes
+ <command>pg_combinebackup</command> to return without waiting, which is
+ faster, but means that a subsequent operating system crash can leave
+ the output backup corrupt. Generally, this option is useful for testing
+ but should not be used when creating a production installation.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><option>-o <replaceable class="parameter">outputdir</replaceable></option></term>
+ <term><option>--output=<replaceable class="parameter">outputdir</replaceable></option></term>
+ <listitem>
+ <para>
+ Specifies the output directory to which the synthetic full backup
+ should be written. Currently, this argument is required.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><option>--sync-method</option></term>
+ <listitem>
+ <para>
+ When set to <literal>fsync</literal>, which is the default,
+ <command>pg_combinebackup</command> will recursively open and synchronize
+ all files in the backup directory. When the plain format is used, the
+ search for files will follow symbolic links for the WAL directory and
+ each configured tablespace.
+ </para>
+ <para>
+ On Linux, <literal>syncfs</literal> may be used instead to ask the
+ operating system to synchronize the whole file system that contains the
+ backup directory. When the plain format is used,
+ <command>pg_combinebackup</command> will also synchronize the file systems
+ that contain the WAL files and each tablespace. See
+ <xref linkend="syncfs"/> for more information about using
+ <function>syncfs()</function>.
+ </para>
+ <para>
+ This option has no effect when <option>--no-sync</option> is used.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><option>--manifest-checksums=<replaceable class="parameter">algorithm</replaceable></option></term>
+ <listitem>
+ <para>
+ Like <xref linkend="app-pgbasebackup"/>,
+ <application>pg_combinebackup</application> writes a backup manifest
+ in the output directory. This option specifies the checksum algorithm
+ that should be applied to each file included in the backup manifest.
+ Currently, the available algorithms are <literal>NONE</literal>,
+ <literal>CRC32C</literal>, <literal>SHA224</literal>,
+ <literal>SHA256</literal>, <literal>SHA384</literal>,
+ and <literal>SHA512</literal>. The default is <literal>CRC32C</literal>.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><option>--no-manifest</option></term>
+ <listitem>
+ <para>
+ Disables generation of a backup manifest. If this option is not
+ specified, a backup manifest for the reconstructed backup will be
+ written to the output directory.
+ </para>
+ </listitem>
+ </varlistentry>
+ </variablelist>
+
+ <variablelist>
+ <varlistentry>
+ <term><option>-V</option></term>
+ <term><option>--version</option></term>
+ <listitem>
+ <para>
+ Prints the <application>pg_combinebackup</application> version and exits.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><option>-?</option></term>
+ <term><option>--help</option></term>
+ <listitem>
+ <para>
+ Shows help about <application>pg_combinebackup</application> command line
+ arguments, and exits.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ </variablelist>
+ </para>
+
+ </refsect1>
+
+ <refsect1>
+ <title>Environment</title>
+
+ <para>
+ This utility, like most other <productname>PostgreSQL</productname> utilities,
+ uses the environment variables supported by <application>libpq</application>
+ (see <xref linkend="libpq-envars"/>).
+ </para>
+
+ <para>
+ The environment variable <envar>PG_COLOR</envar> specifies whether to use
+ color in diagnostic messages. Possible values are
+ <literal>always</literal>, <literal>auto</literal> and
+ <literal>never</literal>.
+ </para>
+ </refsect1>
+
+ <refsect1>
+ <title>See Also</title>
+
+ <simplelist type="inline">
+ <member><xref linkend="app-pgbasebackup"/></member>
+ </simplelist>
+ </refsect1>
+
+</refentry>
diff --git a/doc/src/sgml/reference.sgml b/doc/src/sgml/reference.sgml
index e11b4b6130..a07d2b5e01 100644
--- a/doc/src/sgml/reference.sgml
+++ b/doc/src/sgml/reference.sgml
@@ -250,6 +250,7 @@
&pgamcheck;
&pgBasebackup;
&pgbench;
+ &pgCombinebackup;
&pgConfig;
&pgDump;
&pgDumpall;
diff --git a/src/backend/access/transam/xlogbackup.c b/src/backend/access/transam/xlogbackup.c
index 21d68133ae..f51d4282bb 100644
--- a/src/backend/access/transam/xlogbackup.c
+++ b/src/backend/access/transam/xlogbackup.c
@@ -77,6 +77,16 @@ build_backup_content(BackupState *state, bool ishistoryfile)
appendStringInfo(result, "STOP TIMELINE: %u\n", state->stoptli);
}
+ /* either both istartpoint and istarttli should be set, or neither */
+ Assert(XLogRecPtrIsInvalid(state->istartpoint) == (state->istarttli == 0));
+ if (!XLogRecPtrIsInvalid(state->istartpoint))
+ {
+ appendStringInfo(result, "INCREMENTAL FROM LSN: %X/%X\n",
+ LSN_FORMAT_ARGS(state->istartpoint));
+ appendStringInfo(result, "INCREMENTAL FROM TLI: %u\n",
+ state->istarttli);
+ }
+
data = result->data;
pfree(result);
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index c61566666a..7d2501274e 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -1295,6 +1295,12 @@ read_backup_label(XLogRecPtr *checkPointLoc, TimeLineID *backupLabelTLI,
tli_from_file, BACKUP_LABEL_FILE)));
}
+ if (fscanf(lfp, "INCREMENTAL FROM LSN: %X/%X\n", &hi, &lo) > 0)
+ ereport(FATAL,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("this is an incremental backup, not a data directory"),
+ errhint("Use pg_combinebackup to reconstruct a valid data directory.")));
+
if (ferror(lfp) || FreeFile(lfp))
ereport(FATAL,
(errcode_for_file_access(),
diff --git a/src/backend/backup/Makefile b/src/backend/backup/Makefile
index a67b3c58d4..751e6d3d5e 100644
--- a/src/backend/backup/Makefile
+++ b/src/backend/backup/Makefile
@@ -19,6 +19,7 @@ OBJS = \
basebackup.o \
basebackup_copy.o \
basebackup_gzip.o \
+ basebackup_incremental.o \
basebackup_lz4.o \
basebackup_zstd.o \
basebackup_progress.o \
diff --git a/src/backend/backup/basebackup.c b/src/backend/backup/basebackup.c
index 35dd79babc..5ee9628422 100644
--- a/src/backend/backup/basebackup.c
+++ b/src/backend/backup/basebackup.c
@@ -20,8 +20,10 @@
#include "access/xlogbackup.h"
#include "backup/backup_manifest.h"
#include "backup/basebackup.h"
+#include "backup/basebackup_incremental.h"
#include "backup/basebackup_sink.h"
#include "backup/basebackup_target.h"
+#include "catalog/pg_tablespace_d.h"
#include "commands/defrem.h"
#include "common/compression.h"
#include "common/file_perm.h"
@@ -33,6 +35,7 @@
#include "pgtar.h"
#include "port.h"
#include "postmaster/syslogger.h"
+#include "postmaster/walsummarizer.h"
#include "replication/walsender.h"
#include "replication/walsender_private.h"
#include "storage/bufpage.h"
@@ -64,6 +67,7 @@ typedef struct
bool fastcheckpoint;
bool nowait;
bool includewal;
+ bool incremental;
uint32 maxrate;
bool sendtblspcmapfile;
bool send_to_client;
@@ -76,21 +80,28 @@ typedef struct
} basebackup_options;
static int64 sendTablespace(bbsink *sink, char *path, Oid spcoid, bool sizeonly,
- struct backup_manifest_info *manifest);
+ struct backup_manifest_info *manifest,
+ IncrementalBackupInfo *ib);
static int64 sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
List *tablespaces, bool sendtblspclinks,
- backup_manifest_info *manifest, Oid spcoid);
+ backup_manifest_info *manifest, Oid spcoid,
+ IncrementalBackupInfo *ib);
static bool sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
struct stat *statbuf, bool missing_ok,
Oid dboid, Oid spcoid, RelFileNumber relfilenumber,
unsigned segno,
- backup_manifest_info *manifest);
+ backup_manifest_info *manifest,
+ unsigned num_incremental_blocks,
+ BlockNumber *incremental_blocks,
+ unsigned truncation_block_length);
static off_t read_file_data_into_buffer(bbsink *sink,
const char *readfilename, int fd,
off_t offset, size_t length,
BlockNumber blkno,
bool verify_checksum,
int *checksum_failures);
+static void push_to_sink(bbsink *sink, pg_checksum_context *checksum_ctx,
+ size_t *bytes_done, void *data, size_t length);
static bool verify_page_checksum(Page page, XLogRecPtr start_lsn,
BlockNumber blkno,
uint16 *expected_checksum);
@@ -102,7 +113,8 @@ static int64 _tarWriteHeader(bbsink *sink, const char *filename,
bool sizeonly);
static void _tarWritePadding(bbsink *sink, int len);
static void convert_link_to_directory(const char *pathbuf, struct stat *statbuf);
-static void perform_base_backup(basebackup_options *opt, bbsink *sink);
+static void perform_base_backup(basebackup_options *opt, bbsink *sink,
+ IncrementalBackupInfo *ib);
static void parse_basebackup_options(List *options, basebackup_options *opt);
static int compareWalFileNames(const ListCell *a, const ListCell *b);
static int basebackup_read_file(int fd, char *buf, size_t nbytes, off_t offset,
@@ -220,7 +232,8 @@ static const struct exclude_list_item excludeFiles[] =
* clobbered by longjmp" from stupider versions of gcc.
*/
static void
-perform_base_backup(basebackup_options *opt, bbsink *sink)
+perform_base_backup(basebackup_options *opt, bbsink *sink,
+ IncrementalBackupInfo *ib)
{
bbsink_state state;
XLogRecPtr endptr;
@@ -270,6 +283,10 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
ListCell *lc;
tablespaceinfo *newti;
+ /* If this is an incremental backup, execute preparatory steps. */
+ if (ib != NULL)
+ PrepareForIncrementalBackup(ib, backup_state);
+
/* Add a node for the base directory at the end */
newti = palloc0(sizeof(tablespaceinfo));
newti->size = -1;
@@ -289,10 +306,10 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
if (tmp->path == NULL)
tmp->size = sendDir(sink, ".", 1, true, state.tablespaces,
- true, NULL, InvalidOid);
+ true, NULL, InvalidOid, NULL);
else
tmp->size = sendTablespace(sink, tmp->path, tmp->oid, true,
- NULL);
+ NULL, NULL);
state.bytes_total += tmp->size;
}
state.bytes_total_is_valid = true;
@@ -330,7 +347,7 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
/* Then the bulk of the files... */
sendDir(sink, ".", 1, false, state.tablespaces,
- sendtblspclinks, &manifest, InvalidOid);
+ sendtblspclinks, &manifest, InvalidOid, ib);
/* ... and pg_control after everything else. */
if (lstat(XLOG_CONTROL_FILE, &statbuf) != 0)
@@ -340,7 +357,7 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
XLOG_CONTROL_FILE)));
sendFile(sink, XLOG_CONTROL_FILE, XLOG_CONTROL_FILE, &statbuf,
false, InvalidOid, InvalidOid,
- InvalidRelFileNumber, 0, &manifest);
+ InvalidRelFileNumber, 0, &manifest, 0, NULL, 0);
}
else
{
@@ -348,7 +365,7 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
bbsink_begin_archive(sink, archive_name);
- sendTablespace(sink, ti->path, ti->oid, false, &manifest);
+ sendTablespace(sink, ti->path, ti->oid, false, &manifest, ib);
}
/*
@@ -610,7 +627,7 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
sendFile(sink, pathbuf, pathbuf, &statbuf, false,
InvalidOid, InvalidOid, InvalidRelFileNumber, 0,
- &manifest);
+ &manifest, 0, NULL, 0);
/* unconditionally mark file as archived */
StatusFilePath(pathbuf, fname, ".done");
@@ -686,6 +703,7 @@ parse_basebackup_options(List *options, basebackup_options *opt)
bool o_checkpoint = false;
bool o_nowait = false;
bool o_wal = false;
+ bool o_incremental = false;
bool o_maxrate = false;
bool o_tablespace_map = false;
bool o_noverify_checksums = false;
@@ -764,6 +782,20 @@ parse_basebackup_options(List *options, basebackup_options *opt)
opt->includewal = defGetBoolean(defel);
o_wal = true;
}
+ else if (strcmp(defel->defname, "incremental") == 0)
+ {
+ if (o_incremental)
+ ereport(ERROR,
+ (errcode(ERRCODE_SYNTAX_ERROR),
+ errmsg("duplicate option \"%s\"", defel->defname)));
+ opt->incremental = defGetBoolean(defel);
+ if (opt->incremental && !summarize_wal)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("incremental backups cannot be taken unless WAL summarization is enabled")));
+ opt->incremental = defGetBoolean(defel);
+ o_incremental = true;
+ }
else if (strcmp(defel->defname, "max_rate") == 0)
{
int64 maxrate;
@@ -956,7 +988,7 @@ parse_basebackup_options(List *options, basebackup_options *opt)
* the filesystem, bypassing the buffer cache.
*/
void
-SendBaseBackup(BaseBackupCmd *cmd)
+SendBaseBackup(BaseBackupCmd *cmd, IncrementalBackupInfo *ib)
{
basebackup_options opt;
bbsink *sink;
@@ -980,6 +1012,20 @@ SendBaseBackup(BaseBackupCmd *cmd)
set_ps_display(activitymsg);
}
+ /*
+ * If we're asked to perform an incremental backup and the user has not
+ * supplied a manifest, that's an ERROR.
+ *
+ * If we're asked to perform a full backup and the user did supply a
+ * manifest, just ignore it.
+ */
+ if (!opt.incremental)
+ ib = NULL;
+ else if (ib == NULL)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("must UPLOAD_MANIFEST before performing an incremental BASE_BACKUP")));
+
/*
* If the target is specifically 'client' then set up to stream the backup
* to the client; otherwise, it's being sent someplace else and should not
@@ -1011,7 +1057,7 @@ SendBaseBackup(BaseBackupCmd *cmd)
*/
PG_TRY();
{
- perform_base_backup(&opt, sink);
+ perform_base_backup(&opt, sink, ib);
}
PG_FINALLY();
{
@@ -1089,7 +1135,7 @@ sendFileWithContent(bbsink *sink, const char *filename, const char *content,
*/
static int64
sendTablespace(bbsink *sink, char *path, Oid spcoid, bool sizeonly,
- backup_manifest_info *manifest)
+ backup_manifest_info *manifest, IncrementalBackupInfo *ib)
{
int64 size;
char pathbuf[MAXPGPATH];
@@ -1123,7 +1169,7 @@ sendTablespace(bbsink *sink, char *path, Oid spcoid, bool sizeonly,
/* Send all the files in the tablespace version directory */
size += sendDir(sink, pathbuf, strlen(path), sizeonly, NIL, true, manifest,
- spcoid);
+ spcoid, ib);
return size;
}
@@ -1143,7 +1189,7 @@ sendTablespace(bbsink *sink, char *path, Oid spcoid, bool sizeonly,
static int64
sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
List *tablespaces, bool sendtblspclinks, backup_manifest_info *manifest,
- Oid spcoid)
+ Oid spcoid, IncrementalBackupInfo *ib)
{
DIR *dir;
struct dirent *de;
@@ -1152,7 +1198,16 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
int64 size = 0;
const char *lastDir; /* Split last dir from parent path. */
bool isRelationDir = false; /* Does directory contain relations? */
+ bool isGlobalDir = false;
Oid dboid = InvalidOid;
+ BlockNumber *relative_block_numbers = NULL;
+
+ /*
+ * Since this array is relatively large, avoid putting it on the stack.
+ * But we don't need it at all if this is not an incremental backup.
+ */
+ if (ib != NULL)
+ relative_block_numbers = palloc(sizeof(BlockNumber) * RELSEG_SIZE);
/*
* Determine if the current path is a database directory that can contain
@@ -1185,7 +1240,10 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
}
}
else if (strcmp(path, "./global") == 0)
+ {
isRelationDir = true;
+ isGlobalDir = true;
+ }
dir = AllocateDir(path);
while ((de = ReadDir(dir, path)) != NULL)
@@ -1334,11 +1392,13 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
&statbuf, sizeonly);
/*
- * Also send archive_status directory (by hackishly reusing
- * statbuf from above ...).
+ * Also send archive_status and summaries directories (by
+ * hackishly reusing statbuf from above ...).
*/
size += _tarWriteHeader(sink, "./pg_wal/archive_status", NULL,
&statbuf, sizeonly);
+ size += _tarWriteHeader(sink, "./pg_wal/summaries", NULL,
+ &statbuf, sizeonly);
continue; /* don't recurse into pg_wal */
}
@@ -1407,16 +1467,64 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
if (!skip_this_dir)
size += sendDir(sink, pathbuf, basepathlen, sizeonly, tablespaces,
- sendtblspclinks, manifest, spcoid);
+ sendtblspclinks, manifest, spcoid, ib);
}
else if (S_ISREG(statbuf.st_mode))
{
bool sent = false;
+ unsigned num_blocks_required = 0;
+ unsigned truncation_block_length = 0;
+ char tarfilenamebuf[MAXPGPATH * 2];
+ char *tarfilename = pathbuf + basepathlen + 1;
+ FileBackupMethod method = BACK_UP_FILE_FULLY;
+
+ if (ib != NULL && isRelationFile)
+ {
+ Oid relspcoid;
+ char *lookup_path;
+
+ if (OidIsValid(spcoid))
+ {
+ relspcoid = spcoid;
+ lookup_path = psprintf("pg_tblspc/%u/%s", spcoid,
+ tarfilename);
+ }
+ else
+ {
+ if (isGlobalDir)
+ relspcoid = GLOBALTABLESPACE_OID;
+ else
+ relspcoid = DEFAULTTABLESPACE_OID;
+ lookup_path = pstrdup(tarfilename);
+ }
+
+ method = GetFileBackupMethod(ib, lookup_path, dboid, relspcoid,
+ relfilenumber, relForkNum,
+ segno, statbuf.st_size,
+ &num_blocks_required,
+ relative_block_numbers,
+ &truncation_block_length);
+ if (method == BACK_UP_FILE_INCREMENTALLY)
+ {
+ statbuf.st_size =
+ GetIncrementalFileSize(num_blocks_required);
+ snprintf(tarfilenamebuf, sizeof(tarfilenamebuf),
+ "%s/INCREMENTAL.%s",
+ path + basepathlen + 1,
+ de->d_name);
+ tarfilename = tarfilenamebuf;
+ }
+
+ pfree(lookup_path);
+ }
if (!sizeonly)
- sent = sendFile(sink, pathbuf, pathbuf + basepathlen + 1, &statbuf,
+ sent = sendFile(sink, pathbuf, tarfilename, &statbuf,
true, dboid, spcoid,
- relfilenumber, segno, manifest);
+ relfilenumber, segno, manifest,
+ num_blocks_required,
+ method == BACK_UP_FILE_INCREMENTALLY ? relative_block_numbers : NULL,
+ truncation_block_length);
if (sent || sizeonly)
{
@@ -1434,6 +1542,10 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
ereport(WARNING,
(errmsg("skipping special file \"%s\"", pathbuf)));
}
+
+ if (relative_block_numbers != NULL)
+ pfree(relative_block_numbers);
+
FreeDir(dir);
return size;
}
@@ -1446,6 +1558,12 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
* If dboid is anything other than InvalidOid then any checksum failures
* detected will get reported to the cumulative stats system.
*
+ * If the file is to be sent incrementally, then num_incremental_blocks
+ * should be the number of blocks to be sent, and incremental_blocks
+ * an array of block numbers relative to the start of the current segment.
+ * If the whole file is to be sent, then incremental_blocks should be NULL,
+ * and num_incremental_blocks can have any value, as it will be ignored.
+ *
* Returns true if the file was successfully sent, false if 'missing_ok',
* and the file did not exist.
*/
@@ -1453,7 +1571,8 @@ static bool
sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
struct stat *statbuf, bool missing_ok, Oid dboid, Oid spcoid,
RelFileNumber relfilenumber, unsigned segno,
- backup_manifest_info *manifest)
+ backup_manifest_info *manifest, unsigned num_incremental_blocks,
+ BlockNumber *incremental_blocks, unsigned truncation_block_length)
{
int fd;
BlockNumber blkno = 0;
@@ -1462,6 +1581,7 @@ sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
pgoff_t bytes_done = 0;
bool verify_checksum = false;
pg_checksum_context checksum_ctx;
+ int ibindex = 0;
if (pg_checksum_init(&checksum_ctx, manifest->checksum_type) < 0)
elog(ERROR, "could not initialize checksum of file \"%s\"",
@@ -1494,22 +1614,111 @@ sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
RelFileNumberIsValid(relfilenumber))
verify_checksum = true;
+ /*
+ * If we're sending an incremental file, write the file header.
+ */
+ if (incremental_blocks != NULL)
+ {
+ unsigned magic = INCREMENTAL_MAGIC;
+ size_t header_bytes_done = 0;
+
+ /* Emit header data. */
+ push_to_sink(sink, &checksum_ctx, &header_bytes_done,
+ &magic, sizeof(magic));
+ push_to_sink(sink, &checksum_ctx, &header_bytes_done,
+ &num_incremental_blocks, sizeof(num_incremental_blocks));
+ push_to_sink(sink, &checksum_ctx, &header_bytes_done,
+ &truncation_block_length, sizeof(truncation_block_length));
+ push_to_sink(sink, &checksum_ctx, &header_bytes_done,
+ incremental_blocks,
+ sizeof(BlockNumber) * num_incremental_blocks);
+
+ /* Flush out any data still in the buffer so it's again empty. */
+ if (header_bytes_done > 0)
+ {
+ bbsink_archive_contents(sink, header_bytes_done);
+ if (pg_checksum_update(&checksum_ctx,
+ (uint8 *) sink->bbs_buffer,
+ header_bytes_done) < 0)
+ elog(ERROR, "could not update checksum of base backup");
+ }
+
+ /* Update our notion of file position. */
+ bytes_done += sizeof(magic);
+ bytes_done += sizeof(num_incremental_blocks);
+ bytes_done += sizeof(truncation_block_length);
+ bytes_done += sizeof(BlockNumber) * num_incremental_blocks;
+ }
+
/*
* Loop until we read the amount of data the caller told us to expect. The
* file could be longer, if it was extended while we were sending it, but
* for a base backup we can ignore such extended data. It will be restored
* from WAL.
*/
- while (bytes_done < statbuf->st_size)
+ while (1)
{
- size_t remaining = statbuf->st_size - bytes_done;
+ /*
+ * Determine whether we've read all the data that we need, and if not,
+ * read some more.
+ */
+ if (incremental_blocks == NULL)
+ {
+ size_t remaining = statbuf->st_size - bytes_done;
+
+ /*
+ * If we've read the required number of bytes, then it's time to
+ * stop.
+ */
+ if (bytes_done >= statbuf->st_size)
+ break;
+
+ /*
+ * Read as many bytes as will fit in the buffer, or however many
+ * are left to read, whichever is less.
+ */
+ cnt = read_file_data_into_buffer(sink, readfilename, fd,
+ bytes_done, remaining,
+ blkno + segno * RELSEG_SIZE,
+ verify_checksum,
+ &checksum_failures);
+ }
+ else
+ {
+ BlockNumber relative_blkno;
- /* Try to read some more data. */
- cnt = read_file_data_into_buffer(sink, readfilename, fd, bytes_done,
- remaining,
- blkno + segno * RELSEG_SIZE,
- verify_checksum,
- &checksum_failures);
+ /*
+ * If we've read all the blocks, then it's time to stop.
+ */
+ if (ibindex >= num_incremental_blocks)
+ break;
+
+ /*
+ * Read just one block, whichever one is the next that we're
+ * supposed to include.
+ */
+ relative_blkno = incremental_blocks[ibindex++];
+ cnt = read_file_data_into_buffer(sink, readfilename, fd,
+ relative_blkno * BLCKSZ,
+ BLCKSZ,
+ relative_blkno + segno * RELSEG_SIZE,
+ verify_checksum,
+ &checksum_failures);
+
+ /*
+ * If we get a partial read, that must mean that the relation is
+ * being truncated. Ultimately, it should be truncated to a
+ * multiple of BLCKSZ, since this path should only be reached for
+ * relation files, but we might transiently observe an
+ * intermediate value.
+ *
+ * It should be fine to treat this just as if the entire block had
+ * been truncated away - i.e. fill this and all later blocks with
+ * zeroes. WAL replay will fix things up.
+ */
+ if (cnt < BLCKSZ)
+ break;
+ }
/*
* If the amount of data we were able to read was not a multiple of
@@ -1692,6 +1901,56 @@ read_file_data_into_buffer(bbsink *sink, const char *readfilename, int fd,
return cnt;
}
+/*
+ * Push data into a bbsink.
+ *
+ * It's better, when possible, to read data directly into the bbsink's buffer,
+ * rather than using this function to copy it into the buffer; this function is
+ * for cases where that approach is not practical.
+ *
+ * bytes_done should point to a count of the number of bytes that are
+ * currently used in the bbsink's buffer. Upon return, the bytes identified by
+ * data and length will have been copied into the bbsink's buffer, flushing
+ * as required, and *bytes_done will have been updated accordingly. If the
+ * buffer was flushed, the previous contents will also have been fed to
+ * checksum_ctx.
+ *
+ * Note that after one or more calls to this function it is the caller's
+ * responsibility to perform any required final flush.
+ */
+static void
+push_to_sink(bbsink *sink, pg_checksum_context *checksum_ctx,
+ size_t *bytes_done, void *data, size_t length)
+{
+ while (length > 0)
+ {
+ size_t bytes_to_copy;
+
+ /*
+ * We use < here rather than <= so that if the data exactly fills the
+ * remaining buffer space, we trigger a flush now.
+ */
+ if (length < sink->bbs_buffer_length - *bytes_done)
+ {
+ /* Append remaining data to buffer. */
+ memcpy(sink->bbs_buffer + *bytes_done, data, length);
+ *bytes_done += length;
+ return;
+ }
+
+ /* Copy until buffer is full and flush it. */
+ bytes_to_copy = sink->bbs_buffer_length - *bytes_done;
+ memcpy(sink->bbs_buffer + *bytes_done, data, bytes_to_copy);
+ data = ((char *) data) + bytes_to_copy;
+ length -= bytes_to_copy;
+ bbsink_archive_contents(sink, sink->bbs_buffer_length);
+ if (pg_checksum_update(checksum_ctx, (uint8 *) sink->bbs_buffer,
+ sink->bbs_buffer_length) < 0)
+ elog(ERROR, "could not update checksum");
+ *bytes_done = 0;
+ }
+}
+
/*
* Try to verify the checksum for the provided page, if it seems appropriate
* to do so.
diff --git a/src/backend/backup/basebackup_incremental.c b/src/backend/backup/basebackup_incremental.c
new file mode 100644
index 0000000000..1e5a5ac33a
--- /dev/null
+++ b/src/backend/backup/basebackup_incremental.c
@@ -0,0 +1,1003 @@
+/*-------------------------------------------------------------------------
+ *
+ * basebackup_incremental.c
+ * code for incremental backup support
+ *
+ * This code isn't actually in charge of taking an incremental backup;
+ * the actual construction of the incremental backup happens in
+ * basebackup.c. Here, we're concerned with providing the necessary
+ * supports for that operation. In particular, we need to parse the
+ * backup manifest supplied by the user taking the incremental backup
+ * and extract the required information from it.
+ *
+ * Portions Copyright (c) 2010-2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/backend/backup/basebackup_incremental.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/timeline.h"
+#include "access/xlog.h"
+#include "access/xlogrecovery.h"
+#include "backup/basebackup_incremental.h"
+#include "backup/walsummary.h"
+#include "common/blkreftable.h"
+#include "common/parse_manifest.h"
+#include "common/hashfn.h"
+#include "postmaster/walsummarizer.h"
+
+#define BLOCKS_PER_READ 512
+
+/*
+ * Details extracted from the WAL ranges present in the supplied backup manifest.
+ */
+typedef struct
+{
+ TimeLineID tli;
+ XLogRecPtr start_lsn;
+ XLogRecPtr end_lsn;
+} backup_wal_range;
+
+/*
+ * Details extracted from the file list present in the supplied backup manifest.
+ */
+typedef struct
+{
+ uint32 status;
+ const char *path;
+ size_t size;
+} backup_file_entry;
+
+static uint32 hash_string_pointer(const char *s);
+#define SH_PREFIX backup_file
+#define SH_ELEMENT_TYPE backup_file_entry
+#define SH_KEY_TYPE const char *
+#define SH_KEY path
+#define SH_HASH_KEY(tb, key) hash_string_pointer(key)
+#define SH_EQUAL(tb, a, b) (strcmp(a, b) == 0)
+#define SH_SCOPE static inline
+#define SH_DECLARE
+#define SH_DEFINE
+#include "lib/simplehash.h"
+
+struct IncrementalBackupInfo
+{
+ /* Memory context for this object and its subsidiary objects. */
+ MemoryContext mcxt;
+
+ /* Temporary buffer for storing the manifest while parsing it. */
+ StringInfoData buf;
+
+ /* WAL ranges extracted from the backup manifest. */
+ List *manifest_wal_ranges;
+
+ /*
+ * Files extracted from the backup manifest.
+ *
+ * We don't really need this information, because we use WAL summaries to
+ * figure what's changed. It would be unsafe to just rely on the list of
+ * files that existed before, because it's possible for a file to be
+ * removed and a new one created with the same name and different
+ * contents. In such cases, the whole file must still be sent. We can tell
+ * from the WAL summaries whether that happened, but not from the file
+ * list.
+ *
+ * Nonetheless, this data is useful for sanity checking. If a file that we
+ * think we shouldn't need to send is not present in the manifest for the
+ * prior backup, something has gone terribly wrong. We retain the file
+ * names and sizes, but not the checksums or last modified times, for
+ * which we have no use.
+ *
+ * One significant downside of storing this data is that it consumes
+ * memory. If that turns out to be a problem, we might have to decide not
+ * to retain this information, or to make it optional.
+ */
+ backup_file_hash *manifest_files;
+
+ /*
+ * Block-reference table for the incremental backup.
+ *
+ * It's possible that storing the entire block-reference table in memory
+ * will be a problem for some users. The in-memory format that we're using
+ * here is pretty efficient, converging to little more than 1 bit per
+ * block for relation forks with large numbers of modified blocks. It's
+ * possible, however, that if you try to perform an incremental backup of
+ * a database with a sufficiently large number of relations on a
+ * sufficiently small machine, you could run out of memory here. If that
+ * turns out to be a problem in practice, we'll need to be more clever.
+ */
+ BlockRefTable *brtab;
+};
+
+static void manifest_process_file(JsonManifestParseContext *context,
+ char *pathname,
+ size_t size,
+ pg_checksum_type checksum_type,
+ int checksum_length,
+ uint8 *checksum_payload);
+static void manifest_process_wal_range(JsonManifestParseContext *context,
+ TimeLineID tli,
+ XLogRecPtr start_lsn,
+ XLogRecPtr end_lsn);
+static void manifest_report_error(JsonManifestParseContext *ib,
+ const char *fmt,...)
+ pg_attribute_printf(2, 3) pg_attribute_noreturn();
+static int compare_block_numbers(const void *a, const void *b);
+
+/*
+ * Create a new object for storing information extracted from the manifest
+ * supplied when creating an incremental backup.
+ */
+IncrementalBackupInfo *
+CreateIncrementalBackupInfo(MemoryContext mcxt)
+{
+ IncrementalBackupInfo *ib;
+ MemoryContext oldcontext;
+
+ oldcontext = MemoryContextSwitchTo(mcxt);
+
+ ib = palloc0(sizeof(IncrementalBackupInfo));
+ ib->mcxt = mcxt;
+ initStringInfo(&ib->buf);
+
+ /*
+ * It's hard to guess how many files a "typical" installation will have in
+ * the data directory, but a fresh initdb creates almost 1000 files as of
+ * this writing, so it seems to make sense for our estimate to
+ * substantially higher.
+ */
+ ib->manifest_files = backup_file_create(mcxt, 10000, NULL);
+
+ MemoryContextSwitchTo(oldcontext);
+
+ return ib;
+}
+
+/*
+ * Before taking an incremental backup, the caller must supply the backup
+ * manifest from a prior backup. Each chunk of manifest data recieved
+ * from the client should be passed to this function.
+ */
+void
+AppendIncrementalManifestData(IncrementalBackupInfo *ib, const char *data,
+ int len)
+{
+ MemoryContext oldcontext;
+
+ /* Switch to our memory context. */
+ oldcontext = MemoryContextSwitchTo(ib->mcxt);
+
+ /*
+ * XXX. Our json parser is at present incapable of parsing json blobs
+ * incrementally, so we have to accumulate the entire backup manifest
+ * before we can do anything with it. This should really be fixed, since
+ * some users might have very large numbers of files in the data
+ * directory.
+ */
+ appendBinaryStringInfo(&ib->buf, data, len);
+
+ /* Switch back to previous memory context. */
+ MemoryContextSwitchTo(oldcontext);
+}
+
+/*
+ * Finalize an IncrementalBackupInfo object after all manifest data has
+ * been supplied via calls to AppendIncrementalManifestData.
+ */
+void
+FinalizeIncrementalManifest(IncrementalBackupInfo *ib)
+{
+ JsonManifestParseContext context;
+ MemoryContext oldcontext;
+
+ /* Switch to our memory context. */
+ oldcontext = MemoryContextSwitchTo(ib->mcxt);
+
+ /* Parse the manifest. */
+ context.private_data = ib;
+ context.per_file_cb = manifest_process_file;
+ context.per_wal_range_cb = manifest_process_wal_range;
+ context.error_cb = manifest_report_error;
+ json_parse_manifest(&context, ib->buf.data, ib->buf.len);
+
+ /* Done with the buffer, so release memory. */
+ pfree(ib->buf.data);
+ ib->buf.data = NULL;
+
+ /* Switch back to previous memory context. */
+ MemoryContextSwitchTo(oldcontext);
+}
+
+/*
+ * Prepare to take an incremental backup.
+ *
+ * Before this function is called, AppendIncrementalManifestData and
+ * FinalizeIncrementalManifest should have already been called to pass all
+ * the manifest data to this object.
+ *
+ * This function performs sanity checks on the data extracted from the
+ * manifest and figures out for which WAL ranges we need summaries, and
+ * whether those summaries are available. Then, it reads and combines the
+ * data from those summary files. It also updates the backup_state with the
+ * reference TLI and LSN for the prior backup.
+ */
+void
+PrepareForIncrementalBackup(IncrementalBackupInfo *ib,
+ BackupState *backup_state)
+{
+ MemoryContext oldcontext;
+ List *expectedTLEs;
+ List *all_wslist,
+ *required_wslist = NIL;
+ ListCell *lc;
+ TimeLineHistoryEntry **tlep;
+ int num_wal_ranges;
+ int i;
+ bool found_backup_start_tli = false;
+ TimeLineID earliest_wal_range_tli = 0;
+ XLogRecPtr earliest_wal_range_start_lsn = InvalidXLogRecPtr;
+ TimeLineID latest_wal_range_tli = 0;
+ XLogRecPtr summarized_lsn;
+ XLogRecPtr pending_lsn;
+ XLogRecPtr prior_pending_lsn = InvalidXLogRecPtr;
+ int deadcycles = 0;
+ TimestampTz initial_time,
+ current_time;
+
+ Assert(ib->buf.data == NULL);
+
+ /* Switch to our memory context. */
+ oldcontext = MemoryContextSwitchTo(ib->mcxt);
+
+ /*
+ * A valid backup manifest must always contain at least one WAL range
+ * (usually exactly one, unless the backup spanned a timeline switch).
+ */
+ num_wal_ranges = list_length(ib->manifest_wal_ranges);
+ if (num_wal_ranges == 0)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("manifest contains no required WAL ranges")));
+
+ /*
+ * Match up the TLIs that appear in the WAL ranges of the backup manifest
+ * with those that appear in this server's timeline history. We expect
+ * every backup_wal_range to match to a TimeLineHistoryEntry; if it does
+ * not, that's an error.
+ *
+ * This loop also decides which of the WAL ranges is the manifest is most
+ * ancient and which one is the newest, according to the timeline history
+ * of this server, and stores TLIs of those WAL ranges into
+ * earliest_wal_range_tli and latest_wal_range_tli. It also updates
+ * earliest_wal_range_start_lsn to the start LSN of the WAL range for
+ * earliest_wal_range_tli.
+ *
+ * Note that the return value of readTimeLineHistory puts the latest
+ * timeline at the beginning of the list, not the end. Hence, the earliest
+ * TLI is the one that occurs nearest the end of the list returned by
+ * readTimeLineHistory, and the latest TLI is the one that occurs closest
+ * to the beginning.
+ */
+ expectedTLEs = readTimeLineHistory(backup_state->starttli);
+ tlep = palloc0(num_wal_ranges * sizeof(TimeLineHistoryEntry *));
+ for (i = 0; i < num_wal_ranges; ++i)
+ {
+ backup_wal_range *range = list_nth(ib->manifest_wal_ranges, i);
+ bool saw_earliest_wal_range_tli = false;
+ bool saw_latest_wal_range_tli = false;
+
+ /* Search this server's history for this WAL range's TLI. */
+ foreach(lc, expectedTLEs)
+ {
+ TimeLineHistoryEntry *tle = lfirst(lc);
+
+ if (tle->tli == range->tli)
+ {
+ tlep[i] = tle;
+ break;
+ }
+
+ if (tle->tli == earliest_wal_range_tli)
+ saw_earliest_wal_range_tli = true;
+ if (tle->tli == latest_wal_range_tli)
+ saw_latest_wal_range_tli = true;
+ }
+
+ /*
+ * An incremental backup can only be taken relative to a backup that
+ * represents a previous state of this server. If the backup requires
+ * WAL from a timeline that's not in our history, that definitely
+ * isn't the case.
+ */
+ if (tlep[i] == NULL)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("timeline %u found in manifest, but not in this server's history",
+ range->tli)));
+
+ /*
+ * If we found this TLI in the server's history before encountering
+ * the latest TLI seen so far in the server's history, then this TLI
+ * is the latest one seen so far.
+ *
+ * If on the other hand we saw the earliest TLI seen so far before
+ * finding this TLI, this TLI is earlier than the earliest one seen so
+ * far. And if this is the first TLI for which we've searched, it's
+ * also the earliest one seen so far.
+ *
+ * On the first loop iteration, both things should necessarily be
+ * true.
+ */
+ if (!saw_latest_wal_range_tli)
+ latest_wal_range_tli = range->tli;
+ if (earliest_wal_range_tli == 0 || saw_earliest_wal_range_tli)
+ {
+ earliest_wal_range_tli = range->tli;
+ earliest_wal_range_start_lsn = range->start_lsn;
+ }
+ }
+
+ /*
+ * Propagate information about the prior backup into the backup_label that
+ * will be generated for this backup.
+ */
+ backup_state->istartpoint = earliest_wal_range_start_lsn;
+ backup_state->istarttli = earliest_wal_range_tli;
+
+ /*
+ * Sanity check start and end LSNs for the WAL ranges in the manifest.
+ *
+ * Commonly, there won't be any timeline switches during the prior backup
+ * at all, but if there are, they should happen at the same LSNs that this
+ * server switched timelines.
+ *
+ * Whether there are any timeline switches during the prior backup or not,
+ * the prior backup shouldn't require any WAL from a timeline prior to the
+ * start of that timeline. It also shouldn't require any WAL from later
+ * than the start of this backup.
+ *
+ * If any of these sanity checks fail, one possible explanation is that
+ * the user has generated WAL on the same timeline with the same LSNs more
+ * than once. For instance, if two standbys running on timeline 1 were
+ * both promoted and (due to a broken archiving setup) both selected new
+ * timeline ID 2, then it's possible that one of these checks might trip.
+ *
+ * Note that there are lots of ways for the user to do something very bad
+ * without tripping any of these checks, and they are not intended to be
+ * comprehensive. It's pretty hard to see how we could be certain of
+ * anything here. However, if there's a problem staring us right in the
+ * face, it's best to report it, so we do.
+ */
+ for (i = 0; i < num_wal_ranges; ++i)
+ {
+ backup_wal_range *range = list_nth(ib->manifest_wal_ranges, i);
+
+ if (range->tli == earliest_wal_range_tli)
+ {
+ if (range->start_lsn < tlep[i]->begin)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("manifest requires WAL from initial timeline %u starting at %X/%X, but that timeline begins at %X/%X",
+ range->tli,
+ LSN_FORMAT_ARGS(range->start_lsn),
+ LSN_FORMAT_ARGS(tlep[i]->begin))));
+ }
+ else
+ {
+ if (range->start_lsn != tlep[i]->begin)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("manifest requires WAL from continuation timeline %u starting at %X/%X, but that timeline begins at %X/%X",
+ range->tli,
+ LSN_FORMAT_ARGS(range->start_lsn),
+ LSN_FORMAT_ARGS(tlep[i]->begin))));
+ }
+
+ if (range->tli == latest_wal_range_tli)
+ {
+ if (range->end_lsn > backup_state->startpoint)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("manifest requires WAL from final timeline %u ending at %X/%X, but this backup starts at %X/%X",
+ range->tli,
+ LSN_FORMAT_ARGS(range->end_lsn),
+ LSN_FORMAT_ARGS(backup_state->startpoint))));
+ }
+ else
+ {
+ if (range->end_lsn != tlep[i]->end)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("manifest requires WAL from non-final timeline %u ending at %X/%X, but this server switched timelines at %X/%X",
+ range->tli,
+ LSN_FORMAT_ARGS(range->end_lsn),
+ LSN_FORMAT_ARGS(tlep[i]->end))));
+ }
+
+ }
+
+ /*
+ * Wait for WAL summarization to catch up to the backup start LSN (but
+ * time out if it doesn't do so quickly enough).
+ */
+ initial_time = current_time = GetCurrentTimestamp();
+ while (1)
+ {
+ long timeout_in_ms = 10000;
+ unsigned elapsed_seconds;
+
+ /*
+ * Align the wait time to prevent drift. This doesn't really matter,
+ * but we'd like the warnings about how long we've been waiting to say
+ * 10 seconds, 20 seconds, 30 seconds, 40 seconds ... without ever
+ * drifting to something that is not a multiple of ten.
+ */
+ timeout_in_ms -=
+ TimestampDifferenceMilliseconds(current_time, initial_time) %
+ timeout_in_ms;
+
+ /* Wait for up to 10 seconds. */
+ summarized_lsn = WaitForWalSummarization(backup_state->startpoint,
+ 10000, &pending_lsn);
+
+ /* If WAL summarization has progressed sufficiently, stop waiting. */
+ if (summarized_lsn >= backup_state->startpoint)
+ break;
+
+ /*
+ * Keep track of the number of cycles during which there has been no
+ * progression of pending_lsn. If pending_lsn is not advancing, that
+ * means that not only are no new files appearing on disk, but we're
+ * not even incorporating new records into the in-memory state.
+ */
+ if (pending_lsn > prior_pending_lsn)
+ {
+ prior_pending_lsn = pending_lsn;
+ deadcycles = 0;
+ }
+ else
+ ++deadcycles;
+
+ /*
+ * If we've managed to wait for an entire minute withot the WAL
+ * summarizer absorbing a single WAL record, error out; probably
+ * something is wrong.
+ *
+ * We could consider also erroring out if the summarizer is taking too
+ * long to catch up, but it's not clear what rate of progress would be
+ * acceptable and what would be too slow. So instead, we just try to
+ * error out in the case where there's no progress at all. That seems
+ * likely to catch a reasonable number of the things that can go wrong
+ * in practice (e.g. the summarizer process is completely hung, say
+ * because somebody hooked up a debugger to it or something) without
+ * giving up too quickly when the sytem is just slow.
+ */
+ if (deadcycles >= 6)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("WAL summarization is not progressing"),
+ errdetail("Summarization is needed through %X/%X, but is stuck at %X/%X on disk and %X/%X in memory.",
+ LSN_FORMAT_ARGS(backup_state->startpoint),
+ LSN_FORMAT_ARGS(summarized_lsn),
+ LSN_FORMAT_ARGS(pending_lsn))));
+
+ /*
+ * Otherwise, just let the user know what's happening.
+ */
+ current_time = GetCurrentTimestamp();
+ elapsed_seconds =
+ TimestampDifferenceMilliseconds(initial_time, current_time) / 1000;
+ ereport(WARNING,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("still waiting for WAL summarization through %X/%X after %d seconds",
+ LSN_FORMAT_ARGS(backup_state->startpoint),
+ elapsed_seconds),
+ errdetail("Summarization has reached %X/%X on disk and %X/%X in memory.",
+ LSN_FORMAT_ARGS(summarized_lsn),
+ LSN_FORMAT_ARGS(pending_lsn))));
+ }
+
+ /*
+ * Retrieve a list of all WAL summaries on any timeline that overlap with
+ * the LSN range of interest. We could instead call GetWalSummaries() once
+ * per timeline in the loop that follows, but that would involve reading
+ * the directory multiple times. It should be mildly faster - and perhaps
+ * a bit safer - to do it just once.
+ */
+ all_wslist = GetWalSummaries(0, earliest_wal_range_start_lsn,
+ backup_state->startpoint);
+
+ /*
+ * We need WAL summaries for everything that happened during the prior
+ * backup and everything that happened afterward up until the point where
+ * the current backup started.
+ */
+ foreach(lc, expectedTLEs)
+ {
+ TimeLineHistoryEntry *tle = lfirst(lc);
+ XLogRecPtr tli_start_lsn = tle->begin;
+ XLogRecPtr tli_end_lsn = tle->end;
+ XLogRecPtr tli_missing_lsn = InvalidXLogRecPtr;
+ List *tli_wslist;
+
+ /*
+ * Working through the history of this server from the current
+ * timeline backwards, we skip everything until we find the timeline
+ * where this backup started. Most of the time, this means we won't
+ * skip anything at all, as it's unlikely that the timeline has
+ * changed since the beginning of the backup moments ago.
+ */
+ if (tle->tli == backup_state->starttli)
+ {
+ found_backup_start_tli = true;
+ tli_end_lsn = backup_state->startpoint;
+ }
+ else if (!found_backup_start_tli)
+ continue;
+
+ /*
+ * Find the summaries that overlap the LSN range of interest for this
+ * timeline. If this is the earliest timeline involved, the range of
+ * interest begins with the start LSN of the prior backup; otherwise,
+ * it begins at the LSN at which this timeline came into existence. If
+ * this is the latest TLI involved, the range of interest ends at the
+ * start LSN of the current backup; otherwise, it ends at the point
+ * where we switched from this timeline to the next one.
+ */
+ if (tle->tli == earliest_wal_range_tli)
+ tli_start_lsn = earliest_wal_range_start_lsn;
+ tli_wslist = FilterWalSummaries(all_wslist, tle->tli,
+ tli_start_lsn, tli_end_lsn);
+
+ /*
+ * There is no guarantee that the WAL summaries we found cover the
+ * entire range of LSNs for which summaries are required, or indeed
+ * that we found any WAL summaries at all. Check whether we have a
+ * problem of that sort.
+ */
+ if (!WalSummariesAreComplete(tli_wslist, tli_start_lsn, tli_end_lsn,
+ &tli_missing_lsn))
+ {
+ if (XLogRecPtrIsInvalid(tli_missing_lsn))
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("WAL summaries are required on timeline %u from %X/%X to %X/%X, but no summaries for that timeline and LSN range exist",
+ tle->tli,
+ LSN_FORMAT_ARGS(tli_start_lsn),
+ LSN_FORMAT_ARGS(tli_end_lsn))));
+ else
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("WAL summaries are required on timeline %u from %X/%X to %X/%X, but the summaries for that timeline and LSN range are incomplete",
+ tle->tli,
+ LSN_FORMAT_ARGS(tli_start_lsn),
+ LSN_FORMAT_ARGS(tli_end_lsn)),
+ errdetail("The first unsummarized LSN is this range is %X/%X.",
+ LSN_FORMAT_ARGS(tli_missing_lsn))));
+ }
+
+ /*
+ * Remember that we need to read these summaries.
+ *
+ * Technically, it's possible that this could read more files than
+ * required, since tli_wslist in theory could contain redundant
+ * summaries. For instance, if we have a summary from 0/10000000 to
+ * 0/20000000 and also one from 0/00000000 to 0/30000000, then the
+ * latter subsumes the former and the former could be ignored.
+ *
+ * We ignore this possibility because the WAL summarizer only tries to
+ * generate summaries that do not overlap. If somehow they exist,
+ * we'll do a bit of extra work but the results should still be
+ * correct.
+ */
+ required_wslist = list_concat(required_wslist, tli_wslist);
+
+ /*
+ * Timelines earlier than the one in which the prior backup began are
+ * not relevant.
+ */
+ if (tle->tli == earliest_wal_range_tli)
+ break;
+ }
+
+ /*
+ * Read all of the required block reference table files and merge all of
+ * the data into a single in-memory block reference table.
+ *
+ * See the comments for struct IncrementalBackupInfo for some thoughts on
+ * memory usage.
+ */
+ ib->brtab = CreateEmptyBlockRefTable();
+ foreach(lc, required_wslist)
+ {
+ WalSummaryFile *ws = lfirst(lc);
+ WalSummaryIO wsio;
+ BlockRefTableReader *reader;
+ RelFileLocator rlocator;
+ ForkNumber forknum;
+ BlockNumber limit_block;
+ BlockNumber blocks[BLOCKS_PER_READ];
+
+ wsio.file = OpenWalSummaryFile(ws, false);
+ wsio.filepos = 0;
+ ereport(DEBUG1,
+ (errmsg_internal("reading WAL summary file \"%s\"",
+ FilePathName(wsio.file))));
+ reader = CreateBlockRefTableReader(ReadWalSummary, &wsio,
+ FilePathName(wsio.file),
+ ReportWalSummaryError, NULL);
+ while (BlockRefTableReaderNextRelation(reader, &rlocator, &forknum,
+ &limit_block))
+ {
+ BlockRefTableSetLimitBlock(ib->brtab, &rlocator,
+ forknum, limit_block);
+
+ while (1)
+ {
+ unsigned nblocks;
+ unsigned i;
+
+ nblocks = BlockRefTableReaderGetBlocks(reader, blocks,
+ BLOCKS_PER_READ);
+ if (nblocks == 0)
+ break;
+
+ for (i = 0; i < nblocks; ++i)
+ BlockRefTableMarkBlockModified(ib->brtab, &rlocator,
+ forknum, blocks[i]);
+ }
+ }
+ DestroyBlockRefTableReader(reader);
+ FileClose(wsio.file);
+ }
+
+ /* Switch back to previous memory context. */
+ MemoryContextSwitchTo(oldcontext);
+}
+
+/*
+ * Get the pathname that should be used when a file is sent incrementally.
+ *
+ * The result is a palloc'd string.
+ */
+char *
+GetIncrementalFilePath(Oid dboid, Oid spcoid, RelFileNumber relfilenumber,
+ ForkNumber forknum, unsigned segno)
+{
+ char *path;
+ char *lastslash;
+ char *ipath;
+
+ path = GetRelationPath(dboid, spcoid, relfilenumber, InvalidBackendId,
+ forknum);
+
+ lastslash = strrchr(path, '/');
+ Assert(lastslash != NULL);
+ *lastslash = '\0';
+
+ if (segno > 0)
+ ipath = psprintf("%s/INCREMENTAL.%s.%u", path, lastslash + 1, segno);
+ else
+ ipath = psprintf("%s/INCREMENTAL.%s", path, lastslash + 1);
+
+ pfree(path);
+
+ return ipath;
+}
+
+/*
+ * How should we back up a particular file as part of an incremental backup?
+ *
+ * If the return value is BACK_UP_FILE_FULLY, caller should back up the whole
+ * file just as if this were not an incremental backup.
+ *
+ * If the return value is BACK_UP_FILE_INCREMENTALLY, caller should include
+ * an incremental file in the backup instead of the entire file. On return,
+ * *num_blocks_required will be set to the number of blocks that need to be
+ * sent, and the actual block numbers will have been stored in
+ * relative_block_numbers, which should be an array of at least RELSEG_SIZE.
+ * In addition, *truncation_block_length will be set to the value that should
+ * be included in the incremental file.
+ */
+FileBackupMethod
+GetFileBackupMethod(IncrementalBackupInfo *ib, const char *path,
+ Oid dboid, Oid spcoid,
+ RelFileNumber relfilenumber, ForkNumber forknum,
+ unsigned segno, size_t size,
+ unsigned *num_blocks_required,
+ BlockNumber *relative_block_numbers,
+ unsigned *truncation_block_length)
+{
+ BlockNumber absolute_block_numbers[RELSEG_SIZE];
+ BlockNumber limit_block;
+ BlockNumber start_blkno;
+ BlockNumber stop_blkno;
+ RelFileLocator rlocator;
+ BlockRefTableEntry *brtentry;
+ unsigned i;
+ unsigned nblocks;
+
+ /* Should only be called after PrepareForIncrementalBackup. */
+ Assert(ib->buf.data == NULL);
+
+ /*
+ * dboid could be InvalidOid if shared rel, but spcoid and relfilenumber
+ * should have legal values.
+ */
+ Assert(OidIsValid(spcoid));
+ Assert(RelFileNumberIsValid(relfilenumber));
+
+ /*
+ * If the file size is too large or not a multiple of BLCKSZ, then
+ * something weird is happening, so give up and send the whole file.
+ */
+ if ((size % BLCKSZ) != 0 || size / BLCKSZ > RELSEG_SIZE)
+ return BACK_UP_FILE_FULLY;
+
+ /*
+ * The free-space map fork is not properly WAL-logged, so we need to
+ * backup the entire file every time.
+ */
+ if (forknum == FSM_FORKNUM)
+ return BACK_UP_FILE_FULLY;
+
+ /*
+ * If this file was not part of the prior backup, back it up fully.
+ *
+ * If this file was created after the prior backup and before the start of
+ * the current backup, then the WAL summary information will tell us to
+ * back up the whole file. However, if this file was created after the
+ * start of the current backup, then the WAL summary won't know anything
+ * about it. Without this logic, we would erroneously conclude that it was
+ * OK to send it incrementally.
+ *
+ * Note that the file could have existed at the time of the prior backup,
+ * gotten deleted, and then a new file with the same name could have been
+ * created. In that case, this logic won't prevent the file from being
+ * backed up incrementally. But, if the deletion happened before the start
+ * of the current backup, the limit block will be 0, inducing a full
+ * backup. If the deletion happened after the start of the current backup,
+ * reconstruction will erroneously combine blocks from the current
+ * lifespan of the file with blocks from the previous lifespan -- but in
+ * this type of case, WAL replay to reach backup consistency should remove
+ * and recreate the file anyway, so the initial bogus contents should not
+ * matter.
+ */
+ if (backup_file_lookup(ib->manifest_files, path) == NULL)
+ {
+ char *ipath;
+
+ ipath = GetIncrementalFilePath(dboid, spcoid, relfilenumber,
+ forknum, segno);
+ if (backup_file_lookup(ib->manifest_files, ipath) == NULL)
+ return BACK_UP_FILE_FULLY;
+ }
+
+ /* Look up the block reference table entry. */
+ rlocator.spcOid = spcoid;
+ rlocator.dbOid = dboid;
+ rlocator.relNumber = relfilenumber;
+ brtentry = BlockRefTableGetEntry(ib->brtab, &rlocator, forknum,
+ &limit_block);
+
+ /*
+ * If there is no entry, then there have been no WAL-logged changes to the
+ * relation since the predecessor backup was taken, so we can back it up
+ * incrementally and need not include any modified blocks.
+ *
+ * However, if the file is zero-length, we should do a full backup,
+ * because an incremental file is always more than zero length, and it's
+ * silly to take an incremental backup when a full backup would be
+ * smaller.
+ */
+ if (brtentry == NULL)
+ {
+ if (size == 0)
+ return BACK_UP_FILE_FULLY;
+ *num_blocks_required = 0;
+ *truncation_block_length = size / BLCKSZ;
+ return BACK_UP_FILE_INCREMENTALLY;
+ }
+
+ /*
+ * If the limit_block is less than or equal to the point where this
+ * segment starts, send the whole file.
+ */
+ if (limit_block <= segno * RELSEG_SIZE)
+ return BACK_UP_FILE_FULLY;
+
+ /*
+ * Get relevant entries from the block reference table entry.
+ *
+ * We shouldn't overflow computing the start or stop block numbers, but if
+ * it manages to happen somehow, detect it and throw an error.
+ */
+ start_blkno = segno * RELSEG_SIZE;
+ stop_blkno = start_blkno + (size / BLCKSZ);
+ if (start_blkno / RELSEG_SIZE != segno || stop_blkno < start_blkno)
+ ereport(ERROR,
+ errcode(ERRCODE_INTERNAL_ERROR),
+ errmsg_internal("overflow computing block number bounds for segment %u with size %zu",
+ segno, size));
+ nblocks = BlockRefTableEntryGetBlocks(brtentry, start_blkno, stop_blkno,
+ absolute_block_numbers, RELSEG_SIZE);
+ Assert(nblocks <= RELSEG_SIZE);
+
+ /*
+ * If we're going to have to send nearly all of the blocks, then just send
+ * the whole file, because that won't require much extra storage or
+ * transfer and will speed up and simplify backup restoration. It's not
+ * clear what threshold is most appropriate here and perhaps it ought to
+ * be configurable, but for now we're just going to say that if we'd need
+ * to send 90% of the blocks anyway, give up and send the whole file.
+ *
+ * NB: If you change the threshold here, at least make sure to back up the
+ * file fully when every single block must be sent, because there's
+ * nothing good about sending an incremental file in that case.
+ */
+ if (nblocks * BLCKSZ > size * 0.9)
+ return BACK_UP_FILE_FULLY;
+
+ /*
+ * Looks like we can send an incremental file, so sort the absolute the
+ * block numbers and then transpose absolute block numbers to relative
+ * block numbers.
+ *
+ * NB: If the block reference table was using the bitmap representation
+ * for a given chunk, the block numbers in that chunk will already be
+ * sorted, but when the array-of-offsets representation is used, we can
+ * receive block numbers here out of order.
+ */
+ qsort(absolute_block_numbers, nblocks, sizeof(BlockNumber),
+ compare_block_numbers);
+ for (i = 0; i < nblocks; ++i)
+ relative_block_numbers[i] = absolute_block_numbers[i] - start_blkno;
+ *num_blocks_required = nblocks;
+
+ /*
+ * The truncation block length is the minimum length of the reconstructed
+ * file. Any block numbers below this threshold that are not present in
+ * the backup need to be fetched from the prior backup. At or above this
+ * threshold, blocks should only be included in the result if they are
+ * present in the backup. (This may require inserting zero blocks if the
+ * blocks included in the backup are non-consecutive.)
+ */
+ *truncation_block_length = size / BLCKSZ;
+ if (BlockNumberIsValid(limit_block))
+ {
+ unsigned relative_limit = limit_block - segno * RELSEG_SIZE;
+
+ if (*truncation_block_length < relative_limit)
+ *truncation_block_length = relative_limit;
+ }
+
+ /* Send it incrementally. */
+ return BACK_UP_FILE_INCREMENTALLY;
+}
+
+/*
+ * Compute the size for an incremental file containing a given number of blocks.
+ */
+extern size_t
+GetIncrementalFileSize(unsigned num_blocks_required)
+{
+ size_t result;
+
+ /* Make sure we're not going to overflow. */
+ Assert(num_blocks_required <= RELSEG_SIZE);
+
+ /*
+ * Three four byte quantities (magic number, truncation block length,
+ * block count) followed by block numbers followed by block contents.
+ */
+ result = 3 * sizeof(uint32);
+ result += (BLCKSZ + sizeof(BlockNumber)) * num_blocks_required;
+
+ return result;
+}
+
+/*
+ * Helper function for filemap hash table.
+ */
+static uint32
+hash_string_pointer(const char *s)
+{
+ unsigned char *ss = (unsigned char *) s;
+
+ return hash_bytes(ss, strlen(s));
+}
+
+/*
+ * This callback is invoked for each file mentioned in the backup manifest.
+ *
+ * We store the path to each file and the size of each file for sanity-checking
+ * purposes. For further details, see comments for IncrementalBackupInfo.
+ */
+static void
+manifest_process_file(JsonManifestParseContext *context,
+ char *pathname, size_t size,
+ pg_checksum_type checksum_type,
+ int checksum_length,
+ uint8 *checksum_payload)
+{
+ IncrementalBackupInfo *ib = context->private_data;
+ backup_file_entry *entry;
+ bool found;
+
+ entry = backup_file_insert(ib->manifest_files, pathname, &found);
+ if (!found)
+ {
+ entry->path = MemoryContextStrdup(ib->manifest_files->ctx,
+ pathname);
+ entry->size = size;
+ }
+}
+
+/*
+ * This callback is invoked for each WAL range mentioned in the backup
+ * manifest.
+ *
+ * We're just interested in learning the oldest LSN and the corresponding TLI
+ * that appear in any WAL range.
+ */
+static void
+manifest_process_wal_range(JsonManifestParseContext *context,
+ TimeLineID tli, XLogRecPtr start_lsn,
+ XLogRecPtr end_lsn)
+{
+ IncrementalBackupInfo *ib = context->private_data;
+ backup_wal_range *range = palloc(sizeof(backup_wal_range));
+
+ range->tli = tli;
+ range->start_lsn = start_lsn;
+ range->end_lsn = end_lsn;
+ ib->manifest_wal_ranges = lappend(ib->manifest_wal_ranges, range);
+}
+
+/*
+ * This callback is invoked if an error occurs while parsing the backup
+ * manifest.
+ */
+static void
+manifest_report_error(JsonManifestParseContext *context, const char *fmt,...)
+{
+ StringInfoData errbuf;
+
+ initStringInfo(&errbuf);
+
+ for (;;)
+ {
+ va_list ap;
+ int needed;
+
+ va_start(ap, fmt);
+ needed = appendStringInfoVA(&errbuf, fmt, ap);
+ va_end(ap);
+ if (needed == 0)
+ break;
+ enlargeStringInfo(&errbuf, needed);
+ }
+
+ ereport(ERROR,
+ errmsg_internal("%s", errbuf.data));
+}
+
+/*
+ * Quicksort comparator for block numbers.
+ */
+static int
+compare_block_numbers(const void *a, const void *b)
+{
+ BlockNumber aa = *(BlockNumber *) a;
+ BlockNumber bb = *(BlockNumber *) b;
+
+ if (aa > bb)
+ return 1;
+ else if (aa == bb)
+ return 0;
+ else
+ return -1;
+}
diff --git a/src/backend/backup/meson.build b/src/backend/backup/meson.build
index 0e2de91e9f..19c355ceca 100644
--- a/src/backend/backup/meson.build
+++ b/src/backend/backup/meson.build
@@ -5,6 +5,7 @@ backend_sources += files(
'basebackup.c',
'basebackup_copy.c',
'basebackup_gzip.c',
+ 'basebackup_incremental.c',
'basebackup_lz4.c',
'basebackup_progress.c',
'basebackup_server.c',
diff --git a/src/backend/replication/repl_gram.y b/src/backend/replication/repl_gram.y
index 0c874e33cf..a5d118ed68 100644
--- a/src/backend/replication/repl_gram.y
+++ b/src/backend/replication/repl_gram.y
@@ -76,11 +76,12 @@ Node *replication_parse_result;
%token K_EXPORT_SNAPSHOT
%token K_NOEXPORT_SNAPSHOT
%token K_USE_SNAPSHOT
+%token K_UPLOAD_MANIFEST
%type <node> command
%type <node> base_backup start_replication start_logical_replication
create_replication_slot drop_replication_slot identify_system
- read_replication_slot timeline_history show
+ read_replication_slot timeline_history show upload_manifest
%type <list> generic_option_list
%type <defelt> generic_option
%type <uintval> opt_timeline
@@ -114,6 +115,7 @@ command:
| read_replication_slot
| timeline_history
| show
+ | upload_manifest
;
/*
@@ -307,6 +309,15 @@ timeline_history:
}
;
+/* UPLOAD_MANIFEST doesn't currently accept any arguments */
+upload_manifest:
+ K_UPLOAD_MANIFEST
+ {
+ UploadManifestCmd *cmd = makeNode(UploadManifestCmd);
+
+ $$ = (Node *) cmd;
+ }
+
opt_physical:
K_PHYSICAL
| /* EMPTY */
@@ -411,6 +422,7 @@ ident_or_keyword:
| K_EXPORT_SNAPSHOT { $$ = "export_snapshot"; }
| K_NOEXPORT_SNAPSHOT { $$ = "noexport_snapshot"; }
| K_USE_SNAPSHOT { $$ = "use_snapshot"; }
+ | K_UPLOAD_MANIFEST { $$ = "upload_manifest"; }
;
%%
diff --git a/src/backend/replication/repl_scanner.l b/src/backend/replication/repl_scanner.l
index 1cc7fb858c..4805da08ee 100644
--- a/src/backend/replication/repl_scanner.l
+++ b/src/backend/replication/repl_scanner.l
@@ -136,6 +136,7 @@ EXPORT_SNAPSHOT { return K_EXPORT_SNAPSHOT; }
NOEXPORT_SNAPSHOT { return K_NOEXPORT_SNAPSHOT; }
USE_SNAPSHOT { return K_USE_SNAPSHOT; }
WAIT { return K_WAIT; }
+UPLOAD_MANIFEST { return K_UPLOAD_MANIFEST; }
{space}+ { /* do nothing */ }
@@ -303,6 +304,7 @@ replication_scanner_is_replication_command(void)
case K_DROP_REPLICATION_SLOT:
case K_READ_REPLICATION_SLOT:
case K_TIMELINE_HISTORY:
+ case K_UPLOAD_MANIFEST:
case K_SHOW:
/* Yes; push back the first token so we can parse later. */
repl_pushed_back_token = first_token;
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 3bc9c82389..dbcda32554 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -58,6 +58,7 @@
#include "access/xlogrecovery.h"
#include "access/xlogutils.h"
#include "backup/basebackup.h"
+#include "backup/basebackup_incremental.h"
#include "catalog/pg_authid.h"
#include "catalog/pg_type.h"
#include "commands/dbcommands.h"
@@ -137,6 +138,17 @@ bool wake_wal_senders = false;
*/
static XLogReaderState *xlogreader = NULL;
+/*
+ * If the UPLOAD_MANIFEST command is used to provide a backup manifest in
+ * preparation for an incremental backup, uploaded_manifest will be point
+ * to an object containing information about its contexts, and
+ * uploaded_manifest_mcxt will point to the memory context that contains
+ * that object and all of its subordinate data. Otherwise, both values will
+ * be NULL.
+ */
+static IncrementalBackupInfo *uploaded_manifest = NULL;
+static MemoryContext uploaded_manifest_mcxt = NULL;
+
/*
* These variables keep track of the state of the timeline we're currently
* sending. sendTimeLine identifies the timeline. If sendTimeLineIsHistoric,
@@ -233,6 +245,9 @@ static void XLogSendLogical(void);
static void WalSndDone(WalSndSendDataCallback send_data);
static XLogRecPtr GetStandbyFlushRecPtr(TimeLineID *tli);
static void IdentifySystem(void);
+static void UploadManifest(void);
+static bool HandleUploadManifestPacket(StringInfo buf, off_t *offset,
+ IncrementalBackupInfo *ib);
static void ReadReplicationSlot(ReadReplicationSlotCmd *cmd);
static void CreateReplicationSlot(CreateReplicationSlotCmd *cmd);
static void DropReplicationSlot(DropReplicationSlotCmd *cmd);
@@ -660,6 +675,143 @@ SendTimeLineHistory(TimeLineHistoryCmd *cmd)
pq_endmessage(&buf);
}
+/*
+ * Handle UPLOAD_MANIFEST command.
+ */
+static void
+UploadManifest(void)
+{
+ MemoryContext mcxt;
+ IncrementalBackupInfo *ib;
+ off_t offset = 0;
+ StringInfoData buf;
+
+ /*
+ * parsing the manifest will use the cryptohash stuff, which requires a
+ * resource owner
+ */
+ Assert(CurrentResourceOwner == NULL);
+ CurrentResourceOwner = ResourceOwnerCreate(NULL, "base backup");
+
+ /* Prepare to read manifest data into a temporary context. */
+ mcxt = AllocSetContextCreate(CurrentMemoryContext,
+ "incremental backup information",
+ ALLOCSET_DEFAULT_SIZES);
+ ib = CreateIncrementalBackupInfo(mcxt);
+
+ /* Send a CopyInResponse message */
+ pq_beginmessage(&buf, 'G');
+ pq_sendbyte(&buf, 0);
+ pq_sendint16(&buf, 0);
+ pq_endmessage_reuse(&buf);
+ pq_flush();
+
+ /* Recieve packets from client until done. */
+ while (HandleUploadManifestPacket(&buf, &offset, ib))
+ ;
+
+ /* Finish up manifest processing. */
+ FinalizeIncrementalManifest(ib);
+
+ /*
+ * Discard any old manifest information and arrange to preserve the new
+ * information we just got.
+ *
+ * We assume that MemoryContextDelete and MemoryContextSetParent won't
+ * fail, and thus we shouldn't end up bailing out of here in such a way as
+ * to leave dangling pointrs.
+ */
+ if (uploaded_manifest_mcxt != NULL)
+ MemoryContextDelete(uploaded_manifest_mcxt);
+ MemoryContextSetParent(mcxt, CacheMemoryContext);
+ uploaded_manifest = ib;
+ uploaded_manifest_mcxt = mcxt;
+
+ /* clean up the resource owner we created */
+ WalSndResourceCleanup(true);
+}
+
+/*
+ * Process one packet received during the handling of an UPLOAD_MANIFEST
+ * operation.
+ *
+ * 'buf' is scratch space. This function expects it to be initialized, doesn't
+ * care what the current contents are, and may override them with completely
+ * new contents.
+ *
+ * The return value is true if the caller should continue processing
+ * additional packets and false if the UPLOAD_MANIFEST operation is complete.
+ */
+static bool
+HandleUploadManifestPacket(StringInfo buf, off_t *offset,
+ IncrementalBackupInfo *ib)
+{
+ int mtype;
+ int maxmsglen;
+
+ HOLD_CANCEL_INTERRUPTS();
+
+ pq_startmsgread();
+ mtype = pq_getbyte();
+ if (mtype == EOF)
+ ereport(ERROR,
+ (errcode(ERRCODE_CONNECTION_FAILURE),
+ errmsg("unexpected EOF on client connection with an open transaction")));
+
+ switch (mtype)
+ {
+ case 'd': /* CopyData */
+ maxmsglen = PQ_LARGE_MESSAGE_LIMIT;
+ break;
+ case 'c': /* CopyDone */
+ case 'f': /* CopyFail */
+ case 'H': /* Flush */
+ case 'S': /* Sync */
+ maxmsglen = PQ_SMALL_MESSAGE_LIMIT;
+ break;
+ default:
+ ereport(ERROR,
+ (errcode(ERRCODE_PROTOCOL_VIOLATION),
+ errmsg("unexpected message type 0x%02X during COPY from stdin",
+ mtype)));
+ maxmsglen = 0; /* keep compiler quiet */
+ break;
+ }
+
+ /* Now collect the message body */
+ if (pq_getmessage(buf, maxmsglen))
+ ereport(ERROR,
+ (errcode(ERRCODE_CONNECTION_FAILURE),
+ errmsg("unexpected EOF on client connection with an open transaction")));
+ RESUME_CANCEL_INTERRUPTS();
+
+ /* Process the message */
+ switch (mtype)
+ {
+ case 'd': /* CopyData */
+ AppendIncrementalManifestData(ib, buf->data, buf->len);
+ return true;
+
+ case 'c': /* CopyDone */
+ return false;
+
+ case 'H': /* Sync */
+ case 'S': /* Flush */
+ /* Ignore these while in CopyOut mode as we do elsewhere. */
+ return true;
+
+ case 'f':
+ ereport(ERROR,
+ (errcode(ERRCODE_QUERY_CANCELED),
+ errmsg("COPY from stdin failed: %s",
+ pq_getmsgstring(buf))));
+ }
+
+ /* Not reached. */
+ Assert(false);
+ return false;
+}
+
/*
* Handle START_REPLICATION command.
*
@@ -1801,7 +1953,7 @@ exec_replication_command(const char *cmd_string)
cmdtag = "BASE_BACKUP";
set_ps_display(cmdtag);
PreventInTransactionBlock(true, cmdtag);
- SendBaseBackup((BaseBackupCmd *) cmd_node);
+ SendBaseBackup((BaseBackupCmd *) cmd_node, uploaded_manifest);
EndReplicationCommand(cmdtag);
break;
@@ -1863,6 +2015,14 @@ exec_replication_command(const char *cmd_string)
}
break;
+ case T_UploadManifestCmd:
+ cmdtag = "UPLOAD_MANIFEST";
+ set_ps_display(cmdtag);
+ PreventInTransactionBlock(true, cmdtag);
+ UploadManifest();
+ EndReplicationCommand(cmdtag);
+ break;
+
default:
elog(ERROR, "unrecognized replication command node tag: %u",
cmd_node->type);
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 2225a4a6e6..3828d1dc16 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -31,6 +31,7 @@
#include "postmaster/bgworker_internals.h"
#include "postmaster/bgwriter.h"
#include "postmaster/postmaster.h"
+#include "postmaster/walsummarizer.h"
#include "replication/logicallauncher.h"
#include "replication/origin.h"
#include "replication/slot.h"
@@ -138,6 +139,7 @@ CalculateShmemSize(int *num_semaphores)
size = add_size(size, ReplicationOriginShmemSize());
size = add_size(size, WalSndShmemSize());
size = add_size(size, WalRcvShmemSize());
+ size = add_size(size, WalSummarizerShmemSize());
size = add_size(size, PgArchShmemSize());
size = add_size(size, ApplyLauncherShmemSize());
size = add_size(size, BTreeShmemSize());
@@ -334,6 +336,7 @@ CreateOrAttachShmemStructs(void)
ReplicationOriginShmemInit();
WalSndShmemInit();
WalRcvShmemInit();
+ WalSummarizerShmemInit();
PgArchShmemInit();
ApplyLauncherShmemInit();
diff --git a/src/bin/Makefile b/src/bin/Makefile
index 373077bf52..aa2210925e 100644
--- a/src/bin/Makefile
+++ b/src/bin/Makefile
@@ -19,6 +19,7 @@ SUBDIRS = \
pg_archivecleanup \
pg_basebackup \
pg_checksums \
+ pg_combinebackup \
pg_config \
pg_controldata \
pg_ctl \
diff --git a/src/bin/meson.build b/src/bin/meson.build
index 67cb50630c..4cb6fd59bb 100644
--- a/src/bin/meson.build
+++ b/src/bin/meson.build
@@ -5,6 +5,7 @@ subdir('pg_amcheck')
subdir('pg_archivecleanup')
subdir('pg_basebackup')
subdir('pg_checksums')
+subdir('pg_combinebackup')
subdir('pg_config')
subdir('pg_controldata')
subdir('pg_ctl')
diff --git a/src/bin/pg_basebackup/bbstreamer_file.c b/src/bin/pg_basebackup/bbstreamer_file.c
index 45f32974ff..6b78ee283d 100644
--- a/src/bin/pg_basebackup/bbstreamer_file.c
+++ b/src/bin/pg_basebackup/bbstreamer_file.c
@@ -296,6 +296,7 @@ should_allow_existing_directory(const char *pathname)
if (strcmp(filename, "pg_wal") == 0 ||
strcmp(filename, "pg_xlog") == 0 ||
strcmp(filename, "archive_status") == 0 ||
+ strcmp(filename, "summaries") == 0 ||
strcmp(filename, "pg_tblspc") == 0)
return true;
diff --git a/src/bin/pg_basebackup/pg_basebackup.c b/src/bin/pg_basebackup/pg_basebackup.c
index f32684a8f2..5795b91261 100644
--- a/src/bin/pg_basebackup/pg_basebackup.c
+++ b/src/bin/pg_basebackup/pg_basebackup.c
@@ -101,6 +101,11 @@ typedef void (*WriteDataCallback) (size_t nbytes, char *buf,
*/
#define MINIMUM_VERSION_FOR_TERMINATED_TARFILE 150000
+/*
+ * pg_wal/summaries exists beginning with version 17.
+ */
+#define MINIMUM_VERSION_FOR_WAL_SUMMARIES 170000
+
/*
* Different ways to include WAL
*/
@@ -217,7 +222,8 @@ static void ReceiveBackupManifestInMemoryChunk(size_t r, char *copybuf,
void *callback_data);
static void BaseBackup(char *compression_algorithm, char *compression_detail,
CompressionLocation compressloc,
- pg_compress_specification *client_compress);
+ pg_compress_specification *client_compress,
+ char *incremental_manifest);
static bool reached_end_position(XLogRecPtr segendpos, uint32 timeline,
bool segment_finished);
@@ -390,6 +396,8 @@ usage(void)
printf(_("\nOptions controlling the output:\n"));
printf(_(" -D, --pgdata=DIRECTORY receive base backup into directory\n"));
printf(_(" -F, --format=p|t output format (plain (default), tar)\n"));
+ printf(_(" -i, --incremental=OLDMANIFEST\n"));
+ printf(_(" take incremental backup\n"));
printf(_(" -r, --max-rate=RATE maximum transfer rate to transfer data directory\n"
" (in kB/s, or use suffix \"k\" or \"M\")\n"));
printf(_(" -R, --write-recovery-conf\n"
@@ -688,6 +696,23 @@ StartLogStreamer(char *startpos, uint32 timeline, char *sysidentifier,
if (pg_mkdir_p(statusdir, pg_dir_create_mode) != 0 && errno != EEXIST)
pg_fatal("could not create directory \"%s\": %m", statusdir);
+
+ /*
+ * For newer server versions, likewise create pg_wal/summaries
+ */
+ if (PQserverVersion(conn) < MINIMUM_VERSION_FOR_WAL_SUMMARIES)
+ {
+ char summarydir[MAXPGPATH];
+
+ snprintf(summarydir, sizeof(summarydir), "%s/%s/summaries",
+ basedir,
+ PQserverVersion(conn) < MINIMUM_VERSION_FOR_PG_WAL ?
+ "pg_xlog" : "pg_wal");
+
+ if (pg_mkdir_p(statusdir, pg_dir_create_mode) != 0 &&
+ errno != EEXIST)
+ pg_fatal("could not create directory \"%s\": %m", summarydir);
+ }
}
/*
@@ -1728,7 +1753,9 @@ ReceiveBackupManifestInMemoryChunk(size_t r, char *copybuf,
static void
BaseBackup(char *compression_algorithm, char *compression_detail,
- CompressionLocation compressloc, pg_compress_specification *client_compress)
+ CompressionLocation compressloc,
+ pg_compress_specification *client_compress,
+ char *incremental_manifest)
{
PGresult *res;
char *sysidentifier;
@@ -1794,7 +1821,76 @@ BaseBackup(char *compression_algorithm, char *compression_detail,
exit(1);
/*
- * Start the actual backup
+ * If the user wants an incremental backup, we must upload the manifest
+ * for the previous backup upon which it is to be based.
+ */
+ if (incremental_manifest != NULL)
+ {
+ int fd;
+ char mbuf[65536];
+ int nbytes;
+
+ /* Reject if server is too old. */
+ if (serverVersion < MINIMUM_VERSION_FOR_WAL_SUMMARIES)
+ pg_fatal("server does not support incremental backup");
+
+ /* Open the file. */
+ fd = open(incremental_manifest, O_RDONLY | PG_BINARY, 0);
+ if (fd < 0)
+ pg_fatal("could not open file \"%s\": %m", incremental_manifest);
+
+ /* Tell the server what we want to do. */
+ if (PQsendQuery(conn, "UPLOAD_MANIFEST") == 0)
+ pg_fatal("could not send replication command \"%s\": %s",
+ "UPLOAD_MANIFEST", PQerrorMessage(conn));
+ res = PQgetResult(conn);
+ if (PQresultStatus(res) != PGRES_COPY_IN)
+ {
+ if (PQresultStatus(res) == PGRES_FATAL_ERROR)
+ pg_fatal("could not upload manifest: %s",
+ PQerrorMessage(conn));
+ else
+ pg_fatal("could not upload manifest: unexpected status %s",
+ PQresStatus(PQresultStatus(res)));
+ }
+
+ /* Loop, reading from the file and sending the data to the server. */
+ while ((nbytes = read(fd, mbuf, sizeof mbuf)) > 0)
+ {
+ if (PQputCopyData(conn, mbuf, nbytes) < 0)
+ pg_fatal("could not send COPY data: %s",
+ PQerrorMessage(conn));
+ }
+
+ /* Bail out if we exited the loop due to an error. */
+ if (nbytes < 0)
+ pg_fatal("could not read file \"%s\": %m", incremental_manifest);
+
+ /* End the COPY operation. */
+ if (PQputCopyEnd(conn, NULL) < 0)
+ pg_fatal("could not send end-of-COPY: %s",
+ PQerrorMessage(conn));
+
+ /* See whether the server is happy with what we sent. */
+ res = PQgetResult(conn);
+ if (PQresultStatus(res) == PGRES_FATAL_ERROR)
+ pg_fatal("could not upload manifest: %s",
+ PQerrorMessage(conn));
+ else if (PQresultStatus(res) != PGRES_COMMAND_OK)
+ pg_fatal("could not upload manifest: unexpected status %s",
+ PQresStatus(PQresultStatus(res)));
+
+ /* Consume ReadyForQuery message from server. */
+ res = PQgetResult(conn);
+ if (res != NULL)
+ pg_fatal("unexpected extra result while sending manifest");
+
+ /* Add INCREMENTAL option to BASE_BACKUP command. */
+ AppendPlainCommandOption(&buf, use_new_option_syntax, "INCREMENTAL");
+ }
+
+ /*
+ * Continue building up the options list for the BASE_BACKUP command.
*/
AppendStringCommandOption(&buf, use_new_option_syntax, "LABEL", label);
if (estimatesize)
@@ -1901,6 +1997,7 @@ BaseBackup(char *compression_algorithm, char *compression_detail,
else
basebkp = psprintf("BASE_BACKUP %s", buf.data);
+ /* OK, try to start the backup. */
if (PQsendQuery(conn, basebkp) == 0)
pg_fatal("could not send replication command \"%s\": %s",
"BASE_BACKUP", PQerrorMessage(conn));
@@ -2256,6 +2353,7 @@ main(int argc, char **argv)
{"version", no_argument, NULL, 'V'},
{"pgdata", required_argument, NULL, 'D'},
{"format", required_argument, NULL, 'F'},
+ {"incremental", required_argument, NULL, 'i'},
{"checkpoint", required_argument, NULL, 'c'},
{"create-slot", no_argument, NULL, 'C'},
{"max-rate", required_argument, NULL, 'r'},
@@ -2293,6 +2391,7 @@ main(int argc, char **argv)
int option_index;
char *compression_algorithm = "none";
char *compression_detail = NULL;
+ char *incremental_manifest = NULL;
CompressionLocation compressloc = COMPRESS_LOCATION_UNSPECIFIED;
pg_compress_specification client_compress;
@@ -2317,7 +2416,7 @@ main(int argc, char **argv)
atexit(cleanup_directories_atexit);
- while ((c = getopt_long(argc, argv, "c:Cd:D:F:h:l:nNp:Pr:Rs:S:t:T:U:vwWX:zZ:",
+ while ((c = getopt_long(argc, argv, "c:Cd:D:F:h:i:l:nNp:Pr:Rs:S:t:T:U:vwWX:zZ:",
long_options, &option_index)) != -1)
{
switch (c)
@@ -2352,6 +2451,9 @@ main(int argc, char **argv)
case 'h':
dbhost = pg_strdup(optarg);
break;
+ case 'i':
+ incremental_manifest = pg_strdup(optarg);
+ break;
case 'l':
label = pg_strdup(optarg);
break;
@@ -2765,7 +2867,7 @@ main(int argc, char **argv)
}
BaseBackup(compression_algorithm, compression_detail, compressloc,
- &client_compress);
+ &client_compress, incremental_manifest);
success = true;
return 0;
diff --git a/src/bin/pg_basebackup/t/010_pg_basebackup.pl b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
index b9f5e1266b..bf765291e7 100644
--- a/src/bin/pg_basebackup/t/010_pg_basebackup.pl
+++ b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
@@ -223,10 +223,10 @@ SKIP:
"check backup dir permissions");
}
-# Only archive_status directory should be copied in pg_wal/.
+# Only archive_status and summaries directories should be copied in pg_wal/.
is_deeply(
[ sort(slurp_dir("$tempdir/backup/pg_wal/")) ],
- [ sort qw(. .. archive_status) ],
+ [ sort qw(. .. archive_status summaries) ],
'no WAL files copied');
# Contents of these directories should not be copied.
diff --git a/src/bin/pg_combinebackup/.gitignore b/src/bin/pg_combinebackup/.gitignore
new file mode 100644
index 0000000000..d7e617438c
--- /dev/null
+++ b/src/bin/pg_combinebackup/.gitignore
@@ -0,0 +1 @@
+pg_combinebackup
diff --git a/src/bin/pg_combinebackup/Makefile b/src/bin/pg_combinebackup/Makefile
new file mode 100644
index 0000000000..78ba05e624
--- /dev/null
+++ b/src/bin/pg_combinebackup/Makefile
@@ -0,0 +1,52 @@
+#-------------------------------------------------------------------------
+#
+# Makefile for src/bin/pg_combinebackup
+#
+# Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+# Portions Copyright (c) 1994, Regents of the University of California
+#
+# src/bin/pg_combinebackup/Makefile
+#
+#-------------------------------------------------------------------------
+
+PGFILEDESC = "pg_combinebackup - combine incremental backups"
+PGAPPICON=win32
+
+subdir = src/bin/pg_combinebackup
+top_builddir = ../../..
+include $(top_builddir)/src/Makefile.global
+
+override CPPFLAGS := -I$(libpq_srcdir) $(CPPFLAGS)
+LDFLAGS_INTERNAL += -L$(top_builddir)/src/fe_utils -lpgfeutils
+
+OBJS = \
+ $(WIN32RES) \
+ pg_combinebackup.o \
+ backup_label.o \
+ copy_file.o \
+ load_manifest.o \
+ reconstruct.o \
+ write_manifest.o
+
+all: pg_combinebackup
+
+pg_combinebackup: $(OBJS) | submake-libpgport submake-libpgfeutils
+ $(CC) $(CFLAGS) $^ $(LDFLAGS) $(LDFLAGS_EX) $(LIBS) -o $@$(X)
+
+install: all installdirs
+ $(INSTALL_PROGRAM) pg_combinebackup$(X) '$(DESTDIR)$(bindir)/pg_combinebackup$(X)'
+
+installdirs:
+ $(MKDIR_P) '$(DESTDIR)$(bindir)'
+
+uninstall:
+ rm -f '$(DESTDIR)$(bindir)/pg_combinebackup$(X)'
+
+clean distclean maintainer-clean:
+ rm -f pg_combinebackup$(X) $(OBJS)
+
+check:
+ $(prove_check)
+
+installcheck:
+ $(prove_installcheck)
diff --git a/src/bin/pg_combinebackup/backup_label.c b/src/bin/pg_combinebackup/backup_label.c
new file mode 100644
index 0000000000..922e00854d
--- /dev/null
+++ b/src/bin/pg_combinebackup/backup_label.c
@@ -0,0 +1,283 @@
+/*-------------------------------------------------------------------------
+ *
+ * Read and manipulate backup label files
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/bin/pg_combinebackup/backup_label.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres_fe.h"
+
+#include <unistd.h>
+
+#include "access/xlogdefs.h"
+#include "backup_label.h"
+#include "common/logging.h"
+#include "common/file_perm.h"
+#include "write_manifest.h"
+
+static int get_eol_offset(StringInfo buf);
+static bool line_starts_with(char *s, char *e, char *match, char **sout);
+static bool parse_lsn(char *s, char *e, XLogRecPtr *lsn, char **c);
+static bool parse_tli(char *s, char *e, TimeLineID *tli);
+
+/*
+ * Parse a backup label file, starting at buf->cursor.
+ *
+ * We expect to find a START WAL LOCATION line, followed by a LSN, followed
+ * by a space; the resulting LSN is stored into *start_lsn.
+ *
+ * We expect to find a START TIMELINE line, followed by a TLI, followed by
+ * a newline; the resulting TLI is stored into *start_tli.
+ *
+ * We expect to find either both INCREMENTAL FROM LSN and INCREMENTAL FROM TLI
+ * or neither. If these are found, they should be followed by an LSN or TLI
+ * respectively and then by a newline, and the values will be stored into
+ * *previous_lsn and *previous_tli, respectively.
+ *
+ * Other lines in the provided backup_label data are ignored. filename is used
+ * for error reporting; errors are fatal.
+ */
+void
+parse_backup_label(char *filename, StringInfo buf,
+ TimeLineID *start_tli, XLogRecPtr *start_lsn,
+ TimeLineID *previous_tli, XLogRecPtr *previous_lsn)
+{
+ int found = 0;
+
+ *start_tli = 0;
+ *start_lsn = InvalidXLogRecPtr;
+ *previous_tli = 0;
+ *previous_lsn = InvalidXLogRecPtr;
+
+ while (buf->cursor < buf->len)
+ {
+ char *s = &buf->data[buf->cursor];
+ int eo = get_eol_offset(buf);
+ char *e = &buf->data[eo];
+ char *c;
+
+ if (line_starts_with(s, e, "START WAL LOCATION: ", &s))
+ {
+ if (!parse_lsn(s, e, start_lsn, &c))
+ pg_fatal("%s: could not parse %s",
+ filename, "START WAL LOCATION");
+ if (c >= e || *c != ' ')
+ pg_fatal("%s: improper terminator for %s",
+ filename, "START WAL LOCATION");
+ found |= 1;
+ }
+ else if (line_starts_with(s, e, "START TIMELINE: ", &s))
+ {
+ if (!parse_tli(s, e, start_tli))
+ pg_fatal("%s: could not parse TLI for %s",
+ filename, "START TIMELINE");
+ if (*start_tli == 0)
+ pg_fatal("%s: invalid TLI", filename);
+ found |= 2;
+ }
+ else if (line_starts_with(s, e, "INCREMENTAL FROM LSN: ", &s))
+ {
+ if (!parse_lsn(s, e, previous_lsn, &c))
+ pg_fatal("%s: could not parse %s",
+ filename, "INCREMENTAL FROM LSN");
+ if (c >= e || *c != '\n')
+ pg_fatal("%s: improper terminator for %s",
+ filename, "INCREMENTAL FROM LSN");
+ found |= 4;
+ }
+ else if (line_starts_with(s, e, "INCREMENTAL FROM TLI: ", &s))
+ {
+ if (!parse_tli(s, e, previous_tli))
+ pg_fatal("%s: could not parse %s",
+ filename, "INCREMENTAL FROM TLI");
+ if (*previous_tli == 0)
+ pg_fatal("%s: invalid TLI", filename);
+ found |= 8;
+ }
+
+ buf->cursor = eo;
+ }
+
+ if ((found & 1) == 0)
+ pg_fatal("%s: could not find %s", filename, "START WAL LOCATION");
+ if ((found & 2) == 0)
+ pg_fatal("%s: could not find %s", filename, "START TIMELINE");
+ if ((found & 4) != 0 && (found & 8) == 0)
+ pg_fatal("%s: %s requires %s", filename,
+ "INCREMENTAL FROM LSN", "INCREMENTAL FROM TLI");
+ if ((found & 8) != 0 && (found & 4) == 0)
+ pg_fatal("%s: %s requires %s", filename,
+ "INCREMENTAL FROM TLI", "INCREMENTAL FROM LSN");
+}
+
+/*
+ * Write a backup label file to the output directory.
+ *
+ * This will be identical to the provided backup_label file, except that the
+ * INCREMENTAL FROM LSN and INCREMENTAL FROM TLI lines will be omitted.
+ *
+ * The new file will be checksummed using the specified algorithm. If
+ * mwriter != NULL, it will be added to the manifest.
+ */
+void
+write_backup_label(char *output_directory, StringInfo buf,
+ pg_checksum_type checksum_type, manifest_writer *mwriter)
+{
+ char output_filename[MAXPGPATH];
+ int output_fd;
+ pg_checksum_context checksum_ctx;
+ uint8 checksum_payload[PG_CHECKSUM_MAX_LENGTH];
+ int checksum_length;
+
+ pg_checksum_init(&checksum_ctx, checksum_type);
+
+ snprintf(output_filename, MAXPGPATH, "%s/backup_label", output_directory);
+
+ if ((output_fd = open(output_filename,
+ O_WRONLY | O_CREAT | O_EXCL | PG_BINARY,
+ pg_file_create_mode)) < 0)
+ pg_fatal("could not open file \"%s\": %m", output_filename);
+
+ while (buf->cursor < buf->len)
+ {
+ char *s = &buf->data[buf->cursor];
+ int eo = get_eol_offset(buf);
+ char *e = &buf->data[eo];
+
+ if (!line_starts_with(s, e, "INCREMENTAL FROM LSN: ", NULL) &&
+ !line_starts_with(s, e, "INCREMENTAL FROM TLI: ", NULL))
+ {
+ ssize_t wb;
+
+ wb = write(output_fd, s, e - s);
+ if (wb != e - s)
+ {
+ if (wb < 0)
+ pg_fatal("could not write file \"%s\": %m", output_filename);
+ else
+ pg_fatal("could not write file \"%s\": wrote only %d of %d bytes",
+ output_filename, (int) wb, (int) (e - s));
+ }
+ if (pg_checksum_update(&checksum_ctx, (uint8 *) s, e - s) < 0)
+ pg_fatal("could not update checksum of file \"%s\"",
+ output_filename);
+ }
+
+ buf->cursor = eo;
+ }
+
+ if (close(output_fd) != 0)
+ pg_fatal("could not close \"%s\": %m", output_filename);
+
+ checksum_length = pg_checksum_final(&checksum_ctx, checksum_payload);
+
+ if (mwriter != NULL)
+ {
+ struct stat sb;
+
+ /*
+ * We could track the length ourselves, but must stat() to get the
+ * mtime.
+ */
+ if (stat(output_filename, &sb) < 0)
+ pg_fatal("could not stat file \"%s\": %m", output_filename);
+ add_file_to_manifest(mwriter, "backup_label", sb.st_size,
+ sb.st_mtime, checksum_type,
+ checksum_length, checksum_payload);
+ }
+}
+
+/*
+ * Return the offset at which the next line in the buffer starts, or there
+ * is none, the offset at which the buffer ends.
+ *
+ * The search begins at buf->cursor.
+ */
+static int
+get_eol_offset(StringInfo buf)
+{
+ int eo = buf->cursor;
+
+ while (eo < buf->len)
+ {
+ if (buf->data[eo] == '\n')
+ return eo + 1;
+ ++eo;
+ }
+
+ return eo;
+}
+
+/*
+ * Test whether the line that runs from s to e (inclusive of *s, but not
+ * inclusive of *e) starts with the match string provided, and return true
+ * or false according to whether or not this is the case.
+ *
+ * If the function returns true and if *sout != NULL, stores a pointer to the
+ * byte following the match into *sout.
+ */
+static bool
+line_starts_with(char *s, char *e, char *match, char **sout)
+{
+ while (s < e && *match != '\0' && *s == *match)
+ ++s, ++match;
+
+ if (*match == '\0' && sout != NULL)
+ *sout = s;
+
+ return (*match == '\0');
+}
+
+/*
+ * Parse an LSN starting at s and not stopping at or before e. The return value
+ * is true on success and otherwise false. On success, stores the result into
+ * *lsn and sets *c to the first character that is not part of the LSN.
+ */
+static bool
+parse_lsn(char *s, char *e, XLogRecPtr *lsn, char **c)
+{
+ char save = *e;
+ int nchars;
+ bool success;
+ unsigned hi;
+ unsigned lo;
+
+ *e = '\0';
+ success = (sscanf(s, "%X/%X%n", &hi, &lo, &nchars) == 2);
+ *e = save;
+
+ if (success)
+ {
+ *lsn = ((XLogRecPtr) hi) << 32 | (XLogRecPtr) lo;
+ *c = s + nchars;
+ }
+
+ return success;
+}
+
+/*
+ * Parse a TLI starting at s and stopping at or before e. The return value is
+ * true on success and otherwise false. On success, stores the result into
+ * *tli. If the first character that is not part of the TLI is anything other
+ * than a newline, that is deemed a failure.
+ */
+static bool
+parse_tli(char *s, char *e, TimeLineID *tli)
+{
+ char save = *e;
+ int nchars;
+ bool success;
+
+ *e = '\0';
+ success = (sscanf(s, "%u%n", tli, &nchars) == 1);
+ *e = save;
+
+ if (success && s[nchars] != '\n')
+ success = false;
+
+ return success;
+}
diff --git a/src/bin/pg_combinebackup/backup_label.h b/src/bin/pg_combinebackup/backup_label.h
new file mode 100644
index 0000000000..3af7ea274c
--- /dev/null
+++ b/src/bin/pg_combinebackup/backup_label.h
@@ -0,0 +1,30 @@
+/*-------------------------------------------------------------------------
+ *
+ * Read and manipulate backup label files
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/bin/pg_combinebackup/backup_label.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef BACKUP_LABEL_H
+#define BACKUP_LABEL_H
+
+#include "access/xlogdefs.h"
+#include "common/checksum_helper.h"
+#include "lib/stringinfo.h"
+
+struct manifest_writer;
+
+extern void parse_backup_label(char *filename, StringInfo buf,
+ TimeLineID *start_tli,
+ XLogRecPtr *start_lsn,
+ TimeLineID *previous_tli,
+ XLogRecPtr *previous_lsn);
+extern void write_backup_label(char *output_directory, StringInfo buf,
+ pg_checksum_type checksum_type,
+ struct manifest_writer *mwriter);
+
+#endif /* BACKUP_LABEL_H */
diff --git a/src/bin/pg_combinebackup/copy_file.c b/src/bin/pg_combinebackup/copy_file.c
new file mode 100644
index 0000000000..40a55e3087
--- /dev/null
+++ b/src/bin/pg_combinebackup/copy_file.c
@@ -0,0 +1,169 @@
+/*
+ * Copy entire files.
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/bin/pg_combinebackup/copy_file.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres_fe.h"
+
+#ifdef HAVE_COPYFILE_H
+#include <copyfile.h>
+#endif
+#include <fcntl.h>
+#include <sys/stat.h>
+#include <unistd.h>
+
+#include "common/file_perm.h"
+#include "common/logging.h"
+#include "copy_file.h"
+
+static void copy_file_blocks(const char *src, const char *dst,
+ pg_checksum_context *checksum_ctx);
+
+#ifdef WIN32
+static void copy_file_copyfile(const char *src, const char *dst);
+#endif
+
+/*
+ * Copy a regular file, optionally computing a checksum, and emitting
+ * appropriate debug messages. But if we're in dry-run mode, then just emit
+ * the messages and don't copy anything.
+ */
+void
+copy_file(const char *src, const char *dst,
+ pg_checksum_context *checksum_ctx, bool dry_run)
+{
+ /*
+ * In dry-run mode, we don't actually copy anything, nor do we read any
+ * data from the source file, but we do verify that we can open it.
+ */
+ if (dry_run)
+ {
+ int fd;
+
+ if ((fd = open(src, O_RDONLY | PG_BINARY, 0)) < 0)
+ pg_fatal("could not open \"%s\": %m", src);
+ if (close(fd) < 0)
+ pg_fatal("could not close \"%s\": %m", src);
+ }
+
+ /*
+ * If we don't need to compute a checksum, then we can use any special
+ * operating system primitives that we know about to copy the file; this
+ * may be quicker than a naive block copy.
+ */
+ if (checksum_ctx->type == CHECKSUM_TYPE_NONE)
+ {
+ char *strategy_name = NULL;
+ void (*strategy_implementation) (const char *, const char *) = NULL;
+
+#ifdef WIN32
+ strategy_name = "CopyFile";
+ strategy_implementation = copy_file_copyfile;
+#endif
+
+ if (strategy_name != NULL)
+ {
+ if (dry_run)
+ pg_log_debug("would copy \"%s\" to \"%s\" using strategy %s",
+ src, dst, strategy_name);
+ else
+ {
+ pg_log_debug("copying \"%s\" to \"%s\" using strategy %s",
+ src, dst, strategy_name);
+ (*strategy_implementation) (src, dst);
+ }
+ return;
+ }
+ }
+
+ /*
+ * Fall back to the simple approach of reading and writing all the blocks,
+ * feeding them into the checksum context as we go.
+ */
+ if (dry_run)
+ {
+ if (checksum_ctx->type == CHECKSUM_TYPE_NONE)
+ pg_log_debug("would copy \"%s\" to \"%s\"",
+ src, dst);
+ else
+ pg_log_debug("would copy \"%s\" to \"%s\" and checksum with %s",
+ src, dst, pg_checksum_type_name(checksum_ctx->type));
+ }
+ else
+ {
+ if (checksum_ctx->type == CHECKSUM_TYPE_NONE)
+ pg_log_debug("copying \"%s\" to \"%s\"",
+ src, dst);
+ else
+ pg_log_debug("copying \"%s\" to \"%s\" and checksumming with %s",
+ src, dst, pg_checksum_type_name(checksum_ctx->type));
+ copy_file_blocks(src, dst, checksum_ctx);
+ }
+}
+
+/*
+ * Copy a file block by block, and optionally compute a checksum as we go.
+ */
+static void
+copy_file_blocks(const char *src, const char *dst,
+ pg_checksum_context *checksum_ctx)
+{
+ int src_fd;
+ int dest_fd;
+ uint8 *buffer;
+ const int buffer_size = 50 * BLCKSZ;
+ ssize_t rb;
+ unsigned offset = 0;
+
+ if ((src_fd = open(src, O_RDONLY | PG_BINARY, 0)) < 0)
+ pg_fatal("could not open file \"%s\": %m", src);
+
+ if ((dest_fd = open(dst, O_WRONLY | O_CREAT | O_EXCL | PG_BINARY,
+ pg_file_create_mode)) < 0)
+ pg_fatal("could not open file \"%s\": %m", dst);
+
+ buffer = pg_malloc(buffer_size);
+
+ while ((rb = read(src_fd, buffer, buffer_size)) > 0)
+ {
+ ssize_t wb;
+
+ if ((wb = write(dest_fd, buffer, rb)) != rb)
+ {
+ if (wb < 0)
+ pg_fatal("could not write file \"%s\": %m", dst);
+ else
+ pg_fatal("could not write file \"%s\": wrote only %d of %d bytes at offset %u",
+ dst, (int) wb, (int) rb, offset);
+ }
+
+ if (pg_checksum_update(checksum_ctx, buffer, rb) < 0)
+ pg_fatal("could not update checksum of file \"%s\"", dst);
+
+ offset += rb;
+ }
+
+ if (rb < 0)
+ pg_fatal("could not read file \"%s\": %m", dst);
+
+ pg_free(buffer);
+ close(src_fd);
+ close(dest_fd);
+}
+
+#ifdef WIN32
+static void
+copy_file_copyfile(const char *src, const char *dst)
+{
+ if (CopyFile(src, dst, true) == 0)
+ {
+ _dosmaperr(GetLastError());
+ pg_fatal("could not copy \"%s\" to \"%s\": %m", src, dst);
+ }
+}
+#endif /* WIN32 */
diff --git a/src/bin/pg_combinebackup/copy_file.h b/src/bin/pg_combinebackup/copy_file.h
new file mode 100644
index 0000000000..031030bacb
--- /dev/null
+++ b/src/bin/pg_combinebackup/copy_file.h
@@ -0,0 +1,19 @@
+/*
+ * Copy entire files.
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/bin/pg_combinebackup/copy_file.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef COPY_FILE_H
+#define COPY_FILE_H
+
+#include "common/checksum_helper.h"
+
+extern void copy_file(const char *src, const char *dst,
+ pg_checksum_context *checksum_ctx, bool dry_run);
+
+#endif /* COPY_FILE_H */
diff --git a/src/bin/pg_combinebackup/load_manifest.c b/src/bin/pg_combinebackup/load_manifest.c
new file mode 100644
index 0000000000..ad32323c9c
--- /dev/null
+++ b/src/bin/pg_combinebackup/load_manifest.c
@@ -0,0 +1,245 @@
+/*-------------------------------------------------------------------------
+ *
+ * Load data from a backup manifest into memory.
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/bin/pg_combinebackup/load_manifest.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres_fe.h"
+
+#include <sys/stat.h>
+#include <unistd.h>
+
+#include "common/hashfn.h"
+#include "common/logging.h"
+#include "common/parse_manifest.h"
+#include "load_manifest.h"
+
+/*
+ * For efficiency, we'd like our hash table containing information about the
+ * manifest to start out with approximately the correct number of entries.
+ * There's no way to know the exact number of entries without reading the whole
+ * file, but we can get an estimate by dividing the file size by the estimated
+ * number of bytes per line.
+ *
+ * This could be off by about a factor of two in either direction, because the
+ * checksum algorithm has a big impact on the line lengths; e.g. a SHA512
+ * checksum is 128 hex bytes, whereas a CRC-32C value is only 8, and there
+ * might be no checksum at all.
+ */
+#define ESTIMATED_BYTES_PER_MANIFEST_LINE 100
+
+/*
+ * Define a hash table which we can use to store information about the files
+ * mentioned in the backup manifest.
+ */
+static uint32 hash_string_pointer(char *s);
+#define SH_PREFIX manifest_files
+#define SH_ELEMENT_TYPE manifest_file
+#define SH_KEY_TYPE char *
+#define SH_KEY pathname
+#define SH_HASH_KEY(tb, key) hash_string_pointer(key)
+#define SH_EQUAL(tb, a, b) (strcmp(a, b) == 0)
+#define SH_SCOPE extern
+#define SH_RAW_ALLOCATOR pg_malloc0
+#define SH_DEFINE
+#include "lib/simplehash.h"
+
+static void combinebackup_per_file_cb(JsonManifestParseContext *context,
+ char *pathname, size_t size,
+ pg_checksum_type checksum_type,
+ int checksum_length,
+ uint8 *checksum_payload);
+static void combinebackup_per_wal_range_cb(JsonManifestParseContext *context,
+ TimeLineID tli,
+ XLogRecPtr start_lsn,
+ XLogRecPtr end_lsn);
+static void report_manifest_error(JsonManifestParseContext *context,
+ const char *fmt,...)
+ pg_attribute_printf(2, 3) pg_attribute_noreturn();
+
+/*
+ * Load backup_manifest files from an array of backups and produces an array
+ * of manifest_data objects.
+ *
+ * NB: Since load_backup_manifest() can return NULL, the resulting array could
+ * contain NULL entries.
+ */
+manifest_data **
+load_backup_manifests(int n_backups, char **backup_directories)
+{
+ manifest_data **result;
+ int i;
+
+ result = pg_malloc(sizeof(manifest_data *) * n_backups);
+ for (i = 0; i < n_backups; ++i)
+ result[i] = load_backup_manifest(backup_directories[i]);
+
+ return result;
+}
+
+/*
+ * Parse the backup_manifest file in the named backup directory. Construct a
+ * hash table with information about all the files it mentions, and a linked
+ * list of all the WAL ranges it mentions.
+ *
+ * If the backup_manifest file simply doesn't exist, logs a warning and returns
+ * NULL. Any other error, or any error parsing the contents of the file, is
+ * fatal.
+ */
+manifest_data *
+load_backup_manifest(char *backup_directory)
+{
+ char pathname[MAXPGPATH];
+ int fd;
+ struct stat statbuf;
+ off_t estimate;
+ uint32 initial_size;
+ manifest_files_hash *ht;
+ char *buffer;
+ int rc;
+ JsonManifestParseContext context;
+ manifest_data *result;
+
+ /* Open the manifest file. */
+ snprintf(pathname, MAXPGPATH, "%s/backup_manifest", backup_directory);
+ if ((fd = open(pathname, O_RDONLY | PG_BINARY, 0)) < 0)
+ {
+ if (errno == ENOENT)
+ {
+ pg_log_warning("\"%s\" does not exist", pathname);
+ return NULL;
+ }
+ pg_fatal("could not open file \"%s\": %m", pathname);
+ }
+
+ /* Figure out how big the manifest is. */
+ if (fstat(fd, &statbuf) != 0)
+ pg_fatal("could not stat file \"%s\": %m", pathname);
+
+ /* Guess how large to make the hash table based on the manifest size. */
+ estimate = statbuf.st_size / ESTIMATED_BYTES_PER_MANIFEST_LINE;
+ initial_size = Min(PG_UINT32_MAX, Max(estimate, 256));
+
+ /* Create the hash table. */
+ ht = manifest_files_create(initial_size, NULL);
+
+ /*
+ * Slurp in the whole file.
+ *
+ * This is not ideal, but there's currently no way to get pg_parse_json()
+ * to perform incremental parsing.
+ */
+ buffer = pg_malloc(statbuf.st_size);
+ rc = read(fd, buffer, statbuf.st_size);
+ if (rc != statbuf.st_size)
+ {
+ if (rc < 0)
+ pg_fatal("could not read file \"%s\": %m", pathname);
+ else
+ pg_fatal("could not read file \"%s\": read %d of %lld",
+ pathname, rc, (long long int) statbuf.st_size);
+ }
+
+ /* Close the manifest file. */
+ close(fd);
+
+ /* Parse the manifest. */
+ result = pg_malloc0(sizeof(manifest_data));
+ result->files = ht;
+ context.private_data = result;
+ context.per_file_cb = combinebackup_per_file_cb;
+ context.per_wal_range_cb = combinebackup_per_wal_range_cb;
+ context.error_cb = report_manifest_error;
+ json_parse_manifest(&context, buffer, statbuf.st_size);
+
+ /* All done. */
+ pfree(buffer);
+ return result;
+}
+
+/*
+ * Report an error while parsing the manifest.
+ *
+ * We consider all such errors to be fatal errors. The manifest parser
+ * expects this function not to return.
+ */
+static void
+report_manifest_error(JsonManifestParseContext *context, const char *fmt,...)
+{
+ va_list ap;
+
+ va_start(ap, fmt);
+ pg_log_generic_v(PG_LOG_ERROR, PG_LOG_PRIMARY, gettext(fmt), ap);
+ va_end(ap);
+
+ exit(1);
+}
+
+/*
+ * Record details extracted from the backup manifest for one file.
+ */
+static void
+combinebackup_per_file_cb(JsonManifestParseContext *context,
+ char *pathname, size_t size,
+ pg_checksum_type checksum_type,
+ int checksum_length, uint8 *checksum_payload)
+{
+ manifest_data *manifest = context->private_data;
+ manifest_file *m;
+ bool found;
+
+ /* Make a new entry in the hash table for this file. */
+ m = manifest_files_insert(manifest->files, pathname, &found);
+ if (found)
+ pg_fatal("duplicate path name in backup manifest: \"%s\"", pathname);
+
+ /* Initialize the entry. */
+ m->size = size;
+ m->checksum_type = checksum_type;
+ m->checksum_length = checksum_length;
+ m->checksum_payload = checksum_payload;
+}
+
+/*
+ * Record details extracted from the backup manifest for one WAL range.
+ */
+static void
+combinebackup_per_wal_range_cb(JsonManifestParseContext *context,
+ TimeLineID tli,
+ XLogRecPtr start_lsn, XLogRecPtr end_lsn)
+{
+ manifest_data *manifest = context->private_data;
+ manifest_wal_range *range;
+
+ /* Allocate and initialize a struct describing this WAL range. */
+ range = palloc(sizeof(manifest_wal_range));
+ range->tli = tli;
+ range->start_lsn = start_lsn;
+ range->end_lsn = end_lsn;
+ range->prev = manifest->last_wal_range;
+ range->next = NULL;
+
+ /* Add it to the end of the list. */
+ if (manifest->first_wal_range == NULL)
+ manifest->first_wal_range = range;
+ else
+ manifest->last_wal_range->next = range;
+ manifest->last_wal_range = range;
+}
+
+/*
+ * Helper function for manifest_files hash table.
+ */
+static uint32
+hash_string_pointer(char *s)
+{
+ unsigned char *ss = (unsigned char *) s;
+
+ return hash_bytes(ss, strlen(s));
+}
diff --git a/src/bin/pg_combinebackup/load_manifest.h b/src/bin/pg_combinebackup/load_manifest.h
new file mode 100644
index 0000000000..2bfeeff156
--- /dev/null
+++ b/src/bin/pg_combinebackup/load_manifest.h
@@ -0,0 +1,67 @@
+/*-------------------------------------------------------------------------
+ *
+ * Load data from a backup manifest into memory.
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/bin/pg_combinebackup/load_manifest.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef LOAD_MANIFEST_H
+#define LOAD_MANIFEST_H
+
+#include "access/xlogdefs.h"
+#include "common/checksum_helper.h"
+
+/*
+ * Each file described by the manifest file is parsed to produce an object
+ * like this.
+ */
+typedef struct manifest_file
+{
+ uint32 status; /* hash status */
+ char *pathname;
+ size_t size;
+ pg_checksum_type checksum_type;
+ int checksum_length;
+ uint8 *checksum_payload;
+} manifest_file;
+
+#define SH_PREFIX manifest_files
+#define SH_ELEMENT_TYPE manifest_file
+#define SH_KEY_TYPE char *
+#define SH_SCOPE extern
+#define SH_RAW_ALLOCATOR pg_malloc0
+#define SH_DECLARE
+#include "lib/simplehash.h"
+
+/*
+ * Each WAL range described by the manifest file is parsed to produce an
+ * object like this.
+ */
+typedef struct manifest_wal_range
+{
+ TimeLineID tli;
+ XLogRecPtr start_lsn;
+ XLogRecPtr end_lsn;
+ struct manifest_wal_range *next;
+ struct manifest_wal_range *prev;
+} manifest_wal_range;
+
+/*
+ * All the data parsed from a backup_manifest file.
+ */
+typedef struct manifest_data
+{
+ manifest_files_hash *files;
+ manifest_wal_range *first_wal_range;
+ manifest_wal_range *last_wal_range;
+} manifest_data;
+
+extern manifest_data *load_backup_manifest(char *backup_directory);
+extern manifest_data **load_backup_manifests(int n_backups,
+ char **backup_directories);
+
+#endif /* LOAD_MANIFEST_H */
diff --git a/src/bin/pg_combinebackup/meson.build b/src/bin/pg_combinebackup/meson.build
new file mode 100644
index 0000000000..e402d6f50e
--- /dev/null
+++ b/src/bin/pg_combinebackup/meson.build
@@ -0,0 +1,38 @@
+# Copyright (c) 2022-2023, PostgreSQL Global Development Group
+
+pg_combinebackup_sources = files(
+ 'pg_combinebackup.c',
+ 'backup_label.c',
+ 'copy_file.c',
+ 'load_manifest.c',
+ 'reconstruct.c',
+ 'write_manifest.c',
+)
+
+if host_system == 'windows'
+ pg_combinebackup_sources += rc_bin_gen.process(win32ver_rc, extra_args: [
+ '--NAME', 'pg_combinebackup',
+ '--FILEDESC', 'pg_combinebackup - combine incremental backups',])
+endif
+
+pg_combinebackup = executable('pg_combinebackup',
+ pg_combinebackup_sources,
+ dependencies: [frontend_code],
+ kwargs: default_bin_args,
+)
+bin_targets += pg_combinebackup
+
+tests += {
+ 'name': 'pg_combinebackup',
+ 'sd': meson.current_source_dir(),
+ 'bd': meson.current_build_dir(),
+ 'tap': {
+ 'tests': [
+ 't/001_basic.pl',
+ 't/002_compare_backups.pl',
+ 't/003_timeline.pl',
+ 't/004_manifest.pl',
+ 't/005_integrity.pl',
+ ],
+ }
+}
diff --git a/src/bin/pg_combinebackup/nls.mk b/src/bin/pg_combinebackup/nls.mk
new file mode 100644
index 0000000000..c8e59d1d00
--- /dev/null
+++ b/src/bin/pg_combinebackup/nls.mk
@@ -0,0 +1,11 @@
+# src/bin/pg_combinebackup/nls.mk
+CATALOG_NAME = pg_combinebackup
+GETTEXT_FILES = $(FRONTEND_COMMON_GETTEXT_FILES) \
+ backup_label.c \
+ copy_file.c \
+ load_manifest.c \
+ pg_combinebackup.c \
+ reconstruct.c \
+ write_manifest.c
+GETTEXT_TRIGGERS = $(FRONTEND_COMMON_GETTEXT_TRIGGERS)
+GETTEXT_FLAGS = $(FRONTEND_COMMON_GETTEXT_FLAGS)
diff --git a/src/bin/pg_combinebackup/pg_combinebackup.c b/src/bin/pg_combinebackup/pg_combinebackup.c
new file mode 100644
index 0000000000..63dcbf329d
--- /dev/null
+++ b/src/bin/pg_combinebackup/pg_combinebackup.c
@@ -0,0 +1,1284 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_combinebackup.c
+ * Combine incremental backups with prior backups.
+ *
+ * Copyright (c) 2017-2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/bin/pg_combinebackup/pg_combinebackup.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres_fe.h"
+
+#include <dirent.h>
+#include <fcntl.h>
+#include <limits.h>
+
+#include "backup_label.h"
+#include "common/blkreftable.h"
+#include "common/checksum_helper.h"
+#include "common/controldata_utils.h"
+#include "common/file_perm.h"
+#include "common/file_utils.h"
+#include "common/logging.h"
+#include "copy_file.h"
+#include "fe_utils/option_utils.h"
+#include "lib/stringinfo.h"
+#include "load_manifest.h"
+#include "getopt_long.h"
+#include "reconstruct.h"
+#include "write_manifest.h"
+
+/* Incremental file naming convention. */
+#define INCREMENTAL_PREFIX "INCREMENTAL."
+#define INCREMENTAL_PREFIX_LENGTH (sizeof(INCREMENTAL_PREFIX) - 1)
+
+/*
+ * Tracking for directories that need to be removed, or have their contents
+ * removed, if the operation fails.
+ */
+typedef struct cb_cleanup_dir
+{
+ char *target_path;
+ bool rmtopdir;
+ struct cb_cleanup_dir *next;
+} cb_cleanup_dir;
+
+/*
+ * Stores a tablespace mapping provided using -T, --tablespace-mapping.
+ */
+typedef struct cb_tablespace_mapping
+{
+ char old_dir[MAXPGPATH];
+ char new_dir[MAXPGPATH];
+ struct cb_tablespace_mapping *next;
+} cb_tablespace_mapping;
+
+/*
+ * Stores data parsed from all command-line options.
+ */
+typedef struct cb_options
+{
+ bool debug;
+ char *output;
+ bool dry_run;
+ bool no_sync;
+ cb_tablespace_mapping *tsmappings;
+ pg_checksum_type manifest_checksums;
+ bool no_manifest;
+ DataDirSyncMethod sync_method;
+} cb_options;
+
+/*
+ * Data about a tablespace.
+ *
+ * Every normal tablespace needs a tablespace mapping, but in-place tablespaces
+ * don't, so the list of tablespaces can contain more entries than the list of
+ * tablespace mappings.
+ */
+typedef struct cb_tablespace
+{
+ Oid oid;
+ bool in_place;
+ char old_dir[MAXPGPATH];
+ char new_dir[MAXPGPATH];
+ struct cb_tablespace *next;
+} cb_tablespace;
+
+/* Directories to be removed if we exit uncleanly. */
+cb_cleanup_dir *cleanup_dir_list = NULL;
+
+static void add_tablespace_mapping(cb_options *opt, char *arg);
+static StringInfo check_backup_label_files(int n_backups, char **backup_dirs);
+static void check_control_files(int n_backups, char **backup_dirs);
+static void check_input_dir_permissions(char *dir);
+static void cleanup_directories_atexit(void);
+static void create_output_directory(char *dirname, cb_options *opt);
+static void help(const char *progname);
+static bool parse_oid(char *s, Oid *result);
+static void process_directory_recursively(Oid tsoid,
+ char *input_directory,
+ char *output_directory,
+ char *relative_path,
+ int n_prior_backups,
+ char **prior_backup_dirs,
+ manifest_data **manifests,
+ manifest_writer *mwriter,
+ cb_options *opt);
+static int read_pg_version_file(char *directory);
+static void remember_to_cleanup_directory(char *target_path, bool rmtopdir);
+static void reset_directory_cleanup_list(void);
+static cb_tablespace *scan_for_existing_tablespaces(char *pathname,
+ cb_options *opt);
+static void slurp_file(int fd, char *filename, StringInfo buf, int maxlen);
+
+/*
+ * Main program.
+ */
+int
+main(int argc, char *argv[])
+{
+ static struct option long_options[] = {
+ {"debug", no_argument, NULL, 'd'},
+ {"dry-run", no_argument, NULL, 'n'},
+ {"no-sync", no_argument, NULL, 'N'},
+ {"output", required_argument, NULL, 'o'},
+ {"tablespace-mapping", no_argument, NULL, 'T'},
+ {"manifest-checksums", required_argument, NULL, 1},
+ {"no-manifest", no_argument, NULL, 2},
+ {"sync-method", required_argument, NULL, 3},
+ {NULL, 0, NULL, 0}
+ };
+
+ const char *progname;
+ char *last_input_dir;
+ int optindex;
+ int c;
+ int n_backups;
+ int n_prior_backups;
+ int version;
+ char **prior_backup_dirs;
+ cb_options opt;
+ cb_tablespace *tablespaces;
+ cb_tablespace *ts;
+ StringInfo last_backup_label;
+ manifest_data **manifests;
+ manifest_writer *mwriter;
+
+ pg_logging_init(argv[0]);
+ progname = get_progname(argv[0]);
+ handle_help_version_opts(argc, argv, progname, help);
+
+ memset(&opt, 0, sizeof(opt));
+ opt.manifest_checksums = CHECKSUM_TYPE_CRC32C;
+ opt.sync_method = DATA_DIR_SYNC_METHOD_FSYNC;
+
+ /* process command-line options */
+ while ((c = getopt_long(argc, argv, "do:nNPT:",
+ long_options, &optindex)) != -1)
+ {
+ switch (c)
+ {
+ case 'd':
+ opt.debug = true;
+ pg_logging_increase_verbosity();
+ break;
+ case 'o':
+ opt.output = optarg;
+ break;
+ case 'n':
+ opt.dry_run = true;
+ break;
+ case 'N':
+ opt.no_sync = true;
+ break;
+ case 'T':
+ add_tablespace_mapping(&opt, optarg);
+ break;
+ case 1:
+ if (!pg_checksum_parse_type(optarg,
+ &opt.manifest_checksums))
+ pg_fatal("unrecognized checksum algorithm: \"%s\"",
+ optarg);
+ break;
+ case 2:
+ opt.no_manifest = true;
+ break;
+ case 3:
+ if (!parse_sync_method(optarg, &opt.sync_method))
+ exit(1);
+ break;
+ default:
+ /* getopt_long already emitted a complaint */
+ pg_log_error_hint("Try \"%s --help\" for more information.", progname);
+ exit(1);
+ }
+ }
+
+ if (optind >= argc)
+ {
+ pg_log_error("%s: no input directories specified", progname);
+ pg_log_error_hint("Try \"%s --help\" for more information.", progname);
+ exit(1);
+ }
+
+ if (opt.output == NULL)
+ pg_fatal("no output directory specified");
+
+ /* If no manifest is needed, no checksums are needed, either. */
+ if (opt.no_manifest)
+ opt.manifest_checksums = CHECKSUM_TYPE_NONE;
+
+ /* Read the server version from the final backup. */
+ version = read_pg_version_file(argv[argc - 1]);
+
+ /* Sanity-check control files. */
+ n_backups = argc - optind;
+ check_control_files(n_backups, argv + optind);
+
+ /* Sanity-check backup_label files, and get the contents of the last one. */
+ last_backup_label = check_backup_label_files(n_backups, argv + optind);
+
+ /*
+ * We'll need the pathnames to the prior backups. By "prior" we mean all
+ * but the last one listed on the command line.
+ */
+ n_prior_backups = argc - optind - 1;
+ prior_backup_dirs = argv + optind;
+
+ /* Load backup manifests. */
+ manifests = load_backup_manifests(n_backups, prior_backup_dirs);
+
+ /* Figure out which tablespaces are going to be included in the output. */
+ last_input_dir = argv[argc - 1];
+ check_input_dir_permissions(last_input_dir);
+ tablespaces = scan_for_existing_tablespaces(last_input_dir, &opt);
+
+ /*
+ * Create output directories.
+ *
+ * We create one output directory for the main data directory plus one for
+ * each non-in-place tablespace. create_output_directory() will arrange
+ * for those directories to be cleaned up on failure. In-place tablespaces
+ * aren't handled at this stage because they're located beneath the main
+ * output directory, and thus the cleanup of that directory will get rid
+ * of them. Plus, the pg_tblspc directory that needs to contain them
+ * doesn't exist yet.
+ */
+ atexit(cleanup_directories_atexit);
+ create_output_directory(opt.output, &opt);
+ for (ts = tablespaces; ts != NULL; ts = ts->next)
+ if (!ts->in_place)
+ create_output_directory(ts->new_dir, &opt);
+
+ /* If we need to write a backup_manifest, prepare to do so. */
+ if (!opt.dry_run && !opt.no_manifest)
+ {
+ mwriter = create_manifest_writer(opt.output);
+
+ /*
+ * Verify that we have a backup manifest for the final backup; else we
+ * won't have the WAL ranges for the resulting manifest.
+ */
+ if (manifests[n_prior_backups] == NULL)
+ pg_fatal("can't generate a manifest because no manifest is available for the final input backup");
+ }
+ else
+ mwriter = NULL;
+
+ /* Write backup label into output directory. */
+ if (opt.dry_run)
+ pg_log_debug("would generate \"%s/backup_label\"", opt.output);
+ else
+ {
+ pg_log_debug("generating \"%s/backup_label\"", opt.output);
+ last_backup_label->cursor = 0;
+ write_backup_label(opt.output, last_backup_label,
+ opt.manifest_checksums, mwriter);
+ }
+
+ /* Process everything that's not part of a user-defined tablespace. */
+ pg_log_debug("processing backup directory \"%s\"", last_input_dir);
+ process_directory_recursively(InvalidOid, last_input_dir, opt.output,
+ NULL, n_prior_backups, prior_backup_dirs,
+ manifests, mwriter, &opt);
+
+ /* Process user-defined tablespaces. */
+ for (ts = tablespaces; ts != NULL; ts = ts->next)
+ {
+ pg_log_debug("processing tablespace directory \"%s\"", ts->old_dir);
+
+ /*
+ * If it's a normal tablespace, we need to set up a symbolic link from
+ * pg_tblspc/${OID} to the target directory; if it's an in-place
+ * tablespace, we need to create a directory at pg_tblspc/${OID}.
+ */
+ if (!ts->in_place)
+ {
+ char linkpath[MAXPGPATH];
+
+ snprintf(linkpath, MAXPGPATH, "%s/pg_tblspc/%u", opt.output,
+ ts->oid);
+
+ if (opt.dry_run)
+ pg_log_debug("would create symbolic link from \"%s\" to \"%s\"",
+ linkpath, ts->new_dir);
+ else
+ {
+ pg_log_debug("creating symbolic link from \"%s\" to \"%s\"",
+ linkpath, ts->new_dir);
+ if (symlink(ts->new_dir, linkpath) != 0)
+ pg_fatal("could not create symbolic link from \"%s\" to \"%s\": %m",
+ linkpath, ts->new_dir);
+ }
+ }
+ else
+ {
+ if (opt.dry_run)
+ pg_log_debug("would create directory \"%s\"", ts->new_dir);
+ else
+ {
+ pg_log_debug("creating directory \"%s\"", ts->new_dir);
+ if (pg_mkdir_p(ts->new_dir, pg_dir_create_mode) == -1)
+ pg_fatal("could not create directory \"%s\": %m",
+ ts->new_dir);
+ }
+ }
+
+ /* OK, now handle the directory contents. */
+ process_directory_recursively(ts->oid, ts->old_dir, ts->new_dir,
+ NULL, n_prior_backups, prior_backup_dirs,
+ manifests, mwriter, &opt);
+ }
+
+ /* Finalize the backup_manifest, if we're generating one. */
+ if (mwriter != NULL)
+ finalize_manifest(mwriter,
+ manifests[n_prior_backups]->first_wal_range);
+
+ /* fsync that output directory unless we've been told not to do so */
+ if (!opt.no_sync)
+ {
+ if (opt.dry_run)
+ pg_log_debug("would recursively fsync \"%s\"", opt.output);
+ else
+ {
+ pg_log_debug("recursively fsyncing \"%s\"", opt.output);
+ sync_pgdata(opt.output, version * 10000, opt.sync_method);
+ }
+ }
+
+ /* It's a success, so don't remove the output directories. */
+ reset_directory_cleanup_list();
+ exit(0);
+}
+
+/*
+ * Process the option argument for the -T, --tablespace-mapping switch.
+ */
+static void
+add_tablespace_mapping(cb_options *opt, char *arg)
+{
+ cb_tablespace_mapping *tsmap = pg_malloc0(sizeof(cb_tablespace_mapping));
+ char *dst;
+ char *dst_ptr;
+ char *arg_ptr;
+
+ /*
+ * Basically, we just want to copy everything before the equals sign to
+ * tsmap->old_dir and everything afterwards to tsmap->new_dir, but if
+ * there's more or less than one equals sign, that's an error, and if
+ * there's an equals sign preceded by a backslash, don't treat it as a
+ * field separator but instead copy a literal equals sign.
+ */
+ dst_ptr = dst = tsmap->old_dir;
+ for (arg_ptr = arg; *arg_ptr != '\0'; arg_ptr++)
+ {
+ if (dst_ptr - dst >= MAXPGPATH)
+ pg_fatal("directory name too long");
+
+ if (*arg_ptr == '\\' && *(arg_ptr + 1) == '=')
+ ; /* skip backslash escaping = */
+ else if (*arg_ptr == '=' && (arg_ptr == arg || *(arg_ptr - 1) != '\\'))
+ {
+ if (tsmap->new_dir[0] != '\0')
+ pg_fatal("multiple \"=\" signs in tablespace mapping");
+ else
+ dst = dst_ptr = tsmap->new_dir;
+ }
+ else
+ *dst_ptr++ = *arg_ptr;
+ }
+ if (!tsmap->old_dir[0] || !tsmap->new_dir[0])
+ pg_fatal("invalid tablespace mapping format \"%s\", must be \"OLDDIR=NEWDIR\"", arg);
+
+ /*
+ * All tablespaces are created with absolute directories, so specifying a
+ * non-absolute path here would never match, possibly confusing users.
+ *
+ * In contrast to pg_basebackup, both the old and new directories are on
+ * the local machine, so the local machine's definition of an absolute
+ * path is the only relevant one.
+ */
+ if (!is_absolute_path(tsmap->old_dir))
+ pg_fatal("old directory is not an absolute path in tablespace mapping: %s",
+ tsmap->old_dir);
+
+ if (!is_absolute_path(tsmap->new_dir))
+ pg_fatal("old directory is not an absolute path in tablespace mapping: %s",
+ tsmap->new_dir);
+
+ /* Canonicalize paths to avoid spurious failures when comparing. */
+ canonicalize_path(tsmap->old_dir);
+ canonicalize_path(tsmap->new_dir);
+
+ /* Add it to the list. */
+ tsmap->next = opt->tsmappings;
+ opt->tsmappings = tsmap;
+}
+
+/*
+ * Check that the backup_label files form a coherent backup chain, and return
+ * the contents of the backup_label file from the latest backup.
+ */
+static StringInfo
+check_backup_label_files(int n_backups, char **backup_dirs)
+{
+ StringInfo buf = makeStringInfo();
+ StringInfo lastbuf = buf;
+ int i;
+ TimeLineID check_tli = 0;
+ XLogRecPtr check_lsn = InvalidXLogRecPtr;
+
+ /* Try to read each backup_label file in turn, last to first. */
+ for (i = n_backups - 1; i >= 0; --i)
+ {
+ char pathbuf[MAXPGPATH];
+ int fd;
+ TimeLineID start_tli;
+ TimeLineID previous_tli;
+ XLogRecPtr start_lsn;
+ XLogRecPtr previous_lsn;
+
+ /* Open the backup_label file. */
+ snprintf(pathbuf, MAXPGPATH, "%s/backup_label", backup_dirs[i]);
+ pg_log_debug("reading \"%s\"", pathbuf);
+ if ((fd = open(pathbuf, O_RDONLY, 0)) < 0)
+ pg_fatal("could not open file \"%s\": %m", pathbuf);
+
+ /*
+ * Slurp the whole file into memory.
+ *
+ * The exact size limit that we impose here doesn't really matter --
+ * most of what's supposed to be in the file is fixed size and quite
+ * short. However, the length of the backup_label is limited (at least
+ * by some parts of the code) to MAXGPATH, so include that value in
+ * the maximum length that we tolerate.
+ */
+ slurp_file(fd, pathbuf, buf, 10000 + MAXPGPATH);
+
+ /* Close the file. */
+ if (close(fd) != 0)
+ pg_fatal("could not close \"%s\": %m", pathbuf);
+
+ /* Parse the file contents. */
+ parse_backup_label(pathbuf, buf, &start_tli, &start_lsn,
+ &previous_tli, &previous_lsn);
+
+ /*
+ * Sanity checks.
+ *
+ * XXX. It's actually not required that start_lsn == check_lsn. It
+ * would be OK if start_lsn > check_lsn provided that start_lsn is
+ * less than or equal to the relevant switchpoint. But at the moment
+ * we don't have that information.
+ */
+ if (i > 0 && previous_tli == 0)
+ pg_fatal("backup at \"%s\" is a full backup, but only the first backup should be a full backup",
+ backup_dirs[i]);
+ if (i == 0 && previous_tli != 0)
+ pg_fatal("backup at \"%s\" is an incremental backup, but the first backup should be a full backup",
+ backup_dirs[i]);
+ if (i < n_backups - 1 && start_tli != check_tli)
+ pg_fatal("backup at \"%s\" starts on timeline %u, but expected %u",
+ backup_dirs[i], start_tli, check_tli);
+ if (i < n_backups - 1 && start_lsn != check_lsn)
+ pg_fatal("backup at \"%s\" starts at LSN %X/%X, but expected %X/%X",
+ backup_dirs[i],
+ LSN_FORMAT_ARGS(start_lsn),
+ LSN_FORMAT_ARGS(check_lsn));
+ check_tli = previous_tli;
+ check_lsn = previous_lsn;
+
+ /*
+ * The last backup label in the chain needs to be saved for later use,
+ * while the others are only needed within this loop.
+ */
+ if (lastbuf == buf)
+ buf = makeStringInfo();
+ else
+ resetStringInfo(buf);
+ }
+
+ /* Free memory that we don't need any more. */
+ if (lastbuf != buf)
+ {
+ pfree(buf->data);
+ pfree(buf);
+ }
+
+ /*
+ * Return the data from the first backup_info that we read (which is the
+ * backup_label from the last directory specified on the command line).
+ */
+ return lastbuf;
+}
+
+/*
+ * Sanity check control files.
+ */
+static void
+check_control_files(int n_backups, char **backup_dirs)
+{
+ int i;
+ uint64 system_identifier = 0; /* placate compiler */
+
+ /* Try to read each control file in turn, last to first. */
+ for (i = n_backups - 1; i >= 0; --i)
+ {
+ ControlFileData *control_file;
+ bool crc_ok;
+ char *controlpath;
+
+ controlpath = psprintf("%s/%s", backup_dirs[i], "global/pg_control");
+ pg_log_debug("reading \"%s\"", controlpath);
+ control_file = get_controlfile(backup_dirs[i], &crc_ok);
+
+ /* Control file contents not meaningful if CRC is bad. */
+ if (!crc_ok)
+ pg_fatal("%s: crc is incorrect", controlpath);
+
+ /* Can't interpret control file if not current version. */
+ if (control_file->pg_control_version != PG_CONTROL_VERSION)
+ pg_fatal("%s: unexpected control file version",
+ controlpath);
+
+ /* System identifiers should all match. */
+ if (i == n_backups - 1)
+ system_identifier = control_file->system_identifier;
+ else if (system_identifier != control_file->system_identifier)
+ pg_fatal("%s: expected system identifier %llu, but found %llu",
+ controlpath, (unsigned long long) system_identifier,
+ (unsigned long long) control_file->system_identifier);
+
+ /* Release memory. */
+ pfree(control_file);
+ pfree(controlpath);
+ }
+
+ /*
+ * If debug output is enabled, make a note of the system identifier that
+ * we found in all of the relevant control files.
+ */
+ pg_log_debug("system identifier is %llu",
+ (unsigned long long) system_identifier);
+}
+
+/*
+ * Set default permissions for new files and directories based on the
+ * permissions of the given directory. The intent here is that the output
+ * directory should use the same permissions scheme as the final input
+ * directory.
+ */
+static void
+check_input_dir_permissions(char *dir)
+{
+ struct stat st;
+
+ if (stat(dir, &st) != 0)
+ pg_fatal("could not stat \"%s\": %m", dir);
+
+ SetDataDirectoryCreatePerm(st.st_mode);
+}
+
+/*
+ * Clean up output directories before exiting.
+ */
+static void
+cleanup_directories_atexit(void)
+{
+ while (cleanup_dir_list != NULL)
+ {
+ cb_cleanup_dir *dir = cleanup_dir_list;
+
+ if (dir->rmtopdir)
+ {
+ pg_log_info("removing output directory \"%s\"", dir->target_path);
+ if (!rmtree(dir->target_path, dir->rmtopdir))
+ pg_log_error("failed to remove output directory");
+ }
+ else
+ {
+ pg_log_info("removing contents of output directory \"%s\"",
+ dir->target_path);
+ if (!rmtree(dir->target_path, dir->rmtopdir))
+ pg_log_error("failed to remove contents of output directory");
+ }
+
+ cleanup_dir_list = cleanup_dir_list->next;
+ pfree(dir);
+ }
+}
+
+/*
+ * Create the named output directory, unless it already exists or we're in
+ * dry-run mode. If it already exists but is not empty, that's a fatal error.
+ *
+ * Adds the created directory to the list of directories to be cleaned up
+ * at process exit.
+ */
+static void
+create_output_directory(char *dirname, cb_options *opt)
+{
+ switch (pg_check_dir(dirname))
+ {
+ case 0:
+ if (opt->dry_run)
+ {
+ pg_log_debug("would create directory \"%s\"", dirname);
+ return;
+ }
+ pg_log_debug("creating directory \"%s\"", dirname);
+ if (pg_mkdir_p(dirname, pg_dir_create_mode) == -1)
+ pg_fatal("could not create directory \"%s\": %m", dirname);
+ remember_to_cleanup_directory(dirname, true);
+ break;
+
+ case 1:
+ pg_log_debug("using existing directory \"%s\"", dirname);
+ remember_to_cleanup_directory(dirname, false);
+ break;
+
+ case 2:
+ case 3:
+ case 4:
+ pg_fatal("directory \"%s\" exists but is not empty", dirname);
+
+ case -1:
+ pg_fatal("could not access directory \"%s\": %m", dirname);
+ }
+}
+
+/*
+ * help
+ *
+ * Prints help page for the program
+ *
+ * progname: the name of the executed program, such as "pg_combinebackup"
+ */
+static void
+help(const char *progname)
+{
+ printf(_("%s reconstructs full backups from incrementals.\n\n"), progname);
+ printf(_("Usage:\n"));
+ printf(_(" %s [OPTION]... DIRECTORY...\n"), progname);
+ printf(_("\nOptions:\n"));
+ printf(_(" -d, --debug generate lots of debugging output\n"));
+ printf(_(" -n, --dry-run don't actually do anything\n"));
+ printf(_(" -N, --no-sync do not wait for changes to be written safely to disk\n"));
+ printf(_(" -o, --output output directory\n"));
+ printf(_(" -T, --tablespace-mapping=OLDDIR=NEWDIR\n"));
+ printf(_(" relocate tablespace in OLDDIR to NEWDIR\n"));
+ printf(_(" --manifest-checksums=SHA{224,256,384,512}|CRC32C|NONE\n"
+ " use algorithm for manifest checksums\n"));
+ printf(_(" --no-manifest suppress generation of backup manifest\n"));
+ printf(_(" --sync-method=METHOD set method for syncing files to disk\n"));
+ printf(_(" -?, --help show this help, then exit\n"));
+
+ printf(_("\nReport bugs to <%s>.\n"), PACKAGE_BUGREPORT);
+ printf(_("%s home page: <%s>\n"), PACKAGE_NAME, PACKAGE_URL);
+}
+
+/*
+ * Try to parse a string as a non-zero OID without leading zeroes.
+ *
+ * If it works, return true and set *result to the answer, else return false.
+ */
+static bool
+parse_oid(char *s, Oid *result)
+{
+ Oid oid;
+ char *ep;
+
+ errno = 0;
+ oid = strtoul(s, &ep, 10);
+ if (errno != 0 || *ep != '\0' || oid < 1 || oid > PG_UINT32_MAX)
+ return false;
+
+ *result = oid;
+ return true;
+}
+
+/*
+ * Copy files from the input directory to the output directory, reconstructing
+ * full files from incremental files as required.
+ *
+ * If processing is a user-defined tablespace, the tsoid should be the OID
+ * of that tablespace and input_directory and output_directory should be the
+ * toplevel input and output directories for that tablespace. Otherwise,
+ * tsoid should be InvalidOid and input_directory and output_directory should
+ * be the main input and output directories.
+ *
+ * relative_path is the path beneath the given input and output directories
+ * that we are currently processing. If NULL, it indicates that we're
+ * processing the input and output directories themselves.
+ *
+ * n_prior_backups is the number of prior backups that we have available.
+ * This doesn't count the very last backup, which is referenced by
+ * output_directory, just the older ones. prior_backup_dirs is an array of
+ * the locations of those previous backups.
+ */
+static void
+process_directory_recursively(Oid tsoid,
+ char *input_directory,
+ char *output_directory,
+ char *relative_path,
+ int n_prior_backups,
+ char **prior_backup_dirs,
+ manifest_data **manifests,
+ manifest_writer *mwriter,
+ cb_options *opt)
+{
+ char ifulldir[MAXPGPATH];
+ char ofulldir[MAXPGPATH];
+ char manifest_prefix[MAXPGPATH];
+ DIR *dir;
+ struct dirent *de;
+ bool is_pg_tblspc;
+ bool is_pg_wal;
+ manifest_data *latest_manifest = manifests[n_prior_backups];
+ pg_checksum_type checksum_type;
+
+ /*
+ * pg_tblspc and pg_wal are special cases, so detect those here.
+ *
+ * pg_tblspc is only special at the top level, but subdirectories of
+ * pg_wal are just as special as the top level directory.
+ *
+ * Since incremental backup does not exist in pre-v10 versions, we don't
+ * have to worry about the old pg_xlog naming.
+ */
+ is_pg_tblspc = !OidIsValid(tsoid) && relative_path != NULL &&
+ strcmp(relative_path, "pg_tblspc") == 0;
+ is_pg_wal = !OidIsValid(tsoid) && relative_path != NULL &&
+ (strcmp(relative_path, "pg_wal") == 0 ||
+ strncmp(relative_path, "pg_wal/", 7) == 0);
+
+ /*
+ * If we're under pg_wal, then we don't need checksums, because these
+ * files aren't included in the backup manifest. Otherwise use whatever
+ * type of checksum is configured.
+ */
+ if (!is_pg_wal)
+ checksum_type = opt->manifest_checksums;
+ else
+ checksum_type = CHECKSUM_TYPE_NONE;
+
+ /*
+ * Append the relative path to the input and output directories, and
+ * figure out the appropriate prefix to add to files in this directory
+ * when looking them up in a backup manifest.
+ */
+ if (relative_path == NULL)
+ {
+ strncpy(ifulldir, input_directory, MAXPGPATH);
+ strncpy(ofulldir, output_directory, MAXPGPATH);
+ if (OidIsValid(tsoid))
+ snprintf(manifest_prefix, MAXPGPATH, "pg_tblspc/%u/", tsoid);
+ else
+ manifest_prefix[0] = '\0';
+ }
+ else
+ {
+ snprintf(ifulldir, MAXPGPATH, "%s/%s", input_directory,
+ relative_path);
+ snprintf(ofulldir, MAXPGPATH, "%s/%s", output_directory,
+ relative_path);
+ if (OidIsValid(tsoid))
+ snprintf(manifest_prefix, MAXPGPATH, "pg_tblspc/%u/%s/",
+ tsoid, relative_path);
+ else
+ snprintf(manifest_prefix, MAXPGPATH, "%s/", relative_path);
+ }
+
+ /*
+ * Toplevel output directories have already been created by the time this
+ * function is called, but any subdirectories are our responsibility.
+ */
+ if (relative_path != NULL)
+ {
+ if (opt->dry_run)
+ pg_log_debug("would create directory \"%s\"", ofulldir);
+ else
+ {
+ pg_log_debug("creating directory \"%s\"", ofulldir);
+ if (mkdir(ofulldir, pg_dir_create_mode) == -1)
+ pg_fatal("could not create directory \"%s\": %m", ofulldir);
+ }
+ }
+
+ /* It's time to scan the directory. */
+ if ((dir = opendir(ifulldir)) == NULL)
+ pg_fatal("could not open directory \"%s\": %m", ifulldir);
+ while (errno = 0, (de = readdir(dir)) != NULL)
+ {
+ PGFileType type;
+ char ifullpath[MAXPGPATH];
+ char ofullpath[MAXPGPATH];
+ char manifest_path[MAXPGPATH];
+ Oid oid = InvalidOid;
+ int checksum_length = 0;
+ uint8 *checksum_payload = NULL;
+ pg_checksum_context checksum_ctx;
+
+ /* Ignore "." and ".." entries. */
+ if (strcmp(de->d_name, ".") == 0 ||
+ strcmp(de->d_name, "..") == 0)
+ continue;
+
+ /* Construct input path. */
+ snprintf(ifullpath, MAXPGPATH, "%s/%s", ifulldir, de->d_name);
+
+ /* Figure out what kind of directory entry this is. */
+ type = get_dirent_type(ifullpath, de, false, PG_LOG_ERROR);
+ if (type == PGFILETYPE_ERROR)
+ exit(1);
+
+ /*
+ * If we're processing pg_tblspc, then check whether the filename
+ * looks like it could be a tablespace OID. If so, and if the
+ * directory entry is a symbolic link or a directory, skip it.
+ *
+ * Our goal here is to ignore anything that would have been considered
+ * by scan_for_existing_tablespaces to be a tablespace.
+ */
+ if (is_pg_tblspc && parse_oid(de->d_name, &oid) &&
+ (type == PGFILETYPE_LNK || type == PGFILETYPE_DIR))
+ continue;
+
+ /* If it's a directory, recurse. */
+ if (type == PGFILETYPE_DIR)
+ {
+ char new_relative_path[MAXPGPATH];
+
+ /* Append new pathname component to relative path. */
+ if (relative_path == NULL)
+ strncpy(new_relative_path, de->d_name, MAXPGPATH);
+ else
+ snprintf(new_relative_path, MAXPGPATH, "%s/%s", relative_path,
+ de->d_name);
+
+ /* And recurse. */
+ process_directory_recursively(tsoid,
+ input_directory, output_directory,
+ new_relative_path,
+ n_prior_backups, prior_backup_dirs,
+ manifests, mwriter, opt);
+ continue;
+ }
+
+ /* Skip anything that's not a regular file. */
+ if (type != PGFILETYPE_REG)
+ {
+ if (type == PGFILETYPE_LNK)
+ pg_log_warning("skipping symbolic link \"%s\"", ifullpath);
+ else
+ pg_log_warning("skipping special file \"%s\"", ifullpath);
+ continue;
+ }
+
+ /*
+ * Skip the backup_label and backup_manifest files; they require
+ * special handling and are handled elsewhere.
+ */
+ if (relative_path == NULL &&
+ (strcmp(de->d_name, "backup_label") == 0 ||
+ strcmp(de->d_name, "backup_manifest") == 0))
+ continue;
+
+ /*
+ * If it's an incremental file, hand it off to the reconstruction
+ * code, which will figure out what to do.
+ */
+ if (strncmp(de->d_name, INCREMENTAL_PREFIX,
+ INCREMENTAL_PREFIX_LENGTH) == 0)
+ {
+ /* Output path should not include "INCREMENTAL." prefix. */
+ snprintf(ofullpath, MAXPGPATH, "%s/%s", ofulldir,
+ de->d_name + INCREMENTAL_PREFIX_LENGTH);
+
+
+ /* Manifest path likewise omits incremental prefix. */
+ snprintf(manifest_path, MAXPGPATH, "%s%s", manifest_prefix,
+ de->d_name + INCREMENTAL_PREFIX_LENGTH);
+
+ /* Reconstruction logic will do the rest. */
+ reconstruct_from_incremental_file(ifullpath, ofullpath,
+ relative_path,
+ de->d_name + INCREMENTAL_PREFIX_LENGTH,
+ n_prior_backups,
+ prior_backup_dirs,
+ manifests,
+ manifest_path,
+ checksum_type,
+ &checksum_length,
+ &checksum_payload,
+ opt->debug,
+ opt->dry_run);
+ }
+ else
+ {
+ /* Construct the path that the backup_manifest will use. */
+ snprintf(manifest_path, MAXPGPATH, "%s%s", manifest_prefix,
+ de->d_name);
+
+ /*
+ * It's not an incremental file, so we need to copy the entire
+ * file to the output directory.
+ *
+ * If a checksum of the required type already exists in the
+ * backup_manifest for the final input directory, we can save some
+ * work by reusing that checksum instead of computing a new one.
+ */
+ if (checksum_type != CHECKSUM_TYPE_NONE &&
+ latest_manifest != NULL)
+ {
+ manifest_file *mfile;
+
+ mfile = manifest_files_lookup(latest_manifest->files,
+ manifest_path);
+ if (mfile == NULL)
+ {
+ char *bmpath;
+
+ /*
+ * The directory is out of sync with the backup_manifest,
+ * so emit a warning.
+ */
+ bmpath = psprintf("%s/%s", input_directory,
+ "backup_manifest");
+ pg_log_warning("\"%s\" contains no entry for \"%s\"",
+ bmpath, manifest_path);
+ pfree(bmpath);
+ }
+ else if (mfile->checksum_type == checksum_type)
+ {
+ checksum_length = mfile->checksum_length;
+ checksum_payload = mfile->checksum_payload;
+ }
+ }
+
+ /*
+ * If we're reusing a checksum, then we don't need copy_file() to
+ * compute one for us, but otherwise, it needs to compute whatever
+ * type of checksum we need.
+ */
+ if (checksum_length != 0)
+ pg_checksum_init(&checksum_ctx, CHECKSUM_TYPE_NONE);
+ else
+ pg_checksum_init(&checksum_ctx, checksum_type);
+
+ /* Actually copy the file. */
+ snprintf(ofullpath, MAXPGPATH, "%s/%s", ofulldir, de->d_name);
+ copy_file(ifullpath, ofullpath, &checksum_ctx, opt->dry_run);
+
+ /*
+ * If copy_file() performed a checksum calculation for us, then
+ * save the results (except in dry-run mode, when there's no
+ * point).
+ */
+ if (checksum_ctx.type != CHECKSUM_TYPE_NONE && !opt->dry_run)
+ {
+ checksum_payload = pg_malloc(PG_CHECKSUM_MAX_LENGTH);
+ checksum_length = pg_checksum_final(&checksum_ctx,
+ checksum_payload);
+ }
+ }
+
+ /* Generate manifest entry, if needed. */
+ if (mwriter != NULL)
+ {
+ struct stat sb;
+
+ /*
+ * In order to generate a manifest entry, we need the file size
+ * and mtime. We have no way to know the correct mtime except to
+ * stat() the file, so just do that and get the size as well.
+ *
+ * If we didn't need the mtime here, we could try to obtain the
+ * file size from the reconstruction or file copy process above,
+ * although that is actually not convenient in all cases. If we
+ * write the file ourselves then clearly we can keep a count of
+ * bytes, but if we use something like CopyFile() then it's
+ * trickier. Since we have to stat() anyway to get the mtime,
+ * there's no point in worrying about it.
+ */
+ if (stat(ofullpath, &sb) < 0)
+ pg_fatal("could not stat file \"%s\": %m", ofullpath);
+
+ /* OK, now do the work. */
+ add_file_to_manifest(mwriter, manifest_path,
+ sb.st_size, sb.st_mtime,
+ checksum_type, checksum_length,
+ checksum_payload);
+ }
+
+ /* Avoid leaking memory. */
+ if (checksum_payload != NULL)
+ pfree(checksum_payload);
+ }
+
+ closedir(dir);
+}
+
+/*
+ * Read the version number from PG_VERSION and convert it to the usual server
+ * version number format. (e.g. If PG_VERSION contains "14\n" this function
+ * will return 140000)
+ */
+static int
+read_pg_version_file(char *directory)
+{
+ char filename[MAXPGPATH];
+ StringInfoData buf;
+ int fd;
+ int version;
+ char *ep;
+
+ /* Construct pathname. */
+ snprintf(filename, MAXPGPATH, "%s/PG_VERSION", directory);
+
+ /* Open file. */
+ if ((fd = open(filename, O_RDONLY, 0)) < 0)
+ pg_fatal("could not open file \"%s\": %m", filename);
+
+ /* Read into memory. Length limit of 128 should be more than generous. */
+ initStringInfo(&buf);
+ slurp_file(fd, filename, &buf, 128);
+
+ /* Close the file. */
+ if (close(fd) != 0)
+ pg_fatal("could not close \"%s\": %m", filename);
+
+ /* Convert to integer. */
+ errno = 0;
+ version = strtoul(buf.data, &ep, 10);
+ if (errno != 0 || *ep != '\n')
+ {
+ /*
+ * Incremental backup is not relevant to very old server versions that
+ * used multi-part version number (e.g. 9.6, or 8.4). So if we see
+ * what looks like the beginning of such a version number, just bail
+ * out.
+ */
+ if (version < 10 && *ep == '.')
+ pg_fatal("%s: server version too old\n", filename);
+ pg_fatal("%s: could not parse version number\n", filename);
+ }
+
+ /* Debugging output. */
+ pg_log_debug("read server version %d from \"%s\"", version, filename);
+
+ /* Release memory and return result. */
+ pfree(buf.data);
+ return version * 10000;
+}
+
+/*
+ * Add a directory to the list of output directories to clean up.
+ */
+static void
+remember_to_cleanup_directory(char *target_path, bool rmtopdir)
+{
+ cb_cleanup_dir *dir = pg_malloc(sizeof(cb_cleanup_dir));
+
+ dir->target_path = target_path;
+ dir->rmtopdir = rmtopdir;
+ dir->next = cleanup_dir_list;
+ cleanup_dir_list = dir;
+}
+
+/*
+ * Empty out the list of directories scheduled for cleanup a exit.
+ *
+ * We want to remove the output directories only on a failure, so call this
+ * function when we know that the operation has succeeded.
+ *
+ * Since we only expect this to be called when we're about to exit, we could
+ * just set cleanup_dir_list to NULL and be done with it, but we free the
+ * memory to be tidy.
+ */
+static void
+reset_directory_cleanup_list(void)
+{
+ while (cleanup_dir_list != NULL)
+ {
+ cb_cleanup_dir *dir = cleanup_dir_list;
+
+ cleanup_dir_list = cleanup_dir_list->next;
+ pfree(dir);
+ }
+}
+
+/*
+ * Scan the pg_tblspc directory of the final input backup to get a canonical
+ * list of what tablespaces are part of the backup.
+ *
+ * 'pathname' should be the path to the toplevel backup directory for the
+ * final backup in the backup chain.
+ */
+static cb_tablespace *
+scan_for_existing_tablespaces(char *pathname, cb_options *opt)
+{
+ char pg_tblspc[MAXPGPATH];
+ DIR *dir;
+ struct dirent *de;
+ cb_tablespace *tslist = NULL;
+
+ snprintf(pg_tblspc, MAXPGPATH, "%s/pg_tblspc", pathname);
+ pg_log_debug("scanning \"%s\"", pg_tblspc);
+
+ if ((dir = opendir(pg_tblspc)) == NULL)
+ pg_fatal("could not open directory \"%s\": %m", pathname);
+
+ while (errno = 0, (de = readdir(dir)) != NULL)
+ {
+ Oid oid;
+ char tblspcdir[MAXPGPATH];
+ char link_target[MAXPGPATH];
+ int link_length;
+ cb_tablespace *ts;
+ cb_tablespace *otherts;
+ PGFileType type;
+
+ /* Silently ignore "." and ".." entries. */
+ if (strcmp(de->d_name, ".") == 0 || strcmp(de->d_name, "..") == 0)
+ continue;
+
+ /* Construct full pathname. */
+ snprintf(tblspcdir, MAXPGPATH, "%s/%s", pg_tblspc, de->d_name);
+
+ /* Ignore any file name that doesn't look like a proper OID. */
+ if (!parse_oid(de->d_name, &oid))
+ {
+ pg_log_debug("skipping \"%s\" because the filename is not a legal tablespace OID",
+ tblspcdir);
+ continue;
+ }
+
+ /* Only symbolic links and directories are tablespaces. */
+ type = get_dirent_type(tblspcdir, de, false, PG_LOG_ERROR);
+ if (type == PGFILETYPE_ERROR)
+ exit(1);
+ if (type != PGFILETYPE_LNK && type != PGFILETYPE_DIR)
+ {
+ pg_log_debug("skipping \"%s\" because it is neither a symbolic link nor a directory",
+ tblspcdir);
+ continue;
+ }
+
+ /* Create a new tablespace object. */
+ ts = pg_malloc0(sizeof(cb_tablespace));
+ ts->oid = oid;
+
+ /*
+ * If it's a link, it's not an in-place tablespace. Otherwise, it must
+ * be a directory, and thus an in-place tablespace.
+ */
+ if (type == PGFILETYPE_LNK)
+ {
+ cb_tablespace_mapping *tsmap;
+
+ /* Read the link target. */
+ link_length = readlink(tblspcdir, link_target, sizeof(link_target));
+ if (link_length < 0)
+ pg_fatal("could not read symbolic link \"%s\": %m",
+ tblspcdir);
+ if (link_length >= sizeof(link_target))
+ pg_fatal("symbolic link \"%s\" is too long", tblspcdir);
+ link_target[link_length] = '\0';
+ if (!is_absolute_path(link_target))
+ pg_fatal("symbolic link \"%s\" is relative", tblspcdir);
+
+ /* Caonicalize the link target. */
+ canonicalize_path(link_target);
+
+ /*
+ * Find the corresponding tablespace mapping and copy the relevant
+ * details into the new tablespace entry.
+ */
+ for (tsmap = opt->tsmappings; tsmap != NULL; tsmap = tsmap->next)
+ {
+ if (strcmp(tsmap->old_dir, link_target) == 0)
+ {
+ strncpy(ts->old_dir, tsmap->old_dir, MAXPGPATH);
+ strncpy(ts->new_dir, tsmap->new_dir, MAXPGPATH);
+ ts->in_place = false;
+ break;
+ }
+ }
+
+ /* Every non-in-place tablespace must be mapped. */
+ if (tsmap == NULL)
+ pg_fatal("tablespace at \"%s\" has no tablespace mapping",
+ link_target);
+ }
+ else
+ {
+ /*
+ * For an in-place tablespace, there's no separate directory, so
+ * we just record the paths within the data directories.
+ */
+ snprintf(ts->old_dir, MAXPGPATH, "%s/%s", pg_tblspc, de->d_name);
+ snprintf(ts->new_dir, MAXPGPATH, "%s/pg_tblpc/%s", opt->output,
+ de->d_name);
+ ts->in_place = true;
+ }
+
+ /* Tablespaces should not share a directory. */
+ for (otherts = tslist; otherts != NULL; otherts = otherts->next)
+ if (strcmp(ts->new_dir, otherts->new_dir) == 0)
+ pg_fatal("tablespaces with OIDs %u and %u both point at \"%s\"",
+ otherts->oid, oid, ts->new_dir);
+
+ /* Add this tablespace to the list. */
+ ts->next = tslist;
+ tslist = ts;
+ }
+
+ return tslist;
+}
+
+/*
+ * Read a file into a StringInfo.
+ *
+ * fd is used for the actual file I/O, filename for error reporting purposes.
+ * A file longer than maxlen is a fatal error.
+ */
+static void
+slurp_file(int fd, char *filename, StringInfo buf, int maxlen)
+{
+ struct stat st;
+ ssize_t rb;
+
+ /* Check file size, and complain if it's too large. */
+ if (fstat(fd, &st) != 0)
+ pg_fatal("could not stat \"%s\": %m", filename);
+ if (st.st_size > maxlen)
+ pg_fatal("file \"%s\" is too large", filename);
+
+ /* Make sure we have enough space. */
+ enlargeStringInfo(buf, st.st_size);
+
+ /* Read the data. */
+ rb = read(fd, &buf->data[buf->len], st.st_size);
+
+ /*
+ * We don't expect any concurrent changes, so we should read exactly the
+ * expected number of bytes.
+ */
+ if (rb != st.st_size)
+ {
+ if (rb < 0)
+ pg_fatal("could not read file \"%s\": %m", filename);
+ else
+ pg_fatal("could not read file \"%s\": read only %d of %d bytes",
+ filename, (int) rb, (int) st.st_size);
+ }
+
+ /* Adjust buffer length for new data and restore trailing-\0 invariant */
+ buf->len += rb;
+ buf->data[buf->len] = '\0';
+}
diff --git a/src/bin/pg_combinebackup/reconstruct.c b/src/bin/pg_combinebackup/reconstruct.c
new file mode 100644
index 0000000000..6decdd8934
--- /dev/null
+++ b/src/bin/pg_combinebackup/reconstruct.c
@@ -0,0 +1,687 @@
+/*-------------------------------------------------------------------------
+ *
+ * reconstruct.c
+ * Reconstruct full file from incremental file and backup chain.
+ *
+ * Copyright (c) 2017-2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/bin/pg_combinebackup/reconstruct.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres_fe.h"
+
+#include <unistd.h>
+
+#include "backup/basebackup_incremental.h"
+#include "common/logging.h"
+#include "common/file_perm.h"
+#include "copy_file.h"
+#include "lib/stringinfo.h"
+#include "reconstruct.h"
+#include "storage/block.h"
+
+/*
+ * An rfile stores the data that we need in order to be able to use some file
+ * on disk for reconstruction. For any given output file, we create one rfile
+ * per backup that we need to consult when we constructing that output file.
+ *
+ * If we find a full version of the file in the backup chain, then only
+ * filename and fd are initialized; the remaining fields are 0 or NULL.
+ * For an incremental file, header_length, num_blocks, relative_block_numbers,
+ * and truncation_block_length are also set.
+ *
+ * num_blocks_read and highest_offset_read always start out as 0.
+ */
+typedef struct rfile
+{
+ char *filename;
+ int fd;
+ size_t header_length;
+ unsigned num_blocks;
+ BlockNumber *relative_block_numbers;
+ unsigned truncation_block_length;
+ unsigned num_blocks_read;
+ off_t highest_offset_read;
+} rfile;
+
+static void debug_reconstruction(int n_source,
+ rfile **sources,
+ bool dry_run);
+static unsigned find_reconstructed_block_length(rfile *s);
+static rfile *make_incremental_rfile(char *filename);
+static rfile *make_rfile(char *filename, bool missing_ok);
+static void write_reconstructed_file(char *input_filename,
+ char *output_filename,
+ unsigned block_length,
+ rfile **sourcemap,
+ off_t *offsetmap,
+ pg_checksum_context *checksum_ctx,
+ bool debug,
+ bool dry_run);
+static void read_bytes(rfile *rf, void *buffer, unsigned length);
+
+/*
+ * Reconstruct a full file from an incremental file and a chain of prior
+ * backups.
+ *
+ * input_filename should be the path to the incremental file, and
+ * output_filename should be the path where the reconstructed file is to be
+ * written.
+ *
+ * relative_path should be the relative path to the directory containing this
+ * file. bare_file_name should be the name of the file within that directory,
+ * without "INCREMENTAL.".
+ *
+ * n_prior_backups is the number of prior backups, and prior_backup_dirs is
+ * an array of pathnames where those backups can be found.
+ */
+void
+reconstruct_from_incremental_file(char *input_filename,
+ char *output_filename,
+ char *relative_path,
+ char *bare_file_name,
+ int n_prior_backups,
+ char **prior_backup_dirs,
+ manifest_data **manifests,
+ char *manifest_path,
+ pg_checksum_type checksum_type,
+ int *checksum_length,
+ uint8 **checksum_payload,
+ bool debug,
+ bool dry_run)
+{
+ rfile **source;
+ rfile *latest_source = NULL;
+ rfile **sourcemap;
+ off_t *offsetmap;
+ unsigned block_length;
+ unsigned i;
+ unsigned sidx = n_prior_backups;
+ bool full_copy_possible = true;
+ int copy_source_index = -1;
+ rfile *copy_source = NULL;
+ pg_checksum_context checksum_ctx;
+
+ /*
+ * Every block must come either from the latest version of the file or
+ * from one of the prior backups.
+ */
+ source = pg_malloc0(sizeof(rfile *) * (1 + n_prior_backups));
+
+ /*
+ * Use the information from the latest incremental file to figure out how
+ * long the reconstructed file should be.
+ */
+ latest_source = make_incremental_rfile(input_filename);
+ source[n_prior_backups] = latest_source;
+ block_length = find_reconstructed_block_length(latest_source);
+
+ /*
+ * For each block in the output file, we need to know from which file we
+ * need to obtain it and at what offset in that file it's stored.
+ * sourcemap gives us the first of these things, and offsetmap the latter.
+ */
+ sourcemap = pg_malloc0(sizeof(rfile *) * block_length);
+ offsetmap = pg_malloc0(sizeof(off_t) * block_length);
+
+ /*
+ * Every block that is present in the newest incremental file should be
+ * sourced from that file. If it precedes the truncation_block_length,
+ * it's a block that we would otherwise have had to find in an older
+ * backup and thus reduces the number of blocks remaining to be found by
+ * one; otherwise, it's an extra block that needs to be included in the
+ * output but would not have needed to be found in an older backup if it
+ * had not been present.
+ */
+ for (i = 0; i < latest_source->num_blocks; ++i)
+ {
+ BlockNumber b = latest_source->relative_block_numbers[i];
+
+ Assert(b < block_length);
+ sourcemap[b] = latest_source;
+ offsetmap[b] = latest_source->header_length + (i * BLCKSZ);
+
+ /*
+ * A full copy of a file from an earlier backup is only possible if no
+ * blocks are needed from any later incremental file.
+ */
+ full_copy_possible = false;
+ }
+
+ while (1)
+ {
+ char source_filename[MAXPGPATH];
+ rfile *s;
+
+ /*
+ * Move to the next backup in the chain. If there are no more, then
+ * we're done.
+ */
+ if (sidx == 0)
+ break;
+ --sidx;
+
+ /*
+ * Look for the full file in the previous backup. If not found, then
+ * look for an incremental file instead.
+ */
+ snprintf(source_filename, MAXPGPATH, "%s/%s/%s",
+ prior_backup_dirs[sidx], relative_path, bare_file_name);
+ if ((s = make_rfile(source_filename, true)) == NULL)
+ {
+ snprintf(source_filename, MAXPGPATH, "%s/%s/INCREMENTAL.%s",
+ prior_backup_dirs[sidx], relative_path, bare_file_name);
+ s = make_incremental_rfile(source_filename);
+ }
+ source[sidx] = s;
+
+ /*
+ * If s->header_length == 0, then this is a full file; otherwise, it's
+ * an incremental file.
+ */
+ if (s->header_length == 0)
+ {
+ struct stat sb;
+ BlockNumber b;
+ BlockNumber blocklength;
+
+ /* We need to know the length of the file. */
+ if (fstat(s->fd, &sb) < 0)
+ pg_fatal("could not stat \"%s\": %m", s->filename);
+
+ /*
+ * Since we found a full file, source all blocks from it that
+ * exist in the file.
+ *
+ * Note that there may be blocks that don't exist either in this
+ * file or in any incremental file but that precede
+ * truncation_block_length. These are, presumably, zero-filled
+ * blocks that result from the server extending the file but
+ * taking no action on those blocks that generated any WAL.
+ *
+ * Sadly, we have no way of validating that this is really what
+ * happened, and neither does the server. From it's perspective,
+ * an unmodified block that contains data looks exactly the same
+ * as a zero-filled block that never had any data: either way,
+ * it's not mentioned in any WAL summary and the server has no
+ * reason to read it. From our perspective, all we know is that
+ * nobody had a reason to back up the block. That certainly means
+ * that the block didn't exist at the time of the full backup, but
+ * the supposition that it was all zeroes at the time of every
+ * later backup is one that we can't validate.
+ */
+ blocklength = sb.st_size / BLCKSZ;
+ for (b = 0; b < latest_source->truncation_block_length; ++b)
+ {
+ if (sourcemap[b] == NULL && b < blocklength)
+ {
+ sourcemap[b] = s;
+ offsetmap[b] = b * BLCKSZ;
+ }
+ }
+
+ /*
+ * If a full copy looks possible, check whether the resulting file
+ * should be exactly as long as the source file is. If so, a full
+ * copy is acceptable, otherwise not.
+ */
+ if (full_copy_possible)
+ {
+ uint64 expected_length;
+
+ expected_length =
+ (uint64) latest_source->truncation_block_length;
+ expected_length *= BLCKSZ;
+ if (expected_length == sb.st_size)
+ {
+ copy_source = s;
+ copy_source_index = sidx;
+ }
+ }
+
+ /* We don't need to consider any further sources. */
+ break;
+ }
+
+ /*
+ * Since we found another incremental file, source all blocks from it
+ * that we need but don't yet have.
+ */
+ for (i = 0; i < s->num_blocks; ++i)
+ {
+ BlockNumber b = s->relative_block_numbers[i];
+
+ if (b < latest_source->truncation_block_length &&
+ sourcemap[b] == NULL)
+ {
+ sourcemap[b] = s;
+ offsetmap[b] = s->header_length + (i * BLCKSZ);
+
+ /*
+ * A full copy of a file from an earlier backup is only
+ * possible if no blocks are needed from any later incremental
+ * file.
+ */
+ full_copy_possible = false;
+ }
+ }
+ }
+
+ /*
+ * If a checksum of the required type already exists in the
+ * backup_manifest for the relevant input directory, we can save some work
+ * by reusing that checksum instead of computing a new one.
+ */
+ if (copy_source_index >= 0 && manifests[copy_source_index] != NULL &&
+ checksum_type != CHECKSUM_TYPE_NONE)
+ {
+ manifest_file *mfile;
+
+ mfile = manifest_files_lookup(manifests[copy_source_index]->files,
+ manifest_path);
+ if (mfile == NULL)
+ {
+ char *path = psprintf("%s/backup_manifest",
+ prior_backup_dirs[copy_source_index]);
+
+ /*
+ * The directory is out of sync with the backup_manifest, so emit
+ * a warning.
+ */
+ /*- translator: the first %s is a backup manifest file, the second is a file absent therein */
+ pg_log_warning("\"%s\" contains no entry for \"%s\"",
+ path,
+ manifest_path);
+ pfree(path);
+ }
+ else if (mfile->checksum_type == checksum_type)
+ {
+ *checksum_length = mfile->checksum_length;
+ *checksum_payload = pg_malloc(*checksum_length);
+ memcpy(*checksum_payload, mfile->checksum_payload,
+ *checksum_length);
+ checksum_type = CHECKSUM_TYPE_NONE;
+ }
+ }
+
+ /* Prepare for checksum calculation, if required. */
+ pg_checksum_init(&checksum_ctx, checksum_type);
+
+ /*
+ * If the full file can be created by copying a file from an older backup
+ * in the chain without needing to overwrite any blocks or truncate the
+ * result, then forget about performing reconstruction and just copy that
+ * file in its entirety.
+ *
+ * Otherwise, reconstruct.
+ */
+ if (copy_source != NULL)
+ copy_file(copy_source->filename, output_filename,
+ &checksum_ctx, dry_run);
+ else
+ {
+ write_reconstructed_file(input_filename, output_filename,
+ block_length, sourcemap, offsetmap,
+ &checksum_ctx, debug, dry_run);
+ debug_reconstruction(n_prior_backups + 1, source, dry_run);
+ }
+
+ /* Save results of checksum calculation. */
+ if (checksum_type != CHECKSUM_TYPE_NONE)
+ {
+ *checksum_payload = pg_malloc(PG_CHECKSUM_MAX_LENGTH);
+ *checksum_length = pg_checksum_final(&checksum_ctx,
+ *checksum_payload);
+ }
+
+ /*
+ * Close files and release memory.
+ */
+ for (i = 0; i <= n_prior_backups; ++i)
+ {
+ rfile *s = source[i];
+
+ if (s == NULL)
+ continue;
+ if (close(s->fd) != 0)
+ pg_fatal("could not close \"%s\": %m", s->filename);
+ if (s->relative_block_numbers != NULL)
+ pfree(s->relative_block_numbers);
+ pg_free(s->filename);
+ }
+ pfree(sourcemap);
+ pfree(offsetmap);
+ pfree(source);
+}
+
+/*
+ * Perform post-reconstruction logging and sanity checks.
+ */
+static void
+debug_reconstruction(int n_source, rfile **sources, bool dry_run)
+{
+ unsigned i;
+
+ for (i = 0; i < n_source; ++i)
+ {
+ rfile *s = sources[i];
+
+ /* Ignore source if not used. */
+ if (s == NULL)
+ continue;
+
+ /* If no data is needed from this file, we can ignore it. */
+ if (s->num_blocks_read == 0)
+ continue;
+
+ /* Debug logging. */
+ if (dry_run)
+ pg_log_debug("would have read %u blocks from \"%s\"",
+ s->num_blocks_read, s->filename);
+ else
+ pg_log_debug("read %u blocks from \"%s\"",
+ s->num_blocks_read, s->filename);
+
+ /*
+ * In dry-run mode, we don't actually try to read data from the file,
+ * but we do try to verify that the file is long enough that we could
+ * have read the data if we'd tried.
+ *
+ * If this fails, then it means that a non-dry-run attempt would fail,
+ * complaining of not being able to read the required bytes from the
+ * file.
+ */
+ if (dry_run)
+ {
+ struct stat sb;
+
+ if (fstat(s->fd, &sb) < 0)
+ pg_fatal("could not stat \"%s\": %m", s->filename);
+ if (sb.st_size < s->highest_offset_read)
+ pg_fatal("file \"%s\" is too short: expected %llu, found %llu",
+ s->filename,
+ (unsigned long long) s->highest_offset_read,
+ (unsigned long long) sb.st_size);
+ }
+ }
+}
+
+/*
+ * When we perform reconstruction using an incremental file, the output file
+ * should be at least as long as the truncation_block_length. Any blocks
+ * present in the incremental file increase the output length as far as is
+ * necessary to include those blocks.
+ */
+static unsigned
+find_reconstructed_block_length(rfile *s)
+{
+ unsigned block_length = s->truncation_block_length;
+ unsigned i;
+
+ for (i = 0; i < s->num_blocks; ++i)
+ if (s->relative_block_numbers[i] >= block_length)
+ block_length = s->relative_block_numbers[i] + 1;
+
+ return block_length;
+}
+
+/*
+ * Initialize an incremental rfile, reading the header so that we know which
+ * blocks it contains.
+ */
+static rfile *
+make_incremental_rfile(char *filename)
+{
+ rfile *rf;
+ unsigned magic;
+
+ rf = make_rfile(filename, false);
+
+ /* Read and validate magic number. */
+ read_bytes(rf, &magic, sizeof(magic));
+ if (magic != INCREMENTAL_MAGIC)
+ pg_fatal("file \"%s\" has bad incremental magic number (0x%x not 0x%x)",
+ filename, magic, INCREMENTAL_MAGIC);
+
+ /* Read block count. */
+ read_bytes(rf, &rf->num_blocks, sizeof(rf->num_blocks));
+ if (rf->num_blocks > RELSEG_SIZE)
+ pg_fatal("file \"%s\" has block count %u in excess of segment size %u",
+ filename, rf->num_blocks, RELSEG_SIZE);
+
+ /* Read truncation block length. */
+ read_bytes(rf, &rf->truncation_block_length,
+ sizeof(rf->truncation_block_length));
+ if (rf->truncation_block_length > RELSEG_SIZE)
+ pg_fatal("file \"%s\" has truncation block length %u in excess of segment size %u",
+ filename, rf->truncation_block_length, RELSEG_SIZE);
+
+ /* Read block numbers if there are any. */
+ if (rf->num_blocks > 0)
+ {
+ rf->relative_block_numbers =
+ pg_malloc0(sizeof(BlockNumber) * rf->num_blocks);
+ read_bytes(rf, rf->relative_block_numbers,
+ sizeof(BlockNumber) * rf->num_blocks);
+ }
+
+ /* Remember length of header. */
+ rf->header_length = sizeof(magic) + sizeof(rf->num_blocks) +
+ sizeof(rf->truncation_block_length) +
+ sizeof(BlockNumber) * rf->num_blocks;
+
+ return rf;
+}
+
+/*
+ * Allocate and perform basic initialization of an rfile.
+ */
+static rfile *
+make_rfile(char *filename, bool missing_ok)
+{
+ rfile *rf;
+
+ rf = pg_malloc0(sizeof(rfile));
+ rf->filename = pstrdup(filename);
+ if ((rf->fd = open(filename, O_RDONLY | PG_BINARY, 0)) < 0)
+ {
+ if (missing_ok && errno == ENOENT)
+ {
+ pg_free(rf);
+ return NULL;
+ }
+ pg_fatal("could not open file \"%s\": %m", filename);
+ }
+
+ return rf;
+}
+
+/*
+ * Read the indicated number of bytes from an rfile into the buffer.
+ */
+static void
+read_bytes(rfile *rf, void *buffer, unsigned length)
+{
+ unsigned rb = read(rf->fd, buffer, length);
+
+ if (rb != length)
+ {
+ if (rb < 0)
+ pg_fatal("could not read file \"%s\": %m", rf->filename);
+ else
+ pg_fatal("could not read file \"%s\": read only %d of %d bytes",
+ rf->filename, (int) rb, length);
+ }
+}
+
+/*
+ * Write out a reconstructed file.
+ */
+static void
+write_reconstructed_file(char *input_filename,
+ char *output_filename,
+ unsigned block_length,
+ rfile **sourcemap,
+ off_t *offsetmap,
+ pg_checksum_context *checksum_ctx,
+ bool debug,
+ bool dry_run)
+{
+ int wfd = -1;
+ unsigned i;
+ unsigned zero_blocks = 0;
+
+ /* Debugging output. */
+ if (debug)
+ {
+ StringInfoData debug_buf;
+ unsigned start_of_range = 0;
+ unsigned current_block = 0;
+
+ /* Basic information about the output file to be produced. */
+ if (dry_run)
+ pg_log_debug("would reconstruct \"%s\" (%u blocks, checksum %s)",
+ output_filename, block_length,
+ pg_checksum_type_name(checksum_ctx->type));
+ else
+ pg_log_debug("reconstructing \"%s\" (%u blocks, checksum %s)",
+ output_filename, block_length,
+ pg_checksum_type_name(checksum_ctx->type));
+
+ /* Print out the plan for reconstructing this file. */
+ initStringInfo(&debug_buf);
+ while (current_block < block_length)
+ {
+ rfile *s = sourcemap[current_block];
+
+ /* Extend range, if possible. */
+ if (current_block + 1 < block_length &&
+ s == sourcemap[current_block + 1])
+ {
+ ++current_block;
+ continue;
+ }
+
+ /* Add details about this range. */
+ if (s == NULL)
+ {
+ if (current_block == start_of_range)
+ appendStringInfo(&debug_buf, " %u:zero", current_block);
+ else
+ appendStringInfo(&debug_buf, " %u-%u:zero",
+ start_of_range, current_block);
+ }
+ else
+ {
+ if (current_block == start_of_range)
+ appendStringInfo(&debug_buf, " %u:%s@" UINT64_FORMAT,
+ current_block,
+ s == NULL ? "ZERO" : s->filename,
+ (uint64) offsetmap[current_block]);
+ else
+ appendStringInfo(&debug_buf, " %u-%u:%s@" UINT64_FORMAT,
+ start_of_range, current_block,
+ s == NULL ? "ZERO" : s->filename,
+ (uint64) offsetmap[current_block]);
+ }
+
+ /* Begin new range. */
+ start_of_range = ++current_block;
+
+ /* If the output is very long or we are done, dump it now. */
+ if (current_block == block_length || debug_buf.len > 1024)
+ {
+ pg_log_debug("reconstruction plan:%s", debug_buf.data);
+ resetStringInfo(&debug_buf);
+ }
+ }
+
+ /* Free memory. */
+ pfree(debug_buf.data);
+ }
+
+ /* Open the output file, except in dry_run mode. */
+ if (!dry_run &&
+ (wfd = open(output_filename,
+ O_RDWR | PG_BINARY | O_CREAT | O_EXCL,
+ pg_file_create_mode)) < 0)
+ pg_fatal("could not open file \"%s\": %m", output_filename);
+
+ /* Read and write the blocks as required. */
+ for (i = 0; i < block_length; ++i)
+ {
+ uint8 buffer[BLCKSZ];
+ rfile *s = sourcemap[i];
+ unsigned wb;
+
+ /* Update accounting information. */
+ if (s == NULL)
+ ++zero_blocks;
+ else
+ {
+ s->num_blocks_read++;
+ s->highest_offset_read = Max(s->highest_offset_read,
+ offsetmap[i] + BLCKSZ);
+ }
+
+ /* Skip the rest of this in dry-run mode. */
+ if (dry_run)
+ continue;
+
+ /* Read or zero-fill the block as appropriate. */
+ if (s == NULL)
+ {
+ /*
+ * New block not mentioned in the WAL summary. Should have been an
+ * uninitialized block, so just zero-fill it.
+ */
+ memset(buffer, 0, BLCKSZ);
+ }
+ else
+ {
+ unsigned rb;
+
+ /* Read the block from the correct source, except if dry-run. */
+ rb = pg_pread(s->fd, buffer, BLCKSZ, offsetmap[i]);
+ if (rb != BLCKSZ)
+ {
+ if (rb < 0)
+ pg_fatal("could not read file \"%s\": %m", s->filename);
+ else
+ pg_fatal("could not read file \"%s\": read only %d of %d bytes at offset %u",
+ s->filename, (int) rb, BLCKSZ,
+ (unsigned) offsetmap[i]);
+ }
+ }
+
+ /* Write out the block. */
+ if ((wb = write(wfd, buffer, BLCKSZ)) != BLCKSZ)
+ {
+ if (wb < 0)
+ pg_fatal("could not write file \"%s\": %m", output_filename);
+ else
+ pg_fatal("could not write file \"%s\": wrote only %d of %d bytes",
+ output_filename, (int) wb, BLCKSZ);
+ }
+
+ /* Update the checksum computation. */
+ if (pg_checksum_update(checksum_ctx, buffer, BLCKSZ) < 0)
+ pg_fatal("could not update checksum of file \"%s\"",
+ output_filename);
+ }
+
+ /* Debugging output. */
+ if (zero_blocks > 0)
+ {
+ if (dry_run)
+ pg_log_debug("would have zero-filled %u blocks", zero_blocks);
+ else
+ pg_log_debug("zero-filled %u blocks", zero_blocks);
+ }
+
+ /* Close the output file. */
+ if (wfd >= 0 && close(wfd) != 0)
+ pg_fatal("could not close \"%s\": %m", output_filename);
+}
diff --git a/src/bin/pg_combinebackup/reconstruct.h b/src/bin/pg_combinebackup/reconstruct.h
new file mode 100644
index 0000000000..d689aeb5c2
--- /dev/null
+++ b/src/bin/pg_combinebackup/reconstruct.h
@@ -0,0 +1,33 @@
+/*-------------------------------------------------------------------------
+ *
+ * reconstruct.h
+ * Reconstruct full file from incremental file and backup chain.
+ *
+ * Copyright (c) 2017-2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/bin/pg_combinebackup/reconstruct.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef RECONSTRUCT_H
+#define RECONSTRUCT_H
+
+#include "common/checksum_helper.h"
+#include "load_manifest.h"
+
+extern void reconstruct_from_incremental_file(char *input_filename,
+ char *output_filename,
+ char *relative_path,
+ char *bare_file_name,
+ int n_prior_backups,
+ char **prior_backup_dirs,
+ manifest_data **manifests,
+ char *manifest_path,
+ pg_checksum_type checksum_type,
+ int *checksum_length,
+ uint8 **checksum_payload,
+ bool debug,
+ bool dry_run);
+
+#endif
diff --git a/src/bin/pg_combinebackup/t/001_basic.pl b/src/bin/pg_combinebackup/t/001_basic.pl
new file mode 100644
index 0000000000..fb66075d1a
--- /dev/null
+++ b/src/bin/pg_combinebackup/t/001_basic.pl
@@ -0,0 +1,23 @@
+# Copyright (c) 2021-2023, PostgreSQL Global Development Group
+
+use strict;
+use warnings;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+my $tempdir = PostgreSQL::Test::Utils::tempdir;
+
+program_help_ok('pg_combinebackup');
+program_version_ok('pg_combinebackup');
+program_options_handling_ok('pg_combinebackup');
+
+command_fails_like(
+ ['pg_combinebackup'],
+ qr/no input directories specified/,
+ 'input directories must be specified');
+command_fails_like(
+ [ 'pg_combinebackup', $tempdir ],
+ qr/no output directory specified/,
+ 'output directory must be specified');
+
+done_testing();
diff --git a/src/bin/pg_combinebackup/t/002_compare_backups.pl b/src/bin/pg_combinebackup/t/002_compare_backups.pl
new file mode 100644
index 0000000000..0b80455aff
--- /dev/null
+++ b/src/bin/pg_combinebackup/t/002_compare_backups.pl
@@ -0,0 +1,154 @@
+# Copyright (c) 2021-2023, PostgreSQL Global Development Group
+
+use strict;
+use warnings;
+use File::Compare;
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Set up a new database instance.
+my $primary = PostgreSQL::Test::Cluster->new('primary');
+$primary->init(has_archiving => 1, allows_streaming => 1);
+$primary->append_conf('postgresql.conf', 'summarize_wal = on');
+$primary->start;
+
+# Create some test tables, each containing one row of data, plus a whole
+# extra database.
+$primary->safe_psql('postgres', <<EOM);
+CREATE TABLE will_change (a int, b text);
+INSERT INTO will_change VALUES (1, 'initial test row');
+CREATE TABLE will_grow (a int, b text);
+INSERT INTO will_grow VALUES (1, 'initial test row');
+CREATE TABLE will_shrink (a int, b text);
+INSERT INTO will_shrink VALUES (1, 'initial test row');
+CREATE TABLE will_get_vacuumed (a int, b text);
+INSERT INTO will_get_vacuumed VALUES (1, 'initial test row');
+CREATE TABLE will_get_dropped (a int, b text);
+INSERT INTO will_get_dropped VALUES (1, 'initial test row');
+CREATE TABLE will_get_rewritten (a int, b text);
+INSERT INTO will_get_rewritten VALUES (1, 'initial test row');
+CREATE DATABASE db_will_get_dropped;
+EOM
+
+# Take a full backup.
+my $backup1path = $primary->backup_dir . '/backup1';
+$primary->command_ok(
+ [ 'pg_basebackup', '-D', $backup1path, '--no-sync', '-cfast' ],
+ "full backup");
+
+# Now make some database changes.
+$primary->safe_psql('postgres', <<EOM);
+UPDATE will_change SET b = 'modified value' WHERE a = 1;
+INSERT INTO will_grow
+ SELECT g, 'additional row' FROM generate_series(2, 5000) g;
+TRUNCATE will_shrink;
+VACUUM will_get_vacuumed;
+DROP TABLE will_get_dropped;
+CREATE TABLE newly_created (a int, b text);
+INSERT INTO newly_created VALUES (1, 'row for new table');
+VACUUM FULL will_get_rewritten;
+DROP DATABASE db_will_get_dropped;
+CREATE DATABASE db_newly_created;
+EOM
+
+# Take an incremental backup.
+my $backup2path = $primary->backup_dir . '/backup2';
+$primary->command_ok(
+ [ 'pg_basebackup', '-D', $backup2path, '--no-sync', '-cfast',
+ '--incremental', $backup1path . '/backup_manifest' ],
+ "incremental backup");
+
+# Find an LSN to which either backup can be recovered.
+my $lsn = $primary->safe_psql('postgres', "SELECT pg_current_wal_lsn();");
+
+# Make sure that the WAL segment containing that LSN has been archived.
+# PostgreSQL won't issue two consecutive XLOG_SWITCH records, and the backup
+# just issued one, so call txid_current() to generate some WAL activity
+# before calling pg_switch_wal().
+$primary->safe_psql('postgres', 'SELECT txid_current();');
+$primary->safe_psql('postgres', 'SELECT pg_switch_wal()');
+
+# Now wait for the LSN we chose above to be archived.
+my $archive_wait_query =
+ "SELECT pg_walfile_name('$lsn') <= last_archived_wal FROM pg_stat_archiver;";
+$primary->poll_query_until('postgres', $archive_wait_query)
+ or die "Timed out while waiting for WAL segment to be archived";
+
+# Perform PITR from the full backup. Disable archive_mode so that the archive
+# doesn't find out about the new timeline; that way, the later PITR below will
+# choose the same timeline.
+my $pitr1 = PostgreSQL::Test::Cluster->new('pitr1');
+$pitr1->init_from_backup($primary, 'backup1',
+ standby => 1, has_restoring => 1);
+$pitr1->append_conf('postgresql.conf', qq{
+recovery_target_lsn = '$lsn'
+recovery_target_action = 'promote'
+archive_mode = 'off'
+});
+$pitr1->start();
+
+# Perform PITR to the same LSN from the incremental backup. Use the same
+# basic configuration as before.
+my $pitr2 = PostgreSQL::Test::Cluster->new('pitr2');
+$pitr2->init_from_backup($primary, 'backup2',
+ standby => 1, has_restoring => 1,
+ combine_with_prior => [ 'backup1' ]);
+$pitr2->append_conf('postgresql.conf', qq{
+recovery_target_lsn = '$lsn'
+recovery_target_action = 'promote'
+archive_mode = 'off'
+});
+$pitr2->start();
+
+# Wait until both servers exit recovery.
+$pitr1->poll_query_until('postgres',
+ "SELECT NOT pg_is_in_recovery();")
+ or die "Timed out while waiting apply to reach LSN $lsn";
+$pitr2->poll_query_until('postgres',
+ "SELECT NOT pg_is_in_recovery();")
+ or die "Timed out while waiting apply to reach LSN $lsn";
+
+# Perform a logical dump of each server, and check that they match.
+# It would be much nicer if we could physically compare the data files, but
+# that doesn't really work. The contents of the page hole aren't guaranteed to
+# be identical, and there can be other discrepancies as well. To make this work
+# we'd need the equivalent of each AM's rm_mask functon written or at least
+# callable from Perl, and that doesn't seem practical.
+#
+# NB: We're just using the primary's backup directory for scratch space here.
+# This could equally well be any other directory we wanted to pick.
+my $backupdir = $primary->backup_dir;
+my $dump1 = $backupdir . '/pitr1.dump';
+my $dump2 = $backupdir . '/pitr2.dump';
+$pitr1->command_ok([
+ 'pg_dumpall', '-f', $dump1, '--no-sync', '--no-unlogged-table-data',
+ '-d', $pitr1->connstr('postgres'),
+ ],
+ 'dump from PITR 1');
+$pitr1->command_ok([
+ 'pg_dumpall', '-f', $dump2, '--no-sync', '--no-unlogged-table-data',
+ '-d', $pitr1->connstr('postgres'),
+ ],
+ 'dump from PITR 2');
+
+# Compare the two dumps, there should be no differences.
+my $compare_res = compare($dump1, $dump2);
+note($dump1);
+note($dump2);
+is($compare_res, 0, "dumps are identical");
+
+# Provide more context if the dumps do not match.
+if ($compare_res != 0)
+{
+ my ($stdout, $stderr) =
+ run_command([ 'diff', '-u', $dump1, $dump2 ]);
+ print "=== diff of $dump1 and $dump2\n";
+ print "=== stdout ===\n";
+ print $stdout;
+ print "=== stderr ===\n";
+ print $stderr;
+ print "=== EOF ===\n";
+}
+
+done_testing();
diff --git a/src/bin/pg_combinebackup/t/003_timeline.pl b/src/bin/pg_combinebackup/t/003_timeline.pl
new file mode 100644
index 0000000000..bc053ca5e8
--- /dev/null
+++ b/src/bin/pg_combinebackup/t/003_timeline.pl
@@ -0,0 +1,90 @@
+# Copyright (c) 2021-2023, PostgreSQL Global Development Group
+#
+# This test aims to validate that restoring an incremental backup works
+# properly even when the reference backup is on a different timeline.
+
+use strict;
+use warnings;
+use File::Compare;
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Set up a new database instance.
+my $node1 = PostgreSQL::Test::Cluster->new('node1');
+$node1->init(has_archiving => 1, allows_streaming => 1);
+$node1->append_conf('postgresql.conf', 'summarize_wal = on');
+$node1->start;
+
+# Create a table and insert a test row into it.
+$node1->safe_psql('postgres', <<EOM);
+CREATE TABLE mytable (a int, b text);
+INSERT INTO mytable VALUES (1, 'aardvark');
+EOM
+
+# Take a full backup.
+my $backup1path = $node1->backup_dir . '/backup1';
+$node1->command_ok(
+ [ 'pg_basebackup', '-D', $backup1path, '--no-sync', '-cfast' ],
+ "full backup from node1");
+
+# Insert a second row on the original node.
+$node1->safe_psql('postgres', <<EOM);
+INSERT INTO mytable VALUES (2, 'beetle');
+EOM
+
+# Now take an incremental backup.
+my $backup2path = $node1->backup_dir . '/backup2';
+$node1->command_ok(
+ [ 'pg_basebackup', '-D', $backup2path, '--no-sync', '-cfast',
+ '--incremental', $backup1path . '/backup_manifest' ],
+ "incremental backup from node1");
+
+# Restore the incremental backup and use it to create a new node.
+my $node2 = PostgreSQL::Test::Cluster->new('node2');
+$node2->init_from_backup($node1, 'backup2',
+ combine_with_prior => [ 'backup1' ]);
+$node2->start();
+
+# Insert rows on both nodes.
+$node1->safe_psql('postgres', <<EOM);
+INSERT INTO mytable VALUES (3, 'crab');
+EOM
+$node2->safe_psql('postgres', <<EOM);
+INSERT INTO mytable VALUES (4, 'dingo');
+EOM
+
+# Take another incremental backup, from node2, based on backup2 from node1.
+my $backup3path = $node1->backup_dir . '/backup3';
+$node2->command_ok(
+ [ 'pg_basebackup', '-D', $backup3path, '--no-sync', '-cfast',
+ '--incremental', $backup2path . '/backup_manifest' ],
+ "incremental backup from node2");
+
+# Restore the incremental backup and use it to create a new node.
+my $node3 = PostgreSQL::Test::Cluster->new('node3');
+$node3->init_from_backup($node1, 'backup3',
+ combine_with_prior => [ 'backup1', 'backup2' ]);
+$node3->start();
+
+# Let's insert one more row.
+$node3->safe_psql('postgres', <<EOM);
+INSERT INTO mytable VALUES (5, 'elephant');
+EOM
+
+# Now check that we have the expected rows.
+my $result = $node3->safe_psql('postgres', <<EOM);
+select string_agg(a::text, ':'), string_agg(b, ':') from mytable;
+EOM
+is($result, '1:2:4:5|aardvark:beetle:dingo:elephant');
+
+# Let's also verify all the backups.
+for my $backup_name (qw(backup1 backup2 backup3))
+{
+ $node1->command_ok(
+ [ 'pg_verifybackup', $node1->backup_dir . '/' . $backup_name ],
+ "verify backup $backup_name");
+}
+
+# OK, that's all.
+done_testing();
diff --git a/src/bin/pg_combinebackup/t/004_manifest.pl b/src/bin/pg_combinebackup/t/004_manifest.pl
new file mode 100644
index 0000000000..37de61ac06
--- /dev/null
+++ b/src/bin/pg_combinebackup/t/004_manifest.pl
@@ -0,0 +1,75 @@
+# Copyright (c) 2021-2023, PostgreSQL Global Development Group
+#
+# This test aims to validate that pg_combinebackup works in the degenerate
+# case where it is invoked on a single full backup and that it can produce
+# a new, valid manifest when it does. Secondarily, it checks that
+# pg_combinebackup does not produce a manifest when run with --no-manifest.
+
+use strict;
+use warnings;
+use File::Compare;
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Set up a new database instance.
+my $node = PostgreSQL::Test::Cluster->new('node');
+$node->init(has_archiving => 1, allows_streaming => 1);
+$node->start;
+
+# Take a full backup.
+my $original_backup_path = $node->backup_dir . '/original';
+$node->command_ok(
+ [ 'pg_basebackup', '-D', $original_backup_path, '--no-sync', '-cfast' ],
+ "full backup");
+
+# Verify the full backup.
+$node->command_ok([ 'pg_verifybackup', $original_backup_path ],
+ "verify original backup");
+
+# Process the backup with pg_combinebackup using various manifest options.
+sub combine_and_test_one_backup
+{
+ my ($backup_name, $failure_pattern, @extra_options) = @_;
+ my $revised_backup_path = $node->backup_dir . '/' . $backup_name;
+ $node->command_ok(
+ [ 'pg_combinebackup', $original_backup_path, '-o', $revised_backup_path,
+ '--no-sync', @extra_options ],
+ "pg_combinebackup with @extra_options");
+ if (defined $failure_pattern)
+ {
+ $node->command_fails_like(
+ [ 'pg_verifybackup', $revised_backup_path ],
+ $failure_pattern,
+ "unable to verify backup $backup_name");
+ }
+ else
+ {
+ $node->command_ok(
+ [ 'pg_verifybackup', $revised_backup_path ],
+ "verify backup $backup_name");
+ }
+}
+combine_and_test_one_backup('nomanifest',
+ qr/could not open file.*backup_manifest/, '--no-manifest');
+combine_and_test_one_backup('csum_none',
+ undef, '--manifest-checksums=NONE');
+combine_and_test_one_backup('csum_sha224',
+ undef, '--manifest-checksums=SHA224');
+
+# Verify that SHA224 is mentioned in the SHA224 manifest lots of times.
+my $sha224_manifest =
+ slurp_file($node->backup_dir . '/csum_sha224/backup_manifest');
+my $sha224_count = (() = $sha224_manifest =~ /SHA224/mig);
+cmp_ok($sha224_count,
+ '>', 100, "SHA224 is mentioned many times in SHA224 manifest");
+
+# Verify that SHA224 is mentioned in the SHA224 manifest lots of times.
+my $nocsum_manifest =
+ slurp_file($node->backup_dir . '/csum_none/backup_manifest');
+my $nocsum_count = (() = $nocsum_manifest =~ /Checksum-Algorithm/mig);
+is($nocsum_count, 0,
+ "Checksum_Algorithm is not mentioned in no-checksum manifest");
+
+# OK, that's all.
+done_testing();
diff --git a/src/bin/pg_combinebackup/t/005_integrity.pl b/src/bin/pg_combinebackup/t/005_integrity.pl
new file mode 100644
index 0000000000..b1f63a43e0
--- /dev/null
+++ b/src/bin/pg_combinebackup/t/005_integrity.pl
@@ -0,0 +1,125 @@
+# Copyright (c) 2021-2023, PostgreSQL Global Development Group
+#
+# This test aims to validate that an incremental backup can be combined
+# with a valid prior backup and that it cannot be combined with an invalid
+# prior backup.
+
+use strict;
+use warnings;
+use File::Compare;
+use File::Path qw(rmtree);
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Set up a new database instance.
+my $node1 = PostgreSQL::Test::Cluster->new('node1');
+$node1->init(has_archiving => 1, allows_streaming => 1);
+$node1->append_conf('postgresql.conf', 'summarize_wal = on');
+$node1->start;
+
+# Set up another new database instance. We don't want to use the cached
+# INITDB_TEMPLATE for this, because we want it to be a separate cluster
+# with a different system ID.
+my $node2;
+{
+ local $ENV{'INITDB_TEMPLATE'} = undef;
+
+ $node2 = PostgreSQL::Test::Cluster->new('node2');
+ $node2->init(has_archiving => 1, allows_streaming => 1);
+ $node2->append_conf('postgresql.conf', 'summarize_wal = on');
+ $node2->start;
+}
+
+# Take a full backup from node1.
+my $backup1path = $node1->backup_dir . '/backup1';
+$node1->command_ok(
+ [ 'pg_basebackup', '-D', $backup1path, '--no-sync', '-cfast' ],
+ "full backup from node1");
+
+# Now take an incremental backup.
+my $backup2path = $node1->backup_dir . '/backup2';
+$node1->command_ok(
+ [ 'pg_basebackup', '-D', $backup2path, '--no-sync', '-cfast',
+ '--incremental', $backup1path . '/backup_manifest' ],
+ "incremental backup from node1");
+
+# Now take another incremental backup.
+my $backup3path = $node1->backup_dir . '/backup3';
+$node1->command_ok(
+ [ 'pg_basebackup', '-D', $backup3path, '--no-sync', '-cfast',
+ '--incremental', $backup2path . '/backup_manifest' ],
+ "another incremental backup from node1");
+
+# Take a full backup from node2.
+my $backupother1path = $node1->backup_dir . '/backupother1';
+$node2->command_ok(
+ [ 'pg_basebackup', '-D', $backupother1path, '--no-sync', '-cfast' ],
+ "full backup from node2");
+
+# Take an incremental backup from node2.
+my $backupother2path = $node1->backup_dir . '/backupother2';
+$node2->command_ok(
+ [ 'pg_basebackup', '-D', $backupother2path, '--no-sync', '-cfast',
+ '--incremental', $backupother1path . '/backup_manifest' ],
+ "incremental backup from node2");
+
+# Result directory.
+my $resultpath = $node1->backup_dir . '/result';
+
+# Can't combine 2 full backups.
+$node1->command_fails_like(
+ [ 'pg_combinebackup', $backup1path, $backup1path, '-o', $resultpath ],
+ qr/is a full backup, but only the first backup should be a full backup/,
+ "can't combine full backups");
+
+# Can't combine 2 incremental backups.
+$node1->command_fails_like(
+ [ 'pg_combinebackup', $backup2path, $backup2path, '-o', $resultpath ],
+ qr/is an incremental backup, but the first backup should be a full backup/,
+ "can't combine full backups");
+
+# Can't combine full backup with an incremental backup from a different system.
+$node1->command_fails_like(
+ [ 'pg_combinebackup', $backup1path, $backupother2path, '-o', $resultpath ],
+ qr/expected system identifier.*but found/,
+ "can't combine backups from different nodes");
+
+# Can't omit a required backup.
+$node1->command_fails_like(
+ [ 'pg_combinebackup', $backup1path, $backup3path, '-o', $resultpath ],
+ qr/starts at LSN.*but expected/,
+ "can't omit a required backup");
+
+# Can't combine backups in the wrong order.
+$node1->command_fails_like(
+ [ 'pg_combinebackup', $backup1path, $backup3path, $backup2path, '-o', $resultpath ],
+ qr/starts at LSN.*but expected/,
+ "can't combine backups in the wrong order");
+
+# Can combine 3 backups that match up properly.
+$node1->command_ok(
+ [ 'pg_combinebackup', $backup1path, $backup2path, $backup3path, '-o', $resultpath ],
+ "can combine 3 matching backups");
+rmtree($resultpath);
+
+# Can combine full backup with first incremental.
+my $synthetic12path = $node1->backup_dir . '/synthetic12';
+$node1->command_ok(
+ [ 'pg_combinebackup', $backup1path, $backup2path, '-o', $synthetic12path ],
+ "can combine 2 matching backups");
+
+# Can combine result of previous step with second incremental.
+$node1->command_ok(
+ [ 'pg_combinebackup', $synthetic12path, $backup3path, '-o', $resultpath ],
+ "can combine synthetic backup with later incremental");
+rmtree($resultpath);
+
+# Can't combine result of 1+2 with 2.
+$node1->command_fails_like(
+ [ 'pg_combinebackup', $synthetic12path, $backup2path, '-o', $resultpath ],
+ qr/starts at LSN.*but expected/,
+ "can't combine synthetic backup with included incremental");
+
+# OK, that's all.
+done_testing();
diff --git a/src/bin/pg_combinebackup/write_manifest.c b/src/bin/pg_combinebackup/write_manifest.c
new file mode 100644
index 0000000000..82160134d8
--- /dev/null
+++ b/src/bin/pg_combinebackup/write_manifest.c
@@ -0,0 +1,293 @@
+/*-------------------------------------------------------------------------
+ *
+ * Write a new backup manifest.
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/bin/pg_combinebackup/write_manifest.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres_fe.h"
+
+#include <fcntl.h>
+#include <time.h>
+#include <unistd.h>
+
+#include "common/checksum_helper.h"
+#include "common/file_perm.h"
+#include "common/logging.h"
+#include "lib/stringinfo.h"
+#include "load_manifest.h"
+#include "mb/pg_wchar.h"
+#include "write_manifest.h"
+
+struct manifest_writer
+{
+ char pathname[MAXPGPATH];
+ int fd;
+ StringInfoData buf;
+ bool first_file;
+ bool still_checksumming;
+ pg_checksum_context manifest_ctx;
+};
+
+static void escape_json(StringInfo buf, const char *str);
+static void flush_manifest(manifest_writer *mwriter);
+static size_t hex_encode(const uint8 *src, size_t len, char *dst);
+
+/*
+ * Create a new backup manifest writer.
+ *
+ * The backup manifest will be written into a file named backup_manifest
+ * in the specified directory.
+ */
+manifest_writer *
+create_manifest_writer(char *directory)
+{
+ manifest_writer *mwriter = pg_malloc(sizeof(manifest_writer));
+
+ snprintf(mwriter->pathname, MAXPGPATH, "%s/backup_manifest", directory);
+ mwriter->fd = -1;
+ initStringInfo(&mwriter->buf);
+ mwriter->first_file = true;
+ mwriter->still_checksumming = true;
+ pg_checksum_init(&mwriter->manifest_ctx, CHECKSUM_TYPE_SHA256);
+
+ appendStringInfo(&mwriter->buf,
+ "{ \"PostgreSQL-Backup-Manifest-Version\": 1,\n"
+ "\"Files\": [");
+
+ return mwriter;
+}
+
+/*
+ * Add an entry for a file to a backup manifest.
+ *
+ * This is very similar to the backend's AddFileToBackupManifest, but
+ * various adjustments are required due to frontend/backend differences
+ * and other details.
+ */
+void
+add_file_to_manifest(manifest_writer *mwriter, const char *manifest_path,
+ size_t size, pg_time_t mtime,
+ pg_checksum_type checksum_type,
+ int checksum_length,
+ uint8 *checksum_payload)
+{
+ int pathlen = strlen(manifest_path);
+
+ if (mwriter->first_file)
+ {
+ appendStringInfoChar(&mwriter->buf, '\n');
+ mwriter->first_file = false;
+ }
+ else
+ appendStringInfoString(&mwriter->buf, ",\n");
+
+ if (pg_encoding_verifymbstr(PG_UTF8, manifest_path, pathlen) == pathlen)
+ {
+ appendStringInfoString(&mwriter->buf, "{ \"Path\": ");
+ escape_json(&mwriter->buf, manifest_path);
+ appendStringInfoString(&mwriter->buf, ", ");
+ }
+ else
+ {
+ appendStringInfoString(&mwriter->buf, "{ \"Encoded-Path\": \"");
+ enlargeStringInfo(&mwriter->buf, 2 * pathlen);
+ mwriter->buf.len += hex_encode((const uint8 *) manifest_path, pathlen,
+ &mwriter->buf.data[mwriter->buf.len]);
+ appendStringInfoString(&mwriter->buf, "\", ");
+ }
+
+ appendStringInfo(&mwriter->buf, "\"Size\": %zu, ", size);
+
+ appendStringInfoString(&mwriter->buf, "\"Last-Modified\": \"");
+ enlargeStringInfo(&mwriter->buf, 128);
+ mwriter->buf.len += strftime(&mwriter->buf.data[mwriter->buf.len], 128,
+ "%Y-%m-%d %H:%M:%S %Z",
+ gmtime(&mtime));
+ appendStringInfoChar(&mwriter->buf, '"');
+
+ if (mwriter->buf.len > 128 * 1024)
+ flush_manifest(mwriter);
+
+ if (checksum_length > 0)
+ {
+ appendStringInfo(&mwriter->buf,
+ ", \"Checksum-Algorithm\": \"%s\", \"Checksum\": \"",
+ pg_checksum_type_name(checksum_type));
+
+ enlargeStringInfo(&mwriter->buf, 2 * checksum_length);
+ mwriter->buf.len += hex_encode(checksum_payload, checksum_length,
+ &mwriter->buf.data[mwriter->buf.len]);
+
+ appendStringInfoChar(&mwriter->buf, '"');
+ }
+
+ appendStringInfoString(&mwriter->buf, " }");
+
+ if (mwriter->buf.len > 128 * 1024)
+ flush_manifest(mwriter);
+}
+
+/*
+ * Finalize the backup_manifest.
+ */
+void
+finalize_manifest(manifest_writer *mwriter,
+ manifest_wal_range *first_wal_range)
+{
+ uint8 checksumbuf[PG_SHA256_DIGEST_LENGTH];
+ int len;
+ manifest_wal_range *wal_range;
+
+ /* Terminate the list of files. */
+ appendStringInfoString(&mwriter->buf, "\n],\n");
+
+ /* Start a list of LSN ranges. */
+ appendStringInfoString(&mwriter->buf, "\"WAL-Ranges\": [\n");
+
+ for (wal_range = first_wal_range; wal_range != NULL;
+ wal_range = wal_range->next)
+ appendStringInfo(&mwriter->buf,
+ "%s{ \"Timeline\": %u, \"Start-LSN\": \"%X/%X\", \"End-LSN\": \"%X/%X\" }",
+ wal_range == first_wal_range ? "" : ",\n",
+ wal_range->tli,
+ LSN_FORMAT_ARGS(wal_range->start_lsn),
+ LSN_FORMAT_ARGS(wal_range->end_lsn));
+
+ /* Terminate the list of WAL ranges. */
+ appendStringInfoString(&mwriter->buf, "\n],\n");
+
+ /* Flush accumulated data and update checksum calculation. */
+ flush_manifest(mwriter);
+
+ /* Checksum only includes data up to this point. */
+ mwriter->still_checksumming = false;
+
+ /* Compute and insert manifest checksum. */
+ appendStringInfoString(&mwriter->buf, "\"Manifest-Checksum\": \"");
+ enlargeStringInfo(&mwriter->buf, 2 * PG_SHA256_DIGEST_STRING_LENGTH);
+ len = pg_checksum_final(&mwriter->manifest_ctx, checksumbuf);
+ Assert(len == PG_SHA256_DIGEST_LENGTH);
+ mwriter->buf.len +=
+ hex_encode(checksumbuf, len, &mwriter->buf.data[mwriter->buf.len]);
+ appendStringInfoString(&mwriter->buf, "\"}\n");
+
+ /* Flush the last manifest checksum itself. */
+ flush_manifest(mwriter);
+
+ /* Close the file. */
+ if (close(mwriter->fd) != 0)
+ pg_fatal("could not close \"%s\": %m", mwriter->pathname);
+ mwriter->fd = -1;
+}
+
+/*
+ * Produce a JSON string literal, properly escaping characters in the text.
+ */
+static void
+escape_json(StringInfo buf, const char *str)
+{
+ const char *p;
+
+ appendStringInfoCharMacro(buf, '"');
+ for (p = str; *p; p++)
+ {
+ switch (*p)
+ {
+ case '\b':
+ appendStringInfoString(buf, "\\b");
+ break;
+ case '\f':
+ appendStringInfoString(buf, "\\f");
+ break;
+ case '\n':
+ appendStringInfoString(buf, "\\n");
+ break;
+ case '\r':
+ appendStringInfoString(buf, "\\r");
+ break;
+ case '\t':
+ appendStringInfoString(buf, "\\t");
+ break;
+ case '"':
+ appendStringInfoString(buf, "\\\"");
+ break;
+ case '\\':
+ appendStringInfoString(buf, "\\\\");
+ break;
+ default:
+ if ((unsigned char) *p < ' ')
+ appendStringInfo(buf, "\\u%04x", (int) *p);
+ else
+ appendStringInfoCharMacro(buf, *p);
+ break;
+ }
+ }
+ appendStringInfoCharMacro(buf, '"');
+}
+
+/*
+ * Flush whatever portion of the backup manifest we have generated and
+ * buffered in memory out to a file on disk.
+ *
+ * The first call to this function will create the file. After that, we
+ * keep it open and just append more data.
+ */
+static void
+flush_manifest(manifest_writer *mwriter)
+{
+ char pathname[MAXPGPATH];
+
+ if (mwriter->fd == -1 &&
+ (mwriter->fd = open(mwriter->pathname,
+ O_WRONLY | O_CREAT | O_EXCL | PG_BINARY,
+ pg_file_create_mode)) < 0)
+ pg_fatal("could not open file \"%s\": %m", mwriter->pathname);
+
+ if (mwriter->buf.len > 0)
+ {
+ ssize_t wb;
+
+ wb = write(mwriter->fd, mwriter->buf.data, mwriter->buf.len);
+ if (wb != mwriter->buf.len)
+ {
+ if (wb < 0)
+ pg_fatal("could not write \"%s\": %m", mwriter->pathname);
+ else
+ pg_fatal("could not write file \"%s\": wrote only %d of %d bytes",
+ pathname, (int) wb, mwriter->buf.len);
+ }
+
+ if (mwriter->still_checksumming)
+ pg_checksum_update(&mwriter->manifest_ctx,
+ (uint8 *) mwriter->buf.data,
+ mwriter->buf.len);
+ resetStringInfo(&mwriter->buf);
+ }
+}
+
+/*
+ * Encode bytes using two hexademical digits for each one.
+ */
+static size_t
+hex_encode(const uint8 *src, size_t len, char *dst)
+{
+ const uint8 *end = src + len;
+
+ while (src < end)
+ {
+ unsigned n1 = (*src >> 4) & 0xF;
+ unsigned n2 = *src & 0xF;
+
+ *dst++ = n1 < 10 ? '0' + n1 : 'a' + n1 - 10;
+ *dst++ = n2 < 10 ? '0' + n2 : 'a' + n2 - 10;
+ ++src;
+ }
+
+ return len * 2;
+}
diff --git a/src/bin/pg_combinebackup/write_manifest.h b/src/bin/pg_combinebackup/write_manifest.h
new file mode 100644
index 0000000000..8fd7fe02c8
--- /dev/null
+++ b/src/bin/pg_combinebackup/write_manifest.h
@@ -0,0 +1,33 @@
+/*-------------------------------------------------------------------------
+ *
+ * Write a new backup manifest.
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/bin/pg_combinebackup/write_manifest.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef WRITE_MANIFEST_H
+#define WRITE_MANIFEST_H
+
+#include "common/checksum_helper.h"
+#include "pgtime.h"
+
+struct manifest_wal_range;
+
+struct manifest_writer;
+typedef struct manifest_writer manifest_writer;
+
+extern manifest_writer *create_manifest_writer(char *directory);
+extern void add_file_to_manifest(manifest_writer *mwriter,
+ const char *manifest_path,
+ size_t size, pg_time_t mtime,
+ pg_checksum_type checksum_type,
+ int checksum_length,
+ uint8 *checksum_payload);
+extern void finalize_manifest(manifest_writer *mwriter,
+ struct manifest_wal_range *first_wal_range);
+
+#endif /* WRITE_MANIFEST_H */
diff --git a/src/bin/pg_resetwal/pg_resetwal.c b/src/bin/pg_resetwal/pg_resetwal.c
index 3ae3fc06df..5407f51a4e 100644
--- a/src/bin/pg_resetwal/pg_resetwal.c
+++ b/src/bin/pg_resetwal/pg_resetwal.c
@@ -85,6 +85,7 @@ static void RewriteControlFile(void);
static void FindEndOfXLOG(void);
static void KillExistingXLOG(void);
static void KillExistingArchiveStatus(void);
+static void KillExistingWALSummaries(void);
static void WriteEmptyXLOG(void);
static void usage(void);
@@ -493,6 +494,7 @@ main(int argc, char *argv[])
RewriteControlFile();
KillExistingXLOG();
KillExistingArchiveStatus();
+ KillExistingWALSummaries();
WriteEmptyXLOG();
printf(_("Write-ahead log reset\n"));
@@ -1034,6 +1036,40 @@ KillExistingArchiveStatus(void)
pg_fatal("could not close directory \"%s\": %m", ARCHSTATDIR);
}
+/*
+ * Remove existing WAL summary files
+ */
+static void
+KillExistingWALSummaries(void)
+{
+#define WALSUMMARYDIR XLOGDIR "/summaries"
+#define WALSUMMARY_NHEXCHARS 40
+
+ DIR *xldir;
+ struct dirent *xlde;
+ char path[MAXPGPATH + sizeof(WALSUMMARYDIR)];
+
+ xldir = opendir(WALSUMMARYDIR);
+ if (xldir == NULL)
+ pg_fatal("could not open directory \"%s\": %m", WALSUMMARYDIR);
+
+ while (errno = 0, (xlde = readdir(xldir)) != NULL)
+ {
+ if (strspn(xlde->d_name, "0123456789ABCDEF") == WALSUMMARY_NHEXCHARS &&
+ strcmp(xlde->d_name + WALSUMMARY_NHEXCHARS, ".summary") == 0)
+ {
+ snprintf(path, sizeof(path), "%s/%s", WALSUMMARYDIR, xlde->d_name);
+ if (unlink(path) < 0)
+ pg_fatal("could not delete file \"%s\": %m", path);
+ }
+ }
+
+ if (errno)
+ pg_fatal("could not read directory \"%s\": %m", WALSUMMARYDIR);
+
+ if (closedir(xldir))
+ pg_fatal("could not close directory \"%s\": %m", ARCHSTATDIR);
+}
/*
* Write an empty XLOG file, containing only the checkpoint record
diff --git a/src/include/access/xlogbackup.h b/src/include/access/xlogbackup.h
index 1611358137..90e04cad56 100644
--- a/src/include/access/xlogbackup.h
+++ b/src/include/access/xlogbackup.h
@@ -28,6 +28,8 @@ typedef struct BackupState
XLogRecPtr checkpointloc; /* last checkpoint location */
pg_time_t starttime; /* backup start time */
bool started_in_recovery; /* backup started in recovery? */
+ XLogRecPtr istartpoint; /* incremental based on backup at this LSN */
+ TimeLineID istarttli; /* incremental based on backup on this TLI */
/* Fields saved at the end of backup */
XLogRecPtr stoppoint; /* backup stop WAL location */
diff --git a/src/include/backup/basebackup.h b/src/include/backup/basebackup.h
index 1432d9c206..345bd22534 100644
--- a/src/include/backup/basebackup.h
+++ b/src/include/backup/basebackup.h
@@ -34,6 +34,9 @@ typedef struct
int64 size; /* total size as sent; -1 if not known */
} tablespaceinfo;
-extern void SendBaseBackup(BaseBackupCmd *cmd);
+struct IncrementalBackupInfo;
+
+extern void SendBaseBackup(BaseBackupCmd *cmd,
+ struct IncrementalBackupInfo *ib);
#endif /* _BASEBACKUP_H */
diff --git a/src/include/backup/basebackup_incremental.h b/src/include/backup/basebackup_incremental.h
new file mode 100644
index 0000000000..de99117599
--- /dev/null
+++ b/src/include/backup/basebackup_incremental.h
@@ -0,0 +1,55 @@
+/*-------------------------------------------------------------------------
+ *
+ * basebackup_incremental.h
+ * API for incremental backup support
+ *
+ * Portions Copyright (c) 2010-2022, PostgreSQL Global Development Group
+ *
+ * src/include/backup/basebackup_incremental.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef BASEBACKUP_INCREMENTAL_H
+#define BASEBACKUP_INCREMENTAL_H
+
+#include "access/xlogbackup.h"
+#include "common/relpath.h"
+#include "storage/block.h"
+#include "utils/palloc.h"
+
+#define INCREMENTAL_MAGIC 0xd3ae1f0d
+
+typedef enum
+{
+ BACK_UP_FILE_FULLY,
+ BACK_UP_FILE_INCREMENTALLY
+} FileBackupMethod;
+
+struct IncrementalBackupInfo;
+typedef struct IncrementalBackupInfo IncrementalBackupInfo;
+
+extern IncrementalBackupInfo *CreateIncrementalBackupInfo(MemoryContext);
+
+extern void AppendIncrementalManifestData(IncrementalBackupInfo *ib,
+ const char *data,
+ int len);
+extern void FinalizeIncrementalManifest(IncrementalBackupInfo *ib);
+
+extern void PrepareForIncrementalBackup(IncrementalBackupInfo *ib,
+ BackupState *backup_state);
+
+extern char *GetIncrementalFilePath(Oid dboid, Oid spcoid,
+ RelFileNumber relfilenumber,
+ ForkNumber forknum, unsigned segno);
+extern FileBackupMethod GetFileBackupMethod(IncrementalBackupInfo *ib,
+ const char *path,
+ Oid dboid, Oid spcoid,
+ RelFileNumber relfilenumber,
+ ForkNumber forknum,
+ unsigned segno, size_t size,
+ unsigned *num_blocks_required,
+ BlockNumber *relative_block_numbers,
+ unsigned *truncation_block_length);
+extern size_t GetIncrementalFileSize(unsigned num_blocks_required);
+
+#endif
diff --git a/src/include/nodes/replnodes.h b/src/include/nodes/replnodes.h
index 5142a08729..c98961c329 100644
--- a/src/include/nodes/replnodes.h
+++ b/src/include/nodes/replnodes.h
@@ -108,4 +108,13 @@ typedef struct TimeLineHistoryCmd
TimeLineID timeline;
} TimeLineHistoryCmd;
+/* ----------------------
+ * UPLOAD_MANIFEST command
+ * ----------------------
+ */
+typedef struct UploadManifestCmd
+{
+ NodeTag type;
+} UploadManifestCmd;
+
#endif /* REPLNODES_H */
diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index a020377761..46cb2a6550 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -779,6 +779,10 @@ a tar-format backup, pass the name of the tar program to use in the
keyword parameter tar_program. Note that tablespace tar files aren't
handled here.
+To restore from an incremental backup, pass the parameter combine_with_prior
+as a reference to an array of prior backup names with which this backup
+is to be combined using pg_combinebackup.
+
Streaming replication can be enabled on this node by passing the keyword
parameter has_streaming => 1. This is disabled by default.
@@ -816,7 +820,22 @@ sub init_from_backup
mkdir $self->archive_dir;
my $data_path = $self->data_dir;
- if (defined $params{tar_program})
+ if (defined $params{combine_with_prior})
+ {
+ my @prior_backups = @{$params{combine_with_prior}};
+ my @prior_backup_path;
+
+ for my $prior_backup_name (@prior_backups)
+ {
+ push @prior_backup_path,
+ $root_node->backup_dir . '/' . $prior_backup_name;
+ }
+
+ local %ENV = $self->_get_env();
+ PostgreSQL::Test::Utils::system_or_bail('pg_combinebackup', '-d',
+ @prior_backup_path, $backup_path, '-o', $data_path);
+ }
+ elsif (defined $params{tar_program})
{
mkdir($data_path);
PostgreSQL::Test::Utils::system_or_bail($params{tar_program}, 'xf',
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 4d99b4b3f1..48c2f6c56f 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -4018,3 +4018,15 @@ SummarizerReadLocalXLogPrivate
WalSummarizerData
WalSummaryFile
WalSummaryIO
+FileBackupMethod
+IncrementalBackupInfo
+UploadManifestCmd
+backup_file_entry
+backup_wal_range
+cb_cleanup_dir
+cb_options
+cb_tablespace
+cb_tablespace_mapping
+manifest_data
+manifest_writer
+rfile
--
2.39.3 (Apple Git-145)
On Tue, Dec 5, 2023 at 7:11 PM Robert Haas <robertmhaas@gmail.com> wrote:
[..v13 patchset]
The results with v13 patchset are following:
* - requires checkpoint on primary when doing incremental on standby
when it's too idle, this was explained by Robert in [1]/messages/by-id/CA+TgmoYuC27_ToGtTTNyHgpn_eJmdqrmhJ93bAbinkBtXsWHaA@mail.gmail.com, something AKA
too-fast-incremental backup due to testing-scenario:
test_across_wallevelminimal.sh - GOOD
test_many_incrementals_dbcreate.sh - GOOD
test_many_incrementals.sh - GOOD
test_multixact.sh - GOOD
test_reindex_and_vacuum_full.sh - GOOD
test_standby_incr_just_backup.sh - GOOD*
test_truncaterollback.sh - GOOD
test_unlogged_table.sh - GOOD
test_full_pri__incr_stby__restore_on_pri.sh - GOOD
test_full_pri__incr_stby__restore_on_stby.sh - GOOD
test_full_stby__incr_stby__restore_on_pri.sh - GOOD*
test_full_stby__incr_stby__restore_on_stby.sh - GOOD*
test_incr_on_standby_after_promote.sh - GOOD*
test_incr_after_timelineincrease.sh (pg_ctl stop, pg_resetwal -l
00000002000000000000000E ..., pg_ctl start, pg_basebackup
--incremental) - GOOD, I've got:
pg_basebackup: error: could not initiate base backup: ERROR:
timeline 1 found in manifest, but not in this server's history
Comment: I was wondering if it wouldn't make some sense to teach
pg_resetwal to actually delete all WAL summaries after any any
WAL/controlfile alteration?
test_stuck_walsummary.sh (pkill -STOP walsumm) - GOOD:
This version also improves (at least, IMHO) the way that we wait for
WAL summarization to finish. Previously, you either caught up fully
within 60 seconds or you died. I didn't like that, because it seemed
like some people would get timeouts when the operation was slowly
progressing and would eventually succeed. So what this version does
is:
WARNING: still waiting for WAL summarization through 0/A0000D8
after 10 seconds
DETAIL: Summarization has reached 0/8000028 on disk and 0/80000F8
in memory.
[..]
pg_basebackup: error: could not initiate base backup: ERROR: WAL
summarization is not progressing
DETAIL: Summarization is needed through 0/A0000D8, but is stuck
at 0/8000028 on disk and 0/80000F8 in memory.
Comment2: looks good to me!
test_pending_2pc.sh - getting GOOD on most recent runs, but several
times during early testing (probably due to my own mishaps), I've been
hit by Abort/TRAP. I'm still investigating and trying to reproduce
those ones. TRAP: failed Assert("summary_end_lsn >=
WalSummarizerCtl->pending_lsn"), File: "walsummarizer.c", Line: 940
Regards,
-J.
[1]: /messages/by-id/CA+TgmoYuC27_ToGtTTNyHgpn_eJmdqrmhJ93bAbinkBtXsWHaA@mail.gmail.com
On Thu, Dec 7, 2023 at 9:42 AM Jakub Wartak
<jakub.wartak@enterprisedb.com> wrote:
Comment: I was wondering if it wouldn't make some sense to teach
pg_resetwal to actually delete all WAL summaries after any any
WAL/controlfile alteration?
I thought that this was a good idea so I decided to go implement it,
only to discover that it was already part of the patch set ... did you
find some case where it doesn't work as expected? The code looks like
this:
RewriteControlFile();
KillExistingXLOG();
KillExistingArchiveStatus();
KillExistingWALSummaries();
WriteEmptyXLOG();
test_pending_2pc.sh - getting GOOD on most recent runs, but several
times during early testing (probably due to my own mishaps), I've been
hit by Abort/TRAP. I'm still investigating and trying to reproduce
those ones. TRAP: failed Assert("summary_end_lsn >=
WalSummarizerCtl->pending_lsn"), File: "walsummarizer.c", Line: 940
I have a fix for this locally, but I'm going to hold off on publishing
a new version until either there's a few more things I can address all
at once, or until Thomas commits the ubsan fix.
--
Robert Haas
EDB: http://www.enterprisedb.com
On Thu, Dec 7, 2023 at 4:15 PM Robert Haas <robertmhaas@gmail.com> wrote:
Hi Robert,
On Thu, Dec 7, 2023 at 9:42 AM Jakub Wartak
<jakub.wartak@enterprisedb.com> wrote:Comment: I was wondering if it wouldn't make some sense to teach
pg_resetwal to actually delete all WAL summaries after any any
WAL/controlfile alteration?I thought that this was a good idea so I decided to go implement it,
only to discover that it was already part of the patch set ... did you
find some case where it doesn't work as expected? The code looks like
this:
Ah, my bad, with a fresh mind and coffee the error message makes it
clear and of course it did reset the summaries properly.
While we are at it, maybe around the below in PrepareForIncrementalBackup()
if (tlep[i] == NULL)
ereport(ERROR,
(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
errmsg("timeline %u found in
manifest, but not in this server's history",
range->tli)));
we could add
errhint("You might need to start a new full backup instead of
incremental one")
?
test_pending_2pc.sh - getting GOOD on most recent runs, but several
times during early testing (probably due to my own mishaps), I've been
hit by Abort/TRAP. I'm still investigating and trying to reproduce
those ones. TRAP: failed Assert("summary_end_lsn >=
WalSummarizerCtl->pending_lsn"), File: "walsummarizer.c", Line: 940I have a fix for this locally, but I'm going to hold off on publishing
a new version until either there's a few more things I can address all
at once, or until Thomas commits the ubsan fix.
Great, I cannot get it to fail again today, it had to be some dirty
state of the testing env. BTW: Thomas has pushed that ubsan fix.
-J.
On Tue, Dec 5, 2023 at 11:40 PM Robert Haas <robertmhaas@gmail.com> wrote:
On Mon, Dec 4, 2023 at 3:58 PM Robert Haas <robertmhaas@gmail.com> wrote:
Considering all this, what I'm inclined to do is go and put
UPLOAD_MANIFEST back, instead of INCREMENTAL_WAL_RANGE, and adjust
accordingly. But first: does anybody see more problems here that I may
have missed?OK, so here's a new version with UPLOAD_MANIFEST put back. I wrote a
long comment explaining why that's believed to be necessary and
sufficient. I committed 0001 and 0002 from the previous series also,
since it doesn't seem like anyone has further comments on those
renamings.
I have done some testing on standby, but I am facing some issues,
although things are working fine on the primary. As shown below test
[1]: -- test on primary dilipkumar@dkmac bin % ./pg_basebackup -D d dilipkumar@dkmac bin % ./pg_basebackup -D d1 -i d/backup_manifest
0/60000F8, but this backup starts at 0/6000028. Then I tried to look
into the manifest file of the full backup and it shows contents as
below[0]"WAL-Ranges": [ { "Timeline": 1, "Start-LSN": "0/6000028", "End-LSN": "0/60000F8" }. Actually from this WARNING and ERROR, I am not clear what
is the problem, I understand that full backup ends at "0/60000F8" so
for the next incremental backup we should be looking for a summary
that has WAL starting at "0/60000F8" and we do have those WALs. In
fact, the error message is saying "this backup starts at 0/6000028"
which is before "0/60000F8" so whats the issue?
[0]: "WAL-Ranges": [ { "Timeline": 1, "Start-LSN": "0/6000028", "End-LSN": "0/60000F8" }
"WAL-Ranges": [
{ "Timeline": 1, "Start-LSN": "0/6000028", "End-LSN": "0/60000F8" }
[1]: -- test on primary dilipkumar@dkmac bin % ./pg_basebackup -D d dilipkumar@dkmac bin % ./pg_basebackup -D d1 -i d/backup_manifest
-- test on primary
dilipkumar@dkmac bin % ./pg_basebackup -D d
dilipkumar@dkmac bin % ./pg_basebackup -D d1 -i d/backup_manifest
-- cleanup the backup directory
dilipkumar@dkmac bin % rm -rf d
dilipkumar@dkmac bin % rm -rf d1
--test on standby
dilipkumar@dkmac bin % ./pg_basebackup -D d -p 5433
dilipkumar@dkmac bin % ./pg_basebackup -D d1 -i d/backup_manifest -p 5433
WARNING: aborting backup due to backend exiting before pg_backup_stop
was called
pg_basebackup: error: could not initiate base backup: ERROR: manifest
requires WAL from final timeline 1 ending at 0/60000F8, but this
backup starts at 0/6000028
pg_basebackup: removing data directory "d1"
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
On Mon, Dec 11, 2023 at 11:44 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
On Tue, Dec 5, 2023 at 11:40 PM Robert Haas <robertmhaas@gmail.com> wrote:
On Mon, Dec 4, 2023 at 3:58 PM Robert Haas <robertmhaas@gmail.com> wrote:
Considering all this, what I'm inclined to do is go and put
UPLOAD_MANIFEST back, instead of INCREMENTAL_WAL_RANGE, and adjust
accordingly. But first: does anybody see more problems here that I may
have missed?OK, so here's a new version with UPLOAD_MANIFEST put back. I wrote a
long comment explaining why that's believed to be necessary and
sufficient. I committed 0001 and 0002 from the previous series also,
since it doesn't seem like anyone has further comments on those
renamings.I have done some testing on standby, but I am facing some issues,
although things are working fine on the primary. As shown below test
[1]standby is reporting some errors that manifest require WAL from
0/60000F8, but this backup starts at 0/6000028. Then I tried to look
into the manifest file of the full backup and it shows contents as
below[0]. Actually from this WARNING and ERROR, I am not clear what
is the problem, I understand that full backup ends at "0/60000F8" so
for the next incremental backup we should be looking for a summary
that has WAL starting at "0/60000F8" and we do have those WALs. In
fact, the error message is saying "this backup starts at 0/6000028"
which is before "0/60000F8" so whats the issue?[0]
"WAL-Ranges": [
{ "Timeline": 1, "Start-LSN": "0/6000028", "End-LSN": "0/60000F8" }[1]
-- test on primary
dilipkumar@dkmac bin % ./pg_basebackup -D d
dilipkumar@dkmac bin % ./pg_basebackup -D d1 -i d/backup_manifest-- cleanup the backup directory
dilipkumar@dkmac bin % rm -rf d
dilipkumar@dkmac bin % rm -rf d1--test on standby
dilipkumar@dkmac bin % ./pg_basebackup -D d -p 5433
dilipkumar@dkmac bin % ./pg_basebackup -D d1 -i d/backup_manifest -p 5433WARNING: aborting backup due to backend exiting before pg_backup_stop
was called
pg_basebackup: error: could not initiate base backup: ERROR: manifest
requires WAL from final timeline 1 ending at 0/60000F8, but this
backup starts at 0/6000028
pg_basebackup: removing data directory "d1"
Jakub, pinged me offlist and pointed me to the thread[1]/messages/by-id/CA+TgmoYuC27_ToGtTTNyHgpn_eJmdqrmhJ93bAbinkBtXsWHaA@mail.gmail.com where it is
already explained so I think we can ignore this.
[1]: /messages/by-id/CA+TgmoYuC27_ToGtTTNyHgpn_eJmdqrmhJ93bAbinkBtXsWHaA@mail.gmail.com
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
On Fri, Dec 8, 2023 at 5:02 AM Jakub Wartak
<jakub.wartak@enterprisedb.com> wrote:
While we are at it, maybe around the below in PrepareForIncrementalBackup()
if (tlep[i] == NULL)
ereport(ERROR,(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
errmsg("timeline %u found in
manifest, but not in this server's history",
range->tli)));we could add
errhint("You might need to start a new full backup instead of
incremental one")?
I can't exactly say that such a hint would be inaccurate, but I think
the impulse to add it here is misguided. One of my design goals for
this system is to make it so that you never have to take a new
incremental backup "just because," not even in case of an intervening
timeline switch. So, all of the errors in this function are warning
you that you've done something that you really should not have done.
In this particular case, you've either (1) manually removed the
timeline history file, and not just any timeline history file but the
one for a timeline for a backup that you still intend to use as the
basis for taking an incremental backup or (2) tried to use a full
backup taken from one server as the basis for an incremental backup on
a completely different server that happens to share the same system
identifier, e.g. because you promoted two standbys derived from the
same original primary and then tried to use a full backup taken on one
as the basis for an incremental backup taken on the other.
The scenario I was really concerned about when I wrote this test was
(2), because that could lead to a corrupt restore. This test isn't
strong enough to prevent that completely, because two unrelated
standbys can branch onto the same new timelines at the same LSNs, and
then these checks can't tell that something bad has happened. However,
they can detect a useful subset of problem cases. And the solution is
not so much "take a new full backup" as "keep straight which server is
which." Likewise, in case (1), the relevant hint would be "don't
manually remove timeline history files, and if you must, then at least
don't nuke timelines that you actually still care about."
I have a fix for this locally, but I'm going to hold off on publishing
a new version until either there's a few more things I can address all
at once, or until Thomas commits the ubsan fix.Great, I cannot get it to fail again today, it had to be some dirty
state of the testing env. BTW: Thomas has pushed that ubsan fix.
Huzzah, the cfbot likes the patch set now. Here's a new version with
the promised fix for your non-reproducible issue. Let's see whether
you and cfbot still like this version.
--
Robert Haas
EDB: http://www.enterprisedb.com
Attachments:
v14-0001-Move-src-bin-pg_verifybackup-parse_manifest.c-in.patchapplication/octet-stream; name=v14-0001-Move-src-bin-pg_verifybackup-parse_manifest.c-in.patchDownload
From 02f16ee535bd4f2c501c644e33f4658de732f580 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Tue, 3 Oct 2023 13:32:45 -0400
Subject: [PATCH v14 1/5] Move src/bin/pg_verifybackup/parse_manifest.c into
src/common.
This makes it possible for the code to be easily reused by other
client-side tools, and/or by the server.
---
src/bin/pg_verifybackup/Makefile | 1 -
src/bin/pg_verifybackup/meson.build | 1 -
src/bin/pg_verifybackup/pg_verifybackup.c | 2 +-
src/common/Makefile | 1 +
src/common/meson.build | 1 +
src/{bin/pg_verifybackup => common}/parse_manifest.c | 4 ++--
src/{bin/pg_verifybackup => include/common}/parse_manifest.h | 2 +-
7 files changed, 6 insertions(+), 6 deletions(-)
rename src/{bin/pg_verifybackup => common}/parse_manifest.c (99%)
rename src/{bin/pg_verifybackup => include/common}/parse_manifest.h (97%)
diff --git a/src/bin/pg_verifybackup/Makefile b/src/bin/pg_verifybackup/Makefile
index c96323faa9..7c045f142e 100644
--- a/src/bin/pg_verifybackup/Makefile
+++ b/src/bin/pg_verifybackup/Makefile
@@ -21,7 +21,6 @@ LDFLAGS_INTERNAL += -L$(top_builddir)/src/fe_utils -lpgfeutils $(libpq_pgport)
OBJS = \
$(WIN32RES) \
- parse_manifest.o \
pg_verifybackup.o
all: pg_verifybackup
diff --git a/src/bin/pg_verifybackup/meson.build b/src/bin/pg_verifybackup/meson.build
index 9369da1bc6..58f780d1a6 100644
--- a/src/bin/pg_verifybackup/meson.build
+++ b/src/bin/pg_verifybackup/meson.build
@@ -1,7 +1,6 @@
# Copyright (c) 2022-2023, PostgreSQL Global Development Group
pg_verifybackup_sources = files(
- 'parse_manifest.c',
'pg_verifybackup.c'
)
diff --git a/src/bin/pg_verifybackup/pg_verifybackup.c b/src/bin/pg_verifybackup/pg_verifybackup.c
index d921d0f003..88081f66f7 100644
--- a/src/bin/pg_verifybackup/pg_verifybackup.c
+++ b/src/bin/pg_verifybackup/pg_verifybackup.c
@@ -20,9 +20,9 @@
#include "common/hashfn.h"
#include "common/logging.h"
+#include "common/parse_manifest.h"
#include "fe_utils/simple_list.h"
#include "getopt_long.h"
-#include "parse_manifest.h"
#include "pgtime.h"
/*
diff --git a/src/common/Makefile b/src/common/Makefile
index ce4535d7fe..1092dc63df 100644
--- a/src/common/Makefile
+++ b/src/common/Makefile
@@ -66,6 +66,7 @@ OBJS_COMMON = \
kwlookup.o \
link-canary.o \
md5_common.o \
+ parse_manifest.o \
percentrepl.o \
pg_get_line.o \
pg_lzcompress.o \
diff --git a/src/common/meson.build b/src/common/meson.build
index 8be145c0fb..d52dd12bc9 100644
--- a/src/common/meson.build
+++ b/src/common/meson.build
@@ -18,6 +18,7 @@ common_sources = files(
'kwlookup.c',
'link-canary.c',
'md5_common.c',
+ 'parse_manifest.c',
'percentrepl.c',
'pg_get_line.c',
'pg_lzcompress.c',
diff --git a/src/bin/pg_verifybackup/parse_manifest.c b/src/common/parse_manifest.c
similarity index 99%
rename from src/bin/pg_verifybackup/parse_manifest.c
rename to src/common/parse_manifest.c
index 850adf90a8..9f52bfa83b 100644
--- a/src/bin/pg_verifybackup/parse_manifest.c
+++ b/src/common/parse_manifest.c
@@ -6,15 +6,15 @@
* Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
* Portions Copyright (c) 1994, Regents of the University of California
*
- * src/bin/pg_verifybackup/parse_manifest.c
+ * src/common/parse_manifest.c
*
*-------------------------------------------------------------------------
*/
#include "postgres_fe.h"
-#include "parse_manifest.h"
#include "common/jsonapi.h"
+#include "common/parse_manifest.h"
/*
* Semantic states for JSON manifest parsing.
diff --git a/src/bin/pg_verifybackup/parse_manifest.h b/src/include/common/parse_manifest.h
similarity index 97%
rename from src/bin/pg_verifybackup/parse_manifest.h
rename to src/include/common/parse_manifest.h
index 001b9a6a11..811c9149f4 100644
--- a/src/bin/pg_verifybackup/parse_manifest.h
+++ b/src/include/common/parse_manifest.h
@@ -6,7 +6,7 @@
* Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
* Portions Copyright (c) 1994, Regents of the University of California
*
- * src/bin/pg_verifybackup/parse_manifest.h
+ * src/include/common/parse_manifest.h
*
*-------------------------------------------------------------------------
*/
--
2.39.3 (Apple Git-145)
v14-0005-Test-patch-Enable-summarize_wal-by-default.patchapplication/octet-stream; name=v14-0005-Test-patch-Enable-summarize_wal-by-default.patchDownload
From 354a066bafe030596cc2fc9fcc290cb4bde18227 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Tue, 14 Nov 2023 13:49:28 -0500
Subject: [PATCH v14 5/5] Test patch: Enable summarize_wal by default.
To avoid test failures, must remove the prohibition against running
summarize_wal=off with wal_level=minimal, because a bunch of tests
run with wal_level=minimal.
Not for commit.
---
src/backend/postmaster/postmaster.c | 3 ---
src/backend/postmaster/walsummarizer.c | 2 +-
src/backend/utils/misc/guc_tables.c | 2 +-
src/test/recovery/t/001_stream_rep.pl | 2 ++
src/test/recovery/t/019_replslot_limit.pl | 3 +++
src/test/recovery/t/020_archive_status.pl | 1 +
src/test/recovery/t/035_standby_logical_decoding.pl | 1 +
7 files changed, 9 insertions(+), 5 deletions(-)
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index b163e89cbb..51dc517710 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -937,9 +937,6 @@ PostmasterMain(int argc, char *argv[])
if (max_wal_senders > 0 && wal_level == WAL_LEVEL_MINIMAL)
ereport(ERROR,
(errmsg("WAL streaming (max_wal_senders > 0) requires wal_level \"replica\" or \"logical\"")));
- if (summarize_wal && wal_level == WAL_LEVEL_MINIMAL)
- ereport(ERROR,
- (errmsg("WAL cannot be summarized when wal_level is \"minimal\"")));
/*
* Other one-time internal sanity checks can go here, if they are fast.
diff --git a/src/backend/postmaster/walsummarizer.c b/src/backend/postmaster/walsummarizer.c
index 9fa155349e..71025b43b7 100644
--- a/src/backend/postmaster/walsummarizer.c
+++ b/src/backend/postmaster/walsummarizer.c
@@ -139,7 +139,7 @@ static XLogRecPtr redo_pointer_at_last_summary_removal = InvalidXLogRecPtr;
/*
* GUC parameters
*/
-bool summarize_wal = false;
+bool summarize_wal = true;
int wal_summary_keep_time = 10 * 24 * 60;
static XLogRecPtr GetLatestLSN(TimeLineID *tli);
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 9f59440526..f249a9fad5 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -1795,7 +1795,7 @@ struct config_bool ConfigureNamesBool[] =
NULL
},
&summarize_wal,
- false,
+ true,
NULL, NULL, NULL
},
diff --git a/src/test/recovery/t/001_stream_rep.pl b/src/test/recovery/t/001_stream_rep.pl
index 95f9b0d772..0d0e63b8dc 100644
--- a/src/test/recovery/t/001_stream_rep.pl
+++ b/src/test/recovery/t/001_stream_rep.pl
@@ -15,6 +15,8 @@ my $node_primary = PostgreSQL::Test::Cluster->new('primary');
$node_primary->init(
allows_streaming => 1,
auth_extra => [ '--create-role', 'repl_role' ]);
+# WAL summarization can postpone WAL recycling, leading to test failures
+$node_primary->append_conf('postgresql.conf', "summarize_wal = off");
$node_primary->start;
my $backup_name = 'my_backup';
diff --git a/src/test/recovery/t/019_replslot_limit.pl b/src/test/recovery/t/019_replslot_limit.pl
index 7d94f15778..a8b342bb98 100644
--- a/src/test/recovery/t/019_replslot_limit.pl
+++ b/src/test/recovery/t/019_replslot_limit.pl
@@ -22,6 +22,7 @@ $node_primary->append_conf(
min_wal_size = 2MB
max_wal_size = 4MB
log_checkpoints = yes
+summarize_wal = off
));
$node_primary->start;
$node_primary->safe_psql('postgres',
@@ -256,6 +257,7 @@ $node_primary2->append_conf(
min_wal_size = 32MB
max_wal_size = 32MB
log_checkpoints = yes
+summarize_wal = off
));
$node_primary2->start;
$node_primary2->safe_psql('postgres',
@@ -310,6 +312,7 @@ $node_primary3->append_conf(
max_wal_size = 2MB
log_checkpoints = yes
max_slot_wal_keep_size = 1MB
+ summarize_wal = off
));
$node_primary3->start;
$node_primary3->safe_psql('postgres',
diff --git a/src/test/recovery/t/020_archive_status.pl b/src/test/recovery/t/020_archive_status.pl
index fa24153d4b..d0d6221368 100644
--- a/src/test/recovery/t/020_archive_status.pl
+++ b/src/test/recovery/t/020_archive_status.pl
@@ -15,6 +15,7 @@ $primary->init(
has_archiving => 1,
allows_streaming => 1);
$primary->append_conf('postgresql.conf', 'autovacuum = off');
+$primary->append_conf('postgresql.conf', 'summarize_wal = off');
$primary->start;
my $primary_data = $primary->data_dir;
diff --git a/src/test/recovery/t/035_standby_logical_decoding.pl b/src/test/recovery/t/035_standby_logical_decoding.pl
index 9c34c0d36c..482edc57a8 100644
--- a/src/test/recovery/t/035_standby_logical_decoding.pl
+++ b/src/test/recovery/t/035_standby_logical_decoding.pl
@@ -250,6 +250,7 @@ $node_primary->append_conf(
wal_level = 'logical'
max_replication_slots = 4
max_wal_senders = 4
+summarize_wal = off
});
$node_primary->dump_info;
$node_primary->start;
--
2.39.3 (Apple Git-145)
v14-0002-Add-a-new-WAL-summarizer-process.patchapplication/octet-stream; name=v14-0002-Add-a-new-WAL-summarizer-process.patchDownload
From c4eed6c4120bb245f42b8110b99cd9f6accf6b31 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Wed, 25 Oct 2023 12:57:22 -0400
Subject: [PATCH v14 2/5] Add a new WAL summarizer process.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
When active, this process writes WAL summary files to
$PGDATA/pg_wal/summaries. Each summary file contains information for a
certain range of LSNs on a certain TLI. For each relation, it stores a
"limit block" which is 0 if a relation is created or destroyed within
a certain range of WAL records, or otherwise the shortest length to
which the relation was truncated during that range of WAL records, or
otherwise InvalidBlockNumber. In addition, it stores a list of blocks
which have been modified during that range of WAL records, but
excluding blocks which were removed by truncation after they were
modified and never subsequently modified again. In other words, it
tells us which blocks need to copied in case of an incremental backup
covering that range of WAL records.
A new parameter summarize_wal enables or disables this new background
process. The background process also automatically deletes summary
files that are older than wal_summarize_keep_time, if that parameter
has a non-zero value and the summarizer is configured to run.
Patch by me, with some design help from Dilip Kumar. Reviewed by
Matthias van de Meent, Dilip Kumar, Jakub Wartak, Peter Eisentraut,
and Álvaro Herrera.
---
doc/src/sgml/config.sgml | 61 +
src/backend/access/transam/xlog.c | 101 +-
src/backend/backup/Makefile | 4 +-
src/backend/backup/meson.build | 2 +
src/backend/backup/walsummary.c | 356 +++++
src/backend/backup/walsummaryfuncs.c | 169 ++
src/backend/postmaster/Makefile | 1 +
src/backend/postmaster/auxprocess.c | 8 +
src/backend/postmaster/meson.build | 1 +
src/backend/postmaster/postmaster.c | 56 +
src/backend/postmaster/walsummarizer.c | 1398 +++++++++++++++++
src/backend/storage/lmgr/lwlocknames.txt | 1 +
src/backend/utils/activity/pgstat_io.c | 4 +-
.../utils/activity/wait_event_names.txt | 5 +
src/backend/utils/init/miscinit.c | 3 +
src/backend/utils/misc/guc_tables.c | 26 +
src/backend/utils/misc/postgresql.conf.sample | 5 +
src/bin/initdb/initdb.c | 1 +
src/common/Makefile | 1 +
src/common/blkreftable.c | 1308 +++++++++++++++
src/common/meson.build | 1 +
src/include/access/xlog.h | 1 +
src/include/backup/walsummary.h | 49 +
src/include/catalog/pg_proc.dat | 19 +
src/include/common/blkreftable.h | 116 ++
src/include/miscadmin.h | 3 +
src/include/postmaster/walsummarizer.h | 33 +
src/include/storage/proc.h | 9 +-
src/include/utils/guc_tables.h | 1 +
src/tools/pgindent/typedefs.list | 11 +
30 files changed, 3743 insertions(+), 11 deletions(-)
create mode 100644 src/backend/backup/walsummary.c
create mode 100644 src/backend/backup/walsummaryfuncs.c
create mode 100644 src/backend/postmaster/walsummarizer.c
create mode 100644 src/common/blkreftable.c
create mode 100644 src/include/backup/walsummary.h
create mode 100644 src/include/common/blkreftable.h
create mode 100644 src/include/postmaster/walsummarizer.h
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 44cada2b40..ee98585027 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4150,6 +4150,67 @@ restore_command = 'copy "C:\\server\\archivedir\\%f" "%p"' # Windows
</variablelist>
</sect2>
+ <sect2 id="runtime-config-wal-summarization">
+ <title>WAL Summarization</title>
+
+ <!--
+ <para>
+ These settings control WAL summarization, a feature which must be
+ enabled in order to perform an
+ <link linkend="backup-incremental-backup">incremental backup</link>.
+ </para>
+ -->
+
+ <variablelist>
+ <varlistentry id="guc-summarize-wal" xreflabel="summarize_wal">
+ <term><varname>summarize_wal</varname> (<type>boolean</type>)
+ <indexterm>
+ <primary><varname>summarize_wal</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Enables the WAL summarizer process. Note that WAL summarization can
+ be enabled either on a primary or on a standby. WAL summarization
+ cannot be enabled when <varname>wal_level</varname> is set to
+ <literal>minimal</literal>. This parameter can only be set in the
+ <filename>postgresql.conf</filename> file or on the server command line.
+ The default is <literal>off</literal>.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry id="guc-wal-summary-keep-time" xreflabel="wal_summary_keep_time">
+ <term><varname>wal_summary_keep_time</varname> (<type>boolean</type>)
+ <indexterm>
+ <primary><varname>wal_summary_keep_time</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Configures the amount of time after which the WAL summarizer
+ automatically removes old WAL summaries. The file timestamp is used to
+ determine which files are old enough to remove. Typically, you should set
+ this comfortably higher than the time that could pass between a backup
+ and a later incremental backup that depends on it. WAL summaries must
+ be available for the entire range of WAL records between the preceding
+ backup and the new one being taken; if not, the incremental backup will
+ fail. If this parameter is set to zero, WAL summaries will not be
+ automatically deleted, but it is safe to manually remove files that you
+ know will not be required for future incremental backups.
+ This parameter can only be set in the
+ <filename>postgresql.conf</filename> file or on the server command line.
+ The default is 10 days. If <literal>summarize_wal = off</literal>,
+ existing WAL summaries will not be removed regardless of the value of
+ this parameter, because the WAL summarizer will not run.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ </variablelist>
+
+ </sect2>
+
</sect1>
<sect1 id="runtime-config-replication">
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 01e0484584..421a016ca1 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -77,6 +77,7 @@
#include "port/pg_iovec.h"
#include "postmaster/bgwriter.h"
#include "postmaster/startup.h"
+#include "postmaster/walsummarizer.h"
#include "postmaster/walwriter.h"
#include "replication/logical.h"
#include "replication/origin.h"
@@ -3589,6 +3590,43 @@ XLogGetLastRemovedSegno(void)
return lastRemovedSegNo;
}
+/*
+ * Return the oldest WAL segment on the given TLI that still exists in
+ * XLOGDIR, or 0 if none.
+ */
+XLogSegNo
+XLogGetOldestSegno(TimeLineID tli)
+{
+ DIR *xldir;
+ struct dirent *xlde;
+ XLogSegNo oldest_segno = 0;
+
+ xldir = AllocateDir(XLOGDIR);
+ while ((xlde = ReadDir(xldir, XLOGDIR)) != NULL)
+ {
+ TimeLineID file_tli;
+ XLogSegNo file_segno;
+
+ /* Ignore files that are not XLOG segments. */
+ if (!IsXLogFileName(xlde->d_name))
+ continue;
+
+ /* Parse filename to get TLI and segno. */
+ XLogFromFileName(xlde->d_name, &file_tli, &file_segno,
+ wal_segment_size);
+
+ /* Ignore anything that's not from the TLI of interest. */
+ if (tli != file_tli)
+ continue;
+
+ /* If it's the oldest so far, update oldest_segno. */
+ if (oldest_segno == 0 || file_segno < oldest_segno)
+ oldest_segno = file_segno;
+ }
+
+ FreeDir(xldir);
+ return oldest_segno;
+}
/*
* Update the last removed segno pointer in shared memory, to reflect that the
@@ -3869,8 +3907,8 @@ RemoveXlogFile(const struct dirent *segment_de,
}
/*
- * Verify whether pg_wal and pg_wal/archive_status exist.
- * If the latter does not exist, recreate it.
+ * Verify whether pg_wal, pg_wal/archive_status, and pg_wal/summaries exist.
+ * If the latter do not exist, recreate them.
*
* It is not the goal of this function to verify the contents of these
* directories, but to help in cases where someone has performed a cluster
@@ -3913,6 +3951,26 @@ ValidateXLOGDirectoryStructure(void)
(errmsg("could not create missing directory \"%s\": %m",
path)));
}
+
+ /* Check for summaries */
+ snprintf(path, MAXPGPATH, XLOGDIR "/summaries");
+ if (stat(path, &stat_buf) == 0)
+ {
+ /* Check for weird cases where it exists but isn't a directory */
+ if (!S_ISDIR(stat_buf.st_mode))
+ ereport(FATAL,
+ (errmsg("required WAL directory \"%s\" does not exist",
+ path)));
+ }
+ else
+ {
+ ereport(LOG,
+ (errmsg("creating missing WAL directory \"%s\"", path)));
+ if (MakePGDirectory(path) < 0)
+ ereport(FATAL,
+ (errmsg("could not create missing directory \"%s\": %m",
+ path)));
+ }
}
/*
@@ -5237,9 +5295,9 @@ StartupXLOG(void)
#endif
/*
- * Verify that pg_wal and pg_wal/archive_status exist. In cases where
- * someone has performed a copy for PITR, these directories may have been
- * excluded and need to be re-created.
+ * Verify that pg_wal, pg_wal/archive_status, and pg_wal/summaries exist.
+ * In cases where someone has performed a copy for PITR, these directories
+ * may have been excluded and need to be re-created.
*/
ValidateXLOGDirectoryStructure();
@@ -6956,6 +7014,25 @@ CreateCheckPoint(int flags)
*/
END_CRIT_SECTION();
+ /*
+ * WAL summaries end when the next XLOG_CHECKPOINT_REDO or
+ * XLOG_CHECKPOINT_SHUTDOWN record is reached. This is the first point
+ * where (a) we're not inside of a critical section and (b) we can be
+ * certain that the relevant record has been flushed to disk, which must
+ * happen before it can be summarized.
+ *
+ * If this is a shutdown checkpoint, then this happens reasonably
+ * promptly: we've only just inserted and flushed the
+ * XLOG_CHECKPOINT_SHUTDOWN record. If this is not a shutdown checkpoint,
+ * then this might not be very prompt at all: the XLOG_CHECKPOINT_REDO
+ * record was written before we began flushing data to disk, and that
+ * could be many minutes ago at this point. However, we don't XLogFlush()
+ * after inserting that record, so we're not guaranteed that it's on disk
+ * until after the above call that flushes the XLOG_CHECKPOINT_ONLINE
+ * record.
+ */
+ SetWalSummarizerLatch();
+
/*
* Let smgr do post-checkpoint cleanup (eg, deleting old files).
*/
@@ -7630,6 +7707,20 @@ KeepLogSeg(XLogRecPtr recptr, XLogSegNo *logSegNo)
}
}
+ /*
+ * If WAL summarization is in use, don't remove WAL that has yet to be
+ * summarized.
+ */
+ keep = GetOldestUnsummarizedLSN(NULL, NULL, false);
+ if (keep != InvalidXLogRecPtr)
+ {
+ XLogSegNo unsummarized_segno;
+
+ XLByteToSeg(keep, unsummarized_segno, wal_segment_size);
+ if (unsummarized_segno < segno)
+ segno = unsummarized_segno;
+ }
+
/* but, keep at least wal_keep_size if that's set */
if (wal_keep_size_mb > 0)
{
diff --git a/src/backend/backup/Makefile b/src/backend/backup/Makefile
index b21bd8ff43..a67b3c58d4 100644
--- a/src/backend/backup/Makefile
+++ b/src/backend/backup/Makefile
@@ -25,6 +25,8 @@ OBJS = \
basebackup_server.o \
basebackup_sink.o \
basebackup_target.o \
- basebackup_throttle.o
+ basebackup_throttle.o \
+ walsummary.o \
+ walsummaryfuncs.o
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/backup/meson.build b/src/backend/backup/meson.build
index 11a79bbf80..0e2de91e9f 100644
--- a/src/backend/backup/meson.build
+++ b/src/backend/backup/meson.build
@@ -12,4 +12,6 @@ backend_sources += files(
'basebackup_target.c',
'basebackup_throttle.c',
'basebackup_zstd.c',
+ 'walsummary.c',
+ 'walsummaryfuncs.c'
)
diff --git a/src/backend/backup/walsummary.c b/src/backend/backup/walsummary.c
new file mode 100644
index 0000000000..271d199874
--- /dev/null
+++ b/src/backend/backup/walsummary.c
@@ -0,0 +1,356 @@
+/*-------------------------------------------------------------------------
+ *
+ * walsummary.c
+ * Functions for accessing and managing WAL summary data.
+ *
+ * Portions Copyright (c) 2010-2022, PostgreSQL Global Development Group
+ *
+ * src/backend/backup/walsummary.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include <sys/stat.h>
+#include <unistd.h>
+
+#include "access/xlog_internal.h"
+#include "backup/walsummary.h"
+#include "utils/wait_event.h"
+
+static bool IsWalSummaryFilename(char *filename);
+static int ListComparatorForWalSummaryFiles(const ListCell *a,
+ const ListCell *b);
+
+/*
+ * Get a list of WAL summaries.
+ *
+ * If tli != 0, only WAL summaries with the indicated TLI will be included.
+ *
+ * If start_lsn != InvalidXLogRecPtr, only summaries that end after the
+ * indicated LSN will be included.
+ *
+ * If end_lsn != InvalidXLogRecPtr, only summaries that start before the
+ * indicated LSN will be included.
+ *
+ * The intent is that you can call GetWalSummaries(tli, start_lsn, end_lsn)
+ * to get all WAL summaries on the indicated timeline that overlap the
+ * specified LSN range.
+ */
+List *
+GetWalSummaries(TimeLineID tli, XLogRecPtr start_lsn, XLogRecPtr end_lsn)
+{
+ DIR *sdir;
+ struct dirent *dent;
+ List *result = NIL;
+
+ sdir = AllocateDir(XLOGDIR "/summaries");
+ while ((dent = ReadDir(sdir, XLOGDIR "/summaries")) != NULL)
+ {
+ WalSummaryFile *ws;
+ uint32 tmp[5];
+ TimeLineID file_tli;
+ XLogRecPtr file_start_lsn;
+ XLogRecPtr file_end_lsn;
+
+ /* Decode filename, or skip if it's not in the expected format. */
+ if (!IsWalSummaryFilename(dent->d_name))
+ continue;
+ sscanf(dent->d_name, "%08X%08X%08X%08X%08X",
+ &tmp[0], &tmp[1], &tmp[2], &tmp[3], &tmp[4]);
+ file_tli = tmp[0];
+ file_start_lsn = ((uint64) tmp[1]) << 32 | tmp[2];
+ file_end_lsn = ((uint64) tmp[3]) << 32 | tmp[4];
+
+ /* Skip if it doesn't match the filter criteria. */
+ if (tli != 0 && tli != file_tli)
+ continue;
+ if (!XLogRecPtrIsInvalid(start_lsn) && start_lsn >= file_end_lsn)
+ continue;
+ if (!XLogRecPtrIsInvalid(end_lsn) && end_lsn <= file_start_lsn)
+ continue;
+
+ /* Add it to the list. */
+ ws = palloc(sizeof(WalSummaryFile));
+ ws->tli = file_tli;
+ ws->start_lsn = file_start_lsn;
+ ws->end_lsn = file_end_lsn;
+ result = lappend(result, ws);
+ }
+ FreeDir(sdir);
+
+ return result;
+}
+
+/*
+ * Build a new list of WAL summaries based on an existing list, but filtering
+ * out summaries that don't match the search parameters.
+ *
+ * If tli != 0, only WAL summaries with the indicated TLI will be included.
+ *
+ * If start_lsn != InvalidXLogRecPtr, only summaries that end after the
+ * indicated LSN will be included.
+ *
+ * If end_lsn != InvalidXLogRecPtr, only summaries that start before the
+ * indicated LSN will be included.
+ */
+List *
+FilterWalSummaries(List *wslist, TimeLineID tli,
+ XLogRecPtr start_lsn, XLogRecPtr end_lsn)
+{
+ List *result = NIL;
+ ListCell *lc;
+
+ /* Loop over input. */
+ foreach(lc, wslist)
+ {
+ WalSummaryFile *ws = lfirst(lc);
+
+ /* Skip if it doesn't match the filter criteria. */
+ if (tli != 0 && tli != ws->tli)
+ continue;
+ if (!XLogRecPtrIsInvalid(start_lsn) && start_lsn > ws->end_lsn)
+ continue;
+ if (!XLogRecPtrIsInvalid(end_lsn) && end_lsn < ws->start_lsn)
+ continue;
+
+ /* Add it to the result list. */
+ result = lappend(result, ws);
+ }
+
+ return result;
+}
+
+/*
+ * Check whether the supplied list of WalSummaryFile objects covers the
+ * whole range of LSNs from start_lsn to end_lsn. This function ignores
+ * timelines, so the caller should probably filter using the appropriate
+ * timeline before calling this.
+ *
+ * If the whole range of LSNs is covered, returns true, otherwise false.
+ * If false is returned, *missing_lsn is set either to InvalidXLogRecPtr
+ * if there are no WAL summary files in the input list, or to the first LSN
+ * in the range that is not covered by a WAL summary file in the input list.
+ */
+bool
+WalSummariesAreComplete(List *wslist, XLogRecPtr start_lsn,
+ XLogRecPtr end_lsn, XLogRecPtr *missing_lsn)
+{
+ XLogRecPtr current_lsn = start_lsn;
+ ListCell *lc;
+
+ /* Special case for empty list. */
+ if (wslist == NIL)
+ {
+ *missing_lsn = InvalidXLogRecPtr;
+ return false;
+ }
+
+ /* Make a private copy of the list and sort it by start LSN. */
+ wslist = list_copy(wslist);
+ list_sort(wslist, ListComparatorForWalSummaryFiles);
+
+ /*
+ * Consider summary files in order of increasing start_lsn, advancing the
+ * known-summarized range from start_lsn toward end_lsn.
+ *
+ * Normally, the summary files should cover non-overlapping WAL ranges,
+ * but this algorithm is intended to be correct even in case of overlap.
+ */
+ foreach(lc, wslist)
+ {
+ WalSummaryFile *ws = lfirst(lc);
+
+ if (ws->start_lsn > current_lsn)
+ {
+ /* We found a gap. */
+ break;
+ }
+ if (ws->end_lsn > current_lsn)
+ {
+ /*
+ * Next summary extends beyond end of previous summary, so extend
+ * the end of the range known to be summarized.
+ */
+ current_lsn = ws->end_lsn;
+
+ /*
+ * If the range we know to be summarized has reached the required
+ * end LSN, we have proved completeness.
+ */
+ if (current_lsn >= end_lsn)
+ return true;
+ }
+ }
+
+ /*
+ * We either ran out of summary files without reaching the end LSN, or we
+ * hit a gap in the sequence that resulted in us bailing out of the loop
+ * above.
+ */
+ *missing_lsn = current_lsn;
+ return false;
+}
+
+/*
+ * Open a WAL summary file.
+ *
+ * This will throw an error in case of trouble. As an exception, if
+ * missing_ok = true and the trouble is specifically that the file does
+ * not exist, it will not throw an error and will return a value less than 0.
+ */
+File
+OpenWalSummaryFile(WalSummaryFile *ws, bool missing_ok)
+{
+ char path[MAXPGPATH];
+ File file;
+
+ snprintf(path, MAXPGPATH,
+ XLOGDIR "/summaries/%08X%08X%08X%08X%08X.summary",
+ ws->tli,
+ LSN_FORMAT_ARGS(ws->start_lsn),
+ LSN_FORMAT_ARGS(ws->end_lsn));
+
+ file = PathNameOpenFile(path, O_RDONLY);
+ if (file < 0 && (errno != EEXIST || !missing_ok))
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not open file \"%s\": %m", path)));
+
+ return file;
+}
+
+/*
+ * Remove a WAL summary file if the last modification time precedes the
+ * cutoff time.
+ */
+void
+RemoveWalSummaryIfOlderThan(WalSummaryFile *ws, time_t cutoff_time)
+{
+ char path[MAXPGPATH];
+ struct stat statbuf;
+
+ snprintf(path, MAXPGPATH,
+ XLOGDIR "/summaries/%08X%08X%08X%08X%08X.summary",
+ ws->tli,
+ LSN_FORMAT_ARGS(ws->start_lsn),
+ LSN_FORMAT_ARGS(ws->end_lsn));
+
+ if (lstat(path, &statbuf) != 0)
+ {
+ if (errno == ENOENT)
+ return;
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not stat file \"%s\": %m", path)));
+ }
+ if (statbuf.st_mtime >= cutoff_time)
+ return;
+ if (unlink(path) != 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not stat file \"%s\": %m", path)));
+ ereport(DEBUG2,
+ (errmsg_internal("removing file \"%s\"", path)));
+}
+
+/*
+ * Test whether a filename looks like a WAL summary file.
+ */
+static bool
+IsWalSummaryFilename(char *filename)
+{
+ return strspn(filename, "0123456789ABCDEF") == 40 &&
+ strcmp(filename + 40, ".summary") == 0;
+}
+
+/*
+ * Data read callback for use with CreateBlockRefTableReader.
+ */
+int
+ReadWalSummary(void *wal_summary_io, void *data, int length)
+{
+ WalSummaryIO *io = wal_summary_io;
+ int nbytes;
+
+ nbytes = FileRead(io->file, data, length, io->filepos,
+ WAIT_EVENT_WAL_SUMMARY_READ);
+ if (nbytes < 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not read file \"%s\": %m",
+ FilePathName(io->file))));
+
+ io->filepos += nbytes;
+ return nbytes;
+}
+
+/*
+ * Data write callback for use with WriteBlockRefTable.
+ */
+int
+WriteWalSummary(void *wal_summary_io, void *data, int length)
+{
+ WalSummaryIO *io = wal_summary_io;
+ int nbytes;
+
+ nbytes = FileWrite(io->file, data, length, io->filepos,
+ WAIT_EVENT_WAL_SUMMARY_WRITE);
+ if (nbytes < 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not write file \"%s\": %m",
+ FilePathName(io->file))));
+ if (nbytes != length)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not write file \"%s\": wrote only %d of %d bytes at offset %u",
+ FilePathName(io->file), nbytes,
+ length, (unsigned) io->filepos),
+ errhint("Check free disk space.")));
+
+ io->filepos += nbytes;
+ return nbytes;
+}
+
+/*
+ * Error-reporting callback for use with CreateBlockRefTableReader.
+ */
+void
+ReportWalSummaryError(void *callback_arg, char *fmt,...)
+{
+ StringInfoData buf;
+ va_list ap;
+ int needed;
+
+ initStringInfo(&buf);
+ for (;;)
+ {
+ va_start(ap, fmt);
+ needed = appendStringInfoVA(&buf, fmt, ap);
+ va_end(ap);
+ if (needed == 0)
+ break;
+ enlargeStringInfo(&buf, needed);
+ }
+ ereport(ERROR,
+ errcode(ERRCODE_DATA_CORRUPTED),
+ errmsg_internal("%s", buf.data));
+}
+
+/*
+ * Comparator to sort a List of WalSummaryFile objects by start_lsn.
+ */
+static int
+ListComparatorForWalSummaryFiles(const ListCell *a, const ListCell *b)
+{
+ WalSummaryFile *ws1 = lfirst(a);
+ WalSummaryFile *ws2 = lfirst(b);
+
+ if (ws1->start_lsn < ws2->start_lsn)
+ return -1;
+ if (ws1->start_lsn > ws2->start_lsn)
+ return 1;
+ return 0;
+}
diff --git a/src/backend/backup/walsummaryfuncs.c b/src/backend/backup/walsummaryfuncs.c
new file mode 100644
index 0000000000..a1f69ad4ba
--- /dev/null
+++ b/src/backend/backup/walsummaryfuncs.c
@@ -0,0 +1,169 @@
+/*-------------------------------------------------------------------------
+ *
+ * walsummaryfuncs.c
+ * SQL-callable functions for accessing WAL summary data.
+ *
+ * Portions Copyright (c) 2010-2022, PostgreSQL Global Development Group
+ *
+ * src/backend/backup/walsummaryfuncs.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "backup/walsummary.h"
+#include "common/blkreftable.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "utils/fmgrprotos.h"
+#include "utils/pg_lsn.h"
+
+#define NUM_WS_ATTS 3
+#define NUM_SUMMARY_ATTS 6
+#define MAX_BLOCKS_PER_CALL 256
+
+/*
+ * List the WAL summary files available in pg_wal/summaries.
+ */
+Datum
+pg_available_wal_summaries(PG_FUNCTION_ARGS)
+{
+ ReturnSetInfo *rsi;
+ List *wslist;
+ ListCell *lc;
+ Datum values[NUM_WS_ATTS];
+ bool nulls[NUM_WS_ATTS];
+
+ InitMaterializedSRF(fcinfo, 0);
+ rsi = (ReturnSetInfo *) fcinfo->resultinfo;
+
+ memset(nulls, 0, sizeof(nulls));
+
+ wslist = GetWalSummaries(0, InvalidXLogRecPtr, InvalidXLogRecPtr);
+ foreach(lc, wslist)
+ {
+ WalSummaryFile *ws = (WalSummaryFile *) lfirst(lc);
+ HeapTuple tuple;
+
+ CHECK_FOR_INTERRUPTS();
+
+ values[0] = Int64GetDatum((int64) ws->tli);
+ values[1] = LSNGetDatum(ws->start_lsn);
+ values[2] = LSNGetDatum(ws->end_lsn);
+
+ tuple = heap_form_tuple(rsi->setDesc, values, nulls);
+ tuplestore_puttuple(rsi->setResult, tuple);
+ }
+
+ return (Datum) 0;
+}
+
+/*
+ * List the contents of a WAL summary file identified by TLI, start LSN,
+ * and end LSN.
+ */
+Datum
+pg_wal_summary_contents(PG_FUNCTION_ARGS)
+{
+ ReturnSetInfo *rsi;
+ Datum values[NUM_SUMMARY_ATTS];
+ bool nulls[NUM_SUMMARY_ATTS];
+ WalSummaryFile ws;
+ WalSummaryIO io;
+ BlockRefTableReader *reader;
+ int64 raw_tli;
+ RelFileLocator rlocator;
+ ForkNumber forknum;
+ BlockNumber limit_block;
+
+ InitMaterializedSRF(fcinfo, 0);
+ rsi = (ReturnSetInfo *) fcinfo->resultinfo;
+ memset(nulls, 0, sizeof(nulls));
+
+ /*
+ * Since the timeline could at least in theory be more than 2^31, and
+ * since we don't have unsigned types at the SQL level, it is passed as a
+ * 64-bit integer. Test whether it's out of range.
+ */
+ raw_tli = PG_GETARG_INT64(0);
+ if (raw_tli < 1 || raw_tli > PG_INT32_MAX)
+ ereport(ERROR,
+ errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("invalid timeline %lld", (long long) raw_tli));
+
+ /* Prepare to read the specified WAL summry file. */
+ ws.tli = (TimeLineID) raw_tli;
+ ws.start_lsn = PG_GETARG_LSN(1);
+ ws.end_lsn = PG_GETARG_LSN(2);
+ io.filepos = 0;
+ io.file = OpenWalSummaryFile(&ws, false);
+ reader = CreateBlockRefTableReader(ReadWalSummary, &io,
+ FilePathName(io.file),
+ ReportWalSummaryError, NULL);
+
+ /* Loop over relation forks. */
+ while (BlockRefTableReaderNextRelation(reader, &rlocator, &forknum,
+ &limit_block))
+ {
+ BlockNumber blocks[MAX_BLOCKS_PER_CALL];
+ HeapTuple tuple;
+
+ CHECK_FOR_INTERRUPTS();
+
+ values[0] = ObjectIdGetDatum(rlocator.relNumber);
+ values[1] = ObjectIdGetDatum(rlocator.spcOid);
+ values[2] = ObjectIdGetDatum(rlocator.dbOid);
+ values[3] = Int16GetDatum((int16) forknum);
+
+ /* Loop over blocks within the current relation fork. */
+ while (1)
+ {
+ unsigned nblocks;
+ unsigned i;
+
+ CHECK_FOR_INTERRUPTS();
+
+ nblocks = BlockRefTableReaderGetBlocks(reader, blocks,
+ MAX_BLOCKS_PER_CALL);
+ if (nblocks == 0)
+ break;
+
+ /*
+ * For each block that we specifically know to have been modified,
+ * emit a row with that block number and limit_block = false.
+ */
+ values[5] = BoolGetDatum(false);
+ for (i = 0; i < nblocks; ++i)
+ {
+ values[4] = Int64GetDatum((int64) blocks[i]);
+
+ tuple = heap_form_tuple(rsi->setDesc, values, nulls);
+ tuplestore_puttuple(rsi->setResult, tuple);
+ }
+
+ /*
+ * If the limit block is not InvalidBlockNumber, emit an exta row
+ * with that block number and limit_block = true.
+ *
+ * There is no point in doing this when the limit_block is
+ * InvalidBlockNumber, because no block with that number or any
+ * higher number can ever exist.
+ */
+ if (BlockNumberIsValid(limit_block))
+ {
+ values[4] = Int64GetDatum((int64) limit_block);
+ values[5] = BoolGetDatum(true);
+
+ tuple = heap_form_tuple(rsi->setDesc, values, nulls);
+ tuplestore_puttuple(rsi->setResult, tuple);
+ }
+ }
+ }
+
+ /* Cleanup */
+ DestroyBlockRefTableReader(reader);
+ FileClose(io.file);
+
+ return (Datum) 0;
+}
diff --git a/src/backend/postmaster/Makefile b/src/backend/postmaster/Makefile
index 047448b34e..367a46c617 100644
--- a/src/backend/postmaster/Makefile
+++ b/src/backend/postmaster/Makefile
@@ -24,6 +24,7 @@ OBJS = \
postmaster.o \
startup.o \
syslogger.o \
+ walsummarizer.o \
walwriter.o
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/postmaster/auxprocess.c b/src/backend/postmaster/auxprocess.c
index bae6f68c40..5f244216a6 100644
--- a/src/backend/postmaster/auxprocess.c
+++ b/src/backend/postmaster/auxprocess.c
@@ -21,6 +21,7 @@
#include "postmaster/auxprocess.h"
#include "postmaster/bgwriter.h"
#include "postmaster/startup.h"
+#include "postmaster/walsummarizer.h"
#include "postmaster/walwriter.h"
#include "replication/walreceiver.h"
#include "storage/bufmgr.h"
@@ -80,6 +81,9 @@ AuxiliaryProcessMain(AuxProcType auxtype)
case WalReceiverProcess:
MyBackendType = B_WAL_RECEIVER;
break;
+ case WalSummarizerProcess:
+ MyBackendType = B_WAL_SUMMARIZER;
+ break;
default:
elog(PANIC, "unrecognized process type: %d", (int) MyAuxProcType);
MyBackendType = B_INVALID;
@@ -158,6 +162,10 @@ AuxiliaryProcessMain(AuxProcType auxtype)
WalReceiverMain();
proc_exit(1);
+ case WalSummarizerProcess:
+ WalSummarizerMain();
+ proc_exit(1);
+
default:
elog(PANIC, "unrecognized process type: %d", (int) MyAuxProcType);
proc_exit(1);
diff --git a/src/backend/postmaster/meson.build b/src/backend/postmaster/meson.build
index cda921fd10..a30eb6692f 100644
--- a/src/backend/postmaster/meson.build
+++ b/src/backend/postmaster/meson.build
@@ -12,5 +12,6 @@ backend_sources += files(
'postmaster.c',
'startup.c',
'syslogger.c',
+ 'walsummarizer.c',
'walwriter.c',
)
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 651b85ea74..b163e89cbb 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -113,6 +113,7 @@
#include "postmaster/pgarch.h"
#include "postmaster/postmaster.h"
#include "postmaster/syslogger.h"
+#include "postmaster/walsummarizer.h"
#include "replication/logicallauncher.h"
#include "replication/walsender.h"
#include "storage/fd.h"
@@ -250,6 +251,7 @@ static pid_t StartupPID = 0,
CheckpointerPID = 0,
WalWriterPID = 0,
WalReceiverPID = 0,
+ WalSummarizerPID = 0,
AutoVacPID = 0,
PgArchPID = 0,
SysLoggerPID = 0;
@@ -441,6 +443,7 @@ static bool CreateOptsFile(int argc, char *argv[], char *fullprogname);
static pid_t StartChildProcess(AuxProcType type);
static void StartAutovacuumWorker(void);
static void MaybeStartWalReceiver(void);
+static void MaybeStartWalSummarizer(void);
static void InitPostmasterDeathWatchHandle(void);
/*
@@ -564,6 +567,7 @@ static void ShmemBackendArrayRemove(Backend *bn);
#define StartCheckpointer() StartChildProcess(CheckpointerProcess)
#define StartWalWriter() StartChildProcess(WalWriterProcess)
#define StartWalReceiver() StartChildProcess(WalReceiverProcess)
+#define StartWalSummarizer() StartChildProcess(WalSummarizerProcess)
/* Macros to check exit status of a child process */
#define EXIT_STATUS_0(st) ((st) == 0)
@@ -933,6 +937,9 @@ PostmasterMain(int argc, char *argv[])
if (max_wal_senders > 0 && wal_level == WAL_LEVEL_MINIMAL)
ereport(ERROR,
(errmsg("WAL streaming (max_wal_senders > 0) requires wal_level \"replica\" or \"logical\"")));
+ if (summarize_wal && wal_level == WAL_LEVEL_MINIMAL)
+ ereport(ERROR,
+ (errmsg("WAL cannot be summarized when wal_level is \"minimal\"")));
/*
* Other one-time internal sanity checks can go here, if they are fast.
@@ -1835,6 +1842,9 @@ ServerLoop(void)
if (WalReceiverRequested)
MaybeStartWalReceiver();
+ /* If we need to start a WAL summarizer, try to do that now */
+ MaybeStartWalSummarizer();
+
/* Get other worker processes running, if needed */
if (StartWorkerNeeded || HaveCrashedWorker)
maybe_start_bgworkers();
@@ -2659,6 +2669,8 @@ process_pm_reload_request(void)
signal_child(WalWriterPID, SIGHUP);
if (WalReceiverPID != 0)
signal_child(WalReceiverPID, SIGHUP);
+ if (WalSummarizerPID != 0)
+ signal_child(WalSummarizerPID, SIGHUP);
if (AutoVacPID != 0)
signal_child(AutoVacPID, SIGHUP);
if (PgArchPID != 0)
@@ -3012,6 +3024,7 @@ process_pm_child_exit(void)
BgWriterPID = StartBackgroundWriter();
if (WalWriterPID == 0)
WalWriterPID = StartWalWriter();
+ MaybeStartWalSummarizer();
/*
* Likewise, start other special children as needed. In a restart
@@ -3130,6 +3143,20 @@ process_pm_child_exit(void)
continue;
}
+ /*
+ * Was it the wal summarizer? Normal exit can be ignored; we'll start
+ * a new one at the next iteration of the postmaster's main loop, if
+ * necessary. Any other exit condition is treated as a crash.
+ */
+ if (pid == WalSummarizerPID)
+ {
+ WalSummarizerPID = 0;
+ if (!EXIT_STATUS_0(exitstatus))
+ HandleChildCrash(pid, exitstatus,
+ _("WAL summarizer process"));
+ continue;
+ }
+
/*
* Was it the autovacuum launcher? Normal exit can be ignored; we'll
* start a new one at the next iteration of the postmaster's main
@@ -3525,6 +3552,12 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
else if (WalReceiverPID != 0 && take_action)
sigquit_child(WalReceiverPID);
+ /* Take care of the walsummarizer too */
+ if (pid == WalSummarizerPID)
+ WalSummarizerPID = 0;
+ else if (WalSummarizerPID != 0 && take_action)
+ sigquit_child(WalSummarizerPID);
+
/* Take care of the autovacuum launcher too */
if (pid == AutoVacPID)
AutoVacPID = 0;
@@ -3675,6 +3708,8 @@ PostmasterStateMachine(void)
signal_child(StartupPID, SIGTERM);
if (WalReceiverPID != 0)
signal_child(WalReceiverPID, SIGTERM);
+ if (WalSummarizerPID != 0)
+ signal_child(WalSummarizerPID, SIGTERM);
/* checkpointer, archiver, stats, and syslogger may continue for now */
/* Now transition to PM_WAIT_BACKENDS state to wait for them to die */
@@ -3701,6 +3736,7 @@ PostmasterStateMachine(void)
if (CountChildren(BACKEND_TYPE_ALL - BACKEND_TYPE_WALSND) == 0 &&
StartupPID == 0 &&
WalReceiverPID == 0 &&
+ WalSummarizerPID == 0 &&
BgWriterPID == 0 &&
(CheckpointerPID == 0 ||
(!FatalError && Shutdown < ImmediateShutdown)) &&
@@ -3798,6 +3834,7 @@ PostmasterStateMachine(void)
/* These other guys should be dead already */
Assert(StartupPID == 0);
Assert(WalReceiverPID == 0);
+ Assert(WalSummarizerPID == 0);
Assert(BgWriterPID == 0);
Assert(CheckpointerPID == 0);
Assert(WalWriterPID == 0);
@@ -4019,6 +4056,8 @@ TerminateChildren(int signal)
signal_child(WalWriterPID, signal);
if (WalReceiverPID != 0)
signal_child(WalReceiverPID, signal);
+ if (WalSummarizerPID != 0)
+ signal_child(WalSummarizerPID, signal);
if (AutoVacPID != 0)
signal_child(AutoVacPID, signal);
if (PgArchPID != 0)
@@ -5326,6 +5365,10 @@ StartChildProcess(AuxProcType type)
ereport(LOG,
(errmsg("could not fork WAL receiver process: %m")));
break;
+ case WalSummarizerProcess:
+ ereport(LOG,
+ (errmsg("could not fork WAL summarizer process: %m")));
+ break;
default:
ereport(LOG,
(errmsg("could not fork process: %m")));
@@ -5462,6 +5505,19 @@ MaybeStartWalReceiver(void)
}
}
+/*
+ * MaybeStartWalSummarizer
+ * Start the WAL summarizer process, if not running and our state allows.
+ */
+static void
+MaybeStartWalSummarizer(void)
+{
+ if (summarize_wal && WalSummarizerPID == 0 &&
+ (pmState == PM_RUN || pmState == PM_HOT_STANDBY) &&
+ Shutdown <= SmartShutdown)
+ WalSummarizerPID = StartWalSummarizer();
+}
+
/*
* Create the opts file
diff --git a/src/backend/postmaster/walsummarizer.c b/src/backend/postmaster/walsummarizer.c
new file mode 100644
index 0000000000..7c840c36b3
--- /dev/null
+++ b/src/backend/postmaster/walsummarizer.c
@@ -0,0 +1,1398 @@
+/*-------------------------------------------------------------------------
+ *
+ * walsummarizer.c
+ *
+ * Background process to perform WAL summarization, if it is enabled.
+ * It continuously scans the write-ahead log and periodically emits a
+ * summary file which indicates which blocks in which relation forks
+ * were modified by WAL records in the LSN range covered by the summary
+ * file. See walsummary.c and blkreftable.c for more details on the
+ * naming and contents of WAL summary files.
+ *
+ * If configured to do, this background process will also remove WAL
+ * summary files when the file timestamp is older than a configurable
+ * threshold (but only if the WAL has been removed first).
+ *
+ * Portions Copyright (c) 1996-2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/backend/postmaster/walsummarizer.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/timeline.h"
+#include "access/xlog.h"
+#include "access/xlog_internal.h"
+#include "access/xlogrecovery.h"
+#include "access/xlogutils.h"
+#include "backup/walsummary.h"
+#include "catalog/storage_xlog.h"
+#include "common/blkreftable.h"
+#include "libpq/pqsignal.h"
+#include "miscadmin.h"
+#include "postmaster/bgwriter.h"
+#include "postmaster/interrupt.h"
+#include "postmaster/walsummarizer.h"
+#include "replication/walreceiver.h"
+#include "storage/fd.h"
+#include "storage/ipc.h"
+#include "storage/lwlock.h"
+#include "storage/latch.h"
+#include "storage/proc.h"
+#include "storage/procsignal.h"
+#include "storage/shmem.h"
+#include "storage/spin.h"
+#include "utils/guc.h"
+#include "utils/memutils.h"
+#include "utils/wait_event.h"
+
+/*
+ * Data in shared memory related to WAL summarization.
+ */
+typedef struct
+{
+ /*
+ * These fields are protected by WALSummarizerLock.
+ *
+ * Until we've discovered what summary files already exist on disk and
+ * stored that information in shared memory, initialized is false and the
+ * other fields here contain no meaningful information. After that has
+ * been done, initialized is true.
+ *
+ * summarized_tli and summarized_lsn indicate the last LSN and TLI at
+ * which the next summary file will start. Normally, these are the LSN and
+ * TLI at which the last file ended; in such case, lsn_is_exact is true.
+ * If, however, the LSN is just an approximation, then lsn_is_exact is
+ * false. This can happen if, for example, there are no existing WAL
+ * summary files at startup. In that case, we have to derive the position
+ * at which to start summarizing from the WAL files that exist on disk,
+ * and so the LSN might point to the start of the next file even though
+ * that might happen to be in the middle of a WAL record.
+ *
+ * summarizer_pgprocno is the pgprocno value for the summarizer process,
+ * if one is running, or else INVALID_PGPROCNO.
+ *
+ * pending_lsn is used by the summarizer to advertise the ending LSN of a
+ * record it has recently read. It shouldn't ever be less than
+ * summarized_lsn, but might be greater, because the summarizer buffers
+ * data for a range of LSNs in memory before writing out a new file.
+ */
+ bool initialized;
+ TimeLineID summarized_tli;
+ XLogRecPtr summarized_lsn;
+ bool lsn_is_exact;
+ int summarizer_pgprocno;
+ XLogRecPtr pending_lsn;
+
+ /*
+ * This field handles its own synchronizaton.
+ */
+ ConditionVariable summary_file_cv;
+} WalSummarizerData;
+
+/*
+ * Private data for our xlogreader's page read callback.
+ */
+typedef struct
+{
+ TimeLineID tli;
+ bool historic;
+ XLogRecPtr read_upto;
+ bool end_of_wal;
+} SummarizerReadLocalXLogPrivate;
+
+/* Pointer to shared memory state. */
+static WalSummarizerData *WalSummarizerCtl;
+
+/*
+ * When we reach end of WAL and need to read more, we sleep for a number of
+ * milliseconds that is a integer multiple of MS_PER_SLEEP_QUANTUM. This is
+ * the multiplier. It should vary between 1 and MAX_SLEEP_QUANTA, depending
+ * on system activity. See summarizer_wait_for_wal() for how we adjust this.
+ */
+static long sleep_quanta = 1;
+
+/*
+ * The sleep time will always be a multiple of 200ms and will not exceed
+ * thirty seconds (150 * 200 = 30 * 1000). Note that the timeout here needs
+ * to be substntially less than the maximum amount of time for which an
+ * incremental backup will wait for this process to catch up. Otherwise, an
+ * incremental backup might time out on an idle system just because we sleep
+ * for too long.
+ */
+#define MAX_SLEEP_QUANTA 150
+#define MS_PER_SLEEP_QUANTUM 200
+
+/*
+ * This is a count of the number of pages of WAL that we've read since the
+ * last time we waited for more WAL to appear.
+ */
+static long pages_read_since_last_sleep = 0;
+
+/*
+ * Most recent RedoRecPtr value observed by MaybeRemoveOldWalSummaries.
+ */
+static XLogRecPtr redo_pointer_at_last_summary_removal = InvalidXLogRecPtr;
+
+/*
+ * GUC parameters
+ */
+bool summarize_wal = false;
+int wal_summary_keep_time = 10 * 24 * 60;
+
+static XLogRecPtr GetLatestLSN(TimeLineID *tli);
+static void HandleWalSummarizerInterrupts(void);
+static XLogRecPtr SummarizeWAL(TimeLineID tli, XLogRecPtr start_lsn,
+ bool exact, XLogRecPtr switch_lsn,
+ XLogRecPtr maximum_lsn);
+static void SummarizeSmgrRecord(XLogReaderState *xlogreader,
+ BlockRefTable *brtab);
+static void SummarizeXactRecord(XLogReaderState *xlogreader,
+ BlockRefTable *brtab);
+static bool SummarizeXlogRecord(XLogReaderState *xlogreader);
+static int summarizer_read_local_xlog_page(XLogReaderState *state,
+ XLogRecPtr targetPagePtr,
+ int reqLen,
+ XLogRecPtr targetRecPtr,
+ char *cur_page);
+static void summarizer_wait_for_wal(void);
+static void MaybeRemoveOldWalSummaries(void);
+
+/*
+ * Amount of shared memory required for this module.
+ */
+Size
+WalSummarizerShmemSize(void)
+{
+ return sizeof(WalSummarizerData);
+}
+
+/*
+ * Create or attach to shared memory segment for this module.
+ */
+void
+WalSummarizerShmemInit(void)
+{
+ bool found;
+
+ WalSummarizerCtl = (WalSummarizerData *)
+ ShmemInitStruct("Wal Summarizer Ctl", WalSummarizerShmemSize(),
+ &found);
+
+ if (!found)
+ {
+ /*
+ * First time through, so initialize.
+ *
+ * We're just filling in dummy values here -- the real initialization
+ * will happen when GetOldestUnsummarizedLSN() is called for the first
+ * time.
+ */
+ WalSummarizerCtl->initialized = false;
+ WalSummarizerCtl->summarized_tli = 0;
+ WalSummarizerCtl->summarized_lsn = InvalidXLogRecPtr;
+ WalSummarizerCtl->lsn_is_exact = false;
+ WalSummarizerCtl->summarizer_pgprocno = INVALID_PGPROCNO;
+ WalSummarizerCtl->pending_lsn = InvalidXLogRecPtr;
+ ConditionVariableInit(&WalSummarizerCtl->summary_file_cv);
+ }
+}
+
+/*
+ * Entry point for walsummarizer process.
+ */
+void
+WalSummarizerMain(void)
+{
+ sigjmp_buf local_sigjmp_buf;
+ MemoryContext context;
+
+ /*
+ * Within this function, 'current_lsn' and 'current_tli' refer to the
+ * point from which the next WAL summary file should start. 'exact' is
+ * true if 'current_lsn' is known to be the start of a WAL recod or WAL
+ * segment, and false if it might be in the middle of a record someplace.
+ *
+ * 'switch_lsn' and 'switch_tli', if set, are the LSN at which we need to
+ * switch to a new timeline and the timeline to which we need to switch.
+ * If not set, we either haven't figured out the answers yet or we're
+ * already on the latest timeline.
+ */
+ XLogRecPtr current_lsn;
+ TimeLineID current_tli;
+ bool exact;
+ XLogRecPtr switch_lsn = InvalidXLogRecPtr;
+ TimeLineID switch_tli = 0;
+
+ ereport(DEBUG1,
+ (errmsg_internal("WAL summarizer started")));
+
+ /*
+ * Properly accept or ignore signals the postmaster might send us
+ *
+ * We have no particular use for SIGINT at the moment, but seems
+ * reasonable to treat like SIGTERM.
+ */
+ pqsignal(SIGHUP, SignalHandlerForConfigReload);
+ pqsignal(SIGINT, SignalHandlerForShutdownRequest);
+ pqsignal(SIGTERM, SignalHandlerForShutdownRequest);
+ /* SIGQUIT handler was already set up by InitPostmasterChild */
+ pqsignal(SIGALRM, SIG_IGN);
+ pqsignal(SIGPIPE, SIG_IGN);
+ pqsignal(SIGUSR1, procsignal_sigusr1_handler);
+ pqsignal(SIGUSR2, SIG_IGN); /* not used */
+
+ /* Advertise ourselves. */
+ LWLockAcquire(WALSummarizerLock, LW_EXCLUSIVE);
+ WalSummarizerCtl->summarizer_pgprocno = MyProc->pgprocno;
+ LWLockRelease(WALSummarizerLock);
+
+ /* Create and switch to a memory context that we can reset on error. */
+ context = AllocSetContextCreate(TopMemoryContext,
+ "Wal Summarizer",
+ ALLOCSET_DEFAULT_SIZES);
+ MemoryContextSwitchTo(context);
+
+ /*
+ * Reset some signals that are accepted by postmaster but not here
+ */
+ pqsignal(SIGCHLD, SIG_DFL);
+
+ /*
+ * If an exception is encountered, processing resumes here.
+ */
+ if (sigsetjmp(local_sigjmp_buf, 1) != 0)
+ {
+ /* Since not using PG_TRY, must reset error stack by hand */
+ error_context_stack = NULL;
+
+ /* Prevent interrupts while cleaning up */
+ HOLD_INTERRUPTS();
+
+ /* Report the error to the server log */
+ EmitErrorReport();
+
+ /* Release resources we might have acquired. */
+ LWLockReleaseAll();
+ ConditionVariableCancelSleep();
+ pgstat_report_wait_end();
+ ReleaseAuxProcessResources(false);
+ AtEOXact_Files(false);
+ AtEOXact_HashTables(false);
+
+ /*
+ * Now return to normal top-level context and clear ErrorContext for
+ * next time.
+ */
+ MemoryContextSwitchTo(context);
+ FlushErrorState();
+
+ /* Flush any leaked data in the top-level context */
+ MemoryContextResetAndDeleteChildren(context);
+
+ /* Now we can allow interrupts again */
+ RESUME_INTERRUPTS();
+
+ /*
+ * Sleep for 10 seconds before attempting to resume operations in
+ * order to avoid excessing logging.
+ *
+ * Many of the likely error conditions are things that will repeat
+ * every time. For example, if the WAL can't be read or the summary
+ * can't be written, only administrator action will cure the problem.
+ * So a really fast retry time doesn't seem to be especially
+ * beneficial, and it will clutter the logs.
+ */
+ (void) WaitLatch(MyLatch,
+ WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
+ 10000,
+ WAIT_EVENT_WAL_SUMMARIZER_ERROR);
+ }
+
+ /* We can now handle ereport(ERROR) */
+ PG_exception_stack = &local_sigjmp_buf;
+
+ /*
+ * Unblock signals (they were blocked when the postmaster forked us)
+ */
+ sigprocmask(SIG_SETMASK, &UnBlockSig, NULL);
+
+ /*
+ * Fetch information about previous progress from shared memory, and ask
+ * GetOldestUnsummarizedLSN to reset pending_lsn to summarized_lsn. We
+ * might be recovering from an error, and if so, pending_lsn might have
+ * advanced past summarized_lsn, but any WAL we read previously has been
+ * lost and will need to be reread.
+ *
+ * If we discover that WAL summarization is not enabled, just exit.
+ */
+ current_lsn = GetOldestUnsummarizedLSN(¤t_tli, &exact, true);
+ if (XLogRecPtrIsInvalid(current_lsn))
+ proc_exit(0);
+
+ /*
+ * Loop forever
+ */
+ for (;;)
+ {
+ XLogRecPtr latest_lsn;
+ TimeLineID latest_tli;
+ XLogRecPtr end_of_summary_lsn;
+
+ /* Flush any leaked data in the top-level context */
+ MemoryContextResetAndDeleteChildren(context);
+
+ /* Process any signals received recently. */
+ HandleWalSummarizerInterrupts();
+
+ /* If it's time to remove any old WAL summaries, do that now. */
+ MaybeRemoveOldWalSummaries();
+
+ /* Find the LSN and TLI up to which we can safely summarize. */
+ latest_lsn = GetLatestLSN(&latest_tli);
+
+ /*
+ * If we're summarizing a historic timeline and we haven't yet
+ * computed the point at which to switch to the next timeline, do that
+ * now.
+ *
+ * Note that if this is a standby, what was previously the current
+ * timeline could become historic at any time.
+ *
+ * We could try to make this more efficient by caching the results of
+ * readTimeLineHistory when latest_tli has not changed, but since we
+ * only have to do this once per timeline switch, we probably wouldn't
+ * save any significant amount of work in practice.
+ */
+ if (current_tli != latest_tli && XLogRecPtrIsInvalid(switch_lsn))
+ {
+ List *tles = readTimeLineHistory(latest_tli);
+
+ switch_lsn = tliSwitchPoint(current_tli, tles, &switch_tli);
+ ereport(DEBUG1,
+ errmsg("switch point from TLI %u to TLI %u is at %X/%X",
+ current_tli, switch_tli, LSN_FORMAT_ARGS(switch_lsn)));
+ }
+
+ /*
+ * If we've reached the switch LSN, we can't summarize anything else
+ * on this timeline. Switch to the next timeline and go around again.
+ */
+ if (!XLogRecPtrIsInvalid(switch_lsn) && current_lsn >= switch_lsn)
+ {
+ current_tli = switch_tli;
+ switch_lsn = InvalidXLogRecPtr;
+ switch_tli = 0;
+ continue;
+ }
+
+ /* Summarize WAL. */
+ end_of_summary_lsn = SummarizeWAL(current_tli,
+ current_lsn, exact,
+ switch_lsn, latest_lsn);
+ Assert(!XLogRecPtrIsInvalid(end_of_summary_lsn));
+ Assert(end_of_summary_lsn >= current_lsn);
+
+ /*
+ * Update state for next loop iteration.
+ *
+ * Next summary file should start from exactly where this one ended.
+ */
+ current_lsn = end_of_summary_lsn;
+ exact = true;
+
+ /* Update state in shared memory. */
+ LWLockAcquire(WALSummarizerLock, LW_EXCLUSIVE);
+ Assert(WalSummarizerCtl->pending_lsn <= end_of_summary_lsn);
+ WalSummarizerCtl->summarized_lsn = end_of_summary_lsn;
+ WalSummarizerCtl->summarized_tli = current_tli;
+ WalSummarizerCtl->lsn_is_exact = true;
+ WalSummarizerCtl->pending_lsn = end_of_summary_lsn;
+ LWLockRelease(WALSummarizerLock);
+
+ /* Wake up anyone waiting for more summary files to be written. */
+ ConditionVariableBroadcast(&WalSummarizerCtl->summary_file_cv);
+ }
+}
+
+/*
+ * Get the oldest LSN in this server's timeline history that has not yet been
+ * summarized.
+ *
+ * If *tli != NULL, it will be set to the TLI for the LSN that is returned.
+ *
+ * If *lsn_is_exact != NULL, it will be set to true if the returned LSN is
+ * necessarily the start of a WAL record and false if it's just the beginning
+ * of a WAL segment.
+ *
+ * If reset_pending_lsn is true, resets the pending_lsn in shared memory to
+ * be equal to the summarized_lsn.
+ */
+XLogRecPtr
+GetOldestUnsummarizedLSN(TimeLineID *tli, bool *lsn_is_exact,
+ bool reset_pending_lsn)
+{
+ TimeLineID latest_tli;
+ LWLockMode mode = reset_pending_lsn ? LW_EXCLUSIVE : LW_SHARED;
+ int n;
+ List *tles;
+ XLogRecPtr unsummarized_lsn;
+ TimeLineID unsummarized_tli = 0;
+ bool should_make_exact = false;
+ List *existing_summaries;
+ ListCell *lc;
+
+ /* If not summarizing WAL, do nothing. */
+ if (!summarize_wal)
+ return InvalidXLogRecPtr;
+
+ /*
+ * Unless we need to reset the pending_lsn, we initally acquire the lock
+ * in shared mode and try to fetch the required information. If we acquire
+ * in shared mode and find that the data structure hasn't been
+ * initialized, we reacquire the lock in exclusive mode so that we can
+ * initialize it. However, if someone else does that first before we get
+ * the lock, then we can just return the requested information after all.
+ */
+ while (1)
+ {
+ LWLockAcquire(WALSummarizerLock, mode);
+
+ if (WalSummarizerCtl->initialized)
+ {
+ unsummarized_lsn = WalSummarizerCtl->summarized_lsn;
+ if (tli != NULL)
+ *tli = WalSummarizerCtl->summarized_tli;
+ if (lsn_is_exact != NULL)
+ *lsn_is_exact = WalSummarizerCtl->lsn_is_exact;
+ if (reset_pending_lsn)
+ WalSummarizerCtl->pending_lsn =
+ WalSummarizerCtl->summarized_lsn;
+ LWLockRelease(WALSummarizerLock);
+ return unsummarized_lsn;
+ }
+
+ if (mode == LW_EXCLUSIVE)
+ break;
+
+ LWLockRelease(WALSummarizerLock);
+ mode = LW_EXCLUSIVE;
+ }
+
+ /*
+ * The data structure needs to be initialized, and we are the first to
+ * obtain the lock in exclusive mode, so it's our job to do that
+ * initialization.
+ *
+ * So, find the oldest timeline on which WAL still exists, and the
+ * earliest segment for which it exists.
+ */
+ (void) GetLatestLSN(&latest_tli);
+ tles = readTimeLineHistory(latest_tli);
+ for (n = list_length(tles) - 1; n >= 0; --n)
+ {
+ TimeLineHistoryEntry *tle = list_nth(tles, n);
+ XLogSegNo oldest_segno;
+
+ oldest_segno = XLogGetOldestSegno(tle->tli);
+ if (oldest_segno != 0)
+ {
+ /* Compute oldest LSN that still exists on disk. */
+ XLogSegNoOffsetToRecPtr(oldest_segno, 0, wal_segment_size,
+ unsummarized_lsn);
+
+ unsummarized_tli = tle->tli;
+ break;
+ }
+ }
+
+ /* It really should not be possible for us to find no WAL. */
+ if (unsummarized_tli == 0)
+ ereport(ERROR,
+ errcode(ERRCODE_INTERNAL_ERROR),
+ errmsg_internal("no WAL found on timeline %d", latest_tli));
+
+ /*
+ * Don't try to summarize anything older than the end LSN of the newest
+ * summary file that exists for this timeline.
+ */
+ existing_summaries =
+ GetWalSummaries(unsummarized_tli,
+ InvalidXLogRecPtr, InvalidXLogRecPtr);
+ foreach(lc, existing_summaries)
+ {
+ WalSummaryFile *ws = lfirst(lc);
+
+ if (ws->end_lsn > unsummarized_lsn)
+ {
+ unsummarized_lsn = ws->end_lsn;
+ should_make_exact = true;
+ }
+ }
+
+ /* Update shared memory with the discovered values. */
+ WalSummarizerCtl->initialized = true;
+ WalSummarizerCtl->summarized_lsn = unsummarized_lsn;
+ WalSummarizerCtl->summarized_tli = unsummarized_tli;
+ WalSummarizerCtl->lsn_is_exact = should_make_exact;
+ WalSummarizerCtl->pending_lsn = unsummarized_lsn;
+
+ /* Also return the to the caller as required. */
+ if (tli != NULL)
+ *tli = WalSummarizerCtl->summarized_tli;
+ if (lsn_is_exact != NULL)
+ *lsn_is_exact = WalSummarizerCtl->lsn_is_exact;
+ LWLockRelease(WALSummarizerLock);
+
+ return unsummarized_lsn;
+}
+
+/*
+ * Attempt to set the WAL summarizer's latch.
+ *
+ * This might not work, because there's no guarantee that the WAL summarizer
+ * process was successfully started, and it also might have started but
+ * subsequently terminated. So, under normal circumstances, this will get the
+ * latch set, but there's no guarantee.
+ */
+void
+SetWalSummarizerLatch(void)
+{
+ int pgprocno;
+
+ if (WalSummarizerCtl == NULL)
+ return;
+
+ LWLockAcquire(WALSummarizerLock, LW_EXCLUSIVE);
+ pgprocno = WalSummarizerCtl->summarizer_pgprocno;
+ LWLockRelease(WALSummarizerLock);
+
+ if (pgprocno != INVALID_PGPROCNO)
+ SetLatch(&ProcGlobal->allProcs[pgprocno].procLatch);
+}
+
+/*
+ * Wait until WAL summarization reaches the given LSN, but not longer than
+ * the given timeout.
+ *
+ * The return value is the first still-unsummarized LSN. If it's greater than
+ * or equal to the passed LSN, then that LSN was reached. If not, we timed out.
+ *
+ * Either way, *pending_lsn is set to the value taken from WalSummarizerCtl.
+ */
+XLogRecPtr
+WaitForWalSummarization(XLogRecPtr lsn, long timeout, XLogRecPtr *pending_lsn)
+{
+ TimestampTz start_time = GetCurrentTimestamp();
+ TimestampTz deadline = TimestampTzPlusMilliseconds(start_time, timeout);
+ XLogRecPtr summarized_lsn;
+
+ Assert(!XLogRecPtrIsInvalid(lsn));
+ Assert(timeout > 0);
+
+ while (1)
+ {
+ TimestampTz now;
+ long remaining_timeout;
+
+ /*
+ * If the LSN summarized on disk has reached the target value, stop.
+ */
+ LWLockAcquire(WALSummarizerLock, LW_EXCLUSIVE);
+ summarized_lsn = WalSummarizerCtl->summarized_lsn;
+ *pending_lsn = WalSummarizerCtl->pending_lsn;
+ LWLockRelease(WALSummarizerLock);
+ if (summarized_lsn >= lsn)
+ break;
+
+ /* Timeout reached? If yes, stop. */
+ now = GetCurrentTimestamp();
+ remaining_timeout = TimestampDifferenceMilliseconds(now, deadline);
+ if (remaining_timeout <= 0)
+ break;
+
+ /* Wait and see. */
+ ConditionVariableTimedSleep(&WalSummarizerCtl->summary_file_cv,
+ remaining_timeout,
+ WAIT_EVENT_WAL_SUMMARY_READY);
+ }
+
+ return summarized_lsn;
+}
+
+/*
+ * Get the latest LSN that is eligible to be summarized, and set *tli to the
+ * corresponding timeline.
+ */
+static XLogRecPtr
+GetLatestLSN(TimeLineID *tli)
+{
+ if (!RecoveryInProgress())
+ {
+ /* Don't summarize WAL before it's flushed. */
+ return GetFlushRecPtr(tli);
+ }
+ else
+ {
+ XLogRecPtr flush_lsn;
+ TimeLineID flush_tli;
+ XLogRecPtr replay_lsn;
+ TimeLineID replay_tli;
+
+ /*
+ * What we really want to know is how much WAL has been flushed to
+ * disk, but the only flush position available is the one provided by
+ * the walreceiver, which may not be running, because this could be
+ * crash recovery or recovery via restore_command. So use either the
+ * WAL receiver's flush position or the replay position, whichever is
+ * further ahead, on the theory that if the WAL has been replayed then
+ * it must also have been flushed to disk.
+ */
+ flush_lsn = GetWalRcvFlushRecPtr(NULL, &flush_tli);
+ replay_lsn = GetXLogReplayRecPtr(&replay_tli);
+ if (flush_lsn > replay_lsn)
+ {
+ *tli = flush_tli;
+ return flush_lsn;
+ }
+ else
+ {
+ *tli = replay_tli;
+ return replay_lsn;
+ }
+ }
+}
+
+/*
+ * Interrupt handler for main loop of WAL summarizer process.
+ */
+static void
+HandleWalSummarizerInterrupts(void)
+{
+ if (ProcSignalBarrierPending)
+ ProcessProcSignalBarrier();
+
+ if (ConfigReloadPending)
+ {
+ ConfigReloadPending = false;
+ ProcessConfigFile(PGC_SIGHUP);
+ }
+
+ if (ShutdownRequestPending || !summarize_wal)
+ {
+ ereport(DEBUG1,
+ errmsg_internal("WAL summarizer shutting down"));
+ proc_exit(0);
+ }
+
+ /* Perform logging of memory contexts of this process */
+ if (LogMemoryContextPending)
+ ProcessLogMemoryContextInterrupt();
+}
+
+/*
+ * Summarize a range of WAL records on a single timeline.
+ *
+ * 'tli' is the timeline to be summarized.
+ *
+ * 'start_lsn' is the point at which we should start summarizing. If this
+ * value comes from the end LSN of the previous record as returned by the
+ * xlograder machinery, 'exact' should be true; otherwise, 'exact' should
+ * be false, and this function will search forward for the start of a valid
+ * WAL record.
+ *
+ * 'switch_lsn' is the point at which we should switch to a later timeline,
+ * if we're summarizing a historic timeline.
+ *
+ * 'maximum_lsn' identifies the point beyond which we can't count on being
+ * able to read any more WAL. It should be the switch point when reading a
+ * historic timeline, or the most-recently-measured end of WAL when reading
+ * the current timeline.
+ *
+ * The return value is the LSN at which the WAL summary actually ends. Most
+ * often, a summary file ends because we notice that a checkpoint has
+ * occurred and reach the redo pointer of that checkpoint, but sometimes
+ * we stop for other reasons, such as a timeline switch.
+ */
+static XLogRecPtr
+SummarizeWAL(TimeLineID tli, XLogRecPtr start_lsn, bool exact,
+ XLogRecPtr switch_lsn, XLogRecPtr maximum_lsn)
+{
+ SummarizerReadLocalXLogPrivate *private_data;
+ XLogReaderState *xlogreader;
+ XLogRecPtr summary_start_lsn;
+ XLogRecPtr summary_end_lsn = switch_lsn;
+ char temp_path[MAXPGPATH];
+ char final_path[MAXPGPATH];
+ WalSummaryIO io;
+ BlockRefTable *brtab = CreateEmptyBlockRefTable();
+
+ /* Initialize private data for xlogreader. */
+ private_data = (SummarizerReadLocalXLogPrivate *)
+ palloc0(sizeof(SummarizerReadLocalXLogPrivate));
+ private_data->tli = tli;
+ private_data->historic = !XLogRecPtrIsInvalid(switch_lsn);
+ private_data->read_upto = maximum_lsn;
+
+ /* Create xlogreader. */
+ xlogreader = XLogReaderAllocate(wal_segment_size, NULL,
+ XL_ROUTINE(.page_read = &summarizer_read_local_xlog_page,
+ .segment_open = &wal_segment_open,
+ .segment_close = &wal_segment_close),
+ private_data);
+ if (xlogreader == NULL)
+ ereport(ERROR,
+ (errcode(ERRCODE_OUT_OF_MEMORY),
+ errmsg("out of memory"),
+ errdetail("Failed while allocating a WAL reading processor.")));
+
+ /*
+ * When exact = false, we're starting from an arbitrary point in the WAL
+ * and must search forward for the start of the next record.
+ *
+ * When exact = true, start_lsn should be either the LSN where a record
+ * begins, or the LSN of a page where the page header is immediately
+ * followed by the start of a new record. XLogBeginRead should tolerate
+ * either case.
+ *
+ * We need to allow for both cases because the behavior of xlogreader
+ * varies. When a record spans two or more xlog pages, the ending LSN
+ * reported by xlogreader will be the starting LSN of the following
+ * record, but when an xlog page boundary falls between two records, the
+ * end LSN for the first will be reported as the first byte of the
+ * following page. We can't know until we read that page how large the
+ * header will be, but we'll have to skip over it to find the next record.
+ */
+ if (exact)
+ {
+ /*
+ * Even if start_lsn is the beginning of a page rather than the
+ * beginning of the first record on that page, we should still use it
+ * as the start LSN for the summary file. That's because we detect
+ * missing summary files by looking for cases where the end LSN of one
+ * file is less than the start LSN of the next file. When only a page
+ * header is skipped, nothing has been missed.
+ */
+ XLogBeginRead(xlogreader, start_lsn);
+ summary_start_lsn = start_lsn;
+ }
+ else
+ {
+ summary_start_lsn = XLogFindNextRecord(xlogreader, start_lsn);
+ if (XLogRecPtrIsInvalid(summary_start_lsn))
+ {
+ /*
+ * If we hit end-of-WAL while trying to find the next valid
+ * record, we must be on a historic timeline that has no valid
+ * records that begin after start_lsn and before end of WAL.
+ */
+ if (private_data->end_of_wal)
+ {
+ ereport(DEBUG1,
+ errmsg_internal("could not read WAL from timeline %u at %X/%X: end of WAL at %X/%X",
+ tli,
+ LSN_FORMAT_ARGS(start_lsn),
+ LSN_FORMAT_ARGS(private_data->read_upto)));
+
+ /*
+ * The timeline ends at or after start_lsn, without containing
+ * any records. Thus, we must make sure the main loop does not
+ * iterate. If start_lsn is the end of the timeline, then we
+ * won't actually emit an empty summary file, but otherwise,
+ * we must, to capture the fact that the LSN range in question
+ * contains no interesting WAL records.
+ */
+ summary_start_lsn = start_lsn;
+ summary_end_lsn = private_data->read_upto;
+ switch_lsn = xlogreader->EndRecPtr;
+ }
+ else
+ ereport(ERROR,
+ (errmsg("could not find a valid record after %X/%X",
+ LSN_FORMAT_ARGS(start_lsn))));
+ }
+
+ /* We shouldn't go backward. */
+ Assert(summary_start_lsn >= start_lsn);
+ }
+
+ /*
+ * Main loop: read xlog records one by one.
+ */
+ while (1)
+ {
+ int block_id;
+ char *errormsg;
+ XLogRecord *record;
+ bool stop_requested = false;
+
+ HandleWalSummarizerInterrupts();
+
+ /* We shouldn't go backward. */
+ Assert(summary_start_lsn <= xlogreader->EndRecPtr);
+
+ /* Now read the next record. */
+ record = XLogReadRecord(xlogreader, &errormsg);
+ if (record == NULL)
+ {
+ if (private_data->end_of_wal)
+ {
+ /*
+ * This timeline must be historic and must end before we were
+ * able to read a complete record.
+ */
+ ereport(DEBUG1,
+ errmsg_internal("could not read WAL from timeline %d at %X/%X: end of WAL at %X/%X",
+ tli,
+ LSN_FORMAT_ARGS(xlogreader->EndRecPtr),
+ LSN_FORMAT_ARGS(private_data->read_upto)));
+ /* Summary ends at end of WAL. */
+ summary_end_lsn = private_data->read_upto;
+ break;
+ }
+ if (errormsg)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not read WAL from timeline %u at %X/%X: %s",
+ tli, LSN_FORMAT_ARGS(xlogreader->EndRecPtr),
+ errormsg)));
+ else
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not read WAL from timeline %u at %X/%X",
+ tli, LSN_FORMAT_ARGS(xlogreader->EndRecPtr))));
+ }
+
+ /* We shouldn't go backward. */
+ Assert(summary_start_lsn <= xlogreader->EndRecPtr);
+
+ if (!XLogRecPtrIsInvalid(switch_lsn) &&
+ xlogreader->ReadRecPtr >= switch_lsn)
+ {
+ /*
+ * Woops! We've read a record that *starts* after the switch LSN,
+ * contrary to our goal of reading only until we hit the first
+ * record that ends at or after the switch LSN. Pretend we didn't
+ * read it after all by bailing out of this loop right here,
+ * before we do anything with this record.
+ *
+ * This can happen because the last record before the switch LSN
+ * might be continued across multiple pages, and then we might
+ * come to a page with XLP_FIRST_IS_OVERWRITE_CONTRECORD set. In
+ * that case, the record that was continued across multiple pages
+ * is incomplete and will be disregarded, and the read will
+ * restart from the beginning of the page that is flagged
+ * XLP_FIRST_IS_OVERWRITE_CONTRECORD.
+ *
+ * If this case occurs, we can fairly say that the current summary
+ * file ends at the switch LSN exactly. The first record on the
+ * page marked XLP_FIRST_IS_OVERWRITE_CONTRECORD will be
+ * discovered when generating the next summary file.
+ */
+ summary_end_lsn = switch_lsn;
+ break;
+ }
+
+ /* Special handling for particular types of WAL records. */
+ switch (XLogRecGetRmid(xlogreader))
+ {
+ case RM_SMGR_ID:
+ SummarizeSmgrRecord(xlogreader, brtab);
+ break;
+ case RM_XACT_ID:
+ SummarizeXactRecord(xlogreader, brtab);
+ break;
+ case RM_XLOG_ID:
+ stop_requested = SummarizeXlogRecord(xlogreader);
+ break;
+ default:
+ break;
+ }
+
+ /*
+ * If we've been told that it's time to end this WAL summary file, do
+ * so. As an exception, if there's nothing included in this WAL
+ * summary file yet, then stopping doesn't make any sense, and we
+ * should wait until the next stop point instead.
+ */
+ if (stop_requested && xlogreader->ReadRecPtr > summary_start_lsn)
+ {
+ summary_end_lsn = xlogreader->ReadRecPtr;
+ break;
+ }
+
+ /* Feed block references from xlog record to block reference table. */
+ for (block_id = 0; block_id <= XLogRecMaxBlockId(xlogreader);
+ block_id++)
+ {
+ RelFileLocator rlocator;
+ ForkNumber forknum;
+ BlockNumber blocknum;
+
+ if (!XLogRecGetBlockTagExtended(xlogreader, block_id, &rlocator,
+ &forknum, &blocknum, NULL))
+ continue;
+
+ /*
+ * As we do elsewhere, ignore the FSM fork, because it's not fully
+ * WAL-logged.
+ */
+ if (forknum != FSM_FORKNUM)
+ BlockRefTableMarkBlockModified(brtab, &rlocator, forknum,
+ blocknum);
+ }
+
+ /* Update our notion of where this summary file ends. */
+ summary_end_lsn = xlogreader->EndRecPtr;
+
+ /* Also update shared memory. */
+ LWLockAcquire(WALSummarizerLock, LW_EXCLUSIVE);
+ Assert(summary_end_lsn >= WalSummarizerCtl->pending_lsn);
+ Assert(summary_end_lsn >= WalSummarizerCtl->summarized_lsn);
+ WalSummarizerCtl->pending_lsn = summary_end_lsn;
+ LWLockRelease(WALSummarizerLock);
+
+ /*
+ * If we have a switch LSN and have reached it, stop before reading
+ * the next record.
+ */
+ if (!XLogRecPtrIsInvalid(switch_lsn) &&
+ xlogreader->EndRecPtr >= switch_lsn)
+ break;
+ }
+
+ /* Destroy xlogreader. */
+ pfree(xlogreader->private_data);
+ XLogReaderFree(xlogreader);
+
+ /*
+ * If a timeline switch occurs, we may fail to make any progress at all
+ * before exiting the loop above. If that happens, we don't write a WAL
+ * summary file at all.
+ */
+ if (summary_end_lsn > summary_start_lsn)
+ {
+ /* Generate temporary and final path name. */
+ snprintf(temp_path, MAXPGPATH,
+ XLOGDIR "/summaries/temp.summary");
+ snprintf(final_path, MAXPGPATH,
+ XLOGDIR "/summaries/%08X%08X%08X%08X%08X.summary",
+ tli,
+ LSN_FORMAT_ARGS(summary_start_lsn),
+ LSN_FORMAT_ARGS(summary_end_lsn));
+
+ /* Open the temporary file for writing. */
+ io.filepos = 0;
+ io.file = PathNameOpenFile(temp_path, O_WRONLY | O_CREAT | O_TRUNC);
+ if (io.file < 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not create file \"%s\": %m", temp_path)));
+
+ /* Write the data. */
+ WriteBlockRefTable(brtab, WriteWalSummary, &io);
+
+ /* Close temporary file and shut down xlogreader. */
+ FileClose(io.file);
+
+ /* Tell the user what we did. */
+ ereport(DEBUG1,
+ errmsg("summarized WAL on TLI %d from %X/%X to %X/%X",
+ tli,
+ LSN_FORMAT_ARGS(summary_start_lsn),
+ LSN_FORMAT_ARGS(summary_end_lsn)));
+
+ /* Durably rename the new summary into place. */
+ durable_rename(temp_path, final_path, ERROR);
+ }
+
+ return summary_end_lsn;
+}
+
+/*
+ * Special handling for WAL records with RM_SMGR_ID.
+ */
+static void
+SummarizeSmgrRecord(XLogReaderState *xlogreader, BlockRefTable *brtab)
+{
+ uint8 info = XLogRecGetInfo(xlogreader) & ~XLR_INFO_MASK;
+
+ if (info == XLOG_SMGR_CREATE)
+ {
+ xl_smgr_create *xlrec;
+
+ /*
+ * If a new relation fork is created on disk, there is no point
+ * tracking anything about which blocks have been modified, because
+ * the whole thing will be new. Hence, set the limit block for this
+ * fork to 0.
+ *
+ * Ignore the FSM fork, which is not fully WAL-logged.
+ */
+ xlrec = (xl_smgr_create *) XLogRecGetData(xlogreader);
+
+ if (xlrec->forkNum != FSM_FORKNUM)
+ BlockRefTableSetLimitBlock(brtab, &xlrec->rlocator,
+ xlrec->forkNum, 0);
+ }
+ else if (info == XLOG_SMGR_TRUNCATE)
+ {
+ xl_smgr_truncate *xlrec;
+
+ xlrec = (xl_smgr_truncate *) XLogRecGetData(xlogreader);
+
+ /*
+ * If a relation fork is truncated on disk, there is no point in
+ * tracking anything about block modifications beyond the truncation
+ * point.
+ *
+ * We ignore SMGR_TRUNCATE_FSM here because the FSM isn't fully
+ * WAL-logged and thus we can't track modified blocks for it anyway.
+ */
+ if ((xlrec->flags & SMGR_TRUNCATE_HEAP) != 0)
+ BlockRefTableSetLimitBlock(brtab, &xlrec->rlocator,
+ MAIN_FORKNUM, xlrec->blkno);
+ if ((xlrec->flags & SMGR_TRUNCATE_VM) != 0)
+ BlockRefTableSetLimitBlock(brtab, &xlrec->rlocator,
+ VISIBILITYMAP_FORKNUM, xlrec->blkno);
+ }
+}
+
+/*
+ * Special handling for WAL recods with RM_XACT_ID.
+ */
+static void
+SummarizeXactRecord(XLogReaderState *xlogreader, BlockRefTable *brtab)
+{
+ uint8 info = XLogRecGetInfo(xlogreader) & ~XLR_INFO_MASK;
+ uint8 xact_info = info & XLOG_XACT_OPMASK;
+
+ if (xact_info == XLOG_XACT_COMMIT ||
+ xact_info == XLOG_XACT_COMMIT_PREPARED)
+ {
+ xl_xact_commit *xlrec = (xl_xact_commit *) XLogRecGetData(xlogreader);
+ xl_xact_parsed_commit parsed;
+ int i;
+
+ /*
+ * Don't track modified blocks for any relations that were removed on
+ * commit.
+ */
+ ParseCommitRecord(XLogRecGetInfo(xlogreader), xlrec, &parsed);
+ for (i = 0; i < parsed.nrels; ++i)
+ {
+ ForkNumber forknum;
+
+ for (forknum = 0; forknum <= MAX_FORKNUM; ++forknum)
+ if (forknum != FSM_FORKNUM)
+ BlockRefTableSetLimitBlock(brtab, &parsed.xlocators[i],
+ forknum, 0);
+ }
+ }
+ else if (xact_info == XLOG_XACT_ABORT ||
+ xact_info == XLOG_XACT_ABORT_PREPARED)
+ {
+ xl_xact_abort *xlrec = (xl_xact_abort *) XLogRecGetData(xlogreader);
+ xl_xact_parsed_abort parsed;
+ int i;
+
+ /*
+ * Don't track modified blocks for any relations that were removed on
+ * abort.
+ */
+ ParseAbortRecord(XLogRecGetInfo(xlogreader), xlrec, &parsed);
+ for (i = 0; i < parsed.nrels; ++i)
+ {
+ ForkNumber forknum;
+
+ for (forknum = 0; forknum <= MAX_FORKNUM; ++forknum)
+ if (forknum != FSM_FORKNUM)
+ BlockRefTableSetLimitBlock(brtab, &parsed.xlocators[i],
+ forknum, 0);
+ }
+ }
+}
+
+/*
+ * Special handling for WAL recods with RM_XLOG_ID.
+ */
+static bool
+SummarizeXlogRecord(XLogReaderState *xlogreader)
+{
+ uint8 info = XLogRecGetInfo(xlogreader) & ~XLR_INFO_MASK;
+
+ if (info == XLOG_CHECKPOINT_REDO || info == XLOG_CHECKPOINT_SHUTDOWN)
+ {
+ /*
+ * This is an LSN at which redo might begin, so we'd like
+ * summarization to stop just before this WAL record.
+ */
+ return true;
+ }
+
+ return false;
+}
+
+/*
+ * Similar to read_local_xlog_page, but limited to read from one particular
+ * timeline. If the end of WAL is reached, it will wait for more if reading
+ * from the current timeline, or give up if reading from a historic timeline.
+ * In the latter case, it will also set private_data->end_of_wal = true.
+ *
+ * Caller must set private_data->tli to the TLI of interest,
+ * private_data->read_upto to the lowest LSN that is not known to be safe
+ * to read on that timeline, and private_data->historic to true if and only
+ * if the timeline is not the current timeline. This function will update
+ * private_data->read_upto and private_data->historic if more WAL appears
+ * on the current timeline or if the current timeline becomes historic.
+ */
+static int
+summarizer_read_local_xlog_page(XLogReaderState *state,
+ XLogRecPtr targetPagePtr, int reqLen,
+ XLogRecPtr targetRecPtr, char *cur_page)
+{
+ int count;
+ WALReadError errinfo;
+ SummarizerReadLocalXLogPrivate *private_data;
+
+ HandleWalSummarizerInterrupts();
+
+ private_data = (SummarizerReadLocalXLogPrivate *)
+ state->private_data;
+
+ while (1)
+ {
+ if (targetPagePtr + XLOG_BLCKSZ <= private_data->read_upto)
+ {
+ /*
+ * more than one block available; read only that block, have
+ * caller come back if they need more.
+ */
+ count = XLOG_BLCKSZ;
+ break;
+ }
+ else if (targetPagePtr + reqLen > private_data->read_upto)
+ {
+ /* We don't seem to have enough data. */
+ if (private_data->historic)
+ {
+ /*
+ * This is a historic timeline, so there will never be any
+ * more data than we have currently.
+ */
+ private_data->end_of_wal = true;
+ return -1;
+ }
+ else
+ {
+ XLogRecPtr latest_lsn;
+ TimeLineID latest_tli;
+
+ /*
+ * This is - or at least was up until very recently - the
+ * current timeline, so more data might show up. Delay here
+ * so we don't tight-loop.
+ */
+ HandleWalSummarizerInterrupts();
+ summarizer_wait_for_wal();
+
+ /* Recheck end-of-WAL. */
+ latest_lsn = GetLatestLSN(&latest_tli);
+ if (private_data->tli == latest_tli)
+ {
+ /* Still the current timeline, update max LSN. */
+ Assert(latest_lsn >= private_data->read_upto);
+ private_data->read_upto = latest_lsn;
+ }
+ else
+ {
+ List *tles = readTimeLineHistory(latest_tli);
+ XLogRecPtr switchpoint;
+
+ /*
+ * The timeline we're scanning is no longer the latest
+ * one. Figure out when it ended.
+ */
+ private_data->historic = true;
+ switchpoint = tliSwitchPoint(private_data->tli, tles,
+ NULL);
+
+ /*
+ * Allow reads up to exactly the switch point.
+ *
+ * It's possible that this will cause read_upto to move
+ * backwards, because walreceiver might have read a
+ * partial record and flushed it to disk, and we'd view
+ * that data as safe to read. However, the
+ * XLOG_END_OF_RECOVERY record will be written at the end
+ * of the last complete WAL record, not at the end of the
+ * WAL that we've flushed to disk.
+ *
+ * So switchpoint < private->read_upto is possible here,
+ * but switchpoint < state->EndRecPtr should not be.
+ */
+ Assert(switchpoint >= state->EndRecPtr);
+ private_data->read_upto = switchpoint;
+
+ /* Debugging output. */
+ ereport(DEBUG1,
+ errmsg("timeline %u became historic, can read up to %X/%X",
+ private_data->tli, LSN_FORMAT_ARGS(private_data->read_upto)));
+ }
+
+ /* Go around and try again. */
+ }
+ }
+ else
+ {
+ /* enough bytes available to satisfy the request */
+ count = private_data->read_upto - targetPagePtr;
+ break;
+ }
+ }
+
+ /*
+ * Even though we just determined how much of the page can be validly read
+ * as 'count', read the whole page anyway. It's guaranteed to be
+ * zero-padded up to the page boundary if it's incomplete.
+ */
+ if (!WALRead(state, cur_page, targetPagePtr, XLOG_BLCKSZ,
+ private_data->tli, &errinfo))
+ WALReadRaiseError(&errinfo);
+
+ /* Track that we read a page, for sleep time calculation. */
+ ++pages_read_since_last_sleep;
+
+ /* number of valid bytes in the buffer */
+ return count;
+}
+
+/*
+ * Sleep for long enough that we believe it's likely that more WAL will
+ * be available afterwards.
+ */
+static void
+summarizer_wait_for_wal(void)
+{
+ if (pages_read_since_last_sleep == 0)
+ {
+ /*
+ * No pages were read since the last sleep, so double the sleep time,
+ * but not beyond the maximum allowable value.
+ */
+ sleep_quanta = Min(sleep_quanta * 2, MAX_SLEEP_QUANTA);
+ }
+ else if (pages_read_since_last_sleep > 1)
+ {
+ /*
+ * Multiple pages were read since the last sleep, so reduce the sleep
+ * time.
+ *
+ * A large burst of activity should be able to quickly reduce the
+ * sleep time to the minimum, but we don't want a handful of extra WAL
+ * records to provoke a strong reaction. We choose to reduce the sleep
+ * time by 1 quantum for each page read beyond the first, which is a
+ * fairly arbitrary way of trying to be reactive without
+ * overrreacting.
+ */
+ if (pages_read_since_last_sleep > sleep_quanta - 1)
+ sleep_quanta = 1;
+ else
+ sleep_quanta -= pages_read_since_last_sleep;
+ }
+
+ /* OK, now sleep. */
+ (void) WaitLatch(MyLatch,
+ WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
+ sleep_quanta * MS_PER_SLEEP_QUANTUM,
+ WAIT_EVENT_WAL_SUMMARIZER_WAL);
+ ResetLatch(MyLatch);
+
+ /* Reset count of pages read. */
+ pages_read_since_last_sleep = 0;
+}
+
+/*
+ * Most recent RedoRecPtr value observed by RemoveOldWalSummaries.
+ */
+static void
+MaybeRemoveOldWalSummaries(void)
+{
+ XLogRecPtr redo_pointer = GetRedoRecPtr();
+ List *wslist;
+ time_t cutoff_time;
+
+ /* If WAL summary removal is disabled, don't do anything. */
+ if (wal_summary_keep_time == 0)
+ return;
+
+ /*
+ * If the redo pointer has not advanced, don't do anything.
+ *
+ * This has the effect that we only try to remove old WAL summary files
+ * once per checkpoint cycle.
+ */
+ if (redo_pointer == redo_pointer_at_last_summary_removal)
+ return;
+ redo_pointer_at_last_summary_removal = redo_pointer;
+
+ /*
+ * Files should only be removed if the last modification time precedes the
+ * cutoff time we compute here.
+ */
+ cutoff_time = time(NULL) - 60 * wal_summary_keep_time;
+
+ /* Get all the summaries that currently exist. */
+ wslist = GetWalSummaries(0, InvalidXLogRecPtr, InvalidXLogRecPtr);
+
+ /* Loop until all summaries have been considered for removal. */
+ while (wslist != NIL)
+ {
+ ListCell *lc;
+ XLogSegNo oldest_segno;
+ XLogRecPtr oldest_lsn = InvalidXLogRecPtr;
+ TimeLineID selected_tli;
+
+ HandleWalSummarizerInterrupts();
+
+ /*
+ * Pick a timeline for which some summary files still exist on disk,
+ * and find the oldest LSN that still exists on disk for that
+ * timeline.
+ */
+ selected_tli = ((WalSummaryFile *) linitial(wslist))->tli;
+ oldest_segno = XLogGetOldestSegno(selected_tli);
+ if (oldest_segno != 0)
+ XLogSegNoOffsetToRecPtr(oldest_segno, 0, wal_segment_size,
+ oldest_lsn);
+
+
+ /* Consider each WAL file on the selected timeline in turn. */
+ foreach(lc, wslist)
+ {
+ WalSummaryFile *ws = lfirst(lc);
+
+ HandleWalSummarizerInterrupts();
+
+ /* If it's not on this timeline, it's not time to consider it. */
+ if (selected_tli != ws->tli)
+ continue;
+
+ /*
+ * If the WAL doesn't exist any more, we can remove it if the file
+ * modification time is old enough.
+ */
+ if (XLogRecPtrIsInvalid(oldest_lsn) || ws->end_lsn <= oldest_lsn)
+ RemoveWalSummaryIfOlderThan(ws, cutoff_time);
+
+ /*
+ * Whether we removed the file or not, we need not consider it
+ * again.
+ */
+ wslist = foreach_delete_current(wslist, lc);
+ pfree(ws);
+ }
+ }
+}
diff --git a/src/backend/storage/lmgr/lwlocknames.txt b/src/backend/storage/lmgr/lwlocknames.txt
index f72f2906ce..d621f5507f 100644
--- a/src/backend/storage/lmgr/lwlocknames.txt
+++ b/src/backend/storage/lmgr/lwlocknames.txt
@@ -54,3 +54,4 @@ XactTruncationLock 44
WrapLimitsVacuumLock 46
NotifyQueueTailLock 47
WaitEventExtensionLock 48
+WALSummarizerLock 49
diff --git a/src/backend/utils/activity/pgstat_io.c b/src/backend/utils/activity/pgstat_io.c
index 490d5a9ab7..8109aee6f0 100644
--- a/src/backend/utils/activity/pgstat_io.c
+++ b/src/backend/utils/activity/pgstat_io.c
@@ -296,7 +296,8 @@ pgstat_io_snapshot_cb(void)
* - Syslogger because it is not connected to shared memory
* - Archiver because most relevant archiving IO is delegated to a
* specialized command or module
-* - WAL Receiver and WAL Writer IO is not tracked in pg_stat_io for now
+* - WAL Receiver, WAL Writer, and WAL Summarizer IO are not tracked in
+* pg_stat_io for now
*
* Function returns true if BackendType participates in the cumulative stats
* subsystem for IO and false if it does not.
@@ -318,6 +319,7 @@ pgstat_tracks_io_bktype(BackendType bktype)
case B_LOGGER:
case B_WAL_RECEIVER:
case B_WAL_WRITER:
+ case B_WAL_SUMMARIZER:
return false;
case B_AUTOVAC_LAUNCHER:
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index d7995931bd..7e79163466 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -56,6 +56,7 @@ RECOVERY_WAL_STREAM "Waiting in main loop of startup process for WAL to arrive,
SYSLOGGER_MAIN "Waiting in main loop of syslogger process."
WAL_RECEIVER_MAIN "Waiting in main loop of WAL receiver process."
WAL_SENDER_MAIN "Waiting in main loop of WAL sender process."
+WAL_SUMMARIZER_WAL "Waiting in WAL summarizer for more WAL to be generated."
WAL_WRITER_MAIN "Waiting in main loop of WAL writer process."
@@ -142,6 +143,7 @@ SAFE_SNAPSHOT "Waiting to obtain a valid snapshot for a <literal>READ ONLY DEFER
SYNC_REP "Waiting for confirmation from a remote server during synchronous replication."
WAL_RECEIVER_EXIT "Waiting for the WAL receiver to exit."
WAL_RECEIVER_WAIT_START "Waiting for startup process to send initial data for streaming replication."
+WAL_SUMMARY_READY "Waiting for a new WAL summary to be generated."
XACT_GROUP_UPDATE "Waiting for the group leader to update transaction status at end of a parallel operation."
@@ -162,6 +164,7 @@ REGISTER_SYNC_REQUEST "Waiting while sending synchronization requests to the che
SPIN_DELAY "Waiting while acquiring a contended spinlock."
VACUUM_DELAY "Waiting in a cost-based vacuum delay point."
VACUUM_TRUNCATE "Waiting to acquire an exclusive lock to truncate off any empty pages at the end of a table vacuumed."
+WAL_SUMMARIZER_ERROR "Waiting after a WAL summarizer error."
#
@@ -243,6 +246,8 @@ WAL_COPY_WRITE "Waiting for a write when creating a new WAL segment by copying a
WAL_INIT_SYNC "Waiting for a newly initialized WAL file to reach durable storage."
WAL_INIT_WRITE "Waiting for a write while initializing a new WAL file."
WAL_READ "Waiting for a read from a WAL file."
+WAL_SUMMARY_READ "Waiting for a read from a WAL summary file."
+WAL_SUMMARY_WRITE "Waiting for a write to a WAL summary file."
WAL_SYNC "Waiting for a WAL file to reach durable storage."
WAL_SYNC_METHOD_ASSIGN "Waiting for data to reach durable storage while assigning a new WAL sync method."
WAL_WRITE "Waiting for a write to a WAL file."
diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index 819936ec02..5c9b6f991e 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -305,6 +305,9 @@ GetBackendTypeDesc(BackendType backendType)
case B_WAL_SENDER:
backendDesc = "walsender";
break;
+ case B_WAL_SUMMARIZER:
+ backendDesc = "walsummarizer";
+ break;
case B_WAL_WRITER:
backendDesc = "walwriter";
break;
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index f7c9882f7c..9f59440526 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -63,6 +63,7 @@
#include "postmaster/postmaster.h"
#include "postmaster/startup.h"
#include "postmaster/syslogger.h"
+#include "postmaster/walsummarizer.h"
#include "postmaster/walwriter.h"
#include "replication/logicallauncher.h"
#include "replication/slot.h"
@@ -703,6 +704,8 @@ const char *const config_group_names[] =
gettext_noop("Write-Ahead Log / Archive Recovery"),
/* WAL_RECOVERY_TARGET */
gettext_noop("Write-Ahead Log / Recovery Target"),
+ /* WAL_SUMMARIZATION */
+ gettext_noop("Write-Ahead Log / Summarization"),
/* REPLICATION_SENDING */
gettext_noop("Replication / Sending Servers"),
/* REPLICATION_PRIMARY */
@@ -1786,6 +1789,16 @@ struct config_bool ConfigureNamesBool[] =
NULL, NULL, NULL
},
+ {
+ {"summarize_wal", PGC_SIGHUP, WAL_SUMMARIZATION,
+ gettext_noop("Starts the WAL summarizer process to enable incremental backup."),
+ NULL
+ },
+ &summarize_wal,
+ false,
+ NULL, NULL, NULL
+ },
+
{
{"hot_standby", PGC_POSTMASTER, REPLICATION_STANDBY,
gettext_noop("Allows connections and queries during recovery."),
@@ -3200,6 +3213,19 @@ struct config_int ConfigureNamesInt[] =
check_wal_segment_size, NULL, NULL
},
+ {
+ {"wal_summary_keep_time", PGC_SIGHUP, WAL_SUMMARIZATION,
+ gettext_noop("Time for which WAL summary files should be kept."),
+ NULL,
+ GUC_UNIT_MIN,
+ },
+ &wal_summary_keep_time,
+ 10 * 24 * 60, /* 10 days */
+ 0,
+ INT_MAX,
+ NULL, NULL, NULL
+ },
+
{
{"autovacuum_naptime", PGC_SIGHUP, AUTOVACUUM,
gettext_noop("Time to sleep between autovacuum runs."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index cf9f283cfe..b2809c711a 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -302,6 +302,11 @@
#recovery_target_action = 'pause' # 'pause', 'promote', 'shutdown'
# (change requires restart)
+# - WAL Summarization -
+
+#summarize_wal = off # run WAL summarizer process?
+#wal_summary_keep_time = '10d' # when to remove old summary files, 0 = never
+
#------------------------------------------------------------------------------
# REPLICATION
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 0c6f5ceb0a..e68b40d2b5 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -227,6 +227,7 @@ static char *extra_options = "";
static const char *const subdirs[] = {
"global",
"pg_wal/archive_status",
+ "pg_wal/summaries",
"pg_commit_ts",
"pg_dynshmem",
"pg_notify",
diff --git a/src/common/Makefile b/src/common/Makefile
index 1092dc63df..23e5a3db47 100644
--- a/src/common/Makefile
+++ b/src/common/Makefile
@@ -49,6 +49,7 @@ OBJS_COMMON = \
archive.o \
base64.o \
binaryheap.o \
+ blkreftable.o \
checksum_helper.o \
compression.o \
config_info.o \
diff --git a/src/common/blkreftable.c b/src/common/blkreftable.c
new file mode 100644
index 0000000000..21ee6f5968
--- /dev/null
+++ b/src/common/blkreftable.c
@@ -0,0 +1,1308 @@
+/*-------------------------------------------------------------------------
+ *
+ * blkreftable.c
+ * Block reference tables.
+ *
+ * A block reference table is used to keep track of which blocks have
+ * been modified by WAL records within a certain LSN range.
+ *
+ * For each relation fork, we keep track of all blocks that have appeared
+ * in block reference in the WAL. We also keep track of the "limit block",
+ * which is the smallest relation length in blocks known to have occurred
+ * during that range of WAL records. This should be set to 0 if the relation
+ * fork is created or destroyed, and to the post-truncation length if
+ * truncated.
+ *
+ * Whenever we set the limit block, we also forget about any modified blocks
+ * beyond that point. Those blocks don't exist any more. Such blocks can
+ * later be marked as modified again; if that happens, it means the relation
+ * was re-extended.
+ *
+ * Portions Copyright (c) 2010-2022, PostgreSQL Global Development Group
+ *
+ * src/common/blkreftable.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+
+#ifndef FRONTEND
+#include "postgres.h"
+#else
+#include "postgres_fe.h"
+#endif
+
+#ifdef FRONTEND
+#include "common/logging.h"
+#endif
+
+#include "common/blkreftable.h"
+#include "common/hashfn.h"
+#include "port/pg_crc32c.h"
+
+/*
+ * A block reference table keeps track of the status of each relation
+ * fork individually.
+ */
+typedef struct BlockRefTableKey
+{
+ RelFileLocator rlocator;
+ ForkNumber forknum;
+} BlockRefTableKey;
+
+/*
+ * We could need to store data either for a relation in which only a
+ * tiny fraction of the blocks have been modified or for a relation in
+ * which nearly every block has been modified, and we want a
+ * space-efficient representation in both cases. To accomplish this,
+ * we divide the relation into chunks of 2^16 blocks and choose between
+ * an array representation and a bitmap representation for each chunk.
+ *
+ * When the number of modified blocks in a given chunk is small, we
+ * essentially store an array of block numbers, but we need not store the
+ * entire block number: instead, we store each block number as a 2-byte
+ * offset from the start of the chunk.
+ *
+ * When the number of modified blocks in a given chunk is large, we switch
+ * to a bitmap representation.
+ *
+ * These same basic representational choices are used both when a block
+ * reference table is stored in memory and when it is serialized to disk.
+ *
+ * In the in-memory representation, we initially allocate each chunk with
+ * space for a number of entries given by INITIAL_ENTRIES_PER_CHUNK and
+ * increase that as necessary until we reach MAX_ENTRIES_PER_CHUNK.
+ * Any chunk whose allocated size reaches MAX_ENTRIES_PER_CHUNK is converted
+ * to a bitmap, and thus never needs to grow further.
+ */
+#define BLOCKS_PER_CHUNK (1 << 16)
+#define BLOCKS_PER_ENTRY (BITS_PER_BYTE * sizeof(uint16))
+#define MAX_ENTRIES_PER_CHUNK (BLOCKS_PER_CHUNK / BLOCKS_PER_ENTRY)
+#define INITIAL_ENTRIES_PER_CHUNK 16
+typedef uint16 *BlockRefTableChunk;
+
+/*
+ * State for one relation fork.
+ *
+ * 'rlocator' and 'forknum' identify the relation fork to which this entry
+ * pertains.
+ *
+ * 'limit_block' is the shortest known length of the relation in blocks
+ * within the LSN range covered by a particular block reference table.
+ * It should be set to 0 if the relation fork is created or dropped. If the
+ * relation fork is truncated, it should be set to the number of blocks that
+ * remain after truncation.
+ *
+ * 'nchunks' is the allocated length of each of the three arrays that follow.
+ * We can only represent the status of block numbers less than nchunks *
+ * BLOCKS_PER_CHUNK.
+ *
+ * 'chunk_size' is an array storing the allocated size of each chunk.
+ *
+ * 'chunk_usage' is an array storing the number of elements used in each
+ * chunk. If that value is less than MAX_ENTRIES_PER_CHUNK, the corresonding
+ * chunk is used as an array; else the corresponding chunk is used as a bitmap.
+ * When used as a bitmap, the least significant bit of the first array element
+ * is the status of the lowest-numbered block covered by this chunk.
+ *
+ * 'chunk_data' is the array of chunks.
+ */
+struct BlockRefTableEntry
+{
+ BlockRefTableKey key;
+ BlockNumber limit_block;
+ char status;
+ uint32 nchunks;
+ uint16 *chunk_size;
+ uint16 *chunk_usage;
+ BlockRefTableChunk *chunk_data;
+};
+
+/* Declare and define a hash table over type BlockRefTableEntry. */
+#define SH_PREFIX blockreftable
+#define SH_ELEMENT_TYPE BlockRefTableEntry
+#define SH_KEY_TYPE BlockRefTableKey
+#define SH_KEY key
+#define SH_HASH_KEY(tb, key) \
+ hash_bytes((const unsigned char *) &key, sizeof(BlockRefTableKey))
+#define SH_EQUAL(tb, a, b) (memcmp(&a, &b, sizeof(BlockRefTableKey)) == 0)
+#define SH_SCOPE static inline
+#ifdef FRONTEND
+#define SH_RAW_ALLOCATOR pg_malloc0
+#endif
+#define SH_DEFINE
+#define SH_DECLARE
+#include "lib/simplehash.h"
+
+/*
+ * A block reference table is basically just the hash table, but we don't
+ * want to expose that to outside callers.
+ *
+ * We keep track of the memory context in use explicitly too, so that it's
+ * easy to place all of our allocations in the same context.
+ */
+struct BlockRefTable
+{
+ blockreftable_hash *hash;
+#ifndef FRONTEND
+ MemoryContext mcxt;
+#endif
+};
+
+/*
+ * On-disk serialization format for block reference table entries.
+ */
+typedef struct BlockRefTableSerializedEntry
+{
+ RelFileLocator rlocator;
+ ForkNumber forknum;
+ BlockNumber limit_block;
+ uint32 nchunks;
+} BlockRefTableSerializedEntry;
+
+/*
+ * Buffer size, so that we avoid doing many small I/Os.
+ */
+#define BUFSIZE 65536
+
+/*
+ * Ad-hoc buffer for file I/O.
+ */
+typedef struct BlockRefTableBuffer
+{
+ io_callback_fn io_callback;
+ void *io_callback_arg;
+ char data[BUFSIZE];
+ int used;
+ int cursor;
+ pg_crc32c crc;
+} BlockRefTableBuffer;
+
+/*
+ * State for keeping track of progress while incrementally reading a block
+ * table reference file from disk.
+ *
+ * total_chunks means the number of chunks for the RelFileLocator/ForkNumber
+ * combination that is curently being read, and consumed_chunks is the number
+ * of those that have been read. (We always read all the information for
+ * a single chunk at one time, so we don't need to be able to represent the
+ * state where a chunk has been partially read.)
+ *
+ * chunk_size is the array of chunk sizes. The length is given by total_chunks.
+ *
+ * chunk_data holds the current chunk.
+ *
+ * chunk_position helps us figure out how much progress we've made in returning
+ * the block numbers for the current chunk to the caller. If the chunk is a
+ * bitmap, it's the number of bits we've scanned; otherwise, it's the number
+ * of chunk entries we've scanned.
+ */
+struct BlockRefTableReader
+{
+ BlockRefTableBuffer buffer;
+ char *error_filename;
+ report_error_fn error_callback;
+ void *error_callback_arg;
+ uint32 total_chunks;
+ uint32 consumed_chunks;
+ uint16 *chunk_size;
+ uint16 chunk_data[MAX_ENTRIES_PER_CHUNK];
+ uint32 chunk_position;
+};
+
+/*
+ * State for keeping track of progress while incrementally writing a block
+ * reference table file to disk.
+ */
+struct BlockRefTableWriter
+{
+ BlockRefTableBuffer buffer;
+};
+
+/* Function prototypes. */
+static int BlockRefTableComparator(const void *a, const void *b);
+static void BlockRefTableFlush(BlockRefTableBuffer *buffer);
+static void BlockRefTableRead(BlockRefTableReader *reader, void *data,
+ int length);
+static void BlockRefTableWrite(BlockRefTableBuffer *buffer, void *data,
+ int length);
+static void BlockRefTableFileTerminate(BlockRefTableBuffer *buffer);
+
+/*
+ * Create an empty block reference table.
+ */
+BlockRefTable *
+CreateEmptyBlockRefTable(void)
+{
+ BlockRefTable *brtab = palloc(sizeof(BlockRefTable));
+
+ /*
+ * Even completely empty database has a few hundred relation forks, so it
+ * seems best to size the hash on the assumption that we're going to have
+ * at least a few thousand entries.
+ */
+#ifdef FRONTEND
+ brtab->hash = blockreftable_create(4096, NULL);
+#else
+ brtab->mcxt = CurrentMemoryContext;
+ brtab->hash = blockreftable_create(brtab->mcxt, 4096, NULL);
+#endif
+
+ return brtab;
+}
+
+/*
+ * Set the "limit block" for a relation fork and forget any modified blocks
+ * with equal or higher block numbers.
+ *
+ * The "limit block" is the shortest known length of the relation within the
+ * range of WAL records covered by this block reference table.
+ */
+void
+BlockRefTableSetLimitBlock(BlockRefTable *brtab,
+ const RelFileLocator *rlocator,
+ ForkNumber forknum,
+ BlockNumber limit_block)
+{
+ BlockRefTableEntry *brtentry;
+ BlockRefTableKey key = {0}; /* make sure any padding is zero */
+ bool found;
+
+ memcpy(&key.rlocator, rlocator, sizeof(RelFileLocator));
+ key.forknum = forknum;
+ brtentry = blockreftable_insert(brtab->hash, key, &found);
+
+ if (!found)
+ {
+ /*
+ * We have no existing data about this relation fork, so just record
+ * the limit_block value supplied by the caller, and make sure other
+ * parts of the entry are properly initialized.
+ */
+ brtentry->limit_block = limit_block;
+ brtentry->nchunks = 0;
+ brtentry->chunk_size = NULL;
+ brtentry->chunk_usage = NULL;
+ brtentry->chunk_data = NULL;
+ return;
+ }
+
+ BlockRefTableEntrySetLimitBlock(brtentry, limit_block);
+}
+
+/*
+ * Mark a block in a given relation fork as known to have been modified.
+ */
+void
+BlockRefTableMarkBlockModified(BlockRefTable *brtab,
+ const RelFileLocator *rlocator,
+ ForkNumber forknum,
+ BlockNumber blknum)
+{
+ BlockRefTableEntry *brtentry;
+ BlockRefTableKey key = {0}; /* make sure any padding is zero */
+ bool found;
+#ifndef FRONTEND
+ MemoryContext oldcontext = MemoryContextSwitchTo(brtab->mcxt);
+#endif
+
+ memcpy(&key.rlocator, rlocator, sizeof(RelFileLocator));
+ key.forknum = forknum;
+ brtentry = blockreftable_insert(brtab->hash, key, &found);
+
+ if (!found)
+ {
+ /*
+ * We want to set the initial limit block value to something higher
+ * than any legal block number. InvalidBlockNumber fits the bill.
+ */
+ brtentry->limit_block = InvalidBlockNumber;
+ brtentry->nchunks = 0;
+ brtentry->chunk_size = NULL;
+ brtentry->chunk_usage = NULL;
+ brtentry->chunk_data = NULL;
+ }
+
+ BlockRefTableEntryMarkBlockModified(brtentry, forknum, blknum);
+
+#ifndef FRONTEND
+ MemoryContextSwitchTo(oldcontext);
+#endif
+}
+
+/*
+ * Get an entry from a block reference table.
+ *
+ * If the entry does not exist, this function returns NULL. Otherwise, it
+ * returns the entry and sets *limit_block to the value from the entry.
+ */
+BlockRefTableEntry *
+BlockRefTableGetEntry(BlockRefTable *brtab, const RelFileLocator *rlocator,
+ ForkNumber forknum, BlockNumber *limit_block)
+{
+ BlockRefTableKey key = {0}; /* make sure any padding is zero */
+ BlockRefTableEntry *entry;
+
+ Assert(limit_block != NULL);
+
+ memcpy(&key.rlocator, rlocator, sizeof(RelFileLocator));
+ key.forknum = forknum;
+ entry = blockreftable_lookup(brtab->hash, key);
+
+ if (entry != NULL)
+ *limit_block = entry->limit_block;
+
+ return entry;
+}
+
+/*
+ * Get block numbers from a table entry.
+ *
+ * 'blocks' must point to enough space to hold at least 'nblocks' block
+ * numbers, and any block numbers we manage to get will be written there.
+ * The return value is the number of block numbers actually written.
+ *
+ * We do not return block numbers unless they are greater than or equal to
+ * start_blkno and strictly less than stop_blkno.
+ */
+int
+BlockRefTableEntryGetBlocks(BlockRefTableEntry *entry,
+ BlockNumber start_blkno,
+ BlockNumber stop_blkno,
+ BlockNumber *blocks,
+ int nblocks)
+{
+ uint32 start_chunkno;
+ uint32 stop_chunkno;
+ uint32 chunkno;
+ int nresults = 0;
+
+ Assert(entry != NULL);
+
+ /*
+ * Figure out which chunks could potentially contain blocks of interest.
+ *
+ * We need to be careful about overflow here, because stop_blkno could be
+ * InvalidBlockNumber or something very close to it.
+ */
+ start_chunkno = start_blkno / BLOCKS_PER_CHUNK;
+ stop_chunkno = stop_blkno / BLOCKS_PER_CHUNK;
+ if ((stop_blkno % BLOCKS_PER_CHUNK) != 0)
+ ++stop_chunkno;
+ if (stop_chunkno > entry->nchunks)
+ stop_chunkno = entry->nchunks;
+
+ /*
+ * Loop over chunks.
+ */
+ for (chunkno = start_chunkno; chunkno < stop_chunkno; ++chunkno)
+ {
+ uint16 chunk_usage = entry->chunk_usage[chunkno];
+ BlockRefTableChunk chunk_data = entry->chunk_data[chunkno];
+ unsigned start_offset = 0;
+ unsigned stop_offset = BLOCKS_PER_CHUNK;
+
+ /*
+ * If the start and/or stop block number falls within this chunk, the
+ * whole chunk may not be of interest. Figure out which portion we
+ * care about, if it's not the whole thing.
+ */
+ if (chunkno == start_chunkno)
+ start_offset = start_blkno % BLOCKS_PER_CHUNK;
+ if (chunkno == stop_chunkno - 1)
+ stop_offset = stop_blkno % BLOCKS_PER_CHUNK;
+
+ /*
+ * Handling differs depending on whether this is an array of offsets
+ * or a bitmap.
+ */
+ if (chunk_usage == MAX_ENTRIES_PER_CHUNK)
+ {
+ unsigned i;
+
+ /* It's a bitmap, so test every relevant bit. */
+ for (i = start_offset; i < stop_offset; ++i)
+ {
+ uint16 w = chunk_data[i / BLOCKS_PER_ENTRY];
+
+ if ((w & (1 << (i % BLOCKS_PER_ENTRY))) != 0)
+ {
+ BlockNumber blkno = chunkno * BLOCKS_PER_CHUNK + i;
+
+ blocks[nresults++] = blkno;
+
+ /* Early exit if we run out of output space. */
+ if (nresults == nblocks)
+ return nresults;
+ }
+ }
+ }
+ else
+ {
+ unsigned i;
+
+ /* It's an array of offsets, so check each one. */
+ for (i = 0; i < chunk_usage; ++i)
+ {
+ uint16 offset = chunk_data[i];
+
+ if (offset >= start_offset && offset < stop_offset)
+ {
+ BlockNumber blkno = chunkno * BLOCKS_PER_CHUNK + offset;
+
+ blocks[nresults++] = blkno;
+
+ /* Early exit if we run out of output space. */
+ if (nresults == nblocks)
+ return nresults;
+ }
+ }
+ }
+ }
+
+ return nresults;
+}
+
+/*
+ * Serialize a block reference table to a file.
+ */
+void
+WriteBlockRefTable(BlockRefTable *brtab,
+ io_callback_fn write_callback,
+ void *write_callback_arg)
+{
+ BlockRefTableSerializedEntry *sdata = NULL;
+ BlockRefTableBuffer buffer;
+ uint32 magic = BLOCKREFTABLE_MAGIC;
+
+ /* Prepare buffer. */
+ memset(&buffer, 0, sizeof(BlockRefTableBuffer));
+ buffer.io_callback = write_callback;
+ buffer.io_callback_arg = write_callback_arg;
+ INIT_CRC32C(buffer.crc);
+
+ /* Write magic number. */
+ BlockRefTableWrite(&buffer, &magic, sizeof(uint32));
+
+ /* Write the entries, assuming there are some. */
+ if (brtab->hash->members > 0)
+ {
+ unsigned i = 0;
+ blockreftable_iterator it;
+ BlockRefTableEntry *brtentry;
+
+ /* Extract entries into serializable format and sort them. */
+ sdata =
+ palloc(brtab->hash->members * sizeof(BlockRefTableSerializedEntry));
+ blockreftable_start_iterate(brtab->hash, &it);
+ while ((brtentry = blockreftable_iterate(brtab->hash, &it)) != NULL)
+ {
+ BlockRefTableSerializedEntry *sentry = &sdata[i++];
+
+ sentry->rlocator = brtentry->key.rlocator;
+ sentry->forknum = brtentry->key.forknum;
+ sentry->limit_block = brtentry->limit_block;
+ sentry->nchunks = brtentry->nchunks;
+
+ /* trim trailing zero entries */
+ while (sentry->nchunks > 0 &&
+ brtentry->chunk_usage[sentry->nchunks - 1] == 0)
+ sentry->nchunks--;
+ }
+ Assert(i == brtab->hash->members);
+ qsort(sdata, i, sizeof(BlockRefTableSerializedEntry),
+ BlockRefTableComparator);
+
+ /* Loop over entries in sorted order and serialize each one. */
+ for (i = 0; i < brtab->hash->members; ++i)
+ {
+ BlockRefTableSerializedEntry *sentry = &sdata[i];
+ BlockRefTableKey key = {0}; /* make sure any padding is zero */
+ unsigned j;
+
+ /* Write the serialized entry itself. */
+ BlockRefTableWrite(&buffer, sentry,
+ sizeof(BlockRefTableSerializedEntry));
+
+ /* Look up the original entry so we can access the chunks. */
+ memcpy(&key.rlocator, &sentry->rlocator, sizeof(RelFileLocator));
+ key.forknum = sentry->forknum;
+ brtentry = blockreftable_lookup(brtab->hash, key);
+ Assert(brtentry != NULL);
+
+ /* Write the untruncated portion of the chunk length array. */
+ if (sentry->nchunks != 0)
+ BlockRefTableWrite(&buffer, brtentry->chunk_usage,
+ sentry->nchunks * sizeof(uint16));
+
+ /* Write the contents of each chunk. */
+ for (j = 0; j < brtentry->nchunks; ++j)
+ {
+ if (brtentry->chunk_usage[j] == 0)
+ continue;
+ BlockRefTableWrite(&buffer, brtentry->chunk_data[j],
+ brtentry->chunk_usage[j] * sizeof(uint16));
+ }
+ }
+ }
+
+ /* Write out appropriate terminator and CRC and flush buffer. */
+ BlockRefTableFileTerminate(&buffer);
+}
+
+/*
+ * Prepare to incrementally read a block reference table file.
+ *
+ * 'read_callback' is a function that can be called to read data from the
+ * underlying file (or other data source) into our internal buffer.
+ *
+ * 'read_callback_arg' is an opaque argument to be passed to read_callback.
+ *
+ * 'error_filename' is the filename that should be included in error messages
+ * if the file is found to be malformed. The value is not copied, so the
+ * caller should ensure that it remains valid until done with this
+ * BlockRefTableReader.
+ *
+ * 'error_callback' is a function to be called if the file is found to be
+ * malformed. This is not used for I/O errors, which must be handled internally
+ * by read_callback.
+ *
+ * 'error_callback_arg' is an opaque arguent to be passed to error_callback.
+ */
+BlockRefTableReader *
+CreateBlockRefTableReader(io_callback_fn read_callback,
+ void *read_callback_arg,
+ char *error_filename,
+ report_error_fn error_callback,
+ void *error_callback_arg)
+{
+ BlockRefTableReader *reader;
+ uint32 magic;
+
+ /* Initialize data structure. */
+ reader = palloc0(sizeof(BlockRefTableReader));
+ reader->buffer.io_callback = read_callback;
+ reader->buffer.io_callback_arg = read_callback_arg;
+ reader->error_filename = error_filename;
+ reader->error_callback = error_callback;
+ reader->error_callback_arg = error_callback_arg;
+ INIT_CRC32C(reader->buffer.crc);
+
+ /* Verify magic number. */
+ BlockRefTableRead(reader, &magic, sizeof(uint32));
+ if (magic != BLOCKREFTABLE_MAGIC)
+ error_callback(error_callback_arg,
+ "file \"%s\" has wrong magic number: expected %u, found %u",
+ error_filename,
+ BLOCKREFTABLE_MAGIC, magic);
+
+ return reader;
+}
+
+/*
+ * Read next relation fork covered by this block reference table file.
+ *
+ * After calling this function, you must call BlockRefTableReaderGetBlocks
+ * until it returns 0 before calling it again.
+ */
+bool
+BlockRefTableReaderNextRelation(BlockRefTableReader *reader,
+ RelFileLocator *rlocator,
+ ForkNumber *forknum,
+ BlockNumber *limit_block)
+{
+ BlockRefTableSerializedEntry sentry;
+ BlockRefTableSerializedEntry zentry = {{0}};
+
+ /*
+ * Sanity check: caller must read all blocks from all chunks before moving
+ * on to the next relation.
+ */
+ Assert(reader->total_chunks == reader->consumed_chunks);
+
+ /* Read serialized entry. */
+ BlockRefTableRead(reader, &sentry,
+ sizeof(BlockRefTableSerializedEntry));
+
+ /*
+ * If we just read the sentinel entry indicating that we've reached the
+ * end, read and check the CRC.
+ */
+ if (memcmp(&sentry, &zentry, sizeof(BlockRefTableSerializedEntry)) == 0)
+ {
+ pg_crc32c expected_crc;
+ pg_crc32c actual_crc;
+
+ /*
+ * We want to know the CRC of the file excluding the 4-byte CRC
+ * itself, so copy the current value of the CRC accumulator before
+ * reading those bytes, and use the copy to finalize the calculation.
+ */
+ expected_crc = reader->buffer.crc;
+ FIN_CRC32C(expected_crc);
+
+ /* Now we can read the actual value. */
+ BlockRefTableRead(reader, &actual_crc, sizeof(pg_crc32c));
+
+ /* Throw an error if there is a mismatch. */
+ if (!EQ_CRC32C(expected_crc, actual_crc))
+ reader->error_callback(reader->error_callback_arg,
+ "file \"%s\" has wrong checksum: expected %08X, found %08X",
+ reader->error_filename, expected_crc, actual_crc);
+
+ return false;
+ }
+
+ /* Read chunk size array. */
+ if (reader->chunk_size != NULL)
+ pfree(reader->chunk_size);
+ reader->chunk_size = palloc(sentry.nchunks * sizeof(uint16));
+ BlockRefTableRead(reader, reader->chunk_size,
+ sentry.nchunks * sizeof(uint16));
+
+ /* Set up for chunk scan. */
+ reader->total_chunks = sentry.nchunks;
+ reader->consumed_chunks = 0;
+
+ /* Return data to caller. */
+ memcpy(rlocator, &sentry.rlocator, sizeof(RelFileLocator));
+ *forknum = sentry.forknum;
+ *limit_block = sentry.limit_block;
+ return true;
+}
+
+/*
+ * Get modified blocks associated with the relation fork returned by
+ * the most recent call to BlockRefTableReaderNextRelation.
+ *
+ * On return, block numbers will be written into the 'blocks' array, whose
+ * length should be passed via 'nblocks'. The return value is the number of
+ * entries actually written into the 'blocks' array, which may be less than
+ * 'nblocks' if we run out of modified blocks in the relation fork before
+ * we run out of room in the array.
+ */
+unsigned
+BlockRefTableReaderGetBlocks(BlockRefTableReader *reader,
+ BlockNumber *blocks,
+ int nblocks)
+{
+ unsigned blocks_found = 0;
+
+ /* Must provide space for at least one block number to be returned. */
+ Assert(nblocks > 0);
+
+ /* Loop collecting blocks to return to caller. */
+ for (;;)
+ {
+ uint16 next_chunk_size;
+
+ /*
+ * If we've read at least one chunk, maybe it contains some block
+ * numbers that could satisfy caller's request.
+ */
+ if (reader->consumed_chunks > 0)
+ {
+ uint32 chunkno = reader->consumed_chunks - 1;
+ uint16 chunk_size = reader->chunk_size[chunkno];
+
+ if (chunk_size == MAX_ENTRIES_PER_CHUNK)
+ {
+ /* Bitmap format, so search for bits that are set. */
+ while (reader->chunk_position < BLOCKS_PER_CHUNK &&
+ blocks_found < nblocks)
+ {
+ uint16 chunkoffset = reader->chunk_position;
+ uint16 w;
+
+ w = reader->chunk_data[chunkoffset / BLOCKS_PER_ENTRY];
+ if ((w & (1u << (chunkoffset % BLOCKS_PER_ENTRY))) != 0)
+ blocks[blocks_found++] =
+ chunkno * BLOCKS_PER_CHUNK + chunkoffset;
+ ++reader->chunk_position;
+ }
+ }
+ else
+ {
+ /* Not in bitmap format, so each entry is a 2-byte offset. */
+ while (reader->chunk_position < chunk_size &&
+ blocks_found < nblocks)
+ {
+ blocks[blocks_found++] = chunkno * BLOCKS_PER_CHUNK
+ + reader->chunk_data[reader->chunk_position];
+ ++reader->chunk_position;
+ }
+ }
+ }
+
+ /* We found enough blocks, so we're done. */
+ if (blocks_found >= nblocks)
+ break;
+
+ /*
+ * We didn't find enough blocks, so we must need the next chunk. If
+ * there are none left, though, then we're done anyway.
+ */
+ if (reader->consumed_chunks == reader->total_chunks)
+ break;
+
+ /*
+ * Read data for next chunk and reset scan position to beginning of
+ * chunk. Note that the next chunk might be empty, in which case we
+ * consume the chunk without actually consuming any bytes from the
+ * underlying file.
+ */
+ next_chunk_size = reader->chunk_size[reader->consumed_chunks];
+ if (next_chunk_size > 0)
+ BlockRefTableRead(reader, reader->chunk_data,
+ next_chunk_size * sizeof(uint16));
+ ++reader->consumed_chunks;
+ reader->chunk_position = 0;
+ }
+
+ return blocks_found;
+}
+
+/*
+ * Release memory used while reading a block reference table from a file.
+ */
+void
+DestroyBlockRefTableReader(BlockRefTableReader *reader)
+{
+ if (reader->chunk_size != NULL)
+ {
+ pfree(reader->chunk_size);
+ reader->chunk_size = NULL;
+ }
+ pfree(reader);
+}
+
+/*
+ * Prepare to write a block reference table file incrementally.
+ *
+ * Caller must be able to supply BlockRefTableEntry objects sorted in the
+ * appropriate order.
+ */
+BlockRefTableWriter *
+CreateBlockRefTableWriter(io_callback_fn write_callback,
+ void *write_callback_arg)
+{
+ BlockRefTableWriter *writer;
+ uint32 magic = BLOCKREFTABLE_MAGIC;
+
+ /* Prepare buffer and CRC check and save callbacks. */
+ writer = palloc0(sizeof(BlockRefTableWriter));
+ writer->buffer.io_callback = write_callback;
+ writer->buffer.io_callback_arg = write_callback_arg;
+ INIT_CRC32C(writer->buffer.crc);
+
+ /* Write magic number. */
+ BlockRefTableWrite(&writer->buffer, &magic, sizeof(uint32));
+
+ return writer;
+}
+
+/*
+ * Append one entry to a block reference table file.
+ *
+ * Note that entries must be written in the proper order, that is, sorted by
+ * tablespace, then database, then relfilenumber, then fork number. Caller
+ * is responsible for supplying data in the correct order. If that seems hard,
+ * use an in-memory BlockRefTable instead.
+ */
+void
+BlockRefTableWriteEntry(BlockRefTableWriter *writer, BlockRefTableEntry *entry)
+{
+ BlockRefTableSerializedEntry sentry;
+ unsigned j;
+
+ /* Convert to serialized entry format. */
+ sentry.rlocator = entry->key.rlocator;
+ sentry.forknum = entry->key.forknum;
+ sentry.limit_block = entry->limit_block;
+ sentry.nchunks = entry->nchunks;
+
+ /* Trim trailing zero entries. */
+ while (sentry.nchunks > 0 && entry->chunk_usage[sentry.nchunks - 1] == 0)
+ sentry.nchunks--;
+
+ /* Write the serialized entry itself. */
+ BlockRefTableWrite(&writer->buffer, &sentry,
+ sizeof(BlockRefTableSerializedEntry));
+
+ /* Write the untruncated portion of the chunk length array. */
+ if (sentry.nchunks != 0)
+ BlockRefTableWrite(&writer->buffer, entry->chunk_usage,
+ sentry.nchunks * sizeof(uint16));
+
+ /* Write the contents of each chunk. */
+ for (j = 0; j < entry->nchunks; ++j)
+ {
+ if (entry->chunk_usage[j] == 0)
+ continue;
+ BlockRefTableWrite(&writer->buffer, entry->chunk_data[j],
+ entry->chunk_usage[j] * sizeof(uint16));
+ }
+}
+
+/*
+ * Finalize an incremental write of a block reference table file.
+ */
+void
+DestroyBlockRefTableWriter(BlockRefTableWriter *writer)
+{
+ BlockRefTableFileTerminate(&writer->buffer);
+ pfree(writer);
+}
+
+/*
+ * Allocate a standalone BlockRefTableEntry.
+ *
+ * When we're manipulating a full in-memory BlockRefTable, the entries are
+ * part of the hash table and are allocated by simplehash. This routine is
+ * used by callers that want to write out a BlockRefTable to a file without
+ * needing to store the whole thing in memory at once.
+ *
+ * Entries allocated by this function can be manipulated using the functions
+ * BlockRefTableEntrySetLimitBlock and BlockRefTableEntryMarkBlockModified
+ * and then written using BlockRefTableWriteEntry and freed using
+ * BlockRefTableFreeEntry.
+ */
+BlockRefTableEntry *
+CreateBlockRefTableEntry(RelFileLocator rlocator, ForkNumber forknum)
+{
+ BlockRefTableEntry *entry = palloc0(sizeof(BlockRefTableEntry));
+
+ memcpy(&entry->key.rlocator, &rlocator, sizeof(RelFileLocator));
+ entry->key.forknum = forknum;
+ entry->limit_block = InvalidBlockNumber;
+
+ return entry;
+}
+
+/*
+ * Update a BlockRefTableEntry with a new value for the "limit block" and
+ * forget any equal-or-higher-numbered modified blocks.
+ *
+ * The "limit block" is the shortest known length of the relation within the
+ * range of WAL records covered by this block reference table.
+ */
+void
+BlockRefTableEntrySetLimitBlock(BlockRefTableEntry *entry,
+ BlockNumber limit_block)
+{
+ unsigned chunkno;
+ unsigned limit_chunkno;
+ unsigned limit_chunkoffset;
+ BlockRefTableChunk limit_chunk;
+
+ /* If we already have an equal or lower limit block, do nothing. */
+ if (limit_block >= entry->limit_block)
+ return;
+
+ /* Record the new limit block value. */
+ entry->limit_block = limit_block;
+
+ /*
+ * Figure out which chunk would store the state of the new limit block,
+ * and which offset within that chunk.
+ */
+ limit_chunkno = limit_block / BLOCKS_PER_CHUNK;
+ limit_chunkoffset = limit_block % BLOCKS_PER_CHUNK;
+
+ /*
+ * If the number of chunks is not large enough for any blocks with equal
+ * or higher block numbers to exist, then there is nothing further to do.
+ */
+ if (limit_chunkno >= entry->nchunks)
+ return;
+
+ /* Discard entire contents of any higher-numbered chunks. */
+ for (chunkno = limit_chunkno + 1; chunkno < entry->nchunks; ++chunkno)
+ entry->chunk_usage[chunkno] = 0;
+
+ /*
+ * Next, we need to discard any offsets within the chunk that would
+ * contain the limit_block. We must handle this differenly depending on
+ * whether the chunk that would contain limit_block is a bitmap or an
+ * array of offsets.
+ */
+ limit_chunk = entry->chunk_data[limit_chunkno];
+ if (entry->chunk_usage[limit_chunkno] == MAX_ENTRIES_PER_CHUNK)
+ {
+ unsigned chunkoffset;
+
+ /* It's a bitmap. Unset bits. */
+ for (chunkoffset = limit_chunkoffset; chunkoffset < BLOCKS_PER_CHUNK;
+ ++chunkoffset)
+ limit_chunk[chunkoffset / BLOCKS_PER_ENTRY] &=
+ ~(1 << (chunkoffset % BLOCKS_PER_ENTRY));
+ }
+ else
+ {
+ unsigned i,
+ j = 0;
+
+ /* It's an offset array. Filter out large offsets. */
+ for (i = 0; i < entry->chunk_usage[limit_chunkno]; ++i)
+ {
+ Assert(j <= i);
+ if (limit_chunk[i] < limit_chunkoffset)
+ limit_chunk[j++] = limit_chunk[i];
+ }
+ Assert(j <= entry->chunk_usage[limit_chunkno]);
+ entry->chunk_usage[limit_chunkno] = j;
+ }
+}
+
+/*
+ * Mark a block in a given BlkRefTableEntry as known to have been modified.
+ */
+void
+BlockRefTableEntryMarkBlockModified(BlockRefTableEntry *entry,
+ ForkNumber forknum,
+ BlockNumber blknum)
+{
+ unsigned chunkno;
+ unsigned chunkoffset;
+ unsigned i;
+
+ /*
+ * Which chunk should store the state of this block? And what is the
+ * offset of this block relative to the start of that chunk?
+ */
+ chunkno = blknum / BLOCKS_PER_CHUNK;
+ chunkoffset = blknum % BLOCKS_PER_CHUNK;
+
+ /*
+ * If 'nchunks' isn't big enough for us to be able to represent the state
+ * of this block, we need to enlarge our arrays.
+ */
+ if (chunkno >= entry->nchunks)
+ {
+ unsigned max_chunks;
+ unsigned extra_chunks;
+
+ /*
+ * New array size is a power of 2, at least 16, big enough so that
+ * chunkno will be a valid array index.
+ */
+ max_chunks = Max(16, entry->nchunks);
+ while (max_chunks < chunkno + 1)
+ chunkno *= 2;
+ Assert(max_chunks > chunkno);
+ extra_chunks = max_chunks - entry->nchunks;
+
+ if (entry->nchunks == 0)
+ {
+ entry->chunk_size = palloc0(sizeof(uint16) * max_chunks);
+ entry->chunk_usage = palloc0(sizeof(uint16) * max_chunks);
+ entry->chunk_data =
+ palloc0(sizeof(BlockRefTableChunk) * max_chunks);
+ }
+ else
+ {
+ entry->chunk_size = repalloc(entry->chunk_size,
+ sizeof(uint16) * max_chunks);
+ memset(&entry->chunk_size[entry->nchunks], 0,
+ extra_chunks * sizeof(uint16));
+ entry->chunk_usage = repalloc(entry->chunk_usage,
+ sizeof(uint16) * max_chunks);
+ memset(&entry->chunk_usage[entry->nchunks], 0,
+ extra_chunks * sizeof(uint16));
+ entry->chunk_data = repalloc(entry->chunk_data,
+ sizeof(BlockRefTableChunk) * max_chunks);
+ memset(&entry->chunk_data[entry->nchunks], 0,
+ extra_chunks * sizeof(BlockRefTableChunk));
+ }
+ entry->nchunks = max_chunks;
+ }
+
+ /*
+ * If the chunk that covers this block number doesn't exist yet, create it
+ * as an array and add the appropriate offset to it. We make it pretty
+ * small initially, because there might only be 1 or a few block
+ * references in this chunk and we don't want to use up too much memory.
+ */
+ if (entry->chunk_size[chunkno] == 0)
+ {
+ entry->chunk_data[chunkno] =
+ palloc(sizeof(uint16) * INITIAL_ENTRIES_PER_CHUNK);
+ entry->chunk_size[chunkno] = INITIAL_ENTRIES_PER_CHUNK;
+ entry->chunk_data[chunkno][0] = chunkoffset;
+ entry->chunk_usage[chunkno] = 1;
+ return;
+ }
+
+ /*
+ * If the number of entries in this chunk is already maximum, it must be a
+ * bitmap. Just set the appropriate bit.
+ */
+ if (entry->chunk_usage[chunkno] == MAX_ENTRIES_PER_CHUNK)
+ {
+ BlockRefTableChunk chunk = entry->chunk_data[chunkno];
+
+ chunk[chunkoffset / BLOCKS_PER_ENTRY] |=
+ 1 << (chunkoffset % BLOCKS_PER_ENTRY);
+ return;
+ }
+
+ /*
+ * There is an existing chunk and it's in array format. Let's find out
+ * whether it already has an entry for this block. If so, we do not need
+ * to do anything.
+ */
+ for (i = 0; i < entry->chunk_usage[chunkno]; ++i)
+ {
+ if (entry->chunk_data[chunkno][i] == chunkoffset)
+ return;
+ }
+
+ /*
+ * If the number of entries currently used is one less than the maximum,
+ * it's time to convert to bitmap format.
+ */
+ if (entry->chunk_usage[chunkno] == MAX_ENTRIES_PER_CHUNK - 1)
+ {
+ BlockRefTableChunk newchunk;
+ unsigned j;
+
+ /* Allocate a new chunk. */
+ newchunk = palloc0(MAX_ENTRIES_PER_CHUNK * sizeof(uint16));
+
+ /* Set the bit for each existing entry. */
+ for (j = 0; j < entry->chunk_usage[chunkno]; ++j)
+ {
+ unsigned coff = entry->chunk_data[chunkno][j];
+
+ newchunk[coff / BLOCKS_PER_ENTRY] |=
+ 1 << (coff % BLOCKS_PER_ENTRY);
+ }
+
+ /* Set the bit for the new entry. */
+ newchunk[chunkoffset / BLOCKS_PER_ENTRY] |=
+ 1 << (chunkoffset % BLOCKS_PER_ENTRY);
+
+ /* Swap the new chunk into place and update metadata. */
+ pfree(entry->chunk_data[chunkno]);
+ entry->chunk_data[chunkno] = newchunk;
+ entry->chunk_size[chunkno] = MAX_ENTRIES_PER_CHUNK;
+ entry->chunk_usage[chunkno] = MAX_ENTRIES_PER_CHUNK;
+ return;
+ }
+
+ /*
+ * OK, we currently have an array, and we don't need to convert to a
+ * bitmap, but we do need to add a new element. If there's not enough
+ * room, we'll have to expand the array.
+ */
+ if (entry->chunk_usage[chunkno] == entry->chunk_size[chunkno])
+ {
+ unsigned newsize = entry->chunk_size[chunkno] * 2;
+
+ Assert(newsize <= MAX_ENTRIES_PER_CHUNK);
+ entry->chunk_data[chunkno] = repalloc(entry->chunk_data[chunkno],
+ newsize * sizeof(uint16));
+ entry->chunk_size[chunkno] = newsize;
+ }
+
+ /* Now we can add the new entry. */
+ entry->chunk_data[chunkno][entry->chunk_usage[chunkno]] =
+ chunkoffset;
+ entry->chunk_usage[chunkno]++;
+}
+
+/*
+ * Release memory for a BlockRefTablEntry that was created by
+ * CreateBlockRefTableEntry.
+ */
+void
+BlockRefTableFreeEntry(BlockRefTableEntry *entry)
+{
+ if (entry->chunk_size != NULL)
+ {
+ pfree(entry->chunk_size);
+ entry->chunk_size = NULL;
+ }
+
+ if (entry->chunk_usage != NULL)
+ {
+ pfree(entry->chunk_usage);
+ entry->chunk_usage = NULL;
+ }
+
+ if (entry->chunk_data != NULL)
+ {
+ pfree(entry->chunk_data);
+ entry->chunk_data = NULL;
+ }
+
+ pfree(entry);
+}
+
+/*
+ * Comparator for BlockRefTableSerializedEntry objects.
+ *
+ * We make the tablespace OID the first column of the sort key to match
+ * the on-disk tree structure.
+ */
+static int
+BlockRefTableComparator(const void *a, const void *b)
+{
+ const BlockRefTableSerializedEntry *sa = a;
+ const BlockRefTableSerializedEntry *sb = b;
+
+ if (sa->rlocator.spcOid > sb->rlocator.spcOid)
+ return 1;
+ if (sa->rlocator.spcOid < sb->rlocator.spcOid)
+ return -1;
+
+ if (sa->rlocator.dbOid > sb->rlocator.dbOid)
+ return 1;
+ if (sa->rlocator.dbOid < sb->rlocator.dbOid)
+ return -1;
+
+ if (sa->rlocator.relNumber > sb->rlocator.relNumber)
+ return 1;
+ if (sa->rlocator.relNumber < sb->rlocator.relNumber)
+ return -1;
+
+ if (sa->forknum > sb->forknum)
+ return 1;
+ if (sa->forknum < sb->forknum)
+ return -1;
+
+ return 0;
+}
+
+/*
+ * Flush any buffered data out of a BlockRefTableBuffer.
+ */
+static void
+BlockRefTableFlush(BlockRefTableBuffer *buffer)
+{
+ buffer->io_callback(buffer->io_callback_arg, buffer->data, buffer->used);
+ buffer->used = 0;
+}
+
+/*
+ * Read data from a BlockRefTableBuffer, and update the running CRC
+ * calculation for the returned data (but not any data that we may have
+ * buffered but not yet actually returned).
+ */
+static void
+BlockRefTableRead(BlockRefTableReader *reader, void *data, int length)
+{
+ BlockRefTableBuffer *buffer = &reader->buffer;
+
+ /* Loop until read is fully satisfied. */
+ while (length > 0)
+ {
+ if (buffer->cursor < buffer->used)
+ {
+ /*
+ * If any buffered data is available, use that to satisfy as much
+ * of the request as possible.
+ */
+ int bytes_to_copy = Min(length, buffer->used - buffer->cursor);
+
+ memcpy(data, &buffer->data[buffer->cursor], bytes_to_copy);
+ COMP_CRC32C(buffer->crc, &buffer->data[buffer->cursor],
+ bytes_to_copy);
+ buffer->cursor += bytes_to_copy;
+ data = ((char *) data) + bytes_to_copy;
+ length -= bytes_to_copy;
+ }
+ else if (length >= BUFSIZE)
+ {
+ /*
+ * If the request length is long, read directly into caller's
+ * buffer.
+ */
+ int bytes_read;
+
+ bytes_read = buffer->io_callback(buffer->io_callback_arg,
+ data, length);
+ COMP_CRC32C(buffer->crc, data, bytes_read);
+ data = ((char *) data) + bytes_read;
+ length -= bytes_read;
+
+ /* If we didn't get anything, that's bad. */
+ if (bytes_read == 0)
+ reader->error_callback(reader->error_callback_arg,
+ "file \"%s\" ends unexpectedly",
+ reader->error_filename);
+ }
+ else
+ {
+ /*
+ * Refill our buffer.
+ */
+ buffer->used = buffer->io_callback(buffer->io_callback_arg,
+ buffer->data, BUFSIZE);
+ buffer->cursor = 0;
+
+ /* If we didn't get anything, that's bad. */
+ if (buffer->used == 0)
+ reader->error_callback(reader->error_callback_arg,
+ "file \"%s\" ends unexpectedly",
+ reader->error_filename);
+ }
+ }
+}
+
+/*
+ * Supply data to a BlockRefTableBuffer for write to the underlying File,
+ * and update the running CRC calculation for that data.
+ */
+static void
+BlockRefTableWrite(BlockRefTableBuffer *buffer, void *data, int length)
+{
+ /* Update running CRC calculation. */
+ COMP_CRC32C(buffer->crc, data, length);
+
+ /* If the new data can't fit into the buffer, flush the buffer. */
+ if (buffer->used + length > BUFSIZE)
+ {
+ buffer->io_callback(buffer->io_callback_arg, buffer->data,
+ buffer->used);
+ buffer->used = 0;
+ }
+
+ /* If the new data would fill the buffer, or more, write it directly. */
+ if (length >= BUFSIZE)
+ {
+ buffer->io_callback(buffer->io_callback_arg, data, length);
+ return;
+ }
+
+ /* Otherwise, copy the new data into the buffer. */
+ memcpy(&buffer->data[buffer->used], data, length);
+ buffer->used += length;
+ Assert(buffer->used <= BUFSIZE);
+}
+
+/*
+ * Generate the sentinel and CRC required at the end of a block reference
+ * table file and flush them out of our internal buffer.
+ */
+static void
+BlockRefTableFileTerminate(BlockRefTableBuffer *buffer)
+{
+ BlockRefTableSerializedEntry zentry = {{0}};
+ pg_crc32c crc;
+
+ /* Write a sentinel indicating that there are no more entries. */
+ BlockRefTableWrite(buffer, &zentry,
+ sizeof(BlockRefTableSerializedEntry));
+
+ /*
+ * Writing the checksum will perturb the ongoing checksum calculation, so
+ * copy the state first and finalize the computation using the copy.
+ */
+ crc = buffer->crc;
+ FIN_CRC32C(crc);
+ BlockRefTableWrite(buffer, &crc, sizeof(pg_crc32c));
+
+ /* Flush any leftover data out of our buffer. */
+ BlockRefTableFlush(buffer);
+}
diff --git a/src/common/meson.build b/src/common/meson.build
index d52dd12bc9..7ad4270a3a 100644
--- a/src/common/meson.build
+++ b/src/common/meson.build
@@ -4,6 +4,7 @@ common_sources = files(
'archive.c',
'base64.c',
'binaryheap.c',
+ 'blkreftable.c',
'checksum_helper.c',
'compression.c',
'controldata_utils.c',
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index a14126d164..da71580364 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -209,6 +209,7 @@ extern int XLogFileOpen(XLogSegNo segno, TimeLineID tli);
extern void CheckXLogRemoved(XLogSegNo segno, TimeLineID tli);
extern XLogSegNo XLogGetLastRemovedSegno(void);
+extern XLogSegNo XLogGetOldestSegno(TimeLineID tli);
extern void XLogSetAsyncXactLSN(XLogRecPtr asyncXactLSN);
extern void XLogSetReplicationSlotMinimumLSN(XLogRecPtr lsn);
diff --git a/src/include/backup/walsummary.h b/src/include/backup/walsummary.h
new file mode 100644
index 0000000000..8e3dc7b837
--- /dev/null
+++ b/src/include/backup/walsummary.h
@@ -0,0 +1,49 @@
+/*-------------------------------------------------------------------------
+ *
+ * walsummary.h
+ * WAL summary management
+ *
+ * Portions Copyright (c) 2010-2022, PostgreSQL Global Development Group
+ *
+ * src/include/backup/walsummary.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef WALSUMMARY_H
+#define WALSUMMARY_H
+
+#include <time.h>
+
+#include "access/xlogdefs.h"
+#include "nodes/pg_list.h"
+#include "storage/fd.h"
+
+typedef struct WalSummaryIO
+{
+ File file;
+ off_t filepos;
+} WalSummaryIO;
+
+typedef struct WalSummaryFile
+{
+ XLogRecPtr start_lsn;
+ XLogRecPtr end_lsn;
+ TimeLineID tli;
+} WalSummaryFile;
+
+extern List *GetWalSummaries(TimeLineID tli, XLogRecPtr start_lsn,
+ XLogRecPtr end_lsn);
+extern List *FilterWalSummaries(List *wslist, TimeLineID tli,
+ XLogRecPtr start_lsn, XLogRecPtr end_lsn);
+extern bool WalSummariesAreComplete(List *wslist,
+ XLogRecPtr start_lsn, XLogRecPtr end_lsn,
+ XLogRecPtr *missing_lsn);
+extern File OpenWalSummaryFile(WalSummaryFile *ws, bool missing_ok);
+extern void RemoveWalSummaryIfOlderThan(WalSummaryFile *ws,
+ time_t cutoff_time);
+
+extern int ReadWalSummary(void *wal_summary_io, void *data, int length);
+extern int WriteWalSummary(void *wal_summary_io, void *data, int length);
+extern void ReportWalSummaryError(void *callback_arg, char *fmt,...) pg_attribute_printf(2, 3);
+
+#endif /* WALSUMMARY_H */
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 77e8b13764..916c8ec8d0 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -12099,4 +12099,23 @@
proname => 'any_value_transfn', prorettype => 'anyelement',
proargtypes => 'anyelement anyelement', prosrc => 'any_value_transfn' },
+{ oid => '8436',
+ descr => 'list of available WAL summary files',
+ proname => 'pg_available_wal_summaries', prorows => '100',
+ proretset => 't', provolatile => 'v', proparallel => 's',
+ prorettype => 'record', proargtypes => '',
+ proallargtypes => '{int8,pg_lsn,pg_lsn}',
+ proargmodes => '{o,o,o}',
+ proargnames => '{tli,start_lsn,end_lsn}',
+ prosrc => 'pg_available_wal_summaries' },
+{ oid => '8437',
+ descr => 'contents of a WAL sumamry file',
+ proname => 'pg_wal_summary_contents', prorows => '100',
+ proretset => 't', provolatile => 'v', proparallel => 's',
+ prorettype => 'record', proargtypes => 'int8 pg_lsn pg_lsn',
+ proallargtypes => '{int8,pg_lsn,pg_lsn,oid,oid,oid,int2,int8,bool}',
+ proargmodes => '{i,i,i,o,o,o,o,o,o}',
+ proargnames => '{tli,start_lsn,end_lsn,relfilenode,reltablespace,reldatabase,relforknumber,relblocknumber,is_limit_block}',
+ prosrc => 'pg_wal_summary_contents' },
+
]
diff --git a/src/include/common/blkreftable.h b/src/include/common/blkreftable.h
new file mode 100644
index 0000000000..5141f3acd5
--- /dev/null
+++ b/src/include/common/blkreftable.h
@@ -0,0 +1,116 @@
+/*-------------------------------------------------------------------------
+ *
+ * blkreftable.h
+ * Block reference tables.
+ *
+ * A block reference table is used to keep track of which blocks have
+ * been modified by WAL records within a certain LSN range.
+ *
+ * For each relation fork, there is a "limit block number". All existing
+ * blocks greater than or equal to the limit block number must be
+ * considered modified; for those less than the limit block number,
+ * we maintain a bitmap. When a relation fork is created or dropped,
+ * the limit block number should be set to 0. When it's truncated,
+ * the limit block number should be set to the length in blocks to
+ * which it was truncated.
+ *
+ * Portions Copyright (c) 2010-2022, PostgreSQL Global Development Group
+ *
+ * src/include/common/blkreftable.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef BLKREFTABLE_H
+#define BLKREFTABLE_H
+
+#include "storage/block.h"
+#include "storage/relfilelocator.h"
+
+/* Magic number for serialization file format. */
+#define BLOCKREFTABLE_MAGIC 0x652b137b
+
+typedef struct BlockRefTable BlockRefTable;
+typedef struct BlockRefTableEntry BlockRefTableEntry;
+typedef struct BlockRefTableReader BlockRefTableReader;
+typedef struct BlockRefTableWriter BlockRefTableWriter;
+
+/*
+ * The return value of io_callback_fn should be the number of bytes read
+ * or written. If an error occurs, the functions should report it and
+ * not return. When used as a write callback, short writes should be retried
+ * or treated as errors, so that if the callback returns, the return value
+ * is always the request length.
+ *
+ * report_error_fn should not return.
+ */
+typedef int (*io_callback_fn) (void *callback_arg, void *data, int length);
+typedef void (*report_error_fn) (void *calblack_arg, char *msg,...) pg_attribute_printf(2, 3);
+
+
+/*
+ * Functions for manipulating an entire in-memory block reference table.
+ */
+extern BlockRefTable *CreateEmptyBlockRefTable(void);
+extern void BlockRefTableSetLimitBlock(BlockRefTable *brtab,
+ const RelFileLocator *rlocator,
+ ForkNumber forknum,
+ BlockNumber limit_block);
+extern void BlockRefTableMarkBlockModified(BlockRefTable *brtab,
+ const RelFileLocator *rlocator,
+ ForkNumber forknum,
+ BlockNumber blknum);
+extern void WriteBlockRefTable(BlockRefTable *brtab,
+ io_callback_fn write_callback,
+ void *write_callback_arg);
+
+extern BlockRefTableEntry *BlockRefTableGetEntry(BlockRefTable *brtab,
+ const RelFileLocator *rlocator,
+ ForkNumber forknum,
+ BlockNumber *limit_block);
+extern int BlockRefTableEntryGetBlocks(BlockRefTableEntry *entry,
+ BlockNumber start_blkno,
+ BlockNumber stop_blkno,
+ BlockNumber *blocks,
+ int nblocks);
+
+/*
+ * Functions for reading a block reference table incrementally from disk.
+ */
+extern BlockRefTableReader *CreateBlockRefTableReader(io_callback_fn read_callback,
+ void *read_callback_arg,
+ char *error_filename,
+ report_error_fn error_callback,
+ void *error_callback_arg);
+extern bool BlockRefTableReaderNextRelation(BlockRefTableReader *reader,
+ RelFileLocator *rlocator,
+ ForkNumber *forknum,
+ BlockNumber *limit_block);
+extern unsigned BlockRefTableReaderGetBlocks(BlockRefTableReader *reader,
+ BlockNumber *blocks,
+ int nblocks);
+extern void DestroyBlockRefTableReader(BlockRefTableReader *reader);
+
+/*
+ * Functions for writing a block reference table incrementally to disk.
+ *
+ * Note that entries must be written in the proper order, that is, sorted by
+ * database, then tablespace, then relfilenumber, then fork number. Caller
+ * is responsible for supplying data in the correct order. If that seems hard,
+ * use an in-memory BlockRefTable instead.
+ */
+extern BlockRefTableWriter *CreateBlockRefTableWriter(io_callback_fn write_callback,
+ void *write_callback_arg);
+extern void BlockRefTableWriteEntry(BlockRefTableWriter *writer,
+ BlockRefTableEntry *entry);
+extern void DestroyBlockRefTableWriter(BlockRefTableWriter *writer);
+
+extern BlockRefTableEntry *CreateBlockRefTableEntry(RelFileLocator rlocator,
+ ForkNumber forknum);
+extern void BlockRefTableEntrySetLimitBlock(BlockRefTableEntry *entry,
+ BlockNumber limit_block);
+extern void BlockRefTableEntryMarkBlockModified(BlockRefTableEntry *entry,
+ ForkNumber forknum,
+ BlockNumber blknum);
+extern void BlockRefTableFreeEntry(BlockRefTableEntry *entry);
+
+#endif /* BLKREFTABLE_H */
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 1043a4d782..74bc2f97cb 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -336,6 +336,7 @@ typedef enum BackendType
B_STARTUP,
B_WAL_RECEIVER,
B_WAL_SENDER,
+ B_WAL_SUMMARIZER,
B_WAL_WRITER,
} BackendType;
@@ -442,6 +443,7 @@ typedef enum
CheckpointerProcess,
WalWriterProcess,
WalReceiverProcess,
+ WalSummarizerProcess,
NUM_AUXPROCTYPES /* Must be last! */
} AuxProcType;
@@ -454,6 +456,7 @@ extern PGDLLIMPORT AuxProcType MyAuxProcType;
#define AmCheckpointerProcess() (MyAuxProcType == CheckpointerProcess)
#define AmWalWriterProcess() (MyAuxProcType == WalWriterProcess)
#define AmWalReceiverProcess() (MyAuxProcType == WalReceiverProcess)
+#define AmWalSummarizerProcess() (MyAuxProcType == WalSummarizerProcess)
/*****************************************************************************
diff --git a/src/include/postmaster/walsummarizer.h b/src/include/postmaster/walsummarizer.h
new file mode 100644
index 0000000000..180d3f34b9
--- /dev/null
+++ b/src/include/postmaster/walsummarizer.h
@@ -0,0 +1,33 @@
+/*-------------------------------------------------------------------------
+ *
+ * walsummarizer.h
+ *
+ * Header file for background WAL summarization process.
+ *
+ * Portions Copyright (c) 1996-2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/include/postmaster/walsummarizer.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef WALSUMMARIZER_H
+#define WALSUMMARIZER_H
+
+#include "access/xlogdefs.h"
+
+extern bool summarize_wal;
+extern int wal_summary_keep_time;
+
+extern Size WalSummarizerShmemSize(void);
+extern void WalSummarizerShmemInit(void);
+extern void WalSummarizerMain(void) pg_attribute_noreturn();
+
+extern XLogRecPtr GetOldestUnsummarizedLSN(TimeLineID *tli,
+ bool *lsn_is_exact,
+ bool reset_pending_lsn);
+extern void SetWalSummarizerLatch(void);
+extern XLogRecPtr WaitForWalSummarization(XLogRecPtr lsn, long timeout,
+ XLogRecPtr *pending_lsn);
+
+#endif
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 4b25961249..e87fd25d64 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -417,11 +417,12 @@ extern PGDLLIMPORT PGPROC *PreparedXactProcs;
* We set aside some extra PGPROC structures for auxiliary processes,
* ie things that aren't full-fledged backends but need shmem access.
*
- * Background writer, checkpointer, WAL writer and archiver run during normal
- * operation. Startup process and WAL receiver also consume 2 slots, but WAL
- * writer is launched only after startup has exited, so we only need 5 slots.
+ * Background writer, checkpointer, WAL writer, WAL summarizer, and archiver
+ * run during normal operation. Startup process and WAL receiver also consume
+ * 2 slots, but WAL writer is launched only after startup has exited, so we
+ * only need 6 slots.
*/
-#define NUM_AUXILIARY_PROCS 5
+#define NUM_AUXILIARY_PROCS 6
/* configurable options */
extern PGDLLIMPORT int DeadlockTimeout;
diff --git a/src/include/utils/guc_tables.h b/src/include/utils/guc_tables.h
index 0c38255961..eaa8c46dda 100644
--- a/src/include/utils/guc_tables.h
+++ b/src/include/utils/guc_tables.h
@@ -72,6 +72,7 @@ enum config_group
WAL_RECOVERY,
WAL_ARCHIVE_RECOVERY,
WAL_RECOVERY_TARGET,
+ WAL_SUMMARIZATION,
REPLICATION_SENDING,
REPLICATION_PRIMARY,
REPLICATION_STANDBY,
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index ba41149b88..9390049314 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -4012,3 +4012,14 @@ yyscan_t
z_stream
z_streamp
zic_t
+BlockRefTable
+BlockRefTableBuffer
+BlockRefTableEntry
+BlockRefTableKey
+BlockRefTableReader
+BlockRefTableSerializedEntry
+BlockRefTableWriter
+SummarizerReadLocalXLogPrivate
+WalSummarizerData
+WalSummaryFile
+WalSummaryIO
--
2.39.3 (Apple Git-145)
v14-0003-Add-support-for-incremental-backup.patchapplication/octet-stream; name=v14-0003-Add-support-for-incremental-backup.patchDownload
From 2508129d8f3f80f252d8a4d0b66f20ff893c5862 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Wed, 14 Jun 2023 11:31:29 -0400
Subject: [PATCH v14 3/5] Add support for incremental backup.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
To take an incremental backup, you use the new replication command
UPLOAD_MANIFEST to upload the manifest for the prior backup. This
prior backup could either be a full backup or another incremental
backup. You then use BASE_BACKUP with the INCREMENTAL option to take
the backup. pg_basebackup now has an --incremental=PATH_TO_MANIFEST
option to trigger this behavior.
An incremental backup is like a regular full backup except that
some relation files are replaced with files with names like
INCREMENTAL.${ORIGINAL_NAME}, and the backup_label file contains
additional lines identifying it as an incremental backup. The new
pg_combinebackup tool can be used to reconstruct a data directory
from a full backup and a series of incremental backups.
XXX. It would be nice (but not essential) to do something about
incremental JSON parsing.
Patch by me. Thanks to Dilip Kumar, Andres Freund, and Álvaro Herrera
for design discussion and reviews, and to Jakub Wartak for incredibly
helpful and extensive testing.
---
doc/src/sgml/backup.sgml | 89 +-
doc/src/sgml/config.sgml | 2 -
doc/src/sgml/protocol.sgml | 24 +
doc/src/sgml/ref/allfiles.sgml | 1 +
doc/src/sgml/ref/pg_basebackup.sgml | 37 +-
doc/src/sgml/ref/pg_combinebackup.sgml | 228 +++
doc/src/sgml/reference.sgml | 1 +
src/backend/access/transam/xlogbackup.c | 10 +
src/backend/access/transam/xlogrecovery.c | 6 +
src/backend/backup/Makefile | 1 +
src/backend/backup/basebackup.c | 319 +++-
src/backend/backup/basebackup_incremental.c | 1003 +++++++++++++
src/backend/backup/meson.build | 1 +
src/backend/replication/repl_gram.y | 14 +-
src/backend/replication/repl_scanner.l | 2 +
src/backend/replication/walsender.c | 162 ++-
src/backend/storage/ipc/ipci.c | 3 +
src/bin/Makefile | 1 +
src/bin/meson.build | 1 +
src/bin/pg_basebackup/bbstreamer_file.c | 1 +
src/bin/pg_basebackup/pg_basebackup.c | 112 +-
src/bin/pg_basebackup/t/010_pg_basebackup.pl | 4 +-
src/bin/pg_combinebackup/.gitignore | 1 +
src/bin/pg_combinebackup/Makefile | 52 +
src/bin/pg_combinebackup/backup_label.c | 283 ++++
src/bin/pg_combinebackup/backup_label.h | 30 +
src/bin/pg_combinebackup/copy_file.c | 169 +++
src/bin/pg_combinebackup/copy_file.h | 19 +
src/bin/pg_combinebackup/load_manifest.c | 245 ++++
src/bin/pg_combinebackup/load_manifest.h | 67 +
src/bin/pg_combinebackup/meson.build | 38 +
src/bin/pg_combinebackup/nls.mk | 11 +
src/bin/pg_combinebackup/pg_combinebackup.c | 1284 +++++++++++++++++
src/bin/pg_combinebackup/reconstruct.c | 687 +++++++++
src/bin/pg_combinebackup/reconstruct.h | 33 +
src/bin/pg_combinebackup/t/001_basic.pl | 23 +
.../pg_combinebackup/t/002_compare_backups.pl | 154 ++
src/bin/pg_combinebackup/t/003_timeline.pl | 90 ++
src/bin/pg_combinebackup/t/004_manifest.pl | 75 +
src/bin/pg_combinebackup/t/005_integrity.pl | 125 ++
src/bin/pg_combinebackup/write_manifest.c | 293 ++++
src/bin/pg_combinebackup/write_manifest.h | 33 +
src/bin/pg_resetwal/pg_resetwal.c | 36 +
src/include/access/xlogbackup.h | 2 +
src/include/backup/basebackup.h | 5 +-
src/include/backup/basebackup_incremental.h | 55 +
src/include/nodes/replnodes.h | 9 +
src/test/perl/PostgreSQL/Test/Cluster.pm | 21 +-
src/tools/pgindent/typedefs.list | 12 +
49 files changed, 5822 insertions(+), 52 deletions(-)
create mode 100644 doc/src/sgml/ref/pg_combinebackup.sgml
create mode 100644 src/backend/backup/basebackup_incremental.c
create mode 100644 src/bin/pg_combinebackup/.gitignore
create mode 100644 src/bin/pg_combinebackup/Makefile
create mode 100644 src/bin/pg_combinebackup/backup_label.c
create mode 100644 src/bin/pg_combinebackup/backup_label.h
create mode 100644 src/bin/pg_combinebackup/copy_file.c
create mode 100644 src/bin/pg_combinebackup/copy_file.h
create mode 100644 src/bin/pg_combinebackup/load_manifest.c
create mode 100644 src/bin/pg_combinebackup/load_manifest.h
create mode 100644 src/bin/pg_combinebackup/meson.build
create mode 100644 src/bin/pg_combinebackup/nls.mk
create mode 100644 src/bin/pg_combinebackup/pg_combinebackup.c
create mode 100644 src/bin/pg_combinebackup/reconstruct.c
create mode 100644 src/bin/pg_combinebackup/reconstruct.h
create mode 100644 src/bin/pg_combinebackup/t/001_basic.pl
create mode 100644 src/bin/pg_combinebackup/t/002_compare_backups.pl
create mode 100644 src/bin/pg_combinebackup/t/003_timeline.pl
create mode 100644 src/bin/pg_combinebackup/t/004_manifest.pl
create mode 100644 src/bin/pg_combinebackup/t/005_integrity.pl
create mode 100644 src/bin/pg_combinebackup/write_manifest.c
create mode 100644 src/bin/pg_combinebackup/write_manifest.h
create mode 100644 src/include/backup/basebackup_incremental.h
diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml
index 8cb24d6ae5..b3468eea3c 100644
--- a/doc/src/sgml/backup.sgml
+++ b/doc/src/sgml/backup.sgml
@@ -857,12 +857,79 @@ test ! -f /mnt/server/archivedir/00000001000000A900000065 && cp pg_wal/0
</para>
</sect2>
+ <sect2 id="backup-incremental-backup">
+ <title>Making an Incremental Backup</title>
+
+ <para>
+ You can use <xref linkend="app-pgbasebackup"/> to take an incremental
+ backup by specifying the <literal>--incremental</literal> option. You must
+ supply, as an argument to <literal>--incremental</literal>, the backup
+ manifest to an earlier backup from the same server. In the resulting
+ backup, non-relation files will be included in their entirety, but some
+ relation files may be replaced by smaller incremental files which contain
+ only the blocks which have been changed since the earlier backup and enough
+ metadata to reconstruct the current version of the file.
+ </para>
+
+ <para>
+ To figure out which blocks need to be backed up, the server uses WAL
+ summaries, which are stored in the data directory, inside the directory
+ <literal>pg_wal/summaries</literal>. If the required summary files are not
+ present, an attempt to take an incremental backup will fail. The summaries
+ present in this directory must cover all LSNs from the start LSN of the
+ prior backup to the start LSN of the current backup. Since the server looks
+ for WAL summaries just after establishing the start LSN of the current
+ backup, the necessary summary files probably won't be instantly present
+ on disk, but the server will wait for any missing files to show up.
+ This also helps if the WAL summarization process has fallen behind.
+ However, if the necessary files have already been removed, or if the WAL
+ summarizer doesn't catch up quickly enough, the incremental backup will
+ fail.
+ </para>
+
+ <para>
+ When restoring an incremental backup, it will be necessary to have not
+ only the incremental backup itself but also all earlier backups that
+ are required to supply the blocks omitted from the incremental backup.
+ See <xref linkend="app-pgcombinebackup"/> for further information about
+ this requirement.
+ </para>
+
+ <para>
+ Note that all of the requirements for making use of a full backup also
+ apply to an incremental backup. For instance, you still need all of the
+ WAL segment files generated during and after the file system backup, and
+ any relevant WAL history files. And you still need to create a
+ <literal>recovery.signal</literal> (or <literal>standby.signal</literal>)
+ and perform recovery, as described in
+ <xref linkend="backup-pitr-recovery" />. The requirement to have earlier
+ backups available at restore time and to use
+ <literal>pg_combinebackup</literal> is an additional requirement on top of
+ everything else. Keep in mind that <application>PostgreSQL</application>
+ has no built-in mechanism to figure out which backups are still needed as
+ a basis for restoring later incremental backups. You must keep track of
+ the relationships between your full and incremental backups on your own,
+ and be certain not to remove earlier backups if they might be needed when
+ restoring later incremental backups.
+ </para>
+
+ <para>
+ Incremental backups typically only make sense for relatively large
+ databases where a significant portion of the data does not change, or only
+ changes slowly. For a small database, it's simpler to ignore the existence
+ of incremental backups and simply take full backups, which are simpler
+ to manage. For a large database all of which is heavily modified,
+ incremental backups won't be much smaller than full backups.
+ </para>
+ </sect2>
+
<sect2 id="backup-lowlevel-base-backup">
<title>Making a Base Backup Using the Low Level API</title>
<para>
- The procedure for making a base backup using the low level
- APIs contains a few more steps than
- the <xref linkend="app-pgbasebackup"/> method, but is relatively
+ Instead of taking a full or incremental base backup using
+ <xref linkend="app-pgbasebackup"/>, you can take a base backup using the
+ low-level API. This procedure contains a few more steps than
+ the <application>pg_basebackup</application> method, but is relatively
simple. It is very important that these steps are executed in
sequence, and that the success of a step is verified before
proceeding to the next step.
@@ -1118,7 +1185,8 @@ SELECT * FROM pg_backup_stop(wait_for_archive => true);
</listitem>
<listitem>
<para>
- Restore the database files from your file system backup. Be sure that they
+ If you're restoring a full backup, you can restore the database files
+ directly into the target directories. Be sure that they
are restored with the right ownership (the database system user, not
<literal>root</literal>!) and with the right permissions. If you are using
tablespaces,
@@ -1126,6 +1194,19 @@ SELECT * FROM pg_backup_stop(wait_for_archive => true);
were correctly restored.
</para>
</listitem>
+ <listitem>
+ <para>
+ If you're restoring an incremental backup, you'll need to restore the
+ incremental backup and all earlier backups upon which it directly or
+ indirectly depends to the machine where you are performing the restore.
+ These backups will need to be placed in separate directories, not the
+ target directories where you want the running server to end up.
+ Once this is done, use <xref linkend="app-pgcombinebackup"/> to pull
+ data from the full backup and all of the subsequent incremental backups
+ and write out a synthetic full backup to the target directories. As above,
+ verify that permissions and tablespace links are correct.
+ </para>
+ </listitem>
<listitem>
<para>
Remove any files present in <filename>pg_wal/</filename>; these came from the
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index ee98585027..b5624ca884 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4153,13 +4153,11 @@ restore_command = 'copy "C:\\server\\archivedir\\%f" "%p"' # Windows
<sect2 id="runtime-config-wal-summarization">
<title>WAL Summarization</title>
- <!--
<para>
These settings control WAL summarization, a feature which must be
enabled in order to perform an
<link linkend="backup-incremental-backup">incremental backup</link>.
</para>
- -->
<variablelist>
<varlistentry id="guc-summarize-wal" xreflabel="summarize_wal">
diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index af3f016f74..9a66918171 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -2599,6 +2599,19 @@ psql "dbname=postgres replication=database" -c "IDENTIFY_SYSTEM;"
</listitem>
</varlistentry>
+ <varlistentry id="protocol-replication-upload-manifest">
+ <term>
+ <literal>UPLOAD_MANIFEST</literal>
+ <indexterm><primary>UPLOAD_MANIFEST</primary></indexterm>
+ </term>
+ <listitem>
+ <para>
+ Uploads a backup manifest in preparation for taking an incremental
+ backup.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="protocol-replication-base-backup" xreflabel="BASE_BACKUP">
<term><literal>BASE_BACKUP</literal> [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ]
<indexterm><primary>BASE_BACKUP</primary></indexterm>
@@ -2838,6 +2851,17 @@ psql "dbname=postgres replication=database" -c "IDENTIFY_SYSTEM;"
</para>
</listitem>
</varlistentry>
+
+ <varlistentry>
+ <term><literal>INCREMENTAL</literal></term>
+ <listitem>
+ <para>
+ Requests an incremental backup. The
+ <literal>UPLOAD_MANIFEST</literal> command must be executed
+ before running a base backup with this option.
+ </para>
+ </listitem>
+ </varlistentry>
</variablelist>
</para>
diff --git a/doc/src/sgml/ref/allfiles.sgml b/doc/src/sgml/ref/allfiles.sgml
index 54b5f22d6e..fda4690eab 100644
--- a/doc/src/sgml/ref/allfiles.sgml
+++ b/doc/src/sgml/ref/allfiles.sgml
@@ -202,6 +202,7 @@ Complete list of usable sgml source files in this directory.
<!ENTITY pgBasebackup SYSTEM "pg_basebackup.sgml">
<!ENTITY pgbench SYSTEM "pgbench.sgml">
<!ENTITY pgChecksums SYSTEM "pg_checksums.sgml">
+<!ENTITY pgCombinebackup SYSTEM "pg_combinebackup.sgml">
<!ENTITY pgConfig SYSTEM "pg_config-ref.sgml">
<!ENTITY pgControldata SYSTEM "pg_controldata.sgml">
<!ENTITY pgCtl SYSTEM "pg_ctl-ref.sgml">
diff --git a/doc/src/sgml/ref/pg_basebackup.sgml b/doc/src/sgml/ref/pg_basebackup.sgml
index 0b87fd2d4d..7c183a5cfd 100644
--- a/doc/src/sgml/ref/pg_basebackup.sgml
+++ b/doc/src/sgml/ref/pg_basebackup.sgml
@@ -38,11 +38,25 @@ PostgreSQL documentation
</para>
<para>
- <application>pg_basebackup</application> makes an exact copy of the database
- cluster's files, while making sure the server is put into and
- out of backup mode automatically. Backups are always taken of the entire
- database cluster; it is not possible to back up individual databases or
- database objects. For selective backups, another tool such as
+ <application>pg_basebackup</application> can take a full or incremental
+ base backup of the database. When used to take a full backup, it makes an
+ exact copy of the database cluster's files. When used to take an incremental
+ backup, some files that would have been part of a full backup may be
+ replaced with incremental versions of the same files, containing only those
+ blocks that have been modified since the reference backup. An incremental
+ backup cannot be used directly; instead,
+ <xref linkend="app-pgcombinebackup"/> must first
+ be used to combine it with the previous backups upon which it depends.
+ See <xref linkend="backup-incremental-backup" /> for more information
+ about incremental backups, and <xref linkend="backup-pitr-recovery" />
+ for steps to recover from a backup.
+ </para>
+
+ <para>
+ In any mode, <application>pg_basebackup</application> makes sure the server
+ is put into and out of backup mode automatically. Backups are always taken of
+ the entire database cluster; it is not possible to back up individual
+ databases or database objects. For selective backups, another tool such as
<xref linkend="app-pgdump"/> must be used.
</para>
@@ -197,6 +211,19 @@ PostgreSQL documentation
</listitem>
</varlistentry>
+ <varlistentry>
+ <term><option>-i <replaceable class="parameter">old_manifest_file</replaceable></option></term>
+ <term><option>--incremental=<replaceable class="parameter">old_meanifest_file</replaceable></option></term>
+ <listitem>
+ <para>
+ Performs an <link linkend="backup-incremental-backup">incremental
+ backup</link>. The backup manifest for the reference
+ backup must be provided, and will be uploaded to the server, which will
+ respond by sending the requested incremental backup.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry>
<term><option>-R</option></term>
<term><option>--write-recovery-conf</option></term>
diff --git a/doc/src/sgml/ref/pg_combinebackup.sgml b/doc/src/sgml/ref/pg_combinebackup.sgml
new file mode 100644
index 0000000000..6cac73573f
--- /dev/null
+++ b/doc/src/sgml/ref/pg_combinebackup.sgml
@@ -0,0 +1,228 @@
+<!--
+doc/src/sgml/ref/pg_combinebackup.sgml
+PostgreSQL documentation
+-->
+
+<refentry id="app-pgcombinebackup">
+ <indexterm zone="app-pgcombinebackup">
+ <primary>pg_combinebackup</primary>
+ </indexterm>
+
+ <refmeta>
+ <refentrytitle><application>pg_combinebackup</application></refentrytitle>
+ <manvolnum>1</manvolnum>
+ <refmiscinfo>Application</refmiscinfo>
+ </refmeta>
+
+ <refnamediv>
+ <refname>pg_combinebackup</refname>
+ <refpurpose>reconstruct a full backup from an incremental backup and dependent backups</refpurpose>
+ </refnamediv>
+
+ <refsynopsisdiv>
+ <cmdsynopsis>
+ <command>pg_combinebackup</command>
+ <arg rep="repeat"><replaceable>option</replaceable></arg>
+ <arg rep="repeat"><replaceable>backup_directory</replaceable></arg>
+ </cmdsynopsis>
+ </refsynopsisdiv>
+
+ <refsect1>
+ <title>Description</title>
+ <para>
+ <application>pg_combinebackup</application> is used to reconstruct a
+ synthetic full backup from an
+ <link linkend="backup-incremental-backup">incremental backup</link> and the
+ earlier backups upon which it depends.
+ </para>
+
+ <para>
+ Specify all of the required backups on the command line from oldest to newest.
+ That is, the first backup directory should be the path to the full backup, and
+ the last should be the path to the final incremental backup
+ that you wish to restore. The reconstructed backup will be written to the
+ output directory specified by the <option>-o</option> option.
+ </para>
+
+ <para>
+ Although <application>pg_combinebackup</application> will attempt to verify
+ that the backups you specify form a legal backup chain from which a correct
+ full backup can be reconstructed, it is not designed to help you keep track
+ of which backups depend on which other backups. If you remove the one or
+ more of the previous backups upon which your incremental
+ backup relies, you will not be able to restore it.
+ </para>
+
+ <para>
+ Since the output of <application>pg_combinebackup</application> is a
+ synthetic full backup, it can be used as an input to a future invocation of
+ <application>pg_combinebackup</application>. The synthetic full backup would
+ be specified on the command line in lieu of the chain of backups from which
+ it was reconstructed.
+ </para>
+ </refsect1>
+
+ <refsect1>
+ <title>Options</title>
+
+ <para>
+ <variablelist>
+ <varlistentry>
+ <term><option>-d</option></term>
+ <term><option>--debug</option></term>
+ <listitem>
+ <para>
+ Print lots of debug logging output on <filename>stderr</filename>.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><option>-T <replaceable class="parameter">olddir</replaceable>=<replaceable class="parameter">newdir</replaceable></option></term>
+ <term><option>--tablespace-mapping=<replaceable class="parameter">olddir</replaceable>=<replaceable class="parameter">newdir</replaceable></option></term>
+ <listitem>
+ <para>
+ Relocates the tablespace in directory <replaceable>olddir</replaceable>
+ to <replaceable>newdir</replaceable> during the backup.
+ <replaceable>olddir</replaceable> is the absolute path of the tablespace
+ as it exists in the first backup specified on the command line,
+ and <replaceable>newdir</replaceable> is the absolute path to use for the
+ tablespace in the reconstructed backup. If either path needs to contain
+ an equal sign (<literal>=</literal>), precede that with a backslash.
+ This option can be specified multiple times for multiple tablespaces.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><option>-N</option></term>
+ <term><option>--no-sync</option></term>
+ <listitem>
+ <para>
+ By default, <command>pg_combinebackup</command> will wait for all files
+ to be written safely to disk. This option causes
+ <command>pg_combinebackup</command> to return without waiting, which is
+ faster, but means that a subsequent operating system crash can leave
+ the output backup corrupt. Generally, this option is useful for testing
+ but should not be used when creating a production installation.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><option>-o <replaceable class="parameter">outputdir</replaceable></option></term>
+ <term><option>--output=<replaceable class="parameter">outputdir</replaceable></option></term>
+ <listitem>
+ <para>
+ Specifies the output directory to which the synthetic full backup
+ should be written. Currently, this argument is required.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><option>--sync-method</option></term>
+ <listitem>
+ <para>
+ When set to <literal>fsync</literal>, which is the default,
+ <command>pg_combinebackup</command> will recursively open and synchronize
+ all files in the backup directory. When the plain format is used, the
+ search for files will follow symbolic links for the WAL directory and
+ each configured tablespace.
+ </para>
+ <para>
+ On Linux, <literal>syncfs</literal> may be used instead to ask the
+ operating system to synchronize the whole file system that contains the
+ backup directory. When the plain format is used,
+ <command>pg_combinebackup</command> will also synchronize the file systems
+ that contain the WAL files and each tablespace. See
+ <xref linkend="syncfs"/> for more information about using
+ <function>syncfs()</function>.
+ </para>
+ <para>
+ This option has no effect when <option>--no-sync</option> is used.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><option>--manifest-checksums=<replaceable class="parameter">algorithm</replaceable></option></term>
+ <listitem>
+ <para>
+ Like <xref linkend="app-pgbasebackup"/>,
+ <application>pg_combinebackup</application> writes a backup manifest
+ in the output directory. This option specifies the checksum algorithm
+ that should be applied to each file included in the backup manifest.
+ Currently, the available algorithms are <literal>NONE</literal>,
+ <literal>CRC32C</literal>, <literal>SHA224</literal>,
+ <literal>SHA256</literal>, <literal>SHA384</literal>,
+ and <literal>SHA512</literal>. The default is <literal>CRC32C</literal>.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><option>--no-manifest</option></term>
+ <listitem>
+ <para>
+ Disables generation of a backup manifest. If this option is not
+ specified, a backup manifest for the reconstructed backup will be
+ written to the output directory.
+ </para>
+ </listitem>
+ </varlistentry>
+ </variablelist>
+
+ <variablelist>
+ <varlistentry>
+ <term><option>-V</option></term>
+ <term><option>--version</option></term>
+ <listitem>
+ <para>
+ Prints the <application>pg_combinebackup</application> version and exits.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><option>-?</option></term>
+ <term><option>--help</option></term>
+ <listitem>
+ <para>
+ Shows help about <application>pg_combinebackup</application> command line
+ arguments, and exits.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ </variablelist>
+ </para>
+
+ </refsect1>
+
+ <refsect1>
+ <title>Environment</title>
+
+ <para>
+ This utility, like most other <productname>PostgreSQL</productname> utilities,
+ uses the environment variables supported by <application>libpq</application>
+ (see <xref linkend="libpq-envars"/>).
+ </para>
+
+ <para>
+ The environment variable <envar>PG_COLOR</envar> specifies whether to use
+ color in diagnostic messages. Possible values are
+ <literal>always</literal>, <literal>auto</literal> and
+ <literal>never</literal>.
+ </para>
+ </refsect1>
+
+ <refsect1>
+ <title>See Also</title>
+
+ <simplelist type="inline">
+ <member><xref linkend="app-pgbasebackup"/></member>
+ </simplelist>
+ </refsect1>
+
+</refentry>
diff --git a/doc/src/sgml/reference.sgml b/doc/src/sgml/reference.sgml
index e11b4b6130..a07d2b5e01 100644
--- a/doc/src/sgml/reference.sgml
+++ b/doc/src/sgml/reference.sgml
@@ -250,6 +250,7 @@
&pgamcheck;
&pgBasebackup;
&pgbench;
+ &pgCombinebackup;
&pgConfig;
&pgDump;
&pgDumpall;
diff --git a/src/backend/access/transam/xlogbackup.c b/src/backend/access/transam/xlogbackup.c
index 21d68133ae..f51d4282bb 100644
--- a/src/backend/access/transam/xlogbackup.c
+++ b/src/backend/access/transam/xlogbackup.c
@@ -77,6 +77,16 @@ build_backup_content(BackupState *state, bool ishistoryfile)
appendStringInfo(result, "STOP TIMELINE: %u\n", state->stoptli);
}
+ /* either both istartpoint and istarttli should be set, or neither */
+ Assert(XLogRecPtrIsInvalid(state->istartpoint) == (state->istarttli == 0));
+ if (!XLogRecPtrIsInvalid(state->istartpoint))
+ {
+ appendStringInfo(result, "INCREMENTAL FROM LSN: %X/%X\n",
+ LSN_FORMAT_ARGS(state->istartpoint));
+ appendStringInfo(result, "INCREMENTAL FROM TLI: %u\n",
+ state->istarttli);
+ }
+
data = result->data;
pfree(result);
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index a2c8fa3981..6f4f81f992 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -1295,6 +1295,12 @@ read_backup_label(XLogRecPtr *checkPointLoc, TimeLineID *backupLabelTLI,
tli_from_file, BACKUP_LABEL_FILE)));
}
+ if (fscanf(lfp, "INCREMENTAL FROM LSN: %X/%X\n", &hi, &lo) > 0)
+ ereport(FATAL,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("this is an incremental backup, not a data directory"),
+ errhint("Use pg_combinebackup to reconstruct a valid data directory.")));
+
if (ferror(lfp) || FreeFile(lfp))
ereport(FATAL,
(errcode_for_file_access(),
diff --git a/src/backend/backup/Makefile b/src/backend/backup/Makefile
index a67b3c58d4..751e6d3d5e 100644
--- a/src/backend/backup/Makefile
+++ b/src/backend/backup/Makefile
@@ -19,6 +19,7 @@ OBJS = \
basebackup.o \
basebackup_copy.o \
basebackup_gzip.o \
+ basebackup_incremental.o \
basebackup_lz4.o \
basebackup_zstd.o \
basebackup_progress.o \
diff --git a/src/backend/backup/basebackup.c b/src/backend/backup/basebackup.c
index 35dd79babc..5ee9628422 100644
--- a/src/backend/backup/basebackup.c
+++ b/src/backend/backup/basebackup.c
@@ -20,8 +20,10 @@
#include "access/xlogbackup.h"
#include "backup/backup_manifest.h"
#include "backup/basebackup.h"
+#include "backup/basebackup_incremental.h"
#include "backup/basebackup_sink.h"
#include "backup/basebackup_target.h"
+#include "catalog/pg_tablespace_d.h"
#include "commands/defrem.h"
#include "common/compression.h"
#include "common/file_perm.h"
@@ -33,6 +35,7 @@
#include "pgtar.h"
#include "port.h"
#include "postmaster/syslogger.h"
+#include "postmaster/walsummarizer.h"
#include "replication/walsender.h"
#include "replication/walsender_private.h"
#include "storage/bufpage.h"
@@ -64,6 +67,7 @@ typedef struct
bool fastcheckpoint;
bool nowait;
bool includewal;
+ bool incremental;
uint32 maxrate;
bool sendtblspcmapfile;
bool send_to_client;
@@ -76,21 +80,28 @@ typedef struct
} basebackup_options;
static int64 sendTablespace(bbsink *sink, char *path, Oid spcoid, bool sizeonly,
- struct backup_manifest_info *manifest);
+ struct backup_manifest_info *manifest,
+ IncrementalBackupInfo *ib);
static int64 sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
List *tablespaces, bool sendtblspclinks,
- backup_manifest_info *manifest, Oid spcoid);
+ backup_manifest_info *manifest, Oid spcoid,
+ IncrementalBackupInfo *ib);
static bool sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
struct stat *statbuf, bool missing_ok,
Oid dboid, Oid spcoid, RelFileNumber relfilenumber,
unsigned segno,
- backup_manifest_info *manifest);
+ backup_manifest_info *manifest,
+ unsigned num_incremental_blocks,
+ BlockNumber *incremental_blocks,
+ unsigned truncation_block_length);
static off_t read_file_data_into_buffer(bbsink *sink,
const char *readfilename, int fd,
off_t offset, size_t length,
BlockNumber blkno,
bool verify_checksum,
int *checksum_failures);
+static void push_to_sink(bbsink *sink, pg_checksum_context *checksum_ctx,
+ size_t *bytes_done, void *data, size_t length);
static bool verify_page_checksum(Page page, XLogRecPtr start_lsn,
BlockNumber blkno,
uint16 *expected_checksum);
@@ -102,7 +113,8 @@ static int64 _tarWriteHeader(bbsink *sink, const char *filename,
bool sizeonly);
static void _tarWritePadding(bbsink *sink, int len);
static void convert_link_to_directory(const char *pathbuf, struct stat *statbuf);
-static void perform_base_backup(basebackup_options *opt, bbsink *sink);
+static void perform_base_backup(basebackup_options *opt, bbsink *sink,
+ IncrementalBackupInfo *ib);
static void parse_basebackup_options(List *options, basebackup_options *opt);
static int compareWalFileNames(const ListCell *a, const ListCell *b);
static int basebackup_read_file(int fd, char *buf, size_t nbytes, off_t offset,
@@ -220,7 +232,8 @@ static const struct exclude_list_item excludeFiles[] =
* clobbered by longjmp" from stupider versions of gcc.
*/
static void
-perform_base_backup(basebackup_options *opt, bbsink *sink)
+perform_base_backup(basebackup_options *opt, bbsink *sink,
+ IncrementalBackupInfo *ib)
{
bbsink_state state;
XLogRecPtr endptr;
@@ -270,6 +283,10 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
ListCell *lc;
tablespaceinfo *newti;
+ /* If this is an incremental backup, execute preparatory steps. */
+ if (ib != NULL)
+ PrepareForIncrementalBackup(ib, backup_state);
+
/* Add a node for the base directory at the end */
newti = palloc0(sizeof(tablespaceinfo));
newti->size = -1;
@@ -289,10 +306,10 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
if (tmp->path == NULL)
tmp->size = sendDir(sink, ".", 1, true, state.tablespaces,
- true, NULL, InvalidOid);
+ true, NULL, InvalidOid, NULL);
else
tmp->size = sendTablespace(sink, tmp->path, tmp->oid, true,
- NULL);
+ NULL, NULL);
state.bytes_total += tmp->size;
}
state.bytes_total_is_valid = true;
@@ -330,7 +347,7 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
/* Then the bulk of the files... */
sendDir(sink, ".", 1, false, state.tablespaces,
- sendtblspclinks, &manifest, InvalidOid);
+ sendtblspclinks, &manifest, InvalidOid, ib);
/* ... and pg_control after everything else. */
if (lstat(XLOG_CONTROL_FILE, &statbuf) != 0)
@@ -340,7 +357,7 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
XLOG_CONTROL_FILE)));
sendFile(sink, XLOG_CONTROL_FILE, XLOG_CONTROL_FILE, &statbuf,
false, InvalidOid, InvalidOid,
- InvalidRelFileNumber, 0, &manifest);
+ InvalidRelFileNumber, 0, &manifest, 0, NULL, 0);
}
else
{
@@ -348,7 +365,7 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
bbsink_begin_archive(sink, archive_name);
- sendTablespace(sink, ti->path, ti->oid, false, &manifest);
+ sendTablespace(sink, ti->path, ti->oid, false, &manifest, ib);
}
/*
@@ -610,7 +627,7 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
sendFile(sink, pathbuf, pathbuf, &statbuf, false,
InvalidOid, InvalidOid, InvalidRelFileNumber, 0,
- &manifest);
+ &manifest, 0, NULL, 0);
/* unconditionally mark file as archived */
StatusFilePath(pathbuf, fname, ".done");
@@ -686,6 +703,7 @@ parse_basebackup_options(List *options, basebackup_options *opt)
bool o_checkpoint = false;
bool o_nowait = false;
bool o_wal = false;
+ bool o_incremental = false;
bool o_maxrate = false;
bool o_tablespace_map = false;
bool o_noverify_checksums = false;
@@ -764,6 +782,20 @@ parse_basebackup_options(List *options, basebackup_options *opt)
opt->includewal = defGetBoolean(defel);
o_wal = true;
}
+ else if (strcmp(defel->defname, "incremental") == 0)
+ {
+ if (o_incremental)
+ ereport(ERROR,
+ (errcode(ERRCODE_SYNTAX_ERROR),
+ errmsg("duplicate option \"%s\"", defel->defname)));
+ opt->incremental = defGetBoolean(defel);
+ if (opt->incremental && !summarize_wal)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("incremental backups cannot be taken unless WAL summarization is enabled")));
+ opt->incremental = defGetBoolean(defel);
+ o_incremental = true;
+ }
else if (strcmp(defel->defname, "max_rate") == 0)
{
int64 maxrate;
@@ -956,7 +988,7 @@ parse_basebackup_options(List *options, basebackup_options *opt)
* the filesystem, bypassing the buffer cache.
*/
void
-SendBaseBackup(BaseBackupCmd *cmd)
+SendBaseBackup(BaseBackupCmd *cmd, IncrementalBackupInfo *ib)
{
basebackup_options opt;
bbsink *sink;
@@ -980,6 +1012,20 @@ SendBaseBackup(BaseBackupCmd *cmd)
set_ps_display(activitymsg);
}
+ /*
+ * If we're asked to perform an incremental backup and the user has not
+ * supplied a manifest, that's an ERROR.
+ *
+ * If we're asked to perform a full backup and the user did supply a
+ * manifest, just ignore it.
+ */
+ if (!opt.incremental)
+ ib = NULL;
+ else if (ib == NULL)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("must UPLOAD_MANIFEST before performing an incremental BASE_BACKUP")));
+
/*
* If the target is specifically 'client' then set up to stream the backup
* to the client; otherwise, it's being sent someplace else and should not
@@ -1011,7 +1057,7 @@ SendBaseBackup(BaseBackupCmd *cmd)
*/
PG_TRY();
{
- perform_base_backup(&opt, sink);
+ perform_base_backup(&opt, sink, ib);
}
PG_FINALLY();
{
@@ -1089,7 +1135,7 @@ sendFileWithContent(bbsink *sink, const char *filename, const char *content,
*/
static int64
sendTablespace(bbsink *sink, char *path, Oid spcoid, bool sizeonly,
- backup_manifest_info *manifest)
+ backup_manifest_info *manifest, IncrementalBackupInfo *ib)
{
int64 size;
char pathbuf[MAXPGPATH];
@@ -1123,7 +1169,7 @@ sendTablespace(bbsink *sink, char *path, Oid spcoid, bool sizeonly,
/* Send all the files in the tablespace version directory */
size += sendDir(sink, pathbuf, strlen(path), sizeonly, NIL, true, manifest,
- spcoid);
+ spcoid, ib);
return size;
}
@@ -1143,7 +1189,7 @@ sendTablespace(bbsink *sink, char *path, Oid spcoid, bool sizeonly,
static int64
sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
List *tablespaces, bool sendtblspclinks, backup_manifest_info *manifest,
- Oid spcoid)
+ Oid spcoid, IncrementalBackupInfo *ib)
{
DIR *dir;
struct dirent *de;
@@ -1152,7 +1198,16 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
int64 size = 0;
const char *lastDir; /* Split last dir from parent path. */
bool isRelationDir = false; /* Does directory contain relations? */
+ bool isGlobalDir = false;
Oid dboid = InvalidOid;
+ BlockNumber *relative_block_numbers = NULL;
+
+ /*
+ * Since this array is relatively large, avoid putting it on the stack.
+ * But we don't need it at all if this is not an incremental backup.
+ */
+ if (ib != NULL)
+ relative_block_numbers = palloc(sizeof(BlockNumber) * RELSEG_SIZE);
/*
* Determine if the current path is a database directory that can contain
@@ -1185,7 +1240,10 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
}
}
else if (strcmp(path, "./global") == 0)
+ {
isRelationDir = true;
+ isGlobalDir = true;
+ }
dir = AllocateDir(path);
while ((de = ReadDir(dir, path)) != NULL)
@@ -1334,11 +1392,13 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
&statbuf, sizeonly);
/*
- * Also send archive_status directory (by hackishly reusing
- * statbuf from above ...).
+ * Also send archive_status and summaries directories (by
+ * hackishly reusing statbuf from above ...).
*/
size += _tarWriteHeader(sink, "./pg_wal/archive_status", NULL,
&statbuf, sizeonly);
+ size += _tarWriteHeader(sink, "./pg_wal/summaries", NULL,
+ &statbuf, sizeonly);
continue; /* don't recurse into pg_wal */
}
@@ -1407,16 +1467,64 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
if (!skip_this_dir)
size += sendDir(sink, pathbuf, basepathlen, sizeonly, tablespaces,
- sendtblspclinks, manifest, spcoid);
+ sendtblspclinks, manifest, spcoid, ib);
}
else if (S_ISREG(statbuf.st_mode))
{
bool sent = false;
+ unsigned num_blocks_required = 0;
+ unsigned truncation_block_length = 0;
+ char tarfilenamebuf[MAXPGPATH * 2];
+ char *tarfilename = pathbuf + basepathlen + 1;
+ FileBackupMethod method = BACK_UP_FILE_FULLY;
+
+ if (ib != NULL && isRelationFile)
+ {
+ Oid relspcoid;
+ char *lookup_path;
+
+ if (OidIsValid(spcoid))
+ {
+ relspcoid = spcoid;
+ lookup_path = psprintf("pg_tblspc/%u/%s", spcoid,
+ tarfilename);
+ }
+ else
+ {
+ if (isGlobalDir)
+ relspcoid = GLOBALTABLESPACE_OID;
+ else
+ relspcoid = DEFAULTTABLESPACE_OID;
+ lookup_path = pstrdup(tarfilename);
+ }
+
+ method = GetFileBackupMethod(ib, lookup_path, dboid, relspcoid,
+ relfilenumber, relForkNum,
+ segno, statbuf.st_size,
+ &num_blocks_required,
+ relative_block_numbers,
+ &truncation_block_length);
+ if (method == BACK_UP_FILE_INCREMENTALLY)
+ {
+ statbuf.st_size =
+ GetIncrementalFileSize(num_blocks_required);
+ snprintf(tarfilenamebuf, sizeof(tarfilenamebuf),
+ "%s/INCREMENTAL.%s",
+ path + basepathlen + 1,
+ de->d_name);
+ tarfilename = tarfilenamebuf;
+ }
+
+ pfree(lookup_path);
+ }
if (!sizeonly)
- sent = sendFile(sink, pathbuf, pathbuf + basepathlen + 1, &statbuf,
+ sent = sendFile(sink, pathbuf, tarfilename, &statbuf,
true, dboid, spcoid,
- relfilenumber, segno, manifest);
+ relfilenumber, segno, manifest,
+ num_blocks_required,
+ method == BACK_UP_FILE_INCREMENTALLY ? relative_block_numbers : NULL,
+ truncation_block_length);
if (sent || sizeonly)
{
@@ -1434,6 +1542,10 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
ereport(WARNING,
(errmsg("skipping special file \"%s\"", pathbuf)));
}
+
+ if (relative_block_numbers != NULL)
+ pfree(relative_block_numbers);
+
FreeDir(dir);
return size;
}
@@ -1446,6 +1558,12 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
* If dboid is anything other than InvalidOid then any checksum failures
* detected will get reported to the cumulative stats system.
*
+ * If the file is to be sent incrementally, then num_incremental_blocks
+ * should be the number of blocks to be sent, and incremental_blocks
+ * an array of block numbers relative to the start of the current segment.
+ * If the whole file is to be sent, then incremental_blocks should be NULL,
+ * and num_incremental_blocks can have any value, as it will be ignored.
+ *
* Returns true if the file was successfully sent, false if 'missing_ok',
* and the file did not exist.
*/
@@ -1453,7 +1571,8 @@ static bool
sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
struct stat *statbuf, bool missing_ok, Oid dboid, Oid spcoid,
RelFileNumber relfilenumber, unsigned segno,
- backup_manifest_info *manifest)
+ backup_manifest_info *manifest, unsigned num_incremental_blocks,
+ BlockNumber *incremental_blocks, unsigned truncation_block_length)
{
int fd;
BlockNumber blkno = 0;
@@ -1462,6 +1581,7 @@ sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
pgoff_t bytes_done = 0;
bool verify_checksum = false;
pg_checksum_context checksum_ctx;
+ int ibindex = 0;
if (pg_checksum_init(&checksum_ctx, manifest->checksum_type) < 0)
elog(ERROR, "could not initialize checksum of file \"%s\"",
@@ -1494,22 +1614,111 @@ sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
RelFileNumberIsValid(relfilenumber))
verify_checksum = true;
+ /*
+ * If we're sending an incremental file, write the file header.
+ */
+ if (incremental_blocks != NULL)
+ {
+ unsigned magic = INCREMENTAL_MAGIC;
+ size_t header_bytes_done = 0;
+
+ /* Emit header data. */
+ push_to_sink(sink, &checksum_ctx, &header_bytes_done,
+ &magic, sizeof(magic));
+ push_to_sink(sink, &checksum_ctx, &header_bytes_done,
+ &num_incremental_blocks, sizeof(num_incremental_blocks));
+ push_to_sink(sink, &checksum_ctx, &header_bytes_done,
+ &truncation_block_length, sizeof(truncation_block_length));
+ push_to_sink(sink, &checksum_ctx, &header_bytes_done,
+ incremental_blocks,
+ sizeof(BlockNumber) * num_incremental_blocks);
+
+ /* Flush out any data still in the buffer so it's again empty. */
+ if (header_bytes_done > 0)
+ {
+ bbsink_archive_contents(sink, header_bytes_done);
+ if (pg_checksum_update(&checksum_ctx,
+ (uint8 *) sink->bbs_buffer,
+ header_bytes_done) < 0)
+ elog(ERROR, "could not update checksum of base backup");
+ }
+
+ /* Update our notion of file position. */
+ bytes_done += sizeof(magic);
+ bytes_done += sizeof(num_incremental_blocks);
+ bytes_done += sizeof(truncation_block_length);
+ bytes_done += sizeof(BlockNumber) * num_incremental_blocks;
+ }
+
/*
* Loop until we read the amount of data the caller told us to expect. The
* file could be longer, if it was extended while we were sending it, but
* for a base backup we can ignore such extended data. It will be restored
* from WAL.
*/
- while (bytes_done < statbuf->st_size)
+ while (1)
{
- size_t remaining = statbuf->st_size - bytes_done;
+ /*
+ * Determine whether we've read all the data that we need, and if not,
+ * read some more.
+ */
+ if (incremental_blocks == NULL)
+ {
+ size_t remaining = statbuf->st_size - bytes_done;
+
+ /*
+ * If we've read the required number of bytes, then it's time to
+ * stop.
+ */
+ if (bytes_done >= statbuf->st_size)
+ break;
+
+ /*
+ * Read as many bytes as will fit in the buffer, or however many
+ * are left to read, whichever is less.
+ */
+ cnt = read_file_data_into_buffer(sink, readfilename, fd,
+ bytes_done, remaining,
+ blkno + segno * RELSEG_SIZE,
+ verify_checksum,
+ &checksum_failures);
+ }
+ else
+ {
+ BlockNumber relative_blkno;
- /* Try to read some more data. */
- cnt = read_file_data_into_buffer(sink, readfilename, fd, bytes_done,
- remaining,
- blkno + segno * RELSEG_SIZE,
- verify_checksum,
- &checksum_failures);
+ /*
+ * If we've read all the blocks, then it's time to stop.
+ */
+ if (ibindex >= num_incremental_blocks)
+ break;
+
+ /*
+ * Read just one block, whichever one is the next that we're
+ * supposed to include.
+ */
+ relative_blkno = incremental_blocks[ibindex++];
+ cnt = read_file_data_into_buffer(sink, readfilename, fd,
+ relative_blkno * BLCKSZ,
+ BLCKSZ,
+ relative_blkno + segno * RELSEG_SIZE,
+ verify_checksum,
+ &checksum_failures);
+
+ /*
+ * If we get a partial read, that must mean that the relation is
+ * being truncated. Ultimately, it should be truncated to a
+ * multiple of BLCKSZ, since this path should only be reached for
+ * relation files, but we might transiently observe an
+ * intermediate value.
+ *
+ * It should be fine to treat this just as if the entire block had
+ * been truncated away - i.e. fill this and all later blocks with
+ * zeroes. WAL replay will fix things up.
+ */
+ if (cnt < BLCKSZ)
+ break;
+ }
/*
* If the amount of data we were able to read was not a multiple of
@@ -1692,6 +1901,56 @@ read_file_data_into_buffer(bbsink *sink, const char *readfilename, int fd,
return cnt;
}
+/*
+ * Push data into a bbsink.
+ *
+ * It's better, when possible, to read data directly into the bbsink's buffer,
+ * rather than using this function to copy it into the buffer; this function is
+ * for cases where that approach is not practical.
+ *
+ * bytes_done should point to a count of the number of bytes that are
+ * currently used in the bbsink's buffer. Upon return, the bytes identified by
+ * data and length will have been copied into the bbsink's buffer, flushing
+ * as required, and *bytes_done will have been updated accordingly. If the
+ * buffer was flushed, the previous contents will also have been fed to
+ * checksum_ctx.
+ *
+ * Note that after one or more calls to this function it is the caller's
+ * responsibility to perform any required final flush.
+ */
+static void
+push_to_sink(bbsink *sink, pg_checksum_context *checksum_ctx,
+ size_t *bytes_done, void *data, size_t length)
+{
+ while (length > 0)
+ {
+ size_t bytes_to_copy;
+
+ /*
+ * We use < here rather than <= so that if the data exactly fills the
+ * remaining buffer space, we trigger a flush now.
+ */
+ if (length < sink->bbs_buffer_length - *bytes_done)
+ {
+ /* Append remaining data to buffer. */
+ memcpy(sink->bbs_buffer + *bytes_done, data, length);
+ *bytes_done += length;
+ return;
+ }
+
+ /* Copy until buffer is full and flush it. */
+ bytes_to_copy = sink->bbs_buffer_length - *bytes_done;
+ memcpy(sink->bbs_buffer + *bytes_done, data, bytes_to_copy);
+ data = ((char *) data) + bytes_to_copy;
+ length -= bytes_to_copy;
+ bbsink_archive_contents(sink, sink->bbs_buffer_length);
+ if (pg_checksum_update(checksum_ctx, (uint8 *) sink->bbs_buffer,
+ sink->bbs_buffer_length) < 0)
+ elog(ERROR, "could not update checksum");
+ *bytes_done = 0;
+ }
+}
+
/*
* Try to verify the checksum for the provided page, if it seems appropriate
* to do so.
diff --git a/src/backend/backup/basebackup_incremental.c b/src/backend/backup/basebackup_incremental.c
new file mode 100644
index 0000000000..1e5a5ac33a
--- /dev/null
+++ b/src/backend/backup/basebackup_incremental.c
@@ -0,0 +1,1003 @@
+/*-------------------------------------------------------------------------
+ *
+ * basebackup_incremental.c
+ * code for incremental backup support
+ *
+ * This code isn't actually in charge of taking an incremental backup;
+ * the actual construction of the incremental backup happens in
+ * basebackup.c. Here, we're concerned with providing the necessary
+ * supports for that operation. In particular, we need to parse the
+ * backup manifest supplied by the user taking the incremental backup
+ * and extract the required information from it.
+ *
+ * Portions Copyright (c) 2010-2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/backend/backup/basebackup_incremental.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/timeline.h"
+#include "access/xlog.h"
+#include "access/xlogrecovery.h"
+#include "backup/basebackup_incremental.h"
+#include "backup/walsummary.h"
+#include "common/blkreftable.h"
+#include "common/parse_manifest.h"
+#include "common/hashfn.h"
+#include "postmaster/walsummarizer.h"
+
+#define BLOCKS_PER_READ 512
+
+/*
+ * Details extracted from the WAL ranges present in the supplied backup manifest.
+ */
+typedef struct
+{
+ TimeLineID tli;
+ XLogRecPtr start_lsn;
+ XLogRecPtr end_lsn;
+} backup_wal_range;
+
+/*
+ * Details extracted from the file list present in the supplied backup manifest.
+ */
+typedef struct
+{
+ uint32 status;
+ const char *path;
+ size_t size;
+} backup_file_entry;
+
+static uint32 hash_string_pointer(const char *s);
+#define SH_PREFIX backup_file
+#define SH_ELEMENT_TYPE backup_file_entry
+#define SH_KEY_TYPE const char *
+#define SH_KEY path
+#define SH_HASH_KEY(tb, key) hash_string_pointer(key)
+#define SH_EQUAL(tb, a, b) (strcmp(a, b) == 0)
+#define SH_SCOPE static inline
+#define SH_DECLARE
+#define SH_DEFINE
+#include "lib/simplehash.h"
+
+struct IncrementalBackupInfo
+{
+ /* Memory context for this object and its subsidiary objects. */
+ MemoryContext mcxt;
+
+ /* Temporary buffer for storing the manifest while parsing it. */
+ StringInfoData buf;
+
+ /* WAL ranges extracted from the backup manifest. */
+ List *manifest_wal_ranges;
+
+ /*
+ * Files extracted from the backup manifest.
+ *
+ * We don't really need this information, because we use WAL summaries to
+ * figure what's changed. It would be unsafe to just rely on the list of
+ * files that existed before, because it's possible for a file to be
+ * removed and a new one created with the same name and different
+ * contents. In such cases, the whole file must still be sent. We can tell
+ * from the WAL summaries whether that happened, but not from the file
+ * list.
+ *
+ * Nonetheless, this data is useful for sanity checking. If a file that we
+ * think we shouldn't need to send is not present in the manifest for the
+ * prior backup, something has gone terribly wrong. We retain the file
+ * names and sizes, but not the checksums or last modified times, for
+ * which we have no use.
+ *
+ * One significant downside of storing this data is that it consumes
+ * memory. If that turns out to be a problem, we might have to decide not
+ * to retain this information, or to make it optional.
+ */
+ backup_file_hash *manifest_files;
+
+ /*
+ * Block-reference table for the incremental backup.
+ *
+ * It's possible that storing the entire block-reference table in memory
+ * will be a problem for some users. The in-memory format that we're using
+ * here is pretty efficient, converging to little more than 1 bit per
+ * block for relation forks with large numbers of modified blocks. It's
+ * possible, however, that if you try to perform an incremental backup of
+ * a database with a sufficiently large number of relations on a
+ * sufficiently small machine, you could run out of memory here. If that
+ * turns out to be a problem in practice, we'll need to be more clever.
+ */
+ BlockRefTable *brtab;
+};
+
+static void manifest_process_file(JsonManifestParseContext *context,
+ char *pathname,
+ size_t size,
+ pg_checksum_type checksum_type,
+ int checksum_length,
+ uint8 *checksum_payload);
+static void manifest_process_wal_range(JsonManifestParseContext *context,
+ TimeLineID tli,
+ XLogRecPtr start_lsn,
+ XLogRecPtr end_lsn);
+static void manifest_report_error(JsonManifestParseContext *ib,
+ const char *fmt,...)
+ pg_attribute_printf(2, 3) pg_attribute_noreturn();
+static int compare_block_numbers(const void *a, const void *b);
+
+/*
+ * Create a new object for storing information extracted from the manifest
+ * supplied when creating an incremental backup.
+ */
+IncrementalBackupInfo *
+CreateIncrementalBackupInfo(MemoryContext mcxt)
+{
+ IncrementalBackupInfo *ib;
+ MemoryContext oldcontext;
+
+ oldcontext = MemoryContextSwitchTo(mcxt);
+
+ ib = palloc0(sizeof(IncrementalBackupInfo));
+ ib->mcxt = mcxt;
+ initStringInfo(&ib->buf);
+
+ /*
+ * It's hard to guess how many files a "typical" installation will have in
+ * the data directory, but a fresh initdb creates almost 1000 files as of
+ * this writing, so it seems to make sense for our estimate to
+ * substantially higher.
+ */
+ ib->manifest_files = backup_file_create(mcxt, 10000, NULL);
+
+ MemoryContextSwitchTo(oldcontext);
+
+ return ib;
+}
+
+/*
+ * Before taking an incremental backup, the caller must supply the backup
+ * manifest from a prior backup. Each chunk of manifest data recieved
+ * from the client should be passed to this function.
+ */
+void
+AppendIncrementalManifestData(IncrementalBackupInfo *ib, const char *data,
+ int len)
+{
+ MemoryContext oldcontext;
+
+ /* Switch to our memory context. */
+ oldcontext = MemoryContextSwitchTo(ib->mcxt);
+
+ /*
+ * XXX. Our json parser is at present incapable of parsing json blobs
+ * incrementally, so we have to accumulate the entire backup manifest
+ * before we can do anything with it. This should really be fixed, since
+ * some users might have very large numbers of files in the data
+ * directory.
+ */
+ appendBinaryStringInfo(&ib->buf, data, len);
+
+ /* Switch back to previous memory context. */
+ MemoryContextSwitchTo(oldcontext);
+}
+
+/*
+ * Finalize an IncrementalBackupInfo object after all manifest data has
+ * been supplied via calls to AppendIncrementalManifestData.
+ */
+void
+FinalizeIncrementalManifest(IncrementalBackupInfo *ib)
+{
+ JsonManifestParseContext context;
+ MemoryContext oldcontext;
+
+ /* Switch to our memory context. */
+ oldcontext = MemoryContextSwitchTo(ib->mcxt);
+
+ /* Parse the manifest. */
+ context.private_data = ib;
+ context.per_file_cb = manifest_process_file;
+ context.per_wal_range_cb = manifest_process_wal_range;
+ context.error_cb = manifest_report_error;
+ json_parse_manifest(&context, ib->buf.data, ib->buf.len);
+
+ /* Done with the buffer, so release memory. */
+ pfree(ib->buf.data);
+ ib->buf.data = NULL;
+
+ /* Switch back to previous memory context. */
+ MemoryContextSwitchTo(oldcontext);
+}
+
+/*
+ * Prepare to take an incremental backup.
+ *
+ * Before this function is called, AppendIncrementalManifestData and
+ * FinalizeIncrementalManifest should have already been called to pass all
+ * the manifest data to this object.
+ *
+ * This function performs sanity checks on the data extracted from the
+ * manifest and figures out for which WAL ranges we need summaries, and
+ * whether those summaries are available. Then, it reads and combines the
+ * data from those summary files. It also updates the backup_state with the
+ * reference TLI and LSN for the prior backup.
+ */
+void
+PrepareForIncrementalBackup(IncrementalBackupInfo *ib,
+ BackupState *backup_state)
+{
+ MemoryContext oldcontext;
+ List *expectedTLEs;
+ List *all_wslist,
+ *required_wslist = NIL;
+ ListCell *lc;
+ TimeLineHistoryEntry **tlep;
+ int num_wal_ranges;
+ int i;
+ bool found_backup_start_tli = false;
+ TimeLineID earliest_wal_range_tli = 0;
+ XLogRecPtr earliest_wal_range_start_lsn = InvalidXLogRecPtr;
+ TimeLineID latest_wal_range_tli = 0;
+ XLogRecPtr summarized_lsn;
+ XLogRecPtr pending_lsn;
+ XLogRecPtr prior_pending_lsn = InvalidXLogRecPtr;
+ int deadcycles = 0;
+ TimestampTz initial_time,
+ current_time;
+
+ Assert(ib->buf.data == NULL);
+
+ /* Switch to our memory context. */
+ oldcontext = MemoryContextSwitchTo(ib->mcxt);
+
+ /*
+ * A valid backup manifest must always contain at least one WAL range
+ * (usually exactly one, unless the backup spanned a timeline switch).
+ */
+ num_wal_ranges = list_length(ib->manifest_wal_ranges);
+ if (num_wal_ranges == 0)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("manifest contains no required WAL ranges")));
+
+ /*
+ * Match up the TLIs that appear in the WAL ranges of the backup manifest
+ * with those that appear in this server's timeline history. We expect
+ * every backup_wal_range to match to a TimeLineHistoryEntry; if it does
+ * not, that's an error.
+ *
+ * This loop also decides which of the WAL ranges is the manifest is most
+ * ancient and which one is the newest, according to the timeline history
+ * of this server, and stores TLIs of those WAL ranges into
+ * earliest_wal_range_tli and latest_wal_range_tli. It also updates
+ * earliest_wal_range_start_lsn to the start LSN of the WAL range for
+ * earliest_wal_range_tli.
+ *
+ * Note that the return value of readTimeLineHistory puts the latest
+ * timeline at the beginning of the list, not the end. Hence, the earliest
+ * TLI is the one that occurs nearest the end of the list returned by
+ * readTimeLineHistory, and the latest TLI is the one that occurs closest
+ * to the beginning.
+ */
+ expectedTLEs = readTimeLineHistory(backup_state->starttli);
+ tlep = palloc0(num_wal_ranges * sizeof(TimeLineHistoryEntry *));
+ for (i = 0; i < num_wal_ranges; ++i)
+ {
+ backup_wal_range *range = list_nth(ib->manifest_wal_ranges, i);
+ bool saw_earliest_wal_range_tli = false;
+ bool saw_latest_wal_range_tli = false;
+
+ /* Search this server's history for this WAL range's TLI. */
+ foreach(lc, expectedTLEs)
+ {
+ TimeLineHistoryEntry *tle = lfirst(lc);
+
+ if (tle->tli == range->tli)
+ {
+ tlep[i] = tle;
+ break;
+ }
+
+ if (tle->tli == earliest_wal_range_tli)
+ saw_earliest_wal_range_tli = true;
+ if (tle->tli == latest_wal_range_tli)
+ saw_latest_wal_range_tli = true;
+ }
+
+ /*
+ * An incremental backup can only be taken relative to a backup that
+ * represents a previous state of this server. If the backup requires
+ * WAL from a timeline that's not in our history, that definitely
+ * isn't the case.
+ */
+ if (tlep[i] == NULL)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("timeline %u found in manifest, but not in this server's history",
+ range->tli)));
+
+ /*
+ * If we found this TLI in the server's history before encountering
+ * the latest TLI seen so far in the server's history, then this TLI
+ * is the latest one seen so far.
+ *
+ * If on the other hand we saw the earliest TLI seen so far before
+ * finding this TLI, this TLI is earlier than the earliest one seen so
+ * far. And if this is the first TLI for which we've searched, it's
+ * also the earliest one seen so far.
+ *
+ * On the first loop iteration, both things should necessarily be
+ * true.
+ */
+ if (!saw_latest_wal_range_tli)
+ latest_wal_range_tli = range->tli;
+ if (earliest_wal_range_tli == 0 || saw_earliest_wal_range_tli)
+ {
+ earliest_wal_range_tli = range->tli;
+ earliest_wal_range_start_lsn = range->start_lsn;
+ }
+ }
+
+ /*
+ * Propagate information about the prior backup into the backup_label that
+ * will be generated for this backup.
+ */
+ backup_state->istartpoint = earliest_wal_range_start_lsn;
+ backup_state->istarttli = earliest_wal_range_tli;
+
+ /*
+ * Sanity check start and end LSNs for the WAL ranges in the manifest.
+ *
+ * Commonly, there won't be any timeline switches during the prior backup
+ * at all, but if there are, they should happen at the same LSNs that this
+ * server switched timelines.
+ *
+ * Whether there are any timeline switches during the prior backup or not,
+ * the prior backup shouldn't require any WAL from a timeline prior to the
+ * start of that timeline. It also shouldn't require any WAL from later
+ * than the start of this backup.
+ *
+ * If any of these sanity checks fail, one possible explanation is that
+ * the user has generated WAL on the same timeline with the same LSNs more
+ * than once. For instance, if two standbys running on timeline 1 were
+ * both promoted and (due to a broken archiving setup) both selected new
+ * timeline ID 2, then it's possible that one of these checks might trip.
+ *
+ * Note that there are lots of ways for the user to do something very bad
+ * without tripping any of these checks, and they are not intended to be
+ * comprehensive. It's pretty hard to see how we could be certain of
+ * anything here. However, if there's a problem staring us right in the
+ * face, it's best to report it, so we do.
+ */
+ for (i = 0; i < num_wal_ranges; ++i)
+ {
+ backup_wal_range *range = list_nth(ib->manifest_wal_ranges, i);
+
+ if (range->tli == earliest_wal_range_tli)
+ {
+ if (range->start_lsn < tlep[i]->begin)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("manifest requires WAL from initial timeline %u starting at %X/%X, but that timeline begins at %X/%X",
+ range->tli,
+ LSN_FORMAT_ARGS(range->start_lsn),
+ LSN_FORMAT_ARGS(tlep[i]->begin))));
+ }
+ else
+ {
+ if (range->start_lsn != tlep[i]->begin)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("manifest requires WAL from continuation timeline %u starting at %X/%X, but that timeline begins at %X/%X",
+ range->tli,
+ LSN_FORMAT_ARGS(range->start_lsn),
+ LSN_FORMAT_ARGS(tlep[i]->begin))));
+ }
+
+ if (range->tli == latest_wal_range_tli)
+ {
+ if (range->end_lsn > backup_state->startpoint)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("manifest requires WAL from final timeline %u ending at %X/%X, but this backup starts at %X/%X",
+ range->tli,
+ LSN_FORMAT_ARGS(range->end_lsn),
+ LSN_FORMAT_ARGS(backup_state->startpoint))));
+ }
+ else
+ {
+ if (range->end_lsn != tlep[i]->end)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("manifest requires WAL from non-final timeline %u ending at %X/%X, but this server switched timelines at %X/%X",
+ range->tli,
+ LSN_FORMAT_ARGS(range->end_lsn),
+ LSN_FORMAT_ARGS(tlep[i]->end))));
+ }
+
+ }
+
+ /*
+ * Wait for WAL summarization to catch up to the backup start LSN (but
+ * time out if it doesn't do so quickly enough).
+ */
+ initial_time = current_time = GetCurrentTimestamp();
+ while (1)
+ {
+ long timeout_in_ms = 10000;
+ unsigned elapsed_seconds;
+
+ /*
+ * Align the wait time to prevent drift. This doesn't really matter,
+ * but we'd like the warnings about how long we've been waiting to say
+ * 10 seconds, 20 seconds, 30 seconds, 40 seconds ... without ever
+ * drifting to something that is not a multiple of ten.
+ */
+ timeout_in_ms -=
+ TimestampDifferenceMilliseconds(current_time, initial_time) %
+ timeout_in_ms;
+
+ /* Wait for up to 10 seconds. */
+ summarized_lsn = WaitForWalSummarization(backup_state->startpoint,
+ 10000, &pending_lsn);
+
+ /* If WAL summarization has progressed sufficiently, stop waiting. */
+ if (summarized_lsn >= backup_state->startpoint)
+ break;
+
+ /*
+ * Keep track of the number of cycles during which there has been no
+ * progression of pending_lsn. If pending_lsn is not advancing, that
+ * means that not only are no new files appearing on disk, but we're
+ * not even incorporating new records into the in-memory state.
+ */
+ if (pending_lsn > prior_pending_lsn)
+ {
+ prior_pending_lsn = pending_lsn;
+ deadcycles = 0;
+ }
+ else
+ ++deadcycles;
+
+ /*
+ * If we've managed to wait for an entire minute withot the WAL
+ * summarizer absorbing a single WAL record, error out; probably
+ * something is wrong.
+ *
+ * We could consider also erroring out if the summarizer is taking too
+ * long to catch up, but it's not clear what rate of progress would be
+ * acceptable and what would be too slow. So instead, we just try to
+ * error out in the case where there's no progress at all. That seems
+ * likely to catch a reasonable number of the things that can go wrong
+ * in practice (e.g. the summarizer process is completely hung, say
+ * because somebody hooked up a debugger to it or something) without
+ * giving up too quickly when the sytem is just slow.
+ */
+ if (deadcycles >= 6)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("WAL summarization is not progressing"),
+ errdetail("Summarization is needed through %X/%X, but is stuck at %X/%X on disk and %X/%X in memory.",
+ LSN_FORMAT_ARGS(backup_state->startpoint),
+ LSN_FORMAT_ARGS(summarized_lsn),
+ LSN_FORMAT_ARGS(pending_lsn))));
+
+ /*
+ * Otherwise, just let the user know what's happening.
+ */
+ current_time = GetCurrentTimestamp();
+ elapsed_seconds =
+ TimestampDifferenceMilliseconds(initial_time, current_time) / 1000;
+ ereport(WARNING,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("still waiting for WAL summarization through %X/%X after %d seconds",
+ LSN_FORMAT_ARGS(backup_state->startpoint),
+ elapsed_seconds),
+ errdetail("Summarization has reached %X/%X on disk and %X/%X in memory.",
+ LSN_FORMAT_ARGS(summarized_lsn),
+ LSN_FORMAT_ARGS(pending_lsn))));
+ }
+
+ /*
+ * Retrieve a list of all WAL summaries on any timeline that overlap with
+ * the LSN range of interest. We could instead call GetWalSummaries() once
+ * per timeline in the loop that follows, but that would involve reading
+ * the directory multiple times. It should be mildly faster - and perhaps
+ * a bit safer - to do it just once.
+ */
+ all_wslist = GetWalSummaries(0, earliest_wal_range_start_lsn,
+ backup_state->startpoint);
+
+ /*
+ * We need WAL summaries for everything that happened during the prior
+ * backup and everything that happened afterward up until the point where
+ * the current backup started.
+ */
+ foreach(lc, expectedTLEs)
+ {
+ TimeLineHistoryEntry *tle = lfirst(lc);
+ XLogRecPtr tli_start_lsn = tle->begin;
+ XLogRecPtr tli_end_lsn = tle->end;
+ XLogRecPtr tli_missing_lsn = InvalidXLogRecPtr;
+ List *tli_wslist;
+
+ /*
+ * Working through the history of this server from the current
+ * timeline backwards, we skip everything until we find the timeline
+ * where this backup started. Most of the time, this means we won't
+ * skip anything at all, as it's unlikely that the timeline has
+ * changed since the beginning of the backup moments ago.
+ */
+ if (tle->tli == backup_state->starttli)
+ {
+ found_backup_start_tli = true;
+ tli_end_lsn = backup_state->startpoint;
+ }
+ else if (!found_backup_start_tli)
+ continue;
+
+ /*
+ * Find the summaries that overlap the LSN range of interest for this
+ * timeline. If this is the earliest timeline involved, the range of
+ * interest begins with the start LSN of the prior backup; otherwise,
+ * it begins at the LSN at which this timeline came into existence. If
+ * this is the latest TLI involved, the range of interest ends at the
+ * start LSN of the current backup; otherwise, it ends at the point
+ * where we switched from this timeline to the next one.
+ */
+ if (tle->tli == earliest_wal_range_tli)
+ tli_start_lsn = earliest_wal_range_start_lsn;
+ tli_wslist = FilterWalSummaries(all_wslist, tle->tli,
+ tli_start_lsn, tli_end_lsn);
+
+ /*
+ * There is no guarantee that the WAL summaries we found cover the
+ * entire range of LSNs for which summaries are required, or indeed
+ * that we found any WAL summaries at all. Check whether we have a
+ * problem of that sort.
+ */
+ if (!WalSummariesAreComplete(tli_wslist, tli_start_lsn, tli_end_lsn,
+ &tli_missing_lsn))
+ {
+ if (XLogRecPtrIsInvalid(tli_missing_lsn))
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("WAL summaries are required on timeline %u from %X/%X to %X/%X, but no summaries for that timeline and LSN range exist",
+ tle->tli,
+ LSN_FORMAT_ARGS(tli_start_lsn),
+ LSN_FORMAT_ARGS(tli_end_lsn))));
+ else
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("WAL summaries are required on timeline %u from %X/%X to %X/%X, but the summaries for that timeline and LSN range are incomplete",
+ tle->tli,
+ LSN_FORMAT_ARGS(tli_start_lsn),
+ LSN_FORMAT_ARGS(tli_end_lsn)),
+ errdetail("The first unsummarized LSN is this range is %X/%X.",
+ LSN_FORMAT_ARGS(tli_missing_lsn))));
+ }
+
+ /*
+ * Remember that we need to read these summaries.
+ *
+ * Technically, it's possible that this could read more files than
+ * required, since tli_wslist in theory could contain redundant
+ * summaries. For instance, if we have a summary from 0/10000000 to
+ * 0/20000000 and also one from 0/00000000 to 0/30000000, then the
+ * latter subsumes the former and the former could be ignored.
+ *
+ * We ignore this possibility because the WAL summarizer only tries to
+ * generate summaries that do not overlap. If somehow they exist,
+ * we'll do a bit of extra work but the results should still be
+ * correct.
+ */
+ required_wslist = list_concat(required_wslist, tli_wslist);
+
+ /*
+ * Timelines earlier than the one in which the prior backup began are
+ * not relevant.
+ */
+ if (tle->tli == earliest_wal_range_tli)
+ break;
+ }
+
+ /*
+ * Read all of the required block reference table files and merge all of
+ * the data into a single in-memory block reference table.
+ *
+ * See the comments for struct IncrementalBackupInfo for some thoughts on
+ * memory usage.
+ */
+ ib->brtab = CreateEmptyBlockRefTable();
+ foreach(lc, required_wslist)
+ {
+ WalSummaryFile *ws = lfirst(lc);
+ WalSummaryIO wsio;
+ BlockRefTableReader *reader;
+ RelFileLocator rlocator;
+ ForkNumber forknum;
+ BlockNumber limit_block;
+ BlockNumber blocks[BLOCKS_PER_READ];
+
+ wsio.file = OpenWalSummaryFile(ws, false);
+ wsio.filepos = 0;
+ ereport(DEBUG1,
+ (errmsg_internal("reading WAL summary file \"%s\"",
+ FilePathName(wsio.file))));
+ reader = CreateBlockRefTableReader(ReadWalSummary, &wsio,
+ FilePathName(wsio.file),
+ ReportWalSummaryError, NULL);
+ while (BlockRefTableReaderNextRelation(reader, &rlocator, &forknum,
+ &limit_block))
+ {
+ BlockRefTableSetLimitBlock(ib->brtab, &rlocator,
+ forknum, limit_block);
+
+ while (1)
+ {
+ unsigned nblocks;
+ unsigned i;
+
+ nblocks = BlockRefTableReaderGetBlocks(reader, blocks,
+ BLOCKS_PER_READ);
+ if (nblocks == 0)
+ break;
+
+ for (i = 0; i < nblocks; ++i)
+ BlockRefTableMarkBlockModified(ib->brtab, &rlocator,
+ forknum, blocks[i]);
+ }
+ }
+ DestroyBlockRefTableReader(reader);
+ FileClose(wsio.file);
+ }
+
+ /* Switch back to previous memory context. */
+ MemoryContextSwitchTo(oldcontext);
+}
+
+/*
+ * Get the pathname that should be used when a file is sent incrementally.
+ *
+ * The result is a palloc'd string.
+ */
+char *
+GetIncrementalFilePath(Oid dboid, Oid spcoid, RelFileNumber relfilenumber,
+ ForkNumber forknum, unsigned segno)
+{
+ char *path;
+ char *lastslash;
+ char *ipath;
+
+ path = GetRelationPath(dboid, spcoid, relfilenumber, InvalidBackendId,
+ forknum);
+
+ lastslash = strrchr(path, '/');
+ Assert(lastslash != NULL);
+ *lastslash = '\0';
+
+ if (segno > 0)
+ ipath = psprintf("%s/INCREMENTAL.%s.%u", path, lastslash + 1, segno);
+ else
+ ipath = psprintf("%s/INCREMENTAL.%s", path, lastslash + 1);
+
+ pfree(path);
+
+ return ipath;
+}
+
+/*
+ * How should we back up a particular file as part of an incremental backup?
+ *
+ * If the return value is BACK_UP_FILE_FULLY, caller should back up the whole
+ * file just as if this were not an incremental backup.
+ *
+ * If the return value is BACK_UP_FILE_INCREMENTALLY, caller should include
+ * an incremental file in the backup instead of the entire file. On return,
+ * *num_blocks_required will be set to the number of blocks that need to be
+ * sent, and the actual block numbers will have been stored in
+ * relative_block_numbers, which should be an array of at least RELSEG_SIZE.
+ * In addition, *truncation_block_length will be set to the value that should
+ * be included in the incremental file.
+ */
+FileBackupMethod
+GetFileBackupMethod(IncrementalBackupInfo *ib, const char *path,
+ Oid dboid, Oid spcoid,
+ RelFileNumber relfilenumber, ForkNumber forknum,
+ unsigned segno, size_t size,
+ unsigned *num_blocks_required,
+ BlockNumber *relative_block_numbers,
+ unsigned *truncation_block_length)
+{
+ BlockNumber absolute_block_numbers[RELSEG_SIZE];
+ BlockNumber limit_block;
+ BlockNumber start_blkno;
+ BlockNumber stop_blkno;
+ RelFileLocator rlocator;
+ BlockRefTableEntry *brtentry;
+ unsigned i;
+ unsigned nblocks;
+
+ /* Should only be called after PrepareForIncrementalBackup. */
+ Assert(ib->buf.data == NULL);
+
+ /*
+ * dboid could be InvalidOid if shared rel, but spcoid and relfilenumber
+ * should have legal values.
+ */
+ Assert(OidIsValid(spcoid));
+ Assert(RelFileNumberIsValid(relfilenumber));
+
+ /*
+ * If the file size is too large or not a multiple of BLCKSZ, then
+ * something weird is happening, so give up and send the whole file.
+ */
+ if ((size % BLCKSZ) != 0 || size / BLCKSZ > RELSEG_SIZE)
+ return BACK_UP_FILE_FULLY;
+
+ /*
+ * The free-space map fork is not properly WAL-logged, so we need to
+ * backup the entire file every time.
+ */
+ if (forknum == FSM_FORKNUM)
+ return BACK_UP_FILE_FULLY;
+
+ /*
+ * If this file was not part of the prior backup, back it up fully.
+ *
+ * If this file was created after the prior backup and before the start of
+ * the current backup, then the WAL summary information will tell us to
+ * back up the whole file. However, if this file was created after the
+ * start of the current backup, then the WAL summary won't know anything
+ * about it. Without this logic, we would erroneously conclude that it was
+ * OK to send it incrementally.
+ *
+ * Note that the file could have existed at the time of the prior backup,
+ * gotten deleted, and then a new file with the same name could have been
+ * created. In that case, this logic won't prevent the file from being
+ * backed up incrementally. But, if the deletion happened before the start
+ * of the current backup, the limit block will be 0, inducing a full
+ * backup. If the deletion happened after the start of the current backup,
+ * reconstruction will erroneously combine blocks from the current
+ * lifespan of the file with blocks from the previous lifespan -- but in
+ * this type of case, WAL replay to reach backup consistency should remove
+ * and recreate the file anyway, so the initial bogus contents should not
+ * matter.
+ */
+ if (backup_file_lookup(ib->manifest_files, path) == NULL)
+ {
+ char *ipath;
+
+ ipath = GetIncrementalFilePath(dboid, spcoid, relfilenumber,
+ forknum, segno);
+ if (backup_file_lookup(ib->manifest_files, ipath) == NULL)
+ return BACK_UP_FILE_FULLY;
+ }
+
+ /* Look up the block reference table entry. */
+ rlocator.spcOid = spcoid;
+ rlocator.dbOid = dboid;
+ rlocator.relNumber = relfilenumber;
+ brtentry = BlockRefTableGetEntry(ib->brtab, &rlocator, forknum,
+ &limit_block);
+
+ /*
+ * If there is no entry, then there have been no WAL-logged changes to the
+ * relation since the predecessor backup was taken, so we can back it up
+ * incrementally and need not include any modified blocks.
+ *
+ * However, if the file is zero-length, we should do a full backup,
+ * because an incremental file is always more than zero length, and it's
+ * silly to take an incremental backup when a full backup would be
+ * smaller.
+ */
+ if (brtentry == NULL)
+ {
+ if (size == 0)
+ return BACK_UP_FILE_FULLY;
+ *num_blocks_required = 0;
+ *truncation_block_length = size / BLCKSZ;
+ return BACK_UP_FILE_INCREMENTALLY;
+ }
+
+ /*
+ * If the limit_block is less than or equal to the point where this
+ * segment starts, send the whole file.
+ */
+ if (limit_block <= segno * RELSEG_SIZE)
+ return BACK_UP_FILE_FULLY;
+
+ /*
+ * Get relevant entries from the block reference table entry.
+ *
+ * We shouldn't overflow computing the start or stop block numbers, but if
+ * it manages to happen somehow, detect it and throw an error.
+ */
+ start_blkno = segno * RELSEG_SIZE;
+ stop_blkno = start_blkno + (size / BLCKSZ);
+ if (start_blkno / RELSEG_SIZE != segno || stop_blkno < start_blkno)
+ ereport(ERROR,
+ errcode(ERRCODE_INTERNAL_ERROR),
+ errmsg_internal("overflow computing block number bounds for segment %u with size %zu",
+ segno, size));
+ nblocks = BlockRefTableEntryGetBlocks(brtentry, start_blkno, stop_blkno,
+ absolute_block_numbers, RELSEG_SIZE);
+ Assert(nblocks <= RELSEG_SIZE);
+
+ /*
+ * If we're going to have to send nearly all of the blocks, then just send
+ * the whole file, because that won't require much extra storage or
+ * transfer and will speed up and simplify backup restoration. It's not
+ * clear what threshold is most appropriate here and perhaps it ought to
+ * be configurable, but for now we're just going to say that if we'd need
+ * to send 90% of the blocks anyway, give up and send the whole file.
+ *
+ * NB: If you change the threshold here, at least make sure to back up the
+ * file fully when every single block must be sent, because there's
+ * nothing good about sending an incremental file in that case.
+ */
+ if (nblocks * BLCKSZ > size * 0.9)
+ return BACK_UP_FILE_FULLY;
+
+ /*
+ * Looks like we can send an incremental file, so sort the absolute the
+ * block numbers and then transpose absolute block numbers to relative
+ * block numbers.
+ *
+ * NB: If the block reference table was using the bitmap representation
+ * for a given chunk, the block numbers in that chunk will already be
+ * sorted, but when the array-of-offsets representation is used, we can
+ * receive block numbers here out of order.
+ */
+ qsort(absolute_block_numbers, nblocks, sizeof(BlockNumber),
+ compare_block_numbers);
+ for (i = 0; i < nblocks; ++i)
+ relative_block_numbers[i] = absolute_block_numbers[i] - start_blkno;
+ *num_blocks_required = nblocks;
+
+ /*
+ * The truncation block length is the minimum length of the reconstructed
+ * file. Any block numbers below this threshold that are not present in
+ * the backup need to be fetched from the prior backup. At or above this
+ * threshold, blocks should only be included in the result if they are
+ * present in the backup. (This may require inserting zero blocks if the
+ * blocks included in the backup are non-consecutive.)
+ */
+ *truncation_block_length = size / BLCKSZ;
+ if (BlockNumberIsValid(limit_block))
+ {
+ unsigned relative_limit = limit_block - segno * RELSEG_SIZE;
+
+ if (*truncation_block_length < relative_limit)
+ *truncation_block_length = relative_limit;
+ }
+
+ /* Send it incrementally. */
+ return BACK_UP_FILE_INCREMENTALLY;
+}
+
+/*
+ * Compute the size for an incremental file containing a given number of blocks.
+ */
+extern size_t
+GetIncrementalFileSize(unsigned num_blocks_required)
+{
+ size_t result;
+
+ /* Make sure we're not going to overflow. */
+ Assert(num_blocks_required <= RELSEG_SIZE);
+
+ /*
+ * Three four byte quantities (magic number, truncation block length,
+ * block count) followed by block numbers followed by block contents.
+ */
+ result = 3 * sizeof(uint32);
+ result += (BLCKSZ + sizeof(BlockNumber)) * num_blocks_required;
+
+ return result;
+}
+
+/*
+ * Helper function for filemap hash table.
+ */
+static uint32
+hash_string_pointer(const char *s)
+{
+ unsigned char *ss = (unsigned char *) s;
+
+ return hash_bytes(ss, strlen(s));
+}
+
+/*
+ * This callback is invoked for each file mentioned in the backup manifest.
+ *
+ * We store the path to each file and the size of each file for sanity-checking
+ * purposes. For further details, see comments for IncrementalBackupInfo.
+ */
+static void
+manifest_process_file(JsonManifestParseContext *context,
+ char *pathname, size_t size,
+ pg_checksum_type checksum_type,
+ int checksum_length,
+ uint8 *checksum_payload)
+{
+ IncrementalBackupInfo *ib = context->private_data;
+ backup_file_entry *entry;
+ bool found;
+
+ entry = backup_file_insert(ib->manifest_files, pathname, &found);
+ if (!found)
+ {
+ entry->path = MemoryContextStrdup(ib->manifest_files->ctx,
+ pathname);
+ entry->size = size;
+ }
+}
+
+/*
+ * This callback is invoked for each WAL range mentioned in the backup
+ * manifest.
+ *
+ * We're just interested in learning the oldest LSN and the corresponding TLI
+ * that appear in any WAL range.
+ */
+static void
+manifest_process_wal_range(JsonManifestParseContext *context,
+ TimeLineID tli, XLogRecPtr start_lsn,
+ XLogRecPtr end_lsn)
+{
+ IncrementalBackupInfo *ib = context->private_data;
+ backup_wal_range *range = palloc(sizeof(backup_wal_range));
+
+ range->tli = tli;
+ range->start_lsn = start_lsn;
+ range->end_lsn = end_lsn;
+ ib->manifest_wal_ranges = lappend(ib->manifest_wal_ranges, range);
+}
+
+/*
+ * This callback is invoked if an error occurs while parsing the backup
+ * manifest.
+ */
+static void
+manifest_report_error(JsonManifestParseContext *context, const char *fmt,...)
+{
+ StringInfoData errbuf;
+
+ initStringInfo(&errbuf);
+
+ for (;;)
+ {
+ va_list ap;
+ int needed;
+
+ va_start(ap, fmt);
+ needed = appendStringInfoVA(&errbuf, fmt, ap);
+ va_end(ap);
+ if (needed == 0)
+ break;
+ enlargeStringInfo(&errbuf, needed);
+ }
+
+ ereport(ERROR,
+ errmsg_internal("%s", errbuf.data));
+}
+
+/*
+ * Quicksort comparator for block numbers.
+ */
+static int
+compare_block_numbers(const void *a, const void *b)
+{
+ BlockNumber aa = *(BlockNumber *) a;
+ BlockNumber bb = *(BlockNumber *) b;
+
+ if (aa > bb)
+ return 1;
+ else if (aa == bb)
+ return 0;
+ else
+ return -1;
+}
diff --git a/src/backend/backup/meson.build b/src/backend/backup/meson.build
index 0e2de91e9f..19c355ceca 100644
--- a/src/backend/backup/meson.build
+++ b/src/backend/backup/meson.build
@@ -5,6 +5,7 @@ backend_sources += files(
'basebackup.c',
'basebackup_copy.c',
'basebackup_gzip.c',
+ 'basebackup_incremental.c',
'basebackup_lz4.c',
'basebackup_progress.c',
'basebackup_server.c',
diff --git a/src/backend/replication/repl_gram.y b/src/backend/replication/repl_gram.y
index 0c874e33cf..a5d118ed68 100644
--- a/src/backend/replication/repl_gram.y
+++ b/src/backend/replication/repl_gram.y
@@ -76,11 +76,12 @@ Node *replication_parse_result;
%token K_EXPORT_SNAPSHOT
%token K_NOEXPORT_SNAPSHOT
%token K_USE_SNAPSHOT
+%token K_UPLOAD_MANIFEST
%type <node> command
%type <node> base_backup start_replication start_logical_replication
create_replication_slot drop_replication_slot identify_system
- read_replication_slot timeline_history show
+ read_replication_slot timeline_history show upload_manifest
%type <list> generic_option_list
%type <defelt> generic_option
%type <uintval> opt_timeline
@@ -114,6 +115,7 @@ command:
| read_replication_slot
| timeline_history
| show
+ | upload_manifest
;
/*
@@ -307,6 +309,15 @@ timeline_history:
}
;
+/* UPLOAD_MANIFEST doesn't currently accept any arguments */
+upload_manifest:
+ K_UPLOAD_MANIFEST
+ {
+ UploadManifestCmd *cmd = makeNode(UploadManifestCmd);
+
+ $$ = (Node *) cmd;
+ }
+
opt_physical:
K_PHYSICAL
| /* EMPTY */
@@ -411,6 +422,7 @@ ident_or_keyword:
| K_EXPORT_SNAPSHOT { $$ = "export_snapshot"; }
| K_NOEXPORT_SNAPSHOT { $$ = "noexport_snapshot"; }
| K_USE_SNAPSHOT { $$ = "use_snapshot"; }
+ | K_UPLOAD_MANIFEST { $$ = "upload_manifest"; }
;
%%
diff --git a/src/backend/replication/repl_scanner.l b/src/backend/replication/repl_scanner.l
index 1cc7fb858c..4805da08ee 100644
--- a/src/backend/replication/repl_scanner.l
+++ b/src/backend/replication/repl_scanner.l
@@ -136,6 +136,7 @@ EXPORT_SNAPSHOT { return K_EXPORT_SNAPSHOT; }
NOEXPORT_SNAPSHOT { return K_NOEXPORT_SNAPSHOT; }
USE_SNAPSHOT { return K_USE_SNAPSHOT; }
WAIT { return K_WAIT; }
+UPLOAD_MANIFEST { return K_UPLOAD_MANIFEST; }
{space}+ { /* do nothing */ }
@@ -303,6 +304,7 @@ replication_scanner_is_replication_command(void)
case K_DROP_REPLICATION_SLOT:
case K_READ_REPLICATION_SLOT:
case K_TIMELINE_HISTORY:
+ case K_UPLOAD_MANIFEST:
case K_SHOW:
/* Yes; push back the first token so we can parse later. */
repl_pushed_back_token = first_token;
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 3bc9c82389..dbcda32554 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -58,6 +58,7 @@
#include "access/xlogrecovery.h"
#include "access/xlogutils.h"
#include "backup/basebackup.h"
+#include "backup/basebackup_incremental.h"
#include "catalog/pg_authid.h"
#include "catalog/pg_type.h"
#include "commands/dbcommands.h"
@@ -137,6 +138,17 @@ bool wake_wal_senders = false;
*/
static XLogReaderState *xlogreader = NULL;
+/*
+ * If the UPLOAD_MANIFEST command is used to provide a backup manifest in
+ * preparation for an incremental backup, uploaded_manifest will be point
+ * to an object containing information about its contexts, and
+ * uploaded_manifest_mcxt will point to the memory context that contains
+ * that object and all of its subordinate data. Otherwise, both values will
+ * be NULL.
+ */
+static IncrementalBackupInfo *uploaded_manifest = NULL;
+static MemoryContext uploaded_manifest_mcxt = NULL;
+
/*
* These variables keep track of the state of the timeline we're currently
* sending. sendTimeLine identifies the timeline. If sendTimeLineIsHistoric,
@@ -233,6 +245,9 @@ static void XLogSendLogical(void);
static void WalSndDone(WalSndSendDataCallback send_data);
static XLogRecPtr GetStandbyFlushRecPtr(TimeLineID *tli);
static void IdentifySystem(void);
+static void UploadManifest(void);
+static bool HandleUploadManifestPacket(StringInfo buf, off_t *offset,
+ IncrementalBackupInfo *ib);
static void ReadReplicationSlot(ReadReplicationSlotCmd *cmd);
static void CreateReplicationSlot(CreateReplicationSlotCmd *cmd);
static void DropReplicationSlot(DropReplicationSlotCmd *cmd);
@@ -660,6 +675,143 @@ SendTimeLineHistory(TimeLineHistoryCmd *cmd)
pq_endmessage(&buf);
}
+/*
+ * Handle UPLOAD_MANIFEST command.
+ */
+static void
+UploadManifest(void)
+{
+ MemoryContext mcxt;
+ IncrementalBackupInfo *ib;
+ off_t offset = 0;
+ StringInfoData buf;
+
+ /*
+ * parsing the manifest will use the cryptohash stuff, which requires a
+ * resource owner
+ */
+ Assert(CurrentResourceOwner == NULL);
+ CurrentResourceOwner = ResourceOwnerCreate(NULL, "base backup");
+
+ /* Prepare to read manifest data into a temporary context. */
+ mcxt = AllocSetContextCreate(CurrentMemoryContext,
+ "incremental backup information",
+ ALLOCSET_DEFAULT_SIZES);
+ ib = CreateIncrementalBackupInfo(mcxt);
+
+ /* Send a CopyInResponse message */
+ pq_beginmessage(&buf, 'G');
+ pq_sendbyte(&buf, 0);
+ pq_sendint16(&buf, 0);
+ pq_endmessage_reuse(&buf);
+ pq_flush();
+
+ /* Recieve packets from client until done. */
+ while (HandleUploadManifestPacket(&buf, &offset, ib))
+ ;
+
+ /* Finish up manifest processing. */
+ FinalizeIncrementalManifest(ib);
+
+ /*
+ * Discard any old manifest information and arrange to preserve the new
+ * information we just got.
+ *
+ * We assume that MemoryContextDelete and MemoryContextSetParent won't
+ * fail, and thus we shouldn't end up bailing out of here in such a way as
+ * to leave dangling pointrs.
+ */
+ if (uploaded_manifest_mcxt != NULL)
+ MemoryContextDelete(uploaded_manifest_mcxt);
+ MemoryContextSetParent(mcxt, CacheMemoryContext);
+ uploaded_manifest = ib;
+ uploaded_manifest_mcxt = mcxt;
+
+ /* clean up the resource owner we created */
+ WalSndResourceCleanup(true);
+}
+
+/*
+ * Process one packet received during the handling of an UPLOAD_MANIFEST
+ * operation.
+ *
+ * 'buf' is scratch space. This function expects it to be initialized, doesn't
+ * care what the current contents are, and may override them with completely
+ * new contents.
+ *
+ * The return value is true if the caller should continue processing
+ * additional packets and false if the UPLOAD_MANIFEST operation is complete.
+ */
+static bool
+HandleUploadManifestPacket(StringInfo buf, off_t *offset,
+ IncrementalBackupInfo *ib)
+{
+ int mtype;
+ int maxmsglen;
+
+ HOLD_CANCEL_INTERRUPTS();
+
+ pq_startmsgread();
+ mtype = pq_getbyte();
+ if (mtype == EOF)
+ ereport(ERROR,
+ (errcode(ERRCODE_CONNECTION_FAILURE),
+ errmsg("unexpected EOF on client connection with an open transaction")));
+
+ switch (mtype)
+ {
+ case 'd': /* CopyData */
+ maxmsglen = PQ_LARGE_MESSAGE_LIMIT;
+ break;
+ case 'c': /* CopyDone */
+ case 'f': /* CopyFail */
+ case 'H': /* Flush */
+ case 'S': /* Sync */
+ maxmsglen = PQ_SMALL_MESSAGE_LIMIT;
+ break;
+ default:
+ ereport(ERROR,
+ (errcode(ERRCODE_PROTOCOL_VIOLATION),
+ errmsg("unexpected message type 0x%02X during COPY from stdin",
+ mtype)));
+ maxmsglen = 0; /* keep compiler quiet */
+ break;
+ }
+
+ /* Now collect the message body */
+ if (pq_getmessage(buf, maxmsglen))
+ ereport(ERROR,
+ (errcode(ERRCODE_CONNECTION_FAILURE),
+ errmsg("unexpected EOF on client connection with an open transaction")));
+ RESUME_CANCEL_INTERRUPTS();
+
+ /* Process the message */
+ switch (mtype)
+ {
+ case 'd': /* CopyData */
+ AppendIncrementalManifestData(ib, buf->data, buf->len);
+ return true;
+
+ case 'c': /* CopyDone */
+ return false;
+
+ case 'H': /* Sync */
+ case 'S': /* Flush */
+ /* Ignore these while in CopyOut mode as we do elsewhere. */
+ return true;
+
+ case 'f':
+ ereport(ERROR,
+ (errcode(ERRCODE_QUERY_CANCELED),
+ errmsg("COPY from stdin failed: %s",
+ pq_getmsgstring(buf))));
+ }
+
+ /* Not reached. */
+ Assert(false);
+ return false;
+}
+
/*
* Handle START_REPLICATION command.
*
@@ -1801,7 +1953,7 @@ exec_replication_command(const char *cmd_string)
cmdtag = "BASE_BACKUP";
set_ps_display(cmdtag);
PreventInTransactionBlock(true, cmdtag);
- SendBaseBackup((BaseBackupCmd *) cmd_node);
+ SendBaseBackup((BaseBackupCmd *) cmd_node, uploaded_manifest);
EndReplicationCommand(cmdtag);
break;
@@ -1863,6 +2015,14 @@ exec_replication_command(const char *cmd_string)
}
break;
+ case T_UploadManifestCmd:
+ cmdtag = "UPLOAD_MANIFEST";
+ set_ps_display(cmdtag);
+ PreventInTransactionBlock(true, cmdtag);
+ UploadManifest();
+ EndReplicationCommand(cmdtag);
+ break;
+
default:
elog(ERROR, "unrecognized replication command node tag: %u",
cmd_node->type);
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 0e0ac22bdd..706140eb9f 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -32,6 +32,7 @@
#include "postmaster/bgworker_internals.h"
#include "postmaster/bgwriter.h"
#include "postmaster/postmaster.h"
+#include "postmaster/walsummarizer.h"
#include "replication/logicallauncher.h"
#include "replication/origin.h"
#include "replication/slot.h"
@@ -140,6 +141,7 @@ CalculateShmemSize(int *num_semaphores)
size = add_size(size, ReplicationOriginShmemSize());
size = add_size(size, WalSndShmemSize());
size = add_size(size, WalRcvShmemSize());
+ size = add_size(size, WalSummarizerShmemSize());
size = add_size(size, PgArchShmemSize());
size = add_size(size, ApplyLauncherShmemSize());
size = add_size(size, BTreeShmemSize());
@@ -337,6 +339,7 @@ CreateOrAttachShmemStructs(void)
ReplicationOriginShmemInit();
WalSndShmemInit();
WalRcvShmemInit();
+ WalSummarizerShmemInit();
PgArchShmemInit();
ApplyLauncherShmemInit();
diff --git a/src/bin/Makefile b/src/bin/Makefile
index 373077bf52..aa2210925e 100644
--- a/src/bin/Makefile
+++ b/src/bin/Makefile
@@ -19,6 +19,7 @@ SUBDIRS = \
pg_archivecleanup \
pg_basebackup \
pg_checksums \
+ pg_combinebackup \
pg_config \
pg_controldata \
pg_ctl \
diff --git a/src/bin/meson.build b/src/bin/meson.build
index 67cb50630c..4cb6fd59bb 100644
--- a/src/bin/meson.build
+++ b/src/bin/meson.build
@@ -5,6 +5,7 @@ subdir('pg_amcheck')
subdir('pg_archivecleanup')
subdir('pg_basebackup')
subdir('pg_checksums')
+subdir('pg_combinebackup')
subdir('pg_config')
subdir('pg_controldata')
subdir('pg_ctl')
diff --git a/src/bin/pg_basebackup/bbstreamer_file.c b/src/bin/pg_basebackup/bbstreamer_file.c
index 45f32974ff..6b78ee283d 100644
--- a/src/bin/pg_basebackup/bbstreamer_file.c
+++ b/src/bin/pg_basebackup/bbstreamer_file.c
@@ -296,6 +296,7 @@ should_allow_existing_directory(const char *pathname)
if (strcmp(filename, "pg_wal") == 0 ||
strcmp(filename, "pg_xlog") == 0 ||
strcmp(filename, "archive_status") == 0 ||
+ strcmp(filename, "summaries") == 0 ||
strcmp(filename, "pg_tblspc") == 0)
return true;
diff --git a/src/bin/pg_basebackup/pg_basebackup.c b/src/bin/pg_basebackup/pg_basebackup.c
index f32684a8f2..5795b91261 100644
--- a/src/bin/pg_basebackup/pg_basebackup.c
+++ b/src/bin/pg_basebackup/pg_basebackup.c
@@ -101,6 +101,11 @@ typedef void (*WriteDataCallback) (size_t nbytes, char *buf,
*/
#define MINIMUM_VERSION_FOR_TERMINATED_TARFILE 150000
+/*
+ * pg_wal/summaries exists beginning with version 17.
+ */
+#define MINIMUM_VERSION_FOR_WAL_SUMMARIES 170000
+
/*
* Different ways to include WAL
*/
@@ -217,7 +222,8 @@ static void ReceiveBackupManifestInMemoryChunk(size_t r, char *copybuf,
void *callback_data);
static void BaseBackup(char *compression_algorithm, char *compression_detail,
CompressionLocation compressloc,
- pg_compress_specification *client_compress);
+ pg_compress_specification *client_compress,
+ char *incremental_manifest);
static bool reached_end_position(XLogRecPtr segendpos, uint32 timeline,
bool segment_finished);
@@ -390,6 +396,8 @@ usage(void)
printf(_("\nOptions controlling the output:\n"));
printf(_(" -D, --pgdata=DIRECTORY receive base backup into directory\n"));
printf(_(" -F, --format=p|t output format (plain (default), tar)\n"));
+ printf(_(" -i, --incremental=OLDMANIFEST\n"));
+ printf(_(" take incremental backup\n"));
printf(_(" -r, --max-rate=RATE maximum transfer rate to transfer data directory\n"
" (in kB/s, or use suffix \"k\" or \"M\")\n"));
printf(_(" -R, --write-recovery-conf\n"
@@ -688,6 +696,23 @@ StartLogStreamer(char *startpos, uint32 timeline, char *sysidentifier,
if (pg_mkdir_p(statusdir, pg_dir_create_mode) != 0 && errno != EEXIST)
pg_fatal("could not create directory \"%s\": %m", statusdir);
+
+ /*
+ * For newer server versions, likewise create pg_wal/summaries
+ */
+ if (PQserverVersion(conn) < MINIMUM_VERSION_FOR_WAL_SUMMARIES)
+ {
+ char summarydir[MAXPGPATH];
+
+ snprintf(summarydir, sizeof(summarydir), "%s/%s/summaries",
+ basedir,
+ PQserverVersion(conn) < MINIMUM_VERSION_FOR_PG_WAL ?
+ "pg_xlog" : "pg_wal");
+
+ if (pg_mkdir_p(statusdir, pg_dir_create_mode) != 0 &&
+ errno != EEXIST)
+ pg_fatal("could not create directory \"%s\": %m", summarydir);
+ }
}
/*
@@ -1728,7 +1753,9 @@ ReceiveBackupManifestInMemoryChunk(size_t r, char *copybuf,
static void
BaseBackup(char *compression_algorithm, char *compression_detail,
- CompressionLocation compressloc, pg_compress_specification *client_compress)
+ CompressionLocation compressloc,
+ pg_compress_specification *client_compress,
+ char *incremental_manifest)
{
PGresult *res;
char *sysidentifier;
@@ -1794,7 +1821,76 @@ BaseBackup(char *compression_algorithm, char *compression_detail,
exit(1);
/*
- * Start the actual backup
+ * If the user wants an incremental backup, we must upload the manifest
+ * for the previous backup upon which it is to be based.
+ */
+ if (incremental_manifest != NULL)
+ {
+ int fd;
+ char mbuf[65536];
+ int nbytes;
+
+ /* Reject if server is too old. */
+ if (serverVersion < MINIMUM_VERSION_FOR_WAL_SUMMARIES)
+ pg_fatal("server does not support incremental backup");
+
+ /* Open the file. */
+ fd = open(incremental_manifest, O_RDONLY | PG_BINARY, 0);
+ if (fd < 0)
+ pg_fatal("could not open file \"%s\": %m", incremental_manifest);
+
+ /* Tell the server what we want to do. */
+ if (PQsendQuery(conn, "UPLOAD_MANIFEST") == 0)
+ pg_fatal("could not send replication command \"%s\": %s",
+ "UPLOAD_MANIFEST", PQerrorMessage(conn));
+ res = PQgetResult(conn);
+ if (PQresultStatus(res) != PGRES_COPY_IN)
+ {
+ if (PQresultStatus(res) == PGRES_FATAL_ERROR)
+ pg_fatal("could not upload manifest: %s",
+ PQerrorMessage(conn));
+ else
+ pg_fatal("could not upload manifest: unexpected status %s",
+ PQresStatus(PQresultStatus(res)));
+ }
+
+ /* Loop, reading from the file and sending the data to the server. */
+ while ((nbytes = read(fd, mbuf, sizeof mbuf)) > 0)
+ {
+ if (PQputCopyData(conn, mbuf, nbytes) < 0)
+ pg_fatal("could not send COPY data: %s",
+ PQerrorMessage(conn));
+ }
+
+ /* Bail out if we exited the loop due to an error. */
+ if (nbytes < 0)
+ pg_fatal("could not read file \"%s\": %m", incremental_manifest);
+
+ /* End the COPY operation. */
+ if (PQputCopyEnd(conn, NULL) < 0)
+ pg_fatal("could not send end-of-COPY: %s",
+ PQerrorMessage(conn));
+
+ /* See whether the server is happy with what we sent. */
+ res = PQgetResult(conn);
+ if (PQresultStatus(res) == PGRES_FATAL_ERROR)
+ pg_fatal("could not upload manifest: %s",
+ PQerrorMessage(conn));
+ else if (PQresultStatus(res) != PGRES_COMMAND_OK)
+ pg_fatal("could not upload manifest: unexpected status %s",
+ PQresStatus(PQresultStatus(res)));
+
+ /* Consume ReadyForQuery message from server. */
+ res = PQgetResult(conn);
+ if (res != NULL)
+ pg_fatal("unexpected extra result while sending manifest");
+
+ /* Add INCREMENTAL option to BASE_BACKUP command. */
+ AppendPlainCommandOption(&buf, use_new_option_syntax, "INCREMENTAL");
+ }
+
+ /*
+ * Continue building up the options list for the BASE_BACKUP command.
*/
AppendStringCommandOption(&buf, use_new_option_syntax, "LABEL", label);
if (estimatesize)
@@ -1901,6 +1997,7 @@ BaseBackup(char *compression_algorithm, char *compression_detail,
else
basebkp = psprintf("BASE_BACKUP %s", buf.data);
+ /* OK, try to start the backup. */
if (PQsendQuery(conn, basebkp) == 0)
pg_fatal("could not send replication command \"%s\": %s",
"BASE_BACKUP", PQerrorMessage(conn));
@@ -2256,6 +2353,7 @@ main(int argc, char **argv)
{"version", no_argument, NULL, 'V'},
{"pgdata", required_argument, NULL, 'D'},
{"format", required_argument, NULL, 'F'},
+ {"incremental", required_argument, NULL, 'i'},
{"checkpoint", required_argument, NULL, 'c'},
{"create-slot", no_argument, NULL, 'C'},
{"max-rate", required_argument, NULL, 'r'},
@@ -2293,6 +2391,7 @@ main(int argc, char **argv)
int option_index;
char *compression_algorithm = "none";
char *compression_detail = NULL;
+ char *incremental_manifest = NULL;
CompressionLocation compressloc = COMPRESS_LOCATION_UNSPECIFIED;
pg_compress_specification client_compress;
@@ -2317,7 +2416,7 @@ main(int argc, char **argv)
atexit(cleanup_directories_atexit);
- while ((c = getopt_long(argc, argv, "c:Cd:D:F:h:l:nNp:Pr:Rs:S:t:T:U:vwWX:zZ:",
+ while ((c = getopt_long(argc, argv, "c:Cd:D:F:h:i:l:nNp:Pr:Rs:S:t:T:U:vwWX:zZ:",
long_options, &option_index)) != -1)
{
switch (c)
@@ -2352,6 +2451,9 @@ main(int argc, char **argv)
case 'h':
dbhost = pg_strdup(optarg);
break;
+ case 'i':
+ incremental_manifest = pg_strdup(optarg);
+ break;
case 'l':
label = pg_strdup(optarg);
break;
@@ -2765,7 +2867,7 @@ main(int argc, char **argv)
}
BaseBackup(compression_algorithm, compression_detail, compressloc,
- &client_compress);
+ &client_compress, incremental_manifest);
success = true;
return 0;
diff --git a/src/bin/pg_basebackup/t/010_pg_basebackup.pl b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
index b9f5e1266b..bf765291e7 100644
--- a/src/bin/pg_basebackup/t/010_pg_basebackup.pl
+++ b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
@@ -223,10 +223,10 @@ SKIP:
"check backup dir permissions");
}
-# Only archive_status directory should be copied in pg_wal/.
+# Only archive_status and summaries directories should be copied in pg_wal/.
is_deeply(
[ sort(slurp_dir("$tempdir/backup/pg_wal/")) ],
- [ sort qw(. .. archive_status) ],
+ [ sort qw(. .. archive_status summaries) ],
'no WAL files copied');
# Contents of these directories should not be copied.
diff --git a/src/bin/pg_combinebackup/.gitignore b/src/bin/pg_combinebackup/.gitignore
new file mode 100644
index 0000000000..d7e617438c
--- /dev/null
+++ b/src/bin/pg_combinebackup/.gitignore
@@ -0,0 +1 @@
+pg_combinebackup
diff --git a/src/bin/pg_combinebackup/Makefile b/src/bin/pg_combinebackup/Makefile
new file mode 100644
index 0000000000..78ba05e624
--- /dev/null
+++ b/src/bin/pg_combinebackup/Makefile
@@ -0,0 +1,52 @@
+#-------------------------------------------------------------------------
+#
+# Makefile for src/bin/pg_combinebackup
+#
+# Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+# Portions Copyright (c) 1994, Regents of the University of California
+#
+# src/bin/pg_combinebackup/Makefile
+#
+#-------------------------------------------------------------------------
+
+PGFILEDESC = "pg_combinebackup - combine incremental backups"
+PGAPPICON=win32
+
+subdir = src/bin/pg_combinebackup
+top_builddir = ../../..
+include $(top_builddir)/src/Makefile.global
+
+override CPPFLAGS := -I$(libpq_srcdir) $(CPPFLAGS)
+LDFLAGS_INTERNAL += -L$(top_builddir)/src/fe_utils -lpgfeutils
+
+OBJS = \
+ $(WIN32RES) \
+ pg_combinebackup.o \
+ backup_label.o \
+ copy_file.o \
+ load_manifest.o \
+ reconstruct.o \
+ write_manifest.o
+
+all: pg_combinebackup
+
+pg_combinebackup: $(OBJS) | submake-libpgport submake-libpgfeutils
+ $(CC) $(CFLAGS) $^ $(LDFLAGS) $(LDFLAGS_EX) $(LIBS) -o $@$(X)
+
+install: all installdirs
+ $(INSTALL_PROGRAM) pg_combinebackup$(X) '$(DESTDIR)$(bindir)/pg_combinebackup$(X)'
+
+installdirs:
+ $(MKDIR_P) '$(DESTDIR)$(bindir)'
+
+uninstall:
+ rm -f '$(DESTDIR)$(bindir)/pg_combinebackup$(X)'
+
+clean distclean maintainer-clean:
+ rm -f pg_combinebackup$(X) $(OBJS)
+
+check:
+ $(prove_check)
+
+installcheck:
+ $(prove_installcheck)
diff --git a/src/bin/pg_combinebackup/backup_label.c b/src/bin/pg_combinebackup/backup_label.c
new file mode 100644
index 0000000000..922e00854d
--- /dev/null
+++ b/src/bin/pg_combinebackup/backup_label.c
@@ -0,0 +1,283 @@
+/*-------------------------------------------------------------------------
+ *
+ * Read and manipulate backup label files
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/bin/pg_combinebackup/backup_label.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres_fe.h"
+
+#include <unistd.h>
+
+#include "access/xlogdefs.h"
+#include "backup_label.h"
+#include "common/logging.h"
+#include "common/file_perm.h"
+#include "write_manifest.h"
+
+static int get_eol_offset(StringInfo buf);
+static bool line_starts_with(char *s, char *e, char *match, char **sout);
+static bool parse_lsn(char *s, char *e, XLogRecPtr *lsn, char **c);
+static bool parse_tli(char *s, char *e, TimeLineID *tli);
+
+/*
+ * Parse a backup label file, starting at buf->cursor.
+ *
+ * We expect to find a START WAL LOCATION line, followed by a LSN, followed
+ * by a space; the resulting LSN is stored into *start_lsn.
+ *
+ * We expect to find a START TIMELINE line, followed by a TLI, followed by
+ * a newline; the resulting TLI is stored into *start_tli.
+ *
+ * We expect to find either both INCREMENTAL FROM LSN and INCREMENTAL FROM TLI
+ * or neither. If these are found, they should be followed by an LSN or TLI
+ * respectively and then by a newline, and the values will be stored into
+ * *previous_lsn and *previous_tli, respectively.
+ *
+ * Other lines in the provided backup_label data are ignored. filename is used
+ * for error reporting; errors are fatal.
+ */
+void
+parse_backup_label(char *filename, StringInfo buf,
+ TimeLineID *start_tli, XLogRecPtr *start_lsn,
+ TimeLineID *previous_tli, XLogRecPtr *previous_lsn)
+{
+ int found = 0;
+
+ *start_tli = 0;
+ *start_lsn = InvalidXLogRecPtr;
+ *previous_tli = 0;
+ *previous_lsn = InvalidXLogRecPtr;
+
+ while (buf->cursor < buf->len)
+ {
+ char *s = &buf->data[buf->cursor];
+ int eo = get_eol_offset(buf);
+ char *e = &buf->data[eo];
+ char *c;
+
+ if (line_starts_with(s, e, "START WAL LOCATION: ", &s))
+ {
+ if (!parse_lsn(s, e, start_lsn, &c))
+ pg_fatal("%s: could not parse %s",
+ filename, "START WAL LOCATION");
+ if (c >= e || *c != ' ')
+ pg_fatal("%s: improper terminator for %s",
+ filename, "START WAL LOCATION");
+ found |= 1;
+ }
+ else if (line_starts_with(s, e, "START TIMELINE: ", &s))
+ {
+ if (!parse_tli(s, e, start_tli))
+ pg_fatal("%s: could not parse TLI for %s",
+ filename, "START TIMELINE");
+ if (*start_tli == 0)
+ pg_fatal("%s: invalid TLI", filename);
+ found |= 2;
+ }
+ else if (line_starts_with(s, e, "INCREMENTAL FROM LSN: ", &s))
+ {
+ if (!parse_lsn(s, e, previous_lsn, &c))
+ pg_fatal("%s: could not parse %s",
+ filename, "INCREMENTAL FROM LSN");
+ if (c >= e || *c != '\n')
+ pg_fatal("%s: improper terminator for %s",
+ filename, "INCREMENTAL FROM LSN");
+ found |= 4;
+ }
+ else if (line_starts_with(s, e, "INCREMENTAL FROM TLI: ", &s))
+ {
+ if (!parse_tli(s, e, previous_tli))
+ pg_fatal("%s: could not parse %s",
+ filename, "INCREMENTAL FROM TLI");
+ if (*previous_tli == 0)
+ pg_fatal("%s: invalid TLI", filename);
+ found |= 8;
+ }
+
+ buf->cursor = eo;
+ }
+
+ if ((found & 1) == 0)
+ pg_fatal("%s: could not find %s", filename, "START WAL LOCATION");
+ if ((found & 2) == 0)
+ pg_fatal("%s: could not find %s", filename, "START TIMELINE");
+ if ((found & 4) != 0 && (found & 8) == 0)
+ pg_fatal("%s: %s requires %s", filename,
+ "INCREMENTAL FROM LSN", "INCREMENTAL FROM TLI");
+ if ((found & 8) != 0 && (found & 4) == 0)
+ pg_fatal("%s: %s requires %s", filename,
+ "INCREMENTAL FROM TLI", "INCREMENTAL FROM LSN");
+}
+
+/*
+ * Write a backup label file to the output directory.
+ *
+ * This will be identical to the provided backup_label file, except that the
+ * INCREMENTAL FROM LSN and INCREMENTAL FROM TLI lines will be omitted.
+ *
+ * The new file will be checksummed using the specified algorithm. If
+ * mwriter != NULL, it will be added to the manifest.
+ */
+void
+write_backup_label(char *output_directory, StringInfo buf,
+ pg_checksum_type checksum_type, manifest_writer *mwriter)
+{
+ char output_filename[MAXPGPATH];
+ int output_fd;
+ pg_checksum_context checksum_ctx;
+ uint8 checksum_payload[PG_CHECKSUM_MAX_LENGTH];
+ int checksum_length;
+
+ pg_checksum_init(&checksum_ctx, checksum_type);
+
+ snprintf(output_filename, MAXPGPATH, "%s/backup_label", output_directory);
+
+ if ((output_fd = open(output_filename,
+ O_WRONLY | O_CREAT | O_EXCL | PG_BINARY,
+ pg_file_create_mode)) < 0)
+ pg_fatal("could not open file \"%s\": %m", output_filename);
+
+ while (buf->cursor < buf->len)
+ {
+ char *s = &buf->data[buf->cursor];
+ int eo = get_eol_offset(buf);
+ char *e = &buf->data[eo];
+
+ if (!line_starts_with(s, e, "INCREMENTAL FROM LSN: ", NULL) &&
+ !line_starts_with(s, e, "INCREMENTAL FROM TLI: ", NULL))
+ {
+ ssize_t wb;
+
+ wb = write(output_fd, s, e - s);
+ if (wb != e - s)
+ {
+ if (wb < 0)
+ pg_fatal("could not write file \"%s\": %m", output_filename);
+ else
+ pg_fatal("could not write file \"%s\": wrote only %d of %d bytes",
+ output_filename, (int) wb, (int) (e - s));
+ }
+ if (pg_checksum_update(&checksum_ctx, (uint8 *) s, e - s) < 0)
+ pg_fatal("could not update checksum of file \"%s\"",
+ output_filename);
+ }
+
+ buf->cursor = eo;
+ }
+
+ if (close(output_fd) != 0)
+ pg_fatal("could not close \"%s\": %m", output_filename);
+
+ checksum_length = pg_checksum_final(&checksum_ctx, checksum_payload);
+
+ if (mwriter != NULL)
+ {
+ struct stat sb;
+
+ /*
+ * We could track the length ourselves, but must stat() to get the
+ * mtime.
+ */
+ if (stat(output_filename, &sb) < 0)
+ pg_fatal("could not stat file \"%s\": %m", output_filename);
+ add_file_to_manifest(mwriter, "backup_label", sb.st_size,
+ sb.st_mtime, checksum_type,
+ checksum_length, checksum_payload);
+ }
+}
+
+/*
+ * Return the offset at which the next line in the buffer starts, or there
+ * is none, the offset at which the buffer ends.
+ *
+ * The search begins at buf->cursor.
+ */
+static int
+get_eol_offset(StringInfo buf)
+{
+ int eo = buf->cursor;
+
+ while (eo < buf->len)
+ {
+ if (buf->data[eo] == '\n')
+ return eo + 1;
+ ++eo;
+ }
+
+ return eo;
+}
+
+/*
+ * Test whether the line that runs from s to e (inclusive of *s, but not
+ * inclusive of *e) starts with the match string provided, and return true
+ * or false according to whether or not this is the case.
+ *
+ * If the function returns true and if *sout != NULL, stores a pointer to the
+ * byte following the match into *sout.
+ */
+static bool
+line_starts_with(char *s, char *e, char *match, char **sout)
+{
+ while (s < e && *match != '\0' && *s == *match)
+ ++s, ++match;
+
+ if (*match == '\0' && sout != NULL)
+ *sout = s;
+
+ return (*match == '\0');
+}
+
+/*
+ * Parse an LSN starting at s and not stopping at or before e. The return value
+ * is true on success and otherwise false. On success, stores the result into
+ * *lsn and sets *c to the first character that is not part of the LSN.
+ */
+static bool
+parse_lsn(char *s, char *e, XLogRecPtr *lsn, char **c)
+{
+ char save = *e;
+ int nchars;
+ bool success;
+ unsigned hi;
+ unsigned lo;
+
+ *e = '\0';
+ success = (sscanf(s, "%X/%X%n", &hi, &lo, &nchars) == 2);
+ *e = save;
+
+ if (success)
+ {
+ *lsn = ((XLogRecPtr) hi) << 32 | (XLogRecPtr) lo;
+ *c = s + nchars;
+ }
+
+ return success;
+}
+
+/*
+ * Parse a TLI starting at s and stopping at or before e. The return value is
+ * true on success and otherwise false. On success, stores the result into
+ * *tli. If the first character that is not part of the TLI is anything other
+ * than a newline, that is deemed a failure.
+ */
+static bool
+parse_tli(char *s, char *e, TimeLineID *tli)
+{
+ char save = *e;
+ int nchars;
+ bool success;
+
+ *e = '\0';
+ success = (sscanf(s, "%u%n", tli, &nchars) == 1);
+ *e = save;
+
+ if (success && s[nchars] != '\n')
+ success = false;
+
+ return success;
+}
diff --git a/src/bin/pg_combinebackup/backup_label.h b/src/bin/pg_combinebackup/backup_label.h
new file mode 100644
index 0000000000..3af7ea274c
--- /dev/null
+++ b/src/bin/pg_combinebackup/backup_label.h
@@ -0,0 +1,30 @@
+/*-------------------------------------------------------------------------
+ *
+ * Read and manipulate backup label files
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/bin/pg_combinebackup/backup_label.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef BACKUP_LABEL_H
+#define BACKUP_LABEL_H
+
+#include "access/xlogdefs.h"
+#include "common/checksum_helper.h"
+#include "lib/stringinfo.h"
+
+struct manifest_writer;
+
+extern void parse_backup_label(char *filename, StringInfo buf,
+ TimeLineID *start_tli,
+ XLogRecPtr *start_lsn,
+ TimeLineID *previous_tli,
+ XLogRecPtr *previous_lsn);
+extern void write_backup_label(char *output_directory, StringInfo buf,
+ pg_checksum_type checksum_type,
+ struct manifest_writer *mwriter);
+
+#endif /* BACKUP_LABEL_H */
diff --git a/src/bin/pg_combinebackup/copy_file.c b/src/bin/pg_combinebackup/copy_file.c
new file mode 100644
index 0000000000..40a55e3087
--- /dev/null
+++ b/src/bin/pg_combinebackup/copy_file.c
@@ -0,0 +1,169 @@
+/*
+ * Copy entire files.
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/bin/pg_combinebackup/copy_file.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres_fe.h"
+
+#ifdef HAVE_COPYFILE_H
+#include <copyfile.h>
+#endif
+#include <fcntl.h>
+#include <sys/stat.h>
+#include <unistd.h>
+
+#include "common/file_perm.h"
+#include "common/logging.h"
+#include "copy_file.h"
+
+static void copy_file_blocks(const char *src, const char *dst,
+ pg_checksum_context *checksum_ctx);
+
+#ifdef WIN32
+static void copy_file_copyfile(const char *src, const char *dst);
+#endif
+
+/*
+ * Copy a regular file, optionally computing a checksum, and emitting
+ * appropriate debug messages. But if we're in dry-run mode, then just emit
+ * the messages and don't copy anything.
+ */
+void
+copy_file(const char *src, const char *dst,
+ pg_checksum_context *checksum_ctx, bool dry_run)
+{
+ /*
+ * In dry-run mode, we don't actually copy anything, nor do we read any
+ * data from the source file, but we do verify that we can open it.
+ */
+ if (dry_run)
+ {
+ int fd;
+
+ if ((fd = open(src, O_RDONLY | PG_BINARY, 0)) < 0)
+ pg_fatal("could not open \"%s\": %m", src);
+ if (close(fd) < 0)
+ pg_fatal("could not close \"%s\": %m", src);
+ }
+
+ /*
+ * If we don't need to compute a checksum, then we can use any special
+ * operating system primitives that we know about to copy the file; this
+ * may be quicker than a naive block copy.
+ */
+ if (checksum_ctx->type == CHECKSUM_TYPE_NONE)
+ {
+ char *strategy_name = NULL;
+ void (*strategy_implementation) (const char *, const char *) = NULL;
+
+#ifdef WIN32
+ strategy_name = "CopyFile";
+ strategy_implementation = copy_file_copyfile;
+#endif
+
+ if (strategy_name != NULL)
+ {
+ if (dry_run)
+ pg_log_debug("would copy \"%s\" to \"%s\" using strategy %s",
+ src, dst, strategy_name);
+ else
+ {
+ pg_log_debug("copying \"%s\" to \"%s\" using strategy %s",
+ src, dst, strategy_name);
+ (*strategy_implementation) (src, dst);
+ }
+ return;
+ }
+ }
+
+ /*
+ * Fall back to the simple approach of reading and writing all the blocks,
+ * feeding them into the checksum context as we go.
+ */
+ if (dry_run)
+ {
+ if (checksum_ctx->type == CHECKSUM_TYPE_NONE)
+ pg_log_debug("would copy \"%s\" to \"%s\"",
+ src, dst);
+ else
+ pg_log_debug("would copy \"%s\" to \"%s\" and checksum with %s",
+ src, dst, pg_checksum_type_name(checksum_ctx->type));
+ }
+ else
+ {
+ if (checksum_ctx->type == CHECKSUM_TYPE_NONE)
+ pg_log_debug("copying \"%s\" to \"%s\"",
+ src, dst);
+ else
+ pg_log_debug("copying \"%s\" to \"%s\" and checksumming with %s",
+ src, dst, pg_checksum_type_name(checksum_ctx->type));
+ copy_file_blocks(src, dst, checksum_ctx);
+ }
+}
+
+/*
+ * Copy a file block by block, and optionally compute a checksum as we go.
+ */
+static void
+copy_file_blocks(const char *src, const char *dst,
+ pg_checksum_context *checksum_ctx)
+{
+ int src_fd;
+ int dest_fd;
+ uint8 *buffer;
+ const int buffer_size = 50 * BLCKSZ;
+ ssize_t rb;
+ unsigned offset = 0;
+
+ if ((src_fd = open(src, O_RDONLY | PG_BINARY, 0)) < 0)
+ pg_fatal("could not open file \"%s\": %m", src);
+
+ if ((dest_fd = open(dst, O_WRONLY | O_CREAT | O_EXCL | PG_BINARY,
+ pg_file_create_mode)) < 0)
+ pg_fatal("could not open file \"%s\": %m", dst);
+
+ buffer = pg_malloc(buffer_size);
+
+ while ((rb = read(src_fd, buffer, buffer_size)) > 0)
+ {
+ ssize_t wb;
+
+ if ((wb = write(dest_fd, buffer, rb)) != rb)
+ {
+ if (wb < 0)
+ pg_fatal("could not write file \"%s\": %m", dst);
+ else
+ pg_fatal("could not write file \"%s\": wrote only %d of %d bytes at offset %u",
+ dst, (int) wb, (int) rb, offset);
+ }
+
+ if (pg_checksum_update(checksum_ctx, buffer, rb) < 0)
+ pg_fatal("could not update checksum of file \"%s\"", dst);
+
+ offset += rb;
+ }
+
+ if (rb < 0)
+ pg_fatal("could not read file \"%s\": %m", dst);
+
+ pg_free(buffer);
+ close(src_fd);
+ close(dest_fd);
+}
+
+#ifdef WIN32
+static void
+copy_file_copyfile(const char *src, const char *dst)
+{
+ if (CopyFile(src, dst, true) == 0)
+ {
+ _dosmaperr(GetLastError());
+ pg_fatal("could not copy \"%s\" to \"%s\": %m", src, dst);
+ }
+}
+#endif /* WIN32 */
diff --git a/src/bin/pg_combinebackup/copy_file.h b/src/bin/pg_combinebackup/copy_file.h
new file mode 100644
index 0000000000..031030bacb
--- /dev/null
+++ b/src/bin/pg_combinebackup/copy_file.h
@@ -0,0 +1,19 @@
+/*
+ * Copy entire files.
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/bin/pg_combinebackup/copy_file.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef COPY_FILE_H
+#define COPY_FILE_H
+
+#include "common/checksum_helper.h"
+
+extern void copy_file(const char *src, const char *dst,
+ pg_checksum_context *checksum_ctx, bool dry_run);
+
+#endif /* COPY_FILE_H */
diff --git a/src/bin/pg_combinebackup/load_manifest.c b/src/bin/pg_combinebackup/load_manifest.c
new file mode 100644
index 0000000000..ad32323c9c
--- /dev/null
+++ b/src/bin/pg_combinebackup/load_manifest.c
@@ -0,0 +1,245 @@
+/*-------------------------------------------------------------------------
+ *
+ * Load data from a backup manifest into memory.
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/bin/pg_combinebackup/load_manifest.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres_fe.h"
+
+#include <sys/stat.h>
+#include <unistd.h>
+
+#include "common/hashfn.h"
+#include "common/logging.h"
+#include "common/parse_manifest.h"
+#include "load_manifest.h"
+
+/*
+ * For efficiency, we'd like our hash table containing information about the
+ * manifest to start out with approximately the correct number of entries.
+ * There's no way to know the exact number of entries without reading the whole
+ * file, but we can get an estimate by dividing the file size by the estimated
+ * number of bytes per line.
+ *
+ * This could be off by about a factor of two in either direction, because the
+ * checksum algorithm has a big impact on the line lengths; e.g. a SHA512
+ * checksum is 128 hex bytes, whereas a CRC-32C value is only 8, and there
+ * might be no checksum at all.
+ */
+#define ESTIMATED_BYTES_PER_MANIFEST_LINE 100
+
+/*
+ * Define a hash table which we can use to store information about the files
+ * mentioned in the backup manifest.
+ */
+static uint32 hash_string_pointer(char *s);
+#define SH_PREFIX manifest_files
+#define SH_ELEMENT_TYPE manifest_file
+#define SH_KEY_TYPE char *
+#define SH_KEY pathname
+#define SH_HASH_KEY(tb, key) hash_string_pointer(key)
+#define SH_EQUAL(tb, a, b) (strcmp(a, b) == 0)
+#define SH_SCOPE extern
+#define SH_RAW_ALLOCATOR pg_malloc0
+#define SH_DEFINE
+#include "lib/simplehash.h"
+
+static void combinebackup_per_file_cb(JsonManifestParseContext *context,
+ char *pathname, size_t size,
+ pg_checksum_type checksum_type,
+ int checksum_length,
+ uint8 *checksum_payload);
+static void combinebackup_per_wal_range_cb(JsonManifestParseContext *context,
+ TimeLineID tli,
+ XLogRecPtr start_lsn,
+ XLogRecPtr end_lsn);
+static void report_manifest_error(JsonManifestParseContext *context,
+ const char *fmt,...)
+ pg_attribute_printf(2, 3) pg_attribute_noreturn();
+
+/*
+ * Load backup_manifest files from an array of backups and produces an array
+ * of manifest_data objects.
+ *
+ * NB: Since load_backup_manifest() can return NULL, the resulting array could
+ * contain NULL entries.
+ */
+manifest_data **
+load_backup_manifests(int n_backups, char **backup_directories)
+{
+ manifest_data **result;
+ int i;
+
+ result = pg_malloc(sizeof(manifest_data *) * n_backups);
+ for (i = 0; i < n_backups; ++i)
+ result[i] = load_backup_manifest(backup_directories[i]);
+
+ return result;
+}
+
+/*
+ * Parse the backup_manifest file in the named backup directory. Construct a
+ * hash table with information about all the files it mentions, and a linked
+ * list of all the WAL ranges it mentions.
+ *
+ * If the backup_manifest file simply doesn't exist, logs a warning and returns
+ * NULL. Any other error, or any error parsing the contents of the file, is
+ * fatal.
+ */
+manifest_data *
+load_backup_manifest(char *backup_directory)
+{
+ char pathname[MAXPGPATH];
+ int fd;
+ struct stat statbuf;
+ off_t estimate;
+ uint32 initial_size;
+ manifest_files_hash *ht;
+ char *buffer;
+ int rc;
+ JsonManifestParseContext context;
+ manifest_data *result;
+
+ /* Open the manifest file. */
+ snprintf(pathname, MAXPGPATH, "%s/backup_manifest", backup_directory);
+ if ((fd = open(pathname, O_RDONLY | PG_BINARY, 0)) < 0)
+ {
+ if (errno == ENOENT)
+ {
+ pg_log_warning("\"%s\" does not exist", pathname);
+ return NULL;
+ }
+ pg_fatal("could not open file \"%s\": %m", pathname);
+ }
+
+ /* Figure out how big the manifest is. */
+ if (fstat(fd, &statbuf) != 0)
+ pg_fatal("could not stat file \"%s\": %m", pathname);
+
+ /* Guess how large to make the hash table based on the manifest size. */
+ estimate = statbuf.st_size / ESTIMATED_BYTES_PER_MANIFEST_LINE;
+ initial_size = Min(PG_UINT32_MAX, Max(estimate, 256));
+
+ /* Create the hash table. */
+ ht = manifest_files_create(initial_size, NULL);
+
+ /*
+ * Slurp in the whole file.
+ *
+ * This is not ideal, but there's currently no way to get pg_parse_json()
+ * to perform incremental parsing.
+ */
+ buffer = pg_malloc(statbuf.st_size);
+ rc = read(fd, buffer, statbuf.st_size);
+ if (rc != statbuf.st_size)
+ {
+ if (rc < 0)
+ pg_fatal("could not read file \"%s\": %m", pathname);
+ else
+ pg_fatal("could not read file \"%s\": read %d of %lld",
+ pathname, rc, (long long int) statbuf.st_size);
+ }
+
+ /* Close the manifest file. */
+ close(fd);
+
+ /* Parse the manifest. */
+ result = pg_malloc0(sizeof(manifest_data));
+ result->files = ht;
+ context.private_data = result;
+ context.per_file_cb = combinebackup_per_file_cb;
+ context.per_wal_range_cb = combinebackup_per_wal_range_cb;
+ context.error_cb = report_manifest_error;
+ json_parse_manifest(&context, buffer, statbuf.st_size);
+
+ /* All done. */
+ pfree(buffer);
+ return result;
+}
+
+/*
+ * Report an error while parsing the manifest.
+ *
+ * We consider all such errors to be fatal errors. The manifest parser
+ * expects this function not to return.
+ */
+static void
+report_manifest_error(JsonManifestParseContext *context, const char *fmt,...)
+{
+ va_list ap;
+
+ va_start(ap, fmt);
+ pg_log_generic_v(PG_LOG_ERROR, PG_LOG_PRIMARY, gettext(fmt), ap);
+ va_end(ap);
+
+ exit(1);
+}
+
+/*
+ * Record details extracted from the backup manifest for one file.
+ */
+static void
+combinebackup_per_file_cb(JsonManifestParseContext *context,
+ char *pathname, size_t size,
+ pg_checksum_type checksum_type,
+ int checksum_length, uint8 *checksum_payload)
+{
+ manifest_data *manifest = context->private_data;
+ manifest_file *m;
+ bool found;
+
+ /* Make a new entry in the hash table for this file. */
+ m = manifest_files_insert(manifest->files, pathname, &found);
+ if (found)
+ pg_fatal("duplicate path name in backup manifest: \"%s\"", pathname);
+
+ /* Initialize the entry. */
+ m->size = size;
+ m->checksum_type = checksum_type;
+ m->checksum_length = checksum_length;
+ m->checksum_payload = checksum_payload;
+}
+
+/*
+ * Record details extracted from the backup manifest for one WAL range.
+ */
+static void
+combinebackup_per_wal_range_cb(JsonManifestParseContext *context,
+ TimeLineID tli,
+ XLogRecPtr start_lsn, XLogRecPtr end_lsn)
+{
+ manifest_data *manifest = context->private_data;
+ manifest_wal_range *range;
+
+ /* Allocate and initialize a struct describing this WAL range. */
+ range = palloc(sizeof(manifest_wal_range));
+ range->tli = tli;
+ range->start_lsn = start_lsn;
+ range->end_lsn = end_lsn;
+ range->prev = manifest->last_wal_range;
+ range->next = NULL;
+
+ /* Add it to the end of the list. */
+ if (manifest->first_wal_range == NULL)
+ manifest->first_wal_range = range;
+ else
+ manifest->last_wal_range->next = range;
+ manifest->last_wal_range = range;
+}
+
+/*
+ * Helper function for manifest_files hash table.
+ */
+static uint32
+hash_string_pointer(char *s)
+{
+ unsigned char *ss = (unsigned char *) s;
+
+ return hash_bytes(ss, strlen(s));
+}
diff --git a/src/bin/pg_combinebackup/load_manifest.h b/src/bin/pg_combinebackup/load_manifest.h
new file mode 100644
index 0000000000..2bfeeff156
--- /dev/null
+++ b/src/bin/pg_combinebackup/load_manifest.h
@@ -0,0 +1,67 @@
+/*-------------------------------------------------------------------------
+ *
+ * Load data from a backup manifest into memory.
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/bin/pg_combinebackup/load_manifest.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef LOAD_MANIFEST_H
+#define LOAD_MANIFEST_H
+
+#include "access/xlogdefs.h"
+#include "common/checksum_helper.h"
+
+/*
+ * Each file described by the manifest file is parsed to produce an object
+ * like this.
+ */
+typedef struct manifest_file
+{
+ uint32 status; /* hash status */
+ char *pathname;
+ size_t size;
+ pg_checksum_type checksum_type;
+ int checksum_length;
+ uint8 *checksum_payload;
+} manifest_file;
+
+#define SH_PREFIX manifest_files
+#define SH_ELEMENT_TYPE manifest_file
+#define SH_KEY_TYPE char *
+#define SH_SCOPE extern
+#define SH_RAW_ALLOCATOR pg_malloc0
+#define SH_DECLARE
+#include "lib/simplehash.h"
+
+/*
+ * Each WAL range described by the manifest file is parsed to produce an
+ * object like this.
+ */
+typedef struct manifest_wal_range
+{
+ TimeLineID tli;
+ XLogRecPtr start_lsn;
+ XLogRecPtr end_lsn;
+ struct manifest_wal_range *next;
+ struct manifest_wal_range *prev;
+} manifest_wal_range;
+
+/*
+ * All the data parsed from a backup_manifest file.
+ */
+typedef struct manifest_data
+{
+ manifest_files_hash *files;
+ manifest_wal_range *first_wal_range;
+ manifest_wal_range *last_wal_range;
+} manifest_data;
+
+extern manifest_data *load_backup_manifest(char *backup_directory);
+extern manifest_data **load_backup_manifests(int n_backups,
+ char **backup_directories);
+
+#endif /* LOAD_MANIFEST_H */
diff --git a/src/bin/pg_combinebackup/meson.build b/src/bin/pg_combinebackup/meson.build
new file mode 100644
index 0000000000..e402d6f50e
--- /dev/null
+++ b/src/bin/pg_combinebackup/meson.build
@@ -0,0 +1,38 @@
+# Copyright (c) 2022-2023, PostgreSQL Global Development Group
+
+pg_combinebackup_sources = files(
+ 'pg_combinebackup.c',
+ 'backup_label.c',
+ 'copy_file.c',
+ 'load_manifest.c',
+ 'reconstruct.c',
+ 'write_manifest.c',
+)
+
+if host_system == 'windows'
+ pg_combinebackup_sources += rc_bin_gen.process(win32ver_rc, extra_args: [
+ '--NAME', 'pg_combinebackup',
+ '--FILEDESC', 'pg_combinebackup - combine incremental backups',])
+endif
+
+pg_combinebackup = executable('pg_combinebackup',
+ pg_combinebackup_sources,
+ dependencies: [frontend_code],
+ kwargs: default_bin_args,
+)
+bin_targets += pg_combinebackup
+
+tests += {
+ 'name': 'pg_combinebackup',
+ 'sd': meson.current_source_dir(),
+ 'bd': meson.current_build_dir(),
+ 'tap': {
+ 'tests': [
+ 't/001_basic.pl',
+ 't/002_compare_backups.pl',
+ 't/003_timeline.pl',
+ 't/004_manifest.pl',
+ 't/005_integrity.pl',
+ ],
+ }
+}
diff --git a/src/bin/pg_combinebackup/nls.mk b/src/bin/pg_combinebackup/nls.mk
new file mode 100644
index 0000000000..c8e59d1d00
--- /dev/null
+++ b/src/bin/pg_combinebackup/nls.mk
@@ -0,0 +1,11 @@
+# src/bin/pg_combinebackup/nls.mk
+CATALOG_NAME = pg_combinebackup
+GETTEXT_FILES = $(FRONTEND_COMMON_GETTEXT_FILES) \
+ backup_label.c \
+ copy_file.c \
+ load_manifest.c \
+ pg_combinebackup.c \
+ reconstruct.c \
+ write_manifest.c
+GETTEXT_TRIGGERS = $(FRONTEND_COMMON_GETTEXT_TRIGGERS)
+GETTEXT_FLAGS = $(FRONTEND_COMMON_GETTEXT_FLAGS)
diff --git a/src/bin/pg_combinebackup/pg_combinebackup.c b/src/bin/pg_combinebackup/pg_combinebackup.c
new file mode 100644
index 0000000000..63dcbf329d
--- /dev/null
+++ b/src/bin/pg_combinebackup/pg_combinebackup.c
@@ -0,0 +1,1284 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_combinebackup.c
+ * Combine incremental backups with prior backups.
+ *
+ * Copyright (c) 2017-2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/bin/pg_combinebackup/pg_combinebackup.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres_fe.h"
+
+#include <dirent.h>
+#include <fcntl.h>
+#include <limits.h>
+
+#include "backup_label.h"
+#include "common/blkreftable.h"
+#include "common/checksum_helper.h"
+#include "common/controldata_utils.h"
+#include "common/file_perm.h"
+#include "common/file_utils.h"
+#include "common/logging.h"
+#include "copy_file.h"
+#include "fe_utils/option_utils.h"
+#include "lib/stringinfo.h"
+#include "load_manifest.h"
+#include "getopt_long.h"
+#include "reconstruct.h"
+#include "write_manifest.h"
+
+/* Incremental file naming convention. */
+#define INCREMENTAL_PREFIX "INCREMENTAL."
+#define INCREMENTAL_PREFIX_LENGTH (sizeof(INCREMENTAL_PREFIX) - 1)
+
+/*
+ * Tracking for directories that need to be removed, or have their contents
+ * removed, if the operation fails.
+ */
+typedef struct cb_cleanup_dir
+{
+ char *target_path;
+ bool rmtopdir;
+ struct cb_cleanup_dir *next;
+} cb_cleanup_dir;
+
+/*
+ * Stores a tablespace mapping provided using -T, --tablespace-mapping.
+ */
+typedef struct cb_tablespace_mapping
+{
+ char old_dir[MAXPGPATH];
+ char new_dir[MAXPGPATH];
+ struct cb_tablespace_mapping *next;
+} cb_tablespace_mapping;
+
+/*
+ * Stores data parsed from all command-line options.
+ */
+typedef struct cb_options
+{
+ bool debug;
+ char *output;
+ bool dry_run;
+ bool no_sync;
+ cb_tablespace_mapping *tsmappings;
+ pg_checksum_type manifest_checksums;
+ bool no_manifest;
+ DataDirSyncMethod sync_method;
+} cb_options;
+
+/*
+ * Data about a tablespace.
+ *
+ * Every normal tablespace needs a tablespace mapping, but in-place tablespaces
+ * don't, so the list of tablespaces can contain more entries than the list of
+ * tablespace mappings.
+ */
+typedef struct cb_tablespace
+{
+ Oid oid;
+ bool in_place;
+ char old_dir[MAXPGPATH];
+ char new_dir[MAXPGPATH];
+ struct cb_tablespace *next;
+} cb_tablespace;
+
+/* Directories to be removed if we exit uncleanly. */
+cb_cleanup_dir *cleanup_dir_list = NULL;
+
+static void add_tablespace_mapping(cb_options *opt, char *arg);
+static StringInfo check_backup_label_files(int n_backups, char **backup_dirs);
+static void check_control_files(int n_backups, char **backup_dirs);
+static void check_input_dir_permissions(char *dir);
+static void cleanup_directories_atexit(void);
+static void create_output_directory(char *dirname, cb_options *opt);
+static void help(const char *progname);
+static bool parse_oid(char *s, Oid *result);
+static void process_directory_recursively(Oid tsoid,
+ char *input_directory,
+ char *output_directory,
+ char *relative_path,
+ int n_prior_backups,
+ char **prior_backup_dirs,
+ manifest_data **manifests,
+ manifest_writer *mwriter,
+ cb_options *opt);
+static int read_pg_version_file(char *directory);
+static void remember_to_cleanup_directory(char *target_path, bool rmtopdir);
+static void reset_directory_cleanup_list(void);
+static cb_tablespace *scan_for_existing_tablespaces(char *pathname,
+ cb_options *opt);
+static void slurp_file(int fd, char *filename, StringInfo buf, int maxlen);
+
+/*
+ * Main program.
+ */
+int
+main(int argc, char *argv[])
+{
+ static struct option long_options[] = {
+ {"debug", no_argument, NULL, 'd'},
+ {"dry-run", no_argument, NULL, 'n'},
+ {"no-sync", no_argument, NULL, 'N'},
+ {"output", required_argument, NULL, 'o'},
+ {"tablespace-mapping", no_argument, NULL, 'T'},
+ {"manifest-checksums", required_argument, NULL, 1},
+ {"no-manifest", no_argument, NULL, 2},
+ {"sync-method", required_argument, NULL, 3},
+ {NULL, 0, NULL, 0}
+ };
+
+ const char *progname;
+ char *last_input_dir;
+ int optindex;
+ int c;
+ int n_backups;
+ int n_prior_backups;
+ int version;
+ char **prior_backup_dirs;
+ cb_options opt;
+ cb_tablespace *tablespaces;
+ cb_tablespace *ts;
+ StringInfo last_backup_label;
+ manifest_data **manifests;
+ manifest_writer *mwriter;
+
+ pg_logging_init(argv[0]);
+ progname = get_progname(argv[0]);
+ handle_help_version_opts(argc, argv, progname, help);
+
+ memset(&opt, 0, sizeof(opt));
+ opt.manifest_checksums = CHECKSUM_TYPE_CRC32C;
+ opt.sync_method = DATA_DIR_SYNC_METHOD_FSYNC;
+
+ /* process command-line options */
+ while ((c = getopt_long(argc, argv, "do:nNPT:",
+ long_options, &optindex)) != -1)
+ {
+ switch (c)
+ {
+ case 'd':
+ opt.debug = true;
+ pg_logging_increase_verbosity();
+ break;
+ case 'o':
+ opt.output = optarg;
+ break;
+ case 'n':
+ opt.dry_run = true;
+ break;
+ case 'N':
+ opt.no_sync = true;
+ break;
+ case 'T':
+ add_tablespace_mapping(&opt, optarg);
+ break;
+ case 1:
+ if (!pg_checksum_parse_type(optarg,
+ &opt.manifest_checksums))
+ pg_fatal("unrecognized checksum algorithm: \"%s\"",
+ optarg);
+ break;
+ case 2:
+ opt.no_manifest = true;
+ break;
+ case 3:
+ if (!parse_sync_method(optarg, &opt.sync_method))
+ exit(1);
+ break;
+ default:
+ /* getopt_long already emitted a complaint */
+ pg_log_error_hint("Try \"%s --help\" for more information.", progname);
+ exit(1);
+ }
+ }
+
+ if (optind >= argc)
+ {
+ pg_log_error("%s: no input directories specified", progname);
+ pg_log_error_hint("Try \"%s --help\" for more information.", progname);
+ exit(1);
+ }
+
+ if (opt.output == NULL)
+ pg_fatal("no output directory specified");
+
+ /* If no manifest is needed, no checksums are needed, either. */
+ if (opt.no_manifest)
+ opt.manifest_checksums = CHECKSUM_TYPE_NONE;
+
+ /* Read the server version from the final backup. */
+ version = read_pg_version_file(argv[argc - 1]);
+
+ /* Sanity-check control files. */
+ n_backups = argc - optind;
+ check_control_files(n_backups, argv + optind);
+
+ /* Sanity-check backup_label files, and get the contents of the last one. */
+ last_backup_label = check_backup_label_files(n_backups, argv + optind);
+
+ /*
+ * We'll need the pathnames to the prior backups. By "prior" we mean all
+ * but the last one listed on the command line.
+ */
+ n_prior_backups = argc - optind - 1;
+ prior_backup_dirs = argv + optind;
+
+ /* Load backup manifests. */
+ manifests = load_backup_manifests(n_backups, prior_backup_dirs);
+
+ /* Figure out which tablespaces are going to be included in the output. */
+ last_input_dir = argv[argc - 1];
+ check_input_dir_permissions(last_input_dir);
+ tablespaces = scan_for_existing_tablespaces(last_input_dir, &opt);
+
+ /*
+ * Create output directories.
+ *
+ * We create one output directory for the main data directory plus one for
+ * each non-in-place tablespace. create_output_directory() will arrange
+ * for those directories to be cleaned up on failure. In-place tablespaces
+ * aren't handled at this stage because they're located beneath the main
+ * output directory, and thus the cleanup of that directory will get rid
+ * of them. Plus, the pg_tblspc directory that needs to contain them
+ * doesn't exist yet.
+ */
+ atexit(cleanup_directories_atexit);
+ create_output_directory(opt.output, &opt);
+ for (ts = tablespaces; ts != NULL; ts = ts->next)
+ if (!ts->in_place)
+ create_output_directory(ts->new_dir, &opt);
+
+ /* If we need to write a backup_manifest, prepare to do so. */
+ if (!opt.dry_run && !opt.no_manifest)
+ {
+ mwriter = create_manifest_writer(opt.output);
+
+ /*
+ * Verify that we have a backup manifest for the final backup; else we
+ * won't have the WAL ranges for the resulting manifest.
+ */
+ if (manifests[n_prior_backups] == NULL)
+ pg_fatal("can't generate a manifest because no manifest is available for the final input backup");
+ }
+ else
+ mwriter = NULL;
+
+ /* Write backup label into output directory. */
+ if (opt.dry_run)
+ pg_log_debug("would generate \"%s/backup_label\"", opt.output);
+ else
+ {
+ pg_log_debug("generating \"%s/backup_label\"", opt.output);
+ last_backup_label->cursor = 0;
+ write_backup_label(opt.output, last_backup_label,
+ opt.manifest_checksums, mwriter);
+ }
+
+ /* Process everything that's not part of a user-defined tablespace. */
+ pg_log_debug("processing backup directory \"%s\"", last_input_dir);
+ process_directory_recursively(InvalidOid, last_input_dir, opt.output,
+ NULL, n_prior_backups, prior_backup_dirs,
+ manifests, mwriter, &opt);
+
+ /* Process user-defined tablespaces. */
+ for (ts = tablespaces; ts != NULL; ts = ts->next)
+ {
+ pg_log_debug("processing tablespace directory \"%s\"", ts->old_dir);
+
+ /*
+ * If it's a normal tablespace, we need to set up a symbolic link from
+ * pg_tblspc/${OID} to the target directory; if it's an in-place
+ * tablespace, we need to create a directory at pg_tblspc/${OID}.
+ */
+ if (!ts->in_place)
+ {
+ char linkpath[MAXPGPATH];
+
+ snprintf(linkpath, MAXPGPATH, "%s/pg_tblspc/%u", opt.output,
+ ts->oid);
+
+ if (opt.dry_run)
+ pg_log_debug("would create symbolic link from \"%s\" to \"%s\"",
+ linkpath, ts->new_dir);
+ else
+ {
+ pg_log_debug("creating symbolic link from \"%s\" to \"%s\"",
+ linkpath, ts->new_dir);
+ if (symlink(ts->new_dir, linkpath) != 0)
+ pg_fatal("could not create symbolic link from \"%s\" to \"%s\": %m",
+ linkpath, ts->new_dir);
+ }
+ }
+ else
+ {
+ if (opt.dry_run)
+ pg_log_debug("would create directory \"%s\"", ts->new_dir);
+ else
+ {
+ pg_log_debug("creating directory \"%s\"", ts->new_dir);
+ if (pg_mkdir_p(ts->new_dir, pg_dir_create_mode) == -1)
+ pg_fatal("could not create directory \"%s\": %m",
+ ts->new_dir);
+ }
+ }
+
+ /* OK, now handle the directory contents. */
+ process_directory_recursively(ts->oid, ts->old_dir, ts->new_dir,
+ NULL, n_prior_backups, prior_backup_dirs,
+ manifests, mwriter, &opt);
+ }
+
+ /* Finalize the backup_manifest, if we're generating one. */
+ if (mwriter != NULL)
+ finalize_manifest(mwriter,
+ manifests[n_prior_backups]->first_wal_range);
+
+ /* fsync that output directory unless we've been told not to do so */
+ if (!opt.no_sync)
+ {
+ if (opt.dry_run)
+ pg_log_debug("would recursively fsync \"%s\"", opt.output);
+ else
+ {
+ pg_log_debug("recursively fsyncing \"%s\"", opt.output);
+ sync_pgdata(opt.output, version * 10000, opt.sync_method);
+ }
+ }
+
+ /* It's a success, so don't remove the output directories. */
+ reset_directory_cleanup_list();
+ exit(0);
+}
+
+/*
+ * Process the option argument for the -T, --tablespace-mapping switch.
+ */
+static void
+add_tablespace_mapping(cb_options *opt, char *arg)
+{
+ cb_tablespace_mapping *tsmap = pg_malloc0(sizeof(cb_tablespace_mapping));
+ char *dst;
+ char *dst_ptr;
+ char *arg_ptr;
+
+ /*
+ * Basically, we just want to copy everything before the equals sign to
+ * tsmap->old_dir and everything afterwards to tsmap->new_dir, but if
+ * there's more or less than one equals sign, that's an error, and if
+ * there's an equals sign preceded by a backslash, don't treat it as a
+ * field separator but instead copy a literal equals sign.
+ */
+ dst_ptr = dst = tsmap->old_dir;
+ for (arg_ptr = arg; *arg_ptr != '\0'; arg_ptr++)
+ {
+ if (dst_ptr - dst >= MAXPGPATH)
+ pg_fatal("directory name too long");
+
+ if (*arg_ptr == '\\' && *(arg_ptr + 1) == '=')
+ ; /* skip backslash escaping = */
+ else if (*arg_ptr == '=' && (arg_ptr == arg || *(arg_ptr - 1) != '\\'))
+ {
+ if (tsmap->new_dir[0] != '\0')
+ pg_fatal("multiple \"=\" signs in tablespace mapping");
+ else
+ dst = dst_ptr = tsmap->new_dir;
+ }
+ else
+ *dst_ptr++ = *arg_ptr;
+ }
+ if (!tsmap->old_dir[0] || !tsmap->new_dir[0])
+ pg_fatal("invalid tablespace mapping format \"%s\", must be \"OLDDIR=NEWDIR\"", arg);
+
+ /*
+ * All tablespaces are created with absolute directories, so specifying a
+ * non-absolute path here would never match, possibly confusing users.
+ *
+ * In contrast to pg_basebackup, both the old and new directories are on
+ * the local machine, so the local machine's definition of an absolute
+ * path is the only relevant one.
+ */
+ if (!is_absolute_path(tsmap->old_dir))
+ pg_fatal("old directory is not an absolute path in tablespace mapping: %s",
+ tsmap->old_dir);
+
+ if (!is_absolute_path(tsmap->new_dir))
+ pg_fatal("old directory is not an absolute path in tablespace mapping: %s",
+ tsmap->new_dir);
+
+ /* Canonicalize paths to avoid spurious failures when comparing. */
+ canonicalize_path(tsmap->old_dir);
+ canonicalize_path(tsmap->new_dir);
+
+ /* Add it to the list. */
+ tsmap->next = opt->tsmappings;
+ opt->tsmappings = tsmap;
+}
+
+/*
+ * Check that the backup_label files form a coherent backup chain, and return
+ * the contents of the backup_label file from the latest backup.
+ */
+static StringInfo
+check_backup_label_files(int n_backups, char **backup_dirs)
+{
+ StringInfo buf = makeStringInfo();
+ StringInfo lastbuf = buf;
+ int i;
+ TimeLineID check_tli = 0;
+ XLogRecPtr check_lsn = InvalidXLogRecPtr;
+
+ /* Try to read each backup_label file in turn, last to first. */
+ for (i = n_backups - 1; i >= 0; --i)
+ {
+ char pathbuf[MAXPGPATH];
+ int fd;
+ TimeLineID start_tli;
+ TimeLineID previous_tli;
+ XLogRecPtr start_lsn;
+ XLogRecPtr previous_lsn;
+
+ /* Open the backup_label file. */
+ snprintf(pathbuf, MAXPGPATH, "%s/backup_label", backup_dirs[i]);
+ pg_log_debug("reading \"%s\"", pathbuf);
+ if ((fd = open(pathbuf, O_RDONLY, 0)) < 0)
+ pg_fatal("could not open file \"%s\": %m", pathbuf);
+
+ /*
+ * Slurp the whole file into memory.
+ *
+ * The exact size limit that we impose here doesn't really matter --
+ * most of what's supposed to be in the file is fixed size and quite
+ * short. However, the length of the backup_label is limited (at least
+ * by some parts of the code) to MAXGPATH, so include that value in
+ * the maximum length that we tolerate.
+ */
+ slurp_file(fd, pathbuf, buf, 10000 + MAXPGPATH);
+
+ /* Close the file. */
+ if (close(fd) != 0)
+ pg_fatal("could not close \"%s\": %m", pathbuf);
+
+ /* Parse the file contents. */
+ parse_backup_label(pathbuf, buf, &start_tli, &start_lsn,
+ &previous_tli, &previous_lsn);
+
+ /*
+ * Sanity checks.
+ *
+ * XXX. It's actually not required that start_lsn == check_lsn. It
+ * would be OK if start_lsn > check_lsn provided that start_lsn is
+ * less than or equal to the relevant switchpoint. But at the moment
+ * we don't have that information.
+ */
+ if (i > 0 && previous_tli == 0)
+ pg_fatal("backup at \"%s\" is a full backup, but only the first backup should be a full backup",
+ backup_dirs[i]);
+ if (i == 0 && previous_tli != 0)
+ pg_fatal("backup at \"%s\" is an incremental backup, but the first backup should be a full backup",
+ backup_dirs[i]);
+ if (i < n_backups - 1 && start_tli != check_tli)
+ pg_fatal("backup at \"%s\" starts on timeline %u, but expected %u",
+ backup_dirs[i], start_tli, check_tli);
+ if (i < n_backups - 1 && start_lsn != check_lsn)
+ pg_fatal("backup at \"%s\" starts at LSN %X/%X, but expected %X/%X",
+ backup_dirs[i],
+ LSN_FORMAT_ARGS(start_lsn),
+ LSN_FORMAT_ARGS(check_lsn));
+ check_tli = previous_tli;
+ check_lsn = previous_lsn;
+
+ /*
+ * The last backup label in the chain needs to be saved for later use,
+ * while the others are only needed within this loop.
+ */
+ if (lastbuf == buf)
+ buf = makeStringInfo();
+ else
+ resetStringInfo(buf);
+ }
+
+ /* Free memory that we don't need any more. */
+ if (lastbuf != buf)
+ {
+ pfree(buf->data);
+ pfree(buf);
+ }
+
+ /*
+ * Return the data from the first backup_info that we read (which is the
+ * backup_label from the last directory specified on the command line).
+ */
+ return lastbuf;
+}
+
+/*
+ * Sanity check control files.
+ */
+static void
+check_control_files(int n_backups, char **backup_dirs)
+{
+ int i;
+ uint64 system_identifier = 0; /* placate compiler */
+
+ /* Try to read each control file in turn, last to first. */
+ for (i = n_backups - 1; i >= 0; --i)
+ {
+ ControlFileData *control_file;
+ bool crc_ok;
+ char *controlpath;
+
+ controlpath = psprintf("%s/%s", backup_dirs[i], "global/pg_control");
+ pg_log_debug("reading \"%s\"", controlpath);
+ control_file = get_controlfile(backup_dirs[i], &crc_ok);
+
+ /* Control file contents not meaningful if CRC is bad. */
+ if (!crc_ok)
+ pg_fatal("%s: crc is incorrect", controlpath);
+
+ /* Can't interpret control file if not current version. */
+ if (control_file->pg_control_version != PG_CONTROL_VERSION)
+ pg_fatal("%s: unexpected control file version",
+ controlpath);
+
+ /* System identifiers should all match. */
+ if (i == n_backups - 1)
+ system_identifier = control_file->system_identifier;
+ else if (system_identifier != control_file->system_identifier)
+ pg_fatal("%s: expected system identifier %llu, but found %llu",
+ controlpath, (unsigned long long) system_identifier,
+ (unsigned long long) control_file->system_identifier);
+
+ /* Release memory. */
+ pfree(control_file);
+ pfree(controlpath);
+ }
+
+ /*
+ * If debug output is enabled, make a note of the system identifier that
+ * we found in all of the relevant control files.
+ */
+ pg_log_debug("system identifier is %llu",
+ (unsigned long long) system_identifier);
+}
+
+/*
+ * Set default permissions for new files and directories based on the
+ * permissions of the given directory. The intent here is that the output
+ * directory should use the same permissions scheme as the final input
+ * directory.
+ */
+static void
+check_input_dir_permissions(char *dir)
+{
+ struct stat st;
+
+ if (stat(dir, &st) != 0)
+ pg_fatal("could not stat \"%s\": %m", dir);
+
+ SetDataDirectoryCreatePerm(st.st_mode);
+}
+
+/*
+ * Clean up output directories before exiting.
+ */
+static void
+cleanup_directories_atexit(void)
+{
+ while (cleanup_dir_list != NULL)
+ {
+ cb_cleanup_dir *dir = cleanup_dir_list;
+
+ if (dir->rmtopdir)
+ {
+ pg_log_info("removing output directory \"%s\"", dir->target_path);
+ if (!rmtree(dir->target_path, dir->rmtopdir))
+ pg_log_error("failed to remove output directory");
+ }
+ else
+ {
+ pg_log_info("removing contents of output directory \"%s\"",
+ dir->target_path);
+ if (!rmtree(dir->target_path, dir->rmtopdir))
+ pg_log_error("failed to remove contents of output directory");
+ }
+
+ cleanup_dir_list = cleanup_dir_list->next;
+ pfree(dir);
+ }
+}
+
+/*
+ * Create the named output directory, unless it already exists or we're in
+ * dry-run mode. If it already exists but is not empty, that's a fatal error.
+ *
+ * Adds the created directory to the list of directories to be cleaned up
+ * at process exit.
+ */
+static void
+create_output_directory(char *dirname, cb_options *opt)
+{
+ switch (pg_check_dir(dirname))
+ {
+ case 0:
+ if (opt->dry_run)
+ {
+ pg_log_debug("would create directory \"%s\"", dirname);
+ return;
+ }
+ pg_log_debug("creating directory \"%s\"", dirname);
+ if (pg_mkdir_p(dirname, pg_dir_create_mode) == -1)
+ pg_fatal("could not create directory \"%s\": %m", dirname);
+ remember_to_cleanup_directory(dirname, true);
+ break;
+
+ case 1:
+ pg_log_debug("using existing directory \"%s\"", dirname);
+ remember_to_cleanup_directory(dirname, false);
+ break;
+
+ case 2:
+ case 3:
+ case 4:
+ pg_fatal("directory \"%s\" exists but is not empty", dirname);
+
+ case -1:
+ pg_fatal("could not access directory \"%s\": %m", dirname);
+ }
+}
+
+/*
+ * help
+ *
+ * Prints help page for the program
+ *
+ * progname: the name of the executed program, such as "pg_combinebackup"
+ */
+static void
+help(const char *progname)
+{
+ printf(_("%s reconstructs full backups from incrementals.\n\n"), progname);
+ printf(_("Usage:\n"));
+ printf(_(" %s [OPTION]... DIRECTORY...\n"), progname);
+ printf(_("\nOptions:\n"));
+ printf(_(" -d, --debug generate lots of debugging output\n"));
+ printf(_(" -n, --dry-run don't actually do anything\n"));
+ printf(_(" -N, --no-sync do not wait for changes to be written safely to disk\n"));
+ printf(_(" -o, --output output directory\n"));
+ printf(_(" -T, --tablespace-mapping=OLDDIR=NEWDIR\n"));
+ printf(_(" relocate tablespace in OLDDIR to NEWDIR\n"));
+ printf(_(" --manifest-checksums=SHA{224,256,384,512}|CRC32C|NONE\n"
+ " use algorithm for manifest checksums\n"));
+ printf(_(" --no-manifest suppress generation of backup manifest\n"));
+ printf(_(" --sync-method=METHOD set method for syncing files to disk\n"));
+ printf(_(" -?, --help show this help, then exit\n"));
+
+ printf(_("\nReport bugs to <%s>.\n"), PACKAGE_BUGREPORT);
+ printf(_("%s home page: <%s>\n"), PACKAGE_NAME, PACKAGE_URL);
+}
+
+/*
+ * Try to parse a string as a non-zero OID without leading zeroes.
+ *
+ * If it works, return true and set *result to the answer, else return false.
+ */
+static bool
+parse_oid(char *s, Oid *result)
+{
+ Oid oid;
+ char *ep;
+
+ errno = 0;
+ oid = strtoul(s, &ep, 10);
+ if (errno != 0 || *ep != '\0' || oid < 1 || oid > PG_UINT32_MAX)
+ return false;
+
+ *result = oid;
+ return true;
+}
+
+/*
+ * Copy files from the input directory to the output directory, reconstructing
+ * full files from incremental files as required.
+ *
+ * If processing is a user-defined tablespace, the tsoid should be the OID
+ * of that tablespace and input_directory and output_directory should be the
+ * toplevel input and output directories for that tablespace. Otherwise,
+ * tsoid should be InvalidOid and input_directory and output_directory should
+ * be the main input and output directories.
+ *
+ * relative_path is the path beneath the given input and output directories
+ * that we are currently processing. If NULL, it indicates that we're
+ * processing the input and output directories themselves.
+ *
+ * n_prior_backups is the number of prior backups that we have available.
+ * This doesn't count the very last backup, which is referenced by
+ * output_directory, just the older ones. prior_backup_dirs is an array of
+ * the locations of those previous backups.
+ */
+static void
+process_directory_recursively(Oid tsoid,
+ char *input_directory,
+ char *output_directory,
+ char *relative_path,
+ int n_prior_backups,
+ char **prior_backup_dirs,
+ manifest_data **manifests,
+ manifest_writer *mwriter,
+ cb_options *opt)
+{
+ char ifulldir[MAXPGPATH];
+ char ofulldir[MAXPGPATH];
+ char manifest_prefix[MAXPGPATH];
+ DIR *dir;
+ struct dirent *de;
+ bool is_pg_tblspc;
+ bool is_pg_wal;
+ manifest_data *latest_manifest = manifests[n_prior_backups];
+ pg_checksum_type checksum_type;
+
+ /*
+ * pg_tblspc and pg_wal are special cases, so detect those here.
+ *
+ * pg_tblspc is only special at the top level, but subdirectories of
+ * pg_wal are just as special as the top level directory.
+ *
+ * Since incremental backup does not exist in pre-v10 versions, we don't
+ * have to worry about the old pg_xlog naming.
+ */
+ is_pg_tblspc = !OidIsValid(tsoid) && relative_path != NULL &&
+ strcmp(relative_path, "pg_tblspc") == 0;
+ is_pg_wal = !OidIsValid(tsoid) && relative_path != NULL &&
+ (strcmp(relative_path, "pg_wal") == 0 ||
+ strncmp(relative_path, "pg_wal/", 7) == 0);
+
+ /*
+ * If we're under pg_wal, then we don't need checksums, because these
+ * files aren't included in the backup manifest. Otherwise use whatever
+ * type of checksum is configured.
+ */
+ if (!is_pg_wal)
+ checksum_type = opt->manifest_checksums;
+ else
+ checksum_type = CHECKSUM_TYPE_NONE;
+
+ /*
+ * Append the relative path to the input and output directories, and
+ * figure out the appropriate prefix to add to files in this directory
+ * when looking them up in a backup manifest.
+ */
+ if (relative_path == NULL)
+ {
+ strncpy(ifulldir, input_directory, MAXPGPATH);
+ strncpy(ofulldir, output_directory, MAXPGPATH);
+ if (OidIsValid(tsoid))
+ snprintf(manifest_prefix, MAXPGPATH, "pg_tblspc/%u/", tsoid);
+ else
+ manifest_prefix[0] = '\0';
+ }
+ else
+ {
+ snprintf(ifulldir, MAXPGPATH, "%s/%s", input_directory,
+ relative_path);
+ snprintf(ofulldir, MAXPGPATH, "%s/%s", output_directory,
+ relative_path);
+ if (OidIsValid(tsoid))
+ snprintf(manifest_prefix, MAXPGPATH, "pg_tblspc/%u/%s/",
+ tsoid, relative_path);
+ else
+ snprintf(manifest_prefix, MAXPGPATH, "%s/", relative_path);
+ }
+
+ /*
+ * Toplevel output directories have already been created by the time this
+ * function is called, but any subdirectories are our responsibility.
+ */
+ if (relative_path != NULL)
+ {
+ if (opt->dry_run)
+ pg_log_debug("would create directory \"%s\"", ofulldir);
+ else
+ {
+ pg_log_debug("creating directory \"%s\"", ofulldir);
+ if (mkdir(ofulldir, pg_dir_create_mode) == -1)
+ pg_fatal("could not create directory \"%s\": %m", ofulldir);
+ }
+ }
+
+ /* It's time to scan the directory. */
+ if ((dir = opendir(ifulldir)) == NULL)
+ pg_fatal("could not open directory \"%s\": %m", ifulldir);
+ while (errno = 0, (de = readdir(dir)) != NULL)
+ {
+ PGFileType type;
+ char ifullpath[MAXPGPATH];
+ char ofullpath[MAXPGPATH];
+ char manifest_path[MAXPGPATH];
+ Oid oid = InvalidOid;
+ int checksum_length = 0;
+ uint8 *checksum_payload = NULL;
+ pg_checksum_context checksum_ctx;
+
+ /* Ignore "." and ".." entries. */
+ if (strcmp(de->d_name, ".") == 0 ||
+ strcmp(de->d_name, "..") == 0)
+ continue;
+
+ /* Construct input path. */
+ snprintf(ifullpath, MAXPGPATH, "%s/%s", ifulldir, de->d_name);
+
+ /* Figure out what kind of directory entry this is. */
+ type = get_dirent_type(ifullpath, de, false, PG_LOG_ERROR);
+ if (type == PGFILETYPE_ERROR)
+ exit(1);
+
+ /*
+ * If we're processing pg_tblspc, then check whether the filename
+ * looks like it could be a tablespace OID. If so, and if the
+ * directory entry is a symbolic link or a directory, skip it.
+ *
+ * Our goal here is to ignore anything that would have been considered
+ * by scan_for_existing_tablespaces to be a tablespace.
+ */
+ if (is_pg_tblspc && parse_oid(de->d_name, &oid) &&
+ (type == PGFILETYPE_LNK || type == PGFILETYPE_DIR))
+ continue;
+
+ /* If it's a directory, recurse. */
+ if (type == PGFILETYPE_DIR)
+ {
+ char new_relative_path[MAXPGPATH];
+
+ /* Append new pathname component to relative path. */
+ if (relative_path == NULL)
+ strncpy(new_relative_path, de->d_name, MAXPGPATH);
+ else
+ snprintf(new_relative_path, MAXPGPATH, "%s/%s", relative_path,
+ de->d_name);
+
+ /* And recurse. */
+ process_directory_recursively(tsoid,
+ input_directory, output_directory,
+ new_relative_path,
+ n_prior_backups, prior_backup_dirs,
+ manifests, mwriter, opt);
+ continue;
+ }
+
+ /* Skip anything that's not a regular file. */
+ if (type != PGFILETYPE_REG)
+ {
+ if (type == PGFILETYPE_LNK)
+ pg_log_warning("skipping symbolic link \"%s\"", ifullpath);
+ else
+ pg_log_warning("skipping special file \"%s\"", ifullpath);
+ continue;
+ }
+
+ /*
+ * Skip the backup_label and backup_manifest files; they require
+ * special handling and are handled elsewhere.
+ */
+ if (relative_path == NULL &&
+ (strcmp(de->d_name, "backup_label") == 0 ||
+ strcmp(de->d_name, "backup_manifest") == 0))
+ continue;
+
+ /*
+ * If it's an incremental file, hand it off to the reconstruction
+ * code, which will figure out what to do.
+ */
+ if (strncmp(de->d_name, INCREMENTAL_PREFIX,
+ INCREMENTAL_PREFIX_LENGTH) == 0)
+ {
+ /* Output path should not include "INCREMENTAL." prefix. */
+ snprintf(ofullpath, MAXPGPATH, "%s/%s", ofulldir,
+ de->d_name + INCREMENTAL_PREFIX_LENGTH);
+
+
+ /* Manifest path likewise omits incremental prefix. */
+ snprintf(manifest_path, MAXPGPATH, "%s%s", manifest_prefix,
+ de->d_name + INCREMENTAL_PREFIX_LENGTH);
+
+ /* Reconstruction logic will do the rest. */
+ reconstruct_from_incremental_file(ifullpath, ofullpath,
+ relative_path,
+ de->d_name + INCREMENTAL_PREFIX_LENGTH,
+ n_prior_backups,
+ prior_backup_dirs,
+ manifests,
+ manifest_path,
+ checksum_type,
+ &checksum_length,
+ &checksum_payload,
+ opt->debug,
+ opt->dry_run);
+ }
+ else
+ {
+ /* Construct the path that the backup_manifest will use. */
+ snprintf(manifest_path, MAXPGPATH, "%s%s", manifest_prefix,
+ de->d_name);
+
+ /*
+ * It's not an incremental file, so we need to copy the entire
+ * file to the output directory.
+ *
+ * If a checksum of the required type already exists in the
+ * backup_manifest for the final input directory, we can save some
+ * work by reusing that checksum instead of computing a new one.
+ */
+ if (checksum_type != CHECKSUM_TYPE_NONE &&
+ latest_manifest != NULL)
+ {
+ manifest_file *mfile;
+
+ mfile = manifest_files_lookup(latest_manifest->files,
+ manifest_path);
+ if (mfile == NULL)
+ {
+ char *bmpath;
+
+ /*
+ * The directory is out of sync with the backup_manifest,
+ * so emit a warning.
+ */
+ bmpath = psprintf("%s/%s", input_directory,
+ "backup_manifest");
+ pg_log_warning("\"%s\" contains no entry for \"%s\"",
+ bmpath, manifest_path);
+ pfree(bmpath);
+ }
+ else if (mfile->checksum_type == checksum_type)
+ {
+ checksum_length = mfile->checksum_length;
+ checksum_payload = mfile->checksum_payload;
+ }
+ }
+
+ /*
+ * If we're reusing a checksum, then we don't need copy_file() to
+ * compute one for us, but otherwise, it needs to compute whatever
+ * type of checksum we need.
+ */
+ if (checksum_length != 0)
+ pg_checksum_init(&checksum_ctx, CHECKSUM_TYPE_NONE);
+ else
+ pg_checksum_init(&checksum_ctx, checksum_type);
+
+ /* Actually copy the file. */
+ snprintf(ofullpath, MAXPGPATH, "%s/%s", ofulldir, de->d_name);
+ copy_file(ifullpath, ofullpath, &checksum_ctx, opt->dry_run);
+
+ /*
+ * If copy_file() performed a checksum calculation for us, then
+ * save the results (except in dry-run mode, when there's no
+ * point).
+ */
+ if (checksum_ctx.type != CHECKSUM_TYPE_NONE && !opt->dry_run)
+ {
+ checksum_payload = pg_malloc(PG_CHECKSUM_MAX_LENGTH);
+ checksum_length = pg_checksum_final(&checksum_ctx,
+ checksum_payload);
+ }
+ }
+
+ /* Generate manifest entry, if needed. */
+ if (mwriter != NULL)
+ {
+ struct stat sb;
+
+ /*
+ * In order to generate a manifest entry, we need the file size
+ * and mtime. We have no way to know the correct mtime except to
+ * stat() the file, so just do that and get the size as well.
+ *
+ * If we didn't need the mtime here, we could try to obtain the
+ * file size from the reconstruction or file copy process above,
+ * although that is actually not convenient in all cases. If we
+ * write the file ourselves then clearly we can keep a count of
+ * bytes, but if we use something like CopyFile() then it's
+ * trickier. Since we have to stat() anyway to get the mtime,
+ * there's no point in worrying about it.
+ */
+ if (stat(ofullpath, &sb) < 0)
+ pg_fatal("could not stat file \"%s\": %m", ofullpath);
+
+ /* OK, now do the work. */
+ add_file_to_manifest(mwriter, manifest_path,
+ sb.st_size, sb.st_mtime,
+ checksum_type, checksum_length,
+ checksum_payload);
+ }
+
+ /* Avoid leaking memory. */
+ if (checksum_payload != NULL)
+ pfree(checksum_payload);
+ }
+
+ closedir(dir);
+}
+
+/*
+ * Read the version number from PG_VERSION and convert it to the usual server
+ * version number format. (e.g. If PG_VERSION contains "14\n" this function
+ * will return 140000)
+ */
+static int
+read_pg_version_file(char *directory)
+{
+ char filename[MAXPGPATH];
+ StringInfoData buf;
+ int fd;
+ int version;
+ char *ep;
+
+ /* Construct pathname. */
+ snprintf(filename, MAXPGPATH, "%s/PG_VERSION", directory);
+
+ /* Open file. */
+ if ((fd = open(filename, O_RDONLY, 0)) < 0)
+ pg_fatal("could not open file \"%s\": %m", filename);
+
+ /* Read into memory. Length limit of 128 should be more than generous. */
+ initStringInfo(&buf);
+ slurp_file(fd, filename, &buf, 128);
+
+ /* Close the file. */
+ if (close(fd) != 0)
+ pg_fatal("could not close \"%s\": %m", filename);
+
+ /* Convert to integer. */
+ errno = 0;
+ version = strtoul(buf.data, &ep, 10);
+ if (errno != 0 || *ep != '\n')
+ {
+ /*
+ * Incremental backup is not relevant to very old server versions that
+ * used multi-part version number (e.g. 9.6, or 8.4). So if we see
+ * what looks like the beginning of such a version number, just bail
+ * out.
+ */
+ if (version < 10 && *ep == '.')
+ pg_fatal("%s: server version too old\n", filename);
+ pg_fatal("%s: could not parse version number\n", filename);
+ }
+
+ /* Debugging output. */
+ pg_log_debug("read server version %d from \"%s\"", version, filename);
+
+ /* Release memory and return result. */
+ pfree(buf.data);
+ return version * 10000;
+}
+
+/*
+ * Add a directory to the list of output directories to clean up.
+ */
+static void
+remember_to_cleanup_directory(char *target_path, bool rmtopdir)
+{
+ cb_cleanup_dir *dir = pg_malloc(sizeof(cb_cleanup_dir));
+
+ dir->target_path = target_path;
+ dir->rmtopdir = rmtopdir;
+ dir->next = cleanup_dir_list;
+ cleanup_dir_list = dir;
+}
+
+/*
+ * Empty out the list of directories scheduled for cleanup a exit.
+ *
+ * We want to remove the output directories only on a failure, so call this
+ * function when we know that the operation has succeeded.
+ *
+ * Since we only expect this to be called when we're about to exit, we could
+ * just set cleanup_dir_list to NULL and be done with it, but we free the
+ * memory to be tidy.
+ */
+static void
+reset_directory_cleanup_list(void)
+{
+ while (cleanup_dir_list != NULL)
+ {
+ cb_cleanup_dir *dir = cleanup_dir_list;
+
+ cleanup_dir_list = cleanup_dir_list->next;
+ pfree(dir);
+ }
+}
+
+/*
+ * Scan the pg_tblspc directory of the final input backup to get a canonical
+ * list of what tablespaces are part of the backup.
+ *
+ * 'pathname' should be the path to the toplevel backup directory for the
+ * final backup in the backup chain.
+ */
+static cb_tablespace *
+scan_for_existing_tablespaces(char *pathname, cb_options *opt)
+{
+ char pg_tblspc[MAXPGPATH];
+ DIR *dir;
+ struct dirent *de;
+ cb_tablespace *tslist = NULL;
+
+ snprintf(pg_tblspc, MAXPGPATH, "%s/pg_tblspc", pathname);
+ pg_log_debug("scanning \"%s\"", pg_tblspc);
+
+ if ((dir = opendir(pg_tblspc)) == NULL)
+ pg_fatal("could not open directory \"%s\": %m", pathname);
+
+ while (errno = 0, (de = readdir(dir)) != NULL)
+ {
+ Oid oid;
+ char tblspcdir[MAXPGPATH];
+ char link_target[MAXPGPATH];
+ int link_length;
+ cb_tablespace *ts;
+ cb_tablespace *otherts;
+ PGFileType type;
+
+ /* Silently ignore "." and ".." entries. */
+ if (strcmp(de->d_name, ".") == 0 || strcmp(de->d_name, "..") == 0)
+ continue;
+
+ /* Construct full pathname. */
+ snprintf(tblspcdir, MAXPGPATH, "%s/%s", pg_tblspc, de->d_name);
+
+ /* Ignore any file name that doesn't look like a proper OID. */
+ if (!parse_oid(de->d_name, &oid))
+ {
+ pg_log_debug("skipping \"%s\" because the filename is not a legal tablespace OID",
+ tblspcdir);
+ continue;
+ }
+
+ /* Only symbolic links and directories are tablespaces. */
+ type = get_dirent_type(tblspcdir, de, false, PG_LOG_ERROR);
+ if (type == PGFILETYPE_ERROR)
+ exit(1);
+ if (type != PGFILETYPE_LNK && type != PGFILETYPE_DIR)
+ {
+ pg_log_debug("skipping \"%s\" because it is neither a symbolic link nor a directory",
+ tblspcdir);
+ continue;
+ }
+
+ /* Create a new tablespace object. */
+ ts = pg_malloc0(sizeof(cb_tablespace));
+ ts->oid = oid;
+
+ /*
+ * If it's a link, it's not an in-place tablespace. Otherwise, it must
+ * be a directory, and thus an in-place tablespace.
+ */
+ if (type == PGFILETYPE_LNK)
+ {
+ cb_tablespace_mapping *tsmap;
+
+ /* Read the link target. */
+ link_length = readlink(tblspcdir, link_target, sizeof(link_target));
+ if (link_length < 0)
+ pg_fatal("could not read symbolic link \"%s\": %m",
+ tblspcdir);
+ if (link_length >= sizeof(link_target))
+ pg_fatal("symbolic link \"%s\" is too long", tblspcdir);
+ link_target[link_length] = '\0';
+ if (!is_absolute_path(link_target))
+ pg_fatal("symbolic link \"%s\" is relative", tblspcdir);
+
+ /* Caonicalize the link target. */
+ canonicalize_path(link_target);
+
+ /*
+ * Find the corresponding tablespace mapping and copy the relevant
+ * details into the new tablespace entry.
+ */
+ for (tsmap = opt->tsmappings; tsmap != NULL; tsmap = tsmap->next)
+ {
+ if (strcmp(tsmap->old_dir, link_target) == 0)
+ {
+ strncpy(ts->old_dir, tsmap->old_dir, MAXPGPATH);
+ strncpy(ts->new_dir, tsmap->new_dir, MAXPGPATH);
+ ts->in_place = false;
+ break;
+ }
+ }
+
+ /* Every non-in-place tablespace must be mapped. */
+ if (tsmap == NULL)
+ pg_fatal("tablespace at \"%s\" has no tablespace mapping",
+ link_target);
+ }
+ else
+ {
+ /*
+ * For an in-place tablespace, there's no separate directory, so
+ * we just record the paths within the data directories.
+ */
+ snprintf(ts->old_dir, MAXPGPATH, "%s/%s", pg_tblspc, de->d_name);
+ snprintf(ts->new_dir, MAXPGPATH, "%s/pg_tblpc/%s", opt->output,
+ de->d_name);
+ ts->in_place = true;
+ }
+
+ /* Tablespaces should not share a directory. */
+ for (otherts = tslist; otherts != NULL; otherts = otherts->next)
+ if (strcmp(ts->new_dir, otherts->new_dir) == 0)
+ pg_fatal("tablespaces with OIDs %u and %u both point at \"%s\"",
+ otherts->oid, oid, ts->new_dir);
+
+ /* Add this tablespace to the list. */
+ ts->next = tslist;
+ tslist = ts;
+ }
+
+ return tslist;
+}
+
+/*
+ * Read a file into a StringInfo.
+ *
+ * fd is used for the actual file I/O, filename for error reporting purposes.
+ * A file longer than maxlen is a fatal error.
+ */
+static void
+slurp_file(int fd, char *filename, StringInfo buf, int maxlen)
+{
+ struct stat st;
+ ssize_t rb;
+
+ /* Check file size, and complain if it's too large. */
+ if (fstat(fd, &st) != 0)
+ pg_fatal("could not stat \"%s\": %m", filename);
+ if (st.st_size > maxlen)
+ pg_fatal("file \"%s\" is too large", filename);
+
+ /* Make sure we have enough space. */
+ enlargeStringInfo(buf, st.st_size);
+
+ /* Read the data. */
+ rb = read(fd, &buf->data[buf->len], st.st_size);
+
+ /*
+ * We don't expect any concurrent changes, so we should read exactly the
+ * expected number of bytes.
+ */
+ if (rb != st.st_size)
+ {
+ if (rb < 0)
+ pg_fatal("could not read file \"%s\": %m", filename);
+ else
+ pg_fatal("could not read file \"%s\": read only %d of %d bytes",
+ filename, (int) rb, (int) st.st_size);
+ }
+
+ /* Adjust buffer length for new data and restore trailing-\0 invariant */
+ buf->len += rb;
+ buf->data[buf->len] = '\0';
+}
diff --git a/src/bin/pg_combinebackup/reconstruct.c b/src/bin/pg_combinebackup/reconstruct.c
new file mode 100644
index 0000000000..6decdd8934
--- /dev/null
+++ b/src/bin/pg_combinebackup/reconstruct.c
@@ -0,0 +1,687 @@
+/*-------------------------------------------------------------------------
+ *
+ * reconstruct.c
+ * Reconstruct full file from incremental file and backup chain.
+ *
+ * Copyright (c) 2017-2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/bin/pg_combinebackup/reconstruct.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres_fe.h"
+
+#include <unistd.h>
+
+#include "backup/basebackup_incremental.h"
+#include "common/logging.h"
+#include "common/file_perm.h"
+#include "copy_file.h"
+#include "lib/stringinfo.h"
+#include "reconstruct.h"
+#include "storage/block.h"
+
+/*
+ * An rfile stores the data that we need in order to be able to use some file
+ * on disk for reconstruction. For any given output file, we create one rfile
+ * per backup that we need to consult when we constructing that output file.
+ *
+ * If we find a full version of the file in the backup chain, then only
+ * filename and fd are initialized; the remaining fields are 0 or NULL.
+ * For an incremental file, header_length, num_blocks, relative_block_numbers,
+ * and truncation_block_length are also set.
+ *
+ * num_blocks_read and highest_offset_read always start out as 0.
+ */
+typedef struct rfile
+{
+ char *filename;
+ int fd;
+ size_t header_length;
+ unsigned num_blocks;
+ BlockNumber *relative_block_numbers;
+ unsigned truncation_block_length;
+ unsigned num_blocks_read;
+ off_t highest_offset_read;
+} rfile;
+
+static void debug_reconstruction(int n_source,
+ rfile **sources,
+ bool dry_run);
+static unsigned find_reconstructed_block_length(rfile *s);
+static rfile *make_incremental_rfile(char *filename);
+static rfile *make_rfile(char *filename, bool missing_ok);
+static void write_reconstructed_file(char *input_filename,
+ char *output_filename,
+ unsigned block_length,
+ rfile **sourcemap,
+ off_t *offsetmap,
+ pg_checksum_context *checksum_ctx,
+ bool debug,
+ bool dry_run);
+static void read_bytes(rfile *rf, void *buffer, unsigned length);
+
+/*
+ * Reconstruct a full file from an incremental file and a chain of prior
+ * backups.
+ *
+ * input_filename should be the path to the incremental file, and
+ * output_filename should be the path where the reconstructed file is to be
+ * written.
+ *
+ * relative_path should be the relative path to the directory containing this
+ * file. bare_file_name should be the name of the file within that directory,
+ * without "INCREMENTAL.".
+ *
+ * n_prior_backups is the number of prior backups, and prior_backup_dirs is
+ * an array of pathnames where those backups can be found.
+ */
+void
+reconstruct_from_incremental_file(char *input_filename,
+ char *output_filename,
+ char *relative_path,
+ char *bare_file_name,
+ int n_prior_backups,
+ char **prior_backup_dirs,
+ manifest_data **manifests,
+ char *manifest_path,
+ pg_checksum_type checksum_type,
+ int *checksum_length,
+ uint8 **checksum_payload,
+ bool debug,
+ bool dry_run)
+{
+ rfile **source;
+ rfile *latest_source = NULL;
+ rfile **sourcemap;
+ off_t *offsetmap;
+ unsigned block_length;
+ unsigned i;
+ unsigned sidx = n_prior_backups;
+ bool full_copy_possible = true;
+ int copy_source_index = -1;
+ rfile *copy_source = NULL;
+ pg_checksum_context checksum_ctx;
+
+ /*
+ * Every block must come either from the latest version of the file or
+ * from one of the prior backups.
+ */
+ source = pg_malloc0(sizeof(rfile *) * (1 + n_prior_backups));
+
+ /*
+ * Use the information from the latest incremental file to figure out how
+ * long the reconstructed file should be.
+ */
+ latest_source = make_incremental_rfile(input_filename);
+ source[n_prior_backups] = latest_source;
+ block_length = find_reconstructed_block_length(latest_source);
+
+ /*
+ * For each block in the output file, we need to know from which file we
+ * need to obtain it and at what offset in that file it's stored.
+ * sourcemap gives us the first of these things, and offsetmap the latter.
+ */
+ sourcemap = pg_malloc0(sizeof(rfile *) * block_length);
+ offsetmap = pg_malloc0(sizeof(off_t) * block_length);
+
+ /*
+ * Every block that is present in the newest incremental file should be
+ * sourced from that file. If it precedes the truncation_block_length,
+ * it's a block that we would otherwise have had to find in an older
+ * backup and thus reduces the number of blocks remaining to be found by
+ * one; otherwise, it's an extra block that needs to be included in the
+ * output but would not have needed to be found in an older backup if it
+ * had not been present.
+ */
+ for (i = 0; i < latest_source->num_blocks; ++i)
+ {
+ BlockNumber b = latest_source->relative_block_numbers[i];
+
+ Assert(b < block_length);
+ sourcemap[b] = latest_source;
+ offsetmap[b] = latest_source->header_length + (i * BLCKSZ);
+
+ /*
+ * A full copy of a file from an earlier backup is only possible if no
+ * blocks are needed from any later incremental file.
+ */
+ full_copy_possible = false;
+ }
+
+ while (1)
+ {
+ char source_filename[MAXPGPATH];
+ rfile *s;
+
+ /*
+ * Move to the next backup in the chain. If there are no more, then
+ * we're done.
+ */
+ if (sidx == 0)
+ break;
+ --sidx;
+
+ /*
+ * Look for the full file in the previous backup. If not found, then
+ * look for an incremental file instead.
+ */
+ snprintf(source_filename, MAXPGPATH, "%s/%s/%s",
+ prior_backup_dirs[sidx], relative_path, bare_file_name);
+ if ((s = make_rfile(source_filename, true)) == NULL)
+ {
+ snprintf(source_filename, MAXPGPATH, "%s/%s/INCREMENTAL.%s",
+ prior_backup_dirs[sidx], relative_path, bare_file_name);
+ s = make_incremental_rfile(source_filename);
+ }
+ source[sidx] = s;
+
+ /*
+ * If s->header_length == 0, then this is a full file; otherwise, it's
+ * an incremental file.
+ */
+ if (s->header_length == 0)
+ {
+ struct stat sb;
+ BlockNumber b;
+ BlockNumber blocklength;
+
+ /* We need to know the length of the file. */
+ if (fstat(s->fd, &sb) < 0)
+ pg_fatal("could not stat \"%s\": %m", s->filename);
+
+ /*
+ * Since we found a full file, source all blocks from it that
+ * exist in the file.
+ *
+ * Note that there may be blocks that don't exist either in this
+ * file or in any incremental file but that precede
+ * truncation_block_length. These are, presumably, zero-filled
+ * blocks that result from the server extending the file but
+ * taking no action on those blocks that generated any WAL.
+ *
+ * Sadly, we have no way of validating that this is really what
+ * happened, and neither does the server. From it's perspective,
+ * an unmodified block that contains data looks exactly the same
+ * as a zero-filled block that never had any data: either way,
+ * it's not mentioned in any WAL summary and the server has no
+ * reason to read it. From our perspective, all we know is that
+ * nobody had a reason to back up the block. That certainly means
+ * that the block didn't exist at the time of the full backup, but
+ * the supposition that it was all zeroes at the time of every
+ * later backup is one that we can't validate.
+ */
+ blocklength = sb.st_size / BLCKSZ;
+ for (b = 0; b < latest_source->truncation_block_length; ++b)
+ {
+ if (sourcemap[b] == NULL && b < blocklength)
+ {
+ sourcemap[b] = s;
+ offsetmap[b] = b * BLCKSZ;
+ }
+ }
+
+ /*
+ * If a full copy looks possible, check whether the resulting file
+ * should be exactly as long as the source file is. If so, a full
+ * copy is acceptable, otherwise not.
+ */
+ if (full_copy_possible)
+ {
+ uint64 expected_length;
+
+ expected_length =
+ (uint64) latest_source->truncation_block_length;
+ expected_length *= BLCKSZ;
+ if (expected_length == sb.st_size)
+ {
+ copy_source = s;
+ copy_source_index = sidx;
+ }
+ }
+
+ /* We don't need to consider any further sources. */
+ break;
+ }
+
+ /*
+ * Since we found another incremental file, source all blocks from it
+ * that we need but don't yet have.
+ */
+ for (i = 0; i < s->num_blocks; ++i)
+ {
+ BlockNumber b = s->relative_block_numbers[i];
+
+ if (b < latest_source->truncation_block_length &&
+ sourcemap[b] == NULL)
+ {
+ sourcemap[b] = s;
+ offsetmap[b] = s->header_length + (i * BLCKSZ);
+
+ /*
+ * A full copy of a file from an earlier backup is only
+ * possible if no blocks are needed from any later incremental
+ * file.
+ */
+ full_copy_possible = false;
+ }
+ }
+ }
+
+ /*
+ * If a checksum of the required type already exists in the
+ * backup_manifest for the relevant input directory, we can save some work
+ * by reusing that checksum instead of computing a new one.
+ */
+ if (copy_source_index >= 0 && manifests[copy_source_index] != NULL &&
+ checksum_type != CHECKSUM_TYPE_NONE)
+ {
+ manifest_file *mfile;
+
+ mfile = manifest_files_lookup(manifests[copy_source_index]->files,
+ manifest_path);
+ if (mfile == NULL)
+ {
+ char *path = psprintf("%s/backup_manifest",
+ prior_backup_dirs[copy_source_index]);
+
+ /*
+ * The directory is out of sync with the backup_manifest, so emit
+ * a warning.
+ */
+ /*- translator: the first %s is a backup manifest file, the second is a file absent therein */
+ pg_log_warning("\"%s\" contains no entry for \"%s\"",
+ path,
+ manifest_path);
+ pfree(path);
+ }
+ else if (mfile->checksum_type == checksum_type)
+ {
+ *checksum_length = mfile->checksum_length;
+ *checksum_payload = pg_malloc(*checksum_length);
+ memcpy(*checksum_payload, mfile->checksum_payload,
+ *checksum_length);
+ checksum_type = CHECKSUM_TYPE_NONE;
+ }
+ }
+
+ /* Prepare for checksum calculation, if required. */
+ pg_checksum_init(&checksum_ctx, checksum_type);
+
+ /*
+ * If the full file can be created by copying a file from an older backup
+ * in the chain without needing to overwrite any blocks or truncate the
+ * result, then forget about performing reconstruction and just copy that
+ * file in its entirety.
+ *
+ * Otherwise, reconstruct.
+ */
+ if (copy_source != NULL)
+ copy_file(copy_source->filename, output_filename,
+ &checksum_ctx, dry_run);
+ else
+ {
+ write_reconstructed_file(input_filename, output_filename,
+ block_length, sourcemap, offsetmap,
+ &checksum_ctx, debug, dry_run);
+ debug_reconstruction(n_prior_backups + 1, source, dry_run);
+ }
+
+ /* Save results of checksum calculation. */
+ if (checksum_type != CHECKSUM_TYPE_NONE)
+ {
+ *checksum_payload = pg_malloc(PG_CHECKSUM_MAX_LENGTH);
+ *checksum_length = pg_checksum_final(&checksum_ctx,
+ *checksum_payload);
+ }
+
+ /*
+ * Close files and release memory.
+ */
+ for (i = 0; i <= n_prior_backups; ++i)
+ {
+ rfile *s = source[i];
+
+ if (s == NULL)
+ continue;
+ if (close(s->fd) != 0)
+ pg_fatal("could not close \"%s\": %m", s->filename);
+ if (s->relative_block_numbers != NULL)
+ pfree(s->relative_block_numbers);
+ pg_free(s->filename);
+ }
+ pfree(sourcemap);
+ pfree(offsetmap);
+ pfree(source);
+}
+
+/*
+ * Perform post-reconstruction logging and sanity checks.
+ */
+static void
+debug_reconstruction(int n_source, rfile **sources, bool dry_run)
+{
+ unsigned i;
+
+ for (i = 0; i < n_source; ++i)
+ {
+ rfile *s = sources[i];
+
+ /* Ignore source if not used. */
+ if (s == NULL)
+ continue;
+
+ /* If no data is needed from this file, we can ignore it. */
+ if (s->num_blocks_read == 0)
+ continue;
+
+ /* Debug logging. */
+ if (dry_run)
+ pg_log_debug("would have read %u blocks from \"%s\"",
+ s->num_blocks_read, s->filename);
+ else
+ pg_log_debug("read %u blocks from \"%s\"",
+ s->num_blocks_read, s->filename);
+
+ /*
+ * In dry-run mode, we don't actually try to read data from the file,
+ * but we do try to verify that the file is long enough that we could
+ * have read the data if we'd tried.
+ *
+ * If this fails, then it means that a non-dry-run attempt would fail,
+ * complaining of not being able to read the required bytes from the
+ * file.
+ */
+ if (dry_run)
+ {
+ struct stat sb;
+
+ if (fstat(s->fd, &sb) < 0)
+ pg_fatal("could not stat \"%s\": %m", s->filename);
+ if (sb.st_size < s->highest_offset_read)
+ pg_fatal("file \"%s\" is too short: expected %llu, found %llu",
+ s->filename,
+ (unsigned long long) s->highest_offset_read,
+ (unsigned long long) sb.st_size);
+ }
+ }
+}
+
+/*
+ * When we perform reconstruction using an incremental file, the output file
+ * should be at least as long as the truncation_block_length. Any blocks
+ * present in the incremental file increase the output length as far as is
+ * necessary to include those blocks.
+ */
+static unsigned
+find_reconstructed_block_length(rfile *s)
+{
+ unsigned block_length = s->truncation_block_length;
+ unsigned i;
+
+ for (i = 0; i < s->num_blocks; ++i)
+ if (s->relative_block_numbers[i] >= block_length)
+ block_length = s->relative_block_numbers[i] + 1;
+
+ return block_length;
+}
+
+/*
+ * Initialize an incremental rfile, reading the header so that we know which
+ * blocks it contains.
+ */
+static rfile *
+make_incremental_rfile(char *filename)
+{
+ rfile *rf;
+ unsigned magic;
+
+ rf = make_rfile(filename, false);
+
+ /* Read and validate magic number. */
+ read_bytes(rf, &magic, sizeof(magic));
+ if (magic != INCREMENTAL_MAGIC)
+ pg_fatal("file \"%s\" has bad incremental magic number (0x%x not 0x%x)",
+ filename, magic, INCREMENTAL_MAGIC);
+
+ /* Read block count. */
+ read_bytes(rf, &rf->num_blocks, sizeof(rf->num_blocks));
+ if (rf->num_blocks > RELSEG_SIZE)
+ pg_fatal("file \"%s\" has block count %u in excess of segment size %u",
+ filename, rf->num_blocks, RELSEG_SIZE);
+
+ /* Read truncation block length. */
+ read_bytes(rf, &rf->truncation_block_length,
+ sizeof(rf->truncation_block_length));
+ if (rf->truncation_block_length > RELSEG_SIZE)
+ pg_fatal("file \"%s\" has truncation block length %u in excess of segment size %u",
+ filename, rf->truncation_block_length, RELSEG_SIZE);
+
+ /* Read block numbers if there are any. */
+ if (rf->num_blocks > 0)
+ {
+ rf->relative_block_numbers =
+ pg_malloc0(sizeof(BlockNumber) * rf->num_blocks);
+ read_bytes(rf, rf->relative_block_numbers,
+ sizeof(BlockNumber) * rf->num_blocks);
+ }
+
+ /* Remember length of header. */
+ rf->header_length = sizeof(magic) + sizeof(rf->num_blocks) +
+ sizeof(rf->truncation_block_length) +
+ sizeof(BlockNumber) * rf->num_blocks;
+
+ return rf;
+}
+
+/*
+ * Allocate and perform basic initialization of an rfile.
+ */
+static rfile *
+make_rfile(char *filename, bool missing_ok)
+{
+ rfile *rf;
+
+ rf = pg_malloc0(sizeof(rfile));
+ rf->filename = pstrdup(filename);
+ if ((rf->fd = open(filename, O_RDONLY | PG_BINARY, 0)) < 0)
+ {
+ if (missing_ok && errno == ENOENT)
+ {
+ pg_free(rf);
+ return NULL;
+ }
+ pg_fatal("could not open file \"%s\": %m", filename);
+ }
+
+ return rf;
+}
+
+/*
+ * Read the indicated number of bytes from an rfile into the buffer.
+ */
+static void
+read_bytes(rfile *rf, void *buffer, unsigned length)
+{
+ unsigned rb = read(rf->fd, buffer, length);
+
+ if (rb != length)
+ {
+ if (rb < 0)
+ pg_fatal("could not read file \"%s\": %m", rf->filename);
+ else
+ pg_fatal("could not read file \"%s\": read only %d of %d bytes",
+ rf->filename, (int) rb, length);
+ }
+}
+
+/*
+ * Write out a reconstructed file.
+ */
+static void
+write_reconstructed_file(char *input_filename,
+ char *output_filename,
+ unsigned block_length,
+ rfile **sourcemap,
+ off_t *offsetmap,
+ pg_checksum_context *checksum_ctx,
+ bool debug,
+ bool dry_run)
+{
+ int wfd = -1;
+ unsigned i;
+ unsigned zero_blocks = 0;
+
+ /* Debugging output. */
+ if (debug)
+ {
+ StringInfoData debug_buf;
+ unsigned start_of_range = 0;
+ unsigned current_block = 0;
+
+ /* Basic information about the output file to be produced. */
+ if (dry_run)
+ pg_log_debug("would reconstruct \"%s\" (%u blocks, checksum %s)",
+ output_filename, block_length,
+ pg_checksum_type_name(checksum_ctx->type));
+ else
+ pg_log_debug("reconstructing \"%s\" (%u blocks, checksum %s)",
+ output_filename, block_length,
+ pg_checksum_type_name(checksum_ctx->type));
+
+ /* Print out the plan for reconstructing this file. */
+ initStringInfo(&debug_buf);
+ while (current_block < block_length)
+ {
+ rfile *s = sourcemap[current_block];
+
+ /* Extend range, if possible. */
+ if (current_block + 1 < block_length &&
+ s == sourcemap[current_block + 1])
+ {
+ ++current_block;
+ continue;
+ }
+
+ /* Add details about this range. */
+ if (s == NULL)
+ {
+ if (current_block == start_of_range)
+ appendStringInfo(&debug_buf, " %u:zero", current_block);
+ else
+ appendStringInfo(&debug_buf, " %u-%u:zero",
+ start_of_range, current_block);
+ }
+ else
+ {
+ if (current_block == start_of_range)
+ appendStringInfo(&debug_buf, " %u:%s@" UINT64_FORMAT,
+ current_block,
+ s == NULL ? "ZERO" : s->filename,
+ (uint64) offsetmap[current_block]);
+ else
+ appendStringInfo(&debug_buf, " %u-%u:%s@" UINT64_FORMAT,
+ start_of_range, current_block,
+ s == NULL ? "ZERO" : s->filename,
+ (uint64) offsetmap[current_block]);
+ }
+
+ /* Begin new range. */
+ start_of_range = ++current_block;
+
+ /* If the output is very long or we are done, dump it now. */
+ if (current_block == block_length || debug_buf.len > 1024)
+ {
+ pg_log_debug("reconstruction plan:%s", debug_buf.data);
+ resetStringInfo(&debug_buf);
+ }
+ }
+
+ /* Free memory. */
+ pfree(debug_buf.data);
+ }
+
+ /* Open the output file, except in dry_run mode. */
+ if (!dry_run &&
+ (wfd = open(output_filename,
+ O_RDWR | PG_BINARY | O_CREAT | O_EXCL,
+ pg_file_create_mode)) < 0)
+ pg_fatal("could not open file \"%s\": %m", output_filename);
+
+ /* Read and write the blocks as required. */
+ for (i = 0; i < block_length; ++i)
+ {
+ uint8 buffer[BLCKSZ];
+ rfile *s = sourcemap[i];
+ unsigned wb;
+
+ /* Update accounting information. */
+ if (s == NULL)
+ ++zero_blocks;
+ else
+ {
+ s->num_blocks_read++;
+ s->highest_offset_read = Max(s->highest_offset_read,
+ offsetmap[i] + BLCKSZ);
+ }
+
+ /* Skip the rest of this in dry-run mode. */
+ if (dry_run)
+ continue;
+
+ /* Read or zero-fill the block as appropriate. */
+ if (s == NULL)
+ {
+ /*
+ * New block not mentioned in the WAL summary. Should have been an
+ * uninitialized block, so just zero-fill it.
+ */
+ memset(buffer, 0, BLCKSZ);
+ }
+ else
+ {
+ unsigned rb;
+
+ /* Read the block from the correct source, except if dry-run. */
+ rb = pg_pread(s->fd, buffer, BLCKSZ, offsetmap[i]);
+ if (rb != BLCKSZ)
+ {
+ if (rb < 0)
+ pg_fatal("could not read file \"%s\": %m", s->filename);
+ else
+ pg_fatal("could not read file \"%s\": read only %d of %d bytes at offset %u",
+ s->filename, (int) rb, BLCKSZ,
+ (unsigned) offsetmap[i]);
+ }
+ }
+
+ /* Write out the block. */
+ if ((wb = write(wfd, buffer, BLCKSZ)) != BLCKSZ)
+ {
+ if (wb < 0)
+ pg_fatal("could not write file \"%s\": %m", output_filename);
+ else
+ pg_fatal("could not write file \"%s\": wrote only %d of %d bytes",
+ output_filename, (int) wb, BLCKSZ);
+ }
+
+ /* Update the checksum computation. */
+ if (pg_checksum_update(checksum_ctx, buffer, BLCKSZ) < 0)
+ pg_fatal("could not update checksum of file \"%s\"",
+ output_filename);
+ }
+
+ /* Debugging output. */
+ if (zero_blocks > 0)
+ {
+ if (dry_run)
+ pg_log_debug("would have zero-filled %u blocks", zero_blocks);
+ else
+ pg_log_debug("zero-filled %u blocks", zero_blocks);
+ }
+
+ /* Close the output file. */
+ if (wfd >= 0 && close(wfd) != 0)
+ pg_fatal("could not close \"%s\": %m", output_filename);
+}
diff --git a/src/bin/pg_combinebackup/reconstruct.h b/src/bin/pg_combinebackup/reconstruct.h
new file mode 100644
index 0000000000..d689aeb5c2
--- /dev/null
+++ b/src/bin/pg_combinebackup/reconstruct.h
@@ -0,0 +1,33 @@
+/*-------------------------------------------------------------------------
+ *
+ * reconstruct.h
+ * Reconstruct full file from incremental file and backup chain.
+ *
+ * Copyright (c) 2017-2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/bin/pg_combinebackup/reconstruct.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef RECONSTRUCT_H
+#define RECONSTRUCT_H
+
+#include "common/checksum_helper.h"
+#include "load_manifest.h"
+
+extern void reconstruct_from_incremental_file(char *input_filename,
+ char *output_filename,
+ char *relative_path,
+ char *bare_file_name,
+ int n_prior_backups,
+ char **prior_backup_dirs,
+ manifest_data **manifests,
+ char *manifest_path,
+ pg_checksum_type checksum_type,
+ int *checksum_length,
+ uint8 **checksum_payload,
+ bool debug,
+ bool dry_run);
+
+#endif
diff --git a/src/bin/pg_combinebackup/t/001_basic.pl b/src/bin/pg_combinebackup/t/001_basic.pl
new file mode 100644
index 0000000000..fb66075d1a
--- /dev/null
+++ b/src/bin/pg_combinebackup/t/001_basic.pl
@@ -0,0 +1,23 @@
+# Copyright (c) 2021-2023, PostgreSQL Global Development Group
+
+use strict;
+use warnings;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+my $tempdir = PostgreSQL::Test::Utils::tempdir;
+
+program_help_ok('pg_combinebackup');
+program_version_ok('pg_combinebackup');
+program_options_handling_ok('pg_combinebackup');
+
+command_fails_like(
+ ['pg_combinebackup'],
+ qr/no input directories specified/,
+ 'input directories must be specified');
+command_fails_like(
+ [ 'pg_combinebackup', $tempdir ],
+ qr/no output directory specified/,
+ 'output directory must be specified');
+
+done_testing();
diff --git a/src/bin/pg_combinebackup/t/002_compare_backups.pl b/src/bin/pg_combinebackup/t/002_compare_backups.pl
new file mode 100644
index 0000000000..0b80455aff
--- /dev/null
+++ b/src/bin/pg_combinebackup/t/002_compare_backups.pl
@@ -0,0 +1,154 @@
+# Copyright (c) 2021-2023, PostgreSQL Global Development Group
+
+use strict;
+use warnings;
+use File::Compare;
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Set up a new database instance.
+my $primary = PostgreSQL::Test::Cluster->new('primary');
+$primary->init(has_archiving => 1, allows_streaming => 1);
+$primary->append_conf('postgresql.conf', 'summarize_wal = on');
+$primary->start;
+
+# Create some test tables, each containing one row of data, plus a whole
+# extra database.
+$primary->safe_psql('postgres', <<EOM);
+CREATE TABLE will_change (a int, b text);
+INSERT INTO will_change VALUES (1, 'initial test row');
+CREATE TABLE will_grow (a int, b text);
+INSERT INTO will_grow VALUES (1, 'initial test row');
+CREATE TABLE will_shrink (a int, b text);
+INSERT INTO will_shrink VALUES (1, 'initial test row');
+CREATE TABLE will_get_vacuumed (a int, b text);
+INSERT INTO will_get_vacuumed VALUES (1, 'initial test row');
+CREATE TABLE will_get_dropped (a int, b text);
+INSERT INTO will_get_dropped VALUES (1, 'initial test row');
+CREATE TABLE will_get_rewritten (a int, b text);
+INSERT INTO will_get_rewritten VALUES (1, 'initial test row');
+CREATE DATABASE db_will_get_dropped;
+EOM
+
+# Take a full backup.
+my $backup1path = $primary->backup_dir . '/backup1';
+$primary->command_ok(
+ [ 'pg_basebackup', '-D', $backup1path, '--no-sync', '-cfast' ],
+ "full backup");
+
+# Now make some database changes.
+$primary->safe_psql('postgres', <<EOM);
+UPDATE will_change SET b = 'modified value' WHERE a = 1;
+INSERT INTO will_grow
+ SELECT g, 'additional row' FROM generate_series(2, 5000) g;
+TRUNCATE will_shrink;
+VACUUM will_get_vacuumed;
+DROP TABLE will_get_dropped;
+CREATE TABLE newly_created (a int, b text);
+INSERT INTO newly_created VALUES (1, 'row for new table');
+VACUUM FULL will_get_rewritten;
+DROP DATABASE db_will_get_dropped;
+CREATE DATABASE db_newly_created;
+EOM
+
+# Take an incremental backup.
+my $backup2path = $primary->backup_dir . '/backup2';
+$primary->command_ok(
+ [ 'pg_basebackup', '-D', $backup2path, '--no-sync', '-cfast',
+ '--incremental', $backup1path . '/backup_manifest' ],
+ "incremental backup");
+
+# Find an LSN to which either backup can be recovered.
+my $lsn = $primary->safe_psql('postgres', "SELECT pg_current_wal_lsn();");
+
+# Make sure that the WAL segment containing that LSN has been archived.
+# PostgreSQL won't issue two consecutive XLOG_SWITCH records, and the backup
+# just issued one, so call txid_current() to generate some WAL activity
+# before calling pg_switch_wal().
+$primary->safe_psql('postgres', 'SELECT txid_current();');
+$primary->safe_psql('postgres', 'SELECT pg_switch_wal()');
+
+# Now wait for the LSN we chose above to be archived.
+my $archive_wait_query =
+ "SELECT pg_walfile_name('$lsn') <= last_archived_wal FROM pg_stat_archiver;";
+$primary->poll_query_until('postgres', $archive_wait_query)
+ or die "Timed out while waiting for WAL segment to be archived";
+
+# Perform PITR from the full backup. Disable archive_mode so that the archive
+# doesn't find out about the new timeline; that way, the later PITR below will
+# choose the same timeline.
+my $pitr1 = PostgreSQL::Test::Cluster->new('pitr1');
+$pitr1->init_from_backup($primary, 'backup1',
+ standby => 1, has_restoring => 1);
+$pitr1->append_conf('postgresql.conf', qq{
+recovery_target_lsn = '$lsn'
+recovery_target_action = 'promote'
+archive_mode = 'off'
+});
+$pitr1->start();
+
+# Perform PITR to the same LSN from the incremental backup. Use the same
+# basic configuration as before.
+my $pitr2 = PostgreSQL::Test::Cluster->new('pitr2');
+$pitr2->init_from_backup($primary, 'backup2',
+ standby => 1, has_restoring => 1,
+ combine_with_prior => [ 'backup1' ]);
+$pitr2->append_conf('postgresql.conf', qq{
+recovery_target_lsn = '$lsn'
+recovery_target_action = 'promote'
+archive_mode = 'off'
+});
+$pitr2->start();
+
+# Wait until both servers exit recovery.
+$pitr1->poll_query_until('postgres',
+ "SELECT NOT pg_is_in_recovery();")
+ or die "Timed out while waiting apply to reach LSN $lsn";
+$pitr2->poll_query_until('postgres',
+ "SELECT NOT pg_is_in_recovery();")
+ or die "Timed out while waiting apply to reach LSN $lsn";
+
+# Perform a logical dump of each server, and check that they match.
+# It would be much nicer if we could physically compare the data files, but
+# that doesn't really work. The contents of the page hole aren't guaranteed to
+# be identical, and there can be other discrepancies as well. To make this work
+# we'd need the equivalent of each AM's rm_mask functon written or at least
+# callable from Perl, and that doesn't seem practical.
+#
+# NB: We're just using the primary's backup directory for scratch space here.
+# This could equally well be any other directory we wanted to pick.
+my $backupdir = $primary->backup_dir;
+my $dump1 = $backupdir . '/pitr1.dump';
+my $dump2 = $backupdir . '/pitr2.dump';
+$pitr1->command_ok([
+ 'pg_dumpall', '-f', $dump1, '--no-sync', '--no-unlogged-table-data',
+ '-d', $pitr1->connstr('postgres'),
+ ],
+ 'dump from PITR 1');
+$pitr1->command_ok([
+ 'pg_dumpall', '-f', $dump2, '--no-sync', '--no-unlogged-table-data',
+ '-d', $pitr1->connstr('postgres'),
+ ],
+ 'dump from PITR 2');
+
+# Compare the two dumps, there should be no differences.
+my $compare_res = compare($dump1, $dump2);
+note($dump1);
+note($dump2);
+is($compare_res, 0, "dumps are identical");
+
+# Provide more context if the dumps do not match.
+if ($compare_res != 0)
+{
+ my ($stdout, $stderr) =
+ run_command([ 'diff', '-u', $dump1, $dump2 ]);
+ print "=== diff of $dump1 and $dump2\n";
+ print "=== stdout ===\n";
+ print $stdout;
+ print "=== stderr ===\n";
+ print $stderr;
+ print "=== EOF ===\n";
+}
+
+done_testing();
diff --git a/src/bin/pg_combinebackup/t/003_timeline.pl b/src/bin/pg_combinebackup/t/003_timeline.pl
new file mode 100644
index 0000000000..bc053ca5e8
--- /dev/null
+++ b/src/bin/pg_combinebackup/t/003_timeline.pl
@@ -0,0 +1,90 @@
+# Copyright (c) 2021-2023, PostgreSQL Global Development Group
+#
+# This test aims to validate that restoring an incremental backup works
+# properly even when the reference backup is on a different timeline.
+
+use strict;
+use warnings;
+use File::Compare;
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Set up a new database instance.
+my $node1 = PostgreSQL::Test::Cluster->new('node1');
+$node1->init(has_archiving => 1, allows_streaming => 1);
+$node1->append_conf('postgresql.conf', 'summarize_wal = on');
+$node1->start;
+
+# Create a table and insert a test row into it.
+$node1->safe_psql('postgres', <<EOM);
+CREATE TABLE mytable (a int, b text);
+INSERT INTO mytable VALUES (1, 'aardvark');
+EOM
+
+# Take a full backup.
+my $backup1path = $node1->backup_dir . '/backup1';
+$node1->command_ok(
+ [ 'pg_basebackup', '-D', $backup1path, '--no-sync', '-cfast' ],
+ "full backup from node1");
+
+# Insert a second row on the original node.
+$node1->safe_psql('postgres', <<EOM);
+INSERT INTO mytable VALUES (2, 'beetle');
+EOM
+
+# Now take an incremental backup.
+my $backup2path = $node1->backup_dir . '/backup2';
+$node1->command_ok(
+ [ 'pg_basebackup', '-D', $backup2path, '--no-sync', '-cfast',
+ '--incremental', $backup1path . '/backup_manifest' ],
+ "incremental backup from node1");
+
+# Restore the incremental backup and use it to create a new node.
+my $node2 = PostgreSQL::Test::Cluster->new('node2');
+$node2->init_from_backup($node1, 'backup2',
+ combine_with_prior => [ 'backup1' ]);
+$node2->start();
+
+# Insert rows on both nodes.
+$node1->safe_psql('postgres', <<EOM);
+INSERT INTO mytable VALUES (3, 'crab');
+EOM
+$node2->safe_psql('postgres', <<EOM);
+INSERT INTO mytable VALUES (4, 'dingo');
+EOM
+
+# Take another incremental backup, from node2, based on backup2 from node1.
+my $backup3path = $node1->backup_dir . '/backup3';
+$node2->command_ok(
+ [ 'pg_basebackup', '-D', $backup3path, '--no-sync', '-cfast',
+ '--incremental', $backup2path . '/backup_manifest' ],
+ "incremental backup from node2");
+
+# Restore the incremental backup and use it to create a new node.
+my $node3 = PostgreSQL::Test::Cluster->new('node3');
+$node3->init_from_backup($node1, 'backup3',
+ combine_with_prior => [ 'backup1', 'backup2' ]);
+$node3->start();
+
+# Let's insert one more row.
+$node3->safe_psql('postgres', <<EOM);
+INSERT INTO mytable VALUES (5, 'elephant');
+EOM
+
+# Now check that we have the expected rows.
+my $result = $node3->safe_psql('postgres', <<EOM);
+select string_agg(a::text, ':'), string_agg(b, ':') from mytable;
+EOM
+is($result, '1:2:4:5|aardvark:beetle:dingo:elephant');
+
+# Let's also verify all the backups.
+for my $backup_name (qw(backup1 backup2 backup3))
+{
+ $node1->command_ok(
+ [ 'pg_verifybackup', $node1->backup_dir . '/' . $backup_name ],
+ "verify backup $backup_name");
+}
+
+# OK, that's all.
+done_testing();
diff --git a/src/bin/pg_combinebackup/t/004_manifest.pl b/src/bin/pg_combinebackup/t/004_manifest.pl
new file mode 100644
index 0000000000..37de61ac06
--- /dev/null
+++ b/src/bin/pg_combinebackup/t/004_manifest.pl
@@ -0,0 +1,75 @@
+# Copyright (c) 2021-2023, PostgreSQL Global Development Group
+#
+# This test aims to validate that pg_combinebackup works in the degenerate
+# case where it is invoked on a single full backup and that it can produce
+# a new, valid manifest when it does. Secondarily, it checks that
+# pg_combinebackup does not produce a manifest when run with --no-manifest.
+
+use strict;
+use warnings;
+use File::Compare;
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Set up a new database instance.
+my $node = PostgreSQL::Test::Cluster->new('node');
+$node->init(has_archiving => 1, allows_streaming => 1);
+$node->start;
+
+# Take a full backup.
+my $original_backup_path = $node->backup_dir . '/original';
+$node->command_ok(
+ [ 'pg_basebackup', '-D', $original_backup_path, '--no-sync', '-cfast' ],
+ "full backup");
+
+# Verify the full backup.
+$node->command_ok([ 'pg_verifybackup', $original_backup_path ],
+ "verify original backup");
+
+# Process the backup with pg_combinebackup using various manifest options.
+sub combine_and_test_one_backup
+{
+ my ($backup_name, $failure_pattern, @extra_options) = @_;
+ my $revised_backup_path = $node->backup_dir . '/' . $backup_name;
+ $node->command_ok(
+ [ 'pg_combinebackup', $original_backup_path, '-o', $revised_backup_path,
+ '--no-sync', @extra_options ],
+ "pg_combinebackup with @extra_options");
+ if (defined $failure_pattern)
+ {
+ $node->command_fails_like(
+ [ 'pg_verifybackup', $revised_backup_path ],
+ $failure_pattern,
+ "unable to verify backup $backup_name");
+ }
+ else
+ {
+ $node->command_ok(
+ [ 'pg_verifybackup', $revised_backup_path ],
+ "verify backup $backup_name");
+ }
+}
+combine_and_test_one_backup('nomanifest',
+ qr/could not open file.*backup_manifest/, '--no-manifest');
+combine_and_test_one_backup('csum_none',
+ undef, '--manifest-checksums=NONE');
+combine_and_test_one_backup('csum_sha224',
+ undef, '--manifest-checksums=SHA224');
+
+# Verify that SHA224 is mentioned in the SHA224 manifest lots of times.
+my $sha224_manifest =
+ slurp_file($node->backup_dir . '/csum_sha224/backup_manifest');
+my $sha224_count = (() = $sha224_manifest =~ /SHA224/mig);
+cmp_ok($sha224_count,
+ '>', 100, "SHA224 is mentioned many times in SHA224 manifest");
+
+# Verify that SHA224 is mentioned in the SHA224 manifest lots of times.
+my $nocsum_manifest =
+ slurp_file($node->backup_dir . '/csum_none/backup_manifest');
+my $nocsum_count = (() = $nocsum_manifest =~ /Checksum-Algorithm/mig);
+is($nocsum_count, 0,
+ "Checksum_Algorithm is not mentioned in no-checksum manifest");
+
+# OK, that's all.
+done_testing();
diff --git a/src/bin/pg_combinebackup/t/005_integrity.pl b/src/bin/pg_combinebackup/t/005_integrity.pl
new file mode 100644
index 0000000000..b1f63a43e0
--- /dev/null
+++ b/src/bin/pg_combinebackup/t/005_integrity.pl
@@ -0,0 +1,125 @@
+# Copyright (c) 2021-2023, PostgreSQL Global Development Group
+#
+# This test aims to validate that an incremental backup can be combined
+# with a valid prior backup and that it cannot be combined with an invalid
+# prior backup.
+
+use strict;
+use warnings;
+use File::Compare;
+use File::Path qw(rmtree);
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Set up a new database instance.
+my $node1 = PostgreSQL::Test::Cluster->new('node1');
+$node1->init(has_archiving => 1, allows_streaming => 1);
+$node1->append_conf('postgresql.conf', 'summarize_wal = on');
+$node1->start;
+
+# Set up another new database instance. We don't want to use the cached
+# INITDB_TEMPLATE for this, because we want it to be a separate cluster
+# with a different system ID.
+my $node2;
+{
+ local $ENV{'INITDB_TEMPLATE'} = undef;
+
+ $node2 = PostgreSQL::Test::Cluster->new('node2');
+ $node2->init(has_archiving => 1, allows_streaming => 1);
+ $node2->append_conf('postgresql.conf', 'summarize_wal = on');
+ $node2->start;
+}
+
+# Take a full backup from node1.
+my $backup1path = $node1->backup_dir . '/backup1';
+$node1->command_ok(
+ [ 'pg_basebackup', '-D', $backup1path, '--no-sync', '-cfast' ],
+ "full backup from node1");
+
+# Now take an incremental backup.
+my $backup2path = $node1->backup_dir . '/backup2';
+$node1->command_ok(
+ [ 'pg_basebackup', '-D', $backup2path, '--no-sync', '-cfast',
+ '--incremental', $backup1path . '/backup_manifest' ],
+ "incremental backup from node1");
+
+# Now take another incremental backup.
+my $backup3path = $node1->backup_dir . '/backup3';
+$node1->command_ok(
+ [ 'pg_basebackup', '-D', $backup3path, '--no-sync', '-cfast',
+ '--incremental', $backup2path . '/backup_manifest' ],
+ "another incremental backup from node1");
+
+# Take a full backup from node2.
+my $backupother1path = $node1->backup_dir . '/backupother1';
+$node2->command_ok(
+ [ 'pg_basebackup', '-D', $backupother1path, '--no-sync', '-cfast' ],
+ "full backup from node2");
+
+# Take an incremental backup from node2.
+my $backupother2path = $node1->backup_dir . '/backupother2';
+$node2->command_ok(
+ [ 'pg_basebackup', '-D', $backupother2path, '--no-sync', '-cfast',
+ '--incremental', $backupother1path . '/backup_manifest' ],
+ "incremental backup from node2");
+
+# Result directory.
+my $resultpath = $node1->backup_dir . '/result';
+
+# Can't combine 2 full backups.
+$node1->command_fails_like(
+ [ 'pg_combinebackup', $backup1path, $backup1path, '-o', $resultpath ],
+ qr/is a full backup, but only the first backup should be a full backup/,
+ "can't combine full backups");
+
+# Can't combine 2 incremental backups.
+$node1->command_fails_like(
+ [ 'pg_combinebackup', $backup2path, $backup2path, '-o', $resultpath ],
+ qr/is an incremental backup, but the first backup should be a full backup/,
+ "can't combine full backups");
+
+# Can't combine full backup with an incremental backup from a different system.
+$node1->command_fails_like(
+ [ 'pg_combinebackup', $backup1path, $backupother2path, '-o', $resultpath ],
+ qr/expected system identifier.*but found/,
+ "can't combine backups from different nodes");
+
+# Can't omit a required backup.
+$node1->command_fails_like(
+ [ 'pg_combinebackup', $backup1path, $backup3path, '-o', $resultpath ],
+ qr/starts at LSN.*but expected/,
+ "can't omit a required backup");
+
+# Can't combine backups in the wrong order.
+$node1->command_fails_like(
+ [ 'pg_combinebackup', $backup1path, $backup3path, $backup2path, '-o', $resultpath ],
+ qr/starts at LSN.*but expected/,
+ "can't combine backups in the wrong order");
+
+# Can combine 3 backups that match up properly.
+$node1->command_ok(
+ [ 'pg_combinebackup', $backup1path, $backup2path, $backup3path, '-o', $resultpath ],
+ "can combine 3 matching backups");
+rmtree($resultpath);
+
+# Can combine full backup with first incremental.
+my $synthetic12path = $node1->backup_dir . '/synthetic12';
+$node1->command_ok(
+ [ 'pg_combinebackup', $backup1path, $backup2path, '-o', $synthetic12path ],
+ "can combine 2 matching backups");
+
+# Can combine result of previous step with second incremental.
+$node1->command_ok(
+ [ 'pg_combinebackup', $synthetic12path, $backup3path, '-o', $resultpath ],
+ "can combine synthetic backup with later incremental");
+rmtree($resultpath);
+
+# Can't combine result of 1+2 with 2.
+$node1->command_fails_like(
+ [ 'pg_combinebackup', $synthetic12path, $backup2path, '-o', $resultpath ],
+ qr/starts at LSN.*but expected/,
+ "can't combine synthetic backup with included incremental");
+
+# OK, that's all.
+done_testing();
diff --git a/src/bin/pg_combinebackup/write_manifest.c b/src/bin/pg_combinebackup/write_manifest.c
new file mode 100644
index 0000000000..82160134d8
--- /dev/null
+++ b/src/bin/pg_combinebackup/write_manifest.c
@@ -0,0 +1,293 @@
+/*-------------------------------------------------------------------------
+ *
+ * Write a new backup manifest.
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/bin/pg_combinebackup/write_manifest.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres_fe.h"
+
+#include <fcntl.h>
+#include <time.h>
+#include <unistd.h>
+
+#include "common/checksum_helper.h"
+#include "common/file_perm.h"
+#include "common/logging.h"
+#include "lib/stringinfo.h"
+#include "load_manifest.h"
+#include "mb/pg_wchar.h"
+#include "write_manifest.h"
+
+struct manifest_writer
+{
+ char pathname[MAXPGPATH];
+ int fd;
+ StringInfoData buf;
+ bool first_file;
+ bool still_checksumming;
+ pg_checksum_context manifest_ctx;
+};
+
+static void escape_json(StringInfo buf, const char *str);
+static void flush_manifest(manifest_writer *mwriter);
+static size_t hex_encode(const uint8 *src, size_t len, char *dst);
+
+/*
+ * Create a new backup manifest writer.
+ *
+ * The backup manifest will be written into a file named backup_manifest
+ * in the specified directory.
+ */
+manifest_writer *
+create_manifest_writer(char *directory)
+{
+ manifest_writer *mwriter = pg_malloc(sizeof(manifest_writer));
+
+ snprintf(mwriter->pathname, MAXPGPATH, "%s/backup_manifest", directory);
+ mwriter->fd = -1;
+ initStringInfo(&mwriter->buf);
+ mwriter->first_file = true;
+ mwriter->still_checksumming = true;
+ pg_checksum_init(&mwriter->manifest_ctx, CHECKSUM_TYPE_SHA256);
+
+ appendStringInfo(&mwriter->buf,
+ "{ \"PostgreSQL-Backup-Manifest-Version\": 1,\n"
+ "\"Files\": [");
+
+ return mwriter;
+}
+
+/*
+ * Add an entry for a file to a backup manifest.
+ *
+ * This is very similar to the backend's AddFileToBackupManifest, but
+ * various adjustments are required due to frontend/backend differences
+ * and other details.
+ */
+void
+add_file_to_manifest(manifest_writer *mwriter, const char *manifest_path,
+ size_t size, pg_time_t mtime,
+ pg_checksum_type checksum_type,
+ int checksum_length,
+ uint8 *checksum_payload)
+{
+ int pathlen = strlen(manifest_path);
+
+ if (mwriter->first_file)
+ {
+ appendStringInfoChar(&mwriter->buf, '\n');
+ mwriter->first_file = false;
+ }
+ else
+ appendStringInfoString(&mwriter->buf, ",\n");
+
+ if (pg_encoding_verifymbstr(PG_UTF8, manifest_path, pathlen) == pathlen)
+ {
+ appendStringInfoString(&mwriter->buf, "{ \"Path\": ");
+ escape_json(&mwriter->buf, manifest_path);
+ appendStringInfoString(&mwriter->buf, ", ");
+ }
+ else
+ {
+ appendStringInfoString(&mwriter->buf, "{ \"Encoded-Path\": \"");
+ enlargeStringInfo(&mwriter->buf, 2 * pathlen);
+ mwriter->buf.len += hex_encode((const uint8 *) manifest_path, pathlen,
+ &mwriter->buf.data[mwriter->buf.len]);
+ appendStringInfoString(&mwriter->buf, "\", ");
+ }
+
+ appendStringInfo(&mwriter->buf, "\"Size\": %zu, ", size);
+
+ appendStringInfoString(&mwriter->buf, "\"Last-Modified\": \"");
+ enlargeStringInfo(&mwriter->buf, 128);
+ mwriter->buf.len += strftime(&mwriter->buf.data[mwriter->buf.len], 128,
+ "%Y-%m-%d %H:%M:%S %Z",
+ gmtime(&mtime));
+ appendStringInfoChar(&mwriter->buf, '"');
+
+ if (mwriter->buf.len > 128 * 1024)
+ flush_manifest(mwriter);
+
+ if (checksum_length > 0)
+ {
+ appendStringInfo(&mwriter->buf,
+ ", \"Checksum-Algorithm\": \"%s\", \"Checksum\": \"",
+ pg_checksum_type_name(checksum_type));
+
+ enlargeStringInfo(&mwriter->buf, 2 * checksum_length);
+ mwriter->buf.len += hex_encode(checksum_payload, checksum_length,
+ &mwriter->buf.data[mwriter->buf.len]);
+
+ appendStringInfoChar(&mwriter->buf, '"');
+ }
+
+ appendStringInfoString(&mwriter->buf, " }");
+
+ if (mwriter->buf.len > 128 * 1024)
+ flush_manifest(mwriter);
+}
+
+/*
+ * Finalize the backup_manifest.
+ */
+void
+finalize_manifest(manifest_writer *mwriter,
+ manifest_wal_range *first_wal_range)
+{
+ uint8 checksumbuf[PG_SHA256_DIGEST_LENGTH];
+ int len;
+ manifest_wal_range *wal_range;
+
+ /* Terminate the list of files. */
+ appendStringInfoString(&mwriter->buf, "\n],\n");
+
+ /* Start a list of LSN ranges. */
+ appendStringInfoString(&mwriter->buf, "\"WAL-Ranges\": [\n");
+
+ for (wal_range = first_wal_range; wal_range != NULL;
+ wal_range = wal_range->next)
+ appendStringInfo(&mwriter->buf,
+ "%s{ \"Timeline\": %u, \"Start-LSN\": \"%X/%X\", \"End-LSN\": \"%X/%X\" }",
+ wal_range == first_wal_range ? "" : ",\n",
+ wal_range->tli,
+ LSN_FORMAT_ARGS(wal_range->start_lsn),
+ LSN_FORMAT_ARGS(wal_range->end_lsn));
+
+ /* Terminate the list of WAL ranges. */
+ appendStringInfoString(&mwriter->buf, "\n],\n");
+
+ /* Flush accumulated data and update checksum calculation. */
+ flush_manifest(mwriter);
+
+ /* Checksum only includes data up to this point. */
+ mwriter->still_checksumming = false;
+
+ /* Compute and insert manifest checksum. */
+ appendStringInfoString(&mwriter->buf, "\"Manifest-Checksum\": \"");
+ enlargeStringInfo(&mwriter->buf, 2 * PG_SHA256_DIGEST_STRING_LENGTH);
+ len = pg_checksum_final(&mwriter->manifest_ctx, checksumbuf);
+ Assert(len == PG_SHA256_DIGEST_LENGTH);
+ mwriter->buf.len +=
+ hex_encode(checksumbuf, len, &mwriter->buf.data[mwriter->buf.len]);
+ appendStringInfoString(&mwriter->buf, "\"}\n");
+
+ /* Flush the last manifest checksum itself. */
+ flush_manifest(mwriter);
+
+ /* Close the file. */
+ if (close(mwriter->fd) != 0)
+ pg_fatal("could not close \"%s\": %m", mwriter->pathname);
+ mwriter->fd = -1;
+}
+
+/*
+ * Produce a JSON string literal, properly escaping characters in the text.
+ */
+static void
+escape_json(StringInfo buf, const char *str)
+{
+ const char *p;
+
+ appendStringInfoCharMacro(buf, '"');
+ for (p = str; *p; p++)
+ {
+ switch (*p)
+ {
+ case '\b':
+ appendStringInfoString(buf, "\\b");
+ break;
+ case '\f':
+ appendStringInfoString(buf, "\\f");
+ break;
+ case '\n':
+ appendStringInfoString(buf, "\\n");
+ break;
+ case '\r':
+ appendStringInfoString(buf, "\\r");
+ break;
+ case '\t':
+ appendStringInfoString(buf, "\\t");
+ break;
+ case '"':
+ appendStringInfoString(buf, "\\\"");
+ break;
+ case '\\':
+ appendStringInfoString(buf, "\\\\");
+ break;
+ default:
+ if ((unsigned char) *p < ' ')
+ appendStringInfo(buf, "\\u%04x", (int) *p);
+ else
+ appendStringInfoCharMacro(buf, *p);
+ break;
+ }
+ }
+ appendStringInfoCharMacro(buf, '"');
+}
+
+/*
+ * Flush whatever portion of the backup manifest we have generated and
+ * buffered in memory out to a file on disk.
+ *
+ * The first call to this function will create the file. After that, we
+ * keep it open and just append more data.
+ */
+static void
+flush_manifest(manifest_writer *mwriter)
+{
+ char pathname[MAXPGPATH];
+
+ if (mwriter->fd == -1 &&
+ (mwriter->fd = open(mwriter->pathname,
+ O_WRONLY | O_CREAT | O_EXCL | PG_BINARY,
+ pg_file_create_mode)) < 0)
+ pg_fatal("could not open file \"%s\": %m", mwriter->pathname);
+
+ if (mwriter->buf.len > 0)
+ {
+ ssize_t wb;
+
+ wb = write(mwriter->fd, mwriter->buf.data, mwriter->buf.len);
+ if (wb != mwriter->buf.len)
+ {
+ if (wb < 0)
+ pg_fatal("could not write \"%s\": %m", mwriter->pathname);
+ else
+ pg_fatal("could not write file \"%s\": wrote only %d of %d bytes",
+ pathname, (int) wb, mwriter->buf.len);
+ }
+
+ if (mwriter->still_checksumming)
+ pg_checksum_update(&mwriter->manifest_ctx,
+ (uint8 *) mwriter->buf.data,
+ mwriter->buf.len);
+ resetStringInfo(&mwriter->buf);
+ }
+}
+
+/*
+ * Encode bytes using two hexademical digits for each one.
+ */
+static size_t
+hex_encode(const uint8 *src, size_t len, char *dst)
+{
+ const uint8 *end = src + len;
+
+ while (src < end)
+ {
+ unsigned n1 = (*src >> 4) & 0xF;
+ unsigned n2 = *src & 0xF;
+
+ *dst++ = n1 < 10 ? '0' + n1 : 'a' + n1 - 10;
+ *dst++ = n2 < 10 ? '0' + n2 : 'a' + n2 - 10;
+ ++src;
+ }
+
+ return len * 2;
+}
diff --git a/src/bin/pg_combinebackup/write_manifest.h b/src/bin/pg_combinebackup/write_manifest.h
new file mode 100644
index 0000000000..8fd7fe02c8
--- /dev/null
+++ b/src/bin/pg_combinebackup/write_manifest.h
@@ -0,0 +1,33 @@
+/*-------------------------------------------------------------------------
+ *
+ * Write a new backup manifest.
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/bin/pg_combinebackup/write_manifest.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef WRITE_MANIFEST_H
+#define WRITE_MANIFEST_H
+
+#include "common/checksum_helper.h"
+#include "pgtime.h"
+
+struct manifest_wal_range;
+
+struct manifest_writer;
+typedef struct manifest_writer manifest_writer;
+
+extern manifest_writer *create_manifest_writer(char *directory);
+extern void add_file_to_manifest(manifest_writer *mwriter,
+ const char *manifest_path,
+ size_t size, pg_time_t mtime,
+ pg_checksum_type checksum_type,
+ int checksum_length,
+ uint8 *checksum_payload);
+extern void finalize_manifest(manifest_writer *mwriter,
+ struct manifest_wal_range *first_wal_range);
+
+#endif /* WRITE_MANIFEST_H */
diff --git a/src/bin/pg_resetwal/pg_resetwal.c b/src/bin/pg_resetwal/pg_resetwal.c
index 3ae3fc06df..5407f51a4e 100644
--- a/src/bin/pg_resetwal/pg_resetwal.c
+++ b/src/bin/pg_resetwal/pg_resetwal.c
@@ -85,6 +85,7 @@ static void RewriteControlFile(void);
static void FindEndOfXLOG(void);
static void KillExistingXLOG(void);
static void KillExistingArchiveStatus(void);
+static void KillExistingWALSummaries(void);
static void WriteEmptyXLOG(void);
static void usage(void);
@@ -493,6 +494,7 @@ main(int argc, char *argv[])
RewriteControlFile();
KillExistingXLOG();
KillExistingArchiveStatus();
+ KillExistingWALSummaries();
WriteEmptyXLOG();
printf(_("Write-ahead log reset\n"));
@@ -1034,6 +1036,40 @@ KillExistingArchiveStatus(void)
pg_fatal("could not close directory \"%s\": %m", ARCHSTATDIR);
}
+/*
+ * Remove existing WAL summary files
+ */
+static void
+KillExistingWALSummaries(void)
+{
+#define WALSUMMARYDIR XLOGDIR "/summaries"
+#define WALSUMMARY_NHEXCHARS 40
+
+ DIR *xldir;
+ struct dirent *xlde;
+ char path[MAXPGPATH + sizeof(WALSUMMARYDIR)];
+
+ xldir = opendir(WALSUMMARYDIR);
+ if (xldir == NULL)
+ pg_fatal("could not open directory \"%s\": %m", WALSUMMARYDIR);
+
+ while (errno = 0, (xlde = readdir(xldir)) != NULL)
+ {
+ if (strspn(xlde->d_name, "0123456789ABCDEF") == WALSUMMARY_NHEXCHARS &&
+ strcmp(xlde->d_name + WALSUMMARY_NHEXCHARS, ".summary") == 0)
+ {
+ snprintf(path, sizeof(path), "%s/%s", WALSUMMARYDIR, xlde->d_name);
+ if (unlink(path) < 0)
+ pg_fatal("could not delete file \"%s\": %m", path);
+ }
+ }
+
+ if (errno)
+ pg_fatal("could not read directory \"%s\": %m", WALSUMMARYDIR);
+
+ if (closedir(xldir))
+ pg_fatal("could not close directory \"%s\": %m", ARCHSTATDIR);
+}
/*
* Write an empty XLOG file, containing only the checkpoint record
diff --git a/src/include/access/xlogbackup.h b/src/include/access/xlogbackup.h
index 1611358137..90e04cad56 100644
--- a/src/include/access/xlogbackup.h
+++ b/src/include/access/xlogbackup.h
@@ -28,6 +28,8 @@ typedef struct BackupState
XLogRecPtr checkpointloc; /* last checkpoint location */
pg_time_t starttime; /* backup start time */
bool started_in_recovery; /* backup started in recovery? */
+ XLogRecPtr istartpoint; /* incremental based on backup at this LSN */
+ TimeLineID istarttli; /* incremental based on backup on this TLI */
/* Fields saved at the end of backup */
XLogRecPtr stoppoint; /* backup stop WAL location */
diff --git a/src/include/backup/basebackup.h b/src/include/backup/basebackup.h
index 1432d9c206..345bd22534 100644
--- a/src/include/backup/basebackup.h
+++ b/src/include/backup/basebackup.h
@@ -34,6 +34,9 @@ typedef struct
int64 size; /* total size as sent; -1 if not known */
} tablespaceinfo;
-extern void SendBaseBackup(BaseBackupCmd *cmd);
+struct IncrementalBackupInfo;
+
+extern void SendBaseBackup(BaseBackupCmd *cmd,
+ struct IncrementalBackupInfo *ib);
#endif /* _BASEBACKUP_H */
diff --git a/src/include/backup/basebackup_incremental.h b/src/include/backup/basebackup_incremental.h
new file mode 100644
index 0000000000..de99117599
--- /dev/null
+++ b/src/include/backup/basebackup_incremental.h
@@ -0,0 +1,55 @@
+/*-------------------------------------------------------------------------
+ *
+ * basebackup_incremental.h
+ * API for incremental backup support
+ *
+ * Portions Copyright (c) 2010-2022, PostgreSQL Global Development Group
+ *
+ * src/include/backup/basebackup_incremental.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef BASEBACKUP_INCREMENTAL_H
+#define BASEBACKUP_INCREMENTAL_H
+
+#include "access/xlogbackup.h"
+#include "common/relpath.h"
+#include "storage/block.h"
+#include "utils/palloc.h"
+
+#define INCREMENTAL_MAGIC 0xd3ae1f0d
+
+typedef enum
+{
+ BACK_UP_FILE_FULLY,
+ BACK_UP_FILE_INCREMENTALLY
+} FileBackupMethod;
+
+struct IncrementalBackupInfo;
+typedef struct IncrementalBackupInfo IncrementalBackupInfo;
+
+extern IncrementalBackupInfo *CreateIncrementalBackupInfo(MemoryContext);
+
+extern void AppendIncrementalManifestData(IncrementalBackupInfo *ib,
+ const char *data,
+ int len);
+extern void FinalizeIncrementalManifest(IncrementalBackupInfo *ib);
+
+extern void PrepareForIncrementalBackup(IncrementalBackupInfo *ib,
+ BackupState *backup_state);
+
+extern char *GetIncrementalFilePath(Oid dboid, Oid spcoid,
+ RelFileNumber relfilenumber,
+ ForkNumber forknum, unsigned segno);
+extern FileBackupMethod GetFileBackupMethod(IncrementalBackupInfo *ib,
+ const char *path,
+ Oid dboid, Oid spcoid,
+ RelFileNumber relfilenumber,
+ ForkNumber forknum,
+ unsigned segno, size_t size,
+ unsigned *num_blocks_required,
+ BlockNumber *relative_block_numbers,
+ unsigned *truncation_block_length);
+extern size_t GetIncrementalFileSize(unsigned num_blocks_required);
+
+#endif
diff --git a/src/include/nodes/replnodes.h b/src/include/nodes/replnodes.h
index 5142a08729..c98961c329 100644
--- a/src/include/nodes/replnodes.h
+++ b/src/include/nodes/replnodes.h
@@ -108,4 +108,13 @@ typedef struct TimeLineHistoryCmd
TimeLineID timeline;
} TimeLineHistoryCmd;
+/* ----------------------
+ * UPLOAD_MANIFEST command
+ * ----------------------
+ */
+typedef struct UploadManifestCmd
+{
+ NodeTag type;
+} UploadManifestCmd;
+
#endif /* REPLNODES_H */
diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index a020377761..46cb2a6550 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -779,6 +779,10 @@ a tar-format backup, pass the name of the tar program to use in the
keyword parameter tar_program. Note that tablespace tar files aren't
handled here.
+To restore from an incremental backup, pass the parameter combine_with_prior
+as a reference to an array of prior backup names with which this backup
+is to be combined using pg_combinebackup.
+
Streaming replication can be enabled on this node by passing the keyword
parameter has_streaming => 1. This is disabled by default.
@@ -816,7 +820,22 @@ sub init_from_backup
mkdir $self->archive_dir;
my $data_path = $self->data_dir;
- if (defined $params{tar_program})
+ if (defined $params{combine_with_prior})
+ {
+ my @prior_backups = @{$params{combine_with_prior}};
+ my @prior_backup_path;
+
+ for my $prior_backup_name (@prior_backups)
+ {
+ push @prior_backup_path,
+ $root_node->backup_dir . '/' . $prior_backup_name;
+ }
+
+ local %ENV = $self->_get_env();
+ PostgreSQL::Test::Utils::system_or_bail('pg_combinebackup', '-d',
+ @prior_backup_path, $backup_path, '-o', $data_path);
+ }
+ elsif (defined $params{tar_program})
{
mkdir($data_path);
PostgreSQL::Test::Utils::system_or_bail($params{tar_program}, 'xf',
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 9390049314..e37ef9aa76 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -4023,3 +4023,15 @@ SummarizerReadLocalXLogPrivate
WalSummarizerData
WalSummaryFile
WalSummaryIO
+FileBackupMethod
+IncrementalBackupInfo
+UploadManifestCmd
+backup_file_entry
+backup_wal_range
+cb_cleanup_dir
+cb_options
+cb_tablespace
+cb_tablespace_mapping
+manifest_data
+manifest_writer
+rfile
--
2.39.3 (Apple Git-145)
v14-0004-Add-new-pg_walsummary-tool.patchapplication/octet-stream; name=v14-0004-Add-new-pg_walsummary-tool.patchDownload
From 30d698acb6de507b8c8a5fafda4b8a81f6ca5b5b Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Wed, 25 Oct 2023 13:01:06 -0400
Subject: [PATCH v14 4/5] Add new pg_walsummary tool.
This can dump the contents of WAL summary files, either those in
pg_wal/summaries, or the INCREMENTAL_BACKUP files that are part of
an incremental backup proper.
XXX. Needs tests.
---
doc/src/sgml/ref/allfiles.sgml | 1 +
doc/src/sgml/ref/pg_walsummary.sgml | 122 +++++++++++
doc/src/sgml/reference.sgml | 1 +
src/backend/postmaster/walsummarizer.c | 4 +-
src/bin/Makefile | 1 +
src/bin/meson.build | 1 +
src/bin/pg_walsummary/.gitignore | 1 +
src/bin/pg_walsummary/Makefile | 42 ++++
src/bin/pg_walsummary/meson.build | 24 +++
src/bin/pg_walsummary/pg_walsummary.c | 280 +++++++++++++++++++++++++
src/tools/pgindent/typedefs.list | 2 +
11 files changed, 477 insertions(+), 2 deletions(-)
create mode 100644 doc/src/sgml/ref/pg_walsummary.sgml
create mode 100644 src/bin/pg_walsummary/.gitignore
create mode 100644 src/bin/pg_walsummary/Makefile
create mode 100644 src/bin/pg_walsummary/meson.build
create mode 100644 src/bin/pg_walsummary/pg_walsummary.c
diff --git a/doc/src/sgml/ref/allfiles.sgml b/doc/src/sgml/ref/allfiles.sgml
index fda4690eab..4a42999b18 100644
--- a/doc/src/sgml/ref/allfiles.sgml
+++ b/doc/src/sgml/ref/allfiles.sgml
@@ -219,6 +219,7 @@ Complete list of usable sgml source files in this directory.
<!ENTITY pgtesttiming SYSTEM "pgtesttiming.sgml">
<!ENTITY pgupgrade SYSTEM "pgupgrade.sgml">
<!ENTITY pgwaldump SYSTEM "pg_waldump.sgml">
+<!ENTITY pgwalsummary SYSTEM "pg_walsummary.sgml">
<!ENTITY postgres SYSTEM "postgres-ref.sgml">
<!ENTITY psqlRef SYSTEM "psql-ref.sgml">
<!ENTITY reindexdb SYSTEM "reindexdb.sgml">
diff --git a/doc/src/sgml/ref/pg_walsummary.sgml b/doc/src/sgml/ref/pg_walsummary.sgml
new file mode 100644
index 0000000000..93e265ead7
--- /dev/null
+++ b/doc/src/sgml/ref/pg_walsummary.sgml
@@ -0,0 +1,122 @@
+<!--
+doc/src/sgml/ref/pg_walsummary.sgml
+PostgreSQL documentation
+-->
+
+<refentry id="app-pgwalsummary">
+ <indexterm zone="app-pgwalsummary">
+ <primary>pg_walsummary</primary>
+ </indexterm>
+
+ <refmeta>
+ <refentrytitle><application>pg_walsummary</application></refentrytitle>
+ <manvolnum>1</manvolnum>
+ <refmiscinfo>Application</refmiscinfo>
+ </refmeta>
+
+ <refnamediv>
+ <refname>pg_walsummary</refname>
+ <refpurpose>print contents of WAL summary files</refpurpose>
+ </refnamediv>
+
+ <refsynopsisdiv>
+ <cmdsynopsis>
+ <command>pg_walsummary</command>
+ <arg rep="repeat" choice="opt"><replaceable>option</replaceable></arg>
+ <arg rep="repeat"><replaceable>file</replaceable></arg>
+ </cmdsynopsis>
+ </refsynopsisdiv>
+
+ <refsect1>
+ <title>Description</title>
+ <para>
+ <application>pg_walsummary</application> is used to print the contents of
+ WAL summary files. These binary files are found with the
+ <literal>pg_wal/summaries</literal> subdirectory of the data directory,
+ and can be converted to text using this tool. This is not ordinarily
+ necessary, since WAL summary files primarily exist to support
+ <link linkend="backup-incremental-backup">incremental backup</link>,
+ but it may be useful for debugging purposes.
+ </para>
+
+ <para>
+ A WAL summary file is indexed by tablespace OID, relation OID, and relation
+ fork. For each relation fork, it stores the list of blocks that were
+ modified by WAL within the range summarized in the file. It can also
+ store a "limit block," which is 0 if the relation fork was created or
+ truncated within the relevant WAL range, and otherwise the shortest length
+ to which the relation fork was truncated. If the relation fork was not
+ created, deleted, or truncated within the relevant WAL range, the limit
+ block is undefined or infinite and will not be printed by this tool.
+ </para>
+ </refsect1>
+
+ <refsect1>
+ <title>Options</title>
+
+ <para>
+ <variablelist>
+ <varlistentry>
+ <term><option>-i</option></term>
+ <term><option>--indivudual</option></term>
+ <listitem>
+ <para>
+ By default, <literal>pg_walsummary</literal> prints one line of output
+ for each range of one or more consecutive modified blocks. This can
+ make the output a lot briefer, since a relation where all blocks from
+ 0 through 999 were modified will produce only one line of output rather
+ than 1000 separate lines. This option requests a separate line of
+ output for every modified block.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><option>-q</option></term>
+ <term><option>--quiet</option></term>
+ <listitem>
+ <para>
+ Do not print any output, except for errors. This can be useful
+ when you want to know whether a WAL summary file can be successfully
+ parsed but don't care about the contents.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><option>-?</option></term>
+ <term><option>--help</option></term>
+ <listitem>
+ <para>
+ Shows help about <application>pg_walsummary</application> command line
+ arguments, and exits.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ </variablelist>
+ </para>
+
+ </refsect1>
+
+ <refsect1>
+ <title>Environment</title>
+
+ <para>
+ The environment variable <envar>PG_COLOR</envar> specifies whether to use
+ color in diagnostic messages. Possible values are
+ <literal>always</literal>, <literal>auto</literal> and
+ <literal>never</literal>.
+ </para>
+ </refsect1>
+
+ <refsect1>
+ <title>See Also</title>
+
+ <simplelist type="inline">
+ <member><xref linkend="app-pgbasebackup"/></member>
+ <member><xref linkend="app-pgcombinebackup"/></member>
+ </simplelist>
+ </refsect1>
+
+</refentry>
diff --git a/doc/src/sgml/reference.sgml b/doc/src/sgml/reference.sgml
index a07d2b5e01..aa94f6adf6 100644
--- a/doc/src/sgml/reference.sgml
+++ b/doc/src/sgml/reference.sgml
@@ -289,6 +289,7 @@
&pgtesttiming;
&pgupgrade;
&pgwaldump;
+ &pgwalsummary;
&postgres;
</reference>
diff --git a/src/backend/postmaster/walsummarizer.c b/src/backend/postmaster/walsummarizer.c
index 7c840c36b3..9fa155349e 100644
--- a/src/backend/postmaster/walsummarizer.c
+++ b/src/backend/postmaster/walsummarizer.c
@@ -290,7 +290,7 @@ WalSummarizerMain(void)
FlushErrorState();
/* Flush any leaked data in the top-level context */
- MemoryContextResetAndDeleteChildren(context);
+ MemoryContextReset(context);
/* Now we can allow interrupts again */
RESUME_INTERRUPTS();
@@ -342,7 +342,7 @@ WalSummarizerMain(void)
XLogRecPtr end_of_summary_lsn;
/* Flush any leaked data in the top-level context */
- MemoryContextResetAndDeleteChildren(context);
+ MemoryContextReset(context);
/* Process any signals received recently. */
HandleWalSummarizerInterrupts();
diff --git a/src/bin/Makefile b/src/bin/Makefile
index aa2210925e..f98f58d39e 100644
--- a/src/bin/Makefile
+++ b/src/bin/Makefile
@@ -31,6 +31,7 @@ SUBDIRS = \
pg_upgrade \
pg_verifybackup \
pg_waldump \
+ pg_walsummary \
pgbench \
psql \
scripts
diff --git a/src/bin/meson.build b/src/bin/meson.build
index 4cb6fd59bb..d1e9ef4409 100644
--- a/src/bin/meson.build
+++ b/src/bin/meson.build
@@ -17,6 +17,7 @@ subdir('pg_test_timing')
subdir('pg_upgrade')
subdir('pg_verifybackup')
subdir('pg_waldump')
+subdir('pg_walsummary')
subdir('pgbench')
subdir('pgevent')
subdir('psql')
diff --git a/src/bin/pg_walsummary/.gitignore b/src/bin/pg_walsummary/.gitignore
new file mode 100644
index 0000000000..d71ec192fa
--- /dev/null
+++ b/src/bin/pg_walsummary/.gitignore
@@ -0,0 +1 @@
+pg_walsummary
diff --git a/src/bin/pg_walsummary/Makefile b/src/bin/pg_walsummary/Makefile
new file mode 100644
index 0000000000..852f7208f6
--- /dev/null
+++ b/src/bin/pg_walsummary/Makefile
@@ -0,0 +1,42 @@
+#-------------------------------------------------------------------------
+#
+# Makefile for src/bin/pg_walsummary
+#
+# Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+# Portions Copyright (c) 1994, Regents of the University of California
+#
+# src/bin/pg_walsummary/Makefile
+#
+#-------------------------------------------------------------------------
+
+PGFILEDESC = "pg_walsummary - print contents of WAL summary files"
+PGAPPICON=win32
+
+subdir = src/bin/pg_walsummary
+top_builddir = ../../..
+include $(top_builddir)/src/Makefile.global
+
+override CPPFLAGS := -I$(libpq_srcdir) $(CPPFLAGS)
+LDFLAGS_INTERNAL += -L$(top_builddir)/src/fe_utils -lpgfeutils
+
+OBJS = \
+ $(WIN32RES) \
+ pg_walsummary.o
+
+all: pg_walsummary
+
+pg_walsummary: $(OBJS) | submake-libpgport submake-libpgfeutils
+ $(CC) $(CFLAGS) $^ $(LDFLAGS) $(LDFLAGS_EX) $(LIBS) -o $@$(X)
+
+
+install: all installdirs
+ $(INSTALL_PROGRAM) pg_walsummary$(X) '$(DESTDIR)$(bindir)/pg_walsummary$(X)'
+
+installdirs:
+ $(MKDIR_P) '$(DESTDIR)$(bindir)'
+
+uninstall:
+ rm -f '$(DESTDIR)$(bindir)/pg_walsummary$(X)'
+
+clean distclean maintainer-clean:
+ rm -f pg_walsummary$(X) $(OBJS)
diff --git a/src/bin/pg_walsummary/meson.build b/src/bin/pg_walsummary/meson.build
new file mode 100644
index 0000000000..c2092960c6
--- /dev/null
+++ b/src/bin/pg_walsummary/meson.build
@@ -0,0 +1,24 @@
+# Copyright (c) 2022-2023, PostgreSQL Global Development Group
+
+pg_walsummary_sources = files(
+ 'pg_walsummary.c',
+)
+
+if host_system == 'windows'
+ pg_walsummary_sources += rc_bin_gen.process(win32ver_rc, extra_args: [
+ '--NAME', 'pg_walsummary',
+ '--FILEDESC', 'pg_walsummary - print contents of WAL summary files',])
+endif
+
+pg_walsummary = executable('pg_walsummary',
+ pg_walsummary_sources,
+ dependencies: [frontend_code],
+ kwargs: default_bin_args,
+)
+bin_targets += pg_walsummary
+
+tests += {
+ 'name': 'pg_walsummary',
+ 'sd': meson.current_source_dir(),
+ 'bd': meson.current_build_dir()
+}
diff --git a/src/bin/pg_walsummary/pg_walsummary.c b/src/bin/pg_walsummary/pg_walsummary.c
new file mode 100644
index 0000000000..0c0225eeb8
--- /dev/null
+++ b/src/bin/pg_walsummary/pg_walsummary.c
@@ -0,0 +1,280 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_walsummary.c
+ * Prints the contents of WAL summary files.
+ *
+ * Copyright (c) 2017-2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/bin/pg_walsummary/pg_walsummary.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres_fe.h"
+
+#include <fcntl.h>
+#include <limits.h>
+
+#include "common/blkreftable.h"
+#include "common/logging.h"
+#include "fe_utils/option_utils.h"
+#include "lib/stringinfo.h"
+#include "getopt_long.h"
+
+typedef struct ws_options
+{
+ bool individual;
+ bool quiet;
+} ws_options;
+
+typedef struct ws_file_info
+{
+ int fd;
+ char *filename;
+} ws_file_info;
+
+static BlockNumber *block_buffer = NULL;
+static unsigned block_buffer_size = 512; /* Initial size. */
+
+static void dump_one_relation(ws_options *opt, RelFileLocator *rlocator,
+ ForkNumber forknum, BlockNumber limit_block,
+ BlockRefTableReader *reader);
+static void help(const char *progname);
+static int compare_block_numbers(const void *a, const void *b);
+static int walsummary_read_callback(void *callback_arg, void *data,
+ int length);
+static void walsummary_error_callback(void *callback_arg, char *fmt,...) pg_attribute_printf(2, 3);
+
+/*
+ * Main program.
+ */
+int
+main(int argc, char *argv[])
+{
+ static struct option long_options[] = {
+ {"individual", no_argument, NULL, 'i'},
+ {"quiet", no_argument, NULL, 'q'},
+ {NULL, 0, NULL, 0}
+ };
+
+ const char *progname;
+ int optindex;
+ int c;
+ ws_options opt;
+
+ memset(&opt, 0, sizeof(ws_options));
+
+ pg_logging_init(argv[0]);
+ progname = get_progname(argv[0]);
+ handle_help_version_opts(argc, argv, progname, help);
+
+ /* process command-line options */
+ while ((c = getopt_long(argc, argv, "f:iqw:",
+ long_options, &optindex)) != -1)
+ {
+ switch (c)
+ {
+ case 'i':
+ opt.individual = true;
+ break;
+ case 'q':
+ opt.quiet = true;
+ break;
+ default:
+ /* getopt_long already emitted a complaint */
+ pg_log_error_hint("Try \"%s --help\" for more information.", progname);
+ exit(1);
+ }
+ }
+
+ if (optind >= argc)
+ {
+ pg_log_error("%s: no input files specified", progname);
+ pg_log_error_hint("Try \"%s --help\" for more information.", progname);
+ exit(1);
+ }
+
+ while (optind < argc)
+ {
+ ws_file_info ws;
+ BlockRefTableReader *reader;
+ RelFileLocator rlocator;
+ ForkNumber forknum;
+ BlockNumber limit_block;
+
+ ws.filename = argv[optind++];
+ if ((ws.fd = open(ws.filename, O_RDONLY | PG_BINARY, 0)) < 0)
+ pg_fatal("could not open file \"%s\": %m", ws.filename);
+
+ reader = CreateBlockRefTableReader(walsummary_read_callback, &ws,
+ ws.filename,
+ walsummary_error_callback, NULL);
+ while (BlockRefTableReaderNextRelation(reader, &rlocator, &forknum,
+ &limit_block))
+ dump_one_relation(&opt, &rlocator, forknum, limit_block, reader);
+
+ DestroyBlockRefTableReader(reader);
+ close(ws.fd);
+ }
+
+ exit(0);
+}
+
+/*
+ * Dump details for one relation.
+ */
+static void
+dump_one_relation(ws_options *opt, RelFileLocator *rlocator,
+ ForkNumber forknum, BlockNumber limit_block,
+ BlockRefTableReader *reader)
+{
+ unsigned i = 0;
+ unsigned nblocks;
+ BlockNumber startblock = InvalidBlockNumber;
+ BlockNumber endblock = InvalidBlockNumber;
+
+ /* Dump limit block, if any. */
+ if (limit_block != InvalidBlockNumber)
+ printf("TS %u, DB %u, REL %u, FORK %s: limit %u\n",
+ rlocator->spcOid, rlocator->dbOid, rlocator->relNumber,
+ forkNames[forknum], limit_block);
+
+ /* If we haven't allocated a block buffer yet, do that now. */
+ if (block_buffer == NULL)
+ block_buffer = palloc_array(BlockNumber, block_buffer_size);
+
+ /* Try to fill the block buffer. */
+ nblocks = BlockRefTableReaderGetBlocks(reader,
+ block_buffer,
+ block_buffer_size);
+
+ /* If we filled the block buffer completely, we must enlarge it. */
+ while (nblocks >= block_buffer_size)
+ {
+ unsigned new_size;
+
+ /* Double the size, being careful about overflow. */
+ new_size = block_buffer_size * 2;
+ if (new_size < block_buffer_size)
+ new_size = PG_UINT32_MAX;
+ block_buffer = repalloc_array(block_buffer, BlockNumber, new_size);
+
+ /* Try to fill the newly-allocated space. */
+ nblocks +=
+ BlockRefTableReaderGetBlocks(reader,
+ block_buffer + block_buffer_size,
+ new_size - block_buffer_size);
+
+ /* Save the new size for later calls. */
+ block_buffer_size = new_size;
+ }
+
+ /* If we don't need to produce any output, skip the rest of this. */
+ if (opt->quiet)
+ return;
+
+ /*
+ * Sort the returned block numbers. If the block reference table was using
+ * the bitmap representation for a given chunk, the block numbers in that
+ * chunk will already be sorted, but when the array-of-offsets
+ * representation is used, we can receive block numbers here out of order.
+ */
+ qsort(block_buffer, nblocks, sizeof(BlockNumber), compare_block_numbers);
+
+ /* Dump block references. */
+ while (i < nblocks)
+ {
+ /*
+ * Find the next range of blocks to print, but if --individual was
+ * specified, then consider each block a separate range.
+ */
+ startblock = endblock = block_buffer[i++];
+ if (!opt->individual)
+ {
+ while (i < nblocks && block_buffer[i] == endblock + 1)
+ {
+ endblock++;
+ i++;
+ }
+ }
+
+ /* Format this range of block numbers as a string. */
+ if (startblock == endblock)
+ printf("TS %u, DB %u, REL %u, FORK %s: block %u\n",
+ rlocator->spcOid, rlocator->dbOid, rlocator->relNumber,
+ forkNames[forknum], startblock);
+ else
+ printf("TS %u, DB %u, REL %u, FORK %s: blocks %u..%u\n",
+ rlocator->spcOid, rlocator->dbOid, rlocator->relNumber,
+ forkNames[forknum], startblock, endblock);
+ }
+}
+
+/*
+ * Quicksort comparator for block numbers.
+ */
+static int
+compare_block_numbers(const void *a, const void *b)
+{
+ BlockNumber aa = *(BlockNumber *) a;
+ BlockNumber bb = *(BlockNumber *) b;
+
+ if (aa > bb)
+ return 1;
+ else if (aa == bb)
+ return 0;
+ else
+ return -1;
+}
+
+/*
+ * Error callback.
+ */
+void
+walsummary_error_callback(void *callback_arg, char *fmt,...)
+{
+ va_list ap;
+
+ va_start(ap, fmt);
+ pg_log_generic_v(PG_LOG_ERROR, PG_LOG_PRIMARY, fmt, ap);
+ va_end(ap);
+
+ exit(1);
+}
+
+/*
+ * Read callback.
+ */
+int
+walsummary_read_callback(void *callback_arg, void *data, int length)
+{
+ ws_file_info *ws = callback_arg;
+ int rc;
+
+ if ((rc = read(ws->fd, data, length)) < 0)
+ pg_fatal("could not read file \"%s\": %m", ws->filename);
+
+ return rc;
+}
+
+/*
+ * help
+ *
+ * Prints help page for the program
+ *
+ * progname: the name of the executed program, such as "pg_walsummary"
+ */
+static void
+help(const char *progname)
+{
+ printf(_("%s prints the contents of a WAL summary file.\n\n"), progname);
+ printf(_("Usage:\n"));
+ printf(_(" %s [OPTION]... FILE...\n"), progname);
+ printf(_("\nOptions:\n"));
+ printf(_(" -i, --individual list block numbers individually, not as ranges\n"));
+ printf(_(" -q, --quiet don't print anything, just parse the files\n"));
+ printf(_(" -?, --help show this help, then exit\n"));
+
+ printf(_("\nReport bugs to <%s>.\n"), PACKAGE_BUGREPORT);
+ printf(_("%s home page: <%s>\n"), PACKAGE_NAME, PACKAGE_URL);
+}
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index e37ef9aa76..86e0a86503 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -4035,3 +4035,5 @@ cb_tablespace_mapping
manifest_data
manifest_writer
rfile
+ws_options
+ws_file_info
--
2.39.3 (Apple Git-145)
Hi Robert,
On Mon, Dec 11, 2023 at 6:08 PM Robert Haas <robertmhaas@gmail.com> wrote:
On Fri, Dec 8, 2023 at 5:02 AM Jakub Wartak
<jakub.wartak@enterprisedb.com> wrote:While we are at it, maybe around the below in PrepareForIncrementalBackup()
if (tlep[i] == NULL)
ereport(ERROR,(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
errmsg("timeline %u found in
manifest, but not in this server's history",
range->tli)));we could add
errhint("You might need to start a new full backup instead of
incremental one")?
I can't exactly say that such a hint would be inaccurate, but I think
the impulse to add it here is misguided. One of my design goals for
this system is to make it so that you never have to take a new
incremental backup "just because,"
Did you mean take a new full backup here?
not even in case of an intervening
timeline switch. So, all of the errors in this function are warning
you that you've done something that you really should not have done.
In this particular case, you've either (1) manually removed the
timeline history file, and not just any timeline history file but the
one for a timeline for a backup that you still intend to use as the
basis for taking an incremental backup or (2) tried to use a full
backup taken from one server as the basis for an incremental backup on
a completely different server that happens to share the same system
identifier, e.g. because you promoted two standbys derived from the
same original primary and then tried to use a full backup taken on one
as the basis for an incremental backup taken on the other.
Okay, but please consider two other possibilities:
(3) I had a corrupted DB where I've fixed it by running pg_resetwal
and some cronjob just a day later attempted to take incremental and
failed with that error.
(4) I had pg_upgraded (which calls pg_resetwal on fresh initdb
directory) the DB where I had cronjob that just failed with this error
I bet that (4) is going to happen more often than (1), (2) , which
might trigger users to complain on forums, support tickets.
I have a fix for this locally, but I'm going to hold off on publishing
a new version until either there's a few more things I can address all
at once, or until Thomas commits the ubsan fix.Great, I cannot get it to fail again today, it had to be some dirty
state of the testing env. BTW: Thomas has pushed that ubsan fix.Huzzah, the cfbot likes the patch set now. Here's a new version with
the promised fix for your non-reproducible issue. Let's see whether
you and cfbot still like this version.
LGTM, all quick tests work from my end too. BTW: I have also scheduled
the long/large pgbench -s 14000 (~200GB?) - multiple day incremental
test. I'll let you know how it went.
-J.
On Wed, Dec 13, 2023 at 5:39 AM Jakub Wartak
<jakub.wartak@enterprisedb.com> wrote:
I can't exactly say that such a hint would be inaccurate, but I think
the impulse to add it here is misguided. One of my design goals for
this system is to make it so that you never have to take a new
incremental backup "just because,"Did you mean take a new full backup here?
Yes, apologies for the typo.
not even in case of an intervening
timeline switch. So, all of the errors in this function are warning
you that you've done something that you really should not have done.
In this particular case, you've either (1) manually removed the
timeline history file, and not just any timeline history file but the
one for a timeline for a backup that you still intend to use as the
basis for taking an incremental backup or (2) tried to use a full
backup taken from one server as the basis for an incremental backup on
a completely different server that happens to share the same system
identifier, e.g. because you promoted two standbys derived from the
same original primary and then tried to use a full backup taken on one
as the basis for an incremental backup taken on the other.Okay, but please consider two other possibilities:
(3) I had a corrupted DB where I've fixed it by running pg_resetwal
and some cronjob just a day later attempted to take incremental and
failed with that error.(4) I had pg_upgraded (which calls pg_resetwal on fresh initdb
directory) the DB where I had cronjob that just failed with this errorI bet that (4) is going to happen more often than (1), (2) , which
might trigger users to complain on forums, support tickets.
Hmm. In case (4), I was thinking that you'd get a complaint about the
database system identifier not matching. I'm not actually sure that's
what would happen, though, now that you mention it.
In case (3), I think you would get an error about missing WAL summary files.
Huzzah, the cfbot likes the patch set now. Here's a new version with
the promised fix for your non-reproducible issue. Let's see whether
you and cfbot still like this version.LGTM, all quick tests work from my end too. BTW: I have also scheduled
the long/large pgbench -s 14000 (~200GB?) - multiple day incremental
test. I'll let you know how it went.
Awesome, thank you so much.
--
Robert Haas
EDB: http://www.enterprisedb.com
Hi Robert,
On Wed, Dec 13, 2023 at 2:16 PM Robert Haas <robertmhaas@gmail.com> wrote:
not even in case of an intervening
timeline switch. So, all of the errors in this function are warning
you that you've done something that you really should not have done.
In this particular case, you've either (1) manually removed the
timeline history file, and not just any timeline history file but the
one for a timeline for a backup that you still intend to use as the
basis for taking an incremental backup or (2) tried to use a full
backup taken from one server as the basis for an incremental backup on
a completely different server that happens to share the same system
identifier, e.g. because you promoted two standbys derived from the
same original primary and then tried to use a full backup taken on one
as the basis for an incremental backup taken on the other.Okay, but please consider two other possibilities:
(3) I had a corrupted DB where I've fixed it by running pg_resetwal
and some cronjob just a day later attempted to take incremental and
failed with that error.(4) I had pg_upgraded (which calls pg_resetwal on fresh initdb
directory) the DB where I had cronjob that just failed with this errorI bet that (4) is going to happen more often than (1), (2) , which
might trigger users to complain on forums, support tickets.Hmm. In case (4), I was thinking that you'd get a complaint about the
database system identifier not matching. I'm not actually sure that's
what would happen, though, now that you mention it.
I've played with with initdb/pg_upgrade (17->17) and i don't get DBID
mismatch (of course they do differ after initdb), but i get this
instead:
$ pg_basebackup -c fast -D /tmp/incr2.after.upgrade -p 5432
--incremental /tmp/incr1.before.upgrade/backup_manifest
WARNING: aborting backup due to backend exiting before pg_backup_stop
was called
pg_basebackup: error: could not initiate base backup: ERROR: timeline
2 found in manifest, but not in this server's history
pg_basebackup: removing data directory "/tmp/incr2.after.upgrade"
Also in the manifest I don't see DBID ?
Maybe it's a nuisance and all I'm trying to see is that if an
automated cronjob with pg_basebackup --incremental hits a freshly
upgraded cluster, that error message without errhint() is going to
scare some Junior DBAs.
LGTM, all quick tests work from my end too. BTW: I have also scheduled
the long/large pgbench -s 14000 (~200GB?) - multiple day incremental
test. I'll let you know how it went.Awesome, thank you so much.
OK, so pgbench -i -s 14440 and pgbench -P 1 -R 100 -c 8 -T 259200 did
generate pretty large incrementals (so I had to abort it due to lack
of space, I was expecting to see smaller incrementals so it took too
much space). I initally suspected that the problem lies in the normal
distribution of `\set aid random(1, 100000 * :scale)` for tpcbb that
UPDATEs on big pgbench_accounts.
$ du -sm /backups/backups/* /backups/archive/
216205 /backups/backups/full
215207 /backups/backups/incr.1
216706 /backups/backups/incr.2
102273 /backups/archive/
So I verified the recoverability yesterday anyway - the
pg_combinebackup "full incr.1 incr.2" took 44 minutes and later
archive wal recovery and promotion SUCCEED. The 8-way parallel seqscan
foir sum(abalance) on the pgbench_accounts and other tables worked
fine. The pg_combinebackup was using 15-20% CPU (mostly on %sys),
while performing mostly 60-80MB/s separately for both reads and writes
(it's slow, but it's due to maxed out sequence I/O of the Premium on a
small SSD on Azure).
So i've launched another improved test (to force more localized
UPDATEs) to see the more real-world space-effectiveness of the
incremental backup:
\set aid random_exponential(1, 100000 * :scale, 8)
\set bid random(1, 1 * :scale)
\set tid random(1, 10 * :scale)
\set delta random(-5000, 5000)
BEGIN;
UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;
INSERT INTO pgbench_history (tid
, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);
END;
But then... (and i have verified the low-IDs for :aid above).. same
has happened:
backups/backups$ du -sm /backups/backups/*
210229 /backups/backups/full
208299 /backups/backups/incr.1
208351 /backups/backups/incr.2
# pgbench_accounts has relfilenodeid 16486
postgres@jw-test-1:/backups/backups$ for L in 5 10 15 30 100 161 173
174 175 ; do md5sum full/base/5/16486.$L ./incr.1/base/5/16486.$L
./incr.2/base/5/16486.$L /var/lib/postgres/17/data/base/5/16486.$L ;
echo; done
005c6bbb40fca3c1a0a819376ef0e793 full/base/5/16486.5
005c6bbb40fca3c1a0a819376ef0e793 ./incr.1/base/5/16486.5
005c6bbb40fca3c1a0a819376ef0e793 ./incr.2/base/5/16486.5
005c6bbb40fca3c1a0a819376ef0e793 /var/lib/postgres/17/data/base/5/16486.5
[.. all the checksums match (!) for the above $L..]
c5117a213253035da5e5ee8a80c3ee3d full/base/5/16486.173
c5117a213253035da5e5ee8a80c3ee3d ./incr.1/base/5/16486.173
c5117a213253035da5e5ee8a80c3ee3d ./incr.2/base/5/16486.173
c5117a213253035da5e5ee8a80c3ee3d /var/lib/postgres/17/data/base/5/16486.173
47ee6b18d7f8e40352598d194b9a3c8a full/base/5/16486.174
47ee6b18d7f8e40352598d194b9a3c8a ./incr.1/base/5/16486.174
47ee6b18d7f8e40352598d194b9a3c8a ./incr.2/base/5/16486.174
47ee6b18d7f8e40352598d194b9a3c8a /var/lib/postgres/17/data/base/5/16486.174
82dfeba58b4a1031ac12c23f9559a330 full/base/5/16486.175
21a8ac1e6fef3cf0b34546c41d59b2cc ./incr.1/base/5/16486.175
2c3d89c612b2f97d575a55c6c0204d0b ./incr.2/base/5/16486.175
73367d44d76e98276d3a6bbc14bb31f1 /var/lib/postgres/17/data/base/5/16486.175
So to me, it looks like it copied anyway 174 out of 175 files lowering
the effectiveness of that incremental backup to 0% .The commands to
generate those incr backups were:
pg_basebackup -v -P -c fast -D /backups/backups/incr.1
--incremental=/backups/backups/full/backup_manifest
sleep 4h
pg_basebackup -v -P -c fast -D /backups/backups/incr.2
--incremental=/backups/backups/incr1/backup_manifest
The incrementals are being generated , but just for the first (0)
segment of the relation?
/backups/backups$ ls -l incr.2/base/5 | grep INCR
-rw------- 1 postgres postgres 12 Dec 14 21:33 INCREMENTAL.112
-rw------- 1 postgres postgres 12 Dec 14 21:01 INCREMENTAL.113
-rw------- 1 postgres postgres 12 Dec 14 21:36 INCREMENTAL.1247
-rw------- 1 postgres postgres 12 Dec 14 21:38 INCREMENTAL.1247_vm
[..note, no INCREMENTAL.$int.$segment files]
-rw------- 1 postgres postgres 12 Dec 14 21:24 INCREMENTAL.6238
-rw------- 1 postgres postgres 12 Dec 14 21:17 INCREMENTAL.6239
-rw------- 1 postgres postgres 12 Dec 14 21:55 INCREMENTAL.827
# 16486 is pgbench_accounts
/backups/backups$ ls -l incr.2/base/5/*16486* | grep INCR
-rw------- 1 postgres postgres 14613480 Dec 14 21:00
incr.2/base/5/INCREMENTAL.16486
-rw------- 1 postgres postgres 12 Dec 14 21:52
incr.2/base/5/INCREMENTAL.16486_vm
/backups/backups$
/backups/backups$ find incr* -name INCREMENTAL.* | wc -l
1342
/backups/backups$ find incr* -name INCREMENTAL.*_* | wc -l # VM or FSM
236
/backups/backups$ find incr* -name INCREMENTAL.*.* | wc -l # not a
single >1GB single incremental relation
0
I'm quickly passing info and I haven't really looked at the code yet ,
but it should be somewhere around GetFileBackupMethod() and
reproducible easily with that configure --with-segsize-blocks=X
switch.
-J.
I have a couple of quick fixes here.
The first fixes up some things in nls.mk related to a file move. The
second is some cleanup because some function you are using has been
removed in the meantime; you probably found that yourself while rebasing.
The pg_walsummary patch doesn't have a nls.mk, but you also comment that
it doesn't have tests yet, so I assume it's not considered complete yet
anyway.
Attachments:
0002-fixup-Move-src-bin-pg_verifybackup-parse_man.patch.nocfbottext/plain; charset=UTF-8; name=0002-fixup-Move-src-bin-pg_verifybackup-parse_man.patch.nocfbotDownload
From 04aae4ee91ddd1d4ce061c36a99b0fa18bdd98ec Mon Sep 17 00:00:00 2001
From: Peter Eisentraut <peter@eisentraut.org>
Date: Thu, 14 Dec 2023 12:50:33 +0100
Subject: [PATCH 2/6] fixup! Move src/bin/pg_verifybackup/parse_manifest.c into
src/common.
---
src/bin/pg_verifybackup/nls.mk | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/src/bin/pg_verifybackup/nls.mk b/src/bin/pg_verifybackup/nls.mk
index eba73a2c05..9e6a6049ba 100644
--- a/src/bin/pg_verifybackup/nls.mk
+++ b/src/bin/pg_verifybackup/nls.mk
@@ -1,10 +1,10 @@
# src/bin/pg_verifybackup/nls.mk
CATALOG_NAME = pg_verifybackup
GETTEXT_FILES = $(FRONTEND_COMMON_GETTEXT_FILES) \
- parse_manifest.c \
pg_verifybackup.c \
../../common/fe_memutils.c \
- ../../common/jsonapi.c
+ ../../common/jsonapi.c \
+ ../../common/parse_manifest.c
GETTEXT_TRIGGERS = $(FRONTEND_COMMON_GETTEXT_TRIGGERS) \
json_manifest_parse_failure:2 \
error_cb:2 \
--
2.43.0
0004-fixup-Add-a-new-WAL-summarizer-process.patch.nocfbottext/plain; charset=UTF-8; name=0004-fixup-Add-a-new-WAL-summarizer-process.patch.nocfbotDownload
From 25211044687a629e632ef0a2bfad30acea337266 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut <peter@eisentraut.org>
Date: Thu, 14 Dec 2023 18:32:29 +0100
Subject: [PATCH 4/6] fixup! Add a new WAL summarizer process.
---
src/backend/backup/meson.build | 2 +-
src/backend/postmaster/walsummarizer.c | 4 ++--
2 files changed, 3 insertions(+), 3 deletions(-)
diff --git a/src/backend/backup/meson.build b/src/backend/backup/meson.build
index 0e2de91e9f..5d4ebe3ebe 100644
--- a/src/backend/backup/meson.build
+++ b/src/backend/backup/meson.build
@@ -13,5 +13,5 @@ backend_sources += files(
'basebackup_throttle.c',
'basebackup_zstd.c',
'walsummary.c',
- 'walsummaryfuncs.c'
+ 'walsummaryfuncs.c',
)
diff --git a/src/backend/postmaster/walsummarizer.c b/src/backend/postmaster/walsummarizer.c
index 7c840c36b3..9fa155349e 100644
--- a/src/backend/postmaster/walsummarizer.c
+++ b/src/backend/postmaster/walsummarizer.c
@@ -290,7 +290,7 @@ WalSummarizerMain(void)
FlushErrorState();
/* Flush any leaked data in the top-level context */
- MemoryContextResetAndDeleteChildren(context);
+ MemoryContextReset(context);
/* Now we can allow interrupts again */
RESUME_INTERRUPTS();
@@ -342,7 +342,7 @@ WalSummarizerMain(void)
XLogRecPtr end_of_summary_lsn;
/* Flush any leaked data in the top-level context */
- MemoryContextResetAndDeleteChildren(context);
+ MemoryContextReset(context);
/* Process any signals received recently. */
HandleWalSummarizerInterrupts();
--
2.43.0
A separate bikeshedding topic: The GUC "summarize_wal", could that be
"wal_something" instead? (wal_summarize? wal_summarizer?) It would be
nice if these settings names group together a bit, both with existing
wal_* ones and also with the new ones you are adding
(wal_summary_keep_time).
Another set of comments, about the patch that adds pg_combinebackup:
Make sure all the options are listed in a consistent order. We have
lately changed everything to be alphabetical. This includes:
- reference page pg_combinebackup.sgml
- long_options listing
- getopt_long() argument
- subsequent switch
- (--help output, but it looks ok as is)
Also, in pg_combinebackup.sgml, the option --sync-method is listed as if
it does not take an argument, but it does.
On Fri, Dec 15, 2023 at 6:53 AM Peter Eisentraut <peter@eisentraut.org> wrote:
The first fixes up some things in nls.mk related to a file move. The
second is some cleanup because some function you are using has been
removed in the meantime; you probably found that yourself while rebasing.
Incorporated these. As you guessed,
MemoryContextResetAndDeleteChildren -> MemoryContextReset had already
been done locally.
The pg_walsummary patch doesn't have a nls.mk, but you also comment that
it doesn't have tests yet, so I assume it's not considered complete yet
anyway.
I think this was more of a case of me just not realizing that I should
add that. I'll add something simple to the next version, but I'm not
very good at this NLS stuff.
--
Robert Haas
EDB: http://www.enterprisedb.com
On Fri, Dec 15, 2023 at 6:58 AM Peter Eisentraut <peter@eisentraut.org> wrote:
A separate bikeshedding topic: The GUC "summarize_wal", could that be
"wal_something" instead? (wal_summarize? wal_summarizer?) It would be
nice if these settings names group together a bit, both with existing
wal_* ones and also with the new ones you are adding
(wal_summary_keep_time).
Yeah, this is highly debatable, so bikeshed away. IMHO, the question
here is whether we care more about (1) having the name of the GUC
sound nice grammatically or (2) having the GUC begin with the same
string as other, related GUCs. I think that Tom Lane tends to prefer
the former, and probably some other people do too, while some other
people tend to prefer the latter. Ideally it would be possible to
satisfy both goals at once here, but everything I thought about that
started with "wal" sounded too awkward for me to like it; hence the
current choice of name. But if there's consensus on something else, so
be it.
--
Robert Haas
EDB: http://www.enterprisedb.com
On Mon, Dec 18, 2023 at 4:10 AM Peter Eisentraut <peter@eisentraut.org> wrote:
Another set of comments, about the patch that adds pg_combinebackup:
Make sure all the options are listed in a consistent order. We have
lately changed everything to be alphabetical. This includes:- reference page pg_combinebackup.sgml
- long_options listing
- getopt_long() argument
- subsequent switch
- (--help output, but it looks ok as is)
Also, in pg_combinebackup.sgml, the option --sync-method is listed as if
it does not take an argument, but it does.
I've attempted to clean this stuff up in the attached version. This
version also includes a fix for the bug found by Jakub that caused
things to not work properly for segment files beyond the first for any
particular relation, which turns out to be a really stupid mistake in
my earlier commit 025584a168a4b3002e193.
--
Robert Haas
EDB: http://www.enterprisedb.com
Attachments:
v15-0001-Fix-brown-paper-bag-bug-in-025584a168a4b3002e193.patchapplication/octet-stream; name=v15-0001-Fix-brown-paper-bag-bug-in-025584a168a4b3002e193.patchDownload
From 6d78fd0d854425b0442f69350d2f241eb9bc7648 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Mon, 18 Dec 2023 13:16:57 -0500
Subject: [PATCH v15 1/6] Fix brown paper bag bug in
025584a168a4b3002e19350bb8db0ebf1fd10235.
The previous logic failed to work for anything other than the first
segment of a relation.
Report by Jakub Wartak. Patch by me.
---
src/backend/storage/file/reinit.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/src/backend/storage/file/reinit.c b/src/backend/storage/file/reinit.c
index 5df2517b46..6e8eb786d0 100644
--- a/src/backend/storage/file/reinit.c
+++ b/src/backend/storage/file/reinit.c
@@ -431,7 +431,7 @@ parse_filename_for_nontemp_relation(const char *name, RelFileNumber *relnumber,
else
{
/* Reject leading zeroes, just like we do for RelFileNumber. */
- if (name[0] < '1' || name[0] > '9')
+ if (name[1] < '1' || name[1] > '9')
return false;
errno = 0;
--
2.39.3 (Apple Git-145)
v15-0005-Add-new-pg_walsummary-tool.patchapplication/octet-stream; name=v15-0005-Add-new-pg_walsummary-tool.patchDownload
From 4730b37708f41174005e2016c9b0090a1bd33571 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Wed, 25 Oct 2023 13:01:06 -0400
Subject: [PATCH v15 5/6] Add new pg_walsummary tool.
This can dump the contents of WAL summary files, either those in
pg_wal/summaries, or the INCREMENTAL_BACKUP files that are part of
an incremental backup proper.
XXX. Needs tests.
---
doc/src/sgml/ref/allfiles.sgml | 1 +
doc/src/sgml/ref/pg_walsummary.sgml | 122 +++++++++++
doc/src/sgml/reference.sgml | 1 +
src/backend/postmaster/walsummarizer.c | 4 +-
src/bin/Makefile | 1 +
src/bin/meson.build | 1 +
src/bin/pg_walsummary/.gitignore | 1 +
src/bin/pg_walsummary/Makefile | 42 ++++
src/bin/pg_walsummary/meson.build | 24 +++
src/bin/pg_walsummary/nls.mk | 6 +
src/bin/pg_walsummary/pg_walsummary.c | 280 +++++++++++++++++++++++++
src/tools/pgindent/typedefs.list | 2 +
12 files changed, 483 insertions(+), 2 deletions(-)
create mode 100644 doc/src/sgml/ref/pg_walsummary.sgml
create mode 100644 src/bin/pg_walsummary/.gitignore
create mode 100644 src/bin/pg_walsummary/Makefile
create mode 100644 src/bin/pg_walsummary/meson.build
create mode 100644 src/bin/pg_walsummary/nls.mk
create mode 100644 src/bin/pg_walsummary/pg_walsummary.c
diff --git a/doc/src/sgml/ref/allfiles.sgml b/doc/src/sgml/ref/allfiles.sgml
index fda4690eab..4a42999b18 100644
--- a/doc/src/sgml/ref/allfiles.sgml
+++ b/doc/src/sgml/ref/allfiles.sgml
@@ -219,6 +219,7 @@ Complete list of usable sgml source files in this directory.
<!ENTITY pgtesttiming SYSTEM "pgtesttiming.sgml">
<!ENTITY pgupgrade SYSTEM "pgupgrade.sgml">
<!ENTITY pgwaldump SYSTEM "pg_waldump.sgml">
+<!ENTITY pgwalsummary SYSTEM "pg_walsummary.sgml">
<!ENTITY postgres SYSTEM "postgres-ref.sgml">
<!ENTITY psqlRef SYSTEM "psql-ref.sgml">
<!ENTITY reindexdb SYSTEM "reindexdb.sgml">
diff --git a/doc/src/sgml/ref/pg_walsummary.sgml b/doc/src/sgml/ref/pg_walsummary.sgml
new file mode 100644
index 0000000000..93e265ead7
--- /dev/null
+++ b/doc/src/sgml/ref/pg_walsummary.sgml
@@ -0,0 +1,122 @@
+<!--
+doc/src/sgml/ref/pg_walsummary.sgml
+PostgreSQL documentation
+-->
+
+<refentry id="app-pgwalsummary">
+ <indexterm zone="app-pgwalsummary">
+ <primary>pg_walsummary</primary>
+ </indexterm>
+
+ <refmeta>
+ <refentrytitle><application>pg_walsummary</application></refentrytitle>
+ <manvolnum>1</manvolnum>
+ <refmiscinfo>Application</refmiscinfo>
+ </refmeta>
+
+ <refnamediv>
+ <refname>pg_walsummary</refname>
+ <refpurpose>print contents of WAL summary files</refpurpose>
+ </refnamediv>
+
+ <refsynopsisdiv>
+ <cmdsynopsis>
+ <command>pg_walsummary</command>
+ <arg rep="repeat" choice="opt"><replaceable>option</replaceable></arg>
+ <arg rep="repeat"><replaceable>file</replaceable></arg>
+ </cmdsynopsis>
+ </refsynopsisdiv>
+
+ <refsect1>
+ <title>Description</title>
+ <para>
+ <application>pg_walsummary</application> is used to print the contents of
+ WAL summary files. These binary files are found with the
+ <literal>pg_wal/summaries</literal> subdirectory of the data directory,
+ and can be converted to text using this tool. This is not ordinarily
+ necessary, since WAL summary files primarily exist to support
+ <link linkend="backup-incremental-backup">incremental backup</link>,
+ but it may be useful for debugging purposes.
+ </para>
+
+ <para>
+ A WAL summary file is indexed by tablespace OID, relation OID, and relation
+ fork. For each relation fork, it stores the list of blocks that were
+ modified by WAL within the range summarized in the file. It can also
+ store a "limit block," which is 0 if the relation fork was created or
+ truncated within the relevant WAL range, and otherwise the shortest length
+ to which the relation fork was truncated. If the relation fork was not
+ created, deleted, or truncated within the relevant WAL range, the limit
+ block is undefined or infinite and will not be printed by this tool.
+ </para>
+ </refsect1>
+
+ <refsect1>
+ <title>Options</title>
+
+ <para>
+ <variablelist>
+ <varlistentry>
+ <term><option>-i</option></term>
+ <term><option>--indivudual</option></term>
+ <listitem>
+ <para>
+ By default, <literal>pg_walsummary</literal> prints one line of output
+ for each range of one or more consecutive modified blocks. This can
+ make the output a lot briefer, since a relation where all blocks from
+ 0 through 999 were modified will produce only one line of output rather
+ than 1000 separate lines. This option requests a separate line of
+ output for every modified block.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><option>-q</option></term>
+ <term><option>--quiet</option></term>
+ <listitem>
+ <para>
+ Do not print any output, except for errors. This can be useful
+ when you want to know whether a WAL summary file can be successfully
+ parsed but don't care about the contents.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><option>-?</option></term>
+ <term><option>--help</option></term>
+ <listitem>
+ <para>
+ Shows help about <application>pg_walsummary</application> command line
+ arguments, and exits.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ </variablelist>
+ </para>
+
+ </refsect1>
+
+ <refsect1>
+ <title>Environment</title>
+
+ <para>
+ The environment variable <envar>PG_COLOR</envar> specifies whether to use
+ color in diagnostic messages. Possible values are
+ <literal>always</literal>, <literal>auto</literal> and
+ <literal>never</literal>.
+ </para>
+ </refsect1>
+
+ <refsect1>
+ <title>See Also</title>
+
+ <simplelist type="inline">
+ <member><xref linkend="app-pgbasebackup"/></member>
+ <member><xref linkend="app-pgcombinebackup"/></member>
+ </simplelist>
+ </refsect1>
+
+</refentry>
diff --git a/doc/src/sgml/reference.sgml b/doc/src/sgml/reference.sgml
index a07d2b5e01..aa94f6adf6 100644
--- a/doc/src/sgml/reference.sgml
+++ b/doc/src/sgml/reference.sgml
@@ -289,6 +289,7 @@
&pgtesttiming;
&pgupgrade;
&pgwaldump;
+ &pgwalsummary;
&postgres;
</reference>
diff --git a/src/backend/postmaster/walsummarizer.c b/src/backend/postmaster/walsummarizer.c
index 7c840c36b3..9fa155349e 100644
--- a/src/backend/postmaster/walsummarizer.c
+++ b/src/backend/postmaster/walsummarizer.c
@@ -290,7 +290,7 @@ WalSummarizerMain(void)
FlushErrorState();
/* Flush any leaked data in the top-level context */
- MemoryContextResetAndDeleteChildren(context);
+ MemoryContextReset(context);
/* Now we can allow interrupts again */
RESUME_INTERRUPTS();
@@ -342,7 +342,7 @@ WalSummarizerMain(void)
XLogRecPtr end_of_summary_lsn;
/* Flush any leaked data in the top-level context */
- MemoryContextResetAndDeleteChildren(context);
+ MemoryContextReset(context);
/* Process any signals received recently. */
HandleWalSummarizerInterrupts();
diff --git a/src/bin/Makefile b/src/bin/Makefile
index aa2210925e..f98f58d39e 100644
--- a/src/bin/Makefile
+++ b/src/bin/Makefile
@@ -31,6 +31,7 @@ SUBDIRS = \
pg_upgrade \
pg_verifybackup \
pg_waldump \
+ pg_walsummary \
pgbench \
psql \
scripts
diff --git a/src/bin/meson.build b/src/bin/meson.build
index 4cb6fd59bb..d1e9ef4409 100644
--- a/src/bin/meson.build
+++ b/src/bin/meson.build
@@ -17,6 +17,7 @@ subdir('pg_test_timing')
subdir('pg_upgrade')
subdir('pg_verifybackup')
subdir('pg_waldump')
+subdir('pg_walsummary')
subdir('pgbench')
subdir('pgevent')
subdir('psql')
diff --git a/src/bin/pg_walsummary/.gitignore b/src/bin/pg_walsummary/.gitignore
new file mode 100644
index 0000000000..d71ec192fa
--- /dev/null
+++ b/src/bin/pg_walsummary/.gitignore
@@ -0,0 +1 @@
+pg_walsummary
diff --git a/src/bin/pg_walsummary/Makefile b/src/bin/pg_walsummary/Makefile
new file mode 100644
index 0000000000..852f7208f6
--- /dev/null
+++ b/src/bin/pg_walsummary/Makefile
@@ -0,0 +1,42 @@
+#-------------------------------------------------------------------------
+#
+# Makefile for src/bin/pg_walsummary
+#
+# Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+# Portions Copyright (c) 1994, Regents of the University of California
+#
+# src/bin/pg_walsummary/Makefile
+#
+#-------------------------------------------------------------------------
+
+PGFILEDESC = "pg_walsummary - print contents of WAL summary files"
+PGAPPICON=win32
+
+subdir = src/bin/pg_walsummary
+top_builddir = ../../..
+include $(top_builddir)/src/Makefile.global
+
+override CPPFLAGS := -I$(libpq_srcdir) $(CPPFLAGS)
+LDFLAGS_INTERNAL += -L$(top_builddir)/src/fe_utils -lpgfeutils
+
+OBJS = \
+ $(WIN32RES) \
+ pg_walsummary.o
+
+all: pg_walsummary
+
+pg_walsummary: $(OBJS) | submake-libpgport submake-libpgfeutils
+ $(CC) $(CFLAGS) $^ $(LDFLAGS) $(LDFLAGS_EX) $(LIBS) -o $@$(X)
+
+
+install: all installdirs
+ $(INSTALL_PROGRAM) pg_walsummary$(X) '$(DESTDIR)$(bindir)/pg_walsummary$(X)'
+
+installdirs:
+ $(MKDIR_P) '$(DESTDIR)$(bindir)'
+
+uninstall:
+ rm -f '$(DESTDIR)$(bindir)/pg_walsummary$(X)'
+
+clean distclean maintainer-clean:
+ rm -f pg_walsummary$(X) $(OBJS)
diff --git a/src/bin/pg_walsummary/meson.build b/src/bin/pg_walsummary/meson.build
new file mode 100644
index 0000000000..c2092960c6
--- /dev/null
+++ b/src/bin/pg_walsummary/meson.build
@@ -0,0 +1,24 @@
+# Copyright (c) 2022-2023, PostgreSQL Global Development Group
+
+pg_walsummary_sources = files(
+ 'pg_walsummary.c',
+)
+
+if host_system == 'windows'
+ pg_walsummary_sources += rc_bin_gen.process(win32ver_rc, extra_args: [
+ '--NAME', 'pg_walsummary',
+ '--FILEDESC', 'pg_walsummary - print contents of WAL summary files',])
+endif
+
+pg_walsummary = executable('pg_walsummary',
+ pg_walsummary_sources,
+ dependencies: [frontend_code],
+ kwargs: default_bin_args,
+)
+bin_targets += pg_walsummary
+
+tests += {
+ 'name': 'pg_walsummary',
+ 'sd': meson.current_source_dir(),
+ 'bd': meson.current_build_dir()
+}
diff --git a/src/bin/pg_walsummary/nls.mk b/src/bin/pg_walsummary/nls.mk
new file mode 100644
index 0000000000..f411dcfe9e
--- /dev/null
+++ b/src/bin/pg_walsummary/nls.mk
@@ -0,0 +1,6 @@
+# src/bin/pg_combinebackup/nls.mk
+CATALOG_NAME = pg_walsummary
+GETTEXT_FILES = $(FRONTEND_COMMON_GETTEXT_FILES) \
+ pg_walsummary.c
+GETTEXT_TRIGGERS = $(FRONTEND_COMMON_GETTEXT_TRIGGERS)
+GETTEXT_FLAGS = $(FRONTEND_COMMON_GETTEXT_FLAGS)
diff --git a/src/bin/pg_walsummary/pg_walsummary.c b/src/bin/pg_walsummary/pg_walsummary.c
new file mode 100644
index 0000000000..0c0225eeb8
--- /dev/null
+++ b/src/bin/pg_walsummary/pg_walsummary.c
@@ -0,0 +1,280 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_walsummary.c
+ * Prints the contents of WAL summary files.
+ *
+ * Copyright (c) 2017-2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/bin/pg_walsummary/pg_walsummary.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres_fe.h"
+
+#include <fcntl.h>
+#include <limits.h>
+
+#include "common/blkreftable.h"
+#include "common/logging.h"
+#include "fe_utils/option_utils.h"
+#include "lib/stringinfo.h"
+#include "getopt_long.h"
+
+typedef struct ws_options
+{
+ bool individual;
+ bool quiet;
+} ws_options;
+
+typedef struct ws_file_info
+{
+ int fd;
+ char *filename;
+} ws_file_info;
+
+static BlockNumber *block_buffer = NULL;
+static unsigned block_buffer_size = 512; /* Initial size. */
+
+static void dump_one_relation(ws_options *opt, RelFileLocator *rlocator,
+ ForkNumber forknum, BlockNumber limit_block,
+ BlockRefTableReader *reader);
+static void help(const char *progname);
+static int compare_block_numbers(const void *a, const void *b);
+static int walsummary_read_callback(void *callback_arg, void *data,
+ int length);
+static void walsummary_error_callback(void *callback_arg, char *fmt,...) pg_attribute_printf(2, 3);
+
+/*
+ * Main program.
+ */
+int
+main(int argc, char *argv[])
+{
+ static struct option long_options[] = {
+ {"individual", no_argument, NULL, 'i'},
+ {"quiet", no_argument, NULL, 'q'},
+ {NULL, 0, NULL, 0}
+ };
+
+ const char *progname;
+ int optindex;
+ int c;
+ ws_options opt;
+
+ memset(&opt, 0, sizeof(ws_options));
+
+ pg_logging_init(argv[0]);
+ progname = get_progname(argv[0]);
+ handle_help_version_opts(argc, argv, progname, help);
+
+ /* process command-line options */
+ while ((c = getopt_long(argc, argv, "f:iqw:",
+ long_options, &optindex)) != -1)
+ {
+ switch (c)
+ {
+ case 'i':
+ opt.individual = true;
+ break;
+ case 'q':
+ opt.quiet = true;
+ break;
+ default:
+ /* getopt_long already emitted a complaint */
+ pg_log_error_hint("Try \"%s --help\" for more information.", progname);
+ exit(1);
+ }
+ }
+
+ if (optind >= argc)
+ {
+ pg_log_error("%s: no input files specified", progname);
+ pg_log_error_hint("Try \"%s --help\" for more information.", progname);
+ exit(1);
+ }
+
+ while (optind < argc)
+ {
+ ws_file_info ws;
+ BlockRefTableReader *reader;
+ RelFileLocator rlocator;
+ ForkNumber forknum;
+ BlockNumber limit_block;
+
+ ws.filename = argv[optind++];
+ if ((ws.fd = open(ws.filename, O_RDONLY | PG_BINARY, 0)) < 0)
+ pg_fatal("could not open file \"%s\": %m", ws.filename);
+
+ reader = CreateBlockRefTableReader(walsummary_read_callback, &ws,
+ ws.filename,
+ walsummary_error_callback, NULL);
+ while (BlockRefTableReaderNextRelation(reader, &rlocator, &forknum,
+ &limit_block))
+ dump_one_relation(&opt, &rlocator, forknum, limit_block, reader);
+
+ DestroyBlockRefTableReader(reader);
+ close(ws.fd);
+ }
+
+ exit(0);
+}
+
+/*
+ * Dump details for one relation.
+ */
+static void
+dump_one_relation(ws_options *opt, RelFileLocator *rlocator,
+ ForkNumber forknum, BlockNumber limit_block,
+ BlockRefTableReader *reader)
+{
+ unsigned i = 0;
+ unsigned nblocks;
+ BlockNumber startblock = InvalidBlockNumber;
+ BlockNumber endblock = InvalidBlockNumber;
+
+ /* Dump limit block, if any. */
+ if (limit_block != InvalidBlockNumber)
+ printf("TS %u, DB %u, REL %u, FORK %s: limit %u\n",
+ rlocator->spcOid, rlocator->dbOid, rlocator->relNumber,
+ forkNames[forknum], limit_block);
+
+ /* If we haven't allocated a block buffer yet, do that now. */
+ if (block_buffer == NULL)
+ block_buffer = palloc_array(BlockNumber, block_buffer_size);
+
+ /* Try to fill the block buffer. */
+ nblocks = BlockRefTableReaderGetBlocks(reader,
+ block_buffer,
+ block_buffer_size);
+
+ /* If we filled the block buffer completely, we must enlarge it. */
+ while (nblocks >= block_buffer_size)
+ {
+ unsigned new_size;
+
+ /* Double the size, being careful about overflow. */
+ new_size = block_buffer_size * 2;
+ if (new_size < block_buffer_size)
+ new_size = PG_UINT32_MAX;
+ block_buffer = repalloc_array(block_buffer, BlockNumber, new_size);
+
+ /* Try to fill the newly-allocated space. */
+ nblocks +=
+ BlockRefTableReaderGetBlocks(reader,
+ block_buffer + block_buffer_size,
+ new_size - block_buffer_size);
+
+ /* Save the new size for later calls. */
+ block_buffer_size = new_size;
+ }
+
+ /* If we don't need to produce any output, skip the rest of this. */
+ if (opt->quiet)
+ return;
+
+ /*
+ * Sort the returned block numbers. If the block reference table was using
+ * the bitmap representation for a given chunk, the block numbers in that
+ * chunk will already be sorted, but when the array-of-offsets
+ * representation is used, we can receive block numbers here out of order.
+ */
+ qsort(block_buffer, nblocks, sizeof(BlockNumber), compare_block_numbers);
+
+ /* Dump block references. */
+ while (i < nblocks)
+ {
+ /*
+ * Find the next range of blocks to print, but if --individual was
+ * specified, then consider each block a separate range.
+ */
+ startblock = endblock = block_buffer[i++];
+ if (!opt->individual)
+ {
+ while (i < nblocks && block_buffer[i] == endblock + 1)
+ {
+ endblock++;
+ i++;
+ }
+ }
+
+ /* Format this range of block numbers as a string. */
+ if (startblock == endblock)
+ printf("TS %u, DB %u, REL %u, FORK %s: block %u\n",
+ rlocator->spcOid, rlocator->dbOid, rlocator->relNumber,
+ forkNames[forknum], startblock);
+ else
+ printf("TS %u, DB %u, REL %u, FORK %s: blocks %u..%u\n",
+ rlocator->spcOid, rlocator->dbOid, rlocator->relNumber,
+ forkNames[forknum], startblock, endblock);
+ }
+}
+
+/*
+ * Quicksort comparator for block numbers.
+ */
+static int
+compare_block_numbers(const void *a, const void *b)
+{
+ BlockNumber aa = *(BlockNumber *) a;
+ BlockNumber bb = *(BlockNumber *) b;
+
+ if (aa > bb)
+ return 1;
+ else if (aa == bb)
+ return 0;
+ else
+ return -1;
+}
+
+/*
+ * Error callback.
+ */
+void
+walsummary_error_callback(void *callback_arg, char *fmt,...)
+{
+ va_list ap;
+
+ va_start(ap, fmt);
+ pg_log_generic_v(PG_LOG_ERROR, PG_LOG_PRIMARY, fmt, ap);
+ va_end(ap);
+
+ exit(1);
+}
+
+/*
+ * Read callback.
+ */
+int
+walsummary_read_callback(void *callback_arg, void *data, int length)
+{
+ ws_file_info *ws = callback_arg;
+ int rc;
+
+ if ((rc = read(ws->fd, data, length)) < 0)
+ pg_fatal("could not read file \"%s\": %m", ws->filename);
+
+ return rc;
+}
+
+/*
+ * help
+ *
+ * Prints help page for the program
+ *
+ * progname: the name of the executed program, such as "pg_walsummary"
+ */
+static void
+help(const char *progname)
+{
+ printf(_("%s prints the contents of a WAL summary file.\n\n"), progname);
+ printf(_("Usage:\n"));
+ printf(_(" %s [OPTION]... FILE...\n"), progname);
+ printf(_("\nOptions:\n"));
+ printf(_(" -i, --individual list block numbers individually, not as ranges\n"));
+ printf(_(" -q, --quiet don't print anything, just parse the files\n"));
+ printf(_(" -?, --help show this help, then exit\n"));
+
+ printf(_("\nReport bugs to <%s>.\n"), PACKAGE_BUGREPORT);
+ printf(_("%s home page: <%s>\n"), PACKAGE_NAME, PACKAGE_URL);
+}
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index e37ef9aa76..86e0a86503 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -4035,3 +4035,5 @@ cb_tablespace_mapping
manifest_data
manifest_writer
rfile
+ws_options
+ws_file_info
--
2.39.3 (Apple Git-145)
v15-0003-Add-a-new-WAL-summarizer-process.patchapplication/octet-stream; name=v15-0003-Add-a-new-WAL-summarizer-process.patchDownload
From d88014cd2be3ace022511bf715f31f05de503c74 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Wed, 25 Oct 2023 12:57:22 -0400
Subject: [PATCH v15 3/6] Add a new WAL summarizer process.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
When active, this process writes WAL summary files to
$PGDATA/pg_wal/summaries. Each summary file contains information for a
certain range of LSNs on a certain TLI. For each relation, it stores a
"limit block" which is 0 if a relation is created or destroyed within
a certain range of WAL records, or otherwise the shortest length to
which the relation was truncated during that range of WAL records, or
otherwise InvalidBlockNumber. In addition, it stores a list of blocks
which have been modified during that range of WAL records, but
excluding blocks which were removed by truncation after they were
modified and never subsequently modified again. In other words, it
tells us which blocks need to copied in case of an incremental backup
covering that range of WAL records.
A new parameter summarize_wal enables or disables this new background
process. The background process also automatically deletes summary
files that are older than wal_summarize_keep_time, if that parameter
has a non-zero value and the summarizer is configured to run.
Patch by me, with some design help from Dilip Kumar. Reviewed by
Matthias van de Meent, Dilip Kumar, Jakub Wartak, Peter Eisentraut,
and Álvaro Herrera.
---
doc/src/sgml/config.sgml | 61 +
src/backend/access/transam/xlog.c | 101 +-
src/backend/backup/Makefile | 4 +-
src/backend/backup/meson.build | 2 +
src/backend/backup/walsummary.c | 356 +++++
src/backend/backup/walsummaryfuncs.c | 169 ++
src/backend/postmaster/Makefile | 1 +
src/backend/postmaster/auxprocess.c | 8 +
src/backend/postmaster/meson.build | 1 +
src/backend/postmaster/postmaster.c | 56 +
src/backend/postmaster/walsummarizer.c | 1398 +++++++++++++++++
src/backend/storage/lmgr/lwlocknames.txt | 1 +
src/backend/utils/activity/pgstat_io.c | 4 +-
.../utils/activity/wait_event_names.txt | 5 +
src/backend/utils/init/miscinit.c | 3 +
src/backend/utils/misc/guc_tables.c | 26 +
src/backend/utils/misc/postgresql.conf.sample | 5 +
src/bin/initdb/initdb.c | 1 +
src/common/Makefile | 1 +
src/common/blkreftable.c | 1308 +++++++++++++++
src/common/meson.build | 1 +
src/include/access/xlog.h | 1 +
src/include/backup/walsummary.h | 49 +
src/include/catalog/pg_proc.dat | 19 +
src/include/common/blkreftable.h | 116 ++
src/include/miscadmin.h | 3 +
src/include/postmaster/walsummarizer.h | 33 +
src/include/storage/proc.h | 9 +-
src/include/utils/guc_tables.h | 1 +
src/tools/pgindent/typedefs.list | 11 +
30 files changed, 3743 insertions(+), 11 deletions(-)
create mode 100644 src/backend/backup/walsummary.c
create mode 100644 src/backend/backup/walsummaryfuncs.c
create mode 100644 src/backend/postmaster/walsummarizer.c
create mode 100644 src/common/blkreftable.c
create mode 100644 src/include/backup/walsummary.h
create mode 100644 src/include/common/blkreftable.h
create mode 100644 src/include/postmaster/walsummarizer.h
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 44cada2b40..ee98585027 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4150,6 +4150,67 @@ restore_command = 'copy "C:\\server\\archivedir\\%f" "%p"' # Windows
</variablelist>
</sect2>
+ <sect2 id="runtime-config-wal-summarization">
+ <title>WAL Summarization</title>
+
+ <!--
+ <para>
+ These settings control WAL summarization, a feature which must be
+ enabled in order to perform an
+ <link linkend="backup-incremental-backup">incremental backup</link>.
+ </para>
+ -->
+
+ <variablelist>
+ <varlistentry id="guc-summarize-wal" xreflabel="summarize_wal">
+ <term><varname>summarize_wal</varname> (<type>boolean</type>)
+ <indexterm>
+ <primary><varname>summarize_wal</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Enables the WAL summarizer process. Note that WAL summarization can
+ be enabled either on a primary or on a standby. WAL summarization
+ cannot be enabled when <varname>wal_level</varname> is set to
+ <literal>minimal</literal>. This parameter can only be set in the
+ <filename>postgresql.conf</filename> file or on the server command line.
+ The default is <literal>off</literal>.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry id="guc-wal-summary-keep-time" xreflabel="wal_summary_keep_time">
+ <term><varname>wal_summary_keep_time</varname> (<type>boolean</type>)
+ <indexterm>
+ <primary><varname>wal_summary_keep_time</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Configures the amount of time after which the WAL summarizer
+ automatically removes old WAL summaries. The file timestamp is used to
+ determine which files are old enough to remove. Typically, you should set
+ this comfortably higher than the time that could pass between a backup
+ and a later incremental backup that depends on it. WAL summaries must
+ be available for the entire range of WAL records between the preceding
+ backup and the new one being taken; if not, the incremental backup will
+ fail. If this parameter is set to zero, WAL summaries will not be
+ automatically deleted, but it is safe to manually remove files that you
+ know will not be required for future incremental backups.
+ This parameter can only be set in the
+ <filename>postgresql.conf</filename> file or on the server command line.
+ The default is 10 days. If <literal>summarize_wal = off</literal>,
+ existing WAL summaries will not be removed regardless of the value of
+ this parameter, because the WAL summarizer will not run.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ </variablelist>
+
+ </sect2>
+
</sect1>
<sect1 id="runtime-config-replication">
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 01e0484584..421a016ca1 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -77,6 +77,7 @@
#include "port/pg_iovec.h"
#include "postmaster/bgwriter.h"
#include "postmaster/startup.h"
+#include "postmaster/walsummarizer.h"
#include "postmaster/walwriter.h"
#include "replication/logical.h"
#include "replication/origin.h"
@@ -3589,6 +3590,43 @@ XLogGetLastRemovedSegno(void)
return lastRemovedSegNo;
}
+/*
+ * Return the oldest WAL segment on the given TLI that still exists in
+ * XLOGDIR, or 0 if none.
+ */
+XLogSegNo
+XLogGetOldestSegno(TimeLineID tli)
+{
+ DIR *xldir;
+ struct dirent *xlde;
+ XLogSegNo oldest_segno = 0;
+
+ xldir = AllocateDir(XLOGDIR);
+ while ((xlde = ReadDir(xldir, XLOGDIR)) != NULL)
+ {
+ TimeLineID file_tli;
+ XLogSegNo file_segno;
+
+ /* Ignore files that are not XLOG segments. */
+ if (!IsXLogFileName(xlde->d_name))
+ continue;
+
+ /* Parse filename to get TLI and segno. */
+ XLogFromFileName(xlde->d_name, &file_tli, &file_segno,
+ wal_segment_size);
+
+ /* Ignore anything that's not from the TLI of interest. */
+ if (tli != file_tli)
+ continue;
+
+ /* If it's the oldest so far, update oldest_segno. */
+ if (oldest_segno == 0 || file_segno < oldest_segno)
+ oldest_segno = file_segno;
+ }
+
+ FreeDir(xldir);
+ return oldest_segno;
+}
/*
* Update the last removed segno pointer in shared memory, to reflect that the
@@ -3869,8 +3907,8 @@ RemoveXlogFile(const struct dirent *segment_de,
}
/*
- * Verify whether pg_wal and pg_wal/archive_status exist.
- * If the latter does not exist, recreate it.
+ * Verify whether pg_wal, pg_wal/archive_status, and pg_wal/summaries exist.
+ * If the latter do not exist, recreate them.
*
* It is not the goal of this function to verify the contents of these
* directories, but to help in cases where someone has performed a cluster
@@ -3913,6 +3951,26 @@ ValidateXLOGDirectoryStructure(void)
(errmsg("could not create missing directory \"%s\": %m",
path)));
}
+
+ /* Check for summaries */
+ snprintf(path, MAXPGPATH, XLOGDIR "/summaries");
+ if (stat(path, &stat_buf) == 0)
+ {
+ /* Check for weird cases where it exists but isn't a directory */
+ if (!S_ISDIR(stat_buf.st_mode))
+ ereport(FATAL,
+ (errmsg("required WAL directory \"%s\" does not exist",
+ path)));
+ }
+ else
+ {
+ ereport(LOG,
+ (errmsg("creating missing WAL directory \"%s\"", path)));
+ if (MakePGDirectory(path) < 0)
+ ereport(FATAL,
+ (errmsg("could not create missing directory \"%s\": %m",
+ path)));
+ }
}
/*
@@ -5237,9 +5295,9 @@ StartupXLOG(void)
#endif
/*
- * Verify that pg_wal and pg_wal/archive_status exist. In cases where
- * someone has performed a copy for PITR, these directories may have been
- * excluded and need to be re-created.
+ * Verify that pg_wal, pg_wal/archive_status, and pg_wal/summaries exist.
+ * In cases where someone has performed a copy for PITR, these directories
+ * may have been excluded and need to be re-created.
*/
ValidateXLOGDirectoryStructure();
@@ -6956,6 +7014,25 @@ CreateCheckPoint(int flags)
*/
END_CRIT_SECTION();
+ /*
+ * WAL summaries end when the next XLOG_CHECKPOINT_REDO or
+ * XLOG_CHECKPOINT_SHUTDOWN record is reached. This is the first point
+ * where (a) we're not inside of a critical section and (b) we can be
+ * certain that the relevant record has been flushed to disk, which must
+ * happen before it can be summarized.
+ *
+ * If this is a shutdown checkpoint, then this happens reasonably
+ * promptly: we've only just inserted and flushed the
+ * XLOG_CHECKPOINT_SHUTDOWN record. If this is not a shutdown checkpoint,
+ * then this might not be very prompt at all: the XLOG_CHECKPOINT_REDO
+ * record was written before we began flushing data to disk, and that
+ * could be many minutes ago at this point. However, we don't XLogFlush()
+ * after inserting that record, so we're not guaranteed that it's on disk
+ * until after the above call that flushes the XLOG_CHECKPOINT_ONLINE
+ * record.
+ */
+ SetWalSummarizerLatch();
+
/*
* Let smgr do post-checkpoint cleanup (eg, deleting old files).
*/
@@ -7630,6 +7707,20 @@ KeepLogSeg(XLogRecPtr recptr, XLogSegNo *logSegNo)
}
}
+ /*
+ * If WAL summarization is in use, don't remove WAL that has yet to be
+ * summarized.
+ */
+ keep = GetOldestUnsummarizedLSN(NULL, NULL, false);
+ if (keep != InvalidXLogRecPtr)
+ {
+ XLogSegNo unsummarized_segno;
+
+ XLByteToSeg(keep, unsummarized_segno, wal_segment_size);
+ if (unsummarized_segno < segno)
+ segno = unsummarized_segno;
+ }
+
/* but, keep at least wal_keep_size if that's set */
if (wal_keep_size_mb > 0)
{
diff --git a/src/backend/backup/Makefile b/src/backend/backup/Makefile
index b21bd8ff43..a67b3c58d4 100644
--- a/src/backend/backup/Makefile
+++ b/src/backend/backup/Makefile
@@ -25,6 +25,8 @@ OBJS = \
basebackup_server.o \
basebackup_sink.o \
basebackup_target.o \
- basebackup_throttle.o
+ basebackup_throttle.o \
+ walsummary.o \
+ walsummaryfuncs.o
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/backup/meson.build b/src/backend/backup/meson.build
index 11a79bbf80..5d4ebe3ebe 100644
--- a/src/backend/backup/meson.build
+++ b/src/backend/backup/meson.build
@@ -12,4 +12,6 @@ backend_sources += files(
'basebackup_target.c',
'basebackup_throttle.c',
'basebackup_zstd.c',
+ 'walsummary.c',
+ 'walsummaryfuncs.c',
)
diff --git a/src/backend/backup/walsummary.c b/src/backend/backup/walsummary.c
new file mode 100644
index 0000000000..271d199874
--- /dev/null
+++ b/src/backend/backup/walsummary.c
@@ -0,0 +1,356 @@
+/*-------------------------------------------------------------------------
+ *
+ * walsummary.c
+ * Functions for accessing and managing WAL summary data.
+ *
+ * Portions Copyright (c) 2010-2022, PostgreSQL Global Development Group
+ *
+ * src/backend/backup/walsummary.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include <sys/stat.h>
+#include <unistd.h>
+
+#include "access/xlog_internal.h"
+#include "backup/walsummary.h"
+#include "utils/wait_event.h"
+
+static bool IsWalSummaryFilename(char *filename);
+static int ListComparatorForWalSummaryFiles(const ListCell *a,
+ const ListCell *b);
+
+/*
+ * Get a list of WAL summaries.
+ *
+ * If tli != 0, only WAL summaries with the indicated TLI will be included.
+ *
+ * If start_lsn != InvalidXLogRecPtr, only summaries that end after the
+ * indicated LSN will be included.
+ *
+ * If end_lsn != InvalidXLogRecPtr, only summaries that start before the
+ * indicated LSN will be included.
+ *
+ * The intent is that you can call GetWalSummaries(tli, start_lsn, end_lsn)
+ * to get all WAL summaries on the indicated timeline that overlap the
+ * specified LSN range.
+ */
+List *
+GetWalSummaries(TimeLineID tli, XLogRecPtr start_lsn, XLogRecPtr end_lsn)
+{
+ DIR *sdir;
+ struct dirent *dent;
+ List *result = NIL;
+
+ sdir = AllocateDir(XLOGDIR "/summaries");
+ while ((dent = ReadDir(sdir, XLOGDIR "/summaries")) != NULL)
+ {
+ WalSummaryFile *ws;
+ uint32 tmp[5];
+ TimeLineID file_tli;
+ XLogRecPtr file_start_lsn;
+ XLogRecPtr file_end_lsn;
+
+ /* Decode filename, or skip if it's not in the expected format. */
+ if (!IsWalSummaryFilename(dent->d_name))
+ continue;
+ sscanf(dent->d_name, "%08X%08X%08X%08X%08X",
+ &tmp[0], &tmp[1], &tmp[2], &tmp[3], &tmp[4]);
+ file_tli = tmp[0];
+ file_start_lsn = ((uint64) tmp[1]) << 32 | tmp[2];
+ file_end_lsn = ((uint64) tmp[3]) << 32 | tmp[4];
+
+ /* Skip if it doesn't match the filter criteria. */
+ if (tli != 0 && tli != file_tli)
+ continue;
+ if (!XLogRecPtrIsInvalid(start_lsn) && start_lsn >= file_end_lsn)
+ continue;
+ if (!XLogRecPtrIsInvalid(end_lsn) && end_lsn <= file_start_lsn)
+ continue;
+
+ /* Add it to the list. */
+ ws = palloc(sizeof(WalSummaryFile));
+ ws->tli = file_tli;
+ ws->start_lsn = file_start_lsn;
+ ws->end_lsn = file_end_lsn;
+ result = lappend(result, ws);
+ }
+ FreeDir(sdir);
+
+ return result;
+}
+
+/*
+ * Build a new list of WAL summaries based on an existing list, but filtering
+ * out summaries that don't match the search parameters.
+ *
+ * If tli != 0, only WAL summaries with the indicated TLI will be included.
+ *
+ * If start_lsn != InvalidXLogRecPtr, only summaries that end after the
+ * indicated LSN will be included.
+ *
+ * If end_lsn != InvalidXLogRecPtr, only summaries that start before the
+ * indicated LSN will be included.
+ */
+List *
+FilterWalSummaries(List *wslist, TimeLineID tli,
+ XLogRecPtr start_lsn, XLogRecPtr end_lsn)
+{
+ List *result = NIL;
+ ListCell *lc;
+
+ /* Loop over input. */
+ foreach(lc, wslist)
+ {
+ WalSummaryFile *ws = lfirst(lc);
+
+ /* Skip if it doesn't match the filter criteria. */
+ if (tli != 0 && tli != ws->tli)
+ continue;
+ if (!XLogRecPtrIsInvalid(start_lsn) && start_lsn > ws->end_lsn)
+ continue;
+ if (!XLogRecPtrIsInvalid(end_lsn) && end_lsn < ws->start_lsn)
+ continue;
+
+ /* Add it to the result list. */
+ result = lappend(result, ws);
+ }
+
+ return result;
+}
+
+/*
+ * Check whether the supplied list of WalSummaryFile objects covers the
+ * whole range of LSNs from start_lsn to end_lsn. This function ignores
+ * timelines, so the caller should probably filter using the appropriate
+ * timeline before calling this.
+ *
+ * If the whole range of LSNs is covered, returns true, otherwise false.
+ * If false is returned, *missing_lsn is set either to InvalidXLogRecPtr
+ * if there are no WAL summary files in the input list, or to the first LSN
+ * in the range that is not covered by a WAL summary file in the input list.
+ */
+bool
+WalSummariesAreComplete(List *wslist, XLogRecPtr start_lsn,
+ XLogRecPtr end_lsn, XLogRecPtr *missing_lsn)
+{
+ XLogRecPtr current_lsn = start_lsn;
+ ListCell *lc;
+
+ /* Special case for empty list. */
+ if (wslist == NIL)
+ {
+ *missing_lsn = InvalidXLogRecPtr;
+ return false;
+ }
+
+ /* Make a private copy of the list and sort it by start LSN. */
+ wslist = list_copy(wslist);
+ list_sort(wslist, ListComparatorForWalSummaryFiles);
+
+ /*
+ * Consider summary files in order of increasing start_lsn, advancing the
+ * known-summarized range from start_lsn toward end_lsn.
+ *
+ * Normally, the summary files should cover non-overlapping WAL ranges,
+ * but this algorithm is intended to be correct even in case of overlap.
+ */
+ foreach(lc, wslist)
+ {
+ WalSummaryFile *ws = lfirst(lc);
+
+ if (ws->start_lsn > current_lsn)
+ {
+ /* We found a gap. */
+ break;
+ }
+ if (ws->end_lsn > current_lsn)
+ {
+ /*
+ * Next summary extends beyond end of previous summary, so extend
+ * the end of the range known to be summarized.
+ */
+ current_lsn = ws->end_lsn;
+
+ /*
+ * If the range we know to be summarized has reached the required
+ * end LSN, we have proved completeness.
+ */
+ if (current_lsn >= end_lsn)
+ return true;
+ }
+ }
+
+ /*
+ * We either ran out of summary files without reaching the end LSN, or we
+ * hit a gap in the sequence that resulted in us bailing out of the loop
+ * above.
+ */
+ *missing_lsn = current_lsn;
+ return false;
+}
+
+/*
+ * Open a WAL summary file.
+ *
+ * This will throw an error in case of trouble. As an exception, if
+ * missing_ok = true and the trouble is specifically that the file does
+ * not exist, it will not throw an error and will return a value less than 0.
+ */
+File
+OpenWalSummaryFile(WalSummaryFile *ws, bool missing_ok)
+{
+ char path[MAXPGPATH];
+ File file;
+
+ snprintf(path, MAXPGPATH,
+ XLOGDIR "/summaries/%08X%08X%08X%08X%08X.summary",
+ ws->tli,
+ LSN_FORMAT_ARGS(ws->start_lsn),
+ LSN_FORMAT_ARGS(ws->end_lsn));
+
+ file = PathNameOpenFile(path, O_RDONLY);
+ if (file < 0 && (errno != EEXIST || !missing_ok))
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not open file \"%s\": %m", path)));
+
+ return file;
+}
+
+/*
+ * Remove a WAL summary file if the last modification time precedes the
+ * cutoff time.
+ */
+void
+RemoveWalSummaryIfOlderThan(WalSummaryFile *ws, time_t cutoff_time)
+{
+ char path[MAXPGPATH];
+ struct stat statbuf;
+
+ snprintf(path, MAXPGPATH,
+ XLOGDIR "/summaries/%08X%08X%08X%08X%08X.summary",
+ ws->tli,
+ LSN_FORMAT_ARGS(ws->start_lsn),
+ LSN_FORMAT_ARGS(ws->end_lsn));
+
+ if (lstat(path, &statbuf) != 0)
+ {
+ if (errno == ENOENT)
+ return;
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not stat file \"%s\": %m", path)));
+ }
+ if (statbuf.st_mtime >= cutoff_time)
+ return;
+ if (unlink(path) != 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not stat file \"%s\": %m", path)));
+ ereport(DEBUG2,
+ (errmsg_internal("removing file \"%s\"", path)));
+}
+
+/*
+ * Test whether a filename looks like a WAL summary file.
+ */
+static bool
+IsWalSummaryFilename(char *filename)
+{
+ return strspn(filename, "0123456789ABCDEF") == 40 &&
+ strcmp(filename + 40, ".summary") == 0;
+}
+
+/*
+ * Data read callback for use with CreateBlockRefTableReader.
+ */
+int
+ReadWalSummary(void *wal_summary_io, void *data, int length)
+{
+ WalSummaryIO *io = wal_summary_io;
+ int nbytes;
+
+ nbytes = FileRead(io->file, data, length, io->filepos,
+ WAIT_EVENT_WAL_SUMMARY_READ);
+ if (nbytes < 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not read file \"%s\": %m",
+ FilePathName(io->file))));
+
+ io->filepos += nbytes;
+ return nbytes;
+}
+
+/*
+ * Data write callback for use with WriteBlockRefTable.
+ */
+int
+WriteWalSummary(void *wal_summary_io, void *data, int length)
+{
+ WalSummaryIO *io = wal_summary_io;
+ int nbytes;
+
+ nbytes = FileWrite(io->file, data, length, io->filepos,
+ WAIT_EVENT_WAL_SUMMARY_WRITE);
+ if (nbytes < 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not write file \"%s\": %m",
+ FilePathName(io->file))));
+ if (nbytes != length)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not write file \"%s\": wrote only %d of %d bytes at offset %u",
+ FilePathName(io->file), nbytes,
+ length, (unsigned) io->filepos),
+ errhint("Check free disk space.")));
+
+ io->filepos += nbytes;
+ return nbytes;
+}
+
+/*
+ * Error-reporting callback for use with CreateBlockRefTableReader.
+ */
+void
+ReportWalSummaryError(void *callback_arg, char *fmt,...)
+{
+ StringInfoData buf;
+ va_list ap;
+ int needed;
+
+ initStringInfo(&buf);
+ for (;;)
+ {
+ va_start(ap, fmt);
+ needed = appendStringInfoVA(&buf, fmt, ap);
+ va_end(ap);
+ if (needed == 0)
+ break;
+ enlargeStringInfo(&buf, needed);
+ }
+ ereport(ERROR,
+ errcode(ERRCODE_DATA_CORRUPTED),
+ errmsg_internal("%s", buf.data));
+}
+
+/*
+ * Comparator to sort a List of WalSummaryFile objects by start_lsn.
+ */
+static int
+ListComparatorForWalSummaryFiles(const ListCell *a, const ListCell *b)
+{
+ WalSummaryFile *ws1 = lfirst(a);
+ WalSummaryFile *ws2 = lfirst(b);
+
+ if (ws1->start_lsn < ws2->start_lsn)
+ return -1;
+ if (ws1->start_lsn > ws2->start_lsn)
+ return 1;
+ return 0;
+}
diff --git a/src/backend/backup/walsummaryfuncs.c b/src/backend/backup/walsummaryfuncs.c
new file mode 100644
index 0000000000..a1f69ad4ba
--- /dev/null
+++ b/src/backend/backup/walsummaryfuncs.c
@@ -0,0 +1,169 @@
+/*-------------------------------------------------------------------------
+ *
+ * walsummaryfuncs.c
+ * SQL-callable functions for accessing WAL summary data.
+ *
+ * Portions Copyright (c) 2010-2022, PostgreSQL Global Development Group
+ *
+ * src/backend/backup/walsummaryfuncs.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "backup/walsummary.h"
+#include "common/blkreftable.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "utils/fmgrprotos.h"
+#include "utils/pg_lsn.h"
+
+#define NUM_WS_ATTS 3
+#define NUM_SUMMARY_ATTS 6
+#define MAX_BLOCKS_PER_CALL 256
+
+/*
+ * List the WAL summary files available in pg_wal/summaries.
+ */
+Datum
+pg_available_wal_summaries(PG_FUNCTION_ARGS)
+{
+ ReturnSetInfo *rsi;
+ List *wslist;
+ ListCell *lc;
+ Datum values[NUM_WS_ATTS];
+ bool nulls[NUM_WS_ATTS];
+
+ InitMaterializedSRF(fcinfo, 0);
+ rsi = (ReturnSetInfo *) fcinfo->resultinfo;
+
+ memset(nulls, 0, sizeof(nulls));
+
+ wslist = GetWalSummaries(0, InvalidXLogRecPtr, InvalidXLogRecPtr);
+ foreach(lc, wslist)
+ {
+ WalSummaryFile *ws = (WalSummaryFile *) lfirst(lc);
+ HeapTuple tuple;
+
+ CHECK_FOR_INTERRUPTS();
+
+ values[0] = Int64GetDatum((int64) ws->tli);
+ values[1] = LSNGetDatum(ws->start_lsn);
+ values[2] = LSNGetDatum(ws->end_lsn);
+
+ tuple = heap_form_tuple(rsi->setDesc, values, nulls);
+ tuplestore_puttuple(rsi->setResult, tuple);
+ }
+
+ return (Datum) 0;
+}
+
+/*
+ * List the contents of a WAL summary file identified by TLI, start LSN,
+ * and end LSN.
+ */
+Datum
+pg_wal_summary_contents(PG_FUNCTION_ARGS)
+{
+ ReturnSetInfo *rsi;
+ Datum values[NUM_SUMMARY_ATTS];
+ bool nulls[NUM_SUMMARY_ATTS];
+ WalSummaryFile ws;
+ WalSummaryIO io;
+ BlockRefTableReader *reader;
+ int64 raw_tli;
+ RelFileLocator rlocator;
+ ForkNumber forknum;
+ BlockNumber limit_block;
+
+ InitMaterializedSRF(fcinfo, 0);
+ rsi = (ReturnSetInfo *) fcinfo->resultinfo;
+ memset(nulls, 0, sizeof(nulls));
+
+ /*
+ * Since the timeline could at least in theory be more than 2^31, and
+ * since we don't have unsigned types at the SQL level, it is passed as a
+ * 64-bit integer. Test whether it's out of range.
+ */
+ raw_tli = PG_GETARG_INT64(0);
+ if (raw_tli < 1 || raw_tli > PG_INT32_MAX)
+ ereport(ERROR,
+ errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("invalid timeline %lld", (long long) raw_tli));
+
+ /* Prepare to read the specified WAL summry file. */
+ ws.tli = (TimeLineID) raw_tli;
+ ws.start_lsn = PG_GETARG_LSN(1);
+ ws.end_lsn = PG_GETARG_LSN(2);
+ io.filepos = 0;
+ io.file = OpenWalSummaryFile(&ws, false);
+ reader = CreateBlockRefTableReader(ReadWalSummary, &io,
+ FilePathName(io.file),
+ ReportWalSummaryError, NULL);
+
+ /* Loop over relation forks. */
+ while (BlockRefTableReaderNextRelation(reader, &rlocator, &forknum,
+ &limit_block))
+ {
+ BlockNumber blocks[MAX_BLOCKS_PER_CALL];
+ HeapTuple tuple;
+
+ CHECK_FOR_INTERRUPTS();
+
+ values[0] = ObjectIdGetDatum(rlocator.relNumber);
+ values[1] = ObjectIdGetDatum(rlocator.spcOid);
+ values[2] = ObjectIdGetDatum(rlocator.dbOid);
+ values[3] = Int16GetDatum((int16) forknum);
+
+ /* Loop over blocks within the current relation fork. */
+ while (1)
+ {
+ unsigned nblocks;
+ unsigned i;
+
+ CHECK_FOR_INTERRUPTS();
+
+ nblocks = BlockRefTableReaderGetBlocks(reader, blocks,
+ MAX_BLOCKS_PER_CALL);
+ if (nblocks == 0)
+ break;
+
+ /*
+ * For each block that we specifically know to have been modified,
+ * emit a row with that block number and limit_block = false.
+ */
+ values[5] = BoolGetDatum(false);
+ for (i = 0; i < nblocks; ++i)
+ {
+ values[4] = Int64GetDatum((int64) blocks[i]);
+
+ tuple = heap_form_tuple(rsi->setDesc, values, nulls);
+ tuplestore_puttuple(rsi->setResult, tuple);
+ }
+
+ /*
+ * If the limit block is not InvalidBlockNumber, emit an exta row
+ * with that block number and limit_block = true.
+ *
+ * There is no point in doing this when the limit_block is
+ * InvalidBlockNumber, because no block with that number or any
+ * higher number can ever exist.
+ */
+ if (BlockNumberIsValid(limit_block))
+ {
+ values[4] = Int64GetDatum((int64) limit_block);
+ values[5] = BoolGetDatum(true);
+
+ tuple = heap_form_tuple(rsi->setDesc, values, nulls);
+ tuplestore_puttuple(rsi->setResult, tuple);
+ }
+ }
+ }
+
+ /* Cleanup */
+ DestroyBlockRefTableReader(reader);
+ FileClose(io.file);
+
+ return (Datum) 0;
+}
diff --git a/src/backend/postmaster/Makefile b/src/backend/postmaster/Makefile
index 047448b34e..367a46c617 100644
--- a/src/backend/postmaster/Makefile
+++ b/src/backend/postmaster/Makefile
@@ -24,6 +24,7 @@ OBJS = \
postmaster.o \
startup.o \
syslogger.o \
+ walsummarizer.o \
walwriter.o
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/postmaster/auxprocess.c b/src/backend/postmaster/auxprocess.c
index bae6f68c40..5f244216a6 100644
--- a/src/backend/postmaster/auxprocess.c
+++ b/src/backend/postmaster/auxprocess.c
@@ -21,6 +21,7 @@
#include "postmaster/auxprocess.h"
#include "postmaster/bgwriter.h"
#include "postmaster/startup.h"
+#include "postmaster/walsummarizer.h"
#include "postmaster/walwriter.h"
#include "replication/walreceiver.h"
#include "storage/bufmgr.h"
@@ -80,6 +81,9 @@ AuxiliaryProcessMain(AuxProcType auxtype)
case WalReceiverProcess:
MyBackendType = B_WAL_RECEIVER;
break;
+ case WalSummarizerProcess:
+ MyBackendType = B_WAL_SUMMARIZER;
+ break;
default:
elog(PANIC, "unrecognized process type: %d", (int) MyAuxProcType);
MyBackendType = B_INVALID;
@@ -158,6 +162,10 @@ AuxiliaryProcessMain(AuxProcType auxtype)
WalReceiverMain();
proc_exit(1);
+ case WalSummarizerProcess:
+ WalSummarizerMain();
+ proc_exit(1);
+
default:
elog(PANIC, "unrecognized process type: %d", (int) MyAuxProcType);
proc_exit(1);
diff --git a/src/backend/postmaster/meson.build b/src/backend/postmaster/meson.build
index cda921fd10..a30eb6692f 100644
--- a/src/backend/postmaster/meson.build
+++ b/src/backend/postmaster/meson.build
@@ -12,5 +12,6 @@ backend_sources += files(
'postmaster.c',
'startup.c',
'syslogger.c',
+ 'walsummarizer.c',
'walwriter.c',
)
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 651b85ea74..b163e89cbb 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -113,6 +113,7 @@
#include "postmaster/pgarch.h"
#include "postmaster/postmaster.h"
#include "postmaster/syslogger.h"
+#include "postmaster/walsummarizer.h"
#include "replication/logicallauncher.h"
#include "replication/walsender.h"
#include "storage/fd.h"
@@ -250,6 +251,7 @@ static pid_t StartupPID = 0,
CheckpointerPID = 0,
WalWriterPID = 0,
WalReceiverPID = 0,
+ WalSummarizerPID = 0,
AutoVacPID = 0,
PgArchPID = 0,
SysLoggerPID = 0;
@@ -441,6 +443,7 @@ static bool CreateOptsFile(int argc, char *argv[], char *fullprogname);
static pid_t StartChildProcess(AuxProcType type);
static void StartAutovacuumWorker(void);
static void MaybeStartWalReceiver(void);
+static void MaybeStartWalSummarizer(void);
static void InitPostmasterDeathWatchHandle(void);
/*
@@ -564,6 +567,7 @@ static void ShmemBackendArrayRemove(Backend *bn);
#define StartCheckpointer() StartChildProcess(CheckpointerProcess)
#define StartWalWriter() StartChildProcess(WalWriterProcess)
#define StartWalReceiver() StartChildProcess(WalReceiverProcess)
+#define StartWalSummarizer() StartChildProcess(WalSummarizerProcess)
/* Macros to check exit status of a child process */
#define EXIT_STATUS_0(st) ((st) == 0)
@@ -933,6 +937,9 @@ PostmasterMain(int argc, char *argv[])
if (max_wal_senders > 0 && wal_level == WAL_LEVEL_MINIMAL)
ereport(ERROR,
(errmsg("WAL streaming (max_wal_senders > 0) requires wal_level \"replica\" or \"logical\"")));
+ if (summarize_wal && wal_level == WAL_LEVEL_MINIMAL)
+ ereport(ERROR,
+ (errmsg("WAL cannot be summarized when wal_level is \"minimal\"")));
/*
* Other one-time internal sanity checks can go here, if they are fast.
@@ -1835,6 +1842,9 @@ ServerLoop(void)
if (WalReceiverRequested)
MaybeStartWalReceiver();
+ /* If we need to start a WAL summarizer, try to do that now */
+ MaybeStartWalSummarizer();
+
/* Get other worker processes running, if needed */
if (StartWorkerNeeded || HaveCrashedWorker)
maybe_start_bgworkers();
@@ -2659,6 +2669,8 @@ process_pm_reload_request(void)
signal_child(WalWriterPID, SIGHUP);
if (WalReceiverPID != 0)
signal_child(WalReceiverPID, SIGHUP);
+ if (WalSummarizerPID != 0)
+ signal_child(WalSummarizerPID, SIGHUP);
if (AutoVacPID != 0)
signal_child(AutoVacPID, SIGHUP);
if (PgArchPID != 0)
@@ -3012,6 +3024,7 @@ process_pm_child_exit(void)
BgWriterPID = StartBackgroundWriter();
if (WalWriterPID == 0)
WalWriterPID = StartWalWriter();
+ MaybeStartWalSummarizer();
/*
* Likewise, start other special children as needed. In a restart
@@ -3130,6 +3143,20 @@ process_pm_child_exit(void)
continue;
}
+ /*
+ * Was it the wal summarizer? Normal exit can be ignored; we'll start
+ * a new one at the next iteration of the postmaster's main loop, if
+ * necessary. Any other exit condition is treated as a crash.
+ */
+ if (pid == WalSummarizerPID)
+ {
+ WalSummarizerPID = 0;
+ if (!EXIT_STATUS_0(exitstatus))
+ HandleChildCrash(pid, exitstatus,
+ _("WAL summarizer process"));
+ continue;
+ }
+
/*
* Was it the autovacuum launcher? Normal exit can be ignored; we'll
* start a new one at the next iteration of the postmaster's main
@@ -3525,6 +3552,12 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
else if (WalReceiverPID != 0 && take_action)
sigquit_child(WalReceiverPID);
+ /* Take care of the walsummarizer too */
+ if (pid == WalSummarizerPID)
+ WalSummarizerPID = 0;
+ else if (WalSummarizerPID != 0 && take_action)
+ sigquit_child(WalSummarizerPID);
+
/* Take care of the autovacuum launcher too */
if (pid == AutoVacPID)
AutoVacPID = 0;
@@ -3675,6 +3708,8 @@ PostmasterStateMachine(void)
signal_child(StartupPID, SIGTERM);
if (WalReceiverPID != 0)
signal_child(WalReceiverPID, SIGTERM);
+ if (WalSummarizerPID != 0)
+ signal_child(WalSummarizerPID, SIGTERM);
/* checkpointer, archiver, stats, and syslogger may continue for now */
/* Now transition to PM_WAIT_BACKENDS state to wait for them to die */
@@ -3701,6 +3736,7 @@ PostmasterStateMachine(void)
if (CountChildren(BACKEND_TYPE_ALL - BACKEND_TYPE_WALSND) == 0 &&
StartupPID == 0 &&
WalReceiverPID == 0 &&
+ WalSummarizerPID == 0 &&
BgWriterPID == 0 &&
(CheckpointerPID == 0 ||
(!FatalError && Shutdown < ImmediateShutdown)) &&
@@ -3798,6 +3834,7 @@ PostmasterStateMachine(void)
/* These other guys should be dead already */
Assert(StartupPID == 0);
Assert(WalReceiverPID == 0);
+ Assert(WalSummarizerPID == 0);
Assert(BgWriterPID == 0);
Assert(CheckpointerPID == 0);
Assert(WalWriterPID == 0);
@@ -4019,6 +4056,8 @@ TerminateChildren(int signal)
signal_child(WalWriterPID, signal);
if (WalReceiverPID != 0)
signal_child(WalReceiverPID, signal);
+ if (WalSummarizerPID != 0)
+ signal_child(WalSummarizerPID, signal);
if (AutoVacPID != 0)
signal_child(AutoVacPID, signal);
if (PgArchPID != 0)
@@ -5326,6 +5365,10 @@ StartChildProcess(AuxProcType type)
ereport(LOG,
(errmsg("could not fork WAL receiver process: %m")));
break;
+ case WalSummarizerProcess:
+ ereport(LOG,
+ (errmsg("could not fork WAL summarizer process: %m")));
+ break;
default:
ereport(LOG,
(errmsg("could not fork process: %m")));
@@ -5462,6 +5505,19 @@ MaybeStartWalReceiver(void)
}
}
+/*
+ * MaybeStartWalSummarizer
+ * Start the WAL summarizer process, if not running and our state allows.
+ */
+static void
+MaybeStartWalSummarizer(void)
+{
+ if (summarize_wal && WalSummarizerPID == 0 &&
+ (pmState == PM_RUN || pmState == PM_HOT_STANDBY) &&
+ Shutdown <= SmartShutdown)
+ WalSummarizerPID = StartWalSummarizer();
+}
+
/*
* Create the opts file
diff --git a/src/backend/postmaster/walsummarizer.c b/src/backend/postmaster/walsummarizer.c
new file mode 100644
index 0000000000..7c840c36b3
--- /dev/null
+++ b/src/backend/postmaster/walsummarizer.c
@@ -0,0 +1,1398 @@
+/*-------------------------------------------------------------------------
+ *
+ * walsummarizer.c
+ *
+ * Background process to perform WAL summarization, if it is enabled.
+ * It continuously scans the write-ahead log and periodically emits a
+ * summary file which indicates which blocks in which relation forks
+ * were modified by WAL records in the LSN range covered by the summary
+ * file. See walsummary.c and blkreftable.c for more details on the
+ * naming and contents of WAL summary files.
+ *
+ * If configured to do, this background process will also remove WAL
+ * summary files when the file timestamp is older than a configurable
+ * threshold (but only if the WAL has been removed first).
+ *
+ * Portions Copyright (c) 1996-2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/backend/postmaster/walsummarizer.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/timeline.h"
+#include "access/xlog.h"
+#include "access/xlog_internal.h"
+#include "access/xlogrecovery.h"
+#include "access/xlogutils.h"
+#include "backup/walsummary.h"
+#include "catalog/storage_xlog.h"
+#include "common/blkreftable.h"
+#include "libpq/pqsignal.h"
+#include "miscadmin.h"
+#include "postmaster/bgwriter.h"
+#include "postmaster/interrupt.h"
+#include "postmaster/walsummarizer.h"
+#include "replication/walreceiver.h"
+#include "storage/fd.h"
+#include "storage/ipc.h"
+#include "storage/lwlock.h"
+#include "storage/latch.h"
+#include "storage/proc.h"
+#include "storage/procsignal.h"
+#include "storage/shmem.h"
+#include "storage/spin.h"
+#include "utils/guc.h"
+#include "utils/memutils.h"
+#include "utils/wait_event.h"
+
+/*
+ * Data in shared memory related to WAL summarization.
+ */
+typedef struct
+{
+ /*
+ * These fields are protected by WALSummarizerLock.
+ *
+ * Until we've discovered what summary files already exist on disk and
+ * stored that information in shared memory, initialized is false and the
+ * other fields here contain no meaningful information. After that has
+ * been done, initialized is true.
+ *
+ * summarized_tli and summarized_lsn indicate the last LSN and TLI at
+ * which the next summary file will start. Normally, these are the LSN and
+ * TLI at which the last file ended; in such case, lsn_is_exact is true.
+ * If, however, the LSN is just an approximation, then lsn_is_exact is
+ * false. This can happen if, for example, there are no existing WAL
+ * summary files at startup. In that case, we have to derive the position
+ * at which to start summarizing from the WAL files that exist on disk,
+ * and so the LSN might point to the start of the next file even though
+ * that might happen to be in the middle of a WAL record.
+ *
+ * summarizer_pgprocno is the pgprocno value for the summarizer process,
+ * if one is running, or else INVALID_PGPROCNO.
+ *
+ * pending_lsn is used by the summarizer to advertise the ending LSN of a
+ * record it has recently read. It shouldn't ever be less than
+ * summarized_lsn, but might be greater, because the summarizer buffers
+ * data for a range of LSNs in memory before writing out a new file.
+ */
+ bool initialized;
+ TimeLineID summarized_tli;
+ XLogRecPtr summarized_lsn;
+ bool lsn_is_exact;
+ int summarizer_pgprocno;
+ XLogRecPtr pending_lsn;
+
+ /*
+ * This field handles its own synchronizaton.
+ */
+ ConditionVariable summary_file_cv;
+} WalSummarizerData;
+
+/*
+ * Private data for our xlogreader's page read callback.
+ */
+typedef struct
+{
+ TimeLineID tli;
+ bool historic;
+ XLogRecPtr read_upto;
+ bool end_of_wal;
+} SummarizerReadLocalXLogPrivate;
+
+/* Pointer to shared memory state. */
+static WalSummarizerData *WalSummarizerCtl;
+
+/*
+ * When we reach end of WAL and need to read more, we sleep for a number of
+ * milliseconds that is a integer multiple of MS_PER_SLEEP_QUANTUM. This is
+ * the multiplier. It should vary between 1 and MAX_SLEEP_QUANTA, depending
+ * on system activity. See summarizer_wait_for_wal() for how we adjust this.
+ */
+static long sleep_quanta = 1;
+
+/*
+ * The sleep time will always be a multiple of 200ms and will not exceed
+ * thirty seconds (150 * 200 = 30 * 1000). Note that the timeout here needs
+ * to be substntially less than the maximum amount of time for which an
+ * incremental backup will wait for this process to catch up. Otherwise, an
+ * incremental backup might time out on an idle system just because we sleep
+ * for too long.
+ */
+#define MAX_SLEEP_QUANTA 150
+#define MS_PER_SLEEP_QUANTUM 200
+
+/*
+ * This is a count of the number of pages of WAL that we've read since the
+ * last time we waited for more WAL to appear.
+ */
+static long pages_read_since_last_sleep = 0;
+
+/*
+ * Most recent RedoRecPtr value observed by MaybeRemoveOldWalSummaries.
+ */
+static XLogRecPtr redo_pointer_at_last_summary_removal = InvalidXLogRecPtr;
+
+/*
+ * GUC parameters
+ */
+bool summarize_wal = false;
+int wal_summary_keep_time = 10 * 24 * 60;
+
+static XLogRecPtr GetLatestLSN(TimeLineID *tli);
+static void HandleWalSummarizerInterrupts(void);
+static XLogRecPtr SummarizeWAL(TimeLineID tli, XLogRecPtr start_lsn,
+ bool exact, XLogRecPtr switch_lsn,
+ XLogRecPtr maximum_lsn);
+static void SummarizeSmgrRecord(XLogReaderState *xlogreader,
+ BlockRefTable *brtab);
+static void SummarizeXactRecord(XLogReaderState *xlogreader,
+ BlockRefTable *brtab);
+static bool SummarizeXlogRecord(XLogReaderState *xlogreader);
+static int summarizer_read_local_xlog_page(XLogReaderState *state,
+ XLogRecPtr targetPagePtr,
+ int reqLen,
+ XLogRecPtr targetRecPtr,
+ char *cur_page);
+static void summarizer_wait_for_wal(void);
+static void MaybeRemoveOldWalSummaries(void);
+
+/*
+ * Amount of shared memory required for this module.
+ */
+Size
+WalSummarizerShmemSize(void)
+{
+ return sizeof(WalSummarizerData);
+}
+
+/*
+ * Create or attach to shared memory segment for this module.
+ */
+void
+WalSummarizerShmemInit(void)
+{
+ bool found;
+
+ WalSummarizerCtl = (WalSummarizerData *)
+ ShmemInitStruct("Wal Summarizer Ctl", WalSummarizerShmemSize(),
+ &found);
+
+ if (!found)
+ {
+ /*
+ * First time through, so initialize.
+ *
+ * We're just filling in dummy values here -- the real initialization
+ * will happen when GetOldestUnsummarizedLSN() is called for the first
+ * time.
+ */
+ WalSummarizerCtl->initialized = false;
+ WalSummarizerCtl->summarized_tli = 0;
+ WalSummarizerCtl->summarized_lsn = InvalidXLogRecPtr;
+ WalSummarizerCtl->lsn_is_exact = false;
+ WalSummarizerCtl->summarizer_pgprocno = INVALID_PGPROCNO;
+ WalSummarizerCtl->pending_lsn = InvalidXLogRecPtr;
+ ConditionVariableInit(&WalSummarizerCtl->summary_file_cv);
+ }
+}
+
+/*
+ * Entry point for walsummarizer process.
+ */
+void
+WalSummarizerMain(void)
+{
+ sigjmp_buf local_sigjmp_buf;
+ MemoryContext context;
+
+ /*
+ * Within this function, 'current_lsn' and 'current_tli' refer to the
+ * point from which the next WAL summary file should start. 'exact' is
+ * true if 'current_lsn' is known to be the start of a WAL recod or WAL
+ * segment, and false if it might be in the middle of a record someplace.
+ *
+ * 'switch_lsn' and 'switch_tli', if set, are the LSN at which we need to
+ * switch to a new timeline and the timeline to which we need to switch.
+ * If not set, we either haven't figured out the answers yet or we're
+ * already on the latest timeline.
+ */
+ XLogRecPtr current_lsn;
+ TimeLineID current_tli;
+ bool exact;
+ XLogRecPtr switch_lsn = InvalidXLogRecPtr;
+ TimeLineID switch_tli = 0;
+
+ ereport(DEBUG1,
+ (errmsg_internal("WAL summarizer started")));
+
+ /*
+ * Properly accept or ignore signals the postmaster might send us
+ *
+ * We have no particular use for SIGINT at the moment, but seems
+ * reasonable to treat like SIGTERM.
+ */
+ pqsignal(SIGHUP, SignalHandlerForConfigReload);
+ pqsignal(SIGINT, SignalHandlerForShutdownRequest);
+ pqsignal(SIGTERM, SignalHandlerForShutdownRequest);
+ /* SIGQUIT handler was already set up by InitPostmasterChild */
+ pqsignal(SIGALRM, SIG_IGN);
+ pqsignal(SIGPIPE, SIG_IGN);
+ pqsignal(SIGUSR1, procsignal_sigusr1_handler);
+ pqsignal(SIGUSR2, SIG_IGN); /* not used */
+
+ /* Advertise ourselves. */
+ LWLockAcquire(WALSummarizerLock, LW_EXCLUSIVE);
+ WalSummarizerCtl->summarizer_pgprocno = MyProc->pgprocno;
+ LWLockRelease(WALSummarizerLock);
+
+ /* Create and switch to a memory context that we can reset on error. */
+ context = AllocSetContextCreate(TopMemoryContext,
+ "Wal Summarizer",
+ ALLOCSET_DEFAULT_SIZES);
+ MemoryContextSwitchTo(context);
+
+ /*
+ * Reset some signals that are accepted by postmaster but not here
+ */
+ pqsignal(SIGCHLD, SIG_DFL);
+
+ /*
+ * If an exception is encountered, processing resumes here.
+ */
+ if (sigsetjmp(local_sigjmp_buf, 1) != 0)
+ {
+ /* Since not using PG_TRY, must reset error stack by hand */
+ error_context_stack = NULL;
+
+ /* Prevent interrupts while cleaning up */
+ HOLD_INTERRUPTS();
+
+ /* Report the error to the server log */
+ EmitErrorReport();
+
+ /* Release resources we might have acquired. */
+ LWLockReleaseAll();
+ ConditionVariableCancelSleep();
+ pgstat_report_wait_end();
+ ReleaseAuxProcessResources(false);
+ AtEOXact_Files(false);
+ AtEOXact_HashTables(false);
+
+ /*
+ * Now return to normal top-level context and clear ErrorContext for
+ * next time.
+ */
+ MemoryContextSwitchTo(context);
+ FlushErrorState();
+
+ /* Flush any leaked data in the top-level context */
+ MemoryContextResetAndDeleteChildren(context);
+
+ /* Now we can allow interrupts again */
+ RESUME_INTERRUPTS();
+
+ /*
+ * Sleep for 10 seconds before attempting to resume operations in
+ * order to avoid excessing logging.
+ *
+ * Many of the likely error conditions are things that will repeat
+ * every time. For example, if the WAL can't be read or the summary
+ * can't be written, only administrator action will cure the problem.
+ * So a really fast retry time doesn't seem to be especially
+ * beneficial, and it will clutter the logs.
+ */
+ (void) WaitLatch(MyLatch,
+ WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
+ 10000,
+ WAIT_EVENT_WAL_SUMMARIZER_ERROR);
+ }
+
+ /* We can now handle ereport(ERROR) */
+ PG_exception_stack = &local_sigjmp_buf;
+
+ /*
+ * Unblock signals (they were blocked when the postmaster forked us)
+ */
+ sigprocmask(SIG_SETMASK, &UnBlockSig, NULL);
+
+ /*
+ * Fetch information about previous progress from shared memory, and ask
+ * GetOldestUnsummarizedLSN to reset pending_lsn to summarized_lsn. We
+ * might be recovering from an error, and if so, pending_lsn might have
+ * advanced past summarized_lsn, but any WAL we read previously has been
+ * lost and will need to be reread.
+ *
+ * If we discover that WAL summarization is not enabled, just exit.
+ */
+ current_lsn = GetOldestUnsummarizedLSN(¤t_tli, &exact, true);
+ if (XLogRecPtrIsInvalid(current_lsn))
+ proc_exit(0);
+
+ /*
+ * Loop forever
+ */
+ for (;;)
+ {
+ XLogRecPtr latest_lsn;
+ TimeLineID latest_tli;
+ XLogRecPtr end_of_summary_lsn;
+
+ /* Flush any leaked data in the top-level context */
+ MemoryContextResetAndDeleteChildren(context);
+
+ /* Process any signals received recently. */
+ HandleWalSummarizerInterrupts();
+
+ /* If it's time to remove any old WAL summaries, do that now. */
+ MaybeRemoveOldWalSummaries();
+
+ /* Find the LSN and TLI up to which we can safely summarize. */
+ latest_lsn = GetLatestLSN(&latest_tli);
+
+ /*
+ * If we're summarizing a historic timeline and we haven't yet
+ * computed the point at which to switch to the next timeline, do that
+ * now.
+ *
+ * Note that if this is a standby, what was previously the current
+ * timeline could become historic at any time.
+ *
+ * We could try to make this more efficient by caching the results of
+ * readTimeLineHistory when latest_tli has not changed, but since we
+ * only have to do this once per timeline switch, we probably wouldn't
+ * save any significant amount of work in practice.
+ */
+ if (current_tli != latest_tli && XLogRecPtrIsInvalid(switch_lsn))
+ {
+ List *tles = readTimeLineHistory(latest_tli);
+
+ switch_lsn = tliSwitchPoint(current_tli, tles, &switch_tli);
+ ereport(DEBUG1,
+ errmsg("switch point from TLI %u to TLI %u is at %X/%X",
+ current_tli, switch_tli, LSN_FORMAT_ARGS(switch_lsn)));
+ }
+
+ /*
+ * If we've reached the switch LSN, we can't summarize anything else
+ * on this timeline. Switch to the next timeline and go around again.
+ */
+ if (!XLogRecPtrIsInvalid(switch_lsn) && current_lsn >= switch_lsn)
+ {
+ current_tli = switch_tli;
+ switch_lsn = InvalidXLogRecPtr;
+ switch_tli = 0;
+ continue;
+ }
+
+ /* Summarize WAL. */
+ end_of_summary_lsn = SummarizeWAL(current_tli,
+ current_lsn, exact,
+ switch_lsn, latest_lsn);
+ Assert(!XLogRecPtrIsInvalid(end_of_summary_lsn));
+ Assert(end_of_summary_lsn >= current_lsn);
+
+ /*
+ * Update state for next loop iteration.
+ *
+ * Next summary file should start from exactly where this one ended.
+ */
+ current_lsn = end_of_summary_lsn;
+ exact = true;
+
+ /* Update state in shared memory. */
+ LWLockAcquire(WALSummarizerLock, LW_EXCLUSIVE);
+ Assert(WalSummarizerCtl->pending_lsn <= end_of_summary_lsn);
+ WalSummarizerCtl->summarized_lsn = end_of_summary_lsn;
+ WalSummarizerCtl->summarized_tli = current_tli;
+ WalSummarizerCtl->lsn_is_exact = true;
+ WalSummarizerCtl->pending_lsn = end_of_summary_lsn;
+ LWLockRelease(WALSummarizerLock);
+
+ /* Wake up anyone waiting for more summary files to be written. */
+ ConditionVariableBroadcast(&WalSummarizerCtl->summary_file_cv);
+ }
+}
+
+/*
+ * Get the oldest LSN in this server's timeline history that has not yet been
+ * summarized.
+ *
+ * If *tli != NULL, it will be set to the TLI for the LSN that is returned.
+ *
+ * If *lsn_is_exact != NULL, it will be set to true if the returned LSN is
+ * necessarily the start of a WAL record and false if it's just the beginning
+ * of a WAL segment.
+ *
+ * If reset_pending_lsn is true, resets the pending_lsn in shared memory to
+ * be equal to the summarized_lsn.
+ */
+XLogRecPtr
+GetOldestUnsummarizedLSN(TimeLineID *tli, bool *lsn_is_exact,
+ bool reset_pending_lsn)
+{
+ TimeLineID latest_tli;
+ LWLockMode mode = reset_pending_lsn ? LW_EXCLUSIVE : LW_SHARED;
+ int n;
+ List *tles;
+ XLogRecPtr unsummarized_lsn;
+ TimeLineID unsummarized_tli = 0;
+ bool should_make_exact = false;
+ List *existing_summaries;
+ ListCell *lc;
+
+ /* If not summarizing WAL, do nothing. */
+ if (!summarize_wal)
+ return InvalidXLogRecPtr;
+
+ /*
+ * Unless we need to reset the pending_lsn, we initally acquire the lock
+ * in shared mode and try to fetch the required information. If we acquire
+ * in shared mode and find that the data structure hasn't been
+ * initialized, we reacquire the lock in exclusive mode so that we can
+ * initialize it. However, if someone else does that first before we get
+ * the lock, then we can just return the requested information after all.
+ */
+ while (1)
+ {
+ LWLockAcquire(WALSummarizerLock, mode);
+
+ if (WalSummarizerCtl->initialized)
+ {
+ unsummarized_lsn = WalSummarizerCtl->summarized_lsn;
+ if (tli != NULL)
+ *tli = WalSummarizerCtl->summarized_tli;
+ if (lsn_is_exact != NULL)
+ *lsn_is_exact = WalSummarizerCtl->lsn_is_exact;
+ if (reset_pending_lsn)
+ WalSummarizerCtl->pending_lsn =
+ WalSummarizerCtl->summarized_lsn;
+ LWLockRelease(WALSummarizerLock);
+ return unsummarized_lsn;
+ }
+
+ if (mode == LW_EXCLUSIVE)
+ break;
+
+ LWLockRelease(WALSummarizerLock);
+ mode = LW_EXCLUSIVE;
+ }
+
+ /*
+ * The data structure needs to be initialized, and we are the first to
+ * obtain the lock in exclusive mode, so it's our job to do that
+ * initialization.
+ *
+ * So, find the oldest timeline on which WAL still exists, and the
+ * earliest segment for which it exists.
+ */
+ (void) GetLatestLSN(&latest_tli);
+ tles = readTimeLineHistory(latest_tli);
+ for (n = list_length(tles) - 1; n >= 0; --n)
+ {
+ TimeLineHistoryEntry *tle = list_nth(tles, n);
+ XLogSegNo oldest_segno;
+
+ oldest_segno = XLogGetOldestSegno(tle->tli);
+ if (oldest_segno != 0)
+ {
+ /* Compute oldest LSN that still exists on disk. */
+ XLogSegNoOffsetToRecPtr(oldest_segno, 0, wal_segment_size,
+ unsummarized_lsn);
+
+ unsummarized_tli = tle->tli;
+ break;
+ }
+ }
+
+ /* It really should not be possible for us to find no WAL. */
+ if (unsummarized_tli == 0)
+ ereport(ERROR,
+ errcode(ERRCODE_INTERNAL_ERROR),
+ errmsg_internal("no WAL found on timeline %d", latest_tli));
+
+ /*
+ * Don't try to summarize anything older than the end LSN of the newest
+ * summary file that exists for this timeline.
+ */
+ existing_summaries =
+ GetWalSummaries(unsummarized_tli,
+ InvalidXLogRecPtr, InvalidXLogRecPtr);
+ foreach(lc, existing_summaries)
+ {
+ WalSummaryFile *ws = lfirst(lc);
+
+ if (ws->end_lsn > unsummarized_lsn)
+ {
+ unsummarized_lsn = ws->end_lsn;
+ should_make_exact = true;
+ }
+ }
+
+ /* Update shared memory with the discovered values. */
+ WalSummarizerCtl->initialized = true;
+ WalSummarizerCtl->summarized_lsn = unsummarized_lsn;
+ WalSummarizerCtl->summarized_tli = unsummarized_tli;
+ WalSummarizerCtl->lsn_is_exact = should_make_exact;
+ WalSummarizerCtl->pending_lsn = unsummarized_lsn;
+
+ /* Also return the to the caller as required. */
+ if (tli != NULL)
+ *tli = WalSummarizerCtl->summarized_tli;
+ if (lsn_is_exact != NULL)
+ *lsn_is_exact = WalSummarizerCtl->lsn_is_exact;
+ LWLockRelease(WALSummarizerLock);
+
+ return unsummarized_lsn;
+}
+
+/*
+ * Attempt to set the WAL summarizer's latch.
+ *
+ * This might not work, because there's no guarantee that the WAL summarizer
+ * process was successfully started, and it also might have started but
+ * subsequently terminated. So, under normal circumstances, this will get the
+ * latch set, but there's no guarantee.
+ */
+void
+SetWalSummarizerLatch(void)
+{
+ int pgprocno;
+
+ if (WalSummarizerCtl == NULL)
+ return;
+
+ LWLockAcquire(WALSummarizerLock, LW_EXCLUSIVE);
+ pgprocno = WalSummarizerCtl->summarizer_pgprocno;
+ LWLockRelease(WALSummarizerLock);
+
+ if (pgprocno != INVALID_PGPROCNO)
+ SetLatch(&ProcGlobal->allProcs[pgprocno].procLatch);
+}
+
+/*
+ * Wait until WAL summarization reaches the given LSN, but not longer than
+ * the given timeout.
+ *
+ * The return value is the first still-unsummarized LSN. If it's greater than
+ * or equal to the passed LSN, then that LSN was reached. If not, we timed out.
+ *
+ * Either way, *pending_lsn is set to the value taken from WalSummarizerCtl.
+ */
+XLogRecPtr
+WaitForWalSummarization(XLogRecPtr lsn, long timeout, XLogRecPtr *pending_lsn)
+{
+ TimestampTz start_time = GetCurrentTimestamp();
+ TimestampTz deadline = TimestampTzPlusMilliseconds(start_time, timeout);
+ XLogRecPtr summarized_lsn;
+
+ Assert(!XLogRecPtrIsInvalid(lsn));
+ Assert(timeout > 0);
+
+ while (1)
+ {
+ TimestampTz now;
+ long remaining_timeout;
+
+ /*
+ * If the LSN summarized on disk has reached the target value, stop.
+ */
+ LWLockAcquire(WALSummarizerLock, LW_EXCLUSIVE);
+ summarized_lsn = WalSummarizerCtl->summarized_lsn;
+ *pending_lsn = WalSummarizerCtl->pending_lsn;
+ LWLockRelease(WALSummarizerLock);
+ if (summarized_lsn >= lsn)
+ break;
+
+ /* Timeout reached? If yes, stop. */
+ now = GetCurrentTimestamp();
+ remaining_timeout = TimestampDifferenceMilliseconds(now, deadline);
+ if (remaining_timeout <= 0)
+ break;
+
+ /* Wait and see. */
+ ConditionVariableTimedSleep(&WalSummarizerCtl->summary_file_cv,
+ remaining_timeout,
+ WAIT_EVENT_WAL_SUMMARY_READY);
+ }
+
+ return summarized_lsn;
+}
+
+/*
+ * Get the latest LSN that is eligible to be summarized, and set *tli to the
+ * corresponding timeline.
+ */
+static XLogRecPtr
+GetLatestLSN(TimeLineID *tli)
+{
+ if (!RecoveryInProgress())
+ {
+ /* Don't summarize WAL before it's flushed. */
+ return GetFlushRecPtr(tli);
+ }
+ else
+ {
+ XLogRecPtr flush_lsn;
+ TimeLineID flush_tli;
+ XLogRecPtr replay_lsn;
+ TimeLineID replay_tli;
+
+ /*
+ * What we really want to know is how much WAL has been flushed to
+ * disk, but the only flush position available is the one provided by
+ * the walreceiver, which may not be running, because this could be
+ * crash recovery or recovery via restore_command. So use either the
+ * WAL receiver's flush position or the replay position, whichever is
+ * further ahead, on the theory that if the WAL has been replayed then
+ * it must also have been flushed to disk.
+ */
+ flush_lsn = GetWalRcvFlushRecPtr(NULL, &flush_tli);
+ replay_lsn = GetXLogReplayRecPtr(&replay_tli);
+ if (flush_lsn > replay_lsn)
+ {
+ *tli = flush_tli;
+ return flush_lsn;
+ }
+ else
+ {
+ *tli = replay_tli;
+ return replay_lsn;
+ }
+ }
+}
+
+/*
+ * Interrupt handler for main loop of WAL summarizer process.
+ */
+static void
+HandleWalSummarizerInterrupts(void)
+{
+ if (ProcSignalBarrierPending)
+ ProcessProcSignalBarrier();
+
+ if (ConfigReloadPending)
+ {
+ ConfigReloadPending = false;
+ ProcessConfigFile(PGC_SIGHUP);
+ }
+
+ if (ShutdownRequestPending || !summarize_wal)
+ {
+ ereport(DEBUG1,
+ errmsg_internal("WAL summarizer shutting down"));
+ proc_exit(0);
+ }
+
+ /* Perform logging of memory contexts of this process */
+ if (LogMemoryContextPending)
+ ProcessLogMemoryContextInterrupt();
+}
+
+/*
+ * Summarize a range of WAL records on a single timeline.
+ *
+ * 'tli' is the timeline to be summarized.
+ *
+ * 'start_lsn' is the point at which we should start summarizing. If this
+ * value comes from the end LSN of the previous record as returned by the
+ * xlograder machinery, 'exact' should be true; otherwise, 'exact' should
+ * be false, and this function will search forward for the start of a valid
+ * WAL record.
+ *
+ * 'switch_lsn' is the point at which we should switch to a later timeline,
+ * if we're summarizing a historic timeline.
+ *
+ * 'maximum_lsn' identifies the point beyond which we can't count on being
+ * able to read any more WAL. It should be the switch point when reading a
+ * historic timeline, or the most-recently-measured end of WAL when reading
+ * the current timeline.
+ *
+ * The return value is the LSN at which the WAL summary actually ends. Most
+ * often, a summary file ends because we notice that a checkpoint has
+ * occurred and reach the redo pointer of that checkpoint, but sometimes
+ * we stop for other reasons, such as a timeline switch.
+ */
+static XLogRecPtr
+SummarizeWAL(TimeLineID tli, XLogRecPtr start_lsn, bool exact,
+ XLogRecPtr switch_lsn, XLogRecPtr maximum_lsn)
+{
+ SummarizerReadLocalXLogPrivate *private_data;
+ XLogReaderState *xlogreader;
+ XLogRecPtr summary_start_lsn;
+ XLogRecPtr summary_end_lsn = switch_lsn;
+ char temp_path[MAXPGPATH];
+ char final_path[MAXPGPATH];
+ WalSummaryIO io;
+ BlockRefTable *brtab = CreateEmptyBlockRefTable();
+
+ /* Initialize private data for xlogreader. */
+ private_data = (SummarizerReadLocalXLogPrivate *)
+ palloc0(sizeof(SummarizerReadLocalXLogPrivate));
+ private_data->tli = tli;
+ private_data->historic = !XLogRecPtrIsInvalid(switch_lsn);
+ private_data->read_upto = maximum_lsn;
+
+ /* Create xlogreader. */
+ xlogreader = XLogReaderAllocate(wal_segment_size, NULL,
+ XL_ROUTINE(.page_read = &summarizer_read_local_xlog_page,
+ .segment_open = &wal_segment_open,
+ .segment_close = &wal_segment_close),
+ private_data);
+ if (xlogreader == NULL)
+ ereport(ERROR,
+ (errcode(ERRCODE_OUT_OF_MEMORY),
+ errmsg("out of memory"),
+ errdetail("Failed while allocating a WAL reading processor.")));
+
+ /*
+ * When exact = false, we're starting from an arbitrary point in the WAL
+ * and must search forward for the start of the next record.
+ *
+ * When exact = true, start_lsn should be either the LSN where a record
+ * begins, or the LSN of a page where the page header is immediately
+ * followed by the start of a new record. XLogBeginRead should tolerate
+ * either case.
+ *
+ * We need to allow for both cases because the behavior of xlogreader
+ * varies. When a record spans two or more xlog pages, the ending LSN
+ * reported by xlogreader will be the starting LSN of the following
+ * record, but when an xlog page boundary falls between two records, the
+ * end LSN for the first will be reported as the first byte of the
+ * following page. We can't know until we read that page how large the
+ * header will be, but we'll have to skip over it to find the next record.
+ */
+ if (exact)
+ {
+ /*
+ * Even if start_lsn is the beginning of a page rather than the
+ * beginning of the first record on that page, we should still use it
+ * as the start LSN for the summary file. That's because we detect
+ * missing summary files by looking for cases where the end LSN of one
+ * file is less than the start LSN of the next file. When only a page
+ * header is skipped, nothing has been missed.
+ */
+ XLogBeginRead(xlogreader, start_lsn);
+ summary_start_lsn = start_lsn;
+ }
+ else
+ {
+ summary_start_lsn = XLogFindNextRecord(xlogreader, start_lsn);
+ if (XLogRecPtrIsInvalid(summary_start_lsn))
+ {
+ /*
+ * If we hit end-of-WAL while trying to find the next valid
+ * record, we must be on a historic timeline that has no valid
+ * records that begin after start_lsn and before end of WAL.
+ */
+ if (private_data->end_of_wal)
+ {
+ ereport(DEBUG1,
+ errmsg_internal("could not read WAL from timeline %u at %X/%X: end of WAL at %X/%X",
+ tli,
+ LSN_FORMAT_ARGS(start_lsn),
+ LSN_FORMAT_ARGS(private_data->read_upto)));
+
+ /*
+ * The timeline ends at or after start_lsn, without containing
+ * any records. Thus, we must make sure the main loop does not
+ * iterate. If start_lsn is the end of the timeline, then we
+ * won't actually emit an empty summary file, but otherwise,
+ * we must, to capture the fact that the LSN range in question
+ * contains no interesting WAL records.
+ */
+ summary_start_lsn = start_lsn;
+ summary_end_lsn = private_data->read_upto;
+ switch_lsn = xlogreader->EndRecPtr;
+ }
+ else
+ ereport(ERROR,
+ (errmsg("could not find a valid record after %X/%X",
+ LSN_FORMAT_ARGS(start_lsn))));
+ }
+
+ /* We shouldn't go backward. */
+ Assert(summary_start_lsn >= start_lsn);
+ }
+
+ /*
+ * Main loop: read xlog records one by one.
+ */
+ while (1)
+ {
+ int block_id;
+ char *errormsg;
+ XLogRecord *record;
+ bool stop_requested = false;
+
+ HandleWalSummarizerInterrupts();
+
+ /* We shouldn't go backward. */
+ Assert(summary_start_lsn <= xlogreader->EndRecPtr);
+
+ /* Now read the next record. */
+ record = XLogReadRecord(xlogreader, &errormsg);
+ if (record == NULL)
+ {
+ if (private_data->end_of_wal)
+ {
+ /*
+ * This timeline must be historic and must end before we were
+ * able to read a complete record.
+ */
+ ereport(DEBUG1,
+ errmsg_internal("could not read WAL from timeline %d at %X/%X: end of WAL at %X/%X",
+ tli,
+ LSN_FORMAT_ARGS(xlogreader->EndRecPtr),
+ LSN_FORMAT_ARGS(private_data->read_upto)));
+ /* Summary ends at end of WAL. */
+ summary_end_lsn = private_data->read_upto;
+ break;
+ }
+ if (errormsg)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not read WAL from timeline %u at %X/%X: %s",
+ tli, LSN_FORMAT_ARGS(xlogreader->EndRecPtr),
+ errormsg)));
+ else
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not read WAL from timeline %u at %X/%X",
+ tli, LSN_FORMAT_ARGS(xlogreader->EndRecPtr))));
+ }
+
+ /* We shouldn't go backward. */
+ Assert(summary_start_lsn <= xlogreader->EndRecPtr);
+
+ if (!XLogRecPtrIsInvalid(switch_lsn) &&
+ xlogreader->ReadRecPtr >= switch_lsn)
+ {
+ /*
+ * Woops! We've read a record that *starts* after the switch LSN,
+ * contrary to our goal of reading only until we hit the first
+ * record that ends at or after the switch LSN. Pretend we didn't
+ * read it after all by bailing out of this loop right here,
+ * before we do anything with this record.
+ *
+ * This can happen because the last record before the switch LSN
+ * might be continued across multiple pages, and then we might
+ * come to a page with XLP_FIRST_IS_OVERWRITE_CONTRECORD set. In
+ * that case, the record that was continued across multiple pages
+ * is incomplete and will be disregarded, and the read will
+ * restart from the beginning of the page that is flagged
+ * XLP_FIRST_IS_OVERWRITE_CONTRECORD.
+ *
+ * If this case occurs, we can fairly say that the current summary
+ * file ends at the switch LSN exactly. The first record on the
+ * page marked XLP_FIRST_IS_OVERWRITE_CONTRECORD will be
+ * discovered when generating the next summary file.
+ */
+ summary_end_lsn = switch_lsn;
+ break;
+ }
+
+ /* Special handling for particular types of WAL records. */
+ switch (XLogRecGetRmid(xlogreader))
+ {
+ case RM_SMGR_ID:
+ SummarizeSmgrRecord(xlogreader, brtab);
+ break;
+ case RM_XACT_ID:
+ SummarizeXactRecord(xlogreader, brtab);
+ break;
+ case RM_XLOG_ID:
+ stop_requested = SummarizeXlogRecord(xlogreader);
+ break;
+ default:
+ break;
+ }
+
+ /*
+ * If we've been told that it's time to end this WAL summary file, do
+ * so. As an exception, if there's nothing included in this WAL
+ * summary file yet, then stopping doesn't make any sense, and we
+ * should wait until the next stop point instead.
+ */
+ if (stop_requested && xlogreader->ReadRecPtr > summary_start_lsn)
+ {
+ summary_end_lsn = xlogreader->ReadRecPtr;
+ break;
+ }
+
+ /* Feed block references from xlog record to block reference table. */
+ for (block_id = 0; block_id <= XLogRecMaxBlockId(xlogreader);
+ block_id++)
+ {
+ RelFileLocator rlocator;
+ ForkNumber forknum;
+ BlockNumber blocknum;
+
+ if (!XLogRecGetBlockTagExtended(xlogreader, block_id, &rlocator,
+ &forknum, &blocknum, NULL))
+ continue;
+
+ /*
+ * As we do elsewhere, ignore the FSM fork, because it's not fully
+ * WAL-logged.
+ */
+ if (forknum != FSM_FORKNUM)
+ BlockRefTableMarkBlockModified(brtab, &rlocator, forknum,
+ blocknum);
+ }
+
+ /* Update our notion of where this summary file ends. */
+ summary_end_lsn = xlogreader->EndRecPtr;
+
+ /* Also update shared memory. */
+ LWLockAcquire(WALSummarizerLock, LW_EXCLUSIVE);
+ Assert(summary_end_lsn >= WalSummarizerCtl->pending_lsn);
+ Assert(summary_end_lsn >= WalSummarizerCtl->summarized_lsn);
+ WalSummarizerCtl->pending_lsn = summary_end_lsn;
+ LWLockRelease(WALSummarizerLock);
+
+ /*
+ * If we have a switch LSN and have reached it, stop before reading
+ * the next record.
+ */
+ if (!XLogRecPtrIsInvalid(switch_lsn) &&
+ xlogreader->EndRecPtr >= switch_lsn)
+ break;
+ }
+
+ /* Destroy xlogreader. */
+ pfree(xlogreader->private_data);
+ XLogReaderFree(xlogreader);
+
+ /*
+ * If a timeline switch occurs, we may fail to make any progress at all
+ * before exiting the loop above. If that happens, we don't write a WAL
+ * summary file at all.
+ */
+ if (summary_end_lsn > summary_start_lsn)
+ {
+ /* Generate temporary and final path name. */
+ snprintf(temp_path, MAXPGPATH,
+ XLOGDIR "/summaries/temp.summary");
+ snprintf(final_path, MAXPGPATH,
+ XLOGDIR "/summaries/%08X%08X%08X%08X%08X.summary",
+ tli,
+ LSN_FORMAT_ARGS(summary_start_lsn),
+ LSN_FORMAT_ARGS(summary_end_lsn));
+
+ /* Open the temporary file for writing. */
+ io.filepos = 0;
+ io.file = PathNameOpenFile(temp_path, O_WRONLY | O_CREAT | O_TRUNC);
+ if (io.file < 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not create file \"%s\": %m", temp_path)));
+
+ /* Write the data. */
+ WriteBlockRefTable(brtab, WriteWalSummary, &io);
+
+ /* Close temporary file and shut down xlogreader. */
+ FileClose(io.file);
+
+ /* Tell the user what we did. */
+ ereport(DEBUG1,
+ errmsg("summarized WAL on TLI %d from %X/%X to %X/%X",
+ tli,
+ LSN_FORMAT_ARGS(summary_start_lsn),
+ LSN_FORMAT_ARGS(summary_end_lsn)));
+
+ /* Durably rename the new summary into place. */
+ durable_rename(temp_path, final_path, ERROR);
+ }
+
+ return summary_end_lsn;
+}
+
+/*
+ * Special handling for WAL records with RM_SMGR_ID.
+ */
+static void
+SummarizeSmgrRecord(XLogReaderState *xlogreader, BlockRefTable *brtab)
+{
+ uint8 info = XLogRecGetInfo(xlogreader) & ~XLR_INFO_MASK;
+
+ if (info == XLOG_SMGR_CREATE)
+ {
+ xl_smgr_create *xlrec;
+
+ /*
+ * If a new relation fork is created on disk, there is no point
+ * tracking anything about which blocks have been modified, because
+ * the whole thing will be new. Hence, set the limit block for this
+ * fork to 0.
+ *
+ * Ignore the FSM fork, which is not fully WAL-logged.
+ */
+ xlrec = (xl_smgr_create *) XLogRecGetData(xlogreader);
+
+ if (xlrec->forkNum != FSM_FORKNUM)
+ BlockRefTableSetLimitBlock(brtab, &xlrec->rlocator,
+ xlrec->forkNum, 0);
+ }
+ else if (info == XLOG_SMGR_TRUNCATE)
+ {
+ xl_smgr_truncate *xlrec;
+
+ xlrec = (xl_smgr_truncate *) XLogRecGetData(xlogreader);
+
+ /*
+ * If a relation fork is truncated on disk, there is no point in
+ * tracking anything about block modifications beyond the truncation
+ * point.
+ *
+ * We ignore SMGR_TRUNCATE_FSM here because the FSM isn't fully
+ * WAL-logged and thus we can't track modified blocks for it anyway.
+ */
+ if ((xlrec->flags & SMGR_TRUNCATE_HEAP) != 0)
+ BlockRefTableSetLimitBlock(brtab, &xlrec->rlocator,
+ MAIN_FORKNUM, xlrec->blkno);
+ if ((xlrec->flags & SMGR_TRUNCATE_VM) != 0)
+ BlockRefTableSetLimitBlock(brtab, &xlrec->rlocator,
+ VISIBILITYMAP_FORKNUM, xlrec->blkno);
+ }
+}
+
+/*
+ * Special handling for WAL recods with RM_XACT_ID.
+ */
+static void
+SummarizeXactRecord(XLogReaderState *xlogreader, BlockRefTable *brtab)
+{
+ uint8 info = XLogRecGetInfo(xlogreader) & ~XLR_INFO_MASK;
+ uint8 xact_info = info & XLOG_XACT_OPMASK;
+
+ if (xact_info == XLOG_XACT_COMMIT ||
+ xact_info == XLOG_XACT_COMMIT_PREPARED)
+ {
+ xl_xact_commit *xlrec = (xl_xact_commit *) XLogRecGetData(xlogreader);
+ xl_xact_parsed_commit parsed;
+ int i;
+
+ /*
+ * Don't track modified blocks for any relations that were removed on
+ * commit.
+ */
+ ParseCommitRecord(XLogRecGetInfo(xlogreader), xlrec, &parsed);
+ for (i = 0; i < parsed.nrels; ++i)
+ {
+ ForkNumber forknum;
+
+ for (forknum = 0; forknum <= MAX_FORKNUM; ++forknum)
+ if (forknum != FSM_FORKNUM)
+ BlockRefTableSetLimitBlock(brtab, &parsed.xlocators[i],
+ forknum, 0);
+ }
+ }
+ else if (xact_info == XLOG_XACT_ABORT ||
+ xact_info == XLOG_XACT_ABORT_PREPARED)
+ {
+ xl_xact_abort *xlrec = (xl_xact_abort *) XLogRecGetData(xlogreader);
+ xl_xact_parsed_abort parsed;
+ int i;
+
+ /*
+ * Don't track modified blocks for any relations that were removed on
+ * abort.
+ */
+ ParseAbortRecord(XLogRecGetInfo(xlogreader), xlrec, &parsed);
+ for (i = 0; i < parsed.nrels; ++i)
+ {
+ ForkNumber forknum;
+
+ for (forknum = 0; forknum <= MAX_FORKNUM; ++forknum)
+ if (forknum != FSM_FORKNUM)
+ BlockRefTableSetLimitBlock(brtab, &parsed.xlocators[i],
+ forknum, 0);
+ }
+ }
+}
+
+/*
+ * Special handling for WAL recods with RM_XLOG_ID.
+ */
+static bool
+SummarizeXlogRecord(XLogReaderState *xlogreader)
+{
+ uint8 info = XLogRecGetInfo(xlogreader) & ~XLR_INFO_MASK;
+
+ if (info == XLOG_CHECKPOINT_REDO || info == XLOG_CHECKPOINT_SHUTDOWN)
+ {
+ /*
+ * This is an LSN at which redo might begin, so we'd like
+ * summarization to stop just before this WAL record.
+ */
+ return true;
+ }
+
+ return false;
+}
+
+/*
+ * Similar to read_local_xlog_page, but limited to read from one particular
+ * timeline. If the end of WAL is reached, it will wait for more if reading
+ * from the current timeline, or give up if reading from a historic timeline.
+ * In the latter case, it will also set private_data->end_of_wal = true.
+ *
+ * Caller must set private_data->tli to the TLI of interest,
+ * private_data->read_upto to the lowest LSN that is not known to be safe
+ * to read on that timeline, and private_data->historic to true if and only
+ * if the timeline is not the current timeline. This function will update
+ * private_data->read_upto and private_data->historic if more WAL appears
+ * on the current timeline or if the current timeline becomes historic.
+ */
+static int
+summarizer_read_local_xlog_page(XLogReaderState *state,
+ XLogRecPtr targetPagePtr, int reqLen,
+ XLogRecPtr targetRecPtr, char *cur_page)
+{
+ int count;
+ WALReadError errinfo;
+ SummarizerReadLocalXLogPrivate *private_data;
+
+ HandleWalSummarizerInterrupts();
+
+ private_data = (SummarizerReadLocalXLogPrivate *)
+ state->private_data;
+
+ while (1)
+ {
+ if (targetPagePtr + XLOG_BLCKSZ <= private_data->read_upto)
+ {
+ /*
+ * more than one block available; read only that block, have
+ * caller come back if they need more.
+ */
+ count = XLOG_BLCKSZ;
+ break;
+ }
+ else if (targetPagePtr + reqLen > private_data->read_upto)
+ {
+ /* We don't seem to have enough data. */
+ if (private_data->historic)
+ {
+ /*
+ * This is a historic timeline, so there will never be any
+ * more data than we have currently.
+ */
+ private_data->end_of_wal = true;
+ return -1;
+ }
+ else
+ {
+ XLogRecPtr latest_lsn;
+ TimeLineID latest_tli;
+
+ /*
+ * This is - or at least was up until very recently - the
+ * current timeline, so more data might show up. Delay here
+ * so we don't tight-loop.
+ */
+ HandleWalSummarizerInterrupts();
+ summarizer_wait_for_wal();
+
+ /* Recheck end-of-WAL. */
+ latest_lsn = GetLatestLSN(&latest_tli);
+ if (private_data->tli == latest_tli)
+ {
+ /* Still the current timeline, update max LSN. */
+ Assert(latest_lsn >= private_data->read_upto);
+ private_data->read_upto = latest_lsn;
+ }
+ else
+ {
+ List *tles = readTimeLineHistory(latest_tli);
+ XLogRecPtr switchpoint;
+
+ /*
+ * The timeline we're scanning is no longer the latest
+ * one. Figure out when it ended.
+ */
+ private_data->historic = true;
+ switchpoint = tliSwitchPoint(private_data->tli, tles,
+ NULL);
+
+ /*
+ * Allow reads up to exactly the switch point.
+ *
+ * It's possible that this will cause read_upto to move
+ * backwards, because walreceiver might have read a
+ * partial record and flushed it to disk, and we'd view
+ * that data as safe to read. However, the
+ * XLOG_END_OF_RECOVERY record will be written at the end
+ * of the last complete WAL record, not at the end of the
+ * WAL that we've flushed to disk.
+ *
+ * So switchpoint < private->read_upto is possible here,
+ * but switchpoint < state->EndRecPtr should not be.
+ */
+ Assert(switchpoint >= state->EndRecPtr);
+ private_data->read_upto = switchpoint;
+
+ /* Debugging output. */
+ ereport(DEBUG1,
+ errmsg("timeline %u became historic, can read up to %X/%X",
+ private_data->tli, LSN_FORMAT_ARGS(private_data->read_upto)));
+ }
+
+ /* Go around and try again. */
+ }
+ }
+ else
+ {
+ /* enough bytes available to satisfy the request */
+ count = private_data->read_upto - targetPagePtr;
+ break;
+ }
+ }
+
+ /*
+ * Even though we just determined how much of the page can be validly read
+ * as 'count', read the whole page anyway. It's guaranteed to be
+ * zero-padded up to the page boundary if it's incomplete.
+ */
+ if (!WALRead(state, cur_page, targetPagePtr, XLOG_BLCKSZ,
+ private_data->tli, &errinfo))
+ WALReadRaiseError(&errinfo);
+
+ /* Track that we read a page, for sleep time calculation. */
+ ++pages_read_since_last_sleep;
+
+ /* number of valid bytes in the buffer */
+ return count;
+}
+
+/*
+ * Sleep for long enough that we believe it's likely that more WAL will
+ * be available afterwards.
+ */
+static void
+summarizer_wait_for_wal(void)
+{
+ if (pages_read_since_last_sleep == 0)
+ {
+ /*
+ * No pages were read since the last sleep, so double the sleep time,
+ * but not beyond the maximum allowable value.
+ */
+ sleep_quanta = Min(sleep_quanta * 2, MAX_SLEEP_QUANTA);
+ }
+ else if (pages_read_since_last_sleep > 1)
+ {
+ /*
+ * Multiple pages were read since the last sleep, so reduce the sleep
+ * time.
+ *
+ * A large burst of activity should be able to quickly reduce the
+ * sleep time to the minimum, but we don't want a handful of extra WAL
+ * records to provoke a strong reaction. We choose to reduce the sleep
+ * time by 1 quantum for each page read beyond the first, which is a
+ * fairly arbitrary way of trying to be reactive without
+ * overrreacting.
+ */
+ if (pages_read_since_last_sleep > sleep_quanta - 1)
+ sleep_quanta = 1;
+ else
+ sleep_quanta -= pages_read_since_last_sleep;
+ }
+
+ /* OK, now sleep. */
+ (void) WaitLatch(MyLatch,
+ WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
+ sleep_quanta * MS_PER_SLEEP_QUANTUM,
+ WAIT_EVENT_WAL_SUMMARIZER_WAL);
+ ResetLatch(MyLatch);
+
+ /* Reset count of pages read. */
+ pages_read_since_last_sleep = 0;
+}
+
+/*
+ * Most recent RedoRecPtr value observed by RemoveOldWalSummaries.
+ */
+static void
+MaybeRemoveOldWalSummaries(void)
+{
+ XLogRecPtr redo_pointer = GetRedoRecPtr();
+ List *wslist;
+ time_t cutoff_time;
+
+ /* If WAL summary removal is disabled, don't do anything. */
+ if (wal_summary_keep_time == 0)
+ return;
+
+ /*
+ * If the redo pointer has not advanced, don't do anything.
+ *
+ * This has the effect that we only try to remove old WAL summary files
+ * once per checkpoint cycle.
+ */
+ if (redo_pointer == redo_pointer_at_last_summary_removal)
+ return;
+ redo_pointer_at_last_summary_removal = redo_pointer;
+
+ /*
+ * Files should only be removed if the last modification time precedes the
+ * cutoff time we compute here.
+ */
+ cutoff_time = time(NULL) - 60 * wal_summary_keep_time;
+
+ /* Get all the summaries that currently exist. */
+ wslist = GetWalSummaries(0, InvalidXLogRecPtr, InvalidXLogRecPtr);
+
+ /* Loop until all summaries have been considered for removal. */
+ while (wslist != NIL)
+ {
+ ListCell *lc;
+ XLogSegNo oldest_segno;
+ XLogRecPtr oldest_lsn = InvalidXLogRecPtr;
+ TimeLineID selected_tli;
+
+ HandleWalSummarizerInterrupts();
+
+ /*
+ * Pick a timeline for which some summary files still exist on disk,
+ * and find the oldest LSN that still exists on disk for that
+ * timeline.
+ */
+ selected_tli = ((WalSummaryFile *) linitial(wslist))->tli;
+ oldest_segno = XLogGetOldestSegno(selected_tli);
+ if (oldest_segno != 0)
+ XLogSegNoOffsetToRecPtr(oldest_segno, 0, wal_segment_size,
+ oldest_lsn);
+
+
+ /* Consider each WAL file on the selected timeline in turn. */
+ foreach(lc, wslist)
+ {
+ WalSummaryFile *ws = lfirst(lc);
+
+ HandleWalSummarizerInterrupts();
+
+ /* If it's not on this timeline, it's not time to consider it. */
+ if (selected_tli != ws->tli)
+ continue;
+
+ /*
+ * If the WAL doesn't exist any more, we can remove it if the file
+ * modification time is old enough.
+ */
+ if (XLogRecPtrIsInvalid(oldest_lsn) || ws->end_lsn <= oldest_lsn)
+ RemoveWalSummaryIfOlderThan(ws, cutoff_time);
+
+ /*
+ * Whether we removed the file or not, we need not consider it
+ * again.
+ */
+ wslist = foreach_delete_current(wslist, lc);
+ pfree(ws);
+ }
+ }
+}
diff --git a/src/backend/storage/lmgr/lwlocknames.txt b/src/backend/storage/lmgr/lwlocknames.txt
index f72f2906ce..d621f5507f 100644
--- a/src/backend/storage/lmgr/lwlocknames.txt
+++ b/src/backend/storage/lmgr/lwlocknames.txt
@@ -54,3 +54,4 @@ XactTruncationLock 44
WrapLimitsVacuumLock 46
NotifyQueueTailLock 47
WaitEventExtensionLock 48
+WALSummarizerLock 49
diff --git a/src/backend/utils/activity/pgstat_io.c b/src/backend/utils/activity/pgstat_io.c
index d99ecdd4d8..0dd9b98b3e 100644
--- a/src/backend/utils/activity/pgstat_io.c
+++ b/src/backend/utils/activity/pgstat_io.c
@@ -306,7 +306,8 @@ pgstat_io_snapshot_cb(void)
* - Syslogger because it is not connected to shared memory
* - Archiver because most relevant archiving IO is delegated to a
* specialized command or module
-* - WAL Receiver and WAL Writer IO is not tracked in pg_stat_io for now
+* - WAL Receiver, WAL Writer, and WAL Summarizer IO are not tracked in
+* pg_stat_io for now
*
* Function returns true if BackendType participates in the cumulative stats
* subsystem for IO and false if it does not.
@@ -328,6 +329,7 @@ pgstat_tracks_io_bktype(BackendType bktype)
case B_LOGGER:
case B_WAL_RECEIVER:
case B_WAL_WRITER:
+ case B_WAL_SUMMARIZER:
return false;
case B_AUTOVAC_LAUNCHER:
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index d7995931bd..7e79163466 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -56,6 +56,7 @@ RECOVERY_WAL_STREAM "Waiting in main loop of startup process for WAL to arrive,
SYSLOGGER_MAIN "Waiting in main loop of syslogger process."
WAL_RECEIVER_MAIN "Waiting in main loop of WAL receiver process."
WAL_SENDER_MAIN "Waiting in main loop of WAL sender process."
+WAL_SUMMARIZER_WAL "Waiting in WAL summarizer for more WAL to be generated."
WAL_WRITER_MAIN "Waiting in main loop of WAL writer process."
@@ -142,6 +143,7 @@ SAFE_SNAPSHOT "Waiting to obtain a valid snapshot for a <literal>READ ONLY DEFER
SYNC_REP "Waiting for confirmation from a remote server during synchronous replication."
WAL_RECEIVER_EXIT "Waiting for the WAL receiver to exit."
WAL_RECEIVER_WAIT_START "Waiting for startup process to send initial data for streaming replication."
+WAL_SUMMARY_READY "Waiting for a new WAL summary to be generated."
XACT_GROUP_UPDATE "Waiting for the group leader to update transaction status at end of a parallel operation."
@@ -162,6 +164,7 @@ REGISTER_SYNC_REQUEST "Waiting while sending synchronization requests to the che
SPIN_DELAY "Waiting while acquiring a contended spinlock."
VACUUM_DELAY "Waiting in a cost-based vacuum delay point."
VACUUM_TRUNCATE "Waiting to acquire an exclusive lock to truncate off any empty pages at the end of a table vacuumed."
+WAL_SUMMARIZER_ERROR "Waiting after a WAL summarizer error."
#
@@ -243,6 +246,8 @@ WAL_COPY_WRITE "Waiting for a write when creating a new WAL segment by copying a
WAL_INIT_SYNC "Waiting for a newly initialized WAL file to reach durable storage."
WAL_INIT_WRITE "Waiting for a write while initializing a new WAL file."
WAL_READ "Waiting for a read from a WAL file."
+WAL_SUMMARY_READ "Waiting for a read from a WAL summary file."
+WAL_SUMMARY_WRITE "Waiting for a write to a WAL summary file."
WAL_SYNC "Waiting for a WAL file to reach durable storage."
WAL_SYNC_METHOD_ASSIGN "Waiting for data to reach durable storage while assigning a new WAL sync method."
WAL_WRITE "Waiting for a write to a WAL file."
diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index 819936ec02..5c9b6f991e 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -305,6 +305,9 @@ GetBackendTypeDesc(BackendType backendType)
case B_WAL_SENDER:
backendDesc = "walsender";
break;
+ case B_WAL_SUMMARIZER:
+ backendDesc = "walsummarizer";
+ break;
case B_WAL_WRITER:
backendDesc = "walwriter";
break;
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index f7c9882f7c..9f59440526 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -63,6 +63,7 @@
#include "postmaster/postmaster.h"
#include "postmaster/startup.h"
#include "postmaster/syslogger.h"
+#include "postmaster/walsummarizer.h"
#include "postmaster/walwriter.h"
#include "replication/logicallauncher.h"
#include "replication/slot.h"
@@ -703,6 +704,8 @@ const char *const config_group_names[] =
gettext_noop("Write-Ahead Log / Archive Recovery"),
/* WAL_RECOVERY_TARGET */
gettext_noop("Write-Ahead Log / Recovery Target"),
+ /* WAL_SUMMARIZATION */
+ gettext_noop("Write-Ahead Log / Summarization"),
/* REPLICATION_SENDING */
gettext_noop("Replication / Sending Servers"),
/* REPLICATION_PRIMARY */
@@ -1786,6 +1789,16 @@ struct config_bool ConfigureNamesBool[] =
NULL, NULL, NULL
},
+ {
+ {"summarize_wal", PGC_SIGHUP, WAL_SUMMARIZATION,
+ gettext_noop("Starts the WAL summarizer process to enable incremental backup."),
+ NULL
+ },
+ &summarize_wal,
+ false,
+ NULL, NULL, NULL
+ },
+
{
{"hot_standby", PGC_POSTMASTER, REPLICATION_STANDBY,
gettext_noop("Allows connections and queries during recovery."),
@@ -3200,6 +3213,19 @@ struct config_int ConfigureNamesInt[] =
check_wal_segment_size, NULL, NULL
},
+ {
+ {"wal_summary_keep_time", PGC_SIGHUP, WAL_SUMMARIZATION,
+ gettext_noop("Time for which WAL summary files should be kept."),
+ NULL,
+ GUC_UNIT_MIN,
+ },
+ &wal_summary_keep_time,
+ 10 * 24 * 60, /* 10 days */
+ 0,
+ INT_MAX,
+ NULL, NULL, NULL
+ },
+
{
{"autovacuum_naptime", PGC_SIGHUP, AUTOVACUUM,
gettext_noop("Time to sleep between autovacuum runs."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index cf9f283cfe..b2809c711a 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -302,6 +302,11 @@
#recovery_target_action = 'pause' # 'pause', 'promote', 'shutdown'
# (change requires restart)
+# - WAL Summarization -
+
+#summarize_wal = off # run WAL summarizer process?
+#wal_summary_keep_time = '10d' # when to remove old summary files, 0 = never
+
#------------------------------------------------------------------------------
# REPLICATION
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 0c6f5ceb0a..e68b40d2b5 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -227,6 +227,7 @@ static char *extra_options = "";
static const char *const subdirs[] = {
"global",
"pg_wal/archive_status",
+ "pg_wal/summaries",
"pg_commit_ts",
"pg_dynshmem",
"pg_notify",
diff --git a/src/common/Makefile b/src/common/Makefile
index 1092dc63df..23e5a3db47 100644
--- a/src/common/Makefile
+++ b/src/common/Makefile
@@ -49,6 +49,7 @@ OBJS_COMMON = \
archive.o \
base64.o \
binaryheap.o \
+ blkreftable.o \
checksum_helper.o \
compression.o \
config_info.o \
diff --git a/src/common/blkreftable.c b/src/common/blkreftable.c
new file mode 100644
index 0000000000..21ee6f5968
--- /dev/null
+++ b/src/common/blkreftable.c
@@ -0,0 +1,1308 @@
+/*-------------------------------------------------------------------------
+ *
+ * blkreftable.c
+ * Block reference tables.
+ *
+ * A block reference table is used to keep track of which blocks have
+ * been modified by WAL records within a certain LSN range.
+ *
+ * For each relation fork, we keep track of all blocks that have appeared
+ * in block reference in the WAL. We also keep track of the "limit block",
+ * which is the smallest relation length in blocks known to have occurred
+ * during that range of WAL records. This should be set to 0 if the relation
+ * fork is created or destroyed, and to the post-truncation length if
+ * truncated.
+ *
+ * Whenever we set the limit block, we also forget about any modified blocks
+ * beyond that point. Those blocks don't exist any more. Such blocks can
+ * later be marked as modified again; if that happens, it means the relation
+ * was re-extended.
+ *
+ * Portions Copyright (c) 2010-2022, PostgreSQL Global Development Group
+ *
+ * src/common/blkreftable.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+
+#ifndef FRONTEND
+#include "postgres.h"
+#else
+#include "postgres_fe.h"
+#endif
+
+#ifdef FRONTEND
+#include "common/logging.h"
+#endif
+
+#include "common/blkreftable.h"
+#include "common/hashfn.h"
+#include "port/pg_crc32c.h"
+
+/*
+ * A block reference table keeps track of the status of each relation
+ * fork individually.
+ */
+typedef struct BlockRefTableKey
+{
+ RelFileLocator rlocator;
+ ForkNumber forknum;
+} BlockRefTableKey;
+
+/*
+ * We could need to store data either for a relation in which only a
+ * tiny fraction of the blocks have been modified or for a relation in
+ * which nearly every block has been modified, and we want a
+ * space-efficient representation in both cases. To accomplish this,
+ * we divide the relation into chunks of 2^16 blocks and choose between
+ * an array representation and a bitmap representation for each chunk.
+ *
+ * When the number of modified blocks in a given chunk is small, we
+ * essentially store an array of block numbers, but we need not store the
+ * entire block number: instead, we store each block number as a 2-byte
+ * offset from the start of the chunk.
+ *
+ * When the number of modified blocks in a given chunk is large, we switch
+ * to a bitmap representation.
+ *
+ * These same basic representational choices are used both when a block
+ * reference table is stored in memory and when it is serialized to disk.
+ *
+ * In the in-memory representation, we initially allocate each chunk with
+ * space for a number of entries given by INITIAL_ENTRIES_PER_CHUNK and
+ * increase that as necessary until we reach MAX_ENTRIES_PER_CHUNK.
+ * Any chunk whose allocated size reaches MAX_ENTRIES_PER_CHUNK is converted
+ * to a bitmap, and thus never needs to grow further.
+ */
+#define BLOCKS_PER_CHUNK (1 << 16)
+#define BLOCKS_PER_ENTRY (BITS_PER_BYTE * sizeof(uint16))
+#define MAX_ENTRIES_PER_CHUNK (BLOCKS_PER_CHUNK / BLOCKS_PER_ENTRY)
+#define INITIAL_ENTRIES_PER_CHUNK 16
+typedef uint16 *BlockRefTableChunk;
+
+/*
+ * State for one relation fork.
+ *
+ * 'rlocator' and 'forknum' identify the relation fork to which this entry
+ * pertains.
+ *
+ * 'limit_block' is the shortest known length of the relation in blocks
+ * within the LSN range covered by a particular block reference table.
+ * It should be set to 0 if the relation fork is created or dropped. If the
+ * relation fork is truncated, it should be set to the number of blocks that
+ * remain after truncation.
+ *
+ * 'nchunks' is the allocated length of each of the three arrays that follow.
+ * We can only represent the status of block numbers less than nchunks *
+ * BLOCKS_PER_CHUNK.
+ *
+ * 'chunk_size' is an array storing the allocated size of each chunk.
+ *
+ * 'chunk_usage' is an array storing the number of elements used in each
+ * chunk. If that value is less than MAX_ENTRIES_PER_CHUNK, the corresonding
+ * chunk is used as an array; else the corresponding chunk is used as a bitmap.
+ * When used as a bitmap, the least significant bit of the first array element
+ * is the status of the lowest-numbered block covered by this chunk.
+ *
+ * 'chunk_data' is the array of chunks.
+ */
+struct BlockRefTableEntry
+{
+ BlockRefTableKey key;
+ BlockNumber limit_block;
+ char status;
+ uint32 nchunks;
+ uint16 *chunk_size;
+ uint16 *chunk_usage;
+ BlockRefTableChunk *chunk_data;
+};
+
+/* Declare and define a hash table over type BlockRefTableEntry. */
+#define SH_PREFIX blockreftable
+#define SH_ELEMENT_TYPE BlockRefTableEntry
+#define SH_KEY_TYPE BlockRefTableKey
+#define SH_KEY key
+#define SH_HASH_KEY(tb, key) \
+ hash_bytes((const unsigned char *) &key, sizeof(BlockRefTableKey))
+#define SH_EQUAL(tb, a, b) (memcmp(&a, &b, sizeof(BlockRefTableKey)) == 0)
+#define SH_SCOPE static inline
+#ifdef FRONTEND
+#define SH_RAW_ALLOCATOR pg_malloc0
+#endif
+#define SH_DEFINE
+#define SH_DECLARE
+#include "lib/simplehash.h"
+
+/*
+ * A block reference table is basically just the hash table, but we don't
+ * want to expose that to outside callers.
+ *
+ * We keep track of the memory context in use explicitly too, so that it's
+ * easy to place all of our allocations in the same context.
+ */
+struct BlockRefTable
+{
+ blockreftable_hash *hash;
+#ifndef FRONTEND
+ MemoryContext mcxt;
+#endif
+};
+
+/*
+ * On-disk serialization format for block reference table entries.
+ */
+typedef struct BlockRefTableSerializedEntry
+{
+ RelFileLocator rlocator;
+ ForkNumber forknum;
+ BlockNumber limit_block;
+ uint32 nchunks;
+} BlockRefTableSerializedEntry;
+
+/*
+ * Buffer size, so that we avoid doing many small I/Os.
+ */
+#define BUFSIZE 65536
+
+/*
+ * Ad-hoc buffer for file I/O.
+ */
+typedef struct BlockRefTableBuffer
+{
+ io_callback_fn io_callback;
+ void *io_callback_arg;
+ char data[BUFSIZE];
+ int used;
+ int cursor;
+ pg_crc32c crc;
+} BlockRefTableBuffer;
+
+/*
+ * State for keeping track of progress while incrementally reading a block
+ * table reference file from disk.
+ *
+ * total_chunks means the number of chunks for the RelFileLocator/ForkNumber
+ * combination that is curently being read, and consumed_chunks is the number
+ * of those that have been read. (We always read all the information for
+ * a single chunk at one time, so we don't need to be able to represent the
+ * state where a chunk has been partially read.)
+ *
+ * chunk_size is the array of chunk sizes. The length is given by total_chunks.
+ *
+ * chunk_data holds the current chunk.
+ *
+ * chunk_position helps us figure out how much progress we've made in returning
+ * the block numbers for the current chunk to the caller. If the chunk is a
+ * bitmap, it's the number of bits we've scanned; otherwise, it's the number
+ * of chunk entries we've scanned.
+ */
+struct BlockRefTableReader
+{
+ BlockRefTableBuffer buffer;
+ char *error_filename;
+ report_error_fn error_callback;
+ void *error_callback_arg;
+ uint32 total_chunks;
+ uint32 consumed_chunks;
+ uint16 *chunk_size;
+ uint16 chunk_data[MAX_ENTRIES_PER_CHUNK];
+ uint32 chunk_position;
+};
+
+/*
+ * State for keeping track of progress while incrementally writing a block
+ * reference table file to disk.
+ */
+struct BlockRefTableWriter
+{
+ BlockRefTableBuffer buffer;
+};
+
+/* Function prototypes. */
+static int BlockRefTableComparator(const void *a, const void *b);
+static void BlockRefTableFlush(BlockRefTableBuffer *buffer);
+static void BlockRefTableRead(BlockRefTableReader *reader, void *data,
+ int length);
+static void BlockRefTableWrite(BlockRefTableBuffer *buffer, void *data,
+ int length);
+static void BlockRefTableFileTerminate(BlockRefTableBuffer *buffer);
+
+/*
+ * Create an empty block reference table.
+ */
+BlockRefTable *
+CreateEmptyBlockRefTable(void)
+{
+ BlockRefTable *brtab = palloc(sizeof(BlockRefTable));
+
+ /*
+ * Even completely empty database has a few hundred relation forks, so it
+ * seems best to size the hash on the assumption that we're going to have
+ * at least a few thousand entries.
+ */
+#ifdef FRONTEND
+ brtab->hash = blockreftable_create(4096, NULL);
+#else
+ brtab->mcxt = CurrentMemoryContext;
+ brtab->hash = blockreftable_create(brtab->mcxt, 4096, NULL);
+#endif
+
+ return brtab;
+}
+
+/*
+ * Set the "limit block" for a relation fork and forget any modified blocks
+ * with equal or higher block numbers.
+ *
+ * The "limit block" is the shortest known length of the relation within the
+ * range of WAL records covered by this block reference table.
+ */
+void
+BlockRefTableSetLimitBlock(BlockRefTable *brtab,
+ const RelFileLocator *rlocator,
+ ForkNumber forknum,
+ BlockNumber limit_block)
+{
+ BlockRefTableEntry *brtentry;
+ BlockRefTableKey key = {0}; /* make sure any padding is zero */
+ bool found;
+
+ memcpy(&key.rlocator, rlocator, sizeof(RelFileLocator));
+ key.forknum = forknum;
+ brtentry = blockreftable_insert(brtab->hash, key, &found);
+
+ if (!found)
+ {
+ /*
+ * We have no existing data about this relation fork, so just record
+ * the limit_block value supplied by the caller, and make sure other
+ * parts of the entry are properly initialized.
+ */
+ brtentry->limit_block = limit_block;
+ brtentry->nchunks = 0;
+ brtentry->chunk_size = NULL;
+ brtentry->chunk_usage = NULL;
+ brtentry->chunk_data = NULL;
+ return;
+ }
+
+ BlockRefTableEntrySetLimitBlock(brtentry, limit_block);
+}
+
+/*
+ * Mark a block in a given relation fork as known to have been modified.
+ */
+void
+BlockRefTableMarkBlockModified(BlockRefTable *brtab,
+ const RelFileLocator *rlocator,
+ ForkNumber forknum,
+ BlockNumber blknum)
+{
+ BlockRefTableEntry *brtentry;
+ BlockRefTableKey key = {0}; /* make sure any padding is zero */
+ bool found;
+#ifndef FRONTEND
+ MemoryContext oldcontext = MemoryContextSwitchTo(brtab->mcxt);
+#endif
+
+ memcpy(&key.rlocator, rlocator, sizeof(RelFileLocator));
+ key.forknum = forknum;
+ brtentry = blockreftable_insert(brtab->hash, key, &found);
+
+ if (!found)
+ {
+ /*
+ * We want to set the initial limit block value to something higher
+ * than any legal block number. InvalidBlockNumber fits the bill.
+ */
+ brtentry->limit_block = InvalidBlockNumber;
+ brtentry->nchunks = 0;
+ brtentry->chunk_size = NULL;
+ brtentry->chunk_usage = NULL;
+ brtentry->chunk_data = NULL;
+ }
+
+ BlockRefTableEntryMarkBlockModified(brtentry, forknum, blknum);
+
+#ifndef FRONTEND
+ MemoryContextSwitchTo(oldcontext);
+#endif
+}
+
+/*
+ * Get an entry from a block reference table.
+ *
+ * If the entry does not exist, this function returns NULL. Otherwise, it
+ * returns the entry and sets *limit_block to the value from the entry.
+ */
+BlockRefTableEntry *
+BlockRefTableGetEntry(BlockRefTable *brtab, const RelFileLocator *rlocator,
+ ForkNumber forknum, BlockNumber *limit_block)
+{
+ BlockRefTableKey key = {0}; /* make sure any padding is zero */
+ BlockRefTableEntry *entry;
+
+ Assert(limit_block != NULL);
+
+ memcpy(&key.rlocator, rlocator, sizeof(RelFileLocator));
+ key.forknum = forknum;
+ entry = blockreftable_lookup(brtab->hash, key);
+
+ if (entry != NULL)
+ *limit_block = entry->limit_block;
+
+ return entry;
+}
+
+/*
+ * Get block numbers from a table entry.
+ *
+ * 'blocks' must point to enough space to hold at least 'nblocks' block
+ * numbers, and any block numbers we manage to get will be written there.
+ * The return value is the number of block numbers actually written.
+ *
+ * We do not return block numbers unless they are greater than or equal to
+ * start_blkno and strictly less than stop_blkno.
+ */
+int
+BlockRefTableEntryGetBlocks(BlockRefTableEntry *entry,
+ BlockNumber start_blkno,
+ BlockNumber stop_blkno,
+ BlockNumber *blocks,
+ int nblocks)
+{
+ uint32 start_chunkno;
+ uint32 stop_chunkno;
+ uint32 chunkno;
+ int nresults = 0;
+
+ Assert(entry != NULL);
+
+ /*
+ * Figure out which chunks could potentially contain blocks of interest.
+ *
+ * We need to be careful about overflow here, because stop_blkno could be
+ * InvalidBlockNumber or something very close to it.
+ */
+ start_chunkno = start_blkno / BLOCKS_PER_CHUNK;
+ stop_chunkno = stop_blkno / BLOCKS_PER_CHUNK;
+ if ((stop_blkno % BLOCKS_PER_CHUNK) != 0)
+ ++stop_chunkno;
+ if (stop_chunkno > entry->nchunks)
+ stop_chunkno = entry->nchunks;
+
+ /*
+ * Loop over chunks.
+ */
+ for (chunkno = start_chunkno; chunkno < stop_chunkno; ++chunkno)
+ {
+ uint16 chunk_usage = entry->chunk_usage[chunkno];
+ BlockRefTableChunk chunk_data = entry->chunk_data[chunkno];
+ unsigned start_offset = 0;
+ unsigned stop_offset = BLOCKS_PER_CHUNK;
+
+ /*
+ * If the start and/or stop block number falls within this chunk, the
+ * whole chunk may not be of interest. Figure out which portion we
+ * care about, if it's not the whole thing.
+ */
+ if (chunkno == start_chunkno)
+ start_offset = start_blkno % BLOCKS_PER_CHUNK;
+ if (chunkno == stop_chunkno - 1)
+ stop_offset = stop_blkno % BLOCKS_PER_CHUNK;
+
+ /*
+ * Handling differs depending on whether this is an array of offsets
+ * or a bitmap.
+ */
+ if (chunk_usage == MAX_ENTRIES_PER_CHUNK)
+ {
+ unsigned i;
+
+ /* It's a bitmap, so test every relevant bit. */
+ for (i = start_offset; i < stop_offset; ++i)
+ {
+ uint16 w = chunk_data[i / BLOCKS_PER_ENTRY];
+
+ if ((w & (1 << (i % BLOCKS_PER_ENTRY))) != 0)
+ {
+ BlockNumber blkno = chunkno * BLOCKS_PER_CHUNK + i;
+
+ blocks[nresults++] = blkno;
+
+ /* Early exit if we run out of output space. */
+ if (nresults == nblocks)
+ return nresults;
+ }
+ }
+ }
+ else
+ {
+ unsigned i;
+
+ /* It's an array of offsets, so check each one. */
+ for (i = 0; i < chunk_usage; ++i)
+ {
+ uint16 offset = chunk_data[i];
+
+ if (offset >= start_offset && offset < stop_offset)
+ {
+ BlockNumber blkno = chunkno * BLOCKS_PER_CHUNK + offset;
+
+ blocks[nresults++] = blkno;
+
+ /* Early exit if we run out of output space. */
+ if (nresults == nblocks)
+ return nresults;
+ }
+ }
+ }
+ }
+
+ return nresults;
+}
+
+/*
+ * Serialize a block reference table to a file.
+ */
+void
+WriteBlockRefTable(BlockRefTable *brtab,
+ io_callback_fn write_callback,
+ void *write_callback_arg)
+{
+ BlockRefTableSerializedEntry *sdata = NULL;
+ BlockRefTableBuffer buffer;
+ uint32 magic = BLOCKREFTABLE_MAGIC;
+
+ /* Prepare buffer. */
+ memset(&buffer, 0, sizeof(BlockRefTableBuffer));
+ buffer.io_callback = write_callback;
+ buffer.io_callback_arg = write_callback_arg;
+ INIT_CRC32C(buffer.crc);
+
+ /* Write magic number. */
+ BlockRefTableWrite(&buffer, &magic, sizeof(uint32));
+
+ /* Write the entries, assuming there are some. */
+ if (brtab->hash->members > 0)
+ {
+ unsigned i = 0;
+ blockreftable_iterator it;
+ BlockRefTableEntry *brtentry;
+
+ /* Extract entries into serializable format and sort them. */
+ sdata =
+ palloc(brtab->hash->members * sizeof(BlockRefTableSerializedEntry));
+ blockreftable_start_iterate(brtab->hash, &it);
+ while ((brtentry = blockreftable_iterate(brtab->hash, &it)) != NULL)
+ {
+ BlockRefTableSerializedEntry *sentry = &sdata[i++];
+
+ sentry->rlocator = brtentry->key.rlocator;
+ sentry->forknum = brtentry->key.forknum;
+ sentry->limit_block = brtentry->limit_block;
+ sentry->nchunks = brtentry->nchunks;
+
+ /* trim trailing zero entries */
+ while (sentry->nchunks > 0 &&
+ brtentry->chunk_usage[sentry->nchunks - 1] == 0)
+ sentry->nchunks--;
+ }
+ Assert(i == brtab->hash->members);
+ qsort(sdata, i, sizeof(BlockRefTableSerializedEntry),
+ BlockRefTableComparator);
+
+ /* Loop over entries in sorted order and serialize each one. */
+ for (i = 0; i < brtab->hash->members; ++i)
+ {
+ BlockRefTableSerializedEntry *sentry = &sdata[i];
+ BlockRefTableKey key = {0}; /* make sure any padding is zero */
+ unsigned j;
+
+ /* Write the serialized entry itself. */
+ BlockRefTableWrite(&buffer, sentry,
+ sizeof(BlockRefTableSerializedEntry));
+
+ /* Look up the original entry so we can access the chunks. */
+ memcpy(&key.rlocator, &sentry->rlocator, sizeof(RelFileLocator));
+ key.forknum = sentry->forknum;
+ brtentry = blockreftable_lookup(brtab->hash, key);
+ Assert(brtentry != NULL);
+
+ /* Write the untruncated portion of the chunk length array. */
+ if (sentry->nchunks != 0)
+ BlockRefTableWrite(&buffer, brtentry->chunk_usage,
+ sentry->nchunks * sizeof(uint16));
+
+ /* Write the contents of each chunk. */
+ for (j = 0; j < brtentry->nchunks; ++j)
+ {
+ if (brtentry->chunk_usage[j] == 0)
+ continue;
+ BlockRefTableWrite(&buffer, brtentry->chunk_data[j],
+ brtentry->chunk_usage[j] * sizeof(uint16));
+ }
+ }
+ }
+
+ /* Write out appropriate terminator and CRC and flush buffer. */
+ BlockRefTableFileTerminate(&buffer);
+}
+
+/*
+ * Prepare to incrementally read a block reference table file.
+ *
+ * 'read_callback' is a function that can be called to read data from the
+ * underlying file (or other data source) into our internal buffer.
+ *
+ * 'read_callback_arg' is an opaque argument to be passed to read_callback.
+ *
+ * 'error_filename' is the filename that should be included in error messages
+ * if the file is found to be malformed. The value is not copied, so the
+ * caller should ensure that it remains valid until done with this
+ * BlockRefTableReader.
+ *
+ * 'error_callback' is a function to be called if the file is found to be
+ * malformed. This is not used for I/O errors, which must be handled internally
+ * by read_callback.
+ *
+ * 'error_callback_arg' is an opaque arguent to be passed to error_callback.
+ */
+BlockRefTableReader *
+CreateBlockRefTableReader(io_callback_fn read_callback,
+ void *read_callback_arg,
+ char *error_filename,
+ report_error_fn error_callback,
+ void *error_callback_arg)
+{
+ BlockRefTableReader *reader;
+ uint32 magic;
+
+ /* Initialize data structure. */
+ reader = palloc0(sizeof(BlockRefTableReader));
+ reader->buffer.io_callback = read_callback;
+ reader->buffer.io_callback_arg = read_callback_arg;
+ reader->error_filename = error_filename;
+ reader->error_callback = error_callback;
+ reader->error_callback_arg = error_callback_arg;
+ INIT_CRC32C(reader->buffer.crc);
+
+ /* Verify magic number. */
+ BlockRefTableRead(reader, &magic, sizeof(uint32));
+ if (magic != BLOCKREFTABLE_MAGIC)
+ error_callback(error_callback_arg,
+ "file \"%s\" has wrong magic number: expected %u, found %u",
+ error_filename,
+ BLOCKREFTABLE_MAGIC, magic);
+
+ return reader;
+}
+
+/*
+ * Read next relation fork covered by this block reference table file.
+ *
+ * After calling this function, you must call BlockRefTableReaderGetBlocks
+ * until it returns 0 before calling it again.
+ */
+bool
+BlockRefTableReaderNextRelation(BlockRefTableReader *reader,
+ RelFileLocator *rlocator,
+ ForkNumber *forknum,
+ BlockNumber *limit_block)
+{
+ BlockRefTableSerializedEntry sentry;
+ BlockRefTableSerializedEntry zentry = {{0}};
+
+ /*
+ * Sanity check: caller must read all blocks from all chunks before moving
+ * on to the next relation.
+ */
+ Assert(reader->total_chunks == reader->consumed_chunks);
+
+ /* Read serialized entry. */
+ BlockRefTableRead(reader, &sentry,
+ sizeof(BlockRefTableSerializedEntry));
+
+ /*
+ * If we just read the sentinel entry indicating that we've reached the
+ * end, read and check the CRC.
+ */
+ if (memcmp(&sentry, &zentry, sizeof(BlockRefTableSerializedEntry)) == 0)
+ {
+ pg_crc32c expected_crc;
+ pg_crc32c actual_crc;
+
+ /*
+ * We want to know the CRC of the file excluding the 4-byte CRC
+ * itself, so copy the current value of the CRC accumulator before
+ * reading those bytes, and use the copy to finalize the calculation.
+ */
+ expected_crc = reader->buffer.crc;
+ FIN_CRC32C(expected_crc);
+
+ /* Now we can read the actual value. */
+ BlockRefTableRead(reader, &actual_crc, sizeof(pg_crc32c));
+
+ /* Throw an error if there is a mismatch. */
+ if (!EQ_CRC32C(expected_crc, actual_crc))
+ reader->error_callback(reader->error_callback_arg,
+ "file \"%s\" has wrong checksum: expected %08X, found %08X",
+ reader->error_filename, expected_crc, actual_crc);
+
+ return false;
+ }
+
+ /* Read chunk size array. */
+ if (reader->chunk_size != NULL)
+ pfree(reader->chunk_size);
+ reader->chunk_size = palloc(sentry.nchunks * sizeof(uint16));
+ BlockRefTableRead(reader, reader->chunk_size,
+ sentry.nchunks * sizeof(uint16));
+
+ /* Set up for chunk scan. */
+ reader->total_chunks = sentry.nchunks;
+ reader->consumed_chunks = 0;
+
+ /* Return data to caller. */
+ memcpy(rlocator, &sentry.rlocator, sizeof(RelFileLocator));
+ *forknum = sentry.forknum;
+ *limit_block = sentry.limit_block;
+ return true;
+}
+
+/*
+ * Get modified blocks associated with the relation fork returned by
+ * the most recent call to BlockRefTableReaderNextRelation.
+ *
+ * On return, block numbers will be written into the 'blocks' array, whose
+ * length should be passed via 'nblocks'. The return value is the number of
+ * entries actually written into the 'blocks' array, which may be less than
+ * 'nblocks' if we run out of modified blocks in the relation fork before
+ * we run out of room in the array.
+ */
+unsigned
+BlockRefTableReaderGetBlocks(BlockRefTableReader *reader,
+ BlockNumber *blocks,
+ int nblocks)
+{
+ unsigned blocks_found = 0;
+
+ /* Must provide space for at least one block number to be returned. */
+ Assert(nblocks > 0);
+
+ /* Loop collecting blocks to return to caller. */
+ for (;;)
+ {
+ uint16 next_chunk_size;
+
+ /*
+ * If we've read at least one chunk, maybe it contains some block
+ * numbers that could satisfy caller's request.
+ */
+ if (reader->consumed_chunks > 0)
+ {
+ uint32 chunkno = reader->consumed_chunks - 1;
+ uint16 chunk_size = reader->chunk_size[chunkno];
+
+ if (chunk_size == MAX_ENTRIES_PER_CHUNK)
+ {
+ /* Bitmap format, so search for bits that are set. */
+ while (reader->chunk_position < BLOCKS_PER_CHUNK &&
+ blocks_found < nblocks)
+ {
+ uint16 chunkoffset = reader->chunk_position;
+ uint16 w;
+
+ w = reader->chunk_data[chunkoffset / BLOCKS_PER_ENTRY];
+ if ((w & (1u << (chunkoffset % BLOCKS_PER_ENTRY))) != 0)
+ blocks[blocks_found++] =
+ chunkno * BLOCKS_PER_CHUNK + chunkoffset;
+ ++reader->chunk_position;
+ }
+ }
+ else
+ {
+ /* Not in bitmap format, so each entry is a 2-byte offset. */
+ while (reader->chunk_position < chunk_size &&
+ blocks_found < nblocks)
+ {
+ blocks[blocks_found++] = chunkno * BLOCKS_PER_CHUNK
+ + reader->chunk_data[reader->chunk_position];
+ ++reader->chunk_position;
+ }
+ }
+ }
+
+ /* We found enough blocks, so we're done. */
+ if (blocks_found >= nblocks)
+ break;
+
+ /*
+ * We didn't find enough blocks, so we must need the next chunk. If
+ * there are none left, though, then we're done anyway.
+ */
+ if (reader->consumed_chunks == reader->total_chunks)
+ break;
+
+ /*
+ * Read data for next chunk and reset scan position to beginning of
+ * chunk. Note that the next chunk might be empty, in which case we
+ * consume the chunk without actually consuming any bytes from the
+ * underlying file.
+ */
+ next_chunk_size = reader->chunk_size[reader->consumed_chunks];
+ if (next_chunk_size > 0)
+ BlockRefTableRead(reader, reader->chunk_data,
+ next_chunk_size * sizeof(uint16));
+ ++reader->consumed_chunks;
+ reader->chunk_position = 0;
+ }
+
+ return blocks_found;
+}
+
+/*
+ * Release memory used while reading a block reference table from a file.
+ */
+void
+DestroyBlockRefTableReader(BlockRefTableReader *reader)
+{
+ if (reader->chunk_size != NULL)
+ {
+ pfree(reader->chunk_size);
+ reader->chunk_size = NULL;
+ }
+ pfree(reader);
+}
+
+/*
+ * Prepare to write a block reference table file incrementally.
+ *
+ * Caller must be able to supply BlockRefTableEntry objects sorted in the
+ * appropriate order.
+ */
+BlockRefTableWriter *
+CreateBlockRefTableWriter(io_callback_fn write_callback,
+ void *write_callback_arg)
+{
+ BlockRefTableWriter *writer;
+ uint32 magic = BLOCKREFTABLE_MAGIC;
+
+ /* Prepare buffer and CRC check and save callbacks. */
+ writer = palloc0(sizeof(BlockRefTableWriter));
+ writer->buffer.io_callback = write_callback;
+ writer->buffer.io_callback_arg = write_callback_arg;
+ INIT_CRC32C(writer->buffer.crc);
+
+ /* Write magic number. */
+ BlockRefTableWrite(&writer->buffer, &magic, sizeof(uint32));
+
+ return writer;
+}
+
+/*
+ * Append one entry to a block reference table file.
+ *
+ * Note that entries must be written in the proper order, that is, sorted by
+ * tablespace, then database, then relfilenumber, then fork number. Caller
+ * is responsible for supplying data in the correct order. If that seems hard,
+ * use an in-memory BlockRefTable instead.
+ */
+void
+BlockRefTableWriteEntry(BlockRefTableWriter *writer, BlockRefTableEntry *entry)
+{
+ BlockRefTableSerializedEntry sentry;
+ unsigned j;
+
+ /* Convert to serialized entry format. */
+ sentry.rlocator = entry->key.rlocator;
+ sentry.forknum = entry->key.forknum;
+ sentry.limit_block = entry->limit_block;
+ sentry.nchunks = entry->nchunks;
+
+ /* Trim trailing zero entries. */
+ while (sentry.nchunks > 0 && entry->chunk_usage[sentry.nchunks - 1] == 0)
+ sentry.nchunks--;
+
+ /* Write the serialized entry itself. */
+ BlockRefTableWrite(&writer->buffer, &sentry,
+ sizeof(BlockRefTableSerializedEntry));
+
+ /* Write the untruncated portion of the chunk length array. */
+ if (sentry.nchunks != 0)
+ BlockRefTableWrite(&writer->buffer, entry->chunk_usage,
+ sentry.nchunks * sizeof(uint16));
+
+ /* Write the contents of each chunk. */
+ for (j = 0; j < entry->nchunks; ++j)
+ {
+ if (entry->chunk_usage[j] == 0)
+ continue;
+ BlockRefTableWrite(&writer->buffer, entry->chunk_data[j],
+ entry->chunk_usage[j] * sizeof(uint16));
+ }
+}
+
+/*
+ * Finalize an incremental write of a block reference table file.
+ */
+void
+DestroyBlockRefTableWriter(BlockRefTableWriter *writer)
+{
+ BlockRefTableFileTerminate(&writer->buffer);
+ pfree(writer);
+}
+
+/*
+ * Allocate a standalone BlockRefTableEntry.
+ *
+ * When we're manipulating a full in-memory BlockRefTable, the entries are
+ * part of the hash table and are allocated by simplehash. This routine is
+ * used by callers that want to write out a BlockRefTable to a file without
+ * needing to store the whole thing in memory at once.
+ *
+ * Entries allocated by this function can be manipulated using the functions
+ * BlockRefTableEntrySetLimitBlock and BlockRefTableEntryMarkBlockModified
+ * and then written using BlockRefTableWriteEntry and freed using
+ * BlockRefTableFreeEntry.
+ */
+BlockRefTableEntry *
+CreateBlockRefTableEntry(RelFileLocator rlocator, ForkNumber forknum)
+{
+ BlockRefTableEntry *entry = palloc0(sizeof(BlockRefTableEntry));
+
+ memcpy(&entry->key.rlocator, &rlocator, sizeof(RelFileLocator));
+ entry->key.forknum = forknum;
+ entry->limit_block = InvalidBlockNumber;
+
+ return entry;
+}
+
+/*
+ * Update a BlockRefTableEntry with a new value for the "limit block" and
+ * forget any equal-or-higher-numbered modified blocks.
+ *
+ * The "limit block" is the shortest known length of the relation within the
+ * range of WAL records covered by this block reference table.
+ */
+void
+BlockRefTableEntrySetLimitBlock(BlockRefTableEntry *entry,
+ BlockNumber limit_block)
+{
+ unsigned chunkno;
+ unsigned limit_chunkno;
+ unsigned limit_chunkoffset;
+ BlockRefTableChunk limit_chunk;
+
+ /* If we already have an equal or lower limit block, do nothing. */
+ if (limit_block >= entry->limit_block)
+ return;
+
+ /* Record the new limit block value. */
+ entry->limit_block = limit_block;
+
+ /*
+ * Figure out which chunk would store the state of the new limit block,
+ * and which offset within that chunk.
+ */
+ limit_chunkno = limit_block / BLOCKS_PER_CHUNK;
+ limit_chunkoffset = limit_block % BLOCKS_PER_CHUNK;
+
+ /*
+ * If the number of chunks is not large enough for any blocks with equal
+ * or higher block numbers to exist, then there is nothing further to do.
+ */
+ if (limit_chunkno >= entry->nchunks)
+ return;
+
+ /* Discard entire contents of any higher-numbered chunks. */
+ for (chunkno = limit_chunkno + 1; chunkno < entry->nchunks; ++chunkno)
+ entry->chunk_usage[chunkno] = 0;
+
+ /*
+ * Next, we need to discard any offsets within the chunk that would
+ * contain the limit_block. We must handle this differenly depending on
+ * whether the chunk that would contain limit_block is a bitmap or an
+ * array of offsets.
+ */
+ limit_chunk = entry->chunk_data[limit_chunkno];
+ if (entry->chunk_usage[limit_chunkno] == MAX_ENTRIES_PER_CHUNK)
+ {
+ unsigned chunkoffset;
+
+ /* It's a bitmap. Unset bits. */
+ for (chunkoffset = limit_chunkoffset; chunkoffset < BLOCKS_PER_CHUNK;
+ ++chunkoffset)
+ limit_chunk[chunkoffset / BLOCKS_PER_ENTRY] &=
+ ~(1 << (chunkoffset % BLOCKS_PER_ENTRY));
+ }
+ else
+ {
+ unsigned i,
+ j = 0;
+
+ /* It's an offset array. Filter out large offsets. */
+ for (i = 0; i < entry->chunk_usage[limit_chunkno]; ++i)
+ {
+ Assert(j <= i);
+ if (limit_chunk[i] < limit_chunkoffset)
+ limit_chunk[j++] = limit_chunk[i];
+ }
+ Assert(j <= entry->chunk_usage[limit_chunkno]);
+ entry->chunk_usage[limit_chunkno] = j;
+ }
+}
+
+/*
+ * Mark a block in a given BlkRefTableEntry as known to have been modified.
+ */
+void
+BlockRefTableEntryMarkBlockModified(BlockRefTableEntry *entry,
+ ForkNumber forknum,
+ BlockNumber blknum)
+{
+ unsigned chunkno;
+ unsigned chunkoffset;
+ unsigned i;
+
+ /*
+ * Which chunk should store the state of this block? And what is the
+ * offset of this block relative to the start of that chunk?
+ */
+ chunkno = blknum / BLOCKS_PER_CHUNK;
+ chunkoffset = blknum % BLOCKS_PER_CHUNK;
+
+ /*
+ * If 'nchunks' isn't big enough for us to be able to represent the state
+ * of this block, we need to enlarge our arrays.
+ */
+ if (chunkno >= entry->nchunks)
+ {
+ unsigned max_chunks;
+ unsigned extra_chunks;
+
+ /*
+ * New array size is a power of 2, at least 16, big enough so that
+ * chunkno will be a valid array index.
+ */
+ max_chunks = Max(16, entry->nchunks);
+ while (max_chunks < chunkno + 1)
+ chunkno *= 2;
+ Assert(max_chunks > chunkno);
+ extra_chunks = max_chunks - entry->nchunks;
+
+ if (entry->nchunks == 0)
+ {
+ entry->chunk_size = palloc0(sizeof(uint16) * max_chunks);
+ entry->chunk_usage = palloc0(sizeof(uint16) * max_chunks);
+ entry->chunk_data =
+ palloc0(sizeof(BlockRefTableChunk) * max_chunks);
+ }
+ else
+ {
+ entry->chunk_size = repalloc(entry->chunk_size,
+ sizeof(uint16) * max_chunks);
+ memset(&entry->chunk_size[entry->nchunks], 0,
+ extra_chunks * sizeof(uint16));
+ entry->chunk_usage = repalloc(entry->chunk_usage,
+ sizeof(uint16) * max_chunks);
+ memset(&entry->chunk_usage[entry->nchunks], 0,
+ extra_chunks * sizeof(uint16));
+ entry->chunk_data = repalloc(entry->chunk_data,
+ sizeof(BlockRefTableChunk) * max_chunks);
+ memset(&entry->chunk_data[entry->nchunks], 0,
+ extra_chunks * sizeof(BlockRefTableChunk));
+ }
+ entry->nchunks = max_chunks;
+ }
+
+ /*
+ * If the chunk that covers this block number doesn't exist yet, create it
+ * as an array and add the appropriate offset to it. We make it pretty
+ * small initially, because there might only be 1 or a few block
+ * references in this chunk and we don't want to use up too much memory.
+ */
+ if (entry->chunk_size[chunkno] == 0)
+ {
+ entry->chunk_data[chunkno] =
+ palloc(sizeof(uint16) * INITIAL_ENTRIES_PER_CHUNK);
+ entry->chunk_size[chunkno] = INITIAL_ENTRIES_PER_CHUNK;
+ entry->chunk_data[chunkno][0] = chunkoffset;
+ entry->chunk_usage[chunkno] = 1;
+ return;
+ }
+
+ /*
+ * If the number of entries in this chunk is already maximum, it must be a
+ * bitmap. Just set the appropriate bit.
+ */
+ if (entry->chunk_usage[chunkno] == MAX_ENTRIES_PER_CHUNK)
+ {
+ BlockRefTableChunk chunk = entry->chunk_data[chunkno];
+
+ chunk[chunkoffset / BLOCKS_PER_ENTRY] |=
+ 1 << (chunkoffset % BLOCKS_PER_ENTRY);
+ return;
+ }
+
+ /*
+ * There is an existing chunk and it's in array format. Let's find out
+ * whether it already has an entry for this block. If so, we do not need
+ * to do anything.
+ */
+ for (i = 0; i < entry->chunk_usage[chunkno]; ++i)
+ {
+ if (entry->chunk_data[chunkno][i] == chunkoffset)
+ return;
+ }
+
+ /*
+ * If the number of entries currently used is one less than the maximum,
+ * it's time to convert to bitmap format.
+ */
+ if (entry->chunk_usage[chunkno] == MAX_ENTRIES_PER_CHUNK - 1)
+ {
+ BlockRefTableChunk newchunk;
+ unsigned j;
+
+ /* Allocate a new chunk. */
+ newchunk = palloc0(MAX_ENTRIES_PER_CHUNK * sizeof(uint16));
+
+ /* Set the bit for each existing entry. */
+ for (j = 0; j < entry->chunk_usage[chunkno]; ++j)
+ {
+ unsigned coff = entry->chunk_data[chunkno][j];
+
+ newchunk[coff / BLOCKS_PER_ENTRY] |=
+ 1 << (coff % BLOCKS_PER_ENTRY);
+ }
+
+ /* Set the bit for the new entry. */
+ newchunk[chunkoffset / BLOCKS_PER_ENTRY] |=
+ 1 << (chunkoffset % BLOCKS_PER_ENTRY);
+
+ /* Swap the new chunk into place and update metadata. */
+ pfree(entry->chunk_data[chunkno]);
+ entry->chunk_data[chunkno] = newchunk;
+ entry->chunk_size[chunkno] = MAX_ENTRIES_PER_CHUNK;
+ entry->chunk_usage[chunkno] = MAX_ENTRIES_PER_CHUNK;
+ return;
+ }
+
+ /*
+ * OK, we currently have an array, and we don't need to convert to a
+ * bitmap, but we do need to add a new element. If there's not enough
+ * room, we'll have to expand the array.
+ */
+ if (entry->chunk_usage[chunkno] == entry->chunk_size[chunkno])
+ {
+ unsigned newsize = entry->chunk_size[chunkno] * 2;
+
+ Assert(newsize <= MAX_ENTRIES_PER_CHUNK);
+ entry->chunk_data[chunkno] = repalloc(entry->chunk_data[chunkno],
+ newsize * sizeof(uint16));
+ entry->chunk_size[chunkno] = newsize;
+ }
+
+ /* Now we can add the new entry. */
+ entry->chunk_data[chunkno][entry->chunk_usage[chunkno]] =
+ chunkoffset;
+ entry->chunk_usage[chunkno]++;
+}
+
+/*
+ * Release memory for a BlockRefTablEntry that was created by
+ * CreateBlockRefTableEntry.
+ */
+void
+BlockRefTableFreeEntry(BlockRefTableEntry *entry)
+{
+ if (entry->chunk_size != NULL)
+ {
+ pfree(entry->chunk_size);
+ entry->chunk_size = NULL;
+ }
+
+ if (entry->chunk_usage != NULL)
+ {
+ pfree(entry->chunk_usage);
+ entry->chunk_usage = NULL;
+ }
+
+ if (entry->chunk_data != NULL)
+ {
+ pfree(entry->chunk_data);
+ entry->chunk_data = NULL;
+ }
+
+ pfree(entry);
+}
+
+/*
+ * Comparator for BlockRefTableSerializedEntry objects.
+ *
+ * We make the tablespace OID the first column of the sort key to match
+ * the on-disk tree structure.
+ */
+static int
+BlockRefTableComparator(const void *a, const void *b)
+{
+ const BlockRefTableSerializedEntry *sa = a;
+ const BlockRefTableSerializedEntry *sb = b;
+
+ if (sa->rlocator.spcOid > sb->rlocator.spcOid)
+ return 1;
+ if (sa->rlocator.spcOid < sb->rlocator.spcOid)
+ return -1;
+
+ if (sa->rlocator.dbOid > sb->rlocator.dbOid)
+ return 1;
+ if (sa->rlocator.dbOid < sb->rlocator.dbOid)
+ return -1;
+
+ if (sa->rlocator.relNumber > sb->rlocator.relNumber)
+ return 1;
+ if (sa->rlocator.relNumber < sb->rlocator.relNumber)
+ return -1;
+
+ if (sa->forknum > sb->forknum)
+ return 1;
+ if (sa->forknum < sb->forknum)
+ return -1;
+
+ return 0;
+}
+
+/*
+ * Flush any buffered data out of a BlockRefTableBuffer.
+ */
+static void
+BlockRefTableFlush(BlockRefTableBuffer *buffer)
+{
+ buffer->io_callback(buffer->io_callback_arg, buffer->data, buffer->used);
+ buffer->used = 0;
+}
+
+/*
+ * Read data from a BlockRefTableBuffer, and update the running CRC
+ * calculation for the returned data (but not any data that we may have
+ * buffered but not yet actually returned).
+ */
+static void
+BlockRefTableRead(BlockRefTableReader *reader, void *data, int length)
+{
+ BlockRefTableBuffer *buffer = &reader->buffer;
+
+ /* Loop until read is fully satisfied. */
+ while (length > 0)
+ {
+ if (buffer->cursor < buffer->used)
+ {
+ /*
+ * If any buffered data is available, use that to satisfy as much
+ * of the request as possible.
+ */
+ int bytes_to_copy = Min(length, buffer->used - buffer->cursor);
+
+ memcpy(data, &buffer->data[buffer->cursor], bytes_to_copy);
+ COMP_CRC32C(buffer->crc, &buffer->data[buffer->cursor],
+ bytes_to_copy);
+ buffer->cursor += bytes_to_copy;
+ data = ((char *) data) + bytes_to_copy;
+ length -= bytes_to_copy;
+ }
+ else if (length >= BUFSIZE)
+ {
+ /*
+ * If the request length is long, read directly into caller's
+ * buffer.
+ */
+ int bytes_read;
+
+ bytes_read = buffer->io_callback(buffer->io_callback_arg,
+ data, length);
+ COMP_CRC32C(buffer->crc, data, bytes_read);
+ data = ((char *) data) + bytes_read;
+ length -= bytes_read;
+
+ /* If we didn't get anything, that's bad. */
+ if (bytes_read == 0)
+ reader->error_callback(reader->error_callback_arg,
+ "file \"%s\" ends unexpectedly",
+ reader->error_filename);
+ }
+ else
+ {
+ /*
+ * Refill our buffer.
+ */
+ buffer->used = buffer->io_callback(buffer->io_callback_arg,
+ buffer->data, BUFSIZE);
+ buffer->cursor = 0;
+
+ /* If we didn't get anything, that's bad. */
+ if (buffer->used == 0)
+ reader->error_callback(reader->error_callback_arg,
+ "file \"%s\" ends unexpectedly",
+ reader->error_filename);
+ }
+ }
+}
+
+/*
+ * Supply data to a BlockRefTableBuffer for write to the underlying File,
+ * and update the running CRC calculation for that data.
+ */
+static void
+BlockRefTableWrite(BlockRefTableBuffer *buffer, void *data, int length)
+{
+ /* Update running CRC calculation. */
+ COMP_CRC32C(buffer->crc, data, length);
+
+ /* If the new data can't fit into the buffer, flush the buffer. */
+ if (buffer->used + length > BUFSIZE)
+ {
+ buffer->io_callback(buffer->io_callback_arg, buffer->data,
+ buffer->used);
+ buffer->used = 0;
+ }
+
+ /* If the new data would fill the buffer, or more, write it directly. */
+ if (length >= BUFSIZE)
+ {
+ buffer->io_callback(buffer->io_callback_arg, data, length);
+ return;
+ }
+
+ /* Otherwise, copy the new data into the buffer. */
+ memcpy(&buffer->data[buffer->used], data, length);
+ buffer->used += length;
+ Assert(buffer->used <= BUFSIZE);
+}
+
+/*
+ * Generate the sentinel and CRC required at the end of a block reference
+ * table file and flush them out of our internal buffer.
+ */
+static void
+BlockRefTableFileTerminate(BlockRefTableBuffer *buffer)
+{
+ BlockRefTableSerializedEntry zentry = {{0}};
+ pg_crc32c crc;
+
+ /* Write a sentinel indicating that there are no more entries. */
+ BlockRefTableWrite(buffer, &zentry,
+ sizeof(BlockRefTableSerializedEntry));
+
+ /*
+ * Writing the checksum will perturb the ongoing checksum calculation, so
+ * copy the state first and finalize the computation using the copy.
+ */
+ crc = buffer->crc;
+ FIN_CRC32C(crc);
+ BlockRefTableWrite(buffer, &crc, sizeof(pg_crc32c));
+
+ /* Flush any leftover data out of our buffer. */
+ BlockRefTableFlush(buffer);
+}
diff --git a/src/common/meson.build b/src/common/meson.build
index d52dd12bc9..7ad4270a3a 100644
--- a/src/common/meson.build
+++ b/src/common/meson.build
@@ -4,6 +4,7 @@ common_sources = files(
'archive.c',
'base64.c',
'binaryheap.c',
+ 'blkreftable.c',
'checksum_helper.c',
'compression.c',
'controldata_utils.c',
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index a14126d164..da71580364 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -209,6 +209,7 @@ extern int XLogFileOpen(XLogSegNo segno, TimeLineID tli);
extern void CheckXLogRemoved(XLogSegNo segno, TimeLineID tli);
extern XLogSegNo XLogGetLastRemovedSegno(void);
+extern XLogSegNo XLogGetOldestSegno(TimeLineID tli);
extern void XLogSetAsyncXactLSN(XLogRecPtr asyncXactLSN);
extern void XLogSetReplicationSlotMinimumLSN(XLogRecPtr lsn);
diff --git a/src/include/backup/walsummary.h b/src/include/backup/walsummary.h
new file mode 100644
index 0000000000..8e3dc7b837
--- /dev/null
+++ b/src/include/backup/walsummary.h
@@ -0,0 +1,49 @@
+/*-------------------------------------------------------------------------
+ *
+ * walsummary.h
+ * WAL summary management
+ *
+ * Portions Copyright (c) 2010-2022, PostgreSQL Global Development Group
+ *
+ * src/include/backup/walsummary.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef WALSUMMARY_H
+#define WALSUMMARY_H
+
+#include <time.h>
+
+#include "access/xlogdefs.h"
+#include "nodes/pg_list.h"
+#include "storage/fd.h"
+
+typedef struct WalSummaryIO
+{
+ File file;
+ off_t filepos;
+} WalSummaryIO;
+
+typedef struct WalSummaryFile
+{
+ XLogRecPtr start_lsn;
+ XLogRecPtr end_lsn;
+ TimeLineID tli;
+} WalSummaryFile;
+
+extern List *GetWalSummaries(TimeLineID tli, XLogRecPtr start_lsn,
+ XLogRecPtr end_lsn);
+extern List *FilterWalSummaries(List *wslist, TimeLineID tli,
+ XLogRecPtr start_lsn, XLogRecPtr end_lsn);
+extern bool WalSummariesAreComplete(List *wslist,
+ XLogRecPtr start_lsn, XLogRecPtr end_lsn,
+ XLogRecPtr *missing_lsn);
+extern File OpenWalSummaryFile(WalSummaryFile *ws, bool missing_ok);
+extern void RemoveWalSummaryIfOlderThan(WalSummaryFile *ws,
+ time_t cutoff_time);
+
+extern int ReadWalSummary(void *wal_summary_io, void *data, int length);
+extern int WriteWalSummary(void *wal_summary_io, void *data, int length);
+extern void ReportWalSummaryError(void *callback_arg, char *fmt,...) pg_attribute_printf(2, 3);
+
+#endif /* WALSUMMARY_H */
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 77e8b13764..916c8ec8d0 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -12099,4 +12099,23 @@
proname => 'any_value_transfn', prorettype => 'anyelement',
proargtypes => 'anyelement anyelement', prosrc => 'any_value_transfn' },
+{ oid => '8436',
+ descr => 'list of available WAL summary files',
+ proname => 'pg_available_wal_summaries', prorows => '100',
+ proretset => 't', provolatile => 'v', proparallel => 's',
+ prorettype => 'record', proargtypes => '',
+ proallargtypes => '{int8,pg_lsn,pg_lsn}',
+ proargmodes => '{o,o,o}',
+ proargnames => '{tli,start_lsn,end_lsn}',
+ prosrc => 'pg_available_wal_summaries' },
+{ oid => '8437',
+ descr => 'contents of a WAL sumamry file',
+ proname => 'pg_wal_summary_contents', prorows => '100',
+ proretset => 't', provolatile => 'v', proparallel => 's',
+ prorettype => 'record', proargtypes => 'int8 pg_lsn pg_lsn',
+ proallargtypes => '{int8,pg_lsn,pg_lsn,oid,oid,oid,int2,int8,bool}',
+ proargmodes => '{i,i,i,o,o,o,o,o,o}',
+ proargnames => '{tli,start_lsn,end_lsn,relfilenode,reltablespace,reldatabase,relforknumber,relblocknumber,is_limit_block}',
+ prosrc => 'pg_wal_summary_contents' },
+
]
diff --git a/src/include/common/blkreftable.h b/src/include/common/blkreftable.h
new file mode 100644
index 0000000000..5141f3acd5
--- /dev/null
+++ b/src/include/common/blkreftable.h
@@ -0,0 +1,116 @@
+/*-------------------------------------------------------------------------
+ *
+ * blkreftable.h
+ * Block reference tables.
+ *
+ * A block reference table is used to keep track of which blocks have
+ * been modified by WAL records within a certain LSN range.
+ *
+ * For each relation fork, there is a "limit block number". All existing
+ * blocks greater than or equal to the limit block number must be
+ * considered modified; for those less than the limit block number,
+ * we maintain a bitmap. When a relation fork is created or dropped,
+ * the limit block number should be set to 0. When it's truncated,
+ * the limit block number should be set to the length in blocks to
+ * which it was truncated.
+ *
+ * Portions Copyright (c) 2010-2022, PostgreSQL Global Development Group
+ *
+ * src/include/common/blkreftable.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef BLKREFTABLE_H
+#define BLKREFTABLE_H
+
+#include "storage/block.h"
+#include "storage/relfilelocator.h"
+
+/* Magic number for serialization file format. */
+#define BLOCKREFTABLE_MAGIC 0x652b137b
+
+typedef struct BlockRefTable BlockRefTable;
+typedef struct BlockRefTableEntry BlockRefTableEntry;
+typedef struct BlockRefTableReader BlockRefTableReader;
+typedef struct BlockRefTableWriter BlockRefTableWriter;
+
+/*
+ * The return value of io_callback_fn should be the number of bytes read
+ * or written. If an error occurs, the functions should report it and
+ * not return. When used as a write callback, short writes should be retried
+ * or treated as errors, so that if the callback returns, the return value
+ * is always the request length.
+ *
+ * report_error_fn should not return.
+ */
+typedef int (*io_callback_fn) (void *callback_arg, void *data, int length);
+typedef void (*report_error_fn) (void *calblack_arg, char *msg,...) pg_attribute_printf(2, 3);
+
+
+/*
+ * Functions for manipulating an entire in-memory block reference table.
+ */
+extern BlockRefTable *CreateEmptyBlockRefTable(void);
+extern void BlockRefTableSetLimitBlock(BlockRefTable *brtab,
+ const RelFileLocator *rlocator,
+ ForkNumber forknum,
+ BlockNumber limit_block);
+extern void BlockRefTableMarkBlockModified(BlockRefTable *brtab,
+ const RelFileLocator *rlocator,
+ ForkNumber forknum,
+ BlockNumber blknum);
+extern void WriteBlockRefTable(BlockRefTable *brtab,
+ io_callback_fn write_callback,
+ void *write_callback_arg);
+
+extern BlockRefTableEntry *BlockRefTableGetEntry(BlockRefTable *brtab,
+ const RelFileLocator *rlocator,
+ ForkNumber forknum,
+ BlockNumber *limit_block);
+extern int BlockRefTableEntryGetBlocks(BlockRefTableEntry *entry,
+ BlockNumber start_blkno,
+ BlockNumber stop_blkno,
+ BlockNumber *blocks,
+ int nblocks);
+
+/*
+ * Functions for reading a block reference table incrementally from disk.
+ */
+extern BlockRefTableReader *CreateBlockRefTableReader(io_callback_fn read_callback,
+ void *read_callback_arg,
+ char *error_filename,
+ report_error_fn error_callback,
+ void *error_callback_arg);
+extern bool BlockRefTableReaderNextRelation(BlockRefTableReader *reader,
+ RelFileLocator *rlocator,
+ ForkNumber *forknum,
+ BlockNumber *limit_block);
+extern unsigned BlockRefTableReaderGetBlocks(BlockRefTableReader *reader,
+ BlockNumber *blocks,
+ int nblocks);
+extern void DestroyBlockRefTableReader(BlockRefTableReader *reader);
+
+/*
+ * Functions for writing a block reference table incrementally to disk.
+ *
+ * Note that entries must be written in the proper order, that is, sorted by
+ * database, then tablespace, then relfilenumber, then fork number. Caller
+ * is responsible for supplying data in the correct order. If that seems hard,
+ * use an in-memory BlockRefTable instead.
+ */
+extern BlockRefTableWriter *CreateBlockRefTableWriter(io_callback_fn write_callback,
+ void *write_callback_arg);
+extern void BlockRefTableWriteEntry(BlockRefTableWriter *writer,
+ BlockRefTableEntry *entry);
+extern void DestroyBlockRefTableWriter(BlockRefTableWriter *writer);
+
+extern BlockRefTableEntry *CreateBlockRefTableEntry(RelFileLocator rlocator,
+ ForkNumber forknum);
+extern void BlockRefTableEntrySetLimitBlock(BlockRefTableEntry *entry,
+ BlockNumber limit_block);
+extern void BlockRefTableEntryMarkBlockModified(BlockRefTableEntry *entry,
+ ForkNumber forknum,
+ BlockNumber blknum);
+extern void BlockRefTableFreeEntry(BlockRefTableEntry *entry);
+
+#endif /* BLKREFTABLE_H */
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 1043a4d782..74bc2f97cb 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -336,6 +336,7 @@ typedef enum BackendType
B_STARTUP,
B_WAL_RECEIVER,
B_WAL_SENDER,
+ B_WAL_SUMMARIZER,
B_WAL_WRITER,
} BackendType;
@@ -442,6 +443,7 @@ typedef enum
CheckpointerProcess,
WalWriterProcess,
WalReceiverProcess,
+ WalSummarizerProcess,
NUM_AUXPROCTYPES /* Must be last! */
} AuxProcType;
@@ -454,6 +456,7 @@ extern PGDLLIMPORT AuxProcType MyAuxProcType;
#define AmCheckpointerProcess() (MyAuxProcType == CheckpointerProcess)
#define AmWalWriterProcess() (MyAuxProcType == WalWriterProcess)
#define AmWalReceiverProcess() (MyAuxProcType == WalReceiverProcess)
+#define AmWalSummarizerProcess() (MyAuxProcType == WalSummarizerProcess)
/*****************************************************************************
diff --git a/src/include/postmaster/walsummarizer.h b/src/include/postmaster/walsummarizer.h
new file mode 100644
index 0000000000..180d3f34b9
--- /dev/null
+++ b/src/include/postmaster/walsummarizer.h
@@ -0,0 +1,33 @@
+/*-------------------------------------------------------------------------
+ *
+ * walsummarizer.h
+ *
+ * Header file for background WAL summarization process.
+ *
+ * Portions Copyright (c) 1996-2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/include/postmaster/walsummarizer.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef WALSUMMARIZER_H
+#define WALSUMMARIZER_H
+
+#include "access/xlogdefs.h"
+
+extern bool summarize_wal;
+extern int wal_summary_keep_time;
+
+extern Size WalSummarizerShmemSize(void);
+extern void WalSummarizerShmemInit(void);
+extern void WalSummarizerMain(void) pg_attribute_noreturn();
+
+extern XLogRecPtr GetOldestUnsummarizedLSN(TimeLineID *tli,
+ bool *lsn_is_exact,
+ bool reset_pending_lsn);
+extern void SetWalSummarizerLatch(void);
+extern XLogRecPtr WaitForWalSummarization(XLogRecPtr lsn, long timeout,
+ XLogRecPtr *pending_lsn);
+
+#endif
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 4b25961249..e87fd25d64 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -417,11 +417,12 @@ extern PGDLLIMPORT PGPROC *PreparedXactProcs;
* We set aside some extra PGPROC structures for auxiliary processes,
* ie things that aren't full-fledged backends but need shmem access.
*
- * Background writer, checkpointer, WAL writer and archiver run during normal
- * operation. Startup process and WAL receiver also consume 2 slots, but WAL
- * writer is launched only after startup has exited, so we only need 5 slots.
+ * Background writer, checkpointer, WAL writer, WAL summarizer, and archiver
+ * run during normal operation. Startup process and WAL receiver also consume
+ * 2 slots, but WAL writer is launched only after startup has exited, so we
+ * only need 6 slots.
*/
-#define NUM_AUXILIARY_PROCS 5
+#define NUM_AUXILIARY_PROCS 6
/* configurable options */
extern PGDLLIMPORT int DeadlockTimeout;
diff --git a/src/include/utils/guc_tables.h b/src/include/utils/guc_tables.h
index 0c38255961..eaa8c46dda 100644
--- a/src/include/utils/guc_tables.h
+++ b/src/include/utils/guc_tables.h
@@ -72,6 +72,7 @@ enum config_group
WAL_RECOVERY,
WAL_ARCHIVE_RECOVERY,
WAL_RECOVERY_TARGET,
+ WAL_SUMMARIZATION,
REPLICATION_SENDING,
REPLICATION_PRIMARY,
REPLICATION_STANDBY,
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index ba41149b88..9390049314 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -4012,3 +4012,14 @@ yyscan_t
z_stream
z_streamp
zic_t
+BlockRefTable
+BlockRefTableBuffer
+BlockRefTableEntry
+BlockRefTableKey
+BlockRefTableReader
+BlockRefTableSerializedEntry
+BlockRefTableWriter
+SummarizerReadLocalXLogPrivate
+WalSummarizerData
+WalSummaryFile
+WalSummaryIO
--
2.39.3 (Apple Git-145)
v15-0002-Move-src-bin-pg_verifybackup-parse_manifest.c-in.patchapplication/octet-stream; name=v15-0002-Move-src-bin-pg_verifybackup-parse_manifest.c-in.patchDownload
From 4d04e3756bea8b2422f161e4456defe338a2d0ad Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Tue, 3 Oct 2023 13:32:45 -0400
Subject: [PATCH v15 2/6] Move src/bin/pg_verifybackup/parse_manifest.c into
src/common.
This makes it possible for the code to be easily reused by other
client-side tools, and/or by the server.
---
src/bin/pg_verifybackup/Makefile | 1 -
src/bin/pg_verifybackup/meson.build | 1 -
src/bin/pg_verifybackup/nls.mk | 4 ++--
src/bin/pg_verifybackup/pg_verifybackup.c | 2 +-
src/common/Makefile | 1 +
src/common/meson.build | 1 +
src/{bin/pg_verifybackup => common}/parse_manifest.c | 4 ++--
src/{bin/pg_verifybackup => include/common}/parse_manifest.h | 2 +-
8 files changed, 8 insertions(+), 8 deletions(-)
rename src/{bin/pg_verifybackup => common}/parse_manifest.c (99%)
rename src/{bin/pg_verifybackup => include/common}/parse_manifest.h (97%)
diff --git a/src/bin/pg_verifybackup/Makefile b/src/bin/pg_verifybackup/Makefile
index c96323faa9..7c045f142e 100644
--- a/src/bin/pg_verifybackup/Makefile
+++ b/src/bin/pg_verifybackup/Makefile
@@ -21,7 +21,6 @@ LDFLAGS_INTERNAL += -L$(top_builddir)/src/fe_utils -lpgfeutils $(libpq_pgport)
OBJS = \
$(WIN32RES) \
- parse_manifest.o \
pg_verifybackup.o
all: pg_verifybackup
diff --git a/src/bin/pg_verifybackup/meson.build b/src/bin/pg_verifybackup/meson.build
index 9369da1bc6..58f780d1a6 100644
--- a/src/bin/pg_verifybackup/meson.build
+++ b/src/bin/pg_verifybackup/meson.build
@@ -1,7 +1,6 @@
# Copyright (c) 2022-2023, PostgreSQL Global Development Group
pg_verifybackup_sources = files(
- 'parse_manifest.c',
'pg_verifybackup.c'
)
diff --git a/src/bin/pg_verifybackup/nls.mk b/src/bin/pg_verifybackup/nls.mk
index eba73a2c05..9e6a6049ba 100644
--- a/src/bin/pg_verifybackup/nls.mk
+++ b/src/bin/pg_verifybackup/nls.mk
@@ -1,10 +1,10 @@
# src/bin/pg_verifybackup/nls.mk
CATALOG_NAME = pg_verifybackup
GETTEXT_FILES = $(FRONTEND_COMMON_GETTEXT_FILES) \
- parse_manifest.c \
pg_verifybackup.c \
../../common/fe_memutils.c \
- ../../common/jsonapi.c
+ ../../common/jsonapi.c \
+ ../../common/parse_manifest.c
GETTEXT_TRIGGERS = $(FRONTEND_COMMON_GETTEXT_TRIGGERS) \
json_manifest_parse_failure:2 \
error_cb:2 \
diff --git a/src/bin/pg_verifybackup/pg_verifybackup.c b/src/bin/pg_verifybackup/pg_verifybackup.c
index d921d0f003..88081f66f7 100644
--- a/src/bin/pg_verifybackup/pg_verifybackup.c
+++ b/src/bin/pg_verifybackup/pg_verifybackup.c
@@ -20,9 +20,9 @@
#include "common/hashfn.h"
#include "common/logging.h"
+#include "common/parse_manifest.h"
#include "fe_utils/simple_list.h"
#include "getopt_long.h"
-#include "parse_manifest.h"
#include "pgtime.h"
/*
diff --git a/src/common/Makefile b/src/common/Makefile
index ce4535d7fe..1092dc63df 100644
--- a/src/common/Makefile
+++ b/src/common/Makefile
@@ -66,6 +66,7 @@ OBJS_COMMON = \
kwlookup.o \
link-canary.o \
md5_common.o \
+ parse_manifest.o \
percentrepl.o \
pg_get_line.o \
pg_lzcompress.o \
diff --git a/src/common/meson.build b/src/common/meson.build
index 8be145c0fb..d52dd12bc9 100644
--- a/src/common/meson.build
+++ b/src/common/meson.build
@@ -18,6 +18,7 @@ common_sources = files(
'kwlookup.c',
'link-canary.c',
'md5_common.c',
+ 'parse_manifest.c',
'percentrepl.c',
'pg_get_line.c',
'pg_lzcompress.c',
diff --git a/src/bin/pg_verifybackup/parse_manifest.c b/src/common/parse_manifest.c
similarity index 99%
rename from src/bin/pg_verifybackup/parse_manifest.c
rename to src/common/parse_manifest.c
index 850adf90a8..9f52bfa83b 100644
--- a/src/bin/pg_verifybackup/parse_manifest.c
+++ b/src/common/parse_manifest.c
@@ -6,15 +6,15 @@
* Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
* Portions Copyright (c) 1994, Regents of the University of California
*
- * src/bin/pg_verifybackup/parse_manifest.c
+ * src/common/parse_manifest.c
*
*-------------------------------------------------------------------------
*/
#include "postgres_fe.h"
-#include "parse_manifest.h"
#include "common/jsonapi.h"
+#include "common/parse_manifest.h"
/*
* Semantic states for JSON manifest parsing.
diff --git a/src/bin/pg_verifybackup/parse_manifest.h b/src/include/common/parse_manifest.h
similarity index 97%
rename from src/bin/pg_verifybackup/parse_manifest.h
rename to src/include/common/parse_manifest.h
index 001b9a6a11..811c9149f4 100644
--- a/src/bin/pg_verifybackup/parse_manifest.h
+++ b/src/include/common/parse_manifest.h
@@ -6,7 +6,7 @@
* Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
* Portions Copyright (c) 1994, Regents of the University of California
*
- * src/bin/pg_verifybackup/parse_manifest.h
+ * src/include/common/parse_manifest.h
*
*-------------------------------------------------------------------------
*/
--
2.39.3 (Apple Git-145)
v15-0004-Add-support-for-incremental-backup.patchapplication/octet-stream; name=v15-0004-Add-support-for-incremental-backup.patchDownload
From 7b1c4c8af3182e15e6c4f31db3bd64ba3862d69a Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Wed, 14 Jun 2023 11:31:29 -0400
Subject: [PATCH v15 4/6] Add support for incremental backup.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
To take an incremental backup, you use the new replication command
UPLOAD_MANIFEST to upload the manifest for the prior backup. This
prior backup could either be a full backup or another incremental
backup. You then use BASE_BACKUP with the INCREMENTAL option to take
the backup. pg_basebackup now has an --incremental=PATH_TO_MANIFEST
option to trigger this behavior.
An incremental backup is like a regular full backup except that
some relation files are replaced with files with names like
INCREMENTAL.${ORIGINAL_NAME}, and the backup_label file contains
additional lines identifying it as an incremental backup. The new
pg_combinebackup tool can be used to reconstruct a data directory
from a full backup and a series of incremental backups.
XXX. It would be nice (but not essential) to do something about
incremental JSON parsing.
Patch by me. Thanks to Dilip Kumar, Andres Freund, and Álvaro Herrera
for design discussion and reviews, and to Jakub Wartak for incredibly
helpful and extensive testing.
---
doc/src/sgml/backup.sgml | 89 +-
doc/src/sgml/config.sgml | 2 -
doc/src/sgml/protocol.sgml | 24 +
doc/src/sgml/ref/allfiles.sgml | 1 +
doc/src/sgml/ref/pg_basebackup.sgml | 37 +-
doc/src/sgml/ref/pg_combinebackup.sgml | 240 +++
doc/src/sgml/reference.sgml | 1 +
src/backend/access/transam/xlogbackup.c | 10 +
src/backend/access/transam/xlogrecovery.c | 6 +
src/backend/backup/Makefile | 1 +
src/backend/backup/basebackup.c | 319 +++-
src/backend/backup/basebackup_incremental.c | 1003 +++++++++++++
src/backend/backup/meson.build | 1 +
src/backend/replication/repl_gram.y | 14 +-
src/backend/replication/repl_scanner.l | 2 +
src/backend/replication/walsender.c | 162 ++-
src/backend/storage/ipc/ipci.c | 3 +
src/bin/Makefile | 1 +
src/bin/meson.build | 1 +
src/bin/pg_basebackup/bbstreamer_file.c | 1 +
src/bin/pg_basebackup/pg_basebackup.c | 112 +-
src/bin/pg_basebackup/t/010_pg_basebackup.pl | 4 +-
src/bin/pg_combinebackup/.gitignore | 1 +
src/bin/pg_combinebackup/Makefile | 52 +
src/bin/pg_combinebackup/backup_label.c | 283 ++++
src/bin/pg_combinebackup/backup_label.h | 30 +
src/bin/pg_combinebackup/copy_file.c | 169 +++
src/bin/pg_combinebackup/copy_file.h | 19 +
src/bin/pg_combinebackup/load_manifest.c | 245 ++++
src/bin/pg_combinebackup/load_manifest.h | 67 +
src/bin/pg_combinebackup/meson.build | 38 +
src/bin/pg_combinebackup/nls.mk | 11 +
src/bin/pg_combinebackup/pg_combinebackup.c | 1284 +++++++++++++++++
src/bin/pg_combinebackup/reconstruct.c | 687 +++++++++
src/bin/pg_combinebackup/reconstruct.h | 33 +
src/bin/pg_combinebackup/t/001_basic.pl | 23 +
.../pg_combinebackup/t/002_compare_backups.pl | 154 ++
src/bin/pg_combinebackup/t/003_timeline.pl | 90 ++
src/bin/pg_combinebackup/t/004_manifest.pl | 75 +
src/bin/pg_combinebackup/t/005_integrity.pl | 125 ++
src/bin/pg_combinebackup/write_manifest.c | 293 ++++
src/bin/pg_combinebackup/write_manifest.h | 33 +
src/bin/pg_resetwal/pg_resetwal.c | 36 +
src/include/access/xlogbackup.h | 2 +
src/include/backup/basebackup.h | 5 +-
src/include/backup/basebackup_incremental.h | 55 +
src/include/nodes/replnodes.h | 9 +
src/test/perl/PostgreSQL/Test/Cluster.pm | 21 +-
src/tools/pgindent/typedefs.list | 12 +
49 files changed, 5834 insertions(+), 52 deletions(-)
create mode 100644 doc/src/sgml/ref/pg_combinebackup.sgml
create mode 100644 src/backend/backup/basebackup_incremental.c
create mode 100644 src/bin/pg_combinebackup/.gitignore
create mode 100644 src/bin/pg_combinebackup/Makefile
create mode 100644 src/bin/pg_combinebackup/backup_label.c
create mode 100644 src/bin/pg_combinebackup/backup_label.h
create mode 100644 src/bin/pg_combinebackup/copy_file.c
create mode 100644 src/bin/pg_combinebackup/copy_file.h
create mode 100644 src/bin/pg_combinebackup/load_manifest.c
create mode 100644 src/bin/pg_combinebackup/load_manifest.h
create mode 100644 src/bin/pg_combinebackup/meson.build
create mode 100644 src/bin/pg_combinebackup/nls.mk
create mode 100644 src/bin/pg_combinebackup/pg_combinebackup.c
create mode 100644 src/bin/pg_combinebackup/reconstruct.c
create mode 100644 src/bin/pg_combinebackup/reconstruct.h
create mode 100644 src/bin/pg_combinebackup/t/001_basic.pl
create mode 100644 src/bin/pg_combinebackup/t/002_compare_backups.pl
create mode 100644 src/bin/pg_combinebackup/t/003_timeline.pl
create mode 100644 src/bin/pg_combinebackup/t/004_manifest.pl
create mode 100644 src/bin/pg_combinebackup/t/005_integrity.pl
create mode 100644 src/bin/pg_combinebackup/write_manifest.c
create mode 100644 src/bin/pg_combinebackup/write_manifest.h
create mode 100644 src/include/backup/basebackup_incremental.h
diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml
index 8cb24d6ae5..b3468eea3c 100644
--- a/doc/src/sgml/backup.sgml
+++ b/doc/src/sgml/backup.sgml
@@ -857,12 +857,79 @@ test ! -f /mnt/server/archivedir/00000001000000A900000065 && cp pg_wal/0
</para>
</sect2>
+ <sect2 id="backup-incremental-backup">
+ <title>Making an Incremental Backup</title>
+
+ <para>
+ You can use <xref linkend="app-pgbasebackup"/> to take an incremental
+ backup by specifying the <literal>--incremental</literal> option. You must
+ supply, as an argument to <literal>--incremental</literal>, the backup
+ manifest to an earlier backup from the same server. In the resulting
+ backup, non-relation files will be included in their entirety, but some
+ relation files may be replaced by smaller incremental files which contain
+ only the blocks which have been changed since the earlier backup and enough
+ metadata to reconstruct the current version of the file.
+ </para>
+
+ <para>
+ To figure out which blocks need to be backed up, the server uses WAL
+ summaries, which are stored in the data directory, inside the directory
+ <literal>pg_wal/summaries</literal>. If the required summary files are not
+ present, an attempt to take an incremental backup will fail. The summaries
+ present in this directory must cover all LSNs from the start LSN of the
+ prior backup to the start LSN of the current backup. Since the server looks
+ for WAL summaries just after establishing the start LSN of the current
+ backup, the necessary summary files probably won't be instantly present
+ on disk, but the server will wait for any missing files to show up.
+ This also helps if the WAL summarization process has fallen behind.
+ However, if the necessary files have already been removed, or if the WAL
+ summarizer doesn't catch up quickly enough, the incremental backup will
+ fail.
+ </para>
+
+ <para>
+ When restoring an incremental backup, it will be necessary to have not
+ only the incremental backup itself but also all earlier backups that
+ are required to supply the blocks omitted from the incremental backup.
+ See <xref linkend="app-pgcombinebackup"/> for further information about
+ this requirement.
+ </para>
+
+ <para>
+ Note that all of the requirements for making use of a full backup also
+ apply to an incremental backup. For instance, you still need all of the
+ WAL segment files generated during and after the file system backup, and
+ any relevant WAL history files. And you still need to create a
+ <literal>recovery.signal</literal> (or <literal>standby.signal</literal>)
+ and perform recovery, as described in
+ <xref linkend="backup-pitr-recovery" />. The requirement to have earlier
+ backups available at restore time and to use
+ <literal>pg_combinebackup</literal> is an additional requirement on top of
+ everything else. Keep in mind that <application>PostgreSQL</application>
+ has no built-in mechanism to figure out which backups are still needed as
+ a basis for restoring later incremental backups. You must keep track of
+ the relationships between your full and incremental backups on your own,
+ and be certain not to remove earlier backups if they might be needed when
+ restoring later incremental backups.
+ </para>
+
+ <para>
+ Incremental backups typically only make sense for relatively large
+ databases where a significant portion of the data does not change, or only
+ changes slowly. For a small database, it's simpler to ignore the existence
+ of incremental backups and simply take full backups, which are simpler
+ to manage. For a large database all of which is heavily modified,
+ incremental backups won't be much smaller than full backups.
+ </para>
+ </sect2>
+
<sect2 id="backup-lowlevel-base-backup">
<title>Making a Base Backup Using the Low Level API</title>
<para>
- The procedure for making a base backup using the low level
- APIs contains a few more steps than
- the <xref linkend="app-pgbasebackup"/> method, but is relatively
+ Instead of taking a full or incremental base backup using
+ <xref linkend="app-pgbasebackup"/>, you can take a base backup using the
+ low-level API. This procedure contains a few more steps than
+ the <application>pg_basebackup</application> method, but is relatively
simple. It is very important that these steps are executed in
sequence, and that the success of a step is verified before
proceeding to the next step.
@@ -1118,7 +1185,8 @@ SELECT * FROM pg_backup_stop(wait_for_archive => true);
</listitem>
<listitem>
<para>
- Restore the database files from your file system backup. Be sure that they
+ If you're restoring a full backup, you can restore the database files
+ directly into the target directories. Be sure that they
are restored with the right ownership (the database system user, not
<literal>root</literal>!) and with the right permissions. If you are using
tablespaces,
@@ -1126,6 +1194,19 @@ SELECT * FROM pg_backup_stop(wait_for_archive => true);
were correctly restored.
</para>
</listitem>
+ <listitem>
+ <para>
+ If you're restoring an incremental backup, you'll need to restore the
+ incremental backup and all earlier backups upon which it directly or
+ indirectly depends to the machine where you are performing the restore.
+ These backups will need to be placed in separate directories, not the
+ target directories where you want the running server to end up.
+ Once this is done, use <xref linkend="app-pgcombinebackup"/> to pull
+ data from the full backup and all of the subsequent incremental backups
+ and write out a synthetic full backup to the target directories. As above,
+ verify that permissions and tablespace links are correct.
+ </para>
+ </listitem>
<listitem>
<para>
Remove any files present in <filename>pg_wal/</filename>; these came from the
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index ee98585027..b5624ca884 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4153,13 +4153,11 @@ restore_command = 'copy "C:\\server\\archivedir\\%f" "%p"' # Windows
<sect2 id="runtime-config-wal-summarization">
<title>WAL Summarization</title>
- <!--
<para>
These settings control WAL summarization, a feature which must be
enabled in order to perform an
<link linkend="backup-incremental-backup">incremental backup</link>.
</para>
- -->
<variablelist>
<varlistentry id="guc-summarize-wal" xreflabel="summarize_wal">
diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index af3f016f74..9a66918171 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -2599,6 +2599,19 @@ psql "dbname=postgres replication=database" -c "IDENTIFY_SYSTEM;"
</listitem>
</varlistentry>
+ <varlistentry id="protocol-replication-upload-manifest">
+ <term>
+ <literal>UPLOAD_MANIFEST</literal>
+ <indexterm><primary>UPLOAD_MANIFEST</primary></indexterm>
+ </term>
+ <listitem>
+ <para>
+ Uploads a backup manifest in preparation for taking an incremental
+ backup.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="protocol-replication-base-backup" xreflabel="BASE_BACKUP">
<term><literal>BASE_BACKUP</literal> [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ]
<indexterm><primary>BASE_BACKUP</primary></indexterm>
@@ -2838,6 +2851,17 @@ psql "dbname=postgres replication=database" -c "IDENTIFY_SYSTEM;"
</para>
</listitem>
</varlistentry>
+
+ <varlistentry>
+ <term><literal>INCREMENTAL</literal></term>
+ <listitem>
+ <para>
+ Requests an incremental backup. The
+ <literal>UPLOAD_MANIFEST</literal> command must be executed
+ before running a base backup with this option.
+ </para>
+ </listitem>
+ </varlistentry>
</variablelist>
</para>
diff --git a/doc/src/sgml/ref/allfiles.sgml b/doc/src/sgml/ref/allfiles.sgml
index 54b5f22d6e..fda4690eab 100644
--- a/doc/src/sgml/ref/allfiles.sgml
+++ b/doc/src/sgml/ref/allfiles.sgml
@@ -202,6 +202,7 @@ Complete list of usable sgml source files in this directory.
<!ENTITY pgBasebackup SYSTEM "pg_basebackup.sgml">
<!ENTITY pgbench SYSTEM "pgbench.sgml">
<!ENTITY pgChecksums SYSTEM "pg_checksums.sgml">
+<!ENTITY pgCombinebackup SYSTEM "pg_combinebackup.sgml">
<!ENTITY pgConfig SYSTEM "pg_config-ref.sgml">
<!ENTITY pgControldata SYSTEM "pg_controldata.sgml">
<!ENTITY pgCtl SYSTEM "pg_ctl-ref.sgml">
diff --git a/doc/src/sgml/ref/pg_basebackup.sgml b/doc/src/sgml/ref/pg_basebackup.sgml
index 0b87fd2d4d..7c183a5cfd 100644
--- a/doc/src/sgml/ref/pg_basebackup.sgml
+++ b/doc/src/sgml/ref/pg_basebackup.sgml
@@ -38,11 +38,25 @@ PostgreSQL documentation
</para>
<para>
- <application>pg_basebackup</application> makes an exact copy of the database
- cluster's files, while making sure the server is put into and
- out of backup mode automatically. Backups are always taken of the entire
- database cluster; it is not possible to back up individual databases or
- database objects. For selective backups, another tool such as
+ <application>pg_basebackup</application> can take a full or incremental
+ base backup of the database. When used to take a full backup, it makes an
+ exact copy of the database cluster's files. When used to take an incremental
+ backup, some files that would have been part of a full backup may be
+ replaced with incremental versions of the same files, containing only those
+ blocks that have been modified since the reference backup. An incremental
+ backup cannot be used directly; instead,
+ <xref linkend="app-pgcombinebackup"/> must first
+ be used to combine it with the previous backups upon which it depends.
+ See <xref linkend="backup-incremental-backup" /> for more information
+ about incremental backups, and <xref linkend="backup-pitr-recovery" />
+ for steps to recover from a backup.
+ </para>
+
+ <para>
+ In any mode, <application>pg_basebackup</application> makes sure the server
+ is put into and out of backup mode automatically. Backups are always taken of
+ the entire database cluster; it is not possible to back up individual
+ databases or database objects. For selective backups, another tool such as
<xref linkend="app-pgdump"/> must be used.
</para>
@@ -197,6 +211,19 @@ PostgreSQL documentation
</listitem>
</varlistentry>
+ <varlistentry>
+ <term><option>-i <replaceable class="parameter">old_manifest_file</replaceable></option></term>
+ <term><option>--incremental=<replaceable class="parameter">old_meanifest_file</replaceable></option></term>
+ <listitem>
+ <para>
+ Performs an <link linkend="backup-incremental-backup">incremental
+ backup</link>. The backup manifest for the reference
+ backup must be provided, and will be uploaded to the server, which will
+ respond by sending the requested incremental backup.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry>
<term><option>-R</option></term>
<term><option>--write-recovery-conf</option></term>
diff --git a/doc/src/sgml/ref/pg_combinebackup.sgml b/doc/src/sgml/ref/pg_combinebackup.sgml
new file mode 100644
index 0000000000..e1729671a5
--- /dev/null
+++ b/doc/src/sgml/ref/pg_combinebackup.sgml
@@ -0,0 +1,240 @@
+<!--
+doc/src/sgml/ref/pg_combinebackup.sgml
+PostgreSQL documentation
+-->
+
+<refentry id="app-pgcombinebackup">
+ <indexterm zone="app-pgcombinebackup">
+ <primary>pg_combinebackup</primary>
+ </indexterm>
+
+ <refmeta>
+ <refentrytitle><application>pg_combinebackup</application></refentrytitle>
+ <manvolnum>1</manvolnum>
+ <refmiscinfo>Application</refmiscinfo>
+ </refmeta>
+
+ <refnamediv>
+ <refname>pg_combinebackup</refname>
+ <refpurpose>reconstruct a full backup from an incremental backup and dependent backups</refpurpose>
+ </refnamediv>
+
+ <refsynopsisdiv>
+ <cmdsynopsis>
+ <command>pg_combinebackup</command>
+ <arg rep="repeat"><replaceable>option</replaceable></arg>
+ <arg rep="repeat"><replaceable>backup_directory</replaceable></arg>
+ </cmdsynopsis>
+ </refsynopsisdiv>
+
+ <refsect1>
+ <title>Description</title>
+ <para>
+ <application>pg_combinebackup</application> is used to reconstruct a
+ synthetic full backup from an
+ <link linkend="backup-incremental-backup">incremental backup</link> and the
+ earlier backups upon which it depends.
+ </para>
+
+ <para>
+ Specify all of the required backups on the command line from oldest to newest.
+ That is, the first backup directory should be the path to the full backup, and
+ the last should be the path to the final incremental backup
+ that you wish to restore. The reconstructed backup will be written to the
+ output directory specified by the <option>-o</option> option.
+ </para>
+
+ <para>
+ Although <application>pg_combinebackup</application> will attempt to verify
+ that the backups you specify form a legal backup chain from which a correct
+ full backup can be reconstructed, it is not designed to help you keep track
+ of which backups depend on which other backups. If you remove the one or
+ more of the previous backups upon which your incremental
+ backup relies, you will not be able to restore it.
+ </para>
+
+ <para>
+ Since the output of <application>pg_combinebackup</application> is a
+ synthetic full backup, it can be used as an input to a future invocation of
+ <application>pg_combinebackup</application>. The synthetic full backup would
+ be specified on the command line in lieu of the chain of backups from which
+ it was reconstructed.
+ </para>
+ </refsect1>
+
+ <refsect1>
+ <title>Options</title>
+
+ <para>
+ <variablelist>
+ <varlistentry>
+ <term><option>-d</option></term>
+ <term><option>--debug</option></term>
+ <listitem>
+ <para>
+ Print lots of debug logging output on <filename>stderr</filename>.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><option>-n</option></term>
+ <term><option>--dry-run</option></term>
+ <listitem>
+ <para>
+ The <option>-n</option>/<option>--dry-run</option> option instructs
+ <command>pg_cominebackup</command> to figure out what would be done
+ without actually creating the target directory or any output files.
+ It is particularly useful in comination with <option>--debug</option>.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><option>-N</option></term>
+ <term><option>--no-sync</option></term>
+ <listitem>
+ <para>
+ By default, <command>pg_combinebackup</command> will wait for all files
+ to be written safely to disk. This option causes
+ <command>pg_combinebackup</command> to return without waiting, which is
+ faster, but means that a subsequent operating system crash can leave
+ the output backup corrupt. Generally, this option is useful for testing
+ but should not be used when creating a production installation.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><option>-o <replaceable class="parameter">outputdir</replaceable></option></term>
+ <term><option>--output=<replaceable class="parameter">outputdir</replaceable></option></term>
+ <listitem>
+ <para>
+ Specifies the output directory to which the synthetic full backup
+ should be written. Currently, this argument is required.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><option>-T <replaceable class="parameter">olddir</replaceable>=<replaceable class="parameter">newdir</replaceable></option></term>
+ <term><option>--tablespace-mapping=<replaceable class="parameter">olddir</replaceable>=<replaceable class="parameter">newdir</replaceable></option></term>
+ <listitem>
+ <para>
+ Relocates the tablespace in directory <replaceable>olddir</replaceable>
+ to <replaceable>newdir</replaceable> during the backup.
+ <replaceable>olddir</replaceable> is the absolute path of the tablespace
+ as it exists in the first backup specified on the command line,
+ and <replaceable>newdir</replaceable> is the absolute path to use for the
+ tablespace in the reconstructed backup. If either path needs to contain
+ an equal sign (<literal>=</literal>), precede that with a backslash.
+ This option can be specified multiple times for multiple tablespaces.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><option>--manifest-checksums=<replaceable class="parameter">algorithm</replaceable></option></term>
+ <listitem>
+ <para>
+ Like <xref linkend="app-pgbasebackup"/>,
+ <application>pg_combinebackup</application> writes a backup manifest
+ in the output directory. This option specifies the checksum algorithm
+ that should be applied to each file included in the backup manifest.
+ Currently, the available algorithms are <literal>NONE</literal>,
+ <literal>CRC32C</literal>, <literal>SHA224</literal>,
+ <literal>SHA256</literal>, <literal>SHA384</literal>,
+ and <literal>SHA512</literal>. The default is <literal>CRC32C</literal>.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><option>--no-manifest</option></term>
+ <listitem>
+ <para>
+ Disables generation of a backup manifest. If this option is not
+ specified, a backup manifest for the reconstructed backup will be
+ written to the output directory.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><option>--sync-method=<replaceable class="parameter">method</replaceable></option></term>
+ <listitem>
+ <para>
+ When set to <literal>fsync</literal>, which is the default,
+ <command>pg_combinebackup</command> will recursively open and synchronize
+ all files in the backup directory. When the plain format is used, the
+ search for files will follow symbolic links for the WAL directory and
+ each configured tablespace.
+ </para>
+ <para>
+ On Linux, <literal>syncfs</literal> may be used instead to ask the
+ operating system to synchronize the whole file system that contains the
+ backup directory. When the plain format is used,
+ <command>pg_combinebackup</command> will also synchronize the file systems
+ that contain the WAL files and each tablespace. See
+ <xref linkend="syncfs"/> for more information about using
+ <function>syncfs()</function>.
+ </para>
+ <para>
+ This option has no effect when <option>--no-sync</option> is used.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><option>-V</option></term>
+ <term><option>--version</option></term>
+ <listitem>
+ <para>
+ Prints the <application>pg_combinebackup</application> version and
+ exits.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><option>-?</option></term>
+ <term><option>--help</option></term>
+ <listitem>
+ <para>
+ Shows help about <application>pg_combinebackup</application> command
+ line arguments, and exits.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ </variablelist>
+ </para>
+
+ </refsect1>
+
+ <refsect1>
+ <title>Environment</title>
+
+ <para>
+ This utility, like most other <productname>PostgreSQL</productname> utilities,
+ uses the environment variables supported by <application>libpq</application>
+ (see <xref linkend="libpq-envars"/>).
+ </para>
+
+ <para>
+ The environment variable <envar>PG_COLOR</envar> specifies whether to use
+ color in diagnostic messages. Possible values are
+ <literal>always</literal>, <literal>auto</literal> and
+ <literal>never</literal>.
+ </para>
+ </refsect1>
+
+ <refsect1>
+ <title>See Also</title>
+
+ <simplelist type="inline">
+ <member><xref linkend="app-pgbasebackup"/></member>
+ </simplelist>
+ </refsect1>
+
+</refentry>
diff --git a/doc/src/sgml/reference.sgml b/doc/src/sgml/reference.sgml
index e11b4b6130..a07d2b5e01 100644
--- a/doc/src/sgml/reference.sgml
+++ b/doc/src/sgml/reference.sgml
@@ -250,6 +250,7 @@
&pgamcheck;
&pgBasebackup;
&pgbench;
+ &pgCombinebackup;
&pgConfig;
&pgDump;
&pgDumpall;
diff --git a/src/backend/access/transam/xlogbackup.c b/src/backend/access/transam/xlogbackup.c
index 21d68133ae..f51d4282bb 100644
--- a/src/backend/access/transam/xlogbackup.c
+++ b/src/backend/access/transam/xlogbackup.c
@@ -77,6 +77,16 @@ build_backup_content(BackupState *state, bool ishistoryfile)
appendStringInfo(result, "STOP TIMELINE: %u\n", state->stoptli);
}
+ /* either both istartpoint and istarttli should be set, or neither */
+ Assert(XLogRecPtrIsInvalid(state->istartpoint) == (state->istarttli == 0));
+ if (!XLogRecPtrIsInvalid(state->istartpoint))
+ {
+ appendStringInfo(result, "INCREMENTAL FROM LSN: %X/%X\n",
+ LSN_FORMAT_ARGS(state->istartpoint));
+ appendStringInfo(result, "INCREMENTAL FROM TLI: %u\n",
+ state->istarttli);
+ }
+
data = result->data;
pfree(result);
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index a2c8fa3981..6f4f81f992 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -1295,6 +1295,12 @@ read_backup_label(XLogRecPtr *checkPointLoc, TimeLineID *backupLabelTLI,
tli_from_file, BACKUP_LABEL_FILE)));
}
+ if (fscanf(lfp, "INCREMENTAL FROM LSN: %X/%X\n", &hi, &lo) > 0)
+ ereport(FATAL,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("this is an incremental backup, not a data directory"),
+ errhint("Use pg_combinebackup to reconstruct a valid data directory.")));
+
if (ferror(lfp) || FreeFile(lfp))
ereport(FATAL,
(errcode_for_file_access(),
diff --git a/src/backend/backup/Makefile b/src/backend/backup/Makefile
index a67b3c58d4..751e6d3d5e 100644
--- a/src/backend/backup/Makefile
+++ b/src/backend/backup/Makefile
@@ -19,6 +19,7 @@ OBJS = \
basebackup.o \
basebackup_copy.o \
basebackup_gzip.o \
+ basebackup_incremental.o \
basebackup_lz4.o \
basebackup_zstd.o \
basebackup_progress.o \
diff --git a/src/backend/backup/basebackup.c b/src/backend/backup/basebackup.c
index 35dd79babc..5ee9628422 100644
--- a/src/backend/backup/basebackup.c
+++ b/src/backend/backup/basebackup.c
@@ -20,8 +20,10 @@
#include "access/xlogbackup.h"
#include "backup/backup_manifest.h"
#include "backup/basebackup.h"
+#include "backup/basebackup_incremental.h"
#include "backup/basebackup_sink.h"
#include "backup/basebackup_target.h"
+#include "catalog/pg_tablespace_d.h"
#include "commands/defrem.h"
#include "common/compression.h"
#include "common/file_perm.h"
@@ -33,6 +35,7 @@
#include "pgtar.h"
#include "port.h"
#include "postmaster/syslogger.h"
+#include "postmaster/walsummarizer.h"
#include "replication/walsender.h"
#include "replication/walsender_private.h"
#include "storage/bufpage.h"
@@ -64,6 +67,7 @@ typedef struct
bool fastcheckpoint;
bool nowait;
bool includewal;
+ bool incremental;
uint32 maxrate;
bool sendtblspcmapfile;
bool send_to_client;
@@ -76,21 +80,28 @@ typedef struct
} basebackup_options;
static int64 sendTablespace(bbsink *sink, char *path, Oid spcoid, bool sizeonly,
- struct backup_manifest_info *manifest);
+ struct backup_manifest_info *manifest,
+ IncrementalBackupInfo *ib);
static int64 sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
List *tablespaces, bool sendtblspclinks,
- backup_manifest_info *manifest, Oid spcoid);
+ backup_manifest_info *manifest, Oid spcoid,
+ IncrementalBackupInfo *ib);
static bool sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
struct stat *statbuf, bool missing_ok,
Oid dboid, Oid spcoid, RelFileNumber relfilenumber,
unsigned segno,
- backup_manifest_info *manifest);
+ backup_manifest_info *manifest,
+ unsigned num_incremental_blocks,
+ BlockNumber *incremental_blocks,
+ unsigned truncation_block_length);
static off_t read_file_data_into_buffer(bbsink *sink,
const char *readfilename, int fd,
off_t offset, size_t length,
BlockNumber blkno,
bool verify_checksum,
int *checksum_failures);
+static void push_to_sink(bbsink *sink, pg_checksum_context *checksum_ctx,
+ size_t *bytes_done, void *data, size_t length);
static bool verify_page_checksum(Page page, XLogRecPtr start_lsn,
BlockNumber blkno,
uint16 *expected_checksum);
@@ -102,7 +113,8 @@ static int64 _tarWriteHeader(bbsink *sink, const char *filename,
bool sizeonly);
static void _tarWritePadding(bbsink *sink, int len);
static void convert_link_to_directory(const char *pathbuf, struct stat *statbuf);
-static void perform_base_backup(basebackup_options *opt, bbsink *sink);
+static void perform_base_backup(basebackup_options *opt, bbsink *sink,
+ IncrementalBackupInfo *ib);
static void parse_basebackup_options(List *options, basebackup_options *opt);
static int compareWalFileNames(const ListCell *a, const ListCell *b);
static int basebackup_read_file(int fd, char *buf, size_t nbytes, off_t offset,
@@ -220,7 +232,8 @@ static const struct exclude_list_item excludeFiles[] =
* clobbered by longjmp" from stupider versions of gcc.
*/
static void
-perform_base_backup(basebackup_options *opt, bbsink *sink)
+perform_base_backup(basebackup_options *opt, bbsink *sink,
+ IncrementalBackupInfo *ib)
{
bbsink_state state;
XLogRecPtr endptr;
@@ -270,6 +283,10 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
ListCell *lc;
tablespaceinfo *newti;
+ /* If this is an incremental backup, execute preparatory steps. */
+ if (ib != NULL)
+ PrepareForIncrementalBackup(ib, backup_state);
+
/* Add a node for the base directory at the end */
newti = palloc0(sizeof(tablespaceinfo));
newti->size = -1;
@@ -289,10 +306,10 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
if (tmp->path == NULL)
tmp->size = sendDir(sink, ".", 1, true, state.tablespaces,
- true, NULL, InvalidOid);
+ true, NULL, InvalidOid, NULL);
else
tmp->size = sendTablespace(sink, tmp->path, tmp->oid, true,
- NULL);
+ NULL, NULL);
state.bytes_total += tmp->size;
}
state.bytes_total_is_valid = true;
@@ -330,7 +347,7 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
/* Then the bulk of the files... */
sendDir(sink, ".", 1, false, state.tablespaces,
- sendtblspclinks, &manifest, InvalidOid);
+ sendtblspclinks, &manifest, InvalidOid, ib);
/* ... and pg_control after everything else. */
if (lstat(XLOG_CONTROL_FILE, &statbuf) != 0)
@@ -340,7 +357,7 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
XLOG_CONTROL_FILE)));
sendFile(sink, XLOG_CONTROL_FILE, XLOG_CONTROL_FILE, &statbuf,
false, InvalidOid, InvalidOid,
- InvalidRelFileNumber, 0, &manifest);
+ InvalidRelFileNumber, 0, &manifest, 0, NULL, 0);
}
else
{
@@ -348,7 +365,7 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
bbsink_begin_archive(sink, archive_name);
- sendTablespace(sink, ti->path, ti->oid, false, &manifest);
+ sendTablespace(sink, ti->path, ti->oid, false, &manifest, ib);
}
/*
@@ -610,7 +627,7 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
sendFile(sink, pathbuf, pathbuf, &statbuf, false,
InvalidOid, InvalidOid, InvalidRelFileNumber, 0,
- &manifest);
+ &manifest, 0, NULL, 0);
/* unconditionally mark file as archived */
StatusFilePath(pathbuf, fname, ".done");
@@ -686,6 +703,7 @@ parse_basebackup_options(List *options, basebackup_options *opt)
bool o_checkpoint = false;
bool o_nowait = false;
bool o_wal = false;
+ bool o_incremental = false;
bool o_maxrate = false;
bool o_tablespace_map = false;
bool o_noverify_checksums = false;
@@ -764,6 +782,20 @@ parse_basebackup_options(List *options, basebackup_options *opt)
opt->includewal = defGetBoolean(defel);
o_wal = true;
}
+ else if (strcmp(defel->defname, "incremental") == 0)
+ {
+ if (o_incremental)
+ ereport(ERROR,
+ (errcode(ERRCODE_SYNTAX_ERROR),
+ errmsg("duplicate option \"%s\"", defel->defname)));
+ opt->incremental = defGetBoolean(defel);
+ if (opt->incremental && !summarize_wal)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("incremental backups cannot be taken unless WAL summarization is enabled")));
+ opt->incremental = defGetBoolean(defel);
+ o_incremental = true;
+ }
else if (strcmp(defel->defname, "max_rate") == 0)
{
int64 maxrate;
@@ -956,7 +988,7 @@ parse_basebackup_options(List *options, basebackup_options *opt)
* the filesystem, bypassing the buffer cache.
*/
void
-SendBaseBackup(BaseBackupCmd *cmd)
+SendBaseBackup(BaseBackupCmd *cmd, IncrementalBackupInfo *ib)
{
basebackup_options opt;
bbsink *sink;
@@ -980,6 +1012,20 @@ SendBaseBackup(BaseBackupCmd *cmd)
set_ps_display(activitymsg);
}
+ /*
+ * If we're asked to perform an incremental backup and the user has not
+ * supplied a manifest, that's an ERROR.
+ *
+ * If we're asked to perform a full backup and the user did supply a
+ * manifest, just ignore it.
+ */
+ if (!opt.incremental)
+ ib = NULL;
+ else if (ib == NULL)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("must UPLOAD_MANIFEST before performing an incremental BASE_BACKUP")));
+
/*
* If the target is specifically 'client' then set up to stream the backup
* to the client; otherwise, it's being sent someplace else and should not
@@ -1011,7 +1057,7 @@ SendBaseBackup(BaseBackupCmd *cmd)
*/
PG_TRY();
{
- perform_base_backup(&opt, sink);
+ perform_base_backup(&opt, sink, ib);
}
PG_FINALLY();
{
@@ -1089,7 +1135,7 @@ sendFileWithContent(bbsink *sink, const char *filename, const char *content,
*/
static int64
sendTablespace(bbsink *sink, char *path, Oid spcoid, bool sizeonly,
- backup_manifest_info *manifest)
+ backup_manifest_info *manifest, IncrementalBackupInfo *ib)
{
int64 size;
char pathbuf[MAXPGPATH];
@@ -1123,7 +1169,7 @@ sendTablespace(bbsink *sink, char *path, Oid spcoid, bool sizeonly,
/* Send all the files in the tablespace version directory */
size += sendDir(sink, pathbuf, strlen(path), sizeonly, NIL, true, manifest,
- spcoid);
+ spcoid, ib);
return size;
}
@@ -1143,7 +1189,7 @@ sendTablespace(bbsink *sink, char *path, Oid spcoid, bool sizeonly,
static int64
sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
List *tablespaces, bool sendtblspclinks, backup_manifest_info *manifest,
- Oid spcoid)
+ Oid spcoid, IncrementalBackupInfo *ib)
{
DIR *dir;
struct dirent *de;
@@ -1152,7 +1198,16 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
int64 size = 0;
const char *lastDir; /* Split last dir from parent path. */
bool isRelationDir = false; /* Does directory contain relations? */
+ bool isGlobalDir = false;
Oid dboid = InvalidOid;
+ BlockNumber *relative_block_numbers = NULL;
+
+ /*
+ * Since this array is relatively large, avoid putting it on the stack.
+ * But we don't need it at all if this is not an incremental backup.
+ */
+ if (ib != NULL)
+ relative_block_numbers = palloc(sizeof(BlockNumber) * RELSEG_SIZE);
/*
* Determine if the current path is a database directory that can contain
@@ -1185,7 +1240,10 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
}
}
else if (strcmp(path, "./global") == 0)
+ {
isRelationDir = true;
+ isGlobalDir = true;
+ }
dir = AllocateDir(path);
while ((de = ReadDir(dir, path)) != NULL)
@@ -1334,11 +1392,13 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
&statbuf, sizeonly);
/*
- * Also send archive_status directory (by hackishly reusing
- * statbuf from above ...).
+ * Also send archive_status and summaries directories (by
+ * hackishly reusing statbuf from above ...).
*/
size += _tarWriteHeader(sink, "./pg_wal/archive_status", NULL,
&statbuf, sizeonly);
+ size += _tarWriteHeader(sink, "./pg_wal/summaries", NULL,
+ &statbuf, sizeonly);
continue; /* don't recurse into pg_wal */
}
@@ -1407,16 +1467,64 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
if (!skip_this_dir)
size += sendDir(sink, pathbuf, basepathlen, sizeonly, tablespaces,
- sendtblspclinks, manifest, spcoid);
+ sendtblspclinks, manifest, spcoid, ib);
}
else if (S_ISREG(statbuf.st_mode))
{
bool sent = false;
+ unsigned num_blocks_required = 0;
+ unsigned truncation_block_length = 0;
+ char tarfilenamebuf[MAXPGPATH * 2];
+ char *tarfilename = pathbuf + basepathlen + 1;
+ FileBackupMethod method = BACK_UP_FILE_FULLY;
+
+ if (ib != NULL && isRelationFile)
+ {
+ Oid relspcoid;
+ char *lookup_path;
+
+ if (OidIsValid(spcoid))
+ {
+ relspcoid = spcoid;
+ lookup_path = psprintf("pg_tblspc/%u/%s", spcoid,
+ tarfilename);
+ }
+ else
+ {
+ if (isGlobalDir)
+ relspcoid = GLOBALTABLESPACE_OID;
+ else
+ relspcoid = DEFAULTTABLESPACE_OID;
+ lookup_path = pstrdup(tarfilename);
+ }
+
+ method = GetFileBackupMethod(ib, lookup_path, dboid, relspcoid,
+ relfilenumber, relForkNum,
+ segno, statbuf.st_size,
+ &num_blocks_required,
+ relative_block_numbers,
+ &truncation_block_length);
+ if (method == BACK_UP_FILE_INCREMENTALLY)
+ {
+ statbuf.st_size =
+ GetIncrementalFileSize(num_blocks_required);
+ snprintf(tarfilenamebuf, sizeof(tarfilenamebuf),
+ "%s/INCREMENTAL.%s",
+ path + basepathlen + 1,
+ de->d_name);
+ tarfilename = tarfilenamebuf;
+ }
+
+ pfree(lookup_path);
+ }
if (!sizeonly)
- sent = sendFile(sink, pathbuf, pathbuf + basepathlen + 1, &statbuf,
+ sent = sendFile(sink, pathbuf, tarfilename, &statbuf,
true, dboid, spcoid,
- relfilenumber, segno, manifest);
+ relfilenumber, segno, manifest,
+ num_blocks_required,
+ method == BACK_UP_FILE_INCREMENTALLY ? relative_block_numbers : NULL,
+ truncation_block_length);
if (sent || sizeonly)
{
@@ -1434,6 +1542,10 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
ereport(WARNING,
(errmsg("skipping special file \"%s\"", pathbuf)));
}
+
+ if (relative_block_numbers != NULL)
+ pfree(relative_block_numbers);
+
FreeDir(dir);
return size;
}
@@ -1446,6 +1558,12 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
* If dboid is anything other than InvalidOid then any checksum failures
* detected will get reported to the cumulative stats system.
*
+ * If the file is to be sent incrementally, then num_incremental_blocks
+ * should be the number of blocks to be sent, and incremental_blocks
+ * an array of block numbers relative to the start of the current segment.
+ * If the whole file is to be sent, then incremental_blocks should be NULL,
+ * and num_incremental_blocks can have any value, as it will be ignored.
+ *
* Returns true if the file was successfully sent, false if 'missing_ok',
* and the file did not exist.
*/
@@ -1453,7 +1571,8 @@ static bool
sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
struct stat *statbuf, bool missing_ok, Oid dboid, Oid spcoid,
RelFileNumber relfilenumber, unsigned segno,
- backup_manifest_info *manifest)
+ backup_manifest_info *manifest, unsigned num_incremental_blocks,
+ BlockNumber *incremental_blocks, unsigned truncation_block_length)
{
int fd;
BlockNumber blkno = 0;
@@ -1462,6 +1581,7 @@ sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
pgoff_t bytes_done = 0;
bool verify_checksum = false;
pg_checksum_context checksum_ctx;
+ int ibindex = 0;
if (pg_checksum_init(&checksum_ctx, manifest->checksum_type) < 0)
elog(ERROR, "could not initialize checksum of file \"%s\"",
@@ -1494,22 +1614,111 @@ sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
RelFileNumberIsValid(relfilenumber))
verify_checksum = true;
+ /*
+ * If we're sending an incremental file, write the file header.
+ */
+ if (incremental_blocks != NULL)
+ {
+ unsigned magic = INCREMENTAL_MAGIC;
+ size_t header_bytes_done = 0;
+
+ /* Emit header data. */
+ push_to_sink(sink, &checksum_ctx, &header_bytes_done,
+ &magic, sizeof(magic));
+ push_to_sink(sink, &checksum_ctx, &header_bytes_done,
+ &num_incremental_blocks, sizeof(num_incremental_blocks));
+ push_to_sink(sink, &checksum_ctx, &header_bytes_done,
+ &truncation_block_length, sizeof(truncation_block_length));
+ push_to_sink(sink, &checksum_ctx, &header_bytes_done,
+ incremental_blocks,
+ sizeof(BlockNumber) * num_incremental_blocks);
+
+ /* Flush out any data still in the buffer so it's again empty. */
+ if (header_bytes_done > 0)
+ {
+ bbsink_archive_contents(sink, header_bytes_done);
+ if (pg_checksum_update(&checksum_ctx,
+ (uint8 *) sink->bbs_buffer,
+ header_bytes_done) < 0)
+ elog(ERROR, "could not update checksum of base backup");
+ }
+
+ /* Update our notion of file position. */
+ bytes_done += sizeof(magic);
+ bytes_done += sizeof(num_incremental_blocks);
+ bytes_done += sizeof(truncation_block_length);
+ bytes_done += sizeof(BlockNumber) * num_incremental_blocks;
+ }
+
/*
* Loop until we read the amount of data the caller told us to expect. The
* file could be longer, if it was extended while we were sending it, but
* for a base backup we can ignore such extended data. It will be restored
* from WAL.
*/
- while (bytes_done < statbuf->st_size)
+ while (1)
{
- size_t remaining = statbuf->st_size - bytes_done;
+ /*
+ * Determine whether we've read all the data that we need, and if not,
+ * read some more.
+ */
+ if (incremental_blocks == NULL)
+ {
+ size_t remaining = statbuf->st_size - bytes_done;
+
+ /*
+ * If we've read the required number of bytes, then it's time to
+ * stop.
+ */
+ if (bytes_done >= statbuf->st_size)
+ break;
+
+ /*
+ * Read as many bytes as will fit in the buffer, or however many
+ * are left to read, whichever is less.
+ */
+ cnt = read_file_data_into_buffer(sink, readfilename, fd,
+ bytes_done, remaining,
+ blkno + segno * RELSEG_SIZE,
+ verify_checksum,
+ &checksum_failures);
+ }
+ else
+ {
+ BlockNumber relative_blkno;
- /* Try to read some more data. */
- cnt = read_file_data_into_buffer(sink, readfilename, fd, bytes_done,
- remaining,
- blkno + segno * RELSEG_SIZE,
- verify_checksum,
- &checksum_failures);
+ /*
+ * If we've read all the blocks, then it's time to stop.
+ */
+ if (ibindex >= num_incremental_blocks)
+ break;
+
+ /*
+ * Read just one block, whichever one is the next that we're
+ * supposed to include.
+ */
+ relative_blkno = incremental_blocks[ibindex++];
+ cnt = read_file_data_into_buffer(sink, readfilename, fd,
+ relative_blkno * BLCKSZ,
+ BLCKSZ,
+ relative_blkno + segno * RELSEG_SIZE,
+ verify_checksum,
+ &checksum_failures);
+
+ /*
+ * If we get a partial read, that must mean that the relation is
+ * being truncated. Ultimately, it should be truncated to a
+ * multiple of BLCKSZ, since this path should only be reached for
+ * relation files, but we might transiently observe an
+ * intermediate value.
+ *
+ * It should be fine to treat this just as if the entire block had
+ * been truncated away - i.e. fill this and all later blocks with
+ * zeroes. WAL replay will fix things up.
+ */
+ if (cnt < BLCKSZ)
+ break;
+ }
/*
* If the amount of data we were able to read was not a multiple of
@@ -1692,6 +1901,56 @@ read_file_data_into_buffer(bbsink *sink, const char *readfilename, int fd,
return cnt;
}
+/*
+ * Push data into a bbsink.
+ *
+ * It's better, when possible, to read data directly into the bbsink's buffer,
+ * rather than using this function to copy it into the buffer; this function is
+ * for cases where that approach is not practical.
+ *
+ * bytes_done should point to a count of the number of bytes that are
+ * currently used in the bbsink's buffer. Upon return, the bytes identified by
+ * data and length will have been copied into the bbsink's buffer, flushing
+ * as required, and *bytes_done will have been updated accordingly. If the
+ * buffer was flushed, the previous contents will also have been fed to
+ * checksum_ctx.
+ *
+ * Note that after one or more calls to this function it is the caller's
+ * responsibility to perform any required final flush.
+ */
+static void
+push_to_sink(bbsink *sink, pg_checksum_context *checksum_ctx,
+ size_t *bytes_done, void *data, size_t length)
+{
+ while (length > 0)
+ {
+ size_t bytes_to_copy;
+
+ /*
+ * We use < here rather than <= so that if the data exactly fills the
+ * remaining buffer space, we trigger a flush now.
+ */
+ if (length < sink->bbs_buffer_length - *bytes_done)
+ {
+ /* Append remaining data to buffer. */
+ memcpy(sink->bbs_buffer + *bytes_done, data, length);
+ *bytes_done += length;
+ return;
+ }
+
+ /* Copy until buffer is full and flush it. */
+ bytes_to_copy = sink->bbs_buffer_length - *bytes_done;
+ memcpy(sink->bbs_buffer + *bytes_done, data, bytes_to_copy);
+ data = ((char *) data) + bytes_to_copy;
+ length -= bytes_to_copy;
+ bbsink_archive_contents(sink, sink->bbs_buffer_length);
+ if (pg_checksum_update(checksum_ctx, (uint8 *) sink->bbs_buffer,
+ sink->bbs_buffer_length) < 0)
+ elog(ERROR, "could not update checksum");
+ *bytes_done = 0;
+ }
+}
+
/*
* Try to verify the checksum for the provided page, if it seems appropriate
* to do so.
diff --git a/src/backend/backup/basebackup_incremental.c b/src/backend/backup/basebackup_incremental.c
new file mode 100644
index 0000000000..1e5a5ac33a
--- /dev/null
+++ b/src/backend/backup/basebackup_incremental.c
@@ -0,0 +1,1003 @@
+/*-------------------------------------------------------------------------
+ *
+ * basebackup_incremental.c
+ * code for incremental backup support
+ *
+ * This code isn't actually in charge of taking an incremental backup;
+ * the actual construction of the incremental backup happens in
+ * basebackup.c. Here, we're concerned with providing the necessary
+ * supports for that operation. In particular, we need to parse the
+ * backup manifest supplied by the user taking the incremental backup
+ * and extract the required information from it.
+ *
+ * Portions Copyright (c) 2010-2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/backend/backup/basebackup_incremental.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/timeline.h"
+#include "access/xlog.h"
+#include "access/xlogrecovery.h"
+#include "backup/basebackup_incremental.h"
+#include "backup/walsummary.h"
+#include "common/blkreftable.h"
+#include "common/parse_manifest.h"
+#include "common/hashfn.h"
+#include "postmaster/walsummarizer.h"
+
+#define BLOCKS_PER_READ 512
+
+/*
+ * Details extracted from the WAL ranges present in the supplied backup manifest.
+ */
+typedef struct
+{
+ TimeLineID tli;
+ XLogRecPtr start_lsn;
+ XLogRecPtr end_lsn;
+} backup_wal_range;
+
+/*
+ * Details extracted from the file list present in the supplied backup manifest.
+ */
+typedef struct
+{
+ uint32 status;
+ const char *path;
+ size_t size;
+} backup_file_entry;
+
+static uint32 hash_string_pointer(const char *s);
+#define SH_PREFIX backup_file
+#define SH_ELEMENT_TYPE backup_file_entry
+#define SH_KEY_TYPE const char *
+#define SH_KEY path
+#define SH_HASH_KEY(tb, key) hash_string_pointer(key)
+#define SH_EQUAL(tb, a, b) (strcmp(a, b) == 0)
+#define SH_SCOPE static inline
+#define SH_DECLARE
+#define SH_DEFINE
+#include "lib/simplehash.h"
+
+struct IncrementalBackupInfo
+{
+ /* Memory context for this object and its subsidiary objects. */
+ MemoryContext mcxt;
+
+ /* Temporary buffer for storing the manifest while parsing it. */
+ StringInfoData buf;
+
+ /* WAL ranges extracted from the backup manifest. */
+ List *manifest_wal_ranges;
+
+ /*
+ * Files extracted from the backup manifest.
+ *
+ * We don't really need this information, because we use WAL summaries to
+ * figure what's changed. It would be unsafe to just rely on the list of
+ * files that existed before, because it's possible for a file to be
+ * removed and a new one created with the same name and different
+ * contents. In such cases, the whole file must still be sent. We can tell
+ * from the WAL summaries whether that happened, but not from the file
+ * list.
+ *
+ * Nonetheless, this data is useful for sanity checking. If a file that we
+ * think we shouldn't need to send is not present in the manifest for the
+ * prior backup, something has gone terribly wrong. We retain the file
+ * names and sizes, but not the checksums or last modified times, for
+ * which we have no use.
+ *
+ * One significant downside of storing this data is that it consumes
+ * memory. If that turns out to be a problem, we might have to decide not
+ * to retain this information, or to make it optional.
+ */
+ backup_file_hash *manifest_files;
+
+ /*
+ * Block-reference table for the incremental backup.
+ *
+ * It's possible that storing the entire block-reference table in memory
+ * will be a problem for some users. The in-memory format that we're using
+ * here is pretty efficient, converging to little more than 1 bit per
+ * block for relation forks with large numbers of modified blocks. It's
+ * possible, however, that if you try to perform an incremental backup of
+ * a database with a sufficiently large number of relations on a
+ * sufficiently small machine, you could run out of memory here. If that
+ * turns out to be a problem in practice, we'll need to be more clever.
+ */
+ BlockRefTable *brtab;
+};
+
+static void manifest_process_file(JsonManifestParseContext *context,
+ char *pathname,
+ size_t size,
+ pg_checksum_type checksum_type,
+ int checksum_length,
+ uint8 *checksum_payload);
+static void manifest_process_wal_range(JsonManifestParseContext *context,
+ TimeLineID tli,
+ XLogRecPtr start_lsn,
+ XLogRecPtr end_lsn);
+static void manifest_report_error(JsonManifestParseContext *ib,
+ const char *fmt,...)
+ pg_attribute_printf(2, 3) pg_attribute_noreturn();
+static int compare_block_numbers(const void *a, const void *b);
+
+/*
+ * Create a new object for storing information extracted from the manifest
+ * supplied when creating an incremental backup.
+ */
+IncrementalBackupInfo *
+CreateIncrementalBackupInfo(MemoryContext mcxt)
+{
+ IncrementalBackupInfo *ib;
+ MemoryContext oldcontext;
+
+ oldcontext = MemoryContextSwitchTo(mcxt);
+
+ ib = palloc0(sizeof(IncrementalBackupInfo));
+ ib->mcxt = mcxt;
+ initStringInfo(&ib->buf);
+
+ /*
+ * It's hard to guess how many files a "typical" installation will have in
+ * the data directory, but a fresh initdb creates almost 1000 files as of
+ * this writing, so it seems to make sense for our estimate to
+ * substantially higher.
+ */
+ ib->manifest_files = backup_file_create(mcxt, 10000, NULL);
+
+ MemoryContextSwitchTo(oldcontext);
+
+ return ib;
+}
+
+/*
+ * Before taking an incremental backup, the caller must supply the backup
+ * manifest from a prior backup. Each chunk of manifest data recieved
+ * from the client should be passed to this function.
+ */
+void
+AppendIncrementalManifestData(IncrementalBackupInfo *ib, const char *data,
+ int len)
+{
+ MemoryContext oldcontext;
+
+ /* Switch to our memory context. */
+ oldcontext = MemoryContextSwitchTo(ib->mcxt);
+
+ /*
+ * XXX. Our json parser is at present incapable of parsing json blobs
+ * incrementally, so we have to accumulate the entire backup manifest
+ * before we can do anything with it. This should really be fixed, since
+ * some users might have very large numbers of files in the data
+ * directory.
+ */
+ appendBinaryStringInfo(&ib->buf, data, len);
+
+ /* Switch back to previous memory context. */
+ MemoryContextSwitchTo(oldcontext);
+}
+
+/*
+ * Finalize an IncrementalBackupInfo object after all manifest data has
+ * been supplied via calls to AppendIncrementalManifestData.
+ */
+void
+FinalizeIncrementalManifest(IncrementalBackupInfo *ib)
+{
+ JsonManifestParseContext context;
+ MemoryContext oldcontext;
+
+ /* Switch to our memory context. */
+ oldcontext = MemoryContextSwitchTo(ib->mcxt);
+
+ /* Parse the manifest. */
+ context.private_data = ib;
+ context.per_file_cb = manifest_process_file;
+ context.per_wal_range_cb = manifest_process_wal_range;
+ context.error_cb = manifest_report_error;
+ json_parse_manifest(&context, ib->buf.data, ib->buf.len);
+
+ /* Done with the buffer, so release memory. */
+ pfree(ib->buf.data);
+ ib->buf.data = NULL;
+
+ /* Switch back to previous memory context. */
+ MemoryContextSwitchTo(oldcontext);
+}
+
+/*
+ * Prepare to take an incremental backup.
+ *
+ * Before this function is called, AppendIncrementalManifestData and
+ * FinalizeIncrementalManifest should have already been called to pass all
+ * the manifest data to this object.
+ *
+ * This function performs sanity checks on the data extracted from the
+ * manifest and figures out for which WAL ranges we need summaries, and
+ * whether those summaries are available. Then, it reads and combines the
+ * data from those summary files. It also updates the backup_state with the
+ * reference TLI and LSN for the prior backup.
+ */
+void
+PrepareForIncrementalBackup(IncrementalBackupInfo *ib,
+ BackupState *backup_state)
+{
+ MemoryContext oldcontext;
+ List *expectedTLEs;
+ List *all_wslist,
+ *required_wslist = NIL;
+ ListCell *lc;
+ TimeLineHistoryEntry **tlep;
+ int num_wal_ranges;
+ int i;
+ bool found_backup_start_tli = false;
+ TimeLineID earliest_wal_range_tli = 0;
+ XLogRecPtr earliest_wal_range_start_lsn = InvalidXLogRecPtr;
+ TimeLineID latest_wal_range_tli = 0;
+ XLogRecPtr summarized_lsn;
+ XLogRecPtr pending_lsn;
+ XLogRecPtr prior_pending_lsn = InvalidXLogRecPtr;
+ int deadcycles = 0;
+ TimestampTz initial_time,
+ current_time;
+
+ Assert(ib->buf.data == NULL);
+
+ /* Switch to our memory context. */
+ oldcontext = MemoryContextSwitchTo(ib->mcxt);
+
+ /*
+ * A valid backup manifest must always contain at least one WAL range
+ * (usually exactly one, unless the backup spanned a timeline switch).
+ */
+ num_wal_ranges = list_length(ib->manifest_wal_ranges);
+ if (num_wal_ranges == 0)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("manifest contains no required WAL ranges")));
+
+ /*
+ * Match up the TLIs that appear in the WAL ranges of the backup manifest
+ * with those that appear in this server's timeline history. We expect
+ * every backup_wal_range to match to a TimeLineHistoryEntry; if it does
+ * not, that's an error.
+ *
+ * This loop also decides which of the WAL ranges is the manifest is most
+ * ancient and which one is the newest, according to the timeline history
+ * of this server, and stores TLIs of those WAL ranges into
+ * earliest_wal_range_tli and latest_wal_range_tli. It also updates
+ * earliest_wal_range_start_lsn to the start LSN of the WAL range for
+ * earliest_wal_range_tli.
+ *
+ * Note that the return value of readTimeLineHistory puts the latest
+ * timeline at the beginning of the list, not the end. Hence, the earliest
+ * TLI is the one that occurs nearest the end of the list returned by
+ * readTimeLineHistory, and the latest TLI is the one that occurs closest
+ * to the beginning.
+ */
+ expectedTLEs = readTimeLineHistory(backup_state->starttli);
+ tlep = palloc0(num_wal_ranges * sizeof(TimeLineHistoryEntry *));
+ for (i = 0; i < num_wal_ranges; ++i)
+ {
+ backup_wal_range *range = list_nth(ib->manifest_wal_ranges, i);
+ bool saw_earliest_wal_range_tli = false;
+ bool saw_latest_wal_range_tli = false;
+
+ /* Search this server's history for this WAL range's TLI. */
+ foreach(lc, expectedTLEs)
+ {
+ TimeLineHistoryEntry *tle = lfirst(lc);
+
+ if (tle->tli == range->tli)
+ {
+ tlep[i] = tle;
+ break;
+ }
+
+ if (tle->tli == earliest_wal_range_tli)
+ saw_earliest_wal_range_tli = true;
+ if (tle->tli == latest_wal_range_tli)
+ saw_latest_wal_range_tli = true;
+ }
+
+ /*
+ * An incremental backup can only be taken relative to a backup that
+ * represents a previous state of this server. If the backup requires
+ * WAL from a timeline that's not in our history, that definitely
+ * isn't the case.
+ */
+ if (tlep[i] == NULL)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("timeline %u found in manifest, but not in this server's history",
+ range->tli)));
+
+ /*
+ * If we found this TLI in the server's history before encountering
+ * the latest TLI seen so far in the server's history, then this TLI
+ * is the latest one seen so far.
+ *
+ * If on the other hand we saw the earliest TLI seen so far before
+ * finding this TLI, this TLI is earlier than the earliest one seen so
+ * far. And if this is the first TLI for which we've searched, it's
+ * also the earliest one seen so far.
+ *
+ * On the first loop iteration, both things should necessarily be
+ * true.
+ */
+ if (!saw_latest_wal_range_tli)
+ latest_wal_range_tli = range->tli;
+ if (earliest_wal_range_tli == 0 || saw_earliest_wal_range_tli)
+ {
+ earliest_wal_range_tli = range->tli;
+ earliest_wal_range_start_lsn = range->start_lsn;
+ }
+ }
+
+ /*
+ * Propagate information about the prior backup into the backup_label that
+ * will be generated for this backup.
+ */
+ backup_state->istartpoint = earliest_wal_range_start_lsn;
+ backup_state->istarttli = earliest_wal_range_tli;
+
+ /*
+ * Sanity check start and end LSNs for the WAL ranges in the manifest.
+ *
+ * Commonly, there won't be any timeline switches during the prior backup
+ * at all, but if there are, they should happen at the same LSNs that this
+ * server switched timelines.
+ *
+ * Whether there are any timeline switches during the prior backup or not,
+ * the prior backup shouldn't require any WAL from a timeline prior to the
+ * start of that timeline. It also shouldn't require any WAL from later
+ * than the start of this backup.
+ *
+ * If any of these sanity checks fail, one possible explanation is that
+ * the user has generated WAL on the same timeline with the same LSNs more
+ * than once. For instance, if two standbys running on timeline 1 were
+ * both promoted and (due to a broken archiving setup) both selected new
+ * timeline ID 2, then it's possible that one of these checks might trip.
+ *
+ * Note that there are lots of ways for the user to do something very bad
+ * without tripping any of these checks, and they are not intended to be
+ * comprehensive. It's pretty hard to see how we could be certain of
+ * anything here. However, if there's a problem staring us right in the
+ * face, it's best to report it, so we do.
+ */
+ for (i = 0; i < num_wal_ranges; ++i)
+ {
+ backup_wal_range *range = list_nth(ib->manifest_wal_ranges, i);
+
+ if (range->tli == earliest_wal_range_tli)
+ {
+ if (range->start_lsn < tlep[i]->begin)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("manifest requires WAL from initial timeline %u starting at %X/%X, but that timeline begins at %X/%X",
+ range->tli,
+ LSN_FORMAT_ARGS(range->start_lsn),
+ LSN_FORMAT_ARGS(tlep[i]->begin))));
+ }
+ else
+ {
+ if (range->start_lsn != tlep[i]->begin)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("manifest requires WAL from continuation timeline %u starting at %X/%X, but that timeline begins at %X/%X",
+ range->tli,
+ LSN_FORMAT_ARGS(range->start_lsn),
+ LSN_FORMAT_ARGS(tlep[i]->begin))));
+ }
+
+ if (range->tli == latest_wal_range_tli)
+ {
+ if (range->end_lsn > backup_state->startpoint)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("manifest requires WAL from final timeline %u ending at %X/%X, but this backup starts at %X/%X",
+ range->tli,
+ LSN_FORMAT_ARGS(range->end_lsn),
+ LSN_FORMAT_ARGS(backup_state->startpoint))));
+ }
+ else
+ {
+ if (range->end_lsn != tlep[i]->end)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("manifest requires WAL from non-final timeline %u ending at %X/%X, but this server switched timelines at %X/%X",
+ range->tli,
+ LSN_FORMAT_ARGS(range->end_lsn),
+ LSN_FORMAT_ARGS(tlep[i]->end))));
+ }
+
+ }
+
+ /*
+ * Wait for WAL summarization to catch up to the backup start LSN (but
+ * time out if it doesn't do so quickly enough).
+ */
+ initial_time = current_time = GetCurrentTimestamp();
+ while (1)
+ {
+ long timeout_in_ms = 10000;
+ unsigned elapsed_seconds;
+
+ /*
+ * Align the wait time to prevent drift. This doesn't really matter,
+ * but we'd like the warnings about how long we've been waiting to say
+ * 10 seconds, 20 seconds, 30 seconds, 40 seconds ... without ever
+ * drifting to something that is not a multiple of ten.
+ */
+ timeout_in_ms -=
+ TimestampDifferenceMilliseconds(current_time, initial_time) %
+ timeout_in_ms;
+
+ /* Wait for up to 10 seconds. */
+ summarized_lsn = WaitForWalSummarization(backup_state->startpoint,
+ 10000, &pending_lsn);
+
+ /* If WAL summarization has progressed sufficiently, stop waiting. */
+ if (summarized_lsn >= backup_state->startpoint)
+ break;
+
+ /*
+ * Keep track of the number of cycles during which there has been no
+ * progression of pending_lsn. If pending_lsn is not advancing, that
+ * means that not only are no new files appearing on disk, but we're
+ * not even incorporating new records into the in-memory state.
+ */
+ if (pending_lsn > prior_pending_lsn)
+ {
+ prior_pending_lsn = pending_lsn;
+ deadcycles = 0;
+ }
+ else
+ ++deadcycles;
+
+ /*
+ * If we've managed to wait for an entire minute withot the WAL
+ * summarizer absorbing a single WAL record, error out; probably
+ * something is wrong.
+ *
+ * We could consider also erroring out if the summarizer is taking too
+ * long to catch up, but it's not clear what rate of progress would be
+ * acceptable and what would be too slow. So instead, we just try to
+ * error out in the case where there's no progress at all. That seems
+ * likely to catch a reasonable number of the things that can go wrong
+ * in practice (e.g. the summarizer process is completely hung, say
+ * because somebody hooked up a debugger to it or something) without
+ * giving up too quickly when the sytem is just slow.
+ */
+ if (deadcycles >= 6)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("WAL summarization is not progressing"),
+ errdetail("Summarization is needed through %X/%X, but is stuck at %X/%X on disk and %X/%X in memory.",
+ LSN_FORMAT_ARGS(backup_state->startpoint),
+ LSN_FORMAT_ARGS(summarized_lsn),
+ LSN_FORMAT_ARGS(pending_lsn))));
+
+ /*
+ * Otherwise, just let the user know what's happening.
+ */
+ current_time = GetCurrentTimestamp();
+ elapsed_seconds =
+ TimestampDifferenceMilliseconds(initial_time, current_time) / 1000;
+ ereport(WARNING,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("still waiting for WAL summarization through %X/%X after %d seconds",
+ LSN_FORMAT_ARGS(backup_state->startpoint),
+ elapsed_seconds),
+ errdetail("Summarization has reached %X/%X on disk and %X/%X in memory.",
+ LSN_FORMAT_ARGS(summarized_lsn),
+ LSN_FORMAT_ARGS(pending_lsn))));
+ }
+
+ /*
+ * Retrieve a list of all WAL summaries on any timeline that overlap with
+ * the LSN range of interest. We could instead call GetWalSummaries() once
+ * per timeline in the loop that follows, but that would involve reading
+ * the directory multiple times. It should be mildly faster - and perhaps
+ * a bit safer - to do it just once.
+ */
+ all_wslist = GetWalSummaries(0, earliest_wal_range_start_lsn,
+ backup_state->startpoint);
+
+ /*
+ * We need WAL summaries for everything that happened during the prior
+ * backup and everything that happened afterward up until the point where
+ * the current backup started.
+ */
+ foreach(lc, expectedTLEs)
+ {
+ TimeLineHistoryEntry *tle = lfirst(lc);
+ XLogRecPtr tli_start_lsn = tle->begin;
+ XLogRecPtr tli_end_lsn = tle->end;
+ XLogRecPtr tli_missing_lsn = InvalidXLogRecPtr;
+ List *tli_wslist;
+
+ /*
+ * Working through the history of this server from the current
+ * timeline backwards, we skip everything until we find the timeline
+ * where this backup started. Most of the time, this means we won't
+ * skip anything at all, as it's unlikely that the timeline has
+ * changed since the beginning of the backup moments ago.
+ */
+ if (tle->tli == backup_state->starttli)
+ {
+ found_backup_start_tli = true;
+ tli_end_lsn = backup_state->startpoint;
+ }
+ else if (!found_backup_start_tli)
+ continue;
+
+ /*
+ * Find the summaries that overlap the LSN range of interest for this
+ * timeline. If this is the earliest timeline involved, the range of
+ * interest begins with the start LSN of the prior backup; otherwise,
+ * it begins at the LSN at which this timeline came into existence. If
+ * this is the latest TLI involved, the range of interest ends at the
+ * start LSN of the current backup; otherwise, it ends at the point
+ * where we switched from this timeline to the next one.
+ */
+ if (tle->tli == earliest_wal_range_tli)
+ tli_start_lsn = earliest_wal_range_start_lsn;
+ tli_wslist = FilterWalSummaries(all_wslist, tle->tli,
+ tli_start_lsn, tli_end_lsn);
+
+ /*
+ * There is no guarantee that the WAL summaries we found cover the
+ * entire range of LSNs for which summaries are required, or indeed
+ * that we found any WAL summaries at all. Check whether we have a
+ * problem of that sort.
+ */
+ if (!WalSummariesAreComplete(tli_wslist, tli_start_lsn, tli_end_lsn,
+ &tli_missing_lsn))
+ {
+ if (XLogRecPtrIsInvalid(tli_missing_lsn))
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("WAL summaries are required on timeline %u from %X/%X to %X/%X, but no summaries for that timeline and LSN range exist",
+ tle->tli,
+ LSN_FORMAT_ARGS(tli_start_lsn),
+ LSN_FORMAT_ARGS(tli_end_lsn))));
+ else
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("WAL summaries are required on timeline %u from %X/%X to %X/%X, but the summaries for that timeline and LSN range are incomplete",
+ tle->tli,
+ LSN_FORMAT_ARGS(tli_start_lsn),
+ LSN_FORMAT_ARGS(tli_end_lsn)),
+ errdetail("The first unsummarized LSN is this range is %X/%X.",
+ LSN_FORMAT_ARGS(tli_missing_lsn))));
+ }
+
+ /*
+ * Remember that we need to read these summaries.
+ *
+ * Technically, it's possible that this could read more files than
+ * required, since tli_wslist in theory could contain redundant
+ * summaries. For instance, if we have a summary from 0/10000000 to
+ * 0/20000000 and also one from 0/00000000 to 0/30000000, then the
+ * latter subsumes the former and the former could be ignored.
+ *
+ * We ignore this possibility because the WAL summarizer only tries to
+ * generate summaries that do not overlap. If somehow they exist,
+ * we'll do a bit of extra work but the results should still be
+ * correct.
+ */
+ required_wslist = list_concat(required_wslist, tli_wslist);
+
+ /*
+ * Timelines earlier than the one in which the prior backup began are
+ * not relevant.
+ */
+ if (tle->tli == earliest_wal_range_tli)
+ break;
+ }
+
+ /*
+ * Read all of the required block reference table files and merge all of
+ * the data into a single in-memory block reference table.
+ *
+ * See the comments for struct IncrementalBackupInfo for some thoughts on
+ * memory usage.
+ */
+ ib->brtab = CreateEmptyBlockRefTable();
+ foreach(lc, required_wslist)
+ {
+ WalSummaryFile *ws = lfirst(lc);
+ WalSummaryIO wsio;
+ BlockRefTableReader *reader;
+ RelFileLocator rlocator;
+ ForkNumber forknum;
+ BlockNumber limit_block;
+ BlockNumber blocks[BLOCKS_PER_READ];
+
+ wsio.file = OpenWalSummaryFile(ws, false);
+ wsio.filepos = 0;
+ ereport(DEBUG1,
+ (errmsg_internal("reading WAL summary file \"%s\"",
+ FilePathName(wsio.file))));
+ reader = CreateBlockRefTableReader(ReadWalSummary, &wsio,
+ FilePathName(wsio.file),
+ ReportWalSummaryError, NULL);
+ while (BlockRefTableReaderNextRelation(reader, &rlocator, &forknum,
+ &limit_block))
+ {
+ BlockRefTableSetLimitBlock(ib->brtab, &rlocator,
+ forknum, limit_block);
+
+ while (1)
+ {
+ unsigned nblocks;
+ unsigned i;
+
+ nblocks = BlockRefTableReaderGetBlocks(reader, blocks,
+ BLOCKS_PER_READ);
+ if (nblocks == 0)
+ break;
+
+ for (i = 0; i < nblocks; ++i)
+ BlockRefTableMarkBlockModified(ib->brtab, &rlocator,
+ forknum, blocks[i]);
+ }
+ }
+ DestroyBlockRefTableReader(reader);
+ FileClose(wsio.file);
+ }
+
+ /* Switch back to previous memory context. */
+ MemoryContextSwitchTo(oldcontext);
+}
+
+/*
+ * Get the pathname that should be used when a file is sent incrementally.
+ *
+ * The result is a palloc'd string.
+ */
+char *
+GetIncrementalFilePath(Oid dboid, Oid spcoid, RelFileNumber relfilenumber,
+ ForkNumber forknum, unsigned segno)
+{
+ char *path;
+ char *lastslash;
+ char *ipath;
+
+ path = GetRelationPath(dboid, spcoid, relfilenumber, InvalidBackendId,
+ forknum);
+
+ lastslash = strrchr(path, '/');
+ Assert(lastslash != NULL);
+ *lastslash = '\0';
+
+ if (segno > 0)
+ ipath = psprintf("%s/INCREMENTAL.%s.%u", path, lastslash + 1, segno);
+ else
+ ipath = psprintf("%s/INCREMENTAL.%s", path, lastslash + 1);
+
+ pfree(path);
+
+ return ipath;
+}
+
+/*
+ * How should we back up a particular file as part of an incremental backup?
+ *
+ * If the return value is BACK_UP_FILE_FULLY, caller should back up the whole
+ * file just as if this were not an incremental backup.
+ *
+ * If the return value is BACK_UP_FILE_INCREMENTALLY, caller should include
+ * an incremental file in the backup instead of the entire file. On return,
+ * *num_blocks_required will be set to the number of blocks that need to be
+ * sent, and the actual block numbers will have been stored in
+ * relative_block_numbers, which should be an array of at least RELSEG_SIZE.
+ * In addition, *truncation_block_length will be set to the value that should
+ * be included in the incremental file.
+ */
+FileBackupMethod
+GetFileBackupMethod(IncrementalBackupInfo *ib, const char *path,
+ Oid dboid, Oid spcoid,
+ RelFileNumber relfilenumber, ForkNumber forknum,
+ unsigned segno, size_t size,
+ unsigned *num_blocks_required,
+ BlockNumber *relative_block_numbers,
+ unsigned *truncation_block_length)
+{
+ BlockNumber absolute_block_numbers[RELSEG_SIZE];
+ BlockNumber limit_block;
+ BlockNumber start_blkno;
+ BlockNumber stop_blkno;
+ RelFileLocator rlocator;
+ BlockRefTableEntry *brtentry;
+ unsigned i;
+ unsigned nblocks;
+
+ /* Should only be called after PrepareForIncrementalBackup. */
+ Assert(ib->buf.data == NULL);
+
+ /*
+ * dboid could be InvalidOid if shared rel, but spcoid and relfilenumber
+ * should have legal values.
+ */
+ Assert(OidIsValid(spcoid));
+ Assert(RelFileNumberIsValid(relfilenumber));
+
+ /*
+ * If the file size is too large or not a multiple of BLCKSZ, then
+ * something weird is happening, so give up and send the whole file.
+ */
+ if ((size % BLCKSZ) != 0 || size / BLCKSZ > RELSEG_SIZE)
+ return BACK_UP_FILE_FULLY;
+
+ /*
+ * The free-space map fork is not properly WAL-logged, so we need to
+ * backup the entire file every time.
+ */
+ if (forknum == FSM_FORKNUM)
+ return BACK_UP_FILE_FULLY;
+
+ /*
+ * If this file was not part of the prior backup, back it up fully.
+ *
+ * If this file was created after the prior backup and before the start of
+ * the current backup, then the WAL summary information will tell us to
+ * back up the whole file. However, if this file was created after the
+ * start of the current backup, then the WAL summary won't know anything
+ * about it. Without this logic, we would erroneously conclude that it was
+ * OK to send it incrementally.
+ *
+ * Note that the file could have existed at the time of the prior backup,
+ * gotten deleted, and then a new file with the same name could have been
+ * created. In that case, this logic won't prevent the file from being
+ * backed up incrementally. But, if the deletion happened before the start
+ * of the current backup, the limit block will be 0, inducing a full
+ * backup. If the deletion happened after the start of the current backup,
+ * reconstruction will erroneously combine blocks from the current
+ * lifespan of the file with blocks from the previous lifespan -- but in
+ * this type of case, WAL replay to reach backup consistency should remove
+ * and recreate the file anyway, so the initial bogus contents should not
+ * matter.
+ */
+ if (backup_file_lookup(ib->manifest_files, path) == NULL)
+ {
+ char *ipath;
+
+ ipath = GetIncrementalFilePath(dboid, spcoid, relfilenumber,
+ forknum, segno);
+ if (backup_file_lookup(ib->manifest_files, ipath) == NULL)
+ return BACK_UP_FILE_FULLY;
+ }
+
+ /* Look up the block reference table entry. */
+ rlocator.spcOid = spcoid;
+ rlocator.dbOid = dboid;
+ rlocator.relNumber = relfilenumber;
+ brtentry = BlockRefTableGetEntry(ib->brtab, &rlocator, forknum,
+ &limit_block);
+
+ /*
+ * If there is no entry, then there have been no WAL-logged changes to the
+ * relation since the predecessor backup was taken, so we can back it up
+ * incrementally and need not include any modified blocks.
+ *
+ * However, if the file is zero-length, we should do a full backup,
+ * because an incremental file is always more than zero length, and it's
+ * silly to take an incremental backup when a full backup would be
+ * smaller.
+ */
+ if (brtentry == NULL)
+ {
+ if (size == 0)
+ return BACK_UP_FILE_FULLY;
+ *num_blocks_required = 0;
+ *truncation_block_length = size / BLCKSZ;
+ return BACK_UP_FILE_INCREMENTALLY;
+ }
+
+ /*
+ * If the limit_block is less than or equal to the point where this
+ * segment starts, send the whole file.
+ */
+ if (limit_block <= segno * RELSEG_SIZE)
+ return BACK_UP_FILE_FULLY;
+
+ /*
+ * Get relevant entries from the block reference table entry.
+ *
+ * We shouldn't overflow computing the start or stop block numbers, but if
+ * it manages to happen somehow, detect it and throw an error.
+ */
+ start_blkno = segno * RELSEG_SIZE;
+ stop_blkno = start_blkno + (size / BLCKSZ);
+ if (start_blkno / RELSEG_SIZE != segno || stop_blkno < start_blkno)
+ ereport(ERROR,
+ errcode(ERRCODE_INTERNAL_ERROR),
+ errmsg_internal("overflow computing block number bounds for segment %u with size %zu",
+ segno, size));
+ nblocks = BlockRefTableEntryGetBlocks(brtentry, start_blkno, stop_blkno,
+ absolute_block_numbers, RELSEG_SIZE);
+ Assert(nblocks <= RELSEG_SIZE);
+
+ /*
+ * If we're going to have to send nearly all of the blocks, then just send
+ * the whole file, because that won't require much extra storage or
+ * transfer and will speed up and simplify backup restoration. It's not
+ * clear what threshold is most appropriate here and perhaps it ought to
+ * be configurable, but for now we're just going to say that if we'd need
+ * to send 90% of the blocks anyway, give up and send the whole file.
+ *
+ * NB: If you change the threshold here, at least make sure to back up the
+ * file fully when every single block must be sent, because there's
+ * nothing good about sending an incremental file in that case.
+ */
+ if (nblocks * BLCKSZ > size * 0.9)
+ return BACK_UP_FILE_FULLY;
+
+ /*
+ * Looks like we can send an incremental file, so sort the absolute the
+ * block numbers and then transpose absolute block numbers to relative
+ * block numbers.
+ *
+ * NB: If the block reference table was using the bitmap representation
+ * for a given chunk, the block numbers in that chunk will already be
+ * sorted, but when the array-of-offsets representation is used, we can
+ * receive block numbers here out of order.
+ */
+ qsort(absolute_block_numbers, nblocks, sizeof(BlockNumber),
+ compare_block_numbers);
+ for (i = 0; i < nblocks; ++i)
+ relative_block_numbers[i] = absolute_block_numbers[i] - start_blkno;
+ *num_blocks_required = nblocks;
+
+ /*
+ * The truncation block length is the minimum length of the reconstructed
+ * file. Any block numbers below this threshold that are not present in
+ * the backup need to be fetched from the prior backup. At or above this
+ * threshold, blocks should only be included in the result if they are
+ * present in the backup. (This may require inserting zero blocks if the
+ * blocks included in the backup are non-consecutive.)
+ */
+ *truncation_block_length = size / BLCKSZ;
+ if (BlockNumberIsValid(limit_block))
+ {
+ unsigned relative_limit = limit_block - segno * RELSEG_SIZE;
+
+ if (*truncation_block_length < relative_limit)
+ *truncation_block_length = relative_limit;
+ }
+
+ /* Send it incrementally. */
+ return BACK_UP_FILE_INCREMENTALLY;
+}
+
+/*
+ * Compute the size for an incremental file containing a given number of blocks.
+ */
+extern size_t
+GetIncrementalFileSize(unsigned num_blocks_required)
+{
+ size_t result;
+
+ /* Make sure we're not going to overflow. */
+ Assert(num_blocks_required <= RELSEG_SIZE);
+
+ /*
+ * Three four byte quantities (magic number, truncation block length,
+ * block count) followed by block numbers followed by block contents.
+ */
+ result = 3 * sizeof(uint32);
+ result += (BLCKSZ + sizeof(BlockNumber)) * num_blocks_required;
+
+ return result;
+}
+
+/*
+ * Helper function for filemap hash table.
+ */
+static uint32
+hash_string_pointer(const char *s)
+{
+ unsigned char *ss = (unsigned char *) s;
+
+ return hash_bytes(ss, strlen(s));
+}
+
+/*
+ * This callback is invoked for each file mentioned in the backup manifest.
+ *
+ * We store the path to each file and the size of each file for sanity-checking
+ * purposes. For further details, see comments for IncrementalBackupInfo.
+ */
+static void
+manifest_process_file(JsonManifestParseContext *context,
+ char *pathname, size_t size,
+ pg_checksum_type checksum_type,
+ int checksum_length,
+ uint8 *checksum_payload)
+{
+ IncrementalBackupInfo *ib = context->private_data;
+ backup_file_entry *entry;
+ bool found;
+
+ entry = backup_file_insert(ib->manifest_files, pathname, &found);
+ if (!found)
+ {
+ entry->path = MemoryContextStrdup(ib->manifest_files->ctx,
+ pathname);
+ entry->size = size;
+ }
+}
+
+/*
+ * This callback is invoked for each WAL range mentioned in the backup
+ * manifest.
+ *
+ * We're just interested in learning the oldest LSN and the corresponding TLI
+ * that appear in any WAL range.
+ */
+static void
+manifest_process_wal_range(JsonManifestParseContext *context,
+ TimeLineID tli, XLogRecPtr start_lsn,
+ XLogRecPtr end_lsn)
+{
+ IncrementalBackupInfo *ib = context->private_data;
+ backup_wal_range *range = palloc(sizeof(backup_wal_range));
+
+ range->tli = tli;
+ range->start_lsn = start_lsn;
+ range->end_lsn = end_lsn;
+ ib->manifest_wal_ranges = lappend(ib->manifest_wal_ranges, range);
+}
+
+/*
+ * This callback is invoked if an error occurs while parsing the backup
+ * manifest.
+ */
+static void
+manifest_report_error(JsonManifestParseContext *context, const char *fmt,...)
+{
+ StringInfoData errbuf;
+
+ initStringInfo(&errbuf);
+
+ for (;;)
+ {
+ va_list ap;
+ int needed;
+
+ va_start(ap, fmt);
+ needed = appendStringInfoVA(&errbuf, fmt, ap);
+ va_end(ap);
+ if (needed == 0)
+ break;
+ enlargeStringInfo(&errbuf, needed);
+ }
+
+ ereport(ERROR,
+ errmsg_internal("%s", errbuf.data));
+}
+
+/*
+ * Quicksort comparator for block numbers.
+ */
+static int
+compare_block_numbers(const void *a, const void *b)
+{
+ BlockNumber aa = *(BlockNumber *) a;
+ BlockNumber bb = *(BlockNumber *) b;
+
+ if (aa > bb)
+ return 1;
+ else if (aa == bb)
+ return 0;
+ else
+ return -1;
+}
diff --git a/src/backend/backup/meson.build b/src/backend/backup/meson.build
index 5d4ebe3ebe..2a6a2dc7c0 100644
--- a/src/backend/backup/meson.build
+++ b/src/backend/backup/meson.build
@@ -5,6 +5,7 @@ backend_sources += files(
'basebackup.c',
'basebackup_copy.c',
'basebackup_gzip.c',
+ 'basebackup_incremental.c',
'basebackup_lz4.c',
'basebackup_progress.c',
'basebackup_server.c',
diff --git a/src/backend/replication/repl_gram.y b/src/backend/replication/repl_gram.y
index 0c874e33cf..a5d118ed68 100644
--- a/src/backend/replication/repl_gram.y
+++ b/src/backend/replication/repl_gram.y
@@ -76,11 +76,12 @@ Node *replication_parse_result;
%token K_EXPORT_SNAPSHOT
%token K_NOEXPORT_SNAPSHOT
%token K_USE_SNAPSHOT
+%token K_UPLOAD_MANIFEST
%type <node> command
%type <node> base_backup start_replication start_logical_replication
create_replication_slot drop_replication_slot identify_system
- read_replication_slot timeline_history show
+ read_replication_slot timeline_history show upload_manifest
%type <list> generic_option_list
%type <defelt> generic_option
%type <uintval> opt_timeline
@@ -114,6 +115,7 @@ command:
| read_replication_slot
| timeline_history
| show
+ | upload_manifest
;
/*
@@ -307,6 +309,15 @@ timeline_history:
}
;
+/* UPLOAD_MANIFEST doesn't currently accept any arguments */
+upload_manifest:
+ K_UPLOAD_MANIFEST
+ {
+ UploadManifestCmd *cmd = makeNode(UploadManifestCmd);
+
+ $$ = (Node *) cmd;
+ }
+
opt_physical:
K_PHYSICAL
| /* EMPTY */
@@ -411,6 +422,7 @@ ident_or_keyword:
| K_EXPORT_SNAPSHOT { $$ = "export_snapshot"; }
| K_NOEXPORT_SNAPSHOT { $$ = "noexport_snapshot"; }
| K_USE_SNAPSHOT { $$ = "use_snapshot"; }
+ | K_UPLOAD_MANIFEST { $$ = "upload_manifest"; }
;
%%
diff --git a/src/backend/replication/repl_scanner.l b/src/backend/replication/repl_scanner.l
index 1cc7fb858c..4805da08ee 100644
--- a/src/backend/replication/repl_scanner.l
+++ b/src/backend/replication/repl_scanner.l
@@ -136,6 +136,7 @@ EXPORT_SNAPSHOT { return K_EXPORT_SNAPSHOT; }
NOEXPORT_SNAPSHOT { return K_NOEXPORT_SNAPSHOT; }
USE_SNAPSHOT { return K_USE_SNAPSHOT; }
WAIT { return K_WAIT; }
+UPLOAD_MANIFEST { return K_UPLOAD_MANIFEST; }
{space}+ { /* do nothing */ }
@@ -303,6 +304,7 @@ replication_scanner_is_replication_command(void)
case K_DROP_REPLICATION_SLOT:
case K_READ_REPLICATION_SLOT:
case K_TIMELINE_HISTORY:
+ case K_UPLOAD_MANIFEST:
case K_SHOW:
/* Yes; push back the first token so we can parse later. */
repl_pushed_back_token = first_token;
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 3bc9c82389..dbcda32554 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -58,6 +58,7 @@
#include "access/xlogrecovery.h"
#include "access/xlogutils.h"
#include "backup/basebackup.h"
+#include "backup/basebackup_incremental.h"
#include "catalog/pg_authid.h"
#include "catalog/pg_type.h"
#include "commands/dbcommands.h"
@@ -137,6 +138,17 @@ bool wake_wal_senders = false;
*/
static XLogReaderState *xlogreader = NULL;
+/*
+ * If the UPLOAD_MANIFEST command is used to provide a backup manifest in
+ * preparation for an incremental backup, uploaded_manifest will be point
+ * to an object containing information about its contexts, and
+ * uploaded_manifest_mcxt will point to the memory context that contains
+ * that object and all of its subordinate data. Otherwise, both values will
+ * be NULL.
+ */
+static IncrementalBackupInfo *uploaded_manifest = NULL;
+static MemoryContext uploaded_manifest_mcxt = NULL;
+
/*
* These variables keep track of the state of the timeline we're currently
* sending. sendTimeLine identifies the timeline. If sendTimeLineIsHistoric,
@@ -233,6 +245,9 @@ static void XLogSendLogical(void);
static void WalSndDone(WalSndSendDataCallback send_data);
static XLogRecPtr GetStandbyFlushRecPtr(TimeLineID *tli);
static void IdentifySystem(void);
+static void UploadManifest(void);
+static bool HandleUploadManifestPacket(StringInfo buf, off_t *offset,
+ IncrementalBackupInfo *ib);
static void ReadReplicationSlot(ReadReplicationSlotCmd *cmd);
static void CreateReplicationSlot(CreateReplicationSlotCmd *cmd);
static void DropReplicationSlot(DropReplicationSlotCmd *cmd);
@@ -660,6 +675,143 @@ SendTimeLineHistory(TimeLineHistoryCmd *cmd)
pq_endmessage(&buf);
}
+/*
+ * Handle UPLOAD_MANIFEST command.
+ */
+static void
+UploadManifest(void)
+{
+ MemoryContext mcxt;
+ IncrementalBackupInfo *ib;
+ off_t offset = 0;
+ StringInfoData buf;
+
+ /*
+ * parsing the manifest will use the cryptohash stuff, which requires a
+ * resource owner
+ */
+ Assert(CurrentResourceOwner == NULL);
+ CurrentResourceOwner = ResourceOwnerCreate(NULL, "base backup");
+
+ /* Prepare to read manifest data into a temporary context. */
+ mcxt = AllocSetContextCreate(CurrentMemoryContext,
+ "incremental backup information",
+ ALLOCSET_DEFAULT_SIZES);
+ ib = CreateIncrementalBackupInfo(mcxt);
+
+ /* Send a CopyInResponse message */
+ pq_beginmessage(&buf, 'G');
+ pq_sendbyte(&buf, 0);
+ pq_sendint16(&buf, 0);
+ pq_endmessage_reuse(&buf);
+ pq_flush();
+
+ /* Recieve packets from client until done. */
+ while (HandleUploadManifestPacket(&buf, &offset, ib))
+ ;
+
+ /* Finish up manifest processing. */
+ FinalizeIncrementalManifest(ib);
+
+ /*
+ * Discard any old manifest information and arrange to preserve the new
+ * information we just got.
+ *
+ * We assume that MemoryContextDelete and MemoryContextSetParent won't
+ * fail, and thus we shouldn't end up bailing out of here in such a way as
+ * to leave dangling pointrs.
+ */
+ if (uploaded_manifest_mcxt != NULL)
+ MemoryContextDelete(uploaded_manifest_mcxt);
+ MemoryContextSetParent(mcxt, CacheMemoryContext);
+ uploaded_manifest = ib;
+ uploaded_manifest_mcxt = mcxt;
+
+ /* clean up the resource owner we created */
+ WalSndResourceCleanup(true);
+}
+
+/*
+ * Process one packet received during the handling of an UPLOAD_MANIFEST
+ * operation.
+ *
+ * 'buf' is scratch space. This function expects it to be initialized, doesn't
+ * care what the current contents are, and may override them with completely
+ * new contents.
+ *
+ * The return value is true if the caller should continue processing
+ * additional packets and false if the UPLOAD_MANIFEST operation is complete.
+ */
+static bool
+HandleUploadManifestPacket(StringInfo buf, off_t *offset,
+ IncrementalBackupInfo *ib)
+{
+ int mtype;
+ int maxmsglen;
+
+ HOLD_CANCEL_INTERRUPTS();
+
+ pq_startmsgread();
+ mtype = pq_getbyte();
+ if (mtype == EOF)
+ ereport(ERROR,
+ (errcode(ERRCODE_CONNECTION_FAILURE),
+ errmsg("unexpected EOF on client connection with an open transaction")));
+
+ switch (mtype)
+ {
+ case 'd': /* CopyData */
+ maxmsglen = PQ_LARGE_MESSAGE_LIMIT;
+ break;
+ case 'c': /* CopyDone */
+ case 'f': /* CopyFail */
+ case 'H': /* Flush */
+ case 'S': /* Sync */
+ maxmsglen = PQ_SMALL_MESSAGE_LIMIT;
+ break;
+ default:
+ ereport(ERROR,
+ (errcode(ERRCODE_PROTOCOL_VIOLATION),
+ errmsg("unexpected message type 0x%02X during COPY from stdin",
+ mtype)));
+ maxmsglen = 0; /* keep compiler quiet */
+ break;
+ }
+
+ /* Now collect the message body */
+ if (pq_getmessage(buf, maxmsglen))
+ ereport(ERROR,
+ (errcode(ERRCODE_CONNECTION_FAILURE),
+ errmsg("unexpected EOF on client connection with an open transaction")));
+ RESUME_CANCEL_INTERRUPTS();
+
+ /* Process the message */
+ switch (mtype)
+ {
+ case 'd': /* CopyData */
+ AppendIncrementalManifestData(ib, buf->data, buf->len);
+ return true;
+
+ case 'c': /* CopyDone */
+ return false;
+
+ case 'H': /* Sync */
+ case 'S': /* Flush */
+ /* Ignore these while in CopyOut mode as we do elsewhere. */
+ return true;
+
+ case 'f':
+ ereport(ERROR,
+ (errcode(ERRCODE_QUERY_CANCELED),
+ errmsg("COPY from stdin failed: %s",
+ pq_getmsgstring(buf))));
+ }
+
+ /* Not reached. */
+ Assert(false);
+ return false;
+}
+
/*
* Handle START_REPLICATION command.
*
@@ -1801,7 +1953,7 @@ exec_replication_command(const char *cmd_string)
cmdtag = "BASE_BACKUP";
set_ps_display(cmdtag);
PreventInTransactionBlock(true, cmdtag);
- SendBaseBackup((BaseBackupCmd *) cmd_node);
+ SendBaseBackup((BaseBackupCmd *) cmd_node, uploaded_manifest);
EndReplicationCommand(cmdtag);
break;
@@ -1863,6 +2015,14 @@ exec_replication_command(const char *cmd_string)
}
break;
+ case T_UploadManifestCmd:
+ cmdtag = "UPLOAD_MANIFEST";
+ set_ps_display(cmdtag);
+ PreventInTransactionBlock(true, cmdtag);
+ UploadManifest();
+ EndReplicationCommand(cmdtag);
+ break;
+
default:
elog(ERROR, "unrecognized replication command node tag: %u",
cmd_node->type);
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 0e0ac22bdd..706140eb9f 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -32,6 +32,7 @@
#include "postmaster/bgworker_internals.h"
#include "postmaster/bgwriter.h"
#include "postmaster/postmaster.h"
+#include "postmaster/walsummarizer.h"
#include "replication/logicallauncher.h"
#include "replication/origin.h"
#include "replication/slot.h"
@@ -140,6 +141,7 @@ CalculateShmemSize(int *num_semaphores)
size = add_size(size, ReplicationOriginShmemSize());
size = add_size(size, WalSndShmemSize());
size = add_size(size, WalRcvShmemSize());
+ size = add_size(size, WalSummarizerShmemSize());
size = add_size(size, PgArchShmemSize());
size = add_size(size, ApplyLauncherShmemSize());
size = add_size(size, BTreeShmemSize());
@@ -337,6 +339,7 @@ CreateOrAttachShmemStructs(void)
ReplicationOriginShmemInit();
WalSndShmemInit();
WalRcvShmemInit();
+ WalSummarizerShmemInit();
PgArchShmemInit();
ApplyLauncherShmemInit();
diff --git a/src/bin/Makefile b/src/bin/Makefile
index 373077bf52..aa2210925e 100644
--- a/src/bin/Makefile
+++ b/src/bin/Makefile
@@ -19,6 +19,7 @@ SUBDIRS = \
pg_archivecleanup \
pg_basebackup \
pg_checksums \
+ pg_combinebackup \
pg_config \
pg_controldata \
pg_ctl \
diff --git a/src/bin/meson.build b/src/bin/meson.build
index 67cb50630c..4cb6fd59bb 100644
--- a/src/bin/meson.build
+++ b/src/bin/meson.build
@@ -5,6 +5,7 @@ subdir('pg_amcheck')
subdir('pg_archivecleanup')
subdir('pg_basebackup')
subdir('pg_checksums')
+subdir('pg_combinebackup')
subdir('pg_config')
subdir('pg_controldata')
subdir('pg_ctl')
diff --git a/src/bin/pg_basebackup/bbstreamer_file.c b/src/bin/pg_basebackup/bbstreamer_file.c
index 45f32974ff..6b78ee283d 100644
--- a/src/bin/pg_basebackup/bbstreamer_file.c
+++ b/src/bin/pg_basebackup/bbstreamer_file.c
@@ -296,6 +296,7 @@ should_allow_existing_directory(const char *pathname)
if (strcmp(filename, "pg_wal") == 0 ||
strcmp(filename, "pg_xlog") == 0 ||
strcmp(filename, "archive_status") == 0 ||
+ strcmp(filename, "summaries") == 0 ||
strcmp(filename, "pg_tblspc") == 0)
return true;
diff --git a/src/bin/pg_basebackup/pg_basebackup.c b/src/bin/pg_basebackup/pg_basebackup.c
index f32684a8f2..5795b91261 100644
--- a/src/bin/pg_basebackup/pg_basebackup.c
+++ b/src/bin/pg_basebackup/pg_basebackup.c
@@ -101,6 +101,11 @@ typedef void (*WriteDataCallback) (size_t nbytes, char *buf,
*/
#define MINIMUM_VERSION_FOR_TERMINATED_TARFILE 150000
+/*
+ * pg_wal/summaries exists beginning with version 17.
+ */
+#define MINIMUM_VERSION_FOR_WAL_SUMMARIES 170000
+
/*
* Different ways to include WAL
*/
@@ -217,7 +222,8 @@ static void ReceiveBackupManifestInMemoryChunk(size_t r, char *copybuf,
void *callback_data);
static void BaseBackup(char *compression_algorithm, char *compression_detail,
CompressionLocation compressloc,
- pg_compress_specification *client_compress);
+ pg_compress_specification *client_compress,
+ char *incremental_manifest);
static bool reached_end_position(XLogRecPtr segendpos, uint32 timeline,
bool segment_finished);
@@ -390,6 +396,8 @@ usage(void)
printf(_("\nOptions controlling the output:\n"));
printf(_(" -D, --pgdata=DIRECTORY receive base backup into directory\n"));
printf(_(" -F, --format=p|t output format (plain (default), tar)\n"));
+ printf(_(" -i, --incremental=OLDMANIFEST\n"));
+ printf(_(" take incremental backup\n"));
printf(_(" -r, --max-rate=RATE maximum transfer rate to transfer data directory\n"
" (in kB/s, or use suffix \"k\" or \"M\")\n"));
printf(_(" -R, --write-recovery-conf\n"
@@ -688,6 +696,23 @@ StartLogStreamer(char *startpos, uint32 timeline, char *sysidentifier,
if (pg_mkdir_p(statusdir, pg_dir_create_mode) != 0 && errno != EEXIST)
pg_fatal("could not create directory \"%s\": %m", statusdir);
+
+ /*
+ * For newer server versions, likewise create pg_wal/summaries
+ */
+ if (PQserverVersion(conn) < MINIMUM_VERSION_FOR_WAL_SUMMARIES)
+ {
+ char summarydir[MAXPGPATH];
+
+ snprintf(summarydir, sizeof(summarydir), "%s/%s/summaries",
+ basedir,
+ PQserverVersion(conn) < MINIMUM_VERSION_FOR_PG_WAL ?
+ "pg_xlog" : "pg_wal");
+
+ if (pg_mkdir_p(statusdir, pg_dir_create_mode) != 0 &&
+ errno != EEXIST)
+ pg_fatal("could not create directory \"%s\": %m", summarydir);
+ }
}
/*
@@ -1728,7 +1753,9 @@ ReceiveBackupManifestInMemoryChunk(size_t r, char *copybuf,
static void
BaseBackup(char *compression_algorithm, char *compression_detail,
- CompressionLocation compressloc, pg_compress_specification *client_compress)
+ CompressionLocation compressloc,
+ pg_compress_specification *client_compress,
+ char *incremental_manifest)
{
PGresult *res;
char *sysidentifier;
@@ -1794,7 +1821,76 @@ BaseBackup(char *compression_algorithm, char *compression_detail,
exit(1);
/*
- * Start the actual backup
+ * If the user wants an incremental backup, we must upload the manifest
+ * for the previous backup upon which it is to be based.
+ */
+ if (incremental_manifest != NULL)
+ {
+ int fd;
+ char mbuf[65536];
+ int nbytes;
+
+ /* Reject if server is too old. */
+ if (serverVersion < MINIMUM_VERSION_FOR_WAL_SUMMARIES)
+ pg_fatal("server does not support incremental backup");
+
+ /* Open the file. */
+ fd = open(incremental_manifest, O_RDONLY | PG_BINARY, 0);
+ if (fd < 0)
+ pg_fatal("could not open file \"%s\": %m", incremental_manifest);
+
+ /* Tell the server what we want to do. */
+ if (PQsendQuery(conn, "UPLOAD_MANIFEST") == 0)
+ pg_fatal("could not send replication command \"%s\": %s",
+ "UPLOAD_MANIFEST", PQerrorMessage(conn));
+ res = PQgetResult(conn);
+ if (PQresultStatus(res) != PGRES_COPY_IN)
+ {
+ if (PQresultStatus(res) == PGRES_FATAL_ERROR)
+ pg_fatal("could not upload manifest: %s",
+ PQerrorMessage(conn));
+ else
+ pg_fatal("could not upload manifest: unexpected status %s",
+ PQresStatus(PQresultStatus(res)));
+ }
+
+ /* Loop, reading from the file and sending the data to the server. */
+ while ((nbytes = read(fd, mbuf, sizeof mbuf)) > 0)
+ {
+ if (PQputCopyData(conn, mbuf, nbytes) < 0)
+ pg_fatal("could not send COPY data: %s",
+ PQerrorMessage(conn));
+ }
+
+ /* Bail out if we exited the loop due to an error. */
+ if (nbytes < 0)
+ pg_fatal("could not read file \"%s\": %m", incremental_manifest);
+
+ /* End the COPY operation. */
+ if (PQputCopyEnd(conn, NULL) < 0)
+ pg_fatal("could not send end-of-COPY: %s",
+ PQerrorMessage(conn));
+
+ /* See whether the server is happy with what we sent. */
+ res = PQgetResult(conn);
+ if (PQresultStatus(res) == PGRES_FATAL_ERROR)
+ pg_fatal("could not upload manifest: %s",
+ PQerrorMessage(conn));
+ else if (PQresultStatus(res) != PGRES_COMMAND_OK)
+ pg_fatal("could not upload manifest: unexpected status %s",
+ PQresStatus(PQresultStatus(res)));
+
+ /* Consume ReadyForQuery message from server. */
+ res = PQgetResult(conn);
+ if (res != NULL)
+ pg_fatal("unexpected extra result while sending manifest");
+
+ /* Add INCREMENTAL option to BASE_BACKUP command. */
+ AppendPlainCommandOption(&buf, use_new_option_syntax, "INCREMENTAL");
+ }
+
+ /*
+ * Continue building up the options list for the BASE_BACKUP command.
*/
AppendStringCommandOption(&buf, use_new_option_syntax, "LABEL", label);
if (estimatesize)
@@ -1901,6 +1997,7 @@ BaseBackup(char *compression_algorithm, char *compression_detail,
else
basebkp = psprintf("BASE_BACKUP %s", buf.data);
+ /* OK, try to start the backup. */
if (PQsendQuery(conn, basebkp) == 0)
pg_fatal("could not send replication command \"%s\": %s",
"BASE_BACKUP", PQerrorMessage(conn));
@@ -2256,6 +2353,7 @@ main(int argc, char **argv)
{"version", no_argument, NULL, 'V'},
{"pgdata", required_argument, NULL, 'D'},
{"format", required_argument, NULL, 'F'},
+ {"incremental", required_argument, NULL, 'i'},
{"checkpoint", required_argument, NULL, 'c'},
{"create-slot", no_argument, NULL, 'C'},
{"max-rate", required_argument, NULL, 'r'},
@@ -2293,6 +2391,7 @@ main(int argc, char **argv)
int option_index;
char *compression_algorithm = "none";
char *compression_detail = NULL;
+ char *incremental_manifest = NULL;
CompressionLocation compressloc = COMPRESS_LOCATION_UNSPECIFIED;
pg_compress_specification client_compress;
@@ -2317,7 +2416,7 @@ main(int argc, char **argv)
atexit(cleanup_directories_atexit);
- while ((c = getopt_long(argc, argv, "c:Cd:D:F:h:l:nNp:Pr:Rs:S:t:T:U:vwWX:zZ:",
+ while ((c = getopt_long(argc, argv, "c:Cd:D:F:h:i:l:nNp:Pr:Rs:S:t:T:U:vwWX:zZ:",
long_options, &option_index)) != -1)
{
switch (c)
@@ -2352,6 +2451,9 @@ main(int argc, char **argv)
case 'h':
dbhost = pg_strdup(optarg);
break;
+ case 'i':
+ incremental_manifest = pg_strdup(optarg);
+ break;
case 'l':
label = pg_strdup(optarg);
break;
@@ -2765,7 +2867,7 @@ main(int argc, char **argv)
}
BaseBackup(compression_algorithm, compression_detail, compressloc,
- &client_compress);
+ &client_compress, incremental_manifest);
success = true;
return 0;
diff --git a/src/bin/pg_basebackup/t/010_pg_basebackup.pl b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
index b9f5e1266b..bf765291e7 100644
--- a/src/bin/pg_basebackup/t/010_pg_basebackup.pl
+++ b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
@@ -223,10 +223,10 @@ SKIP:
"check backup dir permissions");
}
-# Only archive_status directory should be copied in pg_wal/.
+# Only archive_status and summaries directories should be copied in pg_wal/.
is_deeply(
[ sort(slurp_dir("$tempdir/backup/pg_wal/")) ],
- [ sort qw(. .. archive_status) ],
+ [ sort qw(. .. archive_status summaries) ],
'no WAL files copied');
# Contents of these directories should not be copied.
diff --git a/src/bin/pg_combinebackup/.gitignore b/src/bin/pg_combinebackup/.gitignore
new file mode 100644
index 0000000000..d7e617438c
--- /dev/null
+++ b/src/bin/pg_combinebackup/.gitignore
@@ -0,0 +1 @@
+pg_combinebackup
diff --git a/src/bin/pg_combinebackup/Makefile b/src/bin/pg_combinebackup/Makefile
new file mode 100644
index 0000000000..78ba05e624
--- /dev/null
+++ b/src/bin/pg_combinebackup/Makefile
@@ -0,0 +1,52 @@
+#-------------------------------------------------------------------------
+#
+# Makefile for src/bin/pg_combinebackup
+#
+# Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+# Portions Copyright (c) 1994, Regents of the University of California
+#
+# src/bin/pg_combinebackup/Makefile
+#
+#-------------------------------------------------------------------------
+
+PGFILEDESC = "pg_combinebackup - combine incremental backups"
+PGAPPICON=win32
+
+subdir = src/bin/pg_combinebackup
+top_builddir = ../../..
+include $(top_builddir)/src/Makefile.global
+
+override CPPFLAGS := -I$(libpq_srcdir) $(CPPFLAGS)
+LDFLAGS_INTERNAL += -L$(top_builddir)/src/fe_utils -lpgfeutils
+
+OBJS = \
+ $(WIN32RES) \
+ pg_combinebackup.o \
+ backup_label.o \
+ copy_file.o \
+ load_manifest.o \
+ reconstruct.o \
+ write_manifest.o
+
+all: pg_combinebackup
+
+pg_combinebackup: $(OBJS) | submake-libpgport submake-libpgfeutils
+ $(CC) $(CFLAGS) $^ $(LDFLAGS) $(LDFLAGS_EX) $(LIBS) -o $@$(X)
+
+install: all installdirs
+ $(INSTALL_PROGRAM) pg_combinebackup$(X) '$(DESTDIR)$(bindir)/pg_combinebackup$(X)'
+
+installdirs:
+ $(MKDIR_P) '$(DESTDIR)$(bindir)'
+
+uninstall:
+ rm -f '$(DESTDIR)$(bindir)/pg_combinebackup$(X)'
+
+clean distclean maintainer-clean:
+ rm -f pg_combinebackup$(X) $(OBJS)
+
+check:
+ $(prove_check)
+
+installcheck:
+ $(prove_installcheck)
diff --git a/src/bin/pg_combinebackup/backup_label.c b/src/bin/pg_combinebackup/backup_label.c
new file mode 100644
index 0000000000..922e00854d
--- /dev/null
+++ b/src/bin/pg_combinebackup/backup_label.c
@@ -0,0 +1,283 @@
+/*-------------------------------------------------------------------------
+ *
+ * Read and manipulate backup label files
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/bin/pg_combinebackup/backup_label.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres_fe.h"
+
+#include <unistd.h>
+
+#include "access/xlogdefs.h"
+#include "backup_label.h"
+#include "common/logging.h"
+#include "common/file_perm.h"
+#include "write_manifest.h"
+
+static int get_eol_offset(StringInfo buf);
+static bool line_starts_with(char *s, char *e, char *match, char **sout);
+static bool parse_lsn(char *s, char *e, XLogRecPtr *lsn, char **c);
+static bool parse_tli(char *s, char *e, TimeLineID *tli);
+
+/*
+ * Parse a backup label file, starting at buf->cursor.
+ *
+ * We expect to find a START WAL LOCATION line, followed by a LSN, followed
+ * by a space; the resulting LSN is stored into *start_lsn.
+ *
+ * We expect to find a START TIMELINE line, followed by a TLI, followed by
+ * a newline; the resulting TLI is stored into *start_tli.
+ *
+ * We expect to find either both INCREMENTAL FROM LSN and INCREMENTAL FROM TLI
+ * or neither. If these are found, they should be followed by an LSN or TLI
+ * respectively and then by a newline, and the values will be stored into
+ * *previous_lsn and *previous_tli, respectively.
+ *
+ * Other lines in the provided backup_label data are ignored. filename is used
+ * for error reporting; errors are fatal.
+ */
+void
+parse_backup_label(char *filename, StringInfo buf,
+ TimeLineID *start_tli, XLogRecPtr *start_lsn,
+ TimeLineID *previous_tli, XLogRecPtr *previous_lsn)
+{
+ int found = 0;
+
+ *start_tli = 0;
+ *start_lsn = InvalidXLogRecPtr;
+ *previous_tli = 0;
+ *previous_lsn = InvalidXLogRecPtr;
+
+ while (buf->cursor < buf->len)
+ {
+ char *s = &buf->data[buf->cursor];
+ int eo = get_eol_offset(buf);
+ char *e = &buf->data[eo];
+ char *c;
+
+ if (line_starts_with(s, e, "START WAL LOCATION: ", &s))
+ {
+ if (!parse_lsn(s, e, start_lsn, &c))
+ pg_fatal("%s: could not parse %s",
+ filename, "START WAL LOCATION");
+ if (c >= e || *c != ' ')
+ pg_fatal("%s: improper terminator for %s",
+ filename, "START WAL LOCATION");
+ found |= 1;
+ }
+ else if (line_starts_with(s, e, "START TIMELINE: ", &s))
+ {
+ if (!parse_tli(s, e, start_tli))
+ pg_fatal("%s: could not parse TLI for %s",
+ filename, "START TIMELINE");
+ if (*start_tli == 0)
+ pg_fatal("%s: invalid TLI", filename);
+ found |= 2;
+ }
+ else if (line_starts_with(s, e, "INCREMENTAL FROM LSN: ", &s))
+ {
+ if (!parse_lsn(s, e, previous_lsn, &c))
+ pg_fatal("%s: could not parse %s",
+ filename, "INCREMENTAL FROM LSN");
+ if (c >= e || *c != '\n')
+ pg_fatal("%s: improper terminator for %s",
+ filename, "INCREMENTAL FROM LSN");
+ found |= 4;
+ }
+ else if (line_starts_with(s, e, "INCREMENTAL FROM TLI: ", &s))
+ {
+ if (!parse_tli(s, e, previous_tli))
+ pg_fatal("%s: could not parse %s",
+ filename, "INCREMENTAL FROM TLI");
+ if (*previous_tli == 0)
+ pg_fatal("%s: invalid TLI", filename);
+ found |= 8;
+ }
+
+ buf->cursor = eo;
+ }
+
+ if ((found & 1) == 0)
+ pg_fatal("%s: could not find %s", filename, "START WAL LOCATION");
+ if ((found & 2) == 0)
+ pg_fatal("%s: could not find %s", filename, "START TIMELINE");
+ if ((found & 4) != 0 && (found & 8) == 0)
+ pg_fatal("%s: %s requires %s", filename,
+ "INCREMENTAL FROM LSN", "INCREMENTAL FROM TLI");
+ if ((found & 8) != 0 && (found & 4) == 0)
+ pg_fatal("%s: %s requires %s", filename,
+ "INCREMENTAL FROM TLI", "INCREMENTAL FROM LSN");
+}
+
+/*
+ * Write a backup label file to the output directory.
+ *
+ * This will be identical to the provided backup_label file, except that the
+ * INCREMENTAL FROM LSN and INCREMENTAL FROM TLI lines will be omitted.
+ *
+ * The new file will be checksummed using the specified algorithm. If
+ * mwriter != NULL, it will be added to the manifest.
+ */
+void
+write_backup_label(char *output_directory, StringInfo buf,
+ pg_checksum_type checksum_type, manifest_writer *mwriter)
+{
+ char output_filename[MAXPGPATH];
+ int output_fd;
+ pg_checksum_context checksum_ctx;
+ uint8 checksum_payload[PG_CHECKSUM_MAX_LENGTH];
+ int checksum_length;
+
+ pg_checksum_init(&checksum_ctx, checksum_type);
+
+ snprintf(output_filename, MAXPGPATH, "%s/backup_label", output_directory);
+
+ if ((output_fd = open(output_filename,
+ O_WRONLY | O_CREAT | O_EXCL | PG_BINARY,
+ pg_file_create_mode)) < 0)
+ pg_fatal("could not open file \"%s\": %m", output_filename);
+
+ while (buf->cursor < buf->len)
+ {
+ char *s = &buf->data[buf->cursor];
+ int eo = get_eol_offset(buf);
+ char *e = &buf->data[eo];
+
+ if (!line_starts_with(s, e, "INCREMENTAL FROM LSN: ", NULL) &&
+ !line_starts_with(s, e, "INCREMENTAL FROM TLI: ", NULL))
+ {
+ ssize_t wb;
+
+ wb = write(output_fd, s, e - s);
+ if (wb != e - s)
+ {
+ if (wb < 0)
+ pg_fatal("could not write file \"%s\": %m", output_filename);
+ else
+ pg_fatal("could not write file \"%s\": wrote only %d of %d bytes",
+ output_filename, (int) wb, (int) (e - s));
+ }
+ if (pg_checksum_update(&checksum_ctx, (uint8 *) s, e - s) < 0)
+ pg_fatal("could not update checksum of file \"%s\"",
+ output_filename);
+ }
+
+ buf->cursor = eo;
+ }
+
+ if (close(output_fd) != 0)
+ pg_fatal("could not close \"%s\": %m", output_filename);
+
+ checksum_length = pg_checksum_final(&checksum_ctx, checksum_payload);
+
+ if (mwriter != NULL)
+ {
+ struct stat sb;
+
+ /*
+ * We could track the length ourselves, but must stat() to get the
+ * mtime.
+ */
+ if (stat(output_filename, &sb) < 0)
+ pg_fatal("could not stat file \"%s\": %m", output_filename);
+ add_file_to_manifest(mwriter, "backup_label", sb.st_size,
+ sb.st_mtime, checksum_type,
+ checksum_length, checksum_payload);
+ }
+}
+
+/*
+ * Return the offset at which the next line in the buffer starts, or there
+ * is none, the offset at which the buffer ends.
+ *
+ * The search begins at buf->cursor.
+ */
+static int
+get_eol_offset(StringInfo buf)
+{
+ int eo = buf->cursor;
+
+ while (eo < buf->len)
+ {
+ if (buf->data[eo] == '\n')
+ return eo + 1;
+ ++eo;
+ }
+
+ return eo;
+}
+
+/*
+ * Test whether the line that runs from s to e (inclusive of *s, but not
+ * inclusive of *e) starts with the match string provided, and return true
+ * or false according to whether or not this is the case.
+ *
+ * If the function returns true and if *sout != NULL, stores a pointer to the
+ * byte following the match into *sout.
+ */
+static bool
+line_starts_with(char *s, char *e, char *match, char **sout)
+{
+ while (s < e && *match != '\0' && *s == *match)
+ ++s, ++match;
+
+ if (*match == '\0' && sout != NULL)
+ *sout = s;
+
+ return (*match == '\0');
+}
+
+/*
+ * Parse an LSN starting at s and not stopping at or before e. The return value
+ * is true on success and otherwise false. On success, stores the result into
+ * *lsn and sets *c to the first character that is not part of the LSN.
+ */
+static bool
+parse_lsn(char *s, char *e, XLogRecPtr *lsn, char **c)
+{
+ char save = *e;
+ int nchars;
+ bool success;
+ unsigned hi;
+ unsigned lo;
+
+ *e = '\0';
+ success = (sscanf(s, "%X/%X%n", &hi, &lo, &nchars) == 2);
+ *e = save;
+
+ if (success)
+ {
+ *lsn = ((XLogRecPtr) hi) << 32 | (XLogRecPtr) lo;
+ *c = s + nchars;
+ }
+
+ return success;
+}
+
+/*
+ * Parse a TLI starting at s and stopping at or before e. The return value is
+ * true on success and otherwise false. On success, stores the result into
+ * *tli. If the first character that is not part of the TLI is anything other
+ * than a newline, that is deemed a failure.
+ */
+static bool
+parse_tli(char *s, char *e, TimeLineID *tli)
+{
+ char save = *e;
+ int nchars;
+ bool success;
+
+ *e = '\0';
+ success = (sscanf(s, "%u%n", tli, &nchars) == 1);
+ *e = save;
+
+ if (success && s[nchars] != '\n')
+ success = false;
+
+ return success;
+}
diff --git a/src/bin/pg_combinebackup/backup_label.h b/src/bin/pg_combinebackup/backup_label.h
new file mode 100644
index 0000000000..3af7ea274c
--- /dev/null
+++ b/src/bin/pg_combinebackup/backup_label.h
@@ -0,0 +1,30 @@
+/*-------------------------------------------------------------------------
+ *
+ * Read and manipulate backup label files
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/bin/pg_combinebackup/backup_label.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef BACKUP_LABEL_H
+#define BACKUP_LABEL_H
+
+#include "access/xlogdefs.h"
+#include "common/checksum_helper.h"
+#include "lib/stringinfo.h"
+
+struct manifest_writer;
+
+extern void parse_backup_label(char *filename, StringInfo buf,
+ TimeLineID *start_tli,
+ XLogRecPtr *start_lsn,
+ TimeLineID *previous_tli,
+ XLogRecPtr *previous_lsn);
+extern void write_backup_label(char *output_directory, StringInfo buf,
+ pg_checksum_type checksum_type,
+ struct manifest_writer *mwriter);
+
+#endif /* BACKUP_LABEL_H */
diff --git a/src/bin/pg_combinebackup/copy_file.c b/src/bin/pg_combinebackup/copy_file.c
new file mode 100644
index 0000000000..40a55e3087
--- /dev/null
+++ b/src/bin/pg_combinebackup/copy_file.c
@@ -0,0 +1,169 @@
+/*
+ * Copy entire files.
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/bin/pg_combinebackup/copy_file.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres_fe.h"
+
+#ifdef HAVE_COPYFILE_H
+#include <copyfile.h>
+#endif
+#include <fcntl.h>
+#include <sys/stat.h>
+#include <unistd.h>
+
+#include "common/file_perm.h"
+#include "common/logging.h"
+#include "copy_file.h"
+
+static void copy_file_blocks(const char *src, const char *dst,
+ pg_checksum_context *checksum_ctx);
+
+#ifdef WIN32
+static void copy_file_copyfile(const char *src, const char *dst);
+#endif
+
+/*
+ * Copy a regular file, optionally computing a checksum, and emitting
+ * appropriate debug messages. But if we're in dry-run mode, then just emit
+ * the messages and don't copy anything.
+ */
+void
+copy_file(const char *src, const char *dst,
+ pg_checksum_context *checksum_ctx, bool dry_run)
+{
+ /*
+ * In dry-run mode, we don't actually copy anything, nor do we read any
+ * data from the source file, but we do verify that we can open it.
+ */
+ if (dry_run)
+ {
+ int fd;
+
+ if ((fd = open(src, O_RDONLY | PG_BINARY, 0)) < 0)
+ pg_fatal("could not open \"%s\": %m", src);
+ if (close(fd) < 0)
+ pg_fatal("could not close \"%s\": %m", src);
+ }
+
+ /*
+ * If we don't need to compute a checksum, then we can use any special
+ * operating system primitives that we know about to copy the file; this
+ * may be quicker than a naive block copy.
+ */
+ if (checksum_ctx->type == CHECKSUM_TYPE_NONE)
+ {
+ char *strategy_name = NULL;
+ void (*strategy_implementation) (const char *, const char *) = NULL;
+
+#ifdef WIN32
+ strategy_name = "CopyFile";
+ strategy_implementation = copy_file_copyfile;
+#endif
+
+ if (strategy_name != NULL)
+ {
+ if (dry_run)
+ pg_log_debug("would copy \"%s\" to \"%s\" using strategy %s",
+ src, dst, strategy_name);
+ else
+ {
+ pg_log_debug("copying \"%s\" to \"%s\" using strategy %s",
+ src, dst, strategy_name);
+ (*strategy_implementation) (src, dst);
+ }
+ return;
+ }
+ }
+
+ /*
+ * Fall back to the simple approach of reading and writing all the blocks,
+ * feeding them into the checksum context as we go.
+ */
+ if (dry_run)
+ {
+ if (checksum_ctx->type == CHECKSUM_TYPE_NONE)
+ pg_log_debug("would copy \"%s\" to \"%s\"",
+ src, dst);
+ else
+ pg_log_debug("would copy \"%s\" to \"%s\" and checksum with %s",
+ src, dst, pg_checksum_type_name(checksum_ctx->type));
+ }
+ else
+ {
+ if (checksum_ctx->type == CHECKSUM_TYPE_NONE)
+ pg_log_debug("copying \"%s\" to \"%s\"",
+ src, dst);
+ else
+ pg_log_debug("copying \"%s\" to \"%s\" and checksumming with %s",
+ src, dst, pg_checksum_type_name(checksum_ctx->type));
+ copy_file_blocks(src, dst, checksum_ctx);
+ }
+}
+
+/*
+ * Copy a file block by block, and optionally compute a checksum as we go.
+ */
+static void
+copy_file_blocks(const char *src, const char *dst,
+ pg_checksum_context *checksum_ctx)
+{
+ int src_fd;
+ int dest_fd;
+ uint8 *buffer;
+ const int buffer_size = 50 * BLCKSZ;
+ ssize_t rb;
+ unsigned offset = 0;
+
+ if ((src_fd = open(src, O_RDONLY | PG_BINARY, 0)) < 0)
+ pg_fatal("could not open file \"%s\": %m", src);
+
+ if ((dest_fd = open(dst, O_WRONLY | O_CREAT | O_EXCL | PG_BINARY,
+ pg_file_create_mode)) < 0)
+ pg_fatal("could not open file \"%s\": %m", dst);
+
+ buffer = pg_malloc(buffer_size);
+
+ while ((rb = read(src_fd, buffer, buffer_size)) > 0)
+ {
+ ssize_t wb;
+
+ if ((wb = write(dest_fd, buffer, rb)) != rb)
+ {
+ if (wb < 0)
+ pg_fatal("could not write file \"%s\": %m", dst);
+ else
+ pg_fatal("could not write file \"%s\": wrote only %d of %d bytes at offset %u",
+ dst, (int) wb, (int) rb, offset);
+ }
+
+ if (pg_checksum_update(checksum_ctx, buffer, rb) < 0)
+ pg_fatal("could not update checksum of file \"%s\"", dst);
+
+ offset += rb;
+ }
+
+ if (rb < 0)
+ pg_fatal("could not read file \"%s\": %m", dst);
+
+ pg_free(buffer);
+ close(src_fd);
+ close(dest_fd);
+}
+
+#ifdef WIN32
+static void
+copy_file_copyfile(const char *src, const char *dst)
+{
+ if (CopyFile(src, dst, true) == 0)
+ {
+ _dosmaperr(GetLastError());
+ pg_fatal("could not copy \"%s\" to \"%s\": %m", src, dst);
+ }
+}
+#endif /* WIN32 */
diff --git a/src/bin/pg_combinebackup/copy_file.h b/src/bin/pg_combinebackup/copy_file.h
new file mode 100644
index 0000000000..031030bacb
--- /dev/null
+++ b/src/bin/pg_combinebackup/copy_file.h
@@ -0,0 +1,19 @@
+/*
+ * Copy entire files.
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/bin/pg_combinebackup/copy_file.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef COPY_FILE_H
+#define COPY_FILE_H
+
+#include "common/checksum_helper.h"
+
+extern void copy_file(const char *src, const char *dst,
+ pg_checksum_context *checksum_ctx, bool dry_run);
+
+#endif /* COPY_FILE_H */
diff --git a/src/bin/pg_combinebackup/load_manifest.c b/src/bin/pg_combinebackup/load_manifest.c
new file mode 100644
index 0000000000..ad32323c9c
--- /dev/null
+++ b/src/bin/pg_combinebackup/load_manifest.c
@@ -0,0 +1,245 @@
+/*-------------------------------------------------------------------------
+ *
+ * Load data from a backup manifest into memory.
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/bin/pg_combinebackup/load_manifest.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres_fe.h"
+
+#include <sys/stat.h>
+#include <unistd.h>
+
+#include "common/hashfn.h"
+#include "common/logging.h"
+#include "common/parse_manifest.h"
+#include "load_manifest.h"
+
+/*
+ * For efficiency, we'd like our hash table containing information about the
+ * manifest to start out with approximately the correct number of entries.
+ * There's no way to know the exact number of entries without reading the whole
+ * file, but we can get an estimate by dividing the file size by the estimated
+ * number of bytes per line.
+ *
+ * This could be off by about a factor of two in either direction, because the
+ * checksum algorithm has a big impact on the line lengths; e.g. a SHA512
+ * checksum is 128 hex bytes, whereas a CRC-32C value is only 8, and there
+ * might be no checksum at all.
+ */
+#define ESTIMATED_BYTES_PER_MANIFEST_LINE 100
+
+/*
+ * Define a hash table which we can use to store information about the files
+ * mentioned in the backup manifest.
+ */
+static uint32 hash_string_pointer(char *s);
+#define SH_PREFIX manifest_files
+#define SH_ELEMENT_TYPE manifest_file
+#define SH_KEY_TYPE char *
+#define SH_KEY pathname
+#define SH_HASH_KEY(tb, key) hash_string_pointer(key)
+#define SH_EQUAL(tb, a, b) (strcmp(a, b) == 0)
+#define SH_SCOPE extern
+#define SH_RAW_ALLOCATOR pg_malloc0
+#define SH_DEFINE
+#include "lib/simplehash.h"
+
+static void combinebackup_per_file_cb(JsonManifestParseContext *context,
+ char *pathname, size_t size,
+ pg_checksum_type checksum_type,
+ int checksum_length,
+ uint8 *checksum_payload);
+static void combinebackup_per_wal_range_cb(JsonManifestParseContext *context,
+ TimeLineID tli,
+ XLogRecPtr start_lsn,
+ XLogRecPtr end_lsn);
+static void report_manifest_error(JsonManifestParseContext *context,
+ const char *fmt,...)
+ pg_attribute_printf(2, 3) pg_attribute_noreturn();
+
+/*
+ * Load backup_manifest files from an array of backups and produces an array
+ * of manifest_data objects.
+ *
+ * NB: Since load_backup_manifest() can return NULL, the resulting array could
+ * contain NULL entries.
+ */
+manifest_data **
+load_backup_manifests(int n_backups, char **backup_directories)
+{
+ manifest_data **result;
+ int i;
+
+ result = pg_malloc(sizeof(manifest_data *) * n_backups);
+ for (i = 0; i < n_backups; ++i)
+ result[i] = load_backup_manifest(backup_directories[i]);
+
+ return result;
+}
+
+/*
+ * Parse the backup_manifest file in the named backup directory. Construct a
+ * hash table with information about all the files it mentions, and a linked
+ * list of all the WAL ranges it mentions.
+ *
+ * If the backup_manifest file simply doesn't exist, logs a warning and returns
+ * NULL. Any other error, or any error parsing the contents of the file, is
+ * fatal.
+ */
+manifest_data *
+load_backup_manifest(char *backup_directory)
+{
+ char pathname[MAXPGPATH];
+ int fd;
+ struct stat statbuf;
+ off_t estimate;
+ uint32 initial_size;
+ manifest_files_hash *ht;
+ char *buffer;
+ int rc;
+ JsonManifestParseContext context;
+ manifest_data *result;
+
+ /* Open the manifest file. */
+ snprintf(pathname, MAXPGPATH, "%s/backup_manifest", backup_directory);
+ if ((fd = open(pathname, O_RDONLY | PG_BINARY, 0)) < 0)
+ {
+ if (errno == ENOENT)
+ {
+ pg_log_warning("\"%s\" does not exist", pathname);
+ return NULL;
+ }
+ pg_fatal("could not open file \"%s\": %m", pathname);
+ }
+
+ /* Figure out how big the manifest is. */
+ if (fstat(fd, &statbuf) != 0)
+ pg_fatal("could not stat file \"%s\": %m", pathname);
+
+ /* Guess how large to make the hash table based on the manifest size. */
+ estimate = statbuf.st_size / ESTIMATED_BYTES_PER_MANIFEST_LINE;
+ initial_size = Min(PG_UINT32_MAX, Max(estimate, 256));
+
+ /* Create the hash table. */
+ ht = manifest_files_create(initial_size, NULL);
+
+ /*
+ * Slurp in the whole file.
+ *
+ * This is not ideal, but there's currently no way to get pg_parse_json()
+ * to perform incremental parsing.
+ */
+ buffer = pg_malloc(statbuf.st_size);
+ rc = read(fd, buffer, statbuf.st_size);
+ if (rc != statbuf.st_size)
+ {
+ if (rc < 0)
+ pg_fatal("could not read file \"%s\": %m", pathname);
+ else
+ pg_fatal("could not read file \"%s\": read %d of %lld",
+ pathname, rc, (long long int) statbuf.st_size);
+ }
+
+ /* Close the manifest file. */
+ close(fd);
+
+ /* Parse the manifest. */
+ result = pg_malloc0(sizeof(manifest_data));
+ result->files = ht;
+ context.private_data = result;
+ context.per_file_cb = combinebackup_per_file_cb;
+ context.per_wal_range_cb = combinebackup_per_wal_range_cb;
+ context.error_cb = report_manifest_error;
+ json_parse_manifest(&context, buffer, statbuf.st_size);
+
+ /* All done. */
+ pfree(buffer);
+ return result;
+}
+
+/*
+ * Report an error while parsing the manifest.
+ *
+ * We consider all such errors to be fatal errors. The manifest parser
+ * expects this function not to return.
+ */
+static void
+report_manifest_error(JsonManifestParseContext *context, const char *fmt,...)
+{
+ va_list ap;
+
+ va_start(ap, fmt);
+ pg_log_generic_v(PG_LOG_ERROR, PG_LOG_PRIMARY, gettext(fmt), ap);
+ va_end(ap);
+
+ exit(1);
+}
+
+/*
+ * Record details extracted from the backup manifest for one file.
+ */
+static void
+combinebackup_per_file_cb(JsonManifestParseContext *context,
+ char *pathname, size_t size,
+ pg_checksum_type checksum_type,
+ int checksum_length, uint8 *checksum_payload)
+{
+ manifest_data *manifest = context->private_data;
+ manifest_file *m;
+ bool found;
+
+ /* Make a new entry in the hash table for this file. */
+ m = manifest_files_insert(manifest->files, pathname, &found);
+ if (found)
+ pg_fatal("duplicate path name in backup manifest: \"%s\"", pathname);
+
+ /* Initialize the entry. */
+ m->size = size;
+ m->checksum_type = checksum_type;
+ m->checksum_length = checksum_length;
+ m->checksum_payload = checksum_payload;
+}
+
+/*
+ * Record details extracted from the backup manifest for one WAL range.
+ */
+static void
+combinebackup_per_wal_range_cb(JsonManifestParseContext *context,
+ TimeLineID tli,
+ XLogRecPtr start_lsn, XLogRecPtr end_lsn)
+{
+ manifest_data *manifest = context->private_data;
+ manifest_wal_range *range;
+
+ /* Allocate and initialize a struct describing this WAL range. */
+ range = palloc(sizeof(manifest_wal_range));
+ range->tli = tli;
+ range->start_lsn = start_lsn;
+ range->end_lsn = end_lsn;
+ range->prev = manifest->last_wal_range;
+ range->next = NULL;
+
+ /* Add it to the end of the list. */
+ if (manifest->first_wal_range == NULL)
+ manifest->first_wal_range = range;
+ else
+ manifest->last_wal_range->next = range;
+ manifest->last_wal_range = range;
+}
+
+/*
+ * Helper function for manifest_files hash table.
+ */
+static uint32
+hash_string_pointer(char *s)
+{
+ unsigned char *ss = (unsigned char *) s;
+
+ return hash_bytes(ss, strlen(s));
+}
diff --git a/src/bin/pg_combinebackup/load_manifest.h b/src/bin/pg_combinebackup/load_manifest.h
new file mode 100644
index 0000000000..2bfeeff156
--- /dev/null
+++ b/src/bin/pg_combinebackup/load_manifest.h
@@ -0,0 +1,67 @@
+/*-------------------------------------------------------------------------
+ *
+ * Load data from a backup manifest into memory.
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/bin/pg_combinebackup/load_manifest.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef LOAD_MANIFEST_H
+#define LOAD_MANIFEST_H
+
+#include "access/xlogdefs.h"
+#include "common/checksum_helper.h"
+
+/*
+ * Each file described by the manifest file is parsed to produce an object
+ * like this.
+ */
+typedef struct manifest_file
+{
+ uint32 status; /* hash status */
+ char *pathname;
+ size_t size;
+ pg_checksum_type checksum_type;
+ int checksum_length;
+ uint8 *checksum_payload;
+} manifest_file;
+
+#define SH_PREFIX manifest_files
+#define SH_ELEMENT_TYPE manifest_file
+#define SH_KEY_TYPE char *
+#define SH_SCOPE extern
+#define SH_RAW_ALLOCATOR pg_malloc0
+#define SH_DECLARE
+#include "lib/simplehash.h"
+
+/*
+ * Each WAL range described by the manifest file is parsed to produce an
+ * object like this.
+ */
+typedef struct manifest_wal_range
+{
+ TimeLineID tli;
+ XLogRecPtr start_lsn;
+ XLogRecPtr end_lsn;
+ struct manifest_wal_range *next;
+ struct manifest_wal_range *prev;
+} manifest_wal_range;
+
+/*
+ * All the data parsed from a backup_manifest file.
+ */
+typedef struct manifest_data
+{
+ manifest_files_hash *files;
+ manifest_wal_range *first_wal_range;
+ manifest_wal_range *last_wal_range;
+} manifest_data;
+
+extern manifest_data *load_backup_manifest(char *backup_directory);
+extern manifest_data **load_backup_manifests(int n_backups,
+ char **backup_directories);
+
+#endif /* LOAD_MANIFEST_H */
diff --git a/src/bin/pg_combinebackup/meson.build b/src/bin/pg_combinebackup/meson.build
new file mode 100644
index 0000000000..e402d6f50e
--- /dev/null
+++ b/src/bin/pg_combinebackup/meson.build
@@ -0,0 +1,38 @@
+# Copyright (c) 2022-2023, PostgreSQL Global Development Group
+
+pg_combinebackup_sources = files(
+ 'pg_combinebackup.c',
+ 'backup_label.c',
+ 'copy_file.c',
+ 'load_manifest.c',
+ 'reconstruct.c',
+ 'write_manifest.c',
+)
+
+if host_system == 'windows'
+ pg_combinebackup_sources += rc_bin_gen.process(win32ver_rc, extra_args: [
+ '--NAME', 'pg_combinebackup',
+ '--FILEDESC', 'pg_combinebackup - combine incremental backups',])
+endif
+
+pg_combinebackup = executable('pg_combinebackup',
+ pg_combinebackup_sources,
+ dependencies: [frontend_code],
+ kwargs: default_bin_args,
+)
+bin_targets += pg_combinebackup
+
+tests += {
+ 'name': 'pg_combinebackup',
+ 'sd': meson.current_source_dir(),
+ 'bd': meson.current_build_dir(),
+ 'tap': {
+ 'tests': [
+ 't/001_basic.pl',
+ 't/002_compare_backups.pl',
+ 't/003_timeline.pl',
+ 't/004_manifest.pl',
+ 't/005_integrity.pl',
+ ],
+ }
+}
diff --git a/src/bin/pg_combinebackup/nls.mk b/src/bin/pg_combinebackup/nls.mk
new file mode 100644
index 0000000000..c8e59d1d00
--- /dev/null
+++ b/src/bin/pg_combinebackup/nls.mk
@@ -0,0 +1,11 @@
+# src/bin/pg_combinebackup/nls.mk
+CATALOG_NAME = pg_combinebackup
+GETTEXT_FILES = $(FRONTEND_COMMON_GETTEXT_FILES) \
+ backup_label.c \
+ copy_file.c \
+ load_manifest.c \
+ pg_combinebackup.c \
+ reconstruct.c \
+ write_manifest.c
+GETTEXT_TRIGGERS = $(FRONTEND_COMMON_GETTEXT_TRIGGERS)
+GETTEXT_FLAGS = $(FRONTEND_COMMON_GETTEXT_FLAGS)
diff --git a/src/bin/pg_combinebackup/pg_combinebackup.c b/src/bin/pg_combinebackup/pg_combinebackup.c
new file mode 100644
index 0000000000..85d3f4e5de
--- /dev/null
+++ b/src/bin/pg_combinebackup/pg_combinebackup.c
@@ -0,0 +1,1284 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_combinebackup.c
+ * Combine incremental backups with prior backups.
+ *
+ * Copyright (c) 2017-2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/bin/pg_combinebackup/pg_combinebackup.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres_fe.h"
+
+#include <dirent.h>
+#include <fcntl.h>
+#include <limits.h>
+
+#include "backup_label.h"
+#include "common/blkreftable.h"
+#include "common/checksum_helper.h"
+#include "common/controldata_utils.h"
+#include "common/file_perm.h"
+#include "common/file_utils.h"
+#include "common/logging.h"
+#include "copy_file.h"
+#include "fe_utils/option_utils.h"
+#include "lib/stringinfo.h"
+#include "load_manifest.h"
+#include "getopt_long.h"
+#include "reconstruct.h"
+#include "write_manifest.h"
+
+/* Incremental file naming convention. */
+#define INCREMENTAL_PREFIX "INCREMENTAL."
+#define INCREMENTAL_PREFIX_LENGTH (sizeof(INCREMENTAL_PREFIX) - 1)
+
+/*
+ * Tracking for directories that need to be removed, or have their contents
+ * removed, if the operation fails.
+ */
+typedef struct cb_cleanup_dir
+{
+ char *target_path;
+ bool rmtopdir;
+ struct cb_cleanup_dir *next;
+} cb_cleanup_dir;
+
+/*
+ * Stores a tablespace mapping provided using -T, --tablespace-mapping.
+ */
+typedef struct cb_tablespace_mapping
+{
+ char old_dir[MAXPGPATH];
+ char new_dir[MAXPGPATH];
+ struct cb_tablespace_mapping *next;
+} cb_tablespace_mapping;
+
+/*
+ * Stores data parsed from all command-line options.
+ */
+typedef struct cb_options
+{
+ bool debug;
+ char *output;
+ bool dry_run;
+ bool no_sync;
+ cb_tablespace_mapping *tsmappings;
+ pg_checksum_type manifest_checksums;
+ bool no_manifest;
+ DataDirSyncMethod sync_method;
+} cb_options;
+
+/*
+ * Data about a tablespace.
+ *
+ * Every normal tablespace needs a tablespace mapping, but in-place tablespaces
+ * don't, so the list of tablespaces can contain more entries than the list of
+ * tablespace mappings.
+ */
+typedef struct cb_tablespace
+{
+ Oid oid;
+ bool in_place;
+ char old_dir[MAXPGPATH];
+ char new_dir[MAXPGPATH];
+ struct cb_tablespace *next;
+} cb_tablespace;
+
+/* Directories to be removed if we exit uncleanly. */
+cb_cleanup_dir *cleanup_dir_list = NULL;
+
+static void add_tablespace_mapping(cb_options *opt, char *arg);
+static StringInfo check_backup_label_files(int n_backups, char **backup_dirs);
+static void check_control_files(int n_backups, char **backup_dirs);
+static void check_input_dir_permissions(char *dir);
+static void cleanup_directories_atexit(void);
+static void create_output_directory(char *dirname, cb_options *opt);
+static void help(const char *progname);
+static bool parse_oid(char *s, Oid *result);
+static void process_directory_recursively(Oid tsoid,
+ char *input_directory,
+ char *output_directory,
+ char *relative_path,
+ int n_prior_backups,
+ char **prior_backup_dirs,
+ manifest_data **manifests,
+ manifest_writer *mwriter,
+ cb_options *opt);
+static int read_pg_version_file(char *directory);
+static void remember_to_cleanup_directory(char *target_path, bool rmtopdir);
+static void reset_directory_cleanup_list(void);
+static cb_tablespace *scan_for_existing_tablespaces(char *pathname,
+ cb_options *opt);
+static void slurp_file(int fd, char *filename, StringInfo buf, int maxlen);
+
+/*
+ * Main program.
+ */
+int
+main(int argc, char *argv[])
+{
+ static struct option long_options[] = {
+ {"debug", no_argument, NULL, 'd'},
+ {"dry-run", no_argument, NULL, 'n'},
+ {"no-sync", no_argument, NULL, 'N'},
+ {"output", required_argument, NULL, 'o'},
+ {"tablespace-mapping", no_argument, NULL, 'T'},
+ {"manifest-checksums", required_argument, NULL, 1},
+ {"no-manifest", no_argument, NULL, 2},
+ {"sync-method", required_argument, NULL, 3},
+ {NULL, 0, NULL, 0}
+ };
+
+ const char *progname;
+ char *last_input_dir;
+ int optindex;
+ int c;
+ int n_backups;
+ int n_prior_backups;
+ int version;
+ char **prior_backup_dirs;
+ cb_options opt;
+ cb_tablespace *tablespaces;
+ cb_tablespace *ts;
+ StringInfo last_backup_label;
+ manifest_data **manifests;
+ manifest_writer *mwriter;
+
+ pg_logging_init(argv[0]);
+ progname = get_progname(argv[0]);
+ handle_help_version_opts(argc, argv, progname, help);
+
+ memset(&opt, 0, sizeof(opt));
+ opt.manifest_checksums = CHECKSUM_TYPE_CRC32C;
+ opt.sync_method = DATA_DIR_SYNC_METHOD_FSYNC;
+
+ /* process command-line options */
+ while ((c = getopt_long(argc, argv, "dnNPo:T:",
+ long_options, &optindex)) != -1)
+ {
+ switch (c)
+ {
+ case 'd':
+ opt.debug = true;
+ pg_logging_increase_verbosity();
+ break;
+ case 'n':
+ opt.dry_run = true;
+ break;
+ case 'N':
+ opt.no_sync = true;
+ break;
+ case 'o':
+ opt.output = optarg;
+ break;
+ case 'T':
+ add_tablespace_mapping(&opt, optarg);
+ break;
+ case 1:
+ if (!pg_checksum_parse_type(optarg,
+ &opt.manifest_checksums))
+ pg_fatal("unrecognized checksum algorithm: \"%s\"",
+ optarg);
+ break;
+ case 2:
+ opt.no_manifest = true;
+ break;
+ case 3:
+ if (!parse_sync_method(optarg, &opt.sync_method))
+ exit(1);
+ break;
+ default:
+ /* getopt_long already emitted a complaint */
+ pg_log_error_hint("Try \"%s --help\" for more information.", progname);
+ exit(1);
+ }
+ }
+
+ if (optind >= argc)
+ {
+ pg_log_error("%s: no input directories specified", progname);
+ pg_log_error_hint("Try \"%s --help\" for more information.", progname);
+ exit(1);
+ }
+
+ if (opt.output == NULL)
+ pg_fatal("no output directory specified");
+
+ /* If no manifest is needed, no checksums are needed, either. */
+ if (opt.no_manifest)
+ opt.manifest_checksums = CHECKSUM_TYPE_NONE;
+
+ /* Read the server version from the final backup. */
+ version = read_pg_version_file(argv[argc - 1]);
+
+ /* Sanity-check control files. */
+ n_backups = argc - optind;
+ check_control_files(n_backups, argv + optind);
+
+ /* Sanity-check backup_label files, and get the contents of the last one. */
+ last_backup_label = check_backup_label_files(n_backups, argv + optind);
+
+ /*
+ * We'll need the pathnames to the prior backups. By "prior" we mean all
+ * but the last one listed on the command line.
+ */
+ n_prior_backups = argc - optind - 1;
+ prior_backup_dirs = argv + optind;
+
+ /* Load backup manifests. */
+ manifests = load_backup_manifests(n_backups, prior_backup_dirs);
+
+ /* Figure out which tablespaces are going to be included in the output. */
+ last_input_dir = argv[argc - 1];
+ check_input_dir_permissions(last_input_dir);
+ tablespaces = scan_for_existing_tablespaces(last_input_dir, &opt);
+
+ /*
+ * Create output directories.
+ *
+ * We create one output directory for the main data directory plus one for
+ * each non-in-place tablespace. create_output_directory() will arrange
+ * for those directories to be cleaned up on failure. In-place tablespaces
+ * aren't handled at this stage because they're located beneath the main
+ * output directory, and thus the cleanup of that directory will get rid
+ * of them. Plus, the pg_tblspc directory that needs to contain them
+ * doesn't exist yet.
+ */
+ atexit(cleanup_directories_atexit);
+ create_output_directory(opt.output, &opt);
+ for (ts = tablespaces; ts != NULL; ts = ts->next)
+ if (!ts->in_place)
+ create_output_directory(ts->new_dir, &opt);
+
+ /* If we need to write a backup_manifest, prepare to do so. */
+ if (!opt.dry_run && !opt.no_manifest)
+ {
+ mwriter = create_manifest_writer(opt.output);
+
+ /*
+ * Verify that we have a backup manifest for the final backup; else we
+ * won't have the WAL ranges for the resulting manifest.
+ */
+ if (manifests[n_prior_backups] == NULL)
+ pg_fatal("can't generate a manifest because no manifest is available for the final input backup");
+ }
+ else
+ mwriter = NULL;
+
+ /* Write backup label into output directory. */
+ if (opt.dry_run)
+ pg_log_debug("would generate \"%s/backup_label\"", opt.output);
+ else
+ {
+ pg_log_debug("generating \"%s/backup_label\"", opt.output);
+ last_backup_label->cursor = 0;
+ write_backup_label(opt.output, last_backup_label,
+ opt.manifest_checksums, mwriter);
+ }
+
+ /* Process everything that's not part of a user-defined tablespace. */
+ pg_log_debug("processing backup directory \"%s\"", last_input_dir);
+ process_directory_recursively(InvalidOid, last_input_dir, opt.output,
+ NULL, n_prior_backups, prior_backup_dirs,
+ manifests, mwriter, &opt);
+
+ /* Process user-defined tablespaces. */
+ for (ts = tablespaces; ts != NULL; ts = ts->next)
+ {
+ pg_log_debug("processing tablespace directory \"%s\"", ts->old_dir);
+
+ /*
+ * If it's a normal tablespace, we need to set up a symbolic link from
+ * pg_tblspc/${OID} to the target directory; if it's an in-place
+ * tablespace, we need to create a directory at pg_tblspc/${OID}.
+ */
+ if (!ts->in_place)
+ {
+ char linkpath[MAXPGPATH];
+
+ snprintf(linkpath, MAXPGPATH, "%s/pg_tblspc/%u", opt.output,
+ ts->oid);
+
+ if (opt.dry_run)
+ pg_log_debug("would create symbolic link from \"%s\" to \"%s\"",
+ linkpath, ts->new_dir);
+ else
+ {
+ pg_log_debug("creating symbolic link from \"%s\" to \"%s\"",
+ linkpath, ts->new_dir);
+ if (symlink(ts->new_dir, linkpath) != 0)
+ pg_fatal("could not create symbolic link from \"%s\" to \"%s\": %m",
+ linkpath, ts->new_dir);
+ }
+ }
+ else
+ {
+ if (opt.dry_run)
+ pg_log_debug("would create directory \"%s\"", ts->new_dir);
+ else
+ {
+ pg_log_debug("creating directory \"%s\"", ts->new_dir);
+ if (pg_mkdir_p(ts->new_dir, pg_dir_create_mode) == -1)
+ pg_fatal("could not create directory \"%s\": %m",
+ ts->new_dir);
+ }
+ }
+
+ /* OK, now handle the directory contents. */
+ process_directory_recursively(ts->oid, ts->old_dir, ts->new_dir,
+ NULL, n_prior_backups, prior_backup_dirs,
+ manifests, mwriter, &opt);
+ }
+
+ /* Finalize the backup_manifest, if we're generating one. */
+ if (mwriter != NULL)
+ finalize_manifest(mwriter,
+ manifests[n_prior_backups]->first_wal_range);
+
+ /* fsync that output directory unless we've been told not to do so */
+ if (!opt.no_sync)
+ {
+ if (opt.dry_run)
+ pg_log_debug("would recursively fsync \"%s\"", opt.output);
+ else
+ {
+ pg_log_debug("recursively fsyncing \"%s\"", opt.output);
+ sync_pgdata(opt.output, version * 10000, opt.sync_method);
+ }
+ }
+
+ /* It's a success, so don't remove the output directories. */
+ reset_directory_cleanup_list();
+ exit(0);
+}
+
+/*
+ * Process the option argument for the -T, --tablespace-mapping switch.
+ */
+static void
+add_tablespace_mapping(cb_options *opt, char *arg)
+{
+ cb_tablespace_mapping *tsmap = pg_malloc0(sizeof(cb_tablespace_mapping));
+ char *dst;
+ char *dst_ptr;
+ char *arg_ptr;
+
+ /*
+ * Basically, we just want to copy everything before the equals sign to
+ * tsmap->old_dir and everything afterwards to tsmap->new_dir, but if
+ * there's more or less than one equals sign, that's an error, and if
+ * there's an equals sign preceded by a backslash, don't treat it as a
+ * field separator but instead copy a literal equals sign.
+ */
+ dst_ptr = dst = tsmap->old_dir;
+ for (arg_ptr = arg; *arg_ptr != '\0'; arg_ptr++)
+ {
+ if (dst_ptr - dst >= MAXPGPATH)
+ pg_fatal("directory name too long");
+
+ if (*arg_ptr == '\\' && *(arg_ptr + 1) == '=')
+ ; /* skip backslash escaping = */
+ else if (*arg_ptr == '=' && (arg_ptr == arg || *(arg_ptr - 1) != '\\'))
+ {
+ if (tsmap->new_dir[0] != '\0')
+ pg_fatal("multiple \"=\" signs in tablespace mapping");
+ else
+ dst = dst_ptr = tsmap->new_dir;
+ }
+ else
+ *dst_ptr++ = *arg_ptr;
+ }
+ if (!tsmap->old_dir[0] || !tsmap->new_dir[0])
+ pg_fatal("invalid tablespace mapping format \"%s\", must be \"OLDDIR=NEWDIR\"", arg);
+
+ /*
+ * All tablespaces are created with absolute directories, so specifying a
+ * non-absolute path here would never match, possibly confusing users.
+ *
+ * In contrast to pg_basebackup, both the old and new directories are on
+ * the local machine, so the local machine's definition of an absolute
+ * path is the only relevant one.
+ */
+ if (!is_absolute_path(tsmap->old_dir))
+ pg_fatal("old directory is not an absolute path in tablespace mapping: %s",
+ tsmap->old_dir);
+
+ if (!is_absolute_path(tsmap->new_dir))
+ pg_fatal("old directory is not an absolute path in tablespace mapping: %s",
+ tsmap->new_dir);
+
+ /* Canonicalize paths to avoid spurious failures when comparing. */
+ canonicalize_path(tsmap->old_dir);
+ canonicalize_path(tsmap->new_dir);
+
+ /* Add it to the list. */
+ tsmap->next = opt->tsmappings;
+ opt->tsmappings = tsmap;
+}
+
+/*
+ * Check that the backup_label files form a coherent backup chain, and return
+ * the contents of the backup_label file from the latest backup.
+ */
+static StringInfo
+check_backup_label_files(int n_backups, char **backup_dirs)
+{
+ StringInfo buf = makeStringInfo();
+ StringInfo lastbuf = buf;
+ int i;
+ TimeLineID check_tli = 0;
+ XLogRecPtr check_lsn = InvalidXLogRecPtr;
+
+ /* Try to read each backup_label file in turn, last to first. */
+ for (i = n_backups - 1; i >= 0; --i)
+ {
+ char pathbuf[MAXPGPATH];
+ int fd;
+ TimeLineID start_tli;
+ TimeLineID previous_tli;
+ XLogRecPtr start_lsn;
+ XLogRecPtr previous_lsn;
+
+ /* Open the backup_label file. */
+ snprintf(pathbuf, MAXPGPATH, "%s/backup_label", backup_dirs[i]);
+ pg_log_debug("reading \"%s\"", pathbuf);
+ if ((fd = open(pathbuf, O_RDONLY, 0)) < 0)
+ pg_fatal("could not open file \"%s\": %m", pathbuf);
+
+ /*
+ * Slurp the whole file into memory.
+ *
+ * The exact size limit that we impose here doesn't really matter --
+ * most of what's supposed to be in the file is fixed size and quite
+ * short. However, the length of the backup_label is limited (at least
+ * by some parts of the code) to MAXGPATH, so include that value in
+ * the maximum length that we tolerate.
+ */
+ slurp_file(fd, pathbuf, buf, 10000 + MAXPGPATH);
+
+ /* Close the file. */
+ if (close(fd) != 0)
+ pg_fatal("could not close \"%s\": %m", pathbuf);
+
+ /* Parse the file contents. */
+ parse_backup_label(pathbuf, buf, &start_tli, &start_lsn,
+ &previous_tli, &previous_lsn);
+
+ /*
+ * Sanity checks.
+ *
+ * XXX. It's actually not required that start_lsn == check_lsn. It
+ * would be OK if start_lsn > check_lsn provided that start_lsn is
+ * less than or equal to the relevant switchpoint. But at the moment
+ * we don't have that information.
+ */
+ if (i > 0 && previous_tli == 0)
+ pg_fatal("backup at \"%s\" is a full backup, but only the first backup should be a full backup",
+ backup_dirs[i]);
+ if (i == 0 && previous_tli != 0)
+ pg_fatal("backup at \"%s\" is an incremental backup, but the first backup should be a full backup",
+ backup_dirs[i]);
+ if (i < n_backups - 1 && start_tli != check_tli)
+ pg_fatal("backup at \"%s\" starts on timeline %u, but expected %u",
+ backup_dirs[i], start_tli, check_tli);
+ if (i < n_backups - 1 && start_lsn != check_lsn)
+ pg_fatal("backup at \"%s\" starts at LSN %X/%X, but expected %X/%X",
+ backup_dirs[i],
+ LSN_FORMAT_ARGS(start_lsn),
+ LSN_FORMAT_ARGS(check_lsn));
+ check_tli = previous_tli;
+ check_lsn = previous_lsn;
+
+ /*
+ * The last backup label in the chain needs to be saved for later use,
+ * while the others are only needed within this loop.
+ */
+ if (lastbuf == buf)
+ buf = makeStringInfo();
+ else
+ resetStringInfo(buf);
+ }
+
+ /* Free memory that we don't need any more. */
+ if (lastbuf != buf)
+ {
+ pfree(buf->data);
+ pfree(buf);
+ }
+
+ /*
+ * Return the data from the first backup_info that we read (which is the
+ * backup_label from the last directory specified on the command line).
+ */
+ return lastbuf;
+}
+
+/*
+ * Sanity check control files.
+ */
+static void
+check_control_files(int n_backups, char **backup_dirs)
+{
+ int i;
+ uint64 system_identifier = 0; /* placate compiler */
+
+ /* Try to read each control file in turn, last to first. */
+ for (i = n_backups - 1; i >= 0; --i)
+ {
+ ControlFileData *control_file;
+ bool crc_ok;
+ char *controlpath;
+
+ controlpath = psprintf("%s/%s", backup_dirs[i], "global/pg_control");
+ pg_log_debug("reading \"%s\"", controlpath);
+ control_file = get_controlfile(backup_dirs[i], &crc_ok);
+
+ /* Control file contents not meaningful if CRC is bad. */
+ if (!crc_ok)
+ pg_fatal("%s: crc is incorrect", controlpath);
+
+ /* Can't interpret control file if not current version. */
+ if (control_file->pg_control_version != PG_CONTROL_VERSION)
+ pg_fatal("%s: unexpected control file version",
+ controlpath);
+
+ /* System identifiers should all match. */
+ if (i == n_backups - 1)
+ system_identifier = control_file->system_identifier;
+ else if (system_identifier != control_file->system_identifier)
+ pg_fatal("%s: expected system identifier %llu, but found %llu",
+ controlpath, (unsigned long long) system_identifier,
+ (unsigned long long) control_file->system_identifier);
+
+ /* Release memory. */
+ pfree(control_file);
+ pfree(controlpath);
+ }
+
+ /*
+ * If debug output is enabled, make a note of the system identifier that
+ * we found in all of the relevant control files.
+ */
+ pg_log_debug("system identifier is %llu",
+ (unsigned long long) system_identifier);
+}
+
+/*
+ * Set default permissions for new files and directories based on the
+ * permissions of the given directory. The intent here is that the output
+ * directory should use the same permissions scheme as the final input
+ * directory.
+ */
+static void
+check_input_dir_permissions(char *dir)
+{
+ struct stat st;
+
+ if (stat(dir, &st) != 0)
+ pg_fatal("could not stat \"%s\": %m", dir);
+
+ SetDataDirectoryCreatePerm(st.st_mode);
+}
+
+/*
+ * Clean up output directories before exiting.
+ */
+static void
+cleanup_directories_atexit(void)
+{
+ while (cleanup_dir_list != NULL)
+ {
+ cb_cleanup_dir *dir = cleanup_dir_list;
+
+ if (dir->rmtopdir)
+ {
+ pg_log_info("removing output directory \"%s\"", dir->target_path);
+ if (!rmtree(dir->target_path, dir->rmtopdir))
+ pg_log_error("failed to remove output directory");
+ }
+ else
+ {
+ pg_log_info("removing contents of output directory \"%s\"",
+ dir->target_path);
+ if (!rmtree(dir->target_path, dir->rmtopdir))
+ pg_log_error("failed to remove contents of output directory");
+ }
+
+ cleanup_dir_list = cleanup_dir_list->next;
+ pfree(dir);
+ }
+}
+
+/*
+ * Create the named output directory, unless it already exists or we're in
+ * dry-run mode. If it already exists but is not empty, that's a fatal error.
+ *
+ * Adds the created directory to the list of directories to be cleaned up
+ * at process exit.
+ */
+static void
+create_output_directory(char *dirname, cb_options *opt)
+{
+ switch (pg_check_dir(dirname))
+ {
+ case 0:
+ if (opt->dry_run)
+ {
+ pg_log_debug("would create directory \"%s\"", dirname);
+ return;
+ }
+ pg_log_debug("creating directory \"%s\"", dirname);
+ if (pg_mkdir_p(dirname, pg_dir_create_mode) == -1)
+ pg_fatal("could not create directory \"%s\": %m", dirname);
+ remember_to_cleanup_directory(dirname, true);
+ break;
+
+ case 1:
+ pg_log_debug("using existing directory \"%s\"", dirname);
+ remember_to_cleanup_directory(dirname, false);
+ break;
+
+ case 2:
+ case 3:
+ case 4:
+ pg_fatal("directory \"%s\" exists but is not empty", dirname);
+
+ case -1:
+ pg_fatal("could not access directory \"%s\": %m", dirname);
+ }
+}
+
+/*
+ * help
+ *
+ * Prints help page for the program
+ *
+ * progname: the name of the executed program, such as "pg_combinebackup"
+ */
+static void
+help(const char *progname)
+{
+ printf(_("%s reconstructs full backups from incrementals.\n\n"), progname);
+ printf(_("Usage:\n"));
+ printf(_(" %s [OPTION]... DIRECTORY...\n"), progname);
+ printf(_("\nOptions:\n"));
+ printf(_(" -d, --debug generate lots of debugging output\n"));
+ printf(_(" -n, --dry-run don't actually do anything\n"));
+ printf(_(" -N, --no-sync do not wait for changes to be written safely to disk\n"));
+ printf(_(" -o, --output output directory\n"));
+ printf(_(" -T, --tablespace-mapping=OLDDIR=NEWDIR\n"));
+ printf(_(" relocate tablespace in OLDDIR to NEWDIR\n"));
+ printf(_(" --manifest-checksums=SHA{224,256,384,512}|CRC32C|NONE\n"
+ " use algorithm for manifest checksums\n"));
+ printf(_(" --no-manifest suppress generation of backup manifest\n"));
+ printf(_(" --sync-method=METHOD set method for syncing files to disk\n"));
+ printf(_(" -?, --help show this help, then exit\n"));
+
+ printf(_("\nReport bugs to <%s>.\n"), PACKAGE_BUGREPORT);
+ printf(_("%s home page: <%s>\n"), PACKAGE_NAME, PACKAGE_URL);
+}
+
+/*
+ * Try to parse a string as a non-zero OID without leading zeroes.
+ *
+ * If it works, return true and set *result to the answer, else return false.
+ */
+static bool
+parse_oid(char *s, Oid *result)
+{
+ Oid oid;
+ char *ep;
+
+ errno = 0;
+ oid = strtoul(s, &ep, 10);
+ if (errno != 0 || *ep != '\0' || oid < 1 || oid > PG_UINT32_MAX)
+ return false;
+
+ *result = oid;
+ return true;
+}
+
+/*
+ * Copy files from the input directory to the output directory, reconstructing
+ * full files from incremental files as required.
+ *
+ * If processing is a user-defined tablespace, the tsoid should be the OID
+ * of that tablespace and input_directory and output_directory should be the
+ * toplevel input and output directories for that tablespace. Otherwise,
+ * tsoid should be InvalidOid and input_directory and output_directory should
+ * be the main input and output directories.
+ *
+ * relative_path is the path beneath the given input and output directories
+ * that we are currently processing. If NULL, it indicates that we're
+ * processing the input and output directories themselves.
+ *
+ * n_prior_backups is the number of prior backups that we have available.
+ * This doesn't count the very last backup, which is referenced by
+ * output_directory, just the older ones. prior_backup_dirs is an array of
+ * the locations of those previous backups.
+ */
+static void
+process_directory_recursively(Oid tsoid,
+ char *input_directory,
+ char *output_directory,
+ char *relative_path,
+ int n_prior_backups,
+ char **prior_backup_dirs,
+ manifest_data **manifests,
+ manifest_writer *mwriter,
+ cb_options *opt)
+{
+ char ifulldir[MAXPGPATH];
+ char ofulldir[MAXPGPATH];
+ char manifest_prefix[MAXPGPATH];
+ DIR *dir;
+ struct dirent *de;
+ bool is_pg_tblspc;
+ bool is_pg_wal;
+ manifest_data *latest_manifest = manifests[n_prior_backups];
+ pg_checksum_type checksum_type;
+
+ /*
+ * pg_tblspc and pg_wal are special cases, so detect those here.
+ *
+ * pg_tblspc is only special at the top level, but subdirectories of
+ * pg_wal are just as special as the top level directory.
+ *
+ * Since incremental backup does not exist in pre-v10 versions, we don't
+ * have to worry about the old pg_xlog naming.
+ */
+ is_pg_tblspc = !OidIsValid(tsoid) && relative_path != NULL &&
+ strcmp(relative_path, "pg_tblspc") == 0;
+ is_pg_wal = !OidIsValid(tsoid) && relative_path != NULL &&
+ (strcmp(relative_path, "pg_wal") == 0 ||
+ strncmp(relative_path, "pg_wal/", 7) == 0);
+
+ /*
+ * If we're under pg_wal, then we don't need checksums, because these
+ * files aren't included in the backup manifest. Otherwise use whatever
+ * type of checksum is configured.
+ */
+ if (!is_pg_wal)
+ checksum_type = opt->manifest_checksums;
+ else
+ checksum_type = CHECKSUM_TYPE_NONE;
+
+ /*
+ * Append the relative path to the input and output directories, and
+ * figure out the appropriate prefix to add to files in this directory
+ * when looking them up in a backup manifest.
+ */
+ if (relative_path == NULL)
+ {
+ strncpy(ifulldir, input_directory, MAXPGPATH);
+ strncpy(ofulldir, output_directory, MAXPGPATH);
+ if (OidIsValid(tsoid))
+ snprintf(manifest_prefix, MAXPGPATH, "pg_tblspc/%u/", tsoid);
+ else
+ manifest_prefix[0] = '\0';
+ }
+ else
+ {
+ snprintf(ifulldir, MAXPGPATH, "%s/%s", input_directory,
+ relative_path);
+ snprintf(ofulldir, MAXPGPATH, "%s/%s", output_directory,
+ relative_path);
+ if (OidIsValid(tsoid))
+ snprintf(manifest_prefix, MAXPGPATH, "pg_tblspc/%u/%s/",
+ tsoid, relative_path);
+ else
+ snprintf(manifest_prefix, MAXPGPATH, "%s/", relative_path);
+ }
+
+ /*
+ * Toplevel output directories have already been created by the time this
+ * function is called, but any subdirectories are our responsibility.
+ */
+ if (relative_path != NULL)
+ {
+ if (opt->dry_run)
+ pg_log_debug("would create directory \"%s\"", ofulldir);
+ else
+ {
+ pg_log_debug("creating directory \"%s\"", ofulldir);
+ if (mkdir(ofulldir, pg_dir_create_mode) == -1)
+ pg_fatal("could not create directory \"%s\": %m", ofulldir);
+ }
+ }
+
+ /* It's time to scan the directory. */
+ if ((dir = opendir(ifulldir)) == NULL)
+ pg_fatal("could not open directory \"%s\": %m", ifulldir);
+ while (errno = 0, (de = readdir(dir)) != NULL)
+ {
+ PGFileType type;
+ char ifullpath[MAXPGPATH];
+ char ofullpath[MAXPGPATH];
+ char manifest_path[MAXPGPATH];
+ Oid oid = InvalidOid;
+ int checksum_length = 0;
+ uint8 *checksum_payload = NULL;
+ pg_checksum_context checksum_ctx;
+
+ /* Ignore "." and ".." entries. */
+ if (strcmp(de->d_name, ".") == 0 ||
+ strcmp(de->d_name, "..") == 0)
+ continue;
+
+ /* Construct input path. */
+ snprintf(ifullpath, MAXPGPATH, "%s/%s", ifulldir, de->d_name);
+
+ /* Figure out what kind of directory entry this is. */
+ type = get_dirent_type(ifullpath, de, false, PG_LOG_ERROR);
+ if (type == PGFILETYPE_ERROR)
+ exit(1);
+
+ /*
+ * If we're processing pg_tblspc, then check whether the filename
+ * looks like it could be a tablespace OID. If so, and if the
+ * directory entry is a symbolic link or a directory, skip it.
+ *
+ * Our goal here is to ignore anything that would have been considered
+ * by scan_for_existing_tablespaces to be a tablespace.
+ */
+ if (is_pg_tblspc && parse_oid(de->d_name, &oid) &&
+ (type == PGFILETYPE_LNK || type == PGFILETYPE_DIR))
+ continue;
+
+ /* If it's a directory, recurse. */
+ if (type == PGFILETYPE_DIR)
+ {
+ char new_relative_path[MAXPGPATH];
+
+ /* Append new pathname component to relative path. */
+ if (relative_path == NULL)
+ strncpy(new_relative_path, de->d_name, MAXPGPATH);
+ else
+ snprintf(new_relative_path, MAXPGPATH, "%s/%s", relative_path,
+ de->d_name);
+
+ /* And recurse. */
+ process_directory_recursively(tsoid,
+ input_directory, output_directory,
+ new_relative_path,
+ n_prior_backups, prior_backup_dirs,
+ manifests, mwriter, opt);
+ continue;
+ }
+
+ /* Skip anything that's not a regular file. */
+ if (type != PGFILETYPE_REG)
+ {
+ if (type == PGFILETYPE_LNK)
+ pg_log_warning("skipping symbolic link \"%s\"", ifullpath);
+ else
+ pg_log_warning("skipping special file \"%s\"", ifullpath);
+ continue;
+ }
+
+ /*
+ * Skip the backup_label and backup_manifest files; they require
+ * special handling and are handled elsewhere.
+ */
+ if (relative_path == NULL &&
+ (strcmp(de->d_name, "backup_label") == 0 ||
+ strcmp(de->d_name, "backup_manifest") == 0))
+ continue;
+
+ /*
+ * If it's an incremental file, hand it off to the reconstruction
+ * code, which will figure out what to do.
+ */
+ if (strncmp(de->d_name, INCREMENTAL_PREFIX,
+ INCREMENTAL_PREFIX_LENGTH) == 0)
+ {
+ /* Output path should not include "INCREMENTAL." prefix. */
+ snprintf(ofullpath, MAXPGPATH, "%s/%s", ofulldir,
+ de->d_name + INCREMENTAL_PREFIX_LENGTH);
+
+
+ /* Manifest path likewise omits incremental prefix. */
+ snprintf(manifest_path, MAXPGPATH, "%s%s", manifest_prefix,
+ de->d_name + INCREMENTAL_PREFIX_LENGTH);
+
+ /* Reconstruction logic will do the rest. */
+ reconstruct_from_incremental_file(ifullpath, ofullpath,
+ relative_path,
+ de->d_name + INCREMENTAL_PREFIX_LENGTH,
+ n_prior_backups,
+ prior_backup_dirs,
+ manifests,
+ manifest_path,
+ checksum_type,
+ &checksum_length,
+ &checksum_payload,
+ opt->debug,
+ opt->dry_run);
+ }
+ else
+ {
+ /* Construct the path that the backup_manifest will use. */
+ snprintf(manifest_path, MAXPGPATH, "%s%s", manifest_prefix,
+ de->d_name);
+
+ /*
+ * It's not an incremental file, so we need to copy the entire
+ * file to the output directory.
+ *
+ * If a checksum of the required type already exists in the
+ * backup_manifest for the final input directory, we can save some
+ * work by reusing that checksum instead of computing a new one.
+ */
+ if (checksum_type != CHECKSUM_TYPE_NONE &&
+ latest_manifest != NULL)
+ {
+ manifest_file *mfile;
+
+ mfile = manifest_files_lookup(latest_manifest->files,
+ manifest_path);
+ if (mfile == NULL)
+ {
+ char *bmpath;
+
+ /*
+ * The directory is out of sync with the backup_manifest,
+ * so emit a warning.
+ */
+ bmpath = psprintf("%s/%s", input_directory,
+ "backup_manifest");
+ pg_log_warning("\"%s\" contains no entry for \"%s\"",
+ bmpath, manifest_path);
+ pfree(bmpath);
+ }
+ else if (mfile->checksum_type == checksum_type)
+ {
+ checksum_length = mfile->checksum_length;
+ checksum_payload = mfile->checksum_payload;
+ }
+ }
+
+ /*
+ * If we're reusing a checksum, then we don't need copy_file() to
+ * compute one for us, but otherwise, it needs to compute whatever
+ * type of checksum we need.
+ */
+ if (checksum_length != 0)
+ pg_checksum_init(&checksum_ctx, CHECKSUM_TYPE_NONE);
+ else
+ pg_checksum_init(&checksum_ctx, checksum_type);
+
+ /* Actually copy the file. */
+ snprintf(ofullpath, MAXPGPATH, "%s/%s", ofulldir, de->d_name);
+ copy_file(ifullpath, ofullpath, &checksum_ctx, opt->dry_run);
+
+ /*
+ * If copy_file() performed a checksum calculation for us, then
+ * save the results (except in dry-run mode, when there's no
+ * point).
+ */
+ if (checksum_ctx.type != CHECKSUM_TYPE_NONE && !opt->dry_run)
+ {
+ checksum_payload = pg_malloc(PG_CHECKSUM_MAX_LENGTH);
+ checksum_length = pg_checksum_final(&checksum_ctx,
+ checksum_payload);
+ }
+ }
+
+ /* Generate manifest entry, if needed. */
+ if (mwriter != NULL)
+ {
+ struct stat sb;
+
+ /*
+ * In order to generate a manifest entry, we need the file size
+ * and mtime. We have no way to know the correct mtime except to
+ * stat() the file, so just do that and get the size as well.
+ *
+ * If we didn't need the mtime here, we could try to obtain the
+ * file size from the reconstruction or file copy process above,
+ * although that is actually not convenient in all cases. If we
+ * write the file ourselves then clearly we can keep a count of
+ * bytes, but if we use something like CopyFile() then it's
+ * trickier. Since we have to stat() anyway to get the mtime,
+ * there's no point in worrying about it.
+ */
+ if (stat(ofullpath, &sb) < 0)
+ pg_fatal("could not stat file \"%s\": %m", ofullpath);
+
+ /* OK, now do the work. */
+ add_file_to_manifest(mwriter, manifest_path,
+ sb.st_size, sb.st_mtime,
+ checksum_type, checksum_length,
+ checksum_payload);
+ }
+
+ /* Avoid leaking memory. */
+ if (checksum_payload != NULL)
+ pfree(checksum_payload);
+ }
+
+ closedir(dir);
+}
+
+/*
+ * Read the version number from PG_VERSION and convert it to the usual server
+ * version number format. (e.g. If PG_VERSION contains "14\n" this function
+ * will return 140000)
+ */
+static int
+read_pg_version_file(char *directory)
+{
+ char filename[MAXPGPATH];
+ StringInfoData buf;
+ int fd;
+ int version;
+ char *ep;
+
+ /* Construct pathname. */
+ snprintf(filename, MAXPGPATH, "%s/PG_VERSION", directory);
+
+ /* Open file. */
+ if ((fd = open(filename, O_RDONLY, 0)) < 0)
+ pg_fatal("could not open file \"%s\": %m", filename);
+
+ /* Read into memory. Length limit of 128 should be more than generous. */
+ initStringInfo(&buf);
+ slurp_file(fd, filename, &buf, 128);
+
+ /* Close the file. */
+ if (close(fd) != 0)
+ pg_fatal("could not close \"%s\": %m", filename);
+
+ /* Convert to integer. */
+ errno = 0;
+ version = strtoul(buf.data, &ep, 10);
+ if (errno != 0 || *ep != '\n')
+ {
+ /*
+ * Incremental backup is not relevant to very old server versions that
+ * used multi-part version number (e.g. 9.6, or 8.4). So if we see
+ * what looks like the beginning of such a version number, just bail
+ * out.
+ */
+ if (version < 10 && *ep == '.')
+ pg_fatal("%s: server version too old\n", filename);
+ pg_fatal("%s: could not parse version number\n", filename);
+ }
+
+ /* Debugging output. */
+ pg_log_debug("read server version %d from \"%s\"", version, filename);
+
+ /* Release memory and return result. */
+ pfree(buf.data);
+ return version * 10000;
+}
+
+/*
+ * Add a directory to the list of output directories to clean up.
+ */
+static void
+remember_to_cleanup_directory(char *target_path, bool rmtopdir)
+{
+ cb_cleanup_dir *dir = pg_malloc(sizeof(cb_cleanup_dir));
+
+ dir->target_path = target_path;
+ dir->rmtopdir = rmtopdir;
+ dir->next = cleanup_dir_list;
+ cleanup_dir_list = dir;
+}
+
+/*
+ * Empty out the list of directories scheduled for cleanup a exit.
+ *
+ * We want to remove the output directories only on a failure, so call this
+ * function when we know that the operation has succeeded.
+ *
+ * Since we only expect this to be called when we're about to exit, we could
+ * just set cleanup_dir_list to NULL and be done with it, but we free the
+ * memory to be tidy.
+ */
+static void
+reset_directory_cleanup_list(void)
+{
+ while (cleanup_dir_list != NULL)
+ {
+ cb_cleanup_dir *dir = cleanup_dir_list;
+
+ cleanup_dir_list = cleanup_dir_list->next;
+ pfree(dir);
+ }
+}
+
+/*
+ * Scan the pg_tblspc directory of the final input backup to get a canonical
+ * list of what tablespaces are part of the backup.
+ *
+ * 'pathname' should be the path to the toplevel backup directory for the
+ * final backup in the backup chain.
+ */
+static cb_tablespace *
+scan_for_existing_tablespaces(char *pathname, cb_options *opt)
+{
+ char pg_tblspc[MAXPGPATH];
+ DIR *dir;
+ struct dirent *de;
+ cb_tablespace *tslist = NULL;
+
+ snprintf(pg_tblspc, MAXPGPATH, "%s/pg_tblspc", pathname);
+ pg_log_debug("scanning \"%s\"", pg_tblspc);
+
+ if ((dir = opendir(pg_tblspc)) == NULL)
+ pg_fatal("could not open directory \"%s\": %m", pathname);
+
+ while (errno = 0, (de = readdir(dir)) != NULL)
+ {
+ Oid oid;
+ char tblspcdir[MAXPGPATH];
+ char link_target[MAXPGPATH];
+ int link_length;
+ cb_tablespace *ts;
+ cb_tablespace *otherts;
+ PGFileType type;
+
+ /* Silently ignore "." and ".." entries. */
+ if (strcmp(de->d_name, ".") == 0 || strcmp(de->d_name, "..") == 0)
+ continue;
+
+ /* Construct full pathname. */
+ snprintf(tblspcdir, MAXPGPATH, "%s/%s", pg_tblspc, de->d_name);
+
+ /* Ignore any file name that doesn't look like a proper OID. */
+ if (!parse_oid(de->d_name, &oid))
+ {
+ pg_log_debug("skipping \"%s\" because the filename is not a legal tablespace OID",
+ tblspcdir);
+ continue;
+ }
+
+ /* Only symbolic links and directories are tablespaces. */
+ type = get_dirent_type(tblspcdir, de, false, PG_LOG_ERROR);
+ if (type == PGFILETYPE_ERROR)
+ exit(1);
+ if (type != PGFILETYPE_LNK && type != PGFILETYPE_DIR)
+ {
+ pg_log_debug("skipping \"%s\" because it is neither a symbolic link nor a directory",
+ tblspcdir);
+ continue;
+ }
+
+ /* Create a new tablespace object. */
+ ts = pg_malloc0(sizeof(cb_tablespace));
+ ts->oid = oid;
+
+ /*
+ * If it's a link, it's not an in-place tablespace. Otherwise, it must
+ * be a directory, and thus an in-place tablespace.
+ */
+ if (type == PGFILETYPE_LNK)
+ {
+ cb_tablespace_mapping *tsmap;
+
+ /* Read the link target. */
+ link_length = readlink(tblspcdir, link_target, sizeof(link_target));
+ if (link_length < 0)
+ pg_fatal("could not read symbolic link \"%s\": %m",
+ tblspcdir);
+ if (link_length >= sizeof(link_target))
+ pg_fatal("symbolic link \"%s\" is too long", tblspcdir);
+ link_target[link_length] = '\0';
+ if (!is_absolute_path(link_target))
+ pg_fatal("symbolic link \"%s\" is relative", tblspcdir);
+
+ /* Caonicalize the link target. */
+ canonicalize_path(link_target);
+
+ /*
+ * Find the corresponding tablespace mapping and copy the relevant
+ * details into the new tablespace entry.
+ */
+ for (tsmap = opt->tsmappings; tsmap != NULL; tsmap = tsmap->next)
+ {
+ if (strcmp(tsmap->old_dir, link_target) == 0)
+ {
+ strncpy(ts->old_dir, tsmap->old_dir, MAXPGPATH);
+ strncpy(ts->new_dir, tsmap->new_dir, MAXPGPATH);
+ ts->in_place = false;
+ break;
+ }
+ }
+
+ /* Every non-in-place tablespace must be mapped. */
+ if (tsmap == NULL)
+ pg_fatal("tablespace at \"%s\" has no tablespace mapping",
+ link_target);
+ }
+ else
+ {
+ /*
+ * For an in-place tablespace, there's no separate directory, so
+ * we just record the paths within the data directories.
+ */
+ snprintf(ts->old_dir, MAXPGPATH, "%s/%s", pg_tblspc, de->d_name);
+ snprintf(ts->new_dir, MAXPGPATH, "%s/pg_tblpc/%s", opt->output,
+ de->d_name);
+ ts->in_place = true;
+ }
+
+ /* Tablespaces should not share a directory. */
+ for (otherts = tslist; otherts != NULL; otherts = otherts->next)
+ if (strcmp(ts->new_dir, otherts->new_dir) == 0)
+ pg_fatal("tablespaces with OIDs %u and %u both point at \"%s\"",
+ otherts->oid, oid, ts->new_dir);
+
+ /* Add this tablespace to the list. */
+ ts->next = tslist;
+ tslist = ts;
+ }
+
+ return tslist;
+}
+
+/*
+ * Read a file into a StringInfo.
+ *
+ * fd is used for the actual file I/O, filename for error reporting purposes.
+ * A file longer than maxlen is a fatal error.
+ */
+static void
+slurp_file(int fd, char *filename, StringInfo buf, int maxlen)
+{
+ struct stat st;
+ ssize_t rb;
+
+ /* Check file size, and complain if it's too large. */
+ if (fstat(fd, &st) != 0)
+ pg_fatal("could not stat \"%s\": %m", filename);
+ if (st.st_size > maxlen)
+ pg_fatal("file \"%s\" is too large", filename);
+
+ /* Make sure we have enough space. */
+ enlargeStringInfo(buf, st.st_size);
+
+ /* Read the data. */
+ rb = read(fd, &buf->data[buf->len], st.st_size);
+
+ /*
+ * We don't expect any concurrent changes, so we should read exactly the
+ * expected number of bytes.
+ */
+ if (rb != st.st_size)
+ {
+ if (rb < 0)
+ pg_fatal("could not read file \"%s\": %m", filename);
+ else
+ pg_fatal("could not read file \"%s\": read only %d of %d bytes",
+ filename, (int) rb, (int) st.st_size);
+ }
+
+ /* Adjust buffer length for new data and restore trailing-\0 invariant */
+ buf->len += rb;
+ buf->data[buf->len] = '\0';
+}
diff --git a/src/bin/pg_combinebackup/reconstruct.c b/src/bin/pg_combinebackup/reconstruct.c
new file mode 100644
index 0000000000..6decdd8934
--- /dev/null
+++ b/src/bin/pg_combinebackup/reconstruct.c
@@ -0,0 +1,687 @@
+/*-------------------------------------------------------------------------
+ *
+ * reconstruct.c
+ * Reconstruct full file from incremental file and backup chain.
+ *
+ * Copyright (c) 2017-2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/bin/pg_combinebackup/reconstruct.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres_fe.h"
+
+#include <unistd.h>
+
+#include "backup/basebackup_incremental.h"
+#include "common/logging.h"
+#include "common/file_perm.h"
+#include "copy_file.h"
+#include "lib/stringinfo.h"
+#include "reconstruct.h"
+#include "storage/block.h"
+
+/*
+ * An rfile stores the data that we need in order to be able to use some file
+ * on disk for reconstruction. For any given output file, we create one rfile
+ * per backup that we need to consult when we constructing that output file.
+ *
+ * If we find a full version of the file in the backup chain, then only
+ * filename and fd are initialized; the remaining fields are 0 or NULL.
+ * For an incremental file, header_length, num_blocks, relative_block_numbers,
+ * and truncation_block_length are also set.
+ *
+ * num_blocks_read and highest_offset_read always start out as 0.
+ */
+typedef struct rfile
+{
+ char *filename;
+ int fd;
+ size_t header_length;
+ unsigned num_blocks;
+ BlockNumber *relative_block_numbers;
+ unsigned truncation_block_length;
+ unsigned num_blocks_read;
+ off_t highest_offset_read;
+} rfile;
+
+static void debug_reconstruction(int n_source,
+ rfile **sources,
+ bool dry_run);
+static unsigned find_reconstructed_block_length(rfile *s);
+static rfile *make_incremental_rfile(char *filename);
+static rfile *make_rfile(char *filename, bool missing_ok);
+static void write_reconstructed_file(char *input_filename,
+ char *output_filename,
+ unsigned block_length,
+ rfile **sourcemap,
+ off_t *offsetmap,
+ pg_checksum_context *checksum_ctx,
+ bool debug,
+ bool dry_run);
+static void read_bytes(rfile *rf, void *buffer, unsigned length);
+
+/*
+ * Reconstruct a full file from an incremental file and a chain of prior
+ * backups.
+ *
+ * input_filename should be the path to the incremental file, and
+ * output_filename should be the path where the reconstructed file is to be
+ * written.
+ *
+ * relative_path should be the relative path to the directory containing this
+ * file. bare_file_name should be the name of the file within that directory,
+ * without "INCREMENTAL.".
+ *
+ * n_prior_backups is the number of prior backups, and prior_backup_dirs is
+ * an array of pathnames where those backups can be found.
+ */
+void
+reconstruct_from_incremental_file(char *input_filename,
+ char *output_filename,
+ char *relative_path,
+ char *bare_file_name,
+ int n_prior_backups,
+ char **prior_backup_dirs,
+ manifest_data **manifests,
+ char *manifest_path,
+ pg_checksum_type checksum_type,
+ int *checksum_length,
+ uint8 **checksum_payload,
+ bool debug,
+ bool dry_run)
+{
+ rfile **source;
+ rfile *latest_source = NULL;
+ rfile **sourcemap;
+ off_t *offsetmap;
+ unsigned block_length;
+ unsigned i;
+ unsigned sidx = n_prior_backups;
+ bool full_copy_possible = true;
+ int copy_source_index = -1;
+ rfile *copy_source = NULL;
+ pg_checksum_context checksum_ctx;
+
+ /*
+ * Every block must come either from the latest version of the file or
+ * from one of the prior backups.
+ */
+ source = pg_malloc0(sizeof(rfile *) * (1 + n_prior_backups));
+
+ /*
+ * Use the information from the latest incremental file to figure out how
+ * long the reconstructed file should be.
+ */
+ latest_source = make_incremental_rfile(input_filename);
+ source[n_prior_backups] = latest_source;
+ block_length = find_reconstructed_block_length(latest_source);
+
+ /*
+ * For each block in the output file, we need to know from which file we
+ * need to obtain it and at what offset in that file it's stored.
+ * sourcemap gives us the first of these things, and offsetmap the latter.
+ */
+ sourcemap = pg_malloc0(sizeof(rfile *) * block_length);
+ offsetmap = pg_malloc0(sizeof(off_t) * block_length);
+
+ /*
+ * Every block that is present in the newest incremental file should be
+ * sourced from that file. If it precedes the truncation_block_length,
+ * it's a block that we would otherwise have had to find in an older
+ * backup and thus reduces the number of blocks remaining to be found by
+ * one; otherwise, it's an extra block that needs to be included in the
+ * output but would not have needed to be found in an older backup if it
+ * had not been present.
+ */
+ for (i = 0; i < latest_source->num_blocks; ++i)
+ {
+ BlockNumber b = latest_source->relative_block_numbers[i];
+
+ Assert(b < block_length);
+ sourcemap[b] = latest_source;
+ offsetmap[b] = latest_source->header_length + (i * BLCKSZ);
+
+ /*
+ * A full copy of a file from an earlier backup is only possible if no
+ * blocks are needed from any later incremental file.
+ */
+ full_copy_possible = false;
+ }
+
+ while (1)
+ {
+ char source_filename[MAXPGPATH];
+ rfile *s;
+
+ /*
+ * Move to the next backup in the chain. If there are no more, then
+ * we're done.
+ */
+ if (sidx == 0)
+ break;
+ --sidx;
+
+ /*
+ * Look for the full file in the previous backup. If not found, then
+ * look for an incremental file instead.
+ */
+ snprintf(source_filename, MAXPGPATH, "%s/%s/%s",
+ prior_backup_dirs[sidx], relative_path, bare_file_name);
+ if ((s = make_rfile(source_filename, true)) == NULL)
+ {
+ snprintf(source_filename, MAXPGPATH, "%s/%s/INCREMENTAL.%s",
+ prior_backup_dirs[sidx], relative_path, bare_file_name);
+ s = make_incremental_rfile(source_filename);
+ }
+ source[sidx] = s;
+
+ /*
+ * If s->header_length == 0, then this is a full file; otherwise, it's
+ * an incremental file.
+ */
+ if (s->header_length == 0)
+ {
+ struct stat sb;
+ BlockNumber b;
+ BlockNumber blocklength;
+
+ /* We need to know the length of the file. */
+ if (fstat(s->fd, &sb) < 0)
+ pg_fatal("could not stat \"%s\": %m", s->filename);
+
+ /*
+ * Since we found a full file, source all blocks from it that
+ * exist in the file.
+ *
+ * Note that there may be blocks that don't exist either in this
+ * file or in any incremental file but that precede
+ * truncation_block_length. These are, presumably, zero-filled
+ * blocks that result from the server extending the file but
+ * taking no action on those blocks that generated any WAL.
+ *
+ * Sadly, we have no way of validating that this is really what
+ * happened, and neither does the server. From it's perspective,
+ * an unmodified block that contains data looks exactly the same
+ * as a zero-filled block that never had any data: either way,
+ * it's not mentioned in any WAL summary and the server has no
+ * reason to read it. From our perspective, all we know is that
+ * nobody had a reason to back up the block. That certainly means
+ * that the block didn't exist at the time of the full backup, but
+ * the supposition that it was all zeroes at the time of every
+ * later backup is one that we can't validate.
+ */
+ blocklength = sb.st_size / BLCKSZ;
+ for (b = 0; b < latest_source->truncation_block_length; ++b)
+ {
+ if (sourcemap[b] == NULL && b < blocklength)
+ {
+ sourcemap[b] = s;
+ offsetmap[b] = b * BLCKSZ;
+ }
+ }
+
+ /*
+ * If a full copy looks possible, check whether the resulting file
+ * should be exactly as long as the source file is. If so, a full
+ * copy is acceptable, otherwise not.
+ */
+ if (full_copy_possible)
+ {
+ uint64 expected_length;
+
+ expected_length =
+ (uint64) latest_source->truncation_block_length;
+ expected_length *= BLCKSZ;
+ if (expected_length == sb.st_size)
+ {
+ copy_source = s;
+ copy_source_index = sidx;
+ }
+ }
+
+ /* We don't need to consider any further sources. */
+ break;
+ }
+
+ /*
+ * Since we found another incremental file, source all blocks from it
+ * that we need but don't yet have.
+ */
+ for (i = 0; i < s->num_blocks; ++i)
+ {
+ BlockNumber b = s->relative_block_numbers[i];
+
+ if (b < latest_source->truncation_block_length &&
+ sourcemap[b] == NULL)
+ {
+ sourcemap[b] = s;
+ offsetmap[b] = s->header_length + (i * BLCKSZ);
+
+ /*
+ * A full copy of a file from an earlier backup is only
+ * possible if no blocks are needed from any later incremental
+ * file.
+ */
+ full_copy_possible = false;
+ }
+ }
+ }
+
+ /*
+ * If a checksum of the required type already exists in the
+ * backup_manifest for the relevant input directory, we can save some work
+ * by reusing that checksum instead of computing a new one.
+ */
+ if (copy_source_index >= 0 && manifests[copy_source_index] != NULL &&
+ checksum_type != CHECKSUM_TYPE_NONE)
+ {
+ manifest_file *mfile;
+
+ mfile = manifest_files_lookup(manifests[copy_source_index]->files,
+ manifest_path);
+ if (mfile == NULL)
+ {
+ char *path = psprintf("%s/backup_manifest",
+ prior_backup_dirs[copy_source_index]);
+
+ /*
+ * The directory is out of sync with the backup_manifest, so emit
+ * a warning.
+ */
+ /*- translator: the first %s is a backup manifest file, the second is a file absent therein */
+ pg_log_warning("\"%s\" contains no entry for \"%s\"",
+ path,
+ manifest_path);
+ pfree(path);
+ }
+ else if (mfile->checksum_type == checksum_type)
+ {
+ *checksum_length = mfile->checksum_length;
+ *checksum_payload = pg_malloc(*checksum_length);
+ memcpy(*checksum_payload, mfile->checksum_payload,
+ *checksum_length);
+ checksum_type = CHECKSUM_TYPE_NONE;
+ }
+ }
+
+ /* Prepare for checksum calculation, if required. */
+ pg_checksum_init(&checksum_ctx, checksum_type);
+
+ /*
+ * If the full file can be created by copying a file from an older backup
+ * in the chain without needing to overwrite any blocks or truncate the
+ * result, then forget about performing reconstruction and just copy that
+ * file in its entirety.
+ *
+ * Otherwise, reconstruct.
+ */
+ if (copy_source != NULL)
+ copy_file(copy_source->filename, output_filename,
+ &checksum_ctx, dry_run);
+ else
+ {
+ write_reconstructed_file(input_filename, output_filename,
+ block_length, sourcemap, offsetmap,
+ &checksum_ctx, debug, dry_run);
+ debug_reconstruction(n_prior_backups + 1, source, dry_run);
+ }
+
+ /* Save results of checksum calculation. */
+ if (checksum_type != CHECKSUM_TYPE_NONE)
+ {
+ *checksum_payload = pg_malloc(PG_CHECKSUM_MAX_LENGTH);
+ *checksum_length = pg_checksum_final(&checksum_ctx,
+ *checksum_payload);
+ }
+
+ /*
+ * Close files and release memory.
+ */
+ for (i = 0; i <= n_prior_backups; ++i)
+ {
+ rfile *s = source[i];
+
+ if (s == NULL)
+ continue;
+ if (close(s->fd) != 0)
+ pg_fatal("could not close \"%s\": %m", s->filename);
+ if (s->relative_block_numbers != NULL)
+ pfree(s->relative_block_numbers);
+ pg_free(s->filename);
+ }
+ pfree(sourcemap);
+ pfree(offsetmap);
+ pfree(source);
+}
+
+/*
+ * Perform post-reconstruction logging and sanity checks.
+ */
+static void
+debug_reconstruction(int n_source, rfile **sources, bool dry_run)
+{
+ unsigned i;
+
+ for (i = 0; i < n_source; ++i)
+ {
+ rfile *s = sources[i];
+
+ /* Ignore source if not used. */
+ if (s == NULL)
+ continue;
+
+ /* If no data is needed from this file, we can ignore it. */
+ if (s->num_blocks_read == 0)
+ continue;
+
+ /* Debug logging. */
+ if (dry_run)
+ pg_log_debug("would have read %u blocks from \"%s\"",
+ s->num_blocks_read, s->filename);
+ else
+ pg_log_debug("read %u blocks from \"%s\"",
+ s->num_blocks_read, s->filename);
+
+ /*
+ * In dry-run mode, we don't actually try to read data from the file,
+ * but we do try to verify that the file is long enough that we could
+ * have read the data if we'd tried.
+ *
+ * If this fails, then it means that a non-dry-run attempt would fail,
+ * complaining of not being able to read the required bytes from the
+ * file.
+ */
+ if (dry_run)
+ {
+ struct stat sb;
+
+ if (fstat(s->fd, &sb) < 0)
+ pg_fatal("could not stat \"%s\": %m", s->filename);
+ if (sb.st_size < s->highest_offset_read)
+ pg_fatal("file \"%s\" is too short: expected %llu, found %llu",
+ s->filename,
+ (unsigned long long) s->highest_offset_read,
+ (unsigned long long) sb.st_size);
+ }
+ }
+}
+
+/*
+ * When we perform reconstruction using an incremental file, the output file
+ * should be at least as long as the truncation_block_length. Any blocks
+ * present in the incremental file increase the output length as far as is
+ * necessary to include those blocks.
+ */
+static unsigned
+find_reconstructed_block_length(rfile *s)
+{
+ unsigned block_length = s->truncation_block_length;
+ unsigned i;
+
+ for (i = 0; i < s->num_blocks; ++i)
+ if (s->relative_block_numbers[i] >= block_length)
+ block_length = s->relative_block_numbers[i] + 1;
+
+ return block_length;
+}
+
+/*
+ * Initialize an incremental rfile, reading the header so that we know which
+ * blocks it contains.
+ */
+static rfile *
+make_incremental_rfile(char *filename)
+{
+ rfile *rf;
+ unsigned magic;
+
+ rf = make_rfile(filename, false);
+
+ /* Read and validate magic number. */
+ read_bytes(rf, &magic, sizeof(magic));
+ if (magic != INCREMENTAL_MAGIC)
+ pg_fatal("file \"%s\" has bad incremental magic number (0x%x not 0x%x)",
+ filename, magic, INCREMENTAL_MAGIC);
+
+ /* Read block count. */
+ read_bytes(rf, &rf->num_blocks, sizeof(rf->num_blocks));
+ if (rf->num_blocks > RELSEG_SIZE)
+ pg_fatal("file \"%s\" has block count %u in excess of segment size %u",
+ filename, rf->num_blocks, RELSEG_SIZE);
+
+ /* Read truncation block length. */
+ read_bytes(rf, &rf->truncation_block_length,
+ sizeof(rf->truncation_block_length));
+ if (rf->truncation_block_length > RELSEG_SIZE)
+ pg_fatal("file \"%s\" has truncation block length %u in excess of segment size %u",
+ filename, rf->truncation_block_length, RELSEG_SIZE);
+
+ /* Read block numbers if there are any. */
+ if (rf->num_blocks > 0)
+ {
+ rf->relative_block_numbers =
+ pg_malloc0(sizeof(BlockNumber) * rf->num_blocks);
+ read_bytes(rf, rf->relative_block_numbers,
+ sizeof(BlockNumber) * rf->num_blocks);
+ }
+
+ /* Remember length of header. */
+ rf->header_length = sizeof(magic) + sizeof(rf->num_blocks) +
+ sizeof(rf->truncation_block_length) +
+ sizeof(BlockNumber) * rf->num_blocks;
+
+ return rf;
+}
+
+/*
+ * Allocate and perform basic initialization of an rfile.
+ */
+static rfile *
+make_rfile(char *filename, bool missing_ok)
+{
+ rfile *rf;
+
+ rf = pg_malloc0(sizeof(rfile));
+ rf->filename = pstrdup(filename);
+ if ((rf->fd = open(filename, O_RDONLY | PG_BINARY, 0)) < 0)
+ {
+ if (missing_ok && errno == ENOENT)
+ {
+ pg_free(rf);
+ return NULL;
+ }
+ pg_fatal("could not open file \"%s\": %m", filename);
+ }
+
+ return rf;
+}
+
+/*
+ * Read the indicated number of bytes from an rfile into the buffer.
+ */
+static void
+read_bytes(rfile *rf, void *buffer, unsigned length)
+{
+ unsigned rb = read(rf->fd, buffer, length);
+
+ if (rb != length)
+ {
+ if (rb < 0)
+ pg_fatal("could not read file \"%s\": %m", rf->filename);
+ else
+ pg_fatal("could not read file \"%s\": read only %d of %d bytes",
+ rf->filename, (int) rb, length);
+ }
+}
+
+/*
+ * Write out a reconstructed file.
+ */
+static void
+write_reconstructed_file(char *input_filename,
+ char *output_filename,
+ unsigned block_length,
+ rfile **sourcemap,
+ off_t *offsetmap,
+ pg_checksum_context *checksum_ctx,
+ bool debug,
+ bool dry_run)
+{
+ int wfd = -1;
+ unsigned i;
+ unsigned zero_blocks = 0;
+
+ /* Debugging output. */
+ if (debug)
+ {
+ StringInfoData debug_buf;
+ unsigned start_of_range = 0;
+ unsigned current_block = 0;
+
+ /* Basic information about the output file to be produced. */
+ if (dry_run)
+ pg_log_debug("would reconstruct \"%s\" (%u blocks, checksum %s)",
+ output_filename, block_length,
+ pg_checksum_type_name(checksum_ctx->type));
+ else
+ pg_log_debug("reconstructing \"%s\" (%u blocks, checksum %s)",
+ output_filename, block_length,
+ pg_checksum_type_name(checksum_ctx->type));
+
+ /* Print out the plan for reconstructing this file. */
+ initStringInfo(&debug_buf);
+ while (current_block < block_length)
+ {
+ rfile *s = sourcemap[current_block];
+
+ /* Extend range, if possible. */
+ if (current_block + 1 < block_length &&
+ s == sourcemap[current_block + 1])
+ {
+ ++current_block;
+ continue;
+ }
+
+ /* Add details about this range. */
+ if (s == NULL)
+ {
+ if (current_block == start_of_range)
+ appendStringInfo(&debug_buf, " %u:zero", current_block);
+ else
+ appendStringInfo(&debug_buf, " %u-%u:zero",
+ start_of_range, current_block);
+ }
+ else
+ {
+ if (current_block == start_of_range)
+ appendStringInfo(&debug_buf, " %u:%s@" UINT64_FORMAT,
+ current_block,
+ s == NULL ? "ZERO" : s->filename,
+ (uint64) offsetmap[current_block]);
+ else
+ appendStringInfo(&debug_buf, " %u-%u:%s@" UINT64_FORMAT,
+ start_of_range, current_block,
+ s == NULL ? "ZERO" : s->filename,
+ (uint64) offsetmap[current_block]);
+ }
+
+ /* Begin new range. */
+ start_of_range = ++current_block;
+
+ /* If the output is very long or we are done, dump it now. */
+ if (current_block == block_length || debug_buf.len > 1024)
+ {
+ pg_log_debug("reconstruction plan:%s", debug_buf.data);
+ resetStringInfo(&debug_buf);
+ }
+ }
+
+ /* Free memory. */
+ pfree(debug_buf.data);
+ }
+
+ /* Open the output file, except in dry_run mode. */
+ if (!dry_run &&
+ (wfd = open(output_filename,
+ O_RDWR | PG_BINARY | O_CREAT | O_EXCL,
+ pg_file_create_mode)) < 0)
+ pg_fatal("could not open file \"%s\": %m", output_filename);
+
+ /* Read and write the blocks as required. */
+ for (i = 0; i < block_length; ++i)
+ {
+ uint8 buffer[BLCKSZ];
+ rfile *s = sourcemap[i];
+ unsigned wb;
+
+ /* Update accounting information. */
+ if (s == NULL)
+ ++zero_blocks;
+ else
+ {
+ s->num_blocks_read++;
+ s->highest_offset_read = Max(s->highest_offset_read,
+ offsetmap[i] + BLCKSZ);
+ }
+
+ /* Skip the rest of this in dry-run mode. */
+ if (dry_run)
+ continue;
+
+ /* Read or zero-fill the block as appropriate. */
+ if (s == NULL)
+ {
+ /*
+ * New block not mentioned in the WAL summary. Should have been an
+ * uninitialized block, so just zero-fill it.
+ */
+ memset(buffer, 0, BLCKSZ);
+ }
+ else
+ {
+ unsigned rb;
+
+ /* Read the block from the correct source, except if dry-run. */
+ rb = pg_pread(s->fd, buffer, BLCKSZ, offsetmap[i]);
+ if (rb != BLCKSZ)
+ {
+ if (rb < 0)
+ pg_fatal("could not read file \"%s\": %m", s->filename);
+ else
+ pg_fatal("could not read file \"%s\": read only %d of %d bytes at offset %u",
+ s->filename, (int) rb, BLCKSZ,
+ (unsigned) offsetmap[i]);
+ }
+ }
+
+ /* Write out the block. */
+ if ((wb = write(wfd, buffer, BLCKSZ)) != BLCKSZ)
+ {
+ if (wb < 0)
+ pg_fatal("could not write file \"%s\": %m", output_filename);
+ else
+ pg_fatal("could not write file \"%s\": wrote only %d of %d bytes",
+ output_filename, (int) wb, BLCKSZ);
+ }
+
+ /* Update the checksum computation. */
+ if (pg_checksum_update(checksum_ctx, buffer, BLCKSZ) < 0)
+ pg_fatal("could not update checksum of file \"%s\"",
+ output_filename);
+ }
+
+ /* Debugging output. */
+ if (zero_blocks > 0)
+ {
+ if (dry_run)
+ pg_log_debug("would have zero-filled %u blocks", zero_blocks);
+ else
+ pg_log_debug("zero-filled %u blocks", zero_blocks);
+ }
+
+ /* Close the output file. */
+ if (wfd >= 0 && close(wfd) != 0)
+ pg_fatal("could not close \"%s\": %m", output_filename);
+}
diff --git a/src/bin/pg_combinebackup/reconstruct.h b/src/bin/pg_combinebackup/reconstruct.h
new file mode 100644
index 0000000000..d689aeb5c2
--- /dev/null
+++ b/src/bin/pg_combinebackup/reconstruct.h
@@ -0,0 +1,33 @@
+/*-------------------------------------------------------------------------
+ *
+ * reconstruct.h
+ * Reconstruct full file from incremental file and backup chain.
+ *
+ * Copyright (c) 2017-2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/bin/pg_combinebackup/reconstruct.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef RECONSTRUCT_H
+#define RECONSTRUCT_H
+
+#include "common/checksum_helper.h"
+#include "load_manifest.h"
+
+extern void reconstruct_from_incremental_file(char *input_filename,
+ char *output_filename,
+ char *relative_path,
+ char *bare_file_name,
+ int n_prior_backups,
+ char **prior_backup_dirs,
+ manifest_data **manifests,
+ char *manifest_path,
+ pg_checksum_type checksum_type,
+ int *checksum_length,
+ uint8 **checksum_payload,
+ bool debug,
+ bool dry_run);
+
+#endif
diff --git a/src/bin/pg_combinebackup/t/001_basic.pl b/src/bin/pg_combinebackup/t/001_basic.pl
new file mode 100644
index 0000000000..fb66075d1a
--- /dev/null
+++ b/src/bin/pg_combinebackup/t/001_basic.pl
@@ -0,0 +1,23 @@
+# Copyright (c) 2021-2023, PostgreSQL Global Development Group
+
+use strict;
+use warnings;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+my $tempdir = PostgreSQL::Test::Utils::tempdir;
+
+program_help_ok('pg_combinebackup');
+program_version_ok('pg_combinebackup');
+program_options_handling_ok('pg_combinebackup');
+
+command_fails_like(
+ ['pg_combinebackup'],
+ qr/no input directories specified/,
+ 'input directories must be specified');
+command_fails_like(
+ [ 'pg_combinebackup', $tempdir ],
+ qr/no output directory specified/,
+ 'output directory must be specified');
+
+done_testing();
diff --git a/src/bin/pg_combinebackup/t/002_compare_backups.pl b/src/bin/pg_combinebackup/t/002_compare_backups.pl
new file mode 100644
index 0000000000..0b80455aff
--- /dev/null
+++ b/src/bin/pg_combinebackup/t/002_compare_backups.pl
@@ -0,0 +1,154 @@
+# Copyright (c) 2021-2023, PostgreSQL Global Development Group
+
+use strict;
+use warnings;
+use File::Compare;
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Set up a new database instance.
+my $primary = PostgreSQL::Test::Cluster->new('primary');
+$primary->init(has_archiving => 1, allows_streaming => 1);
+$primary->append_conf('postgresql.conf', 'summarize_wal = on');
+$primary->start;
+
+# Create some test tables, each containing one row of data, plus a whole
+# extra database.
+$primary->safe_psql('postgres', <<EOM);
+CREATE TABLE will_change (a int, b text);
+INSERT INTO will_change VALUES (1, 'initial test row');
+CREATE TABLE will_grow (a int, b text);
+INSERT INTO will_grow VALUES (1, 'initial test row');
+CREATE TABLE will_shrink (a int, b text);
+INSERT INTO will_shrink VALUES (1, 'initial test row');
+CREATE TABLE will_get_vacuumed (a int, b text);
+INSERT INTO will_get_vacuumed VALUES (1, 'initial test row');
+CREATE TABLE will_get_dropped (a int, b text);
+INSERT INTO will_get_dropped VALUES (1, 'initial test row');
+CREATE TABLE will_get_rewritten (a int, b text);
+INSERT INTO will_get_rewritten VALUES (1, 'initial test row');
+CREATE DATABASE db_will_get_dropped;
+EOM
+
+# Take a full backup.
+my $backup1path = $primary->backup_dir . '/backup1';
+$primary->command_ok(
+ [ 'pg_basebackup', '-D', $backup1path, '--no-sync', '-cfast' ],
+ "full backup");
+
+# Now make some database changes.
+$primary->safe_psql('postgres', <<EOM);
+UPDATE will_change SET b = 'modified value' WHERE a = 1;
+INSERT INTO will_grow
+ SELECT g, 'additional row' FROM generate_series(2, 5000) g;
+TRUNCATE will_shrink;
+VACUUM will_get_vacuumed;
+DROP TABLE will_get_dropped;
+CREATE TABLE newly_created (a int, b text);
+INSERT INTO newly_created VALUES (1, 'row for new table');
+VACUUM FULL will_get_rewritten;
+DROP DATABASE db_will_get_dropped;
+CREATE DATABASE db_newly_created;
+EOM
+
+# Take an incremental backup.
+my $backup2path = $primary->backup_dir . '/backup2';
+$primary->command_ok(
+ [ 'pg_basebackup', '-D', $backup2path, '--no-sync', '-cfast',
+ '--incremental', $backup1path . '/backup_manifest' ],
+ "incremental backup");
+
+# Find an LSN to which either backup can be recovered.
+my $lsn = $primary->safe_psql('postgres', "SELECT pg_current_wal_lsn();");
+
+# Make sure that the WAL segment containing that LSN has been archived.
+# PostgreSQL won't issue two consecutive XLOG_SWITCH records, and the backup
+# just issued one, so call txid_current() to generate some WAL activity
+# before calling pg_switch_wal().
+$primary->safe_psql('postgres', 'SELECT txid_current();');
+$primary->safe_psql('postgres', 'SELECT pg_switch_wal()');
+
+# Now wait for the LSN we chose above to be archived.
+my $archive_wait_query =
+ "SELECT pg_walfile_name('$lsn') <= last_archived_wal FROM pg_stat_archiver;";
+$primary->poll_query_until('postgres', $archive_wait_query)
+ or die "Timed out while waiting for WAL segment to be archived";
+
+# Perform PITR from the full backup. Disable archive_mode so that the archive
+# doesn't find out about the new timeline; that way, the later PITR below will
+# choose the same timeline.
+my $pitr1 = PostgreSQL::Test::Cluster->new('pitr1');
+$pitr1->init_from_backup($primary, 'backup1',
+ standby => 1, has_restoring => 1);
+$pitr1->append_conf('postgresql.conf', qq{
+recovery_target_lsn = '$lsn'
+recovery_target_action = 'promote'
+archive_mode = 'off'
+});
+$pitr1->start();
+
+# Perform PITR to the same LSN from the incremental backup. Use the same
+# basic configuration as before.
+my $pitr2 = PostgreSQL::Test::Cluster->new('pitr2');
+$pitr2->init_from_backup($primary, 'backup2',
+ standby => 1, has_restoring => 1,
+ combine_with_prior => [ 'backup1' ]);
+$pitr2->append_conf('postgresql.conf', qq{
+recovery_target_lsn = '$lsn'
+recovery_target_action = 'promote'
+archive_mode = 'off'
+});
+$pitr2->start();
+
+# Wait until both servers exit recovery.
+$pitr1->poll_query_until('postgres',
+ "SELECT NOT pg_is_in_recovery();")
+ or die "Timed out while waiting apply to reach LSN $lsn";
+$pitr2->poll_query_until('postgres',
+ "SELECT NOT pg_is_in_recovery();")
+ or die "Timed out while waiting apply to reach LSN $lsn";
+
+# Perform a logical dump of each server, and check that they match.
+# It would be much nicer if we could physically compare the data files, but
+# that doesn't really work. The contents of the page hole aren't guaranteed to
+# be identical, and there can be other discrepancies as well. To make this work
+# we'd need the equivalent of each AM's rm_mask functon written or at least
+# callable from Perl, and that doesn't seem practical.
+#
+# NB: We're just using the primary's backup directory for scratch space here.
+# This could equally well be any other directory we wanted to pick.
+my $backupdir = $primary->backup_dir;
+my $dump1 = $backupdir . '/pitr1.dump';
+my $dump2 = $backupdir . '/pitr2.dump';
+$pitr1->command_ok([
+ 'pg_dumpall', '-f', $dump1, '--no-sync', '--no-unlogged-table-data',
+ '-d', $pitr1->connstr('postgres'),
+ ],
+ 'dump from PITR 1');
+$pitr1->command_ok([
+ 'pg_dumpall', '-f', $dump2, '--no-sync', '--no-unlogged-table-data',
+ '-d', $pitr1->connstr('postgres'),
+ ],
+ 'dump from PITR 2');
+
+# Compare the two dumps, there should be no differences.
+my $compare_res = compare($dump1, $dump2);
+note($dump1);
+note($dump2);
+is($compare_res, 0, "dumps are identical");
+
+# Provide more context if the dumps do not match.
+if ($compare_res != 0)
+{
+ my ($stdout, $stderr) =
+ run_command([ 'diff', '-u', $dump1, $dump2 ]);
+ print "=== diff of $dump1 and $dump2\n";
+ print "=== stdout ===\n";
+ print $stdout;
+ print "=== stderr ===\n";
+ print $stderr;
+ print "=== EOF ===\n";
+}
+
+done_testing();
diff --git a/src/bin/pg_combinebackup/t/003_timeline.pl b/src/bin/pg_combinebackup/t/003_timeline.pl
new file mode 100644
index 0000000000..bc053ca5e8
--- /dev/null
+++ b/src/bin/pg_combinebackup/t/003_timeline.pl
@@ -0,0 +1,90 @@
+# Copyright (c) 2021-2023, PostgreSQL Global Development Group
+#
+# This test aims to validate that restoring an incremental backup works
+# properly even when the reference backup is on a different timeline.
+
+use strict;
+use warnings;
+use File::Compare;
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Set up a new database instance.
+my $node1 = PostgreSQL::Test::Cluster->new('node1');
+$node1->init(has_archiving => 1, allows_streaming => 1);
+$node1->append_conf('postgresql.conf', 'summarize_wal = on');
+$node1->start;
+
+# Create a table and insert a test row into it.
+$node1->safe_psql('postgres', <<EOM);
+CREATE TABLE mytable (a int, b text);
+INSERT INTO mytable VALUES (1, 'aardvark');
+EOM
+
+# Take a full backup.
+my $backup1path = $node1->backup_dir . '/backup1';
+$node1->command_ok(
+ [ 'pg_basebackup', '-D', $backup1path, '--no-sync', '-cfast' ],
+ "full backup from node1");
+
+# Insert a second row on the original node.
+$node1->safe_psql('postgres', <<EOM);
+INSERT INTO mytable VALUES (2, 'beetle');
+EOM
+
+# Now take an incremental backup.
+my $backup2path = $node1->backup_dir . '/backup2';
+$node1->command_ok(
+ [ 'pg_basebackup', '-D', $backup2path, '--no-sync', '-cfast',
+ '--incremental', $backup1path . '/backup_manifest' ],
+ "incremental backup from node1");
+
+# Restore the incremental backup and use it to create a new node.
+my $node2 = PostgreSQL::Test::Cluster->new('node2');
+$node2->init_from_backup($node1, 'backup2',
+ combine_with_prior => [ 'backup1' ]);
+$node2->start();
+
+# Insert rows on both nodes.
+$node1->safe_psql('postgres', <<EOM);
+INSERT INTO mytable VALUES (3, 'crab');
+EOM
+$node2->safe_psql('postgres', <<EOM);
+INSERT INTO mytable VALUES (4, 'dingo');
+EOM
+
+# Take another incremental backup, from node2, based on backup2 from node1.
+my $backup3path = $node1->backup_dir . '/backup3';
+$node2->command_ok(
+ [ 'pg_basebackup', '-D', $backup3path, '--no-sync', '-cfast',
+ '--incremental', $backup2path . '/backup_manifest' ],
+ "incremental backup from node2");
+
+# Restore the incremental backup and use it to create a new node.
+my $node3 = PostgreSQL::Test::Cluster->new('node3');
+$node3->init_from_backup($node1, 'backup3',
+ combine_with_prior => [ 'backup1', 'backup2' ]);
+$node3->start();
+
+# Let's insert one more row.
+$node3->safe_psql('postgres', <<EOM);
+INSERT INTO mytable VALUES (5, 'elephant');
+EOM
+
+# Now check that we have the expected rows.
+my $result = $node3->safe_psql('postgres', <<EOM);
+select string_agg(a::text, ':'), string_agg(b, ':') from mytable;
+EOM
+is($result, '1:2:4:5|aardvark:beetle:dingo:elephant');
+
+# Let's also verify all the backups.
+for my $backup_name (qw(backup1 backup2 backup3))
+{
+ $node1->command_ok(
+ [ 'pg_verifybackup', $node1->backup_dir . '/' . $backup_name ],
+ "verify backup $backup_name");
+}
+
+# OK, that's all.
+done_testing();
diff --git a/src/bin/pg_combinebackup/t/004_manifest.pl b/src/bin/pg_combinebackup/t/004_manifest.pl
new file mode 100644
index 0000000000..37de61ac06
--- /dev/null
+++ b/src/bin/pg_combinebackup/t/004_manifest.pl
@@ -0,0 +1,75 @@
+# Copyright (c) 2021-2023, PostgreSQL Global Development Group
+#
+# This test aims to validate that pg_combinebackup works in the degenerate
+# case where it is invoked on a single full backup and that it can produce
+# a new, valid manifest when it does. Secondarily, it checks that
+# pg_combinebackup does not produce a manifest when run with --no-manifest.
+
+use strict;
+use warnings;
+use File::Compare;
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Set up a new database instance.
+my $node = PostgreSQL::Test::Cluster->new('node');
+$node->init(has_archiving => 1, allows_streaming => 1);
+$node->start;
+
+# Take a full backup.
+my $original_backup_path = $node->backup_dir . '/original';
+$node->command_ok(
+ [ 'pg_basebackup', '-D', $original_backup_path, '--no-sync', '-cfast' ],
+ "full backup");
+
+# Verify the full backup.
+$node->command_ok([ 'pg_verifybackup', $original_backup_path ],
+ "verify original backup");
+
+# Process the backup with pg_combinebackup using various manifest options.
+sub combine_and_test_one_backup
+{
+ my ($backup_name, $failure_pattern, @extra_options) = @_;
+ my $revised_backup_path = $node->backup_dir . '/' . $backup_name;
+ $node->command_ok(
+ [ 'pg_combinebackup', $original_backup_path, '-o', $revised_backup_path,
+ '--no-sync', @extra_options ],
+ "pg_combinebackup with @extra_options");
+ if (defined $failure_pattern)
+ {
+ $node->command_fails_like(
+ [ 'pg_verifybackup', $revised_backup_path ],
+ $failure_pattern,
+ "unable to verify backup $backup_name");
+ }
+ else
+ {
+ $node->command_ok(
+ [ 'pg_verifybackup', $revised_backup_path ],
+ "verify backup $backup_name");
+ }
+}
+combine_and_test_one_backup('nomanifest',
+ qr/could not open file.*backup_manifest/, '--no-manifest');
+combine_and_test_one_backup('csum_none',
+ undef, '--manifest-checksums=NONE');
+combine_and_test_one_backup('csum_sha224',
+ undef, '--manifest-checksums=SHA224');
+
+# Verify that SHA224 is mentioned in the SHA224 manifest lots of times.
+my $sha224_manifest =
+ slurp_file($node->backup_dir . '/csum_sha224/backup_manifest');
+my $sha224_count = (() = $sha224_manifest =~ /SHA224/mig);
+cmp_ok($sha224_count,
+ '>', 100, "SHA224 is mentioned many times in SHA224 manifest");
+
+# Verify that SHA224 is mentioned in the SHA224 manifest lots of times.
+my $nocsum_manifest =
+ slurp_file($node->backup_dir . '/csum_none/backup_manifest');
+my $nocsum_count = (() = $nocsum_manifest =~ /Checksum-Algorithm/mig);
+is($nocsum_count, 0,
+ "Checksum_Algorithm is not mentioned in no-checksum manifest");
+
+# OK, that's all.
+done_testing();
diff --git a/src/bin/pg_combinebackup/t/005_integrity.pl b/src/bin/pg_combinebackup/t/005_integrity.pl
new file mode 100644
index 0000000000..b1f63a43e0
--- /dev/null
+++ b/src/bin/pg_combinebackup/t/005_integrity.pl
@@ -0,0 +1,125 @@
+# Copyright (c) 2021-2023, PostgreSQL Global Development Group
+#
+# This test aims to validate that an incremental backup can be combined
+# with a valid prior backup and that it cannot be combined with an invalid
+# prior backup.
+
+use strict;
+use warnings;
+use File::Compare;
+use File::Path qw(rmtree);
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Set up a new database instance.
+my $node1 = PostgreSQL::Test::Cluster->new('node1');
+$node1->init(has_archiving => 1, allows_streaming => 1);
+$node1->append_conf('postgresql.conf', 'summarize_wal = on');
+$node1->start;
+
+# Set up another new database instance. We don't want to use the cached
+# INITDB_TEMPLATE for this, because we want it to be a separate cluster
+# with a different system ID.
+my $node2;
+{
+ local $ENV{'INITDB_TEMPLATE'} = undef;
+
+ $node2 = PostgreSQL::Test::Cluster->new('node2');
+ $node2->init(has_archiving => 1, allows_streaming => 1);
+ $node2->append_conf('postgresql.conf', 'summarize_wal = on');
+ $node2->start;
+}
+
+# Take a full backup from node1.
+my $backup1path = $node1->backup_dir . '/backup1';
+$node1->command_ok(
+ [ 'pg_basebackup', '-D', $backup1path, '--no-sync', '-cfast' ],
+ "full backup from node1");
+
+# Now take an incremental backup.
+my $backup2path = $node1->backup_dir . '/backup2';
+$node1->command_ok(
+ [ 'pg_basebackup', '-D', $backup2path, '--no-sync', '-cfast',
+ '--incremental', $backup1path . '/backup_manifest' ],
+ "incremental backup from node1");
+
+# Now take another incremental backup.
+my $backup3path = $node1->backup_dir . '/backup3';
+$node1->command_ok(
+ [ 'pg_basebackup', '-D', $backup3path, '--no-sync', '-cfast',
+ '--incremental', $backup2path . '/backup_manifest' ],
+ "another incremental backup from node1");
+
+# Take a full backup from node2.
+my $backupother1path = $node1->backup_dir . '/backupother1';
+$node2->command_ok(
+ [ 'pg_basebackup', '-D', $backupother1path, '--no-sync', '-cfast' ],
+ "full backup from node2");
+
+# Take an incremental backup from node2.
+my $backupother2path = $node1->backup_dir . '/backupother2';
+$node2->command_ok(
+ [ 'pg_basebackup', '-D', $backupother2path, '--no-sync', '-cfast',
+ '--incremental', $backupother1path . '/backup_manifest' ],
+ "incremental backup from node2");
+
+# Result directory.
+my $resultpath = $node1->backup_dir . '/result';
+
+# Can't combine 2 full backups.
+$node1->command_fails_like(
+ [ 'pg_combinebackup', $backup1path, $backup1path, '-o', $resultpath ],
+ qr/is a full backup, but only the first backup should be a full backup/,
+ "can't combine full backups");
+
+# Can't combine 2 incremental backups.
+$node1->command_fails_like(
+ [ 'pg_combinebackup', $backup2path, $backup2path, '-o', $resultpath ],
+ qr/is an incremental backup, but the first backup should be a full backup/,
+ "can't combine full backups");
+
+# Can't combine full backup with an incremental backup from a different system.
+$node1->command_fails_like(
+ [ 'pg_combinebackup', $backup1path, $backupother2path, '-o', $resultpath ],
+ qr/expected system identifier.*but found/,
+ "can't combine backups from different nodes");
+
+# Can't omit a required backup.
+$node1->command_fails_like(
+ [ 'pg_combinebackup', $backup1path, $backup3path, '-o', $resultpath ],
+ qr/starts at LSN.*but expected/,
+ "can't omit a required backup");
+
+# Can't combine backups in the wrong order.
+$node1->command_fails_like(
+ [ 'pg_combinebackup', $backup1path, $backup3path, $backup2path, '-o', $resultpath ],
+ qr/starts at LSN.*but expected/,
+ "can't combine backups in the wrong order");
+
+# Can combine 3 backups that match up properly.
+$node1->command_ok(
+ [ 'pg_combinebackup', $backup1path, $backup2path, $backup3path, '-o', $resultpath ],
+ "can combine 3 matching backups");
+rmtree($resultpath);
+
+# Can combine full backup with first incremental.
+my $synthetic12path = $node1->backup_dir . '/synthetic12';
+$node1->command_ok(
+ [ 'pg_combinebackup', $backup1path, $backup2path, '-o', $synthetic12path ],
+ "can combine 2 matching backups");
+
+# Can combine result of previous step with second incremental.
+$node1->command_ok(
+ [ 'pg_combinebackup', $synthetic12path, $backup3path, '-o', $resultpath ],
+ "can combine synthetic backup with later incremental");
+rmtree($resultpath);
+
+# Can't combine result of 1+2 with 2.
+$node1->command_fails_like(
+ [ 'pg_combinebackup', $synthetic12path, $backup2path, '-o', $resultpath ],
+ qr/starts at LSN.*but expected/,
+ "can't combine synthetic backup with included incremental");
+
+# OK, that's all.
+done_testing();
diff --git a/src/bin/pg_combinebackup/write_manifest.c b/src/bin/pg_combinebackup/write_manifest.c
new file mode 100644
index 0000000000..82160134d8
--- /dev/null
+++ b/src/bin/pg_combinebackup/write_manifest.c
@@ -0,0 +1,293 @@
+/*-------------------------------------------------------------------------
+ *
+ * Write a new backup manifest.
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/bin/pg_combinebackup/write_manifest.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres_fe.h"
+
+#include <fcntl.h>
+#include <time.h>
+#include <unistd.h>
+
+#include "common/checksum_helper.h"
+#include "common/file_perm.h"
+#include "common/logging.h"
+#include "lib/stringinfo.h"
+#include "load_manifest.h"
+#include "mb/pg_wchar.h"
+#include "write_manifest.h"
+
+struct manifest_writer
+{
+ char pathname[MAXPGPATH];
+ int fd;
+ StringInfoData buf;
+ bool first_file;
+ bool still_checksumming;
+ pg_checksum_context manifest_ctx;
+};
+
+static void escape_json(StringInfo buf, const char *str);
+static void flush_manifest(manifest_writer *mwriter);
+static size_t hex_encode(const uint8 *src, size_t len, char *dst);
+
+/*
+ * Create a new backup manifest writer.
+ *
+ * The backup manifest will be written into a file named backup_manifest
+ * in the specified directory.
+ */
+manifest_writer *
+create_manifest_writer(char *directory)
+{
+ manifest_writer *mwriter = pg_malloc(sizeof(manifest_writer));
+
+ snprintf(mwriter->pathname, MAXPGPATH, "%s/backup_manifest", directory);
+ mwriter->fd = -1;
+ initStringInfo(&mwriter->buf);
+ mwriter->first_file = true;
+ mwriter->still_checksumming = true;
+ pg_checksum_init(&mwriter->manifest_ctx, CHECKSUM_TYPE_SHA256);
+
+ appendStringInfo(&mwriter->buf,
+ "{ \"PostgreSQL-Backup-Manifest-Version\": 1,\n"
+ "\"Files\": [");
+
+ return mwriter;
+}
+
+/*
+ * Add an entry for a file to a backup manifest.
+ *
+ * This is very similar to the backend's AddFileToBackupManifest, but
+ * various adjustments are required due to frontend/backend differences
+ * and other details.
+ */
+void
+add_file_to_manifest(manifest_writer *mwriter, const char *manifest_path,
+ size_t size, pg_time_t mtime,
+ pg_checksum_type checksum_type,
+ int checksum_length,
+ uint8 *checksum_payload)
+{
+ int pathlen = strlen(manifest_path);
+
+ if (mwriter->first_file)
+ {
+ appendStringInfoChar(&mwriter->buf, '\n');
+ mwriter->first_file = false;
+ }
+ else
+ appendStringInfoString(&mwriter->buf, ",\n");
+
+ if (pg_encoding_verifymbstr(PG_UTF8, manifest_path, pathlen) == pathlen)
+ {
+ appendStringInfoString(&mwriter->buf, "{ \"Path\": ");
+ escape_json(&mwriter->buf, manifest_path);
+ appendStringInfoString(&mwriter->buf, ", ");
+ }
+ else
+ {
+ appendStringInfoString(&mwriter->buf, "{ \"Encoded-Path\": \"");
+ enlargeStringInfo(&mwriter->buf, 2 * pathlen);
+ mwriter->buf.len += hex_encode((const uint8 *) manifest_path, pathlen,
+ &mwriter->buf.data[mwriter->buf.len]);
+ appendStringInfoString(&mwriter->buf, "\", ");
+ }
+
+ appendStringInfo(&mwriter->buf, "\"Size\": %zu, ", size);
+
+ appendStringInfoString(&mwriter->buf, "\"Last-Modified\": \"");
+ enlargeStringInfo(&mwriter->buf, 128);
+ mwriter->buf.len += strftime(&mwriter->buf.data[mwriter->buf.len], 128,
+ "%Y-%m-%d %H:%M:%S %Z",
+ gmtime(&mtime));
+ appendStringInfoChar(&mwriter->buf, '"');
+
+ if (mwriter->buf.len > 128 * 1024)
+ flush_manifest(mwriter);
+
+ if (checksum_length > 0)
+ {
+ appendStringInfo(&mwriter->buf,
+ ", \"Checksum-Algorithm\": \"%s\", \"Checksum\": \"",
+ pg_checksum_type_name(checksum_type));
+
+ enlargeStringInfo(&mwriter->buf, 2 * checksum_length);
+ mwriter->buf.len += hex_encode(checksum_payload, checksum_length,
+ &mwriter->buf.data[mwriter->buf.len]);
+
+ appendStringInfoChar(&mwriter->buf, '"');
+ }
+
+ appendStringInfoString(&mwriter->buf, " }");
+
+ if (mwriter->buf.len > 128 * 1024)
+ flush_manifest(mwriter);
+}
+
+/*
+ * Finalize the backup_manifest.
+ */
+void
+finalize_manifest(manifest_writer *mwriter,
+ manifest_wal_range *first_wal_range)
+{
+ uint8 checksumbuf[PG_SHA256_DIGEST_LENGTH];
+ int len;
+ manifest_wal_range *wal_range;
+
+ /* Terminate the list of files. */
+ appendStringInfoString(&mwriter->buf, "\n],\n");
+
+ /* Start a list of LSN ranges. */
+ appendStringInfoString(&mwriter->buf, "\"WAL-Ranges\": [\n");
+
+ for (wal_range = first_wal_range; wal_range != NULL;
+ wal_range = wal_range->next)
+ appendStringInfo(&mwriter->buf,
+ "%s{ \"Timeline\": %u, \"Start-LSN\": \"%X/%X\", \"End-LSN\": \"%X/%X\" }",
+ wal_range == first_wal_range ? "" : ",\n",
+ wal_range->tli,
+ LSN_FORMAT_ARGS(wal_range->start_lsn),
+ LSN_FORMAT_ARGS(wal_range->end_lsn));
+
+ /* Terminate the list of WAL ranges. */
+ appendStringInfoString(&mwriter->buf, "\n],\n");
+
+ /* Flush accumulated data and update checksum calculation. */
+ flush_manifest(mwriter);
+
+ /* Checksum only includes data up to this point. */
+ mwriter->still_checksumming = false;
+
+ /* Compute and insert manifest checksum. */
+ appendStringInfoString(&mwriter->buf, "\"Manifest-Checksum\": \"");
+ enlargeStringInfo(&mwriter->buf, 2 * PG_SHA256_DIGEST_STRING_LENGTH);
+ len = pg_checksum_final(&mwriter->manifest_ctx, checksumbuf);
+ Assert(len == PG_SHA256_DIGEST_LENGTH);
+ mwriter->buf.len +=
+ hex_encode(checksumbuf, len, &mwriter->buf.data[mwriter->buf.len]);
+ appendStringInfoString(&mwriter->buf, "\"}\n");
+
+ /* Flush the last manifest checksum itself. */
+ flush_manifest(mwriter);
+
+ /* Close the file. */
+ if (close(mwriter->fd) != 0)
+ pg_fatal("could not close \"%s\": %m", mwriter->pathname);
+ mwriter->fd = -1;
+}
+
+/*
+ * Produce a JSON string literal, properly escaping characters in the text.
+ */
+static void
+escape_json(StringInfo buf, const char *str)
+{
+ const char *p;
+
+ appendStringInfoCharMacro(buf, '"');
+ for (p = str; *p; p++)
+ {
+ switch (*p)
+ {
+ case '\b':
+ appendStringInfoString(buf, "\\b");
+ break;
+ case '\f':
+ appendStringInfoString(buf, "\\f");
+ break;
+ case '\n':
+ appendStringInfoString(buf, "\\n");
+ break;
+ case '\r':
+ appendStringInfoString(buf, "\\r");
+ break;
+ case '\t':
+ appendStringInfoString(buf, "\\t");
+ break;
+ case '"':
+ appendStringInfoString(buf, "\\\"");
+ break;
+ case '\\':
+ appendStringInfoString(buf, "\\\\");
+ break;
+ default:
+ if ((unsigned char) *p < ' ')
+ appendStringInfo(buf, "\\u%04x", (int) *p);
+ else
+ appendStringInfoCharMacro(buf, *p);
+ break;
+ }
+ }
+ appendStringInfoCharMacro(buf, '"');
+}
+
+/*
+ * Flush whatever portion of the backup manifest we have generated and
+ * buffered in memory out to a file on disk.
+ *
+ * The first call to this function will create the file. After that, we
+ * keep it open and just append more data.
+ */
+static void
+flush_manifest(manifest_writer *mwriter)
+{
+ char pathname[MAXPGPATH];
+
+ if (mwriter->fd == -1 &&
+ (mwriter->fd = open(mwriter->pathname,
+ O_WRONLY | O_CREAT | O_EXCL | PG_BINARY,
+ pg_file_create_mode)) < 0)
+ pg_fatal("could not open file \"%s\": %m", mwriter->pathname);
+
+ if (mwriter->buf.len > 0)
+ {
+ ssize_t wb;
+
+ wb = write(mwriter->fd, mwriter->buf.data, mwriter->buf.len);
+ if (wb != mwriter->buf.len)
+ {
+ if (wb < 0)
+ pg_fatal("could not write \"%s\": %m", mwriter->pathname);
+ else
+ pg_fatal("could not write file \"%s\": wrote only %d of %d bytes",
+ pathname, (int) wb, mwriter->buf.len);
+ }
+
+ if (mwriter->still_checksumming)
+ pg_checksum_update(&mwriter->manifest_ctx,
+ (uint8 *) mwriter->buf.data,
+ mwriter->buf.len);
+ resetStringInfo(&mwriter->buf);
+ }
+}
+
+/*
+ * Encode bytes using two hexademical digits for each one.
+ */
+static size_t
+hex_encode(const uint8 *src, size_t len, char *dst)
+{
+ const uint8 *end = src + len;
+
+ while (src < end)
+ {
+ unsigned n1 = (*src >> 4) & 0xF;
+ unsigned n2 = *src & 0xF;
+
+ *dst++ = n1 < 10 ? '0' + n1 : 'a' + n1 - 10;
+ *dst++ = n2 < 10 ? '0' + n2 : 'a' + n2 - 10;
+ ++src;
+ }
+
+ return len * 2;
+}
diff --git a/src/bin/pg_combinebackup/write_manifest.h b/src/bin/pg_combinebackup/write_manifest.h
new file mode 100644
index 0000000000..8fd7fe02c8
--- /dev/null
+++ b/src/bin/pg_combinebackup/write_manifest.h
@@ -0,0 +1,33 @@
+/*-------------------------------------------------------------------------
+ *
+ * Write a new backup manifest.
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/bin/pg_combinebackup/write_manifest.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef WRITE_MANIFEST_H
+#define WRITE_MANIFEST_H
+
+#include "common/checksum_helper.h"
+#include "pgtime.h"
+
+struct manifest_wal_range;
+
+struct manifest_writer;
+typedef struct manifest_writer manifest_writer;
+
+extern manifest_writer *create_manifest_writer(char *directory);
+extern void add_file_to_manifest(manifest_writer *mwriter,
+ const char *manifest_path,
+ size_t size, pg_time_t mtime,
+ pg_checksum_type checksum_type,
+ int checksum_length,
+ uint8 *checksum_payload);
+extern void finalize_manifest(manifest_writer *mwriter,
+ struct manifest_wal_range *first_wal_range);
+
+#endif /* WRITE_MANIFEST_H */
diff --git a/src/bin/pg_resetwal/pg_resetwal.c b/src/bin/pg_resetwal/pg_resetwal.c
index 3ae3fc06df..5407f51a4e 100644
--- a/src/bin/pg_resetwal/pg_resetwal.c
+++ b/src/bin/pg_resetwal/pg_resetwal.c
@@ -85,6 +85,7 @@ static void RewriteControlFile(void);
static void FindEndOfXLOG(void);
static void KillExistingXLOG(void);
static void KillExistingArchiveStatus(void);
+static void KillExistingWALSummaries(void);
static void WriteEmptyXLOG(void);
static void usage(void);
@@ -493,6 +494,7 @@ main(int argc, char *argv[])
RewriteControlFile();
KillExistingXLOG();
KillExistingArchiveStatus();
+ KillExistingWALSummaries();
WriteEmptyXLOG();
printf(_("Write-ahead log reset\n"));
@@ -1034,6 +1036,40 @@ KillExistingArchiveStatus(void)
pg_fatal("could not close directory \"%s\": %m", ARCHSTATDIR);
}
+/*
+ * Remove existing WAL summary files
+ */
+static void
+KillExistingWALSummaries(void)
+{
+#define WALSUMMARYDIR XLOGDIR "/summaries"
+#define WALSUMMARY_NHEXCHARS 40
+
+ DIR *xldir;
+ struct dirent *xlde;
+ char path[MAXPGPATH + sizeof(WALSUMMARYDIR)];
+
+ xldir = opendir(WALSUMMARYDIR);
+ if (xldir == NULL)
+ pg_fatal("could not open directory \"%s\": %m", WALSUMMARYDIR);
+
+ while (errno = 0, (xlde = readdir(xldir)) != NULL)
+ {
+ if (strspn(xlde->d_name, "0123456789ABCDEF") == WALSUMMARY_NHEXCHARS &&
+ strcmp(xlde->d_name + WALSUMMARY_NHEXCHARS, ".summary") == 0)
+ {
+ snprintf(path, sizeof(path), "%s/%s", WALSUMMARYDIR, xlde->d_name);
+ if (unlink(path) < 0)
+ pg_fatal("could not delete file \"%s\": %m", path);
+ }
+ }
+
+ if (errno)
+ pg_fatal("could not read directory \"%s\": %m", WALSUMMARYDIR);
+
+ if (closedir(xldir))
+ pg_fatal("could not close directory \"%s\": %m", ARCHSTATDIR);
+}
/*
* Write an empty XLOG file, containing only the checkpoint record
diff --git a/src/include/access/xlogbackup.h b/src/include/access/xlogbackup.h
index 1611358137..90e04cad56 100644
--- a/src/include/access/xlogbackup.h
+++ b/src/include/access/xlogbackup.h
@@ -28,6 +28,8 @@ typedef struct BackupState
XLogRecPtr checkpointloc; /* last checkpoint location */
pg_time_t starttime; /* backup start time */
bool started_in_recovery; /* backup started in recovery? */
+ XLogRecPtr istartpoint; /* incremental based on backup at this LSN */
+ TimeLineID istarttli; /* incremental based on backup on this TLI */
/* Fields saved at the end of backup */
XLogRecPtr stoppoint; /* backup stop WAL location */
diff --git a/src/include/backup/basebackup.h b/src/include/backup/basebackup.h
index 1432d9c206..345bd22534 100644
--- a/src/include/backup/basebackup.h
+++ b/src/include/backup/basebackup.h
@@ -34,6 +34,9 @@ typedef struct
int64 size; /* total size as sent; -1 if not known */
} tablespaceinfo;
-extern void SendBaseBackup(BaseBackupCmd *cmd);
+struct IncrementalBackupInfo;
+
+extern void SendBaseBackup(BaseBackupCmd *cmd,
+ struct IncrementalBackupInfo *ib);
#endif /* _BASEBACKUP_H */
diff --git a/src/include/backup/basebackup_incremental.h b/src/include/backup/basebackup_incremental.h
new file mode 100644
index 0000000000..de99117599
--- /dev/null
+++ b/src/include/backup/basebackup_incremental.h
@@ -0,0 +1,55 @@
+/*-------------------------------------------------------------------------
+ *
+ * basebackup_incremental.h
+ * API for incremental backup support
+ *
+ * Portions Copyright (c) 2010-2022, PostgreSQL Global Development Group
+ *
+ * src/include/backup/basebackup_incremental.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef BASEBACKUP_INCREMENTAL_H
+#define BASEBACKUP_INCREMENTAL_H
+
+#include "access/xlogbackup.h"
+#include "common/relpath.h"
+#include "storage/block.h"
+#include "utils/palloc.h"
+
+#define INCREMENTAL_MAGIC 0xd3ae1f0d
+
+typedef enum
+{
+ BACK_UP_FILE_FULLY,
+ BACK_UP_FILE_INCREMENTALLY
+} FileBackupMethod;
+
+struct IncrementalBackupInfo;
+typedef struct IncrementalBackupInfo IncrementalBackupInfo;
+
+extern IncrementalBackupInfo *CreateIncrementalBackupInfo(MemoryContext);
+
+extern void AppendIncrementalManifestData(IncrementalBackupInfo *ib,
+ const char *data,
+ int len);
+extern void FinalizeIncrementalManifest(IncrementalBackupInfo *ib);
+
+extern void PrepareForIncrementalBackup(IncrementalBackupInfo *ib,
+ BackupState *backup_state);
+
+extern char *GetIncrementalFilePath(Oid dboid, Oid spcoid,
+ RelFileNumber relfilenumber,
+ ForkNumber forknum, unsigned segno);
+extern FileBackupMethod GetFileBackupMethod(IncrementalBackupInfo *ib,
+ const char *path,
+ Oid dboid, Oid spcoid,
+ RelFileNumber relfilenumber,
+ ForkNumber forknum,
+ unsigned segno, size_t size,
+ unsigned *num_blocks_required,
+ BlockNumber *relative_block_numbers,
+ unsigned *truncation_block_length);
+extern size_t GetIncrementalFileSize(unsigned num_blocks_required);
+
+#endif
diff --git a/src/include/nodes/replnodes.h b/src/include/nodes/replnodes.h
index 5142a08729..c98961c329 100644
--- a/src/include/nodes/replnodes.h
+++ b/src/include/nodes/replnodes.h
@@ -108,4 +108,13 @@ typedef struct TimeLineHistoryCmd
TimeLineID timeline;
} TimeLineHistoryCmd;
+/* ----------------------
+ * UPLOAD_MANIFEST command
+ * ----------------------
+ */
+typedef struct UploadManifestCmd
+{
+ NodeTag type;
+} UploadManifestCmd;
+
#endif /* REPLNODES_H */
diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index a020377761..46cb2a6550 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -779,6 +779,10 @@ a tar-format backup, pass the name of the tar program to use in the
keyword parameter tar_program. Note that tablespace tar files aren't
handled here.
+To restore from an incremental backup, pass the parameter combine_with_prior
+as a reference to an array of prior backup names with which this backup
+is to be combined using pg_combinebackup.
+
Streaming replication can be enabled on this node by passing the keyword
parameter has_streaming => 1. This is disabled by default.
@@ -816,7 +820,22 @@ sub init_from_backup
mkdir $self->archive_dir;
my $data_path = $self->data_dir;
- if (defined $params{tar_program})
+ if (defined $params{combine_with_prior})
+ {
+ my @prior_backups = @{$params{combine_with_prior}};
+ my @prior_backup_path;
+
+ for my $prior_backup_name (@prior_backups)
+ {
+ push @prior_backup_path,
+ $root_node->backup_dir . '/' . $prior_backup_name;
+ }
+
+ local %ENV = $self->_get_env();
+ PostgreSQL::Test::Utils::system_or_bail('pg_combinebackup', '-d',
+ @prior_backup_path, $backup_path, '-o', $data_path);
+ }
+ elsif (defined $params{tar_program})
{
mkdir($data_path);
PostgreSQL::Test::Utils::system_or_bail($params{tar_program}, 'xf',
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 9390049314..e37ef9aa76 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -4023,3 +4023,15 @@ SummarizerReadLocalXLogPrivate
WalSummarizerData
WalSummaryFile
WalSummaryIO
+FileBackupMethod
+IncrementalBackupInfo
+UploadManifestCmd
+backup_file_entry
+backup_wal_range
+cb_cleanup_dir
+cb_options
+cb_tablespace
+cb_tablespace_mapping
+manifest_data
+manifest_writer
+rfile
--
2.39.3 (Apple Git-145)
v15-0006-Test-patch-Enable-summarize_wal-by-default.patchapplication/octet-stream; name=v15-0006-Test-patch-Enable-summarize_wal-by-default.patchDownload
From 962477f1009c04d3e6fe857b5341ec36b5466b95 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Tue, 14 Nov 2023 13:49:28 -0500
Subject: [PATCH v15 6/6] Test patch: Enable summarize_wal by default.
To avoid test failures, must remove the prohibition against running
summarize_wal=off with wal_level=minimal, because a bunch of tests
run with wal_level=minimal.
Not for commit.
---
src/backend/postmaster/postmaster.c | 3 ---
src/backend/postmaster/walsummarizer.c | 2 +-
src/backend/utils/misc/guc_tables.c | 2 +-
src/test/recovery/t/001_stream_rep.pl | 2 ++
src/test/recovery/t/019_replslot_limit.pl | 3 +++
src/test/recovery/t/020_archive_status.pl | 1 +
src/test/recovery/t/035_standby_logical_decoding.pl | 1 +
7 files changed, 9 insertions(+), 5 deletions(-)
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index b163e89cbb..51dc517710 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -937,9 +937,6 @@ PostmasterMain(int argc, char *argv[])
if (max_wal_senders > 0 && wal_level == WAL_LEVEL_MINIMAL)
ereport(ERROR,
(errmsg("WAL streaming (max_wal_senders > 0) requires wal_level \"replica\" or \"logical\"")));
- if (summarize_wal && wal_level == WAL_LEVEL_MINIMAL)
- ereport(ERROR,
- (errmsg("WAL cannot be summarized when wal_level is \"minimal\"")));
/*
* Other one-time internal sanity checks can go here, if they are fast.
diff --git a/src/backend/postmaster/walsummarizer.c b/src/backend/postmaster/walsummarizer.c
index 9fa155349e..71025b43b7 100644
--- a/src/backend/postmaster/walsummarizer.c
+++ b/src/backend/postmaster/walsummarizer.c
@@ -139,7 +139,7 @@ static XLogRecPtr redo_pointer_at_last_summary_removal = InvalidXLogRecPtr;
/*
* GUC parameters
*/
-bool summarize_wal = false;
+bool summarize_wal = true;
int wal_summary_keep_time = 10 * 24 * 60;
static XLogRecPtr GetLatestLSN(TimeLineID *tli);
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 9f59440526..f249a9fad5 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -1795,7 +1795,7 @@ struct config_bool ConfigureNamesBool[] =
NULL
},
&summarize_wal,
- false,
+ true,
NULL, NULL, NULL
},
diff --git a/src/test/recovery/t/001_stream_rep.pl b/src/test/recovery/t/001_stream_rep.pl
index 95f9b0d772..0d0e63b8dc 100644
--- a/src/test/recovery/t/001_stream_rep.pl
+++ b/src/test/recovery/t/001_stream_rep.pl
@@ -15,6 +15,8 @@ my $node_primary = PostgreSQL::Test::Cluster->new('primary');
$node_primary->init(
allows_streaming => 1,
auth_extra => [ '--create-role', 'repl_role' ]);
+# WAL summarization can postpone WAL recycling, leading to test failures
+$node_primary->append_conf('postgresql.conf', "summarize_wal = off");
$node_primary->start;
my $backup_name = 'my_backup';
diff --git a/src/test/recovery/t/019_replslot_limit.pl b/src/test/recovery/t/019_replslot_limit.pl
index 7d94f15778..a8b342bb98 100644
--- a/src/test/recovery/t/019_replslot_limit.pl
+++ b/src/test/recovery/t/019_replslot_limit.pl
@@ -22,6 +22,7 @@ $node_primary->append_conf(
min_wal_size = 2MB
max_wal_size = 4MB
log_checkpoints = yes
+summarize_wal = off
));
$node_primary->start;
$node_primary->safe_psql('postgres',
@@ -256,6 +257,7 @@ $node_primary2->append_conf(
min_wal_size = 32MB
max_wal_size = 32MB
log_checkpoints = yes
+summarize_wal = off
));
$node_primary2->start;
$node_primary2->safe_psql('postgres',
@@ -310,6 +312,7 @@ $node_primary3->append_conf(
max_wal_size = 2MB
log_checkpoints = yes
max_slot_wal_keep_size = 1MB
+ summarize_wal = off
));
$node_primary3->start;
$node_primary3->safe_psql('postgres',
diff --git a/src/test/recovery/t/020_archive_status.pl b/src/test/recovery/t/020_archive_status.pl
index fa24153d4b..d0d6221368 100644
--- a/src/test/recovery/t/020_archive_status.pl
+++ b/src/test/recovery/t/020_archive_status.pl
@@ -15,6 +15,7 @@ $primary->init(
has_archiving => 1,
allows_streaming => 1);
$primary->append_conf('postgresql.conf', 'autovacuum = off');
+$primary->append_conf('postgresql.conf', 'summarize_wal = off');
$primary->start;
my $primary_data = $primary->data_dir;
diff --git a/src/test/recovery/t/035_standby_logical_decoding.pl b/src/test/recovery/t/035_standby_logical_decoding.pl
index 9c34c0d36c..482edc57a8 100644
--- a/src/test/recovery/t/035_standby_logical_decoding.pl
+++ b/src/test/recovery/t/035_standby_logical_decoding.pl
@@ -250,6 +250,7 @@ $node_primary->append_conf(
wal_level = 'logical'
max_replication_slots = 4
max_wal_senders = 4
+summarize_wal = off
});
$node_primary->dump_info;
$node_primary->start;
--
2.39.3 (Apple Git-145)
On Fri, Dec 15, 2023 at 5:36 AM Jakub Wartak
<jakub.wartak@enterprisedb.com> wrote:
I've played with with initdb/pg_upgrade (17->17) and i don't get DBID
mismatch (of course they do differ after initdb), but i get this
instead:$ pg_basebackup -c fast -D /tmp/incr2.after.upgrade -p 5432
--incremental /tmp/incr1.before.upgrade/backup_manifest
WARNING: aborting backup due to backend exiting before pg_backup_stop
was called
pg_basebackup: error: could not initiate base backup: ERROR: timeline
2 found in manifest, but not in this server's history
pg_basebackup: removing data directory "/tmp/incr2.after.upgrade"Also in the manifest I don't see DBID ?
Maybe it's a nuisance and all I'm trying to see is that if an
automated cronjob with pg_basebackup --incremental hits a freshly
upgraded cluster, that error message without errhint() is going to
scare some Junior DBAs.
Yeah. I think we should add the system identifier to the manifest, but
I think that should be left for a future project, as I don't think the
lack of it is a good reason to stop all progress here. When we have
that, we can give more reliable error messages about system mismatches
at an earlier stage. Unfortunately, I don't think that the timeline
messages you're seeing here are going to apply in every case: suppose
you have two unrelated servers that are both on timeline 1. I think
you could use a base backup from one of those servers and use it as
the basis for the incremental from the other, and I think that if you
did it right you might fail to hit any sanity check that would block
that. pg_combinebackup will realize there's a problem, because it has
the whole cluster to work with, not just the manifest, and will notice
the mismatching system identifiers, but that's kind of late to find
out that you made a big mistake. However, right now, it's the best we
can do.
The incrementals are being generated , but just for the first (0)
segment of the relation?
I committed the first two patches from the series I posted yesterday.
The first should fix this, and the second relocates parse_manifest.c.
That patch hasn't changed in a while and seems unlikely to attract
major objections. There's no real reason to commit it until we're
ready to move forward with the main patches, but I think we're very
close to that now, so I did.
Here's a rebase for cfbot.
--
Robert Haas
EDB: http://www.enterprisedb.com
Attachments:
v16-0003-Add-new-pg_walsummary-tool.patchapplication/octet-stream; name=v16-0003-Add-new-pg_walsummary-tool.patchDownload
From fc964ce88549e3d44f6641496569a6a67145b5b3 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Wed, 25 Oct 2023 13:01:06 -0400
Subject: [PATCH v16 3/4] Add new pg_walsummary tool.
This can dump the contents of WAL summary files, either those in
pg_wal/summaries, or the INCREMENTAL_BACKUP files that are part of
an incremental backup proper.
XXX. Needs tests.
---
doc/src/sgml/ref/allfiles.sgml | 1 +
doc/src/sgml/ref/pg_walsummary.sgml | 122 +++++++++++
doc/src/sgml/reference.sgml | 1 +
src/backend/postmaster/walsummarizer.c | 4 +-
src/bin/Makefile | 1 +
src/bin/meson.build | 1 +
src/bin/pg_walsummary/.gitignore | 1 +
src/bin/pg_walsummary/Makefile | 42 ++++
src/bin/pg_walsummary/meson.build | 24 +++
src/bin/pg_walsummary/nls.mk | 6 +
src/bin/pg_walsummary/pg_walsummary.c | 280 +++++++++++++++++++++++++
src/tools/pgindent/typedefs.list | 2 +
12 files changed, 483 insertions(+), 2 deletions(-)
create mode 100644 doc/src/sgml/ref/pg_walsummary.sgml
create mode 100644 src/bin/pg_walsummary/.gitignore
create mode 100644 src/bin/pg_walsummary/Makefile
create mode 100644 src/bin/pg_walsummary/meson.build
create mode 100644 src/bin/pg_walsummary/nls.mk
create mode 100644 src/bin/pg_walsummary/pg_walsummary.c
diff --git a/doc/src/sgml/ref/allfiles.sgml b/doc/src/sgml/ref/allfiles.sgml
index fda4690eab..4a42999b18 100644
--- a/doc/src/sgml/ref/allfiles.sgml
+++ b/doc/src/sgml/ref/allfiles.sgml
@@ -219,6 +219,7 @@ Complete list of usable sgml source files in this directory.
<!ENTITY pgtesttiming SYSTEM "pgtesttiming.sgml">
<!ENTITY pgupgrade SYSTEM "pgupgrade.sgml">
<!ENTITY pgwaldump SYSTEM "pg_waldump.sgml">
+<!ENTITY pgwalsummary SYSTEM "pg_walsummary.sgml">
<!ENTITY postgres SYSTEM "postgres-ref.sgml">
<!ENTITY psqlRef SYSTEM "psql-ref.sgml">
<!ENTITY reindexdb SYSTEM "reindexdb.sgml">
diff --git a/doc/src/sgml/ref/pg_walsummary.sgml b/doc/src/sgml/ref/pg_walsummary.sgml
new file mode 100644
index 0000000000..93e265ead7
--- /dev/null
+++ b/doc/src/sgml/ref/pg_walsummary.sgml
@@ -0,0 +1,122 @@
+<!--
+doc/src/sgml/ref/pg_walsummary.sgml
+PostgreSQL documentation
+-->
+
+<refentry id="app-pgwalsummary">
+ <indexterm zone="app-pgwalsummary">
+ <primary>pg_walsummary</primary>
+ </indexterm>
+
+ <refmeta>
+ <refentrytitle><application>pg_walsummary</application></refentrytitle>
+ <manvolnum>1</manvolnum>
+ <refmiscinfo>Application</refmiscinfo>
+ </refmeta>
+
+ <refnamediv>
+ <refname>pg_walsummary</refname>
+ <refpurpose>print contents of WAL summary files</refpurpose>
+ </refnamediv>
+
+ <refsynopsisdiv>
+ <cmdsynopsis>
+ <command>pg_walsummary</command>
+ <arg rep="repeat" choice="opt"><replaceable>option</replaceable></arg>
+ <arg rep="repeat"><replaceable>file</replaceable></arg>
+ </cmdsynopsis>
+ </refsynopsisdiv>
+
+ <refsect1>
+ <title>Description</title>
+ <para>
+ <application>pg_walsummary</application> is used to print the contents of
+ WAL summary files. These binary files are found with the
+ <literal>pg_wal/summaries</literal> subdirectory of the data directory,
+ and can be converted to text using this tool. This is not ordinarily
+ necessary, since WAL summary files primarily exist to support
+ <link linkend="backup-incremental-backup">incremental backup</link>,
+ but it may be useful for debugging purposes.
+ </para>
+
+ <para>
+ A WAL summary file is indexed by tablespace OID, relation OID, and relation
+ fork. For each relation fork, it stores the list of blocks that were
+ modified by WAL within the range summarized in the file. It can also
+ store a "limit block," which is 0 if the relation fork was created or
+ truncated within the relevant WAL range, and otherwise the shortest length
+ to which the relation fork was truncated. If the relation fork was not
+ created, deleted, or truncated within the relevant WAL range, the limit
+ block is undefined or infinite and will not be printed by this tool.
+ </para>
+ </refsect1>
+
+ <refsect1>
+ <title>Options</title>
+
+ <para>
+ <variablelist>
+ <varlistentry>
+ <term><option>-i</option></term>
+ <term><option>--indivudual</option></term>
+ <listitem>
+ <para>
+ By default, <literal>pg_walsummary</literal> prints one line of output
+ for each range of one or more consecutive modified blocks. This can
+ make the output a lot briefer, since a relation where all blocks from
+ 0 through 999 were modified will produce only one line of output rather
+ than 1000 separate lines. This option requests a separate line of
+ output for every modified block.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><option>-q</option></term>
+ <term><option>--quiet</option></term>
+ <listitem>
+ <para>
+ Do not print any output, except for errors. This can be useful
+ when you want to know whether a WAL summary file can be successfully
+ parsed but don't care about the contents.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><option>-?</option></term>
+ <term><option>--help</option></term>
+ <listitem>
+ <para>
+ Shows help about <application>pg_walsummary</application> command line
+ arguments, and exits.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ </variablelist>
+ </para>
+
+ </refsect1>
+
+ <refsect1>
+ <title>Environment</title>
+
+ <para>
+ The environment variable <envar>PG_COLOR</envar> specifies whether to use
+ color in diagnostic messages. Possible values are
+ <literal>always</literal>, <literal>auto</literal> and
+ <literal>never</literal>.
+ </para>
+ </refsect1>
+
+ <refsect1>
+ <title>See Also</title>
+
+ <simplelist type="inline">
+ <member><xref linkend="app-pgbasebackup"/></member>
+ <member><xref linkend="app-pgcombinebackup"/></member>
+ </simplelist>
+ </refsect1>
+
+</refentry>
diff --git a/doc/src/sgml/reference.sgml b/doc/src/sgml/reference.sgml
index a07d2b5e01..aa94f6adf6 100644
--- a/doc/src/sgml/reference.sgml
+++ b/doc/src/sgml/reference.sgml
@@ -289,6 +289,7 @@
&pgtesttiming;
&pgupgrade;
&pgwaldump;
+ &pgwalsummary;
&postgres;
</reference>
diff --git a/src/backend/postmaster/walsummarizer.c b/src/backend/postmaster/walsummarizer.c
index 7c840c36b3..9fa155349e 100644
--- a/src/backend/postmaster/walsummarizer.c
+++ b/src/backend/postmaster/walsummarizer.c
@@ -290,7 +290,7 @@ WalSummarizerMain(void)
FlushErrorState();
/* Flush any leaked data in the top-level context */
- MemoryContextResetAndDeleteChildren(context);
+ MemoryContextReset(context);
/* Now we can allow interrupts again */
RESUME_INTERRUPTS();
@@ -342,7 +342,7 @@ WalSummarizerMain(void)
XLogRecPtr end_of_summary_lsn;
/* Flush any leaked data in the top-level context */
- MemoryContextResetAndDeleteChildren(context);
+ MemoryContextReset(context);
/* Process any signals received recently. */
HandleWalSummarizerInterrupts();
diff --git a/src/bin/Makefile b/src/bin/Makefile
index aa2210925e..f98f58d39e 100644
--- a/src/bin/Makefile
+++ b/src/bin/Makefile
@@ -31,6 +31,7 @@ SUBDIRS = \
pg_upgrade \
pg_verifybackup \
pg_waldump \
+ pg_walsummary \
pgbench \
psql \
scripts
diff --git a/src/bin/meson.build b/src/bin/meson.build
index 4cb6fd59bb..d1e9ef4409 100644
--- a/src/bin/meson.build
+++ b/src/bin/meson.build
@@ -17,6 +17,7 @@ subdir('pg_test_timing')
subdir('pg_upgrade')
subdir('pg_verifybackup')
subdir('pg_waldump')
+subdir('pg_walsummary')
subdir('pgbench')
subdir('pgevent')
subdir('psql')
diff --git a/src/bin/pg_walsummary/.gitignore b/src/bin/pg_walsummary/.gitignore
new file mode 100644
index 0000000000..d71ec192fa
--- /dev/null
+++ b/src/bin/pg_walsummary/.gitignore
@@ -0,0 +1 @@
+pg_walsummary
diff --git a/src/bin/pg_walsummary/Makefile b/src/bin/pg_walsummary/Makefile
new file mode 100644
index 0000000000..852f7208f6
--- /dev/null
+++ b/src/bin/pg_walsummary/Makefile
@@ -0,0 +1,42 @@
+#-------------------------------------------------------------------------
+#
+# Makefile for src/bin/pg_walsummary
+#
+# Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+# Portions Copyright (c) 1994, Regents of the University of California
+#
+# src/bin/pg_walsummary/Makefile
+#
+#-------------------------------------------------------------------------
+
+PGFILEDESC = "pg_walsummary - print contents of WAL summary files"
+PGAPPICON=win32
+
+subdir = src/bin/pg_walsummary
+top_builddir = ../../..
+include $(top_builddir)/src/Makefile.global
+
+override CPPFLAGS := -I$(libpq_srcdir) $(CPPFLAGS)
+LDFLAGS_INTERNAL += -L$(top_builddir)/src/fe_utils -lpgfeutils
+
+OBJS = \
+ $(WIN32RES) \
+ pg_walsummary.o
+
+all: pg_walsummary
+
+pg_walsummary: $(OBJS) | submake-libpgport submake-libpgfeutils
+ $(CC) $(CFLAGS) $^ $(LDFLAGS) $(LDFLAGS_EX) $(LIBS) -o $@$(X)
+
+
+install: all installdirs
+ $(INSTALL_PROGRAM) pg_walsummary$(X) '$(DESTDIR)$(bindir)/pg_walsummary$(X)'
+
+installdirs:
+ $(MKDIR_P) '$(DESTDIR)$(bindir)'
+
+uninstall:
+ rm -f '$(DESTDIR)$(bindir)/pg_walsummary$(X)'
+
+clean distclean maintainer-clean:
+ rm -f pg_walsummary$(X) $(OBJS)
diff --git a/src/bin/pg_walsummary/meson.build b/src/bin/pg_walsummary/meson.build
new file mode 100644
index 0000000000..c2092960c6
--- /dev/null
+++ b/src/bin/pg_walsummary/meson.build
@@ -0,0 +1,24 @@
+# Copyright (c) 2022-2023, PostgreSQL Global Development Group
+
+pg_walsummary_sources = files(
+ 'pg_walsummary.c',
+)
+
+if host_system == 'windows'
+ pg_walsummary_sources += rc_bin_gen.process(win32ver_rc, extra_args: [
+ '--NAME', 'pg_walsummary',
+ '--FILEDESC', 'pg_walsummary - print contents of WAL summary files',])
+endif
+
+pg_walsummary = executable('pg_walsummary',
+ pg_walsummary_sources,
+ dependencies: [frontend_code],
+ kwargs: default_bin_args,
+)
+bin_targets += pg_walsummary
+
+tests += {
+ 'name': 'pg_walsummary',
+ 'sd': meson.current_source_dir(),
+ 'bd': meson.current_build_dir()
+}
diff --git a/src/bin/pg_walsummary/nls.mk b/src/bin/pg_walsummary/nls.mk
new file mode 100644
index 0000000000..f411dcfe9e
--- /dev/null
+++ b/src/bin/pg_walsummary/nls.mk
@@ -0,0 +1,6 @@
+# src/bin/pg_combinebackup/nls.mk
+CATALOG_NAME = pg_walsummary
+GETTEXT_FILES = $(FRONTEND_COMMON_GETTEXT_FILES) \
+ pg_walsummary.c
+GETTEXT_TRIGGERS = $(FRONTEND_COMMON_GETTEXT_TRIGGERS)
+GETTEXT_FLAGS = $(FRONTEND_COMMON_GETTEXT_FLAGS)
diff --git a/src/bin/pg_walsummary/pg_walsummary.c b/src/bin/pg_walsummary/pg_walsummary.c
new file mode 100644
index 0000000000..0c0225eeb8
--- /dev/null
+++ b/src/bin/pg_walsummary/pg_walsummary.c
@@ -0,0 +1,280 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_walsummary.c
+ * Prints the contents of WAL summary files.
+ *
+ * Copyright (c) 2017-2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/bin/pg_walsummary/pg_walsummary.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres_fe.h"
+
+#include <fcntl.h>
+#include <limits.h>
+
+#include "common/blkreftable.h"
+#include "common/logging.h"
+#include "fe_utils/option_utils.h"
+#include "lib/stringinfo.h"
+#include "getopt_long.h"
+
+typedef struct ws_options
+{
+ bool individual;
+ bool quiet;
+} ws_options;
+
+typedef struct ws_file_info
+{
+ int fd;
+ char *filename;
+} ws_file_info;
+
+static BlockNumber *block_buffer = NULL;
+static unsigned block_buffer_size = 512; /* Initial size. */
+
+static void dump_one_relation(ws_options *opt, RelFileLocator *rlocator,
+ ForkNumber forknum, BlockNumber limit_block,
+ BlockRefTableReader *reader);
+static void help(const char *progname);
+static int compare_block_numbers(const void *a, const void *b);
+static int walsummary_read_callback(void *callback_arg, void *data,
+ int length);
+static void walsummary_error_callback(void *callback_arg, char *fmt,...) pg_attribute_printf(2, 3);
+
+/*
+ * Main program.
+ */
+int
+main(int argc, char *argv[])
+{
+ static struct option long_options[] = {
+ {"individual", no_argument, NULL, 'i'},
+ {"quiet", no_argument, NULL, 'q'},
+ {NULL, 0, NULL, 0}
+ };
+
+ const char *progname;
+ int optindex;
+ int c;
+ ws_options opt;
+
+ memset(&opt, 0, sizeof(ws_options));
+
+ pg_logging_init(argv[0]);
+ progname = get_progname(argv[0]);
+ handle_help_version_opts(argc, argv, progname, help);
+
+ /* process command-line options */
+ while ((c = getopt_long(argc, argv, "f:iqw:",
+ long_options, &optindex)) != -1)
+ {
+ switch (c)
+ {
+ case 'i':
+ opt.individual = true;
+ break;
+ case 'q':
+ opt.quiet = true;
+ break;
+ default:
+ /* getopt_long already emitted a complaint */
+ pg_log_error_hint("Try \"%s --help\" for more information.", progname);
+ exit(1);
+ }
+ }
+
+ if (optind >= argc)
+ {
+ pg_log_error("%s: no input files specified", progname);
+ pg_log_error_hint("Try \"%s --help\" for more information.", progname);
+ exit(1);
+ }
+
+ while (optind < argc)
+ {
+ ws_file_info ws;
+ BlockRefTableReader *reader;
+ RelFileLocator rlocator;
+ ForkNumber forknum;
+ BlockNumber limit_block;
+
+ ws.filename = argv[optind++];
+ if ((ws.fd = open(ws.filename, O_RDONLY | PG_BINARY, 0)) < 0)
+ pg_fatal("could not open file \"%s\": %m", ws.filename);
+
+ reader = CreateBlockRefTableReader(walsummary_read_callback, &ws,
+ ws.filename,
+ walsummary_error_callback, NULL);
+ while (BlockRefTableReaderNextRelation(reader, &rlocator, &forknum,
+ &limit_block))
+ dump_one_relation(&opt, &rlocator, forknum, limit_block, reader);
+
+ DestroyBlockRefTableReader(reader);
+ close(ws.fd);
+ }
+
+ exit(0);
+}
+
+/*
+ * Dump details for one relation.
+ */
+static void
+dump_one_relation(ws_options *opt, RelFileLocator *rlocator,
+ ForkNumber forknum, BlockNumber limit_block,
+ BlockRefTableReader *reader)
+{
+ unsigned i = 0;
+ unsigned nblocks;
+ BlockNumber startblock = InvalidBlockNumber;
+ BlockNumber endblock = InvalidBlockNumber;
+
+ /* Dump limit block, if any. */
+ if (limit_block != InvalidBlockNumber)
+ printf("TS %u, DB %u, REL %u, FORK %s: limit %u\n",
+ rlocator->spcOid, rlocator->dbOid, rlocator->relNumber,
+ forkNames[forknum], limit_block);
+
+ /* If we haven't allocated a block buffer yet, do that now. */
+ if (block_buffer == NULL)
+ block_buffer = palloc_array(BlockNumber, block_buffer_size);
+
+ /* Try to fill the block buffer. */
+ nblocks = BlockRefTableReaderGetBlocks(reader,
+ block_buffer,
+ block_buffer_size);
+
+ /* If we filled the block buffer completely, we must enlarge it. */
+ while (nblocks >= block_buffer_size)
+ {
+ unsigned new_size;
+
+ /* Double the size, being careful about overflow. */
+ new_size = block_buffer_size * 2;
+ if (new_size < block_buffer_size)
+ new_size = PG_UINT32_MAX;
+ block_buffer = repalloc_array(block_buffer, BlockNumber, new_size);
+
+ /* Try to fill the newly-allocated space. */
+ nblocks +=
+ BlockRefTableReaderGetBlocks(reader,
+ block_buffer + block_buffer_size,
+ new_size - block_buffer_size);
+
+ /* Save the new size for later calls. */
+ block_buffer_size = new_size;
+ }
+
+ /* If we don't need to produce any output, skip the rest of this. */
+ if (opt->quiet)
+ return;
+
+ /*
+ * Sort the returned block numbers. If the block reference table was using
+ * the bitmap representation for a given chunk, the block numbers in that
+ * chunk will already be sorted, but when the array-of-offsets
+ * representation is used, we can receive block numbers here out of order.
+ */
+ qsort(block_buffer, nblocks, sizeof(BlockNumber), compare_block_numbers);
+
+ /* Dump block references. */
+ while (i < nblocks)
+ {
+ /*
+ * Find the next range of blocks to print, but if --individual was
+ * specified, then consider each block a separate range.
+ */
+ startblock = endblock = block_buffer[i++];
+ if (!opt->individual)
+ {
+ while (i < nblocks && block_buffer[i] == endblock + 1)
+ {
+ endblock++;
+ i++;
+ }
+ }
+
+ /* Format this range of block numbers as a string. */
+ if (startblock == endblock)
+ printf("TS %u, DB %u, REL %u, FORK %s: block %u\n",
+ rlocator->spcOid, rlocator->dbOid, rlocator->relNumber,
+ forkNames[forknum], startblock);
+ else
+ printf("TS %u, DB %u, REL %u, FORK %s: blocks %u..%u\n",
+ rlocator->spcOid, rlocator->dbOid, rlocator->relNumber,
+ forkNames[forknum], startblock, endblock);
+ }
+}
+
+/*
+ * Quicksort comparator for block numbers.
+ */
+static int
+compare_block_numbers(const void *a, const void *b)
+{
+ BlockNumber aa = *(BlockNumber *) a;
+ BlockNumber bb = *(BlockNumber *) b;
+
+ if (aa > bb)
+ return 1;
+ else if (aa == bb)
+ return 0;
+ else
+ return -1;
+}
+
+/*
+ * Error callback.
+ */
+void
+walsummary_error_callback(void *callback_arg, char *fmt,...)
+{
+ va_list ap;
+
+ va_start(ap, fmt);
+ pg_log_generic_v(PG_LOG_ERROR, PG_LOG_PRIMARY, fmt, ap);
+ va_end(ap);
+
+ exit(1);
+}
+
+/*
+ * Read callback.
+ */
+int
+walsummary_read_callback(void *callback_arg, void *data, int length)
+{
+ ws_file_info *ws = callback_arg;
+ int rc;
+
+ if ((rc = read(ws->fd, data, length)) < 0)
+ pg_fatal("could not read file \"%s\": %m", ws->filename);
+
+ return rc;
+}
+
+/*
+ * help
+ *
+ * Prints help page for the program
+ *
+ * progname: the name of the executed program, such as "pg_walsummary"
+ */
+static void
+help(const char *progname)
+{
+ printf(_("%s prints the contents of a WAL summary file.\n\n"), progname);
+ printf(_("Usage:\n"));
+ printf(_(" %s [OPTION]... FILE...\n"), progname);
+ printf(_("\nOptions:\n"));
+ printf(_(" -i, --individual list block numbers individually, not as ranges\n"));
+ printf(_(" -q, --quiet don't print anything, just parse the files\n"));
+ printf(_(" -?, --help show this help, then exit\n"));
+
+ printf(_("\nReport bugs to <%s>.\n"), PACKAGE_BUGREPORT);
+ printf(_("%s home page: <%s>\n"), PACKAGE_NAME, PACKAGE_URL);
+}
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index e37ef9aa76..86e0a86503 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -4035,3 +4035,5 @@ cb_tablespace_mapping
manifest_data
manifest_writer
rfile
+ws_options
+ws_file_info
--
2.39.3 (Apple Git-145)
v16-0001-Add-a-new-WAL-summarizer-process.patchapplication/octet-stream; name=v16-0001-Add-a-new-WAL-summarizer-process.patchDownload
From 033e4e3eccc82a0935eefc9a4f7fef28d0a7af40 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Wed, 25 Oct 2023 12:57:22 -0400
Subject: [PATCH v16 1/4] Add a new WAL summarizer process.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
When active, this process writes WAL summary files to
$PGDATA/pg_wal/summaries. Each summary file contains information for a
certain range of LSNs on a certain TLI. For each relation, it stores a
"limit block" which is 0 if a relation is created or destroyed within
a certain range of WAL records, or otherwise the shortest length to
which the relation was truncated during that range of WAL records, or
otherwise InvalidBlockNumber. In addition, it stores a list of blocks
which have been modified during that range of WAL records, but
excluding blocks which were removed by truncation after they were
modified and never subsequently modified again. In other words, it
tells us which blocks need to copied in case of an incremental backup
covering that range of WAL records.
A new parameter summarize_wal enables or disables this new background
process. The background process also automatically deletes summary
files that are older than wal_summarize_keep_time, if that parameter
has a non-zero value and the summarizer is configured to run.
Patch by me, with some design help from Dilip Kumar. Reviewed by
Matthias van de Meent, Dilip Kumar, Jakub Wartak, Peter Eisentraut,
and Álvaro Herrera.
---
doc/src/sgml/config.sgml | 61 +
src/backend/access/transam/xlog.c | 101 +-
src/backend/backup/Makefile | 4 +-
src/backend/backup/meson.build | 2 +
src/backend/backup/walsummary.c | 356 +++++
src/backend/backup/walsummaryfuncs.c | 169 ++
src/backend/postmaster/Makefile | 1 +
src/backend/postmaster/auxprocess.c | 8 +
src/backend/postmaster/meson.build | 1 +
src/backend/postmaster/postmaster.c | 56 +
src/backend/postmaster/walsummarizer.c | 1398 +++++++++++++++++
src/backend/storage/lmgr/lwlocknames.txt | 1 +
src/backend/utils/activity/pgstat_io.c | 4 +-
.../utils/activity/wait_event_names.txt | 5 +
src/backend/utils/init/miscinit.c | 3 +
src/backend/utils/misc/guc_tables.c | 26 +
src/backend/utils/misc/postgresql.conf.sample | 5 +
src/bin/initdb/initdb.c | 1 +
src/common/Makefile | 1 +
src/common/blkreftable.c | 1308 +++++++++++++++
src/common/meson.build | 1 +
src/include/access/xlog.h | 1 +
src/include/backup/walsummary.h | 49 +
src/include/catalog/pg_proc.dat | 19 +
src/include/common/blkreftable.h | 116 ++
src/include/miscadmin.h | 3 +
src/include/postmaster/walsummarizer.h | 33 +
src/include/storage/proc.h | 9 +-
src/include/utils/guc_tables.h | 1 +
src/tools/pgindent/typedefs.list | 11 +
30 files changed, 3743 insertions(+), 11 deletions(-)
create mode 100644 src/backend/backup/walsummary.c
create mode 100644 src/backend/backup/walsummaryfuncs.c
create mode 100644 src/backend/postmaster/walsummarizer.c
create mode 100644 src/common/blkreftable.c
create mode 100644 src/include/backup/walsummary.h
create mode 100644 src/include/common/blkreftable.h
create mode 100644 src/include/postmaster/walsummarizer.h
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 44cada2b40..ee98585027 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4150,6 +4150,67 @@ restore_command = 'copy "C:\\server\\archivedir\\%f" "%p"' # Windows
</variablelist>
</sect2>
+ <sect2 id="runtime-config-wal-summarization">
+ <title>WAL Summarization</title>
+
+ <!--
+ <para>
+ These settings control WAL summarization, a feature which must be
+ enabled in order to perform an
+ <link linkend="backup-incremental-backup">incremental backup</link>.
+ </para>
+ -->
+
+ <variablelist>
+ <varlistentry id="guc-summarize-wal" xreflabel="summarize_wal">
+ <term><varname>summarize_wal</varname> (<type>boolean</type>)
+ <indexterm>
+ <primary><varname>summarize_wal</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Enables the WAL summarizer process. Note that WAL summarization can
+ be enabled either on a primary or on a standby. WAL summarization
+ cannot be enabled when <varname>wal_level</varname> is set to
+ <literal>minimal</literal>. This parameter can only be set in the
+ <filename>postgresql.conf</filename> file or on the server command line.
+ The default is <literal>off</literal>.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry id="guc-wal-summary-keep-time" xreflabel="wal_summary_keep_time">
+ <term><varname>wal_summary_keep_time</varname> (<type>boolean</type>)
+ <indexterm>
+ <primary><varname>wal_summary_keep_time</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Configures the amount of time after which the WAL summarizer
+ automatically removes old WAL summaries. The file timestamp is used to
+ determine which files are old enough to remove. Typically, you should set
+ this comfortably higher than the time that could pass between a backup
+ and a later incremental backup that depends on it. WAL summaries must
+ be available for the entire range of WAL records between the preceding
+ backup and the new one being taken; if not, the incremental backup will
+ fail. If this parameter is set to zero, WAL summaries will not be
+ automatically deleted, but it is safe to manually remove files that you
+ know will not be required for future incremental backups.
+ This parameter can only be set in the
+ <filename>postgresql.conf</filename> file or on the server command line.
+ The default is 10 days. If <literal>summarize_wal = off</literal>,
+ existing WAL summaries will not be removed regardless of the value of
+ this parameter, because the WAL summarizer will not run.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ </variablelist>
+
+ </sect2>
+
</sect1>
<sect1 id="runtime-config-replication">
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 01e0484584..421a016ca1 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -77,6 +77,7 @@
#include "port/pg_iovec.h"
#include "postmaster/bgwriter.h"
#include "postmaster/startup.h"
+#include "postmaster/walsummarizer.h"
#include "postmaster/walwriter.h"
#include "replication/logical.h"
#include "replication/origin.h"
@@ -3589,6 +3590,43 @@ XLogGetLastRemovedSegno(void)
return lastRemovedSegNo;
}
+/*
+ * Return the oldest WAL segment on the given TLI that still exists in
+ * XLOGDIR, or 0 if none.
+ */
+XLogSegNo
+XLogGetOldestSegno(TimeLineID tli)
+{
+ DIR *xldir;
+ struct dirent *xlde;
+ XLogSegNo oldest_segno = 0;
+
+ xldir = AllocateDir(XLOGDIR);
+ while ((xlde = ReadDir(xldir, XLOGDIR)) != NULL)
+ {
+ TimeLineID file_tli;
+ XLogSegNo file_segno;
+
+ /* Ignore files that are not XLOG segments. */
+ if (!IsXLogFileName(xlde->d_name))
+ continue;
+
+ /* Parse filename to get TLI and segno. */
+ XLogFromFileName(xlde->d_name, &file_tli, &file_segno,
+ wal_segment_size);
+
+ /* Ignore anything that's not from the TLI of interest. */
+ if (tli != file_tli)
+ continue;
+
+ /* If it's the oldest so far, update oldest_segno. */
+ if (oldest_segno == 0 || file_segno < oldest_segno)
+ oldest_segno = file_segno;
+ }
+
+ FreeDir(xldir);
+ return oldest_segno;
+}
/*
* Update the last removed segno pointer in shared memory, to reflect that the
@@ -3869,8 +3907,8 @@ RemoveXlogFile(const struct dirent *segment_de,
}
/*
- * Verify whether pg_wal and pg_wal/archive_status exist.
- * If the latter does not exist, recreate it.
+ * Verify whether pg_wal, pg_wal/archive_status, and pg_wal/summaries exist.
+ * If the latter do not exist, recreate them.
*
* It is not the goal of this function to verify the contents of these
* directories, but to help in cases where someone has performed a cluster
@@ -3913,6 +3951,26 @@ ValidateXLOGDirectoryStructure(void)
(errmsg("could not create missing directory \"%s\": %m",
path)));
}
+
+ /* Check for summaries */
+ snprintf(path, MAXPGPATH, XLOGDIR "/summaries");
+ if (stat(path, &stat_buf) == 0)
+ {
+ /* Check for weird cases where it exists but isn't a directory */
+ if (!S_ISDIR(stat_buf.st_mode))
+ ereport(FATAL,
+ (errmsg("required WAL directory \"%s\" does not exist",
+ path)));
+ }
+ else
+ {
+ ereport(LOG,
+ (errmsg("creating missing WAL directory \"%s\"", path)));
+ if (MakePGDirectory(path) < 0)
+ ereport(FATAL,
+ (errmsg("could not create missing directory \"%s\": %m",
+ path)));
+ }
}
/*
@@ -5237,9 +5295,9 @@ StartupXLOG(void)
#endif
/*
- * Verify that pg_wal and pg_wal/archive_status exist. In cases where
- * someone has performed a copy for PITR, these directories may have been
- * excluded and need to be re-created.
+ * Verify that pg_wal, pg_wal/archive_status, and pg_wal/summaries exist.
+ * In cases where someone has performed a copy for PITR, these directories
+ * may have been excluded and need to be re-created.
*/
ValidateXLOGDirectoryStructure();
@@ -6956,6 +7014,25 @@ CreateCheckPoint(int flags)
*/
END_CRIT_SECTION();
+ /*
+ * WAL summaries end when the next XLOG_CHECKPOINT_REDO or
+ * XLOG_CHECKPOINT_SHUTDOWN record is reached. This is the first point
+ * where (a) we're not inside of a critical section and (b) we can be
+ * certain that the relevant record has been flushed to disk, which must
+ * happen before it can be summarized.
+ *
+ * If this is a shutdown checkpoint, then this happens reasonably
+ * promptly: we've only just inserted and flushed the
+ * XLOG_CHECKPOINT_SHUTDOWN record. If this is not a shutdown checkpoint,
+ * then this might not be very prompt at all: the XLOG_CHECKPOINT_REDO
+ * record was written before we began flushing data to disk, and that
+ * could be many minutes ago at this point. However, we don't XLogFlush()
+ * after inserting that record, so we're not guaranteed that it's on disk
+ * until after the above call that flushes the XLOG_CHECKPOINT_ONLINE
+ * record.
+ */
+ SetWalSummarizerLatch();
+
/*
* Let smgr do post-checkpoint cleanup (eg, deleting old files).
*/
@@ -7630,6 +7707,20 @@ KeepLogSeg(XLogRecPtr recptr, XLogSegNo *logSegNo)
}
}
+ /*
+ * If WAL summarization is in use, don't remove WAL that has yet to be
+ * summarized.
+ */
+ keep = GetOldestUnsummarizedLSN(NULL, NULL, false);
+ if (keep != InvalidXLogRecPtr)
+ {
+ XLogSegNo unsummarized_segno;
+
+ XLByteToSeg(keep, unsummarized_segno, wal_segment_size);
+ if (unsummarized_segno < segno)
+ segno = unsummarized_segno;
+ }
+
/* but, keep at least wal_keep_size if that's set */
if (wal_keep_size_mb > 0)
{
diff --git a/src/backend/backup/Makefile b/src/backend/backup/Makefile
index b21bd8ff43..a67b3c58d4 100644
--- a/src/backend/backup/Makefile
+++ b/src/backend/backup/Makefile
@@ -25,6 +25,8 @@ OBJS = \
basebackup_server.o \
basebackup_sink.o \
basebackup_target.o \
- basebackup_throttle.o
+ basebackup_throttle.o \
+ walsummary.o \
+ walsummaryfuncs.o
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/backup/meson.build b/src/backend/backup/meson.build
index 11a79bbf80..5d4ebe3ebe 100644
--- a/src/backend/backup/meson.build
+++ b/src/backend/backup/meson.build
@@ -12,4 +12,6 @@ backend_sources += files(
'basebackup_target.c',
'basebackup_throttle.c',
'basebackup_zstd.c',
+ 'walsummary.c',
+ 'walsummaryfuncs.c',
)
diff --git a/src/backend/backup/walsummary.c b/src/backend/backup/walsummary.c
new file mode 100644
index 0000000000..271d199874
--- /dev/null
+++ b/src/backend/backup/walsummary.c
@@ -0,0 +1,356 @@
+/*-------------------------------------------------------------------------
+ *
+ * walsummary.c
+ * Functions for accessing and managing WAL summary data.
+ *
+ * Portions Copyright (c) 2010-2022, PostgreSQL Global Development Group
+ *
+ * src/backend/backup/walsummary.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include <sys/stat.h>
+#include <unistd.h>
+
+#include "access/xlog_internal.h"
+#include "backup/walsummary.h"
+#include "utils/wait_event.h"
+
+static bool IsWalSummaryFilename(char *filename);
+static int ListComparatorForWalSummaryFiles(const ListCell *a,
+ const ListCell *b);
+
+/*
+ * Get a list of WAL summaries.
+ *
+ * If tli != 0, only WAL summaries with the indicated TLI will be included.
+ *
+ * If start_lsn != InvalidXLogRecPtr, only summaries that end after the
+ * indicated LSN will be included.
+ *
+ * If end_lsn != InvalidXLogRecPtr, only summaries that start before the
+ * indicated LSN will be included.
+ *
+ * The intent is that you can call GetWalSummaries(tli, start_lsn, end_lsn)
+ * to get all WAL summaries on the indicated timeline that overlap the
+ * specified LSN range.
+ */
+List *
+GetWalSummaries(TimeLineID tli, XLogRecPtr start_lsn, XLogRecPtr end_lsn)
+{
+ DIR *sdir;
+ struct dirent *dent;
+ List *result = NIL;
+
+ sdir = AllocateDir(XLOGDIR "/summaries");
+ while ((dent = ReadDir(sdir, XLOGDIR "/summaries")) != NULL)
+ {
+ WalSummaryFile *ws;
+ uint32 tmp[5];
+ TimeLineID file_tli;
+ XLogRecPtr file_start_lsn;
+ XLogRecPtr file_end_lsn;
+
+ /* Decode filename, or skip if it's not in the expected format. */
+ if (!IsWalSummaryFilename(dent->d_name))
+ continue;
+ sscanf(dent->d_name, "%08X%08X%08X%08X%08X",
+ &tmp[0], &tmp[1], &tmp[2], &tmp[3], &tmp[4]);
+ file_tli = tmp[0];
+ file_start_lsn = ((uint64) tmp[1]) << 32 | tmp[2];
+ file_end_lsn = ((uint64) tmp[3]) << 32 | tmp[4];
+
+ /* Skip if it doesn't match the filter criteria. */
+ if (tli != 0 && tli != file_tli)
+ continue;
+ if (!XLogRecPtrIsInvalid(start_lsn) && start_lsn >= file_end_lsn)
+ continue;
+ if (!XLogRecPtrIsInvalid(end_lsn) && end_lsn <= file_start_lsn)
+ continue;
+
+ /* Add it to the list. */
+ ws = palloc(sizeof(WalSummaryFile));
+ ws->tli = file_tli;
+ ws->start_lsn = file_start_lsn;
+ ws->end_lsn = file_end_lsn;
+ result = lappend(result, ws);
+ }
+ FreeDir(sdir);
+
+ return result;
+}
+
+/*
+ * Build a new list of WAL summaries based on an existing list, but filtering
+ * out summaries that don't match the search parameters.
+ *
+ * If tli != 0, only WAL summaries with the indicated TLI will be included.
+ *
+ * If start_lsn != InvalidXLogRecPtr, only summaries that end after the
+ * indicated LSN will be included.
+ *
+ * If end_lsn != InvalidXLogRecPtr, only summaries that start before the
+ * indicated LSN will be included.
+ */
+List *
+FilterWalSummaries(List *wslist, TimeLineID tli,
+ XLogRecPtr start_lsn, XLogRecPtr end_lsn)
+{
+ List *result = NIL;
+ ListCell *lc;
+
+ /* Loop over input. */
+ foreach(lc, wslist)
+ {
+ WalSummaryFile *ws = lfirst(lc);
+
+ /* Skip if it doesn't match the filter criteria. */
+ if (tli != 0 && tli != ws->tli)
+ continue;
+ if (!XLogRecPtrIsInvalid(start_lsn) && start_lsn > ws->end_lsn)
+ continue;
+ if (!XLogRecPtrIsInvalid(end_lsn) && end_lsn < ws->start_lsn)
+ continue;
+
+ /* Add it to the result list. */
+ result = lappend(result, ws);
+ }
+
+ return result;
+}
+
+/*
+ * Check whether the supplied list of WalSummaryFile objects covers the
+ * whole range of LSNs from start_lsn to end_lsn. This function ignores
+ * timelines, so the caller should probably filter using the appropriate
+ * timeline before calling this.
+ *
+ * If the whole range of LSNs is covered, returns true, otherwise false.
+ * If false is returned, *missing_lsn is set either to InvalidXLogRecPtr
+ * if there are no WAL summary files in the input list, or to the first LSN
+ * in the range that is not covered by a WAL summary file in the input list.
+ */
+bool
+WalSummariesAreComplete(List *wslist, XLogRecPtr start_lsn,
+ XLogRecPtr end_lsn, XLogRecPtr *missing_lsn)
+{
+ XLogRecPtr current_lsn = start_lsn;
+ ListCell *lc;
+
+ /* Special case for empty list. */
+ if (wslist == NIL)
+ {
+ *missing_lsn = InvalidXLogRecPtr;
+ return false;
+ }
+
+ /* Make a private copy of the list and sort it by start LSN. */
+ wslist = list_copy(wslist);
+ list_sort(wslist, ListComparatorForWalSummaryFiles);
+
+ /*
+ * Consider summary files in order of increasing start_lsn, advancing the
+ * known-summarized range from start_lsn toward end_lsn.
+ *
+ * Normally, the summary files should cover non-overlapping WAL ranges,
+ * but this algorithm is intended to be correct even in case of overlap.
+ */
+ foreach(lc, wslist)
+ {
+ WalSummaryFile *ws = lfirst(lc);
+
+ if (ws->start_lsn > current_lsn)
+ {
+ /* We found a gap. */
+ break;
+ }
+ if (ws->end_lsn > current_lsn)
+ {
+ /*
+ * Next summary extends beyond end of previous summary, so extend
+ * the end of the range known to be summarized.
+ */
+ current_lsn = ws->end_lsn;
+
+ /*
+ * If the range we know to be summarized has reached the required
+ * end LSN, we have proved completeness.
+ */
+ if (current_lsn >= end_lsn)
+ return true;
+ }
+ }
+
+ /*
+ * We either ran out of summary files without reaching the end LSN, or we
+ * hit a gap in the sequence that resulted in us bailing out of the loop
+ * above.
+ */
+ *missing_lsn = current_lsn;
+ return false;
+}
+
+/*
+ * Open a WAL summary file.
+ *
+ * This will throw an error in case of trouble. As an exception, if
+ * missing_ok = true and the trouble is specifically that the file does
+ * not exist, it will not throw an error and will return a value less than 0.
+ */
+File
+OpenWalSummaryFile(WalSummaryFile *ws, bool missing_ok)
+{
+ char path[MAXPGPATH];
+ File file;
+
+ snprintf(path, MAXPGPATH,
+ XLOGDIR "/summaries/%08X%08X%08X%08X%08X.summary",
+ ws->tli,
+ LSN_FORMAT_ARGS(ws->start_lsn),
+ LSN_FORMAT_ARGS(ws->end_lsn));
+
+ file = PathNameOpenFile(path, O_RDONLY);
+ if (file < 0 && (errno != EEXIST || !missing_ok))
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not open file \"%s\": %m", path)));
+
+ return file;
+}
+
+/*
+ * Remove a WAL summary file if the last modification time precedes the
+ * cutoff time.
+ */
+void
+RemoveWalSummaryIfOlderThan(WalSummaryFile *ws, time_t cutoff_time)
+{
+ char path[MAXPGPATH];
+ struct stat statbuf;
+
+ snprintf(path, MAXPGPATH,
+ XLOGDIR "/summaries/%08X%08X%08X%08X%08X.summary",
+ ws->tli,
+ LSN_FORMAT_ARGS(ws->start_lsn),
+ LSN_FORMAT_ARGS(ws->end_lsn));
+
+ if (lstat(path, &statbuf) != 0)
+ {
+ if (errno == ENOENT)
+ return;
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not stat file \"%s\": %m", path)));
+ }
+ if (statbuf.st_mtime >= cutoff_time)
+ return;
+ if (unlink(path) != 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not stat file \"%s\": %m", path)));
+ ereport(DEBUG2,
+ (errmsg_internal("removing file \"%s\"", path)));
+}
+
+/*
+ * Test whether a filename looks like a WAL summary file.
+ */
+static bool
+IsWalSummaryFilename(char *filename)
+{
+ return strspn(filename, "0123456789ABCDEF") == 40 &&
+ strcmp(filename + 40, ".summary") == 0;
+}
+
+/*
+ * Data read callback for use with CreateBlockRefTableReader.
+ */
+int
+ReadWalSummary(void *wal_summary_io, void *data, int length)
+{
+ WalSummaryIO *io = wal_summary_io;
+ int nbytes;
+
+ nbytes = FileRead(io->file, data, length, io->filepos,
+ WAIT_EVENT_WAL_SUMMARY_READ);
+ if (nbytes < 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not read file \"%s\": %m",
+ FilePathName(io->file))));
+
+ io->filepos += nbytes;
+ return nbytes;
+}
+
+/*
+ * Data write callback for use with WriteBlockRefTable.
+ */
+int
+WriteWalSummary(void *wal_summary_io, void *data, int length)
+{
+ WalSummaryIO *io = wal_summary_io;
+ int nbytes;
+
+ nbytes = FileWrite(io->file, data, length, io->filepos,
+ WAIT_EVENT_WAL_SUMMARY_WRITE);
+ if (nbytes < 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not write file \"%s\": %m",
+ FilePathName(io->file))));
+ if (nbytes != length)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not write file \"%s\": wrote only %d of %d bytes at offset %u",
+ FilePathName(io->file), nbytes,
+ length, (unsigned) io->filepos),
+ errhint("Check free disk space.")));
+
+ io->filepos += nbytes;
+ return nbytes;
+}
+
+/*
+ * Error-reporting callback for use with CreateBlockRefTableReader.
+ */
+void
+ReportWalSummaryError(void *callback_arg, char *fmt,...)
+{
+ StringInfoData buf;
+ va_list ap;
+ int needed;
+
+ initStringInfo(&buf);
+ for (;;)
+ {
+ va_start(ap, fmt);
+ needed = appendStringInfoVA(&buf, fmt, ap);
+ va_end(ap);
+ if (needed == 0)
+ break;
+ enlargeStringInfo(&buf, needed);
+ }
+ ereport(ERROR,
+ errcode(ERRCODE_DATA_CORRUPTED),
+ errmsg_internal("%s", buf.data));
+}
+
+/*
+ * Comparator to sort a List of WalSummaryFile objects by start_lsn.
+ */
+static int
+ListComparatorForWalSummaryFiles(const ListCell *a, const ListCell *b)
+{
+ WalSummaryFile *ws1 = lfirst(a);
+ WalSummaryFile *ws2 = lfirst(b);
+
+ if (ws1->start_lsn < ws2->start_lsn)
+ return -1;
+ if (ws1->start_lsn > ws2->start_lsn)
+ return 1;
+ return 0;
+}
diff --git a/src/backend/backup/walsummaryfuncs.c b/src/backend/backup/walsummaryfuncs.c
new file mode 100644
index 0000000000..a1f69ad4ba
--- /dev/null
+++ b/src/backend/backup/walsummaryfuncs.c
@@ -0,0 +1,169 @@
+/*-------------------------------------------------------------------------
+ *
+ * walsummaryfuncs.c
+ * SQL-callable functions for accessing WAL summary data.
+ *
+ * Portions Copyright (c) 2010-2022, PostgreSQL Global Development Group
+ *
+ * src/backend/backup/walsummaryfuncs.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "backup/walsummary.h"
+#include "common/blkreftable.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "utils/fmgrprotos.h"
+#include "utils/pg_lsn.h"
+
+#define NUM_WS_ATTS 3
+#define NUM_SUMMARY_ATTS 6
+#define MAX_BLOCKS_PER_CALL 256
+
+/*
+ * List the WAL summary files available in pg_wal/summaries.
+ */
+Datum
+pg_available_wal_summaries(PG_FUNCTION_ARGS)
+{
+ ReturnSetInfo *rsi;
+ List *wslist;
+ ListCell *lc;
+ Datum values[NUM_WS_ATTS];
+ bool nulls[NUM_WS_ATTS];
+
+ InitMaterializedSRF(fcinfo, 0);
+ rsi = (ReturnSetInfo *) fcinfo->resultinfo;
+
+ memset(nulls, 0, sizeof(nulls));
+
+ wslist = GetWalSummaries(0, InvalidXLogRecPtr, InvalidXLogRecPtr);
+ foreach(lc, wslist)
+ {
+ WalSummaryFile *ws = (WalSummaryFile *) lfirst(lc);
+ HeapTuple tuple;
+
+ CHECK_FOR_INTERRUPTS();
+
+ values[0] = Int64GetDatum((int64) ws->tli);
+ values[1] = LSNGetDatum(ws->start_lsn);
+ values[2] = LSNGetDatum(ws->end_lsn);
+
+ tuple = heap_form_tuple(rsi->setDesc, values, nulls);
+ tuplestore_puttuple(rsi->setResult, tuple);
+ }
+
+ return (Datum) 0;
+}
+
+/*
+ * List the contents of a WAL summary file identified by TLI, start LSN,
+ * and end LSN.
+ */
+Datum
+pg_wal_summary_contents(PG_FUNCTION_ARGS)
+{
+ ReturnSetInfo *rsi;
+ Datum values[NUM_SUMMARY_ATTS];
+ bool nulls[NUM_SUMMARY_ATTS];
+ WalSummaryFile ws;
+ WalSummaryIO io;
+ BlockRefTableReader *reader;
+ int64 raw_tli;
+ RelFileLocator rlocator;
+ ForkNumber forknum;
+ BlockNumber limit_block;
+
+ InitMaterializedSRF(fcinfo, 0);
+ rsi = (ReturnSetInfo *) fcinfo->resultinfo;
+ memset(nulls, 0, sizeof(nulls));
+
+ /*
+ * Since the timeline could at least in theory be more than 2^31, and
+ * since we don't have unsigned types at the SQL level, it is passed as a
+ * 64-bit integer. Test whether it's out of range.
+ */
+ raw_tli = PG_GETARG_INT64(0);
+ if (raw_tli < 1 || raw_tli > PG_INT32_MAX)
+ ereport(ERROR,
+ errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("invalid timeline %lld", (long long) raw_tli));
+
+ /* Prepare to read the specified WAL summry file. */
+ ws.tli = (TimeLineID) raw_tli;
+ ws.start_lsn = PG_GETARG_LSN(1);
+ ws.end_lsn = PG_GETARG_LSN(2);
+ io.filepos = 0;
+ io.file = OpenWalSummaryFile(&ws, false);
+ reader = CreateBlockRefTableReader(ReadWalSummary, &io,
+ FilePathName(io.file),
+ ReportWalSummaryError, NULL);
+
+ /* Loop over relation forks. */
+ while (BlockRefTableReaderNextRelation(reader, &rlocator, &forknum,
+ &limit_block))
+ {
+ BlockNumber blocks[MAX_BLOCKS_PER_CALL];
+ HeapTuple tuple;
+
+ CHECK_FOR_INTERRUPTS();
+
+ values[0] = ObjectIdGetDatum(rlocator.relNumber);
+ values[1] = ObjectIdGetDatum(rlocator.spcOid);
+ values[2] = ObjectIdGetDatum(rlocator.dbOid);
+ values[3] = Int16GetDatum((int16) forknum);
+
+ /* Loop over blocks within the current relation fork. */
+ while (1)
+ {
+ unsigned nblocks;
+ unsigned i;
+
+ CHECK_FOR_INTERRUPTS();
+
+ nblocks = BlockRefTableReaderGetBlocks(reader, blocks,
+ MAX_BLOCKS_PER_CALL);
+ if (nblocks == 0)
+ break;
+
+ /*
+ * For each block that we specifically know to have been modified,
+ * emit a row with that block number and limit_block = false.
+ */
+ values[5] = BoolGetDatum(false);
+ for (i = 0; i < nblocks; ++i)
+ {
+ values[4] = Int64GetDatum((int64) blocks[i]);
+
+ tuple = heap_form_tuple(rsi->setDesc, values, nulls);
+ tuplestore_puttuple(rsi->setResult, tuple);
+ }
+
+ /*
+ * If the limit block is not InvalidBlockNumber, emit an exta row
+ * with that block number and limit_block = true.
+ *
+ * There is no point in doing this when the limit_block is
+ * InvalidBlockNumber, because no block with that number or any
+ * higher number can ever exist.
+ */
+ if (BlockNumberIsValid(limit_block))
+ {
+ values[4] = Int64GetDatum((int64) limit_block);
+ values[5] = BoolGetDatum(true);
+
+ tuple = heap_form_tuple(rsi->setDesc, values, nulls);
+ tuplestore_puttuple(rsi->setResult, tuple);
+ }
+ }
+ }
+
+ /* Cleanup */
+ DestroyBlockRefTableReader(reader);
+ FileClose(io.file);
+
+ return (Datum) 0;
+}
diff --git a/src/backend/postmaster/Makefile b/src/backend/postmaster/Makefile
index 047448b34e..367a46c617 100644
--- a/src/backend/postmaster/Makefile
+++ b/src/backend/postmaster/Makefile
@@ -24,6 +24,7 @@ OBJS = \
postmaster.o \
startup.o \
syslogger.o \
+ walsummarizer.o \
walwriter.o
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/postmaster/auxprocess.c b/src/backend/postmaster/auxprocess.c
index bae6f68c40..5f244216a6 100644
--- a/src/backend/postmaster/auxprocess.c
+++ b/src/backend/postmaster/auxprocess.c
@@ -21,6 +21,7 @@
#include "postmaster/auxprocess.h"
#include "postmaster/bgwriter.h"
#include "postmaster/startup.h"
+#include "postmaster/walsummarizer.h"
#include "postmaster/walwriter.h"
#include "replication/walreceiver.h"
#include "storage/bufmgr.h"
@@ -80,6 +81,9 @@ AuxiliaryProcessMain(AuxProcType auxtype)
case WalReceiverProcess:
MyBackendType = B_WAL_RECEIVER;
break;
+ case WalSummarizerProcess:
+ MyBackendType = B_WAL_SUMMARIZER;
+ break;
default:
elog(PANIC, "unrecognized process type: %d", (int) MyAuxProcType);
MyBackendType = B_INVALID;
@@ -158,6 +162,10 @@ AuxiliaryProcessMain(AuxProcType auxtype)
WalReceiverMain();
proc_exit(1);
+ case WalSummarizerProcess:
+ WalSummarizerMain();
+ proc_exit(1);
+
default:
elog(PANIC, "unrecognized process type: %d", (int) MyAuxProcType);
proc_exit(1);
diff --git a/src/backend/postmaster/meson.build b/src/backend/postmaster/meson.build
index cda921fd10..a30eb6692f 100644
--- a/src/backend/postmaster/meson.build
+++ b/src/backend/postmaster/meson.build
@@ -12,5 +12,6 @@ backend_sources += files(
'postmaster.c',
'startup.c',
'syslogger.c',
+ 'walsummarizer.c',
'walwriter.c',
)
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 651b85ea74..b163e89cbb 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -113,6 +113,7 @@
#include "postmaster/pgarch.h"
#include "postmaster/postmaster.h"
#include "postmaster/syslogger.h"
+#include "postmaster/walsummarizer.h"
#include "replication/logicallauncher.h"
#include "replication/walsender.h"
#include "storage/fd.h"
@@ -250,6 +251,7 @@ static pid_t StartupPID = 0,
CheckpointerPID = 0,
WalWriterPID = 0,
WalReceiverPID = 0,
+ WalSummarizerPID = 0,
AutoVacPID = 0,
PgArchPID = 0,
SysLoggerPID = 0;
@@ -441,6 +443,7 @@ static bool CreateOptsFile(int argc, char *argv[], char *fullprogname);
static pid_t StartChildProcess(AuxProcType type);
static void StartAutovacuumWorker(void);
static void MaybeStartWalReceiver(void);
+static void MaybeStartWalSummarizer(void);
static void InitPostmasterDeathWatchHandle(void);
/*
@@ -564,6 +567,7 @@ static void ShmemBackendArrayRemove(Backend *bn);
#define StartCheckpointer() StartChildProcess(CheckpointerProcess)
#define StartWalWriter() StartChildProcess(WalWriterProcess)
#define StartWalReceiver() StartChildProcess(WalReceiverProcess)
+#define StartWalSummarizer() StartChildProcess(WalSummarizerProcess)
/* Macros to check exit status of a child process */
#define EXIT_STATUS_0(st) ((st) == 0)
@@ -933,6 +937,9 @@ PostmasterMain(int argc, char *argv[])
if (max_wal_senders > 0 && wal_level == WAL_LEVEL_MINIMAL)
ereport(ERROR,
(errmsg("WAL streaming (max_wal_senders > 0) requires wal_level \"replica\" or \"logical\"")));
+ if (summarize_wal && wal_level == WAL_LEVEL_MINIMAL)
+ ereport(ERROR,
+ (errmsg("WAL cannot be summarized when wal_level is \"minimal\"")));
/*
* Other one-time internal sanity checks can go here, if they are fast.
@@ -1835,6 +1842,9 @@ ServerLoop(void)
if (WalReceiverRequested)
MaybeStartWalReceiver();
+ /* If we need to start a WAL summarizer, try to do that now */
+ MaybeStartWalSummarizer();
+
/* Get other worker processes running, if needed */
if (StartWorkerNeeded || HaveCrashedWorker)
maybe_start_bgworkers();
@@ -2659,6 +2669,8 @@ process_pm_reload_request(void)
signal_child(WalWriterPID, SIGHUP);
if (WalReceiverPID != 0)
signal_child(WalReceiverPID, SIGHUP);
+ if (WalSummarizerPID != 0)
+ signal_child(WalSummarizerPID, SIGHUP);
if (AutoVacPID != 0)
signal_child(AutoVacPID, SIGHUP);
if (PgArchPID != 0)
@@ -3012,6 +3024,7 @@ process_pm_child_exit(void)
BgWriterPID = StartBackgroundWriter();
if (WalWriterPID == 0)
WalWriterPID = StartWalWriter();
+ MaybeStartWalSummarizer();
/*
* Likewise, start other special children as needed. In a restart
@@ -3130,6 +3143,20 @@ process_pm_child_exit(void)
continue;
}
+ /*
+ * Was it the wal summarizer? Normal exit can be ignored; we'll start
+ * a new one at the next iteration of the postmaster's main loop, if
+ * necessary. Any other exit condition is treated as a crash.
+ */
+ if (pid == WalSummarizerPID)
+ {
+ WalSummarizerPID = 0;
+ if (!EXIT_STATUS_0(exitstatus))
+ HandleChildCrash(pid, exitstatus,
+ _("WAL summarizer process"));
+ continue;
+ }
+
/*
* Was it the autovacuum launcher? Normal exit can be ignored; we'll
* start a new one at the next iteration of the postmaster's main
@@ -3525,6 +3552,12 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
else if (WalReceiverPID != 0 && take_action)
sigquit_child(WalReceiverPID);
+ /* Take care of the walsummarizer too */
+ if (pid == WalSummarizerPID)
+ WalSummarizerPID = 0;
+ else if (WalSummarizerPID != 0 && take_action)
+ sigquit_child(WalSummarizerPID);
+
/* Take care of the autovacuum launcher too */
if (pid == AutoVacPID)
AutoVacPID = 0;
@@ -3675,6 +3708,8 @@ PostmasterStateMachine(void)
signal_child(StartupPID, SIGTERM);
if (WalReceiverPID != 0)
signal_child(WalReceiverPID, SIGTERM);
+ if (WalSummarizerPID != 0)
+ signal_child(WalSummarizerPID, SIGTERM);
/* checkpointer, archiver, stats, and syslogger may continue for now */
/* Now transition to PM_WAIT_BACKENDS state to wait for them to die */
@@ -3701,6 +3736,7 @@ PostmasterStateMachine(void)
if (CountChildren(BACKEND_TYPE_ALL - BACKEND_TYPE_WALSND) == 0 &&
StartupPID == 0 &&
WalReceiverPID == 0 &&
+ WalSummarizerPID == 0 &&
BgWriterPID == 0 &&
(CheckpointerPID == 0 ||
(!FatalError && Shutdown < ImmediateShutdown)) &&
@@ -3798,6 +3834,7 @@ PostmasterStateMachine(void)
/* These other guys should be dead already */
Assert(StartupPID == 0);
Assert(WalReceiverPID == 0);
+ Assert(WalSummarizerPID == 0);
Assert(BgWriterPID == 0);
Assert(CheckpointerPID == 0);
Assert(WalWriterPID == 0);
@@ -4019,6 +4056,8 @@ TerminateChildren(int signal)
signal_child(WalWriterPID, signal);
if (WalReceiverPID != 0)
signal_child(WalReceiverPID, signal);
+ if (WalSummarizerPID != 0)
+ signal_child(WalSummarizerPID, signal);
if (AutoVacPID != 0)
signal_child(AutoVacPID, signal);
if (PgArchPID != 0)
@@ -5326,6 +5365,10 @@ StartChildProcess(AuxProcType type)
ereport(LOG,
(errmsg("could not fork WAL receiver process: %m")));
break;
+ case WalSummarizerProcess:
+ ereport(LOG,
+ (errmsg("could not fork WAL summarizer process: %m")));
+ break;
default:
ereport(LOG,
(errmsg("could not fork process: %m")));
@@ -5462,6 +5505,19 @@ MaybeStartWalReceiver(void)
}
}
+/*
+ * MaybeStartWalSummarizer
+ * Start the WAL summarizer process, if not running and our state allows.
+ */
+static void
+MaybeStartWalSummarizer(void)
+{
+ if (summarize_wal && WalSummarizerPID == 0 &&
+ (pmState == PM_RUN || pmState == PM_HOT_STANDBY) &&
+ Shutdown <= SmartShutdown)
+ WalSummarizerPID = StartWalSummarizer();
+}
+
/*
* Create the opts file
diff --git a/src/backend/postmaster/walsummarizer.c b/src/backend/postmaster/walsummarizer.c
new file mode 100644
index 0000000000..7c840c36b3
--- /dev/null
+++ b/src/backend/postmaster/walsummarizer.c
@@ -0,0 +1,1398 @@
+/*-------------------------------------------------------------------------
+ *
+ * walsummarizer.c
+ *
+ * Background process to perform WAL summarization, if it is enabled.
+ * It continuously scans the write-ahead log and periodically emits a
+ * summary file which indicates which blocks in which relation forks
+ * were modified by WAL records in the LSN range covered by the summary
+ * file. See walsummary.c and blkreftable.c for more details on the
+ * naming and contents of WAL summary files.
+ *
+ * If configured to do, this background process will also remove WAL
+ * summary files when the file timestamp is older than a configurable
+ * threshold (but only if the WAL has been removed first).
+ *
+ * Portions Copyright (c) 1996-2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/backend/postmaster/walsummarizer.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/timeline.h"
+#include "access/xlog.h"
+#include "access/xlog_internal.h"
+#include "access/xlogrecovery.h"
+#include "access/xlogutils.h"
+#include "backup/walsummary.h"
+#include "catalog/storage_xlog.h"
+#include "common/blkreftable.h"
+#include "libpq/pqsignal.h"
+#include "miscadmin.h"
+#include "postmaster/bgwriter.h"
+#include "postmaster/interrupt.h"
+#include "postmaster/walsummarizer.h"
+#include "replication/walreceiver.h"
+#include "storage/fd.h"
+#include "storage/ipc.h"
+#include "storage/lwlock.h"
+#include "storage/latch.h"
+#include "storage/proc.h"
+#include "storage/procsignal.h"
+#include "storage/shmem.h"
+#include "storage/spin.h"
+#include "utils/guc.h"
+#include "utils/memutils.h"
+#include "utils/wait_event.h"
+
+/*
+ * Data in shared memory related to WAL summarization.
+ */
+typedef struct
+{
+ /*
+ * These fields are protected by WALSummarizerLock.
+ *
+ * Until we've discovered what summary files already exist on disk and
+ * stored that information in shared memory, initialized is false and the
+ * other fields here contain no meaningful information. After that has
+ * been done, initialized is true.
+ *
+ * summarized_tli and summarized_lsn indicate the last LSN and TLI at
+ * which the next summary file will start. Normally, these are the LSN and
+ * TLI at which the last file ended; in such case, lsn_is_exact is true.
+ * If, however, the LSN is just an approximation, then lsn_is_exact is
+ * false. This can happen if, for example, there are no existing WAL
+ * summary files at startup. In that case, we have to derive the position
+ * at which to start summarizing from the WAL files that exist on disk,
+ * and so the LSN might point to the start of the next file even though
+ * that might happen to be in the middle of a WAL record.
+ *
+ * summarizer_pgprocno is the pgprocno value for the summarizer process,
+ * if one is running, or else INVALID_PGPROCNO.
+ *
+ * pending_lsn is used by the summarizer to advertise the ending LSN of a
+ * record it has recently read. It shouldn't ever be less than
+ * summarized_lsn, but might be greater, because the summarizer buffers
+ * data for a range of LSNs in memory before writing out a new file.
+ */
+ bool initialized;
+ TimeLineID summarized_tli;
+ XLogRecPtr summarized_lsn;
+ bool lsn_is_exact;
+ int summarizer_pgprocno;
+ XLogRecPtr pending_lsn;
+
+ /*
+ * This field handles its own synchronizaton.
+ */
+ ConditionVariable summary_file_cv;
+} WalSummarizerData;
+
+/*
+ * Private data for our xlogreader's page read callback.
+ */
+typedef struct
+{
+ TimeLineID tli;
+ bool historic;
+ XLogRecPtr read_upto;
+ bool end_of_wal;
+} SummarizerReadLocalXLogPrivate;
+
+/* Pointer to shared memory state. */
+static WalSummarizerData *WalSummarizerCtl;
+
+/*
+ * When we reach end of WAL and need to read more, we sleep for a number of
+ * milliseconds that is a integer multiple of MS_PER_SLEEP_QUANTUM. This is
+ * the multiplier. It should vary between 1 and MAX_SLEEP_QUANTA, depending
+ * on system activity. See summarizer_wait_for_wal() for how we adjust this.
+ */
+static long sleep_quanta = 1;
+
+/*
+ * The sleep time will always be a multiple of 200ms and will not exceed
+ * thirty seconds (150 * 200 = 30 * 1000). Note that the timeout here needs
+ * to be substntially less than the maximum amount of time for which an
+ * incremental backup will wait for this process to catch up. Otherwise, an
+ * incremental backup might time out on an idle system just because we sleep
+ * for too long.
+ */
+#define MAX_SLEEP_QUANTA 150
+#define MS_PER_SLEEP_QUANTUM 200
+
+/*
+ * This is a count of the number of pages of WAL that we've read since the
+ * last time we waited for more WAL to appear.
+ */
+static long pages_read_since_last_sleep = 0;
+
+/*
+ * Most recent RedoRecPtr value observed by MaybeRemoveOldWalSummaries.
+ */
+static XLogRecPtr redo_pointer_at_last_summary_removal = InvalidXLogRecPtr;
+
+/*
+ * GUC parameters
+ */
+bool summarize_wal = false;
+int wal_summary_keep_time = 10 * 24 * 60;
+
+static XLogRecPtr GetLatestLSN(TimeLineID *tli);
+static void HandleWalSummarizerInterrupts(void);
+static XLogRecPtr SummarizeWAL(TimeLineID tli, XLogRecPtr start_lsn,
+ bool exact, XLogRecPtr switch_lsn,
+ XLogRecPtr maximum_lsn);
+static void SummarizeSmgrRecord(XLogReaderState *xlogreader,
+ BlockRefTable *brtab);
+static void SummarizeXactRecord(XLogReaderState *xlogreader,
+ BlockRefTable *brtab);
+static bool SummarizeXlogRecord(XLogReaderState *xlogreader);
+static int summarizer_read_local_xlog_page(XLogReaderState *state,
+ XLogRecPtr targetPagePtr,
+ int reqLen,
+ XLogRecPtr targetRecPtr,
+ char *cur_page);
+static void summarizer_wait_for_wal(void);
+static void MaybeRemoveOldWalSummaries(void);
+
+/*
+ * Amount of shared memory required for this module.
+ */
+Size
+WalSummarizerShmemSize(void)
+{
+ return sizeof(WalSummarizerData);
+}
+
+/*
+ * Create or attach to shared memory segment for this module.
+ */
+void
+WalSummarizerShmemInit(void)
+{
+ bool found;
+
+ WalSummarizerCtl = (WalSummarizerData *)
+ ShmemInitStruct("Wal Summarizer Ctl", WalSummarizerShmemSize(),
+ &found);
+
+ if (!found)
+ {
+ /*
+ * First time through, so initialize.
+ *
+ * We're just filling in dummy values here -- the real initialization
+ * will happen when GetOldestUnsummarizedLSN() is called for the first
+ * time.
+ */
+ WalSummarizerCtl->initialized = false;
+ WalSummarizerCtl->summarized_tli = 0;
+ WalSummarizerCtl->summarized_lsn = InvalidXLogRecPtr;
+ WalSummarizerCtl->lsn_is_exact = false;
+ WalSummarizerCtl->summarizer_pgprocno = INVALID_PGPROCNO;
+ WalSummarizerCtl->pending_lsn = InvalidXLogRecPtr;
+ ConditionVariableInit(&WalSummarizerCtl->summary_file_cv);
+ }
+}
+
+/*
+ * Entry point for walsummarizer process.
+ */
+void
+WalSummarizerMain(void)
+{
+ sigjmp_buf local_sigjmp_buf;
+ MemoryContext context;
+
+ /*
+ * Within this function, 'current_lsn' and 'current_tli' refer to the
+ * point from which the next WAL summary file should start. 'exact' is
+ * true if 'current_lsn' is known to be the start of a WAL recod or WAL
+ * segment, and false if it might be in the middle of a record someplace.
+ *
+ * 'switch_lsn' and 'switch_tli', if set, are the LSN at which we need to
+ * switch to a new timeline and the timeline to which we need to switch.
+ * If not set, we either haven't figured out the answers yet or we're
+ * already on the latest timeline.
+ */
+ XLogRecPtr current_lsn;
+ TimeLineID current_tli;
+ bool exact;
+ XLogRecPtr switch_lsn = InvalidXLogRecPtr;
+ TimeLineID switch_tli = 0;
+
+ ereport(DEBUG1,
+ (errmsg_internal("WAL summarizer started")));
+
+ /*
+ * Properly accept or ignore signals the postmaster might send us
+ *
+ * We have no particular use for SIGINT at the moment, but seems
+ * reasonable to treat like SIGTERM.
+ */
+ pqsignal(SIGHUP, SignalHandlerForConfigReload);
+ pqsignal(SIGINT, SignalHandlerForShutdownRequest);
+ pqsignal(SIGTERM, SignalHandlerForShutdownRequest);
+ /* SIGQUIT handler was already set up by InitPostmasterChild */
+ pqsignal(SIGALRM, SIG_IGN);
+ pqsignal(SIGPIPE, SIG_IGN);
+ pqsignal(SIGUSR1, procsignal_sigusr1_handler);
+ pqsignal(SIGUSR2, SIG_IGN); /* not used */
+
+ /* Advertise ourselves. */
+ LWLockAcquire(WALSummarizerLock, LW_EXCLUSIVE);
+ WalSummarizerCtl->summarizer_pgprocno = MyProc->pgprocno;
+ LWLockRelease(WALSummarizerLock);
+
+ /* Create and switch to a memory context that we can reset on error. */
+ context = AllocSetContextCreate(TopMemoryContext,
+ "Wal Summarizer",
+ ALLOCSET_DEFAULT_SIZES);
+ MemoryContextSwitchTo(context);
+
+ /*
+ * Reset some signals that are accepted by postmaster but not here
+ */
+ pqsignal(SIGCHLD, SIG_DFL);
+
+ /*
+ * If an exception is encountered, processing resumes here.
+ */
+ if (sigsetjmp(local_sigjmp_buf, 1) != 0)
+ {
+ /* Since not using PG_TRY, must reset error stack by hand */
+ error_context_stack = NULL;
+
+ /* Prevent interrupts while cleaning up */
+ HOLD_INTERRUPTS();
+
+ /* Report the error to the server log */
+ EmitErrorReport();
+
+ /* Release resources we might have acquired. */
+ LWLockReleaseAll();
+ ConditionVariableCancelSleep();
+ pgstat_report_wait_end();
+ ReleaseAuxProcessResources(false);
+ AtEOXact_Files(false);
+ AtEOXact_HashTables(false);
+
+ /*
+ * Now return to normal top-level context and clear ErrorContext for
+ * next time.
+ */
+ MemoryContextSwitchTo(context);
+ FlushErrorState();
+
+ /* Flush any leaked data in the top-level context */
+ MemoryContextResetAndDeleteChildren(context);
+
+ /* Now we can allow interrupts again */
+ RESUME_INTERRUPTS();
+
+ /*
+ * Sleep for 10 seconds before attempting to resume operations in
+ * order to avoid excessing logging.
+ *
+ * Many of the likely error conditions are things that will repeat
+ * every time. For example, if the WAL can't be read or the summary
+ * can't be written, only administrator action will cure the problem.
+ * So a really fast retry time doesn't seem to be especially
+ * beneficial, and it will clutter the logs.
+ */
+ (void) WaitLatch(MyLatch,
+ WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
+ 10000,
+ WAIT_EVENT_WAL_SUMMARIZER_ERROR);
+ }
+
+ /* We can now handle ereport(ERROR) */
+ PG_exception_stack = &local_sigjmp_buf;
+
+ /*
+ * Unblock signals (they were blocked when the postmaster forked us)
+ */
+ sigprocmask(SIG_SETMASK, &UnBlockSig, NULL);
+
+ /*
+ * Fetch information about previous progress from shared memory, and ask
+ * GetOldestUnsummarizedLSN to reset pending_lsn to summarized_lsn. We
+ * might be recovering from an error, and if so, pending_lsn might have
+ * advanced past summarized_lsn, but any WAL we read previously has been
+ * lost and will need to be reread.
+ *
+ * If we discover that WAL summarization is not enabled, just exit.
+ */
+ current_lsn = GetOldestUnsummarizedLSN(¤t_tli, &exact, true);
+ if (XLogRecPtrIsInvalid(current_lsn))
+ proc_exit(0);
+
+ /*
+ * Loop forever
+ */
+ for (;;)
+ {
+ XLogRecPtr latest_lsn;
+ TimeLineID latest_tli;
+ XLogRecPtr end_of_summary_lsn;
+
+ /* Flush any leaked data in the top-level context */
+ MemoryContextResetAndDeleteChildren(context);
+
+ /* Process any signals received recently. */
+ HandleWalSummarizerInterrupts();
+
+ /* If it's time to remove any old WAL summaries, do that now. */
+ MaybeRemoveOldWalSummaries();
+
+ /* Find the LSN and TLI up to which we can safely summarize. */
+ latest_lsn = GetLatestLSN(&latest_tli);
+
+ /*
+ * If we're summarizing a historic timeline and we haven't yet
+ * computed the point at which to switch to the next timeline, do that
+ * now.
+ *
+ * Note that if this is a standby, what was previously the current
+ * timeline could become historic at any time.
+ *
+ * We could try to make this more efficient by caching the results of
+ * readTimeLineHistory when latest_tli has not changed, but since we
+ * only have to do this once per timeline switch, we probably wouldn't
+ * save any significant amount of work in practice.
+ */
+ if (current_tli != latest_tli && XLogRecPtrIsInvalid(switch_lsn))
+ {
+ List *tles = readTimeLineHistory(latest_tli);
+
+ switch_lsn = tliSwitchPoint(current_tli, tles, &switch_tli);
+ ereport(DEBUG1,
+ errmsg("switch point from TLI %u to TLI %u is at %X/%X",
+ current_tli, switch_tli, LSN_FORMAT_ARGS(switch_lsn)));
+ }
+
+ /*
+ * If we've reached the switch LSN, we can't summarize anything else
+ * on this timeline. Switch to the next timeline and go around again.
+ */
+ if (!XLogRecPtrIsInvalid(switch_lsn) && current_lsn >= switch_lsn)
+ {
+ current_tli = switch_tli;
+ switch_lsn = InvalidXLogRecPtr;
+ switch_tli = 0;
+ continue;
+ }
+
+ /* Summarize WAL. */
+ end_of_summary_lsn = SummarizeWAL(current_tli,
+ current_lsn, exact,
+ switch_lsn, latest_lsn);
+ Assert(!XLogRecPtrIsInvalid(end_of_summary_lsn));
+ Assert(end_of_summary_lsn >= current_lsn);
+
+ /*
+ * Update state for next loop iteration.
+ *
+ * Next summary file should start from exactly where this one ended.
+ */
+ current_lsn = end_of_summary_lsn;
+ exact = true;
+
+ /* Update state in shared memory. */
+ LWLockAcquire(WALSummarizerLock, LW_EXCLUSIVE);
+ Assert(WalSummarizerCtl->pending_lsn <= end_of_summary_lsn);
+ WalSummarizerCtl->summarized_lsn = end_of_summary_lsn;
+ WalSummarizerCtl->summarized_tli = current_tli;
+ WalSummarizerCtl->lsn_is_exact = true;
+ WalSummarizerCtl->pending_lsn = end_of_summary_lsn;
+ LWLockRelease(WALSummarizerLock);
+
+ /* Wake up anyone waiting for more summary files to be written. */
+ ConditionVariableBroadcast(&WalSummarizerCtl->summary_file_cv);
+ }
+}
+
+/*
+ * Get the oldest LSN in this server's timeline history that has not yet been
+ * summarized.
+ *
+ * If *tli != NULL, it will be set to the TLI for the LSN that is returned.
+ *
+ * If *lsn_is_exact != NULL, it will be set to true if the returned LSN is
+ * necessarily the start of a WAL record and false if it's just the beginning
+ * of a WAL segment.
+ *
+ * If reset_pending_lsn is true, resets the pending_lsn in shared memory to
+ * be equal to the summarized_lsn.
+ */
+XLogRecPtr
+GetOldestUnsummarizedLSN(TimeLineID *tli, bool *lsn_is_exact,
+ bool reset_pending_lsn)
+{
+ TimeLineID latest_tli;
+ LWLockMode mode = reset_pending_lsn ? LW_EXCLUSIVE : LW_SHARED;
+ int n;
+ List *tles;
+ XLogRecPtr unsummarized_lsn;
+ TimeLineID unsummarized_tli = 0;
+ bool should_make_exact = false;
+ List *existing_summaries;
+ ListCell *lc;
+
+ /* If not summarizing WAL, do nothing. */
+ if (!summarize_wal)
+ return InvalidXLogRecPtr;
+
+ /*
+ * Unless we need to reset the pending_lsn, we initally acquire the lock
+ * in shared mode and try to fetch the required information. If we acquire
+ * in shared mode and find that the data structure hasn't been
+ * initialized, we reacquire the lock in exclusive mode so that we can
+ * initialize it. However, if someone else does that first before we get
+ * the lock, then we can just return the requested information after all.
+ */
+ while (1)
+ {
+ LWLockAcquire(WALSummarizerLock, mode);
+
+ if (WalSummarizerCtl->initialized)
+ {
+ unsummarized_lsn = WalSummarizerCtl->summarized_lsn;
+ if (tli != NULL)
+ *tli = WalSummarizerCtl->summarized_tli;
+ if (lsn_is_exact != NULL)
+ *lsn_is_exact = WalSummarizerCtl->lsn_is_exact;
+ if (reset_pending_lsn)
+ WalSummarizerCtl->pending_lsn =
+ WalSummarizerCtl->summarized_lsn;
+ LWLockRelease(WALSummarizerLock);
+ return unsummarized_lsn;
+ }
+
+ if (mode == LW_EXCLUSIVE)
+ break;
+
+ LWLockRelease(WALSummarizerLock);
+ mode = LW_EXCLUSIVE;
+ }
+
+ /*
+ * The data structure needs to be initialized, and we are the first to
+ * obtain the lock in exclusive mode, so it's our job to do that
+ * initialization.
+ *
+ * So, find the oldest timeline on which WAL still exists, and the
+ * earliest segment for which it exists.
+ */
+ (void) GetLatestLSN(&latest_tli);
+ tles = readTimeLineHistory(latest_tli);
+ for (n = list_length(tles) - 1; n >= 0; --n)
+ {
+ TimeLineHistoryEntry *tle = list_nth(tles, n);
+ XLogSegNo oldest_segno;
+
+ oldest_segno = XLogGetOldestSegno(tle->tli);
+ if (oldest_segno != 0)
+ {
+ /* Compute oldest LSN that still exists on disk. */
+ XLogSegNoOffsetToRecPtr(oldest_segno, 0, wal_segment_size,
+ unsummarized_lsn);
+
+ unsummarized_tli = tle->tli;
+ break;
+ }
+ }
+
+ /* It really should not be possible for us to find no WAL. */
+ if (unsummarized_tli == 0)
+ ereport(ERROR,
+ errcode(ERRCODE_INTERNAL_ERROR),
+ errmsg_internal("no WAL found on timeline %d", latest_tli));
+
+ /*
+ * Don't try to summarize anything older than the end LSN of the newest
+ * summary file that exists for this timeline.
+ */
+ existing_summaries =
+ GetWalSummaries(unsummarized_tli,
+ InvalidXLogRecPtr, InvalidXLogRecPtr);
+ foreach(lc, existing_summaries)
+ {
+ WalSummaryFile *ws = lfirst(lc);
+
+ if (ws->end_lsn > unsummarized_lsn)
+ {
+ unsummarized_lsn = ws->end_lsn;
+ should_make_exact = true;
+ }
+ }
+
+ /* Update shared memory with the discovered values. */
+ WalSummarizerCtl->initialized = true;
+ WalSummarizerCtl->summarized_lsn = unsummarized_lsn;
+ WalSummarizerCtl->summarized_tli = unsummarized_tli;
+ WalSummarizerCtl->lsn_is_exact = should_make_exact;
+ WalSummarizerCtl->pending_lsn = unsummarized_lsn;
+
+ /* Also return the to the caller as required. */
+ if (tli != NULL)
+ *tli = WalSummarizerCtl->summarized_tli;
+ if (lsn_is_exact != NULL)
+ *lsn_is_exact = WalSummarizerCtl->lsn_is_exact;
+ LWLockRelease(WALSummarizerLock);
+
+ return unsummarized_lsn;
+}
+
+/*
+ * Attempt to set the WAL summarizer's latch.
+ *
+ * This might not work, because there's no guarantee that the WAL summarizer
+ * process was successfully started, and it also might have started but
+ * subsequently terminated. So, under normal circumstances, this will get the
+ * latch set, but there's no guarantee.
+ */
+void
+SetWalSummarizerLatch(void)
+{
+ int pgprocno;
+
+ if (WalSummarizerCtl == NULL)
+ return;
+
+ LWLockAcquire(WALSummarizerLock, LW_EXCLUSIVE);
+ pgprocno = WalSummarizerCtl->summarizer_pgprocno;
+ LWLockRelease(WALSummarizerLock);
+
+ if (pgprocno != INVALID_PGPROCNO)
+ SetLatch(&ProcGlobal->allProcs[pgprocno].procLatch);
+}
+
+/*
+ * Wait until WAL summarization reaches the given LSN, but not longer than
+ * the given timeout.
+ *
+ * The return value is the first still-unsummarized LSN. If it's greater than
+ * or equal to the passed LSN, then that LSN was reached. If not, we timed out.
+ *
+ * Either way, *pending_lsn is set to the value taken from WalSummarizerCtl.
+ */
+XLogRecPtr
+WaitForWalSummarization(XLogRecPtr lsn, long timeout, XLogRecPtr *pending_lsn)
+{
+ TimestampTz start_time = GetCurrentTimestamp();
+ TimestampTz deadline = TimestampTzPlusMilliseconds(start_time, timeout);
+ XLogRecPtr summarized_lsn;
+
+ Assert(!XLogRecPtrIsInvalid(lsn));
+ Assert(timeout > 0);
+
+ while (1)
+ {
+ TimestampTz now;
+ long remaining_timeout;
+
+ /*
+ * If the LSN summarized on disk has reached the target value, stop.
+ */
+ LWLockAcquire(WALSummarizerLock, LW_EXCLUSIVE);
+ summarized_lsn = WalSummarizerCtl->summarized_lsn;
+ *pending_lsn = WalSummarizerCtl->pending_lsn;
+ LWLockRelease(WALSummarizerLock);
+ if (summarized_lsn >= lsn)
+ break;
+
+ /* Timeout reached? If yes, stop. */
+ now = GetCurrentTimestamp();
+ remaining_timeout = TimestampDifferenceMilliseconds(now, deadline);
+ if (remaining_timeout <= 0)
+ break;
+
+ /* Wait and see. */
+ ConditionVariableTimedSleep(&WalSummarizerCtl->summary_file_cv,
+ remaining_timeout,
+ WAIT_EVENT_WAL_SUMMARY_READY);
+ }
+
+ return summarized_lsn;
+}
+
+/*
+ * Get the latest LSN that is eligible to be summarized, and set *tli to the
+ * corresponding timeline.
+ */
+static XLogRecPtr
+GetLatestLSN(TimeLineID *tli)
+{
+ if (!RecoveryInProgress())
+ {
+ /* Don't summarize WAL before it's flushed. */
+ return GetFlushRecPtr(tli);
+ }
+ else
+ {
+ XLogRecPtr flush_lsn;
+ TimeLineID flush_tli;
+ XLogRecPtr replay_lsn;
+ TimeLineID replay_tli;
+
+ /*
+ * What we really want to know is how much WAL has been flushed to
+ * disk, but the only flush position available is the one provided by
+ * the walreceiver, which may not be running, because this could be
+ * crash recovery or recovery via restore_command. So use either the
+ * WAL receiver's flush position or the replay position, whichever is
+ * further ahead, on the theory that if the WAL has been replayed then
+ * it must also have been flushed to disk.
+ */
+ flush_lsn = GetWalRcvFlushRecPtr(NULL, &flush_tli);
+ replay_lsn = GetXLogReplayRecPtr(&replay_tli);
+ if (flush_lsn > replay_lsn)
+ {
+ *tli = flush_tli;
+ return flush_lsn;
+ }
+ else
+ {
+ *tli = replay_tli;
+ return replay_lsn;
+ }
+ }
+}
+
+/*
+ * Interrupt handler for main loop of WAL summarizer process.
+ */
+static void
+HandleWalSummarizerInterrupts(void)
+{
+ if (ProcSignalBarrierPending)
+ ProcessProcSignalBarrier();
+
+ if (ConfigReloadPending)
+ {
+ ConfigReloadPending = false;
+ ProcessConfigFile(PGC_SIGHUP);
+ }
+
+ if (ShutdownRequestPending || !summarize_wal)
+ {
+ ereport(DEBUG1,
+ errmsg_internal("WAL summarizer shutting down"));
+ proc_exit(0);
+ }
+
+ /* Perform logging of memory contexts of this process */
+ if (LogMemoryContextPending)
+ ProcessLogMemoryContextInterrupt();
+}
+
+/*
+ * Summarize a range of WAL records on a single timeline.
+ *
+ * 'tli' is the timeline to be summarized.
+ *
+ * 'start_lsn' is the point at which we should start summarizing. If this
+ * value comes from the end LSN of the previous record as returned by the
+ * xlograder machinery, 'exact' should be true; otherwise, 'exact' should
+ * be false, and this function will search forward for the start of a valid
+ * WAL record.
+ *
+ * 'switch_lsn' is the point at which we should switch to a later timeline,
+ * if we're summarizing a historic timeline.
+ *
+ * 'maximum_lsn' identifies the point beyond which we can't count on being
+ * able to read any more WAL. It should be the switch point when reading a
+ * historic timeline, or the most-recently-measured end of WAL when reading
+ * the current timeline.
+ *
+ * The return value is the LSN at which the WAL summary actually ends. Most
+ * often, a summary file ends because we notice that a checkpoint has
+ * occurred and reach the redo pointer of that checkpoint, but sometimes
+ * we stop for other reasons, such as a timeline switch.
+ */
+static XLogRecPtr
+SummarizeWAL(TimeLineID tli, XLogRecPtr start_lsn, bool exact,
+ XLogRecPtr switch_lsn, XLogRecPtr maximum_lsn)
+{
+ SummarizerReadLocalXLogPrivate *private_data;
+ XLogReaderState *xlogreader;
+ XLogRecPtr summary_start_lsn;
+ XLogRecPtr summary_end_lsn = switch_lsn;
+ char temp_path[MAXPGPATH];
+ char final_path[MAXPGPATH];
+ WalSummaryIO io;
+ BlockRefTable *brtab = CreateEmptyBlockRefTable();
+
+ /* Initialize private data for xlogreader. */
+ private_data = (SummarizerReadLocalXLogPrivate *)
+ palloc0(sizeof(SummarizerReadLocalXLogPrivate));
+ private_data->tli = tli;
+ private_data->historic = !XLogRecPtrIsInvalid(switch_lsn);
+ private_data->read_upto = maximum_lsn;
+
+ /* Create xlogreader. */
+ xlogreader = XLogReaderAllocate(wal_segment_size, NULL,
+ XL_ROUTINE(.page_read = &summarizer_read_local_xlog_page,
+ .segment_open = &wal_segment_open,
+ .segment_close = &wal_segment_close),
+ private_data);
+ if (xlogreader == NULL)
+ ereport(ERROR,
+ (errcode(ERRCODE_OUT_OF_MEMORY),
+ errmsg("out of memory"),
+ errdetail("Failed while allocating a WAL reading processor.")));
+
+ /*
+ * When exact = false, we're starting from an arbitrary point in the WAL
+ * and must search forward for the start of the next record.
+ *
+ * When exact = true, start_lsn should be either the LSN where a record
+ * begins, or the LSN of a page where the page header is immediately
+ * followed by the start of a new record. XLogBeginRead should tolerate
+ * either case.
+ *
+ * We need to allow for both cases because the behavior of xlogreader
+ * varies. When a record spans two or more xlog pages, the ending LSN
+ * reported by xlogreader will be the starting LSN of the following
+ * record, but when an xlog page boundary falls between two records, the
+ * end LSN for the first will be reported as the first byte of the
+ * following page. We can't know until we read that page how large the
+ * header will be, but we'll have to skip over it to find the next record.
+ */
+ if (exact)
+ {
+ /*
+ * Even if start_lsn is the beginning of a page rather than the
+ * beginning of the first record on that page, we should still use it
+ * as the start LSN for the summary file. That's because we detect
+ * missing summary files by looking for cases where the end LSN of one
+ * file is less than the start LSN of the next file. When only a page
+ * header is skipped, nothing has been missed.
+ */
+ XLogBeginRead(xlogreader, start_lsn);
+ summary_start_lsn = start_lsn;
+ }
+ else
+ {
+ summary_start_lsn = XLogFindNextRecord(xlogreader, start_lsn);
+ if (XLogRecPtrIsInvalid(summary_start_lsn))
+ {
+ /*
+ * If we hit end-of-WAL while trying to find the next valid
+ * record, we must be on a historic timeline that has no valid
+ * records that begin after start_lsn and before end of WAL.
+ */
+ if (private_data->end_of_wal)
+ {
+ ereport(DEBUG1,
+ errmsg_internal("could not read WAL from timeline %u at %X/%X: end of WAL at %X/%X",
+ tli,
+ LSN_FORMAT_ARGS(start_lsn),
+ LSN_FORMAT_ARGS(private_data->read_upto)));
+
+ /*
+ * The timeline ends at or after start_lsn, without containing
+ * any records. Thus, we must make sure the main loop does not
+ * iterate. If start_lsn is the end of the timeline, then we
+ * won't actually emit an empty summary file, but otherwise,
+ * we must, to capture the fact that the LSN range in question
+ * contains no interesting WAL records.
+ */
+ summary_start_lsn = start_lsn;
+ summary_end_lsn = private_data->read_upto;
+ switch_lsn = xlogreader->EndRecPtr;
+ }
+ else
+ ereport(ERROR,
+ (errmsg("could not find a valid record after %X/%X",
+ LSN_FORMAT_ARGS(start_lsn))));
+ }
+
+ /* We shouldn't go backward. */
+ Assert(summary_start_lsn >= start_lsn);
+ }
+
+ /*
+ * Main loop: read xlog records one by one.
+ */
+ while (1)
+ {
+ int block_id;
+ char *errormsg;
+ XLogRecord *record;
+ bool stop_requested = false;
+
+ HandleWalSummarizerInterrupts();
+
+ /* We shouldn't go backward. */
+ Assert(summary_start_lsn <= xlogreader->EndRecPtr);
+
+ /* Now read the next record. */
+ record = XLogReadRecord(xlogreader, &errormsg);
+ if (record == NULL)
+ {
+ if (private_data->end_of_wal)
+ {
+ /*
+ * This timeline must be historic and must end before we were
+ * able to read a complete record.
+ */
+ ereport(DEBUG1,
+ errmsg_internal("could not read WAL from timeline %d at %X/%X: end of WAL at %X/%X",
+ tli,
+ LSN_FORMAT_ARGS(xlogreader->EndRecPtr),
+ LSN_FORMAT_ARGS(private_data->read_upto)));
+ /* Summary ends at end of WAL. */
+ summary_end_lsn = private_data->read_upto;
+ break;
+ }
+ if (errormsg)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not read WAL from timeline %u at %X/%X: %s",
+ tli, LSN_FORMAT_ARGS(xlogreader->EndRecPtr),
+ errormsg)));
+ else
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not read WAL from timeline %u at %X/%X",
+ tli, LSN_FORMAT_ARGS(xlogreader->EndRecPtr))));
+ }
+
+ /* We shouldn't go backward. */
+ Assert(summary_start_lsn <= xlogreader->EndRecPtr);
+
+ if (!XLogRecPtrIsInvalid(switch_lsn) &&
+ xlogreader->ReadRecPtr >= switch_lsn)
+ {
+ /*
+ * Woops! We've read a record that *starts* after the switch LSN,
+ * contrary to our goal of reading only until we hit the first
+ * record that ends at or after the switch LSN. Pretend we didn't
+ * read it after all by bailing out of this loop right here,
+ * before we do anything with this record.
+ *
+ * This can happen because the last record before the switch LSN
+ * might be continued across multiple pages, and then we might
+ * come to a page with XLP_FIRST_IS_OVERWRITE_CONTRECORD set. In
+ * that case, the record that was continued across multiple pages
+ * is incomplete and will be disregarded, and the read will
+ * restart from the beginning of the page that is flagged
+ * XLP_FIRST_IS_OVERWRITE_CONTRECORD.
+ *
+ * If this case occurs, we can fairly say that the current summary
+ * file ends at the switch LSN exactly. The first record on the
+ * page marked XLP_FIRST_IS_OVERWRITE_CONTRECORD will be
+ * discovered when generating the next summary file.
+ */
+ summary_end_lsn = switch_lsn;
+ break;
+ }
+
+ /* Special handling for particular types of WAL records. */
+ switch (XLogRecGetRmid(xlogreader))
+ {
+ case RM_SMGR_ID:
+ SummarizeSmgrRecord(xlogreader, brtab);
+ break;
+ case RM_XACT_ID:
+ SummarizeXactRecord(xlogreader, brtab);
+ break;
+ case RM_XLOG_ID:
+ stop_requested = SummarizeXlogRecord(xlogreader);
+ break;
+ default:
+ break;
+ }
+
+ /*
+ * If we've been told that it's time to end this WAL summary file, do
+ * so. As an exception, if there's nothing included in this WAL
+ * summary file yet, then stopping doesn't make any sense, and we
+ * should wait until the next stop point instead.
+ */
+ if (stop_requested && xlogreader->ReadRecPtr > summary_start_lsn)
+ {
+ summary_end_lsn = xlogreader->ReadRecPtr;
+ break;
+ }
+
+ /* Feed block references from xlog record to block reference table. */
+ for (block_id = 0; block_id <= XLogRecMaxBlockId(xlogreader);
+ block_id++)
+ {
+ RelFileLocator rlocator;
+ ForkNumber forknum;
+ BlockNumber blocknum;
+
+ if (!XLogRecGetBlockTagExtended(xlogreader, block_id, &rlocator,
+ &forknum, &blocknum, NULL))
+ continue;
+
+ /*
+ * As we do elsewhere, ignore the FSM fork, because it's not fully
+ * WAL-logged.
+ */
+ if (forknum != FSM_FORKNUM)
+ BlockRefTableMarkBlockModified(brtab, &rlocator, forknum,
+ blocknum);
+ }
+
+ /* Update our notion of where this summary file ends. */
+ summary_end_lsn = xlogreader->EndRecPtr;
+
+ /* Also update shared memory. */
+ LWLockAcquire(WALSummarizerLock, LW_EXCLUSIVE);
+ Assert(summary_end_lsn >= WalSummarizerCtl->pending_lsn);
+ Assert(summary_end_lsn >= WalSummarizerCtl->summarized_lsn);
+ WalSummarizerCtl->pending_lsn = summary_end_lsn;
+ LWLockRelease(WALSummarizerLock);
+
+ /*
+ * If we have a switch LSN and have reached it, stop before reading
+ * the next record.
+ */
+ if (!XLogRecPtrIsInvalid(switch_lsn) &&
+ xlogreader->EndRecPtr >= switch_lsn)
+ break;
+ }
+
+ /* Destroy xlogreader. */
+ pfree(xlogreader->private_data);
+ XLogReaderFree(xlogreader);
+
+ /*
+ * If a timeline switch occurs, we may fail to make any progress at all
+ * before exiting the loop above. If that happens, we don't write a WAL
+ * summary file at all.
+ */
+ if (summary_end_lsn > summary_start_lsn)
+ {
+ /* Generate temporary and final path name. */
+ snprintf(temp_path, MAXPGPATH,
+ XLOGDIR "/summaries/temp.summary");
+ snprintf(final_path, MAXPGPATH,
+ XLOGDIR "/summaries/%08X%08X%08X%08X%08X.summary",
+ tli,
+ LSN_FORMAT_ARGS(summary_start_lsn),
+ LSN_FORMAT_ARGS(summary_end_lsn));
+
+ /* Open the temporary file for writing. */
+ io.filepos = 0;
+ io.file = PathNameOpenFile(temp_path, O_WRONLY | O_CREAT | O_TRUNC);
+ if (io.file < 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not create file \"%s\": %m", temp_path)));
+
+ /* Write the data. */
+ WriteBlockRefTable(brtab, WriteWalSummary, &io);
+
+ /* Close temporary file and shut down xlogreader. */
+ FileClose(io.file);
+
+ /* Tell the user what we did. */
+ ereport(DEBUG1,
+ errmsg("summarized WAL on TLI %d from %X/%X to %X/%X",
+ tli,
+ LSN_FORMAT_ARGS(summary_start_lsn),
+ LSN_FORMAT_ARGS(summary_end_lsn)));
+
+ /* Durably rename the new summary into place. */
+ durable_rename(temp_path, final_path, ERROR);
+ }
+
+ return summary_end_lsn;
+}
+
+/*
+ * Special handling for WAL records with RM_SMGR_ID.
+ */
+static void
+SummarizeSmgrRecord(XLogReaderState *xlogreader, BlockRefTable *brtab)
+{
+ uint8 info = XLogRecGetInfo(xlogreader) & ~XLR_INFO_MASK;
+
+ if (info == XLOG_SMGR_CREATE)
+ {
+ xl_smgr_create *xlrec;
+
+ /*
+ * If a new relation fork is created on disk, there is no point
+ * tracking anything about which blocks have been modified, because
+ * the whole thing will be new. Hence, set the limit block for this
+ * fork to 0.
+ *
+ * Ignore the FSM fork, which is not fully WAL-logged.
+ */
+ xlrec = (xl_smgr_create *) XLogRecGetData(xlogreader);
+
+ if (xlrec->forkNum != FSM_FORKNUM)
+ BlockRefTableSetLimitBlock(brtab, &xlrec->rlocator,
+ xlrec->forkNum, 0);
+ }
+ else if (info == XLOG_SMGR_TRUNCATE)
+ {
+ xl_smgr_truncate *xlrec;
+
+ xlrec = (xl_smgr_truncate *) XLogRecGetData(xlogreader);
+
+ /*
+ * If a relation fork is truncated on disk, there is no point in
+ * tracking anything about block modifications beyond the truncation
+ * point.
+ *
+ * We ignore SMGR_TRUNCATE_FSM here because the FSM isn't fully
+ * WAL-logged and thus we can't track modified blocks for it anyway.
+ */
+ if ((xlrec->flags & SMGR_TRUNCATE_HEAP) != 0)
+ BlockRefTableSetLimitBlock(brtab, &xlrec->rlocator,
+ MAIN_FORKNUM, xlrec->blkno);
+ if ((xlrec->flags & SMGR_TRUNCATE_VM) != 0)
+ BlockRefTableSetLimitBlock(brtab, &xlrec->rlocator,
+ VISIBILITYMAP_FORKNUM, xlrec->blkno);
+ }
+}
+
+/*
+ * Special handling for WAL recods with RM_XACT_ID.
+ */
+static void
+SummarizeXactRecord(XLogReaderState *xlogreader, BlockRefTable *brtab)
+{
+ uint8 info = XLogRecGetInfo(xlogreader) & ~XLR_INFO_MASK;
+ uint8 xact_info = info & XLOG_XACT_OPMASK;
+
+ if (xact_info == XLOG_XACT_COMMIT ||
+ xact_info == XLOG_XACT_COMMIT_PREPARED)
+ {
+ xl_xact_commit *xlrec = (xl_xact_commit *) XLogRecGetData(xlogreader);
+ xl_xact_parsed_commit parsed;
+ int i;
+
+ /*
+ * Don't track modified blocks for any relations that were removed on
+ * commit.
+ */
+ ParseCommitRecord(XLogRecGetInfo(xlogreader), xlrec, &parsed);
+ for (i = 0; i < parsed.nrels; ++i)
+ {
+ ForkNumber forknum;
+
+ for (forknum = 0; forknum <= MAX_FORKNUM; ++forknum)
+ if (forknum != FSM_FORKNUM)
+ BlockRefTableSetLimitBlock(brtab, &parsed.xlocators[i],
+ forknum, 0);
+ }
+ }
+ else if (xact_info == XLOG_XACT_ABORT ||
+ xact_info == XLOG_XACT_ABORT_PREPARED)
+ {
+ xl_xact_abort *xlrec = (xl_xact_abort *) XLogRecGetData(xlogreader);
+ xl_xact_parsed_abort parsed;
+ int i;
+
+ /*
+ * Don't track modified blocks for any relations that were removed on
+ * abort.
+ */
+ ParseAbortRecord(XLogRecGetInfo(xlogreader), xlrec, &parsed);
+ for (i = 0; i < parsed.nrels; ++i)
+ {
+ ForkNumber forknum;
+
+ for (forknum = 0; forknum <= MAX_FORKNUM; ++forknum)
+ if (forknum != FSM_FORKNUM)
+ BlockRefTableSetLimitBlock(brtab, &parsed.xlocators[i],
+ forknum, 0);
+ }
+ }
+}
+
+/*
+ * Special handling for WAL recods with RM_XLOG_ID.
+ */
+static bool
+SummarizeXlogRecord(XLogReaderState *xlogreader)
+{
+ uint8 info = XLogRecGetInfo(xlogreader) & ~XLR_INFO_MASK;
+
+ if (info == XLOG_CHECKPOINT_REDO || info == XLOG_CHECKPOINT_SHUTDOWN)
+ {
+ /*
+ * This is an LSN at which redo might begin, so we'd like
+ * summarization to stop just before this WAL record.
+ */
+ return true;
+ }
+
+ return false;
+}
+
+/*
+ * Similar to read_local_xlog_page, but limited to read from one particular
+ * timeline. If the end of WAL is reached, it will wait for more if reading
+ * from the current timeline, or give up if reading from a historic timeline.
+ * In the latter case, it will also set private_data->end_of_wal = true.
+ *
+ * Caller must set private_data->tli to the TLI of interest,
+ * private_data->read_upto to the lowest LSN that is not known to be safe
+ * to read on that timeline, and private_data->historic to true if and only
+ * if the timeline is not the current timeline. This function will update
+ * private_data->read_upto and private_data->historic if more WAL appears
+ * on the current timeline or if the current timeline becomes historic.
+ */
+static int
+summarizer_read_local_xlog_page(XLogReaderState *state,
+ XLogRecPtr targetPagePtr, int reqLen,
+ XLogRecPtr targetRecPtr, char *cur_page)
+{
+ int count;
+ WALReadError errinfo;
+ SummarizerReadLocalXLogPrivate *private_data;
+
+ HandleWalSummarizerInterrupts();
+
+ private_data = (SummarizerReadLocalXLogPrivate *)
+ state->private_data;
+
+ while (1)
+ {
+ if (targetPagePtr + XLOG_BLCKSZ <= private_data->read_upto)
+ {
+ /*
+ * more than one block available; read only that block, have
+ * caller come back if they need more.
+ */
+ count = XLOG_BLCKSZ;
+ break;
+ }
+ else if (targetPagePtr + reqLen > private_data->read_upto)
+ {
+ /* We don't seem to have enough data. */
+ if (private_data->historic)
+ {
+ /*
+ * This is a historic timeline, so there will never be any
+ * more data than we have currently.
+ */
+ private_data->end_of_wal = true;
+ return -1;
+ }
+ else
+ {
+ XLogRecPtr latest_lsn;
+ TimeLineID latest_tli;
+
+ /*
+ * This is - or at least was up until very recently - the
+ * current timeline, so more data might show up. Delay here
+ * so we don't tight-loop.
+ */
+ HandleWalSummarizerInterrupts();
+ summarizer_wait_for_wal();
+
+ /* Recheck end-of-WAL. */
+ latest_lsn = GetLatestLSN(&latest_tli);
+ if (private_data->tli == latest_tli)
+ {
+ /* Still the current timeline, update max LSN. */
+ Assert(latest_lsn >= private_data->read_upto);
+ private_data->read_upto = latest_lsn;
+ }
+ else
+ {
+ List *tles = readTimeLineHistory(latest_tli);
+ XLogRecPtr switchpoint;
+
+ /*
+ * The timeline we're scanning is no longer the latest
+ * one. Figure out when it ended.
+ */
+ private_data->historic = true;
+ switchpoint = tliSwitchPoint(private_data->tli, tles,
+ NULL);
+
+ /*
+ * Allow reads up to exactly the switch point.
+ *
+ * It's possible that this will cause read_upto to move
+ * backwards, because walreceiver might have read a
+ * partial record and flushed it to disk, and we'd view
+ * that data as safe to read. However, the
+ * XLOG_END_OF_RECOVERY record will be written at the end
+ * of the last complete WAL record, not at the end of the
+ * WAL that we've flushed to disk.
+ *
+ * So switchpoint < private->read_upto is possible here,
+ * but switchpoint < state->EndRecPtr should not be.
+ */
+ Assert(switchpoint >= state->EndRecPtr);
+ private_data->read_upto = switchpoint;
+
+ /* Debugging output. */
+ ereport(DEBUG1,
+ errmsg("timeline %u became historic, can read up to %X/%X",
+ private_data->tli, LSN_FORMAT_ARGS(private_data->read_upto)));
+ }
+
+ /* Go around and try again. */
+ }
+ }
+ else
+ {
+ /* enough bytes available to satisfy the request */
+ count = private_data->read_upto - targetPagePtr;
+ break;
+ }
+ }
+
+ /*
+ * Even though we just determined how much of the page can be validly read
+ * as 'count', read the whole page anyway. It's guaranteed to be
+ * zero-padded up to the page boundary if it's incomplete.
+ */
+ if (!WALRead(state, cur_page, targetPagePtr, XLOG_BLCKSZ,
+ private_data->tli, &errinfo))
+ WALReadRaiseError(&errinfo);
+
+ /* Track that we read a page, for sleep time calculation. */
+ ++pages_read_since_last_sleep;
+
+ /* number of valid bytes in the buffer */
+ return count;
+}
+
+/*
+ * Sleep for long enough that we believe it's likely that more WAL will
+ * be available afterwards.
+ */
+static void
+summarizer_wait_for_wal(void)
+{
+ if (pages_read_since_last_sleep == 0)
+ {
+ /*
+ * No pages were read since the last sleep, so double the sleep time,
+ * but not beyond the maximum allowable value.
+ */
+ sleep_quanta = Min(sleep_quanta * 2, MAX_SLEEP_QUANTA);
+ }
+ else if (pages_read_since_last_sleep > 1)
+ {
+ /*
+ * Multiple pages were read since the last sleep, so reduce the sleep
+ * time.
+ *
+ * A large burst of activity should be able to quickly reduce the
+ * sleep time to the minimum, but we don't want a handful of extra WAL
+ * records to provoke a strong reaction. We choose to reduce the sleep
+ * time by 1 quantum for each page read beyond the first, which is a
+ * fairly arbitrary way of trying to be reactive without
+ * overrreacting.
+ */
+ if (pages_read_since_last_sleep > sleep_quanta - 1)
+ sleep_quanta = 1;
+ else
+ sleep_quanta -= pages_read_since_last_sleep;
+ }
+
+ /* OK, now sleep. */
+ (void) WaitLatch(MyLatch,
+ WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
+ sleep_quanta * MS_PER_SLEEP_QUANTUM,
+ WAIT_EVENT_WAL_SUMMARIZER_WAL);
+ ResetLatch(MyLatch);
+
+ /* Reset count of pages read. */
+ pages_read_since_last_sleep = 0;
+}
+
+/*
+ * Most recent RedoRecPtr value observed by RemoveOldWalSummaries.
+ */
+static void
+MaybeRemoveOldWalSummaries(void)
+{
+ XLogRecPtr redo_pointer = GetRedoRecPtr();
+ List *wslist;
+ time_t cutoff_time;
+
+ /* If WAL summary removal is disabled, don't do anything. */
+ if (wal_summary_keep_time == 0)
+ return;
+
+ /*
+ * If the redo pointer has not advanced, don't do anything.
+ *
+ * This has the effect that we only try to remove old WAL summary files
+ * once per checkpoint cycle.
+ */
+ if (redo_pointer == redo_pointer_at_last_summary_removal)
+ return;
+ redo_pointer_at_last_summary_removal = redo_pointer;
+
+ /*
+ * Files should only be removed if the last modification time precedes the
+ * cutoff time we compute here.
+ */
+ cutoff_time = time(NULL) - 60 * wal_summary_keep_time;
+
+ /* Get all the summaries that currently exist. */
+ wslist = GetWalSummaries(0, InvalidXLogRecPtr, InvalidXLogRecPtr);
+
+ /* Loop until all summaries have been considered for removal. */
+ while (wslist != NIL)
+ {
+ ListCell *lc;
+ XLogSegNo oldest_segno;
+ XLogRecPtr oldest_lsn = InvalidXLogRecPtr;
+ TimeLineID selected_tli;
+
+ HandleWalSummarizerInterrupts();
+
+ /*
+ * Pick a timeline for which some summary files still exist on disk,
+ * and find the oldest LSN that still exists on disk for that
+ * timeline.
+ */
+ selected_tli = ((WalSummaryFile *) linitial(wslist))->tli;
+ oldest_segno = XLogGetOldestSegno(selected_tli);
+ if (oldest_segno != 0)
+ XLogSegNoOffsetToRecPtr(oldest_segno, 0, wal_segment_size,
+ oldest_lsn);
+
+
+ /* Consider each WAL file on the selected timeline in turn. */
+ foreach(lc, wslist)
+ {
+ WalSummaryFile *ws = lfirst(lc);
+
+ HandleWalSummarizerInterrupts();
+
+ /* If it's not on this timeline, it's not time to consider it. */
+ if (selected_tli != ws->tli)
+ continue;
+
+ /*
+ * If the WAL doesn't exist any more, we can remove it if the file
+ * modification time is old enough.
+ */
+ if (XLogRecPtrIsInvalid(oldest_lsn) || ws->end_lsn <= oldest_lsn)
+ RemoveWalSummaryIfOlderThan(ws, cutoff_time);
+
+ /*
+ * Whether we removed the file or not, we need not consider it
+ * again.
+ */
+ wslist = foreach_delete_current(wslist, lc);
+ pfree(ws);
+ }
+ }
+}
diff --git a/src/backend/storage/lmgr/lwlocknames.txt b/src/backend/storage/lmgr/lwlocknames.txt
index f72f2906ce..d621f5507f 100644
--- a/src/backend/storage/lmgr/lwlocknames.txt
+++ b/src/backend/storage/lmgr/lwlocknames.txt
@@ -54,3 +54,4 @@ XactTruncationLock 44
WrapLimitsVacuumLock 46
NotifyQueueTailLock 47
WaitEventExtensionLock 48
+WALSummarizerLock 49
diff --git a/src/backend/utils/activity/pgstat_io.c b/src/backend/utils/activity/pgstat_io.c
index d99ecdd4d8..0dd9b98b3e 100644
--- a/src/backend/utils/activity/pgstat_io.c
+++ b/src/backend/utils/activity/pgstat_io.c
@@ -306,7 +306,8 @@ pgstat_io_snapshot_cb(void)
* - Syslogger because it is not connected to shared memory
* - Archiver because most relevant archiving IO is delegated to a
* specialized command or module
-* - WAL Receiver and WAL Writer IO is not tracked in pg_stat_io for now
+* - WAL Receiver, WAL Writer, and WAL Summarizer IO are not tracked in
+* pg_stat_io for now
*
* Function returns true if BackendType participates in the cumulative stats
* subsystem for IO and false if it does not.
@@ -328,6 +329,7 @@ pgstat_tracks_io_bktype(BackendType bktype)
case B_LOGGER:
case B_WAL_RECEIVER:
case B_WAL_WRITER:
+ case B_WAL_SUMMARIZER:
return false;
case B_AUTOVAC_LAUNCHER:
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index d7995931bd..7e79163466 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -56,6 +56,7 @@ RECOVERY_WAL_STREAM "Waiting in main loop of startup process for WAL to arrive,
SYSLOGGER_MAIN "Waiting in main loop of syslogger process."
WAL_RECEIVER_MAIN "Waiting in main loop of WAL receiver process."
WAL_SENDER_MAIN "Waiting in main loop of WAL sender process."
+WAL_SUMMARIZER_WAL "Waiting in WAL summarizer for more WAL to be generated."
WAL_WRITER_MAIN "Waiting in main loop of WAL writer process."
@@ -142,6 +143,7 @@ SAFE_SNAPSHOT "Waiting to obtain a valid snapshot for a <literal>READ ONLY DEFER
SYNC_REP "Waiting for confirmation from a remote server during synchronous replication."
WAL_RECEIVER_EXIT "Waiting for the WAL receiver to exit."
WAL_RECEIVER_WAIT_START "Waiting for startup process to send initial data for streaming replication."
+WAL_SUMMARY_READY "Waiting for a new WAL summary to be generated."
XACT_GROUP_UPDATE "Waiting for the group leader to update transaction status at end of a parallel operation."
@@ -162,6 +164,7 @@ REGISTER_SYNC_REQUEST "Waiting while sending synchronization requests to the che
SPIN_DELAY "Waiting while acquiring a contended spinlock."
VACUUM_DELAY "Waiting in a cost-based vacuum delay point."
VACUUM_TRUNCATE "Waiting to acquire an exclusive lock to truncate off any empty pages at the end of a table vacuumed."
+WAL_SUMMARIZER_ERROR "Waiting after a WAL summarizer error."
#
@@ -243,6 +246,8 @@ WAL_COPY_WRITE "Waiting for a write when creating a new WAL segment by copying a
WAL_INIT_SYNC "Waiting for a newly initialized WAL file to reach durable storage."
WAL_INIT_WRITE "Waiting for a write while initializing a new WAL file."
WAL_READ "Waiting for a read from a WAL file."
+WAL_SUMMARY_READ "Waiting for a read from a WAL summary file."
+WAL_SUMMARY_WRITE "Waiting for a write to a WAL summary file."
WAL_SYNC "Waiting for a WAL file to reach durable storage."
WAL_SYNC_METHOD_ASSIGN "Waiting for data to reach durable storage while assigning a new WAL sync method."
WAL_WRITE "Waiting for a write to a WAL file."
diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index 819936ec02..5c9b6f991e 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -305,6 +305,9 @@ GetBackendTypeDesc(BackendType backendType)
case B_WAL_SENDER:
backendDesc = "walsender";
break;
+ case B_WAL_SUMMARIZER:
+ backendDesc = "walsummarizer";
+ break;
case B_WAL_WRITER:
backendDesc = "walwriter";
break;
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index f7c9882f7c..9f59440526 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -63,6 +63,7 @@
#include "postmaster/postmaster.h"
#include "postmaster/startup.h"
#include "postmaster/syslogger.h"
+#include "postmaster/walsummarizer.h"
#include "postmaster/walwriter.h"
#include "replication/logicallauncher.h"
#include "replication/slot.h"
@@ -703,6 +704,8 @@ const char *const config_group_names[] =
gettext_noop("Write-Ahead Log / Archive Recovery"),
/* WAL_RECOVERY_TARGET */
gettext_noop("Write-Ahead Log / Recovery Target"),
+ /* WAL_SUMMARIZATION */
+ gettext_noop("Write-Ahead Log / Summarization"),
/* REPLICATION_SENDING */
gettext_noop("Replication / Sending Servers"),
/* REPLICATION_PRIMARY */
@@ -1786,6 +1789,16 @@ struct config_bool ConfigureNamesBool[] =
NULL, NULL, NULL
},
+ {
+ {"summarize_wal", PGC_SIGHUP, WAL_SUMMARIZATION,
+ gettext_noop("Starts the WAL summarizer process to enable incremental backup."),
+ NULL
+ },
+ &summarize_wal,
+ false,
+ NULL, NULL, NULL
+ },
+
{
{"hot_standby", PGC_POSTMASTER, REPLICATION_STANDBY,
gettext_noop("Allows connections and queries during recovery."),
@@ -3200,6 +3213,19 @@ struct config_int ConfigureNamesInt[] =
check_wal_segment_size, NULL, NULL
},
+ {
+ {"wal_summary_keep_time", PGC_SIGHUP, WAL_SUMMARIZATION,
+ gettext_noop("Time for which WAL summary files should be kept."),
+ NULL,
+ GUC_UNIT_MIN,
+ },
+ &wal_summary_keep_time,
+ 10 * 24 * 60, /* 10 days */
+ 0,
+ INT_MAX,
+ NULL, NULL, NULL
+ },
+
{
{"autovacuum_naptime", PGC_SIGHUP, AUTOVACUUM,
gettext_noop("Time to sleep between autovacuum runs."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index cf9f283cfe..b2809c711a 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -302,6 +302,11 @@
#recovery_target_action = 'pause' # 'pause', 'promote', 'shutdown'
# (change requires restart)
+# - WAL Summarization -
+
+#summarize_wal = off # run WAL summarizer process?
+#wal_summary_keep_time = '10d' # when to remove old summary files, 0 = never
+
#------------------------------------------------------------------------------
# REPLICATION
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 0c6f5ceb0a..e68b40d2b5 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -227,6 +227,7 @@ static char *extra_options = "";
static const char *const subdirs[] = {
"global",
"pg_wal/archive_status",
+ "pg_wal/summaries",
"pg_commit_ts",
"pg_dynshmem",
"pg_notify",
diff --git a/src/common/Makefile b/src/common/Makefile
index 1092dc63df..23e5a3db47 100644
--- a/src/common/Makefile
+++ b/src/common/Makefile
@@ -49,6 +49,7 @@ OBJS_COMMON = \
archive.o \
base64.o \
binaryheap.o \
+ blkreftable.o \
checksum_helper.o \
compression.o \
config_info.o \
diff --git a/src/common/blkreftable.c b/src/common/blkreftable.c
new file mode 100644
index 0000000000..21ee6f5968
--- /dev/null
+++ b/src/common/blkreftable.c
@@ -0,0 +1,1308 @@
+/*-------------------------------------------------------------------------
+ *
+ * blkreftable.c
+ * Block reference tables.
+ *
+ * A block reference table is used to keep track of which blocks have
+ * been modified by WAL records within a certain LSN range.
+ *
+ * For each relation fork, we keep track of all blocks that have appeared
+ * in block reference in the WAL. We also keep track of the "limit block",
+ * which is the smallest relation length in blocks known to have occurred
+ * during that range of WAL records. This should be set to 0 if the relation
+ * fork is created or destroyed, and to the post-truncation length if
+ * truncated.
+ *
+ * Whenever we set the limit block, we also forget about any modified blocks
+ * beyond that point. Those blocks don't exist any more. Such blocks can
+ * later be marked as modified again; if that happens, it means the relation
+ * was re-extended.
+ *
+ * Portions Copyright (c) 2010-2022, PostgreSQL Global Development Group
+ *
+ * src/common/blkreftable.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+
+#ifndef FRONTEND
+#include "postgres.h"
+#else
+#include "postgres_fe.h"
+#endif
+
+#ifdef FRONTEND
+#include "common/logging.h"
+#endif
+
+#include "common/blkreftable.h"
+#include "common/hashfn.h"
+#include "port/pg_crc32c.h"
+
+/*
+ * A block reference table keeps track of the status of each relation
+ * fork individually.
+ */
+typedef struct BlockRefTableKey
+{
+ RelFileLocator rlocator;
+ ForkNumber forknum;
+} BlockRefTableKey;
+
+/*
+ * We could need to store data either for a relation in which only a
+ * tiny fraction of the blocks have been modified or for a relation in
+ * which nearly every block has been modified, and we want a
+ * space-efficient representation in both cases. To accomplish this,
+ * we divide the relation into chunks of 2^16 blocks and choose between
+ * an array representation and a bitmap representation for each chunk.
+ *
+ * When the number of modified blocks in a given chunk is small, we
+ * essentially store an array of block numbers, but we need not store the
+ * entire block number: instead, we store each block number as a 2-byte
+ * offset from the start of the chunk.
+ *
+ * When the number of modified blocks in a given chunk is large, we switch
+ * to a bitmap representation.
+ *
+ * These same basic representational choices are used both when a block
+ * reference table is stored in memory and when it is serialized to disk.
+ *
+ * In the in-memory representation, we initially allocate each chunk with
+ * space for a number of entries given by INITIAL_ENTRIES_PER_CHUNK and
+ * increase that as necessary until we reach MAX_ENTRIES_PER_CHUNK.
+ * Any chunk whose allocated size reaches MAX_ENTRIES_PER_CHUNK is converted
+ * to a bitmap, and thus never needs to grow further.
+ */
+#define BLOCKS_PER_CHUNK (1 << 16)
+#define BLOCKS_PER_ENTRY (BITS_PER_BYTE * sizeof(uint16))
+#define MAX_ENTRIES_PER_CHUNK (BLOCKS_PER_CHUNK / BLOCKS_PER_ENTRY)
+#define INITIAL_ENTRIES_PER_CHUNK 16
+typedef uint16 *BlockRefTableChunk;
+
+/*
+ * State for one relation fork.
+ *
+ * 'rlocator' and 'forknum' identify the relation fork to which this entry
+ * pertains.
+ *
+ * 'limit_block' is the shortest known length of the relation in blocks
+ * within the LSN range covered by a particular block reference table.
+ * It should be set to 0 if the relation fork is created or dropped. If the
+ * relation fork is truncated, it should be set to the number of blocks that
+ * remain after truncation.
+ *
+ * 'nchunks' is the allocated length of each of the three arrays that follow.
+ * We can only represent the status of block numbers less than nchunks *
+ * BLOCKS_PER_CHUNK.
+ *
+ * 'chunk_size' is an array storing the allocated size of each chunk.
+ *
+ * 'chunk_usage' is an array storing the number of elements used in each
+ * chunk. If that value is less than MAX_ENTRIES_PER_CHUNK, the corresonding
+ * chunk is used as an array; else the corresponding chunk is used as a bitmap.
+ * When used as a bitmap, the least significant bit of the first array element
+ * is the status of the lowest-numbered block covered by this chunk.
+ *
+ * 'chunk_data' is the array of chunks.
+ */
+struct BlockRefTableEntry
+{
+ BlockRefTableKey key;
+ BlockNumber limit_block;
+ char status;
+ uint32 nchunks;
+ uint16 *chunk_size;
+ uint16 *chunk_usage;
+ BlockRefTableChunk *chunk_data;
+};
+
+/* Declare and define a hash table over type BlockRefTableEntry. */
+#define SH_PREFIX blockreftable
+#define SH_ELEMENT_TYPE BlockRefTableEntry
+#define SH_KEY_TYPE BlockRefTableKey
+#define SH_KEY key
+#define SH_HASH_KEY(tb, key) \
+ hash_bytes((const unsigned char *) &key, sizeof(BlockRefTableKey))
+#define SH_EQUAL(tb, a, b) (memcmp(&a, &b, sizeof(BlockRefTableKey)) == 0)
+#define SH_SCOPE static inline
+#ifdef FRONTEND
+#define SH_RAW_ALLOCATOR pg_malloc0
+#endif
+#define SH_DEFINE
+#define SH_DECLARE
+#include "lib/simplehash.h"
+
+/*
+ * A block reference table is basically just the hash table, but we don't
+ * want to expose that to outside callers.
+ *
+ * We keep track of the memory context in use explicitly too, so that it's
+ * easy to place all of our allocations in the same context.
+ */
+struct BlockRefTable
+{
+ blockreftable_hash *hash;
+#ifndef FRONTEND
+ MemoryContext mcxt;
+#endif
+};
+
+/*
+ * On-disk serialization format for block reference table entries.
+ */
+typedef struct BlockRefTableSerializedEntry
+{
+ RelFileLocator rlocator;
+ ForkNumber forknum;
+ BlockNumber limit_block;
+ uint32 nchunks;
+} BlockRefTableSerializedEntry;
+
+/*
+ * Buffer size, so that we avoid doing many small I/Os.
+ */
+#define BUFSIZE 65536
+
+/*
+ * Ad-hoc buffer for file I/O.
+ */
+typedef struct BlockRefTableBuffer
+{
+ io_callback_fn io_callback;
+ void *io_callback_arg;
+ char data[BUFSIZE];
+ int used;
+ int cursor;
+ pg_crc32c crc;
+} BlockRefTableBuffer;
+
+/*
+ * State for keeping track of progress while incrementally reading a block
+ * table reference file from disk.
+ *
+ * total_chunks means the number of chunks for the RelFileLocator/ForkNumber
+ * combination that is curently being read, and consumed_chunks is the number
+ * of those that have been read. (We always read all the information for
+ * a single chunk at one time, so we don't need to be able to represent the
+ * state where a chunk has been partially read.)
+ *
+ * chunk_size is the array of chunk sizes. The length is given by total_chunks.
+ *
+ * chunk_data holds the current chunk.
+ *
+ * chunk_position helps us figure out how much progress we've made in returning
+ * the block numbers for the current chunk to the caller. If the chunk is a
+ * bitmap, it's the number of bits we've scanned; otherwise, it's the number
+ * of chunk entries we've scanned.
+ */
+struct BlockRefTableReader
+{
+ BlockRefTableBuffer buffer;
+ char *error_filename;
+ report_error_fn error_callback;
+ void *error_callback_arg;
+ uint32 total_chunks;
+ uint32 consumed_chunks;
+ uint16 *chunk_size;
+ uint16 chunk_data[MAX_ENTRIES_PER_CHUNK];
+ uint32 chunk_position;
+};
+
+/*
+ * State for keeping track of progress while incrementally writing a block
+ * reference table file to disk.
+ */
+struct BlockRefTableWriter
+{
+ BlockRefTableBuffer buffer;
+};
+
+/* Function prototypes. */
+static int BlockRefTableComparator(const void *a, const void *b);
+static void BlockRefTableFlush(BlockRefTableBuffer *buffer);
+static void BlockRefTableRead(BlockRefTableReader *reader, void *data,
+ int length);
+static void BlockRefTableWrite(BlockRefTableBuffer *buffer, void *data,
+ int length);
+static void BlockRefTableFileTerminate(BlockRefTableBuffer *buffer);
+
+/*
+ * Create an empty block reference table.
+ */
+BlockRefTable *
+CreateEmptyBlockRefTable(void)
+{
+ BlockRefTable *brtab = palloc(sizeof(BlockRefTable));
+
+ /*
+ * Even completely empty database has a few hundred relation forks, so it
+ * seems best to size the hash on the assumption that we're going to have
+ * at least a few thousand entries.
+ */
+#ifdef FRONTEND
+ brtab->hash = blockreftable_create(4096, NULL);
+#else
+ brtab->mcxt = CurrentMemoryContext;
+ brtab->hash = blockreftable_create(brtab->mcxt, 4096, NULL);
+#endif
+
+ return brtab;
+}
+
+/*
+ * Set the "limit block" for a relation fork and forget any modified blocks
+ * with equal or higher block numbers.
+ *
+ * The "limit block" is the shortest known length of the relation within the
+ * range of WAL records covered by this block reference table.
+ */
+void
+BlockRefTableSetLimitBlock(BlockRefTable *brtab,
+ const RelFileLocator *rlocator,
+ ForkNumber forknum,
+ BlockNumber limit_block)
+{
+ BlockRefTableEntry *brtentry;
+ BlockRefTableKey key = {0}; /* make sure any padding is zero */
+ bool found;
+
+ memcpy(&key.rlocator, rlocator, sizeof(RelFileLocator));
+ key.forknum = forknum;
+ brtentry = blockreftable_insert(brtab->hash, key, &found);
+
+ if (!found)
+ {
+ /*
+ * We have no existing data about this relation fork, so just record
+ * the limit_block value supplied by the caller, and make sure other
+ * parts of the entry are properly initialized.
+ */
+ brtentry->limit_block = limit_block;
+ brtentry->nchunks = 0;
+ brtentry->chunk_size = NULL;
+ brtentry->chunk_usage = NULL;
+ brtentry->chunk_data = NULL;
+ return;
+ }
+
+ BlockRefTableEntrySetLimitBlock(brtentry, limit_block);
+}
+
+/*
+ * Mark a block in a given relation fork as known to have been modified.
+ */
+void
+BlockRefTableMarkBlockModified(BlockRefTable *brtab,
+ const RelFileLocator *rlocator,
+ ForkNumber forknum,
+ BlockNumber blknum)
+{
+ BlockRefTableEntry *brtentry;
+ BlockRefTableKey key = {0}; /* make sure any padding is zero */
+ bool found;
+#ifndef FRONTEND
+ MemoryContext oldcontext = MemoryContextSwitchTo(brtab->mcxt);
+#endif
+
+ memcpy(&key.rlocator, rlocator, sizeof(RelFileLocator));
+ key.forknum = forknum;
+ brtentry = blockreftable_insert(brtab->hash, key, &found);
+
+ if (!found)
+ {
+ /*
+ * We want to set the initial limit block value to something higher
+ * than any legal block number. InvalidBlockNumber fits the bill.
+ */
+ brtentry->limit_block = InvalidBlockNumber;
+ brtentry->nchunks = 0;
+ brtentry->chunk_size = NULL;
+ brtentry->chunk_usage = NULL;
+ brtentry->chunk_data = NULL;
+ }
+
+ BlockRefTableEntryMarkBlockModified(brtentry, forknum, blknum);
+
+#ifndef FRONTEND
+ MemoryContextSwitchTo(oldcontext);
+#endif
+}
+
+/*
+ * Get an entry from a block reference table.
+ *
+ * If the entry does not exist, this function returns NULL. Otherwise, it
+ * returns the entry and sets *limit_block to the value from the entry.
+ */
+BlockRefTableEntry *
+BlockRefTableGetEntry(BlockRefTable *brtab, const RelFileLocator *rlocator,
+ ForkNumber forknum, BlockNumber *limit_block)
+{
+ BlockRefTableKey key = {0}; /* make sure any padding is zero */
+ BlockRefTableEntry *entry;
+
+ Assert(limit_block != NULL);
+
+ memcpy(&key.rlocator, rlocator, sizeof(RelFileLocator));
+ key.forknum = forknum;
+ entry = blockreftable_lookup(brtab->hash, key);
+
+ if (entry != NULL)
+ *limit_block = entry->limit_block;
+
+ return entry;
+}
+
+/*
+ * Get block numbers from a table entry.
+ *
+ * 'blocks' must point to enough space to hold at least 'nblocks' block
+ * numbers, and any block numbers we manage to get will be written there.
+ * The return value is the number of block numbers actually written.
+ *
+ * We do not return block numbers unless they are greater than or equal to
+ * start_blkno and strictly less than stop_blkno.
+ */
+int
+BlockRefTableEntryGetBlocks(BlockRefTableEntry *entry,
+ BlockNumber start_blkno,
+ BlockNumber stop_blkno,
+ BlockNumber *blocks,
+ int nblocks)
+{
+ uint32 start_chunkno;
+ uint32 stop_chunkno;
+ uint32 chunkno;
+ int nresults = 0;
+
+ Assert(entry != NULL);
+
+ /*
+ * Figure out which chunks could potentially contain blocks of interest.
+ *
+ * We need to be careful about overflow here, because stop_blkno could be
+ * InvalidBlockNumber or something very close to it.
+ */
+ start_chunkno = start_blkno / BLOCKS_PER_CHUNK;
+ stop_chunkno = stop_blkno / BLOCKS_PER_CHUNK;
+ if ((stop_blkno % BLOCKS_PER_CHUNK) != 0)
+ ++stop_chunkno;
+ if (stop_chunkno > entry->nchunks)
+ stop_chunkno = entry->nchunks;
+
+ /*
+ * Loop over chunks.
+ */
+ for (chunkno = start_chunkno; chunkno < stop_chunkno; ++chunkno)
+ {
+ uint16 chunk_usage = entry->chunk_usage[chunkno];
+ BlockRefTableChunk chunk_data = entry->chunk_data[chunkno];
+ unsigned start_offset = 0;
+ unsigned stop_offset = BLOCKS_PER_CHUNK;
+
+ /*
+ * If the start and/or stop block number falls within this chunk, the
+ * whole chunk may not be of interest. Figure out which portion we
+ * care about, if it's not the whole thing.
+ */
+ if (chunkno == start_chunkno)
+ start_offset = start_blkno % BLOCKS_PER_CHUNK;
+ if (chunkno == stop_chunkno - 1)
+ stop_offset = stop_blkno % BLOCKS_PER_CHUNK;
+
+ /*
+ * Handling differs depending on whether this is an array of offsets
+ * or a bitmap.
+ */
+ if (chunk_usage == MAX_ENTRIES_PER_CHUNK)
+ {
+ unsigned i;
+
+ /* It's a bitmap, so test every relevant bit. */
+ for (i = start_offset; i < stop_offset; ++i)
+ {
+ uint16 w = chunk_data[i / BLOCKS_PER_ENTRY];
+
+ if ((w & (1 << (i % BLOCKS_PER_ENTRY))) != 0)
+ {
+ BlockNumber blkno = chunkno * BLOCKS_PER_CHUNK + i;
+
+ blocks[nresults++] = blkno;
+
+ /* Early exit if we run out of output space. */
+ if (nresults == nblocks)
+ return nresults;
+ }
+ }
+ }
+ else
+ {
+ unsigned i;
+
+ /* It's an array of offsets, so check each one. */
+ for (i = 0; i < chunk_usage; ++i)
+ {
+ uint16 offset = chunk_data[i];
+
+ if (offset >= start_offset && offset < stop_offset)
+ {
+ BlockNumber blkno = chunkno * BLOCKS_PER_CHUNK + offset;
+
+ blocks[nresults++] = blkno;
+
+ /* Early exit if we run out of output space. */
+ if (nresults == nblocks)
+ return nresults;
+ }
+ }
+ }
+ }
+
+ return nresults;
+}
+
+/*
+ * Serialize a block reference table to a file.
+ */
+void
+WriteBlockRefTable(BlockRefTable *brtab,
+ io_callback_fn write_callback,
+ void *write_callback_arg)
+{
+ BlockRefTableSerializedEntry *sdata = NULL;
+ BlockRefTableBuffer buffer;
+ uint32 magic = BLOCKREFTABLE_MAGIC;
+
+ /* Prepare buffer. */
+ memset(&buffer, 0, sizeof(BlockRefTableBuffer));
+ buffer.io_callback = write_callback;
+ buffer.io_callback_arg = write_callback_arg;
+ INIT_CRC32C(buffer.crc);
+
+ /* Write magic number. */
+ BlockRefTableWrite(&buffer, &magic, sizeof(uint32));
+
+ /* Write the entries, assuming there are some. */
+ if (brtab->hash->members > 0)
+ {
+ unsigned i = 0;
+ blockreftable_iterator it;
+ BlockRefTableEntry *brtentry;
+
+ /* Extract entries into serializable format and sort them. */
+ sdata =
+ palloc(brtab->hash->members * sizeof(BlockRefTableSerializedEntry));
+ blockreftable_start_iterate(brtab->hash, &it);
+ while ((brtentry = blockreftable_iterate(brtab->hash, &it)) != NULL)
+ {
+ BlockRefTableSerializedEntry *sentry = &sdata[i++];
+
+ sentry->rlocator = brtentry->key.rlocator;
+ sentry->forknum = brtentry->key.forknum;
+ sentry->limit_block = brtentry->limit_block;
+ sentry->nchunks = brtentry->nchunks;
+
+ /* trim trailing zero entries */
+ while (sentry->nchunks > 0 &&
+ brtentry->chunk_usage[sentry->nchunks - 1] == 0)
+ sentry->nchunks--;
+ }
+ Assert(i == brtab->hash->members);
+ qsort(sdata, i, sizeof(BlockRefTableSerializedEntry),
+ BlockRefTableComparator);
+
+ /* Loop over entries in sorted order and serialize each one. */
+ for (i = 0; i < brtab->hash->members; ++i)
+ {
+ BlockRefTableSerializedEntry *sentry = &sdata[i];
+ BlockRefTableKey key = {0}; /* make sure any padding is zero */
+ unsigned j;
+
+ /* Write the serialized entry itself. */
+ BlockRefTableWrite(&buffer, sentry,
+ sizeof(BlockRefTableSerializedEntry));
+
+ /* Look up the original entry so we can access the chunks. */
+ memcpy(&key.rlocator, &sentry->rlocator, sizeof(RelFileLocator));
+ key.forknum = sentry->forknum;
+ brtentry = blockreftable_lookup(brtab->hash, key);
+ Assert(brtentry != NULL);
+
+ /* Write the untruncated portion of the chunk length array. */
+ if (sentry->nchunks != 0)
+ BlockRefTableWrite(&buffer, brtentry->chunk_usage,
+ sentry->nchunks * sizeof(uint16));
+
+ /* Write the contents of each chunk. */
+ for (j = 0; j < brtentry->nchunks; ++j)
+ {
+ if (brtentry->chunk_usage[j] == 0)
+ continue;
+ BlockRefTableWrite(&buffer, brtentry->chunk_data[j],
+ brtentry->chunk_usage[j] * sizeof(uint16));
+ }
+ }
+ }
+
+ /* Write out appropriate terminator and CRC and flush buffer. */
+ BlockRefTableFileTerminate(&buffer);
+}
+
+/*
+ * Prepare to incrementally read a block reference table file.
+ *
+ * 'read_callback' is a function that can be called to read data from the
+ * underlying file (or other data source) into our internal buffer.
+ *
+ * 'read_callback_arg' is an opaque argument to be passed to read_callback.
+ *
+ * 'error_filename' is the filename that should be included in error messages
+ * if the file is found to be malformed. The value is not copied, so the
+ * caller should ensure that it remains valid until done with this
+ * BlockRefTableReader.
+ *
+ * 'error_callback' is a function to be called if the file is found to be
+ * malformed. This is not used for I/O errors, which must be handled internally
+ * by read_callback.
+ *
+ * 'error_callback_arg' is an opaque arguent to be passed to error_callback.
+ */
+BlockRefTableReader *
+CreateBlockRefTableReader(io_callback_fn read_callback,
+ void *read_callback_arg,
+ char *error_filename,
+ report_error_fn error_callback,
+ void *error_callback_arg)
+{
+ BlockRefTableReader *reader;
+ uint32 magic;
+
+ /* Initialize data structure. */
+ reader = palloc0(sizeof(BlockRefTableReader));
+ reader->buffer.io_callback = read_callback;
+ reader->buffer.io_callback_arg = read_callback_arg;
+ reader->error_filename = error_filename;
+ reader->error_callback = error_callback;
+ reader->error_callback_arg = error_callback_arg;
+ INIT_CRC32C(reader->buffer.crc);
+
+ /* Verify magic number. */
+ BlockRefTableRead(reader, &magic, sizeof(uint32));
+ if (magic != BLOCKREFTABLE_MAGIC)
+ error_callback(error_callback_arg,
+ "file \"%s\" has wrong magic number: expected %u, found %u",
+ error_filename,
+ BLOCKREFTABLE_MAGIC, magic);
+
+ return reader;
+}
+
+/*
+ * Read next relation fork covered by this block reference table file.
+ *
+ * After calling this function, you must call BlockRefTableReaderGetBlocks
+ * until it returns 0 before calling it again.
+ */
+bool
+BlockRefTableReaderNextRelation(BlockRefTableReader *reader,
+ RelFileLocator *rlocator,
+ ForkNumber *forknum,
+ BlockNumber *limit_block)
+{
+ BlockRefTableSerializedEntry sentry;
+ BlockRefTableSerializedEntry zentry = {{0}};
+
+ /*
+ * Sanity check: caller must read all blocks from all chunks before moving
+ * on to the next relation.
+ */
+ Assert(reader->total_chunks == reader->consumed_chunks);
+
+ /* Read serialized entry. */
+ BlockRefTableRead(reader, &sentry,
+ sizeof(BlockRefTableSerializedEntry));
+
+ /*
+ * If we just read the sentinel entry indicating that we've reached the
+ * end, read and check the CRC.
+ */
+ if (memcmp(&sentry, &zentry, sizeof(BlockRefTableSerializedEntry)) == 0)
+ {
+ pg_crc32c expected_crc;
+ pg_crc32c actual_crc;
+
+ /*
+ * We want to know the CRC of the file excluding the 4-byte CRC
+ * itself, so copy the current value of the CRC accumulator before
+ * reading those bytes, and use the copy to finalize the calculation.
+ */
+ expected_crc = reader->buffer.crc;
+ FIN_CRC32C(expected_crc);
+
+ /* Now we can read the actual value. */
+ BlockRefTableRead(reader, &actual_crc, sizeof(pg_crc32c));
+
+ /* Throw an error if there is a mismatch. */
+ if (!EQ_CRC32C(expected_crc, actual_crc))
+ reader->error_callback(reader->error_callback_arg,
+ "file \"%s\" has wrong checksum: expected %08X, found %08X",
+ reader->error_filename, expected_crc, actual_crc);
+
+ return false;
+ }
+
+ /* Read chunk size array. */
+ if (reader->chunk_size != NULL)
+ pfree(reader->chunk_size);
+ reader->chunk_size = palloc(sentry.nchunks * sizeof(uint16));
+ BlockRefTableRead(reader, reader->chunk_size,
+ sentry.nchunks * sizeof(uint16));
+
+ /* Set up for chunk scan. */
+ reader->total_chunks = sentry.nchunks;
+ reader->consumed_chunks = 0;
+
+ /* Return data to caller. */
+ memcpy(rlocator, &sentry.rlocator, sizeof(RelFileLocator));
+ *forknum = sentry.forknum;
+ *limit_block = sentry.limit_block;
+ return true;
+}
+
+/*
+ * Get modified blocks associated with the relation fork returned by
+ * the most recent call to BlockRefTableReaderNextRelation.
+ *
+ * On return, block numbers will be written into the 'blocks' array, whose
+ * length should be passed via 'nblocks'. The return value is the number of
+ * entries actually written into the 'blocks' array, which may be less than
+ * 'nblocks' if we run out of modified blocks in the relation fork before
+ * we run out of room in the array.
+ */
+unsigned
+BlockRefTableReaderGetBlocks(BlockRefTableReader *reader,
+ BlockNumber *blocks,
+ int nblocks)
+{
+ unsigned blocks_found = 0;
+
+ /* Must provide space for at least one block number to be returned. */
+ Assert(nblocks > 0);
+
+ /* Loop collecting blocks to return to caller. */
+ for (;;)
+ {
+ uint16 next_chunk_size;
+
+ /*
+ * If we've read at least one chunk, maybe it contains some block
+ * numbers that could satisfy caller's request.
+ */
+ if (reader->consumed_chunks > 0)
+ {
+ uint32 chunkno = reader->consumed_chunks - 1;
+ uint16 chunk_size = reader->chunk_size[chunkno];
+
+ if (chunk_size == MAX_ENTRIES_PER_CHUNK)
+ {
+ /* Bitmap format, so search for bits that are set. */
+ while (reader->chunk_position < BLOCKS_PER_CHUNK &&
+ blocks_found < nblocks)
+ {
+ uint16 chunkoffset = reader->chunk_position;
+ uint16 w;
+
+ w = reader->chunk_data[chunkoffset / BLOCKS_PER_ENTRY];
+ if ((w & (1u << (chunkoffset % BLOCKS_PER_ENTRY))) != 0)
+ blocks[blocks_found++] =
+ chunkno * BLOCKS_PER_CHUNK + chunkoffset;
+ ++reader->chunk_position;
+ }
+ }
+ else
+ {
+ /* Not in bitmap format, so each entry is a 2-byte offset. */
+ while (reader->chunk_position < chunk_size &&
+ blocks_found < nblocks)
+ {
+ blocks[blocks_found++] = chunkno * BLOCKS_PER_CHUNK
+ + reader->chunk_data[reader->chunk_position];
+ ++reader->chunk_position;
+ }
+ }
+ }
+
+ /* We found enough blocks, so we're done. */
+ if (blocks_found >= nblocks)
+ break;
+
+ /*
+ * We didn't find enough blocks, so we must need the next chunk. If
+ * there are none left, though, then we're done anyway.
+ */
+ if (reader->consumed_chunks == reader->total_chunks)
+ break;
+
+ /*
+ * Read data for next chunk and reset scan position to beginning of
+ * chunk. Note that the next chunk might be empty, in which case we
+ * consume the chunk without actually consuming any bytes from the
+ * underlying file.
+ */
+ next_chunk_size = reader->chunk_size[reader->consumed_chunks];
+ if (next_chunk_size > 0)
+ BlockRefTableRead(reader, reader->chunk_data,
+ next_chunk_size * sizeof(uint16));
+ ++reader->consumed_chunks;
+ reader->chunk_position = 0;
+ }
+
+ return blocks_found;
+}
+
+/*
+ * Release memory used while reading a block reference table from a file.
+ */
+void
+DestroyBlockRefTableReader(BlockRefTableReader *reader)
+{
+ if (reader->chunk_size != NULL)
+ {
+ pfree(reader->chunk_size);
+ reader->chunk_size = NULL;
+ }
+ pfree(reader);
+}
+
+/*
+ * Prepare to write a block reference table file incrementally.
+ *
+ * Caller must be able to supply BlockRefTableEntry objects sorted in the
+ * appropriate order.
+ */
+BlockRefTableWriter *
+CreateBlockRefTableWriter(io_callback_fn write_callback,
+ void *write_callback_arg)
+{
+ BlockRefTableWriter *writer;
+ uint32 magic = BLOCKREFTABLE_MAGIC;
+
+ /* Prepare buffer and CRC check and save callbacks. */
+ writer = palloc0(sizeof(BlockRefTableWriter));
+ writer->buffer.io_callback = write_callback;
+ writer->buffer.io_callback_arg = write_callback_arg;
+ INIT_CRC32C(writer->buffer.crc);
+
+ /* Write magic number. */
+ BlockRefTableWrite(&writer->buffer, &magic, sizeof(uint32));
+
+ return writer;
+}
+
+/*
+ * Append one entry to a block reference table file.
+ *
+ * Note that entries must be written in the proper order, that is, sorted by
+ * tablespace, then database, then relfilenumber, then fork number. Caller
+ * is responsible for supplying data in the correct order. If that seems hard,
+ * use an in-memory BlockRefTable instead.
+ */
+void
+BlockRefTableWriteEntry(BlockRefTableWriter *writer, BlockRefTableEntry *entry)
+{
+ BlockRefTableSerializedEntry sentry;
+ unsigned j;
+
+ /* Convert to serialized entry format. */
+ sentry.rlocator = entry->key.rlocator;
+ sentry.forknum = entry->key.forknum;
+ sentry.limit_block = entry->limit_block;
+ sentry.nchunks = entry->nchunks;
+
+ /* Trim trailing zero entries. */
+ while (sentry.nchunks > 0 && entry->chunk_usage[sentry.nchunks - 1] == 0)
+ sentry.nchunks--;
+
+ /* Write the serialized entry itself. */
+ BlockRefTableWrite(&writer->buffer, &sentry,
+ sizeof(BlockRefTableSerializedEntry));
+
+ /* Write the untruncated portion of the chunk length array. */
+ if (sentry.nchunks != 0)
+ BlockRefTableWrite(&writer->buffer, entry->chunk_usage,
+ sentry.nchunks * sizeof(uint16));
+
+ /* Write the contents of each chunk. */
+ for (j = 0; j < entry->nchunks; ++j)
+ {
+ if (entry->chunk_usage[j] == 0)
+ continue;
+ BlockRefTableWrite(&writer->buffer, entry->chunk_data[j],
+ entry->chunk_usage[j] * sizeof(uint16));
+ }
+}
+
+/*
+ * Finalize an incremental write of a block reference table file.
+ */
+void
+DestroyBlockRefTableWriter(BlockRefTableWriter *writer)
+{
+ BlockRefTableFileTerminate(&writer->buffer);
+ pfree(writer);
+}
+
+/*
+ * Allocate a standalone BlockRefTableEntry.
+ *
+ * When we're manipulating a full in-memory BlockRefTable, the entries are
+ * part of the hash table and are allocated by simplehash. This routine is
+ * used by callers that want to write out a BlockRefTable to a file without
+ * needing to store the whole thing in memory at once.
+ *
+ * Entries allocated by this function can be manipulated using the functions
+ * BlockRefTableEntrySetLimitBlock and BlockRefTableEntryMarkBlockModified
+ * and then written using BlockRefTableWriteEntry and freed using
+ * BlockRefTableFreeEntry.
+ */
+BlockRefTableEntry *
+CreateBlockRefTableEntry(RelFileLocator rlocator, ForkNumber forknum)
+{
+ BlockRefTableEntry *entry = palloc0(sizeof(BlockRefTableEntry));
+
+ memcpy(&entry->key.rlocator, &rlocator, sizeof(RelFileLocator));
+ entry->key.forknum = forknum;
+ entry->limit_block = InvalidBlockNumber;
+
+ return entry;
+}
+
+/*
+ * Update a BlockRefTableEntry with a new value for the "limit block" and
+ * forget any equal-or-higher-numbered modified blocks.
+ *
+ * The "limit block" is the shortest known length of the relation within the
+ * range of WAL records covered by this block reference table.
+ */
+void
+BlockRefTableEntrySetLimitBlock(BlockRefTableEntry *entry,
+ BlockNumber limit_block)
+{
+ unsigned chunkno;
+ unsigned limit_chunkno;
+ unsigned limit_chunkoffset;
+ BlockRefTableChunk limit_chunk;
+
+ /* If we already have an equal or lower limit block, do nothing. */
+ if (limit_block >= entry->limit_block)
+ return;
+
+ /* Record the new limit block value. */
+ entry->limit_block = limit_block;
+
+ /*
+ * Figure out which chunk would store the state of the new limit block,
+ * and which offset within that chunk.
+ */
+ limit_chunkno = limit_block / BLOCKS_PER_CHUNK;
+ limit_chunkoffset = limit_block % BLOCKS_PER_CHUNK;
+
+ /*
+ * If the number of chunks is not large enough for any blocks with equal
+ * or higher block numbers to exist, then there is nothing further to do.
+ */
+ if (limit_chunkno >= entry->nchunks)
+ return;
+
+ /* Discard entire contents of any higher-numbered chunks. */
+ for (chunkno = limit_chunkno + 1; chunkno < entry->nchunks; ++chunkno)
+ entry->chunk_usage[chunkno] = 0;
+
+ /*
+ * Next, we need to discard any offsets within the chunk that would
+ * contain the limit_block. We must handle this differenly depending on
+ * whether the chunk that would contain limit_block is a bitmap or an
+ * array of offsets.
+ */
+ limit_chunk = entry->chunk_data[limit_chunkno];
+ if (entry->chunk_usage[limit_chunkno] == MAX_ENTRIES_PER_CHUNK)
+ {
+ unsigned chunkoffset;
+
+ /* It's a bitmap. Unset bits. */
+ for (chunkoffset = limit_chunkoffset; chunkoffset < BLOCKS_PER_CHUNK;
+ ++chunkoffset)
+ limit_chunk[chunkoffset / BLOCKS_PER_ENTRY] &=
+ ~(1 << (chunkoffset % BLOCKS_PER_ENTRY));
+ }
+ else
+ {
+ unsigned i,
+ j = 0;
+
+ /* It's an offset array. Filter out large offsets. */
+ for (i = 0; i < entry->chunk_usage[limit_chunkno]; ++i)
+ {
+ Assert(j <= i);
+ if (limit_chunk[i] < limit_chunkoffset)
+ limit_chunk[j++] = limit_chunk[i];
+ }
+ Assert(j <= entry->chunk_usage[limit_chunkno]);
+ entry->chunk_usage[limit_chunkno] = j;
+ }
+}
+
+/*
+ * Mark a block in a given BlkRefTableEntry as known to have been modified.
+ */
+void
+BlockRefTableEntryMarkBlockModified(BlockRefTableEntry *entry,
+ ForkNumber forknum,
+ BlockNumber blknum)
+{
+ unsigned chunkno;
+ unsigned chunkoffset;
+ unsigned i;
+
+ /*
+ * Which chunk should store the state of this block? And what is the
+ * offset of this block relative to the start of that chunk?
+ */
+ chunkno = blknum / BLOCKS_PER_CHUNK;
+ chunkoffset = blknum % BLOCKS_PER_CHUNK;
+
+ /*
+ * If 'nchunks' isn't big enough for us to be able to represent the state
+ * of this block, we need to enlarge our arrays.
+ */
+ if (chunkno >= entry->nchunks)
+ {
+ unsigned max_chunks;
+ unsigned extra_chunks;
+
+ /*
+ * New array size is a power of 2, at least 16, big enough so that
+ * chunkno will be a valid array index.
+ */
+ max_chunks = Max(16, entry->nchunks);
+ while (max_chunks < chunkno + 1)
+ chunkno *= 2;
+ Assert(max_chunks > chunkno);
+ extra_chunks = max_chunks - entry->nchunks;
+
+ if (entry->nchunks == 0)
+ {
+ entry->chunk_size = palloc0(sizeof(uint16) * max_chunks);
+ entry->chunk_usage = palloc0(sizeof(uint16) * max_chunks);
+ entry->chunk_data =
+ palloc0(sizeof(BlockRefTableChunk) * max_chunks);
+ }
+ else
+ {
+ entry->chunk_size = repalloc(entry->chunk_size,
+ sizeof(uint16) * max_chunks);
+ memset(&entry->chunk_size[entry->nchunks], 0,
+ extra_chunks * sizeof(uint16));
+ entry->chunk_usage = repalloc(entry->chunk_usage,
+ sizeof(uint16) * max_chunks);
+ memset(&entry->chunk_usage[entry->nchunks], 0,
+ extra_chunks * sizeof(uint16));
+ entry->chunk_data = repalloc(entry->chunk_data,
+ sizeof(BlockRefTableChunk) * max_chunks);
+ memset(&entry->chunk_data[entry->nchunks], 0,
+ extra_chunks * sizeof(BlockRefTableChunk));
+ }
+ entry->nchunks = max_chunks;
+ }
+
+ /*
+ * If the chunk that covers this block number doesn't exist yet, create it
+ * as an array and add the appropriate offset to it. We make it pretty
+ * small initially, because there might only be 1 or a few block
+ * references in this chunk and we don't want to use up too much memory.
+ */
+ if (entry->chunk_size[chunkno] == 0)
+ {
+ entry->chunk_data[chunkno] =
+ palloc(sizeof(uint16) * INITIAL_ENTRIES_PER_CHUNK);
+ entry->chunk_size[chunkno] = INITIAL_ENTRIES_PER_CHUNK;
+ entry->chunk_data[chunkno][0] = chunkoffset;
+ entry->chunk_usage[chunkno] = 1;
+ return;
+ }
+
+ /*
+ * If the number of entries in this chunk is already maximum, it must be a
+ * bitmap. Just set the appropriate bit.
+ */
+ if (entry->chunk_usage[chunkno] == MAX_ENTRIES_PER_CHUNK)
+ {
+ BlockRefTableChunk chunk = entry->chunk_data[chunkno];
+
+ chunk[chunkoffset / BLOCKS_PER_ENTRY] |=
+ 1 << (chunkoffset % BLOCKS_PER_ENTRY);
+ return;
+ }
+
+ /*
+ * There is an existing chunk and it's in array format. Let's find out
+ * whether it already has an entry for this block. If so, we do not need
+ * to do anything.
+ */
+ for (i = 0; i < entry->chunk_usage[chunkno]; ++i)
+ {
+ if (entry->chunk_data[chunkno][i] == chunkoffset)
+ return;
+ }
+
+ /*
+ * If the number of entries currently used is one less than the maximum,
+ * it's time to convert to bitmap format.
+ */
+ if (entry->chunk_usage[chunkno] == MAX_ENTRIES_PER_CHUNK - 1)
+ {
+ BlockRefTableChunk newchunk;
+ unsigned j;
+
+ /* Allocate a new chunk. */
+ newchunk = palloc0(MAX_ENTRIES_PER_CHUNK * sizeof(uint16));
+
+ /* Set the bit for each existing entry. */
+ for (j = 0; j < entry->chunk_usage[chunkno]; ++j)
+ {
+ unsigned coff = entry->chunk_data[chunkno][j];
+
+ newchunk[coff / BLOCKS_PER_ENTRY] |=
+ 1 << (coff % BLOCKS_PER_ENTRY);
+ }
+
+ /* Set the bit for the new entry. */
+ newchunk[chunkoffset / BLOCKS_PER_ENTRY] |=
+ 1 << (chunkoffset % BLOCKS_PER_ENTRY);
+
+ /* Swap the new chunk into place and update metadata. */
+ pfree(entry->chunk_data[chunkno]);
+ entry->chunk_data[chunkno] = newchunk;
+ entry->chunk_size[chunkno] = MAX_ENTRIES_PER_CHUNK;
+ entry->chunk_usage[chunkno] = MAX_ENTRIES_PER_CHUNK;
+ return;
+ }
+
+ /*
+ * OK, we currently have an array, and we don't need to convert to a
+ * bitmap, but we do need to add a new element. If there's not enough
+ * room, we'll have to expand the array.
+ */
+ if (entry->chunk_usage[chunkno] == entry->chunk_size[chunkno])
+ {
+ unsigned newsize = entry->chunk_size[chunkno] * 2;
+
+ Assert(newsize <= MAX_ENTRIES_PER_CHUNK);
+ entry->chunk_data[chunkno] = repalloc(entry->chunk_data[chunkno],
+ newsize * sizeof(uint16));
+ entry->chunk_size[chunkno] = newsize;
+ }
+
+ /* Now we can add the new entry. */
+ entry->chunk_data[chunkno][entry->chunk_usage[chunkno]] =
+ chunkoffset;
+ entry->chunk_usage[chunkno]++;
+}
+
+/*
+ * Release memory for a BlockRefTablEntry that was created by
+ * CreateBlockRefTableEntry.
+ */
+void
+BlockRefTableFreeEntry(BlockRefTableEntry *entry)
+{
+ if (entry->chunk_size != NULL)
+ {
+ pfree(entry->chunk_size);
+ entry->chunk_size = NULL;
+ }
+
+ if (entry->chunk_usage != NULL)
+ {
+ pfree(entry->chunk_usage);
+ entry->chunk_usage = NULL;
+ }
+
+ if (entry->chunk_data != NULL)
+ {
+ pfree(entry->chunk_data);
+ entry->chunk_data = NULL;
+ }
+
+ pfree(entry);
+}
+
+/*
+ * Comparator for BlockRefTableSerializedEntry objects.
+ *
+ * We make the tablespace OID the first column of the sort key to match
+ * the on-disk tree structure.
+ */
+static int
+BlockRefTableComparator(const void *a, const void *b)
+{
+ const BlockRefTableSerializedEntry *sa = a;
+ const BlockRefTableSerializedEntry *sb = b;
+
+ if (sa->rlocator.spcOid > sb->rlocator.spcOid)
+ return 1;
+ if (sa->rlocator.spcOid < sb->rlocator.spcOid)
+ return -1;
+
+ if (sa->rlocator.dbOid > sb->rlocator.dbOid)
+ return 1;
+ if (sa->rlocator.dbOid < sb->rlocator.dbOid)
+ return -1;
+
+ if (sa->rlocator.relNumber > sb->rlocator.relNumber)
+ return 1;
+ if (sa->rlocator.relNumber < sb->rlocator.relNumber)
+ return -1;
+
+ if (sa->forknum > sb->forknum)
+ return 1;
+ if (sa->forknum < sb->forknum)
+ return -1;
+
+ return 0;
+}
+
+/*
+ * Flush any buffered data out of a BlockRefTableBuffer.
+ */
+static void
+BlockRefTableFlush(BlockRefTableBuffer *buffer)
+{
+ buffer->io_callback(buffer->io_callback_arg, buffer->data, buffer->used);
+ buffer->used = 0;
+}
+
+/*
+ * Read data from a BlockRefTableBuffer, and update the running CRC
+ * calculation for the returned data (but not any data that we may have
+ * buffered but not yet actually returned).
+ */
+static void
+BlockRefTableRead(BlockRefTableReader *reader, void *data, int length)
+{
+ BlockRefTableBuffer *buffer = &reader->buffer;
+
+ /* Loop until read is fully satisfied. */
+ while (length > 0)
+ {
+ if (buffer->cursor < buffer->used)
+ {
+ /*
+ * If any buffered data is available, use that to satisfy as much
+ * of the request as possible.
+ */
+ int bytes_to_copy = Min(length, buffer->used - buffer->cursor);
+
+ memcpy(data, &buffer->data[buffer->cursor], bytes_to_copy);
+ COMP_CRC32C(buffer->crc, &buffer->data[buffer->cursor],
+ bytes_to_copy);
+ buffer->cursor += bytes_to_copy;
+ data = ((char *) data) + bytes_to_copy;
+ length -= bytes_to_copy;
+ }
+ else if (length >= BUFSIZE)
+ {
+ /*
+ * If the request length is long, read directly into caller's
+ * buffer.
+ */
+ int bytes_read;
+
+ bytes_read = buffer->io_callback(buffer->io_callback_arg,
+ data, length);
+ COMP_CRC32C(buffer->crc, data, bytes_read);
+ data = ((char *) data) + bytes_read;
+ length -= bytes_read;
+
+ /* If we didn't get anything, that's bad. */
+ if (bytes_read == 0)
+ reader->error_callback(reader->error_callback_arg,
+ "file \"%s\" ends unexpectedly",
+ reader->error_filename);
+ }
+ else
+ {
+ /*
+ * Refill our buffer.
+ */
+ buffer->used = buffer->io_callback(buffer->io_callback_arg,
+ buffer->data, BUFSIZE);
+ buffer->cursor = 0;
+
+ /* If we didn't get anything, that's bad. */
+ if (buffer->used == 0)
+ reader->error_callback(reader->error_callback_arg,
+ "file \"%s\" ends unexpectedly",
+ reader->error_filename);
+ }
+ }
+}
+
+/*
+ * Supply data to a BlockRefTableBuffer for write to the underlying File,
+ * and update the running CRC calculation for that data.
+ */
+static void
+BlockRefTableWrite(BlockRefTableBuffer *buffer, void *data, int length)
+{
+ /* Update running CRC calculation. */
+ COMP_CRC32C(buffer->crc, data, length);
+
+ /* If the new data can't fit into the buffer, flush the buffer. */
+ if (buffer->used + length > BUFSIZE)
+ {
+ buffer->io_callback(buffer->io_callback_arg, buffer->data,
+ buffer->used);
+ buffer->used = 0;
+ }
+
+ /* If the new data would fill the buffer, or more, write it directly. */
+ if (length >= BUFSIZE)
+ {
+ buffer->io_callback(buffer->io_callback_arg, data, length);
+ return;
+ }
+
+ /* Otherwise, copy the new data into the buffer. */
+ memcpy(&buffer->data[buffer->used], data, length);
+ buffer->used += length;
+ Assert(buffer->used <= BUFSIZE);
+}
+
+/*
+ * Generate the sentinel and CRC required at the end of a block reference
+ * table file and flush them out of our internal buffer.
+ */
+static void
+BlockRefTableFileTerminate(BlockRefTableBuffer *buffer)
+{
+ BlockRefTableSerializedEntry zentry = {{0}};
+ pg_crc32c crc;
+
+ /* Write a sentinel indicating that there are no more entries. */
+ BlockRefTableWrite(buffer, &zentry,
+ sizeof(BlockRefTableSerializedEntry));
+
+ /*
+ * Writing the checksum will perturb the ongoing checksum calculation, so
+ * copy the state first and finalize the computation using the copy.
+ */
+ crc = buffer->crc;
+ FIN_CRC32C(crc);
+ BlockRefTableWrite(buffer, &crc, sizeof(pg_crc32c));
+
+ /* Flush any leftover data out of our buffer. */
+ BlockRefTableFlush(buffer);
+}
diff --git a/src/common/meson.build b/src/common/meson.build
index d52dd12bc9..7ad4270a3a 100644
--- a/src/common/meson.build
+++ b/src/common/meson.build
@@ -4,6 +4,7 @@ common_sources = files(
'archive.c',
'base64.c',
'binaryheap.c',
+ 'blkreftable.c',
'checksum_helper.c',
'compression.c',
'controldata_utils.c',
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index a14126d164..da71580364 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -209,6 +209,7 @@ extern int XLogFileOpen(XLogSegNo segno, TimeLineID tli);
extern void CheckXLogRemoved(XLogSegNo segno, TimeLineID tli);
extern XLogSegNo XLogGetLastRemovedSegno(void);
+extern XLogSegNo XLogGetOldestSegno(TimeLineID tli);
extern void XLogSetAsyncXactLSN(XLogRecPtr asyncXactLSN);
extern void XLogSetReplicationSlotMinimumLSN(XLogRecPtr lsn);
diff --git a/src/include/backup/walsummary.h b/src/include/backup/walsummary.h
new file mode 100644
index 0000000000..8e3dc7b837
--- /dev/null
+++ b/src/include/backup/walsummary.h
@@ -0,0 +1,49 @@
+/*-------------------------------------------------------------------------
+ *
+ * walsummary.h
+ * WAL summary management
+ *
+ * Portions Copyright (c) 2010-2022, PostgreSQL Global Development Group
+ *
+ * src/include/backup/walsummary.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef WALSUMMARY_H
+#define WALSUMMARY_H
+
+#include <time.h>
+
+#include "access/xlogdefs.h"
+#include "nodes/pg_list.h"
+#include "storage/fd.h"
+
+typedef struct WalSummaryIO
+{
+ File file;
+ off_t filepos;
+} WalSummaryIO;
+
+typedef struct WalSummaryFile
+{
+ XLogRecPtr start_lsn;
+ XLogRecPtr end_lsn;
+ TimeLineID tli;
+} WalSummaryFile;
+
+extern List *GetWalSummaries(TimeLineID tli, XLogRecPtr start_lsn,
+ XLogRecPtr end_lsn);
+extern List *FilterWalSummaries(List *wslist, TimeLineID tli,
+ XLogRecPtr start_lsn, XLogRecPtr end_lsn);
+extern bool WalSummariesAreComplete(List *wslist,
+ XLogRecPtr start_lsn, XLogRecPtr end_lsn,
+ XLogRecPtr *missing_lsn);
+extern File OpenWalSummaryFile(WalSummaryFile *ws, bool missing_ok);
+extern void RemoveWalSummaryIfOlderThan(WalSummaryFile *ws,
+ time_t cutoff_time);
+
+extern int ReadWalSummary(void *wal_summary_io, void *data, int length);
+extern int WriteWalSummary(void *wal_summary_io, void *data, int length);
+extern void ReportWalSummaryError(void *callback_arg, char *fmt,...) pg_attribute_printf(2, 3);
+
+#endif /* WALSUMMARY_H */
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 77e8b13764..916c8ec8d0 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -12099,4 +12099,23 @@
proname => 'any_value_transfn', prorettype => 'anyelement',
proargtypes => 'anyelement anyelement', prosrc => 'any_value_transfn' },
+{ oid => '8436',
+ descr => 'list of available WAL summary files',
+ proname => 'pg_available_wal_summaries', prorows => '100',
+ proretset => 't', provolatile => 'v', proparallel => 's',
+ prorettype => 'record', proargtypes => '',
+ proallargtypes => '{int8,pg_lsn,pg_lsn}',
+ proargmodes => '{o,o,o}',
+ proargnames => '{tli,start_lsn,end_lsn}',
+ prosrc => 'pg_available_wal_summaries' },
+{ oid => '8437',
+ descr => 'contents of a WAL sumamry file',
+ proname => 'pg_wal_summary_contents', prorows => '100',
+ proretset => 't', provolatile => 'v', proparallel => 's',
+ prorettype => 'record', proargtypes => 'int8 pg_lsn pg_lsn',
+ proallargtypes => '{int8,pg_lsn,pg_lsn,oid,oid,oid,int2,int8,bool}',
+ proargmodes => '{i,i,i,o,o,o,o,o,o}',
+ proargnames => '{tli,start_lsn,end_lsn,relfilenode,reltablespace,reldatabase,relforknumber,relblocknumber,is_limit_block}',
+ prosrc => 'pg_wal_summary_contents' },
+
]
diff --git a/src/include/common/blkreftable.h b/src/include/common/blkreftable.h
new file mode 100644
index 0000000000..5141f3acd5
--- /dev/null
+++ b/src/include/common/blkreftable.h
@@ -0,0 +1,116 @@
+/*-------------------------------------------------------------------------
+ *
+ * blkreftable.h
+ * Block reference tables.
+ *
+ * A block reference table is used to keep track of which blocks have
+ * been modified by WAL records within a certain LSN range.
+ *
+ * For each relation fork, there is a "limit block number". All existing
+ * blocks greater than or equal to the limit block number must be
+ * considered modified; for those less than the limit block number,
+ * we maintain a bitmap. When a relation fork is created or dropped,
+ * the limit block number should be set to 0. When it's truncated,
+ * the limit block number should be set to the length in blocks to
+ * which it was truncated.
+ *
+ * Portions Copyright (c) 2010-2022, PostgreSQL Global Development Group
+ *
+ * src/include/common/blkreftable.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef BLKREFTABLE_H
+#define BLKREFTABLE_H
+
+#include "storage/block.h"
+#include "storage/relfilelocator.h"
+
+/* Magic number for serialization file format. */
+#define BLOCKREFTABLE_MAGIC 0x652b137b
+
+typedef struct BlockRefTable BlockRefTable;
+typedef struct BlockRefTableEntry BlockRefTableEntry;
+typedef struct BlockRefTableReader BlockRefTableReader;
+typedef struct BlockRefTableWriter BlockRefTableWriter;
+
+/*
+ * The return value of io_callback_fn should be the number of bytes read
+ * or written. If an error occurs, the functions should report it and
+ * not return. When used as a write callback, short writes should be retried
+ * or treated as errors, so that if the callback returns, the return value
+ * is always the request length.
+ *
+ * report_error_fn should not return.
+ */
+typedef int (*io_callback_fn) (void *callback_arg, void *data, int length);
+typedef void (*report_error_fn) (void *calblack_arg, char *msg,...) pg_attribute_printf(2, 3);
+
+
+/*
+ * Functions for manipulating an entire in-memory block reference table.
+ */
+extern BlockRefTable *CreateEmptyBlockRefTable(void);
+extern void BlockRefTableSetLimitBlock(BlockRefTable *brtab,
+ const RelFileLocator *rlocator,
+ ForkNumber forknum,
+ BlockNumber limit_block);
+extern void BlockRefTableMarkBlockModified(BlockRefTable *brtab,
+ const RelFileLocator *rlocator,
+ ForkNumber forknum,
+ BlockNumber blknum);
+extern void WriteBlockRefTable(BlockRefTable *brtab,
+ io_callback_fn write_callback,
+ void *write_callback_arg);
+
+extern BlockRefTableEntry *BlockRefTableGetEntry(BlockRefTable *brtab,
+ const RelFileLocator *rlocator,
+ ForkNumber forknum,
+ BlockNumber *limit_block);
+extern int BlockRefTableEntryGetBlocks(BlockRefTableEntry *entry,
+ BlockNumber start_blkno,
+ BlockNumber stop_blkno,
+ BlockNumber *blocks,
+ int nblocks);
+
+/*
+ * Functions for reading a block reference table incrementally from disk.
+ */
+extern BlockRefTableReader *CreateBlockRefTableReader(io_callback_fn read_callback,
+ void *read_callback_arg,
+ char *error_filename,
+ report_error_fn error_callback,
+ void *error_callback_arg);
+extern bool BlockRefTableReaderNextRelation(BlockRefTableReader *reader,
+ RelFileLocator *rlocator,
+ ForkNumber *forknum,
+ BlockNumber *limit_block);
+extern unsigned BlockRefTableReaderGetBlocks(BlockRefTableReader *reader,
+ BlockNumber *blocks,
+ int nblocks);
+extern void DestroyBlockRefTableReader(BlockRefTableReader *reader);
+
+/*
+ * Functions for writing a block reference table incrementally to disk.
+ *
+ * Note that entries must be written in the proper order, that is, sorted by
+ * database, then tablespace, then relfilenumber, then fork number. Caller
+ * is responsible for supplying data in the correct order. If that seems hard,
+ * use an in-memory BlockRefTable instead.
+ */
+extern BlockRefTableWriter *CreateBlockRefTableWriter(io_callback_fn write_callback,
+ void *write_callback_arg);
+extern void BlockRefTableWriteEntry(BlockRefTableWriter *writer,
+ BlockRefTableEntry *entry);
+extern void DestroyBlockRefTableWriter(BlockRefTableWriter *writer);
+
+extern BlockRefTableEntry *CreateBlockRefTableEntry(RelFileLocator rlocator,
+ ForkNumber forknum);
+extern void BlockRefTableEntrySetLimitBlock(BlockRefTableEntry *entry,
+ BlockNumber limit_block);
+extern void BlockRefTableEntryMarkBlockModified(BlockRefTableEntry *entry,
+ ForkNumber forknum,
+ BlockNumber blknum);
+extern void BlockRefTableFreeEntry(BlockRefTableEntry *entry);
+
+#endif /* BLKREFTABLE_H */
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 1043a4d782..74bc2f97cb 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -336,6 +336,7 @@ typedef enum BackendType
B_STARTUP,
B_WAL_RECEIVER,
B_WAL_SENDER,
+ B_WAL_SUMMARIZER,
B_WAL_WRITER,
} BackendType;
@@ -442,6 +443,7 @@ typedef enum
CheckpointerProcess,
WalWriterProcess,
WalReceiverProcess,
+ WalSummarizerProcess,
NUM_AUXPROCTYPES /* Must be last! */
} AuxProcType;
@@ -454,6 +456,7 @@ extern PGDLLIMPORT AuxProcType MyAuxProcType;
#define AmCheckpointerProcess() (MyAuxProcType == CheckpointerProcess)
#define AmWalWriterProcess() (MyAuxProcType == WalWriterProcess)
#define AmWalReceiverProcess() (MyAuxProcType == WalReceiverProcess)
+#define AmWalSummarizerProcess() (MyAuxProcType == WalSummarizerProcess)
/*****************************************************************************
diff --git a/src/include/postmaster/walsummarizer.h b/src/include/postmaster/walsummarizer.h
new file mode 100644
index 0000000000..180d3f34b9
--- /dev/null
+++ b/src/include/postmaster/walsummarizer.h
@@ -0,0 +1,33 @@
+/*-------------------------------------------------------------------------
+ *
+ * walsummarizer.h
+ *
+ * Header file for background WAL summarization process.
+ *
+ * Portions Copyright (c) 1996-2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/include/postmaster/walsummarizer.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef WALSUMMARIZER_H
+#define WALSUMMARIZER_H
+
+#include "access/xlogdefs.h"
+
+extern bool summarize_wal;
+extern int wal_summary_keep_time;
+
+extern Size WalSummarizerShmemSize(void);
+extern void WalSummarizerShmemInit(void);
+extern void WalSummarizerMain(void) pg_attribute_noreturn();
+
+extern XLogRecPtr GetOldestUnsummarizedLSN(TimeLineID *tli,
+ bool *lsn_is_exact,
+ bool reset_pending_lsn);
+extern void SetWalSummarizerLatch(void);
+extern XLogRecPtr WaitForWalSummarization(XLogRecPtr lsn, long timeout,
+ XLogRecPtr *pending_lsn);
+
+#endif
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 4b25961249..e87fd25d64 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -417,11 +417,12 @@ extern PGDLLIMPORT PGPROC *PreparedXactProcs;
* We set aside some extra PGPROC structures for auxiliary processes,
* ie things that aren't full-fledged backends but need shmem access.
*
- * Background writer, checkpointer, WAL writer and archiver run during normal
- * operation. Startup process and WAL receiver also consume 2 slots, but WAL
- * writer is launched only after startup has exited, so we only need 5 slots.
+ * Background writer, checkpointer, WAL writer, WAL summarizer, and archiver
+ * run during normal operation. Startup process and WAL receiver also consume
+ * 2 slots, but WAL writer is launched only after startup has exited, so we
+ * only need 6 slots.
*/
-#define NUM_AUXILIARY_PROCS 5
+#define NUM_AUXILIARY_PROCS 6
/* configurable options */
extern PGDLLIMPORT int DeadlockTimeout;
diff --git a/src/include/utils/guc_tables.h b/src/include/utils/guc_tables.h
index 0c38255961..eaa8c46dda 100644
--- a/src/include/utils/guc_tables.h
+++ b/src/include/utils/guc_tables.h
@@ -72,6 +72,7 @@ enum config_group
WAL_RECOVERY,
WAL_ARCHIVE_RECOVERY,
WAL_RECOVERY_TARGET,
+ WAL_SUMMARIZATION,
REPLICATION_SENDING,
REPLICATION_PRIMARY,
REPLICATION_STANDBY,
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index ba41149b88..9390049314 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -4012,3 +4012,14 @@ yyscan_t
z_stream
z_streamp
zic_t
+BlockRefTable
+BlockRefTableBuffer
+BlockRefTableEntry
+BlockRefTableKey
+BlockRefTableReader
+BlockRefTableSerializedEntry
+BlockRefTableWriter
+SummarizerReadLocalXLogPrivate
+WalSummarizerData
+WalSummaryFile
+WalSummaryIO
--
2.39.3 (Apple Git-145)
v16-0004-Test-patch-Enable-summarize_wal-by-default.patchapplication/octet-stream; name=v16-0004-Test-patch-Enable-summarize_wal-by-default.patchDownload
From a5c00fe73b91d35aa4902ac1fc93acc3aac751ea Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Tue, 14 Nov 2023 13:49:28 -0500
Subject: [PATCH v16 4/4] Test patch: Enable summarize_wal by default.
To avoid test failures, must remove the prohibition against running
summarize_wal=off with wal_level=minimal, because a bunch of tests
run with wal_level=minimal.
Not for commit.
---
src/backend/postmaster/postmaster.c | 3 ---
src/backend/postmaster/walsummarizer.c | 2 +-
src/backend/utils/misc/guc_tables.c | 2 +-
src/test/recovery/t/001_stream_rep.pl | 2 ++
src/test/recovery/t/019_replslot_limit.pl | 3 +++
src/test/recovery/t/020_archive_status.pl | 1 +
src/test/recovery/t/035_standby_logical_decoding.pl | 1 +
7 files changed, 9 insertions(+), 5 deletions(-)
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index b163e89cbb..51dc517710 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -937,9 +937,6 @@ PostmasterMain(int argc, char *argv[])
if (max_wal_senders > 0 && wal_level == WAL_LEVEL_MINIMAL)
ereport(ERROR,
(errmsg("WAL streaming (max_wal_senders > 0) requires wal_level \"replica\" or \"logical\"")));
- if (summarize_wal && wal_level == WAL_LEVEL_MINIMAL)
- ereport(ERROR,
- (errmsg("WAL cannot be summarized when wal_level is \"minimal\"")));
/*
* Other one-time internal sanity checks can go here, if they are fast.
diff --git a/src/backend/postmaster/walsummarizer.c b/src/backend/postmaster/walsummarizer.c
index 9fa155349e..71025b43b7 100644
--- a/src/backend/postmaster/walsummarizer.c
+++ b/src/backend/postmaster/walsummarizer.c
@@ -139,7 +139,7 @@ static XLogRecPtr redo_pointer_at_last_summary_removal = InvalidXLogRecPtr;
/*
* GUC parameters
*/
-bool summarize_wal = false;
+bool summarize_wal = true;
int wal_summary_keep_time = 10 * 24 * 60;
static XLogRecPtr GetLatestLSN(TimeLineID *tli);
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 9f59440526..f249a9fad5 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -1795,7 +1795,7 @@ struct config_bool ConfigureNamesBool[] =
NULL
},
&summarize_wal,
- false,
+ true,
NULL, NULL, NULL
},
diff --git a/src/test/recovery/t/001_stream_rep.pl b/src/test/recovery/t/001_stream_rep.pl
index 95f9b0d772..0d0e63b8dc 100644
--- a/src/test/recovery/t/001_stream_rep.pl
+++ b/src/test/recovery/t/001_stream_rep.pl
@@ -15,6 +15,8 @@ my $node_primary = PostgreSQL::Test::Cluster->new('primary');
$node_primary->init(
allows_streaming => 1,
auth_extra => [ '--create-role', 'repl_role' ]);
+# WAL summarization can postpone WAL recycling, leading to test failures
+$node_primary->append_conf('postgresql.conf', "summarize_wal = off");
$node_primary->start;
my $backup_name = 'my_backup';
diff --git a/src/test/recovery/t/019_replslot_limit.pl b/src/test/recovery/t/019_replslot_limit.pl
index 7d94f15778..a8b342bb98 100644
--- a/src/test/recovery/t/019_replslot_limit.pl
+++ b/src/test/recovery/t/019_replslot_limit.pl
@@ -22,6 +22,7 @@ $node_primary->append_conf(
min_wal_size = 2MB
max_wal_size = 4MB
log_checkpoints = yes
+summarize_wal = off
));
$node_primary->start;
$node_primary->safe_psql('postgres',
@@ -256,6 +257,7 @@ $node_primary2->append_conf(
min_wal_size = 32MB
max_wal_size = 32MB
log_checkpoints = yes
+summarize_wal = off
));
$node_primary2->start;
$node_primary2->safe_psql('postgres',
@@ -310,6 +312,7 @@ $node_primary3->append_conf(
max_wal_size = 2MB
log_checkpoints = yes
max_slot_wal_keep_size = 1MB
+ summarize_wal = off
));
$node_primary3->start;
$node_primary3->safe_psql('postgres',
diff --git a/src/test/recovery/t/020_archive_status.pl b/src/test/recovery/t/020_archive_status.pl
index fa24153d4b..d0d6221368 100644
--- a/src/test/recovery/t/020_archive_status.pl
+++ b/src/test/recovery/t/020_archive_status.pl
@@ -15,6 +15,7 @@ $primary->init(
has_archiving => 1,
allows_streaming => 1);
$primary->append_conf('postgresql.conf', 'autovacuum = off');
+$primary->append_conf('postgresql.conf', 'summarize_wal = off');
$primary->start;
my $primary_data = $primary->data_dir;
diff --git a/src/test/recovery/t/035_standby_logical_decoding.pl b/src/test/recovery/t/035_standby_logical_decoding.pl
index 9c34c0d36c..482edc57a8 100644
--- a/src/test/recovery/t/035_standby_logical_decoding.pl
+++ b/src/test/recovery/t/035_standby_logical_decoding.pl
@@ -250,6 +250,7 @@ $node_primary->append_conf(
wal_level = 'logical'
max_replication_slots = 4
max_wal_senders = 4
+summarize_wal = off
});
$node_primary->dump_info;
$node_primary->start;
--
2.39.3 (Apple Git-145)
v16-0002-Add-support-for-incremental-backup.patchapplication/octet-stream; name=v16-0002-Add-support-for-incremental-backup.patchDownload
From 2893716fa325f249d2a75469bcbe7df97dd204cc Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Wed, 14 Jun 2023 11:31:29 -0400
Subject: [PATCH v16 2/4] Add support for incremental backup.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
To take an incremental backup, you use the new replication command
UPLOAD_MANIFEST to upload the manifest for the prior backup. This
prior backup could either be a full backup or another incremental
backup. You then use BASE_BACKUP with the INCREMENTAL option to take
the backup. pg_basebackup now has an --incremental=PATH_TO_MANIFEST
option to trigger this behavior.
An incremental backup is like a regular full backup except that
some relation files are replaced with files with names like
INCREMENTAL.${ORIGINAL_NAME}, and the backup_label file contains
additional lines identifying it as an incremental backup. The new
pg_combinebackup tool can be used to reconstruct a data directory
from a full backup and a series of incremental backups.
XXX. It would be nice (but not essential) to do something about
incremental JSON parsing.
Patch by me. Thanks to Dilip Kumar, Andres Freund, and Álvaro Herrera
for design discussion and reviews, and to Jakub Wartak for incredibly
helpful and extensive testing.
---
doc/src/sgml/backup.sgml | 89 +-
doc/src/sgml/config.sgml | 2 -
doc/src/sgml/protocol.sgml | 24 +
doc/src/sgml/ref/allfiles.sgml | 1 +
doc/src/sgml/ref/pg_basebackup.sgml | 37 +-
doc/src/sgml/ref/pg_combinebackup.sgml | 240 +++
doc/src/sgml/reference.sgml | 1 +
src/backend/access/transam/xlogbackup.c | 10 +
src/backend/access/transam/xlogrecovery.c | 6 +
src/backend/backup/Makefile | 1 +
src/backend/backup/basebackup.c | 319 +++-
src/backend/backup/basebackup_incremental.c | 1003 +++++++++++++
src/backend/backup/meson.build | 1 +
src/backend/replication/repl_gram.y | 14 +-
src/backend/replication/repl_scanner.l | 2 +
src/backend/replication/walsender.c | 162 ++-
src/backend/storage/ipc/ipci.c | 3 +
src/bin/Makefile | 1 +
src/bin/meson.build | 1 +
src/bin/pg_basebackup/bbstreamer_file.c | 1 +
src/bin/pg_basebackup/pg_basebackup.c | 112 +-
src/bin/pg_basebackup/t/010_pg_basebackup.pl | 4 +-
src/bin/pg_combinebackup/.gitignore | 1 +
src/bin/pg_combinebackup/Makefile | 52 +
src/bin/pg_combinebackup/backup_label.c | 283 ++++
src/bin/pg_combinebackup/backup_label.h | 30 +
src/bin/pg_combinebackup/copy_file.c | 169 +++
src/bin/pg_combinebackup/copy_file.h | 19 +
src/bin/pg_combinebackup/load_manifest.c | 245 ++++
src/bin/pg_combinebackup/load_manifest.h | 67 +
src/bin/pg_combinebackup/meson.build | 38 +
src/bin/pg_combinebackup/nls.mk | 11 +
src/bin/pg_combinebackup/pg_combinebackup.c | 1284 +++++++++++++++++
src/bin/pg_combinebackup/reconstruct.c | 687 +++++++++
src/bin/pg_combinebackup/reconstruct.h | 33 +
src/bin/pg_combinebackup/t/001_basic.pl | 23 +
.../pg_combinebackup/t/002_compare_backups.pl | 154 ++
src/bin/pg_combinebackup/t/003_timeline.pl | 90 ++
src/bin/pg_combinebackup/t/004_manifest.pl | 75 +
src/bin/pg_combinebackup/t/005_integrity.pl | 125 ++
src/bin/pg_combinebackup/write_manifest.c | 293 ++++
src/bin/pg_combinebackup/write_manifest.h | 33 +
src/bin/pg_resetwal/pg_resetwal.c | 36 +
src/include/access/xlogbackup.h | 2 +
src/include/backup/basebackup.h | 5 +-
src/include/backup/basebackup_incremental.h | 55 +
src/include/nodes/replnodes.h | 9 +
src/test/perl/PostgreSQL/Test/Cluster.pm | 21 +-
src/tools/pgindent/typedefs.list | 12 +
49 files changed, 5834 insertions(+), 52 deletions(-)
create mode 100644 doc/src/sgml/ref/pg_combinebackup.sgml
create mode 100644 src/backend/backup/basebackup_incremental.c
create mode 100644 src/bin/pg_combinebackup/.gitignore
create mode 100644 src/bin/pg_combinebackup/Makefile
create mode 100644 src/bin/pg_combinebackup/backup_label.c
create mode 100644 src/bin/pg_combinebackup/backup_label.h
create mode 100644 src/bin/pg_combinebackup/copy_file.c
create mode 100644 src/bin/pg_combinebackup/copy_file.h
create mode 100644 src/bin/pg_combinebackup/load_manifest.c
create mode 100644 src/bin/pg_combinebackup/load_manifest.h
create mode 100644 src/bin/pg_combinebackup/meson.build
create mode 100644 src/bin/pg_combinebackup/nls.mk
create mode 100644 src/bin/pg_combinebackup/pg_combinebackup.c
create mode 100644 src/bin/pg_combinebackup/reconstruct.c
create mode 100644 src/bin/pg_combinebackup/reconstruct.h
create mode 100644 src/bin/pg_combinebackup/t/001_basic.pl
create mode 100644 src/bin/pg_combinebackup/t/002_compare_backups.pl
create mode 100644 src/bin/pg_combinebackup/t/003_timeline.pl
create mode 100644 src/bin/pg_combinebackup/t/004_manifest.pl
create mode 100644 src/bin/pg_combinebackup/t/005_integrity.pl
create mode 100644 src/bin/pg_combinebackup/write_manifest.c
create mode 100644 src/bin/pg_combinebackup/write_manifest.h
create mode 100644 src/include/backup/basebackup_incremental.h
diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml
index 8cb24d6ae5..b3468eea3c 100644
--- a/doc/src/sgml/backup.sgml
+++ b/doc/src/sgml/backup.sgml
@@ -857,12 +857,79 @@ test ! -f /mnt/server/archivedir/00000001000000A900000065 && cp pg_wal/0
</para>
</sect2>
+ <sect2 id="backup-incremental-backup">
+ <title>Making an Incremental Backup</title>
+
+ <para>
+ You can use <xref linkend="app-pgbasebackup"/> to take an incremental
+ backup by specifying the <literal>--incremental</literal> option. You must
+ supply, as an argument to <literal>--incremental</literal>, the backup
+ manifest to an earlier backup from the same server. In the resulting
+ backup, non-relation files will be included in their entirety, but some
+ relation files may be replaced by smaller incremental files which contain
+ only the blocks which have been changed since the earlier backup and enough
+ metadata to reconstruct the current version of the file.
+ </para>
+
+ <para>
+ To figure out which blocks need to be backed up, the server uses WAL
+ summaries, which are stored in the data directory, inside the directory
+ <literal>pg_wal/summaries</literal>. If the required summary files are not
+ present, an attempt to take an incremental backup will fail. The summaries
+ present in this directory must cover all LSNs from the start LSN of the
+ prior backup to the start LSN of the current backup. Since the server looks
+ for WAL summaries just after establishing the start LSN of the current
+ backup, the necessary summary files probably won't be instantly present
+ on disk, but the server will wait for any missing files to show up.
+ This also helps if the WAL summarization process has fallen behind.
+ However, if the necessary files have already been removed, or if the WAL
+ summarizer doesn't catch up quickly enough, the incremental backup will
+ fail.
+ </para>
+
+ <para>
+ When restoring an incremental backup, it will be necessary to have not
+ only the incremental backup itself but also all earlier backups that
+ are required to supply the blocks omitted from the incremental backup.
+ See <xref linkend="app-pgcombinebackup"/> for further information about
+ this requirement.
+ </para>
+
+ <para>
+ Note that all of the requirements for making use of a full backup also
+ apply to an incremental backup. For instance, you still need all of the
+ WAL segment files generated during and after the file system backup, and
+ any relevant WAL history files. And you still need to create a
+ <literal>recovery.signal</literal> (or <literal>standby.signal</literal>)
+ and perform recovery, as described in
+ <xref linkend="backup-pitr-recovery" />. The requirement to have earlier
+ backups available at restore time and to use
+ <literal>pg_combinebackup</literal> is an additional requirement on top of
+ everything else. Keep in mind that <application>PostgreSQL</application>
+ has no built-in mechanism to figure out which backups are still needed as
+ a basis for restoring later incremental backups. You must keep track of
+ the relationships between your full and incremental backups on your own,
+ and be certain not to remove earlier backups if they might be needed when
+ restoring later incremental backups.
+ </para>
+
+ <para>
+ Incremental backups typically only make sense for relatively large
+ databases where a significant portion of the data does not change, or only
+ changes slowly. For a small database, it's simpler to ignore the existence
+ of incremental backups and simply take full backups, which are simpler
+ to manage. For a large database all of which is heavily modified,
+ incremental backups won't be much smaller than full backups.
+ </para>
+ </sect2>
+
<sect2 id="backup-lowlevel-base-backup">
<title>Making a Base Backup Using the Low Level API</title>
<para>
- The procedure for making a base backup using the low level
- APIs contains a few more steps than
- the <xref linkend="app-pgbasebackup"/> method, but is relatively
+ Instead of taking a full or incremental base backup using
+ <xref linkend="app-pgbasebackup"/>, you can take a base backup using the
+ low-level API. This procedure contains a few more steps than
+ the <application>pg_basebackup</application> method, but is relatively
simple. It is very important that these steps are executed in
sequence, and that the success of a step is verified before
proceeding to the next step.
@@ -1118,7 +1185,8 @@ SELECT * FROM pg_backup_stop(wait_for_archive => true);
</listitem>
<listitem>
<para>
- Restore the database files from your file system backup. Be sure that they
+ If you're restoring a full backup, you can restore the database files
+ directly into the target directories. Be sure that they
are restored with the right ownership (the database system user, not
<literal>root</literal>!) and with the right permissions. If you are using
tablespaces,
@@ -1126,6 +1194,19 @@ SELECT * FROM pg_backup_stop(wait_for_archive => true);
were correctly restored.
</para>
</listitem>
+ <listitem>
+ <para>
+ If you're restoring an incremental backup, you'll need to restore the
+ incremental backup and all earlier backups upon which it directly or
+ indirectly depends to the machine where you are performing the restore.
+ These backups will need to be placed in separate directories, not the
+ target directories where you want the running server to end up.
+ Once this is done, use <xref linkend="app-pgcombinebackup"/> to pull
+ data from the full backup and all of the subsequent incremental backups
+ and write out a synthetic full backup to the target directories. As above,
+ verify that permissions and tablespace links are correct.
+ </para>
+ </listitem>
<listitem>
<para>
Remove any files present in <filename>pg_wal/</filename>; these came from the
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index ee98585027..b5624ca884 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4153,13 +4153,11 @@ restore_command = 'copy "C:\\server\\archivedir\\%f" "%p"' # Windows
<sect2 id="runtime-config-wal-summarization">
<title>WAL Summarization</title>
- <!--
<para>
These settings control WAL summarization, a feature which must be
enabled in order to perform an
<link linkend="backup-incremental-backup">incremental backup</link>.
</para>
- -->
<variablelist>
<varlistentry id="guc-summarize-wal" xreflabel="summarize_wal">
diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index af3f016f74..9a66918171 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -2599,6 +2599,19 @@ psql "dbname=postgres replication=database" -c "IDENTIFY_SYSTEM;"
</listitem>
</varlistentry>
+ <varlistentry id="protocol-replication-upload-manifest">
+ <term>
+ <literal>UPLOAD_MANIFEST</literal>
+ <indexterm><primary>UPLOAD_MANIFEST</primary></indexterm>
+ </term>
+ <listitem>
+ <para>
+ Uploads a backup manifest in preparation for taking an incremental
+ backup.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="protocol-replication-base-backup" xreflabel="BASE_BACKUP">
<term><literal>BASE_BACKUP</literal> [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ]
<indexterm><primary>BASE_BACKUP</primary></indexterm>
@@ -2838,6 +2851,17 @@ psql "dbname=postgres replication=database" -c "IDENTIFY_SYSTEM;"
</para>
</listitem>
</varlistentry>
+
+ <varlistentry>
+ <term><literal>INCREMENTAL</literal></term>
+ <listitem>
+ <para>
+ Requests an incremental backup. The
+ <literal>UPLOAD_MANIFEST</literal> command must be executed
+ before running a base backup with this option.
+ </para>
+ </listitem>
+ </varlistentry>
</variablelist>
</para>
diff --git a/doc/src/sgml/ref/allfiles.sgml b/doc/src/sgml/ref/allfiles.sgml
index 54b5f22d6e..fda4690eab 100644
--- a/doc/src/sgml/ref/allfiles.sgml
+++ b/doc/src/sgml/ref/allfiles.sgml
@@ -202,6 +202,7 @@ Complete list of usable sgml source files in this directory.
<!ENTITY pgBasebackup SYSTEM "pg_basebackup.sgml">
<!ENTITY pgbench SYSTEM "pgbench.sgml">
<!ENTITY pgChecksums SYSTEM "pg_checksums.sgml">
+<!ENTITY pgCombinebackup SYSTEM "pg_combinebackup.sgml">
<!ENTITY pgConfig SYSTEM "pg_config-ref.sgml">
<!ENTITY pgControldata SYSTEM "pg_controldata.sgml">
<!ENTITY pgCtl SYSTEM "pg_ctl-ref.sgml">
diff --git a/doc/src/sgml/ref/pg_basebackup.sgml b/doc/src/sgml/ref/pg_basebackup.sgml
index 0b87fd2d4d..7c183a5cfd 100644
--- a/doc/src/sgml/ref/pg_basebackup.sgml
+++ b/doc/src/sgml/ref/pg_basebackup.sgml
@@ -38,11 +38,25 @@ PostgreSQL documentation
</para>
<para>
- <application>pg_basebackup</application> makes an exact copy of the database
- cluster's files, while making sure the server is put into and
- out of backup mode automatically. Backups are always taken of the entire
- database cluster; it is not possible to back up individual databases or
- database objects. For selective backups, another tool such as
+ <application>pg_basebackup</application> can take a full or incremental
+ base backup of the database. When used to take a full backup, it makes an
+ exact copy of the database cluster's files. When used to take an incremental
+ backup, some files that would have been part of a full backup may be
+ replaced with incremental versions of the same files, containing only those
+ blocks that have been modified since the reference backup. An incremental
+ backup cannot be used directly; instead,
+ <xref linkend="app-pgcombinebackup"/> must first
+ be used to combine it with the previous backups upon which it depends.
+ See <xref linkend="backup-incremental-backup" /> for more information
+ about incremental backups, and <xref linkend="backup-pitr-recovery" />
+ for steps to recover from a backup.
+ </para>
+
+ <para>
+ In any mode, <application>pg_basebackup</application> makes sure the server
+ is put into and out of backup mode automatically. Backups are always taken of
+ the entire database cluster; it is not possible to back up individual
+ databases or database objects. For selective backups, another tool such as
<xref linkend="app-pgdump"/> must be used.
</para>
@@ -197,6 +211,19 @@ PostgreSQL documentation
</listitem>
</varlistentry>
+ <varlistentry>
+ <term><option>-i <replaceable class="parameter">old_manifest_file</replaceable></option></term>
+ <term><option>--incremental=<replaceable class="parameter">old_meanifest_file</replaceable></option></term>
+ <listitem>
+ <para>
+ Performs an <link linkend="backup-incremental-backup">incremental
+ backup</link>. The backup manifest for the reference
+ backup must be provided, and will be uploaded to the server, which will
+ respond by sending the requested incremental backup.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry>
<term><option>-R</option></term>
<term><option>--write-recovery-conf</option></term>
diff --git a/doc/src/sgml/ref/pg_combinebackup.sgml b/doc/src/sgml/ref/pg_combinebackup.sgml
new file mode 100644
index 0000000000..e1729671a5
--- /dev/null
+++ b/doc/src/sgml/ref/pg_combinebackup.sgml
@@ -0,0 +1,240 @@
+<!--
+doc/src/sgml/ref/pg_combinebackup.sgml
+PostgreSQL documentation
+-->
+
+<refentry id="app-pgcombinebackup">
+ <indexterm zone="app-pgcombinebackup">
+ <primary>pg_combinebackup</primary>
+ </indexterm>
+
+ <refmeta>
+ <refentrytitle><application>pg_combinebackup</application></refentrytitle>
+ <manvolnum>1</manvolnum>
+ <refmiscinfo>Application</refmiscinfo>
+ </refmeta>
+
+ <refnamediv>
+ <refname>pg_combinebackup</refname>
+ <refpurpose>reconstruct a full backup from an incremental backup and dependent backups</refpurpose>
+ </refnamediv>
+
+ <refsynopsisdiv>
+ <cmdsynopsis>
+ <command>pg_combinebackup</command>
+ <arg rep="repeat"><replaceable>option</replaceable></arg>
+ <arg rep="repeat"><replaceable>backup_directory</replaceable></arg>
+ </cmdsynopsis>
+ </refsynopsisdiv>
+
+ <refsect1>
+ <title>Description</title>
+ <para>
+ <application>pg_combinebackup</application> is used to reconstruct a
+ synthetic full backup from an
+ <link linkend="backup-incremental-backup">incremental backup</link> and the
+ earlier backups upon which it depends.
+ </para>
+
+ <para>
+ Specify all of the required backups on the command line from oldest to newest.
+ That is, the first backup directory should be the path to the full backup, and
+ the last should be the path to the final incremental backup
+ that you wish to restore. The reconstructed backup will be written to the
+ output directory specified by the <option>-o</option> option.
+ </para>
+
+ <para>
+ Although <application>pg_combinebackup</application> will attempt to verify
+ that the backups you specify form a legal backup chain from which a correct
+ full backup can be reconstructed, it is not designed to help you keep track
+ of which backups depend on which other backups. If you remove the one or
+ more of the previous backups upon which your incremental
+ backup relies, you will not be able to restore it.
+ </para>
+
+ <para>
+ Since the output of <application>pg_combinebackup</application> is a
+ synthetic full backup, it can be used as an input to a future invocation of
+ <application>pg_combinebackup</application>. The synthetic full backup would
+ be specified on the command line in lieu of the chain of backups from which
+ it was reconstructed.
+ </para>
+ </refsect1>
+
+ <refsect1>
+ <title>Options</title>
+
+ <para>
+ <variablelist>
+ <varlistentry>
+ <term><option>-d</option></term>
+ <term><option>--debug</option></term>
+ <listitem>
+ <para>
+ Print lots of debug logging output on <filename>stderr</filename>.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><option>-n</option></term>
+ <term><option>--dry-run</option></term>
+ <listitem>
+ <para>
+ The <option>-n</option>/<option>--dry-run</option> option instructs
+ <command>pg_cominebackup</command> to figure out what would be done
+ without actually creating the target directory or any output files.
+ It is particularly useful in comination with <option>--debug</option>.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><option>-N</option></term>
+ <term><option>--no-sync</option></term>
+ <listitem>
+ <para>
+ By default, <command>pg_combinebackup</command> will wait for all files
+ to be written safely to disk. This option causes
+ <command>pg_combinebackup</command> to return without waiting, which is
+ faster, but means that a subsequent operating system crash can leave
+ the output backup corrupt. Generally, this option is useful for testing
+ but should not be used when creating a production installation.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><option>-o <replaceable class="parameter">outputdir</replaceable></option></term>
+ <term><option>--output=<replaceable class="parameter">outputdir</replaceable></option></term>
+ <listitem>
+ <para>
+ Specifies the output directory to which the synthetic full backup
+ should be written. Currently, this argument is required.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><option>-T <replaceable class="parameter">olddir</replaceable>=<replaceable class="parameter">newdir</replaceable></option></term>
+ <term><option>--tablespace-mapping=<replaceable class="parameter">olddir</replaceable>=<replaceable class="parameter">newdir</replaceable></option></term>
+ <listitem>
+ <para>
+ Relocates the tablespace in directory <replaceable>olddir</replaceable>
+ to <replaceable>newdir</replaceable> during the backup.
+ <replaceable>olddir</replaceable> is the absolute path of the tablespace
+ as it exists in the first backup specified on the command line,
+ and <replaceable>newdir</replaceable> is the absolute path to use for the
+ tablespace in the reconstructed backup. If either path needs to contain
+ an equal sign (<literal>=</literal>), precede that with a backslash.
+ This option can be specified multiple times for multiple tablespaces.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><option>--manifest-checksums=<replaceable class="parameter">algorithm</replaceable></option></term>
+ <listitem>
+ <para>
+ Like <xref linkend="app-pgbasebackup"/>,
+ <application>pg_combinebackup</application> writes a backup manifest
+ in the output directory. This option specifies the checksum algorithm
+ that should be applied to each file included in the backup manifest.
+ Currently, the available algorithms are <literal>NONE</literal>,
+ <literal>CRC32C</literal>, <literal>SHA224</literal>,
+ <literal>SHA256</literal>, <literal>SHA384</literal>,
+ and <literal>SHA512</literal>. The default is <literal>CRC32C</literal>.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><option>--no-manifest</option></term>
+ <listitem>
+ <para>
+ Disables generation of a backup manifest. If this option is not
+ specified, a backup manifest for the reconstructed backup will be
+ written to the output directory.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><option>--sync-method=<replaceable class="parameter">method</replaceable></option></term>
+ <listitem>
+ <para>
+ When set to <literal>fsync</literal>, which is the default,
+ <command>pg_combinebackup</command> will recursively open and synchronize
+ all files in the backup directory. When the plain format is used, the
+ search for files will follow symbolic links for the WAL directory and
+ each configured tablespace.
+ </para>
+ <para>
+ On Linux, <literal>syncfs</literal> may be used instead to ask the
+ operating system to synchronize the whole file system that contains the
+ backup directory. When the plain format is used,
+ <command>pg_combinebackup</command> will also synchronize the file systems
+ that contain the WAL files and each tablespace. See
+ <xref linkend="syncfs"/> for more information about using
+ <function>syncfs()</function>.
+ </para>
+ <para>
+ This option has no effect when <option>--no-sync</option> is used.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><option>-V</option></term>
+ <term><option>--version</option></term>
+ <listitem>
+ <para>
+ Prints the <application>pg_combinebackup</application> version and
+ exits.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><option>-?</option></term>
+ <term><option>--help</option></term>
+ <listitem>
+ <para>
+ Shows help about <application>pg_combinebackup</application> command
+ line arguments, and exits.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ </variablelist>
+ </para>
+
+ </refsect1>
+
+ <refsect1>
+ <title>Environment</title>
+
+ <para>
+ This utility, like most other <productname>PostgreSQL</productname> utilities,
+ uses the environment variables supported by <application>libpq</application>
+ (see <xref linkend="libpq-envars"/>).
+ </para>
+
+ <para>
+ The environment variable <envar>PG_COLOR</envar> specifies whether to use
+ color in diagnostic messages. Possible values are
+ <literal>always</literal>, <literal>auto</literal> and
+ <literal>never</literal>.
+ </para>
+ </refsect1>
+
+ <refsect1>
+ <title>See Also</title>
+
+ <simplelist type="inline">
+ <member><xref linkend="app-pgbasebackup"/></member>
+ </simplelist>
+ </refsect1>
+
+</refentry>
diff --git a/doc/src/sgml/reference.sgml b/doc/src/sgml/reference.sgml
index e11b4b6130..a07d2b5e01 100644
--- a/doc/src/sgml/reference.sgml
+++ b/doc/src/sgml/reference.sgml
@@ -250,6 +250,7 @@
&pgamcheck;
&pgBasebackup;
&pgbench;
+ &pgCombinebackup;
&pgConfig;
&pgDump;
&pgDumpall;
diff --git a/src/backend/access/transam/xlogbackup.c b/src/backend/access/transam/xlogbackup.c
index 21d68133ae..f51d4282bb 100644
--- a/src/backend/access/transam/xlogbackup.c
+++ b/src/backend/access/transam/xlogbackup.c
@@ -77,6 +77,16 @@ build_backup_content(BackupState *state, bool ishistoryfile)
appendStringInfo(result, "STOP TIMELINE: %u\n", state->stoptli);
}
+ /* either both istartpoint and istarttli should be set, or neither */
+ Assert(XLogRecPtrIsInvalid(state->istartpoint) == (state->istarttli == 0));
+ if (!XLogRecPtrIsInvalid(state->istartpoint))
+ {
+ appendStringInfo(result, "INCREMENTAL FROM LSN: %X/%X\n",
+ LSN_FORMAT_ARGS(state->istartpoint));
+ appendStringInfo(result, "INCREMENTAL FROM TLI: %u\n",
+ state->istarttli);
+ }
+
data = result->data;
pfree(result);
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index a2c8fa3981..6f4f81f992 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -1295,6 +1295,12 @@ read_backup_label(XLogRecPtr *checkPointLoc, TimeLineID *backupLabelTLI,
tli_from_file, BACKUP_LABEL_FILE)));
}
+ if (fscanf(lfp, "INCREMENTAL FROM LSN: %X/%X\n", &hi, &lo) > 0)
+ ereport(FATAL,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("this is an incremental backup, not a data directory"),
+ errhint("Use pg_combinebackup to reconstruct a valid data directory.")));
+
if (ferror(lfp) || FreeFile(lfp))
ereport(FATAL,
(errcode_for_file_access(),
diff --git a/src/backend/backup/Makefile b/src/backend/backup/Makefile
index a67b3c58d4..751e6d3d5e 100644
--- a/src/backend/backup/Makefile
+++ b/src/backend/backup/Makefile
@@ -19,6 +19,7 @@ OBJS = \
basebackup.o \
basebackup_copy.o \
basebackup_gzip.o \
+ basebackup_incremental.o \
basebackup_lz4.o \
basebackup_zstd.o \
basebackup_progress.o \
diff --git a/src/backend/backup/basebackup.c b/src/backend/backup/basebackup.c
index 35dd79babc..5ee9628422 100644
--- a/src/backend/backup/basebackup.c
+++ b/src/backend/backup/basebackup.c
@@ -20,8 +20,10 @@
#include "access/xlogbackup.h"
#include "backup/backup_manifest.h"
#include "backup/basebackup.h"
+#include "backup/basebackup_incremental.h"
#include "backup/basebackup_sink.h"
#include "backup/basebackup_target.h"
+#include "catalog/pg_tablespace_d.h"
#include "commands/defrem.h"
#include "common/compression.h"
#include "common/file_perm.h"
@@ -33,6 +35,7 @@
#include "pgtar.h"
#include "port.h"
#include "postmaster/syslogger.h"
+#include "postmaster/walsummarizer.h"
#include "replication/walsender.h"
#include "replication/walsender_private.h"
#include "storage/bufpage.h"
@@ -64,6 +67,7 @@ typedef struct
bool fastcheckpoint;
bool nowait;
bool includewal;
+ bool incremental;
uint32 maxrate;
bool sendtblspcmapfile;
bool send_to_client;
@@ -76,21 +80,28 @@ typedef struct
} basebackup_options;
static int64 sendTablespace(bbsink *sink, char *path, Oid spcoid, bool sizeonly,
- struct backup_manifest_info *manifest);
+ struct backup_manifest_info *manifest,
+ IncrementalBackupInfo *ib);
static int64 sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
List *tablespaces, bool sendtblspclinks,
- backup_manifest_info *manifest, Oid spcoid);
+ backup_manifest_info *manifest, Oid spcoid,
+ IncrementalBackupInfo *ib);
static bool sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
struct stat *statbuf, bool missing_ok,
Oid dboid, Oid spcoid, RelFileNumber relfilenumber,
unsigned segno,
- backup_manifest_info *manifest);
+ backup_manifest_info *manifest,
+ unsigned num_incremental_blocks,
+ BlockNumber *incremental_blocks,
+ unsigned truncation_block_length);
static off_t read_file_data_into_buffer(bbsink *sink,
const char *readfilename, int fd,
off_t offset, size_t length,
BlockNumber blkno,
bool verify_checksum,
int *checksum_failures);
+static void push_to_sink(bbsink *sink, pg_checksum_context *checksum_ctx,
+ size_t *bytes_done, void *data, size_t length);
static bool verify_page_checksum(Page page, XLogRecPtr start_lsn,
BlockNumber blkno,
uint16 *expected_checksum);
@@ -102,7 +113,8 @@ static int64 _tarWriteHeader(bbsink *sink, const char *filename,
bool sizeonly);
static void _tarWritePadding(bbsink *sink, int len);
static void convert_link_to_directory(const char *pathbuf, struct stat *statbuf);
-static void perform_base_backup(basebackup_options *opt, bbsink *sink);
+static void perform_base_backup(basebackup_options *opt, bbsink *sink,
+ IncrementalBackupInfo *ib);
static void parse_basebackup_options(List *options, basebackup_options *opt);
static int compareWalFileNames(const ListCell *a, const ListCell *b);
static int basebackup_read_file(int fd, char *buf, size_t nbytes, off_t offset,
@@ -220,7 +232,8 @@ static const struct exclude_list_item excludeFiles[] =
* clobbered by longjmp" from stupider versions of gcc.
*/
static void
-perform_base_backup(basebackup_options *opt, bbsink *sink)
+perform_base_backup(basebackup_options *opt, bbsink *sink,
+ IncrementalBackupInfo *ib)
{
bbsink_state state;
XLogRecPtr endptr;
@@ -270,6 +283,10 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
ListCell *lc;
tablespaceinfo *newti;
+ /* If this is an incremental backup, execute preparatory steps. */
+ if (ib != NULL)
+ PrepareForIncrementalBackup(ib, backup_state);
+
/* Add a node for the base directory at the end */
newti = palloc0(sizeof(tablespaceinfo));
newti->size = -1;
@@ -289,10 +306,10 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
if (tmp->path == NULL)
tmp->size = sendDir(sink, ".", 1, true, state.tablespaces,
- true, NULL, InvalidOid);
+ true, NULL, InvalidOid, NULL);
else
tmp->size = sendTablespace(sink, tmp->path, tmp->oid, true,
- NULL);
+ NULL, NULL);
state.bytes_total += tmp->size;
}
state.bytes_total_is_valid = true;
@@ -330,7 +347,7 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
/* Then the bulk of the files... */
sendDir(sink, ".", 1, false, state.tablespaces,
- sendtblspclinks, &manifest, InvalidOid);
+ sendtblspclinks, &manifest, InvalidOid, ib);
/* ... and pg_control after everything else. */
if (lstat(XLOG_CONTROL_FILE, &statbuf) != 0)
@@ -340,7 +357,7 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
XLOG_CONTROL_FILE)));
sendFile(sink, XLOG_CONTROL_FILE, XLOG_CONTROL_FILE, &statbuf,
false, InvalidOid, InvalidOid,
- InvalidRelFileNumber, 0, &manifest);
+ InvalidRelFileNumber, 0, &manifest, 0, NULL, 0);
}
else
{
@@ -348,7 +365,7 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
bbsink_begin_archive(sink, archive_name);
- sendTablespace(sink, ti->path, ti->oid, false, &manifest);
+ sendTablespace(sink, ti->path, ti->oid, false, &manifest, ib);
}
/*
@@ -610,7 +627,7 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
sendFile(sink, pathbuf, pathbuf, &statbuf, false,
InvalidOid, InvalidOid, InvalidRelFileNumber, 0,
- &manifest);
+ &manifest, 0, NULL, 0);
/* unconditionally mark file as archived */
StatusFilePath(pathbuf, fname, ".done");
@@ -686,6 +703,7 @@ parse_basebackup_options(List *options, basebackup_options *opt)
bool o_checkpoint = false;
bool o_nowait = false;
bool o_wal = false;
+ bool o_incremental = false;
bool o_maxrate = false;
bool o_tablespace_map = false;
bool o_noverify_checksums = false;
@@ -764,6 +782,20 @@ parse_basebackup_options(List *options, basebackup_options *opt)
opt->includewal = defGetBoolean(defel);
o_wal = true;
}
+ else if (strcmp(defel->defname, "incremental") == 0)
+ {
+ if (o_incremental)
+ ereport(ERROR,
+ (errcode(ERRCODE_SYNTAX_ERROR),
+ errmsg("duplicate option \"%s\"", defel->defname)));
+ opt->incremental = defGetBoolean(defel);
+ if (opt->incremental && !summarize_wal)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("incremental backups cannot be taken unless WAL summarization is enabled")));
+ opt->incremental = defGetBoolean(defel);
+ o_incremental = true;
+ }
else if (strcmp(defel->defname, "max_rate") == 0)
{
int64 maxrate;
@@ -956,7 +988,7 @@ parse_basebackup_options(List *options, basebackup_options *opt)
* the filesystem, bypassing the buffer cache.
*/
void
-SendBaseBackup(BaseBackupCmd *cmd)
+SendBaseBackup(BaseBackupCmd *cmd, IncrementalBackupInfo *ib)
{
basebackup_options opt;
bbsink *sink;
@@ -980,6 +1012,20 @@ SendBaseBackup(BaseBackupCmd *cmd)
set_ps_display(activitymsg);
}
+ /*
+ * If we're asked to perform an incremental backup and the user has not
+ * supplied a manifest, that's an ERROR.
+ *
+ * If we're asked to perform a full backup and the user did supply a
+ * manifest, just ignore it.
+ */
+ if (!opt.incremental)
+ ib = NULL;
+ else if (ib == NULL)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("must UPLOAD_MANIFEST before performing an incremental BASE_BACKUP")));
+
/*
* If the target is specifically 'client' then set up to stream the backup
* to the client; otherwise, it's being sent someplace else and should not
@@ -1011,7 +1057,7 @@ SendBaseBackup(BaseBackupCmd *cmd)
*/
PG_TRY();
{
- perform_base_backup(&opt, sink);
+ perform_base_backup(&opt, sink, ib);
}
PG_FINALLY();
{
@@ -1089,7 +1135,7 @@ sendFileWithContent(bbsink *sink, const char *filename, const char *content,
*/
static int64
sendTablespace(bbsink *sink, char *path, Oid spcoid, bool sizeonly,
- backup_manifest_info *manifest)
+ backup_manifest_info *manifest, IncrementalBackupInfo *ib)
{
int64 size;
char pathbuf[MAXPGPATH];
@@ -1123,7 +1169,7 @@ sendTablespace(bbsink *sink, char *path, Oid spcoid, bool sizeonly,
/* Send all the files in the tablespace version directory */
size += sendDir(sink, pathbuf, strlen(path), sizeonly, NIL, true, manifest,
- spcoid);
+ spcoid, ib);
return size;
}
@@ -1143,7 +1189,7 @@ sendTablespace(bbsink *sink, char *path, Oid spcoid, bool sizeonly,
static int64
sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
List *tablespaces, bool sendtblspclinks, backup_manifest_info *manifest,
- Oid spcoid)
+ Oid spcoid, IncrementalBackupInfo *ib)
{
DIR *dir;
struct dirent *de;
@@ -1152,7 +1198,16 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
int64 size = 0;
const char *lastDir; /* Split last dir from parent path. */
bool isRelationDir = false; /* Does directory contain relations? */
+ bool isGlobalDir = false;
Oid dboid = InvalidOid;
+ BlockNumber *relative_block_numbers = NULL;
+
+ /*
+ * Since this array is relatively large, avoid putting it on the stack.
+ * But we don't need it at all if this is not an incremental backup.
+ */
+ if (ib != NULL)
+ relative_block_numbers = palloc(sizeof(BlockNumber) * RELSEG_SIZE);
/*
* Determine if the current path is a database directory that can contain
@@ -1185,7 +1240,10 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
}
}
else if (strcmp(path, "./global") == 0)
+ {
isRelationDir = true;
+ isGlobalDir = true;
+ }
dir = AllocateDir(path);
while ((de = ReadDir(dir, path)) != NULL)
@@ -1334,11 +1392,13 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
&statbuf, sizeonly);
/*
- * Also send archive_status directory (by hackishly reusing
- * statbuf from above ...).
+ * Also send archive_status and summaries directories (by
+ * hackishly reusing statbuf from above ...).
*/
size += _tarWriteHeader(sink, "./pg_wal/archive_status", NULL,
&statbuf, sizeonly);
+ size += _tarWriteHeader(sink, "./pg_wal/summaries", NULL,
+ &statbuf, sizeonly);
continue; /* don't recurse into pg_wal */
}
@@ -1407,16 +1467,64 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
if (!skip_this_dir)
size += sendDir(sink, pathbuf, basepathlen, sizeonly, tablespaces,
- sendtblspclinks, manifest, spcoid);
+ sendtblspclinks, manifest, spcoid, ib);
}
else if (S_ISREG(statbuf.st_mode))
{
bool sent = false;
+ unsigned num_blocks_required = 0;
+ unsigned truncation_block_length = 0;
+ char tarfilenamebuf[MAXPGPATH * 2];
+ char *tarfilename = pathbuf + basepathlen + 1;
+ FileBackupMethod method = BACK_UP_FILE_FULLY;
+
+ if (ib != NULL && isRelationFile)
+ {
+ Oid relspcoid;
+ char *lookup_path;
+
+ if (OidIsValid(spcoid))
+ {
+ relspcoid = spcoid;
+ lookup_path = psprintf("pg_tblspc/%u/%s", spcoid,
+ tarfilename);
+ }
+ else
+ {
+ if (isGlobalDir)
+ relspcoid = GLOBALTABLESPACE_OID;
+ else
+ relspcoid = DEFAULTTABLESPACE_OID;
+ lookup_path = pstrdup(tarfilename);
+ }
+
+ method = GetFileBackupMethod(ib, lookup_path, dboid, relspcoid,
+ relfilenumber, relForkNum,
+ segno, statbuf.st_size,
+ &num_blocks_required,
+ relative_block_numbers,
+ &truncation_block_length);
+ if (method == BACK_UP_FILE_INCREMENTALLY)
+ {
+ statbuf.st_size =
+ GetIncrementalFileSize(num_blocks_required);
+ snprintf(tarfilenamebuf, sizeof(tarfilenamebuf),
+ "%s/INCREMENTAL.%s",
+ path + basepathlen + 1,
+ de->d_name);
+ tarfilename = tarfilenamebuf;
+ }
+
+ pfree(lookup_path);
+ }
if (!sizeonly)
- sent = sendFile(sink, pathbuf, pathbuf + basepathlen + 1, &statbuf,
+ sent = sendFile(sink, pathbuf, tarfilename, &statbuf,
true, dboid, spcoid,
- relfilenumber, segno, manifest);
+ relfilenumber, segno, manifest,
+ num_blocks_required,
+ method == BACK_UP_FILE_INCREMENTALLY ? relative_block_numbers : NULL,
+ truncation_block_length);
if (sent || sizeonly)
{
@@ -1434,6 +1542,10 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
ereport(WARNING,
(errmsg("skipping special file \"%s\"", pathbuf)));
}
+
+ if (relative_block_numbers != NULL)
+ pfree(relative_block_numbers);
+
FreeDir(dir);
return size;
}
@@ -1446,6 +1558,12 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
* If dboid is anything other than InvalidOid then any checksum failures
* detected will get reported to the cumulative stats system.
*
+ * If the file is to be sent incrementally, then num_incremental_blocks
+ * should be the number of blocks to be sent, and incremental_blocks
+ * an array of block numbers relative to the start of the current segment.
+ * If the whole file is to be sent, then incremental_blocks should be NULL,
+ * and num_incremental_blocks can have any value, as it will be ignored.
+ *
* Returns true if the file was successfully sent, false if 'missing_ok',
* and the file did not exist.
*/
@@ -1453,7 +1571,8 @@ static bool
sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
struct stat *statbuf, bool missing_ok, Oid dboid, Oid spcoid,
RelFileNumber relfilenumber, unsigned segno,
- backup_manifest_info *manifest)
+ backup_manifest_info *manifest, unsigned num_incremental_blocks,
+ BlockNumber *incremental_blocks, unsigned truncation_block_length)
{
int fd;
BlockNumber blkno = 0;
@@ -1462,6 +1581,7 @@ sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
pgoff_t bytes_done = 0;
bool verify_checksum = false;
pg_checksum_context checksum_ctx;
+ int ibindex = 0;
if (pg_checksum_init(&checksum_ctx, manifest->checksum_type) < 0)
elog(ERROR, "could not initialize checksum of file \"%s\"",
@@ -1494,22 +1614,111 @@ sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
RelFileNumberIsValid(relfilenumber))
verify_checksum = true;
+ /*
+ * If we're sending an incremental file, write the file header.
+ */
+ if (incremental_blocks != NULL)
+ {
+ unsigned magic = INCREMENTAL_MAGIC;
+ size_t header_bytes_done = 0;
+
+ /* Emit header data. */
+ push_to_sink(sink, &checksum_ctx, &header_bytes_done,
+ &magic, sizeof(magic));
+ push_to_sink(sink, &checksum_ctx, &header_bytes_done,
+ &num_incremental_blocks, sizeof(num_incremental_blocks));
+ push_to_sink(sink, &checksum_ctx, &header_bytes_done,
+ &truncation_block_length, sizeof(truncation_block_length));
+ push_to_sink(sink, &checksum_ctx, &header_bytes_done,
+ incremental_blocks,
+ sizeof(BlockNumber) * num_incremental_blocks);
+
+ /* Flush out any data still in the buffer so it's again empty. */
+ if (header_bytes_done > 0)
+ {
+ bbsink_archive_contents(sink, header_bytes_done);
+ if (pg_checksum_update(&checksum_ctx,
+ (uint8 *) sink->bbs_buffer,
+ header_bytes_done) < 0)
+ elog(ERROR, "could not update checksum of base backup");
+ }
+
+ /* Update our notion of file position. */
+ bytes_done += sizeof(magic);
+ bytes_done += sizeof(num_incremental_blocks);
+ bytes_done += sizeof(truncation_block_length);
+ bytes_done += sizeof(BlockNumber) * num_incremental_blocks;
+ }
+
/*
* Loop until we read the amount of data the caller told us to expect. The
* file could be longer, if it was extended while we were sending it, but
* for a base backup we can ignore such extended data. It will be restored
* from WAL.
*/
- while (bytes_done < statbuf->st_size)
+ while (1)
{
- size_t remaining = statbuf->st_size - bytes_done;
+ /*
+ * Determine whether we've read all the data that we need, and if not,
+ * read some more.
+ */
+ if (incremental_blocks == NULL)
+ {
+ size_t remaining = statbuf->st_size - bytes_done;
+
+ /*
+ * If we've read the required number of bytes, then it's time to
+ * stop.
+ */
+ if (bytes_done >= statbuf->st_size)
+ break;
+
+ /*
+ * Read as many bytes as will fit in the buffer, or however many
+ * are left to read, whichever is less.
+ */
+ cnt = read_file_data_into_buffer(sink, readfilename, fd,
+ bytes_done, remaining,
+ blkno + segno * RELSEG_SIZE,
+ verify_checksum,
+ &checksum_failures);
+ }
+ else
+ {
+ BlockNumber relative_blkno;
- /* Try to read some more data. */
- cnt = read_file_data_into_buffer(sink, readfilename, fd, bytes_done,
- remaining,
- blkno + segno * RELSEG_SIZE,
- verify_checksum,
- &checksum_failures);
+ /*
+ * If we've read all the blocks, then it's time to stop.
+ */
+ if (ibindex >= num_incremental_blocks)
+ break;
+
+ /*
+ * Read just one block, whichever one is the next that we're
+ * supposed to include.
+ */
+ relative_blkno = incremental_blocks[ibindex++];
+ cnt = read_file_data_into_buffer(sink, readfilename, fd,
+ relative_blkno * BLCKSZ,
+ BLCKSZ,
+ relative_blkno + segno * RELSEG_SIZE,
+ verify_checksum,
+ &checksum_failures);
+
+ /*
+ * If we get a partial read, that must mean that the relation is
+ * being truncated. Ultimately, it should be truncated to a
+ * multiple of BLCKSZ, since this path should only be reached for
+ * relation files, but we might transiently observe an
+ * intermediate value.
+ *
+ * It should be fine to treat this just as if the entire block had
+ * been truncated away - i.e. fill this and all later blocks with
+ * zeroes. WAL replay will fix things up.
+ */
+ if (cnt < BLCKSZ)
+ break;
+ }
/*
* If the amount of data we were able to read was not a multiple of
@@ -1692,6 +1901,56 @@ read_file_data_into_buffer(bbsink *sink, const char *readfilename, int fd,
return cnt;
}
+/*
+ * Push data into a bbsink.
+ *
+ * It's better, when possible, to read data directly into the bbsink's buffer,
+ * rather than using this function to copy it into the buffer; this function is
+ * for cases where that approach is not practical.
+ *
+ * bytes_done should point to a count of the number of bytes that are
+ * currently used in the bbsink's buffer. Upon return, the bytes identified by
+ * data and length will have been copied into the bbsink's buffer, flushing
+ * as required, and *bytes_done will have been updated accordingly. If the
+ * buffer was flushed, the previous contents will also have been fed to
+ * checksum_ctx.
+ *
+ * Note that after one or more calls to this function it is the caller's
+ * responsibility to perform any required final flush.
+ */
+static void
+push_to_sink(bbsink *sink, pg_checksum_context *checksum_ctx,
+ size_t *bytes_done, void *data, size_t length)
+{
+ while (length > 0)
+ {
+ size_t bytes_to_copy;
+
+ /*
+ * We use < here rather than <= so that if the data exactly fills the
+ * remaining buffer space, we trigger a flush now.
+ */
+ if (length < sink->bbs_buffer_length - *bytes_done)
+ {
+ /* Append remaining data to buffer. */
+ memcpy(sink->bbs_buffer + *bytes_done, data, length);
+ *bytes_done += length;
+ return;
+ }
+
+ /* Copy until buffer is full and flush it. */
+ bytes_to_copy = sink->bbs_buffer_length - *bytes_done;
+ memcpy(sink->bbs_buffer + *bytes_done, data, bytes_to_copy);
+ data = ((char *) data) + bytes_to_copy;
+ length -= bytes_to_copy;
+ bbsink_archive_contents(sink, sink->bbs_buffer_length);
+ if (pg_checksum_update(checksum_ctx, (uint8 *) sink->bbs_buffer,
+ sink->bbs_buffer_length) < 0)
+ elog(ERROR, "could not update checksum");
+ *bytes_done = 0;
+ }
+}
+
/*
* Try to verify the checksum for the provided page, if it seems appropriate
* to do so.
diff --git a/src/backend/backup/basebackup_incremental.c b/src/backend/backup/basebackup_incremental.c
new file mode 100644
index 0000000000..1e5a5ac33a
--- /dev/null
+++ b/src/backend/backup/basebackup_incremental.c
@@ -0,0 +1,1003 @@
+/*-------------------------------------------------------------------------
+ *
+ * basebackup_incremental.c
+ * code for incremental backup support
+ *
+ * This code isn't actually in charge of taking an incremental backup;
+ * the actual construction of the incremental backup happens in
+ * basebackup.c. Here, we're concerned with providing the necessary
+ * supports for that operation. In particular, we need to parse the
+ * backup manifest supplied by the user taking the incremental backup
+ * and extract the required information from it.
+ *
+ * Portions Copyright (c) 2010-2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/backend/backup/basebackup_incremental.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/timeline.h"
+#include "access/xlog.h"
+#include "access/xlogrecovery.h"
+#include "backup/basebackup_incremental.h"
+#include "backup/walsummary.h"
+#include "common/blkreftable.h"
+#include "common/parse_manifest.h"
+#include "common/hashfn.h"
+#include "postmaster/walsummarizer.h"
+
+#define BLOCKS_PER_READ 512
+
+/*
+ * Details extracted from the WAL ranges present in the supplied backup manifest.
+ */
+typedef struct
+{
+ TimeLineID tli;
+ XLogRecPtr start_lsn;
+ XLogRecPtr end_lsn;
+} backup_wal_range;
+
+/*
+ * Details extracted from the file list present in the supplied backup manifest.
+ */
+typedef struct
+{
+ uint32 status;
+ const char *path;
+ size_t size;
+} backup_file_entry;
+
+static uint32 hash_string_pointer(const char *s);
+#define SH_PREFIX backup_file
+#define SH_ELEMENT_TYPE backup_file_entry
+#define SH_KEY_TYPE const char *
+#define SH_KEY path
+#define SH_HASH_KEY(tb, key) hash_string_pointer(key)
+#define SH_EQUAL(tb, a, b) (strcmp(a, b) == 0)
+#define SH_SCOPE static inline
+#define SH_DECLARE
+#define SH_DEFINE
+#include "lib/simplehash.h"
+
+struct IncrementalBackupInfo
+{
+ /* Memory context for this object and its subsidiary objects. */
+ MemoryContext mcxt;
+
+ /* Temporary buffer for storing the manifest while parsing it. */
+ StringInfoData buf;
+
+ /* WAL ranges extracted from the backup manifest. */
+ List *manifest_wal_ranges;
+
+ /*
+ * Files extracted from the backup manifest.
+ *
+ * We don't really need this information, because we use WAL summaries to
+ * figure what's changed. It would be unsafe to just rely on the list of
+ * files that existed before, because it's possible for a file to be
+ * removed and a new one created with the same name and different
+ * contents. In such cases, the whole file must still be sent. We can tell
+ * from the WAL summaries whether that happened, but not from the file
+ * list.
+ *
+ * Nonetheless, this data is useful for sanity checking. If a file that we
+ * think we shouldn't need to send is not present in the manifest for the
+ * prior backup, something has gone terribly wrong. We retain the file
+ * names and sizes, but not the checksums or last modified times, for
+ * which we have no use.
+ *
+ * One significant downside of storing this data is that it consumes
+ * memory. If that turns out to be a problem, we might have to decide not
+ * to retain this information, or to make it optional.
+ */
+ backup_file_hash *manifest_files;
+
+ /*
+ * Block-reference table for the incremental backup.
+ *
+ * It's possible that storing the entire block-reference table in memory
+ * will be a problem for some users. The in-memory format that we're using
+ * here is pretty efficient, converging to little more than 1 bit per
+ * block for relation forks with large numbers of modified blocks. It's
+ * possible, however, that if you try to perform an incremental backup of
+ * a database with a sufficiently large number of relations on a
+ * sufficiently small machine, you could run out of memory here. If that
+ * turns out to be a problem in practice, we'll need to be more clever.
+ */
+ BlockRefTable *brtab;
+};
+
+static void manifest_process_file(JsonManifestParseContext *context,
+ char *pathname,
+ size_t size,
+ pg_checksum_type checksum_type,
+ int checksum_length,
+ uint8 *checksum_payload);
+static void manifest_process_wal_range(JsonManifestParseContext *context,
+ TimeLineID tli,
+ XLogRecPtr start_lsn,
+ XLogRecPtr end_lsn);
+static void manifest_report_error(JsonManifestParseContext *ib,
+ const char *fmt,...)
+ pg_attribute_printf(2, 3) pg_attribute_noreturn();
+static int compare_block_numbers(const void *a, const void *b);
+
+/*
+ * Create a new object for storing information extracted from the manifest
+ * supplied when creating an incremental backup.
+ */
+IncrementalBackupInfo *
+CreateIncrementalBackupInfo(MemoryContext mcxt)
+{
+ IncrementalBackupInfo *ib;
+ MemoryContext oldcontext;
+
+ oldcontext = MemoryContextSwitchTo(mcxt);
+
+ ib = palloc0(sizeof(IncrementalBackupInfo));
+ ib->mcxt = mcxt;
+ initStringInfo(&ib->buf);
+
+ /*
+ * It's hard to guess how many files a "typical" installation will have in
+ * the data directory, but a fresh initdb creates almost 1000 files as of
+ * this writing, so it seems to make sense for our estimate to
+ * substantially higher.
+ */
+ ib->manifest_files = backup_file_create(mcxt, 10000, NULL);
+
+ MemoryContextSwitchTo(oldcontext);
+
+ return ib;
+}
+
+/*
+ * Before taking an incremental backup, the caller must supply the backup
+ * manifest from a prior backup. Each chunk of manifest data recieved
+ * from the client should be passed to this function.
+ */
+void
+AppendIncrementalManifestData(IncrementalBackupInfo *ib, const char *data,
+ int len)
+{
+ MemoryContext oldcontext;
+
+ /* Switch to our memory context. */
+ oldcontext = MemoryContextSwitchTo(ib->mcxt);
+
+ /*
+ * XXX. Our json parser is at present incapable of parsing json blobs
+ * incrementally, so we have to accumulate the entire backup manifest
+ * before we can do anything with it. This should really be fixed, since
+ * some users might have very large numbers of files in the data
+ * directory.
+ */
+ appendBinaryStringInfo(&ib->buf, data, len);
+
+ /* Switch back to previous memory context. */
+ MemoryContextSwitchTo(oldcontext);
+}
+
+/*
+ * Finalize an IncrementalBackupInfo object after all manifest data has
+ * been supplied via calls to AppendIncrementalManifestData.
+ */
+void
+FinalizeIncrementalManifest(IncrementalBackupInfo *ib)
+{
+ JsonManifestParseContext context;
+ MemoryContext oldcontext;
+
+ /* Switch to our memory context. */
+ oldcontext = MemoryContextSwitchTo(ib->mcxt);
+
+ /* Parse the manifest. */
+ context.private_data = ib;
+ context.per_file_cb = manifest_process_file;
+ context.per_wal_range_cb = manifest_process_wal_range;
+ context.error_cb = manifest_report_error;
+ json_parse_manifest(&context, ib->buf.data, ib->buf.len);
+
+ /* Done with the buffer, so release memory. */
+ pfree(ib->buf.data);
+ ib->buf.data = NULL;
+
+ /* Switch back to previous memory context. */
+ MemoryContextSwitchTo(oldcontext);
+}
+
+/*
+ * Prepare to take an incremental backup.
+ *
+ * Before this function is called, AppendIncrementalManifestData and
+ * FinalizeIncrementalManifest should have already been called to pass all
+ * the manifest data to this object.
+ *
+ * This function performs sanity checks on the data extracted from the
+ * manifest and figures out for which WAL ranges we need summaries, and
+ * whether those summaries are available. Then, it reads and combines the
+ * data from those summary files. It also updates the backup_state with the
+ * reference TLI and LSN for the prior backup.
+ */
+void
+PrepareForIncrementalBackup(IncrementalBackupInfo *ib,
+ BackupState *backup_state)
+{
+ MemoryContext oldcontext;
+ List *expectedTLEs;
+ List *all_wslist,
+ *required_wslist = NIL;
+ ListCell *lc;
+ TimeLineHistoryEntry **tlep;
+ int num_wal_ranges;
+ int i;
+ bool found_backup_start_tli = false;
+ TimeLineID earliest_wal_range_tli = 0;
+ XLogRecPtr earliest_wal_range_start_lsn = InvalidXLogRecPtr;
+ TimeLineID latest_wal_range_tli = 0;
+ XLogRecPtr summarized_lsn;
+ XLogRecPtr pending_lsn;
+ XLogRecPtr prior_pending_lsn = InvalidXLogRecPtr;
+ int deadcycles = 0;
+ TimestampTz initial_time,
+ current_time;
+
+ Assert(ib->buf.data == NULL);
+
+ /* Switch to our memory context. */
+ oldcontext = MemoryContextSwitchTo(ib->mcxt);
+
+ /*
+ * A valid backup manifest must always contain at least one WAL range
+ * (usually exactly one, unless the backup spanned a timeline switch).
+ */
+ num_wal_ranges = list_length(ib->manifest_wal_ranges);
+ if (num_wal_ranges == 0)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("manifest contains no required WAL ranges")));
+
+ /*
+ * Match up the TLIs that appear in the WAL ranges of the backup manifest
+ * with those that appear in this server's timeline history. We expect
+ * every backup_wal_range to match to a TimeLineHistoryEntry; if it does
+ * not, that's an error.
+ *
+ * This loop also decides which of the WAL ranges is the manifest is most
+ * ancient and which one is the newest, according to the timeline history
+ * of this server, and stores TLIs of those WAL ranges into
+ * earliest_wal_range_tli and latest_wal_range_tli. It also updates
+ * earliest_wal_range_start_lsn to the start LSN of the WAL range for
+ * earliest_wal_range_tli.
+ *
+ * Note that the return value of readTimeLineHistory puts the latest
+ * timeline at the beginning of the list, not the end. Hence, the earliest
+ * TLI is the one that occurs nearest the end of the list returned by
+ * readTimeLineHistory, and the latest TLI is the one that occurs closest
+ * to the beginning.
+ */
+ expectedTLEs = readTimeLineHistory(backup_state->starttli);
+ tlep = palloc0(num_wal_ranges * sizeof(TimeLineHistoryEntry *));
+ for (i = 0; i < num_wal_ranges; ++i)
+ {
+ backup_wal_range *range = list_nth(ib->manifest_wal_ranges, i);
+ bool saw_earliest_wal_range_tli = false;
+ bool saw_latest_wal_range_tli = false;
+
+ /* Search this server's history for this WAL range's TLI. */
+ foreach(lc, expectedTLEs)
+ {
+ TimeLineHistoryEntry *tle = lfirst(lc);
+
+ if (tle->tli == range->tli)
+ {
+ tlep[i] = tle;
+ break;
+ }
+
+ if (tle->tli == earliest_wal_range_tli)
+ saw_earliest_wal_range_tli = true;
+ if (tle->tli == latest_wal_range_tli)
+ saw_latest_wal_range_tli = true;
+ }
+
+ /*
+ * An incremental backup can only be taken relative to a backup that
+ * represents a previous state of this server. If the backup requires
+ * WAL from a timeline that's not in our history, that definitely
+ * isn't the case.
+ */
+ if (tlep[i] == NULL)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("timeline %u found in manifest, but not in this server's history",
+ range->tli)));
+
+ /*
+ * If we found this TLI in the server's history before encountering
+ * the latest TLI seen so far in the server's history, then this TLI
+ * is the latest one seen so far.
+ *
+ * If on the other hand we saw the earliest TLI seen so far before
+ * finding this TLI, this TLI is earlier than the earliest one seen so
+ * far. And if this is the first TLI for which we've searched, it's
+ * also the earliest one seen so far.
+ *
+ * On the first loop iteration, both things should necessarily be
+ * true.
+ */
+ if (!saw_latest_wal_range_tli)
+ latest_wal_range_tli = range->tli;
+ if (earliest_wal_range_tli == 0 || saw_earliest_wal_range_tli)
+ {
+ earliest_wal_range_tli = range->tli;
+ earliest_wal_range_start_lsn = range->start_lsn;
+ }
+ }
+
+ /*
+ * Propagate information about the prior backup into the backup_label that
+ * will be generated for this backup.
+ */
+ backup_state->istartpoint = earliest_wal_range_start_lsn;
+ backup_state->istarttli = earliest_wal_range_tli;
+
+ /*
+ * Sanity check start and end LSNs for the WAL ranges in the manifest.
+ *
+ * Commonly, there won't be any timeline switches during the prior backup
+ * at all, but if there are, they should happen at the same LSNs that this
+ * server switched timelines.
+ *
+ * Whether there are any timeline switches during the prior backup or not,
+ * the prior backup shouldn't require any WAL from a timeline prior to the
+ * start of that timeline. It also shouldn't require any WAL from later
+ * than the start of this backup.
+ *
+ * If any of these sanity checks fail, one possible explanation is that
+ * the user has generated WAL on the same timeline with the same LSNs more
+ * than once. For instance, if two standbys running on timeline 1 were
+ * both promoted and (due to a broken archiving setup) both selected new
+ * timeline ID 2, then it's possible that one of these checks might trip.
+ *
+ * Note that there are lots of ways for the user to do something very bad
+ * without tripping any of these checks, and they are not intended to be
+ * comprehensive. It's pretty hard to see how we could be certain of
+ * anything here. However, if there's a problem staring us right in the
+ * face, it's best to report it, so we do.
+ */
+ for (i = 0; i < num_wal_ranges; ++i)
+ {
+ backup_wal_range *range = list_nth(ib->manifest_wal_ranges, i);
+
+ if (range->tli == earliest_wal_range_tli)
+ {
+ if (range->start_lsn < tlep[i]->begin)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("manifest requires WAL from initial timeline %u starting at %X/%X, but that timeline begins at %X/%X",
+ range->tli,
+ LSN_FORMAT_ARGS(range->start_lsn),
+ LSN_FORMAT_ARGS(tlep[i]->begin))));
+ }
+ else
+ {
+ if (range->start_lsn != tlep[i]->begin)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("manifest requires WAL from continuation timeline %u starting at %X/%X, but that timeline begins at %X/%X",
+ range->tli,
+ LSN_FORMAT_ARGS(range->start_lsn),
+ LSN_FORMAT_ARGS(tlep[i]->begin))));
+ }
+
+ if (range->tli == latest_wal_range_tli)
+ {
+ if (range->end_lsn > backup_state->startpoint)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("manifest requires WAL from final timeline %u ending at %X/%X, but this backup starts at %X/%X",
+ range->tli,
+ LSN_FORMAT_ARGS(range->end_lsn),
+ LSN_FORMAT_ARGS(backup_state->startpoint))));
+ }
+ else
+ {
+ if (range->end_lsn != tlep[i]->end)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("manifest requires WAL from non-final timeline %u ending at %X/%X, but this server switched timelines at %X/%X",
+ range->tli,
+ LSN_FORMAT_ARGS(range->end_lsn),
+ LSN_FORMAT_ARGS(tlep[i]->end))));
+ }
+
+ }
+
+ /*
+ * Wait for WAL summarization to catch up to the backup start LSN (but
+ * time out if it doesn't do so quickly enough).
+ */
+ initial_time = current_time = GetCurrentTimestamp();
+ while (1)
+ {
+ long timeout_in_ms = 10000;
+ unsigned elapsed_seconds;
+
+ /*
+ * Align the wait time to prevent drift. This doesn't really matter,
+ * but we'd like the warnings about how long we've been waiting to say
+ * 10 seconds, 20 seconds, 30 seconds, 40 seconds ... without ever
+ * drifting to something that is not a multiple of ten.
+ */
+ timeout_in_ms -=
+ TimestampDifferenceMilliseconds(current_time, initial_time) %
+ timeout_in_ms;
+
+ /* Wait for up to 10 seconds. */
+ summarized_lsn = WaitForWalSummarization(backup_state->startpoint,
+ 10000, &pending_lsn);
+
+ /* If WAL summarization has progressed sufficiently, stop waiting. */
+ if (summarized_lsn >= backup_state->startpoint)
+ break;
+
+ /*
+ * Keep track of the number of cycles during which there has been no
+ * progression of pending_lsn. If pending_lsn is not advancing, that
+ * means that not only are no new files appearing on disk, but we're
+ * not even incorporating new records into the in-memory state.
+ */
+ if (pending_lsn > prior_pending_lsn)
+ {
+ prior_pending_lsn = pending_lsn;
+ deadcycles = 0;
+ }
+ else
+ ++deadcycles;
+
+ /*
+ * If we've managed to wait for an entire minute withot the WAL
+ * summarizer absorbing a single WAL record, error out; probably
+ * something is wrong.
+ *
+ * We could consider also erroring out if the summarizer is taking too
+ * long to catch up, but it's not clear what rate of progress would be
+ * acceptable and what would be too slow. So instead, we just try to
+ * error out in the case where there's no progress at all. That seems
+ * likely to catch a reasonable number of the things that can go wrong
+ * in practice (e.g. the summarizer process is completely hung, say
+ * because somebody hooked up a debugger to it or something) without
+ * giving up too quickly when the sytem is just slow.
+ */
+ if (deadcycles >= 6)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("WAL summarization is not progressing"),
+ errdetail("Summarization is needed through %X/%X, but is stuck at %X/%X on disk and %X/%X in memory.",
+ LSN_FORMAT_ARGS(backup_state->startpoint),
+ LSN_FORMAT_ARGS(summarized_lsn),
+ LSN_FORMAT_ARGS(pending_lsn))));
+
+ /*
+ * Otherwise, just let the user know what's happening.
+ */
+ current_time = GetCurrentTimestamp();
+ elapsed_seconds =
+ TimestampDifferenceMilliseconds(initial_time, current_time) / 1000;
+ ereport(WARNING,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("still waiting for WAL summarization through %X/%X after %d seconds",
+ LSN_FORMAT_ARGS(backup_state->startpoint),
+ elapsed_seconds),
+ errdetail("Summarization has reached %X/%X on disk and %X/%X in memory.",
+ LSN_FORMAT_ARGS(summarized_lsn),
+ LSN_FORMAT_ARGS(pending_lsn))));
+ }
+
+ /*
+ * Retrieve a list of all WAL summaries on any timeline that overlap with
+ * the LSN range of interest. We could instead call GetWalSummaries() once
+ * per timeline in the loop that follows, but that would involve reading
+ * the directory multiple times. It should be mildly faster - and perhaps
+ * a bit safer - to do it just once.
+ */
+ all_wslist = GetWalSummaries(0, earliest_wal_range_start_lsn,
+ backup_state->startpoint);
+
+ /*
+ * We need WAL summaries for everything that happened during the prior
+ * backup and everything that happened afterward up until the point where
+ * the current backup started.
+ */
+ foreach(lc, expectedTLEs)
+ {
+ TimeLineHistoryEntry *tle = lfirst(lc);
+ XLogRecPtr tli_start_lsn = tle->begin;
+ XLogRecPtr tli_end_lsn = tle->end;
+ XLogRecPtr tli_missing_lsn = InvalidXLogRecPtr;
+ List *tli_wslist;
+
+ /*
+ * Working through the history of this server from the current
+ * timeline backwards, we skip everything until we find the timeline
+ * where this backup started. Most of the time, this means we won't
+ * skip anything at all, as it's unlikely that the timeline has
+ * changed since the beginning of the backup moments ago.
+ */
+ if (tle->tli == backup_state->starttli)
+ {
+ found_backup_start_tli = true;
+ tli_end_lsn = backup_state->startpoint;
+ }
+ else if (!found_backup_start_tli)
+ continue;
+
+ /*
+ * Find the summaries that overlap the LSN range of interest for this
+ * timeline. If this is the earliest timeline involved, the range of
+ * interest begins with the start LSN of the prior backup; otherwise,
+ * it begins at the LSN at which this timeline came into existence. If
+ * this is the latest TLI involved, the range of interest ends at the
+ * start LSN of the current backup; otherwise, it ends at the point
+ * where we switched from this timeline to the next one.
+ */
+ if (tle->tli == earliest_wal_range_tli)
+ tli_start_lsn = earliest_wal_range_start_lsn;
+ tli_wslist = FilterWalSummaries(all_wslist, tle->tli,
+ tli_start_lsn, tli_end_lsn);
+
+ /*
+ * There is no guarantee that the WAL summaries we found cover the
+ * entire range of LSNs for which summaries are required, or indeed
+ * that we found any WAL summaries at all. Check whether we have a
+ * problem of that sort.
+ */
+ if (!WalSummariesAreComplete(tli_wslist, tli_start_lsn, tli_end_lsn,
+ &tli_missing_lsn))
+ {
+ if (XLogRecPtrIsInvalid(tli_missing_lsn))
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("WAL summaries are required on timeline %u from %X/%X to %X/%X, but no summaries for that timeline and LSN range exist",
+ tle->tli,
+ LSN_FORMAT_ARGS(tli_start_lsn),
+ LSN_FORMAT_ARGS(tli_end_lsn))));
+ else
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("WAL summaries are required on timeline %u from %X/%X to %X/%X, but the summaries for that timeline and LSN range are incomplete",
+ tle->tli,
+ LSN_FORMAT_ARGS(tli_start_lsn),
+ LSN_FORMAT_ARGS(tli_end_lsn)),
+ errdetail("The first unsummarized LSN is this range is %X/%X.",
+ LSN_FORMAT_ARGS(tli_missing_lsn))));
+ }
+
+ /*
+ * Remember that we need to read these summaries.
+ *
+ * Technically, it's possible that this could read more files than
+ * required, since tli_wslist in theory could contain redundant
+ * summaries. For instance, if we have a summary from 0/10000000 to
+ * 0/20000000 and also one from 0/00000000 to 0/30000000, then the
+ * latter subsumes the former and the former could be ignored.
+ *
+ * We ignore this possibility because the WAL summarizer only tries to
+ * generate summaries that do not overlap. If somehow they exist,
+ * we'll do a bit of extra work but the results should still be
+ * correct.
+ */
+ required_wslist = list_concat(required_wslist, tli_wslist);
+
+ /*
+ * Timelines earlier than the one in which the prior backup began are
+ * not relevant.
+ */
+ if (tle->tli == earliest_wal_range_tli)
+ break;
+ }
+
+ /*
+ * Read all of the required block reference table files and merge all of
+ * the data into a single in-memory block reference table.
+ *
+ * See the comments for struct IncrementalBackupInfo for some thoughts on
+ * memory usage.
+ */
+ ib->brtab = CreateEmptyBlockRefTable();
+ foreach(lc, required_wslist)
+ {
+ WalSummaryFile *ws = lfirst(lc);
+ WalSummaryIO wsio;
+ BlockRefTableReader *reader;
+ RelFileLocator rlocator;
+ ForkNumber forknum;
+ BlockNumber limit_block;
+ BlockNumber blocks[BLOCKS_PER_READ];
+
+ wsio.file = OpenWalSummaryFile(ws, false);
+ wsio.filepos = 0;
+ ereport(DEBUG1,
+ (errmsg_internal("reading WAL summary file \"%s\"",
+ FilePathName(wsio.file))));
+ reader = CreateBlockRefTableReader(ReadWalSummary, &wsio,
+ FilePathName(wsio.file),
+ ReportWalSummaryError, NULL);
+ while (BlockRefTableReaderNextRelation(reader, &rlocator, &forknum,
+ &limit_block))
+ {
+ BlockRefTableSetLimitBlock(ib->brtab, &rlocator,
+ forknum, limit_block);
+
+ while (1)
+ {
+ unsigned nblocks;
+ unsigned i;
+
+ nblocks = BlockRefTableReaderGetBlocks(reader, blocks,
+ BLOCKS_PER_READ);
+ if (nblocks == 0)
+ break;
+
+ for (i = 0; i < nblocks; ++i)
+ BlockRefTableMarkBlockModified(ib->brtab, &rlocator,
+ forknum, blocks[i]);
+ }
+ }
+ DestroyBlockRefTableReader(reader);
+ FileClose(wsio.file);
+ }
+
+ /* Switch back to previous memory context. */
+ MemoryContextSwitchTo(oldcontext);
+}
+
+/*
+ * Get the pathname that should be used when a file is sent incrementally.
+ *
+ * The result is a palloc'd string.
+ */
+char *
+GetIncrementalFilePath(Oid dboid, Oid spcoid, RelFileNumber relfilenumber,
+ ForkNumber forknum, unsigned segno)
+{
+ char *path;
+ char *lastslash;
+ char *ipath;
+
+ path = GetRelationPath(dboid, spcoid, relfilenumber, InvalidBackendId,
+ forknum);
+
+ lastslash = strrchr(path, '/');
+ Assert(lastslash != NULL);
+ *lastslash = '\0';
+
+ if (segno > 0)
+ ipath = psprintf("%s/INCREMENTAL.%s.%u", path, lastslash + 1, segno);
+ else
+ ipath = psprintf("%s/INCREMENTAL.%s", path, lastslash + 1);
+
+ pfree(path);
+
+ return ipath;
+}
+
+/*
+ * How should we back up a particular file as part of an incremental backup?
+ *
+ * If the return value is BACK_UP_FILE_FULLY, caller should back up the whole
+ * file just as if this were not an incremental backup.
+ *
+ * If the return value is BACK_UP_FILE_INCREMENTALLY, caller should include
+ * an incremental file in the backup instead of the entire file. On return,
+ * *num_blocks_required will be set to the number of blocks that need to be
+ * sent, and the actual block numbers will have been stored in
+ * relative_block_numbers, which should be an array of at least RELSEG_SIZE.
+ * In addition, *truncation_block_length will be set to the value that should
+ * be included in the incremental file.
+ */
+FileBackupMethod
+GetFileBackupMethod(IncrementalBackupInfo *ib, const char *path,
+ Oid dboid, Oid spcoid,
+ RelFileNumber relfilenumber, ForkNumber forknum,
+ unsigned segno, size_t size,
+ unsigned *num_blocks_required,
+ BlockNumber *relative_block_numbers,
+ unsigned *truncation_block_length)
+{
+ BlockNumber absolute_block_numbers[RELSEG_SIZE];
+ BlockNumber limit_block;
+ BlockNumber start_blkno;
+ BlockNumber stop_blkno;
+ RelFileLocator rlocator;
+ BlockRefTableEntry *brtentry;
+ unsigned i;
+ unsigned nblocks;
+
+ /* Should only be called after PrepareForIncrementalBackup. */
+ Assert(ib->buf.data == NULL);
+
+ /*
+ * dboid could be InvalidOid if shared rel, but spcoid and relfilenumber
+ * should have legal values.
+ */
+ Assert(OidIsValid(spcoid));
+ Assert(RelFileNumberIsValid(relfilenumber));
+
+ /*
+ * If the file size is too large or not a multiple of BLCKSZ, then
+ * something weird is happening, so give up and send the whole file.
+ */
+ if ((size % BLCKSZ) != 0 || size / BLCKSZ > RELSEG_SIZE)
+ return BACK_UP_FILE_FULLY;
+
+ /*
+ * The free-space map fork is not properly WAL-logged, so we need to
+ * backup the entire file every time.
+ */
+ if (forknum == FSM_FORKNUM)
+ return BACK_UP_FILE_FULLY;
+
+ /*
+ * If this file was not part of the prior backup, back it up fully.
+ *
+ * If this file was created after the prior backup and before the start of
+ * the current backup, then the WAL summary information will tell us to
+ * back up the whole file. However, if this file was created after the
+ * start of the current backup, then the WAL summary won't know anything
+ * about it. Without this logic, we would erroneously conclude that it was
+ * OK to send it incrementally.
+ *
+ * Note that the file could have existed at the time of the prior backup,
+ * gotten deleted, and then a new file with the same name could have been
+ * created. In that case, this logic won't prevent the file from being
+ * backed up incrementally. But, if the deletion happened before the start
+ * of the current backup, the limit block will be 0, inducing a full
+ * backup. If the deletion happened after the start of the current backup,
+ * reconstruction will erroneously combine blocks from the current
+ * lifespan of the file with blocks from the previous lifespan -- but in
+ * this type of case, WAL replay to reach backup consistency should remove
+ * and recreate the file anyway, so the initial bogus contents should not
+ * matter.
+ */
+ if (backup_file_lookup(ib->manifest_files, path) == NULL)
+ {
+ char *ipath;
+
+ ipath = GetIncrementalFilePath(dboid, spcoid, relfilenumber,
+ forknum, segno);
+ if (backup_file_lookup(ib->manifest_files, ipath) == NULL)
+ return BACK_UP_FILE_FULLY;
+ }
+
+ /* Look up the block reference table entry. */
+ rlocator.spcOid = spcoid;
+ rlocator.dbOid = dboid;
+ rlocator.relNumber = relfilenumber;
+ brtentry = BlockRefTableGetEntry(ib->brtab, &rlocator, forknum,
+ &limit_block);
+
+ /*
+ * If there is no entry, then there have been no WAL-logged changes to the
+ * relation since the predecessor backup was taken, so we can back it up
+ * incrementally and need not include any modified blocks.
+ *
+ * However, if the file is zero-length, we should do a full backup,
+ * because an incremental file is always more than zero length, and it's
+ * silly to take an incremental backup when a full backup would be
+ * smaller.
+ */
+ if (brtentry == NULL)
+ {
+ if (size == 0)
+ return BACK_UP_FILE_FULLY;
+ *num_blocks_required = 0;
+ *truncation_block_length = size / BLCKSZ;
+ return BACK_UP_FILE_INCREMENTALLY;
+ }
+
+ /*
+ * If the limit_block is less than or equal to the point where this
+ * segment starts, send the whole file.
+ */
+ if (limit_block <= segno * RELSEG_SIZE)
+ return BACK_UP_FILE_FULLY;
+
+ /*
+ * Get relevant entries from the block reference table entry.
+ *
+ * We shouldn't overflow computing the start or stop block numbers, but if
+ * it manages to happen somehow, detect it and throw an error.
+ */
+ start_blkno = segno * RELSEG_SIZE;
+ stop_blkno = start_blkno + (size / BLCKSZ);
+ if (start_blkno / RELSEG_SIZE != segno || stop_blkno < start_blkno)
+ ereport(ERROR,
+ errcode(ERRCODE_INTERNAL_ERROR),
+ errmsg_internal("overflow computing block number bounds for segment %u with size %zu",
+ segno, size));
+ nblocks = BlockRefTableEntryGetBlocks(brtentry, start_blkno, stop_blkno,
+ absolute_block_numbers, RELSEG_SIZE);
+ Assert(nblocks <= RELSEG_SIZE);
+
+ /*
+ * If we're going to have to send nearly all of the blocks, then just send
+ * the whole file, because that won't require much extra storage or
+ * transfer and will speed up and simplify backup restoration. It's not
+ * clear what threshold is most appropriate here and perhaps it ought to
+ * be configurable, but for now we're just going to say that if we'd need
+ * to send 90% of the blocks anyway, give up and send the whole file.
+ *
+ * NB: If you change the threshold here, at least make sure to back up the
+ * file fully when every single block must be sent, because there's
+ * nothing good about sending an incremental file in that case.
+ */
+ if (nblocks * BLCKSZ > size * 0.9)
+ return BACK_UP_FILE_FULLY;
+
+ /*
+ * Looks like we can send an incremental file, so sort the absolute the
+ * block numbers and then transpose absolute block numbers to relative
+ * block numbers.
+ *
+ * NB: If the block reference table was using the bitmap representation
+ * for a given chunk, the block numbers in that chunk will already be
+ * sorted, but when the array-of-offsets representation is used, we can
+ * receive block numbers here out of order.
+ */
+ qsort(absolute_block_numbers, nblocks, sizeof(BlockNumber),
+ compare_block_numbers);
+ for (i = 0; i < nblocks; ++i)
+ relative_block_numbers[i] = absolute_block_numbers[i] - start_blkno;
+ *num_blocks_required = nblocks;
+
+ /*
+ * The truncation block length is the minimum length of the reconstructed
+ * file. Any block numbers below this threshold that are not present in
+ * the backup need to be fetched from the prior backup. At or above this
+ * threshold, blocks should only be included in the result if they are
+ * present in the backup. (This may require inserting zero blocks if the
+ * blocks included in the backup are non-consecutive.)
+ */
+ *truncation_block_length = size / BLCKSZ;
+ if (BlockNumberIsValid(limit_block))
+ {
+ unsigned relative_limit = limit_block - segno * RELSEG_SIZE;
+
+ if (*truncation_block_length < relative_limit)
+ *truncation_block_length = relative_limit;
+ }
+
+ /* Send it incrementally. */
+ return BACK_UP_FILE_INCREMENTALLY;
+}
+
+/*
+ * Compute the size for an incremental file containing a given number of blocks.
+ */
+extern size_t
+GetIncrementalFileSize(unsigned num_blocks_required)
+{
+ size_t result;
+
+ /* Make sure we're not going to overflow. */
+ Assert(num_blocks_required <= RELSEG_SIZE);
+
+ /*
+ * Three four byte quantities (magic number, truncation block length,
+ * block count) followed by block numbers followed by block contents.
+ */
+ result = 3 * sizeof(uint32);
+ result += (BLCKSZ + sizeof(BlockNumber)) * num_blocks_required;
+
+ return result;
+}
+
+/*
+ * Helper function for filemap hash table.
+ */
+static uint32
+hash_string_pointer(const char *s)
+{
+ unsigned char *ss = (unsigned char *) s;
+
+ return hash_bytes(ss, strlen(s));
+}
+
+/*
+ * This callback is invoked for each file mentioned in the backup manifest.
+ *
+ * We store the path to each file and the size of each file for sanity-checking
+ * purposes. For further details, see comments for IncrementalBackupInfo.
+ */
+static void
+manifest_process_file(JsonManifestParseContext *context,
+ char *pathname, size_t size,
+ pg_checksum_type checksum_type,
+ int checksum_length,
+ uint8 *checksum_payload)
+{
+ IncrementalBackupInfo *ib = context->private_data;
+ backup_file_entry *entry;
+ bool found;
+
+ entry = backup_file_insert(ib->manifest_files, pathname, &found);
+ if (!found)
+ {
+ entry->path = MemoryContextStrdup(ib->manifest_files->ctx,
+ pathname);
+ entry->size = size;
+ }
+}
+
+/*
+ * This callback is invoked for each WAL range mentioned in the backup
+ * manifest.
+ *
+ * We're just interested in learning the oldest LSN and the corresponding TLI
+ * that appear in any WAL range.
+ */
+static void
+manifest_process_wal_range(JsonManifestParseContext *context,
+ TimeLineID tli, XLogRecPtr start_lsn,
+ XLogRecPtr end_lsn)
+{
+ IncrementalBackupInfo *ib = context->private_data;
+ backup_wal_range *range = palloc(sizeof(backup_wal_range));
+
+ range->tli = tli;
+ range->start_lsn = start_lsn;
+ range->end_lsn = end_lsn;
+ ib->manifest_wal_ranges = lappend(ib->manifest_wal_ranges, range);
+}
+
+/*
+ * This callback is invoked if an error occurs while parsing the backup
+ * manifest.
+ */
+static void
+manifest_report_error(JsonManifestParseContext *context, const char *fmt,...)
+{
+ StringInfoData errbuf;
+
+ initStringInfo(&errbuf);
+
+ for (;;)
+ {
+ va_list ap;
+ int needed;
+
+ va_start(ap, fmt);
+ needed = appendStringInfoVA(&errbuf, fmt, ap);
+ va_end(ap);
+ if (needed == 0)
+ break;
+ enlargeStringInfo(&errbuf, needed);
+ }
+
+ ereport(ERROR,
+ errmsg_internal("%s", errbuf.data));
+}
+
+/*
+ * Quicksort comparator for block numbers.
+ */
+static int
+compare_block_numbers(const void *a, const void *b)
+{
+ BlockNumber aa = *(BlockNumber *) a;
+ BlockNumber bb = *(BlockNumber *) b;
+
+ if (aa > bb)
+ return 1;
+ else if (aa == bb)
+ return 0;
+ else
+ return -1;
+}
diff --git a/src/backend/backup/meson.build b/src/backend/backup/meson.build
index 5d4ebe3ebe..2a6a2dc7c0 100644
--- a/src/backend/backup/meson.build
+++ b/src/backend/backup/meson.build
@@ -5,6 +5,7 @@ backend_sources += files(
'basebackup.c',
'basebackup_copy.c',
'basebackup_gzip.c',
+ 'basebackup_incremental.c',
'basebackup_lz4.c',
'basebackup_progress.c',
'basebackup_server.c',
diff --git a/src/backend/replication/repl_gram.y b/src/backend/replication/repl_gram.y
index 0c874e33cf..a5d118ed68 100644
--- a/src/backend/replication/repl_gram.y
+++ b/src/backend/replication/repl_gram.y
@@ -76,11 +76,12 @@ Node *replication_parse_result;
%token K_EXPORT_SNAPSHOT
%token K_NOEXPORT_SNAPSHOT
%token K_USE_SNAPSHOT
+%token K_UPLOAD_MANIFEST
%type <node> command
%type <node> base_backup start_replication start_logical_replication
create_replication_slot drop_replication_slot identify_system
- read_replication_slot timeline_history show
+ read_replication_slot timeline_history show upload_manifest
%type <list> generic_option_list
%type <defelt> generic_option
%type <uintval> opt_timeline
@@ -114,6 +115,7 @@ command:
| read_replication_slot
| timeline_history
| show
+ | upload_manifest
;
/*
@@ -307,6 +309,15 @@ timeline_history:
}
;
+/* UPLOAD_MANIFEST doesn't currently accept any arguments */
+upload_manifest:
+ K_UPLOAD_MANIFEST
+ {
+ UploadManifestCmd *cmd = makeNode(UploadManifestCmd);
+
+ $$ = (Node *) cmd;
+ }
+
opt_physical:
K_PHYSICAL
| /* EMPTY */
@@ -411,6 +422,7 @@ ident_or_keyword:
| K_EXPORT_SNAPSHOT { $$ = "export_snapshot"; }
| K_NOEXPORT_SNAPSHOT { $$ = "noexport_snapshot"; }
| K_USE_SNAPSHOT { $$ = "use_snapshot"; }
+ | K_UPLOAD_MANIFEST { $$ = "upload_manifest"; }
;
%%
diff --git a/src/backend/replication/repl_scanner.l b/src/backend/replication/repl_scanner.l
index 1cc7fb858c..4805da08ee 100644
--- a/src/backend/replication/repl_scanner.l
+++ b/src/backend/replication/repl_scanner.l
@@ -136,6 +136,7 @@ EXPORT_SNAPSHOT { return K_EXPORT_SNAPSHOT; }
NOEXPORT_SNAPSHOT { return K_NOEXPORT_SNAPSHOT; }
USE_SNAPSHOT { return K_USE_SNAPSHOT; }
WAIT { return K_WAIT; }
+UPLOAD_MANIFEST { return K_UPLOAD_MANIFEST; }
{space}+ { /* do nothing */ }
@@ -303,6 +304,7 @@ replication_scanner_is_replication_command(void)
case K_DROP_REPLICATION_SLOT:
case K_READ_REPLICATION_SLOT:
case K_TIMELINE_HISTORY:
+ case K_UPLOAD_MANIFEST:
case K_SHOW:
/* Yes; push back the first token so we can parse later. */
repl_pushed_back_token = first_token;
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 3bc9c82389..dbcda32554 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -58,6 +58,7 @@
#include "access/xlogrecovery.h"
#include "access/xlogutils.h"
#include "backup/basebackup.h"
+#include "backup/basebackup_incremental.h"
#include "catalog/pg_authid.h"
#include "catalog/pg_type.h"
#include "commands/dbcommands.h"
@@ -137,6 +138,17 @@ bool wake_wal_senders = false;
*/
static XLogReaderState *xlogreader = NULL;
+/*
+ * If the UPLOAD_MANIFEST command is used to provide a backup manifest in
+ * preparation for an incremental backup, uploaded_manifest will be point
+ * to an object containing information about its contexts, and
+ * uploaded_manifest_mcxt will point to the memory context that contains
+ * that object and all of its subordinate data. Otherwise, both values will
+ * be NULL.
+ */
+static IncrementalBackupInfo *uploaded_manifest = NULL;
+static MemoryContext uploaded_manifest_mcxt = NULL;
+
/*
* These variables keep track of the state of the timeline we're currently
* sending. sendTimeLine identifies the timeline. If sendTimeLineIsHistoric,
@@ -233,6 +245,9 @@ static void XLogSendLogical(void);
static void WalSndDone(WalSndSendDataCallback send_data);
static XLogRecPtr GetStandbyFlushRecPtr(TimeLineID *tli);
static void IdentifySystem(void);
+static void UploadManifest(void);
+static bool HandleUploadManifestPacket(StringInfo buf, off_t *offset,
+ IncrementalBackupInfo *ib);
static void ReadReplicationSlot(ReadReplicationSlotCmd *cmd);
static void CreateReplicationSlot(CreateReplicationSlotCmd *cmd);
static void DropReplicationSlot(DropReplicationSlotCmd *cmd);
@@ -660,6 +675,143 @@ SendTimeLineHistory(TimeLineHistoryCmd *cmd)
pq_endmessage(&buf);
}
+/*
+ * Handle UPLOAD_MANIFEST command.
+ */
+static void
+UploadManifest(void)
+{
+ MemoryContext mcxt;
+ IncrementalBackupInfo *ib;
+ off_t offset = 0;
+ StringInfoData buf;
+
+ /*
+ * parsing the manifest will use the cryptohash stuff, which requires a
+ * resource owner
+ */
+ Assert(CurrentResourceOwner == NULL);
+ CurrentResourceOwner = ResourceOwnerCreate(NULL, "base backup");
+
+ /* Prepare to read manifest data into a temporary context. */
+ mcxt = AllocSetContextCreate(CurrentMemoryContext,
+ "incremental backup information",
+ ALLOCSET_DEFAULT_SIZES);
+ ib = CreateIncrementalBackupInfo(mcxt);
+
+ /* Send a CopyInResponse message */
+ pq_beginmessage(&buf, 'G');
+ pq_sendbyte(&buf, 0);
+ pq_sendint16(&buf, 0);
+ pq_endmessage_reuse(&buf);
+ pq_flush();
+
+ /* Recieve packets from client until done. */
+ while (HandleUploadManifestPacket(&buf, &offset, ib))
+ ;
+
+ /* Finish up manifest processing. */
+ FinalizeIncrementalManifest(ib);
+
+ /*
+ * Discard any old manifest information and arrange to preserve the new
+ * information we just got.
+ *
+ * We assume that MemoryContextDelete and MemoryContextSetParent won't
+ * fail, and thus we shouldn't end up bailing out of here in such a way as
+ * to leave dangling pointrs.
+ */
+ if (uploaded_manifest_mcxt != NULL)
+ MemoryContextDelete(uploaded_manifest_mcxt);
+ MemoryContextSetParent(mcxt, CacheMemoryContext);
+ uploaded_manifest = ib;
+ uploaded_manifest_mcxt = mcxt;
+
+ /* clean up the resource owner we created */
+ WalSndResourceCleanup(true);
+}
+
+/*
+ * Process one packet received during the handling of an UPLOAD_MANIFEST
+ * operation.
+ *
+ * 'buf' is scratch space. This function expects it to be initialized, doesn't
+ * care what the current contents are, and may override them with completely
+ * new contents.
+ *
+ * The return value is true if the caller should continue processing
+ * additional packets and false if the UPLOAD_MANIFEST operation is complete.
+ */
+static bool
+HandleUploadManifestPacket(StringInfo buf, off_t *offset,
+ IncrementalBackupInfo *ib)
+{
+ int mtype;
+ int maxmsglen;
+
+ HOLD_CANCEL_INTERRUPTS();
+
+ pq_startmsgread();
+ mtype = pq_getbyte();
+ if (mtype == EOF)
+ ereport(ERROR,
+ (errcode(ERRCODE_CONNECTION_FAILURE),
+ errmsg("unexpected EOF on client connection with an open transaction")));
+
+ switch (mtype)
+ {
+ case 'd': /* CopyData */
+ maxmsglen = PQ_LARGE_MESSAGE_LIMIT;
+ break;
+ case 'c': /* CopyDone */
+ case 'f': /* CopyFail */
+ case 'H': /* Flush */
+ case 'S': /* Sync */
+ maxmsglen = PQ_SMALL_MESSAGE_LIMIT;
+ break;
+ default:
+ ereport(ERROR,
+ (errcode(ERRCODE_PROTOCOL_VIOLATION),
+ errmsg("unexpected message type 0x%02X during COPY from stdin",
+ mtype)));
+ maxmsglen = 0; /* keep compiler quiet */
+ break;
+ }
+
+ /* Now collect the message body */
+ if (pq_getmessage(buf, maxmsglen))
+ ereport(ERROR,
+ (errcode(ERRCODE_CONNECTION_FAILURE),
+ errmsg("unexpected EOF on client connection with an open transaction")));
+ RESUME_CANCEL_INTERRUPTS();
+
+ /* Process the message */
+ switch (mtype)
+ {
+ case 'd': /* CopyData */
+ AppendIncrementalManifestData(ib, buf->data, buf->len);
+ return true;
+
+ case 'c': /* CopyDone */
+ return false;
+
+ case 'H': /* Sync */
+ case 'S': /* Flush */
+ /* Ignore these while in CopyOut mode as we do elsewhere. */
+ return true;
+
+ case 'f':
+ ereport(ERROR,
+ (errcode(ERRCODE_QUERY_CANCELED),
+ errmsg("COPY from stdin failed: %s",
+ pq_getmsgstring(buf))));
+ }
+
+ /* Not reached. */
+ Assert(false);
+ return false;
+}
+
/*
* Handle START_REPLICATION command.
*
@@ -1801,7 +1953,7 @@ exec_replication_command(const char *cmd_string)
cmdtag = "BASE_BACKUP";
set_ps_display(cmdtag);
PreventInTransactionBlock(true, cmdtag);
- SendBaseBackup((BaseBackupCmd *) cmd_node);
+ SendBaseBackup((BaseBackupCmd *) cmd_node, uploaded_manifest);
EndReplicationCommand(cmdtag);
break;
@@ -1863,6 +2015,14 @@ exec_replication_command(const char *cmd_string)
}
break;
+ case T_UploadManifestCmd:
+ cmdtag = "UPLOAD_MANIFEST";
+ set_ps_display(cmdtag);
+ PreventInTransactionBlock(true, cmdtag);
+ UploadManifest();
+ EndReplicationCommand(cmdtag);
+ break;
+
default:
elog(ERROR, "unrecognized replication command node tag: %u",
cmd_node->type);
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 0e0ac22bdd..706140eb9f 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -32,6 +32,7 @@
#include "postmaster/bgworker_internals.h"
#include "postmaster/bgwriter.h"
#include "postmaster/postmaster.h"
+#include "postmaster/walsummarizer.h"
#include "replication/logicallauncher.h"
#include "replication/origin.h"
#include "replication/slot.h"
@@ -140,6 +141,7 @@ CalculateShmemSize(int *num_semaphores)
size = add_size(size, ReplicationOriginShmemSize());
size = add_size(size, WalSndShmemSize());
size = add_size(size, WalRcvShmemSize());
+ size = add_size(size, WalSummarizerShmemSize());
size = add_size(size, PgArchShmemSize());
size = add_size(size, ApplyLauncherShmemSize());
size = add_size(size, BTreeShmemSize());
@@ -337,6 +339,7 @@ CreateOrAttachShmemStructs(void)
ReplicationOriginShmemInit();
WalSndShmemInit();
WalRcvShmemInit();
+ WalSummarizerShmemInit();
PgArchShmemInit();
ApplyLauncherShmemInit();
diff --git a/src/bin/Makefile b/src/bin/Makefile
index 373077bf52..aa2210925e 100644
--- a/src/bin/Makefile
+++ b/src/bin/Makefile
@@ -19,6 +19,7 @@ SUBDIRS = \
pg_archivecleanup \
pg_basebackup \
pg_checksums \
+ pg_combinebackup \
pg_config \
pg_controldata \
pg_ctl \
diff --git a/src/bin/meson.build b/src/bin/meson.build
index 67cb50630c..4cb6fd59bb 100644
--- a/src/bin/meson.build
+++ b/src/bin/meson.build
@@ -5,6 +5,7 @@ subdir('pg_amcheck')
subdir('pg_archivecleanup')
subdir('pg_basebackup')
subdir('pg_checksums')
+subdir('pg_combinebackup')
subdir('pg_config')
subdir('pg_controldata')
subdir('pg_ctl')
diff --git a/src/bin/pg_basebackup/bbstreamer_file.c b/src/bin/pg_basebackup/bbstreamer_file.c
index 45f32974ff..6b78ee283d 100644
--- a/src/bin/pg_basebackup/bbstreamer_file.c
+++ b/src/bin/pg_basebackup/bbstreamer_file.c
@@ -296,6 +296,7 @@ should_allow_existing_directory(const char *pathname)
if (strcmp(filename, "pg_wal") == 0 ||
strcmp(filename, "pg_xlog") == 0 ||
strcmp(filename, "archive_status") == 0 ||
+ strcmp(filename, "summaries") == 0 ||
strcmp(filename, "pg_tblspc") == 0)
return true;
diff --git a/src/bin/pg_basebackup/pg_basebackup.c b/src/bin/pg_basebackup/pg_basebackup.c
index f32684a8f2..5795b91261 100644
--- a/src/bin/pg_basebackup/pg_basebackup.c
+++ b/src/bin/pg_basebackup/pg_basebackup.c
@@ -101,6 +101,11 @@ typedef void (*WriteDataCallback) (size_t nbytes, char *buf,
*/
#define MINIMUM_VERSION_FOR_TERMINATED_TARFILE 150000
+/*
+ * pg_wal/summaries exists beginning with version 17.
+ */
+#define MINIMUM_VERSION_FOR_WAL_SUMMARIES 170000
+
/*
* Different ways to include WAL
*/
@@ -217,7 +222,8 @@ static void ReceiveBackupManifestInMemoryChunk(size_t r, char *copybuf,
void *callback_data);
static void BaseBackup(char *compression_algorithm, char *compression_detail,
CompressionLocation compressloc,
- pg_compress_specification *client_compress);
+ pg_compress_specification *client_compress,
+ char *incremental_manifest);
static bool reached_end_position(XLogRecPtr segendpos, uint32 timeline,
bool segment_finished);
@@ -390,6 +396,8 @@ usage(void)
printf(_("\nOptions controlling the output:\n"));
printf(_(" -D, --pgdata=DIRECTORY receive base backup into directory\n"));
printf(_(" -F, --format=p|t output format (plain (default), tar)\n"));
+ printf(_(" -i, --incremental=OLDMANIFEST\n"));
+ printf(_(" take incremental backup\n"));
printf(_(" -r, --max-rate=RATE maximum transfer rate to transfer data directory\n"
" (in kB/s, or use suffix \"k\" or \"M\")\n"));
printf(_(" -R, --write-recovery-conf\n"
@@ -688,6 +696,23 @@ StartLogStreamer(char *startpos, uint32 timeline, char *sysidentifier,
if (pg_mkdir_p(statusdir, pg_dir_create_mode) != 0 && errno != EEXIST)
pg_fatal("could not create directory \"%s\": %m", statusdir);
+
+ /*
+ * For newer server versions, likewise create pg_wal/summaries
+ */
+ if (PQserverVersion(conn) < MINIMUM_VERSION_FOR_WAL_SUMMARIES)
+ {
+ char summarydir[MAXPGPATH];
+
+ snprintf(summarydir, sizeof(summarydir), "%s/%s/summaries",
+ basedir,
+ PQserverVersion(conn) < MINIMUM_VERSION_FOR_PG_WAL ?
+ "pg_xlog" : "pg_wal");
+
+ if (pg_mkdir_p(statusdir, pg_dir_create_mode) != 0 &&
+ errno != EEXIST)
+ pg_fatal("could not create directory \"%s\": %m", summarydir);
+ }
}
/*
@@ -1728,7 +1753,9 @@ ReceiveBackupManifestInMemoryChunk(size_t r, char *copybuf,
static void
BaseBackup(char *compression_algorithm, char *compression_detail,
- CompressionLocation compressloc, pg_compress_specification *client_compress)
+ CompressionLocation compressloc,
+ pg_compress_specification *client_compress,
+ char *incremental_manifest)
{
PGresult *res;
char *sysidentifier;
@@ -1794,7 +1821,76 @@ BaseBackup(char *compression_algorithm, char *compression_detail,
exit(1);
/*
- * Start the actual backup
+ * If the user wants an incremental backup, we must upload the manifest
+ * for the previous backup upon which it is to be based.
+ */
+ if (incremental_manifest != NULL)
+ {
+ int fd;
+ char mbuf[65536];
+ int nbytes;
+
+ /* Reject if server is too old. */
+ if (serverVersion < MINIMUM_VERSION_FOR_WAL_SUMMARIES)
+ pg_fatal("server does not support incremental backup");
+
+ /* Open the file. */
+ fd = open(incremental_manifest, O_RDONLY | PG_BINARY, 0);
+ if (fd < 0)
+ pg_fatal("could not open file \"%s\": %m", incremental_manifest);
+
+ /* Tell the server what we want to do. */
+ if (PQsendQuery(conn, "UPLOAD_MANIFEST") == 0)
+ pg_fatal("could not send replication command \"%s\": %s",
+ "UPLOAD_MANIFEST", PQerrorMessage(conn));
+ res = PQgetResult(conn);
+ if (PQresultStatus(res) != PGRES_COPY_IN)
+ {
+ if (PQresultStatus(res) == PGRES_FATAL_ERROR)
+ pg_fatal("could not upload manifest: %s",
+ PQerrorMessage(conn));
+ else
+ pg_fatal("could not upload manifest: unexpected status %s",
+ PQresStatus(PQresultStatus(res)));
+ }
+
+ /* Loop, reading from the file and sending the data to the server. */
+ while ((nbytes = read(fd, mbuf, sizeof mbuf)) > 0)
+ {
+ if (PQputCopyData(conn, mbuf, nbytes) < 0)
+ pg_fatal("could not send COPY data: %s",
+ PQerrorMessage(conn));
+ }
+
+ /* Bail out if we exited the loop due to an error. */
+ if (nbytes < 0)
+ pg_fatal("could not read file \"%s\": %m", incremental_manifest);
+
+ /* End the COPY operation. */
+ if (PQputCopyEnd(conn, NULL) < 0)
+ pg_fatal("could not send end-of-COPY: %s",
+ PQerrorMessage(conn));
+
+ /* See whether the server is happy with what we sent. */
+ res = PQgetResult(conn);
+ if (PQresultStatus(res) == PGRES_FATAL_ERROR)
+ pg_fatal("could not upload manifest: %s",
+ PQerrorMessage(conn));
+ else if (PQresultStatus(res) != PGRES_COMMAND_OK)
+ pg_fatal("could not upload manifest: unexpected status %s",
+ PQresStatus(PQresultStatus(res)));
+
+ /* Consume ReadyForQuery message from server. */
+ res = PQgetResult(conn);
+ if (res != NULL)
+ pg_fatal("unexpected extra result while sending manifest");
+
+ /* Add INCREMENTAL option to BASE_BACKUP command. */
+ AppendPlainCommandOption(&buf, use_new_option_syntax, "INCREMENTAL");
+ }
+
+ /*
+ * Continue building up the options list for the BASE_BACKUP command.
*/
AppendStringCommandOption(&buf, use_new_option_syntax, "LABEL", label);
if (estimatesize)
@@ -1901,6 +1997,7 @@ BaseBackup(char *compression_algorithm, char *compression_detail,
else
basebkp = psprintf("BASE_BACKUP %s", buf.data);
+ /* OK, try to start the backup. */
if (PQsendQuery(conn, basebkp) == 0)
pg_fatal("could not send replication command \"%s\": %s",
"BASE_BACKUP", PQerrorMessage(conn));
@@ -2256,6 +2353,7 @@ main(int argc, char **argv)
{"version", no_argument, NULL, 'V'},
{"pgdata", required_argument, NULL, 'D'},
{"format", required_argument, NULL, 'F'},
+ {"incremental", required_argument, NULL, 'i'},
{"checkpoint", required_argument, NULL, 'c'},
{"create-slot", no_argument, NULL, 'C'},
{"max-rate", required_argument, NULL, 'r'},
@@ -2293,6 +2391,7 @@ main(int argc, char **argv)
int option_index;
char *compression_algorithm = "none";
char *compression_detail = NULL;
+ char *incremental_manifest = NULL;
CompressionLocation compressloc = COMPRESS_LOCATION_UNSPECIFIED;
pg_compress_specification client_compress;
@@ -2317,7 +2416,7 @@ main(int argc, char **argv)
atexit(cleanup_directories_atexit);
- while ((c = getopt_long(argc, argv, "c:Cd:D:F:h:l:nNp:Pr:Rs:S:t:T:U:vwWX:zZ:",
+ while ((c = getopt_long(argc, argv, "c:Cd:D:F:h:i:l:nNp:Pr:Rs:S:t:T:U:vwWX:zZ:",
long_options, &option_index)) != -1)
{
switch (c)
@@ -2352,6 +2451,9 @@ main(int argc, char **argv)
case 'h':
dbhost = pg_strdup(optarg);
break;
+ case 'i':
+ incremental_manifest = pg_strdup(optarg);
+ break;
case 'l':
label = pg_strdup(optarg);
break;
@@ -2765,7 +2867,7 @@ main(int argc, char **argv)
}
BaseBackup(compression_algorithm, compression_detail, compressloc,
- &client_compress);
+ &client_compress, incremental_manifest);
success = true;
return 0;
diff --git a/src/bin/pg_basebackup/t/010_pg_basebackup.pl b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
index b9f5e1266b..bf765291e7 100644
--- a/src/bin/pg_basebackup/t/010_pg_basebackup.pl
+++ b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
@@ -223,10 +223,10 @@ SKIP:
"check backup dir permissions");
}
-# Only archive_status directory should be copied in pg_wal/.
+# Only archive_status and summaries directories should be copied in pg_wal/.
is_deeply(
[ sort(slurp_dir("$tempdir/backup/pg_wal/")) ],
- [ sort qw(. .. archive_status) ],
+ [ sort qw(. .. archive_status summaries) ],
'no WAL files copied');
# Contents of these directories should not be copied.
diff --git a/src/bin/pg_combinebackup/.gitignore b/src/bin/pg_combinebackup/.gitignore
new file mode 100644
index 0000000000..d7e617438c
--- /dev/null
+++ b/src/bin/pg_combinebackup/.gitignore
@@ -0,0 +1 @@
+pg_combinebackup
diff --git a/src/bin/pg_combinebackup/Makefile b/src/bin/pg_combinebackup/Makefile
new file mode 100644
index 0000000000..78ba05e624
--- /dev/null
+++ b/src/bin/pg_combinebackup/Makefile
@@ -0,0 +1,52 @@
+#-------------------------------------------------------------------------
+#
+# Makefile for src/bin/pg_combinebackup
+#
+# Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+# Portions Copyright (c) 1994, Regents of the University of California
+#
+# src/bin/pg_combinebackup/Makefile
+#
+#-------------------------------------------------------------------------
+
+PGFILEDESC = "pg_combinebackup - combine incremental backups"
+PGAPPICON=win32
+
+subdir = src/bin/pg_combinebackup
+top_builddir = ../../..
+include $(top_builddir)/src/Makefile.global
+
+override CPPFLAGS := -I$(libpq_srcdir) $(CPPFLAGS)
+LDFLAGS_INTERNAL += -L$(top_builddir)/src/fe_utils -lpgfeutils
+
+OBJS = \
+ $(WIN32RES) \
+ pg_combinebackup.o \
+ backup_label.o \
+ copy_file.o \
+ load_manifest.o \
+ reconstruct.o \
+ write_manifest.o
+
+all: pg_combinebackup
+
+pg_combinebackup: $(OBJS) | submake-libpgport submake-libpgfeutils
+ $(CC) $(CFLAGS) $^ $(LDFLAGS) $(LDFLAGS_EX) $(LIBS) -o $@$(X)
+
+install: all installdirs
+ $(INSTALL_PROGRAM) pg_combinebackup$(X) '$(DESTDIR)$(bindir)/pg_combinebackup$(X)'
+
+installdirs:
+ $(MKDIR_P) '$(DESTDIR)$(bindir)'
+
+uninstall:
+ rm -f '$(DESTDIR)$(bindir)/pg_combinebackup$(X)'
+
+clean distclean maintainer-clean:
+ rm -f pg_combinebackup$(X) $(OBJS)
+
+check:
+ $(prove_check)
+
+installcheck:
+ $(prove_installcheck)
diff --git a/src/bin/pg_combinebackup/backup_label.c b/src/bin/pg_combinebackup/backup_label.c
new file mode 100644
index 0000000000..922e00854d
--- /dev/null
+++ b/src/bin/pg_combinebackup/backup_label.c
@@ -0,0 +1,283 @@
+/*-------------------------------------------------------------------------
+ *
+ * Read and manipulate backup label files
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/bin/pg_combinebackup/backup_label.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres_fe.h"
+
+#include <unistd.h>
+
+#include "access/xlogdefs.h"
+#include "backup_label.h"
+#include "common/logging.h"
+#include "common/file_perm.h"
+#include "write_manifest.h"
+
+static int get_eol_offset(StringInfo buf);
+static bool line_starts_with(char *s, char *e, char *match, char **sout);
+static bool parse_lsn(char *s, char *e, XLogRecPtr *lsn, char **c);
+static bool parse_tli(char *s, char *e, TimeLineID *tli);
+
+/*
+ * Parse a backup label file, starting at buf->cursor.
+ *
+ * We expect to find a START WAL LOCATION line, followed by a LSN, followed
+ * by a space; the resulting LSN is stored into *start_lsn.
+ *
+ * We expect to find a START TIMELINE line, followed by a TLI, followed by
+ * a newline; the resulting TLI is stored into *start_tli.
+ *
+ * We expect to find either both INCREMENTAL FROM LSN and INCREMENTAL FROM TLI
+ * or neither. If these are found, they should be followed by an LSN or TLI
+ * respectively and then by a newline, and the values will be stored into
+ * *previous_lsn and *previous_tli, respectively.
+ *
+ * Other lines in the provided backup_label data are ignored. filename is used
+ * for error reporting; errors are fatal.
+ */
+void
+parse_backup_label(char *filename, StringInfo buf,
+ TimeLineID *start_tli, XLogRecPtr *start_lsn,
+ TimeLineID *previous_tli, XLogRecPtr *previous_lsn)
+{
+ int found = 0;
+
+ *start_tli = 0;
+ *start_lsn = InvalidXLogRecPtr;
+ *previous_tli = 0;
+ *previous_lsn = InvalidXLogRecPtr;
+
+ while (buf->cursor < buf->len)
+ {
+ char *s = &buf->data[buf->cursor];
+ int eo = get_eol_offset(buf);
+ char *e = &buf->data[eo];
+ char *c;
+
+ if (line_starts_with(s, e, "START WAL LOCATION: ", &s))
+ {
+ if (!parse_lsn(s, e, start_lsn, &c))
+ pg_fatal("%s: could not parse %s",
+ filename, "START WAL LOCATION");
+ if (c >= e || *c != ' ')
+ pg_fatal("%s: improper terminator for %s",
+ filename, "START WAL LOCATION");
+ found |= 1;
+ }
+ else if (line_starts_with(s, e, "START TIMELINE: ", &s))
+ {
+ if (!parse_tli(s, e, start_tli))
+ pg_fatal("%s: could not parse TLI for %s",
+ filename, "START TIMELINE");
+ if (*start_tli == 0)
+ pg_fatal("%s: invalid TLI", filename);
+ found |= 2;
+ }
+ else if (line_starts_with(s, e, "INCREMENTAL FROM LSN: ", &s))
+ {
+ if (!parse_lsn(s, e, previous_lsn, &c))
+ pg_fatal("%s: could not parse %s",
+ filename, "INCREMENTAL FROM LSN");
+ if (c >= e || *c != '\n')
+ pg_fatal("%s: improper terminator for %s",
+ filename, "INCREMENTAL FROM LSN");
+ found |= 4;
+ }
+ else if (line_starts_with(s, e, "INCREMENTAL FROM TLI: ", &s))
+ {
+ if (!parse_tli(s, e, previous_tli))
+ pg_fatal("%s: could not parse %s",
+ filename, "INCREMENTAL FROM TLI");
+ if (*previous_tli == 0)
+ pg_fatal("%s: invalid TLI", filename);
+ found |= 8;
+ }
+
+ buf->cursor = eo;
+ }
+
+ if ((found & 1) == 0)
+ pg_fatal("%s: could not find %s", filename, "START WAL LOCATION");
+ if ((found & 2) == 0)
+ pg_fatal("%s: could not find %s", filename, "START TIMELINE");
+ if ((found & 4) != 0 && (found & 8) == 0)
+ pg_fatal("%s: %s requires %s", filename,
+ "INCREMENTAL FROM LSN", "INCREMENTAL FROM TLI");
+ if ((found & 8) != 0 && (found & 4) == 0)
+ pg_fatal("%s: %s requires %s", filename,
+ "INCREMENTAL FROM TLI", "INCREMENTAL FROM LSN");
+}
+
+/*
+ * Write a backup label file to the output directory.
+ *
+ * This will be identical to the provided backup_label file, except that the
+ * INCREMENTAL FROM LSN and INCREMENTAL FROM TLI lines will be omitted.
+ *
+ * The new file will be checksummed using the specified algorithm. If
+ * mwriter != NULL, it will be added to the manifest.
+ */
+void
+write_backup_label(char *output_directory, StringInfo buf,
+ pg_checksum_type checksum_type, manifest_writer *mwriter)
+{
+ char output_filename[MAXPGPATH];
+ int output_fd;
+ pg_checksum_context checksum_ctx;
+ uint8 checksum_payload[PG_CHECKSUM_MAX_LENGTH];
+ int checksum_length;
+
+ pg_checksum_init(&checksum_ctx, checksum_type);
+
+ snprintf(output_filename, MAXPGPATH, "%s/backup_label", output_directory);
+
+ if ((output_fd = open(output_filename,
+ O_WRONLY | O_CREAT | O_EXCL | PG_BINARY,
+ pg_file_create_mode)) < 0)
+ pg_fatal("could not open file \"%s\": %m", output_filename);
+
+ while (buf->cursor < buf->len)
+ {
+ char *s = &buf->data[buf->cursor];
+ int eo = get_eol_offset(buf);
+ char *e = &buf->data[eo];
+
+ if (!line_starts_with(s, e, "INCREMENTAL FROM LSN: ", NULL) &&
+ !line_starts_with(s, e, "INCREMENTAL FROM TLI: ", NULL))
+ {
+ ssize_t wb;
+
+ wb = write(output_fd, s, e - s);
+ if (wb != e - s)
+ {
+ if (wb < 0)
+ pg_fatal("could not write file \"%s\": %m", output_filename);
+ else
+ pg_fatal("could not write file \"%s\": wrote only %d of %d bytes",
+ output_filename, (int) wb, (int) (e - s));
+ }
+ if (pg_checksum_update(&checksum_ctx, (uint8 *) s, e - s) < 0)
+ pg_fatal("could not update checksum of file \"%s\"",
+ output_filename);
+ }
+
+ buf->cursor = eo;
+ }
+
+ if (close(output_fd) != 0)
+ pg_fatal("could not close \"%s\": %m", output_filename);
+
+ checksum_length = pg_checksum_final(&checksum_ctx, checksum_payload);
+
+ if (mwriter != NULL)
+ {
+ struct stat sb;
+
+ /*
+ * We could track the length ourselves, but must stat() to get the
+ * mtime.
+ */
+ if (stat(output_filename, &sb) < 0)
+ pg_fatal("could not stat file \"%s\": %m", output_filename);
+ add_file_to_manifest(mwriter, "backup_label", sb.st_size,
+ sb.st_mtime, checksum_type,
+ checksum_length, checksum_payload);
+ }
+}
+
+/*
+ * Return the offset at which the next line in the buffer starts, or there
+ * is none, the offset at which the buffer ends.
+ *
+ * The search begins at buf->cursor.
+ */
+static int
+get_eol_offset(StringInfo buf)
+{
+ int eo = buf->cursor;
+
+ while (eo < buf->len)
+ {
+ if (buf->data[eo] == '\n')
+ return eo + 1;
+ ++eo;
+ }
+
+ return eo;
+}
+
+/*
+ * Test whether the line that runs from s to e (inclusive of *s, but not
+ * inclusive of *e) starts with the match string provided, and return true
+ * or false according to whether or not this is the case.
+ *
+ * If the function returns true and if *sout != NULL, stores a pointer to the
+ * byte following the match into *sout.
+ */
+static bool
+line_starts_with(char *s, char *e, char *match, char **sout)
+{
+ while (s < e && *match != '\0' && *s == *match)
+ ++s, ++match;
+
+ if (*match == '\0' && sout != NULL)
+ *sout = s;
+
+ return (*match == '\0');
+}
+
+/*
+ * Parse an LSN starting at s and not stopping at or before e. The return value
+ * is true on success and otherwise false. On success, stores the result into
+ * *lsn and sets *c to the first character that is not part of the LSN.
+ */
+static bool
+parse_lsn(char *s, char *e, XLogRecPtr *lsn, char **c)
+{
+ char save = *e;
+ int nchars;
+ bool success;
+ unsigned hi;
+ unsigned lo;
+
+ *e = '\0';
+ success = (sscanf(s, "%X/%X%n", &hi, &lo, &nchars) == 2);
+ *e = save;
+
+ if (success)
+ {
+ *lsn = ((XLogRecPtr) hi) << 32 | (XLogRecPtr) lo;
+ *c = s + nchars;
+ }
+
+ return success;
+}
+
+/*
+ * Parse a TLI starting at s and stopping at or before e. The return value is
+ * true on success and otherwise false. On success, stores the result into
+ * *tli. If the first character that is not part of the TLI is anything other
+ * than a newline, that is deemed a failure.
+ */
+static bool
+parse_tli(char *s, char *e, TimeLineID *tli)
+{
+ char save = *e;
+ int nchars;
+ bool success;
+
+ *e = '\0';
+ success = (sscanf(s, "%u%n", tli, &nchars) == 1);
+ *e = save;
+
+ if (success && s[nchars] != '\n')
+ success = false;
+
+ return success;
+}
diff --git a/src/bin/pg_combinebackup/backup_label.h b/src/bin/pg_combinebackup/backup_label.h
new file mode 100644
index 0000000000..3af7ea274c
--- /dev/null
+++ b/src/bin/pg_combinebackup/backup_label.h
@@ -0,0 +1,30 @@
+/*-------------------------------------------------------------------------
+ *
+ * Read and manipulate backup label files
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/bin/pg_combinebackup/backup_label.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef BACKUP_LABEL_H
+#define BACKUP_LABEL_H
+
+#include "access/xlogdefs.h"
+#include "common/checksum_helper.h"
+#include "lib/stringinfo.h"
+
+struct manifest_writer;
+
+extern void parse_backup_label(char *filename, StringInfo buf,
+ TimeLineID *start_tli,
+ XLogRecPtr *start_lsn,
+ TimeLineID *previous_tli,
+ XLogRecPtr *previous_lsn);
+extern void write_backup_label(char *output_directory, StringInfo buf,
+ pg_checksum_type checksum_type,
+ struct manifest_writer *mwriter);
+
+#endif /* BACKUP_LABEL_H */
diff --git a/src/bin/pg_combinebackup/copy_file.c b/src/bin/pg_combinebackup/copy_file.c
new file mode 100644
index 0000000000..40a55e3087
--- /dev/null
+++ b/src/bin/pg_combinebackup/copy_file.c
@@ -0,0 +1,169 @@
+/*
+ * Copy entire files.
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/bin/pg_combinebackup/copy_file.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres_fe.h"
+
+#ifdef HAVE_COPYFILE_H
+#include <copyfile.h>
+#endif
+#include <fcntl.h>
+#include <sys/stat.h>
+#include <unistd.h>
+
+#include "common/file_perm.h"
+#include "common/logging.h"
+#include "copy_file.h"
+
+static void copy_file_blocks(const char *src, const char *dst,
+ pg_checksum_context *checksum_ctx);
+
+#ifdef WIN32
+static void copy_file_copyfile(const char *src, const char *dst);
+#endif
+
+/*
+ * Copy a regular file, optionally computing a checksum, and emitting
+ * appropriate debug messages. But if we're in dry-run mode, then just emit
+ * the messages and don't copy anything.
+ */
+void
+copy_file(const char *src, const char *dst,
+ pg_checksum_context *checksum_ctx, bool dry_run)
+{
+ /*
+ * In dry-run mode, we don't actually copy anything, nor do we read any
+ * data from the source file, but we do verify that we can open it.
+ */
+ if (dry_run)
+ {
+ int fd;
+
+ if ((fd = open(src, O_RDONLY | PG_BINARY, 0)) < 0)
+ pg_fatal("could not open \"%s\": %m", src);
+ if (close(fd) < 0)
+ pg_fatal("could not close \"%s\": %m", src);
+ }
+
+ /*
+ * If we don't need to compute a checksum, then we can use any special
+ * operating system primitives that we know about to copy the file; this
+ * may be quicker than a naive block copy.
+ */
+ if (checksum_ctx->type == CHECKSUM_TYPE_NONE)
+ {
+ char *strategy_name = NULL;
+ void (*strategy_implementation) (const char *, const char *) = NULL;
+
+#ifdef WIN32
+ strategy_name = "CopyFile";
+ strategy_implementation = copy_file_copyfile;
+#endif
+
+ if (strategy_name != NULL)
+ {
+ if (dry_run)
+ pg_log_debug("would copy \"%s\" to \"%s\" using strategy %s",
+ src, dst, strategy_name);
+ else
+ {
+ pg_log_debug("copying \"%s\" to \"%s\" using strategy %s",
+ src, dst, strategy_name);
+ (*strategy_implementation) (src, dst);
+ }
+ return;
+ }
+ }
+
+ /*
+ * Fall back to the simple approach of reading and writing all the blocks,
+ * feeding them into the checksum context as we go.
+ */
+ if (dry_run)
+ {
+ if (checksum_ctx->type == CHECKSUM_TYPE_NONE)
+ pg_log_debug("would copy \"%s\" to \"%s\"",
+ src, dst);
+ else
+ pg_log_debug("would copy \"%s\" to \"%s\" and checksum with %s",
+ src, dst, pg_checksum_type_name(checksum_ctx->type));
+ }
+ else
+ {
+ if (checksum_ctx->type == CHECKSUM_TYPE_NONE)
+ pg_log_debug("copying \"%s\" to \"%s\"",
+ src, dst);
+ else
+ pg_log_debug("copying \"%s\" to \"%s\" and checksumming with %s",
+ src, dst, pg_checksum_type_name(checksum_ctx->type));
+ copy_file_blocks(src, dst, checksum_ctx);
+ }
+}
+
+/*
+ * Copy a file block by block, and optionally compute a checksum as we go.
+ */
+static void
+copy_file_blocks(const char *src, const char *dst,
+ pg_checksum_context *checksum_ctx)
+{
+ int src_fd;
+ int dest_fd;
+ uint8 *buffer;
+ const int buffer_size = 50 * BLCKSZ;
+ ssize_t rb;
+ unsigned offset = 0;
+
+ if ((src_fd = open(src, O_RDONLY | PG_BINARY, 0)) < 0)
+ pg_fatal("could not open file \"%s\": %m", src);
+
+ if ((dest_fd = open(dst, O_WRONLY | O_CREAT | O_EXCL | PG_BINARY,
+ pg_file_create_mode)) < 0)
+ pg_fatal("could not open file \"%s\": %m", dst);
+
+ buffer = pg_malloc(buffer_size);
+
+ while ((rb = read(src_fd, buffer, buffer_size)) > 0)
+ {
+ ssize_t wb;
+
+ if ((wb = write(dest_fd, buffer, rb)) != rb)
+ {
+ if (wb < 0)
+ pg_fatal("could not write file \"%s\": %m", dst);
+ else
+ pg_fatal("could not write file \"%s\": wrote only %d of %d bytes at offset %u",
+ dst, (int) wb, (int) rb, offset);
+ }
+
+ if (pg_checksum_update(checksum_ctx, buffer, rb) < 0)
+ pg_fatal("could not update checksum of file \"%s\"", dst);
+
+ offset += rb;
+ }
+
+ if (rb < 0)
+ pg_fatal("could not read file \"%s\": %m", dst);
+
+ pg_free(buffer);
+ close(src_fd);
+ close(dest_fd);
+}
+
+#ifdef WIN32
+static void
+copy_file_copyfile(const char *src, const char *dst)
+{
+ if (CopyFile(src, dst, true) == 0)
+ {
+ _dosmaperr(GetLastError());
+ pg_fatal("could not copy \"%s\" to \"%s\": %m", src, dst);
+ }
+}
+#endif /* WIN32 */
diff --git a/src/bin/pg_combinebackup/copy_file.h b/src/bin/pg_combinebackup/copy_file.h
new file mode 100644
index 0000000000..031030bacb
--- /dev/null
+++ b/src/bin/pg_combinebackup/copy_file.h
@@ -0,0 +1,19 @@
+/*
+ * Copy entire files.
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/bin/pg_combinebackup/copy_file.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef COPY_FILE_H
+#define COPY_FILE_H
+
+#include "common/checksum_helper.h"
+
+extern void copy_file(const char *src, const char *dst,
+ pg_checksum_context *checksum_ctx, bool dry_run);
+
+#endif /* COPY_FILE_H */
diff --git a/src/bin/pg_combinebackup/load_manifest.c b/src/bin/pg_combinebackup/load_manifest.c
new file mode 100644
index 0000000000..ad32323c9c
--- /dev/null
+++ b/src/bin/pg_combinebackup/load_manifest.c
@@ -0,0 +1,245 @@
+/*-------------------------------------------------------------------------
+ *
+ * Load data from a backup manifest into memory.
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/bin/pg_combinebackup/load_manifest.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres_fe.h"
+
+#include <sys/stat.h>
+#include <unistd.h>
+
+#include "common/hashfn.h"
+#include "common/logging.h"
+#include "common/parse_manifest.h"
+#include "load_manifest.h"
+
+/*
+ * For efficiency, we'd like our hash table containing information about the
+ * manifest to start out with approximately the correct number of entries.
+ * There's no way to know the exact number of entries without reading the whole
+ * file, but we can get an estimate by dividing the file size by the estimated
+ * number of bytes per line.
+ *
+ * This could be off by about a factor of two in either direction, because the
+ * checksum algorithm has a big impact on the line lengths; e.g. a SHA512
+ * checksum is 128 hex bytes, whereas a CRC-32C value is only 8, and there
+ * might be no checksum at all.
+ */
+#define ESTIMATED_BYTES_PER_MANIFEST_LINE 100
+
+/*
+ * Define a hash table which we can use to store information about the files
+ * mentioned in the backup manifest.
+ */
+static uint32 hash_string_pointer(char *s);
+#define SH_PREFIX manifest_files
+#define SH_ELEMENT_TYPE manifest_file
+#define SH_KEY_TYPE char *
+#define SH_KEY pathname
+#define SH_HASH_KEY(tb, key) hash_string_pointer(key)
+#define SH_EQUAL(tb, a, b) (strcmp(a, b) == 0)
+#define SH_SCOPE extern
+#define SH_RAW_ALLOCATOR pg_malloc0
+#define SH_DEFINE
+#include "lib/simplehash.h"
+
+static void combinebackup_per_file_cb(JsonManifestParseContext *context,
+ char *pathname, size_t size,
+ pg_checksum_type checksum_type,
+ int checksum_length,
+ uint8 *checksum_payload);
+static void combinebackup_per_wal_range_cb(JsonManifestParseContext *context,
+ TimeLineID tli,
+ XLogRecPtr start_lsn,
+ XLogRecPtr end_lsn);
+static void report_manifest_error(JsonManifestParseContext *context,
+ const char *fmt,...)
+ pg_attribute_printf(2, 3) pg_attribute_noreturn();
+
+/*
+ * Load backup_manifest files from an array of backups and produces an array
+ * of manifest_data objects.
+ *
+ * NB: Since load_backup_manifest() can return NULL, the resulting array could
+ * contain NULL entries.
+ */
+manifest_data **
+load_backup_manifests(int n_backups, char **backup_directories)
+{
+ manifest_data **result;
+ int i;
+
+ result = pg_malloc(sizeof(manifest_data *) * n_backups);
+ for (i = 0; i < n_backups; ++i)
+ result[i] = load_backup_manifest(backup_directories[i]);
+
+ return result;
+}
+
+/*
+ * Parse the backup_manifest file in the named backup directory. Construct a
+ * hash table with information about all the files it mentions, and a linked
+ * list of all the WAL ranges it mentions.
+ *
+ * If the backup_manifest file simply doesn't exist, logs a warning and returns
+ * NULL. Any other error, or any error parsing the contents of the file, is
+ * fatal.
+ */
+manifest_data *
+load_backup_manifest(char *backup_directory)
+{
+ char pathname[MAXPGPATH];
+ int fd;
+ struct stat statbuf;
+ off_t estimate;
+ uint32 initial_size;
+ manifest_files_hash *ht;
+ char *buffer;
+ int rc;
+ JsonManifestParseContext context;
+ manifest_data *result;
+
+ /* Open the manifest file. */
+ snprintf(pathname, MAXPGPATH, "%s/backup_manifest", backup_directory);
+ if ((fd = open(pathname, O_RDONLY | PG_BINARY, 0)) < 0)
+ {
+ if (errno == ENOENT)
+ {
+ pg_log_warning("\"%s\" does not exist", pathname);
+ return NULL;
+ }
+ pg_fatal("could not open file \"%s\": %m", pathname);
+ }
+
+ /* Figure out how big the manifest is. */
+ if (fstat(fd, &statbuf) != 0)
+ pg_fatal("could not stat file \"%s\": %m", pathname);
+
+ /* Guess how large to make the hash table based on the manifest size. */
+ estimate = statbuf.st_size / ESTIMATED_BYTES_PER_MANIFEST_LINE;
+ initial_size = Min(PG_UINT32_MAX, Max(estimate, 256));
+
+ /* Create the hash table. */
+ ht = manifest_files_create(initial_size, NULL);
+
+ /*
+ * Slurp in the whole file.
+ *
+ * This is not ideal, but there's currently no way to get pg_parse_json()
+ * to perform incremental parsing.
+ */
+ buffer = pg_malloc(statbuf.st_size);
+ rc = read(fd, buffer, statbuf.st_size);
+ if (rc != statbuf.st_size)
+ {
+ if (rc < 0)
+ pg_fatal("could not read file \"%s\": %m", pathname);
+ else
+ pg_fatal("could not read file \"%s\": read %d of %lld",
+ pathname, rc, (long long int) statbuf.st_size);
+ }
+
+ /* Close the manifest file. */
+ close(fd);
+
+ /* Parse the manifest. */
+ result = pg_malloc0(sizeof(manifest_data));
+ result->files = ht;
+ context.private_data = result;
+ context.per_file_cb = combinebackup_per_file_cb;
+ context.per_wal_range_cb = combinebackup_per_wal_range_cb;
+ context.error_cb = report_manifest_error;
+ json_parse_manifest(&context, buffer, statbuf.st_size);
+
+ /* All done. */
+ pfree(buffer);
+ return result;
+}
+
+/*
+ * Report an error while parsing the manifest.
+ *
+ * We consider all such errors to be fatal errors. The manifest parser
+ * expects this function not to return.
+ */
+static void
+report_manifest_error(JsonManifestParseContext *context, const char *fmt,...)
+{
+ va_list ap;
+
+ va_start(ap, fmt);
+ pg_log_generic_v(PG_LOG_ERROR, PG_LOG_PRIMARY, gettext(fmt), ap);
+ va_end(ap);
+
+ exit(1);
+}
+
+/*
+ * Record details extracted from the backup manifest for one file.
+ */
+static void
+combinebackup_per_file_cb(JsonManifestParseContext *context,
+ char *pathname, size_t size,
+ pg_checksum_type checksum_type,
+ int checksum_length, uint8 *checksum_payload)
+{
+ manifest_data *manifest = context->private_data;
+ manifest_file *m;
+ bool found;
+
+ /* Make a new entry in the hash table for this file. */
+ m = manifest_files_insert(manifest->files, pathname, &found);
+ if (found)
+ pg_fatal("duplicate path name in backup manifest: \"%s\"", pathname);
+
+ /* Initialize the entry. */
+ m->size = size;
+ m->checksum_type = checksum_type;
+ m->checksum_length = checksum_length;
+ m->checksum_payload = checksum_payload;
+}
+
+/*
+ * Record details extracted from the backup manifest for one WAL range.
+ */
+static void
+combinebackup_per_wal_range_cb(JsonManifestParseContext *context,
+ TimeLineID tli,
+ XLogRecPtr start_lsn, XLogRecPtr end_lsn)
+{
+ manifest_data *manifest = context->private_data;
+ manifest_wal_range *range;
+
+ /* Allocate and initialize a struct describing this WAL range. */
+ range = palloc(sizeof(manifest_wal_range));
+ range->tli = tli;
+ range->start_lsn = start_lsn;
+ range->end_lsn = end_lsn;
+ range->prev = manifest->last_wal_range;
+ range->next = NULL;
+
+ /* Add it to the end of the list. */
+ if (manifest->first_wal_range == NULL)
+ manifest->first_wal_range = range;
+ else
+ manifest->last_wal_range->next = range;
+ manifest->last_wal_range = range;
+}
+
+/*
+ * Helper function for manifest_files hash table.
+ */
+static uint32
+hash_string_pointer(char *s)
+{
+ unsigned char *ss = (unsigned char *) s;
+
+ return hash_bytes(ss, strlen(s));
+}
diff --git a/src/bin/pg_combinebackup/load_manifest.h b/src/bin/pg_combinebackup/load_manifest.h
new file mode 100644
index 0000000000..2bfeeff156
--- /dev/null
+++ b/src/bin/pg_combinebackup/load_manifest.h
@@ -0,0 +1,67 @@
+/*-------------------------------------------------------------------------
+ *
+ * Load data from a backup manifest into memory.
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/bin/pg_combinebackup/load_manifest.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef LOAD_MANIFEST_H
+#define LOAD_MANIFEST_H
+
+#include "access/xlogdefs.h"
+#include "common/checksum_helper.h"
+
+/*
+ * Each file described by the manifest file is parsed to produce an object
+ * like this.
+ */
+typedef struct manifest_file
+{
+ uint32 status; /* hash status */
+ char *pathname;
+ size_t size;
+ pg_checksum_type checksum_type;
+ int checksum_length;
+ uint8 *checksum_payload;
+} manifest_file;
+
+#define SH_PREFIX manifest_files
+#define SH_ELEMENT_TYPE manifest_file
+#define SH_KEY_TYPE char *
+#define SH_SCOPE extern
+#define SH_RAW_ALLOCATOR pg_malloc0
+#define SH_DECLARE
+#include "lib/simplehash.h"
+
+/*
+ * Each WAL range described by the manifest file is parsed to produce an
+ * object like this.
+ */
+typedef struct manifest_wal_range
+{
+ TimeLineID tli;
+ XLogRecPtr start_lsn;
+ XLogRecPtr end_lsn;
+ struct manifest_wal_range *next;
+ struct manifest_wal_range *prev;
+} manifest_wal_range;
+
+/*
+ * All the data parsed from a backup_manifest file.
+ */
+typedef struct manifest_data
+{
+ manifest_files_hash *files;
+ manifest_wal_range *first_wal_range;
+ manifest_wal_range *last_wal_range;
+} manifest_data;
+
+extern manifest_data *load_backup_manifest(char *backup_directory);
+extern manifest_data **load_backup_manifests(int n_backups,
+ char **backup_directories);
+
+#endif /* LOAD_MANIFEST_H */
diff --git a/src/bin/pg_combinebackup/meson.build b/src/bin/pg_combinebackup/meson.build
new file mode 100644
index 0000000000..e402d6f50e
--- /dev/null
+++ b/src/bin/pg_combinebackup/meson.build
@@ -0,0 +1,38 @@
+# Copyright (c) 2022-2023, PostgreSQL Global Development Group
+
+pg_combinebackup_sources = files(
+ 'pg_combinebackup.c',
+ 'backup_label.c',
+ 'copy_file.c',
+ 'load_manifest.c',
+ 'reconstruct.c',
+ 'write_manifest.c',
+)
+
+if host_system == 'windows'
+ pg_combinebackup_sources += rc_bin_gen.process(win32ver_rc, extra_args: [
+ '--NAME', 'pg_combinebackup',
+ '--FILEDESC', 'pg_combinebackup - combine incremental backups',])
+endif
+
+pg_combinebackup = executable('pg_combinebackup',
+ pg_combinebackup_sources,
+ dependencies: [frontend_code],
+ kwargs: default_bin_args,
+)
+bin_targets += pg_combinebackup
+
+tests += {
+ 'name': 'pg_combinebackup',
+ 'sd': meson.current_source_dir(),
+ 'bd': meson.current_build_dir(),
+ 'tap': {
+ 'tests': [
+ 't/001_basic.pl',
+ 't/002_compare_backups.pl',
+ 't/003_timeline.pl',
+ 't/004_manifest.pl',
+ 't/005_integrity.pl',
+ ],
+ }
+}
diff --git a/src/bin/pg_combinebackup/nls.mk b/src/bin/pg_combinebackup/nls.mk
new file mode 100644
index 0000000000..c8e59d1d00
--- /dev/null
+++ b/src/bin/pg_combinebackup/nls.mk
@@ -0,0 +1,11 @@
+# src/bin/pg_combinebackup/nls.mk
+CATALOG_NAME = pg_combinebackup
+GETTEXT_FILES = $(FRONTEND_COMMON_GETTEXT_FILES) \
+ backup_label.c \
+ copy_file.c \
+ load_manifest.c \
+ pg_combinebackup.c \
+ reconstruct.c \
+ write_manifest.c
+GETTEXT_TRIGGERS = $(FRONTEND_COMMON_GETTEXT_TRIGGERS)
+GETTEXT_FLAGS = $(FRONTEND_COMMON_GETTEXT_FLAGS)
diff --git a/src/bin/pg_combinebackup/pg_combinebackup.c b/src/bin/pg_combinebackup/pg_combinebackup.c
new file mode 100644
index 0000000000..85d3f4e5de
--- /dev/null
+++ b/src/bin/pg_combinebackup/pg_combinebackup.c
@@ -0,0 +1,1284 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_combinebackup.c
+ * Combine incremental backups with prior backups.
+ *
+ * Copyright (c) 2017-2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/bin/pg_combinebackup/pg_combinebackup.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres_fe.h"
+
+#include <dirent.h>
+#include <fcntl.h>
+#include <limits.h>
+
+#include "backup_label.h"
+#include "common/blkreftable.h"
+#include "common/checksum_helper.h"
+#include "common/controldata_utils.h"
+#include "common/file_perm.h"
+#include "common/file_utils.h"
+#include "common/logging.h"
+#include "copy_file.h"
+#include "fe_utils/option_utils.h"
+#include "lib/stringinfo.h"
+#include "load_manifest.h"
+#include "getopt_long.h"
+#include "reconstruct.h"
+#include "write_manifest.h"
+
+/* Incremental file naming convention. */
+#define INCREMENTAL_PREFIX "INCREMENTAL."
+#define INCREMENTAL_PREFIX_LENGTH (sizeof(INCREMENTAL_PREFIX) - 1)
+
+/*
+ * Tracking for directories that need to be removed, or have their contents
+ * removed, if the operation fails.
+ */
+typedef struct cb_cleanup_dir
+{
+ char *target_path;
+ bool rmtopdir;
+ struct cb_cleanup_dir *next;
+} cb_cleanup_dir;
+
+/*
+ * Stores a tablespace mapping provided using -T, --tablespace-mapping.
+ */
+typedef struct cb_tablespace_mapping
+{
+ char old_dir[MAXPGPATH];
+ char new_dir[MAXPGPATH];
+ struct cb_tablespace_mapping *next;
+} cb_tablespace_mapping;
+
+/*
+ * Stores data parsed from all command-line options.
+ */
+typedef struct cb_options
+{
+ bool debug;
+ char *output;
+ bool dry_run;
+ bool no_sync;
+ cb_tablespace_mapping *tsmappings;
+ pg_checksum_type manifest_checksums;
+ bool no_manifest;
+ DataDirSyncMethod sync_method;
+} cb_options;
+
+/*
+ * Data about a tablespace.
+ *
+ * Every normal tablespace needs a tablespace mapping, but in-place tablespaces
+ * don't, so the list of tablespaces can contain more entries than the list of
+ * tablespace mappings.
+ */
+typedef struct cb_tablespace
+{
+ Oid oid;
+ bool in_place;
+ char old_dir[MAXPGPATH];
+ char new_dir[MAXPGPATH];
+ struct cb_tablespace *next;
+} cb_tablespace;
+
+/* Directories to be removed if we exit uncleanly. */
+cb_cleanup_dir *cleanup_dir_list = NULL;
+
+static void add_tablespace_mapping(cb_options *opt, char *arg);
+static StringInfo check_backup_label_files(int n_backups, char **backup_dirs);
+static void check_control_files(int n_backups, char **backup_dirs);
+static void check_input_dir_permissions(char *dir);
+static void cleanup_directories_atexit(void);
+static void create_output_directory(char *dirname, cb_options *opt);
+static void help(const char *progname);
+static bool parse_oid(char *s, Oid *result);
+static void process_directory_recursively(Oid tsoid,
+ char *input_directory,
+ char *output_directory,
+ char *relative_path,
+ int n_prior_backups,
+ char **prior_backup_dirs,
+ manifest_data **manifests,
+ manifest_writer *mwriter,
+ cb_options *opt);
+static int read_pg_version_file(char *directory);
+static void remember_to_cleanup_directory(char *target_path, bool rmtopdir);
+static void reset_directory_cleanup_list(void);
+static cb_tablespace *scan_for_existing_tablespaces(char *pathname,
+ cb_options *opt);
+static void slurp_file(int fd, char *filename, StringInfo buf, int maxlen);
+
+/*
+ * Main program.
+ */
+int
+main(int argc, char *argv[])
+{
+ static struct option long_options[] = {
+ {"debug", no_argument, NULL, 'd'},
+ {"dry-run", no_argument, NULL, 'n'},
+ {"no-sync", no_argument, NULL, 'N'},
+ {"output", required_argument, NULL, 'o'},
+ {"tablespace-mapping", no_argument, NULL, 'T'},
+ {"manifest-checksums", required_argument, NULL, 1},
+ {"no-manifest", no_argument, NULL, 2},
+ {"sync-method", required_argument, NULL, 3},
+ {NULL, 0, NULL, 0}
+ };
+
+ const char *progname;
+ char *last_input_dir;
+ int optindex;
+ int c;
+ int n_backups;
+ int n_prior_backups;
+ int version;
+ char **prior_backup_dirs;
+ cb_options opt;
+ cb_tablespace *tablespaces;
+ cb_tablespace *ts;
+ StringInfo last_backup_label;
+ manifest_data **manifests;
+ manifest_writer *mwriter;
+
+ pg_logging_init(argv[0]);
+ progname = get_progname(argv[0]);
+ handle_help_version_opts(argc, argv, progname, help);
+
+ memset(&opt, 0, sizeof(opt));
+ opt.manifest_checksums = CHECKSUM_TYPE_CRC32C;
+ opt.sync_method = DATA_DIR_SYNC_METHOD_FSYNC;
+
+ /* process command-line options */
+ while ((c = getopt_long(argc, argv, "dnNPo:T:",
+ long_options, &optindex)) != -1)
+ {
+ switch (c)
+ {
+ case 'd':
+ opt.debug = true;
+ pg_logging_increase_verbosity();
+ break;
+ case 'n':
+ opt.dry_run = true;
+ break;
+ case 'N':
+ opt.no_sync = true;
+ break;
+ case 'o':
+ opt.output = optarg;
+ break;
+ case 'T':
+ add_tablespace_mapping(&opt, optarg);
+ break;
+ case 1:
+ if (!pg_checksum_parse_type(optarg,
+ &opt.manifest_checksums))
+ pg_fatal("unrecognized checksum algorithm: \"%s\"",
+ optarg);
+ break;
+ case 2:
+ opt.no_manifest = true;
+ break;
+ case 3:
+ if (!parse_sync_method(optarg, &opt.sync_method))
+ exit(1);
+ break;
+ default:
+ /* getopt_long already emitted a complaint */
+ pg_log_error_hint("Try \"%s --help\" for more information.", progname);
+ exit(1);
+ }
+ }
+
+ if (optind >= argc)
+ {
+ pg_log_error("%s: no input directories specified", progname);
+ pg_log_error_hint("Try \"%s --help\" for more information.", progname);
+ exit(1);
+ }
+
+ if (opt.output == NULL)
+ pg_fatal("no output directory specified");
+
+ /* If no manifest is needed, no checksums are needed, either. */
+ if (opt.no_manifest)
+ opt.manifest_checksums = CHECKSUM_TYPE_NONE;
+
+ /* Read the server version from the final backup. */
+ version = read_pg_version_file(argv[argc - 1]);
+
+ /* Sanity-check control files. */
+ n_backups = argc - optind;
+ check_control_files(n_backups, argv + optind);
+
+ /* Sanity-check backup_label files, and get the contents of the last one. */
+ last_backup_label = check_backup_label_files(n_backups, argv + optind);
+
+ /*
+ * We'll need the pathnames to the prior backups. By "prior" we mean all
+ * but the last one listed on the command line.
+ */
+ n_prior_backups = argc - optind - 1;
+ prior_backup_dirs = argv + optind;
+
+ /* Load backup manifests. */
+ manifests = load_backup_manifests(n_backups, prior_backup_dirs);
+
+ /* Figure out which tablespaces are going to be included in the output. */
+ last_input_dir = argv[argc - 1];
+ check_input_dir_permissions(last_input_dir);
+ tablespaces = scan_for_existing_tablespaces(last_input_dir, &opt);
+
+ /*
+ * Create output directories.
+ *
+ * We create one output directory for the main data directory plus one for
+ * each non-in-place tablespace. create_output_directory() will arrange
+ * for those directories to be cleaned up on failure. In-place tablespaces
+ * aren't handled at this stage because they're located beneath the main
+ * output directory, and thus the cleanup of that directory will get rid
+ * of them. Plus, the pg_tblspc directory that needs to contain them
+ * doesn't exist yet.
+ */
+ atexit(cleanup_directories_atexit);
+ create_output_directory(opt.output, &opt);
+ for (ts = tablespaces; ts != NULL; ts = ts->next)
+ if (!ts->in_place)
+ create_output_directory(ts->new_dir, &opt);
+
+ /* If we need to write a backup_manifest, prepare to do so. */
+ if (!opt.dry_run && !opt.no_manifest)
+ {
+ mwriter = create_manifest_writer(opt.output);
+
+ /*
+ * Verify that we have a backup manifest for the final backup; else we
+ * won't have the WAL ranges for the resulting manifest.
+ */
+ if (manifests[n_prior_backups] == NULL)
+ pg_fatal("can't generate a manifest because no manifest is available for the final input backup");
+ }
+ else
+ mwriter = NULL;
+
+ /* Write backup label into output directory. */
+ if (opt.dry_run)
+ pg_log_debug("would generate \"%s/backup_label\"", opt.output);
+ else
+ {
+ pg_log_debug("generating \"%s/backup_label\"", opt.output);
+ last_backup_label->cursor = 0;
+ write_backup_label(opt.output, last_backup_label,
+ opt.manifest_checksums, mwriter);
+ }
+
+ /* Process everything that's not part of a user-defined tablespace. */
+ pg_log_debug("processing backup directory \"%s\"", last_input_dir);
+ process_directory_recursively(InvalidOid, last_input_dir, opt.output,
+ NULL, n_prior_backups, prior_backup_dirs,
+ manifests, mwriter, &opt);
+
+ /* Process user-defined tablespaces. */
+ for (ts = tablespaces; ts != NULL; ts = ts->next)
+ {
+ pg_log_debug("processing tablespace directory \"%s\"", ts->old_dir);
+
+ /*
+ * If it's a normal tablespace, we need to set up a symbolic link from
+ * pg_tblspc/${OID} to the target directory; if it's an in-place
+ * tablespace, we need to create a directory at pg_tblspc/${OID}.
+ */
+ if (!ts->in_place)
+ {
+ char linkpath[MAXPGPATH];
+
+ snprintf(linkpath, MAXPGPATH, "%s/pg_tblspc/%u", opt.output,
+ ts->oid);
+
+ if (opt.dry_run)
+ pg_log_debug("would create symbolic link from \"%s\" to \"%s\"",
+ linkpath, ts->new_dir);
+ else
+ {
+ pg_log_debug("creating symbolic link from \"%s\" to \"%s\"",
+ linkpath, ts->new_dir);
+ if (symlink(ts->new_dir, linkpath) != 0)
+ pg_fatal("could not create symbolic link from \"%s\" to \"%s\": %m",
+ linkpath, ts->new_dir);
+ }
+ }
+ else
+ {
+ if (opt.dry_run)
+ pg_log_debug("would create directory \"%s\"", ts->new_dir);
+ else
+ {
+ pg_log_debug("creating directory \"%s\"", ts->new_dir);
+ if (pg_mkdir_p(ts->new_dir, pg_dir_create_mode) == -1)
+ pg_fatal("could not create directory \"%s\": %m",
+ ts->new_dir);
+ }
+ }
+
+ /* OK, now handle the directory contents. */
+ process_directory_recursively(ts->oid, ts->old_dir, ts->new_dir,
+ NULL, n_prior_backups, prior_backup_dirs,
+ manifests, mwriter, &opt);
+ }
+
+ /* Finalize the backup_manifest, if we're generating one. */
+ if (mwriter != NULL)
+ finalize_manifest(mwriter,
+ manifests[n_prior_backups]->first_wal_range);
+
+ /* fsync that output directory unless we've been told not to do so */
+ if (!opt.no_sync)
+ {
+ if (opt.dry_run)
+ pg_log_debug("would recursively fsync \"%s\"", opt.output);
+ else
+ {
+ pg_log_debug("recursively fsyncing \"%s\"", opt.output);
+ sync_pgdata(opt.output, version * 10000, opt.sync_method);
+ }
+ }
+
+ /* It's a success, so don't remove the output directories. */
+ reset_directory_cleanup_list();
+ exit(0);
+}
+
+/*
+ * Process the option argument for the -T, --tablespace-mapping switch.
+ */
+static void
+add_tablespace_mapping(cb_options *opt, char *arg)
+{
+ cb_tablespace_mapping *tsmap = pg_malloc0(sizeof(cb_tablespace_mapping));
+ char *dst;
+ char *dst_ptr;
+ char *arg_ptr;
+
+ /*
+ * Basically, we just want to copy everything before the equals sign to
+ * tsmap->old_dir and everything afterwards to tsmap->new_dir, but if
+ * there's more or less than one equals sign, that's an error, and if
+ * there's an equals sign preceded by a backslash, don't treat it as a
+ * field separator but instead copy a literal equals sign.
+ */
+ dst_ptr = dst = tsmap->old_dir;
+ for (arg_ptr = arg; *arg_ptr != '\0'; arg_ptr++)
+ {
+ if (dst_ptr - dst >= MAXPGPATH)
+ pg_fatal("directory name too long");
+
+ if (*arg_ptr == '\\' && *(arg_ptr + 1) == '=')
+ ; /* skip backslash escaping = */
+ else if (*arg_ptr == '=' && (arg_ptr == arg || *(arg_ptr - 1) != '\\'))
+ {
+ if (tsmap->new_dir[0] != '\0')
+ pg_fatal("multiple \"=\" signs in tablespace mapping");
+ else
+ dst = dst_ptr = tsmap->new_dir;
+ }
+ else
+ *dst_ptr++ = *arg_ptr;
+ }
+ if (!tsmap->old_dir[0] || !tsmap->new_dir[0])
+ pg_fatal("invalid tablespace mapping format \"%s\", must be \"OLDDIR=NEWDIR\"", arg);
+
+ /*
+ * All tablespaces are created with absolute directories, so specifying a
+ * non-absolute path here would never match, possibly confusing users.
+ *
+ * In contrast to pg_basebackup, both the old and new directories are on
+ * the local machine, so the local machine's definition of an absolute
+ * path is the only relevant one.
+ */
+ if (!is_absolute_path(tsmap->old_dir))
+ pg_fatal("old directory is not an absolute path in tablespace mapping: %s",
+ tsmap->old_dir);
+
+ if (!is_absolute_path(tsmap->new_dir))
+ pg_fatal("old directory is not an absolute path in tablespace mapping: %s",
+ tsmap->new_dir);
+
+ /* Canonicalize paths to avoid spurious failures when comparing. */
+ canonicalize_path(tsmap->old_dir);
+ canonicalize_path(tsmap->new_dir);
+
+ /* Add it to the list. */
+ tsmap->next = opt->tsmappings;
+ opt->tsmappings = tsmap;
+}
+
+/*
+ * Check that the backup_label files form a coherent backup chain, and return
+ * the contents of the backup_label file from the latest backup.
+ */
+static StringInfo
+check_backup_label_files(int n_backups, char **backup_dirs)
+{
+ StringInfo buf = makeStringInfo();
+ StringInfo lastbuf = buf;
+ int i;
+ TimeLineID check_tli = 0;
+ XLogRecPtr check_lsn = InvalidXLogRecPtr;
+
+ /* Try to read each backup_label file in turn, last to first. */
+ for (i = n_backups - 1; i >= 0; --i)
+ {
+ char pathbuf[MAXPGPATH];
+ int fd;
+ TimeLineID start_tli;
+ TimeLineID previous_tli;
+ XLogRecPtr start_lsn;
+ XLogRecPtr previous_lsn;
+
+ /* Open the backup_label file. */
+ snprintf(pathbuf, MAXPGPATH, "%s/backup_label", backup_dirs[i]);
+ pg_log_debug("reading \"%s\"", pathbuf);
+ if ((fd = open(pathbuf, O_RDONLY, 0)) < 0)
+ pg_fatal("could not open file \"%s\": %m", pathbuf);
+
+ /*
+ * Slurp the whole file into memory.
+ *
+ * The exact size limit that we impose here doesn't really matter --
+ * most of what's supposed to be in the file is fixed size and quite
+ * short. However, the length of the backup_label is limited (at least
+ * by some parts of the code) to MAXGPATH, so include that value in
+ * the maximum length that we tolerate.
+ */
+ slurp_file(fd, pathbuf, buf, 10000 + MAXPGPATH);
+
+ /* Close the file. */
+ if (close(fd) != 0)
+ pg_fatal("could not close \"%s\": %m", pathbuf);
+
+ /* Parse the file contents. */
+ parse_backup_label(pathbuf, buf, &start_tli, &start_lsn,
+ &previous_tli, &previous_lsn);
+
+ /*
+ * Sanity checks.
+ *
+ * XXX. It's actually not required that start_lsn == check_lsn. It
+ * would be OK if start_lsn > check_lsn provided that start_lsn is
+ * less than or equal to the relevant switchpoint. But at the moment
+ * we don't have that information.
+ */
+ if (i > 0 && previous_tli == 0)
+ pg_fatal("backup at \"%s\" is a full backup, but only the first backup should be a full backup",
+ backup_dirs[i]);
+ if (i == 0 && previous_tli != 0)
+ pg_fatal("backup at \"%s\" is an incremental backup, but the first backup should be a full backup",
+ backup_dirs[i]);
+ if (i < n_backups - 1 && start_tli != check_tli)
+ pg_fatal("backup at \"%s\" starts on timeline %u, but expected %u",
+ backup_dirs[i], start_tli, check_tli);
+ if (i < n_backups - 1 && start_lsn != check_lsn)
+ pg_fatal("backup at \"%s\" starts at LSN %X/%X, but expected %X/%X",
+ backup_dirs[i],
+ LSN_FORMAT_ARGS(start_lsn),
+ LSN_FORMAT_ARGS(check_lsn));
+ check_tli = previous_tli;
+ check_lsn = previous_lsn;
+
+ /*
+ * The last backup label in the chain needs to be saved for later use,
+ * while the others are only needed within this loop.
+ */
+ if (lastbuf == buf)
+ buf = makeStringInfo();
+ else
+ resetStringInfo(buf);
+ }
+
+ /* Free memory that we don't need any more. */
+ if (lastbuf != buf)
+ {
+ pfree(buf->data);
+ pfree(buf);
+ }
+
+ /*
+ * Return the data from the first backup_info that we read (which is the
+ * backup_label from the last directory specified on the command line).
+ */
+ return lastbuf;
+}
+
+/*
+ * Sanity check control files.
+ */
+static void
+check_control_files(int n_backups, char **backup_dirs)
+{
+ int i;
+ uint64 system_identifier = 0; /* placate compiler */
+
+ /* Try to read each control file in turn, last to first. */
+ for (i = n_backups - 1; i >= 0; --i)
+ {
+ ControlFileData *control_file;
+ bool crc_ok;
+ char *controlpath;
+
+ controlpath = psprintf("%s/%s", backup_dirs[i], "global/pg_control");
+ pg_log_debug("reading \"%s\"", controlpath);
+ control_file = get_controlfile(backup_dirs[i], &crc_ok);
+
+ /* Control file contents not meaningful if CRC is bad. */
+ if (!crc_ok)
+ pg_fatal("%s: crc is incorrect", controlpath);
+
+ /* Can't interpret control file if not current version. */
+ if (control_file->pg_control_version != PG_CONTROL_VERSION)
+ pg_fatal("%s: unexpected control file version",
+ controlpath);
+
+ /* System identifiers should all match. */
+ if (i == n_backups - 1)
+ system_identifier = control_file->system_identifier;
+ else if (system_identifier != control_file->system_identifier)
+ pg_fatal("%s: expected system identifier %llu, but found %llu",
+ controlpath, (unsigned long long) system_identifier,
+ (unsigned long long) control_file->system_identifier);
+
+ /* Release memory. */
+ pfree(control_file);
+ pfree(controlpath);
+ }
+
+ /*
+ * If debug output is enabled, make a note of the system identifier that
+ * we found in all of the relevant control files.
+ */
+ pg_log_debug("system identifier is %llu",
+ (unsigned long long) system_identifier);
+}
+
+/*
+ * Set default permissions for new files and directories based on the
+ * permissions of the given directory. The intent here is that the output
+ * directory should use the same permissions scheme as the final input
+ * directory.
+ */
+static void
+check_input_dir_permissions(char *dir)
+{
+ struct stat st;
+
+ if (stat(dir, &st) != 0)
+ pg_fatal("could not stat \"%s\": %m", dir);
+
+ SetDataDirectoryCreatePerm(st.st_mode);
+}
+
+/*
+ * Clean up output directories before exiting.
+ */
+static void
+cleanup_directories_atexit(void)
+{
+ while (cleanup_dir_list != NULL)
+ {
+ cb_cleanup_dir *dir = cleanup_dir_list;
+
+ if (dir->rmtopdir)
+ {
+ pg_log_info("removing output directory \"%s\"", dir->target_path);
+ if (!rmtree(dir->target_path, dir->rmtopdir))
+ pg_log_error("failed to remove output directory");
+ }
+ else
+ {
+ pg_log_info("removing contents of output directory \"%s\"",
+ dir->target_path);
+ if (!rmtree(dir->target_path, dir->rmtopdir))
+ pg_log_error("failed to remove contents of output directory");
+ }
+
+ cleanup_dir_list = cleanup_dir_list->next;
+ pfree(dir);
+ }
+}
+
+/*
+ * Create the named output directory, unless it already exists or we're in
+ * dry-run mode. If it already exists but is not empty, that's a fatal error.
+ *
+ * Adds the created directory to the list of directories to be cleaned up
+ * at process exit.
+ */
+static void
+create_output_directory(char *dirname, cb_options *opt)
+{
+ switch (pg_check_dir(dirname))
+ {
+ case 0:
+ if (opt->dry_run)
+ {
+ pg_log_debug("would create directory \"%s\"", dirname);
+ return;
+ }
+ pg_log_debug("creating directory \"%s\"", dirname);
+ if (pg_mkdir_p(dirname, pg_dir_create_mode) == -1)
+ pg_fatal("could not create directory \"%s\": %m", dirname);
+ remember_to_cleanup_directory(dirname, true);
+ break;
+
+ case 1:
+ pg_log_debug("using existing directory \"%s\"", dirname);
+ remember_to_cleanup_directory(dirname, false);
+ break;
+
+ case 2:
+ case 3:
+ case 4:
+ pg_fatal("directory \"%s\" exists but is not empty", dirname);
+
+ case -1:
+ pg_fatal("could not access directory \"%s\": %m", dirname);
+ }
+}
+
+/*
+ * help
+ *
+ * Prints help page for the program
+ *
+ * progname: the name of the executed program, such as "pg_combinebackup"
+ */
+static void
+help(const char *progname)
+{
+ printf(_("%s reconstructs full backups from incrementals.\n\n"), progname);
+ printf(_("Usage:\n"));
+ printf(_(" %s [OPTION]... DIRECTORY...\n"), progname);
+ printf(_("\nOptions:\n"));
+ printf(_(" -d, --debug generate lots of debugging output\n"));
+ printf(_(" -n, --dry-run don't actually do anything\n"));
+ printf(_(" -N, --no-sync do not wait for changes to be written safely to disk\n"));
+ printf(_(" -o, --output output directory\n"));
+ printf(_(" -T, --tablespace-mapping=OLDDIR=NEWDIR\n"));
+ printf(_(" relocate tablespace in OLDDIR to NEWDIR\n"));
+ printf(_(" --manifest-checksums=SHA{224,256,384,512}|CRC32C|NONE\n"
+ " use algorithm for manifest checksums\n"));
+ printf(_(" --no-manifest suppress generation of backup manifest\n"));
+ printf(_(" --sync-method=METHOD set method for syncing files to disk\n"));
+ printf(_(" -?, --help show this help, then exit\n"));
+
+ printf(_("\nReport bugs to <%s>.\n"), PACKAGE_BUGREPORT);
+ printf(_("%s home page: <%s>\n"), PACKAGE_NAME, PACKAGE_URL);
+}
+
+/*
+ * Try to parse a string as a non-zero OID without leading zeroes.
+ *
+ * If it works, return true and set *result to the answer, else return false.
+ */
+static bool
+parse_oid(char *s, Oid *result)
+{
+ Oid oid;
+ char *ep;
+
+ errno = 0;
+ oid = strtoul(s, &ep, 10);
+ if (errno != 0 || *ep != '\0' || oid < 1 || oid > PG_UINT32_MAX)
+ return false;
+
+ *result = oid;
+ return true;
+}
+
+/*
+ * Copy files from the input directory to the output directory, reconstructing
+ * full files from incremental files as required.
+ *
+ * If processing is a user-defined tablespace, the tsoid should be the OID
+ * of that tablespace and input_directory and output_directory should be the
+ * toplevel input and output directories for that tablespace. Otherwise,
+ * tsoid should be InvalidOid and input_directory and output_directory should
+ * be the main input and output directories.
+ *
+ * relative_path is the path beneath the given input and output directories
+ * that we are currently processing. If NULL, it indicates that we're
+ * processing the input and output directories themselves.
+ *
+ * n_prior_backups is the number of prior backups that we have available.
+ * This doesn't count the very last backup, which is referenced by
+ * output_directory, just the older ones. prior_backup_dirs is an array of
+ * the locations of those previous backups.
+ */
+static void
+process_directory_recursively(Oid tsoid,
+ char *input_directory,
+ char *output_directory,
+ char *relative_path,
+ int n_prior_backups,
+ char **prior_backup_dirs,
+ manifest_data **manifests,
+ manifest_writer *mwriter,
+ cb_options *opt)
+{
+ char ifulldir[MAXPGPATH];
+ char ofulldir[MAXPGPATH];
+ char manifest_prefix[MAXPGPATH];
+ DIR *dir;
+ struct dirent *de;
+ bool is_pg_tblspc;
+ bool is_pg_wal;
+ manifest_data *latest_manifest = manifests[n_prior_backups];
+ pg_checksum_type checksum_type;
+
+ /*
+ * pg_tblspc and pg_wal are special cases, so detect those here.
+ *
+ * pg_tblspc is only special at the top level, but subdirectories of
+ * pg_wal are just as special as the top level directory.
+ *
+ * Since incremental backup does not exist in pre-v10 versions, we don't
+ * have to worry about the old pg_xlog naming.
+ */
+ is_pg_tblspc = !OidIsValid(tsoid) && relative_path != NULL &&
+ strcmp(relative_path, "pg_tblspc") == 0;
+ is_pg_wal = !OidIsValid(tsoid) && relative_path != NULL &&
+ (strcmp(relative_path, "pg_wal") == 0 ||
+ strncmp(relative_path, "pg_wal/", 7) == 0);
+
+ /*
+ * If we're under pg_wal, then we don't need checksums, because these
+ * files aren't included in the backup manifest. Otherwise use whatever
+ * type of checksum is configured.
+ */
+ if (!is_pg_wal)
+ checksum_type = opt->manifest_checksums;
+ else
+ checksum_type = CHECKSUM_TYPE_NONE;
+
+ /*
+ * Append the relative path to the input and output directories, and
+ * figure out the appropriate prefix to add to files in this directory
+ * when looking them up in a backup manifest.
+ */
+ if (relative_path == NULL)
+ {
+ strncpy(ifulldir, input_directory, MAXPGPATH);
+ strncpy(ofulldir, output_directory, MAXPGPATH);
+ if (OidIsValid(tsoid))
+ snprintf(manifest_prefix, MAXPGPATH, "pg_tblspc/%u/", tsoid);
+ else
+ manifest_prefix[0] = '\0';
+ }
+ else
+ {
+ snprintf(ifulldir, MAXPGPATH, "%s/%s", input_directory,
+ relative_path);
+ snprintf(ofulldir, MAXPGPATH, "%s/%s", output_directory,
+ relative_path);
+ if (OidIsValid(tsoid))
+ snprintf(manifest_prefix, MAXPGPATH, "pg_tblspc/%u/%s/",
+ tsoid, relative_path);
+ else
+ snprintf(manifest_prefix, MAXPGPATH, "%s/", relative_path);
+ }
+
+ /*
+ * Toplevel output directories have already been created by the time this
+ * function is called, but any subdirectories are our responsibility.
+ */
+ if (relative_path != NULL)
+ {
+ if (opt->dry_run)
+ pg_log_debug("would create directory \"%s\"", ofulldir);
+ else
+ {
+ pg_log_debug("creating directory \"%s\"", ofulldir);
+ if (mkdir(ofulldir, pg_dir_create_mode) == -1)
+ pg_fatal("could not create directory \"%s\": %m", ofulldir);
+ }
+ }
+
+ /* It's time to scan the directory. */
+ if ((dir = opendir(ifulldir)) == NULL)
+ pg_fatal("could not open directory \"%s\": %m", ifulldir);
+ while (errno = 0, (de = readdir(dir)) != NULL)
+ {
+ PGFileType type;
+ char ifullpath[MAXPGPATH];
+ char ofullpath[MAXPGPATH];
+ char manifest_path[MAXPGPATH];
+ Oid oid = InvalidOid;
+ int checksum_length = 0;
+ uint8 *checksum_payload = NULL;
+ pg_checksum_context checksum_ctx;
+
+ /* Ignore "." and ".." entries. */
+ if (strcmp(de->d_name, ".") == 0 ||
+ strcmp(de->d_name, "..") == 0)
+ continue;
+
+ /* Construct input path. */
+ snprintf(ifullpath, MAXPGPATH, "%s/%s", ifulldir, de->d_name);
+
+ /* Figure out what kind of directory entry this is. */
+ type = get_dirent_type(ifullpath, de, false, PG_LOG_ERROR);
+ if (type == PGFILETYPE_ERROR)
+ exit(1);
+
+ /*
+ * If we're processing pg_tblspc, then check whether the filename
+ * looks like it could be a tablespace OID. If so, and if the
+ * directory entry is a symbolic link or a directory, skip it.
+ *
+ * Our goal here is to ignore anything that would have been considered
+ * by scan_for_existing_tablespaces to be a tablespace.
+ */
+ if (is_pg_tblspc && parse_oid(de->d_name, &oid) &&
+ (type == PGFILETYPE_LNK || type == PGFILETYPE_DIR))
+ continue;
+
+ /* If it's a directory, recurse. */
+ if (type == PGFILETYPE_DIR)
+ {
+ char new_relative_path[MAXPGPATH];
+
+ /* Append new pathname component to relative path. */
+ if (relative_path == NULL)
+ strncpy(new_relative_path, de->d_name, MAXPGPATH);
+ else
+ snprintf(new_relative_path, MAXPGPATH, "%s/%s", relative_path,
+ de->d_name);
+
+ /* And recurse. */
+ process_directory_recursively(tsoid,
+ input_directory, output_directory,
+ new_relative_path,
+ n_prior_backups, prior_backup_dirs,
+ manifests, mwriter, opt);
+ continue;
+ }
+
+ /* Skip anything that's not a regular file. */
+ if (type != PGFILETYPE_REG)
+ {
+ if (type == PGFILETYPE_LNK)
+ pg_log_warning("skipping symbolic link \"%s\"", ifullpath);
+ else
+ pg_log_warning("skipping special file \"%s\"", ifullpath);
+ continue;
+ }
+
+ /*
+ * Skip the backup_label and backup_manifest files; they require
+ * special handling and are handled elsewhere.
+ */
+ if (relative_path == NULL &&
+ (strcmp(de->d_name, "backup_label") == 0 ||
+ strcmp(de->d_name, "backup_manifest") == 0))
+ continue;
+
+ /*
+ * If it's an incremental file, hand it off to the reconstruction
+ * code, which will figure out what to do.
+ */
+ if (strncmp(de->d_name, INCREMENTAL_PREFIX,
+ INCREMENTAL_PREFIX_LENGTH) == 0)
+ {
+ /* Output path should not include "INCREMENTAL." prefix. */
+ snprintf(ofullpath, MAXPGPATH, "%s/%s", ofulldir,
+ de->d_name + INCREMENTAL_PREFIX_LENGTH);
+
+
+ /* Manifest path likewise omits incremental prefix. */
+ snprintf(manifest_path, MAXPGPATH, "%s%s", manifest_prefix,
+ de->d_name + INCREMENTAL_PREFIX_LENGTH);
+
+ /* Reconstruction logic will do the rest. */
+ reconstruct_from_incremental_file(ifullpath, ofullpath,
+ relative_path,
+ de->d_name + INCREMENTAL_PREFIX_LENGTH,
+ n_prior_backups,
+ prior_backup_dirs,
+ manifests,
+ manifest_path,
+ checksum_type,
+ &checksum_length,
+ &checksum_payload,
+ opt->debug,
+ opt->dry_run);
+ }
+ else
+ {
+ /* Construct the path that the backup_manifest will use. */
+ snprintf(manifest_path, MAXPGPATH, "%s%s", manifest_prefix,
+ de->d_name);
+
+ /*
+ * It's not an incremental file, so we need to copy the entire
+ * file to the output directory.
+ *
+ * If a checksum of the required type already exists in the
+ * backup_manifest for the final input directory, we can save some
+ * work by reusing that checksum instead of computing a new one.
+ */
+ if (checksum_type != CHECKSUM_TYPE_NONE &&
+ latest_manifest != NULL)
+ {
+ manifest_file *mfile;
+
+ mfile = manifest_files_lookup(latest_manifest->files,
+ manifest_path);
+ if (mfile == NULL)
+ {
+ char *bmpath;
+
+ /*
+ * The directory is out of sync with the backup_manifest,
+ * so emit a warning.
+ */
+ bmpath = psprintf("%s/%s", input_directory,
+ "backup_manifest");
+ pg_log_warning("\"%s\" contains no entry for \"%s\"",
+ bmpath, manifest_path);
+ pfree(bmpath);
+ }
+ else if (mfile->checksum_type == checksum_type)
+ {
+ checksum_length = mfile->checksum_length;
+ checksum_payload = mfile->checksum_payload;
+ }
+ }
+
+ /*
+ * If we're reusing a checksum, then we don't need copy_file() to
+ * compute one for us, but otherwise, it needs to compute whatever
+ * type of checksum we need.
+ */
+ if (checksum_length != 0)
+ pg_checksum_init(&checksum_ctx, CHECKSUM_TYPE_NONE);
+ else
+ pg_checksum_init(&checksum_ctx, checksum_type);
+
+ /* Actually copy the file. */
+ snprintf(ofullpath, MAXPGPATH, "%s/%s", ofulldir, de->d_name);
+ copy_file(ifullpath, ofullpath, &checksum_ctx, opt->dry_run);
+
+ /*
+ * If copy_file() performed a checksum calculation for us, then
+ * save the results (except in dry-run mode, when there's no
+ * point).
+ */
+ if (checksum_ctx.type != CHECKSUM_TYPE_NONE && !opt->dry_run)
+ {
+ checksum_payload = pg_malloc(PG_CHECKSUM_MAX_LENGTH);
+ checksum_length = pg_checksum_final(&checksum_ctx,
+ checksum_payload);
+ }
+ }
+
+ /* Generate manifest entry, if needed. */
+ if (mwriter != NULL)
+ {
+ struct stat sb;
+
+ /*
+ * In order to generate a manifest entry, we need the file size
+ * and mtime. We have no way to know the correct mtime except to
+ * stat() the file, so just do that and get the size as well.
+ *
+ * If we didn't need the mtime here, we could try to obtain the
+ * file size from the reconstruction or file copy process above,
+ * although that is actually not convenient in all cases. If we
+ * write the file ourselves then clearly we can keep a count of
+ * bytes, but if we use something like CopyFile() then it's
+ * trickier. Since we have to stat() anyway to get the mtime,
+ * there's no point in worrying about it.
+ */
+ if (stat(ofullpath, &sb) < 0)
+ pg_fatal("could not stat file \"%s\": %m", ofullpath);
+
+ /* OK, now do the work. */
+ add_file_to_manifest(mwriter, manifest_path,
+ sb.st_size, sb.st_mtime,
+ checksum_type, checksum_length,
+ checksum_payload);
+ }
+
+ /* Avoid leaking memory. */
+ if (checksum_payload != NULL)
+ pfree(checksum_payload);
+ }
+
+ closedir(dir);
+}
+
+/*
+ * Read the version number from PG_VERSION and convert it to the usual server
+ * version number format. (e.g. If PG_VERSION contains "14\n" this function
+ * will return 140000)
+ */
+static int
+read_pg_version_file(char *directory)
+{
+ char filename[MAXPGPATH];
+ StringInfoData buf;
+ int fd;
+ int version;
+ char *ep;
+
+ /* Construct pathname. */
+ snprintf(filename, MAXPGPATH, "%s/PG_VERSION", directory);
+
+ /* Open file. */
+ if ((fd = open(filename, O_RDONLY, 0)) < 0)
+ pg_fatal("could not open file \"%s\": %m", filename);
+
+ /* Read into memory. Length limit of 128 should be more than generous. */
+ initStringInfo(&buf);
+ slurp_file(fd, filename, &buf, 128);
+
+ /* Close the file. */
+ if (close(fd) != 0)
+ pg_fatal("could not close \"%s\": %m", filename);
+
+ /* Convert to integer. */
+ errno = 0;
+ version = strtoul(buf.data, &ep, 10);
+ if (errno != 0 || *ep != '\n')
+ {
+ /*
+ * Incremental backup is not relevant to very old server versions that
+ * used multi-part version number (e.g. 9.6, or 8.4). So if we see
+ * what looks like the beginning of such a version number, just bail
+ * out.
+ */
+ if (version < 10 && *ep == '.')
+ pg_fatal("%s: server version too old\n", filename);
+ pg_fatal("%s: could not parse version number\n", filename);
+ }
+
+ /* Debugging output. */
+ pg_log_debug("read server version %d from \"%s\"", version, filename);
+
+ /* Release memory and return result. */
+ pfree(buf.data);
+ return version * 10000;
+}
+
+/*
+ * Add a directory to the list of output directories to clean up.
+ */
+static void
+remember_to_cleanup_directory(char *target_path, bool rmtopdir)
+{
+ cb_cleanup_dir *dir = pg_malloc(sizeof(cb_cleanup_dir));
+
+ dir->target_path = target_path;
+ dir->rmtopdir = rmtopdir;
+ dir->next = cleanup_dir_list;
+ cleanup_dir_list = dir;
+}
+
+/*
+ * Empty out the list of directories scheduled for cleanup a exit.
+ *
+ * We want to remove the output directories only on a failure, so call this
+ * function when we know that the operation has succeeded.
+ *
+ * Since we only expect this to be called when we're about to exit, we could
+ * just set cleanup_dir_list to NULL and be done with it, but we free the
+ * memory to be tidy.
+ */
+static void
+reset_directory_cleanup_list(void)
+{
+ while (cleanup_dir_list != NULL)
+ {
+ cb_cleanup_dir *dir = cleanup_dir_list;
+
+ cleanup_dir_list = cleanup_dir_list->next;
+ pfree(dir);
+ }
+}
+
+/*
+ * Scan the pg_tblspc directory of the final input backup to get a canonical
+ * list of what tablespaces are part of the backup.
+ *
+ * 'pathname' should be the path to the toplevel backup directory for the
+ * final backup in the backup chain.
+ */
+static cb_tablespace *
+scan_for_existing_tablespaces(char *pathname, cb_options *opt)
+{
+ char pg_tblspc[MAXPGPATH];
+ DIR *dir;
+ struct dirent *de;
+ cb_tablespace *tslist = NULL;
+
+ snprintf(pg_tblspc, MAXPGPATH, "%s/pg_tblspc", pathname);
+ pg_log_debug("scanning \"%s\"", pg_tblspc);
+
+ if ((dir = opendir(pg_tblspc)) == NULL)
+ pg_fatal("could not open directory \"%s\": %m", pathname);
+
+ while (errno = 0, (de = readdir(dir)) != NULL)
+ {
+ Oid oid;
+ char tblspcdir[MAXPGPATH];
+ char link_target[MAXPGPATH];
+ int link_length;
+ cb_tablespace *ts;
+ cb_tablespace *otherts;
+ PGFileType type;
+
+ /* Silently ignore "." and ".." entries. */
+ if (strcmp(de->d_name, ".") == 0 || strcmp(de->d_name, "..") == 0)
+ continue;
+
+ /* Construct full pathname. */
+ snprintf(tblspcdir, MAXPGPATH, "%s/%s", pg_tblspc, de->d_name);
+
+ /* Ignore any file name that doesn't look like a proper OID. */
+ if (!parse_oid(de->d_name, &oid))
+ {
+ pg_log_debug("skipping \"%s\" because the filename is not a legal tablespace OID",
+ tblspcdir);
+ continue;
+ }
+
+ /* Only symbolic links and directories are tablespaces. */
+ type = get_dirent_type(tblspcdir, de, false, PG_LOG_ERROR);
+ if (type == PGFILETYPE_ERROR)
+ exit(1);
+ if (type != PGFILETYPE_LNK && type != PGFILETYPE_DIR)
+ {
+ pg_log_debug("skipping \"%s\" because it is neither a symbolic link nor a directory",
+ tblspcdir);
+ continue;
+ }
+
+ /* Create a new tablespace object. */
+ ts = pg_malloc0(sizeof(cb_tablespace));
+ ts->oid = oid;
+
+ /*
+ * If it's a link, it's not an in-place tablespace. Otherwise, it must
+ * be a directory, and thus an in-place tablespace.
+ */
+ if (type == PGFILETYPE_LNK)
+ {
+ cb_tablespace_mapping *tsmap;
+
+ /* Read the link target. */
+ link_length = readlink(tblspcdir, link_target, sizeof(link_target));
+ if (link_length < 0)
+ pg_fatal("could not read symbolic link \"%s\": %m",
+ tblspcdir);
+ if (link_length >= sizeof(link_target))
+ pg_fatal("symbolic link \"%s\" is too long", tblspcdir);
+ link_target[link_length] = '\0';
+ if (!is_absolute_path(link_target))
+ pg_fatal("symbolic link \"%s\" is relative", tblspcdir);
+
+ /* Caonicalize the link target. */
+ canonicalize_path(link_target);
+
+ /*
+ * Find the corresponding tablespace mapping and copy the relevant
+ * details into the new tablespace entry.
+ */
+ for (tsmap = opt->tsmappings; tsmap != NULL; tsmap = tsmap->next)
+ {
+ if (strcmp(tsmap->old_dir, link_target) == 0)
+ {
+ strncpy(ts->old_dir, tsmap->old_dir, MAXPGPATH);
+ strncpy(ts->new_dir, tsmap->new_dir, MAXPGPATH);
+ ts->in_place = false;
+ break;
+ }
+ }
+
+ /* Every non-in-place tablespace must be mapped. */
+ if (tsmap == NULL)
+ pg_fatal("tablespace at \"%s\" has no tablespace mapping",
+ link_target);
+ }
+ else
+ {
+ /*
+ * For an in-place tablespace, there's no separate directory, so
+ * we just record the paths within the data directories.
+ */
+ snprintf(ts->old_dir, MAXPGPATH, "%s/%s", pg_tblspc, de->d_name);
+ snprintf(ts->new_dir, MAXPGPATH, "%s/pg_tblpc/%s", opt->output,
+ de->d_name);
+ ts->in_place = true;
+ }
+
+ /* Tablespaces should not share a directory. */
+ for (otherts = tslist; otherts != NULL; otherts = otherts->next)
+ if (strcmp(ts->new_dir, otherts->new_dir) == 0)
+ pg_fatal("tablespaces with OIDs %u and %u both point at \"%s\"",
+ otherts->oid, oid, ts->new_dir);
+
+ /* Add this tablespace to the list. */
+ ts->next = tslist;
+ tslist = ts;
+ }
+
+ return tslist;
+}
+
+/*
+ * Read a file into a StringInfo.
+ *
+ * fd is used for the actual file I/O, filename for error reporting purposes.
+ * A file longer than maxlen is a fatal error.
+ */
+static void
+slurp_file(int fd, char *filename, StringInfo buf, int maxlen)
+{
+ struct stat st;
+ ssize_t rb;
+
+ /* Check file size, and complain if it's too large. */
+ if (fstat(fd, &st) != 0)
+ pg_fatal("could not stat \"%s\": %m", filename);
+ if (st.st_size > maxlen)
+ pg_fatal("file \"%s\" is too large", filename);
+
+ /* Make sure we have enough space. */
+ enlargeStringInfo(buf, st.st_size);
+
+ /* Read the data. */
+ rb = read(fd, &buf->data[buf->len], st.st_size);
+
+ /*
+ * We don't expect any concurrent changes, so we should read exactly the
+ * expected number of bytes.
+ */
+ if (rb != st.st_size)
+ {
+ if (rb < 0)
+ pg_fatal("could not read file \"%s\": %m", filename);
+ else
+ pg_fatal("could not read file \"%s\": read only %d of %d bytes",
+ filename, (int) rb, (int) st.st_size);
+ }
+
+ /* Adjust buffer length for new data and restore trailing-\0 invariant */
+ buf->len += rb;
+ buf->data[buf->len] = '\0';
+}
diff --git a/src/bin/pg_combinebackup/reconstruct.c b/src/bin/pg_combinebackup/reconstruct.c
new file mode 100644
index 0000000000..6decdd8934
--- /dev/null
+++ b/src/bin/pg_combinebackup/reconstruct.c
@@ -0,0 +1,687 @@
+/*-------------------------------------------------------------------------
+ *
+ * reconstruct.c
+ * Reconstruct full file from incremental file and backup chain.
+ *
+ * Copyright (c) 2017-2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/bin/pg_combinebackup/reconstruct.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres_fe.h"
+
+#include <unistd.h>
+
+#include "backup/basebackup_incremental.h"
+#include "common/logging.h"
+#include "common/file_perm.h"
+#include "copy_file.h"
+#include "lib/stringinfo.h"
+#include "reconstruct.h"
+#include "storage/block.h"
+
+/*
+ * An rfile stores the data that we need in order to be able to use some file
+ * on disk for reconstruction. For any given output file, we create one rfile
+ * per backup that we need to consult when we constructing that output file.
+ *
+ * If we find a full version of the file in the backup chain, then only
+ * filename and fd are initialized; the remaining fields are 0 or NULL.
+ * For an incremental file, header_length, num_blocks, relative_block_numbers,
+ * and truncation_block_length are also set.
+ *
+ * num_blocks_read and highest_offset_read always start out as 0.
+ */
+typedef struct rfile
+{
+ char *filename;
+ int fd;
+ size_t header_length;
+ unsigned num_blocks;
+ BlockNumber *relative_block_numbers;
+ unsigned truncation_block_length;
+ unsigned num_blocks_read;
+ off_t highest_offset_read;
+} rfile;
+
+static void debug_reconstruction(int n_source,
+ rfile **sources,
+ bool dry_run);
+static unsigned find_reconstructed_block_length(rfile *s);
+static rfile *make_incremental_rfile(char *filename);
+static rfile *make_rfile(char *filename, bool missing_ok);
+static void write_reconstructed_file(char *input_filename,
+ char *output_filename,
+ unsigned block_length,
+ rfile **sourcemap,
+ off_t *offsetmap,
+ pg_checksum_context *checksum_ctx,
+ bool debug,
+ bool dry_run);
+static void read_bytes(rfile *rf, void *buffer, unsigned length);
+
+/*
+ * Reconstruct a full file from an incremental file and a chain of prior
+ * backups.
+ *
+ * input_filename should be the path to the incremental file, and
+ * output_filename should be the path where the reconstructed file is to be
+ * written.
+ *
+ * relative_path should be the relative path to the directory containing this
+ * file. bare_file_name should be the name of the file within that directory,
+ * without "INCREMENTAL.".
+ *
+ * n_prior_backups is the number of prior backups, and prior_backup_dirs is
+ * an array of pathnames where those backups can be found.
+ */
+void
+reconstruct_from_incremental_file(char *input_filename,
+ char *output_filename,
+ char *relative_path,
+ char *bare_file_name,
+ int n_prior_backups,
+ char **prior_backup_dirs,
+ manifest_data **manifests,
+ char *manifest_path,
+ pg_checksum_type checksum_type,
+ int *checksum_length,
+ uint8 **checksum_payload,
+ bool debug,
+ bool dry_run)
+{
+ rfile **source;
+ rfile *latest_source = NULL;
+ rfile **sourcemap;
+ off_t *offsetmap;
+ unsigned block_length;
+ unsigned i;
+ unsigned sidx = n_prior_backups;
+ bool full_copy_possible = true;
+ int copy_source_index = -1;
+ rfile *copy_source = NULL;
+ pg_checksum_context checksum_ctx;
+
+ /*
+ * Every block must come either from the latest version of the file or
+ * from one of the prior backups.
+ */
+ source = pg_malloc0(sizeof(rfile *) * (1 + n_prior_backups));
+
+ /*
+ * Use the information from the latest incremental file to figure out how
+ * long the reconstructed file should be.
+ */
+ latest_source = make_incremental_rfile(input_filename);
+ source[n_prior_backups] = latest_source;
+ block_length = find_reconstructed_block_length(latest_source);
+
+ /*
+ * For each block in the output file, we need to know from which file we
+ * need to obtain it and at what offset in that file it's stored.
+ * sourcemap gives us the first of these things, and offsetmap the latter.
+ */
+ sourcemap = pg_malloc0(sizeof(rfile *) * block_length);
+ offsetmap = pg_malloc0(sizeof(off_t) * block_length);
+
+ /*
+ * Every block that is present in the newest incremental file should be
+ * sourced from that file. If it precedes the truncation_block_length,
+ * it's a block that we would otherwise have had to find in an older
+ * backup and thus reduces the number of blocks remaining to be found by
+ * one; otherwise, it's an extra block that needs to be included in the
+ * output but would not have needed to be found in an older backup if it
+ * had not been present.
+ */
+ for (i = 0; i < latest_source->num_blocks; ++i)
+ {
+ BlockNumber b = latest_source->relative_block_numbers[i];
+
+ Assert(b < block_length);
+ sourcemap[b] = latest_source;
+ offsetmap[b] = latest_source->header_length + (i * BLCKSZ);
+
+ /*
+ * A full copy of a file from an earlier backup is only possible if no
+ * blocks are needed from any later incremental file.
+ */
+ full_copy_possible = false;
+ }
+
+ while (1)
+ {
+ char source_filename[MAXPGPATH];
+ rfile *s;
+
+ /*
+ * Move to the next backup in the chain. If there are no more, then
+ * we're done.
+ */
+ if (sidx == 0)
+ break;
+ --sidx;
+
+ /*
+ * Look for the full file in the previous backup. If not found, then
+ * look for an incremental file instead.
+ */
+ snprintf(source_filename, MAXPGPATH, "%s/%s/%s",
+ prior_backup_dirs[sidx], relative_path, bare_file_name);
+ if ((s = make_rfile(source_filename, true)) == NULL)
+ {
+ snprintf(source_filename, MAXPGPATH, "%s/%s/INCREMENTAL.%s",
+ prior_backup_dirs[sidx], relative_path, bare_file_name);
+ s = make_incremental_rfile(source_filename);
+ }
+ source[sidx] = s;
+
+ /*
+ * If s->header_length == 0, then this is a full file; otherwise, it's
+ * an incremental file.
+ */
+ if (s->header_length == 0)
+ {
+ struct stat sb;
+ BlockNumber b;
+ BlockNumber blocklength;
+
+ /* We need to know the length of the file. */
+ if (fstat(s->fd, &sb) < 0)
+ pg_fatal("could not stat \"%s\": %m", s->filename);
+
+ /*
+ * Since we found a full file, source all blocks from it that
+ * exist in the file.
+ *
+ * Note that there may be blocks that don't exist either in this
+ * file or in any incremental file but that precede
+ * truncation_block_length. These are, presumably, zero-filled
+ * blocks that result from the server extending the file but
+ * taking no action on those blocks that generated any WAL.
+ *
+ * Sadly, we have no way of validating that this is really what
+ * happened, and neither does the server. From it's perspective,
+ * an unmodified block that contains data looks exactly the same
+ * as a zero-filled block that never had any data: either way,
+ * it's not mentioned in any WAL summary and the server has no
+ * reason to read it. From our perspective, all we know is that
+ * nobody had a reason to back up the block. That certainly means
+ * that the block didn't exist at the time of the full backup, but
+ * the supposition that it was all zeroes at the time of every
+ * later backup is one that we can't validate.
+ */
+ blocklength = sb.st_size / BLCKSZ;
+ for (b = 0; b < latest_source->truncation_block_length; ++b)
+ {
+ if (sourcemap[b] == NULL && b < blocklength)
+ {
+ sourcemap[b] = s;
+ offsetmap[b] = b * BLCKSZ;
+ }
+ }
+
+ /*
+ * If a full copy looks possible, check whether the resulting file
+ * should be exactly as long as the source file is. If so, a full
+ * copy is acceptable, otherwise not.
+ */
+ if (full_copy_possible)
+ {
+ uint64 expected_length;
+
+ expected_length =
+ (uint64) latest_source->truncation_block_length;
+ expected_length *= BLCKSZ;
+ if (expected_length == sb.st_size)
+ {
+ copy_source = s;
+ copy_source_index = sidx;
+ }
+ }
+
+ /* We don't need to consider any further sources. */
+ break;
+ }
+
+ /*
+ * Since we found another incremental file, source all blocks from it
+ * that we need but don't yet have.
+ */
+ for (i = 0; i < s->num_blocks; ++i)
+ {
+ BlockNumber b = s->relative_block_numbers[i];
+
+ if (b < latest_source->truncation_block_length &&
+ sourcemap[b] == NULL)
+ {
+ sourcemap[b] = s;
+ offsetmap[b] = s->header_length + (i * BLCKSZ);
+
+ /*
+ * A full copy of a file from an earlier backup is only
+ * possible if no blocks are needed from any later incremental
+ * file.
+ */
+ full_copy_possible = false;
+ }
+ }
+ }
+
+ /*
+ * If a checksum of the required type already exists in the
+ * backup_manifest for the relevant input directory, we can save some work
+ * by reusing that checksum instead of computing a new one.
+ */
+ if (copy_source_index >= 0 && manifests[copy_source_index] != NULL &&
+ checksum_type != CHECKSUM_TYPE_NONE)
+ {
+ manifest_file *mfile;
+
+ mfile = manifest_files_lookup(manifests[copy_source_index]->files,
+ manifest_path);
+ if (mfile == NULL)
+ {
+ char *path = psprintf("%s/backup_manifest",
+ prior_backup_dirs[copy_source_index]);
+
+ /*
+ * The directory is out of sync with the backup_manifest, so emit
+ * a warning.
+ */
+ /*- translator: the first %s is a backup manifest file, the second is a file absent therein */
+ pg_log_warning("\"%s\" contains no entry for \"%s\"",
+ path,
+ manifest_path);
+ pfree(path);
+ }
+ else if (mfile->checksum_type == checksum_type)
+ {
+ *checksum_length = mfile->checksum_length;
+ *checksum_payload = pg_malloc(*checksum_length);
+ memcpy(*checksum_payload, mfile->checksum_payload,
+ *checksum_length);
+ checksum_type = CHECKSUM_TYPE_NONE;
+ }
+ }
+
+ /* Prepare for checksum calculation, if required. */
+ pg_checksum_init(&checksum_ctx, checksum_type);
+
+ /*
+ * If the full file can be created by copying a file from an older backup
+ * in the chain without needing to overwrite any blocks or truncate the
+ * result, then forget about performing reconstruction and just copy that
+ * file in its entirety.
+ *
+ * Otherwise, reconstruct.
+ */
+ if (copy_source != NULL)
+ copy_file(copy_source->filename, output_filename,
+ &checksum_ctx, dry_run);
+ else
+ {
+ write_reconstructed_file(input_filename, output_filename,
+ block_length, sourcemap, offsetmap,
+ &checksum_ctx, debug, dry_run);
+ debug_reconstruction(n_prior_backups + 1, source, dry_run);
+ }
+
+ /* Save results of checksum calculation. */
+ if (checksum_type != CHECKSUM_TYPE_NONE)
+ {
+ *checksum_payload = pg_malloc(PG_CHECKSUM_MAX_LENGTH);
+ *checksum_length = pg_checksum_final(&checksum_ctx,
+ *checksum_payload);
+ }
+
+ /*
+ * Close files and release memory.
+ */
+ for (i = 0; i <= n_prior_backups; ++i)
+ {
+ rfile *s = source[i];
+
+ if (s == NULL)
+ continue;
+ if (close(s->fd) != 0)
+ pg_fatal("could not close \"%s\": %m", s->filename);
+ if (s->relative_block_numbers != NULL)
+ pfree(s->relative_block_numbers);
+ pg_free(s->filename);
+ }
+ pfree(sourcemap);
+ pfree(offsetmap);
+ pfree(source);
+}
+
+/*
+ * Perform post-reconstruction logging and sanity checks.
+ */
+static void
+debug_reconstruction(int n_source, rfile **sources, bool dry_run)
+{
+ unsigned i;
+
+ for (i = 0; i < n_source; ++i)
+ {
+ rfile *s = sources[i];
+
+ /* Ignore source if not used. */
+ if (s == NULL)
+ continue;
+
+ /* If no data is needed from this file, we can ignore it. */
+ if (s->num_blocks_read == 0)
+ continue;
+
+ /* Debug logging. */
+ if (dry_run)
+ pg_log_debug("would have read %u blocks from \"%s\"",
+ s->num_blocks_read, s->filename);
+ else
+ pg_log_debug("read %u blocks from \"%s\"",
+ s->num_blocks_read, s->filename);
+
+ /*
+ * In dry-run mode, we don't actually try to read data from the file,
+ * but we do try to verify that the file is long enough that we could
+ * have read the data if we'd tried.
+ *
+ * If this fails, then it means that a non-dry-run attempt would fail,
+ * complaining of not being able to read the required bytes from the
+ * file.
+ */
+ if (dry_run)
+ {
+ struct stat sb;
+
+ if (fstat(s->fd, &sb) < 0)
+ pg_fatal("could not stat \"%s\": %m", s->filename);
+ if (sb.st_size < s->highest_offset_read)
+ pg_fatal("file \"%s\" is too short: expected %llu, found %llu",
+ s->filename,
+ (unsigned long long) s->highest_offset_read,
+ (unsigned long long) sb.st_size);
+ }
+ }
+}
+
+/*
+ * When we perform reconstruction using an incremental file, the output file
+ * should be at least as long as the truncation_block_length. Any blocks
+ * present in the incremental file increase the output length as far as is
+ * necessary to include those blocks.
+ */
+static unsigned
+find_reconstructed_block_length(rfile *s)
+{
+ unsigned block_length = s->truncation_block_length;
+ unsigned i;
+
+ for (i = 0; i < s->num_blocks; ++i)
+ if (s->relative_block_numbers[i] >= block_length)
+ block_length = s->relative_block_numbers[i] + 1;
+
+ return block_length;
+}
+
+/*
+ * Initialize an incremental rfile, reading the header so that we know which
+ * blocks it contains.
+ */
+static rfile *
+make_incremental_rfile(char *filename)
+{
+ rfile *rf;
+ unsigned magic;
+
+ rf = make_rfile(filename, false);
+
+ /* Read and validate magic number. */
+ read_bytes(rf, &magic, sizeof(magic));
+ if (magic != INCREMENTAL_MAGIC)
+ pg_fatal("file \"%s\" has bad incremental magic number (0x%x not 0x%x)",
+ filename, magic, INCREMENTAL_MAGIC);
+
+ /* Read block count. */
+ read_bytes(rf, &rf->num_blocks, sizeof(rf->num_blocks));
+ if (rf->num_blocks > RELSEG_SIZE)
+ pg_fatal("file \"%s\" has block count %u in excess of segment size %u",
+ filename, rf->num_blocks, RELSEG_SIZE);
+
+ /* Read truncation block length. */
+ read_bytes(rf, &rf->truncation_block_length,
+ sizeof(rf->truncation_block_length));
+ if (rf->truncation_block_length > RELSEG_SIZE)
+ pg_fatal("file \"%s\" has truncation block length %u in excess of segment size %u",
+ filename, rf->truncation_block_length, RELSEG_SIZE);
+
+ /* Read block numbers if there are any. */
+ if (rf->num_blocks > 0)
+ {
+ rf->relative_block_numbers =
+ pg_malloc0(sizeof(BlockNumber) * rf->num_blocks);
+ read_bytes(rf, rf->relative_block_numbers,
+ sizeof(BlockNumber) * rf->num_blocks);
+ }
+
+ /* Remember length of header. */
+ rf->header_length = sizeof(magic) + sizeof(rf->num_blocks) +
+ sizeof(rf->truncation_block_length) +
+ sizeof(BlockNumber) * rf->num_blocks;
+
+ return rf;
+}
+
+/*
+ * Allocate and perform basic initialization of an rfile.
+ */
+static rfile *
+make_rfile(char *filename, bool missing_ok)
+{
+ rfile *rf;
+
+ rf = pg_malloc0(sizeof(rfile));
+ rf->filename = pstrdup(filename);
+ if ((rf->fd = open(filename, O_RDONLY | PG_BINARY, 0)) < 0)
+ {
+ if (missing_ok && errno == ENOENT)
+ {
+ pg_free(rf);
+ return NULL;
+ }
+ pg_fatal("could not open file \"%s\": %m", filename);
+ }
+
+ return rf;
+}
+
+/*
+ * Read the indicated number of bytes from an rfile into the buffer.
+ */
+static void
+read_bytes(rfile *rf, void *buffer, unsigned length)
+{
+ unsigned rb = read(rf->fd, buffer, length);
+
+ if (rb != length)
+ {
+ if (rb < 0)
+ pg_fatal("could not read file \"%s\": %m", rf->filename);
+ else
+ pg_fatal("could not read file \"%s\": read only %d of %d bytes",
+ rf->filename, (int) rb, length);
+ }
+}
+
+/*
+ * Write out a reconstructed file.
+ */
+static void
+write_reconstructed_file(char *input_filename,
+ char *output_filename,
+ unsigned block_length,
+ rfile **sourcemap,
+ off_t *offsetmap,
+ pg_checksum_context *checksum_ctx,
+ bool debug,
+ bool dry_run)
+{
+ int wfd = -1;
+ unsigned i;
+ unsigned zero_blocks = 0;
+
+ /* Debugging output. */
+ if (debug)
+ {
+ StringInfoData debug_buf;
+ unsigned start_of_range = 0;
+ unsigned current_block = 0;
+
+ /* Basic information about the output file to be produced. */
+ if (dry_run)
+ pg_log_debug("would reconstruct \"%s\" (%u blocks, checksum %s)",
+ output_filename, block_length,
+ pg_checksum_type_name(checksum_ctx->type));
+ else
+ pg_log_debug("reconstructing \"%s\" (%u blocks, checksum %s)",
+ output_filename, block_length,
+ pg_checksum_type_name(checksum_ctx->type));
+
+ /* Print out the plan for reconstructing this file. */
+ initStringInfo(&debug_buf);
+ while (current_block < block_length)
+ {
+ rfile *s = sourcemap[current_block];
+
+ /* Extend range, if possible. */
+ if (current_block + 1 < block_length &&
+ s == sourcemap[current_block + 1])
+ {
+ ++current_block;
+ continue;
+ }
+
+ /* Add details about this range. */
+ if (s == NULL)
+ {
+ if (current_block == start_of_range)
+ appendStringInfo(&debug_buf, " %u:zero", current_block);
+ else
+ appendStringInfo(&debug_buf, " %u-%u:zero",
+ start_of_range, current_block);
+ }
+ else
+ {
+ if (current_block == start_of_range)
+ appendStringInfo(&debug_buf, " %u:%s@" UINT64_FORMAT,
+ current_block,
+ s == NULL ? "ZERO" : s->filename,
+ (uint64) offsetmap[current_block]);
+ else
+ appendStringInfo(&debug_buf, " %u-%u:%s@" UINT64_FORMAT,
+ start_of_range, current_block,
+ s == NULL ? "ZERO" : s->filename,
+ (uint64) offsetmap[current_block]);
+ }
+
+ /* Begin new range. */
+ start_of_range = ++current_block;
+
+ /* If the output is very long or we are done, dump it now. */
+ if (current_block == block_length || debug_buf.len > 1024)
+ {
+ pg_log_debug("reconstruction plan:%s", debug_buf.data);
+ resetStringInfo(&debug_buf);
+ }
+ }
+
+ /* Free memory. */
+ pfree(debug_buf.data);
+ }
+
+ /* Open the output file, except in dry_run mode. */
+ if (!dry_run &&
+ (wfd = open(output_filename,
+ O_RDWR | PG_BINARY | O_CREAT | O_EXCL,
+ pg_file_create_mode)) < 0)
+ pg_fatal("could not open file \"%s\": %m", output_filename);
+
+ /* Read and write the blocks as required. */
+ for (i = 0; i < block_length; ++i)
+ {
+ uint8 buffer[BLCKSZ];
+ rfile *s = sourcemap[i];
+ unsigned wb;
+
+ /* Update accounting information. */
+ if (s == NULL)
+ ++zero_blocks;
+ else
+ {
+ s->num_blocks_read++;
+ s->highest_offset_read = Max(s->highest_offset_read,
+ offsetmap[i] + BLCKSZ);
+ }
+
+ /* Skip the rest of this in dry-run mode. */
+ if (dry_run)
+ continue;
+
+ /* Read or zero-fill the block as appropriate. */
+ if (s == NULL)
+ {
+ /*
+ * New block not mentioned in the WAL summary. Should have been an
+ * uninitialized block, so just zero-fill it.
+ */
+ memset(buffer, 0, BLCKSZ);
+ }
+ else
+ {
+ unsigned rb;
+
+ /* Read the block from the correct source, except if dry-run. */
+ rb = pg_pread(s->fd, buffer, BLCKSZ, offsetmap[i]);
+ if (rb != BLCKSZ)
+ {
+ if (rb < 0)
+ pg_fatal("could not read file \"%s\": %m", s->filename);
+ else
+ pg_fatal("could not read file \"%s\": read only %d of %d bytes at offset %u",
+ s->filename, (int) rb, BLCKSZ,
+ (unsigned) offsetmap[i]);
+ }
+ }
+
+ /* Write out the block. */
+ if ((wb = write(wfd, buffer, BLCKSZ)) != BLCKSZ)
+ {
+ if (wb < 0)
+ pg_fatal("could not write file \"%s\": %m", output_filename);
+ else
+ pg_fatal("could not write file \"%s\": wrote only %d of %d bytes",
+ output_filename, (int) wb, BLCKSZ);
+ }
+
+ /* Update the checksum computation. */
+ if (pg_checksum_update(checksum_ctx, buffer, BLCKSZ) < 0)
+ pg_fatal("could not update checksum of file \"%s\"",
+ output_filename);
+ }
+
+ /* Debugging output. */
+ if (zero_blocks > 0)
+ {
+ if (dry_run)
+ pg_log_debug("would have zero-filled %u blocks", zero_blocks);
+ else
+ pg_log_debug("zero-filled %u blocks", zero_blocks);
+ }
+
+ /* Close the output file. */
+ if (wfd >= 0 && close(wfd) != 0)
+ pg_fatal("could not close \"%s\": %m", output_filename);
+}
diff --git a/src/bin/pg_combinebackup/reconstruct.h b/src/bin/pg_combinebackup/reconstruct.h
new file mode 100644
index 0000000000..d689aeb5c2
--- /dev/null
+++ b/src/bin/pg_combinebackup/reconstruct.h
@@ -0,0 +1,33 @@
+/*-------------------------------------------------------------------------
+ *
+ * reconstruct.h
+ * Reconstruct full file from incremental file and backup chain.
+ *
+ * Copyright (c) 2017-2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/bin/pg_combinebackup/reconstruct.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef RECONSTRUCT_H
+#define RECONSTRUCT_H
+
+#include "common/checksum_helper.h"
+#include "load_manifest.h"
+
+extern void reconstruct_from_incremental_file(char *input_filename,
+ char *output_filename,
+ char *relative_path,
+ char *bare_file_name,
+ int n_prior_backups,
+ char **prior_backup_dirs,
+ manifest_data **manifests,
+ char *manifest_path,
+ pg_checksum_type checksum_type,
+ int *checksum_length,
+ uint8 **checksum_payload,
+ bool debug,
+ bool dry_run);
+
+#endif
diff --git a/src/bin/pg_combinebackup/t/001_basic.pl b/src/bin/pg_combinebackup/t/001_basic.pl
new file mode 100644
index 0000000000..fb66075d1a
--- /dev/null
+++ b/src/bin/pg_combinebackup/t/001_basic.pl
@@ -0,0 +1,23 @@
+# Copyright (c) 2021-2023, PostgreSQL Global Development Group
+
+use strict;
+use warnings;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+my $tempdir = PostgreSQL::Test::Utils::tempdir;
+
+program_help_ok('pg_combinebackup');
+program_version_ok('pg_combinebackup');
+program_options_handling_ok('pg_combinebackup');
+
+command_fails_like(
+ ['pg_combinebackup'],
+ qr/no input directories specified/,
+ 'input directories must be specified');
+command_fails_like(
+ [ 'pg_combinebackup', $tempdir ],
+ qr/no output directory specified/,
+ 'output directory must be specified');
+
+done_testing();
diff --git a/src/bin/pg_combinebackup/t/002_compare_backups.pl b/src/bin/pg_combinebackup/t/002_compare_backups.pl
new file mode 100644
index 0000000000..0b80455aff
--- /dev/null
+++ b/src/bin/pg_combinebackup/t/002_compare_backups.pl
@@ -0,0 +1,154 @@
+# Copyright (c) 2021-2023, PostgreSQL Global Development Group
+
+use strict;
+use warnings;
+use File::Compare;
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Set up a new database instance.
+my $primary = PostgreSQL::Test::Cluster->new('primary');
+$primary->init(has_archiving => 1, allows_streaming => 1);
+$primary->append_conf('postgresql.conf', 'summarize_wal = on');
+$primary->start;
+
+# Create some test tables, each containing one row of data, plus a whole
+# extra database.
+$primary->safe_psql('postgres', <<EOM);
+CREATE TABLE will_change (a int, b text);
+INSERT INTO will_change VALUES (1, 'initial test row');
+CREATE TABLE will_grow (a int, b text);
+INSERT INTO will_grow VALUES (1, 'initial test row');
+CREATE TABLE will_shrink (a int, b text);
+INSERT INTO will_shrink VALUES (1, 'initial test row');
+CREATE TABLE will_get_vacuumed (a int, b text);
+INSERT INTO will_get_vacuumed VALUES (1, 'initial test row');
+CREATE TABLE will_get_dropped (a int, b text);
+INSERT INTO will_get_dropped VALUES (1, 'initial test row');
+CREATE TABLE will_get_rewritten (a int, b text);
+INSERT INTO will_get_rewritten VALUES (1, 'initial test row');
+CREATE DATABASE db_will_get_dropped;
+EOM
+
+# Take a full backup.
+my $backup1path = $primary->backup_dir . '/backup1';
+$primary->command_ok(
+ [ 'pg_basebackup', '-D', $backup1path, '--no-sync', '-cfast' ],
+ "full backup");
+
+# Now make some database changes.
+$primary->safe_psql('postgres', <<EOM);
+UPDATE will_change SET b = 'modified value' WHERE a = 1;
+INSERT INTO will_grow
+ SELECT g, 'additional row' FROM generate_series(2, 5000) g;
+TRUNCATE will_shrink;
+VACUUM will_get_vacuumed;
+DROP TABLE will_get_dropped;
+CREATE TABLE newly_created (a int, b text);
+INSERT INTO newly_created VALUES (1, 'row for new table');
+VACUUM FULL will_get_rewritten;
+DROP DATABASE db_will_get_dropped;
+CREATE DATABASE db_newly_created;
+EOM
+
+# Take an incremental backup.
+my $backup2path = $primary->backup_dir . '/backup2';
+$primary->command_ok(
+ [ 'pg_basebackup', '-D', $backup2path, '--no-sync', '-cfast',
+ '--incremental', $backup1path . '/backup_manifest' ],
+ "incremental backup");
+
+# Find an LSN to which either backup can be recovered.
+my $lsn = $primary->safe_psql('postgres', "SELECT pg_current_wal_lsn();");
+
+# Make sure that the WAL segment containing that LSN has been archived.
+# PostgreSQL won't issue two consecutive XLOG_SWITCH records, and the backup
+# just issued one, so call txid_current() to generate some WAL activity
+# before calling pg_switch_wal().
+$primary->safe_psql('postgres', 'SELECT txid_current();');
+$primary->safe_psql('postgres', 'SELECT pg_switch_wal()');
+
+# Now wait for the LSN we chose above to be archived.
+my $archive_wait_query =
+ "SELECT pg_walfile_name('$lsn') <= last_archived_wal FROM pg_stat_archiver;";
+$primary->poll_query_until('postgres', $archive_wait_query)
+ or die "Timed out while waiting for WAL segment to be archived";
+
+# Perform PITR from the full backup. Disable archive_mode so that the archive
+# doesn't find out about the new timeline; that way, the later PITR below will
+# choose the same timeline.
+my $pitr1 = PostgreSQL::Test::Cluster->new('pitr1');
+$pitr1->init_from_backup($primary, 'backup1',
+ standby => 1, has_restoring => 1);
+$pitr1->append_conf('postgresql.conf', qq{
+recovery_target_lsn = '$lsn'
+recovery_target_action = 'promote'
+archive_mode = 'off'
+});
+$pitr1->start();
+
+# Perform PITR to the same LSN from the incremental backup. Use the same
+# basic configuration as before.
+my $pitr2 = PostgreSQL::Test::Cluster->new('pitr2');
+$pitr2->init_from_backup($primary, 'backup2',
+ standby => 1, has_restoring => 1,
+ combine_with_prior => [ 'backup1' ]);
+$pitr2->append_conf('postgresql.conf', qq{
+recovery_target_lsn = '$lsn'
+recovery_target_action = 'promote'
+archive_mode = 'off'
+});
+$pitr2->start();
+
+# Wait until both servers exit recovery.
+$pitr1->poll_query_until('postgres',
+ "SELECT NOT pg_is_in_recovery();")
+ or die "Timed out while waiting apply to reach LSN $lsn";
+$pitr2->poll_query_until('postgres',
+ "SELECT NOT pg_is_in_recovery();")
+ or die "Timed out while waiting apply to reach LSN $lsn";
+
+# Perform a logical dump of each server, and check that they match.
+# It would be much nicer if we could physically compare the data files, but
+# that doesn't really work. The contents of the page hole aren't guaranteed to
+# be identical, and there can be other discrepancies as well. To make this work
+# we'd need the equivalent of each AM's rm_mask functon written or at least
+# callable from Perl, and that doesn't seem practical.
+#
+# NB: We're just using the primary's backup directory for scratch space here.
+# This could equally well be any other directory we wanted to pick.
+my $backupdir = $primary->backup_dir;
+my $dump1 = $backupdir . '/pitr1.dump';
+my $dump2 = $backupdir . '/pitr2.dump';
+$pitr1->command_ok([
+ 'pg_dumpall', '-f', $dump1, '--no-sync', '--no-unlogged-table-data',
+ '-d', $pitr1->connstr('postgres'),
+ ],
+ 'dump from PITR 1');
+$pitr1->command_ok([
+ 'pg_dumpall', '-f', $dump2, '--no-sync', '--no-unlogged-table-data',
+ '-d', $pitr1->connstr('postgres'),
+ ],
+ 'dump from PITR 2');
+
+# Compare the two dumps, there should be no differences.
+my $compare_res = compare($dump1, $dump2);
+note($dump1);
+note($dump2);
+is($compare_res, 0, "dumps are identical");
+
+# Provide more context if the dumps do not match.
+if ($compare_res != 0)
+{
+ my ($stdout, $stderr) =
+ run_command([ 'diff', '-u', $dump1, $dump2 ]);
+ print "=== diff of $dump1 and $dump2\n";
+ print "=== stdout ===\n";
+ print $stdout;
+ print "=== stderr ===\n";
+ print $stderr;
+ print "=== EOF ===\n";
+}
+
+done_testing();
diff --git a/src/bin/pg_combinebackup/t/003_timeline.pl b/src/bin/pg_combinebackup/t/003_timeline.pl
new file mode 100644
index 0000000000..bc053ca5e8
--- /dev/null
+++ b/src/bin/pg_combinebackup/t/003_timeline.pl
@@ -0,0 +1,90 @@
+# Copyright (c) 2021-2023, PostgreSQL Global Development Group
+#
+# This test aims to validate that restoring an incremental backup works
+# properly even when the reference backup is on a different timeline.
+
+use strict;
+use warnings;
+use File::Compare;
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Set up a new database instance.
+my $node1 = PostgreSQL::Test::Cluster->new('node1');
+$node1->init(has_archiving => 1, allows_streaming => 1);
+$node1->append_conf('postgresql.conf', 'summarize_wal = on');
+$node1->start;
+
+# Create a table and insert a test row into it.
+$node1->safe_psql('postgres', <<EOM);
+CREATE TABLE mytable (a int, b text);
+INSERT INTO mytable VALUES (1, 'aardvark');
+EOM
+
+# Take a full backup.
+my $backup1path = $node1->backup_dir . '/backup1';
+$node1->command_ok(
+ [ 'pg_basebackup', '-D', $backup1path, '--no-sync', '-cfast' ],
+ "full backup from node1");
+
+# Insert a second row on the original node.
+$node1->safe_psql('postgres', <<EOM);
+INSERT INTO mytable VALUES (2, 'beetle');
+EOM
+
+# Now take an incremental backup.
+my $backup2path = $node1->backup_dir . '/backup2';
+$node1->command_ok(
+ [ 'pg_basebackup', '-D', $backup2path, '--no-sync', '-cfast',
+ '--incremental', $backup1path . '/backup_manifest' ],
+ "incremental backup from node1");
+
+# Restore the incremental backup and use it to create a new node.
+my $node2 = PostgreSQL::Test::Cluster->new('node2');
+$node2->init_from_backup($node1, 'backup2',
+ combine_with_prior => [ 'backup1' ]);
+$node2->start();
+
+# Insert rows on both nodes.
+$node1->safe_psql('postgres', <<EOM);
+INSERT INTO mytable VALUES (3, 'crab');
+EOM
+$node2->safe_psql('postgres', <<EOM);
+INSERT INTO mytable VALUES (4, 'dingo');
+EOM
+
+# Take another incremental backup, from node2, based on backup2 from node1.
+my $backup3path = $node1->backup_dir . '/backup3';
+$node2->command_ok(
+ [ 'pg_basebackup', '-D', $backup3path, '--no-sync', '-cfast',
+ '--incremental', $backup2path . '/backup_manifest' ],
+ "incremental backup from node2");
+
+# Restore the incremental backup and use it to create a new node.
+my $node3 = PostgreSQL::Test::Cluster->new('node3');
+$node3->init_from_backup($node1, 'backup3',
+ combine_with_prior => [ 'backup1', 'backup2' ]);
+$node3->start();
+
+# Let's insert one more row.
+$node3->safe_psql('postgres', <<EOM);
+INSERT INTO mytable VALUES (5, 'elephant');
+EOM
+
+# Now check that we have the expected rows.
+my $result = $node3->safe_psql('postgres', <<EOM);
+select string_agg(a::text, ':'), string_agg(b, ':') from mytable;
+EOM
+is($result, '1:2:4:5|aardvark:beetle:dingo:elephant');
+
+# Let's also verify all the backups.
+for my $backup_name (qw(backup1 backup2 backup3))
+{
+ $node1->command_ok(
+ [ 'pg_verifybackup', $node1->backup_dir . '/' . $backup_name ],
+ "verify backup $backup_name");
+}
+
+# OK, that's all.
+done_testing();
diff --git a/src/bin/pg_combinebackup/t/004_manifest.pl b/src/bin/pg_combinebackup/t/004_manifest.pl
new file mode 100644
index 0000000000..37de61ac06
--- /dev/null
+++ b/src/bin/pg_combinebackup/t/004_manifest.pl
@@ -0,0 +1,75 @@
+# Copyright (c) 2021-2023, PostgreSQL Global Development Group
+#
+# This test aims to validate that pg_combinebackup works in the degenerate
+# case where it is invoked on a single full backup and that it can produce
+# a new, valid manifest when it does. Secondarily, it checks that
+# pg_combinebackup does not produce a manifest when run with --no-manifest.
+
+use strict;
+use warnings;
+use File::Compare;
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Set up a new database instance.
+my $node = PostgreSQL::Test::Cluster->new('node');
+$node->init(has_archiving => 1, allows_streaming => 1);
+$node->start;
+
+# Take a full backup.
+my $original_backup_path = $node->backup_dir . '/original';
+$node->command_ok(
+ [ 'pg_basebackup', '-D', $original_backup_path, '--no-sync', '-cfast' ],
+ "full backup");
+
+# Verify the full backup.
+$node->command_ok([ 'pg_verifybackup', $original_backup_path ],
+ "verify original backup");
+
+# Process the backup with pg_combinebackup using various manifest options.
+sub combine_and_test_one_backup
+{
+ my ($backup_name, $failure_pattern, @extra_options) = @_;
+ my $revised_backup_path = $node->backup_dir . '/' . $backup_name;
+ $node->command_ok(
+ [ 'pg_combinebackup', $original_backup_path, '-o', $revised_backup_path,
+ '--no-sync', @extra_options ],
+ "pg_combinebackup with @extra_options");
+ if (defined $failure_pattern)
+ {
+ $node->command_fails_like(
+ [ 'pg_verifybackup', $revised_backup_path ],
+ $failure_pattern,
+ "unable to verify backup $backup_name");
+ }
+ else
+ {
+ $node->command_ok(
+ [ 'pg_verifybackup', $revised_backup_path ],
+ "verify backup $backup_name");
+ }
+}
+combine_and_test_one_backup('nomanifest',
+ qr/could not open file.*backup_manifest/, '--no-manifest');
+combine_and_test_one_backup('csum_none',
+ undef, '--manifest-checksums=NONE');
+combine_and_test_one_backup('csum_sha224',
+ undef, '--manifest-checksums=SHA224');
+
+# Verify that SHA224 is mentioned in the SHA224 manifest lots of times.
+my $sha224_manifest =
+ slurp_file($node->backup_dir . '/csum_sha224/backup_manifest');
+my $sha224_count = (() = $sha224_manifest =~ /SHA224/mig);
+cmp_ok($sha224_count,
+ '>', 100, "SHA224 is mentioned many times in SHA224 manifest");
+
+# Verify that SHA224 is mentioned in the SHA224 manifest lots of times.
+my $nocsum_manifest =
+ slurp_file($node->backup_dir . '/csum_none/backup_manifest');
+my $nocsum_count = (() = $nocsum_manifest =~ /Checksum-Algorithm/mig);
+is($nocsum_count, 0,
+ "Checksum_Algorithm is not mentioned in no-checksum manifest");
+
+# OK, that's all.
+done_testing();
diff --git a/src/bin/pg_combinebackup/t/005_integrity.pl b/src/bin/pg_combinebackup/t/005_integrity.pl
new file mode 100644
index 0000000000..b1f63a43e0
--- /dev/null
+++ b/src/bin/pg_combinebackup/t/005_integrity.pl
@@ -0,0 +1,125 @@
+# Copyright (c) 2021-2023, PostgreSQL Global Development Group
+#
+# This test aims to validate that an incremental backup can be combined
+# with a valid prior backup and that it cannot be combined with an invalid
+# prior backup.
+
+use strict;
+use warnings;
+use File::Compare;
+use File::Path qw(rmtree);
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Set up a new database instance.
+my $node1 = PostgreSQL::Test::Cluster->new('node1');
+$node1->init(has_archiving => 1, allows_streaming => 1);
+$node1->append_conf('postgresql.conf', 'summarize_wal = on');
+$node1->start;
+
+# Set up another new database instance. We don't want to use the cached
+# INITDB_TEMPLATE for this, because we want it to be a separate cluster
+# with a different system ID.
+my $node2;
+{
+ local $ENV{'INITDB_TEMPLATE'} = undef;
+
+ $node2 = PostgreSQL::Test::Cluster->new('node2');
+ $node2->init(has_archiving => 1, allows_streaming => 1);
+ $node2->append_conf('postgresql.conf', 'summarize_wal = on');
+ $node2->start;
+}
+
+# Take a full backup from node1.
+my $backup1path = $node1->backup_dir . '/backup1';
+$node1->command_ok(
+ [ 'pg_basebackup', '-D', $backup1path, '--no-sync', '-cfast' ],
+ "full backup from node1");
+
+# Now take an incremental backup.
+my $backup2path = $node1->backup_dir . '/backup2';
+$node1->command_ok(
+ [ 'pg_basebackup', '-D', $backup2path, '--no-sync', '-cfast',
+ '--incremental', $backup1path . '/backup_manifest' ],
+ "incremental backup from node1");
+
+# Now take another incremental backup.
+my $backup3path = $node1->backup_dir . '/backup3';
+$node1->command_ok(
+ [ 'pg_basebackup', '-D', $backup3path, '--no-sync', '-cfast',
+ '--incremental', $backup2path . '/backup_manifest' ],
+ "another incremental backup from node1");
+
+# Take a full backup from node2.
+my $backupother1path = $node1->backup_dir . '/backupother1';
+$node2->command_ok(
+ [ 'pg_basebackup', '-D', $backupother1path, '--no-sync', '-cfast' ],
+ "full backup from node2");
+
+# Take an incremental backup from node2.
+my $backupother2path = $node1->backup_dir . '/backupother2';
+$node2->command_ok(
+ [ 'pg_basebackup', '-D', $backupother2path, '--no-sync', '-cfast',
+ '--incremental', $backupother1path . '/backup_manifest' ],
+ "incremental backup from node2");
+
+# Result directory.
+my $resultpath = $node1->backup_dir . '/result';
+
+# Can't combine 2 full backups.
+$node1->command_fails_like(
+ [ 'pg_combinebackup', $backup1path, $backup1path, '-o', $resultpath ],
+ qr/is a full backup, but only the first backup should be a full backup/,
+ "can't combine full backups");
+
+# Can't combine 2 incremental backups.
+$node1->command_fails_like(
+ [ 'pg_combinebackup', $backup2path, $backup2path, '-o', $resultpath ],
+ qr/is an incremental backup, but the first backup should be a full backup/,
+ "can't combine full backups");
+
+# Can't combine full backup with an incremental backup from a different system.
+$node1->command_fails_like(
+ [ 'pg_combinebackup', $backup1path, $backupother2path, '-o', $resultpath ],
+ qr/expected system identifier.*but found/,
+ "can't combine backups from different nodes");
+
+# Can't omit a required backup.
+$node1->command_fails_like(
+ [ 'pg_combinebackup', $backup1path, $backup3path, '-o', $resultpath ],
+ qr/starts at LSN.*but expected/,
+ "can't omit a required backup");
+
+# Can't combine backups in the wrong order.
+$node1->command_fails_like(
+ [ 'pg_combinebackup', $backup1path, $backup3path, $backup2path, '-o', $resultpath ],
+ qr/starts at LSN.*but expected/,
+ "can't combine backups in the wrong order");
+
+# Can combine 3 backups that match up properly.
+$node1->command_ok(
+ [ 'pg_combinebackup', $backup1path, $backup2path, $backup3path, '-o', $resultpath ],
+ "can combine 3 matching backups");
+rmtree($resultpath);
+
+# Can combine full backup with first incremental.
+my $synthetic12path = $node1->backup_dir . '/synthetic12';
+$node1->command_ok(
+ [ 'pg_combinebackup', $backup1path, $backup2path, '-o', $synthetic12path ],
+ "can combine 2 matching backups");
+
+# Can combine result of previous step with second incremental.
+$node1->command_ok(
+ [ 'pg_combinebackup', $synthetic12path, $backup3path, '-o', $resultpath ],
+ "can combine synthetic backup with later incremental");
+rmtree($resultpath);
+
+# Can't combine result of 1+2 with 2.
+$node1->command_fails_like(
+ [ 'pg_combinebackup', $synthetic12path, $backup2path, '-o', $resultpath ],
+ qr/starts at LSN.*but expected/,
+ "can't combine synthetic backup with included incremental");
+
+# OK, that's all.
+done_testing();
diff --git a/src/bin/pg_combinebackup/write_manifest.c b/src/bin/pg_combinebackup/write_manifest.c
new file mode 100644
index 0000000000..82160134d8
--- /dev/null
+++ b/src/bin/pg_combinebackup/write_manifest.c
@@ -0,0 +1,293 @@
+/*-------------------------------------------------------------------------
+ *
+ * Write a new backup manifest.
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/bin/pg_combinebackup/write_manifest.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres_fe.h"
+
+#include <fcntl.h>
+#include <time.h>
+#include <unistd.h>
+
+#include "common/checksum_helper.h"
+#include "common/file_perm.h"
+#include "common/logging.h"
+#include "lib/stringinfo.h"
+#include "load_manifest.h"
+#include "mb/pg_wchar.h"
+#include "write_manifest.h"
+
+struct manifest_writer
+{
+ char pathname[MAXPGPATH];
+ int fd;
+ StringInfoData buf;
+ bool first_file;
+ bool still_checksumming;
+ pg_checksum_context manifest_ctx;
+};
+
+static void escape_json(StringInfo buf, const char *str);
+static void flush_manifest(manifest_writer *mwriter);
+static size_t hex_encode(const uint8 *src, size_t len, char *dst);
+
+/*
+ * Create a new backup manifest writer.
+ *
+ * The backup manifest will be written into a file named backup_manifest
+ * in the specified directory.
+ */
+manifest_writer *
+create_manifest_writer(char *directory)
+{
+ manifest_writer *mwriter = pg_malloc(sizeof(manifest_writer));
+
+ snprintf(mwriter->pathname, MAXPGPATH, "%s/backup_manifest", directory);
+ mwriter->fd = -1;
+ initStringInfo(&mwriter->buf);
+ mwriter->first_file = true;
+ mwriter->still_checksumming = true;
+ pg_checksum_init(&mwriter->manifest_ctx, CHECKSUM_TYPE_SHA256);
+
+ appendStringInfo(&mwriter->buf,
+ "{ \"PostgreSQL-Backup-Manifest-Version\": 1,\n"
+ "\"Files\": [");
+
+ return mwriter;
+}
+
+/*
+ * Add an entry for a file to a backup manifest.
+ *
+ * This is very similar to the backend's AddFileToBackupManifest, but
+ * various adjustments are required due to frontend/backend differences
+ * and other details.
+ */
+void
+add_file_to_manifest(manifest_writer *mwriter, const char *manifest_path,
+ size_t size, pg_time_t mtime,
+ pg_checksum_type checksum_type,
+ int checksum_length,
+ uint8 *checksum_payload)
+{
+ int pathlen = strlen(manifest_path);
+
+ if (mwriter->first_file)
+ {
+ appendStringInfoChar(&mwriter->buf, '\n');
+ mwriter->first_file = false;
+ }
+ else
+ appendStringInfoString(&mwriter->buf, ",\n");
+
+ if (pg_encoding_verifymbstr(PG_UTF8, manifest_path, pathlen) == pathlen)
+ {
+ appendStringInfoString(&mwriter->buf, "{ \"Path\": ");
+ escape_json(&mwriter->buf, manifest_path);
+ appendStringInfoString(&mwriter->buf, ", ");
+ }
+ else
+ {
+ appendStringInfoString(&mwriter->buf, "{ \"Encoded-Path\": \"");
+ enlargeStringInfo(&mwriter->buf, 2 * pathlen);
+ mwriter->buf.len += hex_encode((const uint8 *) manifest_path, pathlen,
+ &mwriter->buf.data[mwriter->buf.len]);
+ appendStringInfoString(&mwriter->buf, "\", ");
+ }
+
+ appendStringInfo(&mwriter->buf, "\"Size\": %zu, ", size);
+
+ appendStringInfoString(&mwriter->buf, "\"Last-Modified\": \"");
+ enlargeStringInfo(&mwriter->buf, 128);
+ mwriter->buf.len += strftime(&mwriter->buf.data[mwriter->buf.len], 128,
+ "%Y-%m-%d %H:%M:%S %Z",
+ gmtime(&mtime));
+ appendStringInfoChar(&mwriter->buf, '"');
+
+ if (mwriter->buf.len > 128 * 1024)
+ flush_manifest(mwriter);
+
+ if (checksum_length > 0)
+ {
+ appendStringInfo(&mwriter->buf,
+ ", \"Checksum-Algorithm\": \"%s\", \"Checksum\": \"",
+ pg_checksum_type_name(checksum_type));
+
+ enlargeStringInfo(&mwriter->buf, 2 * checksum_length);
+ mwriter->buf.len += hex_encode(checksum_payload, checksum_length,
+ &mwriter->buf.data[mwriter->buf.len]);
+
+ appendStringInfoChar(&mwriter->buf, '"');
+ }
+
+ appendStringInfoString(&mwriter->buf, " }");
+
+ if (mwriter->buf.len > 128 * 1024)
+ flush_manifest(mwriter);
+}
+
+/*
+ * Finalize the backup_manifest.
+ */
+void
+finalize_manifest(manifest_writer *mwriter,
+ manifest_wal_range *first_wal_range)
+{
+ uint8 checksumbuf[PG_SHA256_DIGEST_LENGTH];
+ int len;
+ manifest_wal_range *wal_range;
+
+ /* Terminate the list of files. */
+ appendStringInfoString(&mwriter->buf, "\n],\n");
+
+ /* Start a list of LSN ranges. */
+ appendStringInfoString(&mwriter->buf, "\"WAL-Ranges\": [\n");
+
+ for (wal_range = first_wal_range; wal_range != NULL;
+ wal_range = wal_range->next)
+ appendStringInfo(&mwriter->buf,
+ "%s{ \"Timeline\": %u, \"Start-LSN\": \"%X/%X\", \"End-LSN\": \"%X/%X\" }",
+ wal_range == first_wal_range ? "" : ",\n",
+ wal_range->tli,
+ LSN_FORMAT_ARGS(wal_range->start_lsn),
+ LSN_FORMAT_ARGS(wal_range->end_lsn));
+
+ /* Terminate the list of WAL ranges. */
+ appendStringInfoString(&mwriter->buf, "\n],\n");
+
+ /* Flush accumulated data and update checksum calculation. */
+ flush_manifest(mwriter);
+
+ /* Checksum only includes data up to this point. */
+ mwriter->still_checksumming = false;
+
+ /* Compute and insert manifest checksum. */
+ appendStringInfoString(&mwriter->buf, "\"Manifest-Checksum\": \"");
+ enlargeStringInfo(&mwriter->buf, 2 * PG_SHA256_DIGEST_STRING_LENGTH);
+ len = pg_checksum_final(&mwriter->manifest_ctx, checksumbuf);
+ Assert(len == PG_SHA256_DIGEST_LENGTH);
+ mwriter->buf.len +=
+ hex_encode(checksumbuf, len, &mwriter->buf.data[mwriter->buf.len]);
+ appendStringInfoString(&mwriter->buf, "\"}\n");
+
+ /* Flush the last manifest checksum itself. */
+ flush_manifest(mwriter);
+
+ /* Close the file. */
+ if (close(mwriter->fd) != 0)
+ pg_fatal("could not close \"%s\": %m", mwriter->pathname);
+ mwriter->fd = -1;
+}
+
+/*
+ * Produce a JSON string literal, properly escaping characters in the text.
+ */
+static void
+escape_json(StringInfo buf, const char *str)
+{
+ const char *p;
+
+ appendStringInfoCharMacro(buf, '"');
+ for (p = str; *p; p++)
+ {
+ switch (*p)
+ {
+ case '\b':
+ appendStringInfoString(buf, "\\b");
+ break;
+ case '\f':
+ appendStringInfoString(buf, "\\f");
+ break;
+ case '\n':
+ appendStringInfoString(buf, "\\n");
+ break;
+ case '\r':
+ appendStringInfoString(buf, "\\r");
+ break;
+ case '\t':
+ appendStringInfoString(buf, "\\t");
+ break;
+ case '"':
+ appendStringInfoString(buf, "\\\"");
+ break;
+ case '\\':
+ appendStringInfoString(buf, "\\\\");
+ break;
+ default:
+ if ((unsigned char) *p < ' ')
+ appendStringInfo(buf, "\\u%04x", (int) *p);
+ else
+ appendStringInfoCharMacro(buf, *p);
+ break;
+ }
+ }
+ appendStringInfoCharMacro(buf, '"');
+}
+
+/*
+ * Flush whatever portion of the backup manifest we have generated and
+ * buffered in memory out to a file on disk.
+ *
+ * The first call to this function will create the file. After that, we
+ * keep it open and just append more data.
+ */
+static void
+flush_manifest(manifest_writer *mwriter)
+{
+ char pathname[MAXPGPATH];
+
+ if (mwriter->fd == -1 &&
+ (mwriter->fd = open(mwriter->pathname,
+ O_WRONLY | O_CREAT | O_EXCL | PG_BINARY,
+ pg_file_create_mode)) < 0)
+ pg_fatal("could not open file \"%s\": %m", mwriter->pathname);
+
+ if (mwriter->buf.len > 0)
+ {
+ ssize_t wb;
+
+ wb = write(mwriter->fd, mwriter->buf.data, mwriter->buf.len);
+ if (wb != mwriter->buf.len)
+ {
+ if (wb < 0)
+ pg_fatal("could not write \"%s\": %m", mwriter->pathname);
+ else
+ pg_fatal("could not write file \"%s\": wrote only %d of %d bytes",
+ pathname, (int) wb, mwriter->buf.len);
+ }
+
+ if (mwriter->still_checksumming)
+ pg_checksum_update(&mwriter->manifest_ctx,
+ (uint8 *) mwriter->buf.data,
+ mwriter->buf.len);
+ resetStringInfo(&mwriter->buf);
+ }
+}
+
+/*
+ * Encode bytes using two hexademical digits for each one.
+ */
+static size_t
+hex_encode(const uint8 *src, size_t len, char *dst)
+{
+ const uint8 *end = src + len;
+
+ while (src < end)
+ {
+ unsigned n1 = (*src >> 4) & 0xF;
+ unsigned n2 = *src & 0xF;
+
+ *dst++ = n1 < 10 ? '0' + n1 : 'a' + n1 - 10;
+ *dst++ = n2 < 10 ? '0' + n2 : 'a' + n2 - 10;
+ ++src;
+ }
+
+ return len * 2;
+}
diff --git a/src/bin/pg_combinebackup/write_manifest.h b/src/bin/pg_combinebackup/write_manifest.h
new file mode 100644
index 0000000000..8fd7fe02c8
--- /dev/null
+++ b/src/bin/pg_combinebackup/write_manifest.h
@@ -0,0 +1,33 @@
+/*-------------------------------------------------------------------------
+ *
+ * Write a new backup manifest.
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/bin/pg_combinebackup/write_manifest.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef WRITE_MANIFEST_H
+#define WRITE_MANIFEST_H
+
+#include "common/checksum_helper.h"
+#include "pgtime.h"
+
+struct manifest_wal_range;
+
+struct manifest_writer;
+typedef struct manifest_writer manifest_writer;
+
+extern manifest_writer *create_manifest_writer(char *directory);
+extern void add_file_to_manifest(manifest_writer *mwriter,
+ const char *manifest_path,
+ size_t size, pg_time_t mtime,
+ pg_checksum_type checksum_type,
+ int checksum_length,
+ uint8 *checksum_payload);
+extern void finalize_manifest(manifest_writer *mwriter,
+ struct manifest_wal_range *first_wal_range);
+
+#endif /* WRITE_MANIFEST_H */
diff --git a/src/bin/pg_resetwal/pg_resetwal.c b/src/bin/pg_resetwal/pg_resetwal.c
index 3ae3fc06df..5407f51a4e 100644
--- a/src/bin/pg_resetwal/pg_resetwal.c
+++ b/src/bin/pg_resetwal/pg_resetwal.c
@@ -85,6 +85,7 @@ static void RewriteControlFile(void);
static void FindEndOfXLOG(void);
static void KillExistingXLOG(void);
static void KillExistingArchiveStatus(void);
+static void KillExistingWALSummaries(void);
static void WriteEmptyXLOG(void);
static void usage(void);
@@ -493,6 +494,7 @@ main(int argc, char *argv[])
RewriteControlFile();
KillExistingXLOG();
KillExistingArchiveStatus();
+ KillExistingWALSummaries();
WriteEmptyXLOG();
printf(_("Write-ahead log reset\n"));
@@ -1034,6 +1036,40 @@ KillExistingArchiveStatus(void)
pg_fatal("could not close directory \"%s\": %m", ARCHSTATDIR);
}
+/*
+ * Remove existing WAL summary files
+ */
+static void
+KillExistingWALSummaries(void)
+{
+#define WALSUMMARYDIR XLOGDIR "/summaries"
+#define WALSUMMARY_NHEXCHARS 40
+
+ DIR *xldir;
+ struct dirent *xlde;
+ char path[MAXPGPATH + sizeof(WALSUMMARYDIR)];
+
+ xldir = opendir(WALSUMMARYDIR);
+ if (xldir == NULL)
+ pg_fatal("could not open directory \"%s\": %m", WALSUMMARYDIR);
+
+ while (errno = 0, (xlde = readdir(xldir)) != NULL)
+ {
+ if (strspn(xlde->d_name, "0123456789ABCDEF") == WALSUMMARY_NHEXCHARS &&
+ strcmp(xlde->d_name + WALSUMMARY_NHEXCHARS, ".summary") == 0)
+ {
+ snprintf(path, sizeof(path), "%s/%s", WALSUMMARYDIR, xlde->d_name);
+ if (unlink(path) < 0)
+ pg_fatal("could not delete file \"%s\": %m", path);
+ }
+ }
+
+ if (errno)
+ pg_fatal("could not read directory \"%s\": %m", WALSUMMARYDIR);
+
+ if (closedir(xldir))
+ pg_fatal("could not close directory \"%s\": %m", ARCHSTATDIR);
+}
/*
* Write an empty XLOG file, containing only the checkpoint record
diff --git a/src/include/access/xlogbackup.h b/src/include/access/xlogbackup.h
index 1611358137..90e04cad56 100644
--- a/src/include/access/xlogbackup.h
+++ b/src/include/access/xlogbackup.h
@@ -28,6 +28,8 @@ typedef struct BackupState
XLogRecPtr checkpointloc; /* last checkpoint location */
pg_time_t starttime; /* backup start time */
bool started_in_recovery; /* backup started in recovery? */
+ XLogRecPtr istartpoint; /* incremental based on backup at this LSN */
+ TimeLineID istarttli; /* incremental based on backup on this TLI */
/* Fields saved at the end of backup */
XLogRecPtr stoppoint; /* backup stop WAL location */
diff --git a/src/include/backup/basebackup.h b/src/include/backup/basebackup.h
index 1432d9c206..345bd22534 100644
--- a/src/include/backup/basebackup.h
+++ b/src/include/backup/basebackup.h
@@ -34,6 +34,9 @@ typedef struct
int64 size; /* total size as sent; -1 if not known */
} tablespaceinfo;
-extern void SendBaseBackup(BaseBackupCmd *cmd);
+struct IncrementalBackupInfo;
+
+extern void SendBaseBackup(BaseBackupCmd *cmd,
+ struct IncrementalBackupInfo *ib);
#endif /* _BASEBACKUP_H */
diff --git a/src/include/backup/basebackup_incremental.h b/src/include/backup/basebackup_incremental.h
new file mode 100644
index 0000000000..de99117599
--- /dev/null
+++ b/src/include/backup/basebackup_incremental.h
@@ -0,0 +1,55 @@
+/*-------------------------------------------------------------------------
+ *
+ * basebackup_incremental.h
+ * API for incremental backup support
+ *
+ * Portions Copyright (c) 2010-2022, PostgreSQL Global Development Group
+ *
+ * src/include/backup/basebackup_incremental.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef BASEBACKUP_INCREMENTAL_H
+#define BASEBACKUP_INCREMENTAL_H
+
+#include "access/xlogbackup.h"
+#include "common/relpath.h"
+#include "storage/block.h"
+#include "utils/palloc.h"
+
+#define INCREMENTAL_MAGIC 0xd3ae1f0d
+
+typedef enum
+{
+ BACK_UP_FILE_FULLY,
+ BACK_UP_FILE_INCREMENTALLY
+} FileBackupMethod;
+
+struct IncrementalBackupInfo;
+typedef struct IncrementalBackupInfo IncrementalBackupInfo;
+
+extern IncrementalBackupInfo *CreateIncrementalBackupInfo(MemoryContext);
+
+extern void AppendIncrementalManifestData(IncrementalBackupInfo *ib,
+ const char *data,
+ int len);
+extern void FinalizeIncrementalManifest(IncrementalBackupInfo *ib);
+
+extern void PrepareForIncrementalBackup(IncrementalBackupInfo *ib,
+ BackupState *backup_state);
+
+extern char *GetIncrementalFilePath(Oid dboid, Oid spcoid,
+ RelFileNumber relfilenumber,
+ ForkNumber forknum, unsigned segno);
+extern FileBackupMethod GetFileBackupMethod(IncrementalBackupInfo *ib,
+ const char *path,
+ Oid dboid, Oid spcoid,
+ RelFileNumber relfilenumber,
+ ForkNumber forknum,
+ unsigned segno, size_t size,
+ unsigned *num_blocks_required,
+ BlockNumber *relative_block_numbers,
+ unsigned *truncation_block_length);
+extern size_t GetIncrementalFileSize(unsigned num_blocks_required);
+
+#endif
diff --git a/src/include/nodes/replnodes.h b/src/include/nodes/replnodes.h
index 5142a08729..c98961c329 100644
--- a/src/include/nodes/replnodes.h
+++ b/src/include/nodes/replnodes.h
@@ -108,4 +108,13 @@ typedef struct TimeLineHistoryCmd
TimeLineID timeline;
} TimeLineHistoryCmd;
+/* ----------------------
+ * UPLOAD_MANIFEST command
+ * ----------------------
+ */
+typedef struct UploadManifestCmd
+{
+ NodeTag type;
+} UploadManifestCmd;
+
#endif /* REPLNODES_H */
diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index a020377761..46cb2a6550 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -779,6 +779,10 @@ a tar-format backup, pass the name of the tar program to use in the
keyword parameter tar_program. Note that tablespace tar files aren't
handled here.
+To restore from an incremental backup, pass the parameter combine_with_prior
+as a reference to an array of prior backup names with which this backup
+is to be combined using pg_combinebackup.
+
Streaming replication can be enabled on this node by passing the keyword
parameter has_streaming => 1. This is disabled by default.
@@ -816,7 +820,22 @@ sub init_from_backup
mkdir $self->archive_dir;
my $data_path = $self->data_dir;
- if (defined $params{tar_program})
+ if (defined $params{combine_with_prior})
+ {
+ my @prior_backups = @{$params{combine_with_prior}};
+ my @prior_backup_path;
+
+ for my $prior_backup_name (@prior_backups)
+ {
+ push @prior_backup_path,
+ $root_node->backup_dir . '/' . $prior_backup_name;
+ }
+
+ local %ENV = $self->_get_env();
+ PostgreSQL::Test::Utils::system_or_bail('pg_combinebackup', '-d',
+ @prior_backup_path, $backup_path, '-o', $data_path);
+ }
+ elsif (defined $params{tar_program})
{
mkdir($data_path);
PostgreSQL::Test::Utils::system_or_bail($params{tar_program}, 'xf',
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 9390049314..e37ef9aa76 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -4023,3 +4023,15 @@ SummarizerReadLocalXLogPrivate
WalSummarizerData
WalSummaryFile
WalSummaryIO
+FileBackupMethod
+IncrementalBackupInfo
+UploadManifestCmd
+backup_file_entry
+backup_wal_range
+cb_cleanup_dir
+cb_options
+cb_tablespace
+cb_tablespace_mapping
+manifest_data
+manifest_writer
+rfile
--
2.39.3 (Apple Git-145)
Hi Robert,
On Tue, Dec 19, 2023 at 9:36 PM Robert Haas <robertmhaas@gmail.com> wrote:
On Fri, Dec 15, 2023 at 5:36 AM Jakub Wartak
<jakub.wartak@enterprisedb.com> wrote:I've played with with initdb/pg_upgrade (17->17) and i don't get DBID
mismatch (of course they do differ after initdb), but i get this
instead:$ pg_basebackup -c fast -D /tmp/incr2.after.upgrade -p 5432
--incremental /tmp/incr1.before.upgrade/backup_manifest
WARNING: aborting backup due to backend exiting before pg_backup_stop
was called
pg_basebackup: error: could not initiate base backup: ERROR: timeline
2 found in manifest, but not in this server's history
pg_basebackup: removing data directory "/tmp/incr2.after.upgrade"Also in the manifest I don't see DBID ?
Maybe it's a nuisance and all I'm trying to see is that if an
automated cronjob with pg_basebackup --incremental hits a freshly
upgraded cluster, that error message without errhint() is going to
scare some Junior DBAs.Yeah. I think we should add the system identifier to the manifest, but
I think that should be left for a future project, as I don't think the
lack of it is a good reason to stop all progress here. When we have
that, we can give more reliable error messages about system mismatches
at an earlier stage. Unfortunately, I don't think that the timeline
messages you're seeing here are going to apply in every case: suppose
you have two unrelated servers that are both on timeline 1. I think
you could use a base backup from one of those servers and use it as
the basis for the incremental from the other, and I think that if you
did it right you might fail to hit any sanity check that would block
that. pg_combinebackup will realize there's a problem, because it has
the whole cluster to work with, not just the manifest, and will notice
the mismatching system identifiers, but that's kind of late to find
out that you made a big mistake. However, right now, it's the best we
can do.
OK, understood.
The incrementals are being generated , but just for the first (0)
segment of the relation?I committed the first two patches from the series I posted yesterday.
The first should fix this, and the second relocates parse_manifest.c.
That patch hasn't changed in a while and seems unlikely to attract
major objections. There's no real reason to commit it until we're
ready to move forward with the main patches, but I think we're very
close to that now, so I did.Here's a rebase for cfbot.
the v15 patchset (posted yesterday) test results are GOOD:
1. make check-world - GOOD
2. cfbot was GOOD
3. the devel/master bug present in
parse_filename_for_nontemp_relation() seems to be gone (in local
testing)
4. some further tests:
test_across_wallevelminimal.sh - GOOD
test_incr_after_timelineincrease.sh - GOOD
test_incr_on_standby_after_promote.sh - GOOD
test_many_incrementals_dbcreate.sh - GOOD
test_many_incrementals.sh - GOOD
test_multixact.sh - GOOD
test_pending_2pc.sh - GOOD
test_reindex_and_vacuum_full.sh - GOOD
test_repro_assert.sh
test_standby_incr_just_backup.sh - GOOD
test_stuck_walsum.sh - GOOD
test_truncaterollback.sh - GOOD
test_unlogged_table.sh - GOOD
test_full_pri__incr_stby__restore_on_pri.sh - GOOD
test_full_pri__incr_stby__restore_on_stby.sh - GOOD
test_full_stby__incr_stby__restore_on_pri.sh - GOOD
test_full_stby__incr_stby__restore_on_stby.sh - GOOD
5. the more real-world pgbench test with localized segment writes
usigng `\set aid random_exponential...` [1] indicates much greater
efficiency in terms of backup space use now, du -sm shows:
210229 /backups/backups/full
250 /backups/backups/incr.1
255 /backups/backups/incr.2
[..]
348 /backups/backups/incr.13
408 /backups/backups/incr.14 // latest(20th of Dec on 10:40)
6673 /backups/archive/
The DB size was as reported by \l+ 205GB.
That pgbench was running for ~27h (19th Dec 08:39 -> 20th Dec 11:30)
with slow 100 TPS (-R), so no insane amounts of WAL.
Time to reconstruct 14 chained incremental backups was 45mins
(pg_combinebackup -o /var/lib/postgres/17/data /backups/backups/full
/backups/backups/incr.1 (..) /backups/backups/incr.14).
DB after recovering was OK and working fine.
-J.
On Wed, Dec 20, 2023 at 8:11 AM Jakub Wartak
<jakub.wartak@enterprisedb.com> wrote:
the v15 patchset (posted yesterday) test results are GOOD:
All right. I committed the main two patches, dropped the
for-testing-only patch, and added a simple test to the remaining
pg_walsummary patch. That needs more work, but here's what I have as
of now.
--
Robert Haas
EDB: http://www.enterprisedb.com
Attachments:
v17-0001-Add-new-pg_walsummary-tool.patchapplication/octet-stream; name=v17-0001-Add-new-pg_walsummary-tool.patchDownload
From b1ef3268b441d7661f5277e4aa89468d957a9f5d Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Wed, 20 Dec 2023 15:52:59 -0500
Subject: [PATCH v17] Add new pg_walsummary tool.
This can dump the contents of the WAL summary files found in
pg_wal/summaries. Normally, this shouldn't really be something anyone
needs to do, but it may be needed for debugging problems with
incremental backup, or could possibly be used in some useful way by
external tools.
XXX. Needs more tests.
---
doc/src/sgml/ref/allfiles.sgml | 1 +
doc/src/sgml/ref/pg_walsummary.sgml | 122 +++++++++++
doc/src/sgml/reference.sgml | 1 +
src/bin/Makefile | 1 +
src/bin/meson.build | 1 +
src/bin/pg_walsummary/.gitignore | 1 +
src/bin/pg_walsummary/Makefile | 48 +++++
src/bin/pg_walsummary/meson.build | 29 +++
src/bin/pg_walsummary/nls.mk | 6 +
src/bin/pg_walsummary/pg_walsummary.c | 280 ++++++++++++++++++++++++++
src/bin/pg_walsummary/t/001_basic.pl | 19 ++
src/tools/pgindent/typedefs.list | 2 +
12 files changed, 511 insertions(+)
create mode 100644 doc/src/sgml/ref/pg_walsummary.sgml
create mode 100644 src/bin/pg_walsummary/.gitignore
create mode 100644 src/bin/pg_walsummary/Makefile
create mode 100644 src/bin/pg_walsummary/meson.build
create mode 100644 src/bin/pg_walsummary/nls.mk
create mode 100644 src/bin/pg_walsummary/pg_walsummary.c
create mode 100644 src/bin/pg_walsummary/t/001_basic.pl
diff --git a/doc/src/sgml/ref/allfiles.sgml b/doc/src/sgml/ref/allfiles.sgml
index fda4690eab..4a42999b18 100644
--- a/doc/src/sgml/ref/allfiles.sgml
+++ b/doc/src/sgml/ref/allfiles.sgml
@@ -219,6 +219,7 @@ Complete list of usable sgml source files in this directory.
<!ENTITY pgtesttiming SYSTEM "pgtesttiming.sgml">
<!ENTITY pgupgrade SYSTEM "pgupgrade.sgml">
<!ENTITY pgwaldump SYSTEM "pg_waldump.sgml">
+<!ENTITY pgwalsummary SYSTEM "pg_walsummary.sgml">
<!ENTITY postgres SYSTEM "postgres-ref.sgml">
<!ENTITY psqlRef SYSTEM "psql-ref.sgml">
<!ENTITY reindexdb SYSTEM "reindexdb.sgml">
diff --git a/doc/src/sgml/ref/pg_walsummary.sgml b/doc/src/sgml/ref/pg_walsummary.sgml
new file mode 100644
index 0000000000..93e265ead7
--- /dev/null
+++ b/doc/src/sgml/ref/pg_walsummary.sgml
@@ -0,0 +1,122 @@
+<!--
+doc/src/sgml/ref/pg_walsummary.sgml
+PostgreSQL documentation
+-->
+
+<refentry id="app-pgwalsummary">
+ <indexterm zone="app-pgwalsummary">
+ <primary>pg_walsummary</primary>
+ </indexterm>
+
+ <refmeta>
+ <refentrytitle><application>pg_walsummary</application></refentrytitle>
+ <manvolnum>1</manvolnum>
+ <refmiscinfo>Application</refmiscinfo>
+ </refmeta>
+
+ <refnamediv>
+ <refname>pg_walsummary</refname>
+ <refpurpose>print contents of WAL summary files</refpurpose>
+ </refnamediv>
+
+ <refsynopsisdiv>
+ <cmdsynopsis>
+ <command>pg_walsummary</command>
+ <arg rep="repeat" choice="opt"><replaceable>option</replaceable></arg>
+ <arg rep="repeat"><replaceable>file</replaceable></arg>
+ </cmdsynopsis>
+ </refsynopsisdiv>
+
+ <refsect1>
+ <title>Description</title>
+ <para>
+ <application>pg_walsummary</application> is used to print the contents of
+ WAL summary files. These binary files are found with the
+ <literal>pg_wal/summaries</literal> subdirectory of the data directory,
+ and can be converted to text using this tool. This is not ordinarily
+ necessary, since WAL summary files primarily exist to support
+ <link linkend="backup-incremental-backup">incremental backup</link>,
+ but it may be useful for debugging purposes.
+ </para>
+
+ <para>
+ A WAL summary file is indexed by tablespace OID, relation OID, and relation
+ fork. For each relation fork, it stores the list of blocks that were
+ modified by WAL within the range summarized in the file. It can also
+ store a "limit block," which is 0 if the relation fork was created or
+ truncated within the relevant WAL range, and otherwise the shortest length
+ to which the relation fork was truncated. If the relation fork was not
+ created, deleted, or truncated within the relevant WAL range, the limit
+ block is undefined or infinite and will not be printed by this tool.
+ </para>
+ </refsect1>
+
+ <refsect1>
+ <title>Options</title>
+
+ <para>
+ <variablelist>
+ <varlistentry>
+ <term><option>-i</option></term>
+ <term><option>--indivudual</option></term>
+ <listitem>
+ <para>
+ By default, <literal>pg_walsummary</literal> prints one line of output
+ for each range of one or more consecutive modified blocks. This can
+ make the output a lot briefer, since a relation where all blocks from
+ 0 through 999 were modified will produce only one line of output rather
+ than 1000 separate lines. This option requests a separate line of
+ output for every modified block.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><option>-q</option></term>
+ <term><option>--quiet</option></term>
+ <listitem>
+ <para>
+ Do not print any output, except for errors. This can be useful
+ when you want to know whether a WAL summary file can be successfully
+ parsed but don't care about the contents.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><option>-?</option></term>
+ <term><option>--help</option></term>
+ <listitem>
+ <para>
+ Shows help about <application>pg_walsummary</application> command line
+ arguments, and exits.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ </variablelist>
+ </para>
+
+ </refsect1>
+
+ <refsect1>
+ <title>Environment</title>
+
+ <para>
+ The environment variable <envar>PG_COLOR</envar> specifies whether to use
+ color in diagnostic messages. Possible values are
+ <literal>always</literal>, <literal>auto</literal> and
+ <literal>never</literal>.
+ </para>
+ </refsect1>
+
+ <refsect1>
+ <title>See Also</title>
+
+ <simplelist type="inline">
+ <member><xref linkend="app-pgbasebackup"/></member>
+ <member><xref linkend="app-pgcombinebackup"/></member>
+ </simplelist>
+ </refsect1>
+
+</refentry>
diff --git a/doc/src/sgml/reference.sgml b/doc/src/sgml/reference.sgml
index a07d2b5e01..aa94f6adf6 100644
--- a/doc/src/sgml/reference.sgml
+++ b/doc/src/sgml/reference.sgml
@@ -289,6 +289,7 @@
&pgtesttiming;
&pgupgrade;
&pgwaldump;
+ &pgwalsummary;
&postgres;
</reference>
diff --git a/src/bin/Makefile b/src/bin/Makefile
index aa2210925e..f98f58d39e 100644
--- a/src/bin/Makefile
+++ b/src/bin/Makefile
@@ -31,6 +31,7 @@ SUBDIRS = \
pg_upgrade \
pg_verifybackup \
pg_waldump \
+ pg_walsummary \
pgbench \
psql \
scripts
diff --git a/src/bin/meson.build b/src/bin/meson.build
index 4cb6fd59bb..d1e9ef4409 100644
--- a/src/bin/meson.build
+++ b/src/bin/meson.build
@@ -17,6 +17,7 @@ subdir('pg_test_timing')
subdir('pg_upgrade')
subdir('pg_verifybackup')
subdir('pg_waldump')
+subdir('pg_walsummary')
subdir('pgbench')
subdir('pgevent')
subdir('psql')
diff --git a/src/bin/pg_walsummary/.gitignore b/src/bin/pg_walsummary/.gitignore
new file mode 100644
index 0000000000..d71ec192fa
--- /dev/null
+++ b/src/bin/pg_walsummary/.gitignore
@@ -0,0 +1 @@
+pg_walsummary
diff --git a/src/bin/pg_walsummary/Makefile b/src/bin/pg_walsummary/Makefile
new file mode 100644
index 0000000000..2c24bc6db5
--- /dev/null
+++ b/src/bin/pg_walsummary/Makefile
@@ -0,0 +1,48 @@
+#-------------------------------------------------------------------------
+#
+# Makefile for src/bin/pg_walsummary
+#
+# Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+# Portions Copyright (c) 1994, Regents of the University of California
+#
+# src/bin/pg_walsummary/Makefile
+#
+#-------------------------------------------------------------------------
+
+PGFILEDESC = "pg_walsummary - print contents of WAL summary files"
+PGAPPICON=win32
+
+subdir = src/bin/pg_walsummary
+top_builddir = ../../..
+include $(top_builddir)/src/Makefile.global
+
+override CPPFLAGS := -I$(libpq_srcdir) $(CPPFLAGS)
+LDFLAGS_INTERNAL += -L$(top_builddir)/src/fe_utils -lpgfeutils
+
+OBJS = \
+ $(WIN32RES) \
+ pg_walsummary.o
+
+all: pg_walsummary
+
+pg_walsummary: $(OBJS) | submake-libpgport submake-libpgfeutils
+ $(CC) $(CFLAGS) $^ $(LDFLAGS) $(LDFLAGS_EX) $(LIBS) -o $@$(X)
+
+
+install: all installdirs
+ $(INSTALL_PROGRAM) pg_walsummary$(X) '$(DESTDIR)$(bindir)/pg_walsummary$(X)'
+
+installdirs:
+ $(MKDIR_P) '$(DESTDIR)$(bindir)'
+
+uninstall:
+ rm -f '$(DESTDIR)$(bindir)/pg_walsummary$(X)'
+
+clean distclean maintainer-clean:
+ rm -f pg_walsummary$(X) $(OBJS)
+
+check:
+ $(prove_check)
+
+installcheck:
+ $(prove_installcheck)
diff --git a/src/bin/pg_walsummary/meson.build b/src/bin/pg_walsummary/meson.build
new file mode 100644
index 0000000000..25cd56cda8
--- /dev/null
+++ b/src/bin/pg_walsummary/meson.build
@@ -0,0 +1,29 @@
+# Copyright (c) 2022-2023, PostgreSQL Global Development Group
+
+pg_walsummary_sources = files(
+ 'pg_walsummary.c',
+)
+
+if host_system == 'windows'
+ pg_walsummary_sources += rc_bin_gen.process(win32ver_rc, extra_args: [
+ '--NAME', 'pg_walsummary',
+ '--FILEDESC', 'pg_walsummary - print contents of WAL summary files',])
+endif
+
+pg_walsummary = executable('pg_walsummary',
+ pg_walsummary_sources,
+ dependencies: [frontend_code],
+ kwargs: default_bin_args,
+)
+bin_targets += pg_walsummary
+
+tests += {
+ 'name': 'pg_walsummary',
+ 'sd': meson.current_source_dir(),
+ 'bd': meson.current_build_dir(),
+ 'tap': {
+ 'tests': [
+ 't/001_basic.pl',
+ ],
+ }
+}
diff --git a/src/bin/pg_walsummary/nls.mk b/src/bin/pg_walsummary/nls.mk
new file mode 100644
index 0000000000..f411dcfe9e
--- /dev/null
+++ b/src/bin/pg_walsummary/nls.mk
@@ -0,0 +1,6 @@
+# src/bin/pg_combinebackup/nls.mk
+CATALOG_NAME = pg_walsummary
+GETTEXT_FILES = $(FRONTEND_COMMON_GETTEXT_FILES) \
+ pg_walsummary.c
+GETTEXT_TRIGGERS = $(FRONTEND_COMMON_GETTEXT_TRIGGERS)
+GETTEXT_FLAGS = $(FRONTEND_COMMON_GETTEXT_FLAGS)
diff --git a/src/bin/pg_walsummary/pg_walsummary.c b/src/bin/pg_walsummary/pg_walsummary.c
new file mode 100644
index 0000000000..0c0225eeb8
--- /dev/null
+++ b/src/bin/pg_walsummary/pg_walsummary.c
@@ -0,0 +1,280 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_walsummary.c
+ * Prints the contents of WAL summary files.
+ *
+ * Copyright (c) 2017-2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/bin/pg_walsummary/pg_walsummary.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres_fe.h"
+
+#include <fcntl.h>
+#include <limits.h>
+
+#include "common/blkreftable.h"
+#include "common/logging.h"
+#include "fe_utils/option_utils.h"
+#include "lib/stringinfo.h"
+#include "getopt_long.h"
+
+typedef struct ws_options
+{
+ bool individual;
+ bool quiet;
+} ws_options;
+
+typedef struct ws_file_info
+{
+ int fd;
+ char *filename;
+} ws_file_info;
+
+static BlockNumber *block_buffer = NULL;
+static unsigned block_buffer_size = 512; /* Initial size. */
+
+static void dump_one_relation(ws_options *opt, RelFileLocator *rlocator,
+ ForkNumber forknum, BlockNumber limit_block,
+ BlockRefTableReader *reader);
+static void help(const char *progname);
+static int compare_block_numbers(const void *a, const void *b);
+static int walsummary_read_callback(void *callback_arg, void *data,
+ int length);
+static void walsummary_error_callback(void *callback_arg, char *fmt,...) pg_attribute_printf(2, 3);
+
+/*
+ * Main program.
+ */
+int
+main(int argc, char *argv[])
+{
+ static struct option long_options[] = {
+ {"individual", no_argument, NULL, 'i'},
+ {"quiet", no_argument, NULL, 'q'},
+ {NULL, 0, NULL, 0}
+ };
+
+ const char *progname;
+ int optindex;
+ int c;
+ ws_options opt;
+
+ memset(&opt, 0, sizeof(ws_options));
+
+ pg_logging_init(argv[0]);
+ progname = get_progname(argv[0]);
+ handle_help_version_opts(argc, argv, progname, help);
+
+ /* process command-line options */
+ while ((c = getopt_long(argc, argv, "f:iqw:",
+ long_options, &optindex)) != -1)
+ {
+ switch (c)
+ {
+ case 'i':
+ opt.individual = true;
+ break;
+ case 'q':
+ opt.quiet = true;
+ break;
+ default:
+ /* getopt_long already emitted a complaint */
+ pg_log_error_hint("Try \"%s --help\" for more information.", progname);
+ exit(1);
+ }
+ }
+
+ if (optind >= argc)
+ {
+ pg_log_error("%s: no input files specified", progname);
+ pg_log_error_hint("Try \"%s --help\" for more information.", progname);
+ exit(1);
+ }
+
+ while (optind < argc)
+ {
+ ws_file_info ws;
+ BlockRefTableReader *reader;
+ RelFileLocator rlocator;
+ ForkNumber forknum;
+ BlockNumber limit_block;
+
+ ws.filename = argv[optind++];
+ if ((ws.fd = open(ws.filename, O_RDONLY | PG_BINARY, 0)) < 0)
+ pg_fatal("could not open file \"%s\": %m", ws.filename);
+
+ reader = CreateBlockRefTableReader(walsummary_read_callback, &ws,
+ ws.filename,
+ walsummary_error_callback, NULL);
+ while (BlockRefTableReaderNextRelation(reader, &rlocator, &forknum,
+ &limit_block))
+ dump_one_relation(&opt, &rlocator, forknum, limit_block, reader);
+
+ DestroyBlockRefTableReader(reader);
+ close(ws.fd);
+ }
+
+ exit(0);
+}
+
+/*
+ * Dump details for one relation.
+ */
+static void
+dump_one_relation(ws_options *opt, RelFileLocator *rlocator,
+ ForkNumber forknum, BlockNumber limit_block,
+ BlockRefTableReader *reader)
+{
+ unsigned i = 0;
+ unsigned nblocks;
+ BlockNumber startblock = InvalidBlockNumber;
+ BlockNumber endblock = InvalidBlockNumber;
+
+ /* Dump limit block, if any. */
+ if (limit_block != InvalidBlockNumber)
+ printf("TS %u, DB %u, REL %u, FORK %s: limit %u\n",
+ rlocator->spcOid, rlocator->dbOid, rlocator->relNumber,
+ forkNames[forknum], limit_block);
+
+ /* If we haven't allocated a block buffer yet, do that now. */
+ if (block_buffer == NULL)
+ block_buffer = palloc_array(BlockNumber, block_buffer_size);
+
+ /* Try to fill the block buffer. */
+ nblocks = BlockRefTableReaderGetBlocks(reader,
+ block_buffer,
+ block_buffer_size);
+
+ /* If we filled the block buffer completely, we must enlarge it. */
+ while (nblocks >= block_buffer_size)
+ {
+ unsigned new_size;
+
+ /* Double the size, being careful about overflow. */
+ new_size = block_buffer_size * 2;
+ if (new_size < block_buffer_size)
+ new_size = PG_UINT32_MAX;
+ block_buffer = repalloc_array(block_buffer, BlockNumber, new_size);
+
+ /* Try to fill the newly-allocated space. */
+ nblocks +=
+ BlockRefTableReaderGetBlocks(reader,
+ block_buffer + block_buffer_size,
+ new_size - block_buffer_size);
+
+ /* Save the new size for later calls. */
+ block_buffer_size = new_size;
+ }
+
+ /* If we don't need to produce any output, skip the rest of this. */
+ if (opt->quiet)
+ return;
+
+ /*
+ * Sort the returned block numbers. If the block reference table was using
+ * the bitmap representation for a given chunk, the block numbers in that
+ * chunk will already be sorted, but when the array-of-offsets
+ * representation is used, we can receive block numbers here out of order.
+ */
+ qsort(block_buffer, nblocks, sizeof(BlockNumber), compare_block_numbers);
+
+ /* Dump block references. */
+ while (i < nblocks)
+ {
+ /*
+ * Find the next range of blocks to print, but if --individual was
+ * specified, then consider each block a separate range.
+ */
+ startblock = endblock = block_buffer[i++];
+ if (!opt->individual)
+ {
+ while (i < nblocks && block_buffer[i] == endblock + 1)
+ {
+ endblock++;
+ i++;
+ }
+ }
+
+ /* Format this range of block numbers as a string. */
+ if (startblock == endblock)
+ printf("TS %u, DB %u, REL %u, FORK %s: block %u\n",
+ rlocator->spcOid, rlocator->dbOid, rlocator->relNumber,
+ forkNames[forknum], startblock);
+ else
+ printf("TS %u, DB %u, REL %u, FORK %s: blocks %u..%u\n",
+ rlocator->spcOid, rlocator->dbOid, rlocator->relNumber,
+ forkNames[forknum], startblock, endblock);
+ }
+}
+
+/*
+ * Quicksort comparator for block numbers.
+ */
+static int
+compare_block_numbers(const void *a, const void *b)
+{
+ BlockNumber aa = *(BlockNumber *) a;
+ BlockNumber bb = *(BlockNumber *) b;
+
+ if (aa > bb)
+ return 1;
+ else if (aa == bb)
+ return 0;
+ else
+ return -1;
+}
+
+/*
+ * Error callback.
+ */
+void
+walsummary_error_callback(void *callback_arg, char *fmt,...)
+{
+ va_list ap;
+
+ va_start(ap, fmt);
+ pg_log_generic_v(PG_LOG_ERROR, PG_LOG_PRIMARY, fmt, ap);
+ va_end(ap);
+
+ exit(1);
+}
+
+/*
+ * Read callback.
+ */
+int
+walsummary_read_callback(void *callback_arg, void *data, int length)
+{
+ ws_file_info *ws = callback_arg;
+ int rc;
+
+ if ((rc = read(ws->fd, data, length)) < 0)
+ pg_fatal("could not read file \"%s\": %m", ws->filename);
+
+ return rc;
+}
+
+/*
+ * help
+ *
+ * Prints help page for the program
+ *
+ * progname: the name of the executed program, such as "pg_walsummary"
+ */
+static void
+help(const char *progname)
+{
+ printf(_("%s prints the contents of a WAL summary file.\n\n"), progname);
+ printf(_("Usage:\n"));
+ printf(_(" %s [OPTION]... FILE...\n"), progname);
+ printf(_("\nOptions:\n"));
+ printf(_(" -i, --individual list block numbers individually, not as ranges\n"));
+ printf(_(" -q, --quiet don't print anything, just parse the files\n"));
+ printf(_(" -?, --help show this help, then exit\n"));
+
+ printf(_("\nReport bugs to <%s>.\n"), PACKAGE_BUGREPORT);
+ printf(_("%s home page: <%s>\n"), PACKAGE_NAME, PACKAGE_URL);
+}
diff --git a/src/bin/pg_walsummary/t/001_basic.pl b/src/bin/pg_walsummary/t/001_basic.pl
new file mode 100644
index 0000000000..10a232a150
--- /dev/null
+++ b/src/bin/pg_walsummary/t/001_basic.pl
@@ -0,0 +1,19 @@
+# Copyright (c) 2021-2023, PostgreSQL Global Development Group
+
+use strict;
+use warnings;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+my $tempdir = PostgreSQL::Test::Utils::tempdir;
+
+program_help_ok('pg_walsummary');
+program_version_ok('pg_walsummary');
+program_options_handling_ok('pg_walsummary');
+
+command_fails_like(
+ ['pg_walsummary'],
+ qr/no input files specified/,
+ 'input files must be specified');
+
+done_testing();
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index e37ef9aa76..86e0a86503 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -4035,3 +4035,5 @@ cb_tablespace_mapping
manifest_data
manifest_writer
rfile
+ws_options
+ws_file_info
--
2.39.3 (Apple Git-145)
Hello Robert,
20.12.2023 23:56, Robert Haas wrote:
On Wed, Dec 20, 2023 at 8:11 AM Jakub Wartak
<jakub.wartak@enterprisedb.com> wrote:the v15 patchset (posted yesterday) test results are GOOD:
All right. I committed the main two patches, dropped the
for-testing-only patch, and added a simple test to the remaining
pg_walsummary patch. That needs more work, but here's what I have as
of now.
I've found several typos/inconsistencies introduced with 174c48050 and
dc2123400. Maybe you would want to fix them, while on it?:
s/arguent/argument/;
s/BlkRefTableEntry/BlockRefTableEntry/;
s/BlockRefTablEntry/BlockRefTableEntry/;
s/Caonicalize/Canonicalize/;
s/Checksum_Algorithm/Checksum-Algorithm/;
s/corresonding/corresponding/;
s/differenly/differently/;
s/excessing/excessive/;
s/ exta / extra /;
s/hexademical/hexadecimal/;
s/initally/initially/;
s/MAXGPATH/MAXPGPATH/;
s/overrreacting/overreacting/;
s/old_meanifest_file/old_manifest_file/;
s/pg_cominebackup/pg_combinebackup/;
s/pg_tblpc/pg_tblspc/;
s/pointrs/pointers/;
s/Recieve/Receive/;
s/recieved/received/;
s/ recod / record /;
s/ recods / records /;
s/substntially/substantially/;
s/sumamry/summary/;
s/summry/summary/;
s/synchronizaton/synchronization/;
s/sytem/system/;
s/withot/without/;
s/Woops/Whoops/;
s/xlograder/xlogreader/;
Also, a comment above MaybeRemoveOldWalSummaries() basically repeats a
comment above redo_pointer_at_last_summary_removal declaration, but
perhaps it should say about removing summaries instead?
Best regards,
Alexander
On Wed, Dec 20, 2023 at 11:00 PM Alexander Lakhin <exclusion@gmail.com> wrote:
I've found several typos/inconsistencies introduced with 174c48050 and
dc2123400. Maybe you would want to fix them, while on it?:
That's an impressively long list of mistakes in something I thought
I'd been careful about. Sigh.
I don't suppose you could provide these corrections in the form of a
patch? I don't really want to run these sed commands across the entire
tree and then try to figure out what's what...
Also, a comment above MaybeRemoveOldWalSummaries() basically repeats a
comment above redo_pointer_at_last_summary_removal declaration, but
perhaps it should say about removing summaries instead?
Wow, yeah. Thanks, will fix.
--
Robert Haas
EDB: http://www.enterprisedb.com
21.12.2023 15:07, Robert Haas wrote:
On Wed, Dec 20, 2023 at 11:00 PM Alexander Lakhin <exclusion@gmail.com> wrote:
I've found several typos/inconsistencies introduced with 174c48050 and
dc2123400. Maybe you would want to fix them, while on it?:That's an impressively long list of mistakes in something I thought
I'd been careful about. Sigh.I don't suppose you could provide these corrections in the form of a
patch? I don't really want to run these sed commands across the entire
tree and then try to figure out what's what...
Please look at the attached patch; it corrects all 29 items ("recods"
fixed in two places), but maybe you find some substitutions wrong...
I've also observed that those commits introduced new warnings:
$ CC=gcc-12 CPPFLAGS="-Wtype-limits" ./configure -q && make -s -j8
reconstruct.c: In function ‘read_bytes’:
reconstruct.c:511:24: warning: comparison of unsigned expression in ‘< 0’ is always false [-Wtype-limits]
511 | if (rb < 0)
| ^
reconstruct.c: In function ‘write_reconstructed_file’:
reconstruct.c:650:40: warning: comparison of unsigned expression in ‘< 0’ is always false [-Wtype-limits]
650 | if (rb < 0)
| ^
reconstruct.c:662:32: warning: comparison of unsigned expression in ‘< 0’ is always false [-Wtype-limits]
662 | if (wb < 0)
There are also two deadcode.DeadStores complaints from clang. First one is
about:
/*
* Align the wait time to prevent drift. This doesn't really matter,
* but we'd like the warnings about how long we've been waiting to say
* 10 seconds, 20 seconds, 30 seconds, 40 seconds ... without ever
* drifting to something that is not a multiple of ten.
*/
timeout_in_ms -=
TimestampDifferenceMilliseconds(current_time, initial_time) %
timeout_in_ms;
It looks like this timeout is really not used.
And the minor one (similar to many existing, maybe doesn't deserve fixing):
walsummarizer.c:808:5: warning: Value stored to 'summary_end_lsn' is never read [deadcode.DeadStores]
summary_end_lsn = private_data->read_upto;
^ ~~~~~~~~~~~~~~~~~~~~~~~
Also, a comment above MaybeRemoveOldWalSummaries() basically repeats a
comment above redo_pointer_at_last_summary_removal declaration, but
perhaps it should say about removing summaries instead?Wow, yeah. Thanks, will fix.
Thank you for paying attention to it!
Best regards,
Alexander
Attachments:
fix-typos.patchtext/x-patch; charset=UTF-8; name=fix-typos.patchDownload
diff --git a/doc/src/sgml/ref/pg_basebackup.sgml b/doc/src/sgml/ref/pg_basebackup.sgml
index 7c183a5cfd..e411ddbf45 100644
--- a/doc/src/sgml/ref/pg_basebackup.sgml
+++ b/doc/src/sgml/ref/pg_basebackup.sgml
@@ -213,7 +213,7 @@ PostgreSQL documentation
<varlistentry>
<term><option>-i <replaceable class="parameter">old_manifest_file</replaceable></option></term>
- <term><option>--incremental=<replaceable class="parameter">old_meanifest_file</replaceable></option></term>
+ <term><option>--incremental=<replaceable class="parameter">old_manifest_file</replaceable></option></term>
<listitem>
<para>
Performs an <link linkend="backup-incremental-backup">incremental
diff --git a/doc/src/sgml/ref/pg_combinebackup.sgml b/doc/src/sgml/ref/pg_combinebackup.sgml
index e1cb31607e..8a0a600c2b 100644
--- a/doc/src/sgml/ref/pg_combinebackup.sgml
+++ b/doc/src/sgml/ref/pg_combinebackup.sgml
@@ -83,7 +83,7 @@ PostgreSQL documentation
<listitem>
<para>
The <option>-n</option>/<option>--dry-run</option> option instructs
- <command>pg_cominebackup</command> to figure out what would be done
+ <command>pg_combinebackup</command> to figure out what would be done
without actually creating the target directory or any output files.
It is particularly useful in combination with <option>--debug</option>.
</para>
diff --git a/src/backend/backup/basebackup_incremental.c b/src/backend/backup/basebackup_incremental.c
index 1e5a5ac33a..42bbe564e2 100644
--- a/src/backend/backup/basebackup_incremental.c
+++ b/src/backend/backup/basebackup_incremental.c
@@ -158,7 +158,7 @@ CreateIncrementalBackupInfo(MemoryContext mcxt)
/*
* Before taking an incremental backup, the caller must supply the backup
- * manifest from a prior backup. Each chunk of manifest data recieved
+ * manifest from a prior backup. Each chunk of manifest data received
* from the client should be passed to this function.
*/
void
@@ -462,7 +462,7 @@ PrepareForIncrementalBackup(IncrementalBackupInfo *ib,
++deadcycles;
/*
- * If we've managed to wait for an entire minute withot the WAL
+ * If we've managed to wait for an entire minute without the WAL
* summarizer absorbing a single WAL record, error out; probably
* something is wrong.
*
@@ -473,7 +473,7 @@ PrepareForIncrementalBackup(IncrementalBackupInfo *ib,
* likely to catch a reasonable number of the things that can go wrong
* in practice (e.g. the summarizer process is completely hung, say
* because somebody hooked up a debugger to it or something) without
- * giving up too quickly when the sytem is just slow.
+ * giving up too quickly when the system is just slow.
*/
if (deadcycles >= 6)
ereport(ERROR,
diff --git a/src/backend/backup/walsummaryfuncs.c b/src/backend/backup/walsummaryfuncs.c
index a1f69ad4ba..f96491534d 100644
--- a/src/backend/backup/walsummaryfuncs.c
+++ b/src/backend/backup/walsummaryfuncs.c
@@ -92,7 +92,7 @@ pg_wal_summary_contents(PG_FUNCTION_ARGS)
errcode(ERRCODE_INVALID_PARAMETER_VALUE),
errmsg("invalid timeline %lld", (long long) raw_tli));
- /* Prepare to read the specified WAL summry file. */
+ /* Prepare to read the specified WAL summary file. */
ws.tli = (TimeLineID) raw_tli;
ws.start_lsn = PG_GETARG_LSN(1);
ws.end_lsn = PG_GETARG_LSN(2);
@@ -143,7 +143,7 @@ pg_wal_summary_contents(PG_FUNCTION_ARGS)
}
/*
- * If the limit block is not InvalidBlockNumber, emit an exta row
+ * If the limit block is not InvalidBlockNumber, emit an extra row
* with that block number and limit_block = true.
*
* There is no point in doing this when the limit_block is
diff --git a/src/backend/postmaster/walsummarizer.c b/src/backend/postmaster/walsummarizer.c
index 9fa155349e..071d2c0d58 100644
--- a/src/backend/postmaster/walsummarizer.c
+++ b/src/backend/postmaster/walsummarizer.c
@@ -87,7 +87,7 @@ typedef struct
XLogRecPtr pending_lsn;
/*
- * This field handles its own synchronizaton.
+ * This field handles its own synchronization.
*/
ConditionVariable summary_file_cv;
} WalSummarizerData;
@@ -117,7 +117,7 @@ static long sleep_quanta = 1;
/*
* The sleep time will always be a multiple of 200ms and will not exceed
* thirty seconds (150 * 200 = 30 * 1000). Note that the timeout here needs
- * to be substntially less than the maximum amount of time for which an
+ * to be substantially less than the maximum amount of time for which an
* incremental backup will wait for this process to catch up. Otherwise, an
* incremental backup might time out on an idle system just because we sleep
* for too long.
@@ -212,7 +212,7 @@ WalSummarizerMain(void)
/*
* Within this function, 'current_lsn' and 'current_tli' refer to the
* point from which the next WAL summary file should start. 'exact' is
- * true if 'current_lsn' is known to be the start of a WAL recod or WAL
+ * true if 'current_lsn' is known to be the start of a WAL record or WAL
* segment, and false if it might be in the middle of a record someplace.
*
* 'switch_lsn' and 'switch_tli', if set, are the LSN at which we need to
@@ -297,7 +297,7 @@ WalSummarizerMain(void)
/*
* Sleep for 10 seconds before attempting to resume operations in
- * order to avoid excessing logging.
+ * order to avoid excessive logging.
*
* Many of the likely error conditions are things that will repeat
* every time. For example, if the WAL can't be read or the summary
@@ -449,7 +449,7 @@ GetOldestUnsummarizedLSN(TimeLineID *tli, bool *lsn_is_exact,
return InvalidXLogRecPtr;
/*
- * Unless we need to reset the pending_lsn, we initally acquire the lock
+ * Unless we need to reset the pending_lsn, we initially acquire the lock
* in shared mode and try to fetch the required information. If we acquire
* in shared mode and find that the data structure hasn't been
* initialized, we reacquire the lock in exclusive mode so that we can
@@ -699,7 +699,7 @@ HandleWalSummarizerInterrupts(void)
*
* 'start_lsn' is the point at which we should start summarizing. If this
* value comes from the end LSN of the previous record as returned by the
- * xlograder machinery, 'exact' should be true; otherwise, 'exact' should
+ * xlogreader machinery, 'exact' should be true; otherwise, 'exact' should
* be false, and this function will search forward for the start of a valid
* WAL record.
*
@@ -872,7 +872,7 @@ SummarizeWAL(TimeLineID tli, XLogRecPtr start_lsn, bool exact,
xlogreader->ReadRecPtr >= switch_lsn)
{
/*
- * Woops! We've read a record that *starts* after the switch LSN,
+ * Whoops! We've read a record that *starts* after the switch LSN,
* contrary to our goal of reading only until we hit the first
* record that ends at or after the switch LSN. Pretend we didn't
* read it after all by bailing out of this loop right here,
@@ -1061,7 +1061,7 @@ SummarizeSmgrRecord(XLogReaderState *xlogreader, BlockRefTable *brtab)
}
/*
- * Special handling for WAL recods with RM_XACT_ID.
+ * Special handling for WAL records with RM_XACT_ID.
*/
static void
SummarizeXactRecord(XLogReaderState *xlogreader, BlockRefTable *brtab)
@@ -1116,7 +1116,7 @@ SummarizeXactRecord(XLogReaderState *xlogreader, BlockRefTable *brtab)
}
/*
- * Special handling for WAL recods with RM_XLOG_ID.
+ * Special handling for WAL records with RM_XLOG_ID.
*/
static bool
SummarizeXlogRecord(XLogReaderState *xlogreader)
@@ -1295,7 +1295,7 @@ summarizer_wait_for_wal(void)
* records to provoke a strong reaction. We choose to reduce the sleep
* time by 1 quantum for each page read beyond the first, which is a
* fairly arbitrary way of trying to be reactive without
- * overrreacting.
+ * overreacting.
*/
if (pages_read_since_last_sleep > sleep_quanta - 1)
sleep_quanta = 1;
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index dbcda32554..d4aa9e1c96 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -706,7 +706,7 @@ UploadManifest(void)
pq_endmessage_reuse(&buf);
pq_flush();
- /* Recieve packets from client until done. */
+ /* Receive packets from client until done. */
while (HandleUploadManifestPacket(&buf, &offset, ib))
;
@@ -719,7 +719,7 @@ UploadManifest(void)
*
* We assume that MemoryContextDelete and MemoryContextSetParent won't
* fail, and thus we shouldn't end up bailing out of here in such a way as
- * to leave dangling pointrs.
+ * to leave dangling pointers.
*/
if (uploaded_manifest_mcxt != NULL)
MemoryContextDelete(uploaded_manifest_mcxt);
diff --git a/src/bin/pg_combinebackup/pg_combinebackup.c b/src/bin/pg_combinebackup/pg_combinebackup.c
index 85d3f4e5de..b6ae6f2aef 100644
--- a/src/bin/pg_combinebackup/pg_combinebackup.c
+++ b/src/bin/pg_combinebackup/pg_combinebackup.c
@@ -454,7 +454,7 @@ check_backup_label_files(int n_backups, char **backup_dirs)
* The exact size limit that we impose here doesn't really matter --
* most of what's supposed to be in the file is fixed size and quite
* short. However, the length of the backup_label is limited (at least
- * by some parts of the code) to MAXGPATH, so include that value in
+ * by some parts of the code) to MAXPGPATH, so include that value in
* the maximum length that we tolerate.
*/
slurp_file(fd, pathbuf, buf, 10000 + MAXPGPATH);
@@ -1192,7 +1192,7 @@ scan_for_existing_tablespaces(char *pathname, cb_options *opt)
if (!is_absolute_path(link_target))
pg_fatal("symbolic link \"%s\" is relative", tblspcdir);
- /* Caonicalize the link target. */
+ /* Canonicalize the link target. */
canonicalize_path(link_target);
/*
@@ -1222,7 +1222,7 @@ scan_for_existing_tablespaces(char *pathname, cb_options *opt)
* we just record the paths within the data directories.
*/
snprintf(ts->old_dir, MAXPGPATH, "%s/%s", pg_tblspc, de->d_name);
- snprintf(ts->new_dir, MAXPGPATH, "%s/pg_tblpc/%s", opt->output,
+ snprintf(ts->new_dir, MAXPGPATH, "%s/pg_tblspc/%s", opt->output,
de->d_name);
ts->in_place = true;
}
diff --git a/src/bin/pg_combinebackup/t/004_manifest.pl b/src/bin/pg_combinebackup/t/004_manifest.pl
index 37de61ac06..4f3779274f 100644
--- a/src/bin/pg_combinebackup/t/004_manifest.pl
+++ b/src/bin/pg_combinebackup/t/004_manifest.pl
@@ -69,7 +69,7 @@ my $nocsum_manifest =
slurp_file($node->backup_dir . '/csum_none/backup_manifest');
my $nocsum_count = (() = $nocsum_manifest =~ /Checksum-Algorithm/mig);
is($nocsum_count, 0,
- "Checksum_Algorithm is not mentioned in no-checksum manifest");
+ "Checksum-Algorithm is not mentioned in no-checksum manifest");
# OK, that's all.
done_testing();
diff --git a/src/bin/pg_combinebackup/write_manifest.c b/src/bin/pg_combinebackup/write_manifest.c
index 82160134d8..5cf36c2b05 100644
--- a/src/bin/pg_combinebackup/write_manifest.c
+++ b/src/bin/pg_combinebackup/write_manifest.c
@@ -272,7 +272,7 @@ flush_manifest(manifest_writer *mwriter)
}
/*
- * Encode bytes using two hexademical digits for each one.
+ * Encode bytes using two hexadecimal digits for each one.
*/
static size_t
hex_encode(const uint8 *src, size_t len, char *dst)
diff --git a/src/common/blkreftable.c b/src/common/blkreftable.c
index 21ee6f5968..ccbb4006bd 100644
--- a/src/common/blkreftable.c
+++ b/src/common/blkreftable.c
@@ -100,7 +100,7 @@ typedef uint16 *BlockRefTableChunk;
* 'chunk_size' is an array storing the allocated size of each chunk.
*
* 'chunk_usage' is an array storing the number of elements used in each
- * chunk. If that value is less than MAX_ENTRIES_PER_CHUNK, the corresonding
+ * chunk. If that value is less than MAX_ENTRIES_PER_CHUNK, the corresponding
* chunk is used as an array; else the corresponding chunk is used as a bitmap.
* When used as a bitmap, the least significant bit of the first array element
* is the status of the lowest-numbered block covered by this chunk.
@@ -567,7 +567,7 @@ WriteBlockRefTable(BlockRefTable *brtab,
* malformed. This is not used for I/O errors, which must be handled internally
* by read_callback.
*
- * 'error_callback_arg' is an opaque arguent to be passed to error_callback.
+ * 'error_callback_arg' is an opaque argument to be passed to error_callback.
*/
BlockRefTableReader *
CreateBlockRefTableReader(io_callback_fn read_callback,
@@ -922,7 +922,7 @@ BlockRefTableEntrySetLimitBlock(BlockRefTableEntry *entry,
/*
* Next, we need to discard any offsets within the chunk that would
- * contain the limit_block. We must handle this differenly depending on
+ * contain the limit_block. We must handle this differently depending on
* whether the chunk that would contain limit_block is a bitmap or an
* array of offsets.
*/
@@ -955,7 +955,7 @@ BlockRefTableEntrySetLimitBlock(BlockRefTableEntry *entry,
}
/*
- * Mark a block in a given BlkRefTableEntry as known to have been modified.
+ * Mark a block in a given BlockRefTableEntry as known to have been modified.
*/
void
BlockRefTableEntryMarkBlockModified(BlockRefTableEntry *entry,
@@ -1112,7 +1112,7 @@ BlockRefTableEntryMarkBlockModified(BlockRefTableEntry *entry,
}
/*
- * Release memory for a BlockRefTablEntry that was created by
+ * Release memory for a BlockRefTableEntry that was created by
* CreateBlockRefTableEntry.
*/
void
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 916c8ec8d0..b8b26c263d 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -12109,7 +12109,7 @@
proargnames => '{tli,start_lsn,end_lsn}',
prosrc => 'pg_available_wal_summaries' },
{ oid => '8437',
- descr => 'contents of a WAL sumamry file',
+ descr => 'contents of a WAL summary file',
proname => 'pg_wal_summary_contents', prorows => '100',
proretset => 't', provolatile => 'v', proparallel => 's',
prorettype => 'record', proargtypes => 'int8 pg_lsn pg_lsn',
On Thu, Dec 21, 2023 at 10:00 AM Alexander Lakhin <exclusion@gmail.com> wrote:
Please look at the attached patch; it corrects all 29 items ("recods"
fixed in two places), but maybe you find some substitutions wrong...
Thanks, committed with a few additions.
I've also observed that those commits introduced new warnings:
$ CC=gcc-12 CPPFLAGS="-Wtype-limits" ./configure -q && make -s -j8
reconstruct.c: In function ‘read_bytes’:
reconstruct.c:511:24: warning: comparison of unsigned expression in ‘< 0’ is always false [-Wtype-limits]
511 | if (rb < 0)
| ^
reconstruct.c: In function ‘write_reconstructed_file’:
reconstruct.c:650:40: warning: comparison of unsigned expression in ‘< 0’ is always false [-Wtype-limits]
650 | if (rb < 0)
| ^
reconstruct.c:662:32: warning: comparison of unsigned expression in ‘< 0’ is always false [-Wtype-limits]
662 | if (wb < 0)
Oops. I think the variables should be type int. See attached.
There are also two deadcode.DeadStores complaints from clang. First one is
about:
/*
* Align the wait time to prevent drift. This doesn't really matter,
* but we'd like the warnings about how long we've been waiting to say
* 10 seconds, 20 seconds, 30 seconds, 40 seconds ... without ever
* drifting to something that is not a multiple of ten.
*/
timeout_in_ms -=
TimestampDifferenceMilliseconds(current_time, initial_time) %
timeout_in_ms;
It looks like this timeout is really not used.
Oops. It should be. See attached.
And the minor one (similar to many existing, maybe doesn't deserve fixing):
walsummarizer.c:808:5: warning: Value stored to 'summary_end_lsn' is never read [deadcode.DeadStores]
summary_end_lsn = private_data->read_upto;
^ ~~~~~~~~~~~~~~~~~~~~~~~
It kind of surprises me that this is dead, but it seems best to keep
it there to be on the safe side, in case some change to the logic
renders it not dead in the future.
Also, a comment above MaybeRemoveOldWalSummaries() basically repeats a
comment above redo_pointer_at_last_summary_removal declaration, but
perhaps it should say about removing summaries instead?Wow, yeah. Thanks, will fix.
Thank you for paying attention to it!
I'll fix this next.
--
Robert Haas
EDB: http://www.enterprisedb.com
Attachments:
fix-ib-thinkos.patchapplication/octet-stream; name=fix-ib-thinkos.patchDownload
diff --git a/src/backend/backup/basebackup_incremental.c b/src/backend/backup/basebackup_incremental.c
index 42bbe564e2..aa98e1872f 100644
--- a/src/backend/backup/basebackup_incremental.c
+++ b/src/backend/backup/basebackup_incremental.c
@@ -441,7 +441,7 @@ PrepareForIncrementalBackup(IncrementalBackupInfo *ib,
/* Wait for up to 10 seconds. */
summarized_lsn = WaitForWalSummarization(backup_state->startpoint,
- 10000, &pending_lsn);
+ timeout_in_ms, &pending_lsn);
/* If WAL summarization has progressed sufficiently, stop waiting. */
if (summarized_lsn >= backup_state->startpoint)
diff --git a/src/bin/pg_combinebackup/reconstruct.c b/src/bin/pg_combinebackup/reconstruct.c
index 6decdd8934..21cba5b33d 100644
--- a/src/bin/pg_combinebackup/reconstruct.c
+++ b/src/bin/pg_combinebackup/reconstruct.c
@@ -504,7 +504,7 @@ make_rfile(char *filename, bool missing_ok)
static void
read_bytes(rfile *rf, void *buffer, unsigned length)
{
- unsigned rb = read(rf->fd, buffer, length);
+ int rb = read(rf->fd, buffer, length);
if (rb != length)
{
@@ -614,7 +614,7 @@ write_reconstructed_file(char *input_filename,
{
uint8 buffer[BLCKSZ];
rfile *s = sourcemap[i];
- unsigned wb;
+ int wb;
/* Update accounting information. */
if (s == NULL)
@@ -641,7 +641,7 @@ write_reconstructed_file(char *input_filename,
}
else
{
- unsigned rb;
+ int rb;
/* Read the block from the correct source, except if dry-run. */
rb = pg_pread(s->fd, buffer, BLCKSZ, offsetmap[i]);
21.12.2023 23:43, Robert Haas wrote:
There are also two deadcode.DeadStores complaints from clang. First one is
about:
/*
* Align the wait time to prevent drift. This doesn't really matter,
* but we'd like the warnings about how long we've been waiting to say
* 10 seconds, 20 seconds, 30 seconds, 40 seconds ... without ever
* drifting to something that is not a multiple of ten.
*/
timeout_in_ms -=
TimestampDifferenceMilliseconds(current_time, initial_time) %
timeout_in_ms;
It looks like this timeout is really not used.Oops. It should be. See attached.
My quick experiment shows that that TimestampDifferenceMilliseconds call
always returns zero, due to it's arguments swapped.
The other changes look good to me.
Thank you!
Best regards,
Alexander
My compiler has the following complaint:
../postgresql/src/backend/postmaster/walsummarizer.c: In function ‘GetOldestUnsummarizedLSN’:
../postgresql/src/backend/postmaster/walsummarizer.c:540:32: error: ‘unsummarized_lsn’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
540 | WalSummarizerCtl->pending_lsn = unsummarized_lsn;
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~
I haven't looked closely to see whether there is actually a problem here,
but the attached patch at least resolves the warning.
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
Attachments:
fix_uninitialized_warning.patchtext/x-diff; charset=us-asciiDownload
diff --git a/src/backend/postmaster/walsummarizer.c b/src/backend/postmaster/walsummarizer.c
index 9b5d3cdeb0..0cf6bbe59d 100644
--- a/src/backend/postmaster/walsummarizer.c
+++ b/src/backend/postmaster/walsummarizer.c
@@ -438,7 +438,7 @@ GetOldestUnsummarizedLSN(TimeLineID *tli, bool *lsn_is_exact,
LWLockMode mode = reset_pending_lsn ? LW_EXCLUSIVE : LW_SHARED;
int n;
List *tles;
- XLogRecPtr unsummarized_lsn;
+ XLogRecPtr unsummarized_lsn = InvalidXLogRecPtr;
TimeLineID unsummarized_tli = 0;
bool should_make_exact = false;
List *existing_summaries;
On Sat, Dec 23, 2023 at 4:51 PM Nathan Bossart <nathandbossart@gmail.com> wrote:
My compiler has the following complaint:
../postgresql/src/backend/postmaster/walsummarizer.c: In function ‘GetOldestUnsummarizedLSN’:
../postgresql/src/backend/postmaster/walsummarizer.c:540:32: error: ‘unsummarized_lsn’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
540 | WalSummarizerCtl->pending_lsn = unsummarized_lsn;
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~
Thanks. I don't think there's a real bug, but I pushed a fix, same as
what you had.
--
Robert Haas
EDB: http://www.enterprisedb.com
On Wed, Dec 27, 2023 at 09:11:02AM -0500, Robert Haas wrote:
Thanks. I don't think there's a real bug, but I pushed a fix, same as
what you had.
Thanks! I also noticed that WALSummarizerLock probably needs a mention in
wait_event_names.txt.
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
On Wed, Dec 27, 2023 at 10:36 AM Nathan Bossart
<nathandbossart@gmail.com> wrote:
On Wed, Dec 27, 2023 at 09:11:02AM -0500, Robert Haas wrote:
Thanks. I don't think there's a real bug, but I pushed a fix, same as
what you had.Thanks! I also noticed that WALSummarizerLock probably needs a mention in
wait_event_names.txt.
Fixed.
It seems like it would be good if there were an automated cross-check
between lwlocknames.txt and wait_event_names.txt.
--
Robert Haas
EDB: http://www.enterprisedb.com
On Fri, Dec 22, 2023 at 12:00 AM Alexander Lakhin <exclusion@gmail.com> wrote:
My quick experiment shows that that TimestampDifferenceMilliseconds call
always returns zero, due to it's arguments swapped.
Thanks. Tom already changed the unsigned -> int stuff in a separate
commit, so I just pushed the fixes to PrepareForIncrementalBackup,
both the one I had before, and swapping the arguments to
TimestampDifferenceMilliseconds.
--
Robert Haas
EDB: http://www.enterprisedb.com
On Wed, 3 Jan 2024 at 15:10, Robert Haas <robertmhaas@gmail.com> wrote:
On Fri, Dec 22, 2023 at 12:00 AM Alexander Lakhin <exclusion@gmail.com>
wrote:My quick experiment shows that that TimestampDifferenceMilliseconds call
always returns zero, due to it's arguments swapped.Thanks. Tom already changed the unsigned -> int stuff in a separate
commit, so I just pushed the fixes to PrepareForIncrementalBackup,
both the one I had before, and swapping the arguments to
TimestampDifferenceMilliseconds
I would like to query the following:
--tablespace-mapping=olddir=newdir
Relocates the tablespace in directory olddir to newdir during the
backup. olddir is the absolute path of the tablespace as it exists in the
first backup specified on the command line, and newdir is the absolute path
to use for the tablespace in the reconstructed backup.
The first backup specified on the command line will be the regular, full,
non-incremental backup. But if a tablespace was introduced subsequently,
it would only appear in an incremental backup. Wouldn't this then mean
that a mapping would need to be provided based on the path to the
tablespace of that incremental backup's copy?
Regards
Thom
On Thu, Apr 25, 2024 at 6:44 PM Thom Brown <thom@linux.com> wrote:
I would like to query the following:
--tablespace-mapping=olddir=newdir
Relocates the tablespace in directory olddir to newdir during the backup. olddir is the absolute path of the tablespace as it exists in the first backup specified on the command line, and newdir is the absolute path to use for the tablespace in the reconstructed backup.
The first backup specified on the command line will be the regular, full, non-incremental backup. But if a tablespace was introduced subsequently, it would only appear in an incremental backup. Wouldn't this then mean that a mapping would need to be provided based on the path to the tablespace of that incremental backup's copy?
Yes. Tomas Vondra found the same issue, which I have fixed in
1713e3d6cd393fcc1d4873e75c7fa1f6c7023d75.
--
Robert Haas
EDB: http://www.enterprisedb.com
Hi,
So I am a bit confused about the status of the tar format support, and
after re-reading the thread (or at least grepping it for ' tar '), this
wasn't really much discussed here either.
On Wed, Jun 14, 2023 at 02:46:48PM -0400, Robert Haas wrote:
- We only know how to operate on directories, not tar files. I thought
about that when working on pg_verifybackup as well, but I didn't do
anything about it. It would be nice to go back and make that tool work
on tar-format backups, and this one, too.
I believe "that tool" is pg_verifybackup, while "this one" is
pg_combinebackup? However, what's up with pg_basebackup itself with
respect to tar format incremental backups?
AFAICT (see below), pg_basebackup -Ft --incremental=foo/backup_manifest
happily creates an incremental backup in tar format; however,
pg_combinebackup will not be able to restore it? If that is the case,
shouldn't there be a bigger warning in the documentation about this, or
maybe pg_basebackup should refuse to make incremental tar-format backups
in the first place?
Am I missing something here? It will be obvious to users after the first
failure (to try to restore) that this will not work, and hopefully
everybody tests a restore before they put a backup solution into
production (or even better, wait until this whole feature is included in
a wholesale solution), but I wonder whether somebody might trip over
this after all and be unhappy. If one reads the pg_combinebackup
documentation carefully it kinda becomes obvious that it does occupy
itself with tar format backups, but it is not spelt out explicitly
either.
|postgres@mbanck-lin-1:~$ pg_basebackup -c fast -Ft -D backup/backup_full
|postgres@mbanck-lin-1:~$ pg_basebackup -c fast -Ft -D backup/backup_incr_1 --incremental=backup/backup_full/backup_manifest
|postgres@mbanck-lin-1:~$ echo $?
|0
|postgres@mbanck-lin-1:~$ du -h backup/
|44M backup/backup_incr_1
|4,5G backup/backup_full
|4,5G backup/
|postgres@mbanck-lin-1:~$ tar tf backup/backup_incr_1/base.tar | grep INCR | head
|base/1/INCREMENTAL.3603
|base/1/INCREMENTAL.2187
|base/1/INCREMENTAL.13418
|base/1/INCREMENTAL.3467
|base/1/INCREMENTAL.2615_vm
|base/1/INCREMENTAL.2228
|base/1/INCREMENTAL.3503
|base/1/INCREMENTAL.2659
|base/1/INCREMENTAL.2607_vm
|base/1/INCREMENTAL.4164
|postgres@mbanck-lin-1:~$ /usr/lib/postgresql/17/bin/pg_combinebackup backup/backup_full/ backup/backup_incr_1/ -o backup/combined
|pg_combinebackup: error: could not open file "backup/backup_incr_1//PG_VERSION": No such file or directory
Michael