recovery_init_sync_method=wal

Started by Thomas Munroalmost 5 years ago4 messages

thomas.munro@gmail.com

almost 5 years ago

1 attachment(s)

Hello hackers,

I'm starting a new thread and CF entry for the material for r15 from
the earlier thread[1]/messages/by-id/11bc2bb7-ecb5-3ad0-b39f-df632734cd81@discourse.org that introduced the recovery_init_sync_method
GUC for r14. I wrote a summary of this topic as I see it, while it's
still fresh on my mind from working on commit 61752afb, starting from
what problem this solves.

TL;DR: Here's a patch that adds a less pessimistic, faster starting
crash recovery mode based on first principles.

=== Background ===

Why do we synchronise the data directory before we run crash recovery?

1. WAL: It's not safe for changes to data pages to reach the disk
before the WAL. This is the basic log-before-data rule. Suppose we
didn't do that. If our online cluster crashed after calling
pwrite(<some WAL data>) but before calling fdatasync(), the WAL data
we later read in crash recovery may differ from what's really on disk,
and it'd be dangerous to replay its contents, because its effects to
data pages might then be written to disk at any time and break the
rule. If that happens and you lose power and then run crash recovery
a second time, now you have some phantom partial changes already
applied but no WAL left to redo them, leading to hazards including
xids being recycled, effects of committed transactions being partially
lost, multi-page changes being half done, and other such corruption.

2. Data files: We can't skip changes to a data page based on the page
header's LSN if the page is not known to be on disk (that is, it is
clean in PostgreSQL's buffer pool, but possibly dirty in the kernel's
page cache). Otherwise, the end-of-recovery checkpoint will do
nothing for the page (assuming nothing else marks the page dirty in
our buffer pool before that), so we'll complete the checkpoint, and
allow the WAL to be discarded. Then we might lose power before the
kernel gets a chance to write the data page to disk, and when the
lights come back on we'll run crash recovery but we don't replay that
forgotten change from before the bogus checkpoint, and we have lost
committed changes. (I don't think this can happen with
full_page_writes=on, because in that mode we never skip pages and
always do a full replay, which has various tradeoffs.)

I believe those are the fundamental reasons. If you know of more
reasons, I'd love to hear them.

Why don't we synchronise the data directory for a clean startup?

When we start up the database from a shutdown checkpoint, we take the
checkpoint at face value. A checkpoint is a promise that all changes
up to a given LSN have been durably stored on disk. There are a
couple of cases where that isn't true:

1. You were previously running with fsync=off. That's OK, we told
you not to do that. Checkpoints created without fsync barriers to
enforce the strict WAL-then-data-then-checkpoint protocol are
forgeries.

2. You made a file system-level copy of a cluster that you shut down
cleanly first, using cp, tar, scp, rsync, xmodem etc. Now you start
up the copy. Its checkpoint is a forgery. (Maybe our manual should
mention this problem under "25.2. File System Level Backup" where it
teaches you to rsync your cluster.)

How do the existing recovery_init_sync_method modes work?

You can think of recovery_init_sync_method as different "query plans"
for finding dirty buffers in the kernel's page cache to sync.

1. fsync: Go through the directory tree and call fsync() on each
file, just in case that file had some dirty pages. This is a terrible
plan if the files aren't currently in the kernel's VFS cache, because
it could take up to a few milliseconds to get each one in there
(random read to slower SSDs or network storage or IOPS-capped cloud
storage). If there really is a lot of dirty data, that's a good bet,
because the files must have been in the VFS cache already. But if
there are one million mostly read-only tables, it could take ages just
to *open* all the files, even though there's not much to actually
write out.

2. syncfs: Go through the kernel page cache instead, looking for
dirty data in the small number of file systems that contain our
directories. This is driven by data that is already in the kernel's
cache, so we avoid the need to perform I/O to search for dirty data.
That's great if your cluster is running mostly alone on the file
system in question, but it's not great if you're running another
PostgreSQL cluster on the same file system, because now we generate
extra write I/O when it finds incidental other stuff to write out.

These are both scatter gun approaches that can sometimes do a lot of
useless work, and I'd like to find a precise version that uses
information we already have about what might be dirty according to the
meaning of a checkpoint and a transaction log. The attached patch
does that as follows:

1. Sync the WAL using fsync(), to enforce the log-before-data rule.
That's moved into the existing loop that scans the WAL files looking
for temporary files to unlink. (I suppose it should perhaps do the
"presync" trick too. Not done yet.)

2. While replaying the WAL, if we ever decide to skip a page because
of its LSN, remember to fsync() the file in the next checkpoint anyway
(because the data might be dirty in the kernel). This way we sync
all files that changed since the last checkpoint (even if we don't
have to apply the change again). (A more paranoid mode would mark the
page dirty instead, so that we'll not only fsync() it, but we'll also
write it out again. This would defend against kernels that have
writeback failure modes that include keeping changes but dropping
their own dirty flag. Not done here.)

One thing about this approach is that it takes the checkpoint it
recovers from at face value. This is similar to the current situation
with startup from a clean shutdown checkpoint. If you're starting up
a database that was previously running with fsync=off, it won't fix
the problems that might have created, and if you beamed a copy of your
crashed cluster to another machine with rsync and took no steps to
sync it, then it won't fix the problems caused by random files that
are not yet flushed to disk, and that don't happen to be dirtied (or
skipped with BLK_DONE) by WAL replay.
recovery_init_sync_method=fsync,syncfs will fix at least that second
problem for you.

Now, what holes are there in this scheme?

[1]: /messages/by-id/11bc2bb7-ecb5-3ad0-b39f-df632734cd81@discourse.org

Attachments:

v8-0001-Provide-recovery_init_sync_method-wal.patchtext/x-patch; charset=US-ASCII; name=v8-0001-Provide-recovery_init_sync_method-wal.patchDownload

From 6bb06cf444fd587ac30b59f725491102ccdc8ec6 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Fri, 19 Mar 2021 13:06:35 +1300
Subject: [PATCH v8] Provide recovery_init_sync_method=wal.

It's important to avoid relying on dirty data in the kernel's cache when
deciding that a change has already been applied.  The existing options
achieve that but potentially do a lot of unnecesssary extra work.

This new option syncs only the WAL files, and then relies on analysis of
the WAL during replay.  We start by assuming that modifications from
before the last checkpoint were already flushed by the checkpoint.  That
is true if you crashed or ran pg_basebackup to the present location, but
false if you made a second copy with non-syncing tools.  Then, if
full_page_writes=on, we expect all data modified since the last
checkpoint to be written to disk as part of the end-of-recovery
checkpoint.  If full_page_writes=off, things are slightly trickier: we
skip updating pages that have an LSN higher than a WAL record
("BLK_DONE"), but we don't trust any data we read from the kernel's
cache to be really on disk, so we skip replay but we still make a note
to sync the file at the next checkpoint anyway.

(A slightly more paranoid mode would mark skipped pages dirty instead,
to trigger a rewrite of all pages we skip, if our threat model includes
operating systems that might mark their own page cache pages "clean"
rather than invalidating them after a writeback failure.  That is not
done in this commit.)

Discussion: https://postgr.es/m/11bc2bb7-ecb5-3ad0-b39f-df632734cd81%40discourse.org
Discussion: https://postgr.es/m/CAEET0ZHGnbXmi8yF3ywsDZvb3m9CbdsGZgfTXscQ6agcbzcZAw%40mail.gmail.com
---
 doc/src/sgml/config.sgml                      | 12 +++++
 src/backend/access/transam/xlog.c             | 52 +++++++++++++------
 src/backend/access/transam/xlogutils.c        | 11 ++++
 src/backend/storage/buffer/bufmgr.c           | 20 +++++++
 src/backend/storage/file/fd.c                 | 11 ++--
 src/backend/storage/smgr/md.c                 | 15 ++++++
 src/backend/storage/smgr/smgr.c               | 15 ++++++
 src/backend/utils/misc/guc.c                  |  1 +
 src/backend/utils/misc/postgresql.conf.sample |  2 +-
 src/include/storage/bufmgr.h                  |  1 +
 src/include/storage/fd.h                      |  1 +
 src/include/storage/md.h                      |  1 +
 src/include/storage/smgr.h                    |  2 +
 13 files changed, 124 insertions(+), 20 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index ee4925d6d9..fc453b3915 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -9753,6 +9753,18 @@ dynamic_library_path = 'C:\tools\postgresql;H:\my_project\lib;$libdir'
         <productname>PostgreSQL</productname>, and relevant error messages may
         appear only in kernel logs.
        </para>
+       <para>
+        When set to <literal>wal</literal>,
+        <productname>PostgreSQL</productname> will synchronize all WAL files
+        before recovery begins, and then rely on WAL replay to synchronize all
+        data files modified since the last checkpoint.  Data files that are not
+        modified by WAL replay are not synchronized, which is sufficient if
+        recovering from a crash or a copy made with
+        <application>pg_basebackup</application> (without
+        <literal>--no-sync</literal>), but may be insufficient if
+        the data directory was copied from its original location without taking
+        extra measures to ensure that it has reached the disk.
+       </para>
       </listitem>
      </varlistentry>
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 6f8810e149..89678e6470 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -927,7 +927,7 @@ static bool WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 static int	emode_for_corrupt_record(int emode, XLogRecPtr RecPtr);
 static void XLogFileClose(void);
 static void PreallocXlogFiles(XLogRecPtr endptr);
-static void RemoveTempXlogFiles(void);
+static void ScanXlogFilesAfterCrash(void);
 static void RemoveOldXlogFiles(XLogSegNo segno, XLogRecPtr lastredoptr, XLogRecPtr endptr);
 static void RemoveXlogFile(const char *segname, XLogSegNo recycleSegNo,
 						   XLogSegNo *endlogSegNo);
@@ -4030,13 +4030,15 @@ UpdateLastRemovedPtr(char *filename)
 }
 
 /*
- * Remove all temporary log files in pg_wal
+ * Remove all temporary log files in pg_wal, and make sure that all remaining
+ * files are down on disk before we replay anything in them if
+ * recovery_init_sync_method requires that.
  *
  * This is called at the beginning of recovery after a previous crash,
  * at a point where no other processes write fresh WAL data.
  */
 static void
-RemoveTempXlogFiles(void)
+ScanXlogFilesAfterCrash(void)
 {
 	DIR		   *xldir;
 	struct dirent *xlde;
@@ -4048,12 +4050,23 @@ RemoveTempXlogFiles(void)
 	{
 		char		path[MAXPGPATH];
 
-		if (strncmp(xlde->d_name, "xlogtemp.", 9) != 0)
-			continue;
-
 		snprintf(path, MAXPGPATH, XLOGDIR "/%s", xlde->d_name);
-		unlink(path);
-		elog(DEBUG2, "removed temporary WAL segment \"%s\"", path);
+
+		if (strncmp(xlde->d_name, "xlogtemp.", 9) == 0)
+		{
+			unlink(path);
+			elog(DEBUG2, "removed temporary WAL segment \"%s\"", path);
+		}
+		else if (IsXLogFileName(xlde->d_name) &&
+				 recovery_init_sync_method == RECOVERY_INIT_SYNC_METHOD_WAL)
+		{
+			/*
+			 * Make sure that whatever we read from this WAL file is durably on
+			 * disk before replaying in RECOVERY_INIT_SYNC_METHOD_WAL mode.
+			 * Necessary because SyncDataDirectory() won't do that for us.
+			 */
+			fsync_fname(path, false);
+		}
 	}
 	FreeDir(xldir);
 }
@@ -6547,23 +6560,30 @@ StartupXLOG(void)
 	ValidateXLOGDirectoryStructure();
 
 	/*----------
-	 * If we previously crashed, perform a couple of actions:
+	 * If we previously crashed:
 	 *
 	 * - The pg_wal directory may still include some temporary WAL segments
 	 *   used when creating a new segment, so perform some clean up to not
-	 *   bloat this path.  This is done first as there is no point to sync
-	 *   this temporary data.
+	 *   bloat this path, in ScanXlogFilesAfterCrash().
+	 *
+	 * - It's possible that some WAL data had been written but not yet synced.
+	 *   Therefore we have to sync these files before replaying the records
+	 *   they contain.  If recovery_init_wal_sync_method=wal we'll do that
+	 *   in ScanXlogFilesAfterCrash(); otherwise we'll leave it up to
+	 *   SyncDataDirectory().
 	 *
 	 * - There might be data which we had written, intending to fsync it, but
-	 *   which we had not actually fsync'd yet.  Therefore, a power failure in
-	 *   the near future might cause earlier unflushed writes to be lost, even
-	 *   though more recent data written to disk from here on would be
-	 *   persisted.  To avoid that, fsync the entire data directory.
+	 *   which we had not actually fsync'd yet.  If
+	 *   recovery_init_wal_sync_method=wal, then XLogReadBufferForRedo() will
+	 *   compute the set of files that may need fsyncing at the next
+	 *   checkpoint, during recovery.  Otherwise, SyncDataDirectory() will
+	 *   sync the entire pgdata directory up front, to avoid relying on data
+	 *   from the kernel's cache that hasn't reached disk yet.
 	 */
 	if (ControlFile->state != DB_SHUTDOWNED &&
 		ControlFile->state != DB_SHUTDOWNED_IN_RECOVERY)
 	{
-		RemoveTempXlogFiles();
+		ScanXlogFilesAfterCrash();
 		SyncDataDirectory();
 	}
 
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index d17d660f46..85303e761e 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -401,7 +401,18 @@ XLogReadBufferForRedoExtended(XLogReaderState *record,
 					LockBuffer(*buf, BUFFER_LOCK_EXCLUSIVE);
 			}
 			if (lsn <= PageGetLSN(BufferGetPage(*buf)))
+			{
+				/*
+				 * The page is from the future.  We must be in crash recovery.
+				 * We don't need to redo the record, but we do need to make
+				 * sure that the image as we have seen it is durably stored on
+				 * disk as part of the next checkpoint, unless that was already
+				 * done by SyncDataDirectory().
+				 */
+				if (recovery_init_sync_method == RECOVERY_INIT_SYNC_METHOD_WAL)
+					RequestBufferSync(*buf);
 				return BLK_DONE;
+			}
 			else
 				return BLK_NEEDS_REDO;
 		}
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 852138f9c9..6393545f0c 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1530,6 +1530,26 @@ MarkBufferDirty(Buffer buffer)
 	}
 }
 
+/*
+ * Request that the file containing a buffer be fsynced at the next checkpoint.
+ * This is only used in crash recovery, to make it safe to skip applying WAL on
+ * the basis that the page already has a change.  We want to make sure that the
+ * data we read from the kernel matches what's on disk at the next checkpoint.
+ */
+void
+RequestBufferSync(Buffer buffer)
+{
+	SMgrRelation reln;
+	BufferDesc *bufHdr;
+
+	Assert(BufferIsPinned(buffer));
+	Assert(!BufferIsLocal(buffer));
+
+	bufHdr = GetBufferDescriptor(buffer - 1);
+	reln = smgropen(bufHdr->tag.rnode, InvalidBackendId);
+	smgrsync(reln, bufHdr->tag.forkNum, bufHdr->tag.blockNum);
+}
+
 /*
  * ReleaseAndReadBuffer -- combine ReleaseBuffer() and ReadBuffer()
  *
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 28933f8bbe..dde20fb5c3 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -336,6 +336,7 @@ static void walkdir(const char *path,
 					void (*action) (const char *fname, bool isdir, int elevel),
 					bool process_symlinks,
 					int elevel);
+
 #ifdef PG_FLUSH_DATA_WORKS
 static void pre_sync_fname(const char *fname, bool isdir, int elevel);
 #endif
@@ -3293,8 +3294,9 @@ do_syncfs(const char *path)
 #endif
 
 /*
- * Issue fsync recursively on PGDATA and all its contents, or issue syncfs for
- * all potential filesystem, depending on recovery_init_sync_method setting.
+ * Issue fsync recursively on PGDATA and all its contents, issue syncfs for all
+ * potential filesystem, or do nothing, depending on the
+ * recovery_init_sync_method setting.
  *
  * We fsync regular files and directories wherever they are, but we
  * follow symlinks only for pg_wal and immediately under pg_tblspc.
@@ -3323,6 +3325,10 @@ SyncDataDirectory(void)
 	if (!enableFsync)
 		return;
 
+	/* Likewise if we're using WAL analysis instead. */
+	if (recovery_init_sync_method == RECOVERY_INIT_SYNC_METHOD_WAL)
+		return;
+
 	/*
 	 * If pg_wal is a symlink, we'll need to recurse into it separately,
 	 * because the first walkdir below will ignore it.
@@ -3478,7 +3484,6 @@ walkdir(const char *path,
 		(*action) (path, true, elevel);
 }
 
-
 /*
  * Hint to the OS that it should get ready to fsync() this file.
  *
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 1e12cfad8e..d74506be31 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -961,6 +961,21 @@ mdimmedsync(SMgrRelation reln, ForkNumber forknum)
 	}
 }
 
+/*
+ *	mdsync() -- Schedule a sync at the next checkpoint.
+ *
+ * This is useful in crash recovery, to ensure that data we've read from the
+ * kernel matches what is on disk before a checkpoint.
+ */
+void
+mdsync(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum)
+{
+	MdfdVec	   *seg;
+
+	seg = _mdfd_getseg(reln, forknum, blocknum, false, EXTENSION_FAIL);
+	register_dirty_segment(reln, forknum, seg);
+}
+
 /*
  * register_dirty_segment() -- Mark a relation segment as needing fsync
  *
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 4dc24649df..6856c40300 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -62,6 +62,8 @@ typedef struct f_smgr
 	void		(*smgr_truncate) (SMgrRelation reln, ForkNumber forknum,
 								  BlockNumber nblocks);
 	void		(*smgr_immedsync) (SMgrRelation reln, ForkNumber forknum);
+	void		(*smgr_sync) (SMgrRelation reln, ForkNumber forknum,
+							  BlockNumber blocknum);
 } f_smgr;
 
 static const f_smgr smgrsw[] = {
@@ -79,6 +81,7 @@ static const f_smgr smgrsw[] = {
 		.smgr_read = mdread,
 		.smgr_write = mdwrite,
 		.smgr_writeback = mdwriteback,
+		.smgr_sync = mdsync,
 		.smgr_nblocks = mdnblocks,
 		.smgr_truncate = mdtruncate,
 		.smgr_immedsync = mdimmedsync,
@@ -540,6 +543,18 @@ smgrwriteback(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 											nblocks);
 }
 
+
+/*
+ *	smgrsync() -- Request that the file containing a block be flushed at the
+ *				  next checkpoint.
+ */
+void
+smgrsync(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum)
+{
+	smgrsw[reln->smgr_which].smgr_sync(reln, forknum, blocknum);
+}
+
+
 /*
  *	smgrnblocks() -- Calculate the number of blocks in the
  *					 supplied relation.
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 2964efda96..2b5e1151c7 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -493,6 +493,7 @@ static struct config_enum_entry recovery_init_sync_method_options[] = {
 #ifdef HAVE_SYNCFS
 	{"syncfs", RECOVERY_INIT_SYNC_METHOD_SYNCFS, false},
 #endif
+	{"wal", RECOVERY_INIT_SYNC_METHOD_WAL, false},
 	{NULL, 0, false}
 };
 
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 86425965d0..b9091ccb75 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -761,7 +761,7 @@
 #restart_after_crash = on		# reinitialize after backend crash?
 #remove_temp_files_after_crash = on	# remove temporary files after
 					# backend crash?
-#recovery_init_sync_method = fsync	# fsync, syncfs (Linux 5.8+)
+#recovery_init_sync_method = fsync	# fsync, syncfs (Linux 5.8+), wal
 #data_sync_retry = off			# retry or panic on failure to fsync
 					# data?
 					# (change requires restart)
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index fb00fda6a7..f5409264da 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -186,6 +186,7 @@ extern Buffer ReadBufferWithoutRelcache(RelFileNode rnode,
 extern void ReleaseBuffer(Buffer buffer);
 extern void UnlockReleaseBuffer(Buffer buffer);
 extern void MarkBufferDirty(Buffer buffer);
+extern void RequestBufferSync(Buffer buffer);
 extern void IncrBufferRefCount(Buffer buffer);
 extern Buffer ReleaseAndReadBuffer(Buffer buffer, Relation relation,
 								   BlockNumber blockNum);
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 328473bdc9..2fe0ce890e 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -46,6 +46,7 @@
 #include <dirent.h>
 
 typedef enum RecoveryInitSyncMethod {
+	RECOVERY_INIT_SYNC_METHOD_WAL,
 	RECOVERY_INIT_SYNC_METHOD_FSYNC,
 	RECOVERY_INIT_SYNC_METHOD_SYNCFS
 } RecoveryInitSyncMethod;
diff --git a/src/include/storage/md.h b/src/include/storage/md.h
index 752b440864..35e813410a 100644
--- a/src/include/storage/md.h
+++ b/src/include/storage/md.h
@@ -40,6 +40,7 @@ extern BlockNumber mdnblocks(SMgrRelation reln, ForkNumber forknum);
 extern void mdtruncate(SMgrRelation reln, ForkNumber forknum,
 					   BlockNumber nblocks);
 extern void mdimmedsync(SMgrRelation reln, ForkNumber forknum);
+extern void mdsync(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum);
 
 extern void ForgetDatabaseSyncRequests(Oid dbid);
 extern void DropRelationFiles(RelFileNode *delrels, int ndelrels, bool isRedo);
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index a6fbf7b6a6..4a9cc9f181 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -103,6 +103,8 @@ extern BlockNumber smgrnblocks_cached(SMgrRelation reln, ForkNumber forknum);
 extern void smgrtruncate(SMgrRelation reln, ForkNumber *forknum,
 						 int nforks, BlockNumber *nblocks);
 extern void smgrimmedsync(SMgrRelation reln, ForkNumber forknum);
+extern void smgrsync(SMgrRelation reln, ForkNumber forknum,
+					 BlockNumber blocknum);
 extern void AtEOXact_SMgr(void);
 
 #endif							/* SMGR_H */
-- 
2.30.1

Stephen Frost

sfrost@snowman.net

almost 5 years ago

In reply to: Thomas Munro (#1)

Re: recovery_init_sync_method=wal

Greetings,

* Thomas Munro (thomas.munro@gmail.com) wrote:

2. You made a file system-level copy of a cluster that you shut down
cleanly first, using cp, tar, scp, rsync, xmodem etc. Now you start
up the copy. Its checkpoint is a forgery. (Maybe our manual should
mention this problem under "25.2. File System Level Backup" where it
teaches you to rsync your cluster.)

Yes, it'd be good to get some updates to the backup documentation around
this which stresses in all cases that your backup utility should make
sure to fsync everything it restores.

These are both scatter gun approaches that can sometimes do a lot of
useless work, and I'd like to find a precise version that uses
information we already have about what might be dirty according to the
meaning of a checkpoint and a transaction log. The attached patch
does that as follows:

1. Sync the WAL using fsync(), to enforce the log-before-data rule.
That's moved into the existing loop that scans the WAL files looking
for temporary files to unlink. (I suppose it should perhaps do the
"presync" trick too. Not done yet.)

2. While replaying the WAL, if we ever decide to skip a page because
of its LSN, remember to fsync() the file in the next checkpoint anyway
(because the data might be dirty in the kernel). This way we sync
all files that changed since the last checkpoint (even if we don't
have to apply the change again). (A more paranoid mode would mark the
page dirty instead, so that we'll not only fsync() it, but we'll also
write it out again. This would defend against kernels that have
writeback failure modes that include keeping changes but dropping
their own dirty flag. Not done here.)

Presuming that we do add to the documentation the language to document
what's assumed (and already done by modern backup tools) that they're
fsync'ing everything they're restoring, do we/can we have an option
which those tools could set that explicitly tells PG "everything in the
cluster has been fsync'd already, you don't need to do anything extra"?
Perhaps also/seperately one for WAL that's restored with restore command
if we think that's necessary?

Otherwise, just in general, agree with doing this to address the risks
discussed around regular crash recovery. We have some pretty clear "if
the DB was doing recovery and was interrupted, you need to restore from
backup" messages today in xlog.c, and this patch didn't seem to change
that? Am I missing something or isn't the idea here that these changes
would make it so you aren't going to end up with corruption in those
cases? Specifically looking at-

xlog.c:6509-
case DB_IN_CRASH_RECOVERY:
ereport(LOG,
(errmsg("database system was interrupted while in recovery at %s",
str_time(ControlFile->time)),
errhint("This probably means that some data is corrupted and"
" you will have to use the last backup for recovery.")));
break;

case DB_IN_ARCHIVE_RECOVERY:
ereport(LOG,
(errmsg("database system was interrupted while in recovery at log time %s",
str_time(ControlFile->checkPointCopy.time)),
errhint("If this has occurred more than once some data might be corrupted"
" and you might need to choose an earlier recovery target.")));
break;

Thanks!

Stephen

Thomas Munro

thomas.munro@gmail.com

almost 5 years ago

In reply to: Stephen Frost (#2)

Re: recovery_init_sync_method=wal

On Mon, Mar 22, 2021 at 4:31 AM Stephen Frost <sfrost@snowman.net> wrote:

Presuming that we do add to the documentation the language to document
what's assumed (and already done by modern backup tools) that they're
fsync'ing everything they're restoring, do we/can we have an option
which those tools could set that explicitly tells PG "everything in the
cluster has been fsync'd already, you don't need to do anything extra"?
Perhaps also/seperately one for WAL that's restored with restore command
if we think that's necessary?

In the earlier thread, we did contemplate
recovery_init_sync_method=none, but it has the problem that after
recovery completes you have a cluster running with a setting that is
really bad if you eventually crash again and run crash recovery again,
this time with a dirty kernel page cache. I was one of the people
voting against that feature, but I also wrote a strawman patch just
for the visceral experience of every cell in my body suddenly
whispering "yeah, nah, I'm not committing that" as I wrote the
weaselwordage for the manual.

In that thread we also contemplated safe ways for a basebackup-type
tool to promise that data has been sync'd, to avoid that problem with
the GUC. Maybe something like: the backup label file could contain a
"SYNC_DONE" message. But then what if someone copies the whole
directory to a new location, how can you invalidate the promise? This
is another version of the question of whether it's our problem or the
user's to worry about buffering of pgdata files that they copy around
with unknown tools. If it's our problem, maybe something like:
"SYNC_DONE for pgdata_inode=1234, hostname=x, ..." is enough to detect
cases where we still believe the claim. But there's probably a long
tail of weird ways for whatever you come up with to be deceived.

In the case of the recovery_init_sync_method=wal patch proposed in
*this* thread, here's the thing: there's not much to gain by trying to
skip the sync, anyway! For the WAL, you'll be opening those files
soon anyway to replay them, and if they're already sync'd then fsync()
will return quickly. For the relfile data, you'll be opening all
relfiles that are referenced by the WAL soon anyway, and syncing them
if required. So that just leaves relfiles that are not referenced in
the WAL you're about to replay. Whether we have a duty to sync those
too is that central question again, and one of the things I'm trying
to get an answer to with this thread. The
recovery_init_sync_method=wal patch only makes sense if the answer is
"no".

Otherwise, just in general, agree with doing this to address the risks
discussed around regular crash recovery. We have some pretty clear "if
the DB was doing recovery and was interrupted, you need to restore from
backup" messages today in xlog.c, and this patch didn't seem to change
that? Am I missing something or isn't the idea here that these changes
would make it so you aren't going to end up with corruption in those
cases? Specifically looking at-

xlog.c:6509-
case DB_IN_CRASH_RECOVERY:
ereport(LOG,
(errmsg("database system was interrupted while in recovery at %s",
str_time(ControlFile->time)),
errhint("This probably means that some data is corrupted and"
" you will have to use the last backup for recovery.")));
break;

case DB_IN_ARCHIVE_RECOVERY:
ereport(LOG,
(errmsg("database system was interrupted while in recovery at log time %s",
str_time(ControlFile->checkPointCopy.time)),
errhint("If this has occurred more than once some data might be corrupted"
" and you might need to choose an earlier recovery target.")));
break;

Maybe I missed your point... but I don't think anything changes here?
If recovery is crashing in some deterministic way (not just because
you lost power the first time, but rather because a particular log
record hits the same bug or gets confused by the same corruption and
implodes every time) it'll probably keep doing so, and our sync
algorithm doesn't seem to make a difference to that.

Stephen Frost

sfrost@snowman.net

almost 5 years ago

In reply to: Thomas Munro (#3)

Re: recovery_init_sync_method=wal

Greetings,

* Thomas Munro (thomas.munro@gmail.com) wrote:

On Mon, Mar 22, 2021 at 4:31 AM Stephen Frost <sfrost@snowman.net> wrote:

Presuming that we do add to the documentation the language to document
what's assumed (and already done by modern backup tools) that they're
fsync'ing everything they're restoring, do we/can we have an option
which those tools could set that explicitly tells PG "everything in the
cluster has been fsync'd already, you don't need to do anything extra"?
Perhaps also/seperately one for WAL that's restored with restore command
if we think that's necessary?

In the earlier thread, we did contemplate
recovery_init_sync_method=none, but it has the problem that after
recovery completes you have a cluster running with a setting that is
really bad if you eventually crash again and run crash recovery again,
this time with a dirty kernel page cache. I was one of the people
voting against that feature, but I also wrote a strawman patch just
for the visceral experience of every cell in my body suddenly
whispering "yeah, nah, I'm not committing that" as I wrote the
weaselwordage for the manual.

Why not have a 'recovery_init_sync_method=backup'? It's not like
there's a question about if we're doing recovery from a backup or not.

In that thread we also contemplated safe ways for a basebackup-type
tool to promise that data has been sync'd, to avoid that problem with
the GUC. Maybe something like: the backup label file could contain a
"SYNC_DONE" message. But then what if someone copies the whole
directory to a new location, how can you invalidate the promise? This
is another version of the question of whether it's our problem or the
user's to worry about buffering of pgdata files that they copy around
with unknown tools. If it's our problem, maybe something like:
"SYNC_DONE for pgdata_inode=1234, hostname=x, ..." is enough to detect
cases where we still believe the claim. But there's probably a long
tail of weird ways for whatever you come up with to be deceived.

I don't really care for the idea of backup tools modifying the backup
label... they're explicitly told not to do that and that seems like the
best move. I also don't particularly care about silly "what ifs" like
if someone randomly copies the data directory after the restore- yes,
there's a lot of ways that people can screw things up by doing things
that aren't sane, but that doesn't mean we should try to cater to such
cases.

In the case of the recovery_init_sync_method=wal patch proposed in
*this* thread, here's the thing: there's not much to gain by trying to
skip the sync, anyway! For the WAL, you'll be opening those files
soon anyway to replay them, and if they're already sync'd then fsync()
will return quickly. For the relfile data, you'll be opening all
relfiles that are referenced by the WAL soon anyway, and syncing them
if required. So that just leaves relfiles that are not referenced in
the WAL you're about to replay. Whether we have a duty to sync those
too is that central question again, and one of the things I'm trying
to get an answer to with this thread. The
recovery_init_sync_method=wal patch only makes sense if the answer is
"no".

I'm not too bothered by an extra fsync() for WAL files, just to be
clear, it's running around fsync'ing everything else that seems
objectionable to me.

Otherwise, just in general, agree with doing this to address the risks
discussed around regular crash recovery. We have some pretty clear "if
the DB was doing recovery and was interrupted, you need to restore from
backup" messages today in xlog.c, and this patch didn't seem to change
that? Am I missing something or isn't the idea here that these changes
would make it so you aren't going to end up with corruption in those
cases? Specifically looking at-

xlog.c:6509-
case DB_IN_CRASH_RECOVERY:
ereport(LOG,
(errmsg("database system was interrupted while in recovery at %s",
str_time(ControlFile->time)),
errhint("This probably means that some data is corrupted and"
" you will have to use the last backup for recovery.")));
break;

case DB_IN_ARCHIVE_RECOVERY:
ereport(LOG,
(errmsg("database system was interrupted while in recovery at log time %s",
str_time(ControlFile->checkPointCopy.time)),
errhint("If this has occurred more than once some data might be corrupted"
" and you might need to choose an earlier recovery target.")));
break;

Maybe I missed your point... but I don't think anything changes here?
If recovery is crashing in some deterministic way (not just because
you lost power the first time, but rather because a particular log
record hits the same bug or gets confused by the same corruption and
implodes every time) it'll probably keep doing so, and our sync
algorithm doesn't seem to make a difference to that.

These errors aren't thrown only in the case where we hit a bad XLOG
record though, are they..? Maybe I missed that somehow but it seems
like these get thrown in the simple case that we, say, lost power,
started recovery and didn't finish recovery and lost power again, even
though with your patches hopefully that wouldn't actually result in a
failure case or in corruption..?

In the 'bad XLOG' or 'confused by corruption' cases, I wonder if it'd be
helpful to write that out more explicitly somehow.. essentially
segregating these cases.

Thanks,

Stephen