reorder pg_rewind control file sync

Started by Fabien COELHOalmost 7 years ago4 messages
#1Fabien COELHO
coelho@cri.ensmp.fr
1 attachment(s)

Bonjour Michaᅵl,

On Sat, 23 Mar 2019, Michael Paquier wrote:

On Fri, Mar 22, 2019 at 03:18:26PM +0100, Fabien COELHO wrote:

Attached is a quick patch about "pg_rewind", so that the control file
is updated after everything else is committed to disk.

Could you start a new thread about that please? This one has already
been used for too many things.

Here it is.

The attached patch reorders the cluster fsyncing and control file changes
in "pg_rewind" so that the later is done after all data are committed to
disk, so as to reflect the actual cluster status, similarly to what is
done by "pg_checksums", per discussion in the thread about offline
enabling of checksums:

/messages/by-id/20181221201616.GD4974@nighthawk.caipicrew.dd-dns.de

--
Fabien.

Attachments:

rewind-fsync-1.patchtext/x-diff; charset=us-ascii; name=rewind-fsync-1.patchDownload
diff --git a/src/bin/pg_rewind/pg_rewind.c b/src/bin/pg_rewind/pg_rewind.c
index 3dcadb9b40..c1e6d7cd07 100644
--- a/src/bin/pg_rewind/pg_rewind.c
+++ b/src/bin/pg_rewind/pg_rewind.c
@@ -351,9 +351,12 @@ main(int argc, char **argv)
 
 	progress_report(true);
 
-	pg_log(PG_PROGRESS, "\ncreating backup label and updating control file\n");
+	pg_log(PG_PROGRESS, "\ncreating backup label\n");
 	createBackupLabel(chkptredo, chkpttli, chkptrec);
 
+	pg_log(PG_PROGRESS, "syncing target data directory\n");
+	syncTargetDirectory();
+
 	/*
 	 * Update control file of target. Make it ready to perform archive
 	 * recovery when restarting.
@@ -362,6 +365,7 @@ main(int argc, char **argv)
 	 * source server. Like in an online backup, it's important that we recover
 	 * all the WAL that was generated while we copied the files over.
 	 */
+	pg_log(PG_PROGRESS, "updating control file\n");
 	memcpy(&ControlFile_new, &ControlFile_source, sizeof(ControlFileData));
 
 	if (connstr_source)
@@ -377,11 +381,9 @@ main(int argc, char **argv)
 	ControlFile_new.minRecoveryPoint = endrec;
 	ControlFile_new.minRecoveryPointTLI = endtli;
 	ControlFile_new.state = DB_IN_ARCHIVE_RECOVERY;
+
 	update_controlfile(datadir_target, progname, &ControlFile_new, do_sync);
 
-	pg_log(PG_PROGRESS, "syncing target data directory\n");
-	syncTargetDirectory();
-
 	printf(_("Done!\n"));
 
 	return 0;
#2Michael Paquier
michael@paquier.xyz
In reply to: Fabien COELHO (#1)
Re: reorder pg_rewind control file sync

On Sat, Mar 23, 2019 at 06:18:27AM +0100, Fabien COELHO wrote:

Here it is.

Thanks.

The attached patch reorders the cluster fsyncing and control file changes in
"pg_rewind" so that the later is done after all data are committed to disk,
so as to reflect the actual cluster status, similarly to what is done by
"pg_checksums", per discussion in the thread about offline enabling of
checksums:

It would be an interesting property to see that it is possible to
retry a rewind of a node which has been partially rewound already,
but the operation failed in the middle. Because that's the real deal
here: as long as we know that its control file is in its previous
state, we can rely on it for retrying the operation. Logically, I
think that it should work, because we would still try to fetch the
same blocks from the source server since WAL has forked by looking at
the records of the target up from the last checkpoint before WAL has
forked up to the last shutdown checkpoint, and the operation is lossy
by design when it comes to deal with file differences.

Have you tried to see if pg_rewind is able to repeat its operation for
specific scenarios? One is for example a database created on the
promoted standby, used as a source, and a second, different database
created on the primary after the standby has been promoted. You could
make the tool exit() before the rewind finishes, just before updating
the control file, and see if the operation is repeatable.
Interrupting the tool would be fine as well, still less controllable.

It would be good to mention in the patch why the order matters.
--
Michael

#3Fabien COELHO
coelho@cri.ensmp.fr
In reply to: Michael Paquier (#2)
Re: reorder pg_rewind control file sync

Bonjour Michaᅵl,

The attached patch reorders the cluster fsyncing and control file changes in
"pg_rewind" so that the later is done after all data are committed to disk,
so as to reflect the actual cluster status, similarly to what is done by
"pg_checksums", per discussion in the thread about offline enabling of
checksums:

It would be an interesting property to see that it is possible to
retry a rewind of a node which has been partially rewound already,
but the operation failed in the middle.

Yes. I understand that the question is whether the Warning in pg_rewind
documentation can be partially lifted. The short answer is that it is not
obvious.

Because that's the real deal here: as long as we know that its control
file is in its previous state, we can rely on it for retrying the
operation. Logically, I think that it should work, because we would
still try to fetch the same blocks from the source server since WAL has
forked by looking at the records of the target up from the last
checkpoint before WAL has forked up to the last shutdown checkpoint, and
the operation is lossy by design when it comes to deal with file
differences.

Have you tried to see if pg_rewind is able to repeat its operation for
specific scenarios?

I have run the non regression tests. I'm not sure of what scenarii are
covered there, but probably not an interruption in the middle of a fsync,
specially if fsync is usually disabled to ease the tests:-)

One is for example a database created on the promoted standby, used as a
source, and a second, different database created on the primary after
the standby has been promoted. You could make the tool exit() before
the rewind finishes, just before updating the control file, and see if
the operation is repeatable. Interrupting the tool would be fine as
well, still less controllable.

It would be good to mention in the patch why the order matters.

Yep. This requires a careful analysis of pg_rewind inner working, that I
do not have to do in the short terme.

--
Fabien.

#4Michael Paquier
michael@paquier.xyz
In reply to: Fabien COELHO (#3)
Re: reorder pg_rewind control file sync

On Mon, Mar 25, 2019 at 10:29:46AM +0100, Fabien COELHO wrote:

I have run the non regression tests. I'm not sure of what scenarii are
covered there, but probably not an interruption in the middle of a fsync,
specially if fsync is usually disabled to ease the tests:-)

Force the tool to stop at a specific point requires a booby-trap. And
even if fsync is not killed, you could just enforce the tool to stop
once before updating the control file, and attempt a re-run without
the trap, checking if it works at the second attempt, so the problem
is quite independent from the timing of fsync().
--
Michael