Use durable_unlink for .ready and .done files for WAL segment removal

Started by Michael Paquierover 7 years ago29 messages
#1Michael Paquier
michael@paquier.xyz
1 attachment(s)

Hi all,

While reviewing the archiving code, I have bumped into the fact that
XLogArchiveCleanup() thinks that it is safe to do only a plain unlink()
for .ready and .done files when removing a past segment. I don't think
that it is a smart move, as on a subsequent crash we may still see
those, but the related segment would have gone away. This is not really
a problem for .done files, but it could confuse the archiver to see some
.ready files about things that have already gone away.

Attached is a patch. Thoughts?
--
Michael

Attachments:

archive-clean-durable.patchtext/x-diff; charset=us-asciiDownload
diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c
index d40317168e..86c8383e11 100644
--- a/src/backend/access/transam/xlogarchive.c
+++ b/src/backend/access/transam/xlogarchive.c
@@ -762,11 +762,11 @@ XLogArchiveCleanup(const char *xlog)
 
 	/* Remove the .done file */
 	StatusFilePath(archiveStatusPath, xlog, ".done");
-	unlink(archiveStatusPath);
+	durable_unlink(archiveStatusPath, DEBUG1);
 	/* should we complain about failure? */
 
 	/* Remove the .ready file if present --- normally it shouldn't be */
 	StatusFilePath(archiveStatusPath, xlog, ".ready");
-	unlink(archiveStatusPath);
+	durable_unlink(archiveStatusPath, DEBUG1);
 	/* should we complain about failure? */
 }
#2Andres Freund
andres@anarazel.de
In reply to: Michael Paquier (#1)
Re: Use durable_unlink for .ready and .done files for WAL segment removal

Hi,

On 2018-09-28 12:28:27 +0900, Michael Paquier wrote:

While reviewing the archiving code, I have bumped into the fact that
XLogArchiveCleanup() thinks that it is safe to do only a plain unlink()
for .ready and .done files when removing a past segment. I don't think
that it is a smart move, as on a subsequent crash we may still see
those, but the related segment would have gone away. This is not really
a problem for .done files, but it could confuse the archiver to see some
.ready files about things that have already gone away.

Isn't that window fundamentally there anyway?

- Andres

#3Michael Paquier
michael@paquier.xyz
In reply to: Andres Freund (#2)
Re: Use durable_unlink for .ready and .done files for WAL segment removal

On Thu, Sep 27, 2018 at 08:40:26PM -0700, Andres Freund wrote:

On 2018-09-28 12:28:27 +0900, Michael Paquier wrote:

While reviewing the archiving code, I have bumped into the fact that
XLogArchiveCleanup() thinks that it is safe to do only a plain unlink()
for .ready and .done files when removing a past segment. I don't think
that it is a smart move, as on a subsequent crash we may still see
those, but the related segment would have gone away. This is not really
a problem for .done files, but it could confuse the archiver to see some
.ready files about things that have already gone away.

Isn't that window fundamentally there anyway?

Sure. However the point I would like to make is that if we have the
possibility to reduce this window, then we should.
--
Michael

#4Andres Freund
andres@anarazel.de
In reply to: Michael Paquier (#3)
Re: Use durable_unlink for .ready and .done files for WAL segment removal

On September 27, 2018 10:23:31 PM PDT, Michael Paquier <michael@paquier.xyz> wrote:

On Thu, Sep 27, 2018 at 08:40:26PM -0700, Andres Freund wrote:

On 2018-09-28 12:28:27 +0900, Michael Paquier wrote:

While reviewing the archiving code, I have bumped into the fact that
XLogArchiveCleanup() thinks that it is safe to do only a plain

unlink()

for .ready and .done files when removing a past segment. I don't

think

that it is a smart move, as on a subsequent crash we may still see
those, but the related segment would have gone away. This is not

really

a problem for .done files, but it could confuse the archiver to see

some

.ready files about things that have already gone away.

Isn't that window fundamentally there anyway?

Sure. However the point I would like to make is that if we have the
possibility to reduce this window, then we should.

It's not free though. I don't think this is as clear cut as you make it sound.

Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

#5Stephen Frost
sfrost@snowman.net
In reply to: Michael Paquier (#1)
Re: Use durable_unlink for .ready and .done files for WAL segment removal

Greetings,

* Michael Paquier (michael@paquier.xyz) wrote:

While reviewing the archiving code, I have bumped into the fact that
XLogArchiveCleanup() thinks that it is safe to do only a plain unlink()
for .ready and .done files when removing a past segment. I don't think
that it is a smart move, as on a subsequent crash we may still see
those, but the related segment would have gone away. This is not really
a problem for .done files, but it could confuse the archiver to see some
.ready files about things that have already gone away.

Is there an issue with making the archiver able to understand that
situation instead of being confused by it..? Seems like that'd probably
be a good thing to do regardless of this, but that would then remove the
need for this kind of change..

Thanks!

Stephen

#6Michael Paquier
michael@paquier.xyz
In reply to: Stephen Frost (#5)
Re: Use durable_unlink for .ready and .done files for WAL segment removal

On Fri, Sep 28, 2018 at 02:36:19PM -0400, Stephen Frost wrote:

Is there an issue with making the archiver able to understand that
situation instead of being confused by it..? Seems like that'd probably
be a good thing to do regardless of this, but that would then remove the
need for this kind of change..

I thought about that a bit, and there is as well a lot which can be done
within the archive_command itself regarding that, so I am not sure that
there is the argument to make pgarch.c more complicated than it should.
Now it is true that for most users having a .ready file but no segment
would most likely lead in a failure. I suspect that a large user base
is still just using plain cp in archive_command, which would cause the
archiver to be stuck. So we could actually just tweak pgarch_readyXlog
to check if the segment fetched actually exists (see bottom of the
so-said function). If it doesn't, then the archiver removes the .ready
file and retries fetching a new segment.
--
Michael

#7Stephen Frost
sfrost@snowman.net
In reply to: Michael Paquier (#6)
Re: Use durable_unlink for .ready and .done files for WAL segment removal

Greetings,

* Michael Paquier (michael@paquier.xyz) wrote:

On Fri, Sep 28, 2018 at 02:36:19PM -0400, Stephen Frost wrote:

Is there an issue with making the archiver able to understand that
situation instead of being confused by it..? Seems like that'd probably
be a good thing to do regardless of this, but that would then remove the
need for this kind of change..

I thought about that a bit, and there is as well a lot which can be done
within the archive_command itself regarding that, so I am not sure that
there is the argument to make pgarch.c more complicated than it should.
Now it is true that for most users having a .ready file but no segment
would most likely lead in a failure. I suspect that a large user base
is still just using plain cp in archive_command, which would cause the
archiver to be stuck. So we could actually just tweak pgarch_readyXlog
to check if the segment fetched actually exists (see bottom of the
so-said function). If it doesn't, then the archiver removes the .ready
file and retries fetching a new segment.

Yes, checking if the WAL file exists before calling archive_command on
it is what I was thinking we'd do here, and if it doesn't, then just
remove the .ready file.

An alternative would be to go through the .ready files on crash-recovery
and remove any .ready files that don't have corresponding WAL files, or
if we felt that it was necessary, we could do that on every restart but
do we really think we'd need to do that..?

Thanks!

Stephen

#8Michael Paquier
michael@paquier.xyz
In reply to: Stephen Frost (#7)
Re: Use durable_unlink for .ready and .done files for WAL segment removal

On Fri, Sep 28, 2018 at 07:16:25PM -0400, Stephen Frost wrote:

An alternative would be to go through the .ready files on crash-recovery
and remove any .ready files that don't have corresponding WAL files, or
if we felt that it was necessary, we could do that on every restart but
do we really think we'd need to do that..?

Actually, what you are proposing here sounds much better to me. That's
in the area of what has been done recently with RemoveTempXlogFiles() in
5fc1008e. Any objections to doing something like that?
--
Michael

#9Michael Paquier
michael@paquier.xyz
In reply to: Michael Paquier (#8)
1 attachment(s)
Re: Use durable_unlink for .ready and .done files for WAL segment removal

On Sat, Sep 29, 2018 at 04:58:57PM +0900, Michael Paquier wrote:

Actually, what you are proposing here sounds much better to me. That's
in the area of what has been done recently with RemoveTempXlogFiles() in
5fc1008e. Any objections to doing something like that?

Okay. I have hacked a patch based on Stephen's idea as attached. Any
opinions?
--
Michael

Attachments:

archive-missing-v1.patchtext/x-diff; charset=us-asciiDownload
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 7375a78ffc..fe41400a7a 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6456,6 +6456,11 @@ StartupXLOG(void)
 	 * used when creating a new segment, so perform some clean up to not
 	 * bloat this path.  This is done first as there is no point to sync this
 	 * temporary data.
+	 *	- The pg_wal/archive_status directory may still include .ready or
+	 * .done files which refer to already-removed WAL segments, as recycled
+	 * or removed segments are removed before the corresponding archive
+	 * status files are themselves removed.  This is also done before
+	 * syncing the data directory.
 	 *	- There might be data which we had written, intending to fsync it,
 	 * but which we had not actually fsync'd yet. Therefore, a power failure
 	 * in the near future might cause earlier unflushed writes to be lost,
@@ -6467,6 +6472,7 @@ StartupXLOG(void)
 		ControlFile->state != DB_SHUTDOWNED_IN_RECOVERY)
 	{
 		RemoveTempXlogFiles();
+		XLogArchiveCleanStatus();
 		SyncDataDirectory();
 	}
 
diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c
index d40317168e..67818638d5 100644
--- a/src/backend/access/transam/xlogarchive.c
+++ b/src/backend/access/transam/xlogarchive.c
@@ -652,6 +652,56 @@ XLogArchiveCheckDone(const char *xlog)
 	return false;
 }
 
+/*
+ * XLogArchiveCleanStatus
+ *
+ * Remove .ready and .done files in pg_wal/archive_status which refer to
+ * non-existing WAL segments.  When a segment is removed or recycled at
+ * checkpoint or restart point, its corresponding archive status files are
+ * removed after, so if an instance crashes at this point some files may
+ * remain behind, confusing the archiver.
+ *
+ * This is called at the beginning of recovery after a previous crash where
+ * no other processes write WAL data.
+ */
+void
+XLogArchiveCleanStatus(void)
+{
+	DIR		   *xldir;
+	struct dirent *xlde;
+
+	elog(DEBUG2, "removing archive status referring to missing WAL segments");
+
+	xldir = AllocateDir(XLOGDIR "/archive_status");
+
+	while ((xlde = ReadDir(xldir, XLOGDIR)) != NULL)
+	{
+		char		path[MAXPGPATH];
+		char		xlogname[XLOG_FNAME_LEN + 1];
+		struct stat stat_buf;
+
+		/*
+		 * Check for compatible .ready and .done files, then extract the
+		 * related WAL segment.
+		 */
+		if (!StatusFileIsDone(xlde->d_name) &&
+			!StatusFileIsReady(xlde->d_name))
+			continue;
+
+		memcpy(xlogname, xlde->d_name, XLOG_FNAME_LEN);
+		xlogname[XLOG_FNAME_LEN] = '\0';
+		snprintf(path, MAXPGPATH, XLOGDIR "/%s", xlogname);
+
+		if (stat(path, &stat_buf) == 0)
+			continue;
+
+		snprintf(path, MAXPGPATH, XLOGDIR "/archive_status/%s", xlde->d_name);
+		unlink(path);
+		elog(DEBUG2, "removed archive status file \"%s\"", path);
+	}
+	FreeDir(xldir);
+}
+
 /*
  * XLogArchiveIsBusy
  *
@@ -760,6 +810,13 @@ XLogArchiveCleanup(const char *xlog)
 {
 	char		archiveStatusPath[MAXPGPATH];
 
+	/*
+	 * durable_unlink is not used here.  This is used after a segment has been
+	 * removed or renamed, and even if the system crashes in-between
+	 * notification files referring to missing WAL segments are automatically
+	 * removed at the beginning of recovery.
+	 */
+
 	/* Remove the .done file */
 	StatusFilePath(archiveStatusPath, xlog, ".done");
 	unlink(archiveStatusPath);
diff --git a/src/include/access/xlog_internal.h b/src/include/access/xlog_internal.h
index 30610b3ea9..4c8d7dae62 100644
--- a/src/include/access/xlog_internal.h
+++ b/src/include/access/xlog_internal.h
@@ -201,6 +201,16 @@ typedef XLogLongPageHeaderData *XLogLongPageHeader;
 #define StatusFilePath(path, xlog, suffix)	\
 	snprintf(path, MAXPGPATH, XLOGDIR "/archive_status/%s%s", xlog, suffix)
 
+#define StatusFileIsReady(fname) \
+	(strlen(fname) == XLOG_FNAME_LEN + strlen(".ready") &&		\
+	 strspn(fname, "0123456789ABCDEF") == XLOG_FNAME_LEN &&		\
+	 strcmp((fname) + XLOG_FNAME_LEN, ".ready") == 0)
+
+#define StatusFileIsDone(fname) \
+	(strlen(fname) == XLOG_FNAME_LEN + strlen(".done") &&		\
+	 strspn(fname, "0123456789ABCDEF") == XLOG_FNAME_LEN &&		\
+	 strcmp((fname) + XLOG_FNAME_LEN, ".done") == 0)
+
 #define BackupHistoryFileName(fname, tli, logSegNo, startpoint, wal_segsz_bytes) \
 	snprintf(fname, MAXFNAMELEN, "%08X%08X%08X.%08X.backup", tli, \
 			 (uint32) ((logSegNo) / XLogSegmentsPerXLogId(wal_segsz_bytes)), \
@@ -328,6 +338,7 @@ extern void XLogArchiveNotify(const char *xlog);
 extern void XLogArchiveNotifySeg(XLogSegNo segno);
 extern void XLogArchiveForceDone(const char *xlog);
 extern bool XLogArchiveCheckDone(const char *xlog);
+extern void XLogArchiveCleanStatus(void);
 extern bool XLogArchiveIsBusy(const char *xlog);
 extern bool XLogArchiveIsReady(const char *xlog);
 extern bool XLogArchiveIsReadyOrDone(const char *xlog);
#10Nathan Bossart
bossartn@amazon.com
In reply to: Michael Paquier (#9)
Re: Use durable_unlink for .ready and .done files for WAL segment removal

One argument for instead checking WAL file existence before calling
archive_command might be to avoid the increased startup time.
Granted, any added delay from this patch is unlikely to be noticeable
unless your archiver is way behind and archive_status has a huge
number of files. However, I have seen cases where startup is stuck on
other tasks like SyncDataDirectory() and RemovePgTempFiles() for a
very long time, so perhaps it is worth considering.

Nathan

#11Kyotaro HORIGUCHI
horiguchi.kyotaro@lab.ntt.co.jp
In reply to: Nathan Bossart (#10)
Re: Use durable_unlink for .ready and .done files for WAL segment removal

At Fri, 02 Nov 2018 14:47:08 +0000, Nathan Bossart <bossartn@amazon.com> wrote in <154117002849.5569.14588306221618961668.pgcf@coridan.postgresql.org>

One argument for instead checking WAL file existence before calling
archive_command might be to avoid the increased startup time.
Granted, any added delay from this patch is unlikely to be noticeable
unless your archiver is way behind and archive_status has a huge
number of files. However, I have seen cases where startup is stuck on
other tasks like SyncDataDirectory() and RemovePgTempFiles() for a
very long time, so perhaps it is worth considering.

While archive_mode is tuned on, .ready files are created for all
exising wal files if not exists. Thus archiver may wait for the
ealiest segment to have .ready file. As the result
pgarch_readyXLog can be modified to loops over wal files, not
status files. This prevents the confusion comes from .ready
files for non-existent segment files.

RemoveXlogFile as is doesn't get confused by .done files for
nonexistent segments.

We may leave useless .done/.ready files. We no longer scan over
the files so no matter how many files are there in the directory.

The remaining issue is removal of the files. Even if we blew away
the directory altogether, status files would be cleanly recreated
having already-archived wal segments are archived again. However,
redundant copy won't happen with our recommending configuration:p

# Yeah, I see almost all sites uses simple 'cp' or 'scp' for that..

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#12Michael Paquier
michael@paquier.xyz
In reply to: Kyotaro HORIGUCHI (#11)
Re: Use durable_unlink for .ready and .done files for WAL segment removal

On Thu, Nov 15, 2018 at 07:39:27PM +0900, Kyotaro HORIGUCHI wrote:

At Fri, 02 Nov 2018 14:47:08 +0000, Nathan Bossart
<bossartn@amazon.com> wrote in
<154117002849.5569.14588306221618961668.pgcf@coridan.postgresql.org>:

One argument for instead checking WAL file existence before calling
archive_command might be to avoid the increased startup time.

I guess that you mean the startup of the archive command itself here.
Yes that can be an issue with a high WAL output depending on the
interpreter of the archive command :(

Granted, any added delay from this patch is unlikely to be noticeable
unless your archiver is way behind and archive_status has a huge
number of files. However, I have seen cases where startup is stuck on
other tasks like SyncDataDirectory() and RemovePgTempFiles() for a
very long time, so perhaps it is worth considering.

What's the scale of the pg_wal partition and the amount of time things
were stuck? I would imagine that the sync phase hurts the most, and a
fast startup time for crash recovery is always important.

While archive_mode is tuned on, .ready files are created for all
existing wal files if not exists. Thus archiver may wait for the
earliest segment to have .ready file.

Yes, RemoveOldXlogFiles() does that via XLogArchiveCheckDone().

As the result
pgarch_readyXLog can be modified to loops over WAL files, not
status files. This prevents the confusion comes from .ready
files for non-existent segment files.

No, pgarch_readyXLog() should still look after .ready files as those are
here for this purpose, but we could have an additional check to see if
the segment linked with it actually exists and can be archived. This
check could happen in pgarch.c code before calling the archive command
gets called (just before pgarch_ArchiverCopyLoop and after
XLogArchiveCommandSet feels rather right, and that it should be cheap
enough to call stat()).
--
Michael

#13Michael Paquier
michael@paquier.xyz
In reply to: Michael Paquier (#12)
1 attachment(s)
Re: Use durable_unlink for .ready and .done files for WAL segment removal

On Thu, Nov 22, 2018 at 01:16:09PM +0900, Michael Paquier wrote:

No, pgarch_readyXLog() should still look after .ready files as those are
here for this purpose, but we could have an additional check to see if
the segment linked with it actually exists and can be archived. This
check could happen in pgarch.c code before calling the archive command
gets called (just before pgarch_ArchiverCopyLoop and after
XLogArchiveCommandSet feels rather right, and that it should be cheap
enough to call stat()).

s/pgarch_ArchiverCopyLoop/pgarch_archiveXlog/.

Attached is a patch showing shaped based on the idea of upthread.
Thoughts?
--
Michael

Attachments:

archive-missing-v2.patchtext/x-diff; charset=us-asciiDownload
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index 844b9d1b0e..1cc788ca83 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -28,6 +28,7 @@
 #include <fcntl.h>
 #include <signal.h>
 #include <time.h>
+#include <sys/stat.h>
 #include <sys/time.h>
 #include <sys/wait.h>
 #include <unistd.h>
@@ -427,6 +428,9 @@ pgarch_ArchiverCopyLoop(void)
 
 		for (;;)
 		{
+			struct stat	stat_buf;
+			char		pathname[MAXPGPATH];
+
 			/*
 			 * Do not initiate any more archive commands after receiving
 			 * SIGTERM, nor after the postmaster has died unexpectedly. The
@@ -456,6 +460,25 @@ pgarch_ArchiverCopyLoop(void)
 				return;
 			}
 
+			/*
+			 * In the event of a system crash, archive status files may be
+			 * left behind as their removals are not durable, cleaning up
+			 * orphan entries here is the cheapest method.  So check that
+			 * the segment trying to be archived still exists.
+			 */
+			snprintf(pathname, MAXPGPATH, XLOGDIR "/%s", xlog);
+			if (stat(pathname, &stat_buf) != 0)
+			{
+				char		xlogready[MAXPGPATH];
+
+				StatusFilePath(xlogready, xlog, ".ready");
+				if (durable_unlink(xlogready, WARNING) == 0)
+					ereport(WARNING,
+							(errmsg("removed orphan archive status file %s",
+									xlogready)));
+				return;
+			}
+
 			if (pgarch_archiveXlog(xlog))
 			{
 				/* successful */
#14Bossart, Nathan
bossartn@amazon.com
In reply to: Michael Paquier (#13)
Re: Use durable_unlink for .ready and .done files for WAL segment removal

On 11/21/18, 10:16 PM, "Michael Paquier" <michael@paquier.xyz> wrote:

At Fri, 02 Nov 2018 14:47:08 +0000, Nathan Bossart
<bossartn@amazon.com> wrote in
<154117002849.5569.14588306221618961668.pgcf@coridan.postgresql.org>:

Granted, any added delay from this patch is unlikely to be noticeable
unless your archiver is way behind and archive_status has a huge
number of files. However, I have seen cases where startup is stuck on
other tasks like SyncDataDirectory() and RemovePgTempFiles() for a
very long time, so perhaps it is worth considering.

What's the scale of the pg_wal partition and the amount of time things
were stuck? I would imagine that the sync phase hurts the most, and a
fast startup time for crash recovery is always important.

I don't have exact figures to share, but yes, a huge number of calls
to sync_file_range() and fsync() can use up a lot of time. Presumably
Postgres processes files individually instead of using sync() because
sync() may return before writing is done. Also, sync() would affect
non-Postgres files. However, it looks like Linux actually does wait
for writing to complete before returning from sync() [0]http://man7.org/linux/man-pages/man2/sync.2.html.

For RemovePgTempFiles(), the documentation above the function
indicates that skipping temp file cleanup during startup would
actually be okay because collisions with existing temp file names
should be handled by OpenTemporaryFile(). I assume this cleanup is
done during startup because there isn't a great alternative besides
offloading the work to a new background worker or something.

On 11/27/18, 6:35 AM, "Michael Paquier" <michael@paquier.xyz> wrote:

Attached is a patch showing shaped based on the idea of upthread.
Thoughts?

I took a look at this patch.

+                       /*
+                        * In the event of a system crash, archive status files may be
+                        * left behind as their removals are not durable, cleaning up
+                        * orphan entries here is the cheapest method.  So check that
+                        * the segment trying to be archived still exists.
+                        */
+                       snprintf(pathname, MAXPGPATH, XLOGDIR "/%s", xlog);
+                       if (stat(pathname, &stat_buf) != 0)
+                       {

Don't we also need to check that errno is ENOENT here?

+                               StatusFilePath(xlogready, xlog, ".ready");
+                               if (durable_unlink(xlogready, WARNING) == 0)
+                                       ereport(WARNING,
+                                                       (errmsg("removed orphan archive status file %s",
+                                                                       xlogready)));
+                               return;

IIUC any time that the file does not exist, we will attempt to unlink
it. Regardless of whether unlinking fails or succeeds, we then
proceed to give up archiving for now, but it's not clear why. Perhaps
we should retry unlinking a number of times (like we do for
pgarch_archiveXlog()) when durable_unlink() fails and simply "break"
to move on to the next .ready file if durable_unlink() succeeds.

Nathan

[0]: http://man7.org/linux/man-pages/man2/sync.2.html

#15Andres Freund
andres@anarazel.de
In reply to: Bossart, Nathan (#14)
Re: Use durable_unlink for .ready and .done files for WAL segment removal

Hi,

On 2018-11-27 20:43:06 +0000, Bossart, Nathan wrote:

I don't have exact figures to share, but yes, a huge number of calls
to sync_file_range() and fsync() can use up a lot of time. Presumably
Postgres processes files individually instead of using sync() because
sync() may return before writing is done. Also, sync() would affect
non-Postgres files. However, it looks like Linux actually does wait
for writing to complete before returning from sync() [0].

sync() has absolutely no way to report errors. So, we're never going to
be able to use it. Besides, even postgres' temp files would be a good
reason to not use it.

Greetings,

Andres Freund

#16Bossart, Nathan
bossartn@amazon.com
In reply to: Andres Freund (#15)
Re: Use durable_unlink for .ready and .done files for WAL segment removal

On 11/27/18, 2:46 PM, "Andres Freund" <andres@anarazel.de> wrote:

On 2018-11-27 20:43:06 +0000, Bossart, Nathan wrote:

I don't have exact figures to share, but yes, a huge number of calls
to sync_file_range() and fsync() can use up a lot of time. Presumably
Postgres processes files individually instead of using sync() because
sync() may return before writing is done. Also, sync() would affect
non-Postgres files. However, it looks like Linux actually does wait
for writing to complete before returning from sync() [0].

sync() has absolutely no way to report errors. So, we're never going to
be able to use it. Besides, even postgres' temp files would be a good
reason to not use it.

Ah, I see. Thanks for clarifying.

Nathan

#17Michael Paquier
michael@paquier.xyz
In reply to: Bossart, Nathan (#14)
Re: Use durable_unlink for .ready and .done files for WAL segment removal

On Tue, Nov 27, 2018 at 08:43:06PM +0000, Bossart, Nathan wrote:

Don't we also need to check that errno is ENOENT here?

Yep.

IIUC any time that the file does not exist, we will attempt to unlink
it. Regardless of whether unlinking fails or succeeds, we then
proceed to give up archiving for now, but it's not clear why. Perhaps
we should retry unlinking a number of times (like we do for
pgarch_archiveXlog()) when durable_unlink() fails and simply "break"
to move on to the next .ready file if durable_unlink() succeeds.

Both suggestions sound reasonable to me. (durable_unlink is not called
on HEAD in pgarch_archiveXlog). How about 3 retries with a in-between
wait of 1s? That's consistent with what pgarch_ArchiverCopyLoop does,
still I am not completely sure if we actually want to be consistent for
the purpose of removing orphaned ready files.
--
Michael

#18Bossart, Nathan
bossartn@amazon.com
In reply to: Michael Paquier (#17)
Re: Use durable_unlink for .ready and .done files for WAL segment removal

On 11/27/18, 3:20 PM, "Michael Paquier" <michael@paquier.xyz> wrote:

On Tue, Nov 27, 2018 at 08:43:06PM +0000, Bossart, Nathan wrote:

IIUC any time that the file does not exist, we will attempt to unlink
it. Regardless of whether unlinking fails or succeeds, we then
proceed to give up archiving for now, but it's not clear why. Perhaps
we should retry unlinking a number of times (like we do for
pgarch_archiveXlog()) when durable_unlink() fails and simply "break"
to move on to the next .ready file if durable_unlink() succeeds.

Both suggestions sound reasonable to me. (durable_unlink is not called
on HEAD in pgarch_archiveXlog). How about 3 retries with a in-between
wait of 1s? That's consistent with what pgarch_ArchiverCopyLoop does,
still I am not completely sure if we actually want to be consistent for
the purpose of removing orphaned ready files.

That sounds good to me. I was actually thinking of using the same
retry counter that we use for pgarch_archiveXlog(), but on second
thought, it is probably better to have two independent retry counters
for these two unrelated operations.

Nathan

#19Michael Paquier
michael@paquier.xyz
In reply to: Bossart, Nathan (#18)
Re: Use durable_unlink for .ready and .done files for WAL segment removal

On Tue, Nov 27, 2018 at 09:49:29PM +0000, Bossart, Nathan wrote:

That sounds good to me. I was actually thinking of using the same
retry counter that we use for pgarch_archiveXlog(), but on second
thought, it is probably better to have two independent retry counters
for these two unrelated operations.

What I had in mind was two different variables if what I wrote was
unclear, possibly with the same value, as archiving failure and failure
with orphan file removals are two different concepts.
--
Michael

#20Bossart, Nathan
bossartn@amazon.com
In reply to: Michael Paquier (#19)
Re: Use durable_unlink for .ready and .done files for WAL segment removal

On 11/27/18, 3:53 PM, "Michael Paquier" <michael@paquier.xyz> wrote:

On Tue, Nov 27, 2018 at 09:49:29PM +0000, Bossart, Nathan wrote:

That sounds good to me. I was actually thinking of using the same
retry counter that we use for pgarch_archiveXlog(), but on second
thought, it is probably better to have two independent retry counters
for these two unrelated operations.

What I had in mind was two different variables if what I wrote was
unclear, possibly with the same value, as archiving failure and failure
with orphan file removals are two different concepts.

+1

Nathan

#21Michael Paquier
michael@paquier.xyz
In reply to: Bossart, Nathan (#20)
1 attachment(s)
Re: Use durable_unlink for .ready and .done files for WAL segment removal

On Thu, Nov 29, 2018 at 03:00:42PM +0000, Bossart, Nathan wrote:

+1

Okay, here is an updated patch for this stuff, which does the following:
- Check for a WAL segment if it has a ".ready" status file, an orphaned
status file is removed only on ENOENT.
- If durable_unlink fails, retry 3 times. If there are too many
failures, the archiver gives up on the orphan status file removal. If
the removal works correctly, the archiver moves on to the next file.

(The variable names could be better.)
--
Michael

Attachments:

archive-missing-v3.patchtext/x-diff; charset=us-asciiDownload
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index 844b9d1b0e..0a78003172 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -28,6 +28,7 @@
 #include <fcntl.h>
 #include <signal.h>
 #include <time.h>
+#include <sys/stat.h>
 #include <sys/time.h>
 #include <sys/wait.h>
 #include <unistd.h>
@@ -60,6 +61,7 @@
 									 * failed archiver; in seconds. */
 
 #define NUM_ARCHIVE_RETRIES 3
+#define NUM_STATUS_CLEANUP_RETRIES 3
 
 
 /* ----------
@@ -424,9 +426,13 @@ pgarch_ArchiverCopyLoop(void)
 	while (pgarch_readyXlog(xlog))
 	{
 		int			failures = 0;
+		int			failures_unlink = 0;
 
 		for (;;)
 		{
+			struct stat	stat_buf;
+			char		pathname[MAXPGPATH];
+
 			/*
 			 * Do not initiate any more archive commands after receiving
 			 * SIGTERM, nor after the postmaster has died unexpectedly. The
@@ -456,6 +462,44 @@ pgarch_ArchiverCopyLoop(void)
 				return;
 			}
 
+			/*
+			 * In the event of a system crash, archive status files may be
+			 * left behind as their removals are not durable, cleaning up
+			 * orphan entries here is the cheapest method.  So check that
+			 * the segment trying to be archived still exists.  If it does
+			 * not, its orphan status file is removed.
+			 */
+			snprintf(pathname, MAXPGPATH, XLOGDIR "/%s", xlog);
+			if (stat(pathname, &stat_buf) != 0 && errno == ENOENT)
+			{
+				char		xlogready[MAXPGPATH];
+
+				StatusFilePath(xlogready, xlog, ".ready");
+				if (durable_unlink(xlogready, WARNING) == 0)
+				{
+					ereport(WARNING,
+							(errmsg("removed orphan archive status file %s",
+									xlogready)));
+
+					/* leave loop and move to the next status file */
+					break;
+				}
+
+				if (++failures_unlink >= NUM_STATUS_CLEANUP_RETRIES)
+				{
+					ereport(WARNING,
+							(errmsg("failed removal of \"%s\" too many times, will try again later",
+									xlogready)));
+
+					/* give up cleanup of orphan status files */
+					return;
+				}
+
+				/* wait a bit before retrying */
+				pg_usleep(1000000L);
+				continue;
+			}
+
 			if (pgarch_archiveXlog(xlog))
 			{
 				/* successful */
#22Bossart, Nathan
bossartn@amazon.com
In reply to: Michael Paquier (#21)
Re: Use durable_unlink for .ready and .done files for WAL segment removal

On 12/4/18, 1:36 AM, "Michael Paquier" <michael@paquier.xyz> wrote:

Okay, here is an updated patch for this stuff, which does the following:
- Check for a WAL segment if it has a ".ready" status file, an orphaned
status file is removed only on ENOENT.
- If durable_unlink fails, retry 3 times. If there are too many
failures, the archiver gives up on the orphan status file removal. If
the removal works correctly, the archiver moves on to the next file.

Thanks for the updated patch! The code looks good to me, the patch
applies cleanly and builds without warnings, and it seems to work well
in my manual tests. I just have a few wording suggestions.

+			/*
+			 * In the event of a system crash, archive status files may be
+			 * left behind as their removals are not durable, cleaning up
+			 * orphan entries here is the cheapest method.  So check that
+			 * the segment trying to be archived still exists.  If it does
+			 * not, its orphan status file is removed.
+			 */

I would phrase this comment this way:

Since archive_status files are not durably removed, a system
crash could leave behind .ready files for WAL segments that
have already been recycled or removed. In this case, simply
remove the orphan status file and move on.

+					ereport(WARNING,
+							(errmsg("removed orphan archive status file %s",
+									xlogready)));

I think we should put quotes around the file name like we do elsewhere
in pgarch_ArchiverCopyLoop().

+					ereport(WARNING,
+							(errmsg("failed removal of \"%s\" too many times, will try again later",
+									xlogready)));

I'd suggest mirroring the log statement for failed archiving commands
and saying something like, "removing orphan archive status file \"%s\"
failed too many times, will try again later." IMO that makes it
clearer what is failing and why we are removing it in the first place.

Nathan

#23Michael Paquier
michael@paquier.xyz
In reply to: Bossart, Nathan (#22)
1 attachment(s)
Re: Use durable_unlink for .ready and .done files for WAL segment removal

On Tue, Dec 04, 2018 at 08:40:40PM +0000, Bossart, Nathan wrote:

Thanks for the updated patch! The code looks good to me, the patch
applies cleanly and builds without warnings, and it seems to work well
in my manual tests. I just have a few wording suggestions.

How are you testing this? I just stop the server and manually touch
some fake status files in archive_status :)

I would phrase this comment this way:

Since archive_status files are not durably removed, a system
crash could leave behind .ready files for WAL segments that
have already been recycled or removed. In this case, simply
remove the orphan status file and move on.

Fine for me. Thanks!

+     ereport(WARNING,
+             (errmsg("removed orphan archive status file %s",
+                     xlogready)));

I think we should put quotes around the file name like we do elsewhere
in pgarch_ArchiverCopyLoop().

Done.

+					ereport(WARNING,
+							(errmsg("failed removal of \"%s\" too many times, will try again later",
+									xlogready)));

I'd suggest mirroring the log statement for failed archiving commands
and saying something like, "removing orphan archive status file \"%s\"
failed too many times, will try again later." IMO that makes it
clearer what is failing and why we are removing it in the first place.

"removal of" is more consistent here I think, so changed this way with
your wording merged.
--
Michael

Attachments:

archive-missing-v4.patchtext/x-diff; charset=us-asciiDownload
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index 844b9d1b0e..90817c8598 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -28,6 +28,7 @@
 #include <fcntl.h>
 #include <signal.h>
 #include <time.h>
+#include <sys/stat.h>
 #include <sys/time.h>
 #include <sys/wait.h>
 #include <unistd.h>
@@ -60,6 +61,7 @@
 									 * failed archiver; in seconds. */
 
 #define NUM_ARCHIVE_RETRIES 3
+#define NUM_STATUS_CLEANUP_RETRIES 3
 
 
 /* ----------
@@ -424,9 +426,13 @@ pgarch_ArchiverCopyLoop(void)
 	while (pgarch_readyXlog(xlog))
 	{
 		int			failures = 0;
+		int			failures_unlink = 0;
 
 		for (;;)
 		{
+			struct stat	stat_buf;
+			char		pathname[MAXPGPATH];
+
 			/*
 			 * Do not initiate any more archive commands after receiving
 			 * SIGTERM, nor after the postmaster has died unexpectedly. The
@@ -456,6 +462,43 @@ pgarch_ArchiverCopyLoop(void)
 				return;
 			}
 
+			/*
+			 * Since archive status files are not durably removed, a system
+			 * crash could leave behind .ready files for WAL segments that
+			 * have already been recycled or removed.  In this case, simply
+			 * remove the orphan status file and move on.
+			 */
+			snprintf(pathname, MAXPGPATH, XLOGDIR "/%s", xlog);
+			if (stat(pathname, &stat_buf) != 0 && errno == ENOENT)
+			{
+				char		xlogready[MAXPGPATH];
+
+				StatusFilePath(xlogready, xlog, ".ready");
+				if (durable_unlink(xlogready, WARNING) == 0)
+				{
+					ereport(WARNING,
+							(errmsg("removed orphan archive status file \"%s\"",
+									xlogready)));
+
+					/* leave loop and move to the next status file */
+					break;
+				}
+
+				if (++failures_unlink >= NUM_STATUS_CLEANUP_RETRIES)
+				{
+					ereport(WARNING,
+							(errmsg("removal of orphan archive status file \"%s\" failed too many times, will try again later",
+									xlogready)));
+
+					/* give up cleanup of orphan status files */
+					return;
+				}
+
+				/* wait a bit before retrying */
+				pg_usleep(1000000L);
+				continue;
+			}
+
 			if (pgarch_archiveXlog(xlog))
 			{
 				/* successful */
#24Bossart, Nathan
bossartn@amazon.com
In reply to: Michael Paquier (#23)
Re: Use durable_unlink for .ready and .done files for WAL segment removal

On 12/4/18, 7:35 PM, "Michael Paquier" <michael@paquier.xyz> wrote:

On Tue, Dec 04, 2018 at 08:40:40PM +0000, Bossart, Nathan wrote:

Thanks for the updated patch! The code looks good to me, the patch
applies cleanly and builds without warnings, and it seems to work well
in my manual tests. I just have a few wording suggestions.

How are you testing this? I just stop the server and manually touch
some fake status files in archive_status :)

That's almost exactly what I was doing, too.

I would phrase this comment this way:

Since archive_status files are not durably removed, a system
crash could leave behind .ready files for WAL segments that
have already been recycled or removed. In this case, simply
remove the orphan status file and move on.

Fine for me. Thanks!

+     ereport(WARNING,
+             (errmsg("removed orphan archive status file %s",
+                     xlogready)));

I think we should put quotes around the file name like we do elsewhere
in pgarch_ArchiverCopyLoop().

Done.

+					ereport(WARNING,
+							(errmsg("failed removal of \"%s\" too many times, will try again later",
+									xlogready)));

I'd suggest mirroring the log statement for failed archiving commands
and saying something like, "removing orphan archive status file \"%s\"
failed too many times, will try again later." IMO that makes it
clearer what is failing and why we are removing it in the first place.

"removal of" is more consistent here I think, so changed this way with
your wording merged.

The v4 patch looks good to me!

Nathan

#25Kyotaro HORIGUCHI
horiguchi.kyotaro@lab.ntt.co.jp
In reply to: Bossart, Nathan (#24)
Re: Use durable_unlink for .ready and .done files for WAL segment removal

Hello.

At Wed, 5 Dec 2018 16:11:23 +0000, "Bossart, Nathan" <bossartn@amazon.com> wrote in <DB47EBD7-7291-4C39-9F8F-BE42E5193BD5@amazon.com>

The v4 patch looks good to me!

durable_unlnk has two modes of faiure. Failure to unlink and
fsync. If once it fails at the fsync stage, subsequent
durable_unlink calls for the same file always fail to unlink with
ENOENT. If durable_unlink is intended to be called repeatedly on
falure, perhaps it should return a different code for each
failure so that the caller can indentify what to do next.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#26Michael Paquier
michael@paquier.xyz
In reply to: Kyotaro HORIGUCHI (#25)
Re: Use durable_unlink for .ready and .done files for WAL segment removal

On Thu, Dec 06, 2018 at 01:55:46PM +0900, Kyotaro HORIGUCHI wrote:

durable_unlink has two modes of faiure. Failure to unlink and
fsync. If once it fails at the fsync stage, subsequent
durable_unlink calls for the same file always fail to unlink with
ENOENT. If durable_unlink is intended to be called repeatedly on
falure, perhaps it should return a different code for each
failure so that the caller can indentify what to do next.

Why? A WARNING would be logged if the first unlink() fails, and
another, different WARNING would be logged if the subsequent fsync
fails. It looks enough to me to make a distinction between both. Now,
you may have a point in the fact that we could also live with only using
unlink() for this code path, as even on repetitive crashes this would
take care of removing orphan archive status files consistently.
--
Michael

#27Michael Paquier
michael@paquier.xyz
In reply to: Michael Paquier (#26)
Re: Use durable_unlink for .ready and .done files for WAL segment removal

On Thu, Dec 06, 2018 at 02:43:35PM +0900, Michael Paquier wrote:

Why? A WARNING would be logged if the first unlink() fails, and
another, different WARNING would be logged if the subsequent fsync
fails. It looks enough to me to make a distinction between both. Now,
you may have a point in the fact that we could also live with only using
unlink() for this code path, as even on repetitive crashes this would
take care of removing orphan archive status files consistently.

After sleeping on that, using plain unlink() makes indeed the most
sense. Any objections if I move on with that, adding a proper comment
explaining the choice? I don't plan to finish wrapping this patch today
but Monday my time anyway.
--
Michael

#28Bossart, Nathan
bossartn@amazon.com
In reply to: Michael Paquier (#27)
Re: Use durable_unlink for .ready and .done files for WAL segment removal

On 12/6/18, 4:54 PM, "Michael Paquier" <michael@paquier.xyz> wrote:

On Thu, Dec 06, 2018 at 02:43:35PM +0900, Michael Paquier wrote:

Why? A WARNING would be logged if the first unlink() fails, and
another, different WARNING would be logged if the subsequent fsync
fails. It looks enough to me to make a distinction between both. Now,
you may have a point in the fact that we could also live with only using
unlink() for this code path, as even on repetitive crashes this would
take care of removing orphan archive status files consistently.

After sleeping on that, using plain unlink() makes indeed the most
sense. Any objections if I move on with that, adding a proper comment
explaining the choice? I don't plan to finish wrapping this patch today
but Monday my time anyway.

That seems reasonable to me.

Nathan

#29Michael Paquier
michael@paquier.xyz
In reply to: Bossart, Nathan (#28)
Re: Use durable_unlink for .ready and .done files for WAL segment removal

On Thu, Dec 06, 2018 at 11:18:12PM +0000, Bossart, Nathan wrote:

That seems reasonable to me.

Thanks, committed after renaming a couple of variables, after adding a
comment about the use of unlink() and also adding a comment explaining
what the different retry counters are here for.
--
Michael