Temporary WAL segments files not cleaned up after an instance crash

Started by Michael Paquierover 7 years ago8 messages
#1Michael Paquier
michael@paquier.xyz
1 attachment(s)

Hi all,

While playing with a standby as follows I noticed that xlogtemp.*
generated in pg_wal may stay around when entering crash recovery. The
test I was conducting is pretty simple:
- Use a primary and a standby.
- Run pgbench on the primary.
- Then restart the standby with -m immediate and force WAL segment
switch on the primary in a loop. Depending on the timing, one can see
that those xlogtemp files stay around. Those files are here when
creating a new segment from scratch and append the PID of the process
creating them. Any previous file existing with the same name is
unlinked.

The problem is that if an instance is not really stable for a reason or
another and starts crash recovery periodically, then there is a risk of
accumulating those temporary files. If pg_wal is on its own partition,
tuned by max_wal_size, then there is a risk to run into ENOSPC and take
PostgreSQL down as new WAL segments cannot be created.

Shouldn't those files be cleaned up at the beginning of crash recovery?
Attached is a proposal of patch doing so.

Thoughts?
--
Michael

Attachments:

remove-xlogtemp-crash.patchtext/x-diff; charset=us-asciiDownload
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index c633e11128..60f9f75aa9 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -881,6 +881,7 @@ static bool WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 static int	emode_for_corrupt_record(int emode, XLogRecPtr RecPtr);
 static void XLogFileClose(void);
 static void PreallocXlogFiles(XLogRecPtr endptr);
+static void RemoveXLogTempFiles(void);
 static void RemoveOldXlogFiles(XLogSegNo segno, XLogRecPtr PriorRedoPtr, XLogRecPtr endptr);
 static void RemoveXlogFile(const char *segname, XLogRecPtr PriorRedoPtr, XLogRecPtr endptr);
 static void UpdateLastRemovedPtr(char *filename);
@@ -3851,6 +3852,35 @@ UpdateLastRemovedPtr(char *filename)
 	SpinLockRelease(&XLogCtl->info_lck);
 }
 
+/*
+ * Remove all temporary log files in pg_wal
+ *
+ * This is called at the beginning of recovery after a previous crash,
+ * at a point where no other processes write fresh WAL data.
+ */
+static void
+RemoveXLogTempFiles(void)
+{
+	DIR	   *xldir;
+	struct dirent *xlde;
+
+	elog(DEBUG2, "removing all temporary WAL files");
+
+	xldir = AllocateDir(XLOGDIR);
+	while ((xlde = ReadDir(xldir, XLOGDIR)) != NULL)
+	{
+		char		path[MAXPGPATH];
+
+		if (strncmp(xlde->d_name, "xlogtemp.", 9) != 0)
+			continue;
+
+		snprintf(path, MAXPGPATH, XLOGDIR "/%s", xlde->d_name);
+		elog(DEBUG2, "removed temporary WAL file \"%s\"", path);
+		unlink(path);
+	}
+	FreeDir(xldir);
+}
+
 /*
  * Recycle or remove all log files older or equal to passed segno.
  *
@@ -6352,16 +6382,24 @@ StartupXLOG(void)
 	ValidateXLOGDirectoryStructure();
 
 	/*
-	 * If we previously crashed, there might be data which we had written,
-	 * intending to fsync it, but which we had not actually fsync'd yet.
-	 * Therefore, a power failure in the near future might cause earlier
-	 * unflushed writes to be lost, even though more recent data written to
-	 * disk from here on would be persisted.  To avoid that, fsync the entire
+	 * If we previously crashed, perform a couple of actions:
+	 * 1) The pg_wal directory may still include some temporary WAL
+	 * segments used when creating a new segment, so perform some
+	 * clean up to not bloat this path.  This is done first as there
+	 * is no point to sync this temporary data.
+	 * 2) There might be data which we had written, intending to fsync
+	 * it, but which we had not actually fsync'd yet. Therefore, a
+	 * power failure in the near future might cause earlier unflushed
+	 * writes to be lost, even though more recent data written to disk
+	 * from here on would be persisted.  To avoid that, fsync the entire
 	 * data directory.
 	 */
 	if (ControlFile->state != DB_SHUTDOWNED &&
 		ControlFile->state != DB_SHUTDOWNED_IN_RECOVERY)
+	{
+		RemoveXLogTempFiles();
 		SyncDataDirectory();
+	}
 
 	/*
 	 * Initialize on the assumption we want to recover to the latest timeline
#2Yugo Nagata
nagata@sraoss.co.jp
In reply to: Michael Paquier (#1)
Re: Temporary WAL segments files not cleaned up after an instance crash

On Mon, 14 May 2018 14:49:55 +0900
Michael Paquier <michael@paquier.xyz> wrote:

Hi all,

While playing with a standby as follows I noticed that xlogtemp.*
generated in pg_wal may stay around when entering crash recovery. The
test I was conducting is pretty simple:
- Use a primary and a standby.
- Run pgbench on the primary.
- Then restart the standby with -m immediate and force WAL segment
switch on the primary in a loop. Depending on the timing, one can see
that those xlogtemp files stay around. Those files are here when
creating a new segment from scratch and append the PID of the process
creating them. Any previous file existing with the same name is
unlinked.

The problem is that if an instance is not really stable for a reason or
another and starts crash recovery periodically, then there is a risk of
accumulating those temporary files. If pg_wal is on its own partition,
tuned by max_wal_size, then there is a risk to run into ENOSPC and take
PostgreSQL down as new WAL segments cannot be created.

Shouldn't those files be cleaned up at the beginning of crash recovery?
Attached is a proposal of patch doing so.

I think it makes sense to remove unnecessary temporary WAL files although
I'm not sure how high the risk of ENOSPC is.

The code looks fine, the patch can be applied to HEAD, and I can build this
successfully. I confirmed that all tempxlog.* files are removed when restarting
postgres that was shutdown immediately.

One little thing I noticed is the function name "RemoveXLogTempFiles".
Other similar functions are named as RemoveOldXlogFiles or RemoveXlogFile
(using Xlog not XLog), so it seem to me more consistent to rename this
"RemoveXlogTempFiles" or "RemoveTempXlogFiles" and so on.

Regards

--
Yugo Nagata <nagata@sraoss.co.jp>

#3Michael Paquier
michael@paquier.xyz
In reply to: Yugo Nagata (#2)
1 attachment(s)
Re: Temporary WAL segments files not cleaned up after an instance crash

On Thu, Jul 12, 2018 at 03:35:53PM +0900, Yugo Nagata wrote:

I think it makes sense to remove unnecessary temporary WAL files although
I'm not sure how high the risk of ENOSPC is.

It depends on how close to the partition size limit max_wal_size is set,
and how much a system is unstable. Switching on/off a VM where Postgres
is located can participate in that, as well as VM snapshots taken
without memory (I work a lot on those as you can guess :D). Setting it
to 70% of the partition size is what I imagine is the base, but I can
imagine as well people setting it at 90% or more.

Still the probability is low, which is why I think that it would make
sense to just fix the problem on HEAD and move on.

One little thing I noticed is the function name "RemoveXLogTempFiles".
Other similar functions are named as RemoveOldXlogFiles or RemoveXlogFile
(using Xlog not XLog), so it seem to me more consistent to rename this
"RemoveXlogTempFiles" or "RemoveTempXlogFiles" and so on.

I see, a lower-case for Xlog instead of XLog. That makes sense. I have
used your second suggestion in the attached. I have also changed the
thing so as the format of the comment block is better even after
indenting.

Thoughts?
--
Michael

Attachments:

remove-xlogtemp-crash-v2.patchtext/x-diff; charset=us-asciiDownload
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 022dac6cd7..94526e8942 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -887,6 +887,7 @@ static bool WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 static int	emode_for_corrupt_record(int emode, XLogRecPtr RecPtr);
 static void XLogFileClose(void);
 static void PreallocXlogFiles(XLogRecPtr endptr);
+static void RemoveTempXlogFiles(void);
 static void RemoveOldXlogFiles(XLogSegNo segno, XLogRecPtr PriorRedoPtr, XLogRecPtr endptr);
 static void RemoveXlogFile(const char *segname, XLogRecPtr PriorRedoPtr, XLogRecPtr endptr);
 static void UpdateLastRemovedPtr(char *filename);
@@ -3863,6 +3864,35 @@ UpdateLastRemovedPtr(char *filename)
 	SpinLockRelease(&XLogCtl->info_lck);
 }
 
+/*
+ * Remove all temporary log files in pg_wal
+ *
+ * This is called at the beginning of recovery after a previous crash,
+ * at a point where no other processes write fresh WAL data.
+ */
+static void
+RemoveTempXlogFiles(void)
+{
+	DIR		   *xldir;
+	struct dirent *xlde;
+
+	elog(DEBUG2, "removing all temporary WAL files");
+
+	xldir = AllocateDir(XLOGDIR);
+	while ((xlde = ReadDir(xldir, XLOGDIR)) != NULL)
+	{
+		char		path[MAXPGPATH];
+
+		if (strncmp(xlde->d_name, "xlogtemp.", 9) != 0)
+			continue;
+
+		snprintf(path, MAXPGPATH, XLOGDIR "/%s", xlde->d_name);
+		elog(DEBUG2, "removed temporary WAL file \"%s\"", path);
+		unlink(path);
+	}
+	FreeDir(xldir);
+}
+
 /*
  * Recycle or remove all log files older or equal to passed segno.
  *
@@ -6379,17 +6409,25 @@ StartupXLOG(void)
 	 */
 	ValidateXLOGDirectoryStructure();
 
-	/*
-	 * If we previously crashed, there might be data which we had written,
-	 * intending to fsync it, but which we had not actually fsync'd yet.
-	 * Therefore, a power failure in the near future might cause earlier
-	 * unflushed writes to be lost, even though more recent data written to
-	 * disk from here on would be persisted.  To avoid that, fsync the entire
-	 * data directory.
+	/*----------
+	 * If we previously crashed, perform a couple of actions:
+	 *	- The pg_wal directory may still include some temporary WAL segments
+	 * used when creating a new segment, so perform some clean up to not
+	 * bloat this path.  This is done first as there is no point to sync this
+	 * temporary data.
+	 *	- There might be data which we had written, intending to fsync it,
+	 * but which we had not actually fsync'd yet. Therefore, a power failure
+	 * in the near future might cause earlier unflushed writes to be lost,
+	 * even though more recent data written to disk from here on would be
+	 * persisted.  To avoid that, fsync the entire data directory.
+	 *---------
 	 */
 	if (ControlFile->state != DB_SHUTDOWNED &&
 		ControlFile->state != DB_SHUTDOWNED_IN_RECOVERY)
+	{
+		RemoveTempXlogFiles();
 		SyncDataDirectory();
+	}
 
 	/*
 	 * Initialize on the assumption we want to recover to the latest timeline
#4Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Michael Paquier (#3)
Re: Temporary WAL segments files not cleaned up after an instance crash

On 12/07/18 10:44, Michael Paquier wrote:

On Thu, Jul 12, 2018 at 03:35:53PM +0900, Yugo Nagata wrote:

One little thing I noticed is the function name "RemoveXLogTempFiles".
Other similar functions are named as RemoveOldXlogFiles or RemoveXlogFile
(using Xlog not XLog), so it seem to me more consistent to rename this
"RemoveXlogTempFiles" or "RemoveTempXlogFiles" and so on.

I see, a lower-case for Xlog instead of XLog. That makes sense. I have
used your second suggestion in the attached. I have also changed the
thing so as the format of the comment block is better even after
indenting.

Looks good to me. Just one little nitpick:

+		snprintf(path, MAXPGPATH, XLOGDIR "/%s", xlde->d_name);
+		elog(DEBUG2, "removed temporary WAL file \"%s\"", path);
+		unlink(path);

The elog message says "removed", but the removal actually happens after
the elog. "removing" would be more accurate.

- Heikki

#5Michael Paquier
michael@paquier.xyz
In reply to: Heikki Linnakangas (#4)
Re: Temporary WAL segments files not cleaned up after an instance crash

On Thu, Jul 12, 2018 at 01:15:03PM +0300, Heikki Linnakangas wrote:

On 12/07/18 10:44, Michael Paquier wrote:

+		snprintf(path, MAXPGPATH, XLOGDIR "/%s", xlde->d_name);
+		elog(DEBUG2, "removed temporary WAL file \"%s\"", path);
+		unlink(path);

The elog message says "removed", but the removal actually happens after the
elog. "removing" would be more accurate.

Or just move the elog() after the file is actually removed? Would you
be fine with that?
--
Michael

#6Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Michael Paquier (#5)
Re: Temporary WAL segments files not cleaned up after an instance crash

On 12/07/18 15:38, Michael Paquier wrote:

On Thu, Jul 12, 2018 at 01:15:03PM +0300, Heikki Linnakangas wrote:

On 12/07/18 10:44, Michael Paquier wrote:

+		snprintf(path, MAXPGPATH, XLOGDIR "/%s", xlde->d_name);
+		elog(DEBUG2, "removed temporary WAL file \"%s\"", path);
+		unlink(path);

The elog message says "removed", but the removal actually happens after the
elog. "removing" would be more accurate.

Or just move the elog() after the file is actually removed? Would you
be fine with that?

Sure.

- Heikki

#7Michael Paquier
michael@paquier.xyz
In reply to: Heikki Linnakangas (#6)
Re: Temporary WAL segments files not cleaned up after an instance crash

On Thu, Jul 12, 2018 at 03:40:43PM +0300, Heikki Linnakangas wrote:

Sure.

Thanks for the reviews, I have pushed the patch after moving the elog()
call and changing the logs to mention "WAL segments" instead of "WAL
files".
--
Michael

#8Yugo Nagata
nagata@sraoss.co.jp
In reply to: Michael Paquier (#3)
Re: Temporary WAL segments files not cleaned up after an instance crash

On Thu, 12 Jul 2018 16:44:45 +0900
Michael Paquier <michael@paquier.xyz> wrote:

On Thu, Jul 12, 2018 at 03:35:53PM +0900, Yugo Nagata wrote:

I think it makes sense to remove unnecessary temporary WAL files although
I'm not sure how high the risk of ENOSPC is.

It depends on how close to the partition size limit max_wal_size is set,
and how much a system is unstable. Switching on/off a VM where Postgres
is located can participate in that, as well as VM snapshots taken
without memory (I work a lot on those as you can guess :D). Setting it
to 70% of the partition size is what I imagine is the base, but I can
imagine as well people setting it at 90% or more.

Thank you for your explaining this. I have understood the problem
you concern well.

Thanks,

--
Yugo Nagata <nagata@sraoss.co.jp>