Speed up the removal of WAL files
Hello,
The attached patch speeds up the removal of WAL files in the old timelines. I'll add this to the next CF.
BACKGROUND
==================================================
We need to meet a severe availability requirement of a potential customer. They will use synchronous streaming replication. The allowed failover duration, from the failure through failure detection to the failover completion, is 10 seconds. Even one second is precious.
During a testing on a fast machine with SSD, we observed about 2 seconds between these messages. There were no other messages between them.
LOG: archive recovery complete
LOG: MultiXact member wraparound protections are now enabled
CAUSE
==================================================
Examining the source code, RemoveNonParentXlogFiles() seems to account for the time. It syncs pg_wal directory every time it deletes a WAL file. max_wal_size was set to 48GB, so about 1,000 WAL files were probably deleted and hence the pg_wal directory was synced as much.
FIX
==================================================
unlink() the WAL files, then sync the pg_wal directory once at the end.
Unfortunately, the original machine is now not available, so I confirmed the speedup on a VM with HDD.
[time to remove 1,000 WAL files including the directory sync]
nonpatched: 2.45 seconds
patched: 0.81 seconds
Regards
Takayuki Tsunakawa
Attachments:
speedup_wal_removal.patchapplication/octet-stream; name=speedup_wal_removal.patchDownload
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index e729180..685a90b 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -3909,6 +3909,7 @@ RemoveNonParentXlogFiles(XLogRecPtr switchpoint, TimeLineID newTLI)
struct dirent *xlde;
char switchseg[MAXFNAMELEN];
XLogSegNo endLogSegNo;
+ char path[MAXPGPATH];
XLByteToPrevSeg(switchpoint, endLogSegNo, wal_segment_size);
@@ -3948,9 +3949,14 @@ RemoveNonParentXlogFiles(XLogRecPtr switchpoint, TimeLineID newTLI)
* - but seems safer to let them be archived and removed later.
*/
if (!XLogArchiveIsReady(xlde->d_name))
- RemoveXlogFile(xlde->d_name, InvalidXLogRecPtr, switchpoint);
+ {
+ snprintf(path, sizeof(path), XLOGDIR "/%s", xlde->d_name);
+ if (unlink(path) == 0)
+ XLogArchiveCleanup(xlde->d_name);
+ }
}
}
+ (void) fsync_fname(XLOGDIR, true);
FreeDir(xldir);
}
Hello,
At Fri, 17 Nov 2017 06:35:41 +0000, "Tsunakawa, Takayuki" <tsunakawa.takay@jp.fujitsu.com> wrote in <0A3221C70F24FB45833433255569204D1F81B0C8@G01JPEXMBYT05>
Hello,
The attached patch speeds up the removal of WAL files in the old timelines. I'll add this to the next CF.
BACKGROUND
==================================================We need to meet a severe availability requirement of a potential customer. They will use synchronous streaming replication. The allowed failover duration, from the failure through failure detection to the failover completion, is 10 seconds. Even one second is precious.
During a testing on a fast machine with SSD, we observed about 2 seconds between these messages. There were no other messages between them.
LOG: archive recovery complete
LOG: MultiXact member wraparound protections are now enabledCAUSE
==================================================Examining the source code, RemoveNonParentXlogFiles() seems to account for the time. It syncs pg_wal directory every time it deletes a WAL file. max_wal_size was set to 48GB, so about 1,000 WAL files were probably deleted and hence the pg_wal directory was synced as much.
FIX
==================================================unlink() the WAL files, then sync the pg_wal directory once at the end.
Unfortunately, the original machine is now not available, so I confirmed the speedup on a VM with HDD.
[time to remove 1,000 WAL files including the directory sync]
nonpatched: 2.45 seconds
patched: 0.81 secondsRegards
Takayuki Tsunakawa
The orinal code recycles some of the to-be-removed files, but the
patch removes all the victims. This may impact on performance.
Likewise the original code is using durable_unlink to actually
remove a file so separating unlink and fsync might resurrect the
problem that should have been fixed by
1b02be21f271db6bd3cd43abb23fa596fcb6bac3 (I'm not sure what it
was but you are one of the reviwers of it). I suppose that you
need to explain the reason why this change doesn't risk anything.
regards,
--
Kyotaro Horiguchi
NTT Open Source Software Center
From: Kyotaro HORIGUCHI [mailto:horiguchi.kyotaro@lab.ntt.co.jp]
The orinal code recycles some of the to-be-removed files, but the patch
removes all the victims. This may impact on performance.
Yes, I noticed it after submitting the patch and was wondering what to do. Thinking simply, I think it would be just enough to replace durable_unlink/durable_rename in RemoveXLogFile() with unlink/rename, and sync the pg_wal directory once in RemoveNonParentXlogFiles() and RemoveOldXlogFiles(). This will benefit the failover time when fast promotion is not performed. What do you think?
BTW, RemoveNonParentXlogFiles() recycles only 10 WAL files and delete all other files. So the impact of modification is limited.
Likewise the original code is using durable_unlink to actually remove a
file so separating unlink and fsync might resurrect the problem that should
have been fixed by
1b02be21f271db6bd3cd43abb23fa596fcb6bac3 (I'm not sure what it was but you
are one of the reviwers of it). I suppose that you need to explain the reason
why this change doesn't risk anything.
It's safe because the directory is finally synced. If the database server crashes before it deletes all WAL files, it performs the deletion again during the next recovery.
Regards
Takayuki Tsunakawa
On Fri, Nov 17, 2017 at 5:20 PM, Tsunakawa, Takayuki
<tsunakawa.takay@jp.fujitsu.com> wrote:
From: Kyotaro HORIGUCHI [mailto:horiguchi.kyotaro@lab.ntt.co.jp]
The orinal code recycles some of the to-be-removed files, but the patch
removes all the victims. This may impact on performance.Yes, I noticed it after submitting the patch and was wondering what to do. Thinking simply, I think it would be just enough to replace durable_unlink/durable_rename in RemoveXLogFile() with unlink/rename, and sync the pg_wal directory once in RemoveNonParentXlogFiles() and RemoveOldXlogFiles(). This will benefit the failover time when fast promotion is not performed. What do you think?
It seems not good idea to just replace durable_rename() with rename()
in RemoveOldXlogFiles()->RemoveXlogFiles()->InstallXLogFileSegment().
Because that change seems to be able to cause the following problem.
1. Checkpoint calls RemoveOldXlogFiles().
2. It recycles the WAL file AAA to BBB. pg_wal directory has not fsync'd yet.
3. Another transaction (TX1) writes its WAL data into the (recycled) file BBB.
4. CRASH and RESTART
5. The WAL file BBB disappears and you can see AAA,
but AAA is not used in recovery. This causes data loss of
transaction by Tx1.
Regards,
--
Fujii Masao
On Sat, Nov 18, 2017 at 2:57 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
On Fri, Nov 17, 2017 at 5:20 PM, Tsunakawa, Takayuki
<tsunakawa.takay@jp.fujitsu.com> wrote:From: Kyotaro HORIGUCHI [mailto:horiguchi.kyotaro@lab.ntt.co.jp]
The orinal code recycles some of the to-be-removed files, but the patch
removes all the victims. This may impact on performance.Yes, I noticed it after submitting the patch and was wondering what to do. Thinking simply, I think it would be just enough to replace durable_unlink/durable_rename in RemoveXLogFile() with unlink/rename, and sync the pg_wal directory once in RemoveNonParentXlogFiles() and RemoveOldXlogFiles(). This will benefit the failover time when fast promotion is not performed. What do you think?
Note that durable_rename() also flushes the origin file to make sure
that it does not show up again after a crash.
It seems not good idea to just replace durable_rename() with rename()
in RemoveOldXlogFiles()->RemoveXlogFiles()->InstallXLogFileSegment().
Because that change seems to be able to cause the following problem.
If archiving is enabled, RemoveOldXlogFiles() would create a .ready
flag for all segments that have reappeared. Oops, it archived again.
--
Michael
Thanks, it seems to require a bit more consideration about RemoveOldXLogFiles(). Let me continue this next month.
Regards
Takayuki Tsunakawa
Show quoted text
-----Original Message-----
From: Michael Paquier [mailto:michael.paquier@gmail.com]
Sent: Saturday, November 18, 2017 10:37 PM
To: Fujii Masao
Cc: Tsunakawa, Takayuki/綱川 貴之; Kyotaro HORIGUCHI;
pgsql-hackers@postgresql.org
Subject: Re: Speed up the removal of WAL filesOn Sat, Nov 18, 2017 at 2:57 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
On Fri, Nov 17, 2017 at 5:20 PM, Tsunakawa, Takayuki
<tsunakawa.takay@jp.fujitsu.com> wrote:From: Kyotaro HORIGUCHI [mailto:horiguchi.kyotaro@lab.ntt.co.jp]
The orinal code recycles some of the to-be-removed files, but the
patch removes all the victims. This may impact on performance.Yes, I noticed it after submitting the patch and was wondering what to
do. Thinking simply, I think it would be just enough to replace
durable_unlink/durable_rename in RemoveXLogFile() with unlink/rename, and
sync the pg_wal directory once in RemoveNonParentXlogFiles() and
RemoveOldXlogFiles(). This will benefit the failover time when fast
promotion is not performed. What do you think?Note that durable_rename() also flushes the origin file to make sure that
it does not show up again after a crash.It seems not good idea to just replace durable_rename() with rename()
in RemoveOldXlogFiles()->RemoveXlogFiles()->InstallXLogFileSegment().
Because that change seems to be able to cause the following problem.If archiving is enabled, RemoveOldXlogFiles() would create a .ready flag
for all segments that have reappeared. Oops, it archived again.
From: Fujii Masao [mailto:masao.fujii@gmail.com]
On Fri, Nov 17, 2017 at 5:20 PM, Tsunakawa, Takayuki
<tsunakawa.takay@jp.fujitsu.com> wrote:Yes, I noticed it after submitting the patch and was wondering what to
do. Thinking simply, I think it would be just enough to replace
durable_unlink/durable_rename in RemoveXLogFile() with unlink/rename, and
sync the pg_wal directory once in RemoveNonParentXlogFiles() and
RemoveOldXlogFiles(). This will benefit the failover time when fast
promotion is not performed. What do you think?It seems not good idea to just replace durable_rename() with rename() in
RemoveOldXlogFiles()->RemoveXlogFiles()->InstallXLogFileSegment().
Because that change seems to be able to cause the following problem.1. Checkpoint calls RemoveOldXlogFiles().
2. It recycles the WAL file AAA to BBB. pg_wal directory has not fsync'd
yet.
3. Another transaction (TX1) writes its WAL data into the (recycled) file
BBB.
4. CRASH and RESTART
5. The WAL file BBB disappears and you can see AAA,
but AAA is not used in recovery. This causes data loss of transaction
by Tx1.
Right. Then I thought of doing the following to avoid making a new function only for RemoveNonParentXlogFiles() which is similar to RemoveXlogFile().
* Add an argument "bool durable" to RemoveXlogFile(). Based on its value, RemoveXlogFile() calls either durable_xx() or xx().
* Likewise, add an argument "bool durable" to InstallXLogFileSegment() and do the same.
* RemoveNonParentXlogFiles() calls RemoveXlogFile() with durable=false. At the end of the function, sync the pg_wal directory with fsync_fname().
* RemoveOldXlogFiles() does the same thing. One difference is that it passes false to RemoveXlogFile() during recovery (InRecovery == true) and true otherwise.
But this requires InstallXLogFileSegment() to also have an argument "bool durable." That means InstallXLogFileSegment() and RemoveXlogFile() have to include something like below in several places, which is dirty:
if (durable)
rc = durable_xx(path, elevel);
else
rc = xx(path);
if (rc != 0)
{
/* durable_xx() output the message */
if (!durable)
ereport(elevel, ...);
}
Do you think of a cleaner way?
From: Kyotaro HORIGUCHI [mailto:horiguchi.kyotaro@lab.ntt.co.jp]
The orinal code recycles some of the to-be-removed files, but the
patch removes all the victims. This may impact on performance.
At most only 10 WAL files are recycled. If there's no good solution to the above matter, can we compromise with the current patch?
if (PriorRedoPtr == InvalidXLogRecPtr)
recycleSegNo = endlogSegNo + 10;
Likewise the original code is using durable_unlink to actually
remove a file so separating unlink and fsync might resurrect the
problem that should have been fixed by
1b02be21f271db6bd3cd43abb23fa596fcb6bac3 (I'm not sure what it
was but you are one of the reviwers of it). I suppose that you
need to explain the reason why this change doesn't risk anything.
If the host crashes while running RemoveNonParentXlogFiles() and some effects of the function are lost, the next recovery will do them again.
Regards
Takayuki Tsunakawa
On Wed, Feb 21, 2018 at 07:20:00AM +0000, Tsunakawa, Takayuki wrote:
Right. Then I thought of doing the following to avoid making a new
function only for RemoveNonParentXlogFiles() which is similar to
RemoveXlogFile().* Add an argument "bool durable" to RemoveXlogFile(). Based on its
value, RemoveXlogFile() calls either durable_xx() or xx().* Likewise, add an argument "bool durable" to InstallXLogFileSegment()
and do the same.* RemoveNonParentXlogFiles() calls RemoveXlogFile() with
durable=false. At the end of the function, sync the pg_wal directory
with fsync_fname().* RemoveOldXlogFiles() does the same thing. One difference is that it
passes false to RemoveXlogFile() during recovery (InRecovery == true)
and true otherwise.
It seems to me that you would reintroduce partially the problems that
1d4a0ab1 has fixed. In short, if a crash happens in the code paths
calling RemoveXlogFile with durable = false before fsync'ing pg_wal,
then a rename has no guarantee to be durable, so you could finish again
with a file that as an old name, but new contents. A crucial thing
which matters for a rename to be durable is that the old file is sync'ed
as well.
--
Michael
From: Michael Paquier [mailto:michael@paquier.xyz]
It seems to me that you would reintroduce partially the problems that
1d4a0ab1 has fixed. In short, if a crash happens in the code paths calling
RemoveXlogFile with durable = false before fsync'ing pg_wal, then a rename
has no guarantee to be durable, so you could finish again with a file that
as an old name, but new contents. A crucial thing which matters for a rename
Hmm, you're right. Even during recovery, RemoveOldXlogFiles() can't skip fsyncing pg_wal/ because new WAL records streamed from the master are written to recycled WAL files.
After all, it seems to me that we have to stand with the current patch which only handles RemoveNonParentXlogFiles().
Regards
Takayuki Tsunakawa
On Wed, Feb 21, 2018 at 5:27 PM, Tsunakawa, Takayuki
<tsunakawa.takay@jp.fujitsu.com> wrote:
From: Michael Paquier [mailto:michael@paquier.xyz]
It seems to me that you would reintroduce partially the problems that1d4a0ab1 has fixed. In short, if a crash happens in the code paths calling
RemoveXlogFile with durable = false before fsync'ing pg_wal, then a rename
has no guarantee to be durable, so you could finish again with a file that
as an old name, but new contents. A crucial thing which matters for a renameHmm, you're right. Even during recovery, RemoveOldXlogFiles() can't skip fsyncing pg_wal/ because new WAL records streamed from the master are written to recycled WAL files.
After all, it seems to me that we have to stand with the current patch which only handles RemoveNonParentXlogFiles().
But the approach that the patch uses would cause the performance problem
as Horiguchi-san pointed out upthread.
So, what about, as another approach, making the checkpointer instead of
the startup process call RemoveNonParentXlogFiles() when end-of-recovery
checkpoint is executed? ISTM that a recovery doesn't need to wait for
RemoveNonParentXlogFiles() to end. Instead, RemoveNonParentXlogFiles()
seems to have to complete before the checkpointer calls RemoveOldXlogFiles()
and creates .ready files for the "garbage" WAL files on the old timeline.
So it seems natual to leave that WAL recycle task to the checkpointer.
Regards,
--
Fujii Masao
On Wed, Mar 07, 2018 at 12:55:43AM +0900, Fujii Masao wrote:
So, what about, as another approach, making the checkpointer instead of
the startup process call RemoveNonParentXlogFiles() when end-of-recovery
checkpoint is executed? ISTM that a recovery doesn't need to wait for
RemoveNonParentXlogFiles() to end. Instead, RemoveNonParentXlogFiles()
seems to have to complete before the checkpointer calls RemoveOldXlogFiles()
and creates .ready files for the "garbage" WAL files on the old timeline.
So it seems natual to leave that WAL recycle task to the checkpointer.
Couldn't that impact the I/O performance at the end of recovery until
the first post-recovery checkpoint is completed? Let's not forget that
since 9.3 the end-of-recovery checkpoint is not triggered immediately,
so there could be a delay. If WAL segments of the past timeline are
recycled without waiting for this first checkpoint to happen then there
is no need to create new, zero-emptied, segments post-recovery, which
can count as well.
--
Michael
From: Michael Paquier [mailto:michael@paquier.xyz]
On Wed, Mar 07, 2018 at 12:55:43AM +0900, Fujii Masao wrote:
So, what about, as another approach, making the checkpointer instead
of the startup process call RemoveNonParentXlogFiles() when
end-of-recovery checkpoint is executed? ISTM that a recovery doesn't
need to wait for
RemoveNonParentXlogFiles() to end. Instead, RemoveNonParentXlogFiles()
seems to have to complete before the checkpointer calls
RemoveOldXlogFiles() and creates .ready files for the "garbage" WAL fileson the old timeline.
So it seems natual to leave that WAL recycle task to the checkpointer.
Couldn't that impact the I/O performance at the end of recovery until the
first post-recovery checkpoint is completed? Let's not forget that since
9.3 the end-of-recovery checkpoint is not triggered immediately, so there
could be a delay. If WAL segments of the past timeline are recycled without
waiting for this first checkpoint to happen then there is no need to create
new, zero-emptied, segments post-recovery, which can count as well.
Good point. I understood you referred to PreallocXlogFiles(), which may create one new WAL file if RemoveNonParentXlogFiles() is not called or does not recycle WAL files in the old timeline.
The best hack (or a compromise/kludge?) seems to be:
1. Modify durable_xx() functions so that they don't fsync directory hanges when enableFsync is false.
2. RemoveNonParentXlogFiles() sets enableFsync to false before the while loop, restores the original value of it after the while loop, and fsync pg_wal/ just once.
What do you think?
Regards
Takayuki Tsunakawa
On Wed, Mar 07, 2018 at 06:15:24AM +0000, Tsunakawa, Takayuki wrote:
Good point. I understood you referred to PreallocXlogFiles(), which
may create one new WAL file if RemoveNonParentXlogFiles() is not
called or does not recycle WAL files in the old timeline.The best hack (or a compromise/kludge?) seems to be:
1. Modify durable_xx() functions so that they don't fsync directory
hanges when enableFsync is false.2. RemoveNonParentXlogFiles() sets enableFsync to false before the
while loop, restores the original value of it after the while loop,
and fsync pg_wal/ just once. What do you think?
Hm. durable_xx should remain a sane operation as an isolated call as
you still get the same problem if a crash happens before flushing the
parent... Fujii-san idea also has value to speed up the end of recovery
but this costs as well in extra recycling operations post promotion. If
the checkpoint was to happen a the end of recovery then that would be
more logic, but we don't for performance reasons. Let's continue to
discuss on this thread. If you have any patch to offer, let's also look
at them.
Anyway, as things are pretty much idle on this thread for a couple of
days and that we are still discussing potential ideas, I think that this
entry should be marked as returned with feedback. Thoughts?
--
Michael
From: Michael Paquier [mailto:michael@paquier.xyz]
Hm. durable_xx should remain a sane operation as an isolated call as you
still get the same problem if a crash happens before flushing the parent...
Fujii-san idea also has value to speed up the end of recovery but this costs
as well in extra recycling operations post promotion. If the checkpoint
was to happen a the end of recovery then that would be more logic, but we
don't for performance reasons. Let's continue to discuss on this thread.
If you have any patch to offer, let's also look at them.Anyway, as things are pretty much idle on this thread for a couple of days
and that we are still discussing potential ideas, I think that this entry
should be marked as returned with feedback. Thoughts?
OK, I moved this to the next CF. Thank you for your cooperation.
Regards
Takayuki Tsunakawa
On Wed, Feb 21, 2018 at 4:52 PM, Michael Paquier <michael@paquier.xyz> wrote:
On Wed, Feb 21, 2018 at 07:20:00AM +0000, Tsunakawa, Takayuki wrote:
Right. Then I thought of doing the following to avoid making a new
function only for RemoveNonParentXlogFiles() which is similar to
RemoveXlogFile().* Add an argument "bool durable" to RemoveXlogFile(). Based on its
value, RemoveXlogFile() calls either durable_xx() or xx().* Likewise, add an argument "bool durable" to InstallXLogFileSegment()
and do the same.* RemoveNonParentXlogFiles() calls RemoveXlogFile() with
durable=false. At the end of the function, sync the pg_wal directory
with fsync_fname().* RemoveOldXlogFiles() does the same thing. One difference is that it
passes false to RemoveXlogFile() during recovery (InRecovery == true)
and true otherwise.It seems to me that you would reintroduce partially the problems that
1d4a0ab1 has fixed. In short, if a crash happens in the code paths
calling RemoveXlogFile with durable = false before fsync'ing pg_wal,
then a rename has no guarantee to be durable, so you could finish again
with a file that as an old name, but new contents.
ISTM that this trouble will never happen because basically no one writes
new contents in the renamed WAL files while RemoveNonParentXlogFiles()
is renaming the WAL files at the end of recovery. No?
Regards,
--
Fujii Masao
On Fri, Mar 09, 2018 at 12:50:03AM +0000, Tsunakawa, Takayuki wrote:
From: Michael Paquier [mailto:michael@paquier.xyz]
Hm. durable_xx should remain a sane operation as an isolated call as you
still get the same problem if a crash happens before flushing the parent...
Fujii-san idea also has value to speed up the end of recovery but this costs
as well in extra recycling operations post promotion. If the checkpoint
was to happen a the end of recovery then that would be more logic, but we
don't for performance reasons. Let's continue to discuss on this thread.
If you have any patch to offer, let's also look at them.Anyway, as things are pretty much idle on this thread for a couple of days
and that we are still discussing potential ideas, I think that this entry
should be marked as returned with feedback. Thoughts?OK, I moved this to the next CF. Thank you for your cooperation.
The patch is still roughly with this status, so I am marking it as
returned with feedback.
--
Michael