Use durable_unlink for .ready and .done files for WAL segment removal
Hi all,
While reviewing the archiving code, I have bumped into the fact that
XLogArchiveCleanup() thinks that it is safe to do only a plain unlink()
for .ready and .done files when removing a past segment. I don't think
that it is a smart move, as on a subsequent crash we may still see
those, but the related segment would have gone away. This is not really
a problem for .done files, but it could confuse the archiver to see some
.ready files about things that have already gone away.
Attached is a patch. Thoughts?
--
Michael
Attachments:
archive-clean-durable.patchtext/x-diff; charset=us-asciiDownload+2-2
Hi,
On 2018-09-28 12:28:27 +0900, Michael Paquier wrote:
While reviewing the archiving code, I have bumped into the fact that
XLogArchiveCleanup() thinks that it is safe to do only a plain unlink()
for .ready and .done files when removing a past segment. I don't think
that it is a smart move, as on a subsequent crash we may still see
those, but the related segment would have gone away. This is not really
a problem for .done files, but it could confuse the archiver to see some
.ready files about things that have already gone away.
Isn't that window fundamentally there anyway?
- Andres
On Thu, Sep 27, 2018 at 08:40:26PM -0700, Andres Freund wrote:
On 2018-09-28 12:28:27 +0900, Michael Paquier wrote:
While reviewing the archiving code, I have bumped into the fact that
XLogArchiveCleanup() thinks that it is safe to do only a plain unlink()
for .ready and .done files when removing a past segment. I don't think
that it is a smart move, as on a subsequent crash we may still see
those, but the related segment would have gone away. This is not really
a problem for .done files, but it could confuse the archiver to see some
.ready files about things that have already gone away.Isn't that window fundamentally there anyway?
Sure. However the point I would like to make is that if we have the
possibility to reduce this window, then we should.
--
Michael
On September 27, 2018 10:23:31 PM PDT, Michael Paquier <michael@paquier.xyz> wrote:
On Thu, Sep 27, 2018 at 08:40:26PM -0700, Andres Freund wrote:
On 2018-09-28 12:28:27 +0900, Michael Paquier wrote:
While reviewing the archiving code, I have bumped into the fact that
XLogArchiveCleanup() thinks that it is safe to do only a plainunlink()
for .ready and .done files when removing a past segment. I don't
think
that it is a smart move, as on a subsequent crash we may still see
those, but the related segment would have gone away. This is notreally
a problem for .done files, but it could confuse the archiver to see
some
.ready files about things that have already gone away.
Isn't that window fundamentally there anyway?
Sure. However the point I would like to make is that if we have the
possibility to reduce this window, then we should.
It's not free though. I don't think this is as clear cut as you make it sound.
Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.
Greetings,
* Michael Paquier (michael@paquier.xyz) wrote:
While reviewing the archiving code, I have bumped into the fact that
XLogArchiveCleanup() thinks that it is safe to do only a plain unlink()
for .ready and .done files when removing a past segment. I don't think
that it is a smart move, as on a subsequent crash we may still see
those, but the related segment would have gone away. This is not really
a problem for .done files, but it could confuse the archiver to see some
.ready files about things that have already gone away.
Is there an issue with making the archiver able to understand that
situation instead of being confused by it..? Seems like that'd probably
be a good thing to do regardless of this, but that would then remove the
need for this kind of change..
Thanks!
Stephen
On Fri, Sep 28, 2018 at 02:36:19PM -0400, Stephen Frost wrote:
Is there an issue with making the archiver able to understand that
situation instead of being confused by it..? Seems like that'd probably
be a good thing to do regardless of this, but that would then remove the
need for this kind of change..
I thought about that a bit, and there is as well a lot which can be done
within the archive_command itself regarding that, so I am not sure that
there is the argument to make pgarch.c more complicated than it should.
Now it is true that for most users having a .ready file but no segment
would most likely lead in a failure. I suspect that a large user base
is still just using plain cp in archive_command, which would cause the
archiver to be stuck. So we could actually just tweak pgarch_readyXlog
to check if the segment fetched actually exists (see bottom of the
so-said function). If it doesn't, then the archiver removes the .ready
file and retries fetching a new segment.
--
Michael
Greetings,
* Michael Paquier (michael@paquier.xyz) wrote:
On Fri, Sep 28, 2018 at 02:36:19PM -0400, Stephen Frost wrote:
Is there an issue with making the archiver able to understand that
situation instead of being confused by it..? Seems like that'd probably
be a good thing to do regardless of this, but that would then remove the
need for this kind of change..I thought about that a bit, and there is as well a lot which can be done
within the archive_command itself regarding that, so I am not sure that
there is the argument to make pgarch.c more complicated than it should.
Now it is true that for most users having a .ready file but no segment
would most likely lead in a failure. I suspect that a large user base
is still just using plain cp in archive_command, which would cause the
archiver to be stuck. So we could actually just tweak pgarch_readyXlog
to check if the segment fetched actually exists (see bottom of the
so-said function). If it doesn't, then the archiver removes the .ready
file and retries fetching a new segment.
Yes, checking if the WAL file exists before calling archive_command on
it is what I was thinking we'd do here, and if it doesn't, then just
remove the .ready file.
An alternative would be to go through the .ready files on crash-recovery
and remove any .ready files that don't have corresponding WAL files, or
if we felt that it was necessary, we could do that on every restart but
do we really think we'd need to do that..?
Thanks!
Stephen
On Fri, Sep 28, 2018 at 07:16:25PM -0400, Stephen Frost wrote:
An alternative would be to go through the .ready files on crash-recovery
and remove any .ready files that don't have corresponding WAL files, or
if we felt that it was necessary, we could do that on every restart but
do we really think we'd need to do that..?
Actually, what you are proposing here sounds much better to me. That's
in the area of what has been done recently with RemoveTempXlogFiles() in
5fc1008e. Any objections to doing something like that?
--
Michael
On Sat, Sep 29, 2018 at 04:58:57PM +0900, Michael Paquier wrote:
Actually, what you are proposing here sounds much better to me. That's
in the area of what has been done recently with RemoveTempXlogFiles() in
5fc1008e. Any objections to doing something like that?
Okay. I have hacked a patch based on Stephen's idea as attached. Any
opinions?
--
Michael
Attachments:
archive-missing-v1.patchtext/x-diff; charset=us-asciiDownload+74-0
One argument for instead checking WAL file existence before calling
archive_command might be to avoid the increased startup time.
Granted, any added delay from this patch is unlikely to be noticeable
unless your archiver is way behind and archive_status has a huge
number of files. However, I have seen cases where startup is stuck on
other tasks like SyncDataDirectory() and RemovePgTempFiles() for a
very long time, so perhaps it is worth considering.
Nathan
At Fri, 02 Nov 2018 14:47:08 +0000, Nathan Bossart <bossartn@amazon.com> wrote in <154117002849.5569.14588306221618961668.pgcf@coridan.postgresql.org>
One argument for instead checking WAL file existence before calling
archive_command might be to avoid the increased startup time.
Granted, any added delay from this patch is unlikely to be noticeable
unless your archiver is way behind and archive_status has a huge
number of files. However, I have seen cases where startup is stuck on
other tasks like SyncDataDirectory() and RemovePgTempFiles() for a
very long time, so perhaps it is worth considering.
While archive_mode is tuned on, .ready files are created for all
exising wal files if not exists. Thus archiver may wait for the
ealiest segment to have .ready file. As the result
pgarch_readyXLog can be modified to loops over wal files, not
status files. This prevents the confusion comes from .ready
files for non-existent segment files.
RemoveXlogFile as is doesn't get confused by .done files for
nonexistent segments.
We may leave useless .done/.ready files. We no longer scan over
the files so no matter how many files are there in the directory.
The remaining issue is removal of the files. Even if we blew away
the directory altogether, status files would be cleanly recreated
having already-archived wal segments are archived again. However,
redundant copy won't happen with our recommending configuration:p
# Yeah, I see almost all sites uses simple 'cp' or 'scp' for that..
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
On Thu, Nov 15, 2018 at 07:39:27PM +0900, Kyotaro HORIGUCHI wrote:
At Fri, 02 Nov 2018 14:47:08 +0000, Nathan Bossart
<bossartn@amazon.com> wrote in
<154117002849.5569.14588306221618961668.pgcf@coridan.postgresql.org>:One argument for instead checking WAL file existence before calling
archive_command might be to avoid the increased startup time.
I guess that you mean the startup of the archive command itself here.
Yes that can be an issue with a high WAL output depending on the
interpreter of the archive command :(
Granted, any added delay from this patch is unlikely to be noticeable
unless your archiver is way behind and archive_status has a huge
number of files. However, I have seen cases where startup is stuck on
other tasks like SyncDataDirectory() and RemovePgTempFiles() for a
very long time, so perhaps it is worth considering.
What's the scale of the pg_wal partition and the amount of time things
were stuck? I would imagine that the sync phase hurts the most, and a
fast startup time for crash recovery is always important.
While archive_mode is tuned on, .ready files are created for all
existing wal files if not exists. Thus archiver may wait for the
earliest segment to have .ready file.
Yes, RemoveOldXlogFiles() does that via XLogArchiveCheckDone().
As the result
pgarch_readyXLog can be modified to loops over WAL files, not
status files. This prevents the confusion comes from .ready
files for non-existent segment files.
No, pgarch_readyXLog() should still look after .ready files as those are
here for this purpose, but we could have an additional check to see if
the segment linked with it actually exists and can be archived. This
check could happen in pgarch.c code before calling the archive command
gets called (just before pgarch_ArchiverCopyLoop and after
XLogArchiveCommandSet feels rather right, and that it should be cheap
enough to call stat()).
--
Michael
On Thu, Nov 22, 2018 at 01:16:09PM +0900, Michael Paquier wrote:
No, pgarch_readyXLog() should still look after .ready files as those are
here for this purpose, but we could have an additional check to see if
the segment linked with it actually exists and can be archived. This
check could happen in pgarch.c code before calling the archive command
gets called (just before pgarch_ArchiverCopyLoop and after
XLogArchiveCommandSet feels rather right, and that it should be cheap
enough to call stat()).
s/pgarch_ArchiverCopyLoop/pgarch_archiveXlog/.
Attached is a patch showing shaped based on the idea of upthread.
Thoughts?
--
Michael
Attachments:
archive-missing-v2.patchtext/x-diff; charset=us-asciiDownload+23-0
On 11/21/18, 10:16 PM, "Michael Paquier" <michael@paquier.xyz> wrote:
At Fri, 02 Nov 2018 14:47:08 +0000, Nathan Bossart
<bossartn@amazon.com> wrote in
<154117002849.5569.14588306221618961668.pgcf@coridan.postgresql.org>:Granted, any added delay from this patch is unlikely to be noticeable
unless your archiver is way behind and archive_status has a huge
number of files. However, I have seen cases where startup is stuck on
other tasks like SyncDataDirectory() and RemovePgTempFiles() for a
very long time, so perhaps it is worth considering.What's the scale of the pg_wal partition and the amount of time things
were stuck? I would imagine that the sync phase hurts the most, and a
fast startup time for crash recovery is always important.
I don't have exact figures to share, but yes, a huge number of calls
to sync_file_range() and fsync() can use up a lot of time. Presumably
Postgres processes files individually instead of using sync() because
sync() may return before writing is done. Also, sync() would affect
non-Postgres files. However, it looks like Linux actually does wait
for writing to complete before returning from sync() [0]http://man7.org/linux/man-pages/man2/sync.2.html.
For RemovePgTempFiles(), the documentation above the function
indicates that skipping temp file cleanup during startup would
actually be okay because collisions with existing temp file names
should be handled by OpenTemporaryFile(). I assume this cleanup is
done during startup because there isn't a great alternative besides
offloading the work to a new background worker or something.
On 11/27/18, 6:35 AM, "Michael Paquier" <michael@paquier.xyz> wrote:
Attached is a patch showing shaped based on the idea of upthread.
Thoughts?
I took a look at this patch.
+ /*
+ * In the event of a system crash, archive status files may be
+ * left behind as their removals are not durable, cleaning up
+ * orphan entries here is the cheapest method. So check that
+ * the segment trying to be archived still exists.
+ */
+ snprintf(pathname, MAXPGPATH, XLOGDIR "/%s", xlog);
+ if (stat(pathname, &stat_buf) != 0)
+ {
Don't we also need to check that errno is ENOENT here?
+ StatusFilePath(xlogready, xlog, ".ready");
+ if (durable_unlink(xlogready, WARNING) == 0)
+ ereport(WARNING,
+ (errmsg("removed orphan archive status file %s",
+ xlogready)));
+ return;
IIUC any time that the file does not exist, we will attempt to unlink
it. Regardless of whether unlinking fails or succeeds, we then
proceed to give up archiving for now, but it's not clear why. Perhaps
we should retry unlinking a number of times (like we do for
pgarch_archiveXlog()) when durable_unlink() fails and simply "break"
to move on to the next .ready file if durable_unlink() succeeds.
Nathan
Hi,
On 2018-11-27 20:43:06 +0000, Bossart, Nathan wrote:
I don't have exact figures to share, but yes, a huge number of calls
to sync_file_range() and fsync() can use up a lot of time. Presumably
Postgres processes files individually instead of using sync() because
sync() may return before writing is done. Also, sync() would affect
non-Postgres files. However, it looks like Linux actually does wait
for writing to complete before returning from sync() [0].
sync() has absolutely no way to report errors. So, we're never going to
be able to use it. Besides, even postgres' temp files would be a good
reason to not use it.
Greetings,
Andres Freund
On 11/27/18, 2:46 PM, "Andres Freund" <andres@anarazel.de> wrote:
On 2018-11-27 20:43:06 +0000, Bossart, Nathan wrote:
I don't have exact figures to share, but yes, a huge number of calls
to sync_file_range() and fsync() can use up a lot of time. Presumably
Postgres processes files individually instead of using sync() because
sync() may return before writing is done. Also, sync() would affect
non-Postgres files. However, it looks like Linux actually does wait
for writing to complete before returning from sync() [0].sync() has absolutely no way to report errors. So, we're never going to
be able to use it. Besides, even postgres' temp files would be a good
reason to not use it.
Ah, I see. Thanks for clarifying.
Nathan
On Tue, Nov 27, 2018 at 08:43:06PM +0000, Bossart, Nathan wrote:
Don't we also need to check that errno is ENOENT here?
Yep.
IIUC any time that the file does not exist, we will attempt to unlink
it. Regardless of whether unlinking fails or succeeds, we then
proceed to give up archiving for now, but it's not clear why. Perhaps
we should retry unlinking a number of times (like we do for
pgarch_archiveXlog()) when durable_unlink() fails and simply "break"
to move on to the next .ready file if durable_unlink() succeeds.
Both suggestions sound reasonable to me. (durable_unlink is not called
on HEAD in pgarch_archiveXlog). How about 3 retries with a in-between
wait of 1s? That's consistent with what pgarch_ArchiverCopyLoop does,
still I am not completely sure if we actually want to be consistent for
the purpose of removing orphaned ready files.
--
Michael
On 11/27/18, 3:20 PM, "Michael Paquier" <michael@paquier.xyz> wrote:
On Tue, Nov 27, 2018 at 08:43:06PM +0000, Bossart, Nathan wrote:
IIUC any time that the file does not exist, we will attempt to unlink
it. Regardless of whether unlinking fails or succeeds, we then
proceed to give up archiving for now, but it's not clear why. Perhaps
we should retry unlinking a number of times (like we do for
pgarch_archiveXlog()) when durable_unlink() fails and simply "break"
to move on to the next .ready file if durable_unlink() succeeds.Both suggestions sound reasonable to me. (durable_unlink is not called
on HEAD in pgarch_archiveXlog). How about 3 retries with a in-between
wait of 1s? That's consistent with what pgarch_ArchiverCopyLoop does,
still I am not completely sure if we actually want to be consistent for
the purpose of removing orphaned ready files.
That sounds good to me. I was actually thinking of using the same
retry counter that we use for pgarch_archiveXlog(), but on second
thought, it is probably better to have two independent retry counters
for these two unrelated operations.
Nathan
On Tue, Nov 27, 2018 at 09:49:29PM +0000, Bossart, Nathan wrote:
That sounds good to me. I was actually thinking of using the same
retry counter that we use for pgarch_archiveXlog(), but on second
thought, it is probably better to have two independent retry counters
for these two unrelated operations.
What I had in mind was two different variables if what I wrote was
unclear, possibly with the same value, as archiving failure and failure
with orphan file removals are two different concepts.
--
Michael
On 11/27/18, 3:53 PM, "Michael Paquier" <michael@paquier.xyz> wrote:
On Tue, Nov 27, 2018 at 09:49:29PM +0000, Bossart, Nathan wrote:
That sounds good to me. I was actually thinking of using the same
retry counter that we use for pgarch_archiveXlog(), but on second
thought, it is probably better to have two independent retry counters
for these two unrelated operations.What I had in mind was two different variables if what I wrote was
unclear, possibly with the same value, as archiving failure and failure
with orphan file removals are two different concepts.
+1
Nathan