remove more archiving overhead
Hi hackers,
With the addition of archive modules (5ef1eef) and other archiving
improvements (e.g., beb4e9b), the amount of archiving overhead has been
reduced. I spent some time measuring the remaining overhead, and the
following function stood out:
/*
* pgarch_archiveDone
*
* Emit notification that an xlog file has been successfully archived.
* We do this by renaming the status file from NNN.ready to NNN.done.
* Eventually, a checkpoint process will notice this and delete both the
* NNN.done file and the xlog file itself.
*/
static void
pgarch_archiveDone(char *xlog)
{
char rlogready[MAXPGPATH];
char rlogdone[MAXPGPATH];
StatusFilePath(rlogready, xlog, ".ready");
StatusFilePath(rlogdone, xlog, ".done");
(void) durable_rename(rlogready, rlogdone, WARNING);
}
Specifically, the call to durable_rename(), which calls fsync() a few
times, can cause a decent delay. On my machine, archiving ~12k WAL files
using a bare-bones archive module that did nothing but return true took
over a minute. When I changed this to rename() (i.e., reverting the change
to pgarch.c in 1d4a0ab), the same test took a second or less.
I also spent some time investigating whether durably renaming the archive
status files was even necessary. In theory, it shouldn't be. If a crash
happens before the rename is persisted, we might try to re-archive files,
but your archive_command/archive_library should handle that. If the file
was already recycled or removed, the archiver will skip it (thanks to
6d8727f). But when digging further, I found that WAL file recyling uses
durable_rename_excl(), which has the following note:
* Note that a crash in an unfortunate moment can leave you with two links to
* the target file.
IIUC this means that in theory, a crash at an unfortunate moment could
leave you with a .ready file, the file to archive, and another link to that
file with a "future" WAL filename. If you re-archive the file after it has
been reused, you could end up with corrupt WAL archives. I think this
might already be possible today. Syncing the directory after every rename
probably reduces the likelihood quite substantially, but if
RemoveOldXlogFiles() quickly picks up the .done file and attempts
recycling before durable_rename() calls fsync() on the directory,
presumably the same problem could occur.
I believe the current recommendation is to fail if the archive file already
exists. The basic_archive contrib module fails if the archive already
exists and has different contents, and it succeeds without re-archiving if
it already exists with identical contents. Since something along these
lines is recommended as a best practice, perhaps this is okay. But I think
we might be able to do better. If we ensure that the archive_status
directory is sync'd prior to recycling/removing a WAL file, I believe we
are safe from the aforementioned problem.
The drawback of this is that we are essentially offloading more work to the
checkpointer process. In addition to everything else it must do, it will
also need to fsync() the archive_status directory many times. I'm not sure
how big of a problem this is. It should be good to ensure that archiving
keeps up, and there are far more expensive tasks that the checkpointer must
perform that could make the added fsync() calls relatively insignificant.
I've attached a patch that makes the changes discussed above. Thoughts?
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
Attachments:
v1-0001-remove-more-archiving-overhead.patchtext/x-diff; charset=us-asciiDownload+25-4
On Mon, Feb 21, 2022 at 05:19:48PM -0800, Nathan Bossart wrote:
I also spent some time investigating whether durably renaming the archive
status files was even necessary. In theory, it shouldn't be. If a crash
happens before the rename is persisted, we might try to re-archive files,
but your archive_command/archive_library should handle that. If the file
was already recycled or removed, the archiver will skip it (thanks to
6d8727f). But when digging further, I found that WAL file recyling uses
durable_rename_excl(), which has the following note:* Note that a crash in an unfortunate moment can leave you with two links to
* the target file.IIUC this means that in theory, a crash at an unfortunate moment could
leave you with a .ready file, the file to archive, and another link to that
file with a "future" WAL filename. If you re-archive the file after it has
been reused, you could end up with corrupt WAL archives. I think this
might already be possible today. Syncing the directory after every rename
probably reduces the likelihood quite substantially, but if
RemoveOldXlogFiles() quickly picks up the .done file and attempts
recycling before durable_rename() calls fsync() on the directory,
presumably the same problem could occur.
In my testing, I found that when I killed the server just before unlink()
during WAL recyling, I ended up with links to the same file in pg_wal after
restarting. My latest test produced links to the same file for the current
WAL file and the next one. Maybe WAL recyling should use durable_rename()
instead of durable_rename_excl().
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
On Tue, Feb 22, 2022 at 09:37:11AM -0800, Nathan Bossart wrote:
In my testing, I found that when I killed the server just before unlink()
during WAL recyling, I ended up with links to the same file in pg_wal after
restarting. My latest test produced links to the same file for the current
WAL file and the next one. Maybe WAL recyling should use durable_rename()
instead of durable_rename_excl().
Here is an updated patch.
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
Attachments:
v2-0001-reduce-archiving-overhead.patchtext/x-diff; charset=us-asciiDownload+19-6
At Mon, 21 Feb 2022 17:19:48 -0800, Nathan Bossart <nathandbossart@gmail.com> wrote in
I also spent some time investigating whether durably renaming the archive
status files was even necessary. In theory, it shouldn't be. If a crash
happens before the rename is persisted, we might try to re-archive files,
but your archive_command/archive_library should handle that. If the file
I believe we're trying to archive a WAL segment once and only once,
even though actually we fail in certain cases.
https://www.postgresql.org/docs/14/continuous-archiving.html
| The archive command should generally be designed to refuse to
| overwrite any pre-existing archive file. This is an important safety
| feature to preserve the integrity of your archive in case of
| administrator error (such as sending the output of two different
| servers to the same archive directory).
The recommended behavior of archive_command above is to avoid this
kind of corruption. If we officially admit that a WAL segment can be
archive twice, we need to revise that part (and the following part) of
the doc.
was already recycled or removed, the archiver will skip it (thanks to
6d8727f). But when digging further, I found that WAL file recyling uses
durable_rename_excl(), which has the following note:* Note that a crash in an unfortunate moment can leave you with two links to
* the target file.IIUC this means that in theory, a crash at an unfortunate moment could
leave you with a .ready file, the file to archive, and another link to that
file with a "future" WAL filename. If you re-archive the file after it has
been reused, you could end up with corrupt WAL archives. I think this
IMHO the difference would be that the single system call (maybe)
cannot be interrupted by sigkill. So I'm not sure that that is worth
considering. The sigkill window can be elimianted if we could use
renameat(RENAME_NOREPLACE). But finally power-loss can cause that.
might already be possible today. Syncing the directory after every rename
probably reduces the likelihood quite substantially, but if
RemoveOldXlogFiles() quickly picks up the .done file and attempts
recycling before durable_rename() calls fsync() on the directory,
presumably the same problem could occur.
If we don't fsync at every rename, we don't even need for
RemoveOldXlogFiles to pickup .done very quickly. Isn't it a big
difference?
I believe the current recommendation is to fail if the archive file already
exists. The basic_archive contrib module fails if the archive already
exists and has different contents, and it succeeds without re-archiving if
it already exists with identical contents. Since something along these
lines is recommended as a best practice, perhaps this is okay. But I think
we might be able to do better. If we ensure that the archive_status
directory is sync'd prior to recycling/removing a WAL file, I believe we
are safe from the aforementioned problem.The drawback of this is that we are essentially offloading more work to the
checkpointer process. In addition to everything else it must do, it will
also need to fsync() the archive_status directory many times. I'm not sure
how big of a problem this is. It should be good to ensure that archiving
keeps up, and there are far more expensive tasks that the checkpointer must
perform that could make the added fsync() calls relatively insignificant.
I think we don't want make checkpointer slower in exchange of making
archiver faster. Perhaps we need to work using a condensed
information rather than repeatedly scanning over the distributed
information like archive_status?
At Tue, 22 Feb 2022 11:52:29 -0800, Nathan Bossart <nathandbossart@gmail.com> wrote in
On Tue, Feb 22, 2022 at 09:37:11AM -0800, Nathan Bossart wrote:
In my testing, I found that when I killed the server just before unlink()
during WAL recyling, I ended up with links to the same file in pg_wal after
restarting. My latest test produced links to the same file for the current
WAL file and the next one. Maybe WAL recyling should use durable_rename()
instead of durable_rename_excl().Here is an updated patch.
Is it intentional that v2 loses the xlogarchive.c part?
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Import Notes
Reply to msg id not found: 20220222195229.GA42988@nathanxps1320220222011948.GA3850532@nathanxps13
Thanks for taking a look!
On Thu, Feb 24, 2022 at 02:13:49PM +0900, Kyotaro Horiguchi wrote:
https://www.postgresql.org/docs/14/continuous-archiving.html
| The archive command should generally be designed to refuse to
| overwrite any pre-existing archive file. This is an important safety
| feature to preserve the integrity of your archive in case of
| administrator error (such as sending the output of two different
| servers to the same archive directory).The recommended behavior of archive_command above is to avoid this
kind of corruption. If we officially admit that a WAL segment can be
archive twice, we need to revise that part (and the following part) of
the doc.
Yeah, we might want to recommend succeeding if the archive already exists
with identical contents (like basic_archive does). IMO this would best be
handled in a separate thread that is solely focused on documentation
updates. AFAICT this is an existing problem.
IIUC this means that in theory, a crash at an unfortunate moment could
leave you with a .ready file, the file to archive, and another link to that
file with a "future" WAL filename. If you re-archive the file after it has
been reused, you could end up with corrupt WAL archives. I think thisIMHO the difference would be that the single system call (maybe)
cannot be interrupted by sigkill. So I'm not sure that that is worth
considering. The sigkill window can be elimianted if we could use
renameat(RENAME_NOREPLACE). But finally power-loss can cause that.
Perhaps durable_rename_excl() could use renameat2() for systems that
support it. However, the problem would still exist for systems that have
link() but not renameat2(). In this particular case, there really
shouldn't be an existing file in pg_wal.
might already be possible today. Syncing the directory after every rename
probably reduces the likelihood quite substantially, but if
RemoveOldXlogFiles() quickly picks up the .done file and attempts
recycling before durable_rename() calls fsync() on the directory,
presumably the same problem could occur.If we don't fsync at every rename, we don't even need for
RemoveOldXlogFiles to pickup .done very quickly. Isn't it a big
difference?
Yeah, it probably increases the chances quite a bit.
I think we don't want make checkpointer slower in exchange of making
archiver faster. Perhaps we need to work using a condensed
information rather than repeatedly scanning over the distributed
information like archive_status?
v2 avoids giving the checkpointer more work.
At Tue, 22 Feb 2022 11:52:29 -0800, Nathan Bossart <nathandbossart@gmail.com> wrote in
On Tue, Feb 22, 2022 at 09:37:11AM -0800, Nathan Bossart wrote:
In my testing, I found that when I killed the server just before unlink()
during WAL recyling, I ended up with links to the same file in pg_wal after
restarting. My latest test produced links to the same file for the current
WAL file and the next one. Maybe WAL recyling should use durable_rename()
instead of durable_rename_excl().Here is an updated patch.
Is it intentional that v2 loses the xlogarchive.c part?
Yes. I found that a crash at an unfortunate moment can produce multiple
links to the same file in pg_wal, which seemed bad independent of archival.
By fixing that (i.e., switching from durable_rename_excl() to
durable_rename()), we not only avoid this problem, but we also avoid trying
to archive a file the server is concurrently writing. Then, after a crash,
the WAL file to archive should either not exist (which is handled by the
archiver) or contain the same contents as any preexisting archives.
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
I tested this again with many more WAL files and a much larger machine
(r5d.24xlarge with data directory on an NVMe SSD instance store volume).
As before, I am using a bare-bones archive module that does nothing but
return true. Without the patch, archiving ~120k WAL files took about 109
seconds. With the patch, it took around 62 seconds, which amounts to a
~43% reduction in overhead.
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
On Thu, Feb 24, 2022 at 09:55:53AM -0800, Nathan Bossart wrote:
Yes. I found that a crash at an unfortunate moment can produce multiple
links to the same file in pg_wal, which seemed bad independent of archival.
By fixing that (i.e., switching from durable_rename_excl() to
durable_rename()), we not only avoid this problem, but we also avoid trying
to archive a file the server is concurrently writing. Then, after a crash,
the WAL file to archive should either not exist (which is handled by the
archiver) or contain the same contents as any preexisting archives.
I moved the fix for this to a new thread [0]/messages/by-id/20220407182954.GA1231544@nathanxps13 since I think it should be
back-patched. I've attached a new patch that only contains the part
related to reducing archiving overhead.
[0]: /messages/by-id/20220407182954.GA1231544@nathanxps13
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
Attachments:
v3-0001-Reduce-overhead-of-renaming-archive-status-files.patchtext/x-diff; charset=us-asciiDownload+13-2
On Thu, Apr 7, 2022 at 6:23 PM Nathan Bossart <nathandbossart@gmail.com> wrote:
On Thu, Feb 24, 2022 at 09:55:53AM -0800, Nathan Bossart wrote:
Yes. I found that a crash at an unfortunate moment can produce multiple
links to the same file in pg_wal, which seemed bad independent of archival.
By fixing that (i.e., switching from durable_rename_excl() to
durable_rename()), we not only avoid this problem, but we also avoid trying
to archive a file the server is concurrently writing. Then, after a crash,
the WAL file to archive should either not exist (which is handled by the
archiver) or contain the same contents as any preexisting archives.I moved the fix for this to a new thread [0] since I think it should be
back-patched. I've attached a new patch that only contains the part
related to reducing archiving overhead.
While we're now after the feature freeze and thus this will need to
wait for v16, it looks like a reasonable change to me.
--
Robert Haas
EDB: http://www.enterprisedb.com
On Fri, Apr 08, 2022 at 10:20:27AM -0400, Robert Haas wrote:
On Thu, Apr 7, 2022 at 6:23 PM Nathan Bossart <nathandbossart@gmail.com> wrote:
On Thu, Feb 24, 2022 at 09:55:53AM -0800, Nathan Bossart wrote:
Yes. I found that a crash at an unfortunate moment can produce multiple
links to the same file in pg_wal, which seemed bad independent of archival.
By fixing that (i.e., switching from durable_rename_excl() to
durable_rename()), we not only avoid this problem, but we also avoid trying
to archive a file the server is concurrently writing. Then, after a crash,
the WAL file to archive should either not exist (which is handled by the
archiver) or contain the same contents as any preexisting archives.I moved the fix for this to a new thread [0] since I think it should be
back-patched. I've attached a new patch that only contains the part
related to reducing archiving overhead.While we're now after the feature freeze and thus this will need to
wait for v16, it looks like a reasonable change to me.
Dang, just missed it. Thanks for taking a look.
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
On 2022/04/08 7:23, Nathan Bossart wrote:
On Thu, Feb 24, 2022 at 09:55:53AM -0800, Nathan Bossart wrote:
Yes. I found that a crash at an unfortunate moment can produce multiple
links to the same file in pg_wal, which seemed bad independent of archival.
By fixing that (i.e., switching from durable_rename_excl() to
durable_rename()), we not only avoid this problem, but we also avoid trying
to archive a file the server is concurrently writing. Then, after a crash,
the WAL file to archive should either not exist (which is handled by the
archiver) or contain the same contents as any preexisting archives.I moved the fix for this to a new thread [0] since I think it should be
back-patched. I've attached a new patch that only contains the part
related to reducing archiving overhead.
Thanks for updating the patch. It looks good to me.
Barring any objection, I'm thinking to commit it.
Regards,
--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION
On Thu, Jul 7, 2022 at 10:03 AM Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
Thanks for updating the patch. It looks good to me.
Barring any objection, I'm thinking to commit it.
I don't object, but I just started to wonder whether the need to
handle re-archiving of the same file cleanly is as well-documented as
it ought to be.
--
Robert Haas
EDB: http://www.enterprisedb.com
On 7/7/22 10:37, Robert Haas wrote:
On Thu, Jul 7, 2022 at 10:03 AM Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
Thanks for updating the patch. It looks good to me.
Barring any objection, I'm thinking to commit it.I don't object, but I just started to wonder whether the need to
handle re-archiving of the same file cleanly is as well-documented as
it ought to be.
+1, but I don't think that needs to stand in the way of this patch,
which looks sensible to me as-is. I think that's what you meant, but
just wanted to be sure.
There are plenty of ways that already-archived WAL might get archived
again and this is just one of them.
Thoughts, Nathan?
Regards,
-David
On Thu, Jul 07, 2022 at 10:46:23AM -0400, David Steele wrote:
On 7/7/22 10:37, Robert Haas wrote:
On Thu, Jul 7, 2022 at 10:03 AM Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
Thanks for updating the patch. It looks good to me.
Barring any objection, I'm thinking to commit it.I don't object, but I just started to wonder whether the need to
handle re-archiving of the same file cleanly is as well-documented as
it ought to be.+1, but I don't think that needs to stand in the way of this patch, which
looks sensible to me as-is. I think that's what you meant, but just wanted
to be sure.
Yeah, this seems like something that should be documented. I can pick this
up. I believe this is an existing problem, but this patch could make it
more likely.
There are plenty of ways that already-archived WAL might get archived again
and this is just one of them.
What are some of the others? I was aware of the case that was fixed in
ff9f111, where we might try to re-archive a file with different contents,
but I'm curious what other ways you've seen this happen.
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
On Thu, Jul 07, 2022 at 09:18:25AM -0700, Nathan Bossart wrote:
On Thu, Jul 07, 2022 at 10:46:23AM -0400, David Steele wrote:
On 7/7/22 10:37, Robert Haas wrote:
I don't object, but I just started to wonder whether the need to
handle re-archiving of the same file cleanly is as well-documented as
it ought to be.+1, but I don't think that needs to stand in the way of this patch, which
looks sensible to me as-is. I think that's what you meant, but just wanted
to be sure.Yeah, this seems like something that should be documented. I can pick this
up. I believe this is an existing problem, but this patch could make it
more likely.
Here is a first try at documenting this. I'm not thrilled about the
placement, since it feels a bit buried in the backup docs, but this is
where this sort of thing lives today. It also seems odd to stress the
importance of avoiding overwriting pre-existing archives in case multiple
servers are archiving to the same place while only offering solutions with
obvious race conditions. Even basic_archive is subject to this now that
durable_rename_excl() no longer exists. Perhaps we should make a note of
that, too.
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
Attachments:
v1-0001-add-note-about-re-archiving-in-docs.patchtext/x-diff; charset=us-asciiDownload+18-5
On 7/7/22 12:18, Nathan Bossart wrote:
On Thu, Jul 07, 2022 at 10:46:23AM -0400, David Steele wrote:
There are plenty of ways that already-archived WAL might get archived again
and this is just one of them.What are some of the others? I was aware of the case that was fixed in
ff9f111, where we might try to re-archive a file with different contents,
but I'm curious what other ways you've seen this happen.
On the PG side, crashes and (IIRC) immediate shutdown.
In general, any failure in the archiver itself. Storage, memory,
network, etc. There are plenty of ways that the file might make it to
storage but postgres never gets notified, so it will retry.
Any archiver that is not tolerant of this fact is not going to be very
useful and this patch only makes it slightly more true.
Regards,
-David
On Thu, Jul 07, 2022 at 02:07:26PM -0400, David Steele wrote:
On 7/7/22 12:18, Nathan Bossart wrote:
On Thu, Jul 07, 2022 at 10:46:23AM -0400, David Steele wrote:
There are plenty of ways that already-archived WAL might get archived again
and this is just one of them.What are some of the others? I was aware of the case that was fixed in
ff9f111, where we might try to re-archive a file with different contents,
but I'm curious what other ways you've seen this happen.On the PG side, crashes and (IIRC) immediate shutdown.
In general, any failure in the archiver itself. Storage, memory, network,
etc. There are plenty of ways that the file might make it to storage but
postgres never gets notified, so it will retry.Any archiver that is not tolerant of this fact is not going to be very
useful and this patch only makes it slightly more true.
Ah, got it, makes sense.
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
On Thu, Jul 07, 2022 at 10:51:42AM -0700, Nathan Bossart wrote:
+ library to ensure that it indeed does not overwrite an existing file. When + a pre-existing file is encountered, the archive library may return + <literal>true</literal> if the WAL file has identical contents to the + pre-existing archive. Alternatively, the archive library can return
This should likely say something about ensuring the pre-existing file is
persisted to disk before returning true.
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
On 7/7/22 14:22, Nathan Bossart wrote:
On Thu, Jul 07, 2022 at 10:51:42AM -0700, Nathan Bossart wrote:
+ library to ensure that it indeed does not overwrite an existing file. When + a pre-existing file is encountered, the archive library may return + <literal>true</literal> if the WAL file has identical contents to the + pre-existing archive. Alternatively, the archive library can returnThis should likely say something about ensuring the pre-existing file is
persisted to disk before returning true.
Agreed.
Also, I'd prefer to remove "indeed" from this sentence (yes, I see that
was in the original text):
+ library to ensure that it indeed does not overwrite an existing
Regards,
-David
On Thu, Jul 07, 2022 at 05:06:13PM -0400, David Steele wrote:
On 7/7/22 14:22, Nathan Bossart wrote:
On Thu, Jul 07, 2022 at 10:51:42AM -0700, Nathan Bossart wrote:
+ library to ensure that it indeed does not overwrite an existing file. When + a pre-existing file is encountered, the archive library may return + <literal>true</literal> if the WAL file has identical contents to the + pre-existing archive. Alternatively, the archive library can returnThis should likely say something about ensuring the pre-existing file is
persisted to disk before returning true.Agreed.
Also, I'd prefer to remove "indeed" from this sentence (yes, I see that was
in the original text):+ library to ensure that it indeed does not overwrite an existing
Here's an updated patch.
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
Attachments:
v2-0001-add-note-about-re-archiving-in-docs.patchtext/x-diff; charset=us-asciiDownload+19-6
At Thu, 7 Jul 2022 15:07:16 -0700, Nathan Bossart <nathandbossart@gmail.com> wrote in
Here's an updated patch.
Thinking RFC'ish, the meaning of "may" and "must" is significant in
this description. On the other hand it uses both "may" and "can" but
I thinkthat their difference is not significant or "can" there is
somewhat confusing. I think the "can" should be "may" here.
And since "must" is emphasized, doesn't "may" also needs emphasis?
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center