avoid multiple hard links to same WAL file after a crash
Hi hackers,
I am splitting this off of a previous thread aimed at reducing archiving
overhead [0]/messages/by-id/20220222011948.GA3850532@nathanxps13, as I believe this fix might deserve back-patching.
Presently, WAL recycling uses durable_rename_excl(), which notes that a
crash at an unfortunate moment can result in two links to the same file.
My testing [1]/messages/by-id/20220222173711.GA3852671@nathanxps13 demonstrated that it was possible to end up with two links
to the same file in pg_wal after a crash just before unlink() during WAL
recycling. Specifically, the test produced links to the same file for the
current WAL file and the next one because the half-recycled WAL file was
re-recycled upon restarting. This seems likely to lead to WAL corruption.
The attached patch prevents this problem by using durable_rename() instead
of durable_rename_excl() for WAL recycling. This removes the protection
against accidentally overwriting an existing WAL file, but there shouldn't
be one.
This patch also sets the stage for reducing archiving overhead (as
discussed in the other thread [0]/messages/by-id/20220222011948.GA3850532@nathanxps13). The proposed change to reduce
archiving overhead will make it more likely that the server will attempt to
re-archive segments after a crash. This might lead to archive corruption
if the server concurrently writes to the same file via the aforementioned
bug.
[0]: /messages/by-id/20220222011948.GA3850532@nathanxps13
[1]: /messages/by-id/20220222173711.GA3852671@nathanxps13
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
Attachments:
v1-0001-Avoid-multiple-hard-links-to-same-WAL-file-after-.patchtext/x-diff; charset=us-asciiDownload+6-5
On Thu, Apr 7, 2022 at 2:30 PM Nathan Bossart <nathandbossart@gmail.com> wrote:
Presently, WAL recycling uses durable_rename_excl(), which notes that a
crash at an unfortunate moment can result in two links to the same file.
My testing [1] demonstrated that it was possible to end up with two links
to the same file in pg_wal after a crash just before unlink() during WAL
recycling. Specifically, the test produced links to the same file for the
current WAL file and the next one because the half-recycled WAL file was
re-recycled upon restarting. This seems likely to lead to WAL corruption.
Wow, that's bad.
The attached patch prevents this problem by using durable_rename() instead
of durable_rename_excl() for WAL recycling. This removes the protection
against accidentally overwriting an existing WAL file, but there shouldn't
be one.
I see that durable_rename_excl() has the following comment: "Similar
to durable_rename(), except that this routine tries (but does not
guarantee) not to overwrite the target file." If those are the desired
semantics, we could achieve them more simply and more safely by just
trying to stat() the target file and then, if it's not found, call
durable_rename(). I think that would be a heck of a lot safer than
what this function is doing right now.
I'd actually be in favor of nuking durable_rename_excl() from orbit
and putting the file-exists tests in the callers. Otherwise, someone
might assume that it actually has the semantics that its name
suggests, which could be pretty disastrous. If we don't want to do
that, then I'd changing to do the stat-then-durable-rename thing
internally, so we don't leave hard links lying around in *any* code
path. Perhaps that's the right answer for the back-branches in any
case, since there could be third-party code calling this function.
Your proposed fix is OK if we don't want to do any of that stuff, but
personally I'm much more inclined to blame durable_rename_excl() for
being horrible than I am to blame the calling code for using it
improvidently.
--
Robert Haas
EDB: http://www.enterprisedb.com
On Fri, Apr 08, 2022 at 10:38:03AM -0400, Robert Haas wrote:
I see that durable_rename_excl() has the following comment: "Similar
to durable_rename(), except that this routine tries (but does not
guarantee) not to overwrite the target file." If those are the desired
semantics, we could achieve them more simply and more safely by just
trying to stat() the target file and then, if it's not found, call
durable_rename(). I think that would be a heck of a lot safer than
what this function is doing right now.
IIUC it actually does guarantee that you won't overwrite the target file
when HAVE_WORKING_LINK is defined. If not, it provides no guarantees at
all. Using stat() before rename() would therefore weaken this check for
systems with working link(), but it'd probably strengthen it for systems
without a working link().
I'd actually be in favor of nuking durable_rename_excl() from orbit
and putting the file-exists tests in the callers. Otherwise, someone
might assume that it actually has the semantics that its name
suggests, which could be pretty disastrous. If we don't want to do
that, then I'd changing to do the stat-then-durable-rename thing
internally, so we don't leave hard links lying around in *any* code
path. Perhaps that's the right answer for the back-branches in any
case, since there could be third-party code calling this function.
I think there might be another problem. The man page for rename() seems to
indicate that overwriting an existing file also introduces a window where
the old and new path are hard links to the same file. This isn't a problem
for the WAL files because we should never be overwriting an existing one,
but I wonder if it's a problem for other code paths. My guess is that many
code paths that overwrite an existing file are first writing changes to a
temporary file before atomically replacing the original. Those paths are
likely okay, too, as you can usually just discard any existing temporary
files.
Your proposed fix is OK if we don't want to do any of that stuff, but
personally I'm much more inclined to blame durable_rename_excl() for
being horrible than I am to blame the calling code for using it
improvidently.
I do agree that it's worth examining this stuff a bit closer. I've
frequently found myself trying to reason about all the different states
that callers of these functions can produce, so any changes that help
simplify matters are a win in my book.
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
On Fri, Apr 08, 2022 at 09:53:12AM -0700, Nathan Bossart wrote:
On Fri, Apr 08, 2022 at 10:38:03AM -0400, Robert Haas wrote:
I'd actually be in favor of nuking durable_rename_excl() from orbit
and putting the file-exists tests in the callers. Otherwise, someone
might assume that it actually has the semantics that its name
suggests, which could be pretty disastrous. If we don't want to do
that, then I'd changing to do the stat-then-durable-rename thing
internally, so we don't leave hard links lying around in *any* code
path. Perhaps that's the right answer for the back-branches in any
case, since there could be third-party code calling this function.I think there might be another problem. The man page for rename() seems to
indicate that overwriting an existing file also introduces a window where
the old and new path are hard links to the same file. This isn't a problem
for the WAL files because we should never be overwriting an existing one,
but I wonder if it's a problem for other code paths. My guess is that many
code paths that overwrite an existing file are first writing changes to a
temporary file before atomically replacing the original. Those paths are
likely okay, too, as you can usually just discard any existing temporary
files.
Ha, so there are only a few callers of durable_rename_excl() in the
PostgreSQL tree. One is basic_archive.c, which is already doing a stat()
check. IIRC I only used durable_rename_excl() here to handle the case
where multiple servers are writing archives to the same location. If that
happened, the archiver process would begin failing. If a crash left two
hard links to the same file around, we will silently succeed the next time
around thanks to the compare_files() check. Besides the WAL installation
code, the only other callers are in timeline.c, and both note that the use
of durable_rename_excl() is for "paranoidly trying to avoid overwriting an
existing file (there shouldn't be one)."
So AFAICT basic_archive.c is the only caller with a strong reason for using
durable_rename_excl(), and even that might not be worth keeping it around.
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
On Fri, Apr 08, 2022 at 10:38:03AM -0400, Robert Haas wrote:
I'd actually be in favor of nuking durable_rename_excl() from orbit
and putting the file-exists tests in the callers. Otherwise, someone
might assume that it actually has the semantics that its name
suggests, which could be pretty disastrous. If we don't want to do
that, then I'd changing to do the stat-then-durable-rename thing
internally, so we don't leave hard links lying around in *any* code
path. Perhaps that's the right answer for the back-branches in any
case, since there could be third-party code calling this function.
I've attached a patch that simply removes durable_rename_excl() and
replaces existing calls with durable_rename(). I noticed that Andres
expressed similar misgivings about durable_rename_excl() last year [0]https://postgr.es/me/20210318014812.ds2iz4jz5h7la6un%40alap3.anarazel.de [1]/messages/by-id/20210318023004.gz2aejhze2kkkqr2@alap3.anarazel.de.
I can create a stat-then-durable-rename version of this for back-patching
if that is still the route we want to go.
[0]: https://postgr.es/me/20210318014812.ds2iz4jz5h7la6un%40alap3.anarazel.de
[1]: /messages/by-id/20210318023004.gz2aejhze2kkkqr2@alap3.anarazel.de
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
Attachments:
v2-0001-Remove-durable_rename_excl.patchtext/x-diff; charset=us-asciiDownload+7-92
On Fri, Apr 8, 2022 at 12:53 PM Nathan Bossart <nathandbossart@gmail.com> wrote:
On Fri, Apr 08, 2022 at 10:38:03AM -0400, Robert Haas wrote:
I see that durable_rename_excl() has the following comment: "Similar
to durable_rename(), except that this routine tries (but does not
guarantee) not to overwrite the target file." If those are the desired
semantics, we could achieve them more simply and more safely by just
trying to stat() the target file and then, if it's not found, call
durable_rename(). I think that would be a heck of a lot safer than
what this function is doing right now.IIUC it actually does guarantee that you won't overwrite the target file
when HAVE_WORKING_LINK is defined. If not, it provides no guarantees at
all. Using stat() before rename() would therefore weaken this check for
systems with working link(), but it'd probably strengthen it for systems
without a working link().
Sure, but a guarantee that happens on only some systems isn't worth
much. And, if it comes at the price of potentially having multiple
hard links to the same file in obscure situations, that seems like it
could easily cause more problems than this whole scheme can ever hope
to solve.
I think there might be another problem. The man page for rename() seems to
indicate that overwriting an existing file also introduces a window where
the old and new path are hard links to the same file. This isn't a problem
for the WAL files because we should never be overwriting an existing one,
but I wonder if it's a problem for other code paths. My guess is that many
code paths that overwrite an existing file are first writing changes to a
temporary file before atomically replacing the original. Those paths are
likely okay, too, as you can usually just discard any existing temporary
files.
I wonder if this is really true. I thought rename() was supposed to be atomic.
--
Robert Haas
EDB: http://www.enterprisedb.com
On Fri, Apr 08, 2022 at 09:00:36PM -0400, Robert Haas wrote:
I think there might be another problem. The man page for rename() seems to
indicate that overwriting an existing file also introduces a window where
the old and new path are hard links to the same file. This isn't a problem
for the WAL files because we should never be overwriting an existing one,
but I wonder if it's a problem for other code paths. My guess is that many
code paths that overwrite an existing file are first writing changes to a
temporary file before atomically replacing the original. Those paths are
likely okay, too, as you can usually just discard any existing temporary
files.I wonder if this is really true. I thought rename() was supposed to be atomic.
Looks like it's atomic in that it's not like cp + rm, but not atomic the other
way you want.
| If newpath already exists, it will be atomically replaced, so that there is no point at which another process attempting to access newpath will find it missing. However, there will probably be a window in which
| both oldpath and newpath refer to the file being renamed.
At Thu, 7 Apr 2022 11:29:54 -0700, Nathan Bossart <nathandbossart@gmail.com> wrote in
The attached patch prevents this problem by using durable_rename() instead
of durable_rename_excl() for WAL recycling. This removes the protection
against accidentally overwriting an existing WAL file, but there shouldn't
be one.
From another direction, if the new segment was the currently active
one, we just mustn't install it. Otherwise we don't care.
So, the only thing we need to care is segment switch. Without it, the
segment that InstallXLogFileSegment found by the stat loop is known to
be safe to overwrite even if exists.
When segment switch finds an existing file, it's no problem since the
segment switch doesn't create a new segment. Otherwise segment switch
always calls InstallXLogFileSegment. The section from searching for
an empty segmetn slot until calling durable_rename_excl() is protected
by ControlFileLock. Thus if a process is in the section, no other
process can switch to a newly-created segment.
If this diagnosis is correct, the comment is proved to be paranoid.
* Perform the rename using link if available, paranoidly trying to avoid
* overwriting an existing file (there shouldn't be one).
As the result, I think Nathan's fix is correct that we can safely use
durable_rename() instead.
And I propose to use renameat2 on Linux so that we can detect the
contradicting case by the regression tests even though only on Linux.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
On Mon, Apr 11, 2022 at 5:12 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
So, the only thing we need to care is segment switch. Without it, the
segment that InstallXLogFileSegment found by the stat loop is known to
be safe to overwrite even if exists.When segment switch finds an existing file, it's no problem since the
segment switch doesn't create a new segment. Otherwise segment switch
always calls InstallXLogFileSegment. The section from searching for
an empty segmetn slot until calling durable_rename_excl() is protected
by ControlFileLock. Thus if a process is in the section, no other
process can switch to a newly-created segment.If this diagnosis is correct, the comment is proved to be paranoid.
It's sometimes difficult to understand what problems really old code
comments are worrying about. For example, could they have been
worrying about bugs in the code? Could they have been worrying about
manual interference with the pg_wal directory? It's hard to know.
--
Robert Haas
EDB: http://www.enterprisedb.com
Robert Haas <robertmhaas@gmail.com> writes:
On Mon, Apr 11, 2022 at 5:12 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:If this diagnosis is correct, the comment is proved to be paranoid.
It's sometimes difficult to understand what problems really old code
comments are worrying about. For example, could they have been
worrying about bugs in the code? Could they have been worrying about
manual interference with the pg_wal directory? It's hard to know.
"git blame" can be helpful here, if you trace back to when the comment
was written and then try to find the associated mailing-list discussion.
(That leap can be difficult for commits pre-dating our current
convention of including links in the commit message, but it's usually
not *that* hard to locate contemporaneous discussion.)
regards, tom lane
On Mon, Apr 11, 2022 at 12:28:47PM -0400, Tom Lane wrote:
Robert Haas <robertmhaas@gmail.com> writes:
On Mon, Apr 11, 2022 at 5:12 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:If this diagnosis is correct, the comment is proved to be paranoid.
It's sometimes difficult to understand what problems really old code
comments are worrying about. For example, could they have been
worrying about bugs in the code? Could they have been worrying about
manual interference with the pg_wal directory? It's hard to know."git blame" can be helpful here, if you trace back to when the comment
was written and then try to find the associated mailing-list discussion.
(That leap can be difficult for commits pre-dating our current
convention of including links in the commit message, but it's usually
not *that* hard to locate contemporaneous discussion.)
I traced this back a while ago. I believe the link() was first added in
November 2000 as part of f0e37a8. This even predates WAL recycling, which
was added in July 2001 as part of 7d4d5c0.
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
At Mon, 11 Apr 2022 09:52:57 -0700, Nathan Bossart <nathandbossart@gmail.com> wrote in
On Mon, Apr 11, 2022 at 12:28:47PM -0400, Tom Lane wrote:
Robert Haas <robertmhaas@gmail.com> writes:
On Mon, Apr 11, 2022 at 5:12 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:If this diagnosis is correct, the comment is proved to be paranoid.
It's sometimes difficult to understand what problems really old code
comments are worrying about. For example, could they have been
worrying about bugs in the code? Could they have been worrying about
manual interference with the pg_wal directory? It's hard to know."git blame" can be helpful here, if you trace back to when the comment
was written and then try to find the associated mailing-list discussion.
(That leap can be difficult for commits pre-dating our current
convention of including links in the commit message, but it's usually
not *that* hard to locate contemporaneous discussion.)I traced this back a while ago. I believe the link() was first added in
November 2000 as part of f0e37a8. This even predates WAL recycling, which
was added in July 2001 as part of 7d4d5c0.
f0e37a8 lacks discussion.. It introduced the CHECKPOINT command from
somwhere out of the ML.. This patch changed XLogFileInit to
supportusing existent files so that XLogWrite can use the new segment
provided by checkpoint and still allow XLogWrite to create a new
segment by itself.
Just before the commit, calls to XLogFileInit were protected (or
serialized) by logwr_lck. At the commit calls to the same function
were still serialized by ControlFileLockId.
I *guess* that Vadim faced/noticed a race condition when he added
checkpoint. Thus introduced the link+remove protocol but finally it
became useless by moving the call to XLogFileInit within
ControlFileLockId section. But, of course, all of story is mere a
guess.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
On Tue, Apr 12, 2022 at 03:46:31PM +0900, Kyotaro Horiguchi wrote:
At Mon, 11 Apr 2022 09:52:57 -0700, Nathan Bossart <nathandbossart@gmail.com> wrote in
I traced this back a while ago. I believe the link() was first added in
November 2000 as part of f0e37a8. This even predates WAL recycling, which
was added in July 2001 as part of 7d4d5c0.f0e37a8 lacks discussion.. It introduced the CHECKPOINT command from
somwhere out of the ML.. This patch changed XLogFileInit to
supportusing existent files so that XLogWrite can use the new segment
provided by checkpoint and still allow XLogWrite to create a new
segment by itself.
Yeah, I've been unable to find any discussion besides a brief reference to
adding checkpointing [0]/messages/by-id/8F4C99C66D04D4118F580090272A7A23018D85@sectorbase1.sectorbase.com.
[0]: /messages/by-id/8F4C99C66D04D4118F580090272A7A23018D85@sectorbase1.sectorbase.com
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
On Fri, Apr 08, 2022 at 09:00:36PM -0400, Robert Haas wrote:
On Fri, Apr 8, 2022 at 12:53 PM Nathan Bossart <nathandbossart@gmail.com> wrote:
I think there might be another problem. The man page for rename() seems to
indicate that overwriting an existing file also introduces a window where
the old and new path are hard links to the same file. This isn't a problem
for the WAL files because we should never be overwriting an existing one,
but I wonder if it's a problem for other code paths. My guess is that many
code paths that overwrite an existing file are first writing changes to a
temporary file before atomically replacing the original. Those paths are
likely okay, too, as you can usually just discard any existing temporary
files.I wonder if this is really true. I thought rename() was supposed to be atomic.
Not always. For example, some old versions of MacOS have a non-atomic
implementation of rename(), like prairiedog with 10.4. Even 10.5 does
not handle atomicity as far as I call. In short, it looks like a bad
idea to me to rely on this idea at all. Some FSes have their own way
of handling things, as well, but I am not much into this world.
Saying that, it would be nice to see durable_rename_excl() gone as it
has created quite a bit of pain for us in the past years.
--
Michael
On Mon, Apr 18, 2022 at 04:48:35PM +0900, Michael Paquier wrote:
Saying that, it would be nice to see durable_rename_excl() gone as it
has created quite a bit of pain for us in the past years.
Yeah, I think this is the right thing to do. Patch upthread [0]/messages/by-id/20220408194345.GA1541826@nathanxps13.
For back-branches, I suspect we'll want to remove all uses of
durable_rename_excl() but leave the function around for any extensions that
are using it. Of course, we'd also need a big comment imploring folks not
to add any more callers. Another option would be to change the behavior of
durable_rename_excl() to something that we think is safer (e.g., stat then
rename), but that might just introduce a different set of problems.
[0]: /messages/by-id/20220408194345.GA1541826@nathanxps13
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
Michael Paquier <michael@paquier.xyz> writes:
On Fri, Apr 08, 2022 at 09:00:36PM -0400, Robert Haas wrote:
I wonder if this is really true. I thought rename() was supposed to be atomic.
Not always. For example, some old versions of MacOS have a non-atomic
implementation of rename(), like prairiedog with 10.4. Even 10.5 does
not handle atomicity as far as I call.
I think that's not talking about the same thing. POSIX requires rename(2)
to replace an existing target link atomically:
If the link named by the new argument exists, it shall be removed and
old renamed to new. In this case, a link named new shall remain
visible to other threads throughout the renaming operation and refer
either to the file referred to by new or old before the operation
began.
(It's that requirement that ancient macOS fails to meet.)
However, I do not see any text that addresses the question of whether
the old link disappears atomically with the appearance of the new link,
and it seems like that'd be pretty impractical to ensure in cases like
moving a link from one directory to another. (What would it even mean
to say that, considering that a thread can't read the two directories
at the same instant?) From a crash-safety standpoint, it'd surely be
better to make the new link before removing the old, so I imagine
that's what most file systems do.
regards, tom lane
The readdir interface allows processes to be in the middle of reading
a directory and unless a kernel was happy to either materialize the
entire directory list when the readdir starts, or lock the entire
directory against modification for the entire time the a process has a
readdir fd open it's always going to be possible for the a process to
have previously read the old directory entry and later see the new
directory entry. Kernels don't do any MVCC or cmin type of games so
they're not going to be able to prevent it.
What's worse of course is that it may only happen in very large
directories. Most directories fit on a single block and readdir may
buffer up all the entries a block at a time for efficiency. So it may
only be visible on very large directories that span multiple blocks.
Here is an attempt at creating something that can be back-patched. 0001
simply replaces calls to durable_rename_excl() with durable_rename() and is
intended to be back-patched. 0002 removes the definition of
durable_rename_excl() and is _not_ intended for back-patching. I imagine
0002 will need to be held back for v16devel.
I think back-patching 0001 will encounter a couple of small obstacles. For
example, the call in basic_archive won't exist on most of the
back-branches, and durable_rename_excl() was named durable_link_or_rename()
before v13. I don't mind producing a patch for each back-branch if needed.
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
On Tue, Apr 26, 2022 at 01:09:35PM -0700, Nathan Bossart wrote:
Here is an attempt at creating something that can be back-patched. 0001
simply replaces calls to durable_rename_excl() with durable_rename() and is
intended to be back-patched. 0002 removes the definition of
durable_rename_excl() and is _not_ intended for back-patching. I imagine
0002 will need to be held back for v16devel.
I would not mind applying 0002 on HEAD now to avoid more uses of this
API, and I can get behind 0001 after thinking more about it.
I think back-patching 0001 will encounter a couple of small obstacles. For
example, the call in basic_archive won't exist on most of the
back-branches, and durable_rename_excl() was named durable_link_or_rename()
before v13. I don't mind producing a patch for each back-branch if needed.
I am not sure that have any need to backpatch this change based on the
unlikeliness of the problem, TBH. One thing that is itching me a bit,
like Robert upthread, is that we don't check anymore that the newfile
does not exist in the code paths because we never expect one. It is
possible to use stat() for that. But access() within a simple
assertion would be simpler? Say something like:
Assert(access(path, F_OK) != 0 && errno == ENOENT);
The case for basic_archive is limited as the comment of the patch
states, but that would be helpful for the two calls in timeline.c and
the one in xlog.c in the long-term. And this has no need to be part
of fd.c, this can be added before the durable_rename() calls. What do
you think?
--
Michael
On Wed, Apr 27, 2022 at 04:09:20PM +0900, Michael Paquier wrote:
I am not sure that have any need to backpatch this change based on the
unlikeliness of the problem, TBH. One thing that is itching me a bit,
like Robert upthread, is that we don't check anymore that the newfile
does not exist in the code paths because we never expect one. It is
possible to use stat() for that. But access() within a simple
assertion would be simpler? Say something like:
Assert(access(path, F_OK) != 0 && errno == ENOENT);The case for basic_archive is limited as the comment of the patch
states, but that would be helpful for the two calls in timeline.c and
the one in xlog.c in the long-term. And this has no need to be part
of fd.c, this can be added before the durable_rename() calls. What do
you think?
Here is a new patch set with these assertions added. I think at least the
xlog.c change ought to be back-patched. The problem may be unlikely, but
AFAICT the possible consequences include WAL corruption.
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com