silent data loss with ext4 / all current versions
Hi,
I've been doing some power failure tests (i.e. unexpectedly
interrupting power) a few days ago, and I've discovered a fairly serious
case of silent data loss on ext3/ext4. Initially i thought it's a
filesystem bug, but after further investigation I'm pretty sure it's our
fault.
What happens is that when we recycle WAL segments, we rename them and
then sync them using fdatasync (which is the default on Linux). However
fdatasync does not force fsync on the parent directory, so in case of
power failure the rename may get lost. The recovery won't realize those
segments actually contain changes from "future" and thus does not replay
them. Hence data loss. The recovery completes as if everything went OK,
so the data loss is entirely silent.
Reproducing this is rather trivial. I've prepared a simple C program
simulating our WAL recycling, that I intended to send to ext4 mailing
list to demonstrate the ext4 bug before (I realized it's most likely our
bug and not theirs).
The example program is called ext4-data-loss.c and is available here
(along with other stuff mentioned in this message):
https://github.com/2ndQuadrant/ext4-data-loss
Compile it, run it (over ssh from another host), interrupt the power and
after restart you should see some of the segments be lost (the rename
reverted).
The git repo also contains a bunch of python scripts that I initially
used to reproduce this on PostgreSQL - insert.py, update.py and
xlog-watch.py. I'm not going to explain the details here, it's a bit
more complicated but the cause is exactly the same as with the C
program, just demonstrated in database. See README for instructions.
So, what's going on? The problem is that while the rename() is atomic,
it's not guaranteed to be durable without an explicit fsync on the
parent directory. And by default we only do fdatasync on the recycled
segments, which may not force fsync on the directory (and ext4 does not
do that, apparently).
This impacts all current kernels (tested on 2.6.32.68, 4.0.5 and
4.4-rc1), and also all supported PostgreSQL versions (tested on 9.1.19,
but I believe all versions since spread checkpoints were introduced are
vulnerable).
FWIW this has nothing to do with storage reliability - you may have good
drives, RAID controller with BBU, reliable SSDs or whatever, and you're
still not safe. This issue is at the filesystem level, not storage.
I've done the same tests on xfs and that seems to be safe - I've been
unable to reproduce the issue, so either the issue is not there or it's
more difficult to hit it. I haven't tried on other file systems, because
ext4 and xfs cover vast majority of deployments (at least on Linux), and
thus issue on ext4 is serious enough I believe.
It's possible to make ext3/ext4 safe with respect to this issue by using
full journaling (data=journal) instead of the default (data=ordered)
mode. However this comes at a significant performance cost and pretty
much no one is using it with PostgreSQL because data=ordered is believed
to be safe.
It's also possible to mitigate this by setting wal_sync_method=fsync,
but I don't think I've ever seen that change at a customer. This also
comes with a significant performance penalty, comparable to setting
data=journal. This has the advantage that this can be done without
restarting the database (SIGHUP is enough).
So pretty much everyone running on Linux + ext3/ext4 is vulnerable.
It's also worth mentioning that the data is not actually lost - it's
properly fsynced in the WAL segments, it's just the rename that got
lost. So it's possible to survive this without losing data by manually
renaming the segments, but this must happen before starting the cluster
because the automatic recovery comes and discards all the data etc.
I think this issue might also result in various other issues, not just
data loss. For example, I wouldn't be surprised by data corruption due
to flushing some of the changes in data files to disk (due to contention
for shared buffers and reaching vm.dirty_bytes) and then losing the
matching WAL segment. Also, while I have only seen 1 to 3 segments
getting lost, it might be possible that more segments can get lost,
possibly making the recovery impossible. And of course, this might cause
problems with WAL archiving due to archiving the same
segment twice (before and after crash).
Attached is a proposed fix for this (xlog-fsync.patch), and I'm pretty
sure this needs to be backpatched to all backbranches. I've also
attached a patch that adds pg_current_xlog_flush_location() function,
which proved to be quite useful when debugging this issue.
I'd also like to propose adding "last segment" to pg_controldata, next
to the last checkpoint / restartpoint. We don't need to write this on
every commit, once per segment (on the first write) is enough. This
would make investigating the issue much easier, and it'd also make it
possible to terminate the recovery with an error if the last found
segment does not match the expectation (instead of just assuming we've
found all segments, leading to data loss).
Another useful change would be to allow pg_xlogdump to print segments
even if the contents does not match the filename. Currently it's
impossible to even look at the contents in that case, so renaming the
existing segments is mostly guess work (find segments whrere pg_xlogdump
fails, try renaming to next segments).
And finally, I've done a quick review of all places that might suffer
the same issue - some are not really interesting as the stuff is
ephemeral anyway (like pgstat for example), but there are ~15 places
that may need this fix:
* src/backend/access/transam/timeline.c (2 matches)
* src/backend/access/transam/xlog.c (9 matches)
* src/backend/access/transam/xlogarchive.c (3 matches)
* src/backend/postmaster/pgarch.c (1 match)
Some of these places might be actually safe because a fsync happens
somewhere immediately after the rename (e.g. in a caller), but I guess
better safe than sorry.
I plan to do more power failure testing soon, with more complex test
scenarios. I suspect there might be other similar issues (e.g. when we
rename a file before a checkpoint and don't fsync the directory - then
the rename won't be replayed and will be lost).
regards
--
Tomas Vondra http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
What happens is that when we recycle WAL segments, we rename them and then sync
them using fdatasync (which is the default on Linux). However fdatasync does not
force fsync on the parent directory, so in case of power failure the rename may
get lost. The recovery won't realize those segments actually contain changes
Agree. Some time ago I faced with this, although it wasn't a postgres.
So, what's going on? The problem is that while the rename() is atomic, it's not
guaranteed to be durable without an explicit fsync on the parent directory. And
by default we only do fdatasync on the recycled segments, which may not force
fsync on the directory (and ext4 does not do that, apparently).This impacts all current kernels (tested on 2.6.32.68, 4.0.5 and 4.4-rc1), and
also all supported PostgreSQL versions (tested on 9.1.19, but I believe all
versions since spread checkpoints were introduced are vulnerable).FWIW this has nothing to do with storage reliability - you may have good drives,
RAID controller with BBU, reliable SSDs or whatever, and you're still not safe.
This issue is at the filesystem level, not storage.
Agree again.
I plan to do more power failure testing soon, with more complex test scenarios.
I suspect there might be other similar issues (e.g. when we rename a file before
a checkpoint and don't fsync the directory - then the rename won't be replayed
and will be lost).
It would be very useful, but I hope you will not find a new bug :)
--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Nov 27, 2015 at 8:17 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
So, what's going on? The problem is that while the rename() is atomic, it's
not guaranteed to be durable without an explicit fsync on the parent
directory. And by default we only do fdatasync on the recycled segments,
which may not force fsync on the directory (and ext4 does not do that,
apparently).
Yeah, that seems to be the way the POSIX spec clears things.
"If _POSIX_SYNCHRONIZED_IO is defined, the fsync() function shall
force all currently queued I/O operations associated with the file
indicated by file descriptor fildes to the synchronized I/O completion
state. All I/O operations shall be completed as defined for
synchronized I/O file integrity completion."
http://pubs.opengroup.org/onlinepubs/009695399/functions/fsync.html
If I understand that right, it is guaranteed that the rename() will be
atomic, meaning that there will be only one file even if there is a
crash, but that we need to fsync() the parent directory as mentioned.
FWIW this has nothing to do with storage reliability - you may have good
drives, RAID controller with BBU, reliable SSDs or whatever, and you're
still not safe. This issue is at the filesystem level, not storage.
The POSIX spec authorizes this behavior, so the FS is not to blame,
clearly. At least that's what I get from it.
I've done the same tests on xfs and that seems to be safe - I've been unable
to reproduce the issue, so either the issue is not there or it's more
difficult to hit it. I haven't tried on other file systems, because ext4 and
xfs cover vast majority of deployments (at least on Linux), and thus issue
on ext4 is serious enough I believe.So pretty much everyone running on Linux + ext3/ext4 is vulnerable.
It's also worth mentioning that the data is not actually lost - it's
properly fsynced in the WAL segments, it's just the rename that got lost. So
it's possible to survive this without losing data by manually renaming the
segments, but this must happen before starting the cluster because the
automatic recovery comes and discards all the data etc.
Hm. Most users are not going to notice that, particularly where things
are embedded.
I think this issue might also result in various other issues, not just data
loss. For example, I wouldn't be surprised by data corruption due to
flushing some of the changes in data files to disk (due to contention for
shared buffers and reaching vm.dirty_bytes) and then losing the matching WAL
segment. Also, while I have only seen 1 to 3 segments getting lost, it might
be possible that more segments can get lost, possibly making the recovery
impossible. And of course, this might cause problems with WAL archiving due
to archiving the same
segment twice (before and after crash).
Possible, the switch to .done is done after renaming the segment in
xlogarchive.c. So this could happen in theory.
Attached is a proposed fix for this (xlog-fsync.patch), and I'm pretty sure
this needs to be backpatched to all backbranches. I've also attached a patch
that adds pg_current_xlog_flush_location() function, which proved to be
quite useful when debugging this issue.
Agreed. We should be sure as well that the calls to fsync_fname get
issued in a critical section with START/END_CRIT_SECTION(). It does
not seem to be the case with your patch.
And finally, I've done a quick review of all places that might suffer the
same issue - some are not really interesting as the stuff is ephemeral
anyway (like pgstat for example), but there are ~15 places that may need
this fix:* src/backend/access/transam/timeline.c (2 matches)
* src/backend/access/transam/xlog.c (9 matches)
* src/backend/access/transam/xlogarchive.c (3 matches)
* src/backend/postmaster/pgarch.c (1 match)Some of these places might be actually safe because a fsync happens
somewhere immediately after the rename (e.g. in a caller), but I guess
better safe than sorry.
I had a quick look at those code paths and indeed it would be safer to
be sure that once rename() is called we issue those fsync calls.
I plan to do more power failure testing soon, with more complex test
scenarios. I suspect there might be other similar issues (e.g. when we
rename a file before a checkpoint and don't fsync the directory - then the
rename won't be replayed and will be lost).
That would be great.
--
Michael
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Nov 27, 2015 at 11:17 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
I plan to do more power failure testing soon, with more complex test
scenarios. I suspect there might be other similar issues (e.g. when we
rename a file before a checkpoint and don't fsync the directory - then the
rename won't be replayed and will be lost).
I'm curious how you're doing this testing. The easiest way I can think
of would be to run a database on an LVM volume and take a large number
of LVM snapshots very rapidly and then see if the database can start
up from each snapshot. Bonus points for keeping track of the committed
transactions before each snaphsot and ensuring they're still there I
guess.
That always seemed unsatisfactory because in the past we were mainly
concerned with whether fsync was actually getting propagated to the
physical media. But for testing whether we're fsyncing enough for the
filesystem that would be good enough.
--
greg
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Hi,
On 11/27/2015 02:28 PM, Greg Stark wrote:
On Fri, Nov 27, 2015 at 11:17 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:I plan to do more power failure testing soon, with more complex
test scenarios. I suspect there might be other similar issues (e.g.
when we rename a file before a checkpoint and don't fsync the
directory -then the rename won't be replayed and will be lost).I'm curious how you're doing this testing. The easiest way I can
think of would be to run a database on an LVM volume and take a large
number of LVM snapshots very rapidly and then see if the database can
start up from each snapshot. Bonus points for keeping track of the
committed transactions before each snaphsot and ensuring they're
still there I guess.
I do have reliable storage (Intel SSD with power-loss protection), and
I've connected the system to a sophisticated power-loss-making device
called "the power switch" (image attached).
In other words, in the last ~7 days the system got rebooted more times
than in the previous ~5 years.
That always seemed unsatisfactory because in the past we were mainly
concerned with whether fsync was actually getting propagated to the
physical media. But for testing whether we're fsyncing enough for
the filesystem that would be good enough.
Yeah. I considered some form of virtualized setup initially, but my
original intent was to verify whether disabling write barriers really is
safe (because I've heard numerous complaints that it's stupid). And as
write barriers are more tightly coupled to the hardware, I opted for the
more brutal approach.
But I agree some form of virtualized setup might be more flexible,
although I'm not sure LVM snapshots are good approach as snapshots may
wait for I/O requests to complete and such. I think something qemu might
work better when combined with "kill -9" and I plan to try reproducing
the data loss issue on such setup.
regards
--
Tomas Vondra http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachments:
power-failure-device.jpgimage/jpeg; name=power-failure-device.jpgDownload+0-2
On 11/27/2015 02:18 PM, Michael Paquier wrote:
On Fri, Nov 27, 2015 at 8:17 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:So, what's going on? The problem is that while the rename() is atomic, it's
not guaranteed to be durable without an explicit fsync on the parent
directory. And by default we only do fdatasync on the recycled segments,
which may not force fsync on the directory (and ext4 does not do that,
apparently).Yeah, that seems to be the way the POSIX spec clears things.
"If _POSIX_SYNCHRONIZED_IO is defined, the fsync() function shall
force all currently queued I/O operations associated with the file
indicated by file descriptor fildes to the synchronized I/O completion
state. All I/O operations shall be completed as defined for
synchronized I/O file integrity completion."
http://pubs.opengroup.org/onlinepubs/009695399/functions/fsync.html
If I understand that right, it is guaranteed that the rename() will be
atomic, meaning that there will be only one file even if there is a
crash, but that we need to fsync() the parent directory as mentioned.FWIW this has nothing to do with storage reliability - you may have good
drives, RAID controller with BBU, reliable SSDs or whatever, and you're
still not safe. This issue is at the filesystem level, not storage.The POSIX spec authorizes this behavior, so the FS is not to blame,
clearly. At least that's what I get from it.
The spec seems a bit vague to me (but maybe it's not, I'm not a POSIX
expert), but we should be prepared for the less favorable interpretation
I think.
I think this issue might also result in various other issues, not just data
loss. For example, I wouldn't be surprised by data corruption due to
flushing some of the changes in data files to disk (due to contention for
shared buffers and reaching vm.dirty_bytes) and then losing the matching WAL
segment. Also, while I have only seen 1 to 3 segments getting lost, it might
be possible that more segments can get lost, possibly making the recovery
impossible. And of course, this might cause problems with WAL archiving due
to archiving the same
segment twice (before and after crash).Possible, the switch to .done is done after renaming the segment in
xlogarchive.c. So this could happen in theory.
Yes. That's one of the suspicious places in my notes (haven't posted all
the details, the message was long enough already).
Attached is a proposed fix for this (xlog-fsync.patch), and I'm pretty sure
this needs to be backpatched to all backbranches. I've also attached a patch
that adds pg_current_xlog_flush_location() function, which proved to be
quite useful when debugging this issue.Agreed. We should be sure as well that the calls to fsync_fname get
issued in a critical section with START/END_CRIT_SECTION(). It does
not seem to be the case with your patch.
Don't know. I've based that on code from replication/logical/ which does
fsync_fname() on all the interesting places, without the critical section.
regards
--
Tomas Vondra http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Sat, Nov 28, 2015 at 3:01 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
On 11/27/2015 02:18 PM, Michael Paquier wrote:
On Fri, Nov 27, 2015 at 8:17 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:So, what's going on? The problem is that while the rename() is atomic,
it's
not guaranteed to be durable without an explicit fsync on the parent
directory. And by default we only do fdatasync on the recycled segments,
which may not force fsync on the directory (and ext4 does not do that,
apparently).Yeah, that seems to be the way the POSIX spec clears things.
"If _POSIX_SYNCHRONIZED_IO is defined, the fsync() function shall
force all currently queued I/O operations associated with the file
indicated by file descriptor fildes to the synchronized I/O completion
state. All I/O operations shall be completed as defined for
synchronized I/O file integrity completion."
http://pubs.opengroup.org/onlinepubs/009695399/functions/fsync.html
If I understand that right, it is guaranteed that the rename() will be
atomic, meaning that there will be only one file even if there is a
crash, but that we need to fsync() the parent directory as mentioned.FWIW this has nothing to do with storage reliability - you may have good
drives, RAID controller with BBU, reliable SSDs or whatever, and you're
still not safe. This issue is at the filesystem level, not storage.The POSIX spec authorizes this behavior, so the FS is not to blame,
clearly. At least that's what I get from it.The spec seems a bit vague to me (but maybe it's not, I'm not a POSIX
expert),
As I am understanding it, FS implementations are free to decide to
make the rename persist on disk or not.
but we should be prepared for the less favorable interpretation I
think.
Yep. I agree. And in case my previous words were not clear, that's the
same line of thought here, we had better cover our backs and study
carefully each code path that could be impacted.
Attached is a proposed fix for this (xlog-fsync.patch), and I'm pretty
sure
this needs to be backpatched to all backbranches. I've also attached a
patch
that adds pg_current_xlog_flush_location() function, which proved to be
quite useful when debugging this issue.Agreed. We should be sure as well that the calls to fsync_fname get
issued in a critical section with START/END_CRIT_SECTION(). It does
not seem to be the case with your patch.Don't know. I've based that on code from replication/logical/ which does
fsync_fname() on all the interesting places, without the critical section.
For slot information in slot.c, there will be a PANIC when fsyncing
pg_replslot at some points. It does not seem that weird to do the same
for example after renaming the backup label file..
--
Michael
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 27 November 2015 at 21:28, Greg Stark <stark@mit.edu> wrote:
On Fri, Nov 27, 2015 at 11:17 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:I plan to do more power failure testing soon, with more complex test
scenarios. I suspect there might be other similar issues (e.g. when we
rename a file before a checkpoint and don't fsync the directory - thenthe
rename won't be replayed and will be lost).
I'm curious how you're doing this testing. The easiest way I can think
of would be to run a database on an LVM volume and take a large number
of LVM snapshots very rapidly and then see if the database can start
up from each snapshot. Bonus points for keeping track of the committed
transactions before each snaphsot and ensuring they're still there I
guess.
I've had a few tries at implementing a qemu-based crashtester where it hard
kills the qemu instance at a random point then starts it back up.
I always got stuck on the validation part - actually ensuring that the DB
state is how we expect. I think I could probably get that right now, it's
been a while.
The VM can be started back up and killed again over and over quite quickly.
It's not as good as physical plug-pull, but it's a lot more practical.
--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On 27 November 2015 at 19:17, Tomas Vondra <tomas.vondra@2ndquadrant.com>
wrote:
It's also possible to mitigate this by setting wal_sync_method=fsync
Are you sure?
https://lwn.net/Articles/322823/ tends to suggest that fsync() on the file
is insufficient to ensure rename() is persistent, though it's somewhat old.
--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Hi,
On 11/29/2015 02:38 PM, Craig Ringer wrote:
On 27 November 2015 at 21:28, Greg Stark <stark@mit.edu
<mailto:stark@mit.edu>> wrote:On Fri, Nov 27, 2015 at 11:17 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com <mailto:tomas.vondra@2ndquadrant.com>>
wrote:I plan to do more power failure testing soon, with more complex test
scenarios. I suspect there might be other similar issues (e.g. when we
rename a file before a checkpoint and don't fsync the directory - then the
rename won't be replayed and will be lost).I'm curious how you're doing this testing. The easiest way I can think
of would be to run a database on an LVM volume and take a large number
of LVM snapshots very rapidly and then see if the database can start
up from each snapshot. Bonus points for keeping track of the committed
transactions before each snaphsot and ensuring they're still there I
guess.I've had a few tries at implementing a qemu-based crashtester where it
hard kills the qemu instance at a random point then starts it back up.
I've tried to reproduce the issue by killing a qemu VM, and so far I've
been unsuccessful. On bare HW it was easily reproducible (I'd hit the
issue 9 out of 10 attempts), so either I'm doing something wrong or qemu
somehow interacts with the I/O.
I always got stuck on the validation part - actually ensuring that the
DB state is how we expect. I think I could probably get that right now,
it's been a while.
Weel, I guess we can't really check all the details, but I guess the
checksums make checking the general consistency somewhat simpler. And
then you have to design the workload in a way that makes the check
easier - for example remembering the committed values etc.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 11/29/2015 02:41 PM, Craig Ringer wrote:
On 27 November 2015 at 19:17, Tomas Vondra <tomas.vondra@2ndquadrant.com
<mailto:tomas.vondra@2ndquadrant.com>> wrote:It's also possible to mitigate this by setting wal_sync_method=fsync
Are you sure?
https://lwn.net/Articles/322823/ tends to suggest that fsync() on the
file is insufficient to ensure rename() is persistent, though it's
somewhat old.
Good point. I don't know, and I'm not any smarter after reading the LWN
article. What I meant by "mitigate" is that I've been unable to
reproduce the issue after setting wal_sync_method=fsync, so my
conclusion is that it either fixes the issue or at least significantly
reduces the probability of hitting it.
It's pretty clear that the right fix is the additional fsync on pg_xlog.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 11/29/2015 03:33 PM, Tomas Vondra wrote:
Hi,
On 11/29/2015 02:38 PM, Craig Ringer wrote:
I've had a few tries at implementing a qemu-based crashtester where it
hard kills the qemu instance at a random point then starts it back up.I've tried to reproduce the issue by killing a qemu VM, and so far I've
been unsuccessful. On bare HW it was easily reproducible (I'd hit the
issue 9 out of 10 attempts), so either I'm doing something wrong or qemu
somehow interacts with the I/O.
Update: I've managed to reproduce the issue in the qemu setup - I think
it needs slightly different timing due to the VM being slightly slower.
I also tweaked vm.dirty_bytes and vm.dirty_background_bytes to values
used on the bare hardware (I suspect it widens the window).
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 11/27/15 8:18 AM, Michael Paquier wrote:
On Fri, Nov 27, 2015 at 8:17 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:So, what's going on? The problem is that while the rename() is atomic, it's
not guaranteed to be durable without an explicit fsync on the parent
directory. And by default we only do fdatasync on the recycled segments,
which may not force fsync on the directory (and ext4 does not do that,
apparently).Yeah, that seems to be the way the POSIX spec clears things.
"If _POSIX_SYNCHRONIZED_IO is defined, the fsync() function shall
force all currently queued I/O operations associated with the file
indicated by file descriptor fildes to the synchronized I/O completion
state. All I/O operations shall be completed as defined for
synchronized I/O file integrity completion."
http://pubs.opengroup.org/onlinepubs/009695399/functions/fsync.html
If I understand that right, it is guaranteed that the rename() will be
atomic, meaning that there will be only one file even if there is a
crash, but that we need to fsync() the parent directory as mentioned.
I don't see anywhere in the spec that a rename needs an fsync of the
directory to be durable. I can see why that would be needed in
practice, though. File system developers would probably be able to give
a more definite answer.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 12/01/2015 10:44 PM, Peter Eisentraut wrote:
On 11/27/15 8:18 AM, Michael Paquier wrote:
On Fri, Nov 27, 2015 at 8:17 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:So, what's going on? The problem is that while the rename() is atomic, it's
not guaranteed to be durable without an explicit fsync on the parent
directory. And by default we only do fdatasync on the recycled segments,
which may not force fsync on the directory (and ext4 does not do that,
apparently).Yeah, that seems to be the way the POSIX spec clears things.
"If _POSIX_SYNCHRONIZED_IO is defined, the fsync() function shall
force all currently queued I/O operations associated with the file
indicated by file descriptor fildes to the synchronized I/O completion
state. All I/O operations shall be completed as defined for
synchronized I/O file integrity completion."
http://pubs.opengroup.org/onlinepubs/009695399/functions/fsync.html
If I understand that right, it is guaranteed that the rename() will be
atomic, meaning that there will be only one file even if there is a
crash, but that we need to fsync() the parent directory as mentioned.I don't see anywhere in the spec that a rename needs an fsync of the
directory to be durable. I can see why that would be needed in
practice, though. File system developers would probably be able to
give a more definite answer.
Yeah, POSIX is the smallest common denominator. In this case the spec
seems not to require this durability guarantee (rename without fsync on
directory), which allows a POSIX-compliant filesystem.
At least that's my conclusion from reading https://lwn.net/Articles/322823/
However, as I explained in the original post, it's more complicated as
this only seems to be problem with fdatasync. I've been unable to
reproduce the issue with wal_sync_method=fsync.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Attached is v2 of the patch, that
(a) adds explicit fsync on the parent directory after all the rename()
calls in timeline.c, xlog.c, xlogarchive.c and pgarch.c
(b) adds START/END_CRIT_SECTION around the new fsync_fname calls
(except for those in timeline.c, as the START/END_CRIT_SECTION is
not available there)
The patch is fairly trivial and I've done some rudimentary testing, but
I'm sure I haven't exercised all the modified paths.
regards
Tomas
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachments:
xlog-fsync-v2.patchtext/x-diff; name=xlog-fsync-v2.patchDownload+53-0
On Wed, Dec 2, 2015 at 7:05 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
Attached is v2 of the patch, that
(a) adds explicit fsync on the parent directory after all the rename()
calls in timeline.c, xlog.c, xlogarchive.c and pgarch.c(b) adds START/END_CRIT_SECTION around the new fsync_fname calls
(except for those in timeline.c, as the START/END_CRIT_SECTION is
not available there)The patch is fairly trivial and I've done some rudimentary testing, but I'm
sure I haven't exercised all the modified paths.
I would like to have an in-depth look at that after finishing the
current CF, I am the manager of this one after all... Could you
register it to 2016-01 CF for the time being? I don't mind being
beaten by someone else if this someone has some room to look at this
patch..
--
Michael
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Dec 2, 2015 at 3:23 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
On Wed, Dec 2, 2015 at 7:05 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:Attached is v2 of the patch, that
(a) adds explicit fsync on the parent directory after all the rename()
calls in timeline.c, xlog.c, xlogarchive.c and pgarch.c(b) adds START/END_CRIT_SECTION around the new fsync_fname calls
(except for those in timeline.c, as the START/END_CRIT_SECTION is
not available there)The patch is fairly trivial and I've done some rudimentary testing, but I'm
sure I haven't exercised all the modified paths.I would like to have an in-depth look at that after finishing the
current CF, I am the manager of this one after all... Could you
register it to 2016-01 CF for the time being? I don't mind being
beaten by someone else if this someone has some room to look at this
patch..
And please feel free to add my name as reviewer.
--
Michael
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Dec 2, 2015 at 3:24 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
On Wed, Dec 2, 2015 at 3:23 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:On Wed, Dec 2, 2015 at 7:05 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:Attached is v2 of the patch, that
(a) adds explicit fsync on the parent directory after all the rename()
calls in timeline.c, xlog.c, xlogarchive.c and pgarch.c(b) adds START/END_CRIT_SECTION around the new fsync_fname calls
(except for those in timeline.c, as the START/END_CRIT_SECTION is
not available there)The patch is fairly trivial and I've done some rudimentary testing, but I'm
sure I haven't exercised all the modified paths.I would like to have an in-depth look at that after finishing the
current CF, I am the manager of this one after all... Could you
register it to 2016-01 CF for the time being? I don't mind being
beaten by someone else if this someone has some room to look at this
patch..And please feel free to add my name as reviewer.
Tomas, I am planning to have a look at that, because it seems to be
important. In case it becomes lost on my radar, do you mind if I add
it to the 2016-03 CF?
--
Michael
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 01/19/2016 07:44 AM, Michael Paquier wrote:
On Wed, Dec 2, 2015 at 3:24 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:On Wed, Dec 2, 2015 at 3:23 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:On Wed, Dec 2, 2015 at 7:05 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:Attached is v2 of the patch, that
(a) adds explicit fsync on the parent directory after all the rename()
calls in timeline.c, xlog.c, xlogarchive.c and pgarch.c(b) adds START/END_CRIT_SECTION around the new fsync_fname calls
(except for those in timeline.c, as the START/END_CRIT_SECTION is
not available there)The patch is fairly trivial and I've done some rudimentary testing, but I'm
sure I haven't exercised all the modified paths.I would like to have an in-depth look at that after finishing the
current CF, I am the manager of this one after all... Could you
register it to 2016-01 CF for the time being? I don't mind being
beaten by someone else if this someone has some room to look at this
patch..And please feel free to add my name as reviewer.
Tomas, I am planning to have a look at that, because it seems to be
important. In case it becomes lost on my radar, do you mind if I add
it to the 2016-03 CF?
Well, what else can I do? I have to admit I'm quite surprised this is
still rotting here, considering it addresses a rather serious data loss
/ corruption issue on pretty common setup.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Jan 19, 2016 at 3:58 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
On 01/19/2016 07:44 AM, Michael Paquier wrote:
On Wed, Dec 2, 2015 at 3:24 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:On Wed, Dec 2, 2015 at 3:23 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:On Wed, Dec 2, 2015 at 7:05 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:Attached is v2 of the patch, that
(a) adds explicit fsync on the parent directory after all the rename()
calls in timeline.c, xlog.c, xlogarchive.c and pgarch.c(b) adds START/END_CRIT_SECTION around the new fsync_fname calls
(except for those in timeline.c, as the START/END_CRIT_SECTION is
not available there)The patch is fairly trivial and I've done some rudimentary testing, but
I'm
sure I haven't exercised all the modified paths.I would like to have an in-depth look at that after finishing the
current CF, I am the manager of this one after all... Could you
register it to 2016-01 CF for the time being? I don't mind being
beaten by someone else if this someone has some room to look at this
patch..And please feel free to add my name as reviewer.
Tomas, I am planning to have a look at that, because it seems to be
important. In case it becomes lost on my radar, do you mind if I add
it to the 2016-03 CF?Well, what else can I do? I have to admit I'm quite surprised this is still
rotting here, considering it addresses a rather serious data loss /
corruption issue on pretty common setup.
Well, I think you did what you could. And we need to be sure now that
it gets in and that this patch gets a serious lookup. So for now my
guess is that not loosing track of it would be a good first move. I
have added it here to attract more attention:
https://commitfest.postgresql.org/9/484/
--
Michael
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers