PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
Hi all
Some time ago I ran into an issue where a user encountered data corruption
after a storage error. PostgreSQL played a part in that corruption by
allowing checkpoint what should've been a fatal error.
TL;DR: Pg should PANIC on fsync() EIO return. Retrying fsync() is not OK at
least on Linux. When fsync() returns success it means "all writes since the
last fsync have hit disk" but we assume it means "all writes since the last
SUCCESSFUL fsync have hit disk".
Pg wrote some blocks, which went to OS dirty buffers for writeback.
Writeback failed due to an underlying storage error. The block I/O layer
and XFS marked the writeback page as failed (AS_EIO), but had no way to
tell the app about the failure. When Pg called fsync() on the FD during the
next checkpoint, fsync() returned EIO because of the flagged page, to tell
Pg that a previous async write failed. Pg treated the checkpoint as failed
and didn't advance the redo start position in the control file.
All good so far.
But then we retried the checkpoint, which retried the fsync(). The retry
succeeded, because the prior fsync() *cleared the AS_EIO bad page flag*.
The write never made it to disk, but we completed the checkpoint, and
merrily carried on our way. Whoops, data loss.
The clear-error-and-continue behaviour of fsync is not documented as far as
I can tell. Nor is fsync() returning EIO unless you have a very new linux
man-pages with the patch I wrote to add it. But from what I can see in the
POSIX standard we are not given any guarantees about what happens on
fsync() failure at all, so we're probably wrong to assume that retrying
fsync( ) is safe.
If the server had been using ext3 or ext4 with errors=remount-ro, the
problem wouldn't have occurred because the first I/O error would've
remounted the FS and stopped Pg from continuing. But XFS doesn't have that
option. There may be other situations where this can occur too, involving
LVM and/or multipath, but I haven't comprehensively dug out the details yet.
It proved possible to recover the system by faking up a backup label from
before the first incorrectly-successful checkpoint, forcing redo to repeat
and write the lost blocks. But ... what a mess.
I posted about the underlying fsync issue here some time ago:
https://stackoverflow.com/q/42434872/398670
but haven't had a chance to follow up about the Pg specifics.
I've been looking at the problem on and off and haven't come up with a good
answer. I think we should just PANIC and let redo sort it out by repeating
the failed write when it repeats work since the last checkpoint.
The API offered by async buffered writes and fsync offers us no way to find
out which page failed, so we can't just selectively redo that write. I
think we do know the relfilenode associated with the fd that failed to
fsync, but not much more. So the alternative seems to be some sort of
potentially complex online-redo scheme where we replay WAL only the
relation on which we had the fsync() error, while otherwise servicing
queries normally. That's likely to be extremely error-prone and hard to
test, and it's trying to solve a case where on other filesystems the whole
DB would grind to a halt anyway.
I looked into whether we can solve it with use of the AIO API instead, but
the mess is even worse there - from what I can tell you can't even reliably
guarantee fsync at all on all Linux kernel versions.
We already PANIC on fsync() failure for WAL segments. We just need to do
the same for data forks at least for EIO. This isn't as bad as it seems
because AFAICS fsync only returns EIO in cases where we should be stopping
the world anyway, and many FSes will do that for us.
There are rather a lot of pg_fsync() callers. While we could handle this
case-by-case for each one, I'm tempted to just make pg_fsync() itself
intercept EIO and PANIC. Thoughts?
--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Craig Ringer <craig@2ndquadrant.com> writes:
TL;DR: Pg should PANIC on fsync() EIO return.
Surely you jest.
Retrying fsync() is not OK at
least on Linux. When fsync() returns success it means "all writes since the
last fsync have hit disk" but we assume it means "all writes since the last
SUCCESSFUL fsync have hit disk".
If that's actually the case, we need to push back on this kernel brain
damage, because as you're describing it fsync would be completely useless.
Moreover, POSIX is entirely clear that successful fsync means all
preceding writes for the file have been completed, full stop, doesn't
matter when they were issued.
regards, tom lane
On Tue, Mar 27, 2018 at 11:53:08PM -0400, Tom Lane wrote:
Craig Ringer <craig@2ndquadrant.com> writes:
TL;DR: Pg should PANIC on fsync() EIO return.
Surely you jest.
Any callers of pg_fsync in the backend code are careful enough to check
the returned status, sometimes doing retries like in mdsync, so what is
proposed here would be a regression.
--
Michael
On Thu, Mar 29, 2018 at 3:30 PM, Michael Paquier <michael@paquier.xyz> wrote:
On Tue, Mar 27, 2018 at 11:53:08PM -0400, Tom Lane wrote:
Craig Ringer <craig@2ndquadrant.com> writes:
TL;DR: Pg should PANIC on fsync() EIO return.
Surely you jest.
Any callers of pg_fsync in the backend code are careful enough to check
the returned status, sometimes doing retries like in mdsync, so what is
proposed here would be a regression.
Craig, is the phenomenon you described the same as the second issue
"Reporting writeback errors" discussed in this article?
https://lwn.net/Articles/724307/
"Current kernels might report a writeback error on an fsync() call,
but there are a number of ways in which that can fail to happen."
That's... I'm speechless.
--
Thomas Munro
http://www.enterprisedb.com
On Thu, Mar 29, 2018 at 11:30:59AM +0900, Michael Paquier wrote:
On Tue, Mar 27, 2018 at 11:53:08PM -0400, Tom Lane wrote:
Craig Ringer <craig@2ndquadrant.com> writes:
TL;DR: Pg should PANIC on fsync() EIO return.
Surely you jest.
Any callers of pg_fsync in the backend code are careful enough to check
the returned status, sometimes doing retries like in mdsync, so what is
proposed here would be a regression.
The retries are the source of the problem ; the first fsync() can return EIO,
and also *clears the error* causing a 2nd fsync (of the same data) to return
success.
(Note, I can see that it might be useful to PANIC on EIO but retry for ENOSPC).
On Thu, Mar 29, 2018 at 03:48:27PM +1300, Thomas Munro wrote:
Craig, is the phenomenon you described the same as the second issue
"Reporting writeback errors" discussed in this article?
https://lwn.net/Articles/724307/
Worse, the article acknowledges the behavior without apparently suggesting to
change it:
"Storing that value in the file structure has an important benefit: it makes
it possible to report a writeback error EXACTLY ONCE TO EVERY PROCESS THAT
CALLS FSYNC() .... In current kernels, ONLY THE FIRST CALLER AFTER AN ERROR
OCCURS HAS A CHANCE OF SEEING THAT ERROR INFORMATION."
I believe I reproduced the problem behavior using dmsetup "error" target, see
attached.
strace looks like this:
kernel is Linux 4.10.0-28-generic #32~16.04.2-Ubuntu SMP Thu Jul 20 10:19:48 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
1 open("/dev/mapper/eio", O_RDWR|O_CREAT, 0600) = 3
2 write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = 8192
3 write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = 8192
4 write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = 8192
5 write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = 8192
6 write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = 8192
7 write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = 8192
8 write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = 2560
9 write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = -1 ENOSPC (No space left on device)
10 dup(2) = 4
11 fcntl(4, F_GETFL) = 0x8402 (flags O_RDWR|O_APPEND|O_LARGEFILE)
12 brk(NULL) = 0x1299000
13 brk(0x12ba000) = 0x12ba000
14 fstat(4, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 2), ...}) = 0
15 write(4, "write(1): No space left on devic"..., 34write(1): No space left on device
16 ) = 34
17 close(4) = 0
18 fsync(3) = -1 EIO (Input/output error)
19 dup(2) = 4
20 fcntl(4, F_GETFL) = 0x8402 (flags O_RDWR|O_APPEND|O_LARGEFILE)
21 fstat(4, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 2), ...}) = 0
22 write(4, "fsync(1): Input/output error\n", 29fsync(1): Input/output error
23 ) = 29
24 close(4) = 0
25 close(3) = 0
26 open("/dev/mapper/eio", O_RDWR|O_CREAT, 0600) = 3
27 fsync(3) = 0
28 write(3, "\0", 1) = 1
29 fsync(3) = 0
30 exit_group(0) = ?
2: EIO isn't seen initially due to writeback page cache;
9: ENOSPC due to small device
18: original IO error reported by fsync, good
25: the original FD is closed
26: ..and file reopened
27: fsync on file with still-dirty data+EIO returns success BAD
10, 19: I'm not sure why there's dup(2), I guess glibc thinks that perror
should write to a separate FD (?)
Also note, close() ALSO returned success..which you might think exonerates the
2nd fsync(), but I think may itself be problematic, no? In any case, the 2nd
byte certainly never got written to DM error, and the failure status was lost
following fsync().
I get the exact same behavior if I break after one write() loop, such as to
avoid ENOSPC.
Justin
Attachments:
eio.ctext/x-csrc; charset=us-asciiDownload
Import Notes
Reply to msg id not found: CAEepm2JnwtkZ1PAuPMxUtG21VPQFfRrVzVTWjEPQmYR-zyng@mail.gmail.com20180329023059.GA2291@paquier.xyz | Resolved by subject fallback
On Thu, Mar 29, 2018 at 6:00 PM, Justin Pryzby <pryzby@telsasoft.com> wrote:
The retries are the source of the problem ; the first fsync() can return EIO,
and also *clears the error* causing a 2nd fsync (of the same data) to return
success.
What I'm failing to grok here is how that error flag even matters,
whether it's a single bit or a counter as described in that patch. If
write back failed, *the page is still dirty*. So all future calls to
fsync() need to try to try to flush it again, and (presumably) fail
again (unless it happens to succeed this time around).
--
Thomas Munro
http://www.enterprisedb.com
On 29 March 2018 at 13:06, Thomas Munro <thomas.munro@enterprisedb.com>
wrote:
On Thu, Mar 29, 2018 at 6:00 PM, Justin Pryzby <pryzby@telsasoft.com>
wrote:The retries are the source of the problem ; the first fsync() can return
EIO,
and also *clears the error* causing a 2nd fsync (of the same data) to
return
success.
What I'm failing to grok here is how that error flag even matters,
whether it's a single bit or a counter as described in that patch. If
write back failed, *the page is still dirty*. So all future calls to
fsync() need to try to try to flush it again, and (presumably) fail
again (unless it happens to succeed this time around).
<http://www.enterprisedb.com>
You'd think so. But it doesn't appear to work that way. You can see
yourself with the error device-mapper destination mapped over part of a
volume.
I wrote a test case here.
https://github.com/ringerc/scrapcode/blob/master/testcases/fsync-error-clear.c
I don't pretend the kernel behaviour is sane. And it's possible I've made
an error in my analysis. But since I've observed this in the wild, and seen
it in a test case, I strongly suspect that's what I've described is just
what's happening, brain-dead or no.
Presumably the kernel marks the page clean when it dispatches it to the I/O
subsystem and doesn't dirty it again on I/O error? I haven't dug that deep
on the kernel side. See the stackoverflow post for details on what I found
in kernel code analysis.
--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On 29 March 2018 at 10:48, Thomas Munro <thomas.munro@enterprisedb.com>
wrote:
On Thu, Mar 29, 2018 at 3:30 PM, Michael Paquier <michael@paquier.xyz>
wrote:On Tue, Mar 27, 2018 at 11:53:08PM -0400, Tom Lane wrote:
Craig Ringer <craig@2ndquadrant.com> writes:
TL;DR: Pg should PANIC on fsync() EIO return.
Surely you jest.
Any callers of pg_fsync in the backend code are careful enough to check
the returned status, sometimes doing retries like in mdsync, so what is
proposed here would be a regression.Craig, is the phenomenon you described the same as the second issue
"Reporting writeback errors" discussed in this article?
A variant of it, by the looks.
The problem in our case is that the kernel only tells us about the error
once. It then forgets about it. So yes, that seems like a variant of the
statement:
"Current kernels might report a writeback error on an fsync() call,
but there are a number of ways in which that can fail to happen."That's... I'm speechless.
Yeah.
It's a bit nuts.
I was astonished when I saw the behaviour, and that it appears undocumented.
--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On 29 March 2018 at 10:30, Michael Paquier <michael@paquier.xyz> wrote:
On Tue, Mar 27, 2018 at 11:53:08PM -0400, Tom Lane wrote:
Craig Ringer <craig@2ndquadrant.com> writes:
TL;DR: Pg should PANIC on fsync() EIO return.
Surely you jest.
Any callers of pg_fsync in the backend code are careful enough to check
the returned status, sometimes doing retries like in mdsync, so what is
proposed here would be a regression.
I covered this in my original post.
Yes, we check the return value. But what do we do about it? For fsyncs of
heap files, we ERROR, aborting the checkpoint. We'll retry the checkpoint
later, which will retry the fsync(). **Which will now appear to succeed**
because the kernel forgot that it lost our writes after telling us the
first time. So we do check the error code, which returns success, and we
complete the checkpoint and move on.
But we only retried the fsync, not the writes before the fsync.
So we lost data. Or rather, failed to detect that the kernel did so, so our
checkpoint was bad and could not be completed.
The problem is that we keep retrying checkpoints *without* repeating the
writes leading up to the checkpoint, and retrying fsync.
I don't pretend the kernel behaviour is sane, but we'd better deal with it
anyway.
--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On 28 March 2018 at 11:53, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Craig Ringer <craig@2ndquadrant.com> writes:
TL;DR: Pg should PANIC on fsync() EIO return.
Surely you jest.
No. I'm quite serious. Worse, we quite possibly have to do it for ENOSPC as
well to avoid similar lost-page-write issues.
It's not necessary on ext3/ext4 with errors=remount-ro, but that's only
because the FS stops us dead in our tracks.
I don't pretend it's sane. The kernel behaviour is IMO crazy. If it's going
to lose a write, it should at minimum mark the FD as broken so no further
fsync() or anything else can succeed on the FD, and an app that cares about
durability must repeat the whole set of work since the prior succesful
fsync(). Just reporting it once and forgetting it is madness.
But even if we convince the kernel folks of that, how do other platforms
behave? And how long before these kernels are out of use? We'd better deal
with it, crazy or no.
Please see my StackOverflow post for the kernel-level explanation. Note
also the test case link there. https://stackoverflow.com/a/42436054/398670
Retrying fsync() is not OK at
least on Linux. When fsync() returns success it means "all writes since
the
last fsync have hit disk" but we assume it means "all writes since the
last
SUCCESSFUL fsync have hit disk".
If that's actually the case, we need to push back on this kernel brain
damage, because as you're describing it fsync would be completely useless.
It's not useless, it's just telling us something other than what we think
it means. The promise it seems to give us is that if it reports an error
once, everything *after* that is useless, so we should throw our toys,
close and reopen everything, and redo from the last known-good state.
Though as Tomas posted below, it provides rather weaker guarantees than I
thought in some other areas too. See that lwn.net article he linked.
Moreover, POSIX is entirely clear that successful fsync means all
preceding writes for the file have been completed, full stop, doesn't
matter when they were issued.
I can't find anything that says so to me. Please quote relevant spec.
I'm working from
http://pubs.opengroup.org/onlinepubs/009695399/functions/fsync.html which
states that
"The fsync() function shall request that all data for the open file
descriptor named by fildes is to be transferred to the storage device
associated with the file described by fildes. The nature of the transfer is
implementation-defined. The fsync() function shall not return until the
system has completed that action or until an error is detected."
My reading is that POSIX does not specify what happens AFTER an error is
detected. It doesn't say that error has to be persistent and that
subsequent calls must also report the error. It also says:
"If the fsync() function fails, outstanding I/O operations are not
guaranteed to have been completed."
but that doesn't clarify matters much either, because it can be read to
mean that once there's been an error reported for some IO operations
there's no guarantee those operations are ever completed even after a
subsequent fsync returns success.
I'm not seeking to defend what the kernel seems to be doing. Rather, saying
that we might see similar behaviour on other platforms, crazy or not. I
haven't looked past linux yet, though.
--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Thu, Mar 29, 2018 at 6:58 PM, Craig Ringer <craig@2ndquadrant.com> wrote:
On 28 March 2018 at 11:53, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Craig Ringer <craig@2ndquadrant.com> writes:
TL;DR: Pg should PANIC on fsync() EIO return.
Surely you jest.
No. I'm quite serious. Worse, we quite possibly have to do it for ENOSPC as
well to avoid similar lost-page-write issues.
I found your discussion with kernel hacker Jeff Layton at
https://lwn.net/Articles/718734/ in which he said: "The stackoverflow
writeup seems to want a scheme where pages stay dirty after a
writeback failure so that we can try to fsync them again. Note that
that has never been the case in Linux after hard writeback failures,
AFAIK, so programs should definitely not assume that behavior."
The article above that says the same thing a couple of different ways,
ie that writeback failure leaves you with pages that are neither
written to disk successfully nor marked dirty.
If I'm reading various articles correctly, the situation was even
worse before his errseq_t stuff landed. That fixed cases of
completely unreported writeback failures due to sharing of PG_error
for both writeback and read errors with certain filesystems, but it
doesn't address the clean pages problem.
Yeah, I see why you want to PANIC.
Moreover, POSIX is entirely clear that successful fsync means all
preceding writes for the file have been completed, full stop, doesn't
matter when they were issued.I can't find anything that says so to me. Please quote relevant spec.
I'm working from
http://pubs.opengroup.org/onlinepubs/009695399/functions/fsync.html which
states that"The fsync() function shall request that all data for the open file
descriptor named by fildes is to be transferred to the storage device
associated with the file described by fildes. The nature of the transfer is
implementation-defined. The fsync() function shall not return until the
system has completed that action or until an error is detected."My reading is that POSIX does not specify what happens AFTER an error is
detected. It doesn't say that error has to be persistent and that subsequent
calls must also report the error. It also says:
FWIW my reading is the same as Tom's. It says "all data for the open
file descriptor" without qualification or special treatment after
errors. Not "some".
I'm not seeking to defend what the kernel seems to be doing. Rather, saying
that we might see similar behaviour on other platforms, crazy or not. I
haven't looked past linux yet, though.
I see no reason to think that any other operating system would behave
that way without strong evidence... This is openly acknowledged to be
"a mess" and "a surprise" in the Filesystem Summit article. I am not
really qualified to comment, but from a cursory glance at FreeBSD's
vfs_bio.c I think it's doing what you'd hope for... see the code near
the comment "Failed write, redirty."
--
Thomas Munro
http://www.enterprisedb.com
On 29 March 2018 at 20:07, Thomas Munro <thomas.munro@enterprisedb.com>
wrote:
On Thu, Mar 29, 2018 at 6:58 PM, Craig Ringer <craig@2ndquadrant.com>
wrote:On 28 March 2018 at 11:53, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Craig Ringer <craig@2ndquadrant.com> writes:
TL;DR: Pg should PANIC on fsync() EIO return.
Surely you jest.
No. I'm quite serious. Worse, we quite possibly have to do it for ENOSPC
as
well to avoid similar lost-page-write issues.
I found your discussion with kernel hacker Jeff Layton at
https://lwn.net/Articles/718734/ in which he said: "The stackoverflow
writeup seems to want a scheme where pages stay dirty after a
writeback failure so that we can try to fsync them again. Note that
that has never been the case in Linux after hard writeback failures,
AFAIK, so programs should definitely not assume that behavior."The article above that says the same thing a couple of different ways,
ie that writeback failure leaves you with pages that are neither
written to disk successfully nor marked dirty.If I'm reading various articles correctly, the situation was even
worse before his errseq_t stuff landed. That fixed cases of
completely unreported writeback failures due to sharing of PG_error
for both writeback and read errors with certain filesystems, but it
doesn't address the clean pages problem.Yeah, I see why you want to PANIC.
In more ways than one ;)
I'm not seeking to defend what the kernel seems to be doing. Rather,
sayingthat we might see similar behaviour on other platforms, crazy or not. I
haven't looked past linux yet, though.I see no reason to think that any other operating system would behave
that way without strong evidence... This is openly acknowledged to be
"a mess" and "a surprise" in the Filesystem Summit article. I am not
really qualified to comment, but from a cursory glance at FreeBSD's
vfs_bio.c I think it's doing what you'd hope for... see the code near
the comment "Failed write, redirty."
Ok, that's reassuring, but doesn't help us on the platform the great
majority of users deploy on :(
"If on Linux, PANIC"
Hrm.
--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Thu, Mar 29, 2018 at 2:07 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
I found your discussion with kernel hacker Jeff Layton at
https://lwn.net/Articles/718734/ in which he said: "The stackoverflow
writeup seems to want a scheme where pages stay dirty after a
writeback failure so that we can try to fsync them again. Note that
that has never been the case in Linux after hard writeback failures,
AFAIK, so programs should definitely not assume that behavior."
And a bit below in the same comments, to this question about PG: "So,
what are the options at this point? The assumption was that we can
repeat the fsync (which as you point out is not the case), or shut
down the database and perform recovery from WAL", the same Jeff Layton
seems to agree PANIC is the appropriate response:
"Replaying the WAL synchronously sounds like the simplest approach
when you get an error on fsync. These are uncommon occurrences for the
most part, so having to fall back to slow, synchronous error recovery
modes when this occurs is probably what you want to do.".
And right after, he confirms the errseq_t patches are about always
detecting this, not more:
"The main thing I working on is to better guarantee is that you
actually get an error when this occurs rather than silently corrupting
your data. The circumstances where that can occur require some
corner-cases, but I think we need to make sure that it doesn't occur."
Jeff's comments in the pull request that merged errseq_t are worth
reading as well:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=088737f44bbf6378745f5b57b035e57ee3dc4750
The article above that says the same thing a couple of different ways,
ie that writeback failure leaves you with pages that are neither
written to disk successfully nor marked dirty.If I'm reading various articles correctly, the situation was even
worse before his errseq_t stuff landed. That fixed cases of
completely unreported writeback failures due to sharing of PG_error
for both writeback and read errors with certain filesystems, but it
doesn't address the clean pages problem.
Indeed, that's exactly how I read it as well (opinion formed
independently before reading your sentence above). The errseq_t
patches landed in v4.13 by the way, so very recently.
Yeah, I see why you want to PANIC.
Indeed. Even doing that leaves question marks about all the kernel
versions before v4.13, which at this point is pretty much everything
out there, not even detecting this reliably. This is messy.
On Fri, Mar 30, 2018 at 5:20 AM, Catalin Iacob <iacobcatalin@gmail.com> wrote:
Jeff's comments in the pull request that merged errseq_t are worth
reading as well:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=088737f44bbf6378745f5b57b035e57ee3dc4750
Wow. It looks like there may be a separate question of when each
filesystem adopted this new infrastructure?
Yeah, I see why you want to PANIC.
Indeed. Even doing that leaves question marks about all the kernel
versions before v4.13, which at this point is pretty much everything
out there, not even detecting this reliably. This is messy.
The pre-errseq_t problems are beyond our control. There's nothing we
can do about that in userspace (except perhaps abandon OS-buffered IO,
a big project). We just need to be aware that this problem exists in
certain kernel versions and be grateful to Layton for fixing it.
The dropped dirty flag problem is something we can and in my view
should do something about, whatever we might think about that design
choice. As Andrew Gierth pointed out to me in an off-list chat about
this, by the time you've reached this state, both PostgreSQL's buffer
and the kernel's buffer are clean and might be reused for another
block at any time, so your data might be gone from the known universe
-- we don't even have the option to rewrite our buffers in general.
Recovery is the only option.
Thank you to Craig for chasing this down and +1 for his proposal, on Linux only.
--
Thomas Munro
http://www.enterprisedb.com
On Fri, Mar 30, 2018 at 10:18:14AM +1300, Thomas Munro wrote:
Yeah, I see why you want to PANIC.
Indeed. Even doing that leaves question marks about all the kernel
versions before v4.13, which at this point is pretty much everything
out there, not even detecting this reliably. This is messy.
There may still be a way to reliably detect this on older kernel
versions from userspace, but it will be messy whatsoever. On EIO
errors, the kernel will not restore the dirty page flags, but it
will flip the error flags on the failed pages. One could mmap()
the file in question, obtain the PFNs (via /proc/pid/pagemap)
and enumerate those to match the ones with the error flag switched
on (via /proc/kpageflags). This could serve at least as a detection
mechanism, but one could also further use this info to logically
map the pages that failed IO back to the original file offsets,
and potentially retry IO just for those file ranges that cover
the failed pages. Just an idea, not tested.
Best regards,
Anthony
On 31 March 2018 at 21:24, Anthony Iliopoulos <ailiop@altatus.com> wrote:
On Fri, Mar 30, 2018 at 10:18:14AM +1300, Thomas Munro wrote:
Yeah, I see why you want to PANIC.
Indeed. Even doing that leaves question marks about all the kernel
versions before v4.13, which at this point is pretty much everything
out there, not even detecting this reliably. This is messy.There may still be a way to reliably detect this on older kernel
versions from userspace, but it will be messy whatsoever. On EIO
errors, the kernel will not restore the dirty page flags, but it
will flip the error flags on the failed pages. One could mmap()
the file in question, obtain the PFNs (via /proc/pid/pagemap)
and enumerate those to match the ones with the error flag switched
on (via /proc/kpageflags). This could serve at least as a detection
mechanism, but one could also further use this info to logically
map the pages that failed IO back to the original file offsets,
and potentially retry IO just for those file ranges that cover
the failed pages. Just an idea, not tested.
That sounds like a huge amount of complexity, with uncertainty as to how
it'll behave kernel-to-kernel, for negligble benefit.
I was exploring the idea of doing selective recovery of one relfilenode,
based on the assumption that we know the filenode related to the fd that
failed to fsync(). We could redo only WAL on that relation. But it fails
the same test: it's too complex for a niche case that shouldn't happen in
the first place, so it'll probably have bugs, or grow bugs in bitrot over
time.
Remember, if you're on ext4 with errors=remount-ro, you get shut down even
harder than a PANIC. So we should just use the big hammer here.
I'll send a patch this week.
--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Craig Ringer <craig@2ndquadrant.com> writes:
So we should just use the big hammer here.
And bitch, loudly and publicly, about how broken this kernel behavior is.
If we make enough of a stink maybe it'll get fixed.
regards, tom lane
On Sat, Mar 31, 2018 at 12:38:12PM -0400, Tom Lane wrote:
Craig Ringer <craig@2ndquadrant.com> writes:
So we should just use the big hammer here.
And bitch, loudly and publicly, about how broken this kernel behavior is.
If we make enough of a stink maybe it'll get fixed.
That won't fix anything released already, so as per the information
gathered something has to be done anyway. The discussion of this thread
is spreading quite a lot actually.
Handling things at a low-level looks like a better plan for the backend.
Tools like pg_basebackup and pg_dump also issue fsync's on the data
created, we should do an equivalent for them, with some exit() calls in
file_utils.c. As of now failures are logged to stderr but not
considered fatal.
--
Michael
On Sun, Apr 01, 2018 at 12:13:09AM +0800, Craig Ringer wrote:
On 31 March 2018 at 21:24, Anthony Iliopoulos <[1]ailiop@altatus.com>
wrote:On Fri, Mar 30, 2018 at 10:18:14AM +1300, Thomas Munro wrote:
Yeah, I see why you want to PANIC.
Indeed. Even doing that leaves question marks about all the kernel
versions before v4.13, which at this point is pretty much everything
out there, not even detecting this reliably. This is messy.There may still be a way to reliably detect this on older kernel
versions from userspace, but it will be messy whatsoever. On EIO
errors, the kernel will not restore the dirty page flags, but it
will flip the error flags on the failed pages. One could mmap()
the file in question, obtain the PFNs (via /proc/pid/pagemap)
and enumerate those to match the ones with the error flag switched
on (via /proc/kpageflags). This could serve at least as a detection
mechanism, but one could also further use this info to logically
map the pages that failed IO back to the original file offsets,
and potentially retry IO just for those file ranges that cover
the failed pages. Just an idea, not tested.That sounds like a huge amount of complexity, with uncertainty as to how
it'll behave kernel-to-kernel, for negligble benefit.
Those interfaces have been around since the kernel 2.6 times and are
rather stable, but I was merely responding to your original post comment
regarding having a way of finding out which page(s) failed. I assume
that indeed there would be no benefit, especially since those errors
are usually not transient (typically they come from hard medium faults),
and although a filesystem could theoretically mask the error by allocating
a different logical block, I am not aware of any implementation that
currently does that.
I was exploring the idea of doing selective recovery of one relfilenode,
based on the assumption that we know the filenode related to the fd that
failed to fsync(). We could redo only WAL on that relation. But it fails
the same test: it's too complex for a niche case that shouldn't happen in
the first place, so it'll probably have bugs, or grow bugs in bitrot over
time.
Fully agree, those cases should be sufficiently rare that a complex
and possibly non-maintainable solution is not really warranted.
Remember, if you're on ext4 with errors=remount-ro, you get shut down even
harder than a PANIC. So we should just use the big hammer here.
I am not entirely sure what you mean here, does Pg really treat write()
errors as fatal? Also, the kind of errors that ext4 detects with this
option is at the superblock level and govern metadata rather than actual
data writes (recall that those are buffered anyway, no actual device IO
has to take place at the time of write()).
Best regards,
Anthony
On Sat, Mar 31, 2018 at 12:38:12PM -0400, Tom Lane wrote:
Craig Ringer <craig@2ndquadrant.com> writes:
So we should just use the big hammer here.
And bitch, loudly and publicly, about how broken this kernel behavior is.
If we make enough of a stink maybe it'll get fixed.
It is not likely to be fixed (beyond what has been done already with the
manpage patches and errseq_t fixes on the reporting level). The issue is,
the kernel needs to deal with hard IO errors at that level somehow, and
since those errors typically persist, re-dirtying the pages would not
really solve the problem (unless some filesystem remaps the request to a
different block, assuming the device is alive). Keeping around dirty
pages that cannot possibly be written out is essentially a memory leak,
as those pages would stay around even after the application has exited.
Best regards,
Anthony