PostgreSQL&#39;s handling of fsync() errors is unsafe and risks data loss at least on XFS

michael@paquier.xyz

about 8 years ago

In reply to: Tom Lane (#2)

Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

On Tue, Mar 27, 2018 at 11:53:08PM -0400, Tom Lane wrote:

Craig Ringer <craig@2ndquadrant.com> writes:

TL;DR: Pg should PANIC on fsync() EIO return.

Surely you jest.

Any callers of pg_fsync in the backend code are careful enough to check
the returned status, sometimes doing retries like in mdsync, so what is
proposed here would be a regression.
--
Michael

https://lwn.net/Articles/724307/

thomas.munro@gmail.com

about 8 years ago

In reply to: Michael Paquier (#3)

Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

On Thu, Mar 29, 2018 at 3:30 PM, Michael Paquier <michael@paquier.xyz> wrote:

On Tue, Mar 27, 2018 at 11:53:08PM -0400, Tom Lane wrote:

Craig Ringer <craig@2ndquadrant.com> writes:

TL;DR: Pg should PANIC on fsync() EIO return.

Surely you jest.

Any callers of pg_fsync in the backend code are careful enough to check
the returned status, sometimes doing retries like in mdsync, so what is
proposed here would be a regression.

Craig, is the phenomenon you described the same as the second issue
"Reporting writeback errors" discussed in this article?

"Current kernels might report a writeback error on an fsync() call,
but there are a number of ways in which that can fail to happen."

That's... I'm speechless.

--
Thomas Munro
http://www.enterprisedb.com

Justin Pryzby

pryzby@telsasoft.com

about 8 years ago

In reply to: Thomas Munro (#4)

Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

On Thu, Mar 29, 2018 at 11:30:59AM +0900, Michael Paquier wrote:

On Tue, Mar 27, 2018 at 11:53:08PM -0400, Tom Lane wrote:

Craig Ringer <craig@2ndquadrant.com> writes:

TL;DR: Pg should PANIC on fsync() EIO return.

Surely you jest.

Any callers of pg_fsync in the backend code are careful enough to check
the returned status, sometimes doing retries like in mdsync, so what is
proposed here would be a regression.

The retries are the source of the problem ; the first fsync() can return EIO,
and also *clears the error* causing a 2nd fsync (of the same data) to return
success.

(Note, I can see that it might be useful to PANIC on EIO but retry for ENOSPC).

On Thu, Mar 29, 2018 at 03:48:27PM +1300, Thomas Munro wrote:

Craig, is the phenomenon you described the same as the second issue
"Reporting writeback errors" discussed in this article?
https://lwn.net/Articles/724307/

Worse, the article acknowledges the behavior without apparently suggesting to
change it:

"Storing that value in the file structure has an important benefit: it makes
it possible to report a writeback error EXACTLY ONCE TO EVERY PROCESS THAT
CALLS FSYNC() .... In current kernels, ONLY THE FIRST CALLER AFTER AN ERROR
OCCURS HAS A CHANCE OF SEEING THAT ERROR INFORMATION."

I believe I reproduced the problem behavior using dmsetup "error" target, see
attached.

strace looks like this:

kernel is Linux 4.10.0-28-generic #32~16.04.2-Ubuntu SMP Thu Jul 20 10:19:48 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

1 open("/dev/mapper/eio", O_RDWR|O_CREAT, 0600) = 3
2 write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = 8192
3 write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = 8192
4 write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = 8192
5 write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = 8192
6 write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = 8192
7 write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = 8192
8 write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = 2560
9 write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = -1 ENOSPC (No space left on device)
10 dup(2) = 4
11 fcntl(4, F_GETFL) = 0x8402 (flags O_RDWR|O_APPEND|O_LARGEFILE)
12 brk(NULL) = 0x1299000
13 brk(0x12ba000) = 0x12ba000
14 fstat(4, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 2), ...}) = 0
15 write(4, "write(1): No space left on devic"..., 34write(1): No space left on device
16 ) = 34
17 close(4) = 0
18 fsync(3) = -1 EIO (Input/output error)
19 dup(2) = 4
20 fcntl(4, F_GETFL) = 0x8402 (flags O_RDWR|O_APPEND|O_LARGEFILE)
21 fstat(4, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 2), ...}) = 0
22 write(4, "fsync(1): Input/output error\n", 29fsync(1): Input/output error
23 ) = 29
24 close(4) = 0
25 close(3) = 0
26 open("/dev/mapper/eio", O_RDWR|O_CREAT, 0600) = 3
27 fsync(3) = 0
28 write(3, "\0", 1) = 1
29 fsync(3) = 0
30 exit_group(0) = ?

2: EIO isn't seen initially due to writeback page cache;
9: ENOSPC due to small device
18: original IO error reported by fsync, good
25: the original FD is closed
26: ..and file reopened
27: fsync on file with still-dirty data+EIO returns success BAD

10, 19: I'm not sure why there's dup(2), I guess glibc thinks that perror
should write to a separate FD (?)

Also note, close() ALSO returned success..which you might think exonerates the
2nd fsync(), but I think may itself be problematic, no? In any case, the 2nd
byte certainly never got written to DM error, and the failure status was lost
following fsync().

I get the exact same behavior if I break after one write() loop, such as to
avoid ENOSPC.

Justin

Import Notes

Reply to msg id not found: CAEepm2JnwtkZ1PAuPMxUtG21VPQFfRrVzVTWjEPQmYR-zyng@mail.gmail.com20180329023059.GA2291@paquier.xyz | Resolved by subject fallback

thomas.munro@gmail.com

about 8 years ago

In reply to: Justin Pryzby (#5)

Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

On Thu, Mar 29, 2018 at 6:00 PM, Justin Pryzby <pryzby@telsasoft.com> wrote:

The retries are the source of the problem ; the first fsync() can return EIO,
and also *clears the error* causing a 2nd fsync (of the same data) to return
success.

What I'm failing to grok here is how that error flag even matters,
whether it's a single bit or a counter as described in that patch. If
write back failed, *the page is still dirty*. So all future calls to
fsync() need to try to try to flush it again, and (presumably) fail
again (unless it happens to succeed this time around).

--
Thomas Munro
http://www.enterprisedb.com

https://github.com/ringerc/scrapcode/blob/master/testcases/fsync-error-clear.c

craig@2ndquadrant.com

about 8 years ago

In reply to: Thomas Munro (#6)

Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

On 29 March 2018 at 13:06, Thomas Munro <thomas.munro@enterprisedb.com>
wrote:

On Thu, Mar 29, 2018 at 6:00 PM, Justin Pryzby <pryzby@telsasoft.com>
wrote:

The retries are the source of the problem ; the first fsync() can return

EIO,

and also *clears the error* causing a 2nd fsync (of the same data) to

return

success.

What I'm failing to grok here is how that error flag even matters,
whether it's a single bit or a counter as described in that patch. If
write back failed, *the page is still dirty*. So all future calls to
fsync() need to try to try to flush it again, and (presumably) fail
again (unless it happens to succeed this time around).
<http://www.enterprisedb.com>

You'd think so. But it doesn't appear to work that way. You can see
yourself with the error device-mapper destination mapped over part of a
volume.

I wrote a test case here.

I don't pretend the kernel behaviour is sane. And it's possible I've made
an error in my analysis. But since I've observed this in the wild, and seen
it in a test case, I strongly suspect that's what I've described is just
what's happening, brain-dead or no.

Presumably the kernel marks the page clean when it dispatches it to the I/O
subsystem and doesn't dirty it again on I/O error? I haven't dug that deep
on the kernel side. See the stackoverflow post for details on what I found
in kernel code analysis.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

craig@2ndquadrant.com

about 8 years ago

In reply to: Thomas Munro (#4)

Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

On 29 March 2018 at 10:48, Thomas Munro <thomas.munro@enterprisedb.com>
wrote:

On Thu, Mar 29, 2018 at 3:30 PM, Michael Paquier <michael@paquier.xyz>
wrote:

On Tue, Mar 27, 2018 at 11:53:08PM -0400, Tom Lane wrote:

Craig Ringer <craig@2ndquadrant.com> writes:

TL;DR: Pg should PANIC on fsync() EIO return.

Surely you jest.

Any callers of pg_fsync in the backend code are careful enough to check
the returned status, sometimes doing retries like in mdsync, so what is
proposed here would be a regression.

Craig, is the phenomenon you described the same as the second issue
"Reporting writeback errors" discussed in this article?

https://lwn.net/Articles/724307/

A variant of it, by the looks.

The problem in our case is that the kernel only tells us about the error
once. It then forgets about it. So yes, that seems like a variant of the
statement:

"Current kernels might report a writeback error on an fsync() call,
but there are a number of ways in which that can fail to happen."

That's... I'm speechless.

Yeah.

It's a bit nuts.

I was astonished when I saw the behaviour, and that it appears undocumented.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

craig@2ndquadrant.com

about 8 years ago

In reply to: Michael Paquier (#3)

Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

On 29 March 2018 at 10:30, Michael Paquier <michael@paquier.xyz> wrote:

On Tue, Mar 27, 2018 at 11:53:08PM -0400, Tom Lane wrote:

Craig Ringer <craig@2ndquadrant.com> writes:

TL;DR: Pg should PANIC on fsync() EIO return.

Surely you jest.

Any callers of pg_fsync in the backend code are careful enough to check
the returned status, sometimes doing retries like in mdsync, so what is
proposed here would be a regression.

I covered this in my original post.

Yes, we check the return value. But what do we do about it? For fsyncs of
heap files, we ERROR, aborting the checkpoint. We'll retry the checkpoint
later, which will retry the fsync(). **Which will now appear to succeed**
because the kernel forgot that it lost our writes after telling us the
first time. So we do check the error code, which returns success, and we
complete the checkpoint and move on.

But we only retried the fsync, not the writes before the fsync.

So we lost data. Or rather, failed to detect that the kernel did so, so our
checkpoint was bad and could not be completed.

The problem is that we keep retrying checkpoints *without* repeating the
writes leading up to the checkpoint, and retrying fsync.

I don't pretend the kernel behaviour is sane, but we'd better deal with it
anyway.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#10

craig@2ndquadrant.com

about 8 years ago

In reply to: Tom Lane (#2)

Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

On 28 March 2018 at 11:53, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Craig Ringer <craig@2ndquadrant.com> writes:

TL;DR: Pg should PANIC on fsync() EIO return.

Surely you jest.

No. I'm quite serious. Worse, we quite possibly have to do it for ENOSPC as
well to avoid similar lost-page-write issues.

It's not necessary on ext3/ext4 with errors=remount-ro, but that's only
because the FS stops us dead in our tracks.

I don't pretend it's sane. The kernel behaviour is IMO crazy. If it's going
to lose a write, it should at minimum mark the FD as broken so no further
fsync() or anything else can succeed on the FD, and an app that cares about
durability must repeat the whole set of work since the prior succesful
fsync(). Just reporting it once and forgetting it is madness.

But even if we convince the kernel folks of that, how do other platforms
behave? And how long before these kernels are out of use? We'd better deal
with it, crazy or no.

Please see my StackOverflow post for the kernel-level explanation. Note
also the test case link there. https://stackoverflow.com/a/42436054/398670

Retrying fsync() is not OK at

least on Linux. When fsync() returns success it means "all writes since

the

last fsync have hit disk" but we assume it means "all writes since the

last

SUCCESSFUL fsync have hit disk".

If that's actually the case, we need to push back on this kernel brain
damage, because as you're describing it fsync would be completely useless.

It's not useless, it's just telling us something other than what we think
it means. The promise it seems to give us is that if it reports an error
once, everything *after* that is useless, so we should throw our toys,
close and reopen everything, and redo from the last known-good state.

Though as Tomas posted below, it provides rather weaker guarantees than I
thought in some other areas too. See that lwn.net article he linked.

Moreover, POSIX is entirely clear that successful fsync means all
preceding writes for the file have been completed, full stop, doesn't
matter when they were issued.

I can't find anything that says so to me. Please quote relevant spec.

I'm working from
http://pubs.opengroup.org/onlinepubs/009695399/functions/fsync.html which
states that

"The fsync() function shall request that all data for the open file
descriptor named by fildes is to be transferred to the storage device
associated with the file described by fildes. The nature of the transfer is
implementation-defined. The fsync() function shall not return until the
system has completed that action or until an error is detected."

My reading is that POSIX does not specify what happens AFTER an error is
detected. It doesn't say that error has to be persistent and that
subsequent calls must also report the error. It also says:

"If the fsync() function fails, outstanding I/O operations are not
guaranteed to have been completed."

but that doesn't clarify matters much either, because it can be read to
mean that once there's been an error reported for some IO operations
there's no guarantee those operations are ever completed even after a
subsequent fsync returns success.

I'm not seeking to defend what the kernel seems to be doing. Rather, saying
that we might see similar behaviour on other platforms, crazy or not. I
haven't looked past linux yet, though.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#11

thomas.munro@gmail.com

about 8 years ago

In reply to: Craig Ringer (#10)

Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

On Thu, Mar 29, 2018 at 6:58 PM, Craig Ringer <craig@2ndquadrant.com> wrote:

On 28 March 2018 at 11:53, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Craig Ringer <craig@2ndquadrant.com> writes:

TL;DR: Pg should PANIC on fsync() EIO return.

Surely you jest.

No. I'm quite serious. Worse, we quite possibly have to do it for ENOSPC as
well to avoid similar lost-page-write issues.

I found your discussion with kernel hacker Jeff Layton at
https://lwn.net/Articles/718734/ in which he said: "The stackoverflow
writeup seems to want a scheme where pages stay dirty after a
writeback failure so that we can try to fsync them again. Note that
that has never been the case in Linux after hard writeback failures,
AFAIK, so programs should definitely not assume that behavior."

The article above that says the same thing a couple of different ways,
ie that writeback failure leaves you with pages that are neither
written to disk successfully nor marked dirty.

If I'm reading various articles correctly, the situation was even
worse before his errseq_t stuff landed. That fixed cases of
completely unreported writeback failures due to sharing of PG_error
for both writeback and read errors with certain filesystems, but it
doesn't address the clean pages problem.

Yeah, I see why you want to PANIC.

Moreover, POSIX is entirely clear that successful fsync means all
preceding writes for the file have been completed, full stop, doesn't
matter when they were issued.

I can't find anything that says so to me. Please quote relevant spec.

I'm working from
http://pubs.opengroup.org/onlinepubs/009695399/functions/fsync.html which
states that

"The fsync() function shall request that all data for the open file
descriptor named by fildes is to be transferred to the storage device
associated with the file described by fildes. The nature of the transfer is
implementation-defined. The fsync() function shall not return until the
system has completed that action or until an error is detected."

My reading is that POSIX does not specify what happens AFTER an error is
detected. It doesn't say that error has to be persistent and that subsequent
calls must also report the error. It also says:

FWIW my reading is the same as Tom's. It says "all data for the open
file descriptor" without qualification or special treatment after
errors. Not "some".

I'm not seeking to defend what the kernel seems to be doing. Rather, saying
that we might see similar behaviour on other platforms, crazy or not. I
haven't looked past linux yet, though.

I see no reason to think that any other operating system would behave
that way without strong evidence... This is openly acknowledged to be
"a mess" and "a surprise" in the Filesystem Summit article. I am not
really qualified to comment, but from a cursory glance at FreeBSD's
vfs_bio.c I think it's doing what you'd hope for... see the code near
the comment "Failed write, redirty."

--
Thomas Munro
http://www.enterprisedb.com

#12

craig@2ndquadrant.com

about 8 years ago

In reply to: Thomas Munro (#11)

Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

On 29 March 2018 at 20:07, Thomas Munro <thomas.munro@enterprisedb.com>
wrote:

On Thu, Mar 29, 2018 at 6:58 PM, Craig Ringer <craig@2ndquadrant.com>
wrote:

On 28 March 2018 at 11:53, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Craig Ringer <craig@2ndquadrant.com> writes:

TL;DR: Pg should PANIC on fsync() EIO return.

Surely you jest.

No. I'm quite serious. Worse, we quite possibly have to do it for ENOSPC

as

well to avoid similar lost-page-write issues.

I found your discussion with kernel hacker Jeff Layton at
https://lwn.net/Articles/718734/ in which he said: "The stackoverflow
writeup seems to want a scheme where pages stay dirty after a
writeback failure so that we can try to fsync them again. Note that
that has never been the case in Linux after hard writeback failures,
AFAIK, so programs should definitely not assume that behavior."

The article above that says the same thing a couple of different ways,
ie that writeback failure leaves you with pages that are neither
written to disk successfully nor marked dirty.

If I'm reading various articles correctly, the situation was even
worse before his errseq_t stuff landed. That fixed cases of
completely unreported writeback failures due to sharing of PG_error
for both writeback and read errors with certain filesystems, but it
doesn't address the clean pages problem.

Yeah, I see why you want to PANIC.

In more ways than one ;)

I'm not seeking to defend what the kernel seems to be doing. Rather,
saying

that we might see similar behaviour on other platforms, crazy or not. I
haven't looked past linux yet, though.

I see no reason to think that any other operating system would behave
that way without strong evidence... This is openly acknowledged to be
"a mess" and "a surprise" in the Filesystem Summit article. I am not
really qualified to comment, but from a cursory glance at FreeBSD's
vfs_bio.c I think it's doing what you'd hope for... see the code near
the comment "Failed write, redirty."

Ok, that's reassuring, but doesn't help us on the platform the great
majority of users deploy on :(

"If on Linux, PANIC"

Hrm.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#13

Catalin Iacob

iacobcatalin@gmail.com

about 8 years ago

In reply to: Thomas Munro (#11)

Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

On Thu, Mar 29, 2018 at 2:07 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

I found your discussion with kernel hacker Jeff Layton at
https://lwn.net/Articles/718734/ in which he said: "The stackoverflow
writeup seems to want a scheme where pages stay dirty after a
writeback failure so that we can try to fsync them again. Note that
that has never been the case in Linux after hard writeback failures,
AFAIK, so programs should definitely not assume that behavior."

And a bit below in the same comments, to this question about PG: "So,
what are the options at this point? The assumption was that we can
repeat the fsync (which as you point out is not the case), or shut
down the database and perform recovery from WAL", the same Jeff Layton
seems to agree PANIC is the appropriate response:
"Replaying the WAL synchronously sounds like the simplest approach
when you get an error on fsync. These are uncommon occurrences for the
most part, so having to fall back to slow, synchronous error recovery
modes when this occurs is probably what you want to do.".
And right after, he confirms the errseq_t patches are about always
detecting this, not more:
"The main thing I working on is to better guarantee is that you
actually get an error when this occurs rather than silently corrupting
your data. The circumstances where that can occur require some
corner-cases, but I think we need to make sure that it doesn't occur."

Jeff's comments in the pull request that merged errseq_t are worth
reading as well:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=088737f44bbf6378745f5b57b035e57ee3dc4750

The article above that says the same thing a couple of different ways,
ie that writeback failure leaves you with pages that are neither
written to disk successfully nor marked dirty.

If I'm reading various articles correctly, the situation was even
worse before his errseq_t stuff landed. That fixed cases of
completely unreported writeback failures due to sharing of PG_error
for both writeback and read errors with certain filesystems, but it
doesn't address the clean pages problem.

Indeed, that's exactly how I read it as well (opinion formed
independently before reading your sentence above). The errseq_t
patches landed in v4.13 by the way, so very recently.

Yeah, I see why you want to PANIC.

Indeed. Even doing that leaves question marks about all the kernel
versions before v4.13, which at this point is pretty much everything
out there, not even detecting this reliably. This is messy.

#14

thomas.munro@gmail.com

about 8 years ago

In reply to: Catalin Iacob (#13)

Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

On Fri, Mar 30, 2018 at 5:20 AM, Catalin Iacob <iacobcatalin@gmail.com> wrote:

Jeff's comments in the pull request that merged errseq_t are worth
reading as well:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=088737f44bbf6378745f5b57b035e57ee3dc4750

Wow. It looks like there may be a separate question of when each
filesystem adopted this new infrastructure?

Yeah, I see why you want to PANIC.

Indeed. Even doing that leaves question marks about all the kernel
versions before v4.13, which at this point is pretty much everything
out there, not even detecting this reliably. This is messy.

The pre-errseq_t problems are beyond our control. There's nothing we
can do about that in userspace (except perhaps abandon OS-buffered IO,
a big project). We just need to be aware that this problem exists in
certain kernel versions and be grateful to Layton for fixing it.

The dropped dirty flag problem is something we can and in my view
should do something about, whatever we might think about that design
choice. As Andrew Gierth pointed out to me in an off-list chat about
this, by the time you've reached this state, both PostgreSQL's buffer
and the kernel's buffer are clean and might be reused for another
block at any time, so your data might be gone from the known universe
-- we don't even have the option to rewrite our buffers in general.
Recovery is the only option.

Thank you to Craig for chasing this down and +1 for his proposal, on Linux only.

--
Thomas Munro
http://www.enterprisedb.com

#15

ailiop@altatus.com

about 8 years ago

In reply to: Thomas Munro (#14)

Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

On Fri, Mar 30, 2018 at 10:18:14AM +1300, Thomas Munro wrote:

Yeah, I see why you want to PANIC.

Indeed. Even doing that leaves question marks about all the kernel
versions before v4.13, which at this point is pretty much everything
out there, not even detecting this reliably. This is messy.

There may still be a way to reliably detect this on older kernel
versions from userspace, but it will be messy whatsoever. On EIO
errors, the kernel will not restore the dirty page flags, but it
will flip the error flags on the failed pages. One could mmap()
the file in question, obtain the PFNs (via /proc/pid/pagemap)
and enumerate those to match the ones with the error flag switched
on (via /proc/kpageflags). This could serve at least as a detection
mechanism, but one could also further use this info to logically
map the pages that failed IO back to the original file offsets,
and potentially retry IO just for those file ranges that cover
the failed pages. Just an idea, not tested.

Best regards,
Anthony

#16

craig@2ndquadrant.com

about 8 years ago

In reply to: Anthony Iliopoulos (#15)

Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

On 31 March 2018 at 21:24, Anthony Iliopoulos <ailiop@altatus.com> wrote:

On Fri, Mar 30, 2018 at 10:18:14AM +1300, Thomas Munro wrote:

Yeah, I see why you want to PANIC.

Indeed. Even doing that leaves question marks about all the kernel
versions before v4.13, which at this point is pretty much everything
out there, not even detecting this reliably. This is messy.

There may still be a way to reliably detect this on older kernel
versions from userspace, but it will be messy whatsoever. On EIO
errors, the kernel will not restore the dirty page flags, but it
will flip the error flags on the failed pages. One could mmap()
the file in question, obtain the PFNs (via /proc/pid/pagemap)
and enumerate those to match the ones with the error flag switched
on (via /proc/kpageflags). This could serve at least as a detection
mechanism, but one could also further use this info to logically
map the pages that failed IO back to the original file offsets,
and potentially retry IO just for those file ranges that cover
the failed pages. Just an idea, not tested.

That sounds like a huge amount of complexity, with uncertainty as to how
it'll behave kernel-to-kernel, for negligble benefit.

I was exploring the idea of doing selective recovery of one relfilenode,
based on the assumption that we know the filenode related to the fd that
failed to fsync(). We could redo only WAL on that relation. But it fails
the same test: it's too complex for a niche case that shouldn't happen in
the first place, so it'll probably have bugs, or grow bugs in bitrot over
time.

Remember, if you're on ext4 with errors=remount-ro, you get shut down even
harder than a PANIC. So we should just use the big hammer here.

I'll send a patch this week.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#17

Tom Lane

tgl@sss.pgh.pa.us

about 8 years ago

In reply to: Craig Ringer (#16)

Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

Craig Ringer <craig@2ndquadrant.com> writes:

So we should just use the big hammer here.

And bitch, loudly and publicly, about how broken this kernel behavior is.
If we make enough of a stink maybe it'll get fixed.

regards, tom lane

#18

michael@paquier.xyz

about 8 years ago

In reply to: Tom Lane (#17)

Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

On Sat, Mar 31, 2018 at 12:38:12PM -0400, Tom Lane wrote:

Craig Ringer <craig@2ndquadrant.com> writes:

So we should just use the big hammer here.

And bitch, loudly and publicly, about how broken this kernel behavior is.
If we make enough of a stink maybe it'll get fixed.

That won't fix anything released already, so as per the information
gathered something has to be done anyway. The discussion of this thread
is spreading quite a lot actually.

Handling things at a low-level looks like a better plan for the backend.
Tools like pg_basebackup and pg_dump also issue fsync's on the data
created, we should do an equivalent for them, with some exit() calls in
file_utils.c. As of now failures are logged to stderr but not
considered fatal.
--
Michael

#19

ailiop@altatus.com

about 8 years ago

In reply to: Craig Ringer (#16)

Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

On Sun, Apr 01, 2018 at 12:13:09AM +0800, Craig Ringer wrote:

On 31 March 2018 at 21:24, Anthony Iliopoulos <[1]ailiop@altatus.com>
wrote:

On Fri, Mar 30, 2018 at 10:18:14AM +1300, Thomas Munro wrote:

Yeah, I see why you want to PANIC.

Indeed. Even doing that leaves question marks about all the kernel
versions before v4.13, which at this point is pretty much everything
out there, not even detecting this reliably. This is messy.

There may still be a way to reliably detect this on older kernel
versions from userspace, but it will be messy whatsoever. On EIO
errors, the kernel will not restore the dirty page flags, but it
will flip the error flags on the failed pages. One could mmap()
the file in question, obtain the PFNs (via /proc/pid/pagemap)
and enumerate those to match the ones with the error flag switched
on (via /proc/kpageflags). This could serve at least as a detection
mechanism, but one could also further use this info to logically
map the pages that failed IO back to the original file offsets,
and potentially retry IO just for those file ranges that cover
the failed pages. Just an idea, not tested.

That sounds like a huge amount of complexity, with uncertainty as to how
it'll behave kernel-to-kernel, for negligble benefit.

Those interfaces have been around since the kernel 2.6 times and are
rather stable, but I was merely responding to your original post comment
regarding having a way of finding out which page(s) failed. I assume
that indeed there would be no benefit, especially since those errors
are usually not transient (typically they come from hard medium faults),
and although a filesystem could theoretically mask the error by allocating
a different logical block, I am not aware of any implementation that
currently does that.

I was exploring the idea of doing selective recovery of one relfilenode,
based on the assumption that we know the filenode related to the fd that
failed to fsync(). We could redo only WAL on that relation. But it fails
the same test: it's too complex for a niche case that shouldn't happen in
the first place, so it'll probably have bugs, or grow bugs in bitrot over
time.

Fully agree, those cases should be sufficiently rare that a complex
and possibly non-maintainable solution is not really warranted.

Remember, if you're on ext4 with errors=remount-ro, you get shut down even
harder than a PANIC. So we should just use the big hammer here.

I am not entirely sure what you mean here, does Pg really treat write()
errors as fatal? Also, the kind of errors that ext4 detects with this
option is at the superblock level and govern metadata rather than actual
data writes (recall that those are buffered anyway, no actual device IO
has to take place at the time of write()).

Best regards,
Anthony

#20

ailiop@altatus.com

about 8 years ago

In reply to: Tom Lane (#17)

Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

On Sat, Mar 31, 2018 at 12:38:12PM -0400, Tom Lane wrote:

Craig Ringer <craig@2ndquadrant.com> writes:

So we should just use the big hammer here.

And bitch, loudly and publicly, about how broken this kernel behavior is.
If we make enough of a stink maybe it'll get fixed.

It is not likely to be fixed (beyond what has been done already with the
manpage patches and errseq_t fixes on the reporting level). The issue is,
the kernel needs to deal with hard IO errors at that level somehow, and
since those errors typically persist, re-dirtying the pages would not
really solve the problem (unless some filesystem remaps the request to a
different block, assuming the device is alive). Keeping around dirty
pages that cannot possibly be written out is essentially a memory leak,
as those pages would stay around even after the application has exited.

Best regards,
Anthony

#21

thomas.munro@gmail.com

about 8 years ago

In reply to: Thomas Munro (#14)

#22

craig@2ndquadrant.com

about 8 years ago

In reply to: Thomas Munro (#21)

#23

andres@anarazel.de

about 8 years ago

In reply to: Anthony Iliopoulos (#20)

#24

ailiop@altatus.com

about 8 years ago

In reply to: Andres Freund (#23)

#25

andres@anarazel.de

about 8 years ago

In reply to: Anthony Iliopoulos (#24)

#26

ailiop@altatus.com

about 8 years ago

In reply to: Andres Freund (#25)

#27

Stephen Frost

sfrost@snowman.net

about 8 years ago

In reply to: Anthony Iliopoulos (#26)

#28

ailiop@altatus.com

about 8 years ago

In reply to: Stephen Frost (#27)

#29

andres@anarazel.de

about 8 years ago

In reply to: Anthony Iliopoulos (#28)

#30

craig@2ndquadrant.com

about 8 years ago

In reply to: Anthony Iliopoulos (#28)

#31

xof@thebuild.com

about 8 years ago

In reply to: Craig Ringer (#30)

#32

andres@anarazel.de

about 8 years ago

In reply to: Christophe Pettus (#31)

#33

xof@thebuild.com

about 8 years ago

In reply to: Andres Freund (#32)

#34

pg@bowt.ie

about 8 years ago

In reply to: Andres Freund (#32)

#35

thomas.munro@gmail.com

about 8 years ago

In reply to: Craig Ringer (#22)

#36

robertmhaas@gmail.com

about 8 years ago

In reply to: Anthony Iliopoulos (#24)

#37

pg@bowt.ie

about 8 years ago

In reply to: Robert Haas (#36)

#38

ailiop@altatus.com

about 8 years ago

In reply to: Robert Haas (#36)

#39

bruce@momjian.us

about 8 years ago

In reply to: Anthony Iliopoulos (#38)

#40

ailiop@altatus.com

about 8 years ago

In reply to: Bruce Momjian (#39)

#41

craig@2ndquadrant.com

about 8 years ago

In reply to: Robert Haas (#36)

#42

bruce@momjian.us

about 8 years ago

In reply to: Anthony Iliopoulos (#40)

#43

ailiop@altatus.com

about 8 years ago

In reply to: Bruce Momjian (#42)

#44

robertmhaas@gmail.com

about 8 years ago

In reply to: Anthony Iliopoulos (#38)

#45

thomas.munro@gmail.com

about 8 years ago

In reply to: Thomas Munro (#35)

#46

bruce@momjian.us

about 8 years ago

In reply to: Robert Haas (#44)

#47

thomas.munro@gmail.com

about 8 years ago

In reply to: Bruce Momjian (#46)

#48

bruce@momjian.us

about 8 years ago

In reply to: Thomas Munro (#47)

#49

bruce@momjian.us

about 8 years ago

In reply to: Bruce Momjian (#48)

#50

craig@2ndquadrant.com

about 8 years ago

In reply to: Robert Haas (#44)

#51

thomas.munro@gmail.com

about 8 years ago

In reply to: Bruce Momjian (#49)

#52

thomas.munro@gmail.com

about 8 years ago

In reply to: Thomas Munro (#51)

#53

craig@2ndquadrant.com

about 8 years ago

In reply to: Thomas Munro (#52)

#54

thomas.munro@gmail.com

about 8 years ago

In reply to: Craig Ringer (#53)

#55

craig@2ndquadrant.com

about 8 years ago

In reply to: Craig Ringer (#53)

#56

bruce@momjian.us

about 8 years ago

In reply to: Thomas Munro (#54)

#57

bruce@momjian.us

about 8 years ago

In reply to: Craig Ringer (#50)

#58

craig@2ndquadrant.com

about 8 years ago

In reply to: Craig Ringer (#55)

#59

craig@2ndquadrant.com

about 8 years ago

In reply to: Craig Ringer (#58)

#60

bruce@momjian.us

about 8 years ago

In reply to: Craig Ringer (#59)

#61

craig@2ndquadrant.com

about 8 years ago

In reply to: Bruce Momjian (#60)

#62

ailiop@altatus.com

about 8 years ago

In reply to: Craig Ringer (#61)

#63

craig@2ndquadrant.com

about 8 years ago

In reply to: Bruce Momjian (#56)

#64

Gasper Zejn

zejn@owca.info

about 8 years ago

In reply to: Bruce Momjian (#56)

#65

bruce@momjian.us

about 8 years ago

In reply to: Craig Ringer (#63)

#66

thomas.munro@gmail.com

about 8 years ago

In reply to: Craig Ringer (#58)

#67

thomas.munro@gmail.com

about 8 years ago

In reply to: Thomas Munro (#66)

#68

craig@2ndquadrant.com

about 8 years ago

In reply to: Craig Ringer (#1)

#69

craig@2ndquadrant.com

about 8 years ago

In reply to: Craig Ringer (#68)

#70

bruce@momjian.us

about 8 years ago

In reply to: Craig Ringer (#68)

#71

Andrew Gierth

andrew@tao11.riddles.org.uk

about 8 years ago

In reply to: Craig Ringer (#63)

#72

craig@2ndquadrant.com

about 8 years ago

In reply to: Andrew Gierth (#71)

#73

thomas.munro@gmail.com

about 8 years ago

In reply to: Craig Ringer (#72)

#74

craig@2ndquadrant.com

about 8 years ago

In reply to: Thomas Munro (#73)

#75

thomas.munro@gmail.com

about 8 years ago

In reply to: Craig Ringer (#74)

#76

bruce@momjian.us

about 8 years ago

In reply to: Thomas Munro (#75)

#77

xof@thebuild.com

about 8 years ago

In reply to: Bruce Momjian (#76)

#78

craig@2ndquadrant.com

about 8 years ago

In reply to: Thomas Munro (#75)

#79

pg@bowt.ie

about 8 years ago

In reply to: Craig Ringer (#78)

#80

xof@thebuild.com

about 8 years ago

In reply to: Craig Ringer (#78)

#81

Andreas Karlsson

andreas.karlsson@percona.com

about 8 years ago

In reply to: Craig Ringer (#78)

#82

craig@2ndquadrant.com

about 8 years ago

In reply to: Christophe Pettus (#80)

#83

craig@2ndquadrant.com

about 8 years ago

In reply to: Andreas Karlsson (#81)

#84

xof@thebuild.com

about 8 years ago

In reply to: Craig Ringer (#82)

#85

bruce@momjian.us

about 8 years ago

In reply to: Craig Ringer (#78)

#86

xof@thebuild.com

about 8 years ago

In reply to: Bruce Momjian (#85)

#87

ailiop@altatus.com

about 8 years ago

In reply to: Bruce Momjian (#85)

#88

bruce@momjian.us

about 8 years ago

In reply to: Christophe Pettus (#84)

#89

xof@thebuild.com

about 8 years ago

In reply to: Bruce Momjian (#88)

#90

andres@anarazel.de

about 8 years ago

In reply to: Bruce Momjian (#88)

#91

xof@thebuild.com

about 8 years ago

In reply to: Andres Freund (#90)

#92

craig@2ndquadrant.com

about 8 years ago

In reply to: Christophe Pettus (#86)

#93

craig@2ndquadrant.com

about 8 years ago

In reply to: Bruce Momjian (#88)

#94

andres@anarazel.de

about 8 years ago

In reply to: Christophe Pettus (#91)

#95

craig@2ndquadrant.com

about 8 years ago

In reply to: Andres Freund (#90)

#96

andres@anarazel.de

about 8 years ago

In reply to: Craig Ringer (#95)

#97

craig@2ndquadrant.com

about 8 years ago

In reply to: Andres Freund (#96)

#98

bruce@momjian.us

about 8 years ago

In reply to: Anthony Iliopoulos (#87)

#99

ailiop@altatus.com

about 8 years ago

In reply to: Bruce Momjian (#98)

#100

Geoff Winkless

pgsqladmin@geoff.dj

about 8 years ago

In reply to: Anthony Iliopoulos (#99)

#101

craig@2ndquadrant.com

about 8 years ago

In reply to: Anthony Iliopoulos (#99)

#102

ailiop@altatus.com

about 8 years ago

In reply to: Geoff Winkless (#100)

#103

ailiop@altatus.com

about 8 years ago

In reply to: Craig Ringer (#101)

#104

tomas.vondra@2ndquadrant.com

about 8 years ago

In reply to: Anthony Iliopoulos (#102)

#105

tomas.vondra@2ndquadrant.com

about 8 years ago

In reply to: Bruce Momjian (#88)

#106

Abhijit Menon-Sen

ams@2ndQuadrant.com

about 8 years ago

In reply to: Tomas Vondra (#105)

#107

tomas.vondra@2ndquadrant.com

about 8 years ago

In reply to: Craig Ringer (#95)

#108

ailiop@altatus.com

about 8 years ago

In reply to: Tomas Vondra (#104)

#109

bruce@momjian.us

about 8 years ago

In reply to: Anthony Iliopoulos (#108)

#110

robertmhaas@gmail.com

about 8 years ago

In reply to: Craig Ringer (#101)

#111

jd@commandprompt.com

about 8 years ago

In reply to: Robert Haas (#110)

#112

Gasper Zejn

zejn@owca.info

about 8 years ago

In reply to: Tomas Vondra (#105)

#113

mark.dilger@enterprisedb.com

about 8 years ago

In reply to: Joshua D. Drake (#111)

#114

robertmhaas@gmail.com

about 8 years ago

In reply to: Robert Haas (#110)

#115

andres@anarazel.de

about 8 years ago

In reply to: Robert Haas (#114)

#116

tomas.vondra@2ndquadrant.com

about 8 years ago

In reply to: Mark Dilger (#113)

#117

pg@bowt.ie

about 8 years ago

In reply to: Andres Freund (#115)

#118

ailiop@altatus.com

about 8 years ago

In reply to: Bruce Momjian (#109)

#119

andres@anarazel.de

about 8 years ago

In reply to: Anthony Iliopoulos (#118)

#120

andres@anarazel.de

about 8 years ago

In reply to: Anthony Iliopoulos (#118)

#121

Justin Pryzby

pryzby@telsasoft.com

about 8 years ago

In reply to: Andres Freund (#120)

#122

ailiop@altatus.com

about 8 years ago

In reply to: Andres Freund (#119)

#123

tomas.vondra@2ndquadrant.com

about 8 years ago

In reply to: Anthony Iliopoulos (#108)

#124

ailiop@altatus.com

about 8 years ago

In reply to: Andres Freund (#120)

#125

tomas.vondra@2ndquadrant.com

about 8 years ago

In reply to: Andres Freund (#120)

#126

andres@anarazel.de

about 8 years ago

In reply to: Justin Pryzby (#121)

#127

andres@anarazel.de

about 8 years ago

In reply to: Tomas Vondra (#125)

#128

mark.dilger@enterprisedb.com

about 8 years ago

In reply to: Andres Freund (#115)

#129

tomas.vondra@2ndquadrant.com

about 8 years ago

In reply to: Andres Freund (#127)

#130

andres@anarazel.de

about 8 years ago

In reply to: Mark Dilger (#128)

#131

andres@anarazel.de

about 8 years ago

In reply to: Tomas Vondra (#129)

#132

tomas.vondra@2ndquadrant.com

about 8 years ago

In reply to: Mark Dilger (#128)

#133

mark.dilger@enterprisedb.com

about 8 years ago

In reply to: Tomas Vondra (#132)

#134

andres@anarazel.de

about 8 years ago

In reply to: Mark Dilger (#133)

#135

tomas.vondra@2ndquadrant.com

about 8 years ago

In reply to: Andres Freund (#134)

#136

mark.dilger@enterprisedb.com

about 8 years ago

In reply to: Tomas Vondra (#135)

#137

thomas.munro@gmail.com

about 8 years ago

In reply to: Anthony Iliopoulos (#108)

#138

thomas.munro@gmail.com

about 8 years ago

In reply to: Thomas Munro (#137)

#139

Andreas Karlsson

andreas.karlsson@percona.com

about 8 years ago

In reply to: Craig Ringer (#101)

#140

craig@2ndquadrant.com

about 8 years ago

In reply to: Andres Freund (#126)

#141

thomas.munro@gmail.com

about 8 years ago

In reply to: Craig Ringer (#140)

#142

craig@2ndquadrant.com

about 8 years ago

In reply to: Mark Dilger (#128)

#143

craig@2ndquadrant.com

about 8 years ago

In reply to: Andres Freund (#131)

#144

andres@anarazel.de

about 8 years ago

In reply to: Craig Ringer (#143)

#145

craig@2ndquadrant.com

about 8 years ago

In reply to: Andreas Karlsson (#139)

#146

michael@paquier.xyz

about 8 years ago

In reply to: Robert Haas (#114)

#147

craig@2ndquadrant.com

about 8 years ago

In reply to: Michael Paquier (#146)

#148

michael@paquier.xyz

about 8 years ago

In reply to: Craig Ringer (#147)

#149

craig@2ndquadrant.com

about 8 years ago

In reply to: Michael Paquier (#148)

#150

robertmhaas@gmail.com

about 8 years ago

In reply to: Andres Freund (#115)

#151

robertmhaas@gmail.com

about 8 years ago

In reply to: Craig Ringer (#147)

#152

ailiop@altatus.com

about 8 years ago

In reply to: Robert Haas (#150)

#153

bruce@momjian.us

about 8 years ago

In reply to: Anthony Iliopoulos (#152)

#154

bruce@momjian.us

about 8 years ago

In reply to: Craig Ringer (#143)

#155

jd@commandprompt.com

about 8 years ago

In reply to: Bruce Momjian (#154)

#156

jd@commandprompt.com

about 8 years ago

In reply to: Joshua D. Drake (#155)

#157

jd@commandprompt.com

about 8 years ago

In reply to: Joshua D. Drake (#156)

#158

Jonathan Corbet

corbet@lwn.net

about 8 years ago

In reply to: Anthony Iliopoulos (#152)

#159

bruce@momjian.us

about 8 years ago

In reply to: Joshua D. Drake (#155)

#160

andres@anarazel.de

about 8 years ago

In reply to: Jonathan Corbet (#158)

#161

Jonathan Corbet

corbet@lwn.net

about 8 years ago

In reply to: Andres Freund (#160)

#162

bruce@momjian.us

about 8 years ago

In reply to: Bruce Momjian (#154)

#163

bruce@momjian.us

about 8 years ago

In reply to: Tomas Vondra (#105)

#164

bruce@momjian.us

about 8 years ago

In reply to: Tomas Vondra (#105)

#165

andres@anarazel.de

about 8 years ago

In reply to: Bruce Momjian (#163)

#166

andres@anarazel.de

about 8 years ago

In reply to: Bruce Momjian (#164)

#167

bruce@momjian.us

about 8 years ago

In reply to: Peter Geoghegan (#117)

#168

bruce@momjian.us

about 8 years ago

In reply to: Andres Freund (#165)

#169

craig@2ndquadrant.com

about 8 years ago

In reply to: Bruce Momjian (#162)

#170

craig@2ndquadrant.com

about 8 years ago

In reply to: Craig Ringer (#149)

#171

bruce@momjian.us

about 8 years ago

In reply to: Craig Ringer (#169)

#172

bruce@momjian.us

about 8 years ago

In reply to: Andres Freund (#166)

#173

craig@2ndquadrant.com

about 8 years ago

In reply to: Bruce Momjian (#171)

#174

Mark Kirkwood

mark.kirkwood@catalyst.net.nz

about 8 years ago

In reply to: Craig Ringer (#173)

#175

craig@2ndquadrant.com

about 8 years ago

In reply to: Mark Kirkwood (#174)

#176

bruce@momjian.us

about 8 years ago

In reply to: Craig Ringer (#173)

#177

Gasper Zejn

zejn@owca.info

about 8 years ago

In reply to: Craig Ringer (#7)

#178

andres@anarazel.de

about 8 years ago

In reply to: Craig Ringer (#1)

#179

bruce@momjian.us

about 8 years ago

In reply to: Andres Freund (#178)

#180

craig@2ndquadrant.com

about 8 years ago

In reply to: Andres Freund (#178)

#181

thomas.munro@gmail.com

about 8 years ago

In reply to: Bruce Momjian (#179)

#182