Postgres, fsync, and OSs (specifically linux)
Hi,
I thought I'd send this separately from [0]https://archives.postgresql.org/message-id/CAMsr+YHh+5Oq4xziwwoEfhoTZgr07vdGG+hu=1adXx59aTeaoQ@mail.gmail.com as the issue has become more
general than what was mentioned in that thread, and it went off into
various weeds.
I went to LSF/MM 2018 to discuss [0]https://archives.postgresql.org/message-id/CAMsr+YHh+5Oq4xziwwoEfhoTZgr07vdGG+hu=1adXx59aTeaoQ@mail.gmail.com and related issues. Overall I'd say
it was a very productive discussion. I'll first try to recap the
current situation, updated with knowledge I gained. Secondly I'll try to
discuss the kernel changes that seem to have been agreed upon. Thirdly
I'll try to sum up what postgres needs to change.
== Current Situation ==
The fundamental problem is that postgres assumed that any IO error would
be reported at fsync time, and that the error would be reported until
resolved. That's not true in several operating systems, linux included.
There's various judgement calls leading to the current OS (specifically
linux, but the concerns are similar in other OSs) behaviour:
- By the time IO errors are treated as fatal, it's unlikely that plain
retries attempting to write exactly the same data are going to
succeed. There are retries on several layers. Some cases would be
resolved by overwriting a larger amount (so device level remapping
functionality can mask dead areas), but plain retries aren't going to
get there if they didn't the first time round.
- Retaining all the data necessary for retries would make it quite
possible to turn IO errors on some device into out of memory
errors. This is true to a far lesser degree if only enough information
were to be retained to (re-)report an error, rather than actually
retry the write.
- Continuing to re-report an error after one fsync() failed would make
it hard to recover from that fact. There'd need to be a way to "clear"
a persistent error bit, and that'd obviously be outside of posix.
- Some other databases use direct-IO and thus these paths haven't been
exercised under fire that much.
- Actually marking files as persistently failed would require filesystem
changes, and filesystem metadata IO, far from guaranteed in failure
scenarios.
Before linux v4.13 errors in kernel writeback would be reported at most
once, without a guarantee that that'd happen (IIUC memory pressure could
lead to the relevant information being evicted) - but it was pretty
likely. After v4.13 (see https://lwn.net/Articles/724307/) errors are
reported exactly once to all open file descriptors for a file with an
error - but never for files that have been opened after the error
occurred.
It's worth to note that on linux it's not well defined what contents one
would read after a writeback error. IIUC xfs will mark the pagecache
contents that triggered an error as invalid, triggering a re-read from
the underlying storage (thus either failing or returning old but
persistent contents). Whereas some other filesystems (among them ext4 I
believe) retain the modified contents of the page cache, but marking it
as clean (thereby returning new contents until the page cache contents
are evicted).
Some filesystems (prominently NFS in many configurations) perform an
implicit fsync when closing the file. While postgres checks for an error
of close() and reports it, we don't treat it as fatal. It's worth to
note that by my reading this means that an fsync error at close() will
*not* be re-reported by the time an explicit fsync() is issued. It also
means that we'll not react properly to the possible ENOSPC errors that
may be reported at close() for NFS. At least the latter isn't just the
case in linux.
Proposals for how postgres could deal with this included using syncfs(2)
- but that turns out not to work at all currently, because syncfs()
basically wouldn't return any file-level errors. It'd also imply
superflously flushing temporary files etc.
The second major type of proposal was using direct-IO. That'd generally
be a desirable feature, but a) would require some significant changes to
postgres to be performant, b) isn't really applicable for the large
percentage of installations that aren't tuned reasonably well, because
at the moment the OS page cache functions as a memory-pressure aware
extension of postgres' page cache.
Another topic brought up in this thread was the handling of ENOSPC
errors that aren't triggered on a filesystem level, but rather are
triggered by thin provisioning. On linux that currently apprently lead
to page cache contents being lost (and errors "eaten") in a lot of
places, including just when doing a write(). In a lot of cases it's
pretty much expected that the file system will just hang or react
unpredictably upon space exhaustion. My reading is that the block-layer
thin provisioning code is still pretty fresh, and should only be used
with great care. The only way to halfway reliably use it appears to
change the configuration so space exhaustion blocks until admin
intervention (at least dm-thinp provides allows that).
There's some clear need to automate some more testing in this area so
that future behaviour changes don't surprise us.
== Proposed Linux Changes ==
- Matthew Wilcox proposed (and posted a patch) that'd partially revert
behaviour to the pre v4.13 world, by *also* reporting errors to
"newer" file-descriptors if the error hasn't previously been
reported. That'd still not guarantee that the error is reported
(memory pressure could evict information without open fd), but in most
situations we'll again get the error in the checkpointer.
This seems largely be agreed upon. It's unclear whether it'll go into
the stable backports for still-maintained >= v4.13 kernels.
- syncfs() will be fixed so it reports errors properly - that'll likely
require passing it an O_PATH filedescriptor to have space to store the
errseq_t value that allows discerning already reported and new errors.
No patch has appeared yet, but the behaviour seems largely agreed
upon.
- Make per-filesystem error counts available in a uniform (i.e. same for
every supporting fs) manner. Right now it's very hard to figure out
whether errors occurred. There seemed general agreement that exporting
knowledge about such errors is desirable. Quite possibly the syncfs()
fix above will provide the necessary infrastructure. It's unclear as
of yet how the value would be exposed. Per-fs /sys/ entries and an
ioctl on O_PATH fds have been mentioned.
These'd error counts would not vanish due to memory pressure, and they
can be checked even without knowing which files in a specific
filesystem have been touched (e.g. when just untar-ing something).
There seemed to be fairly widespread agreement that this'd be a good
idea. Much less clearer whether somebody would do the work.
- Provide config knobs that allow to define the FS error behaviour in a
consistent way across supported filesystems. XFS currently has various
knobs controlling what happens in case of metadata errors [1]static const struct xfs_error_init xfs_error_meta_init[XFS_ERR_ERRNO_MAX] = { { .name = "default", .max_retries = XFS_ERR_RETRY_FOREVER, .retry_timeout = XFS_ERR_RETRY_FOREVER, }, { .name = "EIO", .max_retries = XFS_ERR_RETRY_FOREVER, .retry_timeout = XFS_ERR_RETRY_FOREVER, }, { .name = "ENOSPC", .max_retries = XFS_ERR_RETRY_FOREVER, .retry_timeout = XFS_ERR_RETRY_FOREVER, }, { .name = "ENODEV", .max_retries = 0, /* We can't recover from devices disappearing */ .retry_timeout = 0, }, }; (retry
forever, timeout, return up). It was proposed that this interface be
extended to also deal with data errors, and moved into generic support
code.
While the timeline is unclear, there seemed to be widespread support
for the idea. I believe Dave Chinner indicated that he at least has
plans to generalize the code.
- Stop inodes with unreported errors from being evicted. This will
guarantee that a later fsync (without an open FD) will see the
error. The memory pressure concerns here are lower than with keeping
all the failed pages in memory, and it could be optimized further.
I read some tentative agreement behind this idea, but I think it's the
by far most controversial one.
== Potential Postgres Changes ==
Several operating systems / file systems behave differently (See
e.g. [2]https://wiki.postgresql.org/wiki/Fsync_Errors, thanks Thomas) than we expected. Even the discussed changes to
e.g. linux don't get to where we thought we are. There's obviously also
the question of how to deal with kernels / OSs that have not been
updated.
Changes that appear to be necessary, even for kernels with the issues
addressed:
- Clearly we need to treat fsync() EIO, ENOSPC errors as a PANIC and
retry recovery. While ENODEV (underlying device went away) will be
persistent, it probably makes sense to treat it the same or even just
give up and shut down. One question I see here is whether we just
want to continue crash-recovery cycles, or whether we want to limit
that.
- We need more aggressive error checking on close(), for ENOSPC and
EIO. In both cases afaics we'll have to trigger a crash recovery
cycle. It's entirely possible to end up in a loop on NFS etc, but I
don't think there's a way around that.
Robert, on IM, wondered whether there'd be a race between some backend
doing a close(), triggering a PANIC, and a checkpoint succeeding. I
don't *think* so, because the error will only happen if there's
outstanding dirty data, and the checkpoint would have flushed that out
if it belonged to the current checkpointing cycle.
- The outstanding fsync request queue isn't persisted properly [3]https://archives.postgresql.org/message-id/87y3i1ia4w.fsf%40news-spur.riddles.org.uk. This
means that even if the kernel behaved the way we'd expected, we'd not
fail a second checkpoint :(. It's possible that we don't need to deal
with this because we'll henceforth PANIC, but I'd argue we should fix
that regardless. Seems like a time-bomb otherwise (e.g. after moving
to DIO somebody might want to relax the PANIC...).
- It might be a good idea to whitelist expected return codes for write()
and PANIC one ones that we did not expect. E.g. when hitting an EIO we
should probably PANIC, to get back to a known good state. Even though
it's likely that we'd again that error at fsync().
- Docs.
I think we also need to audit a few codepaths. I'd be surprised if we
PANICed appropriately on all fsyncs(), particularly around the SLRUs. I
think we need to be particularly careful around the WAL handling, I
think it's fairly likely that there's cases where we'd write out WAL in
one backend and then fsync() in another backend with a file descriptor
that has only been opened *after* the write occurred, which means we
might miss the error entirely.
Then there's the question of how we want to deal with kernels that
haven't been updated with the aforementioned changes. We could say that
we expect decent OS support and declare that we just can't handle this -
given that at least various linux versions, netbsd, openbsd, MacOS just
silently drop errors and we'd need different approaches for dealing with
that, that doesn't seem like an insane approach.
What we could do:
- forward file descriptors from backends to checkpointer (using
SCM_RIGHTS) when marking a segment dirty. That'd require some
optimizations (see [4]https://archives.postgresql.org/message-id/20180424180054.inih6bxfspgowjuc@alap3.anarazel.de) to avoid doing so repeatedly. That'd
guarantee correct behaviour in all linux kernels >= 4.13 (possibly
backported by distributions?), and I think it'd also make it vastly
more likely that errors are reported in earlier kernels.
This should be doable without a noticeable performance impact, I
believe. I don't think it'd be that hard either, but it'd be a bit of
a pain to backport it to all postgres versions, as well as a bit
invasive for that.
The infrastructure this'd likely end up building (hashtable of open
relfilenodes), would likely be useful for further things (like caching
file size).
- Add a pre-checkpoint hook that checks for filesystem errors *after*
fsyncing all the files, but *before* logging the checkpoint completion
record. Operating systems, filesystems, etc. all log the error format
differently, but for larger installations it'd not be too hard to
write code that checks their specific configuration.
While I'm a bit concerned adding user-code before a checkpoint, if
we'd do it as a shell command it seems pretty reasonable. And useful
even without concern for the fsync issue itself. Checking for IO
errors could e.g. also include checking for read errors - it'd not be
unreasonable to not want to complete a checkpoint if there'd been any
media errors.
- Use direct IO. Due to architectural performance issues in PG and the
fact that it'd not be applicable for all installations I don't think
this is a reasonable fix for the issue presented here. Although it's
independently something we should work on. It might be worthwhile to
provide a configuration that allows to force DIO to be enabled for WAL
even if replication is turned on.
- magic
Greetings,
Andres Freund
[1]: static const struct xfs_error_init xfs_error_meta_init[XFS_ERR_ERRNO_MAX] = { { .name = "default", .max_retries = XFS_ERR_RETRY_FOREVER, .retry_timeout = XFS_ERR_RETRY_FOREVER, }, { .name = "EIO", .max_retries = XFS_ERR_RETRY_FOREVER, .retry_timeout = XFS_ERR_RETRY_FOREVER, }, { .name = "ENOSPC", .max_retries = XFS_ERR_RETRY_FOREVER, .retry_timeout = XFS_ERR_RETRY_FOREVER, }, { .name = "ENODEV", .max_retries = 0, /* We can't recover from devices disappearing */ .retry_timeout = 0, }, };
static const struct xfs_error_init xfs_error_meta_init[XFS_ERR_ERRNO_MAX] = {
{ .name = "default",
.max_retries = XFS_ERR_RETRY_FOREVER,
.retry_timeout = XFS_ERR_RETRY_FOREVER,
},
{ .name = "EIO",
.max_retries = XFS_ERR_RETRY_FOREVER,
.retry_timeout = XFS_ERR_RETRY_FOREVER,
},
{ .name = "ENOSPC",
.max_retries = XFS_ERR_RETRY_FOREVER,
.retry_timeout = XFS_ERR_RETRY_FOREVER,
},
{ .name = "ENODEV",
.max_retries = 0, /* We can't recover from devices disappearing */
.retry_timeout = 0,
},
};
[2]: https://wiki.postgresql.org/wiki/Fsync_Errors
[3]: https://archives.postgresql.org/message-id/87y3i1ia4w.fsf%40news-spur.riddles.org.uk
[4]: https://archives.postgresql.org/message-id/20180424180054.inih6bxfspgowjuc@alap3.anarazel.de
On Fri, Apr 27, 2018 at 03:28:42PM -0700, Andres Freund wrote:
- We need more aggressive error checking on close(), for ENOSPC and
EIO. In both cases afaics we'll have to trigger a crash recovery
cycle. It's entirely possible to end up in a loop on NFS etc, but I
don't think there's a way around that.
If the no-space or write failures are persistent, as you mentioned
above, what is the point of going into crash recovery --- why not just
shut down? Also, since we can't guarantee that we can write any
persistent state to storage, we have no way of preventing infinite crash
recovery loops, which, based on inconsistent writes, might make things
worse. I think a single panic with no restart is the right solution.
An additional features we have talked about is running some kind of
notification shell script to inform administrators, similar to
archive_command. We need this too when sync replication fails.
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com
+ As you are, so once was I. As I am, so you will be. +
+ Ancient Roman grave inscription +
Hi,
On 2018-04-27 19:04:47 -0400, Bruce Momjian wrote:
On Fri, Apr 27, 2018 at 03:28:42PM -0700, Andres Freund wrote:
- We need more aggressive error checking on close(), for ENOSPC and
EIO. In both cases afaics we'll have to trigger a crash recovery
cycle. It's entirely possible to end up in a loop on NFS etc, but I
don't think there's a way around that.If the no-space or write failures are persistent, as you mentioned
above, what is the point of going into crash recovery --- why not just
shut down?
Well, I mentioned that as an alternative in my email. But for one we
don't really have cases where we do that right now, for another we can't
really differentiate between a transient and non-transient state. It's
entirely possible that the admin on the system that ran out of space
fixes things, clearing up the problem.
Also, since we can't guarantee that we can write any persistent state
to storage, we have no way of preventing infinite crash recovery
loops, which, based on inconsistent writes, might make things worse.
How would it make things worse?
An additional features we have talked about is running some kind of
notification shell script to inform administrators, similar to
archive_command. We need this too when sync replication fails.
To me that seems like a feature independent of this thread.
Greetings,
Andres Freund
On Fri, Apr 27, 2018 at 04:10:43PM -0700, Andres Freund wrote:
Hi,
On 2018-04-27 19:04:47 -0400, Bruce Momjian wrote:
On Fri, Apr 27, 2018 at 03:28:42PM -0700, Andres Freund wrote:
- We need more aggressive error checking on close(), for ENOSPC and
EIO. In both cases afaics we'll have to trigger a crash recovery
cycle. It's entirely possible to end up in a loop on NFS etc, but I
don't think there's a way around that.If the no-space or write failures are persistent, as you mentioned
above, what is the point of going into crash recovery --- why not just
shut down?Well, I mentioned that as an alternative in my email. But for one we
don't really have cases where we do that right now, for another we can't
really differentiate between a transient and non-transient state. It's
entirely possible that the admin on the system that ran out of space
fixes things, clearing up the problem.
True, but if we get a no-space error, odds are it will not be fixed at
the time we are failing. Wouldn't the administrator check that the
server is still running after they free the space?
Also, since we can't guarantee that we can write any persistent state
to storage, we have no way of preventing infinite crash recovery
loops, which, based on inconsistent writes, might make things worse.How would it make things worse?
Uh, I can imagine some writes working and some not, and getting things
more inconsistent. I would say at least that we don't know.
An additional features we have talked about is running some kind of
notification shell script to inform administrators, similar to
archive_command. We need this too when sync replication fails.To me that seems like a feature independent of this thread.
Well, if we are introducing new panic-and-not-restart behavior, we might
need this new feature.
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com
+ As you are, so once was I. As I am, so you will be. +
+ Ancient Roman grave inscription +
On 2018-04-27 19:38:30 -0400, Bruce Momjian wrote:
On Fri, Apr 27, 2018 at 04:10:43PM -0700, Andres Freund wrote:
Hi,
On 2018-04-27 19:04:47 -0400, Bruce Momjian wrote:
On Fri, Apr 27, 2018 at 03:28:42PM -0700, Andres Freund wrote:
- We need more aggressive error checking on close(), for ENOSPC and
EIO. In both cases afaics we'll have to trigger a crash recovery
cycle. It's entirely possible to end up in a loop on NFS etc, but I
don't think there's a way around that.If the no-space or write failures are persistent, as you mentioned
above, what is the point of going into crash recovery --- why not just
shut down?Well, I mentioned that as an alternative in my email. But for one we
don't really have cases where we do that right now, for another we can't
really differentiate between a transient and non-transient state. It's
entirely possible that the admin on the system that ran out of space
fixes things, clearing up the problem.True, but if we get a no-space error, odds are it will not be fixed at
the time we are failing. Wouldn't the administrator check that the
server is still running after they free the space?
I'd assume it's pretty common that those are separate teams. Given that
we currently don't behave that way for other cases where we *already*
can enter crash-recovery loops I don't think we need to introduce that
here. It's far more common to enter this kind of problem with pg_xlog
filling up the ordinary way. And that can lead to such loops.
Also, since we can't guarantee that we can write any persistent state
to storage, we have no way of preventing infinite crash recovery
loops, which, based on inconsistent writes, might make things worse.How would it make things worse?
Uh, I can imagine some writes working and some not, and getting things
more inconsistent. I would say at least that we don't know.
Recovery needs to fix that or we're lost anyway. And we'll retry exactly
the same writes each round.
An additional features we have talked about is running some kind of
notification shell script to inform administrators, similar to
archive_command. We need this too when sync replication fails.To me that seems like a feature independent of this thread.
Well, if we are introducing new panic-and-not-restart behavior, we might
need this new feature.
I don't see how this follows. It's easier to externally script
notification for the server having died, than doing it for crash
restarts. That's why we have restart_after_crash=false... There might
be some arguments for this type of notification, but I don't think it
should be conflated with the problem here. Nor is it guaranteed that
such a script could do much, given that disks might be failing and such.
Greetings,
Andres Freund
On 28 April 2018 at 06:28, Andres Freund <andres@anarazel.de> wrote:
Hi,
I thought I'd send this separately from [0] as the issue has become more
general than what was mentioned in that thread, and it went off into
various weeds.
Thanks very much for going and for the great summary.
- Actually marking files as persistently failed would require filesystem
changes, and filesystem metadata IO, far from guaranteed in failure
scenarios.
Yeah, I've avoided suggesting anything like that because it seems way
too likely to lead to knock-on errors.
Like malloc'ing in an OOM path, just don't.
The second major type of proposal was using direct-IO. That'd generally
be a desirable feature, but a) would require some significant changes to
postgres to be performant, b) isn't really applicable for the large
percentage of installations that aren't tuned reasonably well, because
at the moment the OS page cache functions as a memory-pressure aware
extension of postgres' page cache.
Yeah. I've avoided advocating for O_DIRECT because it's a big job
(understatement). We'd need to pay so much more attention to details
of storage layout if we couldn't rely as much on the kernel neatly
organising and queuing everything for us, too.
At the risk of displaying my relative ignorance of direct I/O: Does
O_DIRECT without O_SYNC even provide a strong guarantee that when you
close() the file, all I/O has reliably succeeded? It must've gone
through the FS layer, but don't FSes do various caching and
reorganisation too? Can the same issue arise in other ways unless we
also fsync() before close() or write O_SYNC?
At one point I looked into using AIO instead. But last I looked it was
pretty spectacularly quirky when it comes to reliably flushing, and
outright broken on some versions. In any case, our multiprocessing
model would make tracking completions annoying, likely more so than
the sort of FD handoff games we've discussed.
Another topic brought up in this thread was the handling of ENOSPC
errors that aren't triggered on a filesystem level, but rather are
triggered by thin provisioning. On linux that currently apprently lead
to page cache contents being lost (and errors "eaten") in a lot of
places, including just when doing a write().
... wow.
Is that with lvm-thin?
The thin provisioning I was mainly concerned with is SAN-based thin
provisioning, which looks like a normal iSCSI target or a normal LUN
on a HBA to Linux. Then it starts failing writes with a weird
potentially vendor-specific sense error if it runs out of backing
store. How that's handled likely depends on the specific error, the
driver, which FS you use, etc. In the case I saw, multipath+lvm+xfs,
it resulted in lost writes and fsync() errors being reported once, per
the start of the original thread.
In a lot of cases it's
pretty much expected that the file system will just hang or react
unpredictably upon space exhaustion. My reading is that the block-layer
thin provisioning code is still pretty fresh, and should only be used
with great care. The only way to halfway reliably use it appears to
change the configuration so space exhaustion blocks until admin
intervention (at least dm-thinp provides allows that).
Seems that should go in the OS-specific configuration part of the
docs, along with the advice I gave on the original thread re
configuring multipath no_path_retries.
There's some clear need to automate some more testing in this area so
that future behaviour changes don't surprise us.
We don't routinely test ENOSPC (or memory exhaustion, or crashes) in
PostgreSQL even on bog standard setups.
Like the performance farm discussion, this is something I'd like to
pick up at some point. I'm going to need to talk to the team I work
with regarding time/resources allocation, but I think it's important
that we make such testing more of a routine thing.
- Matthew Wilcox proposed (and posted a patch) that'd partially revert
behaviour to the pre v4.13 world, by *also* reporting errors to
"newer" file-descriptors if the error hasn't previously been
reported. That'd still not guarantee that the error is reported
(memory pressure could evict information without open fd), but in most
situations we'll again get the error in the checkpointer.This seems largely be agreed upon. It's unclear whether it'll go into
the stable backports for still-maintained >= v4.13 kernels.
That seems very sensible. In our case we're very unlikely to have some
other unrelated process come in and fsync() our files for us.
I'd want to be sure the report didn't get eaten by sync() or syncfs() though.
- syncfs() will be fixed so it reports errors properly - that'll likely
require passing it an O_PATH filedescriptor to have space to store the
errseq_t value that allows discerning already reported and new errors.No patch has appeared yet, but the behaviour seems largely agreed
upon.
Good, but as you noted, of limited use to us unless we want to force
users to manage space for temporary and unlogged relations completely
separately.
I wonder if we could convince the kernel to offer a file_sync_mode
xattr to control this? (Hint: I'm already running away in a mylar fire
suit).
- Make per-filesystem error counts available in a uniform (i.e. same for
every supporting fs) manner. Right now it's very hard to figure out
whether errors occurred. There seemed general agreement that exporting
knowledge about such errors is desirable. Quite possibly the syncfs()
fix above will provide the necessary infrastructure. It's unclear as
of yet how the value would be exposed. Per-fs /sys/ entries and an
ioctl on O_PATH fds have been mentioned.These'd error counts would not vanish due to memory pressure, and they
can be checked even without knowing which files in a specific
filesystem have been touched (e.g. when just untar-ing something).There seemed to be fairly widespread agreement that this'd be a good
idea. Much less clearer whether somebody would do the work.- Provide config knobs that allow to define the FS error behaviour in a
consistent way across supported filesystems. XFS currently has various
knobs controlling what happens in case of metadata errors [1] (retry
forever, timeout, return up). It was proposed that this interface be
extended to also deal with data errors, and moved into generic support
code.While the timeline is unclear, there seemed to be widespread support
for the idea. I believe Dave Chinner indicated that he at least has
plans to generalize the code.
That's great. It sounds like this has revitalised some interest in the
error reporting and might yield some more general cleanups :)
- Stop inodes with unreported errors from being evicted. This will
guarantee that a later fsync (without an open FD) will see the
error. The memory pressure concerns here are lower than with keeping
all the failed pages in memory, and it could be optimized further.I read some tentative agreement behind this idea, but I think it's the
by far most controversial one.
The main issue there would seem to be cases of whole-FS failure like
the USB-key-yank example. You're going to have to be able to get rid
of them at some point.
- Clearly we need to treat fsync() EIO, ENOSPC errors as a PANIC and
retry recovery. While ENODEV (underlying device went away) will be
persistent, it probably makes sense to treat it the same or even just
give up and shut down. One question I see here is whether we just
want to continue crash-recovery cycles, or whether we want to limit
that.
Right now, we'll panic once, then panic again in redo if the error
persists and give up.
On some systems, and everywhere that Pg is directly user-managed with
pg_ctl, that'll leave Pg down until the operator intervenes. Some init
systems will restart the postmaster automatically. Some will give up
after a few tries. Some will back off retries over time. It depends on
the init system. I'm not sure that's a great outcome.
So rather than giving up if redo fails, we might want to offer a knob
to retry, possibly with pause/backoff. I'm sure people currently
expect PostgreSQL to try to stay up and recover, like it does after a
segfault or most other errors.
Personally I prefer to run Pg with restart_after_crash=off and let the
init system launch a new postmaster, but that's not an option unless
you have a sensible init.
- We need more aggressive error checking on close(), for ENOSPC and
EIO. In both cases afaics we'll have to trigger a crash recovery
cycle. It's entirely possible to end up in a loop on NFS etc, but I
don't think there's a way around that.Robert, on IM, wondered whether there'd be a race between some backend
doing a close(), triggering a PANIC, and a checkpoint succeeding. I
don't *think* so, because the error will only happen if there's
outstanding dirty data, and the checkpoint would have flushed that out
if it belonged to the current checkpointing cycle.
Even if it's possible (which it sounds like it probably isn't), it
might also be one of those corner-cases-of-corner-cases where we just
shrug and worry about bigger fish.
- The outstanding fsync request queue isn't persisted properly [3]. This
means that even if the kernel behaved the way we'd expected, we'd not
fail a second checkpoint :(. It's possible that we don't need to deal
with this because we'll henceforth PANIC, but I'd argue we should fix
that regardless. Seems like a time-bomb otherwise (e.g. after moving
to DIO somebody might want to relax the PANIC...).
Huh! Good find. That definitely merits fixing.
- It might be a good idea to whitelist expected return codes for write()
and PANIC one ones that we did not expect. E.g. when hitting an EIO we
should probably PANIC, to get back to a known good state. Even though
it's likely that we'd again that error at fsync().- Docs.
Yep. Especially OS-specific configuration for known dangerous setups
(lvm-thin, multipath), etc. I imagine we can distill a lot of it from
the discussion and simplify a bit.
I think we also need to audit a few codepaths. I'd be surprised if we
PANICed appropriately on all fsyncs(), particularly around the SLRUs.
We _definitely_ do not, see the patch I sent on the other thread.
Then there's the question of how we want to deal with kernels that
haven't been updated with the aforementioned changes. We could say that
we expect decent OS support and declare that we just can't handle this -
given that at least various linux versions, netbsd, openbsd, MacOS just
silently drop errors and we'd need different approaches for dealing with
that, that doesn't seem like an insane approach.
What we could do:
- forward file descriptors from backends to checkpointer (using
SCM_RIGHTS) when marking a segment dirty. That'd require some
optimizations (see [4]) to avoid doing so repeatedly. That'd
guarantee correct behaviour in all linux kernels >= 4.13 (possibly
backported by distributions?), and I think it'd also make it vastly
more likely that errors are reported in earlier kernels.
It'd be interesting to see if other platforms that support fd passing
will give us the desired behaviour too. But even if it only helps on
Linux, that's a huge majority of the PostgreSQL deployments these
days.
- Add a pre-checkpoint hook that checks for filesystem errors *after*
fsyncing all the files, but *before* logging the checkpoint completion
record. Operating systems, filesystems, etc. all log the error format
differently, but for larger installations it'd not be too hard to
write code that checks their specific configuration.
I looked into using trace event file descriptors for this, btw, but
we'd need CAP_SYS_ADMIN to create one that captured events for other
processes. Plus filtering the events to find only events for the files
/ file systems of interest would be far from trivial. And I don't know
what guarantees we have about when events are delivered.
I'd love to be able to use inotify for this, but again, that'd only be
a new-kernels thing since it'd need an inotify extension to report I/O
errors.
Presumably mostly this check would land up looking at dmesg.
I'm not convinced it'd get widely deployed and widely used, or that
it'd be used correctly when people tried to use it. Look at the
hideous mess that most backup/standby creation scripts,
archive_command scripts, etc are.
- Use direct IO. Due to architectural performance issues in PG and the
fact that it'd not be applicable for all installations I don't think
this is a reasonable fix for the issue presented here. Although it's
independently something we should work on. It might be worthwhile to
provide a configuration that allows to force DIO to be enabled for WAL
even if replication is turned on.
Seems like a long term goal, but you've noted elsewhere that doing it
well would be hard. I suspect we'd need writer threads, we'd need to
know more about the underlying FS/storage layout to make better
decisions about write parallelism, etc. We get away with a lot right
now by letting the kernel and buffered I/O sort that out.
--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Greetings,
* Andres Freund (andres@anarazel.de) wrote:
On 2018-04-27 19:38:30 -0400, Bruce Momjian wrote:
On Fri, Apr 27, 2018 at 04:10:43PM -0700, Andres Freund wrote:
On 2018-04-27 19:04:47 -0400, Bruce Momjian wrote:
On Fri, Apr 27, 2018 at 03:28:42PM -0700, Andres Freund wrote:
- We need more aggressive error checking on close(), for ENOSPC and
EIO. In both cases afaics we'll have to trigger a crash recovery
cycle. It's entirely possible to end up in a loop on NFS etc, but I
don't think there's a way around that.If the no-space or write failures are persistent, as you mentioned
above, what is the point of going into crash recovery --- why not just
shut down?Well, I mentioned that as an alternative in my email. But for one we
don't really have cases where we do that right now, for another we can't
really differentiate between a transient and non-transient state. It's
entirely possible that the admin on the system that ran out of space
fixes things, clearing up the problem.True, but if we get a no-space error, odds are it will not be fixed at
the time we are failing. Wouldn't the administrator check that the
server is still running after they free the space?I'd assume it's pretty common that those are separate teams. Given that
we currently don't behave that way for other cases where we *already*
can enter crash-recovery loops I don't think we need to introduce that
here. It's far more common to enter this kind of problem with pg_xlog
filling up the ordinary way. And that can lead to such loops.
When we crash-restart, we also go through and clean things up some, no?
Seems like that gives us the potential to end up fixing things ourselves
and allowing the crash-restart to succeed.
Consider unlogged tables, temporary tables, on-disk sorts, etc. It's
entirely common for a bad query to run the system out of disk space (but
have a write of a regular table be what discovers the out-of-space
problem...) and if we crash-restart properly then we'd hopefully clean
things out, freeing up space, and allowing us to come back up.
Now, of course, ideally admins would set up temp tablespaces and
segregate WAL onto its own filesystem, etc, but...
Thanks!
Stephen
Greetings,
* Craig Ringer (craig@2ndquadrant.com) wrote:
On 28 April 2018 at 06:28, Andres Freund <andres@anarazel.de> wrote:
- Add a pre-checkpoint hook that checks for filesystem errors *after*
fsyncing all the files, but *before* logging the checkpoint completion
record. Operating systems, filesystems, etc. all log the error format
differently, but for larger installations it'd not be too hard to
write code that checks their specific configuration.I looked into using trace event file descriptors for this, btw, but
we'd need CAP_SYS_ADMIN to create one that captured events for other
processes. Plus filtering the events to find only events for the files
/ file systems of interest would be far from trivial. And I don't know
what guarantees we have about when events are delivered.I'd love to be able to use inotify for this, but again, that'd only be
a new-kernels thing since it'd need an inotify extension to report I/O
errors.Presumably mostly this check would land up looking at dmesg.
I'm not convinced it'd get widely deployed and widely used, or that
it'd be used correctly when people tried to use it. Look at the
hideous mess that most backup/standby creation scripts,
archive_command scripts, etc are.
Agree with more-or-less everything you've said here, but a big +1 on
this. If we do end up going down this route we have *got* to provide
scripts which we know work and have been tested and are well maintained
on the popular OS's for the popular filesystems and make it clear that
we've tested those and not others. We definitely shouldn't put
something in our docs that is effectively an example of the interface
but not an actual command that anyone should be using.
Thanks!
Stephen
On 27 April 2018 at 15:28, Andres Freund <andres@anarazel.de> wrote:
- Add a pre-checkpoint hook that checks for filesystem errors *after*
fsyncing all the files, but *before* logging the checkpoint completion
record. Operating systems, filesystems, etc. all log the error format
differently, but for larger installations it'd not be too hard to
write code that checks their specific configuration.While I'm a bit concerned adding user-code before a checkpoint, if
we'd do it as a shell command it seems pretty reasonable. And useful
even without concern for the fsync issue itself. Checking for IO
errors could e.g. also include checking for read errors - it'd not be
unreasonable to not want to complete a checkpoint if there'd been any
media errors.
It seems clear that we need to evaluate our compatibility not just
with an OS, as we do now, but with an OS/filesystem.
Although people have suggested some approaches, I'm more interested in
discovering how we can be certain we got it right.
And the end result seems to be that PostgreSQL will be forced, in the
short term, to declare certain combinations of OS/filesystem
unsupported, with clear warning sent out to users.
Adding a pre-checkpoint hook encourages people to fix this themselves
without reporting issues, so I initially oppose this until we have a
clearer argument as to why we need it. The answer is not to make this
issue more obscure, but to make it more public.
- Use direct IO. Due to architectural performance issues in PG and the
fact that it'd not be applicable for all installations I don't think
this is a reasonable fix for the issue presented here. Although it's
independently something we should work on. It might be worthwhile to
provide a configuration that allows to force DIO to be enabled for WAL
even if replication is turned on.
"Use DirectIO" is roughly same suggestion as "don't trust Linux filesystems".
It would be a major admission of defeat for us to take that as our
main route to a solution.
The people I've spoken to so far have encouraged us to continue
working with the filesystem layer, offering encouragement of our
decision to use filesystems.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi,
On Sat, Apr 28, 2018 at 11:21:20AM -0400, Stephen Frost wrote:
* Craig Ringer (craig@2ndquadrant.com) wrote:
On 28 April 2018 at 06:28, Andres Freund <andres@anarazel.de> wrote:
- Add a pre-checkpoint hook that checks for filesystem errors *after*
fsyncing all the files, but *before* logging the checkpoint completion
record. Operating systems, filesystems, etc. all log the error format
differently, but for larger installations it'd not be too hard to
write code that checks their specific configuration.I looked into using trace event file descriptors for this, btw, but
we'd need CAP_SYS_ADMIN to create one that captured events for other
processes. Plus filtering the events to find only events for the files
/ file systems of interest would be far from trivial. And I don't know
what guarantees we have about when events are delivered.I'd love to be able to use inotify for this, but again, that'd only be
a new-kernels thing since it'd need an inotify extension to report I/O
errors.Presumably mostly this check would land up looking at dmesg.
I'm not convinced it'd get widely deployed and widely used, or that
it'd be used correctly when people tried to use it. Look at the
hideous mess that most backup/standby creation scripts,
archive_command scripts, etc are.Agree with more-or-less everything you've said here, but a big +1 on
this. If we do end up going down this route we have *got* to provide
scripts which we know work and have been tested and are well maintained
on the popular OS's for the popular filesystems and make it clear that
we've tested those and not others. We definitely shouldn't put
something in our docs that is effectively an example of the interface
but not an actual command that anyone should be using.
This dmesg-checking has been mentioned several times now, but IME
enterprise distributions (or server ops teams?) seem to tighten access
to dmesg and /var/log to non-root users, including postgres.
Well, or just vanilla Debian stable apparently:
postgres@fock:~$ dmesg
dmesg: read kernel buffer failed: Operation not permitted
Is it really a useful expectation that the postgres user will be able to
trawl system logs for I/O errors? Or are we expecting the sysadmins (in
case they are distinct from the DBAs) to setup sudo and/or relax
permissions for this everywhere? We should document this requirement
properly at least then.
The netlink thing from Google that Tet Ts'O mentioned would probably
work around that, but if that is opened up it would not be deployed
anytime soon either.
Michael
--
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax: +49 2166 9901-100
Email: michael.banck@credativ.de
credativ GmbH, HRB M�nchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 M�nchengladbach
Gesch�ftsf�hrung: Dr. Michael Meskes, J�rg Folz, Sascha Heuer
Hi,
On 2018-04-28 11:10:54 -0400, Stephen Frost wrote:
When we crash-restart, we also go through and clean things up some, no?
Seems like that gives us the potential to end up fixing things ourselves
and allowing the crash-restart to succeed.
Sure, there's the potential for that. But it's quite possible to be
missing a lot of free space over NFS (this really isn't much of an issue
for local FS, at least not on linux) in a workload with rapidly
expanding space usage. And even if you recover, you could just hit the
issue again shortly afterwards.
Greetings,
Andres Freund
Hi,
On 2018-04-28 17:35:48 +0200, Michael Banck wrote:
This dmesg-checking has been mentioned several times now, but IME
enterprise distributions (or server ops teams?) seem to tighten access
to dmesg and /var/log to non-root users, including postgres.Well, or just vanilla Debian stable apparently:
postgres@fock:~$ dmesg
dmesg: read kernel buffer failed: Operation not permittedIs it really a useful expectation that the postgres user will be able to
trawl system logs for I/O errors? Or are we expecting the sysadmins (in
case they are distinct from the DBAs) to setup sudo and/or relax
permissions for this everywhere? We should document this requirement
properly at least then.
I'm not a huge fan of this approach, but yes, that'd be necessary. It's
not that problematic to have to change /dev/kmsg permissions imo. Adding
a read group / acl seems quite doable.
The netlink thing from Google that Tet Ts'O mentioned would probably
work around that, but if that is opened up it would not be deployed
anytime soon either.
Yea, that seems irrelevant for now.
Greetings,
Andres Freund
Hi,
On 2018-04-28 08:25:53 -0700, Simon Riggs wrote:
- Use direct IO. Due to architectural performance issues in PG and the
fact that it'd not be applicable for all installations I don't think
this is a reasonable fix for the issue presented here. Although it's
independently something we should work on. It might be worthwhile to
provide a configuration that allows to force DIO to be enabled for WAL
even if replication is turned on."Use DirectIO" is roughly same suggestion as "don't trust Linux filesystems".
I want to emphasize that this is NOT a linux only issue. It's a problem
across a number of operating systems, including linux.
It would be a major admission of defeat for us to take that as our
main route to a solution.
Well, I think we were wrong to not engineer towards DIO. There's just
too many issues with buffered IO to not have a supported path for
DIO. But given that it's unrealistic to do so without major work, and
wouldn't be applicable for all installations (shared_buffer size becomes
critical), I don't think it matters that much for the issue discussed
here.
The people I've spoken to so far have encouraged us to continue
working with the filesystem layer, offering encouragement of our
decision to use filesystems.
There's a lot of people disagreeing with it too.
Greetings,
Andres Freund
Hi,
On 2018-04-28 20:00:25 +0800, Craig Ringer wrote:
On 28 April 2018 at 06:28, Andres Freund <andres@anarazel.de> wrote:
The second major type of proposal was using direct-IO. That'd generally
be a desirable feature, but a) would require some significant changes to
postgres to be performant, b) isn't really applicable for the large
percentage of installations that aren't tuned reasonably well, because
at the moment the OS page cache functions as a memory-pressure aware
extension of postgres' page cache.Yeah. I've avoided advocating for O_DIRECT because it's a big job
(understatement). We'd need to pay so much more attention to details
of storage layout if we couldn't rely as much on the kernel neatly
organising and queuing everything for us, too.At the risk of displaying my relative ignorance of direct I/O: Does
O_DIRECT without O_SYNC even provide a strong guarantee that when you
close() the file, all I/O has reliably succeeded? It must've gone
through the FS layer, but don't FSes do various caching and
reorganisation too? Can the same issue arise in other ways unless we
also fsync() before close() or write O_SYNC?
No, not really. There's generally two categories of IO here: Metadata IO
and data IO. The filesystem's metadata IO a) has a lot more error
checking (including things like remount-ro, stalling the filesystem on
errors etc), b) isn't direct IO itself. For some filesystem metadata
operations you'll still need fsyncs, but the *data* is flushed if use
use DIO. I'd personally use O_DSYNC | O_DIRECT, and have the metadata
operations guaranteed by fsyncs. You'd need the current fsyncs for
renaming, and probably some fsyncs for file extensions. The latter to
make sure the filesystem has written the metadata change.
At one point I looked into using AIO instead. But last I looked it was
pretty spectacularly quirky when it comes to reliably flushing, and
outright broken on some versions. In any case, our multiprocessing
model would make tracking completions annoying, likely more so than
the sort of FD handoff games we've discussed.
AIO pretty much only works sensibly with DIO.
Another topic brought up in this thread was the handling of ENOSPC
errors that aren't triggered on a filesystem level, but rather are
triggered by thin provisioning. On linux that currently apprently lead
to page cache contents being lost (and errors "eaten") in a lot of
places, including just when doing a write().... wow.
Is that with lvm-thin?
I think both dm and lvm (I typed llvm thrice) based thin
provisioning. The FS code basically didn't expect ENOSPC being returned
from storage, but suddenly the storage layer started returning it...
The thin provisioning I was mainly concerned with is SAN-based thin
provisioning, which looks like a normal iSCSI target or a normal LUN
on a HBA to Linux. Then it starts failing writes with a weird
potentially vendor-specific sense error if it runs out of backing
store. How that's handled likely depends on the specific error, the
driver, which FS you use, etc. In the case I saw, multipath+lvm+xfs,
it resulted in lost writes and fsync() errors being reported once, per
the start of the original thread.
I think the concerns are largely the same for that. You'll have to
configure the SAN to block in that case.
- Matthew Wilcox proposed (and posted a patch) that'd partially revert
behaviour to the pre v4.13 world, by *also* reporting errors to
"newer" file-descriptors if the error hasn't previously been
reported. That'd still not guarantee that the error is reported
(memory pressure could evict information without open fd), but in most
situations we'll again get the error in the checkpointer.This seems largely be agreed upon. It's unclear whether it'll go into
the stable backports for still-maintained >= v4.13 kernels.That seems very sensible. In our case we're very unlikely to have some
other unrelated process come in and fsync() our files for us.I'd want to be sure the report didn't get eaten by sync() or syncfs() though.
It doesn't. Basically every fd has an errseq_t value copied into it at
open.
- syncfs() will be fixed so it reports errors properly - that'll likely
require passing it an O_PATH filedescriptor to have space to store the
errseq_t value that allows discerning already reported and new errors.No patch has appeared yet, but the behaviour seems largely agreed
upon.Good, but as you noted, of limited use to us unless we want to force
users to manage space for temporary and unlogged relations completely
separately.
Well, I think it'd still be ok as a backstop if it had decent error
semantics. We don't checkpoint that often, and doing the syncing via
syncfs() is considerably more efficient than individual fsync()s. But
given it's currently buggy that tradeoff is moot.
I wonder if we could convince the kernel to offer a file_sync_mode
xattr to control this? (Hint: I'm already running away in a mylar fire
suit).
Err. I am fairly sure you're not going to get anywhere with that. Given
we're concerned about existing kernels, I doubt it'd help us much anyway.
- Stop inodes with unreported errors from being evicted. This will
guarantee that a later fsync (without an open FD) will see the
error. The memory pressure concerns here are lower than with keeping
all the failed pages in memory, and it could be optimized further.I read some tentative agreement behind this idea, but I think it's the
by far most controversial one.The main issue there would seem to be cases of whole-FS failure like
the USB-key-yank example. You're going to have to be able to get rid
of them at some point.
It's not actually a real problem (despite initially being brought up a
number of times by kernel people). There's a separate error for that
(ENODEV), and filesystems already handle it differently. Once that's
returned, fsyncs() etc are just shortcut to ENODEV.
What we could do:
- forward file descriptors from backends to checkpointer (using
SCM_RIGHTS) when marking a segment dirty. That'd require some
optimizations (see [4]) to avoid doing so repeatedly. That'd
guarantee correct behaviour in all linux kernels >= 4.13 (possibly
backported by distributions?), and I think it'd also make it vastly
more likely that errors are reported in earlier kernels.It'd be interesting to see if other platforms that support fd passing
will give us the desired behaviour too. But even if it only helps on
Linux, that's a huge majority of the PostgreSQL deployments these
days.
Afaict it'd not help all of them. It does provide guarantees against the
inode being evicted on pretty much all OSs, but not all of them have an
error counter there...
- Use direct IO. Due to architectural performance issues in PG and the
fact that it'd not be applicable for all installations I don't think
this is a reasonable fix for the issue presented here. Although it's
independently something we should work on. It might be worthwhile to
provide a configuration that allows to force DIO to be enabled for WAL
even if replication is turned on.Seems like a long term goal, but you've noted elsewhere that doing it
well would be hard. I suspect we'd need writer threads, we'd need to
know more about the underlying FS/storage layout to make better
decisions about write parallelism, etc. We get away with a lot right
now by letting the kernel and buffered I/O sort that out.
We're a *lot* slower due to it.
Don't think you would need writer threads, "just" a bgwriter that
actually works and provides clean buffers unless the machine is
overloaded. I've posted a patch that adds that. On the write side you
then additionally need write combining (doing one writes for several
on-disk-consecutive buffers), which isn't trivial to add currently. The
bigger issue than writes is actually doing reads nicely. There's no
readahead anymore, and we'd not have the kernel backstopping our bad
caching decisions anymore.
Greetings,
Andres Freund
On 29 April 2018 at 00:15, Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2018-04-28 08:25:53 -0700, Simon Riggs wrote:
- Use direct IO. Due to architectural performance issues in PG and the
fact that it'd not be applicable for all installations I don't think
this is a reasonable fix for the issue presented here. Although it's
independently something we should work on. It might be worthwhile to
provide a configuration that allows to force DIO to be enabled for WAL
even if replication is turned on."Use DirectIO" is roughly same suggestion as "don't trust Linux filesystems".
I want to emphasize that this is NOT a linux only issue. It's a problem
across a number of operating systems, including linux.It would be a major admission of defeat for us to take that as our
main route to a solution.Well, I think we were wrong to not engineer towards DIO. There's just
too many issues with buffered IO to not have a supported path for
DIO. But given that it's unrealistic to do so without major work, and
wouldn't be applicable for all installations (shared_buffer size becomes
critical), I don't think it matters that much for the issue discussed
here.
20/20 hindsight, really. Not much to be done now.
Even with the work you and others have done on shared_buffers
scalability, there's likely still improvement needed there if it
becomes more important to evict buffers into per-device queues, etc,
too.
Personally I'd rather not have to write half the kernel's job because
the kernel doesn't feel like doing it :( . I'd kind of hoped to go in
the other direction if anything, with some kind of pseudo-write op
that let us swap a dirty shared_buffers entry from our shared_buffers
into the OS dirty buffer cache (on Linux at least) and let it handle
writeback, so we reduce double-buffering. Ha! So much for that!
--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On 28 April 2018 at 23:25, Simon Riggs <simon@2ndquadrant.com> wrote:
On 27 April 2018 at 15:28, Andres Freund <andres@anarazel.de> wrote:
- Add a pre-checkpoint hook that checks for filesystem errors *after*
fsyncing all the files, but *before* logging the checkpoint completion
record. Operating systems, filesystems, etc. all log the error format
differently, but for larger installations it'd not be too hard to
write code that checks their specific configuration.While I'm a bit concerned adding user-code before a checkpoint, if
we'd do it as a shell command it seems pretty reasonable. And useful
even without concern for the fsync issue itself. Checking for IO
errors could e.g. also include checking for read errors - it'd not be
unreasonable to not want to complete a checkpoint if there'd been any
media errors.It seems clear that we need to evaluate our compatibility not just
with an OS, as we do now, but with an OS/filesystem.Although people have suggested some approaches, I'm more interested in
discovering how we can be certain we got it right.
TBH, we can't be certain, because there are too many failure modes,
some of which we can't really simulate in practical ways, or automated
ways.
But there are definitely steps we can take:
- Test the stack of FS, LVM (if any) etc with the dmsetup 'flakey'
target and a variety of workloads designed to hit errors at various
points. Some form of torture test.
- Almost up the device and see what happens if we write() then fsync()
enough to fill it.
- Plug-pull storage and see what happens, especially for multipath/iSCSI/SAN.
Experience with pg_test_fsync shows that it can also be hard to
reliably interpret the results of tests.
Again I'd like to emphasise that this is really only a significant
risk for a few configurations. Yes, it could result in Pg not failing
a checkpoint when it should if, say, your disk has a bad block it
can't repair and remap. But as Andres has pointed out in the past,
those sorts local storage failure cases tend toward "you're kind of
screwed anyway". It's only a serious concern when I/O errors are part
of the storage's accepted operation, as in multipath with default
settings.
We _definitely_ need to warn multipath users that the defaults are insane.
- Use direct IO. Due to architectural performance issues in PG and the
fact that it'd not be applicable for all installations I don't think
this is a reasonable fix for the issue presented here. Although it's
independently something we should work on. It might be worthwhile to
provide a configuration that allows to force DIO to be enabled for WAL
even if replication is turned on."Use DirectIO" is roughly same suggestion as "don't trust Linux filesystems".
Surprisingly, that seems to be a lot of what's coming out of Linux
developers. Reliable buffered I/O? Why would you try to do that?
I know that's far from a universal position, though, and it sounds
like things were more productive in Andres's discussions at the meet.
--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On 28 April 2018 at 08:25, Simon Riggs <simon@2ndquadrant.com> wrote:
On 27 April 2018 at 15:28, Andres Freund <andres@anarazel.de> wrote:
- Add a pre-checkpoint hook that checks for filesystem errors *after*
fsyncing all the files, but *before* logging the checkpoint completion
record. Operating systems, filesystems, etc. all log the error format
differently, but for larger installations it'd not be too hard to
write code that checks their specific configuration.While I'm a bit concerned adding user-code before a checkpoint, if
we'd do it as a shell command it seems pretty reasonable. And useful
even without concern for the fsync issue itself. Checking for IO
errors could e.g. also include checking for read errors - it'd not be
unreasonable to not want to complete a checkpoint if there'd been any
media errors.It seems clear that we need to evaluate our compatibility not just
with an OS, as we do now, but with an OS/filesystem.Although people have suggested some approaches, I'm more interested in
discovering how we can be certain we got it right.And the end result seems to be that PostgreSQL will be forced, in the
short term, to declare certain combinations of OS/filesystem
unsupported, with clear warning sent out to users.Adding a pre-checkpoint hook encourages people to fix this themselves
without reporting issues, so I initially oppose this until we have a
clearer argument as to why we need it. The answer is not to make this
issue more obscure, but to make it more public.
Thinking some more, I think I understand, but please explain if not.
We need behavior that varies according to OS and filesystem, which
varies per tablespace.
We could have that variable behavior using
* a hook
* a set of GUC parameters that can be set at tablespace level
* a separate config file for each tablespace
My preference would be to avoid a hook.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 28 April 2018 at 09:15, Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2018-04-28 08:25:53 -0700, Simon Riggs wrote:
- Use direct IO. Due to architectural performance issues in PG and the
fact that it'd not be applicable for all installations I don't think
this is a reasonable fix for the issue presented here. Although it's
independently something we should work on. It might be worthwhile to
provide a configuration that allows to force DIO to be enabled for WAL
even if replication is turned on."Use DirectIO" is roughly same suggestion as "don't trust Linux filesystems".
I want to emphasize that this is NOT a linux only issue. It's a problem
across a number of operating systems, including linux.
Yes, of course.
It would be a major admission of defeat for us to take that as our
main route to a solution.Well, I think we were wrong to not engineer towards DIO. There's just
too many issues with buffered IO to not have a supported path for
DIO. But given that it's unrealistic to do so without major work, and
wouldn't be applicable for all installations (shared_buffer size becomes
critical), I don't think it matters that much for the issue discussed
here.The people I've spoken to so far have encouraged us to continue
working with the filesystem layer, offering encouragement of our
decision to use filesystems.There's a lot of people disagreeing with it too.
Specific recent verbal feedback from OpenLDAP was that the project
adopted DIO and found no benefit in doing so, with regret the other
way from having tried.
The care we need to use for any technique is the same.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Sun, Apr 29, 2018 at 10:42 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
On 28 April 2018 at 09:15, Andres Freund <andres@anarazel.de> wrote:
On 2018-04-28 08:25:53 -0700, Simon Riggs wrote:
The people I've spoken to so far have encouraged us to continue
working with the filesystem layer, offering encouragement of our
decision to use filesystems.There's a lot of people disagreeing with it too.
Specific recent verbal feedback from OpenLDAP was that the project
adopted DIO and found no benefit in doing so, with regret the other
way from having tried.
I'm not sure if OpenLDAP is really comparable. The big three RDBMSs +
MySQL started like us and eventually switched to direct IO, I guess at
a time when direct IO support matured in OSs and their own IO
scheduling was thought to be superior. I'm pretty sure they did that
because they didn't like wasting RAM on double buffering and had
better ideas about IO scheduling. From some googling this morning:
DB2: The Linux/Unix/Windows edition changed its default to DIO ("NO
FILESYSTEM CACHING") in release 9.5 in 2007[1]https://www.ibm.com/support/knowledgecenter/en/SSEPGG_9.5.0/com.ibm.db2.luw.admin.dbobj.doc/doc/c0051304.html, but it can still do
buffered IO if you ask for it.
Oracle: Around the same time or earlier, in the Linux 2.4 era, Oracle
apparently supported direct IO ("FILESYSTEMIO_OPTIONS = DIRECTIO" (or
SETALL for DIRECTIO + ASYNCH)) on big iron Unix but didn't yet use it
on Linux[2]http://www.ixora.com.au/notes/direct_io.htm. There were some amusing emails from Linus Torvalds on
this topic[3]https://lkml.org/lkml/2002/5/11/58. I'm not sure what FILESYSTEMIO_OPTIONS's default value
is on each operating system today or when it changed, it's probably
SETALL everywhere by now? I wonder if they stuck with buffered IO for
a time on Linux despite the availability of direct IO because they
thought it was more reliable or more performant.
SQL Server: I couldn't find any evidence that they've even kept the
option to use buffered IO (which must have existed in the ancestral
code base). Can it? It's a different situation though, targeting a
reduced set of platforms.
MySQL: The default is still buffered ("innodb_flush_method = fsync" as
opposed to "O_DIRECT") but O_DIRECT is supported and widely
recommended, so it sounds like it's usually a win. Maybe not on
smaller systems though?
On MySQL, there are anecdotal reports of performance suffering on some
systems when you turn on O_DIRECT however. If that's true, it's
interesting to speculate about why that might be as it would probably
apply also to us in early versions (optimistic explanation: the
kernel's stretchy page cache allows people to get away with poorly
tuned buffer pool size? pessimistic explanation: the page reclamation
or IO scheduling (asynchronous write-back, write clustering,
read-ahead etc) is not as good as the OS's, but that effect is hidden
by suitably powerful disk subsystem with its own magic caching?) Note
that its O_DIRECT setting *also* calls fsync() to flush filesystem
meta-data (necessary if the file was extended); I wonder if that is
exposed to write-back error loss.
[1]: https://www.ibm.com/support/knowledgecenter/en/SSEPGG_9.5.0/com.ibm.db2.luw.admin.dbobj.doc/doc/c0051304.html
[2]: http://www.ixora.com.au/notes/direct_io.htm
[3]: https://lkml.org/lkml/2002/5/11/58
--
Thomas Munro
http://www.enterprisedb.com
On Mon, Apr 30, 2018 at 11:02 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
MySQL: The default is still buffered
Someone pulled me up on this off-list: the default is buffered (fsync)
on Unix, but it's unbuffered on Windows. That's quite interesting.
https://dev.mysql.com/doc/refman/8.0/en/innodb-parameters.html#sysvar_innodb_flush_method
https://mariadb.com/kb/en/library/xtradbinnodb-server-system-variables/#innodb_flush_method
--
Thomas Munro
http://www.enterprisedb.com