Postgres, fsync, and OSs (specifically linux)

Started by Andres Freundover 7 years ago71 messages
#1Andres Freund
andres@anarazel.de

Hi,

I thought I'd send this separately from [0]https://archives.postgresql.org/message-id/CAMsr+YHh+5Oq4xziwwoEfhoTZgr07vdGG+hu=1adXx59aTeaoQ@mail.gmail.com as the issue has become more
general than what was mentioned in that thread, and it went off into
various weeds.

I went to LSF/MM 2018 to discuss [0]https://archives.postgresql.org/message-id/CAMsr+YHh+5Oq4xziwwoEfhoTZgr07vdGG+hu=1adXx59aTeaoQ@mail.gmail.com and related issues. Overall I'd say
it was a very productive discussion. I'll first try to recap the
current situation, updated with knowledge I gained. Secondly I'll try to
discuss the kernel changes that seem to have been agreed upon. Thirdly
I'll try to sum up what postgres needs to change.

== Current Situation ==

The fundamental problem is that postgres assumed that any IO error would
be reported at fsync time, and that the error would be reported until
resolved. That's not true in several operating systems, linux included.

There's various judgement calls leading to the current OS (specifically
linux, but the concerns are similar in other OSs) behaviour:

- By the time IO errors are treated as fatal, it's unlikely that plain
retries attempting to write exactly the same data are going to
succeed. There are retries on several layers. Some cases would be
resolved by overwriting a larger amount (so device level remapping
functionality can mask dead areas), but plain retries aren't going to
get there if they didn't the first time round.
- Retaining all the data necessary for retries would make it quite
possible to turn IO errors on some device into out of memory
errors. This is true to a far lesser degree if only enough information
were to be retained to (re-)report an error, rather than actually
retry the write.
- Continuing to re-report an error after one fsync() failed would make
it hard to recover from that fact. There'd need to be a way to "clear"
a persistent error bit, and that'd obviously be outside of posix.
- Some other databases use direct-IO and thus these paths haven't been
exercised under fire that much.
- Actually marking files as persistently failed would require filesystem
changes, and filesystem metadata IO, far from guaranteed in failure
scenarios.

Before linux v4.13 errors in kernel writeback would be reported at most
once, without a guarantee that that'd happen (IIUC memory pressure could
lead to the relevant information being evicted) - but it was pretty
likely. After v4.13 (see https://lwn.net/Articles/724307/) errors are
reported exactly once to all open file descriptors for a file with an
error - but never for files that have been opened after the error
occurred.

It's worth to note that on linux it's not well defined what contents one
would read after a writeback error. IIUC xfs will mark the pagecache
contents that triggered an error as invalid, triggering a re-read from
the underlying storage (thus either failing or returning old but
persistent contents). Whereas some other filesystems (among them ext4 I
believe) retain the modified contents of the page cache, but marking it
as clean (thereby returning new contents until the page cache contents
are evicted).

Some filesystems (prominently NFS in many configurations) perform an
implicit fsync when closing the file. While postgres checks for an error
of close() and reports it, we don't treat it as fatal. It's worth to
note that by my reading this means that an fsync error at close() will
*not* be re-reported by the time an explicit fsync() is issued. It also
means that we'll not react properly to the possible ENOSPC errors that
may be reported at close() for NFS. At least the latter isn't just the
case in linux.

Proposals for how postgres could deal with this included using syncfs(2)
- but that turns out not to work at all currently, because syncfs()
basically wouldn't return any file-level errors. It'd also imply
superflously flushing temporary files etc.

The second major type of proposal was using direct-IO. That'd generally
be a desirable feature, but a) would require some significant changes to
postgres to be performant, b) isn't really applicable for the large
percentage of installations that aren't tuned reasonably well, because
at the moment the OS page cache functions as a memory-pressure aware
extension of postgres' page cache.

Another topic brought up in this thread was the handling of ENOSPC
errors that aren't triggered on a filesystem level, but rather are
triggered by thin provisioning. On linux that currently apprently lead
to page cache contents being lost (and errors "eaten") in a lot of
places, including just when doing a write(). In a lot of cases it's
pretty much expected that the file system will just hang or react
unpredictably upon space exhaustion. My reading is that the block-layer
thin provisioning code is still pretty fresh, and should only be used
with great care. The only way to halfway reliably use it appears to
change the configuration so space exhaustion blocks until admin
intervention (at least dm-thinp provides allows that).

There's some clear need to automate some more testing in this area so
that future behaviour changes don't surprise us.

== Proposed Linux Changes ==

- Matthew Wilcox proposed (and posted a patch) that'd partially revert
behaviour to the pre v4.13 world, by *also* reporting errors to
"newer" file-descriptors if the error hasn't previously been
reported. That'd still not guarantee that the error is reported
(memory pressure could evict information without open fd), but in most
situations we'll again get the error in the checkpointer.

This seems largely be agreed upon. It's unclear whether it'll go into
the stable backports for still-maintained >= v4.13 kernels.

- syncfs() will be fixed so it reports errors properly - that'll likely
require passing it an O_PATH filedescriptor to have space to store the
errseq_t value that allows discerning already reported and new errors.

No patch has appeared yet, but the behaviour seems largely agreed
upon.

- Make per-filesystem error counts available in a uniform (i.e. same for
every supporting fs) manner. Right now it's very hard to figure out
whether errors occurred. There seemed general agreement that exporting
knowledge about such errors is desirable. Quite possibly the syncfs()
fix above will provide the necessary infrastructure. It's unclear as
of yet how the value would be exposed. Per-fs /sys/ entries and an
ioctl on O_PATH fds have been mentioned.

These'd error counts would not vanish due to memory pressure, and they
can be checked even without knowing which files in a specific
filesystem have been touched (e.g. when just untar-ing something).

There seemed to be fairly widespread agreement that this'd be a good
idea. Much less clearer whether somebody would do the work.

- Provide config knobs that allow to define the FS error behaviour in a
consistent way across supported filesystems. XFS currently has various
knobs controlling what happens in case of metadata errors [1]static const struct xfs_error_init xfs_error_meta_init[XFS_ERR_ERRNO_MAX] = { { .name = "default", .max_retries = XFS_ERR_RETRY_FOREVER, .retry_timeout = XFS_ERR_RETRY_FOREVER, }, { .name = "EIO", .max_retries = XFS_ERR_RETRY_FOREVER, .retry_timeout = XFS_ERR_RETRY_FOREVER, }, { .name = "ENOSPC", .max_retries = XFS_ERR_RETRY_FOREVER, .retry_timeout = XFS_ERR_RETRY_FOREVER, }, { .name = "ENODEV", .max_retries = 0, /* We can't recover from devices disappearing */ .retry_timeout = 0, }, }; (retry
forever, timeout, return up). It was proposed that this interface be
extended to also deal with data errors, and moved into generic support
code.

While the timeline is unclear, there seemed to be widespread support
for the idea. I believe Dave Chinner indicated that he at least has
plans to generalize the code.

- Stop inodes with unreported errors from being evicted. This will
guarantee that a later fsync (without an open FD) will see the
error. The memory pressure concerns here are lower than with keeping
all the failed pages in memory, and it could be optimized further.

I read some tentative agreement behind this idea, but I think it's the
by far most controversial one.

== Potential Postgres Changes ==

Several operating systems / file systems behave differently (See
e.g. [2]https://wiki.postgresql.org/wiki/Fsync_Errors, thanks Thomas) than we expected. Even the discussed changes to
e.g. linux don't get to where we thought we are. There's obviously also
the question of how to deal with kernels / OSs that have not been
updated.

Changes that appear to be necessary, even for kernels with the issues
addressed:

- Clearly we need to treat fsync() EIO, ENOSPC errors as a PANIC and
retry recovery. While ENODEV (underlying device went away) will be
persistent, it probably makes sense to treat it the same or even just
give up and shut down. One question I see here is whether we just
want to continue crash-recovery cycles, or whether we want to limit
that.

- We need more aggressive error checking on close(), for ENOSPC and
EIO. In both cases afaics we'll have to trigger a crash recovery
cycle. It's entirely possible to end up in a loop on NFS etc, but I
don't think there's a way around that.

Robert, on IM, wondered whether there'd be a race between some backend
doing a close(), triggering a PANIC, and a checkpoint succeeding. I
don't *think* so, because the error will only happen if there's
outstanding dirty data, and the checkpoint would have flushed that out
if it belonged to the current checkpointing cycle.

- The outstanding fsync request queue isn't persisted properly [3]https://archives.postgresql.org/message-id/87y3i1ia4w.fsf%40news-spur.riddles.org.uk. This
means that even if the kernel behaved the way we'd expected, we'd not
fail a second checkpoint :(. It's possible that we don't need to deal
with this because we'll henceforth PANIC, but I'd argue we should fix
that regardless. Seems like a time-bomb otherwise (e.g. after moving
to DIO somebody might want to relax the PANIC...).

- It might be a good idea to whitelist expected return codes for write()
and PANIC one ones that we did not expect. E.g. when hitting an EIO we
should probably PANIC, to get back to a known good state. Even though
it's likely that we'd again that error at fsync().

- Docs.

I think we also need to audit a few codepaths. I'd be surprised if we
PANICed appropriately on all fsyncs(), particularly around the SLRUs. I
think we need to be particularly careful around the WAL handling, I
think it's fairly likely that there's cases where we'd write out WAL in
one backend and then fsync() in another backend with a file descriptor
that has only been opened *after* the write occurred, which means we
might miss the error entirely.

Then there's the question of how we want to deal with kernels that
haven't been updated with the aforementioned changes. We could say that
we expect decent OS support and declare that we just can't handle this -
given that at least various linux versions, netbsd, openbsd, MacOS just
silently drop errors and we'd need different approaches for dealing with
that, that doesn't seem like an insane approach.

What we could do:

- forward file descriptors from backends to checkpointer (using
SCM_RIGHTS) when marking a segment dirty. That'd require some
optimizations (see [4]https://archives.postgresql.org/message-id/20180424180054.inih6bxfspgowjuc@alap3.anarazel.de) to avoid doing so repeatedly. That'd
guarantee correct behaviour in all linux kernels >= 4.13 (possibly
backported by distributions?), and I think it'd also make it vastly
more likely that errors are reported in earlier kernels.

This should be doable without a noticeable performance impact, I
believe. I don't think it'd be that hard either, but it'd be a bit of
a pain to backport it to all postgres versions, as well as a bit
invasive for that.

The infrastructure this'd likely end up building (hashtable of open
relfilenodes), would likely be useful for further things (like caching
file size).

- Add a pre-checkpoint hook that checks for filesystem errors *after*
fsyncing all the files, but *before* logging the checkpoint completion
record. Operating systems, filesystems, etc. all log the error format
differently, but for larger installations it'd not be too hard to
write code that checks their specific configuration.

While I'm a bit concerned adding user-code before a checkpoint, if
we'd do it as a shell command it seems pretty reasonable. And useful
even without concern for the fsync issue itself. Checking for IO
errors could e.g. also include checking for read errors - it'd not be
unreasonable to not want to complete a checkpoint if there'd been any
media errors.

- Use direct IO. Due to architectural performance issues in PG and the
fact that it'd not be applicable for all installations I don't think
this is a reasonable fix for the issue presented here. Although it's
independently something we should work on. It might be worthwhile to
provide a configuration that allows to force DIO to be enabled for WAL
even if replication is turned on.

- magic

Greetings,

Andres Freund

[0]: https://archives.postgresql.org/message-id/CAMsr+YHh+5Oq4xziwwoEfhoTZgr07vdGG+hu=1adXx59aTeaoQ@mail.gmail.com

[1]: static const struct xfs_error_init xfs_error_meta_init[XFS_ERR_ERRNO_MAX] = { { .name = "default", .max_retries = XFS_ERR_RETRY_FOREVER, .retry_timeout = XFS_ERR_RETRY_FOREVER, }, { .name = "EIO", .max_retries = XFS_ERR_RETRY_FOREVER, .retry_timeout = XFS_ERR_RETRY_FOREVER, }, { .name = "ENOSPC", .max_retries = XFS_ERR_RETRY_FOREVER, .retry_timeout = XFS_ERR_RETRY_FOREVER, }, { .name = "ENODEV", .max_retries = 0, /* We can't recover from devices disappearing */ .retry_timeout = 0, }, };
static const struct xfs_error_init xfs_error_meta_init[XFS_ERR_ERRNO_MAX] = {
{ .name = "default",
.max_retries = XFS_ERR_RETRY_FOREVER,
.retry_timeout = XFS_ERR_RETRY_FOREVER,
},
{ .name = "EIO",
.max_retries = XFS_ERR_RETRY_FOREVER,
.retry_timeout = XFS_ERR_RETRY_FOREVER,
},
{ .name = "ENOSPC",
.max_retries = XFS_ERR_RETRY_FOREVER,
.retry_timeout = XFS_ERR_RETRY_FOREVER,
},
{ .name = "ENODEV",
.max_retries = 0, /* We can't recover from devices disappearing */
.retry_timeout = 0,
},
};

[2]: https://wiki.postgresql.org/wiki/Fsync_Errors
[3]: https://archives.postgresql.org/message-id/87y3i1ia4w.fsf%40news-spur.riddles.org.uk
[4]: https://archives.postgresql.org/message-id/20180424180054.inih6bxfspgowjuc@alap3.anarazel.de

#2Bruce Momjian
bruce@momjian.us
In reply to: Andres Freund (#1)
Re: Postgres, fsync, and OSs (specifically linux)

On Fri, Apr 27, 2018 at 03:28:42PM -0700, Andres Freund wrote:

- We need more aggressive error checking on close(), for ENOSPC and
EIO. In both cases afaics we'll have to trigger a crash recovery
cycle. It's entirely possible to end up in a loop on NFS etc, but I
don't think there's a way around that.

If the no-space or write failures are persistent, as you mentioned
above, what is the point of going into crash recovery --- why not just
shut down? Also, since we can't guarantee that we can write any
persistent state to storage, we have no way of preventing infinite crash
recovery loops, which, based on inconsistent writes, might make things
worse. I think a single panic with no restart is the right solution.

An additional features we have talked about is running some kind of
notification shell script to inform administrators, similar to
archive_command. We need this too when sync replication fails.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +
#3Andres Freund
andres@anarazel.de
In reply to: Bruce Momjian (#2)
Re: Postgres, fsync, and OSs (specifically linux)

Hi,

On 2018-04-27 19:04:47 -0400, Bruce Momjian wrote:

On Fri, Apr 27, 2018 at 03:28:42PM -0700, Andres Freund wrote:

- We need more aggressive error checking on close(), for ENOSPC and
EIO. In both cases afaics we'll have to trigger a crash recovery
cycle. It's entirely possible to end up in a loop on NFS etc, but I
don't think there's a way around that.

If the no-space or write failures are persistent, as you mentioned
above, what is the point of going into crash recovery --- why not just
shut down?

Well, I mentioned that as an alternative in my email. But for one we
don't really have cases where we do that right now, for another we can't
really differentiate between a transient and non-transient state. It's
entirely possible that the admin on the system that ran out of space
fixes things, clearing up the problem.

Also, since we can't guarantee that we can write any persistent state
to storage, we have no way of preventing infinite crash recovery
loops, which, based on inconsistent writes, might make things worse.

How would it make things worse?

An additional features we have talked about is running some kind of
notification shell script to inform administrators, similar to
archive_command. We need this too when sync replication fails.

To me that seems like a feature independent of this thread.

Greetings,

Andres Freund

#4Bruce Momjian
bruce@momjian.us
In reply to: Andres Freund (#3)
Re: Postgres, fsync, and OSs (specifically linux)

On Fri, Apr 27, 2018 at 04:10:43PM -0700, Andres Freund wrote:

Hi,

On 2018-04-27 19:04:47 -0400, Bruce Momjian wrote:

On Fri, Apr 27, 2018 at 03:28:42PM -0700, Andres Freund wrote:

- We need more aggressive error checking on close(), for ENOSPC and
EIO. In both cases afaics we'll have to trigger a crash recovery
cycle. It's entirely possible to end up in a loop on NFS etc, but I
don't think there's a way around that.

If the no-space or write failures are persistent, as you mentioned
above, what is the point of going into crash recovery --- why not just
shut down?

Well, I mentioned that as an alternative in my email. But for one we
don't really have cases where we do that right now, for another we can't
really differentiate between a transient and non-transient state. It's
entirely possible that the admin on the system that ran out of space
fixes things, clearing up the problem.

True, but if we get a no-space error, odds are it will not be fixed at
the time we are failing. Wouldn't the administrator check that the
server is still running after they free the space?

Also, since we can't guarantee that we can write any persistent state
to storage, we have no way of preventing infinite crash recovery
loops, which, based on inconsistent writes, might make things worse.

How would it make things worse?

Uh, I can imagine some writes working and some not, and getting things
more inconsistent. I would say at least that we don't know.

An additional features we have talked about is running some kind of
notification shell script to inform administrators, similar to
archive_command. We need this too when sync replication fails.

To me that seems like a feature independent of this thread.

Well, if we are introducing new panic-and-not-restart behavior, we might
need this new feature.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +
#5Andres Freund
andres@anarazel.de
In reply to: Bruce Momjian (#4)
Re: Postgres, fsync, and OSs (specifically linux)

On 2018-04-27 19:38:30 -0400, Bruce Momjian wrote:

On Fri, Apr 27, 2018 at 04:10:43PM -0700, Andres Freund wrote:

Hi,

On 2018-04-27 19:04:47 -0400, Bruce Momjian wrote:

On Fri, Apr 27, 2018 at 03:28:42PM -0700, Andres Freund wrote:

- We need more aggressive error checking on close(), for ENOSPC and
EIO. In both cases afaics we'll have to trigger a crash recovery
cycle. It's entirely possible to end up in a loop on NFS etc, but I
don't think there's a way around that.

If the no-space or write failures are persistent, as you mentioned
above, what is the point of going into crash recovery --- why not just
shut down?

Well, I mentioned that as an alternative in my email. But for one we
don't really have cases where we do that right now, for another we can't
really differentiate between a transient and non-transient state. It's
entirely possible that the admin on the system that ran out of space
fixes things, clearing up the problem.

True, but if we get a no-space error, odds are it will not be fixed at
the time we are failing. Wouldn't the administrator check that the
server is still running after they free the space?

I'd assume it's pretty common that those are separate teams. Given that
we currently don't behave that way for other cases where we *already*
can enter crash-recovery loops I don't think we need to introduce that
here. It's far more common to enter this kind of problem with pg_xlog
filling up the ordinary way. And that can lead to such loops.

Also, since we can't guarantee that we can write any persistent state
to storage, we have no way of preventing infinite crash recovery
loops, which, based on inconsistent writes, might make things worse.

How would it make things worse?

Uh, I can imagine some writes working and some not, and getting things
more inconsistent. I would say at least that we don't know.

Recovery needs to fix that or we're lost anyway. And we'll retry exactly
the same writes each round.

An additional features we have talked about is running some kind of
notification shell script to inform administrators, similar to
archive_command. We need this too when sync replication fails.

To me that seems like a feature independent of this thread.

Well, if we are introducing new panic-and-not-restart behavior, we might
need this new feature.

I don't see how this follows. It's easier to externally script
notification for the server having died, than doing it for crash
restarts. That's why we have restart_after_crash=false... There might
be some arguments for this type of notification, but I don't think it
should be conflated with the problem here. Nor is it guaranteed that
such a script could do much, given that disks might be failing and such.

Greetings,

Andres Freund

#6Craig Ringer
craig@2ndquadrant.com
In reply to: Andres Freund (#1)
Re: Postgres, fsync, and OSs (specifically linux)

On 28 April 2018 at 06:28, Andres Freund <andres@anarazel.de> wrote:

Hi,

I thought I'd send this separately from [0] as the issue has become more
general than what was mentioned in that thread, and it went off into
various weeds.

Thanks very much for going and for the great summary.

- Actually marking files as persistently failed would require filesystem
changes, and filesystem metadata IO, far from guaranteed in failure
scenarios.

Yeah, I've avoided suggesting anything like that because it seems way
too likely to lead to knock-on errors.

Like malloc'ing in an OOM path, just don't.

The second major type of proposal was using direct-IO. That'd generally
be a desirable feature, but a) would require some significant changes to
postgres to be performant, b) isn't really applicable for the large
percentage of installations that aren't tuned reasonably well, because
at the moment the OS page cache functions as a memory-pressure aware
extension of postgres' page cache.

Yeah. I've avoided advocating for O_DIRECT because it's a big job
(understatement). We'd need to pay so much more attention to details
of storage layout if we couldn't rely as much on the kernel neatly
organising and queuing everything for us, too.

At the risk of displaying my relative ignorance of direct I/O: Does
O_DIRECT without O_SYNC even provide a strong guarantee that when you
close() the file, all I/O has reliably succeeded? It must've gone
through the FS layer, but don't FSes do various caching and
reorganisation too? Can the same issue arise in other ways unless we
also fsync() before close() or write O_SYNC?

At one point I looked into using AIO instead. But last I looked it was
pretty spectacularly quirky when it comes to reliably flushing, and
outright broken on some versions. In any case, our multiprocessing
model would make tracking completions annoying, likely more so than
the sort of FD handoff games we've discussed.

Another topic brought up in this thread was the handling of ENOSPC
errors that aren't triggered on a filesystem level, but rather are
triggered by thin provisioning. On linux that currently apprently lead
to page cache contents being lost (and errors "eaten") in a lot of
places, including just when doing a write().

... wow.

Is that with lvm-thin?

The thin provisioning I was mainly concerned with is SAN-based thin
provisioning, which looks like a normal iSCSI target or a normal LUN
on a HBA to Linux. Then it starts failing writes with a weird
potentially vendor-specific sense error if it runs out of backing
store. How that's handled likely depends on the specific error, the
driver, which FS you use, etc. In the case I saw, multipath+lvm+xfs,
it resulted in lost writes and fsync() errors being reported once, per
the start of the original thread.

In a lot of cases it's
pretty much expected that the file system will just hang or react
unpredictably upon space exhaustion. My reading is that the block-layer
thin provisioning code is still pretty fresh, and should only be used
with great care. The only way to halfway reliably use it appears to
change the configuration so space exhaustion blocks until admin
intervention (at least dm-thinp provides allows that).

Seems that should go in the OS-specific configuration part of the
docs, along with the advice I gave on the original thread re
configuring multipath no_path_retries.

There's some clear need to automate some more testing in this area so
that future behaviour changes don't surprise us.

We don't routinely test ENOSPC (or memory exhaustion, or crashes) in
PostgreSQL even on bog standard setups.

Like the performance farm discussion, this is something I'd like to
pick up at some point. I'm going to need to talk to the team I work
with regarding time/resources allocation, but I think it's important
that we make such testing more of a routine thing.

- Matthew Wilcox proposed (and posted a patch) that'd partially revert
behaviour to the pre v4.13 world, by *also* reporting errors to
"newer" file-descriptors if the error hasn't previously been
reported. That'd still not guarantee that the error is reported
(memory pressure could evict information without open fd), but in most
situations we'll again get the error in the checkpointer.

This seems largely be agreed upon. It's unclear whether it'll go into
the stable backports for still-maintained >= v4.13 kernels.

That seems very sensible. In our case we're very unlikely to have some
other unrelated process come in and fsync() our files for us.

I'd want to be sure the report didn't get eaten by sync() or syncfs() though.

- syncfs() will be fixed so it reports errors properly - that'll likely
require passing it an O_PATH filedescriptor to have space to store the
errseq_t value that allows discerning already reported and new errors.

No patch has appeared yet, but the behaviour seems largely agreed
upon.

Good, but as you noted, of limited use to us unless we want to force
users to manage space for temporary and unlogged relations completely
separately.

I wonder if we could convince the kernel to offer a file_sync_mode
xattr to control this? (Hint: I'm already running away in a mylar fire
suit).

- Make per-filesystem error counts available in a uniform (i.e. same for
every supporting fs) manner. Right now it's very hard to figure out
whether errors occurred. There seemed general agreement that exporting
knowledge about such errors is desirable. Quite possibly the syncfs()
fix above will provide the necessary infrastructure. It's unclear as
of yet how the value would be exposed. Per-fs /sys/ entries and an
ioctl on O_PATH fds have been mentioned.

These'd error counts would not vanish due to memory pressure, and they
can be checked even without knowing which files in a specific
filesystem have been touched (e.g. when just untar-ing something).

There seemed to be fairly widespread agreement that this'd be a good
idea. Much less clearer whether somebody would do the work.

- Provide config knobs that allow to define the FS error behaviour in a
consistent way across supported filesystems. XFS currently has various
knobs controlling what happens in case of metadata errors [1] (retry
forever, timeout, return up). It was proposed that this interface be
extended to also deal with data errors, and moved into generic support
code.

While the timeline is unclear, there seemed to be widespread support
for the idea. I believe Dave Chinner indicated that he at least has
plans to generalize the code.

That's great. It sounds like this has revitalised some interest in the
error reporting and might yield some more general cleanups :)

- Stop inodes with unreported errors from being evicted. This will
guarantee that a later fsync (without an open FD) will see the
error. The memory pressure concerns here are lower than with keeping
all the failed pages in memory, and it could be optimized further.

I read some tentative agreement behind this idea, but I think it's the
by far most controversial one.

The main issue there would seem to be cases of whole-FS failure like
the USB-key-yank example. You're going to have to be able to get rid
of them at some point.

- Clearly we need to treat fsync() EIO, ENOSPC errors as a PANIC and
retry recovery. While ENODEV (underlying device went away) will be
persistent, it probably makes sense to treat it the same or even just
give up and shut down. One question I see here is whether we just
want to continue crash-recovery cycles, or whether we want to limit
that.

Right now, we'll panic once, then panic again in redo if the error
persists and give up.

On some systems, and everywhere that Pg is directly user-managed with
pg_ctl, that'll leave Pg down until the operator intervenes. Some init
systems will restart the postmaster automatically. Some will give up
after a few tries. Some will back off retries over time. It depends on
the init system. I'm not sure that's a great outcome.

So rather than giving up if redo fails, we might want to offer a knob
to retry, possibly with pause/backoff. I'm sure people currently
expect PostgreSQL to try to stay up and recover, like it does after a
segfault or most other errors.

Personally I prefer to run Pg with restart_after_crash=off and let the
init system launch a new postmaster, but that's not an option unless
you have a sensible init.

- We need more aggressive error checking on close(), for ENOSPC and
EIO. In both cases afaics we'll have to trigger a crash recovery
cycle. It's entirely possible to end up in a loop on NFS etc, but I
don't think there's a way around that.

Robert, on IM, wondered whether there'd be a race between some backend
doing a close(), triggering a PANIC, and a checkpoint succeeding. I
don't *think* so, because the error will only happen if there's
outstanding dirty data, and the checkpoint would have flushed that out
if it belonged to the current checkpointing cycle.

Even if it's possible (which it sounds like it probably isn't), it
might also be one of those corner-cases-of-corner-cases where we just
shrug and worry about bigger fish.

- The outstanding fsync request queue isn't persisted properly [3]. This
means that even if the kernel behaved the way we'd expected, we'd not
fail a second checkpoint :(. It's possible that we don't need to deal
with this because we'll henceforth PANIC, but I'd argue we should fix
that regardless. Seems like a time-bomb otherwise (e.g. after moving
to DIO somebody might want to relax the PANIC...).

Huh! Good find. That definitely merits fixing.

- It might be a good idea to whitelist expected return codes for write()
and PANIC one ones that we did not expect. E.g. when hitting an EIO we
should probably PANIC, to get back to a known good state. Even though
it's likely that we'd again that error at fsync().

- Docs.

Yep. Especially OS-specific configuration for known dangerous setups
(lvm-thin, multipath), etc. I imagine we can distill a lot of it from
the discussion and simplify a bit.

I think we also need to audit a few codepaths. I'd be surprised if we
PANICed appropriately on all fsyncs(), particularly around the SLRUs.

We _definitely_ do not, see the patch I sent on the other thread.

Then there's the question of how we want to deal with kernels that
haven't been updated with the aforementioned changes. We could say that
we expect decent OS support and declare that we just can't handle this -
given that at least various linux versions, netbsd, openbsd, MacOS just
silently drop errors and we'd need different approaches for dealing with
that, that doesn't seem like an insane approach.

What we could do:

- forward file descriptors from backends to checkpointer (using
SCM_RIGHTS) when marking a segment dirty. That'd require some
optimizations (see [4]) to avoid doing so repeatedly. That'd
guarantee correct behaviour in all linux kernels >= 4.13 (possibly
backported by distributions?), and I think it'd also make it vastly
more likely that errors are reported in earlier kernels.

It'd be interesting to see if other platforms that support fd passing
will give us the desired behaviour too. But even if it only helps on
Linux, that's a huge majority of the PostgreSQL deployments these
days.

- Add a pre-checkpoint hook that checks for filesystem errors *after*
fsyncing all the files, but *before* logging the checkpoint completion
record. Operating systems, filesystems, etc. all log the error format
differently, but for larger installations it'd not be too hard to
write code that checks their specific configuration.

I looked into using trace event file descriptors for this, btw, but
we'd need CAP_SYS_ADMIN to create one that captured events for other
processes. Plus filtering the events to find only events for the files
/ file systems of interest would be far from trivial. And I don't know
what guarantees we have about when events are delivered.

I'd love to be able to use inotify for this, but again, that'd only be
a new-kernels thing since it'd need an inotify extension to report I/O
errors.

Presumably mostly this check would land up looking at dmesg.

I'm not convinced it'd get widely deployed and widely used, or that
it'd be used correctly when people tried to use it. Look at the
hideous mess that most backup/standby creation scripts,
archive_command scripts, etc are.

- Use direct IO. Due to architectural performance issues in PG and the
fact that it'd not be applicable for all installations I don't think
this is a reasonable fix for the issue presented here. Although it's
independently something we should work on. It might be worthwhile to
provide a configuration that allows to force DIO to be enabled for WAL
even if replication is turned on.

Seems like a long term goal, but you've noted elsewhere that doing it
well would be hard. I suspect we'd need writer threads, we'd need to
know more about the underlying FS/storage layout to make better
decisions about write parallelism, etc. We get away with a lot right
now by letting the kernel and buffered I/O sort that out.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#7Stephen Frost
sfrost@snowman.net
In reply to: Andres Freund (#5)
Re: Postgres, fsync, and OSs (specifically linux)

Greetings,

* Andres Freund (andres@anarazel.de) wrote:

On 2018-04-27 19:38:30 -0400, Bruce Momjian wrote:

On Fri, Apr 27, 2018 at 04:10:43PM -0700, Andres Freund wrote:

On 2018-04-27 19:04:47 -0400, Bruce Momjian wrote:

On Fri, Apr 27, 2018 at 03:28:42PM -0700, Andres Freund wrote:

- We need more aggressive error checking on close(), for ENOSPC and
EIO. In both cases afaics we'll have to trigger a crash recovery
cycle. It's entirely possible to end up in a loop on NFS etc, but I
don't think there's a way around that.

If the no-space or write failures are persistent, as you mentioned
above, what is the point of going into crash recovery --- why not just
shut down?

Well, I mentioned that as an alternative in my email. But for one we
don't really have cases where we do that right now, for another we can't
really differentiate between a transient and non-transient state. It's
entirely possible that the admin on the system that ran out of space
fixes things, clearing up the problem.

True, but if we get a no-space error, odds are it will not be fixed at
the time we are failing. Wouldn't the administrator check that the
server is still running after they free the space?

I'd assume it's pretty common that those are separate teams. Given that
we currently don't behave that way for other cases where we *already*
can enter crash-recovery loops I don't think we need to introduce that
here. It's far more common to enter this kind of problem with pg_xlog
filling up the ordinary way. And that can lead to such loops.

When we crash-restart, we also go through and clean things up some, no?
Seems like that gives us the potential to end up fixing things ourselves
and allowing the crash-restart to succeed.

Consider unlogged tables, temporary tables, on-disk sorts, etc. It's
entirely common for a bad query to run the system out of disk space (but
have a write of a regular table be what discovers the out-of-space
problem...) and if we crash-restart properly then we'd hopefully clean
things out, freeing up space, and allowing us to come back up.

Now, of course, ideally admins would set up temp tablespaces and
segregate WAL onto its own filesystem, etc, but...

Thanks!

Stephen

#8Stephen Frost
sfrost@snowman.net
In reply to: Craig Ringer (#6)
Re: Postgres, fsync, and OSs (specifically linux)

Greetings,

* Craig Ringer (craig@2ndquadrant.com) wrote:

On 28 April 2018 at 06:28, Andres Freund <andres@anarazel.de> wrote:

- Add a pre-checkpoint hook that checks for filesystem errors *after*
fsyncing all the files, but *before* logging the checkpoint completion
record. Operating systems, filesystems, etc. all log the error format
differently, but for larger installations it'd not be too hard to
write code that checks their specific configuration.

I looked into using trace event file descriptors for this, btw, but
we'd need CAP_SYS_ADMIN to create one that captured events for other
processes. Plus filtering the events to find only events for the files
/ file systems of interest would be far from trivial. And I don't know
what guarantees we have about when events are delivered.

I'd love to be able to use inotify for this, but again, that'd only be
a new-kernels thing since it'd need an inotify extension to report I/O
errors.

Presumably mostly this check would land up looking at dmesg.

I'm not convinced it'd get widely deployed and widely used, or that
it'd be used correctly when people tried to use it. Look at the
hideous mess that most backup/standby creation scripts,
archive_command scripts, etc are.

Agree with more-or-less everything you've said here, but a big +1 on
this. If we do end up going down this route we have *got* to provide
scripts which we know work and have been tested and are well maintained
on the popular OS's for the popular filesystems and make it clear that
we've tested those and not others. We definitely shouldn't put
something in our docs that is effectively an example of the interface
but not an actual command that anyone should be using.

Thanks!

Stephen

#9Simon Riggs
simon@2ndquadrant.com
In reply to: Andres Freund (#1)
Re: Postgres, fsync, and OSs (specifically linux)

On 27 April 2018 at 15:28, Andres Freund <andres@anarazel.de> wrote:

- Add a pre-checkpoint hook that checks for filesystem errors *after*
fsyncing all the files, but *before* logging the checkpoint completion
record. Operating systems, filesystems, etc. all log the error format
differently, but for larger installations it'd not be too hard to
write code that checks their specific configuration.

While I'm a bit concerned adding user-code before a checkpoint, if
we'd do it as a shell command it seems pretty reasonable. And useful
even without concern for the fsync issue itself. Checking for IO
errors could e.g. also include checking for read errors - it'd not be
unreasonable to not want to complete a checkpoint if there'd been any
media errors.

It seems clear that we need to evaluate our compatibility not just
with an OS, as we do now, but with an OS/filesystem.

Although people have suggested some approaches, I'm more interested in
discovering how we can be certain we got it right.

And the end result seems to be that PostgreSQL will be forced, in the
short term, to declare certain combinations of OS/filesystem
unsupported, with clear warning sent out to users.

Adding a pre-checkpoint hook encourages people to fix this themselves
without reporting issues, so I initially oppose this until we have a
clearer argument as to why we need it. The answer is not to make this
issue more obscure, but to make it more public.

- Use direct IO. Due to architectural performance issues in PG and the
fact that it'd not be applicable for all installations I don't think
this is a reasonable fix for the issue presented here. Although it's
independently something we should work on. It might be worthwhile to
provide a configuration that allows to force DIO to be enabled for WAL
even if replication is turned on.

"Use DirectIO" is roughly same suggestion as "don't trust Linux filesystems".

It would be a major admission of defeat for us to take that as our
main route to a solution.

The people I've spoken to so far have encouraged us to continue
working with the filesystem layer, offering encouragement of our
decision to use filesystems.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#10Michael Banck
michael.banck@credativ.de
In reply to: Stephen Frost (#8)
Re: Postgres, fsync, and OSs (specifically linux)

Hi,

On Sat, Apr 28, 2018 at 11:21:20AM -0400, Stephen Frost wrote:

* Craig Ringer (craig@2ndquadrant.com) wrote:

On 28 April 2018 at 06:28, Andres Freund <andres@anarazel.de> wrote:

- Add a pre-checkpoint hook that checks for filesystem errors *after*
fsyncing all the files, but *before* logging the checkpoint completion
record. Operating systems, filesystems, etc. all log the error format
differently, but for larger installations it'd not be too hard to
write code that checks their specific configuration.

I looked into using trace event file descriptors for this, btw, but
we'd need CAP_SYS_ADMIN to create one that captured events for other
processes. Plus filtering the events to find only events for the files
/ file systems of interest would be far from trivial. And I don't know
what guarantees we have about when events are delivered.

I'd love to be able to use inotify for this, but again, that'd only be
a new-kernels thing since it'd need an inotify extension to report I/O
errors.

Presumably mostly this check would land up looking at dmesg.

I'm not convinced it'd get widely deployed and widely used, or that
it'd be used correctly when people tried to use it. Look at the
hideous mess that most backup/standby creation scripts,
archive_command scripts, etc are.

Agree with more-or-less everything you've said here, but a big +1 on
this. If we do end up going down this route we have *got* to provide
scripts which we know work and have been tested and are well maintained
on the popular OS's for the popular filesystems and make it clear that
we've tested those and not others. We definitely shouldn't put
something in our docs that is effectively an example of the interface
but not an actual command that anyone should be using.

This dmesg-checking has been mentioned several times now, but IME
enterprise distributions (or server ops teams?) seem to tighten access
to dmesg and /var/log to non-root users, including postgres.

Well, or just vanilla Debian stable apparently:

postgres@fock:~$ dmesg
dmesg: read kernel buffer failed: Operation not permitted

Is it really a useful expectation that the postgres user will be able to
trawl system logs for I/O errors? Or are we expecting the sysadmins (in
case they are distinct from the DBAs) to setup sudo and/or relax
permissions for this everywhere? We should document this requirement
properly at least then.

The netlink thing from Google that Tet Ts'O mentioned would probably
work around that, but if that is opened up it would not be deployed
anytime soon either.

Michael

--
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax: +49 2166 9901-100
Email: michael.banck@credativ.de
credativ GmbH, HRB M�nchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 M�nchengladbach
Gesch�ftsf�hrung: Dr. Michael Meskes, J�rg Folz, Sascha Heuer

#11Andres Freund
andres@anarazel.de
In reply to: Stephen Frost (#7)
Re: Postgres, fsync, and OSs (specifically linux)

Hi,

On 2018-04-28 11:10:54 -0400, Stephen Frost wrote:

When we crash-restart, we also go through and clean things up some, no?
Seems like that gives us the potential to end up fixing things ourselves
and allowing the crash-restart to succeed.

Sure, there's the potential for that. But it's quite possible to be
missing a lot of free space over NFS (this really isn't much of an issue
for local FS, at least not on linux) in a workload with rapidly
expanding space usage. And even if you recover, you could just hit the
issue again shortly afterwards.

Greetings,

Andres Freund

#12Andres Freund
andres@anarazel.de
In reply to: Michael Banck (#10)
Re: Postgres, fsync, and OSs (specifically linux)

Hi,

On 2018-04-28 17:35:48 +0200, Michael Banck wrote:

This dmesg-checking has been mentioned several times now, but IME
enterprise distributions (or server ops teams?) seem to tighten access
to dmesg and /var/log to non-root users, including postgres.

Well, or just vanilla Debian stable apparently:

postgres@fock:~$ dmesg
dmesg: read kernel buffer failed: Operation not permitted

Is it really a useful expectation that the postgres user will be able to
trawl system logs for I/O errors? Or are we expecting the sysadmins (in
case they are distinct from the DBAs) to setup sudo and/or relax
permissions for this everywhere? We should document this requirement
properly at least then.

I'm not a huge fan of this approach, but yes, that'd be necessary. It's
not that problematic to have to change /dev/kmsg permissions imo. Adding
a read group / acl seems quite doable.

The netlink thing from Google that Tet Ts'O mentioned would probably
work around that, but if that is opened up it would not be deployed
anytime soon either.

Yea, that seems irrelevant for now.

Greetings,

Andres Freund

#13Andres Freund
andres@anarazel.de
In reply to: Simon Riggs (#9)
Re: Postgres, fsync, and OSs (specifically linux)

Hi,

On 2018-04-28 08:25:53 -0700, Simon Riggs wrote:

- Use direct IO. Due to architectural performance issues in PG and the
fact that it'd not be applicable for all installations I don't think
this is a reasonable fix for the issue presented here. Although it's
independently something we should work on. It might be worthwhile to
provide a configuration that allows to force DIO to be enabled for WAL
even if replication is turned on.

"Use DirectIO" is roughly same suggestion as "don't trust Linux filesystems".

I want to emphasize that this is NOT a linux only issue. It's a problem
across a number of operating systems, including linux.

It would be a major admission of defeat for us to take that as our
main route to a solution.

Well, I think we were wrong to not engineer towards DIO. There's just
too many issues with buffered IO to not have a supported path for
DIO. But given that it's unrealistic to do so without major work, and
wouldn't be applicable for all installations (shared_buffer size becomes
critical), I don't think it matters that much for the issue discussed
here.

The people I've spoken to so far have encouraged us to continue
working with the filesystem layer, offering encouragement of our
decision to use filesystems.

There's a lot of people disagreeing with it too.

Greetings,

Andres Freund

#14Andres Freund
andres@anarazel.de
In reply to: Craig Ringer (#6)
Re: Postgres, fsync, and OSs (specifically linux)

Hi,

On 2018-04-28 20:00:25 +0800, Craig Ringer wrote:

On 28 April 2018 at 06:28, Andres Freund <andres@anarazel.de> wrote:

The second major type of proposal was using direct-IO. That'd generally
be a desirable feature, but a) would require some significant changes to
postgres to be performant, b) isn't really applicable for the large
percentage of installations that aren't tuned reasonably well, because
at the moment the OS page cache functions as a memory-pressure aware
extension of postgres' page cache.

Yeah. I've avoided advocating for O_DIRECT because it's a big job
(understatement). We'd need to pay so much more attention to details
of storage layout if we couldn't rely as much on the kernel neatly
organising and queuing everything for us, too.

At the risk of displaying my relative ignorance of direct I/O: Does
O_DIRECT without O_SYNC even provide a strong guarantee that when you
close() the file, all I/O has reliably succeeded? It must've gone
through the FS layer, but don't FSes do various caching and
reorganisation too? Can the same issue arise in other ways unless we
also fsync() before close() or write O_SYNC?

No, not really. There's generally two categories of IO here: Metadata IO
and data IO. The filesystem's metadata IO a) has a lot more error
checking (including things like remount-ro, stalling the filesystem on
errors etc), b) isn't direct IO itself. For some filesystem metadata
operations you'll still need fsyncs, but the *data* is flushed if use
use DIO. I'd personally use O_DSYNC | O_DIRECT, and have the metadata
operations guaranteed by fsyncs. You'd need the current fsyncs for
renaming, and probably some fsyncs for file extensions. The latter to
make sure the filesystem has written the metadata change.

At one point I looked into using AIO instead. But last I looked it was
pretty spectacularly quirky when it comes to reliably flushing, and
outright broken on some versions. In any case, our multiprocessing
model would make tracking completions annoying, likely more so than
the sort of FD handoff games we've discussed.

AIO pretty much only works sensibly with DIO.

Another topic brought up in this thread was the handling of ENOSPC
errors that aren't triggered on a filesystem level, but rather are
triggered by thin provisioning. On linux that currently apprently lead
to page cache contents being lost (and errors "eaten") in a lot of
places, including just when doing a write().

... wow.

Is that with lvm-thin?

I think both dm and lvm (I typed llvm thrice) based thin
provisioning. The FS code basically didn't expect ENOSPC being returned
from storage, but suddenly the storage layer started returning it...

The thin provisioning I was mainly concerned with is SAN-based thin
provisioning, which looks like a normal iSCSI target or a normal LUN
on a HBA to Linux. Then it starts failing writes with a weird
potentially vendor-specific sense error if it runs out of backing
store. How that's handled likely depends on the specific error, the
driver, which FS you use, etc. In the case I saw, multipath+lvm+xfs,
it resulted in lost writes and fsync() errors being reported once, per
the start of the original thread.

I think the concerns are largely the same for that. You'll have to
configure the SAN to block in that case.

- Matthew Wilcox proposed (and posted a patch) that'd partially revert
behaviour to the pre v4.13 world, by *also* reporting errors to
"newer" file-descriptors if the error hasn't previously been
reported. That'd still not guarantee that the error is reported
(memory pressure could evict information without open fd), but in most
situations we'll again get the error in the checkpointer.

This seems largely be agreed upon. It's unclear whether it'll go into
the stable backports for still-maintained >= v4.13 kernels.

That seems very sensible. In our case we're very unlikely to have some
other unrelated process come in and fsync() our files for us.

I'd want to be sure the report didn't get eaten by sync() or syncfs() though.

It doesn't. Basically every fd has an errseq_t value copied into it at
open.

- syncfs() will be fixed so it reports errors properly - that'll likely
require passing it an O_PATH filedescriptor to have space to store the
errseq_t value that allows discerning already reported and new errors.

No patch has appeared yet, but the behaviour seems largely agreed
upon.

Good, but as you noted, of limited use to us unless we want to force
users to manage space for temporary and unlogged relations completely
separately.

Well, I think it'd still be ok as a backstop if it had decent error
semantics. We don't checkpoint that often, and doing the syncing via
syncfs() is considerably more efficient than individual fsync()s. But
given it's currently buggy that tradeoff is moot.

I wonder if we could convince the kernel to offer a file_sync_mode
xattr to control this? (Hint: I'm already running away in a mylar fire
suit).

Err. I am fairly sure you're not going to get anywhere with that. Given
we're concerned about existing kernels, I doubt it'd help us much anyway.

- Stop inodes with unreported errors from being evicted. This will
guarantee that a later fsync (without an open FD) will see the
error. The memory pressure concerns here are lower than with keeping
all the failed pages in memory, and it could be optimized further.

I read some tentative agreement behind this idea, but I think it's the
by far most controversial one.

The main issue there would seem to be cases of whole-FS failure like
the USB-key-yank example. You're going to have to be able to get rid
of them at some point.

It's not actually a real problem (despite initially being brought up a
number of times by kernel people). There's a separate error for that
(ENODEV), and filesystems already handle it differently. Once that's
returned, fsyncs() etc are just shortcut to ENODEV.

What we could do:

- forward file descriptors from backends to checkpointer (using
SCM_RIGHTS) when marking a segment dirty. That'd require some
optimizations (see [4]) to avoid doing so repeatedly. That'd
guarantee correct behaviour in all linux kernels >= 4.13 (possibly
backported by distributions?), and I think it'd also make it vastly
more likely that errors are reported in earlier kernels.

It'd be interesting to see if other platforms that support fd passing
will give us the desired behaviour too. But even if it only helps on
Linux, that's a huge majority of the PostgreSQL deployments these
days.

Afaict it'd not help all of them. It does provide guarantees against the
inode being evicted on pretty much all OSs, but not all of them have an
error counter there...

- Use direct IO. Due to architectural performance issues in PG and the
fact that it'd not be applicable for all installations I don't think
this is a reasonable fix for the issue presented here. Although it's
independently something we should work on. It might be worthwhile to
provide a configuration that allows to force DIO to be enabled for WAL
even if replication is turned on.

Seems like a long term goal, but you've noted elsewhere that doing it
well would be hard. I suspect we'd need writer threads, we'd need to
know more about the underlying FS/storage layout to make better
decisions about write parallelism, etc. We get away with a lot right
now by letting the kernel and buffered I/O sort that out.

We're a *lot* slower due to it.

Don't think you would need writer threads, "just" a bgwriter that
actually works and provides clean buffers unless the machine is
overloaded. I've posted a patch that adds that. On the write side you
then additionally need write combining (doing one writes for several
on-disk-consecutive buffers), which isn't trivial to add currently. The
bigger issue than writes is actually doing reads nicely. There's no
readahead anymore, and we'd not have the kernel backstopping our bad
caching decisions anymore.

Greetings,

Andres Freund

#15Craig Ringer
craig@2ndquadrant.com
In reply to: Andres Freund (#13)
Re: Postgres, fsync, and OSs (specifically linux)

On 29 April 2018 at 00:15, Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2018-04-28 08:25:53 -0700, Simon Riggs wrote:

- Use direct IO. Due to architectural performance issues in PG and the
fact that it'd not be applicable for all installations I don't think
this is a reasonable fix for the issue presented here. Although it's
independently something we should work on. It might be worthwhile to
provide a configuration that allows to force DIO to be enabled for WAL
even if replication is turned on.

"Use DirectIO" is roughly same suggestion as "don't trust Linux filesystems".

I want to emphasize that this is NOT a linux only issue. It's a problem
across a number of operating systems, including linux.

It would be a major admission of defeat for us to take that as our
main route to a solution.

Well, I think we were wrong to not engineer towards DIO. There's just
too many issues with buffered IO to not have a supported path for
DIO. But given that it's unrealistic to do so without major work, and
wouldn't be applicable for all installations (shared_buffer size becomes
critical), I don't think it matters that much for the issue discussed
here.

20/20 hindsight, really. Not much to be done now.

Even with the work you and others have done on shared_buffers
scalability, there's likely still improvement needed there if it
becomes more important to evict buffers into per-device queues, etc,
too.

Personally I'd rather not have to write half the kernel's job because
the kernel doesn't feel like doing it :( . I'd kind of hoped to go in
the other direction if anything, with some kind of pseudo-write op
that let us swap a dirty shared_buffers entry from our shared_buffers
into the OS dirty buffer cache (on Linux at least) and let it handle
writeback, so we reduce double-buffering. Ha! So much for that!

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#16Craig Ringer
craig@2ndquadrant.com
In reply to: Simon Riggs (#9)
Re: Postgres, fsync, and OSs (specifically linux)

On 28 April 2018 at 23:25, Simon Riggs <simon@2ndquadrant.com> wrote:

On 27 April 2018 at 15:28, Andres Freund <andres@anarazel.de> wrote:

- Add a pre-checkpoint hook that checks for filesystem errors *after*
fsyncing all the files, but *before* logging the checkpoint completion
record. Operating systems, filesystems, etc. all log the error format
differently, but for larger installations it'd not be too hard to
write code that checks their specific configuration.

While I'm a bit concerned adding user-code before a checkpoint, if
we'd do it as a shell command it seems pretty reasonable. And useful
even without concern for the fsync issue itself. Checking for IO
errors could e.g. also include checking for read errors - it'd not be
unreasonable to not want to complete a checkpoint if there'd been any
media errors.

It seems clear that we need to evaluate our compatibility not just
with an OS, as we do now, but with an OS/filesystem.

Although people have suggested some approaches, I'm more interested in
discovering how we can be certain we got it right.

TBH, we can't be certain, because there are too many failure modes,
some of which we can't really simulate in practical ways, or automated
ways.

But there are definitely steps we can take:

- Test the stack of FS, LVM (if any) etc with the dmsetup 'flakey'
target and a variety of workloads designed to hit errors at various
points. Some form of torture test.

- Almost up the device and see what happens if we write() then fsync()
enough to fill it.

- Plug-pull storage and see what happens, especially for multipath/iSCSI/SAN.

Experience with pg_test_fsync shows that it can also be hard to
reliably interpret the results of tests.

Again I'd like to emphasise that this is really only a significant
risk for a few configurations. Yes, it could result in Pg not failing
a checkpoint when it should if, say, your disk has a bad block it
can't repair and remap. But as Andres has pointed out in the past,
those sorts local storage failure cases tend toward "you're kind of
screwed anyway". It's only a serious concern when I/O errors are part
of the storage's accepted operation, as in multipath with default
settings.

We _definitely_ need to warn multipath users that the defaults are insane.

- Use direct IO. Due to architectural performance issues in PG and the
fact that it'd not be applicable for all installations I don't think
this is a reasonable fix for the issue presented here. Although it's
independently something we should work on. It might be worthwhile to
provide a configuration that allows to force DIO to be enabled for WAL
even if replication is turned on.

"Use DirectIO" is roughly same suggestion as "don't trust Linux filesystems".

Surprisingly, that seems to be a lot of what's coming out of Linux
developers. Reliable buffered I/O? Why would you try to do that?

I know that's far from a universal position, though, and it sounds
like things were more productive in Andres's discussions at the meet.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#17Simon Riggs
simon@2ndquadrant.com
In reply to: Simon Riggs (#9)
Re: Postgres, fsync, and OSs (specifically linux)

On 28 April 2018 at 08:25, Simon Riggs <simon@2ndquadrant.com> wrote:

On 27 April 2018 at 15:28, Andres Freund <andres@anarazel.de> wrote:

- Add a pre-checkpoint hook that checks for filesystem errors *after*
fsyncing all the files, but *before* logging the checkpoint completion
record. Operating systems, filesystems, etc. all log the error format
differently, but for larger installations it'd not be too hard to
write code that checks their specific configuration.

While I'm a bit concerned adding user-code before a checkpoint, if
we'd do it as a shell command it seems pretty reasonable. And useful
even without concern for the fsync issue itself. Checking for IO
errors could e.g. also include checking for read errors - it'd not be
unreasonable to not want to complete a checkpoint if there'd been any
media errors.

It seems clear that we need to evaluate our compatibility not just
with an OS, as we do now, but with an OS/filesystem.

Although people have suggested some approaches, I'm more interested in
discovering how we can be certain we got it right.

And the end result seems to be that PostgreSQL will be forced, in the
short term, to declare certain combinations of OS/filesystem
unsupported, with clear warning sent out to users.

Adding a pre-checkpoint hook encourages people to fix this themselves
without reporting issues, so I initially oppose this until we have a
clearer argument as to why we need it. The answer is not to make this
issue more obscure, but to make it more public.

Thinking some more, I think I understand, but please explain if not.

We need behavior that varies according to OS and filesystem, which
varies per tablespace.

We could have that variable behavior using

* a hook

* a set of GUC parameters that can be set at tablespace level

* a separate config file for each tablespace

My preference would be to avoid a hook.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#18Simon Riggs
simon@2ndquadrant.com
In reply to: Andres Freund (#13)
Re: Postgres, fsync, and OSs (specifically linux)

On 28 April 2018 at 09:15, Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2018-04-28 08:25:53 -0700, Simon Riggs wrote:

- Use direct IO. Due to architectural performance issues in PG and the
fact that it'd not be applicable for all installations I don't think
this is a reasonable fix for the issue presented here. Although it's
independently something we should work on. It might be worthwhile to
provide a configuration that allows to force DIO to be enabled for WAL
even if replication is turned on.

"Use DirectIO" is roughly same suggestion as "don't trust Linux filesystems".

I want to emphasize that this is NOT a linux only issue. It's a problem
across a number of operating systems, including linux.

Yes, of course.

It would be a major admission of defeat for us to take that as our
main route to a solution.

Well, I think we were wrong to not engineer towards DIO. There's just
too many issues with buffered IO to not have a supported path for
DIO. But given that it's unrealistic to do so without major work, and
wouldn't be applicable for all installations (shared_buffer size becomes
critical), I don't think it matters that much for the issue discussed
here.

The people I've spoken to so far have encouraged us to continue
working with the filesystem layer, offering encouragement of our
decision to use filesystems.

There's a lot of people disagreeing with it too.

Specific recent verbal feedback from OpenLDAP was that the project
adopted DIO and found no benefit in doing so, with regret the other
way from having tried.

The care we need to use for any technique is the same.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#19Thomas Munro
thomas.munro@enterprisedb.com
In reply to: Simon Riggs (#18)
Re: Postgres, fsync, and OSs (specifically linux)

On Sun, Apr 29, 2018 at 10:42 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

On 28 April 2018 at 09:15, Andres Freund <andres@anarazel.de> wrote:

On 2018-04-28 08:25:53 -0700, Simon Riggs wrote:

The people I've spoken to so far have encouraged us to continue
working with the filesystem layer, offering encouragement of our
decision to use filesystems.

There's a lot of people disagreeing with it too.

Specific recent verbal feedback from OpenLDAP was that the project
adopted DIO and found no benefit in doing so, with regret the other
way from having tried.

I'm not sure if OpenLDAP is really comparable. The big three RDBMSs +
MySQL started like us and eventually switched to direct IO, I guess at
a time when direct IO support matured in OSs and their own IO
scheduling was thought to be superior. I'm pretty sure they did that
because they didn't like wasting RAM on double buffering and had
better ideas about IO scheduling. From some googling this morning:

DB2: The Linux/Unix/Windows edition changed its default to DIO ("NO
FILESYSTEM CACHING") in release 9.5 in 2007[1]https://www.ibm.com/support/knowledgecenter/en/SSEPGG_9.5.0/com.ibm.db2.luw.admin.dbobj.doc/doc/c0051304.html, but it can still do
buffered IO if you ask for it.

Oracle: Around the same time or earlier, in the Linux 2.4 era, Oracle
apparently supported direct IO ("FILESYSTEMIO_OPTIONS = DIRECTIO" (or
SETALL for DIRECTIO + ASYNCH)) on big iron Unix but didn't yet use it
on Linux[2]http://www.ixora.com.au/notes/direct_io.htm. There were some amusing emails from Linus Torvalds on
this topic[3]https://lkml.org/lkml/2002/5/11/58. I'm not sure what FILESYSTEMIO_OPTIONS's default value
is on each operating system today or when it changed, it's probably
SETALL everywhere by now? I wonder if they stuck with buffered IO for
a time on Linux despite the availability of direct IO because they
thought it was more reliable or more performant.

SQL Server: I couldn't find any evidence that they've even kept the
option to use buffered IO (which must have existed in the ancestral
code base). Can it? It's a different situation though, targeting a
reduced set of platforms.

MySQL: The default is still buffered ("innodb_flush_method = fsync" as
opposed to "O_DIRECT") but O_DIRECT is supported and widely
recommended, so it sounds like it's usually a win. Maybe not on
smaller systems though?

On MySQL, there are anecdotal reports of performance suffering on some
systems when you turn on O_DIRECT however. If that's true, it's
interesting to speculate about why that might be as it would probably
apply also to us in early versions (optimistic explanation: the
kernel's stretchy page cache allows people to get away with poorly
tuned buffer pool size? pessimistic explanation: the page reclamation
or IO scheduling (asynchronous write-back, write clustering,
read-ahead etc) is not as good as the OS's, but that effect is hidden
by suitably powerful disk subsystem with its own magic caching?) Note
that its O_DIRECT setting *also* calls fsync() to flush filesystem
meta-data (necessary if the file was extended); I wonder if that is
exposed to write-back error loss.

[1]: https://www.ibm.com/support/knowledgecenter/en/SSEPGG_9.5.0/com.ibm.db2.luw.admin.dbobj.doc/doc/c0051304.html
[2]: http://www.ixora.com.au/notes/direct_io.htm
[3]: https://lkml.org/lkml/2002/5/11/58

--
Thomas Munro
http://www.enterprisedb.com

#20Thomas Munro
thomas.munro@enterprisedb.com
In reply to: Thomas Munro (#19)
Re: Postgres, fsync, and OSs (specifically linux)

On Mon, Apr 30, 2018 at 11:02 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

MySQL: The default is still buffered

Someone pulled me up on this off-list: the default is buffered (fsync)
on Unix, but it's unbuffered on Windows. That's quite interesting.

https://dev.mysql.com/doc/refman/8.0/en/innodb-parameters.html#sysvar_innodb_flush_method
https://mariadb.com/kb/en/library/xtradbinnodb-server-system-variables/#innodb_flush_method

--
Thomas Munro
http://www.enterprisedb.com

#21Thomas Munro
thomas.munro@enterprisedb.com
In reply to: Craig Ringer (#16)
Re: Postgres, fsync, and OSs (specifically linux)

On Sun, Apr 29, 2018 at 1:58 PM, Craig Ringer <craig@2ndquadrant.com> wrote:

On 28 April 2018 at 23:25, Simon Riggs <simon@2ndquadrant.com> wrote:

On 27 April 2018 at 15:28, Andres Freund <andres@anarazel.de> wrote:

While I'm a bit concerned adding user-code before a checkpoint, if
we'd do it as a shell command it seems pretty reasonable. And useful
even without concern for the fsync issue itself. Checking for IO
errors could e.g. also include checking for read errors - it'd not be
unreasonable to not want to complete a checkpoint if there'd been any
media errors.

It seems clear that we need to evaluate our compatibility not just
with an OS, as we do now, but with an OS/filesystem.

Although people have suggested some approaches, I'm more interested in
discovering how we can be certain we got it right.

TBH, we can't be certain, because there are too many failure modes,
some of which we can't really simulate in practical ways, or automated
ways.

+1

Testing is good, but unless you have a categorical statement from the
relevant documentation or kernel team or you have the source code, I'm
not sure how you can ever really be sure about this. I think we have
a fair idea now what several open kernels do, but we still haven't got
a clue about Windows, AIX, HPUX and Solaris and we only have half the
answer for Illumos, and no "negative" test result can prove that they
can't throw away write-back errors or data.

Considering the variety in interpretation and liberties taken, I
wonder if fsync() is underspecified and someone should file an issue
over at http://www.opengroup.org/austin/ about that.

--
Thomas Munro
http://www.enterprisedb.com

#22Craig Ringer
craig@2ndquadrant.com
In reply to: Thomas Munro (#21)
Re: Postgres, fsync, and OSs (specifically linux)

On 30 April 2018 at 09:09, Thomas Munro <thomas.munro@enterprisedb.com> wrote:

Considering the variety in interpretation and liberties taken, I
wonder if fsync() is underspecified and someone should file an issue
over at http://www.opengroup.org/austin/ about that.

All it's going to achieve is adding an "is implementation-defined"
caveat, but that's at least a bit of a heads-up.

I filed patches for Linux man-pages ages ago. I'll update them and
post to LKML; apparently bugzilla has a lot of spam and many people
ignore notifications, so they might just bitrot forever otherwise.

Meanwhile, do we know if, on Linux 4.13+, if we get a buffered write
error due to dirty writeback before we close() a file we don't
fsync(), we'll get the error on close()?

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#23Andres Freund
andres@anarazel.de
In reply to: Craig Ringer (#22)
Re: Postgres, fsync, and OSs (specifically linux)

On 2018-04-30 10:14:23 +0800, Craig Ringer wrote:

Meanwhile, do we know if, on Linux 4.13+, if we get a buffered write
error due to dirty writeback before we close() a file we don't
fsync(), we'll get the error on close()?

Not quite sure what you're getting at with "a file we don't fsync" - if
we don't, we don't care about durability anyway, no? Or do you mean
where we fsync in a different process?

Either way, the answer is mostly no: On NFS et al where close() implies
an fsync you'll get the error at that time, otherwise you'll get it at
the next fsync().

Greetings,

Andres Freund

#24Craig Ringer
craig@2ndquadrant.com
In reply to: Andres Freund (#23)
Re: Postgres, fsync, and OSs (specifically linux)

Not quite sure what you're getting at with "a file we don't fsync" - if
we don't, we don't care about durability anyway, no? Or do you mean
where we fsync in a different process?

Right.

Either way, the answer is mostly no: On NFS et al where close() implies
an fsync you'll get the error at that time, otherwise you'll get it at
the next fsync().

Thanks.

The reason I ask is that if we got notified of already-detected
writeback errors (on 4.13+) on close() too, it'd narrow the window a
little for problems, since normal backends could PANIC if close() of a
persistent file raised EIO. Otherwise we're less likely to see the
error, since the checkpointer won't see it - it happened before the
checkpointer open()ed the file. It'd still be no help for dirty
writeback that happens after we close() in a user backend / the
bgwriter and before we re-open(), but it'd be nice if the kernel would
tell us on close() if it knows of a writeback error.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#25Craig Ringer
craig@2ndquadrant.com
In reply to: Craig Ringer (#24)
Re: Postgres, fsync, and OSs (specifically linux)

Hrm, something else that just came up. On 9.6+ we use sync_file_range.
It's surely going to eat errors:

rc = sync_file_range(fd, offset, nbytes,
SYNC_FILE_RANGE_WRITE);

/* don't error out, this is just a performance optimization */
if (rc != 0)
{
ereport(WARNING,
(errcode_for_file_access(),
errmsg("could not flush dirty data: %m")));
}

so that has to panic too.

I'm very suspicious about the safety of the msync() path too.

I'll post an update to my PANIC-everywhere patch that add these cases.

#26Andres Freund
andres@anarazel.de
In reply to: Craig Ringer (#25)
Re: Postgres, fsync, and OSs (specifically linux)

On 2018-04-30 13:03:24 +0800, Craig Ringer wrote:

Hrm, something else that just came up. On 9.6+ we use sync_file_range.
It's surely going to eat errors:

rc = sync_file_range(fd, offset, nbytes,
SYNC_FILE_RANGE_WRITE);

/* don't error out, this is just a performance optimization */
if (rc != 0)
{
ereport(WARNING,
(errcode_for_file_access(),
errmsg("could not flush dirty data: %m")));
}

It's not. Only SYNC_FILE_RANGE_WAIT_{BEFORE,AFTER} eat errors. Which
seems sensible, because they could be considered data integrity
operations.

fs/sync.c:
SYSCALL_DEFINE4(sync_file_range, int, fd, loff_t, offset, loff_t, nbytes,
unsigned int, flags)
{
...

if (flags & SYNC_FILE_RANGE_WAIT_BEFORE) {
ret = file_fdatawait_range(f.file, offset, endbyte);
if (ret < 0)
goto out_put;
}

if (flags & SYNC_FILE_RANGE_WRITE) {
ret = __filemap_fdatawrite_range(mapping, offset, endbyte,
WB_SYNC_NONE);
if (ret < 0)
goto out_put;
}

if (flags & SYNC_FILE_RANGE_WAIT_AFTER)
ret = file_fdatawait_range(f.file, offset, endbyte);

where
int file_fdatawait_range(struct file *file, loff_t start_byte, loff_t end_byte)
{
struct address_space *mapping = file->f_mapping;

__filemap_fdatawait_range(mapping, start_byte, end_byte);
return file_check_and_advance_wb_err(file);
}
EXPORT_SYMBOL(file_fdatawait_range);

the critical call is file_check_and_advance_wb_err(). That's the call
that checks and clears errors.

Would be good to add a kernel xfstest (gets used on all relevant FS
these days), to make sure that doesn't change.

I'm very suspicious about the safety of the msync() path too.

That seems justified however:

SYSCALL_DEFINE3(msync, unsigned long, start, size_t, len, int, flags)
{
...
error = vfs_fsync_range(file, fstart, fend, 1);
..

*/
int vfs_fsync_range(struct file *file, loff_t start, loff_t end, int datasync)
{
...
return file->f_op->fsync(file, start, end, datasync);
}
EXPORT_SYMBOL(vfs_fsync_range);

int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
{
...
ret = file_write_and_wait_range(file, start, end);
if (ret)
return ret;
...

STATIC int
xfs_file_fsync(
struct file *file,
loff_t start,
loff_t end,
int datasync)
{
...
error = file_write_and_wait_range(file, start, end);
if (error)
return error;

int file_write_and_wait_range(struct file *file, loff_t lstart, loff_t lend)
{
...
err2 = file_check_and_advance_wb_err(file);
if (!err)
err = err2;
return err;
}

Greetings,

Andres Freund

#27Craig Ringer
craig@2ndquadrant.com
In reply to: Andres Freund (#26)
Re: Postgres, fsync, and OSs (specifically linux)

On 1 May 2018 at 00:09, Andres Freund <andres@anarazel.de> wrote:

It's not. Only SYNC_FILE_RANGE_WAIT_{BEFORE,AFTER} eat errors. Which
seems sensible, because they could be considered data integrity
operations.

Ah, I misread that. Thankyou.

I'm very suspicious about the safety of the msync() path too.

That seems justified however:

I'll add EIO tests there.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#28Catalin Iacob
iacobcatalin@gmail.com
In reply to: Andres Freund (#1)
Re: Postgres, fsync, and OSs (specifically linux)

On Sat, Apr 28, 2018 at 12:28 AM, Andres Freund <andres@anarazel.de> wrote:

Before linux v4.13 errors in kernel writeback would be reported at most
once, without a guarantee that that'd happen (IIUC memory pressure could
lead to the relevant information being evicted) - but it was pretty
likely. After v4.13 (see https://lwn.net/Articles/724307/) errors are
reported exactly once to all open file descriptors for a file with an
error - but never for files that have been opened after the error
occurred.

snip

== Proposed Linux Changes ==

- Matthew Wilcox proposed (and posted a patch) that'd partially revert
behaviour to the pre v4.13 world, by *also* reporting errors to
"newer" file-descriptors if the error hasn't previously been
reported. That'd still not guarantee that the error is reported
(memory pressure could evict information without open fd), but in most
situations we'll again get the error in the checkpointer.

This seems largely be agreed upon. It's unclear whether it'll go into
the stable backports for still-maintained >= v4.13 kernels.

This is now merged, if it's not reverted it will appear in v4.17.
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=fff75eb2a08c2ac96404a2d79685668f3cf5a7a3

The commit is cc-ed to stable so it should get picked up in the near future.
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=b4678df184b314a2bd47d2329feca2c2534aa12b

#29Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#1)
Re: Postgres, fsync, and OSs (specifically linux)

Hi,

On 2018-04-27 15:28:42 -0700, Andres Freund wrote:

I went to LSF/MM 2018 to discuss [0] and related issues. Overall I'd say
it was a very productive discussion. I'll first try to recap the
current situation, updated with knowledge I gained. Secondly I'll try to
discuss the kernel changes that seem to have been agreed upon. Thirdly
I'll try to sum up what postgres needs to change.

LWN summarized the discussion as well:

https://lwn.net/SubscriberLink/752952/6825e6a1ddcfb1f3/

- Andres

#30Andres Freund
andres@anarazel.de
In reply to: Craig Ringer (#27)
Re: Postgres, fsync, and OSs (specifically linux)

On 2018-05-01 09:38:03 +0800, Craig Ringer wrote:

On 1 May 2018 at 00:09, Andres Freund <andres@anarazel.de> wrote:

It's not. Only SYNC_FILE_RANGE_WAIT_{BEFORE,AFTER} eat errors. Which
seems sensible, because they could be considered data integrity
operations.

Ah, I misread that. Thankyou.

I'm very suspicious about the safety of the msync() path too.

That seems justified however:

I'll add EIO tests there.

Do you have a patchset including that? I didn't find anything after a
quick search...

Greetings,

Andres Freund

#31Craig Ringer
craig@2ndquadrant.com
In reply to: Andres Freund (#30)
1 attachment(s)
Re: Postgres, fsync, and OSs (specifically linux)

On 10 May 2018 at 06:55, Andres Freund <andres@anarazel.de> wrote:

Do you have a patchset including that? I didn't find anything after a
quick search...

There was an earlier rev on the other thread but without msync checks.

I've added panic for msync in the attached, and tidied the comments a bit.

I didn't add comments on why we panic to each individual pg_fsync or
FileSync caller that panics; instead pg_fsync points to
pg_fsync_no_writethrough which explains it briefly and has links.

I looked at callers of pg_fsync, pg_fsync_no_writethrough,
pg_fsync_writethrough, mdsync, and FileSync when writing this.

WAL writing already PANIC'd on fsync failure, which helps, though we
now know that's not sufficient for complete safety.

Patch on top of v11 HEAD @ ddc1f32ee507

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

v2-0001-PANIC-when-we-detect-a-possible-fsync-I-O-error-i.patchtext/x-patch; charset=US-ASCII; name=v2-0001-PANIC-when-we-detect-a-possible-fsync-I-O-error-i.patchDownload
From a6cade9e1de68962d95374127841b0af8eb4cfe0 Mon Sep 17 00:00:00 2001
From: Craig Ringer <craig@2ndquadrant.com>
Date: Tue, 10 Apr 2018 14:08:32 +0800
Subject: [PATCH v2] PANIC when we detect a possible fsync I/O error instead of
 retrying fsync

Panic the server on fsync failure in places where we can't simply repeat the
whole operation on retry. Most imporantly, panic when fsync fails during a
checkpoint.

This will result in log messages like:

    PANIC:  58030: could not fsync file "base/12367/16386": Input/output error
    LOG:  00000: checkpointer process (PID 10799) was terminated by signal 6: Aborted

and, if the condition persists during redo:

    LOG:  00000: checkpoint starting: end-of-recovery immediate
    PANIC:  58030: could not fsync file "base/12367/16386": Input/output error
    LOG:  00000: startup process (PID 10808) was terminated by signal 6: Aborted

Why?

In a number of places PostgreSQL we responded to fsync() errors by retrying the
fsync(), expecting that this would force the operating system to repeat any
write attempts. The code assumed that fsync() would return an error on all
subsequent calls until any I/O error was resolved.

This is not what actually happens on some platforms, including Linux. The
operating system may give up and drop dirty buffers for async writes on the
floor and mark the page mapping as bad. The first fsync() clears any error flag
from the page entry and/or our file descriptor. So a subsequent fsync() returns
success, even though the data PostgreSQL wrote was really discarded.

We have no way to find out which writes failed, and no way to ask the kernel to
retry indefinitely, so all we can do is PANIC. Redo will attempt the write
again, and if it fails again, it will also PANIC.

This doesn't completely prevent fsync reliability issues, because it only
handles cases where the kernel actually reports the error to us. It's entirely
possible for a buffered write to be lost without causing fsync to report an
error at all (see discussion below). Work on addressing those issues and
documenting them is ongoing and will be committed separately.

See:

* https://www.postgresql.org/message-id/CAMsr%2BYHh%2B5Oq4xziwwoEfhoTZgr07vdGG%2Bhu%3D1adXx59aTeaoQ%40mail.gmail.com
* https://www.postgresql.org/message-id/20180427222842.in2e4mibx45zdth5@alap3.anarazel.de
* https://lwn.net/Articles/752063/
* https://lwn.net/Articles/753650/
* https://lwn.net/Articles/752952/
* https://lwn.net/Articles/752613/
---
 src/backend/access/heap/rewriteheap.c       |  6 +++---
 src/backend/access/transam/timeline.c       |  4 ++--
 src/backend/access/transam/twophase.c       |  2 +-
 src/backend/access/transam/xlog.c           |  4 ++--
 src/backend/replication/logical/snapbuild.c |  3 +++
 src/backend/storage/file/fd.c               | 29 +++++++++++++++++++++++++++--
 src/backend/storage/smgr/md.c               | 22 ++++++++++++++++------
 src/backend/utils/cache/relmapper.c         |  2 +-
 8 files changed, 55 insertions(+), 17 deletions(-)

diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index 8d3c861a33..0320baffec 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -965,7 +965,7 @@ logical_end_heap_rewrite(RewriteState state)
 	while ((src = (RewriteMappingFile *) hash_seq_search(&seq_status)) != NULL)
 	{
 		if (FileSync(src->vfd, WAIT_EVENT_LOGICAL_REWRITE_SYNC) != 0)
-			ereport(ERROR,
+			ereport(PANIC,
 					(errcode_for_file_access(),
 					 errmsg("could not fsync file \"%s\": %m", src->path)));
 		FileClose(src->vfd);
@@ -1180,7 +1180,7 @@ heap_xlog_logical_rewrite(XLogReaderState *r)
 	 */
 	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_REWRITE_MAPPING_SYNC);
 	if (pg_fsync(fd) != 0)
-		ereport(ERROR,
+		ereport(PANIC,
 				(errcode_for_file_access(),
 				 errmsg("could not fsync file \"%s\": %m", path)));
 	pgstat_report_wait_end();
@@ -1279,7 +1279,7 @@ CheckPointLogicalRewriteHeap(void)
 			 */
 			pgstat_report_wait_start(WAIT_EVENT_LOGICAL_REWRITE_CHECKPOINT_SYNC);
 			if (pg_fsync(fd) != 0)
-				ereport(ERROR,
+				ereport(PANIC,
 						(errcode_for_file_access(),
 						 errmsg("could not fsync file \"%s\": %m", path)));
 			pgstat_report_wait_end();
diff --git a/src/backend/access/transam/timeline.c b/src/backend/access/transam/timeline.c
index 61d36050c3..f4b8410333 100644
--- a/src/backend/access/transam/timeline.c
+++ b/src/backend/access/transam/timeline.c
@@ -406,7 +406,7 @@ writeTimeLineHistory(TimeLineID newTLI, TimeLineID parentTLI,
 
 	pgstat_report_wait_start(WAIT_EVENT_TIMELINE_HISTORY_SYNC);
 	if (pg_fsync(fd) != 0)
-		ereport(ERROR,
+		ereport(PANIC,
 				(errcode_for_file_access(),
 				 errmsg("could not fsync file \"%s\": %m", tmppath)));
 	pgstat_report_wait_end();
@@ -485,7 +485,7 @@ writeTimeLineHistoryFile(TimeLineID tli, char *content, int size)
 
 	pgstat_report_wait_start(WAIT_EVENT_TIMELINE_HISTORY_FILE_SYNC);
 	if (pg_fsync(fd) != 0)
-		ereport(ERROR,
+		ereport(PANIC,
 				(errcode_for_file_access(),
 				 errmsg("could not fsync file \"%s\": %m", tmppath)));
 	pgstat_report_wait_end();
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 65194db70e..962412c0f4 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -1687,7 +1687,7 @@ RecreateTwoPhaseFile(TransactionId xid, void *content, int len)
 	if (pg_fsync(fd) != 0)
 	{
 		CloseTransientFile(fd);
-		ereport(ERROR,
+		ereport(PANIC,
 				(errcode_for_file_access(),
 				 errmsg("could not fsync two-phase state file: %m")));
 	}
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index c633e11128..f3cf4c9a65 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -3269,7 +3269,7 @@ XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock)
 	if (pg_fsync(fd) != 0)
 	{
 		close(fd);
-		ereport(ERROR,
+		ereport(PANIC,
 				(errcode_for_file_access(),
 				 errmsg("could not fsync file \"%s\": %m", tmppath)));
 	}
@@ -3435,7 +3435,7 @@ XLogFileCopy(XLogSegNo destsegno, TimeLineID srcTLI, XLogSegNo srcsegno,
 
 	pgstat_report_wait_start(WAIT_EVENT_WAL_COPY_SYNC);
 	if (pg_fsync(fd) != 0)
-		ereport(ERROR,
+		ereport(PANIC,
 				(errcode_for_file_access(),
 				 errmsg("could not fsync file \"%s\": %m", tmppath)));
 	pgstat_report_wait_end();
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 4123cdebcf..31ab7c1de9 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -1616,6 +1616,9 @@ SnapBuildSerialize(SnapBuild *builder, XLogRecPtr lsn)
 	 * fsync the file before renaming so that even if we crash after this we
 	 * have either a fully valid file or nothing.
 	 *
+	 * It's safe to just ERROR on fsync() here because we'll retry the whole
+	 * operation including the writes.
+	 *
 	 * TODO: Do the fsync() via checkpoints/restartpoints, doing it here has
 	 * some noticeable overhead since it's performed synchronously during
 	 * decoding?
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 441f18dcf5..95f32484d2 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -353,6 +353,17 @@ pg_fsync(int fd)
 /*
  * pg_fsync_no_writethrough --- same as fsync except does nothing if
  *	enableFsync is off
+ *
+ * WARNING: It is unsafe to retry fsync() calls without repeating the preceding
+ * writes.  fsync() clears the error flag on some platforms (including Linux,
+ * true up to at least 4.14) when it reports the error to the caller. A second
+ * call may return success even though writes are lost. Many callers test the
+ * return value and PANIC on failure so that redo repeats the writes. It is
+ * safe to ERROR instead if the whole operation can be retried without needing
+ * WAL redo.
+ *
+ * See https://lwn.net/Articles/752063/
+ * and https://www.postgresql.org/message-id/CAMsr%2BYHh%2B5Oq4xziwwoEfhoTZgr07vdGG%2Bhu%3D1adXx59aTeaoQ%40mail.gmail.com
  */
 int
 pg_fsync_no_writethrough(int fd)
@@ -443,7 +454,12 @@ pg_flush_data(int fd, off_t offset, off_t nbytes)
 		rc = sync_file_range(fd, offset, nbytes,
 							 SYNC_FILE_RANGE_WRITE);
 
-		/* don't error out, this is just a performance optimization */
+		/*
+		 * Don't error out, this is just a performance optimization.
+		 *
+		 * sync_file_range(SYNC_FILE_RANGE_WRITE) won't clear any error flags,
+		 * so we don't have to worry about this impacting fsync reliability.
+		 */
 		if (rc != 0)
 		{
 			ereport(WARNING,
@@ -518,7 +534,12 @@ pg_flush_data(int fd, off_t offset, off_t nbytes)
 			rc = msync(p, (size_t) nbytes, MS_ASYNC);
 			if (rc != 0)
 			{
-				ereport(WARNING,
+				/*
+				 * We must panic here to preserve fsync reliability,
+				 * as msync may clear the fsync error state on some
+				 * OSes. See pg_fsync_no_writethrough().
+				 */
+				ereport(PANIC,
 						(errcode_for_file_access(),
 						 errmsg("could not flush dirty data: %m")));
 				/* NB: need to fall through to munmap()! */
@@ -3250,6 +3271,10 @@ looks_like_temp_rel_name(const char *name)
  * harmless cases such as read-only files in the data directory, and that's
  * not good either.
  *
+ * Importantly, on Linux (true in 4.14) and some other platforms, fsync errors
+ * will consume the error, causing a subsequent fsync to succeed even though
+ * the writes did not succeed. See pg_fsync_no_writethrough().
+ *
  * Note we assume we're chdir'd into PGDATA to begin with.
  */
 void
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 2ec103e604..614fa4f3ec 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -1038,7 +1038,7 @@ mdimmedsync(SMgrRelation reln, ForkNumber forknum)
 		MdfdVec    *v = &reln->md_seg_fds[forknum][segno - 1];
 
 		if (FileSync(v->mdfd_vfd, WAIT_EVENT_DATA_FILE_IMMEDIATE_SYNC) < 0)
-			ereport(ERROR,
+			ereport(PANIC,
 					(errcode_for_file_access(),
 					 errmsg("could not fsync file \"%s\": %m",
 							FilePathName(v->mdfd_vfd))));
@@ -1265,13 +1265,22 @@ mdsync(void)
 					 * _mdfd_getseg() and for FileSync, since fd.c might have
 					 * closed the file behind our back.
 					 *
-					 * XXX is there any point in allowing more than one retry?
-					 * Don't see one at the moment, but easy to change the
-					 * test here if so.
+					 * It's unsafe to ignore failures for other errors,
+					 * particularly EIO or (undocumented, but possible) ENOSPC.
+					 * The first fsync() will clear any error flag on dirty
+					 * buffers pending writeback and/or the file descriptor, so
+					 * a second fsync report success despite the buffers
+					 * possibly not being written. (Verified on Linux 4.14).
+					 * To cope with this we must PANIC and redo all writes
+					 * since the last successful checkpoint. See discussion at:
+					 *
+					 * https://www.postgresql.org/message-id/CAMsr%2BYHh%2B5Oq4xziwwoEfhoTZgr07vdGG%2Bhu%3D1adXx59aTeaoQ%40mail.gmail.com
+					 *
+					 * for details.
 					 */
 					if (!FILE_POSSIBLY_DELETED(errno) ||
 						failures > 0)
-						ereport(ERROR,
+						ereport(PANIC,
 								(errcode_for_file_access(),
 								 errmsg("could not fsync file \"%s\": %m",
 										path)));
@@ -1280,6 +1289,7 @@ mdsync(void)
 								(errcode_for_file_access(),
 								 errmsg("could not fsync file \"%s\" but retrying: %m",
 										path)));
+
 					pfree(path);
 
 					/*
@@ -1444,7 +1454,7 @@ register_dirty_segment(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
 				(errmsg("could not forward fsync request because request queue is full")));
 
 		if (FileSync(seg->mdfd_vfd, WAIT_EVENT_DATA_FILE_SYNC) < 0)
-			ereport(ERROR,
+			ereport(PANIC,
 					(errcode_for_file_access(),
 					 errmsg("could not fsync file \"%s\": %m",
 							FilePathName(seg->mdfd_vfd))));
diff --git a/src/backend/utils/cache/relmapper.c b/src/backend/utils/cache/relmapper.c
index 99d095f2df..f8ff793a66 100644
--- a/src/backend/utils/cache/relmapper.c
+++ b/src/backend/utils/cache/relmapper.c
@@ -795,7 +795,7 @@ write_relmap_file(bool shared, RelMapFile *newmap,
 	 */
 	pgstat_report_wait_start(WAIT_EVENT_RELATION_MAP_SYNC);
 	if (pg_fsync(fd) != 0)
-		ereport(ERROR,
+		ereport(PANIC,
 				(errcode_for_file_access(),
 				 errmsg("could not fsync relation mapping file \"%s\": %m",
 						mapfilename)));
-- 
2.14.3

#32Andres Freund
andres@anarazel.de
In reply to: Craig Ringer (#31)
Re: Postgres, fsync, and OSs (specifically linux)

Hi,

On 2018-05-10 09:50:03 +0800, Craig Ringer wrote:

while ((src = (RewriteMappingFile *) hash_seq_search(&seq_status)) != NULL)
{
if (FileSync(src->vfd, WAIT_EVENT_LOGICAL_REWRITE_SYNC) != 0)
- ereport(ERROR,
+ ereport(PANIC,
(errcode_for_file_access(),
errmsg("could not fsync file \"%s\": %m", src->path)));

To me this (and the other callers) doesn't quite look right. First, I
think we should probably be a bit more restrictive about when PANIC
out. It seems like we should PANIC on ENOSPC and EIO, but possibly not
others. Secondly, I think we should centralize the error handling. It
seems likely that we'll acrue some platform specific workarounds, and I
don't want to copy that knowledge everywhere.

Also, don't we need the same on close()?

- Andres

#33Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#32)
Re: Postgres, fsync, and OSs (specifically linux)

On Thu, May 17, 2018 at 12:44 PM, Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2018-05-10 09:50:03 +0800, Craig Ringer wrote:

while ((src = (RewriteMappingFile *) hash_seq_search(&seq_status)) != NULL)
{
if (FileSync(src->vfd, WAIT_EVENT_LOGICAL_REWRITE_SYNC) != 0)
- ereport(ERROR,
+ ereport(PANIC,
(errcode_for_file_access(),
errmsg("could not fsync file \"%s\": %m", src->path)));

To me this (and the other callers) doesn't quite look right. First, I
think we should probably be a bit more restrictive about when PANIC
out. It seems like we should PANIC on ENOSPC and EIO, but possibly not
others. Secondly, I think we should centralize the error handling. It
seems likely that we'll acrue some platform specific workarounds, and I
don't want to copy that knowledge everywhere.

Maybe something like:

ereport(promote_eio_to_panic(ERROR), ...)

?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#34Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Robert Haas (#33)
Re: Postgres, fsync, and OSs (specifically linux)

On Thu, May 17, 2018 at 11:45 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, May 17, 2018 at 12:44 PM, Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2018-05-10 09:50:03 +0800, Craig Ringer wrote:

while ((src = (RewriteMappingFile *) hash_seq_search(&seq_status)) != NULL)
{
if (FileSync(src->vfd, WAIT_EVENT_LOGICAL_REWRITE_SYNC) != 0)
- ereport(ERROR,
+ ereport(PANIC,
(errcode_for_file_access(),
errmsg("could not fsync file \"%s\": %m", src->path)));

To me this (and the other callers) doesn't quite look right. First, I
think we should probably be a bit more restrictive about when PANIC
out. It seems like we should PANIC on ENOSPC and EIO, but possibly not
others. Secondly, I think we should centralize the error handling. It
seems likely that we'll acrue some platform specific workarounds, and I
don't want to copy that knowledge everywhere.

Maybe something like:

ereport(promote_eio_to_panic(ERROR), ...)

Well, searching for places where error is reported with level PANIC,
using word PANIC would miss these instances. People will have to
remember to search with promote_eio_to_panic. May be handle the errors
inside FileSync() itself or a wrapper around that.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

#35Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#1)
7 attachment(s)
Re: Postgres, fsync, and OSs (specifically linux)

Hi,

On 2018-04-27 15:28:42 -0700, Andres Freund wrote:

== Potential Postgres Changes ==

Several operating systems / file systems behave differently (See
e.g. [2], thanks Thomas) than we expected. Even the discussed changes to
e.g. linux don't get to where we thought we are. There's obviously also
the question of how to deal with kernels / OSs that have not been
updated.

Changes that appear to be necessary, even for kernels with the issues
addressed:

- Clearly we need to treat fsync() EIO, ENOSPC errors as a PANIC and
retry recovery. While ENODEV (underlying device went away) will be
persistent, it probably makes sense to treat it the same or even just
give up and shut down. One question I see here is whether we just
want to continue crash-recovery cycles, or whether we want to limit
that.

Craig has a patch for this, although I'm not yet 100% happy with it.

- We need more aggressive error checking on close(), for ENOSPC and
EIO. In both cases afaics we'll have to trigger a crash recovery
cycle. It's entirely possible to end up in a loop on NFS etc, but I
don't think there's a way around that.

This needs to be handled.

- The outstanding fsync request queue isn't persisted properly [3]. This
means that even if the kernel behaved the way we'd expected, we'd not
fail a second checkpoint :(. It's possible that we don't need to deal
with this because we'll henceforth PANIC, but I'd argue we should fix
that regardless. Seems like a time-bomb otherwise (e.g. after moving
to DIO somebody might want to relax the PANIC...).

What we could do:

- forward file descriptors from backends to checkpointer (using
SCM_RIGHTS) when marking a segment dirty. That'd require some
optimizations (see [4]) to avoid doing so repeatedly. That'd
guarantee correct behaviour in all linux kernels >= 4.13 (possibly
backported by distributions?), and I think it'd also make it vastly
more likely that errors are reported in earlier kernels.

This should be doable without a noticeable performance impact, I
believe. I don't think it'd be that hard either, but it'd be a bit of
a pain to backport it to all postgres versions, as well as a bit
invasive for that.

The infrastructure this'd likely end up building (hashtable of open
relfilenodes), would likely be useful for further things (like caching
file size).

I've written a patch series for this. Took me quite a bit longer than I
had hoped.

The attached patchseries consists out of a few preparatory patches:
- freespace optimization to not call smgrexists() unnecessarily
- register_dirty_segment() optimization to not queue requests for
segments that locally are known to already have been dirtied. This
seems like a good optimization regardless of further changes. Doesn't
yet deal with the mdsync counter wrapping around (which is unlikely to
ever happen in practice, it's 32bit).
- some fd.c changes, I don't think they're quite right yet
- new functions to send/recv data over a unix domain socket, *including*
a file descriptor.

The main patch guarantees that fsync requests are forwarded from
backends to the checkpointer, including the file descriptor. As we do so
immediately at mdwrite() time, that guarantees that the fd has been open
from before the write started, therefore linux will guarantee that that
FD will see errors.

The design of the patch went through a few iterations. I initially
attempted to make the fsync request hashtable shared, but that turned
out to be a lot harder to do reliably *and* fast than I was anticipating
(we'd need to hold a lock for the entirety of mdsync(), dynahash doesn't
allow iteration while other backends modify).

So what I instead did was to replace the shared memory fsync request
queue with a unix domain socket (created in postmaster, using
socketpair()). CheckpointerRequest structs are written to that queue,
including the associated file descriptor. The checkpointer absorbs
those requests, and updates the local pending requests hashtable in
local process memory. To facilitate that mdsync() has read all requests
from the last cycle, checkpointer self-enqueues a token, which allows
to detect the end of the relevant portion of the queue.

The biggest complication in all this scheme is that checkpointer now
needs to keep a file descriptor open for every segment. That obviously
requires adding a few new fields to the hashtable entry. But the bigger
issue is that it's now possible that pending requests need to be
processed earlier than the next checkpoint, because of file descriptor
limits. To address that absorbing the fsync request queue will now do a
mdsync() style pass, doing the necessary fsyncs.

Because mdsync() (or rather its new workhorse mdsyncpass()) will now not
open files itself, there's no need to do deal with retries for files
that have been deleted. For the cases where we didn't yet receive a
fsync cancel request, we'll just fsync the fd. That's unnecessary, but
harmless.

Obviously this is currently heavily unix specific (according to my
research all our unix platforms support say that they support sending
fds across unix domain sockets w/ SCM_RIGHTS). It's unclear whether any
OS but linux benefits from not closing file descriptors before fsync().

We could make this work for windows, without *too* much trouble (one can
just open fds in another process, using that process' handle).

I think there's some advantage in using the same approach
everywhere. For one not maintaining two radically different approaches
for complicated code. It'd also allow us to offload more fsyncs to
checkpointer, not just the ones for normal relation files, which does
seem advantageous. Not having ugly retry logic around deleted files in
mdsync() also seems nice. But there's cases where this is likely
slower, due to the potential of having to wait for checkpointer when the
queue is full.

I'll note that I think the new mdsync() is considerably simpler. Even if
we do not decide to use an approach as presented here, I think we should
make some of those changes. Specifically not unlinking the pending
requests bitmap in mdsync() seems like it both resolves existing bug
(see upthread) and makes the code simpler.

I plan to switch to working on something else for a day or two next
week, and then polish this further. I'd greatly appreciate comments till
then.

I didn't want to do this now, but I think we should also consider
removing all awareness of segments from the fsync request queue. Instead
it should deal with individual files, and the segmentation should be
handled by md.c. That'll allow us to move all the necessary code to
smgr.c (or checkpointer?); Thomas said that'd be helpful for further
work. I personally think it'd be a lot simpler, because having to have
long bitmaps with only the last bit set for large append only relations
isn't a particularly sensible approach imo. The only thing that that'd
make more complicated is that the file/database unlink requests get more
expensive (as they'd likely need to search the whole table), but that
seems like a sensible tradeoff. Alternatively using a tree structure
would be an alternative obviously. Personally I was thinking that we
should just make the hashtable be over a pathname, that seems most
generic.

Greetings,

Andres Freund

Attachments:

v1-0002-Add-functions-to-send-receive-data-FD-over-a-unix.patchtext/x-diff; charset=us-asciiDownload
From 8c16dcb5f341651e5ae19ea9d5f935b23f52d902 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Fri, 18 May 2018 12:38:25 -0700
Subject: [PATCH v1 2/7] Add functions to send/receive data & FD over a unix
 domain socket.

This'll be used by a followup patch changing how the fsync request
queue works, to make it safe on linux.

TODO: This probably should live elsewhere.

Author: Andres Freund
Reviewed-By:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/backend/storage/file/fd.c | 102 ++++++++++++++++++++++++++++++++++
 src/include/storage/fd.h      |   4 ++
 2 files changed, 106 insertions(+)

diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 441f18dcf56..65e46483a44 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -3572,3 +3572,105 @@ MakePGDirectory(const char *directoryName)
 {
 	return mkdir(directoryName, pg_dir_create_mode);
 }
+
+/*
+ * Send data over a unix domain socket, optionally (when fd != -1) including a
+ * file descriptor.
+ */
+ssize_t
+pg_uds_send_with_fd(int sock, void *buf, ssize_t buflen, int fd)
+{
+	ssize_t     size;
+	struct msghdr   msg = {0};
+	struct iovec    iov;
+	/* cmsg header, union for correct alignment */
+	union
+	{
+		struct cmsghdr  cmsghdr;
+		char        control[CMSG_SPACE(sizeof (int))];
+	} cmsgu;
+	struct cmsghdr  *cmsg;
+
+	iov.iov_base = buf;
+	iov.iov_len = buflen;
+
+	msg.msg_name = NULL;
+	msg.msg_namelen = 0;
+	msg.msg_iov = &iov;
+	msg.msg_iovlen = 1;
+
+	if (fd >= 0)
+	{
+		msg.msg_control = cmsgu.control;
+		msg.msg_controllen = sizeof(cmsgu.control);
+
+		cmsg = CMSG_FIRSTHDR(&msg);
+		cmsg->cmsg_len = CMSG_LEN(sizeof (int));
+		cmsg->cmsg_level = SOL_SOCKET;
+		cmsg->cmsg_type = SCM_RIGHTS;
+
+		*((int *) CMSG_DATA(cmsg)) = fd;
+	}
+
+	size = sendmsg(sock, &msg, 0);
+
+	/* errors are returned directly */
+	return size;
+}
+
+/*
+ * Receive data from a unix domain socket. If a file is sent over the socket,
+ * store it in *fd.
+ */
+ssize_t
+pg_uds_recv_with_fd(int sock, void *buf, ssize_t bufsize, int *fd)
+{
+	ssize_t     size;
+	struct msghdr   msg;
+	struct iovec    iov;
+	/* cmsg header, union for correct alignment */
+	union
+	{
+		struct cmsghdr  cmsghdr;
+		char        control[CMSG_SPACE(sizeof (int))];
+	} cmsgu;
+	struct cmsghdr  *cmsg;
+
+	Assert(fd != NULL);
+
+	iov.iov_base = buf;
+	iov.iov_len = bufsize;
+
+	msg.msg_name = NULL;
+	msg.msg_namelen = 0;
+	msg.msg_iov = &iov;
+	msg.msg_iovlen = 1;
+	msg.msg_control = cmsgu.control;
+	msg.msg_controllen = sizeof(cmsgu.control);
+
+	size = recvmsg (sock, &msg, 0);
+
+	if (size < 0)
+	{
+		*fd = -1;
+		return size;
+	}
+
+	cmsg = CMSG_FIRSTHDR(&msg);
+	if (cmsg && cmsg->cmsg_len == CMSG_LEN(sizeof(int)))
+	{
+		if (cmsg->cmsg_level != SOL_SOCKET)
+			elog(FATAL, "unexpected cmsg_level");
+
+		if (cmsg->cmsg_type != SCM_RIGHTS)
+			elog(FATAL, "unexpected cmsg_type");
+
+		*fd = *((int *) CMSG_DATA(cmsg));
+
+		/* FIXME: check / handle additional cmsg structures */
+	}
+	else
+		*fd = -1;
+
+	return size;
+}
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 8e7c9728f4b..5e016d69a5a 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -143,4 +143,8 @@ extern void SyncDataDirectory(void);
 #define PG_TEMP_FILES_DIR "pgsql_tmp"
 #define PG_TEMP_FILE_PREFIX "pgsql_tmp"
 
+/* XXX; This should probably go elsewhere */
+ssize_t pg_uds_send_with_fd(int sock, void *buf, ssize_t buflen, int fd);
+ssize_t pg_uds_recv_with_fd(int sock, void *buf, ssize_t bufsize, int *fd);
+
 #endif							/* FD_H */
-- 
2.17.0.rc1.dirty

v1-0001-freespace-Don-t-constantly-close-files-when-readi.patchtext/x-diff; charset=us-asciiDownload
From ecb3bce411622780bb27bb0c17eb0af2e6a0a3b3 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Fri, 18 May 2018 12:33:12 -0700
Subject: [PATCH v1 1/7] freespace: Don't constantly close files when reading
 buffer.

fsm_readbuf() used to always do an smgrexists() when reading a buffer
beyond the known file size. That currently implies closing the md.c
handle, loosing all the data cached therein.  Change this to only
check for file existance when not already known to be larger than 0
blocks.

Author: Andres Freund
Reviewed-By:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/backend/storage/freespace/freespace.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/src/backend/storage/freespace/freespace.c b/src/backend/storage/freespace/freespace.c
index 65c4e74999f..d7569cec5ed 100644
--- a/src/backend/storage/freespace/freespace.c
+++ b/src/backend/storage/freespace/freespace.c
@@ -556,7 +556,7 @@ fsm_readbuf(Relation rel, FSMAddress addr, bool extend)
 	 * not on extension.)
 	 */
 	if (rel->rd_smgr->smgr_fsm_nblocks == InvalidBlockNumber ||
-		blkno >= rel->rd_smgr->smgr_fsm_nblocks)
+		rel->rd_smgr->smgr_fsm_nblocks == 0)
 	{
 		if (smgrexists(rel->rd_smgr, FSM_FORKNUM))
 			rel->rd_smgr->smgr_fsm_nblocks = smgrnblocks(rel->rd_smgr,
@@ -564,6 +564,9 @@ fsm_readbuf(Relation rel, FSMAddress addr, bool extend)
 		else
 			rel->rd_smgr->smgr_fsm_nblocks = 0;
 	}
+	else if (blkno >= rel->rd_smgr->smgr_fsm_nblocks)
+		rel->rd_smgr->smgr_fsm_nblocks = smgrnblocks(rel->rd_smgr,
+													 FSM_FORKNUM);
 
 	/* Handle requests beyond EOF */
 	if (blkno >= rel->rd_smgr->smgr_fsm_nblocks)
-- 
2.17.0.rc1.dirty

v1-0003-Make-FileGetRawDesc-ensure-there-s-an-associated-.patchtext/x-diff; charset=us-asciiDownload
From 298bcf50f8cac7b8955f9f25b997cc4c4b65fbd0 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Fri, 18 May 2018 12:40:05 -0700
Subject: [PATCH v1 3/7] Make FileGetRawDesc() ensure there's an associated
 kernel FD.

Author: Andres Freund
Reviewed-By:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/backend/storage/file/fd.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 65e46483a44..8ae13a51ec1 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -2232,6 +2232,10 @@ int
 FileGetRawDesc(File file)
 {
 	Assert(FileIsValid(file));
+
+	if (FileAccess(file))
+		return -1;
+
 	return VfdCache[file].fd;
 }
 
-- 
2.17.0.rc1.dirty

v1-0004-WIP-Allow-to-create-a-transient-file-for-a-previo.patchtext/x-diff; charset=us-asciiDownload
From 786fafb408ba0b9080941170eedcb6a58c2df8d1 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Fri, 18 May 2018 12:42:32 -0700
Subject: [PATCH v1 4/7] WIP: Allow to create a transient file for a previously
 openend FD.

It might be better to extend the normal vfd files instead, adding a
flag that prohibits closing the underlying file (and removing them
from the LRU).

Author: Andres Freund
Reviewed-By:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/backend/storage/file/fd.c | 32 ++++++++++++++++++++++++++++++++
 src/include/storage/fd.h      |  2 ++
 2 files changed, 34 insertions(+)

diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 8ae13a51ec1..e2492ce94d5 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -2430,6 +2430,38 @@ OpenTransientFilePerm(const char *fileName, int fileFlags, mode_t fileMode)
 	return -1;					/* failure */
 }
 
+void
+ReserveTransientFile(void)
+{
+	if (!reserveAllocatedDesc())
+		ereport(PANIC,
+				(errcode(ERRCODE_INSUFFICIENT_RESOURCES),
+				 errmsg("exceeded maxAllocatedDescs (%d) while trying to open file",
+						maxAllocatedDescs)));
+
+	/* Close excess kernel FDs. */
+	ReleaseLruFiles();
+
+	Assert(nfile + numAllocatedDescs <= max_safe_fds);
+}
+
+void
+RegisterTransientFile(int fd)
+{
+	AllocateDesc *desc;
+
+	/* make sure ReserveTransientFile was called sufficiently recently */
+	Assert(fd >= 0);
+	Assert(nfile + numAllocatedDescs <= max_safe_fds);
+	Assert(numAllocatedDescs < maxAllocatedDescs);
+
+	desc = &allocatedDescs[numAllocatedDescs];
+	desc->kind = AllocateDescRawFD;
+	desc->desc.fd = fd;
+	desc->create_subid = GetCurrentSubTransactionId();
+	numAllocatedDescs++;
+}
+
 /*
  * Routines that want to initiate a pipe stream should use OpenPipeStream
  * rather than plain popen().  This lets fd.c deal with freeing FDs if
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 5e016d69a5a..9bb32771602 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -105,6 +105,8 @@ extern int	FreeDir(DIR *dir);
 /* Operations to allow use of a plain kernel FD, with automatic cleanup */
 extern int	OpenTransientFile(const char *fileName, int fileFlags);
 extern int	OpenTransientFilePerm(const char *fileName, int fileFlags, mode_t fileMode);
+extern void ReserveTransientFile(void);
+extern void RegisterTransientFile(int fd);
 extern int	CloseTransientFile(int fd);
 
 /* If you've really really gotta have a plain kernel FD, use this */
-- 
2.17.0.rc1.dirty

v1-0005-WIP-Allow-more-transient-files-and-allow-to-query.patchtext/x-diff; charset=us-asciiDownload
From 53ded3d52f792c58ad3550ffb2244cec19981b66 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Fri, 18 May 2018 12:43:40 -0700
Subject: [PATCH v1 5/7] WIP: Allow more transient files and allow to query the
 max.

Author: Andres Freund
Reviewed-By:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/backend/storage/file/fd.c | 10 ++++++++--
 src/include/storage/fd.h      |  1 +
 2 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index e2492ce94d5..b0db997edc7 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -2299,10 +2299,10 @@ reserveAllocatedDesc(void)
 	 *
 	 * We mustn't let allocated descriptors hog all the available FDs, and in
 	 * practice we'd better leave a reasonable number of FDs for VFD use.  So
-	 * set the maximum to max_safe_fds / 2.  (This should certainly be at
+	 * set the maximum to 80% of max_safe_fds.  (This should certainly be at
 	 * least as large as the initial size, FD_MINFREE / 2.)
 	 */
-	newMax = max_safe_fds / 2;
+	newMax = MaxTransientFiles(); // XXX: more accurate name
 	if (newMax > maxAllocatedDescs)
 	{
 		newDescs = (AllocateDesc *) realloc(allocatedDescs,
@@ -2610,6 +2610,12 @@ CloseTransientFile(int fd)
 	return close(fd);
 }
 
+int
+MaxTransientFiles(void)
+{
+	return (max_safe_fds * 8) / 10;
+}
+
 /*
  * Routines that want to use <dirent.h> (ie, DIR*) should use AllocateDir
  * rather than plain opendir().  This lets fd.c deal with freeing FDs if
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 9bb32771602..2c3055e77cd 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -108,6 +108,7 @@ extern int	OpenTransientFilePerm(const char *fileName, int fileFlags, mode_t fil
 extern void ReserveTransientFile(void);
 extern void RegisterTransientFile(int fd);
 extern int	CloseTransientFile(int fd);
+extern int	MaxTransientFiles(void);
 
 /* If you've really really gotta have a plain kernel FD, use this */
 extern int	BasicOpenFile(const char *fileName, int fileFlags);
-- 
2.17.0.rc1.dirty

v1-0006-WIP-Optimize-register_dirty_segment-to-not-repeat.patchtext/x-diff; charset=us-asciiDownload
From 36df35480f033ddc02d463959ed6e764afcf63ad Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Fri, 18 May 2018 12:47:33 -0700
Subject: [PATCH v1 6/7] WIP: Optimize register_dirty_segment() to not
 repeatedly queue fsync requests.

Author: Andres Freund
Reviewed-By:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/backend/postmaster/checkpointer.c | 36 ++++++++++++-------
 src/backend/storage/smgr/md.c         | 50 +++++++++++++++++++--------
 src/include/postmaster/bgwriter.h     |  3 ++
 3 files changed, 63 insertions(+), 26 deletions(-)

diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 0950ada6019..333eb91c9de 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -46,6 +46,7 @@
 #include "libpq/pqsignal.h"
 #include "miscadmin.h"
 #include "pgstat.h"
+#include "port/atomics.h"
 #include "postmaster/bgwriter.h"
 #include "replication/syncrep.h"
 #include "storage/bufmgr.h"
@@ -126,8 +127,9 @@ typedef struct
 
 	int			ckpt_flags;		/* checkpoint flags, as defined in xlog.h */
 
-	uint32		num_backend_writes; /* counts user backend buffer writes */
-	uint32		num_backend_fsync;	/* counts user backend fsync calls */
+	pg_atomic_uint32 num_backend_writes; /* counts user backend buffer writes */
+	pg_atomic_uint32 num_backend_fsync;	/* counts user backend fsync calls */
+	pg_atomic_uint32 ckpt_cycle; /* cycle */
 
 	int			num_requests;	/* current # of requests */
 	int			max_requests;	/* allocated array size */
@@ -943,6 +945,9 @@ CheckpointerShmemInit(void)
 		MemSet(CheckpointerShmem, 0, size);
 		SpinLockInit(&CheckpointerShmem->ckpt_lck);
 		CheckpointerShmem->max_requests = NBuffers;
+		pg_atomic_init_u32(&CheckpointerShmem->ckpt_cycle, 0);
+		pg_atomic_init_u32(&CheckpointerShmem->num_backend_writes, 0);
+		pg_atomic_init_u32(&CheckpointerShmem->num_backend_fsync, 0);
 	}
 }
 
@@ -1133,10 +1138,6 @@ ForwardFsyncRequest(RelFileNode rnode, ForkNumber forknum, BlockNumber segno)
 
 	LWLockAcquire(CheckpointerCommLock, LW_EXCLUSIVE);
 
-	/* Count all backend writes regardless of if they fit in the queue */
-	if (!AmBackgroundWriterProcess())
-		CheckpointerShmem->num_backend_writes++;
-
 	/*
 	 * If the checkpointer isn't running or the request queue is full, the
 	 * backend will have to perform its own fsync request.  But before forcing
@@ -1151,7 +1152,7 @@ ForwardFsyncRequest(RelFileNode rnode, ForkNumber forknum, BlockNumber segno)
 		 * fsync
 		 */
 		if (!AmBackgroundWriterProcess())
-			CheckpointerShmem->num_backend_fsync++;
+			pg_atomic_fetch_add_u32(&CheckpointerShmem->num_backend_fsync, 1);
 		LWLockRelease(CheckpointerCommLock);
 		return false;
 	}
@@ -1312,11 +1313,10 @@ AbsorbFsyncRequests(void)
 	LWLockAcquire(CheckpointerCommLock, LW_EXCLUSIVE);
 
 	/* Transfer stats counts into pending pgstats message */
-	BgWriterStats.m_buf_written_backend += CheckpointerShmem->num_backend_writes;
-	BgWriterStats.m_buf_fsync_backend += CheckpointerShmem->num_backend_fsync;
-
-	CheckpointerShmem->num_backend_writes = 0;
-	CheckpointerShmem->num_backend_fsync = 0;
+	BgWriterStats.m_buf_written_backend +=
+		pg_atomic_exchange_u32(&CheckpointerShmem->num_backend_writes, 0);
+	BgWriterStats.m_buf_fsync_backend +=
+		pg_atomic_exchange_u32(&CheckpointerShmem->num_backend_fsync, 0);
 
 	/*
 	 * We try to avoid holding the lock for a long time by copying the request
@@ -1390,3 +1390,15 @@ FirstCallSinceLastCheckpoint(void)
 
 	return FirstCall;
 }
+
+uint32
+GetCheckpointSyncCycle(void)
+{
+	return pg_atomic_read_u32(&CheckpointerShmem->ckpt_cycle);
+}
+
+uint32
+IncCheckpointSyncCycle(void)
+{
+	return pg_atomic_fetch_add_u32(&CheckpointerShmem->ckpt_cycle, 1);
+}
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 2ec103e6047..555774320b5 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -109,6 +109,7 @@ typedef struct _MdfdVec
 {
 	File		mdfd_vfd;		/* fd number in fd.c's pool */
 	BlockNumber mdfd_segno;		/* segment number, from 0 */
+	uint32		mdfd_dirtied_cycle;
 } MdfdVec;
 
 static MemoryContext MdCxt;		/* context for all MdfdVec objects */
@@ -133,12 +134,12 @@ static MemoryContext MdCxt;		/* context for all MdfdVec objects */
  * (Regular backends do not track pending operations locally, but forward
  * them to the checkpointer.)
  */
-typedef uint16 CycleCtr;		/* can be any convenient integer size */
+typedef uint32 CycleCtr;		/* can be any convenient integer size */
 
 typedef struct
 {
 	RelFileNode rnode;			/* hash table key (must be first!) */
-	CycleCtr	cycle_ctr;		/* mdsync_cycle_ctr of oldest request */
+	CycleCtr	cycle_ctr;		/* sync cycle of oldest request */
 	/* requests[f] has bit n set if we need to fsync segment n of fork f */
 	Bitmapset  *requests[MAX_FORKNUM + 1];
 	/* canceled[f] is true if we canceled fsyncs for fork "recently" */
@@ -155,7 +156,6 @@ static HTAB *pendingOpsTable = NULL;
 static List *pendingUnlinks = NIL;
 static MemoryContext pendingOpsCxt; /* context for the above  */
 
-static CycleCtr mdsync_cycle_ctr = 0;
 static CycleCtr mdckpt_cycle_ctr = 0;
 
 
@@ -333,6 +333,7 @@ mdcreate(SMgrRelation reln, ForkNumber forkNum, bool isRedo)
 	mdfd = &reln->md_seg_fds[forkNum][0];
 	mdfd->mdfd_vfd = fd;
 	mdfd->mdfd_segno = 0;
+	mdfd->mdfd_dirtied_cycle = GetCheckpointSyncCycle() - 1;
 }
 
 /*
@@ -614,6 +615,7 @@ mdopen(SMgrRelation reln, ForkNumber forknum, int behavior)
 	mdfd = &reln->md_seg_fds[forknum][0];
 	mdfd->mdfd_vfd = fd;
 	mdfd->mdfd_segno = 0;
+	mdfd->mdfd_dirtied_cycle = GetCheckpointSyncCycle() - 1;
 
 	Assert(_mdnblocks(reln, forknum, mdfd) <= ((BlockNumber) RELSEG_SIZE));
 
@@ -1089,9 +1091,9 @@ mdsync(void)
 	 * To avoid excess fsync'ing (in the worst case, maybe a never-terminating
 	 * checkpoint), we want to ignore fsync requests that are entered into the
 	 * hashtable after this point --- they should be processed next time,
-	 * instead.  We use mdsync_cycle_ctr to tell old entries apart from new
-	 * ones: new ones will have cycle_ctr equal to the incremented value of
-	 * mdsync_cycle_ctr.
+	 * instead.  We use GetCheckpointSyncCycle() to tell old entries apart
+	 * from new ones: new ones will have cycle_ctr equal to
+	 * IncCheckpointSyncCycle().
 	 *
 	 * In normal circumstances, all entries present in the table at this point
 	 * will have cycle_ctr exactly equal to the current (about to be old)
@@ -1115,16 +1117,16 @@ mdsync(void)
 		hash_seq_init(&hstat, pendingOpsTable);
 		while ((entry = (PendingOperationEntry *) hash_seq_search(&hstat)) != NULL)
 		{
-			entry->cycle_ctr = mdsync_cycle_ctr;
+			entry->cycle_ctr = GetCheckpointSyncCycle();
 		}
 	}
 
-	/* Advance counter so that new hashtable entries are distinguishable */
-	mdsync_cycle_ctr++;
-
 	/* Set flag to detect failure if we don't reach the end of the loop */
 	mdsync_in_progress = true;
 
+	/* Advance counter so that new hashtable entries are distinguishable */
+	IncCheckpointSyncCycle();
+
 	/* Now scan the hashtable for fsync requests to process */
 	absorb_counter = FSYNCS_PER_ABSORB;
 	hash_seq_init(&hstat, pendingOpsTable);
@@ -1137,11 +1139,11 @@ mdsync(void)
 		 * contain multiple fsync-request bits, but they are all new.  Note
 		 * "continue" bypasses the hash-remove call at the bottom of the loop.
 		 */
-		if (entry->cycle_ctr == mdsync_cycle_ctr)
+		if (entry->cycle_ctr == GetCheckpointSyncCycle())
 			continue;
 
 		/* Else assert we haven't missed it */
-		Assert((CycleCtr) (entry->cycle_ctr + 1) == mdsync_cycle_ctr);
+		Assert((CycleCtr) (entry->cycle_ctr + 1) == GetCheckpointSyncCycle());
 
 		/*
 		 * Scan over the forks and segments represented by the entry.
@@ -1308,7 +1310,7 @@ mdsync(void)
 				break;
 		}
 		if (forknum <= MAX_FORKNUM)
-			entry->cycle_ctr = mdsync_cycle_ctr;
+			entry->cycle_ctr = GetCheckpointSyncCycle();
 		else
 		{
 			/* Okay to remove it */
@@ -1427,18 +1429,37 @@ mdpostckpt(void)
 static void
 register_dirty_segment(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
 {
+	uint32 cycle;
+
 	/* Temp relations should never be fsync'd */
 	Assert(!SmgrIsTemp(reln));
 
+	pg_memory_barrier();
+	cycle = GetCheckpointSyncCycle();
+
+	/*
+	 * Don't repeatedly register the same segment as dirty.
+	 *
+	 * FIXME: This doesn't correctly deal with overflows yet! We could
+	 * e.g. emit an smgr invalidation every now and then, or use a 64bit
+	 * counter.  Or just error out if the cycle reaches UINT32_MAX.
+	 */
+	if (seg->mdfd_dirtied_cycle == cycle)
+		return;
+
 	if (pendingOpsTable)
 	{
 		/* push it into local pending-ops table */
 		RememberFsyncRequest(reln->smgr_rnode.node, forknum, seg->mdfd_segno);
+		seg->mdfd_dirtied_cycle = cycle;
 	}
 	else
 	{
 		if (ForwardFsyncRequest(reln->smgr_rnode.node, forknum, seg->mdfd_segno))
+		{
+			seg->mdfd_dirtied_cycle = cycle;
 			return;				/* passed it off successfully */
+		}
 
 		ereport(DEBUG1,
 				(errmsg("could not forward fsync request because request queue is full")));
@@ -1623,7 +1644,7 @@ RememberFsyncRequest(RelFileNode rnode, ForkNumber forknum, BlockNumber segno)
 		/* if new entry, initialize it */
 		if (!found)
 		{
-			entry->cycle_ctr = mdsync_cycle_ctr;
+			entry->cycle_ctr = GetCheckpointSyncCycle();
 			MemSet(entry->requests, 0, sizeof(entry->requests));
 			MemSet(entry->canceled, 0, sizeof(entry->canceled));
 		}
@@ -1793,6 +1814,7 @@ _mdfd_openseg(SMgrRelation reln, ForkNumber forknum, BlockNumber segno,
 	v = &reln->md_seg_fds[forknum][segno];
 	v->mdfd_vfd = fd;
 	v->mdfd_segno = segno;
+	v->mdfd_dirtied_cycle = GetCheckpointSyncCycle() - 1;
 
 	Assert(_mdnblocks(reln, forknum, v) <= ((BlockNumber) RELSEG_SIZE));
 
diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h
index 941c6aba7d1..87a5cfad415 100644
--- a/src/include/postmaster/bgwriter.h
+++ b/src/include/postmaster/bgwriter.h
@@ -38,6 +38,9 @@ extern void AbsorbFsyncRequests(void);
 extern Size CheckpointerShmemSize(void);
 extern void CheckpointerShmemInit(void);
 
+extern uint32 GetCheckpointSyncCycle(void);
+extern uint32 IncCheckpointSyncCycle(void);
+
 extern bool FirstCallSinceLastCheckpoint(void);
 
 #endif							/* _BGWRITER_H */
-- 
2.17.0.rc1.dirty

v1-0007-Heavily-WIP-Send-file-descriptors-to-checkpointer.patchtext/x-diff; charset=us-asciiDownload
From 1c99e98386a763de70326e96fb9b7cfa72373e5f Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Fri, 18 May 2018 13:05:42 -0700
Subject: [PATCH v1 7/7] Heavily-WIP: Send file descriptors to checkpointer for
 fsyncing.

This addresses the issue that, at least on linux, fsyncs only reliably
see errors that occurred after they've been opeend.

Author:
Reviewed-By:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/backend/access/transam/xlog.c     |   7 +-
 src/backend/postmaster/checkpointer.c | 358 +++++++----------
 src/backend/postmaster/postmaster.c   |  38 ++
 src/backend/storage/smgr/md.c         | 545 ++++++++++++++++----------
 src/include/postmaster/bgwriter.h     |   8 +-
 src/include/postmaster/postmaster.h   |   5 +
 src/include/storage/smgr.h            |   3 +-
 7 files changed, 542 insertions(+), 422 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index adbd6a21264..427774152eb 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -8634,8 +8634,10 @@ CreateCheckPoint(int flags)
 	 * Note: because it is possible for log_checkpoints to change while a
 	 * checkpoint proceeds, we always accumulate stats, even if
 	 * log_checkpoints is currently off.
+	 *
+	 * Note #2: this is reset at the end of the checkpoint, not here, because
+	 * we might have to fsync before getting here (see mdsync()).
 	 */
-	MemSet(&CheckpointStats, 0, sizeof(CheckpointStats));
 	CheckpointStats.ckpt_start_t = GetCurrentTimestamp();
 
 	/*
@@ -8999,6 +9001,9 @@ CreateCheckPoint(int flags)
 									 CheckpointStats.ckpt_segs_recycled);
 
 	LWLockRelease(CheckpointLock);
+
+	/* reset stats */
+	MemSet(&CheckpointStats, 0, sizeof(CheckpointStats));
 }
 
 /*
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 333eb91c9de..1bce610336a 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -48,6 +48,7 @@
 #include "pgstat.h"
 #include "port/atomics.h"
 #include "postmaster/bgwriter.h"
+#include "postmaster/postmaster.h"
 #include "replication/syncrep.h"
 #include "storage/bufmgr.h"
 #include "storage/condition_variable.h"
@@ -102,19 +103,21 @@
  *
  * The requests array holds fsync requests sent by backends and not yet
  * absorbed by the checkpointer.
- *
- * Unlike the checkpoint fields, num_backend_writes, num_backend_fsync, and
- * the requests fields are protected by CheckpointerCommLock.
  *----------
  */
 typedef struct
 {
+	uint32		type;
 	RelFileNode rnode;
 	ForkNumber	forknum;
 	BlockNumber segno;			/* see md.c for special values */
+	bool		contains_fd;
 	/* might add a real request-type field later; not needed yet */
 } CheckpointerRequest;
 
+#define CKPT_REQUEST_RNODE			1
+#define CKPT_REQUEST_SYN			2
+
 typedef struct
 {
 	pid_t		checkpointer_pid;	/* PID (0 if not started) */
@@ -131,8 +134,6 @@ typedef struct
 	pg_atomic_uint32 num_backend_fsync;	/* counts user backend fsync calls */
 	pg_atomic_uint32 ckpt_cycle; /* cycle */
 
-	int			num_requests;	/* current # of requests */
-	int			max_requests;	/* allocated array size */
 	CheckpointerRequest requests[FLEXIBLE_ARRAY_MEMBER];
 } CheckpointerShmemStruct;
 
@@ -168,13 +169,17 @@ static double ckpt_cached_elapsed;
 static pg_time_t last_checkpoint_time;
 static pg_time_t last_xlog_switch_time;
 
+static BlockNumber next_syn_rqst;
+static BlockNumber received_syn_rqst;
+
 /* Prototypes for private functions */
 
 static void CheckArchiveTimeout(void);
 static bool IsCheckpointOnSchedule(double progress);
 static bool ImmediateCheckpointRequested(void);
-static bool CompactCheckpointerRequestQueue(void);
 static void UpdateSharedMemoryConfig(void);
+static void SendFsyncRequest(CheckpointerRequest *request, int fd);
+static bool AbsorbFsyncRequest(void);
 
 /* Signal handlers */
 
@@ -557,10 +562,11 @@ CheckpointerMain(void)
 			cur_timeout = Min(cur_timeout, XLogArchiveTimeout - elapsed_secs);
 		}
 
-		rc = WaitLatch(MyLatch,
-					   WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH,
-					   cur_timeout * 1000L /* convert to ms */ ,
-					   WAIT_EVENT_CHECKPOINTER_MAIN);
+		rc = WaitLatchOrSocket(MyLatch,
+							   WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH | WL_SOCKET_READABLE,
+							   fsync_fds[FSYNC_FD_PROCESS],
+							   cur_timeout * 1000L /* convert to ms */ ,
+							   WAIT_EVENT_CHECKPOINTER_MAIN);
 
 		/*
 		 * Emergency bailout if postmaster has died.  This is to avoid the
@@ -910,12 +916,7 @@ CheckpointerShmemSize(void)
 {
 	Size		size;
 
-	/*
-	 * Currently, the size of the requests[] array is arbitrarily set equal to
-	 * NBuffers.  This may prove too large or small ...
-	 */
 	size = offsetof(CheckpointerShmemStruct, requests);
-	size = add_size(size, mul_size(NBuffers, sizeof(CheckpointerRequest)));
 
 	return size;
 }
@@ -938,13 +939,10 @@ CheckpointerShmemInit(void)
 	if (!found)
 	{
 		/*
-		 * First time through, so initialize.  Note that we zero the whole
-		 * requests array; this is so that CompactCheckpointerRequestQueue can
-		 * assume that any pad bytes in the request structs are zeroes.
+		 * First time through, so initialize.
 		 */
 		MemSet(CheckpointerShmem, 0, size);
 		SpinLockInit(&CheckpointerShmem->ckpt_lck);
-		CheckpointerShmem->max_requests = NBuffers;
 		pg_atomic_init_u32(&CheckpointerShmem->ckpt_cycle, 0);
 		pg_atomic_init_u32(&CheckpointerShmem->num_backend_writes, 0);
 		pg_atomic_init_u32(&CheckpointerShmem->num_backend_fsync, 0);
@@ -1124,176 +1122,61 @@ RequestCheckpoint(int flags)
  * the queue is full and contains no duplicate entries.  In that case, we
  * let the backend know by returning false.
  */
-bool
-ForwardFsyncRequest(RelFileNode rnode, ForkNumber forknum, BlockNumber segno)
+void
+ForwardFsyncRequest(RelFileNode rnode, ForkNumber forknum, BlockNumber segno,
+					File file)
 {
-	CheckpointerRequest *request;
-	bool		too_full;
+	CheckpointerRequest request = {0};
 
 	if (!IsUnderPostmaster)
-		return false;			/* probably shouldn't even get here */
+		elog(ERROR, "ForwardFsyncRequest must not be called in single user mode");
 
 	if (AmCheckpointerProcess())
 		elog(ERROR, "ForwardFsyncRequest must not be called in checkpointer");
 
-	LWLockAcquire(CheckpointerCommLock, LW_EXCLUSIVE);
+	request.type = CKPT_REQUEST_RNODE;
+	request.rnode = rnode;
+	request.forknum = forknum;
+	request.segno = segno;
+	request.contains_fd = file != -1;
 
-	/*
-	 * If the checkpointer isn't running or the request queue is full, the
-	 * backend will have to perform its own fsync request.  But before forcing
-	 * that to happen, we can try to compact the request queue.
-	 */
-	if (CheckpointerShmem->checkpointer_pid == 0 ||
-		(CheckpointerShmem->num_requests >= CheckpointerShmem->max_requests &&
-		 !CompactCheckpointerRequestQueue()))
-	{
-		/*
-		 * Count the subset of writes where backends have to do their own
-		 * fsync
-		 */
-		if (!AmBackgroundWriterProcess())
-			pg_atomic_fetch_add_u32(&CheckpointerShmem->num_backend_fsync, 1);
-		LWLockRelease(CheckpointerCommLock);
-		return false;
-	}
-
-	/* OK, insert request */
-	request = &CheckpointerShmem->requests[CheckpointerShmem->num_requests++];
-	request->rnode = rnode;
-	request->forknum = forknum;
-	request->segno = segno;
-
-	/* If queue is more than half full, nudge the checkpointer to empty it */
-	too_full = (CheckpointerShmem->num_requests >=
-				CheckpointerShmem->max_requests / 2);
-
-	LWLockRelease(CheckpointerCommLock);
-
-	/* ... but not till after we release the lock */
-	if (too_full && ProcGlobal->checkpointerLatch)
-		SetLatch(ProcGlobal->checkpointerLatch);
-
-	return true;
-}
-
-/*
- * CompactCheckpointerRequestQueue
- *		Remove duplicates from the request queue to avoid backend fsyncs.
- *		Returns "true" if any entries were removed.
- *
- * Although a full fsync request queue is not common, it can lead to severe
- * performance problems when it does happen.  So far, this situation has
- * only been observed to occur when the system is under heavy write load,
- * and especially during the "sync" phase of a checkpoint.  Without this
- * logic, each backend begins doing an fsync for every block written, which
- * gets very expensive and can slow down the whole system.
- *
- * Trying to do this every time the queue is full could lose if there
- * aren't any removable entries.  But that should be vanishingly rare in
- * practice: there's one queue entry per shared buffer.
- */
-static bool
-CompactCheckpointerRequestQueue(void)
-{
-	struct CheckpointerSlotMapping
-	{
-		CheckpointerRequest request;
-		int			slot;
-	};
-
-	int			n,
-				preserve_count;
-	int			num_skipped = 0;
-	HASHCTL		ctl;
-	HTAB	   *htab;
-	bool	   *skip_slot;
-
-	/* must hold CheckpointerCommLock in exclusive mode */
-	Assert(LWLockHeldByMe(CheckpointerCommLock));
-
-	/* Initialize skip_slot array */
-	skip_slot = palloc0(sizeof(bool) * CheckpointerShmem->num_requests);
-
-	/* Initialize temporary hash table */
-	MemSet(&ctl, 0, sizeof(ctl));
-	ctl.keysize = sizeof(CheckpointerRequest);
-	ctl.entrysize = sizeof(struct CheckpointerSlotMapping);
-	ctl.hcxt = CurrentMemoryContext;
-
-	htab = hash_create("CompactCheckpointerRequestQueue",
-					   CheckpointerShmem->num_requests,
-					   &ctl,
-					   HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
-	/*
-	 * The basic idea here is that a request can be skipped if it's followed
-	 * by a later, identical request.  It might seem more sensible to work
-	 * backwards from the end of the queue and check whether a request is
-	 * *preceded* by an earlier, identical request, in the hopes of doing less
-	 * copying.  But that might change the semantics, if there's an
-	 * intervening FORGET_RELATION_FSYNC or FORGET_DATABASE_FSYNC request, so
-	 * we do it this way.  It would be possible to be even smarter if we made
-	 * the code below understand the specific semantics of such requests (it
-	 * could blow away preceding entries that would end up being canceled
-	 * anyhow), but it's not clear that the extra complexity would buy us
-	 * anything.
-	 */
-	for (n = 0; n < CheckpointerShmem->num_requests; n++)
-	{
-		CheckpointerRequest *request;
-		struct CheckpointerSlotMapping *slotmap;
-		bool		found;
-
-		/*
-		 * We use the request struct directly as a hashtable key.  This
-		 * assumes that any padding bytes in the structs are consistently the
-		 * same, which should be okay because we zeroed them in
-		 * CheckpointerShmemInit.  Note also that RelFileNode had better
-		 * contain no pad bytes.
-		 */
-		request = &CheckpointerShmem->requests[n];
-		slotmap = hash_search(htab, request, HASH_ENTER, &found);
-		if (found)
-		{
-			/* Duplicate, so mark the previous occurrence as skippable */
-			skip_slot[slotmap->slot] = true;
-			num_skipped++;
-		}
-		/* Remember slot containing latest occurrence of this request value */
-		slotmap->slot = n;
-	}
-
-	/* Done with the hash table. */
-	hash_destroy(htab);
-
-	/* If no duplicates, we're out of luck. */
-	if (!num_skipped)
-	{
-		pfree(skip_slot);
-		return false;
-	}
-
-	/* We found some duplicates; remove them. */
-	preserve_count = 0;
-	for (n = 0; n < CheckpointerShmem->num_requests; n++)
-	{
-		if (skip_slot[n])
-			continue;
-		CheckpointerShmem->requests[preserve_count++] = CheckpointerShmem->requests[n];
-	}
-	ereport(DEBUG1,
-			(errmsg("compacted fsync request queue from %d entries to %d entries",
-					CheckpointerShmem->num_requests, preserve_count)));
-	CheckpointerShmem->num_requests = preserve_count;
-
-	/* Cleanup. */
-	pfree(skip_slot);
-	return true;
+	SendFsyncRequest(&request, request.contains_fd ? FileGetRawDesc(file) : -1);
 }
 
 /*
  * AbsorbFsyncRequests
- *		Retrieve queued fsync requests and pass them to local smgr.
+ *		Retrieve queued fsync requests and pass them to local smgr. Stop when
+ *		resources would be exhausted by absorbing more.
+ *
+ * This is exported because we want to continue accepting requests during
+ * mdsync().
+ */
+void
+AbsorbFsyncRequests(void)
+{
+	if (!AmCheckpointerProcess())
+		return;
+
+	/* Transfer stats counts into pending pgstats message */
+	BgWriterStats.m_buf_written_backend +=
+		pg_atomic_exchange_u32(&CheckpointerShmem->num_backend_writes, 0);
+	BgWriterStats.m_buf_fsync_backend +=
+		pg_atomic_exchange_u32(&CheckpointerShmem->num_backend_fsync, 0);
+
+	while (true)
+	{
+		if (!FlushFsyncRequestQueueIfNecessary())
+			break;
+
+		if (!AbsorbFsyncRequest())
+			break;
+	}
+}
+
+/*
+ * AbsorbAllFsyncRequests
+ *		Retrieve all already pending fsync requests and pass them to local
+ *		smgr.
  *
  * This is exported because it must be called during CreateCheckPoint;
  * we have to be sure we have accepted all pending requests just before
@@ -1301,17 +1184,13 @@ CompactCheckpointerRequestQueue(void)
  * non-checkpointer processes, do nothing if not checkpointer.
  */
 void
-AbsorbFsyncRequests(void)
+AbsorbAllFsyncRequests(void)
 {
-	CheckpointerRequest *requests = NULL;
-	CheckpointerRequest *request;
-	int			n;
+	CheckpointerRequest request = {0};
 
 	if (!AmCheckpointerProcess())
 		return;
 
-	LWLockAcquire(CheckpointerCommLock, LW_EXCLUSIVE);
-
 	/* Transfer stats counts into pending pgstats message */
 	BgWriterStats.m_buf_written_backend +=
 		pg_atomic_exchange_u32(&CheckpointerShmem->num_backend_writes, 0);
@@ -1319,35 +1198,61 @@ AbsorbFsyncRequests(void)
 		pg_atomic_exchange_u32(&CheckpointerShmem->num_backend_fsync, 0);
 
 	/*
-	 * We try to avoid holding the lock for a long time by copying the request
-	 * array, and processing the requests after releasing the lock.
-	 *
-	 * Once we have cleared the requests from shared memory, we have to PANIC
-	 * if we then fail to absorb them (eg, because our hashtable runs out of
-	 * memory).  This is because the system cannot run safely if we are unable
-	 * to fsync what we have been told to fsync.  Fortunately, the hashtable
-	 * is so small that the problem is quite unlikely to arise in practice.
+	 * For mdsync()'s guarantees to work, all pending fsync requests need to
+	 * be executed. But we don't want to absorb requests till the queue is
+	 * empty, as that could take a long while.  So instead we enqueue
 	 */
-	n = CheckpointerShmem->num_requests;
-	if (n > 0)
+	request.type = CKPT_REQUEST_SYN;
+	request.segno = ++next_syn_rqst;
+	SendFsyncRequest(&request, -1);
+
+	received_syn_rqst = next_syn_rqst + 1;
+	while (received_syn_rqst != request.segno)
 	{
-		requests = (CheckpointerRequest *) palloc(n * sizeof(CheckpointerRequest));
-		memcpy(requests, CheckpointerShmem->requests, n * sizeof(CheckpointerRequest));
+		if (!FlushFsyncRequestQueueIfNecessary())
+			elog(FATAL, "may not happen");
+
+		if (!AbsorbFsyncRequest())
+			break;
+	}
+}
+
+/*
+ * AbsorbFsyncRequest
+ *		Retrieve one queued fsync request and pass them to local smgr.
+ */
+static bool
+AbsorbFsyncRequest(void)
+{
+	CheckpointerRequest req;
+	int fd;
+	int ret;
+
+	/* FIXME, this should be a critical section */
+	ReserveTransientFile();
+
+	ret = pg_uds_recv_with_fd(fsync_fds[FSYNC_FD_PROCESS], &req, sizeof(req), &fd);
+	if (ret < 0 && (errno == EWOULDBLOCK || errno == EAGAIN))
+		return false;
+	else if (ret < 0)
+		elog(FATAL, "recvmsg failed: %m");
+
+	if (req.contains_fd != (fd != -1))
+	{
+		elog(FATAL, "message should have fd associated, but doesn't");
 	}
 
-	START_CRIT_SECTION();
+	if (req.type == CKPT_REQUEST_SYN)
+	{
+		received_syn_rqst = req.segno;
+		Assert(fd == -1);
+	}
+	else
+	{
+		RememberFsyncRequest(req.rnode, req.forknum, req.segno, fd);
+	}
 
-	CheckpointerShmem->num_requests = 0;
-
-	LWLockRelease(CheckpointerCommLock);
-
-	for (request = requests; n > 0; request++, n--)
-		RememberFsyncRequest(request->rnode, request->forknum, request->segno);
-
-	END_CRIT_SECTION();
-
-	if (requests)
-		pfree(requests);
+	return true;
 }
 
 /*
@@ -1402,3 +1307,42 @@ IncCheckpointSyncCycle(void)
 {
 	return pg_atomic_fetch_add_u32(&CheckpointerShmem->ckpt_cycle, 1);
 }
+
+void
+CountBackendWrite(void)
+{
+	pg_atomic_fetch_add_u32(&CheckpointerShmem->num_backend_writes, 1);
+}
+
+static void
+SendFsyncRequest(CheckpointerRequest *request, int fd)
+{
+	ssize_t ret;
+
+	while (true)
+	{
+		ret = pg_uds_send_with_fd(fsync_fds[FSYNC_FD_SUBMIT], request, sizeof(*request),
+								  request->contains_fd ? fd : -1);
+
+		if (ret >= 0)
+		{
+			/*
+			 * Don't think short reads will ever happen in realistic
+			 * implementations, but better make sure that's true...
+			 */
+			if (ret != sizeof(*request))
+				elog(FATAL, "oops, gotta do better");
+			break;
+		}
+		else if (errno == EWOULDBLOCK || errno == EAGAIN)
+		{
+			/* blocked on write - wait for socket to become readable */
+			/* FIXME: postmaster death? Other interrupts? */
+			WaitLatchOrSocket(NULL, WL_SOCKET_WRITEABLE, fsync_fds[FSYNC_FD_SUBMIT], -1, 0);
+		}
+		else
+		{
+			ereport(FATAL, (errmsg("could not receive fsync request: %m")));
+		}
+	}
+}
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index a4b53b33cdd..135aa29bfeb 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -70,6 +70,7 @@
 #include <time.h>
 #include <sys/wait.h>
 #include <ctype.h>
+#include <sys/types.h>
 #include <sys/stat.h>
 #include <sys/socket.h>
 #include <fcntl.h>
@@ -434,6 +435,7 @@ static pid_t StartChildProcess(AuxProcType type);
 static void StartAutovacuumWorker(void);
 static void MaybeStartWalReceiver(void);
 static void InitPostmasterDeathWatchHandle(void);
+static void InitFsyncFdSocketPair(void);
 
 /*
  * Archiver is allowed to start up at the current postmaster state?
@@ -568,6 +570,8 @@ int			postmaster_alive_fds[2] = {-1, -1};
 HANDLE		PostmasterHandle;
 #endif
 
+int			fsync_fds[2] = {-1, -1};
+
 /*
  * Postmaster main entry point
  */
@@ -1195,6 +1199,11 @@ PostmasterMain(int argc, char *argv[])
 	 */
 	InitPostmasterDeathWatchHandle();
 
+	/*
+	 * Initialize socket pair used to transport file descriptors over.
+	 */
+	InitFsyncFdSocketPair();
+
 #ifdef WIN32
 
 	/*
@@ -6443,3 +6452,32 @@ InitPostmasterDeathWatchHandle(void)
 								 GetLastError())));
 #endif							/* WIN32 */
 }
+
+/* Create socket used for requesting fsyncs by checkpointer */
+static void
+InitFsyncFdSocketPair(void)
+{
+	Assert(MyProcPid == PostmasterPid);
+	if (socketpair(AF_UNIX, SOCK_STREAM | SOCK_CLOEXEC, 0, fsync_fds) < 0)
+		ereport(FATAL,
+				(errcode_for_file_access(),
+				 errmsg_internal("could not create fsync sockets: %m")));
+
+	/*
+	 * Set O_NONBLOCK on both fds.
+	 */
+	if (fcntl(fsync_fds[FSYNC_FD_PROCESS], F_SETFL, O_NONBLOCK) == -1)
+		ereport(FATAL,
+				(errcode_for_socket_access(),
+				 errmsg_internal("could not set fsync process socket to nonblocking mode: %m")));
+
+	if (fcntl(fsync_fds[FSYNC_FD_SUBMIT], F_SETFL, O_NONBLOCK) == -1)
+		ereport(FATAL,
+				(errcode_for_socket_access(),
+				 errmsg_internal("could not set fsync submit socket to nonblocking mode: %m")));
+
+	/*
+	 * FIXME: do DuplicateHandle dance for windows - can that work
+	 * trivially?
+	 */
+}
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 555774320b5..e24b0e9ec39 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -142,8 +142,8 @@ typedef struct
 	CycleCtr	cycle_ctr;		/* sync cycle of oldest request */
 	/* requests[f] has bit n set if we need to fsync segment n of fork f */
 	Bitmapset  *requests[MAX_FORKNUM + 1];
-	/* canceled[f] is true if we canceled fsyncs for fork "recently" */
-	bool		canceled[MAX_FORKNUM + 1];
+	int		   *syncfds[MAX_FORKNUM + 1];
+	int			syncfd_len[MAX_FORKNUM + 1];
 } PendingOperationEntry;
 
 typedef struct
@@ -152,6 +152,8 @@ typedef struct
 	CycleCtr	cycle_ctr;		/* mdckpt_cycle_ctr when request was made */
 } PendingUnlinkEntry;
 
+static uint32 open_fsync_queue_files = 0;
+static bool mdsync_in_progress = false;
 static HTAB *pendingOpsTable = NULL;
 static List *pendingUnlinks = NIL;
 static MemoryContext pendingOpsCxt; /* context for the above  */
@@ -196,6 +198,8 @@ static MdfdVec *_mdfd_getseg(SMgrRelation reln, ForkNumber forkno,
 			 BlockNumber blkno, bool skipFsync, int behavior);
 static BlockNumber _mdnblocks(SMgrRelation reln, ForkNumber forknum,
 		   MdfdVec *seg);
+static char *mdpath(RelFileNode rnode, ForkNumber forknum, BlockNumber segno);
+static void mdsyncpass(bool include_current);
 
 
 /*
@@ -1049,43 +1053,28 @@ mdimmedsync(SMgrRelation reln, ForkNumber forknum)
 }
 
 /*
- *	mdsync() -- Sync previous writes to stable storage.
+ * Do one pass over the the fsync request hashtable and perform the necessary
+ * fsyncs. Increments the mdsync cycle counter.
+ *
+ * If include_current is true perform all fsyncs (this is done if too many
+ * files are open), otherwise only perform the fsyncs belonging to the cycle
+ * valid at call time.
  */
-void
-mdsync(void)
+static void
+mdsyncpass(bool include_current)
 {
-	static bool mdsync_in_progress = false;
-
 	HASH_SEQ_STATUS hstat;
 	PendingOperationEntry *entry;
 	int			absorb_counter;
 
 	/* Statistics on sync times */
-	int			processed = 0;
 	instr_time	sync_start,
 				sync_end,
 				sync_diff;
 	uint64		elapsed;
-	uint64		longest = 0;
-	uint64		total_elapsed = 0;
-
-	/*
-	 * This is only called during checkpoints, and checkpoints should only
-	 * occur in processes that have created a pendingOpsTable.
-	 */
-	if (!pendingOpsTable)
-		elog(ERROR, "cannot sync without a pendingOpsTable");
-
-	/*
-	 * If we are in the checkpointer, the sync had better include all fsync
-	 * requests that were queued by backends up to this point.  The tightest
-	 * race condition that could occur is that a buffer that must be written
-	 * and fsync'd for the checkpoint could have been dumped by a backend just
-	 * before it was visited by BufferSync().  We know the backend will have
-	 * queued an fsync request before clearing the buffer's dirtybit, so we
-	 * are safe as long as we do an Absorb after completing BufferSync().
-	 */
-	AbsorbFsyncRequests();
+	int			processed = CheckpointStats.ckpt_sync_rels;
+	uint64		longest = CheckpointStats.ckpt_longest_sync;
+	uint64		total_elapsed = CheckpointStats.ckpt_agg_sync_time;
 
 	/*
 	 * To avoid excess fsync'ing (in the worst case, maybe a never-terminating
@@ -1133,17 +1122,27 @@ mdsync(void)
 	while ((entry = (PendingOperationEntry *) hash_seq_search(&hstat)) != NULL)
 	{
 		ForkNumber	forknum;
+		bool has_remaining;
 
 		/*
-		 * If the entry is new then don't process it this time; it might
-		 * contain multiple fsync-request bits, but they are all new.  Note
-		 * "continue" bypasses the hash-remove call at the bottom of the loop.
+		 * If processing fsync requests because of too may file handles, close
+		 * regardless of cycle. Otherwise nothing to be closed might be found,
+		 * and we want to make room as quickly as possible so more requests
+		 * can be absorbed.
 		 */
-		if (entry->cycle_ctr == GetCheckpointSyncCycle())
-			continue;
+		if (!include_current)
+		{
+			/*
+			 * If the entry is new then don't process it this time; it might
+			 * contain multiple fsync-request bits, but they are all new.  Note
+			 * "continue" bypasses the hash-remove call at the bottom of the loop.
+			 */
+			if (entry->cycle_ctr == GetCheckpointSyncCycle())
+				continue;
 
-		/* Else assert we haven't missed it */
-		Assert((CycleCtr) (entry->cycle_ctr + 1) == GetCheckpointSyncCycle());
+			/* Else assert we haven't missed it */
+			Assert((CycleCtr) (entry->cycle_ctr + 1) == GetCheckpointSyncCycle());
+		}
 
 		/*
 		 * Scan over the forks and segments represented by the entry.
@@ -1158,158 +1157,151 @@ mdsync(void)
 		 */
 		for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
 		{
-			Bitmapset  *requests = entry->requests[forknum];
 			int			segno;
 
-			entry->requests[forknum] = NULL;
-			entry->canceled[forknum] = false;
-
-			while ((segno = bms_first_member(requests)) >= 0)
+			segno = -1;
+			while ((segno = bms_next_member(entry->requests[forknum], segno)) >= 0)
 			{
-				int			failures;
+				char	   *path;
+				int			returnCode;
+
+				/*
+				 * Temporarily mark as processed. Have to do so before
+				 * absorbing further requests, otherwise we might delete a new
+				 * requests in a new cycle.
+				 */
+				bms_del_member(entry->requests[forknum], segno);
+
+				if (entry->syncfd_len[forknum] <= segno ||
+					entry->syncfds[forknum][segno] == -1)
+				{
+					/*
+					 * Optionally open file, if we want to support not
+					 * transporting fds as well.
+					 */
+					elog(FATAL, "file not opened");
+				}
 
 				/*
 				 * If fsync is off then we don't have to bother opening the
 				 * file at all.  (We delay checking until this point so that
 				 * changing fsync on the fly behaves sensibly.)
+				 *
+				 * XXX: Why is that an important goal? Doesn't give any
+				 * interesting guarantees afaict?
 				 */
-				if (!enableFsync)
-					continue;
-
-				/*
-				 * If in checkpointer, we want to absorb pending requests
-				 * every so often to prevent overflow of the fsync request
-				 * queue.  It is unspecified whether newly-added entries will
-				 * be visited by hash_seq_search, but we don't care since we
-				 * don't need to process them anyway.
-				 */
-				if (--absorb_counter <= 0)
+				if (enableFsync)
 				{
-					AbsorbFsyncRequests();
-					absorb_counter = FSYNCS_PER_ABSORB;
-				}
-
-				/*
-				 * The fsync table could contain requests to fsync segments
-				 * that have been deleted (unlinked) by the time we get to
-				 * them. Rather than just hoping an ENOENT (or EACCES on
-				 * Windows) error can be ignored, what we do on error is
-				 * absorb pending requests and then retry.  Since mdunlink()
-				 * queues a "cancel" message before actually unlinking, the
-				 * fsync request is guaranteed to be marked canceled after the
-				 * absorb if it really was this case. DROP DATABASE likewise
-				 * has to tell us to forget fsync requests before it starts
-				 * deletions.
-				 */
-				for (failures = 0;; failures++) /* loop exits at "break" */
-				{
-					SMgrRelation reln;
-					MdfdVec    *seg;
-					char	   *path;
-					int			save_errno;
-
 					/*
-					 * Find or create an smgr hash entry for this relation.
-					 * This may seem a bit unclean -- md calling smgr?	But
-					 * it's really the best solution.  It ensures that the
-					 * open file reference isn't permanently leaked if we get
-					 * an error here. (You may say "but an unreferenced
-					 * SMgrRelation is still a leak!" Not really, because the
-					 * only case in which a checkpoint is done by a process
-					 * that isn't about to shut down is in the checkpointer,
-					 * and it will periodically do smgrcloseall(). This fact
-					 * justifies our not closing the reln in the success path
-					 * either, which is a good thing since in non-checkpointer
-					 * cases we couldn't safely do that.)
+					 * The fsync table could contain requests to fsync
+					 * segments that have been deleted (unlinked) by the time
+					 * we get to them.  That used to be problematic, but now
+					 * we have a filehandle to the deleted file. That means we
+					 * might fsync an empty file superfluously, in a
+					 * relatively tight window, which is acceptable.
 					 */
-					reln = smgropen(entry->rnode, InvalidBackendId);
 
-					/* Attempt to open and fsync the target segment */
-					seg = _mdfd_getseg(reln, forknum,
-									   (BlockNumber) segno * (BlockNumber) RELSEG_SIZE,
-									   false,
-									   EXTENSION_RETURN_NULL
-									   | EXTENSION_DONT_CHECK_SIZE);
+					path = mdpath(entry->rnode, forknum, segno);
 
 					INSTR_TIME_SET_CURRENT(sync_start);
 
-					if (seg != NULL &&
-						FileSync(seg->mdfd_vfd, WAIT_EVENT_DATA_FILE_SYNC) >= 0)
+					pgstat_report_wait_start(WAIT_EVENT_DATA_FILE_SYNC);
+					returnCode = pg_fsync(entry->syncfds[forknum][segno]);
+					pgstat_report_wait_end();
+
+					if (returnCode < 0)
 					{
-						/* Success; update statistics about sync timing */
-						INSTR_TIME_SET_CURRENT(sync_end);
-						sync_diff = sync_end;
-						INSTR_TIME_SUBTRACT(sync_diff, sync_start);
-						elapsed = INSTR_TIME_GET_MICROSEC(sync_diff);
-						if (elapsed > longest)
-							longest = elapsed;
-						total_elapsed += elapsed;
-						processed++;
-						if (log_checkpoints)
-							elog(DEBUG1, "checkpoint sync: number=%d file=%s time=%.3f msec",
-								 processed,
-								 FilePathName(seg->mdfd_vfd),
-								 (double) elapsed / 1000);
+						/* XXX: decide on policy */
+						bms_add_member(entry->requests[forknum], segno);
 
-						break;	/* out of retry loop */
-					}
-
-					/* Compute file name for use in message */
-					save_errno = errno;
-					path = _mdfd_segpath(reln, forknum, (BlockNumber) segno);
-					errno = save_errno;
-
-					/*
-					 * It is possible that the relation has been dropped or
-					 * truncated since the fsync request was entered.
-					 * Therefore, allow ENOENT, but only if we didn't fail
-					 * already on this file.  This applies both for
-					 * _mdfd_getseg() and for FileSync, since fd.c might have
-					 * closed the file behind our back.
-					 *
-					 * XXX is there any point in allowing more than one retry?
-					 * Don't see one at the moment, but easy to change the
-					 * test here if so.
-					 */
-					if (!FILE_POSSIBLY_DELETED(errno) ||
-						failures > 0)
 						ereport(ERROR,
 								(errcode_for_file_access(),
 								 errmsg("could not fsync file \"%s\": %m",
 										path)));
-					else
+					}
+
+					/* Success; update statistics about sync timing */
+					INSTR_TIME_SET_CURRENT(sync_end);
+					sync_diff = sync_end;
+					INSTR_TIME_SUBTRACT(sync_diff, sync_start);
+					elapsed = INSTR_TIME_GET_MICROSEC(sync_diff);
+					if (elapsed > longest)
+						longest = elapsed;
+					total_elapsed += elapsed;
+					processed++;
+					if (log_checkpoints)
 						ereport(DEBUG1,
-								(errcode_for_file_access(),
-								 errmsg("could not fsync file \"%s\" but retrying: %m",
-										path)));
+								(errmsg("checkpoint sync: number=%d file=%s time=%.3f msec",
+										processed,
+										path,
+										(double) elapsed / 1000),
+								 errhidestmt(true),
+								 errhidecontext(true)));
+
 					pfree(path);
+				}
 
+				/*
+				 * It shouldn't be possible for a new request to arrive during
+				 * the fsync (on error this will not be reached).
+				 */
+				Assert(!bms_is_member(segno, entry->requests[forknum]));
+
+				/*
+				 * Close file.  XXX: centralize code.
+				 */
+				{
+					open_fsync_queue_files--;
+					CloseTransientFile(entry->syncfds[forknum][segno]);
+					entry->syncfds[forknum][segno] = -1;
+				}
+
+				/*
+				 * If in checkpointer, we want to absorb pending requests every so
+				 * often to prevent overflow of the fsync request queue.  It is
+				 * unspecified whether newly-added entries will be visited by
+				 * hash_seq_search, but we don't care since we don't need to process
+				 * them anyway.
+				 */
+				if (absorb_counter-- <= 0)
+				{
 					/*
-					 * Absorb incoming requests and check to see if a cancel
-					 * arrived for this relation fork.
+					 * Don't absorb if too many files are open. This pass will
+					 * soon close some, so check again later.
 					 */
-					AbsorbFsyncRequests();
-					absorb_counter = FSYNCS_PER_ABSORB; /* might as well... */
-
-					if (entry->canceled[forknum])
-						break;
-				}				/* end retry loop */
+					if (open_fsync_queue_files < ((MaxTransientFiles() * 8) / 10))
+						AbsorbFsyncRequests();
+					absorb_counter = FSYNCS_PER_ABSORB;
+				}
 			}
-			bms_free(requests);
 		}
 
 		/*
-		 * We've finished everything that was requested before we started to
-		 * scan the entry.  If no new requests have been inserted meanwhile,
-		 * remove the entry.  Otherwise, update its cycle counter, as all the
-		 * requests now in it must have arrived during this cycle.
+		 * We've finished everything for the file that was requested before we
+		 * started to scan the entry.  If no new requests have been inserted
+		 * meanwhile, remove the entry.  Otherwise, update its cycle counter,
+		 * as all the requests now in it must have arrived during this cycle.
+		 *
+		 * This needs to be checked separately from the above for-each-fork
+		 * loop, as new requests for this relation could have been absorbed.
 		 */
+		has_remaining = false;
 		for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
 		{
-			if (entry->requests[forknum] != NULL)
-				break;
+			if (bms_is_empty(entry->requests[forknum]))
+			{
+				if (entry->syncfds[forknum])
+				{
+					pfree(entry->syncfds[forknum]);
+					entry->syncfds[forknum] = NULL;
+				}
+				bms_free(entry->requests[forknum]);
+				entry->requests[forknum] = NULL;
+			}
+			else
+				has_remaining = true;
 		}
-		if (forknum <= MAX_FORKNUM)
+		if (has_remaining)
 			entry->cycle_ctr = GetCheckpointSyncCycle();
 		else
 		{
@@ -1320,13 +1312,66 @@ mdsync(void)
 		}
 	}							/* end loop over hashtable entries */
 
-	/* Return sync performance metrics for report at checkpoint end */
+	/* Flag successful completion of mdsync */
+	mdsync_in_progress = false;
+
+	/* Maintain sync performance metrics for report at checkpoint end */
 	CheckpointStats.ckpt_sync_rels = processed;
 	CheckpointStats.ckpt_longest_sync = longest;
 	CheckpointStats.ckpt_agg_sync_time = total_elapsed;
+}
 
-	/* Flag successful completion of mdsync */
-	mdsync_in_progress = false;
+/*
+ *	mdsync() -- Sync previous writes to stable storage.
+ */
+void
+mdsync(void)
+{
+	/*
+	 * This is only called during checkpoints, and checkpoints should only
+	 * occur in processes that have created a pendingOpsTable.
+	 */
+	if (!pendingOpsTable)
+		elog(ERROR, "cannot sync without a pendingOpsTable");
+
+	/*
+	 * If we are in the checkpointer, the sync had better include all fsync
+	 * requests that were queued by backends up to this point.  The tightest
+	 * race condition that could occur is that a buffer that must be written
+	 * and fsync'd for the checkpoint could have been dumped by a backend just
+	 * before it was visited by BufferSync().  We know the backend will have
+	 * queued an fsync request before clearing the buffer's dirtybit, so we
+	 * are safe as long as we do an Absorb after completing BufferSync().
+	 */
+	AbsorbAllFsyncRequests();
+
+	mdsyncpass(false);
+}
+
+/*
+ * Flush the fsync request queue enough to make sure there's room for at least
+ * one more entry.
+ */
+bool
+FlushFsyncRequestQueueIfNecessary(void)
+{
+	if (mdsync_in_progress)
+		return false;
+
+	while (true)
+	{
+		if (open_fsync_queue_files >= ((MaxTransientFiles() * 8) / 10))
+		{
+			elog(DEBUG1,
+				 "flush fsync request queue due to %u open files",
+				 open_fsync_queue_files);
+			mdsyncpass(true);
+		}
+		else
+			break;
+	}
+
+	return true;
 }
 
 /*
@@ -1411,12 +1456,38 @@ mdpostckpt(void)
 		 */
 		if (--absorb_counter <= 0)
 		{
-			AbsorbFsyncRequests();
+			/* XXX: Centralize this condition */
+			if (open_fsync_queue_files < ((MaxTransientFiles() * 8) / 10))
+				AbsorbFsyncRequests();
 			absorb_counter = UNLINKS_PER_ABSORB;
 		}
 	}
 }
 
+
+/*
+ * Return the filename for the specified segment of the relation. The
+ * returned string is palloc'd.
+ */
+static char *
+mdpath(RelFileNode rnode, ForkNumber forknum, BlockNumber segno)
+{
+	char	   *path,
+			   *fullpath;
+
+	path = relpathperm(rnode, forknum);
+
+	if (segno > 0)
+	{
+		fullpath = psprintf("%s.%u", path, segno);
+		pfree(path);
+	}
+	else
+		fullpath = path;
+
+	return fullpath;
+}
+
 /*
  * register_dirty_segment() -- Mark a relation segment as needing fsync
  *
@@ -1437,6 +1508,13 @@ register_dirty_segment(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
 	pg_memory_barrier();
 	cycle = GetCheckpointSyncCycle();
 
+	/*
+	 * For historical reasons checkpointer keeps track of the number of time
+	 * backends perform writes themselves.
+	 */
+	if (!AmBackgroundWriterProcess())
+		CountBackendWrite();
+
 	/*
 	 * Don't repeatedly register the same segment as dirty.
 	 *
@@ -1449,27 +1527,23 @@ register_dirty_segment(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
 
 	if (pendingOpsTable)
 	{
-		/* push it into local pending-ops table */
-		RememberFsyncRequest(reln->smgr_rnode.node, forknum, seg->mdfd_segno);
-		seg->mdfd_dirtied_cycle = cycle;
+		int fd;
+
+		/*
+		 * Push it into local pending-ops table.
+		 *
+		 * Gotta duplicate the fd - we can't have fd.c close it behind our
+		 * back, as that'd lead to loosing error reporting guarantees on
+		 * linux. RememberFsyncRequest() will manage the lifetime.
+		 */
+		ReserveTransientFile();
+		fd = dup(FileGetRawDesc(seg->mdfd_vfd));
+		if (fd < 0)
+			elog(ERROR, "couldn't dup: %m");
+		RememberFsyncRequest(reln->smgr_rnode.node, forknum, seg->mdfd_segno, fd);
 	}
 	else
-	{
-		if (ForwardFsyncRequest(reln->smgr_rnode.node, forknum, seg->mdfd_segno))
-		{
-			seg->mdfd_dirtied_cycle = cycle;
-			return;				/* passed it off successfully */
-		}
-
-		ereport(DEBUG1,
-				(errmsg("could not forward fsync request because request queue is full")));
-
-		if (FileSync(seg->mdfd_vfd, WAIT_EVENT_DATA_FILE_SYNC) < 0)
-			ereport(ERROR,
-					(errcode_for_file_access(),
-					 errmsg("could not fsync file \"%s\": %m",
-							FilePathName(seg->mdfd_vfd))));
-	}
+		ForwardFsyncRequest(reln->smgr_rnode.node, forknum, seg->mdfd_segno, seg->mdfd_vfd);
 }
 
 /*
@@ -1491,21 +1565,14 @@ register_unlink(RelFileNodeBackend rnode)
 	{
 		/* push it into local pending-ops table */
 		RememberFsyncRequest(rnode.node, MAIN_FORKNUM,
-							 UNLINK_RELATION_REQUEST);
+							 UNLINK_RELATION_REQUEST,
+							 -1);
 	}
 	else
 	{
-		/*
-		 * Notify the checkpointer about it.  If we fail to queue the request
-		 * message, we have to sleep and try again, because we can't simply
-		 * delete the file now.  Ugly, but hopefully won't happen often.
-		 *
-		 * XXX should we just leave the file orphaned instead?
-		 */
+		/* Notify the checkpointer about it. */
 		Assert(IsUnderPostmaster);
-		while (!ForwardFsyncRequest(rnode.node, MAIN_FORKNUM,
-									UNLINK_RELATION_REQUEST))
-			pg_usleep(10000L);	/* 10 msec seems a good number */
+		ForwardFsyncRequest(rnode.node, MAIN_FORKNUM, UNLINK_RELATION_REQUEST, -1);
 	}
 }
 
@@ -1531,7 +1598,7 @@ register_unlink(RelFileNodeBackend rnode)
  * heavyweight operation anyhow, so we'll live with it.)
  */
 void
-RememberFsyncRequest(RelFileNode rnode, ForkNumber forknum, BlockNumber segno)
+RememberFsyncRequest(RelFileNode rnode, ForkNumber forknum, BlockNumber segno, int fd)
 {
 	Assert(pendingOpsTable);
 
@@ -1549,18 +1616,28 @@ RememberFsyncRequest(RelFileNode rnode, ForkNumber forknum, BlockNumber segno)
 			/*
 			 * We can't just delete the entry since mdsync could have an
 			 * active hashtable scan.  Instead we delete the bitmapsets; this
-			 * is safe because of the way mdsync is coded.  We also set the
-			 * "canceled" flags so that mdsync can tell that a cancel arrived
-			 * for the fork(s).
+			 * is safe because of the way mdsync is coded.
 			 */
 			if (forknum == InvalidForkNumber)
 			{
 				/* remove requests for all forks */
 				for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
 				{
+					int segno;
+
 					bms_free(entry->requests[forknum]);
 					entry->requests[forknum] = NULL;
-					entry->canceled[forknum] = true;
+
+					for (segno = 0; segno < entry->syncfd_len[forknum]; segno++)
+					{
+						if (entry->syncfds[forknum][segno] != -1)
+						{
+							open_fsync_queue_files--;
+							CloseTransientFile(entry->syncfds[forknum][segno]);
+							entry->syncfds[forknum][segno] = -1;
+						}
+					}
+
 				}
 			}
 			else
@@ -1568,7 +1645,16 @@ RememberFsyncRequest(RelFileNode rnode, ForkNumber forknum, BlockNumber segno)
 				/* remove requests for single fork */
 				bms_free(entry->requests[forknum]);
 				entry->requests[forknum] = NULL;
-				entry->canceled[forknum] = true;
+
+				for (segno = 0; segno < entry->syncfd_len[forknum]; segno++)
+				{
+					if (entry->syncfds[forknum][segno] != -1)
+					{
+						open_fsync_queue_files--;
+						CloseTransientFile(entry->syncfds[forknum][segno]);
+						entry->syncfds[forknum][segno] = -1;
+					}
+				}
 			}
 		}
 	}
@@ -1592,7 +1678,6 @@ RememberFsyncRequest(RelFileNode rnode, ForkNumber forknum, BlockNumber segno)
 				{
 					bms_free(entry->requests[forknum]);
 					entry->requests[forknum] = NULL;
-					entry->canceled[forknum] = true;
 				}
 			}
 		}
@@ -1646,7 +1731,8 @@ RememberFsyncRequest(RelFileNode rnode, ForkNumber forknum, BlockNumber segno)
 		{
 			entry->cycle_ctr = GetCheckpointSyncCycle();
 			MemSet(entry->requests, 0, sizeof(entry->requests));
-			MemSet(entry->canceled, 0, sizeof(entry->canceled));
+			MemSet(entry->syncfds, 0, sizeof(entry->syncfds));
+			MemSet(entry->syncfd_len, 0, sizeof(entry->syncfd_len));
 		}
 
 		/*
@@ -1658,6 +1744,55 @@ RememberFsyncRequest(RelFileNode rnode, ForkNumber forknum, BlockNumber segno)
 		entry->requests[forknum] = bms_add_member(entry->requests[forknum],
 												  (int) segno);
 
+		if (fd >= 0)
+		{
+			/* make space for entry */
+			if (entry->syncfds[forknum] == NULL)
+			{
+				int i;
+
+				entry->syncfds[forknum] = palloc(sizeof(int*) * (segno + 1));
+				entry->syncfd_len[forknum] = segno + 1;
+
+				for (i = 0; i <= segno; i++)
+					entry->syncfds[forknum][i] = -1;
+			}
+			else  if (entry->syncfd_len[forknum] <= segno)
+			{
+				int i;
+
+				entry->syncfds[forknum] = repalloc(entry->syncfds[forknum],
+												   sizeof(int*) * (segno + 1));
+
+				/* initialize newly created entries */
+				for (i = entry->syncfd_len[forknum]; i <= segno; i++)
+					entry->syncfds[forknum][i] = -1;
+
+				entry->syncfd_len[forknum] = segno + 1;
+			}
+
+			if (entry->syncfds[forknum][segno] == -1)
+			{
+				open_fsync_queue_files++;
+				/* caller must have reserved entry */
+				RegisterTransientFile(fd);
+				entry->syncfds[forknum][segno] = fd;
+			}
+			else
+			{
+				/*
+				 * File is already open. Have to keep the older fd, errors
+				 * might only be reported to it, thus close the one we just
+				 * got.
+				 *
+				 * XXX: check for errrors.
+				 */
+				close(fd);
+			}
+
+			FlushFsyncRequestQueueIfNecessary();
+		}
+
 		MemoryContextSwitchTo(oldcxt);
 	}
 }
@@ -1674,22 +1809,12 @@ ForgetRelationFsyncRequests(RelFileNode rnode, ForkNumber forknum)
 	if (pendingOpsTable)
 	{
 		/* standalone backend or startup process: fsync state is local */
-		RememberFsyncRequest(rnode, forknum, FORGET_RELATION_FSYNC);
+		RememberFsyncRequest(rnode, forknum, FORGET_RELATION_FSYNC, -1);
 	}
 	else if (IsUnderPostmaster)
 	{
-		/*
-		 * Notify the checkpointer about it.  If we fail to queue the cancel
-		 * message, we have to sleep and try again ... ugly, but hopefully
-		 * won't happen often.
-		 *
-		 * XXX should we CHECK_FOR_INTERRUPTS in this loop?  Escaping with an
-		 * error would leave the no-longer-used file still present on disk,
-		 * which would be bad, so I'm inclined to assume that the checkpointer
-		 * will always empty the queue soon.
-		 */
-		while (!ForwardFsyncRequest(rnode, forknum, FORGET_RELATION_FSYNC))
-			pg_usleep(10000L);	/* 10 msec seems a good number */
+		/* Notify the checkpointer about it. */
+		ForwardFsyncRequest(rnode, forknum, FORGET_RELATION_FSYNC, -1);
 
 		/*
 		 * Note we don't wait for the checkpointer to actually absorb the
@@ -1713,14 +1838,12 @@ ForgetDatabaseFsyncRequests(Oid dbid)
 	if (pendingOpsTable)
 	{
 		/* standalone backend or startup process: fsync state is local */
-		RememberFsyncRequest(rnode, InvalidForkNumber, FORGET_DATABASE_FSYNC);
+		RememberFsyncRequest(rnode, InvalidForkNumber, FORGET_DATABASE_FSYNC, -1);
 	}
 	else if (IsUnderPostmaster)
 	{
 		/* see notes in ForgetRelationFsyncRequests */
-		while (!ForwardFsyncRequest(rnode, InvalidForkNumber,
-									FORGET_DATABASE_FSYNC))
-			pg_usleep(10000L);	/* 10 msec seems a good number */
+		ForwardFsyncRequest(rnode, InvalidForkNumber, FORGET_DATABASE_FSYNC, -1);
 	}
 }
 
diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h
index 87a5cfad415..58ba671a907 100644
--- a/src/include/postmaster/bgwriter.h
+++ b/src/include/postmaster/bgwriter.h
@@ -16,6 +16,7 @@
 #define _BGWRITER_H
 
 #include "storage/block.h"
+#include "storage/fd.h"
 #include "storage/relfilenode.h"
 
 
@@ -31,9 +32,10 @@ extern void CheckpointerMain(void) pg_attribute_noreturn();
 extern void RequestCheckpoint(int flags);
 extern void CheckpointWriteDelay(int flags, double progress);
 
-extern bool ForwardFsyncRequest(RelFileNode rnode, ForkNumber forknum,
-					BlockNumber segno);
+extern void ForwardFsyncRequest(RelFileNode rnode, ForkNumber forknum,
+								BlockNumber segno, File file);
 extern void AbsorbFsyncRequests(void);
+extern void AbsorbAllFsyncRequests(void);
 
 extern Size CheckpointerShmemSize(void);
 extern void CheckpointerShmemInit(void);
@@ -43,4 +45,6 @@ extern uint32 IncCheckpointSyncCycle(void);
 
 extern bool FirstCallSinceLastCheckpoint(void);
 
+extern void CountBackendWrite(void);
+
 #endif							/* _BGWRITER_H */
diff --git a/src/include/postmaster/postmaster.h b/src/include/postmaster/postmaster.h
index 1877eef2391..e2ba64e8984 100644
--- a/src/include/postmaster/postmaster.h
+++ b/src/include/postmaster/postmaster.h
@@ -44,6 +44,11 @@ extern int	postmaster_alive_fds[2];
 #define POSTMASTER_FD_OWN		1	/* kept open by postmaster only */
 #endif
 
+#define FSYNC_FD_SUBMIT			0
+#define FSYNC_FD_PROCESS		1
+
+extern int	fsync_fds[2];
+
 extern PGDLLIMPORT const char *progname;
 
 extern void PostmasterMain(int argc, char *argv[]) pg_attribute_noreturn();
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 558e4d8518b..798a9652927 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -140,7 +140,8 @@ extern void mdpostckpt(void);
 
 extern void SetForwardFsyncRequests(void);
 extern void RememberFsyncRequest(RelFileNode rnode, ForkNumber forknum,
-					 BlockNumber segno);
+					 BlockNumber segno, int fd);
+extern bool FlushFsyncRequestQueueIfNecessary(void);
 extern void ForgetRelationFsyncRequests(RelFileNode rnode, ForkNumber forknum);
 extern void ForgetDatabaseFsyncRequests(Oid dbid);
 
-- 
2.17.0.rc1.dirty

#36Stephen Frost
sfrost@snowman.net
In reply to: Ashutosh Bapat (#34)
Re: Postgres, fsync, and OSs (specifically linux)

Greetings,

* Ashutosh Bapat (ashutosh.bapat@enterprisedb.com) wrote:

On Thu, May 17, 2018 at 11:45 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, May 17, 2018 at 12:44 PM, Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2018-05-10 09:50:03 +0800, Craig Ringer wrote:

while ((src = (RewriteMappingFile *) hash_seq_search(&seq_status)) != NULL)
{
if (FileSync(src->vfd, WAIT_EVENT_LOGICAL_REWRITE_SYNC) != 0)
- ereport(ERROR,
+ ereport(PANIC,
(errcode_for_file_access(),
errmsg("could not fsync file \"%s\": %m", src->path)));

To me this (and the other callers) doesn't quite look right. First, I
think we should probably be a bit more restrictive about when PANIC
out. It seems like we should PANIC on ENOSPC and EIO, but possibly not
others. Secondly, I think we should centralize the error handling. It
seems likely that we'll acrue some platform specific workarounds, and I
don't want to copy that knowledge everywhere.

Maybe something like:

ereport(promote_eio_to_panic(ERROR), ...)

Well, searching for places where error is reported with level PANIC,
using word PANIC would miss these instances. People will have to
remember to search with promote_eio_to_panic. May be handle the errors
inside FileSync() itself or a wrapper around that.

No, that search wouldn't miss those instances- such a search would find
promote_eio_to_panic() and then someone would go look up the uses of
that function. That hardly seems like a serious issue for folks hacking
on PG.

I'm not saying that having a wrapper around FileSync() would be bad or
having it handle things, but I don't agree with the general notion that
we can't have a function which handles the complicated bits about the
kind of error because someone grep'ing the source for PANIC might have
to do an additional lookup.

Thanks!

Stephen

#37Abhijit Menon-Sen
ams@2ndQuadrant.com
In reply to: Stephen Frost (#36)
Re: Postgres, fsync, and OSs (specifically linux)

At 2018-05-18 20:27:57 -0400, sfrost@snowman.net wrote:

I don't agree with the general notion that we can't have a function
which handles the complicated bits about the kind of error because
someone grep'ing the source for PANIC might have to do an additional
lookup.

Or we could just name the function promote_eio_to_PANIC.

(I understood the objection to be about how 'grep PANIC' wouldn't find
these lines at all, not that there would be an additional lookup.)

-- Abhijit

#38Stephen Frost
sfrost@snowman.net
In reply to: Abhijit Menon-Sen (#37)
Re: Postgres, fsync, and OSs (specifically linux)

Greetings,

* Abhijit Menon-Sen (ams@2ndQuadrant.com) wrote:

At 2018-05-18 20:27:57 -0400, sfrost@snowman.net wrote:

I don't agree with the general notion that we can't have a function
which handles the complicated bits about the kind of error because
someone grep'ing the source for PANIC might have to do an additional
lookup.

Or we could just name the function promote_eio_to_PANIC.

Ugh, I'm not thrilled with that either.

(I understood the objection to be about how 'grep PANIC' wouldn't find
these lines at all, not that there would be an additional lookup.)

... and my point was that 'grep PANIC' would, almost certainly, find the
function promote_eio_to_panic(), and someone could trivially look up all
the callers of that function then.

Thanks!

Stephen

#39Thomas Munro
thomas.munro@enterprisedb.com
In reply to: Andres Freund (#35)
Re: Postgres, fsync, and OSs (specifically linux)

On Sat, May 19, 2018 at 9:03 AM, Andres Freund <andres@anarazel.de> wrote:

I've written a patch series for this. Took me quite a bit longer than I
had hoped.

Great.

I plan to switch to working on something else for a day or two next
week, and then polish this further. I'd greatly appreciate comments till
then.

Took it for a spin on macOS and FreeBSD. First problem:

+ if (socketpair(AF_UNIX, SOCK_STREAM | SOCK_CLOEXEC, 0, fsync_fds) < 0)

SOCK_CLOEXEC isn't portable (FreeBSD yes since 10, macOS no, others
who knows). Adding FD_CLOEXEC to your later fcntl() calls is probably
the way to do it? I understand from reading the Linux man pages that
there are race conditions with threads but that doesn't apply here.

Next, make check hangs in initdb on both of my pet OSes when md.c
raises an error (fseek fails) and we raise and error while raising and
error and deadlock against ourselves. Backtrace here:
https://paste.debian.net/1025336/

Apparently the initial error was that mdextend() called _mdnblocks()
which called FileSeek() on vfd 43 == fd 30, pathname "base/1/2704",
but when I check my operating system open file descriptor table I find
that there is no fd 30: there is a 29 and a 31, so it has already been
unexpectedly closed.

I could dig further and/or provide a shell on a system with dev tools.

I didn't want to do this now, but I think we should also consider
removing all awareness of segments from the fsync request queue. Instead
it should deal with individual files, and the segmentation should be
handled by md.c. That'll allow us to move all the necessary code to
smgr.c (or checkpointer?); Thomas said that'd be helpful for further
work. I personally think it'd be a lot simpler, because having to have
long bitmaps with only the last bit set for large append only relations
isn't a particularly sensible approach imo. The only thing that that'd
make more complicated is that the file/database unlink requests get more
expensive (as they'd likely need to search the whole table), but that
seems like a sensible tradeoff. Alternatively using a tree structure
would be an alternative obviously. Personally I was thinking that we
should just make the hashtable be over a pathname, that seems most
generic.

+1

I'll be posting a patch shortly that also needs similar machinery, but
can't easily share with md.c due to technical details. I'd love there
to be just one of those, and for it to be simpler and general.

--
Thomas Munro
http://www.enterprisedb.com

#40Thomas Munro
thomas.munro@enterprisedb.com
In reply to: Thomas Munro (#39)
Re: Postgres, fsync, and OSs (specifically linux)

On Sat, May 19, 2018 at 4:51 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

Next, make check hangs in initdb on both of my pet OSes when md.c
raises an error (fseek fails) and we raise and error while raising and
error and deadlock against ourselves. Backtrace here:
https://paste.debian.net/1025336/

Ah, I see now that something similar is happening on Linux too, so I
guess you already knew this.

https://travis-ci.org/postgresql-cfbot/postgresql/builds/380913223

--
Thomas Munro
http://www.enterprisedb.com

#41Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Stephen Frost (#38)
Re: Postgres, fsync, and OSs (specifically linux)

On Sat, May 19, 2018 at 6:31 AM, Stephen Frost <sfrost@snowman.net> wrote:

Greetings,

* Abhijit Menon-Sen (ams@2ndQuadrant.com) wrote:

At 2018-05-18 20:27:57 -0400, sfrost@snowman.net wrote:

I don't agree with the general notion that we can't have a function
which handles the complicated bits about the kind of error because
someone grep'ing the source for PANIC might have to do an additional
lookup.

Or we could just name the function promote_eio_to_PANIC.

Ugh, I'm not thrilled with that either.

(I understood the objection to be about how 'grep PANIC' wouldn't find
these lines at all, not that there would be an additional lookup.)

... and my point was that 'grep PANIC' would, almost certainly, find the
function promote_eio_to_panic(), and someone could trivially look up all
the callers of that function then.

It's not just grep, but tools like cscope, tag. Although, I agree,
that adding a function, if all necessary, is more important than
convenience of finding all the instances of a certain token easily.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

#42Craig Ringer
craig@2ndquadrant.com
In reply to: Andres Freund (#32)
Re: Postgres, fsync, and OSs (specifically linux)

On 18 May 2018 at 00:44, Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2018-05-10 09:50:03 +0800, Craig Ringer wrote:

while ((src = (RewriteMappingFile *) hash_seq_search(&seq_status))

!= NULL)

{
if (FileSync(src->vfd, WAIT_EVENT_LOGICAL_REWRITE_SYNC)

!= 0)

-                     ereport(ERROR,
+                     ereport(PANIC,
(errcode_for_file_access(),
errmsg("could not fsync file

\"%s\": %m", src->path)));

To me this (and the other callers) doesn't quite look right. First, I
think we should probably be a bit more restrictive about when PANIC
out. It seems like we should PANIC on ENOSPC and EIO, but possibly not
others. Secondly, I think we should centralize the error handling. It
seems likely that we'll acrue some platform specific workarounds, and I
don't want to copy that knowledge everywhere.

Also, don't we need the same on close()?

Yes, we do, and that expands the scope a bit.

I agree with Robert that some sort of filter/macro is wise, though naming
it clearly will be tricky.

I'll have a look.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#43Craig Ringer
craig@2ndquadrant.com
In reply to: Craig Ringer (#42)
Re: Postgres, fsync, and OSs (specifically linux)

On 21 May 2018 at 12:57, Craig Ringer <craig@2ndquadrant.com> wrote:

On 18 May 2018 at 00:44, Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2018-05-10 09:50:03 +0800, Craig Ringer wrote:

while ((src = (RewriteMappingFile *)

hash_seq_search(&seq_status)) != NULL)

{
if (FileSync(src->vfd, WAIT_EVENT_LOGICAL_REWRITE_SYNC)

!= 0)

-                     ereport(ERROR,
+                     ereport(PANIC,
(errcode_for_file_access(),
errmsg("could not fsync file

\"%s\": %m", src->path)));

To me this (and the other callers) doesn't quite look right. First, I
think we should probably be a bit more restrictive about when PANIC
out. It seems like we should PANIC on ENOSPC and EIO, but possibly not
others. Secondly, I think we should centralize the error handling. It
seems likely that we'll acrue some platform specific workarounds, and I
don't want to copy that knowledge everywhere.

Also, don't we need the same on close()?

Yes, we do, and that expands the scope a bit.

I agree with Robert that some sort of filter/macro is wise, though naming
it clearly will be tricky.

I'll have a look.

On the queue for tomorrow.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#44Andres Freund
andres@anarazel.de
In reply to: Thomas Munro (#40)
6 attachment(s)
Re: Postgres, fsync, and OSs (specifically linux)

On 2018-05-19 18:12:52 +1200, Thomas Munro wrote:

On Sat, May 19, 2018 at 4:51 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

Next, make check hangs in initdb on both of my pet OSes when md.c
raises an error (fseek fails) and we raise and error while raising and
error and deadlock against ourselves. Backtrace here:
https://paste.debian.net/1025336/

Ah, I see now that something similar is happening on Linux too, so I
guess you already knew this.

I didn't. I cleaned something up and only tested installcheck
after... Singleuser mode was broken.

Attached is a new version.

I've changed my previous attempt at using transient files to using File
type files, but unliked from the LRU so that they're kept open. Not sure
if that's perfect, but seems cleaner.

Greetings,

Andres Freund

Attachments:

v2-0001-freespace-Don-t-constantly-close-files-when-readi.patchtext/x-diff; charset=us-asciiDownload
From 96435b05c9546b6da829043fb10b2a7309216bd2 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 21 May 2018 15:43:30 -0700
Subject: [PATCH v2 1/6] freespace: Don't constantly close files when reading
 buffer.

fsm_readbuf() used to always do an smgrexists() when reading a buffer
beyond the known file size. That currently implies closing the md.c
handle, loosing all the data cached therein.  Change this to only
check for file existance when not already known to be larger than 0
blocks.

Author: Andres Freund
Reviewed-By:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/backend/storage/freespace/freespace.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/src/backend/storage/freespace/freespace.c b/src/backend/storage/freespace/freespace.c
index 65c4e74999f..d7569cec5ed 100644
--- a/src/backend/storage/freespace/freespace.c
+++ b/src/backend/storage/freespace/freespace.c
@@ -556,7 +556,7 @@ fsm_readbuf(Relation rel, FSMAddress addr, bool extend)
 	 * not on extension.)
 	 */
 	if (rel->rd_smgr->smgr_fsm_nblocks == InvalidBlockNumber ||
-		blkno >= rel->rd_smgr->smgr_fsm_nblocks)
+		rel->rd_smgr->smgr_fsm_nblocks == 0)
 	{
 		if (smgrexists(rel->rd_smgr, FSM_FORKNUM))
 			rel->rd_smgr->smgr_fsm_nblocks = smgrnblocks(rel->rd_smgr,
@@ -564,6 +564,9 @@ fsm_readbuf(Relation rel, FSMAddress addr, bool extend)
 		else
 			rel->rd_smgr->smgr_fsm_nblocks = 0;
 	}
+	else if (blkno >= rel->rd_smgr->smgr_fsm_nblocks)
+		rel->rd_smgr->smgr_fsm_nblocks = smgrnblocks(rel->rd_smgr,
+													 FSM_FORKNUM);
 
 	/* Handle requests beyond EOF */
 	if (blkno >= rel->rd_smgr->smgr_fsm_nblocks)
-- 
2.17.0.rc1.dirty

v2-0002-Add-functions-to-send-receive-data-FD-over-a-unix.patchtext/x-diff; charset=us-asciiDownload
From aa533828e6164731006dab92665fa92b7b058d6f Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 21 May 2018 15:43:30 -0700
Subject: [PATCH v2 2/6] Add functions to send/receive data & FD over a unix
 domain socket.

This'll be used by a followup patch changing how the fsync request
queue works, to make it safe on linux.

TODO: This probably should live elsewhere.

Author: Andres Freund
Reviewed-By:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/backend/storage/file/fd.c | 102 ++++++++++++++++++++++++++++++++++
 src/include/storage/fd.h      |   4 ++
 2 files changed, 106 insertions(+)

diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 441f18dcf56..65e46483a44 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -3572,3 +3572,105 @@ MakePGDirectory(const char *directoryName)
 {
 	return mkdir(directoryName, pg_dir_create_mode);
 }
+
+/*
+ * Send data over a unix domain socket, optionally (when fd != -1) including a
+ * file descriptor.
+ */
+ssize_t
+pg_uds_send_with_fd(int sock, void *buf, ssize_t buflen, int fd)
+{
+	ssize_t     size;
+	struct msghdr   msg = {0};
+	struct iovec    iov;
+	/* cmsg header, union for correct alignment */
+	union
+	{
+		struct cmsghdr  cmsghdr;
+		char        control[CMSG_SPACE(sizeof (int))];
+	} cmsgu;
+	struct cmsghdr  *cmsg;
+
+	iov.iov_base = buf;
+	iov.iov_len = buflen;
+
+	msg.msg_name = NULL;
+	msg.msg_namelen = 0;
+	msg.msg_iov = &iov;
+	msg.msg_iovlen = 1;
+
+	if (fd >= 0)
+	{
+		msg.msg_control = cmsgu.control;
+		msg.msg_controllen = sizeof(cmsgu.control);
+
+		cmsg = CMSG_FIRSTHDR(&msg);
+		cmsg->cmsg_len = CMSG_LEN(sizeof (int));
+		cmsg->cmsg_level = SOL_SOCKET;
+		cmsg->cmsg_type = SCM_RIGHTS;
+
+		*((int *) CMSG_DATA(cmsg)) = fd;
+	}
+
+	size = sendmsg(sock, &msg, 0);
+
+	/* errors are returned directly */
+	return size;
+}
+
+/*
+ * Receive data from a unix domain socket. If a file is sent over the socket,
+ * store it in *fd.
+ */
+ssize_t
+pg_uds_recv_with_fd(int sock, void *buf, ssize_t bufsize, int *fd)
+{
+	ssize_t     size;
+	struct msghdr   msg;
+	struct iovec    iov;
+	/* cmsg header, union for correct alignment */
+	union
+	{
+		struct cmsghdr  cmsghdr;
+		char        control[CMSG_SPACE(sizeof (int))];
+	} cmsgu;
+	struct cmsghdr  *cmsg;
+
+	Assert(fd != NULL);
+
+	iov.iov_base = buf;
+	iov.iov_len = bufsize;
+
+	msg.msg_name = NULL;
+	msg.msg_namelen = 0;
+	msg.msg_iov = &iov;
+	msg.msg_iovlen = 1;
+	msg.msg_control = cmsgu.control;
+	msg.msg_controllen = sizeof(cmsgu.control);
+
+	size = recvmsg (sock, &msg, 0);
+
+	if (size < 0)
+	{
+		*fd = -1;
+		return size;
+	}
+
+	cmsg = CMSG_FIRSTHDR(&msg);
+	if (cmsg && cmsg->cmsg_len == CMSG_LEN(sizeof(int)))
+	{
+		if (cmsg->cmsg_level != SOL_SOCKET)
+			elog(FATAL, "unexpected cmsg_level");
+
+		if (cmsg->cmsg_type != SCM_RIGHTS)
+			elog(FATAL, "unexpected cmsg_type");
+
+		*fd = *((int *) CMSG_DATA(cmsg));
+
+		/* FIXME: check / handle additional cmsg structures */
+	}
+	else
+		*fd = -1;
+
+	return size;
+}
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 8e7c9728f4b..5e016d69a5a 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -143,4 +143,8 @@ extern void SyncDataDirectory(void);
 #define PG_TEMP_FILES_DIR "pgsql_tmp"
 #define PG_TEMP_FILE_PREFIX "pgsql_tmp"
 
+/* XXX; This should probably go elsewhere */
+ssize_t pg_uds_send_with_fd(int sock, void *buf, ssize_t buflen, int fd);
+ssize_t pg_uds_recv_with_fd(int sock, void *buf, ssize_t bufsize, int *fd);
+
 #endif							/* FD_H */
-- 
2.17.0.rc1.dirty

v2-0003-Make-FileGetRawDesc-ensure-there-s-an-associated-.patchtext/x-diff; charset=us-asciiDownload
From 87a6fa9b8d478504b0a4323a3e104b353a239f1a Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 21 May 2018 15:43:30 -0700
Subject: [PATCH v2 3/6] Make FileGetRawDesc() ensure there's an associated
 kernel FD.

Author: Andres Freund
Reviewed-By:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/backend/storage/file/fd.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 65e46483a44..8ae13a51ec1 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -2232,6 +2232,10 @@ int
 FileGetRawDesc(File file)
 {
 	Assert(FileIsValid(file));
+
+	if (FileAccess(file))
+		return -1;
+
 	return VfdCache[file].fd;
 }
 
-- 
2.17.0.rc1.dirty

v2-0004-WIP-Add-FileOpenForFd.patchtext/x-diff; charset=us-asciiDownload
From 4cdfdef2ff441f5db01dc989a90aecce4dc6b272 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 21 May 2018 15:43:30 -0700
Subject: [PATCH v2 4/6] WIP: Add FileOpenForFd().

Author: Andres Freund
Reviewed-By:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/backend/storage/file/fd.c | 68 +++++++++++++++++++++++++++++++----
 src/include/storage/fd.h      |  2 ++
 2 files changed, 64 insertions(+), 6 deletions(-)

diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 8ae13a51ec1..50a1cb930f6 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -180,6 +180,7 @@ int			max_safe_fds = 32;	/* default if not changed */
 #define FD_DELETE_AT_CLOSE	(1 << 0)	/* T = delete when closed */
 #define FD_CLOSE_AT_EOXACT	(1 << 1)	/* T = close at eoXact */
 #define FD_TEMP_FILE_LIMIT	(1 << 2)	/* T = respect temp_file_limit */
+#define FD_NOT_IN_LRU		(1 << 3)	/* T = not in LRU */
 
 typedef struct vfd
 {
@@ -304,7 +305,6 @@ static void LruDelete(File file);
 static void Insert(File file);
 static int	LruInsert(File file);
 static bool ReleaseLruFile(void);
-static void ReleaseLruFiles(void);
 static File AllocateVfd(void);
 static void FreeVfd(File file);
 
@@ -1176,7 +1176,7 @@ ReleaseLruFile(void)
  * Release kernel FDs as needed to get under the max_safe_fds limit.
  * After calling this, it's OK to try to open another file.
  */
-static void
+void
 ReleaseLruFiles(void)
 {
 	while (nfile + numAllocatedDescs >= max_safe_fds)
@@ -1289,9 +1289,11 @@ FileAccess(File file)
 		 * We now know that the file is open and that it is not the last one
 		 * accessed, so we need to move it to the head of the Lru ring.
 		 */
-
-		Delete(file);
-		Insert(file);
+		if (!(VfdCache[file].fdstate & FD_NOT_IN_LRU))
+		{
+			Delete(file);
+			Insert(file);
+		}
 	}
 
 	return 0;
@@ -1414,6 +1416,56 @@ PathNameOpenFilePerm(const char *fileName, int fileFlags, mode_t fileMode)
 	return file;
 }
 
+/*
+ * Open a File for a pre-existing file descriptor.
+ *
+ * Note that these files will not be closed in an LRU basis, therefore the
+ * caller is responsible for limiting the number of open file descriptors.
+ *
+ * The passed in name is purely for informational purposes.
+ */
+File
+FileOpenForFd(int fd, const char *fileName)
+{
+	char	   *fnamecopy;
+	File		file;
+	Vfd		   *vfdP;
+
+	/*
+	 * We need a malloc'd copy of the file name; fail cleanly if no room.
+	 */
+	fnamecopy = strdup(fileName);
+	if (fnamecopy == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OUT_OF_MEMORY),
+				 errmsg("out of memory")));
+
+	file = AllocateVfd();
+	vfdP = &VfdCache[file];
+
+	/* Close excess kernel FDs. */
+	ReleaseLruFiles();
+
+	vfdP->fd = fd;
+	++nfile;
+
+	DO_DB(elog(LOG, "FileOpenForFd: success %d/%d (%s)",
+			   file, fd, fnamecopy));
+
+	/* NB: Explicitly not inserted into LRU! */
+
+	vfdP->fileName = fnamecopy;
+	/* Saved flags are adjusted to be OK for re-opening file */
+	vfdP->fileFlags = 0;
+	vfdP->fileMode = 0;
+	vfdP->seekPos = 0;
+	vfdP->fileSize = 0;
+	vfdP->fdstate = FD_NOT_IN_LRU;
+	vfdP->resowner = NULL;
+
+	return file;
+}
+
 /*
  * Create directory 'directory'.  If necessary, create 'basedir', which must
  * be the directory above it.  This is designed for creating the top-level
@@ -1760,7 +1812,11 @@ FileClose(File file)
 		vfdP->fd = VFD_CLOSED;
 
 		/* remove the file from the lru ring */
-		Delete(file);
+		if (!(vfdP->fdstate & FD_NOT_IN_LRU))
+		{
+			vfdP->fdstate &= ~FD_NOT_IN_LRU;
+			Delete(file);
+		}
 	}
 
 	if (vfdP->fdstate & FD_TEMP_FILE_LIMIT)
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 5e016d69a5a..e96e8b13982 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -65,6 +65,7 @@ extern int	max_safe_fds;
 /* Operations on virtual Files --- equivalent to Unix kernel file ops */
 extern File PathNameOpenFile(const char *fileName, int fileFlags);
 extern File PathNameOpenFilePerm(const char *fileName, int fileFlags, mode_t fileMode);
+extern File FileOpenForFd(int fd, const char *fileName);
 extern File OpenTemporaryFile(bool interXact);
 extern void FileClose(File file);
 extern int	FilePrefetch(File file, off_t offset, int amount, uint32 wait_event_info);
@@ -127,6 +128,7 @@ extern void AtEOSubXact_Files(bool isCommit, SubTransactionId mySubid,
 				  SubTransactionId parentSubid);
 extern void RemovePgTempFiles(void);
 extern bool looks_like_temp_rel_name(const char *name);
+extern void ReleaseLruFiles(void);
 
 extern int	pg_fsync(int fd);
 extern int	pg_fsync_no_writethrough(int fd);
-- 
2.17.0.rc1.dirty

v2-0005-WIP-Optimize-register_dirty_segment-to-not-repeat.patchtext/x-diff; charset=us-asciiDownload
From bd7dcdafa752802566d0fef3ac9c126c41f276e4 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 21 May 2018 15:43:30 -0700
Subject: [PATCH v2 5/6] WIP: Optimize register_dirty_segment() to not
 repeatedly queue fsync requests.

Author: Andres Freund
Reviewed-By:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/backend/postmaster/checkpointer.c | 36 ++++++++++++-------
 src/backend/storage/smgr/md.c         | 50 +++++++++++++++++++--------
 src/include/postmaster/bgwriter.h     |  3 ++
 3 files changed, 63 insertions(+), 26 deletions(-)

diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 0950ada6019..333eb91c9de 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -46,6 +46,7 @@
 #include "libpq/pqsignal.h"
 #include "miscadmin.h"
 #include "pgstat.h"
+#include "port/atomics.h"
 #include "postmaster/bgwriter.h"
 #include "replication/syncrep.h"
 #include "storage/bufmgr.h"
@@ -126,8 +127,9 @@ typedef struct
 
 	int			ckpt_flags;		/* checkpoint flags, as defined in xlog.h */
 
-	uint32		num_backend_writes; /* counts user backend buffer writes */
-	uint32		num_backend_fsync;	/* counts user backend fsync calls */
+	pg_atomic_uint32 num_backend_writes; /* counts user backend buffer writes */
+	pg_atomic_uint32 num_backend_fsync;	/* counts user backend fsync calls */
+	pg_atomic_uint32 ckpt_cycle; /* cycle */
 
 	int			num_requests;	/* current # of requests */
 	int			max_requests;	/* allocated array size */
@@ -943,6 +945,9 @@ CheckpointerShmemInit(void)
 		MemSet(CheckpointerShmem, 0, size);
 		SpinLockInit(&CheckpointerShmem->ckpt_lck);
 		CheckpointerShmem->max_requests = NBuffers;
+		pg_atomic_init_u32(&CheckpointerShmem->ckpt_cycle, 0);
+		pg_atomic_init_u32(&CheckpointerShmem->num_backend_writes, 0);
+		pg_atomic_init_u32(&CheckpointerShmem->num_backend_fsync, 0);
 	}
 }
 
@@ -1133,10 +1138,6 @@ ForwardFsyncRequest(RelFileNode rnode, ForkNumber forknum, BlockNumber segno)
 
 	LWLockAcquire(CheckpointerCommLock, LW_EXCLUSIVE);
 
-	/* Count all backend writes regardless of if they fit in the queue */
-	if (!AmBackgroundWriterProcess())
-		CheckpointerShmem->num_backend_writes++;
-
 	/*
 	 * If the checkpointer isn't running or the request queue is full, the
 	 * backend will have to perform its own fsync request.  But before forcing
@@ -1151,7 +1152,7 @@ ForwardFsyncRequest(RelFileNode rnode, ForkNumber forknum, BlockNumber segno)
 		 * fsync
 		 */
 		if (!AmBackgroundWriterProcess())
-			CheckpointerShmem->num_backend_fsync++;
+			pg_atomic_fetch_add_u32(&CheckpointerShmem->num_backend_fsync, 1);
 		LWLockRelease(CheckpointerCommLock);
 		return false;
 	}
@@ -1312,11 +1313,10 @@ AbsorbFsyncRequests(void)
 	LWLockAcquire(CheckpointerCommLock, LW_EXCLUSIVE);
 
 	/* Transfer stats counts into pending pgstats message */
-	BgWriterStats.m_buf_written_backend += CheckpointerShmem->num_backend_writes;
-	BgWriterStats.m_buf_fsync_backend += CheckpointerShmem->num_backend_fsync;
-
-	CheckpointerShmem->num_backend_writes = 0;
-	CheckpointerShmem->num_backend_fsync = 0;
+	BgWriterStats.m_buf_written_backend +=
+		pg_atomic_exchange_u32(&CheckpointerShmem->num_backend_writes, 0);
+	BgWriterStats.m_buf_fsync_backend +=
+		pg_atomic_exchange_u32(&CheckpointerShmem->num_backend_fsync, 0);
 
 	/*
 	 * We try to avoid holding the lock for a long time by copying the request
@@ -1390,3 +1390,15 @@ FirstCallSinceLastCheckpoint(void)
 
 	return FirstCall;
 }
+
+uint32
+GetCheckpointSyncCycle(void)
+{
+	return pg_atomic_read_u32(&CheckpointerShmem->ckpt_cycle);
+}
+
+uint32
+IncCheckpointSyncCycle(void)
+{
+	return pg_atomic_fetch_add_u32(&CheckpointerShmem->ckpt_cycle, 1);
+}
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 2ec103e6047..555774320b5 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -109,6 +109,7 @@ typedef struct _MdfdVec
 {
 	File		mdfd_vfd;		/* fd number in fd.c's pool */
 	BlockNumber mdfd_segno;		/* segment number, from 0 */
+	uint32		mdfd_dirtied_cycle;
 } MdfdVec;
 
 static MemoryContext MdCxt;		/* context for all MdfdVec objects */
@@ -133,12 +134,12 @@ static MemoryContext MdCxt;		/* context for all MdfdVec objects */
  * (Regular backends do not track pending operations locally, but forward
  * them to the checkpointer.)
  */
-typedef uint16 CycleCtr;		/* can be any convenient integer size */
+typedef uint32 CycleCtr;		/* can be any convenient integer size */
 
 typedef struct
 {
 	RelFileNode rnode;			/* hash table key (must be first!) */
-	CycleCtr	cycle_ctr;		/* mdsync_cycle_ctr of oldest request */
+	CycleCtr	cycle_ctr;		/* sync cycle of oldest request */
 	/* requests[f] has bit n set if we need to fsync segment n of fork f */
 	Bitmapset  *requests[MAX_FORKNUM + 1];
 	/* canceled[f] is true if we canceled fsyncs for fork "recently" */
@@ -155,7 +156,6 @@ static HTAB *pendingOpsTable = NULL;
 static List *pendingUnlinks = NIL;
 static MemoryContext pendingOpsCxt; /* context for the above  */
 
-static CycleCtr mdsync_cycle_ctr = 0;
 static CycleCtr mdckpt_cycle_ctr = 0;
 
 
@@ -333,6 +333,7 @@ mdcreate(SMgrRelation reln, ForkNumber forkNum, bool isRedo)
 	mdfd = &reln->md_seg_fds[forkNum][0];
 	mdfd->mdfd_vfd = fd;
 	mdfd->mdfd_segno = 0;
+	mdfd->mdfd_dirtied_cycle = GetCheckpointSyncCycle() - 1;
 }
 
 /*
@@ -614,6 +615,7 @@ mdopen(SMgrRelation reln, ForkNumber forknum, int behavior)
 	mdfd = &reln->md_seg_fds[forknum][0];
 	mdfd->mdfd_vfd = fd;
 	mdfd->mdfd_segno = 0;
+	mdfd->mdfd_dirtied_cycle = GetCheckpointSyncCycle() - 1;
 
 	Assert(_mdnblocks(reln, forknum, mdfd) <= ((BlockNumber) RELSEG_SIZE));
 
@@ -1089,9 +1091,9 @@ mdsync(void)
 	 * To avoid excess fsync'ing (in the worst case, maybe a never-terminating
 	 * checkpoint), we want to ignore fsync requests that are entered into the
 	 * hashtable after this point --- they should be processed next time,
-	 * instead.  We use mdsync_cycle_ctr to tell old entries apart from new
-	 * ones: new ones will have cycle_ctr equal to the incremented value of
-	 * mdsync_cycle_ctr.
+	 * instead.  We use GetCheckpointSyncCycle() to tell old entries apart
+	 * from new ones: new ones will have cycle_ctr equal to
+	 * IncCheckpointSyncCycle().
 	 *
 	 * In normal circumstances, all entries present in the table at this point
 	 * will have cycle_ctr exactly equal to the current (about to be old)
@@ -1115,16 +1117,16 @@ mdsync(void)
 		hash_seq_init(&hstat, pendingOpsTable);
 		while ((entry = (PendingOperationEntry *) hash_seq_search(&hstat)) != NULL)
 		{
-			entry->cycle_ctr = mdsync_cycle_ctr;
+			entry->cycle_ctr = GetCheckpointSyncCycle();
 		}
 	}
 
-	/* Advance counter so that new hashtable entries are distinguishable */
-	mdsync_cycle_ctr++;
-
 	/* Set flag to detect failure if we don't reach the end of the loop */
 	mdsync_in_progress = true;
 
+	/* Advance counter so that new hashtable entries are distinguishable */
+	IncCheckpointSyncCycle();
+
 	/* Now scan the hashtable for fsync requests to process */
 	absorb_counter = FSYNCS_PER_ABSORB;
 	hash_seq_init(&hstat, pendingOpsTable);
@@ -1137,11 +1139,11 @@ mdsync(void)
 		 * contain multiple fsync-request bits, but they are all new.  Note
 		 * "continue" bypasses the hash-remove call at the bottom of the loop.
 		 */
-		if (entry->cycle_ctr == mdsync_cycle_ctr)
+		if (entry->cycle_ctr == GetCheckpointSyncCycle())
 			continue;
 
 		/* Else assert we haven't missed it */
-		Assert((CycleCtr) (entry->cycle_ctr + 1) == mdsync_cycle_ctr);
+		Assert((CycleCtr) (entry->cycle_ctr + 1) == GetCheckpointSyncCycle());
 
 		/*
 		 * Scan over the forks and segments represented by the entry.
@@ -1308,7 +1310,7 @@ mdsync(void)
 				break;
 		}
 		if (forknum <= MAX_FORKNUM)
-			entry->cycle_ctr = mdsync_cycle_ctr;
+			entry->cycle_ctr = GetCheckpointSyncCycle();
 		else
 		{
 			/* Okay to remove it */
@@ -1427,18 +1429,37 @@ mdpostckpt(void)
 static void
 register_dirty_segment(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
 {
+	uint32 cycle;
+
 	/* Temp relations should never be fsync'd */
 	Assert(!SmgrIsTemp(reln));
 
+	pg_memory_barrier();
+	cycle = GetCheckpointSyncCycle();
+
+	/*
+	 * Don't repeatedly register the same segment as dirty.
+	 *
+	 * FIXME: This doesn't correctly deal with overflows yet! We could
+	 * e.g. emit an smgr invalidation every now and then, or use a 64bit
+	 * counter.  Or just error out if the cycle reaches UINT32_MAX.
+	 */
+	if (seg->mdfd_dirtied_cycle == cycle)
+		return;
+
 	if (pendingOpsTable)
 	{
 		/* push it into local pending-ops table */
 		RememberFsyncRequest(reln->smgr_rnode.node, forknum, seg->mdfd_segno);
+		seg->mdfd_dirtied_cycle = cycle;
 	}
 	else
 	{
 		if (ForwardFsyncRequest(reln->smgr_rnode.node, forknum, seg->mdfd_segno))
+		{
+			seg->mdfd_dirtied_cycle = cycle;
 			return;				/* passed it off successfully */
+		}
 
 		ereport(DEBUG1,
 				(errmsg("could not forward fsync request because request queue is full")));
@@ -1623,7 +1644,7 @@ RememberFsyncRequest(RelFileNode rnode, ForkNumber forknum, BlockNumber segno)
 		/* if new entry, initialize it */
 		if (!found)
 		{
-			entry->cycle_ctr = mdsync_cycle_ctr;
+			entry->cycle_ctr = GetCheckpointSyncCycle();
 			MemSet(entry->requests, 0, sizeof(entry->requests));
 			MemSet(entry->canceled, 0, sizeof(entry->canceled));
 		}
@@ -1793,6 +1814,7 @@ _mdfd_openseg(SMgrRelation reln, ForkNumber forknum, BlockNumber segno,
 	v = &reln->md_seg_fds[forknum][segno];
 	v->mdfd_vfd = fd;
 	v->mdfd_segno = segno;
+	v->mdfd_dirtied_cycle = GetCheckpointSyncCycle() - 1;
 
 	Assert(_mdnblocks(reln, forknum, v) <= ((BlockNumber) RELSEG_SIZE));
 
diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h
index 941c6aba7d1..87a5cfad415 100644
--- a/src/include/postmaster/bgwriter.h
+++ b/src/include/postmaster/bgwriter.h
@@ -38,6 +38,9 @@ extern void AbsorbFsyncRequests(void);
 extern Size CheckpointerShmemSize(void);
 extern void CheckpointerShmemInit(void);
 
+extern uint32 GetCheckpointSyncCycle(void);
+extern uint32 IncCheckpointSyncCycle(void);
+
 extern bool FirstCallSinceLastCheckpoint(void);
 
 #endif							/* _BGWRITER_H */
-- 
2.17.0.rc1.dirty

v2-0006-Heavily-WIP-Send-file-descriptors-to-checkpointer.patchtext/x-diff; charset=us-asciiDownload
From 306975ba3856c5be31eca5d8efe6d7fe9eee06c6 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 21 May 2018 15:43:30 -0700
Subject: [PATCH v2 6/6] Heavily-WIP: Send file descriptors to checkpointer for
 fsyncing.

This addresses the issue that, at least on linux, fsyncs only reliably
see errors that occurred after they've been opeend.

Author:
Reviewed-By:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/backend/access/transam/xlog.c     |   7 +-
 src/backend/postmaster/checkpointer.c | 354 +++++++----------
 src/backend/postmaster/postmaster.c   |  38 ++
 src/backend/storage/smgr/md.c         | 549 ++++++++++++++++----------
 src/include/postmaster/bgwriter.h     |   8 +-
 src/include/postmaster/postmaster.h   |   5 +
 src/include/storage/smgr.h            |   3 +-
 7 files changed, 543 insertions(+), 421 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index adbd6a21264..427774152eb 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -8634,8 +8634,10 @@ CreateCheckPoint(int flags)
 	 * Note: because it is possible for log_checkpoints to change while a
 	 * checkpoint proceeds, we always accumulate stats, even if
 	 * log_checkpoints is currently off.
+	 *
+	 * Note #2: this is reset at the end of the checkpoint, not here, because
+	 * we might have to fsync before getting here (see mdsync()).
 	 */
-	MemSet(&CheckpointStats, 0, sizeof(CheckpointStats));
 	CheckpointStats.ckpt_start_t = GetCurrentTimestamp();
 
 	/*
@@ -8999,6 +9001,9 @@ CreateCheckPoint(int flags)
 									 CheckpointStats.ckpt_segs_recycled);
 
 	LWLockRelease(CheckpointLock);
+
+	/* reset stats */
+	MemSet(&CheckpointStats, 0, sizeof(CheckpointStats));
 }
 
 /*
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 333eb91c9de..c2be529bca4 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -48,6 +48,7 @@
 #include "pgstat.h"
 #include "port/atomics.h"
 #include "postmaster/bgwriter.h"
+#include "postmaster/postmaster.h"
 #include "replication/syncrep.h"
 #include "storage/bufmgr.h"
 #include "storage/condition_variable.h"
@@ -102,19 +103,21 @@
  *
  * The requests array holds fsync requests sent by backends and not yet
  * absorbed by the checkpointer.
- *
- * Unlike the checkpoint fields, num_backend_writes, num_backend_fsync, and
- * the requests fields are protected by CheckpointerCommLock.
  *----------
  */
 typedef struct
 {
+	uint32		type;
 	RelFileNode rnode;
 	ForkNumber	forknum;
 	BlockNumber segno;			/* see md.c for special values */
+	bool		contains_fd;
 	/* might add a real request-type field later; not needed yet */
 } CheckpointerRequest;
 
+#define CKPT_REQUEST_RNODE			1
+#define CKPT_REQUEST_SYN			2
+
 typedef struct
 {
 	pid_t		checkpointer_pid;	/* PID (0 if not started) */
@@ -131,8 +134,6 @@ typedef struct
 	pg_atomic_uint32 num_backend_fsync;	/* counts user backend fsync calls */
 	pg_atomic_uint32 ckpt_cycle; /* cycle */
 
-	int			num_requests;	/* current # of requests */
-	int			max_requests;	/* allocated array size */
 	CheckpointerRequest requests[FLEXIBLE_ARRAY_MEMBER];
 } CheckpointerShmemStruct;
 
@@ -168,13 +169,17 @@ static double ckpt_cached_elapsed;
 static pg_time_t last_checkpoint_time;
 static pg_time_t last_xlog_switch_time;
 
+static BlockNumber next_syn_rqst;
+static BlockNumber received_syn_rqst;
+
 /* Prototypes for private functions */
 
 static void CheckArchiveTimeout(void);
 static bool IsCheckpointOnSchedule(double progress);
 static bool ImmediateCheckpointRequested(void);
-static bool CompactCheckpointerRequestQueue(void);
 static void UpdateSharedMemoryConfig(void);
+static void SendFsyncRequest(CheckpointerRequest *request, int fd);
+static bool AbsorbFsyncRequest(void);
 
 /* Signal handlers */
 
@@ -557,10 +562,11 @@ CheckpointerMain(void)
 			cur_timeout = Min(cur_timeout, XLogArchiveTimeout - elapsed_secs);
 		}
 
-		rc = WaitLatch(MyLatch,
-					   WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH,
-					   cur_timeout * 1000L /* convert to ms */ ,
-					   WAIT_EVENT_CHECKPOINTER_MAIN);
+		rc = WaitLatchOrSocket(MyLatch,
+							   WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH | WL_SOCKET_READABLE,
+							   fsync_fds[FSYNC_FD_PROCESS],
+							   cur_timeout * 1000L /* convert to ms */ ,
+							   WAIT_EVENT_CHECKPOINTER_MAIN);
 
 		/*
 		 * Emergency bailout if postmaster has died.  This is to avoid the
@@ -910,12 +916,7 @@ CheckpointerShmemSize(void)
 {
 	Size		size;
 
-	/*
-	 * Currently, the size of the requests[] array is arbitrarily set equal to
-	 * NBuffers.  This may prove too large or small ...
-	 */
 	size = offsetof(CheckpointerShmemStruct, requests);
-	size = add_size(size, mul_size(NBuffers, sizeof(CheckpointerRequest)));
 
 	return size;
 }
@@ -938,13 +939,10 @@ CheckpointerShmemInit(void)
 	if (!found)
 	{
 		/*
-		 * First time through, so initialize.  Note that we zero the whole
-		 * requests array; this is so that CompactCheckpointerRequestQueue can
-		 * assume that any pad bytes in the request structs are zeroes.
+		 * First time through, so initialize.
 		 */
 		MemSet(CheckpointerShmem, 0, size);
 		SpinLockInit(&CheckpointerShmem->ckpt_lck);
-		CheckpointerShmem->max_requests = NBuffers;
 		pg_atomic_init_u32(&CheckpointerShmem->ckpt_cycle, 0);
 		pg_atomic_init_u32(&CheckpointerShmem->num_backend_writes, 0);
 		pg_atomic_init_u32(&CheckpointerShmem->num_backend_fsync, 0);
@@ -1124,176 +1122,61 @@ RequestCheckpoint(int flags)
  * the queue is full and contains no duplicate entries.  In that case, we
  * let the backend know by returning false.
  */
-bool
-ForwardFsyncRequest(RelFileNode rnode, ForkNumber forknum, BlockNumber segno)
+void
+ForwardFsyncRequest(RelFileNode rnode, ForkNumber forknum, BlockNumber segno,
+					File file)
 {
-	CheckpointerRequest *request;
-	bool		too_full;
+	CheckpointerRequest request = {0};
 
 	if (!IsUnderPostmaster)
-		return false;			/* probably shouldn't even get here */
+		elog(ERROR, "ForwardFsyncRequest must not be called in single user mode");
 
 	if (AmCheckpointerProcess())
 		elog(ERROR, "ForwardFsyncRequest must not be called in checkpointer");
 
-	LWLockAcquire(CheckpointerCommLock, LW_EXCLUSIVE);
+	request.type = CKPT_REQUEST_RNODE;
+	request.rnode = rnode;
+	request.forknum = forknum;
+	request.segno = segno;
+	request.contains_fd = file != -1;
 
-	/*
-	 * If the checkpointer isn't running or the request queue is full, the
-	 * backend will have to perform its own fsync request.  But before forcing
-	 * that to happen, we can try to compact the request queue.
-	 */
-	if (CheckpointerShmem->checkpointer_pid == 0 ||
-		(CheckpointerShmem->num_requests >= CheckpointerShmem->max_requests &&
-		 !CompactCheckpointerRequestQueue()))
-	{
-		/*
-		 * Count the subset of writes where backends have to do their own
-		 * fsync
-		 */
-		if (!AmBackgroundWriterProcess())
-			pg_atomic_fetch_add_u32(&CheckpointerShmem->num_backend_fsync, 1);
-		LWLockRelease(CheckpointerCommLock);
-		return false;
-	}
-
-	/* OK, insert request */
-	request = &CheckpointerShmem->requests[CheckpointerShmem->num_requests++];
-	request->rnode = rnode;
-	request->forknum = forknum;
-	request->segno = segno;
-
-	/* If queue is more than half full, nudge the checkpointer to empty it */
-	too_full = (CheckpointerShmem->num_requests >=
-				CheckpointerShmem->max_requests / 2);
-
-	LWLockRelease(CheckpointerCommLock);
-
-	/* ... but not till after we release the lock */
-	if (too_full && ProcGlobal->checkpointerLatch)
-		SetLatch(ProcGlobal->checkpointerLatch);
-
-	return true;
-}
-
-/*
- * CompactCheckpointerRequestQueue
- *		Remove duplicates from the request queue to avoid backend fsyncs.
- *		Returns "true" if any entries were removed.
- *
- * Although a full fsync request queue is not common, it can lead to severe
- * performance problems when it does happen.  So far, this situation has
- * only been observed to occur when the system is under heavy write load,
- * and especially during the "sync" phase of a checkpoint.  Without this
- * logic, each backend begins doing an fsync for every block written, which
- * gets very expensive and can slow down the whole system.
- *
- * Trying to do this every time the queue is full could lose if there
- * aren't any removable entries.  But that should be vanishingly rare in
- * practice: there's one queue entry per shared buffer.
- */
-static bool
-CompactCheckpointerRequestQueue(void)
-{
-	struct CheckpointerSlotMapping
-	{
-		CheckpointerRequest request;
-		int			slot;
-	};
-
-	int			n,
-				preserve_count;
-	int			num_skipped = 0;
-	HASHCTL		ctl;
-	HTAB	   *htab;
-	bool	   *skip_slot;
-
-	/* must hold CheckpointerCommLock in exclusive mode */
-	Assert(LWLockHeldByMe(CheckpointerCommLock));
-
-	/* Initialize skip_slot array */
-	skip_slot = palloc0(sizeof(bool) * CheckpointerShmem->num_requests);
-
-	/* Initialize temporary hash table */
-	MemSet(&ctl, 0, sizeof(ctl));
-	ctl.keysize = sizeof(CheckpointerRequest);
-	ctl.entrysize = sizeof(struct CheckpointerSlotMapping);
-	ctl.hcxt = CurrentMemoryContext;
-
-	htab = hash_create("CompactCheckpointerRequestQueue",
-					   CheckpointerShmem->num_requests,
-					   &ctl,
-					   HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
-	/*
-	 * The basic idea here is that a request can be skipped if it's followed
-	 * by a later, identical request.  It might seem more sensible to work
-	 * backwards from the end of the queue and check whether a request is
-	 * *preceded* by an earlier, identical request, in the hopes of doing less
-	 * copying.  But that might change the semantics, if there's an
-	 * intervening FORGET_RELATION_FSYNC or FORGET_DATABASE_FSYNC request, so
-	 * we do it this way.  It would be possible to be even smarter if we made
-	 * the code below understand the specific semantics of such requests (it
-	 * could blow away preceding entries that would end up being canceled
-	 * anyhow), but it's not clear that the extra complexity would buy us
-	 * anything.
-	 */
-	for (n = 0; n < CheckpointerShmem->num_requests; n++)
-	{
-		CheckpointerRequest *request;
-		struct CheckpointerSlotMapping *slotmap;
-		bool		found;
-
-		/*
-		 * We use the request struct directly as a hashtable key.  This
-		 * assumes that any padding bytes in the structs are consistently the
-		 * same, which should be okay because we zeroed them in
-		 * CheckpointerShmemInit.  Note also that RelFileNode had better
-		 * contain no pad bytes.
-		 */
-		request = &CheckpointerShmem->requests[n];
-		slotmap = hash_search(htab, request, HASH_ENTER, &found);
-		if (found)
-		{
-			/* Duplicate, so mark the previous occurrence as skippable */
-			skip_slot[slotmap->slot] = true;
-			num_skipped++;
-		}
-		/* Remember slot containing latest occurrence of this request value */
-		slotmap->slot = n;
-	}
-
-	/* Done with the hash table. */
-	hash_destroy(htab);
-
-	/* If no duplicates, we're out of luck. */
-	if (!num_skipped)
-	{
-		pfree(skip_slot);
-		return false;
-	}
-
-	/* We found some duplicates; remove them. */
-	preserve_count = 0;
-	for (n = 0; n < CheckpointerShmem->num_requests; n++)
-	{
-		if (skip_slot[n])
-			continue;
-		CheckpointerShmem->requests[preserve_count++] = CheckpointerShmem->requests[n];
-	}
-	ereport(DEBUG1,
-			(errmsg("compacted fsync request queue from %d entries to %d entries",
-					CheckpointerShmem->num_requests, preserve_count)));
-	CheckpointerShmem->num_requests = preserve_count;
-
-	/* Cleanup. */
-	pfree(skip_slot);
-	return true;
+	SendFsyncRequest(&request, request.contains_fd ? FileGetRawDesc(file) : -1);
 }
 
 /*
  * AbsorbFsyncRequests
- *		Retrieve queued fsync requests and pass them to local smgr.
+ *		Retrieve queued fsync requests and pass them to local smgr. Stop when
+ *		resources would be exhausted by absorbing more.
+ *
+ * This is exported because we want to continue accepting requests during
+ * mdsync().
+ */
+void
+AbsorbFsyncRequests(void)
+{
+	if (!AmCheckpointerProcess())
+		return;
+
+	/* Transfer stats counts into pending pgstats message */
+	BgWriterStats.m_buf_written_backend +=
+		pg_atomic_exchange_u32(&CheckpointerShmem->num_backend_writes, 0);
+	BgWriterStats.m_buf_fsync_backend +=
+		pg_atomic_exchange_u32(&CheckpointerShmem->num_backend_fsync, 0);
+
+	while (true)
+	{
+		if (!FlushFsyncRequestQueueIfNecessary())
+			break;
+
+		if (!AbsorbFsyncRequest())
+			break;
+	}
+}
+
+/*
+ * AbsorbAllFsyncRequests
+ *		Retrieve all already pending fsync requests and pass them to local
+ *		smgr.
  *
  * This is exported because it must be called during CreateCheckPoint;
  * we have to be sure we have accepted all pending requests just before
@@ -1301,17 +1184,13 @@ CompactCheckpointerRequestQueue(void)
  * non-checkpointer processes, do nothing if not checkpointer.
  */
 void
-AbsorbFsyncRequests(void)
+AbsorbAllFsyncRequests(void)
 {
-	CheckpointerRequest *requests = NULL;
-	CheckpointerRequest *request;
-	int			n;
+	CheckpointerRequest request = {0};
 
 	if (!AmCheckpointerProcess())
 		return;
 
-	LWLockAcquire(CheckpointerCommLock, LW_EXCLUSIVE);
-
 	/* Transfer stats counts into pending pgstats message */
 	BgWriterStats.m_buf_written_backend +=
 		pg_atomic_exchange_u32(&CheckpointerShmem->num_backend_writes, 0);
@@ -1319,35 +1198,65 @@ AbsorbFsyncRequests(void)
 		pg_atomic_exchange_u32(&CheckpointerShmem->num_backend_fsync, 0);
 
 	/*
-	 * We try to avoid holding the lock for a long time by copying the request
-	 * array, and processing the requests after releasing the lock.
-	 *
-	 * Once we have cleared the requests from shared memory, we have to PANIC
-	 * if we then fail to absorb them (eg, because our hashtable runs out of
-	 * memory).  This is because the system cannot run safely if we are unable
-	 * to fsync what we have been told to fsync.  Fortunately, the hashtable
-	 * is so small that the problem is quite unlikely to arise in practice.
+	 * For mdsync()'s guarantees to work, all pending fsync requests need to
+	 * be executed. But we don't want to absorb requests till the queue is
+	 * empty, as that could take a long while.  So instead we enqueue
 	 */
-	n = CheckpointerShmem->num_requests;
-	if (n > 0)
+	request.type = CKPT_REQUEST_SYN;
+	request.segno = ++next_syn_rqst;
+	SendFsyncRequest(&request, -1);
+
+	received_syn_rqst = next_syn_rqst + 1;
+	while (received_syn_rqst != request.segno)
 	{
-		requests = (CheckpointerRequest *) palloc(n * sizeof(CheckpointerRequest));
-		memcpy(requests, CheckpointerShmem->requests, n * sizeof(CheckpointerRequest));
+		if (!FlushFsyncRequestQueueIfNecessary())
+			elog(FATAL, "may not happen");
+
+		if (!AbsorbFsyncRequest())
+			break;
 	}
+}
+
+/*
+ * AbsorbFsyncRequest
+ *		Retrieve one queued fsync request and pass them to local smgr.
+ */
+static bool
+AbsorbFsyncRequest(void)
+{
+	CheckpointerRequest req;
+	int fd;
+	int ret;
+
+	ReleaseLruFiles();
 
 	START_CRIT_SECTION();
+	ret = pg_uds_recv_with_fd(fsync_fds[FSYNC_FD_PROCESS], &req, sizeof(req), &fd);
+	if (ret < 0 && (errno == EWOULDBLOCK || errno == EAGAIN))
+	{
+		END_CRIT_SECTION();
+		return false;
+	}
+	else if (ret < 0)
+		elog(ERROR, "recvmsg failed: %m");
 
-	CheckpointerShmem->num_requests = 0;
-
-	LWLockRelease(CheckpointerCommLock);
-
-	for (request = requests; n > 0; request++, n--)
-		RememberFsyncRequest(request->rnode, request->forknum, request->segno);
+	if (req.contains_fd != (fd != -1))
+	{
+		elog(FATAL, "message should have fd associated, but doesn't");
+	}
 
+	if (req.type == CKPT_REQUEST_SYN)
+	{
+		received_syn_rqst = req.segno;
+		Assert(fd == -1);
+	}
+	else
+	{
+		RememberFsyncRequest(req.rnode, req.forknum, req.segno, fd);
+	}
 	END_CRIT_SECTION();
 
-	if (requests)
-		pfree(requests);
+	return true;
 }
 
 /*
@@ -1402,3 +1311,42 @@ IncCheckpointSyncCycle(void)
 {
 	return pg_atomic_fetch_add_u32(&CheckpointerShmem->ckpt_cycle, 1);
 }
+
+void
+CountBackendWrite(void)
+{
+	pg_atomic_fetch_add_u32(&CheckpointerShmem->num_backend_writes, 1);
+}
+
+static void
+SendFsyncRequest(CheckpointerRequest *request, int fd)
+{
+	ssize_t ret;
+
+	while (true)
+	{
+		ret = pg_uds_send_with_fd(fsync_fds[FSYNC_FD_SUBMIT], request, sizeof(*request),
+								  request->contains_fd ? fd : -1);
+
+		if (ret >= 0)
+		{
+			/*
+			 * Don't think short reads will ever happen in realistic
+			 * implementations, but better make sure that's true...
+			 */
+			if (ret != sizeof(*request))
+				elog(FATAL, "oops, gotta do better");
+			break;
+		}
+		else if (errno == EWOULDBLOCK || errno == EAGAIN)
+		{
+			/* blocked on write - wait for socket to become readable */
+			/* FIXME: postmaster death? Other interrupts? */
+			WaitLatchOrSocket(NULL, WL_SOCKET_WRITEABLE, fsync_fds[FSYNC_FD_SUBMIT], -1, 0);
+		}
+		else
+		{
+			ereport(FATAL, (errmsg("could not receive fsync request: %m")));
+		}
+	}
+}
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index a4b53b33cdd..135aa29bfeb 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -70,6 +70,7 @@
 #include <time.h>
 #include <sys/wait.h>
 #include <ctype.h>
+#include <sys/types.h>
 #include <sys/stat.h>
 #include <sys/socket.h>
 #include <fcntl.h>
@@ -434,6 +435,7 @@ static pid_t StartChildProcess(AuxProcType type);
 static void StartAutovacuumWorker(void);
 static void MaybeStartWalReceiver(void);
 static void InitPostmasterDeathWatchHandle(void);
+static void InitFsyncFdSocketPair(void);
 
 /*
  * Archiver is allowed to start up at the current postmaster state?
@@ -568,6 +570,8 @@ int			postmaster_alive_fds[2] = {-1, -1};
 HANDLE		PostmasterHandle;
 #endif
 
+int			fsync_fds[2] = {-1, -1};
+
 /*
  * Postmaster main entry point
  */
@@ -1195,6 +1199,11 @@ PostmasterMain(int argc, char *argv[])
 	 */
 	InitPostmasterDeathWatchHandle();
 
+	/*
+	 * Initialize socket pair used to transport file descriptors over.
+	 */
+	InitFsyncFdSocketPair();
+
 #ifdef WIN32
 
 	/*
@@ -6443,3 +6452,32 @@ InitPostmasterDeathWatchHandle(void)
 								 GetLastError())));
 #endif							/* WIN32 */
 }
+
+/* Create socket used for requesting fsyncs by checkpointer */
+static void
+InitFsyncFdSocketPair(void)
+{
+	Assert(MyProcPid == PostmasterPid);
+	if (socketpair(AF_UNIX, SOCK_STREAM | SOCK_CLOEXEC, 0, fsync_fds) < 0)
+		ereport(FATAL,
+				(errcode_for_file_access(),
+				 errmsg_internal("could not create fsync sockets: %m")));
+
+	/*
+	 * Set O_NONBLOCK on both fds.
+	 */
+	if (fcntl(fsync_fds[FSYNC_FD_PROCESS], F_SETFL, O_NONBLOCK) == -1)
+		ereport(FATAL,
+				(errcode_for_socket_access(),
+				 errmsg_internal("could not set fsync process socket to nonblocking mode: %m")));
+
+	if (fcntl(fsync_fds[FSYNC_FD_SUBMIT], F_SETFL, O_NONBLOCK) == -1)
+		ereport(FATAL,
+				(errcode_for_socket_access(),
+				 errmsg_internal("could not set fsync submit socket to nonblocking mode: %m")));
+
+	/*
+	 * FIXME: do DuplicateHandle dance for windows - can that work
+	 * trivially?
+	 */
+}
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 555774320b5..ae3a5bf023f 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -142,8 +142,8 @@ typedef struct
 	CycleCtr	cycle_ctr;		/* sync cycle of oldest request */
 	/* requests[f] has bit n set if we need to fsync segment n of fork f */
 	Bitmapset  *requests[MAX_FORKNUM + 1];
-	/* canceled[f] is true if we canceled fsyncs for fork "recently" */
-	bool		canceled[MAX_FORKNUM + 1];
+	File	   *syncfds[MAX_FORKNUM + 1];
+	int			syncfd_len[MAX_FORKNUM + 1];
 } PendingOperationEntry;
 
 typedef struct
@@ -152,6 +152,8 @@ typedef struct
 	CycleCtr	cycle_ctr;		/* mdckpt_cycle_ctr when request was made */
 } PendingUnlinkEntry;
 
+static uint32 open_fsync_queue_files = 0;
+static bool mdsync_in_progress = false;
 static HTAB *pendingOpsTable = NULL;
 static List *pendingUnlinks = NIL;
 static MemoryContext pendingOpsCxt; /* context for the above  */
@@ -196,6 +198,8 @@ static MdfdVec *_mdfd_getseg(SMgrRelation reln, ForkNumber forkno,
 			 BlockNumber blkno, bool skipFsync, int behavior);
 static BlockNumber _mdnblocks(SMgrRelation reln, ForkNumber forknum,
 		   MdfdVec *seg);
+static char *mdpath(RelFileNode rnode, ForkNumber forknum, BlockNumber segno);
+static void mdsyncpass(bool include_current);
 
 
 /*
@@ -1049,43 +1053,28 @@ mdimmedsync(SMgrRelation reln, ForkNumber forknum)
 }
 
 /*
- *	mdsync() -- Sync previous writes to stable storage.
+ * Do one pass over the the fsync request hashtable and perform the necessary
+ * fsyncs. Increments the mdsync cycle counter.
+ *
+ * If include_current is true perform all fsyncs (this is done if too many
+ * files are open), otherwise only perform the fsyncs belonging to the cycle
+ * valid at call time.
  */
-void
-mdsync(void)
+static void
+mdsyncpass(bool include_current)
 {
-	static bool mdsync_in_progress = false;
-
 	HASH_SEQ_STATUS hstat;
 	PendingOperationEntry *entry;
 	int			absorb_counter;
 
 	/* Statistics on sync times */
-	int			processed = 0;
 	instr_time	sync_start,
 				sync_end,
 				sync_diff;
 	uint64		elapsed;
-	uint64		longest = 0;
-	uint64		total_elapsed = 0;
-
-	/*
-	 * This is only called during checkpoints, and checkpoints should only
-	 * occur in processes that have created a pendingOpsTable.
-	 */
-	if (!pendingOpsTable)
-		elog(ERROR, "cannot sync without a pendingOpsTable");
-
-	/*
-	 * If we are in the checkpointer, the sync had better include all fsync
-	 * requests that were queued by backends up to this point.  The tightest
-	 * race condition that could occur is that a buffer that must be written
-	 * and fsync'd for the checkpoint could have been dumped by a backend just
-	 * before it was visited by BufferSync().  We know the backend will have
-	 * queued an fsync request before clearing the buffer's dirtybit, so we
-	 * are safe as long as we do an Absorb after completing BufferSync().
-	 */
-	AbsorbFsyncRequests();
+	int			processed = CheckpointStats.ckpt_sync_rels;
+	uint64		longest = CheckpointStats.ckpt_longest_sync;
+	uint64		total_elapsed = CheckpointStats.ckpt_agg_sync_time;
 
 	/*
 	 * To avoid excess fsync'ing (in the worst case, maybe a never-terminating
@@ -1133,17 +1122,27 @@ mdsync(void)
 	while ((entry = (PendingOperationEntry *) hash_seq_search(&hstat)) != NULL)
 	{
 		ForkNumber	forknum;
+		bool has_remaining;
 
 		/*
-		 * If the entry is new then don't process it this time; it might
-		 * contain multiple fsync-request bits, but they are all new.  Note
-		 * "continue" bypasses the hash-remove call at the bottom of the loop.
+		 * If processing fsync requests because of too may file handles, close
+		 * regardless of cycle. Otherwise nothing to be closed might be found,
+		 * and we want to make room as quickly as possible so more requests
+		 * can be absorbed.
 		 */
-		if (entry->cycle_ctr == GetCheckpointSyncCycle())
-			continue;
+		if (!include_current)
+		{
+			/*
+			 * If the entry is new then don't process it this time; it might
+			 * contain multiple fsync-request bits, but they are all new.  Note
+			 * "continue" bypasses the hash-remove call at the bottom of the loop.
+			 */
+			if (entry->cycle_ctr == GetCheckpointSyncCycle())
+				continue;
 
-		/* Else assert we haven't missed it */
-		Assert((CycleCtr) (entry->cycle_ctr + 1) == GetCheckpointSyncCycle());
+			/* Else assert we haven't missed it */
+			Assert((CycleCtr) (entry->cycle_ctr + 1) == GetCheckpointSyncCycle());
+		}
 
 		/*
 		 * Scan over the forks and segments represented by the entry.
@@ -1158,158 +1157,144 @@ mdsync(void)
 		 */
 		for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
 		{
-			Bitmapset  *requests = entry->requests[forknum];
 			int			segno;
 
-			entry->requests[forknum] = NULL;
-			entry->canceled[forknum] = false;
-
-			while ((segno = bms_first_member(requests)) >= 0)
+			segno = -1;
+			while ((segno = bms_next_member(entry->requests[forknum], segno)) >= 0)
 			{
-				int			failures;
+				int			returnCode;
+
+				/*
+				 * Temporarily mark as processed. Have to do so before
+				 * absorbing further requests, otherwise we might delete a new
+				 * requests in a new cycle.
+				 */
+				bms_del_member(entry->requests[forknum], segno);
+
+				if (entry->syncfd_len[forknum] <= segno ||
+					entry->syncfds[forknum][segno] == -1)
+				{
+					/*
+					 * Optionally open file, if we want to support not
+					 * transporting fds as well.
+					 */
+					elog(FATAL, "file not opened");
+				}
 
 				/*
 				 * If fsync is off then we don't have to bother opening the
 				 * file at all.  (We delay checking until this point so that
 				 * changing fsync on the fly behaves sensibly.)
+				 *
+				 * XXX: Why is that an important goal? Doesn't give any
+				 * interesting guarantees afaict?
 				 */
-				if (!enableFsync)
-					continue;
-
-				/*
-				 * If in checkpointer, we want to absorb pending requests
-				 * every so often to prevent overflow of the fsync request
-				 * queue.  It is unspecified whether newly-added entries will
-				 * be visited by hash_seq_search, but we don't care since we
-				 * don't need to process them anyway.
-				 */
-				if (--absorb_counter <= 0)
+				if (enableFsync)
 				{
-					AbsorbFsyncRequests();
-					absorb_counter = FSYNCS_PER_ABSORB;
-				}
-
-				/*
-				 * The fsync table could contain requests to fsync segments
-				 * that have been deleted (unlinked) by the time we get to
-				 * them. Rather than just hoping an ENOENT (or EACCES on
-				 * Windows) error can be ignored, what we do on error is
-				 * absorb pending requests and then retry.  Since mdunlink()
-				 * queues a "cancel" message before actually unlinking, the
-				 * fsync request is guaranteed to be marked canceled after the
-				 * absorb if it really was this case. DROP DATABASE likewise
-				 * has to tell us to forget fsync requests before it starts
-				 * deletions.
-				 */
-				for (failures = 0;; failures++) /* loop exits at "break" */
-				{
-					SMgrRelation reln;
-					MdfdVec    *seg;
-					char	   *path;
-					int			save_errno;
-
 					/*
-					 * Find or create an smgr hash entry for this relation.
-					 * This may seem a bit unclean -- md calling smgr?	But
-					 * it's really the best solution.  It ensures that the
-					 * open file reference isn't permanently leaked if we get
-					 * an error here. (You may say "but an unreferenced
-					 * SMgrRelation is still a leak!" Not really, because the
-					 * only case in which a checkpoint is done by a process
-					 * that isn't about to shut down is in the checkpointer,
-					 * and it will periodically do smgrcloseall(). This fact
-					 * justifies our not closing the reln in the success path
-					 * either, which is a good thing since in non-checkpointer
-					 * cases we couldn't safely do that.)
+					 * The fsync table could contain requests to fsync
+					 * segments that have been deleted (unlinked) by the time
+					 * we get to them.  That used to be problematic, but now
+					 * we have a filehandle to the deleted file. That means we
+					 * might fsync an empty file superfluously, in a
+					 * relatively tight window, which is acceptable.
 					 */
-					reln = smgropen(entry->rnode, InvalidBackendId);
-
-					/* Attempt to open and fsync the target segment */
-					seg = _mdfd_getseg(reln, forknum,
-									   (BlockNumber) segno * (BlockNumber) RELSEG_SIZE,
-									   false,
-									   EXTENSION_RETURN_NULL
-									   | EXTENSION_DONT_CHECK_SIZE);
 
 					INSTR_TIME_SET_CURRENT(sync_start);
 
-					if (seg != NULL &&
-						FileSync(seg->mdfd_vfd, WAIT_EVENT_DATA_FILE_SYNC) >= 0)
+					returnCode = FileSync(entry->syncfds[forknum][segno], WAIT_EVENT_DATA_FILE_SYNC);
+
+					if (returnCode < 0)
 					{
-						/* Success; update statistics about sync timing */
-						INSTR_TIME_SET_CURRENT(sync_end);
-						sync_diff = sync_end;
-						INSTR_TIME_SUBTRACT(sync_diff, sync_start);
-						elapsed = INSTR_TIME_GET_MICROSEC(sync_diff);
-						if (elapsed > longest)
-							longest = elapsed;
-						total_elapsed += elapsed;
-						processed++;
-						if (log_checkpoints)
-							elog(DEBUG1, "checkpoint sync: number=%d file=%s time=%.3f msec",
-								 processed,
-								 FilePathName(seg->mdfd_vfd),
-								 (double) elapsed / 1000);
+						/* XXX: decide on policy */
+						bms_add_member(entry->requests[forknum], segno);
 
-						break;	/* out of retry loop */
-					}
-
-					/* Compute file name for use in message */
-					save_errno = errno;
-					path = _mdfd_segpath(reln, forknum, (BlockNumber) segno);
-					errno = save_errno;
-
-					/*
-					 * It is possible that the relation has been dropped or
-					 * truncated since the fsync request was entered.
-					 * Therefore, allow ENOENT, but only if we didn't fail
-					 * already on this file.  This applies both for
-					 * _mdfd_getseg() and for FileSync, since fd.c might have
-					 * closed the file behind our back.
-					 *
-					 * XXX is there any point in allowing more than one retry?
-					 * Don't see one at the moment, but easy to change the
-					 * test here if so.
-					 */
-					if (!FILE_POSSIBLY_DELETED(errno) ||
-						failures > 0)
 						ereport(ERROR,
 								(errcode_for_file_access(),
 								 errmsg("could not fsync file \"%s\": %m",
-										path)));
-					else
+										FilePathName(entry->syncfds[forknum][segno]))));
+					}
+
+					/* Success; update statistics about sync timing */
+					INSTR_TIME_SET_CURRENT(sync_end);
+					sync_diff = sync_end;
+					INSTR_TIME_SUBTRACT(sync_diff, sync_start);
+					elapsed = INSTR_TIME_GET_MICROSEC(sync_diff);
+					if (elapsed > longest)
+						longest = elapsed;
+					total_elapsed += elapsed;
+					processed++;
+					if (log_checkpoints)
 						ereport(DEBUG1,
-								(errcode_for_file_access(),
-								 errmsg("could not fsync file \"%s\" but retrying: %m",
-										path)));
-					pfree(path);
+								(errmsg("checkpoint sync: number=%d file=%s time=%.3f msec",
+										processed,
+										FilePathName(entry->syncfds[forknum][segno]),
+										(double) elapsed / 1000),
+								 errhidestmt(true),
+								 errhidecontext(true)));
+				}
 
+				/*
+				 * It shouldn't be possible for a new request to arrive during
+				 * the fsync (on error this will not be reached).
+				 */
+				Assert(!bms_is_member(segno, entry->requests[forknum]));
+
+				/*
+				 * Close file.  XXX: centralize code.
+				 */
+				{
+					open_fsync_queue_files--;
+					FileClose(entry->syncfds[forknum][segno]);
+					entry->syncfds[forknum][segno] = -1;
+				}
+
+				/*
+				 * If in checkpointer, we want to absorb pending requests every so
+				 * often to prevent overflow of the fsync request queue.  It is
+				 * unspecified whether newly-added entries will be visited by
+				 * hash_seq_search, but we don't care since we don't need to process
+				 * them anyway.
+				 */
+				if (absorb_counter-- <= 0)
+				{
 					/*
-					 * Absorb incoming requests and check to see if a cancel
-					 * arrived for this relation fork.
+					 * Don't absorb if too many files are open. This pass will
+					 * soon close some, so check again later.
 					 */
-					AbsorbFsyncRequests();
-					absorb_counter = FSYNCS_PER_ABSORB; /* might as well... */
-
-					if (entry->canceled[forknum])
-						break;
-				}				/* end retry loop */
+					if (open_fsync_queue_files < ((max_safe_fds * 7) / 10))
+						AbsorbFsyncRequests();
+					absorb_counter = FSYNCS_PER_ABSORB;
+				}
 			}
-			bms_free(requests);
 		}
 
 		/*
-		 * We've finished everything that was requested before we started to
-		 * scan the entry.  If no new requests have been inserted meanwhile,
-		 * remove the entry.  Otherwise, update its cycle counter, as all the
-		 * requests now in it must have arrived during this cycle.
+		 * We've finished everything for the file that was requested before we
+		 * started to scan the entry.  If no new requests have been inserted
+		 * meanwhile, remove the entry.  Otherwise, update its cycle counter,
+		 * as all the requests now in it must have arrived during this cycle.
+		 *
+		 * This needs to be checked separately from the above for-each-fork
+		 * loop, as new requests for this relation could have been absorbed.
 		 */
+		has_remaining = false;
 		for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
 		{
-			if (entry->requests[forknum] != NULL)
-				break;
+			if (bms_is_empty(entry->requests[forknum]))
+			{
+				if (entry->syncfds[forknum])
+				{
+					pfree(entry->syncfds[forknum]);
+					entry->syncfds[forknum] = NULL;
+				}
+				bms_free(entry->requests[forknum]);
+				entry->requests[forknum] = NULL;
+			}
+			else
+				has_remaining = true;
 		}
-		if (forknum <= MAX_FORKNUM)
+		if (has_remaining)
 			entry->cycle_ctr = GetCheckpointSyncCycle();
 		else
 		{
@@ -1320,13 +1305,69 @@ mdsync(void)
 		}
 	}							/* end loop over hashtable entries */
 
-	/* Return sync performance metrics for report at checkpoint end */
+	/* Flag successful completion of mdsync */
+	mdsync_in_progress = false;
+
+	/* Maintain sync performance metrics for report at checkpoint end */
 	CheckpointStats.ckpt_sync_rels = processed;
 	CheckpointStats.ckpt_longest_sync = longest;
 	CheckpointStats.ckpt_agg_sync_time = total_elapsed;
+}
 
-	/* Flag successful completion of mdsync */
-	mdsync_in_progress = false;
+/*
+ *	mdsync() -- Sync previous writes to stable storage.
+ */
+void
+mdsync(void)
+{
+	/*
+	 * This is only called during checkpoints, and checkpoints should only
+	 * occur in processes that have created a pendingOpsTable.
+	 */
+	if (!pendingOpsTable)
+		elog(ERROR, "cannot sync without a pendingOpsTable");
+
+	/*
+	 * If we are in the checkpointer, the sync had better include all fsync
+	 * requests that were queued by backends up to this point.  The tightest
+	 * race condition that could occur is that a buffer that must be written
+	 * and fsync'd for the checkpoint could have been dumped by a backend just
+	 * before it was visited by BufferSync().  We know the backend will have
+	 * queued an fsync request before clearing the buffer's dirtybit, so we
+	 * are safe as long as we do an Absorb after completing BufferSync().
+	 */
+	AbsorbAllFsyncRequests();
+
+	mdsyncpass(false);
+}
+
+/*
+ * Flush the fsync request queue enough to make sure there's room for at least
+ * one more entry.
+ */
+bool
+FlushFsyncRequestQueueIfNecessary(void)
+{
+	if (mdsync_in_progress)
+		return false;
+
+	while (true)
+	{
+		if (open_fsync_queue_files >= ((max_safe_fds * 7) / 10))
+		{
+			elog(DEBUG1,
+				 "flush fsync request queue due to %u open files",
+				 open_fsync_queue_files);
+			mdsyncpass(true);
+			elog(DEBUG1,
+				 "flushed fsync request, now at %u open files",
+				 open_fsync_queue_files);
+		}
+		else
+			break;
+	}
+
+	return true;
 }
 
 /*
@@ -1411,12 +1452,38 @@ mdpostckpt(void)
 		 */
 		if (--absorb_counter <= 0)
 		{
-			AbsorbFsyncRequests();
+			/* XXX: Centralize this condition */
+			if (open_fsync_queue_files < ((max_safe_fds * 7) / 10))
+				AbsorbFsyncRequests();
 			absorb_counter = UNLINKS_PER_ABSORB;
 		}
 	}
 }
 
+
+/*
+ * Return the filename for the specified segment of the relation. The
+ * returned string is palloc'd.
+ */
+static char *
+mdpath(RelFileNode rnode, ForkNumber forknum, BlockNumber segno)
+{
+	char	   *path,
+			   *fullpath;
+
+	path = relpathperm(rnode, forknum);
+
+	if (segno > 0)
+	{
+		fullpath = psprintf("%s.%u", path, segno);
+		pfree(path);
+	}
+	else
+		fullpath = path;
+
+	return fullpath;
+}
+
 /*
  * register_dirty_segment() -- Mark a relation segment as needing fsync
  *
@@ -1437,6 +1504,13 @@ register_dirty_segment(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
 	pg_memory_barrier();
 	cycle = GetCheckpointSyncCycle();
 
+	/*
+	 * For historical reasons checkpointer keeps track of the number of time
+	 * backends perform writes themselves.
+	 */
+	if (!AmBackgroundWriterProcess())
+		CountBackendWrite();
+
 	/*
 	 * Don't repeatedly register the same segment as dirty.
 	 *
@@ -1449,27 +1523,23 @@ register_dirty_segment(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
 
 	if (pendingOpsTable)
 	{
-		/* push it into local pending-ops table */
-		RememberFsyncRequest(reln->smgr_rnode.node, forknum, seg->mdfd_segno);
-		seg->mdfd_dirtied_cycle = cycle;
+		int fd;
+
+		/*
+		 * Push it into local pending-ops table.
+		 *
+		 * Gotta duplicate the fd - we can't have fd.c close it behind our
+		 * back, as that'd lead to loosing error reporting guarantees on
+		 * linux. RememberFsyncRequest() will manage the lifetime.
+		 */
+		ReleaseLruFiles();
+		fd = dup(FileGetRawDesc(seg->mdfd_vfd));
+		if (fd < 0)
+			elog(ERROR, "couldn't dup: %m");
+		RememberFsyncRequest(reln->smgr_rnode.node, forknum, seg->mdfd_segno, fd);
 	}
 	else
-	{
-		if (ForwardFsyncRequest(reln->smgr_rnode.node, forknum, seg->mdfd_segno))
-		{
-			seg->mdfd_dirtied_cycle = cycle;
-			return;				/* passed it off successfully */
-		}
-
-		ereport(DEBUG1,
-				(errmsg("could not forward fsync request because request queue is full")));
-
-		if (FileSync(seg->mdfd_vfd, WAIT_EVENT_DATA_FILE_SYNC) < 0)
-			ereport(ERROR,
-					(errcode_for_file_access(),
-					 errmsg("could not fsync file \"%s\": %m",
-							FilePathName(seg->mdfd_vfd))));
-	}
+		ForwardFsyncRequest(reln->smgr_rnode.node, forknum, seg->mdfd_segno, seg->mdfd_vfd);
 }
 
 /*
@@ -1491,21 +1561,14 @@ register_unlink(RelFileNodeBackend rnode)
 	{
 		/* push it into local pending-ops table */
 		RememberFsyncRequest(rnode.node, MAIN_FORKNUM,
-							 UNLINK_RELATION_REQUEST);
+							 UNLINK_RELATION_REQUEST,
+							 -1);
 	}
 	else
 	{
-		/*
-		 * Notify the checkpointer about it.  If we fail to queue the request
-		 * message, we have to sleep and try again, because we can't simply
-		 * delete the file now.  Ugly, but hopefully won't happen often.
-		 *
-		 * XXX should we just leave the file orphaned instead?
-		 */
+		/* Notify the checkpointer about it. */
 		Assert(IsUnderPostmaster);
-		while (!ForwardFsyncRequest(rnode.node, MAIN_FORKNUM,
-									UNLINK_RELATION_REQUEST))
-			pg_usleep(10000L);	/* 10 msec seems a good number */
+		ForwardFsyncRequest(rnode.node, MAIN_FORKNUM, UNLINK_RELATION_REQUEST, -1);
 	}
 }
 
@@ -1531,7 +1594,7 @@ register_unlink(RelFileNodeBackend rnode)
  * heavyweight operation anyhow, so we'll live with it.)
  */
 void
-RememberFsyncRequest(RelFileNode rnode, ForkNumber forknum, BlockNumber segno)
+RememberFsyncRequest(RelFileNode rnode, ForkNumber forknum, BlockNumber segno, int fd)
 {
 	Assert(pendingOpsTable);
 
@@ -1549,18 +1612,28 @@ RememberFsyncRequest(RelFileNode rnode, ForkNumber forknum, BlockNumber segno)
 			/*
 			 * We can't just delete the entry since mdsync could have an
 			 * active hashtable scan.  Instead we delete the bitmapsets; this
-			 * is safe because of the way mdsync is coded.  We also set the
-			 * "canceled" flags so that mdsync can tell that a cancel arrived
-			 * for the fork(s).
+			 * is safe because of the way mdsync is coded.
 			 */
 			if (forknum == InvalidForkNumber)
 			{
 				/* remove requests for all forks */
 				for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
 				{
+					int segno;
+
 					bms_free(entry->requests[forknum]);
 					entry->requests[forknum] = NULL;
-					entry->canceled[forknum] = true;
+
+					for (segno = 0; segno < entry->syncfd_len[forknum]; segno++)
+					{
+						if (entry->syncfds[forknum][segno] != -1)
+						{
+							open_fsync_queue_files--;
+							FileClose(entry->syncfds[forknum][segno]);
+							entry->syncfds[forknum][segno] = -1;
+						}
+					}
+
 				}
 			}
 			else
@@ -1568,7 +1641,16 @@ RememberFsyncRequest(RelFileNode rnode, ForkNumber forknum, BlockNumber segno)
 				/* remove requests for single fork */
 				bms_free(entry->requests[forknum]);
 				entry->requests[forknum] = NULL;
-				entry->canceled[forknum] = true;
+
+				for (segno = 0; segno < entry->syncfd_len[forknum]; segno++)
+				{
+					if (entry->syncfds[forknum][segno] != -1)
+					{
+						open_fsync_queue_files--;
+						FileClose(entry->syncfds[forknum][segno]);
+						entry->syncfds[forknum][segno] = -1;
+					}
+				}
 			}
 		}
 	}
@@ -1592,7 +1674,6 @@ RememberFsyncRequest(RelFileNode rnode, ForkNumber forknum, BlockNumber segno)
 				{
 					bms_free(entry->requests[forknum]);
 					entry->requests[forknum] = NULL;
-					entry->canceled[forknum] = true;
 				}
 			}
 		}
@@ -1646,7 +1727,8 @@ RememberFsyncRequest(RelFileNode rnode, ForkNumber forknum, BlockNumber segno)
 		{
 			entry->cycle_ctr = GetCheckpointSyncCycle();
 			MemSet(entry->requests, 0, sizeof(entry->requests));
-			MemSet(entry->canceled, 0, sizeof(entry->canceled));
+			MemSet(entry->syncfds, 0, sizeof(entry->syncfds));
+			MemSet(entry->syncfd_len, 0, sizeof(entry->syncfd_len));
 		}
 
 		/*
@@ -1658,6 +1740,57 @@ RememberFsyncRequest(RelFileNode rnode, ForkNumber forknum, BlockNumber segno)
 		entry->requests[forknum] = bms_add_member(entry->requests[forknum],
 												  (int) segno);
 
+		if (fd >= 0)
+		{
+			/* make space for entry */
+			if (entry->syncfds[forknum] == NULL)
+			{
+				int i;
+
+				entry->syncfds[forknum] = palloc(sizeof(File) * (segno + 1));
+				entry->syncfd_len[forknum] = segno + 1;
+
+				for (i = 0; i <= segno; i++)
+					entry->syncfds[forknum][i] = -1;
+			}
+			else  if (entry->syncfd_len[forknum] <= segno)
+			{
+				int i;
+
+				entry->syncfds[forknum] = repalloc(entry->syncfds[forknum],
+												   sizeof(File) * (segno + 1));
+
+				/* initialize newly created entries */
+				for (i = entry->syncfd_len[forknum]; i <= segno; i++)
+					entry->syncfds[forknum][i] = -1;
+
+				entry->syncfd_len[forknum] = segno + 1;
+			}
+
+			if (entry->syncfds[forknum][segno] == -1)
+			{
+				char *path = mdpath(entry->rnode, forknum, segno);
+				open_fsync_queue_files++;
+				/* caller must have reserved entry */
+				entry->syncfds[forknum][segno] =
+					FileOpenForFd(fd, path);
+				pfree(path);
+			}
+			else
+			{
+				/*
+				 * File is already open. Have to keep the older fd, errors
+				 * might only be reported to it, thus close the one we just
+				 * got.
+				 *
+				 * XXX: check for errrors.
+				 */
+				close(fd);
+			}
+
+			FlushFsyncRequestQueueIfNecessary();
+		}
+
 		MemoryContextSwitchTo(oldcxt);
 	}
 }
@@ -1674,22 +1807,12 @@ ForgetRelationFsyncRequests(RelFileNode rnode, ForkNumber forknum)
 	if (pendingOpsTable)
 	{
 		/* standalone backend or startup process: fsync state is local */
-		RememberFsyncRequest(rnode, forknum, FORGET_RELATION_FSYNC);
+		RememberFsyncRequest(rnode, forknum, FORGET_RELATION_FSYNC, -1);
 	}
 	else if (IsUnderPostmaster)
 	{
-		/*
-		 * Notify the checkpointer about it.  If we fail to queue the cancel
-		 * message, we have to sleep and try again ... ugly, but hopefully
-		 * won't happen often.
-		 *
-		 * XXX should we CHECK_FOR_INTERRUPTS in this loop?  Escaping with an
-		 * error would leave the no-longer-used file still present on disk,
-		 * which would be bad, so I'm inclined to assume that the checkpointer
-		 * will always empty the queue soon.
-		 */
-		while (!ForwardFsyncRequest(rnode, forknum, FORGET_RELATION_FSYNC))
-			pg_usleep(10000L);	/* 10 msec seems a good number */
+		/* Notify the checkpointer about it. */
+		ForwardFsyncRequest(rnode, forknum, FORGET_RELATION_FSYNC, -1);
 
 		/*
 		 * Note we don't wait for the checkpointer to actually absorb the
@@ -1713,14 +1836,12 @@ ForgetDatabaseFsyncRequests(Oid dbid)
 	if (pendingOpsTable)
 	{
 		/* standalone backend or startup process: fsync state is local */
-		RememberFsyncRequest(rnode, InvalidForkNumber, FORGET_DATABASE_FSYNC);
+		RememberFsyncRequest(rnode, InvalidForkNumber, FORGET_DATABASE_FSYNC, -1);
 	}
 	else if (IsUnderPostmaster)
 	{
 		/* see notes in ForgetRelationFsyncRequests */
-		while (!ForwardFsyncRequest(rnode, InvalidForkNumber,
-									FORGET_DATABASE_FSYNC))
-			pg_usleep(10000L);	/* 10 msec seems a good number */
+		ForwardFsyncRequest(rnode, InvalidForkNumber, FORGET_DATABASE_FSYNC, -1);
 	}
 }
 
diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h
index 87a5cfad415..58ba671a907 100644
--- a/src/include/postmaster/bgwriter.h
+++ b/src/include/postmaster/bgwriter.h
@@ -16,6 +16,7 @@
 #define _BGWRITER_H
 
 #include "storage/block.h"
+#include "storage/fd.h"
 #include "storage/relfilenode.h"
 
 
@@ -31,9 +32,10 @@ extern void CheckpointerMain(void) pg_attribute_noreturn();
 extern void RequestCheckpoint(int flags);
 extern void CheckpointWriteDelay(int flags, double progress);
 
-extern bool ForwardFsyncRequest(RelFileNode rnode, ForkNumber forknum,
-					BlockNumber segno);
+extern void ForwardFsyncRequest(RelFileNode rnode, ForkNumber forknum,
+								BlockNumber segno, File file);
 extern void AbsorbFsyncRequests(void);
+extern void AbsorbAllFsyncRequests(void);
 
 extern Size CheckpointerShmemSize(void);
 extern void CheckpointerShmemInit(void);
@@ -43,4 +45,6 @@ extern uint32 IncCheckpointSyncCycle(void);
 
 extern bool FirstCallSinceLastCheckpoint(void);
 
+extern void CountBackendWrite(void);
+
 #endif							/* _BGWRITER_H */
diff --git a/src/include/postmaster/postmaster.h b/src/include/postmaster/postmaster.h
index 1877eef2391..e2ba64e8984 100644
--- a/src/include/postmaster/postmaster.h
+++ b/src/include/postmaster/postmaster.h
@@ -44,6 +44,11 @@ extern int	postmaster_alive_fds[2];
 #define POSTMASTER_FD_OWN		1	/* kept open by postmaster only */
 #endif
 
+#define FSYNC_FD_SUBMIT			0
+#define FSYNC_FD_PROCESS		1
+
+extern int	fsync_fds[2];
+
 extern PGDLLIMPORT const char *progname;
 
 extern void PostmasterMain(int argc, char *argv[]) pg_attribute_noreturn();
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 558e4d8518b..798a9652927 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -140,7 +140,8 @@ extern void mdpostckpt(void);
 
 extern void SetForwardFsyncRequests(void);
 extern void RememberFsyncRequest(RelFileNode rnode, ForkNumber forknum,
-					 BlockNumber segno);
+					 BlockNumber segno, int fd);
+extern bool FlushFsyncRequestQueueIfNecessary(void);
 extern void ForgetRelationFsyncRequests(RelFileNode rnode, ForkNumber forknum);
 extern void ForgetDatabaseFsyncRequests(Oid dbid);
 
-- 
2.17.0.rc1.dirty

#45Dmitry Dolgov
9erthalion6@gmail.com
In reply to: Andres Freund (#44)
Re: Postgres, fsync, and OSs (specifically linux)

On 22 May 2018 at 03:08, Andres Freund <andres@anarazel.de> wrote:
On 2018-05-19 18:12:52 +1200, Thomas Munro wrote:

On Sat, May 19, 2018 at 4:51 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

Next, make check hangs in initdb on both of my pet OSes when md.c
raises an error (fseek fails) and we raise and error while raising and
error and deadlock against ourselves. Backtrace here:
https://paste.debian.net/1025336/

Ah, I see now that something similar is happening on Linux too, so I
guess you already knew this.

I didn't. I cleaned something up and only tested installcheck
after... Singleuser mode was broken.

Attached is a new version.

I've changed my previous attempt at using transient files to using File
type files, but unliked from the LRU so that they're kept open. Not sure
if that's perfect, but seems cleaner.

Thanks for the patch. Out of curiosity I tried to play with it a bit.
`pgbench -i -s 100` actually hang on my machine, because the
copy process ended up with waiting after `pg_uds_send_with_fd`
had

errno == EWOULDBLOCK || errno == EAGAIN

as well as the checkpointer process. Looks like with the default
configuration and `max_wal_size=1GB` it writes more than reads to a
socket, and a buffer eventually becomes full. I've increased
SO_RCVBUF/SO_SNDBUF and `max_wal_size` independently to
check it, and in both cases the problem disappeared (but I assume
only for this particular scale). Is it something that was already considered?

#46Andres Freund
andres@anarazel.de
In reply to: Dmitry Dolgov (#45)
Re: Postgres, fsync, and OSs (specifically linux)

Hi,

On 2018-05-22 17:37:28 +0200, Dmitry Dolgov wrote:

Thanks for the patch. Out of curiosity I tried to play with it a bit.

Thanks.

`pgbench -i -s 100` actually hang on my machine, because the
copy process ended up with waiting after `pg_uds_send_with_fd`
had

Hm, that had worked at some point...

errno == EWOULDBLOCK || errno == EAGAIN

as well as the checkpointer process.

What do you mean with that latest sentence?

Looks like with the default
configuration and `max_wal_size=1GB` it writes more than reads to a
socket, and a buffer eventually becomes full.

That's intended to then wake up the checkpointer immediately, so it can
absorb the requests. So something isn't right yet.

I've increased SO_RCVBUF/SO_SNDBUF and `max_wal_size` independently to
check it, and in both cases the problem disappeared (but I assume only
for this particular scale). Is it something that was already
considered?

It's considered. Tuning up those might help with performance, but
shouldn't required from a correctness POV. Hm.

Greetings,

Andres Freund

#47Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#46)
1 attachment(s)
Re: Postgres, fsync, and OSs (specifically linux)

On 2018-05-22 08:57:18 -0700, Andres Freund wrote:

Hi,

On 2018-05-22 17:37:28 +0200, Dmitry Dolgov wrote:

Thanks for the patch. Out of curiosity I tried to play with it a bit.

Thanks.

`pgbench -i -s 100` actually hang on my machine, because the
copy process ended up with waiting after `pg_uds_send_with_fd`
had

Hm, that had worked at some point...

errno == EWOULDBLOCK || errno == EAGAIN

as well as the checkpointer process.

What do you mean with that latest sentence?

Looks like with the default
configuration and `max_wal_size=1GB` it writes more than reads to a
socket, and a buffer eventually becomes full.

That's intended to then wake up the checkpointer immediately, so it can
absorb the requests. So something isn't right yet.

Doesn't hang here, but it's way too slow. Reason for that is that I've
wrongly resolved a merge conflict. Attached is a fixup patch - does that
address the issue for you?

Greetings,

Andres Freund

Attachments:

0001-Fix-and-improve-pg_atomic_flag-fallback-implementati.patchtext/x-diff; charset=us-asciiDownload
From 3b98662adf5e0f82375b50833bc618403614a461 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Fri, 6 Apr 2018 16:17:01 -0700
Subject: [PATCH 1/2] Fix and improve pg_atomic_flag fallback implementation.

The atomics fallback implementation for pg_atomic_flag was broken,
returning the inverted value from pg_atomic_test_set_flag().  This was
unnoticed because a) atomic flags were unused until recently b) the
test code wasn't run when the fallback implementation was in
use (because it didn't allow to test for some edge cases).

Fix the bug, and improve the fallback so it has the same behaviour as
the non-fallback implementation in the problematic edge cases. That
breaks ABI compatibility in the back branches when fallbacks are in
use, but given they were broken until now...

Author: Andres Freund
Reported-by: Daniel Gustafsson
Discussion: https://postgr.es/m/FB948276-7B32-4B77-83E6-D00167F8EEB4@yesql.se
Backpatch: 9.5-, where the atomics abstraction was introduced.
---
 src/backend/port/atomics.c          | 21 +++++++++++++++++++--
 src/include/port/atomics/fallback.h | 13 ++-----------
 src/test/regress/regress.c          | 14 --------------
 3 files changed, 21 insertions(+), 27 deletions(-)

diff --git a/src/backend/port/atomics.c b/src/backend/port/atomics.c
index e4e4734dd23..caa84bf2b62 100644
--- a/src/backend/port/atomics.c
+++ b/src/backend/port/atomics.c
@@ -68,18 +68,35 @@ pg_atomic_init_flag_impl(volatile pg_atomic_flag *ptr)
 #else
 	SpinLockInit((slock_t *) &ptr->sema);
 #endif
+
+	ptr->value = false;
 }
 
 bool
 pg_atomic_test_set_flag_impl(volatile pg_atomic_flag *ptr)
 {
-	return TAS((slock_t *) &ptr->sema);
+	uint32		oldval;
+
+	SpinLockAcquire((slock_t *) &ptr->sema);
+	oldval = ptr->value;
+	ptr->value = true;
+	SpinLockRelease((slock_t *) &ptr->sema);
+
+	return oldval == 0;
 }
 
 void
 pg_atomic_clear_flag_impl(volatile pg_atomic_flag *ptr)
 {
-	S_UNLOCK((slock_t *) &ptr->sema);
+	SpinLockAcquire((slock_t *) &ptr->sema);
+	ptr->value = false;
+	SpinLockRelease((slock_t *) &ptr->sema);
+}
+
+bool
+pg_atomic_unlocked_test_flag_impl(volatile pg_atomic_flag *ptr)
+{
+	return ptr->value == 0;
 }
 
 #endif							/* PG_HAVE_ATOMIC_FLAG_SIMULATION */
diff --git a/src/include/port/atomics/fallback.h b/src/include/port/atomics/fallback.h
index 7b9dcad8073..88a967ad5b9 100644
--- a/src/include/port/atomics/fallback.h
+++ b/src/include/port/atomics/fallback.h
@@ -80,6 +80,7 @@ typedef struct pg_atomic_flag
 #else
 	int			sema;
 #endif
+	volatile bool value;
 } pg_atomic_flag;
 
 #endif /* PG_HAVE_ATOMIC_FLAG_SUPPORT */
@@ -132,17 +133,7 @@ extern bool pg_atomic_test_set_flag_impl(volatile pg_atomic_flag *ptr);
 extern void pg_atomic_clear_flag_impl(volatile pg_atomic_flag *ptr);
 
 #define PG_HAVE_ATOMIC_UNLOCKED_TEST_FLAG
-static inline bool
-pg_atomic_unlocked_test_flag_impl(volatile pg_atomic_flag *ptr)
-{
-	/*
-	 * Can't do this efficiently in the semaphore based implementation - we'd
-	 * have to try to acquire the semaphore - so always return true. That's
-	 * correct, because this is only an unlocked test anyway. Do this in the
-	 * header so compilers can optimize the test away.
-	 */
-	return true;
-}
+extern bool pg_atomic_unlocked_test_flag_impl(volatile pg_atomic_flag *ptr);
 
 #endif /* PG_HAVE_ATOMIC_FLAG_SIMULATION */
 
diff --git a/src/test/regress/regress.c b/src/test/regress/regress.c
index e14322c798a..8bc562ee4f0 100644
--- a/src/test/regress/regress.c
+++ b/src/test/regress/regress.c
@@ -633,7 +633,6 @@ wait_pid(PG_FUNCTION_ARGS)
 	PG_RETURN_VOID();
 }
 
-#ifndef PG_HAVE_ATOMIC_FLAG_SIMULATION
 static void
 test_atomic_flag(void)
 {
@@ -663,7 +662,6 @@ test_atomic_flag(void)
 
 	pg_atomic_clear_flag(&flag);
 }
-#endif							/* PG_HAVE_ATOMIC_FLAG_SIMULATION */
 
 static void
 test_atomic_uint32(void)
@@ -846,19 +844,7 @@ PG_FUNCTION_INFO_V1(test_atomic_ops);
 Datum
 test_atomic_ops(PG_FUNCTION_ARGS)
 {
-	/* ---
-	 * Can't run the test under the semaphore emulation, it doesn't handle
-	 * checking two edge cases well:
-	 * - pg_atomic_unlocked_test_flag() always returns true
-	 * - locking a already locked flag blocks
-	 * it seems better to not test the semaphore fallback here, than weaken
-	 * the checks for the other cases. The semaphore code will be the same
-	 * everywhere, whereas the efficient implementations wont.
-	 * ---
-	 */
-#ifndef PG_HAVE_ATOMIC_FLAG_SIMULATION
 	test_atomic_flag();
-#endif
 
 	test_atomic_uint32();
 
-- 
2.17.0.rc1.dirty

#48Dmitry Dolgov
9erthalion6@gmail.com
In reply to: Andres Freund (#47)
Re: Postgres, fsync, and OSs (specifically linux)

On 22 May 2018 at 18:47, Andres Freund <andres@anarazel.de> wrote:
On 2018-05-22 08:57:18 -0700, Andres Freund wrote:

Hi,

On 2018-05-22 17:37:28 +0200, Dmitry Dolgov wrote:

Thanks for the patch. Out of curiosity I tried to play with it a bit.

Thanks.

`pgbench -i -s 100` actually hang on my machine, because the
copy process ended up with waiting after `pg_uds_send_with_fd`
had

Hm, that had worked at some point...

errno == EWOULDBLOCK || errno == EAGAIN

as well as the checkpointer process.

What do you mean with that latest sentence?

To investigate what's happening I attached with gdb to two processes, COPY
process from pgbench and checkpointer (since I assumed it may be involved).
Both were waiting in WaitLatchOrSocket right after SendFsyncRequest.

Looks like with the default
configuration and `max_wal_size=1GB` it writes more than reads to a
socket, and a buffer eventually becomes full.

That's intended to then wake up the checkpointer immediately, so it can
absorb the requests. So something isn't right yet.

Doesn't hang here, but it's way too slow.

Yep, in my case it was also getting slower, but eventually hang.

Reason for that is that I've wrongly resolved a merge conflict. Attached is a
fixup patch - does that address the issue for you?

Hm...is it a correct patch? I see the same committed in
8c3debbbf61892dabd8b6f3f8d55e600a7901f2b, so I can't really apply it.

#49Andres Freund
andres@anarazel.de
In reply to: Dmitry Dolgov (#48)
1 attachment(s)
Re: Postgres, fsync, and OSs (specifically linux)

On 2018-05-22 20:54:46 +0200, Dmitry Dolgov wrote:

On 22 May 2018 at 18:47, Andres Freund <andres@anarazel.de> wrote:
On 2018-05-22 08:57:18 -0700, Andres Freund wrote:

Hi,

On 2018-05-22 17:37:28 +0200, Dmitry Dolgov wrote:

Thanks for the patch. Out of curiosity I tried to play with it a bit.

Thanks.

`pgbench -i -s 100` actually hang on my machine, because the
copy process ended up with waiting after `pg_uds_send_with_fd`
had

Hm, that had worked at some point...

errno == EWOULDBLOCK || errno == EAGAIN

as well as the checkpointer process.

What do you mean with that latest sentence?

To investigate what's happening I attached with gdb to two processes, COPY
process from pgbench and checkpointer (since I assumed it may be involved).
Both were waiting in WaitLatchOrSocket right after SendFsyncRequest.

Huh? Checkpointer was in SendFsyncRequest()? Coudl you share the
backtrace?

Looks like with the default
configuration and `max_wal_size=1GB` it writes more than reads to a
socket, and a buffer eventually becomes full.

That's intended to then wake up the checkpointer immediately, so it can
absorb the requests. So something isn't right yet.

Doesn't hang here, but it's way too slow.

Yep, in my case it was also getting slower, but eventually hang.

Reason for that is that I've wrongly resolved a merge conflict. Attached is a
fixup patch - does that address the issue for you?

Hm...is it a correct patch? I see the same committed in
8c3debbbf61892dabd8b6f3f8d55e600a7901f2b, so I can't really apply it.

Yea, sorry for that. Too many files in my patch directory... Right one
attached.

Greetings,

Andres Freund

Attachments:

0001-fixup-WIP-Optimize-register_dirty_segment-to-not-rep.patchtext/x-diff; charset=us-asciiDownload
From 483b98fd21b40e2997a1f164155cae698204ec25 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 22 May 2018 09:38:58 -0700
Subject: [PATCH] fixup! WIP: Optimize register_dirty_segment() to not
 repeatedly queue fsync requests.

Merge failure.
---
 src/backend/storage/smgr/md.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index ae3a5bf023f..942e2dcf788 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -1540,6 +1540,8 @@ register_dirty_segment(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
 	}
 	else
 		ForwardFsyncRequest(reln->smgr_rnode.node, forknum, seg->mdfd_segno, seg->mdfd_vfd);
+
+	seg->mdfd_dirtied_cycle = cycle;
 }
 
 /*
-- 
2.17.0.rc1.dirty

#50Dmitry Dolgov
9erthalion6@gmail.com
In reply to: Andres Freund (#49)
Re: Postgres, fsync, and OSs (specifically linux)

On 22 May 2018 at 20:59, Andres Freund <andres@anarazel.de> wrote:
On 2018-05-22 20:54:46 +0200, Dmitry Dolgov wrote:

On 22 May 2018 at 18:47, Andres Freund <andres@anarazel.de> wrote:
On 2018-05-22 08:57:18 -0700, Andres Freund wrote:

Hi,

On 2018-05-22 17:37:28 +0200, Dmitry Dolgov wrote:

Thanks for the patch. Out of curiosity I tried to play with it a bit.

Thanks.

`pgbench -i -s 100` actually hang on my machine, because the
copy process ended up with waiting after `pg_uds_send_with_fd`
had

Hm, that had worked at some point...

errno == EWOULDBLOCK || errno == EAGAIN

as well as the checkpointer process.

What do you mean with that latest sentence?

To investigate what's happening I attached with gdb to two processes, COPY
process from pgbench and checkpointer (since I assumed it may be involved).
Both were waiting in WaitLatchOrSocket right after SendFsyncRequest.

Huh? Checkpointer was in SendFsyncRequest()? Coudl you share the
backtrace?

Well, that's what I've got from gdb:

#0 0x00007fae03fae9f3 in __epoll_wait_nocancel () at
../sysdeps/unix/syscall-template.S:84
#1 0x000000000077a979 in WaitEventSetWaitBlock (nevents=1,
occurred_events=0x7ffe37529ec0, cur_timeout=-1, set=0x23cddf8) at
latch.c:1048
#2 WaitEventSetWait (set=set@entry=0x23cddf8,
timeout=timeout@entry=-1,
occurred_events=occurred_events@entry=0x7ffe37529ec0,
nevents=nevents@entry=1, wait_event_info=wait_event_info@entry=0) at
latch.c:1000
#3 0x000000000077ad08 in WaitLatchOrSocket
(latch=latch@entry=0x0, wakeEvents=wakeEvents@entry=4, sock=8,
timeout=timeout@entry=-1, wait_event_info=wait_event_info@entry=0) at
latch.c:385
#4 0x00000000007152cb in SendFsyncRequest
(request=request@entry=0x7ffe37529f40, fd=fd@entry=-1) at
checkpointer.c:1345
#5 0x0000000000716223 in AbsorbAllFsyncRequests () at checkpointer.c:1207
#6 0x000000000079a5f0 in mdsync () at md.c:1339
#7 0x000000000079c672 in smgrsync () at smgr.c:766
#8 0x000000000076dd53 in CheckPointBuffers (flags=flags@entry=64)
at bufmgr.c:2581
#9 0x000000000051c681 in CheckPointGuts
(checkPointRedo=722254352, flags=flags@entry=64) at xlog.c:9079
#10 0x0000000000523c4a in CreateCheckPoint (flags=flags@entry=64)
at xlog.c:8863
#11 0x0000000000715f41 in CheckpointerMain () at checkpointer.c:494
#12 0x00000000005329f4 in AuxiliaryProcessMain (argc=argc@entry=2,
argv=argv@entry=0x7ffe3752a220) at bootstrap.c:451
#13 0x0000000000720c28 in StartChildProcess
(type=type@entry=CheckpointerProcess) at postmaster.c:5340
#14 0x0000000000721c23 in reaper (postgres_signal_arg=<optimized
out>) at postmaster.c:2875
#15 <signal handler called>
#16 0x00007fae03fa45b3 in __select_nocancel () at
../sysdeps/unix/syscall-template.S:84
#17 0x0000000000722968 in ServerLoop () at postmaster.c:1679
#18 0x0000000000723cde in PostmasterMain (argc=argc@entry=3,
argv=argv@entry=0x23a00e0) at postmaster.c:1388
#19 0x000000000068979f in main (argc=3, argv=0x23a00e0) at main.c:228

Looks like with the default
configuration and `max_wal_size=1GB` it writes more than reads to a
socket, and a buffer eventually becomes full.

That's intended to then wake up the checkpointer immediately, so it can
absorb the requests. So something isn't right yet.

Doesn't hang here, but it's way too slow.

Yep, in my case it was also getting slower, but eventually hang.

Reason for that is that I've wrongly resolved a merge conflict. Attached is a
fixup patch - does that address the issue for you?

Hm...is it a correct patch? I see the same committed in
8c3debbbf61892dabd8b6f3f8d55e600a7901f2b, so I can't really apply it.

Yea, sorry for that. Too many files in my patch directory... Right one
attached.

Yes, this patch solves the problem, thanks.

#51Andres Freund
andres@anarazel.de
In reply to: Dmitry Dolgov (#50)
Re: Postgres, fsync, and OSs (specifically linux)

On 2018-05-22 21:58:06 +0200, Dmitry Dolgov wrote:

On 22 May 2018 at 20:59, Andres Freund <andres@anarazel.de> wrote:
On 2018-05-22 20:54:46 +0200, Dmitry Dolgov wrote:
Huh? Checkpointer was in SendFsyncRequest()? Coudl you share the
backtrace?

Well, that's what I've got from gdb:

#3 0x000000000077ad08 in WaitLatchOrSocket
(latch=latch@entry=0x0, wakeEvents=wakeEvents@entry=4, sock=8,
timeout=timeout@entry=-1, wait_event_info=wait_event_info@entry=0) at
latch.c:385
#4 0x00000000007152cb in SendFsyncRequest
(request=request@entry=0x7ffe37529f40, fd=fd@entry=-1) at
checkpointer.c:1345
#5 0x0000000000716223 in AbsorbAllFsyncRequests () at checkpointer.c:1207

Oh, I see. That makes sense. So it's possible to self-deadlock
here. Should be easy to fix... Thanks for finding that one.

Yes, this patch solves the problem, thanks.

Just avoids it, I'm afraid... It probably can't be hit easily, but the
issue is there...

- Andres

#52Craig Ringer
craig@2ndquadrant.com
In reply to: Craig Ringer (#43)
1 attachment(s)
Re: Postgres, fsync, and OSs (specifically linux)

On 21 May 2018 at 15:50, Craig Ringer <craig@2ndquadrant.com> wrote:

On 21 May 2018 at 12:57, Craig Ringer <craig@2ndquadrant.com> wrote:

On 18 May 2018 at 00:44, Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2018-05-10 09:50:03 +0800, Craig Ringer wrote:

while ((src = (RewriteMappingFile *)

hash_seq_search(&seq_status)) != NULL)

{
if (FileSync(src->vfd, WAIT_EVENT_LOGICAL_REWRITE_SYNC)

!= 0)

-                     ereport(ERROR,
+                     ereport(PANIC,
(errcode_for_file_access(),
errmsg("could not fsync file

\"%s\": %m", src->path)));

To me this (and the other callers) doesn't quite look right. First, I
think we should probably be a bit more restrictive about when PANIC
out. It seems like we should PANIC on ENOSPC and EIO, but possibly not
others. Secondly, I think we should centralize the error handling. It
seems likely that we'll acrue some platform specific workarounds, and I
don't want to copy that knowledge everywhere.

Also, don't we need the same on close()?

Yes, we do, and that expands the scope a bit.

I agree with Robert that some sort of filter/macro is wise, though naming
it clearly will be tricky.

I'll have a look.

On the queue for tomorrow.

Hi all.

I've revised the fsync patch with the cleanups discussed and gone through
the close() calls.

AFAICS either socket closes, temp file closes, or (for WAL) already PANIC
on close. It's mainly fd.c that needs amendment. Which I've done per the
attached revised patch.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

v3-0001-PANIC-when-we-detect-a-possible-fsync-I-O-error-i.patchtext/x-patch; charset=US-ASCII; name=v3-0001-PANIC-when-we-detect-a-possible-fsync-I-O-error-i.patchDownload
From 952e6fbab1a735f677d8e6e92ba0aa8d53d9ab3e Mon Sep 17 00:00:00 2001
From: Craig Ringer <craig@2ndquadrant.com>
Date: Tue, 10 Apr 2018 14:08:32 +0800
Subject: [PATCH v3] PANIC when we detect a possible fsync I/O error instead of
 retrying fsync

Panic the server on fsync failure in places where we can't simply repeat the
whole operation on retry. Most imporantly, panic when fsync fails during a
checkpoint.

This will result in log messages like:

    PANIC:  58030: could not fsync file "base/12367/16386": Input/output error
    LOG:  00000: checkpointer process (PID 10799) was terminated by signal 6: Aborted

and, if the condition persists during redo:

    LOG:  00000: checkpoint starting: end-of-recovery immediate
    PANIC:  58030: could not fsync file "base/12367/16386": Input/output error
    LOG:  00000: startup process (PID 10808) was terminated by signal 6: Aborted

Why?

In a number of places PostgreSQL we responded to fsync() errors by retrying the
fsync(), expecting that this would force the operating system to repeat any
write attempts. The code assumed that fsync() would return an error on all
subsequent calls until any I/O error was resolved.

This is not what actually happens on some platforms, including Linux. The
operating system may give up and drop dirty buffers for async writes on the
floor and mark the page mapping as bad. The first fsync() clears any error flag
from the page entry and/or our file descriptor. So a subsequent fsync() returns
success, even though the data PostgreSQL wrote was really discarded.

We have no way to find out which writes failed, and no way to ask the kernel to
retry indefinitely, so all we can do is PANIC. Redo will attempt the write
again, and if it fails again, it will also PANIC.

This doesn't completely prevent fsync reliability issues, because it only
handles cases where the kernel actually reports the error to us. It's entirely
possible for a buffered write to be lost without causing fsync to report an
error at all (see discussion below). Work on addressing those issues and
documenting them is ongoing and will be committed separately.

Because NFS on Linux performs an implicit fsync() on close(), we also PANIC on
close() failures for non-temporary files managed by fd.c.

See:

* https://www.postgresql.org/message-id/CAMsr%2BYHh%2B5Oq4xziwwoEfhoTZgr07vdGG%2Bhu%3D1adXx59aTeaoQ%40mail.gmail.com
* https://www.postgresql.org/message-id/20180427222842.in2e4mibx45zdth5@alap3.anarazel.de
* https://lwn.net/Articles/752063/
* https://lwn.net/Articles/753650/
* https://lwn.net/Articles/752952/
* https://lwn.net/Articles/752613/
---
 src/backend/access/heap/rewriteheap.c       |  6 ++--
 src/backend/access/transam/timeline.c       |  4 +--
 src/backend/access/transam/twophase.c       |  2 +-
 src/backend/access/transam/xlog.c           |  9 +++++-
 src/backend/replication/logical/snapbuild.c |  3 ++
 src/backend/storage/file/fd.c               | 50 +++++++++++++++++++++++++----
 src/backend/storage/smgr/md.c               | 22 +++++++++----
 src/backend/utils/cache/relmapper.c         |  2 +-
 src/backend/utils/error/elog.c              | 11 +++++++
 src/include/utils/elog.h                    |  3 +-
 10 files changed, 91 insertions(+), 21 deletions(-)

diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index 8d3c861a33..a86a2f3824 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -965,7 +965,7 @@ logical_end_heap_rewrite(RewriteState state)
 	while ((src = (RewriteMappingFile *) hash_seq_search(&seq_status)) != NULL)
 	{
 		if (FileSync(src->vfd, WAIT_EVENT_LOGICAL_REWRITE_SYNC) != 0)
-			ereport(ERROR,
+			ereport(promote_ioerr_to_panic(ERROR),
 					(errcode_for_file_access(),
 					 errmsg("could not fsync file \"%s\": %m", src->path)));
 		FileClose(src->vfd);
@@ -1180,7 +1180,7 @@ heap_xlog_logical_rewrite(XLogReaderState *r)
 	 */
 	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_REWRITE_MAPPING_SYNC);
 	if (pg_fsync(fd) != 0)
-		ereport(ERROR,
+		ereport(promote_ioerr_to_panic(ERROR),
 				(errcode_for_file_access(),
 				 errmsg("could not fsync file \"%s\": %m", path)));
 	pgstat_report_wait_end();
@@ -1279,7 +1279,7 @@ CheckPointLogicalRewriteHeap(void)
 			 */
 			pgstat_report_wait_start(WAIT_EVENT_LOGICAL_REWRITE_CHECKPOINT_SYNC);
 			if (pg_fsync(fd) != 0)
-				ereport(ERROR,
+				ereport(promote_ioerr_to_panic(ERROR),
 						(errcode_for_file_access(),
 						 errmsg("could not fsync file \"%s\": %m", path)));
 			pgstat_report_wait_end();
diff --git a/src/backend/access/transam/timeline.c b/src/backend/access/transam/timeline.c
index 61d36050c3..65e5ff6a82 100644
--- a/src/backend/access/transam/timeline.c
+++ b/src/backend/access/transam/timeline.c
@@ -406,7 +406,7 @@ writeTimeLineHistory(TimeLineID newTLI, TimeLineID parentTLI,
 
 	pgstat_report_wait_start(WAIT_EVENT_TIMELINE_HISTORY_SYNC);
 	if (pg_fsync(fd) != 0)
-		ereport(ERROR,
+		ereport(promote_ioerr_to_panic(ERROR),
 				(errcode_for_file_access(),
 				 errmsg("could not fsync file \"%s\": %m", tmppath)));
 	pgstat_report_wait_end();
@@ -485,7 +485,7 @@ writeTimeLineHistoryFile(TimeLineID tli, char *content, int size)
 
 	pgstat_report_wait_start(WAIT_EVENT_TIMELINE_HISTORY_FILE_SYNC);
 	if (pg_fsync(fd) != 0)
-		ereport(ERROR,
+		ereport(promote_ioerr_to_panic(ERROR),
 				(errcode_for_file_access(),
 				 errmsg("could not fsync file \"%s\": %m", tmppath)));
 	pgstat_report_wait_end();
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 65194db70e..350238e2fb 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -1687,7 +1687,7 @@ RecreateTwoPhaseFile(TransactionId xid, void *content, int len)
 	if (pg_fsync(fd) != 0)
 	{
 		CloseTransientFile(fd);
-		ereport(ERROR,
+		ereport(promote_ioerr_to_panic(ERROR),
 				(errcode_for_file_access(),
 				 errmsg("could not fsync two-phase state file: %m")));
 	}
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index adbd6a2126..c6c6b64826 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -3268,7 +3268,14 @@ XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock)
 	pgstat_report_wait_start(WAIT_EVENT_WAL_INIT_SYNC);
 	if (pg_fsync(fd) != 0)
 	{
+		int save_errno = errno;
+		unlink(tmppath);
 		close(fd);
+		errno = save_errno;
+		/*
+		 * This fsync failure is not PANIC-worthy because it's still a temp
+		 * file at this time.
+		 */
 		ereport(ERROR,
 				(errcode_for_file_access(),
 				 errmsg("could not fsync file \"%s\": %m", tmppath)));
@@ -3435,7 +3442,7 @@ XLogFileCopy(XLogSegNo destsegno, TimeLineID srcTLI, XLogSegNo srcsegno,
 
 	pgstat_report_wait_start(WAIT_EVENT_WAL_COPY_SYNC);
 	if (pg_fsync(fd) != 0)
-		ereport(ERROR,
+		ereport(promote_ioerr_to_panic(ERROR),
 				(errcode_for_file_access(),
 				 errmsg("could not fsync file \"%s\": %m", tmppath)));
 	pgstat_report_wait_end();
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 4123cdebcf..31ab7c1de9 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -1616,6 +1616,9 @@ SnapBuildSerialize(SnapBuild *builder, XLogRecPtr lsn)
 	 * fsync the file before renaming so that even if we crash after this we
 	 * have either a fully valid file or nothing.
 	 *
+	 * It's safe to just ERROR on fsync() here because we'll retry the whole
+	 * operation including the writes.
+	 *
 	 * TODO: Do the fsync() via checkpoints/restartpoints, doing it here has
 	 * some noticeable overhead since it's performed synchronously during
 	 * decoding?
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 441f18dcf5..9a29eaef7b 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -353,6 +353,17 @@ pg_fsync(int fd)
 /*
  * pg_fsync_no_writethrough --- same as fsync except does nothing if
  *	enableFsync is off
+ *
+ * WARNING: It is unsafe to retry fsync() calls without repeating the preceding
+ * writes.  fsync() clears the error flag on some platforms (including Linux,
+ * true up to at least 4.14) when it reports the error to the caller. A second
+ * call may return success even though writes are lost. Many callers test the
+ * return value and PANIC on failure so that redo repeats the writes. It is
+ * safe to ERROR instead if the whole operation can be retried without needing
+ * WAL redo.
+ *
+ * See https://lwn.net/Articles/752063/
+ * and https://www.postgresql.org/message-id/CAMsr%2BYHh%2B5Oq4xziwwoEfhoTZgr07vdGG%2Bhu%3D1adXx59aTeaoQ%40mail.gmail.com
  */
 int
 pg_fsync_no_writethrough(int fd)
@@ -443,7 +454,12 @@ pg_flush_data(int fd, off_t offset, off_t nbytes)
 		rc = sync_file_range(fd, offset, nbytes,
 							 SYNC_FILE_RANGE_WRITE);
 
-		/* don't error out, this is just a performance optimization */
+		/*
+		 * Don't error out, this is just a performance optimization.
+		 *
+		 * sync_file_range(SYNC_FILE_RANGE_WRITE) won't clear any error flags,
+		 * so we don't have to worry about this impacting fsync reliability.
+		 */
 		if (rc != 0)
 		{
 			ereport(WARNING,
@@ -518,7 +534,12 @@ pg_flush_data(int fd, off_t offset, off_t nbytes)
 			rc = msync(p, (size_t) nbytes, MS_ASYNC);
 			if (rc != 0)
 			{
-				ereport(WARNING,
+				/*
+				 * We must panic here to preserve fsync reliability,
+				 * as msync may clear the fsync error state on some
+				 * OSes. See pg_fsync_no_writethrough().
+				 */
+				ereport(PANIC,
 						(errcode_for_file_access(),
 						 errmsg("could not flush dirty data: %m")));
 				/* NB: need to fall through to munmap()! */
@@ -1046,11 +1067,17 @@ LruDelete(File file)
 	}
 
 	/*
-	 * Close the file.  We aren't expecting this to fail; if it does, better
-	 * to leak the FD than to mess up our internal state.
+	 * Close the file.  We aren't expecting this to fail; if it does, we need
+	 * to PANIC on I/O errors for non-temporary files in case it's an an
+	 * important file. That's because NFS on Linux may do an implicit fsync()
+	 * on close() which can cause similar issues to those discussed in the
+	 * comments on pg_fsync.
+	 *
+	 * Otherwise, better to leak the FD than to mess up our internal state.
 	 */
 	if (close(vfdP->fd))
-		elog(LOG, "could not close file \"%s\": %m", vfdP->fileName);
+		elog(vfdP->fdstate & FD_DELETE_AT_CLOSE ? LOG : promote_ioerr_to_panic(LOG),
+			 "could not close file \"%s\": %m", vfdP->fileName);
 	vfdP->fd = VFD_CLOSED;
 	--nfile;
 
@@ -1754,7 +1781,14 @@ FileClose(File file)
 	{
 		/* close the file */
 		if (close(vfdP->fd))
-			elog(LOG, "could not close file \"%s\": %m", vfdP->fileName);
+		{
+			/*
+			 * We must panic on failure to close non-temporary files; see
+			 * LruDelete.
+			 */
+			elog(vfdP->fdstate & FD_DELETE_AT_CLOSE ? LOG : promote_ioerr_to_panic(LOG),
+				"could not close file \"%s\": %m", vfdP->fileName);
+		}
 
 		--nfile;
 		vfdP->fd = VFD_CLOSED;
@@ -3250,6 +3284,10 @@ looks_like_temp_rel_name(const char *name)
  * harmless cases such as read-only files in the data directory, and that's
  * not good either.
  *
+ * Importantly, on Linux (true in 4.14) and some other platforms, fsync errors
+ * will consume the error, causing a subsequent fsync to succeed even though
+ * the writes did not succeed. See pg_fsync_no_writethrough().
+ *
  * Note we assume we're chdir'd into PGDATA to begin with.
  */
 void
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 2ec103e604..c604ec4e28 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -1038,7 +1038,7 @@ mdimmedsync(SMgrRelation reln, ForkNumber forknum)
 		MdfdVec    *v = &reln->md_seg_fds[forknum][segno - 1];
 
 		if (FileSync(v->mdfd_vfd, WAIT_EVENT_DATA_FILE_IMMEDIATE_SYNC) < 0)
-			ereport(ERROR,
+			ereport(promote_ioerr_to_panic(ERROR),
 					(errcode_for_file_access(),
 					 errmsg("could not fsync file \"%s\": %m",
 							FilePathName(v->mdfd_vfd))));
@@ -1265,13 +1265,22 @@ mdsync(void)
 					 * _mdfd_getseg() and for FileSync, since fd.c might have
 					 * closed the file behind our back.
 					 *
-					 * XXX is there any point in allowing more than one retry?
-					 * Don't see one at the moment, but easy to change the
-					 * test here if so.
+					 * It's unsafe to ignore failures for other errors,
+					 * particularly EIO or (undocumented, but possible) ENOSPC.
+					 * The first fsync() will clear any error flag on dirty
+					 * buffers pending writeback and/or the file descriptor, so
+					 * a second fsync report success despite the buffers
+					 * possibly not being written. (Verified on Linux 4.14).
+					 * To cope with this we must PANIC and redo all writes
+					 * since the last successful checkpoint. See discussion at:
+					 *
+					 * https://www.postgresql.org/message-id/CAMsr%2BYHh%2B5Oq4xziwwoEfhoTZgr07vdGG%2Bhu%3D1adXx59aTeaoQ%40mail.gmail.com
+					 *
+					 * for details.
 					 */
 					if (!FILE_POSSIBLY_DELETED(errno) ||
 						failures > 0)
-						ereport(ERROR,
+						ereport(promote_ioerr_to_panic(ERROR),
 								(errcode_for_file_access(),
 								 errmsg("could not fsync file \"%s\": %m",
 										path)));
@@ -1280,6 +1289,7 @@ mdsync(void)
 								(errcode_for_file_access(),
 								 errmsg("could not fsync file \"%s\" but retrying: %m",
 										path)));
+
 					pfree(path);
 
 					/*
@@ -1444,7 +1454,7 @@ register_dirty_segment(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
 				(errmsg("could not forward fsync request because request queue is full")));
 
 		if (FileSync(seg->mdfd_vfd, WAIT_EVENT_DATA_FILE_SYNC) < 0)
-			ereport(ERROR,
+			ereport(promote_ioerr_to_panic(ERROR),
 					(errcode_for_file_access(),
 					 errmsg("could not fsync file \"%s\": %m",
 							FilePathName(seg->mdfd_vfd))));
diff --git a/src/backend/utils/cache/relmapper.c b/src/backend/utils/cache/relmapper.c
index 99d095f2df..662812b799 100644
--- a/src/backend/utils/cache/relmapper.c
+++ b/src/backend/utils/cache/relmapper.c
@@ -795,7 +795,7 @@ write_relmap_file(bool shared, RelMapFile *newmap,
 	 */
 	pgstat_report_wait_start(WAIT_EVENT_RELATION_MAP_SYNC);
 	if (pg_fsync(fd) != 0)
-		ereport(ERROR,
+		ereport(promote_ioerr_to_panic(ERROR),
 				(errcode_for_file_access(),
 				 errmsg("could not fsync relation mapping file \"%s\": %m",
 						mapfilename)));
diff --git a/src/backend/utils/error/elog.c b/src/backend/utils/error/elog.c
index 16531f7a0f..e202094642 100644
--- a/src/backend/utils/error/elog.c
+++ b/src/backend/utils/error/elog.c
@@ -692,6 +692,17 @@ errcode_for_socket_access(void)
 	return 0;					/* return value does not matter */
 }
 
+/*
+ * PostgreSQL needs to PANIC on EIO in some cases to preserve data integrity.
+ * See explanation on pg_fsync for details. This keeps that logic in one place.
+ */
+int
+promote_ioerr_to_panic(int elevel)
+{
+	ErrorData  *edata = &errordata[errordata_stack_depth];
+	return (edata->saved_errno == EIO || edata->saved_errno == ENOSPC) ? elevel : PANIC;
+}
+
 
 /*
  * This macro handles expansion of a format string and associated parameters;
diff --git a/src/include/utils/elog.h b/src/include/utils/elog.h
index 7a9ba7f2ff..abd078075c 100644
--- a/src/include/utils/elog.h
+++ b/src/include/utils/elog.h
@@ -133,6 +133,8 @@ extern int	errcode(int sqlerrcode);
 extern int	errcode_for_file_access(void);
 extern int	errcode_for_socket_access(void);
 
+extern int  promote_ioerr_to_panic(int elevel);
+
 extern int	errmsg(const char *fmt,...) pg_attribute_printf(1, 2);
 extern int	errmsg_internal(const char *fmt,...) pg_attribute_printf(1, 2);
 
@@ -182,7 +184,6 @@ extern int	geterrcode(void);
 extern int	geterrposition(void);
 extern int	getinternalerrposition(void);
 
-
 /*----------
  * Old-style error reporting API: to be used in this way:
  *		elog(ERROR, "portal \"%s\" not found", stmt->portalname);
-- 
2.14.3

#53Thomas Munro
thomas.munro@enterprisedb.com
In reply to: Andres Freund (#51)
Re: Postgres, fsync, and OSs (specifically linux)

On Wed, May 23, 2018 at 8:02 AM, Andres Freund <andres@anarazel.de> wrote:

[patches]

Hi Andres,

Obviously there is more work to be done here but the basic idea in
your clone-fd-checkpointer branch as of today seems OK to me. I think
Craig and I both had similar ideas (sending file descriptors that have
an old enough f_wb_err) but we thought rlimit management was going to
be hard (shared memory counters + deduplication, bleugh). You made it
look easy. Nice work.

First, let me describe in my own words what's going on, mostly to make
sure I really understand this:

1. We adopt a new fd lifecycle that is carefully tailored to avoid
error loss on Linux, and can't hurt on other OSes. By sending the
file descriptor used for every write to the checkpointer, we guarantee
that (1) the inode stays pinned (the file is "in flight" in the
socket, even if the sender closes it before the checkpointer receives
it) so Linux won't be tempted to throw away the precious information
in i_mapping->wb_err, and (2) the checkpointer finishes up with a file
descriptor that points to the very same "struct file" with the
f_wb_err value that was originally sampled before the write, by the
sender. So we can't miss any write-back errors. Wahey! However,
errors are still reported only once, so we probably need to panic.

Hmm... Is there any way that the *sender* could finish up in
file_check_and_advance_wb_err() for the same struct file, before the
checkpointer manages to call fsync() on its dup'd fd? I don't
immediately see how (it looks like you have to call one of the various
sync functions to reach that, and your patch removes the fallback
just-call-FileSync()-myself code from register_dirty_segment()). I
guess it would be bad if, say, close() were to do that in some future
Linux release because then we'd have no race-free way to tell the
checkpointer that the file is borked before it runs fsync() and
potentially gets an OK response and reports a successful checkpoint
(even if we panicked, with sufficiently bad timing it might manage to
report a successful checkpoint).

2. In order to make sure that we don't exceed our soft limit on the
number of file descriptors per process, you use a simple backend-local
counter in the checkpointer, on the theory that we don't care about
fds (or, technically, the files they point to) waiting in the socket
buffer, we care only about how many the checkpointer has actually
received but not yet closed. As long as we make sure there is space
for at least one more before we read one message, everything should be
fine. Good and simple.

One reason I thought this would be harder is because I had no model of
how RLIMIT_NOFILE would apply when you start flinging fds around
between processes (considering there can be moments when neither end
has the file open), so I feared the worst and thought we would need to
maintain a counter in shared memory and have senders and receiver
cooperate to respect it. My intuition that this was murky and
required pessimism wasn't too far from the truth actually: apparently
the kernel didn't do a great job at accounting for that until a
patch[1]https://github.com/torvalds/linux/commit/712f4aad406bb1ed67f3f98d04c044191f0ff593 landed for CVE-2013-4312.

The behaviour in older releases is super lax, so no problem there.
The behaviour from 4.9 onward (or is it 4.4.1?) is that you get a
separate per-user RLIMIT_NOFILE allowance for in-flight fds. So
effectively the sender doesn't have to worry about about fds it has
sent but closed and the receiver doesn't have to care about fds it
hasn't received yet, so your counting scheme seems OK. As for
exceeding RLIMIT_NOFILE with in-flight fds, it's at least bounded by
the fact that the socket would block/EWOULDBLOCK if the receiver isn't
draining it fast enough and can only hold a small and finite amount of
data and thus file descriptors, so we can probably just ignore that.
If you did manage to exceed it, I think you'd find out about that with
ETOOMANYREFS at send time (curiously absent from the sendmsg() man
page, but there in black and white in the source for
unix_attach_fds()), and then you'd just raise an error (currently
FATAL in your patch). I have no idea how the rlimit for SCM-ified
files works on other Unixoid systems though.

Some actual code feedback:

+                       if (entry->syncfds[forknum][segno] == -1)
+                       {
+                               char *path = mdpath(entry->rnode,
forknum, segno);
+                               open_fsync_queue_files++;
+                               /* caller must have reserved entry */
+                               entry->syncfds[forknum][segno] =
+                                       FileOpenForFd(fd, path);
+                               pfree(path);
+                       }
+                       else
+                       {
+                               /*
+                                * File is already open. Have to keep
the older fd, errors
+                                * might only be reported to it, thus
close the one we just
+                                * got.
+                                *
+                                * XXX: check for errrors.
+                                */
+                               close(fd);
+                       }

Wait... it's not guaranteed to be older in open() time, is it? It's
just older in sendmsg() time. Those might be different:

A: open("base/42/1234")
A: write()
kernel: inode->i_mapping->wb_err++
B: open("base/42/1234")
B: write()
B: sendmsg()
A: sendmsg()
C: recvmsg() /* clone B's fd */
C: recvmsg() /* clone A's fd, throw it away because we already have B's */
C: fsync()

I think that would eat an error under the 4.13 code. I think it might
be OK under the new 4.17 code, because the new "must report it to at
least one caller" thing would save the day. So now I'm wondering if
that's getting backported to the 4.14 long term kernel.

Aha, yes it has been already[2]https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/commit/lib/errseq.c?h=linux-4.14.y&amp;id=0799a0ea96e4923f52f85fe315b62e9176a3319c.

So... if you don't have to worry about kernels without that patch, I
suspect the exact ordering doesn't matter anymore, as long as
*someone* held the file open at all times beween write and fsync() to
keep the inode around, which this patch achieves. (And of course no
other process sets the ERRSEQ_SEEN flag, for example some other kind
of backend or random other program that opened our data file and
called one of the sync syscalls, or any future syscalls that start
calling file_check_and_advance_wb_err()).

+       /*
+        * FIXME: do DuplicateHandle dance for windows - can that work
+        * trivially?
+        */

I don't know, I guess something like CreatePipe() and then
write_duplicate_handle()? And some magic (maybe our own LWLock) to
allow atomic messages?

A more interesting question is: how will you cap the number file
handles you send through that pipe? On that OS you call
DuplicateHandle() to fling handles into another process's handle table
directly. Then you send the handle number as plain old data to the
other process via carrier pigeon, smoke signals, a pipe etc. That's
interesting because the handle allocation is asynchronous from the
point of view of the receiver. Unlike the Unix case where the
receiver can count handles and make sure there is space for one more
before it reads a potentially-SCM-containing message, here the
*senders* will somehow need to make sure they don't create too many in
the receiving process. I guess that would involve a shared counter,
and a strategy for what to do when the number is too high (probably
just wait).

Hmm. I wonder if that would be a safer approach on all operating systems.

+       if (socketpair(AF_UNIX, SOCK_STREAM | SOCK_CLOEXEC, 0, fsync_fds) < 0)
...
+       size = sendmsg(sock, &msg, 0);

Here you are relying on atomic sending and receiving of whole messages
over a stream socket. For example, in Linux's unix_stream_sendmsg()
it's going to be chopped into buffers of size (sk->sk_sndbuf >> 1) -
64 (I have no clue how big that is) that are appended one at a time to
the receiving socket's queue, with no locking in between, so a
concurrently send messages can be interleaved with yours. How big is
big enough for that to happen? There doesn't seem to be an equivalent
of PIPE_BUF for Unix domain sockets. These tiny messages are almost
certainly safe, but I wonder if we should be using SOCK_SEQPACKET
instead of SOCK_STREAM?

Might be a less theoretical problem if we switch to variable sized
messages containing file paths as you once contemplated in an off-list
chat.

Presumably for EXEC_BACKEND we'll need to open
PGDATA/something/something/socket or similar.

+                                       if (returnCode < 0)
+                                       {
+                                               /* XXX: decide on policy */
+
bms_add_member(entry->requests[forknum], segno);

Obviously this is pseudocode (doesn't even keep the return value), but
just BTW, I think that if we decide not to PANIC unconditionally on
any kind of fsync() failure, we definitely can't use bms_add_member()
here (it might fail to allocate, and then we forget the segment, raise
and error and won't try again). It's got to be PANIC or no-fail code
(like the patch I proposed in another thread).

+SendFsyncRequest(CheckpointerRequest *request, int fd)
...
+                        * Don't think short reads will ever happen in realistic
...
+                       ereport(FATAL, (errmsg("could not receive
fsync request: %m")));

Short *writes*, could not *send*.

+ * back, as that'd lead to loosing error reporting guarantees on

s/loosing/losing/

[1]: https://github.com/torvalds/linux/commit/712f4aad406bb1ed67f3f98d04c044191f0ff593
[2]: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/commit/lib/errseq.c?h=linux-4.14.y&amp;id=0799a0ea96e4923f52f85fe315b62e9176a3319c

--
Thomas Munro
http://www.enterprisedb.com

#54Robert Haas
robertmhaas@gmail.com
In reply to: Craig Ringer (#52)
Re: Postgres, fsync, and OSs (specifically linux)

On Tue, May 29, 2018 at 4:53 AM, Craig Ringer <craig@2ndquadrant.com> wrote:

I've revised the fsync patch with the cleanups discussed and gone through
the close() calls.

AFAICS either socket closes, temp file closes, or (for WAL) already PANIC on
close. It's mainly fd.c that needs amendment. Which I've done per the
attached revised patch.

I think we should have a separate thread for this patch vs. Andres's
patch to do magic things with the checkpointer and file-descriptor
forwarding. Meanwhile, here's some review.

1. No longer applies cleanly.

2. I don't like promote_ioerr_to_panic() very much, partly because the
same pattern gets repeated over and over, and partly because it would
be awkwardly-named if we discovered that another 2 or 3 errors needed
similar handling (or some other variant handling). I suggest instead
having a function like report_critical_fsync_failure(char *path) that
does something like this:

int elevel = ERROR;
if (errno == EIO)
elevel = PANIC;
ereport(elevel,
(errcode_for_file_access(),
errmsg("could not fsync file \"%s\": %m", path);

And similarly I'd add report_critical_close_failure. In some cases,
this would remove wording variations (e.g. in twophase.c) but I think
that's fine, and maybe an improvement, as discussed on another recent
thread.

3. slru.c calls pg_fsync() but isn't changed by the patch. That looks wrong.

4. The comment changes in snapbuild.c interact with the TODO that
immediately follows. I think more adjustment is needed here.

5. It seems odd that you adjusted the comment for
pg_fsync_no_writethrough() but not pg_fsync_writethrough() or
pg_fsync(). Either pg_fsync_writethrough() doesn't have the same
problem, in which case, awesome, but let's add a comment, or it does,
in which case it should refer to the other one. And I think
pg_fsync() itself needs a comment saying that every caller must be
careful to use promote_ioerr_to_panic() or
report_critical_fsync_failure() or whatever we end up calling it
unless the fsync is not critical for data integrity.

6. In md.c, there's a stray blank line added. But more importantly,
the code just above that looks like this:

                     if (!FILE_POSSIBLY_DELETED(errno) ||
                         failures > 0)
-                        ereport(ERROR,
+                        ereport(promote_ioerr_to_panic(ERROR),
                                 (errcode_for_file_access(),
                                  errmsg("could not fsync file \"%s\": %m",
                                         path)));
                     else
                         ereport(DEBUG1,
                                 (errcode_for_file_access(),
                                  errmsg("could not fsync file \"%s\"
but retrying: %m",
                                         path)));

I might be all wet here, but it seems like if we enter the bottom
branch, we still need the promote-to-panic behavior.

7. The comment adjustment for SyncDataDirectory mentions an
"important" fact about fsync behavior, but then doesn't seem to change
any logic on that basis. I think in general a number of these
comments need a little more thought, but in this particular case, I
think we also need to consider what the behavior should be (and the
comment should reflect our considered judgement on that point, and the
implementation should match).

8. Andres suggested to me off-list that we should have a GUC to
disable the promote-to-panic behavior in case it turns out to be a
show-stopper for some user. I think that's probably a good idea.
Adding many new ways to PANIC in a minor release without providing any
way to go back to the old behavior sounds unfriendly. Surely, anyone
who suffers much from this has really serious other problems anyway,
but all the same I think we should provide an escape hatch.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#55Thomas Munro
thomas.munro@enterprisedb.com
In reply to: Robert Haas (#54)
Re: Postgres, fsync, and OSs (specifically linux)

On Thu, Jul 19, 2018 at 7:23 AM, Robert Haas <robertmhaas@gmail.com> wrote:

2. I don't like promote_ioerr_to_panic() very much, partly because the
same pattern gets repeated over and over, and partly because it would
be awkwardly-named if we discovered that another 2 or 3 errors needed
similar handling (or some other variant handling). I suggest instead
having a function like report_critical_fsync_failure(char *path) that
...

Note that if we don't cover *all* errno values, or ...

8. Andres suggested to me off-list that we should have a GUC to
disable the promote-to-panic behavior in case it turns out to be a
show-stopper for some user.

... we let the user turn this off, then we also have to fix this:

/messages/by-id/87y3i1ia4w.fsf@news-spur.riddles.org.uk

--
Thomas Munro
http://www.enterprisedb.com

#56Thomas Munro
thomas.munro@enterprisedb.com
In reply to: Thomas Munro (#53)
Re: Postgres, fsync, and OSs (specifically linux)

On Thu, Jun 14, 2018 at 5:30 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

On Wed, May 23, 2018 at 8:02 AM, Andres Freund <andres@anarazel.de> wrote:

[patches]

A more interesting question is: how will you cap the number file
handles you send through that pipe? On that OS you call
DuplicateHandle() to fling handles into another process's handle table
directly. Then you send the handle number as plain old data to the
other process via carrier pigeon, smoke signals, a pipe etc. That's
interesting because the handle allocation is asynchronous from the
point of view of the receiver. Unlike the Unix case where the
receiver can count handles and make sure there is space for one more
before it reads a potentially-SCM-containing message, here the
*senders* will somehow need to make sure they don't create too many in
the receiving process. I guess that would involve a shared counter,
and a strategy for what to do when the number is too high (probably
just wait).

Hmm. I wonder if that would be a safer approach on all operating systems.

As a way of poking this thread, here are some more thoughts. Buffer
stealing currently look something like this:

Evicting backend:
lseek(fd)
write(fd)
...enqueue-fsync-request via shm...

Checkpointer:
...push into hash table...

With the patch it presumably looks something like this:

Evicting backend:
lseek(fd)
write(fd)
sendmsg(fsync_socket) /* passes fd */

Checkpointer:
recvmsg(fsync_socket) /* gets a copy of fd */
...push into hash table...
close(fd) /* for all but the first one received for the same file */

That takes us from 2 syscalls to 5 per evicted buffer. I suppose it's
possible that on some operating systems that might hurt a bit, given
that it's happening at the granularity of 1GB data files that could
have a lot of backends working in them concurrently. I have no idea
if it's really a problem on any particular OS. Admittedly on Linux
it's probably just a bunch of fast atomic ops and RCU stuff...
probably only the existing write() actually takes the inode lock or
anything that heavy, and that's probably lost in the noise in an
evict-heavy workload. I don't know, I guess it's probably not a
problem, but I thought I'd mention that.

Contention on the new fsync socket doesn't seem to be a new problem
per se since it replaces a contention point we already had:
CheckpointerCommLock. If that was acceptable today then perhaps that
indicates that any in-kernel contention created by the new syscalls is
also OK.

My feeling so far is that I'd probably go for sender-collapses model
(and it might even be necessary on Windows?) if doing this as a new
feature, but I fully understand your desire to do it in a much simpler
way that could be back-patched more easily. I'm just slightly
concerned about the unintended consequence risk that comes with
exercising an operating system feature that not all operating system
authors probably intended to be used at high frequency. Nothing that
can't be assuaged by testing.

  * the queue is full and contains no duplicate entries.  In that case, we
  * let the backend know by returning false.
  */
-bool
-ForwardFsyncRequest(RelFileNode rnode, ForkNumber forknum, BlockNumber segno)
+void
+ForwardFsyncRequest(RelFileNode rnode, ForkNumber forknum, BlockNumber segno,
+                                       File file)

Comment out of date.

--
Thomas Munro
http://www.enterprisedb.com

#57Thomas Munro
thomas.munro@enterprisedb.com
In reply to: Thomas Munro (#56)
4 attachment(s)
Re: Postgres, fsync, and OSs (specifically linux)

On Sun, Jul 29, 2018 at 6:14 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

As a way of poking this thread, here are some more thoughts.

I am keen to move this forward, not only because it is something we
need to get fixed, but also because I have some other pending patches
in this area and I want this sorted out first.

Here are some small fix-up patches for Andres's patchset:

1. Use FD_CLOEXEC instead of the non-portable Linuxism SOCK_CLOEXEC.

2. Fix the self-deadlock hazard reported by Dmitry Dolgov. Instead
of the checkpoint trying to send itself a CKPT_REQUEST_SYN message
through the socket (whose buffer may be full), I included the
ckpt_started counter in all messages. When AbsorbAllFsyncRequests()
drains the socket, it stops at messages with the current ckpt_started
value.

3. Handle postmaster death while waiting.

4. I discovered that macOS would occasionally return EMSGSIZE for
sendmsg(), but treating that just like EAGAIN seems to work the next
time around. I couldn't make that happen on FreeBSD (I mention that
because the implementation is somehow related). So handle that weird
case on macOS only for now.

Testing on other Unixoid systems would be useful. The case that
produced occasional EMSGSIZE on macOS was: shared_buffers=1MB,
max_files_per_process=32, installcheck-parallel. Based on man pages
that seems to imply an error in the client code but I don't see it.

(I also tried to use SOCK_SEQPACKET instead of SOCK_STREAM, but it's
not supported on macOS. I also tried to use SOCK_DGRAM, but that
produced occasional ENOBUFS errors and retrying didn't immediately
succeed leading to busy syscall churn. This is all rather
unsatisfying, since SOCK_STREAM is not guaranteed by any standard to
be atomic, and we're writing messages from many backends into the
socket so we're assuming atomicity. I don't have a better idea that
is portable.)

There are a couple of FIXMEs remaining, and I am aware of three more problems:

* Andres mentioned to me off-list that there may be a deadlock risk
where the checkpointer gets stuck waiting for an IO lock. I'm going
to look into that.
* Windows. Patch soon.
* The ordering problem that I mentioned earlier: the patchset wants to
keep the *oldest* fd, but it's really the oldest it has received. An
idea Andres and I discussed is to use a shared atomic counter to
assign a number to all file descriptors just before their first write,
and send that along with it to the checkpointer. Patch soon.

--
Thomas Munro
http://www.enterprisedb.com

Attachments:

0001-Use-portable-close-on-exec-syscalls.patchapplication/octet-stream; name=0001-Use-portable-close-on-exec-syscalls.patchDownload
From 6f278a123caa395f0f487a2b04d7992e573a5fc6 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@enterprisedb.com>
Date: Thu, 9 Aug 2018 15:38:17 +0530
Subject: [PATCH 1/4] Use portable close-on-exec syscalls.

---
 src/backend/postmaster/postmaster.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 135aa29bfeb..42134d4ed28 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -6458,7 +6458,7 @@ static void
 InitFsyncFdSocketPair(void)
 {
 	Assert(MyProcPid == PostmasterPid);
-	if (socketpair(AF_UNIX, SOCK_STREAM | SOCK_CLOEXEC, 0, fsync_fds) < 0)
+	if (socketpair(AF_UNIX, SOCK_STREAM, 0, fsync_fds) < 0)
 		ereport(FATAL,
 				(errcode_for_file_access(),
 				 errmsg_internal("could not create fsync sockets: %m")));
@@ -6470,11 +6470,19 @@ InitFsyncFdSocketPair(void)
 		ereport(FATAL,
 				(errcode_for_socket_access(),
 				 errmsg_internal("could not set fsync process socket to nonblocking mode: %m")));
+	if (fcntl(fsync_fds[FSYNC_FD_PROCESS], F_SETFD, FD_CLOEXEC) == -1)
+		ereport(FATAL,
+				(errcode_for_socket_access(),
+				 errmsg_internal("could not set fsync process socket to close-on-exec mode: %m")));
 
 	if (fcntl(fsync_fds[FSYNC_FD_SUBMIT], F_SETFL, O_NONBLOCK) == -1)
 		ereport(FATAL,
 				(errcode_for_socket_access(),
 				 errmsg_internal("could not set fsync submit socket to nonblocking mode: %m")));
+	if (fcntl(fsync_fds[FSYNC_FD_SUBMIT], F_SETFD, FD_CLOEXEC) == -1)
+		ereport(FATAL,
+				(errcode_for_socket_access(),
+				 errmsg_internal("could not set fsync submit socket to close-on-exec mode: %m")));
 
 	/*
 	 * FIXME: do DuplicateHandle dance for windows - can that work
-- 
2.17.0

0002-Fix-deadlock-in-AbsorbAllFsyncRequests.patchapplication/octet-stream; name=0002-Fix-deadlock-in-AbsorbAllFsyncRequests.patchDownload
From 806bee1efdde958bb3d819626e0cfaf624cf2055 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@enterprisedb.com>
Date: Thu, 9 Aug 2018 16:57:57 +0530
Subject: [PATCH 2/4] Fix deadlock in AbsorbAllFsyncRequests().

---
 src/backend/postmaster/checkpointer.c | 54 +++++++++++++--------------
 1 file changed, 25 insertions(+), 29 deletions(-)

diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 9ef56db97bc..6250cb21946 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -112,6 +112,7 @@ typedef struct
 	ForkNumber	forknum;
 	BlockNumber segno;			/* see md.c for special values */
 	bool		contains_fd;
+	int			ckpt_started;
 	/* might add a real request-type field later; not needed yet */
 } CheckpointerRequest;
 
@@ -169,9 +170,6 @@ static double ckpt_cached_elapsed;
 static pg_time_t last_checkpoint_time;
 static pg_time_t last_xlog_switch_time;
 
-static BlockNumber next_syn_rqst;
-static BlockNumber received_syn_rqst;
-
 /* Prototypes for private functions */
 
 static void CheckArchiveTimeout(void);
@@ -179,7 +177,7 @@ static bool IsCheckpointOnSchedule(double progress);
 static bool ImmediateCheckpointRequested(void);
 static void UpdateSharedMemoryConfig(void);
 static void SendFsyncRequest(CheckpointerRequest *request, int fd);
-static bool AbsorbFsyncRequest(void);
+static bool AbsorbFsyncRequest(bool stop_at_current_cycle);
 
 /* Signal handlers */
 
@@ -1105,6 +1103,11 @@ RequestCheckpoint(int flags)
  * is theoretically possible a backend fsync might still be necessary, if
  * the queue is full and contains no duplicate entries.  In that case, we
  * let the backend know by returning false.
+ *
+ * We add the cycle counter to the message.  That is an unsynchronized read
+ * of the shared memory counter, but it doesn't matter if it is arbitrarily
+ * old since it is only used to limit unnecessary extra queue draining in
+ * AbsorbAllFsyncRequests().
  */
 void
 ForwardFsyncRequest(RelFileNode rnode, ForkNumber forknum, BlockNumber segno,
@@ -1124,6 +1127,15 @@ ForwardFsyncRequest(RelFileNode rnode, ForkNumber forknum, BlockNumber segno,
 	request.segno = segno;
 	request.contains_fd = file != -1;
 
+	/*
+	 * We read ckpt_started without synchronization.  It is used to prevent
+	 * AbsorbAllFsyncRequests() from reading new values from after a
+	 * checkpoint began.  A slightly out-of-date value here will only cause
+	 * it to do a little bit more work than strictly necessary, but that's
+	 * OK.
+	 */
+	request.ckpt_started = CheckpointerShmem->ckpt_started;
+
 	SendFsyncRequest(&request, request.contains_fd ? FileGetRawDesc(file) : -1);
 }
 
@@ -1152,7 +1164,7 @@ AbsorbFsyncRequests(void)
 		if (!FlushFsyncRequestQueueIfNecessary())
 			break;
 
-		if (!AbsorbFsyncRequest())
+		if (!AbsorbFsyncRequest(false))
 			break;
 	}
 }
@@ -1170,8 +1182,6 @@ AbsorbFsyncRequests(void)
 void
 AbsorbAllFsyncRequests(void)
 {
-	CheckpointerRequest request = {0};
-
 	if (!AmCheckpointerProcess())
 		return;
 
@@ -1181,22 +1191,12 @@ AbsorbAllFsyncRequests(void)
 	BgWriterStats.m_buf_fsync_backend +=
 		pg_atomic_exchange_u32(&CheckpointerShmem->num_backend_fsync, 0);
 
-	/*
-	 * For mdsync()'s guarantees to work, all pending fsync requests need to
-	 * be executed. But we don't want to absorb requests till the queue is
-	 * empty, as that could take a long while.  So instead we enqueue
-	 */
-	request.type = CKPT_REQUEST_SYN;
-	request.segno = ++next_syn_rqst;
-	SendFsyncRequest(&request, -1);
-
-	received_syn_rqst = next_syn_rqst + 1;
-	while (received_syn_rqst != request.segno)
+	for (;;)
 	{
 		if (!FlushFsyncRequestQueueIfNecessary())
 			elog(FATAL, "may not happen");
 
-		if (!AbsorbFsyncRequest())
+		if (!AbsorbFsyncRequest(true))
 			break;
 	}
 }
@@ -1206,7 +1206,7 @@ AbsorbAllFsyncRequests(void)
  *		Retrieve one queued fsync request and pass them to local smgr.
  */
 static bool
-AbsorbFsyncRequest(void)
+AbsorbFsyncRequest(bool stop_at_current_cycle)
 {
 	CheckpointerRequest req;
 	int fd;
@@ -1229,17 +1229,13 @@ AbsorbFsyncRequest(void)
 		elog(FATAL, "message should have fd associated, but doesn't");
 	}
 
-	if (req.type == CKPT_REQUEST_SYN)
-	{
-		received_syn_rqst = req.segno;
-		Assert(fd == -1);
-	}
-	else
-	{
-		RememberFsyncRequest(req.rnode, req.forknum, req.segno, fd);
-	}
+	RememberFsyncRequest(req.rnode, req.forknum, req.segno, fd);
 	END_CRIT_SECTION();
 
+	if (stop_at_current_cycle &&
+		req.ckpt_started == CheckpointerShmem->ckpt_started)
+		return false;
+
 	return true;
 }
 
-- 
2.17.0

0003-Handle-postmaster-death-CFI-improve-error-messages-a.patchapplication/octet-stream; name=0003-Handle-postmaster-death-CFI-improve-error-messages-a.patchDownload
From 005bd50afb94e9876962edbb8d8d32a2843f9feb Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@enterprisedb.com>
Date: Fri, 10 Aug 2018 10:49:57 +0530
Subject: [PATCH 3/4] Handle postmaster death, CFI, improve error messages and
 comments.

---
 src/backend/postmaster/checkpointer.c | 14 +++++++++-----
 1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 6250cb21946..16a57090fe7 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -1302,9 +1302,12 @@ static void
 SendFsyncRequest(CheckpointerRequest *request, int fd)
 {
 	ssize_t ret;
+	int		rc;
 
 	while (true)
 	{
+		CHECK_FOR_INTERRUPTS();
+
 		ret = pg_uds_send_with_fd(fsync_fds[FSYNC_FD_SUBMIT], request, sizeof(*request),
 								  request->contains_fd ? fd : -1);
 
@@ -1315,18 +1318,19 @@ SendFsyncRequest(CheckpointerRequest *request, int fd)
 			 * implementations, but better make sure that's true...
 			 */
 			if (ret != sizeof(*request))
-				elog(FATAL, "oops, gotta do better");
+				elog(FATAL, "unexpected short write to fsync request socket");
 			break;
 		}
 		else if (errno == EWOULDBLOCK || errno == EAGAIN)
 		{
 			/* blocked on write - wait for socket to become readable */
-			/* FIXME: postmaster death? Other interrupts? */
-			WaitLatchOrSocket(NULL, WL_SOCKET_WRITEABLE, fsync_fds[FSYNC_FD_SUBMIT], -1, 0);
+			rc = WaitLatchOrSocket(NULL,
+								   WL_SOCKET_WRITEABLE | WL_POSTMASTER_DEATH,
+								   fsync_fds[FSYNC_FD_SUBMIT], -1, 0);
+			if (rc & WL_POSTMASTER_DEATH)
+				exit(1);
 		}
 		else
-		{
 			ereport(FATAL, (errmsg("could not receive fsync request: %m")));
-		}
 	}
 }
-- 
2.17.0

0004-Handle-EMSGSIZE-on-macOS.-Fix-misleading-error-messa.patchapplication/octet-stream; name=0004-Handle-EMSGSIZE-on-macOS.-Fix-misleading-error-messa.patchDownload
From 5cc52ace3e31f9492fe6f2e4441f386c2b21837e Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@enterprisedb.com>
Date: Fri, 10 Aug 2018 12:54:05 +0530
Subject: [PATCH 4/4] Handle EMSGSIZE on macOS.  Fix misleading error message.

Also zero-initialize a couple of structs passed to the kernel just in case
there is padding on some system somewhere that could be a problem (based on
a rumor about EMSGSIZE errors which didn't turn out to help).
---
 src/backend/postmaster/checkpointer.c | 15 +++++++++++++--
 src/backend/storage/file/fd.c         |  3 ++-
 2 files changed, 15 insertions(+), 3 deletions(-)

diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 16a57090fe7..714f1522f15 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -1321,8 +1321,19 @@ SendFsyncRequest(CheckpointerRequest *request, int fd)
 				elog(FATAL, "unexpected short write to fsync request socket");
 			break;
 		}
-		else if (errno == EWOULDBLOCK || errno == EAGAIN)
+		else if (errno == EWOULDBLOCK || errno == EAGAIN
+#ifdef __darwin__
+				 || errno == EMSGSIZE || errno == ENOBUFS
+#endif
+				)
 		{
+			/*
+			 * Testing on macOS 10.13 showed occasional EMSGSIZE or
+			 * ENOBUFS errors, which could be handled by retrying.  Unless
+			 * the problem also shows up on other systems, let's handle those
+			 * only for that OS.
+			 */
+
 			/* blocked on write - wait for socket to become readable */
 			rc = WaitLatchOrSocket(NULL,
 								   WL_SOCKET_WRITEABLE | WL_POSTMASTER_DEATH,
@@ -1331,6 +1342,6 @@ SendFsyncRequest(CheckpointerRequest *request, int fd)
 				exit(1);
 		}
 		else
-			ereport(FATAL, (errmsg("could not receive fsync request: %m")));
+			ereport(FATAL, (errmsg("could not send fsync request: %m")));
 	}
 }
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 7135a57df57..d45e15a9e41 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -3642,7 +3642,7 @@ pg_uds_send_with_fd(int sock, void *buf, ssize_t buflen, int fd)
 {
 	ssize_t     size;
 	struct msghdr   msg = {0};
-	struct iovec    iov;
+	struct iovec    iov = {0};
 	/* cmsg header, union for correct alignment */
 	union
 	{
@@ -3651,6 +3651,7 @@ pg_uds_send_with_fd(int sock, void *buf, ssize_t buflen, int fd)
 	} cmsgu;
 	struct cmsghdr  *cmsg;
 
+	memset(&cmsgu, 0, sizeof(cmsgu));
 	iov.iov_base = buf;
 	iov.iov_len = buflen;
 
-- 
2.17.0

#58Asim R P
apraveen@pivotal.io
In reply to: Thomas Munro (#57)
Re: Postgres, fsync, and OSs (specifically linux)

I was looking at the commitfest entry for feature
(https://commitfest.postgresql.org/19/1639/) for the most recent list
of patches to try out. The list doesn't look correct/complete. Can
someone please check?

Asim

#59Thomas Munro
thomas.munro@enterprisedb.com
In reply to: Asim R P (#58)
Re: Postgres, fsync, and OSs (specifically linux)

On Wed, Aug 15, 2018 at 11:08 AM, Asim R P <apraveen@pivotal.io> wrote:

I was looking at the commitfest entry for feature
(https://commitfest.postgresql.org/19/1639/) for the most recent list
of patches to try out. The list doesn't look correct/complete. Can
someone please check?

Hi Asim,

This thread is a bit tangled up. There are two related patchsets in it:

1. Craig Ringer's PANIC-on-EIO patch set, to cope with the fact that
Linux throws away buffers and errors after reporting an error, so the
checkpointer shouldn't retry as it does today. The latest is here:

/messages/by-id/CAMsr+YFPeKVaQ57PwHqmRNjPCPABsdbV=L85he2dVBcr6yS1mA@mail.gmail.com

2. Andres Freund's fd-sending fsync queue, to cope with the fact that
some versions of Linux only report writeback errors that occurred
after you opened the file, and all versions of Linux and some other
operating systems might forget about writeback errors while no one has
it open.

Here is the original patchset:

/messages/by-id/20180522010823.z5bdq7wnlsna5qoo@alap3.anarazel.de

Here is a fix-up you need:

/messages/by-id/20180522185951.5sdudzl46spktyyz@alap3.anarazel.de

Here are some more fix-up patches that I propose:

/messages/by-id/CAEepm=2WSPP03-20XHpxohSd2UyG_dvw5zWS1v7Eas8Rd=5e4A@mail.gmail.com

I will soon post some more fix-up patches that add EXEC_BACKEND
support, Windows support, and a counting scheme to fix the timing
issue that I mentioned in my first review. I will probably squash it
all down to a tidy patch-set after that.

--
Thomas Munro
http://www.enterprisedb.com

#60Craig Ringer
craig@2ndquadrant.com
In reply to: Thomas Munro (#59)
Re: Postgres, fsync, and OSs (specifically linux)

On 15 August 2018 at 07:32, Thomas Munro <thomas.munro@enterprisedb.com>
wrote:

On Wed, Aug 15, 2018 at 11:08 AM, Asim R P <apraveen@pivotal.io> wrote:

I was looking at the commitfest entry for feature
(https://commitfest.postgresql.org/19/1639/) for the most recent list
of patches to try out. The list doesn't look correct/complete. Can
someone please check?

Hi Asim,

This thread is a bit tangled up. There are two related patchsets in it:

1. Craig Ringer's PANIC-on-EIO patch set, to cope with the fact that
Linux throws away buffers and errors after reporting an error, so the
checkpointer shouldn't retry as it does today. The latest is here:

/messages/by-id/CAMsr+YFPeKVaQ57PwHqmRNjPCPABsdbV%25
3DL85he2dVBcr6yS1mA%40mail.gmail.com

2. Andres Freund's fd-sending fsync queue, to cope with the fact that
some versions of Linux only report writeback errors that occurred
after you opened the file, and all versions of Linux and some other
operating systems might forget about writeback errors while no one has
it open.

Here is the original patchset:

/messages/by-id/20180522010823.
z5bdq7wnlsna5qoo%40alap3.anarazel.de

Here is a fix-up you need:

/messages/by-id/20180522185951.
5sdudzl46spktyyz%40alap3.anarazel.de

Here are some more fix-up patches that I propose:

/messages/by-id/CAEepm=2WSPP03-20XHpxohSd2UyG_
dvw5zWS1v7Eas8Rd%3D5e4A%40mail.gmail.com

I will soon post some more fix-up patches that add EXEC_BACKEND
support, Windows support, and a counting scheme to fix the timing
issue that I mentioned in my first review. I will probably squash it
all down to a tidy patch-set after that.

Thanks very much Tomas.

I've had to back off from this a bit after posting my initial
panic-for-safety patch, as the changes Andres proposed are a bit out of my
current depth and time capacity.

I still think the panic patch is needed and appropriate, but agree it's not
*sufficient*.

#61Thomas Munro
thomas.munro@enterprisedb.com
In reply to: Craig Ringer (#60)
1 attachment(s)
Re: Postgres, fsync, and OSs (specifically linux)

On Thu, Aug 30, 2018 at 2:44 PM Craig Ringer <craig@2ndquadrant.com> wrote:

On 15 August 2018 at 07:32, Thomas Munro <thomas.munro@enterprisedb.com> wrote:

I will soon post some more fix-up patches that add EXEC_BACKEND
support, Windows support, and a counting scheme to fix the timing
issue that I mentioned in my first review. I will probably squash it
all down to a tidy patch-set after that.

I went down a bit of a rabbit hole with the Windows support for
Andres's patch set. I have something that works as far as I can tell,
but my Windows environment consists of throwing things at Appveyor and
seeing what sticks, so I'm hoping that someone with a real Windows
system and knowledge will be able to comment.

New patches in this WIP patch set:

0012: Fix for EXEC_BACKEND.

0013: Windows. This involved teaching latch.c to deal with Windows
asynchronous IO events, since you can't wait for pipe readiness via
WSAEventSelect. Pipes and sockets exist in different dimensions on
Windows, and there are no "Unix" domain sockets (well, there are but
they aren't usable yet[1]https://blogs.msdn.microsoft.com/commandline/2017/12/19/af_unix-comes-to-windows/). An alternative would be to use TCP
sockets for this, and then the code would look more like the Unix
code, but that seems a bit strange. Note that the Windows version
doesn't actually hand off file handles like the Unix code (it could
fairly easily, but there is no reason to think that would actually be
useful on that platform). I may be way off here...

The 0013 patch also fixes a mistake in the 0010 patch: it is not
appropriate to call CFI() while waiting to notify the checkpointer of
a dirty segment, because then ^C could cause the following checkpoint
not to flush dirty data. SendFsyncRequest() is essentially blocking,
except that it uses non-blocking IO so that it multiplex postmaster
death detection.

0014: Fix the ordering race condition mentioned upthread[2]/messages/by-id/CAEepm=04ZCG_8N3m61kXZP-7Ecr02HUNNG-QsAhwyFLim4su2g@mail.gmail.com. All
files are assigned an increasing sequence number after [re]opening (ie
before their first write), so that the checkpointer process can track
the fd that must have the oldest Linux f_wb_err that could be relevant
for writes done by PostgreSQL.

The other patches in this tarball are all as posted already, but are
now rebased and assembled in one place. Also pushed to
https://github.com/macdice/postgres/tree/fsyncgate .

Thoughts?

[1]: https://blogs.msdn.microsoft.com/commandline/2017/12/19/af_unix-comes-to-windows/
[2]: /messages/by-id/CAEepm=04ZCG_8N3m61kXZP-7Ecr02HUNNG-QsAhwyFLim4su2g@mail.gmail.com

--
Thomas Munro
http://www.enterprisedb.com

Attachments:

fsyncgate-v3.tgzapplication/x-gzip; name=fsyncgate-v3.tgzDownload
����[��iw�F�(z�J���^�Is��V��,��V���$���oH��$��du������P A����'\�E�B
��<T*�4�|?��}���K�R?�Sw<����0�K�`����+\�|����������k�O�Ri��
���
�[��o�����rj�F��l�j57[5��Te�����L#�h6�����`��>�E���!��Q8R���hUzn���4<�iT[-�3h{�A�k���N�����z���?QNKU*��?U��n^�����z�����w��_������e��as��������:��{�;m�4^�k�jU��*���Y�_~�J��t�b�'Uq���/���+��b�@U��U!�*�M��
~T����.^���������r�7�m�<�6V��2�?�4�v7��~6{�m��^���8��;U��/��+5�"���&���c�>���W�o^�����|����N]�w�W>� ��qY�=h~��� �����&4�R�0�����q��A��)��C��A�s=_
�����'�je�7��bX����*���o�����+��}����Y�����N'����IO/��o��7���O�^m�J�MG��\���v<
#���6��|+�����
����M�&�U]������8W����}��+�77�`0P��e��^�����nc���Z���9�z�S/���z�]��n���m8���`�{�Q�����h4�-U�?p��3���������]�2V.�-�^F������P/	B�9]�]-��m�7�tS�!��xJ?��]|��AF�~����0�������`�������� N��Z��#�Y����5�o�?
�:���4������x��'������Jw�������������������*;��,>�p�O��+fO[��j��`P�/�O��` ��|�%x���m6k��f�������N&���`Zr�*�?�$��� +V�-
f�>���4,��������/!Q*��/��~TrK�q����O���*��<�W�5��������^��i��A�^i�:-��t:~���z=������A����q�_��?�Se�Y��m�Sf�����
!6��M`�Fn� 
���Q/��v�X��-4��ax3�(�jf@�1�
o�1�����}���8�0�q#��������Tu�g��U'�'��PM�����'������&0bD�7��}K���G|�S��������p��I?^�J��w����]U�o��9�f��[�e3\� �����=���V���������u*����i;f� WPk����*�_��@��=l���������L�K����z��|`�����,��'��K5(��e�t��iL��(�|���l�_n�o;��1�[@����
'�� T���O={�JN^�6��A����q?
&0�2^��,�(�t������x��7���;�r�xJ�*��0��K���J�@)i����F`�
�w�o�����i4�O�(���"�_�G������$��|�������pb>�Z0y8n(��W���r<����A
�/Bw����/����|`�Q8�����w������:f����O|��g�`M�/��n����w{.0t�qqv���P|�p{x��w�p�������e~�b_����sY$�*�ud,0~����%�me���fU��;m�O^�\�u��������{{xv~���Y�9\������_Yjxtp�^����>����{~��������	.���������.�e</s�/������]x>��&�������v�a^�%���(
#`�#d��,�s���N��@�}|�lZ5@��u���p�D	�&�����;�"9��b)�<��K�RKN/��z���������������H�r�9���u�e����^������	@����<$�c��-D������$j����X;����y��!x3����)D���$�T^����?*�����<����<��Q�\��Y���e��!�!��rL$����=����+�
�m��BT-L��1���U����s{����L�����f��V�����^���+N�kv����vev�p\����r��"�������T`"����Q�M�.����_<�}u��{q��@�����x��&�����t4��n�==;��O7D��~������}7��e�p�6�U�e�v�zGHP���� ��w�!����_�A�[	E��[�6���3�f���?����
Q�K����p&^��=����W*���F�����[|H�3h���m�7��Z�u�zu����F�������q���NUNsy�����1Z�H�T�c`���A:��b������t!�r��ktju��9_C1R���h/��N�5���������&/�G{R��Z�&�����8�[��,�Q��}���������2Jv�P�?	����z���)�p�O&��m������~�
��8:i��4�����[|�����;
���Z^��r+��_i�n������~�w�8�_'�p���)@��� ��l��=/�~V���������C��?�k�?��fXd�\n9���hy�F�k��]A��ZB0@0���hG!m����* ����m��� KD�g-��8�ptpq�������l��=*)����O\d�_h�z;88�yw�BzpL�e�N�������'�����C�x�<��5�G�.�Fw��xY��\t��Gg6���������o*T3���P�i+Y+*ut�U��pP�4���x��i5,���nr86�y�>n���9A�������D��%�%�&���
u�;�}�@���Sp��^jbN�|f�/\���"�,�H�E�$c5�am�M���*@)��MA)�a�z�L�E���+�A�Wz�7�d���.�SCdYi���vje6�o�1�ts�����,��l�W�Cn+FU�=�<�I��)v`M�����X<���m���G@�����ol��P�L�D<��z����]�|���������ie�v^�W�B��
���"X�2w���E���pA�P.���Yn�����@l0�'I�8�3^Y��9���������!!��,6�����+����N�h�2b�l�#��z;t/`����+����
/	S�-
'�|PP���	��$� �e������o64���C����1d�;L�����$h
B��/�w�d�|Dp0��'����]��&�Kp37�8�4������hB)oxK�c�w��Bn�cm������jE��b���@�_���pr�[��ll��6�6�
��Sm��� L�p��������f����au}w�>�H�T��2���\3T�O#o6���je��F���3� �����L��r~�������A���E��m���������������8��?
������h��S��������q3��="��g<�n-�D�����,�@Zl6�
c������#�������J��+�������<��.�o�a���"�Y�|b�;~�J|��~��<&��C�)xF#O)��C��z;w����i2���@�`��(�2V<
����$}J����2������?��M���c_T�:�
�p���1Uc�J��h�Q�H���]A_��|}QVh���#nt��@�4{Qv,�_��
���������D@W����k�hw%"��q�4�����h����o;U��P��:����}v~���|"T�$�����)�O4��Qz�W�������]����$d	_���9�4�TV ����7M��sj�v���@Md^�o��!�is6;�H
#�}�@N����a�����M�p?��]0mi8![�~��<_��#�����3��v���psC��.@���ni^��C~�r���9��C�4����[����4*%t��btL�1��p6��3�Q��7�v����]��y���-b������`v�'�����d���#�6�
n^D�8v�Q��S�[�xE��2��`�b�5�����K\���"����0��0�	�e;�T��������t���Ojq'�]r@��`���8��D���^E���*i������O�Ri���d2^��~)�/Av�#�D��vc��r�+�!�;A��
���W�W�������������:�Z�o�+~����
��v�g�������6�V���:�?N��H��PU6���H�T7���[��)/�o��
G.r��y&!�X���j^U�T��4�9��0h���}.���\�5��%��H���+U#�smA�\��>W�V?/�wo�v�^:n�V�tj�F�\��z��T�~}�:zy��j��m��7����pS�E�Rm
�������(������}s��z�`<w}r��8^,XAT�v��(�����v,��U�w����\
=���o2k�,���jqd[Z�Bi�aA�]_wHJ����"*HYiMz�� �T�V�������x6���3w� ��G&0��H�m	�U�����^��U �>����8`�x���q>fDwt������o� b������eS�m���8�X��6&8�#��A%�������"@���	 pt���T;{��8����q0�:���������ZU�H�"���q>	�Ga�=�|�u�������g4�'�Z�7�11	��*�wV�.�������"�<������/�V���u�|y�
6 ������8��c�)J
�����o���|�����q�	��>�>��A������tD�G�?�}8?���"%R�!@S��g9@�a�C�����bBf*��0���L�����K�Q��G�l�QH��|U(k�@�b
�r<�!kYm<!��fc���l�(�Q���CRjsGfv�	�rQ�����Y}�o3e���PcN:q���ixKk
6?�����;J��6�H����|�*S����oJ������y�u��5�����������>�&[�&���puZ�b�����g/���=�)��T���b����w�.�KL�c5B�����K^�s�U�U�
X��X��*�^D�Y����?����	��z/+��l��/J�������g�3�s�Y�G��F�~��k��Kz��p�i�3L��
C��|+E[�n�CG	�k�:F�_��Z�(��v=��?�/���hZ��@f6�|�9���Cz'��KzH��A3�:�I��:����F6�v���
�e�l�s_d�A����@�-���[�4���w��$=-wqI�0`G\\�|5�����w���|�4��R�z�1�=���4�(�r�E�&�p$To�z��?)��<^d�EEb����gPz|O�N(%�� 	����^}�B���S
��?PLyV�$����tj�U��M�Q�1Z��z^���a�)�� Dp� �01
 �;������SWpLG�
!%��;C�.��T�{�H����t�������q���`��c�>��H>?>l��y\�����h���Y
m��"�gy�]z����#�j���~�!�)��e���CbL���mkL��9�G�zp�����e��0�1��="w�r��'��zi�|���Ne�
��e�_��\IC��WbV�h����~[�)��-[�g�:m���<N������b��z)�w2�/h��)U�:
p����0S���C�M�p$=�!Qsg��-�"�Kzg����,6K���L��Z��\�!4hM���������3^`b���K����������=����/����a;���Q�R�	�� ��z
���%2�JSI���������i����~����j��DFwNu�L5q���<+-��L����_�\��5yuvpt~��{~���g�J�B�e�N��r�%G��?�G���s"-���@D�����`�@��`L�~�,A��"��-��O4��e_'����k�#�!�d,��e[cmr$�J��i���2��_�(���)�T ~�#����El�c�0C�;q�)����=�c?~�?��%�&]�����Y�5�����Q.�4?�4Z�R�Y:d�H��>C�'�p��2m�����.a�k�g#Nt��G�]�f�1Kv���Z���e��K�����z� ��T|v`�y��i�X;��_�}!
����=�BL[Tst�E^����l��+w�[	�r@7^���4���%���&J���$|�xj����
�A�V{��^h�����$&#��IC H���={�5q�Wt����%�u��������?�t��$��\�@��W�u(���0'���q��%�k{Mt�W��fR�X���^f�����#W�11s(+����������Z�=����y����������
���p'�6kU�zE<���X1���)F��4@�hI3r����O�$g7�#�o����t�������19x�fJ�a���7�^8�������\����y����x�sA=���3��D7�� ���V������4[��x;��������OK'a�-p�`g2� ���}[xjT�
"��X|���e��*�xk��y�4�ll��~2h�]U��@>�b��_I����;w���K��3�*����hS��J�&�Br�zn�k�2�����e_�s���%r%2� 
t9�nDp���!��w�
.��G��a�O/� @�&v�����W���w��i�}�������������2�K��?.�-������F�8����bv���0����?}��I'�������1fF����D����6���Q	�P�*27�j�Qq7�B=�D
�r9��BR�)����
�����j�7q��D(�e6���;��Uy_��y\�S���Of ���@�Ts��?q���>f+eL`��n���{�x��@�k�$4@��G�dc�2f�������0K)),t�yj�o4�H"&q�5n��o>�s�0�=��|�B�J�a8k}!(�t���G�U��r���6���G�{`�0	�C3�i��o���h���o�
j#�<��c����j�K���V^��>Xs8~�Q�������.2^Xg�����:�#i�:CT����Y)���_OP�^KJ_���������;���s[�S.�[n�?p���X�����?xv;RE�I�fUC�Ax�Y0�����^�oI=�p� �*m�������7^eG!�a��m&"x����}�����H	&������on�'������	F��G�
�C�_�$R<Y�g�Vsj�B�����?����?�n�U�k��^���T;-��t��^���W�-��j��w��^\����{\J���~�ia�JJ�I��+�������UG��.�����x���J����$G``_qJ��`R��R�@@*7����bJM	�}��~�1�E�@�����s.j���@~"��o�������,
��8����}�f���h���&n�O�h����S�jm�+f�dg��>��5��T�tK��N�������t�������?�R�n��mf������o�%��c��U��������t������WZ�:Jg�[��6:��.�W�4z�r�i6�-�s����-:4d�C���j������#G�/rJ��q�*_A���X���e7�D9�T��u�����5����G����g����A��
�1�L���K`I�� mm������-��O�f�R1'I������*��WS�O���x^�S
_��9D(�U��5�"+YN3�<�DX�������2w��M�S=x.�0��h�K<��#������<���7��~�I�,��fu�_�xJl6/?���l������`�+���hz�N���&�$���I�zLB!���c��q��9	n������&��R-:dz���Yql���U�����u/�7����w��������a�l{I���U�,��������@2�!"�,���Cx5u6���}�m0����X�<[)��MDI1��4�x����L'"tL3j�
Q8��9�i�3v<�"g��1�Q��-%Y6��@8��4A_#B���)��C=w����+D��m�������}I����$���J{;�������s�{|�v��?��1�����l5	��s�'���5��8=�W��NUE�p]�	�P�
9�d��<a�IF_�LJ�0)�7�DmdlQ��������7G�����`z�7g���o-kjh���I�]t����):-��g�`����^�����x>����C�Xd�tyi;d���M�.��p� M����!]��n�+�cV)`���(��)����@|�F�&)����v.,Z���"Ig���Y�I:��5=�����~NL����|/����m����^�`����f���=3����o8�T��Apitbvc�R��,�������(��2
=(�OR�G��T%��!;��N/b��=�7�4
� mL��{�g])���������h�����G�~��
�G(�����"��)�����������_���������������9��v/~���������8^��sv�(�e�����_t�~:Dzrx�]��=<fSz�'�9U�H��+���|�����������P�O������������_R����=��SO��F~t����gC16&ly�z��Y�nL�z����aW�NE��W���X���
.�^;���e�"pM�Pc��/��QY���X�5�\����C�C�FH�	��(ZbcL����2=�4AJNp������E3&��&Oa0~�q�{����M�(`�B^�G�*$�/Y%�������22msk.4|����B$5%�KLw�%�$��G��v��/����#�j7�;P-:6KOn�F�6�5���Ib��x�iC'�$�t��(���S���B�hv���(Q��(���\����0$	$���$�b�a�2M�2���)c5I5����s��%3��������{�SWC_\����|��[��D��0^O'�D��Yz�hGN�Yx�N5&O�%/e����@�c[W	�
�VW5�����0������
�
{��t�@t�!}�"5G�

g�S[�C��LVG��H��Xyz��������wG����Q��������? H�d���-=e�/u�I5C�g2A�#*���}F��������P���I����rV{[��T�w!����><��g�S1y�����E�z�!�\��U��zD���V:�@�R����@�'N.��<J��Hu��������F��d��;��P�e'��sH��Xw���ket5��9�;�#s2F�������g���V��H4���W����n�d��.��VU9�k�QAO�,��>v���Lnd�crs��G���Z�$��g\�K����%Gi
QDl�v�D���D\`V��sT	��
�,���(:�j�L��)/��3��6���Fr������=���WXi�R��A�D*��<Q`0LI���Lcp#�4_qYw����NkE,�r����O��M^���N��������i@s:(�<��
�������j��H����	���Q�Y��t}�D�<\��������@������c@��5B���;$�{L���	��|����/@���+d�|�����,���[8�-5�B���d.��p(
��?�I�&S��,�RiYM,Qk�.����$-��|�����5��y��c����C`2=��F��R��,G/�N{}-��7�����u�����-�}��4���rW�w�pt������L�X;����F�I�/��"���t�9�"3�3���J����X@	S�`1�`��A���$4����v�Jn���2�t�}������(8�h�l1��$�QY�q��	�R*dm��B�t��y��9��6Cu�
�s��YZXKW����?�3��m���p��O�m^��c����~��~~a�)jr�~Q��K��I���O�rO���i���������tH�/v�P.q2R^���Z&�R��*�������?���A^�����2���zon�/!���H<��8D	D�&_�!BUE���	���&�����t�v�z��
�fw�'��������{��{tc��t���b����i��,��s����%��[w����G�c�����O������{�;st��\��;9�8��bNz� q>�r���@�8YC�$'��e@NT�b������O���msE�pA?���Ja*����K,f�K��7�a%�����\-Yu��B,M%4��"��~^N�����
��
������*DC���f���$���ny6��B4�r�b���U�NOI�H����i��v/O����FAUn`Q�7��r��2E+<R���7�-���N���CVIa��I����|-�bp���iHj�}��Z����� �'����g@�
r�S�H�~���!�/nNX3^���f���	��k�~�(�Ux��pB5�������#p�"�D���9Df?�U���3�"W?-V#3��B�j�����A�B�H�wi���8������P[V����d"HK$��6)1�\Yq���D��R�����]�������b�j�"?(V�`G1�$���	b���b�cV*��|��\��Av�����B-��a$}��S�����l����������k�,�������1E�MQ=gm���D�r��j�|�kA���p�>���_�,���(��
M��FBI�)����b�KiF�b��Hc��cY-�\(:M���$6^2N��H�L�Z���/K�VX�����OR���e*^���c��(��u�K����[�:�����zrt|�U
�g�Oe+��(�;�R�-R���A��9v�4Z�j�[`(�l��������!p`b~y�Q�	��pd�;V���<�>kN�Q���2����Z�$�;�td�EzD������^�?����;�XJ#�8/����}�$��3��2��]l�JO2�|�]
��jA��
��RF���� K��Rz�,��H��
�����k`�)�n
�Y�E��-q5\"'bI�!�|�����~$���oT�X_b���&.8���%h�����Up�A
��k	�8�q8N����7?�c
���"����$���K���=���UlWlrj�m������Uf0��=����d�j]�����I��Z��@`jW���]�W�.�f�)�o�l���"m���f���Y'o���	��Y+}4IBa>�r�Et-*5��P��������H�$�a�.���e���/ML0���G3�wD<w�/�[>D�AKYB[xXa%.D���mfc�X$D�p#�)4v�HOQ�������G��v(2c�)��
;IN��x�_��C�j�C�E�.%q���A@���p��>�CT&Sr�K:A����~6E�+�M>!���mH��4�D�+�)�F)��y���)��$��5��x�����1�f7R��p|c��;^��8�2\
�~��l?d��ps��(/�aJ����\l���J����R��������r�W�x�j��m.ayV�pEuw�����K�����QQm�3���gKV�Lc��>�/v�.�{g���{�>�-R
t��9v�(����������l���N����R�J������2�����d��"�{�]x��!!G���;�(6���Blw��V9�b~�_�2��Qe���������6[�����VO>��9:�����w�\�}��������x?c�s�Z�D5L����<�pJ Z]��W���v�!9+�},y
Q���P(�	)�H�Z�t�vK0���`]I����M�;������$\�(���#�a=�=�J�!:j�3�����"�4���r����k���|p;����,[�r�He{}�����S��,0P���1%�-�Ja����\�I����F�N��W&�&�'H�����|G�Bo��#F�����,�c����i?,T���9�
�*��W�X�$�-G$���?���R>
*�#�'n�*9@i���UB�
���� ��D	�6�^(�}�D�J�'�:)��(d�0�����1�'$���*[5BN�dUd^pJet��g�O������2�K�0��K.���E����'�G��T�H�����^�N���m�Du)9��"X��FW,I�H�Kj3������GuB�)��h6��?��C!P�-Z����b���S��,D27�j�%�Hy%�.�0��'�~Y��Y>��!�v+	��U�A�Ym��r���wz�0��V�l��������
[���U��������%���.�.�����B�)Enet�!���	�n�c�;��a�������:2
��9�Z���bw��^ ����Z���ix��g��G F~���1S�����!&�%�y��}<�����.|���[����D�,�GIl��w��LnW8*4&�@����Hu����&��j4;�6� ����]@y�TR��U��������N���������dv<��o	��/9�s�$5��^u��{� 1b�{�w�\?������9�.C"�F����������%/����c;�:������~0���V�A���rA :Y���9����;����0����q�j�h�U�J���S�	Wr7,%0�f�:�l,����Y��E��xa����x�����e�	�AL
%[j	E�_��cf"�k|������������������~8>����,t�/�v���~���|�G^1����B�L`�������6v9�GN��ID�&9v�6����R���[���$��|�=>9f�b����+q:}&���r�[�|�+�//l�Z��q�g����s���`r>v�����WN���?rF����S���v:w+pa����Py���G�9fm-q�{{D���i\����z��Bi��O�\�Z~u�j���S��^+r�`�$q��A%QJORE�E��_T�"G��]���j$~���;��YU\����;�q�V�N��~��P����~��1�SI
�S����&�����)���Q�u\vXzf�z]�P�Z�1VNG+��Nu��Y��{���!'�M����x����o�T��V���OZ�S�KK��*��;���F�8��4��������q����2/�(q*�JS���z������g ��Z���l���8Y�3�G&	�bvLp"���C�,��$���)%�Ua��]���ZyN�I,*���	g���	�����z�p0�`��7�:�#V9<����a3D�����#Z�����*D)�mQ�a]3�>�x���d��(;��xZ4�T)��lO��Os�w���u_e������������C��aI�_p���t
S1���)��&p8�T���b��@#�[ L�I�$57]��?�[���6����D.%s	��U���4��C��En���ho�V��b�w,��D��qG�C�J��`.����;cn�9d���|���4���-e%4g����HY1����-�H��0q�����)���A82&%;{����*�C�h@
�6m���K�V$��m7��4��R��]���F����g�!he%��'�2]{��D�wLz��:�L���8�X�'p�!,T5�����xA�]���s����4��x�/�rp|�Uz�k��,�ha&{��+8t��8qZ`�h]S6p����O�.��f'���4��o/���g�6=k?663v//��:�RR���
�I1���KUk�jV���-�DiB����G;���_o�L���%�������
;�����]u�����!���~pa����)�IY.9s��F�[�Dqb&i���t]r��e��%��k�eGz���������u#SG��]rZ�b�|x���a��I7���$\�����Di�V�K�� d��6����F�y"�IY�����;��<��V��U�����K�������r���@��[t����b����0���K�
E����m��d�6���R#�fW���i�7�=`9�&5Q��$id�kl���v����.����/@b��'6�
���0$d�B^V���!E��}��y�d�R6/���E���q�O����hl�H���1���3~�$!��$����42q�!���G�]��cc%��1)�)�A�rM�A�d}@��8�X����8+��J�2
�A����)�B��gZ?��tfM,�KDA���R}3Wqc�0 	H����@��}��$~,/�)��@�f]��=��sIxR5S��q���<����V!��c�L���SS�K#��6�#�����@b�� NX��m��0�������c�^|��bjc�O�r�����u�������J��Z�S*��]��\��X�J���!���}9R�*
i#����e�:�\�"dA�_��'��R<�tF�����&d�>)y�X�p`:bB�d�&�z��#�Z�m��%��4�I���!������D���}'����,����TnFjO��+t���o(�:g1��ct��K@I�p����Imqp|rp|�r@�v���%��?�$/�������^�xQ{�2�6A2�%���C���=��V�T.�>g ���j��
�����3�y~^SL����X������N}/�
�J���Xqr#����g4f�I����NN�uV���H���6]tx��F���eIZT'�Q����B����2���I���!��<���?�R�n��]��4������j_�*^r�[�0a|1h[��.����sT�93a�5��%wc`��@�u1�:��"o�-s�/ZT-��t��iu;���K}��d3B}��(�,Z�\�X�i#�S��"m��5#)8�% �Ql,���<m�R���p�1*m��#y$���b�?n��M��)v]��D���3z	QB�kZa
�<l�2�0�9��3�M|L,���0�O����X����&"(��?�O��BJ	
fc���<����,U��g[(����vX�5r��r���������toB���i%x���v|E��74�,��=m�H	�fz�lG!m��p�w��@������)�)����Vl�pl�����B�L���:d���X��bM�#����K:�u#�^_�{�����
 ���F���L�7$SH%���'����A�DQ�����XF�oV�h����Be��6\����y������
�,����Y<����M��}M�E��+��E����W������M@�g���l�;����|���@Hb��������6�j^\�����7�xa�n�.�(�h�@�R���|����{����k?_�c�����g�]���M~�����\q�H����g��_�{�����%V�D`�5u��T�R�����N��^���p#?�}{xt���{Z��5�!{��5�T�l1f�cm���zCV����"h���
!F��d��������t�UjU��D=n=��b������%$
�P����K���!�>i�|a����%!�~�\y��]$H�;0�	1�a������s��;V:���\�Q���,�L�x�"��)��cw���#o7��y=#�MY���*n�$�f�8nfw$|���NP����Z�[��1�����)�%Vc,�=�J��wMb�[FwT"�-X�H`��0W>6y@����V������Xt�<b[=��=4�c<����H�
����n#���+���R���Q9��h�-�;���#�,��E��^��!����] w�H�=�E4d�G��f^@�Yh�W��$ +���\�p2"�N��PS� ��sG	�et�Kj�Y:h�X w�<�*2���G`Y�@[/�bFn����YVF�����It�u�����(Dr�K����U�-�OR*�M6�9��|��3tzr~~����������}��K�d�����H�-���8�r�M��3������������E.�6q�3)Y7?����O
��F'y��h�({�M����I���p�}�E����>�-A>�MV1���������h[���6;v*�;�'!Hk:Q�QwB��������^.-��@hM=����o0��]dGS���)��|�s����J�o���������D�#����GBrF����P�����O���F�$L,w�����y�����%L#=�=J+T��,��F���7�nM���OK%
0����]�i��Jmc����&�A��s?�D� a:��a"m�0]��0�m��!-1L��qRZ�_*����2ta�$�G�eN�Ii��������:
������� �>�k��9�J����(�yE��B)��
���^�n��.���5����X5����b4���i��}��l�i1"!,K���W�f�G�F���������-u�Zs��H�
�)H��|$���J��KZ�3�gO�8��y�5�)	i�o|]	Ql-��'�=�d���pl����%�q�d"/J?IZ9����)j�5�)�a����hJ,�W_06��X_%�o�����X��Hd}�IO�J=p���d��F������H����X/��Wx�Q��'nD����gTB�ZpHv���v1���i��s��Ky�Ya�uS������q,Y�C�1	�@���]6�m�9�j���O\���n\c4M��q���J�;n
��D.��Ik!����fmJ^���:`�}�Ol����=�}�X��Qu��h
ru���~ad��Q��-�p�THVJ���(�*&1�L>���$\=����Km����/8�01	�������6F�a��cZV�\;S~I�/�V.�������y�q����vtV�{�Ui�Zq"���l��(����6�8���U(<&V��]�U(<4V��5}�X����*�.V����
���U(<Y�B��b4�-I* cH�4�$#ek[&�b��?��[��p�~��&��M��J������W>�����p�y����I���6���5�Bv���������2"]n
h������K����X+��D�
����f���������h�.����|8	{�ER.��Uw�Ta��7j���Vb��)n�+|
�BT���Dy�l��5�k����%PY���������EE���J%L-�3�U>�7��Iqa��A<K��f`�)��`���\�A�E���[>�)�")�����!Eh0j���79y��M�?��]�r�9�}!{O`v�An�����l�H�#�a�����r��:2c��"�+�L�_����tW�������Iz�\��������m��4$������NWpM.�����C9��BY ]��[�����\��xq�p�3�|��	�� �zp���.�e� �x_��]���^�s�LV���\Lz�)qL0k���&��w��y��em-z���
jT0j
v�Z/V�~�$+���t1P5A�L95�������~�$=�p���2��+���1�[6��"N�MW���*�|��Yu���~���09�E�x'rXCw�D�J�-).�md�H�6�-�o�����Rw�$y��Y�Y����|��m�,����"�[�
�=�m�u:]8�z�w�z���&2�ar@J���������`jY���#&14���`�0d�B6!������c��
���e�yh����4�J1�a0�9�@�(�2r�"QA�+L07W�sn)M�!�7niIO[��3�b��4�\R��5|��k_��W���N�Xu�6�������[���u�BLp�=p�Z�Z�T��k�Vr�cKJ�IR[���IbM�������+]��8����\�	��K�:=#�t�}�J��R*~<��a�-�y�bs�b�@z+�H������C�pI��3
������?%6%3�������E$�*����rm�F�+wLu�)s����6��5���W������:���,�����eP`�d�h�shc�������������0C<�{�f)�Y��gL22U���t������X���'�]����|�h#^Z�|�\�*@�i�2I�q{�D���%	�)����<��cx?�n�NM�=�����fc��I:.�Y���X���|[-w�;���V��t!Ij#X��t�o���R���,bV4��0��{�_N���1�8f*=�A����b�����C�2�Q;f�h��u�
m��.��������K��ei�9T��i�~�=������?�
|)3�C)�[:}�1+����$B�:�e�V[���J�2��������U�p����68�9W��3�����hL=y�G�|j��U�8$�{���d�Bj�H��!���C��J8b0"���V�8�
9���p�x�1���@c`�N��?��`��}?�v���Sc(i5�_
H~��p����Y7�-�u��.~%������������R�s��;h�A�"�[�w2��T�srg�cx��>�7��>F���[u��h}��XP��$$~��Q�sj���y�"���&�d�e�i1�����v�?�����nRAI��.�Ab��0c���FVR���+=Y5`d���=����S`c���C�5s+<h"_VbI���2�V/��8��n2����1�[>�[Es�������3w
��d����F1N�e& i�`��c4����x8����L*�z��r'��L\�O���$�z9Tg�����s^H��amj]���H,����2hg f�<:����/��ewSr� �q):�����0�E�H��}��0\�S�l.�389�f����.��qn����F�3m\E���|rv��"���������8��*�U��6�W��[J��-i��Z�=��&��7�g�)�\�a�.���vy���L�2E�*S�`���+F��V"��*A�y�8��#���?zJe]�����N���lQ�&��O�1���C��~�����]�h_���J�}����f��e�Y�~C=%*�qX"G����(T��H.3������v(���]���z�P:|1"O�`�g�����
6�����(y���X������n���S�w'�=N���"s�Rp]�N%����v�� I2�.��^�Ge��r����Rk����;u{n����I�)���(�X������q��$��t/�"K�S�&��o������`-Ev�|%5��������h���]��@T���+�H�v�m��Ww�r��s�-��TZ�e,V����X���3�[M]����T��;�Cv��	��K���bPV����U���xx#�����Z��(�kN��
�j�~C�GcN�o����G��������w1���In���N���I���o��!�}����������������g�I�r2��C|��P�am Vwe~I�����b�����9
G���i�l��%�Y�-�(Z�~]tU����XU��o&���B�����1��������BWs�����8i�a�[�?�V�A�v����j����~0H�ow���0H���8$��������\�L�
Su-)�������������@o��'�76\�O���W�U���&T$'��tWs%�`�+7���Ek�v�6�J�w�n���������gbVq|���
Q�k[�����]�h4�7"��z���w��j����^�V�uj��1��" �5`��
)Q�O{����.���?��mq&Vx�6P;e"�%��16�y�2�����������t������y��~��C�����/.C��9���f�������N0
��`Zr�j~s�����'|��n�����������aG@�K����iXA�N�,���;�(zW�A+U���[UN������V��*�������|���g��e4G��v�l0Xq��������[�m8�Q�w*�F��kz�V���n�Yu���iTj���^����7����T^��
;����R�c�����H����������������	H��.f���U�+�;mU����_5��TiU*���������������O���v��(��g
���������N7mjV��s��77���?u��2��M������
�8�M��E�IB�1�l�h��kZe
��A���������J��F�a���~z�����8t�J�N�����Br��]��%�#�m���pb��/��1�`�{���w���������N�����������W�^��:.`}�?p�f����^���<��W\��;���U��z�K�����2-�_)y�$
b������P��YQu����	��@B�E�*�Z!h!�U`U
`���E<��N���@,�Oi�w�h�
t��\i�_���������wz�����r���[�N���u�c�e�-��e-�"w�\b�oR���gW�mO���,w]z�r�kt���Z��LNwf�{x�l^�V�$���������\�}�8X��~��������?�twO���c�W��ko�}�w�k:�-L���q���F����[�^�Sz|.y��!(��q)[��`p5��TZ���W�9�?��o�!��^�x��Qi6;�N���~������j��4�_s;-���_�U��-��:���*
�h�^�%�&��. 6�QO�@bS�q�1�V;I���Y��q,e����=���F��uZ�~���6*�^����k��s=�d��r��*��*3������L�0�OY'Z����Z4?�l?R�P���p�X?{&���(*Gsa��/�Z���8q��
����(�e �(�0a'}��\�;D�0����[��0�/)������jv�V��*lv�X��4�>e�M�M�6�K�F�������UXW��n�S��`g���2J�S�@8��WnV�����d7C�F}��p��kbcg�>4g(��^���^���Y�8��8���� ��p'|2>���
�\��=���`X��g:�����#�@�
�������� ��l|0�RAe���7iY�J�-��,|G3�O�I���"���b�z��PI��H���b����R��*������hq�:9&!r�k�y5>K�5�C�f1pa�8�*�k����L�*��L������$��g�����?M����U�[�C�I09e���1��qR�T�J]VF�u�t���G8��(��1A��q�'��	`g$Ir���:�����*�a�Q/�r�I�l�k
�}2Z�R9�`D��b��.ru��H�c�j9��bZ��a��8�.H�Ut�Y��/Qr��:$Kh57,�����?cWw;��G���)���s�f�I.,E����5�{Z����_��iG�:�uNi�rVF��I���*�1����i)�(�*u�%�]Tr��b�CL�R�3�)����D�u0qR{�N&�_v�/b���oI����Z�^��)��&�����$����a?����e�^�;��w�4�L,�z��[��pV�x3I�e��<�@�y�,4�.��K�f9�#'@e��U�V�O�*)T��6�h����Z�V��5�$�����;�L���m�����#���������qf�Q�7H��F����"rd	GA��f�;�Us�3�
��~��*���v!'vI��3i����(p�����?WR��sW@�P�0�b�;�\��$cd ��5M�����$s���)Fr����|_�@g�sE���1uC}���>&��������N���tv�&'
����sHG�D���z�U�#*i�����^��X��P�������aL������}st�V�Q��+@,;��J�����Kt�FN���o$Af:�r9������/�%��b�6K�������A��/A8o����^��;~�e#(K���!�5%����>2�Sa�������a4��p������F�dR�<QNfe�Z�l����#��0"�{I/T�h�Gu�6���{g�0��=z����6���X��r~�:G��$���BUFZ��
O���XyZJd}�N�J{oK�$�k�D�5%����>��O�V�����*��|����u����
�����Z���n��W�N����������"Z_`�U��\�TH(�~�� ��d�J`V��4�n�D:����U
:�jp�6p�J����@K�W.;M���t*?#r�+�k�*���/��){��T+L�b��R,����lJeGV��B��m"��P������12�(#����Pf&��z��bpF��CZ���#k�48]S�i������1���G�&�6f�,�����(
u(��
���7?C�O�b
�p/�����i������������WX��Lab�I�pK�^���t�@5)G�Z|<�p�OX���*���n���J��
�?j���X^^J��c���:+Q)�`��<���k�pq�h6��?��>B�GH!O"���0iCQ}<��C��E��Av�X.	pWrX2�u�����������(�ZA��?�[w+}�<�GIR���sN>���-�D��]Ye$3�0
#�9�d%������oG����a�f������y�,�� ���f���f�>��U+�?��o�!�����^��	V���y�f���6[���^������h4��7�����W��2����?
���b���ep��67w1���(,Y�Z\T� ���2������������b�B����d����#	�o�H���
���E=|��@aF*�;H�6]�F��7��!��@�����`hW�p��pY_��6��������nSrc��]��"��T�3���`
�}k��bx���S8�ju�������:���P�e����*=��*��!��@����M0�v�!,�Vw�����o�������+$Yw��U�y���S��������C;D�Yy����+s0������J�bN��P��C�>��N�
��fPH� ���E,����$�J2X�.�:A�����dp|���b��XE������b�x�/�� ��$N,�=wygO�&�cj�}AS-��D�
���z�wn��;wG�$=-��I�PJL��R����^��b�zr�x��a�n���������F��b��`�����c��7���'r_p�2.�g�����\l��}6�j�p`#�~��ap9�z���t<��*�}��o|�!������_�7����B�4������L��w������}^�����Z���y���ZB����]����	�Yx�j��^��������������W������-��7��^�m{�~��4��A��;�����?E�:u��o�r���e��U��� ����iixQ��Z;��)�{���ry���Z���j������;��L���Hd&�W.��!
����4���5C.�si�\\ykct�E�|s.�������`��^�����0�-���J���_�]����W����9I�r�F�������)r�/�G(&������S��Y3/����)��I���Y��W�E�~���M���Vpq��@7��'�d���mr��-]����.b>c�������"������.bj�2Vt�"JW�YoW���L$�kc<�6�+j���o?�U�Ck�����z�����'������`�j�]�+���KI�0'��H*��k�p�`5�_w����G�Rk������=g��{�N���z��S�7����k�Z���{��?��p�����T�[�,��k���z��b��k���<�E��c�y�O���,�����������c1G�x�1[�gc�=��s>FK*����l��R�f�i�3������x�{�����&H�G����~�i��������n������k?����O���xp{�0���a�<<��O�M����������)$u~�k��O���Gd����������#i.y�u����N�X��\��)/7M�.��������	;�����\`�$���:�!��=s��P��;�`���;����xO�H�����0�,��P��RJ��������<����������TD�����~>�&lU�6�.k�#��k�`���y�
�|�d�[�,�P���7�:����71�I��?�e�����&Zsfb��������������0��y���8r�Ww?/8��M����toL7����:�d���s��`\��y��.\9bX<�q���}���������}q��&(-����R�`7�Cv��R\�T����R
�cmH;�B+,'�t��L�"&�Y�E�\�k4�P�
���&P=��{�zc��QDM���7~�.kY��vL��L��5V���k��%]1~q����
��;L�O&!x�s~���/]��Tj��D�,&�R��P�C��};wNR�ST5���f�v!��.���&u������G��b�**��Iq���!���MVo�oi�E����W�+��������[FDN
�z��Y8�+z�YX�%�cg�_�<
�����gQ��;V���w/��e�D4���QN�H�	X 7*�-g�9�%��Pwc��$����B������~�����m�cj����FX'TWm)����?��b�g>��%9�Q�P�M`-]x���1���{�����(�LGO��jO7��UZj����V�������'.���|u��WB����0&��T�0l�"���#\a�mRT�\wOO�o���e��k���~g(^|cQ=�M��8-J	���D��%(F�r��U=7R�����y��P�,CP�K�kPp/{��8���;���x�w�����Na��^��P�)r�k%yc����o�����*m_�B�
���<����K�a(���y���bI�j^'(��W���Dn���a�R6h�O�������W���R�d,�����"1:r^����9a��+�����3��V�����Hf������J��t,3L�2�����wI�kW0GFV-'�����b����0�:I��f1���I�"R��"#�8�����{�m	�kc [���������(�{�^X�KI�$��-K�����6�x��4�<9��m��1�����c����8��`�������P���d�BD~�zU(D���8Sc��y�[q�#�^U����{�����S��HR�6����$Tq�n�K��p�����������	�	&��=V6[OA�KdD���H�3�����F6�5W-��ULV���/|��I<=����#��)uO�������R���^��	,����p
�*\���_c>��tI8L�NI���9ow��B�	��T���S=m)�����3�b�5�!�(N1�]B����ZNs�w�V����m�YdEmR�E:�R�3������13��Uw/`�_ �B�3	'3L4����E��.�Bn������(S%���<1�fu6����	���:��*�{mA������������QE�	��E#Em��@��@*Dr��������Y�)�x�g�����no�"x(�k�W�(�0.�������&���t�[�����(E�v�9.8�S����V���"�-�������B��6�/8�V�X��0K���3I~���y����:	"H����R���?�{�\�J|��8��U�P���
n$-����S2!��C�W�S��Jh*g����e�r��C`G�O��c��.������:?��X��G�$���@����dJ���R3��%�Y�H�2�7>'�p��:�fb�,	a
�8��%��~$�����S3zS��e���1�jA"�6��Y�F��b`�dW���'^�!�G�'�
�;�|{���Ix�Tl��#���<���?&��V�"<��WY���n�Fux��=�����i���c�E���<�(�5�_+�[x]�	\�����`��H���O���
�Y���L���%�I��?���4P���X�<�qB�s=���V��y�u|��U�fju��EC�Lmjjn|�`���LE����"����Y4���8�L��z��'���t��:��^���e+mRY[�B�aQ���~��k��������z���<(����`N��0)���u���������~����&����>�p��pf%K���x]�\h{�[%���4��F~�a������f�zif�#��7���8���dz
/v�q�\��[n�Z��+'�����^��%��$ll$�'x_v�{�&�1����o/��������x��lZT?��r�I`eWH����Z1Q�Bf��]��O��UaY�%.��,�|25����<;�6M����*E&I�X���kt�K	I|�Ky�� 
J@�`�!5�1�Q��`�M�Xp.��n�a�9���A8{������^�N^r/z�S9�yh�����FN�8p���*�i���������;�������?�_y��&�����e�J���3� �;���.fY^�������B\o�A����-Yi��!>���G�Ek�A�t������b����`�����VC,�6���Y=�6�,)���;C�:+E�zOz���^p����-��#��,y��������$��a��f��T{=�a�h�J����Z���>d�;"k^fz��a

X��'��l�"�a'+MX���)8j�M�]���@+6�&MLpN��D��H�9#�b��+���9T�N��v/0j����g{�MHn���?�����=9�*g/��qYG��	U�����*�W5s|t��
�X�1o�f����(����ip{%��Hw��p�U�1d��5k�����2�Q-04�`���;�cf���{�3���������t+'���K��Zqp�V�bbpD2r�=I��B���?��Z�rB@�O��Z�^8���������-���+����!���c���]�s�i2I6^E���6��=��VB<;��`f�$�j%�Y���0a����V�����^3��Ze����8J��Q���������v�-�iX>�H�,������u�8�~��������Kp?����R��
��[����
!tu\���+� ��)W#�BV�rS<�m���;�N�����K1���[�Dsk�^�+����8�������|K�=���Y`
y��J�(��8��c1_%$h����H��<"�}����<aX�Qr�m�&���%���2���p�
B�h9��p��_��dM>��H��*����~m���-'Q�T�x�K�&���H���%�,"Nv�9U���t���w�Adt@����������c��u����a�`�x3��Un��?\���m���a��hn��1��&{ky���73�x���M���	��o�=7�g�{N��j7a����r��n�����/����d�C'����~���;���I�H;�~v�:�n�i>�Ut�7�� kL��d�����@FvQ�T	�5�����72�V��*�����?=�"�t~�L�v���9(��(2���}�	��gw��G���5J3HUD�-�;�����l.BQ� :�T��t� �����d�(��J*tP�4%g��2�g�`%~�F��[�4�jc���%�6���8���b��bn�Di�l�"zZ6���5\�<�94�A�T�M��l�~�E�z}��E��f�u;�`U8r��<�o���d,0D��q�R��W�\A_��SlNS|�6@C�$����z�������l)��
�7�}��^0	����{�����{rV�l�|�>9P��i�8���5��&�-��g�n-����������z�A�}���K���+�h:�4O5et��K���1����\��������L�����b�Q�y���n�v��J%�b^�3]���yl�/�D6�D�[���R�����}�:r���)�/g�Uhg��=h�Kb)+F����h����s|zr�7tI�)v���e�����������u��0�M�R*�q����i7��KL�L��i�	K�0�q�E!,�>p�X��F����N�}|����s�;g���e���Y�g�����`�%.�������8MN�� &.w���@�1�-�%E=uooi��������0������7��|�����"�6��FPJ%�a�������]�<�w����6���Y�P���b�g��9����Zr�iI���n�����W��Ok�tv�nOS���P,���T�0��������^����q(x�X�B
�q�Se${=��d��e�*��~L��t��"pb��(�y�z��J�����_L�	������R���A�Z��j�r��*��y4��_�M�kV��K��Y	��R�����?Z��r�y��Xt�({�H�Y+D�>Xg!&���r�����c�
����.�-.�#��h�L����n
'������g������������q�����$��2��,����3����&6�,��=��3
|Q���#���h24�=@I,s%*��y�0����D<H�h���r��k����j��5z	A����z�]�.�8���/i���%`&}g�
��T�\�������|���Y]�g��=�������k8\���������C�w���-fMy�U�tf
���3S��R}2�	�|N��
����c��u8������.T��)��(�4_�N��=��Uhf��������y.�}�2a�����������U}�7��>V?�u�|HsK��"��V���V��������R����k�f��7pA3V�]�M�&�,#\�@����H1��;�KW�c���tB3:3>N��{�.�`A;8�/^r�\����%5uV���'@*��D"e��	m�3���������1FY?yc[�,����<����5�v�H��3�?%Lt���=���r����K����y<��'D��Y��+A3�Q����4���������,
!����#��3!�5i��g5^OF��'g8��!0'@�S".p����0��svL�q���,I�+�W���M����*�6���g.}���5w�j����<c*��	�*$��_�z����.|1<�l���Q��~�5NH���r�ab�x�����wrP>	�� &�/�=��	<���<��90\��7��1��\�`>\�Zo�_�W�&�z	j'n"f�OTSJ������aHS��;(q��w��X��j��KZSF��iQdY16DAM����#b��G'���U�1Bv�����M�':�cc>�s]�r?FW_+�K�-~F����b��%Z�A�0#��>uJ/-�.u�<+1����SB��m9Aw��%Q�"%��1���u���.4}�Z��Ba�G$t/��4~/�El3��W.��=���O*�sN�,h��(	�ky'�Hw&�)�p��T��,������N��F^�:��)���w.�\���#��a
�`qA���H7�+���h�������No'�������_�t*��I#�F��Y�����TMKC8��'"g�0Ku��a�.�|�O�rt�	����<��;�Q��z�<��
�5R,V����X�������e�)u�V]�������cn�7���I��t��������[A�
��E`��UEh6y�8��[�a���)@��u����,E
E��b����P���`+PF�p{9EL���Y������8�Y/y��}�|�F�I���]�Lr�&�i��(�	�x��9���E�<{!tA��l��0�S���J�	�:#-��}��??���wR��B�md�
!��	������q X�r)�1��so����K��Lg��4 �Cy`���m�U������ry���j}Z��=� ��
Ux!J�����e��������
�A�%vn>�o���)���� `��X7 ����P��$]�	�����NR�`�*��d?�������#:��b
����lT�����5�`�-�i���c������?%G�rHn��w��/q��z���3D\��&�~�c/���L��$}�\6��,��rp��z��j\����(:�H�.t��d
����rM���M2�Kl�GO�<�x������xo�����B�����YV��P��UC!4[g��N>v��tv^�G��w	��x��t�JC��^F����B~���u_�y�����9k��q�����k�������%f����Sk������������[����Eh���%�9����pQ��JCJ1s��A`w^�Z�t�l}Id���d���1�J�"-�l�R���~y��A�~%��MI��c��a�':>�o��g�=E�A�(��"�,�)������d��7�\#���.k`y	B�45�p����J<�@������������e��K>��6�vA��PP�[*��e�L�]r'���i��T�hJ���>a�p:A�#��(��
!�� �}O��Hv	2���=������k1�����k�3�Q����O!��+c���H�c���'���{�fcl�k����vv�^E� ah�������)M�@*���D
��0���wvqie�a��w��U6L�}o�|3t{�����C�����"F��w4 �m��L[A��_��!�J2�����w}����Rt�.@mtG�C��@�<`D�z��Y~VEKd���g��$Y-%�B����E$R
�n$6GZzR�i'<5t��j�vv�>v���4�$�Ykr�O�=x�;�G�3�A�.&s�$��~�M������g��,�#�`ji�����U�A�0�G��dE���=[u�~vB7 ��e��h�m&�
X��Mcrlb��&��T�:g�
����z�{���
i�n�����
�O|��^��{�+�u����4}�;��%������6��31��]${�4dZ��O��xH� pT(����1�������tq��b�4$���7���;	u�>�fJnF��'�\���$��������1.��.��P��|���������d���*�w��������G9SPH���z�+C��������}]�z��a8�G#w�c%�j6�^�`O�-ZV�)k]�a��v�����(s
�F�w��p�Q����]���A��?a��,\���,�6^���z��hV��R9��f�Y�.���f�2�=��9Y��i@�	�y&�--Q��"{9`�^��&�=kF�c��=���+��~G�t�u��U$���U��	7hT{O�O��������{�=��~���qtx��
#62Thomas Munro
thomas.munro@enterprisedb.com
In reply to: Thomas Munro (#61)
Re: Postgres, fsync, and OSs (specifically linux)

On Fri, Sep 28, 2018 at 9:37 PM Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

The 0013 patch also fixes a mistake in the 0010 patch: it is not
appropriate to call CFI() while waiting to notify the checkpointer of
a dirty segment, because then ^C could cause the following checkpoint
not to flush dirty data.

(Though of course it wouldn't actually do that due to an LWLock being
held, but still, I removed the CFI because it was at best misleading).

--
Thomas Munro
http://www.enterprisedb.com

#63Thomas Munro
thomas.munro@enterprisedb.com
In reply to: Thomas Munro (#61)
2 attachment(s)
Re: Postgres, fsync, and OSs (specifically linux)

On Fri, Sep 28, 2018 at 9:37 PM Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

The other patches in this tarball are all as posted already, but are
now rebased and assembled in one place. Also pushed to
https://github.com/macdice/postgres/tree/fsyncgate .

Here is a new version that fixes an assertion failure during crash
recovery, revealed by cfbot. I also took the liberty of squashing the
patch stack into one and writing a new commit message, except the
Windows part which seems worth keeping separate until we agree it's
the right way forward.

--
Thomas Munro
http://www.enterprisedb.com

Attachments:

0001-Keep-file-descriptors-open-to-avoid-losing-errors-v4.patchapplication/octet-stream; name=0001-Keep-file-descriptors-open-to-avoid-losing-errors-v4.patchDownload
From 4bbda6201d63559794794dd3d851a4da34bebe48 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 21 May 2018 15:43:30 -0700
Subject: [PATCH 1/3] Keep file descriptors open to avoid losing errors.

Craig Ringer reported that our practice of closing files and then
reopening them in the checkpointer and calling fsync() could lose
track of write-back errors, on Linux.

Change to a model where file descriptors are sent to the checkpointer
via the ancillary data mechanism of Unix domain sockets, and the
oldest file descriptor is kept open so that the write-back errors
cannot be lost.

Not yet done in this commit:  Craig also reported that after fsync()
returns an error, retrying does not report success reliably.  A
follow-up patch will handle that.

Not yet done in this commit:  Windows support.

Author: Andres Freund, with contributions by Thomas Munro
Reviewed-by: Thomas Munro, Dmitry Dolgov
Discussion: https://postgr.es/m/20180427222842.in2e4mibx45zdth5%40alap3.anarazel.de
Discussion: https://postgr.es/m/CAMsr+YHh+5Oq4xziwwoEfhoTZgr07vdGG+hu=1adXx59aTeaoQ@mail.gmail.com
---
 src/backend/access/transam/xlog.c         |   7 +-
 src/backend/postmaster/checkpointer.c     | 382 +++++++-------
 src/backend/postmaster/postmaster.c       |  53 ++
 src/backend/storage/file/fd.c             | 213 +++++++-
 src/backend/storage/freespace/freespace.c |   5 +-
 src/backend/storage/ipc/ipci.c            |   1 +
 src/backend/storage/smgr/md.c             | 595 ++++++++++++++--------
 src/include/postmaster/bgwriter.h         |  11 +-
 src/include/postmaster/postmaster.h       |   5 +
 src/include/storage/fd.h                  |   9 +
 src/include/storage/smgr.h                |   3 +-
 11 files changed, 853 insertions(+), 431 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 7375a78ffcf..775da6b2afd 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -8777,8 +8777,10 @@ CreateCheckPoint(int flags)
 	 * Note: because it is possible for log_checkpoints to change while a
 	 * checkpoint proceeds, we always accumulate stats, even if
 	 * log_checkpoints is currently off.
+	 *
+	 * Note #2: this is reset at the end of the checkpoint, not here, because
+	 * we might have to fsync before getting here (see mdsync()).
 	 */
-	MemSet(&CheckpointStats, 0, sizeof(CheckpointStats));
 	CheckpointStats.ckpt_start_t = GetCurrentTimestamp();
 
 	/*
@@ -9141,6 +9143,9 @@ CreateCheckPoint(int flags)
 									 CheckpointStats.ckpt_segs_recycled);
 
 	LWLockRelease(CheckpointLock);
+
+	/* reset stats */
+	MemSet(&CheckpointStats, 0, sizeof(CheckpointStats));
 }
 
 /*
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 1a033093c53..645a5a59e0c 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -46,7 +46,9 @@
 #include "libpq/pqsignal.h"
 #include "miscadmin.h"
 #include "pgstat.h"
+#include "port/atomics.h"
 #include "postmaster/bgwriter.h"
+#include "postmaster/postmaster.h"
 #include "replication/syncrep.h"
 #include "storage/bufmgr.h"
 #include "storage/condition_variable.h"
@@ -101,19 +103,23 @@
  *
  * The requests array holds fsync requests sent by backends and not yet
  * absorbed by the checkpointer.
- *
- * Unlike the checkpoint fields, num_backend_writes, num_backend_fsync, and
- * the requests fields are protected by CheckpointerCommLock.
  *----------
  */
 typedef struct
 {
+	uint32		type;
 	RelFileNode rnode;
 	ForkNumber	forknum;
 	BlockNumber segno;			/* see md.c for special values */
+	bool		contains_fd;
+	int			ckpt_started;
+	uint64		open_seq;
 	/* might add a real request-type field later; not needed yet */
 } CheckpointerRequest;
 
+#define CKPT_REQUEST_RNODE			1
+#define CKPT_REQUEST_SYN			2
+
 typedef struct
 {
 	pid_t		checkpointer_pid;	/* PID (0 if not started) */
@@ -126,11 +132,10 @@ typedef struct
 
 	int			ckpt_flags;		/* checkpoint flags, as defined in xlog.h */
 
-	uint32		num_backend_writes; /* counts user backend buffer writes */
-	uint32		num_backend_fsync;	/* counts user backend fsync calls */
+	pg_atomic_uint32 num_backend_writes; /* counts user backend buffer writes */
+	pg_atomic_uint32 num_backend_fsync;	/* counts user backend fsync calls */
+	pg_atomic_uint32 ckpt_cycle; /* cycle */
 
-	int			num_requests;	/* current # of requests */
-	int			max_requests;	/* allocated array size */
 	CheckpointerRequest requests[FLEXIBLE_ARRAY_MEMBER];
 } CheckpointerShmemStruct;
 
@@ -171,8 +176,9 @@ static pg_time_t last_xlog_switch_time;
 static void CheckArchiveTimeout(void);
 static bool IsCheckpointOnSchedule(double progress);
 static bool ImmediateCheckpointRequested(void);
-static bool CompactCheckpointerRequestQueue(void);
 static void UpdateSharedMemoryConfig(void);
+static void SendFsyncRequest(CheckpointerRequest *request, int fd);
+static bool AbsorbFsyncRequest(bool stop_at_current_cycle);
 
 /* Signal handlers */
 
@@ -545,10 +551,11 @@ CheckpointerMain(void)
 			cur_timeout = Min(cur_timeout, XLogArchiveTimeout - elapsed_secs);
 		}
 
-		rc = WaitLatch(MyLatch,
-					   WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH,
-					   cur_timeout * 1000L /* convert to ms */ ,
-					   WAIT_EVENT_CHECKPOINTER_MAIN);
+		rc = WaitLatchOrSocket(MyLatch,
+							   WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH | WL_SOCKET_READABLE,
+							   fsync_fds[FSYNC_FD_PROCESS],
+							   cur_timeout * 1000L /* convert to ms */ ,
+							   WAIT_EVENT_CHECKPOINTER_MAIN);
 
 		/*
 		 * Emergency bailout if postmaster has died.  This is to avoid the
@@ -892,12 +899,7 @@ CheckpointerShmemSize(void)
 {
 	Size		size;
 
-	/*
-	 * Currently, the size of the requests[] array is arbitrarily set equal to
-	 * NBuffers.  This may prove too large or small ...
-	 */
 	size = offsetof(CheckpointerShmemStruct, requests);
-	size = add_size(size, mul_size(NBuffers, sizeof(CheckpointerRequest)));
 
 	return size;
 }
@@ -920,13 +922,13 @@ CheckpointerShmemInit(void)
 	if (!found)
 	{
 		/*
-		 * First time through, so initialize.  Note that we zero the whole
-		 * requests array; this is so that CompactCheckpointerRequestQueue can
-		 * assume that any pad bytes in the request structs are zeroes.
+		 * First time through, so initialize.
 		 */
 		MemSet(CheckpointerShmem, 0, size);
 		SpinLockInit(&CheckpointerShmem->ckpt_lck);
-		CheckpointerShmem->max_requests = NBuffers;
+		pg_atomic_init_u32(&CheckpointerShmem->ckpt_cycle, 0);
+		pg_atomic_init_u32(&CheckpointerShmem->num_backend_writes, 0);
+		pg_atomic_init_u32(&CheckpointerShmem->num_backend_fsync, 0);
 	}
 }
 
@@ -1102,181 +1104,82 @@ RequestCheckpoint(int flags)
  * is theoretically possible a backend fsync might still be necessary, if
  * the queue is full and contains no duplicate entries.  In that case, we
  * let the backend know by returning false.
+ *
+ * We add the cycle counter to the message.  That is an unsynchronized read
+ * of the shared memory counter, but it doesn't matter if it is arbitrarily
+ * old since it is only used to limit unnecessary extra queue draining in
+ * AbsorbAllFsyncRequests().
  */
-bool
-ForwardFsyncRequest(RelFileNode rnode, ForkNumber forknum, BlockNumber segno)
+void
+ForwardFsyncRequest(RelFileNode rnode, ForkNumber forknum, BlockNumber segno,
+					File file)
 {
-	CheckpointerRequest *request;
-	bool		too_full;
+	CheckpointerRequest request = {0};
 
 	if (!IsUnderPostmaster)
-		return false;			/* probably shouldn't even get here */
+		elog(ERROR, "ForwardFsyncRequest must not be called in single user mode");
 
 	if (AmCheckpointerProcess())
 		elog(ERROR, "ForwardFsyncRequest must not be called in checkpointer");
 
-	LWLockAcquire(CheckpointerCommLock, LW_EXCLUSIVE);
-
-	/* Count all backend writes regardless of if they fit in the queue */
-	if (!AmBackgroundWriterProcess())
-		CheckpointerShmem->num_backend_writes++;
+	request.type = CKPT_REQUEST_RNODE;
+	request.rnode = rnode;
+	request.forknum = forknum;
+	request.segno = segno;
+	request.contains_fd = file != -1;
 
 	/*
-	 * If the checkpointer isn't running or the request queue is full, the
-	 * backend will have to perform its own fsync request.  But before forcing
-	 * that to happen, we can try to compact the request queue.
+	 * Tell the checkpointer the sequence number of the most recent open, so
+	 * that it can be sure to hold the older file descriptor.
 	 */
-	if (CheckpointerShmem->checkpointer_pid == 0 ||
-		(CheckpointerShmem->num_requests >= CheckpointerShmem->max_requests &&
-		 !CompactCheckpointerRequestQueue()))
-	{
-		/*
-		 * Count the subset of writes where backends have to do their own
-		 * fsync
-		 */
-		if (!AmBackgroundWriterProcess())
-			CheckpointerShmem->num_backend_fsync++;
-		LWLockRelease(CheckpointerCommLock);
-		return false;
-	}
+	request.open_seq = request.contains_fd ? FileGetOpenSeq(file) : (uint64) -1;
 
-	/* OK, insert request */
-	request = &CheckpointerShmem->requests[CheckpointerShmem->num_requests++];
-	request->rnode = rnode;
-	request->forknum = forknum;
-	request->segno = segno;
-
-	/* If queue is more than half full, nudge the checkpointer to empty it */
-	too_full = (CheckpointerShmem->num_requests >=
-				CheckpointerShmem->max_requests / 2);
-
-	LWLockRelease(CheckpointerCommLock);
-
-	/* ... but not till after we release the lock */
-	if (too_full && ProcGlobal->checkpointerLatch)
-		SetLatch(ProcGlobal->checkpointerLatch);
+	/*
+	 * We read ckpt_started without synchronization.  It is used to prevent
+	 * AbsorbAllFsyncRequests() from reading new values from after a
+	 * checkpoint began.  A slightly out-of-date value here will only cause
+	 * it to do a little bit more work than strictly necessary, but that's
+	 * OK.
+	 */
+	request.ckpt_started = CheckpointerShmem->ckpt_started;
 
-	return true;
+	SendFsyncRequest(&request, request.contains_fd ? FileGetRawDesc(file) : -1);
 }
 
 /*
- * CompactCheckpointerRequestQueue
- *		Remove duplicates from the request queue to avoid backend fsyncs.
- *		Returns "true" if any entries were removed.
- *
- * Although a full fsync request queue is not common, it can lead to severe
- * performance problems when it does happen.  So far, this situation has
- * only been observed to occur when the system is under heavy write load,
- * and especially during the "sync" phase of a checkpoint.  Without this
- * logic, each backend begins doing an fsync for every block written, which
- * gets very expensive and can slow down the whole system.
+ * AbsorbFsyncRequests
+ *		Retrieve queued fsync requests and pass them to local smgr. Stop when
+ *		resources would be exhausted by absorbing more.
  *
- * Trying to do this every time the queue is full could lose if there
- * aren't any removable entries.  But that should be vanishingly rare in
- * practice: there's one queue entry per shared buffer.
+ * This is exported because we want to continue accepting requests during
+ * mdsync().
  */
-static bool
-CompactCheckpointerRequestQueue(void)
+void
+AbsorbFsyncRequests(void)
 {
-	struct CheckpointerSlotMapping
-	{
-		CheckpointerRequest request;
-		int			slot;
-	};
-
-	int			n,
-				preserve_count;
-	int			num_skipped = 0;
-	HASHCTL		ctl;
-	HTAB	   *htab;
-	bool	   *skip_slot;
-
-	/* must hold CheckpointerCommLock in exclusive mode */
-	Assert(LWLockHeldByMe(CheckpointerCommLock));
-
-	/* Initialize skip_slot array */
-	skip_slot = palloc0(sizeof(bool) * CheckpointerShmem->num_requests);
-
-	/* Initialize temporary hash table */
-	MemSet(&ctl, 0, sizeof(ctl));
-	ctl.keysize = sizeof(CheckpointerRequest);
-	ctl.entrysize = sizeof(struct CheckpointerSlotMapping);
-	ctl.hcxt = CurrentMemoryContext;
-
-	htab = hash_create("CompactCheckpointerRequestQueue",
-					   CheckpointerShmem->num_requests,
-					   &ctl,
-					   HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
-	/*
-	 * The basic idea here is that a request can be skipped if it's followed
-	 * by a later, identical request.  It might seem more sensible to work
-	 * backwards from the end of the queue and check whether a request is
-	 * *preceded* by an earlier, identical request, in the hopes of doing less
-	 * copying.  But that might change the semantics, if there's an
-	 * intervening FORGET_RELATION_FSYNC or FORGET_DATABASE_FSYNC request, so
-	 * we do it this way.  It would be possible to be even smarter if we made
-	 * the code below understand the specific semantics of such requests (it
-	 * could blow away preceding entries that would end up being canceled
-	 * anyhow), but it's not clear that the extra complexity would buy us
-	 * anything.
-	 */
-	for (n = 0; n < CheckpointerShmem->num_requests; n++)
-	{
-		CheckpointerRequest *request;
-		struct CheckpointerSlotMapping *slotmap;
-		bool		found;
-
-		/*
-		 * We use the request struct directly as a hashtable key.  This
-		 * assumes that any padding bytes in the structs are consistently the
-		 * same, which should be okay because we zeroed them in
-		 * CheckpointerShmemInit.  Note also that RelFileNode had better
-		 * contain no pad bytes.
-		 */
-		request = &CheckpointerShmem->requests[n];
-		slotmap = hash_search(htab, request, HASH_ENTER, &found);
-		if (found)
-		{
-			/* Duplicate, so mark the previous occurrence as skippable */
-			skip_slot[slotmap->slot] = true;
-			num_skipped++;
-		}
-		/* Remember slot containing latest occurrence of this request value */
-		slotmap->slot = n;
-	}
+	if (!AmCheckpointerProcess())
+		return;
 
-	/* Done with the hash table. */
-	hash_destroy(htab);
+	/* Transfer stats counts into pending pgstats message */
+	BgWriterStats.m_buf_written_backend +=
+		pg_atomic_exchange_u32(&CheckpointerShmem->num_backend_writes, 0);
+	BgWriterStats.m_buf_fsync_backend +=
+		pg_atomic_exchange_u32(&CheckpointerShmem->num_backend_fsync, 0);
 
-	/* If no duplicates, we're out of luck. */
-	if (!num_skipped)
+	while (true)
 	{
-		pfree(skip_slot);
-		return false;
-	}
+		if (!FlushFsyncRequestQueueIfNecessary())
+			break;
 
-	/* We found some duplicates; remove them. */
-	preserve_count = 0;
-	for (n = 0; n < CheckpointerShmem->num_requests; n++)
-	{
-		if (skip_slot[n])
-			continue;
-		CheckpointerShmem->requests[preserve_count++] = CheckpointerShmem->requests[n];
+		if (!AbsorbFsyncRequest(false))
+			break;
 	}
-	ereport(DEBUG1,
-			(errmsg("compacted fsync request queue from %d entries to %d entries",
-					CheckpointerShmem->num_requests, preserve_count)));
-	CheckpointerShmem->num_requests = preserve_count;
-
-	/* Cleanup. */
-	pfree(skip_slot);
-	return true;
 }
 
 /*
- * AbsorbFsyncRequests
- *		Retrieve queued fsync requests and pass them to local smgr.
+ * AbsorbAllFsyncRequests
+ *		Retrieve all already pending fsync requests and pass them to local
+ *		smgr.
  *
  * This is exported because it must be called during CreateCheckPoint;
  * we have to be sure we have accepted all pending requests just before
@@ -1284,54 +1187,63 @@ CompactCheckpointerRequestQueue(void)
  * non-checkpointer processes, do nothing if not checkpointer.
  */
 void
-AbsorbFsyncRequests(void)
+AbsorbAllFsyncRequests(void)
 {
-	CheckpointerRequest *requests = NULL;
-	CheckpointerRequest *request;
-	int			n;
-
 	if (!AmCheckpointerProcess())
 		return;
 
-	LWLockAcquire(CheckpointerCommLock, LW_EXCLUSIVE);
-
 	/* Transfer stats counts into pending pgstats message */
-	BgWriterStats.m_buf_written_backend += CheckpointerShmem->num_backend_writes;
-	BgWriterStats.m_buf_fsync_backend += CheckpointerShmem->num_backend_fsync;
-
-	CheckpointerShmem->num_backend_writes = 0;
-	CheckpointerShmem->num_backend_fsync = 0;
+	BgWriterStats.m_buf_written_backend +=
+		pg_atomic_exchange_u32(&CheckpointerShmem->num_backend_writes, 0);
+	BgWriterStats.m_buf_fsync_backend +=
+		pg_atomic_exchange_u32(&CheckpointerShmem->num_backend_fsync, 0);
 
-	/*
-	 * We try to avoid holding the lock for a long time by copying the request
-	 * array, and processing the requests after releasing the lock.
-	 *
-	 * Once we have cleared the requests from shared memory, we have to PANIC
-	 * if we then fail to absorb them (eg, because our hashtable runs out of
-	 * memory).  This is because the system cannot run safely if we are unable
-	 * to fsync what we have been told to fsync.  Fortunately, the hashtable
-	 * is so small that the problem is quite unlikely to arise in practice.
-	 */
-	n = CheckpointerShmem->num_requests;
-	if (n > 0)
+	for (;;)
 	{
-		requests = (CheckpointerRequest *) palloc(n * sizeof(CheckpointerRequest));
-		memcpy(requests, CheckpointerShmem->requests, n * sizeof(CheckpointerRequest));
+		if (!FlushFsyncRequestQueueIfNecessary())
+			elog(FATAL, "may not happen");
+
+		if (!AbsorbFsyncRequest(true))
+			break;
 	}
+}
 
-	START_CRIT_SECTION();
+/*
+ * AbsorbFsyncRequest
+ *		Retrieve one queued fsync request and pass them to local smgr.
+ */
+static bool
+AbsorbFsyncRequest(bool stop_at_current_cycle)
+{
+	CheckpointerRequest req;
+	int fd;
+	int ret;
 
-	CheckpointerShmem->num_requests = 0;
+	ReleaseLruFiles();
 
-	LWLockRelease(CheckpointerCommLock);
+	START_CRIT_SECTION();
+	ret = pg_uds_recv_with_fd(fsync_fds[FSYNC_FD_PROCESS], &req, sizeof(req), &fd);
+	if (ret < 0 && (errno == EWOULDBLOCK || errno == EAGAIN))
+	{
+		END_CRIT_SECTION();
+		return false;
+	}
+	else if (ret < 0)
+		elog(ERROR, "recvmsg failed: %m");
 
-	for (request = requests; n > 0; request++, n--)
-		RememberFsyncRequest(request->rnode, request->forknum, request->segno);
+	if (req.contains_fd != (fd != -1))
+	{
+		elog(FATAL, "message should have fd associated, but doesn't");
+	}
 
+	RememberFsyncRequest(req.rnode, req.forknum, req.segno, fd, req.open_seq);
 	END_CRIT_SECTION();
 
-	if (requests)
-		pfree(requests);
+	if (stop_at_current_cycle &&
+		req.ckpt_started == CheckpointerShmem->ckpt_started)
+		return false;
+
+	return true;
 }
 
 /*
@@ -1374,3 +1286,69 @@ FirstCallSinceLastCheckpoint(void)
 
 	return FirstCall;
 }
+
+uint32
+GetCheckpointSyncCycle(void)
+{
+	return pg_atomic_read_u32(&CheckpointerShmem->ckpt_cycle);
+}
+
+uint32
+IncCheckpointSyncCycle(void)
+{
+	return pg_atomic_fetch_add_u32(&CheckpointerShmem->ckpt_cycle, 1);
+}
+
+void
+CountBackendWrite(void)
+{
+	pg_atomic_fetch_add_u32(&CheckpointerShmem->num_backend_writes, 1);
+}
+
+static void
+SendFsyncRequest(CheckpointerRequest *request, int fd)
+{
+	ssize_t ret;
+	int		rc;
+
+	while (true)
+	{
+		CHECK_FOR_INTERRUPTS();
+
+		ret = pg_uds_send_with_fd(fsync_fds[FSYNC_FD_SUBMIT], request, sizeof(*request),
+								  request->contains_fd ? fd : -1);
+
+		if (ret >= 0)
+		{
+			/*
+			 * Don't think short reads will ever happen in realistic
+			 * implementations, but better make sure that's true...
+			 */
+			if (ret != sizeof(*request))
+				elog(FATAL, "unexpected short write to fsync request socket");
+			break;
+		}
+		else if (errno == EWOULDBLOCK || errno == EAGAIN
+#ifdef __darwin__
+				 || errno == EMSGSIZE || errno == ENOBUFS
+#endif
+				)
+		{
+			/*
+			 * Testing on macOS 10.13 showed occasional EMSGSIZE or
+			 * ENOBUFS errors, which could be handled by retrying.  Unless
+			 * the problem also shows up on other systems, let's handle those
+			 * only for that OS.
+			 */
+
+			/* blocked on write - wait for socket to become readable */
+			rc = WaitLatchOrSocket(NULL,
+								   WL_SOCKET_WRITEABLE | WL_POSTMASTER_DEATH,
+								   fsync_fds[FSYNC_FD_SUBMIT], -1, 0);
+			if (rc & WL_POSTMASTER_DEATH)
+				exit(1);
+		}
+		else
+			ereport(FATAL, (errmsg("could not send fsync request: %m")));
+	}
+}
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 41de140ae01..5631a09fcb8 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -70,6 +70,7 @@
 #include <time.h>
 #include <sys/wait.h>
 #include <ctype.h>
+#include <sys/types.h>
 #include <sys/stat.h>
 #include <sys/socket.h>
 #include <fcntl.h>
@@ -434,6 +435,7 @@ static pid_t StartChildProcess(AuxProcType type);
 static void StartAutovacuumWorker(void);
 static void MaybeStartWalReceiver(void);
 static void InitPostmasterDeathWatchHandle(void);
+static void InitFsyncFdSocketPair(void);
 
 /*
  * Archiver is allowed to start up at the current postmaster state?
@@ -526,6 +528,7 @@ typedef struct
 #else
 	int			postmaster_alive_fds[2];
 	int			syslogPipe[2];
+	int			fsync_fds[2];
 #endif
 	char		my_exec_path[MAXPGPATH];
 	char		pkglib_path[MAXPGPATH];
@@ -568,6 +571,8 @@ int			postmaster_alive_fds[2] = {-1, -1};
 HANDLE		PostmasterHandle;
 #endif
 
+int			fsync_fds[2] = {-1, -1};
+
 /*
  * Postmaster main entry point
  */
@@ -1195,6 +1200,11 @@ PostmasterMain(int argc, char *argv[])
 	 */
 	InitPostmasterDeathWatchHandle();
 
+	/*
+	 * Initialize socket pair used to transport file descriptors over.
+	 */
+	InitFsyncFdSocketPair();
+
 #ifdef WIN32
 
 	/*
@@ -6063,6 +6073,7 @@ save_backend_variables(BackendParameters *param, Port *port,
 #else
 	memcpy(&param->postmaster_alive_fds, &postmaster_alive_fds,
 		   sizeof(postmaster_alive_fds));
+	memcpy(&param->fsync_fds, &fsync_fds, sizeof(fsync_fds));
 #endif
 
 	memcpy(&param->syslogPipe, &syslogPipe, sizeof(syslogPipe));
@@ -6292,6 +6303,7 @@ restore_backend_variables(BackendParameters *param, Port *port)
 #else
 	memcpy(&postmaster_alive_fds, &param->postmaster_alive_fds,
 		   sizeof(postmaster_alive_fds));
+	memcpy(&fsync_fds, &param->fsync_fds, sizeof(fsync_fds));
 #endif
 
 	memcpy(&syslogPipe, &param->syslogPipe, sizeof(syslogPipe));
@@ -6468,3 +6480,44 @@ InitPostmasterDeathWatchHandle(void)
 								 GetLastError())));
 #endif							/* WIN32 */
 }
+
+/* Create socket used for requesting fsyncs by checkpointer */
+static void
+InitFsyncFdSocketPair(void)
+{
+	Assert(MyProcPid == PostmasterPid);
+	if (socketpair(AF_UNIX, SOCK_STREAM, 0, fsync_fds) < 0)
+		ereport(FATAL,
+				(errcode_for_file_access(),
+				 errmsg_internal("could not create fsync sockets: %m")));
+
+	/*
+	 * Set O_NONBLOCK on both fds.
+	 */
+	if (fcntl(fsync_fds[FSYNC_FD_PROCESS], F_SETFL, O_NONBLOCK) == -1)
+		ereport(FATAL,
+				(errcode_for_socket_access(),
+				 errmsg_internal("could not set fsync process socket to nonblocking mode: %m")));
+#ifndef EXEC_BACKEND
+	if (fcntl(fsync_fds[FSYNC_FD_PROCESS], F_SETFD, FD_CLOEXEC) == -1)
+		ereport(FATAL,
+				(errcode_for_socket_access(),
+				 errmsg_internal("could not set fsync process socket to close-on-exec mode: %m")));
+#endif
+
+	if (fcntl(fsync_fds[FSYNC_FD_SUBMIT], F_SETFL, O_NONBLOCK) == -1)
+		ereport(FATAL,
+				(errcode_for_socket_access(),
+				 errmsg_internal("could not set fsync submit socket to nonblocking mode: %m")));
+#ifndef EXEC_BACKEND
+	if (fcntl(fsync_fds[FSYNC_FD_SUBMIT], F_SETFD, FD_CLOEXEC) == -1)
+		ereport(FATAL,
+				(errcode_for_socket_access(),
+				 errmsg_internal("could not set fsync submit socket to close-on-exec mode: %m")));
+#endif
+
+	/*
+	 * FIXME: do DuplicateHandle dance for windows - can that work
+	 * trivially?
+	 */
+}
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 8dd51f17674..3dbfd0e4c06 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -85,6 +85,7 @@
 #include "catalog/pg_tablespace.h"
 #include "common/file_perm.h"
 #include "pgstat.h"
+#include "port/atomics.h"
 #include "portability/mem.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
@@ -180,6 +181,7 @@ int			max_safe_fds = 32;	/* default if not changed */
 #define FD_DELETE_AT_CLOSE	(1 << 0)	/* T = delete when closed */
 #define FD_CLOSE_AT_EOXACT	(1 << 1)	/* T = close at eoXact */
 #define FD_TEMP_FILE_LIMIT	(1 << 2)	/* T = respect temp_file_limit */
+#define FD_NOT_IN_LRU		(1 << 3)	/* T = not in LRU */
 
 typedef struct vfd
 {
@@ -195,6 +197,7 @@ typedef struct vfd
 	/* NB: fileName is malloc'd, and must be free'd when closing the VFD */
 	int			fileFlags;		/* open(2) flags for (re)opening the file */
 	mode_t		fileMode;		/* mode to pass to open(2) */
+	uint64		open_seq;		/* sequence number of opened file */
 } Vfd;
 
 /*
@@ -304,7 +307,6 @@ static void LruDelete(File file);
 static void Insert(File file);
 static int	LruInsert(File file);
 static bool ReleaseLruFile(void);
-static void ReleaseLruFiles(void);
 static File AllocateVfd(void);
 static void FreeVfd(File file);
 
@@ -333,6 +335,13 @@ static void unlink_if_exists_fname(const char *fname, bool isdir, int elevel);
 static int	fsync_fname_ext(const char *fname, bool isdir, bool ignore_perm, int elevel);
 static int	fsync_parent_path(const char *fname, int elevel);
 
+/* Shared memory state. */
+typedef struct
+{
+	pg_atomic_uint64 open_seq;
+} FdSharedData;
+
+static FdSharedData *fd_shared;
 
 /*
  * pg_fsync --- do fsync with or without writethrough
@@ -789,6 +798,20 @@ InitFileAccess(void)
 	on_proc_exit(AtProcExit_Files, 0);
 }
 
+/*
+ * Initialize shared memory state.  This is called after shared memory is
+ * ready.
+ */
+void
+FileShmemInit(void)
+{
+	bool	found;
+
+	fd_shared = ShmemInitStruct("fd_shared", sizeof(*fd_shared), &found);
+	if (!found)
+		pg_atomic_init_u64(&fd_shared->open_seq, 0);
+}
+
 /*
  * count_usable_fds --- count how many FDs the system will let us open,
  *		and estimate how many are already open.
@@ -1113,6 +1136,8 @@ LruInsert(File file)
 		{
 			++nfile;
 		}
+		vfdP->open_seq =
+			pg_atomic_fetch_add_u64(&fd_shared->open_seq, 1);
 
 		/*
 		 * Seek to the right position.  We need no special case for seekPos
@@ -1176,7 +1201,7 @@ ReleaseLruFile(void)
  * Release kernel FDs as needed to get under the max_safe_fds limit.
  * After calling this, it's OK to try to open another file.
  */
-static void
+void
 ReleaseLruFiles(void)
 {
 	while (nfile + numAllocatedDescs >= max_safe_fds)
@@ -1289,9 +1314,11 @@ FileAccess(File file)
 		 * We now know that the file is open and that it is not the last one
 		 * accessed, so we need to move it to the head of the Lru ring.
 		 */
-
-		Delete(file);
-		Insert(file);
+		if (!(VfdCache[file].fdstate & FD_NOT_IN_LRU))
+		{
+			Delete(file);
+			Insert(file);
+		}
 	}
 
 	return 0;
@@ -1410,6 +1437,58 @@ PathNameOpenFilePerm(const char *fileName, int fileFlags, mode_t fileMode)
 	vfdP->fileSize = 0;
 	vfdP->fdstate = 0x0;
 	vfdP->resowner = NULL;
+	vfdP->open_seq = pg_atomic_fetch_add_u64(&fd_shared->open_seq, 1);
+
+	return file;
+}
+
+/*
+ * Open a File for a pre-existing file descriptor.
+ *
+ * Note that these files will not be closed in an LRU basis, therefore the
+ * caller is responsible for limiting the number of open file descriptors.
+ *
+ * The passed in name is purely for informational purposes.
+ */
+File
+FileOpenForFd(int fd, const char *fileName, uint64 open_seq)
+{
+	char	   *fnamecopy;
+	File		file;
+	Vfd		   *vfdP;
+
+	/*
+	 * We need a malloc'd copy of the file name; fail cleanly if no room.
+	 */
+	fnamecopy = strdup(fileName);
+	if (fnamecopy == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OUT_OF_MEMORY),
+				 errmsg("out of memory")));
+
+	file = AllocateVfd();
+	vfdP = &VfdCache[file];
+
+	/* Close excess kernel FDs. */
+	ReleaseLruFiles();
+
+	vfdP->fd = fd;
+	++nfile;
+
+	DO_DB(elog(LOG, "FileOpenForFd: success %d/%d (%s)",
+			   file, fd, fnamecopy));
+
+	/* NB: Explicitly not inserted into LRU! */
+
+	vfdP->fileName = fnamecopy;
+	/* Saved flags are adjusted to be OK for re-opening file */
+	vfdP->fileFlags = 0;
+	vfdP->fileMode = 0;
+	vfdP->seekPos = 0;
+	vfdP->fileSize = 0;
+	vfdP->fdstate = FD_NOT_IN_LRU;
+	vfdP->resowner = NULL;
+	vfdP->open_seq = open_seq;
 
 	return file;
 }
@@ -1760,7 +1839,11 @@ FileClose(File file)
 		vfdP->fd = VFD_CLOSED;
 
 		/* remove the file from the lru ring */
-		Delete(file);
+		if (!(vfdP->fdstate & FD_NOT_IN_LRU))
+		{
+			vfdP->fdstate &= ~FD_NOT_IN_LRU;
+			Delete(file);
+		}
 	}
 
 	if (vfdP->fdstate & FD_TEMP_FILE_LIMIT)
@@ -2232,6 +2315,10 @@ int
 FileGetRawDesc(File file)
 {
 	Assert(FileIsValid(file));
+
+	if (FileAccess(file))
+		return -1;
+
 	return VfdCache[file].fd;
 }
 
@@ -2255,6 +2342,17 @@ FileGetRawMode(File file)
 	return VfdCache[file].fileMode;
 }
 
+/*
+ * Get the opening sequence number of this file.  This number is captured
+ * after the file was opened but before anything was written to the file,
+ */
+uint64
+FileGetOpenSeq(File file)
+{
+	Assert(FileIsValid(file));
+	return VfdCache[file].open_seq;
+}
+
 /*
  * Make room for another allocatedDescs[] array entry if needed and possible.
  * Returns true if an array element is available.
@@ -3572,3 +3670,106 @@ MakePGDirectory(const char *directoryName)
 {
 	return mkdir(directoryName, pg_dir_create_mode);
 }
+
+/*
+ * Send data over a unix domain socket, optionally (when fd != -1) including a
+ * file descriptor.
+ */
+ssize_t
+pg_uds_send_with_fd(int sock, void *buf, ssize_t buflen, int fd)
+{
+	ssize_t     size;
+	struct msghdr   msg = {0};
+	struct iovec    iov = {0};
+	/* cmsg header, union for correct alignment */
+	union
+	{
+		struct cmsghdr  cmsghdr;
+		char        control[CMSG_SPACE(sizeof (int))];
+	} cmsgu;
+	struct cmsghdr  *cmsg;
+
+	memset(&cmsgu, 0, sizeof(cmsgu));
+	iov.iov_base = buf;
+	iov.iov_len = buflen;
+
+	msg.msg_name = NULL;
+	msg.msg_namelen = 0;
+	msg.msg_iov = &iov;
+	msg.msg_iovlen = 1;
+
+	if (fd >= 0)
+	{
+		msg.msg_control = cmsgu.control;
+		msg.msg_controllen = sizeof(cmsgu.control);
+
+		cmsg = CMSG_FIRSTHDR(&msg);
+		cmsg->cmsg_len = CMSG_LEN(sizeof (int));
+		cmsg->cmsg_level = SOL_SOCKET;
+		cmsg->cmsg_type = SCM_RIGHTS;
+
+		*((int *) CMSG_DATA(cmsg)) = fd;
+	}
+
+	size = sendmsg(sock, &msg, 0);
+
+	/* errors are returned directly */
+	return size;
+}
+
+/*
+ * Receive data from a unix domain socket. If a file is sent over the socket,
+ * store it in *fd.
+ */
+ssize_t
+pg_uds_recv_with_fd(int sock, void *buf, ssize_t bufsize, int *fd)
+{
+	ssize_t     size;
+	struct msghdr   msg;
+	struct iovec    iov;
+	/* cmsg header, union for correct alignment */
+	union
+	{
+		struct cmsghdr  cmsghdr;
+		char        control[CMSG_SPACE(sizeof (int))];
+	} cmsgu;
+	struct cmsghdr  *cmsg;
+
+	Assert(fd != NULL);
+
+	iov.iov_base = buf;
+	iov.iov_len = bufsize;
+
+	msg.msg_name = NULL;
+	msg.msg_namelen = 0;
+	msg.msg_iov = &iov;
+	msg.msg_iovlen = 1;
+	msg.msg_control = cmsgu.control;
+	msg.msg_controllen = sizeof(cmsgu.control);
+
+	size = recvmsg (sock, &msg, 0);
+
+	if (size < 0)
+	{
+		*fd = -1;
+		return size;
+	}
+
+	cmsg = CMSG_FIRSTHDR(&msg);
+	if (cmsg && cmsg->cmsg_len == CMSG_LEN(sizeof(int)))
+	{
+		if (cmsg->cmsg_level != SOL_SOCKET)
+			elog(FATAL, "unexpected cmsg_level");
+
+		if (cmsg->cmsg_type != SCM_RIGHTS)
+			elog(FATAL, "unexpected cmsg_type");
+
+		*fd = *((int *) CMSG_DATA(cmsg));
+
+		/* FIXME: check / handle additional cmsg structures */
+	}
+	else
+		*fd = -1;
+
+	return size;
+}
diff --git a/src/backend/storage/freespace/freespace.c b/src/backend/storage/freespace/freespace.c
index 7c4ad1c4494..2b47824aab9 100644
--- a/src/backend/storage/freespace/freespace.c
+++ b/src/backend/storage/freespace/freespace.c
@@ -556,7 +556,7 @@ fsm_readbuf(Relation rel, FSMAddress addr, bool extend)
 	 * not on extension.)
 	 */
 	if (rel->rd_smgr->smgr_fsm_nblocks == InvalidBlockNumber ||
-		blkno >= rel->rd_smgr->smgr_fsm_nblocks)
+		rel->rd_smgr->smgr_fsm_nblocks == 0)
 	{
 		if (smgrexists(rel->rd_smgr, FSM_FORKNUM))
 			rel->rd_smgr->smgr_fsm_nblocks = smgrnblocks(rel->rd_smgr,
@@ -564,6 +564,9 @@ fsm_readbuf(Relation rel, FSMAddress addr, bool extend)
 		else
 			rel->rd_smgr->smgr_fsm_nblocks = 0;
 	}
+	else if (blkno >= rel->rd_smgr->smgr_fsm_nblocks)
+		rel->rd_smgr->smgr_fsm_nblocks = smgrnblocks(rel->rd_smgr,
+													 FSM_FORKNUM);
 
 	/* Handle requests beyond EOF */
 	if (blkno >= rel->rd_smgr->smgr_fsm_nblocks)
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 0c86a581c03..704473c7c00 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -270,6 +270,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
 	SyncScanShmemInit();
 	AsyncShmemInit();
 	BackendRandomShmemInit();
+	FileShmemInit();
 
 #ifdef EXEC_BACKEND
 
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index f4374d077be..8fc474d78fd 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -110,6 +110,7 @@ typedef struct _MdfdVec
 {
 	File		mdfd_vfd;		/* fd number in fd.c's pool */
 	BlockNumber mdfd_segno;		/* segment number, from 0 */
+	uint32		mdfd_dirtied_cycle;
 } MdfdVec;
 
 static MemoryContext MdCxt;		/* context for all MdfdVec objects */
@@ -134,16 +135,16 @@ static MemoryContext MdCxt;		/* context for all MdfdVec objects */
  * (Regular backends do not track pending operations locally, but forward
  * them to the checkpointer.)
  */
-typedef uint16 CycleCtr;		/* can be any convenient integer size */
+typedef uint32 CycleCtr;		/* can be any convenient integer size */
 
 typedef struct
 {
 	RelFileNode rnode;			/* hash table key (must be first!) */
-	CycleCtr	cycle_ctr;		/* mdsync_cycle_ctr of oldest request */
+	CycleCtr	cycle_ctr;		/* sync cycle of oldest request */
 	/* requests[f] has bit n set if we need to fsync segment n of fork f */
 	Bitmapset  *requests[MAX_FORKNUM + 1];
-	/* canceled[f] is true if we canceled fsyncs for fork "recently" */
-	bool		canceled[MAX_FORKNUM + 1];
+	File	   *syncfds[MAX_FORKNUM + 1];
+	int			syncfd_len[MAX_FORKNUM + 1];
 } PendingOperationEntry;
 
 typedef struct
@@ -152,11 +153,12 @@ typedef struct
 	CycleCtr	cycle_ctr;		/* mdckpt_cycle_ctr when request was made */
 } PendingUnlinkEntry;
 
+static uint32 open_fsync_queue_files = 0;
+static bool mdsync_in_progress = false;
 static HTAB *pendingOpsTable = NULL;
 static List *pendingUnlinks = NIL;
 static MemoryContext pendingOpsCxt; /* context for the above  */
 
-static CycleCtr mdsync_cycle_ctr = 0;
 static CycleCtr mdckpt_cycle_ctr = 0;
 
 
@@ -197,6 +199,8 @@ static MdfdVec *_mdfd_getseg(SMgrRelation reln, ForkNumber forkno,
 			 BlockNumber blkno, bool skipFsync, int behavior);
 static BlockNumber _mdnblocks(SMgrRelation reln, ForkNumber forknum,
 		   MdfdVec *seg);
+static char *mdpath(RelFileNode rnode, ForkNumber forknum, BlockNumber segno);
+static void mdsyncpass(bool include_current);
 
 
 /*
@@ -334,6 +338,7 @@ mdcreate(SMgrRelation reln, ForkNumber forkNum, bool isRedo)
 	mdfd = &reln->md_seg_fds[forkNum][0];
 	mdfd->mdfd_vfd = fd;
 	mdfd->mdfd_segno = 0;
+	mdfd->mdfd_dirtied_cycle = GetCheckpointSyncCycle() - 1;
 }
 
 /*
@@ -615,6 +620,7 @@ mdopen(SMgrRelation reln, ForkNumber forknum, int behavior)
 	mdfd = &reln->md_seg_fds[forknum][0];
 	mdfd->mdfd_vfd = fd;
 	mdfd->mdfd_segno = 0;
+	mdfd->mdfd_dirtied_cycle = GetCheckpointSyncCycle() - 1;
 
 	Assert(_mdnblocks(reln, forknum, mdfd) <= ((BlockNumber) RELSEG_SIZE));
 
@@ -1048,51 +1054,36 @@ mdimmedsync(SMgrRelation reln, ForkNumber forknum)
 }
 
 /*
- *	mdsync() -- Sync previous writes to stable storage.
+ * Do one pass over the the fsync request hashtable and perform the necessary
+ * fsyncs. Increments the mdsync cycle counter.
+ *
+ * If include_current is true perform all fsyncs (this is done if too many
+ * files are open), otherwise only perform the fsyncs belonging to the cycle
+ * valid at call time.
  */
-void
-mdsync(void)
+static void
+mdsyncpass(bool include_current)
 {
-	static bool mdsync_in_progress = false;
-
 	HASH_SEQ_STATUS hstat;
 	PendingOperationEntry *entry;
 	int			absorb_counter;
 
 	/* Statistics on sync times */
-	int			processed = 0;
 	instr_time	sync_start,
 				sync_end,
 				sync_diff;
 	uint64		elapsed;
-	uint64		longest = 0;
-	uint64		total_elapsed = 0;
-
-	/*
-	 * This is only called during checkpoints, and checkpoints should only
-	 * occur in processes that have created a pendingOpsTable.
-	 */
-	if (!pendingOpsTable)
-		elog(ERROR, "cannot sync without a pendingOpsTable");
-
-	/*
-	 * If we are in the checkpointer, the sync had better include all fsync
-	 * requests that were queued by backends up to this point.  The tightest
-	 * race condition that could occur is that a buffer that must be written
-	 * and fsync'd for the checkpoint could have been dumped by a backend just
-	 * before it was visited by BufferSync().  We know the backend will have
-	 * queued an fsync request before clearing the buffer's dirtybit, so we
-	 * are safe as long as we do an Absorb after completing BufferSync().
-	 */
-	AbsorbFsyncRequests();
+	int			processed = CheckpointStats.ckpt_sync_rels;
+	uint64		longest = CheckpointStats.ckpt_longest_sync;
+	uint64		total_elapsed = CheckpointStats.ckpt_agg_sync_time;
 
 	/*
 	 * To avoid excess fsync'ing (in the worst case, maybe a never-terminating
 	 * checkpoint), we want to ignore fsync requests that are entered into the
 	 * hashtable after this point --- they should be processed next time,
-	 * instead.  We use mdsync_cycle_ctr to tell old entries apart from new
-	 * ones: new ones will have cycle_ctr equal to the incremented value of
-	 * mdsync_cycle_ctr.
+	 * instead.  We use GetCheckpointSyncCycle() to tell old entries apart
+	 * from new ones: new ones will have cycle_ctr equal to
+	 * IncCheckpointSyncCycle().
 	 *
 	 * In normal circumstances, all entries present in the table at this point
 	 * will have cycle_ctr exactly equal to the current (about to be old)
@@ -1116,33 +1107,43 @@ mdsync(void)
 		hash_seq_init(&hstat, pendingOpsTable);
 		while ((entry = (PendingOperationEntry *) hash_seq_search(&hstat)) != NULL)
 		{
-			entry->cycle_ctr = mdsync_cycle_ctr;
+			entry->cycle_ctr = GetCheckpointSyncCycle();
 		}
 	}
 
-	/* Advance counter so that new hashtable entries are distinguishable */
-	mdsync_cycle_ctr++;
-
 	/* Set flag to detect failure if we don't reach the end of the loop */
 	mdsync_in_progress = true;
 
+	/* Advance counter so that new hashtable entries are distinguishable */
+	IncCheckpointSyncCycle();
+
 	/* Now scan the hashtable for fsync requests to process */
 	absorb_counter = FSYNCS_PER_ABSORB;
 	hash_seq_init(&hstat, pendingOpsTable);
 	while ((entry = (PendingOperationEntry *) hash_seq_search(&hstat)) != NULL)
 	{
 		ForkNumber	forknum;
+		bool has_remaining;
 
 		/*
-		 * If the entry is new then don't process it this time; it might
-		 * contain multiple fsync-request bits, but they are all new.  Note
-		 * "continue" bypasses the hash-remove call at the bottom of the loop.
+		 * If processing fsync requests because of too may file handles, close
+		 * regardless of cycle. Otherwise nothing to be closed might be found,
+		 * and we want to make room as quickly as possible so more requests
+		 * can be absorbed.
 		 */
-		if (entry->cycle_ctr == mdsync_cycle_ctr)
-			continue;
+		if (!include_current)
+		{
+			/*
+			 * If the entry is new then don't process it this time; it might
+			 * contain multiple fsync-request bits, but they are all new.  Note
+			 * "continue" bypasses the hash-remove call at the bottom of the loop.
+			 */
+			if (entry->cycle_ctr == GetCheckpointSyncCycle())
+				continue;
 
-		/* Else assert we haven't missed it */
-		Assert((CycleCtr) (entry->cycle_ctr + 1) == mdsync_cycle_ctr);
+			/* Else assert we haven't missed it */
+			Assert((CycleCtr) (entry->cycle_ctr + 1) == GetCheckpointSyncCycle());
+		}
 
 		/*
 		 * Scan over the forks and segments represented by the entry.
@@ -1157,159 +1158,145 @@ mdsync(void)
 		 */
 		for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
 		{
-			Bitmapset  *requests = entry->requests[forknum];
 			int			segno;
 
-			entry->requests[forknum] = NULL;
-			entry->canceled[forknum] = false;
-
-			while ((segno = bms_first_member(requests)) >= 0)
+			segno = -1;
+			while ((segno = bms_next_member(entry->requests[forknum], segno)) >= 0)
 			{
-				int			failures;
+				int			returnCode;
 
 				/*
-				 * If fsync is off then we don't have to bother opening the
-				 * file at all.  (We delay checking until this point so that
-				 * changing fsync on the fly behaves sensibly.)
+				 * Temporarily mark as processed. Have to do so before
+				 * absorbing further requests, otherwise we might delete a new
+				 * requests in a new cycle.
 				 */
-				if (!enableFsync)
-					continue;
+				bms_del_member(entry->requests[forknum], segno);
 
-				/*
-				 * If in checkpointer, we want to absorb pending requests
-				 * every so often to prevent overflow of the fsync request
-				 * queue.  It is unspecified whether newly-added entries will
-				 * be visited by hash_seq_search, but we don't care since we
-				 * don't need to process them anyway.
-				 */
-				if (--absorb_counter <= 0)
+				if (entry->syncfd_len[forknum] <= segno ||
+					entry->syncfds[forknum][segno] == -1)
 				{
-					AbsorbFsyncRequests();
-					absorb_counter = FSYNCS_PER_ABSORB;
+					/*
+					 * Optionally open file, if we want to support not
+					 * transporting fds as well.
+					 */
+					elog(FATAL, "file not opened");
 				}
 
 				/*
-				 * The fsync table could contain requests to fsync segments
-				 * that have been deleted (unlinked) by the time we get to
-				 * them. Rather than just hoping an ENOENT (or EACCES on
-				 * Windows) error can be ignored, what we do on error is
-				 * absorb pending requests and then retry.  Since mdunlink()
-				 * queues a "cancel" message before actually unlinking, the
-				 * fsync request is guaranteed to be marked canceled after the
-				 * absorb if it really was this case. DROP DATABASE likewise
-				 * has to tell us to forget fsync requests before it starts
-				 * deletions.
+				 * If fsync is off then we don't have to bother opening the
+				 * file at all.  (We delay checking until this point so that
+				 * changing fsync on the fly behaves sensibly.)
+				 *
+				 * XXX: Why is that an important goal? Doesn't give any
+				 * interesting guarantees afaict?
 				 */
-				for (failures = 0;; failures++) /* loop exits at "break" */
+				if (enableFsync)
 				{
-					SMgrRelation reln;
-					MdfdVec    *seg;
-					char	   *path;
-					int			save_errno;
-
 					/*
-					 * Find or create an smgr hash entry for this relation.
-					 * This may seem a bit unclean -- md calling smgr?	But
-					 * it's really the best solution.  It ensures that the
-					 * open file reference isn't permanently leaked if we get
-					 * an error here. (You may say "but an unreferenced
-					 * SMgrRelation is still a leak!" Not really, because the
-					 * only case in which a checkpoint is done by a process
-					 * that isn't about to shut down is in the checkpointer,
-					 * and it will periodically do smgrcloseall(). This fact
-					 * justifies our not closing the reln in the success path
-					 * either, which is a good thing since in non-checkpointer
-					 * cases we couldn't safely do that.)
+					 * The fsync table could contain requests to fsync
+					 * segments that have been deleted (unlinked) by the time
+					 * we get to them.  That used to be problematic, but now
+					 * we have a filehandle to the deleted file. That means we
+					 * might fsync an empty file superfluously, in a
+					 * relatively tight window, which is acceptable.
 					 */
-					reln = smgropen(entry->rnode, InvalidBackendId);
-
-					/* Attempt to open and fsync the target segment */
-					seg = _mdfd_getseg(reln, forknum,
-									   (BlockNumber) segno * (BlockNumber) RELSEG_SIZE,
-									   false,
-									   EXTENSION_RETURN_NULL
-									   | EXTENSION_DONT_CHECK_SIZE);
 
 					INSTR_TIME_SET_CURRENT(sync_start);
 
-					if (seg != NULL &&
-						FileSync(seg->mdfd_vfd, WAIT_EVENT_DATA_FILE_SYNC) >= 0)
-					{
-						/* Success; update statistics about sync timing */
-						INSTR_TIME_SET_CURRENT(sync_end);
-						sync_diff = sync_end;
-						INSTR_TIME_SUBTRACT(sync_diff, sync_start);
-						elapsed = INSTR_TIME_GET_MICROSEC(sync_diff);
-						if (elapsed > longest)
-							longest = elapsed;
-						total_elapsed += elapsed;
-						processed++;
-						if (log_checkpoints)
-							elog(DEBUG1, "checkpoint sync: number=%d file=%s time=%.3f msec",
-								 processed,
-								 FilePathName(seg->mdfd_vfd),
-								 (double) elapsed / 1000);
-
-						break;	/* out of retry loop */
-					}
+					returnCode = FileSync(entry->syncfds[forknum][segno], WAIT_EVENT_DATA_FILE_SYNC);
 
-					/* Compute file name for use in message */
-					save_errno = errno;
-					path = _mdfd_segpath(reln, forknum, (BlockNumber) segno);
-					errno = save_errno;
+					if (returnCode < 0)
+					{
+						/* XXX: decide on policy */
+						bms_add_member(entry->requests[forknum], segno);
 
-					/*
-					 * It is possible that the relation has been dropped or
-					 * truncated since the fsync request was entered.
-					 * Therefore, allow ENOENT, but only if we didn't fail
-					 * already on this file.  This applies both for
-					 * _mdfd_getseg() and for FileSync, since fd.c might have
-					 * closed the file behind our back.
-					 *
-					 * XXX is there any point in allowing more than one retry?
-					 * Don't see one at the moment, but easy to change the
-					 * test here if so.
-					 */
-					if (!FILE_POSSIBLY_DELETED(errno) ||
-						failures > 0)
 						ereport(ERROR,
 								(errcode_for_file_access(),
 								 errmsg("could not fsync file \"%s\": %m",
-										path)));
-					else
+										FilePathName(entry->syncfds[forknum][segno]))));
+					}
+
+					/* Success; update statistics about sync timing */
+					INSTR_TIME_SET_CURRENT(sync_end);
+					sync_diff = sync_end;
+					INSTR_TIME_SUBTRACT(sync_diff, sync_start);
+					elapsed = INSTR_TIME_GET_MICROSEC(sync_diff);
+					if (elapsed > longest)
+						longest = elapsed;
+					total_elapsed += elapsed;
+					processed++;
+					if (log_checkpoints)
 						ereport(DEBUG1,
-								(errcode_for_file_access(),
-								 errmsg("could not fsync file \"%s\" but retrying: %m",
-										path)));
-					pfree(path);
+								(errmsg("checkpoint sync: number=%d file=%s time=%.3f msec",
+										processed,
+										FilePathName(entry->syncfds[forknum][segno]),
+										(double) elapsed / 1000),
+								 errhidestmt(true),
+								 errhidecontext(true)));
+				}
 
+				/*
+				 * It shouldn't be possible for a new request to arrive during
+				 * the fsync (on error this will not be reached).
+				 */
+				Assert(!bms_is_member(segno, entry->requests[forknum]));
+
+				/*
+				 * Close file.  XXX: centralize code.
+				 */
+				{
+					open_fsync_queue_files--;
+					FileClose(entry->syncfds[forknum][segno]);
+					entry->syncfds[forknum][segno] = -1;
+				}
+
+				/*
+				 * If in checkpointer, we want to absorb pending requests every so
+				 * often to prevent overflow of the fsync request queue.  It is
+				 * unspecified whether newly-added entries will be visited by
+				 * hash_seq_search, but we don't care since we don't need to process
+				 * them anyway.
+				 */
+				if (absorb_counter-- <= 0)
+				{
 					/*
-					 * Absorb incoming requests and check to see if a cancel
-					 * arrived for this relation fork.
+					 * Don't absorb if too many files are open. This pass will
+					 * soon close some, so check again later.
 					 */
-					AbsorbFsyncRequests();
-					absorb_counter = FSYNCS_PER_ABSORB; /* might as well... */
-
-					if (entry->canceled[forknum])
-						break;
-				}				/* end retry loop */
+					if (open_fsync_queue_files < ((max_safe_fds * 7) / 10))
+						AbsorbFsyncRequests();
+					absorb_counter = FSYNCS_PER_ABSORB;
+				}
 			}
-			bms_free(requests);
 		}
 
 		/*
-		 * We've finished everything that was requested before we started to
-		 * scan the entry.  If no new requests have been inserted meanwhile,
-		 * remove the entry.  Otherwise, update its cycle counter, as all the
-		 * requests now in it must have arrived during this cycle.
+		 * We've finished everything for the file that was requested before we
+		 * started to scan the entry.  If no new requests have been inserted
+		 * meanwhile, remove the entry.  Otherwise, update its cycle counter,
+		 * as all the requests now in it must have arrived during this cycle.
+		 *
+		 * This needs to be checked separately from the above for-each-fork
+		 * loop, as new requests for this relation could have been absorbed.
 		 */
+		has_remaining = false;
 		for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
 		{
-			if (entry->requests[forknum] != NULL)
-				break;
+			if (bms_is_empty(entry->requests[forknum]))
+			{
+				if (entry->syncfds[forknum])
+				{
+					pfree(entry->syncfds[forknum]);
+					entry->syncfds[forknum] = NULL;
+				}
+				bms_free(entry->requests[forknum]);
+				entry->requests[forknum] = NULL;
+			}
+			else
+				has_remaining = true;
 		}
-		if (forknum <= MAX_FORKNUM)
-			entry->cycle_ctr = mdsync_cycle_ctr;
+		if (has_remaining)
+			entry->cycle_ctr = GetCheckpointSyncCycle();
 		else
 		{
 			/* Okay to remove it */
@@ -1319,13 +1306,69 @@ mdsync(void)
 		}
 	}							/* end loop over hashtable entries */
 
-	/* Return sync performance metrics for report at checkpoint end */
+	/* Flag successful completion of mdsync */
+	mdsync_in_progress = false;
+
+	/* Maintain sync performance metrics for report at checkpoint end */
 	CheckpointStats.ckpt_sync_rels = processed;
 	CheckpointStats.ckpt_longest_sync = longest;
 	CheckpointStats.ckpt_agg_sync_time = total_elapsed;
+}
 
-	/* Flag successful completion of mdsync */
-	mdsync_in_progress = false;
+/*
+ *	mdsync() -- Sync previous writes to stable storage.
+ */
+void
+mdsync(void)
+{
+	/*
+	 * This is only called during checkpoints, and checkpoints should only
+	 * occur in processes that have created a pendingOpsTable.
+	 */
+	if (!pendingOpsTable)
+		elog(ERROR, "cannot sync without a pendingOpsTable");
+
+	/*
+	 * If we are in the checkpointer, the sync had better include all fsync
+	 * requests that were queued by backends up to this point.  The tightest
+	 * race condition that could occur is that a buffer that must be written
+	 * and fsync'd for the checkpoint could have been dumped by a backend just
+	 * before it was visited by BufferSync().  We know the backend will have
+	 * queued an fsync request before clearing the buffer's dirtybit, so we
+	 * are safe as long as we do an Absorb after completing BufferSync().
+	 */
+	AbsorbAllFsyncRequests();
+
+	mdsyncpass(false);
+}
+
+/*
+ * Flush the fsync request queue enough to make sure there's room for at least
+ * one more entry.
+ */
+bool
+FlushFsyncRequestQueueIfNecessary(void)
+{
+	if (mdsync_in_progress)
+		return false;
+
+	while (true)
+	{
+		if (open_fsync_queue_files >= ((max_safe_fds * 7) / 10))
+		{
+			elog(DEBUG1,
+				 "flush fsync request queue due to %u open files",
+				 open_fsync_queue_files);
+			mdsyncpass(true);
+			elog(DEBUG1,
+				 "flushed fsync request, now at %u open files",
+				 open_fsync_queue_files);
+		}
+		else
+			break;
+	}
+
+	return true;
 }
 
 /*
@@ -1410,12 +1453,38 @@ mdpostckpt(void)
 		 */
 		if (--absorb_counter <= 0)
 		{
-			AbsorbFsyncRequests();
+			/* XXX: Centralize this condition */
+			if (open_fsync_queue_files < ((max_safe_fds * 7) / 10))
+				AbsorbFsyncRequests();
 			absorb_counter = UNLINKS_PER_ABSORB;
 		}
 	}
 }
 
+
+/*
+ * Return the filename for the specified segment of the relation. The
+ * returned string is palloc'd.
+ */
+static char *
+mdpath(RelFileNode rnode, ForkNumber forknum, BlockNumber segno)
+{
+	char	   *path,
+			   *fullpath;
+
+	path = relpathperm(rnode, forknum);
+
+	if (segno > 0)
+	{
+		fullpath = psprintf("%s.%u", path, segno);
+		pfree(path);
+	}
+	else
+		fullpath = path;
+
+	return fullpath;
+}
+
 /*
  * register_dirty_segment() -- Mark a relation segment as needing fsync
  *
@@ -1428,28 +1497,53 @@ mdpostckpt(void)
 static void
 register_dirty_segment(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
 {
+	uint32 cycle;
+
 	/* Temp relations should never be fsync'd */
 	Assert(!SmgrIsTemp(reln));
 
+	pg_memory_barrier();
+	cycle = GetCheckpointSyncCycle();
+
+	/*
+	 * For historical reasons checkpointer keeps track of the number of time
+	 * backends perform writes themselves.
+	 */
+	if (!AmBackgroundWriterProcess())
+		CountBackendWrite();
+
+	/*
+	 * Don't repeatedly register the same segment as dirty.
+	 *
+	 * FIXME: This doesn't correctly deal with overflows yet! We could
+	 * e.g. emit an smgr invalidation every now and then, or use a 64bit
+	 * counter.  Or just error out if the cycle reaches UINT32_MAX.
+	 */
+	if (seg->mdfd_dirtied_cycle == cycle)
+		return;
+
 	if (pendingOpsTable)
 	{
-		/* push it into local pending-ops table */
-		RememberFsyncRequest(reln->smgr_rnode.node, forknum, seg->mdfd_segno);
+		int fd;
+
+		/*
+		 * Push it into local pending-ops table.
+		 *
+		 * Gotta duplicate the fd - we can't have fd.c close it behind our
+		 * back, as that'd lead to loosing error reporting guarantees on
+		 * linux. RememberFsyncRequest() will manage the lifetime.
+		 */
+		ReleaseLruFiles();
+		fd = dup(FileGetRawDesc(seg->mdfd_vfd));
+		if (fd < 0)
+			elog(ERROR, "couldn't dup: %m");
+		RememberFsyncRequest(reln->smgr_rnode.node, forknum, seg->mdfd_segno, fd,
+							 FileGetOpenSeq(seg->mdfd_vfd));
 	}
 	else
-	{
-		if (ForwardFsyncRequest(reln->smgr_rnode.node, forknum, seg->mdfd_segno))
-			return;				/* passed it off successfully */
+		ForwardFsyncRequest(reln->smgr_rnode.node, forknum, seg->mdfd_segno, seg->mdfd_vfd);
 
-		ereport(DEBUG1,
-				(errmsg("could not forward fsync request because request queue is full")));
-
-		if (FileSync(seg->mdfd_vfd, WAIT_EVENT_DATA_FILE_SYNC) < 0)
-			ereport(ERROR,
-					(errcode_for_file_access(),
-					 errmsg("could not fsync file \"%s\": %m",
-							FilePathName(seg->mdfd_vfd))));
-	}
+	seg->mdfd_dirtied_cycle = cycle;
 }
 
 /*
@@ -1471,21 +1565,14 @@ register_unlink(RelFileNodeBackend rnode)
 	{
 		/* push it into local pending-ops table */
 		RememberFsyncRequest(rnode.node, MAIN_FORKNUM,
-							 UNLINK_RELATION_REQUEST);
+							 UNLINK_RELATION_REQUEST,
+							 -1, 0);
 	}
 	else
 	{
-		/*
-		 * Notify the checkpointer about it.  If we fail to queue the request
-		 * message, we have to sleep and try again, because we can't simply
-		 * delete the file now.  Ugly, but hopefully won't happen often.
-		 *
-		 * XXX should we just leave the file orphaned instead?
-		 */
+		/* Notify the checkpointer about it. */
 		Assert(IsUnderPostmaster);
-		while (!ForwardFsyncRequest(rnode.node, MAIN_FORKNUM,
-									UNLINK_RELATION_REQUEST))
-			pg_usleep(10000L);	/* 10 msec seems a good number */
+		ForwardFsyncRequest(rnode.node, MAIN_FORKNUM, UNLINK_RELATION_REQUEST, -1);
 	}
 }
 
@@ -1511,7 +1598,8 @@ register_unlink(RelFileNodeBackend rnode)
  * heavyweight operation anyhow, so we'll live with it.)
  */
 void
-RememberFsyncRequest(RelFileNode rnode, ForkNumber forknum, BlockNumber segno)
+RememberFsyncRequest(RelFileNode rnode, ForkNumber forknum, BlockNumber segno,
+					 int fd, uint64 open_seq)
 {
 	Assert(pendingOpsTable);
 
@@ -1529,18 +1617,28 @@ RememberFsyncRequest(RelFileNode rnode, ForkNumber forknum, BlockNumber segno)
 			/*
 			 * We can't just delete the entry since mdsync could have an
 			 * active hashtable scan.  Instead we delete the bitmapsets; this
-			 * is safe because of the way mdsync is coded.  We also set the
-			 * "canceled" flags so that mdsync can tell that a cancel arrived
-			 * for the fork(s).
+			 * is safe because of the way mdsync is coded.
 			 */
 			if (forknum == InvalidForkNumber)
 			{
 				/* remove requests for all forks */
 				for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
 				{
+					int segno;
+
 					bms_free(entry->requests[forknum]);
 					entry->requests[forknum] = NULL;
-					entry->canceled[forknum] = true;
+
+					for (segno = 0; segno < entry->syncfd_len[forknum]; segno++)
+					{
+						if (entry->syncfds[forknum][segno] != -1)
+						{
+							open_fsync_queue_files--;
+							FileClose(entry->syncfds[forknum][segno]);
+							entry->syncfds[forknum][segno] = -1;
+						}
+					}
+
 				}
 			}
 			else
@@ -1548,7 +1646,16 @@ RememberFsyncRequest(RelFileNode rnode, ForkNumber forknum, BlockNumber segno)
 				/* remove requests for single fork */
 				bms_free(entry->requests[forknum]);
 				entry->requests[forknum] = NULL;
-				entry->canceled[forknum] = true;
+
+				for (segno = 0; segno < entry->syncfd_len[forknum]; segno++)
+				{
+					if (entry->syncfds[forknum][segno] != -1)
+					{
+						open_fsync_queue_files--;
+						FileClose(entry->syncfds[forknum][segno]);
+						entry->syncfds[forknum][segno] = -1;
+					}
+				}
 			}
 		}
 	}
@@ -1572,7 +1679,6 @@ RememberFsyncRequest(RelFileNode rnode, ForkNumber forknum, BlockNumber segno)
 				{
 					bms_free(entry->requests[forknum]);
 					entry->requests[forknum] = NULL;
-					entry->canceled[forknum] = true;
 				}
 			}
 		}
@@ -1624,9 +1730,10 @@ RememberFsyncRequest(RelFileNode rnode, ForkNumber forknum, BlockNumber segno)
 		/* if new entry, initialize it */
 		if (!found)
 		{
-			entry->cycle_ctr = mdsync_cycle_ctr;
+			entry->cycle_ctr = GetCheckpointSyncCycle();
 			MemSet(entry->requests, 0, sizeof(entry->requests));
-			MemSet(entry->canceled, 0, sizeof(entry->canceled));
+			MemSet(entry->syncfds, 0, sizeof(entry->syncfds));
+			MemSet(entry->syncfd_len, 0, sizeof(entry->syncfd_len));
 		}
 
 		/*
@@ -1638,6 +1745,69 @@ RememberFsyncRequest(RelFileNode rnode, ForkNumber forknum, BlockNumber segno)
 		entry->requests[forknum] = bms_add_member(entry->requests[forknum],
 												  (int) segno);
 
+		if (fd >= 0)
+		{
+			File existing_file;
+			File new_file;
+
+			/* make space for entry */
+			if (entry->syncfds[forknum] == NULL)
+			{
+				int i;
+
+				entry->syncfds[forknum] = palloc(sizeof(File) * (segno + 1));
+				entry->syncfd_len[forknum] = segno + 1;
+
+				for (i = 0; i <= segno; i++)
+					entry->syncfds[forknum][i] = -1;
+			}
+			else  if (entry->syncfd_len[forknum] <= segno)
+			{
+				int i;
+
+				entry->syncfds[forknum] = repalloc(entry->syncfds[forknum],
+												   sizeof(File) * (segno + 1));
+
+				/* initialize newly created entries */
+				for (i = entry->syncfd_len[forknum]; i <= segno; i++)
+					entry->syncfds[forknum][i] = -1;
+
+				entry->syncfd_len[forknum] = segno + 1;
+			}
+
+			/*
+			 * If we didn't have a file already, or we did have a file but it
+			 * was opened later than this one, we'll keep the newly arrived
+			 * one.
+			 */
+			existing_file = entry->syncfds[forknum][segno];
+			if (existing_file == -1 || FileGetOpenSeq(existing_file) > open_seq)
+			{
+				char *path = mdpath(entry->rnode, forknum, segno);
+
+				open_fsync_queue_files++;
+				new_file = FileOpenForFd(fd, path, open_seq);
+				/* caller must have reserved entry */
+				entry->syncfds[forknum][segno] = new_file;
+				pfree(path);
+				if (existing_file != -1)
+					FileClose(existing_file);
+			}
+			else
+			{
+				/*
+				 * File is already open. Have to keep the older fd, errors
+				 * might only be reported to it, thus close the one we just
+				 * got.
+				 *
+				 * XXX: check for errrors.
+				 */
+				close(fd);
+			}
+
+			FlushFsyncRequestQueueIfNecessary();
+		}
+
 		MemoryContextSwitchTo(oldcxt);
 	}
 }
@@ -1654,22 +1824,12 @@ ForgetRelationFsyncRequests(RelFileNode rnode, ForkNumber forknum)
 	if (pendingOpsTable)
 	{
 		/* standalone backend or startup process: fsync state is local */
-		RememberFsyncRequest(rnode, forknum, FORGET_RELATION_FSYNC);
+		RememberFsyncRequest(rnode, forknum, FORGET_RELATION_FSYNC, -1, 0);
 	}
 	else if (IsUnderPostmaster)
 	{
-		/*
-		 * Notify the checkpointer about it.  If we fail to queue the cancel
-		 * message, we have to sleep and try again ... ugly, but hopefully
-		 * won't happen often.
-		 *
-		 * XXX should we CHECK_FOR_INTERRUPTS in this loop?  Escaping with an
-		 * error would leave the no-longer-used file still present on disk,
-		 * which would be bad, so I'm inclined to assume that the checkpointer
-		 * will always empty the queue soon.
-		 */
-		while (!ForwardFsyncRequest(rnode, forknum, FORGET_RELATION_FSYNC))
-			pg_usleep(10000L);	/* 10 msec seems a good number */
+		/* Notify the checkpointer about it. */
+		ForwardFsyncRequest(rnode, forknum, FORGET_RELATION_FSYNC, -1);
 
 		/*
 		 * Note we don't wait for the checkpointer to actually absorb the
@@ -1693,14 +1853,12 @@ ForgetDatabaseFsyncRequests(Oid dbid)
 	if (pendingOpsTable)
 	{
 		/* standalone backend or startup process: fsync state is local */
-		RememberFsyncRequest(rnode, InvalidForkNumber, FORGET_DATABASE_FSYNC);
+		RememberFsyncRequest(rnode, InvalidForkNumber, FORGET_DATABASE_FSYNC, -1, 0);
 	}
 	else if (IsUnderPostmaster)
 	{
 		/* see notes in ForgetRelationFsyncRequests */
-		while (!ForwardFsyncRequest(rnode, InvalidForkNumber,
-									FORGET_DATABASE_FSYNC))
-			pg_usleep(10000L);	/* 10 msec seems a good number */
+		ForwardFsyncRequest(rnode, InvalidForkNumber, FORGET_DATABASE_FSYNC, -1);
 	}
 }
 
@@ -1831,6 +1989,7 @@ _mdfd_openseg(SMgrRelation reln, ForkNumber forknum, BlockNumber segno,
 	v = &reln->md_seg_fds[forknum][segno];
 	v->mdfd_vfd = fd;
 	v->mdfd_segno = segno;
+	v->mdfd_dirtied_cycle = GetCheckpointSyncCycle() - 1;
 
 	Assert(_mdnblocks(reln, forknum, v) <= ((BlockNumber) RELSEG_SIZE));
 
diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h
index 941c6aba7d1..58ba671a907 100644
--- a/src/include/postmaster/bgwriter.h
+++ b/src/include/postmaster/bgwriter.h
@@ -16,6 +16,7 @@
 #define _BGWRITER_H
 
 #include "storage/block.h"
+#include "storage/fd.h"
 #include "storage/relfilenode.h"
 
 
@@ -31,13 +32,19 @@ extern void CheckpointerMain(void) pg_attribute_noreturn();
 extern void RequestCheckpoint(int flags);
 extern void CheckpointWriteDelay(int flags, double progress);
 
-extern bool ForwardFsyncRequest(RelFileNode rnode, ForkNumber forknum,
-					BlockNumber segno);
+extern void ForwardFsyncRequest(RelFileNode rnode, ForkNumber forknum,
+								BlockNumber segno, File file);
 extern void AbsorbFsyncRequests(void);
+extern void AbsorbAllFsyncRequests(void);
 
 extern Size CheckpointerShmemSize(void);
 extern void CheckpointerShmemInit(void);
 
+extern uint32 GetCheckpointSyncCycle(void);
+extern uint32 IncCheckpointSyncCycle(void);
+
 extern bool FirstCallSinceLastCheckpoint(void);
 
+extern void CountBackendWrite(void);
+
 #endif							/* _BGWRITER_H */
diff --git a/src/include/postmaster/postmaster.h b/src/include/postmaster/postmaster.h
index 1877eef2391..e2ba64e8984 100644
--- a/src/include/postmaster/postmaster.h
+++ b/src/include/postmaster/postmaster.h
@@ -44,6 +44,11 @@ extern int	postmaster_alive_fds[2];
 #define POSTMASTER_FD_OWN		1	/* kept open by postmaster only */
 #endif
 
+#define FSYNC_FD_SUBMIT			0
+#define FSYNC_FD_PROCESS		1
+
+extern int	fsync_fds[2];
+
 extern PGDLLIMPORT const char *progname;
 
 extern void PostmasterMain(int argc, char *argv[]) pg_attribute_noreturn();
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 8e7c9728f4b..2808b06613a 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -65,6 +65,7 @@ extern int	max_safe_fds;
 /* Operations on virtual Files --- equivalent to Unix kernel file ops */
 extern File PathNameOpenFile(const char *fileName, int fileFlags);
 extern File PathNameOpenFilePerm(const char *fileName, int fileFlags, mode_t fileMode);
+extern File FileOpenForFd(int fd, const char *fileName, uint64 open_seq);
 extern File OpenTemporaryFile(bool interXact);
 extern void FileClose(File file);
 extern int	FilePrefetch(File file, off_t offset, int amount, uint32 wait_event_info);
@@ -78,6 +79,8 @@ extern char *FilePathName(File file);
 extern int	FileGetRawDesc(File file);
 extern int	FileGetRawFlags(File file);
 extern mode_t FileGetRawMode(File file);
+extern uint64 FileGetOpenSeq(File file);
+extern void FileSetOpenSeq(File file, uint64 seq);
 
 /* Operations used for sharing named temporary files */
 extern File PathNameCreateTemporaryFile(const char *name, bool error_on_failure);
@@ -116,6 +119,7 @@ extern int	MakePGDirectory(const char *directoryName);
 
 /* Miscellaneous support routines */
 extern void InitFileAccess(void);
+extern void FileShmemInit(void);
 extern void set_max_safe_fds(void);
 extern void closeAllVfds(void);
 extern void SetTempTablespaces(Oid *tableSpaces, int numSpaces);
@@ -127,6 +131,7 @@ extern void AtEOSubXact_Files(bool isCommit, SubTransactionId mySubid,
 				  SubTransactionId parentSubid);
 extern void RemovePgTempFiles(void);
 extern bool looks_like_temp_rel_name(const char *name);
+extern void ReleaseLruFiles(void);
 
 extern int	pg_fsync(int fd);
 extern int	pg_fsync_no_writethrough(int fd);
@@ -143,4 +148,8 @@ extern void SyncDataDirectory(void);
 #define PG_TEMP_FILES_DIR "pgsql_tmp"
 #define PG_TEMP_FILE_PREFIX "pgsql_tmp"
 
+/* XXX; This should probably go elsewhere */
+ssize_t pg_uds_send_with_fd(int sock, void *buf, ssize_t buflen, int fd);
+ssize_t pg_uds_recv_with_fd(int sock, void *buf, ssize_t bufsize, int *fd);
+
 #endif							/* FD_H */
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index c843bbc9692..0f8016956f0 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -140,7 +140,8 @@ extern void mdpostckpt(void);
 
 extern void SetForwardFsyncRequests(void);
 extern void RememberFsyncRequest(RelFileNode rnode, ForkNumber forknum,
-					 BlockNumber segno);
+								 BlockNumber segno, int fd, uint64 open_seq);
+extern bool FlushFsyncRequestQueueIfNecessary(void);
 extern void ForgetRelationFsyncRequests(RelFileNode rnode, ForkNumber forknum);
 extern void ForgetDatabaseFsyncRequests(Oid dbid);
 extern void DropRelationFiles(RelFileNode *delrels, int ndelrels, bool isRedo);
-- 
2.17.1 (Apple Git-112)

0002-Add-an-fsync-request-pipe-for-Windows-v4.patchapplication/octet-stream; name=0002-Add-an-fsync-request-pipe-for-Windows-v4.patchDownload
From 2d976f6b8ea424b3903fa71027b0abc19ce6e7c6 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@enterprisedb.com>
Date: Fri, 17 Aug 2018 16:00:39 +1200
Subject: [PATCH 2/3] Add an fsync request pipe for Windows.

On Windows, a pipe is the most natural replacement for a Unix domin
socket, but unfortunately pipes don't support multiplexing via
WSAEventSelect(), as used by our WaitEventSet machninery.  So use
"overlapped" IO, and add the ability to wait for IO completion to
WaitEventSet.  A new wait event flag WL_WIN32_HANDLE is provided
on Windows only, and used to wait for asynchronous read and write
operations over the checkpointer pipe.

XXX Could use serious review from a real Windows programmer.

Author: Thomas Munro
---
 src/backend/postmaster/checkpointer.c | 167 ++++++++++++++++++++++++--
 src/backend/postmaster/postmaster.c   |  80 ++++++++++--
 src/backend/storage/file/fd.c         |   4 +
 src/backend/storage/ipc/latch.c       |  12 ++
 src/include/postmaster/postmaster.h   |   4 +
 src/include/storage/fd.h              |   2 +
 src/include/storage/latch.h           |   1 +
 7 files changed, 250 insertions(+), 20 deletions(-)

diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 645a5a59e0c..4aa889f49ef 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -188,6 +188,11 @@ static void ReqCheckpointHandler(SIGNAL_ARGS);
 static void chkpt_sigusr1_handler(SIGNAL_ARGS);
 static void ReqShutdownHandler(SIGNAL_ARGS);
 
+#ifdef WIN32
+/* State used to track in-progress asynchronous fsync pipe reads. */
+static OVERLAPPED absorb_overlapped;
+static HANDLE *absorb_read_in_progress;
+#endif
 
 /*
  * Main entry point for checkpointer process
@@ -200,6 +205,7 @@ CheckpointerMain(void)
 {
 	sigjmp_buf	local_sigjmp_buf;
 	MemoryContext checkpointer_context;
+	WaitEventSet *wes;
 
 	CheckpointerShmem->checkpointer_pid = MyProcPid;
 
@@ -340,6 +346,21 @@ CheckpointerMain(void)
 	 */
 	ProcGlobal->checkpointerLatch = &MyProc->procLatch;
 
+	/* Create reusable WaitEventSet. */
+	wes = CreateWaitEventSet(TopMemoryContext, 3);
+	AddWaitEventToSet(wes, WL_POSTMASTER_DEATH, PGINVALID_SOCKET, NULL,
+					  NULL);
+	AddWaitEventToSet(wes, WL_LATCH_SET, PGINVALID_SOCKET, MyLatch, NULL);
+#ifndef WIN32
+	AddWaitEventToSet(wes, WL_SOCKET_READABLE, fsync_fds[FSYNC_FD_PROCESS],
+					  NULL, NULL);
+#else
+	absorb_overlapped.hEvent = CreateEvent(NULL, TRUE, TRUE,
+										   "fsync pipe read completion");
+	AddWaitEventToSet(wes, WL_WIN32_HANDLE, PGINVALID_SOCKET, NULL,
+					  &absorb_overlapped.hEvent);
+#endif
+
 	/*
 	 * Loop forever
 	 */
@@ -351,6 +372,7 @@ CheckpointerMain(void)
 		int			elapsed_secs;
 		int			cur_timeout;
 		int			rc;
+		WaitEvent	event;
 
 		/* Clear any already-pending wakeups */
 		ResetLatch(MyLatch);
@@ -551,17 +573,14 @@ CheckpointerMain(void)
 			cur_timeout = Min(cur_timeout, XLogArchiveTimeout - elapsed_secs);
 		}
 
-		rc = WaitLatchOrSocket(MyLatch,
-							   WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH | WL_SOCKET_READABLE,
-							   fsync_fds[FSYNC_FD_PROCESS],
-							   cur_timeout * 1000L /* convert to ms */ ,
-							   WAIT_EVENT_CHECKPOINTER_MAIN);
+		rc = WaitEventSetWait(wes, cur_timeout * 1000, &event, 1, 0);
+		Assert(rc > 0);
 
 		/*
 		 * Emergency bailout if postmaster has died.  This is to avoid the
 		 * necessity for manual cleanup of all postmaster children.
 		 */
-		if (rc & WL_POSTMASTER_DEATH)
+		if (event.events == WL_POSTMASTER_DEATH)
 			exit(1);
 	}
 }
@@ -1126,7 +1145,18 @@ ForwardFsyncRequest(RelFileNode rnode, ForkNumber forknum, BlockNumber segno,
 	request.rnode = rnode;
 	request.forknum = forknum;
 	request.segno = segno;
+#ifndef WIN32
 	request.contains_fd = file != -1;
+#else
+	/*
+	 * For now we don't try to send duplicate handles to the checkpointer on
+	 * Windows.  That would be possible, but it's not clear whether it would
+	 * actually serve any useful purpose in that kernel without inside
+	 * knowledge of how it tracks errors.  The file will simply be reopened by
+	 * name when required by the checkpointer.
+	 */
+	request.contains_fd = false;
+#endif
 
 	/*
 	 * Tell the checkpointer the sequence number of the most recent open, so
@@ -1215,13 +1245,18 @@ AbsorbAllFsyncRequests(void)
 static bool
 AbsorbFsyncRequest(bool stop_at_current_cycle)
 {
-	CheckpointerRequest req;
-	int fd;
+	static CheckpointerRequest req;
+	int fd = -1;
+#ifndef WIN32
 	int ret;
+#else
+	DWORD bytes_read;
+#endif
 
 	ReleaseLruFiles();
 
 	START_CRIT_SECTION();
+#ifndef WIN32
 	ret = pg_uds_recv_with_fd(fsync_fds[FSYNC_FD_PROCESS], &req, sizeof(req), &fd);
 	if (ret < 0 && (errno == EWOULDBLOCK || errno == EAGAIN))
 	{
@@ -1230,6 +1265,51 @@ AbsorbFsyncRequest(bool stop_at_current_cycle)
 	}
 	else if (ret < 0)
 		elog(ERROR, "recvmsg failed: %m");
+#else
+	if (!absorb_read_in_progress)
+	{
+		if (!ReadFile(fsyncPipe[FSYNC_FD_PROCESS], &req, sizeof(req), &bytes_read,
+					  &absorb_overlapped))
+		{
+			if (GetLastError() != ERROR_IO_PENDING)
+			{
+				_dosmaperr(GetLastError());
+				elog(ERROR, "can't begin read from fsync pipe: %m");
+			}
+
+			/*
+			 * An asynchronous read has begun.  We'll tell caller to call us
+			 * back when the event indicates completion.
+			 */
+			absorb_read_in_progress = &absorb_overlapped.hEvent;
+			END_CRIT_SECTION();
+			return false;
+		}
+		/* The read completed synchronously.  'req' is now populated. */
+	}
+	if (absorb_read_in_progress)
+	{
+		/* Completed yet? */
+		if (!GetOverlappedResult(fsyncPipe[FSYNC_FD_PROCESS], &absorb_overlapped, &bytes_read,
+								 false))
+		{
+			if (GetLastError() == ERROR_IO_INCOMPLETE)
+			{
+				/* Nope.  Spurious event?  Tell caller to wait some more. */
+				END_CRIT_SECTION();
+				return false;
+			}
+			_dosmaperr(GetLastError());
+			elog(ERROR, "can't complete from fsync pipe: %m");
+		}
+		/* The asynchronous read completed.  'req' is now populated. */
+		absorb_read_in_progress = NULL;
+	}
+
+	/* Check message size. */
+	if (bytes_read != sizeof(req))
+		elog(ERROR, "unexpected short read on fsync pipe");
+#endif
 
 	if (req.contains_fd != (fd != -1))
 	{
@@ -1305,16 +1385,25 @@ CountBackendWrite(void)
 	pg_atomic_fetch_add_u32(&CheckpointerShmem->num_backend_writes, 1);
 }
 
+/*
+ * Send a message to the checkpointer's fsync socket (Unix) or pipe (Windows).
+ * This is essentially a blocking call (there is no CHECK_FOR_INTERRUPTS, and
+ * even if there were it'd be surpressed since callers hold a lock), except
+ * that we don't ignore postmaster death so we need an event loop.
+ *
+ * The code is rather different on Windows, because there we have to do the
+ * write and then wait for it to complete, while on Unix we have to wait until
+ * we can do the write.
+ */
 static void
 SendFsyncRequest(CheckpointerRequest *request, int fd)
 {
+#ifndef WIN32
 	ssize_t ret;
 	int		rc;
 
 	while (true)
 	{
-		CHECK_FOR_INTERRUPTS();
-
 		ret = pg_uds_send_with_fd(fsync_fds[FSYNC_FD_SUBMIT], request, sizeof(*request),
 								  request->contains_fd ? fd : -1);
 
@@ -1341,7 +1430,7 @@ SendFsyncRequest(CheckpointerRequest *request, int fd)
 			 * only for that OS.
 			 */
 
-			/* blocked on write - wait for socket to become readable */
+			/* Blocked on write - wait for socket to become readable */
 			rc = WaitLatchOrSocket(NULL,
 								   WL_SOCKET_WRITEABLE | WL_POSTMASTER_DEATH,
 								   fsync_fds[FSYNC_FD_SUBMIT], -1, 0);
@@ -1351,4 +1440,60 @@ SendFsyncRequest(CheckpointerRequest *request, int fd)
 		else
 			ereport(FATAL, (errmsg("could not send fsync request: %m")));
 	}
+
+#else /* WIN32 */
+	{
+		OVERLAPPED overlapped = {0};
+		DWORD nwritten;
+		int rc;
+
+		overlapped.hEvent = CreateEvent(NULL, TRUE, TRUE, NULL);
+
+		if (!WriteFile(fsyncPipe[FSYNC_FD_SUBMIT], request, sizeof(*request), &nwritten,
+					   &overlapped))
+		{
+			WaitEventSet *wes;
+			WaitEvent event;
+
+			/* Handle unexpected errors. */
+			if (GetLastError() != ERROR_IO_PENDING)
+			{
+				_dosmaperr(GetLastError());
+				CloseHandle(overlapped.hEvent);
+				ereport(FATAL, (errmsg("could not send fsync request: %m")));
+			}
+
+			/* Wait for asynchronous IO to complete. */
+			wes = CreateWaitEventSet(TopMemoryContext, 3);
+			AddWaitEventToSet(wes, WL_POSTMASTER_DEATH, PGINVALID_SOCKET, NULL,
+							  NULL);
+			AddWaitEventToSet(wes, WL_WIN32_HANDLE, PGINVALID_SOCKET, NULL,
+							  &overlapped.hEvent);
+			for (;;)
+			{
+				rc = WaitEventSetWait(wes, -1, &event, 1, 0);
+				Assert(rc > 0);
+				if (event.events == WL_POSTMASTER_DEATH)
+					exit(1);
+				if (event.events == WL_WIN32_HANDLE)
+				{
+					if (!GetOverlappedResult(fsyncPipe[FSYNC_FD_SUBMIT], &overlapped,
+											 &nwritten, FALSE))
+					{
+						_dosmaperr(GetLastError());
+						CloseHandle(overlapped.hEvent);
+						ereport(FATAL, (errmsg("could not get result of sending fsync request: %m")));
+					}
+					if (nwritten > 0)
+						break;
+				}
+			}
+			FreeWaitEventSet(wes);
+		}
+
+		CloseHandle(overlapped.hEvent);
+		if (nwritten != sizeof(*request))
+			elog(FATAL, "unexpected short write to fsync request pipe");
+	}
+#endif
 }
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 5631a09fcb8..8ec71d13fa7 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -525,6 +525,7 @@ typedef struct
 	HANDLE		PostmasterHandle;
 	HANDLE		initial_signal_pipe;
 	HANDLE		syslogPipe[2];
+	HANDLE		fsyncPipe[2];
 #else
 	int			postmaster_alive_fds[2];
 	int			syslogPipe[2];
@@ -571,7 +572,11 @@ int			postmaster_alive_fds[2] = {-1, -1};
 HANDLE		PostmasterHandle;
 #endif
 
+#ifndef WIN32
 int			fsync_fds[2] = {-1, -1};
+#else
+HANDLE		fsyncPipe[2] = {0, 0};
+#endif
 
 /*
  * Postmaster main entry point
@@ -6004,7 +6009,8 @@ extern pg_time_t first_syslogger_file_time;
 #define write_inheritable_socket(dest, src, childpid) ((*(dest) = (src)), true)
 #define read_inheritable_socket(dest, src) (*(dest) = *(src))
 #else
-static bool write_duplicated_handle(HANDLE *dest, HANDLE src, HANDLE child);
+static bool write_duplicated_handle(HANDLE *dest, HANDLE src, HANDLE child,
+									bool close_source);
 static bool write_inheritable_socket(InheritableSocket *dest, SOCKET src,
 						 pid_t childPid);
 static void read_inheritable_socket(SOCKET *dest, InheritableSocket *src);
@@ -6068,7 +6074,15 @@ save_backend_variables(BackendParameters *param, Port *port,
 	param->PostmasterHandle = PostmasterHandle;
 	if (!write_duplicated_handle(&param->initial_signal_pipe,
 								 pgwin32_create_signal_listener(childPid),
-								 childProcess))
+								 childProcess, true))
+		return false;
+	if (!write_duplicated_handle(&param->fsyncPipe[0],
+								 fsyncPipe[0],
+								 childProcess, false))
+		return false;
+	if (!write_duplicated_handle(&param->fsyncPipe[1],
+								 fsyncPipe[1],
+								 childProcess, false))
 		return false;
 #else
 	memcpy(&param->postmaster_alive_fds, &postmaster_alive_fds,
@@ -6094,7 +6108,8 @@ save_backend_variables(BackendParameters *param, Port *port,
  * process instance of the handle to the parameter file.
  */
 static bool
-write_duplicated_handle(HANDLE *dest, HANDLE src, HANDLE childProcess)
+write_duplicated_handle(HANDLE *dest, HANDLE src, HANDLE childProcess,
+						bool close_source)
 {
 	HANDLE		hChild = INVALID_HANDLE_VALUE;
 
@@ -6104,7 +6119,8 @@ write_duplicated_handle(HANDLE *dest, HANDLE src, HANDLE childProcess)
 						 &hChild,
 						 0,
 						 TRUE,
-						 DUPLICATE_CLOSE_SOURCE | DUPLICATE_SAME_ACCESS))
+						 (close_source ? DUPLICATE_CLOSE_SOURCE : 0) |
+						 DUPLICATE_SAME_ACCESS))
 	{
 		ereport(LOG,
 				(errmsg_internal("could not duplicate handle to be written to backend parameter file: error code %lu",
@@ -6300,6 +6316,8 @@ restore_backend_variables(BackendParameters *param, Port *port)
 #ifdef WIN32
 	PostmasterHandle = param->PostmasterHandle;
 	pgwin32_initial_signal_pipe = param->initial_signal_pipe;
+	fsyncPipe[0] = param->fsyncPipe[0];
+	fsyncPipe[1] = param->fsyncPipe[1];
 #else
 	memcpy(&postmaster_alive_fds, &param->postmaster_alive_fds,
 		   sizeof(postmaster_alive_fds));
@@ -6486,11 +6504,12 @@ static void
 InitFsyncFdSocketPair(void)
 {
 	Assert(MyProcPid == PostmasterPid);
+
+#ifndef WIN32
 	if (socketpair(AF_UNIX, SOCK_STREAM, 0, fsync_fds) < 0)
 		ereport(FATAL,
 				(errcode_for_file_access(),
 				 errmsg_internal("could not create fsync sockets: %m")));
-
 	/*
 	 * Set O_NONBLOCK on both fds.
 	 */
@@ -6515,9 +6534,52 @@ InitFsyncFdSocketPair(void)
 				(errcode_for_socket_access(),
 				 errmsg_internal("could not set fsync submit socket to close-on-exec mode: %m")));
 #endif
+#else
+	{
+		UCHAR		pipename[MAX_PATH];
+		SECURITY_ATTRIBUTES sa;
 
-	/*
-	 * FIXME: do DuplicateHandle dance for windows - can that work
-	 * trivially?
-	 */
+		memset(&sa, 0, sizeof(sa));
+
+		/*
+		 * We'll create a named pipe, because anonymous pipes don't allow
+		 * overlapped (= async) IO or message-orient communication.  We'll
+		 * open both ends of it here, and then duplicate them into all child
+		 * processes in save_backend_variables().  First, open the server end.
+		 */
+		snprintf(pipename, sizeof(pipename), "\\\\.\\Pipe\\fsync_pipe.%08x",
+				 GetCurrentProcessId());
+		fsyncPipe[FSYNC_FD_PROCESS] = CreateNamedPipeA(pipename,
+													   PIPE_ACCESS_INBOUND | FILE_FLAG_OVERLAPPED,
+													   PIPE_TYPE_MESSAGE | PIPE_WAIT,
+													   1,
+													   4096,
+													   4096,
+													   -1,
+													   &sa);
+		if (!fsyncPipe[FSYNC_FD_PROCESS])
+		{
+			_dosmaperr(GetLastError());
+			ereport(FATAL,
+					(errcode_for_file_access(),
+					 errmsg_internal("could not create server end of fsync pipe: %m")));
+		}
+
+		/* Now open the client end. */
+		fsyncPipe[FSYNC_FD_SUBMIT] = CreateFileA(pipename,
+												 GENERIC_WRITE,
+												 0,
+												 &sa,
+												 OPEN_EXISTING,
+												 FILE_ATTRIBUTE_NORMAL | FILE_FLAG_OVERLAPPED,
+												 NULL);
+		if (!fsyncPipe[FSYNC_FD_SUBMIT])
+		{
+			_dosmaperr(GetLastError());
+			ereport(FATAL,
+					(errcode_for_file_access(),
+					 errmsg_internal("could not create client end of fsync pipe: %m")));
+		}
+	}
+#endif
 }
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 3dbfd0e4c06..d5c8328b5d6 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -3671,6 +3671,8 @@ MakePGDirectory(const char *directoryName)
 	return mkdir(directoryName, pg_dir_create_mode);
 }
 
+#ifndef WIN32
+
 /*
  * Send data over a unix domain socket, optionally (when fd != -1) including a
  * file descriptor.
@@ -3773,3 +3775,5 @@ pg_uds_recv_with_fd(int sock, void *buf, ssize_t bufsize, int *fd)
 
 	return size;
 }
+
+#endif
diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c
index f6dda9cc9ac..081d399eefc 100644
--- a/src/backend/storage/ipc/latch.c
+++ b/src/backend/storage/ipc/latch.c
@@ -878,6 +878,12 @@ WaitEventAdjustWin32(WaitEventSet *set, WaitEvent *event)
 	{
 		*handle = PostmasterHandle;
 	}
+#ifdef WIN32
+	else if (event->events == WL_WIN32_HANDLE)
+	{
+		*handle = *(HANDLE *)event->user_data;
+	}
+#endif
 	else
 	{
 		int			flags = FD_CLOSE;	/* always check for errors/EOF */
@@ -1453,6 +1459,12 @@ WaitEventSetWaitBlock(WaitEventSet *set, int cur_timeout,
 			returned_events++;
 		}
 	}
+	else if (cur_event->events & WL_WIN32_HANDLE)
+	{
+		occurred_events->events |= WL_WIN32_HANDLE;
+		occurred_events++;
+		returned_events++;
+	}
 
 	return returned_events;
 }
diff --git a/src/include/postmaster/postmaster.h b/src/include/postmaster/postmaster.h
index e2ba64e8984..821fd2d1ad2 100644
--- a/src/include/postmaster/postmaster.h
+++ b/src/include/postmaster/postmaster.h
@@ -47,7 +47,11 @@ extern int	postmaster_alive_fds[2];
 #define FSYNC_FD_SUBMIT			0
 #define FSYNC_FD_PROCESS		1
 
+#ifndef WIN32
 extern int	fsync_fds[2];
+#else
+extern HANDLE fsyncPipe[2];
+#endif
 
 extern PGDLLIMPORT const char *progname;
 
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 2808b06613a..d952acf714e 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -148,8 +148,10 @@ extern void SyncDataDirectory(void);
 #define PG_TEMP_FILES_DIR "pgsql_tmp"
 #define PG_TEMP_FILE_PREFIX "pgsql_tmp"
 
+#ifndef WIN32
 /* XXX; This should probably go elsewhere */
 ssize_t pg_uds_send_with_fd(int sock, void *buf, ssize_t buflen, int fd);
 ssize_t pg_uds_recv_with_fd(int sock, void *buf, ssize_t bufsize, int *fd);
+#endif
 
 #endif							/* FD_H */
diff --git a/src/include/storage/latch.h b/src/include/storage/latch.h
index fd8735b7f5f..a74eedfe4e9 100644
--- a/src/include/storage/latch.h
+++ b/src/include/storage/latch.h
@@ -128,6 +128,7 @@ typedef struct Latch
 #define WL_POSTMASTER_DEATH  (1 << 4)
 #ifdef WIN32
 #define WL_SOCKET_CONNECTED  (1 << 5)
+#define WL_WIN32_HANDLE		 (1 << 6)
 #else
 /* avoid having to deal with case on platforms not requiring it */
 #define WL_SOCKET_CONNECTED  WL_SOCKET_WRITEABLE
-- 
2.17.1 (Apple Git-112)

#64Thomas Munro
thomas.munro@enterprisedb.com
In reply to: Thomas Munro (#63)
2 attachment(s)
Re: Postgres, fsync, and OSs (specifically linux)

Hello hackers,

Let's try to get this issue resolved. Here is my position on the
course of action we should take in back-branches:

1. I am -1 on back-patching the fd-transfer code. It's a significant
change, and even when sufficiently debugged (I don't think it's there
yet), we have no idea what will happen on all the kernels we support
under extreme workloads. IMHO there is no way we can spring this on
users in a point release.

2. I am +1 on back-patching Craig's PANIC-on-failure logic. Doing
nothing is not an option I like. I have some feedback and changes to
propose though; see attached.

Responses to a review from Robert:

On Thu, Jul 19, 2018 at 7:23 AM Robert Haas <robertmhaas@gmail.com> wrote:

2. I don't like promote_ioerr_to_panic() very much, partly because the
same pattern gets repeated over and over, and partly because it would
be awkwardly-named if we discovered that another 2 or 3 errors needed
similar handling (or some other variant handling). I suggest instead
having a function like report_critical_fsync_failure(char *path) that
does something like this:

int elevel = ERROR;
if (errno == EIO)
elevel = PANIC;
ereport(elevel,
(errcode_for_file_access(),
errmsg("could not fsync file \"%s\": %m", path);

And similarly I'd add report_critical_close_failure. In some cases,
this would remove wording variations (e.g. in twophase.c) but I think
that's fine, and maybe an improvement, as discussed on another recent
thread.

I changed it to look like data_sync_elevel(ERROR) and made it treat
all errnos the same. ENOSPC, EIO, EWOK, EIEIO, it makes no difference
to the level of faith I have that my data still exists.

3. slru.c calls pg_fsync() but isn't changed by the patch. That looks wrong.

Fixed.

4. The comment changes in snapbuild.c interact with the TODO that
immediately follows. I think more adjustment is needed here.

I don't understand this.

5. It seems odd that you adjusted the comment for
pg_fsync_no_writethrough() but not pg_fsync_writethrough() or
pg_fsync(). Either pg_fsync_writethrough() doesn't have the same
problem, in which case, awesome, but let's add a comment, or it does,
in which case it should refer to the other one. And I think
pg_fsync() itself needs a comment saying that every caller must be
careful to use promote_ioerr_to_panic() or
report_critical_fsync_failure() or whatever we end up calling it
unless the fsync is not critical for data integrity.

I removed these comments and many others; I don't see the point in
scattering descriptions of this problem and references to specific
versions of Linux and -hackers archive links all over the place. I
added a comment in one place, and also added some user documentation
of the problem.

6. In md.c, there's a stray blank line added. But more importantly,
the code just above that looks like this:

if (!FILE_POSSIBLY_DELETED(errno) ||
failures > 0)
-                        ereport(ERROR,
+                        ereport(promote_ioerr_to_panic(ERROR),
(errcode_for_file_access(),
errmsg("could not fsync file \"%s\": %m",
path)));
else
ereport(DEBUG1,
(errcode_for_file_access(),
errmsg("could not fsync file \"%s\"
but retrying: %m",
path)));

I might be all wet here, but it seems like if we enter the bottom
branch, we still need the promote-to-panic behavior.

That case only is only reached if FILE_POSSIBLY_DELETED() on the first
time through the loop, and it detects an errno value not actually from
fsync(). It's from FileSync(), when it tries to reopen a virtual fd
and gets ENOENT, before calling fsync(). Code further down then
absorbs incoming requests before checking if that was expected,
closing a race. The comments could make that clearer, admittedly.

7. The comment adjustment for SyncDataDirectory mentions an
"important" fact about fsync behavior, but then doesn't seem to change
any logic on that basis. I think in general a number of these
comments need a little more thought, but in this particular case, I
think we also need to consider what the behavior should be (and the
comment should reflect our considered judgement on that point, and the
implementation should match).

I updated the comment. I don't think this is too relevant to the
fsync() failure case, because we'll be rewriting all changes from the
WAL again during recovery; I think this function is mostly useful for
switching from fsync = off to fsync = on and restarting, not coping
with previous fsync() failures by retrying (which we know to be
useless anyway). Someone could argue that if you restarted after
changing fsync from off to on, then this may be the first time you
learn that write-back failed, and then you're somewhat screwed whether
we panic or not, but I don't see any solution to that. Don't run
databases with fsync = off.

8. Andres suggested to me off-list that we should have a GUC to
disable the promote-to-panic behavior in case it turns out to be a
show-stopper for some user. I think that's probably a good idea.
Adding many new ways to PANIC in a minor release without providing any
way to go back to the old behavior sounds unfriendly. Surely, anyone
who suffers much from this has really serious other problems anyway,
but all the same I think we should provide an escape hatch.

+1. See the new GUC data_sync_retry, defaulting to false. If set to
true, we also need to fix the problem reported in [1]/messages/by-id/87y3i1ia4w.fsf@news-spur.riddles.org.uk, so here's the
patch for that too.

Other comments:

I don't see why sync_file_range(SYNC_FILE_RANGE_WRITE) should get a
pass here. Inspection of some version of the kernel might tell us it
can't advance the error counter and report failure, but what do we
gain by relying on that? Changed.

FD_DELETE_AT_CLOSE is not a good way to detect temporary files in
recent versions, as it doesn't detect the kind of shared temporary
files used by Parallel Hash; FD_TEMP_FILE_LIMIT is a better way.
Changed. (We could also just not bother exempting temporary files?)

I plan to continue working on the fd-transfer system as part of a
larger sync queue redesign project for 12. If we can get an agreement
that we can't possibly back-patch the fd-transfer logic, then we can
move all future discussion of that topic over to the other thread[2]/messages/by-id/CAEepm=2gTANm=e3ARnJT=n0h8hf88wqmaZxk0JYkxw+b21fNrw@mail.gmail.com,
and this thread can be about consensus to back-patch the PANIC patch.
Thoughts?

[1]: /messages/by-id/87y3i1ia4w.fsf@news-spur.riddles.org.uk
[2]: /messages/by-id/CAEepm=2gTANm=e3ARnJT=n0h8hf88wqmaZxk0JYkxw+b21fNrw@mail.gmail.com

Attachments:

0001-Don-t-forget-about-failed-fsync-requests-v4.patchapplication/octet-stream; name=0001-Don-t-forget-about-failed-fsync-requests-v4.patchDownload
From 42d19801ddfd4d53c9c8a95ea3d761c14fafe53a Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@enterprisedb.com>
Date: Fri, 6 Apr 2018 11:54:17 +1200
Subject: [PATCH 1/2] Don't forget about failed fsync() requests.

If fsync() fails, the storage manager mustn't forget the fsync request
so that future attempts will try again.

Back-patch to all supported releases.

Author: Thomas Munro
Reviewed-By: Amit Kapila
Reported-By: Andrew Gierth
Discussion: https://postgr.es/m/87y3i1ia4w.fsf%40news-spur.riddles.org.uk
---
 src/backend/storage/smgr/md.c | 23 ++++++++++++++++++-----
 1 file changed, 18 insertions(+), 5 deletions(-)

diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index f4374d077be..d1a2a3d5b65 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -1150,10 +1150,8 @@ mdsync(void)
 		 * The bitmap manipulations are slightly tricky, because we can call
 		 * AbsorbFsyncRequests() inside the loop and that could result in
 		 * bms_add_member() modifying and even re-palloc'ing the bitmapsets.
-		 * This is okay because we unlink each bitmapset from the hashtable
-		 * entry before scanning it.  That means that any incoming fsync
-		 * requests will be processed now if they reach the table before we
-		 * begin to scan their fork.
+		 * So we detach it, but if we fail we'll merge it with any new
+		 * requests that have arrived in the meantime.
 		 */
 		for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
 		{
@@ -1163,7 +1161,8 @@ mdsync(void)
 			entry->requests[forknum] = NULL;
 			entry->canceled[forknum] = false;
 
-			while ((segno = bms_first_member(requests)) >= 0)
+			segno = -1;
+			while ((segno = bms_next_member(requests, segno)) >= 0)
 			{
 				int			failures;
 
@@ -1244,6 +1243,7 @@ mdsync(void)
 							longest = elapsed;
 						total_elapsed += elapsed;
 						processed++;
+						requests = bms_del_member(requests, segno);
 						if (log_checkpoints)
 							elog(DEBUG1, "checkpoint sync: number=%d file=%s time=%.3f msec",
 								 processed,
@@ -1272,10 +1272,23 @@ mdsync(void)
 					 */
 					if (!FILE_POSSIBLY_DELETED(errno) ||
 						failures > 0)
+					{
+						Bitmapset *new_requests;
+
+						/*
+						 * We need to merge these unsatisfied requests with
+						 * any others that have arrived since we started.
+						 */
+						new_requests = entry->requests[forknum];
+						entry->requests[forknum] =
+							bms_join(new_requests, requests);
+
+						errno = save_errno;
 						ereport(ERROR,
 								(errcode_for_file_access(),
 								 errmsg("could not fsync file \"%s\": %m",
 										path)));
+					}
 					else
 						ereport(DEBUG1,
 								(errcode_for_file_access(),
-- 
2.17.1 (Apple Git-112)

0002-PANIC-on-fsync-failure-v4.patchapplication/octet-stream; name=0002-PANIC-on-fsync-failure-v4.patchDownload
From 3d3f83c50c6d42d46a157ebf58325a7be32e0e2e Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@enterprisedb.com>
Date: Wed, 17 Oct 2018 14:45:43 +1300
Subject: [PATCH 2/2] PANIC on fsync() failure.

On some operating systems, it doesn't make sense to retry fsync(),
because dirty data cached by the kernel may have been dropped on
failure.  In that case the only remaining copy of the data is in
the WAL.  A subsequent fsync() could appear to succeed, but not
have flushed the data that was dropped.  That means that a future
checkpoint could apparently complete successfully, allowing the
WAL to be deleted.

Therefore, violently prevent any future checkpoint attempts by
panicking on the first fsync() failure.  Note that we already
did the same for WAL data; this change refers to non-temporary
data files.

Provide a GUC data_sync_retry to disable this new behavior, for
users of operating systems that don't eject dirty data, and possibly
forensic/testing uses.  If it is set to on and the write-back error
was transient, a later checkpoint might genuinely succeed; if the
error is permanent, later checkpoints will continue to fail.  The
GUC defaults to off, meaning that we panic.

Back-patch to all supported releases.

There is still a narrow window for error-loss on some operating
systems: if the file is closed and later reopened and a write-back
error occurs in the intervening time, but the inode has the bad
luck to be evicted due to memory pressure before we reopen, we could
miss the error.  A later patch will address that by keeping files
with dirty data open at all times by passing fds between processes,
but for now we judge that to be too complicated to back-patch.

Author: Craig Ringer, some adjustments by Thomas Munro
Reported-by: Craig Ringer
Reviewed-by: Robert Haas, Thomas Munro, Andres Freund
Discussion: https://postgr.es/m/20180427222842.in2e4mibx45zdth5%40alap3.anarazel.de
---
 doc/src/sgml/config.sgml                      | 32 ++++++++++++
 src/backend/access/heap/rewriteheap.c         |  6 +--
 src/backend/access/transam/slru.c             |  2 +-
 src/backend/access/transam/timeline.c         |  4 +-
 src/backend/access/transam/xlog.c             |  2 +-
 src/backend/replication/logical/snapbuild.c   |  3 ++
 src/backend/storage/file/fd.c                 | 51 ++++++++++++++++---
 src/backend/storage/smgr/md.c                 |  6 +--
 src/backend/utils/cache/relmapper.c           |  2 +-
 src/backend/utils/misc/guc.c                  |  9 ++++
 src/backend/utils/misc/postgresql.conf.sample |  1 +
 src/include/storage/fd.h                      |  2 +
 12 files changed, 101 insertions(+), 19 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 7554cba3f96..292fdf9d1ee 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -8161,6 +8161,38 @@ dynamic_library_path = 'C:\tools\postgresql;H:\my_project\lib;$libdir'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-data-sync-retry" xreflabel="data_sync_retry">
+      <term><varname>data_sync_retry</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>data_sync_retry</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        When set to false, which is the default, <productname>PostgreSQL</productname>
+        will raise a PANIC-level error on failure to flush modified data files
+        to the filesystem.  This causes the database server to crash.
+       </para>
+       <para>
+        On some operating systems, the status of data in the kernel's page
+        cache is unknown after a write-back failure.  In some cases it might
+        have been entirely forgotten, making it unsafe to retry; the second
+        attempt may be reported as successful, when in fact the data has been
+        lost.  In these circumstances, the only way to avoid data loss is to
+        recover from the WAL after any failure is reported, preferrably
+        after investigating the root cause of the failure and replacing any
+        faulty hardware.
+       </para>
+       <para>
+        If set to true, <productname>PostgreSQL</productname> will instead
+        report an error but continue to run so that the data flushing
+        operation can be retried in a later checkpoint.  Only set it to true
+        after investigating the operating system's treatment of buffered data
+        in case of write-back failure.
+       </para>
+      </listitem>
+     </varlistentry>
+
     </variablelist>
 
    </sect1>
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index 71277889649..36139600377 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -978,7 +978,7 @@ logical_end_heap_rewrite(RewriteState state)
 	while ((src = (RewriteMappingFile *) hash_seq_search(&seq_status)) != NULL)
 	{
 		if (FileSync(src->vfd, WAIT_EVENT_LOGICAL_REWRITE_SYNC) != 0)
-			ereport(ERROR,
+			ereport(data_sync_elevel(ERROR),
 					(errcode_for_file_access(),
 					 errmsg("could not fsync file \"%s\": %m", src->path)));
 		FileClose(src->vfd);
@@ -1199,7 +1199,7 @@ heap_xlog_logical_rewrite(XLogReaderState *r)
 	 */
 	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_REWRITE_MAPPING_SYNC);
 	if (pg_fsync(fd) != 0)
-		ereport(ERROR,
+		ereport(data_sync_elevel(ERROR),
 				(errcode_for_file_access(),
 				 errmsg("could not fsync file \"%s\": %m", path)));
 	pgstat_report_wait_end();
@@ -1298,7 +1298,7 @@ CheckPointLogicalRewriteHeap(void)
 			 */
 			pgstat_report_wait_start(WAIT_EVENT_LOGICAL_REWRITE_CHECKPOINT_SYNC);
 			if (pg_fsync(fd) != 0)
-				ereport(ERROR,
+				ereport(data_sync_elevel(ERROR),
 						(errcode_for_file_access(),
 						 errmsg("could not fsync file \"%s\": %m", path)));
 			pgstat_report_wait_end();
diff --git a/src/backend/access/transam/slru.c b/src/backend/access/transam/slru.c
index 1132eef0384..fad5d363e32 100644
--- a/src/backend/access/transam/slru.c
+++ b/src/backend/access/transam/slru.c
@@ -928,7 +928,7 @@ SlruReportIOError(SlruCtl ctl, int pageno, TransactionId xid)
 							   path, offset)));
 			break;
 		case SLRU_FSYNC_FAILED:
-			ereport(ERROR,
+			ereport(data_sync_elevel(ERROR),
 					(errcode_for_file_access(),
 					 errmsg("could not access status of transaction %u", xid),
 					 errdetail("Could not fsync file \"%s\": %m.",
diff --git a/src/backend/access/transam/timeline.c b/src/backend/access/transam/timeline.c
index 61d36050c34..70eec5676eb 100644
--- a/src/backend/access/transam/timeline.c
+++ b/src/backend/access/transam/timeline.c
@@ -406,7 +406,7 @@ writeTimeLineHistory(TimeLineID newTLI, TimeLineID parentTLI,
 
 	pgstat_report_wait_start(WAIT_EVENT_TIMELINE_HISTORY_SYNC);
 	if (pg_fsync(fd) != 0)
-		ereport(ERROR,
+		ereport(data_sync_elevel(ERROR),
 				(errcode_for_file_access(),
 				 errmsg("could not fsync file \"%s\": %m", tmppath)));
 	pgstat_report_wait_end();
@@ -485,7 +485,7 @@ writeTimeLineHistoryFile(TimeLineID tli, char *content, int size)
 
 	pgstat_report_wait_start(WAIT_EVENT_TIMELINE_HISTORY_FILE_SYNC);
 	if (pg_fsync(fd) != 0)
-		ereport(ERROR,
+		ereport(data_sync_elevel(ERROR),
 				(errcode_for_file_access(),
 				 errmsg("could not fsync file \"%s\": %m", tmppath)));
 	pgstat_report_wait_end();
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 7375a78ffcf..d6cd49ef460 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -3472,7 +3472,7 @@ XLogFileCopy(XLogSegNo destsegno, TimeLineID srcTLI, XLogSegNo srcsegno,
 
 	pgstat_report_wait_start(WAIT_EVENT_WAL_COPY_SYNC);
 	if (pg_fsync(fd) != 0)
-		ereport(ERROR,
+		ereport(data_sync_elevel(ERROR),
 				(errcode_for_file_access(),
 				 errmsg("could not fsync file \"%s\": %m", tmppath)));
 	pgstat_report_wait_end();
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index a6cd6c67d16..363ddf4505e 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -1629,6 +1629,9 @@ SnapBuildSerialize(SnapBuild *builder, XLogRecPtr lsn)
 	 * fsync the file before renaming so that even if we crash after this we
 	 * have either a fully valid file or nothing.
 	 *
+	 * It's safe to just ERROR on fsync() here because we'll retry the whole
+	 * operation including the writes.
+	 *
 	 * TODO: Do the fsync() via checkpoints/restartpoints, doing it here has
 	 * some noticeable overhead since it's performed synchronously during
 	 * decoding?
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 8dd51f17674..361b7d09e6a 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -145,6 +145,8 @@ int			max_files_per_process = 1000;
  */
 int			max_safe_fds = 32;	/* default if not changed */
 
+/* Whether it is safe to continue running after fsync() fails. */
+bool		data_sync_retry = false;
 
 /* Debugging.... */
 
@@ -442,11 +444,9 @@ pg_flush_data(int fd, off_t offset, off_t nbytes)
 		 */
 		rc = sync_file_range(fd, offset, nbytes,
 							 SYNC_FILE_RANGE_WRITE);
-
-		/* don't error out, this is just a performance optimization */
 		if (rc != 0)
 		{
-			ereport(WARNING,
+			ereport(data_sync_elevel(WARNING),
 					(errcode_for_file_access(),
 					 errmsg("could not flush dirty data: %m")));
 		}
@@ -518,7 +518,7 @@ pg_flush_data(int fd, off_t offset, off_t nbytes)
 			rc = msync(p, (size_t) nbytes, MS_ASYNC);
 			if (rc != 0)
 			{
-				ereport(WARNING,
+				ereport(data_sync_elevel(WARNING),
 						(errcode_for_file_access(),
 						 errmsg("could not flush dirty data: %m")));
 				/* NB: need to fall through to munmap()! */
@@ -1046,11 +1046,13 @@ LruDelete(File file)
 	}
 
 	/*
-	 * Close the file.  We aren't expecting this to fail; if it does, better
-	 * to leak the FD than to mess up our internal state.
+	 * Close the file.  We aren't expecting this to fail, but there are some
+	 * filesystems that are capable of reporting write-back failures on close.
+	 * Otherwise, better to leak the FD than to mess up our internal state.
 	 */
 	if (close(vfdP->fd))
-		elog(LOG, "could not close file \"%s\": %m", vfdP->fileName);
+		elog(vfdP->fdstate & FD_TEMP_FILE_LIMIT ? LOG : data_sync_elevel(LOG),
+			 "could not close file \"%s\": %m", vfdP->fileName);
 	vfdP->fd = VFD_CLOSED;
 	--nfile;
 
@@ -1754,7 +1756,14 @@ FileClose(File file)
 	{
 		/* close the file */
 		if (close(vfdP->fd))
-			elog(LOG, "could not close file \"%s\": %m", vfdP->fileName);
+		{
+			/*
+			 * We may need to panic on failure to close non-temporary files;
+			 * see LruDelete.
+			 */
+			elog(vfdP->fdstate & FD_TEMP_FILE_LIMIT ? LOG : data_sync_elevel(LOG),
+				"could not close file \"%s\": %m", vfdP->fileName);
+		}
 
 		--nfile;
 		vfdP->fd = VFD_CLOSED;
@@ -3250,6 +3259,9 @@ looks_like_temp_rel_name(const char *name)
  * harmless cases such as read-only files in the data directory, and that's
  * not good either.
  *
+ * Note that if we previously crashed due to a PANIC on fsync(), we'll be
+ * rewriting all changes again during recovery.
+ *
  * Note we assume we're chdir'd into PGDATA to begin with.
  */
 void
@@ -3572,3 +3584,26 @@ MakePGDirectory(const char *directoryName)
 {
 	return mkdir(directoryName, pg_dir_create_mode);
 }
+
+/*
+ * Return the passed-in error level, or PANIC if data_sync_retry is off.
+ *
+ * Failure to fsync any data file is cause for immediate panic, unless
+ * data_sync_retry is enabled.  Data may have been written to the operating
+ * system and removed from our buffer pool already, and if we are running on
+ * an operating system that forgets dirty data on write-back failure, there
+ * may be only one copy of the data remaining: in the WAL.  A later attempt to
+ * fsync again might falsely report success.  Therefore we must not allow any
+ * further checkpoints to be attempted.  data_sync_retry can in theory be
+ * enabled on systems known not to drop dirty buffered data on write-back
+ * failure (with the likely outcome that checkpoints will continue to fail
+ * until the underlying problem is fixed).
+ *
+ * Any code that reports a failure from fsync() or related functions should
+ * filter the error level with this function.
+ */
+int
+data_sync_elevel(int elevel)
+{
+	return data_sync_retry ? elevel : PANIC;
+}
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index d1a2a3d5b65..94e56208d35 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -1039,7 +1039,7 @@ mdimmedsync(SMgrRelation reln, ForkNumber forknum)
 		MdfdVec    *v = &reln->md_seg_fds[forknum][segno - 1];
 
 		if (FileSync(v->mdfd_vfd, WAIT_EVENT_DATA_FILE_IMMEDIATE_SYNC) < 0)
-			ereport(ERROR,
+			ereport(data_sync_elevel(ERROR),
 					(errcode_for_file_access(),
 					 errmsg("could not fsync file \"%s\": %m",
 							FilePathName(v->mdfd_vfd))));
@@ -1284,7 +1284,7 @@ mdsync(void)
 							bms_join(new_requests, requests);
 
 						errno = save_errno;
-						ereport(ERROR,
+						ereport(data_sync_elevel(ERROR),
 								(errcode_for_file_access(),
 								 errmsg("could not fsync file \"%s\": %m",
 										path)));
@@ -1458,7 +1458,7 @@ register_dirty_segment(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
 				(errmsg("could not forward fsync request because request queue is full")));
 
 		if (FileSync(seg->mdfd_vfd, WAIT_EVENT_DATA_FILE_SYNC) < 0)
-			ereport(ERROR,
+			ereport(data_sync_elevel(ERROR),
 					(errcode_for_file_access(),
 					 errmsg("could not fsync file \"%s\": %m",
 							FilePathName(seg->mdfd_vfd))));
diff --git a/src/backend/utils/cache/relmapper.c b/src/backend/utils/cache/relmapper.c
index 905867dc767..328d4aae7b7 100644
--- a/src/backend/utils/cache/relmapper.c
+++ b/src/backend/utils/cache/relmapper.c
@@ -876,7 +876,7 @@ write_relmap_file(bool shared, RelMapFile *newmap,
 	 */
 	pgstat_report_wait_start(WAIT_EVENT_RELATION_MAP_SYNC);
 	if (pg_fsync(fd) != 0)
-		ereport(ERROR,
+		ereport(data_sync_elevel(ERROR),
 				(errcode_for_file_access(),
 				 errmsg("could not fsync file \"%s\": %m",
 						mapfilename)));
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 2317e8be6be..a50dd6e8e96 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -1831,6 +1831,15 @@ static struct config_bool ConfigureNamesBool[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"data_sync_retry", PGC_POSTMASTER, ERROR_HANDLING_OPTIONS,
+			gettext_noop("Whether to continue running after a failure to sync data files."),
+		},
+		&data_sync_retry,
+		false,
+		NULL, NULL, NULL
+	},
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, false, NULL, NULL, NULL
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 4e61bc6521f..0cb76b68b44 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -667,6 +667,7 @@
 
 #exit_on_error = off			# terminate session on any error?
 #restart_after_crash = on		# reinitialize after backend crash?
+#data_sync_retry = off			# retry or panic on failure to fsync data?
 
 
 #------------------------------------------------------------------------------
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 8e7c9728f4b..4b9d1312c26 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -51,6 +51,7 @@ typedef int File;
 
 /* GUC parameter */
 extern PGDLLIMPORT int max_files_per_process;
+extern PGDLLIMPORT bool data_sync_retry;
 
 /*
  * This is private to fd.c, but exported for save/restore_backend_variables()
@@ -138,6 +139,7 @@ extern int	durable_rename(const char *oldfile, const char *newfile, int loglevel
 extern int	durable_unlink(const char *fname, int loglevel);
 extern int	durable_link_or_rename(const char *oldfile, const char *newfile, int loglevel);
 extern void SyncDataDirectory(void);
+extern int data_sync_elevel(int elevel);
 
 /* Filename components */
 #define PG_TEMP_FILES_DIR "pgsql_tmp"
-- 
2.17.1 (Apple Git-112)

#65Craig Ringer
craig@2ndquadrant.com
In reply to: Thomas Munro (#64)
Re: Postgres, fsync, and OSs (specifically linux)

On Fri, 19 Oct 2018 at 07:27, Thomas Munro <thomas.munro@enterprisedb.com>
wrote:

2. I am +1 on back-patching Craig's PANIC-on-failure logic. Doing
nothing is not an option I like. I have some feedback and changes to
propose though; see attached.

Thanks very much for the work on reviewing and revising this.

I don't see why sync_file_range(SYNC_FILE_RANGE_WRITE) should get a
pass here. Inspection of some version of the kernel might tell us it
can't advance the error counter and report failure, but what do we
gain by relying on that? Changed.

I was sure it made sense at the time, but I can't explain that decision
now, and it looks like we should treat it as a failure.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#66Thomas Munro
thomas.munro@enterprisedb.com
In reply to: Craig Ringer (#65)
Re: Postgres, fsync, and OSs (specifically linux)

On Fri, Oct 19, 2018 at 6:42 PM Craig Ringer <craig@2ndquadrant.com> wrote:

On Fri, 19 Oct 2018 at 07:27, Thomas Munro <thomas.munro@enterprisedb.com> wrote:

2. I am +1 on back-patching Craig's PANIC-on-failure logic. Doing
nothing is not an option I like. I have some feedback and changes to
propose though; see attached.

Thanks very much for the work on reviewing and revising this.

My plan is do a round of testing and review of this stuff next week
once the dust is settled on the current minor releases (including
fixing a few typos I just spotted and some word-smithing). All going
well, I will then push the resulting patches to master and all
supported stable branches, unless other reviews or objections appear.
At some point not too far down the track I hope to be ready to
consider committing that other patch that will completely change all
of this code in the master branch, but in any case Craig's patch will
get almost a full minor release cycle to sit in the stable branches
before release.

--
Thomas Munro
http://www.enterprisedb.com

#67Robert Haas
robertmhaas@gmail.com
In reply to: Thomas Munro (#66)
Re: Postgres, fsync, and OSs (specifically linux)

On Wed, Nov 7, 2018 at 9:41 PM Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

My plan is do a round of testing and review of this stuff next week
once the dust is settled on the current minor releases (including
fixing a few typos I just spotted and some word-smithing). All going
well, I will then push the resulting patches to master and all
supported stable branches, unless other reviews or objections appear.
At some point not too far down the track I hope to be ready to
consider committing that other patch that will completely change all
of this code in the master branch, but in any case Craig's patch will
get almost a full minor release cycle to sit in the stable branches
before release.

I did a read-through of these patches.

+ new_requests = entry->requests[forknum];
+ entry->requests[forknum] =
+ bms_join(new_requests, requests);

What happens if bms_join fails, too?

+ recover from the WAL after any failure is reported, preferrably

preferably.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#68Thomas Munro
thomas.munro@enterprisedb.com
In reply to: Robert Haas (#67)
Re: Postgres, fsync, and OSs (specifically linux)

On Fri, Nov 9, 2018 at 7:07 AM Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Nov 7, 2018 at 9:41 PM Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

My plan is do a round of testing and review of this stuff next week
once the dust is settled on the current minor releases (including
fixing a few typos I just spotted and some word-smithing). All going
well, I will then push the resulting patches to master and all
supported stable branches, unless other reviews or objections appear.
At some point not too far down the track I hope to be ready to
consider committing that other patch that will completely change all
of this code in the master branch, but in any case Craig's patch will
get almost a full minor release cycle to sit in the stable branches
before release.

I did a read-through of these patches.

+ new_requests = entry->requests[forknum];
+ entry->requests[forknum] =
+ bms_join(new_requests, requests);

What happens if bms_join fails, too?

My reasoning for choosing bms_join() is that it cannot fail, assuming
the heap is not corrupted. It simply ORs the two bit-strings into
whichever is the longer input string, and frees the shorter input
string. (In an earlier version I used bms_union(), this function's
non-destructive sibling, but then realised that it could fail to
allocate() causing us to lose track of a 1 bit).

Philosophical point: if pfree() throws, then bms_join() throws, but
(assuming AllocSetFree() implementation) it can only throw if the heap
is corrupted, eg elog(ERROR, "could not find block containing chunk
%p", chunk) and possibly other errors. Of course it's impossible to
make guarantees of any kind in case of arbitrary corruption. But
perhaps we could do this in a critical section, so errors are promoted
to PANIC.

+ recover from the WAL after any failure is reported, preferrably

preferably.

Thanks.

--
Thomas Munro
http://www.enterprisedb.com

#69Robert Haas
robertmhaas@gmail.com
In reply to: Thomas Munro (#68)
Re: Postgres, fsync, and OSs (specifically linux)

On Thu, Nov 8, 2018 at 3:04 PM Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

My reasoning for choosing bms_join() is that it cannot fail, assuming
the heap is not corrupted. It simply ORs the two bit-strings into
whichever is the longer input string, and frees the shorter input
string. (In an earlier version I used bms_union(), this function's
non-destructive sibling, but then realised that it could fail to
allocate() causing us to lose track of a 1 bit).

Oh, OK. I was assuming it was allocating.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#70Thomas Munro
thomas.munro@enterprisedb.com
In reply to: Robert Haas (#69)
3 attachment(s)
Re: Postgres, fsync, and OSs (specifically linux)

On Fri, Nov 9, 2018 at 9:06 AM Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Nov 8, 2018 at 3:04 PM Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

My reasoning for choosing bms_join() is that it cannot fail, assuming
the heap is not corrupted. It simply ORs the two bit-strings into
whichever is the longer input string, and frees the shorter input
string. (In an earlier version I used bms_union(), this function's
non-destructive sibling, but then realised that it could fail to
allocate() causing us to lose track of a 1 bit).

Oh, OK. I was assuming it was allocating.

I did some more testing using throw-away fault injection patch 0003.
I found one extra problem: fsync_fname() needed data_sync_elevel()
treatment, because it is used in eg CheckPointCLOG().

With data_sync_retry = on, if you update a row, touch
/tmp/FileSync_EIO and try to checkpoint then the checkpoint fails, and
the cluster keeps running. Future checkpoint attempts report the same
error about the same file, showing that patch 0001 works (we didn't
forget about the dirty file). Then rm /tmp/FileSync_EIO, and the next
checkpoint should succeed.

With data_sync_retry = off (the default), the same test produces a
PANIC, showing that patch 0002 works.

It's similar if you touch /tmp/pg_sync_EIO instead. That shows that
cases like fsync_fname("pg_xact") also cause PANIC when
data_sync_retry = off, but it hides the bug that 0001 fixes when
data_sync_retry = on, hence my desire to test the two different fault
injection points.

I think these patches are looking good now. If I don't spot any other
problems or hear any objections, I will commit them tomorrow-ish.

--
Thomas Munro
http://www.enterprisedb.com

Attachments:

0001-Don-t-forget-about-failed-fsync-requests-v5.patchapplication/octet-stream; name=0001-Don-t-forget-about-failed-fsync-requests-v5.patchDownload
From 582b0807c35b9657376b68212f8a329b0f117507 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@enterprisedb.com>
Date: Fri, 6 Apr 2018 11:54:17 +1200
Subject: [PATCH 1/3] Don't forget about failed fsync() requests.

If fsync() fails, the storage manager mustn't forget the fsync request
so that future attempts will try again.

Back-patch to all supported releases.

Author: Thomas Munro
Reviewed-By: Amit Kapila
Reported-By: Andrew Gierth
Discussion: https://postgr.es/m/87y3i1ia4w.fsf%40news-spur.riddles.org.uk
---
 src/backend/storage/smgr/md.c | 23 ++++++++++++++++++-----
 1 file changed, 18 insertions(+), 5 deletions(-)

diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 86013a5c8b2..ec7fc322546 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -1123,10 +1123,8 @@ mdsync(void)
 		 * The bitmap manipulations are slightly tricky, because we can call
 		 * AbsorbFsyncRequests() inside the loop and that could result in
 		 * bms_add_member() modifying and even re-palloc'ing the bitmapsets.
-		 * This is okay because we unlink each bitmapset from the hashtable
-		 * entry before scanning it.  That means that any incoming fsync
-		 * requests will be processed now if they reach the table before we
-		 * begin to scan their fork.
+		 * So we detach it, but if we fail we'll merge it with any new
+		 * requests that have arrived in the meantime.
 		 */
 		for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
 		{
@@ -1136,7 +1134,8 @@ mdsync(void)
 			entry->requests[forknum] = NULL;
 			entry->canceled[forknum] = false;
 
-			while ((segno = bms_first_member(requests)) >= 0)
+			segno = -1;
+			while ((segno = bms_next_member(requests, segno)) >= 0)
 			{
 				int			failures;
 
@@ -1217,6 +1216,7 @@ mdsync(void)
 							longest = elapsed;
 						total_elapsed += elapsed;
 						processed++;
+						requests = bms_del_member(requests, segno);
 						if (log_checkpoints)
 							elog(DEBUG1, "checkpoint sync: number=%d file=%s time=%.3f msec",
 								 processed,
@@ -1245,10 +1245,23 @@ mdsync(void)
 					 */
 					if (!FILE_POSSIBLY_DELETED(errno) ||
 						failures > 0)
+					{
+						Bitmapset *new_requests;
+
+						/*
+						 * We need to merge these unsatisfied requests with
+						 * any others that have arrived since we started.
+						 */
+						new_requests = entry->requests[forknum];
+						entry->requests[forknum] =
+							bms_join(new_requests, requests);
+
+						errno = save_errno;
 						ereport(ERROR,
 								(errcode_for_file_access(),
 								 errmsg("could not fsync file \"%s\": %m",
 										path)));
+					}
 					else
 						ereport(DEBUG1,
 								(errcode_for_file_access(),
-- 
2.19.1

0002-PANIC-on-fsync-failure-v5.patchapplication/octet-stream; name=0002-PANIC-on-fsync-failure-v5.patchDownload
From f289980a9a0684fce758dcba599272fa5d6ed0ed Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@enterprisedb.com>
Date: Wed, 17 Oct 2018 14:45:43 +1300
Subject: [PATCH 2/3] PANIC on fsync() failure.

On some operating systems, it doesn't make sense to retry fsync(),
because dirty data cached by the kernel may have been dropped on
failure.  In that case the only remaining copy of the data is in
the WAL.  A subsequent fsync() could appear to succeed, but not
have flushed the data that was dropped.  That means that a future
checkpoint could apparently complete successfully, allowing the
WAL to be deleted.

Therefore, violently prevent any future checkpoint attempts by
panicking on the first fsync() failure.  Note that we already
did the same for WAL data; this change extends that behavior to
non-temporary data files.

Provide a GUC data_sync_retry to disable this new behavior, for
users of operating systems that don't eject dirty data, and possibly
forensic/testing uses.  If it is set to on and the write-back error
was transient, a later checkpoint might genuinely succeed (on a
system that is known not to throw away buffers on failure); if the
error is permanent, later checkpoints will continue to fail.  The
GUC defaults to off, meaning that we panic.

Back-patch to all supported releases.

There is still a narrow window for error-loss on some operating
systems: if the file is closed and later reopened and a write-back
error occurs in the intervening time, but the inode has the bad
luck to be evicted due to memory pressure before we reopen, we could
miss the error.  A later patch will address that with a scheme
for keeping files with dirty data open at all times, but we judge
that to be too complicated a change to back-patch.

Author: Craig Ringer, with some adjustments by Thomas Munro
Reported-by: Craig Ringer
Reviewed-by: Robert Haas, Thomas Munro, Andres Freund
Discussion: https://postgr.es/m/20180427222842.in2e4mibx45zdth5%40alap3.anarazel.de
---
 doc/src/sgml/config.sgml                      | 32 +++++++++++++
 src/backend/access/heap/rewriteheap.c         |  6 +--
 src/backend/access/transam/slru.c             |  2 +-
 src/backend/access/transam/timeline.c         |  4 +-
 src/backend/access/transam/xlog.c             |  2 +-
 src/backend/replication/logical/snapbuild.c   |  3 ++
 src/backend/storage/file/fd.c                 | 48 ++++++++++++++++---
 src/backend/storage/smgr/md.c                 |  6 +--
 src/backend/utils/cache/relmapper.c           |  2 +-
 src/backend/utils/misc/guc.c                  |  9 ++++
 src/backend/utils/misc/postgresql.conf.sample |  1 +
 src/include/storage/fd.h                      |  2 +
 12 files changed, 99 insertions(+), 18 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 0f8f2ef920d..c4effa034c1 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -8161,6 +8161,38 @@ dynamic_library_path = 'C:\tools\postgresql;H:\my_project\lib;$libdir'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-data-sync-retry" xreflabel="data_sync_retry">
+      <term><varname>data_sync_retry</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>data_sync_retry</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        When set to false, which is the default, <productname>PostgreSQL</productname>
+        will raise a PANIC-level error on failure to flush modified data files
+        to the filesystem.  This causes the database server to crash.
+       </para>
+       <para>
+        On some operating systems, the status of data in the kernel's page
+        cache is unknown after a write-back failure.  In some cases it might
+        have been entirely forgotten, making it unsafe to retry; the second
+        attempt may be reported as successful, when in fact the data has been
+        lost.  In these circumstances, the only way to avoid data loss is to
+        recover from the WAL after any failure is reported, preferably
+        after investigating the root cause of the failure and replacing any
+        faulty hardware.
+       </para>
+       <para>
+        If set to true, <productname>PostgreSQL</productname> will instead
+        report an error but continue to run so that the data flushing
+        operation can be retried in a later checkpoint.  Only set it to true
+        after investigating the operating system's treatment of buffered data
+        in case of write-back failure.
+       </para>
+      </listitem>
+     </varlistentry>
+
     </variablelist>
 
    </sect1>
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index c5db75afa1f..d5bd282f8c7 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -978,7 +978,7 @@ logical_end_heap_rewrite(RewriteState state)
 	while ((src = (RewriteMappingFile *) hash_seq_search(&seq_status)) != NULL)
 	{
 		if (FileSync(src->vfd, WAIT_EVENT_LOGICAL_REWRITE_SYNC) != 0)
-			ereport(ERROR,
+			ereport(data_sync_elevel(ERROR),
 					(errcode_for_file_access(),
 					 errmsg("could not fsync file \"%s\": %m", src->path)));
 		FileClose(src->vfd);
@@ -1199,7 +1199,7 @@ heap_xlog_logical_rewrite(XLogReaderState *r)
 	 */
 	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_REWRITE_MAPPING_SYNC);
 	if (pg_fsync(fd) != 0)
-		ereport(ERROR,
+		ereport(data_sync_elevel(ERROR),
 				(errcode_for_file_access(),
 				 errmsg("could not fsync file \"%s\": %m", path)));
 	pgstat_report_wait_end();
@@ -1298,7 +1298,7 @@ CheckPointLogicalRewriteHeap(void)
 			 */
 			pgstat_report_wait_start(WAIT_EVENT_LOGICAL_REWRITE_CHECKPOINT_SYNC);
 			if (pg_fsync(fd) != 0)
-				ereport(ERROR,
+				ereport(data_sync_elevel(ERROR),
 						(errcode_for_file_access(),
 						 errmsg("could not fsync file \"%s\": %m", path)));
 			pgstat_report_wait_end();
diff --git a/src/backend/access/transam/slru.c b/src/backend/access/transam/slru.c
index 1132eef0384..fad5d363e32 100644
--- a/src/backend/access/transam/slru.c
+++ b/src/backend/access/transam/slru.c
@@ -928,7 +928,7 @@ SlruReportIOError(SlruCtl ctl, int pageno, TransactionId xid)
 							   path, offset)));
 			break;
 		case SLRU_FSYNC_FAILED:
-			ereport(ERROR,
+			ereport(data_sync_elevel(ERROR),
 					(errcode_for_file_access(),
 					 errmsg("could not access status of transaction %u", xid),
 					 errdetail("Could not fsync file \"%s\": %m.",
diff --git a/src/backend/access/transam/timeline.c b/src/backend/access/transam/timeline.c
index 61d36050c34..70eec5676eb 100644
--- a/src/backend/access/transam/timeline.c
+++ b/src/backend/access/transam/timeline.c
@@ -406,7 +406,7 @@ writeTimeLineHistory(TimeLineID newTLI, TimeLineID parentTLI,
 
 	pgstat_report_wait_start(WAIT_EVENT_TIMELINE_HISTORY_SYNC);
 	if (pg_fsync(fd) != 0)
-		ereport(ERROR,
+		ereport(data_sync_elevel(ERROR),
 				(errcode_for_file_access(),
 				 errmsg("could not fsync file \"%s\": %m", tmppath)));
 	pgstat_report_wait_end();
@@ -485,7 +485,7 @@ writeTimeLineHistoryFile(TimeLineID tli, char *content, int size)
 
 	pgstat_report_wait_start(WAIT_EVENT_TIMELINE_HISTORY_FILE_SYNC);
 	if (pg_fsync(fd) != 0)
-		ereport(ERROR,
+		ereport(data_sync_elevel(ERROR),
 				(errcode_for_file_access(),
 				 errmsg("could not fsync file \"%s\": %m", tmppath)));
 	pgstat_report_wait_end();
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 7eed5866d2e..fb3d9ee5303 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -3455,7 +3455,7 @@ XLogFileCopy(XLogSegNo destsegno, TimeLineID srcTLI, XLogSegNo srcsegno,
 
 	pgstat_report_wait_start(WAIT_EVENT_WAL_COPY_SYNC);
 	if (pg_fsync(fd) != 0)
-		ereport(ERROR,
+		ereport(data_sync_elevel(ERROR),
 				(errcode_for_file_access(),
 				 errmsg("could not fsync file \"%s\": %m", tmppath)));
 	pgstat_report_wait_end();
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index a6cd6c67d16..363ddf4505e 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -1629,6 +1629,9 @@ SnapBuildSerialize(SnapBuild *builder, XLogRecPtr lsn)
 	 * fsync the file before renaming so that even if we crash after this we
 	 * have either a fully valid file or nothing.
 	 *
+	 * It's safe to just ERROR on fsync() here because we'll retry the whole
+	 * operation including the writes.
+	 *
 	 * TODO: Do the fsync() via checkpoints/restartpoints, doing it here has
 	 * some noticeable overhead since it's performed synchronously during
 	 * decoding?
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 2d75773ef02..827a1e2620b 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -145,6 +145,8 @@ int			max_files_per_process = 1000;
  */
 int			max_safe_fds = 32;	/* default if not changed */
 
+/* Whether it is safe to continue running after fsync() fails. */
+bool		data_sync_retry = false;
 
 /* Debugging.... */
 
@@ -430,11 +432,9 @@ pg_flush_data(int fd, off_t offset, off_t nbytes)
 		 */
 		rc = sync_file_range(fd, offset, nbytes,
 							 SYNC_FILE_RANGE_WRITE);
-
-		/* don't error out, this is just a performance optimization */
 		if (rc != 0)
 		{
-			ereport(WARNING,
+			ereport(data_sync_elevel(WARNING),
 					(errcode_for_file_access(),
 					 errmsg("could not flush dirty data: %m")));
 		}
@@ -506,7 +506,7 @@ pg_flush_data(int fd, off_t offset, off_t nbytes)
 			rc = msync(p, (size_t) nbytes, MS_ASYNC);
 			if (rc != 0)
 			{
-				ereport(WARNING,
+				ereport(data_sync_elevel(WARNING),
 						(errcode_for_file_access(),
 						 errmsg("could not flush dirty data: %m")));
 				/* NB: need to fall through to munmap()! */
@@ -562,7 +562,7 @@ pg_flush_data(int fd, off_t offset, off_t nbytes)
 void
 fsync_fname(const char *fname, bool isdir)
 {
-	fsync_fname_ext(fname, isdir, false, ERROR);
+	fsync_fname_ext(fname, isdir, false, data_sync_elevel(ERROR));
 }
 
 /*
@@ -1022,7 +1022,8 @@ LruDelete(File file)
 	 * to leak the FD than to mess up our internal state.
 	 */
 	if (close(vfdP->fd))
-		elog(LOG, "could not close file \"%s\": %m", vfdP->fileName);
+		elog(vfdP->fdstate & FD_TEMP_FILE_LIMIT ? LOG : data_sync_elevel(LOG),
+			 "could not close file \"%s\": %m", vfdP->fileName);
 	vfdP->fd = VFD_CLOSED;
 	--nfile;
 
@@ -1698,7 +1699,14 @@ FileClose(File file)
 	{
 		/* close the file */
 		if (close(vfdP->fd))
-			elog(LOG, "could not close file \"%s\": %m", vfdP->fileName);
+		{
+			/*
+			 * We may need to panic on failure to close non-temporary files;
+			 * see LruDelete.
+			 */
+			elog(vfdP->fdstate & FD_TEMP_FILE_LIMIT ? LOG : data_sync_elevel(LOG),
+				"could not close file \"%s\": %m", vfdP->fileName);
+		}
 
 		--nfile;
 		vfdP->fd = VFD_CLOSED;
@@ -3091,6 +3099,9 @@ looks_like_temp_rel_name(const char *name)
  * harmless cases such as read-only files in the data directory, and that's
  * not good either.
  *
+ * Note that if we previously crashed due to a PANIC on fsync(), we'll be
+ * rewriting all changes again during recovery.
+ *
  * Note we assume we're chdir'd into PGDATA to begin with.
  */
 void
@@ -3413,3 +3424,26 @@ MakePGDirectory(const char *directoryName)
 {
 	return mkdir(directoryName, pg_dir_create_mode);
 }
+
+/*
+ * Return the passed-in error level, or PANIC if data_sync_retry is off.
+ *
+ * Failure to fsync any data file is cause for immediate panic, unless
+ * data_sync_retry is enabled.  Data may have been written to the operating
+ * system and removed from our buffer pool already, and if we are running on
+ * an operating system that forgets dirty data on write-back failure, there
+ * may be only one copy of the data remaining: in the WAL.  A later attempt to
+ * fsync again might falsely report success.  Therefore we must not allow any
+ * further checkpoints to be attempted.  data_sync_retry can in theory be
+ * enabled on systems known not to drop dirty buffered data on write-back
+ * failure (with the likely outcome that checkpoints will continue to fail
+ * until the underlying problem is fixed).
+ *
+ * Any code that reports a failure from fsync() or related functions should
+ * filter the error level with this function.
+ */
+int
+data_sync_elevel(int elevel)
+{
+	return data_sync_retry ? elevel : PANIC;
+}
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index ec7fc322546..534dfb59c7a 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -1012,7 +1012,7 @@ mdimmedsync(SMgrRelation reln, ForkNumber forknum)
 		MdfdVec    *v = &reln->md_seg_fds[forknum][segno - 1];
 
 		if (FileSync(v->mdfd_vfd, WAIT_EVENT_DATA_FILE_IMMEDIATE_SYNC) < 0)
-			ereport(ERROR,
+			ereport(data_sync_elevel(ERROR),
 					(errcode_for_file_access(),
 					 errmsg("could not fsync file \"%s\": %m",
 							FilePathName(v->mdfd_vfd))));
@@ -1257,7 +1257,7 @@ mdsync(void)
 							bms_join(new_requests, requests);
 
 						errno = save_errno;
-						ereport(ERROR,
+						ereport(data_sync_elevel(ERROR),
 								(errcode_for_file_access(),
 								 errmsg("could not fsync file \"%s\": %m",
 										path)));
@@ -1431,7 +1431,7 @@ register_dirty_segment(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
 				(errmsg("could not forward fsync request because request queue is full")));
 
 		if (FileSync(seg->mdfd_vfd, WAIT_EVENT_DATA_FILE_SYNC) < 0)
-			ereport(ERROR,
+			ereport(data_sync_elevel(ERROR),
 					(errcode_for_file_access(),
 					 errmsg("could not fsync file \"%s\": %m",
 							FilePathName(seg->mdfd_vfd))));
diff --git a/src/backend/utils/cache/relmapper.c b/src/backend/utils/cache/relmapper.c
index 905867dc767..328d4aae7b7 100644
--- a/src/backend/utils/cache/relmapper.c
+++ b/src/backend/utils/cache/relmapper.c
@@ -876,7 +876,7 @@ write_relmap_file(bool shared, RelMapFile *newmap,
 	 */
 	pgstat_report_wait_start(WAIT_EVENT_RELATION_MAP_SYNC);
 	if (pg_fsync(fd) != 0)
-		ereport(ERROR,
+		ereport(data_sync_elevel(ERROR),
 				(errcode_for_file_access(),
 				 errmsg("could not fsync file \"%s\": %m",
 						mapfilename)));
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index f9074215a2d..514595699be 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -1830,6 +1830,15 @@ static struct config_bool ConfigureNamesBool[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"data_sync_retry", PGC_POSTMASTER, ERROR_HANDLING_OPTIONS,
+			gettext_noop("Whether to continue running after a failure to sync data files."),
+		},
+		&data_sync_retry,
+		false,
+		NULL, NULL, NULL
+	},
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, false, NULL, NULL, NULL
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 3fe257c53f1..ab063dae419 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -666,6 +666,7 @@
 
 #exit_on_error = off			# terminate session on any error?
 #restart_after_crash = on		# reinitialize after backend crash?
+#data_sync_retry = off			# retry or panic on failure to fsync data?
 
 
 #------------------------------------------------------------------------------
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 1289589a46b..cb882fb74e5 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -47,6 +47,7 @@ typedef int File;
 
 /* GUC parameter */
 extern PGDLLIMPORT int max_files_per_process;
+extern PGDLLIMPORT bool data_sync_retry;
 
 /*
  * This is private to fd.c, but exported for save/restore_backend_variables()
@@ -134,6 +135,7 @@ extern int	durable_rename(const char *oldfile, const char *newfile, int loglevel
 extern int	durable_unlink(const char *fname, int loglevel);
 extern int	durable_link_or_rename(const char *oldfile, const char *newfile, int loglevel);
 extern void SyncDataDirectory(void);
+extern int data_sync_elevel(int elevel);
 
 /* Filename components */
 #define PG_TEMP_FILES_DIR "pgsql_tmp"
-- 
2.19.1

0003-fsync-fault-injection.-For-testing-only-v5.patchapplication/octet-stream; name=0003-fsync-fault-injection.-For-testing-only-v5.patchDownload
From 97511604cefae94982fcf8e7b5ec067d89167978 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@enterprisedb.com>
Date: Sun, 18 Nov 2018 14:23:47 +1300
Subject: [PATCH 3/3] fsync() fault injection.  For testing only!

Touch /tmp/pg_fsync_EIO and /tmp/FileSync_EIO to cause fake failures
for testing.
---
 src/backend/storage/file/fd.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 827a1e2620b..5a545f226e8 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -332,6 +332,12 @@ static int	fsync_parent_path(const char *fname, int elevel);
 int
 pg_fsync(int fd)
 {
+	if (access("/tmp/pg_fsync_EIO", F_OK) == 0)
+	{
+		errno = EIO;
+		return -1;
+	}
+
 	/* #if is to skip the sync_method test if there's no need for it */
 #if defined(HAVE_FSYNC_WRITETHROUGH) && !defined(FSYNC_WRITETHROUGH_IS_FSYNC)
 	if (sync_method == SYNC_METHOD_FSYNC_WRITETHROUGH)
@@ -1993,6 +1999,12 @@ FileSync(File file, uint32 wait_event_info)
 {
 	int			returnCode;
 
+	if (access("/tmp/FileSync_EIO", F_OK) == 0)
+	{
+		errno = EIO;
+		return -1;
+	}
+
 	Assert(FileIsValid(file));
 
 	DO_DB(elog(LOG, "FileSync: %d (%s)",
-- 
2.19.1

#71Thomas Munro
thomas.munro@enterprisedb.com
In reply to: Thomas Munro (#70)
Re: Postgres, fsync, and OSs (specifically linux)

On Fri, Nov 9, 2018 at 9:03 AM Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

On Fri, Nov 9, 2018 at 7:07 AM Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Nov 7, 2018 at 9:41 PM Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

My plan is do a round of testing and review of this stuff next week
once the dust is settled on the current minor releases (including
fixing a few typos I just spotted and some word-smithing). All going
well, I will then push the resulting patches to master and all
supported stable branches, unless other reviews or objections appear.

...

On Sun, Nov 18, 2018 at 3:20 PM Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

I think these patches are looking good now. If I don't spot any other
problems or hear any objections, I will commit them tomorrow-ish.

Hearing no objections, pushed to all supported branches.

Thank you to Craig for all his work getting to the bottom of this, to
Andres for his open source diplomacy, and the Linux guys for their
change "errseq: Always report a writeback error once" which came out
of that.

Some more comments:

* The promotion of errors from close() to PANIC may or may not be
effective considering that it doesn't have interlocking with
concurrent checkpoints. I'm not sure if it can really happen on local
file systems anyway... this may fall under the category of "making
PostgreSQL work reliably on NFS", a configuration that is not
recommended currently, and a separate project IMV.

* In 9.4 and 9.5 there is no checking of errors from
sync_file_range(), and I didn't add any for now. It was claimed that
sync_file_range() without BEFORE/AFTER can't consume errors[1]/messages/by-id/20180430160945.5s5qfoqryhtmugxl@alap3.anarazel.de.
Errors are promoted in 9.6+ for consistency because we already looked
at the return code, so we won't long rely on that knowledge in the
long term.

* I personally believe it is safe to run with data_sync_retry = on on
any file system on FreeBSD, and ZFS on any operating system... but I
see no need to make recommendations about that in the documentation,
other than that you should investigate the behaviour of your operating
system if you really want to turn it on.

* A PANIC (and possibly ensuing crash restart loop if the I/O error is
not transient) is of course a very unpleasant failure mode, but it is
one that we already had for the WAL and control file. So I'm not sure
I'd personally bother to run with the non-default setting even on a
system where I believe it to be safe (considering the low likelihood
that I/O failure is isolated to certain files); at best it probably
gives you a better experience if the fs underneath a non-default
tablespace dies.

* The GUC is provided primarily because this patch is so drastic in
its effect that it seems like we owe our users a way to disable it on
principle, and that seems to outweigh a desire not to add GUCs in
back-branches.

* If I/O errors happen, your system is probably toast and you need to
fail over or restore from backups, but at least we won't tell you any
lies about checkpoints succeeding. In rare scenarios, perhaps
involving a transient failure of virtualised storage with thin
provisioning as originally described by Craig, the system may actually
be able to continue running, and with this change we should now be
able to avoid data loss by recovering from the WAL.

* As noted the commit message, this isn't quite the end of the story.
See the fsync queue redesign thread[2]/messages/by-id/CAEepm=2gTANm=e3ARnJT=n0h8hf88wqmaZxk0JYkxw+b21fNrw@mail.gmail.com, WIP for master only.

[1]: /messages/by-id/20180430160945.5s5qfoqryhtmugxl@alap3.anarazel.de
[2]: /messages/by-id/CAEepm=2gTANm=e3ARnJT=n0h8hf88wqmaZxk0JYkxw+b21fNrw@mail.gmail.com

--
Thomas Munro
http://www.enterprisedb.com