fdatasync performance problem with large number of DB files

Started by Michael Brownabout 5 years ago49 messageshackers

michael.brown@discourse.org

about 5 years ago

I initially posted this on the pgsql-general mailing list [5]https://www.postgresql-archive.org/fdatasync-performance-problem-with-large-number-of-DB-files-td6184094.html but was
advised to also post this to the -hackers list as it deals with internals.

We've encountered a production performance problem with pg13 related to
how it fsyncs the whole data directory in certain scenarios, related to
what Paul (bcc'ed) described in a post to pgsql-hackers [1]/messages/by-id/CAEET0ZHGnbXmi8yF3ywsDZvb3m9CbdsGZgfTXscQ6agcbzcZAw@mail.gmail.com.

Background:

We've observed the full recursive fsync is triggered when

* pg_basebackup receives a streaming backup (via [2]https://github.com/postgres/postgres/blob/master/src/bin/pg_basebackup/pg_basebackup.c#L2181 fsync_dir_recurse
or fsync_pgdata) unless --no-sync is specified
* postgres starts up unclean (via [3]https://github.com/postgres/postgres/blob/master/src/backend/access/transam/xlog.c#L6495 SyncDataDirectory)

We run multiple postgres clusters and some of those clusters have many
(~450) databases (one database-per-customer) meaning that the postgres
data directory has around 700,000 files.

On one of our less loaded servers this takes ~7 minutes to complete, but
on another [4]It should be identical config-wise. It isn't starved for IO but does have other regular write workloads this takes ~90 minutes.

Obviously this is untenable risk. We've modified our process that
bootstraps a replica via pg_basebackup to instead do "pg_basebackup
--no-sync…" followed by a "sync", but we don't have any way to do the
equivalent for the postgres startup.

I presume the reason postgres doesn't blindly run a sync() is that we
don't know what other I/O is on the system and it'd be rude to affect
other services. That makes sense, except for our environment the work
done by the recursive fsync is orders of magnitude more disruptive than
a sync().

My questions are:

* is there a knob missing we can configure?
* can we get an opt-in knob to use a single sync() call instead of a
recursive fsync()?
* would you be open to merging a patch providing said knob?
* is there something else we missed?

Thanks!

[1]: /messages/by-id/CAEET0ZHGnbXmi8yF3ywsDZvb3m9CbdsGZgfTXscQ6agcbzcZAw@mail.gmail.com
/messages/by-id/CAEET0ZHGnbXmi8yF3ywsDZvb3m9CbdsGZgfTXscQ6agcbzcZAw@mail.gmail.com
[2]: https://github.com/postgres/postgres/blob/master/src/bin/pg_basebackup/pg_basebackup.c#L2181
https://github.com/postgres/postgres/blob/master/src/bin/pg_basebackup/pg_basebackup.c#L2181
[3]: https://github.com/postgres/postgres/blob/master/src/backend/access/transam/xlog.c#L6495
https://github.com/postgres/postgres/blob/master/src/backend/access/transam/xlog.c#L6495
[4]: It should be identical config-wise. It isn't starved for IO but does have other regular write workloads
does have other regular write workloads
[5]: https://www.postgresql-archive.org/fdatasync-performance-problem-with-large-number-of-DB-files-td6184094.html
https://www.postgresql-archive.org/fdatasync-performance-problem-with-large-number-of-DB-files-td6184094.html

--
Michael Brown
Civilized Discourse Construction Kit, Inc.
https://www.discourse.org/

Thomas Munro

thomas.munro@gmail.com

about 5 years ago

In reply to: Michael Brown (#1)

Re: fdatasync performance problem with large number of DB files

On Thu, Mar 11, 2021 at 11:01 AM Michael Brown
<michael.brown@discourse.org> wrote:

* pg_basebackup receives a streaming backup (via [2] fsync_dir_recurse
or fsync_pgdata) unless --no-sync is specified
* postgres starts up unclean (via [3] SyncDataDirectory)

We run multiple postgres clusters and some of those clusters have many
(~450) databases (one database-per-customer) meaning that the postgres
data directory has around 700,000 files.

On one of our less loaded servers this takes ~7 minutes to complete, but
on another [4] this takes ~90 minutes.

Ouch.

My questions are:

* is there a knob missing we can configure?
* can we get an opt-in knob to use a single sync() call instead of a
recursive fsync()?
* would you be open to merging a patch providing said knob?
* is there something else we missed?

As discussed on that other thread, I don't think sync() is an option
(it doesn't wait on all OSes or in the standard and it doesn't report
errors). syncfs() on Linux 5.8+ looks like a good candidate though,
and I think we'd consider a patch like that. I mean, I even posted
one[1]/messages/by-id/CA+hUKGKT6XiPiEJrqeOFGi7RYCGzbBysF9pyWwv0-jm-oNajxg@mail.gmail.com in that other thread. There will of course be cases where
that's slower (small database sharing filesystem with other software
that has a lot of dirty data to write back).

I also wrote a WAL-and-checkpoint based prototype[2]/messages/by-id/CA+hUKGKHhDNnN6fxf6qrAx9h+mjdNU2Zmx7ztJzFQ0C5=u3QPg@mail.gmail.com, which, among
other advantages such as being faster, not ignoring errors and not
triggering collateral write-back storms, happens to work on all
operating systems. On the other hand it requires a somewhat dogmatic
switch in thinking about the meaning of checkpoints (I mean, it
requires humans to promise not to falsify checkpoints by copying
databases around underneath us), which may be hard to sell (I didn't
try very hard), and there may be subtleties I have missed...

[1]: /messages/by-id/CA+hUKGKT6XiPiEJrqeOFGi7RYCGzbBysF9pyWwv0-jm-oNajxg@mail.gmail.com
[2]: /messages/by-id/CA+hUKGKHhDNnN6fxf6qrAx9h+mjdNU2Zmx7ztJzFQ0C5=u3QPg@mail.gmail.com

Thomas Munro

thomas.munro@gmail.com

about 5 years ago

In reply to: Thomas Munro (#2)

Re: fdatasync performance problem with large number of DB files

On Thu, Mar 11, 2021 at 11:38 AM Thomas Munro <thomas.munro@gmail.com> wrote:

On Thu, Mar 11, 2021 at 11:01 AM Michael Brown
<michael.brown@discourse.org> wrote:

* is there a knob missing we can configure?
* can we get an opt-in knob to use a single sync() call instead of a
recursive fsync()?
* would you be open to merging a patch providing said knob?
* is there something else we missed?

As discussed on that other thread, I don't think sync() is an option
(it doesn't wait on all OSes or in the standard and it doesn't report
errors). syncfs() on Linux 5.8+ looks like a good candidate though,
and I think we'd consider a patch like that. I mean, I even posted
one[1] in that other thread. There will of course be cases where
that's slower (small database sharing filesystem with other software
that has a lot of dirty data to write back).

Thinking about this some more, if you were to propose a patch like
that syncfs() one but make it a configurable option, I'd personally be
in favour of trying to squeeze it into v14. Others might object on
commitfest procedural grounds, I dunno, but I think this is a real
operational issue and that's a fairly simple and localised change.
I've run into a couple of users who have just commented that recursive
fsync() code out!

I'd probably make it an enum-style GUC, because I intend to do some
more work on my "precise" alternative, though not in time for this
release, and it could just as well be an option too.

Tom Lane

tgl@sss.pgh.pa.us

about 5 years ago

In reply to: Thomas Munro (#3)

Re: fdatasync performance problem with large number of DB files

Thomas Munro <thomas.munro@gmail.com> writes:

Thinking about this some more, if you were to propose a patch like
that syncfs() one but make it a configurable option, I'd personally be
in favour of trying to squeeze it into v14. Others might object on
commitfest procedural grounds, I dunno, but I think this is a real
operational issue and that's a fairly simple and localised change.
I've run into a couple of users who have just commented that recursive
fsync() code out!

I'm a little skeptical about the "simple" part. At minimum, you'd
have to syncfs() each tablespace, since we have no easy way to tell
which of them are on different filesystems. (Although, if we're
presuming this is Linux-only, we might be able to tell with some
unportable check or other.)

regards, tom lane

Thomas Munro

thomas.munro@gmail.com

about 5 years ago

In reply to: Tom Lane (#4)

Re: fdatasync performance problem with large number of DB files

On Thu, Mar 11, 2021 at 1:16 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Thomas Munro <thomas.munro@gmail.com> writes:

Thinking about this some more, if you were to propose a patch like
that syncfs() one but make it a configurable option, I'd personally be
in favour of trying to squeeze it into v14. Others might object on
commitfest procedural grounds, I dunno, but I think this is a real
operational issue and that's a fairly simple and localised change.
I've run into a couple of users who have just commented that recursive
fsync() code out!

I'm a little skeptical about the "simple" part. At minimum, you'd
have to syncfs() each tablespace, since we have no easy way to tell
which of them are on different filesystems. (Although, if we're
presuming this is Linux-only, we might be able to tell with some
unportable check or other.)

Right, the patch knows about that:

+    /*
+     * On Linux, we don't have to open every single file one by one.  We can
+     * use syncfs() to sync whole filesystems.  We only expect filesystem
+     * boundaries to exist where we tolerate symlinks, namely pg_wal and the
+     * tablespaces, so we call syncfs() for each of those directories.
+     */

Fujii Masao

masao.fujii@gmail.com

about 5 years ago

In reply to: Thomas Munro (#3)

Re: fdatasync performance problem with large number of DB files

On 2021/03/11 8:30, Thomas Munro wrote:

On Thu, Mar 11, 2021 at 11:38 AM Thomas Munro <thomas.munro@gmail.com> wrote:

On Thu, Mar 11, 2021 at 11:01 AM Michael Brown
<michael.brown@discourse.org> wrote:

* is there a knob missing we can configure?
* can we get an opt-in knob to use a single sync() call instead of a
recursive fsync()?
* would you be open to merging a patch providing said knob?
* is there something else we missed?

As discussed on that other thread, I don't think sync() is an option
(it doesn't wait on all OSes or in the standard and it doesn't report
errors). syncfs() on Linux 5.8+ looks like a good candidate though,
and I think we'd consider a patch like that. I mean, I even posted
one[1] in that other thread. There will of course be cases where
that's slower (small database sharing filesystem with other software
that has a lot of dirty data to write back).

Thinking about this some more, if you were to propose a patch like
that syncfs() one but make it a configurable option, I'd personally be
in favour of trying to squeeze it into v14. Others might object on
commitfest procedural grounds, I dunno, but I think this is a real
operational issue and that's a fairly simple and localised change.

+1 to push this kind of change into v14!!

I've run into a couple of users who have just commented that recursive
fsync() code out!

BTW, we can skip that recursive fsync() by disabling fsync GUC even without
commenting out the code?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Thomas Munro

thomas.munro@gmail.com

about 5 years ago

In reply to: Fujii Masao (#6)

Re: fdatasync performance problem with large number of DB files

On Thu, Mar 11, 2021 at 2:00 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote:

On 2021/03/11 8:30, Thomas Munro wrote:

I've run into a couple of users who have just commented that recursive
fsync() code out!

BTW, we can skip that recursive fsync() by disabling fsync GUC even without
commenting out the code?

Those users wanted fsync=on because they wanted to recover to a normal
online system after a crash, but they believed that the preceding
fsync of the data directory was useless, because replaying the WAL
should be enough. IMHO they were nearly on the right track, and the
prototype patch I linked earlier as [2] was my attempt to find the
specific reasons why that doesn't work and fix them. So far, I
figured out that you still have to remember to fsync the WAL files
(otherwise you're replaying WAL that potentially hasn't reached the
disk), and data files holding blocks that recovery decided to skip due
to BLK_DONE (otherwise you might decide to skip replay because of a
higher LSN that is on a page that is in the kernel's cache but not yet
on disk).

Tom Lane

tgl@sss.pgh.pa.us

about 5 years ago

In reply to: Thomas Munro (#5)

Re: fdatasync performance problem with large number of DB files

Thomas Munro <thomas.munro@gmail.com> writes:

On Thu, Mar 11, 2021 at 1:16 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

I'm a little skeptical about the "simple" part. At minimum, you'd
have to syncfs() each tablespace, since we have no easy way to tell
which of them are on different filesystems. (Although, if we're
presuming this is Linux-only, we might be able to tell with some
unportable check or other.)

Right, the patch knows about that:

I noticed that the syncfs man page present in RHEL8 seemed a little
squishy on the critical question of error reporting. It promises
that syncfs will wait for I/O completion, but it doesn't say in so
many words that I/O errors will be reported (and the list of
applicable errno codes is only EBADF, not very reassuring).

Trolling the net, I found a newer-looking version of the man page,
and behold it says

In mainline kernel versions prior to 5.8, syncfs() will fail only
when passed a bad file descriptor (EBADF). Since Linux 5.8,
syncfs() will also report an error if one or more inodes failed
to be written back since the last syncfs() call.

So this means that in less-than-bleeding-edge kernels, syncfs can
only be regarded as a dangerous toy. If we expose an option to use
it, there had better be large blinking warnings in the docs.

regards, tom lane

Thomas Munro

thomas.munro@gmail.com

about 5 years ago

In reply to: Tom Lane (#8)

Re: fdatasync performance problem with large number of DB files

On Thu, Mar 11, 2021 at 2:25 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Trolling the net, I found a newer-looking version of the man page,
and behold it says

In mainline kernel versions prior to 5.8, syncfs() will fail only
when passed a bad file descriptor (EBADF). Since Linux 5.8,
syncfs() will also report an error if one or more inodes failed
to be written back since the last syncfs() call.

So this means that in less-than-bleeding-edge kernels, syncfs can
only be regarded as a dangerous toy. If we expose an option to use
it, there had better be large blinking warnings in the docs.

Agreed. Perhaps we could also try to do something programmatic about that.

Its fsync() was also pretty rough for the first 28 years.

#10

Thomas Munro

thomas.munro@gmail.com

about 5 years ago

In reply to: Thomas Munro (#9)

Re: fdatasync performance problem with large number of DB files

On Thu, Mar 11, 2021 at 2:32 PM Thomas Munro <thomas.munro@gmail.com> wrote:

On Thu, Mar 11, 2021 at 2:25 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Trolling the net, I found a newer-looking version of the man page,
and behold it says

In mainline kernel versions prior to 5.8, syncfs() will fail only
when passed a bad file descriptor (EBADF). Since Linux 5.8,
syncfs() will also report an error if one or more inodes failed
to be written back since the last syncfs() call.

So this means that in less-than-bleeding-edge kernels, syncfs can
only be regarded as a dangerous toy. If we expose an option to use
it, there had better be large blinking warnings in the docs.

Agreed. Perhaps we could also try to do something programmatic about that.

Time being of the essence, here is the patch I posted last year, this
time with a GUC and some docs. You can set sync_after_crash to
"fsync" (default) or "syncfs" if you have it.

I would plan to extend that to include a third option as already
discussed in the other thread, maybe something like "wal" (= sync WAL
files and then do extra analysis of WAL data to sync only data
modified since checkpoint but not replayed), but that'd be material
for PG15.

#11

Thomas Munro

thomas.munro@gmail.com

about 5 years ago

In reply to: Thomas Munro (#10)

Re: fdatasync performance problem with large number of DB files

On Mon, Mar 15, 2021 at 11:52 AM Thomas Munro <thomas.munro@gmail.com> wrote:

Time being of the essence, here is the patch I posted last year, this
time with a GUC and some docs. You can set sync_after_crash to
"fsync" (default) or "syncfs" if you have it.

Cfbot told me to add HAVE_SYNCFS to Solution.pm, and I fixed a couple of typos.

#12

Paul Guo

guopa@vmware.com

about 5 years ago

In reply to: Thomas Munro (#11)

Re: fdatasync performance problem with large number of DB files

> On 2021/3/15, 7:34 AM, "Thomas Munro" <thomas.munro@gmail.com> wrote:

On Mon, Mar 15, 2021 at 11:52 AM Thomas Munro <thomas.munro@gmail.com> wrote:
Time being of the essence, here is the patch I posted last year, this
time with a GUC and some docs. You can set sync_after_crash to
"fsync" (default) or "syncfs" if you have it.

Cfbot told me to add HAVE_SYNCFS to Solution.pm, and I fixed a couple of typos.

By the way, there is a usual case that we could skip fsync: A fsync-ed already standby generated by pg_rewind/pg_basebackup.
The state of those standbys are surely not DB_SHUTDOWNED/DB_SHUTDOWNED_IN_RECOVERY, so the
pgdata directory is fsync-ed again during startup when starting those pg instances. We could ask users to not fsync
during pg_rewind&pg_basebackup, but we probably want to just fsync some files in pg_rewind (see [1]/messages/by-id/25CFBDF2-5551-4CC3-ADEB-434B6B1BAD16@vmware.com), so better
let the startup process skip the unnecessary fsync? As to the solution, using guc or writing something in some files like
backup_label(?) does not seem to be good ideas since
1. Use guc, we still expect fsync after real crash recovery so we need to reset the guc also need to specify pgoptions in pg_ctl command.
2. Write some hint information to files like backup_label(?) in pg_rewind/pg_basebackup, but people might
copy the pgdata directory and then we still need fsync.
The only one simple solution I can think out is to let user touch a file to hint startup, before starting the pg instance.

[1]: /messages/by-id/25CFBDF2-5551-4CC3-ADEB-434B6B1BAD16@vmware.com

#13

Thomas Munro

thomas.munro@gmail.com

about 5 years ago

In reply to: Paul Guo (#12)

Re: fdatasync performance problem with large number of DB files

On Tue, Mar 16, 2021 at 3:30 AM Paul Guo <guopa@vmware.com> wrote:

By the way, there is a usual case that we could skip fsync: A fsync-ed already standby generated by pg_rewind/pg_basebackup.
The state of those standbys are surely not DB_SHUTDOWNED/DB_SHUTDOWNED_IN_RECOVERY, so the
pgdata directory is fsync-ed again during startup when starting those pg instances. We could ask users to not fsync
during pg_rewind&pg_basebackup, but we probably want to just fsync some files in pg_rewind (see [1]), so better
let the startup process skip the unnecessary fsync? As to the solution, using guc or writing something in some files like
backup_label(?) does not seem to be good ideas since
1. Use guc, we still expect fsync after real crash recovery so we need to reset the guc also need to specify pgoptions in pg_ctl command.
2. Write some hint information to files like backup_label(?) in pg_rewind/pg_basebackup, but people might
copy the pgdata directory and then we still need fsync.
The only one simple solution I can think out is to let user touch a file to hint startup, before starting the pg instance.

As a thought experiment only, I wonder if there is a way to make your
touch-a-special-signal-file scheme more reliable and less dangerous
(considering people might copy the signal file around or otherwise
screw this up). It seems to me that invalidation is the key, and
"unlink the signal file after the first crash recovery" isn't good
enough. Hmm What if the file contained a fingerprint containing...
let's see... checkpoint LSN, hostname, MAC address, pgdata path, ...
(add more seasoning to taste), and then also some flags to say what is
known to be fully fsync'd already: the WAL, pgdata but only as far as
changes up to the checkpoint LSN, or all of pgdata? Then you could be
conservative for a non-match, but skip the extra work in some common
cases like pg_basebackup, as long as you trust the fingerprint scheme
not to produce false positives. Or something like that...

I'm not too keen to invent clever new schemes for PG14, though. This
sync_after_crash=syncfs scheme is pretty simple, and has the advantage
that it's very cheap to do it extra redundant times assuming nothing
else is creating new dirty kernel pages in serious quantities. Is
that useful enough? In particular it avoids the dreaded "open
1,000,000 uncached files over high latency network storage" problem.

I don't want to add a hypothetical sync_after_crash=none, because it
seems like generally a bad idea. We already have a
running-with-scissors mode you could use for that: fsync=off.

#14

Fujii Masao

masao.fujii@gmail.com

about 5 years ago

In reply to: Thomas Munro (#11)

Re: fdatasync performance problem with large number of DB files

On 2021/03/15 8:33, Thomas Munro wrote:

On Mon, Mar 15, 2021 at 11:52 AM Thomas Munro <thomas.munro@gmail.com> wrote:

Time being of the essence, here is the patch I posted last year, this
time with a GUC and some docs. You can set sync_after_crash to
"fsync" (default) or "syncfs" if you have it.

Cfbot told me to add HAVE_SYNCFS to Solution.pm, and I fixed a couple of typos.

Thanks for the patch!

+        When set to <literal>fsync</literal>, which is the default,
+        <productname>PostgreSQL</productname> will recursively open and fsync
+        all files in the data directory before crash recovery begins.

Isn't this a bit misleading? This may cause users to misunderstand that
such fsync can happen only in the case of crash recovery.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

#15

Fujii Masao

masao.fujii@gmail.com

about 5 years ago

In reply to: Thomas Munro (#13)

Re: fdatasync performance problem with large number of DB files

On 2021/03/16 8:15, Thomas Munro wrote:

On Tue, Mar 16, 2021 at 3:30 AM Paul Guo <guopa@vmware.com> wrote:

By the way, there is a usual case that we could skip fsync: A fsync-ed already standby generated by pg_rewind/pg_basebackup.
The state of those standbys are surely not DB_SHUTDOWNED/DB_SHUTDOWNED_IN_RECOVERY, so the
pgdata directory is fsync-ed again during startup when starting those pg instances. We could ask users to not fsync
during pg_rewind&pg_basebackup, but we probably want to just fsync some files in pg_rewind (see [1]), so better
let the startup process skip the unnecessary fsync? As to the solution, using guc or writing something in some files like
backup_label(?) does not seem to be good ideas since
1. Use guc, we still expect fsync after real crash recovery so we need to reset the guc also need to specify pgoptions in pg_ctl command.
2. Write some hint information to files like backup_label(?) in pg_rewind/pg_basebackup, but people might
copy the pgdata directory and then we still need fsync.
The only one simple solution I can think out is to let user touch a file to hint startup, before starting the pg instance.

As a thought experiment only, I wonder if there is a way to make your
touch-a-special-signal-file scheme more reliable and less dangerous
(considering people might copy the signal file around or otherwise
screw this up). It seems to me that invalidation is the key, and
"unlink the signal file after the first crash recovery" isn't good
enough. Hmm What if the file contained a fingerprint containing...
let's see... checkpoint LSN, hostname, MAC address, pgdata path, ...
(add more seasoning to taste), and then also some flags to say what is
known to be fully fsync'd already: the WAL, pgdata but only as far as
changes up to the checkpoint LSN, or all of pgdata? Then you could be
conservative for a non-match, but skip the extra work in some common
cases like pg_basebackup, as long as you trust the fingerprint scheme
not to produce false positives. Or something like that...

I'm not too keen to invent clever new schemes for PG14, though. This
sync_after_crash=syncfs scheme is pretty simple, and has the advantage
that it's very cheap to do it extra redundant times assuming nothing
else is creating new dirty kernel pages in serious quantities. Is
that useful enough? In particular it avoids the dreaded "open
1,000,000 uncached files over high latency network storage" problem.

I don't want to add a hypothetical sync_after_crash=none, because it
seems like generally a bad idea. We already have a
running-with-scissors mode you could use for that: fsync=off.

I heard that some backup tools sync the database directory when restoring it.
I guess that those who use such tools might want the option to disable such
startup sync (i.e., sync_after_crash=none) because it's not necessary.

They can skip that sync by fsync=off. But if they just want to skip only that
startup sync and make subsequent recovery (or standby server) work with
fsync=on, they would need to shutdown the server after that startup sync
finishes, enable fsync, and restart the server. In this case, since the server
is restarted with the state=DB_SHUTDOWNED_IN_RECOVERY, the startup sync
would not be performed. This procedure is tricky. So IMO supporting
sync_after_crash=none would be helpful for this case and simple.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

#16

Paul Guo

paulguo@gmail.com

about 5 years ago

In reply to: Fujii Masao (#15)

Re: fdatasync performance problem with large number of DB files

On Tue, Mar 16, 2021 at 4:29 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote:

On 2021/03/16 8:15, Thomas Munro wrote:

On Tue, Mar 16, 2021 at 3:30 AM Paul Guo <guopa@vmware.com> wrote:

By the way, there is a usual case that we could skip fsync: A fsync-ed already standby generated by pg_rewind/pg_basebackup.
The state of those standbys are surely not DB_SHUTDOWNED/DB_SHUTDOWNED_IN_RECOVERY, so the
pgdata directory is fsync-ed again during startup when starting those pg instances. We could ask users to not fsync
during pg_rewind&pg_basebackup, but we probably want to just fsync some files in pg_rewind (see [1]), so better
let the startup process skip the unnecessary fsync? As to the solution, using guc or writing something in some files like
backup_label(?) does not seem to be good ideas since
1. Use guc, we still expect fsync after real crash recovery so we need to reset the guc also need to specify pgoptions in pg_ctl command.
2. Write some hint information to files like backup_label(?) in pg_rewind/pg_basebackup, but people might
copy the pgdata directory and then we still need fsync.
The only one simple solution I can think out is to let user touch a file to hint startup, before starting the pg instance.

As a thought experiment only, I wonder if there is a way to make your
touch-a-special-signal-file scheme more reliable and less dangerous
(considering people might copy the signal file around or otherwise
screw this up). It seems to me that invalidation is the key, and
"unlink the signal file after the first crash recovery" isn't good
enough. Hmm What if the file contained a fingerprint containing...
let's see... checkpoint LSN, hostname, MAC address, pgdata path, ...

hostname, mac address, or pgdata path (or e.g. inode of a file?) might
be the same after vm cloning or directory copying though it is not usual.
I can not figure out a stable solution that makes the information is out of
date after vm/directory cloning/copying, so the simplest way seems to
be that leaves the decision (i.e. touching a file) to users, instead of
writing the information automatically by pg_rewind/pg_basebackup.

(add more seasoning to taste), and then also some flags to say what is
known to be fully fsync'd already: the WAL, pgdata but only as far as
changes up to the checkpoint LSN, or all of pgdata? Then you could be
conservative for a non-match, but skip the extra work in some common
cases like pg_basebackup, as long as you trust the fingerprint scheme
not to produce false positives. Or something like that...

I'm not too keen to invent clever new schemes for PG14, though. This
sync_after_crash=syncfs scheme is pretty simple, and has the advantage
that it's very cheap to do it extra redundant times assuming nothing
else is creating new dirty kernel pages in serious quantities. Is
that useful enough? In particular it avoids the dreaded "open
1,000,000 uncached files over high latency network storage" problem.

I don't want to add a hypothetical sync_after_crash=none, because it
seems like generally a bad idea. We already have a
running-with-scissors mode you could use for that: fsync=off.

I heard that some backup tools sync the database directory when restoring it.
I guess that those who use such tools might want the option to disable such
startup sync (i.e., sync_after_crash=none) because it's not necessary.

This scenario seems to be a support to the file touching solution since
we do not have an automatic solution to skip the fsync. I thought using
sync_after_crash=none to fix my issue but as I said we need to reset
the guc since we still expect fsync/syncfs after the 2nd crash.

They can skip that sync by fsync=off. But if they just want to skip only that
startup sync and make subsequent recovery (or standby server) work with
fsync=on, they would need to shutdown the server after that startup sync
finishes, enable fsync, and restart the server. In this case, since the server
is restarted with the state=DB_SHUTDOWNED_IN_RECOVERY, the startup sync
would not be performed. This procedure is tricky. So IMO supporting

This seems to make the process complex. From the perspective of product design,
this seems to be not attractive.

sync_after_crash=none would be helpful for this case and simple.

Regards,
Paul Guo (Vmware)

#17

Thomas Munro

thomas.munro@gmail.com

about 5 years ago

In reply to: Fujii Masao (#14)

Re: fdatasync performance problem with large number of DB files

On Tue, Mar 16, 2021 at 9:10 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote:

Thanks for the patch!
+        When set to <literal>fsync</literal>, which is the default,
+        <productname>PostgreSQL</productname> will recursively open and fsync
+        all files in the data directory before crash recovery begins.
Isn't this a bit misleading? This may cause users to misunderstand that
such fsync can happen only in the case of crash recovery.

If I insert the following extra sentence after that one, is it better?
"This applies whenever starting a database cluster that did not shut
down cleanly, including copies created with pg_basebackup."

#18

Thomas Munro

thomas.munro@gmail.com

about 5 years ago

In reply to: Fujii Masao (#15)

Re: fdatasync performance problem with large number of DB files

On Tue, Mar 16, 2021 at 9:29 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote:

On 2021/03/16 8:15, Thomas Munro wrote:

I don't want to add a hypothetical sync_after_crash=none, because it
seems like generally a bad idea. We already have a
running-with-scissors mode you could use for that: fsync=off.

I heard that some backup tools sync the database directory when restoring it.
I guess that those who use such tools might want the option to disable such
startup sync (i.e., sync_after_crash=none) because it's not necessary.

Hopefully syncfs() will return quickly in that case, without doing any work?

They can skip that sync by fsync=off. But if they just want to skip only that
startup sync and make subsequent recovery (or standby server) work with
fsync=on, they would need to shutdown the server after that startup sync
finishes, enable fsync, and restart the server. In this case, since the server
is restarted with the state=DB_SHUTDOWNED_IN_RECOVERY, the startup sync
would not be performed. This procedure is tricky. So IMO supporting
sync_after_crash=none would be helpful for this case and simple.

I still do not like this footgun :-) However, perhaps I am being
overly dogmatic. Consider the change in d8179b00, which decided that
I/O errors in this phase should be reported at LOG level rather than
ERROR. In contrast, my "sync_after_crash=wal" mode (which I need to
rebase over this) will PANIC in this case, because any syncing will be
handled through the usual checkpoint codepaths.

Do you think it would be OK to commit this feature with just "fsync"
and "syncfs", and then to continue to consider adding "none" as a
possible separate commit?

#19

Paul Guo

paulguo@gmail.com

about 5 years ago

In reply to: Thomas Munro (#18)

Re: fdatasync performance problem with large number of DB files

On Wed, Mar 17, 2021 at 11:45 AM Thomas Munro <thomas.munro@gmail.com> wrote:

On Tue, Mar 16, 2021 at 9:29 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote:

On 2021/03/16 8:15, Thomas Munro wrote:

I don't want to add a hypothetical sync_after_crash=none, because it
seems like generally a bad idea. We already have a
running-with-scissors mode you could use for that: fsync=off.

I heard that some backup tools sync the database directory when restoring it.
I guess that those who use such tools might want the option to disable such
startup sync (i.e., sync_after_crash=none) because it's not necessary.

Hopefully syncfs() will return quickly in that case, without doing any work?

I just quickly reviewed the patch (the code part). It looks good. Only
one concern
or question is do_syncfs() for symlink opened fd for syncfs() - I'm
not 100% sure.

I think we could consider reviewing and then pushing the syncfs patch
at this moment since
the fsync issue really affects much although there seems to be a
better plan for this in the future,
it may make the sync step in startup much faster. Today I first
encountered a real
case in a production environment. startup spends >1hour on the fsync
step: I'm pretty
sure that the pgdata is sync-ed, and per startup pstack I saw the
startup process
one by one slowly open(), fsync() (surely do nothing) and close(), and
the pre_sync_fname() is also a burden in such a scenario. So this
issue is really
annoying.

We could discuss further optimizing the special crash recovery
scenario that users
explicitly know the sync step could be skipped (this scenario is
surely not unusual),
even having the patch. The syncfs patch could be used for this
scenario also but the
filesystem might be shared by other applications (this is not unusual
and happens in my
customer's environment for example) so syncfs() is (probably much) slower than
skipping the sync step.

--
Paul Guo (Vmware)

#20

Fujii Masao

masao.fujii@gmail.com

about 5 years ago

In reply to: Thomas Munro (#17)

Re: fdatasync performance problem with large number of DB files

On 2021/03/17 12:02, Thomas Munro wrote:

On Tue, Mar 16, 2021 at 9:10 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
Thanks for the patch!
+        When set to <literal>fsync</literal>, which is the default,
+        <productname>PostgreSQL</productname> will recursively open and fsync
+        all files in the data directory before crash recovery begins.
Isn't this a bit misleading? This may cause users to misunderstand that
such fsync can happen only in the case of crash recovery.
If I insert the following extra sentence after that one, is it better?
"This applies whenever starting a database cluster that did not shut
down cleanly, including copies created with pg_basebackup."

Yes. Thanks!

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

#21

Fujii Masao

masao.fujii@gmail.com

about 5 years ago

In reply to: Thomas Munro (#18)

#22

Paul Guo

guopa@vmware.com

about 5 years ago

In reply to: Fujii Masao (#21)

#23

Thomas Munro

thomas.munro@gmail.com

about 5 years ago

In reply to: Paul Guo (#22)

#24

Thomas Munro

thomas.munro@gmail.com

about 5 years ago

In reply to: Paul Guo (#19)

#25

Fujii Masao

masao.fujii@gmail.com

about 5 years ago

In reply to: Thomas Munro (#23)

#26

Bruce Momjian

bruce@momjian.us

about 5 years ago

In reply to: Thomas Munro (#23)

#27

Bruce Momjian

bruce@momjian.us

about 5 years ago

In reply to: Bruce Momjian (#26)

#28

Fujii Masao

masao.fujii@gmail.com

about 5 years ago

In reply to: Bruce Momjian (#27)

#29

Thomas Munro

thomas.munro@gmail.com

about 5 years ago

In reply to: Fujii Masao (#28)

#30

Thomas Munro

thomas.munro@gmail.com

about 5 years ago

In reply to: Thomas Munro (#29)

#31

Fujii Masao

masao.fujii@gmail.com

about 5 years ago

In reply to: Thomas Munro (#30)

#32

Fujii Masao

masao.fujii@gmail.com

about 5 years ago

In reply to: Fujii Masao (#31)

#33

Thomas Munro

thomas.munro@gmail.com

about 5 years ago

In reply to: Fujii Masao (#32)

#34

Fujii Masao

masao.fujii@gmail.com

about 5 years ago

In reply to: Thomas Munro (#33)

#35

Justin Pryzby

pryzby@telsasoft.com

about 5 years ago

In reply to: Thomas Munro (#33)

#36

David Steele

david@pgmasters.net

about 5 years ago

In reply to: Thomas Munro (#33)

#37

Thomas Munro

thomas.munro@gmail.com

about 5 years ago

In reply to: David Steele (#36)

#38

David Steele

david@pgmasters.net

about 5 years ago

In reply to: Thomas Munro (#37)

#39

Bruce Momjian

bruce@momjian.us

about 5 years ago

In reply to: Tom Lane (#8)

#40

Justin Pryzby

pryzby@telsasoft.com

about 5 years ago

In reply to: Thomas Munro (#37)

#41

Michael Paquier

michael@paquier.xyz

about 5 years ago

In reply to: Justin Pryzby (#40)

#42

Justin Pryzby

pryzby@telsasoft.com

about 5 years ago

In reply to: Justin Pryzby (#40)

#43

Michael Paquier

michael@paquier.xyz

almost 5 years ago

In reply to: Justin Pryzby (#42)

#44

Justin Pryzby

pryzby@telsasoft.com

almost 5 years ago

In reply to: Michael Paquier (#43)

#45

Justin Pryzby

pryzby@telsasoft.com

almost 5 years ago

In reply to: Justin Pryzby (#42)

#46

Fujii Masao

masao.fujii@gmail.com

almost 5 years ago

In reply to: Justin Pryzby (#44)

#47

Michael Paquier

michael@paquier.xyz

almost 5 years ago

In reply to: Fujii Masao (#46)

#48

Thomas Munro

thomas.munro@gmail.com

almost 5 years ago

In reply to: Justin Pryzby (#45)

#49

Thomas Munro

thomas.munro@gmail.com

almost 5 years ago

In reply to: Thomas Munro (#48)

fdatasync performance problem with large number of DB files

Attachments:

Attachments: