fdatasync performance problem with large number of DB files

Started by Michael Brownalmost 5 years ago49 messages

Michael Brown

michael.brown@discourse.org

almost 5 years ago

I initially posted this on the pgsql-general mailing list [5]https://www.postgresql-archive.org/fdatasync-performance-problem-with-large-number-of-DB-files-td6184094.html but was
advised to also post this to the -hackers list as it deals with internals.

We've encountered a production performance problem with pg13 related to
how it fsyncs the whole data directory in certain scenarios, related to
what Paul (bcc'ed) described in a post to pgsql-hackers [1]/messages/by-id/CAEET0ZHGnbXmi8yF3ywsDZvb3m9CbdsGZgfTXscQ6agcbzcZAw@mail.gmail.com.

Background:

We've observed the full recursive fsync is triggered when

* pg_basebackup receives a streaming backup (via [2]https://github.com/postgres/postgres/blob/master/src/bin/pg_basebackup/pg_basebackup.c#L2181 fsync_dir_recurse
or fsync_pgdata) unless --no-sync is specified
* postgres starts up unclean (via [3]https://github.com/postgres/postgres/blob/master/src/backend/access/transam/xlog.c#L6495 SyncDataDirectory)

We run multiple postgres clusters and some of those clusters have many
(~450) databases (one database-per-customer) meaning that the postgres
data directory has around 700,000 files.

On one of our less loaded servers this takes ~7 minutes to complete, but
on another [4]It should be identical config-wise. It isn't starved for IO but does have other regular write workloads this takes ~90 minutes.

Obviously this is untenable risk. We've modified our process that
bootstraps a replica via pg_basebackup to instead do "pg_basebackup
--no-syncâ¦" followed by a "sync", but we don't have any way to do the
equivalent for the postgres startup.

I presume the reason postgres doesn't blindly run a sync() is that we
don't know what other I/O is on the system and it'd be rude to affect
other services. That makes sense, except for our environment the work
done by the recursive fsync is orders of magnitude more disruptive than
a sync().

My questions are:

* is there a knob missing we can configure?
* can we get an opt-in knob to use a single sync() call instead of a
recursive fsync()?
* would you be open to merging a patch providing said knob?
* is there something else we missed?

Thanks!

[1]: /messages/by-id/CAEET0ZHGnbXmi8yF3ywsDZvb3m9CbdsGZgfTXscQ6agcbzcZAw@mail.gmail.com
/messages/by-id/CAEET0ZHGnbXmi8yF3ywsDZvb3m9CbdsGZgfTXscQ6agcbzcZAw@mail.gmail.com
[2]: https://github.com/postgres/postgres/blob/master/src/bin/pg_basebackup/pg_basebackup.c#L2181
https://github.com/postgres/postgres/blob/master/src/bin/pg_basebackup/pg_basebackup.c#L2181
[3]: https://github.com/postgres/postgres/blob/master/src/backend/access/transam/xlog.c#L6495
https://github.com/postgres/postgres/blob/master/src/backend/access/transam/xlog.c#L6495
[4]: It should be identical config-wise. It isn't starved for IO but does have other regular write workloads
does have other regular write workloads
[5]: https://www.postgresql-archive.org/fdatasync-performance-problem-with-large-number-of-DB-files-td6184094.html
https://www.postgresql-archive.org/fdatasync-performance-problem-with-large-number-of-DB-files-td6184094.html

--
Michael Brown
Civilized Discourse Construction Kit, Inc.
https://www.discourse.org/

Thomas Munro

thomas.munro@gmail.com

almost 5 years ago

In reply to: Michael Brown (#1)

Re: fdatasync performance problem with large number of DB files

On Thu, Mar 11, 2021 at 11:01 AM Michael Brown
<michael.brown@discourse.org> wrote:

* pg_basebackup receives a streaming backup (via [2] fsync_dir_recurse
or fsync_pgdata) unless --no-sync is specified
* postgres starts up unclean (via [3] SyncDataDirectory)

We run multiple postgres clusters and some of those clusters have many
(~450) databases (one database-per-customer) meaning that the postgres
data directory has around 700,000 files.

On one of our less loaded servers this takes ~7 minutes to complete, but
on another [4] this takes ~90 minutes.

Ouch.

My questions are:

* is there a knob missing we can configure?
* can we get an opt-in knob to use a single sync() call instead of a
recursive fsync()?
* would you be open to merging a patch providing said knob?
* is there something else we missed?

As discussed on that other thread, I don't think sync() is an option
(it doesn't wait on all OSes or in the standard and it doesn't report
errors). syncfs() on Linux 5.8+ looks like a good candidate though,
and I think we'd consider a patch like that. I mean, I even posted
one[1]/messages/by-id/CA+hUKGKT6XiPiEJrqeOFGi7RYCGzbBysF9pyWwv0-jm-oNajxg@mail.gmail.com in that other thread. There will of course be cases where
that's slower (small database sharing filesystem with other software
that has a lot of dirty data to write back).

I also wrote a WAL-and-checkpoint based prototype[2]/messages/by-id/CA+hUKGKHhDNnN6fxf6qrAx9h+mjdNU2Zmx7ztJzFQ0C5=u3QPg@mail.gmail.com, which, among
other advantages such as being faster, not ignoring errors and not
triggering collateral write-back storms, happens to work on all
operating systems. On the other hand it requires a somewhat dogmatic
switch in thinking about the meaning of checkpoints (I mean, it
requires humans to promise not to falsify checkpoints by copying
databases around underneath us), which may be hard to sell (I didn't
try very hard), and there may be subtleties I have missed...

[1]: /messages/by-id/CA+hUKGKT6XiPiEJrqeOFGi7RYCGzbBysF9pyWwv0-jm-oNajxg@mail.gmail.com
[2]: /messages/by-id/CA+hUKGKHhDNnN6fxf6qrAx9h+mjdNU2Zmx7ztJzFQ0C5=u3QPg@mail.gmail.com

Thomas Munro

thomas.munro@gmail.com

almost 5 years ago

In reply to: Thomas Munro (#2)

Re: fdatasync performance problem with large number of DB files

On Thu, Mar 11, 2021 at 11:38 AM Thomas Munro <thomas.munro@gmail.com> wrote:

On Thu, Mar 11, 2021 at 11:01 AM Michael Brown
<michael.brown@discourse.org> wrote:

* is there a knob missing we can configure?
* can we get an opt-in knob to use a single sync() call instead of a
recursive fsync()?
* would you be open to merging a patch providing said knob?
* is there something else we missed?

As discussed on that other thread, I don't think sync() is an option
(it doesn't wait on all OSes or in the standard and it doesn't report
errors). syncfs() on Linux 5.8+ looks like a good candidate though,
and I think we'd consider a patch like that. I mean, I even posted
one[1] in that other thread. There will of course be cases where
that's slower (small database sharing filesystem with other software
that has a lot of dirty data to write back).

Thinking about this some more, if you were to propose a patch like
that syncfs() one but make it a configurable option, I'd personally be
in favour of trying to squeeze it into v14. Others might object on
commitfest procedural grounds, I dunno, but I think this is a real
operational issue and that's a fairly simple and localised change.
I've run into a couple of users who have just commented that recursive
fsync() code out!

I'd probably make it an enum-style GUC, because I intend to do some
more work on my "precise" alternative, though not in time for this
release, and it could just as well be an option too.

Tom Lane

tgl@sss.pgh.pa.us

almost 5 years ago

In reply to: Thomas Munro (#3)

Re: fdatasync performance problem with large number of DB files

Thomas Munro <thomas.munro@gmail.com> writes:

Thinking about this some more, if you were to propose a patch like
that syncfs() one but make it a configurable option, I'd personally be
in favour of trying to squeeze it into v14. Others might object on
commitfest procedural grounds, I dunno, but I think this is a real
operational issue and that's a fairly simple and localised change.
I've run into a couple of users who have just commented that recursive
fsync() code out!

I'm a little skeptical about the "simple" part. At minimum, you'd
have to syncfs() each tablespace, since we have no easy way to tell
which of them are on different filesystems. (Although, if we're
presuming this is Linux-only, we might be able to tell with some
unportable check or other.)

regards, tom lane

Thomas Munro

thomas.munro@gmail.com

almost 5 years ago

In reply to: Tom Lane (#4)

Re: fdatasync performance problem with large number of DB files

On Thu, Mar 11, 2021 at 1:16 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Thomas Munro <thomas.munro@gmail.com> writes:

Thinking about this some more, if you were to propose a patch like
that syncfs() one but make it a configurable option, I'd personally be
in favour of trying to squeeze it into v14. Others might object on
commitfest procedural grounds, I dunno, but I think this is a real
operational issue and that's a fairly simple and localised change.
I've run into a couple of users who have just commented that recursive
fsync() code out!

I'm a little skeptical about the "simple" part. At minimum, you'd
have to syncfs() each tablespace, since we have no easy way to tell
which of them are on different filesystems. (Although, if we're
presuming this is Linux-only, we might be able to tell with some
unportable check or other.)

Right, the patch knows about that:

+    /*
+     * On Linux, we don't have to open every single file one by one.  We can
+     * use syncfs() to sync whole filesystems.  We only expect filesystem
+     * boundaries to exist where we tolerate symlinks, namely pg_wal and the
+     * tablespaces, so we call syncfs() for each of those directories.
+     */

Fujii Masao

masao.fujii@oss.nttdata.com

almost 5 years ago

In reply to: Thomas Munro (#3)

Re: fdatasync performance problem with large number of DB files

On 2021/03/11 8:30, Thomas Munro wrote:

On Thu, Mar 11, 2021 at 11:38 AM Thomas Munro <thomas.munro@gmail.com> wrote:

On Thu, Mar 11, 2021 at 11:01 AM Michael Brown
<michael.brown@discourse.org> wrote:

* is there a knob missing we can configure?
* can we get an opt-in knob to use a single sync() call instead of a
recursive fsync()?
* would you be open to merging a patch providing said knob?
* is there something else we missed?

As discussed on that other thread, I don't think sync() is an option
(it doesn't wait on all OSes or in the standard and it doesn't report
errors). syncfs() on Linux 5.8+ looks like a good candidate though,
and I think we'd consider a patch like that. I mean, I even posted
one[1] in that other thread. There will of course be cases where
that's slower (small database sharing filesystem with other software
that has a lot of dirty data to write back).

Thinking about this some more, if you were to propose a patch like
that syncfs() one but make it a configurable option, I'd personally be
in favour of trying to squeeze it into v14. Others might object on
commitfest procedural grounds, I dunno, but I think this is a real
operational issue and that's a fairly simple and localised change.

+1 to push this kind of change into v14!!

I've run into a couple of users who have just commented that recursive
fsync() code out!

BTW, we can skip that recursive fsync() by disabling fsync GUC even without
commenting out the code?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Thomas Munro

thomas.munro@gmail.com

almost 5 years ago

In reply to: Fujii Masao (#6)

Re: fdatasync performance problem with large number of DB files

On Thu, Mar 11, 2021 at 2:00 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote:

On 2021/03/11 8:30, Thomas Munro wrote:

I've run into a couple of users who have just commented that recursive
fsync() code out!

BTW, we can skip that recursive fsync() by disabling fsync GUC even without
commenting out the code?

Those users wanted fsync=on because they wanted to recover to a normal
online system after a crash, but they believed that the preceding
fsync of the data directory was useless, because replaying the WAL
should be enough. IMHO they were nearly on the right track, and the
prototype patch I linked earlier as [2] was my attempt to find the
specific reasons why that doesn't work and fix them. So far, I
figured out that you still have to remember to fsync the WAL files
(otherwise you're replaying WAL that potentially hasn't reached the
disk), and data files holding blocks that recovery decided to skip due
to BLK_DONE (otherwise you might decide to skip replay because of a
higher LSN that is on a page that is in the kernel's cache but not yet
on disk).

Tom Lane

tgl@sss.pgh.pa.us

almost 5 years ago

In reply to: Thomas Munro (#5)

Re: fdatasync performance problem with large number of DB files

Thomas Munro <thomas.munro@gmail.com> writes:

On Thu, Mar 11, 2021 at 1:16 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

I'm a little skeptical about the "simple" part. At minimum, you'd
have to syncfs() each tablespace, since we have no easy way to tell
which of them are on different filesystems. (Although, if we're
presuming this is Linux-only, we might be able to tell with some
unportable check or other.)

Right, the patch knows about that:

I noticed that the syncfs man page present in RHEL8 seemed a little
squishy on the critical question of error reporting. It promises
that syncfs will wait for I/O completion, but it doesn't say in so
many words that I/O errors will be reported (and the list of
applicable errno codes is only EBADF, not very reassuring).

Trolling the net, I found a newer-looking version of the man page,
and behold it says

In mainline kernel versions prior to 5.8, syncfs() will fail only
when passed a bad file descriptor (EBADF). Since Linux 5.8,
syncfs() will also report an error if one or more inodes failed
to be written back since the last syncfs() call.

So this means that in less-than-bleeding-edge kernels, syncfs can
only be regarded as a dangerous toy. If we expose an option to use
it, there had better be large blinking warnings in the docs.

regards, tom lane

Thomas Munro

thomas.munro@gmail.com

almost 5 years ago

In reply to: Tom Lane (#8)

Re: fdatasync performance problem with large number of DB files

On Thu, Mar 11, 2021 at 2:25 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Trolling the net, I found a newer-looking version of the man page,
and behold it says

In mainline kernel versions prior to 5.8, syncfs() will fail only
when passed a bad file descriptor (EBADF). Since Linux 5.8,
syncfs() will also report an error if one or more inodes failed
to be written back since the last syncfs() call.

So this means that in less-than-bleeding-edge kernels, syncfs can
only be regarded as a dangerous toy. If we expose an option to use
it, there had better be large blinking warnings in the docs.

Agreed. Perhaps we could also try to do something programmatic about that.

Its fsync() was also pretty rough for the first 28 years.

#10

Thomas Munro

thomas.munro@gmail.com

over 4 years ago

In reply to: Thomas Munro (#9)

1 attachment(s)

Re: fdatasync performance problem with large number of DB files

On Thu, Mar 11, 2021 at 2:32 PM Thomas Munro <thomas.munro@gmail.com> wrote:

On Thu, Mar 11, 2021 at 2:25 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Trolling the net, I found a newer-looking version of the man page,
and behold it says

In mainline kernel versions prior to 5.8, syncfs() will fail only
when passed a bad file descriptor (EBADF). Since Linux 5.8,
syncfs() will also report an error if one or more inodes failed
to be written back since the last syncfs() call.

So this means that in less-than-bleeding-edge kernels, syncfs can
only be regarded as a dangerous toy. If we expose an option to use
it, there had better be large blinking warnings in the docs.

Agreed. Perhaps we could also try to do something programmatic about that.

Time being of the essence, here is the patch I posted last year, this
time with a GUC and some docs. You can set sync_after_crash to
"fsync" (default) or "syncfs" if you have it.

I would plan to extend that to include a third option as already
discussed in the other thread, maybe something like "wal" (= sync WAL
files and then do extra analysis of WAL data to sync only data
modified since checkpoint but not replayed), but that'd be material
for PG15.

#11

Thomas Munro

thomas.munro@gmail.com

over 4 years ago

In reply to: Thomas Munro (#10)

1 attachment(s)

Re: fdatasync performance problem with large number of DB files

On Mon, Mar 15, 2021 at 11:52 AM Thomas Munro <thomas.munro@gmail.com> wrote:

Time being of the essence, here is the patch I posted last year, this
time with a GUC and some docs. You can set sync_after_crash to
"fsync" (default) or "syncfs" if you have it.

Cfbot told me to add HAVE_SYNCFS to Solution.pm, and I fixed a couple of typos.

#12

Paul Guo

guopa@vmware.com

over 4 years ago

In reply to: Thomas Munro (#11)

Re: fdatasync performance problem with large number of DB files

> On 2021/3/15, 7:34 AM, "Thomas Munro" <thomas.munro@gmail.com> wrote:

On Mon, Mar 15, 2021 at 11:52 AM Thomas Munro <thomas.munro@gmail.com> wrote:
Time being of the essence, here is the patch I posted last year, this
time with a GUC and some docs. You can set sync_after_crash to
"fsync" (default) or "syncfs" if you have it.

Cfbot told me to add HAVE_SYNCFS to Solution.pm, and I fixed a couple of typos.

By the way, there is a usual case that we could skip fsync: A fsync-ed already standby generated by pg_rewind/pg_basebackup.
The state of those standbys are surely not DB_SHUTDOWNED/DB_SHUTDOWNED_IN_RECOVERY, so the
pgdata directory is fsync-ed again during startup when starting those pg instances. We could ask users to not fsync
during pg_rewind&pg_basebackup, but we probably want to just fsync some files in pg_rewind (see [1]/messages/by-id/25CFBDF2-5551-4CC3-ADEB-434B6B1BAD16@vmware.com), so better
let the startup process skip the unnecessary fsync? As to the solution, using guc or writing something in some files like
backup_label(?) does not seem to be good ideas since
1. Use guc, we still expect fsync after real crash recovery so we need to reset the guc also need to specify pgoptions in pg_ctl command.
2. Write some hint information to files like backup_label(?) in pg_rewind/pg_basebackup, but people might
copy the pgdata directory and then we still need fsync.
The only one simple solution I can think out is to let user touch a file to hint startup, before starting the pg instance.

[1]: /messages/by-id/25CFBDF2-5551-4CC3-ADEB-434B6B1BAD16@vmware.com

#13

Thomas Munro

thomas.munro@gmail.com

over 4 years ago

In reply to: Paul Guo (#12)

Re: fdatasync performance problem with large number of DB files

On Tue, Mar 16, 2021 at 3:30 AM Paul Guo <guopa@vmware.com> wrote:

By the way, there is a usual case that we could skip fsync: A fsync-ed already standby generated by pg_rewind/pg_basebackup.
The state of those standbys are surely not DB_SHUTDOWNED/DB_SHUTDOWNED_IN_RECOVERY, so the
pgdata directory is fsync-ed again during startup when starting those pg instances. We could ask users to not fsync
during pg_rewind&pg_basebackup, but we probably want to just fsync some files in pg_rewind (see [1]), so better
let the startup process skip the unnecessary fsync? As to the solution, using guc or writing something in some files like
backup_label(?) does not seem to be good ideas since
1. Use guc, we still expect fsync after real crash recovery so we need to reset the guc also need to specify pgoptions in pg_ctl command.
2. Write some hint information to files like backup_label(?) in pg_rewind/pg_basebackup, but people might
copy the pgdata directory and then we still need fsync.
The only one simple solution I can think out is to let user touch a file to hint startup, before starting the pg instance.

As a thought experiment only, I wonder if there is a way to make your
touch-a-special-signal-file scheme more reliable and less dangerous
(considering people might copy the signal file around or otherwise
screw this up). It seems to me that invalidation is the key, and
"unlink the signal file after the first crash recovery" isn't good
enough. Hmm What if the file contained a fingerprint containing...
let's see... checkpoint LSN, hostname, MAC address, pgdata path, ...
(add more seasoning to taste), and then also some flags to say what is
known to be fully fsync'd already: the WAL, pgdata but only as far as
changes up to the checkpoint LSN, or all of pgdata? Then you could be
conservative for a non-match, but skip the extra work in some common
cases like pg_basebackup, as long as you trust the fingerprint scheme
not to produce false positives. Or something like that...

I'm not too keen to invent clever new schemes for PG14, though. This
sync_after_crash=syncfs scheme is pretty simple, and has the advantage
that it's very cheap to do it extra redundant times assuming nothing
else is creating new dirty kernel pages in serious quantities. Is
that useful enough? In particular it avoids the dreaded "open
1,000,000 uncached files over high latency network storage" problem.

I don't want to add a hypothetical sync_after_crash=none, because it
seems like generally a bad idea. We already have a
running-with-scissors mode you could use for that: fsync=off.

#14

Fujii Masao

masao.fujii@oss.nttdata.com

over 4 years ago

In reply to: Thomas Munro (#11)

Re: fdatasync performance problem with large number of DB files

On 2021/03/15 8:33, Thomas Munro wrote:

On Mon, Mar 15, 2021 at 11:52 AM Thomas Munro <thomas.munro@gmail.com> wrote:

Time being of the essence, here is the patch I posted last year, this
time with a GUC and some docs. You can set sync_after_crash to
"fsync" (default) or "syncfs" if you have it.

Cfbot told me to add HAVE_SYNCFS to Solution.pm, and I fixed a couple of typos.

Thanks for the patch!

+        When set to <literal>fsync</literal>, which is the default,
+        <productname>PostgreSQL</productname> will recursively open and fsync
+        all files in the data directory before crash recovery begins.

Isn't this a bit misleading? This may cause users to misunderstand that
such fsync can happen only in the case of crash recovery.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

#15

Fujii Masao

masao.fujii@oss.nttdata.com

over 4 years ago

In reply to: Thomas Munro (#13)

Re: fdatasync performance problem with large number of DB files

On 2021/03/16 8:15, Thomas Munro wrote:

On Tue, Mar 16, 2021 at 3:30 AM Paul Guo <guopa@vmware.com> wrote:

By the way, there is a usual case that we could skip fsync: A fsync-ed already standby generated by pg_rewind/pg_basebackup.
The state of those standbys are surely not DB_SHUTDOWNED/DB_SHUTDOWNED_IN_RECOVERY, so the
pgdata directory is fsync-ed again during startup when starting those pg instances. We could ask users to not fsync
during pg_rewind&pg_basebackup, but we probably want to just fsync some files in pg_rewind (see [1]), so better
let the startup process skip the unnecessary fsync? As to the solution, using guc or writing something in some files like
backup_label(?) does not seem to be good ideas since
1. Use guc, we still expect fsync after real crash recovery so we need to reset the guc also need to specify pgoptions in pg_ctl command.
2. Write some hint information to files like backup_label(?) in pg_rewind/pg_basebackup, but people might
copy the pgdata directory and then we still need fsync.
The only one simple solution I can think out is to let user touch a file to hint startup, before starting the pg instance.

As a thought experiment only, I wonder if there is a way to make your
touch-a-special-signal-file scheme more reliable and less dangerous
(considering people might copy the signal file around or otherwise
screw this up). It seems to me that invalidation is the key, and
"unlink the signal file after the first crash recovery" isn't good
enough. Hmm What if the file contained a fingerprint containing...
let's see... checkpoint LSN, hostname, MAC address, pgdata path, ...
(add more seasoning to taste), and then also some flags to say what is
known to be fully fsync'd already: the WAL, pgdata but only as far as
changes up to the checkpoint LSN, or all of pgdata? Then you could be
conservative for a non-match, but skip the extra work in some common
cases like pg_basebackup, as long as you trust the fingerprint scheme
not to produce false positives. Or something like that...

I'm not too keen to invent clever new schemes for PG14, though. This
sync_after_crash=syncfs scheme is pretty simple, and has the advantage
that it's very cheap to do it extra redundant times assuming nothing
else is creating new dirty kernel pages in serious quantities. Is
that useful enough? In particular it avoids the dreaded "open
1,000,000 uncached files over high latency network storage" problem.

I don't want to add a hypothetical sync_after_crash=none, because it
seems like generally a bad idea. We already have a
running-with-scissors mode you could use for that: fsync=off.

I heard that some backup tools sync the database directory when restoring it.
I guess that those who use such tools might want the option to disable such
startup sync (i.e., sync_after_crash=none) because it's not necessary.

They can skip that sync by fsync=off. But if they just want to skip only that
startup sync and make subsequent recovery (or standby server) work with
fsync=on, they would need to shutdown the server after that startup sync
finishes, enable fsync, and restart the server. In this case, since the server
is restarted with the state=DB_SHUTDOWNED_IN_RECOVERY, the startup sync
would not be performed. This procedure is tricky. So IMO supporting
sync_after_crash=none would be helpful for this case and simple.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

#16

Paul Guo

paulguo@gmail.com

over 4 years ago

In reply to: Fujii Masao (#15)

Re: fdatasync performance problem with large number of DB files

On Tue, Mar 16, 2021 at 4:29 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote:

On 2021/03/16 8:15, Thomas Munro wrote:

On Tue, Mar 16, 2021 at 3:30 AM Paul Guo <guopa@vmware.com> wrote:

By the way, there is a usual case that we could skip fsync: A fsync-ed already standby generated by pg_rewind/pg_basebackup.
The state of those standbys are surely not DB_SHUTDOWNED/DB_SHUTDOWNED_IN_RECOVERY, so the
pgdata directory is fsync-ed again during startup when starting those pg instances. We could ask users to not fsync
during pg_rewind&pg_basebackup, but we probably want to just fsync some files in pg_rewind (see [1]), so better
let the startup process skip the unnecessary fsync? As to the solution, using guc or writing something in some files like
backup_label(?) does not seem to be good ideas since
1. Use guc, we still expect fsync after real crash recovery so we need to reset the guc also need to specify pgoptions in pg_ctl command.
2. Write some hint information to files like backup_label(?) in pg_rewind/pg_basebackup, but people might
copy the pgdata directory and then we still need fsync.
The only one simple solution I can think out is to let user touch a file to hint startup, before starting the pg instance.

As a thought experiment only, I wonder if there is a way to make your
touch-a-special-signal-file scheme more reliable and less dangerous
(considering people might copy the signal file around or otherwise
screw this up). It seems to me that invalidation is the key, and
"unlink the signal file after the first crash recovery" isn't good
enough. Hmm What if the file contained a fingerprint containing...
let's see... checkpoint LSN, hostname, MAC address, pgdata path, ...

hostname, mac address, or pgdata path (or e.g. inode of a file?) might
be the same after vm cloning or directory copying though it is not usual.
I can not figure out a stable solution that makes the information is out of
date after vm/directory cloning/copying, so the simplest way seems to
be that leaves the decision (i.e. touching a file) to users, instead of
writing the information automatically by pg_rewind/pg_basebackup.

(add more seasoning to taste), and then also some flags to say what is
known to be fully fsync'd already: the WAL, pgdata but only as far as
changes up to the checkpoint LSN, or all of pgdata? Then you could be
conservative for a non-match, but skip the extra work in some common
cases like pg_basebackup, as long as you trust the fingerprint scheme
not to produce false positives. Or something like that...

I'm not too keen to invent clever new schemes for PG14, though. This
sync_after_crash=syncfs scheme is pretty simple, and has the advantage
that it's very cheap to do it extra redundant times assuming nothing
else is creating new dirty kernel pages in serious quantities. Is
that useful enough? In particular it avoids the dreaded "open
1,000,000 uncached files over high latency network storage" problem.

I don't want to add a hypothetical sync_after_crash=none, because it
seems like generally a bad idea. We already have a
running-with-scissors mode you could use for that: fsync=off.

I heard that some backup tools sync the database directory when restoring it.
I guess that those who use such tools might want the option to disable such
startup sync (i.e., sync_after_crash=none) because it's not necessary.

This scenario seems to be a support to the file touching solution since
we do not have an automatic solution to skip the fsync. I thought using
sync_after_crash=none to fix my issue but as I said we need to reset
the guc since we still expect fsync/syncfs after the 2nd crash.

They can skip that sync by fsync=off. But if they just want to skip only that
startup sync and make subsequent recovery (or standby server) work with
fsync=on, they would need to shutdown the server after that startup sync
finishes, enable fsync, and restart the server. In this case, since the server
is restarted with the state=DB_SHUTDOWNED_IN_RECOVERY, the startup sync
would not be performed. This procedure is tricky. So IMO supporting

This seems to make the process complex. From the perspective of product design,
this seems to be not attractive.

sync_after_crash=none would be helpful for this case and simple.

Regards,
Paul Guo (Vmware)

#17

Thomas Munro

thomas.munro@gmail.com

over 4 years ago

In reply to: Fujii Masao (#14)

Re: fdatasync performance problem with large number of DB files

On Tue, Mar 16, 2021 at 9:10 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote:

Thanks for the patch!
+        When set to <literal>fsync</literal>, which is the default,
+        <productname>PostgreSQL</productname> will recursively open and fsync
+        all files in the data directory before crash recovery begins.
Isn't this a bit misleading? This may cause users to misunderstand that
such fsync can happen only in the case of crash recovery.

If I insert the following extra sentence after that one, is it better?
"This applies whenever starting a database cluster that did not shut
down cleanly, including copies created with pg_basebackup."

#18

Thomas Munro

thomas.munro@gmail.com

over 4 years ago

In reply to: Fujii Masao (#15)

Re: fdatasync performance problem with large number of DB files

On Tue, Mar 16, 2021 at 9:29 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote:

On 2021/03/16 8:15, Thomas Munro wrote:

I don't want to add a hypothetical sync_after_crash=none, because it
seems like generally a bad idea. We already have a
running-with-scissors mode you could use for that: fsync=off.

I heard that some backup tools sync the database directory when restoring it.
I guess that those who use such tools might want the option to disable such
startup sync (i.e., sync_after_crash=none) because it's not necessary.

Hopefully syncfs() will return quickly in that case, without doing any work?

They can skip that sync by fsync=off. But if they just want to skip only that
startup sync and make subsequent recovery (or standby server) work with
fsync=on, they would need to shutdown the server after that startup sync
finishes, enable fsync, and restart the server. In this case, since the server
is restarted with the state=DB_SHUTDOWNED_IN_RECOVERY, the startup sync
would not be performed. This procedure is tricky. So IMO supporting
sync_after_crash=none would be helpful for this case and simple.

I still do not like this footgun :-) However, perhaps I am being
overly dogmatic. Consider the change in d8179b00, which decided that
I/O errors in this phase should be reported at LOG level rather than
ERROR. In contrast, my "sync_after_crash=wal" mode (which I need to
rebase over this) will PANIC in this case, because any syncing will be
handled through the usual checkpoint codepaths.

Do you think it would be OK to commit this feature with just "fsync"
and "syncfs", and then to continue to consider adding "none" as a
possible separate commit?

#19

Paul Guo

paulguo@gmail.com

over 4 years ago

In reply to: Thomas Munro (#18)

Re: fdatasync performance problem with large number of DB files

On Wed, Mar 17, 2021 at 11:45 AM Thomas Munro <thomas.munro@gmail.com> wrote:

On Tue, Mar 16, 2021 at 9:29 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote:

On 2021/03/16 8:15, Thomas Munro wrote:

I don't want to add a hypothetical sync_after_crash=none, because it
seems like generally a bad idea. We already have a
running-with-scissors mode you could use for that: fsync=off.

I heard that some backup tools sync the database directory when restoring it.
I guess that those who use such tools might want the option to disable such
startup sync (i.e., sync_after_crash=none) because it's not necessary.

Hopefully syncfs() will return quickly in that case, without doing any work?

I just quickly reviewed the patch (the code part). It looks good. Only
one concern
or question is do_syncfs() for symlink opened fd for syncfs() - I'm
not 100% sure.

I think we could consider reviewing and then pushing the syncfs patch
at this moment since
the fsync issue really affects much although there seems to be a
better plan for this in the future,
it may make the sync step in startup much faster. Today I first
encountered a real
case in a production environment. startup spends >1hour on the fsync
step: I'm pretty
sure that the pgdata is sync-ed, and per startup pstack I saw the
startup process
one by one slowly open(), fsync() (surely do nothing) and close(), and
the pre_sync_fname() is also a burden in such a scenario. So this
issue is really
annoying.

We could discuss further optimizing the special crash recovery
scenario that users
explicitly know the sync step could be skipped (this scenario is
surely not unusual),
even having the patch. The syncfs patch could be used for this
scenario also but the
filesystem might be shared by other applications (this is not unusual
and happens in my
customer's environment for example) so syncfs() is (probably much) slower than
skipping the sync step.

--
Paul Guo (Vmware)

#20

Fujii Masao

masao.fujii@oss.nttdata.com

over 4 years ago

In reply to: Thomas Munro (#17)

Re: fdatasync performance problem with large number of DB files

On 2021/03/17 12:02, Thomas Munro wrote:

On Tue, Mar 16, 2021 at 9:10 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
Thanks for the patch!
+        When set to <literal>fsync</literal>, which is the default,
+        <productname>PostgreSQL</productname> will recursively open and fsync
+        all files in the data directory before crash recovery begins.
Isn't this a bit misleading? This may cause users to misunderstand that
such fsync can happen only in the case of crash recovery.
If I insert the following extra sentence after that one, is it better?
"This applies whenever starting a database cluster that did not shut
down cleanly, including copies created with pg_basebackup."

Yes. Thanks!

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

#21

Fujii Masao

masao.fujii@oss.nttdata.com

over 4 years ago

In reply to: Thomas Munro (#18)

Re: fdatasync performance problem with large number of DB files

On 2021/03/17 12:45, Thomas Munro wrote:

On Tue, Mar 16, 2021 at 9:29 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote:

On 2021/03/16 8:15, Thomas Munro wrote:

I don't want to add a hypothetical sync_after_crash=none, because it
seems like generally a bad idea. We already have a
running-with-scissors mode you could use for that: fsync=off.

I heard that some backup tools sync the database directory when restoring it.
I guess that those who use such tools might want the option to disable such
startup sync (i.e., sync_after_crash=none) because it's not necessary.

Hopefully syncfs() will return quickly in that case, without doing any work?

Yes, in Linux.

They can skip that sync by fsync=off. But if they just want to skip only that
startup sync and make subsequent recovery (or standby server) work with
fsync=on, they would need to shutdown the server after that startup sync
finishes, enable fsync, and restart the server. In this case, since the server
is restarted with the state=DB_SHUTDOWNED_IN_RECOVERY, the startup sync
would not be performed. This procedure is tricky. So IMO supporting
sync_after_crash=none would be helpful for this case and simple.

I still do not like this footgun :-) However, perhaps I am being
overly dogmatic. Consider the change in d8179b00, which decided that
I/O errors in this phase should be reported at LOG level rather than
ERROR. In contrast, my "sync_after_crash=wal" mode (which I need to
rebase over this) will PANIC in this case, because any syncing will be
handled through the usual checkpoint codepaths.

Do you think it would be OK to commit this feature with just "fsync"
and "syncfs", and then to continue to consider adding "none" as a
possible separate commit?

+1. "syncfs" feature is useful whether we also support "none" mode or not.
It's good idea to commit "syncfs" in advance.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

#22

Paul Guo

guopa@vmware.com

over 4 years ago

In reply to: Fujii Masao (#21)

Re: fdatasync performance problem with large number of DB files

About the syncfs patch, my first impression on the guc name sync_after_crash
is that it is a boolean type. Not sure about other people's feeling. Do you guys think
It is better to rename it to a clearer name like sync_method_after_crash or others?

#23

Thomas Munro

thomas.munro@gmail.com

over 4 years ago

In reply to: Paul Guo (#22)

1 attachment(s)

Re: fdatasync performance problem with large number of DB files

On Thu, Mar 18, 2021 at 8:52 PM Paul Guo <guopa@vmware.com> wrote:

About the syncfs patch, my first impression on the guc name sync_after_crash
is that it is a boolean type. Not sure about other people's feeling. Do you guys think
It is better to rename it to a clearer name like sync_method_after_crash or others?

Works for me. Here is a new version like that, also including the
documentation change discussed with Fujii-san, and a couple of
cosmetic changes.

#24

Thomas Munro

thomas.munro@gmail.com

over 4 years ago

In reply to: Paul Guo (#19)

Re: fdatasync performance problem with large number of DB files

On Wed, Mar 17, 2021 at 11:42 PM Paul Guo <paulguo@gmail.com> wrote:

I just quickly reviewed the patch (the code part). It looks good. Only
one concern
or question is do_syncfs() for symlink opened fd for syncfs() - I'm
not 100% sure.

Alright, let me try to prove that it works the way we want with an experiment.

I'll make a directory with a file in it, and create a symlink to it in
another filesystem:

tmunro@x1:~/junk$ mkdir my_wal_dir
tmunro@x1:~/junk$ touch my_wal_dir/foo
tmunro@x1:~/junk$ ln -s /home/tmunro/junk/my_wal_dir /dev/shm/my_wal_dir_symlink
tmunro@x1:~/junk$ ls /dev/shm/my_wal_dir_symlink/
foo

Now I'll write a program that repeatedly dirties the first block of
foo, and calls syncfs() on the containing directory that it opened
using the symlink:

tmunro@x1:~/junk$ cat test.c
#define _GNU_SOURCE

#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

int
main()
{
int symlink_fd, file_fd;

symlink_fd = open("/dev/shm/my_wal_dir_symlink", O_RDONLY);
if (symlink_fd < 0) {
perror("open1");
return EXIT_FAILURE;
}

file_fd = open("/home/tmunro/junk/my_wal_dir/foo", O_RDWR);
if (file_fd < 0) {
perror("open2");
return EXIT_FAILURE;
}

for (int i = 0; i < 4; ++i) {
if (pwrite(file_fd, "hello world", 10, 0) != 10) {
perror("pwrite");
return EXIT_FAILURE;
}
if (syncfs(symlink_fd) < 0) {
perror("syncfs");
return EXIT_FAILURE;
}
sleep(1);
}
return EXIT_SUCCESS;
}
tmunro@x1:~/junk$ cc test.c
tmunro@x1:~/junk$ ./a.out

While that's running, to prove that it does what we want it to do,
I'll first find out where foo lives on the disk:

tmunro@x1:~/junk$ /sbin/xfs_bmap my_wal_dir/foo
my_wal_dir/foo:
0: [0..7]: 242968520..242968527

Now I'll trace the writes going to block 242968520, and start the program again:

tmunro@x1:~/junk$ sudo btrace /dev/nvme0n1p2 | grep 242968520
259,0 4 93 4.157000669 724924 A W 244019144 + 8 <-
(259,2) 242968520
259,0 2 155 5.158446989 718635 A W 244019144 + 8 <-
(259,2) 242968520
259,0 7 23 6.163765728 724924 A W 244019144 + 8 <-
(259,2) 242968520
259,0 7 30 7.169112683 724924 A W 244019144 + 8 <-
(259,2) 242968520

#25

Fujii Masao

masao.fujii@oss.nttdata.com

over 4 years ago

In reply to: Thomas Munro (#23)

Re: fdatasync performance problem with large number of DB files

On 2021/03/18 19:19, Thomas Munro wrote:

On Thu, Mar 18, 2021 at 8:52 PM Paul Guo <guopa@vmware.com> wrote:

About the syncfs patch, my first impression on the guc name sync_after_crash
is that it is a boolean type. Not sure about other people's feeling. Do you guys think
It is better to rename it to a clearer name like sync_method_after_crash or others?

Works for me. Here is a new version like that, also including the
documentation change discussed with Fujii-san, and a couple of
cosmetic changes.

Thanks for updating the patch!

+        database cluster that did not shut down cleanly, including copies
+        created with pg_basebackup.

"pg_basebackup" should be "<application>pg_basebackup</application>"?

+ while ((de = ReadDir(dir, "pg_tblspc")))

The comment of SyncDataDirectory() says "Errors are logged but not
considered fatal". So ReadDirExtended() with LOG level should be used
here, instead?

Isn't it better to call CHECK_FOR_INTERRUPTS() in this loop?

+ fd = open(path, O_RDONLY);

For current use, it's not harmful to use open() and close(). But isn't
it safer to use OpenTransientFile() and CloseTransientFile(), instead?
Because do_syncfs() may be used for other purpose in the future.

+	if (syncfs(fd) < 0)
+		elog(LOG, "syncfs failed for %s: %m", path);

According to the message style guide, this message should be something
like "could not sync filesystem for \"%s\": %m"?

I confirmed that no error was reported when crash recovery started with
syncfs, in old Linux. I should also confirm that no error is reported in that
case in Linux 5.8+, but I don't have that environement. So I've not tested
this feature in Linux 5.8+....

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

#26

Bruce Momjian

bruce@momjian.us

over 4 years ago

In reply to: Thomas Munro (#23)

Re: fdatasync performance problem with large number of DB files

On Thu, Mar 18, 2021 at 11:19:13PM +1300, Thomas Munro wrote:

On Thu, Mar 18, 2021 at 8:52 PM Paul Guo <guopa@vmware.com> wrote:

About the syncfs patch, my first impression on the guc name sync_after_crash
is that it is a boolean type. Not sure about other people's feeling. Do you guys think
It is better to rename it to a clearer name like sync_method_after_crash or others?

Works for me. Here is a new version like that, also including the
documentation change discussed with Fujii-san, and a couple of
cosmetic changes.

Are we sure we want to use the word "crash" here? I don't remember
seeing it used anywhere else in our user interface. I guess it is
"crash recovery".

--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com

If only the physical world exists, free will is an illusion.

#27

Bruce Momjian

bruce@momjian.us

over 4 years ago

In reply to: Bruce Momjian (#26)

Re: fdatasync performance problem with large number of DB files

On Thu, Mar 18, 2021 at 09:54:11AM -0400, Bruce Momjian wrote:

On Thu, Mar 18, 2021 at 11:19:13PM +1300, Thomas Munro wrote:

On Thu, Mar 18, 2021 at 8:52 PM Paul Guo <guopa@vmware.com> wrote:

About the syncfs patch, my first impression on the guc name sync_after_crash
is that it is a boolean type. Not sure about other people's feeling. Do you guys think
It is better to rename it to a clearer name like sync_method_after_crash or others?

Works for me. Here is a new version like that, also including the
documentation change discussed with Fujii-san, and a couple of
cosmetic changes.

Are we sure we want to use the word "crash" here? I don't remember
seeing it used anywhere else in our user interface. I guess it is
"crash recovery".

Maybe call it "recovery_sync_method"?

--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com

If only the physical world exists, free will is an illusion.

#28

Fujii Masao

masao.fujii@oss.nttdata.com

over 4 years ago

In reply to: Bruce Momjian (#27)

Re: fdatasync performance problem with large number of DB files

On 2021/03/18 23:03, Bruce Momjian wrote:

Are we sure we want to use the word "crash" here? I don't remember
seeing it used anywhere else in our user interface.

We have GUC restart_after_crash.

I guess it is

"crash recovery".

Maybe call it "recovery_sync_method"

+1. This name sounds good to me. Or recovery_init_sync_method, because that
sync happens in the initial phase of recovery.

Another idea from different angle is data_directory_sync_method. If we adopt
this, we can easily extend this feature for other use cases (other than sync at
the beginning of recovery) without changing the name.
I'm not sure if such cases actually exist, though.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

#29

Thomas Munro

thomas.munro@gmail.com

over 4 years ago

In reply to: Fujii Masao (#28)

3 attachment(s)

Re: fdatasync performance problem with large number of DB files

On Fri, Mar 19, 2021 at 5:50 AM Fujii Masao <masao.fujii@oss.nttdata.com> wrote:

On 2021/03/18 23:03, Bruce Momjian wrote:

Are we sure we want to use the word "crash" here? I don't remember
seeing it used anywhere else in our user interface.

We have GUC restart_after_crash.

I guess it is

"crash recovery".

Maybe call it "recovery_sync_method"

+1. This name sounds good to me. Or recovery_init_sync_method, because that
sync happens in the initial phase of recovery.

Yeah, I was trying to fit the existing pattern
{restart,remove_temp_files}_after_crash. But
recovery_init_sync_method also sounds good to me. I prefer the
version with "init"... without "init", people might get the wrong idea
about what this controls, so let's try that. Done in the attached
version.

Another idea from different angle is data_directory_sync_method. If we adopt
this, we can easily extend this feature for other use cases (other than sync at
the beginning of recovery) without changing the name.
I'm not sure if such cases actually exist, though.

I can't imagine what -- it's like using a sledge hammer to crack a nut.

(I am aware of a semi-related idea: use the proposed fsinfo() Linux
system call to read the filesystem-wide error counter at every
checkpoint to see if anything bad happened that Linux forgot to tell
us about through the usual channels. That's the same internal
mechanism used by syncfs() to report errors, but last I checked it
hadn't been committed yet. I don't think that's share anything with
this code though.)

From your earlier email:

On Fri, Mar 19, 2021 at 2:12 AM Fujii Masao <masao.fujii@oss.nttdata.com> wrote:

+        database cluster that did not shut down cleanly, including copies
+        created with pg_basebackup.
"pg_basebackup" should be "<application>pg_basebackup</application>"?

Fixed.

+ while ((de = ReadDir(dir, "pg_tblspc")))

The comment of SyncDataDirectory() says "Errors are logged but not
considered fatal". So ReadDirExtended() with LOG level should be used
here, instead?

Fixed.

Isn't it better to call CHECK_FOR_INTERRUPTS() in this loop?

How could this be useful?

+ fd = open(path, O_RDONLY);

For current use, it's not harmful to use open() and close(). But isn't
it safer to use OpenTransientFile() and CloseTransientFile(), instead?

Ok, done, for consistency with other code.

Because do_syncfs() may be used for other purpose in the future.

I hope not :-)

+       if (syncfs(fd) < 0)
+               elog(LOG, "syncfs failed for %s: %m", path);
According to the message style guide, this message should be something
like "could not sync filesystem for \"%s\": %m"?

Fixed.

I confirmed that no error was reported when crash recovery started with
syncfs, in old Linux. I should also confirm that no error is reported in that
case in Linux 5.8+, but I don't have that environement. So I've not tested
this feature in Linux 5.8+....

I have a Linux 5.10 system. Here's a trace of SyncDataDirectory on a
system that has two tablespaces and has a symlink for pg_wal:

[pid 3861224] lstat("pg_wal", {st_mode=S_IFLNK|0777, st_size=11, ...}) = 0
[pid 3861224] openat(AT_FDCWD, ".", O_RDONLY) = 4
[pid 3861224] syncfs(4) = 0
[pid 3861224] close(4) = 0
[pid 3861224] openat(AT_FDCWD, "pg_tblspc",
O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 4
[pid 3861224] fstat(4, {st_mode=S_IFDIR|0700, st_size=32, ...}) = 0
[pid 3861224] getdents64(4, 0x55627e18fb60 /* 4 entries */, 32768) = 112
[pid 3861224] openat(AT_FDCWD, "pg_tblspc/16406", O_RDONLY) = 5
[pid 3861224] syncfs(5) = 0
[pid 3861224] close(5) = 0
[pid 3861224] openat(AT_FDCWD, "pg_tblspc/16407", O_RDONLY) = 5
[pid 3861224] syncfs(5) = 0
[pid 3861224] close(5) = 0
[pid 3861224] getdents64(4, 0x55627e18fb60 /* 0 entries */, 32768) = 0
[pid 3861224] close(4) = 0
[pid 3861224] openat(AT_FDCWD, "pg_wal", O_RDONLY) = 4
[pid 3861224] syncfs(4) = 0
[pid 3861224] close(4) = 0

To see how it looks when syncfs() fails, I added a fake implementation
that fails with EUCLEAN on every second call, and then the output
looks like this:

...
1616111356.663 startup 3879272 LOG: database system was interrupted;
last known up at 2021-03-19 12:48:33 NZDT
1616111356.663 startup 3879272 LOG: could not sync filesystem for
"pg_tblspc/16406": Structure needs cleaning
1616111356.663 startup 3879272 LOG: could not sync filesystem for
"pg_wal": Structure needs cleaning
1616111356.663 startup 3879272 LOG: database system was not properly
shut down; automatic recovery in progress
...

A more common setup with no tablespaces and pg_wal as a non-symlink looks like:

[pid 3861448] lstat("pg_wal", {st_mode=S_IFDIR|0700, st_size=316, ...}) = 0
[pid 3861448] openat(AT_FDCWD, ".", O_RDONLY) = 4
[pid 3861448] syncfs(4) = 0
[pid 3861448] close(4) = 0
[pid 3861448] openat(AT_FDCWD, "pg_tblspc",
O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 4
[pid 3861448] fstat(4, {st_mode=S_IFDIR|0700, st_size=6, ...}) = 0
[pid 3861448] getdents64(4, 0x55764beb0b60 /* 2 entries */, 32768) = 48
[pid 3861448] getdents64(4, 0x55764beb0b60 /* 0 entries */, 32768) = 0
[pid 3861448] close(4)

The alternative fsync() mode is (unsurprisingly) much longer.

Thanks for the reviews!

PS: For illustration/discussion, I've also attached a "none" patch. I
also couldn't resist rebasing my "wal" mode patch, which I plan to
propose for PG15 because there is not enough time left for this
release.

#30

Thomas Munro

thomas.munro@gmail.com

over 4 years ago

In reply to: Thomas Munro (#29)

3 attachment(s)

Re: fdatasync performance problem with large number of DB files

On Fri, Mar 19, 2021 at 2:16 PM Thomas Munro <thomas.munro@gmail.com> wrote:

PS: For illustration/discussion, I've also attached a "none" patch. I
also couldn't resist rebasing my "wal" mode patch, which I plan to
propose for PG15 because there is not enough time left for this
release.

Erm... I attached the wrong version by mistake. Here's a better one.
(Note: I'm not expecting any review of the 0003 patch in this CF, I
just wanted to share the future direction I'd like to consider for
this problem.)

#31

Fujii Masao

masao.fujii@oss.nttdata.com

over 4 years ago

In reply to: Thomas Munro (#30)

Re: fdatasync performance problem with large number of DB files

On 2021/03/19 10:37, Thomas Munro wrote:

On Fri, Mar 19, 2021 at 2:16 PM Thomas Munro <thomas.munro@gmail.com> wrote:

PS: For illustration/discussion, I've also attached a "none" patch. I
also couldn't resist rebasing my "wal" mode patch, which I plan to
propose for PG15 because there is not enough time left for this
release.

Erm... I attached the wrong version by mistake. Here's a better one.

Thanks for updating the patch! It looks good to me!
I have one minor comment for the patch.

+		elog(LOG, "could not open %s: %m", path);
+		return;
+	}
+	if (syncfs(fd) < 0)
+		elog(LOG, "could not sync filesystem for \"%s\": %m", path);

Since these are neither internal errors nor low-level debug messages, ereport() should be used for them rather than elog()? For example,

ereport(LOG,
(errcode_for_file_access(),
errmsg("could not open \"%s\": %m", path)))

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

#32

Fujii Masao

masao.fujii@oss.nttdata.com

over 4 years ago

In reply to: Fujii Masao (#31)

Re: fdatasync performance problem with large number of DB files

On 2021/03/19 11:22, Fujii Masao wrote:

On 2021/03/19 10:37, Thomas Munro wrote:

On Fri, Mar 19, 2021 at 2:16 PM Thomas Munro <thomas.munro@gmail.com> wrote:

PS: For illustration/discussion, I've also attached a "none" patch.Â I
also couldn't resist rebasing my "wal" mode patch, which I plan to
propose for PG15 because there is not enough time left for this
release.

Erm... I attached the wrong version by mistake.Â Here's a better one.

0002 patch looks good to me. Thanks!
I have minor comments.

- * Issue fsync recursively on PGDATA and all its contents, or issue syncfs for
- * all potential filesystem, depending on recovery_init_sync_method setting.
+ * Issue fsync recursively on PGDATA and all its contents, issue syncfs for
+ * all potential filesystem, or do nothing, depending on
+ * recovery_init_sync_method setting.

The comment in SyncDataDirectory() should be updated so that
it mentions "none" method, as the above?

+        This is only safe if all buffered data is known to have been flushed
+        to disk already, for example by a tool such as
+        <application>pg_basebackup</application>.  It is not a good idea to

Isn't it better to add something like "without <literal>--no-sync</literal>"
to "pg_basebackup" part? Which would prevent users from misunderstanding
that pg_basebackup always ensures that whatever options are specified.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

#33

Thomas Munro

thomas.munro@gmail.com

over 4 years ago

In reply to: Fujii Masao (#32)

3 attachment(s)

Re: fdatasync performance problem with large number of DB files

On Fri, Mar 19, 2021 at 3:23 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote:

Thanks for updating the patch! It looks good to me!
I have one minor comment for the patch.
+               elog(LOG, "could not open %s: %m", path);
+               return;
+       }
+       if (syncfs(fd) < 0)
+               elog(LOG, "could not sync filesystem for \"%s\": %m", path);
Since these are neither internal errors nor low-level debug messages, ereport() should be used for them rather than elog()? For example,

Fixed.

I'll let this sit until tomorrow to collect any other feedback or
objections, and then push the 0001 patch
(recovery_init_sync_method=syncfs).

On Fri, Mar 19, 2021 at 4:08 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote:

0002 patch looks good to me. Thanks!
I have minor comments.

Ok, I made the changes you suggested. Let's see if anyone else would
like to vote for or against the concept of the 0002 patch
(recovery_init_sync_method=none).

#34

Fujii Masao

masao.fujii@oss.nttdata.com

over 4 years ago

In reply to: Thomas Munro (#33)

Re: fdatasync performance problem with large number of DB files

On 2021/03/19 14:28, Thomas Munro wrote:

On Fri, Mar 19, 2021 at 3:23 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
Thanks for updating the patch! It looks good to me!
I have one minor comment for the patch.
+               elog(LOG, "could not open %s: %m", path);
+               return;
+       }
+       if (syncfs(fd) < 0)
+               elog(LOG, "could not sync filesystem for \"%s\": %m", path);
Since these are neither internal errors nor low-level debug messages, ereport() should be used for them rather than elog()? For example,
Fixed.

Thanks! LGTM.

I'll let this sit until tomorrow to collect any other feedback or
objections, and then push the 0001 patch
(recovery_init_sync_method=syncfs).

Understood.

On Fri, Mar 19, 2021 at 4:08 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote:

0002 patch looks good to me. Thanks!
I have minor comments.

Ok, I made the changes you suggested.

Thanks! LGTM.

Let's see if anyone else would
like to vote for or against the concept of the 0002 patch
(recovery_init_sync_method=none).

Agreed. I also want to hear more opinion about the setting "none".

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

#35

Justin Pryzby

pryzby@telsasoft.com

over 4 years ago

In reply to: Thomas Munro (#33)

Re: fdatasync performance problem with large number of DB files

On Fri, Mar 19, 2021 at 06:28:46PM +1300, Thomas Munro wrote:

+++ b/doc/src/sgml/config.sgml

+        <productname>PostgreSQL</productname> will recursively open and fsync
+        all files in the data directory before crash recovery begins.  This

Maybe it should say "data, tablespace, and WAL directories", or just "critical
directories".

+	{
+		{"recovery_init_sync_method", PGC_POSTMASTER, ERROR_HANDLING_OPTIONS,
+			gettext_noop("Sets the method for synchronizing the data directory before crash recovery."),
+		},

"and tablespaces and WAL"

--
Justin

#36

David Steele

david@pgmasters.net

over 4 years ago

In reply to: Thomas Munro (#33)

Re: fdatasync performance problem with large number of DB files

On 3/19/21 1:28 AM, Thomas Munro wrote:

On Fri, Mar 19, 2021 at 3:23 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
Thanks for updating the patch! It looks good to me!
I have one minor comment for the patch.
+               elog(LOG, "could not open %s: %m", path);
+               return;
+       }
+       if (syncfs(fd) < 0)
+               elog(LOG, "could not sync filesystem for \"%s\": %m", path);
Since these are neither internal errors nor low-level debug messages, ereport() should be used for them rather than elog()? For example,
Fixed.

I'll let this sit until tomorrow to collect any other feedback or
objections, and then push the 0001 patch
(recovery_init_sync_method=syncfs).

I had a look at the patch and it looks good to me.

Should we mention in the docs that the contents of non-standard symlinks
will not be synced, i.e. anything other than tablespaces and pg_wal? Or
can we point them to some docs saying not to do that (if such exists)?

On Fri, Mar 19, 2021 at 4:08 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote:

0002 patch looks good to me. Thanks!
I have minor comments.

Ok, I made the changes you suggested. Let's see if anyone else would
like to vote for or against the concept of the 0002 patch
(recovery_init_sync_method=none).

It worries me that this needs to be explicitly "turned off" after the
initial recovery. Seems like something of a foot gun.

Since we have not offered this functionality before I'm not sure we
should rush to introduce it now. For backup solutions that do their own
syncing, syncfs() should provide excellent performance so long as the
file system is not shared, which is something the user can control (and
is noted in the docs).

Regards,
--
-David
david@pgmasters.net

#37

Thomas Munro

thomas.munro@gmail.com

over 4 years ago

In reply to: David Steele (#36)

Re: fdatasync performance problem with large number of DB files

Thanks Justin and David. Replies to two emails inline:

On Sat, Mar 20, 2021 at 12:01 AM Justin Pryzby <pryzby@telsasoft.com> wrote:

On Fri, Mar 19, 2021 at 06:28:46PM +1300, Thomas Munro wrote:
+++ b/doc/src/sgml/config.sgml
+        <productname>PostgreSQL</productname> will recursively open and fsync
+        all files in the data directory before crash recovery begins.  This
Maybe it should say "data, tablespace, and WAL directories", or just "critical
directories".

Fair point. Here's what I went with:

When set to <literal>fsync</literal>, which is the default,
<productname>PostgreSQL</productname> will recursively open and
synchronize all files in the data directory before crash
recovery
begins. The search for files will follow symbolic links for the WAL
directory and each configured tablespace (but not any other symbolic
links).

+     {
+             {"recovery_init_sync_method", PGC_POSTMASTER, ERROR_HANDLING_OPTIONS,
+                     gettext_noop("Sets the method for synchronizing the data directory before crash recovery."),
+             },

"and tablespaces and WAL"

I feel like that's getting too detailed for the GUC description?

On Sat, Mar 20, 2021 at 2:55 AM David Steele <david@pgmasters.net> wrote:

I had a look at the patch and it looks good to me.

Thanks!

Should we mention in the docs that the contents of non-standard symlinks
will not be synced, i.e. anything other than tablespaces and pg_wal? Or
can we point them to some docs saying not to do that (if such exists)?

Good idea. See above for the adjustment I went with to describe the
traditional behaviour, and then I also made a similar change to the
syncfs part, which, I hope, manages to convey that the new mode
matches the existing policy on symlinks:

On Linux, <literal>syncfs</literal> may be used instead, to ask the
operating system to synchronize the whole file systems that contain the
data directory, the WAL file and each tablespace (but not any other
file systems that may be reachable through symbolic links).

I thought about adding some text along the lines that such symlinks
are not expected, but I think you're right that what we really need is
a good place to point to. I mean, generally you can't mess around
with the files managed by PostgreSQL and expect everything to keep
working correctly, but it wouldn't hurt to make an explicit statement
about symlinks and where they're allowed (or maybe there is one
already and I failed to find it). There are hints though, like
pg_basebackup's documentation which tells you it won't follow or
preserve them in general, but... hmm, it also contemplates various
special subdirectories (pg_dynshmem, pg_notify, pg_replslot, ...) that
might be symlinks without saying why.

Ok, I made the changes you suggested. Let's see if anyone else would
like to vote for or against the concept of the 0002 patch
(recovery_init_sync_method=none).

It worries me that this needs to be explicitly "turned off" after the
initial recovery. Seems like something of a foot gun.

Since we have not offered this functionality before I'm not sure we
should rush to introduce it now. For backup solutions that do their own
syncing, syncfs() should provide excellent performance so long as the
file system is not shared, which is something the user can control (and
is noted in the docs).

Thanks. I'm leaving the 0002 patch "on ice" until someone can explain
how you're supposed to use it without putting a hole in your foot.

I pushed the 0001 patch. Thanks to all who reviewed. Of course,
further documentation improvement patches are always welcome.

(One silly thing I noticed is that our comments generally think
"filesystem" is one word, but our documentation always has a space;
this patch followed the local convention in both cases!)

#38

David Steele

david@pgmasters.net

over 4 years ago

In reply to: Thomas Munro (#37)

Re: fdatasync performance problem with large number of DB files

On 3/19/21 7:16 PM, Thomas Munro wrote:

Thanks Justin and David. Replies to two emails inline:

Fair point. Here's what I went with:

When set to <literal>fsync</literal>, which is the default,
<productname>PostgreSQL</productname> will recursively open and
synchronize all files in the data directory before crash
recovery
begins. The search for files will follow symbolic links for the WAL
directory and each configured tablespace (but not any other symbolic
links).

I thought about adding some text along the lines that such symlinks
are not expected, but I think you're right that what we really need is
a good place to point to. I mean, generally you can't mess around
with the files managed by PostgreSQL and expect everything to keep
working correctly

WRT to symlinks I'm not sure that's fair to say. From PG's perspective
it's just a dir/file after all. Other than pg_wal I have seen
pg_stat/pg_stat_tmp sometimes symlinked, plus config files, and the log dir.

pgBackRest takes a pretty liberal approach here. Were preserve all
dir/file symlinks no matter where they appear and allow all of them to
be remapped on restore.

but it wouldn't hurt to make an explicit statement
about symlinks and where they're allowed (or maybe there is one
already and I failed to find it).

I couldn't find it either and I would be in favor of it. For instance,
pgBackRest forbids tablespaces inside PGDATA and when people complain
(more often then you might imagine) we can just point to the code/docs.

There are hints though, like
pg_basebackup's documentation which tells you it won't follow or
preserve them in general, but... hmm, it also contemplates various
special subdirectories (pg_dynshmem, pg_notify, pg_replslot, ...) that
might be symlinks without saying why.

Right, pg_dynshmem is another one that I've seen symlinked. Some things
are nice to have on fast storage. pg_notify and pg_replslot are similar
since they get written to a lot in certain configurations.

It worries me that this needs to be explicitly "turned off" after the
initial recovery. Seems like something of a foot gun.

Since we have not offered this functionality before I'm not sure we
should rush to introduce it now. For backup solutions that do their own
syncing, syncfs() should provide excellent performance so long as the
file system is not shared, which is something the user can control (and
is noted in the docs).

Thanks. I'm leaving the 0002 patch "on ice" until someone can explain
how you're supposed to use it without putting a hole in your foot.

(One silly thing I noticed is that our comments generally think
"filesystem" is one word, but our documentation always has a space;
this patch followed the local convention in both cases!)

Personally I prefer "file system".

Regards,
--
-David
david@pgmasters.net

#39

Greg Stark

stark@mit.edu

over 4 years ago

In reply to: Tom Lane (#8)

Re: fdatasync performance problem with large number of DB files

On Wed, 10 Mar 2021 at 20:25, Tom Lane <tgl@sss.pgh.pa.us> wrote:

So this means that in less-than-bleeding-edge kernels, syncfs can
only be regarded as a dangerous toy. If we expose an option to use
it, there had better be large blinking warnings in the docs.

Isn't that true for fsync and everything else related on
less-than-bleeding-edge kernels anyways?

--
greg

#40

Justin Pryzby

pryzby@telsasoft.com

over 4 years ago

In reply to: Thomas Munro (#37)

Re: fdatasync performance problem with large number of DB files

On Sat, Mar 20, 2021 at 12:16:27PM +1300, Thomas Munro wrote:

+     {
+             {"recovery_init_sync_method", PGC_POSTMASTER, ERROR_HANDLING_OPTIONS,
+                     gettext_noop("Sets the method for synchronizing the data directory before crash recovery."),
+             },

Is there any reason why this can't be PGC_SIGHUP ?
(Same as restart_after_crash, remove_temp_files_after_crash)

As neat as it'd be, I am not expecting the recovery process to reload the
configuration and finish fast if I send it HUP.

While I'm looking, it's not clear why this needs to be PGC_POSTMASTER.
data_sync_retry - but see b3a156858

This one isn't documented as requiring a restart:
max_logical_replication_workers.

ignore_invalid_pages could probably be SIGHUP, but it's intended to be used as
a commandline option, not in a config file.

--
Justin

#41

Michael Paquier

michael@paquier.xyz

over 4 years ago

In reply to: Justin Pryzby (#40)

Re: fdatasync performance problem with large number of DB files

On Tue, May 25, 2021 at 07:13:59PM -0500, Justin Pryzby wrote:

This one isn't documented as requiring a restart:
max_logical_replication_workers.

There is much more than meets the eye here, and this is unrelated to
this thread, so let's discuss that on a separate thread. I'll start a
new one with everything I found.
--
Michael

#42

Justin Pryzby

pryzby@telsasoft.com

over 4 years ago

In reply to: Justin Pryzby (#40)

Re: fdatasync performance problem with large number of DB files

On Tue, May 25, 2021 at 07:13:59PM -0500, Justin Pryzby wrote:

On Sat, Mar 20, 2021 at 12:16:27PM +1300, Thomas Munro wrote:
+     {
+             {"recovery_init_sync_method", PGC_POSTMASTER, ERROR_HANDLING_OPTIONS,
+                     gettext_noop("Sets the method for synchronizing the data directory before crash recovery."),
+             },
Is there any reason why this can't be PGC_SIGHUP ?
(Same as restart_after_crash, remove_temp_files_after_crash)

I can't see any reason why this is nontrivial.
What about data_sync_retry?

commit 2d2d2e10f99548c486b50a1ce095437d558e8649
Author: Justin Pryzby <pryzbyj@telsasoft.com>
Date: Sat May 29 13:41:14 2021 -0500

Change recovery_init_sync_method to PGC_SIGHUP..

The setting has no effect except during startup.
But it's nice to be able to change the setting dynamically, which is expected
to be pretty useful to an admin following crash recovery when turning the
service off and on again is not so appealing.

Participants

Thread Outline

fdatasync performance problem with large number of DB files

Attachments:

Attachments:

Attachments:

Attachments:

Attachments:

Attachments: