fdatasync performance problem with large number of DB files

Started by Michael Brownabout 5 years ago3 messagesgeneral
Jump to latest
#1Michael Brown
michael.brown@discourse.org

We've encountered a production performance problem with pg13 related to
how it fsyncs the whole data directory in certain scenarios, related to
what Paul (bcc'ed) described in a post to pgsql-hackers [1]/messages/by-id/CAEET0ZHGnbXmi8yF3ywsDZvb3m9CbdsGZgfTXscQ6agcbzcZAw@mail.gmail.com.

Background:

We've observed the full recursive fsync is triggered when

* pg_basebackup receives a streaming backup (via [2]https://github.com/postgres/postgres/blob/master/src/bin/pg_basebackup/pg_basebackup.c#L2181 fsync_dir_recurse
or fsync_pgdata) unless --no-sync is specified
* postgres starts up unclean (via [3]https://github.com/postgres/postgres/blob/master/src/backend/access/transam/xlog.c#L6495 SyncDataDirectory)

We run multiple postgres clusters and some of those clusters have many
(~450) databases (one database-per-customer) meaning that the postgres
data directory has around 700,000 files.

On one of our less loaded servers this takes ~7 minutes to complete, but
on another [4]It should be identical config-wise. It isn't starved for IO but does have other regular write workloads this takes ~90 minutes.

Obviously this is untenable risk. We've modified our process that
bootstraps a replica via pg_basebackup to instead do "pg_basebackup
--no-sync…" followed by a "sync", but we don't have any way to do the
equivalent for the postgres startup.

I presume the reason postgres doesn't blindly run a sync() is that we
don't know what other I/O is on the system and it'd be rude to affect
other services. That makes sense, except for our environment the work
done by the recursive fsync is orders of magnitude more disruptive than
a sync().

My questions are:

* is there a knob missing we can configure?
* can we get a knob to use a single sync() call instead of a recursive
fsync()?
* would you be open to merging a patch providing said knob?
* is there something else we missed?

Thanks!

[1]: /messages/by-id/CAEET0ZHGnbXmi8yF3ywsDZvb3m9CbdsGZgfTXscQ6agcbzcZAw@mail.gmail.com
/messages/by-id/CAEET0ZHGnbXmi8yF3ywsDZvb3m9CbdsGZgfTXscQ6agcbzcZAw@mail.gmail.com
[2]: https://github.com/postgres/postgres/blob/master/src/bin/pg_basebackup/pg_basebackup.c#L2181
https://github.com/postgres/postgres/blob/master/src/bin/pg_basebackup/pg_basebackup.c#L2181
[3]: https://github.com/postgres/postgres/blob/master/src/backend/access/transam/xlog.c#L6495
https://github.com/postgres/postgres/blob/master/src/backend/access/transam/xlog.c#L6495
[4]: It should be identical config-wise. It isn't starved for IO but does have other regular write workloads
does have other regular write workloads

--
Michael Brown
Civilized Discourse Construction Kit, Inc.
https://www.discourse.org/

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Michael Brown (#1)
Re: fdatasync performance problem with large number of DB files

Michael Brown <michael.brown@discourse.org> writes:

I presume the reason postgres doesn't blindly run a sync() is that we
don't know what other I/O is on the system and it'd be rude to affect
other services. That makes sense, except for our environment the work
done by the recursive fsync is orders of magnitude more disruptive than
a sync().

Hmm.

* is there a knob missing we can configure?

No. The trouble with sync() is that per POSIX, it only schedules the
writes; there's no way to tell when the work has been done. I see
that Linux offers stronger promises in this department, but I don't
think that's very portable. Moreover, even on Linux there's no
way to detect whether any of the writes failed.

Barring some solution to those problems, we would be unlikely to take
a patch that uses sync() instead of fsync().

regards, tom lane

#3Michael Brown
michael.brown@discourse.org
In reply to: Tom Lane (#2)
Re: fdatasync performance problem with large number of DB files

On 2021-02-22 5:43 p.m., Tom Lane wrote:

Michael Brown <michael.brown@discourse.org> writes:

* is there a knob missing we can configure?

No. The trouble with sync() is that per POSIX, it only schedules the
writes; there's no way to tell when the work has been done. I see
that Linux offers stronger promises in this department, but I don't
think that's very portableTrue, but as mentioned below we're looking for a "this makes sense for

our environment" switch.

Moreover, even on Linux there's no way to detect whether any of the writes failed.

Ugh. Presumably those would be noticed when the WAL replays? (I'll admit
I'd have to look at the sequence of events and think about it, I don't
know offhand.)

Oh, syncfs() exists but is Linux-specific, again, darn.

Barring some solution to those problems, we would be unlikely to take
a patch that uses sync() instead of fsync().

I wouldn't dare to propose outright switching to sync() for everyone,
but a knob we can turn on to say "use sync (or syncfd()) instead" is
what we need, discounting a better solution.

--
Michael Brown
Civilized Discourse Construction Kit, Inc.
https://www.discourse.org/