Large files for relations
Big PostgreSQL databases use and regularly open/close huge numbers of
file descriptors and directory entries for various anachronistic
reasons, one of which is the 1GB RELSEG_SIZE thing. The segment
management code is trickier that you might think and also still
harbours known bugs.
A nearby analysis of yet another obscure segment life cycle bug
reminded me of this patch set to switch to simple large files and
eventually drop all that. I originally meant to develop the attached
sketch-quality code further and try proposing it in the 16 cycle,
while I was down the modernisation rabbit hole[1]https://wiki.postgresql.org/wiki/AllComputers, but then I got side
tracked: at some point I believed that the 56 bit relfilenode thing
might be necessary for correctness, but then I found a set of rules
that seem to hold up without that. I figured I might as well post
what I have early in the 17 cycle as a "concept" patch to see which
way the flames blow.
There are various boring details due to Windows, and then a load of
fairly obvious changes, and then a whole can of worms about how we'd
handle the transition for the world's fleet of existing databases.
I'll cut straight to that part. Different choices on aggressiveness
could be made, but here are the straw-man answers I came up with so
far:
1. All new relations would be in large format only. No 16384.N
files, just 16384 that can grow to MaxBlockNumber * BLCKSZ.
2. The existence of a file 16384.1 means that this smgr relation is
in legacy segmented format that came from pg_upgrade (note that we
don't unlink that file once it exists, even when truncating the fork,
until we eventually drop the relation).
3. Forks that were pg_upgrade'd from earlier releases using hard
links or reflinks would implicitly be in large format if they only had
one segment, and otherwise they could stay in the traditional format
for a grace period of N major releases, after which we'd plan to drop
segment support. pg_upgrade's [ref]link mode would therefore be the
only way to get a segmented relation, other than a developer-only
trick for testing/debugging.
4. Every opportunity to convert a multi-segment fork to large format
would be taken: pg_upgrade in copy mode, basebackup, COPY DATABASE,
VACUUM FULL, TRUNCATE, etc. You can see approximately working sketch
versions of all the cases I thought of so far in the attached.
5. The main places that do file-level copying of relations would use
copy_file_range() to do the splicing, so that on file systems that are
smart enough (XFS, ZFS, BTRFS, ...) with qualifying source and
destination, the operation can be very fast, and other degrees of
optimisation are available to the kernel too even for file systems
without block sharing magic (pushing down block range copies to
hardware/network storage, etc). The copy_file_range() stuff could
also be proposed independently (I vaguely recall it was discussed a
few times before), it's just that it really comes into its own when
you start splicing files together, as needed here, and it's also been
adopted by FreeBSD with the same interface as Linux and has an
efficient implementation in bleeding edge ZFS there.
Stepping back, the main ideas are: (1) for some users of large
databases, it would be painlessly done at upgrade time without even
really noticing, using modern file system facilities where possible
for speed; (2) for anyone who wants to defer that because of lack of
fast copy_file_range() and a desire to avoid prolonged downtime by
using links or reflinks, concatenation can be put off for the next N
releases, giving a total of 5 + N years of option to defer the work,
and in that case there are also many ways to proactively change to
large format before the time comes with varying degrees of granularity
and disruption. For example, set up a new replica and fail over, or
VACUUM FULL tables one at a time, etc.
There are plenty of things left to do in this patch set: pg_rewind
doesn't understand optional segmentation yet, there are probably more
things like that, and I expect there are some ssize_t vs pgoff_t
confusions I missed that could bite a 32 bit system. But you can see
the basics working on a typical system.
I am not aware of any modern/non-historic filesystem[2]https://en.wikipedia.org/wiki/Comparison_of_file_systems that can't do
large files with ease. Anyone know of anything to worry about on that
front? I think the main collateral damage would be weird old external
tools like some weird old version of Windows tar I occasionally see
mentioned, that sort of thing, but that'd just be another case of
"well don't use that then", I guess? What else might we need to think
about, outside PostgreSQL?
What other problems might occur inside PostgreSQL? Clearly we'd need
to figure out a decent strategy to automate testing of all of the
relevant transitions. We could test the splicing code paths with an
optional test suite that you might enable along with a small segment
size (as we're already testing on CI and probably BF after the last
round of segmentation bugs). To test the messy Windows off_t API
stuff convincingly, we'd need actual > 4GB files, I think? Maybe
doable cheaply with file system hole punching tricks.
Speaking of file system holes, this patch set doesn't touch buffile.c
That code wants to use segments for two extra purposes: (1) parallel
create index merges workers' output using segmentation tricks as if
there were holes in the file; this could perhaps be replaced with
large files that make use of actual OS-level holes but I didn't feel
like additionally claiming that all computers have spare files --
perhaps another approach is needed anyway; (2) buffile.c deliberately
spreads large buffiles around across multiple temporary tablespaces
using segments supposedly for space management reasons. So although
it initially looks like a nice safe little place to start using large
files, we'd need an answer to those design choices first.
/me dons flameproof suit and goes back to working on LLVM problems for a while
[1]: https://wiki.postgresql.org/wiki/AllComputers
[2]: https://en.wikipedia.org/wiki/Comparison_of_file_systems
Attachments:
0001-Assert-that-pgoff_t-is-wide-enough.patchtext/x-patch; charset=US-ASCII; name=0001-Assert-that-pgoff_t-is-wide-enough.patchDownload+3-1
0002-Use-pgoff_t-in-system-call-replacements-on-Windows.patchtext/x-patch; charset=US-ASCII; name=0002-Use-pgoff_t-in-system-call-replacements-on-Windows.patchDownload+106-16
0003-Support-large-files-on-Windows-in-our-VFD-API.patchtext/x-patch; charset=US-ASCII; name=0003-Support-large-files-on-Windows-in-our-VFD-API.patchDownload+25-26
0004-Use-pgoff_t-instead-of-off_t-in-more-places.patchtext/x-patch; charset=US-ASCII; name=0004-Use-pgoff_t-instead-of-off_t-in-more-places.patchDownload+24-22
0005-Use-large-files-for-relation-storage.patchtext/x-patch; charset=US-ASCII; name=0005-Use-large-files-for-relation-storage.patchDownload+181-48
0006-Detect-copy_file_range-function.patchtext/x-patch; charset=US-ASCII; name=0006-Detect-copy_file_range-function.patchDownload+7-2
0007-Use-copy_file_range-to-implement-copy_file.patchtext/x-patch; charset=US-ASCII; name=0007-Use-copy_file_range-to-implement-copy_file.patchDownload+82-21
0008-Teach-copy_file-to-concatenate-segmented-files.patchtext/x-patch; charset=US-ASCII; name=0008-Teach-copy_file-to-concatenate-segmented-files.patchDownload+42-2
0009-Use-copy_file_range-in-pg_upgrade.patchtext/x-patch; charset=US-ASCII; name=0009-Use-copy_file_range-in-pg_upgrade.patchDownload+51-15
0010-Teach-pg_upgrade-to-concatenate-segmented-files.patchtext/x-patch; charset=US-ASCII; name=0010-Teach-pg_upgrade-to-concatenate-segmented-files.patchDownload+31-14
0011-Teach-basebackup-to-concatenate-segmented-files.patchtext/x-patch; charset=US-ASCII; name=0011-Teach-basebackup-to-concatenate-segmented-files.patchDownload+71-22
Hi
I like this patch - it can save some system sources - I am not sure how
much, because bigger tables usually use partitioning usually.
Important note - this feature breaks sharing files on the backup side - so
before disabling 1GB sized files, this issue should be solved.
Regards
Pavel
On Tue, May 2, 2023 at 3:28 PM Pavel Stehule <pavel.stehule@gmail.com> wrote:
I like this patch - it can save some system sources - I am not sure how much, because bigger tables usually use partitioning usually.
Yeah, if you only use partitions of < 1GB it won't make a difference.
Larger partitions are not uncommon, though.
Important note - this feature breaks sharing files on the backup side - so before disabling 1GB sized files, this issue should be solved.
Hmm, right, so there is a backup granularity continuum with "whole
database cluster" at one end, "only files whose size, mtime [or
optionally also checksum] changed since last backup" in the middle,
and "only blocks that changed since LSN of last backup" at the other
end. Getting closer to the right end of that continuum can make
backups require less reading, less network transfer, less writing
and/or less storage space depending on details. But this proposal
moves the middle thing further to the left by changing the granularity
from 1GB to whole relation, which can be gargantuan with this patch.
Ultimately we need to be all the way at the right on that continuum,
and there are clearly several people working on that goal.
I'm not involved in any of those projects, but it's fun to think about
an alien technology that produces complete standalone backups like
rsync --link-dest (as opposed to "full" backups followed by a chain of
"incremental" backups that depend on it so you need to retain them
carefully) while still sharing disk blocks with older backups, and
doing so with block granularity. TL;DW something something WAL
something something copy_file_range().
On Wed, May 3, 2023 at 5:21 PM Thomas Munro <thomas.munro@gmail.com> wrote:
rsync --link-dest
I wonder if rsync will grow a mode that can use copy_file_range() to
share blocks with a reference file (= previous backup). Something
like --copy-range-dest. That'd work for large-file relations
(assuming a file system that has block sharing, like XFS and ZFS).
You wouldn't get the "mtime is enough, I don't even need to read the
bytes" optimisation, which I assume makes all database hackers feel a
bit queasy anyway, but you'd get the space savings via the usual
rolling checksum or a cheaper version that only looks for strong
checksum matches at the same offset, or whatever other tricks rsync
might have up its sleeve.
On Wed, May 3, 2023 at 1:37 AM Thomas Munro <thomas.munro@gmail.com> wrote:
On Wed, May 3, 2023 at 5:21 PM Thomas Munro <thomas.munro@gmail.com>
wrote:rsync --link-dest
I wonder if rsync will grow a mode that can use copy_file_range() to
share blocks with a reference file (= previous backup). Something
like --copy-range-dest. That'd work for large-file relations
(assuming a file system that has block sharing, like XFS and ZFS).
You wouldn't get the "mtime is enough, I don't even need to read the
bytes" optimisation, which I assume makes all database hackers feel a
bit queasy anyway, but you'd get the space savings via the usual
rolling checksum or a cheaper version that only looks for strong
checksum matches at the same offset, or whatever other tricks rsync
might have up its sleeve.
I understand the need to reduce open file handles, despite the
possibilities enabled by using large numbers of small file sizes.
Snowflake, for instance, sees everything in 1MB chunks, which makes
massively parallel sequential scans (Snowflake's _only_ query plan)
possible, though I don't know if they accomplish that via separate files,
or via segments within a large file.
I am curious whether a move like this to create a generational change in
file file format shouldn't be more ambitious, perhaps altering the block
format to insert a block format version number, whether that be at every
block, or every megabyte, or some other interval, and whether we store it
in-file or in a separate file to accompany the first non-segmented. Having
such versioning information would allow blocks of different formats to
co-exist in the same table, which could be critical to future changes such
as 64 bit XIDs, etc.
Greetings,
* Corey Huinker (corey.huinker@gmail.com) wrote:
On Wed, May 3, 2023 at 1:37 AM Thomas Munro <thomas.munro@gmail.com> wrote:
On Wed, May 3, 2023 at 5:21 PM Thomas Munro <thomas.munro@gmail.com>
wrote:rsync --link-dest
... rsync isn't really a safe tool to use for PG backups by itself
unless you're using it with archiving and with start/stop backup and
with checksums enabled.
I wonder if rsync will grow a mode that can use copy_file_range() to
share blocks with a reference file (= previous backup). Something
like --copy-range-dest. That'd work for large-file relations
(assuming a file system that has block sharing, like XFS and ZFS).
You wouldn't get the "mtime is enough, I don't even need to read the
bytes" optimisation, which I assume makes all database hackers feel a
bit queasy anyway, but you'd get the space savings via the usual
rolling checksum or a cheaper version that only looks for strong
checksum matches at the same offset, or whatever other tricks rsync
might have up its sleeve.
There's also really good reasons to have multiple full backups and not
just a single full backup and then lots and lots of incrementals which
basically boils down to "are you really sure that one copy of that one
really important file won't every disappear from your backup
repository..?"
That said, pgbackrest does now have block-level incremental backups
(where we define our own block size ...) and there's reasons we decided
against going down the LSN-based approach (not the least of which is
that the LSN isn't always updated...), but long story short, moving to
larger than 1G files should be something that pgbackrest will be able
to handle without as much impact as there would have been previously in
terms of incremental backups. There is a loss in the ability to use
mtime to scan just the parts of the relation that changed and that's
unfortunate but I wouldn't see it as really a game changer (and yes,
there's certainly an argument for not trusting mtime, though I don't
think we've yet had a report where there was an mtime issue that our
mtime-validity checking didn't catch and force pgbackrest into
checksum-based revalidation automatically which resulted in an invalid
backup... of course, not enough people test their backups...).
I understand the need to reduce open file handles, despite the
possibilities enabled by using large numbers of small file sizes.
I'm also generally in favor of reducing the number of open file handles
that we have to deal with. Addressing the concerns raised nearby about
weird corner-cases of non-1G length ABCDEF.1 files existing while
ABCDEF.2, and more, files exist is certainly another good argument in
favor of getting rid of segments.
I am curious whether a move like this to create a generational change in
file file format shouldn't be more ambitious, perhaps altering the block
format to insert a block format version number, whether that be at every
block, or every megabyte, or some other interval, and whether we store it
in-file or in a separate file to accompany the first non-segmented. Having
such versioning information would allow blocks of different formats to
co-exist in the same table, which could be critical to future changes such
as 64 bit XIDs, etc.
To the extent you're interested in this, there are patches posted which
are alrady trying to move us in a direction that would allow for
different page formats that add in space for other features such as
64bit XIDs, better checksums, and TDE tags to be supported.
https://commitfest.postgresql.org/43/3986/
Currently those patches are expecting it to be declared at initdb time,
but the way they're currently written that's more of a soft requirement
as you can tell on a per-page basis what features are enabled for that
page. Might make sense to support it in that form first anyway though,
before going down the more ambitious route of allowing different pages
to have different sets of features enabled for them concurrently.
When it comes to 'a separate file', we do have forks already and those
serve a very valuable but distinct use-case where you can get
information from the much smaller fork (be it the FSM or the VM or some
future thing) while something like 64bit XIDs or a stronger checksum is
something you'd really need on every page. I have serious doubts about
a proposal where we'd store information needed on every page read in
some far away block that's still in the same file such as using
something every 1MB as that would turn every block access into two..
Thanks,
Stephen
On Mon, May 1, 2023 at 9:29 PM Thomas Munro <thomas.munro@gmail.com> wrote:
I am not aware of any modern/non-historic filesystem[2] that can't do
large files with ease. Anyone know of anything to worry about on that
front?
There is some trouble in the ambiguity of what we mean by "modern" and
"large files". There are still a large number of users of ext4 where the
max file size is 16TB. Switching to a single large file per relation would
effectively cut the max table size in half for those users. How would a
user with say a 20TB table running on ext4 be impacted by this change?
On Fri, May 12, 2023 at 8:16 AM Jim Mlodgenski <jimmy76@gmail.com> wrote:
On Mon, May 1, 2023 at 9:29 PM Thomas Munro <thomas.munro@gmail.com> wrote:
I am not aware of any modern/non-historic filesystem[2] that can't do
large files with ease. Anyone know of anything to worry about on that
front?There is some trouble in the ambiguity of what we mean by "modern" and "large files". There are still a large number of users of ext4 where the max file size is 16TB. Switching to a single large file per relation would effectively cut the max table size in half for those users. How would a user with say a 20TB table running on ext4 be impacted by this change?
Hrmph. Yeah, that might be a bit of a problem. I see it discussed in
various places that MySQL/InnoDB can't have tables bigger than 16TB on
ext4 because of this, when it's in its default one-file-per-object
mode (as opposed to its big-tablespace-files-to-hold-all-the-objects
mode like DB2, Oracle etc, in which case I think you can have multiple
16TB segment files and get past that ext4 limit). It's frustrating
because 16TB is still really, really big and you probably should be
using partitions, or more partitions, to avoid all kinds of other
scalability problems at that size. But however hypothetical the
scenario might be, it should work, and this is certainly a plausible
argument against the "aggressive" plan described above with the hard
cut-off where we get to drop the segmented mode.
Concretely, a 20TB pg_upgrade in copy mode would fail while trying to
concatenate with the above patches, so you'd have to use link or
reflink mode (you'd probably want to use that anyway unless due to
sheer volume of data to copy otherwise, since ext4 is also not capable
of block-range sharing), but then you'd be out of luck after N future
major releases, according to that plan where we start deleting the
code, so you'd need to organise some smaller partitions before that
time comes. Or pg_upgrade to a target on xfs etc. I wonder if a
future version of extN will increase its max file size.
A less aggressive version of the plan would be that we just keep the
segment code for the foreseeable future with no planned cut off, and
we make all of those "piggy back" transformations that I showed in the
patch set optional. For example, I had it so that CLUSTER would
quietly convert your relation to large format, if it was still in
segmented format (might as well if you're writing all the data out
anyway, right?), but perhaps that could depend on a GUC. Likewise for
base backup. Etc. Then someone concerned about hitting the 16TB
limit on ext4 could opt out. Or something like that. It seems funny
though, that's exactly the user who should want this feature (they
have 16,000 relation segment files).
Thomas Munro <thomas.munro@gmail.com> writes:
On Fri, May 12, 2023 at 8:16 AM Jim Mlodgenski <jimmy76@gmail.com> wrote:
On Mon, May 1, 2023 at 9:29 PM Thomas Munro <thomas.munro@gmail.com> wrote:
I am not aware of any modern/non-historic filesystem[2] that can't do
large files with ease. Anyone know of anything to worry about on that
front?There is some trouble in the ambiguity of what we mean by "modern" and
"large files". There are still a large number of users of ext4 where
the max file size is 16TB. Switching to a single large file per
relation would effectively cut the max table size in half for those
users. How would a user with say a 20TB table running on ext4 be
impacted by this change?
[…]
A less aggressive version of the plan would be that we just keep the
segment code for the foreseeable future with no planned cut off, and
we make all of those "piggy back" transformations that I showed in the
patch set optional. For example, I had it so that CLUSTER would
quietly convert your relation to large format, if it was still in
segmented format (might as well if you're writing all the data out
anyway, right?), but perhaps that could depend on a GUC. Likewise for
base backup. Etc. Then someone concerned about hitting the 16TB
limit on ext4 could opt out. Or something like that. It seems funny
though, that's exactly the user who should want this feature (they
have 16,000 relation segment files).
If we're going to have to keep the segment code for the foreseeable
future anyway, could we not get most of the benefit by increasing the
segment size to something like 1TB? The vast majority of tables would
fit in one file, and there would be less risk of hitting filesystem
limits.
- ilmari
On Thu, May 11, 2023 at 7:38 PM Thomas Munro <thomas.munro@gmail.com> wrote:
On Fri, May 12, 2023 at 8:16 AM Jim Mlodgenski <jimmy76@gmail.com> wrote:
On Mon, May 1, 2023 at 9:29 PM Thomas Munro <thomas.munro@gmail.com>
wrote:
I am not aware of any modern/non-historic filesystem[2] that can't do
large files with ease. Anyone know of anything to worry about on that
front?There is some trouble in the ambiguity of what we mean by "modern" and
"large files". There are still a large number of users of ext4 where the
max file size is 16TB. Switching to a single large file per relation would
effectively cut the max table size in half for those users. How would a
user with say a 20TB table running on ext4 be impacted by this change?Hrmph. Yeah, that might be a bit of a problem. I see it discussed in
various places that MySQL/InnoDB can't have tables bigger than 16TB on
ext4 because of this, when it's in its default one-file-per-object
mode (as opposed to its big-tablespace-files-to-hold-all-the-objects
mode like DB2, Oracle etc, in which case I think you can have multiple
16TB segment files and get past that ext4 limit). It's frustrating
because 16TB is still really, really big and you probably should be
using partitions, or more partitions, to avoid all kinds of other
scalability problems at that size. But however hypothetical the
scenario might be, it should work,
Agreed, it is frustrating, but it is not hypothetical. I have seen a number
of
users having single tables larger than 16TB and don't use partitioning
because
of the limitations we have today. The most common reason is needing multiple
unique constraints on the table that don't include the partition key.
Something
like a user_id and email. There are workarounds for those cases, but usually
it's easier to deal with a single large table than to deal with the sharp
edges
those workarounds introduce.
Greetings,
* Dagfinn Ilmari Mannsåker (ilmari@ilmari.org) wrote:
Thomas Munro <thomas.munro@gmail.com> writes:
On Fri, May 12, 2023 at 8:16 AM Jim Mlodgenski <jimmy76@gmail.com> wrote:
On Mon, May 1, 2023 at 9:29 PM Thomas Munro <thomas.munro@gmail.com> wrote:
I am not aware of any modern/non-historic filesystem[2] that can't do
large files with ease. Anyone know of anything to worry about on that
front?There is some trouble in the ambiguity of what we mean by "modern" and
"large files". There are still a large number of users of ext4 where
the max file size is 16TB. Switching to a single large file per
relation would effectively cut the max table size in half for those
users. How would a user with say a 20TB table running on ext4 be
impacted by this change?[…]
A less aggressive version of the plan would be that we just keep the
segment code for the foreseeable future with no planned cut off, and
we make all of those "piggy back" transformations that I showed in the
patch set optional. For example, I had it so that CLUSTER would
quietly convert your relation to large format, if it was still in
segmented format (might as well if you're writing all the data out
anyway, right?), but perhaps that could depend on a GUC. Likewise for
base backup. Etc. Then someone concerned about hitting the 16TB
limit on ext4 could opt out. Or something like that. It seems funny
though, that's exactly the user who should want this feature (they
have 16,000 relation segment files).If we're going to have to keep the segment code for the foreseeable
future anyway, could we not get most of the benefit by increasing the
segment size to something like 1TB? The vast majority of tables would
fit in one file, and there would be less risk of hitting filesystem
limits.
While I tend to agree that 1GB is too small, 1TB seems like it's
possibly going to end up on the too big side of things, or at least,
if we aren't getting rid of the segment code then it's possibly throwing
away the benefits we have from the smaller segments without really
giving us all that much. Going from 1G to 10G would reduce the number
of open file descriptors by quite a lot without having much of a net
change on other things. 50G or 100G would reduce the FD handles further
but starts to make us lose out a bit more on some of the nice parts of
having multiple segments.
Just some thoughts.
Thanks,
Stephen
Repeating what was mentioned on Twitter, because I had some experience with
the topic. With fewer files per table there will be more contention on the
per-inode mutex (which might now be the per-inode rwsem). I haven't read
filesystem source in a long time. Back in the day, and perhaps today, it
was locked for the duration of a write to storage (locked within the
kernel) and was briefly locked while setting up a read.
The workaround for writes was one of:
1) enable disk write cache or use battery-backed HW RAID to make writes
faster (yes disks, I encountered this prior to 2010)
2) use XFS and O_DIRECT in which case the per-inode mutex (rwsem) wasn't
locked for the duration of a write
I have a vague memory that filesystems have improved in this regard.
On Thu, May 11, 2023 at 4:38 PM Thomas Munro <thomas.munro@gmail.com> wrote:
On Fri, May 12, 2023 at 8:16 AM Jim Mlodgenski <jimmy76@gmail.com> wrote:
On Mon, May 1, 2023 at 9:29 PM Thomas Munro <thomas.munro@gmail.com>
wrote:
I am not aware of any modern/non-historic filesystem[2] that can't do
large files with ease. Anyone know of anything to worry about on that
front?There is some trouble in the ambiguity of what we mean by "modern" and
"large files". There are still a large number of users of ext4 where the
max file size is 16TB. Switching to a single large file per relation would
effectively cut the max table size in half for those users. How would a
user with say a 20TB table running on ext4 be impacted by this change?Hrmph. Yeah, that might be a bit of a problem. I see it discussed in
various places that MySQL/InnoDB can't have tables bigger than 16TB on
ext4 because of this, when it's in its default one-file-per-object
mode (as opposed to its big-tablespace-files-to-hold-all-the-objects
mode like DB2, Oracle etc, in which case I think you can have multiple
16TB segment files and get past that ext4 limit). It's frustrating
because 16TB is still really, really big and you probably should be
using partitions, or more partitions, to avoid all kinds of other
scalability problems at that size. But however hypothetical the
scenario might be, it should work, and this is certainly a plausible
argument against the "aggressive" plan described above with the hard
cut-off where we get to drop the segmented mode.Concretely, a 20TB pg_upgrade in copy mode would fail while trying to
concatenate with the above patches, so you'd have to use link or
reflink mode (you'd probably want to use that anyway unless due to
sheer volume of data to copy otherwise, since ext4 is also not capable
of block-range sharing), but then you'd be out of luck after N future
major releases, according to that plan where we start deleting the
code, so you'd need to organise some smaller partitions before that
time comes. Or pg_upgrade to a target on xfs etc. I wonder if a
future version of extN will increase its max file size.A less aggressive version of the plan would be that we just keep the
segment code for the foreseeable future with no planned cut off, and
we make all of those "piggy back" transformations that I showed in the
patch set optional. For example, I had it so that CLUSTER would
quietly convert your relation to large format, if it was still in
segmented format (might as well if you're writing all the data out
anyway, right?), but perhaps that could depend on a GUC. Likewise for
base backup. Etc. Then someone concerned about hitting the 16TB
limit on ext4 could opt out. Or something like that. It seems funny
though, that's exactly the user who should want this feature (they
have 16,000 relation segment files).
--
Mark Callaghan
mdcallag@gmail.com
On Sat, May 13, 2023 at 4:41 AM MARK CALLAGHAN <mdcallag@gmail.com> wrote:
Repeating what was mentioned on Twitter, because I had some experience with the topic. With fewer files per table there will be more contention on the per-inode mutex (which might now be the per-inode rwsem). I haven't read filesystem source in a long time. Back in the day, and perhaps today, it was locked for the duration of a write to storage (locked within the kernel) and was briefly locked while setting up a read.
The workaround for writes was one of:
1) enable disk write cache or use battery-backed HW RAID to make writes faster (yes disks, I encountered this prior to 2010)
2) use XFS and O_DIRECT in which case the per-inode mutex (rwsem) wasn't locked for the duration of a writeI have a vague memory that filesystems have improved in this regard.
(I am interpreting your "use XFS" to mean "use XFS instead of ext4".)
Right, 80s file systems like UFS (and I suspect ext and ext2, which
were probably based on similar ideas and ran on non-SMP machines?)
used coarse grained locking including vnodes/inodes level. Then over
time various OSes and file systems have improved concurrency. Brief
digression, as someone who got started on IRIX in the 90 and still
thinks those were probably the coolest computers: At SGI, first they
replaced SysV UFS with EFS (E for extent-based allocation) and
invented O_DIRECT to skip the buffer pool, and then blew the doors off
everything with XFS, which maximised I/O concurrency and possibly (I
guess, it's not open source so who knows?) involved a revamped VFS to
lower stuff like inode locks, motivated by monster IRIX boxes with up
to 1024 CPUs and huge storage arrays. In the Linux ext3 era, I
remember hearing lots of reports of various kinds of large systems
going faster just by switching to XFS and there is lots of writing
about that. ext4 certainly changed enormously. One reason back in
those days (mid 2000s?) was the old
fsync-actually-fsyncs-everything-in-the-known-universe-and-not-just-your-file
thing, and another was the lack of write concurrency especially for
direct I/O, and probably lots more things. But that's all ancient
history...
As for ext4, we've detected and debugged clues about the gradual
weakening of locking over time on this list: we know that concurrent
read/write to the same page of a file was previously atomic, but when
we switched to pread/pwrite for most data (ie not making use of the
current file position), it ceased to be (a concurrent reader can see a
mash-up of old and new data with visible cache line-ish stripes in it,
so there isn't even a write-lock for the page); then we noticed that
in later kernels even read/write ceased to be atomic (implicating a
change in file size/file position interlocking, I guess). I also
vaguely recall reading on here a long time ago that lseek()
performance was dramatically improved with weaker inode interlocking,
perhaps even in response to this very program's pathological SEEK_END
call frequency (something I hope to fix, but I digress). So I think
it's possible that the effect you mentioned is gone?
I can think of a few differences compared to those other RDBMSs.
There the discussion was about one-file-per-relation vs
one-big-file-for-everything, whereas we're talking about
one-file-per-relation vs many-files-per-relation (which doesn't change
the point much, just making clear that I'm not proposing a 42PB file
to whole everything, so you can still partition to get different
files). We also usually call fsync in series in our checkpointer
(after first getting the writebacks started with sync_file_range()
some time sooner). Currently our code believes that it is not safe to
call fdatasync() for files whose size might have changed. There is no
basis for that in POSIX or in any system that I currently know of
(though I haven't looked into it seriously), but I believe there was a
historical file system that at some point in history interpreted
"non-essential meta data" (the stuff POSIX allows it not to flush to
disk) to include "the size of the file" (whereas POSIX really just
meant that you don't have to synchronise the mtime and similar), which
is probably why PostgreSQL has some code that calls fsync() on newly
created empty WAL segments to "make sure the indirect blocks are down
on disk" before allowing itself to use only fdatasync() later to
overwrite it with data. The point being that, for the most important
kind of interactive/user facing I/O latency, namely WAL flushes, we
already use fdatasync(). It's possible that we could use it to flush
relation data too (ie the relation files in question here, usually
synchronised by the checkpointer) according to POSIX but it doesn't
immediately seem like something that should be at all hot and it's
background work. But perhaps I lack imagination.
Thanks, thought-provoking stuff.
On Sat, May 13, 2023 at 11:01 AM Thomas Munro <thomas.munro@gmail.com> wrote:
On Sat, May 13, 2023 at 4:41 AM MARK CALLAGHAN <mdcallag@gmail.com> wrote:
use XFS and O_DIRECT
As for direct I/O, we're only just getting started on that. We
currently can't produce more than one concurrent WAL write, and then
for relation data, we just got very basic direct I/O support but we
haven't yet got the asynchronous machinery to drive it properly (work
in progress, more soon). I was just now trying to find out what the
state of parallel direct writes is in ext4, and it looks like it's
finally happening:
On Fri, May 12, 2023 at 4:02 PM Thomas Munro <thomas.munro@gmail.com> wrote:
On Sat, May 13, 2023 at 4:41 AM MARK CALLAGHAN <mdcallag@gmail.com> wrote:
Repeating what was mentioned on Twitter, because I had some experience
with the topic. With fewer files per table there will be more contention on
the per-inode mutex (which might now be the per-inode rwsem). I haven't
read filesystem source in a long time. Back in the day, and perhaps today,
it was locked for the duration of a write to storage (locked within the
kernel) and was briefly locked while setting up a read.The workaround for writes was one of:
1) enable disk write cache or use battery-backed HW RAID to make writesfaster (yes disks, I encountered this prior to 2010)
2) use XFS and O_DIRECT in which case the per-inode mutex (rwsem) wasn't
locked for the duration of a write
I have a vague memory that filesystems have improved in this regard.
(I am interpreting your "use XFS" to mean "use XFS instead of ext4".)
Yes, although when the decision was made it was probably ext-3 -> XFS. We
suffered from fsync a file == fsync the filesystem
because MySQL binlogs use buffered IO and are appended on write. Switching
from ext-? to XFS was an easy perf win
so I don't have much experience with ext-? over the past decade.
Right, 80s file systems like UFS (and I suspect ext and ext2, which
Late 80s is when I last hacked on Unix fileys code, excluding browsing XFS
and ext source. Unix was easy back then -- one big kernel lock covers
everything.
some time sooner). Currently our code believes that it is not safe to
call fdatasync() for files whose size might have changed. There is no
Long ago we added code for InnoDB to avoid fsync/fdatasync in some cases
when O_DIRECT was used. While great for performance
we also forgot to make sure they were still done when files were extended.
Eventually we fixed that.
Thanks for all of the details.
--
Mark Callaghan
mdcallag@gmail.com
On Fri, May 12, 2023 at 9:53 AM Stephen Frost <sfrost@snowman.net> wrote:
While I tend to agree that 1GB is too small, 1TB seems like it's
possibly going to end up on the too big side of things, or at least,
if we aren't getting rid of the segment code then it's possibly throwing
away the benefits we have from the smaller segments without really
giving us all that much. Going from 1G to 10G would reduce the number
of open file descriptors by quite a lot without having much of a net
change on other things. 50G or 100G would reduce the FD handles further
but starts to make us lose out a bit more on some of the nice parts of
having multiple segments.
This is my view as well, more or less. I don't really like our current
handling of relation segments; we know it has bugs, and making it
non-buggy feels difficult. And there are performance issues as well --
file descriptor consumption, for sure, but also probably that crossing
a file boundary likely breaks the operating system's ability to do
readahead to some degree. However, I think we're going to find that
moving to a system where we have just one file per relation fork and
that file can be arbitrarily large is not fantastic, either. Jim's
point about running into filesystem limits is a good one (hi Jim, long
time no see!) and the problem he points out with ext4 is almost
certainly not the only one. It doesn't just have to be filesystems,
either. It could be a limitation of an archiving tool (tar, zip, cpio)
or a file copy utility or whatever as well. A quick Google search
suggests that most such things have been updated to use 64-bit sizes,
but my point is that the set of things that can potentially cause
problems is broader than just the filesystem. Furthermore, even when
there's no hard limit at play, a smaller file size can occasionally be
*convenient*, as in Pavel's example of using hard links to share
storage between backups. From that point of view, a 16GB or 64GB or
256GB file size limit seems more convenient than no limit and more
convenient than a large limit like 1TB.
However, the bugs are the flies in the ointment (ahem). If we just
make the segment size bigger but don't get rid of segments altogether,
then we still have to fix the bugs that can occur when you do have
multiple segments. I think part of Thomas's motivation is to dodge
that whole category of problems. If we gradually deprecate
multi-segment mode in favor of single-file-per-relation-fork, then the
fact that the segment handling code has bugs becomes progressively
less relevant. While that does make some sense, I'm not sure I really
agree with the approach. The problem is that we're trading problems
that we at least theoretically can fix somehow by hitting our code
with a big enough hammer for an unknown set of problems that stem from
limitations of software we don't control, maybe don't even know about.
--
Robert Haas
EDB: http://www.enterprisedb.com
Thanks all for the feedback. It was a nice idea and it *almost*
works, but it seems like we just can't drop segmented mode. And the
automatic transition schemes I showed don't make much sense without
that goal.
What I'm hearing is that something simple like this might be more acceptable:
* initdb --rel-segsize (cf --wal-segsize), default unchanged
* pg_upgrade would convert if source and target don't match
I would probably also leave out those Windows file API changes, too.
--rel-segsize would simply refuse larger sizes until someone does the
work on that platform, to keep the initial proposal small.
I would probably leave the experimental copy_on_write() ideas out too,
for separate discussion in a separate proposal.
On 24.05.23 02:34, Thomas Munro wrote:
Thanks all for the feedback. It was a nice idea and it *almost*
works, but it seems like we just can't drop segmented mode. And the
automatic transition schemes I showed don't make much sense without
that goal.What I'm hearing is that something simple like this might be more acceptable:
* initdb --rel-segsize (cf --wal-segsize), default unchanged
makes sense
* pg_upgrade would convert if source and target don't match
This would be good, but it could also be an optional or later feature.
Maybe that should be a different mode, like
--copy-and-adjust-as-necessary, so that users would have to opt into
what would presumably be slower than plain --copy, rather than being
surprised by it, if they unwittingly used incompatible initdb options.
I would probably also leave out those Windows file API changes, too.
--rel-segsize would simply refuse larger sizes until someone does the
work on that platform, to keep the initial proposal small.
Those changes from off_t to pgoff_t? Yes, it would be good to do
without those. Apart of the practical problems that have been brought
up, this was a major annoyance with the proposed patch set IMO.
I would probably leave the experimental copy_on_write() ideas out too,
for separate discussion in a separate proposal.
right
On Wed, May 24, 2023 at 2:18 AM Peter Eisentraut
<peter.eisentraut@enterprisedb.com> wrote:
What I'm hearing is that something simple like this might be more acceptable:
* initdb --rel-segsize (cf --wal-segsize), default unchanged
makes sense
+1.
* pg_upgrade would convert if source and target don't match
This would be good, but it could also be an optional or later feature.
+1. I think that would be nice to have, but not absolutely required.
IMHO it's best not to overcomplicate these projects. Not everything
needs to be part of the initial commit. If the initial commit happens
2 months from now and then stuff like this gets added over the next 8,
that's strictly better than trying to land the whole patch set next
March.
--
Robert Haas
EDB: http://www.enterprisedb.com
Greetings,
* Peter Eisentraut (peter.eisentraut@enterprisedb.com) wrote:
On 24.05.23 02:34, Thomas Munro wrote:
Thanks all for the feedback. It was a nice idea and it *almost*
works, but it seems like we just can't drop segmented mode. And the
automatic transition schemes I showed don't make much sense without
that goal.What I'm hearing is that something simple like this might be more acceptable:
* initdb --rel-segsize (cf --wal-segsize), default unchanged
makes sense
Agreed, this seems alright in general. Having more initdb-time options
to help with certain use-cases rather than having things be compile-time
is definitely just generally speaking a good direction to be going in,
imv.
* pg_upgrade would convert if source and target don't match
This would be good, but it could also be an optional or later feature.
Agreed.
Maybe that should be a different mode, like --copy-and-adjust-as-necessary,
so that users would have to opt into what would presumably be slower than
plain --copy, rather than being surprised by it, if they unwittingly used
incompatible initdb options.
I'm curious as to why it would be slower than a regular copy..?
I would probably also leave out those Windows file API changes, too.
--rel-segsize would simply refuse larger sizes until someone does the
work on that platform, to keep the initial proposal small.Those changes from off_t to pgoff_t? Yes, it would be good to do without
those. Apart of the practical problems that have been brought up, this was
a major annoyance with the proposed patch set IMO.I would probably leave the experimental copy_on_write() ideas out too,
for separate discussion in a separate proposal.right
You mean copy_file_range() here, right?
Shouldn't we just add support for that today into pg_upgrade,
independently of this? Seems like a worthwhile improvement even without
the benefit it would provide to changing segment sizes.
Thanks,
Stephen