FileFallocate misbehaving on XFS
Hello PG Hackers
Our application has recently migrated to PG16, and we have experienced
some failed upgrades. The upgrades are performed using pg_upgrade and
have failed during the phase where the schema is restored into the new
cluster, with the following error:
pg_restore: error: could not execute query: ERROR: could not extend
file "pg_tblspc/16401/PG_16_202307071/17643/1249.1" with
FileFallocate(): No space left on device
HINT: Check free disk space.
This has happened multiple times on different servers, and in each
case there was plenty of free space available.
We found this thread describing similar issues:
/messages/by-id/AS1PR05MB91059AC8B525910A5FCD6E699F9A2@AS1PR05MB9105.eurprd05.prod.outlook.com
As is the case in that thread, all of the affected databases are using XFS.
One of my colleagues built postgres from source with
HAVE_POSIX_FALLOCATE not defined, and using that build he was able to
complete the pg_upgrade, and then switched to a stock postgres build
after the upgrade. However, as you might expect, after the upgrade we
have experienced similar errors during regular operation. We make
heavy use of COPY, which is mentioned in the above discussion as
pre-allocating files.
We have seen this on both Rocky Linux 8 (kernel 4.18.0) and Rocky
Linux 9 (Kernel 5.14.0).
I am wondering if this bug might be related:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1791323
When given an offset of 0 and a length, fallocate (man 2 fallocate) reports ENOSPC if the size of the file + the length to be allocated is greater than the available space.
There is a reproduction procedure at the bottom of the above ubuntu
thread, and using that procedure I get the same results on both kernel
4.18.0 and 5.14.0.
When calling fallocate with offset zero on an existing file, I get
enospc even if I am only requesting the same amount of space as the
file already has.
If I repeat the experiment with ext4 I don't get that behaviour.
On a surface examination of the code paths leading to the
FileFallocate call, it does not look like it should be trying to
allocate already allocated space, but I might have missed something
there.
Is this already being looked into?
Thanks in advance,
Cheers
Mike
Il Lun 9 Dic 2024, 10:19 Michael Harris <harmic@gmail.com> ha scritto:
Is this already being looked into?
Funny, i guess it's the same reason I see randomly complain of WhatsApp web
interface, on Chrome, since I switched to XFS. It says something like "no
more space on disk" and logout, with more than 300GB available.
Anyway, just a stupid hint, I would try to write to XFS mailing list. There
you can reach XFS maintainers of Red Hat and the usual historical
developers, of course!!!
On 12/9/24 08:34, Michael Harris wrote:
Hello PG Hackers
Our application has recently migrated to PG16, and we have experienced
some failed upgrades. The upgrades are performed using pg_upgrade and
have failed during the phase where the schema is restored into the new
cluster, with the following error:pg_restore: error: could not execute query: ERROR: could not extend
file "pg_tblspc/16401/PG_16_202307071/17643/1249.1" with
FileFallocate(): No space left on device
HINT: Check free disk space.This has happened multiple times on different servers, and in each
case there was plenty of free space available.We found this thread describing similar issues:
/messages/by-id/AS1PR05MB91059AC8B525910A5FCD6E699F9A2@AS1PR05MB9105.eurprd05.prod.outlook.com
As is the case in that thread, all of the affected databases are using XFS.
One of my colleagues built postgres from source with
HAVE_POSIX_FALLOCATE not defined, and using that build he was able to
complete the pg_upgrade, and then switched to a stock postgres build
after the upgrade. However, as you might expect, after the upgrade we
have experienced similar errors during regular operation. We make
heavy use of COPY, which is mentioned in the above discussion as
pre-allocating files.We have seen this on both Rocky Linux 8 (kernel 4.18.0) and Rocky
Linux 9 (Kernel 5.14.0).I am wondering if this bug might be related:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1791323When given an offset of 0 and a length, fallocate (man 2 fallocate) reports ENOSPC if the size of the file + the length to be allocated is greater than the available space.
There is a reproduction procedure at the bottom of the above ubuntu
thread, and using that procedure I get the same results on both kernel
4.18.0 and 5.14.0.
When calling fallocate with offset zero on an existing file, I get
enospc even if I am only requesting the same amount of space as the
file already has.
If I repeat the experiment with ext4 I don't get that behaviour.On a surface examination of the code paths leading to the
FileFallocate call, it does not look like it should be trying to
allocate already allocated space, but I might have missed something
there.Is this already being looked into?
Sounds more like an XFS bug/behavior, so it's not clear to me what we
could do about it. I mean, if the filesystem reports bogus out-of-space,
is there even something we can do?
What is not clear to me is why would this affect pg_upgrade at all. We
have the data files split into 1GB segments, and the copy/clone/... goes
one by one. So there shouldn't be more than 1GB "extra" space needed.
Surely you have more free space on the system?
regards
--
Tomas Vondra
On 12/9/24 10:47, Andrea Gelmini wrote:
Il Lun 9 Dic 2024, 10:19 Michael Harris <harmic@gmail.com
<mailto:harmic@gmail.com>> ha scritto:Is this already being looked into?
Funny, i guess it's the same reason I see randomly complain of WhatsApp
web interface, on Chrome, since I switched to XFS. It says something
like "no more space on disk" and logout, with more than 300GB available.
If I understand the fallocate issue correctly, it essentially ignores
the offset, so "fallocate -o 0 -l LENGTH" fails if
LENGTH + CURRENT_LENGTH > FREE_SPACE
But if you have 300GB available, that'd mean you have a file that's
close to that size already. But is that likely for WhatsApp?
Anyway, just a stupid hint, I would try to write to XFS mailing list.
There you can reach XFS maintainers of Red Hat and the usual historical
developers, of course!!!
Yes, I think that's a better place to report this. I don't think we're
doing anything particularly weird / wrong with fallocate().
regards
--
Tomas Vondra
On Mon, Dec 9, 2024 at 10:19 AM Michael Harris <harmic@gmail.com> wrote:
Hi Michael,
We found this thread describing similar issues:
/messages/by-id/AS1PR05MB91059AC8B525910A5FCD6E699F9A2@AS1PR05MB9105.eurprd05.prod.outlook.com
We've got some case in the past here in EDB, where an OS vendor has blamed
XFS AG fragmentation (too many AGs, and if one AG is not having enough
space -> error). Could You perhaps show us output of on that LUN:
1. xfs_info
2. run that script from https://www.suse.com/support/kb/doc/?id=000018219
for Your AG range
-J.
On 12/9/24 11:27, Jakub Wartak wrote:
On Mon, Dec 9, 2024 at 10:19 AM Michael Harris <harmic@gmail.com
<mailto:harmic@gmail.com>> wrote:Hi Michael,
We found this thread describing similar issues:
/messages/by-id/flat/
AS1PR05MB91059AC8B525910A5FCD6E699F9A2%40AS1PR05MB9105.eurprd05.prod.outlook.com </messages/by-id/AS1PR05MB91059AC8B525910A5FCD6E699F9A2@AS1PR05MB9105.eurprd05.prod.outlook.com>We've got some case in the past here in EDB, where an OS vendor has
blamed XFS AG fragmentation (too many AGs, and if one AG is not having
enough space -> error). Could You perhaps show us output of on that LUN:
1. xfs_info
2. run that script from https://www.suse.com/support/kb/doc/?
id=000018219 <https://www.suse.com/support/kb/doc/?id=000018219> for
Your AG range
But this can be reproduced on a brand new filesystem - I just tried
creating a 1GB image, create XFS on it, mount it, and fallocate a 600MB
file twice. Which that fails, and there can't be any real fragmentation.
regards
--
Tomas Vondra
Hi,
On 2024-12-09 15:47:55 +0100, Tomas Vondra wrote:
On 12/9/24 11:27, Jakub Wartak wrote:
On Mon, Dec 9, 2024 at 10:19 AM Michael Harris <harmic@gmail.com
<mailto:harmic@gmail.com>> wrote:Hi Michael,
We found this thread describing similar issues:
/messages/by-id/flat/
AS1PR05MB91059AC8B525910A5FCD6E699F9A2%40AS1PR05MB9105.eurprd05.prod.outlook.com </messages/by-id/AS1PR05MB91059AC8B525910A5FCD6E699F9A2@AS1PR05MB9105.eurprd05.prod.outlook.com>We've got some case in the past here in EDB, where an OS vendor has
blamed XFS AG fragmentation (too many AGs, and if one AG is not having
enough space -> error). Could You perhaps show us output of on that LUN:
1. xfs_info
2. run that script from https://www.suse.com/support/kb/doc/?
id=000018219 <https://www.suse.com/support/kb/doc/?id=000018219> for
Your AG rangeBut this can be reproduced on a brand new filesystem - I just tried
creating a 1GB image, create XFS on it, mount it, and fallocate a 600MB
file twice. Which that fails, and there can't be any real fragmentation.
If I understand correctly xfs, before even looking at the file's current
layout, checks if there's enough free space for the fallocate() to
succeed. Here's an explanation for why:
https://www.spinics.net/lists/linux-xfs/msg55429.html
The real problem with preallocation failing part way through due to
overcommit of space is that we can't go back an undo the
allocation(s) made by fallocate because when we get ENOSPC we have
lost all the state of the previous allocations made. If fallocate is
filling holes between unwritten extents already in the file, then we
have no way of knowing where the holes we filled were and hence
cannot reliably free the space we've allocated before ENOSPC was
hit.
I.e. reserving space as you go would leave you open to ending up with some,
but not all, of those allocations having been made. Whereas pre-reserving the
worst case space needed, ahead of time, ensures that you have enough space to
go through it all.
You can't just go through the file [range] and compute how much free space you
will need allocate and then do the a second pass through the file, because the
file layout might have changed concurrently...
This issue seems independent of the issue Michael is having though. Postgres,
afaik, won't fallocate huge ranges with already allocated space.
Greetings,
Andres Freund
Hi,
On 2024-12-09 18:34:22 +1100, Michael Harris wrote:
Our application has recently migrated to PG16, and we have experienced
some failed upgrades. The upgrades are performed using pg_upgrade and
have failed during the phase where the schema is restored into the new
cluster, with the following error:pg_restore: error: could not execute query: ERROR: could not extend
file "pg_tblspc/16401/PG_16_202307071/17643/1249.1" with
FileFallocate(): No space left on device
HINT: Check free disk space.
Were those pg_upgrades done with pg_upgrade --clone? Or have been, on the same
filesystem, in the past?
The reflink stuff in xfs (which is used to implement copy-on-write for files)
is somewhat newer and you're using somewhat old kernels:
We have seen this on both Rocky Linux 8 (kernel 4.18.0) and Rocky
Linux 9 (Kernel 5.14.0).
I found some references for bugs that were fixed in 5.13. But I think at least
some of this would persist if the filesystem ran into the issue with a kernel
before those fixes. Did you upgrade "in-place" from Rocky Linux 8?
I am wondering if this bug might be related:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1791323
Doubt it, we never do this as far as I am aware.
Greetings,
Andres Freund
Hi Andres
On Tue, 10 Dec 2024 at 03:31, Andres Freund <andres@anarazel.de> wrote:
Were those pg_upgrades done with pg_upgrade --clone? Or have been, on the same
filesystem, in the past?
No, our procedure is to use --link.
I found some references for bugs that were fixed in 5.13. But I think at least
some of this would persist if the filesystem ran into the issue with a kernel
before those fixes. Did you upgrade "in-place" from Rocky Linux 8?
We generally don't use "in place" OS upgrades - however we would
usually have the databases on separate filesystem(s) to the OS, and
those filesystem(s) would be preserved through the upgrade, while the
root fs would be scratched.
A lot of the cases reported are on RL8. I will try to find out the
history of the RL9 cases to see if the filesystems started on RL8.
Could you please provide me links for the kernel bugs you are referring to?
Cheers
Mike.
Hi Tomas
On Mon, 9 Dec 2024 at 21:06, Tomas Vondra <tomas@vondra.me> wrote:
Sounds more like an XFS bug/behavior, so it's not clear to me what we
could do about it. I mean, if the filesystem reports bogus out-of-space,
is there even something we can do?
I don't disagree that it's most likely an XFS issue. However, XFS is
pretty widely used - it's the default FS for RHEL & the default in
SUSE for non-root partitions - so maybe some action should be taken.
Some things we could consider:
- Providing a way to configure PG not to use posix_fallocate at runtime
- Detecting the use of XFS (probably nasty and complex to do in a
platform independent way) and disable posix_fallocate
- In the case of posix_fallocate failing with ENOSPC, fall back to
FileZero (worst case that will fail as well, in which case we will
know that we really are out of space)
- Documenting that XFS might not be a good choice, at least for some
kernel versions
What is not clear to me is why would this affect pg_upgrade at all. We
have the data files split into 1GB segments, and the copy/clone/... goes
one by one. So there shouldn't be more than 1GB "extra" space needed.
Surely you have more free space on the system?
Yes, that also confused me. It actually fails during the schema
restore phase - where pg_upgrade calls pg_restore to restore a
schema-only dump that it takes earlier in the process. At this stage
it is only trying to restore the schema, not any actual table data.
Note that we use the --link option to pg_upgrade, so it should not be
using much space even when the table data is being upgraded.
The filesystems have >1TB free space when this has occurred.
It does continue to give this error after the upgrade, at apparently
random intervals, when data is being loaded into the DB using COPY
commands, so it might be best not to focus too much on the fact that
we first encounter it during the upgrade.
Cheers
Mike.
Hi,
On 2024-12-10 09:34:08 +1100, Michael Harris wrote:
On Tue, 10 Dec 2024 at 03:31, Andres Freund <andres@anarazel.de> wrote:
I found some references for bugs that were fixed in 5.13. But I think at least
some of this would persist if the filesystem ran into the issue with a kernel
before those fixes. Did you upgrade "in-place" from Rocky Linux 8?We generally don't use "in place" OS upgrades - however we would
usually have the databases on separate filesystem(s) to the OS, and
those filesystem(s) would be preserved through the upgrade, while the
root fs would be scratched.
Makes sense.
A lot of the cases reported are on RL8. I will try to find out the
history of the RL9 cases to see if the filesystems started on RL8.
That'd be helpful....
Could you please provide me links for the kernel bugs you are referring to?
I unfortunately closed most of the tabs, the only one I could quickly find
again is the one referenced at the bottom of:
https://www.spinics.net/lists/linux-xfs/msg55445.html
Greetings,
Andres
Hi,
On 2024-12-10 10:00:43 +1100, Michael Harris wrote:
On Mon, 9 Dec 2024 at 21:06, Tomas Vondra <tomas@vondra.me> wrote:
Sounds more like an XFS bug/behavior, so it's not clear to me what we
could do about it. I mean, if the filesystem reports bogus out-of-space,
is there even something we can do?I don't disagree that it's most likely an XFS issue. However, XFS is
pretty widely used - it's the default FS for RHEL & the default in
SUSE for non-root partitions - so maybe some action should be taken.Some things we could consider:
- Providing a way to configure PG not to use posix_fallocate at runtime
- Detecting the use of XFS (probably nasty and complex to do in a
platform independent way) and disable posix_fallocate- In the case of posix_fallocate failing with ENOSPC, fall back to
FileZero (worst case that will fail as well, in which case we will
know that we really are out of space)- Documenting that XFS might not be a good choice, at least for some
kernel versions
Pretty unexcited about all of these - XFS is fairly widely used for PG, but
this problem doesn't seem very common. It seems to me that we're missing
something that causes this to only happen in a small subset of cases.
I think the source of this needs to be debugged further before we try to apply
workarounds in postgres.
Are you using any filesystem quotas?
It'd be useful to get the xfs_info output that Jakub asked for. Perhaps also
xfs_spaceman -c 'freesp -s' /mountpoint
xfs_spaceman -c 'health' /mountpoint
and df.
What kind of storage is this on?
Was the filesystem ever grown from a smaller size?
Have you checked the filesystem's internal consistency? I.e. something like
xfs_repair -n /dev/nvme2n1. It does require the filesystem to be read-only or
unmounted though. But corrupted filesystem datastructures certainly could
cause spurious ENOSPC.
What is not clear to me is why would this affect pg_upgrade at all. We
have the data files split into 1GB segments, and the copy/clone/... goes
one by one. So there shouldn't be more than 1GB "extra" space needed.
Surely you have more free space on the system?Yes, that also confused me. It actually fails during the schema
restore phase - where pg_upgrade calls pg_restore to restore a
schema-only dump that it takes earlier in the process. At this stage
it is only trying to restore the schema, not any actual table data.
Note that we use the --link option to pg_upgrade, so it should not be
using much space even when the table data is being upgraded.
Are you using pg_upgrade -j?
I'm asking because looking at linux's git tree I found this interesting recent
commit: https://git.kernel.org/linus/94a0333b9212 - but IIUC it'd actually
cause file creation, not fallocate to fail.
The filesystems have >1TB free space when this has occurred.
It does continue to give this error after the upgrade, at apparently
random intervals, when data is being loaded into the DB using COPY
commands, so it might be best not to focus too much on the fact that
we first encounter it during the upgrade.
I assume the file that actually errors out changes over time? It's always
fallocate() that fails?
Can you tell us anything about the workload / data? Lots of tiny tables, lots
of big tables, write heavy, ...?
Greetings,
Andres Freund
Hi Andres
Following up on the earlier question about OS upgrade paths - all the
cases reported so far are either on RL8 (Kernel 4.18.0) or were
upgraded to RL9 (kernel 5.14.0) and the affected filesystems were
preserved.
In fact the RL9 systems were initially built as Centos 7, and then
when that went EOL they were upgraded to RL9. The process was as I
described - the /var/opt filesystem which contained the database was
preserved, and the root and other OS filesystems were scratched.
The majority of systems where we have this problem are on RL8.
On Tue, 10 Dec 2024 at 11:31, Andres Freund <andres@anarazel.de> wrote:
Are you using any filesystem quotas?
No.
It'd be useful to get the xfs_info output that Jakub asked for. Perhaps also
xfs_spaceman -c 'freesp -s' /mountpoint
xfs_spaceman -c 'health' /mountpoint
and df.
I gathered this info from one of the systems that is currently on RL9.
This system is relatively small compared to some of the others that
have exhibited this issue, but it is the only one I can access right
now.
# uname -a
Linux 5.14.0-503.14.1.el9_5.x86_64 #1 SMP PREEMPT_DYNAMIC Fri Nov 15
12:04:32 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
# xfs_info /dev/mapper/ippvg-ipplv
meta-data=/dev/mapper/ippvg-ipplv isize=512 agcount=4, agsize=262471424 blks
= sectsz=512 attr=2, projid32bit=1
= crc=1 finobt=0, sparse=0, rmapbt=0
= reflink=0 bigtime=0 inobtcount=0 nrext64=0
data = bsize=4096 blocks=1049885696, imaxpct=5
= sunit=0 swidth=0 blks
naming =version 2 bsize=4096 ascii-ci=0, ftype=1
log =internal log bsize=4096 blocks=512639, version=2
= sectsz=512 sunit=0 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
# for agno in `seq 0 3`; do xfs_spaceman -c "freesp -s -a $agno" /var/opt; done
from to extents blocks pct
1 1 37502 37502 0.15
2 3 62647 148377 0.59
4 7 87793 465950 1.85
8 15 135529 1527172 6.08
16 31 184811 3937459 15.67
32 63 165979 7330339 29.16
64 127 101674 8705691 34.64
128 255 15123 2674030 10.64
256 511 973 307655 1.22
total free extents 792031
total free blocks 25134175
average free extent size 31.7338
from to extents blocks pct
1 1 43895 43895 0.22
2 3 59312 141693 0.70
4 7 83406 443964 2.20
8 15 120804 1362108 6.75
16 31 133140 2824317 14.00
32 63 118619 5188474 25.71
64 127 77960 6751764 33.46
128 255 16383 2876626 14.26
256 511 1763 546506 2.71
total free extents 655282
total free blocks 20179347
average free extent size 30.7949
from to extents blocks pct
1 1 72034 72034 0.26
2 3 98158 232135 0.83
4 7 126228 666187 2.38
8 15 169602 1893007 6.77
16 31 180286 3818527 13.65
32 63 164529 7276833 26.01
64 127 109687 9505160 33.97
128 255 22113 3921162 14.02
256 511 1901 592052 2.12
total free extents 944538
total free blocks 27977097
average free extent size 29.6199
from to extents blocks pct
1 1 51462 51462 0.21
2 3 98993 233204 0.93
4 7 131578 697655 2.79
8 15 178151 1993062 7.97
16 31 175718 3680535 14.72
32 63 145310 6372468 25.48
64 127 89518 7749021 30.99
128 255 18926 3415768 13.66
256 511 2640 813586 3.25
total free extents 892296
total free blocks 25006761
average free extent size 28.0252
# xfs_spaceman -c 'health' /var/opt
Health status has not been collected for this filesystem.
What kind of storage is this on?
As mentioned, there are quite a few systems in different sites, so a
number of different storage solutions in use, some with directly
attached disks, others with some SAN solutions.
The instance I got the printout above from is a VM, but in the other
site they are all bare metal.
Was the filesystem ever grown from a smaller size?
I can't say for sure that none of them were, but given the number of
different systems that have this issue I am confident that would not
be a common factor.
Have you checked the filesystem's internal consistency? I.e. something like
xfs_repair -n /dev/nvme2n1. It does require the filesystem to be read-only or
unmounted though. But corrupted filesystem datastructures certainly could
cause spurious ENOSPC.
I executed this on the same system as the above prints came from. It
did not report any issues.
Are you using pg_upgrade -j?
Yes, we use -j `nproc`
I assume the file that actually errors out changes over time? It's always
fallocate() that fails?
Yes, correct, on both counts.
Can you tell us anything about the workload / data? Lots of tiny tables, lots
of big tables, write heavy, ...?
It is a write heavy application which stores mostly time series data.
The time series data is partitioned by time; the application writes
constantly into the 'current' partition, and data is expired by
removing the oldest partition. Most of the data is written once and
not updated.
There are quite a lot of these partitioned tables (in the 1000's or
10000's) depending on how the application is configured. Individual
partitions range in size from a few MB to 10s of GB.
Cheers
Mike.
Hi again
One extra piece of information: I had said that all the machines were
Rocky Linux 8 or Rocky Linux 9, but actually a large number of them
are RHEL8.
Sorry for the confusion.
Of course RL8 is a rebuild of RHEL8 so it is not surprising they would
be behaving similarly.
Cheers
Mike
Show quoted text
On Tue, 10 Dec 2024 at 17:28, Michael Harris <harmic@gmail.com> wrote:
Hi Andres
Following up on the earlier question about OS upgrade paths - all the
cases reported so far are either on RL8 (Kernel 4.18.0) or were
upgraded to RL9 (kernel 5.14.0) and the affected filesystems were
preserved.
In fact the RL9 systems were initially built as Centos 7, and then
when that went EOL they were upgraded to RL9. The process was as I
described - the /var/opt filesystem which contained the database was
preserved, and the root and other OS filesystems were scratched.
The majority of systems where we have this problem are on RL8.On Tue, 10 Dec 2024 at 11:31, Andres Freund <andres@anarazel.de> wrote:
Are you using any filesystem quotas?
No.
It'd be useful to get the xfs_info output that Jakub asked for. Perhaps also
xfs_spaceman -c 'freesp -s' /mountpoint
xfs_spaceman -c 'health' /mountpoint
and df.I gathered this info from one of the systems that is currently on RL9.
This system is relatively small compared to some of the others that
have exhibited this issue, but it is the only one I can access right
now.# uname -a
Linux 5.14.0-503.14.1.el9_5.x86_64 #1 SMP PREEMPT_DYNAMIC Fri Nov 15
12:04:32 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux# xfs_info /dev/mapper/ippvg-ipplv
meta-data=/dev/mapper/ippvg-ipplv isize=512 agcount=4, agsize=262471424 blks
= sectsz=512 attr=2, projid32bit=1
= crc=1 finobt=0, sparse=0, rmapbt=0
= reflink=0 bigtime=0 inobtcount=0 nrext64=0
data = bsize=4096 blocks=1049885696, imaxpct=5
= sunit=0 swidth=0 blks
naming =version 2 bsize=4096 ascii-ci=0, ftype=1
log =internal log bsize=4096 blocks=512639, version=2
= sectsz=512 sunit=0 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0# for agno in `seq 0 3`; do xfs_spaceman -c "freesp -s -a $agno" /var/opt; done
from to extents blocks pct
1 1 37502 37502 0.15
2 3 62647 148377 0.59
4 7 87793 465950 1.85
8 15 135529 1527172 6.08
16 31 184811 3937459 15.67
32 63 165979 7330339 29.16
64 127 101674 8705691 34.64
128 255 15123 2674030 10.64
256 511 973 307655 1.22
total free extents 792031
total free blocks 25134175
average free extent size 31.7338
from to extents blocks pct
1 1 43895 43895 0.22
2 3 59312 141693 0.70
4 7 83406 443964 2.20
8 15 120804 1362108 6.75
16 31 133140 2824317 14.00
32 63 118619 5188474 25.71
64 127 77960 6751764 33.46
128 255 16383 2876626 14.26
256 511 1763 546506 2.71
total free extents 655282
total free blocks 20179347
average free extent size 30.7949
from to extents blocks pct
1 1 72034 72034 0.26
2 3 98158 232135 0.83
4 7 126228 666187 2.38
8 15 169602 1893007 6.77
16 31 180286 3818527 13.65
32 63 164529 7276833 26.01
64 127 109687 9505160 33.97
128 255 22113 3921162 14.02
256 511 1901 592052 2.12
total free extents 944538
total free blocks 27977097
average free extent size 29.6199
from to extents blocks pct
1 1 51462 51462 0.21
2 3 98993 233204 0.93
4 7 131578 697655 2.79
8 15 178151 1993062 7.97
16 31 175718 3680535 14.72
32 63 145310 6372468 25.48
64 127 89518 7749021 30.99
128 255 18926 3415768 13.66
256 511 2640 813586 3.25
total free extents 892296
total free blocks 25006761
average free extent size 28.0252# xfs_spaceman -c 'health' /var/opt
Health status has not been collected for this filesystem.What kind of storage is this on?
As mentioned, there are quite a few systems in different sites, so a
number of different storage solutions in use, some with directly
attached disks, others with some SAN solutions.
The instance I got the printout above from is a VM, but in the other
site they are all bare metal.Was the filesystem ever grown from a smaller size?
I can't say for sure that none of them were, but given the number of
different systems that have this issue I am confident that would not
be a common factor.Have you checked the filesystem's internal consistency? I.e. something like
xfs_repair -n /dev/nvme2n1. It does require the filesystem to be read-only or
unmounted though. But corrupted filesystem datastructures certainly could
cause spurious ENOSPC.I executed this on the same system as the above prints came from. It
did not report any issues.Are you using pg_upgrade -j?
Yes, we use -j `nproc`
I assume the file that actually errors out changes over time? It's always
fallocate() that fails?Yes, correct, on both counts.
Can you tell us anything about the workload / data? Lots of tiny tables, lots
of big tables, write heavy, ...?It is a write heavy application which stores mostly time series data.
The time series data is partitioned by time; the application writes
constantly into the 'current' partition, and data is expired by
removing the oldest partition. Most of the data is written once and
not updated.There are quite a lot of these partitioned tables (in the 1000's or
10000's) depending on how the application is configured. Individual
partitions range in size from a few MB to 10s of GB.Cheers
Mike.
On Tue, Dec 10, 2024 at 7:34 AM Michael Harris <harmic@gmail.com> wrote:
Hi Michael,
1. Well it doesn't look like XFS AG fragmentation to me (we had a customer
with a huge number of AGs with small space in them) reporting such errors
after upgrading to 16, but not for earlier versions (somehow
posix_fallocate() had to be the culprit).
2.
# xfs_info /dev/mapper/ippvg-ipplv
meta-data=/dev/mapper/ippvg-ipplv isize=512 agcount=4,
agsize=262471424 blks
= sectsz=512 attr=2, projid32bit=1
= crc=1 finobt=0, sparse=0, rmapbt=0
= reflink=0 bigtime=0 inobtcount=0
nrext64=0
Yay, reflink=0, that's pretty old fs ?!
ERROR: could not extend file
"pg_tblspc/16401/PG_16_202307071/17643/1249.1" with FileFallocate(): No
space left on device
2. This indicates it was allocating 1GB for such a table (".1"), on
tablespace that was created more than a year ago. Could you get us maybe
those below commands too? (or from any other directory exhibiting such
errors)
stat pg_tblspc/16401/PG_16_202307071/17643/
ls -1 pg_tblspc/16401/PG_16_202307071/17643/ | wc -l
time ls -1 pg_tblspc/16401/PG_16_202307071/17643/ | wc -l # to assess
timing of getdents() call as that may something about that directory
indirectly
3. Maybe somehow there is a bigger interaction between posix_fallocate()
and delayed XFS's dynamic speculative preallocation from many processes all
writing into different partitions ? Maybe try "allocsize=1m" mount option
for that /fs and see if that helps. I'm going to speculate about XFS
speculative :) pre allocations, but if we have fdcache and are *not*
closing fds, how XFS might know to abort its own speculation about
streaming write ? (multiply that up to potentially the number of opened fds
to get an avalanche of "preallocations").
4. You can also try compiling with patch from Alvaro from [2]/messages/by-id/202409110955.6njbwzm4ocus@alvherre.pgsql
"0001-Add-some-debugging-around-mdzeroextend.patch", so we might end up
having more clarity in offsets involved. If not then you could use 'strace
-e fallocate -p <pid>' to get the exact syscall.
5. Another idea could be catching the kernel side stacktrace of fallocate()
when it is hitting ENOSPC. E.g. with XFS fs and attached bpftrace eBPF
tracer I could get the source of the problem in my artificial reproducer,
e.g
# bpftrace ./track_enospc2.bt # wait for "START" and then start reproducing
on the sess2, but try to minimize the time period, that eBPF might things
really slow
$ dd if=/dev/zero of=/fs/test1 bs=1M count=200
$ fallocate /fs/test -l 30000000
fallocate: fallocate failed: No space left on device
$ df -h /fs
Filesystem Size Used Avail Use% Mounted on
/dev/loop0 236M 217M 20M 92% /fs
# in bpftrace CTRL+C, will get:
@errors[-28, kretprobe:xfs_file_fallocate,
xfs_alloc_file_space+665
xfs_alloc_file_space+665
xfs_file_fallocate+869
vfs_fallocate+319
__x64_sys_fallocate+68
do_syscall_64+130
entry_SYSCALL_64_after_hwframe+118
]: 1
-28 = ENOSPC, xfs_alloc_file_space() was the routine that was root-cause
and shows the full logic behind it. That ABI might be different on Your
side due to kernel variations. It could be enhanced, and it might print too
much (so you need to look for that -28 in the outputs). Possibly if you get
any sensible output from it, you could also involve OS support (because if
posix_fallocate() fails and there's space , then it's pretty odd anyway).
-J.
[1]: /messages/by-id/50A117B6.5030300@optionshouse.com
/messages/by-id/50A117B6.5030300@optionshouse.com
[2]: /messages/by-id/202409110955.6njbwzm4ocus@alvherre.pgsql
/messages/by-id/202409110955.6njbwzm4ocus@alvherre.pgsql
Attachments:
Hi,
On 2024-12-10 17:28:21 +1100, Michael Harris wrote:
On Tue, 10 Dec 2024 at 11:31, Andres Freund <andres@anarazel.de> wrote:
It'd be useful to get the xfs_info output that Jakub asked for. Perhaps also
xfs_spaceman -c 'freesp -s' /mountpoint
xfs_spaceman -c 'health' /mountpoint
and df.I gathered this info from one of the systems that is currently on RL9.
This system is relatively small compared to some of the others that
have exhibited this issue, but it is the only one I can access right
now.
I think it's implied, but I just want to be sure: This was one of the affected
systems?
# uname -a
Linux 5.14.0-503.14.1.el9_5.x86_64 #1 SMP PREEMPT_DYNAMIC Fri Nov 15
12:04:32 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux# xfs_info /dev/mapper/ippvg-ipplv
meta-data=/dev/mapper/ippvg-ipplv isize=512 agcount=4, agsize=262471424 blks
= sectsz=512 attr=2, projid32bit=1
= crc=1 finobt=0, sparse=0, rmapbt=0
= reflink=0 bigtime=0 inobtcount=0 nrext64=0
data = bsize=4096 blocks=1049885696, imaxpct=5
= sunit=0 swidth=0 blks
naming =version 2 bsize=4096 ascii-ci=0, ftype=1
log =internal log bsize=4096 blocks=512639, version=2
= sectsz=512 sunit=0 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
It might be interesting that finobt=0, sparse=0 and nrext64=0. Those all
affect space allocation to some degree and more recently created filesystems
will have them to different values, which could explain why you but not that
many others hit this issue.
Any chance to get df output? I'm mainly curious about the number of used
inodes.
Could you show the mount options that end up being used?
grep /var/opt /proc/mounts
I rather doubt it is, but it'd sure be interesting if inode32 were used.
I assume you have never set XFS options for the PG directory or files within
it? Could you show
xfs_io -r -c lsattr -c stat -c statfs /path/to/directory/with/enospc
?
# for agno in `seq 0 3`; do xfs_spaceman -c "freesp -s -a $agno" /var/opt; done
from to extents blocks pct
1 1 37502 37502 0.15
2 3 62647 148377 0.59
4 7 87793 465950 1.85
8 15 135529 1527172 6.08
16 31 184811 3937459 15.67
32 63 165979 7330339 29.16
64 127 101674 8705691 34.64
128 255 15123 2674030 10.64
256 511 973 307655 1.22
total free extents 792031
total free blocks 25134175
average free extent size 31.7338
from to extents blocks pct
1 1 43895 43895 0.22
2 3 59312 141693 0.70
4 7 83406 443964 2.20
8 15 120804 1362108 6.75
16 31 133140 2824317 14.00
32 63 118619 5188474 25.71
64 127 77960 6751764 33.46
128 255 16383 2876626 14.26
256 511 1763 546506 2.71
total free extents 655282
total free blocks 20179347
average free extent size 30.7949
from to extents blocks pct
1 1 72034 72034 0.26
2 3 98158 232135 0.83
4 7 126228 666187 2.38
8 15 169602 1893007 6.77
16 31 180286 3818527 13.65
32 63 164529 7276833 26.01
64 127 109687 9505160 33.97
128 255 22113 3921162 14.02
256 511 1901 592052 2.12
total free extents 944538
total free blocks 27977097
average free extent size 29.6199
from to extents blocks pct
1 1 51462 51462 0.21
2 3 98993 233204 0.93
4 7 131578 697655 2.79
8 15 178151 1993062 7.97
16 31 175718 3680535 14.72
32 63 145310 6372468 25.48
64 127 89518 7749021 30.99
128 255 18926 3415768 13.66
256 511 2640 813586 3.25
total free extents 892296
total free blocks 25006761
average free extent size 28.0252
So there's *some*, but not a lot, of imbalance in AG usage. Of course that's
as of this moment, and as you say below, you expire old partitions on a
regular basis...
My understanding of XFS's space allocation is that by default it continues to
use the same AG for allocations within one directory, until that AG is full.
For a write heavy postgres workload that's of course not optimal, as all
activity will focus on one AG.
I'd try monitoring the per-ag free space over time and see if the the ENOSPC
issue is correlated with one AG getting full. 'freesp' is probably too
expensive for that, but it looks like
xfs_db -r -c agresv /dev/nvme6n1
should work?
Actually that output might be interesting to see, even when you don't hit the
issue.
Can you tell us anything about the workload / data? Lots of tiny tables, lots
of big tables, write heavy, ...?It is a write heavy application which stores mostly time series data.
The time series data is partitioned by time; the application writes
constantly into the 'current' partition, and data is expired by
removing the oldest partition. Most of the data is written once and
not updated.There are quite a lot of these partitioned tables (in the 1000's or
10000's) depending on how the application is configured. Individual
partitions range in size from a few MB to 10s of GB.
So there are 1000s of tables that are concurrently being appended, but only
into one partition each. That does make it plausible that there's a
significant amount of fragmentation. Possibly transient due to the expiration.
How many partitions are there for each of the tables? Mainly wondering because
of the number of inodes being used.
Are all of the active tables within one database? That could be relevant due
to per-directory behaviour of free space allocation.
Greetings,
Andres Freund
Hi,
On 2024-12-10 12:36:33 +0100, Jakub Wartak wrote:
On Tue, Dec 10, 2024 at 7:34 AM Michael Harris <harmic@gmail.com> wrote:
1. Well it doesn't look like XFS AG fragmentation to me (we had a customer
with a huge number of AGs with small space in them) reporting such errors
after upgrading to 16, but not for earlier versions (somehow
posix_fallocate() had to be the culprit).
Given the workload expires old partitions, I'm not sure we conclude a whole
lot from the current state :/
2.
# xfs_info /dev/mapper/ippvg-ipplv
meta-data=/dev/mapper/ippvg-ipplv isize=512 agcount=4,agsize=262471424 blks
= sectsz=512 attr=2, projid32bit=1
= crc=1 finobt=0, sparse=0, rmapbt=0
= reflink=0 bigtime=0 inobtcount=0nrext64=0
Yay, reflink=0, that's pretty old fs ?!
I think that only started to default to on more recently (2019, plus time to
percolate into RHEL). The more curious cases is finobt=0 (turned on by default
since 2015) and to a lesser degree sparse=0 (turned on by default since 2018).
ERROR: could not extend file
"pg_tblspc/16401/PG_16_202307071/17643/1249.1" with FileFallocate(): No
space left on device2. This indicates it was allocating 1GB for such a table (".1"), on
tablespace that was created more than a year ago. Could you get us maybe
those below commands too? (or from any other directory exhibiting such
errors)
The date in the directory is the catversion of the server, which is just
determined by the major version being used, not the creation time of the
tablespace.
andres@awork3:~/src/postgresql$ git grep CATALOG_VERSION_NO upstream/REL_16_STABLE src/include/catalog/catversion.h
upstream/REL_16_STABLE:src/include/catalog/catversion.h:#define CATALOG_VERSION_NO 202307071
Greetings,
Andres Freund
On 2024-12-10 11:34:15 -0500, Andres Freund wrote:
On 2024-12-10 12:36:33 +0100, Jakub Wartak wrote:
On Tue, Dec 10, 2024 at 7:34 AM Michael Harris <harmic@gmail.com> wrote:
2.# xfs_info /dev/mapper/ippvg-ipplv
meta-data=/dev/mapper/ippvg-ipplv isize=512 agcount=4,agsize=262471424 blks
= sectsz=512 attr=2, projid32bit=1
= crc=1 finobt=0, sparse=0, rmapbt=0
= reflink=0 bigtime=0 inobtcount=0nrext64=0
Yay, reflink=0, that's pretty old fs ?!
I think that only started to default to on more recently (2019, plus time to
percolate into RHEL). The more curious cases is finobt=0 (turned on by default
since 2015) and to a lesser degree sparse=0 (turned on by default since 2018).
One thing that might be interesting is to compare xfs_info of affected and
non-affected servers...
On Mon, Dec 9, 2024 at 7:31 PM Andres Freund <andres@anarazel.de> wrote:
Pretty unexcited about all of these - XFS is fairly widely used for PG, but
this problem doesn't seem very common. It seems to me that we're missing
something that causes this to only happen in a small subset of cases.
I wonder if this is actually pretty common on XFS. I mean, we've
already hit this with at least one EDB customer, and Michael's report
is, as far as I know, independent of that; and he points to a
pgsql-general thread which, AFAIK, is also independent. We don't get
three (or more?) independent reports of that many bugs, so I think
it's not crazy to think that the problem is actually pretty common.
It's probably workload dependent somehow, but for all we know today it
seems like the workload could be as simple as "do enough file
extension and you'll get into trouble eventually" or maybe "do enough
file extension[with some level of concurrency and you'll get into
trouble eventually".
I think the source of this needs to be debugged further before we try to apply
workarounds in postgres.
Why? It seems to me that this has to be a filesystem bug, and we
should almost certainly adopt one of these ideas from Michael Harris:
- Providing a way to configure PG not to use posix_fallocate at runtime
- In the case of posix_fallocate failing with ENOSPC, fall back to
FileZero (worst case that will fail as well, in which case we will
know that we really are out of space)
Maybe we need some more research to figure out which of those two
things we should do -- I suspect the second one is better but if that
fails then we might need to do the first one -- but I doubt that we
can wait for XFS to fix whatever the issue is here. Our usage of
posix_fallocate doesn't look to be anything more than plain vanilla,
so as between these competing hypotheses:
(1) posix_fallocate is and always has been buggy and you can't rely on it, or
(2) we use posix_fallocate in a way that nobody else has and have hit
an incredibly obscure bug as a result, which will be swiftly patched
...the first seems much more likely.
--
Robert Haas
EDB: http://www.enterprisedb.com
Hi,
On 2024-12-10 12:36:40 -0500, Robert Haas wrote:
On Mon, Dec 9, 2024 at 7:31 PM Andres Freund <andres@anarazel.de> wrote:
Pretty unexcited about all of these - XFS is fairly widely used for PG, but
this problem doesn't seem very common. It seems to me that we're missing
something that causes this to only happen in a small subset of cases.I wonder if this is actually pretty common on XFS. I mean, we've
already hit this with at least one EDB customer, and Michael's report
is, as far as I know, independent of that; and he points to a
pgsql-general thread which, AFAIK, is also independent. We don't get
three (or more?) independent reports of that many bugs, so I think
it's not crazy to think that the problem is actually pretty common.
Maybe. I think we would have gotten a lot more reports if it were common. I
know of quite a few very busy installs using xfs.
I think there must be some as-of-yet-unknown condition gating it. E.g. that
the filesystem has been created a while ago and has some now-on-by-default
options disabled.
I think the source of this needs to be debugged further before we try to apply
workarounds in postgres.Why? It seems to me that this has to be a filesystem bug,
Adding workarounds for half-understood problems tends to lead to code that we
can't evolve in the future, as we a) don't understand b) can't reproduce the
problem.
Workarounds could also mask some bigger / worse issues. We e.g. have blamed
ext4 for a bunch of bugs that then turned out to be ours in the past. But we
didn't look for a long time, because it was convenient to just blame ext4.
and we should almost certainly adopt one of these ideas from Michael Harris:
- Providing a way to configure PG not to use posix_fallocate at runtime
I'm not strongly opposed to that. That's testable without access to an
affected system. I wouldn't want to automatically do that when detecting an
affected system though, that'll make behaviour way less predictable.
- In the case of posix_fallocate failing with ENOSPC, fall back to
FileZero (worst case that will fail as well, in which case we will
know that we really are out of space)
I doubt that that's a good idea. What if fallocate failing is an indicator of
a problem? What if you turn on AIO + DIO and suddenly get a much more
fragmented file?
Greetings,
Andres Freund