FileFallocate misbehaving on XFS

Started by Michael Harrisabout 1 year ago54 messages
#1Michael Harris
harmic@gmail.com

Hello PG Hackers

Our application has recently migrated to PG16, and we have experienced
some failed upgrades. The upgrades are performed using pg_upgrade and
have failed during the phase where the schema is restored into the new
cluster, with the following error:

pg_restore: error: could not execute query: ERROR: could not extend
file "pg_tblspc/16401/PG_16_202307071/17643/1249.1" with
FileFallocate(): No space left on device
HINT: Check free disk space.

This has happened multiple times on different servers, and in each
case there was plenty of free space available.

We found this thread describing similar issues:

/messages/by-id/AS1PR05MB91059AC8B525910A5FCD6E699F9A2@AS1PR05MB9105.eurprd05.prod.outlook.com

As is the case in that thread, all of the affected databases are using XFS.

One of my colleagues built postgres from source with
HAVE_POSIX_FALLOCATE not defined, and using that build he was able to
complete the pg_upgrade, and then switched to a stock postgres build
after the upgrade. However, as you might expect, after the upgrade we
have experienced similar errors during regular operation. We make
heavy use of COPY, which is mentioned in the above discussion as
pre-allocating files.

We have seen this on both Rocky Linux 8 (kernel 4.18.0) and Rocky
Linux 9 (Kernel 5.14.0).

I am wondering if this bug might be related:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1791323

When given an offset of 0 and a length, fallocate (man 2 fallocate) reports ENOSPC if the size of the file + the length to be allocated is greater than the available space.

There is a reproduction procedure at the bottom of the above ubuntu
thread, and using that procedure I get the same results on both kernel
4.18.0 and 5.14.0.
When calling fallocate with offset zero on an existing file, I get
enospc even if I am only requesting the same amount of space as the
file already has.
If I repeat the experiment with ext4 I don't get that behaviour.

On a surface examination of the code paths leading to the
FileFallocate call, it does not look like it should be trying to
allocate already allocated space, but I might have missed something
there.

Is this already being looked into?

Thanks in advance,

Cheers
Mike

#2Andrea Gelmini
andrea.gelmini@gmail.com
In reply to: Michael Harris (#1)
Re: FileFallocate misbehaving on XFS

Il Lun 9 Dic 2024, 10:19 Michael Harris <harmic@gmail.com> ha scritto:

Is this already being looked into?

Funny, i guess it's the same reason I see randomly complain of WhatsApp web
interface, on Chrome, since I switched to XFS. It says something like "no
more space on disk" and logout, with more than 300GB available.

Anyway, just a stupid hint, I would try to write to XFS mailing list. There
you can reach XFS maintainers of Red Hat and the usual historical
developers, of course!!!

#3Tomas Vondra
tomas@vondra.me
In reply to: Michael Harris (#1)
Re: FileFallocate misbehaving on XFS

On 12/9/24 08:34, Michael Harris wrote:

Hello PG Hackers

Our application has recently migrated to PG16, and we have experienced
some failed upgrades. The upgrades are performed using pg_upgrade and
have failed during the phase where the schema is restored into the new
cluster, with the following error:

pg_restore: error: could not execute query: ERROR: could not extend
file "pg_tblspc/16401/PG_16_202307071/17643/1249.1" with
FileFallocate(): No space left on device
HINT: Check free disk space.

This has happened multiple times on different servers, and in each
case there was plenty of free space available.

We found this thread describing similar issues:

/messages/by-id/AS1PR05MB91059AC8B525910A5FCD6E699F9A2@AS1PR05MB9105.eurprd05.prod.outlook.com

As is the case in that thread, all of the affected databases are using XFS.

One of my colleagues built postgres from source with
HAVE_POSIX_FALLOCATE not defined, and using that build he was able to
complete the pg_upgrade, and then switched to a stock postgres build
after the upgrade. However, as you might expect, after the upgrade we
have experienced similar errors during regular operation. We make
heavy use of COPY, which is mentioned in the above discussion as
pre-allocating files.

We have seen this on both Rocky Linux 8 (kernel 4.18.0) and Rocky
Linux 9 (Kernel 5.14.0).

I am wondering if this bug might be related:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1791323

When given an offset of 0 and a length, fallocate (man 2 fallocate) reports ENOSPC if the size of the file + the length to be allocated is greater than the available space.

There is a reproduction procedure at the bottom of the above ubuntu
thread, and using that procedure I get the same results on both kernel
4.18.0 and 5.14.0.
When calling fallocate with offset zero on an existing file, I get
enospc even if I am only requesting the same amount of space as the
file already has.
If I repeat the experiment with ext4 I don't get that behaviour.

On a surface examination of the code paths leading to the
FileFallocate call, it does not look like it should be trying to
allocate already allocated space, but I might have missed something
there.

Is this already being looked into?

Sounds more like an XFS bug/behavior, so it's not clear to me what we
could do about it. I mean, if the filesystem reports bogus out-of-space,
is there even something we can do?

What is not clear to me is why would this affect pg_upgrade at all. We
have the data files split into 1GB segments, and the copy/clone/... goes
one by one. So there shouldn't be more than 1GB "extra" space needed.
Surely you have more free space on the system?

regards

--
Tomas Vondra

#4Tomas Vondra
tomas@vondra.me
In reply to: Andrea Gelmini (#2)
Re: FileFallocate misbehaving on XFS

On 12/9/24 10:47, Andrea Gelmini wrote:

Il Lun 9 Dic 2024, 10:19 Michael Harris <harmic@gmail.com
<mailto:harmic@gmail.com>> ha scritto:

Is this already being looked into?

Funny, i guess it's the same reason I see randomly complain of WhatsApp
web interface, on Chrome, since I switched to XFS. It says something
like "no more space on disk" and logout, with more than 300GB available.

If I understand the fallocate issue correctly, it essentially ignores
the offset, so "fallocate -o 0 -l LENGTH" fails if

LENGTH + CURRENT_LENGTH > FREE_SPACE

But if you have 300GB available, that'd mean you have a file that's
close to that size already. But is that likely for WhatsApp?

Anyway, just a stupid hint, I would try to write to XFS mailing list.
There you can reach XFS maintainers of Red Hat and the usual historical
developers, of course!!!

Yes, I think that's a better place to report this. I don't think we're
doing anything particularly weird / wrong with fallocate().

regards

--
Tomas Vondra

#5Jakub Wartak
jakub.wartak@enterprisedb.com
In reply to: Michael Harris (#1)
Re: FileFallocate misbehaving on XFS

On Mon, Dec 9, 2024 at 10:19 AM Michael Harris <harmic@gmail.com> wrote:

Hi Michael,

We found this thread describing similar issues:

/messages/by-id/AS1PR05MB91059AC8B525910A5FCD6E699F9A2@AS1PR05MB9105.eurprd05.prod.outlook.com

We've got some case in the past here in EDB, where an OS vendor has blamed
XFS AG fragmentation (too many AGs, and if one AG is not having enough
space -> error). Could You perhaps show us output of on that LUN:
1. xfs_info
2. run that script from https://www.suse.com/support/kb/doc/?id=000018219
for Your AG range

-J.

#6Tomas Vondra
tomas@vondra.me
In reply to: Jakub Wartak (#5)
Re: FileFallocate misbehaving on XFS

On 12/9/24 11:27, Jakub Wartak wrote:

On Mon, Dec 9, 2024 at 10:19 AM Michael Harris <harmic@gmail.com
<mailto:harmic@gmail.com>> wrote:

Hi Michael,

We found this thread describing similar issues:

/messages/by-id/flat/
AS1PR05MB91059AC8B525910A5FCD6E699F9A2%40AS1PR05MB9105.eurprd05.prod.outlook.com </messages/by-id/AS1PR05MB91059AC8B525910A5FCD6E699F9A2@AS1PR05MB9105.eurprd05.prod.outlook.com&gt;

We've got some case in the past here in EDB, where an OS vendor has
blamed XFS AG fragmentation (too many AGs, and if one AG is not having
enough space -> error). Could You perhaps show us output of on that LUN:
1. xfs_info
2. run that script from https://www.suse.com/support/kb/doc/?
id=000018219 <https://www.suse.com/support/kb/doc/?id=000018219&gt; for
Your AG range

But this can be reproduced on a brand new filesystem - I just tried
creating a 1GB image, create XFS on it, mount it, and fallocate a 600MB
file twice. Which that fails, and there can't be any real fragmentation.

regards

--
Tomas Vondra

#7Andres Freund
andres@anarazel.de
In reply to: Tomas Vondra (#6)
Re: FileFallocate misbehaving on XFS

Hi,

On 2024-12-09 15:47:55 +0100, Tomas Vondra wrote:

On 12/9/24 11:27, Jakub Wartak wrote:

On Mon, Dec 9, 2024 at 10:19 AM Michael Harris <harmic@gmail.com
<mailto:harmic@gmail.com>> wrote:

Hi Michael,

We found this thread describing similar issues:

/messages/by-id/flat/
AS1PR05MB91059AC8B525910A5FCD6E699F9A2%40AS1PR05MB9105.eurprd05.prod.outlook.com </messages/by-id/AS1PR05MB91059AC8B525910A5FCD6E699F9A2@AS1PR05MB9105.eurprd05.prod.outlook.com&gt;

We've got some case in the past here in EDB, where an OS vendor has
blamed XFS AG fragmentation (too many AGs, and if one AG is not having
enough space -> error). Could You perhaps show us output of on that LUN:
1. xfs_info
2. run that script from https://www.suse.com/support/kb/doc/?
id=000018219 <https://www.suse.com/support/kb/doc/?id=000018219&gt; for
Your AG range

But this can be reproduced on a brand new filesystem - I just tried
creating a 1GB image, create XFS on it, mount it, and fallocate a 600MB
file twice. Which that fails, and there can't be any real fragmentation.

If I understand correctly xfs, before even looking at the file's current
layout, checks if there's enough free space for the fallocate() to
succeed. Here's an explanation for why:
https://www.spinics.net/lists/linux-xfs/msg55429.html

The real problem with preallocation failing part way through due to
overcommit of space is that we can't go back an undo the
allocation(s) made by fallocate because when we get ENOSPC we have
lost all the state of the previous allocations made. If fallocate is
filling holes between unwritten extents already in the file, then we
have no way of knowing where the holes we filled were and hence
cannot reliably free the space we've allocated before ENOSPC was
hit.

I.e. reserving space as you go would leave you open to ending up with some,
but not all, of those allocations having been made. Whereas pre-reserving the
worst case space needed, ahead of time, ensures that you have enough space to
go through it all.

You can't just go through the file [range] and compute how much free space you
will need allocate and then do the a second pass through the file, because the
file layout might have changed concurrently...

This issue seems independent of the issue Michael is having though. Postgres,
afaik, won't fallocate huge ranges with already allocated space.

Greetings,

Andres Freund

#8Andres Freund
andres@anarazel.de
In reply to: Michael Harris (#1)
Re: FileFallocate misbehaving on XFS

Hi,

On 2024-12-09 18:34:22 +1100, Michael Harris wrote:

Our application has recently migrated to PG16, and we have experienced
some failed upgrades. The upgrades are performed using pg_upgrade and
have failed during the phase where the schema is restored into the new
cluster, with the following error:

pg_restore: error: could not execute query: ERROR: could not extend
file "pg_tblspc/16401/PG_16_202307071/17643/1249.1" with
FileFallocate(): No space left on device
HINT: Check free disk space.

Were those pg_upgrades done with pg_upgrade --clone? Or have been, on the same
filesystem, in the past?

The reflink stuff in xfs (which is used to implement copy-on-write for files)
is somewhat newer and you're using somewhat old kernels:

We have seen this on both Rocky Linux 8 (kernel 4.18.0) and Rocky
Linux 9 (Kernel 5.14.0).

I found some references for bugs that were fixed in 5.13. But I think at least
some of this would persist if the filesystem ran into the issue with a kernel
before those fixes. Did you upgrade "in-place" from Rocky Linux 8?

I am wondering if this bug might be related:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1791323

Doubt it, we never do this as far as I am aware.

Greetings,

Andres Freund

#9Michael Harris
harmic@gmail.com
In reply to: Andres Freund (#8)
Re: FileFallocate misbehaving on XFS

Hi Andres

On Tue, 10 Dec 2024 at 03:31, Andres Freund <andres@anarazel.de> wrote:

Were those pg_upgrades done with pg_upgrade --clone? Or have been, on the same
filesystem, in the past?

No, our procedure is to use --link.

I found some references for bugs that were fixed in 5.13. But I think at least
some of this would persist if the filesystem ran into the issue with a kernel
before those fixes. Did you upgrade "in-place" from Rocky Linux 8?

We generally don't use "in place" OS upgrades - however we would
usually have the databases on separate filesystem(s) to the OS, and
those filesystem(s) would be preserved through the upgrade, while the
root fs would be scratched.
A lot of the cases reported are on RL8. I will try to find out the
history of the RL9 cases to see if the filesystems started on RL8.

Could you please provide me links for the kernel bugs you are referring to?

Cheers
Mike.

#10Michael Harris
harmic@gmail.com
In reply to: Tomas Vondra (#3)
Re: FileFallocate misbehaving on XFS

Hi Tomas

On Mon, 9 Dec 2024 at 21:06, Tomas Vondra <tomas@vondra.me> wrote:

Sounds more like an XFS bug/behavior, so it's not clear to me what we
could do about it. I mean, if the filesystem reports bogus out-of-space,
is there even something we can do?

I don't disagree that it's most likely an XFS issue. However, XFS is
pretty widely used - it's the default FS for RHEL & the default in
SUSE for non-root partitions - so maybe some action should be taken.

Some things we could consider:

- Providing a way to configure PG not to use posix_fallocate at runtime

- Detecting the use of XFS (probably nasty and complex to do in a
platform independent way) and disable posix_fallocate

- In the case of posix_fallocate failing with ENOSPC, fall back to
FileZero (worst case that will fail as well, in which case we will
know that we really are out of space)

- Documenting that XFS might not be a good choice, at least for some
kernel versions

What is not clear to me is why would this affect pg_upgrade at all. We
have the data files split into 1GB segments, and the copy/clone/... goes
one by one. So there shouldn't be more than 1GB "extra" space needed.
Surely you have more free space on the system?

Yes, that also confused me. It actually fails during the schema
restore phase - where pg_upgrade calls pg_restore to restore a
schema-only dump that it takes earlier in the process. At this stage
it is only trying to restore the schema, not any actual table data.
Note that we use the --link option to pg_upgrade, so it should not be
using much space even when the table data is being upgraded.

The filesystems have >1TB free space when this has occurred.

It does continue to give this error after the upgrade, at apparently
random intervals, when data is being loaded into the DB using COPY
commands, so it might be best not to focus too much on the fact that
we first encounter it during the upgrade.

Cheers
Mike.

#11Andres Freund
andres@anarazel.de
In reply to: Michael Harris (#9)
Re: FileFallocate misbehaving on XFS

Hi,

On 2024-12-10 09:34:08 +1100, Michael Harris wrote:

On Tue, 10 Dec 2024 at 03:31, Andres Freund <andres@anarazel.de> wrote:

I found some references for bugs that were fixed in 5.13. But I think at least
some of this would persist if the filesystem ran into the issue with a kernel
before those fixes. Did you upgrade "in-place" from Rocky Linux 8?

We generally don't use "in place" OS upgrades - however we would
usually have the databases on separate filesystem(s) to the OS, and
those filesystem(s) would be preserved through the upgrade, while the
root fs would be scratched.

Makes sense.

A lot of the cases reported are on RL8. I will try to find out the
history of the RL9 cases to see if the filesystems started on RL8.

That'd be helpful....

Could you please provide me links for the kernel bugs you are referring to?

I unfortunately closed most of the tabs, the only one I could quickly find
again is the one referenced at the bottom of:
https://www.spinics.net/lists/linux-xfs/msg55445.html

Greetings,

Andres

#12Andres Freund
andres@anarazel.de
In reply to: Michael Harris (#10)
Re: FileFallocate misbehaving on XFS

Hi,

On 2024-12-10 10:00:43 +1100, Michael Harris wrote:

On Mon, 9 Dec 2024 at 21:06, Tomas Vondra <tomas@vondra.me> wrote:

Sounds more like an XFS bug/behavior, so it's not clear to me what we
could do about it. I mean, if the filesystem reports bogus out-of-space,
is there even something we can do?

I don't disagree that it's most likely an XFS issue. However, XFS is
pretty widely used - it's the default FS for RHEL & the default in
SUSE for non-root partitions - so maybe some action should be taken.

Some things we could consider:

- Providing a way to configure PG not to use posix_fallocate at runtime

- Detecting the use of XFS (probably nasty and complex to do in a
platform independent way) and disable posix_fallocate

- In the case of posix_fallocate failing with ENOSPC, fall back to
FileZero (worst case that will fail as well, in which case we will
know that we really are out of space)

- Documenting that XFS might not be a good choice, at least for some
kernel versions

Pretty unexcited about all of these - XFS is fairly widely used for PG, but
this problem doesn't seem very common. It seems to me that we're missing
something that causes this to only happen in a small subset of cases.

I think the source of this needs to be debugged further before we try to apply
workarounds in postgres.

Are you using any filesystem quotas?

It'd be useful to get the xfs_info output that Jakub asked for. Perhaps also
xfs_spaceman -c 'freesp -s' /mountpoint
xfs_spaceman -c 'health' /mountpoint
and df.

What kind of storage is this on?

Was the filesystem ever grown from a smaller size?

Have you checked the filesystem's internal consistency? I.e. something like
xfs_repair -n /dev/nvme2n1. It does require the filesystem to be read-only or
unmounted though. But corrupted filesystem datastructures certainly could
cause spurious ENOSPC.

What is not clear to me is why would this affect pg_upgrade at all. We
have the data files split into 1GB segments, and the copy/clone/... goes
one by one. So there shouldn't be more than 1GB "extra" space needed.
Surely you have more free space on the system?

Yes, that also confused me. It actually fails during the schema
restore phase - where pg_upgrade calls pg_restore to restore a
schema-only dump that it takes earlier in the process. At this stage
it is only trying to restore the schema, not any actual table data.
Note that we use the --link option to pg_upgrade, so it should not be
using much space even when the table data is being upgraded.

Are you using pg_upgrade -j?

I'm asking because looking at linux's git tree I found this interesting recent
commit: https://git.kernel.org/linus/94a0333b9212 - but IIUC it'd actually
cause file creation, not fallocate to fail.

The filesystems have >1TB free space when this has occurred.

It does continue to give this error after the upgrade, at apparently
random intervals, when data is being loaded into the DB using COPY
commands, so it might be best not to focus too much on the fact that
we first encounter it during the upgrade.

I assume the file that actually errors out changes over time? It's always
fallocate() that fails?

Can you tell us anything about the workload / data? Lots of tiny tables, lots
of big tables, write heavy, ...?

Greetings,

Andres Freund

#13Michael Harris
harmic@gmail.com
In reply to: Andres Freund (#12)
Re: FileFallocate misbehaving on XFS

Hi Andres

Following up on the earlier question about OS upgrade paths - all the
cases reported so far are either on RL8 (Kernel 4.18.0) or were
upgraded to RL9 (kernel 5.14.0) and the affected filesystems were
preserved.
In fact the RL9 systems were initially built as Centos 7, and then
when that went EOL they were upgraded to RL9. The process was as I
described - the /var/opt filesystem which contained the database was
preserved, and the root and other OS filesystems were scratched.
The majority of systems where we have this problem are on RL8.

On Tue, 10 Dec 2024 at 11:31, Andres Freund <andres@anarazel.de> wrote:

Are you using any filesystem quotas?

No.

It'd be useful to get the xfs_info output that Jakub asked for. Perhaps also
xfs_spaceman -c 'freesp -s' /mountpoint
xfs_spaceman -c 'health' /mountpoint
and df.

I gathered this info from one of the systems that is currently on RL9.
This system is relatively small compared to some of the others that
have exhibited this issue, but it is the only one I can access right
now.

# uname -a
Linux 5.14.0-503.14.1.el9_5.x86_64 #1 SMP PREEMPT_DYNAMIC Fri Nov 15
12:04:32 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

# xfs_info /dev/mapper/ippvg-ipplv
meta-data=/dev/mapper/ippvg-ipplv isize=512 agcount=4, agsize=262471424 blks
= sectsz=512 attr=2, projid32bit=1
= crc=1 finobt=0, sparse=0, rmapbt=0
= reflink=0 bigtime=0 inobtcount=0 nrext64=0
data = bsize=4096 blocks=1049885696, imaxpct=5
= sunit=0 swidth=0 blks
naming =version 2 bsize=4096 ascii-ci=0, ftype=1
log =internal log bsize=4096 blocks=512639, version=2
= sectsz=512 sunit=0 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0

# for agno in `seq 0 3`; do xfs_spaceman -c "freesp -s -a $agno" /var/opt; done
from to extents blocks pct
1 1 37502 37502 0.15
2 3 62647 148377 0.59
4 7 87793 465950 1.85
8 15 135529 1527172 6.08
16 31 184811 3937459 15.67
32 63 165979 7330339 29.16
64 127 101674 8705691 34.64
128 255 15123 2674030 10.64
256 511 973 307655 1.22
total free extents 792031
total free blocks 25134175
average free extent size 31.7338
from to extents blocks pct
1 1 43895 43895 0.22
2 3 59312 141693 0.70
4 7 83406 443964 2.20
8 15 120804 1362108 6.75
16 31 133140 2824317 14.00
32 63 118619 5188474 25.71
64 127 77960 6751764 33.46
128 255 16383 2876626 14.26
256 511 1763 546506 2.71
total free extents 655282
total free blocks 20179347
average free extent size 30.7949
from to extents blocks pct
1 1 72034 72034 0.26
2 3 98158 232135 0.83
4 7 126228 666187 2.38
8 15 169602 1893007 6.77
16 31 180286 3818527 13.65
32 63 164529 7276833 26.01
64 127 109687 9505160 33.97
128 255 22113 3921162 14.02
256 511 1901 592052 2.12
total free extents 944538
total free blocks 27977097
average free extent size 29.6199
from to extents blocks pct
1 1 51462 51462 0.21
2 3 98993 233204 0.93
4 7 131578 697655 2.79
8 15 178151 1993062 7.97
16 31 175718 3680535 14.72
32 63 145310 6372468 25.48
64 127 89518 7749021 30.99
128 255 18926 3415768 13.66
256 511 2640 813586 3.25
total free extents 892296
total free blocks 25006761
average free extent size 28.0252

# xfs_spaceman -c 'health' /var/opt
Health status has not been collected for this filesystem.

What kind of storage is this on?

As mentioned, there are quite a few systems in different sites, so a
number of different storage solutions in use, some with directly
attached disks, others with some SAN solutions.
The instance I got the printout above from is a VM, but in the other
site they are all bare metal.

Was the filesystem ever grown from a smaller size?

I can't say for sure that none of them were, but given the number of
different systems that have this issue I am confident that would not
be a common factor.

Have you checked the filesystem's internal consistency? I.e. something like
xfs_repair -n /dev/nvme2n1. It does require the filesystem to be read-only or
unmounted though. But corrupted filesystem datastructures certainly could
cause spurious ENOSPC.

I executed this on the same system as the above prints came from. It
did not report any issues.

Are you using pg_upgrade -j?

Yes, we use -j `nproc`

I assume the file that actually errors out changes over time? It's always
fallocate() that fails?

Yes, correct, on both counts.

Can you tell us anything about the workload / data? Lots of tiny tables, lots
of big tables, write heavy, ...?

It is a write heavy application which stores mostly time series data.

The time series data is partitioned by time; the application writes
constantly into the 'current' partition, and data is expired by
removing the oldest partition. Most of the data is written once and
not updated.

There are quite a lot of these partitioned tables (in the 1000's or
10000's) depending on how the application is configured. Individual
partitions range in size from a few MB to 10s of GB.

Cheers
Mike.

#14Michael Harris
harmic@gmail.com
In reply to: Michael Harris (#13)
Re: FileFallocate misbehaving on XFS

Hi again

One extra piece of information: I had said that all the machines were
Rocky Linux 8 or Rocky Linux 9, but actually a large number of them
are RHEL8.

Sorry for the confusion.

Of course RL8 is a rebuild of RHEL8 so it is not surprising they would
be behaving similarly.

Cheers
Mike

Show quoted text

On Tue, 10 Dec 2024 at 17:28, Michael Harris <harmic@gmail.com> wrote:

Hi Andres

Following up on the earlier question about OS upgrade paths - all the
cases reported so far are either on RL8 (Kernel 4.18.0) or were
upgraded to RL9 (kernel 5.14.0) and the affected filesystems were
preserved.
In fact the RL9 systems were initially built as Centos 7, and then
when that went EOL they were upgraded to RL9. The process was as I
described - the /var/opt filesystem which contained the database was
preserved, and the root and other OS filesystems were scratched.
The majority of systems where we have this problem are on RL8.

On Tue, 10 Dec 2024 at 11:31, Andres Freund <andres@anarazel.de> wrote:

Are you using any filesystem quotas?

No.

It'd be useful to get the xfs_info output that Jakub asked for. Perhaps also
xfs_spaceman -c 'freesp -s' /mountpoint
xfs_spaceman -c 'health' /mountpoint
and df.

I gathered this info from one of the systems that is currently on RL9.
This system is relatively small compared to some of the others that
have exhibited this issue, but it is the only one I can access right
now.

# uname -a
Linux 5.14.0-503.14.1.el9_5.x86_64 #1 SMP PREEMPT_DYNAMIC Fri Nov 15
12:04:32 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

# xfs_info /dev/mapper/ippvg-ipplv
meta-data=/dev/mapper/ippvg-ipplv isize=512 agcount=4, agsize=262471424 blks
= sectsz=512 attr=2, projid32bit=1
= crc=1 finobt=0, sparse=0, rmapbt=0
= reflink=0 bigtime=0 inobtcount=0 nrext64=0
data = bsize=4096 blocks=1049885696, imaxpct=5
= sunit=0 swidth=0 blks
naming =version 2 bsize=4096 ascii-ci=0, ftype=1
log =internal log bsize=4096 blocks=512639, version=2
= sectsz=512 sunit=0 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0

# for agno in `seq 0 3`; do xfs_spaceman -c "freesp -s -a $agno" /var/opt; done
from to extents blocks pct
1 1 37502 37502 0.15
2 3 62647 148377 0.59
4 7 87793 465950 1.85
8 15 135529 1527172 6.08
16 31 184811 3937459 15.67
32 63 165979 7330339 29.16
64 127 101674 8705691 34.64
128 255 15123 2674030 10.64
256 511 973 307655 1.22
total free extents 792031
total free blocks 25134175
average free extent size 31.7338
from to extents blocks pct
1 1 43895 43895 0.22
2 3 59312 141693 0.70
4 7 83406 443964 2.20
8 15 120804 1362108 6.75
16 31 133140 2824317 14.00
32 63 118619 5188474 25.71
64 127 77960 6751764 33.46
128 255 16383 2876626 14.26
256 511 1763 546506 2.71
total free extents 655282
total free blocks 20179347
average free extent size 30.7949
from to extents blocks pct
1 1 72034 72034 0.26
2 3 98158 232135 0.83
4 7 126228 666187 2.38
8 15 169602 1893007 6.77
16 31 180286 3818527 13.65
32 63 164529 7276833 26.01
64 127 109687 9505160 33.97
128 255 22113 3921162 14.02
256 511 1901 592052 2.12
total free extents 944538
total free blocks 27977097
average free extent size 29.6199
from to extents blocks pct
1 1 51462 51462 0.21
2 3 98993 233204 0.93
4 7 131578 697655 2.79
8 15 178151 1993062 7.97
16 31 175718 3680535 14.72
32 63 145310 6372468 25.48
64 127 89518 7749021 30.99
128 255 18926 3415768 13.66
256 511 2640 813586 3.25
total free extents 892296
total free blocks 25006761
average free extent size 28.0252

# xfs_spaceman -c 'health' /var/opt
Health status has not been collected for this filesystem.

What kind of storage is this on?

As mentioned, there are quite a few systems in different sites, so a
number of different storage solutions in use, some with directly
attached disks, others with some SAN solutions.
The instance I got the printout above from is a VM, but in the other
site they are all bare metal.

Was the filesystem ever grown from a smaller size?

I can't say for sure that none of them were, but given the number of
different systems that have this issue I am confident that would not
be a common factor.

Have you checked the filesystem's internal consistency? I.e. something like
xfs_repair -n /dev/nvme2n1. It does require the filesystem to be read-only or
unmounted though. But corrupted filesystem datastructures certainly could
cause spurious ENOSPC.

I executed this on the same system as the above prints came from. It
did not report any issues.

Are you using pg_upgrade -j?

Yes, we use -j `nproc`

I assume the file that actually errors out changes over time? It's always
fallocate() that fails?

Yes, correct, on both counts.

Can you tell us anything about the workload / data? Lots of tiny tables, lots
of big tables, write heavy, ...?

It is a write heavy application which stores mostly time series data.

The time series data is partitioned by time; the application writes
constantly into the 'current' partition, and data is expired by
removing the oldest partition. Most of the data is written once and
not updated.

There are quite a lot of these partitioned tables (in the 1000's or
10000's) depending on how the application is configured. Individual
partitions range in size from a few MB to 10s of GB.

Cheers
Mike.

#15Jakub Wartak
jakub.wartak@enterprisedb.com
In reply to: Michael Harris (#14)
1 attachment(s)
Re: FileFallocate misbehaving on XFS

On Tue, Dec 10, 2024 at 7:34 AM Michael Harris <harmic@gmail.com> wrote:

Hi Michael,

1. Well it doesn't look like XFS AG fragmentation to me (we had a customer
with a huge number of AGs with small space in them) reporting such errors
after upgrading to 16, but not for earlier versions (somehow
posix_fallocate() had to be the culprit).

2.

# xfs_info /dev/mapper/ippvg-ipplv
meta-data=/dev/mapper/ippvg-ipplv isize=512 agcount=4,

agsize=262471424 blks

= sectsz=512 attr=2, projid32bit=1
= crc=1 finobt=0, sparse=0, rmapbt=0
= reflink=0 bigtime=0 inobtcount=0

nrext64=0

Yay, reflink=0, that's pretty old fs ?!

ERROR: could not extend file

"pg_tblspc/16401/PG_16_202307071/17643/1249.1" with FileFallocate(): No
space left on device

2. This indicates it was allocating 1GB for such a table (".1"), on
tablespace that was created more than a year ago. Could you get us maybe
those below commands too? (or from any other directory exhibiting such
errors)

stat pg_tblspc/16401/PG_16_202307071/17643/
ls -1 pg_tblspc/16401/PG_16_202307071/17643/ | wc -l
time ls -1 pg_tblspc/16401/PG_16_202307071/17643/ | wc -l # to assess
timing of getdents() call as that may something about that directory
indirectly

3. Maybe somehow there is a bigger interaction between posix_fallocate()
and delayed XFS's dynamic speculative preallocation from many processes all
writing into different partitions ? Maybe try "allocsize=1m" mount option
for that /fs and see if that helps. I'm going to speculate about XFS
speculative :) pre allocations, but if we have fdcache and are *not*
closing fds, how XFS might know to abort its own speculation about
streaming write ? (multiply that up to potentially the number of opened fds
to get an avalanche of "preallocations").

4. You can also try compiling with patch from Alvaro from [2]/messages/by-id/202409110955.6njbwzm4ocus@alvherre.pgsql
"0001-Add-some-debugging-around-mdzeroextend.patch", so we might end up
having more clarity in offsets involved. If not then you could use 'strace
-e fallocate -p <pid>' to get the exact syscall.

5. Another idea could be catching the kernel side stacktrace of fallocate()
when it is hitting ENOSPC. E.g. with XFS fs and attached bpftrace eBPF
tracer I could get the source of the problem in my artificial reproducer,
e.g

# bpftrace ./track_enospc2.bt # wait for "START" and then start reproducing
on the sess2, but try to minimize the time period, that eBPF might things
really slow

$ dd if=/dev/zero of=/fs/test1 bs=1M count=200
$ fallocate /fs/test -l 30000000
fallocate: fallocate failed: No space left on device
$ df -h /fs
Filesystem Size Used Avail Use% Mounted on
/dev/loop0 236M 217M 20M 92% /fs

# in bpftrace CTRL+C, will get:
@errors[-28, kretprobe:xfs_file_fallocate,
xfs_alloc_file_space+665
xfs_alloc_file_space+665
xfs_file_fallocate+869
vfs_fallocate+319
__x64_sys_fallocate+68
do_syscall_64+130
entry_SYSCALL_64_after_hwframe+118
]: 1

-28 = ENOSPC, xfs_alloc_file_space() was the routine that was root-cause
and shows the full logic behind it. That ABI might be different on Your
side due to kernel variations. It could be enhanced, and it might print too
much (so you need to look for that -28 in the outputs). Possibly if you get
any sensible output from it, you could also involve OS support (because if
posix_fallocate() fails and there's space , then it's pretty odd anyway).

-J.

[1]: /messages/by-id/50A117B6.5030300@optionshouse.com
/messages/by-id/50A117B6.5030300@optionshouse.com
[2]: /messages/by-id/202409110955.6njbwzm4ocus@alvherre.pgsql
/messages/by-id/202409110955.6njbwzm4ocus@alvherre.pgsql

Attachments:

track_enospc2.btapplication/octet-stream; name=track_enospc2.btDownload
#16Andres Freund
andres@anarazel.de
In reply to: Michael Harris (#13)
Re: FileFallocate misbehaving on XFS

Hi,

On 2024-12-10 17:28:21 +1100, Michael Harris wrote:

On Tue, 10 Dec 2024 at 11:31, Andres Freund <andres@anarazel.de> wrote:

It'd be useful to get the xfs_info output that Jakub asked for. Perhaps also
xfs_spaceman -c 'freesp -s' /mountpoint
xfs_spaceman -c 'health' /mountpoint
and df.

I gathered this info from one of the systems that is currently on RL9.
This system is relatively small compared to some of the others that
have exhibited this issue, but it is the only one I can access right
now.

I think it's implied, but I just want to be sure: This was one of the affected
systems?

# uname -a
Linux 5.14.0-503.14.1.el9_5.x86_64 #1 SMP PREEMPT_DYNAMIC Fri Nov 15
12:04:32 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

# xfs_info /dev/mapper/ippvg-ipplv
meta-data=/dev/mapper/ippvg-ipplv isize=512 agcount=4, agsize=262471424 blks
= sectsz=512 attr=2, projid32bit=1
= crc=1 finobt=0, sparse=0, rmapbt=0
= reflink=0 bigtime=0 inobtcount=0 nrext64=0
data = bsize=4096 blocks=1049885696, imaxpct=5
= sunit=0 swidth=0 blks
naming =version 2 bsize=4096 ascii-ci=0, ftype=1
log =internal log bsize=4096 blocks=512639, version=2
= sectsz=512 sunit=0 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0

It might be interesting that finobt=0, sparse=0 and nrext64=0. Those all
affect space allocation to some degree and more recently created filesystems
will have them to different values, which could explain why you but not that
many others hit this issue.

Any chance to get df output? I'm mainly curious about the number of used
inodes.

Could you show the mount options that end up being used?
grep /var/opt /proc/mounts

I rather doubt it is, but it'd sure be interesting if inode32 were used.

I assume you have never set XFS options for the PG directory or files within
it? Could you show
xfs_io -r -c lsattr -c stat -c statfs /path/to/directory/with/enospc
?

# for agno in `seq 0 3`; do xfs_spaceman -c "freesp -s -a $agno" /var/opt; done
from to extents blocks pct
1 1 37502 37502 0.15
2 3 62647 148377 0.59
4 7 87793 465950 1.85
8 15 135529 1527172 6.08
16 31 184811 3937459 15.67
32 63 165979 7330339 29.16
64 127 101674 8705691 34.64
128 255 15123 2674030 10.64
256 511 973 307655 1.22
total free extents 792031
total free blocks 25134175
average free extent size 31.7338
from to extents blocks pct
1 1 43895 43895 0.22
2 3 59312 141693 0.70
4 7 83406 443964 2.20
8 15 120804 1362108 6.75
16 31 133140 2824317 14.00
32 63 118619 5188474 25.71
64 127 77960 6751764 33.46
128 255 16383 2876626 14.26
256 511 1763 546506 2.71
total free extents 655282
total free blocks 20179347
average free extent size 30.7949
from to extents blocks pct
1 1 72034 72034 0.26
2 3 98158 232135 0.83
4 7 126228 666187 2.38
8 15 169602 1893007 6.77
16 31 180286 3818527 13.65
32 63 164529 7276833 26.01
64 127 109687 9505160 33.97
128 255 22113 3921162 14.02
256 511 1901 592052 2.12
total free extents 944538
total free blocks 27977097
average free extent size 29.6199
from to extents blocks pct
1 1 51462 51462 0.21
2 3 98993 233204 0.93
4 7 131578 697655 2.79
8 15 178151 1993062 7.97
16 31 175718 3680535 14.72
32 63 145310 6372468 25.48
64 127 89518 7749021 30.99
128 255 18926 3415768 13.66
256 511 2640 813586 3.25
total free extents 892296
total free blocks 25006761
average free extent size 28.0252

So there's *some*, but not a lot, of imbalance in AG usage. Of course that's
as of this moment, and as you say below, you expire old partitions on a
regular basis...

My understanding of XFS's space allocation is that by default it continues to
use the same AG for allocations within one directory, until that AG is full.
For a write heavy postgres workload that's of course not optimal, as all
activity will focus on one AG.

I'd try monitoring the per-ag free space over time and see if the the ENOSPC
issue is correlated with one AG getting full. 'freesp' is probably too
expensive for that, but it looks like
xfs_db -r -c agresv /dev/nvme6n1
should work?

Actually that output might be interesting to see, even when you don't hit the
issue.

Can you tell us anything about the workload / data? Lots of tiny tables, lots
of big tables, write heavy, ...?

It is a write heavy application which stores mostly time series data.

The time series data is partitioned by time; the application writes
constantly into the 'current' partition, and data is expired by
removing the oldest partition. Most of the data is written once and
not updated.

There are quite a lot of these partitioned tables (in the 1000's or
10000's) depending on how the application is configured. Individual
partitions range in size from a few MB to 10s of GB.

So there are 1000s of tables that are concurrently being appended, but only
into one partition each. That does make it plausible that there's a
significant amount of fragmentation. Possibly transient due to the expiration.

How many partitions are there for each of the tables? Mainly wondering because
of the number of inodes being used.

Are all of the active tables within one database? That could be relevant due
to per-directory behaviour of free space allocation.

Greetings,

Andres Freund

#17Andres Freund
andres@anarazel.de
In reply to: Jakub Wartak (#15)
Re: FileFallocate misbehaving on XFS

Hi,

On 2024-12-10 12:36:33 +0100, Jakub Wartak wrote:

On Tue, Dec 10, 2024 at 7:34 AM Michael Harris <harmic@gmail.com> wrote:
1. Well it doesn't look like XFS AG fragmentation to me (we had a customer
with a huge number of AGs with small space in them) reporting such errors
after upgrading to 16, but not for earlier versions (somehow
posix_fallocate() had to be the culprit).

Given the workload expires old partitions, I'm not sure we conclude a whole
lot from the current state :/

2.

# xfs_info /dev/mapper/ippvg-ipplv
meta-data=/dev/mapper/ippvg-ipplv isize=512 agcount=4,

agsize=262471424 blks

= sectsz=512 attr=2, projid32bit=1
= crc=1 finobt=0, sparse=0, rmapbt=0
= reflink=0 bigtime=0 inobtcount=0

nrext64=0

Yay, reflink=0, that's pretty old fs ?!

I think that only started to default to on more recently (2019, plus time to
percolate into RHEL). The more curious cases is finobt=0 (turned on by default
since 2015) and to a lesser degree sparse=0 (turned on by default since 2018).

ERROR: could not extend file

"pg_tblspc/16401/PG_16_202307071/17643/1249.1" with FileFallocate(): No
space left on device

2. This indicates it was allocating 1GB for such a table (".1"), on
tablespace that was created more than a year ago. Could you get us maybe
those below commands too? (or from any other directory exhibiting such
errors)

The date in the directory is the catversion of the server, which is just
determined by the major version being used, not the creation time of the
tablespace.

andres@awork3:~/src/postgresql$ git grep CATALOG_VERSION_NO upstream/REL_16_STABLE src/include/catalog/catversion.h
upstream/REL_16_STABLE:src/include/catalog/catversion.h:#define CATALOG_VERSION_NO 202307071

Greetings,

Andres Freund

#18Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#17)
Re: FileFallocate misbehaving on XFS

On 2024-12-10 11:34:15 -0500, Andres Freund wrote:

On 2024-12-10 12:36:33 +0100, Jakub Wartak wrote:

On Tue, Dec 10, 2024 at 7:34 AM Michael Harris <harmic@gmail.com> wrote:
2.

# xfs_info /dev/mapper/ippvg-ipplv
meta-data=/dev/mapper/ippvg-ipplv isize=512 agcount=4,

agsize=262471424 blks

= sectsz=512 attr=2, projid32bit=1
= crc=1 finobt=0, sparse=0, rmapbt=0
= reflink=0 bigtime=0 inobtcount=0

nrext64=0

Yay, reflink=0, that's pretty old fs ?!

I think that only started to default to on more recently (2019, plus time to
percolate into RHEL). The more curious cases is finobt=0 (turned on by default
since 2015) and to a lesser degree sparse=0 (turned on by default since 2018).

One thing that might be interesting is to compare xfs_info of affected and
non-affected servers...

#19Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#12)
Re: FileFallocate misbehaving on XFS

On Mon, Dec 9, 2024 at 7:31 PM Andres Freund <andres@anarazel.de> wrote:

Pretty unexcited about all of these - XFS is fairly widely used for PG, but
this problem doesn't seem very common. It seems to me that we're missing
something that causes this to only happen in a small subset of cases.

I wonder if this is actually pretty common on XFS. I mean, we've
already hit this with at least one EDB customer, and Michael's report
is, as far as I know, independent of that; and he points to a
pgsql-general thread which, AFAIK, is also independent. We don't get
three (or more?) independent reports of that many bugs, so I think
it's not crazy to think that the problem is actually pretty common.
It's probably workload dependent somehow, but for all we know today it
seems like the workload could be as simple as "do enough file
extension and you'll get into trouble eventually" or maybe "do enough
file extension[with some level of concurrency and you'll get into
trouble eventually".

I think the source of this needs to be debugged further before we try to apply
workarounds in postgres.

Why? It seems to me that this has to be a filesystem bug, and we
should almost certainly adopt one of these ideas from Michael Harris:

- Providing a way to configure PG not to use posix_fallocate at runtime

- In the case of posix_fallocate failing with ENOSPC, fall back to
FileZero (worst case that will fail as well, in which case we will
know that we really are out of space)

Maybe we need some more research to figure out which of those two
things we should do -- I suspect the second one is better but if that
fails then we might need to do the first one -- but I doubt that we
can wait for XFS to fix whatever the issue is here. Our usage of
posix_fallocate doesn't look to be anything more than plain vanilla,
so as between these competing hypotheses:

(1) posix_fallocate is and always has been buggy and you can't rely on it, or
(2) we use posix_fallocate in a way that nobody else has and have hit
an incredibly obscure bug as a result, which will be swiftly patched

...the first seems much more likely.

--
Robert Haas
EDB: http://www.enterprisedb.com

#20Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#19)
Re: FileFallocate misbehaving on XFS

Hi,

On 2024-12-10 12:36:40 -0500, Robert Haas wrote:

On Mon, Dec 9, 2024 at 7:31 PM Andres Freund <andres@anarazel.de> wrote:

Pretty unexcited about all of these - XFS is fairly widely used for PG, but
this problem doesn't seem very common. It seems to me that we're missing
something that causes this to only happen in a small subset of cases.

I wonder if this is actually pretty common on XFS. I mean, we've
already hit this with at least one EDB customer, and Michael's report
is, as far as I know, independent of that; and he points to a
pgsql-general thread which, AFAIK, is also independent. We don't get
three (or more?) independent reports of that many bugs, so I think
it's not crazy to think that the problem is actually pretty common.

Maybe. I think we would have gotten a lot more reports if it were common. I
know of quite a few very busy installs using xfs.

I think there must be some as-of-yet-unknown condition gating it. E.g. that
the filesystem has been created a while ago and has some now-on-by-default
options disabled.

I think the source of this needs to be debugged further before we try to apply
workarounds in postgres.

Why? It seems to me that this has to be a filesystem bug,

Adding workarounds for half-understood problems tends to lead to code that we
can't evolve in the future, as we a) don't understand b) can't reproduce the
problem.

Workarounds could also mask some bigger / worse issues. We e.g. have blamed
ext4 for a bunch of bugs that then turned out to be ours in the past. But we
didn't look for a long time, because it was convenient to just blame ext4.

and we should almost certainly adopt one of these ideas from Michael Harris:

- Providing a way to configure PG not to use posix_fallocate at runtime

I'm not strongly opposed to that. That's testable without access to an
affected system. I wouldn't want to automatically do that when detecting an
affected system though, that'll make behaviour way less predictable.

- In the case of posix_fallocate failing with ENOSPC, fall back to
FileZero (worst case that will fail as well, in which case we will
know that we really are out of space)

I doubt that that's a good idea. What if fallocate failing is an indicator of
a problem? What if you turn on AIO + DIO and suddenly get a much more
fragmented file?

Greetings,

Andres Freund

#21Michael Harris
harmic@gmail.com
In reply to: Andres Freund (#16)
Re: FileFallocate misbehaving on XFS

Hi Andres

On Wed, 11 Dec 2024 at 03:09, Andres Freund <andres@anarazel.de> wrote:

I think it's implied, but I just want to be sure: This was one of the affected
systems?

Yes, correct.

Any chance to get df output? I'm mainly curious about the number of used
inodes.

Sorry, I could swear I had included that already! Here it is:

# df /var/opt
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/mapper/ippvg-ipplv 4197492228 3803866716 393625512 91% /var/opt

# df -i /var/opt
Filesystem Inodes IUsed IFree IUse% Mounted on
/dev/mapper/ippvg-ipplv 419954240 1568137 418386103 1% /var/opt

Could you show the mount options that end up being used?
grep /var/opt /proc/mounts

/dev/mapper/ippvg-ipplv /var/opt xfs
rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota 0 0

These seem to be the defaults.

I assume you have never set XFS options for the PG directory or files within
it?

Correct.

Could you show
xfs_io -r -c lsattr -c stat -c statfs /path/to/directory/with/enospc

-p--------------X pg_tblspc/16402/PG_16_202307071/49163/1132925906.4
fd.path = "pg_tblspc/16402/PG_16_202307071/49163/1132925906.4"
fd.flags = non-sync,non-direct,read-only
stat.ino = 4320612794
stat.type = regular file
stat.size = 201211904
stat.blocks = 393000
fsxattr.xflags = 0x80000002 [-p--------------X]
fsxattr.projid = 0
fsxattr.extsize = 0
fsxattr.cowextsize = 0
fsxattr.nextents = 165
fsxattr.naextents = 0
dioattr.mem = 0x200
dioattr.miniosz = 512
dioattr.maxiosz = 2147483136
fd.path = "pg_tblspc/16402/PG_16_202307071/49163/1132925906.4"
statfs.f_bsize = 4096
statfs.f_blocks = 1049373057
statfs.f_bavail = 98406378
statfs.f_files = 419954240
statfs.f_ffree = 418386103
statfs.f_flags = 0x1020
geom.bsize = 4096
geom.agcount = 4
geom.agblocks = 262471424
geom.datablocks = 1049885696
geom.rtblocks = 0
geom.rtextents = 0
geom.rtextsize = 1
geom.sunit = 0
geom.swidth = 0
counts.freedata = 98406378
counts.freertx = 0
counts.freeino = 864183
counts.allocino = 2432320

I'd try monitoring the per-ag free space over time and see if the the ENOSPC
issue is correlated with one AG getting full. 'freesp' is probably too
expensive for that, but it looks like
xfs_db -r -c agresv /dev/nvme6n1
should work?

Actually that output might be interesting to see, even when you don't hit the
issue.

I will see if I can set that up.

How many partitions are there for each of the tables? Mainly wondering because
of the number of inodes being used.

It is configurable and varies from site to site. It could range from 7
up to maybe 60.

Are all of the active tables within one database? That could be relevant due
to per-directory behaviour of free space allocation.

Each pg instance may have one or more application databases. Typically
data is being written into all of them (although sometimes a database
will be archived, with no new data going into it).

You might be onto something though. The system I got the above prints
from is only experiencing this issue in one directory - that might not
mean very much though, it only has 2 databases and one of them looks
like it is not receiving imports.
But another system I can access has multiple databases with ongoing
imports, yet all the errors bar one relate to one directory.
I will collect some data from that system and post it shortly.

Cheers
Mike

#22Michael Harris
harmic@gmail.com
In reply to: Jakub Wartak (#15)
Re: FileFallocate misbehaving on XFS

Hi Jakub

On Tue, 10 Dec 2024 at 22:36, Jakub Wartak
<jakub.wartak@enterprisedb.com> wrote:

Yay, reflink=0, that's pretty old fs ?!

This particular filesystem was created on Centos 7, and retained when
the system was upgraded to RL9. So yes probably pretty old!

Could you get us maybe those below commands too? (or from any other directory exhibiting such errors)

stat pg_tblspc/16401/PG_16_202307071/17643/
ls -1 pg_tblspc/16401/PG_16_202307071/17643/ | wc -l
time ls -1 pg_tblspc/16401/PG_16_202307071/17643/ | wc -l # to assess timing of getdents() call as that may something about that directory indirectly

# stat pg_tblspc/16402/PG_16_202307071/49163/
File: pg_tblspc/16402/PG_16_202307071/49163/
Size: 5177344 Blocks: 14880 IO Block: 4096 directory
Device: fd02h/64770d Inode: 4299946593 Links: 2
Access: (0700/drwx------) Uid: ( 26/postgres) Gid: ( 26/postgres)
Access: 2024-12-11 09:39:42.467802419 +0900
Modify: 2024-12-11 09:51:19.813948673 +0900
Change: 2024-12-11 09:51:19.813948673 +0900
Birth: 2024-11-25 17:37:11.812374672 +0900

# time ls -1 pg_tblspc/16402/PG_16_202307071/49163/ | wc -l
179000

real 0m0.474s
user 0m0.439s
sys 0m0.038s

3. Maybe somehow there is a bigger interaction between posix_fallocate() and delayed XFS's dynamic speculative preallocation from many processes all writing into different partitions ? Maybe try "allocsize=1m" mount option for that /fs and see if that helps. I'm going to speculate about XFS speculative :) pre allocations, but if we have fdcache and are *not* closing fds, how XFS might know to abort its own speculation about streaming write ? (multiply that up to potentially the number of opened fds to get an avalanche of "preallocations").

I will try to organize that. They are production systems so it might
take some time.

4. You can also try compiling with patch from Alvaro from [2] "0001-Add-some-debugging-around-mdzeroextend.patch", so we might end up having more clarity in offsets involved. If not then you could use 'strace -e fallocate -p <pid>' to get the exact syscall.

I'll take a look at Alvaro's patch. strace sounds good, but how to
arrange to start it on the correct PG backends? There will be a
large-ish number of PG backends going at a time, only some of which
are performing imports, and they will be coming and going every so
often as the ETL application scales up and down with the load.

5. Another idea could be catching the kernel side stacktrace of fallocate() when it is hitting ENOSPC. E.g. with XFS fs and attached bpftrace eBPF tracer I could get the source of the problem in my artificial reproducer, e.g

OK, I will look into that also.

Cheers
Mike

#23Michael Harris
harmic@gmail.com
In reply to: Michael Harris (#21)
1 attachment(s)
Re: FileFallocate misbehaving on XFS

Hi again

On Wed, 11 Dec 2024 at 12:09, Michael Harris <harmic@gmail.com> wrote:

But another system I can access has multiple databases with ongoing
imports, yet all the errors bar one relate to one directory.
I will collect some data from that system and post it shortly.

I've attached the same set of data collected from an RHEL8 system.

Unfortunately the 'agresv' subcommand does not exist in the version of
xfs_db that is available on RHEL8, so I was not able to implement that
suggestion.

I thought I had one *L8 system that had an XFS filesystem and had not
experienced this issue, but it turns out it had - just at a much lower
frequency.

Cheers
Mike

Attachments:

rhel8_fallocate_fail.logapplication/octet-stream; name=rhel8_fallocate_fail.logDownload
#24Jakub Wartak
jakub.wartak@enterprisedb.com
In reply to: Michael Harris (#22)
Re: FileFallocate misbehaving on XFS

On Wed, Dec 11, 2024 at 4:00 AM Michael Harris <harmic@gmail.com> wrote:

Hi Jakub

On Tue, 10 Dec 2024 at 22:36, Jakub Wartak
<jakub.wartak@enterprisedb.com> wrote:

[..]

3. Maybe somehow there is a bigger interaction between posix_fallocate()

and delayed XFS's dynamic speculative preallocation from many processes all
writing into different partitions ? Maybe try "allocsize=1m" mount option
for that /fs and see if that helps. I'm going to speculate about XFS
speculative :) pre allocations, but if we have fdcache and are *not*
closing fds, how XFS might know to abort its own speculation about
streaming write ? (multiply that up to potentially the number of opened fds
to get an avalanche of "preallocations").

I will try to organize that. They are production systems so it might
take some time.

Cool.

4. You can also try compiling with patch from Alvaro from [2]
"0001-Add-some-debugging-around-mdzeroextend.patch", so we might end up
having more clarity in offsets involved. If not then you could use 'strace
-e fallocate -p <pid>' to get the exact syscall.

I'll take a look at Alvaro's patch. strace sounds good, but how to
arrange to start it on the correct PG backends? There will be a
large-ish number of PG backends going at a time, only some of which
are performing imports, and they will be coming and going every so
often as the ETL application scales up and down with the load.

Yes, it sounds like mission impossible. Is there any chance you can get
reproduced down to one or a small number of postgres backends doing the
writes?

5. Another idea could be catching the kernel side stacktrace of

fallocate() when it is hitting ENOSPC. E.g. with XFS fs and attached
bpftrace eBPF tracer I could get the source of the problem in my artificial
reproducer, e.g

OK, I will look into that also.

Hopefully that reveals some more. Somehow ENOSPC UNIX error reporting got
one big pile of errors into 1 error category and that's not helpful at all
(inode/extent/block allocation problems are all squeezed into 1 error)

Anyway, if that helps others here are my notes so far on this thread
including that useful file from subthread, hopefully I've did not
misinterpreted something:

- works in <PG16, but fails with >= PG16 due to posix_fallocate() rather
than multiple separate(but adjacent) iovectors to pg_writev. It launched
only in case of mdzeroextend() with numblocks > 8
- 179k or 414k files in single directory (0.3s - 0.5s just to list those)
- OS/FS upgraded from earlier release
- one AG with extreme low extent sizes compared to the others AGs (I bet
that 2->3 22.73% below means small 8192b pg files in $PGDATA, but there are
no large extents in that AG)
from to extents blocks pct
1 1 4949 4949 0.65
2 3 86113 173452 22.73
4 7 19399 94558 12.39
8 15 23233 248602 32.58
16 31 12425 241421 31.64
total free extents 146119
total free blocks 762982
average free extent size 5.22165 (!)
- note that the max extent size above (31) is very low when compared to the
others AG which have 1024-8192. Therefore it looks there are no contiguous
blocks for request sizes above 31*4096 = 126976 bytes within that AG (??).
- we have logic of `extend_by_pages += extend_by_pages * waitcount;` capped
up to 64 pg blocks maximum (and that's higher than the above)
- but the fails where observed also using pg_upgrade --link -j/pg_restore
-j (also concurrent posix_fallocate() to many independent files sharing the
same AG, but that's 1 backend:1 file so no contention for waitcount in
RelationAddBlocks())
- so maybe it's lots of backends doing independent concurrent
posix_fallocate() that end up somehow coalesced ? Or hypothetically let's
say 16-32 fallocates() hit the same AG initially, maybe it's some form of
concurrency semi race-condition inside XFS where one of fallocate calls
fails to find space in that one AG, but according to [1]https://blogs.oracle.com/linux/post/extent-allocation-in-xfs it should fallback
to another AGs.
- and there's also additional XFS dynamic speculative preallocation that
might cause space pressure during our normal writes..

Another workaround idea/test: create tablespace on the same XFS fs (but in
a somewhat different directory if possible) and see if it still fails.

-J.

[1]: https://blogs.oracle.com/linux/post/extent-allocation-in-xfs

#25Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#20)
Re: FileFallocate misbehaving on XFS

Hi,

On 2024-12-10 16:33:06 -0500, Andres Freund wrote:

Maybe. I think we would have gotten a lot more reports if it were common. I
know of quite a few very busy installs using xfs.

I think there must be some as-of-yet-unknown condition gating it. E.g. that
the filesystem has been created a while ago and has some now-on-by-default
options disabled.

I think the source of this needs to be debugged further before we try to apply
workarounds in postgres.

Why? It seems to me that this has to be a filesystem bug,

Adding workarounds for half-understood problems tends to lead to code that we
can't evolve in the future, as we a) don't understand b) can't reproduce the
problem.

Workarounds could also mask some bigger / worse issues. We e.g. have blamed
ext4 for a bunch of bugs that then turned out to be ours in the past. But we
didn't look for a long time, because it was convenient to just blame ext4.

and we should almost certainly adopt one of these ideas from Michael Harris:

- Providing a way to configure PG not to use posix_fallocate at runtime

I'm not strongly opposed to that. That's testable without access to an
affected system. I wouldn't want to automatically do that when detecting an
affected system though, that'll make behaviour way less predictable.

- In the case of posix_fallocate failing with ENOSPC, fall back to
FileZero (worst case that will fail as well, in which case we will
know that we really are out of space)

I doubt that that's a good idea. What if fallocate failing is an indicator of
a problem? What if you turn on AIO + DIO and suddenly get a much more
fragmented file?

One thing that I think we should definitely do is to include more detail in
the error message. mdzeroextend()'s error messages don't include how many
blocks the relation was to be extended by. Neither mdextend() nor
mdzeroextend() include the offset at which the extension failed.

I'm not entirely sure about the phrasing though, we have a somewhat confusing
mix of blocks and bytes in messages.

Perhaps some of information should be in an errdetail, but I admit I'm a bit
hesitant about doing so for crucial details. I find that often only the
primary error message is available when debugging problems encountered by
others.

Maybe something like
/* translator: second %s is a function name like FileAllocate() */
could not extend file \"%s\" by %u blocks, from %llu to %llu bytes, using %s: %m
or
could not extend file \"%s\" using %s by %u blocks, from its current size of %u blocks: %m
or
could not extend file \"%s\" using %s by %u blocks/%llu bytes from its current size of %llu bytes: %m

If we want to use errdetail() judiciously, we could go for something like
errmsg("could not extend file \"%s\" by %u blocks, using %s: %m", ...
errdetail("Failed to extend file from %u blocks/%llu bytes to %u blocks / %llu bytes.", ...)

I think it might also be good - this is a slightly more complicated project -
to report the amount of free space the filesystem reports when we hit
ENOSPC. I have dealt with cases of the FS transiently filling up way too many
times, and it's always a pain to figure that out.

Greetings,

Andres Freund

#26Andres Freund
andres@anarazel.de
In reply to: Jakub Wartak (#24)
Re: FileFallocate misbehaving on XFS

Hi,

FWIW, I tried fairly hard to reproduce this.

An extended cycle of 80 backends copying into relations and occasionally
truncating them (to simulate the partitions being dropped and new ones
created). For this I ran a 4TB filesystem very close to fully filled (peaking
at 99.998 % full).

I did not see any ENOSPC errors unless the filesystem really was full at that
time. To check that, I made mdzeroextend() do a statfs() when encountering
ENOSPC, printed statfs.f_blocks and made that case PANIC.

What I do see is that after - intentionally - hitting an out-of-disk-space
error, the available disk space would occasionally increase a small amount
after a few seconds. Regardless of whether using the fallocate and
non-fallocate path.

From what I can tell this small increase in free space has a few reasons:

- Checkpointer might not have gotten around to unlinking files, keeping the
inode alive.

- Occasionally bgwriter or a backend would have relation segments that were
unlinked open, so the inode (not the actual file space, because the segment
to prevent that) could not yet be removed from the filesystem

- It looks like xfs does some small amount of work to reclaim space in the
background. Which makes sense, otherwise each unlink would have to be a
flush to disk.

But that's not in any way enough amount of space to explain what you're
seeing. The most I've were 6MB, when ramping up the truncation frequency a
lot.

Of course this was on a newer kernel, not on RHEL / RL 8/9.

Just to make sure - you're absolutely certain that you actually have space at
the time of the errors? E.g. a checkpoint could free up a lot of WAL after a
checkpoint that's soon after an ENOSPC, due to removing now-unneeded WAL
files. That can be 100s of gigabytes.

If I were to provide you with a patch that showed the amount of free disk
space at the time of an error, the size of the relation etc, could you
reproduce the issue with it applied? Or is that unrealistic?

On 2024-12-11 13:05:21 +0100, Jakub Wartak wrote:

- one AG with extreme low extent sizes compared to the others AGs (I bet
that 2->3 22.73% below means small 8192b pg files in $PGDATA, but there are
no large extents in that AG)
from to extents blocks pct
1 1 4949 4949 0.65
2 3 86113 173452 22.73
4 7 19399 94558 12.39
8 15 23233 248602 32.58
16 31 12425 241421 31.64
total free extents 146119
total free blocks 762982
average free extent size 5.22165 (!)

Note that this does not mean that all extents in the AG are that small, just
that the *free* extents are of that size.

I think this might primarily be because this AG has the smallest amount of
free blocks (2.9GB). However, the fact that it *does* have less, could be
interesting. It might be the AG associated with the directory for the busiest
database or such.

The next least-space AG is:

from to extents blocks pct
1 1 1021 1021 0.10
2 3 48748 98255 10.06
4 7 9840 47038 4.81
8 15 13648 146779 15.02
16 31 15818 323022 33.06
32 63 584 27932 2.86
64 127 147 14286 1.46
128 255 253 49047 5.02
256 511 229 87173 8.92
512 1023 139 102456 10.49
1024 2047 51 72506 7.42
2048 4095 3 7422 0.76
total free extents 90481
total free blocks 976937

It seems plausible it'd would look similar if more of the free blocks were
used.

- we have logic of `extend_by_pages += extend_by_pages * waitcount;` capped
up to 64 pg blocks maximum (and that's higher than the above)
- but the fails where observed also using pg_upgrade --link -j/pg_restore
-j (also concurrent posix_fallocate() to many independent files sharing the
same AG, but that's 1 backend:1 file so no contention for waitcount in
RelationAddBlocks())

We also extend by more than one page, even without concurrency, if
bulk-insertion is used, and i think we do use that for
e.g. pg_attribute. Which is actually the table where pg_restore encountered
the issue:

pg_restore: error: could not execute query: ERROR: could not extend
file "pg_tblspc/16401/PG_16_202307071/17643/1249.1" with
FileFallocate(): No space left on device

1249 is the initial relfilenode for pg_attribute.

There could also be some parallelism leading to bulk extension, due to the
parallel restore. I don't remember which commands pg_restore actually executes
in parallel.

Greetings,

Andres Freund

#27Michael Harris
harmic@gmail.com
In reply to: Andres Freund (#26)
Re: FileFallocate misbehaving on XFS

Hi Andres

On Thu, 12 Dec 2024 at 10:50, Andres Freund <andres@anarazel.de> wrote:

Just to make sure - you're absolutely certain that you actually have space at
the time of the errors?

As sure as I can be. The RHEL8 system that I took prints from
yesterday has > 1.5TB free. I can't see it varying by that much.

It does look as though the system needs to be quite full to provoke
this problem. The systems I have looked at so far have >90% full
filesystems.

Another interesting snippet: the application has a number of ETL
workers going at once. The actual number varies depending on a number
of factors but might be somewhere from 10 - 150. Each worker will have
a single postgres backend that they are feeding data to.

At the time of the error, it is not the case that all ETL workers
strike it at once - it looks like a lot of the time only a single
worker is affected, or at most a handful of workers. I can't see for
sure what the other workers were doing at the time, but I would expect
they were all importing data as well.

If I were to provide you with a patch that showed the amount of free disk
space at the time of an error, the size of the relation etc, could you
reproduce the issue with it applied? Or is that unrealistic?

I have not been able to reproduce it on demand, and so far it has only
happened in production systems.

As long as the patch doesn't degrade normal performance it should be
possible to deploy it to one of the systems that is regularly
reporting the error, although it might take a while to get approval to
do that.

Cheers
Mike

#28Andres Freund
andres@anarazel.de
In reply to: Michael Harris (#27)
3 attachment(s)
Re: FileFallocate misbehaving on XFS

Hi,

On 2024-12-12 14:14:20 +1100, Michael Harris wrote:

On Thu, 12 Dec 2024 at 10:50, Andres Freund <andres@anarazel.de> wrote:

Just to make sure - you're absolutely certain that you actually have space at
the time of the errors?

As sure as I can be. The RHEL8 system that I took prints from
yesterday has > 1.5TB free. I can't see it varying by that much.

That does seem unlikely, but it'd probably still be worth monitoring by how
much it varies.

It does look as though the system needs to be quite full to provoke
this problem. The systems I have looked at so far have >90% full
filesystems.

Another interesting snippet: the application has a number of ETL
workers going at once. The actual number varies depending on a number
of factors but might be somewhere from 10 - 150. Each worker will have
a single postgres backend that they are feeding data to.

Are they all inserting into distinct tables/partitions or into shared tables?

At the time of the error, it is not the case that all ETL workers
strike it at once - it looks like a lot of the time only a single
worker is affected, or at most a handful of workers. I can't see for
sure what the other workers were doing at the time, but I would expect
they were all importing data as well.

When you say that they're not "all striking it at once", do you mean that some
of them aren't interacting with the database at the time, or that they're not
erroring out?

If I were to provide you with a patch that showed the amount of free disk
space at the time of an error, the size of the relation etc, could you
reproduce the issue with it applied? Or is that unrealistic?

I have not been able to reproduce it on demand, and so far it has only
happened in production systems.

As long as the patch doesn't degrade normal performance it should be
possible to deploy it to one of the systems that is regularly
reporting the error, although it might take a while to get approval to
do that.

Cool. The patch only has an effect in the branches reporting out-of-space
errors, so there's no overhead during normal operation. And the additional
detail doesn't have much overhead in the error case either.

I attached separate patches for 16, 17 and master, as there's some minor
conflicts between the version.

Greetings,

Andres Freund

Attachments:

16-0001-md-Report-more-detail-when-encountering-ENOSPC-durin.patchtext/x-diff; charset=us-asciiDownload
From c8ecdff54fcdbd2cf89ca7888f641db369f207ce Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Thu, 12 Dec 2024 12:57:12 -0500
Subject: [PATCH] md: Report more detail when encountering ENOSPC during
 extension

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 meson.build                   |  1 +
 configure.ac                  |  1 +
 src/include/pg_config.h.in    |  3 ++
 src/backend/storage/smgr/md.c | 63 +++++++++++++++++++++++++++++++----
 configure                     |  2 +-
 5 files changed, 63 insertions(+), 7 deletions(-)

diff --git a/meson.build b/meson.build
index 4e59feb91da..e644db41ef9 100644
--- a/meson.build
+++ b/meson.build
@@ -2269,6 +2269,7 @@ header_checks = [
   'sys/procctl.h',
   'sys/signalfd.h',
   'sys/ucred.h',
+  'sys/vfs.h',
   'termios.h',
   'ucred.h',
 ]
diff --git a/configure.ac b/configure.ac
index 23add80d8fd..0984949a3b9 100644
--- a/configure.ac
+++ b/configure.ac
@@ -1512,6 +1512,7 @@ AC_CHECK_HEADERS(m4_normalize([
 	sys/procctl.h
 	sys/signalfd.h
 	sys/ucred.h
+	sys/vfs.h
 	termios.h
 	ucred.h
 ]))
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index ce3063b2b22..626de538821 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -478,6 +478,9 @@
 /* Define to 1 if you have the <sys/ucred.h> header file. */
 #undef HAVE_SYS_UCRED_H
 
+/* Define to 1 if you have the <sys/vfs.h> header file. */
+#undef HAVE_SYS_VFS_H
+
 /* Define to 1 if you have the <termios.h> header file. */
 #undef HAVE_TERMIOS_H
 
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index fdecbad1709..67c42a69c11 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -24,6 +24,9 @@
 #include <unistd.h>
 #include <fcntl.h>
 #include <sys/file.h>
+#ifdef HAVE_SYS_VFS_H
+#include <sys/vfs.h>
+#endif
 
 #include "access/xlog.h"
 #include "access/xlogutils.h"
@@ -449,6 +452,37 @@ mdunlinkfork(RelFileLocatorBackend rlocator, ForkNumber forknum, bool isRedo)
 	pfree(path);
 }
 
+static void
+report_disk_space(const char *reason, const char *path)
+{
+	/*
+	 * I'm sure there's a way to do this on other OSs too, but for the
+	 * debugging here this should be sufficient.
+	 */
+#ifdef HAVE_SYS_VFS_H
+	int			saved_errno = errno;
+	struct statfs sf;
+	int			ret;
+
+	ret = statfs(path, &sf);
+
+	if (ret != 0)
+		elog(WARNING, "%s: statfs failed: %m", reason);
+	else
+		elog(LOG, "%s: free space for filesystem containing \"%s\" "
+			 "f_blocks: %llu, f_bfree: %llu, f_bavail: %llu "
+			 "f_files: %llu, f_ffree: %llu",
+			 reason, path,
+			 (long long unsigned) sf.f_blocks,
+			 (long long unsigned) sf.f_bfree,
+			 (long long unsigned) sf.f_bavail,
+			 (long long unsigned) sf.f_files,
+			 (long long unsigned) sf.f_ffree);
+
+	errno = saved_errno;
+#endif
+}
+
 /*
  * mdextend() -- Add a block to the specified relation.
  *
@@ -496,11 +530,16 @@ mdextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 
 	if ((nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ, seekpos, WAIT_EVENT_DATA_FILE_EXTEND)) != BLCKSZ)
 	{
+		if (errno == ENOSPC)
+			report_disk_space("mdextend failing with ENOSPC",
+							  FilePathName(v->mdfd_vfd));
+
 		if (nbytes < 0)
 			ereport(ERROR,
 					(errcode_for_file_access(),
-					 errmsg("could not extend file \"%s\": %m",
-							FilePathName(v->mdfd_vfd)),
+					 errmsg("could not extend file \"%s\" from %u to %u blocks: %m",
+							FilePathName(v->mdfd_vfd),
+							blocknum, blocknum + 1),
 					 errhint("Check free disk space.")));
 		/* short write: complain appropriately */
 		ereport(ERROR,
@@ -586,10 +625,15 @@ mdzeroextend(SMgrRelation reln, ForkNumber forknum,
 								WAIT_EVENT_DATA_FILE_EXTEND);
 			if (ret != 0)
 			{
+				if (errno == ENOSPC)
+					report_disk_space("mdzeroextend FileFallocate failing with ENOSPC",
+									  FilePathName(v->mdfd_vfd));
+
 				ereport(ERROR,
 						errcode_for_file_access(),
-						errmsg("could not extend file \"%s\" with FileFallocate(): %m",
-							   FilePathName(v->mdfd_vfd)),
+						errmsg("could not extend file \"%s\" by %u blocks, from %u to %u, using FileFallocate(): %m",
+							   FilePathName(v->mdfd_vfd),
+							   numblocks, segstartblock, segstartblock+numblocks),
 						errhint("Check free disk space."));
 			}
 		}
@@ -608,11 +652,18 @@ mdzeroextend(SMgrRelation reln, ForkNumber forknum,
 						   seekpos, (off_t) BLCKSZ * numblocks,
 						   WAIT_EVENT_DATA_FILE_EXTEND);
 			if (ret < 0)
+			{
+				if (errno == ENOSPC)
+					report_disk_space("mdzeroextend FileZero failing with ENOSPC",
+									  FilePathName(v->mdfd_vfd));
+
 				ereport(ERROR,
 						errcode_for_file_access(),
-						errmsg("could not extend file \"%s\": %m",
-							   FilePathName(v->mdfd_vfd)),
+						errmsg("could not extend file \"%s\" by %u blocks, from %u to %u, using FileZero(): %m",
+							   FilePathName(v->mdfd_vfd),
+							   numblocks, segstartblock, segstartblock+numblocks),
 						errhint("Check free disk space."));
+			}
 		}
 
 		if (!skipFsync && !SmgrIsTemp(reln))
diff --git a/configure b/configure
index 8c2ab3a1973..f62f4f6d3ab 100755
--- a/configure
+++ b/configure
@@ -13768,7 +13768,7 @@ fi
 ## Header files
 ##
 
-for ac_header in atomic.h copyfile.h execinfo.h getopt.h ifaddrs.h langinfo.h mbarrier.h sys/epoll.h sys/event.h sys/personality.h sys/prctl.h sys/procctl.h sys/signalfd.h sys/ucred.h termios.h ucred.h
+for ac_header in atomic.h copyfile.h execinfo.h getopt.h ifaddrs.h langinfo.h mbarrier.h sys/epoll.h sys/event.h sys/personality.h sys/prctl.h sys/procctl.h sys/signalfd.h sys/ucred.h sys/vfs.h termios.h ucred.h
 do :
   as_ac_Header=`$as_echo "ac_cv_header_$ac_header" | $as_tr_sh`
 ac_fn_c_check_header_mongrel "$LINENO" "$ac_header" "$as_ac_Header" "$ac_includes_default"
-- 
2.45.2.746.g06e570c0df.dirty

17-0001-md-Report-more-detail-when-encountering-ENOSPC-durin.patchtext/x-diff; charset=us-asciiDownload
From e7119200b69e0c96f85cede510889be9548b0d73 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Thu, 12 Dec 2024 12:57:12 -0500
Subject: [PATCH] md: Report more detail when encountering ENOSPC during
 extension

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 meson.build                   |  1 +
 configure.ac                  |  1 +
 src/include/pg_config.h.in    |  3 ++
 src/backend/storage/smgr/md.c | 63 +++++++++++++++++++++++++++++++----
 configure                     |  2 +-
 5 files changed, 63 insertions(+), 7 deletions(-)

diff --git a/meson.build b/meson.build
index 005dc9f3532..a0113e84aef 100644
--- a/meson.build
+++ b/meson.build
@@ -2415,6 +2415,7 @@ header_checks = [
   'sys/procctl.h',
   'sys/signalfd.h',
   'sys/ucred.h',
+  'sys/vfs.h',
   'termios.h',
   'ucred.h',
 ]
diff --git a/configure.ac b/configure.ac
index 3c76c9ebc87..074572aabf5 100644
--- a/configure.ac
+++ b/configure.ac
@@ -1478,6 +1478,7 @@ AC_CHECK_HEADERS(m4_normalize([
 	sys/procctl.h
 	sys/signalfd.h
 	sys/ucred.h
+	sys/vfs.h
 	termios.h
 	ucred.h
 ]))
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 2397d90b465..7b53c994699 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -462,6 +462,9 @@
 /* Define to 1 if you have the <sys/ucred.h> header file. */
 #undef HAVE_SYS_UCRED_H
 
+/* Define to 1 if you have the <sys/vfs.h> header file. */
+#undef HAVE_SYS_VFS_H
+
 /* Define to 1 if you have the <termios.h> header file. */
 #undef HAVE_TERMIOS_H
 
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 6796756358f..8c49312db0d 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -24,6 +24,9 @@
 #include <unistd.h>
 #include <fcntl.h>
 #include <sys/file.h>
+#ifdef HAVE_SYS_VFS_H
+#include <sys/vfs.h>
+#endif
 
 #include "access/xlogutils.h"
 #include "commands/tablespace.h"
@@ -447,6 +450,37 @@ mdunlinkfork(RelFileLocatorBackend rlocator, ForkNumber forknum, bool isRedo)
 	pfree(path);
 }
 
+static void
+report_disk_space(const char *reason, const char *path)
+{
+	/*
+	 * I'm sure there's a way to do this on other OSs too, but for the
+	 * debugging here this should be sufficient.
+	 */
+#ifdef HAVE_SYS_VFS_H
+	int			saved_errno = errno;
+	struct statfs sf;
+	int			ret;
+
+	ret = statfs(path, &sf);
+
+	if (ret != 0)
+		elog(WARNING, "%s: statfs failed: %m", reason);
+	else
+		elog(LOG, "%s: free space for filesystem containing \"%s\" "
+			 "f_blocks: %llu, f_bfree: %llu, f_bavail: %llu "
+			 "f_files: %llu, f_ffree: %llu",
+			 reason, path,
+			 (long long unsigned) sf.f_blocks,
+			 (long long unsigned) sf.f_bfree,
+			 (long long unsigned) sf.f_bavail,
+			 (long long unsigned) sf.f_files,
+			 (long long unsigned) sf.f_ffree);
+
+	errno = saved_errno;
+#endif
+}
+
 /*
  * mdextend() -- Add a block to the specified relation.
  *
@@ -494,11 +528,16 @@ mdextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 
 	if ((nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ, seekpos, WAIT_EVENT_DATA_FILE_EXTEND)) != BLCKSZ)
 	{
+		if (errno == ENOSPC)
+			report_disk_space("mdextend failing with ENOSPC",
+							  FilePathName(v->mdfd_vfd));
+
 		if (nbytes < 0)
 			ereport(ERROR,
 					(errcode_for_file_access(),
-					 errmsg("could not extend file \"%s\": %m",
-							FilePathName(v->mdfd_vfd)),
+					 errmsg("could not extend file \"%s\" from %u to %u blocks: %m",
+							FilePathName(v->mdfd_vfd),
+							blocknum, blocknum + 1),
 					 errhint("Check free disk space.")));
 		/* short write: complain appropriately */
 		ereport(ERROR,
@@ -584,10 +623,15 @@ mdzeroextend(SMgrRelation reln, ForkNumber forknum,
 								WAIT_EVENT_DATA_FILE_EXTEND);
 			if (ret != 0)
 			{
+				if (errno == ENOSPC)
+					report_disk_space("mdzeroextend FileFallocate failing with ENOSPC",
+									  FilePathName(v->mdfd_vfd));
+
 				ereport(ERROR,
 						errcode_for_file_access(),
-						errmsg("could not extend file \"%s\" with FileFallocate(): %m",
-							   FilePathName(v->mdfd_vfd)),
+						errmsg("could not extend file \"%s\" by %u blocks, from %u to %u, using FileFallocate(): %m",
+							   FilePathName(v->mdfd_vfd),
+							   numblocks, segstartblock, segstartblock+numblocks),
 						errhint("Check free disk space."));
 			}
 		}
@@ -606,11 +650,18 @@ mdzeroextend(SMgrRelation reln, ForkNumber forknum,
 						   seekpos, (off_t) BLCKSZ * numblocks,
 						   WAIT_EVENT_DATA_FILE_EXTEND);
 			if (ret < 0)
+			{
+				if (errno == ENOSPC)
+					report_disk_space("mdzeroextend FileZero failing with ENOSPC",
+									  FilePathName(v->mdfd_vfd));
+
 				ereport(ERROR,
 						errcode_for_file_access(),
-						errmsg("could not extend file \"%s\": %m",
-							   FilePathName(v->mdfd_vfd)),
+						errmsg("could not extend file \"%s\" by %u blocks, from %u to %u, using FileZero(): %m",
+							   FilePathName(v->mdfd_vfd),
+							   numblocks, segstartblock, segstartblock+numblocks),
 						errhint("Check free disk space."));
+			}
 		}
 
 		if (!skipFsync && !SmgrIsTemp(reln))
diff --git a/configure b/configure
index 97996b7f6b7..5efd85bb17a 100755
--- a/configure
+++ b/configure
@@ -13349,7 +13349,7 @@ fi
 ## Header files
 ##
 
-for ac_header in atomic.h copyfile.h execinfo.h getopt.h ifaddrs.h langinfo.h mbarrier.h sys/epoll.h sys/event.h sys/personality.h sys/prctl.h sys/procctl.h sys/signalfd.h sys/ucred.h termios.h ucred.h
+for ac_header in atomic.h copyfile.h execinfo.h getopt.h ifaddrs.h langinfo.h mbarrier.h sys/epoll.h sys/event.h sys/personality.h sys/prctl.h sys/procctl.h sys/signalfd.h sys/ucred.h sys/vfs.h termios.h ucred.h
 do :
   as_ac_Header=`$as_echo "ac_cv_header_$ac_header" | $as_tr_sh`
 ac_fn_c_check_header_mongrel "$LINENO" "$ac_header" "$as_ac_Header" "$ac_includes_default"
-- 
2.45.2.746.g06e570c0df.dirty

HEAD-0001-md-Report-more-detail-when-encountering-ENOSPC-durin.patchtext/x-diff; charset=us-asciiDownload
From 66a18a4565ec4cca4a8ce216aae9322a8c68731c Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Thu, 12 Dec 2024 12:57:12 -0500
Subject: [PATCH] md: Report more detail when encountering ENOSPC during
 extension

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 meson.build                   |  1 +
 configure.ac                  |  1 +
 src/include/pg_config.h.in    |  3 ++
 src/backend/storage/smgr/md.c | 63 +++++++++++++++++++++++++++++++----
 configure                     |  2 +-
 5 files changed, 63 insertions(+), 7 deletions(-)

diff --git a/meson.build b/meson.build
index e5ce437a5c7..05f622ccd79 100644
--- a/meson.build
+++ b/meson.build
@@ -2389,6 +2389,7 @@ header_checks = [
   'sys/procctl.h',
   'sys/signalfd.h',
   'sys/ucred.h',
+  'sys/vfs.h',
   'termios.h',
   'ucred.h',
   'xlocale.h',
diff --git a/configure.ac b/configure.ac
index 247ae97fa4c..d68774f9c89 100644
--- a/configure.ac
+++ b/configure.ac
@@ -1446,6 +1446,7 @@ AC_CHECK_HEADERS(m4_normalize([
 	sys/procctl.h
 	sys/signalfd.h
 	sys/ucred.h
+	sys/vfs.h
 	termios.h
 	ucred.h
 	xlocale.h
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 07b2f798abd..c5e083f8793 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -439,6 +439,9 @@
 /* Define to 1 if you have the <sys/ucred.h> header file. */
 #undef HAVE_SYS_UCRED_H
 
+/* Define to 1 if you have the <sys/vfs.h> header file. */
+#undef HAVE_SYS_VFS_H
+
 /* Define to 1 if you have the <termios.h> header file. */
 #undef HAVE_TERMIOS_H
 
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index cc8a80ee961..eac080f1a43 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -24,6 +24,9 @@
 #include <unistd.h>
 #include <fcntl.h>
 #include <sys/file.h>
+#ifdef HAVE_SYS_VFS_H
+#include <sys/vfs.h>
+#endif
 
 #include "access/xlogutils.h"
 #include "commands/tablespace.h"
@@ -447,6 +450,37 @@ mdunlinkfork(RelFileLocatorBackend rlocator, ForkNumber forknum, bool isRedo)
 	pfree(path);
 }
 
+static void
+report_disk_space(const char *reason, const char *path)
+{
+	/*
+	 * I'm sure there's a way to do this on other OSs too, but for the
+	 * debugging here this should be sufficient.
+	 */
+#ifdef HAVE_SYS_VFS_H
+	int			saved_errno = errno;
+	struct statfs sf;
+	int			ret;
+
+	ret = statfs(path, &sf);
+
+	if (ret != 0)
+		elog(WARNING, "%s: statfs failed: %m", reason);
+	else
+		elog(LOG, "%s: free space for filesystem containing \"%s\" "
+			 "f_blocks: %llu, f_bfree: %llu, f_bavail: %llu "
+			 "f_files: %llu, f_ffree: %llu",
+			 reason, path,
+			 (long long unsigned) sf.f_blocks,
+			 (long long unsigned) sf.f_bfree,
+			 (long long unsigned) sf.f_bavail,
+			 (long long unsigned) sf.f_files,
+			 (long long unsigned) sf.f_ffree);
+
+	errno = saved_errno;
+#endif
+}
+
 /*
  * mdextend() -- Add a block to the specified relation.
  *
@@ -494,11 +528,16 @@ mdextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 
 	if ((nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ, seekpos, WAIT_EVENT_DATA_FILE_EXTEND)) != BLCKSZ)
 	{
+		if (errno == ENOSPC)
+			report_disk_space("mdextend failing with ENOSPC",
+							  FilePathName(v->mdfd_vfd));
+
 		if (nbytes < 0)
 			ereport(ERROR,
 					(errcode_for_file_access(),
-					 errmsg("could not extend file \"%s\": %m",
-							FilePathName(v->mdfd_vfd)),
+					 errmsg("could not extend file \"%s\" from %u to %u blocks: %m",
+							FilePathName(v->mdfd_vfd),
+							blocknum, blocknum + 1),
 					 errhint("Check free disk space.")));
 		/* short write: complain appropriately */
 		ereport(ERROR,
@@ -584,10 +623,15 @@ mdzeroextend(SMgrRelation reln, ForkNumber forknum,
 								WAIT_EVENT_DATA_FILE_EXTEND);
 			if (ret != 0)
 			{
+				if (errno == ENOSPC)
+					report_disk_space("mdzeroextend FileFallocate failing with ENOSPC",
+									  FilePathName(v->mdfd_vfd));
+
 				ereport(ERROR,
 						errcode_for_file_access(),
-						errmsg("could not extend file \"%s\" with FileFallocate(): %m",
-							   FilePathName(v->mdfd_vfd)),
+						errmsg("could not extend file \"%s\" by %u blocks, from %u to %u, using FileFallocate(): %m",
+							   FilePathName(v->mdfd_vfd),
+							   numblocks, segstartblock, segstartblock+numblocks),
 						errhint("Check free disk space."));
 			}
 		}
@@ -606,11 +650,18 @@ mdzeroextend(SMgrRelation reln, ForkNumber forknum,
 						   seekpos, (off_t) BLCKSZ * numblocks,
 						   WAIT_EVENT_DATA_FILE_EXTEND);
 			if (ret < 0)
+			{
+				if (errno == ENOSPC)
+					report_disk_space("mdzeroextend FileZero failing with ENOSPC",
+									  FilePathName(v->mdfd_vfd));
+
 				ereport(ERROR,
 						errcode_for_file_access(),
-						errmsg("could not extend file \"%s\": %m",
-							   FilePathName(v->mdfd_vfd)),
+						errmsg("could not extend file \"%s\" by %u blocks, from %u to %u, using FileZero(): %m",
+							   FilePathName(v->mdfd_vfd),
+							   numblocks, segstartblock, segstartblock+numblocks),
 						errhint("Check free disk space."));
+			}
 		}
 
 		if (!skipFsync && !SmgrIsTemp(reln))
diff --git a/configure b/configure
index 518c33b73a9..191cbca0844 100755
--- a/configure
+++ b/configure
@@ -13227,7 +13227,7 @@ fi
 ## Header files
 ##
 
-for ac_header in atomic.h copyfile.h execinfo.h getopt.h ifaddrs.h mbarrier.h sys/epoll.h sys/event.h sys/personality.h sys/prctl.h sys/procctl.h sys/signalfd.h sys/ucred.h termios.h ucred.h xlocale.h
+for ac_header in atomic.h copyfile.h execinfo.h getopt.h ifaddrs.h mbarrier.h sys/epoll.h sys/event.h sys/personality.h sys/prctl.h sys/procctl.h sys/signalfd.h sys/ucred.h sys/vfs.h termios.h ucred.h xlocale.h
 do :
   as_ac_Header=`$as_echo "ac_cv_header_$ac_header" | $as_tr_sh`
 ac_fn_c_check_header_mongrel "$LINENO" "$ac_header" "$as_ac_Header" "$ac_includes_default"
-- 
2.45.2.746.g06e570c0df.dirty

#29Michael Harris
harmic@gmail.com
In reply to: Andres Freund (#28)
Re: FileFallocate misbehaving on XFS

Hi Andres

On Fri, 13 Dec 2024 at 08:38, Andres Freund <andres@anarazel.de> wrote:

Another interesting snippet: the application has a number of ETL
workers going at once. The actual number varies depending on a number
of factors but might be somewhere from 10 - 150. Each worker will have
a single postgres backend that they are feeding data to.

Are they all inserting into distinct tables/partitions or into shared tables?

The set of tables they are writing into is the same, but we do take
some effort to randomize the order of the tables that we each worker
is writing into so as to reduce contention. Even so it is quite likely
that multiple processes will be writing into a table at a time.
Also worth noting that I have only seen this error triggered by COPY
statements (other than the upgrade case). There are some other cases
in our code that use INSERT but so far I have not seen that end in an
out of space error.

When you say that they're not "all striking it at once", do you mean that some
of them aren't interacting with the database at the time, or that they're not
erroring out?

Sorry, I meant erroring out.

Thanks for the patch!

Cheers
Mike

#30Alvaro Herrera
alvherre@alvh.no-ip.org
In reply to: Andres Freund (#25)
Re: FileFallocate misbehaving on XFS

On 2024-Dec-11, Andres Freund wrote:

One thing that I think we should definitely do is to include more detail in
the error message. mdzeroextend()'s error messages don't include how many
blocks the relation was to be extended by. Neither mdextend() nor
mdzeroextend() include the offset at which the extension failed.

I proposed a patch at
/messages/by-id/202409110955.6njbwzm4ocus@alvherre.pgsql

FileFallocate failure:
errmsg("could not allocate additional %lld bytes from position %lld in file \"%s\": %m",
(long long) addbytes, (long long) seekpos,
FilePathName(v->mdfd_vfd)),

FileZero failure:
errmsg("could not zero additional %lld bytes from position %lld file \"%s\": %m",
(long long) addbytes, (long long) seekpos,
FilePathName(v->mdfd_vfd)),

I'm not sure that we need to talk about blocks, given that the
underlying syscalls don't work in blocks anyway. IMO we should just
report bytes.

--
Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/
"No hay ausente sin culpa ni presente sin disculpa" (Prov. francés)

#31Thomas Munro
thomas.munro@gmail.com
In reply to: Alvaro Herrera (#30)
Re: FileFallocate misbehaving on XFS

On Sat, Dec 14, 2024 at 9:29 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

On 2024-Dec-11, Andres Freund wrote:

One thing that I think we should definitely do is to include more detail in
the error message. mdzeroextend()'s error messages don't include how many
blocks the relation was to be extended by. Neither mdextend() nor
mdzeroextend() include the offset at which the extension failed.

I proposed a patch at
/messages/by-id/202409110955.6njbwzm4ocus@alvherre.pgsql

If adding more logging, I wonder why FileAccess()'s "re-open failed"
case is not considered newsworthy. I've suspected it as a candidate
source of an unexplained and possibly misattributed error in other
cases. I'm not saying it's at all likely in this case, but it seems
like just the sort of rare unexpected failure that we'd want to know
more about when trying to solve mysteries.

#32Robert Haas
robertmhaas@gmail.com
In reply to: Thomas Munro (#31)
Re: FileFallocate misbehaving on XFS

On Sat, Dec 14, 2024 at 4:20 AM Thomas Munro <thomas.munro@gmail.com> wrote:

On Sat, Dec 14, 2024 at 9:29 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

On 2024-Dec-11, Andres Freund wrote:

One thing that I think we should definitely do is to include more detail in
the error message. mdzeroextend()'s error messages don't include how many
blocks the relation was to be extended by. Neither mdextend() nor
mdzeroextend() include the offset at which the extension failed.

I proposed a patch at
/messages/by-id/202409110955.6njbwzm4ocus@alvherre.pgsql

If adding more logging, I wonder why FileAccess()'s "re-open failed"
case is not considered newsworthy. I've suspected it as a candidate
source of an unexplained and possibly misattributed error in other
cases. I'm not saying it's at all likely in this case, but it seems
like just the sort of rare unexpected failure that we'd want to know
more about when trying to solve mysteries.

Wow. That's truly abominable. It doesn't seem likely to explain this
case, because I don't think trying to reopen an existing file could
result in LruInsert() returning ENOSPC. But this code desperately
needs refactoring to properly report the open() failure as such,
instead of conflating it with a failure of whatever syscall we were
contemplating before we realized we needed to open().

--
Robert Haas
EDB: http://www.enterprisedb.com

#33Jakub Wartak
jakub.wartak@enterprisedb.com
In reply to: Andres Freund (#26)
Re: FileFallocate misbehaving on XFS

On Thu, Dec 12, 2024 at 12:50 AM Andres Freund <andres@anarazel.de> wrote:

Hi,

FWIW, I tried fairly hard to reproduce this.

Same, but without PG and also without much success. I've also tried to push
the AGs (with just one or two AGs created via mkfs) to contain only small
size extents (by creating hundreds of thousands of 8kb files) then deleting
some modulo and then try couple of bigger fallocate/writes to see if that
would blow up on original CentOS 7.9 / 3.10.x kernel, but no - it did not
blow up. It only failed when df -h was exactly 100% in multiple scenarios
like that (and yes it added little space out of blue sometimes too). So my
take is something related to state (having fd open) and concurrency.

Interesting thing that I've observed is that the per directory AG affinity
for big directories (think $PGDATA) is lost when AG is full and then
extents are allocated from different AGs (one can use xfs_bmap -vv to see
allocated AG affinity for directory VS files there)

An extended cycle of 80 backends copying into relations and occasionally

truncating them (to simulate the partitions being dropped and new ones
created). For this I ran a 4TB filesystem very close to fully filled
(peaking
at 99.998 % full).

I could only think of the question: how many files were involved there ?
Maybe it is some kind of race between other (or the same) backends
frequently churning their fdcache's with open()/close() [defeating
speculative preallocation] -> XFS ending up fragmented and only then
posix_fallocate() having issues for larger allocations (>> 8kB)? My take is
if we send N io write vectors then this seems to be handled fine, but when
we throw one big fallocate -- it is not -- so maybe the posix_fallocate()
was in the process of finding space while some other activities happened to
that inode -- like close() -- but then it seems it doesn't match the
pg_upgrade scenario.

Well IMHO we are stuck till Michael provides some more data (patch outcome,
bpf and maybe other hints and tests).

-J.

#34Andres Freund
andres@anarazel.de
In reply to: Alvaro Herrera (#30)
Re: FileFallocate misbehaving on XFS

Hi,

On 2024-12-14 09:29:12 +0100, Alvaro Herrera wrote:

On 2024-Dec-11, Andres Freund wrote:

One thing that I think we should definitely do is to include more detail in
the error message. mdzeroextend()'s error messages don't include how many
blocks the relation was to be extended by. Neither mdextend() nor
mdzeroextend() include the offset at which the extension failed.

I proposed a patch at
/messages/by-id/202409110955.6njbwzm4ocus@alvherre.pgsql

FileFallocate failure:
errmsg("could not allocate additional %lld bytes from position %lld in file \"%s\": %m",
(long long) addbytes, (long long) seekpos,
FilePathName(v->mdfd_vfd)),

FileZero failure:
errmsg("could not zero additional %lld bytes from position %lld file \"%s\": %m",
(long long) addbytes, (long long) seekpos,
FilePathName(v->mdfd_vfd)),

Personally I don't like the obfuscation of "allocate" and "zero" vs just
naming the function names. But I guess that's just taste thing.

I'm not sure that we need to talk about blocks, given that the
underlying syscalls don't work in blocks anyway. IMO we should just
report bytes.

When looking for problems it's considerably more work with bytes, because - at
least for me - the large number is hard to compare quickly and to know how
aggressively we extended also requires to translate to blocks.

Greetings,

Andres Freund

#35Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#34)
Re: FileFallocate misbehaving on XFS

On Mon, Dec 16, 2024 at 9:12 AM Andres Freund <andres@anarazel.de> wrote:

Personally I don't like the obfuscation of "allocate" and "zero" vs just
naming the function names. But I guess that's just taste thing.

When looking for problems it's considerably more work with bytes, because - at
least for me - the large number is hard to compare quickly and to know how
aggressively we extended also requires to translate to blocks.

FWIW, I think that what we report in the error should hew as closely
to the actual system call as possible. Hence, I agree with your first
complaint and would prefer to simply see the system calls named, but I
disagree with your second complaint and would prefer to see the byte
count.

--
Robert Haas
EDB: http://www.enterprisedb.com

#36Andres Freund
andres@anarazel.de
In reply to: Jakub Wartak (#33)
Re: FileFallocate misbehaving on XFS

Hi,

On 2024-12-16 14:45:37 +0100, Jakub Wartak wrote:

On Thu, Dec 12, 2024 at 12:50 AM Andres Freund <andres@anarazel.de> wrote:
An extended cycle of 80 backends copying into relations and occasionally

truncating them (to simulate the partitions being dropped and new ones
created). For this I ran a 4TB filesystem very close to fully filled
(peaking
at 99.998 % full).

I could only think of the question: how many files were involved there ?

I varied the number heavily. From dozens to 10s of thousands. No meaningful
difference.

Well IMHO we are stuck till Michael provides some more data (patch outcome,
bpf and maybe other hints and tests).

Yea.

Greetings,

Andres Freund

#37Alvaro Herrera
alvherre@alvh.no-ip.org
In reply to: Robert Haas (#35)
Re: FileFallocate misbehaving on XFS

On 2024-Dec-16, Robert Haas wrote:

On Mon, Dec 16, 2024 at 9:12 AM Andres Freund <andres@anarazel.de> wrote:

Personally I don't like the obfuscation of "allocate" and "zero" vs just
naming the function names. But I guess that's just taste thing.

When looking for problems it's considerably more work with bytes, because - at
least for me - the large number is hard to compare quickly and to know how
aggressively we extended also requires to translate to blocks.

FWIW, I think that what we report in the error should hew as closely
to the actual system call as possible. Hence, I agree with your first
complaint and would prefer to simply see the system calls named, but I
disagree with your second complaint and would prefer to see the byte
count.

Maybe we can add errdetail("The system call was FileFallocate( ... %u ...)")
with the number of bytes, and leave the errmsg() mentioning the general
operation being done (allocate, zero, etc) with the number of blocks.

--
Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/
"The eagle never lost so much time, as
when he submitted to learn of the crow." (William Blake)

#38Andres Freund
andres@anarazel.de
In reply to: Alvaro Herrera (#37)
Re: FileFallocate misbehaving on XFS

Hi,

On 2024-12-16 18:05:59 +0100, Alvaro Herrera wrote:

On 2024-Dec-16, Robert Haas wrote:

On Mon, Dec 16, 2024 at 9:12 AM Andres Freund <andres@anarazel.de> wrote:

Personally I don't like the obfuscation of "allocate" and "zero" vs just
naming the function names. But I guess that's just taste thing.

When looking for problems it's considerably more work with bytes, because - at
least for me - the large number is hard to compare quickly and to know how
aggressively we extended also requires to translate to blocks.

FWIW, I think that what we report in the error should hew as closely
to the actual system call as possible. Hence, I agree with your first
complaint and would prefer to simply see the system calls named, but I
disagree with your second complaint and would prefer to see the byte
count.

Maybe we can add errdetail("The system call was FileFallocate( ... %u ...)")
with the number of bytes, and leave the errmsg() mentioning the general
operation being done (allocate, zero, etc) with the number of blocks.

I don't see what we gain by requiring guesswork (what does allocating vs
zeroing mean, zeroing also allocates disk space after all) to interpret the
main error message. My experience is that it's often harder to get the DETAIL
than the actual error message (grepping becomes harder due to separate line,
terse verbosity is commonly used).

I think we're going too far towards not mentioning the actual problems in too
many error messages in general.

Greetings,

Andres Freund

#39Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#38)
Re: FileFallocate misbehaving on XFS

On Mon, Dec 16, 2024 at 12:52 PM Andres Freund <andres@anarazel.de> wrote:

I don't see what we gain by requiring guesswork (what does allocating vs
zeroing mean, zeroing also allocates disk space after all) to interpret the
main error message. My experience is that it's often harder to get the DETAIL
than the actual error message (grepping becomes harder due to separate line,
terse verbosity is commonly used).

I feel like the normal way that we do this is basically:

could not {name of system call} file "\%s\": %m

e.g.

could not read file \"%s\": %m

I don't know why we should do anything else in this type of case.

--
Robert Haas
EDB: http://www.enterprisedb.com

#40Michael Harris
harmic@gmail.com
In reply to: Robert Haas (#39)
1 attachment(s)
Re: FileFallocate misbehaving on XFS

Hello,

I finally managed to get the patched version installed in a production
database where the error is occurring very regularly.

Here is a sample of the output:

2024-12-19 01:08:50 CET [2533222]: LOG: mdzeroextend FileFallocate
failing with ENOSPC: free space for filesystem containing
"pg_tblspc/107724/PG_16_202307071/465960/2591590762.15" f_blocks:
2683831808, f_bfree: 205006167, f_bavail: 205006167 f_files:
1073741376, f_ffree: 1069933796
2024-12-19 01:08:50 CET [2533222]: ERROR: could not extend file
"pg_tblspc/107724/PG_16_202307071/465960/2591590762.15" by 13 blocks,
from 110869 to 110882, using FileFallocate(): No space left on device
2024-12-19 01:08:51 CET [2533246]: LOG: mdzeroextend FileFallocate
failing with ENOSPC: free space for filesystem containing
"pg_tblspc/107724/PG_16_202307071/465960/2591590762.15" f_blocks:
2683831808, f_bfree: 205004945, f_bavail: 205004945 f_files:
1073741376, f_ffree: 1069933796
2024-12-19 01:08:51 CET [2533246]: ERROR: could not extend file
"pg_tblspc/107724/PG_16_202307071/465960/2591590762.15" by 14 blocks,
from 110965 to 110979, using FileFallocate(): No space left on device
2024-12-19 01:08:59 CET [2531320]: LOG: mdzeroextend FileFallocate
failing with ENOSPC: free space for filesystem containing
"pg_tblspc/107724/PG_16_202307071/465960/2591590762.15" f_blocks:
2683831808, f_bfree: 204980672, f_bavail: 204980672 f_files:
1073741376, f_ffree: 1069933795
2024-12-19 01:08:59 CET [2531320]: ERROR: could not extend file
"pg_tblspc/107724/PG_16_202307071/465960/2591590762.15" by 14 blocks,
from 111745 to 111759, using FileFallocate(): No space left on device
2024-12-19 01:09:01 CET [2531331]: LOG: mdzeroextend FileFallocate
failing with ENOSPC: free space for filesystem containing
"pg_tblspc/107724/PG_16_202307071/465960/2591590762.15" f_blocks:
2683831808, f_bfree: 204970783, f_bavail: 204970783 f_files:
1073741376, f_ffree: 1069933795
2024-12-19 01:09:01 CET [2531331]: ERROR: could not extend file
"pg_tblspc/107724/PG_16_202307071/465960/2591590762.15" by 12 blocks,
from 112045 to 112057, using FileFallocate(): No space left on device

I have attached a file containing all the errors I collected. The
error is happening pretty regularly - over 400 times in a ~6 hour
period. The number of blocks being extended varies from ~9 to ~15, and
the statfs result shows plenty of available space & inodes at the
time. The errors do seem to come in bursts.

This is a different system to those I previously provided logs from.
It is also running RHEL8 with a similar configuration to the other
system.

I have so far not installed the bpftrace that Jakub suggested before -
as I say this is a production machine and I am wary of triggering a
kernel panic or worse (even though it seems like the risk for that
would be low?). While a kernel stack trace would no doubt be helpful
to the XFS developers, from a postgres point of view, would that be
likely to help us decide what to do about this?

Cheers
Mike

Show quoted text

On Tue, 17 Dec 2024 at 10:23, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Dec 16, 2024 at 12:52 PM Andres Freund <andres@anarazel.de> wrote:

I don't see what we gain by requiring guesswork (what does allocating vs
zeroing mean, zeroing also allocates disk space after all) to interpret the
main error message. My experience is that it's often harder to get the DETAIL
than the actual error message (grepping becomes harder due to separate line,
terse verbosity is commonly used).

I feel like the normal way that we do this is basically:

could not {name of system call} file "\%s\": %m

e.g.

could not read file \"%s\": %m

I don't know why we should do anything else in this type of case.

--
Robert Haas
EDB: http://www.enterprisedb.com

Attachments:

rhel8_fallocate_extended.logapplication/octet-stream; name=rhel8_fallocate_extended.logDownload
#41Jakub Wartak
jakub.wartak@enterprisedb.com
In reply to: Michael Harris (#40)
Re: FileFallocate misbehaving on XFS

On Thu, Dec 19, 2024 at 7:49 AM Michael Harris <harmic@gmail.com> wrote:

Hello,

I finally managed to get the patched version installed in a production
database where the error is occurring very regularly.

Here is a sample of the output:

2024-12-19 01:08:50 CET [2533222]: LOG: mdzeroextend FileFallocate
failing with ENOSPC: free space for filesystem containing
"pg_tblspc/107724/PG_16_202307071/465960/2591590762.15" f_blocks:
2683831808, f_bfree: 205006167, f_bavail: 205006167 f_files:
1073741376, f_ffree: 1069933796

[..]

I have attached a file containing all the errors I collected. The
error is happening pretty regularly - over 400 times in a ~6 hour
period. The number of blocks being extended varies from ~9 to ~15, and
the statfs result shows plenty of available space & inodes at the
time. The errors do seem to come in bursts.

I couldn't resist: you seem to have entered the quantum realm of free disk
space AKA Schrodinger's free space: you both have the space and dont have
it... ;)

No one else has responded, so I'll try. My take is that we got very limited
number of reports (2-3) of this stuff happening and it always seem to be

90% space used, yet the adoption of PG16 is rising, so we may or may not

see more errors of those kind, but on another side of things: it's
frequency is so rare that it's really wild we don't see more reports like
this one. Lots of OS upgrades in the wild are performed by building new
standbys (maybe that lowers the fs fragmentation), rather than in-place OS
upgrades. To me it sounds like a new bug in XFS that is rare. You can
probably live with #undef HAVE_POSIX_FALLOCATE as a way to survive, another
would be probably to try to run xfs_fsr to defragment the fs.

Longer-term: other than collecting the eBPF data to start digging from
where it is really triggered, I don't see a way forward. It would be
suboptimal to just abandon fallocate() optimizations from commit
31966b151e6ab7a6284deab6e8fe5faddaf2ae4c, just because of very unusual
combinations of factors (XFS bug).

Well we could be having some kludge like pseudo-code: if(posix_falloc() ==
ENOSPC && statfs().free_space_pct >= 1) fallback_to_pwrites(), but it is
ugly. Another is GUC (or even two -- how much to extend or to use or not
the posix_fallocate()), but people do not like more GUCs...

I have so far not installed the bpftrace that Jakub suggested before -
as I say this is a production machine and I am wary of triggering a
kernel panic or worse (even though it seems like the risk for that
would be low?). While a kernel stack trace would no doubt be helpful
to the XFS developers, from a postgres point of view, would that be
likely to help us decide what to do about this?[..]

Well you could try having reproduction outside of production, or even clone
the storage (but not using backup/restore), but literally clone the XFS
LUNs on the storage itself and connect those separate VM to have a safe
testbed (or even use dd(1) of some smaller XFS fs exhibiting such behaviour
to some other place)

As for eBPF/bpftrace: it is safe (it's sandboxed anyway), lots of customers
are using it, but as always YMMV.

There's also xfs_fsr that might help overcome.

You can also experiment if -o allocsize helps or just even try -o
allocsize=0 (but that might have some negative effects on performance
probably)

-J.

#42Andres Freund
andres@anarazel.de
In reply to: Michael Harris (#40)
Re: FileFallocate misbehaving on XFS

Hi,

On 2024-12-19 17:47:13 +1100, Michael Harris wrote:

I finally managed to get the patched version installed in a production
database where the error is occurring very regularly.

Thanks!

Here is a sample of the output:

2024-12-19 01:08:50 CET [2533222]: LOG: mdzeroextend FileFallocate
failing with ENOSPC: free space for filesystem containing
"pg_tblspc/107724/PG_16_202307071/465960/2591590762.15" f_blocks:
2683831808, f_bfree: 205006167, f_bavail: 205006167 f_files:
1073741376, f_ffree: 1069933796

That's ~700 GB of free space...

It'd be interesting to see filefrag -v for that segment.

This is a different system to those I previously provided logs from.
It is also running RHEL8 with a similar configuration to the other
system.

Given it's a RHEL system, have you raised this as an issue with RH? They
probably have somebody with actual XFS hacking experience on staff.

RH's kernels are *heavily* patched, so it's possible the issue is actually RH
specific.

I have so far not installed the bpftrace that Jakub suggested before -
as I say this is a production machine and I am wary of triggering a
kernel panic or worse (even though it seems like the risk for that
would be low?). While a kernel stack trace would no doubt be helpful
to the XFS developers, from a postgres point of view, would that be
likely to help us decide what to do about this?

Well, I'm personally wary of installing workarounds for a problem I don't
understand and can't reproduce, which might be specific to old filesystems
and/or heavily patched kernels. This clearly is an FS bug.

That said, if we learn that somehow this is a fundamental XFS issue that can
be triggered on every XFS filesystem, with current kernels, it becomes more
reasonable to implement a workaround in PG.

Another thing I've been wondering about is if we could reduce the frequency of
hitting problems by rounding up the number of blocks we extend by to powers of
two. That would probably reduce fragmentation, and the extra space would be
quickly used in workloads where we extend by a bunch of blocks at once,
anyway.

Greetings,

Andres Freund

#43Bruce Momjian
bruce@momjian.us
In reply to: Jakub Wartak (#41)
Re: FileFallocate misbehaving on XFS

On Fri, Dec 20, 2024 at 01:25:41PM +0100, Jakub Wartak wrote:

On Thu, Dec 19, 2024 at 7:49 AM Michael Harris <harmic@gmail.com> wrote:
No one else has responded, so I'll try. My take is that we got very limited
number of reports (2-3) of this stuff happening and it always seem to be >90%
space used, yet the adoption of PG16 is rising, so we may or may not see more
errors of those kind, but on another side of things: it's frequency is so rare

My guess is that if you are seeing this with 90% full, and not lesser
values, that something is being temporarily exhausted in XFS and the
kernel errno API just doesn't allow returning enough detail to explain
what is being exhausted. A traditional Unix file system only has limits
for the inode table and free blocks, but I am sure XFS has many more
areas of possible exhaustion.

I didn't see any mention of checking the kernel log, which might have
more details of what XFS is having problems with.

I agree trying to modify Postgres for this now makes no sense since we
don't even know the cause. Once we find the cause, and admit it can't
be avoided or quickly fixed, we can reevaluate.

--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com

Do not let urgent matters crowd out time for investment in the future.

#44Andres Freund
andres@anarazel.de
In reply to: Michael Harris (#40)
Re: FileFallocate misbehaving on XFS

Hi,

On 2024-12-19 17:47:13 +1100, Michael Harris wrote:

I have attached a file containing all the errors I collected. The
error is happening pretty regularly - over 400 times in a ~6 hour
period. The number of blocks being extended varies from ~9 to ~15, and
the statfs result shows plenty of available space & inodes at the
time. The errors do seem to come in bursts.

One interesting moment is this:

2024-12-19 01:59:39 CET [2559130]: LOG: mdzeroextend FileFallocate failing with ENOSPC: free space for filesystem containing "pg_tblspc/107724/PG_16_202307071/465960/3232056651" f_blocks: 2683831808, f_bfree: 198915036, f_bavail: 198915036 f_files: 1073741376, f_ffree: 1069932412
2024-12-19 01:59:39 CET [2559130]: ERROR: could not extend file "pg_tblspc/107724/PG_16_202307071/465960/3232056651" by 9 blocks, from 59723 to 59732, using FileFallocate(): No space left on device
2024-12-19 04:47:04 CET [2646363]: LOG: mdzeroextend FileFallocate failing with ENOSPC: free space for filesystem containing "pg_tblspc/107724/PG_16_202307071/465960/3232056651.2" f_blocks: 2683831808, f_bfree: 300862306, f_bavail: 300862306 f_files: 1073741376, f_ffree: 1069821450
2024-12-19 04:47:04 CET [2646363]: ERROR: could not extend file "pg_tblspc/107724/PG_16_202307071/465960/3232056651.2" by 11 blocks, from 29850 to 29861, using FileFallocate(): No space left on device

Note that there's
a) a few hours between messages, whereas previous they were more frequent
b) f_bfree increased substantially.

I assume that somewhere around 2AM some script prunes old partitions?

Greetings,

Andres Freund

#45Andrea Gelmini
andrea.gelmini@gmail.com
In reply to: Andres Freund (#44)
Re: FileFallocate misbehaving on XFS

Il giorno mar 31 dic 2024 alle ore 16:31 Andres Freund <andres@anarazel.de>
ha scritto:

2024-12-19 04:47:04 CET [2646363]: ERROR: could not extend file

"pg_tblspc/107724/PG_16_202307071/465960/3232056651.2" by 11 blocks, from
29850 to 29861, using FileFallocate(): No space left on device

Dunno it it helps, but today I read this reference in latest patchset on
XFS mailing list:
https://lore.kernel.org/linux-xfs/20241104014439.3786609-1-zhangshida@kylinos.cn/

It could be related and explain the effect?

For the record, I found it here:
https://lore.kernel.org/linux-xfs/CANubcdXWHOtTW4PjJE1qjAJHEg48LS7MFc065gcQwoH7s0Ybqw@mail.gmail.com/

Ciao,
Gelma

#46Andres Freund
andres@anarazel.de
In reply to: Andrea Gelmini (#45)
Re: FileFallocate misbehaving on XFS

Hi,

On 2025-01-02 11:41:56 +0100, Andrea Gelmini wrote:

Il giorno mar 31 dic 2024 alle ore 16:31 Andres Freund <andres@anarazel.de>
ha scritto:

2024-12-19 04:47:04 CET [2646363]: ERROR: could not extend file

"pg_tblspc/107724/PG_16_202307071/465960/3232056651.2" by 11 blocks, from
29850 to 29861, using FileFallocate(): No space left on device

Dunno it it helps, but today I read this reference in latest patchset on
XFS mailing list:
https://lore.kernel.org/linux-xfs/20241104014439.3786609-1-zhangshida@kylinos.cn/

It could be related and explain the effect?

I doubt it - there was a lot more free space in the various AGs and they, with
the exception of 1-2 AGs, weren't that fragmented.

Greetings,

Andres Freund

#47Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#42)
Re: FileFallocate misbehaving on XFS

Hi,

On 2024-12-20 11:39:42 -0500, Andres Freund wrote:

On 2024-12-19 17:47:13 +1100, Michael Harris wrote:

This is a different system to those I previously provided logs from.
It is also running RHEL8 with a similar configuration to the other
system.

Given it's a RHEL system, have you raised this as an issue with RH? They
probably have somebody with actual XFS hacking experience on staff.

RH's kernels are *heavily* patched, so it's possible the issue is actually RH
specific.

FWIW, I raised this on the #xfs irc channel. One ask they had was:

│15:56:40 dchinner | andres: can you get a metadump of a filesystem that is displaying these symptoms for us to analyse?
│15:57:54 dchinner | metadumps don't contain data, and metadata is obfuscated so no filenames or attributes are exposed, either.

Greetings,

Andres

#48Michael Harris
harmic@gmail.com
In reply to: Andres Freund (#47)
Re: FileFallocate misbehaving on XFS

Hi Andres

On Wed, 1 Jan 2025 at 02:31, Andres Freund <andres@anarazel.de> wrote:

Note that there's
a) a few hours between messages, whereas previous they were more frequent
b) f_bfree increased substantially.

I assume that somewhere around 2AM some script prunes old partitions?

Correct.
Data is imported continuously, and the pruning is typically scheduled for 02:00.
In general the events do seem more frequent when the FS is more full.

On Fri, 3 Jan 2025 at 01:40, Andres Freund <andres@anarazel.de> wrote:

FWIW, I raised this on the #xfs irc channel. One ask they had was:

│15:56:40 dchinner | andres: can you get a metadump of a filesystem that is displaying these symptoms for us to analyse?
│15:57:54 dchinner | metadumps don't contain data, and metadata is obfuscated so no filenames or attributes are exposed, either.

I'm on leave from work at the moment, but I will try to collect this
when I'm back.

Cheers
Mike

#49Michael Harris
harmic@gmail.com
In reply to: Michael Harris (#48)
Re: FileFallocate misbehaving on XFS

Hello All

An update on this.

Earlier in this thread, Jakub had suggested remounting the XFS
filesystems with the mount option allocsize=1m.
I've done that now, on a few systems that have been experiencing this
error multiple times a day, and it does seem to stop the errors from
occurring.
It has only been a few days, but it does look encouraging as a workaround.

From the xfs man page, it seems that manually setting allocsize turns
off the dynamic preallocation size heuristics that are normally used
by XFS.

The other piece of info is that we have raised a support ticket with
Redhat. I'm not directly in contact with them (it's another team that
handles that) but I will let you know of any developments.

Cheers
Mike

#50Jean-Christophe Arnu
jcarnu@gmail.com
In reply to: Michael Harris (#49)
Re: FileFallocate misbehaving on XFS

Hello Mike,

We encountered the same problem with a fixed allocsize=262144k. Removing
this option seemed to fix the problem.We are now in an XFS managed
allocation heuristic way. The problem does not show up since the change
was made on monday, but I'm not totally sure it has fixed the problem.
If we can provide you with more information, please let me know.

Le ven. 31 janv. 2025 à 00:53, Michael Harris <harmic@gmail.com> a écrit :

Hello All

An update on this.

Earlier in this thread, Jakub had suggested remounting the XFS
filesystems with the mount option allocsize=1m.
I've done that now, on a few systems that have been experiencing this
error multiple times a day, and it does seem to stop the errors from
occurring.
It has only been a few days, but it does look encouraging as a workaround.

From the xfs man page, it seems that manually setting allocsize turns
off the dynamic preallocation size heuristics that are normally used
by XFS.

The other piece of info is that we have raised a support ticket with
Redhat. I'm not directly in contact with them (it's another team that
handles that) but I will let you know of any developments.

Cheers
Mike

--
Jean-Christophe Arnu

#51Jakub Wartak
jakub.wartak@enterprisedb.com
In reply to: Jean-Christophe Arnu (#50)
Re: FileFallocate misbehaving on XFS

On Fri, Jan 31, 2025 at 3:33 PM Jean-Christophe Arnu <jcarnu@gmail.com> wrote:

Hello Mike,

We encountered the same problem with a fixed allocsize=262144k. Removing this option seemed to fix the problem.We are now in an XFS managed allocation heuristic way. The problem does not show up since the change was made on monday, but I'm not totally sure it has fixed the problem.
If we can provide you with more information, please let me know.

Le ven. 31 janv. 2025 à 00:53, Michael Harris <harmic@gmail.com> a écrit :

Hello All

An update on this.

Earlier in this thread, Jakub had suggested remounting the XFS
filesystems with the mount option allocsize=1m.
I've done that now, on a few systems that have been experiencing this
error multiple times a day, and it does seem to stop the errors from
occurring.
It has only been a few days, but it does look encouraging as a workaround.

From the xfs man page, it seems that manually setting allocsize turns
off the dynamic preallocation size heuristics that are normally used
by XFS.

The other piece of info is that we have raised a support ticket with
Redhat. I'm not directly in contact with them (it's another team that
handles that) but I will let you know of any developments.

Hi Mike and Jean-Christophe ,

out of curiosity, did Redhat provide any further (deeper) information
on why too big preallocation (or what heuristics) on XFS can cause
such issues?

-J

#52Michael Harris
harmic@gmail.com
In reply to: Jakub Wartak (#51)
Re: FileFallocate misbehaving on XFS

On Tue, 4 Mar 2025 at 23:05, Jakub Wartak <jakub.wartak@enterprisedb.com> wrote:

out of curiosity, did Redhat provide any further (deeper) information
on why too big preallocation (or what heuristics) on XFS can cause
such issues?

Hi Jakub

So far I don't have any feedback from RH. Unfortunately there is a
long corporate chain of people between me and whomever in RH is
dealing with the ticket! I will update this list if/when I get an
update.

Cheers
Mike

#53Andrea Gelmini
andrea.gelmini@gmail.com
In reply to: Andrea Gelmini (#2)
Re: FileFallocate misbehaving on XFS

Il giorno lun 9 dic 2024 alle ore 10:47 Andrea Gelmini <
andrea.gelmini@gmail.com> ha scritto:

Funny, i guess it's the same reason I see randomly complain of WhatsApp
web interface, on Chrome, since I switched to XFS. It says something like
"no more space on disk" and logout, with more than 300GB available.

To be fair, I found after months (changes fs, and googled about), it's not
XFS related, and it happens with different browsers, also. So, my suspect
was wrong.

Ciao,
Gelma

#54Pierre Barre
pierre@barre.sh
In reply to: Michael Harris (#1)
Re: FileFallocate misbehaving on XFS

Hello,

I was running into the same thing, and the fix for me was mounting xfs with -o inodes64

Best,
Pierre

Show quoted text

On Mon, Dec 9, 2024, at 08:34, Michael Harris wrote:

Hello PG Hackers

Our application has recently migrated to PG16, and we have experienced
some failed upgrades. The upgrades are performed using pg_upgrade and
have failed during the phase where the schema is restored into the new
cluster, with the following error:

pg_restore: error: could not execute query: ERROR: could not extend
file "pg_tblspc/16401/PG_16_202307071/17643/1249.1" with
FileFallocate(): No space left on device
HINT: Check free disk space.

This has happened multiple times on different servers, and in each
case there was plenty of free space available.

We found this thread describing similar issues:

/messages/by-id/AS1PR05MB91059AC8B525910A5FCD6E699F9A2@AS1PR05MB9105.eurprd05.prod.outlook.com

As is the case in that thread, all of the affected databases are using XFS.

One of my colleagues built postgres from source with
HAVE_POSIX_FALLOCATE not defined, and using that build he was able to
complete the pg_upgrade, and then switched to a stock postgres build
after the upgrade. However, as you might expect, after the upgrade we
have experienced similar errors during regular operation. We make
heavy use of COPY, which is mentioned in the above discussion as
pre-allocating files.

We have seen this on both Rocky Linux 8 (kernel 4.18.0) and Rocky
Linux 9 (Kernel 5.14.0).

I am wondering if this bug might be related:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1791323

When given an offset of 0 and a length, fallocate (man 2 fallocate) reports ENOSPC if the size of the file + the length to be allocated is greater than the available space.

There is a reproduction procedure at the bottom of the above ubuntu
thread, and using that procedure I get the same results on both kernel
4.18.0 and 5.14.0.
When calling fallocate with offset zero on an existing file, I get
enospc even if I am only requesting the same amount of space as the
file already has.
If I repeat the experiment with ext4 I don't get that behaviour.

On a surface examination of the code paths leading to the
FileFallocate call, it does not look like it should be trying to
allocate already allocated space, but I might have missed something
there.

Is this already being looked into?

Thanks in advance,

Cheers
Mike