macOS prefetching support
It seems to me that we could implement prefetching support
(USE_PREFETCH) on macOS using the fcntl() command F_RDADVISE. The man
page description is a bit terse:
F_RDADVISE Issue an advisory read async with no copy to user.
But it seems to be the right idea. Was this looked into before? I
couldn't find anything in the archives.
Attached is a patch to implement this. It seems to work, but of course
it's kind of hard to tell whether it actually does anything useful.
(Even if the performance effects were negligible, this would be useful
to get the prefetch code some exercise on this platform.)
Thoughts?
Attachments:
0001-Add-prefetching-support-on-macOS.patchtext/plain; charset=UTF-8; name=0001-Add-prefetching-support-on-macOS.patchDownload
From d69804b540d0bf168f0706af289ea3583353a30c Mon Sep 17 00:00:00 2001
From: Peter Eisentraut <peter@eisentraut.org>
Date: Wed, 14 Aug 2024 08:50:15 +0200
Subject: [PATCH] Add prefetching support on macOS
macOS doesn't have posix_fadvise(), but fcntl() with the F_RDADVISE
command does the same thing.
TODO: docs update
---
src/backend/commands/variable.c | 4 +--
src/backend/storage/file/fd.c | 52 +++++++++++++++++++++++----------
src/include/pg_config_manual.h | 7 ++---
3 files changed, 41 insertions(+), 22 deletions(-)
diff --git a/src/backend/commands/variable.c b/src/backend/commands/variable.c
index 6202c5ebe44..c1c6c2811c9 100644
--- a/src/backend/commands/variable.c
+++ b/src/backend/commands/variable.c
@@ -1212,7 +1212,7 @@ check_effective_io_concurrency(int *newval, void **extra, GucSource source)
#ifndef USE_PREFETCH
if (*newval != 0)
{
- GUC_check_errdetail("\"effective_io_concurrency\" must be set to 0 on platforms that lack posix_fadvise().");
+ GUC_check_errdetail("\"effective_io_concurrency\" must be set to 0 on platforms that lack prefetching support.");
return false;
}
#endif /* USE_PREFETCH */
@@ -1225,7 +1225,7 @@ check_maintenance_io_concurrency(int *newval, void **extra, GucSource source)
#ifndef USE_PREFETCH
if (*newval != 0)
{
- GUC_check_errdetail("\"maintenance_io_concurrency\" must be set to 0 on platforms that lack posix_fadvise().");
+ GUC_check_errdetail("\"maintenance_io_concurrency\" must be set to 0 on platforms that lack prefetching support.");
return false;
}
#endif /* USE_PREFETCH */
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 3944321ff37..8fe3ac58d97 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -2068,7 +2068,9 @@ FileClose(File file)
/*
* FilePrefetch - initiate asynchronous read of a given range of the file.
*
- * Currently the only implementation of this function is using posix_fadvise
+ * Returns 0 on success, otherwise an errno error code.
+ *
+ * XXX Currently the only implementation of this function is using posix_fadvise
* which is the simplest standardized interface that accomplishes this.
* We could add an implementation using libaio in the future; but note that
* this API is inappropriate for libaio, which wants to have a buffer provided
@@ -2077,31 +2079,51 @@ FileClose(File file)
int
FilePrefetch(File file, off_t offset, off_t amount, uint32 wait_event_info)
{
-#if defined(USE_POSIX_FADVISE) && defined(POSIX_FADV_WILLNEED)
- int returnCode;
-
Assert(FileIsValid(file));
DO_DB(elog(LOG, "FilePrefetch: %d (%s) " INT64_FORMAT " " INT64_FORMAT,
file, VfdCache[file].fileName,
(int64) offset, (int64) amount));
- returnCode = FileAccess(file);
- if (returnCode < 0)
- return returnCode;
+#if defined(USE_POSIX_FADVISE) && defined(POSIX_FADV_WILLNEED)
+ {
+ int returnCode;
+
+ returnCode = FileAccess(file);
+ if (returnCode < 0)
+ return returnCode;
retry:
- pgstat_report_wait_start(wait_event_info);
- returnCode = posix_fadvise(VfdCache[file].fd, offset, amount,
- POSIX_FADV_WILLNEED);
- pgstat_report_wait_end();
+ pgstat_report_wait_start(wait_event_info);
+ returnCode = posix_fadvise(VfdCache[file].fd, offset, amount,
+ POSIX_FADV_WILLNEED);
+ pgstat_report_wait_end();
- if (returnCode == EINTR)
- goto retry;
+ if (returnCode == EINTR)
+ goto retry;
- return returnCode;
+ return returnCode;
+ }
+#elif defined(__darwin__)
+ {
+ struct radvisory
+ {
+ off_t ra_offset; /* offset into the file */
+ int ra_count; /* size of the read */
+ } ra;
+ int returnCode;
+
+ ra.ra_offset = offset;
+ ra.ra_count = amount;
+ pgstat_report_wait_start(wait_event_info);
+ returnCode = fcntl(VfdCache[file].fd, F_RDADVISE, &ra);
+ pgstat_report_wait_end();
+ if (returnCode != -1)
+ return 0;
+ else
+ return errno;
+ }
#else
- Assert(FileIsValid(file));
return 0;
#endif
}
diff --git a/src/include/pg_config_manual.h b/src/include/pg_config_manual.h
index b769030d8fa..84bb5368b3d 100644
--- a/src/include/pg_config_manual.h
+++ b/src/include/pg_config_manual.h
@@ -138,12 +138,9 @@
/*
* USE_PREFETCH code should be compiled only if we have a way to implement
- * prefetching. (This is decoupled from USE_POSIX_FADVISE because there
- * might in future be support for alternative low-level prefetch APIs.
- * If you change this, you probably need to adjust the error message in
- * check_effective_io_concurrency.)
+ * prefetching.
*/
-#ifdef USE_POSIX_FADVISE
+#if defined(USE_POSIX_FADVISE) || defined(__darwin__)
#define USE_PREFETCH
#endif
--
2.46.0
On Wed, Aug 14, 2024 at 7:04 PM Peter Eisentraut <peter@eisentraut.org> wrote:
Attached is a patch to implement this. It seems to work, but of course
it's kind of hard to tell whether it actually does anything useful.
Header order problem: pg_config_os.h defines __darwin__, but
pg_config_manual.h is included first, and tests __darwin__. I hacked
my way around that, and then made a table of 40,000,000 integers in a
2GB buffer pool. I used "select count(pg_buffercache_evict(buffered))
from pg_buffer_cache", and "sudo purge", to clear the two layers of
cache for each test, and then measured:
maintenance_io_concurrency=0, ANALYZE: 2311ms
maintenance_io_concurrency=10, ANALYZE: 652ms
maintenance_io_concurrency=25, ANALYZE: 389ms
It works!
On 14.08.24 14:36, Thomas Munro wrote:
On Wed, Aug 14, 2024 at 7:04 PM Peter Eisentraut <peter@eisentraut.org> wrote:
Attached is a patch to implement this. It seems to work, but of course
it's kind of hard to tell whether it actually does anything useful.Header order problem: pg_config_os.h defines __darwin__, but
pg_config_manual.h is included first, and tests __darwin__. I hacked
my way around that, and then made a table of 40,000,000 integers in a
2GB buffer pool. I used "select count(pg_buffercache_evict(buffered))
from pg_buffer_cache", and "sudo purge", to clear the two layers of
cache for each test, and then measured:maintenance_io_concurrency=0, ANALYZE: 2311ms
maintenance_io_concurrency=10, ANALYZE: 652ms
maintenance_io_concurrency=25, ANALYZE: 389msIt works!
Cool! I'll work on a more polished patch.
On 14.08.24 16:39, Peter Eisentraut wrote:
On 14.08.24 14:36, Thomas Munro wrote:
On Wed, Aug 14, 2024 at 7:04 PM Peter Eisentraut
<peter@eisentraut.org> wrote:Attached is a patch to implement this. It seems to work, but of course
it's kind of hard to tell whether it actually does anything useful.Header order problem: pg_config_os.h defines __darwin__, but
pg_config_manual.h is included first, and tests __darwin__. I hacked
my way around that, and then made a table of 40,000,000 integers in a
2GB buffer pool. I used "select count(pg_buffercache_evict(buffered))
from pg_buffer_cache", and "sudo purge", to clear the two layers of
cache for each test, and then measured:maintenance_io_concurrency=0, ANALYZE: 2311ms
maintenance_io_concurrency=10, ANALYZE: 652ms
maintenance_io_concurrency=25, ANALYZE: 389msIt works!
Cool! I'll work on a more polished patch.
Here it is.
Some interesting questions:
What to do about the order of the symbols and include files. I threw
something into src/include/port/darwin.h, but I'm not sure if that's
good. Alternatively, we could not use __darwin__ but instead the more
standard and predefined defined(__APPLE__) && defined(__MACH__).
How to document it. The current documentation makes references mainly
to the availability of posix_fadvise(). That seems quite low-level.
How could a user of a prepared package even find out about that? Should
we just say "requires OS support" (kind of like I did here) and you can
query the effective state by looking at the *_io_concurrency settings?
Or do we need a read-only parameter that shows whether prefetch support
exists (kind of along the lines of huge pages)?
Btw., for context, here is what I gather the prefetch support (with this
patch) is:
cygwin posix_fadvise
darwin fcntl
freebsd posix_fadvise
linux posix_fadvise
netbsd posix_fadvise
openbsd no
solaris fake
win32 no
(There is also the possibility that we provide an implementation of
posix_fadvise() for macOS that wraps the platform-specific code in this
patch. And then Apple could just take that. ;-) )
Attachments:
v2-0001-Add-prefetching-support-on-macOS.patchtext/plain; charset=UTF-8; name=v2-0001-Add-prefetching-support-on-macOS.patchDownload
From d318c4108b9e05a4828f6e7f71af34c3ca89b3ed Mon Sep 17 00:00:00 2001
From: Peter Eisentraut <peter@eisentraut.org>
Date: Fri, 16 Aug 2024 20:15:43 +0200
Subject: [PATCH v2] Add prefetching support on macOS
macOS doesn't have posix_fadvise(), but fcntl() with the F_RDADVISE
command does the same thing.
Discussion: https://www.postgresql.org/message-id/flat/0827edec-1317-4917-a186-035eb1e3241d%40eisentraut.org
---
doc/src/sgml/config.sgml | 14 +++-----
doc/src/sgml/wal.sgml | 4 +--
src/backend/commands/variable.c | 4 +--
src/backend/storage/file/fd.c | 59 ++++++++++++++++++++++-----------
src/include/pg_config_manual.h | 7 ++--
src/include/port/darwin.h | 5 +++
6 files changed, 57 insertions(+), 36 deletions(-)
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 2937384b001..c6d2fa2148e 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2679,11 +2679,10 @@ <title>Asynchronous Behavior</title>
</para>
<para>
- Asynchronous I/O depends on an effective <function>posix_fadvise</function>
- function, which some operating systems lack. If the function is not
- present then setting this parameter to anything but zero will result
- in an error. On some operating systems (e.g., Solaris), the function
- is present but does not actually do anything.
+ Asynchronous I/O depends on an effective support by the operating
+ system, which some operating systems lack. If there is no operating
+ system support then setting this parameter to anything but zero will
+ result in an error.
</para>
<para>
@@ -3852,10 +3851,7 @@ <title>Recovery</title>
<literal>off</literal>, <literal>on</literal> and
<literal>try</literal> (the default). The setting
<literal>try</literal> enables
- prefetching only if the operating system provides the
- <function>posix_fadvise</function> function, which is currently used
- to implement prefetching. Note that some operating systems provide the
- function, but it doesn't do anything.
+ prefetching only if the operating system provides prefetching support.
</para>
<para>
Prefetching blocks that will soon be needed can reduce I/O wait times
diff --git a/doc/src/sgml/wal.sgml b/doc/src/sgml/wal.sgml
index d5df65bc693..72b73dbf113 100644
--- a/doc/src/sgml/wal.sgml
+++ b/doc/src/sgml/wal.sgml
@@ -841,8 +841,8 @@ <title><acronym>WAL</acronym> Configuration</title>
The <xref linkend="guc-maintenance-io-concurrency"/> and
<xref linkend="guc-wal-decode-buffer-size"/> settings limit prefetching
concurrency and distance, respectively. By default, it is set to
- <literal>try</literal>, which enables the feature on systems where
- <function>posix_fadvise</function> is available.
+ <literal>try</literal>, which enables the feature on systems that have
+ prefetching support.
</para>
</sect1>
diff --git a/src/backend/commands/variable.c b/src/backend/commands/variable.c
index 6202c5ebe44..c1c6c2811c9 100644
--- a/src/backend/commands/variable.c
+++ b/src/backend/commands/variable.c
@@ -1212,7 +1212,7 @@ check_effective_io_concurrency(int *newval, void **extra, GucSource source)
#ifndef USE_PREFETCH
if (*newval != 0)
{
- GUC_check_errdetail("\"effective_io_concurrency\" must be set to 0 on platforms that lack posix_fadvise().");
+ GUC_check_errdetail("\"effective_io_concurrency\" must be set to 0 on platforms that lack prefetching support.");
return false;
}
#endif /* USE_PREFETCH */
@@ -1225,7 +1225,7 @@ check_maintenance_io_concurrency(int *newval, void **extra, GucSource source)
#ifndef USE_PREFETCH
if (*newval != 0)
{
- GUC_check_errdetail("\"maintenance_io_concurrency\" must be set to 0 on platforms that lack posix_fadvise().");
+ GUC_check_errdetail("\"maintenance_io_concurrency\" must be set to 0 on platforms that lack prefetching support.");
return false;
}
#endif /* USE_PREFETCH */
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 3944321ff37..2830b310e0b 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -2068,40 +2068,61 @@ FileClose(File file)
/*
* FilePrefetch - initiate asynchronous read of a given range of the file.
*
- * Currently the only implementation of this function is using posix_fadvise
- * which is the simplest standardized interface that accomplishes this.
- * We could add an implementation using libaio in the future; but note that
- * this API is inappropriate for libaio, which wants to have a buffer provided
- * to read into.
+ * Returns 0 on success, otherwise an errno error code (like posix_fadvise()).
+ *
+ * posix_fadvise() is the simplest standardized interface that accomplishes
+ * this. We could add an implementation using libaio in the future; but note
+ * that this API is inappropriate for libaio, which wants to have a buffer
+ * provided to read into.
*/
int
FilePrefetch(File file, off_t offset, off_t amount, uint32 wait_event_info)
{
-#if defined(USE_POSIX_FADVISE) && defined(POSIX_FADV_WILLNEED)
- int returnCode;
-
Assert(FileIsValid(file));
DO_DB(elog(LOG, "FilePrefetch: %d (%s) " INT64_FORMAT " " INT64_FORMAT,
file, VfdCache[file].fileName,
(int64) offset, (int64) amount));
- returnCode = FileAccess(file);
- if (returnCode < 0)
- return returnCode;
+#if defined(USE_POSIX_FADVISE) && defined(POSIX_FADV_WILLNEED)
+ {
+ int returnCode;
+
+ returnCode = FileAccess(file);
+ if (returnCode < 0)
+ return returnCode;
retry:
- pgstat_report_wait_start(wait_event_info);
- returnCode = posix_fadvise(VfdCache[file].fd, offset, amount,
- POSIX_FADV_WILLNEED);
- pgstat_report_wait_end();
+ pgstat_report_wait_start(wait_event_info);
+ returnCode = posix_fadvise(VfdCache[file].fd, offset, amount,
+ POSIX_FADV_WILLNEED);
+ pgstat_report_wait_end();
- if (returnCode == EINTR)
- goto retry;
+ if (returnCode == EINTR)
+ goto retry;
- return returnCode;
+ return returnCode;
+ }
+#elif defined(__darwin__)
+ {
+ struct radvisory
+ {
+ off_t ra_offset; /* offset into the file */
+ int ra_count; /* size of the read */
+ } ra;
+ int returnCode;
+
+ ra.ra_offset = offset;
+ ra.ra_count = amount;
+ pgstat_report_wait_start(wait_event_info);
+ returnCode = fcntl(VfdCache[file].fd, F_RDADVISE, &ra);
+ pgstat_report_wait_end();
+ if (returnCode != -1)
+ return 0;
+ else
+ return errno;
+ }
#else
- Assert(FileIsValid(file));
return 0;
#endif
}
diff --git a/src/include/pg_config_manual.h b/src/include/pg_config_manual.h
index e799c2989b8..d603b87afd3 100644
--- a/src/include/pg_config_manual.h
+++ b/src/include/pg_config_manual.h
@@ -139,11 +139,10 @@
/*
* USE_PREFETCH code should be compiled only if we have a way to implement
* prefetching. (This is decoupled from USE_POSIX_FADVISE because there
- * might in future be support for alternative low-level prefetch APIs.
- * If you change this, you probably need to adjust the error message in
- * check_effective_io_concurrency.)
+ * might in future be support for alternative low-level prefetch APIs,
+ * as well as platform-specific APIs defined elsewhere.)
*/
-#ifdef USE_POSIX_FADVISE
+#if defined(USE_POSIX_FADVISE)
#define USE_PREFETCH
#endif
diff --git a/src/include/port/darwin.h b/src/include/port/darwin.h
index 15fb69d6dbb..6aa2ea70f6b 100644
--- a/src/include/port/darwin.h
+++ b/src/include/port/darwin.h
@@ -6,3 +6,8 @@
#define HAVE_FSYNC_WRITETHROUGH
#endif
+
+/*
+ * macOS has a platform-specific implementation of prefetching.
+ */
+#define USE_PREFETCH
base-commit: e3ec9dc1bf4983fcedb6f43c71ea12ee26aefc7a
--
2.46.0
On Sat, Aug 17, 2024 at 6:58 AM Peter Eisentraut <peter@eisentraut.org> wrote:
What to do about the order of the symbols and include files. I threw
something into src/include/port/darwin.h, but I'm not sure if that's
good. Alternatively, we could not use __darwin__ but instead the more
standard and predefined defined(__APPLE__) && defined(__MACH__).
Hmm. fd.h and fd.c test for F_NOCACHE, which is pretty closely
related. Now I'm wondering why we actually need this in
pg_config_manual.h at all. Who would turn it off at compile time, and
why would they not be satisfied with setting relevant GUCs to 0? Can
we just teach fd.h to define USE_PREFETCH if
defined(POSIX_FADV_WILLNEED) || defined(F_RDADVISE)?
(I have also thought multiple times about removing the configure
probes for F_FULLFSYNC, and just doing #ifdef. Oh, that's in my patch
for CF #4453.)
How to document it. The current documentation makes references mainly
to the availability of posix_fadvise(). That seems quite low-level.
How could a user of a prepared package even find out about that? Should
we just say "requires OS support" (kind of like I did here) and you can
query the effective state by looking at the *_io_concurrency settings?
Or do we need a read-only parameter that shows whether prefetch support
exists (kind of along the lines of huge pages)?
I think that's fine. I don't really like the word "prefetch", could
mean many different things. What about "requires OS support for
issuing read-ahead advice", which uses a word that appears in both of
those interfaces? And yeah the GUCs being nailed to zero means you
don't have it.
(There is also the possibility that we provide an implementation of
posix_fadvise() for macOS that wraps the platform-specific code in this
patch. And then Apple could just take that. ;-) )
Yeah might be simpler... as long as we make sure that we test for
having the function AND the relevant POSIX_FADV_xxx in places, which I
see that we do.
Thomas Munro <thomas.munro@gmail.com> writes:
Hmm. fd.h and fd.c test for F_NOCACHE, which is pretty closely
related. Now I'm wondering why we actually need this in
pg_config_manual.h at all. Who would turn it off at compile time, and
why would they not be satisfied with setting relevant GUCs to 0?
+1 for not bothering with a pg_config_manual.h knob.
regards, tom lane
On Sat, Aug 17, 2024 at 6:58 AM Peter Eisentraut <peter@eisentraut.org> wrote:
solaris fake
I'm half tempted to suggest that we take this exception out. If it's
there, we call it. It doesn't do anything right now, but it's a cheap
empty user space function, and I heard they are thinking about adding
a real syscall. (In a burst of enthusiasm for standards and stuff, I
talked to people involved in all the OSes using ZFS, to suggest
hooking that up; they added some other stuff I asked about so I think
it's a real threat.) But I haven't noticed any users on this list in
many years, to opine either way.
It doesn't do anything on Cygwin either. Actually, it's worse than
nothing, it looks like it makes two useless system calls adjusting the
"sequential" flag on the file handle. But really, of the 3 ways to
build on Windows, only MSVC has real users, so it makes no useless
system calls at all, so I don't propose to change anything due to this
observation.
win32 no
I guess you could implement this in two ways:
* CreateFileMapping(), MapViewOfFile(), PrefetchVirtualMemory(),
UnmapViewOfFile(), Closehandle(). That's a lot system calls and maybe
expensive VM stuff, who knows.
* ReadFileEx() in OVERLAPPED mode (async), into a dummy buffer, with a
completion callback that frees the buffer and OVERLAPPED object.
That's a lot of extra work allocating memory, copying to user space,
scheduling a callback, and freeing memory for nothing.
Both are terrible, but likely still better than an I/O stall, I dunno.
I think the VMS, NT, Unix-hater view would be: why would you want such
a stupid programming interface anyway, when you could use async I/O
properly? I love Unix, but they'd have a point.
On 17.08.24 00:01, Thomas Munro wrote:
On Sat, Aug 17, 2024 at 6:58 AM Peter Eisentraut <peter@eisentraut.org> wrote:
What to do about the order of the symbols and include files. I threw
something into src/include/port/darwin.h, but I'm not sure if that's
good. Alternatively, we could not use __darwin__ but instead the more
standard and predefined defined(__APPLE__) && defined(__MACH__).Hmm. fd.h and fd.c test for F_NOCACHE, which is pretty closely
related. Now I'm wondering why we actually need this in
pg_config_manual.h at all. Who would turn it off at compile time, and
why would they not be satisfied with setting relevant GUCs to 0? Can
we just teach fd.h to define USE_PREFETCH if
defined(POSIX_FADV_WILLNEED) || defined(F_RDADVISE)?
I thought USE_PREFETCH existed so that we don't have the run-time
overhead for all the bookkeeping code if we don't have any OS-level
prefetch support at the end. But it looks like most of that bookkeeping
code is skipped anyway if the *_io_concurrency settings are at 0. So
yes, getting rid of USE_PREFETCH globally would be useful.
(I have also thought multiple times about removing the configure
probes for F_FULLFSYNC, and just doing #ifdef. Oh, that's in my patch
for CF #4453.)
Understandable, but we should be careful here that we don't create
setups that can cause bugs like
</messages/by-id/48da4a1f-ccd9-4988-9622-24f37b1de2b4@eisentraut.org>.
I think that's fine. I don't really like the word "prefetch", could
mean many different things. What about "requires OS support for
issuing read-ahead advice", which uses a word that appears in both of
those interfaces?
I like that term.
On 18.08.24 15:35, Peter Eisentraut wrote:
On 17.08.24 00:01, Thomas Munro wrote:
Hmm. fd.h and fd.c test for F_NOCACHE, which is pretty closely
related. Now I'm wondering why we actually need this in
pg_config_manual.h at all. Who would turn it off at compile time, and
why would they not be satisfied with setting relevant GUCs to 0? Can
we just teach fd.h to define USE_PREFETCH if
defined(POSIX_FADV_WILLNEED) || defined(F_RDADVISE)?I thought USE_PREFETCH existed so that we don't have the run-time
overhead for all the bookkeeping code if we don't have any OS-level
prefetch support at the end. But it looks like most of that bookkeeping
code is skipped anyway if the *_io_concurrency settings are at 0. So
yes, getting rid of USE_PREFETCH globally would be useful.(I have also thought multiple times about removing the configure
probes for F_FULLFSYNC, and just doing #ifdef. Oh, that's in my patch
for CF #4453.)Understandable, but we should be careful here that we don't create
setups that can cause bugs like
</messages/by-id/48da4a1f-ccd9-4988-9622-24f37b1de2b4@eisentraut.org>.I think that's fine. I don't really like the word "prefetch", could
mean many different things. What about "requires OS support for
issuing read-ahead advice", which uses a word that appears in both of
those interfaces?I like that term.
Here is another patch, with the documentation wording adjusted like
this, and a bit of other tidying.
I opted against pursuing some of the other refactoring ideas as part of
this. Removing USE_PREFETCH seems worthwhile, but has some open
questions. For example, pg_prewarm has a prefetch mode, which currently
errors if there is no prefetch support. So we'd still require some
knowledge outside of fd.c, unless we want to change that behavior. The
idea of creating a src/port/posix_fadvise.c also got a bit too
complicated in terms of how to weave that into configure and meson.
There are apparently various old problems, like old Linux systems with
incompatible declarations, or something like that, and then the special
casing of Solaris (which isn't even in meson.build). If we could get
rid of some of that, then it would be easier to add new behavior there,
otherwise it's just likely to break things. So I'm leaving these as
separate projects.
In terms of $subject, this patch seems sufficient for now.
Attachments:
v3-0001-Add-prefetching-support-on-macOS.patchtext/plain; charset=UTF-8; name=v3-0001-Add-prefetching-support-on-macOS.patchDownload
From 60e86ec3b41ad0aef7e9aba4cfc0688604702047 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut <peter@eisentraut.org>
Date: Fri, 23 Aug 2024 13:49:55 +0200
Subject: [PATCH v3] Add prefetching support on macOS
macOS doesn't have posix_fadvise(), but fcntl() with the F_RDADVISE
command does the same thing.
Some related documentation has been generalized to not mention
posix_advise() specifically anymore.
Discussion: https://www.postgresql.org/message-id/flat/0827edec-1317-4917-a186-035eb1e3241d%40eisentraut.org
---
doc/src/sgml/config.sgml | 14 +++-----
doc/src/sgml/wal.sgml | 4 +--
src/backend/commands/variable.c | 4 +--
src/backend/storage/file/fd.c | 59 ++++++++++++++++++++++-----------
src/include/pg_config_manual.h | 5 ++-
src/include/port/darwin.h | 5 +++
6 files changed, 56 insertions(+), 35 deletions(-)
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 2937384b001..12feac60874 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2679,11 +2679,9 @@ <title>Asynchronous Behavior</title>
</para>
<para>
- Asynchronous I/O depends on an effective <function>posix_fadvise</function>
- function, which some operating systems lack. If the function is not
- present then setting this parameter to anything but zero will result
- in an error. On some operating systems (e.g., Solaris), the function
- is present but does not actually do anything.
+ Asynchronous I/O requires that the operating system supports issuing
+ read-ahead advice. If there is no operating system support then
+ setting this parameter to anything but zero will result in an error.
</para>
<para>
@@ -3852,10 +3850,8 @@ <title>Recovery</title>
<literal>off</literal>, <literal>on</literal> and
<literal>try</literal> (the default). The setting
<literal>try</literal> enables
- prefetching only if the operating system provides the
- <function>posix_fadvise</function> function, which is currently used
- to implement prefetching. Note that some operating systems provide the
- function, but it doesn't do anything.
+ prefetching only if the operating system provides support for issuing
+ read-ahead advice.
</para>
<para>
Prefetching blocks that will soon be needed can reduce I/O wait times
diff --git a/doc/src/sgml/wal.sgml b/doc/src/sgml/wal.sgml
index d5df65bc693..0ba0c930b78 100644
--- a/doc/src/sgml/wal.sgml
+++ b/doc/src/sgml/wal.sgml
@@ -841,8 +841,8 @@ <title><acronym>WAL</acronym> Configuration</title>
The <xref linkend="guc-maintenance-io-concurrency"/> and
<xref linkend="guc-wal-decode-buffer-size"/> settings limit prefetching
concurrency and distance, respectively. By default, it is set to
- <literal>try</literal>, which enables the feature on systems where
- <function>posix_fadvise</function> is available.
+ <literal>try</literal>, which enables the feature on systems that support
+ issuing read-ahead advice.
</para>
</sect1>
diff --git a/src/backend/commands/variable.c b/src/backend/commands/variable.c
index 6202c5ebe44..136c584305a 100644
--- a/src/backend/commands/variable.c
+++ b/src/backend/commands/variable.c
@@ -1212,7 +1212,7 @@ check_effective_io_concurrency(int *newval, void **extra, GucSource source)
#ifndef USE_PREFETCH
if (*newval != 0)
{
- GUC_check_errdetail("\"effective_io_concurrency\" must be set to 0 on platforms that lack posix_fadvise().");
+ GUC_check_errdetail("\"effective_io_concurrency\" must be set to 0 on platforms that lack support for issuing read-ahead advice.");
return false;
}
#endif /* USE_PREFETCH */
@@ -1225,7 +1225,7 @@ check_maintenance_io_concurrency(int *newval, void **extra, GucSource source)
#ifndef USE_PREFETCH
if (*newval != 0)
{
- GUC_check_errdetail("\"maintenance_io_concurrency\" must be set to 0 on platforms that lack posix_fadvise().");
+ GUC_check_errdetail("\"maintenance_io_concurrency\" must be set to 0 on platforms that lack support for issuing read-ahead advice.");
return false;
}
#endif /* USE_PREFETCH */
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 3944321ff37..2830b310e0b 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -2068,40 +2068,61 @@ FileClose(File file)
/*
* FilePrefetch - initiate asynchronous read of a given range of the file.
*
- * Currently the only implementation of this function is using posix_fadvise
- * which is the simplest standardized interface that accomplishes this.
- * We could add an implementation using libaio in the future; but note that
- * this API is inappropriate for libaio, which wants to have a buffer provided
- * to read into.
+ * Returns 0 on success, otherwise an errno error code (like posix_fadvise()).
+ *
+ * posix_fadvise() is the simplest standardized interface that accomplishes
+ * this. We could add an implementation using libaio in the future; but note
+ * that this API is inappropriate for libaio, which wants to have a buffer
+ * provided to read into.
*/
int
FilePrefetch(File file, off_t offset, off_t amount, uint32 wait_event_info)
{
-#if defined(USE_POSIX_FADVISE) && defined(POSIX_FADV_WILLNEED)
- int returnCode;
-
Assert(FileIsValid(file));
DO_DB(elog(LOG, "FilePrefetch: %d (%s) " INT64_FORMAT " " INT64_FORMAT,
file, VfdCache[file].fileName,
(int64) offset, (int64) amount));
- returnCode = FileAccess(file);
- if (returnCode < 0)
- return returnCode;
+#if defined(USE_POSIX_FADVISE) && defined(POSIX_FADV_WILLNEED)
+ {
+ int returnCode;
+
+ returnCode = FileAccess(file);
+ if (returnCode < 0)
+ return returnCode;
retry:
- pgstat_report_wait_start(wait_event_info);
- returnCode = posix_fadvise(VfdCache[file].fd, offset, amount,
- POSIX_FADV_WILLNEED);
- pgstat_report_wait_end();
+ pgstat_report_wait_start(wait_event_info);
+ returnCode = posix_fadvise(VfdCache[file].fd, offset, amount,
+ POSIX_FADV_WILLNEED);
+ pgstat_report_wait_end();
- if (returnCode == EINTR)
- goto retry;
+ if (returnCode == EINTR)
+ goto retry;
- return returnCode;
+ return returnCode;
+ }
+#elif defined(__darwin__)
+ {
+ struct radvisory
+ {
+ off_t ra_offset; /* offset into the file */
+ int ra_count; /* size of the read */
+ } ra;
+ int returnCode;
+
+ ra.ra_offset = offset;
+ ra.ra_count = amount;
+ pgstat_report_wait_start(wait_event_info);
+ returnCode = fcntl(VfdCache[file].fd, F_RDADVISE, &ra);
+ pgstat_report_wait_end();
+ if (returnCode != -1)
+ return 0;
+ else
+ return errno;
+ }
#else
- Assert(FileIsValid(file));
return 0;
#endif
}
diff --git a/src/include/pg_config_manual.h b/src/include/pg_config_manual.h
index e799c2989b8..e49eb13e43c 100644
--- a/src/include/pg_config_manual.h
+++ b/src/include/pg_config_manual.h
@@ -139,9 +139,8 @@
/*
* USE_PREFETCH code should be compiled only if we have a way to implement
* prefetching. (This is decoupled from USE_POSIX_FADVISE because there
- * might in future be support for alternative low-level prefetch APIs.
- * If you change this, you probably need to adjust the error message in
- * check_effective_io_concurrency.)
+ * might in future be support for alternative low-level prefetch APIs,
+ * as well as platform-specific APIs defined elsewhere.)
*/
#ifdef USE_POSIX_FADVISE
#define USE_PREFETCH
diff --git a/src/include/port/darwin.h b/src/include/port/darwin.h
index 15fb69d6dbb..6aa2ea70f6b 100644
--- a/src/include/port/darwin.h
+++ b/src/include/port/darwin.h
@@ -6,3 +6,8 @@
#define HAVE_FSYNC_WRITETHROUGH
#endif
+
+/*
+ * macOS has a platform-specific implementation of prefetching.
+ */
+#define USE_PREFETCH
base-commit: 94a3373ac5c3d2444b2379a3c185b986627c42d4
--
2.46.0
On Sat, Aug 24, 2024 at 12:28 AM Peter Eisentraut <peter@eisentraut.org> wrote:
In terms of $subject, this patch seems sufficient for now.
WFM. I noticed you don't have an EINTR retry loop, but the man page
doesn't say you need one, so overall this patch LGTM.
+ * posix_fadvise() is the simplest standardized interface that accomplishes
+ * this. We could add an implementation using libaio in the future; but note
+ * that this API is inappropriate for libaio, which wants to have a buffer
+ * provided to read into.
I would consider just dropping that comment about libaio, if touching
this paragraph. Proposals exist for AIO of course, but not with
libaio, and better predictions with useful discussion of the topic
seem unlikely to fit in this margin.
On 26.08.24 07:54, Thomas Munro wrote:
On Sat, Aug 24, 2024 at 12:28 AM Peter Eisentraut <peter@eisentraut.org> wrote:
In terms of $subject, this patch seems sufficient for now.
WFM. I noticed you don't have an EINTR retry loop, but the man page
doesn't say you need one, so overall this patch LGTM.+ * posix_fadvise() is the simplest standardized interface that accomplishes + * this. We could add an implementation using libaio in the future; but note + * that this API is inappropriate for libaio, which wants to have a buffer + * provided to read into.I would consider just dropping that comment about libaio, if touching
this paragraph. Proposals exist for AIO of course, but not with
libaio, and better predictions with useful discussion of the topic
seem unlikely to fit in this margin.
committed with that change
Oh, I missed something: I think we're missing FileAcces(), in case the
vfd has to be re-opened, no?
On 29.08.24 03:22, Thomas Munro wrote:
Oh, I missed something: I think we're missing FileAcces(), in case the
vfd has to be re-opened, no?
Fixed, thanks.
On Mon, Aug 19, 2024 at 1:35 AM Peter Eisentraut <peter@eisentraut.org> wrote:
On 17.08.24 00:01, Thomas Munro wrote:
I think that's fine. I don't really like the word "prefetch", could
mean many different things. What about "requires OS support for
issuing read-ahead advice", which uses a word that appears in both of
those interfaces?I like that term.
A couple of other places still use the old specific terminology. PSA.
Attachments:
0001-Standardize-read-ahead-advice-terminology.patchtext/x-patch; charset=US-ASCII; name=0001-Standardize-read-ahead-advice-terminology.patchDownload
From 287a56f8d177e56983d68cdd7201866ba119c130 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 3 Sep 2024 13:30:22 +1200
Subject: [PATCH] Standardize "read-ahead advice" terminology.
Commit 6654bb920 added macOS's equivalent of POSIX_FADV_WILLNEED,
and adjusted a few explicit references to posix_fadvise to use a more
general name for the concept. Update some remaining references.
Reviewed-by:
Discussion: https://postgr.es/m/0827edec-1317-4917-a186-035eb1e3241d%40eisentraut.org
---
src/backend/access/transam/xlogprefetcher.c | 2 +-
src/backend/storage/aio/read_stream.c | 17 +++++++++--------
2 files changed, 10 insertions(+), 9 deletions(-)
diff --git a/src/backend/access/transam/xlogprefetcher.c b/src/backend/access/transam/xlogprefetcher.c
index 3cb698d3bcb..3acaaea5b70 100644
--- a/src/backend/access/transam/xlogprefetcher.c
+++ b/src/backend/access/transam/xlogprefetcher.c
@@ -1083,7 +1083,7 @@ check_recovery_prefetch(int *new_value, void **extra, GucSource source)
#ifndef USE_PREFETCH
if (*new_value == RECOVERY_PREFETCH_ON)
{
- GUC_check_errdetail("\"recovery_prefetch\" is not supported on platforms that lack posix_fadvise().");
+ GUC_check_errdetail("\"recovery_prefetch\" is not supported on platforms that lack support for issuing read-ahead advice.");
return false;
}
#endif
diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index 93cdd35fea0..aae92e6b1f8 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -24,16 +24,17 @@
* already. There is no benefit to looking ahead more than one block, so
* distance is 1. This is the default initial assumption.
*
- * B) I/O is necessary, but fadvise is undesirable because the access is
- * sequential, or impossible because direct I/O is enabled or the system
- * doesn't support fadvise. There is no benefit in looking ahead more than
+ * B) I/O is necessary, but read-ahead advice is undesirable because the access
+ * is sequential and we can rely on the kernel's read-ahead heuristics, or
+ * impossible because direct I/O is enabled, or the system doesn't support
+ * read-ahead advice. There is no benefit in looking ahead more than
* io_combine_limit, because in this case the only goal is larger read system
- * calls. Looking further ahead would pin many buffers and perform
- * speculative work looking ahead for no benefit.
+ * calls. Looking further ahead would pin many buffers and perform speculative
+ * work for no benefit.
*
- * C) I/O is necessary, it appears random, and this system supports fadvise.
- * We'll look further ahead in order to reach the configured level of I/O
- * concurrency.
+ * C) I/O is necessary, it appears to be random, and this system supports
+ * read-ahead advice. We'll look further ahead in order to reach the
+ * configured level of I/O concurrency.
*
* The distance increases rapidly and decays slowly, so that it moves towards
* those levels as different I/O patterns are discovered. For example, a
--
2.46.0
On 03.09.24 03:47, Thomas Munro wrote:
On Mon, Aug 19, 2024 at 1:35 AM Peter Eisentraut <peter@eisentraut.org> wrote:
On 17.08.24 00:01, Thomas Munro wrote:
I think that's fine. I don't really like the word "prefetch", could
mean many different things. What about "requires OS support for
issuing read-ahead advice", which uses a word that appears in both of
those interfaces?I like that term.
A couple of other places still use the old specific terminology. PSA.
looks good to me