Add time spent in posix_fadvise() in I/O read time ?

Started by Frédéric Yhuel10 months ago3 messages
#1Frédéric Yhuel
frederic.yhuel@dalibo.com
1 attachment(s)

Hi,

I'm currently looking at how we can integrate effective_io_concurrency
into the planner cost model. This is actually a big overstatement, as I
have only done a bunch of tests so far. One thing I've noticed is that
posix_fadvise(,,POSIX_FADV_WILLNEED) isn't always non-blocking on Linux.
As Ted Ts'o explains in this old thread[1]https://linux.kernel.narkive.com/WcGajTwh/posix-fadvise-posix-fadv-willneed-waits-before-returning#post2, it blocks when the request
queue fills up.

This leads to very surprising EXPLAIN output. In the following test, run
on a machine with an ATA SSD (I can't observe the same on a machine with
a nvme SSD), if we increase effective_io_concurrency from 256 to 512,
the I/O shared read time decreases from 1148ms to 350ms, while the
execution time remains almost the same:

CREATE UNLOGGED TABLE eic_uniform(a INT, filler TEXT);

INSERT INTO eic_uniform SELECT 10000 * random(), md5(i::text) FROM
generate_series(1,20000000) AS T(i);

VACUUM ANALYZE eic_uniform;

CREATE INDEX ON eic_uniform(a);

set work_mem = '16MB';
set max_parallel_workers_per_gather = 0;
set track_io_timing = on;

set effective_io_concurrency = 256;

-- remove heap from postgres cache and page cache

EXPLAIN (ANALYZE, SETTINGS, BUFFERS, COSTS OFF) SELECT * FROM
eic_uniform WHERE a < 140 AND filler > 'fffffff';
QUERY PLAN

------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on eic_uniform (actual time=2107.439..2107.440 rows=0
loops=1)
Recheck Cond: (a < 140)
Filter: (filler > 'fffffff'::text)
Rows Removed by Filter: 279648
Heap Blocks: exact=135778
Buffers: shared read=136024
I/O Timings: shared read=1147.903
-> Bitmap Index Scan on eic_uniform_a_idx (actual
time=41.827..41.828 rows=279648 loops=1)
Index Cond: (a < 140)
Buffers: shared read=246
I/O Timings: shared read=0.568
Settings: effective_io_concurrency = '256',
max_parallel_workers_per_gather = '0', work_mem = '16MB'
Planning Time: 0.197 ms
Execution Time: 2108.553 ms

set effective_io_concurrency = 512;

-- remove heap from postgres cache and page cache

EXPLAIN (ANALYZE, SETTINGS, BUFFERS, COSTS OFF) SELECT * FROM
eic_uniform WHERE a < 140 AND filler > 'fffffff';
QUERY PLAN

------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on eic_uniform (actual time=2118.706..2118.707 rows=0
loops=1)
Recheck Cond: (a < 140)
Filter: (filler > 'fffffff'::text)
Rows Removed by Filter: 279648
Heap Blocks: exact=135778
Buffers: shared read=136024
I/O Timings: shared read=350.014
-> Bitmap Index Scan on eic_uniform_a_idx (actual
time=44.619..44.620 rows=279648 loops=1)
Index Cond: (a < 140)
Buffers: shared read=246
I/O Timings: shared read=0.581
Settings: effective_io_concurrency = '512',
max_parallel_workers_per_gather = '0', work_mem = '16MB'
Planning Time: 0.201 ms
Execution Time: 2119.885 ms

Now, if we apply the attached patch (it applies to REL_17_STABLE and is
not meant to be reviewed), it becomes:

I/O Timings: shared wait=322.476 read=1147.903

and:

I/O Timings: shared wait=1030.122 read=350.014

where the "wait time" measures the time spent in posix_fadvise().

I don't know if we should do something like this... this breakdown is
useful for my study, but probably not for most users... although it
might be interesting to have statistics in pg_stat_io that would allow
us to detect a bad setting of effective_io_concurrency?

But if we don't, then I think that we should at least include the time
spent in posix_fadvise() in the I/O read time, don't you?

BTW, with the same dataset, e_i_c = 0 performs twice better than
e_i_c = 1, which means that kernel readahead does a much better job than
postgresql prefetching with default settings in this very specific case,
but I think that this has already been discussed somewhere.

[1]: https://linux.kernel.narkive.com/WcGajTwh/posix-fadvise-posix-fadv-willneed-waits-before-returning#post2
https://linux.kernel.narkive.com/WcGajTwh/posix-fadvise-posix-fadv-willneed-waits-before-returning#post2

Attachments:

0001-Add-a-new-IOOp-type-IOOP_WAITED.patchtext/x-patch; charset=UTF-8; name=0001-Add-a-new-IOOp-type-IOOP_WAITED.patchDownload
From 87c2a3c20858162a210ea563a7576509ae090dcf Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Fr=C3=A9d=C3=A9ric=20Yhuel?= <frederic.yhuel@dalibo.com>
Date: Thu, 13 Mar 2025 13:37:14 +0100
Subject: [PATCH] Add a new IOOp type, "IOOP_WAITED"

and measure the time spent in posix_fadvise(,,WILLNEED) inside the
FilePrefetch() function.

EXPLAIN's output displays the "shared wait" time if any, like in the
following example:

                                          QUERY PLAN
-----------------------------------------------------------------------------------------------
 Bitmap Heap Scan on eic_uniform (actual time=117.516..117.517 rows=0.00 loops=1)
   Recheck Cond: (a < 100)
   Filter: (filler > 'fffffff'::text)
   Rows Removed by Filter: 20175
   Heap Blocks: exact=11703
   Buffers: shared hit=19 read=11703
   I/O Timings: shared wait=54.162 read=26.283
[...]
---
 src/backend/commands/explain.c         | 6 ++++++
 src/backend/executor/instrument.c      | 3 +++
 src/backend/storage/file/fd.c          | 7 +++++++
 src/backend/utils/activity/pgstat_io.c | 5 +++++
 src/include/executor/instrument.h      | 1 +
 src/include/pgstat.h                   | 1 +
 6 files changed, 23 insertions(+)

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 8086607710e..8742712e9b4 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -3825,6 +3825,9 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage)
 			if (has_shared_timing)
 			{
 				appendStringInfoString(es->str, " shared");
+				if (!INSTR_TIME_IS_ZERO(usage->shared_blk_wait_time))
+					appendStringInfo(es->str, " wait=%0.3f",
+									 INSTR_TIME_GET_MILLISEC(usage->shared_blk_wait_time));
 				if (!INSTR_TIME_IS_ZERO(usage->shared_blk_read_time))
 					appendStringInfo(es->str, " read=%0.3f",
 									 INSTR_TIME_GET_MILLISEC(usage->shared_blk_read_time));
@@ -3886,6 +3889,9 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage)
 			ExplainPropertyFloat("Shared I/O Read Time", "ms",
 								 INSTR_TIME_GET_MILLISEC(usage->shared_blk_read_time),
 								 3, es);
+			ExplainPropertyFloat("Shared I/O Wait Time", "ms",
+								 INSTR_TIME_GET_MILLISEC(usage->shared_blk_wait_time),
+								 3, es);
 			ExplainPropertyFloat("Shared I/O Write Time", "ms",
 								 INSTR_TIME_GET_MILLISEC(usage->shared_blk_write_time),
 								 3, es);
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index 268ae8a945f..d74ab140c0d 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -237,6 +237,7 @@ BufferUsageAdd(BufferUsage *dst, const BufferUsage *add)
 	dst->temp_blks_written += add->temp_blks_written;
 	INSTR_TIME_ADD(dst->shared_blk_read_time, add->shared_blk_read_time);
 	INSTR_TIME_ADD(dst->shared_blk_write_time, add->shared_blk_write_time);
+	INSTR_TIME_ADD(dst->shared_blk_wait_time, add->shared_blk_wait_time);
 	INSTR_TIME_ADD(dst->local_blk_read_time, add->local_blk_read_time);
 	INSTR_TIME_ADD(dst->local_blk_write_time, add->local_blk_write_time);
 	INSTR_TIME_ADD(dst->temp_blk_read_time, add->temp_blk_read_time);
@@ -261,6 +262,8 @@ BufferUsageAccumDiff(BufferUsage *dst,
 	dst->temp_blks_written += add->temp_blks_written - sub->temp_blks_written;
 	INSTR_TIME_ACCUM_DIFF(dst->shared_blk_read_time,
 						  add->shared_blk_read_time, sub->shared_blk_read_time);
+	INSTR_TIME_ACCUM_DIFF(dst->shared_blk_wait_time,
+						  add->shared_blk_wait_time, sub->shared_blk_wait_time);
 	INSTR_TIME_ACCUM_DIFF(dst->shared_blk_write_time,
 						  add->shared_blk_write_time, sub->shared_blk_write_time);
 	INSTR_TIME_ACCUM_DIFF(dst->local_blk_read_time,
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index a7c05b0a6fd..0017e79e558 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -167,6 +167,8 @@ int			recovery_init_sync_method = DATA_DIR_SYNC_METHOD_FSYNC;
 /* Which kinds of files should be opened with PG_O_DIRECT. */
 int			io_direct_flags;
 
+extern bool             track_io_timing;
+
 /* Debugging.... */
 
 #ifdef FDDEBUG
@@ -2079,6 +2081,9 @@ FilePrefetch(File file, off_t offset, off_t amount, uint32 wait_event_info)
 {
 #if defined(USE_POSIX_FADVISE) && defined(POSIX_FADV_WILLNEED)
 	int			returnCode;
+	instr_time              io_start;
+	IOContext               io_context = IOCONTEXT_NORMAL;
+	IOObject                io_object = IOOBJECT_RELATION;
 
 	Assert(FileIsValid(file));
 
@@ -2092,8 +2097,10 @@ FilePrefetch(File file, off_t offset, off_t amount, uint32 wait_event_info)
 
 retry:
 	pgstat_report_wait_start(wait_event_info);
+	io_start = pgstat_prepare_io_time(track_io_timing);
 	returnCode = posix_fadvise(VfdCache[file].fd, offset, amount,
 							   POSIX_FADV_WILLNEED);
+	pgstat_count_io_op_time(io_object, io_context, IOOP_WAITED, io_start, amount/BLCKSZ);
 	pgstat_report_wait_end();
 
 	if (returnCode == EINTR)
diff --git a/src/backend/utils/activity/pgstat_io.c b/src/backend/utils/activity/pgstat_io.c
index 9d6e0673827..6223f46c779 100644
--- a/src/backend/utils/activity/pgstat_io.c
+++ b/src/backend/utils/activity/pgstat_io.c
@@ -145,6 +145,11 @@ pgstat_count_io_op_time(IOObject io_object, IOContext io_context, IOOp io_op,
 			else if (io_object == IOOBJECT_TEMP_RELATION)
 				INSTR_TIME_ADD(pgBufferUsage.local_blk_read_time, io_time);
 		}
+		else if (io_op == IOOP_WAITED)
+		{
+			if (io_object == IOOBJECT_RELATION)
+				INSTR_TIME_ADD(pgBufferUsage.shared_blk_wait_time, io_time);
+		}
 
 		INSTR_TIME_ADD(PendingIOStats.pending_times[io_object][io_context][io_op],
 					   io_time);
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index bfd7b6d8445..193c807d98c 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -34,6 +34,7 @@ typedef struct BufferUsage
 	int64		temp_blks_read; /* # of temp blocks read */
 	int64		temp_blks_written;	/* # of temp blocks written */
 	instr_time	shared_blk_read_time;	/* time spent reading shared blocks */
+	instr_time	shared_blk_wait_time;	/* time spent waiting for shared blocks */
 	instr_time	shared_blk_write_time;	/* time spent writing shared blocks */
 	instr_time	local_blk_read_time;	/* time spent reading local blocks */
 	instr_time	local_blk_write_time;	/* time spent writing local blocks */
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 2136239710e..2dee07f11e4 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -300,6 +300,7 @@ typedef enum IOOp
 	IOOP_FSYNC,
 	IOOP_HIT,
 	IOOP_READ,
+	IOOP_WAITED,
 	IOOP_REUSE,
 	IOOP_WRITE,
 	IOOP_WRITEBACK,
-- 
2.47.2

#2Frédéric Yhuel
frederic.yhuel@dalibo.com
In reply to: Frédéric Yhuel (#1)
Re: Add time spent in posix_fadvise() in I/O read time ?

On 3/14/25 09:43, Frédéric Yhuel wrote:

One thing I've noticed is that posix_fadvise(,,POSIX_FADV_WILLNEED)
isn't always non-blocking on Linux. As Ted Ts'o explains in this old
thread[1], it blocks when the request queue fills up.

When posix_fadvise() blocks, it doesn't seem to be completely off-cpu
(but mostly, at least on my machine), and I assume that this is the
reason for the sentence "A value higher than needed to keep the disks
busy will only result in extra CPU overhead" in the documentation for
effective_io_concurrency? My guess is that Linux uses a spinlock
somewhere, and that this explains the cpu overhead.

Also, on Linux, it seems we can control the size of the request queue
with the /sys/block/XXX/queue/nr_requests setting. On my machine, it is
set to 64 by default. When I set it up to 128, postgres spends less time
on posix_fadvise(), and correspondingly more on pread().

#3Cédric Villemain
Cedric.Villemain@Data-Bene.io
In reply to: Frédéric Yhuel (#2)
Re: Add time spent in posix_fadvise() in I/O read time ?

On 19/03/2025 14:25, Frédéric Yhuel wrote:

On 3/14/25 09:43, Frédéric Yhuel wrote:

One thing I've noticed is that posix_fadvise(,,POSIX_FADV_WILLNEED)
isn't always non-blocking on Linux. As Ted Ts'o explains in this old
thread[1], it blocks when the request queue fills up.

When posix_fadvise() blocks, it doesn't seem to be completely off-cpu
(but mostly, at least on my machine), and I assume that this is the
reason for the sentence "A value higher than needed to keep the disks
busy will only result in extra CPU overhead" in the documentation for
effective_io_concurrency? My guess is that Linux uses a spinlock
somewhere, and that this explains the cpu overhead.

Also, on Linux, it seems we can control the size of the request queue
with the /sys/block/XXX/queue/nr_requests setting. On my machine, it
is set to 64 by default. When I set it up to 128, postgres spends less
time on posix_fadvise(), and correspondingly more on pread().

If you're still on this topic, some additions:

* io_uring_prep_fadvise() syscall is maybe interesting to look at, I
don't know what has been done here within Andres work nor if it is
applicable (I didn't look at this code yet), but "git grep" didn't raise
anything.
   This is supposed to be "async posix_fadvise"

* nr_request, I have 1023 (nvme) with Linux 6.12.22-amd64 #1 SMP
PREEMPT_DYNAMIC Debian 6.12.22-1 (2025-04-10) x86_64.

Note that the scheduler also change the behavior of the queue, with
mq-deadline there is 25% of the queue reserved for sync requests.
See
https://github.com/torvalds/linux/commit/07757588e5076748308dd95ee2e3cd0b82ebb8c4

* https://wiki.postgresql.org/wiki/Readahead - if you want to
read/check/enhance its content.

I hope it helps.

---
Cédric Villemain +33 6 20 30 22 52
https://www.Data-Bene.io
PostgreSQL Support, Expertise, Training, R&D