pg_stat_io_histogram

andres@anarazel.de

3 months ago

In reply to: Jakub Wartak (#1)

Re: pg_stat_io_histogram

Hi,

On 2026-01-26 10:40:52 +0100, Jakub Wartak wrote:

I'm proposing that we add pg_stat_io_histogram that would track/show I/O
latencies profile, so we could quickly identify I/O outliers.

I think that's an interesting feature!

From time to time users complain that 'PostgreSQL is slow or stuck' (usually
COMMIT is slow), when it is quite apparent that it is down to somewhere in
the I/O stack. It is quite easy to prove once one has proper measurement
tools in place and is able to correlate, but it takes IMHO way too much time
and energy to cross-correlate all of that information (iostat -x 1s, wait
events 1s, and so on), especially if one would like to provide rapid
response.

For measuring particularly stuck things, I've been wondering about having a
regular timer that starts to collect more information if stuck in a place for
a while. That would probably end up being lower overhead than constantly
measuring... But would also be a lot more work.

Right now the patch does not include per-backend/PID tracking, hopefully if
there will be interest in this, I'll add it, but I would like to first hear
if that's a good idea. The current implementation uses fast bucket
calculation to avoid overheads and tries to cover most useful range of
devices via buckets (128us..256ms, so that covers both NVMe/SSD/HDD and
abnormally high latency too as from time to time I'm try to help with I/O
stuck for *seconds*, usually a sign of some I/O multipath issues, device
resetting, or hypervisor woes).

Hm. Isn't 128us a pretty high floor for at least reads and writes? On a good
NVMe disk you'll get < 10us, after all.

I see a few problems with the source of the latency measurements though:

- The latency gathered is that it's quite heavily affected by scheduler
noise. If your process isn't scheduled because other processes are busy
doing stuff on the CPU, it's quite possible to get results many orders of
magnitude wrong.

- With AIO, you're measuring wait time, and that time can be affected by other
IOs in the queue. That will often *drastically* overestimate IO latency
measured this way.

I don't see how we can do better absent cooperation from the kernel (by
putting lower-level measurements into io_uring completions, for example)
though. So maybe this is just how it has to be and we ought to just document
it.

postgres=# select
substring(backend_type,1,8) as backend,object,context,io_type,
bucket_latency_us as lat_us,
round(bucket_latency_us/1000.0, 3) as lat_ms,
bucket_count as count
from pg_stat_get_io_histogram()
where
bucket_count > 0
order by 1,2,3,4,5;
backend | object | context | io_type | lat_us | lat_ms | count
----------+----------+-----------+-----------+--------+--------+-------
autovacu | relation | normal | read | 128 | 0.128 | 54

Perhaps the latency should be represented as a range?

Of course most of the I/O calls today are hitting page cache, so one would
expect they'll be < 128us most of the time

Have you measured whether overhead is measurable when hitting the page cache?
I'd hope that it doesn't, due to io combing amortizing the cost somewhat. But
it seems worth measuring.

I assume you made pgstat_get_io_op_name() return "hit?" because you don't
expect that to ever be hit?

+static inline int get_bucket_index(uint32_t val) {
+#define MIN_PG_STAT_IO_HIST_LATENCY 127
+	const uint32_t max_index = PGSTAT_IO_HIST_BUCKETS - 1;
+	/*
+	 * hopefully calculated to be 25 by the compiler:
+	 * clz(127) = clz(01111111b on uint32) = 25
+	 */
+	const uint32_t min_latency_leading_zeros =
+		pg_leading_zero_bits32(MIN_PG_STAT_IO_HIST_LATENCY);
+
+	/*
+	 * make sure the tmp value at least 127 (our minimum bucket size)
+	 * as __builtin_clz might return undefined behavior when operating on 0
+	 */
+	uint32_t tmp = val | MIN_PG_STAT_IO_HIST_LATENCY;

+	/* count leading zeros */
+	int leading_zeros = pg_leading_zero_bits32(tmp);
+
+	/* normalize the index */
+	uint32_t index = min_latency_leading_zeros - leading_zeros;
+
+	/* clamp it to the maximum */
+	return (index > max_index) ? max_index : index;
+}

Wouldn't it be easier to handle the minimum latency by shifting right?

I think we may also need to handle inputs that don't fit a uint32. For things
like a stopped VM or such we could see IOs that that don't fit into a uint32
when measured in microseconds. So perhaps I'd make the input to the bucket
calc 64 bits, then shift to the minimum precision and mask to implement
clamping.

@@ -152,6 +189,10 @@ pgstat_count_io_op_time(IOObject io_object, IOContext io_context, IOOp io_op,
INSTR_TIME_ADD(PendingIOStats.pending_times[io_object][io_context][io_op],
io_time);
+		/* calculate the bucket_index based on latency in us */
+		bucket_index = get_bucket_index(INSTR_TIME_GET_MICROSEC(io_time));
+		PendingIOStats.pending_hist_time_buckets[io_object][io_context][io_op][bucket_index]++;
+
/* Add the per-backend count */
pgstat_count_backend_io_op_time(io_object, io_context, io_op,

It's annoying to have to convert to microseconds here, that's not free :(.

@@ -1356,6 +1356,24 @@ typedef enum io_stat_col
IO_NUM_COLUMNS,
} io_stat_col;

+/*
+* When adding a new column to the pg_stat_io view and the
+* pg_stat_get_backend_io() function, add a new enum value here above
+* IO_NUM_COLUMNS.
+*/
+typedef enum hist_io_stat_col
+{
+	HIST_IO_COL_INVALID = -1,
+	HIST_IO_COL_BACKEND_TYPE,
+	HIST_IO_COL_OBJECT,
+	HIST_IO_COL_CONTEXT,
+	HIST_IO_COL_IOTYPE,
+	HIST_IO_COL_BUCKET_US,
+	HIST_IO_COL_COUNT,
+	HIST_IO_COL_RESET_TIME,
+	HIST_IO_NUM_COLUMNS
+} history_get_history_state;

Think the IO_NUM_COLUMNS reference in the comment is a copy-pasto. I don't
think this should be introduced in the middle of the pg_stat_io implementation.

+/*
+ * pg_leading_zero_bits32
+ *		Returns the number of leading 0-bits in x, starting at the most significant bit position.
+ *		Word must not be 0 (as it is undefined behavior).
+ */
+static inline int
+pg_leading_zero_bits32(uint32 word)

Do we really need this in addition to the already existing
pg_leftmost_one_pos32()? Particularly because that already has an msvc
implementation...

Greetings,

Andres Freund

[1]: /messages/by-id/CAKZiRmyZzmOODYS6n8mns9zN4RcS3o9kfrdQDyeRupqaGp9PmQ@mail.gmail.com
/messages/by-id/CAKZiRmyZzmOODYS6n8mns9zN4RcS3o9kfrdQDyeRupqaGp9PmQ@mail.gmail.com

jakub.wartak@enterprisedb.com

3 months ago

In reply to: Andres Freund (#2)

Re: pg_stat_io_histogram

On Mon, Jan 26, 2026 at 4:08 PM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2026-01-26 10:40:52 +0100, Jakub Wartak wrote:

I'm proposing that we add pg_stat_io_histogram that would track/show I/O
latencies profile, so we could quickly identify I/O outliers.

I think that's an interesting feature!

Hi Andres, thanks for kind words and review!

[..]

For measuring particularly stuck things, I've been wondering about having

regular timer that starts to collect more information if stuck in a place

for

a while. That would probably end up being lower overhead than constantly
measuring... But it would also be a lot more work.

Well if something is really stuck, I think the wait events are covering us
on that,
aren't they? One can argue if they carry enough information (for me they
mostly
do, but I'm trying to squeeze some more stuff into them in a nearby thread
[1]: /messages/by-id/CAKZiRmyZzmOODYS6n8mns9zN4RcS3o9kfrdQDyeRupqaGp9PmQ@mail.gmail.com
BTW: it's kind of "blocked" due to that 56-bit relfilenode idea/question,
any thoughts on that?)

Right now the patch does not include per-backend/PID tracking,

hopefully if

there will be interest in this, I'll add it, but I would like to first

hear

if that's a good idea. The current implementation uses fast bucket
calculation to avoid overheads and tries to cover most useful range of
devices via buckets (128us..256ms, so that covers both NVMe/SSD/HDD and
abnormally high latency too as from time to time I'm try to help with

I/O

stuck for *seconds*, usually a sign of some I/O multipath issues, device
resetting, or hypervisor woes).

Hm. Isn't 128us a pretty high floor for at least reads and writes? On a

good

NVMe disk you'll get < 10us, after all.

I was blind and concentrated way too much on the bad-behaving I/O rather
than good
I/O - let's call it I/O negativity bias 8)

Now v2 contains the min bucket lowered to 8us (but max then is just ~131ms,
I
didn't want it to use more than 64b total, 16*4b (uint32)=64b and well
16*8b(uint64)=128b already, so that's why it's capped max at 131072us right
now).

I see a few problems with the source of the latency measurements though:

- The latency gathered is that it's quite heavily affected by scheduler
noise. If your process isn't scheduled because other processes are busy
doing stuff on the CPU, it's quite possible to get results many orders

magnitude wrong.

- With AIO, you're measuring wait time, and that time can be affected by

other

IOs in the queue. That will often *drastically* overestimate IO latency
measured this way.

I don't see how we can do better absent cooperation from the kernel (by
putting lower-level measurements into io_uring completions, for example)
though. So maybe this is just how it has to be and we ought to just

document

it.

Right, I think this is a complex topic on it's own, I've added a small
section into
the docs. I didn't want to start the thread with undermining my own
results, but
indeed I'm getting "bad" numbers. Bad in essence that perceived latency
numbers do not match with other stuff:

E.g.checkpointer's fsync/fdatasync latency is awful, although i've been
using
this to simulate latency (just 2ms!, but it ended up adding way more):
dd if=/dev/zero of=/fs bs=1M count=1024
losetup /dev/loop15 /fs
echo "0 $(blockdev --getsz /dev/loop15) delay /dev/loop15 0 2" | \
dmsetup create delay
mkfs.ext4 /dev/mapper/delay
mount /dev/mapper/delay /mnt

... and it has e.g. quite interesting effects:

- lack of "noatime" the impact is clearly visible on fsync

- even with noatime I'm was getting spikes of latenices above 131ms
(sic!) with this:
select pg_stat_reset_shared();
pgbench -i -s 10 -p 1234 -h /tmp postgres
checkpoint;

- I've created attached bpftrace to see the gap between kernel and uprobes,
but
it's not that high, sample of the view

vs eBPF, which does not seem to see that, worst case seem to be like
wasted
~20us (gap is between user and kernel)

- in mdsyncfiletag() we seem to have pgstat_count_io_op_time() *after*
potential
FileClose(), but I haven't witnessed long close(), it's still
context-switch

- I've spotted that power mgmt might be influencing it even further (but
that's not
in the docs yet, dunno if I should add it there like the next item on the
list)

backend | object | context | io_type | lat_us | lat_ms | count
----------+----------+-----------+-----------+--------+--------+-------
autovacu | relation | normal | read | 128 | 0.128 | 54

Perhaps the latency should be represented as a range?

Cool idea, I haven't even thought about this one! From now v2 shows:

Of course most of the I/O calls today are hitting page cache, so one

would

expect they'll be < 128us most of the time

Have you measured whether overhead is measurable when hitting the page

cache?

I'd hope that it doesn't, due to io combing amortizing the cost somewhat.

But

it seems worth measuring.

Not yet, I first wanted to hear if I'm not sailing into some plain stupid
direction somewhere with this idea or implementation (e.g.
that INSTR_TIME_GET_MICROSEC() was a really stupid omission from my side).

I'll try to perform this test overhead measurement hopefully with v3 once
we settle on how to do that bit shifting/clz().

I assume you made pgstat_get_io_op_name() return "hit?" because you don't
expect that to ever be hit?

Yes, my patch seems to always return 0 for "hit?". I'll need to investigate
that
further.

+static inline int get_bucket_index(uint32_t val) {
+#define MIN_PG_STAT_IO_HIST_LATENCY 127
+     const uint32_t max_index = PGSTAT_IO_HIST_BUCKETS - 1;
+     /*
+      * hopefully calculated to be 25 by the compiler:
+      * clz(127) = clz(01111111b on uint32) = 25
+      */
+     const uint32_t min_latency_leading_zeros =
+             pg_leading_zero_bits32(MIN_PG_STAT_IO_HIST_LATENCY);
+
+     /*
+      * make sure the tmp value at least 127 (our minimum bucket size)
+      * as __builtin_clz might return undefined behavior when

operating on 0

+      */
+     uint32_t tmp = val | MIN_PG_STAT_IO_HIST_LATENCY;

+     /* count leading zeros */
+     int leading_zeros = pg_leading_zero_bits32(tmp);
+
+     /* normalize the index */
+     uint32_t index = min_latency_leading_zeros - leading_zeros;
+
+     /* clamp it to the maximum */
+     return (index > max_index) ? max_index : index;
+}

Wouldn't it be easier to handle the minimum latency by shifting right?

What you seem to suggest seems to be equally width buckets {1,2,3,4..ms} and
not logarithmic buckets {1,2,4,8..ms} or am I missing something? The patch
as is
stands has two ways #ifdef implementations now, with bitwise shifting
working
now, but even in objdump on -O2 there's plenty of those jumps because of
current
code.

I think we may also need to handle inputs that don't fit a uint32.

Fixed.

[..] For things
like a stopped VM or such we could see IOs that that don't fit into a

uint32

when measured in microseconds. So perhaps I'd make the input to the

bucket

calc 64 bits, then shift to the minimum precision and mask to implement
clamping.

@@ -152,6 +189,10 @@ pgstat_count_io_op_time(IOObject io_object,

IOContext io_context, IOOp io_op,

INSTR_TIME_ADD(PendingIOStats.pending_times[io_object][io_context][io_op],

io_time);

+             /* calculate the bucket_index based on latency in us */
+             bucket_index =

get_bucket_index(INSTR_TIME_GET_MICROSEC(io_time));

+

PendingIOStats.pending_hist_time_buckets[io_object][io_context][io_op][bucket_index]++;

+
/* Add the per-backend count */
pgstat_count_backend_io_op_time(io_object, io_context,

io_op,

It's annoying to have to convert to microseconds here, that's not free :(.

Oooops, fixed, get_bucket_index() now operates directly on nanos/int64.

[..]

+     HIST_IO_COL_COUNT,
+     HIST_IO_COL_RESET_TIME,
+     HIST_IO_NUM_COLUMNS
+} history_get_history_state;

]> Think the IO_NUM_COLUMNS reference in the comment is a copy-pasto. I
don't

think this should be introduced in the middle of the pg_stat_io

implementation.

Right, I've moved it to just before pg_stat_io_histogram_build_tuples().

+/*
+ * pg_leading_zero_bits32
+ *           Returns the number of leading 0-bits in x, starting at

the most significant bit position.

+ *           Word must not be 0 (as it is undefined behavior).
+ */
+static inline int
+pg_leading_zero_bits32(uint32 word)
Do we really need this in addition to the already existing
pg_leftmost_one_pos32()? Particularly because that already has an msvc
implementation...

Well, I would be all in for removal , but please see above the
get_bucket_index() discussion.
I've tried to get rid of it (but maybe i misunderstood something), but in
the end I think it is more
elegant/faster to have it there so that code in get_bucket_index() stays
more readable, rather
than throw more bitwise voodoo there(?)

-J.

jakub.wartak@enterprisedb.com

3 months ago

In reply to: Jakub Wartak (#3)

Re: pg_stat_io_histogram

On Tue, Jan 27, 2026 at 1:06 PM Jakub Wartak <jakub.wartak@enterprisedb.com>
wrote:

[..]

Of course most of the I/O calls today are hitting page cache, so one

would

expect they'll be < 128us most of the time

Have you measured whether overhead is measurable when hitting the page

cache?

I'd hope that it doesn't, due to io combing amortizing the cost

somewhat. But

it seems worth measuring.

Not yet, I first wanted to hear if I'm not sailing into some plain stupid
direction somewhere with this idea or implementation (e.g.
that INSTR_TIME_GET_MICROSEC() was a really stupid omission from my side).

I'll try to perform this test overhead measurement hopefully with v3 once
we settle on how to do that bit shifting/clz().

[..]
Here's the answer: on properly isolated perf test run (my
old&legacy&predictiable
4s32c64t NUMA box, s_b=8GB, DB size 16GB, hugepages, no turboboost, proper
warmup,
no THP, cpupower D0, no physical I/O, ~22k pread64() calls/sec combined to
VFS
cache)
and started on just using single NUMA: numactl --membind=0
--cpunodebind=0
measured using: pgbench -M prepared -c 4 -j 4 postgres -T 20 -P 1 -S

master+track_io_timings=on, 60s warmup and then 3x runs
tps = 44615.603668
tps = 44556.191492
tps = 44813.793981
avg = 44662

master+track_io_timings=on+patch, , 60s warmup and then 3x runs
tps = 44441.879384
tps = 44403.101737
tps = 45036.747418
avg = 44627

so that's like 99.921% (so literally no overhead) and yields picture like:

postgres=# select bucket_latency_us, bucket_count
from pg_stat_io_histogram
where
bucket_count > 0 and
backend_type = 'client backend' and
io_type = 'read' and
context = 'normal'
order by bucket_latency_us;
bucket_latency_us | bucket_count
-------------------+--------------
[0,9) | 273455
[8,17) | 3820379
[16,33) | 29359
[32,65) | 585
[64,129) | 467
[128,257) | 419
[256,513) | 15828
[512,1025) | 89

So one can also see 0.25..0.5ms bucket being larger there (initial reading
of the data from physical device) and that's hardware RAID-1 with 2x
Intel D3-S4510. And if I do pg_buffercache_evict_all()+vm drop_cache+
pg_stat_reset_shared(), I get this picture (note for bulkreads):

postgres=# select bucket_latency_us, bucket_count
from pg_stat_io_histogram
where
bucket_count > 0 and
backend_type = 'client backend' and context='bulkread';
bucket_latency_us | bucket_count
-------------------+--------------
[0,9) | 102555
[8,17) | 70
[16,33) | 3938
[32,65) | 6763
[64,129) | 5206
[128,257) | 9392
[256,513) | 10959
[512,1025) | 21372
[1024,2049) | 502
[2048,4097) | 34
[4096,8193) | 2
[8192,16385) | 2
[16384,32769) | 7

So clearly there's a distinction between reading the VFS cache and hitting
physical I/O.

Now we'll just probably settle on the proper get_bucket_index() impl.

-J.

andres@anarazel.de

3 months ago

In reply to: Jakub Wartak (#4)

Re: pg_stat_io_histogram

Hi,

On 2026-01-28 12:12:10 +0100, Jakub Wartak wrote:

On Tue, Jan 27, 2026 at 1:06 PM Jakub Wartak <jakub.wartak@enterprisedb.com>

Not yet, I first wanted to hear if I'm not sailing into some plain stupid
direction somewhere with this idea or implementation (e.g.
that INSTR_TIME_GET_MICROSEC() was a really stupid omission from my side).

I'll try to perform this test overhead measurement hopefully with v3 once
we settle on how to do that bit shifting/clz().

[..]
Here's the answer: on properly isolated perf test run (my
old&legacy&predictiable
4s32c64t NUMA box, s_b=8GB, DB size 16GB, hugepages, no turboboost, proper
warmup,
no THP, cpupower D0, no physical I/O, ~22k pread64() calls/sec combined to
VFS
cache)
and started on just using single NUMA: numactl --membind=0
--cpunodebind=0
measured using: pgbench -M prepared -c 4 -j 4 postgres -T 20 -P 1 -S

master+track_io_timings=on, 60s warmup and then 3x runs
tps = 44615.603668
tps = 44556.191492
tps = 44813.793981
avg = 44662

master+track_io_timings=on+patch, , 60s warmup and then 3x runs
tps = 44441.879384
tps = 44403.101737
tps = 45036.747418
avg = 44627

so that's like 99.921% (so literally no overhead) and yields picture like:

I don't think that's a particularly useful assurance, unfortunately:

1) Using pgbench with an in-memory readonly workload is typically limited by
context switch overhead and per-statement overhead. After a short while you
have at most one IO per statement (the heap page), which obviously isn't
going to be affected by a small per-IO overhead.

2) The per-core memory bandwidth on that old machine, if it's the quite old
EDB machine I think it is, is so low, that you'd be bottlenecked by memory
bandwidth well before you're going to be bottlenecked by actual CPU stuff
(which the bucket computation is).

I think you'd have to test something like pg_prewarm(), with
io_combine_limit=1, on a modern *client* CPU (client CPUs typically have much
higher per-core memory bandwidth than the more scalable server CPUs).

Greetings,

Andres Freund

[1]: /messages/by-id/20231115180433.p3eeaczbam5zxdz5@awork3.anarazel.de
[2]: https://vondra.me/posts/benchmarking-is-hard-sometimes/

jakub.wartak@enterprisedb.com

3 months ago

In reply to: Andres Freund (#5)

Re: pg_stat_io_histogram

On Thu, Jan 29, 2026 at 5:27 PM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2026-01-28 12:12:10 +0100, Jakub Wartak wrote:

On Tue, Jan 27, 2026 at 1:06 PM Jakub Wartak <jakub.wartak@enterprisedb.com>

Not yet, I first wanted to hear if I'm not sailing into some plain stupid
direction somewhere with this idea or implementation (e.g.
that INSTR_TIME_GET_MICROSEC() was a really stupid omission from my side).

I'll try to perform this test overhead measurement hopefully with v3 once
we settle on how to do that bit shifting/clz().

[..]
Here's the answer: on properly isolated perf test run (my
old&legacy&predictiable
4s32c64t NUMA box, s_b=8GB, DB size 16GB, hugepages, no turboboost, proper
warmup,
no THP, cpupower D0, no physical I/O, ~22k pread64() calls/sec combined to
VFS
cache)
and started on just using single NUMA: numactl --membind=0
--cpunodebind=0
measured using: pgbench -M prepared -c 4 -j 4 postgres -T 20 -P 1 -S

master+track_io_timings=on, 60s warmup and then 3x runs
tps = 44615.603668
tps = 44556.191492
tps = 44813.793981
avg = 44662

master+track_io_timings=on+patch, , 60s warmup and then 3x runs
tps = 44441.879384
tps = 44403.101737
tps = 45036.747418
avg = 44627

so that's like 99.921% (so literally no overhead) and yields picture like:

I don't think that's a particularly useful assurance, unfortunately:

1) Using pgbench with an in-memory readonly workload is typically limited by
context switch overhead and per-statement overhead. After a short while you
have at most one IO per statement (the heap page), which obviously isn't
going to be affected by a small per-IO overhead.

2) The per-core memory bandwidth on that old machine, if it's the quite old
EDB machine I think it is, is so low, that you'd be bottlenecked by memory
bandwidth well before you're going to be bottlenecked by actual CPU stuff
(which the bucket computation is).

Hi, thanks for having a look!

That legacy server is mine, but yeah even on same NUMA it can just get
~16GB/s AFAIR and just ~4GB between nodes. I've forgot to reply in
that old NUMA thread
back then, maybe it's not relevant, but I find it valuable often as
the bottlenecks
are easier to hit/notice and there's not that many traps that modern CPUs have
(+ find having 4 NUMAs/socket in 2U is not that easy today, after all
it's physical
box nearby so PMCs are there too - unlike in "cloud", and by having 4 nodes the
disbalances between nodes/zones are much more cleanly visible than
just 1 local vs
1 remote). Somehow I built trust on the results that machine (but I
still can lift it!
so probably still shouldn't trust it fully - pun to the "never trust a
machine you
can lift" :D)

I think you'd have to test something like pg_prewarm(), with
io_combine_limit=1, on a modern *client* CPU (client CPUs typically have much
higher per-core memory bandwidth than the more scalable server CPUs).

Fair point, thanks for explaining in the above chapter. So for much more modern
Intel(R) Ultra 7 (1s16c32t,that can - according to mlc(1) - can do up
to 85-90GB/s
bandwidth and has those ugly P/E-cores, so I've pinned (taskset) the backend
doing pg_prewam to P-Core @ that usually runs @ 5Ghz, but with no_turbo
that's just @ 1.4Ghz). That was on normal build (not debug, so -O2, no casserts,
gcc 13, kernel 6.14.x, no_turbo, s_b=8GB with HP, DB scale 600
(pgbench_accounts @
7.7GB), performance governor, no THP, ...)

I'was kind of surprised, but here it goes full disclosure of my results. With
the patch and track_io_timing=on, io_combine_limit=1, IO_METHOD='SYNC'
just to care
about 1 PID, here are timings of pg_buffercache_evict_all() and then
measured duration of
select pg_prewarm('pgbench_accounts');

So initially I've got this picture, eliminated some worst/best too:

Without patch:
Time: 4644.346 ms (00:04.644)
Time: 4612.610 ms (00:04.613)
Time: 4639.133 ms (00:04.639)
Time: 4625.020 ms (00:04.625)
Time: 4636.652 ms (00:04.637)
Avg: 4631ms

With the patch:
Time: 4765.780 ms (00:04.766)
Time: 4784.308 ms (00:04.784)
Time: 4754.661 ms (00:04.755)
Time: 4770.772 ms (00:04.771)
Time: 4768.232 ms (00:04.768)
Avg: 4768ms (102.95%)

With the patch and __builtin_clzl()
Time: 4750.293 ms (00:04.750)
Time: 4729.288 ms (00:04.729)
Time: 4727.820 ms (00:04.728)
Time: 4729.760 ms (00:04.730)
Time: 4727.146 ms (00:04.727)
Avg: 4732ms (102.18%)

So clearly there was some overhead (I've started getting worried),
and __builtin_clz() was cheaper slightly cheaper or just too much jitter --
yes I've got plenty jittering out with this (not-shown, so above are like 5
results out of 15).

With v2 patch and __builtin_clzl() and default track_io_timing=off (default)
got back to ~4660m as expected.

With v2 patch and __builtin_clzl() and default io_combine_limit=128kB
and track_io_timing=off, went back to ~4150ms:
Time: 4151.942 ms (00:04.152)
Time: 4133.747 ms (00:04.134)
Time: 4153.103 ms (00:04.153)
Time: 4135.199 ms (00:04.135)

With thje patch and __builtin_clzl() and default io_combine_limit=128kB
track_io_timing=on, was also @ ~4150ms.
Time: 4152.941 ms (00:04.153)
Time: 4154.096 ms (00:04.154)
Time: 4155.119 ms (00:04.155)

So with "batching" the IOs, the overhead is almost gone. BTW that's with
current_clocksource says "tsc". After dozens of runs, I've noticed thermals
starting playing a bigger role than this patch, so i've did idle-set -D0
and it degraded even further.

Master, but still got lots of fluctuations, non filtered picture
Time: 5518.546 ms (00:05.519)
Time: 5587.675 ms (00:05.588)
Time: 5512.828 ms (00:05.513)
Time: 5534.023 ms (00:05.534)
Time: 5728.125 ms (00:05.728)
Time: 5731.543 ms (00:05.732)
Time: 5762.687 ms (00:05.763)
Time: 5565.607 ms (00:05.566)
Time: 5498.496 ms (00:05.498)
Time: 5637.870 ms (00:05.638)

but if I leave it *idle* for a while couple minutes then I get:
Time: 5577.879 ms (00:05.578)
Time: 5575.648 ms (00:05.576)
Time: 5548.146 ms (00:05.548)

Some break and with the patch and __builtin_clzl (it gets lower sometimes than
master, how can I trust this?!)
Time: 5504.415 ms (00:05.504)
Time: 5531.827 ms (00:05.532)
Time: 5733.146 ms (00:05.733)
Time: 5511.549 ms (00:05.512)

So something more happening there , probably with thermals/scheduler than with
patch. So of course I've done some home work [1]/messages/by-id/20231115180433.p3eeaczbam5zxdz5@awork3.anarazel.de[2]https://vondra.me/posts/benchmarking-is-hard-sometimes/, I have found even
Your's rant on
some of this [1]/messages/by-id/20231115180433.p3eeaczbam5zxdz5@awork3.anarazel.de and truth to be told I'm unable to stabilize those
deviations on
this __modern__ client CPU. So i've tried on another much more predictable
(and non-modern :P) client CPU: Intel Core i5 7600k (1s4c4t) and got much more
consistent numbers there (those are non-filtered, almost identical variables
from also same setup(also 6.14.x, same tweaks, also with taskset to 1c) ):

Master:
Time: 2592.351 ms (00:02.592)
Time: 2574.612 ms (00:02.575)
Time: 2592.530 ms (00:02.593)
Time: 2575.356 ms (00:02.575)
Time: 2594.687 ms (00:02.595)
Avg=2585ms

Master+patch:
Time: 2577.610 ms (00:02.578)
Time: 2585.796 ms (00:02.586)
Time: 2568.559 ms (00:02.569)
Time: 2586.199 ms (00:02.586)
Time: 2567.872 ms (00:02.568)
Avg=2576ms (below master?!)

Master+patch__builtin_clzl:
Time: 2578.083 ms (00:02.578)
Time: 2586.732 ms (00:02.587)
Time: 2573.176 ms (00:02.573)
Time: 2592.048 ms (00:02.592)
Time: 2575.731 ms (00:02.576)
Time: 2575.570 ms (00:02.576)
Avg=2579ms (below master?!)

Just Master again:
Time: 2578.838 ms (00:02.579)
Time: 2588.531 ms (00:02.589)
Time: 2572.165 ms (00:02.572)
Time: 2591.528 ms (00:02.592)
Time: 2572.015 ms (00:02.572)
Time: 2589.921 ms (00:02.590)
Time: 2572.124 ms (00:02.572)
Avg=2580ms

So to sum-up:
- it still looks OK to me
- bigger impact than the patches itself can be thermals on modern-day CPUs(?)
- older/legacy CPU (desktop one) seems to be less jittering than modern client
laptop CPU even with the most strict perf. settings (?)
- worst-case: to spot that ~2% regression one would have to disable
the io batching,
enable track_io_timing (that's the not default)

I'm attaching v3 which has now default switched to __builtin_clzl() which
works ok for uint64 (not sure if I need to care about __builtin_clzll
on Windows?).

Open questions:
0. Should I pursue more benchmarking or the above results are enough?
1. __builtin_clzl() or not to __builtin_clzl() that is the question... ?
2. Should I add per-PID backend stats too or are you having something
against it?
3. Shouldn't we fix that mdsyncfiletag() mentioned earlier we seem to have
pgstat_count_io_op_time() *after* potential FileClose() (as per my
earlier question)

-J.

jakub.wartak@enterprisedb.com

3 months ago

In reply to: Jakub Wartak (#6)

Re: pg_stat_io_histogram

On Fri, Jan 30, 2026 at 2:43 PM Jakub Wartak
<jakub.wartak@enterprisedb.com> wrote:
[..]

I'm attaching v3 which has now default switched to __builtin_clzl() which
works ok for uint64 (not sure if I need to care about __builtin_clzll
on Windows?).

Here comes the v4:

1. Rebased just in case.
2. Earlier appears to the uncomplete patch without local changes (comment
mentioned use of __builtin_clzl, but actually code called
__builtin_clz -- 32-bit one not long one), fixed that with the new
version.
3. I've added discovery of __builtin_clzl into autoconf/meson as it was
missing (although comment there says "We assume that we needn't test
all widths of these explicitly:", but isn't it safer we test explicitly
what we use?
4. And then I've spotted that pg_leftmost_one_pos64() in pg_binutils.h uses on
master the __builtin_clzl already, so I've tweaked it to use check
HAVE__BUILTIN_CLZL (not CLZ) too once we have that now.

Open questions:
0. Should I pursue more benchmarking or the above results are enough?
1. Should I add per-PID backend stats too or skip that to avoid causing
potential further overhead? (probably yet another memcpy...)
2. Shouldn't we fix that mdsyncfiletag() mentioned earlier we seem to have
pgstat_count_io_op_time() *after* potential FileClose() (as per my
earlier question)

-J.

jakub.wartak@enterprisedb.com

3 months ago

In reply to: Jakub Wartak (#7)

Re: pg_stat_io_histogram

On Thu, Feb 5, 2026 at 1:13 PM Jakub Wartak
<jakub.wartak@enterprisedb.com> wrote:

On Fri, Jan 30, 2026 at 2:43 PM Jakub Wartak
<jakub.wartak@enterprisedb.com> wrote:
[..]

I'm attaching v3 which has now default switched to __builtin_clzl() which
works ok for uint64 (not sure if I need to care about __builtin_clzll
on Windows?).

Here comes the v4:

1. Rebased just in case.
2. Earlier appears to the uncomplete patch without local changes (comment
mentioned use of __builtin_clzl, but actually code called
__builtin_clz -- 32-bit one not long one), fixed that with the new
version.
3. I've added discovery of __builtin_clzl into autoconf/meson as it was
missing (although comment there says "We assume that we needn't test
all widths of these explicitly:", but isn't it safer we test explicitly
what we use?
4. And then I've spotted that pg_leftmost_one_pos64() in pg_binutils.h uses on
master the __builtin_clzl already, so I've tweaked it to use check
HAVE__BUILTIN_CLZL (not CLZ) too once we have that now.

Open questions:
0. Should I pursue more benchmarking or the above results are enough?
1. Should I add per-PID backend stats too or skip that to avoid causing
potential further overhead? (probably yet another memcpy...)
2. Shouldn't we fix that mdsyncfiletag() mentioned earlier we seem to have
pgstat_count_io_op_time() *after* potential FileClose() (as per my
earlier question)

Hi all,

I would be grateful for any feedback. Here comes v5 attached with:
1. finish documentation for I/O operation types (io_type column) and s/hit?/hit/
2. fix documentation typos, references, copy paste errors..
3. fix code comments typos
4. added missing return for pg_leading_zero_bits64() in case of lack
of HAVE__BUILTIN_CLZL
(discovered thanks to clang's -Wreturn-type)

There are still two implementations inside for get_bucket_index()
I think we should stick to the clz one as it appears to be faster.
Above questions since v4 remain.

-J.

ants.aasma@cybertec.at

3 months ago

In reply to: Jakub Wartak (#3)

Re: pg_stat_io_histogram

On Tue, 27 Jan 2026 at 14:06, Jakub Wartak
<jakub.wartak@enterprisedb.com> wrote:

Hm. Isn't 128us a pretty high floor for at least reads and writes? On a good
NVMe disk you'll get < 10us, after all.

I was blind and concentrated way too much on the bad-behaving I/O rather than good
I/O - let's call it I/O negativity bias 8)

Now v2 contains the min bucket lowered to 8us (but max then is just ~131ms, I
didn't want it to use more than 64b total, 16*4b (uint32)=64b and well
16*8b(uint64)=128b already, so that's why it's capped max at 131072us right now).

I have toyed around with similar histogram implementations as I have
dealt with multiple cases where having a latency histogram would have
made diagnosis much faster. So thank you for working on this.

I think it would be useful to have a max higher than 131ms. I've seen
some cases with buggy multipathing driver and self-DDOS'ing networking
hardware where the problem latencies have been in the 20s - 60s range.
Being able to attribute the whole time to I/O allows quickly ruling
out other problems. Seeing a count in 131ms+ bucket is a strong hint,
seeing a count in 34s-68s bucket is a smoking gun.

Is the main concern for limiting the range cache-misses/pollution when
counting I/O or is it memory overhead and cost of collecting?

It seems quite wasteful to replicate the histogram 240x for each
object/context/op combination. I don't think it matters for I/O
instrumentation overhead - each backend is only doing a limited amount
of different I/O categories and the lower buckets are likely to be on
the same cache line with the counter that gets touched anyway. For
higher buckets the overhead should be negligible compared to the cost
of the I/O itself.

What I'm worried about is that this increases PgStat_PendingIO from
5.6KB to 30KB. This whole chunk of memory needs to be scanned and
added to shared memory structures element by element. Compiler auto
vectorization doesn't seem to kick in on pgstat_io_flush_cb(), but
even then scanning an extra 25KB of mostly zeroes on every commit
doesn't seem great. Maybe making the histogram accumulation
conditional on the counter field being non-zero is enough to avoid any
issues? I haven't yet constructed a benchmark to see if it's actually
a problem or not. Select only pgbench with small shared buffers and
scale that fits into page cache should be an adversarial use case
while still being reasonably realistic.

I'm not familiar enough with the new stats infrastructure to tell
whether it's a problem, but it seems odd that
pgstat_flush_backend_entry_io() isn't modified to aggregate the
histograms.

Regards,
Ants Aasma

#10

[1]: /messages/by-id/CAKZiRmyLKeh9thmHNbkD7KSy3fsoUeopNVEGH33na8dXS9kN2g@mail.gmail.com

jakub.wartak@enterprisedb.com

3 months ago

In reply to: Ants Aasma (#9)

Re: pg_stat_io_histogram

On Wed, Feb 11, 2026 at 1:42 PM Ants Aasma <ants.aasma@cybertec.at> wrote:

Hi Ants, thanks for taking time to respond!

On Tue, 27 Jan 2026 at 14:06, Jakub Wartak
<jakub.wartak@enterprisedb.com> wrote:

Hm. Isn't 128us a pretty high floor for at least reads and writes? On a good
NVMe disk you'll get < 10us, after all.

I was blind and concentrated way too much on the bad-behaving I/O rather than good
I/O - let's call it I/O negativity bias 8)

Now v2 contains the min bucket lowered to 8us (but max then is just ~131ms, I
didn't want it to use more than 64b total, 16*4b (uint32)=64b and well
16*8b(uint64)=128b already, so that's why it's capped max at 131072us right now).

I have toyed around with similar histogram implementations as I have
dealt with multiple cases where having a latency histogram would have
made diagnosis much faster. So thank you for working on this.

Awesome! (I mean sorry you had to deal with terrible I/O stack
implementations.. ;))

I think it would be useful to have a max higher than 131ms. I've seen
some cases with buggy multipathing driver and self-DDOS'ing networking
hardware where the problem latencies have been in the 20s - 60s range.
Being able to attribute the whole time to I/O allows quickly ruling
out other problems. Seeing a count in 131ms+ bucket is a strong hint,
seeing a count in 34s-68s bucket is a smoking gun.

Is the main concern for limiting the range cache-misses/pollution when
counting I/O or is it memory overhead and cost of collecting?

Yes, I fully agree, but the primary reason for developing is finding those
edge case outliers (p99.9) that cause issues, but as You say I'm completely
not sure of how much data we can gather there before it starts to be
noticeable OR just makes committers uncomfortable due to performance concerns
(even if not demonstrated by benchmarks).

It seems quite wasteful to replicate the histogram 240x for each
object/context/op combination. I don't think it matters for I/O
instrumentation overhead - each backend is only doing a limited amount
of different I/O categories and the lower buckets are likely to be on
the same cache line with the counter that gets touched anyway. For
higher buckets the overhead should be negligible compared to the cost
of the I/O itself.
What I'm worried about is that this increases PgStat_PendingIO from
5.6KB to 30KB. This whole chunk of memory needs to be scanned and
added to shared memory structures element by element. Compiler auto
vectorization doesn't seem to kick in on pgstat_io_flush_cb(), but

Right, after putting "#pragma clang loop vectorize(enable)") clang reports:
../src/backend/utils/activity/pgstat_io.c:273:2: warning: loop not vectorized:
the optimizer was unable to perform the requested transformation;
the transformation might be disabled or specified as part of an unsupported
transformation ordering [-Wpass-failed=transform-warning]
273 | for (int io_object = 0; io_object <
IOOBJECT_NUM_TYPES; io_object++)

BTW how have you arrived with the "240x" number? We have 16 buckets for each
of the object/context/type.

even then scanning an extra 25KB of mostly zeroes on every commit
doesn't seem great. Maybe making the histogram accumulation
conditional on the counter field being non-zero is enough to avoid any
issues? I haven't yet constructed a benchmark to see if it's actually
a problem or not. Select only pgbench with small shared buffers and
scale that fits into page cache should be an adversarial use case
while still being reasonably realistic.

Earlier I've done some benchmarks (please see [1]/messages/by-id/CAKZiRmyLKeh9thmHNbkD7KSy3fsoUeopNVEGH33na8dXS9kN2g@mail.gmail.com) based on recommendations
by Andres to keep low io_combine_limit for that and just tiny shared_buffers.
I'm getting too much noise to derive any results, and as this is related
to I/O even probably context switches start playing a role there... sadly we
seem not to have a performance farm to answer this.

TBH, I'm not sure how to progress with this, I mean we could as you say:
- reduce PgStat_PendingIO.pending_hist_time_buckets by removing
IOCONTEXT_NUM_TYPES
(not a big loss, just lack of showing BAS strategy)
- we could even further reduce PgStat_PendingIO.pending_hist_time_buckets
by removing IOOBJECT_NUM_TYPES, but those are just 3 and they might be
useful

... and are You saying to try to do this below thing too?

@@ -288,8 +290,9 @@ pgstat_io_flush_cb(bool nowait)
for(int b = 0; b < PGSTAT_IO_HIST_BUCKETS; b++)
-
bktype_shstats->hist_time_buckets[io_object][io_context][io_op][b] +=
-
PendingIOStats.pending_hist_time_buckets[io_object][io_context][io_op][b];
+
if(PendingIOStats.pending_hist_time_buckets[io_object][io_context][io_op][b]

0)

+
bktype_shstats->hist_time_buckets[io_object][io_context][io_op][b] +=
+
PendingIOStats.pending_hist_time_buckets[io_object][io_context][io_op][b];

.. but the main problem, even if I do all of that I won't be able to
reliably measure the impact, probably the best I could say is
"runs good as well as master, +/- 3%".

Could you somehow help me with this? I mean should we reduce the scope(remove
context) and add that "if"?

I'm not familiar enough with the new stats infrastructure to tell
whether it's a problem, but it seems odd that
pgstat_flush_backend_entry_io() isn't modified to aggregate the
histograms.

Well I'm first time doing this too, and my understanding is that
pgstat_io.c::pgstat_io_flush_cb() is flushing the global statistics
(per backend-type) while the per-individual backend
pgstat_flush_backend_entry_io() (from pgstat_backend.c) is more about
per-PID-backends stats (--> for: select * from pg_stat_get_backend_io(PID)).

In terms of this patch, the per-backend-PID-I/O histograms are not implemented
yet, and I've raised this question earlier, but I'm starting to believe
the answer is probably no, we should not implement those (more overhead
for no apparent benefit, as most of the cases could be tracked down just with
this overall per-backend-type stats ).

Please feel free to drop some code, I'm looking for Co-authors on this for sure.

-J.

#11

ants.aasma@cybertec.at

3 months ago

In reply to: Jakub Wartak (#10)

Re: pg_stat_io_histogram

On Thu, 12 Feb 2026 at 10:35, Jakub Wartak <jakub.wartak@enterprisedb.com>
wrote:

BTW how have you arrived with the "240x" number? We have 16 buckets for
each
of the object/context/type.

Sorry, I worded that poorly. I meant that we store a histogram for each
combination, 240 in total.

even then scanning an extra 25KB of mostly zeroes on every commit
doesn't seem great. Maybe making the histogram accumulation
conditional on the counter field being non-zero is enough to avoid any
issues? I haven't yet constructed a benchmark to see if it's actually
a problem or not. Select only pgbench with small shared buffers and
scale that fits into page cache should be an adversarial use case
while still being reasonably realistic.

Earlier I've done some benchmarks (please see [1]) based on recommendations
by Andres to keep low io_combine_limit for that and just tiny
shared_buffers.
I'm getting too much noise to derive any results, and as this is related
to I/O even probably context switches start playing a role there... sadly
we
seem not to have a performance farm to answer this.

I glossed over the first benchmark you did. That's pretty close to what I
was talking about - exercise the stats collection part by having ~1 I/O
served from page cache per a trivial transaction. And the prewarm should
exercise the per I/O overhead. If neither of them have any measurable
overhead then I can't think of a workload where it could be worse.

TBH, I'm not sure how to progress with this, I mean we could as you say:

- reduce PgStat_PendingIO.pending_hist_time_buckets by removing
IOCONTEXT_NUM_TYPES
(not a big loss, just lack of showing BAS strategy)

I'm on the fence on this. For the actual problems I've had to diagnose it
wouldn't have mattered. But latency differences of bulk vs. normal access
might be useful for understanding benchmark results better. A 5x reduction
in size is pretty big.

- we could even further reduce PgStat_PendingIO.pending_hist_time_buckets
by removing IOOBJECT_NUM_TYPES, but those are just 3 and they might be
useful

WAL write vs. relation write is a very useful distinction for me.

... and are You saying to try to do this below thing too?

@@ -288,8 +290,9 @@ pgstat_io_flush_cb(bool nowait)
for(int b = 0; b < PGSTAT_IO_HIST_BUCKETS;
b++)
-
bktype_shstats->hist_time_buckets[io_object][io_context][io_op][b] +=
-
PendingIOStats.pending_hist_time_buckets[io_object][io_context][io_op][b];
+

if(PendingIOStats.pending_hist_time_buckets[io_object][io_context][io_op][b]

0)

+
bktype_shstats->hist_time_buckets[io_object][io_context][io_op][b] +=
+
PendingIOStats.pending_hist_time_buckets[io_object][io_context][io_op][b];

I meant this:

@@ -287,6 +287,7 @@ pgstat_io_flush_cb(bool nowait)

bktype_shstats->times[io_object][io_context][io_op] +=
INSTR_TIME_GET_MICROSEC(time);

+ if
(PendingIOStats.counts[io_object][io_context][io_op] > 0)
for(int b = 0; b <
PGSTAT_IO_HIST_BUCKETS; b++)

bktype_shstats->hist_time_buckets[io_object][io_context][io_op][b] +=

PendingIOStats.pending_hist_time_buckets[io_object][io_context][io_op][b];

Most object/context/op combinations will have a 0 count, so no point in
actually looking at the histogram.

.. but the main problem, even if I do all of that I won't be able to
reliably measure the impact, probably the best I could say is
"runs good as well as master, +/- 3%".

Could you somehow help me with this? I mean should we reduce the
scope(remove
context) and add that "if"?

I think if we only aggregate histograms conditionally, then having a ton of
different histograms is less of a problem. Only the histograms that have
any data will get accessed. The overhead is limited to the memory usage
which I think is acceptable.

I'll run a few benchmarks on what I have available here to see if I can
tease out anything more than the no effect with a 3% error margin we have
today.

I'm not familiar enough with the new stats infrastructure to tell
whether it's a problem, but it seems odd that
pgstat_flush_backend_entry_io() isn't modified to aggregate the
histograms.

Well I'm first time doing this too, and my understanding is that
pgstat_io.c::pgstat_io_flush_cb() is flushing the global statistics
(per backend-type) while the per-individual backend
pgstat_flush_backend_entry_io() (from pgstat_backend.c) is more about
per-PID-backends stats (--> for: select * from
pg_stat_get_backend_io(PID)).

In terms of this patch, the per-backend-PID-I/O histograms are not
implemented
yet, and I've raised this question earlier, but I'm starting to believe
the answer is probably no, we should not implement those (more overhead
for no apparent benefit, as most of the cases could be tracked down just
with
this overall per-backend-type stats ).

I agree that per-PID histograms are probably not worth the extra work. But
this left me wondering if we are allocating the whole set of histograms too
many times. I don't think every place that uses PgStat_BktypeIO actually
needs the histograms. I will need to dig around to understand this code a
bit better.

Regards,
Ants Aasma

#12

ants.aasma@cybertec.at

2 months ago

In reply to: Ants Aasma (#11)

Re: pg_stat_io_histogram

On Thu, 12 Feb 2026 at 12:31, Ants Aasma <ants.aasma@cybertec.at> wrote:

I meant this:

@@ -287,6 +287,7 @@ pgstat_io_flush_cb(bool nowait)
bktype_shstats->times[io_object][io_context][io_op] +=
INSTR_TIME_GET_MICROSEC(time);

+ if (PendingIOStats.counts[io_object][io_context][io_op] > 0)
for(int b = 0; b < PGSTAT_IO_HIST_BUCKETS; b++)
bktype_shstats->hist_time_buckets[io_object][io_context][io_op][b] +=
PendingIOStats.pending_hist_time_buckets[io_object][io_context][io_op][b];

Most object/context/op combinations will have a 0 count, so no point in actually looking at the histogram.

For this to be of any use, the memset that comes after lock release
would also have to be adjusted to be conditional.

I was also able to convince clang and gcc to vectorize these loops. I
had to split the innermost loop so the time calculation with the /1000
for microseconds conversion and the conditional histogram loop are
done separately, mark all the loop with nounroll pragmas, and tag the
innermost loop for vectorization by clang. But looking at the
benchmark results below, probably not worth the effort.

.. but the main problem, even if I do all of that I won't be able to
reliably measure the impact, probably the best I could say is
"runs good as well as master, +/- 3%".

Could you somehow help me with this? I mean should we reduce the scope(remove
context) and add that "if"?

I think if we only aggregate histograms conditionally, then having a ton of different histograms is less of a problem. Only the histograms that have any data will get accessed. The overhead is limited to the memory usage which I think is acceptable.

I'll run a few benchmarks on what I have available here to see if I can tease out anything more than the no effect with a 3% error margin we have today.

I benchmarked with gcc 15.2 with "-O2 -march=x86-64
-fno-omit-frame-pointer" to match what most users have. The CPU is
Ryzen 9 9900X. I concentrated on two codepaths, pgstat_io_flush_cb and
pgstat_count_io_op_time. Configuration is default except the
following:

track_io_timing = 'on'
io_method = 'io_uring'
io_combine_limit = '1'
effective_io_concurrency = '1'

For checking if aggregating statistics has an effect I used pgbench
scale 100 in read only mode, 10 60s runs each. It will do 1-2 reads
from page cache per select.

bin | concurrency | avg | stddev_fraction | diff
-----------+-------------+----------+-----------------+---------
pg-iohist | 1 | 92927.2 | 0.00254 | 1.00735
pg-master | 1 | 92248.9 | 0.00214 |
pg-iohist | 12 | 618342.3 | 0.00828 | 1.00035
pg-master | 12 | 618127.9 | 0.00819 |
pg-iohist | 24 | 591228.1 | 0.00889 | 0.98858
pg-master | 24 | 598058.5 | 0.00846 |

perf measurement shows 0.00% samples in pgstat_io_flush_cb, and 0.07%
in pgstat_count_io_op_time. After checking the logic in
pgstat_report_stat() these make sense - stats are aggregated at most
once per second per backend.

For the I/O collection, I tried using prewarm, but got really noisy
results from it. So instead I created a table with 100k rows with one
row per page, vacuumed it and benchmarked select count(*) over it.
Interestingly, setting effective_io_concurrency = 1 made the results
both more consistent and faster.

bin | avg | stddev_fraction | diff
-----------+-------+-----------------+---------
pg-iohist | 7.526 | 0.01012 | 0.99396
pg-master | 7.572 | 0.01186 |

perf measurement shows 0.40% spent in pgstat_count_io_op_time.

With -march=native build it goes up to 0.62%, mostly thanks to tps
going up by 30%. Checksum calculations really love AVX-512.

I think performance wise the patch is fine as is, there is negligible
performance overhead even in most adverse conditions.

I still want to look at the memory overhead more closely. The 30kB per
backend seems tolerable to me, but I think having it in
PgStat_BktypeIO is not great. This makes PgStat_IO
30k*BACKEND_NUM_TYPES bigger, or ~ 0.5MB. Having a stats snapshot be
half a megabyte bigger for no reason seems too wasteful.

Regards,
Ants Aasma

#13

[1]: /messages/by-id/aNVWe2tR1jj5Tsct@ip-10-97-1-34.eu-west-3.compute.internal

andres@anarazel.de

2 months ago

In reply to: Ants Aasma (#12)

Re: pg_stat_io_histogram

Hi,

On 2026-02-18 19:37:16 +0200, Ants Aasma wrote:

I was also able to convince clang and gcc to vectorize these loops. I
had to split the innermost loop so the time calculation with the /1000
for microseconds conversion and the conditional histogram loop are
done separately, mark all the loop with nounroll pragmas, and tag the
innermost loop for vectorization by clang. But looking at the
benchmark results below, probably not worth the effort.

I don't think the CPU efficiency of flushing stats should ever matter to the
degree that vectorizing should be a goal. If it does, we are either keeping
way too many stats or are flushing way way too often.

What matters is to reduce the overhead when doing process local accounting, as
that will typically happen many orders of magnitude more frequently than
flushing / merging stats.

For the I/O collection, I tried using prewarm, but got really noisy
results from it. So instead I created a table with 100k rows with one
row per page, vacuumed it and benchmarked select count(*) over it.
Interestingly, setting effective_io_concurrency = 1 made the results
both more consistent and faster.

It's an aside, but anyway: There's really not a whole lot of benefit of doing
AIO when the data is in the page cache. It can only accellerate things if
either checksum computations or memory bandwidth is a limiting factor, as with
worker mode both can be parallelized. I don't think the checksum computation
commonly is a bottleneck with proper compiler optimization. While memory
bandwith can be a major bottleneck on Intel server architectures, I haven't
seen that on AMD.

I'd probably, just out of paranoia, also test without checksums enabled (to
avoid the memory bandwidth hit) and see if the overhead increases if you
change the query to not need to evaluate expressions (e.g. by using SELECT *
FROM tbl OFFSET large_number, or using pg_prewarm with
maintenance_io_concurrency=1).

One thing to be aware of is that with the rdtsc[p] patch (to substantially
reduce timing overhead), it'll become a tad more expensive to convert an
instr_time to nanoseconds (due to having to convert cycles to nanoseconds).
It may be worth testing the combination.

On that note, why is this measuring things in nanoseconds, given that we
already conver instr_time to microseconds nearby and that its quite unlikely
that you'd ever have IO times below a microsecond and that
MIN_PG_STAT_IO_HIST_LATENCY already is in the microsecond domain and we
display it as microseconds?

Just rediscovered that the per-backend tracking patch added an external
function call to pgstat_count_io_op_time(), pgstat_count_backend_io_op() and
that a fair number of more recently added branches are constants at the
callsite :(. Probably doesn't matter, but makes me sad nonetheless :)

I still want to look at the memory overhead more closely. The 30kB per
backend seems tolerable to me

One thing worth thinking about here is that we probably could stand to
increase the number of IO types further, we e.g. have been talking about
tracking IO that bypasses shared buffers separately. And a few more context
types (e.g. index inner/leaf) could also make sense.

Without that change that'd be a somewhat moderate increase in memory usage,
but with this change it'd increase a lot more.

but I think having it in PgStat_BktypeIO is not great. This makes
PgStat_IO 30k*BACKEND_NUM_TYPES bigger, or ~ 0.5MB. Having a stats snapshot
be half a megabyte bigger for no reason seems too wasteful.

Yea, that's not awesome.

I guess we could count IO as 4 byte integers, and shift all bucket counts down
in the rare case of an on overflow. It's just a 2x improvement, but ...

I think we might need to reduce the number of buckets somewhat.

Right now the lowest bucket is for 0-8 ms, the second for 8-16, the third for
16-32. I.e. the first bucket is the same width as the second. Is that
intentional?

Greetings,

Andres Freund

#14

Bertrand Drouvot

bertranddrouvot.pg@gmail.com

2 months ago

In reply to: Andres Freund (#13)

Re: pg_stat_io_histogram

Hi,

On Wed, Feb 18, 2026 at 06:12:47PM -0500, Andres Freund wrote:

Just rediscovered that the per-backend tracking patch added an external
function call to pgstat_count_io_op_time(), pgstat_count_backend_io_op() and
that a fair number of more recently added branches are constants at the
callsite :(. Probably doesn't matter, but makes me sad nonetheless :)

Yeah, we have a dedicated thread to remove those [1]/messages/by-id/aNVWe2tR1jj5Tsct@ip-10-97-1-34.eu-west-3.compute.internal.

Sorry that my message is not directly linked to $SUBJECT, but I was reading this
thread and saw Andres's message above that remind me of [1]/messages/by-id/aNVWe2tR1jj5Tsct@ip-10-97-1-34.eu-west-3.compute.internal.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#15

ants.aasma@cybertec.at

2 months ago

In reply to: Andres Freund (#13)

Re: pg_stat_io_histogram

On Thu, 19 Feb 2026 at 01:12, Andres Freund <andres@anarazel.de> wrote:

I'd probably, just out of paranoia, also test without checksums enabled (to
avoid the memory bandwidth hit) and see if the overhead increases if you
change the query to not need to evaluate expressions (e.g. by using SELECT *
FROM tbl OFFSET large_number, or using pg_prewarm with
maintenance_io_concurrency=1).

Tried it, disabling checksums made the performance of march=x86-64
match march=native. I didn't run enough iterations to make any
statistically significant conclusions, but curiously perf now shows
only 0.23% in pgstat_count_io_op_time, compared to 0.60% before with
march=native. Probably less CPU cache thrashing going on.

One thing to be aware of is that with the rdtsc[p] patch (to substantially
reduce timing overhead), it'll become a tad more expensive to convert an
instr_time to nanoseconds (due to having to convert cycles to nanoseconds).
It may be worth testing the combination.

On that note, why is this measuring things in nanoseconds, given that we
already conver instr_time to microseconds nearby and that its quite unlikely
that you'd ever have IO times below a microsecond and that
MIN_PG_STAT_IO_HIST_LATENCY already is in the microsecond domain and we
display it as microseconds?

I agree that just using microseconds here would be better.

I still want to look at the memory overhead more closely. The 30kB per
backend seems tolerable to me

One thing worth thinking about here is that we probably could stand to
increase the number of IO types further, we e.g. have been talking about
tracking IO that bypasses shared buffers separately. And a few more context
types (e.g. index inner/leaf) could also make sense.

Without that change that'd be a somewhat moderate increase in memory usage,
but with this change it'd increase a lot more.

but I think having it in PgStat_BktypeIO is not great. This makes
PgStat_IO 30k*BACKEND_NUM_TYPES bigger, or ~ 0.5MB. Having a stats snapshot
be half a megabyte bigger for no reason seems too wasteful.

Yea, that's not awesome.

I guess we could count IO as 4 byte integers, and shift all bucket counts down
in the rare case of an on overflow. It's just a 2x improvement, but ...

I think we might need to reduce the number of buckets somewhat.

Right now the lowest bucket is for 0-8 ms, the second for 8-16, the third for
16-32. I.e. the first bucket is the same width as the second. Is that
intentional?

If the boundaries are not on power-of-2 calculating the correct bucket
would take a bit longer.

For reducing the number of buckets one option is to use log base-4
buckets instead of base-2. But if we are worried about the size, then
reducing the number of histograms kept would be better. Many of the
combinations are not used at all, and for normal use being able to
distinguish latency profiles between so many different categories is
not that useful.

Regards,
Ants Aasma

#16

andres@anarazel.de

2 months ago

In reply to: Ants Aasma (#15)

Re: pg_stat_io_histogram

Hi,

On 2026-02-19 19:55:06 +0200, Ants Aasma wrote:

Right now the lowest bucket is for 0-8 ms, the second for 8-16, the third for
16-32. I.e. the first bucket is the same width as the second. Is that
intentional?

If the boundaries are not on power-of-2 calculating the correct bucket
would take a bit longer.

Powers of two make sense, my point was that the lowest bucket and the next
smallest one are *not* sized in a powers of two fashion, unless I miss
something?

For reducing the number of buckets one option is to use log base-4 buckets
instead of base-2.

Yea, that could make sense, although it'd be somewhat sad to lose that much
precision.

But if we are worried about the size, then reducing the number of histograms
kept would be better.

I think we may want both.

Many of the combinations are not used at all

Yea, and for many of the operations we will never measure time and thus will
never have anything to fill the histogram with.

Perhaps we need to do something like have an array of histogram IDs and then a
smaller number of histograms without the same indexing. That implies more
indirection, but I think that may be acceptable - the overhead of reading a
page are high enough that it's probably fine, whereas a lot more indirection
for something like a buffer hit is a different story.

and for normal use being able to distinguish latency profiles between so
many different categories is not that useful.

I'm not that convinced by that though. It's pretty useful to separate out the
IO latency for something like vacuuming, COPY and normal use of a
relation. They will often have very different latency profiles.

Greetings,

Andres Freund

#17

jakub.wartak@enterprisedb.com

2 months ago

In reply to: Ants Aasma (#12)

Re: pg_stat_io_histogram

On Wed, Feb 18, 2026 at 6:37 PM Ants Aasma <ants.aasma@cybertec.at> wrote:

[..]

I think performance wise the patch is fine as is, there is negligible
performance overhead even in most adverse conditions.

Awesome, thank You very much Ants for those measurements! You had quite
already quite exchange with Andres on this topic, so I'll just respond
to some concern in follow-up emails.

-J.

#18

[1]: /messages/by-id/CAP53PkxNJ2Y6G8PEpQn1zKa6ODE6k1-oP9DNqWjkTj=dC8_KiA@mail.gmail.com
[2]: /messages/by-id/vhzkeogzrrfzjwo3xrnq4xsjh6i37ou6xsbz7yby3lbb3rnxzz@6fpysnkjyldi

jakub.wartak@enterprisedb.com

2 months ago

In reply to: Andres Freund (#13)

Re: pg_stat_io_histogram

On Thu, Feb 19, 2026 at 12:12 AM Andres Freund <andres@anarazel.de> wrote:

Hi Andres,

One thing to be aware of is that with the rdtsc[p] patch (to substantially
reduce timing overhead), it'll become a tad more expensive to convert an
instr_time to nanoseconds (due to having to convert cycles to nanoseconds).
It may be worth testing the combination.

I've took a quick look on latest v7-0002 from there [1]/messages/by-id/CAP53PkxNJ2Y6G8PEpQn1zKa6ODE6k1-oP9DNqWjkTj=dC8_KiA@mail.gmail.com and to sum up it does:

-#define INSTR_TIME_GET_NANOSEC(t) \
- ((int64) (t).ticks)

+static inline int64
+pg_ticks_to_ns(int64 ticks)
+{
+#if defined(__x86_64__) || defined(WIN32)
[..]
+    ns += ticks * ticks_per_ns_scaled / TICKS_TO_NS_PRECISION;
+    return ns;
+#else
+    return ticks;
+#endif
+}

[..but!]
+/*
+ * Make sure this is a power-of-two, so that the compiler can turn the
+ * multiplications and divisions into shifts.
+ */
+#define TICKS_TO_NS_PRECISION (1<<14)

+#define INSTR_TIME_GET_NANOSEC(t) \
+    (pg_ticks_to_ns((t).ticks))
+

So at least to my eyes, it looks pretty cheap, doesn't it?

On that note, why is this measuring things in nanoseconds, given that we
already conver instr_time to microseconds nearby and that its quite unlikely
that you'd ever have IO times below a microsecond and that
MIN_PG_STAT_IO_HIST_LATENCY already is in the microsecond domain and we
display it as microseconds?

Hmm, in earlier reply You have recommened to get away from conversion from
microseconds so I've did because the microseconds were really costly
integer divisions [2]/messages/by-id/vhzkeogzrrfzjwo3xrnq4xsjh6i37ou6xsbz7yby3lbb3rnxzz@6fpysnkjyldi
"It's annoying to have to convert to microseconds here, that's not free :("

so because INSTR_TIME_GET_NANOSEC() is still cheap and fetching "ticks".

I still want to look at the memory overhead more closely. The 30kB per
backend seems tolerable to me

One thing worth thinking about here is that we probably could stand to
increase the number of IO types further, we e.g. have been talking about
tracking IO that bypasses shared buffers separately. And a few more context
types (e.g. index inner/leaf) could also make sense.

Without that change that'd be a somewhat moderate increase in memory usage,
but with this change it'd increase a lot more.

OK, point taken, it can grow even further, but..:

but I think having it in PgStat_BktypeIO is not great. This makes
PgStat_IO 30k*BACKEND_NUM_TYPES bigger, or ~ 0.5MB. Having a stats snapshot
be half a megabyte bigger for no reason seems too wasteful.

Yea, that's not awesome.

Guys, question, could You please explain me what are the drawbacks of having
this semi-big (internal-only) stat snapshot of 0.5MB? I'm struggling to
understand two things:
a) 0.5MB is not a lot those days (ok my 286 had 1MB in the day ;))
b) how does it affect anything, because testing show it's not?

My understandiung is that it only affects file size on startup/shutdown
in $PGDATADIR/pgstat/pgstat.stat, correct? My worry is that we introduce
more code (and bugs) for no real gain (?)

I guess we could count IO as 4 byte integers, and shift all bucket counts down
in the rare case of an on overflow. It's just a 2x improvement, but ...

[..I'll reply to that in next follow-up]

I think we might need to reduce the number of buckets somewhat.

I'm kind of skeptical on lowering bucket count, and even Ants wanted to
increase it, so that we would gain perfect visibility into sometimes
problematic hardware issues (I would also swear there is something magical
for I/Os stuck for 60secs), so we would both would want to cover it there,
but we cannot squeezee more due to performance concerns...

Now there's also this area where we want to understand was it from page
cache or some-fast-IO-dev and that's how I arrived at this first edge of
~8us. If we go one bucket further (that is make first bucket 16us), I was
afraid we may start loosing being able to differentiating page-cache vs
devices, won't we? (Optane seems to be gone, but it started @ ~20us? You said
in [3] that it could be even as low as 10? so I've thought 8 is good bet)

Right now, the final bucket is that we track >128ms (==> bad stuff),
and I would love to extend to that >512ms, but we cannot as it would be
more than 16 buckets (and 16*8bytes_due_to_uint64=128bytes already).

Right now the lowest bucket is for 0-8 ms, the second for 8-16, the third for
16-32. I.e. the first bucket is the same width as the second. Is that
intentional?

Yes, it's intentional flat at the beggining to be able to differentiate
those fast accesses.

-J.

#19

jakub.wartak@enterprisedb.com

2 months ago

In reply to: Andres Freund (#16)

Re: pg_stat_io_histogram

On Thu, Feb 19, 2026 at 7:12 PM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2026-02-19 19:55:06 +0200, Ants Aasma wrote:

Right now the lowest bucket is for 0-8 ms, the second for 8-16, the third for
16-32. I.e. the first bucket is the same width as the second. Is that
intentional?

If the boundaries are not on power-of-2 calculating the correct bucket
would take a bit longer.

Powers of two make sense, my point was that the lowest bucket and the next
smallest one are *not* sized in a powers of two fashion, unless I miss
something?

Yes, as stated earlier it's intentionally made flat at the beggining to be able
to differentiate those fast accesses.

For reducing the number of buckets one option is to use log base-4 buckets
instead of base-2.

Yea, that could make sense, although it'd be somewhat sad to lose that much
precision.

Same here, as stated earlier I wouldn't like to loose this precision.

But if we are worried about the size, then reducing the number of histograms
kept would be better.

I think we may want both.

+1.

Many of the combinations are not used at all

This!

Yea, and for many of the operations we will never measure time and thus will
never have anything to fill the histogram with.

Perhaps we need to do something like have an array of histogram IDs and then a
smaller number of histograms without the same indexing. That implies more
indirection, but I think that may be acceptable - the overhead of reading a
page are high enough that it's probably fine, whereas a lot more indirection
for something like a buffer hit is a different story.

OK so the previous options from the thread are:
a) we might use uint32 instead of uint64 and deal with overflows
b) we might filter some out of in order to save some memory. Trouble would be
which ones to eliminate... and would e.g. 2x saving be enough?
c) we leave it as it is (accept the simple code/optimal code and waste
this ~0.5MB
pgstat.stat)
d) the above - but I hardly understood how it would look like at all
e) eliminate some precision (via log4?) or column (like context/) - IMHO we
would waste too much precision or orginal goals with this.

So I'm kind of lost how to progress this, because now I - as previously stated -
I do not understand this challenge with memory saving and do now know the aim
or where to stop this optimization, thus I'm mostly +1 for "c", unless somebody
Enlighten me, please ;)

and for normal use being able to distinguish latency profiles between so
many different categories is not that useful.

I'm not that convinced by that though. It's pretty useful to separate out the
IO latency for something like vacuuming, COPY and normal use of a
relation. They will often have very different latency profiles.

Anyway, I'm attaching v6 - no serious changes, just cleaning:

1. Removed dead ifdefed code (finding most siginificant bits) as testing by Ants
showed that CLZ has literally zero overhead.
2. Rebased and fixed some missing include for ports/bits header for
pg_leading_zero_bits64(), dunno why it didnt complain earlier.
3. Added Ants as reviewer.
4. Fixed one comment refering to wrong function (nearby enum hist_io_stat_col).
5. Added one typedef to src/tools/pgindent/typedefs.list.

-J.

#20

jakub.wartak@enterprisedb.com

2 months ago

In reply to: Jakub Wartak (#19)

Re: pg_stat_io_histogram

On Mon, Feb 23, 2026 at 1:35 PM Jakub Wartak
<jakub.wartak@enterprisedb.com> wrote:

On Thu, Feb 19, 2026 at 7:12 PM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2026-02-19 19:55:06 +0200, Ants Aasma wrote:

Right now the lowest bucket is for 0-8 ms, the second for 8-16, the third for
16-32. I.e. the first bucket is the same width as the second. Is that
intentional?

If the boundaries are not on power-of-2 calculating the correct bucket
would take a bit longer.

Powers of two make sense, my point was that the lowest bucket and the next
smallest one are *not* sized in a powers of two fashion, unless I miss
something?

Yes, as stated earlier it's intentionally made flat at the beggining to be able
to differentiate those fast accesses.

For reducing the number of buckets one option is to use log base-4 buckets
instead of base-2.

Yea, that could make sense, although it'd be somewhat sad to lose that much
precision.

Same here, as stated earlier I wouldn't like to loose this precision.

But if we are worried about the size, then reducing the number of histograms
kept would be better.

I think we may want both.

+1.

Many of the combinations are not used at all

This!

Yea, and for many of the operations we will never measure time and thus will
never have anything to fill the histogram with.

Perhaps we need to do something like have an array of histogram IDs and then a
smaller number of histograms without the same indexing. That implies more
indirection, but I think that may be acceptable - the overhead of reading a
page are high enough that it's probably fine, whereas a lot more indirection
for something like a buffer hit is a different story.

OK so the previous options from the thread are:
a) we might use uint32 instead of uint64 and deal with overflows
b) we might filter some out of in order to save some memory. Trouble would be
which ones to eliminate... and would e.g. 2x saving be enough?
c) we leave it as it is (accept the simple code/optimal code and waste
this ~0.5MB
pgstat.stat)
d) the above - but I hardly understood how it would look like at all
e) eliminate some precision (via log4?) or column (like context/) - IMHO we
would waste too much precision or orginal goals with this.

So I'm kind of lost how to progress this, because now I - as previously stated -
I do not understand this challenge with memory saving and do now know the aim
or where to stop this optimization, thus I'm mostly +1 for "c", unless somebody
Enlighten me, please ;)

and for normal use being able to distinguish latency profiles between so
many different categories is not that useful.

I'm not that convinced by that though. It's pretty useful to separate out the
IO latency for something like vacuuming, COPY and normal use of a
relation. They will often have very different latency profiles.

+1

--

Anyway, I'm attaching v6 - no serious changes, just cleaning:

1. Removed dead ifdefed code (finding most siginificant bits) as testing by Ants
showed that CLZ has literally zero overhead.
2. Rebased and fixed some missing include for ports/bits header for
pg_leading_zero_bits64(), dunno why it didnt complain earlier.
3. Added Ants as reviewer.
4. Fixed one comment refering to wrong function (nearby enum hist_io_stat_col).
5. Added one typedef to src/tools/pgindent/typedefs.list.

I think I have found another way how to minimize the weight of that memory
allocation simply remapping sparse backend type IDs to contiguous ones:

0. So the orginal patch weights like below according to pahole:

struct PgStat_BktypeIO {
[..]
uint64 hist_time_buckets[3][5][8][16]; /* 2880 15360 */
/* size: 18240, cachelines: 285, members: 4 */
};

struct PgStat_IO {
[..]
PgStat_BktypeIO stats[18]; /* 8 328320 */
/* size: 328328, cachelines: 5131, members: 2 */
/* last cacheline: 8 bytes */
};

so 320kB total and not 0.5MB for a start.

1. I've noticed that we were already skipping 4 out of 17 (~23%) of backend
types (thanks to pgstat_tracks_io_bktype()), and with simple array
condensation of backendtype (attached dirty PoC) I can get this down to:

struct PgStat_IO {
[..]
PgStat_BktypeIO stats[14]; /* 8 255360 */
/* size: 255368, cachelines: 3991, members: 2 */
/* last cacheline: 8 bytes */
};

so the attached crude patch is mainly about remapping using
pgstat_remap_condensed_bktype(). Patch needs lots of work, but
demonstrates a point.

2. We could slightly reduce even further if necessary, by also ignorning
B_AUTOVAC_LAUNCHER and B_STANDALONE_BACKEND for pg_stat_io. I mean those
seem to not generating any I/O and yet pgstat_tracks_io_bktype says
yes to them.

Thoughts? Is that a good direction? Would 1 or 2 be enough?

-J.

#21

andres@anarazel.de

2 months ago

In reply to: Jakub Wartak (#18)

#22