RFC: Allow EXPLAIN to Output Page Fault Information
Hi,
When reading the output of EXPLAIN (ANALYZE) to diagnose slow queries
for our customers, I often want to know how many page faults occurred,
especially major page faults, which involve disk access.
Currently, the BUFFERS option in EXPLAIN provides information on whether
a page was found in the shared buffers, but it doesn't provide insight
into whether the page was found in the OS cache or not and disk I/O
occurred.
Since page faults especially major one impact performance compared to
shared buffer and OS cache hits, it would be helpful to track these
events.
I have attached a PoC patch that modifies EXPLAIN to include page fault
information during both the planning and execution phases of a query.
The output would look like this:
=# EXPLAIN (ANALYZE, PAGEFAULTS)
SELECT * FROM pgbench_branches b
JOIN pgbench_accounts a ON b.bid = a.bid ORDER BY a.aid;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
Nested Loop (cost=0.58..335386.98 rows=4999917 width=197) (actual
time=3.785..5590.294 rows=5000000 loops=1)
... (omitted)
Planning:
Buffers: shared hit=50 read=48 dirtied=4 written=2
Page Faults: minor=30 major=19 <-ADDED
Planning Time: 22.080 ms
Execution:
Page Faults: minor=49 major=5 <-ADDED
Execution Time: 5794.356 ms
The patch has not yet added this functionality to auto_explain, but I
believe this feature would be more useful in auto_explain than in the
EXPLAIN command itself. I plan to add it to auto_explain if there's
interest in including page fault information in EXPLAIN.
While GUCs like log_statement_stats allow logging of getrusage(2)
information, including page faults, always enabling this can lead to
excessive log output. It would be better to log this data only when
queries are slow. Therefore, adding this feature to auto_explain seems
like a good solution.
The patch introduces a new option, PAGEFAULTS, but it may be more
appropriate to include the page fault information in another option,
such as SUMMARY, especially if there are other useful resources that can
be obtained from getrusage(2).
The patch cannot be applied to Windows because getrusage() in PostgreSQL
ported for Windows currently only tracks CPU times. I'm not sure if
information on major and minor page faults is accessible on Windows, but
it might be acceptable to treat it similarly to log_statement_stats and
exclude it from Windows support.
I also tried to add page faults information for each plan node, similar
to the BUFFERS option, but I decided against this approach due to the
performance impact. The frequent calls to getrusage(2), i.e. each time
getting one row, created significant overhead.
Any feedback would be greatly appreciated.
--
Regards,
--
Atsushi Torikoshi
Seconded from NTT DATA GROUP CORPORATION to SRA OSS K.K.
Attachments:
v1-0001-PoC-Allow-EXPLAIN-to-output-page-fault-info.patchtext/x-diff; name=v1-0001-PoC-Allow-EXPLAIN-to-output-page-fault-info.patchDownload+133-5
torikoshia <torikoshia@oss.nttdata.com> writes:
I have attached a PoC patch that modifies EXPLAIN to include page fault
information during both the planning and execution phases of a query.
Surely these numbers would be too unstable to be worth anything.
regards, tom lane
On 2024-12-25 00:52, Tom Lane wrote:
torikoshia <torikoshia@oss.nttdata.com> writes:
I have attached a PoC patch that modifies EXPLAIN to include page
fault
information during both the planning and execution phases of a query.Surely these numbers would be too unstable to be worth anything.
Thanks for your comment!
Hmm, I didn't know these are unstable. If there are any reasons, I'd
like to know about them.
I would like to explore alternative methods for measuring resource
usage, but
I am not aware of other approaches.
(IIUC pg_stat_kcache[1]https://github.com/powa-team/pg_stat_kcache, which is said to provide information about
filesystem layer I/O usage also gets metrics from getrusage(2))
[1]: https://github.com/powa-team/pg_stat_kcache
--
Regards,
--
Atsushi Torikoshi
Seconded from NTT DATA GROUP CORPORATION to SRA OSS K.K.
On Thu, 26 Dec 2024 13:15:59 +0900
torikoshia <torikoshia@oss.nttdata.com> wrote:
On 2024-12-25 00:52, Tom Lane wrote:
torikoshia <torikoshia@oss.nttdata.com> writes:
I have attached a PoC patch that modifies EXPLAIN to include page
fault
information during both the planning and execution phases of a
query.Surely these numbers would be too unstable to be worth anything.
Thanks for your comment!
Hmm, I didn't know these are unstable. If there are any reasons, I'd
like to know about them.I would like to explore alternative methods for measuring resource
usage, but
I am not aware of other approaches.
(IIUC pg_stat_kcache[1], which is said to provide information about
filesystem layer I/O usage also gets metrics from getrusage(2))
What I'd really like to see is a column added to pg_stat_database
called blks_read_majflts
It would be great if we could calculate a cache hit ratio that took OS
major page faults into account
Yes this could also be done in pg_stat_kcache but why not do it right
in pg_stat_database? I think it would pretty widely appreciated and
used.
-Jeremy
On Tue Dec 24, 2024 at 4:52 PM CET, Tom Lane wrote:
torikoshia <torikoshia@oss.nttdata.com> writes:
I have attached a PoC patch that modifies EXPLAIN to include page fault
information during both the planning and execution phases of a query.Surely these numbers would be too unstable to be worth anything.
What makes you think that? I'd expect them to be similarly stable to the
numbers we get for BUFFERS. i.e. Sure they won't be completely stable,
but I expect them to be quite helpful when debugging perf issues,
because large numbers indicate that the query is disk-bound and small
numbers indicate that it is not.
These numbers seem especially useful for setups where shared_buffers is
significantly smaller than the total memory available to the system. In
those cases the output from BUFFERS might give the impression that that
you're disk-bound, but if your working set still fits into OS cache then
the number of page faults is likely still low. Thus telling you that the
numbers that you get back from BUFFERS are not as big of a problem as
they might seem.
On Fri, 27 Dec 2024 15:15:40 +0100
"Jelte Fennema-Nio" <postgres@jeltef.nl> wrote:
On Tue Dec 24, 2024 at 4:52 PM CET, Tom Lane wrote:
torikoshia <torikoshia@oss.nttdata.com> writes:
I have attached a PoC patch that modifies EXPLAIN to include page
fault information during both the planning and execution phases of
a query.Surely these numbers would be too unstable to be worth anything.
What makes you think that? I'd expect them to be similarly stable to
the numbers we get for BUFFERS. i.e. Sure they won't be completely
stable, but I expect them to be quite helpful when debugging perf
issues, because large numbers indicate that the query is disk-bound
and small numbers indicate that it is not.These numbers seem especially useful for setups where shared_buffers
is significantly smaller than the total memory available to the
system. In those cases the output from BUFFERS might give the
impression that that you're disk-bound, but if your working set still
fits into OS cache then the number of page faults is likely still
low. Thus telling you that the numbers that you get back from BUFFERS
are not as big of a problem as they might seem.
We'd probably need to combine both pg_buffercache_evict() and
/proc/sys/vm/drop_caches to get stable numbers - which is something I
have done in the past for testing.
Another thought would be splitting out the IO timing information into
two values - IO timing for reads that triggered major faults, versus
IO timing for reads that did not.
And system views like pg_stat_database seem worth considering too.
-Jeremy
On Fri, Dec 27, 2024 at 03:15:40PM +0100, Jelte Fennema-Nio wrote:
On Tue Dec 24, 2024 at 4:52 PM CET, Tom Lane wrote:
torikoshia <torikoshia@oss.nttdata.com> writes:
I have attached a PoC patch that modifies EXPLAIN to include page
fault information during both the planning and execution phases of a
query.Surely these numbers would be too unstable to be worth anything.
What makes you think that? I'd expect them to be similarly stable to the
numbers we get for BUFFERS. i.e. Sure they won't be completely stable,
but I expect them to be quite helpful when debugging perf issues,
because large numbers indicate that the query is disk-bound and small
numbers indicate that it is not.These numbers seem especially useful for setups where shared_buffers is
significantly smaller than the total memory available to the system. In
those cases the output from BUFFERS might give the impression that that
you're disk-bound, but if your working set still fits into OS cache then
the number of page faults is likely still low. Thus telling you that the
numbers that you get back from BUFFERS are not as big of a problem as
they might seem.
I certainly would love to see storage I/O numbers as distinct from
kernel read I/O numbers.
--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com
Do not let urgent matters crowd out time for investment in the future.
Bruce Momjian <bruce@momjian.us> writes:
On Fri, Dec 27, 2024 at 03:15:40PM +0100, Jelte Fennema-Nio wrote:
On Tue Dec 24, 2024 at 4:52 PM CET, Tom Lane wrote:
torikoshia <torikoshia@oss.nttdata.com> writes:
I have attached a PoC patch that modifies EXPLAIN to include page
fault information during both the planning and execution phases of a
query.
I certainly would love to see storage I/O numbers as distinct from
kernel read I/O numbers.
Me too, but I think it is 100% wishful thinking to imagine that
page fault counts match up with that. Maybe there are filesystems
where a read that we request maps one-to-one with a subsequent
page fault, but it hardly seems likely to me that that's
universal. Also, you can't tell page faults for reading program
code apart from those for data, and you won't get any information
at all about writes.
regards, tom lane
On Mon Dec 30, 2024 at 5:39 PM CET, Tom Lane wrote:
Bruce Momjian <bruce@momjian.us> writes:
I certainly would love to see storage I/O numbers as distinct from
kernel read I/O numbers.Me too, but I think it is 100% wishful thinking to imagine that
page fault counts match up with that.
Okay I played around with this patch a bit, in hopes of proving you
wrong. But I now agree with you. I cannot seem to get any numbers out of
this that make sense.
The major page fault numbers are always zero, even after running:
echo 1 > /proc/sys/vm/drop_caches
If Takahori has a way to get some more useful insights from this patch,
I'm quite interested in the steps he took (I might very well have missed
something obvious).
**However, I think the general direction has merit**: Changing this patch to
use `ru_inblock`/`ru_oublock` gives very useful insights. `ru_inblock`
is 0 when everything is in page cache, and it is very high when stuff is
not. I was only hacking around and basically did this:
s/ru_minflt/ru_inblock/g
s/ru_majflt/ru_oublock/g
Obviously more is needed. We'd probably want to show these numbers in
useful units like MB or something. Also, maybe there's some better way
of getting read/write numbers for the current process than
ru_inblock/ru_oublock (but this one seems to work at least reasonably
well).
One other thing that I noticed when playing around with this, which
would need to be addressed: Parallel workers need to pass these values
to the main process somehow, otherwise the IO from those processes gets lost.
For the record, the queries I used to test this patch were:
create table t_big(a int, b text);
insert into t_big SELECT i, repeat(i::text, 200) FROM generate_series(1, 3000000) i;
explain (ANALYZE, PAGEFAULTS) select max(a), max(b) from t_big;
explain (analyze, PAGEFAULTS) insert into t_big SELECT i, repeat(i::text, 200) FROM generate_series(1, 3000000) i;
And then seeing if there was any difference in the explain analyze
output after running the following (as root):
echo 1 > /proc/sys/vm/drop_caches
On Tue, Dec 31, 2024 at 1:39 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Bruce Momjian <bruce@momjian.us> writes:
I certainly would love to see storage I/O numbers as distinct from
kernel read I/O numbers.Me too, but I think it is 100% wishful thinking to imagine that
page fault counts match up with that. Maybe there are filesystems
where a read that we request maps one-to-one with a subsequent
page fault, but it hardly seems likely to me that that's
universal. Also, you can't tell page faults for reading program
code apart from those for data, and you won't get any information
at all about writes.
Thanks for the explanation.
On Tue, Dec 31, 2024 at 7:57 AM Jelte Fennema-Nio <postgres@jeltef.nl>
wrote:
On Mon Dec 30, 2024 at 5:39 PM CET, Tom Lane wrote:
Bruce Momjian <bruce@momjian.us> writes:
I certainly would love to see storage I/O numbers as distinct from
kernel read I/O numbers.Me too, but I think it is 100% wishful thinking to imagine that
page fault counts match up with that.Okay I played around with this patch a bit, in hopes of proving you
wrong. But I now agree with you. I cannot seem to get any numbers out
of
this that make sense.The major page fault numbers are always zero, even after running:
echo 1 > /proc/sys/vm/drop_caches
If Takahori has a way to get some more useful insights from this patch,
I'm quite interested in the steps he took (I might very well have
missed
something obvious).
Thanks for testing.
I also did pg_ctl restart to clear buffercache in addition to your step
and saw many major faults again.
However, when I replaced the restart with pg_buffercache_evict(), I also
observed too few number of major fault.
I now feel majflt from getrusage() is not appropriate metrics for
measuring storage I/O.
**However, I think the general direction has merit**: Changing this
patch to
use `ru_inblock`/`ru_oublock` gives very useful insights. `ru_inblock`
is 0 when everything is in page cache, and it is very high when stuff
is
not. I was only hacking around and basically did this:s/ru_minflt/ru_inblock/g
s/ru_majflt/ru_oublock/g
Great!
I misunderstood these metrics contain page cached I/O.
As far as I inspected, they come from read_bytes/write_bytes of
task_io_accounting and the comment seems they are what we want, i.e.
storage I/O:
--
/usr/src/linux-headers-5.15.0-127/include/linux/task_io_accounting.h
struct task_io_accounting {
..(snip)..
#ifdef CONFIG_TASK_IO_ACCOUNTING
/*
* The number of bytes which this task has caused to be read
from
* storage.
*/
u64 read_bytes;
/*
* The number of bytes which this task has caused, or shall
cause to be
* written to disk.
*/
u64 write_bytes;
Obviously more is needed. We'd probably want to show these numbers in
useful units like MB or something. Also, maybe there's some better way
of getting read/write numbers for the current process than
ru_inblock/ru_oublock (but this one seems to work at least reasonably
well).
Updated the PoC patch to calculate them by KB:
=# EXPLAIN (ANALYZE, STORAGEIO) SELECT * FROM pgbench_accounts;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------
Seq Scan on pgbench_accounts (cost=0.00..263935.35 rows=10000035
width=97) (actual time=1.447..3900.279 rows=10000000 loops=1)
Buffers: shared hit=2587 read=161348
Planning Time: 0.367 ms
Execution:
Storage I/O: read=1291856 KB write=0 KB
Execution Time: 4353.253 ms
(6 rows)
Also, maybe there's some better way
of getting read/write numbers for the current process than
ru_inblock/ru_oublock (but this one seems to work at least reasonably
well).
Maybe, but as far as using getrusage(), ru_inblock and ru_outblock seem
the best.
One other thing that I noticed when playing around with this, which
would need to be addressed: Parallel workers need to pass these values
to the main process somehow, otherwise the IO from those processes gets
lost.
Yes.
I haven't developed it yet but I believe we can pass them like
buffer/WAL usage.
--
Regards,
--
Atsushi Torikoshi
Seconded from NTT DATA GROUP CORPORATION to SRA OSS K.K.
Attachments:
v1-0001-PoC-Allow-EXPLAIN-to-output-storage-I-O-informati.patchtext/x-diff; name=v1-0001-PoC-Allow-EXPLAIN-to-output-storage-I-O-informati.patchDownload+134-3
Hi,
On Mon, Jan 06, 2025 at 06:49:06PM +0900, torikoshia wrote:
**However, I think the general direction has merit**: Changing this
patch to
use `ru_inblock`/`ru_oublock` gives very useful insights. `ru_inblock`
is 0 when everything is in page cache, and it is very high when stuff is
not. I was only hacking around and basically did this:s/ru_minflt/ru_inblock/g
s/ru_majflt/ru_oublock/g
[...]
Also, maybe there's some better way
of getting read/write numbers for the current process than
ru_inblock/ru_oublock (but this one seems to work at least reasonably
well).Maybe, but as far as using getrusage(), ru_inblock and ru_outblock seem the
best.
FWIW that's the counters we've been using in pg_stat_kcache / powa to compute
disk IO and "postgres shared buffers hit VS OS cache hit VS disk read" hit
ratio for many years and we didn't get any problem report for that.
On Mon Jan 6, 2025 at 10:49 AM CET, torikoshia wrote:
On Tue, Dec 31, 2024 at 7:57 AM Jelte Fennema-Nio <postgres@jeltef.nl>
Updated the PoC patch to calculate them by KB:=# EXPLAIN (ANALYZE, STORAGEIO) SELECT * FROM pgbench_accounts;
QUERY PLAN---------------------------------------------------------------------------------------------------------------------------------
Seq Scan on pgbench_accounts (cost=0.00..263935.35 rows=10000035
width=97) (actual time=1.447..3900.279 rows=10000000 loops=1)
Buffers: shared hit=2587 read=161348
Planning Time: 0.367 ms
Execution:
Storage I/O: read=1291856 KB write=0 KB
Execution Time: 4353.253 ms
(6 rows)
The core functionality works well in my opinion. I think it makes sense
to spend the effort to move this from PoC quality to something
committable. Below some of the things that are necessary to do that
after an initial pass over the code (and trying it out):
One other thing that I noticed when playing around with this, which
would need to be addressed: Parallel workers need to pass these values
to the main process somehow, otherwise the IO from those processes gets
lost.Yes.
I haven't developed it yet but I believe we can pass them like
buffer/WAL usage.
1. Yeah, makes sense to do this the same way as we do for buffers. Let's do
this now.
+ if (es->storageio) + { + getrusage(RUSAGE_SELF, &rusage); + + storageio.inblock = rusage.ru_inblock - storageio_start.inblock; + storageio.outblock = rusage.ru_oublock - storageio_start.outblock; + + if (es->format == EXPLAIN_FORMAT_TEXT) + { + ExplainIndentText(es); + appendStringInfoString(es->str, "Execution:\n"); + es->indent++; + } + show_storageio(es, &storageio); + + if (es->format == EXPLAIN_FORMAT_TEXT) + es->indent--; + ExplainCloseGroup("Execution", "Execution", true, es); + }
2. The current code always shows "Execution: " in the explain analyze
output, even if no storageio is done. I think this should use
peek_storageio() to check if any storageio was done and only show the
"Execution: " line if that is the case.
3. FORMAT JSON seems to be broken with this patch. With the following
simple query:
explain (ANALYZE, STORAGEIO, FORMAT JSON) select max(a), max(b) from t_big;
I get this this assert failure:
TRAP: failed Assert("es->indent == 0"), File: "../src/backend/commands/explain.c", Line: 375, PID: 199034
postgres: jelte postgres 127.0.0.1(49034) EXPLAIN(ExceptionalCondition+0x74)[0x5ad72872b464]
postgres: jelte postgres 127.0.0.1(49034) EXPLAIN(ExplainQuery+0x75b)[0x5ad7283c87bb]
postgres: jelte postgres 127.0.0.1(49034) EXPLAIN(standard_ProcessUtility+0x595)[0x5ad7285e97f5]
postgres: jelte postgres 127.0.0.1(49034) EXPLAIN(+0x4daadf)[0x5ad7285e7adf]
postgres: jelte postgres 127.0.0.1(49034) EXPLAIN(+0x4dafc4)[0x5ad7285e7fc4]
postgres: jelte postgres 127.0.0.1(49034) EXPLAIN(PortalRun+0x32d)[0x5ad7285e834d]
postgres: jelte postgres 127.0.0.1(49034) EXPLAIN(+0x4d70a2)[0x5ad7285e40a2]
postgres: jelte postgres 127.0.0.1(49034) EXPLAIN(PostgresMain+0x16e9)[0x5ad7285e5b39]
postgres: jelte postgres 127.0.0.1(49034) EXPLAIN(BackendMain+0x5f)[0x5ad7285e02df]
postgres: jelte postgres 127.0.0.1(49034) EXPLAIN(postmaster_child_launch+0xe1)[0x5ad72853cde1]
postgres: jelte postgres 127.0.0.1(49034) EXPLAIN(+0x433758)[0x5ad728540758]
postgres: jelte postgres 127.0.0.1(49034) EXPLAIN(PostmasterMain+0xddd)[0x5ad72854223d]
postgres: jelte postgres 127.0.0.1(49034) EXPLAIN(main+0x1d0)[0x5ad72828b600]
/lib/x86_64-linux-gnu/libc.so.6(+0x2a1ca)[0x714aa222a1ca]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x8b)[0x714aa222a28b]
postgres: jelte postgres 127.0.0.1(49034) EXPLAIN(_start+0x25)[0x5ad72828bc05]
FORMAT JSON should obviously not crash the server, but apart from
that it should also actually show this new data in the json output.
4. I think this setting should be enabled by default for ANALYZE, just
like BUFFERS is now since c2a4078e[1]https://github.com/postgres/postgres/commit/c2a4078ebad71999dd451ae7d4358be3c9290b07.
5. I'm wondering whether this option deserves its own EXPLAIN option, or
if it should simply be made part of BUFFERS.
6. Windows compilation is currently failing on the CFbot. Looking at the
output, that's because rusage does not contain these fields there. I
think you'll need some #ifdefs
7. The result of getrusage() should be checked for errors and we should
report the error. (eventhough it's very unexpected to ever fail).
8. This needs docs
+ appendStringInfo(es->str, " read=%ld KB", (long) usage->inblock / 2); + appendStringInfo(es->str, " write=%ld KB", (long) usage->outblock / 2);
9. I think this division by 2 could use some explanation in a comment. I
understand that you're doing this because linux divides its original
bytes using 512 bytes[2]https://github.com/torvalds/linux/blob/fbfd64d25c7af3b8695201ebc85efe90be28c5a3/include/linux/task_io_accounting_ops.h#L16-L23 and your additional factor of 2 gets that to
1024 bytes. But that's not clear immediately from the code.
I'm also not convinced that 512 is the blocksize if this logic is
even correct on every platform. I'm wondering if maybe we should
simply show the blocks after all.
[1]: https://github.com/postgres/postgres/commit/c2a4078ebad71999dd451ae7d4358be3c9290b07
[2]: https://github.com/torvalds/linux/blob/fbfd64d25c7af3b8695201ebc85efe90be28c5a3/include/linux/task_io_accounting_ops.h#L16-L23
On Mon, Jan 6, 2025 at 6:59 PM Julien Rouhaud <rjuju123@gmail.com> wrote:
Hi,
On Mon, Jan 06, 2025 at 06:49:06PM +0900, torikoshia wrote:
**However, I think the general direction has merit**: Changing this
patch to
use `ru_inblock`/`ru_oublock` gives very useful insights. `ru_inblock`
is 0 when everything is in page cache, and it is very high when stuffis
not. I was only hacking around and basically did this:
s/ru_minflt/ru_inblock/g
s/ru_majflt/ru_oublock/g
[...]
Also, maybe there's some better way
of getting read/write numbers for the current process than
ru_inblock/ru_oublock (but this one seems to work at least reasonably
well).Maybe, but as far as using getrusage(), ru_inblock and ru_outblock seem
the
best.
FWIW that's the counters we've been using in pg_stat_kcache / powa to
compute
disk IO and "postgres shared buffers hit VS OS cache hit VS disk read" hit
ratio for many years and we didn't get any problem report for that.
Thanks for sharing the information!
On Tue, Jan 7, 2025 at 5:09 AM Jelte Fennema-Nio <postgres@jeltef.nl> wrote:
On Mon Jan 6, 2025 at 10:49 AM CET, torikoshia wrote:
On Tue, Dec 31, 2024 at 7:57 AM Jelte Fennema-Nio <postgres@jeltef.nl>
Updated the PoC patch to calculate them by KB:=# EXPLAIN (ANALYZE, STORAGEIO) SELECT * FROM pgbench_accounts;
QUERY PLAN---------------------------------------------------------------------------------------------------------------------------------
Seq Scan on pgbench_accounts (cost=0.00..263935.35 rows=10000035
width=97) (actual time=1.447..3900.279 rows=10000000 loops=1)
Buffers: shared hit=2587 read=161348
Planning Time: 0.367 ms
Execution:
Storage I/O: read=1291856 KB write=0 KB
Execution Time: 4353.253 ms
(6 rows)The core functionality works well in my opinion. I think it makes sense
to spend the effort to move this from PoC quality to something
committable.
Thanks for the comment and review!
If there are no other opinions, I will make a patch based on the direction
of the current PoC patch.
4. I think this setting should be enabled by default for ANALYZE, just
like BUFFERS is now since c2a4078e[1].5. I'm wondering whether this option deserves its own EXPLAIN option, or
if it should simply be made part of BUFFERS.
Although this is not information about PostgreSQL buffers, I now feel that
making this addition part of BUFFERS is attractive, since people using
BUFFERS would also sometimes want to know about the storage I/O.
Since BUFFERS is now ON by default with EXPLAIN ANALYZE, I am concerned
about the performance impact.
However, if it is limited to just twice—once at the start and once at the
end—for the planning phase, execution phase, and each parallel worker, I
believe the impact is relatively small.
9. I think this division by 2 could use some explanation in a comment. I
understand that you're doing this because linux divides its original
bytes using 512 bytes[2] and your additional factor of 2 gets that to
1024 bytes. But that's not clear immediately from the code.I'm also not convinced that 512 is the blocksize if this logic is
even correct on every platform. I'm wondering if maybe we should
simply show the blocks after all.
Maybe so. I'll look into this and then decide the unit.
For the other comments, I plan to address them as you suggested.
Regards,
Atsushi Torikoshi
Hi,
On Tue, Jan 7, 2025 at 5:09 AM Jelte Fennema-Nio <postgres@jeltef.nl>
wrote:
The core functionality works well in my opinion. I think it makes sense
to spend the effort to move this from PoC quality to something
committable. Below some of the things that are necessary to do that
after an initial pass over the code (and trying it out):
Attached a Patch.
On Tue, Jan 7, 2025 at 10:42 PM Atsushi Torikoshi
<torikoshia.tech@gmail.com> wrote:
9. I think this division by 2 could use some explanation in a comment.
I
understand that you're doing this because linux divides its
original
bytes using 512 bytes[2] and your additional factor of 2 gets that
to
1024 bytes. But that's not clear immediately from the code.I'm also not convinced that 512 is the blocksize if this logic is
even correct on every platform. I'm wondering if maybe we should
simply show the blocks after all.Maybe so. I'll look into this and then decide the unit.
I looked up the manuals for the following operating systems, as
documented in [1]https://www.postgresql.org/docs/devel/supported-platforms.html, and it seems that all of them—except Windows—support
getrusage(2) and return ru_inblock/ru_oublock:
Linux, Windows, FreeBSD, OpenBSD, NetBSD, DragonFlyBSD, macOS, AIX,
Solaris, and illumos.
However, I’m unsure if the unit of these values is consistently 512KB
across all operating systems.
Additionally, I’m concerned that the timing of when these metrics are
incremented might vary between OSs.
For example, on Linux, it seems that ru_oublock is incremented when a
page is dirtied, as it’s calculated by dividing write_bytes [2]https://www.kernel.org/doc/Documentation/filesystems/proc.txt by 9.
I’m not sure if other operating systems behave the same way.
That said, they all represent the number of storage I/O operations
performed or to be performed.
Therefore, I believe it would be reasonable to report the raw values
as-is, as they should still be useful for understanding storage I/O
activity.
Example output:
=# explain analyze select max(a), max(b) from t_big_ul;
(..snip..)
Planning:
Buffers: shared hit=31 read=54
Storage I/O: read=2744 times write=0 times
Planning Time: 30.685 ms
Execution:
Storage I/O: read=2563600 times write=0 times
Execution Time: 1685.272 ms
[1]: https://www.postgresql.org/docs/devel/supported-platforms.html
[2]: https://www.kernel.org/doc/Documentation/filesystems/proc.txt
--
Regards,
Atsushi Torikoshi
Seconded from NTT DATA GROUP CORPORATION to SRA OSS K.K.
Attachments:
v1-0001-Add-storage-I-O-tracking-to-BUFFERS-option.patchtext/x-diff; name=v1-0001-Add-storage-I-O-tracking-to-BUFFERS-option.patchDownload+298-39
On Mon, 27 Jan 2025 at 10:05, torikoshia <torikoshia@oss.nttdata.com> wrote:
Therefore, I believe it would be reasonable to report the raw values
as-is, as they should still be useful for understanding storage I/O
activity.
Sounds reasonable.
Below some feedback on the patch. It's all really minor. The patch
looks great. I'll play around with it a bit next week.
meta: it's confusing that this one is called v1 again, it would be
clearer if it was called v2.
nit: at line 528 the "if (es->buffers)" check can simply be merged
with the if block above (which does the exact same check)
if (usage->inblock <= 0 && usage->outblock <= 0)
return false;else
return true;
nit: You can replace that if-else with:
return usage->inblock > 0 || usage->outblock > 0;
StorageIOUsageAccumDiff(StorageIOUsage *dst, const StorageIOUsage *add, const StorageIOUsage *sub)
Missing a function comment
On Sat, 8 Feb 2025 at 14:54, Jelte Fennema-Nio <postgres@jeltef.nl> wrote:
I'll play around with it a bit next week.
Okay, I played around with it and couldn't find any issues. I marked
the patch as "ready for committer" in the commitfest app[1]https://commitfest.postgresql.org/52/5526/, given
that all feedback in my previous email was very minor.
Hi,
On 2025-02-09 12:51:40 +0100, Jelte Fennema-Nio wrote:
On Sat, 8 Feb 2025 at 14:54, Jelte Fennema-Nio <postgres@jeltef.nl> wrote:
I'll play around with it a bit next week.
Okay, I played around with it and couldn't find any issues. I marked
the patch as "ready for committer" in the commitfest app[1], given
that all feedback in my previous email was very minor.
I'm somewhat against this patch, as it's fairly fundamentally incompatible
with AIO. There's no real way to get information in this manner if the IO
isn't executed synchronously in process context...
Greetings,
Andres
Andres Freund <andres@anarazel.de> writes:
I'm somewhat against this patch, as it's fairly fundamentally incompatible
with AIO. There's no real way to get information in this manner if the IO
isn't executed synchronously in process context...
Even without looking ahead to AIO, there's bgwriter, walwriter, and
checkpointer processes that all take I/O load away from foreground
processes. I don't really believe that this will produce useful
numbers.
regards, tom lane
On Sun, 9 Feb 2025 at 19:05, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Andres Freund <andres@anarazel.de> writes:
I'm somewhat against this patch, as it's fairly fundamentally incompatible
with AIO. There's no real way to get information in this manner if the IO
isn't executed synchronously in process context...
Hmm, I had not considered how this would interact with your AIO work.
I agree that getting this info would be hard/impossible to do
efficiently, when IOs are done by background IO processes that
interleave IOs from different queries. But I'd expect that AIOs that
are done using iouring would be tracked correctly without having to
change this code at all (because I assume those are done from the
query backend process).
One other thought: I think the primary benefit of this feature is
being able to see how many read IOs actually hit the disk, as opposed
to hitting OS page cache. That benefit disappears when using Direct
IO, because then there's no OS page cache.
How many years away do you think that widespread general use of
AIO+Direct IO is, though? I think that for the N years from now until
then, it would be very nice to have this feature to help debug query
performance problems. Then once the numbers become too
inaccurate/useless at some point, we could simply remove them again.
Even without looking ahead to AIO, there's bgwriter, walwriter, and
checkpointer processes that all take I/O load away from foreground
processes. I don't really believe that this will produce useful
numbers.
The bgwriter, walwriter, and checkpointer should only take away
*write* IOs. For read IOs the numbers should be very accurate and as
explained above read IOs is where I think the primary benefit of this
feature is.
But even for write IOs I think the numbers would be useful when
looking at them with the goal of finding out why a particular query is
slow: If the bgwriter or checkpointer do the writes, then the query
should be roughly as fast as if no writes to the disk had taken place
at all, but if the query process does the writes then those writes are
probably blocking further execution of the query and thus slowing it
down.
On 2025-02-10 05:06, Jelte Fennema-Nio wrote:
Thanks for reviewing the patch and comments!
Fixed issues you pointed out and attached v2 patch.
On Sun, 9 Feb 2025 at 19:05, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Andres Freund <andres@anarazel.de> writes:
I'm somewhat against this patch, as it's fairly fundamentally incompatible
with AIO. There's no real way to get information in this manner if the IO
isn't executed synchronously in process context...Hmm, I had not considered how this would interact with your AIO work.
I agree that getting this info would be hard/impossible to do
efficiently, when IOs are done by background IO processes that
interleave IOs from different queries. But I'd expect that AIOs that
are done using iouring would be tracked correctly without having to
change this code at all (because I assume those are done from the
query backend process).One other thought: I think the primary benefit of this feature is
being able to see how many read IOs actually hit the disk, as opposed
to hitting OS page cache. That benefit disappears when using Direct
IO, because then there's no OS page cache.How many years away do you think that widespread general use of
AIO+Direct IO is, though? I think that for the N years from now until
then, it would be very nice to have this feature to help debug query
performance problems. Then once the numbers become too
inaccurate/useless at some point, we could simply remove them again.
AIO efforts are something I haven't fully grasped yet, but Jelte's
comments seem reasonable to me.
Of course, as someone proposing this, I'm naturally biased toward
thinking it’s beneficial.
What do you think?
Even without looking ahead to AIO, there's bgwriter, walwriter, and
checkpointer processes that all take I/O load away from foreground
processes. I don't really believe that this will produce useful
numbers.The bgwriter, walwriter, and checkpointer should only take away
*write* IOs. For read IOs the numbers should be very accurate and as
explained above read IOs is where I think the primary benefit of this
feature is.But even for write IOs I think the numbers would be useful when
looking at them with the goal of finding out why a particular query is
slow: If the bgwriter or checkpointer do the writes, then the query
should be roughly as fast as if no writes to the disk had taken place
at all, but if the query process does the writes then those writes are
probably blocking further execution of the query and thus slowing it
down.
I agree with this as well.
For example, in a SELECT query executed immediately after a large number
of INSERTs, we can observe that writes to storage occur due to WAL
writes for hint bits.
This makes the query take longer compared to a scenario where these
writes do not occur.
I think we can guess what is happening from the output:
postgres=# insert into t1 (select i, repeat('a', 1000) from
generate_series(1, 1000000) i);
INSERT 0 1000000
postgres=# explain analyze table t1;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------
Seq Scan on t1 (cost=0.00..382665.25 rows=21409025 width=36) (actual
time=1.926..11035.531 rows=1000100 loops=1)
Buffers: shared read=168575 dirtied=142858 written=142479
Planning:
Buffers: shared hit=3 read=3 written=1
Storage I/O: read=48 times write=16 times
Planning Time: 4.472 ms
Execution:
Storage I/O: read=2697272 times write=4480096 times // many writes
Execution Time: 11099.424 ms // slow
(9 rows)
postgres=# explain analyze table t1;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------
Seq Scan on t1 (cost=0.00..382665.25 rows=21409025 width=36) (actual
time=2.066..2926.394 rows=1000100 loops=1)
Buffers: shared read=168575 written=14
Planning Time: 0.295 ms
Execution:
Storage I/O: read=2697200 times write=224 times // few writes
Execution Time: 3016.257 ms // fast
(6 rows)
--
Regards,
--
Atsushi Torikoshi
Seconded from NTT DATA GROUP CORPORATION to SRA OSS K.K.