Make pg_stat_io view count IOs as bytes instead of blocks
Hi,
Currently, in the pg_stat_io view, IOs are counted as blocks. However,
there are two issues with this approach:
1- The actual number of IO requests to the kernel is lower because IO
requests can be merged before sending the final request. Additionally, it
appears that all IOs are counted in block size.
2- Some IOs may not align with block size. For example, WAL read IOs are
done in variable bytes and it is not possible to correctly show these IOs
in the pg_stat_io view [1]/messages/by-id/CAN55FZ1ny+3kpdm5X3nGZ2Jp3wxZO-744eFgxktS6YQ3=OKR-A@mail.gmail.com.
To address this, I propose showing the total number of IO requests to the
kernel (as smgr function calls) and the total number of bytes in the IO. To
implement this change, the op_bytes column will be removed from the
pg_stat_io view. Instead, the [reads | writes | extends] columns will track
the total number of IO requests, and newly added [read | write |
extend]_bytes columns will track the total number of bytes in the IO.
Example benefit of this change:
Running query [2]CREATE TABLE t as select i, repeat('a', 600) as filler from generate_series(1, 10000000) as i; SELECT pg_stat_reset_shared('io'); SELECT * FROM t WHERE i = 0; SELECT backend_type, object, context, TRUNC((read_bytes / reads / (SELECT current_setting('block_size')::numeric)), 2) as avg_io_blocks FROM pg_stat_io WHERE reads > 0;, the result is:
╔═══════════════════╦══════════╦══════════╦═══════════════╗
║ backend_type ║ object ║ context ║ avg_io_blocks ║
╠═══════════════════╬══════════╬══════════╬═══════════════╣
║ client backend ║ relation ║ bulkread ║ 15.99 ║
╠═══════════════════╬══════════╬══════════╬═══════════════╣
║ background worker ║ relation ║ bulkread ║ 15.99 ║
╚═══════════════════╩══════════╩══════════╩═══════════════╝
You can rerun the same query [2]CREATE TABLE t as select i, repeat('a', 600) as filler from generate_series(1, 10000000) as i; SELECT pg_stat_reset_shared('io'); SELECT * FROM t WHERE i = 0; SELECT backend_type, object, context, TRUNC((read_bytes / reads / (SELECT current_setting('block_size')::numeric)), 2) as avg_io_blocks FROM pg_stat_io WHERE reads > 0; after setting io_combine_limit to 32 [3]SET io_combine_limit TO 32;.
The result is:
╔═══════════════════╦══════════╦══════════╦═══════════════╗
║ backend_type ║ object ║ context ║ avg_io_blocks ║
╠═══════════════════╬══════════╬══════════╬═══════════════╣
║ client backend ║ relation ║ bulkread ║ 31.70 ║
╠═══════════════════╬══════════╬══════════╬═══════════════╣
║ background worker ║ relation ║ bulkread ║ 31.60 ║
╚═══════════════════╩══════════╩══════════╩═══════════════╝
I believe that having visibility into avg_io_[bytes | blocks] is valuable
information that could help optimize Postgres.
Any feedback would be appreciated.
[1]: /messages/by-id/CAN55FZ1ny+3kpdm5X3nGZ2Jp3wxZO-744eFgxktS6YQ3=OKR-A@mail.gmail.com
/messages/by-id/CAN55FZ1ny+3kpdm5X3nGZ2Jp3wxZO-744eFgxktS6YQ3=OKR-A@mail.gmail.com
[2]: CREATE TABLE t as select i, repeat('a', 600) as filler from generate_series(1, 10000000) as i; SELECT pg_stat_reset_shared('io'); SELECT * FROM t WHERE i = 0; SELECT backend_type, object, context, TRUNC((read_bytes / reads / (SELECT current_setting('block_size')::numeric)), 2) as avg_io_blocks FROM pg_stat_io WHERE reads > 0;
CREATE TABLE t as select i, repeat('a', 600) as filler from
generate_series(1, 10000000) as i;
SELECT pg_stat_reset_shared('io');
SELECT * FROM t WHERE i = 0;
SELECT backend_type, object, context, TRUNC((read_bytes / reads / (SELECT
current_setting('block_size')::numeric)), 2) as avg_io_blocks FROM
pg_stat_io WHERE reads > 0;
[3]: SET io_combine_limit TO 32;
--
Regards,
Nazir Bilal Yavuz
Microsoft
Attachments:
v1-0001-Make-pg_stat_io-count-IOs-as-bytes-instead-of-blo.patchtext/x-patch; charset=US-ASCII; name=v1-0001-Make-pg_stat_io-count-IOs-as-bytes-instead-of-blo.patchDownload+184-77
On Wed, Sep 11, 2024 at 7:19 AM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
Currently, in the pg_stat_io view, IOs are counted as blocks. However, there are two issues with this approach:
1- The actual number of IO requests to the kernel is lower because IO requests can be merged before sending the final request. Additionally, it appears that all IOs are counted in block size.
I think this is a great idea. It will allow people to tune
io_combine_limit as you mention below.
2- Some IOs may not align with block size. For example, WAL read IOs are done in variable bytes and it is not possible to correctly show these IOs in the pg_stat_io view [1].
Yep, this makes a lot of sense as a solution.
To address this, I propose showing the total number of IO requests to the kernel (as smgr function calls) and the total number of bytes in the IO. To implement this change, the op_bytes column will be removed from the pg_stat_io view. Instead, the [reads | writes | extends] columns will track the total number of IO requests, and newly added [read | write | extend]_bytes columns will track the total number of bytes in the IO.
smgr API seems like the right place for this.
Example benefit of this change:
Running query [2], the result is:
╔═══════════════════╦══════════╦══════════╦═══════════════╗
║ backend_type ║ object ║ context ║ avg_io_blocks ║
╠═══════════════════╬══════════╬══════════╬═══════════════╣
║ client backend ║ relation ║ bulkread ║ 15.99 ║
╠═══════════════════╬══════════╬══════════╬═══════════════╣
║ background worker ║ relation ║ bulkread ║ 15.99 ║
╚═══════════════════╩══════════╩══════════╩═══════════════╝
I don't understand why background worker is listed here.
You can rerun the same query [2] after setting io_combine_limit to 32 [3]. The result is:
╔═══════════════════╦══════════╦══════════╦═══════════════╗
║ backend_type ║ object ║ context ║ avg_io_blocks ║
╠═══════════════════╬══════════╬══════════╬═══════════════╣
║ client backend ║ relation ║ bulkread ║ 31.70 ║
╠═══════════════════╬══════════╬══════════╬═══════════════╣
║ background worker ║ relation ║ bulkread ║ 31.60 ║
╚═══════════════════╩══════════╩══════════╩═══════════════╝I believe that having visibility into avg_io_[bytes | blocks] is valuable information that could help optimize Postgres.
In general, for this example, I think it would be more clear if you
compared what visibility we have in pg_stat_io on master with what
visibility we have with your patch.
I like that you show how io_combine_limit can be tuned using this, but
I don't think the problem statement is clear nor is the full
narrative.
CREATE TABLE t as select i, repeat('a', 600) as filler from generate_series(1, 10000000) as i;
SELECT pg_stat_reset_shared('io');
SELECT * FROM t WHERE i = 0;
SELECT backend_type, object, context, TRUNC((read_bytes / reads / (SELECT current_setting('block_size')::numeric)), 2) as avg_io_blocks FROM pg_stat_io WHERE reads > 0;
I like that you calculate the avg_io_blocks, but I think it is good to
show the raw columns as well.
- Melanie
Hi,
On Wed, Nov 27, 2024 at 11:08:01AM -0500, Melanie Plageman wrote:
On Wed, Sep 11, 2024 at 7:19 AM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
Currently, in the pg_stat_io view, IOs are counted as blocks. However, there are two issues with this approach:
1- The actual number of IO requests to the kernel is lower because IO requests can be merged before sending the final request. Additionally, it appears that all IOs are counted in block size.
I think this is a great idea. It will allow people to tune
io_combine_limit as you mention below.2- Some IOs may not align with block size. For example, WAL read IOs are done in variable bytes and it is not possible to correctly show these IOs in the pg_stat_io view [1].
Yep, this makes a lot of sense as a solution.
Thanks for the patch! I also think it makes sense.
A few random comments:
=== 1
+ /*
+ * If IO done in bytes and byte is <= 0, this means there is an error
+ * while doing an IO. Don't count these IOs.
+ */
s/byte/bytes/?
also:
The pgstat_io_count_checks() parameter is uint64. Does it mean it has to be
changed to int64?
Also from what I can see the calls are done with those values:
- 0
- io_buffers_len * BLCKSZ
- extend_by * BLCKSZ
- BLCKSZ
could io_buffers_len and extend_by be < 0? If not, is the comment correct?
=== 2
+ Assert((io_op == IOOP_READ || io_op == IOOP_WRITE || io_op == IOOP_EXTEND
and
+ if ((io_op == IOOP_READ || io_op == IOOP_WRITE || io_op == IOOP_EXTEND) &&
What about ordering the enum in IOOp (no bytes/bytes) so that we could check
that io_op >= "our firt bytes enum" instead?
Also we could create a macro on top of that to make it clear. And a comment
would be needed around the IOOp definition.
I think that would be simpler to maintain should we add no bytes or bytes op in
the future.
=== 3
+pgstat_io_count_checks(IOObject io_object, IOContext io_context, IOOp io_op, uint64 bytes)
+{
+ Assert((unsigned int) io_object < IOOBJECT_NUM_TYPES);
+ Assert((unsigned int) io_context < IOCONTEXT_NUM_TYPES);
+ Assert((unsigned int) io_op < IOOP_NUM_TYPES);
+ Assert(pgstat_tracks_io_op(MyBackendType, io_object, io_context, io_op));
IOObject and IOContext are passed only for the assertions. What about removing
them from there and put the asserts in other places?
=== 4
+ /* Only IOOP_READ, IOOP_WRITE and IOOP_EXTEND can do IO in bytes. */
Not sure about "can do IO in bytes" (same wording is used in multiple places).
=== 5
/* Convert to numeric. */
"convert to numeric"? to be consistent with others single line comments around.
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Hi,
Thanks for looking into this!
On Wed, 27 Nov 2024 at 19:08, Melanie Plageman
<melanieplageman@gmail.com> wrote:
On Wed, Sep 11, 2024 at 7:19 AM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
Currently, in the pg_stat_io view, IOs are counted as blocks. However, there are two issues with this approach:
1- The actual number of IO requests to the kernel is lower because IO requests can be merged before sending the final request. Additionally, it appears that all IOs are counted in block size.
I think this is a great idea. It will allow people to tune
io_combine_limit as you mention below.2- Some IOs may not align with block size. For example, WAL read IOs are done in variable bytes and it is not possible to correctly show these IOs in the pg_stat_io view [1].
Yep, this makes a lot of sense as a solution.
To address this, I propose showing the total number of IO requests to the kernel (as smgr function calls) and the total number of bytes in the IO. To implement this change, the op_bytes column will be removed from the pg_stat_io view. Instead, the [reads | writes | extends] columns will track the total number of IO requests, and newly added [read | write | extend]_bytes columns will track the total number of bytes in the IO.
smgr API seems like the right place for this.
Example benefit of this change:
Running query [2], the result is:
╔═══════════════════╦══════════╦══════════╦═══════════════╗
║ backend_type ║ object ║ context ║ avg_io_blocks ║
╠═══════════════════╬══════════╬══════════╬═══════════════╣
║ client backend ║ relation ║ bulkread ║ 15.99 ║
╠═══════════════════╬══════════╬══════════╬═══════════════╣
║ background worker ║ relation ║ bulkread ║ 15.99 ║
╚═══════════════════╩══════════╩══════════╩═══════════════╝I don't understand why background worker is listed here.
Parallel sequential scan happens in this example and parallel workers
are listed as background workers. After setting
'max_parallel_workers_per_gather' to 0, it is gone.
You can rerun the same query [2] after setting io_combine_limit to 32 [3]. The result is:
╔═══════════════════╦══════════╦══════════╦═══════════════╗
║ backend_type ║ object ║ context ║ avg_io_blocks ║
╠═══════════════════╬══════════╬══════════╬═══════════════╣
║ client backend ║ relation ║ bulkread ║ 31.70 ║
╠═══════════════════╬══════════╬══════════╬═══════════════╣
║ background worker ║ relation ║ bulkread ║ 31.60 ║
╚═══════════════════╩══════════╩══════════╩═══════════════╝I believe that having visibility into avg_io_[bytes | blocks] is valuable information that could help optimize Postgres.
In general, for this example, I think it would be more clear if you
compared what visibility we have in pg_stat_io on master with what
visibility we have with your patch.
I am listing the changes as text, images are also attached.
* [reads | writes | extends] columns count the number of smgr function
calls now. They were counting the number of block IOs before.
* op_bytes column is removed from the view because each IO could have
a different size. They are not always equal to op_bytes.
* [read_bytes | write_bytes | extend_bytes] columns are added. These
columns count IO sizes as bytes.
There are two different IO cases:
1- Size of the IOs are constant:
* See 'client backend / bulkread' row, If you divide read_bytes
columns' value (6826754048) to BLCKSZ (8192) in the patched image, you
get the reads columns' value (833344) in the upstream image. So, we
actually do not lose any information when the size of the IOs are
constant.
2- Size of the IOs are different:
* Upstream version will give wrong information in this case. For
example see WALRead() function. pg_pread() is called with different
segbytes values in this function. It is not possible to correctly show
this stat in pg_stat_io view.
The problem with the upstream version of the pg_stat_io view is that
multiplying the number of blocks with the op_bytes does not always
give the total IO size. Also, it looks like Postgres is doing one IO
request per block. This patch tries to address these problems.
I like that you show how io_combine_limit can be tuned using this, but
I don't think the problem statement is clear nor is the full
narrative.
I just wanted to show one piece of information that can be gathered
with the patched version, it was not possible to gather that before.
CREATE TABLE t as select i, repeat('a', 600) as filler from generate_series(1, 10000000) as i;
SELECT pg_stat_reset_shared('io');
SELECT * FROM t WHERE i = 0;
SELECT backend_type, object, context, TRUNC((read_bytes / reads / (SELECT current_setting('block_size')::numeric)), 2) as avg_io_blocks FROM pg_stat_io WHERE reads > 0;I like that you calculate the avg_io_blocks, but I think it is good to
show the raw columns as well.
Images of the view after running the query [1]SET track_io_timing to ON; SET max_parallel_workers_per_gather TO 0; SELECT pg_stat_reset_shared('io'); CREATE TABLE t as select i, repeat('a', 600) as filler from generate_series(1, 10000000) as i; SELECT * FROM t WHERE i = 0; SELECT * FROM pg_stat_io; are attached.
P.S. I attached the images of the view because I do not know how they
will look if I copy paste them as text. If there is a way to add them
as text without distortion, please let me know.
[1]: SET track_io_timing to ON; SET max_parallel_workers_per_gather TO 0; SELECT pg_stat_reset_shared('io'); CREATE TABLE t as select i, repeat('a', 600) as filler from generate_series(1, 10000000) as i; SELECT * FROM t WHERE i = 0; SELECT * FROM pg_stat_io;
SET track_io_timing to ON;
SET max_parallel_workers_per_gather TO 0;
SELECT pg_stat_reset_shared('io');
CREATE TABLE t as select i, repeat('a', 600) as filler from
generate_series(1, 10000000) as i;
SELECT * FROM t WHERE i = 0;
SELECT * FROM pg_stat_io;
--
Regards,
Nazir Bilal Yavuz
Microsoft
Hi,
Thanks for looking into this!
On Thu, 28 Nov 2024 at 16:39, Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:
Hi,
On Wed, Nov 27, 2024 at 11:08:01AM -0500, Melanie Plageman wrote:
On Wed, Sep 11, 2024 at 7:19 AM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
Currently, in the pg_stat_io view, IOs are counted as blocks. However, there are two issues with this approach:
1- The actual number of IO requests to the kernel is lower because IO requests can be merged before sending the final request. Additionally, it appears that all IOs are counted in block size.
I think this is a great idea. It will allow people to tune
io_combine_limit as you mention below.2- Some IOs may not align with block size. For example, WAL read IOs are done in variable bytes and it is not possible to correctly show these IOs in the pg_stat_io view [1].
Yep, this makes a lot of sense as a solution.
Thanks for the patch! I also think it makes sense.
A few random comments:
=== 1
+ /* + * If IO done in bytes and byte is <= 0, this means there is an error + * while doing an IO. Don't count these IOs. + */s/byte/bytes/?
This is removed, please look below for the explanation.
also:
The pgstat_io_count_checks() parameter is uint64. Does it mean it has to be
changed to int64?Also from what I can see the calls are done with those values:
- 0
- io_buffers_len * BLCKSZ
- extend_by * BLCKSZ
- BLCKSZcould io_buffers_len and extend_by be < 0? If not, is the comment correct?
You are right, no need to have this check; it can not be less than 0.
I completely removed the function now.
=== 2
+ Assert((io_op == IOOP_READ || io_op == IOOP_WRITE || io_op == IOOP_EXTEND
and
+ if ((io_op == IOOP_READ || io_op == IOOP_WRITE || io_op == IOOP_EXTEND) &&
What about ordering the enum in IOOp (no bytes/bytes) so that we could check
that io_op >= "our firt bytes enum" instead?Also we could create a macro on top of that to make it clear. And a comment
would be needed around the IOOp definition.I think that would be simpler to maintain should we add no bytes or bytes op in
the future.
I think this is a good idea. I applied all the comments. Created an
inline function instead of macro and added this 'Assert((unsigned int)
io_object < IOOBJECT_NUM_TYPES);' to function.
=== 3
+pgstat_io_count_checks(IOObject io_object, IOContext io_context, IOOp io_op, uint64 bytes) +{ + Assert((unsigned int) io_object < IOOBJECT_NUM_TYPES); + Assert((unsigned int) io_context < IOCONTEXT_NUM_TYPES); + Assert((unsigned int) io_op < IOOP_NUM_TYPES); + Assert(pgstat_tracks_io_op(MyBackendType, io_object, io_context, io_op));IOObject and IOContext are passed only for the assertions. What about removing
them from there and put the asserts in other places?
Done. I moved these checks to the pgstat_count_io_op_n() function. The
code looks more like its previous version now.
=== 4
+ /* Only IOOP_READ, IOOP_WRITE and IOOP_EXTEND can do IO in bytes. */
Not sure about "can do IO in bytes" (same wording is used in multiple places).
I changed it to 'measured in bytes' but I am not sure if this is
better, open to suggestions.
=== 5
/* Convert to numeric. */
"convert to numeric"? to be consistent with others single line comments around.
Done.
--
Regards,
Nazir Bilal Yavuz
Microsoft
Attachments:
v2-0001-Make-pg_stat_io-count-IOs-as-bytes-instead-of-blo.patchtext/x-patch; charset=US-ASCII; name=v2-0001-Make-pg_stat_io-count-IOs-as-bytes-instead-of-blo.patchDownload+167-77
Hi,
On Wed, Dec 04, 2024 at 02:49:11PM +0300, Nazir Bilal Yavuz wrote:
On Thu, 28 Nov 2024 at 16:39, Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:You are right, no need to have this check; it can not be less than 0.
I completely removed the function now.
Thanks! Yeah I think that makes sense.
What about ordering the enum in IOOp (no bytes/bytes) so that we could check
that io_op >= "our firt bytes enum" instead?Also we could create a macro on top of that to make it clear. And a comment
would be needed around the IOOp definition.I think that would be simpler to maintain should we add no bytes or bytes op in
the future.I think this is a good idea. I applied all the comments.
Thanks!
+ * non-byte-measured and byte-measured operations. So, that makes is easier
+ * to check whether IO is measured in bytes.
s/that makes is/that makes it/
+ *
+ * Note: If this enum is modified, ensure that both `IOOP_NUM_TYPES` macro
+ * and the `ioop_measured_in_bytes` function are updated accordingly.
Yeah, or?
"Ensure IOOP_EXTEND is the first and IOOP_WRITE the last ones in the measured in
bytes" group and that the groups stay in that order.
That would make the "checks" local to the enum def should someone modify it.
Created an
inline function instead of macro and added this 'Assert((unsigned int)
io_object < IOOBJECT_NUM_TYPES);' to function.
An inline function seems the right choice for me too.
Done. I moved these checks to the pgstat_count_io_op_n() function. The
code looks more like its previous version now.
Thanks! Yeah, more easier to follow now.
Not sure about "can do IO in bytes" (same wording is used in multiple places).
I changed it to 'measured in bytes' but I am not sure if this is
better, open to suggestions.
I'm tempted to say something around "track", would mean things like:
1.
ioop_measured_in_bytes => is_ioop_tracked_in_bytes
2.
s/If an op does not do IO in bytes/If an op does not track bytes/
3.
s/not every IO measured in bytes/not every IO tracks bytes/
4.
s/non-byte-measured and byte-measured/non-tracking and tracking bytes/
Thoughts?
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Hi,
On Thu, 5 Dec 2024 at 12:13, Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:
Hi,
On Wed, Dec 04, 2024 at 02:49:11PM +0300, Nazir Bilal Yavuz wrote:
On Thu, 28 Nov 2024 at 16:39, Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:You are right, no need to have this check; it can not be less than 0.
I completely removed the function now.Thanks! Yeah I think that makes sense.
What about ordering the enum in IOOp (no bytes/bytes) so that we could check
that io_op >= "our firt bytes enum" instead?Also we could create a macro on top of that to make it clear. And a comment
would be needed around the IOOp definition.I think that would be simpler to maintain should we add no bytes or bytes op in
the future.I think this is a good idea. I applied all the comments.
Thanks!
+ * non-byte-measured and byte-measured operations. So, that makes is easier + * to check whether IO is measured in bytes.s/that makes is/that makes it/
Done.
+ * + * Note: If this enum is modified, ensure that both `IOOP_NUM_TYPES` macro + * and the `ioop_measured_in_bytes` function are updated accordingly.Yeah, or?
"Ensure IOOP_EXTEND is the first and IOOP_WRITE the last ones in the measured in
bytes" group and that the groups stay in that order.That would make the "checks" local to the enum def should someone modify it.
Makes sense, done.
Created an
inline function instead of macro and added this 'Assert((unsigned int)
io_object < IOOBJECT_NUM_TYPES);' to function.An inline function seems the right choice for me too.
Done. I moved these checks to the pgstat_count_io_op_n() function. The
code looks more like its previous version now.Thanks! Yeah, more easier to follow now.
Not sure about "can do IO in bytes" (same wording is used in multiple places).
I changed it to 'measured in bytes' but I am not sure if this is
better, open to suggestions.I'm tempted to say something around "track", would mean things like:
1.
ioop_measured_in_bytes => is_ioop_tracked_in_bytes
2.
s/If an op does not do IO in bytes/If an op does not track bytes/
3.
s/not every IO measured in bytes/not every IO tracks bytes/
4.
s/non-byte-measured and byte-measured/non-tracking and tracking bytes/
Thoughts?
Thanks! I think 'track' is a better word in this context. I used
'tracked in ...', as it sounded more correct to me (I hope it is).
--
Regards,
Nazir Bilal Yavuz
Microsoft
Attachments:
v3-0001-Make-pg_stat_io-count-IOs-as-bytes-instead-of-blo.patchtext/x-patch; charset=US-ASCII; name=v3-0001-Make-pg_stat_io-count-IOs-as-bytes-instead-of-blo.patchDownload+167-77
On Fri, Dec 06, 2024 at 12:41:55PM +0300, Nazir Bilal Yavuz wrote:
Thanks! I think 'track' is a better word in this context. I used
'tracked in ...', as it sounded more correct to me (I hope it is).
Splitting op_bytes into three fields sounds like a good idea. Count
me in regarding the concept to depend less on BLCKSZ.
typedef enum IOOp
{
+ /* IOs not tracked in bytes */
IOOP_EVICT,
- IOOP_EXTEND,
IOOP_FSYNC,
IOOP_HIT,
- IOOP_READ,
IOOP_REUSE,
- IOOP_WRITE,
IOOP_WRITEBACK,
+
+ /* IOs tracked in bytes */
+ IOOP_EXTEND,
+ IOOP_READ,
+ IOOP_WRITE,
} IOOp;
pg_stat_io_build_tuples() is now the code path taken when building the
tuples returned by pg_stat_io, meaning that the new function called
pg_stat_get_backend_io() is also going to need an update in its list
of output parameters to accomodate with what you are changing here.
Calling your new pgstat_get_io_byte_index() in the new refactored
routine is also necessary. Sorry about that.
Could you send a rebase, please? I can promise to review this
thread's new patch, as that will also move the needle regarding your
work with pg_stat_io to track WAL activity.
--
Michael
Hi,
On Tue, 24 Dec 2024 at 09:13, Michael Paquier <michael@paquier.xyz> wrote:
On Fri, Dec 06, 2024 at 12:41:55PM +0300, Nazir Bilal Yavuz wrote:
Thanks! I think 'track' is a better word in this context. I used
'tracked in ...', as it sounded more correct to me (I hope it is).Splitting op_bytes into three fields sounds like a good idea. Count
me in regarding the concept to depend less on BLCKSZ.typedef enum IOOp { + /* IOs not tracked in bytes */ IOOP_EVICT, - IOOP_EXTEND, IOOP_FSYNC, IOOP_HIT, - IOOP_READ, IOOP_REUSE, - IOOP_WRITE, IOOP_WRITEBACK, + + /* IOs tracked in bytes */ + IOOP_EXTEND, + IOOP_READ, + IOOP_WRITE, } IOOp;pg_stat_io_build_tuples() is now the code path taken when building the
tuples returned by pg_stat_io, meaning that the new function called
pg_stat_get_backend_io() is also going to need an update in its list
of output parameters to accomodate with what you are changing here.
Calling your new pgstat_get_io_byte_index() in the new refactored
routine is also necessary. Sorry about that.
No problem at all, thank you for reminding me!
Could you send a rebase, please? I can promise to review this
thread's new patch, as that will also move the needle regarding your
work with pg_stat_io to track WAL activity.
Thanks! v4 is attached. I quickly tested the pg_stat_get_backend_io()
function and it seems it is working.
--
Regards,
Nazir Bilal Yavuz
Microsoft
Attachments:
v4-0001-Make-pg_stat_io-count-IOs-as-bytes-instead-of-blo.patchtext/x-patch; charset=US-ASCII; name=v4-0001-Make-pg_stat_io-count-IOs-as-bytes-instead-of-blo.patchDownload+172-79
On Thu, Dec 26, 2024 at 02:41:26PM +0300, Nazir Bilal Yavuz wrote:
Thanks! v4 is attached. I quickly tested the pg_stat_get_backend_io()
function and it seems it is working.
Thanks a lot for the rebased version. This looks pretty solid. Here
are some comments.
void
-pgstat_count_io_op(IOObject io_object, IOContext io_context, IOOp io_op)
+pgstat_count_io_op(IOObject io_object, IOContext io_context, IOOp io_op, uint64 bytes)
{
- pgstat_count_io_op_n(io_object, io_context, io_op, 1);
+ pgstat_count_io_op_n(io_object, io_context, io_op, 1, bytes);
pgstat_count_io_op_n() is only used locally in pgstat_io.c. I'm OK to
keep it as it is used with the time calculations, but wouldn't it be
better to make it static in pgstat_io.c instead and not declare it in
pgstat.h? Do we really have a need for pgstat_count_io_op() at all at
the end or would it be better to change it so as it can handle a
number of operations given by the caller?
typedef struct PgStat_BktypeIO
{
+ uint64 bytes[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
This wastes a bit of memory, while keeping the code simpler to
understand. That's better than more element number manipulations, so
I'm OK with what you have here.
+static inline bool
+is_ioop_tracked_in_bytes(IOOp io_op)
+{
+ Assert((unsigned int) io_op < IOOP_NUM_TYPES);
+ return io_op >= IOOP_EXTEND;
+}
This is only used in an assertion of pgstat_count_io_op_n() in
pgstat_io.c. Let's also keep it this routine local to the file. The
assert to make sure that the callers don't assign bytes to the
operations that don't support the counters is a good idea.
+ * If an IOOp does not tracked in bytes, IO_COL_INVALID is returned.
s/does not tracked/is not tracked/.
+/*
+ * Get the number of the column containing IO bytes for the specified IOOp.
+ * If an IOOp does not tracked in bytes, IO_COL_INVALID is returned.
+ */
+static io_stat_col
+pgstat_get_io_byte_index(IOOp io_op)
+{
Makes sense to me, and that's consistent with how the time attributes
are handled.
- /*
- * Hard-code this to the value of BLCKSZ for now. Future values
- * could include XLOG_BLCKSZ, once WAL IO is tracked, and constant
- * multipliers, once non-block-oriented IO (e.g. temporary file
- * IO) is tracked.
- */
- values[IO_COL_CONVERSION] = Int64GetDatum(BLCKSZ);
Glad to see that gone.
+ <structfield>write_bytes</structfield> <type>bigint</type>
+ <structfield>extend_bytes</structfield> <type>bigint</type>
+ <structfield>read_bytes</structfield> <type>bigint</type>
These additions in the documentation are incorrect. All these
attributes are of type numeric, not bigint.
+ Number of units of size <symbol>BLCKSZ</symbol> which the process
It seems to me that this had better mention 8192 as the default.
--
Michael
Hi,
Thanks for the review!
On Thu, 9 Jan 2025 at 05:59, Michael Paquier <michael@paquier.xyz> wrote:
On Thu, Dec 26, 2024 at 02:41:26PM +0300, Nazir Bilal Yavuz wrote:
Thanks! v4 is attached. I quickly tested the pg_stat_get_backend_io()
function and it seems it is working.Thanks a lot for the rebased version. This looks pretty solid. Here
are some comments.void -pgstat_count_io_op(IOObject io_object, IOContext io_context, IOOp io_op) +pgstat_count_io_op(IOObject io_object, IOContext io_context, IOOp io_op, uint64 bytes) { - pgstat_count_io_op_n(io_object, io_context, io_op, 1); + pgstat_count_io_op_n(io_object, io_context, io_op, 1, bytes);pgstat_count_io_op_n() is only used locally in pgstat_io.c. I'm OK to
keep it as it is used with the time calculations, but wouldn't it be
better to make it static in pgstat_io.c instead and not declare it in
pgstat.h? Do we really have a need for pgstat_count_io_op() at all at
the end or would it be better to change it so as it can handle a
number of operations given by the caller?
I am a bit confused, are you suggesting these two alternatives:
1- Making pgstat_count_io_op_n() static and continuing to use
pgstat_count_io_op() as it is.
2- Removing pgstat_count_io_op() and instead using
pgstat_count_io_op_n() everywhere.
typedef struct PgStat_BktypeIO
{
+ uint64 bytes[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];This wastes a bit of memory, while keeping the code simpler to
understand. That's better than more element number manipulations, so
I'm OK with what you have here.+static inline bool +is_ioop_tracked_in_bytes(IOOp io_op) +{ + Assert((unsigned int) io_op < IOOP_NUM_TYPES); + return io_op >= IOOP_EXTEND; +}This is only used in an assertion of pgstat_count_io_op_n() in
pgstat_io.c. Let's also keep it this routine local to the file. The
assert to make sure that the callers don't assign bytes to the
operations that don't support the counters is a good idea.
Makes sense, done.
+ * If an IOOp does not tracked in bytes, IO_COL_INVALID is returned.
s/does not tracked/is not tracked/.
Done.
+/* + * Get the number of the column containing IO bytes for the specified IOOp. + * If an IOOp does not tracked in bytes, IO_COL_INVALID is returned. + */ +static io_stat_col +pgstat_get_io_byte_index(IOOp io_op) +{Makes sense to me, and that's consistent with how the time attributes
are handled.- /*
- * Hard-code this to the value of BLCKSZ for now. Future values
- * could include XLOG_BLCKSZ, once WAL IO is tracked, and constant
- * multipliers, once non-block-oriented IO (e.g. temporary file
- * IO) is tracked.
- */
- values[IO_COL_CONVERSION] = Int64GetDatum(BLCKSZ);Glad to see that gone.
+ <structfield>write_bytes</structfield> <type>bigint</type> + <structfield>extend_bytes</structfield> <type>bigint</type> + <structfield>read_bytes</structfield> <type>bigint</type>These additions in the documentation are incorrect. All these
attributes are of type numeric, not bigint.
Done.
+ Number of units of size <symbol>BLCKSZ</symbol> which the process
It seems to me that this had better mention 8192 as the default.
I agree, done.
v5 is attached.
--
Regards,
Nazir Bilal Yavuz
Microsoft
Attachments:
v5-0001-Make-pg_stat_io-count-IOs-as-bytes-instead-of-blo.patchtext/x-patch; charset=US-ASCII; name=v5-0001-Make-pg_stat_io-count-IOs-as-bytes-instead-of-blo.patchDownload+172-80
Hi,
On Thu, 9 Jan 2025 at 10:15, Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
On Thu, 9 Jan 2025 at 05:59, Michael Paquier <michael@paquier.xyz> wrote:
+static inline bool +is_ioop_tracked_in_bytes(IOOp io_op) +{ + Assert((unsigned int) io_op < IOOP_NUM_TYPES); + return io_op >= IOOP_EXTEND; +}This is only used in an assertion of pgstat_count_io_op_n() in
pgstat_io.c. Let's also keep it this routine local to the file. The
assert to make sure that the callers don't assign bytes to the
operations that don't support the counters is a good idea.Makes sense, done.
v5 is attached.
I missed a compilation error. Since the is_ioop_tracked_in_bytes() is
only used in the assert and the asserts are disabled in the production
builds, CompilerWarnings task gives -Wunused-function error:
pgstat_io.c:28:1: error: unused function 'is_ioop_tracked_in_bytes'
[-Werror,-Wunused-function]
pg_attribute_unused() is added to the function to silence that. v6 is attached.
--
Regards,
Nazir Bilal Yavuz
Microsoft
Attachments:
v6-0001-Make-pg_stat_io-count-IOs-as-bytes-instead-of-blo.patchtext/x-patch; charset=US-ASCII; name=v6-0001-Make-pg_stat_io-count-IOs-as-bytes-instead-of-blo.patchDownload+173-80
On Thu, Jan 09, 2025 at 10:15:20AM +0300, Nazir Bilal Yavuz wrote:
I am a bit confused, are you suggesting these two alternatives:
1- Making pgstat_count_io_op_n() static and continuing to use
pgstat_count_io_op() as it is.
2- Removing pgstat_count_io_op() and instead using
pgstat_count_io_op_n() everywhere.
Either of these options is OK by me. The current state of things just
seems a bit strange because we publish a routine that's used nowhere.
If you have plans for it in a different patch, that's also fine.
--
Michael
Hi,
On Thu, 9 Jan 2025 at 11:11, Michael Paquier <michael@paquier.xyz> wrote:
On Thu, Jan 09, 2025 at 10:15:20AM +0300, Nazir Bilal Yavuz wrote:
I am a bit confused, are you suggesting these two alternatives:
1- Making pgstat_count_io_op_n() static and continuing to use
pgstat_count_io_op() as it is.
2- Removing pgstat_count_io_op() and instead using
pgstat_count_io_op_n() everywhere.Either of these options is OK by me. The current state of things just
seems a bit strange because we publish a routine that's used nowhere.
If you have plans for it in a different patch, that's also fine.
I followed the second option as it is similar to
pgstat_count_io_op_time() and also more future proof. I attached it as
another patch. v7 is attached.
--
Regards,
Nazir Bilal Yavuz
Microsoft
Attachments:
v7-0001-Make-pg_stat_io-count-IOs-as-bytes-instead-of-blo.patchtext/x-patch; charset=US-ASCII; name=v7-0001-Make-pg_stat_io-count-IOs-as-bytes-instead-of-blo.patchDownload+173-80
v7-0002-Replace-pgstat_count_io_op-with-pgstat_count_io_o.patchtext/x-patch; charset=US-ASCII; name=v7-0002-Replace-pgstat_count_io_op-with-pgstat_count_io_o.patchDownload+4-12
Hi,
On Thu, Jan 09, 2025 at 02:20:16PM +0300, Nazir Bilal Yavuz wrote:
Hi,
On Thu, 9 Jan 2025 at 11:11, Michael Paquier <michael@paquier.xyz> wrote:
On Thu, Jan 09, 2025 at 10:15:20AM +0300, Nazir Bilal Yavuz wrote:
I am a bit confused, are you suggesting these two alternatives:
1- Making pgstat_count_io_op_n() static and continuing to use
pgstat_count_io_op() as it is.
2- Removing pgstat_count_io_op() and instead using
pgstat_count_io_op_n() everywhere.Either of these options is OK by me. The current state of things just
seems a bit strange because we publish a routine that's used nowhere.
If you have plans for it in a different patch, that's also fine.I followed the second option as it is similar to
pgstat_count_io_op_time() and also more future proof. I attached it as
another patch. v7 is attached.
Thanks for the patches!
v7-0001:
+pg_attribute_unused()
+static inline bool
+is_ioop_tracked_in_bytes(IOOp io_op)
+{
+ Assert((unsigned int) io_op < IOOP_NUM_TYPES);
+ return io_op >= IOOP_EXTEND;
+}
and then
+ Assert(is_ioop_tracked_in_bytes(io_op) || bytes == 0);
We first use an Assert in is_ioop_tracked_in_bytes() and then we return
a value "just" to check another Assert. I wonder if it wouldn't make more sense
to get rid of this function and use a macro instead, something like?
#define is_ioop_tracked_in_bytes(io_op) \
((io_op) < IOOP_NUM_TYPES && (io_op) >= IOOP_EXTEND)
v7-0002:
I wonder if it wouldn't make more sense to remove pgstat_count_io_op() first
and then implement what currently is in v7-0001. What v7-0002 is removing is
not produced by v7-0001.
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
On Thu, Jan 09, 2025 at 03:30:37PM +0000, Bertrand Drouvot wrote:
We first use an Assert in is_ioop_tracked_in_bytes() and then we return
a value "just" to check another Assert. I wonder if it wouldn't make more sense
to get rid of this function and use a macro instead, something like?#define is_ioop_tracked_in_bytes(io_op) \
((io_op) < IOOP_NUM_TYPES && (io_op) >= IOOP_EXTEND)
Indeed. Your suggestion to use a macro makes more sense to me because
is_ioop_tracked_in_bytes() itself has an Assert(), while being itself
only used in an assertion, as you say. Better to document the
dependency on the ordering of IOOp, even if that's kind of hard to
miss.
v7-0002:
I wonder if it wouldn't make more sense to remove pgstat_count_io_op() first
and then implement what currently is in v7-0001. What v7-0002 is removing is
not produced by v7-0001.
This kind of cleanup should happen first, and it simplifies a bit the
reading of v7-0001 as we would just append a new argument to the
count_io_op() routine for the number of bytes. I was looking at that
and chose to stick with count_io_op() rather than count_io_op_n() as
we would have only one routine.
pgstat_count_io_op_time(IOOBJECT_RELATION, io_context, IOOP_EXTEND,
- io_start, extend_by);
+ io_start, 1, extend_by * BLCKSZ);
[...]
pgstat_count_io_op_time(IOOBJECT_TEMP_RELATION, IOCONTEXT_NORMAL, IOOP_EXTEND,
- io_start, extend_by);
+ io_start, 1, extend_by * BLCKSZ);
Something worth mentioning. I had a small doubt about this one for
temp and persistent relations, as it changes the extend count.
Anyway, I think that this is better: we are only doing one extend
batch, worth (extend_by * BLCKSZ).
Two times my fault today that the main patch does not apply anymore
(three times at least in total), so I have rebased it myself, and did
an extra review while on it, switching the code to use a macro. That
seems OK here. Please let me know if you have more comments.
(Still need to spend more time on the commit message, will do that
later).
--
Michael
Attachments:
v8-0001-Make-pg_stat_io-count-IOs-as-bytes-instead-of-blo.patchtext/x-diff; charset=us-asciiDownload+170-77
Hi,
On Fri, 10 Jan 2025 at 04:47, Michael Paquier <michael@paquier.xyz> wrote:
On Thu, Jan 09, 2025 at 03:30:37PM +0000, Bertrand Drouvot wrote:
We first use an Assert in is_ioop_tracked_in_bytes() and then we return
a value "just" to check another Assert. I wonder if it wouldn't make more sense
to get rid of this function and use a macro instead, something like?#define is_ioop_tracked_in_bytes(io_op) \
((io_op) < IOOP_NUM_TYPES && (io_op) >= IOOP_EXTEND)Indeed. Your suggestion to use a macro makes more sense to me because
is_ioop_tracked_in_bytes() itself has an Assert(), while being itself
only used in an assertion, as you say. Better to document the
dependency on the ordering of IOOp, even if that's kind of hard to
miss.
I did not make it like this because now the
pgstat_is_ioop_tracked_in_bytes macro will return false when io_op >=
IOOP_NUM_TYPES. But I agree that having a macro has more benefits,
also there already is a check for the 'io_op < IOOP_NUM_TYPES' in the
pgstat_count_io_op() function.
v7-0002:
I wonder if it wouldn't make more sense to remove pgstat_count_io_op() first
and then implement what currently is in v7-0001. What v7-0002 is removing is
not produced by v7-0001.This kind of cleanup should happen first, and it simplifies a bit the
reading of v7-0001 as we would just append a new argument to the
count_io_op() routine for the number of bytes. I was looking at that
and chose to stick with count_io_op() rather than count_io_op_n() as
we would have only one routine.pgstat_count_io_op_time(IOOBJECT_RELATION, io_context, IOOP_EXTEND, - io_start, extend_by); + io_start, 1, extend_by * BLCKSZ); [...] pgstat_count_io_op_time(IOOBJECT_TEMP_RELATION, IOCONTEXT_NORMAL, IOOP_EXTEND, - io_start, extend_by); + io_start, 1, extend_by * BLCKSZ);Something worth mentioning. I had a small doubt about this one for
temp and persistent relations, as it changes the extend count.
Anyway, I think that this is better: we are only doing one extend
batch, worth (extend_by * BLCKSZ).
I agree with you.
Two times my fault today that the main patch does not apply anymore
(three times at least in total), so I have rebased it myself, and did
an extra review while on it, switching the code to use a macro. That
seems OK here. Please let me know if you have more comments.
No worries, thank you for all of these!
--
Regards,
Nazir Bilal Yavuz
Microsoft
Hi,
On Fri, Jan 10, 2025 at 10:15:52AM +0300, Nazir Bilal Yavuz wrote:
Hi,
On Fri, 10 Jan 2025 at 04:47, Michael Paquier <michael@paquier.xyz> wrote:
On Thu, Jan 09, 2025 at 03:30:37PM +0000, Bertrand Drouvot wrote:
We first use an Assert in is_ioop_tracked_in_bytes() and then we return
a value "just" to check another Assert. I wonder if it wouldn't make more sense
to get rid of this function and use a macro instead, something like?#define is_ioop_tracked_in_bytes(io_op) \
((io_op) < IOOP_NUM_TYPES && (io_op) >= IOOP_EXTEND)Indeed. Your suggestion to use a macro makes more sense to me because
is_ioop_tracked_in_bytes() itself has an Assert(), while being itself
only used in an assertion, as you say. Better to document the
dependency on the ordering of IOOp, even if that's kind of hard to
miss.But I agree that having a macro has more benefits,
also there already is a check for the 'io_op < IOOP_NUM_TYPES' in the
pgstat_count_io_op() function.
Yeah, I think we can remove the "io_op < IOOP_NUM_TYPE" assertion in
pgstat_count_io_op() (but keep this check as part of the macro).
That
seems OK here. Please let me know if you have more comments.
Appart from the above, LGTM.
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
On Fri, Jan 10, 2025 at 08:23:46AM +0000, Bertrand Drouvot wrote:
On Fri, Jan 10, 2025 at 10:15:52AM +0300, Nazir Bilal Yavuz wrote:
But I agree that having a macro has more benefits,
also there already is a check for the 'io_op < IOOP_NUM_TYPES' in the
pgstat_count_io_op() function.Yeah, I think we can remove the "io_op < IOOP_NUM_TYPE" assertion in
pgstat_count_io_op() (but keep this check as part of the macro).Appart from the above, LGTM.
Okay, so applied.
And I've somewhat managed to fat-finger the business with
pgstat_count_io_op() with an incorrect rebase. Will remove in a
minute..
--
Michael
Hi,
On Tue, 14 Jan 2025 at 06:18, Michael Paquier <michael@paquier.xyz> wrote:
On Fri, Jan 10, 2025 at 08:23:46AM +0000, Bertrand Drouvot wrote:
On Fri, Jan 10, 2025 at 10:15:52AM +0300, Nazir Bilal Yavuz wrote:
But I agree that having a macro has more benefits,
also there already is a check for the 'io_op < IOOP_NUM_TYPES' in the
pgstat_count_io_op() function.Yeah, I think we can remove the "io_op < IOOP_NUM_TYPE" assertion in
pgstat_count_io_op() (but keep this check as part of the macro).Appart from the above, LGTM.
Okay, so applied.
And I've somewhat managed to fat-finger the business with
pgstat_count_io_op() with an incorrect rebase. Will remove in a
minute..
Thank you!
--
Regards,
Nazir Bilal Yavuz
Microsoft