Remove wal_[sync|write][_time] from pg_stat_wal and track_wal_io_timing
Hi hackers,
a051e71e28a added the WAL writes and fsyncs information (and their related timing)
into the pg_stat_io view so that one can now find the same information in 2 places:
pg_stat_wal and pg_stat_io (for the wal object). This lead to a discussion in
[1]: /messages/by-id/Z7NSc5j6M4g2r1HD@ip-10-97-1-34.eu-west-3.compute.internal
=== Let's sump up in this dedicated thread what has been discussed in [1]/messages/by-id/Z7NSc5j6M4g2r1HD@ip-10-97-1-34.eu-west-3.compute.internal
1. The pg_stat_io's writes and fsyncs (and their relative timing) are incremented
at the exact same places as their pg_stat_wal counterparts.
2. pgstat_tracks_io_bktype() returns false for backend's that do not generate WAL
(so that we don't lost WAL information in pg_stat_io).
3. pg_stat_stat.wal_write is the same value as "select sum(writes)
from pg_stat_io where object = 'wal' and context = 'normal'" as these
are incremented in XLogWrite().
4. Same argument about pg_stat_wal.wal_write_time with
pg_stat_io.write_time.
5. issue_xlog_fsync() tells that pg_stat_wal.wal_sync_time and
sum(pg_stat_io.fsync_time) under object=wal and context=normal are the
same values.
6. Same argument with the fsync counters pg_stat_wal.wal_sync and
pg_stat_io.fsyncs.
7. This will encourage monitoring pull to move to pg_stat_io, where there is much
more context and granularity of the stats data.
8. That will simplify the work done for the per backend WAL statistics ([1]/messages/by-id/Z7NSc5j6M4g2r1HD@ip-10-97-1-34.eu-west-3.compute.internal)
=== Remarks
R1. The "bytes" fields differ (as pg_stat_wal.wal_bytes somehow "focus" on the wal
records size while the pg_stat_io's unit is the wal_block_size) so we keep them
in both places.
R2. track_wal_io_timing becomes useless once those fields are gone from pg_stat_wal
R3. PendingWalStats becomes empty, so removes it.
=== Patch structure
PFA 3 sub patches:
- 0001: Add details in the pg_stat_io doc about the wal object. That's basically
more or less a copy/paste from the pg_stat_wal fields description that will be
removed in 0002. As we already track them in pg_stat_io, it's done in a dedicated
patch.
- 0002: Remove wal_[sync|write][_time] from pg_stat_wal, PendingWalStats and
track_wal_io_timing. That does not make sense to split this work in sub-patches
so that all of this in done in 0002.
- 0003: remove the pgstat_prepare_io_time() parameter. Now that track_wal_io_timing
is gone there is no need for pgstat_prepare_io_time() to get a parameter.
[1]: /messages/by-id/Z7NSc5j6M4g2r1HD@ip-10-97-1-34.eu-west-3.compute.internal
Looking forward to your feedback,
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Attachments:
v1-0001-Add-details-in-the-pg_stat_io-doc-about-the-wal-o.patchtext/x-diff; charset=us-asciiDownload+42-1
v1-0002-Remove-wal_-sync-write-_time-from-pg_stat_wal-and.patchtext/x-diff; charset=us-asciiDownload+19-216
v1-0003-remove-the-pgstat_prepare_io_time-parameter.patchtext/x-diff; charset=us-asciiDownload+19-21
On Tue, Feb 18, 2025 at 10:43:15AM +0000, Bertrand Drouvot wrote:
=== Remarks
As mentioned on a previous thread, I am siding with the removal of
these fields. There are more benefits on terms of granularity of this
data to encourage users to move away from pg_stat_wal and stick to
pg_stat_io. Reasons are posted here:
/messages/by-id/Z7PHiAyt6GI0_Tlt@paquier.xyz
R1. The "bytes" fields differ (as pg_stat_wal.wal_bytes somehow "focus" on the wal
records size while the pg_stat_io's unit is the wal_block_size) so we keep them
in both places.
Yes, I don't think we should drop that anyway. There is no equivalent
mapping and pg_stat_wal.bytes is not I/O. This is also useful in the
WAL stats for the backends.
R2. track_wal_io_timing becomes useless once those fields are gone
from pg_stat_walR3. PendingWalStats becomes empty, so removes it.
Sounds fine to me. The simplifications are in pgstat_wal.c, where we
can now depend only on the diffs in the WalUsage, which we rely on for
EXPLAIN, VACUUM/ANALYZE and pgss. This makes the implementation of
the backend WAL stats much simpler.
=== Patch structure
PFA 3 sub patches:
- 0001: Add details in the pg_stat_io doc about the wal object. That's basically
more or less a copy/paste from the pg_stat_wal fields description that will be
removed in 0002. As we already track them in pg_stat_io, it's done in a dedicated
patch.
These additions feel unbalanced with the existing contents, and
overlap with the follow-up paragraph about the tuning that can be
guessed from the contents of pg_stat_io because the new content refers
twice to the section "WAL Configuration". Perhaps this could be
simpler, with one sentence in the tuning part? I would propose
something like:
For the WAL object, "fsyncs" and "sync_time" track the sync activity
of WAL files via issue_xlog_fsync(). "writes" and "write_time" track
the write activity of WAL files via XLogWrite(). See <xref
linkend="wal-configuration"/> for more information.
My point is that the "WAL configuration" section already provides all
the details that were in pg_stat_wal removed in 0002 and added in
0001, leading to duplicated contents.
- 0002: Remove wal_[sync|write][_time] from pg_stat_wal, PendingWalStats and
track_wal_io_timing. That does not make sense to split this work in sub-patches
so that all of this in done in 0002.
- 0003: remove the pgstat_prepare_io_time() parameter. Now that track_wal_io_timing
is gone there is no need for pgstat_prepare_io_time() to get a parameter.
0002 and 0003 can be grouped in the same commit, IMO.
- When <xref linkend="guc-track-wal-io-timing"/> is enabled, the total
+ When <xref linkend="guc-track-io-timing"/> is enabled, the total
amounts of time <function>XLogWrite</function> writes and
<function>issue_xlog_fsync</function> syncs WAL data to disk are counted as
- <literal>wal_write_time</literal> and <literal>wal_sync_time</literal> in
- <xref linkend="pg-stat-wal-view"/>, respectively.
+ <literal>write_time</literal> and <literal>sync_time</literal> in
+ <xref linkend="pg-stat-io-view"/> for the wal <literal>object</literal>, respectively.
Okay to update these are part of 0002 that removes the fields, but
there is also an argument about updating that on its own because this
is actually the case on HEAD? (No need to post an updated patch for
this remark.)
The removal with PgStat_PendingWalStats and the GUC gone looks clean
as presented.
--
Michael
Hi,
On Wed, Feb 19, 2025 at 01:32:33PM +0900, Michael Paquier wrote:
These additions feel unbalanced with the existing contents, and
overlap with the follow-up paragraph about the tuning that can be
guessed from the contents of pg_stat_io because the new content refers
twice to the section "WAL Configuration". Perhaps this could be
simpler, with one sentence in the tuning part? I would propose
something like:
For the WAL object, "fsyncs" and "sync_time" track the sync activity
of WAL files via issue_xlog_fsync(). "writes" and "write_time" track
the write activity of WAL files via XLogWrite(). See <xref
linkend="wal-configuration"/> for more information.My point is that the "WAL configuration" section already provides all
the details that were in pg_stat_wal removed in 0002 and added in
0001, leading to duplicated contents.
Makes sense, done that way in the attached.
- 0002: Remove wal_[sync|write][_time] from pg_stat_wal, PendingWalStats and
track_wal_io_timing. That does not make sense to split this work in sub-patches
so that all of this in done in 0002.
- 0003: remove the pgstat_prepare_io_time() parameter. Now that track_wal_io_timing
is gone there is no need for pgstat_prepare_io_time() to get a parameter.0002 and 0003 can be grouped in the same commit, IMO.
Idea was to "ease" the review but I'm fine to merge them. Done that way in the
attached.
- When <xref linkend="guc-track-wal-io-timing"/> is enabled, the total + When <xref linkend="guc-track-io-timing"/> is enabled, the total amounts of time <function>XLogWrite</function> writes and <function>issue_xlog_fsync</function> syncs WAL data to disk are counted as - <literal>wal_write_time</literal> and <literal>wal_sync_time</literal> in - <xref linkend="pg-stat-wal-view"/>, respectively. + <literal>write_time</literal> and <literal>sync_time</literal> in + <xref linkend="pg-stat-io-view"/> for the wal <literal>object</literal>, respectively.Okay to update these are part of 0002 that removes the fields, but
there is also an argument about updating that on its own because this
is actually the case on HEAD? (No need to post an updated patch for
this remark.)
That's right but that would mean almost duplicating the pg_stat_wal related
stuff (because it can't be removed from the doc until the fields are gone). I
think it's simpler to do it as proposed initially (the end result is the same).
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
On Wed, Feb 19, 2025 at 09:24:41AM +0000, Bertrand Drouvot wrote:
That's right but that would mean almost duplicating the pg_stat_wal related
stuff (because it can't be removed from the doc until the fields are gone). I
think it's simpler to do it as proposed initially (the end result is the same).
After sleeping on it, I still saw no reason to not apply the changes
from 0002 in wal.sgml to describe that the stats for the writes/fsyncs
are in pg_stat_io rather than pg_stat_wal for the "WAL configuration"
section, so done that. The tags were a bit inconsistent, and I've
managed to miss the.. cough.. s/sync_time/fsync_time/.
0001 was mixing a bit the tags <varname>, <literal> and <function>,
and I've moved that as a paragraph not under the tuning part, to make
it a more general statement with its link to "WAL configuration",
which is a very popular link for pg_stat_io.
Attached is the rest, as of v3-0002.
--
Michael
Attachments:
v3-0002-Remove-wal_-sync-write-_time-from-pg_stat_wal-and.patchtext/x-diff; charset=us-asciiDownload+30-228
Hi,
On Thu, Feb 20, 2025 at 02:37:18PM +0900, Michael Paquier wrote:
On Wed, Feb 19, 2025 at 09:24:41AM +0000, Bertrand Drouvot wrote:
That's right but that would mean almost duplicating the pg_stat_wal related
stuff (because it can't be removed from the doc until the fields are gone). I
think it's simpler to do it as proposed initially (the end result is the same).After sleeping on it, I still saw no reason to not apply the changes
from 0002 in wal.sgml to describe that the stats for the writes/fsyncs
are in pg_stat_io rather than pg_stat_wal for the "WAL configuration"
section, so done that.
I see, I did not get that you wanted to get rid of the pg_stat_wal part before
removing the fields. Makes sense that way to "just" replace the pg_stat_wal
by pg_stat_io first in the doc as you've done in 4538bd3f1dd.
and I've moved that as a paragraph not under the tuning part, to make
it a more general statement with its link to "WAL configuration",
which is a very popular link for pg_stat_io.
Makes more sense, agree.
Attached is the rest, as of v3-0002.
LGTM.
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
On Thu, Feb 20, 2025 at 08:11:12AM +0000, Bertrand Drouvot wrote:
LGTM.
Applied this one. Now, to the part about the backend stats and WAL,
which should be the last piece..
--
Michael
Hi,
On Mon, Feb 24, 2025 at 09:57:49AM +0900, Michael Paquier wrote:
On Thu, Feb 20, 2025 at 08:11:12AM +0000, Bertrand Drouvot wrote:
LGTM.
Applied this one.
Thanks!
Now, to the part about the backend stats and WAL,
which should be the last piece..
Yup, going back to [1]/messages/by-id/Z7UJoX8jr7aBbt2Q@paquier.xyz now.
[1]: /messages/by-id/Z7UJoX8jr7aBbt2Q@paquier.xyz
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Hi,
On 2025-02-20 14:37:18 +0900, Michael Paquier wrote:
a051e71e28a added this information into pg_stat_io (with more details and
granularity), so there is no need to keep it in pg_stat_wal. This also
allows to remove PendingWalStats and simplifies up coming commits related
to per backend WAL statistics. Once done, track_wal_io_timing becomes useless
so it is also removed.In passing remove the pgstat_prepare_io_time() parameter now that
track_wal_io_timing is gone.
I don't think this is a good idea - there was a good reason for
track_wal_io_timing to exist, namely that it happens while holding one of the
two most contended locks in postgres. On many systems it'll be an ok constant
overhead to enable track_io_timing, but enabling track_wal_io_timing will
cause scalability issues. Now you made it impossible to separate those two
situations, forcing disabling of all IO timing.
Greetings,
Andres Freund
Hi,
On Mon, Feb 24, 2025 at 04:41:36AM -0500, Andres Freund wrote:
Hi,
On 2025-02-20 14:37:18 +0900, Michael Paquier wrote:
a051e71e28a added this information into pg_stat_io (with more details and
granularity), so there is no need to keep it in pg_stat_wal. This also
allows to remove PendingWalStats and simplifies up coming commits related
to per backend WAL statistics. Once done, track_wal_io_timing becomes useless
so it is also removed.In passing remove the pgstat_prepare_io_time() parameter now that
track_wal_io_timing is gone.I don't think this is a good idea - there was a good reason for
track_wal_io_timing to exist, namely that it happens while holding one of the
two most contended locks in postgres. On many systems it'll be an ok constant
overhead to enable track_io_timing, but enabling track_wal_io_timing will
cause scalability issues. Now you made it impossible to separate those two
situations, forcing disabling of all IO timing.
a051e71e28a added the "timing tracking" in the WAL code path with "only"
track_io_timing enabled (and track_wal_io_timing still false). Here, in this thread,
we remove unnecessary INSTR_TIME_SET_CURRENT()/INSTR_TIME_ACCUM_DIFF() calls when
both track_io_timing and track_wal_io_timing were enabled.
So I think that your main concern comes from the fact that a051e71e28a added the
"timing tracking" in the WAL code path with "only" track_io_timing enabled.
That's a concern we also had and discussed in [1]/messages/by-id/CAN55FZ1qOt_gjhQgJdQZjO1KebBLuZcHz4DYmjfUvF3yGBSahw@mail.gmail.com. The question was: should
we track the WAL timing stats in pg_stat_io only when track_wal_io_timing is
enabled or relying only on track_io_timing would be enough?
We ran several benchmarks ([2]/messages/by-id/Z6CcglxJF8XW+R7W@ip-10-97-1-34.eu-west-3.compute.internal, [3]/messages/by-id/CAN55FZ0thZHTBbXwAsNhfrRdgmDwxWbLPiZyh+TG9CrC1V8TeA@mail.gmail.com and [4]/messages/by-id/Z6HH150-Aw6ilQYE@paquier.xyz) and based on the results we reached
the conclusion that to rely only on track_io_timing to display the WAL timing
stats in pg_stat_io was "good" enough.
If you feel strongly that we should keep track_wal_io_timing then we would need
to change a bit the logic in a051e71e28a and then keep it in the current thread (
but still removing the "now useless" pg_stat_wal fields).
[1]: /messages/by-id/CAN55FZ1qOt_gjhQgJdQZjO1KebBLuZcHz4DYmjfUvF3yGBSahw@mail.gmail.com
[2]: /messages/by-id/Z6CcglxJF8XW+R7W@ip-10-97-1-34.eu-west-3.compute.internal
[3]: /messages/by-id/CAN55FZ0thZHTBbXwAsNhfrRdgmDwxWbLPiZyh+TG9CrC1V8TeA@mail.gmail.com
[4]: /messages/by-id/Z6HH150-Aw6ilQYE@paquier.xyz
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Hi,
On 2025-02-24 10:54:54 +0000, Bertrand Drouvot wrote:
a051e71e28a added the "timing tracking" in the WAL code path with "only"
track_io_timing enabled (and track_wal_io_timing still false). Here, in this thread,
we remove unnecessary INSTR_TIME_SET_CURRENT()/INSTR_TIME_ACCUM_DIFF() calls when
both track_io_timing and track_wal_io_timing were enabled.So I think that your main concern comes from the fact that a051e71e28a added the
"timing tracking" in the WAL code path with "only" track_io_timing enabled.
Correct.
That's a concern we also had and discussed in [1]. The question was: should
we track the WAL timing stats in pg_stat_io only when track_wal_io_timing is
enabled or relying only on track_io_timing would be enough?We ran several benchmarks ([2], [3] and [4]) and based on the results we reached
the conclusion that to rely only on track_io_timing to display the WAL timing
stats in pg_stat_io was "good" enough.
Sorry to be blunt, but those benchmarks miss the mark. The benchmark
- emits WAL in a transactional fashion, with fairly small transactions, where
the disk is fairly slow. It's absolutely guaranteed that the bottleneck is
just the WAL flush disk time. You're doing ~1k-2k TPS - the timing overhead
would have to be *ginormous* to show up.
- emits WAL at a low concurrency thus lock contention not really an issue.
- likely is executed so on a system with very cheap timing functions - but
often production workloads unfortunately don't. While it's not quite as
common as it used to be, virtualized systems often have considerably slower
clocks.
On this system it'll likely be fine overhead-wise:
$ echo tsc | sudo tee /sys/devices/system/clocksource/clocksource0/current_clocksource
tsc
$ src/bin/pg_test_timing/pg_test_timing
Testing timing overhead for 3 seconds.
Per loop time including overhead: 19.17 ns
Histogram of timing durations:
< us % of total count
1 98.08521 153484414
2 1.91421 2995376
4 0.00040 619
8 0.00014 225
16 0.00002 35
32 0.00001 9
64 0.00001 10
128 0.00000 3
256 0.00000 2
512 0.00000 1
On this system however, I'd not want to bet on it:
$ echo hpet | sudo tee /sys/devices/system/clocksource/clocksource0/current_clocksource
$ src/bin/pg_test_timing/pg_test_timing
Testing timing overhead for 3 seconds.
Per loop time including overhead: 1228.91 ns
Histogram of timing durations:
< us % of total count
1 0.15574 3802
2 78.51065 1916598
4 21.26778 519188
8 0.02069 505
16 0.03957 966
32 0.00356 87
64 0.00193 47
128 0.00008 2
Greetings,
Andres Freund
Hi,
On Mon, Feb 24, 2025 at 06:15:53AM -0500, Andres Freund wrote:
Hi,
On 2025-02-24 10:54:54 +0000, Bertrand Drouvot wrote:
a051e71e28a added the "timing tracking" in the WAL code path with "only"
track_io_timing enabled (and track_wal_io_timing still false). Here, in this thread,
we remove unnecessary INSTR_TIME_SET_CURRENT()/INSTR_TIME_ACCUM_DIFF() calls when
both track_io_timing and track_wal_io_timing were enabled.So I think that your main concern comes from the fact that a051e71e28a added the
"timing tracking" in the WAL code path with "only" track_io_timing enabled.Correct.
That's a concern we also had and discussed in [1]. The question was: should
we track the WAL timing stats in pg_stat_io only when track_wal_io_timing is
enabled or relying only on track_io_timing would be enough?We ran several benchmarks ([2], [3] and [4]) and based on the results we reached
the conclusion that to rely only on track_io_timing to display the WAL timing
stats in pg_stat_io was "good" enough.Sorry to be blunt, but those benchmarks miss the mark. The benchmark
- emits WAL in a transactional fashion, with fairly small transactions, where
the disk is fairly slow. It's absolutely guaranteed that the bottleneck is
just the WAL flush disk time. You're doing ~1k-2k TPS - the timing overhead
would have to be *ginormous* to show up.- emits WAL at a low concurrency thus lock contention not really an issue.
- likely is executed so on a system with very cheap timing functions - but
often production workloads unfortunately don't. While it's not quite as
common as it used to be, virtualized systems often have considerably slower
clocks.On this system it'll likely be fine overhead-wise:
$ echo tsc | sudo tee /sys/devices/system/clocksource/clocksource0/current_clocksource
tsc$ src/bin/pg_test_timing/pg_test_timing
Testing timing overhead for 3 seconds.
Per loop time including overhead: 19.17 ns
Histogram of timing durations:
< us % of total count
1 98.08521 153484414
2 1.91421 2995376
4 0.00040 619
8 0.00014 225
16 0.00002 35
32 0.00001 9
64 0.00001 10
128 0.00000 3
256 0.00000 2
512 0.00000 1On this system however, I'd not want to bet on it:
$ echo hpet | sudo tee /sys/devices/system/clocksource/clocksource0/current_clocksource
$ src/bin/pg_test_timing/pg_test_timing
Testing timing overhead for 3 seconds.
Per loop time including overhead: 1228.91 ns
Histogram of timing durations:
< us % of total count
1 0.15574 3802
2 78.51065 1916598
4 21.26778 519188
8 0.02069 505
16 0.03957 966
32 0.00356 87
64 0.00193 47
128 0.00008 2
I agree that we've to put everything in the picture (system with or without
cheap timing functions, lock contention and WAL flush disk time) and that we
can certainly find configuration/workload that would get benefits from a
dedicated track_wal_io_timing GUC.
PFA a patch to re-introduce the track_wal_io_timing GUC and to ensure that the
WAL write and fsync timing activities are tracked in pg_stat_io (and
pg_stat_get_backend_io()) only if both track_io_timing and track_wal_io_timing
are enabled.
Note that to change the a051e71e28a behavior, the attached patch adds an extra
"track_io_guc" parameter to pgstat_count_io_op_time() (like pgstat_prepare_io_time
already had in a051e71e28a).
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Attachments:
v1-0001-re-introduce-track_wal_io_timing.patchtext/x-diff; charset=us-asciiDownload+87-39
On Mon, Feb 24, 2025 at 01:08:03PM +0000, Bertrand Drouvot wrote:
I agree that we've to put everything in the picture (system with or without
cheap timing functions, lock contention and WAL flush disk time) and that we
can certainly find configuration/workload that would get benefits from a
dedicated track_wal_io_timing GUC.PFA a patch to re-introduce the track_wal_io_timing GUC and to ensure that the
WAL write and fsync timing activities are tracked in pg_stat_io (and
pg_stat_get_backend_io()) only if both track_io_timing and track_wal_io_timing
are enabled.Note that to change the a051e71e28a behavior, the attached patch adds an extra
"track_io_guc" parameter to pgstat_count_io_op_time() (like pgstat_prepare_io_time
already had in a051e71e28a).
+ bool track_timing = track_io_timing && track_wal_io_timing;
- start = pgstat_prepare_io_time();
+ start = pgstat_prepare_io_time(track_timing);
This does not represent exactly what I am understanding from the
comment of upthread, because WAL IO timings would require both
track_wal_io_timing and track_io_timing to be enabled with this patch.
It seems to me that what we should do is to decouple completely the
timings for WAL and non-WAL IO across the two GUCs track_wal_io_timing
and track_io_timing, without any dependency between one and the other.
This way, it is possible to only enable one of them without affecting
the paths of the other, or both if you are ready to pay the price.
If you take that into account, XLogWrite(), XLogFileInitInternal(),
issue_xlog_fsync() should trigger clock calculations only for
track_wal_io_timing. The write and fsyncs are the hot ones in
standalone servers.
Two new things tracked in pg_stat_io are WALRead() and XLogPageRead(),
which are used at recovery, and for consistency with the rest there is
a good argument for controlling these as well with
track_wal_io_timing, I guess.
void
pgstat_count_io_op_time(IOObject io_object, IOContext io_context, IOOp io_op,
- instr_time start_time, uint32 cnt, uint64 bytes)
+ instr_time start_time, uint32 cnt, uint64 bytes,
+ bool track_io_guc)
Not much a fan of the new argument to pass the GUC value, which could
be error prone. It would be simpler to check that start_time is 0
instead. There is no need to change the other callers of
pgstat_count_io_op_time() if we do that.
pgstat_count_backend_io_op_time() would have triggered an assertion
failure, it needs to be adjusted.
With more tweaks applied to the docs, the attached is my result.
--
Michael
Attachments:
v2-0001-re-introduce-track_wal_io_timing.patchtext/x-diff; charset=us-asciiDownload+88-29
On Tue, Feb 25, 2025 at 02:20:48PM +0900, Michael Paquier wrote:
With more tweaks applied to the docs, the attached is my result.
Resending once as there was a blip with the prefixes used with
format-patch, preventing git am to work by default.
--
Michael
Attachments:
v2-0001-re-introduce-track_wal_io_timing.patchtext/x-diff; charset=us-asciiDownload+88-29
Hi,
On Tue, Feb 25, 2025 at 02:20:48PM +0900, Michael Paquier wrote:
On Mon, Feb 24, 2025 at 01:08:03PM +0000, Bertrand Drouvot wrote:
I agree that we've to put everything in the picture (system with or without
cheap timing functions, lock contention and WAL flush disk time) and that we
can certainly find configuration/workload that would get benefits from a
dedicated track_wal_io_timing GUC.PFA a patch to re-introduce the track_wal_io_timing GUC and to ensure that the
WAL write and fsync timing activities are tracked in pg_stat_io (and
pg_stat_get_backend_io()) only if both track_io_timing and track_wal_io_timing
are enabled.Note that to change the a051e71e28a behavior, the attached patch adds an extra
"track_io_guc" parameter to pgstat_count_io_op_time() (like pgstat_prepare_io_time
already had in a051e71e28a).+ bool track_timing = track_io_timing && track_wal_io_timing; - start = pgstat_prepare_io_time(); + start = pgstat_prepare_io_time(track_timing);This does not represent exactly what I am understanding from the
comment of upthread, because WAL IO timings would require both
track_wal_io_timing and track_io_timing to be enabled with this patch.
It seems to me that what we should do is to decouple completely the
timings for WAL and non-WAL IO across the two GUCs track_wal_io_timing
and track_io_timing, without any dependency between one and the other.
This way, it is possible to only enable one of them without affecting
the paths of the other, or both if you are ready to pay the price.
The idea was to not let track_io_timing alone enable the timing in the WAL
code path. My reasoning was: if you want to see timing in pg_stat_io then you
need to enable track_io_timing. But that's not enough if you also want to see
WAL timing, then you also need to set track_wal_io_timing. Your proposal also
ensures that "track_io_timing alone can not enable the timing in the WAL code path",
with a clear separation of duties, it's probably better so I'm fine with it.
Two new things tracked in pg_stat_io are WALRead() and XLogPageRead(),
which are used at recovery, and for consistency with the rest there is
a good argument for controlling these as well with
track_wal_io_timing, I guess.
Yeah, I though about it too but decided to not change the READ part in v1 (because
I think they are less of a concern). OTOH, if you want to see the READ timing then
you need to set track_wal_io_timing but then once the recovery is over then you'll
need to disable track_wal_io_timing (if you don't want to pay the price for
write/fsync activities).
OTOH pre a051e71e28a track_wal_io_timing was impacting only the write/fsyncs,
in that regard v1 was close to that.
I think both idea (v1 / v2) have pros and cons and I don't have a strong opinion
on it (though I do prefer v1 a bit for the reasons mentioned above).
void pgstat_count_io_op_time(IOObject io_object, IOContext io_context, IOOp io_op, - instr_time start_time, uint32 cnt, uint64 bytes) + instr_time start_time, uint32 cnt, uint64 bytes, + bool track_io_guc)Not much a fan of the new argument to pass the GUC value, which could
be error prone. It would be simpler to check that start_time is 0
instead. There is no need to change the other callers of
pgstat_count_io_op_time() if we do that.
Yeah I thought about it too
- if (track_io_guc)
+ if (!INSTR_TIME_IS_ZERO(start_time))
INSTR_TIME_IS_ZERO is cheap so it is fine by me.
pgstat_count_backend_io_op_time() would have triggered an assertion
failure, it needs to be adjusted.
With v2 in place, yeah.
With more tweaks applied to the docs, the attached is my result.
In the track_io_timing GUC description Shouldn't we also mention the
wal object restriction, something like?
"
in pg_stat_database, pg_stat_io (if object is not wal), in the output of the
pg_stat_get_backend_io() function (if object is not wal)
"
Also in the track_wal_io_timing GUC description add the wal object restriction
for the function:
"
and in the output of the pg_stat_get_backend_io() function for the object wal
"
The proposed doc changes are in the .txt attached (that applies on top of v2).
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Attachments:
doc_update.txttext/plain; charset=us-asciiDownload+6-3
On Tue, Feb 25, 2025 at 08:12:53AM +0000, Bertrand Drouvot wrote:
The idea was to not let track_io_timing alone enable the timing in the WAL
code path. My reasoning was: if you want to see timing in pg_stat_io then you
need to enable track_io_timing. But that's not enough if you also want to see
WAL timing, then you also need to set track_wal_io_timing. Your proposal also
ensures that "track_io_timing alone can not enable the timing in the WAL code path",
with a clear separation of duties, it's probably better so I'm fine with it.
One problem with this approach is that you would need to always pay
the price of the timings under track_io_timing if the WAL part is
enabled, so you'd lose flexibility compared to the past configuration.
Based on the complaint of upthread, a split makes things more
flexible, as it is possible to monitor one or the other or both. I'm
OK to tweak things more as required, we still have time for that.
Yeah, I though about it too but decided to not change the READ part in v1 (because
I think they are less of a concern). OTOH, if you want to see the READ timing then
you need to set track_wal_io_timing but then once the recovery is over then you'll
need to disable track_wal_io_timing (if you don't want to pay the price for
write/fsync activities).OTOH pre a051e71e28a track_wal_io_timing was impacting only the write/fsyncs,
in that regard v1 was close to that.
Still less consistent.. We've never tracked WAL reads in the stats up
to now, so we need to put it in one of these buckets.
In the track_io_timing GUC description Shouldn't we also mention the
wal object restriction, something like?"
in pg_stat_database, pg_stat_io (if object is not wal), in the output of the
pg_stat_get_backend_io() function (if object is not wal)
"The proposed doc changes are in the .txt attached (that applies on top of v2).
Sounds like a good set of additions as we have the two GUCs.
Bonus idea: We could also have more GUCs to control all that, or just
put eveything in one single GUC that takes a list of values made of
pairs of (object,op), then set the timings only for the combinations
listed in the GUC. That feels a bit over-engineered, but you could
deeply control the areas where the data is aggregated. I'm not really
convinced it is strictly necessary, but, well, that would not be the
first time people tell me I'm wrong.. All things said, v2 sounds like
it has the right balance for now based on what I am reading, providing
the same control as pg_stat_wal previously, so I've used it for now.
--
Michael