log_checkpoint's "WAL file(s) added" is misleading to the point of uselessness
Hi,
I've been repeatedly confused by the the number of WAL files supposedly
added. Even when 100s of new WAL files are created the relevant portion
of log_checkpoints will only ever list zero or one added WAL file.
The reason for that is that CheckpointStats.ckpt_segs_added is only
incremented in PreallocXlogFiles(). Which has the following comment:
* XXX this is currently extremely conservative, since it forces only one
* future log segment to exist, and even that only if we are 75% done with
* the current one. This is only appropriate for very low-WAL-volume systems.
Whereas in real workloads WAL files are almost exclusively created via
XLogWrite()->XLogFileInit().
I think we should consider just removing that field. Or, even better, show
something accurate instead.
As an example, here's the log output of a workload that has a replication slot
preventing WAL files from being recycled (and too small max_wal_size):
2021-07-24 15:47:42.524 PDT [2251649][checkpointer][:0] LOG: checkpoint complete: wrote 55767 buffers (42.5%); 0 WAL file(s) added, 0 removed, 0 recycled; write=3.914 s, sync=0.041 s, total=3.972 s; sync files=10, longest=0.010 s, average=0.005 s; distance=540578 kB, estimate=540905 kB
2021-07-24 15:47:46.721 PDT [2251649][checkpointer][:0] LOG: checkpoint complete: wrote 55806 buffers (42.6%); 1 WAL file(s) added, 0 removed, 0 recycled; write=3.855 s, sync=0.028 s, total=3.928 s; sync files=8, longest=0.008 s, average=0.004 s; distance=540708 kB, estimate=540886 kB
2021-07-24 15:47:51.004 PDT [2251649][checkpointer][:0] LOG: checkpoint complete: wrote 55850 buffers (42.6%); 1 WAL file(s) added, 0 removed, 0 recycled; write=3.895 s, sync=0.034 s, total=3.974 s; sync files=9, longest=0.009 s, average=0.004 s; distance=540657 kB, estimate=540863 kB
2021-07-24 15:47:55.231 PDT [2251649][checkpointer][:0] LOG: checkpoint complete: wrote 55879 buffers (42.6%); 0 WAL file(s) added, 0 removed, 0 recycled; write=3.878 s, sync=0.026 s, total=3.944 s; sync files=9, longest=0.007 s, average=0.003 s; distance=540733 kB, estimate=540850 kB
2021-07-24 15:47:59.462 PDT [2251649][checkpointer][:0] LOG: checkpoint complete: wrote 55847 buffers (42.6%); 1 WAL file(s) added, 0 removed, 0 recycled; write=3.882 s, sync=0.027 s, total=3.952 s; sync files=9, longest=0.008 s, average=0.003 s; distance=540640 kB, estimate=540829 kB
So it's 3 new WAL files in that timeframe, one might think?
A probe instrumenting xlog file creation shows something very different:
perf probe -x /home/andres/build/postgres/dev-assert/vpath/src/backend/postgres -a XLogFileInitInternal:39
(39 is the O_CREAT BasicOpenFile(), not the recycle path).
perf stat -a -e probe_postgres:XLogFileInitInternal_L39 -I 1000
1.001030943 9 probe_postgres:XLogFileInitInternal_L39
2.002998896 8 probe_postgres:XLogFileInitInternal_L39
3.005098857 8 probe_postgres:XLogFileInitInternal_L39
4.007000426 6 probe_postgres:XLogFileInitInternal_L39
5.008423119 9 probe_postgres:XLogFileInitInternal_L39
6.013427568 8 probe_postgres:XLogFileInitInternal_L39
7.015087613 8 probe_postgres:XLogFileInitInternal_L39
8.017393339 8 probe_postgres:XLogFileInitInternal_L39
9.018425526 7 probe_postgres:XLogFileInitInternal_L39
10.020398520 5 probe_postgres:XLogFileInitInternal_L39
Greetings,
Andres Freund
Hi,
On 2021-07-24 15:50:36 -0700, Andres Freund wrote:
As an example, here's the log output of a workload that has a replication slot
preventing WAL files from being recycled (and too small max_wal_size):2021-07-24 15:47:42.524 PDT [2251649][checkpointer][:0] LOG: checkpoint complete: wrote 55767 buffers (42.5%); 0 WAL file(s) added, 0 removed, 0 recycled; write=3.914 s, sync=0.041 s, total=3.972 s; sync files=10, longest=0.010 s, average=0.005 s; distance=540578 kB, estimate=540905 kB
2021-07-24 15:47:46.721 PDT [2251649][checkpointer][:0] LOG: checkpoint complete: wrote 55806 buffers (42.6%); 1 WAL file(s) added, 0 removed, 0 recycled; write=3.855 s, sync=0.028 s, total=3.928 s; sync files=8, longest=0.008 s, average=0.004 s; distance=540708 kB, estimate=540886 kB
2021-07-24 15:47:51.004 PDT [2251649][checkpointer][:0] LOG: checkpoint complete: wrote 55850 buffers (42.6%); 1 WAL file(s) added, 0 removed, 0 recycled; write=3.895 s, sync=0.034 s, total=3.974 s; sync files=9, longest=0.009 s, average=0.004 s; distance=540657 kB, estimate=540863 kB
2021-07-24 15:47:55.231 PDT [2251649][checkpointer][:0] LOG: checkpoint complete: wrote 55879 buffers (42.6%); 0 WAL file(s) added, 0 removed, 0 recycled; write=3.878 s, sync=0.026 s, total=3.944 s; sync files=9, longest=0.007 s, average=0.003 s; distance=540733 kB, estimate=540850 kB
2021-07-24 15:47:59.462 PDT [2251649][checkpointer][:0] LOG: checkpoint complete: wrote 55847 buffers (42.6%); 1 WAL file(s) added, 0 removed, 0 recycled; write=3.882 s, sync=0.027 s, total=3.952 s; sync files=9, longest=0.008 s, average=0.003 s; distance=540640 kB, estimate=540829 kBSo it's 3 new WAL files in that timeframe, one might think?
A probe instrumenting xlog file creation shows something very different:
perf probe -x /home/andres/build/postgres/dev-assert/vpath/src/backend/postgres -a XLogFileInitInternal:39
(39 is the O_CREAT BasicOpenFile(), not the recycle path).perf stat -a -e probe_postgres:XLogFileInitInternal_L39 -I 1000
1.001030943 9 probe_postgres:XLogFileInitInternal_L39
2.002998896 8 probe_postgres:XLogFileInitInternal_L39
3.005098857 8 probe_postgres:XLogFileInitInternal_L39
4.007000426 6 probe_postgres:XLogFileInitInternal_L39
5.008423119 9 probe_postgres:XLogFileInitInternal_L39
6.013427568 8 probe_postgres:XLogFileInitInternal_L39
7.015087613 8 probe_postgres:XLogFileInitInternal_L39
8.017393339 8 probe_postgres:XLogFileInitInternal_L39
9.018425526 7 probe_postgres:XLogFileInitInternal_L39
10.020398520 5 probe_postgres:XLogFileInitInternal_L39
And even more extreme, the logs can end up suggestiong pg_wal is shrinking,
when the opposite is the case. Here's the checkpoint output from a parallel
copy data load (without a replication slot holding things back):
2021-07-24 15:59:03.215 PDT [2253324][checkpointer][:0] LOG: checkpoint complete: wrote 22291 buffers (17.0%); 0 WAL file(s) added, 27 removed, 141 recycled; write=9.737 s, sync=4.153 s, total=14.884 s; sync files=108, longest=0.116 s, average=0.039 s; distance=2756904 kB, estimate=2756904 kB
2021-07-24 15:59:12.978 PDT [2253324][checkpointer][:0] LOG: checkpoint complete: wrote 21840 buffers (16.7%); 0 WAL file(s) added, 53 removed, 149 recycled; write=5.531 s, sync=3.008 s, total=9.763 s; sync files=81, longest=0.201 s, average=0.037 s; distance=3313885 kB, estimate=3313885 kB
2021-07-24 15:59:23.421 PDT [2253324][checkpointer][:0] LOG: checkpoint complete: wrote 22326 buffers (17.0%); 0 WAL file(s) added, 56 removed, 149 recycled; write=5.787 s, sync=3.230 s, total=10.436 s; sync files=81, longest=0.099 s, average=0.040 s; distance=3346125 kB, estimate=3346125 kB
2021-07-24 15:59:34.424 PDT [2253324][checkpointer][:0] LOG: checkpoint complete: wrote 22155 buffers (16.9%); 0 WAL file(s) added, 60 removed, 148 recycled; write=6.096 s, sync=3.432 s, total=10.995 s; sync files=81, longest=0.101 s, average=0.043 s; distance=3409281 kB, estimate=3409281 kB
It does look like WAL space usage is shrinking, but the opposite is true -
we're creating so much WAL that the checkpointer can't checkpoint fast enough
to keep WAL usage below max_wal_size. So WAL files are constantly created that
then need to be removed (hence the non-zero removed counts).
# time counts unit events
277.087990275 15 probe_postgres:XLogFileInitInternal_L39
278.098549960 15 probe_postgres:XLogFileInitInternal_L39
279.104787575 6 probe_postgres:XLogFileInitInternal_L39
280.108980690 5 probe_postgres:XLogFileInitInternal_L39
281.111781617 6 probe_postgres:XLogFileInitInternal_L39
282.113601958 2 probe_postgres:XLogFileInitInternal_L39
283.115711683 0 probe_postgres:XLogFileInitInternal_L39
284.121508636 0 probe_postgres:XLogFileInitInternal_L39
285.124865325 0 probe_postgres:XLogFileInitInternal_L39
286.126932016 0 probe_postgres:XLogFileInitInternal_L39
287.129874993 11 probe_postgres:XLogFileInitInternal_L39
288.131838429 15 probe_postgres:XLogFileInitInternal_L39
289.133609021 13 probe_postgres:XLogFileInitInternal_L39
290.136254341 6 probe_postgres:XLogFileInitInternal_L39
291.139368485 5 probe_postgres:XLogFileInitInternal_L39
292.142728293 6 probe_postgres:XLogFileInitInternal_L39
293.148078766 2 probe_postgres:XLogFileInitInternal_L39
294.150258476 0 probe_postgres:XLogFileInitInternal_L39
295.172398897 0 probe_postgres:XLogFileInitInternal_L39
296.174658196 0 probe_postgres:XLogFileInitInternal_L39
297.176818943 0 probe_postgres:XLogFileInitInternal_L39
298.179003473 14 probe_postgres:XLogFileInitInternal_L39
299.181597777 14 probe_postgres:XLogFileInitInternal_L39
300.184711566 14 probe_postgres:XLogFileInitInternal_L39
301.188919194 6 probe_postgres:XLogFileInitInternal_L39
Greetings,
Andres Freund
On 2021/07/25 7:50, Andres Freund wrote:
Hi,
I've been repeatedly confused by the the number of WAL files supposedly
added. Even when 100s of new WAL files are created the relevant portion
of log_checkpoints will only ever list zero or one added WAL file.The reason for that is that CheckpointStats.ckpt_segs_added is only
incremented in PreallocXlogFiles(). Which has the following comment:
* XXX this is currently extremely conservative, since it forces only one
* future log segment to exist, and even that only if we are 75% done with
* the current one. This is only appropriate for very low-WAL-volume systems.Whereas in real workloads WAL files are almost exclusively created via
XLogWrite()->XLogFileInit().I think we should consider just removing that field. Or, even better, show
something accurate instead.
+1 to show something accurate instead.
It's also worth showing them in monitoring stats view like pg_stat_wal?
Regards,
--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION
On 7/24/21, 8:10 PM, "Fujii Masao" <masao.fujii@oss.nttdata.com> wrote:
On 2021/07/25 7:50, Andres Freund wrote:
Hi,
I've been repeatedly confused by the the number of WAL files supposedly
added. Even when 100s of new WAL files are created the relevant portion
of log_checkpoints will only ever list zero or one added WAL file.The reason for that is that CheckpointStats.ckpt_segs_added is only
incremented in PreallocXlogFiles(). Which has the following comment:
* XXX this is currently extremely conservative, since it forces only one
* future log segment to exist, and even that only if we are 75% done with
* the current one. This is only appropriate for very low-WAL-volume systems.Whereas in real workloads WAL files are almost exclusively created via
XLogWrite()->XLogFileInit().I think we should consider just removing that field. Or, even better, show
something accurate instead.+1 to show something accurate instead.
It's also worth showing them in monitoring stats view like pg_stat_wal?
+1. I was confused by this when working on a WAL pre-allocation
patch [0]/messages/by-id/20201225200953.jjkrytlrzojbndh5@alap3.anarazel.de. Perhaps it could be replaced by a new parameter and a new
field in pg_stat_wal. How about something like log_wal_init_interval,
where the value is the minimum amount of time between reporting the
number of WAL segments created since the last report?
Nathan
[0]: /messages/by-id/20201225200953.jjkrytlrzojbndh5@alap3.anarazel.de
On 2021/07/27 5:27, Bossart, Nathan wrote:
+1. I was confused by this when working on a WAL pre-allocation
patch [0]. Perhaps it could be replaced by a new parameter and a new
field in pg_stat_wal. How about something like log_wal_init_interval,
where the value is the minimum amount of time between reporting the
number of WAL segments created since the last report?
You mean to introduce new GUC like log_wal_init_interval and that
the number of WAL files created since the last report will be logged
every that interval? But isn't it better and simpler to just expose
the accumulated number of WAL files created, in pg_stat_wal view
or elsewhere? If so, we can easily get to know the number of WAL files
created in every interval by checking the view and calculating the diff.
Regards,
--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION
On 7/26/21, 5:23 PM, "Fujii Masao" <masao.fujii@oss.nttdata.com> wrote:
On 2021/07/27 5:27, Bossart, Nathan wrote:
+1. I was confused by this when working on a WAL pre-allocation
patch [0]. Perhaps it could be replaced by a new parameter and a new
field in pg_stat_wal. How about something like log_wal_init_interval,
where the value is the minimum amount of time between reporting the
number of WAL segments created since the last report?You mean to introduce new GUC like log_wal_init_interval and that
the number of WAL files created since the last report will be logged
every that interval? But isn't it better and simpler to just expose
the accumulated number of WAL files created, in pg_stat_wal view
or elsewhere? If so, we can easily get to know the number of WAL files
created in every interval by checking the view and calculating the diff.
I agree with you about adding a new field to pg_stat_wal. The
parameter would just be a convenient way of logging this information
for future reference. I don't feel strongly about the parameter if
you think the pg_stat_wal addition is enough.
Nathan
Hi,
On 2021-07-26 20:27:21 +0000, Bossart, Nathan wrote:
+1. I was confused by this when working on a WAL pre-allocation
patch [0]. Perhaps it could be replaced by a new parameter and a new
field in pg_stat_wal. How about something like log_wal_init_interval,
where the value is the minimum amount of time between reporting the
number of WAL segments created since the last report?
Why not just make the number in log_checkpoints accurate? There's no
point in the current number, so we don't need to preserve it...
Greetings,
Andres Freund
Hi,
On 2021-07-25 12:10:07 +0900, Fujii Masao wrote:
It's also worth showing them in monitoring stats view like pg_stat_wal?
I'm not convinced that's all that meaningful. It makes sense to include
it as part of the checkpoint output, because checkpoints determine when
WAL can be recycled etc. It's not that clear to me how to represent that
as part of pg_stat_wal?
I guess we could add columns for the amount of WAL has been a) newly
created b) recycled c) removed. In combination that *does* seem
useful. But also a mostly independent change...
Greetings,
Andres Freund
On 7/26/21, 5:48 PM, "Andres Freund" <andres@anarazel.de> wrote:
On 2021-07-26 20:27:21 +0000, Bossart, Nathan wrote:
+1. I was confused by this when working on a WAL pre-allocation
patch [0]. Perhaps it could be replaced by a new parameter and a new
field in pg_stat_wal. How about something like log_wal_init_interval,
where the value is the minimum amount of time between reporting the
number of WAL segments created since the last report?Why not just make the number in log_checkpoints accurate? There's no
point in the current number, so we don't need to preserve it...
My understanding is that the statistics logged for log_checkpoints are
just for that specific checkpoint. From that angle, the value for the
number of WAL files added is technically correct. Checkpoints will
only ever create zero or one new files via PreallocXlogFiles(). If we
also added all the segments created outside of the checkpoint, the
value for "added" would go from meaning "WAL files created by this
checkpoint" to "WAL files creates since the last checkpoint." That's
probably less confusing than what's there today, but it's still
slightly different than all the other log_checkpoints stats.
Nathan