bgwrite process is too lazy
Hi hackers
Whenever I check the checkpoint information in a log, most dirty pages
are written by the checkpoint process,I tried to read the source code to
find out why, I think reducing this value will make bgwrite process write
more dirty pages (min_scan_buffers = (int) (NBuffers /
(scan_whole_pool_milliseconds / BgWriterDelay));) ,Here's how I tested
it, Originally, I wanted to write a patch, and I listened to the expert
community leaders first
src/backend/postmaster/bgwriter.c
[image: image.png]
I see a line of comments in (src/backend/storage/buffer/bufmgr.c )
[image: image.png]
This is a simple test I did myself
```
test methods
pgbench -i -s 10000 dbname -U postgres
pgbench -U postgres dbname -c 96 -j 96 -P 1 -T 900
test env
Intel(R) Xeon(R) Platinum 8336C CPU @ 2.30GHz 128 threads 64 cores
512G
ssd:1.8 T
postgresql version 17 GA
postgresql parameters
cat postgresql.auto.conf
listen_addresses = '*'
port = '5430'
max_connections = '1000'
unix_socket_directories = '/tmp,.'
unix_socket_permissions = '0700'
password_encryption = 'md5'
shared_buffers = '32GB'
maintenance_work_mem = '2GB'
autovacuum_work_mem = '2GB'
vacuum_buffer_usage_limit = '16GB'
max_files_per_process = '60000'
vacuum_cost_limit = '10000'
bgwriter_delay = '10ms'
bgwriter_lru_maxpages = '8192'
bgwriter_lru_multiplier = '10.0'
max_worker_processes = '32'
max_parallel_workers_per_gather = '8'
max_parallel_maintenance_workers = '8'
max_parallel_workers = '32'
wal_buffers = '64MB'
checkpoint_completion_target = '0.999'
max_wal_size = '32GB'
min_wal_size = '16GB'
archive_mode = 'on'
archive_command = '/bin/date'
wal_keep_size = '32GB'
effective_cache_size = '512GB'
log_destination = 'csvlog'
logging_collector = 'on'
log_filename = 'postgresql-%a_%H.log'
log_truncate_on_rotation = 'on'
log_lock_waits = 'on'
autovacuum_max_workers = '10'
autovacuum_naptime = '1min'
autovacuum_vacuum_scale_factor = '0.02'
autovacuum_vacuum_insert_scale_factor = '0.02'
autovacuum_analyze_scale_factor = '0.01'
checkpoint_timeout = '600'
test report
original value
scan_whole_pool_milliseconds = 60000.0
scan_whole_pool_milliseconds = 30000.0
scan_whole_pool_milliseconds = 10000.0
buffers_clean
2731311
7247872
13271584
17406338
num_requested
35
27
28
24
write_time
1110084
855959
1176642
870992
sync_time
1285
1110
746
419
buffers_written
49208974
42617458
35153293
31586820
tps
33153.438460
41242.271814
34969.554408
37177.879583
```
Thanks
On Wed, 2024-10-02 at 16:48 +0800, wenhui qiu wrote:
Whenever I check the checkpoint information in a log, most dirty pages are written by the checkpoint process
That's exactly how it should be!
Yours,
Laurenz Albe
On Wed, Oct 2, 2024 at 8:14 PM Laurenz Albe <laurenz.albe@cybertec.at>
wrote:
Show quoted text
On Wed, 2024-10-02 at 16:48 +0800, wenhui qiu wrote:
Whenever I check the checkpoint information in a log, most dirty pages
are written by the checkpoint process
That's exactly how it should be!
is it because if bgwriter frequently flushes, the disk io will be more?🤔
On 10/2/24 17:02, Tony Wayne wrote:
On Wed, Oct 2, 2024 at 8:14 PM Laurenz Albe <laurenz.albe@cybertec.at
<mailto:laurenz.albe@cybertec.at>> wrote:On Wed, 2024-10-02 at 16:48 +0800, wenhui qiu wrote:
Whenever I check the checkpoint information in a log, most dirty
pages are written by the checkpoint process
That's exactly how it should be!
is it because if bgwriter frequently flushes, the disk io will be more?🤔
Yes, pretty much. But it's also about where the writes happen.
Checkpoint flushes dirty buffers only once per checkpoint interval,
which is the lowest amount of write I/O that needs to happen.
Every other way of flushing buffers is less efficient, and is mostly a
sign of memory pressure (shared buffers not large enough for active part
of the data).
But it's also happens about where the writes happen. Checkpoint does
that in the background, not as part of regular query execution. What we
don't want is for the user backends to flush buffers, because it's
expensive and can cause result in much higher latency.
The bgwriter is somewhere in between - it's happens in the background,
but may not be as efficient as doing it in the checkpointer. Still much
better than having to do this in regular backends.
regards
--
Tomas Vondra
Hi Tomas
Thank you for explaining,If do not change this static parameter,Can
refer to the parameter autovacuum_vacuum_cost_delay and lower the minimum
value of the bgwrite_delay parameter to 2ms? Let bgwrite write more dirty
pages to reduce the impact on performance when checkpoint occurs?After all,
extending the checkpoint time and crash recovery time needs to find a
balance, and cannot be increased indefinitely ( checkpoint_timeout
,max_wal_size)
Thanks
Tomas Vondra <tomas@vondra.me> 于2024年10月3日周四 00:36写道:
Show quoted text
On 10/2/24 17:02, Tony Wayne wrote:
On Wed, Oct 2, 2024 at 8:14 PM Laurenz Albe <laurenz.albe@cybertec.at
<mailto:laurenz.albe@cybertec.at>> wrote:On Wed, 2024-10-02 at 16:48 +0800, wenhui qiu wrote:
Whenever I check the checkpoint information in a log, most dirty
pages are written by the checkpoint process
That's exactly how it should be!
is it because if bgwriter frequently flushes, the disk io will be
more?🤔
Yes, pretty much. But it's also about where the writes happen.
Checkpoint flushes dirty buffers only once per checkpoint interval,
which is the lowest amount of write I/O that needs to happen.Every other way of flushing buffers is less efficient, and is mostly a
sign of memory pressure (shared buffers not large enough for active part
of the data).But it's also happens about where the writes happen. Checkpoint does
that in the background, not as part of regular query execution. What we
don't want is for the user backends to flush buffers, because it's
expensive and can cause result in much higher latency.The bgwriter is somewhere in between - it's happens in the background,
but may not be as efficient as doing it in the checkpointer. Still much
better than having to do this in regular backends.regards
--
Tomas Vondra
Hi,
On 2024-10-02 18:36:44 +0200, Tomas Vondra wrote:
On 10/2/24 17:02, Tony Wayne wrote:
On Wed, Oct 2, 2024 at 8:14 PM Laurenz Albe <laurenz.albe@cybertec.at
<mailto:laurenz.albe@cybertec.at>> wrote:On Wed, 2024-10-02 at 16:48 +0800, wenhui qiu wrote:
Whenever I check the checkpoint information in a log, most dirty
pages are written by the checkpoint process
That's exactly how it should be!
is it because if bgwriter frequently flushes, the disk io will be more?🤔
Yes, pretty much. But it's also about where the writes happen.
Checkpoint flushes dirty buffers only once per checkpoint interval,
which is the lowest amount of write I/O that needs to happen.Every other way of flushing buffers is less efficient, and is mostly a
sign of memory pressure (shared buffers not large enough for active part
of the data).
It's implied, but to make it more explicit: One big efficiency advantage of
writes by checkpointer is that they are sorted and can often be combined into
larger writes. That's often a lot more efficient: For network attached storage
it saves you iops, for local SSDs it's much friendlier to wear leveling.
But it's also happens about where the writes happen. Checkpoint does
that in the background, not as part of regular query execution. What we
don't want is for the user backends to flush buffers, because it's
expensive and can cause result in much higher latency.The bgwriter is somewhere in between - it's happens in the background,
but may not be as efficient as doing it in the checkpointer. Still much
better than having to do this in regular backends.
Another aspect is that checkpointer's writes are much easier to pace over time
than e.g. bgwriters, because bgwriter is triggered by a fairly short term
signal. Eventually we'll want to combine writes by bgwriter too, but that's
always going to be more expensive than doing it in a large batched fashion
like checkpointer does.
I think we could improve checkpointer's pacing further, fwiw, by taking into
account that the WAL volume at the start of a spread-out checkpoint typically
is bigger than at the end.
Greetings,
Andres Freund
Hi Andres
It's implied, but to make it more explicit: One big efficiency advantage
of
writes by checkpointer is that they are sorted and can often be combined
into
larger writes. That's often a lot more efficient: For network attached
storage
it saves you iops, for local SSDs it's much friendlier to wear leveling.
thank you for explanation, I think bgwrite also can merge io ,It writes
asynchronously to the file system cache, scheduling by os, .
Another aspect is that checkpointer's writes are much easier to pace over
time
than e.g. bgwriters, because bgwriter is triggered by a fairly short term
signal. Eventually we'll want to combine writes by bgwriter too, but
that's
always going to be more expensive than doing it in a large batched fashion
like checkpointer does.
I think we could improve checkpointer's pacing further, fwiw, by taking
into
account that the WAL volume at the start of a spread-out checkpoint
typically
is bigger than at the end.
I'm also very keen to improve checkpoints , Whenever I do stress test,
bgwrite does not write dirty pages when the data set is smaller than
shard_buffer size,Before the checkpoint, the pressure measurement tps was
stable and the highest during the entire pressure measurement phase,Other
databases refresh dirty pages at a certain frequency, at intervals, and at
dirty page water levels,They have a much smaller impact on performance when
checkpoints occur
Thanks
Andres Freund <andres@anarazel.de> 于2024年10月4日周五 03:40写道:
Show quoted text
Hi,
On 2024-10-02 18:36:44 +0200, Tomas Vondra wrote:
On 10/2/24 17:02, Tony Wayne wrote:
On Wed, Oct 2, 2024 at 8:14 PM Laurenz Albe <laurenz.albe@cybertec.at
<mailto:laurenz.albe@cybertec.at>> wrote:On Wed, 2024-10-02 at 16:48 +0800, wenhui qiu wrote:
Whenever I check the checkpoint information in a log, most dirty
pages are written by the checkpoint process
That's exactly how it should be!
is it because if bgwriter frequently flushes, the disk io will be
more?🤔
Yes, pretty much. But it's also about where the writes happen.
Checkpoint flushes dirty buffers only once per checkpoint interval,
which is the lowest amount of write I/O that needs to happen.Every other way of flushing buffers is less efficient, and is mostly a
sign of memory pressure (shared buffers not large enough for active part
of the data).It's implied, but to make it more explicit: One big efficiency advantage of
writes by checkpointer is that they are sorted and can often be combined
into
larger writes. That's often a lot more efficient: For network attached
storage
it saves you iops, for local SSDs it's much friendlier to wear leveling.But it's also happens about where the writes happen. Checkpoint does
that in the background, not as part of regular query execution. What we
don't want is for the user backends to flush buffers, because it's
expensive and can cause result in much higher latency.The bgwriter is somewhere in between - it's happens in the background,
but may not be as efficient as doing it in the checkpointer. Still much
better than having to do this in regular backends.Another aspect is that checkpointer's writes are much easier to pace over
time
than e.g. bgwriters, because bgwriter is triggered by a fairly short term
signal. Eventually we'll want to combine writes by bgwriter too, but
that's
always going to be more expensive than doing it in a large batched fashion
like checkpointer does.I think we could improve checkpointer's pacing further, fwiw, by taking
into
account that the WAL volume at the start of a spread-out checkpoint
typically
is bigger than at the end.Greetings,
Andres Freund
Hi,
On 2024-10-04 09:31:45 +0800, wenhui qiu wrote:
It's implied, but to make it more explicit: One big efficiency advantage
of
writes by checkpointer is that they are sorted and can often be combined
into
larger writes. That's often a lot more efficient: For network attached
storage
it saves you iops, for local SSDs it's much friendlier to wear leveling.
thank you for explanation, I think bgwrite also can merge io ,It writes
asynchronously to the file system cache, scheduling by os, .
Because bgwriter writes are just ordered by their buffer id (further made less
sequential due to only writing out not-recently-used buffers), they are often
effectively random. The OS can't do much about that.
Another aspect is that checkpointer's writes are much easier to pace over
time
than e.g. bgwriters, because bgwriter is triggered by a fairly short term
signal. Eventually we'll want to combine writes by bgwriter too, butthat's
always going to be more expensive than doing it in a large batched fashion
like checkpointer does.I think we could improve checkpointer's pacing further, fwiw, by taking
into
account that the WAL volume at the start of a spread-out checkpoint
typically
is bigger than at the end.
I'm also very keen to improve checkpoints , Whenever I do stress test,
bgwrite does not write dirty pages when the data set is smaller than
shard_buffer size,
It *SHOULD NOT* do anything in that situation. There's absolutely nothing to
be gained by bgwriter writing in that case.
Before the checkpoint, the pressure measurement tps was stable and the
highest during the entire pressure measurement phase,Other databases
refresh dirty pages at a certain frequency, at intervals, and at dirty page
water levels,They have a much smaller impact on performance when
checkpoints occur
I doubt that slowdown is caused by bgwriter not being active enough. I suspect
what you're seeing is one or more of:
a) The overhead of doing full page writes (due to increasing the WAL
volume). You could verify whether that's the case by turning
full_page_writes off (but note that that's not generally safe!) or see if
the overhead shrinks if you set wal_compression=zstd or wal_compression=lz4
(don't use pglz, it's too slow).
b) The overhead of renaming WAL segments during recycling. You could see if
this is related by specifying --wal-segsize 512 or such during initdb.
Greetings,
Andres
Hi Andres Freund
Thank you for explanation
I doubt that slowdown is caused by bgwriter not being active enough. I
suspect
what you're seeing is one or more of:
a) The overhead of doing full page writes (due to increasing the WAL
volume). You could verify whether that's the case by turning
full_page_writes off (but note that that's not generally safe!) or see if
the overhead shrinks if you set wal_compression=zstd or
wal_compression=lz4
(don't use pglz, it's too slow).
b) The overhead of renaming WAL segments during recycling. You could see
if
this is related by specifying --wal-segsize 512 or such during initdb.
I am aware of these optimizations, and these optimizations only mitigate
the impact, I didn't turn on wal log compression on purpose during stress
test ,
shared_buffers = '32GB'
bgwriter_delay = '10ms'
bgwriter_lru_maxpages = '8192'
bgwriter_lru_multiplier = '10.0'
wal_buffers = '64MB'
checkpoint_completion_target = '0.999'
checkpoint_timeout = '600'
max_wal_size = '32GB'
min_wal_size = '16GB'
I think in business scenarios where there are many reads and few writes, it
is indeed desirable to keep as many dirty pages in memory as
possible,However, in scenarios such as push systems and task scheduling
systems, which also have a lot of reads and writes, the impact of
checkpoints will be more obvious,Adaptive bgwrite or bgwrite triggered when
a dirty page reaches a certain watermark eliminates the impact of
checkpoints on performance jitter.From what I understand, quite a few
commercial databases residing in postgresql have added the adaptive refresh
dirty page feature, and from their internal reports, the whole stress
testing process was very smooth! Since it's a trade secret, I don't know
how they implemented this feature.