bgwrite process is too lazy

Started by wenhui qiuover 1 year ago9 messageshackers

qiuwenhuifx@gmail.com

over 1 year ago

Hi hackers
Whenever I check the checkpoint information in a log, most dirty pages
are written by the checkpoint process,I tried to read the source code to
find out why, I think reducing this value will make bgwrite process write
more dirty pages (min_scan_buffers = (int) (NBuffers /
(scan_whole_pool_milliseconds / BgWriterDelay));) ,Here's how I tested
it, Originally, I wanted to write a patch, and I listened to the expert
community leaders first

src/backend/postmaster/bgwriter.c
[image: image.png]

I see a line of comments in (src/backend/storage/buffer/bufmgr.c )

[image: image.png]

This is a simple test I did myself
```
test methods
pgbench -i -s 10000 dbname -U postgres
pgbench -U postgres dbname -c 96 -j 96 -P 1 -T 900
test env
Intel(R) Xeon(R) Platinum 8336C CPU @ 2.30GHz 128 threads 64 cores
512G
ssd:1.8 T
postgresql version 17 GA

postgresql parameters
cat postgresql.auto.conf
listen_addresses = '*'
port = '5430'
max_connections = '1000'
unix_socket_directories = '/tmp,.'
unix_socket_permissions = '0700'
password_encryption = 'md5'
shared_buffers = '32GB'
maintenance_work_mem = '2GB'
autovacuum_work_mem = '2GB'
vacuum_buffer_usage_limit = '16GB'
max_files_per_process = '60000'
vacuum_cost_limit = '10000'
bgwriter_delay = '10ms'
bgwriter_lru_maxpages = '8192'
bgwriter_lru_multiplier = '10.0'
max_worker_processes = '32'
max_parallel_workers_per_gather = '8'
max_parallel_maintenance_workers = '8'
max_parallel_workers = '32'
wal_buffers = '64MB'
checkpoint_completion_target = '0.999'
max_wal_size = '32GB'
min_wal_size = '16GB'
archive_mode = 'on'
archive_command = '/bin/date'
wal_keep_size = '32GB'
effective_cache_size = '512GB'
log_destination = 'csvlog'
logging_collector = 'on'
log_filename = 'postgresql-%a_%H.log'
log_truncate_on_rotation = 'on'
log_lock_waits = 'on'
autovacuum_max_workers = '10'
autovacuum_naptime = '1min'
autovacuum_vacuum_scale_factor = '0.02'
autovacuum_vacuum_insert_scale_factor = '0.02'
autovacuum_analyze_scale_factor = '0.01'
checkpoint_timeout = '600'

test report

original value
scan_whole_pool_milliseconds = 60000.0
scan_whole_pool_milliseconds = 30000.0
scan_whole_pool_milliseconds = 10000.0
buffers_clean
2731311
7247872
13271584
17406338
num_requested
35
27
28
24
write_time
1110084
855959
1176642
870992
sync_time
1285
1110
746
419
buffers_written
49208974
42617458
35153293
31586820
tps
33153.438460
41242.271814
34969.554408
37177.879583

```

Thanks

Laurenz Albe

laurenz.albe@cybertec.at

over 1 year ago

In reply to: wenhui qiu (#1)

Re: bgwrite process is too lazy

On Wed, 2024-10-02 at 16:48 +0800, wenhui qiu wrote:

Whenever I check the checkpoint information in a log, most dirty pages are written by the checkpoint process

That's exactly how it should be!

Yours,
Laurenz Albe

Tony Wayne

anonymouslydark3@gmail.com

over 1 year ago

In reply to: Laurenz Albe (#2)

Re: bgwrite process is too lazy

On Wed, Oct 2, 2024 at 8:14 PM Laurenz Albe <laurenz.albe@cybertec.at>
wrote:

Show quoted text

On Wed, 2024-10-02 at 16:48 +0800, wenhui qiu wrote:

Whenever I check the checkpoint information in a log, most dirty pages

are written by the checkpoint process

That's exactly how it should be!

is it because if bgwriter frequently flushes, the disk io will be more?🤔

Tomas Vondra

tomas.vondra@2ndquadrant.com

over 1 year ago

In reply to: Tony Wayne (#3)

Re: bgwrite process is too lazy

On 10/2/24 17:02, Tony Wayne wrote:

On Wed, Oct 2, 2024 at 8:14 PM Laurenz Albe <laurenz.albe@cybertec.at
<mailto:laurenz.albe@cybertec.at>> wrote:

On Wed, 2024-10-02 at 16:48 +0800, wenhui qiu wrote:

Whenever I check the checkpoint information in a log, most dirty

pages are written by the checkpoint process

That's exactly how it should be!

is it because if bgwriter frequently flushes, the disk io will be more?🤔

Yes, pretty much. But it's also about where the writes happen.

Checkpoint flushes dirty buffers only once per checkpoint interval,
which is the lowest amount of write I/O that needs to happen.

Every other way of flushing buffers is less efficient, and is mostly a
sign of memory pressure (shared buffers not large enough for active part
of the data).

But it's also happens about where the writes happen. Checkpoint does
that in the background, not as part of regular query execution. What we
don't want is for the user backends to flush buffers, because it's
expensive and can cause result in much higher latency.

The bgwriter is somewhere in between - it's happens in the background,
but may not be as efficient as doing it in the checkpointer. Still much
better than having to do this in regular backends.

regards

--
Tomas Vondra

wenhui qiu

qiuwenhuifx@gmail.com

over 1 year ago

In reply to: Tomas Vondra (#4)

Re: bgwrite process is too lazy

Hi Tomas
Thank you for explaining,If do not change this static parameter,Can
refer to the parameter autovacuum_vacuum_cost_delay and lower the minimum
value of the bgwrite_delay parameter to 2ms? Let bgwrite write more dirty
pages to reduce the impact on performance when checkpoint occurs?After all,
extending the checkpoint time and crash recovery time needs to find a
balance, and cannot be increased indefinitely ( checkpoint_timeout
,max_wal_size)

Thanks

Tomas Vondra <tomas@vondra.me> 于2024年10月3日周四 00:36写道：

Show quoted text

On 10/2/24 17:02, Tony Wayne wrote:

On Wed, Oct 2, 2024 at 8:14 PM Laurenz Albe <laurenz.albe@cybertec.at
<mailto:laurenz.albe@cybertec.at>> wrote:

On Wed, 2024-10-02 at 16:48 +0800, wenhui qiu wrote:

Whenever I check the checkpoint information in a log, most dirty

pages are written by the checkpoint process

That's exactly how it should be!

is it because if bgwriter frequently flushes, the disk io will be

more?🤔

Yes, pretty much. But it's also about where the writes happen.

Checkpoint flushes dirty buffers only once per checkpoint interval,
which is the lowest amount of write I/O that needs to happen.

Every other way of flushing buffers is less efficient, and is mostly a
sign of memory pressure (shared buffers not large enough for active part
of the data).

But it's also happens about where the writes happen. Checkpoint does
that in the background, not as part of regular query execution. What we
don't want is for the user backends to flush buffers, because it's
expensive and can cause result in much higher latency.

The bgwriter is somewhere in between - it's happens in the background,
but may not be as efficient as doing it in the checkpointer. Still much
better than having to do this in regular backends.

regards

--
Tomas Vondra

Andres Freund

andres@anarazel.de

over 1 year ago

In reply to: Tomas Vondra (#4)

Re: bgwrite process is too lazy

Hi,

On 2024-10-02 18:36:44 +0200, Tomas Vondra wrote:

On 10/2/24 17:02, Tony Wayne wrote:

On Wed, Oct 2, 2024 at 8:14 PM Laurenz Albe <laurenz.albe@cybertec.at
<mailto:laurenz.albe@cybertec.at>> wrote:

On Wed, 2024-10-02 at 16:48 +0800, wenhui qiu wrote:

Whenever I check the checkpoint information in a log, most dirty

pages are written by the checkpoint process

That's exactly how it should be!

is it because if bgwriter frequently flushes, the disk io will be more?🤔

Yes, pretty much. But it's also about where the writes happen.

Checkpoint flushes dirty buffers only once per checkpoint interval,
which is the lowest amount of write I/O that needs to happen.

Every other way of flushing buffers is less efficient, and is mostly a
sign of memory pressure (shared buffers not large enough for active part
of the data).

It's implied, but to make it more explicit: One big efficiency advantage of
writes by checkpointer is that they are sorted and can often be combined into
larger writes. That's often a lot more efficient: For network attached storage
it saves you iops, for local SSDs it's much friendlier to wear leveling.

But it's also happens about where the writes happen. Checkpoint does
that in the background, not as part of regular query execution. What we
don't want is for the user backends to flush buffers, because it's
expensive and can cause result in much higher latency.

The bgwriter is somewhere in between - it's happens in the background,
but may not be as efficient as doing it in the checkpointer. Still much
better than having to do this in regular backends.

Another aspect is that checkpointer's writes are much easier to pace over time
than e.g. bgwriters, because bgwriter is triggered by a fairly short term
signal. Eventually we'll want to combine writes by bgwriter too, but that's
always going to be more expensive than doing it in a large batched fashion
like checkpointer does.

I think we could improve checkpointer's pacing further, fwiw, by taking into
account that the WAL volume at the start of a spread-out checkpoint typically
is bigger than at the end.

Greetings,

Andres Freund

wenhui qiu

qiuwenhuifx@gmail.com

over 1 year ago

In reply to: Andres Freund (#6)

Re: bgwrite process is too lazy

Hi Andres

It's implied, but to make it more explicit: One big efficiency advantage

writes by checkpointer is that they are sorted and can often be combined

into

larger writes. That's often a lot more efficient: For network attached

storage

it saves you iops, for local SSDs it's much friendlier to wear leveling.

thank you for explanation, I think bgwrite also can merge io ,It writes
asynchronously to the file system cache, scheduling by os, .

Another aspect is that checkpointer's writes are much easier to pace over

time

than e.g. bgwriters, because bgwriter is triggered by a fairly short term
signal. Eventually we'll want to combine writes by bgwriter too, but

that's

always going to be more expensive than doing it in a large batched fashion
like checkpointer does.

I think we could improve checkpointer's pacing further, fwiw, by taking

into

account that the WAL volume at the start of a spread-out checkpoint

typically

is bigger than at the end.

I'm also very keen to improve checkpoints , Whenever I do stress test,
bgwrite does not write dirty pages when the data set is smaller than
shard_buffer size,Before the checkpoint, the pressure measurement tps was
stable and the highest during the entire pressure measurement phase，Other
databases refresh dirty pages at a certain frequency, at intervals, and at
dirty page water levels，They have a much smaller impact on performance when
checkpoints occur

Thanks

Andres Freund <andres@anarazel.de> 于2024年10月4日周五 03:40写道：

Show quoted text

Hi,

On 2024-10-02 18:36:44 +0200, Tomas Vondra wrote:

On 10/2/24 17:02, Tony Wayne wrote:

On Wed, Oct 2, 2024 at 8:14 PM Laurenz Albe <laurenz.albe@cybertec.at
<mailto:laurenz.albe@cybertec.at>> wrote:

On Wed, 2024-10-02 at 16:48 +0800, wenhui qiu wrote:

Whenever I check the checkpoint information in a log, most dirty

pages are written by the checkpoint process

That's exactly how it should be!

is it because if bgwriter frequently flushes, the disk io will be

more?🤔

Yes, pretty much. But it's also about where the writes happen.

Checkpoint flushes dirty buffers only once per checkpoint interval,
which is the lowest amount of write I/O that needs to happen.

Every other way of flushing buffers is less efficient, and is mostly a
sign of memory pressure (shared buffers not large enough for active part
of the data).

It's implied, but to make it more explicit: One big efficiency advantage of
writes by checkpointer is that they are sorted and can often be combined
into
larger writes. That's often a lot more efficient: For network attached
storage
it saves you iops, for local SSDs it's much friendlier to wear leveling.

But it's also happens about where the writes happen. Checkpoint does
that in the background, not as part of regular query execution. What we
don't want is for the user backends to flush buffers, because it's
expensive and can cause result in much higher latency.

The bgwriter is somewhere in between - it's happens in the background,
but may not be as efficient as doing it in the checkpointer. Still much
better than having to do this in regular backends.

Another aspect is that checkpointer's writes are much easier to pace over
time
than e.g. bgwriters, because bgwriter is triggered by a fairly short term
signal. Eventually we'll want to combine writes by bgwriter too, but
that's
always going to be more expensive than doing it in a large batched fashion
like checkpointer does.

I think we could improve checkpointer's pacing further, fwiw, by taking
into
account that the WAL volume at the start of a spread-out checkpoint
typically
is bigger than at the end.

Greetings,

Andres Freund

Andres Freund

andres@anarazel.de

over 1 year ago

In reply to: wenhui qiu (#7)

Re: bgwrite process is too lazy

Hi,

On 2024-10-04 09:31:45 +0800, wenhui qiu wrote:

It's implied, but to make it more explicit: One big efficiency advantage

of

writes by checkpointer is that they are sorted and can often be combined

into

larger writes. That's often a lot more efficient: For network attached

storage

it saves you iops, for local SSDs it's much friendlier to wear leveling.

thank you for explanation, I think bgwrite also can merge io ,It writes
asynchronously to the file system cache, scheduling by os, .

Because bgwriter writes are just ordered by their buffer id (further made less
sequential due to only writing out not-recently-used buffers), they are often
effectively random. The OS can't do much about that.

Another aspect is that checkpointer's writes are much easier to pace over

time

than e.g. bgwriters, because bgwriter is triggered by a fairly short term
signal. Eventually we'll want to combine writes by bgwriter too, but

that's

always going to be more expensive than doing it in a large batched fashion
like checkpointer does.

I think we could improve checkpointer's pacing further, fwiw, by taking

into

account that the WAL volume at the start of a spread-out checkpoint

typically

is bigger than at the end.

I'm also very keen to improve checkpoints , Whenever I do stress test,
bgwrite does not write dirty pages when the data set is smaller than
shard_buffer size,

It *SHOULD NOT* do anything in that situation. There's absolutely nothing to
be gained by bgwriter writing in that case.

Before the checkpoint, the pressure measurement tps was stable and the
highest during the entire pressure measurement phase，Other databases
refresh dirty pages at a certain frequency, at intervals, and at dirty page
water levels，They have a much smaller impact on performance when
checkpoints occur

I doubt that slowdown is caused by bgwriter not being active enough. I suspect
what you're seeing is one or more of:

a) The overhead of doing full page writes (due to increasing the WAL
volume). You could verify whether that's the case by turning
full_page_writes off (but note that that's not generally safe!) or see if
the overhead shrinks if you set wal_compression=zstd or wal_compression=lz4
(don't use pglz, it's too slow).

b) The overhead of renaming WAL segments during recycling. You could see if
this is related by specifying --wal-segsize 512 or such during initdb.

Greetings,

Andres

wenhui qiu

qiuwenhuifx@gmail.com

over 1 year ago

In reply to: Andres Freund (#8)

Re: bgwrite process is too lazy

Hi Andres Freund
Thank you for explanation

I doubt that slowdown is caused by bgwriter not being active enough. I

suspect

what you're seeing is one or more of:

a) The overhead of doing full page writes (due to increasing the WAL
volume). You could verify whether that's the case by turning
full_page_writes off (but note that that's not generally safe!) or see if
the overhead shrinks if you set wal_compression=zstd or

wal_compression=lz4

(don't use pglz, it's too slow).

b) The overhead of renaming WAL segments during recycling. You could see

this is related by specifying --wal-segsize 512 or such during initdb.

I am aware of these optimizations, and these optimizations only mitigate
the impact, I didn't turn on wal log compression on purpose during stress
test ，

shared_buffers = '32GB'
bgwriter_delay = '10ms'
bgwriter_lru_maxpages = '8192'
bgwriter_lru_multiplier = '10.0'
wal_buffers = '64MB'
checkpoint_completion_target = '0.999'
checkpoint_timeout = '600'
max_wal_size = '32GB'
min_wal_size = '16GB'

I think in business scenarios where there are many reads and few writes, it
is indeed desirable to keep as many dirty pages in memory as
possible,However, in scenarios such as push systems and task scheduling
systems, which also have a lot of reads and writes, the impact of
checkpoints will be more obvious,Adaptive bgwrite or bgwrite triggered when
a dirty page reaches a certain watermark eliminates the impact of
checkpoints on performance jitter.From what I understand, quite a few
commercial databases residing in postgresql have added the adaptive refresh
dirty page feature, and from their internal reports, the whole stress
testing process was very smooth! Since it's a trade secret, I don't know
how they implemented this feature.

bgwrite process is too lazy

Attachments: