Pre-allocating WAL files
Hi,
When running write heavy transactional workloads I've many times
observed that one needs to run the benchmarks for quite a while till
they get to their steady state performance. The most significant reason
for that is that initially WAL files will not get recycled, but need to
be freshly initialized. That's 16MB of writes that need to synchronously
finish before a small write transaction can even start to be written
out...
I think there's two useful things we could do:
1) Add pg_wal_preallocate(uint64 bytes) that ensures (bytes +
segment_size - 1) / segment_size WAL segments exist from the current
point in the WAL. Perhaps with the number of bytes defaulting to
min_wal_size if not explicitly specified?
2) Have checkpointer (we want walwriter to run with low latency to flush
out async commits etc) occasionally check if WAL files need to be
pre-allocated.
Checkpointer already tracks the amount of WAL that's expected to be
generated till the end of the checkpoint, so it seems like it's a
pretty good candidate to do so.
To keep checkpointer pre-allocating when idle we could signal it
whenever a record has crossed a segment boundary.
With a plain pgbench run I see a 2.5x reduction in throughput in the
periods where we initialize WAL files.
Greetings,
Andres Freund
On 12/25/20, 12:09 PM, "Andres Freund" <andres@anarazel.de> wrote:
When running write heavy transactional workloads I've many times
observed that one needs to run the benchmarks for quite a while till
they get to their steady state performance. The most significant reason
for that is that initially WAL files will not get recycled, but need to
be freshly initialized. That's 16MB of writes that need to synchronously
finish before a small write transaction can even start to be written
out...I think there's two useful things we could do:
1) Add pg_wal_preallocate(uint64 bytes) that ensures (bytes +
segment_size - 1) / segment_size WAL segments exist from the current
point in the WAL. Perhaps with the number of bytes defaulting to
min_wal_size if not explicitly specified?2) Have checkpointer (we want walwriter to run with low latency to flush
out async commits etc) occasionally check if WAL files need to be
pre-allocated.Checkpointer already tracks the amount of WAL that's expected to be
generated till the end of the checkpoint, so it seems like it's a
pretty good candidate to do so.To keep checkpointer pre-allocating when idle we could signal it
whenever a record has crossed a segment boundary.With a plain pgbench run I see a 2.5x reduction in throughput in the
periods where we initialize WAL files.
I've been exploring this independently a bit and noticed this message.
Attached is a proof-of-concept patch for a separate "WAL allocator"
process that maintains a pool of WAL-segment-sized files that can be
claimed whenever a new segment file is needed. An early version of
this patch attempted to spread the I/O like non-immediate checkpoints
do, but I couldn't point to any real benefit from doing so, and it
complicated things quite a bit.
I like the idea of trying to bake this into an existing process such
as the checkpointer. I'll admit that creating a new process just for
WAL pre-allocation feels a bit heavy-handed, but it was a nice way to
keep this stuff modularized. I can look into moving this
functionality into the checkpointer process if this is something that
folks are interested in.
Nathan
Attachments:
v1-0001-wal-segment-pre-allocation.patchapplication/octet-stream; name=v1-0001-wal-segment-pre-allocation.patchDownload+761-108
On Mon, Jun 7, 2021 at 8:48 PM Bossart, Nathan <bossartn@amazon.com> wrote:
On 12/25/20, 12:09 PM, "Andres Freund" <andres@anarazel.de> wrote:
When running write heavy transactional workloads I've many times
observed that one needs to run the benchmarks for quite a while till
they get to their steady state performance. The most significant reason
for that is that initially WAL files will not get recycled, but need to
be freshly initialized. That's 16MB of writes that need to synchronously
finish before a small write transaction can even start to be written
out...I think there's two useful things we could do:
1) Add pg_wal_preallocate(uint64 bytes) that ensures (bytes +
segment_size - 1) / segment_size WAL segments exist from the current
point in the WAL. Perhaps with the number of bytes defaulting to
min_wal_size if not explicitly specified?2) Have checkpointer (we want walwriter to run with low latency to flush
out async commits etc) occasionally check if WAL files need to be
pre-allocated.Checkpointer already tracks the amount of WAL that's expected to be
generated till the end of the checkpoint, so it seems like it's a
pretty good candidate to do so.To keep checkpointer pre-allocating when idle we could signal it
whenever a record has crossed a segment boundary.With a plain pgbench run I see a 2.5x reduction in throughput in the
periods where we initialize WAL files.I've been exploring this independently a bit and noticed this message.
Attached is a proof-of-concept patch for a separate "WAL allocator"
process that maintains a pool of WAL-segment-sized files that can be
claimed whenever a new segment file is needed. An early version of
this patch attempted to spread the I/O like non-immediate checkpoints
do, but I couldn't point to any real benefit from doing so, and it
complicated things quite a bit.I like the idea of trying to bake this into an existing process such
as the checkpointer. I'll admit that creating a new process just for
WAL pre-allocation feels a bit heavy-handed, but it was a nice way to
keep this stuff modularized. I can look into moving this
functionality into the checkpointer process if this is something that
folks are interested in.
Thanks for posting the patch, the patch no more applies on Head:
Applying: wal segment pre-allocation
error: patch failed: src/backend/access/transam/xlog.c:3283
error: src/backend/access/transam/xlog.c: patch does not apply
Can you rebase the patch and post, it might help if someone is picking
it up for review.
Regards,
Vignesh
On 7/5/21, 9:52 AM, "vignesh C" <vignesh21@gmail.com> wrote:
Thanks for posting the patch, the patch no more applies on Head:
Applying: wal segment pre-allocation
error: patch failed: src/backend/access/transam/xlog.c:3283
error: src/backend/access/transam/xlog.c: patch does not applyCan you rebase the patch and post, it might help if someone is picking
it up for review.
I've attached a rebased version of the patch.
Nathan
Attachments:
v2-0001-wal-segment-pre-allocation.patchapplication/octet-stream; name=v2-0001-wal-segment-pre-allocation.patchDownload+765-112
On 7/9/21, 2:10 PM, "Bossart, Nathan" <bossartn@amazon.com> wrote:
I've attached a rebased version of the patch.
Here's a newer rebased version of the patch.
Nathan
Attachments:
v3-0001-wal-segment-pre-allocation.patchapplication/octet-stream; name=v3-0001-wal-segment-pre-allocation.patchDownload+765-112
On 8/6/21, 1:27 PM, "Bossart, Nathan" <bossartn@amazon.com> wrote:
Here's a newer rebased version of the patch.
Rebasing again to keep http://commitfest.cputube.org/ happy.
Nathan
Attachments:
v4-0001-wal-segment-pre-allocation.patchapplication/octet-stream; name=v4-0001-wal-segment-pre-allocation.patchDownload+765-112
On 8/31/21, 10:27 AM, "Bossart, Nathan" <bossartn@amazon.com> wrote:
Rebasing again to keep http://commitfest.cputube.org/ happy.
Another rebase.
Nathan
Attachments:
v5-0001-wal-segment-pre-allocation.patchapplication/octet-stream; name=v5-0001-wal-segment-pre-allocation.patchDownload+765-112
The following review has been posted through the commitfest application:
make installcheck-world: tested, passed
Implements feature: tested, failed
Spec compliant: not tested
Documentation: not tested
Hi!
We've looked through the code and everything looks good except few minor things:
1). Using dedicated bg worker seems not optimal, it introduces a lot of redundant code for little single action.
We'd join initial proposal of Andres to implement it as an extra fuction of checkpointer.
2). In our view, it is better shift #define PREALLOCSEGDIR outside the function body.
3). It is better to have at least small comments on functions GetNumPreallocatedWalSegs, SetNumPreallocatedWalSegs,
We've also tested performance difference between master branch and this patch and noticed no significant difference in performance.
We used pgbench with some sort of "standard" settings:
$ pgbench -c50 -s50 -T200 -P1 -r postgres
...and with...
$ pgbench -c100 -s50 -T200 -P1 -r postgres
When looking at every second output of pgbench we saw regular spikes of latency (aroud 5-10 times increase) and this pattern was similar with and without patch. Overall average latency stat for 200 sec of pgbench also looks pretty much the same with and without patch. Could you provide your testing setup to see the effect, please.
The check-world was successfull.
Overall patch looks good, but in our view it's better to have experimental support of the performance improvements to be commited.
---
Best regards,
Maxim Orlov, Pavel Borisov.
The new status of this patch is: Waiting on Author
On 10/6/21, 5:20 AM, "Maxim Orlov" <m.orlov@postgrespro.ru> wrote:
We've looked through the code and everything looks good except few minor things:
1). Using dedicated bg worker seems not optimal, it introduces a lot of redundant code for little single action.
We'd join initial proposal of Andres to implement it as an extra fuction of checkpointer.
Thanks for taking a look!
I have been thinking about the right place to put this logic. My
first thought is that it sounds like something that ought to go in the
WAL writer process, but as Andres noted upthread, that is undesirable
because it'll add latency for asynchronous commits. The checkpointer
process seems to be another candidate, but I'm not totally sure how
it'll fit in. My patch works by maintaining a small pool of pre-
allocated segments that is quickly replenished whenever one is used.
If the checkpointer is spending most of its time checkpointing, this
small pool could remain empty for long periods of time. To keep pre-
allocating WAL while we're checkpointing, perhaps we could add another
function like CheckpointWriteDelay() that is called periodically.
There still might be several operations in CheckPointGuts() that take
a while and leave the segment pool empty, but maybe that's okay for
now.
Nathan
On 10/6/21, 9:34 AM, "Bossart, Nathan" <bossartn@amazon.com> wrote:
I have been thinking about the right place to put this logic. My
first thought is that it sounds like something that ought to go in the
WAL writer process, but as Andres noted upthread, that is undesirable
because it'll add latency for asynchronous commits. The checkpointer
process seems to be another candidate, but I'm not totally sure how
it'll fit in. My patch works by maintaining a small pool of pre-
allocated segments that is quickly replenished whenever one is used.
If the checkpointer is spending most of its time checkpointing, this
small pool could remain empty for long periods of time. To keep pre-
allocating WAL while we're checkpointing, perhaps we could add another
function like CheckpointWriteDelay() that is called periodically.
There still might be several operations in CheckPointGuts() that take
a while and leave the segment pool empty, but maybe that's okay for
now.
Here is a first attempt at adding the pre-allocation logic to the
checkpointer. I went ahead and just used CheckpointWriteDelay() for
pre-allocating during checkpoints. I've done a few pgbench runs, and
it seems to work pretty well. Initialization is around 15% faster,
and I'm seeing about a 5% increase in TPS with a simple-update
workload with wal_recycle turned off. Of course, these improvements
go away once segments can be recycled.
Nathan
Attachments:
v6-0001-Move-WAL-segment-creation-logic-to-its-own-functi.patchapplication/octet-stream; name=v6-0001-Move-WAL-segment-creation-logic-to-its-own-functi.patchDownload+116-103
v6-0002-WAL-segment-pre-allocation.patchapplication/octet-stream; name=v6-0002-WAL-segment-pre-allocation.patchDownload+417-21
On 10/8/21, 1:55 PM, "Bossart, Nathan" <bossartn@amazon.com> wrote:
Here is a first attempt at adding the pre-allocation logic to the
checkpointer. I went ahead and just used CheckpointWriteDelay() for
pre-allocating during checkpoints. I've done a few pgbench runs, and
it seems to work pretty well. Initialization is around 15% faster,
and I'm seeing about a 5% increase in TPS with a simple-update
workload with wal_recycle turned off. Of course, these improvements
go away once segments can be recycled.
Here is a rebased version of this patch set. I'm getting the sense
that there isn't a whole lot of interest in this feature, so I'll
likely withdraw it if it goes too much longer without traction.
Nathan
Attachments:
v7-0001-Move-WAL-segment-creation-logic-to-its-own-functi.patchapplication/octet-stream; name=v7-0001-Move-WAL-segment-creation-logic-to-its-own-functi.patchDownload+116-103
v7-0002-WAL-segment-pre-allocation.patchapplication/octet-stream; name=v7-0002-WAL-segment-pre-allocation.patchDownload+417-21
On Thu, Nov 11, 2021 at 12:29 AM Bossart, Nathan <bossartn@amazon.com> wrote:
On 10/8/21, 1:55 PM, "Bossart, Nathan" <bossartn@amazon.com> wrote:
Here is a first attempt at adding the pre-allocation logic to the
checkpointer. I went ahead and just used CheckpointWriteDelay() for
pre-allocating during checkpoints. I've done a few pgbench runs, and
it seems to work pretty well. Initialization is around 15% faster,
and I'm seeing about a 5% increase in TPS with a simple-update
workload with wal_recycle turned off. Of course, these improvements
go away once segments can be recycled.Here is a rebased version of this patch set. I'm getting the sense
that there isn't a whole lot of interest in this feature, so I'll
likely withdraw it if it goes too much longer without traction.
As I mentioned in the other thread at [1]/messages/by-id/CALj2ACVqYJX9JugooRC1chb2sHqv-C9mYEBE1kxwn+Tn9vY42A@mail.gmail.com, let's continue the discussion here.
Why can't the walwriter pre-allocate some of the WAL segments instead
of a new background process? Of course, it might delay the main
functionality of the walwriter i.e. flush and sync the WAL files, but
having checkpointer do the pre-allocation makes it do another extra
task. Here the amount of walwriter work vs checkpointer work, I'm not
sure which one does more work compared to the other.
Another idea could be to let walwrtier or checkpointer pre-allocate
the WAL files whichever seems free as-of-the-moment when the WAL
segment pre-allocation request comes. We can go further to let the
user choose which process i.e. checkpointer or walwrtier do the
pre-allocation with a GUC maybe?
[1]: /messages/by-id/CALj2ACVqYJX9JugooRC1chb2sHqv-C9mYEBE1kxwn+Tn9vY42A@mail.gmail.com
Regards,
Bharath Rupireddy.
On 12/7/21, 12:29 AM, "Bharath Rupireddy" <bharath.rupireddyforpostgres@gmail.com> wrote:
Why can't the walwriter pre-allocate some of the WAL segments instead
of a new background process? Of course, it might delay the main
functionality of the walwriter i.e. flush and sync the WAL files, but
having checkpointer do the pre-allocation makes it do another extra
task. Here the amount of walwriter work vs checkpointer work, I'm not
sure which one does more work compared to the other.
The argument against adding it to the WAL writer is that we want it to
run with low latency to flush asynchronous commits. If we added WAL
pre-allocation to the WAL writer, there could periodically be large
delays.
Another idea could be to let walwrtier or checkpointer pre-allocate
the WAL files whichever seems free as-of-the-moment when the WAL
segment pre-allocation request comes. We can go further to let the
user choose which process i.e. checkpointer or walwrtier do the
pre-allocation with a GUC maybe?
My latest patch set [0]/messages/by-id/CB15BEBD-98FC-4E72-86AE-513D34014176@amazon.com adds WAL pre-allocation to the checkpointer.
In that patch set, WAL pre-allocation is done both outside of
checkpoints as well as during checkpoints (via
CheckPointWriteDelay()).
Nathan
[0]: /messages/by-id/CB15BEBD-98FC-4E72-86AE-513D34014176@amazon.com
On 12/7/21, 9:35 AM, "Bossart, Nathan" <bossartn@amazon.com> wrote:
On 12/7/21, 12:29 AM, "Bharath Rupireddy" <bharath.rupireddyforpostgres@gmail.com> wrote:
Why can't the walwriter pre-allocate some of the WAL segments instead
of a new background process? Of course, it might delay the main
functionality of the walwriter i.e. flush and sync the WAL files, but
having checkpointer do the pre-allocation makes it do another extra
task. Here the amount of walwriter work vs checkpointer work, I'm not
sure which one does more work compared to the other.The argument against adding it to the WAL writer is that we want it to
run with low latency to flush asynchronous commits. If we added WAL
pre-allocation to the WAL writer, there could periodically be large
delays.
To your point on trying to avoid giving the checkpointer extra tasks
(basically what we are talking about on the other thread [0]/messages/by-id/C1EE64B0-D4DB-40F3-98C8-0CED324D34CB@amazon.com), WAL
pre-allocation might not be of much concern because it will generally
be a small, fixed (and configurable) amount of work, and it can be
performed concurrently with the checkpoint. Plus, WAL pre-allocation
should ordinarily be phased out as WAL segments become eligible for
recycling. IMO it's not comparable to tasks like
CheckPointSnapBuild() that can delay checkpointing for a long time.
Nathan
[0]: /messages/by-id/C1EE64B0-D4DB-40F3-98C8-0CED324D34CB@amazon.com
pre-allocating during checkpoints. I've done a few pgbench runs, and
it seems to work pretty well. Initialization is around 15% faster,
and I'm seeing about a 5% increase in TPS with a simple-update
workload with wal_recycle turned off. Of course, these improvements
go away once segments can be recycled.
I've checked the patch v7. It applies cleanly, code is good, check-world
tests passed without problems.
I think it's ok to use checkpointer for this and the overall patch can be
committed. But the seen performance gain makes me think again before adding
this feature. I did tests myself a couple of months ago and got similar
results.
Really don't know whether is it worth the effort.
Wish you and all hackers happy New Year!
--
Best regards,
Pavel Borisov
Postgres Professional: http://postgrespro.com <http://www.postgrespro.com>
I did check the patch too and found it to be ok. Check and check-world are
passed.
Overall idea seems to be good in my opinion, but I'm not sure where is the
optimal place to put the pre-allocation.
On Thu, Dec 30, 2021 at 2:46 PM Pavel Borisov <pashkin.elfe@gmail.com>
wrote:
pre-allocating during checkpoints. I've done a few pgbench runs, and
it seems to work pretty well. Initialization is around 15% faster,
and I'm seeing about a 5% increase in TPS with a simple-update
workload with wal_recycle turned off. Of course, these improvements
go away once segments can be recycled.I've checked the patch v7. It applies cleanly, code is good, check-world
tests passed without problems.
I think it's ok to use checkpointer for this and the overall patch can be
committed. But the seen performance gain makes me think again before adding
this feature. I did tests myself a couple of months ago and got similar
results.
Really don't know whether is it worth the effort.Wish you and all hackers happy New Year!
--
Best regards,
Pavel BorisovPostgres Professional: http://postgrespro.com <http://www.postgrespro.com>
--
---
Best regards,
Maxim Orlov.
On 12/30/21, 3:52 AM, "Maxim Orlov" <orlovmg@gmail.com> wrote:
I did check the patch too and found it to be ok. Check and check-world are passed.
Overall idea seems to be good in my opinion, but I'm not sure where is the optimal place to put the pre-allocation.On Thu, Dec 30, 2021 at 2:46 PM Pavel Borisov <pashkin.elfe@gmail.com> wrote:
I've checked the patch v7. It applies cleanly, code is good, check-world tests passed without problems.
I think it's ok to use checkpointer for this and the overall patch can be committed. But the seen performance gain makes me think again before adding this feature. I did tests myself a couple of months ago and got similar results.
Really don't know whether is it worth the effort.
Thank you both for your review.
Nathan
On Thu, Jan 6, 2022 at 3:39 AM Bossart, Nathan <bossartn@amazon.com> wrote:
On 12/30/21, 3:52 AM, "Maxim Orlov" <orlovmg@gmail.com> wrote:
I did check the patch too and found it to be ok. Check and check-world are passed.
Overall idea seems to be good in my opinion, but I'm not sure where is the optimal place to put the pre-allocation.On Thu, Dec 30, 2021 at 2:46 PM Pavel Borisov <pashkin.elfe@gmail.com> wrote:
I've checked the patch v7. It applies cleanly, code is good, check-world tests passed without problems.
I think it's ok to use checkpointer for this and the overall patch can be committed. But the seen performance gain makes me think again before adding this feature. I did tests myself a couple of months ago and got similar results.
Really don't know whether is it worth the effort.Thank you both for your review.
It may have been discussed earlier, let me ask this here - IIUC the
whole point of pre-allocating WAL files is that creating new WAL files
of wal_segment_size requires us to write zero-filled empty pages to
the disk which is costly. With the advent of
fallocate/posix_fallocate, isn't file allocation going to be much
faster on platforms where fallocate is supported? IIRC, the
"Asynchronous and "direct" IO support for PostgreSQL." has a way to
use fallocate. If at all, we move ahead and use fallocate, then the
whole point of pre-allocating WAL files becomes unnecessary?
Having said above, the idea of pre-allocating WAL files is still
relevant, given the portability of fallocate/posix_fallocate.
Regards,
Bharath Rupireddy.
On Sat, Jan 15, 2022 at 1:36 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
On Thu, Jan 6, 2022 at 3:39 AM Bossart, Nathan <bossartn@amazon.com> wrote:
On 12/30/21, 3:52 AM, "Maxim Orlov" <orlovmg@gmail.com> wrote:
I did check the patch too and found it to be ok. Check and check-world are passed.
Overall idea seems to be good in my opinion, but I'm not sure where is the optimal place to put the pre-allocation.On Thu, Dec 30, 2021 at 2:46 PM Pavel Borisov <pashkin.elfe@gmail.com> wrote:
I've checked the patch v7. It applies cleanly, code is good, check-world tests passed without problems.
I think it's ok to use checkpointer for this and the overall patch can be committed. But the seen performance gain makes me think again before adding this feature. I did tests myself a couple of months ago and got similar results.
Really don't know whether is it worth the effort.Thank you both for your review.
It may have been discussed earlier, let me ask this here - IIUC the
whole point of pre-allocating WAL files is that creating new WAL files
of wal_segment_size requires us to write zero-filled empty pages to
the disk which is costly. With the advent of
fallocate/posix_fallocate, isn't file allocation going to be much
faster on platforms where fallocate is supported? IIRC, the
"Asynchronous and "direct" IO support for PostgreSQL." has a way to
use fallocate. If at all, we move ahead and use fallocate, then the
whole point of pre-allocating WAL files becomes unnecessary?Having said above, the idea of pre-allocating WAL files is still
relevant, given the portability of fallocate/posix_fallocate.
Adding one more point: do we have any numbers like how much total time
WAL files allocation usually takes, maybe under a high-write load
server?
Regards,
Bharath Rupireddy.
On Thu, Dec 30, 2021 at 02:51:10PM +0300, Maxim Orlov wrote:
I did check the patch too and found it to be ok. Check and check-world are
passed.
FYI: this is currently failing in cfbot on linux.
https://cirrus-ci.com/task/4934371210690560
https://api.cirrus-ci.com/v1/artifact/task/4934371210690560/log/src/test/regress/regression.diffs
DROP TABLESPACE regress_tblspace_renamed;
+ERROR: tablespace "regress_tblspace_renamed" is not empty
--
Justin