[PROPOSAL] Doublewrite Buffer as an alternative torn page protection to Full Page Write

Started by 陈宗志25 days ago14 messages

baotiao@gmail.com

25 days ago

Hi hackers,

I raised this topic a while back [1]/messages/by-id/CAGbZs7hbJeUe7xY4QD25QW6VSnNFk1e3cwbCa8_R+2+YnoYRKw@mail.gmail.com but didn't get much traction, so
I went ahead and implemented it: a doublewrite buffer (DWB) mechanism
for PostgreSQL as an alternative to full_page_writes.

The core argument is straightforward. FPW and checkpoint frequency are
fundamentally at odds:

- FPW wants fewer checkpoints -- each checkpoint triggers a wave of
full-page WAL writes for every page dirtied for the first time,
bloating WAL and tanking write throughput.
- Fast crash recovery wants more checkpoints -- less WAL to replay
means the database comes back sooner.

DWB resolves this tension by moving torn page protection out of the
WAL path entirely. Instead of writing full pages into WAL (foreground,
latency-sensitive), dirty pages are sequentially written to a
dedicated doublewrite buffer area on disk before being flushed to
their actual locations. The buffer is fsync'd once when full, then
pages are scatter-written to their final positions. On crash recovery,
intact copies from the DWB repair any torn pages.

Key design differences:

- FPW: 1 WAL write (foreground) + 1 page write = directly impacts SQL latency
- DWB: 2 page writes (background flush path) = minimal user-visible impact
- DWB batches fsync() across multiple pages; WAL fsync batching is
limited by foreground latency constraints
- DWB decouples torn page protection from checkpoint frequency, so you
can checkpoint as often as you want without write amplification

I ran sysbench benchmarks (io-bound, --tables=10
--table_size=10000000) with checkpoint_timeout=30s,
shared_buffers=4GB, synchronous_commit=on. Each scenario uses a fresh
database, VACUUM FULL, 60s warmup, 300s run.

Results (TPS):

FPW OFF FPW ON DWB ON
read_write/32 18,038 7,943 13,009
read_write/64 24,249 9,533 15,387
read_write/128 27,801 9,715 15,387
write_only/32 53,146 18,116 31,460
write_only/64 57,628 19,589 32,875
write_only/128 59,454 14,857 33,814

Avg latency (ms):

FPW OFF FPW ON DWB ON
read_write/32 1.77 4.03 2.46
read_write/64 2.64 6.71 4.16
read_write/128 4.60 13.17 9.81
write_only/32 0.60 1.77 1.02
write_only/64 1.11 3.27 1.95
write_only/128 2.15 8.61 3.78

FPW ON drops to ~25% of baseline (FPW OFF). DWB ON holds at ~57%. In
write-heavy scenarios DWB delivers over 2x the throughput of FPW with
significantly better latency.

The implementation is here: https://github.com/baotiao/postgres

I'd appreciate any feedback on the approach. Would be great if the
community could take a look and see if this direction is worth
pursuing upstream.

Thanks,
Baotiao

[1]: /messages/by-id/CAGbZs7hbJeUe7xY4QD25QW6VSnNFk1e3cwbCa8_R+2+YnoYRKw@mail.gmail.com

Jakub Wartak

jakub.wartak@enterprisedb.com

18 days ago

In reply to: 陈宗志 (#1)

Re: [PROPOSAL] Doublewrite Buffer as an alternative torn page protection to Full Page Write

On Mon, Feb 9, 2026 at 7:53 PM 陈宗志 <baotiao@gmail.com> wrote:

Hi hackers,

I raised this topic a while back [1] but didn't get much traction, so
I went ahead and implemented it: a doublewrite buffer (DWB) mechanism
for PostgreSQL as an alternative to full_page_writes.

The core argument is straightforward. FPW and checkpoint frequency are
fundamentally at odds:

- FPW wants fewer checkpoints -- each checkpoint triggers a wave of
full-page WAL writes for every page dirtied for the first time,
bloating WAL and tanking write throughput.
- Fast crash recovery wants more checkpoints -- less WAL to replay
means the database comes back sooner.

DWB resolves this tension by moving torn page protection out of the
WAL path entirely. Instead of writing full pages into WAL (foreground,
latency-sensitive), dirty pages are sequentially written to a
dedicated doublewrite buffer area on disk before being flushed to
their actual locations. The buffer is fsync'd once when full, then
pages are scatter-written to their final positions. On crash recovery,
intact copies from the DWB repair any torn pages.

Key design differences:

- FPW: 1 WAL write (foreground) + 1 page write = directly impacts SQL latency
- DWB: 2 page writes (background flush path) = minimal user-visible impact
- DWB batches fsync() across multiple pages; WAL fsync batching is
limited by foreground latency constraints
- DWB decouples torn page protection from checkpoint frequency, so you
can checkpoint as often as you want without write amplification

I ran sysbench benchmarks (io-bound, --tables=10
--table_size=10000000) with checkpoint_timeout=30s,
shared_buffers=4GB, synchronous_commit=on. Each scenario uses a fresh
database, VACUUM FULL, 60s warmup, 300s run.

Results (TPS):

FPW OFF FPW ON DWB ON
read_write/32 18,038 7,943 13,009
read_write/64 24,249 9,533 15,387
read_write/128 27,801 9,715 15,387
write_only/32 53,146 18,116 31,460
write_only/64 57,628 19,589 32,875
write_only/128 59,454 14,857 33,814

Avg latency (ms):

FPW OFF FPW ON DWB ON
read_write/32 1.77 4.03 2.46
read_write/64 2.64 6.71 4.16
read_write/128 4.60 13.17 9.81
write_only/32 0.60 1.77 1.02
write_only/64 1.11 3.27 1.95
write_only/128 2.15 8.61 3.78

FPW ON drops to ~25% of baseline (FPW OFF). DWB ON holds at ~57%. In
write-heavy scenarios DWB delivers over 2x the throughput of FPW with
significantly better latency.

The implementation is here: https://github.com/baotiao/postgres

I'd appreciate any feedback on the approach. Would be great if the
community could take a look and see if this direction is worth
pursuing upstream.

Hi Baotiao

I'm a newbie here, but took Your idea with some interest, probably everyone
else is busy with work on other patches before commit freeze.

I think it would be valuable to have this as I've been hit by PostgreSQL's
unsteady (chain-saw-like) WAL traffic, especially related to touching 1st the
pages after checkpoint, up to the point of saturating network links. The common
counter-argument to double buffering is probably that FPI may(?) increase WAL
standby replication rate and this would have to be measured into account
(but we also should take into account how much maintenance_io_concurrency/
posix_fadvise() prefetching that we do today helps avoid any I/O stalls on
fetching pages - so it should be basically free), I see even that you
got benefits
by not using FPI. Interesting.

Some notes/questions about the patches itself:

0. The convention here is send the patches using:
git format-patch -v<VERSION> HEAD~<numberOfpatches>
for easier review. The 0003 probably should be out of scope. Anyway I've
attached all of those so maybe somebody else is going to take a
look at them too,
they look very mature. Is this code used in production already anywhere? (and
BTW the numbers are quite impressive)

1. We have full_page_writes = on/off, but Your's patch adds double_write_buffer
IMHO if we have competing solution it would be better to have something like
io_torn_pages_protection = off | full_pages | double_writes
and maybe we'll be able to add 'atomic_writes' one day.
BTW: once you stabilize the GUC, it is worth adding to postgresql.conf.sample

2. How would one know how to size double_write_buffer_size ?

2b. IMHO the patch could have enriched pg_stat_io with some
information. Please take
a look on pg_stat_io view and functions like pgstat_count_io_op_time() and
their parameters and enums there, that way we could have IOOBJECT_DWBUF maybe
and be able to say how much I/O was attributed to double-buffering, fsync()
times related to it and so on.

3. In DWBufPostCheckpoint() there's pg_usleep(1ms) just before atomic pwrites(),
but exactly why is it necessary to have this literally sched_yield(2) there?

4. In BufferSync() I have doubts if such copying is safe in loop:
page = BufHdrGetBlock(bufHdr);
memcpy(dwb_buf, page, BLCKSZ);
shouldn't there be some form of locking (BUFFER_LOCK_SHARE?)/pinning buffers?
Also it wouldn't be better if that memcpy would be guarded by the
critical section?
(START_CRIT_SECTION)

4b. There seems to be double coping: there's palloc for dwb_buf in BufferSync()
that is filled by memcpy(), and then DWBufWritePage() is called and
then again
that "page" is copied a second time using memcpy(). This seems to be done for
every checkpoint page, so may reduce benefits of this double-buffering code.

4c. Shouldn't this active waiting in DWBufWritePage() shouldn't be achieved
using spinlocks rather than pg_usleep(100us)?

5. Have you maybe verified using injection points (or gdb) if crashing
in several
places really hits that DWBufRecoverPage()? Is there a simple way
of reproducing this
to play with it? (possibly that could be good test on it's own)

6. Quick testing overview (for completeness)
- basic test without even enabling this feature complains about
postgresql.conf.sample
(test_misc/003_check_guc)
- with `PG_TEST_INITDB_EXTRA_OPTS="-c double_write_buffer=on" meson
test` I've
got 3 failures there:
* test_misc/003_check_guc (expected)
* pg_waldump/002_save_fullpage (I would say it's expected)
* pg_walinspect / pg_walinspect/regress (I would say it's expected)

I haven't really got it up and running for real, but at least that's
some start and I hope that helps.

-J.

Robert Treat

xzilla@users.sourceforge.net

16 days ago

In reply to: Jakub Wartak (#2)

Re: [PROPOSAL] Doublewrite Buffer as an alternative torn page protection to Full Page Write

On Mon, Feb 16, 2026 at 9:07 AM Jakub Wartak
<jakub.wartak@enterprisedb.com> wrote:

On Mon, Feb 9, 2026 at 7:53 PM 陈宗志 <baotiao@gmail.com> wrote:

Hi hackers,

I raised this topic a while back [1] but didn't get much traction, so
I went ahead and implemented it: a doublewrite buffer (DWB) mechanism
for PostgreSQL as an alternative to full_page_writes.

The core argument is straightforward. FPW and checkpoint frequency are
fundamentally at odds:

- FPW wants fewer checkpoints -- each checkpoint triggers a wave of
full-page WAL writes for every page dirtied for the first time,
bloating WAL and tanking write throughput.
- Fast crash recovery wants more checkpoints -- less WAL to replay
means the database comes back sooner.

DWB resolves this tension by moving torn page protection out of the
WAL path entirely. Instead of writing full pages into WAL (foreground,
latency-sensitive), dirty pages are sequentially written to a
dedicated doublewrite buffer area on disk before being flushed to
their actual locations. The buffer is fsync'd once when full, then
pages are scatter-written to their final positions. On crash recovery,
intact copies from the DWB repair any torn pages.

Key design differences:

- FPW: 1 WAL write (foreground) + 1 page write = directly impacts SQL latency
- DWB: 2 page writes (background flush path) = minimal user-visible impact
- DWB batches fsync() across multiple pages; WAL fsync batching is
limited by foreground latency constraints
- DWB decouples torn page protection from checkpoint frequency, so you
can checkpoint as often as you want without write amplification

I ran sysbench benchmarks (io-bound, --tables=10
--table_size=10000000) with checkpoint_timeout=30s,
shared_buffers=4GB, synchronous_commit=on. Each scenario uses a fresh
database, VACUUM FULL, 60s warmup, 300s run.

Results (TPS):

FPW OFF FPW ON DWB ON
read_write/32 18,038 7,943 13,009
read_write/64 24,249 9,533 15,387
read_write/128 27,801 9,715 15,387
write_only/32 53,146 18,116 31,460
write_only/64 57,628 19,589 32,875
write_only/128 59,454 14,857 33,814

Avg latency (ms):

FPW OFF FPW ON DWB ON
read_write/32 1.77 4.03 2.46
read_write/64 2.64 6.71 4.16
read_write/128 4.60 13.17 9.81
write_only/32 0.60 1.77 1.02
write_only/64 1.11 3.27 1.95
write_only/128 2.15 8.61 3.78

FPW ON drops to ~25% of baseline (FPW OFF). DWB ON holds at ~57%. In
write-heavy scenarios DWB delivers over 2x the throughput of FPW with
significantly better latency.

The implementation is here: https://github.com/baotiao/postgres

I'd appreciate any feedback on the approach. Would be great if the
community could take a look and see if this direction is worth
pursuing upstream.

Hi Baotiao

I'm a newbie here, but took Your idea with some interest, probably everyone
else is busy with work on other patches before commit freeze.

I'm somewhat less of a noob here, so I'll confirm that this proposal
has basically zero chance of getting in, at least for the v19 cycle.
This isn't so much about the proposal itself, but more in that if you
were trying to pick the worst time of year to submit a large,
complicated feature into the postgresql workflow, this would be really
close to that.

However, I have also wondered about this specific trade-off (FPW vs
DWB) for years, but until now, the level of effort required to produce
a meaningful POC that would confirm if the idea was worth pursuing was
so large that I think it stopped anyone from even trying. So,
hopefully everyone will realize that we don't live in that world
anymore, and as a side benefit, apparently the idea is worth pursuing.

I think it would be valuable to have this as I've been hit by PostgreSQL's
unsteady (chain-saw-like) WAL traffic, especially related to touching 1st the
pages after checkpoint, up to the point of saturating network links. The common
counter-argument to double buffering is probably that FPI may(?) increase WAL
standby replication rate and this would have to be measured into account
(but we also should take into account how much maintenance_io_concurrency/
posix_fadvise() prefetching that we do today helps avoid any I/O stalls on
fetching pages - so it should be basically free), I see even that you
got benefits
by not using FPI. Interesting.

Some notes/questions about the patches itself:

So, I haven't looked at the code itself; tbh honest I am a bit too
paranoid to dive into generated code that would seem to carry some
likely level of legal risk around potential reuse of GPL/proprietary
code it might be based on (either in its original training, inference,
or context used for generation. Yeah, I know innodb isn't written in
C, but still). That said, I did have some feedback and questions on
the proposal itself, and some suggestions for how to move things
forward.

I would be helpful if you could provide a little more information on
the system you are running these benchmarks on, specifically for me
the underlying OS/Filesystem/hardware, and I'd even be interested in
the build flags. I'd also be interested to know if you did any kind of
crash safety testing... while it is great to have improved
performance, presumably that isn't actually the primary point of these
subsystems. It'd also be worth knowing if you tested this on any
systems with replication (physical or logical) since we'd need to
understand those potential downstream effects. I'm tempted to say you
should have an AI generate some pgbench scripts. Granted its early and
fine if you have done any of this, but I imagine we'll need to look at
it eventually.

0. The convention here is send the patches using:
git format-patch -v<VERSION> HEAD~<numberOfpatches>
for easier review. The 0003 probably should be out of scope. Anyway I've
attached all of those so maybe somebody else is going to take a
look at them too,
they look very mature. Is this code used in production already anywhere? (and
BTW the numbers are quite impressive)

While Jakub is right that the convention is to send patches, that
convention is based on a manual development model, not an agentic
development model. While there is no official project policy on this,
IMHO the thing we really need from you is not the code output, but the
prompts that were used to generate the code. There are plenty of folks
who have access to claude that could then use those prompts to
"recreate with enough proximity" the work you had claude do, and that
process would also allow for additional verification and reduction of
any legal concerns or concerns about investing further human
time/energy. (No offense, but as you are not a regular contributor,
you could analogize this to when third parties do large code dumps and
say "here's a contribution, it's up to you to figure out how to use
it". Ideally we want other folks to be able to pick up the project and
continue with it, even if it means recreating it, and that works best
if we have the underlying prompts).

The claude code configuration file is a good start, but certainly not
enough. Probably the ideal here would be full session logs, although a
developer-diary would probably also suffice. I'm kind of guessing here
because I don't know the scope of the prompts involved or how you were
interacting with Claude in order to get where you are now, but those
seem like the more obvious tools for work of this size whose intention
is to be open.

Robert Treat
https://xzilla.net

wenhui qiu

qiuwenhuifx@gmail.com

16 days ago

In reply to: Robert Treat (#3)

Re: [PROPOSAL] Doublewrite Buffer as an alternative torn page protection to Full Page Write

HI Robert
In fact, this was discussed over a decade ago.（
/messages/by-id/1962493974.656458.1327703514780.JavaMail.root@zimbra-prod-mbox-4.vmware.com
）,In practice, this mainly stems from the significant overhead introduced
by FPW. The community has adopted various approaches to mitigate its
impact, including compressing FPW and extending checkpoint intervals.
Nowadays, waiting for the primary to recover instead of performing an HA
switchover is generally considered unacceptable.

Thanks

On Thu, Feb 19, 2026 at 2:00 AM Robert Treat <rob@xzilla.net> wrote:

Show quoted text

On Mon, Feb 16, 2026 at 9:07 AM Jakub Wartak
<jakub.wartak@enterprisedb.com> wrote:

On Mon, Feb 9, 2026 at 7:53 PM 陈宗志 <baotiao@gmail.com> wrote:

Hi hackers,

I raised this topic a while back [1] but didn't get much traction, so
I went ahead and implemented it: a doublewrite buffer (DWB) mechanism
for PostgreSQL as an alternative to full_page_writes.

The core argument is straightforward. FPW and checkpoint frequency are
fundamentally at odds:

- FPW wants fewer checkpoints -- each checkpoint triggers a wave of
full-page WAL writes for every page dirtied for the first time,
bloating WAL and tanking write throughput.
- Fast crash recovery wants more checkpoints -- less WAL to replay
means the database comes back sooner.

DWB resolves this tension by moving torn page protection out of the
WAL path entirely. Instead of writing full pages into WAL (foreground,
latency-sensitive), dirty pages are sequentially written to a
dedicated doublewrite buffer area on disk before being flushed to
their actual locations. The buffer is fsync'd once when full, then
pages are scatter-written to their final positions. On crash recovery,
intact copies from the DWB repair any torn pages.

Key design differences:

- FPW: 1 WAL write (foreground) + 1 page write = directly impacts SQL

latency

- DWB: 2 page writes (background flush path) = minimal user-visible

impact

- DWB batches fsync() across multiple pages; WAL fsync batching is
limited by foreground latency constraints
- DWB decouples torn page protection from checkpoint frequency, so you
can checkpoint as often as you want without write amplification

I ran sysbench benchmarks (io-bound, --tables=10
--table_size=10000000) with checkpoint_timeout=30s,
shared_buffers=4GB, synchronous_commit=on. Each scenario uses a fresh
database, VACUUM FULL, 60s warmup, 300s run.

Results (TPS):

FPW OFF FPW ON DWB ON
read_write/32 18,038 7,943 13,009
read_write/64 24,249 9,533 15,387
read_write/128 27,801 9,715 15,387
write_only/32 53,146 18,116 31,460
write_only/64 57,628 19,589 32,875
write_only/128 59,454 14,857 33,814

Avg latency (ms):

FPW OFF FPW ON DWB ON
read_write/32 1.77 4.03 2.46
read_write/64 2.64 6.71 4.16
read_write/128 4.60 13.17 9.81
write_only/32 0.60 1.77 1.02
write_only/64 1.11 3.27 1.95
write_only/128 2.15 8.61 3.78

FPW ON drops to ~25% of baseline (FPW OFF). DWB ON holds at ~57%. In
write-heavy scenarios DWB delivers over 2x the throughput of FPW with
significantly better latency.

The implementation is here: https://github.com/baotiao/postgres

I'd appreciate any feedback on the approach. Would be great if the
community could take a look and see if this direction is worth
pursuing upstream.

Hi Baotiao

I'm a newbie here, but took Your idea with some interest, probably

everyone

else is busy with work on other patches before commit freeze.

I'm somewhat less of a noob here, so I'll confirm that this proposal
has basically zero chance of getting in, at least for the v19 cycle.
This isn't so much about the proposal itself, but more in that if you
were trying to pick the worst time of year to submit a large,
complicated feature into the postgresql workflow, this would be really
close to that.

However, I have also wondered about this specific trade-off (FPW vs
DWB) for years, but until now, the level of effort required to produce
a meaningful POC that would confirm if the idea was worth pursuing was
so large that I think it stopped anyone from even trying. So,
hopefully everyone will realize that we don't live in that world
anymore, and as a side benefit, apparently the idea is worth pursuing.

I think it would be valuable to have this as I've been hit by

PostgreSQL's

unsteady (chain-saw-like) WAL traffic, especially related to touching

1st the

pages after checkpoint, up to the point of saturating network links. The

common

counter-argument to double buffering is probably that FPI may(?)

increase WAL

standby replication rate and this would have to be measured into account
(but we also should take into account how much

maintenance_io_concurrency/

posix_fadvise() prefetching that we do today helps avoid any I/O stalls

on

fetching pages - so it should be basically free), I see even that you
got benefits
by not using FPI. Interesting.

Some notes/questions about the patches itself:

So, I haven't looked at the code itself; tbh honest I am a bit too
paranoid to dive into generated code that would seem to carry some
likely level of legal risk around potential reuse of GPL/proprietary
code it might be based on (either in its original training, inference,
or context used for generation. Yeah, I know innodb isn't written in
C, but still). That said, I did have some feedback and questions on
the proposal itself, and some suggestions for how to move things
forward.

I would be helpful if you could provide a little more information on
the system you are running these benchmarks on, specifically for me
the underlying OS/Filesystem/hardware, and I'd even be interested in
the build flags. I'd also be interested to know if you did any kind of
crash safety testing... while it is great to have improved
performance, presumably that isn't actually the primary point of these
subsystems. It'd also be worth knowing if you tested this on any
systems with replication (physical or logical) since we'd need to
understand those potential downstream effects. I'm tempted to say you
should have an AI generate some pgbench scripts. Granted its early and
fine if you have done any of this, but I imagine we'll need to look at
it eventually.

0. The convention here is send the patches using:
git format-patch -v<VERSION> HEAD~<numberOfpatches>
for easier review. The 0003 probably should be out of scope. Anyway

I've

attached all of those so maybe somebody else is going to take a
look at them too,
they look very mature. Is this code used in production already

anywhere? (and

BTW the numbers are quite impressive)

While Jakub is right that the convention is to send patches, that
convention is based on a manual development model, not an agentic
development model. While there is no official project policy on this,
IMHO the thing we really need from you is not the code output, but the
prompts that were used to generate the code. There are plenty of folks
who have access to claude that could then use those prompts to
"recreate with enough proximity" the work you had claude do, and that
process would also allow for additional verification and reduction of
any legal concerns or concerns about investing further human
time/energy. (No offense, but as you are not a regular contributor,
you could analogize this to when third parties do large code dumps and
say "here's a contribution, it's up to you to figure out how to use
it". Ideally we want other folks to be able to pick up the project and
continue with it, even if it means recreating it, and that works best
if we have the underlying prompts).

The claude code configuration file is a good start, but certainly not
enough. Probably the ideal here would be full session logs, although a
developer-diary would probably also suffice. I'm kind of guessing here
because I don't know the scope of the prompts involved or how you were
interacting with Claude in order to get where you are now, but those
seem like the more obvious tools for work of this size whose intention
is to be open.

Robert Treat
https://xzilla.net

陈宗志

baotiao@gmail.com

10 days ago

In reply to: Jakub Wartak (#2)

Re: [PROPOSAL] Doublewrite Buffer as an alternative torn page protection to Full Page Write

Hi Jakub

Thanks for your detailed suggestions. Sorry for the delayed response
due to the Chinese Spring Festival.

Overall, my current work on this is mostly a Proof of Concept (POC). I
believe that introducing a Double Write Buffer (DWB) is a more
reasonable design at this point, so I wanted to bring this up for
discussion with the community. At present, I am still modifying the
code, and it has not been deployed to production yet. We are currently
reviewing this code, and we plan to deploy and release it on our
Alibaba Cloud RDS for PostgreSQL in the future.

I'm a newbie here, but took Your idea with some interest, probably everyone
else is busy with work on other patches before commit freeze.

I think it would be valuable to have this as I've been hit by PostgreSQL's
unsteady (chain-saw-like) WAL traffic, especially related to touching 1st the
pages after checkpoint, up to the point of saturating network links. The common
counter-argument to double buffering is probably that FPI may(?) increase WAL
standby replication rate and this would have to be measured into account
(but we also should take into account how much maintenance_io_concurrency/
posix_fadvise() prefetching that we do today helps avoid any I/O stalls on
fetching pages - so it should be basically free), I see even that you
got benefits
by not using FPI. Interesting.

Yes, I believe that using a DWB can reduce the WAL size, which is also
helpful for PostgreSQL's replication synchronization latency.

Some notes/questions about the patches itself:

0. The convention here is send the patches using:
git format-patch -v<VERSION> HEAD~<numberOfpatches>
for easier review. The 0003 probably should be out of scope. Anyway I've
attached all of those so maybe somebody else is going to take a
look at them too,
they look very mature. Is this code used in production already anywhere? (and
BTW the numbers are quite impressive)

Since my initial goal was just to start a discussion and provide some
experimental data, I didn't organize a proper patch series for this
mail thread.

As mentioned, this code is not yet used in production. However, we are
reviewing it and will push it to be used in Alibaba Cloud RDS for
PostgreSQL later (I am the manager of the Alibaba Cloud RDS Team).

1. We have full_page_writes = on/off, but Your's patch adds double_write_buffer
IMHO if we have competing solution it would be better to have something like
io_torn_pages_protection = off | full_pages | double_writes
and maybe we'll be able to add 'atomic_writes' one day.
BTW: once you stabilize the GUC, it is worth adding to postgresql.conf.sample

That is a great idea, I will modify it accordingly later. I omitted
this initially because I was mainly focusing on providing experimental
data for the POC.

2. How would one know how to size double_write_buffer_size ?

If we decide to merge this into upstream, I will provide a detailed
design document covering the internal mechanisms, sizing guidelines,
etc.

2b. IMHO the patch could have enriched pg_stat_io with some
information. Please take
a look on pg_stat_io view and functions like pgstat_count_io_op_time() and
their parameters and enums there, that way we could have IOOBJECT_DWBUF maybe
and be able to say how much I/O was attributed to double-buffering, fsync()
times related to it and so on.

Agreed, these metrics need to be provided as well. I haven't modified
that part yet for this POC.

3. In DWBufPostCheckpoint() there's pg_usleep(1ms) just before atomic pwrites(),
but exactly why is it necessary to have this literally sched_yield(2) there?

Good catch, this will be revised in the next version.

4. In BufferSync() I have doubts if such copying is safe in loop:
page = BufHdrGetBlock(bufHdr);
memcpy(dwb_buf, page, BLCKSZ);
shouldn't there be some form of locking (BUFFER_LOCK_SHARE?)/pinning buffers?
Also it wouldn't be better if that memcpy would be guarded by the
critical section?
(START_CRIT_SECTION)

Yes, this needs to be fixed.

4b. There seems to be double coping: there's palloc for dwb_buf in BufferSync()
that is filled by memcpy(), and then DWBufWritePage() is called and
then again
that "page" is copied a second time using memcpy(). This seems to be done for
every checkpoint page, so may reduce benefits of this double-buffering code.

Yes, this also needs to be fixed to avoid the double copying.

4c. Shouldn't this active waiting in DWBufWritePage() shouldn't be achieved
using spinlocks rather than pg_usleep(100us)?

Yes, that is doable. This part hasn't been carefully refined yet.

5. Have you maybe verified using injection points (or gdb) if crashing
in several
places really hits that DWBufRecoverPage()? Is there a simple way
of reproducing this
to play with it? (possibly that could be good test on it's own)

OK, I will work on providing a way to test this using injection points
in subsequent updates.

6. Quick testing overview (for completeness)
- basic test without even enabling this feature complains about
postgresql.conf.sample
(test_misc/003_check_guc)
- with `PG_TEST_INITDB_EXTRA_OPTS="-c double_write_buffer=on" meson
test` I've
got 3 failures there:
* test_misc/003_check_guc (expected)
* pg_waldump/002_save_fullpage (I would say it's expected)
* pg_walinspect / pg_walinspect/regress (I would say it's expected)

OK, I will address these test failures in the next version.

I haven't really got it up and running for real, but at least that's
some start and I hope that helps.

Thanks again for the review.

Regards,
Baotiao

陈宗志

baotiao@gmail.com

9 days ago

In reply to: Robert Treat (#3)

Re: [PROPOSAL] Doublewrite Buffer as an alternative torn page protection to Full Page Write

Hi Robert,

Thanks for the feedback and suggestions.

I'm somewhat less of a noob here, so I'll confirm that this proposal
has basically zero chance of getting in, at least for the v19 cycle.
This isn't so much about the proposal itself, but more in that if you
were trying to pick the worst time of year to submit a large,
complicated feature into the postgresql workflow, this would be really
close to that.
However, I have also wondered about this specific trade-off (FPW vs
DWB) for years, but until now, the level of effort required to produce
a meaningful POC that would confirm if the idea was worth pursuing was
so large that I think it stopped anyone from even trying. So,
hopefully everyone will realize that we don't live in that world
anymore, and as a side benefit, apparently the idea is worth pursuing.

I completely understand, and I actually have no intention of pushing
this patch for the v19 cycle. My primary goal right now is simply to
share the POC results and discuss the idea with the community to see
if this direction is worth pursuing.

For context, I have been a MySQL InnoDB developer for over 10 years,
but I admit I am a newcomer to the PostgreSQL community, so I am still
familiarizing myself with the standard workflow and processes here.

I think it would be valuable to have this as I've been hit by PostgreSQL's
unsteady (chain-saw-like) WAL traffic, especially related to touching 1st the
pages after checkpoint, up to the point of saturating network links. The common
counter-argument to double buffering is probably that FPI may(?) increase WAL
standby replication rate and this would have to be measured into account
(but we also should take into account how much maintenance_io_concurrency/
posix_fadvise() prefetching that we do today helps avoid any I/O stalls on
fetching pages - so it should be basically free), I see even that you
got benefits
by not using FPI. Interesting.

Some notes/questions about the patches itself:

So, I haven't looked at the code itself; tbh honest I am a bit too
paranoid to dive into generated code that would seem to carry some
likely level of legal risk around potential reuse of GPL/proprietary
code it might be based on (either in its original training, inference,
or context used for generation. Yeah, I know innodb isn't written in
C, but still). That said, I did have some feedback and questions on
the proposal itself, and some suggestions for how to move things
forward.

0. The convention here is send the patches using:
git format-patch -v<VERSION> HEAD~<numberOfpatches>
for easier review. The 0003 probably should be out of scope. Anyway I've
attached all of those so maybe somebody else is going to take a
look at them too,
they look very mature. Is this code used in production already anywhere? (and
BTW the numbers are quite impressive)

While Jakub is right that the convention is to send patches, that
convention is based on a manual development model, not an agentic
development model. While there is no official project policy on this,
IMHO the thing we really need from you is not the code output, but the
prompts that were used to generate the code. There are plenty of folks
who have access to claude that could then use those prompts to
"recreate with enough proximity" the work you had claude do, and that
process would also allow for additional verification and reduction of
any legal concerns or concerns about investing further human
time/energy. (No offense, but as you are not a regular contributor,
you could analogize this to when third parties do large code dumps and
say "here's a contribution, it's up to you to figure out how to use
it". Ideally we want other folks to be able to pick up the project and
continue with it, even if it means recreating it, and that works best
if we have the underlying prompts).
The claude code configuration file is a good start, but certainly not
enough. Probably the ideal here would be full session logs, although a
developer-diary would probably also suffice. I'm kind of guessing here
because I don't know the scope of the prompts involved or how you were
interacting with Claude in order to get where you are now, but those
seem like the more obvious tools for work of this size whose intention
is to be open.

Regarding the AI-generated code, the raw output from Claude was far
from perfect. I have manually reviewed and modified the code
extensively to get it to this state.

Our plan is to first deploy and test this thoroughly on our own
product, Alibaba Cloud RDS for PostgreSQL. Once we are confident that
it is stable and issue-free, we intend to submit a formalized patch
to the community. I am very much looking forward to discussing and
reviewing the actual code with you all when the time comes.

As for sharing the prompts or session logs, I personally feel they
might not be as valuable as the final code itself. The generation
process involved a lot of iterative, back-and-forth communication;
the AI only knew how to make the right modifications after continuous
human guidance, correction, and architectural decisions.

I would be helpful if you could provide a little more information on
the system you are running these benchmarks on, specifically for me
the underlying OS/Filesystem/hardware, and I'd even be interested in
the build flags. I'd also be interested to know if you did any kind of
crash safety testing... while it is great to have improved
performance, presumably that isn't actually the primary point of these
subsystems. It'd also be worth knowing if you tested this on any
systems with replication (physical or logical) since we'd need to
understand those potential downstream effects. I'm tempted to say you
should have an AI generate some pgbench scripts. Granted its early and
fine if you have done any of this, but I imagine we'll need to look at
it eventually.

I have addressed the feedback and conducted comprehensive benchmarks
comparing the three io_torn_pages_protection modes. Here are the
detailed performance results and the system setup information you
requested.

Benchmark Setup:
- Hardware: x86_64, Linux 5.10, NVMe SSD
- PostgreSQL: 19devel (with DWB patch applied)
- Tool: pgbench (TPC-B), 64 clients, 8 threads, 60 seconds per run
- Common config: shared_buffers = 1GB, wal_level = replica
- Three modes tested:
* io_torn_pages_protection = full_pages (traditional FPW)
* io_torn_pages_protection = double_writes (DWB size = 128MB)
* io_torn_pages_protection = off (no protection, baseline)

Each test was run sequentially on the same machine to avoid I/O
contention.

Test 1: pgbench scale=100 (~1.5GB dataset), max_wal_size = 10GB

With infrequent checkpoints, FPW overhead is minimal and all three
modes perform similarly:

Mode TPS Latency(avg) WAL Size FPI Count FPI Size
----------- ---------- ------------ --------- ---------- ---------
full_pages 103,290 0.610 ms 3,903 MB 191,341 1,456 MB
double_writes 104,088 0.606 ms 2,475 MB 0 0
off 104,622 0.602 ms 2,510 MB 0 0

DWB vs FPW: +0.8% TPS, WAL reduced by 36.6%.

Test 2: pgbench scale=100 (~1.5GB dataset), max_wal_size = 64MB

With frequent checkpoints (triggered every ~64MB of WAL), the FPW write
amplification becomes severe:

Mode TPS Latency(avg) WAL Size FPI Count FPI Size
----------- ---------- ------------ --------- ----------- ---------
full_pages 54,324 1.171 ms 29 GB 3,806,504 28 GB
double_writes 93,942 0.672 ms 2,303 MB 2 16 kB
off 108,901 0.578 ms 2,746 MB 0 0

DWB vs FPW: +72.9% TPS, latency reduced by 42.6%, WAL reduced by 92.2%.

Test 3: pgbench scale=10 (~150MB dataset), max_wal_size = 64MB

Even with a smaller dataset that fits in shared_buffers, the advantage
is clear:

Mode TPS Latency(avg) WAL Size FPI Count FPI Size
----------- ---------- ------------ --------- ----------- ---------
full_pages 33,707 1.895 ms 15 GB 2,010,439 14 GB
double_writes 43,743 1.459 ms 1,140 MB 0 0
off 43,982 1.452 ms 1,150 MB 0 0

DWB vs FPW: +29.8% TPS, WAL reduced by 92.6%.

Test 4: sysbench oltp_write_only (10 tables x 100K rows), 64 threads, 30s

max_wal_size=10GB max_wal_size=64MB
-------------------- --------------------
Mode TPS FPI Count TPS FPI Count
----------- -------- ---------- -------- ----------
full_pages 138,019 32,253 40,187 3,455,253
double_writes 136,642 0 133,034 0

With large max_wal_size: DWB and FPW perform equally (-1.0%).
With small max_wal_size: DWB is +231% faster (3.3x).

Analysis:

The key factor is checkpoint frequency. FPW must write a full 8KB page
image to WAL for every page's first modification after each checkpoint.
When checkpoints are frequent:

- FPI count explodes (32K -> 3.8M with scale=100)
- WAL becomes dominated by FPI data (28GB out of 29GB = 96.6%)
- This creates massive write amplification on the WAL path

DWB avoids this entirely. Its WAL size stays constant regardless of
checkpoint frequency (~2.3GB vs FPW's 29GB). The double-write buffer
itself wrote 42GB in the heavy test, but since it uses sequential writes
with batched fsync, the overhead is modest — DWB achieves 86.3% of the
"no protection" baseline (93,942 vs 108,901 TPS).

When does DWB matter most?
- Large active datasets that exceed shared_buffers
- Frequent checkpoints (small max_wal_size or short checkpoint_timeout)
- Write-heavy workloads
- Replication scenarios where WAL volume directly impacts network

In production environments where max_wal_size is often set
conservatively (e.g., 1GB) and datasets are much larger than
shared_buffers, DWB should provide significant and consistent benefits
over FPW. As for the crash safety testing you mentioned, it is on our
roadmap as we continue to refine the patch for our internal RDS
deployment.

Regards,
Baotiao

wenhui qiu

qiuwenhuifx@gmail.com

9 days ago

In reply to: 陈宗志 (#6)

Re: [PROPOSAL] Doublewrite Buffer as an alternative torn page protection to Full Page Write

Hi
Why have you selectively ignored this parameter?
wal_compression = lz4 # enables compression of full-page writes;
# off, pglz, lz4, zstd, or on

Thanks

On Thu, Feb 26, 2026 at 3:14 PM 陈宗志 <baotiao@gmail.com> wrote:

Show quoted text

Hi Robert,

Thanks for the feedback and suggestions.

I'm somewhat less of a noob here, so I'll confirm that this proposal
has basically zero chance of getting in, at least for the v19 cycle.
This isn't so much about the proposal itself, but more in that if you
were trying to pick the worst time of year to submit a large,
complicated feature into the postgresql workflow, this would be really
close to that.
However, I have also wondered about this specific trade-off (FPW vs
DWB) for years, but until now, the level of effort required to produce
a meaningful POC that would confirm if the idea was worth pursuing was
so large that I think it stopped anyone from even trying. So,
hopefully everyone will realize that we don't live in that world
anymore, and as a side benefit, apparently the idea is worth pursuing.

I completely understand, and I actually have no intention of pushing
this patch for the v19 cycle. My primary goal right now is simply to
share the POC results and discuss the idea with the community to see
if this direction is worth pursuing.

For context, I have been a MySQL InnoDB developer for over 10 years,
but I admit I am a newcomer to the PostgreSQL community, so I am still
familiarizing myself with the standard workflow and processes here.

I think it would be valuable to have this as I've been hit by

PostgreSQL's

unsteady (chain-saw-like) WAL traffic, especially related to touching

1st the

pages after checkpoint, up to the point of saturating network links.

The common

counter-argument to double buffering is probably that FPI may(?)

increase WAL

standby replication rate and this would have to be measured into account
(but we also should take into account how much

maintenance_io_concurrency/

posix_fadvise() prefetching that we do today helps avoid any I/O stalls

on

fetching pages - so it should be basically free), I see even that you
got benefits
by not using FPI. Interesting.

Some notes/questions about the patches itself:

So, I haven't looked at the code itself; tbh honest I am a bit too
paranoid to dive into generated code that would seem to carry some
likely level of legal risk around potential reuse of GPL/proprietary
code it might be based on (either in its original training, inference,
or context used for generation. Yeah, I know innodb isn't written in
C, but still). That said, I did have some feedback and questions on
the proposal itself, and some suggestions for how to move things
forward.

0. The convention here is send the patches using:
git format-patch -v<VERSION> HEAD~<numberOfpatches>
for easier review. The 0003 probably should be out of scope. Anyway

I've

attached all of those so maybe somebody else is going to take a
look at them too,
they look very mature. Is this code used in production already

anywhere? (and

BTW the numbers are quite impressive)

While Jakub is right that the convention is to send patches, that
convention is based on a manual development model, not an agentic
development model. While there is no official project policy on this,
IMHO the thing we really need from you is not the code output, but the
prompts that were used to generate the code. There are plenty of folks
who have access to claude that could then use those prompts to
"recreate with enough proximity" the work you had claude do, and that
process would also allow for additional verification and reduction of
any legal concerns or concerns about investing further human
time/energy. (No offense, but as you are not a regular contributor,
you could analogize this to when third parties do large code dumps and
say "here's a contribution, it's up to you to figure out how to use
it". Ideally we want other folks to be able to pick up the project and
continue with it, even if it means recreating it, and that works best
if we have the underlying prompts).
The claude code configuration file is a good start, but certainly not
enough. Probably the ideal here would be full session logs, although a
developer-diary would probably also suffice. I'm kind of guessing here
because I don't know the scope of the prompts involved or how you were
interacting with Claude in order to get where you are now, but those
seem like the more obvious tools for work of this size whose intention
is to be open.

Regarding the AI-generated code, the raw output from Claude was far
from perfect. I have manually reviewed and modified the code
extensively to get it to this state.

Our plan is to first deploy and test this thoroughly on our own
product, Alibaba Cloud RDS for PostgreSQL. Once we are confident that
it is stable and issue-free, we intend to submit a formalized patch
to the community. I am very much looking forward to discussing and
reviewing the actual code with you all when the time comes.

As for sharing the prompts or session logs, I personally feel they
might not be as valuable as the final code itself. The generation
process involved a lot of iterative, back-and-forth communication;
the AI only knew how to make the right modifications after continuous
human guidance, correction, and architectural decisions.

I would be helpful if you could provide a little more information on
the system you are running these benchmarks on, specifically for me
the underlying OS/Filesystem/hardware, and I'd even be interested in
the build flags. I'd also be interested to know if you did any kind of
crash safety testing... while it is great to have improved
performance, presumably that isn't actually the primary point of these
subsystems. It'd also be worth knowing if you tested this on any
systems with replication (physical or logical) since we'd need to
understand those potential downstream effects. I'm tempted to say you
should have an AI generate some pgbench scripts. Granted its early and
fine if you have done any of this, but I imagine we'll need to look at
it eventually.

I have addressed the feedback and conducted comprehensive benchmarks
comparing the three io_torn_pages_protection modes. Here are the
detailed performance results and the system setup information you
requested.

Benchmark Setup:
- Hardware: x86_64, Linux 5.10, NVMe SSD
- PostgreSQL: 19devel (with DWB patch applied)
- Tool: pgbench (TPC-B), 64 clients, 8 threads, 60 seconds per run
- Common config: shared_buffers = 1GB, wal_level = replica
- Three modes tested:
* io_torn_pages_protection = full_pages (traditional FPW)
* io_torn_pages_protection = double_writes (DWB size = 128MB)
* io_torn_pages_protection = off (no protection, baseline)

Each test was run sequentially on the same machine to avoid I/O
contention.

Test 1: pgbench scale=100 (~1.5GB dataset), max_wal_size = 10GB

With infrequent checkpoints, FPW overhead is minimal and all three
modes perform similarly:

Mode TPS Latency(avg) WAL Size FPI Count FPI Size
----------- ---------- ------------ --------- ---------- ---------
full_pages 103,290 0.610 ms 3,903 MB 191,341 1,456 MB
double_writes 104,088 0.606 ms 2,475 MB 0 0
off 104,622 0.602 ms 2,510 MB 0 0

DWB vs FPW: +0.8% TPS, WAL reduced by 36.6%.

Test 2: pgbench scale=100 (~1.5GB dataset), max_wal_size = 64MB

With frequent checkpoints (triggered every ~64MB of WAL), the FPW write
amplification becomes severe:

Mode TPS Latency(avg) WAL Size FPI Count FPI Size
----------- ---------- ------------ --------- ----------- ---------
full_pages 54,324 1.171 ms 29 GB 3,806,504 28 GB
double_writes 93,942 0.672 ms 2,303 MB 2 16 kB
off 108,901 0.578 ms 2,746 MB 0 0

DWB vs FPW: +72.9% TPS, latency reduced by 42.6%, WAL reduced by 92.2%.

Test 3: pgbench scale=10 (~150MB dataset), max_wal_size = 64MB

Even with a smaller dataset that fits in shared_buffers, the advantage
is clear:

Mode TPS Latency(avg) WAL Size FPI Count FPI Size
----------- ---------- ------------ --------- ----------- ---------
full_pages 33,707 1.895 ms 15 GB 2,010,439 14 GB
double_writes 43,743 1.459 ms 1,140 MB 0 0
off 43,982 1.452 ms 1,150 MB 0 0

DWB vs FPW: +29.8% TPS, WAL reduced by 92.6%.

Test 4: sysbench oltp_write_only (10 tables x 100K rows), 64 threads, 30s

max_wal_size=10GB max_wal_size=64MB
-------------------- --------------------
Mode TPS FPI Count TPS FPI Count
----------- -------- ---------- -------- ----------
full_pages 138,019 32,253 40,187 3,455,253
double_writes 136,642 0 133,034 0

With large max_wal_size: DWB and FPW perform equally (-1.0%).
With small max_wal_size: DWB is +231% faster (3.3x).

Analysis:

The key factor is checkpoint frequency. FPW must write a full 8KB page
image to WAL for every page's first modification after each checkpoint.
When checkpoints are frequent:

- FPI count explodes (32K -> 3.8M with scale=100)
- WAL becomes dominated by FPI data (28GB out of 29GB = 96.6%)
- This creates massive write amplification on the WAL path

DWB avoids this entirely. Its WAL size stays constant regardless of
checkpoint frequency (~2.3GB vs FPW's 29GB). The double-write buffer
itself wrote 42GB in the heavy test, but since it uses sequential writes
with batched fsync, the overhead is modest — DWB achieves 86.3% of the
"no protection" baseline (93,942 vs 108,901 TPS).

When does DWB matter most?
- Large active datasets that exceed shared_buffers
- Frequent checkpoints (small max_wal_size or short checkpoint_timeout)
- Write-heavy workloads
- Replication scenarios where WAL volume directly impacts network

In production environments where max_wal_size is often set
conservatively (e.g., 1GB) and datasets are much larger than
shared_buffers, DWB should provide significant and consistent benefits
over FPW. As for the crash safety testing you mentioned, it is on our
roadmap as we continue to refine the patch for our internal RDS
deployment.

Regards,
Baotiao

DEVOPS_WwIT

devops@ww-it.cn

8 days ago

In reply to: Robert Treat (#3)

Re: [PROPOSAL] Doublewrite Buffer as an alternative torn page protection to Full Page Write

Personally believe that the Double Write is very smart for MySQL InnoDB,
but not a good ideal for Postgres, currently, WAL is the best solution
for Postgres,

maybe the next generation log system for Postgres could use OrioleDB's
storage engine.

Regards

Tony

Show quoted text

On 2026/2/19 02:00, Robert Treat wrote:

On Mon, Feb 16, 2026 at 9:07 AM Jakub Wartak
<jakub.wartak@enterprisedb.com> wrote:

On Mon, Feb 9, 2026 at 7:53 PM 陈宗志 <baotiao@gmail.com> wrote:

Hi hackers,

I raised this topic a while back [1] but didn't get much traction, so
I went ahead and implemented it: a doublewrite buffer (DWB) mechanism
for PostgreSQL as an alternative to full_page_writes.

The core argument is straightforward. FPW and checkpoint frequency are
fundamentally at odds:

- FPW wants fewer checkpoints -- each checkpoint triggers a wave of
full-page WAL writes for every page dirtied for the first time,
bloating WAL and tanking write throughput.
- Fast crash recovery wants more checkpoints -- less WAL to replay
means the database comes back sooner.

DWB resolves this tension by moving torn page protection out of the
WAL path entirely. Instead of writing full pages into WAL (foreground,
latency-sensitive), dirty pages are sequentially written to a
dedicated doublewrite buffer area on disk before being flushed to
their actual locations. The buffer is fsync'd once when full, then
pages are scatter-written to their final positions. On crash recovery,
intact copies from the DWB repair any torn pages.

Key design differences:

- FPW: 1 WAL write (foreground) + 1 page write = directly impacts SQL latency
- DWB: 2 page writes (background flush path) = minimal user-visible impact
- DWB batches fsync() across multiple pages; WAL fsync batching is
limited by foreground latency constraints
- DWB decouples torn page protection from checkpoint frequency, so you
can checkpoint as often as you want without write amplification

I ran sysbench benchmarks (io-bound, --tables=10
--table_size=10000000) with checkpoint_timeout=30s,
shared_buffers=4GB, synchronous_commit=on. Each scenario uses a fresh
database, VACUUM FULL, 60s warmup, 300s run.

Results (TPS):

FPW OFF FPW ON DWB ON
read_write/32 18,038 7,943 13,009
read_write/64 24,249 9,533 15,387
read_write/128 27,801 9,715 15,387
write_only/32 53,146 18,116 31,460
write_only/64 57,628 19,589 32,875
write_only/128 59,454 14,857 33,814

Avg latency (ms):

FPW OFF FPW ON DWB ON
read_write/32 1.77 4.03 2.46
read_write/64 2.64 6.71 4.16
read_write/128 4.60 13.17 9.81
write_only/32 0.60 1.77 1.02
write_only/64 1.11 3.27 1.95
write_only/128 2.15 8.61 3.78

FPW ON drops to ~25% of baseline (FPW OFF). DWB ON holds at ~57%. In
write-heavy scenarios DWB delivers over 2x the throughput of FPW with
significantly better latency.

The implementation is here: https://github.com/baotiao/postgres

I'd appreciate any feedback on the approach. Would be great if the
community could take a look and see if this direction is worth
pursuing upstream.

Hi Baotiao

I'm a newbie here, but took Your idea with some interest, probably everyone
else is busy with work on other patches before commit freeze.

I'm somewhat less of a noob here, so I'll confirm that this proposal
has basically zero chance of getting in, at least for the v19 cycle.
This isn't so much about the proposal itself, but more in that if you
were trying to pick the worst time of year to submit a large,
complicated feature into the postgresql workflow, this would be really
close to that.

However, I have also wondered about this specific trade-off (FPW vs
DWB) for years, but until now, the level of effort required to produce
a meaningful POC that would confirm if the idea was worth pursuing was
so large that I think it stopped anyone from even trying. So,
hopefully everyone will realize that we don't live in that world
anymore, and as a side benefit, apparently the idea is worth pursuing.

I think it would be valuable to have this as I've been hit by PostgreSQL's
unsteady (chain-saw-like) WAL traffic, especially related to touching 1st the
pages after checkpoint, up to the point of saturating network links. The common
counter-argument to double buffering is probably that FPI may(?) increase WAL
standby replication rate and this would have to be measured into account
(but we also should take into account how much maintenance_io_concurrency/
posix_fadvise() prefetching that we do today helps avoid any I/O stalls on
fetching pages - so it should be basically free), I see even that you
got benefits
by not using FPI. Interesting.

Some notes/questions about the patches itself:

So, I haven't looked at the code itself; tbh honest I am a bit too
paranoid to dive into generated code that would seem to carry some
likely level of legal risk around potential reuse of GPL/proprietary
code it might be based on (either in its original training, inference,
or context used for generation. Yeah, I know innodb isn't written in
C, but still). That said, I did have some feedback and questions on
the proposal itself, and some suggestions for how to move things
forward.

I would be helpful if you could provide a little more information on
the system you are running these benchmarks on, specifically for me
the underlying OS/Filesystem/hardware, and I'd even be interested in
the build flags. I'd also be interested to know if you did any kind of
crash safety testing... while it is great to have improved
performance, presumably that isn't actually the primary point of these
subsystems. It'd also be worth knowing if you tested this on any
systems with replication (physical or logical) since we'd need to
understand those potential downstream effects. I'm tempted to say you
should have an AI generate some pgbench scripts. Granted its early and
fine if you have done any of this, but I imagine we'll need to look at
it eventually.

0. The convention here is send the patches using:
git format-patch -v<VERSION> HEAD~<numberOfpatches>
for easier review. The 0003 probably should be out of scope. Anyway I've
attached all of those so maybe somebody else is going to take a
look at them too,
they look very mature. Is this code used in production already anywhere? (and
BTW the numbers are quite impressive)

While Jakub is right that the convention is to send patches, that
convention is based on a manual development model, not an agentic
development model. While there is no official project policy on this,
IMHO the thing we really need from you is not the code output, but the
prompts that were used to generate the code. There are plenty of folks
who have access to claude that could then use those prompts to
"recreate with enough proximity" the work you had claude do, and that
process would also allow for additional verification and reduction of
any legal concerns or concerns about investing further human
time/energy. (No offense, but as you are not a regular contributor,
you could analogize this to when third parties do large code dumps and
say "here's a contribution, it's up to you to figure out how to use
it". Ideally we want other folks to be able to pick up the project and
continue with it, even if it means recreating it, and that works best
if we have the underlying prompts).

The claude code configuration file is a good start, but certainly not
enough. Probably the ideal here would be full session logs, although a
developer-diary would probably also suffice. I'm kind of guessing here
because I don't know the scope of the prompts involved or how you were
interacting with Claude in order to get where you are now, but those
seem like the more obvious tools for work of this size whose intention
is to be open.

Robert Treat
https://xzilla.net

陈宗志

baotiao@gmail.com

8 days ago

In reply to: wenhui qiu (#7)

Re: [PROPOSAL] Doublewrite Buffer as an alternative torn page protection to Full Page Write

Hi wenhui,

Here are the latest benchmark results for the Double Write Buffer (DWB)
proposal. In this round of testing, I have included the two-phase
checkpoint batch fsync optimization and evaluated the impact of
wal_compression (lz4) on both FPW and DWB.

Test Environment:
- PostgreSQL: 19devel (with DWB patch applied)
- Hardware: Linux 5.10, x86_64
- Configuration:
* shared_buffers = 1GB
* max_wal_size = 32MB (to stress checkpoint frequency)
* wal_compression = lz4
* double_write_buffer_size = 128MB (for DWB mode)
- Workload: sysbench 1.1.0, 10 tables x 1,000,000 rows (~2.3GB dataset)
- Method: 16 threads, 60 seconds per run, each mode tested
independently (only one instance running at a time to eliminate
I/O contention).

Three modes compared:
- FPW: io_torn_pages_protection = full_pages (current default)
- DWB: io_torn_pages_protection = double_writes
- OFF: io_torn_pages_protection = off (no protection, baseline)

Results with wal_compression = lz4
----------------------------------
1. oltp_write_only (pure write transactions: UPDATE + DELETE + INSERT)

Mode TPS vs FPW vs OFF
---- ------ ------ ------
FPW 13,772 - -64.3%
DWB 20,660 +50.0% -46.5%
OFF 38,588 +180.2% -

2. oltp_update_non_index (single UPDATE per transaction)

Mode TPS vs FPW vs OFF
---- ------ ------ ------
FPW 59,427 - -57.5%
DWB 104,328 +75.6% -25.4%
OFF 139,870 +135.4% -

3. oltp_read_write (mixed: 70% reads + 30% writes)

Mode TPS vs FPW vs OFF
---- ------ ------ ------
FPW 6,232 - -9.0%
DWB 4,408 -29.3% -35.6%
OFF 6,845 +9.8% -

Results without wal_compression (for comparison)
------------------------------------------------
Workload FPW DWB DWB vs FPW
-------- ------ ------ ----------
oltp_write_only 9,651 22,111 +129.1%
oltp_update_non_index 48,624 98,356 +102.3%
oltp_read_write 5,414 5,275 -2.6%

Key Observations:

1. Write-heavy workloads: DWB outperforms FPW by +50% to +76% even
with lz4 compression enabled. Without lz4, the advantage grows
to +102% to +129% because uncompressed full-page images cause
severe WAL bloat.

2. lz4 compression significantly helps FPW: For oltp_write_only, lz4
boosts FPW from 9,651 to 13,772 TPS (+43%), while DWB sees minimal
change (22,111 -> 20,660). This is expected -- lz4 compresses the
8KB full-page images that FPW writes to WAL, but DWB doesn't
generate FPIs at all, so lz4 has little effect on DWB's WAL volume.

3. Read-heavy mixed workloads: DWB shows a regression (-29%) in
oltp_read_write with lz4. This workload is 70% reads with only 4
write operations per transaction, so FPW overhead is minimal.
Meanwhile, DWB incurs additional I/O overhead from writing pages
to the double write buffer file, which outweighs the WAL savings
in this scenario.

4. Batch fsync optimization is critical for DWB: The two-phase
checkpoint approach (batch all DWB writes in Phase 1 -> single
fsync -> data file writes in Phase 2) reduces checkpoint DWB
fsyncs from millions to ~hundreds. For example, in
oltp_write_only: 1,157,729 DWB page writes -> only 148 fsyncs.

Summary:

DWB provides substantial performance benefits for write-intensive
workloads with frequent checkpoints, which is the scenario where FPW
overhead is most pronounced. The advantage is most significant without
WAL compression (+100~130%), and remains strong (+50~76%) even with
lz4 enabled. For read-dominated mixed workloads, DWB currently shows
overhead that needs further optimization (reducing non-checkpoint
DWB fsync costs).

Regards,
Baotiao

#10

wenhui qiu

qiuwenhuifx@gmail.com

8 days ago

In reply to: 陈宗志 (#9)

Re: [PROPOSAL] Doublewrite Buffer as an alternative torn page protection to Full Page Write

Hi
This test is completely meaningless. Just as you wouldn't set
innodb_redo_log_capacity=Minimum Value, innodb_max_dirty_pages_pct=Minimum
Value
You used an extreme example to prove the double write.Why didn't you
compare using best practices?

Thank

On Fri, Feb 27, 2026 at 7:43 PM 陈宗志 <baotiao@gmail.com> wrote:

Show quoted text

Hi wenhui,

Here are the latest benchmark results for the Double Write Buffer (DWB)
proposal. In this round of testing, I have included the two-phase
checkpoint batch fsync optimization and evaluated the impact of
wal_compression (lz4) on both FPW and DWB.

Test Environment:
- PostgreSQL: 19devel (with DWB patch applied)
- Hardware: Linux 5.10, x86_64
- Configuration:
* shared_buffers = 1GB
* max_wal_size = 32MB (to stress checkpoint frequency)
* wal_compression = lz4
* double_write_buffer_size = 128MB (for DWB mode)
- Workload: sysbench 1.1.0, 10 tables x 1,000,000 rows (~2.3GB dataset)
- Method: 16 threads, 60 seconds per run, each mode tested
independently (only one instance running at a time to eliminate
I/O contention).

Three modes compared:
- FPW: io_torn_pages_protection = full_pages (current default)
- DWB: io_torn_pages_protection = double_writes
- OFF: io_torn_pages_protection = off (no protection, baseline)

Results with wal_compression = lz4
----------------------------------
1. oltp_write_only (pure write transactions: UPDATE + DELETE + INSERT)

Mode TPS vs FPW vs OFF
---- ------ ------ ------
FPW 13,772 - -64.3%
DWB 20,660 +50.0% -46.5%
OFF 38,588 +180.2% -

2. oltp_update_non_index (single UPDATE per transaction)

Mode TPS vs FPW vs OFF
---- ------ ------ ------
FPW 59,427 - -57.5%
DWB 104,328 +75.6% -25.4%
OFF 139,870 +135.4% -

3. oltp_read_write (mixed: 70% reads + 30% writes)

Mode TPS vs FPW vs OFF
---- ------ ------ ------
FPW 6,232 - -9.0%
DWB 4,408 -29.3% -35.6%
OFF 6,845 +9.8% -

Results without wal_compression (for comparison)
------------------------------------------------
Workload FPW DWB DWB vs FPW
-------- ------ ------ ----------
oltp_write_only 9,651 22,111 +129.1%
oltp_update_non_index 48,624 98,356 +102.3%
oltp_read_write 5,414 5,275 -2.6%

Key Observations:

1. Write-heavy workloads: DWB outperforms FPW by +50% to +76% even
with lz4 compression enabled. Without lz4, the advantage grows
to +102% to +129% because uncompressed full-page images cause
severe WAL bloat.

2. lz4 compression significantly helps FPW: For oltp_write_only, lz4
boosts FPW from 9,651 to 13,772 TPS (+43%), while DWB sees minimal
change (22,111 -> 20,660). This is expected -- lz4 compresses the
8KB full-page images that FPW writes to WAL, but DWB doesn't
generate FPIs at all, so lz4 has little effect on DWB's WAL volume.

3. Read-heavy mixed workloads: DWB shows a regression (-29%) in
oltp_read_write with lz4. This workload is 70% reads with only 4
write operations per transaction, so FPW overhead is minimal.
Meanwhile, DWB incurs additional I/O overhead from writing pages
to the double write buffer file, which outweighs the WAL savings
in this scenario.

4. Batch fsync optimization is critical for DWB: The two-phase
checkpoint approach (batch all DWB writes in Phase 1 -> single
fsync -> data file writes in Phase 2) reduces checkpoint DWB
fsyncs from millions to ~hundreds. For example, in
oltp_write_only: 1,157,729 DWB page writes -> only 148 fsyncs.

Summary:

DWB provides substantial performance benefits for write-intensive
workloads with frequent checkpoints, which is the scenario where FPW
overhead is most pronounced. The advantage is most significant without
WAL compression (+100~130%), and remains strong (+50~76%) even with
lz4 enabled. For read-dominated mixed workloads, DWB currently shows
overhead that needs further optimization (reducing non-checkpoint
DWB fsync costs).

Regards,
Baotiao

#11

陈宗志

baotiao@gmail.com

7 days ago

In reply to: 陈宗志 (#9)

Re: [PROPOSAL] Doublewrite Buffer as an alternative torn page protection to Full Page Write

Hi Tony,

Personally believe that the Double Write is very smart for MySQL InnoDB,
but not a good ideal for Postgres, currently, WAL is the best solution
for Postgres,
maybe the next generation log system for Postgres could use OrioleDB's
storage engine.

Just to clarify from a technical perspective, both MySQL and PostgreSQL
use Write-Ahead Logging (WAL) as their fundamental transaction logging
mechanism, so there is no difference in that regard.

The comparison here is specifically between Full-Page Writes (FPW) and
the Double Write Buffer (DWB). Neither of these concepts conflicts with
or replaces the core WAL design. Instead, both are simply different
techniques implemented to solve the exact same issue: preventing torn
pages during a crash.

My proposal is aimed at discussing the performance tradeoffs and
implementation details between these two specific torn-page protection
mechanisms, rather than replacing WAL itself.

Regards,
Baotiao

#12

陈宗志

baotiao@gmail.com

7 days ago

In reply to: wenhui qiu (#10)

Re: [PROPOSAL] Doublewrite Buffer as an alternative torn page protection to Full Page Write

Hi,

This test is completely meaningless. Just as you wouldn't set
innodb_redo_log_capacity=Minimum Value,
innodb_max_dirty_pages_pct=Minimum Value
You used an extreme example to prove the double write.Why didn't you
compare using best practices?

I wouldn't be so quick to dismiss these results. The configuration
was deliberately chosen to trigger more frequent checkpoints. As I
mentioned in my initial email, more frequent checkpoints strictly bound
the amount of WAL that needs to be replayed, resulting in much faster
crash recovery.

The entire ARIES paper heavily emphasizes optimizing crash recovery
time in database design. Minimizing recovery time is a fundamental
database capability, and we shouldn't rely solely on High Availability
(HA) switchovers to mask or solve crash recovery problems.

Actually, I have always felt that PostgreSQL's minimum limit of 30s
for `checkpoint_timeout` is a bit too restrictive. Ideally, the system
should allow for even higher frequency checkpoints. Setting it to a
lower value, such as 10s, could achieve the exact same effect of
strictly bounding recovery time. This test simulates an environment
where a very aggressive RTO (Recovery Time Objective) is required,
which is a highly practical scenario, not just an extreme edge case.

Regards,
Baotiao

#13

陈宗志

baotiao@gmail.com

7 days ago

In reply to: 陈宗志 (#1)

Re: [PROPOSAL] Doublewrite Buffer as an alternative torn page protection to Full Page Write

Hi,

(if an instance can recover and start up within 15 seconds) This
depends on the volume of write operations and whether MySQL uses
incremental checkpoints.The experiences with the two different
databases vary. PostgreSQL incorporates numerous optimizations to
mitigate FPW, such as enabling wal_compression as mentioned earlier,
and reducing the WAL log size through vacuum .
Another thing ，Did rds set innodb_redo_log_capacity to the minimum
value?

In our production RDS environment, we generally do not enable
wal_compression. Compressing data during write operations can
introduce CPU overhead and latency spikes (jitter). For our production
services, maintaining stable and predictable performance is much more
important than the I/O savings. Of course, as my previous benchmark
results demonstrated, I fully agree that wal_compression provides a
significant performance boost when FPW is enabled.

Regarding MySQL's behavior, it predominantly relies on incremental
checkpoints (also known as fuzzy checkpoints) during normal operation.
It only flushes all dirty pages to disk during a clean, safe shutdown.

As for whether we set innodb_redo_log_capacity to the minimum value
in production: no, we do not. As I mentioned in my previous email,
the aggressive configuration in my test was purely to simulate an
environment that triggers higher frequency checkpoints to strictly
bound the crash recovery time. In a real-world PostgreSQL deployment,
we could theoretically achieve a similar effect by lowering the
checkpoint_timeout, if the system allowed configuring it below the
current 30-second minimum.

Regards,
Baotiao

On Fri, Feb 27, 2026 at 10:36 AM wenhui qiu <qiuwenhuifx@gmail.com> wrote:

HI

Regarding this point, we actually have a specific strategy in our
production environment: if an instance can recover and start up within
15 seconds, we do not trigger an HA switchover. Furthermore, there are
many single-node instances deployed that do not have HA capabilities
at all. Therefore, optimizing the crash recovery time (which DWB helps
with by reducing WAL replay volume) remains highly valuable in
real-world deployments.

( if an instance can recover and start up within 15 seconds) This depends on the volume of write operations and whether MySQL uses incremental checkpoints.The experiences with the two different databases vary. PostgreSQL incorporates numerous optimizations to mitigate FPW, such as enabling wal_compression as mentioned earlier, and reducing the WAL log size through vacuum . Another thing ，Did rds set innodb_redo_log_capacity to the minimum value?

Thanks

On Fri, Feb 27, 2026 at 3:32 AM 陈宗志 <baotiao@gmail.com> wrote:

Hi wenhui

I have carefully read through the email thread you linked. Here is a
summary of my takeaways from that discussion:

1. It provided a DWB demo patch based on a very old PostgreSQL version
and tested it in certain scenarios. The performance comparison
between DWB and FPW was inconclusive across different
shared_buffers sizes.
2. The patch caused severe performance degradation in COPY scenarios.
3. It discussed shared_buffers sizing strategies (e.g., setting it to
25% of RAM). Since the original author didn't follow this best
practice, there is a lot of room for interpreting their performance
results. In our benchmarks, we fix it at 25%, so this isn't a
concern for us.
4. It discussed whether BufferAccessStrategy should be used with DWB.
(The intent of the strategy is to restrict a single process to a
portion of shared_buffers to avoid impacting others, but with DWB,
this might cause excessively frequent flushing and degrade
performance).
5. Besides normal read/write workloads, it noted the need to consider
bulk loads, SELECT on large unhinted tables, vacuum speed,
checkpoint duration, and others to prevent severe regressions in
these areas.
6. Double-write cannot replace FPW during backups.

From reading that thread, it seems the general concept of a DWB is
actually quite acceptable to the community; it's just that no one has
invested enough effort to fully solve all these edge cases yet.

waiting for the primary to recover instead of performing an HA
switchover is generally considered unacceptable.

Regarding this point, we actually have a specific strategy in our
production environment: if an instance can recover and start up within
15 seconds, we do not trigger an HA switchover. Furthermore, there are
many single-node instances deployed that do not have HA capabilities
at all. Therefore, optimizing the crash recovery time (which DWB helps
with by reducing WAL replay volume) remains highly valuable in
real-world deployments.

Regards,
Baotiao

On Thu, Feb 19, 2026 at 4:19 PM wenhui qiu <qiuwenhuifx@gmail.com> wrote:

HI Robert
In fact, this was discussed over a decade ago.（/messages/by-id/1962493974.656458.1327703514780.JavaMail.root@zimbra-prod-mbox-4.vmware.com ）,In practice, this mainly stems from the significant overhead introduced by FPW. The community has adopted various approaches to mitigate its impact, including compressing FPW and extending checkpoint intervals. Nowadays, waiting for the primary to recover instead of performing an HA switchover is generally considered unacceptable.

Thanks

On Thu, Feb 19, 2026 at 2:00 AM Robert Treat <rob@xzilla.net> wrote:

On Mon, Feb 16, 2026 at 9:07 AM Jakub Wartak
<jakub.wartak@enterprisedb.com> wrote:

On Mon, Feb 9, 2026 at 7:53 PM 陈宗志 <baotiao@gmail.com> wrote:

Hi hackers,

I raised this topic a while back [1] but didn't get much traction, so
I went ahead and implemented it: a doublewrite buffer (DWB) mechanism
for PostgreSQL as an alternative to full_page_writes.

The core argument is straightforward. FPW and checkpoint frequency are
fundamentally at odds:

- FPW wants fewer checkpoints -- each checkpoint triggers a wave of
full-page WAL writes for every page dirtied for the first time,
bloating WAL and tanking write throughput.
- Fast crash recovery wants more checkpoints -- less WAL to replay
means the database comes back sooner.

DWB resolves this tension by moving torn page protection out of the
WAL path entirely. Instead of writing full pages into WAL (foreground,
latency-sensitive), dirty pages are sequentially written to a
dedicated doublewrite buffer area on disk before being flushed to
their actual locations. The buffer is fsync'd once when full, then
pages are scatter-written to their final positions. On crash recovery,
intact copies from the DWB repair any torn pages.

Key design differences:

- FPW: 1 WAL write (foreground) + 1 page write = directly impacts SQL latency
- DWB: 2 page writes (background flush path) = minimal user-visible impact
- DWB batches fsync() across multiple pages; WAL fsync batching is
limited by foreground latency constraints
- DWB decouples torn page protection from checkpoint frequency, so you
can checkpoint as often as you want without write amplification

I ran sysbench benchmarks (io-bound, --tables=10
--table_size=10000000) with checkpoint_timeout=30s,
shared_buffers=4GB, synchronous_commit=on. Each scenario uses a fresh
database, VACUUM FULL, 60s warmup, 300s run.

Results (TPS):

FPW OFF FPW ON DWB ON
read_write/32 18,038 7,943 13,009
read_write/64 24,249 9,533 15,387
read_write/128 27,801 9,715 15,387
write_only/32 53,146 18,116 31,460
write_only/64 57,628 19,589 32,875
write_only/128 59,454 14,857 33,814

Avg latency (ms):

FPW OFF FPW ON DWB ON
read_write/32 1.77 4.03 2.46
read_write/64 2.64 6.71 4.16
read_write/128 4.60 13.17 9.81
write_only/32 0.60 1.77 1.02
write_only/64 1.11 3.27 1.95
write_only/128 2.15 8.61 3.78

FPW ON drops to ~25% of baseline (FPW OFF). DWB ON holds at ~57%. In
write-heavy scenarios DWB delivers over 2x the throughput of FPW with
significantly better latency.

The implementation is here: https://github.com/baotiao/postgres

I'd appreciate any feedback on the approach. Would be great if the
community could take a look and see if this direction is worth
pursuing upstream.

Hi Baotiao

I'm a newbie here, but took Your idea with some interest, probably everyone
else is busy with work on other patches before commit freeze.

I'm somewhat less of a noob here, so I'll confirm that this proposal
has basically zero chance of getting in, at least for the v19 cycle.
This isn't so much about the proposal itself, but more in that if you
were trying to pick the worst time of year to submit a large,
complicated feature into the postgresql workflow, this would be really
close to that.

However, I have also wondered about this specific trade-off (FPW vs
DWB) for years, but until now, the level of effort required to produce
a meaningful POC that would confirm if the idea was worth pursuing was
so large that I think it stopped anyone from even trying. So,
hopefully everyone will realize that we don't live in that world
anymore, and as a side benefit, apparently the idea is worth pursuing.

I think it would be valuable to have this as I've been hit by PostgreSQL's
unsteady (chain-saw-like) WAL traffic, especially related to touching 1st the
pages after checkpoint, up to the point of saturating network links. The common
counter-argument to double buffering is probably that FPI may(?) increase WAL
standby replication rate and this would have to be measured into account
(but we also should take into account how much maintenance_io_concurrency/
posix_fadvise() prefetching that we do today helps avoid any I/O stalls on
fetching pages - so it should be basically free), I see even that you
got benefits
by not using FPI. Interesting.

Some notes/questions about the patches itself:

So, I haven't looked at the code itself; tbh honest I am a bit too
paranoid to dive into generated code that would seem to carry some
likely level of legal risk around potential reuse of GPL/proprietary
code it might be based on (either in its original training, inference,
or context used for generation. Yeah, I know innodb isn't written in
C, but still). That said, I did have some feedback and questions on
the proposal itself, and some suggestions for how to move things
forward.

I would be helpful if you could provide a little more information on
the system you are running these benchmarks on, specifically for me
the underlying OS/Filesystem/hardware, and I'd even be interested in
the build flags. I'd also be interested to know if you did any kind of
crash safety testing... while it is great to have improved
performance, presumably that isn't actually the primary point of these
subsystems. It'd also be worth knowing if you tested this on any
systems with replication (physical or logical) since we'd need to
understand those potential downstream effects. I'm tempted to say you
should have an AI generate some pgbench scripts. Granted its early and
fine if you have done any of this, but I imagine we'll need to look at
it eventually.

0. The convention here is send the patches using:
git format-patch -v<VERSION> HEAD~<numberOfpatches>
for easier review. The 0003 probably should be out of scope. Anyway I've
attached all of those so maybe somebody else is going to take a
look at them too,
they look very mature. Is this code used in production already anywhere? (and
BTW the numbers are quite impressive)

While Jakub is right that the convention is to send patches, that
convention is based on a manual development model, not an agentic
development model. While there is no official project policy on this,
IMHO the thing we really need from you is not the code output, but the
prompts that were used to generate the code. There are plenty of folks
who have access to claude that could then use those prompts to
"recreate with enough proximity" the work you had claude do, and that
process would also allow for additional verification and reduction of
any legal concerns or concerns about investing further human
time/energy. (No offense, but as you are not a regular contributor,
you could analogize this to when third parties do large code dumps and
say "here's a contribution, it's up to you to figure out how to use
it". Ideally we want other folks to be able to pick up the project and
continue with it, even if it means recreating it, and that works best
if we have the underlying prompts).

The claude code configuration file is a good start, but certainly not
enough. Probably the ideal here would be full session logs, although a
developer-diary would probably also suffice. I'm kind of guessing here
because I don't know the scope of the prompts involved or how you were
interacting with Claude in order to get where you are now, but those
seem like the more obvious tools for work of this size whose intention
is to be open.

Robert Treat
https://xzilla.net

--
---
Blog: https://baotiao.github.io/
Twitter: https://twitter.com/baotiao
Git: https://github.com/baotiao

--
---
Blog: https://baotiao.github.io/
Twitter: https://twitter.com/baotiao
Git: https://github.com/baotiao

Import Notes

Reply to msg id not found: CAGjGUA+Y-Z9mr2u5pHupMQuV9d+SQw5znV0vyRAOcafMr5hPKg@mail.gmail.com

#14

Robert Treat

xzilla@users.sourceforge.net

3 days ago

In reply to: 陈宗志 (#6)

Re: [PROPOSAL] Doublewrite Buffer as an alternative torn page protection to Full Page Write

On Thu, Feb 26, 2026 at 2:14 AM 陈宗志 <baotiao@gmail.com> wrote:

Hi Robert,

Thanks for the feedback and suggestions.

<snip>

So, I haven't looked at the code itself; tbh honest I am a bit too
paranoid to dive into generated code that would seem to carry some
likely level of legal risk around potential reuse of GPL/proprietary
code it might be based on (either in its original training, inference,
or context used for generation. Yeah, I know innodb isn't written in
C, but still). That said, I did have some feedback and questions on
the proposal itself, and some suggestions for how to move things
forward.

0. The convention here is send the patches using:
git format-patch -v<VERSION> HEAD~<numberOfpatches>
for easier review. The 0003 probably should be out of scope. Anyway I've
attached all of those so maybe somebody else is going to take a
look at them too,
they look very mature. Is this code used in production already anywhere? (and
BTW the numbers are quite impressive)

While Jakub is right that the convention is to send patches, that
convention is based on a manual development model, not an agentic
development model. While there is no official project policy on this,
IMHO the thing we really need from you is not the code output, but the
prompts that were used to generate the code. There are plenty of folks
who have access to claude that could then use those prompts to
"recreate with enough proximity" the work you had claude do, and that
process would also allow for additional verification and reduction of
any legal concerns or concerns about investing further human
time/energy. (No offense, but as you are not a regular contributor,
you could analogize this to when third parties do large code dumps and
say "here's a contribution, it's up to you to figure out how to use
it". Ideally we want other folks to be able to pick up the project and
continue with it, even if it means recreating it, and that works best
if we have the underlying prompts).
The claude code configuration file is a good start, but certainly not
enough. Probably the ideal here would be full session logs, although a
developer-diary would probably also suffice. I'm kind of guessing here
because I don't know the scope of the prompts involved or how you were
interacting with Claude in order to get where you are now, but those
seem like the more obvious tools for work of this size whose intention
is to be open.

Regarding the AI-generated code, the raw output from Claude was far
from perfect. I have manually reviewed and modified the code
extensively to get it to this state.

Our plan is to first deploy and test this thoroughly on our own
product, Alibaba Cloud RDS for PostgreSQL. Once we are confident that
it is stable and issue-free, we intend to submit a formalized patch
to the community. I am very much looking forward to discussing and
reviewing the actual code with you all when the time comes.

As for sharing the prompts or session logs, I personally feel they
might not be as valuable as the final code itself. The generation
process involved a lot of iterative, back-and-forth communication;
the AI only knew how to make the right modifications after continuous
human guidance, correction, and architectural decisions.

Yeah, this is an issue which doesn't seem like we have very good
answers to at the moment. Part of me thinks the right path is to
require completely open transcripts of this back and forth, like you
would see in a discussion of developers on the mailing list. OTOH,
lots of patches have gone through "pre-development" work before
hitting the mailing lists, not to mention that agents can operate in
ways that are so ridiculously verbose that the idea of having these
kinds of logs for every developer doesn't sound like it would scale,
if it even remained useful.

In any case, as you stated you were a former innodb developer, clearly
you would understand concerns about potential ip muddiness, so to that
end I decided to spin up my own agent to have it examine your patch vs
the innodb implementation and provide an analysis contrasting the two
implementations, and while no one should mistake that as something
official, the initial read through was comforting.

I would be helpful if you could provide a little more information on
the system you are running these benchmarks on, specifically for me
the underlying OS/Filesystem/hardware, and I'd even be interested in
the build flags. I'd also be interested to know if you did any kind of
crash safety testing... while it is great to have improved
performance, presumably that isn't actually the primary point of these
subsystems. It'd also be worth knowing if you tested this on any
systems with replication (physical or logical) since we'd need to
understand those potential downstream effects. I'm tempted to say you
should have an AI generate some pgbench scripts. Granted its early and
fine if you have done any of this, but I imagine we'll need to look at
it eventually.

I have addressed the feedback and conducted comprehensive benchmarks
comparing the three io_torn_pages_protection modes. Here are the
detailed performance results and the system setup information you
requested.

Benchmark Setup:
- Hardware: x86_64, Linux 5.10, NVMe SSD
- PostgreSQL: 19devel (with DWB patch applied)
- Tool: pgbench (TPC-B), 64 clients, 8 threads, 60 seconds per run
- Common config: shared_buffers = 1GB, wal_level = replica
- Three modes tested:
* io_torn_pages_protection = full_pages (traditional FPW)
* io_torn_pages_protection = double_writes (DWB size = 128MB)
* io_torn_pages_protection = off (no protection, baseline)

Each test was run sequentially on the same machine to avoid I/O
contention.

<snip>

Analysis:

The key factor is checkpoint frequency. FPW must write a full 8KB page
image to WAL for every page's first modification after each checkpoint.
When checkpoints are frequent:

<snip>

When does DWB matter most?
- Large active datasets that exceed shared_buffers
- Frequent checkpoints (small max_wal_size or short checkpoint_timeout)
- Write-heavy workloads
- Replication scenarios where WAL volume directly impacts network

In production environments where max_wal_size is often set
conservatively (e.g., 1GB) and datasets are much larger than
shared_buffers, DWB should provide significant and consistent benefits
over FPW. As for the crash safety testing you mentioned, it is on our
roadmap as we continue to refine the patch for our internal RDS
deployment.

I suspect that some folks would argue that the problem is as much
users with poorly configured servers (primarily undersized
max_wal_size and too frequent checkpointing) as it is them needing an
entirely different page write implementation, but there are certainly
some workloads this helps even when those things are tuned
accordingly. Makes me wonder, its a bit of a crazy idea, but have you
thought about the possibility of making this user settable per
transaction... recreating magic similar to synchronous_commit?

Robert Treat
https://xzilla.net

[PROPOSAL] Doublewrite Buffer as an alternative torn page protection to Full Page Write

Attachments: