[WIP] Pipelined Recovery

Started by Imran Zaheer3 months ago15 messageshackers

imran.zhir@gmail.com

3 months ago

Hi,

Based on a suggestion by my colleague Ants Aasma, I worked on this
idea of adding parallelism to the WAL recovery process.

The crux of this idea is to decode the WAL using parallel workers. Now
the replay process can get the records from the shared memory queue
directly. This way, we can decrease some CPU load on the recovery process.

Implementing this idea yielded an improvement of around 20% in the
recovery times, but results may differ based on workloads. I have
attached some benchmarks for different workloads.

Following are some recovery tests with the default configs. Here p1
shows pipeline enabled. (db size) is the backup database size on
which the recovery happens. You can see more detail related to the
benchmarks in the attached file `recoveries-benchmark-v01`.

elapsed (p0) elapsed (p1) % perf db
size

inserts.sql 272s 10ms 197s 570ms 27.37% 480 MB
updates.sql 177s 420ms 117s 80ms 34.01% 480 MB
hot-updates.sql 36s 940ms 29s 240ms 20.84% 480 MB
nonhot.sql 36s 570ms 28s 980ms 20.75% 480 MB
simple-update 20s 160ms 11s 580ms 42.56% 4913 MB
tpcb-like 20s 590ms 13s 640ms 33.75% 4913 MB

Similar approach was also suggested by Matthias van de Meent earlier in a
separate thread [1]/messages/by-id/CAEze2Wh6C_QfxLii++eZue5=KvbVXKkHyZW8PLmtLgyjmFzwCQ@mail.gmail.com. Right now I am using one bgw for decoding and filling
up the shared message queue, and the redo apply loop simply receives the
decoded record
from the queue. After the redo is finished, the consumer (startup
process) can request a shutdown from the producer (pipeline bgw)
before exiting recovery.

This idea can be coupled with another idea of pinning the buffers in
parallel before the recovery process needs them. This will try to
parallelize most of the work being done in
`XLogReadBufferForRedoExtended`. The Redo can simply receive
the already pinned buffers from a queue, but for implementing
this, we still need some R&D on that, as IPC and pinning/unpinning of
buffers across two processes can be tricky.

If someone wants to reproduce the benchmark, they can do so using
these scripts [2]https://github.com/imranzaheer612/pg-recovery-testing.

Looking forward to your reviews, comments, etc.

[1]: /messages/by-id/CAEze2Wh6C_QfxLii++eZue5=KvbVXKkHyZW8PLmtLgyjmFzwCQ@mail.gmail.com
/messages/by-id/CAEze2Wh6C_QfxLii++eZue5=KvbVXKkHyZW8PLmtLgyjmFzwCQ@mail.gmail.com
[2]: https://github.com/imranzaheer612/pg-recovery-testing

--
Regards,
Imran Zaheer
CYBERTEC PostgreSQL International GmbH

Imran Zaheer

imran.zhir@gmail.com

3 months ago

In reply to: Imran Zaheer (#1)

Re: [WIP] Pipelined Recovery

Just found this discussion where Bruce Momjian mentioned about
replication pipelining.

[1]: /messages/by-id/aJyuxlqx0-OSuGqC@momjian.us

Thanks
Imran Zaheer

Show quoted text

On Fri, Jan 30, 2026 at 11:28 AM Imran Zaheer <imran.zhir@gmail.com> wrote:

Hi,

Based on a suggestion by my colleague Ants Aasma, I worked on this
idea of adding parallelism to the WAL recovery process.

The crux of this idea is to decode the WAL using parallel workers. Now
the replay process can get the records from the shared memory queue
directly. This way, we can decrease some CPU load on the recovery process.

Implementing this idea yielded an improvement of around 20% in the
recovery times, but results may differ based on workloads. I have
attached some benchmarks for different workloads.

Following are some recovery tests with the default configs. Here p1
shows pipeline enabled. (db size) is the backup database size on
which the recovery happens. You can see more detail related to the
benchmarks in the attached file `recoveries-benchmark-v01`.

elapsed (p0) elapsed (p1) % perf db size

inserts.sql 272s 10ms 197s 570ms 27.37% 480 MB
updates.sql 177s 420ms 117s 80ms 34.01% 480 MB
hot-updates.sql 36s 940ms 29s 240ms 20.84% 480 MB
nonhot.sql 36s 570ms 28s 980ms 20.75% 480 MB
simple-update 20s 160ms 11s 580ms 42.56% 4913 MB
tpcb-like 20s 590ms 13s 640ms 33.75% 4913 MB

Similar approach was also suggested by Matthias van de Meent earlier in a
separate thread [1]. Right now I am using one bgw for decoding and filling
up the shared message queue, and the redo apply loop simply receives the decoded record
from the queue. After the redo is finished, the consumer (startup
process) can request a shutdown from the producer (pipeline bgw)
before exiting recovery.

This idea can be coupled with another idea of pinning the buffers in
parallel before the recovery process needs them. This will try to
parallelize most of the work being done in
`XLogReadBufferForRedoExtended`. The Redo can simply receive
the already pinned buffers from a queue, but for implementing
this, we still need some R&D on that, as IPC and pinning/unpinning of
buffers across two processes can be tricky.

If someone wants to reproduce the benchmark, they can do so using
these scripts [2].

Looking forward to your reviews, comments, etc.

[1]: /messages/by-id/CAEze2Wh6C_QfxLii++eZue5=KvbVXKkHyZW8PLmtLgyjmFzwCQ@mail.gmail.com
[2]: https://github.com/imranzaheer612/pg-recovery-testing

--
Regards,
Imran Zaheer
CYBERTEC PostgreSQL International GmbH

Jakub Wartak

jakub.wartak@enterprisedb.com

3 months ago

In reply to: Imran Zaheer (#1)

Re: [WIP] Pipelined Recovery

On Fri, Jan 30, 2026 at 3:56 PM Imran Zaheer <imran.zhir@gmail.com> wrote:

Hi,

Based on a suggestion by my colleague Ants Aasma, I worked on this
idea of adding parallelism to the WAL recovery process.

The crux of this idea is to decode the WAL using parallel workers. Now
the replay process can get the records from the shared memory queue
directly. This way, we can decrease some CPU load on the recovery process.

Implementing this idea yielded an improvement of around 20% in the
recovery times, but results may differ based on workloads. I have
attached some benchmarks for different workloads.

Following are some recovery tests with the default configs. Here p1
shows pipeline enabled. (db size) is the backup database size on
which the recovery happens. You can see more detail related to the
benchmarks in the attached file `recoveries-benchmark-v01`.

elapsed (p0) elapsed (p1) % perf db size

inserts.sql 272s 10ms 197s 570ms 27.37% 480 MB
updates.sql 177s 420ms 117s 80ms 34.01% 480 MB
hot-updates.sql 36s 940ms 29s 240ms 20.84% 480 MB
nonhot.sql 36s 570ms 28s 980ms 20.75% 480 MB
simple-update 20s 160ms 11s 580ms 42.56% 4913 MB
tpcb-like 20s 590ms 13s 640ms 33.75% 4913 MB

Similar approach was also suggested by Matthias van de Meent earlier in a
separate thread [1]. Right now I am using one bgw for decoding and filling
up the shared message queue, and the redo apply loop simply receives the decoded record
from the queue. After the redo is finished, the consumer (startup
process) can request a shutdown from the producer (pipeline bgw)
before exiting recovery.

This idea can be coupled with another idea of pinning the buffers in
parallel before the recovery process needs them. This will try to
parallelize most of the work being done in
`XLogReadBufferForRedoExtended`. The Redo can simply receive
the already pinned buffers from a queue, but for implementing
this, we still need some R&D on that, as IPC and pinning/unpinning of
buffers across two processes can be tricky.

If someone wants to reproduce the benchmark, they can do so using
these scripts [2].

Looking forward to your reviews, comments, etc.

Hi Imran,

It's great that you are undertaking such a cool project (I also think
that recovery
performance is one of the two biggest performance limitations today).
I've taken a
quick look at the attached benchmarks and to my surprise they have flamegraphs!

Thoughts on results:

- on the flamegraphs BufferAlloc->GetVictimBuffer->FlushBuffer() is visible
often in both scenarios p0/p1 . I recall from my old WAL IO prefetching stress
tests experiments [1]/messages/by-id/VI1PR0701MB69608CBCE44D80857E59572EF6CA0@VI1PR0701MB6960.eurprd07.prod.outlook.com that tuning bgwriter was playing some important role
(most of the time bgwriter values are left on default). It's connected to
the pipelining idea too for sure, and somehow we are not effective on it
based on Your's and mine results and some other field reports. Basically we
have wrong defaults for bgwriter on standbys, and no-one is going to notice
unless you measure it in-depth. We could also tweak the bgwriter in standby
role to make the pipelining more effective that way too, but maybe I am missing
something? Maybe bgwriter on standby should work full-steam instead of being
rate limited as long as there is work to be done (pressure) (?) It's not a
critique of the patch, but maybe have guys investigated that road too as it is
related to the pipelining concept?

- there's plenty of ValidXLogRecord->pg_comp_crc32c_sb8(), probably even too
much, so I think either old CPU was used or something happened that SSE4.2
was not available (pg_comp_crc32c_sse42()). I'm speculating, but probably
you would even getter results from the patch (which are impressive anyway)
by using pg_comp_crc32c_sse42() because it wouldn't be such a
bottleneck anymore.

- The flamegraphs itself are for whole Postgresql, right or am I
misunderstanding it? Probably in the long run it would be better to have just
the PID of startup/recovering (but that's way harder to script for sure,
due to the need for isolated PID: perf record -p <PID>)

- You need to rebase due to due to 1eb09ed63a8 from couple of days ago:
CONFLICT (content): Merge conflict in src/backend/postmaster/bgworker.c

- naming: I don't have anything against wal_pipeline.c, it's just that
the suite of C files is named src/backend/access/transam/xlog*.c
(so perhaps xlogrecoverypipeline.c (?) to stay consistent)

- I'm was getting lots of problems during linking (e.g. undefined reference to
`WalPipeline_IsActive') and it appears that the patch is missing
meson support:

    --- a/src/backend/access/transam/meson.build
    +++ b/src/backend/access/transam/meson.build
    @@ -25,6 +25,7 @@ backend_sources += files(
    'xlogstats.c',
    'xlogutils.c',
    'xlogwait.c',
    +'wal_pipeline.c'

- nitpicking, but you need to add wal_pipeline to postgresql.conf.sample
otherwise 003_check_guc.pl test fails.

- good news is that test with PG_TEST_EXTRA="wal_consistency_checking" passes

- but bad news is that PG_TEST_EXTRA="wal_consistency_checking" with
PG_TEST_INITDB_EXTRA_OPTS="-c wal_pipeline=on" which actives all of this does
not multiple tests (possibly I look for those in the following days,
unless you are faster)

- at least here, fresh after pg_basebackup -R for building standby and
restart with wal_pipeline=on
I've got bug:
2026-02-04 10:29:40.146 CET [335197] LOG: [walpipeline] started.
2026-02-04 10:29:40.146 CET [335197] LOG: redo starts at 0/02000028
2026-02-04 10:29:40.147 CET [335198] LOG: invalid record length
at 0/03000060: expected at least 24, got 0
2026-02-04 10:29:40.147 CET [335197] LOG: consistent recovery
state reached at 0/03000060
2026-02-04 10:29:40.147 CET [335197] LOG: [walpipeline] consumer:
received shutdown message from the producer
2026-02-04 10:29:40.147 CET [335191] LOG: database system is
ready to accept read-only connections
2026-02-04 10:29:40.157 CET [335198] LOG: [walpipeline] producer:
exiting: sent=5 received=5
2026-02-04 10:29:40.167 CET [335197] LOG: WAL pipeline stopped
2026-02-04 10:29:40.167 CET [335197] LOG: redo done at 0/03000028
system usage: CPU: user: 0.00 s, system: 0.03 s, elapsed: 0.04 s
2026-02-04 10:29:40.169 CET [335197] LOG: selected new timeline ID: 2
2026-02-04 10:29:40.184 CET [335197] LOG: archive recovery complete

- even when starting from scratch with wal_pipeline = on it's similiar:
2026-02-04 10:34:14.386 CET [335833] LOG: database system was
interrupted; last known up at 2026-02-04 10:34:02 CET
2026-02-04 10:34:14.453 CET [335833] LOG: starting backup recovery
with redo LSN 0/0B0000E0, checkpoint LSN 0/0B000138, on timeline ID 1
2026-02-04 10:34:14.453 CET [335833] LOG: entering standby mode
2026-02-04 10:34:14.482 CET [335833] LOG: [walpipeline] started.
2026-02-04 10:34:14.482 CET [335833] LOG: redo starts at 0/0B0000E0
2026-02-04 10:34:14.484 CET [335833] LOG: completed backup
recovery with redo LSN 0/0B0000E0 and end LSN 0/0B0001E0
2026-02-04 10:34:14.484 CET [335833] LOG: consistent recovery
state reached at 0/0B0001E0
2026-02-04 10:34:14.484 CET [335833] LOG: [walpipeline] consumer:
received shutdown message from the producer
2026-02-04 10:34:14.484 CET [335827] LOG: database system is ready
to accept read-only connections
2026-02-04 10:34:14.493 CET [335834] LOG: [walpipeline] producer:
exiting: sent=4 received=4
2026-02-04 10:34:14.504 CET [335833] LOG: WAL pipeline stopped
2026-02-04 10:34:14.504 CET [335833] LOG: redo done at 0/0B0001E0
system usage: CPU: user: 0.00 s, system: 0.03 s, elapsed: 0.04 s
2026-02-04 10:34:14.506 CET [335833] LOG: selected new timeline ID: 2

it appears that "Cannot read more records, shut it down" ->
WalPipeline_SendShutdown() path is taken but I haven't pursued further.

Also some very quick review comments:

+bool
+WalPipeline_SendRecord(XLogReaderState *record)
+{
[..]
+
+    if (msglen > WAL_PIPELINE_MAX_MSG_SIZE)
+    {
+        elog(WARNING, "[walpipeline] producer: wal record at %X/%X
too large (%zu bytes), skipping",
+             LSN_FORMAT_ARGS(record->ReadRecPtr), msglen);
+        pfree(buffer);
+        return true;
+    }

When/why it could happen and if it would happen, shouldn't this be more
like PANIC instead?

+/* Size of the shared memory queue (can be made configurable) */
+#define WAL_PIPELINE_QUEUE_SIZE  (128 * 1024 * 1024)  /* 8 MB */
+
+/* Maximum size of a single message */
+#define WAL_PIPELINE_MAX_MSG_SIZE  (2 * 1024 * 1024)  /* 1 MB */

Maybe those should take into account (XLOG_)BLCKSZ too?

-J.

[1]: /messages/by-id/VI1PR0701MB69608CBCE44D80857E59572EF6CA0@VI1PR0701MB6960.eurprd07.prod.outlook.com

Hayato Kuroda (Fujitsu)

kuroda.hayato@fujitsu.com

3 months ago

In reply to: Imran Zaheer (#1)

RE: [WIP] Pipelined Recovery

Dear Imran,

I feel your idea is very interesting. Here are very elementary comments from my
point of view.

01.
I feel the patch looks large and it is bit difficult to be reviewed. Can you
separate it into several parts, e.g., producer and consumer parts?

02.
The file name seems to be "xlogpipeline.{h|c}" per others.

03.
Can you add tests? it's also helpful to understand the patch.

04.
Some Assert() are commented out but they won't be accepted. Can you added a
function like AmWalPipelineProducer() and call there?

05.
Producer is implemented as the bgworker process, but we have another option to
be as a background process. Which one is better and why?

06.
Can we enable the pipeline with the streaming replication? I've tried but it
did not work well - the producer exit and recovery finished without any promote
signals. My expectation was that producer waits till new record was arrived.

```
...
LOG: starting backup recovery with redo LSN 0/02000028, checkpoint LSN 0/02000080, on timeline ID 1
LOG: entering standby mode
LOG: [walpipeline] started.
LOG: redo starts at 0/02000028
LOG: completed backup recovery with redo LSN 0/02000028 and end LSN 0/02000128
LOG: consistent recovery state reached at 0/02000128
LOG: database system is ready to accept read-only connections
LOG: [walpipeline] consumer: received shutdown message from the producer
LOG: [walpipeline] producer: exiting: sent=4 received=4
LOG: WAL pipeline stopped
LOG: redo done at 0/02000128 system usage: CPU: user: 0.00 s, system: 0.02 s, elapsed: 0.03 s
...
```

07.
As mentioned by Jakub, it is better to profile per process (startup and producer).

08.
I suggest to benchmark on larger environment. Do let me know if difficult - I can
try on my env instead.

Best regards,
Hayato Kuroda
FUJITSU LIMITED

Imran Zaheer

imran.zhir@gmail.com

3 months ago

In reply to: Jakub Wartak (#3)

Re: [WIP] Pipelined Recovery

Hi Jakub, thanks for your valuable feedback.

- on the flamegraphs BufferAlloc->GetVictimBuffer->FlushBuffer() is visible
often in both scenarios p0/p1 . I recall from my old WAL IO prefetching stress
tests experiments [1] that tuning bgwriter was playing some important role
(most of the time bgwriter values are left on default). It's connected to
the pipelining idea too for sure, and somehow we are not effective on it
based on Your's and mine results and some other field reports. Basically we
have wrong defaults for bgwriter on standbys, and no-one is going to notice
unless you measure it in-depth. We could also tweak the bgwriter in standby
role to make the pipelining more effective that way too, but maybe I am missing
something? Maybe bgwriter on standby should work full-steam instead of being
rate limited as long as there is work to be done (pressure) (?) It's not a
critique of the patch, but maybe have guys investigated that road too as it is
related to the pipelining concept?

Ok thanks, I will look into that.

- there's plenty of ValidXLogRecord->pg_comp_crc32c_sb8(), probably even too
much, so I think either old CPU was used or something happened that SSE4.2
was not available (pg_comp_crc32c_sse42()). I'm speculating, but probably
you would even getter results from the patch (which are impressive anyway)
by using pg_comp_crc32c_sse42() because it wouldn't be such a
bottleneck anymore.

As mentioned in the benchmarks pdf the cpu was used `CPU: Common KVM
(4) @ 2.593GHz`. I was using
a virtual machine. Maybe that could be the reason for this bottleneck.
I will do some more research on it.

- The flamegraphs itself are for whole Postgresql, right or am I
misunderstanding it? Probably in the long run it would be better to have just
the PID of startup/recovering (but that's way harder to script for sure,
due to the need for isolated PID: perf record -p <PID>)

Yes the flame graphs were for the whole postgresql, I will try to
improve the script to make it only for the recovering processes.

- You need to rebase due to due to 1eb09ed63a8 from couple of days ago:
CONFLICT (content): Merge conflict in src/backend/postmaster/bgworker.c

Noted, thanks.

- naming: I don't have anything against wal_pipeline.c, it's just that
the suite of C files is named src/backend/access/transam/xlog*.c
(so perhaps xlogrecoverypipeline.c (?) to stay consistent)

I agree, I will change that in the next patch. Maybe `xlogpipeline.c`
would be enough.

- I'm was getting lots of problems during linking (e.g. undefined reference to
`WalPipeline_IsActive') and it appears that the patch is missing
meson support:
--- a/src/backend/access/transam/meson.build
+++ b/src/backend/access/transam/meson.build
@@ -25,6 +25,7 @@ backend_sources += files(
'xlogstats.c',
'xlogutils.c',
'xlogwait.c',
+'wal_pipeline.c'

Noted, thanks. This was some initial efforts and I still haven't
tested on windows env.

- nitpicking, but you need to add wal_pipeline to postgresql.conf.sample
otherwise 003_check_guc.pl test fails.

Got it.

- good news is that test with PG_TEST_EXTRA="wal_consistency_checking" passes

- but bad news is that PG_TEST_EXTRA="wal_consistency_checking" with
PG_TEST_INITDB_EXTRA_OPTS="-c wal_pipeline=on" which actives all of this does
not multiple tests (possibly I look for those in the following days,
unless you are faster)

- at least here, fresh after pg_basebackup -R for building standby and
restart with wal_pipeline=on
I've got bug:
2026-02-04 10:29:40.146 CET [335197] LOG: [walpipeline] started.
2026-02-04 10:29:40.146 CET [335197] LOG: redo starts at 0/02000028
2026-02-04 10:29:40.147 CET [335198] LOG: invalid record length
at 0/03000060: expected at least 24, got 0
2026-02-04 10:29:40.147 CET [335197] LOG: consistent recovery
state reached at 0/03000060
2026-02-04 10:29:40.147 CET [335197] LOG: [walpipeline] consumer:
received shutdown message from the producer
2026-02-04 10:29:40.147 CET [335191] LOG: database system is
ready to accept read-only connections
2026-02-04 10:29:40.157 CET [335198] LOG: [walpipeline] producer:
exiting: sent=5 received=5
2026-02-04 10:29:40.167 CET [335197] LOG: WAL pipeline stopped
2026-02-04 10:29:40.167 CET [335197] LOG: redo done at 0/03000028
system usage: CPU: user: 0.00 s, system: 0.03 s, elapsed: 0.04 s
2026-02-04 10:29:40.169 CET [335197] LOG: selected new timeline ID: 2
2026-02-04 10:29:40.184 CET [335197] LOG: archive recovery complete

- even when starting from scratch with wal_pipeline = on it's similiar:
2026-02-04 10:34:14.386 CET [335833] LOG: database system was
interrupted; last known up at 2026-02-04 10:34:02 CET
2026-02-04 10:34:14.453 CET [335833] LOG: starting backup recovery
with redo LSN 0/0B0000E0, checkpoint LSN 0/0B000138, on timeline ID 1
2026-02-04 10:34:14.453 CET [335833] LOG: entering standby mode
2026-02-04 10:34:14.482 CET [335833] LOG: [walpipeline] started.
2026-02-04 10:34:14.482 CET [335833] LOG: redo starts at 0/0B0000E0
2026-02-04 10:34:14.484 CET [335833] LOG: completed backup
recovery with redo LSN 0/0B0000E0 and end LSN 0/0B0001E0
2026-02-04 10:34:14.484 CET [335833] LOG: consistent recovery
state reached at 0/0B0001E0
2026-02-04 10:34:14.484 CET [335833] LOG: [walpipeline] consumer:
received shutdown message from the producer
2026-02-04 10:34:14.484 CET [335827] LOG: database system is ready
to accept read-only connections
2026-02-04 10:34:14.493 CET [335834] LOG: [walpipeline] producer:
exiting: sent=4 received=4
2026-02-04 10:34:14.504 CET [335833] LOG: WAL pipeline stopped
2026-02-04 10:34:14.504 CET [335833] LOG: redo done at 0/0B0001E0
system usage: CPU: user: 0.00 s, system: 0.03 s, elapsed: 0.04 s
2026-02-04 10:34:14.506 CET [335833] LOG: selected new timeline ID: 2

it appears that "Cannot read more records, shut it down" ->
WalPipeline_SendShutdown() path is taken but I haven't pursued further.

Yes I initial testing was only done with the archive recovery. I am
still trying to add support for streaming replication.
Also many tests from `make -C src/test/recovery/ check
PG_TEST_INITDB_EXTRA_OPTS="-c wal_pipeline=on"` are failing and need
to be fixed.

Also some very quick review comments:

+bool
+WalPipeline_SendRecord(XLogReaderState *record)
+{
[..]
+
+    if (msglen > WAL_PIPELINE_MAX_MSG_SIZE)
+    {
+        elog(WARNING, "[walpipeline] producer: wal record at %X/%X
too large (%zu bytes), skipping",
+             LSN_FORMAT_ARGS(record->ReadRecPtr), msglen);
+        pfree(buffer);
+        return true;
+    }

When/why it could happen and if it would happen, shouldn't this be more
like PANIC instead?

Noted, Will fix that.

+/* Size of the shared memory queue (can be made configurable) */
+#define WAL_PIPELINE_QUEUE_SIZE  (128 * 1024 * 1024)  /* 8 MB */
+
+/* Maximum size of a single message */
+#define WAL_PIPELINE_MAX_MSG_SIZE  (2 * 1024 * 1024)  /* 1 MB */

Maybe those should take into account (XLOG_)BLCKSZ too?

Yes, that's a good idea, I will do that.

Thanks for the valuable feedback. I will try to fix most of these
issues in the next patch.

Regards,
Imran Zaheer

Imran Zaheer

imran.zhir@gmail.com

3 months ago

In reply to: Hayato Kuroda (Fujitsu) (#4)

Re: [WIP] Pipelined Recovery

Hi Kuroda.

Thanks for the feedback.

01.
I feel the patch looks large and it is bit difficult to be reviewed. Can you
separate it into several parts, e.g., producer and consumer parts?

Ok I will try to split it.

02.
The file name seems to be "xlogpipeline.{h|c}" per others.

Noted, thanks.

03.
Can you add tests? it's also helpful to understand the patch.

Right now I am trying to fix the existing tests under
`src/test/recovery/`. But sure we should add some pipeline
specific tests.

04.
Some Assert() are commented out but they won't be accepted. Can you added a
function like AmWalPipelineProducer() and call there?

Noted. Sure I will add AmWalPipelineProducer().

05.
Producer is implemented as the bgworker process, but we have another option to
be as a background process. Which one is better and why?

Yes I myself don't have any good answer for this question. For me the
bgworker api looked simple and good enough to start the pipeline
implementation.

06.
Can we enable the pipeline with the streaming replication? I've tried but it
did not work well - the producer exit and recovery finished without any promote
signals. My expectation was that producer waits till new record was arrived.

```
...
LOG: starting backup recovery with redo LSN 0/02000028, checkpoint LSN 0/02000080, on timeline ID 1
LOG: entering standby mode
LOG: [walpipeline] started.
LOG: redo starts at 0/02000028
LOG: completed backup recovery with redo LSN 0/02000028 and end LSN 0/02000128
LOG: consistent recovery state reached at 0/02000128
LOG: database system is ready to accept read-only connections
LOG: [walpipeline] consumer: received shutdown message from the producer
LOG: [walpipeline] producer: exiting: sent=4 received=4
LOG: WAL pipeline stopped
LOG: redo done at 0/02000128 system usage: CPU: user: 0.00 s, system: 0.02 s, elapsed: 0.03 s
...
```

Yes am working to fix the streaming replication, so far the
testing/benchmarking were done with archive recovery.

07.
As mentioned by Jakub, it is better to profile per process (startup and producer).

Noted.

08.
I suggest to benchmark on larger environment. Do let me know if difficult - I can
try on my env instead.

Yes, that would be very helpful. I myself tried very hard to collect
some good benchmarks but have some hardware limitations. If you can
help me with this , that would be great. You can use the scripts[1]https://github.com/imranzaheer612/pg-recovery-testing
that I used. Let me know if you need any help understanding or using
the scripts.

[1]: https://github.com/imranzaheer612/pg-recovery-testing

Thanks for your valuable feedback. That was very helpful.

Regards,
Imran Zaheer

Zsolt Parragi

zsolt.parragi@percona.com

2 months ago

In reply to: Imran Zaheer (#6)

Re: [WIP] Pipelined Recovery

Hello!

+
+ SpinLockAcquire(&WalPipelineShm->mutex);
+
+ if (WalPipelineShm->initialized)
+ {
+ SpinLockRelease(&WalPipelineShm->mutex);
+ return;  /* Already started */
+ }
+

This doesn't seem to be a good use for a spinlock, as it guards a
longer operation. Spinlocks are supposed to guard "a few
instructions", not long initialization processes, according to their
documentation. Since the code already uses dsm segment, wouldn't it be
easier to use something like GetNamedDSMSegment which explicitly
supports this use case with an initialization callback?

Also see the next two more specific comments about errors and spinlocks.

+ case WAL_MSG_ERROR:
+ SpinLockAcquire(&WalPipelineShm->mutex);
+ ereport(ERROR,
+ (errcode(WalPipelineShm->error_code),
+ errmsg("[walpipeline] consumer: received error from the producer: %s",
+ WalPipelineShm->error_message)));
+ SpinLockRelease(&WalPipelineShm->mutex);
+ return NULL;

According to the documentation spinlocks are not automatically
released on errors, and ereport ERROR stops the code flow so
everything after that is dead code.

+ SpinLockAcquire(&WalPipelineShm->mutex);
+ elog(LOG, "[walpipeline] producer: exiting: sent=" UINT64_FORMAT "
received=" UINT64_FORMAT,
+ WalPipelineShm->records_sent, WalPipelineShm->records_received);
+ SpinLockRelease(&WalPipelineShm->mutex);

A LOG is not an error, but elog can call palloc, which can cause an
out of memory error, and then again we never release the spinlock.

+ if (msglen > WAL_PIPELINE_MAX_MSG_SIZE)
+ {
+ elog(WARNING, "[walpipeline] producer: wal record at %X/%X too large
(%zu bytes), skipping",
+ LSN_FORMAT_ARGS(record->ReadRecPtr), msglen);
+ pfree(buffer);
+ return true;
+ }

This doesn't seem like a good idea to me, won't skipping records cause
data corruption?

+ shm_mq_handle *producer_mq_handle;
+ shm_mq_handle *consumer_mq_handle;

Aren't these handles process local, yet stored in WalPipelineShmCtl?

+{ name => 'wal_pipeline', type => 'bool', context => 'PGC_SIGHUP',
group => 'WAL_RECOVERY',
+  short_desc => 'Use parallel workers to speedup recovery.',
+  variable => 'wal_pipeline_enabled',
+  boot_val => 'false',
+},

Is SIGHUP really useful for this feature? It only runs at startup.

+ elog(FATAL, "[walpipeline] consumer: either pipeline not active, or
no record available from pipeline.");
+ return record;

FATAL also stops the codeflow, so that return is never executed.

+/* Size of the shared memory queue (can be made configurable) */
+#define WAL_PIPELINE_QUEUE_SIZE  (128 * 1024 * 1024)  /* 8 MB */
+
+/* Maximum size of a single message */
+#define WAL_PIPELINE_MAX_MSG_SIZE  (2 * 1024 * 1024)  /* 1 MB */

The comments about the sizes seem to be off.

  if (reachedRecoveryTarget)
  {
+ if (wal_pipeline_enabled)
+ WalPipeline_Stop();

What if we didn't reach the recovery target, shouldn't we stop the
pipelines then?

+ /* Send shutdown message if queue is available */
+ if (consumer_mq_handle)
+ WalPipeline_SendShutdown();
+}

This seems wrong, WalPipeline_SendShutdown checks for the producer
handle inside it instead? What's the exact contract, who should call
these methods? By looking at the code I'm not sure if this shutdown
logic works as intended.

Imran Zaheer

imran.zhir@gmail.com

about 2 months ago

In reply to: Zsolt Parragi (#7)

Re: [WIP] Pipelined Recovery

Hi Zsolt.

Thanks alot for the review and pointing out the bugs. I have fixed the bugs
you mentioned in my new patch set. But
patchset mail is held for moderation for some reason.

if (reachedRecoveryTarget)
{
+ if (wal_pipeline_enabled)
+ WalPipeline_Stop();
What if we didn't reach the recovery target, shouldn't we stop the
pipelines then?

I have fixed the bugs shutdown logic.

As we already know we will exist the recovery redo loop in
`PerformWalRecovery()` only in two cases

1: Recovery target reached:
In this case consumers will call to stop the pipeline.

@@ -1807,6 +1931,9 @@ PerformWalRecovery(void)

  if (reachedRecoveryTarget)
  {
+ if (wal_pipeline_enabled)
+ WalPipeline_Stop();
+

2: Available pg_wal is consumed and now more wal to read.
In this case pipeline producers will send the shutdown msg to the consumer.
Consumer will
detect this message and then call ` WalPipeline_Stop`. This is the case
where we cannot read
more records and the while loop will break here.

+ if (decoded_record)
+ {
+ record = &decoded_record->header;
+ return record;
+ }
+ else
+ {
+ /*
+ * We will end up here only when pipeline couldn't read more
+ * records and have sent a shutdown msg. We will acknowldge this
+ * and will trigger request to stop the pipeline workers.
+ */
+ WalPipeline_Stop();
+ return NULL;
+ }

Hope this makes sense.

Once again thanks for reporting the bugs. You will receive the new patchset
mail soon once it is cleared from
the moderation.

Looking forward to your reviews, comments, etc.

Regards,
Imran Zaheer

Imran Zaheer

imran.zhir@gmail.com

about 2 months ago

In reply to: Imran Zaheer (#1)

Re: [WIP] Pipelined Recovery

(resending the mail, previous mail was held for moderation for some reason.)

I am attaching a new rebased version of the patch. Following are some
major changes in the new patch set.

* Streaming replication is now working. The prefetcher was not fully
decoupled from the startup process; that's why there were inconsistencies
in some scenarios and most of the recovery tap tests were failing.

* Patch is now split into consumer and producer patches. This will make
review easier.

* Pipeline shutdown flow is also improved. Now producer will always check
for the shutdown flag (being set by the consumer)

* Pipeline msg queue size is now configurable `wal_pipeline_queue_size`

* Tap tests now passes with PG_TEST_INITDB_EXTRA_OPTS="-c wal_pipeline=on"

* New tap test for `recovery/t/053_walpipeline.pl`. This covers some basic
functionality of the pipeline.

* The filename is changed to xlogpipeline.{h|C}

Thanks to all for the valuable feedback.
Looking forward to your reviews, comments, etc.

--
Regards,
Imran Zaheer

On Wed, Mar 18, 2026 at 3:15 PM Imran Zaheer <imran.zhir@gmail.com> wrote:

Show quoted text

(resending the mail, previous mail was held for moderation for some
reason. Now pdf is moved to the tar.gz)

Hi

I am attaching a new rebased version of the patch. Following are some
major changes in the new patch set.

* Streaming replication is now working. The prefetcher was not fully
decoupled from the startup process; that's why there were inconsistencies
in some scenarios and most of the recovery tap tests were failing.

* Patch is now split into consumer and producer patches. This will make
review easier.

* Pipeline shutdown flow is also improved. Now producer will always check
for the shutdown flag (being set by the consumer)

* Pipeline msg queue size is now configurable `wal_pipeline_queue_size`

* Tap tests now passes with PG_TEST_INITDB_EXTRA_OPTS="-c wal_pipeline=on"

* New tap test for `recovery/t/053_walpipeline.pl`. This covers some
basic functionality of the pipeline.

* The filename is changed to xlogpipeline.{h|C}

Thanks to all for the valuable feedback.
Looking forward to your reviews, comments, etc.

--
Regards,
Imran Zaheer

On Wed, Mar 18, 2026 at 12:43 PM Imran Zaheer <imran.zhir@gmail.com>
wrote:
Hi Zsolt.

Thanks alot for the review and pointing out the bugs. I have fixed the
bugs you mentioned in my new patch set. But
patchset mail is held for moderation for some reason.
if (reachedRecoveryTarget)
{
+ if (wal_pipeline_enabled)
+ WalPipeline_Stop();
What if we didn't reach the recovery target, shouldn't we stop the
pipelines then?
I have fixed the bugs shutdown logic.

As we already know we will exist the recovery redo loop in
`PerformWalRecovery()` only in two cases

1: Recovery target reached:
In this case consumers will call to stop the pipeline.

@@ -1807,6 +1931,9 @@ PerformWalRecovery(void)
if (reachedRecoveryTarget)
{
+ if (wal_pipeline_enabled)
+ WalPipeline_Stop();
+
2: Available pg_wal is consumed and now more wal to read.
In this case pipeline producers will send the shutdown msg to the
consumer. Consumer will
detect this message and then call ` WalPipeline_Stop`. This is the case
where we cannot read
more records and the while loop will break here.
+ if (decoded_record)
+ {
+ record = &decoded_record->header;
+ return record;
+ }
+ else
+ {
+ /*
+ * We will end up here only when pipeline couldn't read more
+ * records and have sent a shutdown msg. We will acknowldge this
+ * and will trigger request to stop the pipeline workers.
+ */
+ WalPipeline_Stop();
+ return NULL;
+ }
Hope this makes sense.

Once again thanks for reporting the bugs. You will receive the new
patchset mail soon once it is cleared from
the moderation.

Looking forward to your reviews, comments, etc.

Regards,
Imran Zaheer

Import Notes

Reply to msg id not found: CA+UBfa=Dv-2tLSEKHJ0YFFH8PCTHxnX9rtVZeV8gd8q1a-GmYA@mail.gmail.com

#10

Xuneng Zhou

xunengzhou@gmail.com

about 1 month ago

In reply to: Imran Zaheer (#9)

Re: [WIP] Pipelined Recovery

Hi Imran,

On Wed, Mar 18, 2026 at 8:06 PM Imran Zaheer <imran.zhir@gmail.com> wrote:

(resending the mail, previous mail was held for moderation for some reason.)

Hi

I am attaching a new rebased version of the patch. Following are some major changes in the new patch set.

* Streaming replication is now working. The prefetcher was not fully decoupled from the startup process; that's why there were inconsistencies in some scenarios and most of the recovery tap tests were failing.

* Patch is now split into consumer and producer patches. This will make review easier.

* Pipeline shutdown flow is also improved. Now producer will always check for the shutdown flag (being set by the consumer)

* Pipeline msg queue size is now configurable `wal_pipeline_queue_size`

* Tap tests now passes with PG_TEST_INITDB_EXTRA_OPTS="-c wal_pipeline=on"

* New tap test for `recovery/t/053_walpipeline.pl`. This covers some basic functionality of the pipeline.

* The filename is changed to xlogpipeline.{h|C}

Thanks to all for the valuable feedback.
Looking forward to your reviews, comments, etc.

--
Regards,
Imran Zaheer

On Wed, Mar 18, 2026 at 3:15 PM Imran Zaheer <imran.zhir@gmail.com> wrote:
(resending the mail, previous mail was held for moderation for some reason. Now pdf is moved to the tar.gz)

Hi

I am attaching a new rebased version of the patch. Following are some major changes in the new patch set.

* Streaming replication is now working. The prefetcher was not fully decoupled from the startup process; that's why there were inconsistencies in some scenarios and most of the recovery tap tests were failing.

* Patch is now split into consumer and producer patches. This will make review easier.

* Pipeline shutdown flow is also improved. Now producer will always check for the shutdown flag (being set by the consumer)

* Pipeline msg queue size is now configurable `wal_pipeline_queue_size`

* Tap tests now passes with PG_TEST_INITDB_EXTRA_OPTS="-c wal_pipeline=on"

* New tap test for `recovery/t/053_walpipeline.pl`. This covers some basic functionality of the pipeline.

* The filename is changed to xlogpipeline.{h|C}

Thanks to all for the valuable feedback.
Looking forward to your reviews, comments, etc.

--
Regards,
Imran Zaheer

On Wed, Mar 18, 2026 at 12:43 PM Imran Zaheer <imran.zhir@gmail.com> wrote:
Hi Zsolt.

Thanks alot for the review and pointing out the bugs. I have fixed the bugs you mentioned in my new patch set. But
patchset mail is held for moderation for some reason.
if (reachedRecoveryTarget)
{
+ if (wal_pipeline_enabled)
+ WalPipeline_Stop();
What if we didn't reach the recovery target, shouldn't we stop the
pipelines then?
I have fixed the bugs shutdown logic.

As we already know we will exist the recovery redo loop in `PerformWalRecovery()` only in two cases

1: Recovery target reached:
In this case consumers will call to stop the pipeline.

@@ -1807,6 +1931,9 @@ PerformWalRecovery(void)
if (reachedRecoveryTarget)
{
+ if (wal_pipeline_enabled)
+ WalPipeline_Stop();
+
2: Available pg_wal is consumed and now more wal to read.
In this case pipeline producers will send the shutdown msg to the consumer. Consumer will
detect this message and then call ` WalPipeline_Stop`. This is the case where we cannot read
more records and the while loop will break here.
+ if (decoded_record)
+ {
+ record = &decoded_record->header;
+ return record;
+ }
+ else
+ {
+ /*
+ * We will end up here only when pipeline couldn't read more
+ * records and have sent a shutdown msg. We will acknowldge this
+ * and will trigger request to stop the pipeline workers.
+ */
+ WalPipeline_Stop();
+ return NULL;
+ }
Hope this makes sense.

Once again thanks for reporting the bugs. You will receive the new patchset mail soon once it is cleared from
the moderation.

Looking forward to your reviews, comments, etc.

Regards,
Imran Zaheer

Thanks for this patch—it’s quite interesting. To my knowledge, there
have been prior attempts to introduce parallelism into recovery, as
you mentioned in your earlier email.

I’m curious how this approach differs from those previous efforts, and
why those attempts ultimately did not land. I imagine there were
constraints or complexities involved. It would be valuable to
understand what lessons can be drawn from them.
It also raises an implicit question: what makes the current approach
more promising—whether due to a simpler design or improved
performance.

While these may not be directly related to your current proposal, the
insights and experience from earlier work could help guide the
development and shape the direction of this patch. Of course, some of
this context can be pieced together from mailing list discussions and
past talks, but doing so raises the bar for future reviewers. Any
additional background you can share would be very helpful.

--
Best,
Xuneng

#11

Hayato Kuroda (Fujitsu)

kuroda.hayato@fujitsu.com

29 days ago

In reply to: Imran Zaheer (#1)

RE: [WIP] Pipelined Recovery

Dear Imran,

v2 could not be applied anymore. And even when I solve the conflict it has
complier warnings. Can you fix them?

Best regards,
Hayato Kuroda
FUJITSU LIMITED

Import Notes

Reply to msg id not found: CA+UBfamPoGOrKPxy+iRi2VwCHJsLdj04s12LsV=XZ5AKtg6HDQ@mail.gmail.com

#12

Henson Choi

assam258@gmail.com

29 days ago

In reply to: Xuneng Zhou (#10)

Re: [WIP] Pipelined Recovery

Hi Xuneng, Imran, and everyone,

I’m curious how this approach differs from those previous efforts, and

why those attempts ultimately did not land.

There is directly relevant prior art that may be worth looking at.
Koichi Suzuki presented parallel recovery at PGCon 2023 [1]https://www.pgcon.org/2023/schedule/session/392-parallel-recovery-in-postgresql/ and
published a detailed design on the PostgreSQL wiki [2]https://wiki.postgresql.org/wiki/Parallel_Recovery with a working
prototype on GitHub.

Koichi's approach is quite different from the current patch: instead of
pipelining decode, he parallelizes redo itself by dispatching WAL
records to block workers based on page identity. The key rule is that
for a given page, WAL records are applied in written order, but
different pages can be replayed in parallel by different workers.

His design uses a dispatcher to route records to workers, with
synchronization needed for multi-block WAL records. One thing I
wondered is whether the dispatcher could be avoided entirely: if each
child simply reads the whole WAL stream on its own and skips blocks
that are not assigned to it, there would be no IPC and no need to
coordinate multi-block records across workers.

The hard problem he ran into was Hot Standby visibility: when index and
heap pages are replayed by different workers at different speeds,
concurrent queries can see inconsistent state. The wiki itself notes
the idea is to "use this when hot standby is disabled." As far as I
know, this was never submitted as a patch to hackers.

It also raises an implicit question: what makes the current approach

more promising—whether due to a simpler design or improved
performance.

The two approaches target different bottlenecks. The current patch
parallelizes WAL decoding, which keeps the redo path single-threaded
and avoids the Hot Standby visibility problem entirely.

One thing I am curious about in the current patch: WAL records are
already in a serialized format on disk. The producer decodes them and
then re-serializes into a different custom format for shm_mq. What is
the advantage of this second serialization format over simply passing
the raw WAL bytes after CRC validation and letting the consumer decode
directly? Offloading CRC to a separate core could still improve
throughput at the cost of higher total CPU usage, without needing the
custom format.

Koichi's approach parallelizes redo (buffer I/O) itself, which attacks
a larger cost — Jakub's flamegraphs show BufferAlloc ->
GetVictimBuffer -> FlushBuffer dominating in both p0 and p1 — but at
the expense of much harder concurrency problems.

Whether the decode pipelining ceiling is high enough, or whether the
redo parallelization complexity is tractable, seems like the central
design question for this area.

[1]: https://www.pgcon.org/2023/schedule/session/392-parallel-recovery-in-postgresql/
https://www.pgcon.org/2023/schedule/session/392-parallel-recovery-in-postgresql/
[2]: https://wiki.postgresql.org/wiki/Parallel_Recovery

Best regards,
Henson

#13

Imran Zaheer

imran.zhir@gmail.com

24 days ago

In reply to: Henson Choi (#12)

Re: [WIP] Pipelined Recovery

Hi Xuneng, Imran, and everyone,

Hi Henson and Xuneng.

Thanks for explaining the approaches to Xuneng.

The two approaches target different bottlenecks. The current patch
parallelizes WAL decoding, which keeps the redo path single-threaded
and avoids the Hot Standby visibility problem entirely.

You are right both approaches
target different bottlenecks. Pipeline patch aims to improve overall
cpu throughput
and to save CPU time by offloading the steps we can safely do in parallel with
out causing synchronization problems.

One thing I am curious about in the current patch: WAL records are
already in a serialized format on disk. The producer decodes them and
then re-serializes into a different custom format for shm_mq. What is
the advantage of this second serialization format over simply passing
the raw WAL bytes after CRC validation and letting the consumer decode
directly? Offloading CRC to a separate core could still improve
throughput at the cost of higher total CPU usage, without needing the
custom format.

Thanks. You are right there was no need to serialize the decoded record again.
I was not aware that we already have continuous bytes in memory. In my
next patch
I will remove this extra serialization step.

Koichi's approach parallelizes redo (buffer I/O) itself, which attacks
a larger cost — Jakub's flamegraphs show BufferAlloc ->
GetVictimBuffer -> FlushBuffer dominating in both p0 and p1 — but at
the expense of much harder concurrency problems.

Whether the decode pipelining ceiling is high enough, or whether the
redo parallelization complexity is tractable, seems like the central
design question for this area.

I still have to investigate the problem related to `GetVictimBuffer` that
Jakub mentioned. But I was trying that how can we safely offload the work done
by `XLogReadBufferForRedoExtended` to a separate
pipeline worker, or maybe we can try prefetching the buffer header so
the main redo
loop doesn't have to spend time getting the buffer

Thanks for the feedback. That was helpful.

Regards,
Imran Zaheer

#14

Imran Zaheer

imran.zhir@gmail.com

24 days ago

In reply to: Imran Zaheer (#13)

Re: [WIP] Pipelined Recovery

I am uploading the new version with the following fixes

* Rebased version.
* Skip serialization of decoded records. As pointed out by Henson,
there was no need to serialize the records again
for the sh_mq. We can simply pass the continuous bytes with minor
pointer fixing to the sh_mq

This time I am uploading the benchmarking results to drive and
attaching the link here. Otherwise my mail will get holded for
moderation (My guess is overall attachment size is greater than 1MB thats why).

I am still not sure whether my testing approach is good enough.
Because sometimes I am not able to get the same performance
improvement
with the pgbench builtin scripts as I got with the custom sql scripts.
Maybe pgbench is not creating enough WAL to test on
or maybe I am missing something.

Benchmarks: https://drive.google.com/file/d/1Y4SYVnrFEQRE5T2r87rrTr7SWC9m19Si/view?usp=sharing

Thanks & Regards
Imran Zaheer

Imran Zaheer

Show quoted text

On Wed, Apr 8, 2026 at 1:46 PM Imran Zaheer <imran.zhir@gmail.com> wrote:

Hi Xuneng, Imran, and everyone,

Hi Henson and Xuneng.

Thanks for explaining the approaches to Xuneng.

The two approaches target different bottlenecks. The current patch
parallelizes WAL decoding, which keeps the redo path single-threaded
and avoids the Hot Standby visibility problem entirely.

You are right both approaches
target different bottlenecks. Pipeline patch aims to improve overall
cpu throughput
and to save CPU time by offloading the steps we can safely do in parallel with
out causing synchronization problems.

One thing I am curious about in the current patch: WAL records are
already in a serialized format on disk. The producer decodes them and
then re-serializes into a different custom format for shm_mq. What is
the advantage of this second serialization format over simply passing
the raw WAL bytes after CRC validation and letting the consumer decode
directly? Offloading CRC to a separate core could still improve
throughput at the cost of higher total CPU usage, without needing the
custom format.

Thanks. You are right there was no need to serialize the decoded record again.
I was not aware that we already have continuous bytes in memory. In my
next patch
I will remove this extra serialization step.

Koichi's approach parallelizes redo (buffer I/O) itself, which attacks
a larger cost — Jakub's flamegraphs show BufferAlloc ->
GetVictimBuffer -> FlushBuffer dominating in both p0 and p1 — but at
the expense of much harder concurrency problems.

Whether the decode pipelining ceiling is high enough, or whether the
redo parallelization complexity is tractable, seems like the central
design question for this area.

I still have to investigate the problem related to `GetVictimBuffer` that
Jakub mentioned. But I was trying that how can we safely offload the work done
by `XLogReadBufferForRedoExtended` to a separate
pipeline worker, or maybe we can try prefetching the buffer header so
the main redo
loop doesn't have to spend time getting the buffer

Thanks for the feedback. That was helpful.

Regards,
Imran Zaheer

#15

Xuneng Zhou

xunengzhou@gmail.com

10 days ago

In reply to: Imran Zaheer (#14)

Re: [WIP] Pipelined Recovery

Hi Henson, Imran,

On Wed, Apr 8, 2026 at 7:14 PM Imran Zaheer <imran.zhir@gmail.com> wrote:

Hi

I am uploading the new version with the following fixes

* Rebased version.
* Skip serialization of decoded records. As pointed out by Henson,
there was no need to serialize the records again
for the sh_mq. We can simply pass the continuous bytes with minor
pointer fixing to the sh_mq

This time I am uploading the benchmarking results to drive and
attaching the link here. Otherwise my mail will get holded for
moderation (My guess is overall attachment size is greater than 1MB thats why).

I am still not sure whether my testing approach is good enough.
Because sometimes I am not able to get the same performance
improvement
with the pgbench builtin scripts as I got with the custom sql scripts.
Maybe pgbench is not creating enough WAL to test on
or maybe I am missing something.

Benchmarks: https://drive.google.com/file/d/1Y4SYVnrFEQRE5T2r87rrTr7SWC9m19Si/view?usp=sharing

Thanks & Regards
Imran Zaheer

Imran Zaheer

On Wed, Apr 8, 2026 at 1:46 PM Imran Zaheer <imran.zhir@gmail.com> wrote:

Hi Xuneng, Imran, and everyone,

Hi Henson and Xuneng.

Thanks for explaining the approaches to Xuneng.

The two approaches target different bottlenecks. The current patch
parallelizes WAL decoding, which keeps the redo path single-threaded
and avoids the Hot Standby visibility problem entirely.

You are right both approaches
target different bottlenecks. Pipeline patch aims to improve overall
cpu throughput
and to save CPU time by offloading the steps we can safely do in parallel with
out causing synchronization problems.

One thing I am curious about in the current patch: WAL records are
already in a serialized format on disk. The producer decodes them and
then re-serializes into a different custom format for shm_mq. What is
the advantage of this second serialization format over simply passing
the raw WAL bytes after CRC validation and letting the consumer decode
directly? Offloading CRC to a separate core could still improve
throughput at the cost of higher total CPU usage, without needing the
custom format.

Thanks. You are right there was no need to serialize the decoded record again.
I was not aware that we already have continuous bytes in memory. In my
next patch
I will remove this extra serialization step.

Koichi's approach parallelizes redo (buffer I/O) itself, which attacks
a larger cost — Jakub's flamegraphs show BufferAlloc ->
GetVictimBuffer -> FlushBuffer dominating in both p0 and p1 — but at
the expense of much harder concurrency problems.

Whether the decode pipelining ceiling is high enough, or whether the
redo parallelization complexity is tractable, seems like the central
design question for this area.

I still have to investigate the problem related to `GetVictimBuffer` that
Jakub mentioned. But I was trying that how can we safely offload the work done
by `XLogReadBufferForRedoExtended` to a separate
pipeline worker, or maybe we can try prefetching the buffer header so
the main redo
loop doesn't have to spend time getting the buffer

Thanks for your clarification! I'll try to review this patch later.

--
Best,
Xuneng

[WIP] Pipelined Recovery

Attachments:

Attachments:

Attachments: